Automated String Processing in Spreadsheets: Innovations and Applications

 
 
Automating String Processing in Spreadsheets
using Input-Output Examples
 
Sumit Gulwani
(sumitg@microsoft.com)
Microsoft Research, Redmond
1
Automated Program Synthesis
 
Deserves renewed interest today!
 
Natural goal given that computing has become accessible, but
most people are not expert programmers.
 
Enabling technology is now available
Better search techniques
AI style search techniques.
logical reasoning based techniques (SAT/SMT solvers).
Faster machines (good application for multi-cores)
 
 
State of the art: We can synthesize 10-20 lines of code.
 
Our techniques can synthesize a wide variety of
algorithms/programs from logic and examples.
 
Undergraduate book algorithms 
(e.g., sorting, dynamic prog)
[Srivastava/Gulwani/Foster, POPL 2010]
 
 
Program Inverses 
(e.g, deserializers from serializers)
[Srivastava/Gulwani/Chaudhuri/Foster, MSR-TR-2010-34]
 
Graph Algorithms 
(e.g., bi-partiteness check)
[Itzhaky/Gulwani/Immerman/Sagiv, OOPSLA 2010]
 
 
Bit-vector algorithms 
(e.g., turn-off rightmost one bit)
[Jha/Gulwani/Seshia/Tiwari, ICSE 2010]
 
2
 
Recent Success in Program Synthesis
End-Users
Algorithm
Designers
Software Developers
Most Useful
Target
Potential Consumers of Synthesis Technology
Pyramid of Technology Users
 
Demo
 
4
 
 
Language of String Programs
 
Synthesis Algorithm
 
Ranking Strategy
 
Limitations
 
5
 
Outline
 
Guarded Expression G  := 
Switch((b
1
,e
1
), …, (b
n
,e
n
))
 
String Expression e  := 
Concatenate(f
1
, …, 
f
n
)
 
Base Expression f  :=  
s 
  // Constant String
                               | SubStr(v
i
, p
1
, p
2
)
 
Index Expression p  := 
k
  // Constant Integer
                                 | Pos(r
1
, r
2
, k
) 
// k
th
 position in string
  
                                       whose left/right side
     
         matches with 
r
1
/
r
2
 
 
Notation: 
SubStr2(v
i
,r,k)
 
´
 
SubsStr(v
i
,Pos(
²
,r,k),Pos(r,
²
,k))
Denotes k
th
 occurrence of regular expression r in v
i
 
6
 
Language for Constructing Output Strings
7
Example
 
Switch((
b
1
, e
1
), (
b
2
, e
2
))
, where
b
1
 
´
 
Match(v
1
,NumTok,3),      
b
2 
´
 
:
Match(v
1
,NumTok,3),
e
1
 
´
 
Concatenate(SubStr2(v
1
,NumTok,1), ConstStr(“-”),
  
        SubStr2(v
1
,NumTok,2), ConstStr(“-”),
  
        SubStr2(v
1
,NumTok,3))
e
2
 
´
 Concatenate(ConstStr(“425-”),
SubStr2(v
1
,NumTok,1),
 
                   ConstStr(“-”),
SubStr2(v
1
,NumTok,2))
Format phone numbers
 
 
Language of String Programs
 
Synthesis Algorithm
 
Ranking Strategy
 
Limitations
 
8
 
Outline
 
Reduction requires computing 
all
 solutions for each of the
sub-problems:
This also allows to rank various solutions and select the
highest ranked solution at the top-level.
A challenge here is to efficiently represent, compute, and
manipulate huge number of such solutions.
 
I will show three applications of this idea in the talk.
Read the paper for more tricks!
 
9
 
Key Synthesis Idea: Divide and Conquer
10
Synthesizing Guarded Expression
 
Goal:
 Given input-output pairs: (
i
1
,
o
1
), (
i
2
,
o
2
), (
i
3
,
o
3
), (
i
4
,
o
4
), find
P such that 
P(i
1
)=
o
1
, 
P(i
2
)=
o
2
, 
P(i
3
)=
o
3
, 
P(i
4
)=
o
4
.
 
 
 
 
 
 
Algorithm:
1. 
Learn set 
S
1
 of string expressions s.t. 
8
e in
 
S
1
, [
[e]] i
1
 = 
o
1
.
Similarly compute 
S
2
, 
S
3
, 
S
4
. Let S = S
1 
Å
S
2 
Å
S
3 
Å
S
4
.
 
2(a) If S ≠ 
;
 then result is 
Switch((true,S))
.
11
Example: Various choices for a String Expression
Input
Output
 
Constant
 
Constant
 
Constant
 
Number of all possible string expressions (that can
construct a given output string 
o
1
 from a given input
string 
i
1
) is 
exponential
 in size of output string.
 
 
 
 
 
# of substrings is just 
quadratic
 in size of output string!
We use a DAG based data-structure, and it supports
efficient intersection operation!
 
12
 
Synthesizing String Expressions
 
Various ways to extract “706” from “425-706-7709”:
 
Chars after 1
st
 hyphen and before 2
nd
 hyphen.
     Substr(v
1
, Pos(HyphenTok,
²
,1), Pos(
²
,HyphenTok,2))
 
Chars from 2
nd
 number and up to 2
nd
 number.
    Substr(v
1
, Pos(
²
,NumTok,2), Pos(NumTok,
²
,2))
 
Chars from 2
nd
 number and before 2
nd
 hyphen.
    Substr(v
1
, Pos(
²
,NumTok,2), Pos(
²
,HyphenTok,2))
 
Chars from 1
st
 hyphen and up to 2
nd
 number.
    
Substr(v
1
, Pos(HyphenTok,
²
,1), Pos(
²
,HyphenTok,2))
 
 
13
Example: Various choices for a SubStr Expression
 
The number of 
SubStr(v,p
1
,
p
2
) expressions that can
extract a given substring w from a given string v can
be large!
 
 
 
 
This allows for representing and computing 
O(n
1
*n
2
)
choices for SubStr using size/time 
O(n
1
+n
2
).
 
14
 
Synthesizing SubStr Expressions
15
Back to Synthesizing Guarded Expression
 
Goal: 
Given input-output pairs: (
i
1
,
o
1
), (
i
2
,
o
2
), (
i
3
,
o
3
), (
i
4
,
o
4
), find
P such that 
P(i
1
)=
o
1
, 
P(i
2
)=
o
2
, 
P(i
3
)=
o
3
, 
P(i
4
)=
o
4
.
 
Algorithm:
1.
Learn set 
S
1
 of string expressions s.t. 
8
e in
 
S
1
, 
[[e]] i
1
 = 
o
1
.
Similarly compute 
S
2
, 
S
3
, 
S
4
. Let S = 
S
1 
Å
S
2 
Å
S
3 
Å
S
4
.
2(a). If S ≠ 
;
 then result is 
Switch((true,S))
.
2(b). Else find a smallest partition, say {
S
1
,
S
2
}, {
S
3
,
S
4
}, s.t.
 
S
1
 
Å
S
2
;
  and 
S
3
 
Å
S
4
;
.
 
3
. Learn boolean formulas 
b
1
, 
b
2
 s.t.
b
1
 maps 
i
1
, 
i
2
 to true and 
i
3
, 
i
4
 to false.
b
2
 maps 
i
3
, 
i
4
 to true and 
i
1
, 
i
2
 to false.
 
4. Result is: 
Switch((
b
1
,
S
1
 
Å
S
2
), (b
2
,S
3
 
Å
S
4
))
 
 
Language of String Programs
 
Synthesis Algorithm
 
Ranking Strategy
 
Limitations
 
16
 
Outline
 
Prefer shorter programs.
Fewer number of conditionals.
Shorter string expression, regular expressions.
 
Prefer programs with less number of constants.
 
 
17
 
Ranking Strategy
 
 
Language of String Programs
 
Synthesis Algorithm
 
Ranking Strategy
 
Limitations
 
18
 
Outline
 
This paper: 
Syntactic
 Manipulation of 
strings
 
Extension 1: 
Semantic
 Manipulation of strings
Joint work with intern Rishabh Singh (MIT)
 
Extension 2: Layout Manipulation of 
tables
Joint work with intern Bill Harris (UW-Madison)
 
19
 
Limitations and Follow-up Work
 
Demo
 
20
 
Problem
: End-user Programming
 
Solution
: Program Synthesis with 
inter-disciplinary
 inspirations
Programming Languages
Design of an 
expressive
 language that can 
succinctly
 
represent string computations and is 
amenable to learning
.
Machine Learning
Version space algebra for learning straight-line code.
Boolean classification technique for learning control flow.
HCI
Input-output based interaction model
Several usability features: Ranking scheme, Feedback to user,
   
                  Quick Convergence, Noise tolerance.
 
21
 
Conclusion
Slide Note
Embed
Share

Automating string processing in spreadsheets is gaining traction due to advancements in program synthesis technology. This field enables the generation of algorithms and programs from logic and examples, benefitting algorithm designers, software developers, and end-users alike. Synthesis techniques cover a wide range of applications, from sorting algorithms to graph and bit-vector algorithms, showcasing the potential for creating efficient and tailored solutions. The language of string programs, synthesis algorithm ranking strategies, and limitations are key aspects explored in this domain, highlighting the growing interest and relevance of automated program synthesis in the current technological landscape.

  • String processing
  • Spreadsheets
  • Automated technology
  • Program synthesis
  • Algorithm design

Uploaded on Sep 26, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Automating String Processing in Spreadsheets using Input-Output Examples Sumit Gulwani (sumitg@microsoft.com) Microsoft Research, Redmond

  2. Automated Program Synthesis Deserves renewed interest today! Natural goal given that computing has become accessible, but most people are not expert programmers. Enabling technology is now available Better search techniques AI style search techniques. logical reasoning based techniques (SAT/SMT solvers). Faster machines (good application for multi-cores) State of the art: We can synthesize 10-20 lines of code. 1

  3. Recent Success in Program Synthesis Our techniques can synthesize a wide variety of algorithms/programs from logic and examples. Undergraduate book algorithms (e.g., sorting, dynamic prog) [Srivastava/Gulwani/Foster, POPL 2010] Program Inverses (e.g, deserializers from serializers) [Srivastava/Gulwani/Chaudhuri/Foster, MSR-TR-2010-34] Graph Algorithms (e.g., bi-partiteness check) [Itzhaky/Gulwani/Immerman/Sagiv, OOPSLA 2010] Bit-vector algorithms (e.g., turn-off rightmost one bit) [Jha/Gulwani/Seshia/Tiwari, ICSE 2010] 2

  4. Potential Consumers of Synthesis Technology Algorithm Designers Software Developers Most Useful Target End-Users Pyramid of Technology Users

  5. Demo 4

  6. Outline Language of String Programs Synthesis Algorithm Ranking Strategy Limitations 5

  7. Language for Constructing Output Strings Guarded Expression G := Switch((b1,e1), , (bn,en)) String Expression e := Concatenate(f1, , fn) Base Expression f := s // Constant String | SubStr(vi, p1, p2) Index Expression p := k // Constant Integer | Pos(r1, r2, k) // kth position in string whose left/right side matches with r1/r2 Notation: SubStr2(vi,r,k) SubsStr(vi,Pos( ,r,k),Pos(r, ,k)) Denotes kth occurrence of regular expression r in vi 6

  8. Example Format phone numbers Input v1 (425)-706-7709 510.220.5586 235 7654 745-8139 Output 425-706-7709 510-220-5586 425-235-7654 425-745-8139 Switch((b1, e1), (b2, e2)), where b1 Match(v1,NumTok,3), b2 :Match(v1,NumTok,3), e1 Concatenate(SubStr2(v1,NumTok,1), ConstStr( - ), SubStr2(v1,NumTok,2), ConstStr( - ), SubStr2(v1,NumTok,3)) e2 Concatenate(ConstStr( 425- ),SubStr2(v1,NumTok,1), ConstStr( - ),SubStr2(v1,NumTok,2)) 7

  9. Outline Language of String Programs Synthesis Algorithm Ranking Strategy Limitations 8

  10. Key Synthesis Idea: Divide and Conquer Reduce the problem of synthesizing expressions into sub-problems of synthesizing sub-expressions. Reduction requires computing all solutions for each of the sub-problems: This also allows to rank various solutions and select the highest ranked solution at the top-level. A challenge here is to efficiently represent, compute, and manipulate huge number of such solutions. I will show three applications of this idea in the talk. Read the paper for more tricks! 9

  11. Synthesizing Guarded Expression Goal: Given input-output pairs: (i1,o1), (i2,o2), (i3,o3), (i4,o4), find P such that P(i1)=o1, P(i2)=o2, P(i3)=o3, P(i4)=o4. Application #1: We reduce the problem of learning guarded expression P to the problem of learning string expressions for each input-output pair. Algorithm: 1. Learn set S1 of string expressions s.t. 8e inS1, [[e]] i1 = o1. Similarly compute S2, S3, S4. Let S = S1 S2 S3 S4. 2(a) If S ; then result is Switch((true,S)). 10

  12. Example: Various choices for a String Expression Input Output Constant Constant Constant 11

  13. Synthesizing String Expressions Number of all possible string expressions (that can construct a given output string o1 from a given input string i1) is exponential in size of output string. Application #2: To represent/learn all string expressions, it suffices to represent/learn all base expressions for each substring of the output. # of substrings is just quadratic in size of output string! We use a DAG based data-structure, and it supports efficient intersection operation! 12

  14. Example: Various choices for a SubStr Expression Various ways to extract 706 from 425-706-7709 : Chars after 1st hyphen and before 2nd hyphen. Substr(v1, Pos(HyphenTok, ,1), Pos( ,HyphenTok,2)) Chars from 2nd number and up to 2nd number. Substr(v1, Pos( ,NumTok,2), Pos(NumTok, ,2)) Chars from 2nd number and before 2nd hyphen. Substr(v1, Pos( ,NumTok,2), Pos( ,HyphenTok,2)) Chars from 1st hyphen and up to 2nd number. Substr(v1, Pos(HyphenTok, ,1), Pos( ,HyphenTok,2)) 13

  15. Synthesizing SubStr Expressions The number of SubStr(v,p1,p2) expressions that can extract a given substring w from a given string v can be large! Application #3: To represent/learn all SubStr expressions, we can independently represent/learn all choices for each of the two index expressions. This allows for representing and computing O(n1*n2) choices for SubStr using size/time O(n1+n2). 14

  16. Back to Synthesizing Guarded Expression Goal: Given input-output pairs: (i1,o1), (i2,o2), (i3,o3), (i4,o4), find P such that P(i1)=o1, P(i2)=o2, P(i3)=o3, P(i4)=o4. Algorithm: 1. Learn set S1 of string expressions s.t. 8e inS1, [[e]] i1 = o1. Similarly compute S2, S3, S4. Let S = S1 S2 S3 S4. 2(a). If S ; then result is Switch((true,S)). 2(b). Else find a smallest partition, say {S1,S2}, {S3,S4}, s.t. S1 S2 ; and S3 S4 ;. 3. Learn boolean formulas b1, b2 s.t. b1 maps i1, i2 to true and i3, i4 to false. b2 maps i3, i4 to true and i1, i2 to false. 4. Result is: Switch((b1,S1 S2), (b2,S3 S4)) 15

  17. Outline Language of String Programs Synthesis Algorithm Ranking Strategy Limitations 16

  18. Ranking Strategy Prefer shorter programs. Fewer number of conditionals. Shorter string expression, regular expressions. Prefer programs with less number of constants. 17

  19. Outline Language of String Programs Synthesis Algorithm Ranking Strategy Limitations 18

  20. Limitations and Follow-up Work This paper: Syntactic Manipulation of strings Extension 1: Semantic Manipulation of strings Joint work with intern Rishabh Singh (MIT) Extension 2: Layout Manipulation of tables Joint work with intern Bill Harris (UW-Madison) 19

  21. Demo 20

  22. Conclusion Problem: End-user Programming Solution: Program Synthesis with inter-disciplinary inspirations Programming Languages Design of an expressive language that can succinctly represent string computations and is amenable to learning. Machine Learning Version space algebra for learning straight-line code. Boolean classification technique for learning control flow. HCI Input-output based interaction model Several usability features: Ranking scheme, Feedback to user, Quick Convergence, Noise tolerance. 21

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#