Wrangler: Interactive Visual Specification of Data Transformation

Wrangler: Interactive
Visual Specification of
Data Transformation
Scripts
Sean Kandel , Andreas Paepcke , Joseph
Hellerstein and Jeffrey Heer
Presented by 
You Jen Lin
Before analysis, data must be “wrangled” into a
usable form.
Data wrangling: restructure data, identifying
and correcting erroneous/missing values,
combining data sources.
Introduction
Data analysts still spend a lot of time
wrangling data in preparation for analysis.
Transforms for restructuring/validating data
can be complicated to specify in scripts.
e.g. Regular expressions to split strings, validate
data format.
Reuse and revision of transforms for data
updates and changing schemas not possible
with scripts or manual editing.
Data wrangling should also output a record of
transforms used.
Motivation
Underlying declarative transformation language
based on Potter’s Wheel.
Supports six classes of transforms.
Interactive user interface that simplifies the
way users can specify and apply transforms to
data.
Natural language descriptions of transforms.
Visual previews of transform effects on data.
Interactive history viewer
Inference engine suggests applicable
transforms based on user interaction with data.
Contributions
Transformation language
Extension of Potter’s Wheel.
Added operators for data cleaning:
Positional operators
Aggregation
Reshaping
Added conditional mapping operators.
e.g. Update country to “U.S.” where
state=“California”.
General structure
Eight classes of transforms.
Map
: map one input data row to zero or more
output rows.
Delete
 (1-to-0): Takes as predicate a row to remove.
Extract, cut, update
 (1-to-1): Process values within a
column.
Split
 (1-to-many): Split text on some delimiter to
yield multiple rows.
Classes of transforms
Lookups and joins
: lookup or join data from
external tables.
Wrangler provides lookup tables for common
transformations, e.g. mapping zip codes to states.
Equi-join
: join on some attribute of two tuples from
different tables if their values are the same.
Approximate join
: join if the values of a common
attribute of two tuples is within some string edit
distance.
Classes of transforms (cont.)
Reshape
: manipulate table structure and
schema.
Fold
: collapse multiple columns into fewer columns
containing key-value sets rather than individual
values.
Unfold
: creates new column headers from data
values that could be stored in multiple columns.
Classes of transforms (cont.)
Fold transform
Unfold transform
Positional
: modifies the values or their position
based on their location in the table.
Fill
: generate values based on neighboring values.
Lag
: shift column values up or down by a specified
number of rows.
Sorting
Aggregation
 (sum, min, max, mean, etc.)
Key generation
 (skolemization): retrieval of
keys for new records.
Schema transforms
:
Set column name.
Specify column data type.
Assign semantic role.
Classes of transforms (cont.)
Standard data types: primitives and their
corresponding parsing functions.
Integers, strings, numeric.
Higher-level 
semantic roles
: additional parsing
functions, transformation functions mapping
related roles, e.g.
Semantic role: zip code
Parsing function: check for 5 digits, valid
Mapping: return the state it is in
Data types and semantic roles
Interface design
Direct manipulation of and interaction with the
data.
Automatic suggestion of possible transforms.
Inference engine that bases suggestions of user
input and data type/semantic role.
Menu-based transform selection.
Manual editing of transform parameters.
Components
The Wrangler interface
Wrangler supports six basic interactions
between the user and the data table.
1.
Select row.
2.
Select column.
3.
Click bars in the data quality meter.
4.
Select text in a cell.
5.
Edit a value in the table.
6.
Assign column name, data type, or semantic role.
Basic interactions
Users can choose transforms from the menu to
apply.
Users can also edit transform descriptions and
their parameters to customize a transform.
Additional interactions
Wrangler generates a list of suggested
transforms while the user interacts with data.
Displays only a fraction of the entire suggestion
space if there are many possible transforms.
Users can prune and reorder the suggestion shown:
1.
Select more example data to disambiguate input.
2.
Select an operator from the transform menu.
3.
Edit a transform’s parameters.
Automated transform
suggestion
Wrangler does not immediately execute a
selected transform.
Creates a 
current working transform
.
Displays natural language description:
Transform type and parameters, both editable.
Non-critical defaults may be omitted.
Allows user to edit transform parameters and
updates the suggestions as edits are made.
Editing transforms
Shows users the effects of their transforms
within the source data itself, not as a separate
view (e.g. side-by-side before and after).
Five different preview classes.
Preview selected by Wrangler based on
transform type.
Visual previews
Selection preview
: Highlights in yellow the
regions of text that will be affected by the
transform.
Deletion preview
: Highlights in red the cells to
be deleted.
Update preview
: Overwrites values with their
new values and highlights them in yellow.
Visual previews (cont.)
Column preview
: displays newly-derived
columns and highlights them in yellow.
Table preview
: for fold and unfold, shows the
new table side-by-side with the old since these
transforms modify the table too much
Visual previews (cont.)
Example of selection preview
Example of deletion preview
Example of update preview
Example of column preview
Example of table preview
Transform history
Allows users to edit
individual transform
descriptions and
enable/disable
transforms in the
sequence.
Allows text annotations
for analysts to document
why they used a
particular transform.
Wrangler applies validation functions to a
sample of a column’s data to determine the
proportion of values that parse successfully.
Data quality meter:
Green: values that parse successfully.
Yellow: values match type, but not role.
e.g. Six-digit zip code in numeric, but not a zip code.
Red: values don’t match type.
Grey: values missing.
Data quality
Data quality bar
Inference engine
Inference engine generates a ranked list of
suggested transforms.
Input:
user interactions
current working transform
data descriptions
corpus of user’s transform usage history
Transform suggestions
Three phases:
1.
Infer transform parameters from interactions.
2.
Generate candidate transforms from inferred
parameters.
3.
Rank the resulting transform suggestions.
Suggestion process
Contains frequency counts of transform
descriptors and initiating interactions.
Original corpus built by authors from
wrangling their own data sets.
Updates as analysts use it.
Generate transform frequencies via a relaxed
matching routine.
Usage corpus
Transforms equivalent if:
1.
Identical transform type, e.g. extract, fold, etc.
2.
Equivalent parameters.
Four basic parameter types and equivalence:
1.
Row: both contain filtering conditions or match all
rows in a table.
2.
Column: same data type/semantic role
3.
Text selection: both index-based or contain regex.
4.
Enumerables: exact match
Usage corpus
Infer a parameter independent of others.
Row selection: row indices and predicate
matching.
Column selection: the columns that the user
has interacted with.
Text selection: index ranges or inferred regex.
Inferring parameter sets
Loop over each transform type.
Emit types that can accept all parameters in the
inferred parameter set.
Instantiate each transform with parameters
from the set.
Filter suggestions to exclude degenerate
transforms (no effect on data).
Suggested transforms
Five criteria:
Three by transform type:
Explicit interactions: user chooses from menu or
selects a current working transform.
Specification difficulty: row and text selection
predicates are harder to specify than others; sort types
by number of hard parameters each accepts.
Corpus frequency: also conditioned on their initiating
user interaction.
Ranking suggestions
Five criteria (cont.):
Two within transform type:
Frequency of equivalent transforms in the
corpus.
Ascending order by transform complexity:
Rank simpler transforms higher since
users can evaluate their descriptions
more quickly.
Transform complexity: sum of
parameter complexities.
Filter transforms so one type is <1/3 of the
suggestions.
Ranking suggestions (cont.)
Comparative user study with MS Excel.
Twelve participants, all of whom have never
used Wrangler and with varying degrees of
experience with Excel.
Three tasks: extract text, fill missing values,
reshape table structure.
Result: Median performance in Wrangler was
>2x as fast as in Excel for all experience levels.
Evaluation
Wrangler: interactive interface for data
transformation and cleaning.
Maps user interactions to suggested transforms.
Editable transform history.
In-place transform previews.
Conclusion
Slide Note
Embed
Share

Data wrangling is a crucial step before analysis, involving restructuring, correcting errors, and combining sources. Addressing the time-consuming nature of data preparation, Wrangler offers a user-friendly interface with a declarative transformation language, supporting various transforms and interactive features like natural language descriptions and visual previews. Its contributions include simplifying transform specification, enabling reuse, and providing a history viewer for transparency. The tool enhances efficiency by suggesting applicable transforms based on user interactions.

  • Data Wrangling
  • Interactive Visualization
  • Data Transformation
  • Declarative Language
  • User Interface

Uploaded on Oct 09, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Wrangler: Interactive Visual Specification of Data Transformation Wrangler: Interactive Visual Specification of Data Transformation Scripts Scripts Sean Kandel , Andreas Paepcke , Joseph Hellerstein and Jeffrey Heer Presented by You Jen Lin

  2. Introduction Introduction Before analysis, data must be wrangled into a usable form. Data wrangling: restructure data, identifying and correcting erroneous/missing values, combining data sources.

  3. Motivation Motivation Data analysts still spend a lot of time wrangling data in preparation for analysis. Transforms for restructuring/validating data can be complicated to specify in scripts. e.g. Regular expressions to split strings, validate data format. Reuse and revision of transforms for data updates and changing schemas not possible with scripts or manual editing. Data wrangling should also output a record of transforms used.

  4. Contributions Contributions Underlying declarative transformation language based on Potter s Wheel. Supports six classes of transforms. Interactive user interface that simplifies the way users can specify and apply transforms to data. Natural language descriptions of transforms. Visual previews of transform effects on data. Interactive history viewer Inference engine suggests applicable transforms based on user interaction with data.

  5. Transformation language Transformation language

  6. General structure General structure Extension of Potter s Wheel. Added operators for data cleaning: Positional operators Aggregation Reshaping Added conditional mapping operators. e.g. Update country to U.S. where state= California .

  7. Classes of transforms Classes of transforms Eight classes of transforms. Map output rows. Delete (1-to-0): Takes as predicate a row to remove. Extract, cut, update (1-to-1): Process values within a column. Split (1-to-many): Split text on some delimiter to yield multiple rows. Map: map one input data row to zero or more

  8. Classes of transforms (cont.) Classes of transforms (cont.) Lookups and joins external tables. Wrangler provides lookup tables for common transformations, e.g. mapping zip codes to states. Equi-join: join on some attribute of two tuples from different tables if their values are the same. Approximate join: join if the values of a common attribute of two tuples is within some string edit distance. Lookups and joins: lookup or join data from

  9. Classes of transforms (cont.) Classes of transforms (cont.) Reshape schema. Fold: collapse multiple columns into fewer columns containing key-value sets rather than individual values. Unfold: creates new column headers from data values that could be stored in multiple columns. Reshape: manipulate table structure and

  10. Fold transform Fold transform

  11. Unfold transform Unfold transform

  12. Classes of transforms (cont.) Classes of transforms (cont.) Positional based on their location in the table. Fill: generate values based on neighboring values. Lag: shift column values up or down by a specified number of rows. Sorting Aggregation Key generation keys for new records. Schema transforms Set column name. Specify column data type. Assign semantic role. Positional: modifies the values or their position Sorting Aggregation (sum, min, max, mean, etc.) Key generation (skolemization): retrieval of Schema transforms:

  13. Data types and semantic roles Data types and semantic roles Standard data types: primitives and their corresponding parsing functions. Integers, strings, numeric. Higher-level semantic roles: additional parsing functions, transformation functions mapping related roles, e.g. Semantic role: zip code Parsing function: check for 5 digits, valid Mapping: return the state it is in

  14. Interface design Interface design

  15. Components Components Direct manipulation of and interaction with the data. Automatic suggestion of possible transforms. Inference engine that bases suggestions of user input and data type/semantic role. Menu-based transform selection. Manual editing of transform parameters.

  16. The Wrangler interface The Wrangler interface

  17. Basic interactions Basic interactions Wrangler supports six basic interactions between the user and the data table. 1.Select row. 2.Select column. 3.Click bars in the data quality meter. 4.Select text in a cell. 5.Edit a value in the table. 6.Assign column name, data type, or semantic role.

  18. Additional interactions Additional interactions Users can choose transforms from the menu to apply. Users can also edit transform descriptions and their parameters to customize a transform.

  19. Automated transform suggestion Automated transform suggestion Wrangler generates a list of suggested transforms while the user interacts with data. Displays only a fraction of the entire suggestion space if there are many possible transforms. Users can prune and reorder the suggestion shown: 1.Select more example data to disambiguate input. 2.Select an operator from the transform menu. 3.Edit a transform s parameters.

  20. Editing transforms Editing transforms Wrangler does not immediately execute a selected transform. Creates a current working transform. Displays natural language description: Transform type and parameters, both editable. Non-critical defaults may be omitted. Allows user to edit transform parameters and updates the suggestions as edits are made.

  21. Visual previews Visual previews Shows users the effects of their transforms within the source data itself, not as a separate view (e.g. side-by-side before and after). Five different preview classes. Preview selected by Wrangler based on transform type.

  22. Visual previews (cont.) Visual previews (cont.) Selection preview regions of text that will be affected by the transform. Deletion preview be deleted. Update preview new values and highlights them in yellow. Selection preview: Highlights in yellow the Deletion preview: Highlights in red the cells to Update preview: Overwrites values with their

  23. Visual previews (cont.) Visual previews (cont.) Column preview columns and highlights them in yellow. Table preview new table side-by-side with the old since these transforms modify the table too much Column preview: displays newly-derived Table preview: for fold and unfold, shows the

  24. Example of selection preview Example of selection preview

  25. Example of deletion preview Example of deletion preview

  26. Example of update preview Example of update preview

  27. Example of column preview Example of column preview

  28. Example of table preview Example of table preview

  29. Transform history Transform history Allows users to edit individual transform descriptions and enable/disable transforms in the sequence. Allows text annotations for analysts to document why they used a particular transform.

  30. Data quality Data quality Wrangler applies validation functions to a sample of a column s data to determine the proportion of values that parse successfully. Data quality meter: Green: values that parse successfully. Yellow: values match type, but not role. e.g. Six-digit zip code in numeric, but not a zip code. Red: values don t match type. Grey: values missing.

  31. Data quality bar Data quality bar

  32. Inference engine Inference engine

  33. Transform suggestions Transform suggestions Inference engine generates a ranked list of suggested transforms. Input: user interactions current working transform data descriptions corpus of user s transform usage history

  34. Suggestion process Suggestion process Three phases: 1.Infer transform parameters from interactions. 2.Generate candidate transforms from inferred parameters. 3.Rank the resulting transform suggestions.

  35. Usage corpus Usage corpus Contains frequency counts of transform descriptors and initiating interactions. Original corpus built by authors from wrangling their own data sets. Updates as analysts use it. Generate transform frequencies via a relaxed matching routine.

  36. Usage corpus Usage corpus Transforms equivalent if: 1.Identical transform type, e.g. extract, fold, etc. 2.Equivalent parameters. Four basic parameter types and equivalence: 1.Row: both contain filtering conditions or match all rows in a table. 2.Column: same data type/semantic role 3.Text selection: both index-based or contain regex. 4.Enumerables: exact match

  37. Inferring parameter sets Inferring parameter sets Infer a parameter independent of others. Row selection: row indices and predicate matching. Column selection: the columns that the user has interacted with. Text selection: index ranges or inferred regex.

  38. Suggested transforms Suggested transforms Loop over each transform type. Emit types that can accept all parameters in the inferred parameter set. Instantiate each transform with parameters from the set. Filter suggestions to exclude degenerate transforms (no effect on data).

  39. Ranking suggestions Ranking suggestions Five criteria: Three by transform type: Explicit interactions: user chooses from menu or selects a current working transform. Specification difficulty: row and text selection predicates are harder to specify than others; sort types by number of hard parameters each accepts. Corpus frequency: also conditioned on their initiating user interaction.

  40. Ranking suggestions (cont.) Ranking suggestions (cont.) Five criteria (cont.): Two within transform type: Frequency of equivalent transforms in the corpus. Ascending order by transform complexity: Rank simpler transforms higher since users can evaluate their descriptions more quickly. Transform complexity: sum of parameter complexities. Filter transforms so one type is <1/3 of the suggestions.

  41. Evaluation Evaluation Comparative user study with MS Excel. Twelve participants, all of whom have never used Wrangler and with varying degrees of experience with Excel. Three tasks: extract text, fill missing values, reshape table structure. Result: Median performance in Wrangler was >2x as fast as in Excel for all experience levels.

  42. Conclusion Conclusion Wrangler: interactive interface for data transformation and cleaning. Maps user interactions to suggested transforms. Editable transform history. In-place transform previews.

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#