Challenges and Potential of Crowd-Powered Databases

CrowdDb
 
History Lesson
First crowd-powered database
At that time, the state of the art was turkit
Programming library for the crowd
Two other crowd-powered databases at around
the same time
Deco (Stanford, UC Santa Cruz)
Qurk (MIT)
Necessarily incomplete, preliminary
Motivation of CrowdDB
Two reasons why present DB systems won’t
do:
Closed world assumption
Get human help for finding new data
Very literal in processing data
SELECT marketcap FROM company
   WHERE name = “IBM”
Get the best of both worlds:
 
human power for processing and getting data
 
traditional systems for heavy lifting/data manip
Issues in building CrowdDB
Performance and variability:
Humans are slow, costly, variable, inaccurate
Task design and ambiguity:
Challenging to get people to do what you want
Affinity / Learning
Workers develop relationships with requesters,
skills
Open world
Possibly unbounded answers
At a High Level
Modifications to QL: CrowdSQL
Automatic UI generation
Automatic interaction with marketplace
Storing data for future use
Modifications to SQL
Special keyword: CROWD
Used in two ways
First: crowdsourced columns
CREATE TABLE Department (
    
 
university STRING,
 
name STRING,
 
url 
CROWD
 String,
 
phone STRING,
 
primary key (university, name));
CROWD attribute cannot be PK
Modifications to SQL
Crowdsourced Tables
CREATE 
CROWD
 TABLE Profs (
 
name STRING PRIMARY KEY,
 
email STRING UNIQUE,
 
university STRING,
 
department STRING,
 
FOREIGN KEY (university, department)
 
REF Department (university, name) );
Still need a PK
How do we designate incomplete
data?
Special Keyword 
CNULL
Constraint: we want CNULL to be filled in before
query results are returned
CREATE TABLE Department (
    
 
university STRING,
 
name STRING,
 
url 
CROWD
 String,
 
phone STRING,
 
primary key (university, name));
SELECT url FROM Department WHERE name = “math”
Comparisons
CROWDEQUAL
 
SELECT name FROM Professor
 
WHERE department ~= “CS”
CROWDORDER
 
SELECT p FROM Picture
 
WHERE subject = “chair”
 
ORDER BY CROWDORDER (p,
 
“Which picture visualizes better %chair”)
Similar to Qurk FILTER, SORT predicates, but hides away even more
details from the user
UI Generation
Instantiated at run-time for every tuple
Can also be edited
Can you think of other UIs? (Remember
previous papers..)
Multi-relational UI Generation
Interesting trick to deal with dependencies
If referencing a non-crowdsourced table
Drop-down box + suggest function. Why?
If referencing a crowdsourced table
Normalized interface with a suggest function
Denormalized interface getting dependent data
Query Processing
Given these SQL extensions, there are a
handful of new operators
CROWDPROBE:
Collects missing information in CROWD columns or
tables
CROWDJOIN:
Inner table is probed in a crowdsourced fashion using
the other table
CROWDCOMPARE:
Used for CROWDEQUAL and CROWDORDER
Let’s dig deeper…
Quality management: Majority vote with 5
answers across all those whose PKs match
Issue?
CROWDPROBE on a CROWDTABLE?
Prof (name, email, univ, dep), (n, u) pk
Say I want to find 3 professors from univ = X, dep
= Y; how long would I take?
Best case:
Worst case:
Let’s dig deeper…
Quality management: Majority vote with 5
answers across all those whose PKs match
Issue?
CROWDPROBE on a CROWDTABLE?
Prof (name, email, univ, dep)
Say I want to find 3 professors from univ = X, dep
= Y
Workers may focus on different answers
Let’s dig deeper …
CROWDTABLE
Any other issues?
Let’s dig deeper …
CROWDTABLE
What if workers refer to two Professors in a
slightly different manner:
Jiawei Han vs. J. Han
Spelling mistakes
Let’s dig deeper …
CROWDJOIN
Outer is used to probe inner crowdsourced table,
asking for values of missing predicates
E.g., join between Dept and Prof
Are there similar issues here? What if workers
can’t find the URL of a specific Prof?
Crowd Operators
Tasks
Create HITs and HIT groups
Collect results from AMT
Quality control via majority vote
Right now: simple rule-based optimizer
Batch size, # assignments, price fixed
Predicate push down, join ordering, delete
optimization
Future: cost-based optimizer
Query Processing Example
 
Results on benchmarks
HIT Size vs. responsiveness
Tradeoff between HITS completed/time and %
completion of HITS
Reward vs. Responsiveness
% hits fully completed
   
%hits with >1 done
Completion across workers
Skewed distribution
No variation in error rate between high freq
workers and others
Complex Queries
Entity resolution on company names
Matching one company name with 100 others for
four separate runs
Majority vote gives the correct result
Ordering photos in terms of relevance
Majority vote matches expert ranking
Complex Queries
Joining professors and departments
SELECT p.name, p.email, d.name, d.phone
FROM   Professor p, Department d
WHERE  p.department = d.name AND
       p.university = d.university AND
       p.name = "[name of a professor]"
Method 1: first prof details collected, then dep
details
Method 2: prof and dep details collected together
via a denormalized interface
Method 2 is cheaper, but Method 1 outperforms
Method 2 in accuracy:
Instructions for denormalized interface unclear
Other obersvations
Relationship with workers is long-term
Keep workers happy
Implement less stringent approval mechanisms
Good interface design and instructions matter
Simple choices like “none of the above” improve
quality dramatically
History Lesson
Even now (4 years later), there is no real complete,
fully-functional crowd-powered database
Why?
History Lesson
Even now, there is no real complete, fully-functional
crowd-powered database
Why?
No one understands the crowds (EVEN NOW)
We were all naïve in thinking that we could treat crowds as
just another data source.
People don’t seem to want to use crowds within
databases
Crowdsourcing is a one-off task
Crowds have very different characteristics than other
data
Still…
 
The ideas are very powerful and applicable
everywhere you want data to be extracted
Very common use-case of crowds
Semantics
 
Semantics = an understanding of what the
query does
Regular SQL has very understandable
semantics because starting from a given state,
you know exactly what state you will be once
you execute a statement.
Does CrowdSQL have understandable/
semantics? How would you improve it?
Semantics
Does CrowdSQL have understandable/
semantics? How would you improve it?
Fill in CNULL; LIMIT clause
What if you had more than the limit # of tuples
already filled in?
Overall, very hard. But at the least:
A specification of budget?
A specification that cost/latency is minimized?
Optimization Techniques
Beyond the ones presented in the paper, what
other “database style” optimization
techniques can you think of?
Optimization Techniques
Beyond the ones presented in the paper, what other
“database style” optimization techniques can you think
of?
Paper mentions predicate pushdown, e.g., if you only care
about tuples in CA, instantiate interfaces with CA filled in.
Not always good 
 evaluating crowd predicates may be costly
Reorder tables such that more “complete” tables are filled
first.
Reorder predicates such that more “complete” predicates
are checked first.
SELECT * FROM PROEFESSOR WHERE Dept = “math” AND Email
LIKE “%berkeley%”
 
Recording Data
CrowdDB only records either CNULL or the
final outcome. Why might this be a bad idea?
Recording Data
CrowdDB only records either CNULL or the
final outcome. Why might this be a bad idea?
Needs and aggregations schemes change
An application that requires more accuracy
We find that people are more erroneous than we
expected
Data may get stale
Joins between crowdsourced relations
CrowdDB forbids joins between two
crowdsourced tables. Is there a case where we
may want that?
Joins between crowdsourced relations
CrowdDB forbids joins between two
crowdsourced tables. Is there a case where we
may want that?
Sure:
People in a department, courses taught in the
department
What interesting challenges emerge there?
Joins between crowdsourced relations
CrowdDB forbids joins between two
crowdsourced tables. Is there a case where we
may want that?
Sure:
People in a department, courses taught in the
department
What interesting challenges emerge there?
Get more tuples for one relation or the other.
Especially if not K-FK join
FDs?
CrowdDB assumes a primary key per table.
What if there are other Functional
Dependencies? Can we do better?
FDs?
CrowdDB assumes a primary key per table.
What if there are other Functional
Dependencies? Can we do better?
Example: Company, City, State
Other things…
 
What other issues did you identify in the
paper?
Other things…
 
What other issues did you identify in the
paper?
CROWDTABLE:
What if workers refer to two entities in a slightly
different manner:
Jiawei Han vs. J. Han
Spelling mistakes
CROWDPROBE:
What if some information is just hard to crowdsource?
Bottlenecked?
Slide Note
Embed
Share

In the world of crowd-powered databases, challenges such as performance variability and task ambiguity exist alongside the potential for harnessing human power to enhance data processing. Although no fully functional crowd-powered database currently exists, the idea remains powerful and applicable in various data extraction tasks. Understanding the semantics of CrowdSQL and improving its comprehensibility are critical for advancing this field.

  • - Crowd-powered databases - Data extraction - Semantics improvement - Database challenges

Uploaded on Sep 08, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. CrowdDb

  2. History Lesson First crowd-powered database At that time, the state of the art was turkit Programming library for the crowd Two other crowd-powered databases at around the same time Deco (Stanford, UC Santa Cruz) Qurk (MIT) Necessarily incomplete, preliminary

  3. Motivation of CrowdDB Two reasons why present DB systems won t do: Closed world assumption Get human help for finding new data Very literal in processing data SELECT marketcap FROM company WHERE name = IBM Get the best of both worlds: human power for processing and getting data traditional systems for heavy lifting/data manip

  4. Issues in building CrowdDB Performance and variability: Humans are slow, costly, variable, inaccurate Task design and ambiguity: Challenging to get people to do what you want Affinity / Learning Workers develop relationships with requesters, skills Open world Possibly unbounded answers

  5. History Lesson Even now (4 years later), there is no real complete, fully-functional crowd-powered database Why?

  6. History Lesson Even now, there is no real complete, fully-functional crowd-powered database Why? No one understands the crowds (EVEN NOW) We were all na ve in thinking that we could treat crowds as just another data source. People don t seem to want to use crowds within databases Crowdsourcing is a one-off task Crowds have very different characteristics than other data

  7. Still The ideas are very powerful and applicable everywhere you want data to be extracted Very common use-case of crowds

  8. Semantics Semantics = an understanding of what the query does Regular SQL has very understandable semantics because starting from a given state, you know exactly what state you will be once you execute a statement. Does CrowdSQL have understandable/ semantics? How would you improve it?

  9. Semantics Does CrowdSQL have understandable/ semantics? How would you improve it? Fill in CNULL; LIMIT clause What if you had more than the limit # of tuples already filled in? Overall, very hard. But at the least: A specification of budget? A specification that cost/latency is minimized?

  10. Optimization Techniques Beyond the ones presented in the paper, what other database style optimization techniques can you think of?

  11. Optimization Techniques Beyond the ones presented in the paper, what other database style optimization techniques can you think of? Paper mentions predicate pushdown, e.g., if you only care about tuples in CA, instantiate interfaces with CA filled in. Not always good evaluating crowd predicates may be costly Reorder tables such that more complete tables are filled first. Reorder predicates such that more complete predicates are checked first. SELECT * FROM PROEFESSOR WHERE Dept = math AND Email LIKE %berkeley%

  12. Please fill out the missing department data University Name URL SELECT * FROM professor p, department d WHERE p.department = d.name AND p.university = d.university AND p.name = "Karp" name="Karp" p.dep=d.name CrowdJoin (Dep) p.dep = d.name Phone Submit Department Please fill out the professor data name= "Karp" p.dep=d.name Karp Name CrowdProbe (Professor) name=Karp Email Professor Department University Professor Department Submit Figure3: CrowdSQL Query PlanGeneration quality control. In thecurrent CrowdDB prototype, quality control iscarried out by amajority voteon theinput provided by different workers for the same HIT. The number of workers assigned to each HIT iscontrolled by an Assignmentsparameter (Section 2.1). Theinitial number of Assignmentsiscurrently astaticparameter of CrowdDB. Asmentioned in Section 4.3, thisparameter should be set by theCrowdDB optimizer based on budget constraintsset via CrowdSQL. Thecurrent version of CrowdDB hasthreeCrowdoperators: This logical plan is then optimized using traditional and crowd- specific optimizations. Figure3c showstheoptimized logical plan for this example. In this example, only predicate push-down was applied, a well-known traditional optimization technique. Some crowd-specific optimization heuristics used in CrowdDB are de- scribedinthenext subsection. Finally, thelogical planistranslated into aphysical plan which can beexecuted by theCrowdDB run- timesystem. Aspart of thisstep, Crowd operators and traditional operatorsof therelational algebraareinstantiated. In theexample of Figure3, thequery isexecuted by aCrowdProbeoperator in or- der to crowdsource missing information from the Professor table and a CrowdJoin operator in order to crowdsource missing infor- mation from theDepartment table. (In thisexample, it isassumed that theDepartment isaCROWD table; otherwise, theCrowdJoin operator would not beapplicable.) 6.3 Heuristics Thecurrent CrowdDB compiler isbased on asimplerule-based optimizer. Theoptimizer implementsseveral essential query rewrit- ing rules such as predicate push-down, stopafter push-down [7], join-ordering and determining if theplan isbounded [5]. Thelast optimization deals with the open-world assumption by ensuring that theamount of datarequestedfromthecrowdisbounded. Thus, theheuristic first annotatesthequery plan with thecardinality pre- dictions between the operators. Afterwards, the heuristic tries to re-order the operators to minimize the requests against the crowd andwarnstheuser at compile-timeif thenumber of requestscannot bebounded. Furthermore, wealsocreatedaset of crowd-sourcing rulesinor- der toset thebasiccrowdsourcingparameters(e.g., price,batching- size), select theuser interface(e.g., normalized vs. denormalized) and several other simple cost-saving techniques. For example, a delete on a crowd-sourced table does not try to receive all tuples satisfying the expression in the delete statement before deleting them. Instead the optimizer rewrites the query to only look into existing tuples. Nevertheless, in contrast to acost-based optimizer, arule-based optimizer is not able to exhaustively explore all parameters and thus, often produces asub-optimal result. A cost-based optimizer for CrowdDB, which must also consider the changing conditions onAMT, remainsfuturework. CrowdProbe: Thisoperator crowdsourcesmissing information of CROWD columns (i.e., CNULL values) and new tuples. It uses interfaces such as those shown in Figures 2a, 2d and 2e. The operator enforces quality control by selecting the majority answer for every attribute as the final value. That is, given the answersfor asingletuple(i.e., entity withthesamekey),thema- jority of turkershavetoenter thesamevaluetomakeit thefinal valueof thetuple. If nomajority exists, moreworkersareasked until the majority agrees or a pre-set maximum of answers are collected. In thelatter case, thefinal valueisrandomly selected from thevaluesmost workershad incommon. Inthecaseof newly createdtuplesit canhappenthat all workers enter tuples with different primary keys, making finding a ma- jority impossible. In thiscase, theoperator re-poststhetasksby leavingall non-confirmed attributesempty except theonescom- prising the primary key. This allows CrowdDB to obtain more answersfor every key inorder toform amajority quorum. CrowdJoin: Thisoperator implementsanindexnested-loopjoin over twotables, at least oneof whichiscrowdsourced. For each tupleof theouter relation,thisoperator createsoneor moreHITs in order to crowdsource new tuples from theinner relation that matchesthetupleof theouter relation. Correspondingly, thein- ner relation must be a CROWD table and the user interface to crowdsource new tuples from the inner relation is instantiated with thejoin column valuesof thetuplefrom theouter relation according to the join predicates. The quality control technique isthesameasfor CrowdProbe. CrowdCompare: Thisoperator implementstheCROWDEQUAL and CROWDORDER functions described in Section 4.2. It in- stantiatesuser interfacessuch asthoseshown in Figures2c and 2d. Note that CrowdCompare is typically used inside another traditional operator, such assorting or predicateevaluation. For example,anoperator thatimplementsquick-sort mightuseCrowd- Compare to perform the required binary comparisons. Quality control isbased onthesimplemajority vote. 6.2 Physical Plan Generation Figure3presentsanend-to-endexamplethatshowshowCrowdDB creates a query plan for a simple CrowdSQL query. A query is first parsed; the result is a logical plan, as shown in Figure 3b. 7. EXPERIMENTSAND RESULTS ThissectionpresentsresultsfromexperimentsrunwithCrowdDB andAMT. Weranover 25,000HITsonAMT duringOctober 2010, varyingparameterssuchasprice, jobsper HIT andtimeof day. We measured the response time and quality of the answers provided by the workers. Here, we report on Micro-benchmarks (Section 7.1), that use simple jobs involving finding new data or making subjectivecomparisons. Thegoalsof theseexperiments areto ex-

  13. Recording Data CrowdDB only records either CNULL or the final outcome. Why might this be a bad idea?

  14. Recording Data CrowdDB only records either CNULL or the final outcome. Why might this be a bad idea? Needs and aggregations schemes change An application that requires more accuracy We find that people are more erroneous than we expected Data may get stale

  15. Joins between crowdsourced relations CrowdDB forbids joins between two crowdsourced tables. Is there a case where we may want that?

  16. Joins between crowdsourced relations CrowdDB forbids joins between two crowdsourced tables. Is there a case where we may want that? Sure: People in a department, courses taught in the department What interesting challenges emerge there?

  17. Joins between crowdsourced relations CrowdDB forbids joins between two crowdsourced tables. Is there a case where we may want that? Sure: People in a department, courses taught in the department What interesting challenges emerge there? Get more tuples for one relation or the other. Especially if not K-FK join

  18. FDs? CrowdDB assumes a primary key per table. What if there are other Functional Dependencies? Can we do better?

  19. FDs? CrowdDB assumes a primary key per table. What if there are other Functional Dependencies? Can we do better? Example: Company, City, State

  20. Other things What other issues did you identify in the paper?

  21. Other things What other issues did you identify in the paper? CROWDTABLE: What if workers refer to two entities in a slightly different manner: Jiawei Han vs. J. Han Spelling mistakes CROWDPROBE: What if some information is just hard to crowdsource? Bottlenecked?

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#