Challenges and Potential of Crowd-Powered Databases
In the world of crowd-powered databases, challenges such as performance variability and task ambiguity exist alongside the potential for harnessing human power to enhance data processing. Although no fully functional crowd-powered database currently exists, the idea remains powerful and applicable in various data extraction tasks. Understanding the semantics of CrowdSQL and improving its comprehensibility are critical for advancing this field.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
History Lesson First crowd-powered database At that time, the state of the art was turkit Programming library for the crowd Two other crowd-powered databases at around the same time Deco (Stanford, UC Santa Cruz) Qurk (MIT) Necessarily incomplete, preliminary
Motivation of CrowdDB Two reasons why present DB systems won t do: Closed world assumption Get human help for finding new data Very literal in processing data SELECT marketcap FROM company WHERE name = IBM Get the best of both worlds: human power for processing and getting data traditional systems for heavy lifting/data manip
Issues in building CrowdDB Performance and variability: Humans are slow, costly, variable, inaccurate Task design and ambiguity: Challenging to get people to do what you want Affinity / Learning Workers develop relationships with requesters, skills Open world Possibly unbounded answers
History Lesson Even now (4 years later), there is no real complete, fully-functional crowd-powered database Why?
History Lesson Even now, there is no real complete, fully-functional crowd-powered database Why? No one understands the crowds (EVEN NOW) We were all na ve in thinking that we could treat crowds as just another data source. People don t seem to want to use crowds within databases Crowdsourcing is a one-off task Crowds have very different characteristics than other data
Still The ideas are very powerful and applicable everywhere you want data to be extracted Very common use-case of crowds
Semantics Semantics = an understanding of what the query does Regular SQL has very understandable semantics because starting from a given state, you know exactly what state you will be once you execute a statement. Does CrowdSQL have understandable/ semantics? How would you improve it?
Semantics Does CrowdSQL have understandable/ semantics? How would you improve it? Fill in CNULL; LIMIT clause What if you had more than the limit # of tuples already filled in? Overall, very hard. But at the least: A specification of budget? A specification that cost/latency is minimized?
Optimization Techniques Beyond the ones presented in the paper, what other database style optimization techniques can you think of?
Optimization Techniques Beyond the ones presented in the paper, what other database style optimization techniques can you think of? Paper mentions predicate pushdown, e.g., if you only care about tuples in CA, instantiate interfaces with CA filled in. Not always good evaluating crowd predicates may be costly Reorder tables such that more complete tables are filled first. Reorder predicates such that more complete predicates are checked first. SELECT * FROM PROEFESSOR WHERE Dept = math AND Email LIKE %berkeley%
Please fill out the missing department data University Name URL SELECT * FROM professor p, department d WHERE p.department = d.name AND p.university = d.university AND p.name = "Karp" name="Karp" p.dep=d.name CrowdJoin (Dep) p.dep = d.name Phone Submit Department Please fill out the professor data name= "Karp" p.dep=d.name Karp Name CrowdProbe (Professor) name=Karp Email Professor Department University Professor Department Submit Figure3: CrowdSQL Query PlanGeneration quality control. In thecurrent CrowdDB prototype, quality control iscarried out by amajority voteon theinput provided by different workers for the same HIT. The number of workers assigned to each HIT iscontrolled by an Assignmentsparameter (Section 2.1). Theinitial number of Assignmentsiscurrently astaticparameter of CrowdDB. Asmentioned in Section 4.3, thisparameter should be set by theCrowdDB optimizer based on budget constraintsset via CrowdSQL. Thecurrent version of CrowdDB hasthreeCrowdoperators: This logical plan is then optimized using traditional and crowd- specific optimizations. Figure3c showstheoptimized logical plan for this example. In this example, only predicate push-down was applied, a well-known traditional optimization technique. Some crowd-specific optimization heuristics used in CrowdDB are de- scribedinthenext subsection. Finally, thelogical planistranslated into aphysical plan which can beexecuted by theCrowdDB run- timesystem. Aspart of thisstep, Crowd operators and traditional operatorsof therelational algebraareinstantiated. In theexample of Figure3, thequery isexecuted by aCrowdProbeoperator in or- der to crowdsource missing information from the Professor table and a CrowdJoin operator in order to crowdsource missing infor- mation from theDepartment table. (In thisexample, it isassumed that theDepartment isaCROWD table; otherwise, theCrowdJoin operator would not beapplicable.) 6.3 Heuristics Thecurrent CrowdDB compiler isbased on asimplerule-based optimizer. Theoptimizer implementsseveral essential query rewrit- ing rules such as predicate push-down, stopafter push-down [7], join-ordering and determining if theplan isbounded [5]. Thelast optimization deals with the open-world assumption by ensuring that theamount of datarequestedfromthecrowdisbounded. Thus, theheuristic first annotatesthequery plan with thecardinality pre- dictions between the operators. Afterwards, the heuristic tries to re-order the operators to minimize the requests against the crowd andwarnstheuser at compile-timeif thenumber of requestscannot bebounded. Furthermore, wealsocreatedaset of crowd-sourcing rulesinor- der toset thebasiccrowdsourcingparameters(e.g., price,batching- size), select theuser interface(e.g., normalized vs. denormalized) and several other simple cost-saving techniques. For example, a delete on a crowd-sourced table does not try to receive all tuples satisfying the expression in the delete statement before deleting them. Instead the optimizer rewrites the query to only look into existing tuples. Nevertheless, in contrast to acost-based optimizer, arule-based optimizer is not able to exhaustively explore all parameters and thus, often produces asub-optimal result. A cost-based optimizer for CrowdDB, which must also consider the changing conditions onAMT, remainsfuturework. CrowdProbe: Thisoperator crowdsourcesmissing information of CROWD columns (i.e., CNULL values) and new tuples. It uses interfaces such as those shown in Figures 2a, 2d and 2e. The operator enforces quality control by selecting the majority answer for every attribute as the final value. That is, given the answersfor asingletuple(i.e., entity withthesamekey),thema- jority of turkershavetoenter thesamevaluetomakeit thefinal valueof thetuple. If nomajority exists, moreworkersareasked until the majority agrees or a pre-set maximum of answers are collected. In thelatter case, thefinal valueisrandomly selected from thevaluesmost workershad incommon. Inthecaseof newly createdtuplesit canhappenthat all workers enter tuples with different primary keys, making finding a ma- jority impossible. In thiscase, theoperator re-poststhetasksby leavingall non-confirmed attributesempty except theonescom- prising the primary key. This allows CrowdDB to obtain more answersfor every key inorder toform amajority quorum. CrowdJoin: Thisoperator implementsanindexnested-loopjoin over twotables, at least oneof whichiscrowdsourced. For each tupleof theouter relation,thisoperator createsoneor moreHITs in order to crowdsource new tuples from theinner relation that matchesthetupleof theouter relation. Correspondingly, thein- ner relation must be a CROWD table and the user interface to crowdsource new tuples from the inner relation is instantiated with thejoin column valuesof thetuplefrom theouter relation according to the join predicates. The quality control technique isthesameasfor CrowdProbe. CrowdCompare: Thisoperator implementstheCROWDEQUAL and CROWDORDER functions described in Section 4.2. It in- stantiatesuser interfacessuch asthoseshown in Figures2c and 2d. Note that CrowdCompare is typically used inside another traditional operator, such assorting or predicateevaluation. For example,anoperator thatimplementsquick-sort mightuseCrowd- Compare to perform the required binary comparisons. Quality control isbased onthesimplemajority vote. 6.2 Physical Plan Generation Figure3presentsanend-to-endexamplethatshowshowCrowdDB creates a query plan for a simple CrowdSQL query. A query is first parsed; the result is a logical plan, as shown in Figure 3b. 7. EXPERIMENTSAND RESULTS ThissectionpresentsresultsfromexperimentsrunwithCrowdDB andAMT. Weranover 25,000HITsonAMT duringOctober 2010, varyingparameterssuchasprice, jobsper HIT andtimeof day. We measured the response time and quality of the answers provided by the workers. Here, we report on Micro-benchmarks (Section 7.1), that use simple jobs involving finding new data or making subjectivecomparisons. Thegoalsof theseexperiments areto ex-
Recording Data CrowdDB only records either CNULL or the final outcome. Why might this be a bad idea?
Recording Data CrowdDB only records either CNULL or the final outcome. Why might this be a bad idea? Needs and aggregations schemes change An application that requires more accuracy We find that people are more erroneous than we expected Data may get stale
Joins between crowdsourced relations CrowdDB forbids joins between two crowdsourced tables. Is there a case where we may want that?
Joins between crowdsourced relations CrowdDB forbids joins between two crowdsourced tables. Is there a case where we may want that? Sure: People in a department, courses taught in the department What interesting challenges emerge there?
Joins between crowdsourced relations CrowdDB forbids joins between two crowdsourced tables. Is there a case where we may want that? Sure: People in a department, courses taught in the department What interesting challenges emerge there? Get more tuples for one relation or the other. Especially if not K-FK join
FDs? CrowdDB assumes a primary key per table. What if there are other Functional Dependencies? Can we do better?
FDs? CrowdDB assumes a primary key per table. What if there are other Functional Dependencies? Can we do better? Example: Company, City, State
Other things What other issues did you identify in the paper?
Other things What other issues did you identify in the paper? CROWDTABLE: What if workers refer to two entities in a slightly different manner: Jiawei Han vs. J. Han Spelling mistakes CROWDPROBE: What if some information is just hard to crowdsource? Bottlenecked?