CS240B Midterm Fall 2018 UCLA - Problems and Solutions
This document covers various problems related to luggage items at airports, including detecting stuck luggage items and aimless cycles. It also discusses defining a UDA to detect specific conditions and exploring operators' processing speeds in a system. The solutions provided involve SQL-TS queries and a memory usage analysis to optimize the processing order of operators.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
UCLA, Fall 2018. CS240B Midterm Your Name: and your ID: Problem Max score Score Problem A 40% Problem B 20% Problem C 20% Problem D 20% Total 100% Extra Credit: 5% Final score:
Problem A [20%+20%=40%]. Luggage items at airports are often delivered by conveyor belts, where sensors are used to monitor the presence/passage of each item through a checkpoint. The resulting data stream has the following form: sensed(Time, CPS, LID). This denotes that at a certain Time (in seconds) the sensor at checkpoint CPS had detected a luggage item with id LID (assume that we have a reading every second). The airport is interested in detecting the occurrences of the following two problems, along with the CPS of the checkpoint where they were detected: A1. A luggage item that seems to be stuck: i.e. has spent more than 5 minutes spent at the same CPS. Return the LID, the CPS and the current time A2. A luggage item has been on the conveyor belt for at least 6 minutes, during which the item has gone through the same CPS at least 2 times (an indication that the item is wondering in aimless cycles but the cycles do not need to be consecutive or identical). Write an SQL-TS query to detect situation A1 and an SQL-TS query for situation A2 above. In both cases return the luggage LID, the CPS for the problem checkpoint, and the current time at the expiration of the 6 minutes. (PS: you can write SQL-MR if you prefer) Select D.Time, D.CPS, D.LID From sensed Partitioned by LID ordered by Time AS (D+) Where D.CPS=previous(D.CPS) AND D.Time> first(D.Time)+300 Select D.Time, D.CPS, D.LID From sensed Partitioned by LID ordered by Time AS (S1 O1+ E1 X* S2 O2+ E2 W* Z) Where O1.CPS <>S1.CPS AND E1.CPS=S1.CPS and O2.CPS <>S2.CPS AND E2.CPS=S2.CPS and Z.Time > S.Time+360
Problem B [20 %]: Define a UDA that detects the occurrence of condition 1 from Problem A. Show the definition of the UDA using the notation used in the homeworks. The UDA will be called with a partition by LID, unlimited preceding clause. Window stuck(TimeIn timestamp, CPSin char) : (Stime timestamp, Sensor char) { Table previous(STime timestamp, SCPS char); initialize: {insert into previous value(Time, CPS)} Iterate: { update previous set STime= Timein, SCPS= CPSin) where CPSin<>SCPS; insert into return (STime, SCPS) select from previous where CPSin=SCPS and Timein = STime+301 % i.e., 5 minutes +1sec }
Problem C. 20% Layout: A A The processing speeds of these operators are: A: 100 tuple/sec B: 100 tuples/sec C: 100 tuples/sec U: 200 tuples/sec Source1 Source1 Sink Sink B Source2 Source2 ||| U Sink Sink ||| C C Source3 Source3 We have timestamped tuples. Also we assume that the buffers feeding A, B, and C contain respectively 1000 tuples, 400 tuples and 600 tuples. Also B and C deliver one output tuple for each input tuple, while A is a selection that eliminates about 50% of its input tuples. All tuples have the same size. C1: What is the time T required to process all tuples? In which order should the operators be scheduled to minimize the memory usage? (1) Illustrate your answer with a memory diagram and (2) estimate the resulting average number of tuples kept in memory during time T. You can also assume that the idle waiting caused by U is negligible. 2000 A 1000 B+C+U 10s 15s Time 25s C1: B and C have the same memory release rate and they must operate together because of the union. Thus B and C process their 600+400 =1000 tuples in 10 sec. along with U that takes 5 for a total of 15 s. So A+B+U releases 1000 tuple of memory in 15 secds. 100/15= 66.6 tples/sec. A releases 100 tuples/sec and should go first. It will take 10 secs. The total area of 1,000x10+ 1000x10/2 + 1000x15/2= 22,500. The average memory usage over the 25 seconds is 22,500/25= 900 tuples 4
I i Problem D. 20% The processing speeds of these operators are: A: 100 tuples/sec, B: 100 tuples/sec, C: 25 tuple/sec B B B1 B1 B2 B2 Sink Sink Layout II: A A Source Source C C Sink Sink Problem D: Consider Layout II and assume that the buffers feeding A contains 1000 tuples. A is a selection operator that discards 50% of its input tuples. However, for each tuple in its input, B delivers one tuple to its output, and so does C. D1. We wan to minimize the average latency measured in the number of missing query answers. Should we then partition this network and, if so, how? Illustrate your answer with a diagram and estimate the resulting average latency. C 1000 A+B 500 C 35s Time 15s 20s D1: to process 1000 tuples, A takes 10 sec. Then to output 500 tuples B takes 5 sec and C takes 20. So the total output a is is 1000 answers in 35 sec. - Unbroken it delivers at the rate of 1000/35 tuples per sec. - If we cut A+B delivers 500 answers in 10+5 sec. i.e. 1000/30 faster we should break it - Then to deliver 500 tuples, C takes 500/25= 20s. - Avg Latency, i.e. the time taken by a tuple on the average. Area/1000 Area= 500x15+500x15/2+ 500x20/2= 500x32.5 - Avg. Latency= 500X32.5/1000= 16.25 s 5
I i Extra Credit. 5% Rather than defining a new aggregate, express problem A1 using standard SQL OLAP functions %Idea: a window of size 300 seconds, in which there is only one value of CPS select, LID, Time from ( select LID, Time, count(DISTINCT CPS) over (partition by LID order by Time, range 300 preceding) as SNs ) where SNs=1. 6
Alternative serial Schedules: Memory Used B C 2000 A 1000 B+C+U Time 10s 15s 10s 25s