Introduction to Tabular Data Modeling and SQL Concepts

Slide Note
Embed
Share

Explore the fundamentals of tabular data modeling, relational algebra, SQL, and database relationships through practical examples. Learn about key concepts such as primary keys, tuples, and different types of relationships in database management. Get hands-on experience using Pandas and SQL for data manipulation and querying.


Uploaded on Sep 11, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. http://dsg.csail.mit.edu/6.S080 Piazza signup: piazza.com/mit/fall2019/6s080 6.S080 Lecture 2 Sam Madden Key ideas: Tabular data & relational model Relational algebra & SQL Next time: reversed lecture on SQL & Pandas; bring a laptop setup as in Lab 0 Let us know ASAP if you need to borrow one Make sure your laptop is charged!

  2. Tables Tables Tables Tables

  3. Tabular Representation bandfan.com Named, typed columns Members ID Primary key Name Birthday Address Email 1 Sam 1/1/2000 32 Vassar St srmadden 2 Tim 1/2/1990 46 Pumpkin St timk Unique records Ordered? Unordered? Schema: the names and types of the fields in a table Tuple: a single record Unique identifier for a row is a key A minimal unique non-null identifier is a primary key

  4. Tabular Representation bandfan.com Members ID Name Birthday Address Email Primary key 1 Sam 1/1/2000 32 Vassar St srmadden 2 Tim 1/2/1990 46 Pumpkin St timk Bands ID Name Genre How to capture relationship between bandfan members and the bands? Primary key 1 Nickelback Terrible 2 Creed Terrible 3 Limp Bizkit Terrible

  5. Types of Relationships One to one: each band has a genre One to many: bands play shows, one band per show * Many to many: members are fans of multiple bands * Of course, shows might only multiple bands this is a design decision

  6. Chad Kroeger of Nickelback Tim the Superfan

  7. Representing Fandom Relationship Try 1 Member-band-fans FanID Name Birthday Address Email BandID BandName Genre 2 Tim 1/2/1990 46 Pumpkin St timk 1 Nickelback Terrible 2 Tim 1/2/1990 46 Pumpkin St timk 2 Creed Terrible 2 Tim 1/2/1990 46 Pumpkin St timk 3 Limp Bizkit Terrible What s wrong with this representation?

  8. Representing Fandom Relationship Try 1 Member-band-fans FanID Name Birthday Address Email BandID BandName Genre 2 Tim 1/2/1990 46 Pumpkin St timk 1 Nickelback Terrible 2 Tim 1/2/1990 46 Pumpkin St timk 2 Creed Terrible 2 Tim 1/2/1990 46 Pumpkin St timk 3 Limp Bizkit Terrible 1 Sam 1/1/2000 32 Vassar St srmadden NULL NULL NULL Adding NULLs is messy because it again introduces the possibility of missing data

  9. Representing Fandom Relationship Try 2 Columns that reference keys in other tables are Foreign keys Member-band-fans FanID Name Birthday Address Email BandID 2 Tim 1/2/1990 46 Pumpkin St timk 1 2 Tim 1/2/1990 46 Pumpkin St timk 2 2 Tim 1/2/1990 46 Pumpkin St timk 3 Bands BandID Name Genre 1 Nickelback Terrible 2 Creed Terrible 3 Limp Bizkit Terrible

  10. Representing Fandom Relationship Try 3 Normalized Members Member-Band-Fans FanID Name Birthday Address Email FanID BandID 2 Tim 1/2/1990 46 Pumpkin St timk 2 1 1 Sam 1/1/2000 32 Vassar St srmadden 2 2 2 3 Bands BandID Name Genre Relationship table 1 Nickelback Terrible 2 Creed Terrible Some members can be a fan of no bands 3 Limp Bizkit Terrible

  11. One to Many Relationships Bands ID Name Genre 1 Nickelback Terrible 2 Creed Terrible 3 Limp Bizkit Terrible How to represent the fact that each show is played by one band? Shows ID Location Date 1 Gillette 4/5/2020 2 Fenway 5/1/2020 3 Agganis 6/1/2020

  12. One to Many Relationships Bands ID Name Genre 1 Nickelback Terrible 2 Creed Terrible Add a band columns to shows 3 Limp Bizkit Terrible Shows ID Location Date BandId Each band can play multiple shows 1 Gillette 4/5/2020 1 2 Fenway 5/1/2020 1 3 Agganis 6/1/2020 2 Some bands can play no shows

  13. Entity Relationship Diagrams Name Genre n n Bands Like Fans Names Emails Addresses 1 Play n Dates Venues Shows

  14. Study Break Patient database Want to represent patients at hospitals with doctors Patients have names, birthdates Doctors have names, specialties Hospitals have names, addresses One doctor can treat multiple patients, each patient has one doctor Each patient in one hospital, hospitals have many patients Write out schema that captures these relationships, including primary keys and foreign keys

  15. Soln Patients (pid, name, bday, did references doctors.did, hid references hospitals.hid) Doctors (did, name, specialty) Hospital (hid, name, addr)

  16. Relational Algebra Projection ( (T,c1, , cn)) -- select a subset of columns c1 .. cn Selection (sel(T, pred)) -- select a subset of rows that satisfy pred Cross Product (T1 x T2) -- combine two tables Join (T1, T2, pred) = sel(T1 x T2, pred) Plus set operations (Union, Difference, etc)

  17. Bandid showid Band Join as Cross Product 1 1 Nickelback 2 1 Creed Bands Shows 3 1 Limp Bizkit bandid name showid bandid 1 2 Nickelback 1 Nickelback 1 1 2 2 Creed 2 Creed 2 1 3 Limp Bizkit 3 2 3 2 Limp Bizkit 4 3 1 3 Nickelback 2 3 Creed Find shows by Creed 3 3 Limp Bizkit 1 4 Nickelback Sel ( join(bands, shows, name= Creed ) 2 4 Creed 3 4 Limp Bizkit bands.bandid=shows.bandid), Real implementations do not ever materialize the cross product

  18. Bandid showid Band 1 1 Nickelback Join as Cross Product Bands 2 1 Creed Shows 3 1 Limp Bizkit bandid name showid bandid 1 2 Nickelback 1 Nickelback 1 1 2 2 Creed 2 Creed 2 1 3 Limp Bizkit 3 2 3 2 Limp Bizkit 4 3 1 3 Nickelback 2 3 Creed Find shows by nickelback 1. bandid=showid 3 3 Limp Bizkit 1 4 Nickelback Sel ( join(bands, shows, name= Creed ) 2 4 Creed bands.bandid=shows.bandid), 3 4 Limp Bizkit

  19. Bandid showid Band 1 1 Nickelback Join as Cross Product Bands 2 1 Creed Shows 3 1 Limp Bizkit bandid name showid bandid 1 2 Nickelback 1 Nickelback 1 1 2 2 Creed 2 Creed 2 1 3 Limp Bizkit 3 2 3 2 Limp Bizkit 4 3 1 3 Nickelback 2 3 Creed Find shows by nickelback 1. bandid=showid 2. name = Creed 3 3 Limp Bizkit 1 4 Nickelback Sel ( join(bands, shows, bands.bandid=shows.bandid), name= Creed ) 2 4 Creed 3 4 Limp Bizkit

  20. Data Flow Graph Representation of Algebra Select Bands Name = Creed Join Project Date Shows.BandId = Bands.Id BandId Record Shows Check for match 1 imagine records flowing out of tables from left to right

  21. Many possible implementations Suppose we have an index on shows: e.g., we store it sorted by band id Select Bands Name = Creed Join Project Date Shows.BandId = Bands.Id BandId Record Shows Check for match Index on shows.bandid

  22. Equivalent Representation Bands Join Select Project Date Shows.BandId = Bands.Id Name = Creed Shows Cross Product All bands and shows

  23. Study Break Write relational algebra to Find the bands Tim likes , using projection, selection, and join Projection ( (T,c1, , cn)) -- select a subset of columns c1 .. cn Selection (sel(T, pred)) -- select a subset of rows that satisfy pred Cross Product (T1 x T2) -- combine two tables Join (T1, T2, pred) = sel(T1 x T2, pred)

  24. Find the bands Tim likes Select Fans Name = Tim Join mbf.fanid = fans.id Member- band-fans Join Project Bands.name mbf.bandid = bands.id Bands

  25. Band Schema CREATE TABLE bands (id int PRIMARY KEY, name varchar, genre varchar); CREATE TABLE fans (id int PRIMARY KEY, name varchar, address varchar); CREATE TABLE band_likes(fanid int REFERENCES fans(id), bandid int REFERENCES bands(id));

  26. SQL Find the genre of Justin Bieber SELECT genre FROM bands WHERE name = 'Justin Bieber'

  27. Find how many fans each band has SELECT bands.name, count(*) FROM bands JOIN band_likes bl ON bl.bandid = bands.id JOIN fans ON fans.id = bl.fanid GROUP BY bands.name;

  28. Find the fan of the most bands SELECT fans.name, count(*) FROM bands JOIN band_likes bl ON bl.bandid = bands.id JOIN fans ON fans.id = bl.fanid GROUP BY fans.name ORDER BY count(*) DESC LIMIT 1;

  29. SQL is Declarative Say what I want, not how to do it E.g., we don t actually know what operation the database executes to find a particular record What are some possible implementation choices?

  30. Tuning Example: Beliebers Find fans of Justin Bieber SELECT fans.name FROM bands JOIN band_likes bl ON bl.bandid = bands.id JOIN fans ON fans.id = bl.fanid WHERE bands.name = 'Justin Bieber' How might we make this query faster?

Related