Data Acquisition and Parsing Methods in Data Science

Lecture 3: Data II
Harvard IACS
CS109A
Pavlos Protopapas, Kevin Rader, and Chris Tanner
How to get it, methods to parse it,
and ways to explore it.
2
Homework 0 
isn’t graded for accuracy. If your questions were surface-
level / clarifying questions, you’re in good shape.
Homework 1 
is graded for accuracy
it’ll be released 
today (due in a week)
Study Break 
this Thurs @ 8:30pm and Fri @ 10:15am
After lecture, please update your Zoom to the latest version
ANNOUNCEMENTS
Background
So far, we’ve learned:
3
What is Data Science?
The Data Science Process
Data: types, formats, issues, etc.
Regular Expressions (briefly)
How to get data and parse web data + PANDAS
How to model data
Lecture 1
Lectures 1 & 2
Lecture 2
Lecture 2
This lecture
Future lectures
The Data Science Process:
4
Ask an interesting question
Get the Data
Explore the Data
Model the Data
Communicate/Visualize the Results
Background
5
Ask an interesting question
Get the Data
Explore the Data
Model the Data
Communicate/Visualize the Results
This
lecture
The Data Science Process:
Background
Understand different ways to obtain it
Be able to extract any web content of interest
Be able to do basic PANDAS commands to store and explore data
Feel comfortable using online resources to help with these libraries
(
Requests
, 
BeautifulSoup
, and 
PANDAS
)
6
Learning Objectives
How to get web data?
How to parse basic elements using BeautifulSoup
Getting started with PANDAS
7
Agenda
What are common sources
for data?
8
(For Data Science and computation purposes.)
Data can come from:
You curate it
Someone else provides it, all pre-packaged for you (e.g., files)
Someone else provides an API
Someone else has available content, and you try to take it
(web scraping)
9
Obtaining Data
Web scraping
Using programs to get data from online
Often much faster than manually copying data!
Transfer the data into a form that is compatible with your code
Legal and moral issues 
(per Lecture 2)
10
Obtaining Data: Web scraping
Why scrape the web?
Vast source of information; can combine with multiple datasets
Companies have not provided APIs
Automate tasks
Keep up with sites / real-time data
Fun!
11
Obtaining Data: Web scraping
12
Obtaining Data: Web scraping
Web scraping tips:
Be careful and polite
Give proper credit
Care about media law / obey licenses / privacy
Don’t be evil (no spam, overloading sites, etc)
13
Obtaining Data: Web scraping
Robots.txt
Specified by web site owner
Gives instructions to web robots (e.g., your code)
Located at the top-level directory of the web server
E.g., 
http://google.com/robots.txt
14
Obtaining Data: Web scraping
Web Servers
A server maintains a long-running process (also called a daemon),
which listens on a pre-specified port
It responds to requests, which is sent using a protocol called HTTP
(HTTPS is secure)
Our browser sends these requests and downloads the content, then
displays it
2– 
request was successful, 
4–
 client error, often `page not found`; 
5–
server error (often that your request was incorrectly formed)
15
Obtaining Data: Web scraping
HTML
Tags are denoted by angled
brackets
Almost all tags are in pairs e.g.,
<p>
Hello
</p>
Some tags do not have a closing tag
e.g., 
<br/>
Example
16
Obtaining Data: Web scraping
HTML
<html>, 
indicates the start of an html page
<body>, 
contains the items on the actual webpage
 
(text, links, images, etc)
<p>, 
the paragraph tag. Can contain text and links
<a>, 
the link tag. Contains a link url, and possibly a description of the link
<input>, 
a form input tag. Used for text boxes, and other user input
<form>, 
a form start tag, to indicate the start of a form
<img>, 
an image tag containing the link to an image
17
Obtaining Data: Web scraping
How to Web scrape:
1. 
Get
 the webpage content
Requests (Python library) 
gets
 a webpage for you
2. 
Parse
 the webpage content
(e.g., find all the text or all the links on a page)
BeautifulSoup (Python library) helps you 
parse 
the webpage.
Documentation: 
http://crummy.com/software/BeautifulSoup
18
The Big Picture Recap
Data Sources
Files, APIs, Webpages (via 
Requests
)
Data Structures/Storage
Data Parsing
Regular Expressions, Beautiful Soup
Traditional lists/dictionaries, PANDAS
Models
Linear Regression, Logistic Regression, kNN, etc
BeautifulSoup
 only concerns webpage data
19
Obtaining Data: Web scraping
1.
Get
 the webpage content
Requests (Python library) 
gets
 a webpage for you
page
 = 
requests
.
get
(url)
page
.
status_code
page
.
content
20
Obtaining Data: Web scraping
1.
Get
 the webpage content
Requests (Python library) 
gets
 a webpage for you
page
 = 
requests
.
get
(url)
page
.
status_code
page
.
content
Gets the status from the
webpage request.
200 means success.
404 means page not found.
21
Obtaining Data: Web scraping
1.
Get
 the webpage content
Requests (Python library) 
gets
 a webpage for you
page
 = 
requests
.
get
(url)
page
.
status_code
page
.
content
Returns the content of the
response, in bytes.
22
Obtaining Data: Web scraping
2. Parse
 the webpage content
BeautifulSoup (Python library) helps you 
parse
 a webpage
soup
 = 
BeautifulSoup
(
page.
content
, “
html.parser
”)
soup
.
title
soup
.
title.text
23
Obtaining Data: Web scraping
Returns the full context, including the title tag.
e.g.,
<title data-rh="true">The New York Times
– Breaking News</title>
2. Parse
 the webpage content
BeautifulSoup (Python library) helps you 
parse
 a webpage
soup
 = 
BeautifulSoup
(
page.
content
, “
html.parser
”)
soup
.
title
soup
.
title.text
24
Obtaining Data: Web scraping
2. Parse
 the webpage content
BeautifulSoup (Python library) helps you 
parse
 a webpage
soup
 = 
BeautifulSoup
(
page.
content
, “
html.parser
”)
soup
.
title
soup
.
title.text
Returns the text part of the title tag. e.g.,
The New York Times – Breaking News
25
Obtaining Data: Web scraping
BeautifulSoup
Helps make messy HTML digestible
Provides functions for quickly accessing certain sections of
HTML content
Example
26
Obtaining Data: Web scraping
HTML is a tree
You don’t have to access the
HTML as a tree, though;
Can immediately search for
tags/content of interest (a la
previous slide)
Example
27
Exercise 1 time!
28
PANDAS
Kung Fu Panda is property of DreamWorks and Paramount Pictures
29
Pandas is an 
open-source
 
Python library
 (anyone can contribute)
Allows for high-performance, easy-to-use data structures and data analysis
Unlike NumPy library which provides multi-dimensional arrays, Pandas
provides 2D table object called 
DataFrame
(akin to a spreadsheet with column names and row labels).
Used by 
a lot 
of people
What / Why?
Store and Explore Data: PANDAS
30
import 
pandas
 library (convenient to rename it)
Use 
read_csv() 
function
How
Store and Explore Data: PANDAS
31
Visit 
https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/01_table_oriented.html
for a more in-depth walkthrough
What it looks like
Store and Explore Data: PANDAS
32
Say we have the following, tiny DataFrame of just 3 rows and 3 columns
Example
Store and Explore Data: PANDAS
df2
[‘
a
’]
 
returns a Boolean list representing which
rows of column 
a
 equal 4:
[False, True, False]
selects column 
a
df2
[‘
a
’] == 4
 
returns 
1
 because that’s the minimum value in
the 
a
 column
df2
[‘
a
’].min()
df2
[[‘
a
’, ‘
c
’]]
 
selects columns 
a
 and 
c
33
Example continued
Store and Explore Data: PANDAS
df2
[‘
a
’].unique()
 
returns a Series
representing the row
w/ the label 
2
returns all distinct values of the 
a
 column once
df2.
loc
[
2
]
 
.loc returns all rows that were passed-in
df2.
loc
[df2
[‘
a
’] == 4
]
[False, True, False] 
34
Example continued
Store and Explore Data: PANDAS
 
returns a Series representing the row at index 
2 
(NOT the row labelled
2. Though, they are often the same, as seen here)
df2.
iloc
[
2
]
df2.
sort_values(
by
=[‘
c
’]
)
returns the DataFrame with rows shuffled such that now
they are in ascending order according to column 
c. 
In this
example, df2 would remain the same, as the values were
already sorted
35
High-level viewing:
head() 
– first N observations
tail() 
– last N observations
describe() 
– statistics of the quantitative data
dtypes 
– the data types of the columns
columns 
– names of the columns
shape 
– the # of (rows, columns)
Common PANDAS functions
Store and Explore Data: PANDAS
36
Accessing/processing:
df[“column_name”]
df.column_name
.
max(), .min(), .idxmax(), .idxmin()
<dataframe> <conditional statement>
.loc[] 
– label-based accessing
.iloc[] 
– index-based accessing
.sort_values()
.isnull(), .notnull()
Store and Explore Data: PANDAS
Common PANDAS functions
37
Grouping/Splitting/Aggregating:
groupby(), .get_groups()
.merge()
.concat()
.aggegate()
.append()
Common Panda functions
Store and Explore Data: PANDAS
EDA encompasses the “
explore
 data” part of the data science process
EDA is crucial but often overlooked:
If your data is bad, your results will be bad
Conversely, understanding your data well can help you create smart,
appropriate models
38
Why?
Exploratory Data Analysis (EDA)
1.
Store data in data structure(s) that will be convenient for exploring/processing
(Memory is fast. Storage is slow)
2.
Clean/format the data so that:
Each row represents a single object/observation/entry
Each column represents an attribute/property/feature of that entry
Values are numeric whenever possible
Columns contain atomic properties that cannot be further decomposed
*
39
* Unlike food waste, which can be composted.
   Please consider composting food scraps.
What?
Exploratory Data Analysis (EDA)
3.
Explore 
global
 properties: use histograms, scatter plots, and aggregation
functions to summarize the data
4.
Explore 
group
 properties: group like-items together to compare subsets of the
data 
(are the comparison results reasonable/expected?)
40
What? (continued)
This process transforms your data into a format which is easier to work
with, gives you a basic overview of the data's properties, and likely
generates several questions for you to follow-up in subsequent analysis.
Exploratory Data Analysis (EDA)
41
Up Next
We will address EDA more
and dive into Advanced
PANDAS operations
42
Exercise 2 time!
Slide Note
Embed
Share

This lecture covers various aspects of obtaining and parsing data, including methods for extracting web content, basic PANDAS commands for data storage and exploration, and the use of libraries like Requests, BeautifulSoup, and PANDAS. The Data Science Process is highlighted, emphasizing the importance of asking intriguing questions, obtaining data, exploring, modeling, and visualizing results. Various sources of data are explored, from curated datasets to web scraping techniques. Attendees are encouraged to engage with different tools and online resources to enhance their data science skills.

  • Data Science
  • Data Acquisition
  • Web Scraping
  • PANDAS
  • Data Exploration

Uploaded on Sep 25, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Lecture 3: Data II How to get it, methods to parse it, and ways to explore it. Harvard IACS CS109A Pavlos Protopapas, Kevin Rader, and Chris Tanner

  2. ANNOUNCEMENTS Homework 0 isn t graded for accuracy. If your questions were surface- level / clarifying questions, you re in good shape. Homework 1 is graded for accuracy it ll be released today (due in a week) Study Break this Thurs @ 8:30pm and Fri @ 10:15am After lecture, please update your Zoom to the latest version 2

  3. Background So far, we ve learned: Lecture 1 What is Data Science? Lectures 1 & 2 The Data Science Process Lecture 2 Data: types, formats, issues, etc. Lecture 2 Regular Expressions (briefly) This lecture How to get data and parse web data + PANDAS Future lectures How to model data 3

  4. Background The Data Science Process: Ask an interesting question Get the Data Explore the Data Model the Data Communicate/Visualize the Results 4

  5. Background The Data Science Process: Ask an interesting question Get the Data This lecture Explore the Data Model the Data Communicate/Visualize the Results 5

  6. Learning Objectives Understand different ways to obtain it Be able to extract any web content of interest Be able to do basic PANDAS commands to store and explore data Feel comfortable using online resources to help with these libraries (Requests, BeautifulSoup, and PANDAS) 6

  7. Agenda How to get web data? How to parse basic elements using BeautifulSoup Getting started with PANDAS 7

  8. What are common sources for data? (For Data Science and computation purposes.) 8

  9. Obtaining Data Data can come from: You curate it Someone else provides it, all pre-packaged for you (e.g., files) Someone else provides an API Someone else has available content, and you try to take it (web scraping) 9

  10. Obtaining Data: Web scraping Web scraping Using programs to get data from online Often much faster than manually copying data! Transfer the data into a form that is compatible with your code Legal and moral issues (per Lecture 2) 10

  11. Obtaining Data: Web scraping Why scrape the web? Vast source of information; can combine with multiple datasets Companies have not provided APIs Automate tasks Keep up with sites / real-time data Fun! 11

  12. Obtaining Data: Web scraping Web scraping tips: Be careful and polite Give proper credit Care about media law / obey licenses / privacy Don t be evil (no spam, overloading sites, etc) 12

  13. Obtaining Data: Web scraping Robots.txt Specified by web site owner Gives instructions to web robots (e.g., your code) Located at the top-level directory of the web server E.g., http://google.com/robots.txt 13

  14. Obtaining Data: Web scraping Web Servers A server maintains a long-running process (also called a daemon), which listens on a pre-specified port It responds to requests, which is sent using a protocol called HTTP (HTTPS is secure) Our browser sends these requests and downloads the content, then displays it 2 request was successful, 4 client error, often `page not found`; 5 server error (often that your request was incorrectly formed) 14

  15. Obtaining Data: Web scraping HTML Example Tags are denoted by angled brackets Almost all tags are in pairs e.g., <p>Hello</p> Some tags do not have a closing tag e.g., <br/> 15

  16. Obtaining Data: Web scraping HTML <html>, indicates the start of an html page <body>, contains the items on the actual webpage (text, links, images, etc) <p>, the paragraph tag. Can contain text and links <a>, the link tag. Contains a link url, and possibly a description of the link <input>, a form input tag. Used for text boxes, and other user input <form>, a form start tag, to indicate the start of a form <img>, an image tag containing the link to an image 16

  17. Obtaining Data: Web scraping How to Web scrape: 1. Get the webpage content Requests (Python library) gets a webpage for you 2. Parse the webpage content (e.g., find all the text or all the links on a page) BeautifulSoup (Python library) helps you parse the webpage. Documentation: http://crummy.com/software/BeautifulSoup 17

  18. The Big Picture Recap Files, APIs, Webpages (via Requests) Data Sources Regular Expressions, Beautiful Soup Data Parsing Traditional lists/dictionaries, PANDAS Data Structures/Storage Linear Regression, Logistic Regression, kNN, etc Models BeautifulSoup only concerns webpage data 18

  19. Obtaining Data: Web scraping 1. Get the webpage content Requests (Python library) gets a webpage for you page = requests.get(url) page.status_code page.content 19

  20. Obtaining Data: Web scraping 1. Get the webpage content Requests (Python library) gets a webpage for you Gets the status from the page = requests.get(url) page.status_code page.content webpage request. 200 means success. 404 means page not found. 20

  21. Obtaining Data: Web scraping 1. Get the webpage content Requests (Python library) gets a webpage for you page = requests.get(url) page.status_code page.content Returns the content of the response, in bytes. 21

  22. Obtaining Data: Web scraping 2. Parse the webpage content BeautifulSoup (Python library) helps you parse a webpage soup = BeautifulSoup(page.content, html.parser ) soup.title soup.title.text 22

  23. Obtaining Data: Web scraping 2. Parse the webpage content BeautifulSoup (Python library) helps you parse a webpage soup = BeautifulSoup(page.content, html.parser ) soup.title soup.title.text Returns the full context, including the title tag. e.g., <title data-rh="true">The New York Times Breaking News</title> 23

  24. Obtaining Data: Web scraping 2. Parse the webpage content BeautifulSoup (Python library) helps you parse a webpage soup = BeautifulSoup(page.content, html.parser ) soup.title soup.title.text The New York Times Breaking News Returns the text part of the title tag. e.g., 24

  25. Obtaining Data: Web scraping BeautifulSoup Helps make messy HTML digestible Provides functions for quickly accessing certain sections of HTML content Example 25

  26. Obtaining Data: Web scraping HTML is a tree Example You don t have to access the HTML as a tree, though; Can immediately search for tags/content of interest (a la previous slide) 26

  27. Exercise 1 time! 27

  28. PANDAS Kung Fu Panda is property of DreamWorks and Paramount Pictures 28

  29. Store and Explore Data: PANDAS What / Why? Pandas is an open-source Python library (anyone can contribute) Allows for high-performance, easy-to-use data structures and data analysis Unlike NumPy library which provides multi-dimensional arrays, Pandas provides 2D table object called DataFrame (akin to a spreadsheet with column names and row labels). Used by a lot of people 29

  30. Store and Explore Data: PANDAS How import pandas library (convenient to rename it) Use read_csv() function 30

  31. Store and Explore Data: PANDAS What it looks like Visit https://pandas.pydata.org/pandas-docs/stable/getting_started/intro_tutorials/01_table_oriented.html for a more in-depth walkthrough 31

  32. Store and Explore Data: PANDAS Example Say we have the following, tiny DataFrame of just 3 rows and 3 columns df2[ a ] selects column a returns a Boolean list representing which rows of column a equal 4: [False, True, False] df2[ a ] == 4 df2[ a ].min() returns 1 because that s the minimum value in the a column selects columns a and c df2[[ a , c ]] 32

  33. Store and Explore Data: PANDAS Example continued df2[ a ].unique() returns all distinct values of the a column once returns a Series representing the row w/ the label 2 df2.loc[2] df2.loc[df2[ a ] == 4] .loc returns all rows that were passed-in [False, True, False] 33

  34. Store and Explore Data: PANDAS Example continued df2.iloc[2] returns a Series representing the row at index 2 (NOT the row labelled 2. Though, they are often the same, as seen here) df2.sort_values(by=[ c ]) returns the DataFrame with rows shuffled such that now they are in ascending order according to column c. In this example, df2 would remain the same, as the values were already sorted 34

  35. Store and Explore Data: PANDAS Common PANDAS functions High-level viewing: head() first N observations tail() last N observations describe() statistics of the quantitative data dtypes the data types of the columns columns names of the columns shape the # of (rows, columns) 35

  36. Store and Explore Data: PANDAS Common PANDAS functions Accessing/processing: df[ column_name ] df.column_name .max(), .min(), .idxmax(), .idxmin() <dataframe> <conditional statement> .loc[] label-based accessing .iloc[] index-based accessing .sort_values() .isnull(), .notnull() 36

  37. Store and Explore Data: PANDAS Common Panda functions Grouping/Splitting/Aggregating: groupby(), .get_groups() .merge() .concat() .aggegate() .append() 37

  38. Exploratory Data Analysis (EDA) Why? EDA encompasses the exploredata part of the data science process EDA is crucial but often overlooked: If your data is bad, your results will be bad Conversely, understanding your data well can help you create smart, appropriate models 38

  39. Exploratory Data Analysis (EDA) What? 1. Store data in data structure(s) that will be convenient for exploring/processing (Memory is fast. Storage is slow) 2. Clean/format the data so that: Each row represents a single object/observation/entry Each column represents an attribute/property/feature of that entry Values are numeric whenever possible Columns contain atomic properties that cannot be further decomposed* * Unlike food waste, which can be composted. Please consider composting food scraps. 39

  40. Exploratory Data Analysis (EDA) What? (continued) 3. Explore global properties: use histograms, scatter plots, and aggregation functions to summarize the data 4. Explore group properties: group like-items together to compare subsets of the data (are the comparison results reasonable/expected?) This process transforms your data into a format which is easier to work with, gives you a basic overview of the data's properties, and likely generates several questions for you to follow-up in subsequent analysis. 40

  41. Up Next We will address EDA more and dive into Advanced PANDAS operations 41

  42. Exercise 2 time! 42

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#