Big Data: Insights and Applications

 
Data Information Knowledge 3
 
 
Presentation originally from the University of Texas at Austin
Edits by Rick Mercer
 
 
 
Outline
 
2
 
Review scope of big data
Searching using Indexes
Analyzing Data
 
 
Review: Big data is huge
 
3
 
50 petabytes of data = 25 trillion pages of text!
 
Big data is increasing
 
4
 
Big data is mostly unstructured
 
5
 
Try to structure part of the web?
 
6
 
Top level domain names attempted, but
.edu  .com  .org  .biz  .ca   .co.uk   .nz   .mx
Open Directory Project
Instead of applying a formula to search strings, this
lists directories that you drill into
Compare searches here and on Google for
machine learning software
 
Google Search Formula
 
7
 
Google uses programs (spiders) to index and
explore the Web:
visit webpages,
gather all of the links on each page visited, and
add them to their list of pages to visit in the future
Google takes your words and examines its
index for pages that have your words
Applies 200 questions to determine result list
 
 
 
What is an index?
 
8
 
Indexes organizes
conceptual topics and
location pairings
Google creates an
index to look things
up,  much like the
index in a book
Why do searches use
indexes?  Efficiency!
 
 
 
 
Why index?
 
An index provides an easy way to find
pertinent information related to a topic
Building indexes is difficult, but indexes make it
possible to get results in 0.5 seconds
Google index is100,000,000 gigabytes
over one million computing hours to build it
Play first 2 minutes of this 
video
 
 
 
 
9
 
Analyzing Data
 
Making sense of our world with numbers
 
Old school – the Scientific Method
 
11
 
Graph shows scientific
method we are taught
Investigate phenomena to
acquire new knowledge
Procedures vary
Statistical Hypothesis
Testing will be shown
later, as an application of
the scientific method
 
 
 
Analyzing Statistics
 
Three 
uses
 of statistical analysis commonly used
by scientists, mathematicians, politicians, and
other professionals across the globe.
 
12
 
Descriptive Statistics
 
1. 
Descriptive
 analytics - provide information
about collected data 
via
 statistics such as  
mean,
median, mode, range
These tend to 'describe' circumstances, but do not
offer conjectures about unknowns
Example: the percentage of graduates employed
within 6 months of graduating
Application: Google’s indexing the web
Consider another site dealing with describing data
(recorded search history on any topic):
http://www.google.com/trends/
 
13
 
Predictive Analytics
 
2. 
Predictive
 
Analytics
 is the practice of
extracting information from existing data sets in
order to determine patterns and predict future
outcomes and trend
Does not predict the future, may be wrong
Example: Given that 90 of the 100 CS graduates were
employed within 6 months in 2011, it is __ % likely
that 108 of the 120 CS graduates in 2015 will be
employed within 6 months
Upcoming Application: ranking pages based on a
search query
Ron Burgandy 
8-second clip
 
14
 
Prescriptive Analytics
 
3. 
Prescriptive
 
Analytics
 is the practice of
extracting information from existing data sets in
order to determine patterns and predict future
outcomes and trend
Does not predict the future
Example: how likely is that I will find a high-paying
job if I choose to major in ‘computer science’
rather than ‘biology’
Upcoming Application:  autocomplete makes
recommendation based on previous rankings
 
 
15
 
How Useful are these three?
 
The following grades each type of analysis on
its utility (how useful is it?) and confidence
(how likely is it to be true and/or valid?) in the
context of decision making
 
16
 
An example of hypothesis testing
 
17
 
Car Talk Puzzler – “The Case of the Finicky Volare”
 
Not statistical hypothesis testing
 
A man thinks he is having some peculiar car trouble:
 
“It doesn't like a certain kind of ice cream I buy."
 
He goes on to explain that he only has three flavors
of ice cream he likes:
Vanilla
Chocolate
Three-bean tofu mint chipped-beef ice cream.
 
18
 
Instead, only establish an hypothesis
 
He says:
 
"When I go to buy chocolate I park in front of the
ice cream parlor: I buy the chocolate, and my car
starts right up.  I buy the vanilla, and my car starts
right up. However, if I buy the three-bean tofu mint
chipped-beef,  my car won't start.”
 
What could be the issue with this car?
 
19
 
Deriving an hypothesis
 
“When you go in to buy chocolate, you go into
the freezer case and there's chocolate in a
container -- you take it, you pay for it, you get
into your car and drive away. Same thing with
vanilla, but nobody buys mint chip beef bean
tofu, right? So, somebody must hand pack the ice
cream into a special container”
What could be the issue with this car?
 
20
 
Hypothesis Example
 
The car is old.
It takes longer to purchase hand packed ice
cream than pre-packed ice cream.
Ice cream is purchased more often in the
summer when it is hot
 
Hypothesis:  A proposed explanation for this phenomenon
The car 
overheats
 and ‘vapor locks’ in the extra time
it takes to purchase three-bean tofu mint chipped-
beef ice cream.
 
 
21
 
Statistical Hypothesis Testing
 
To get an idea, it’s more than car mechanics
More realistic to view 1.5 minutes of this
example 
from Khan Academy
 
 
 
22
 
Exploratory data analysis
 
Exploratory data analysis
 is another method of
scientific inquiry
Utilize big data statistics to postulate
correlations (data sets that are linked
together) that have not yet been hypothesized
Attempts to discover patterns in order to
establish correlative links
 
23
 
An example of exploratory data analysis
 
24
 
Data from millions of searches to predict what
you are looking for
Discovers patterns
Correlations not guaranteed
 
 
Analog autocomplete
 
25
 
“How to change a _______”
         
Y-Axis Frequency
 
26
 
X-Axis Words in blank
 
Exploratory data analysis
 
27
 
Some big data sets and collections
 
Google Public Data Explorer (Links to an external site.)
 (130 datasets from Bureau of Labor
Statistics, U.S. Census Bureau, etc.)
data.gov (Links to an external site.)
 - an online repository of datasets from U.S. Government
Many counties have searchable property databases, such as the 
Travis County Appraisal
District (Links to an external site.)
Many counties have searchable legal databases, such as the 
Travis County Clerk (Links to
an external site.)
Some data sets defy categorization, such as the 
Texas Death Row Executions (Links to an
external site.)
 data set
Google's Ngram Data (Links to an external site.)
 - data on Google's catalog of millions of
books, including raw data sets
Google Trends (Links to an external site.)
 - detailed search history information, including CSV
downloads
NOAA National Climatic Data Center (Links to an external site.)
Knoema (Links to an external site.)
 - "free to use public and open data platform for users with
interests in statistics and data analysis, visual storytelling and making infographics"
Geocommons (Links to an external site.)
 - "all about open data analysis and maps"
Stat Silk (Links to an external site.)
 - "interactive maps of open data"
Better World Flux (Links to an external site.)
 - "a beautiful interactive visualization of
information on what really matters in life"
Gapminder (Links to an external site.)
 - "unveiling the beauty of statistics for a better world
view"
 
28
Slide Note
Embed
Share

Explore the world of big data through images and descriptions covering topics such as data organization, the increase in big data, unstructured data, search algorithms, indexing, and the efficiency of using indexes in searches. Discover the significance of indexes in retrieving information quickly and efficiently in the vast realm of data.

  • Big Data
  • Data Analysis
  • Indexing
  • Data Organization

Uploaded on Sep 16, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer by-nc-nd.pngMercer by-nc-nd.png

  2. Outline Review scope of big data Searching using Indexes Analyzing Data by-nc-nd.pngMercer by-nc-nd.png 2

  3. Review: Big data is huge 50 petabytes of data = 25 trillion pages of text! by-nc-nd.pngMercer by-nc-nd.png 3

  4. Big data is increasing Big Data Created 2005 2006 2007 2008 2009 2010 2011 2012 2013 by-nc-nd.pngMercer by-nc-nd.png 4

  5. Big data is mostly unstructured Big Data Organization Unstructured data Structured data by-nc-nd.pngMercer by-nc-nd.png 5

  6. Try to structure part of the web? Top level domain names attempted, but .edu .com .org .biz .ca .co.uk .nz .mx Open Directory Project Instead of applying a formula to search strings, this lists directories that you drill into Compare searches here and on Google for machine learning software by-nc-nd.pngMercer by-nc-nd.png 6

  7. Google Search Formula Google uses programs (spiders) to index and explore the Web: visit webpages, gather all of the links on each page visited, and add them to their list of pages to visit in the future Google takes your words and examines its index for pages that have your words Applies 200 questions to determine result list by-nc-nd.pngMercer by-nc-nd.png 7

  8. What is an index? Indexes organizes conceptual topics and location pairings Google creates an index to look things up, much like the index in a book Why do searches use indexes? Efficiency! by-nc-nd.pngMercer by-nc-nd.png 8

  9. Why index? An index provides an easy way to find pertinent information related to a topic Building indexes is difficult, but indexes make it possible to get results in 0.5 seconds Google index is100,000,000 gigabytes over one million computing hours to build it Play first 2 minutes of this video by-nc-nd.pngMercer by-nc-nd.png 9

  10. Analyzing Data Making sense of our world with numbers by-nc-nd.pngMercer by-nc-nd.png

  11. Old school the Scientific Method Graph shows scientific method we are taught Investigate phenomena to acquire new knowledge Procedures vary Statistical Hypothesis Testing will be shown later, as an application of the scientific method by-nc-nd.pngMercer by-nc-nd.png 11

  12. Analyzing Statistics Three uses of statistical analysis commonly used by scientists, mathematicians, politicians, and other professionals across the globe. by-nc-nd.pngMercer by-nc-nd.png 12

  13. Descriptive Statistics 1. Descriptive analytics - provide information about collected data via statistics such as mean, median, mode, range These tend to 'describe' circumstances, but do not offer conjectures about unknowns Example: the percentage of graduates employed within 6 months of graduating Application: Google s indexing the web Consider another site dealing with describing data (recorded search history on any topic): http://www.google.com/trends/ by-nc-nd.pngMercer by-nc-nd.png 13

  14. Predictive Analytics 2. Predictive Analytics is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trend Does not predict the future, may be wrong Example: Given that 90 of the 100 CS graduates were employed within 6 months in 2011, it is __ % likely that 108 of the 120 CS graduates in 2015 will be employed within 6 months Upcoming Application: ranking pages based on a search query Ron Burgandy 8-second clip by-nc-nd.pngMercer by-nc-nd.png 14

  15. Prescriptive Analytics 3. Prescriptive Analytics is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trend Does not predict the future Example: how likely is that I will find a high-paying job if I choose to major in computer science rather than biology Upcoming Application: autocomplete makes recommendation based on previous rankings by-nc-nd.pngMercer by-nc-nd.png 15

  16. How Useful are these three? The following grades each type of analysis on its utility (how useful is it?) and confidence (how likely is it to be true and/or valid?) in the context of decision making Analysis Type Descriptive Predictive Prescriptive Utility Level C B+ A Confidence Level A+ B- C Example Obama got x% of the vote x% chance of winning How to run the campaign by-nc-nd.pngMercer by-nc-nd.png 16

  17. An example of hypothesis testing Car Talk Puzzler The Case of the Finicky Volare by-nc-nd.pngMercer by-nc-nd.png 17

  18. Not statistical hypothesis testing A man thinks he is having some peculiar car trouble: It doesn't like a certain kind of ice cream I buy." He goes on to explain that he only has three flavors of ice cream he likes: Vanilla Chocolate Three-bean tofu mint chipped-beef ice cream. by-nc-nd.pngMercer by-nc-nd.png 18

  19. Instead, only establish an hypothesis He says: "When I go to buy chocolate I park in front of the ice cream parlor: I buy the chocolate, and my car starts right up. I buy the vanilla, and my car starts right up. However, if I buy the three-bean tofu mint chipped-beef, my car won't start. What could be the issue with this car? by-nc-nd.pngMercer by-nc-nd.png 19

  20. Deriving an hypothesis When you go in to buy chocolate, you go into the freezer case and there's chocolate in a container -- you take it, you pay for it, you get into your car and drive away. Same thing with vanilla, but nobody buys mint chip beef bean tofu, right? So, somebody must hand pack the ice cream into a special container What could be the issue with this car? by-nc-nd.pngMercer by-nc-nd.png 20

  21. Hypothesis Example The car is old. It takes longer to purchase hand packed ice cream than pre-packed ice cream. Ice cream is purchased more often in the summer when it is hot Hypothesis: A proposed explanation for this phenomenon The car overheatsand vapor locks in the extra time it takes to purchase three-bean tofu mint chipped- beef ice cream. by-nc-nd.pngMercer by-nc-nd.png 21

  22. Statistical Hypothesis Testing To get an idea, it s more than car mechanics More realistic to view 1.5 minutes of this example from Khan Academy by-nc-nd.pngMercer by-nc-nd.png 22

  23. Exploratory data analysis Exploratory data analysis is another method of scientific inquiry Utilize big data statistics to postulate correlations (data sets that are linked together) that have not yet been hypothesized Attempts to discover patterns in order to establish correlative links by-nc-nd.pngMercer by-nc-nd.png 23

  24. An example of exploratory data analysis Data from millions of searches to predict what you are looking for Discovers patterns Correlations not guaranteed by-nc-nd.pngMercer by-nc-nd.png 24

  25. Analog autocomplete by-nc-nd.pngMercer by-nc-nd.png 25

  26. How to change a _______ Y-Axis Frequency X-Axis Words in blank by-nc-nd.pngMercer by-nc-nd.png 26

  27. Exploratory data analysis Statistical Hypothesis Testing Injections into rats Exploratory Data Analysis Autocomplete since the test is constrained to a specific issue (and the variables are known), confidence can be really high confidence is typically lower, because the data are messier and the connections among them are unknown Confidence works over a typically circumscribed problem usually applied to much larger datasets with more unknowns Scope power increases with sample size, sometimes able to determine a minimum sample size to guarantee a desired confidence power is dependent on sample size, because there are no established hypothesis and it is unknown how large a sample size must be to discover knowledge Sample size by-nc-nd.pngMercer by-nc-nd.png 27

  28. Some big data sets and collections Google Public Data Explorer (Links to an external site.) (130 datasets from Bureau of Labor Statistics, U.S. Census Bureau, etc.) data.gov (Links to an external site.) - an online repository of datasets from U.S. Government Many counties have searchable property databases, such as theTravis County Appraisal District (Links to an external site.) Many counties have searchable legal databases, such as theTravis County Clerk (Links to an external site.) Some data sets defy categorization, such as theTexas Death Row Executions (Links to an external site.) data set Google's Ngram Data (Links to an external site.) - data on Google's catalog of millions of books, including raw data sets Google Trends (Links to an external site.) - detailed search history information,including CSV downloads NOAA National Climatic Data Center (Links to an external site.) Knoema (Links to an external site.) - "free to use public and open data platform for users with interests in statistics and data analysis, visual storytelling and making infographics" Geocommons (Links to an external site.) - "all about open data analysis and maps" Stat Silk (Links to an external site.) - "interactive maps of open data" Better World Flux (Links to an external site.) - "a beautiful interactive visualization of information on what really matters in life" Gapminder (Links to an external site.) - "unveiling the beauty of statistics for a better world view" by-nc-nd.pngMercer by-nc-nd.png 28

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#