Understanding Big Data: Insights and Applications
Explore the world of big data through images and descriptions covering topics such as data organization, the increase in big data, unstructured data, search algorithms, indexing, and the efficiency of using indexes in searches. Discover the significance of indexes in retrieving information quickly and efficiently in the vast realm of data.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Data Information Knowledge 3 Presentation originally from the University of Texas at Austin Edits by Rick Mercer by-nc-nd.pngMercer by-nc-nd.png
Outline Review scope of big data Searching using Indexes Analyzing Data by-nc-nd.pngMercer by-nc-nd.png 2
Review: Big data is huge 50 petabytes of data = 25 trillion pages of text! by-nc-nd.pngMercer by-nc-nd.png 3
Big data is increasing Big Data Created 2005 2006 2007 2008 2009 2010 2011 2012 2013 by-nc-nd.pngMercer by-nc-nd.png 4
Big data is mostly unstructured Big Data Organization Unstructured data Structured data by-nc-nd.pngMercer by-nc-nd.png 5
Try to structure part of the web? Top level domain names attempted, but .edu .com .org .biz .ca .co.uk .nz .mx Open Directory Project Instead of applying a formula to search strings, this lists directories that you drill into Compare searches here and on Google for machine learning software by-nc-nd.pngMercer by-nc-nd.png 6
Google Search Formula Google uses programs (spiders) to index and explore the Web: visit webpages, gather all of the links on each page visited, and add them to their list of pages to visit in the future Google takes your words and examines its index for pages that have your words Applies 200 questions to determine result list by-nc-nd.pngMercer by-nc-nd.png 7
What is an index? Indexes organizes conceptual topics and location pairings Google creates an index to look things up, much like the index in a book Why do searches use indexes? Efficiency! by-nc-nd.pngMercer by-nc-nd.png 8
Why index? An index provides an easy way to find pertinent information related to a topic Building indexes is difficult, but indexes make it possible to get results in 0.5 seconds Google index is100,000,000 gigabytes over one million computing hours to build it Play first 2 minutes of this video by-nc-nd.pngMercer by-nc-nd.png 9
Analyzing Data Making sense of our world with numbers by-nc-nd.pngMercer by-nc-nd.png
Old school the Scientific Method Graph shows scientific method we are taught Investigate phenomena to acquire new knowledge Procedures vary Statistical Hypothesis Testing will be shown later, as an application of the scientific method by-nc-nd.pngMercer by-nc-nd.png 11
Analyzing Statistics Three uses of statistical analysis commonly used by scientists, mathematicians, politicians, and other professionals across the globe. by-nc-nd.pngMercer by-nc-nd.png 12
Descriptive Statistics 1. Descriptive analytics - provide information about collected data via statistics such as mean, median, mode, range These tend to 'describe' circumstances, but do not offer conjectures about unknowns Example: the percentage of graduates employed within 6 months of graduating Application: Google s indexing the web Consider another site dealing with describing data (recorded search history on any topic): http://www.google.com/trends/ by-nc-nd.pngMercer by-nc-nd.png 13
Predictive Analytics 2. Predictive Analytics is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trend Does not predict the future, may be wrong Example: Given that 90 of the 100 CS graduates were employed within 6 months in 2011, it is __ % likely that 108 of the 120 CS graduates in 2015 will be employed within 6 months Upcoming Application: ranking pages based on a search query Ron Burgandy 8-second clip by-nc-nd.pngMercer by-nc-nd.png 14
Prescriptive Analytics 3. Prescriptive Analytics is the practice of extracting information from existing data sets in order to determine patterns and predict future outcomes and trend Does not predict the future Example: how likely is that I will find a high-paying job if I choose to major in computer science rather than biology Upcoming Application: autocomplete makes recommendation based on previous rankings by-nc-nd.pngMercer by-nc-nd.png 15
How Useful are these three? The following grades each type of analysis on its utility (how useful is it?) and confidence (how likely is it to be true and/or valid?) in the context of decision making Analysis Type Descriptive Predictive Prescriptive Utility Level C B+ A Confidence Level A+ B- C Example Obama got x% of the vote x% chance of winning How to run the campaign by-nc-nd.pngMercer by-nc-nd.png 16
An example of hypothesis testing Car Talk Puzzler The Case of the Finicky Volare by-nc-nd.pngMercer by-nc-nd.png 17
Not statistical hypothesis testing A man thinks he is having some peculiar car trouble: It doesn't like a certain kind of ice cream I buy." He goes on to explain that he only has three flavors of ice cream he likes: Vanilla Chocolate Three-bean tofu mint chipped-beef ice cream. by-nc-nd.pngMercer by-nc-nd.png 18
Instead, only establish an hypothesis He says: "When I go to buy chocolate I park in front of the ice cream parlor: I buy the chocolate, and my car starts right up. I buy the vanilla, and my car starts right up. However, if I buy the three-bean tofu mint chipped-beef, my car won't start. What could be the issue with this car? by-nc-nd.pngMercer by-nc-nd.png 19
Deriving an hypothesis When you go in to buy chocolate, you go into the freezer case and there's chocolate in a container -- you take it, you pay for it, you get into your car and drive away. Same thing with vanilla, but nobody buys mint chip beef bean tofu, right? So, somebody must hand pack the ice cream into a special container What could be the issue with this car? by-nc-nd.pngMercer by-nc-nd.png 20
Hypothesis Example The car is old. It takes longer to purchase hand packed ice cream than pre-packed ice cream. Ice cream is purchased more often in the summer when it is hot Hypothesis: A proposed explanation for this phenomenon The car overheatsand vapor locks in the extra time it takes to purchase three-bean tofu mint chipped- beef ice cream. by-nc-nd.pngMercer by-nc-nd.png 21
Statistical Hypothesis Testing To get an idea, it s more than car mechanics More realistic to view 1.5 minutes of this example from Khan Academy by-nc-nd.pngMercer by-nc-nd.png 22
Exploratory data analysis Exploratory data analysis is another method of scientific inquiry Utilize big data statistics to postulate correlations (data sets that are linked together) that have not yet been hypothesized Attempts to discover patterns in order to establish correlative links by-nc-nd.pngMercer by-nc-nd.png 23
An example of exploratory data analysis Data from millions of searches to predict what you are looking for Discovers patterns Correlations not guaranteed by-nc-nd.pngMercer by-nc-nd.png 24
Analog autocomplete by-nc-nd.pngMercer by-nc-nd.png 25
How to change a _______ Y-Axis Frequency X-Axis Words in blank by-nc-nd.pngMercer by-nc-nd.png 26
Exploratory data analysis Statistical Hypothesis Testing Injections into rats Exploratory Data Analysis Autocomplete since the test is constrained to a specific issue (and the variables are known), confidence can be really high confidence is typically lower, because the data are messier and the connections among them are unknown Confidence works over a typically circumscribed problem usually applied to much larger datasets with more unknowns Scope power increases with sample size, sometimes able to determine a minimum sample size to guarantee a desired confidence power is dependent on sample size, because there are no established hypothesis and it is unknown how large a sample size must be to discover knowledge Sample size by-nc-nd.pngMercer by-nc-nd.png 27
Some big data sets and collections Google Public Data Explorer (Links to an external site.) (130 datasets from Bureau of Labor Statistics, U.S. Census Bureau, etc.) data.gov (Links to an external site.) - an online repository of datasets from U.S. Government Many counties have searchable property databases, such as theTravis County Appraisal District (Links to an external site.) Many counties have searchable legal databases, such as theTravis County Clerk (Links to an external site.) Some data sets defy categorization, such as theTexas Death Row Executions (Links to an external site.) data set Google's Ngram Data (Links to an external site.) - data on Google's catalog of millions of books, including raw data sets Google Trends (Links to an external site.) - detailed search history information,including CSV downloads NOAA National Climatic Data Center (Links to an external site.) Knoema (Links to an external site.) - "free to use public and open data platform for users with interests in statistics and data analysis, visual storytelling and making infographics" Geocommons (Links to an external site.) - "all about open data analysis and maps" Stat Silk (Links to an external site.) - "interactive maps of open data" Better World Flux (Links to an external site.) - "a beautiful interactive visualization of information on what really matters in life" Gapminder (Links to an external site.) - "unveiling the beauty of statistics for a better world view" by-nc-nd.pngMercer by-nc-nd.png 28