Data Management Essentials
This content covers the fundamentals of data, information, knowledge, and the importance of database technology. It details the differences between data, information, and knowledge, the role of database management systems, characteristics of data warehouses, and the significance of data mining in organizations. Additionally, it addresses data quality issues such as redundancy, inconsistency, and integrity. The material also explains the concept of databases, including their organization, benefits, and types like flat and relational databases. Overall, it provides a foundational understanding of data and its management.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
CHAPTER 6 DATA AND KNOWLEDGE MANAGEMENT
CHAPTER OBJECTIVES Describe the differences between data, information, and knowledge; Describe why database technology must be used for data resource management; Define the term database and identify the steps to creating one; Describe the role of a database management system; Describe the characteristics of a data warehouse; and Define data mining and describe its role in an organization
DATA, INFORMATION, AND KNOWLEDGE Data: Raw Facts, devoid of context Information: processed data that possess context, relevance, and purpose (example: monthly sales calculated from the collected daily sales data for the past year) Knowledge: consumption of information produces knowledge. Tacit Knowledge: insights and intuitions Explicit Knowledge: can be expressed into words and numbers Once data is put into context, aggregated and analyzed, it can be used to make decisions
DATA QUALITY Issues with data Redundant Inconsistent Inaccurate Corrupted
Redundancy: Student 4567 is Mary Brown, and her major is Finance is stored more than once. Inconsistency: when the data system is huge, making changes to all redundant data is difficult if not impossible Integrity: Alex Wilson received a grade B in MKT211; however, you can t find Alex Wilson in the student roster Lack of integrity leads to corrupted data
WHAT IS A DATABASE? It s an organized collection of data It s the most common way to store large amounts of data Users can store and retrieve the data easily Minimize redundancy Eliminate inconsistency and enforce data integrity
FLAT VS RELATIONAL DATABASE Flat database Single collection of data Data not related to any other collections of data Spreadsheets Relational database Multiple collections of data May be related to each other
HOW DATA IS ORGANIZED Bit Binary Digit -- 1 or 0 Byte -- 8 bits a single character field -- One single data item. record a set of related fields. Also called a row. Table a set of related records. Also called a relation. Database a set of related tables. Also called a relational database.
EXAMPLE: CUSTOMER TABLE Fields (columns) Customer number 12 LastName FirstName Archer Jeffrey Records 25 Clarke Martha (rows) 30 Owen Hayes 47 Parker Stephen
TERMINOLOGY A database management system (DBMS)is the software to manage a database. Access 2016/2019 is Microsoft s DBMS MS Access is for Pesonal Use Oracle, MS SQL Server, MySQL etc. for Industrial Use
WHAT DO YOU PUT INTO A DATABASE? Data is stored for entities, their attributes, and the relationships between those entities.
TERMINOLOGY: Entity a real or abstract object in the real world Examples -- customer, automobile, order Like a noun person, place or thing An entity becomes a table in the database.
TERMINOLOGY Attribute characteristic of an entity Example: height, weight, name, address, salary, age, (fields of a person) Properties that are of interest to the particular application An attribute becomes a column in the table.
TERMINOLOGY Relationship a connection or link between entities Example: A customer places orders
THIS DIAGRAM SHOWS THE ENTITIES AND THEIR RELATIONSHIP Customer Order places
ENTITY-RELATIONSHIP DIAGRAM (ERD) Customer Customer Number First Name Last Name Order Order Number Date places
Customer Order places Order number 1 2 3 4 Customer number 25 30 12 12 Date 7/12/08 7/12/08 7/14/08 7/15/08 Customer number 12 25 30 47 LastName FirstName Archer Clarke Owen Parker Jeffrey Martha Hayes Stephen
CUSTOMER NUMBER IS CALLED THE COMMON FIELD BECAUSE IT APPEARS IN BOTH TABLES. Customer Customer number 12 25 30 47 LastName FirstName Archer Clarke Owen Parker Jeffrey Martha Hayes Stephen Order Order number 1 2 3 4 Customer number 25 30 12 12 Date 7/12/14 7/12/14 7/14/14 7/15/14 The common field sets up the Martha Clarke placed Order 1
QUERY: WHICH ORDER(S) DID JEFFREY ARCHER PLACE? Customer Customer number 12 25 30 47 LastName FirstName Archer Clarke Owen Parker Jeffrey Martha Hayes Stephen Order Order number 1 2 3 4 Customer number 25 30 12 12 Date 7/12/14 7/12/14 7/14/14 7/15/14 He placed both Order number 3 and 4.
PRIMARY KEY AN IDENTIFIER, A FIELD THAT IS UNIQUE FOR EACH RECORD Primary key Customer number 12 25 30 47 LastName FirstName Archer Clarke Owen Parker Jeffrey Martha Hayes Stephen
FOREIGN KEY A FIELD IN ANOTHER TABLE THAT MATCHES ONE OF THE VALUES OF THE PRIMARY KEY Primary key Customer number 12 25 30 47 LastName FirstName Archer Clarke Owen Parker Jeffrey Martha Hayes Stephen Foreign key Order number 1 2 3 4 Customer number 25 30 12 12 Date 7/12/14 7/12/14 7/14/14 7/15/14
FOREIGN KEY A FIELD IN ANOTHER TABLE THAT MATCHES ONE OF THE VALUES OF THE PRIMARY KEY Customer number 12 25 30 47 LastName FirstName Archer Clarke Owen Parker Jeffrey Martha Hayes Stephen Can this be correct? Order number 1 2 3 4 5 Customer number 25 30 12 12 56 Date 7/12/14 7/12/14 7/14/14 7/15/14 7/16/14
NORMALIZATION To normalize a database means to design it in a way that: 1) reduces data redundancy 2) ensure data integrity.
BIG DATA, DATA WAREHOUSING, DATA MINING
DATA WAREHOUSE Organize and maintain the data: typically in a mD model Provide operators for navigation Often depends on human user to detect patterns The patterns are explicit but may be hidden Typically deals with summarized data
DATA WAREHOUSE Non-operational Data Uses a copy of data from the active databases that the company uses in its day-to-day operations The data warehouse must pull data from the existing databases on a regular, scheduled basis Time-variant Data Whenever data is loaded into the data warehouse, it receives a time stamp hence allows for comparisons between different time periods Standardized Data ETL: Extraction, Transformation, Loading
HISTORY OF DATA WAREHOUSING Data warehouses extend the transformation of data into information In the 1990 s executives became less concerned with the day-to-day business operations and more concerned with overall business functions The data warehouse provided the ability to support decision making without disrupting the day-to-day operations
DATA WAREHOUSE FUNDAMENTALS Data warehouse a logical collection of information gathered from many different operational databases that supports business analysis activities and decision-making tasks The primary purpose of a data warehouse is to aggregate information throughout an organization into a single repository for decision-making purposes
DATA WAREHOUSE FUNDAMENTALS Extraction, transformation, and loading (ETL) a process that extracts information from internal and external databases, transforms the information using a common set of enterprise definitions, and loads the information into a data warehouse Data mart contains a subset of data warehouse information
DATA WAREHOUSE FUNDAMENTALS
MULTIDIMENSIONAL ANALYSIS Databases contain information in a series of two- dimensional tables In a data warehouse and data mart, information is multidimensional, it contains layers of columns and rows Dimension a particular attribute of information
MULTIDIMENSIONAL ANALYSIS AND DATA CUBE Cube common term for the representation of multidimensional information
Data WarehouseTime Variant The time aspect for the data warehouse is significantly longer than that of operational systems. Operational database: current value data. Data warehouse data: provide information from a historical perspective (e.g., past 5-10 years) Every key structure in the data warehouse Contains an element of time, explicitly or implicitly But the key of operational data may or may not contain time element .
Data WarehouseNon-Volatile A physically separate store of data transformed from the operational environment. Operational update of data does not occur in the data warehouse environment. Does not require transaction processing, recovery, and concurrency control mechanisms Requires only two operations in data accessing: initial loading of data and access of data.
OLTP vs. OLAP users function DB design data OLTP clerk, IT professional day to day operations application-oriented current, up-to-date detailed, flat relational isolated repetitive read/write index/hash on prim. key short, simple transaction Tens or hundreds thousands 100MB-GB transaction throughput OLAP knowledge worker decision support subject-oriented historical, summarized, multidimensional integrated, consolidated ad-hoc lots of scans usage access unit of work # records accessed #users DB size metric complex query millions dozens 100GB-TB query throughput
Typical OLAP Operations Roll up (drill-up): summarize data by climbing up hierarchy or by dimension reduction Drill down (roll down): reverse of roll-up from higher level summary to lower level summary or detailed data, or introducing new dimensions Slice and dice: project and select Pivot (rotate): reorient the cube, visualization, 3D to series of 2D planes. Other operations drill across: involving (across) more than one fact table drill through: through the bottom level of the cube to its back-end relational tables (using SQL)
MULTIDIMENSIONAL ANALYSIS AND DATA MINING Data mining the process of analyzing data to extract information not offered by the raw data alone To perform data mining users need data-mining tools Data-mining tool uses a variety of techniques to find patterns and relationships in large volumes of information and infers rules that predict future behavior and guide decision making
A DATASET Data/points/instances/examples/samples/records: rows Features/attributes/dimensions/independent variables/covariates/predictors/regressors: columns Target/outcome/response/label/dependent variable: special column to be predicted
MAIN TYPES OF COLUMNS Continuous: a number, like an age or height Discrete: a symbol, like cat or dog
MAIN GOAL OF LEARNING: PREDICTION The setup: 1. You obtain some kind of model based on some examples, or training data, through a process called learning (also estimation). 2. Then you use that model to predict something about data you haven t seen before, but that comes from the same distribution as the training data, called test data.
LEARNING TASKS Supervised learning: We re predicting a target variable for which we get to see examples. Regression: predict a continuous target variable Classification: predict a discrete target variable Unsupervised learning: We re predicting a target variable for which we never get to see examples. Clustering: predict clusters Dimensionality reduction: predict new features or select important features while discarding others
CLASSIFICATION We re predicting a discrete target variable. Supervised.
EXAMPLE: CANCER PREDICTION Application: automatic disease detection Importance: this is modern/future medical diagnosis. Prediction goal: Based on past patients, predict whether new patients have the disease Data: Past patients with and without the disease Target: Cancer or no-cancer Features: Concentrations of various proteins in your blood
EXAMPLE: NETFLIX Application: automatic product recommendation Importance: this is the modern/future shopping. Prediction goal: Based on past preferences, predict which movies you might want to watch Data: Past movies you have watched Target: Like or don t-like Features: ?
EXAMPLE: CALL CENTERS Application: automatic call routing Importance: this is modern/future customer service. Prediction goal: Based on your speech recording, predict which words you said Data: Past recordings of various people Target: Which word was intended Features: ?
EXAMPLE: STOCK MARKET Application: automatic program trading Importance: this is modern/future finance. Prediction goal: Based on past patterns, predict whether the stock will go up Data: Past stock prices Target: Up or down Features: ?
HOW IT ALL FITS TOGETHER Pattern Evaluation Data mining: the core of knowledge discovery process. Data Mining Task-relevant Data Selection Data Warehouse Data Cleaning Data Integration Databases
GROWTH OF MACHINE LEARNING Last 15 years: Applications in industry - data mining, big data etc Applications in science - computational biology, etc Fast-growing presence in AI, statistics, applied math, etc Reasons: Data is everywhere. This phenomenon is growing. Many modeling problems are more easily cast as data (learning) problems. Both widely useful and intellectually rich (both mathematics and algorithms).