
Effective Practices for Data Acquisition and Modeling
Explore the world of data acquisition, modeling, and utilization in various real-world scenarios. Discover different types of data, data preprocessing steps, and the convenience of using vector data for modeling. Learn how to identify existing datasets, the importance of data acquisition in different projects, and the significance of vector data in modeling algorithms.
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
Data and modeling Raw data cleaned data Transformed data Feature extraction Modeling y=f(x)
Types of data you may see in practice Raw data Tabular data (mostly from business applications; relational data) Image Audio Video Text Sequences (biomedical) Graphs (e.g., social network)
Processed data (ready for modeling) Vector data Image/text data (deep learning) Graph data
Real-world scenarios Personal projects/course projects Data is often given; well-cleaned sometimes Research projects Often no data, thus find a novel way to generate data (however, identifying the research problem is more important) Company projects Often large datasets are available (at low cost) from business workflows If not, company will buy datasets from sources, or develop the data acquisition pipeline
Discover what data is available Identify existing datasets (try public datasets first) Benchmark datasets to evaluate a new idea E.g. A diverse set of small to medium public datasets for a new hyperparameter tuning algorithm; Large scale datasets for a very big deep neural network Popular places: Kaggle, UCI ML databases, etc Extract from existing workflows
Can the raw data be used directly? Most of them cannot Go through various data preprocessing steps: often the end product is vector data Some of them with less preprocessing, especially for deep learning Images Text data
Why vector data is convenient Most modeling algorithms like vector data Why? Nice mathematical methods can be applied directly Linear algebra Multivariate statistics Intuition (visualization) can be built around the vector space
Popular ML datasets Small/medium tabular data UCI ML database (kind of old) Kaggle a diverse and large collection Used in deep learning MNIST: digits written by employees of the US Census Bureau ImageNet: millions of images from image search engines AudioSet: YouTube sound clips for sound classification LibriSpeech: 1000 hours of English speech from audiobook Kinetics: YouTube videos clips for human actions classification KITTI: traffic scenarios recorded by cameras and other sensors Amazon Review: customer reviews and from Amazon online shopping SQuAD: question-answer pairs derived from Wikipedia More at https://en.wikipedia.org/wiki/List_of_datasets_for_machine-learning_research
Explore more datasets Paperswithcodes Datasets: academic datasets with leaderboard Kaggle Datasets: ML datasets uploaded by data scientists Google Dataset search: search datasets in the Web Various toolkits datasets: tensorflow, huggingface Various conference/company ML competitions Open Data on AWS: 100+ large-scale raw data Data lakes in your own organization
Dataset Comparison You often need to deal with raw data in industrial settings Data curation can be a big project involving multiple teams, processing pipeline, storage, legal issue, privacy,
Data integration Combine data from multiple sources into a coherent dataset Product data is often stored in multiple tables E.g. a table for house information, a table for sales, a table for listing agents Join tables by keys, which are often entity IDs Key issues: identify IDs, missing rows, redundant columns, value conflicts More in-depth study: COSC4800/5800 Database Systems
What if public (free) data is not available? Generating data Or purchasing commercial datasets (many legal sources) if you are clear about what you need
Generate data Generate data
Generate real data the most critical Study your workflow (business workflow or research workflow) How and where real data is generated Computerize it Design programs to collect data at specific points Need sufficient programming skills Maintain the whole workflow/pipeline **An application-specific, complicated, expensive, and long-term project a core infrastructure for most companies
Another (popular, relatively new) way: Generate synthetic data (somehow limited) Use GANs (Generative Adversarial Networks) Simulations Data augmentation https://github.com/aleju/imgaug
Summary Getting data is challenging Raw data in industry vs academic datasets Data integration combines data from different sources Data augmentation is a common practice Synthesizing data is getting popular