Data Cleaning

 Data Cleaning
Data Cleaning
Data Cleaning is the process of fixing or
removing incorrect, corrupted, incorrect
formatted , duplicate, or incomplete data
within a dataset.
Data cleansing improves data quality and
helps provide more accurate, consistent and
reliable information for decision-making in an
organization.
Why Data Cleaning is Necessary
Why Data Cleaning is Necessary
Data cleaning might seem uninteresting, but it’s
one of the most important tasks you would have to
do as a data science professional. Having wrong or
bad quality data can be detrimental to your
processes and analysis. Poor data can cause a
stellar algorithm to fail.
On the other hand, high-quality data can cause a
simple algorithm to give you outstanding results.
There are many data cleaning techniques, and you
should get familiar with them to improve your data
quality. Not all data is useful. So that’s another
major factor that affects your data quality. Poor
quality data can come from many sources.
Cont..
Usually, they are a result of human error, but
they can also arise if a lot of data is combined
from different sources. Multichannel data is
not only important, but it is also the norm. So
as a data scientist, you can expect errors from
this type of data. They can cause incorrect
insights in your project and sidetrack your
data analysis process. This is why data
cleaning methods in data analysis are so
important.
Reasons why data cleaning is
essential
Efficiency
Having clean data (free from wrong and
inconsistent values) can help you in performing
your analysis a lot faster. You’d save a
considerable amount of time by doing this task
beforehand. When you clean your data before
using it, you’d be able to avoid multiple errors. If
you use data containing false values, your results
won’t be accurate. A data scientist has to spend
significantly more time cleaning and purifying
data than analyzing it.
Error Margin
When you don’t use accurate data for analysis,
you will surely make mistakes. Suppose, you’ve
gotten a lot of effort and time into analyzing a
specific group of datasets. You are very eager to
show the results to your superior, but in the
meeting, your superior points out a few mistakes
the situation gets kind of embarrassing and
painful.
Wouldn’t you want to avoid such mistakes from
happening? Not only do they cause
embarrassment, but they also waste
resources. Data cleansing helps you in that regard
full stop it is a widespread practice, and you
should learn the methods used to clean data.
Determining Data Quality
Is The Data Valid? (Validity)
The validity of your data is the degree to which it
follows the rules of your particular requirements.
For example, you how to import phone numbers
of different customers, but in some places, you
added email addresses in the data. Now because
your needs were explicitly for phone numbers,
the email addresses would be invalid.
Validity errors take place when the input method
isn’t properly inspected. You might be using
spreadsheets for collecting your data. And you
might enter the wrong information in the cells of
the spreadsheet.
Range
 
Some types of numbers have to be in a
specific range. For example, the number of
products you can transport in a day must have
a minimum and maximum value. There would
surely be a particular range for the data. There
would be a starting point and an end-point.
Data-Type
 
Some data cells might require a specific kind
of data, such as numeric, Boolean, etc. For
example, in a Boolean section, you wouldn’t
add a numerical value.
Compulsory constraints
In every scenario, there are some mandatory
constraints your data should follow. The
compulsory restrictions depend on your
specific needs. Surely, specific columns of your
data shouldn’t be empty. For example, in the
list of your clients’ names, the column of
‘name’ can’t be empty.
Cross-field examination
There are certain conditions which affect
multiple fields of data in a particular
form. Suppose the time of departure of a
flight couldn’t be earlier than its arrival. In a
balance sheet, the sum of the debit and credit
of the client must be the same. It can’t be
different.
These values are related to each other, and
that’s why you might need to perform cross-
field examination
Unique Requirements
Particulars types of data have unique
restrictions. Two customers can’t have the
same customer support ticket. Such kind of
data must be unique to a particular field and
can’t be shared by multiple ones.
Set-Membership Restrictions
Some values are restricted to a particular set.
Like, gender can either be Male, Female or
Unknown.
Regular Patterns
Some pieces of data follow a specific format. For
example, email addresses have the format
‘randomperson@randomemail.com’. Similarly,
phone numbers have ten digits.
If the data isn’t in the required format, it would
also be invalid.
If a person omits the ‘@’ while entering an email
address, then the email address would be invalid,
wouldn’t it? Checking the validity of your data is
the first step to determine its quality. Most of the
time, the cause of entry of invalid information is
human error.
Cont..
Getting rid of it will help you in streamlining
your process and avoiding useless data values
beforehand.
Consistency
Consistency
You can measure consistency by comparing
two similar systems. Or, you can check the
data values within the same dataset to see if
they are consistent or not. Consistency can be
relational. For example, a customer’s age
might be 15, which is a valid value and could
be accurate, but they might also be stated
senior-citizen in the same system.
Next
In such cases, you’ll need to cross-check the
data, similar to measuring accuracy, and see
which value is true. Is the client a 15-year old?
Or is the client a senior-citizen? Only one of
these values could be true.
There are multiple ways to make your
data consistent
Check different systems
You can take a look at another similar system
to find whether the value you have is real or
not. If two of your systems are contradicting
each other, it might help to check the third
one.
In our previous example, suppose you check
the third system and find the age of the
customer is 65. This shows that the second
system, which said the customer is a senior
citizen, would hold.
Check the latest data
Another way to improve the consistency of
your data is to check the more recent value. It
can be more beneficial to you in specific
scenarios. You might have two different
contact numbers for a customer in your
record. The most recent one would probably
be more reliable because it’s possible that the
customer switched numbers.
Check the source
The most fool-proof way to check the
reliability of the data is to contact the source
simply. In our example of the customer’s age,
you can opt to contact the customer directly
and ask them their age. However, it’s not
possible in every scenario and directly
contacting the source can be highly tricky.
Maybe the customer doesn’t respond, or their
contact information isn’t available.
Uniformity
You should ensure that all the values you’ve
entered in your dataset are in the same units. If
you’re entering SI units for measurements, you
can’t use the Imperial system in some places. On
the other hand, if at one place you’ve entered the
time in seconds, then you should enter it in this
format all across the dataset.
This may happen while formatting dates as well.
Make sure to use the same date format for all
your entries. If you are using the DD/MM/YYYY
format, stick to that, do not change it to
MM/DD/YYYY for some of the entries, this will
contaminate the data and create problems.
Cont..
Checking the uniformity of your records is
quite easy. A simple inspection can reveal
whether a particular value is in the required
unit or not. The units you use for entering
your data depend on your specific
requirements. Checking for uniformity across
datasets is one of the most important factors
of data cleaning in data analysis.
Heterogeneous data
Heterogeneous data are any data with high
variability of data types and formats. They are
possibly ambiguous and low quality due to
missing values, high data redundancy, and
untruthfulness. It is difficult to integrate
heterogeneous data to meet the business
information demands.
Example
Heterogeneous data structures are data
structures that contain diverse types of data,
such as integers, doubles, and floats. Linked
lists and ordered lists are good examples of
these data structures
Missing Data
Missing data, or missing values, occur when
you don't have data stored for certain
variables or participants. Data can go missing
due to incomplete data entry, equipment
malfunctions, lost files, and many other
reasons. In any dataset, there are usually
some missing data.
Example
Data Transformation
Data transformation is 
the process of
converting and structuring data into a usable
format that can be analyzed to support
decision making processes, and to propel the
growth of an organization
. Data
transformation is used when data needs to be
converted to match that of the destination
system.
Data Segmentation
Data Segmentation is the process of taking
the data you hold and dividing it up and
grouping similar data together based on
the chosen parameters so that you can
use it more efficiently within marketing and
operations
Example
A company might segment customers into
groups based on age gender, customer loyalty,
geographic location or the product and
services customers use most
Slide Note
Embed
Share

Data cleaning is the process of fixing or removing incorrect, duplicate, or incomplete data within a dataset. It improves data quality, ensuring accurate and reliable information for decision-making. Learn why data cleaning is necessary and the essential reasons to clean your data.

  • data cleaning
  • data quality
  • decision-making
  • data analysis

Uploaded on Dec 21, 2023 | 4 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript


  1. Data Cleaning

  2. Data Cleaning Data Cleaning is the process of fixing or removing incorrect, formatted , duplicate, or incomplete data within a dataset. Data cleansing improves data quality and helps provide more accurate, consistent and reliable information for decision-making in an organization. corrupted, incorrect

  3. Why Data Cleaning is Necessary

  4. Why Data Cleaning is Necessary Data cleaning might seem uninteresting, but it s one of the most important tasks you would have to do as a data science professional. Having wrong or bad quality data can be detrimental to your processes and analysis. Poor data can cause a stellar algorithm to fail. On the other hand, high-quality data can cause a simple algorithm to give you outstanding results. There are many data cleaning techniques, and you should get familiar with them to improve your data quality. Not all data is useful. So that s another major factor that affects your data quality. Poor quality data can come from many sources.

  5. Cont.. Usually, they are a result of human error, but they can also arise if a lot of data is combined from different sources. Multichannel data is not only important, but it is also the norm. So as a data scientist, you can expect errors from this type of data. They can cause incorrect insights in your project and sidetrack your data analysis process. This is why data cleaning methods in data analysis are so important.

  6. Reasons why data cleaning is essential Efficiency Having inconsistent values) can help you in performing your analysis a lot considerable amount of time by doing this task beforehand. When you clean your data before using it, you d be able to avoid multiple errors. If you use data containing false values, your results won t be accurate. A data scientist has to spend significantly more time cleaning and purifying data than analyzing it. clean data (free from wrong and faster. You d save a

  7. Error Margin When you don t use accurate data for analysis, you will surely make mistakes. Suppose, you ve gotten a lot of effort and time into analyzing a specific group of datasets. You are very eager to show the results to your superior, but in the meeting, your superior points out a few mistakes the situation gets kind of embarrassing and painful. Wouldn t you want to avoid such mistakes from happening? Not only embarrassment, but resources. Data cleansing helps you in that regard full stop it is a widespread practice, and you should learn the methods used to clean data. do they also cause waste they

  8. Determining Data Quality Is The Data Valid? (Validity) The validity of your data is the degree to which it follows the rules of your particular requirements. For example, you how to import phone numbers of different customers, but in some places, you added email addresses in the data. Now because your needs were explicitly for phone numbers, the email addresses would be invalid. Validity errors take place when the input method isn t properly inspected. You might be using spreadsheets for collecting your data. And you might enter the wrong information in the cells of the spreadsheet.

  9. Range Some types of numbers have to be in a specific range. For example, the number of products you can transport in a day must have a minimum and maximum value. There would surely be a particular range for the data. There would be a starting point and an end-point.

  10. Data-Type Some data cells might require a specific kind of data, such as numeric, Boolean, etc. For example, in a Boolean section, you wouldn t add a numerical value.

  11. Compulsory constraints In every scenario, there are some mandatory constraints your data should follow. The compulsory restrictions specific needs. Surely, specific columns of your data shouldn t be empty. For example, in the list of your clients names, the column of name can t be empty. depend on your

  12. Cross-field examination There are certain conditions which affect multiple fields of form. Suppose the time of departure of a flight couldn t be earlier than its arrival. In a balance sheet, the sum of the debit and credit of the client must be the same. It can t be different. These values are related to each other, and that s why you might need to perform cross- field examination data in a particular

  13. Unique Requirements Particulars restrictions. Two customers can t have the same customer support ticket. Such kind of data must be unique to a particular field and can t be shared by multiple ones. types of data have unique

  14. Set-Membership Restrictions Some values are restricted to a particular set. Like, gender can either be Male, Female or Unknown.

  15. Regular Patterns Some pieces of data follow a specific format. For example, email addresses have the format randomperson@randomemail.com . phone numbers have ten digits. If the data isn t in the required format, it would also be invalid. If a person omits the @ while entering an email address, then the email address would be invalid, wouldn t it? Checking the validity of your data is the first step to determine its quality. Most of the time, the cause of entry of invalid information is human error. Similarly,

  16. Cont.. Getting rid of it will help you in streamlining your process and avoiding useless data values beforehand.

  17. Consistency

  18. Consistency You can measure consistency by comparing two similar systems. Or, you can check the data values within the same dataset to see if they are consistent or not. Consistency can be relational. For example, a customer s age might be 15, which is a valid value and could be accurate, but they might also be stated senior-citizen in the same system.

  19. Next In such cases, you ll need to cross-check the data, similar to measuring accuracy, and see which value is true. Is the client a 15-year old? Or is the client a senior-citizen? Only one of these values could be true.

  20. There are multiple ways to make your data consistent Check different systems You can take a look at another similar system to find whether the value you have is real or not. If two of your systems are contradicting each other, it might help to check the third one. In our previous example, suppose you check the third system and find the age of the customer is 65. This shows that the second system, which said the customer is a senior citizen, would hold.

  21. Check the latest data Another way to improve the consistency of your data is to check the more recent value. It can be more beneficial to you in specific scenarios. You might have two different contact numbers for a customer in your record. The most recent one would probably be more reliable because it s possible that the customer switched numbers.

  22. Check the source The most fool-proof way to check the reliability of the data is to contact the source simply. In our example of the customer s age, you can opt to contact the customer directly and ask them their age. However, it s not possible in every contacting the source can be highly tricky. Maybe the customer doesn t respond, or their contact information isn t available. scenario and directly

  23. Uniformity You should ensure that all the values you ve entered in your dataset are in the same units. If you re entering SI units for measurements, you can t use the Imperial system in some places. On the other hand, if at one place you ve entered the time in seconds, then you should enter it in this format all across the dataset. This may happen while formatting dates as well. Make sure to use the same date format for all your entries. If you are using the DD/MM/YYYY format, stick to that, do not change it to MM/DD/YYYY for some of the entries, this will contaminate the data and create problems.

  24. Cont.. Checking the uniformity of your records is quite easy. A simple inspection can reveal whether a particular value is in the required unit or not. The units you use for entering your data depend requirements. Checking for uniformity across datasets is one of the most important factors of data cleaning in data analysis. on your specific

  25. Heterogeneous data Heterogeneous data are any data with high variability of data types and formats. They are possibly ambiguous and low quality due to missing values, high data redundancy, and untruthfulness. It is difficult to integrate heterogeneous data to meet the business information demands.

  26. Example Heterogeneous structures that contain diverse types of data, such as integers, doubles, and floats. Linked lists and ordered lists are good examples of these data structures data structures are data

  27. Missing Data Missing data, or missing values, occur when you don't have data stored for certain variables or participants. Data can go missing due to incomplete data entry, equipment malfunctions, lost files, and many other reasons. In any dataset, there are usually some missing data.

  28. Example

  29. Data Transformation Data converting and structuring data into a usable format that can be analyzed to support decision making processes, and to propel the growth of an transformation is used when data needs to be converted to match that of the destination system. transformation is the process of organization. Data

  30. Data Segmentation Data Segmentation is the process of taking the data you hold and dividing it up and grouping similar data together based on the chosen parameters so that you can use it more efficiently within marketing and operations

  31. Example A company might segment customers into groups based on age gender, customer loyalty, geographic location or the product and services customers use most

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#