Module 3: Clean Your Data

Spread the love

INTRODUCTION – Clean Your Data

For the final segment of the course, just remember that you have to learn about three very important EDA components. These include data cleaning, data joining, and data validation. These become part of the processes by which you develop your data-analytic skills, helping you in extracting meanings from different datasets. The module will open your eyes to what theory and practice mean concerning really important skills.

Data cleaning forms the crux of the EDA by addressing these errors and inconsistencies within datasets. In this module, you will weigh the importance of data cleaning towards improving the soundness and accuracy of an analysis. You will learn to apply various techniques for identifying and dealing with errors, outliers, and missing values using Python. This emphasizes the importance of a clean data set as the base for generating accurate data for significant analysis.

Learning objectives

  • Make a dataset input validation processing in Python.
  • Explain the importance of data validation in input.
  • Convert categorical data into numbers in Python.
  • Distinguish and explain the importance of categorical vs. numerical data in a dataset.
  • Explain why identifying outliers in a dataset is essential.
  • Demonstrate methods by which an outlier is detected using Python.
  • When seeing stakeholders or engineers regarding their missing values.
  • Ethical issues concerning missing data.
  • Detecting missing data using Python with a dataset.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: THE CHALLENGE OF MISSING OR DUPLICATE DATA

1. Fill in the blank: Missing data has a value that is not stored for a _____ in a dataset.

  • visualization
  • column
  • variable (CORRECT)
  • row

Correct: Missing values, in simple terms, denote the absence of a value. For example, they might be indicated by the value ‘N/A’ or ‘NaN’, as well as by empty fields.

2. A data professional requests additional information from a dataset’s original owner. Unfortunately, they are not able to provide the information. Therefore, the data professional creates a NaN category in the dataset. What concept does this scenario describe?

  • Solving the problem of missing data (CORRECT)
  • Mapping variables in a dataset
  • Managing big data
  • Ensuring two datasets are compatible

Correct: The four common methods to overcome this missing data problem are as follows: request the missing values directly from the data owner; drop affected columns or rows, or sometimes even specific values; introduce a separate NaN category; or create new representative values to fill the gaps in data.

3. When merging data, a data professional uses the following code:

df_joined = df.merge(df_zip, how='left', 
on=['date','center_point_geom'])

What is the function of the parameters how and on in this code?

  • To tell Python how to find missing values in the rows and columns
  • To tell Python how to place the appropriate values on the top row of the dataset
  • To tell Python which way to join the data and which column to join from (CORRECT)
  • To tell Python which datasets should be merged

Correct: The “how” and “on” parameters in Python define how joining takes place and the column(s) on which to join. How defines the join type, i.e., inner, outer, left, or right, whereas on defines the column(s) to use to join.

4. Non-null count is the total number of blank data entries within a data column.

  • True
  • False (CORRECT)

Correct: A non-null count provides the number of entries in a data column that have valid non-blank values.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: THE INS AND OUTS OF DATA OUTLIERS

1. What type of outlier is a normal data point under certain conditions, but becomes an anomaly under most other conditions?

  • Global outlier
  • Collective outlier
  • Contextual outlier (CORRECT)
  • Constant outlier

Correct: A contextual outlier is a data point that seems normal by itself but abnormal while viewed in some other context.

2. What is the term for a line of text that follows a method or function, which is used to explain the purpose of that method or function to others using the same code?

  • Factor
  • Annotation
  • Argument
  • Docstring (CORRECT)

Correct: Docstring can be defined between the method definition and the code block. It explains a description of a function or method. It helps other programmers understand the purpose and behave of a function.

3. A data professional is using a box plot to identify suspected high outliers in a dataset, according to the interquartile rule. To do that, they search for data points greater than the third quartile plus what standard of the interquartile range?

  • 3 times
  • .5 times
  • 1.5 times (CORRECT)
  • 10 times

Correct: Box plots determine the outliers as the outlier points that extend beyond the threshold of 1.5 times the interquartile range (IQR) from the third or first quartile. This is the well-known statistical criterion that states that a given observation is an outlier if and only if it is outside the range computed as the third quartile + 1.5 times the IQR or lower than the first quartile – 1.5 times the IQR.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: CHANGING CATEGORICAL DATA TO NUMERICAL DATA

1. Fill in the blank: Label encoding assigns each category a unique _____ instead of a qualitative value.

  • qualifier
  • character
  • string
  • number (CORRECT)

Correct: Label encoding is an approach that assigns a unique numeric value to each category in a dataset, substituting qualitative labels by numerical representations. This enables data professionals to work more efficiently with categorical data-the coexistence of qualitative features in machine learning algorithms, which normally demands numeric input for training purposes.

2. When working with dummy variables, data professionals may assign the variables an infinite number of values.

  • True
  • False (CORRECT)

Correct: Dummy variables are simply binary variables that can take either the values of 0 and 1. They indicate the absence (0) or presence (1) of a particular category or feature in a dataset. So, it allows us to convert all categorical data into some statistical or machine learning model.

3. Which pandas function does a data professional use to convert categorical variables into dummy variables?

  • convert_categories()
  • get_categories()
  • get_dummies() (CORRECT)
  • convert_dummies()

Correct: Dummy variable creation is the process of converting categorical variables into dummy variables, which replaces each category with a binary column. This is to accommodate categorical data into the machine learning models which only accept numerical input.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: INPUT VALIDATION

1. Data professionals use input validation to ensure data is complete, error-free, and of high-quality.

  • True (CORRECT)
  • False

Correct: Input validation has thus been referred to as the relatively straightforward process for professional data users with regard to ensuring that data is accurate, complete as well as high quality before they process it or use it. Through this, errors from data, inconsistencies as well as invalid data are prevented from cascading to affect the analysis as well as decisions that follow.

2. Fill in the blank: If a dataset lacks sufficient information to answer a business question, the process of _____ makes it possible to augment that data by adding values from other datasets.

  • Sampling
  • Joining (CORRECT)
  • summing
  • blending

Correct: When a dataset is not enough to answer a business question, joining allows data from other datasets to be added. This is more effective when the new data is validated with respect to its format, data entries, and data types that may match those of the original dataset. Now, both would have the same consistency and accuracy with the supplemented data.

3. In which phase of the PACE workflow would a data professional perform the majority of the data-validation process?

  • Execute
  • Construct
  • Plan
  • Analyze (CORRECT)

Correct: A data professional primarily carries out the data-validation process during the Analyze phase of the PACE workflow. However, it is vital that throughout all four stages of the workflow, data validation should be emphasized, for it will ensure accurate, consistent, and valid quality of data through every step of the process.

PRACTICE QUIZ: Test your knowledge on Analytical Thinking

1. Which of the following terms are used to describe missing data? Select all that apply.

  • Zero
  • Blank (CORRECT)
  • NaN (CORRECT)
  • N/A (CORRECT)

Correct!

2. Stakeholders at a film studio hire a data analytics firm to provide insights about the best locations for film shoots. However, the film studio’s datasets contain missing data. Which of the following strategies can help the data analytics firm solve this problem? Select all that apply.

  • Use their best judgment to add in values themselves.
  • Create a NaN category. (CORRECT)
  • Add in the missing values by taking the average values from the existing data. (CORRECT)
  • Ask the film studio to fill in the missing values. (CORRECT)

Correct!

3. A data professional writes the following code:

df.merge(df_zip, how='left', 
    on=['date','center_point_geom'])

Which section of the code refers to the dataframe to be merged with df?

  • df_zip (CORRECT)
  • how=’left’
  • merge
  • center_point_geom

Correct!

4. What pandas function is used to pull all of the missing values from a data frame?

  • pd.getnull()
  • pd.ofnull()
  • pd.findnull()
  • pd.isnull() (CORRECT)

Correct!

5. What type of outliers are values that are completely different from the overall data group and have no association with any other outliers?

  • Collective outliers
  • Global outliers (CORRECT)
  • Contextual outliers
  • Dissimilar outliers

Correct!

6. A data professional works for a car insurance company. To gain insights about the popularity of electric vehicles, they study categorical data about cars. They add a 0 to their dataset to indicate if a car is gas-powered and a 1 if a car is electric. What does this scenario describe?

  • Applying a variable character
  • Changing a floating point
  • Using dummy variables (CORRECT)
  • Removing a data operator

Correct!

7. What type of data visualization shows the concentration of values between two data points by illustrating their magnitude with two colors?

  • Heat map (CORRECT)
  • Treemap
  • Scatter plot
  • Density map

Correct!

8. What does the pandas function pd.duplicated() return to indicate that a data value does not have a duplicate value within the same dataset?

  • True
  • Duplicate
  • Unique
  • False (CORRECT)

Correct!

9. Fill in the blank: The pandas function _____ enables data professionals to create a new dataframe with all duplicate rows removed.

  • drop_duplicates() (CORRECT)
  • deduplicate()
  • de_duplication()
  • deduplication()

Correct!

10. Which of the following terms can be used to describe a value that is not stored for a variable in a set of data? Select all that apply.

  • Zero
  • N/A (CORRECT)
  • NaN (CORRECT)
  • Blank (CORRECT)

Correct!

11. A data professional writes the following code:

df.merge(df_zip, how='left', 
    on=['date','center_point_geom'])

Which of the following is a parameter for the merge?

  • df_joined
  • how=’left’ (CORRECT)
  • df.merge()
  • df.head()

Correct!

12. What tasks could the pandas function pd.isnull() be used for? Select all that apply.

  • To delete all of the values from a data frame
  • To change all values to nulls in a data frame
  • To identify when a value is missing from a data frame (CORRECT)
  • To pull all of the missing values from a data frame (CORRECT)

Correct!

13. Fill in the blank: Contextual outliers are normal data points under certain conditions but become _____ under most other conditions.

  • Insignificant
  • Samples
  • Anomalies (CORRECT)
  • Standard

Correct!

14. A data professional works for a veterinary office. To gain insights about the most common household pets, they study categorical data about pet adoptions over the past five years. They assign the number 1 to dogs, 2 to cats, 3 to hamsters, and so on. What does this scenario describe?

  • Data blending
  • Label encoding (CORRECT)
  • Data partitioning
  • Aliasing

Correct!

15. Fill in the blank: A _____ is a data visualization that displays the magnitude of a set of values using two colors to show the concentration of the values.

  • heat map (CORRECT)
  • bubble chart
  • bar graph
  • line chart

Correct!

16. Fill in the blank: A data professional should _____ a duplicate when its value is clearly a mistake or will misrepresent the remaining unique values within the dataset.

  • Eliminate (CORRECT)
  • keep
  • filter
  • replicate

Correct!

17. Fill in the blank: N/A and NaN are terms used to describe _____ data.

  •  
  • Missing (CORRECT)
  • nominal
  • qualitative
  • string

Correct!

18. What does the pandas function pd.duplicated() return to indicate that a data value is a duplicate of another value within the same dataset?

  • Duplicate
  • Unique
  • False
  • True (CORRECT)

Correct!

19. A data professional at a garden center researches data related to ideal growing climates. As they familiarize themselves with the datasets, they discover some data is missing. Which of the following strategies can help them solve this problem? Select all that apply.

  • Change the missing values to Boolean data that is either true or false.
  • Create a NaN category. (CORRECT)
  • Derive new representative values based on available data. (CORRECT)
  • Add in the missing values by taking the average values from the existing data. (CORRECT)

Correct!

20. What pandas function enables a data professional to determine if duplicate values are present in a dataset?

  • pd.deduplication() (CORRECT)
  • pd.duplicated()
  • pd.dupe()
  • pd.deduplicates()

Correct!

21. A data team for an investment banker works on a project related to interest rates. As they familiarize themselves with the datasets, they discover some data is missing. Which of the following strategies can help them solve this problem? Select all that apply.

  • Change the missing values to zeros.
  • Ask the owner of the data to fill in the missing values. (CORRECT)
  • Derive new representative values based on available data. (CORRECT) Add in the missing values by taking the average values from the existing data. (CORRECT)

Correct!

22. A data team works for a stereo installation company. To gain insights into what products people are most likely to purchase in the coming year, they review categorical data about 20 of the most popular stereos. Rather than using brand names, they assign a different number to each stereo to make the data simpler to join. What does this scenario describe?

  • Data smoothing
  • Label encoding (CORRECT)
  • Aggregation
  • Normalization

Correct: Exploratory Data Analysis (EDA) is the process by which data specialists investigate, organize, and scrutinize datasets in order to summarize the main characteristics of this dataset. EDA involves the identification of patterns, trends, anomalies, and correlations in the data in order to provide insights into the possibilities of further analysis and decision-making.

23. A data professional writes the following code:

df.merge(df_zip, how='left', 
    on=['date','center_point_geom'])

Which of the following indicates that the first data frame should be merged with another data frame?

  • on=
  • how=
  • merge() (CORRECT)
  • zip

Correct!

24. What pandas function is used to identify when a value is missing from a data frame?

  • null.pd()
  • pd.null()
  • null().pd
  • pd.isnull() (CORRECT)

Correct!

25. Data encoded as N/A, NaN, or a blank is defined as zero.

  • True
  • False (CORRECT)

Correct: The use of codes like N/A, NaN, or just an empty box in data implies that for a variable in a dataset, there is either no value stored for it or, simply, it is absent or missing. It differs from a data point of zero, which may either be a genuine value (for instance, a count or measurement of zero) or indicate a missing value depending on the context.

26. What is indicated by the term null?

  • The data is missing. (CORRECT)     
  • The data point is mandatory.   
  • The data has a value of zero.   
  • The data has a value that is not stored for a variable in the dataset.   

Correct: The term “null” describes the missing or undefined data entry. It refers to the absence of any value in a data set, unlike other values such as zero or empty strings, which can still be considered valid data points.

27. Fill in the blank: Outliers are observations that are an _____ distance from other values.

  • equal
  • optimal
  • adequate
  • abnormal (CORRECT)

Correct: Outliers include observations that are very different from other values in a data list. They are unusually distanced from other points or differ from the general pattern or distribution of the data population such that they qualify as outliers.

28. Docstrings are useful within a line of Python code, but they cannot be exported to create library documentation.

  • True
  • False (CORRECT)

Correct: Docstrings are lines of text that are placed immediately after the definition of a method or function to describe its purpose and functionality to others. They clearly describe what that method or function does and can be easily extracted to create documentation for a library or API.

29. Categorical data can be grouped on its qualities, thus enabling data professionals to store and identify it based on its category.

  • True (CORRECT)
  • False

Correct: Based on the characteristics of categorical data, it can group data professionals for storing, organizing, and identifying it into categories or labels. It creates understanding and enables better analysis.

30. Fill in the blank: A heat map uses  _____ to depict the magnitude of an instance or set of values.

  • Colors (CORRECT)
  • markers
  • lines
  • plots

Correct: Heatmap is a representation of different colors to show the extent of values to make it easy to visualize the patterns or trends in data. It is a type of data visualization that reveals the intensity of values over different data and with different colors indicating higher or lower magnitudes.

Leave a Comment