Course 4 – Process Data from Dirty to Clean Quiz Answers

Spread the love

Week 1: The Importance of Integrity

The Importance of Integrity INTRODUCTION

Making decisions based on this integrity will depend on how much the integrity of the data is strong. Data integrity is one way of completing the entire data life cycle accurately. In this article, we will discuss one such potential threat that analysts should think of and that is data manipulation and its effect on quality analysis. For any insights derived from data to be trustworthy, they need to properly ensure that their data is correctly collected, stored, and processed. Only one module in the entire program is devoted to data integrity and its maintenance, which is the course offered by Coursera under the Google Data Analytics Professional Certificate program.

Understand various techniques through which analysts determine the data to be collected for analysis in this course. It also introduces different types of structured and unstructured data, different data format types, and much more.

Learning Outcomes:

  • Understand all statistical measures related to data integrity namely statistical power, hypothesis testing, and margin of error.
  • Explore ways of tackling inadequate data.
  • Describe the importance of sample size including problems associated with sample bias and random sampling.
  • Examine the interplay of data and business objectives.
  • Define data integrity, different types of it, and the risks related to it.
  • Understand the importance of pre-cleaning activities.

Test your knowledge on data integrity and analytics objectives

1. Which of the following principles are key elements of data integrity? Select all that apply.

  • Accuracy (Correct)
  • Consistency (Correct)
  • Trustworthiness (Correct)
  • Selectivity

Correct: Data integrity concerns the correctness, completeness, consistency, and trustworthiness of data during the entire life cycle: from acquisition to storage, processing, and from analysis. Data integrity is important when making decisions that rely on accurate data and informations.

2. Which process do data analysts use to make data more organized and easier to read?

  • Data uniformity
  • Data manipulation (Correct)
  • Data transfer
  • Data replication

Correct: Data manipulation is one of the processes that data analysts use to organize the data and make it easier for analysis. These processes include cleaning, transforming, and reconstituting data to improve its quality and suitability for analysis and decision-making.

3. Before analysis, a company collects data from countries that use different date formats. Which of the following updates would improve the data integrity?

  • Change all of the dates to the same format (Correct)
  • Remove data in an unfamiliar date format
  • Organize the data by country
  • Leave the dates in their current formats

Correct: Adapt all dates to the same formatting style actually enhances data integrity. This means the consistency that ensures accuracy; likewise, it makes the data easy to analyze as well as avoids errors or confusion when comparing or processing the data against those formats.

4. Which of the following processes helps ensure a close alignment of data and business objectives?

  • Maintain data integrity (Correct)
  • Completing data replication
  • Having data update automatically during analysis
  • Transferring data multiple times

Correct: Data integrity goes a long distance to keep organizations and data in close proximity with their business objectives. This means that the integrity of data is assured because it is accurate, complete, consistent, and credible. Making informed and the right decisions is possible with reliable data; thus, the performance of organizations can be measured, helping to determine whether they are on target with their organizational goals.

5. When gathering data through a survey, companies can save money by surveying 100% of a population.

  • False (Correct)
  • True

Correct: Using a 100% of a population is ideal, but it can be very expensive to gather data from an entire population.

Test your knowledge on insufficient data

1. What should an analyst do if they do not have the data needed to meet a business objective? Select all that apply.

  • Gather related data on a small scale and request additional time to find more complete data. (Correct)
  • Continue with the analysis using data from less reliable sources.
  • Create and use hypothetical data that aligns with analysis predictions.
  • Perform the analysis by finding and using proxy data from other datasets. (Correct)

Correct: A good analyst should first collect some amount of related data and ask for some time if the amount of data is not enough for achieving the business objective. This gives an analyst a chance to perform preliminary analysis while waiting for more complete data. Either that, or some other related datasets can be mashed up to substitute for the missing data and do the analysis in such a way that possible decisions can be made until the corresponding data is ready for collection.

2. Which of the following are limitations that might lead to insufficient data? Select all that apply.

  • Data that updates continually (Correct)
  • Duplicate data
  • Outdated data (Correct)
  • Data from a single source (Correct)

Correct: Limitations that often lead to insufficient data comprise constantly dynamic data or obsolete data or which has been obtained only from one source.

3. A data analyst wants to find out how many people in Utah have swimming pools. It’s unlikely that they can survey every Utah resident. Instead, they survey enough people to be representative of the population. This describes what data analytics concept?

  • Confidence level
  • Margin of error
  • Sample (Correct)
  • Statistical significance

Correct: That would be an exemplary sample from a sample, which is actually part of a population to represent the whole population through an inference regarding it.

Test your knowledge on testing your data

1. A research team runs an experiment to determine if a new security system is more effective than the previous version. What type of results are required for the experiment to be statistically significant?

  • Results that are real and not caused by random chance (Correct)
  • Results that are unlikely to occur again
  • Results that are inaccurate and should be ignored
  • Results that are hypothetical and in need of more testing

Correct: Within the realm of statistical significance, results cannot be interpreted as the true effect or relationship on the same level as being due to random chance. Such an outcome is met with the signal that indeed the findings observed would not randomly arrive unless there was more to the story regarding specific variables under test.

2. In order to have a high confidence level in a customer survey, what should the sample size accurately reflect?

  • The most valuable members of the population
  • The trends from other customer surveys
  • The predictions of stakeholders
  • The entire population (Correct)

Correct: A high-level confidence customer survey will depend on a sample that is large enough and representative of the total population. This means that the results will be reliable and can be generalized to the broader group within reasonable bounds of confidence.

3. A data analyst determines an appropriate sample size for a survey. They can check their work by making sure the confidence level percentage plus the margin of error percentage add up to 100%.

  • True
  • False (Correct)

Correct: The confidence level percentage and margin of error percentage do not need to add up to 100%. Those two are completely independent: a confidence level refers to the odds that what happened is within that margin of error. While that margin indicates the range where the true value is expected to lie.

Test your knowledge on margin of error

1. Fill in the blank: Margin of error is the _____ amount that the sample results are expected to differ from those of the actual population.

  • Maximum (Correct)
  • minimum
  • median
  • average

Correct: An allowable margin of error indicates the maximum expected difference between results based on the sample and those based on the actual population figures. It indicates the probable range within which the true population value will fall, thus helping in judging the accuracy of the sample data.

2. In a survey about a new cleaning product, 75% of respondents report they would buy the product again. The margin of error for the survey is 5%. Based on the margin of error, what percentage range reflects the population’s true response?

  • Between 75% and 80%
  • Between 73% and 78%
  • Between 70% and 80% (Correct)
  • Between 70% and 75%

Correct: The survey result indicates a response rate of 75%, and because of the margin of error, the true response rate should fall between 70% and 80%. This range indicates the extent within which possible sampling variability can affect the true value of the population’s response.

Process Data from Dirty to Clean Weekly Challenge 1

1. Which of the following conditions are necessary to ensure data integrity? Select all that apply.

  • Accuracy (Correct)
  • Statistical power
  • Completeness (Correct)
  • Privacy

Correct: Accuracy and completeness are fundamental for data clarity. Accurate data reflects the exact values, while complete data contains all elements required to consistently and reliably make decisions with data throughout its lifecycle.

2. What is one potential problem associated with data manipulation that analysts must be aware of?

  • Data manipulation can help organize a dataset.
  • Data manipulation can introduce errors. (Correct)
  • Data manipulation can make a dataset easier to read.
  • Data manipulation can separate a dataset among different locations.

Correct: The data manipulation is where adjusting or reshaping the data makes it well organized so that it becomes easier to analyze. Though it helps improve the potentiality of being read and structured, wrong manipulation often makes the data defective, making the outcome unreliable as well.

3. A data analyst is given a dataset for analysis. It includes data about the total population of every country in the previous 20 years. Which of the following questions can the analyst use this dataset to address? Select all that apply.

  • What was the reason for the population increase in a certain country?
  • What was the effect of migration on the population of a certain country?
  • What was the difference in population between two specific countries in 2018? (Correct)
  • What was the average population of a certain country from 2015 through 2020? (Correct)

Correct: These two tasks include providing an average population from the year 2015 to the year 2020 over a specific country in the dataset as well as a measure of the population difference between two such countries in 2018 through population summation across time and comparison between the two countries for that particular year.

4. A data analyst is given a dataset for analysis. To use the template for this dataset, click the link below and select “Use Template.”

The analyst notices a limitation with the data in rows 8 and 9. What is the limitation?

  • Row 8 and row 9 show the wrong currency.
  • Row 9 is a duplicate of row 8. (Correct)
  • Row 9 needs more data.
  • Row 8 is not in the correct format.

Correct: As it duplicates row 8, row 9 is an example of duplicated data which limitation can create skewness in analysis this means duplicate data can lead to misleading conclusions, trends misrepresented, and results not reliable; thus, cleaning and removing such duplicates is essential for appropriate analysis.

5. A data analyst is working on a project about the global supply chain. They have a dataset with lots of relevant data from Europe and Asia. However, they decide to generate new data that represents all continents. What type of insufficient data does this scenario describe?

  • Data that keeps updating
  • Data from only one source
  • Data that’s geographically limited (Correct)
  • Data that’s outdated

Correct: The analysis is based on insufficient data because it is geographically limited; if the analytics project is global, then the dataset should reflect various regions or countries so that it can represent the real worldwide situation and give meaningful insights.

6. In the data analysis process, how does a sample relate to a population?

  • A sample is a part of a population that is representative of the population. (Correct)
  • A sample is an ideal example taken from a population.
  • A sample is a duplicate selection of data that is taken from the population.
  • A sample is an average of all the data that represents the population.

Correct: In essence, a sample is a miniature scale of the population, which provides enough information in order to generalize it to the whole population sample.

7. A restaurant gathers data about a new dish by providing free samples to parties of six or more diners. What does this scenario describe?

  • Unbiased sampling
  • Random sampling
  • Geographically limited sampling
  • Sampling bias (Correct)

Correct: This situation exemplifies sampling bias because groups of six or more people are not representative groups of the total population. This sampling bias happens when a sample does not properly reflect the types and diversities of characteristics in the population as a whole, which results in a misrepresentation of the diverse results.

8. Data and business objectives might not align for a number of reasons. Which of the following issues can prevent alignment? Select all that apply.

  • Sampling bias (Correct)
  • Data integrity
  • Insufficient data (Correct)
  • Data visualization

Correct: Insufficient data and sampling bias can prevent alignment.

9. Fill in the blank: Data _____ refers to the accuracy, completeness, consistency, and trustworthiness of data throughout its life cycle.

  • Integrity (CORRECT)
  • analysis
  • sampling
  • replication

Correct: The term data integrity actually refers to the accuracy, completeness, consistency, and reliability of data throughout its life cycle. This is critical in order to make trustworthy decisions and maintain the quality of data from collection through analysis and reporting.

10. A healthcare company keeps copies of their data at several locations across the country. The data becomes compromised because each location creates a copy of the original at different times of day. Which of the following processes caused the compromise?

  • Data manipulation
  • Data transfer
  • Data gathering
  • Data replication (CORRECT)

Correct: The entire breach happened because of data replication. Data replication is storing data in many locations for the sake of availability and redundancy, but such replication can compromise the integrity of data and lead to inconsistencies in the copies of the data if not managed properly.

11. A data analyst is given a dataset for analysis. It includes data about the total population of every country in the previous 20 years. Based on the available data, an analyst would be able to determine the reasons behind a certain country’s population increase from 2016 to 2017.

  • True
  • False (CORRECT)

Correct: Additional data is necessary so that the analyst could identify the causes of the increase in population. This might involve migration data, changes in birth rates, or maybe even economic explanations for how these factors cause changes over time in population.

12. A data analyst at a nonprofit organization is working with a dataset about a summer fundraiser. Although they have a lot of useful data by the end of the month, they recognize that the data is insufficient. So, they decide to wait until the end of the season to begin working with the dataset. Which type of insufficient data does this example describe?

  • Data that keeps updating (CORRECT)
  • Outdated data
  • Geographically limited data
  • Data from only one source

Correct: The analyst would require more data in order to be able to identify the factors that lead to population growth based on the available data. Such information includes data on migration patterns, birth rates, or economic factors that could assist in the explanation of changes in the population during different periods.

13. Fill in the blank: Sampling bias in data collection happens when a sample isn’t representative of _____.

  • the population as a whole (CORRECT)
  • the population most affected by the data 
  • a dataset about the population
  • a subset of the population

Correct: When sampling bias occurs in data collection, the sample chosen does not reflect the entire population as a whole. It results in improper, skewed findings and therefore unreliable conclusions, in addition to overrepresenting or underrepresenting certain groups or traits in the sample.

14. Sometimes during analysis, an analyst discovers that it’s necessary to adjust the business objective. When this happens, the analyst should take the initiative to do so without involving others in order to be respectful of their time.

  • True
  • False (CORRECT)

Correct: If a data analyst believes the business objective should be adjusted, it’s important to first have a discussion with stakeholders.

15. A data analyst is given a dataset for analysis. It includes data about the total population of every country in the previous 20 years. Which of the following questions would the analyst need more data to address? 

  • What was the population of a certain country in 2020?
  • Which country had the greatest population in 2015?
  • Which country had the smallest population in 2017?
  • What was the reason for the population increase in a certain country? (CORRECT)

Correct: The analyst would require more data to determine the underlying causes for the population increase: increased migration, higher births, deteriorating economic conditions, or other demographic trends that might explain the change.

16. A restaurant wants to gather data about a new dish by giving out free samples and asking for feedback. Who should the restaurant give samples to?

  • All diners (CORRECT)
  • 80% of diners 
  • Diners who spend the most money on their meal 
  • Diners who are willing to pay for the samples

Correct: The restaurant should give samples to all diners.

17. A data analyst at a software company wants to learn more about industry competitors. Because the software industry has more mergers than any other field, the companies and their products are constantly evolving. The analyst has a dataset from three years ago, and they notice that many of the companies and products in the dataset have changed. What makes the analyst decide that the data is insufficient, so they should generate fresh data instead?

  • It is geographically limited data
  • It is data from only one source
  • It is outdated data (CORRECT)
  • It is data that keeps updating

Correct: This case talks about old data, which is insufficient. If a dataset is obsolete, it means its data is old and doesn’t likely correspond with the current situation, making it unusable for accurate analysis or decision-making.

18. Fill in the blank: If a data analyst is using data that has been _____, the data will lack integrity and the analysis will be faulty.

  • Compromised (CORRECT)
  • clean
  • wide
  • public

Correct: Using spoiled data compromises the integrity of the data, making the analysis invalid, if not entirely misleading. Such spoiled data introduce error, inconsistency, or bias into the evaluation, jeopardizing the reliability of the findings and any subsequent decisions based on that analysis.

19. A financial analyst imports a dataset to their computer from a storage device. As it’s being imported, the connection is interrupted, which compromises the data. Which of the following processes caused the compromise?

  • Data analysis
  • Data manipulation
  • Data transfer (CORRECT)
  • Data gathering

Correct: Data transfer causes misunderstanding. An incomplete dataset may be created if interruption in data transfer occurs and it lacks certain information, is corrupted, or is incomplete. Thus, the validity and reliability of the analysis are determined upon it.

20. Fill in the blank: As a data analyst, you need to verify that your data is _____ to ensure your analysis and conclusions are accurate.

  • complete and valid (CORRECT)
  • manipulated and replicated
  • private and valid
  • manipulated and valid

21. A data analyst is given a dataset for analysis. To use the template for this dataset, click the link below and select “Use Template.”

Link to template: June 2014 Invoices

OR

If you don’t have a Google account, download the CSV file directly from the attachment below.

Which of the following has duplicate data?

  • Data for Symteco on 2/21/2014
  • Data for Symteco on 5/20/2014
  • Data for Valando on 1/1/2014
  • Data for Valando on 2/18/2014 (CORRECT)

22. A clothing manufacturer wants to learn more about why their consumers have purchased the brand’s products. How should this manufacturer conduct their survey?

  • Send the survey to their least frequent customers
  • Send the survey to a representative sample of their customers (CORRECT)
  • Send the survey to customers who have purchased more than one product
  • Send the survey to random people who buy clothes

23. A car dealership gathers data about their entire customer population. They decide to conduct a survey to understand why their customers chose their dealership. They send out an email to all customers who have purchased more than two vehicles in the past five years. What does this scenario describe?

  • Unbiased sampling
  • Random sampling
  • Sampling bias (CORRECT)
  • Geographically limited sampling

24. What can jeopardize data integrity throughout its lifecycle? Select all that apply.

  • Insufficient data
  • Malware (CORRECT)
  • System failures (CORRECT)
  • Human error (CORRECT)

25. A data analyst needs to migrate data from a server located at their company’s headquarters to a remote site. This can lead to what type of data integrity issue? Select all that apply.

  • Data cleaning
  • Data manipulation
  • Data replication (CORRECT)
  • Data transfer (CORRECT)

26. A data analyst, working for a publishing company, gathers a dataset which includes all books sold in the United Kingdom over the last three years. However, they decide to generate new data that represents global book sales. What type of insufficient data does this scenario describe?

  • Data from only one source
  • Data that is outdated
  • Data that is geographically limited (CORRECT)
  • Data that keeps updating

27. A car manufacturer wants to learn more about the brand preferences of electric car owners. There are millions of electric car owners in the world. Who should the company survey?

  • A sample of all electric car owners (CORRECT)
  • A sample of car owners who have owned more than one electric car
  • The entire population of electric car owners
  • A sample of car owners who most recently bought an electric car

28. A restaurant gathers data about a new dish by providing free samples to parties of six or more diners. What does this scenario describe?

  • Unbiased sampling
  • Random sampling
  • Sampling bias (CORRECT)
  • Geographically limited sampling

29. What best describes a sample size?

  • A subset of the population excluding outliers
  • A subset of the population between the 25th and 50th percentile
  • A random subset of the population
  • A subset that is representative of the population as a whole (CORRECT)

30. Fill in the blank: In order to have a strong and thorough analysis, a data analyst must verify _____.

  • data manipulation
  • data engineering
  • data integrity (CORRECT)
  • data replication

31. A financial analyst imports a dataset to their computer from a storage device. As it’s being imported, the connection is interrupted, which compromises the data. Which of the following processes caused the compromise?

  • Data gathering
  • Data analysis
  • Data transfer (CORRECT)
  • Data manipulation

32. You are working for a global technology company. You have a dataset with the company’s total cell phone sales by country from 2015 to present. Based on the data you have, what questions are you able to answer?

  • What was the effect on sales when a new phone model was launched?
  • What countries have the most cell phone sales in the past three years? (CORRECT)
  • What was the effect on sales when new phone features were introduced?
  • What are the mean cell phone sales for each country since 2010?

33. A data analyst is working on a project around a national supply chain. They have a dataset with lots of relevant data from about half of the country. However, they decide to generate new data that represents the entire nation. What type of insufficient data does this scenario describe?

  • Geographically limited data (CORRECT)
  • Data that keeps updating
  • Outdated data
  • Data from only one source

34. A company has multiple retail chain stores. Each store’s database is located onsite and used for various purposes. Which of the following processes could compromise data integrity?

  • Data transfer
  • Data gathering
  • Data replication (CORRECT)
  • Data cleaning

35. A data analyst is given a dataset for analysis. To use the template for this dataset, click the link below and select “Use Template.”

  • Identifying the best paying client between January and November of 2014
  • Identifying the most profitable clients between January and November of 2014 (CORRECT)
  • Identifying the worst paying client between March and December of 2014 (CORRECT)
  • Identifying the least profitable clients between January and November of 2014 (CORRECT)

36. A high school principal is estimating the total number of students that will attend an upcoming event. She assumes that the older students are unlikely to attend and decides to only survey the first-year students. What issue will the principal face when calculating her estimation? 

  • The sample is too small.
  • The sample should be the older students. 
  • The sample exhibits sampling randomness.
  • The sample exhibits sampling bias. (CORRECT)

37. Fill in the blank: _____ is the process of changing data to make it more organized and easier to read.

  • Data replication
  • Data transfer
  • Data gathering
  • Data manipulation (CORRECT)

38. A company is trying to learn more about their customer base. They would like to conduct a survey to understand why their customers chose their brand. How should the company survey its customers?

  • Conduct a survey with customers who have purchased more than five products
  • Conduct a survey with a representative sample of their customer population (CORRECT)
  • Conduct a survey of customers who purchased a different brand
  • Conduct a survey of customers that live in high-income areas

39. A candy manufacturer finds an even distribution of sales across all age ranges of customers who purchase their products. The manufacturer decides to conduct a survey to learn more about its customer base. Due to age requirements, they can only send the survey to customers who are 21 years or older. This scenario can be described as what?

  • Sampling bias (CORRECT)
  • Down sampling bias
  • Unbiased sampling
  • Upsampling bias

40. A data analyst retrieves a sample of their data that is roughly representative of the population as a whole. They realize that there will be some error in their sample results because they didn’t sample the entire population. What is this error called?

  • Margin of error
  • Sampling error
  • Mean squared error
  • Population error (CORRECT)

The Importance of Integrity conclusion

Data integrity is a crucial concern for any effective decision-making process, and if the data is not trustworthy, accurate, or relevant, it will not add any value to that decision. This portion of the course has underscored the benefits that accrue to the pursuit of data integrity and traced how data are produced. Other concepts have also covered the techniques analysts use to determine what to collect for analysis and the difference between structured and unstructured data.

You have also learned different types and formats of data. Join Coursera’s learning experience on these topics and increase your knowledge about data management!

Leave a Comment