Week 6: Course Challenge
Process Data from Dirty to Clean Course Challenge INTRODUCTION
The Google Data Analytics Professional Certificate course challenge on Coursera is a very major aspect of participants’ ability to understand the roles of how data cleaning is applied to sample size, data integrity, and business objectives accordingly in data analysis.
Participants will then be required to demonstrate their observation of data cleaning techniques via spreadsheets and SQL. It will include credible dirty to clean data processing, and the participants will be required to document and report as well as verify the results of their cleaning procedures. Therefore, the glossary of terms and definitions shall be included in the course challenge to facilitate the learner’s interpretation of the content. This module will then lead learners to a credible professional certificate from Google.
Learning objectives:
- Discuss and describe statistical measures pertaining to data integrity, such as conceptions of statistical power, hypothesis testing, and margin of error.
- Comedy’s manage the insufficiency of data.
Sample size such that sample bias and random sampling occurs. - Data concerning business objectives.
- Definition of data integrity, types, and associated threats.
- The data emergence techniques that describe when they identify errors of redundancy, incompatibility issues and include the need for continuous improvement.
- Demonstrate proficiency from using spreadsheets for cleaning data.
- Articulate how one can use SQL to clean large datasets.
The merits of documenting the data cleaning process. - Elements of any data-cleaning report and its meanings.
- Explaining how the verification of the results of data cleaning works.
Process Data From Dirty to Clean Course Challenge
1. Scenario 1, questions 1-5
You are a data analyst at a small analytics company. Your company is hosting a project kick-off meeting with a new client, Meer-Kitty Interior Design. The agenda includes reviewing their goals for the year, answering any questions, and discussing their available data.
Before the meeting you review the About Us tab on their website and their business plan, linked below:
Meer-Kitty Interior Design has two goals. They want to expand their online audience, which means getting their company and brand known by as many people as possible. They also want to launch a line of high-quality indoor paint to be sold in-store and online. You decide to consider the data about indoor paint first.
When you refer to the Meer-Kitty survey feedback tab, you are pleased to find that the available data is aligned to the business objective. However, you do some research about confidence level for this type of survey and learn that you need at least 120 unique responses for the survey results to be useful. Therefore, the dataset has two limitations: First, there are only 40 responses; second, a Meer-Kitty superfan, User 588, completed the survey 11 times.
As the survey has too few responses and numerous duplicates that are skewing results, what are your options? Select all that apply.
- Locate another dataset about indoor paint.
- Repeat the survey in order to create a new, improved dataset. (Correct)
- Talk with stakeholders and ask for more time. (Correct)
- Remove the duplicates from the data and proceed with analysis.
Correct: Engaging the stakeholders, asking them for more time to work on the current problem, is the best way of dealing with many duplicates basking within the dataset. During this time, the new survey or fresh data collection can regenerate a fresh and better dataset that is free of duplicates. This makes certain that the final data will be more reliable and accurate in the analysis and decision-making process. Additionally, by bringing the stakeholders in the process, one can meet expectations and understand how the revised data serves the business objectives.
2. Scenario 1 continued
During the meeting, you also learn that Meer-Kitty videos are hosted on their website. For each product offered, there is an accompanying video for customers to learn more. So, more views for a video suggests greater consumer interest.
Your goal is to identify which videos are most popular, so Meer-Kitty knows what topics to explore in the future. Unfortunately, Meer-Kitty has just three months of data available because they only recently launched the videos on their site.
Without enough data to identify long-term trends about the video subjects that people prefer, what should you do?
- Watch the videos and use your gut instinct to identify which are most successful.
- Move ahead with the data you have to determine the top video subjects.
- Tell the client you’re sorry, but there is no way to meet their objective.
- Find an alternate data source that will still enable you to meet your objective. (Correct)
Correct: When insufficient data is available to identify long-term trends, there is often the option of finding another source of data to achieve the purpose of the initial data. Example: You could search for data from another company similar to yours in order to get some sense of the appetites and current trends with consumers. It would allow you to make decisions without having the original data and still ensure that your findings are relevant and actionable.
3. Scenario 1 continued
Now that you’ve identified some limitations with Meer-Kitty’s data, you want to communicate your concerns to stakeholders. In addition to insufficient video trend data, your main concern with the indoor paint survey is that the data isn’t representative of the population as a whole.
Clearly, one particular respondent, the superfan, is overrepresented. This is an example of margin of error.
- True
- False (Correct)
Correct: This situation realizes the sampling bias. It happens if the sample does not depict the entire population accurately. Thus biased sampling leads to skewed results and also conclusions which may not be generalizable to the wider population. As far as possible, a sample should be selected that reflects the diversity of the characteristics of the population in order to prevent bias of this type.
4. Scenario 1 continued
The stakeholders understand your concerns and agree to repeat the indoor paint survey. In a few weeks, you have a much better dataset with more than 150 responses and no duplicates.
If you are using the template, please refer to the New Meer-Kitty survey feedback tab. You notice that questions 4 and 5 are dependent on the respondent’s answer to question 3. So, you need to determine how many people answered Yes to question 3, then compare that to responses to questions 4 and 5. That way, you will know if questions 4 and 5 have any nulls.
You decide to use a spreadsheet tool that changes how cells appear when they contain the word Yes. Which tool do you use?
- CONCATENATE
- Filtering
- Data validation
- Conditional formatting (Correct)
Correct: To change how cells appear when they meet a certain value, use conditional formatting.
5. Scenario 1 continued
You have finished cleaning the data to ensure it is complete, correct, and relevant to the problem you’re trying to solve. Then, you complete the verification and reporting processes to share the details of your data-cleaning effort with your team.
You use a spreadsheet function to divide the text strings in Column G around the commas and put each fragment into a new, separate cell. In this example, what are the commas called?
- Delimiters (Correct)
- Substrings
- Partitions
- MIDs
Correct: The commas function as delimiters that indicate the beginning or the end of a data item.
6. Scenario 2, questions 6-10
You’ve completed this program and are interviewing for a junior data scientist position. The job is at B.Spoke Market Research, a company that analyzes market conditions using customer surveys and other research methods. The detailed job description can be found below:
So far, you’ve had a phone interview with a recruiter and you’ve secured a second interview with the B.Spoke team. The recruiter’s email can be found below:
There is a spreadsheet function that searches for a value in the first column of a given range and returns the value of a specified cell in the row in which it is found. It is called SEARCH.
- True
- False (Correct)
Correct: The VLOOKUP Function finds a specific value from a row in particular columns and returns data from corresponding or adjacent columns.
7. Scenario 2 continued
Next, your interviewer wants to know more about your understanding of tools that work in both spreadsheets and SQL queries. She explains that the data her team receives from customer surveys sometimes has many duplicate entries.
She says: Spreadsheets have a great tool for that called remove duplicates. But when writing a SQL query, what command should you include in your SELECT statement to remove duplicates?
- DISTINCT (Correct)
- DIVERSE
- DIFFERENT
- DISCRETE
Correct: In the SQL query, to remove duplicates, one has to use the keyword DISTINCT in the SELECT statement.
8. Scenario 2 continued
Now, your interviewer explains that the data team usually works with very large amounts of customer survey data. After receiving the data, they import it into a SQL table. But sometimes, the new dataset imports incorrectly and they need to change the format.
She asks: Is there a SQL function that can convert data types such as currency, dates, and times in a SQL table?
- Yes, data types including currency, dates, and times can be converted. (Correct)
- No, only currency can be converted.
Correct: The CAST function in SQL is the popular function which converts the currency, dates, times, and many other values in a table from one datatype to another datatype.
9. Scenario 2 continued
Next, your interviewer explains that one of their clients is an online retailer that needs to create product numbers for a vast inventory. Her team does this by combining the text strings for product number, manufacturing date, and color.
She asks: If you encountered a situation where you wanted to add strings together to create new text strings, which SQL function would you use?
- COMBINE
- CREATE
- CONCAT (Correct)
- COALESCE
Correct: Formulate the new text strings by joining the individual strings using the CONCAT function.
10. Scenario 2 continued
For your final question, your interviewer explains that her team often comes across data with extra leading or trailing spaces.
She asks: Which function would enable you to eliminate those extra spaces? You respond: To eliminate extra spaces for consistency, use the TRIM function.
- True (Correct)
- False
Correct: It would create trimming; that would eliminate extra spaces and ensure uniformity.
11. Now that you’ve identified some limitations with Meer-Kitty’s data, you want to communicate your concerns to stakeholders. In addition to insufficient video trend data, your main concern with the indoor paint survey is that the data isn’t representative of the population as a whole.
Clearly, one particular respondent, the superfan, is overrepresented. What does this situation describe?
- Sampling bias (Correct)
- Margin of error
- Statistical significance
- Confidence level
Correct: It is sampling bias when the sample does not completely represent the total population.
12. The stakeholders understand your concerns and agree to repeat the indoor paint survey. In a few weeks, you have a much better dataset with more than 150 responses and no duplicates.
If you are using the template, please refer to the New Meer-Kitty survey feedback tab. You notice that questions 4 and 5 are dependent on the respondent’s answer to question 3. So, you need to determine how many people answered Yes to question 3, then compare that to responses to questions 4 and 5. That way, you will know if questions 4 and 5 have any nulls.
You decide to use a spreadsheet tool that changes how cells appear when they meet a certain value — in this case, the word Yes. You are using VLOOKUP.
- True
- False (Correct)
Correct: To change how cells appear when they meet a certain value, use conditional formatting.
14. Next, your interviewer wants to know more about your understanding of tools that work in both spreadsheets and SQL. She explains that the data her team receives from customer surveys sometimes has many duplicate entries.
She says: Spreadsheets have a great tool for that called remove duplicates. In SQL, you can include DISTINCT to do the same thing. In which part of the SQL statement do you include DISTINCT?
- True (Correct)
- False
Correct: Using the SPLIT function, it is possible to cut the text strings that are present in column G about the commas, fragment them into smaller strings, and place them into another new individual separate cell. SPLIT is a spreadsheet function that divides through character as specified and separates a certain portion into another cell for each fragment.
15. Now, your interviewer explains that the data team usually works with very large amounts of customer survey data. After receiving the data, they import it into a SQL table. But sometimes, the new dataset imports incorrectly and they need to change the format.
She asks: Is there a command or function that converts data in a SQL table from one datatype to another? You respond: Yes, it’s the CAST function.
- True (Correct)
- False
Correct: CAST in SQL can be used to convert data types in a table, for example:
16. Next, your interviewer explains that one of their clients is an online retailer that has a vast inventory. She has a list of items by name, color, and size. Then, she has another list of the price of each item by size, as a larger item sometimes costs more. The client needs one list of all items by name, color, size, and price.
She then asks: If you were to use the CONCAT function to complete this task, what would it enable you to do?
- Clean the product identifier text strings
- Create a new product database table
- Search for and return missing products in inventory
- Create a unique key to tell products apart (Correct)
Correct: Make it into a unique key which distinguishes products and counts them easily, by concatenating all those strings together using the CONCAT function.
17. For your final question, your interviewer explains that her team often comes across data with extra leading or trailing spaces.
She asks: Which SQL function enables you to eliminate those extra spaces for consistency?
- LENGTH
- TRIM (Correct)
- LEN
- SUBSTR
Correct: To eliminate extra spaces for consistency, use the TRIM function.
18. She says: Spreadsheets have a great tool for that called remove duplicates. But when writing a SQL query, what command should you include in you SELECT statement to remove duplicates.
- DISCRETE
- DIVERSE
- DISTINCT (Correct)
- DIFFERENT
Correct: To eliminate duplicates in column outputs from SQL queries, place the DISTINCT keyword in conjunction with the SELECT statement.
19. When you refer to the Meer-Kitty survey feedback tab, you are pleased to find that the available data is aligned to the business objective. However, you do some research about confidence level for this type of survey and learn that you need at least 120 unique responses for the survey results to be useful. Therefore, the dataset has two limitations: First, there are only 40 responses; second, a Meer-Kitty superfan, User 588, completed the survey 11 times.
As the survey has too few responses and numerous duplicates that are skewing results, you decide to repeat the survey in order to create a new, improved dataset. What is your first step?
- Write new, improved survey questions.
- Talk with stakeholders, explain the new timeline, and ask for approval. (Correct)
- Find a survey tool that only allows someone to complete the survey once.
- Delete all of the data from the current, skewed survey.
Correct: It is important to come up with a new timeline update to stakeholders and put it up for endorsement before proceeding with the repeat survey exercise.
20. During the meeting, you also learn that Meer-Kitty videos are hosted on their website. For each product offered, there is an accompanying video for customers to learn more. So, more views for a video suggests greater consumer interest.
Your goal is to identify which videos are most popular, so Meer-Kitty knows what topics to explore in the future. Unfortunately, Meer-Kitty has just three months of data available because they only recently launched the videos on their site.
Without enough data to identify long-term trends about the video subjects that people prefer, what are your available options? Select all that apply.
- Ask to wait for more data and provide Meer-Kitty with an updated timeline. (Correct)
- Move ahead with the data you have to determine the top video subjects.
- Watch the videos and use your gut instinct to identify which are most successful.
- Talk with Meer-Kitty stakeholders and ask to adjust the objective. (Correct)
Correct: When there’s not enough data to reveal any long-term trends, one option is to speak to stakeholders and recommend changes to the objective. The other option is to wait for more data and give them an updated timeline.
21. You continue cleaning the data. You use tools such as remove duplicates and COUNTIF to ensure the dataset is complete, correct, and relevant to the problem you’re trying to solve. Then, you complete the verification and reporting processes to share the details of your data-cleaning effort with your team.
While reviewing, your team notes one aspect of data cleaning that would improve the dataset even more. They point out that the new survey also has a new question in Column G: “What are your favorite indoor paint colors?” This was a free-response question, so respondents typed in their answers. Some people included multiple different colors of paint. In order to determine which colors are most popular, it will be necessary to put each color in its own cell.
What spreadsheet function enables you to put each of the colors in Column G into a new, separate cell?
- SPLIT (Correct)
- Delimit
- MID
- Divide
Correct: Then, to put each color in Column G into a separate cell, use the SPLIT function. SPLIT is a spreadsheet function that takes the cell word and splits it using a specified character (such as a comma, space, or other types of delimiter) and puts each fragment into a different cell.
22. You arrive 15 minutes early for your interview. Soon, you are escorted into a conference room, where you meet Jodie Choi, the data science lead. After welcoming you, the behavioral interview begins.
For your first question, your interviewer wants to learn about your experience with spreadsheets. She says: Sometimes the team needs data that is stored in different spreadsheets. So, we use spreadsheet functions to help us find the information we need.
What function would you use to search for a certain value in a spreadsheet column to return the corresponding piece of information?
- VLOOKUP (Correct)
- RETURN
- SEARCH
- COUNTIF
Correct:The function to perform a search for a certain value in the column of a spreadsheet is to set the returned result for the corresponding cell of another column as VLOOKUP.
23. Next, your interviewer wants to know more about your understanding of tools that work in both spreadsheets and SQL. She explains that the data her team receives from customer surveys sometimes has many duplicate entries.
She says: Spreadsheets have a great tool for that called remove duplicates. Does this mean the team has to remove the duplicate data in a spreadsheet before transferring data to our database?
- Yes
- No (Correct)
Correct.
24. Now, your interviewer explains that the data team usually works with very large amounts of customer survey data. After receiving the data, they import it into a SQL table. But sometimes, the new dataset imports incorrectly and they need to change the format.
She asks: What function would you use to convert data in a SQL table from one datatype to another?
- CHANGE
- CONVERSE
- COALESCE
- CAST (Correct)
Correct: The CAST function is used to convert data in a SQL table from one datatype to another.
25. For your final question, your interviewer explains that her team often uses the TRIM function when writing SQL queries.
She asks: What is the TRIM function used for in SQL?
- To shorten the list of results
- To return the smallest numeric value from a list
- To eliminate null values
- To eliminate extra leading or trailing spaces (Correct)
Correct: The TRIM function is used to eliminate extra leading or trailing spaces.
26. Now that you’ve identified some limitations with Meer-Kitty’s data, you want to communicate your concerns to stakeholders. In addition to insufficient video trend data, your main concern with the indoor paint survey is that the data isn’t representative of the population as a whole.
Clearly, one particular respondent, the superfan, is overrepresented. This means the data doesn’t represent the population as a whole.
When surveying people for Meer-Kitty in the future, what are some best practices you can use to address some of the issues associated with sampling bias? Select all that apply.
- Use data that keeps updating
- Use data from only one source
- Use random sampling (Correct)
- Increase sample size (Correct)
Correct: The use of random samples can remove imperfections of sampling biases. In using random sampling, an analyst can choose a sample from a population in which each and every possible sample has an equal chance of being chosen. Increasing the sample size also raises the chance of capturing a portion of the population that accurately represents the entire group.
1. Scenario 1, questions 1-5
You are a data analyst at a small analytics company. Your company is hosting a project kick-off meeting with a new client, Meer-Kitty Interior Design. The agenda includes reviewing their goals for the year, answering any questions, and discussing their available data.
Before the meeting you review the About Us tab on their website and their business plan, linked below:
Meer-Kitty Interior Design has two goals. They want to expand their online audience, which means getting their company and brand known by as many people as possible. They also want to launch a line of high-quality indoor paint to be sold in-store and online. You decide to consider the data about indoor paint first.
When you refer to the Meer-Kitty survey feedback tab, you are pleased to find that the available data is aligned to the business objective. However, you do some research about confidence level for this type of survey and learn that you need at least 120 unique responses for the survey results to be useful. Therefore, the dataset has two limitations: First, there are only 40 responses; second, a Meer-Kitty superfan, User 588, completed the survey 11 times.
As the survey has too few responses and numerous duplicates that are skewing results, what are your options? Select all that apply.
- Remove the duplicates from the data and proceed with analysis.
- Locate another dataset about indoor paint.
- Repeat the survey in order to create a new, improved dataset. (CORRECT)
- Talk with stakeholders and ask for more time. (CORRECT)
There are so many duplicates, so it will be better if we consult our stakeholders and ask them for more time. After that, do the survey again for a more refined dataset.
In light of the many duplicates, the best course of action is to consult with stakeholders and request an extension. Then, you can conduct the survey again in order to build a more accurate and improved dataset.
2. Scenario 2, continued
Next, your interviewer explains that one of their clients is an online retailer that needs to create product numbers for a vast inventory. Her team does this by combining the text strings for product number, manufacturing date, and color.
She asks: If you encountered a situation where you wanted to add strings together to create new text strings, which SQL function would you use?
- CREATE
- COMBINE
- COALESCE
- CONCAT (CORRECT)
Correct: The CONCAT function is used for joining strings to create new text strings.
3. Scenario 2, continued
For your final question, your interviewer explains that her team often comes across data with extra leading or trailing spaces.
She asks: Which SQL function enables you to eliminate those extra spaces for consistency?
- TRIM (CORRECT)
- SUBSTR
- LEN
- LENGTH
Correct: To eliminate extra spaces for consistency, use the TRIM function.
4. Scenario 1 continued
During the meeting, you also learn that Meer-Kitty videos are hosted on their website. For each product offered, there is an accompanying video for customers to learn more. So, more views for a video suggests greater consumer interest.
Your goal is to identify which videos are most popular, so Meer-Kitty knows what topics to explore in the future. Unfortunately, Meer-Kitty has just three months of data available because they only recently launched the videos on their site.
Without enough data to identify long-term trends about the video subjects that people prefer, what should you do?
- Tell the client you’re sorry, but there is no way to meet their objective.*
- Watch the videos and use your gut instinct to identify which are most successful.
- Find an alternate data source that will still enable you to meet your objective.
- Move ahead with the data you have to determine the top video subjects.
5. Scenario 1, continued
You have finished cleaning the data to ensure it is complete, correct, and relevant to the problem you’re trying to solve. Then, you complete the verification and reporting processes to share the details of your data-cleaning effort with your team.
Your team notes one aspect of data cleaning that would help improve the dataset. They point out that the new survey also has a new question in Column G: “What are your favorite indoor paint colors?” This was a free-response question, so respondents typed in their answers. Some people included multiple different colors of paint. In order to determine which colors are most popular, it will be necessary to put each color in its own cell.
You use a spreadsheet function to divide the text strings in Column G around the commas and put each fragment into a new, separate cell. In this example, what are the commas called?
- Delimiters (CORRECT)
- Partitions
- MIDs
- Substrings
Correct: Commas are called delimiters as they are characters that indicate the beginning or the end of an item of information.
6. Scenario 2, questions 6-10
You’ve completed this program and are interviewing for a junior data scientist position. The job is at B.Spoke Market Research, a company that analyzes market conditions using customer surveys and other research methods. The detailed job description can be found below:
You arrive 15 minutes early for your interview. Soon, you are escorted into a conference room, where you meet Jodie Choi, the data science lead. After welcoming you, the behavioral interview begins.
For your first question, your interviewer wants to learn about your experience with spreadsheets. She says: Sometimes the team needs data that is stored in different spreadsheets. So, we use a spreadsheet function to find the information we need.
There is a spreadsheet function that allows a data analyst to search for a value in the first column of a given range and return the value of a specified cell in the row in which it is found. What function allows you to complete these tasks?
- RETURN
- SEARCH
- COUNTIF
- VLOOKUP (CORRECT)
Correct: It seeks for a value from the first column of a given range (array) and gives back the value from a cell with the same row, according to the column index supplied.
7. Scenario 2, continued
Next, your interviewer wants to know more about your understanding of tools that work in both spreadsheets and SQL queries. She explains that the data her team receives from customer surveys sometimes has many duplicate entries.
She says: Spreadsheets have a great tool for that called remove duplicates. But when writing a SQL query, what command should you include in your SELECT statement to remove duplicates?
- DIVERSE
- DISTINCT (CORRECT)
- DISCRETE
- DIFFERENT
Correct: To remove duplicates in a SQL query, include DISTINCT in your SELECT statement.
8. Scenario 2, continued
Now, your interviewer explains that the data team usually works with very large amounts of customer survey data. After receiving the data, they import it into a SQL table. But sometimes, the new dataset imports incorrectly and they need to change the format.
She asks: Is there a SQL function that can convert data types such as currency, dates, and times in a SQL table?
- Yes, data types including currency, dates, and times can be converted. (CORRECT)
- No, only currency can be converted.
Correct: CAST is used to translate a data type into another data type for converting money types, date types, and time types.
9, Scenario 1 continued
The stakeholders understand your concerns and agree to repeat the indoor paint survey. In a few weeks, you have a much better dataset with more than 150 responses and no duplicates.
If you are using the template, please refer to the New Meer-Kitty survey feedback tab located at the bottom of the page. You notice that questions 4 and 5 are dependent on the respondent’s answer to question 3. So, you need to determine how many people answered Yes to question 3, then compare that to responses to questions 4 and 5. That way, you will know if questions 4 and 5 have any nulls.
You decide to use a spreadsheet tool that changes how cells appear when they contain the word Yes. Which tool do you use?
- Conditional formatting (CORRECT)
- Data validation
- Filtering
- CONCATENATE
Correct: To change how cells appear when they meet a certain value, use conditional formatting.
10. Scenario 2, questions 6-10
You’ve completed this program and are interviewing for a junior data scientist position. The job is at B.Spoke Market Research, a company that analyzes market conditions using customer surveys and other research methods. The detailed job description can be found below:
So far, you’ve had a phone interview with a recruiter and you’ve secured a second interview with the B.Spoke team. The recruiter’s email can be found below:
You arrive 15 minutes early for your interview. Soon, you are escorted into a conference room, where you meet Jodie Choi, the data science lead. After welcoming you, the behavioral interview begins.
For your first question, your interviewer wants to learn about your experience with spreadsheets. She says: Sometimes the team needs data that is stored in different spreadsheets. So, we use a spreadsheet function to find the information we need.
There is a spreadsheet function that searches for a value in the first column of a given range and returns the value of a specified cell in the row in which it is found. It is called SEARCH.
- True
- False (CORRECT)
11. Scenario 1 continued
Now that you’ve identified some limitations with Meer-Kitty’s data, you want to communicate your concerns to stakeholders. In addition to insufficient video trend data, your main concern with the indoor paint survey is that the data isn’t representative of the population as a whole.
Clearly, one particular respondent, the superfan, is overrepresented. This is an example of margin of error.
- True
- False (CORRECT)
Process Data from Dirty to Clean Course Challenge CONCLUSION
To proceed with the course challenge, review all the terms and definitions that have been provided at various points during the course. Demonstrate your knowledge of key concepts such as data cleaning, sample size, data integrity, and the alignment of data with business goals in the quiz.
You will be able to try out your data cleaning skills on a spreadsheet and in SQL as well. Finally, document the process by which you clean the data and report the results. With these activities, you will be ready to walk into the job of Data Analyst. Begin your journey of learning today by registering for the course on Coursera.