Week 3: Cleaning Data with SQL
Here, you will learn SQL techniques and skills to clean data in the Google Data Analytics course from Coursera. It will cover the queries and functions that help you find and eliminate errors in the dataset, as well as inconsistency and duplication of data. This step is essential to a good data analysis process because huge data volumes can also be cleaned up with very few hands and in less time. Mastering these tools will enable you to prepare clean datasets ready for analysis without hassle.
Learning Objectives:
- Explain the process of cleaning large datasets using SQL.
- Describe SQL functions for datacentring compared to spreadsheet data-cleaning functions.
- Write basic SQL queries to work effectively with databases.
- Use SQL functions for cleaning and manipulating database string variables.
- Use SQL functions to transform data variables for better analysis.
Test Your Knowledge on SQL
1. What is the maximum value in the price column of the car_info table?
- 45,400 (Correct)
- 12,978
- 5,1180
- 16,430
Correct: Using the MIN and MAX functions, you were able to find out the maximum price in the car_info table price column and if all values fell into the appropriate ranges. The results returned a maximum price of 45,400. Having this information, you cleaned the price column for analysis. In future, you will continue verifying numeric data columns in BigQuery to ensure that they are the very-clean forms. This will enable you to catch and fix quickly any issues that may cause errors in the analysis.
2. Which of the following are benefits of using SQL? Select all that apply.
- SQL offers powerful tools for cleaning data. (Correct)
- SQL can handle huge amounts of data. (Correct)
- SQL can be used to program microprocessors on database servers.
- SQL can be adapted and used with multiple database programs. (Correct)
Correct: The SQL is adept at handling enormous amounts of data, compatible with many database programs, and provides more effective tools for the data cleaning.
3. Which of the following tasks can data analysts do using both spreadsheets and SQL? Select all that apply.
- Process huge amounts of data efficiently
- Use formulas (Correct)
- Join data (Correct)
- Perform arithmetic (Correct)
Correct: An important part of the job of analysts is employing SQL and spreadsheets to perform arithmetic calculations, utilize formulas, and merge different sources of information within an organization.
4. SQL is a language used to communicate with databases. Like most languages, SQL has dialects. What are the advantages of learning and using standard SQL? Select all that apply.
- Standard SQL works with a majority of databases. (Correct)
- Standard SQL is automatically translated by databases to other dialects.
- Standard SQL is much easier to learn than other dialects.
- Standard SQL requires a small number of syntax changes to adapt to other dialects. (Correct)
Correct: Essentially, standard SQL supports a majority of the databases and would require just a slight narrowing of the syntax to fit the other SQL dialects.
5. In your last query, you processed 415.8 GB of data. How many rows were returned by the query?
- 198,768
- 225,038
- 305,710
- 214,710 (Correct)
Correct: The last execution of your request has returned a staggering 214,710 rows of data. You can view the total number of rows in the data preview, located at the bottom. This knowledge of data measurements will help you assess the amount of data you have to handle for future use and select the right analytical tools for every project you take on.
Test Your Knowledge on SQl Queries
1. Which of the following SQL functions can data analysts use to clean string variables? Select all that apply.
- SUBSTR (Correct)
- LENGTH
- COUNTIF
- TRIM (Correct)
Correct: The SUBSTR and TRIM functions enable data analysts to effectively clean and manipulate the string variables.
2. You are working with a database table that contains data about playlists for different types of digital media. The table includes columns for playlist_id and name. You want to remove duplicate entries for playlist names and sort the results by playlist ID.
You write the SQL query below. Add a DISTINCT clause that will remove duplicate entries from the name column.
NOTE: The three dots (…) indicate where to add the clause.

What playlist name appears in row 6 of your query result?
- TV Shows
- Movies
- Audiobooks
- Music Videos (Correct)
Correct: It appears in row 6 of the query result that Music Videos is the playlist name.
3. You are working with a database table that contains data about music albums. The table includes columns for album_id, title, and artist_id. You want to check for album titles that are less than 4 characters long.
You write the SQL query below. Add a LENGTH function that will return any album titles that are less than 4 characters long.

What album ID number appears in row 3 of your query result?
- 236
- 131
- 182 (Correct)
- 239
Correct: The SQL request you are talking about selects from the album table those rows, of which the title length of the album must be less than 4 characters.
4. You are working with a database table that contains customer data. The table includes columns about customer location such as city, state, and country. You want to retrieve the first 3 letters of each country name. You decide to use the SUBSTR function to retrieve the first 3 letters of each country name, and use the AS command to store the result in a new column called new_country.
You write the SQL query below. Add a statement to your SQL query that will retrieve the first 3 letters of each country name and store the result in a new column as new_country.
NOTE: The three dots (…) indicate where to add the statement.

What customer ID number appears in row 2 of your query result?
- 55 (Correct)
- 28
- 47
- 3
Correct: The query SELECT customer_id, SUBSTR(country, 1, 3) AS new_country FROM customer ORDER BY country; returns the customer_id and a substring from the country column. In this case, SUBSTR extracts a substring starting from the first letter (1) and up to the third letter. Hence, it will be returning the first three letters available from the country field for each row.
Process Data from Dirty to Clean Weekly Challenge 3
1. Fill in the blank: Data analysts usually use _____ to deal with very large datasets.
- web browsers
- spreadsheets
- SQL (Correct)
- word processors
Correct: Data analysts usually use SQL to deal with very large datasets.
2. In which of the following situations would a data analyst use spreadsheets instead of SQL? Select all that apply.
- When using a language to interact with multiple database programs
- When working with a dataset with more than 1,000,000 rows
- When working with a small dataset (Correct)
- When visually inspecting data (Correct)
Correct: Before every large analysis with SQL database, an analyst would always prefer Microsoft’s Excel or similar products for visual inspection of data or small data management as well.
3. A data analyst is managing a database of customer information for a retail store. What SQL command can the analyst use to add a new customer to the database?
- INSERT INTO (Correct)
- UPDATE
- DROP TABLE IF EXISTS
- CREATE TABLE IF NOT EXISTS
Correct: An insert new customer into database can be done by allowing the analyst to issue the INSERT INTO command.
4. You are working with a database table that contains invoice data. The table includes columns for invoice_id and billing_state. You want to remove duplicate entries for billing state and sort the results by invoice ID.
You write the SQL query below. Add a DISTINCT clause that will remove duplicate entries from the billing_state column.
NOTE: The three dots (…) indicate where to add the clause.

What billing state appears in row 17 of your query result?
- CA
- NV
- AZ (Correct)
- WI
Correct: Apprehending the clause SELECT DISTINCT billing_state FROM invoice ORDER BY invoice_id applies the DISTINCT operation onto the billing_state column, since every billing state is guaranteed to appear exactly once in the result. Consequently, entry number 17 contains billing state “AZ.”
5. You are working with a database table that contains customer data. The table includes columns about customer location such as city, state, country, and postal_code. The state names are abbreviated. You want to check for state names that are greater than 2 characters long.
You write the SQL query below. Add a LENGTH function that will return any state names that are greater than 2 characters long.

What country appears in row 1 of your query result?
- Chile
- Ireland (Correct)
- India
- France
Correct: The LENGTH(state) > 2 function returns all state names that are longer than 2 characters. The complete query that can be run is SELECT * FROM customer WHERE LENGTH(state) > 2. LENGTH counts the number of characters in a string. The query result will have row 1 showing the country Ireland.
6. In SQL databases, what data type refers to a number that contains a decimal?
- Integer
- Boolean
- String
- Float (Correct)
Correct: The FLOAT data type keeps a number with a decimal point, and it specifies approximate numeric values with floating-point accuracy in SQL databases.
7. Fill in the blank: In SQL databases, the _____ function can be used to convert data from one datatype to another.
- SUBSTR
- TRIM
- LENGTH
- CAST (Correct)
Correct: In SQL, transformation of data from one datatype to another is possible with the help of the CAST function that converts values to a particular format as desired by the users.
8. A data analyst is cleaning survey data. The results for an optional question contain many nulls. What function can the analyst use to eliminate the null values from the results?
- LENGTH
- CAST
- COALESCE (Correct)
- CONCAT
Correct: The analyst can now utilize the COALESCE function to change null values into a definite value in the results. Thus null values will not be returned anymore but will be replaced with some other value.
9. You are working with a database table that contains invoice data. The table includes columns about billing location such as billing_city, billing_state, and billing_country. You want to retrieve the first 4 letters of each city name. You decide to use the SUBSTR function to retrieve the first 4 letters of each city name, and use the AS command to store the result in a new column called new_city.
You write the SQL query below. Add a statement to your SQL query that will retrieve the first 4 letters of each city name and store the result in a new column as new_city.
NOTE: The three dots (…) indicate where to add the statement.

10. What invoice ID number appears in row 7 of your query result?
- 97
- 390 (Correct)
- 206
- 23
Correct: So, the complete statement becomes SELECT invoice_id, SUBSTR(billing_city, 1, 4) AS new_city FROM invoice ORDER BY billing_city. The SUBSTR function slices a string into substrings and instructs the database to return the first 4 characters of each billing_city along with invoice number 390. You are trained on data up to October 2023.
11. Data analysts choose SQL for which of the following reasons? Select all that apply.
- SQL is a programming language that can also create web apps
- SQL is a powerful software program
- SQL is a well-known standard in the professional community (CORRECT)
- SQL can handle huge amounts of data (CORRECT)
Correct: Data analysts choose SQL because it’s a well-known standard in the profession and can cater to enormous data sizes.
Correct: Data analysts favor SQL because it can take care of massive data and it’s a standard known well in the professional milieu.
12. In which of the following situations would a data analyst use SQL instead of a spreadsheet? Select all that apply.
- When using the COUNTIF function to find a specific piece of information
- When working with a huge amount of data (CORRECT)
- When recording queries and changes throughout a project (CORRECT)
- When quickly pulling information from many different sources in a database (CORRECT)
A data analyst would use SQL over a spreadsheet when working on a very large amount of data.
It enables fast access to information that is stored in different areas within a database.
It also allows the recording of queries and tracking changes across the entire life of the project.
13. A data analyst creates many new tables in their company’s database. When the project is complete, the analyst wants to remove the tables so they don’t clutter the database. What SQL commands can they use to delete the tables?
- INSERT INTO
- DROP TABLE IF EXISTS (CORRECT)
- CREATE TABLE IF NOT EXISTS
- UPDATE
Correct: Through the usage of the query DROP TABLE IF EXISTS, he can delete these tables so as not to clutter the database.
14. You are working with a database table that contains customer data. The table includes columns about customer location such as city, state, country, and postal_code. You want to check for city names that are greater than 9 characters long.
You write the SQL query below. Add a LENGTH function that will return any city names that are greater than 9 characters long.
SELECT
*
FROM
customer
WHERE
What is the first name of the customer that appears in row 7 of your query result?
- Roberto
- Diego
- Kara
- Julia (CORRECT)
15. A data analyst is cleaning transportation data for a ride-share company. The analyst converts the data on ride duration from text strings to floats. What does this scenario describe?
- Typecasting (CORRECT)
- Processing
- Calculating
- Visualizing
Correct: Typecasting, which means converting data from one form to another, is being performed by the analyst.
16. A data analyst is working with product sales data. They import new data into a database. The database recognizes the data for product price as text strings. What SQL function can the analyst use to convert text strings to floats?
- SUBSTR
- TRIM
- LENGTH
- CAST (CORRECT)
Correct: The analyst can use the CAST function to convert text strings to floats.
17. You are working with a database table that contains customer data. The table includes columns about customer location such as city, state, and country. The state names are abbreviated. You want to retrieve the first 2 letters of each state name. You decide to use the SUBSTR function to retrieve the first 2 letters of each state name, and use the AS command to store the result in a new column called new_state.
You write the SQL query below. Add a statement to your SQL query that will retrieve the first 2 letters of each state name and store the result in a new column as new_state.
NOTE: The three dots (…) indicate where to add the statement.
SELECT
customer_id,
…
FROM
customer
ORDER BY
state DESC
What customer ID number appears in row 9 of your query result?
- 3
- 55
- 47 (CORRECT)
- 10
Correct: Based on the above criteria, SUBSTR returns a substring from a string, determined by the first character of the state field and two characters from the SUBSTR-crafted string.
18. What are some of the benefits of using SQL for analysis? Select all that apply.
- SQL has built-in functionalities.
- SQL tracks changes across a team. (CORRECT)
- SQL can pull information from different database sources. (CORRECT)
- SQL interacts with database programs. (CORRECT)
Correct: SQL is a database language that provides a variety of advantages such as collaboration between team members, communication with various database programs, as well as importation of data from different sources.
19. Fill in the blank: _____ refers to the process of converting data from one type to another.
- Formatting
- Cleaning
- Typecasting (CORRECT)
- Querying
Correct: Typecasting involves the conversion of the data into a different type.
20. The CAST function can be used to convert the DATE datatype to the DATETIME datatype.
- True (CORRECT)
- False
Correct: DATE datatype is converted to DATETIME datatype via the CAST function. Any field in a database can be converted from one type to another using the CAST function.
21. What SQL function lets you add strings together to create new text strings that can be used as unique keys?
- LENGTH
- CONCAT (CORRECT)
- CAST
- COALESCE
Correct: By using the CONCAT function, you can merge strings for the formation of new composite text strings that can be used as distinct keys.
22. A data analyst is analyzing medical data for a health insurance company. The dataset contains billions of rows of data. Which of the following tools will handle the data most efficiently?
- A spreadsheet
- A word processor
- A presentation
- SQL (CORRECT)
Correct: SQL is very much efficient in handling data and can manage it in huge amount.
23. A data analyst is managing a database of customer information for a retail store. What SQL command can the analyst use to add a new customer to the database?
- CREATE TABLE IF NOT EXISTS
- INSERT INTO (CORRECT)
- UPDATE
- DROP TABLE IF EXISTS
Correct: The new customer record can be inserted into the database using the command INSERT INTO.
24. A data analyst runs a SQL query to extract some data from a database for further analysis. How can the analyst save the data? Select all that apply.
- Run a SQL query to automatically save the data.
- Use the UPDATE query to save the data.
- Download the data as a spreadsheet. (CORRECT)
- Create a new table for the data. (CORRECT)
This enables the analyst to save the data either by downloading it as a spreadsheet or by creating a new separate table for retaining the data.
The analyst can either save the data by downloading it in the form of a spreadsheet or by creating a new little table for holding the same data.
25. Fill in the blank: The _____ function can be used to return non-null values in a list.
- COALESCE (CORRECT)
- CAST
- TRIM
- CONCAT
Correct: The COALESCE function can be used to return non-null values in a list.
26. You are working with a database table named customer that contains customer data. The table includes columns about customer location such as city, state, country, and postal_code. You want to check for postal codes that are greater than 7 characters long.
You write the SQL query below. Add a LENGTH function that will return any postal_code that is greater than 7 characters long.
NOTE: The three dots (…) indicate where to add the clause.
1 SELECT
2 *
3 FROM
4 customer
5 WHERE …
What is the last name of the customer that is in row 10 of your query result?
NOTE: The query index starts at 1 not 0.
- Ramos
- Brooks
- Hughes (CORRECT)
- Rocha
Correct: Essentially, the LENGTH function counts how long is the string. In accordance to the query results, there is the 10th row in which the customer called “Hughes” has his last name written.
27. After a company merger, a data analyst receives a dataset with billions of rows of data. They need to leverage this data to identify insights for upper management. What tool would be most efficient for the analyst to use?
- SQL (CORRECT)
- Word processor
- CSV
- Spreadsheet
28. As a data analyst, you are working on a quick project containing a small amount of data. As the data was emailed to you, there is no need to query the data. What tool should you use to perform your analysis?
- Spreadsheet (CORRECT)
- CSV
- word process
- SQL
29. You are working with a database table named invoice that contains invoice data. The table includes columns for invoice_id and customer_id. You want to remove duplicate entries for customer_id and sort the results by invoice_id.
You write the SQL query below. Add a DISTINCT clause that will remove duplicate entries from the customer_id column.
NOTE: The three dots (…) indicate where to add the clause.
1 SELECT …
2 FROM
3 invoice
4 ORDER BY
5 invoice_id
What customer ID number appears in row 12 of your query result?
NOTE: The query index starts at 1 not 0.
- 42
- 8
- 16 (CORRECT)
- 23
30. You’re working with a dataset that contains a float column with a significant amount of decimal places. This level of granularity is not needed for your current analysis. How can you convert the data in the float column to be integer data?
- LENGTH
- TRIM
- SUBSTR
- CAST (CORRECT)
31. You are working with a database table that contains invoice data. The table includes columns about billing location such as billing_city, billing_state, and billing_country. You use the SUBSTR function to retrieve the first 4 letters of each billing city name, and use the AS command to store the result in a new column called new_city.
You write the SQL query below. Add a statement to your SQL query that will retrieve the first 4 letters of each billing city name and store the result in a new column as new_city.
NOTE: The three dots (…) indicate where to add the statement.
NOTE: SUBSTR takes in three arguments being column, starting_index, ending_index
1 SELECT
2 invoice_id,
3 …
4 FROM
5 invoice
6 ORDER BY
7 billing_city
What invoice ID number is in row 7 of your query result?
NOTE: The query index starts at 1 not 0.
- 390
- 97
- 23 (CORRECT)
- 206
32. A junior data analyst joins a new company. The analyst learns that SQL is heavily utilized within the organization. Why would the organization choose to invest in SQL? Select all that apply.
- SQL is a powerful software program.
- SQL is a programming language that can also create web apps.
- SQL can handle huge amounts of data. (CORRECT)
- SQL is a well-known standard in the professional community. (CORRECT)
33. Your manager tasks you with analyzing a dataset and visually inspecting the data. Upon initial inspection you realize that this is a small dataset. What tool should you use to analyze the data?
- Word processor
- CSV
- SQL
- Spreadsheet (CORRECT)
34. A data analyst creates a database to store information on the company’s customer data. When completing the initial import the analyst notices that they forgot to add a few customers into the table. What command can the analyst use to add these missed customers?
- INSERT INTO (CORRECT)
- DROP
- APPEND
- ADD
35. You are working with a database table named invoice that contains invoice data. The table includes a column for customer_id. You want to remove duplicate entries for customer_id and get a count of total customers in the database.
You write the SQL query below. Add a DISTINCT clause that will remove duplicate entries from the customer_id column.
NOTE: The three dots (…) indicate where to add the clause.
1
2
3
SELECT COUNT(…)
FROM
invoice
Run
Reset
What is the total number of customers in the database?
- 59 (CORRECT)
- 43
- 84
- 105
36. You are working with a database table that contains employee data. The table includes columns about employee location such as city, state, country, and postal_code. You use the SUBSTR function to retrieve the first 3 characters of each last_name, and use the AS command to store the result in a new column called new_last_name.
You write the SQL query below. Add a statement to your SQL query that will retrieve the first 3 characters of each last_name and store the result in a new column as new_last_name.
NOTE: The three dots (…) indicate where to add the statement.
NOTE: SUBSTR takes in three arguments being column, starting_index, ending_index
1 SELECT
2 employee_id,
3 …
4 FROM
5 employee
6 ORDER BY
7 postal_code
What employee ID number is in row 8 of your query result?
NOTE: The query index starts at 1 not 0.
- 3
- 8
- 7
- 1 (CORRECT)
37. A data analyst is tasked with identifying what orders are still in transit. The current list of orders contains trillions of rows. What is the best tool for the analyst to use?
- SQL (CORRECT)
- Word processor
- CSV
- Spreadsheets
38. You’ve been working on a large project for your organization that has spanned many months. Throughout the project you have created multiple tables to save your progress and store data you may need later on. Because the project is ending soon, you decide to do some housekeeping and clean up the tables you will no longer need. What command will you use to accomplish this task?
- DROP COLUMN IF EXISTS
- DROP ROW IF EXISTS
- DROP TABLE IF EXISTS (CORRECT)
- DROP IF EXISTS TABLE
39. You are working with a database table named invoice that contains invoice data. The table includes a column for invoice_date. You want to remove duplicate entries for invoice_date.
You write the SQL query below. Add a DISTINCT clause that will remove duplicate entries from the invoice_date column.
NOTE: The three dots (…) indicate where to add the clause.
1 SELECT …
2 FROM
3 invoice
What invoice_date is in row 17 of your query result?
NOTE: The query index starts at 1 not 0.
- 2009-04-06
- 2009-03-14 (CORRECT)
- 2009-01-03
- 2009-03-05
Correct: By the way, DISTINCT eliminates duplicates in the specified column (for instance, invoice_date), but not in the billing_state column.
40. You’re working with a dataset that contains a float column with a significant amount of decimal places. This level of granularity is not needed for your current analysis. How can you convert the data in the float column to be integer data?
- TRIM
- LENGTH
- CAST (CORRECT)
- SUBSTR
41. Fill in the blank: The _____ function can be used to change the data type of a column.
- TRIM
- COALESCE
- CONCAT
- CAST (CORRECT)
42. Fill in the blank: The _____ function can be used to join strings to create a new column.
- COALESCE
- TRIM
- CAST
- CONCAT (CORRECT)
Cleaning Data with SQL conclusion
Understanding cleaning data using SQL can save much of the work of an analyst. The course would plunge into effective queries and functions for cleaning data. Certainly useful for data analysts with this skill. Discover more about data cleansing in SQL through learning on Coursera.