Week 3: Working with Data in R
WORKING WITH DATA IN R INTRODUCTION
To reveal such understanding concerning R in the working-with-data chapter of the module, effective structuring, organizing, and cleaning will be tackled. A Data Frame is the object in R that stores tabular data, and in this module, systematically, a variety of R function and techniques for management of data will be introduced.
Within this module, you will learn how to:
- Learn to use R functions to address possible biases in data (such as selection bias and confounding variables), and observe how R can help identify any of these.
- R functions for data cleaning or organization, such as read_csv(), data(), datapasta(), etc.
- Differentiate between tibbles and tribbles: two types of structures in R.
- Compare the effectiveness of tools for data cleaning.
- Hands-on experience in creation, manipulation, and work in R on data.
TEST YOUR KNOWLEDGE ON R DATA FRAMES
1. Which of the following are best practices for creating data frames? Select all that apply.
⭐⭐⭐⭐⭐
- All data stored should be the same type
- Rows should be named
- Columns should be named (Correct)
- Each column should contain the same number of data items (Correct)
Correct: While forming data frames, it is required to name every column and also to have every column with the same number of entries under them.
2. Why are tibbles a useful variation of data frames?
- Tibbles make printing easier (Correct)
- Tibbles make changing the names of variables easier.
- Tibble can change the data type of inputs
- Tibbles can create row names
Correct: Tibbles make things easier by printing and also taking away that situation called console overloading when one is dealing with really large datasets. For instance, by default, tibbles print only the first ten rows of a dataset and as many columns can fit on the screen.
3. Tidy data is a way of standardizing the organization of data within R.
- True (Correct)
- False
Correct: Tidy data indicate that what a data is constituted of is based on important principles: this is a meaningful and legible data structure. It follows the standardized house and organization of data in R to explain and interpret it better.
4. Which R function can be used to make changes to a data frame?
- str()
- mutate() (Correct)
- head()
- colnames()
Correct: The function mutate allows you to change or create new columns in a data frame. Typical usage is when you want to perform some transformation or calculation using the preexisting columns.
TEST YOUR KNOWLEDGE ON CLEANING DATA
1. A data analyst is cleaning their data in R. They want to be sure that their column names are unique and consistent to avoid any errors in their analysis. What R function can they use to do this automatically?
- rename()
- clean_names() (Correct)
- select()
- rename_with()
Correct: The clean_names() function automatically standardizes the column names, rendering them singular, uniform, and formatted correctly for better accessibility (e.g. the transformation of any space into an underscore and the conversion of all letters to lower case).
2. You are working with the penguins dataset. You want to use the arrange() function to sort the data for the column bill_length_mm in ascending order. You write the following code:
penguins %>%
Add a code chunk to sort the column bill_length_mm in ascending order.

- 33.1
- 33.5
- 34.0
- 32.1 (Correct)
Correct: The above code sorts the data in the tibble by the bill length millimeters column, from the shortest bill length to the longest bill length. The shortest bill length is 32.1mm, and the output will show the data containing this column in ascending order.
3. A data analyst is working with customer information from their company’s sales data. The first and last names are in separate columns, but they want to create one column with both names instead. Which of the following functions can they use?
- arrange()
- unite() (Correct)
- select()
- separate()
Correct: In a data frame, the unite() function could be used to combine many columns into a single column. You can give the name of the new column to be created and the old columns to be combined along with a separator string to delimit the values.
TEST YOUR KNOWLEDGE ON R FUNCTIONS
1. Which of the following functions can a data analyst use to get a statistical summary of their dataset? Select all that apply.
- mean() (Correct)
- ggplot2()
- cor() (Correct)
- sd() (Correct)
Correct: The functions sd(), cor(), and mean() serve the purpose of providing statistical summaries for datasets through the computation of standard deviation, correlation, and mean, respectively.
2. A data analyst inputs the following command:
quartet %>% group_by(set) %>% summarize(mean(x), sd(x), mean(y), sd(y), cor(x, y)).
Which of the functions in this command can help them determine how strongly related their variables are?
- sd(x)
- cor(x,y) (Correct)
- sd(y)
- mean(y)
Correct: The correlation played against two variables gives a measure of strength interaction between these two.
3. Fill in the blank: The bias function compares the actual outcome of the data with the _____ outcome to determine whether or not the model is biased.
- desired
- probable
- final
- predicted (Correct)
Correct: The bias function compares actual outcomes with predicted outcomes to determine if the model is biased.
DATA ANALYSIS WITH R PROGRAMMING WEEKLY CHALLENGE 3
1. A data analyst creates a data frame with data that has more than 50,000 observations in it. When they print their data frame, it slows down their console. To avoid this, they decide to switch to a tibble. Why would a tibble be more useful in this situation?
- Tibbles only include a limited number of data items
- Tibbles will automatically create row names to make the data easier to read
- Tibbles will automatically change the names of variables to make them shorter and easier to read
- Tibbles won’t overload the console because they automatically only print the first 10 rows of data and as many variables as will fit on the screen (Correct)
Correct: Tibbles have a very different interpretation of printing than conventional forms in R: they automatically print only the first 10 rows of any data frame and as many columns as would fit on a screen, preventing the console from being flooded with unmanageable data.
2. A data analyst is exploring their data to get more familiar with it. They want a preview of just the first six rows to get a better idea of how the data frame is laid out. What function should they use?
- print()
- colnames()
- preview()
- head() (Correct)
Correct: The head() function returns a head preview of the first six rows of a data frame, being helpful in exploring and generally understanding the structure of it.
3. You are working with the ToothGrowth dataset. You want to use the head() function to get a preview of the dataset. Write the code chunk that will give you this preview.

- head(ToothGrowth)
What are the names of the columns in the ToothGrowth dataset?
- len, supp, dose (Correct)
- VC, supp, dose
- len, supp, VC
- len, VC, dose
Correct: By running the code chunk above, one can get a preview of the ToothGrowth dataset. You have to type in the name of the dataset within the head() function’s parentheses, and it should return a view of the column names alongside the first few rows of the said dataset. The ToothGrowth dataset contains three columns: len, supp, and dose..
4. A data analyst is working with a data frame named cars. The analyst notices that all the column names in the data frame are capitalized. What code chunk lets the analyst change all the column names to lowercase?
- rename_with(cars, toupper)
- rename_with(tolower, cars)
- rename_with(toupper, cars)
- rename_with(cars, tolower) (Correct)
Correct: The rename_with(cars, tolower) code chunk is actually the rename_with() function used to convert the letter casing in the column names of the cars dataset to lowercase. The tolower argument is required to ensure that all column names will be converted to lowercase values.
5. A data analyst is working with the penguins dataset in R. What code chunk will allow them to sort the penguins data by the variable bill_length_mm?
- arrange(=bill_length_mm)
- arrange(bill_length_mm, penguins)
- arrange(penguins, bill_length_mm) (Correct)
- arrange(penguins)
Correct: The code chunk is arrange(penguins, bill_length_mm). The arrange function allows the analyst to sort data in their dataset. The arguments for the function identify the dataset as the penguins data, and that the sort should be based on the bill_length_mm variable. The data is automatically sorted in ascending order.
6. You are working with the penguins dataset. You want to use the summarize() and max() functions to find the maximum value for the variable flipper_length_mm. You write the following code:
- penguins %>%
- drop_na() %>%
- group_by(species) %>%
Add the code chunk that lets you find the maximum value for the variable flipper_length_mm.

What is the maximum flipper length in mm for the Gentoo species?
- 212
- 231 (Correct)
- 210
- 200
Correct: The summarize() function is validated for the summary statistics. Further, it can be combined with mean(), max(), and min() to produce specific statistics. For example, max() is called to find the maximum value of the flipper_length_mm variable. In the case of the Gentoo species, their maximum flipper length is 231mm.
7. A data analyst is working with a data frame called salary_data. They want to create a new column named total_wages that adds together data in the standard_wages and overtime_wages columns. What code chunk lets the analyst create the total_wages column?
- mutate(total_wages = standard_wages + overtime_wages)
- mutate(salary_data, total_wages = standard_wages + overtime_wages) (Correct)
- mutate(salary_data, standard_wages = total_wages + overtime_wages)
- mutate(salary_data, total_wages = standard_wages * overtime_wages)
Correct: The code chunk mutate(salary_data, total_wages = standard_wages + overtime_wages) uses the function mutate() to create a new column total_wages which will consist of standard wages added to overtime wages. The mutate() function helps the analyst create a column without modifying the rest of the columns in the data set.
8. A data analyst is working with a data frame named stores. It has separate columns for city (city) and state (state). The analyst wants to combine the two columns into a single column named location, with the city and state separated by a comma. What code chunk lets the analyst create the location column?
- unite(stores, “location”, city, state, sep=”,”) (Correct)
- unite(stores, “location”, city, state)
- unite(stores, “location”, city, sep=”,”)
- unite(stores, city, state, sep=”,”)
Correct: The code chunk unite(stores, “location”, city, state, sep =”,”) allows the analyst to create a new column called location which combines both the city and state columns into a new column. Hence, the unite() function both combines the specified columns into one. Within its parentheses, the first argument is the name of the data frame followed by the new name of column (in quotation marks). Then comes the next arguments, which are the columns to be combined and the sep=”,” argument, which places a comma between the city’s values and the state’s values in the new location column.
9. In R, which statistical measure demonstrates how strong the relationship is between two variables?
- Standard deviation
- Average
- Maximum
- Correlation (Correct)
Correct: The relationship strength between two variables is measured by correlation, calculated using the function cor() in R.
10. A data analyst is studying weather data. They write the following code chunk:
- bias(actual_temp, predicted_temp)
What will this code chunk calculate?
- The maximum difference between the actual and predicted values
- The minimum difference between the actual and predicted values
- The average difference between the actual and predicted values (Correct)
- The total average of the values
Correct: The function bias() is used to measure the mean of the difference between predicted and actual outcomes. This helps in finding out whether the data model is biased or not.
11. A data analyst is working with a data frame called salary_data. They want to create a new column named hourly_salary that includes data from the wages column divided by 40. What code chunk lets the analyst create the hourly_salary column?
- mutate(hourly_salary, salary_data = wages / 40)
- mutate(salary_data, hourly_salary = wages * 40)
- mutate(salary_data, hourly_salary = wages / 40) (Correct)
- mutate(hourly_salary = wages / 40)
Correct: Though, the above code chunk mutate(salary_data, hourly_salary = wages / 40) creates a new column hourly_salary using the mutate() function, which computes to wages divided by 40. The mutate() function hence, empowers the analyst to add such a column into the dataset without changing all the already existing columns.
12. A data analyst wants a high level summary of the structure of their data frame, including the column names, the number of rows and variables, and type of data within a given column. What function should they use?
- colnames()
- rename_with()
- str() (CORRECT)
- head()
13. You are working with the ToothGrowth dataset. You want to use the select() function to view all columns except the supp column. Write the code chunk that will give you this view.
1
How many columns does the resulting data frame contain?
- 1
- 3
- 2 (CORRECT)
- 4
14. You are working with the penguins dataset. You want to use the summarize() and min() functions to find the minimum value for the variable bill_depth_mm. At this point, the following code has already been written into the script:
penguins %>%
drop_na() %>%
group_by(species) %>%
Add the code chunk that lets you find the minimum value for the variable bill_depth_mm.
(Note: do not type the above code into the code block editor, as it has already been inputted. Simply add a single line of code based on the prompt.)
1
What is the minimum bill depth in mm for the Chinstrap species?
- 12.4
- 13.1
- 15.5
- 16.4 (CORRECT)
Correct: The summarize() function meant for summary statistics. When combined with other functions such as mean, max, and min, specific summary statistics can be computed. Here, the min() function extracts the minimum from the bill_depth_mm variable. It notes that this bill depth is 16.4mm minimum observed in Chinstrap.
15. A data analyst wants to find out how much the predicted outcome and the actual outcome of their data model differ. What function can they use to quickly measure this?
- mean()
- bias() (CORRECT)
- sd()
- cor()
16. You are working with the ToothGrowth dataset. You want to use the skim_without_charts() function to get a comprehensive view of the dataset. Write the code chunk that will give you this view.
1
What is the average value of the len column?
- 13.1
- 18.8 (CORRECT)
- 4.2
- 7.65
17. A data analyst is working with the penguins dataset and wants to sort the penguins by body_mass_g from least to greatest. When they run the following code the penguin body mass data is not displayed in the correct order.
penguins %>% arrange(body_mass_g)
head(penguins)
What can the data analyst do to fix their code?
- Use the print() function instead of the head() function
- Correct the capitalization of arrange() to Arrange()
- Save the results of arrange() to a variable that gets passed to head() (CORRECT)
- Add a minus sign in front of body_mass_g to reverse the order
18. You are working with the penguins dataset and want to understand the year of data collection for all combinations of species, island, and sex. At this point, the following code has already been written into your script:
penguins %>%
drop_na() %>%
group_by(species, island, sex) %>%
summarize(min = min(year), max = max(year))
1
When you run the code in the code box, how many separate observational rows are returned by this code chunk?
- 10
- 6
- 2
- 3 (CORRECT)
19. A data analyst is working with a data frame called athletes. The data frame contains a column names record that represents an athlete’s wins and losses separated by a hyphen (-). They want to turn this single column into individual columns for wins and losses. Which code chunk lets the analyst split the record column?
- separate(athletes, record, into=c(“wins”, “losses”), delim=”-“)
- separate(athletes, record, into=c(“wins”, “losses”), sep=”-“) (CORRECT)
- separate(record, athletes, into=c(“wins”, “losses”), sep=”-“)*
- separate(record, athletes, into=c(“wins”, “losses”), delim=”-“)
20. A data analyst is working with a data frame named stores. It has separate columns for city (city) and state (state). The analyst wants to combine the two columns into a single column named location, with the city and state separated by a comma. What code chunk lets the analyst create the location column?
- unite(stores, “location”, city, state, sep=”,”)(CORRECT)
- unite(stores, “location”, city, state)
- unite(stores, “location”, city, sep=”,”)
- unite(stores, city, state, sep=”,”)
21. A data analyst is working with the penguins dataset in R. What code chunk will allow them to sort the penguins data by the variable bill_length_mm?
- arrange(=bill_length_mm)
- arrange(penguins, bill_length_mm)(CORRECT)
- arrange(bill_length_mm, penguins)
- arrange(penguins)
22. A data analyst is working with a data frame called salary_data. They want to create a new column named total_wages that adds together data in the standard_wages and overtime_wagescolumns.
- mutate(salary_data, total_wages = standard_wages + overtime_wages) (CORRECT)
- mutate(total_wages = standard_wages + overtime_wages)
- mutate(salary_data, standard_wages = total_wages + overtime_wages)
- mutate(salary_data, total_wages = standard_wages * overtime_wages)
23. What scenarios would prevent you from being able to use a tibble?
- You need to store numerical data
- You need to create column names
- You need to create row names (CORRECT)
- You need to change the data types of inputs (CORRECT)
24. You are working with the ToothGrowth dataset. You want to use the skim_without_charts() function to get a comprehensive view of the dataset. Write the code chunk that will give you this view.
1
How many rows does the ToothGrowth dataset contain?
- 50
- 40
- 60 (CORRECT)
- 25
Correct: A full summary of the dataset ToothGrowth is provided by the code chunk skim_without_charts(ToothGrowth). In the parentheses, you provide the name of the datase you want to get information about. The code now returns a summary containing the dataset name, the number of rows and columns, and the column types and data types. The dataset ToothGrowth has 60 rows in it.
25. In R, which statistical measure demonstrates how strong the relationship is between two variables?
- Standard deviation
- Average
- Correlation (CORRECT)
- Maximum
26. A data analyst is studying weather data. They write the following code chunk:
bias(actual_temp, predicted_temp)
What will this code chunk calculate?
- The minimum difference between the actual and predicted values
- The average difference between the actual and predicted values (CORRECT)
- The maximum difference between the actual and predicted values
- The total average of the values
27. A data analyst wants to learn more about a specific data frame. Which function will allow them to review the data types of each column in the data frame?
- colnames()
- package()
- library()
- str() (CORRECT)
28. You are working with the ToothGrowth dataset. You want to use the glimpse() function to get a quick summary of the dataset. Write the code chunk that will give you this summary.
1
2
How many different data types are used for the column data types?
- 2 (CORRECT)
- 3
- 60
- 1
29. You are working with the penguins dataset. You want to use the summarize() and mean() functions to find the mean value for the variable body_mass_g. At this point, the following code has already been written into your script:
penguins %>%
drop_na() %>%
group_by(species) %>%
Add the code chunk that lets you find the mean value for the variable body_mass_g.
(Note: do not type the above code into the code block editor, as it has already been inputted. Simply add a single line of code based on the prompt.)
1
What is the mean body mass in g for the Adelie species?
- 3733.088
- 3706.164 (CORRECT)
- 5092.437
- 4207.433
Correct: The summarize() function is often employed to display summary statistics. You can pair this function with other functions such as mean(), max(), and min() to calculate the specific statistics. Mean() gives the mean value for body_mass_g in this case.
30. A data analyst is working with a data frame called sales. In the data frame, a column named location represents data in the format “city, state”. The analyst wants to split the city into an individual city column and state into a new countrycolumn. What code chunk lets the analyst split the location column?
- separate(sales, location, into=c(“country”, “city” ), sep=”, “)
- separate(sales, location, into=c(“city”, “country”), sep=”, “) (CORRECT)
- separate(sales, location, into=c(“country”, “city” ), sep=” “)
- untie(sales, location, into=c(“city”, “country”), sep=”, “)
31. What is an advantage of using data frames instead of tibbles?
- Data frames make printing easier
- Data frames allow you to create row names (CORRECT)
- Data frames allow you to use column names
- Data frames store never change variable names
32. A data analyst is checking a script for one of their peers. They want to learn more about a specific data frame. What function(s) will allow them to see a subset of data values in the data frame? Select all that apply.
- library()
- colnames()
- head() (CORRECT)
- str() (CORRECT)
WORKING WITH DATA IN R CONCLUSION
It seems you have been introduced to some rudimentary concepts of R programming, which are vital when analyzing data. Implementing methods to structure, organize, and clean data using functions is essential when dealing with larger data types. Data frames formed the core of the data analysis process in R; hence, it is important to know how structured data can be stored and manipulated. Furthermore, to understand the scope of the biases within data, making sure that your results are accurate and trustworthy is equally important.
There are many advanced courses on Data Analysis offered on Coursera that can further enhance your understanding of R and its use in Data Science applications. Sure it’s a great next step for your learning!