In EDA or Exploratory Data Analysis, we organize and analyze raw data to bring out insight from it. This is a complete role of Python, which enables you to easily and quickly perform huge data discovery patterns. Sculpting for further analysis to decipher the insight you want. Through Python, learn cleaning, manipulating data and visualizing them to uncover the stories hidden within.
Learning Objectives
Discover the ethical issues concerning the ethical discovering process as learned through exploratory data analysis (EDA).
Learn to use Python to merge / join datasets against criteria.
Use Python for data sorting and filtering.
Make use of the relevant Python libraries to clean up raw data.
The identification of opportunities will be there to formulate hypotheses, given raw data.
Understand the beginning and the correct way to communicate data status and raise questions to significant stakeholders.
Inspect raw data using python tools by understanding the structure and format.
Apply PACE (Problem, Approach, Constraints, Evaluation) workflow to determine whether the data is suitable and relevant for a data science project.
Compare the differences between common raw data formats (for example, JSON, tabular format) and their associated data types.
PRACTICE QUIZ: TEST YOUR KNOWLEDGE: DISCOVERING IS THE BEGINNING OF AN INVESTIGATION
1. Fill in the blank: Tabular, XML, CSV, and JSON files are all types of _____.
data formats (CORRECT)
data types
spreadsheets
Python functions
Correct: Tabular, XML, CSV and JSON can denote different data formats.
2. It is a data professional’s responsibility to understand data sources because the data’s origin affects its reliability.
True (CORRECT)
False
Correct: Understanding data sources is one of the duties of data professionals since data reliability is determined by the data’s origin. This involves identifying and getting in touch with the right people-those who generated the data or who are responsible for providing it-to enable good data discovery and acquisition.
3. Which Python method returns the total number of entries and the data types of individual data entries in a dataset?
Return()
Total()
Number()
Info() (CORRECT)
Correct: This method will info() use to total entries in the entire dataset and this state data types of each column. This also shows whether there is missing information along with the memory usage, by which we get familiarity toward the structure and content of datasets.
PRACTICE QUIZ: TEST YOUR KNOWLEDGE: UNDERSTAND DATA FORMAT
1. Which of the following statements will convert the ‘time’ column into a datetime data type?
[‘time’] = pd.datetime(df[‘time’])
df[‘time’] = pd.to_datetime(‘time’)
df[‘time’] = pd.to_time(df[‘datetime’])
df[‘time’] = pd.to_datetime(df[‘time’]) (CORRECT)
Correct: The statement it will convert df [‘time’] = pd.to_datetime (df [‘time’] ) into datetime data-type column. Now it can be used for datetime related operations such as date filtering, time difference calculations and many more.
2. What Python method formats data into a new string representing date and time using a date, time, or datetime object?
Strftime() (CORRECT)
Head()
Fig.show()
Div()
Correct: The strftime() method formats a date, time, or datetime object into a new string that represents that date and time in a specific format. The format string defines the appearance of the output string, for example, %Y for year, %m for month, and %d for date.
3. A data professional is creating a bar chart in Python. To label the y-axis Sales to Date, a data professional could use the following statements: plt.ylabel(‘Sales to Date’).
True (CORRECT)
False
Correct: If libraries such as Matplotlib are used, it will set the label for the y-axis in the chart.
PRACTICE QUIZ: TEST YOUR KNOWLEDGE: CREATE STRUCTURE FROM RAW DATA
1. Fill in the blank: Grouping is a structuring method that enables data professionals to _____ individual observations of a variable into different categories or classes.
Classify
Rank
Aggregate (CORRECT)
disperse
Correct: Grouping refers to the process of organizing data by collecting individual observations of a variable into separate categories or classes. Grouping allows data practitioners to analyze and summarize data more thoroughly as one: through their categorization it will be possible to scrutinize specific patterns or trends developing within each group.
2. Which of the following Python statements will create a list called grade_order that starts with Preschool?
Correct: The list called grade_order has elements organized from Preschool and the others go as Kindergarten, Elementary School, Middle School, and finally High School. This will then be important in ordering for custom order in categorical data, like when sorting, or plotting, educational levels.
3. A data professional can use the concat function to join two or more dataframes.
True (CORRECT)
False
Correct: A data professional can use the concat function to join two or more dataframes. In pandas, it’s written as pd.concat().
QUIZ: MODULE 2 CHALLENGE
1. What are some strategies data professionals use to understand the source of a dataset? Select all that apply.
Ensure data supports the data professional’s hypothesis.
Investigate whether the data originator has any financial stake in the dataset. (CORRECT)
Request relevant information from the data owners. (CORRECT)
Confirm the data owners have experience collecting data. (CORRECT)
Correct!
2. Fill in the blank: A data storage file saved in a JavaScript format, also known as a _____ file, may contain nested objects.
CSV
spreadsheet
J-SON (CORRECT)
JPG
Correct!
3. What type of data is gathered outside of an organization, but directly from the original source?
First-party
Fourth-party
Third-party
Second-party (CORRECT)
Correct!
4. Which of the following statements correctly uses the head() function to return the first 25 rows of a dataset?
df.head(25) (CORRECT)
df.head(rows=25)
df.head(25.df)
head=25
Correct!
5. Which of the following statements will assign the name Chicago Neighborhoods to a bar graph in Python?
plt.title(“Chicago Neighborhoods”) (CORRECT)
plt.show(“Chicago Neighborhoods”)
plt.xlabel(“Chicago Neighborhoods”)
plt.name(“Chicago Neighborhoods”)
Correct!
6. A data professional types the following partial code:
Which element of the code is used to render a graphic of the plot?
fig.show() (CORRECT)
fig.update_layout
geo_scope=
title_text =
Correct!
7. Which structuring method aggregates individual observations of a variable into buckets?
Filtering
Slicing
Grouping (CORRECT)
Merging
Correct!
8. Fill in the blank: A box plot is a data visualization that depicts the locality, skew, and _____ of groups of values within quartiles.
temperature
spread (CORRECT)
height
area
Correct!
9. What are some of the benefits of J-SON files for data professionals? Select all that apply.
Eliminate nested objects within the files
Readability in almost any programming language (CORRECT)
Easily distinguish between strings and numbers (CORRECT)
Small message size (CORRECT)
Correct!
10. Which of the following statements will assign the name Salzburg Restaurants to a bar graph in Python?
plt.title(“Salzburg Restaurants”) (CORRECT)
plt.name(“Salzburg Restaurants”)
plt.show(“Salzburg Restaurants”)
plt.xlabel(“Salzburg Restaurants”)
Correct!
11. Which Python function is used to render a graphic of a plot called graph?
show.pt(graph)
graph.display()
graph.show() (CORRECT)
plot.graph()
Correct!
12. What type of data is gathered outside of an organization and aggregated?
Fourth-party
First-party
Third-party (CORRECT)
Second-party
Correct!
13. Which of the following statements correctly uses the head() function to return the first 5 rows of a dataset?
df.head(5) (CORRECT)
df.head(5.df)
head=5
df.head(rows=5)
Correct!
14. Fill in the blank: The Python function fig.show() is used to render a _____ of a plot.
Template
Dashboard
mirror image
graphic (CORRECT)
Correct!
15. Which structuring method combines two different data frames along a specified starting column?
Filtering
Sorting
Merging (CORRECT)
Grouping
Correct!
16. Fill in the blank: A box plot is a data visualization that depicts the spread, skew, and _____ of groups of values within quartiles.
speed
intensity
locality (CORRECT)
timing
Correct!
17. What is the data storage file format for JavaScript?
spreadsheet
XML
CSV
J-SON (CORRECT)
Correct!
18. Which structuring method selects a smaller part of a dataset based on specified parameters, then uses it for analysis?
Organizing
Sorting
Grouping
Filtering (CORRECT)
Correct!
19. Fill in the blank: A _____ is a data visualization that depicts the locality, spread, and skew of groups of values within quartiles.
Gantt chart
box plot (CORRECT)
density map
scatter plot
Correct!
20. What are some strategies data professionals use to understand the source of a dataset? Select all that apply.
Verify the data source to ensure it will align with stakeholder beliefs.
Give extra weight to duplicate records to highlight the multiple responses.
If questions arise during discovery, contact the data engineers for information. (CORRECT)
Confirm the database owners have experience storing data. (CORRECT)
Correct!
21. What type of data is gathered from inside a company’s own organization?
Third-party
First-party (CORRECT)
Second-party
Fourth-party
Correct!
22. Which of the following statements will assign the name Kuwait Museums to a bar graph in Python?
plt.xlabel(“Kuwait Museums”)
plt.name(“Kuwait Museums”)
plt.title(“Kuwait Museums”) (CORRECT)
plt.show(“Kuwait Museums”)
Correct!
23. What are some strategies data professionals use to understand the source of a dataset? Select all that apply.
Reduce outliers by ensuring data comes from a small sample.
Request relevant information from the team members who supplied the data. (CORRECT)
Determine where the data originally came from. (CORRECT)
Confirm the original data owner has no financial stake in the data’s output. (CORRECT)
Correct!
24. Which of the following statements correctly uses the head() function to return the first 10 rows of a dataset?
head=10
df.head(10.df)
df.head(10) (CORRECT)
df.head(rows=10)
Correct!
25. Fill in the blank: _____ is data gathered from inside your own organization.
First-party (CORRECT)
Third-party
Fourth-party
Second-party
Correct: First-party data is data gathered from inside your own organization.
26. Why does a data professional use the Python methods describe(), sample(), size, and shape?
To save a dataset
To share a dataset
To transfer a dataset
To learn about a dataset (CORRECT)
Correct: A data professional uses the Python methods describe(), sample(), size, and shape to learn about a dataset.
27. In the statement df[‘date’].dt.strftime(‘%Y-W%V’), which element states that the year should be included in the new column format?
Hyphen
Parentheses
%Y (CORRECT)
Square brackets
Correct: In the statement df[‘date’].dt.strftime(‘%Y-W%V’), the element %Y points out that year part (in four digits like 2024) should be reported in newer column format. This is a part of the string formatting of strftime() method, which offers the flexibility to display date and time in the desired format.
28. What structuring method enables data professionals to divide information into smaller parts in order to facilitate efficient examination and analysis from different viewpoints?
Slicing (CORRECT)
Grouping
Filtering
Extracting
Correct: Slicing enables data professionals to break down information into smaller parts.
29. Fill in the blank: A box plot is a data visualization that depicts the locality, spread, and _____ of groups of values within quartiles.
variety
flow
skew (CORRECT)
meaning
Correct: A box plot represents how a dataset distributes itself over its quartiles, whereby the quintessence of such measurement is represented by the following figures: minimum, first quartile (Q1), median (Q2), third quartile (Q3) , maximum. Furthermore, it also indicates the skewness and variation in data presented by outliers and spread. Box plot helps to comprehend central tendency, dispersion, and its possible asymmetry in an integrated overall representation of data distribution and dispersion.