COURSE 3: GO BEYOND THE NUMBERS: TRANSLATE DATA INTO INSIGHTS

Spread the love

Module 2: Explore Raw Data

INTRODUCTION – Explore Raw Data

In EDA or Exploratory Data Analysis, we organize and analyze raw data to bring out insight from it. This is a complete role of Python, which enables you to easily and quickly perform huge data discovery patterns. Sculpting for further analysis to decipher the insight you want. Through Python, learn cleaning, manipulating data and visualizing them to uncover the stories hidden within.

Learning Objectives

  • Discover the ethical issues concerning the ethical discovering process as learned through exploratory data analysis (EDA).
  • Learn to use Python to merge / join datasets against criteria.
  • Use Python for data sorting and filtering.
  • Make use of the relevant Python libraries to clean up raw data.
  • The identification of opportunities will be there to formulate hypotheses, given raw data.
  • Understand the beginning and the correct way to communicate data status and raise questions to significant stakeholders.
  • Inspect raw data using python tools by understanding the structure and format.
  • Apply PACE (Problem, Approach, Constraints, Evaluation) workflow to determine whether the data is suitable and relevant for a data science project.
  • Compare the differences between common raw data formats (for example, JSON, tabular format) and their associated data types.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: DISCOVERING IS THE BEGINNING OF AN INVESTIGATION

1. Fill in the blank: Tabular, XML, CSV, and JSON files are all types of _____.

  • data formats (CORRECT)
  • data types
  • spreadsheets
  • Python functions

Correct: Tabular, XML, CSV and JSON can denote different data formats.

2. It is a data professional’s responsibility to understand data sources because the data’s origin affects its reliability.

  • True (CORRECT)
  • False

Correct: Understanding data sources is one of the duties of data professionals since data reliability is determined by the data’s origin. This involves identifying and getting in touch with the right people-those who generated the data or who are responsible for providing it-to enable good data discovery and acquisition.

3. Which Python method returns the total number of entries and the data types of individual data entries in a dataset?

  • Return()
  • Total()
  • Number()
  • Info() (CORRECT)

Correct: This method will info() use to total entries in the entire dataset and this state data types of each column. This also shows whether there is missing information along with the memory usage, by which we get familiarity toward the structure and content of datasets.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: UNDERSTAND DATA FORMAT

1. Which of the following statements will convert the ‘time’ column into a datetime data type?

  • [‘time’] = pd.datetime(df[‘time’])
  • df[‘time’] = pd.to_datetime(‘time’)
  • df[‘time’] = pd.to_time(df[‘datetime’])
  • df[‘time’] = pd.to_datetime(df[‘time’]) (CORRECT)

Correct: The statement it will convert df [‘time’] = pd.to_datetime (df [‘time’] ) into datetime data-type column. Now it can be used for datetime related operations such as date filtering, time difference calculations and many more.

2. What Python method formats data into a new string representing date and time using a date, time, or datetime object?

  • Strftime() (CORRECT)
  • Head()
  • Fig.show()
  • Div()

Correct: The strftime() method formats a date, time, or datetime object into a new string that represents that date and time in a specific format. The format string defines the appearance of the output string, for example, %Y for year, %m for month, and %d for date.

3. A data professional is creating a bar chart in Python. To label the y-axis Sales to Date, a data professional could use the following statements: plt.ylabel(‘Sales to Date’).

  • True (CORRECT)
  • False

Correct: If libraries such as Matplotlib are used, it will set the label for the y-axis in the chart.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: CREATE STRUCTURE FROM RAW DATA

1. Fill in the blank: Grouping is a structuring method that enables data professionals to _____ individual observations of a variable into different categories or classes.

  • Classify
  • Rank
  • Aggregate (CORRECT)
  • disperse

Correct: Grouping refers to the process of organizing data by collecting individual observations of a variable into separate categories or classes. Grouping allows data practitioners to analyze and summarize data more thoroughly as one: through their categorization it will be possible to scrutinize specific patterns or trends developing within each group.

2. Which of the following Python statements will create a list called grade_order that starts with Preschool?

  • order_grade = [‘Preschool’, ‘Kindergarten’, ‘Elementary School’, ‘Middle School’, ‘High School’]
  • order = [‘Preschool_Grade’, ‘Kindergarten_Grade’, ‘Elementary School_Grade’, ‘Middle School_Grade’, ‘High School_Grade’]
  • grade_order (‘Preschool’, ‘Kindergarten’, ‘Elementary School’, ‘Middle School’, ‘High School’)
  • grade_order = [‘Preschool’, ‘Kindergarten’, ‘Elementary School’, ‘Middle School’, ‘High School’] (CORRECT)

Correct: The list called grade_order has elements organized from Preschool and the others go as Kindergarten, Elementary School, Middle School, and finally High School. This will then be important in ordering for custom order in categorical data, like when sorting, or plotting, educational levels.

3. A data professional can use the concat function to join two or more dataframes.

  • True (CORRECT)
  • False

Correct: A data professional can use the concat function to join two or more dataframes. In pandas, it’s written as pd.concat().

QUIZ: MODULE 2 CHALLENGE

1. What are some strategies data professionals use to understand the source of a dataset? Select all that apply.

  • Ensure data supports the data professional’s hypothesis.
  • Investigate whether the data originator has any financial stake in the dataset. (CORRECT)
  • Request relevant information from the data owners. (CORRECT)
  • Confirm the data owners have experience collecting data. (CORRECT)

Correct!

2. Fill in the blank: A data storage file saved in a JavaScript format, also known as a _____ file, may contain nested objects.

  • CSV
  • spreadsheet
  • J-SON (CORRECT)
  • JPG

Correct!

3. What type of data is gathered outside of an organization, but directly from the original source?

  • First-party
  • Fourth-party
  • Third-party
  • Second-party (CORRECT)

Correct!

4. Which of the following statements correctly uses the head() function to return the first 25 rows of a dataset?

  • df.head(25) (CORRECT)
  • df.head(rows=25)
  • df.head(25.df)
  • head=25

Correct!

5. Which of the following statements will assign the name Chicago Neighborhoods to a bar graph in Python?

  • plt.title(“Chicago Neighborhoods”) (CORRECT)
  • plt.show(“Chicago Neighborhoods”)
  • plt.xlabel(“Chicago Neighborhoods”)
  • plt.name(“Chicago Neighborhoods”)

Correct!

6. A data professional types the following partial code:

locationmode ='Indonesia'
fig.update_layout(title_text = 'Languages', 
geo_scope='Indonesia', )
fig.show()

Which element of the code is used to render a graphic of the plot?

  • fig.show() (CORRECT)
  • fig.update_layout
  • geo_scope=
  • title_text =

Correct!

7. Which structuring method aggregates individual observations of a variable into buckets?

  • Filtering
  • Slicing
  • Grouping (CORRECT)
  • Merging

Correct!

8. Fill in the blank: A box plot is a data visualization that depicts the locality, skew, and _____ of groups of values within quartiles.

  • temperature
  • spread (CORRECT)
  • height
  • area

Correct!

9. What are some of the benefits of J-SON files for data professionals? Select all that apply.

  • Eliminate nested objects within the files
  • Readability in almost any programming language (CORRECT)
  • Easily distinguish between strings and numbers (CORRECT)
  • Small message size (CORRECT)

Correct!

10. Which of the following statements will assign the name Salzburg Restaurants to a bar graph in Python?

  • plt.title(“Salzburg Restaurants”) (CORRECT)
  • plt.name(“Salzburg Restaurants”)
  • plt.show(“Salzburg Restaurants”)
  • plt.xlabel(“Salzburg Restaurants”)

Correct!

11. Which Python function is used to render a graphic of a plot called graph?

  • show.pt(graph)
  • graph.display()
  • graph.show() (CORRECT)
  • plot.graph()

Correct!

12. What type of data is gathered outside of an organization and aggregated?

  • Fourth-party
  • First-party
  • Third-party (CORRECT)
  • Second-party

Correct!

13. Which of the following statements correctly uses the head() function to return the first 5 rows of a dataset?

 
  • df.head(5) (CORRECT)
  • df.head(5.df)
  • head=5
  • df.head(rows=5)

Correct!

14. Fill in the blank: The Python function fig.show() is used to render a _____ of a plot.

  • Template
  • Dashboard
  • mirror image
  • graphic (CORRECT)

Correct!

15. Which structuring method combines two different data frames along a specified starting column?

  • Filtering
  • Sorting
  • Merging (CORRECT)
  • Grouping

Correct!

16. Fill in the blank: A box plot is a data visualization that depicts the spread, skew, and _____ of groups of values within quartiles.

  • speed
  • intensity
  • locality (CORRECT)
  • timing

Correct!

17. What is the data storage file format for JavaScript?

  • spreadsheet
  • XML
  • CSV
  • J-SON (CORRECT)

Correct!

18. Which structuring method selects a smaller part of a dataset based on specified parameters, then uses it for analysis?

  • Organizing
  • Sorting
  • Grouping
  • Filtering (CORRECT)

Correct!

19. Fill in the blank: A _____ is a data visualization that depicts the locality, spread, and skew of groups of values within quartiles.

  • Gantt chart
  • box plot (CORRECT)
  • density map
  • scatter plot

Correct!

20. What are some strategies data professionals use to understand the source of a dataset? Select all that apply.

  • Verify the data source to ensure it will align with stakeholder beliefs.
  • Give extra weight to duplicate records to highlight the multiple responses.
  • If questions arise during discovery, contact the data engineers for information. (CORRECT)
  • Confirm the database owners have experience storing data. (CORRECT)

Correct!

21. What type of data is gathered from inside a company’s own organization?

  • Third-party
  • First-party (CORRECT)
  • Second-party
  • Fourth-party

Correct!

22. Which of the following statements will assign the name Kuwait Museums to a bar graph in Python?

  • plt.xlabel(“Kuwait Museums”)
  • plt.name(“Kuwait Museums”)
  • plt.title(“Kuwait Museums”) (CORRECT)
  • plt.show(“Kuwait Museums”)

Correct!

23. What are some strategies data professionals use to understand the source of a dataset? Select all that apply.

  • Reduce outliers by ensuring data comes from a small sample.
  • Request relevant information from the team members who supplied the data. (CORRECT)
  • Determine where the data originally came from. (CORRECT)
  • Confirm the original data owner has no financial stake in the data’s output. (CORRECT)

Correct!

24. Which of the following statements correctly uses the head() function to return the first 10 rows of a dataset?

  • head=10
  • df.head(10.df)
  • df.head(10) (CORRECT)
  • df.head(rows=10)

Correct!

25. Fill in the blank: _____ is data gathered from inside your own organization.

  • First-party (CORRECT)
  • Third-party
  • Fourth-party
  • Second-party

Correct: First-party data is data gathered from inside your own organization.

26. Why does a data professional use the Python methods describe(), sample(), size, and shape?

  • To save a dataset
  • To share a dataset
  • To transfer a dataset
  • To learn about a dataset (CORRECT)

Correct: A data professional uses the Python methods describe(), sample(), size, and shape to learn about a dataset.

27. In the statement df[‘date’].dt.strftime(‘%Y-W%V’), which element states that the year should be included in the new column format?

  • Hyphen
  • Parentheses
  • %Y (CORRECT)
  • Square brackets

Correct: In the statement df[‘date’].dt.strftime(‘%Y-W%V’), the element %Y points out that year part (in four digits like 2024) should be reported in newer column format. This is a part of the string formatting of strftime() method, which offers the flexibility to display date and time in the desired format.

28. What structuring method enables data professionals to divide information into smaller parts in order to facilitate efficient examination and analysis from different viewpoints?

  • Slicing (CORRECT)
  • Grouping
  • Filtering
  • Extracting

Correct: Slicing enables data professionals to break down information into smaller parts.

29. Fill in the blank: A box plot is a data visualization that depicts the locality, spread, and _____ of groups of values within quartiles.

  • variety
  • flow
  • skew (CORRECT)
  • meaning

Correct: A box plot represents how a dataset distributes itself over its quartiles, whereby the quintessence of such measurement is represented by the following figures: minimum, first quartile (Q1), median (Q2), third quartile (Q3) , maximum. Furthermore, it also indicates the skewness and variation in data presented by outliers and spread. Box plot helps to comprehend central tendency, dispersion, and its possible asymmetry in an integrated overall representation of data distribution and dispersion.

Leave a Comment