Module 1: Explore Data And Create Models To Predict Numeric Values

Spread the love

INTRODUCTION – Explore Data And Create Models To Predict Numeric Values

Data exploration and analysis are indeed vital and core portions of data science. Data scientists need to learn one of the general-purpose languages like Python to explore, visualize, and manipulate data. In this module, you will study how to use Python for these tasks. You will study the regression techniques which will give you the capability to make a machine-learning-based model that can predict numeric values. The scenario will be exercised using the scikit-learn framework in Python for the training and evaluation of this regression model.

Learning Objectives:

  • Use Python to explore, visualize and manipulate data.
  • Understand how regression can be applied to build a machine-learning model that predicts numeric values.
  • Using scikit-learn framework in Python, train and evaluate regression models.

PRACTICE QUIZ: KNOWLEDGE CHECK 1

1. Which Python libraries are used in machine learning and deep learning?

Select all options that apply.

  • Scikit-learn
  • Matplotlib
  • TensorFlow (CORRECT)
  • PyTorch (CORRECT)

Correct: TensorFlow allows open source development. It is a method of machine learning that facilitates building, training, and deploying machine learning models all in one place.

Correct: PyTorch is an open-source machine learning framework that accelerates the path from research prototyping to production deployment.

2. What are the benefits of using the NumPy and Pandas libraries in your Python project?

  • Supplying machine learning and deep learning capabilities
  • Offering simple and predictive data analysis
  • Providing attractive data visualizations
  • Simplifying analyzing and manipulating data (CORRECT)

Correct: This is made possible, though, in Python by using the NumPy and Pandas libraries..

3. Which Python library provides functionalities similar to excel?

  • Pandas (CORRECT)
  • NumPy
  • Matplotlib

Correct: Pandas can be accepted as equivalent to the exceliness in the world of Python-an easy approach to working with tables of data.

4. Which Python library provides functionality comparable to mathematical tools such as MATLAB and R?

  • Scikit-learn
  • NumPy (CORRECT)
  • TensorFlow

Correct: NumPy has really made the user’s experience more satisfying through an additional application programming interface and a variety of exhaustive mathematical functions.

5. What should you use if you want to run basic scripts in a web browser?

  • NumPy
  • Pandas
  • Jupyter notebook (CORRECT)

Correct: Jupyter Notebook is a very useful tool for executing basic scripts directly from your web browser. In its most common form, such a notebook consists of a single web page, divided into two main sections-one for text and the other for code-executed on the server-not your own machine.

PRACTICE QUIZ: KNOWLEDGE CHECK 2

1. In a regression model, what is a feature and what is a label?

  • A feature is the variable to be predicted.
  • A label is the variable that represents characteristics.
  • A label is the variable to be predicted. (CORRECT)
  • A feature is the variable that represents characteristics. (CORRECT)
  • Correct: The variable we’re trying to predict is known as the label.

Correct: Variables in the particular data which indicate the features are known as ‘Features’.

2. When you want to train a regression model based on historical data, which are the two subsets into which you split the data sample?

  • A confirmation dataset
  • A performance dataset
  • A validation dataset (CORRECT)
  • A training dataset (CORRECT)

Correct: A validation or test dataset can be leveraged against the model to evaluate performance through the generation of predictions for labels and comparison thereof to the actual known label values.

Correct: In this dataset, you will implement an algorithm that develops a function to capture the relationship between the feature values and the known label values.

3. True or False? In machine learning, the difference between a predicted label value and the actual value is known as “the residuals”.

  • True
  • False (CORRECT)

Correct: In practice, “actual” values are derived from some sample observations, which could be subjected to random variance. Literally, going by “residual value”, it is the difference between the predicted value (ŷ) and that of the observed value (y).

4. To randomly split the data between training and validation subsets, you can use the train_test_split function. In which python library can you find this function?

  • Pytorch
  • Numpy
  • Scikit-learn (CORRECT)
  • Matplotlib
  • Scikit-learn

Correct: The train_test_split function is included in this library.

5. This evaluation metric yields a relative metric in which the smaller the value, the better the fit of the model. Which evaluation metric is described above?

  • Mean Square Error (MSE) (CORRECT)
  • Coefficient of Determination (usually known as R-squared or R2)
  • Root Mean Square Error (RMSE)

Correct: This measure is the average squared difference between predicted values and actual values, which is widely called the Mean Square Error.

QUIZ: TEST PREP

1. How would a list and a NumPy array behave when they are multiplied by 3?

  • Multiplying a list by 3 performs an element-wise calculation on the list, which sees the list stay the same size, but each element has been multiplied by 3.
  • Multiplying an NumPy array by 3 creates a new array 3 times the length with the original sequence repeated 3 times.
  • Multiplying a list by 3 creates a new list 3 times the length with the original sequence repeated 3 times. (CORRECT)
  • Multiplying a NumPy array by 3 performs an element-wise calculation on the array, which sees the array stay the same size, but each element has been multiplied by 3. (CORRECT)

Correct: “This behavior occurs when one multiplies a list.”

Correct: By multiplying a NumPy array, it performs an element-wise multiplication that multiplies every single element of the array with a corresponding element of another array (if one exists) or scalar value.

2. If you have a NumPy array with the shape (2,35), what does this tell you about the elements in the array?

  • The array contains 2 elements with the values of 2 and 35.
  • The array contains 35 elements, all with the value 2.
  • The array is two dimensional, consisting of two arrays with 35 elements each. (CORRECT)

Correct: So a shape (2,35) means that this array is a two-dimensional array which has 2 rows with each row consisting of 35 elements.

3. Suppose you have a Pandas DataFrame named df_sales containing daily sales data. The DataFrame has the following columns: year, month, day_of_month, sales_total. If you want to find the average sales_total value, which code should you use?

  • mean(df_sales[‘sales_total’])
  • df_sales[‘sales_total’].mean() (CORRECT)
  • df_sales[‘sales_total’].avg()

Correct: This algorithm computes the average of the sales_total column values.

4. You work on a DataFrame containing data about daily ice cream sales. You use the corr method to compare the avg_temp and units_sold columns, and get a result of 0.95. What does this result indicate?

  • Days with high avg_temp values tend to coincide with days that have high units_sold values. (CORRECT)
  • On the day with the maximum units_sold value, the avg_temp value was 0.95.
  • The units_sold value is, on average, 95% of the avg_temp value.

Correct: To calculate the correlation, consider using the corr method; a value that approximates one indicates very high positive correlation.

5. This is a relative metric in which the higher the value, the better the fit of the model.Which evaluation model is described?

  • Mean Square Error (MSE)
  • Coefficient of Determination (known as R-squared or R2) (CORRECT)
  • Root Mean Square Error (RMSE)

Correct: This is evaluation metric that is described here. Basically, it’s measuring the ratio of the explained variance from the actual value labels to predicted label values.

6. This evaluation metric yields an absolute metric in the same unit as the label.

Which metric is described?

  • Mean Square Error (MSE)
  • Root Mean Square Error (RMSE) (CORRECT)
  • Coefficient of Determination (known as R-squared or R2)

Correct: It reduces to the value less number-the meter is more efficient with a smaller value.

7. You’ve just created a model object using the LinearRegression class from the scikit-learn library.

What should you do next to train the model?

  • Call the predict() method of the model object, specifying the training feature and label arrays.
  • Call the fit() method of the model object, specifying the training feature and label arrays. (CORRECT)
  • Call the score() method of the model object, specifying the training feature and test feature arrays.

Correct: To train the model, use the fit() method.

CONCLUSION – Explore Data And Create Models To Predict Numeric Values

In the end, learning the art of data exploration and analysis through the medium of Python becomes the requirement of any wannabe data scientist. This module equips you with the enough skills to explore, visualize, and even manipulate data. You will also get real-world experience in the construction and evaluation of regression models using the scikit-learn framework. These skills, indeed, will serve as a solid base for your career development in data science and machine learning.

Leave a Comment