Module 2: Workflow for Building Complex Models

Spread the love

INTRODUCTION – Workflow for Building Complex Models

They are trained on facts until the month of October 2023.

This unit exposes participants to the well-structured workflow within which data professionals work for machine learning projects. The course will explain the various important steps in this workflow and make participants appreciate the importance of each stage of applying machine learning in the real world. Knowledge of the sequential flow of these steps is critical for the successful practical application of machine learning in an efficient manner.

The course will be hands-on in applying various machine learning models to solving particular business problems. By the end of the course, participants will understand not only the theoretical aspects of a structured workflow but also the practice of taking possible insights from applying machine learning techniques to understanding and solving various business problems. This all-inclusive approach enables the learner to understand both the conceptual aspect of machine learning workflows as well as acquire the skills to effectively navigate and contribute to this emerging field.

Learning Outcomes:

  • Identify and apply model validation techniques.
  • Explain how Naive Bayes models work and their use cases.
  • Build a Naive Bayes model.
  • Implement feature engineering techniques using Python.
  • Understand how the PACE framework informs each step in the end-to-end data science workflow for machine learning.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: PACE IN MACHINE LEARNING: THE PLAN AND ANALYZE STAGES

1. Fill in the blank: Feature engineering enables data professionals to take _____ and extract features from it.

  • raw data (CORRECT)
  • delimited text
  • a dynamic dashboard
  • a code chunk

Correct: The feature engineering process enables data users to take primitive data and build valid attributes from this data. They would then be able to select, extract, transform, or even produce such features and properties from raw data according to their particular relevance and usefulness for the machine learning model.

2. What term describes the process of modifying existing features in a way that improves accuracy when training a model?

  • Feature transformation (CORRECT)
  • Feature improvement
  • Feature extraction
  • Feature selection

Correct: Feature transformation includes the processes to change the existing features so that they accurately represent the model during training sessions. Most of these transformations will usually be applied in a situation that is made most probable by scaling, encoding, or producing new-derived features displaying patterns in the dataset accordingly well.

3. A class imbalance occurs when a dataset has a predictor variable that contains an equal number of instances of all possible outcomes.

  • True
  • False (CORRECT)

Correct: Class imbalance occurs in a data set, which has predictor variables. In one of the possible outcomes of the predictor, that outcome has significantly more cases than the other. This imbalance can influence the performance of the model as those biased predictions would be toward the class with a higher frequency.

PRACTICE QUIZ: TEST YOUR KNOWLEDGE: PACE IN MACHINE LEARNING: THE PLAN AND ANALYZE STAGES

1. Fill in the blank: Posterior probability is the probability of an event occurring after considering _____ information.

  • undefined
  • new (CORRECT)
  • historical
  • conditional

Correct: Probabilities relate to posterior probabilities when updating beliefs in response to further observations. These posterior probabilities are also distinguished as predictive or posterior given the data and refer only to updating in conjunction with the newfound data-from-behavior.

2. A data professional would use the function MinMaxScaler to normalize the columns in a model so that each value falls between zero and one.

  • True (CORRECT)
  • False

Correct: The MinMaxScaler function transforms the columns in a model onto a range: 0-1. For instance, the minimum value will be 0 for that column and the maximum will be 1, whereas every other value will fall between this scale linearly. This makes features contribute equally into the model to avoid original scales affecting the results.

3. A data professional has built a model, and now they are adjusting how features are engineered in order to improve performance. Which PACE stage does this scenario describe?

  • Construct
  • Analyze
  • Execute (CORRECT)
  • Plan

Correct: It is gathering that, in the workflow of machine learning, this is the step of Executing. Building a model is an iterative approach that does require constant parameter adjustment, since the performance is often improved by re-engineering the whole set of features and building better models.

QUIZ: MODULE 2 CHALLENGE

1. Which of the following statements accurately describe feature engineering? Select all that apply.

  • Feature transformation involves selecting the features in the data that contribute the most to predicting the response variable.
  • Feature engineering involves selecting, transforming, or extracting elements from within raw data. (CORRECT)
  • In feature engineering, a data professional may use their practical, statistical, and data science knowledge. (CORRECT)
  • Feature extraction involves taking multiple features to create a new one that will improve the accuracy of the algorithm. (CORRECT)

2. A data professional resolves a class imbalance in a very large dataset. They alter the majority class by using fewer of the original data points in order to produce a split that is more even. What does this scenario describe?

  • Upsampling
  • Merging
  • Downsampling (CORRECT)
  • Smoothing

3. Fill in the blank: Customer churn is the business term that describes how many customers stop _____ and at what rate this occurs.

  • researching a company’s offerings
  • using a product or service (CORRECT)
  • sharing feedback with a company
  • reviewing items online

4. Naive Bayes is a supervised classification technique that assumes independence among predictors. What is the meaning of this concept?

  • The value of a predictor variable on a given class is dependent upon the values of other predictors.
  • The value of a predictor variable on a given class is measured by the values of other predictors.
  • The value of a predictor variable on a given class is equal to the values of other predictors.
  • The value of a predictor variable on a given class is not affected by the values of other predictors. (CORRECT)

5. Fill in the blank: When using a scaler to _____ the columns in a dataset using MinMaxScaler, a data professional must fit the scaler to the training data and transform both the training data and the test data using that same scaler.

  • customize
  • filter
  • sort
  • normalize (CORRECT)

6. A data professional evaluates a model’s performance and considers how it can be improved. Which PACE stage does this scenario describe?

  • Analyze
  • Plan
  • Construct
  • Execute (CORRECT)

7. In the model-development process, which type of feature is useful by itself because it contains information that will be useful when forecasting the target?

  • Redundant
  • Irrelevant
  • Predictive (CORRECT)
  • Interactive

8. Fill in the blank: Log normalization is useful when working with a model that cannot manage continuous variables with _____ distributions.

  • Binomial
  • probability
  • normal
  • skewed (CORRECT)

9. A data professional discovers that the dataset they are working with contains a class imbalance. The majority class comprises 90% of the data and the minority class comprises 10% of the data. Which of the following statements best describe the impact of this class imbalance?

  • Major issues should not arise if the majority class makes up 10% or less of the dataset.
  • Major issues should not arise because the data has a 50-50 split of outcomes.
  • Major issues will arise if the data professional decides to rebalance the dataset.
  • Major issues will arise because the majority class makes up 90% or more of the dataset. (CORRECT)

10. Fill in the blank: Customer churn is a business term that describes how many customers stop _____ and at what rate this occurs.

  • writing positive reviews about a company
  • doing business with a company (CORRECT)
  • returning items to a company
  • contacting a company’s customer relations department

11. What does Bayes’s theorem enable data professionals to calculate?

  • Data accuracy
  • Posterior probability (CORRECT)
  • Causation
  • Margin of error

12. Fill in the blank: When normalizing the columns in a dataset using MinMaxScaler, the columns’ maximum value scales to one, and the minimum value scales to _____. Everything else falls somewhere in between.

  • .5
  • -1
  • 0.1
  • (CORRECT)

13. In the model-development process, which type of feature is not useful by itself for predicting the target variable, but becomes predictive in conjunction with other features?

  • Predictive
  • Irrelevant
  • Redundant
  • Interactive (CORRECT)

14. Naive Bayes’s theorem enables data professionals to calculate posterior probability for a data project. What does posterior probability describe?

  • The likelihood of an event occurring after taking into consideration all new, relevant observations and information (CORRECT)
  • The likelihood of an event occurring after taking into consideration only the most suitable observations and information
  • The likelihood of an event occurring based upon only observations and information that align with current hypotheses
  • The likelihood of an event occurring based upon the observations and information that were available at the start of the data project

15. A data professional assesses a business need in order to determine what type of model is best suited to a project. Which PACE stage does this scenario describe?

  • Analyze
  • Construct
  • Execute
  • Plan (CORRECT)

16. Fill in the blank: Log normalization involves taking the log of a _____ feature and making the data more effective for modeling.

  • Skewed (CORRECT)
  • continuous
  • normal
  • probable

17. Fill in the blank: Log normalization involves reducing _____ in order to make and making the data more effective for modeling.

  • Probability
  • skew
  • continuity
  • normality (CORRECT)

18. In the model-development process, which type of feature does not contain any useful information for predicting the target variable?

  • Predictive
  • Irrelevant (CORRECT)
  • Conducive
  • Relevant

19. Which of the following statements accurately describe feature engineering? Select all that apply.

  • Feature engineering does not involve using a data professional’s statistical knowledge.
  • Feature engineering may involve transforming the properties of raw data. (CORRECT)
  • In feature engineering, feature selection involves choosing the features in the data that contribute the most to predicting the response variable. (CORRECT)
  • In feature engineering, feature extraction involves taking multiple features to create a new one that will improve the accuracy of the algorithm. (CORRECT)

20. Which of the following statements accurately describe the general categories of feature engineering? Select all that apply.

  • Feature selection involves taking multiple features to create a new one that will improve the accuracy of the algorithm.
  • Feature extraction involves choosing the features in the data that contribute the most to predicting the response variable.
  • Feature transformation involves modifying existing features in a way that improves accuracy when training a model. (CORRECT)
  • The three general categories of feature engineering are selection, extraction, and transformation. (CORRECT)

21. A data professional works with a dataset for a project with their company’s human resources team. They discover that the dataset has a predictor variable that contains more instances of one outcome than another. What will occur as a result of this scenario?

  • Class imbalance (CORRECT)
  • Inconsistent data
  • Incompatibility
  • Redundancy

22. A data professional examines a dataset to reveal key details about the data that will help inform the plans for building a model. Which PACE stage does this scenario describe?

  • Execute
  • Plan
  • Construct
  • Analyze (CORRECT)

23. Fill in the blank: When normalizing the columns in a dataset using MinMaxScaler, the columns’ maximum value scales to _____, and the minimum value scales to zero. Everything else falls somewhere in between.

  • 10
  • .5
  • 100
  • (CORRECT)

24. Fill in the blank: Customer _____ is the business term that describes how many customers stop using a product or service, or stop doing business with a company altogether, and at what rate this occurs.

  • Churn (CORRECT)
  • exchange
  • retention
  • transfer

25. Fill in the blank: Naive Bayes is a supervised classification technique that is based on Bayes’ Theorem, with an assumption of _____ among predictors.

  • Interdependence
  • even distribution
  • clear hierarchy
  • independence (CORRECT)

Correct: Naive Bayes classificator is a supervised classifier type, which follows bayesian theorem and assumes the independence between the predictors.

26. In classification techniques, what is the term for the proportion of actual positives that are identified correctly to all actual positives?

  • Accuracy
  • Precision
  • Recall (CORRECT)
  • F1 score

Correct: Recall measures how many of the real positive members are identified by the model, it is calculated as follows: true positives divided by the whole number of actual positives.

CONCLUSION – Workflow for Building Complex Models

Such a strong foundation is laid by this module in understanding and navigating machine learning workflows as it brings about a very orderly construction by which learners are empowered to address, with confidence and efficiency, almost any real-world business challenge.

The thing that is expertly captured is the ready conversion of theory into action because of the great focus laid thereon towards doing. Participants further acquire conceptual to the workflow and practical experience in using machine learning models in different business scenarios as they progress through the course. Such comprehensive learning experience prepares them to actively contribute to the exciting field of machine learning and make significant impacts from the exercise of data science.

Leave a Comment