INTRODUCTION – Train And Evaluate Classification And Clustering Models
Classification is the machine learning method used to assign items to a predefined set of categories. In this module, you will be guided through the steps of building a model to predict categories using the state-of-the-art techniques. You are going to use the scikit-learn library of Python to train and evaluate classification models. The module also introduces clustering, which is an unsupervised machine learning paradigm of grouping data observations into meaningful clusters. You will be able to train and evaluate a clustering model on the scikit-learn platform.
Learning Objectives:
- Know when to apply classification techniques.
- How to train and evaluate classification models using scikit-learn.
- Be able to identify situations for using clustering.
- Train and evaluate clustering models within the scikit-learn environment.
PRACTICE QUIZ: KNOWLEDGE CHECK 1
1. True or False? Classification is an unsupervised machine learning technique.
- True
- False (CORRECT)
Correct: Classification is a type of supervised machine learning in which a model is trained for predicting the label of a case on a set of input features.
2. Complete the statement:
Classification is a form of machine learning in which you train a model to predict […] of an item.
- The category (CORRECT)
- The feature
- The numeric value
Correct: Classification uses the accept known features along with their associated labels to predict the particular category the item would belong to.
3. Which Python package contains the train_test_split function?
- Scikit-learn (CORRECT)
- Matplotlib
- Tensorflow
- Numpy
Correct: In Python, the sklearn library provides lots and lots of functions with which machine learning can be performed by passing on the raw data. It includes train_test_split function that brings about a statistically random split by setting aside some amount of data as testing data while the rest serves as training data.
4. How would you split your data for training and testing to ensure the model performs well?
- 30% training and 70% testing
- 50% training and 50% testing
- 70% training and 30% testing (CORRECT)
Correct: The division assures the adequate distribution of information and fabricates a principal part of data sufficient for the model to train itself and then proceed into testing for reliable performance.
5. For machine learning algorithms, how are parameters generally referred to as?
- Staticparameters
- Hyperparameters (CORRECT)
- Superparameters
Correct: Hyperparameters are often referred to as those parameters that influence the algorithm for learning by assigning an external setting, while parameters are values that can be derived from the data itself. Hyperparameters are different from parameters in sense that rather than getting their values from the data, they are set externally.
PRACTICE QUIZ: KNOWLEDGE CHECK 2
1. True or False?
Clustering is an example of a supervised machine learning technique.
- True
- False (CORRECT)
Correct: Clustering is an ‘unsupervised’ method, where ‘training’ is done without labels. Instead, models identify examples that have a similar collection of features.
2. When working on a clustering model, if you want to measure how tightly the data points are grouped, what metric should you use?
- R-squared (R2)
- Receiver operating characteristic (ROC) curve
- F1 score
- Within cluster sum of squares (WCSS) (CORRECT)
Correct: One of the most popularly used measures regarding the tightness of clusters is the Within-Cluster Sum of Squares (WCSS), which states that the lower the value it has, the more near-close its data points lie within a particular cluster.
3. This clustering algorithm separates a dataset into clusters of equal variances, where the number of clusters is user-defined.
Which clustering algorithm is this?
- Logistic regression
- K-means (CORRECT)
- Hierarchical
Correct: So, this is the most popular clustering algorithm scoring here and does partition the data into K clusters with nearly same variances. Number of clusters required, K, is the input by the user.
4. Select the correct steps that a basic K-means clustering algorithm consists of:
- A set of K centroids are specifically chosen.
- Clusters are formed by assigning the data points to a random centroid.
- The means of each cluster is computed and the centroid is moved to the mean. (CORRECT)
- When the clusters stop changing, the algorithm has converged. (CORRECT)
- Steps 2 and 3 (b & c here) are repeated until a stopping criteria is met. (CORRECT)
Correct: This is correct.
Correct: This is really i’m correct. After the clusters have stabilized and changed no longer, their locations are confirmed. But as initial positions of centroids are chosen randomly, running the algorithm again would sometimes result in producing slightly different clusters. The answer to such problem is that the training usually involves a number of iterations, in which iteration, the centroids are set again and the model is reported having the best WCSS value.
Correct: Feedback: The procedures defined in steps 2 and 3 are continuously repeated until the decision stopping criteria are met. In this manner, the algorithm usually terminates when the displacements of the centroids during the iteration become extremely smaller, thereby indicating that the clusters are stabilized and stop adapting.
5. Which of the following algorithms are considered to be clustering-type algorithms?
- Decision Tree
- K-Means (CORRECT)
- Hierarchical (CORRECT)
Correct: K-Means is a clustering-type algorithm.
Correct: Hierarchical clustering is a form of clustering algorithm that constructs a hierarchy of clusters by merging smaller clusters into bigger ones (agglomerative), or breaking a large cluster into smaller ones (divisive).
QUIZ: TEST PREP
1. Which type of machine learning model can be trained using the Support Vector Machine algorithm?
- Classification (CORRECT)
- Clustering
- Regression
Correct: Logistic regression is one of the most commonly used statistical techniques for classification. In this context, it relates the probability of a binary outcome, such as ‘yes’ or ‘no’ or ‘1’ and ‘0’, to a set of explanatory variables. Although called a model, it is linear, in that it uses the logistic (sigmoid) function to predict probabilities which are then mapped to class labels.
2. When using the classification report from sklearn.metrics to evaluate the performance of your model, what does the F1-Score metric provide?
- An average metric that takes both precision and recall into account. (CORRECT)
- Out of all of the instances of this class in the test dataset, how many did the model identify.
- How many instances of this class are there in the test dataset.
- Of the predictions the model made for this class, what proportion were correct.
Correct: The F1 score estimates the accuracy of a model according to precision and recall. It is defined as the harmonic mean of precision and recall, offering a better evaluation for imbalanced classifications. The F1-Score is beneficial to model evaluation when false positive and false negative costs differ from one another.
3. The Precision and Recall metrics are derived from four possible prediction outcomes.
If the predicted label is 1, but the actual label is 0, what would the outcome be?
- False Negative
- True Negative
- False Positive (CORRECT)
- True Positive
Correct: A false positive occurs when the predicted label is 1, but the actual label is determined to be 0.
4. In multiclass classification, what are the two ways in which you can approach a problem?
- Rest minus One
- One and Rest
- One vs One (CORRECT)
- One vs Rest (CORRECT)
Correct: A classifier is built for every pair of classes in the One-versus-One (OVO) paradigm of classification. Here, each classifier learns to discriminate between two classes and finally, combines the responses of all the classifiers into one by letting the majority vote for the class label in the prediction.
Correct: There is one-versus-all classification where one classifier is created for each possible class. One would train the classifier to distinguish the class from all other classes, treating the target class as positive and all other classes as negative.
5. Hierarchical clustering creates clusters using two methods.
Which are those two methods?
- Aggregational
- Distinctive
- Agglomerative (CORRECT)
- Divisive (CORRECT)
Correct: Agglomerative clustering is a type of bottom-up clustering technique: at first each data point represents an individual cluster, and then the algorithm iteratively merges the two closest clusters until there is only one cluster describing all data points or until reaching a certain number of clusters. The whole process is based on distance or similarity among data objects, and at each stage the most similar clusters are merged together.
Correct: The divisive method is a “top-down” approach that starts with the entire dataset and incrementally reveals partitions stepwise.
6. To which kind of machine learning can the K-Means clustering algorithm be associated with?
- Reinforcement learning
- Unsupervised machine learning (CORRECT)
- Supervised machine learning
Correct: Part of an unsupervised machine learning paradigm is clustering wherein the input training data do not represent a typical set of labeled data.
7. You are using scikit-learn library to train a K-Means clustering model that groups observations into four clusters. How should you create the K-Means object?
- model = Kmeans(max_iter=4)
- model = Kmeans(n_init=4)
- model = KMeans(n_clusters=4) (CORRECT)
Correct: This defines n_clusters as the number of clusters to form.
CONCLUSION – Train And Evaluate Classification And Clustering Models
Understanding both classification and clustering techniques matters when one wants to apply machine learning in different scenarios. This module will equip you with the knowledge of how to build and assess classification models to predict categories and clustering models to group data observations. By using the scikit-learn framework in Python, you will gain experience in supervised and unsupervised machine learning, thereby laying the ground for future activity in data science.