Module 5: Exam Preparation Course 4

Spread the love

INTRODUCTION – Exam Preparation Course 4

In this module, you are going to be introduced to the detailed view of Course 4 in Microsoft Azure Data Scientist Associate specialization. This section aims to re-enforce and widen the understanding of what has been learned in course-related advanced topics which are more practical, and revisiting key concepts, methods and tools taught in course 4-it is going to improve one’s capability of applying all these principles within the ecosystem of Azure.

This full review will prepare you in front of the actual data-science practice and eventualities, so that you will be a hundred percent sure in taking advantage of the robust services and solutions that Microsoft Azure offers.

Learning Objectives:

  • Outline the key points in the Microsoft Azure Data Scientist Associate specialization.
  • Recap course 4: Perform Data Science with Azure Databricks major topics.
  • Assess own understanding and skill in performing data science with Azure Databricks.

Quiz: Practice exam covering Course 4: Perform data science with Azure Databricks

1. You have an AirBnB housing dataframe which you preprocessed and filtered down to only the relevant columns.

The columns are: id, host_name, bedrooms, neighbourhood_cleansed, price.

You’ve written the function below name firstInitialFunction that returns the first initial from the host_name column:

def firstInitialFunction(name):

return name[0]

firstInitialFunction(“George”)

You now want to create a UDF from this function using the spark.sql.register so that it will create the UDF in the SQL namespace.

How would you code that?

  • airbnbDF.replaceTempView(“airbnbDF”)
  • spark.udf.register(“sql_udf”, firstInitialFunction)
  • airbnbDF.createTempView(“airbnbDF”)
  • spark.udf.register(sql_udf = firstInitialFunction)
  • airbnbDF.createAndReplaceTempView(“airbnbDF”)
  • spark.udf.register(sql_udf.firstInitialFunction)
  • airbnbDF.createOrReplaceTempView(“airbnbDF”) (CORRECT)
  • spark.udf.register(“sql_udf”, firstInitialFunction)

Correct: I guess you’re talking about code but it’s not in your message, could you provide tis and I will assist or review it for you.

2. You have a Boston Housing dataset where you find a median value for a number variables such as the number of rooms, per capita crime and economic status of residents.

You want to use Linear Regression to predict the median home value based on the average number of rooms.

You’ve imported the dataset and created a column named features that has a single input variable named rm by using VectorAssembler.

You now want to fit the Liner Regression model.

How should you code that?

  • from pyspark.ml.regression import LinearRegression
  • lr = LinearRegression(featuresCol=”features”, labelCol=”medv”)
  • lrModel = lr.fit(bostonFeaturizedDF)
  • from pyspark.ml.regression import LinearRegression
  • lr = LinearRegression(featuresCol=”rm”, labelCol=”medv”)
  • lrModel = lr_fit(bostonFeaturizedDF)
  • from pyspark.ml import LinearRegression
  • lr = LinearRegression(featuresCol=”rm “, labelCol=”medv”)
  • lrModel = lr_fit(bostonFeaturizedDF)
  • from pyspark import LinearRegression (CORRECT)
  • lr = LinearRegression(featuresCol=”features”, labelCol=”medv”)
  • lrModel = lr.fit(bostonFeaturizedDF)

Correct: It looks like you might want to include the code to which you are referring, but it is not in your message. Would you provide the code so that I can help you?

3. You are using MLflow to track the runs of a Linear Regression model of an AirBnB dataset.

You want to use all the features in the dataset.

You’ve created the pipeline, logged the pipeline, and logged the parameters.

Now you need to create predictions and metrics.

How should you code that?

  • predDF = pipelineModel.estimate(testDF)
  • regressionEvaluator = RegressionEvaluator(labelCol=”price”, predictionCol=”prediction”)
  • rmse = regressionEvaluator.setMetricName(“rmse”).evaluate(predDF)
  • r2 = regressionEvaluator.setMetricName(“r2”).evaluate(predDF)
  • predDF = pipelineModel.evaluate(testDF)
  • regressionEvaluator = RegressionEvaluator(labelCol=”price”, predictionCol=”prediction”)
  • rmse = regressionEvaluator.setMetricName(“rmse”).evaluate(predDF)
  • r2 = regressionEvaluator.setMetricName(“r2”).evaluate(predDF)
  • predDF = pipelineModel.transform(testDF) (CORRECT)
  • regression = RegressionEvaluator(labelCol=”price”, predictionCol=”prediction”)
  • rmse = regressionEvaluator.setMetricName(“rmse”).evaluate(predDF)
  • r2 = regressionEvaluator.setMetricName(“r2”).evaluate(predDF)
  • predDF = pipelineModel.transform(testDF)
  • regressionEvaluator = RegressionEvaluator(labelCol=”price”, predictionCol=”prediction”)
  • rmse = regressionEvaluator.setMetricName(“rmse”).evaluate(predDF)
  • r2 = regressionEvaluator.setMetricName(“r2”).evaluate(predDF)

Correct: This is the correct code for the task.

4. You are running Python code interactively in a Conda environment. The environment includes all required Azure Machine Learning SDK and MLflow packages.

You must use MLflow to log metrics in an Azure Machine Learning experiment named mlflow-experiment.

To answer, replace the bolded comments in the code with the appropriate code options in the answer area.

How should you complete the code?

import mlflow

from azureml.core import Workspace

ws = Workspace.from_config()

#1 Set the MLflow logging target

#2 Configure the experiment

with #3 Begin the experiment run

               #4 Log my_metric with value 1.00 (‘my_metric’, 1.00)

print(“Finished!”)

  • #1 mlflow.tracking.client = ws, #2 mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri()), #3 mlflow.active_run(), #4 mlflow.log_metric
  • #1 mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri()), #2 mlflow.get_run(‘mlflow-experiment), #3 mlflow.start_run(), #4 mlflow.log_metric
  • #1 mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri()), #2 mlflow.set_experiment(‘mlflow-experiment), #3 mlflow.start_run(), #4 mlflow.log_metric (CORRECT)
  • #1 mlflow.set_tracking_uri(ws.get_mlflow_tracking_uri()), #2 mlflow.get_run(‘mlflow-experiment), #3 mlflow.start_run(), #4 run.log()

Correct: ws.get_mlflow_tracking_uri(): This function returns the tracking URI unique to this Azure Machine Learning workspace (ws). Tracking URI refers to the place where the MLflow server will store logs and metadata for experiments.

#2 mlflow.set_experiment(experiment_name) Set the MLflow experiment name with set_experiment() and start your training run with start_run(). 

#3 mlflow.start_run()

#4 mlflow.log_metric – Then use log_metric() to activate the MLflow logging API and begin logging your training run metrics.

5. You are evaluating a Python NumPy array that contains six data points defined as follows: data = [10, 20, 30, 40, 50, 60]

You must generate the following output by using the k-fold algorithm implementation in the Python Scikit-learn machine learning library: train: [10 40 50 60], test: [20 30] train: [20 30 40 60], test: [10 50] train: [10 20 30 50], test: [40 60] 

You need to implement a cross-validation to generate the output. 

To answer, replace the bolded comments in the code with the appropriate code options in the answer area.

How should you complete the code?

from numpy import array

from sklearn.model_selection import #1st option

data – array ([10, 20, 30, 40, 50, 60])

kfold – Kfold(n_splits- #2nd option, shuffle – True – random_state-1)

for train, test in kFold, split( #3rd option):

print (‘train’: %s, test: %5’ % (data[train], data[test])

CrossValidation, 3, data

  • K-fold, 3, array
  • K-fold, 3, data (CORRECT)
  • K-means, 6, array

Correct: The K-Folds cross-validator is a technique for testing the performance of a machine learning model, wherein the split performed on the dataset is carried out in k successive folds. This helps in checking the generalized efficiency of the model against the unseen data.

The parameter n_splits ( int, default=3) is the number of folds. Must be at least 2.

6. You use the following code to run a script as an experiment in Azure Machine Learning: 

from azureml.core import Workspace, Experiment, Run 

from azureml.core import RunConfig, ScriptRunConfig 

ws = Workspace.from_config() 

run_config = RunConfiguration() 

run_config.target=’local’ 

script_config = ScriptRunConfig(source_directory=’./script’,

script=’experiment.py’, run_config=run_config) 

experiment = Experiment(workspace=ws, name=’script experiment’) 

run = experiment.submit(config=script_config) 

run.wait_for_completion() 

You must identify the output files that are generated by the experiment run. You need to add code to retrieve the output file names.  Which code segment should you add to the script? 

  • files = run.get_fine_names() (CORRECT)
  •  files = run.get_properties()
  • files = run.get_metrics()
  • run.get_details_with_logs()

Correct: These are the correct commands for this task.

7. You have a Boston Housing dataset from which you want to train a model to predict the value of housing based on one or more input measures.

You are using the Spark ml framework to train the model on a single column that contains a vector of all the relevant features.

You must prepare the data by creating one column named features that has the average number of rooms, age and tax rate.You want to use VectorAssembler for this task.

How would you code this?

  • from pyspark.ml.feature import VectorAssembler (CORRECT)
  • featureCols = [“rm”, “age”, “tax”]
  • assembler = VectorAssembler(inputCols=featureCols, outputCol=”features”)
  • bostonFeaturizedDF = assembler.transform(bostonDF)
  • display(bostonFeaturizedDF)
  • from pyspark.ml.feature import VectorAssembler
  • featureCols = [“rm”, “age”, “tax”]
  • assembler = VectorAssembler(inputCols=featureCols, outputCol=”features”)
  • bostonFeaturizedDF = vector.assembler_transform(bostonDF)
  • display(bostonFeaturizedDF)
  • from pyspark.ml.feature import VectorAssembler
  • featureCols = [“rm”, “age”, “tax”]
  • assembler = Vector(inputCols=featureCols, outputCol=”features”)
  • bostonFeaturizedDF = assembler.transform(bostonDF)
  • display(bostonFeaturizedDF)
  • from pyspark.ml.feature import VectorAssembler
  • featureCols = [“rm”, “age”, “tax”]
  • assembler = VectorAssembler(Cols=featureCols, outputCol=”features”)
  • bostonFeaturizedDF = assembler.transform(bostonDF)
  • display(bostonFeaturizedDF)

Correct: This is the correct code for this task.

8. You are running a training experiment on remote compute in Azure Machine Learning.

The experiment is configured to use a Conda environment that includes the mlflow and azureml-contrib-run packages. You must use MLflow as the logging package for tracking metrics generated in the experiment. 

To answer, replace the bolded comments in the code with the appropriate code options in the answer area.

How should you complete the code?

Import numpy as np

#1 Import library to log metrics

#2 Start logging for this run

reg_rage = 0.01

#3 Log the reg_rate metric

#4 Stop loggin for this run

  • #1 import mlflow, #2 mlflow.start_run(), #3 logger.info(‘ ..’), #4 mlflow.end_run()
  • #1 from azureml.core import Run, #2 run = Run.get_context(), #3 logger.info(‘ ..’), #4 run.complete() (CORRECT)
  • #1 import mlflow, #2 mlflow.start_run(), #3 mlflow.log_metric(‘ ..’), #4 mlflow.end_run()
  • #1 import logging, #2 mlflow.start_run(), #3 mlflow.log_metric(‘ ..’), #4 run.complete()

Correct: This is the correct code for this task.

9. You have an AirBnB housing dataframe which you preprocessed and filtered down to only the relevant columns.

The columns are: id, host_name, bedrooms, neighbourhood_cleansed, price.

You’ve written the function below name firstInitialFunction that returns the first initial from the host_name column:

def firstInitialFunction(name):

return name[0]

firstInitialFunction(“George”)

Because Python UDFs are much slower than Scala UDFs, you now want to create a Vectorized UDF in Python to speed up the computation.

How would you code that?

  • from pyspark.sql.functions import pandas_udf (CORRECT)
  • @pandas_udf(“string”)
  • def vectorizedUDF(name):
  • return name.str[0]
  • from pyspark.sql.functions import pandas_udf
  • # We have a string input/output
  • @pandas_udf(“string”)
  • create vectorizedUDF(name):
  • return name.str[0]
  • from pyspark.sql.functions import pandas_udf
  • @pandas_udf(“string”)
  • def vectorizedUDF(host_name):
  • get name.str[0]
  • from pyspark.sql.functions import pandas_udf
  • @pandas_udf(“int”)
  • def vectorizedUDF(name):
  • return name.str[0]

Correct: This is the correct code for this task.

10. You’re working with the Boston Housing dataset and you want to tune the Hyperparameters for the Linear Regression algorithm you’re using.

You’ve performed a test split on the Boston data set and built a pipeline for linear regression.Now you want to use ParamGridBuilder() to test the maximum number of iterations, no matter whether you want to use an intercept with the y axis, or whether you want to standardize the features.

How should you code that?

  • from pyspark.ml.tuning import ParamGridBuilder
  • paramGrid = (ParamGridBuilder().addGrid(lr.maxIter, [1, 10, 100])
  • .addGrid(lr.fitIntercept, [True, False])
  • .addGrid(lr.standardization, [True, False])
  • .search())
  • from pyspark.ml.tuning import ParamGridBuilder
  • paramGrid = (ParamGridBuilder(lr)
  • .addGrid(lr.maxIter, [1, 10, 100])
  • .addGrid(lr.fitIntercept, [True, False])
  • .addGrid(lr.standardization, [True, False])
  • .run()
  • )
  • from pyspark.ml.tuning import ParamGridBuilder
  • paramGrid = (ParamGridBuilder(lr)
  • .addGrid(lr.maxIter, [1, 10, 100])
  • .addGrid(lr.fitIntercept, [True, False])
  • .addGrid(lr.standardization, [True, False])
  • .create()
  • )
  • from pyspark.ml.tuning import ParamGridBuilder (CORRECT)
  • paramGrid = (ParamGridBuilder()
  • .addGrid(lr.maxIter, [1, 10, 100])
  • .addGrid(lr.fitIntercept, [True, False])
  • .addGrid(lr.standardization, [True, False])
  • .build()
  • )10. 

Correct: Please send me the code that needs to be rewritten. I would like to assist you on this!

CONCLUSION – Exam Preparation Course 4

In sum, this module gives an all-round review of Microsoft Azure Data Scientist Associate, Course 4. Revisiting the advanced topics, methodologies, and tools presented by the course will help you understand better all these concepts and your applications over them practically in Azure. This detailed review would prepare you fully to meet real data science challenges, using all the services and solutions that Microsoft Azure has to offer.

Leave a Comment