INTRODUCTION – EXAM PREPARATION COURSE 3
In this module revisits course 3 content from the Microsoft Azure Data Scientist Associate specialization. The module aims at revisiting all significant aspects, techniques, and instruments used by data science professionals in the Azure environment.
By the time you are done with this material, your practical application and understanding of the advanced topics learned will be ready to apply in everyday real-data-use cases through Microsoft’s entire range of services.
Learning Objectives:
- Summarize the salient points from the Microsoft Azure Data Scientist Associate specialization.
- Recap of Course 3: Building and operating machine learning solutions using Azure Machine Learning.
- Assess your knowledge and skills on creating and managing machine learning solutions using Azure Machine Learning.
QUIZ: BUILD AND OPERATE MACHINE LEARNING SOLUTIONS WITH AZURE MACHINE LEARNING
1. You create an Azure Machine Learning workspace. You are preparing a local Python environment on a laptop computer.
You want to use the laptop to connect to the workspace and run experiments.
You create the following config.json file:
{ “workspace_name” : “ml-workspace” }
You must use the Azure Machine Learning SDK to interact with data and experiments in the workspace. You need to configure the config.json file to connect to the workspace from the Python environment. Which two additional parameters must you add to the config.json file in order to connect to the workspace? Each correct answer presents part of the solution.
- Region
- Login
- Key
- Resource_group (CORRECT)
- Subscription_id (CORRECT)
Correct: This parameter must be specified.
Correct: This parameter must be specified.
2. You are developing a data science workspace that uses an Azure Machine Learning service. You need to select a compute target to deploy the workspace. What should you use?
- Apache Spark for HDInsight
- Azure Databricks
- Azure Data Lake Analytics
- Azure Container Instances (CORRECT)
Correct: Azure now upgraded newly introduced container instances which are designed for quick economies in running containers in prototyping or development applications. The ideal use case is lights-out, CPU-based workloads with less than 48 GB of physical memory. One can execute containers without managing the underlying cost-based infrastructure. ACI is particularly suitable for conducting preliminary tests, development, and for temporary batch processing runs.
3. A coworker registers a datastore in a Machine Learning services workspace by using the following code:
Datastore.register_azure_blob_container(workspace=ws,
datastore_name=‘demo_datastore’,
container_name=‘demo_datacontainer’,
account_name=’demo_account’,
account_key=’0A0A0A-0A00A0A-0A0A0A0A0A0’
create_if_not_exists=True)
You need to write code to access the datastore from a notebook. How should you complete the code segment?
- import azureml.core
- from azureml.core import Workspace, Datastore
- ws = Workspace.from_config()
- datastore = <add answer here> .get( <add answer here>, ‘<add answer here>’)
- Run, experiment, demo_datastore
- Experiment, run, demo_account
- Run, ws, demo_datastore
- DataStore, ws, demo_datastore (CORRECT)
Correct: Replace “your_datastore_name” with the actual name of the datastore that you wish to retrieve. This method will return the registered datastore in the current workspace.
datastore = Datastore.get(ws, datastore_name=’your datastore name’)
4. A set of CSV files contains sales records. All the CSV files have the same data schema.
Each CSV file contains the sales record for a particular month and has the filename sales.csv. Each file is stored in a folder that indicates the month and year when the data was recorded. The folders are in an Azure blob container for which a datastore has been defined in an Azure Machine Learning workspace. The folders are organized in a parent folder named sales to create the following hierarchical structure:
/sales
/01-2019
/sales.csv
/02-2019
/sales.csv
/03-2019
/sales.csv
…
At the end of each month, a new folder with that month’s sales file is added to the sales folder. You plan to use the sales data to train a machine learning model based on the following requirements:
– You must define a dataset that loads all of the sales data to date into a structure that can be easily converted to a dataframe.
– You must be able to create experiments that use only data that was created before a specific previous month, ignoring any data that was added after that month.
– You must register the minimum number of datasets possible.
You need to register the sales data as a dataset in Azure Machine Learning service workspace. What should you do?
- Create a tabular dataset that references the datastore and explicitly specifies each ‘sales/mm-yyyy/sales.csv’ file every month. Register the dataset with the name sales_dataset each month, replacing the existing dataset and specifying a tag named month indicating the month and year it was registered. Use this dataset for all experiments.
- Create a tabular dataset that references the datastore and explicitly specifies each ‘sales/mm-yyyy/sales.csv’ file. Register the dataset with the name sales_dataset each month as a new version and with a tag named month indicating the month and year it was registered. Use this dataset for all experiments, identifying the version to be used based on the month tag as necessary.
- (CORRECT)
- Create a new tabular dataset that references the datastore and explicitly specifies each ‘sales/mm-yyyy/sales.csv’ file every month. Register the dataset with the name sales_dataset_MM-YYYY each month with appropriate MM and YYYY values for the month and year. Use the appropriate month-specific dataset for experiments.
- Create a tabular dataset that references the datastore and specifies the path ‘sales/*/sales.csv’, register the dataset with the name sales_dataset and a tag named month indicating the month and year it was registered, and use this dataset for all experiments.
Correct: This is one way to achieve the goal with dataset.
5. You create a deep learning model for image recognition on Azure Machine Learning service using GPU-based training.You must deploy the model to a context that allows for real-time GPU-based inferencing.
You need to configure compute resources for model inferencing. Which compute type should you use?
- Azure Kubernetes Service (CORRECT)
- Azure Container Instance
- Field Programmable Gate Array
- Machine Learning Compute
Correct: Indeed, Azure Machine Learning enables the deployment of GPU-accelerated models as a web service. A useful and efficient option here is to use Azure Kubernetes Service (AKS). AKS offers scalable and managed Kubernetes clusters that can be used and configured to use GPU resources for model inference.
6. You use Azure Machine Learning designer to create a real-time service endpoint. You have a single Azure Machine Learning service compute resource.
You train the model and prepare the real-time pipeline for deployment.
You need to publish the inference pipeline as a web service. Which compute type should you use?
- HDInsight
- Azure Kubernetes Services (CORRECT)
- A new Machine Learning Compute resource
- Azure Databricks
- The existing Machine Learning Compute resource
Correct: AKS always the solution for these types of problems.
7. You deploy a model as an Azure Machine Learning real-time web service using the following code.
# ws, model, inference_config, and deployment_config defined previously
service = Model.deploy(ws, ‘classification-service’, [model], inference_config, deployment_config)
service.wait_for_deployment(True)
The deployment fails.
You need to troubleshoot the deployment failure by determining the actions that were performed during deployment and identifying the specific action that failed.
Which code segment should you run?
- service.update_deployment_state()
- service.serialize()
- service.get_logs() (CORRECT)
- service.state
Correct: You can generate detailed log messages from the Docker engine service object.
You can view the log for ACI, AKS, and Local deployments.
8. You register a model that you plan to use in a batch inference pipeline.
The batch inference pipeline must use a ParallelRunStep step to process files in a file dataset. The script has the ParallelRunStep step and the runs must process six input files each time the inferencing function is called.
You need to configure the pipeline. Which configuration setting should you specify in the ParallelRunConfig object for the ParallelRunStep step?
- error_threshold= “6”
- node_count= “6”
- process_count_per_node= “6”
- mini_batch_size= “6” (CORRECT)
Correct: For a FileDataset input, this field indicates how many files can be processed by the user script in one run() call. For a TabularDataset input, this field indicates the estimated size of data that a user script can process in one run() call.
9. Yes or No?
You train a classification model by using a logistic regression algorithm. You must be able to explain the model’s predictions by calculating the importance of each feature, both as an overall global relative importance value and as a measure of local importance for a specific set of predictions.
You need to create an explainer that you can use to retrieve the required global and local feature importance values.
Solution: Create a TabularExplainer. Does the solution meet the goal?
- Yes (CORRECT)
- No
Correct: In addition, the TabularExplainer supports global as well as local feature importance explanations.
10. You deploy a real-time inference service for a trained model.
The deployed model supports a business-critical application, and it is important to be able to monitor the data submitted to the web service and the predictions the data generates.
You need to implement a monitoring solution for the deployed model using minimal administrative effort. What should you do?
- Enable Azure Application Insights for the service endpoint and view logged data in the Azure portal. (CORRECT)
- View the log files generated by the experiment used to train the model.
- Create an ML Flow tracking URI that references the endpoint, and view the data logged by ML Flow.
- View the explanations for the registered model in Azure ML studio.
Correct: You can also enable Azure Application Insights from Azure Machine Learning studio.
11. You are a lead data scientist for a project that tracks the health and migration of birds. You create a multi-class image classification deep learning model that uses a set of labeled bird photographs collected by experts.
You have 100,000 photographs of birds. All photographs use the JPG format and are stored in an Azure blob container in an Azure subscription. You need to access the bird photograph files in the Azure blob container from the Azure Machine Learning service workspace that will be used for deep learning model training.
You must minimize data movement. What should you do?
- Create an Azure Cosmos DB database and attach the Azure Blob containing bird photographs storage to the database.
- Create an Azure Data Lake store and move the bird photographs to the store.
- Register the Azure blob storage containing the bird photographs as a datastore in Azure Machine Learning service. (CORRECT)
- Create and register a dataset by using TabularDataset class that references the Azure blob storage containing bird photographs.
- Copy the bird photographs to the blob datastore that was created with your Azure Machine Learning service workspace.
Correct: Setting up an Azure Blob container for datastores is what we recommend. When a workspace is created, it automatically registers an Azure Blob container and an Azure file share to the workspace.
12. An organization creates and deploys a multi-class image classification deep learning model that uses a set of labeled photographs.
The software engineering team reports there is a heavy inferencing load for the prediction web services during the summer. The production web service for the model fails to meet demand despite having a fully-utilized compute cluster where the web service is deployed.
You need to improve performance of the image classification web service with minimal downtime and minimal administrative effort. What should you advise the IT Operations team to do?
- Increase the VM size of nodes in the compute cluster where the web service is deployed.
- Increase the node count of the compute cluster where the web service is deployed
- Increase the minimum node count of the compute cluster where the web service is deployed. (CORRECT)
- Create a new compute cluster by using larger VM sizes for the nodes, redeploy the web service to that cluster, and update the DNS registration for the service endpoint to point to the new cluster.
Correct: It will rise in cluster computing when the services are deployed and it starts running using cluster nodes.
13. You use the Azure Machine Learning Python SDK to define a pipeline that consists of multiple steps.
When you run the pipeline, you observe that some steps do not run. The cached output from a previous run is used instead. You need to ensure that every step in the pipeline is run, even if the parameters and contents of the source directory have not changed since the previous run.
What are two possible ways to achieve this goal? Each correct answer presents a complete solution.
- Restart the compute cluster where the pipeline experiment is configured to run.
- Use a PipelineData object that references a datastore other than the default datastore.
- Set the allow_reuse property of each step in the pipeline to False.
- Set the outputs property of each step in the pipeline to True. (CORRECT)
- Set the regenerate_outputs property of the pipeline to True. (CORRECT)
Always remember when working on pipeline stages, input/output data, and step reuse: If some data used in a step has been stored in a datastore and allow_reuse was set to True, this would not capture the change in data. On the other hand, if this data is uploaded as part of the snapshot (under the step’s source_directory), which is not recommended, the hash would change triggering a rerun.
If regenerate_outputs is True, new submission will always trigger readonly for all step outputs and disallow data reuse for any step of that run. However, this new run may reuse this run’s outputs as soon as that run is complete.
14. You train and register a model in your Azure Machine Learning workspace.
You must publish a pipeline that enables client applications to use the model for batch inferencing.
You must use a pipeline with a single ParallelRunStep step that runs a Python inferencing script to get predictions from the input data.
You need to create the inferencing script for the ParallelRunStep pipeline step. \
Which two functions should you include? Each correct answer presents part of the solution.
- batch()
- score(mini_batch)
- main()
- run(mini_batch) (CORRECT)
- init() (CORRECT)
Correct: This function is called for each batch of data to be processed.
Correct: This function is called when the pipeline is initialized.
15. An organization uses Azure Machine Learning service and wants to expand their use of machine learning. You have the following compute environments. The organization does not want to create another compute environment.
Environment name | Compute Type |
---|---|
nb_server | Compute instance |
aks_cluster | Azure Kubernetes Service |
mlc_cluster | Machine Learning compute |
You need to determine which compute environment to use for the following scenarios:
1. Run an Azure Machine Learning Designer training pipeline.
2. Deploying a web service from the Azure Machine Learning Designer.
Which compute types should you use?
- 1 nb_server, 2 mlc_cluster
- 1 mlc_cluster, 2 nb_server
- 1 nb_server, 2 aks_cluster
- 1 mlc_cluster, 2 aks_cluster (CORRECT)
Correct: Of course, this is an old question and with time Azure Kubernetes Service (AKS) has evolved into an option for the designer option.
16. You must be able to explain the model’s predictions by calculating the importance of each feature, both as an overall global relative importance value and as a measure of local importance for a specific set of predictions.
You need to create an explainer that you can use to retrieve the required global and local feature importance values.
Solution: Create a PFIExplainer. Does the solution meet the goal?
- Yes
- No (CORRECT)
Correct: The PFIExplainer does not facilitate feature importance interpretations based on locale.
17. You create an Azure Machine Learning compute resource to train models. The compute resource is configured as follows: – Minimum nodes: 2 – Maximum nodes: 4. You must decrease the minimum number of nodes and increase the maximum number of nodes to the following values: – Minimum nodes: 0 – Maximum nodes: 8
You need to reconfigure the compute resource. What are three possible ways to achieve this goal? Each correct answer presents a complete solution.
- Run the refresh_state() method of the BatchCompute class in the Python SDK.
- Use the Azure Machine Learning Designer.
- Use the Azure portal.
- Run the update method of the AmlCompute class in the Python SDK. (CORRECT)
- Use the Azure Machine Learning Studio. (CORRECT)
Use the UI for your cluster in the Azure portal to change the nodes in the cluster. CORRECT.
The update method of the AmlCompute class updates the ScaleSettings for the AmlCompute target. Its parameters are min_nodes=None, max_nodes=None, and idle_seconds_before_scaledown=None.
In fact, Azure Machine Learning Studio allows you to manage assets as well as resources.
18. You create a new Azure subscription. No resources are provisioned in the subscription. You need to create an Azure Machine Learning workspace.
What are three possible ways to achieve this goal? Each correct answer presents a complete solution.
- Navigate to Azure Machine Learning studio and create a workspace.
- Run Python code that uses the Azure ML SDK library and calls the Workspace.get method with name, subscription_id, and resource_group parameters.
- Run Python code that uses the Azure ML SDK library and calls the Workspace.create method with name, subscription_id, resource_group, and location parameters. (CORRECT)
- Use the Azure Command Line Interface (CLI) with the Azure Machine Learning extension to call the az group create function with –name and –location parameters, and then the az ml workspace create function, specifying Cw and Cg parameters for the workspace name and resource group. (CORRECT)
- Use an Azure Resource Management template that includes a Microsoft.MachineLearningServices/ workspaces resource and its dependencies. (CORRECT)
Correct: This is one way to achieve the goal.
Correct: This is one way to achieve the goal.
Correct: This is one way to achieve the goal.
CONCLUSION – EXAM PREPARATION COURSE 3
Taking the module to wrap up is going to be the thorough completing review of Course 3 in the Microsoft Azure Data Scientist Associate specialization. In revising the necessary concepts, techniques, and tools, one will refresh his or her knowledge and skills practically in data science within the Azure environment. This extensive revision prepares you well for addressing the challenges posed by real-world data science with the full plethora of Azure services and solutions at Microsoft’s disposal.