INTRODUCTION – Work With Data And Compute In Azure Machine Learning
Data is the lifeblood of machine learning; these inputs are important in the derivation of accurate models. In this module, one learns how to use datastores and datasets in Azure Machine Learning and therefore creates scalable, cloud-based model training solutions.
Furthermore, by the end of the check, you will be able to learn how to do cloud compute resource utilization in Azure Machine Learning to run training experiments at scale. So, by the end of this module, you will already have developed the skills to blend Azure Machine Learning data management with cloud computing for the construction of powerful and scalable machine learning solutions.
Learning Objective:
- Creation and management of datastores.
- Creation and management of datasets.
- Creation and management of environments.
- Creation and management of compute targets.
PRACTICE QUIZ: KNOWLEDGE CHECK
1. When planning for datastores, which datafiles format should perform better
- XLS
- XML
- CVS
- Parquet (CORRECT)
Correct: CSV is an excellent format to use when dealing with data files; however, Parquet is most probably the best performing in comparison. Notably, Parquet is put together as a form of columnar storage that provides efficient data access and processing, especially with larger datasets. The very advanced compression and encoding techniques offered by Parquet allow the storage to be reduced and reading/writing speeds enhanced against row-based formats like CSV.
2. True or False?
You cannot access datastores by name.
- True
- False (CORRECT)
Correct: You can use datastore names to access any datastore, but it may be useful to change the default to something other than the built-in workspaceblobstore datastore.
3. If you want to change the default datastore, what method should you use?
- change_default_datastore()
- new_default_datastore()
- set_default_datastore() (CORRECT)
- modify_default_datastore()
Correct: This method is going to be used to alter the default datastore.
4. What types of datasets can be created in Azure Machine Learning? Select all that apply.
- Notebook
- Media
- File (CORRECT)
- Tabular (CORRECT)
Correct: Dataset containing a collection of file paths accessible and readable as though they are actual files in a filesystem.
Correct: The data is read from the dataset as a table.
5. To create a tabular dataset using the SDK, which method of the Dataset.Tabular class should you use?
- from_tabular_dataset
- from_tabular_files
- from_delimited_files (CORRECT)
- from_files_tabular
Correct: Use this method to create a tabular dataset from the Dataset.Tabular class.
PRACTICE QUIZ: KNOWLEDGE CHECK
1. Which package managers are usually used in the installation of a Python virtual environment? Select all that apply.
- pandas
- numpy
- pip (CORRECT)
- conda (CORRECT)
Correct: iconvenience is one of those commonplace package managers around when in the context of a virtual environment concerning a python runtime.
Correct: In the context of the virtual environment of a Python runtime, this is the most common package manager.
2. You saved a specification file named conda.yml and you want to use it to create an Azure ML environment.
Which SDK command does the job?
- from azureml.core import Environment
- env = Environment.from_conda_specification(name=’training_environment’,
- file_path=’conda.yml’)
- from azureml.core import Environment (CORRECT)
- env = Environment.from_conda_specification(name=’training_environment’,
- file_path=’/conda.yml’)
- from azureml.core import Environment
- env = Environment.from_conda_specification(name=’training_environment’,
- file_path=’*conda.yml’)
- from azureml.core import Environment
- env = Environment.from_conda_specification(name=’training_environment’,
- file_path=’./conda.yml’)
Correct: That is the command that is called for the job.
3. You want to create an Azure ML environment by specifying the packages you need.
Which SDK commands does the job?
- from azureml.core import Environment
- from azureml.core.conda_dependencies import CondaDependencies
- env = Environment(‘training_environment’)
- deps = CondaDependencies.create(conda_packages=[‘scikit-learn’,’pandas’,’numpy’],
- pip_packages=[‘azureml-defaults’])
- env.python.conda_dependencies = deps
- from azureml.core import Environment (CORRECT)
- from azureml.core.conda_dependencies import CondaDependencies
- env = Environment(‘training_environment’)
- deps = CondaDependencies.deploy(conda_packages=[‘scikit-learn’,’pandas’,’numpy’],
- pip_packages=[‘azureml-defaults’])
- env.python.conda_dependencies = deps
- from azureml.core import Environment
- env = Environment(‘training_environment’)
- deps = CondaDependencies.create(conda_packages=[‘scikit-learn’,’pandas’,’numpy’],
- pip_packages=[‘azureml-defaults’])
- env.python.conda_dependencies = deps
- from azureml.core import Environment
- from azureml.core.conda_dependencies import Conda
- env = Environment(‘training_environment’)
- deps = CondaDependencies.create(conda_packages=[‘scikit-learn’,’pandas’,’numpy’],
- pip_packages=[‘azureml-defaults’])
- env.python.conda_dependencies = deps
Correct: The above commands are correct. You can create an environment by specifying the required Conda and pip packages for that environment in a CondaDependencies object.
4. If you are running a notebook experiment on an Azure Machine Learning compute instance, what type of compute are you using?
- Attached compute
- Compute clusters
- Local compute (CORRECT)
Correct: It runs the experiment on the same compute target that was used to run the experiment – this could be either your local workstation or a virtual machine, such as an Azure Machine Learning compute instance, where you’re running a notebook.
5. If you have an Azure Databricks cluster that you want to use for experiment running and model training, which type of compute target is this?
- Managed
- Unmanaged (CORRECT)
Correct: A compute target that has not been managed refers to a type of resource that is configured and managed externally from the Azure Machine Learning workspace, such as an Azure virtual machine or an Azure Databricks cluster.
QUIZ: START PREP
1. Which Python commands should you use to create and register a tabular dataset using the from_delimited_files method of the Dataset.Tabular class?
- from azureml.core import Dataset
- blob_ds = ws.get_default_datastore()
- csv_paths = [(blob_ds, ‘data/files/current_data.csv’),
- (blob_ds, ‘data/files/archive/*.csv’)]
- tab_ds = Dataset.Tabular.from_delimited_files()
- tab_ds = tab_ds.register(workspace=ws, name=’csv_table’)
- from azureml.core import Dataset
- blob_ds = ws.get_default_datastore()
- csv_paths = [(blob_ds, ‘data/files/current_data.csv’),
- (blob_ds, ‘data/files/archive/*.csv’)]
- tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
- tab_ds = tab_ds.register(workspace=ws, name=’csv_table’)
- from azureml.core import Dataset
- blob_ds = ws.get_default_datastore()
- csv_paths = [(blob_ds, ‘data/files/current_data.csv’),
- (blob_ds, ‘data/files/archive/csv’)]
- tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
- tab_ds = tab_ds.register(workspace=ws, name=’csv_table’)
- from azureml.core import Dataset (CORRECT)
- blob_ds = ws.change_default_datastore()
- csv_paths = [(blob_ds, ‘data/files/current_data.csv’),
- (blob_ds, ‘data/files/archive/*.csv’)]
- tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
- tab_ds = tab_ds.register(workspace=ws, name=’csv_table’)
Correct: This is indeed the proper command for this request.
2. You’re creating a file dataset using the from_files method of the Dataset.File class.
You also want to register it in the workspace with the name img_files.
Which SDK commands can you use?
- from azureml.core import Dataset
- blob_ds = ws.get_default_datastore()
- file_ds = Dataset.File.from_files(path=(blob_ds, ‘data/files/images/*.jpg’))
- file_ds = file_ds.register(workspace=ws, name=’img_files’)
- from azureml.core import Dataset
- blob_ds = ws.get_default_datastore()
- file_ds = Dataset.File.from_files(path=(blob_ds, ‘data/files/images/*.jpg’))
- from azureml.core import Dataset
- file_ds = Dataset.File.from_files(path=(blob_ds, ‘data/files/images/*.jpg’))
- file_ds = file_ds.register(workspace=ws, name=’img_files’)
- from azureml.core import Dataset (CORRECT)
- blob_ds = ws.get_default_datastore()
- file_ds = Dataset.File.from_files(path=(blob_ds, ‘data/files/images’))
- file_ds = file_ds.register(workspace=ws, name=’img_files’)
Correct: This is the right and complete command in this scenario.
3. What methods can you use from the Dataset class to retrieve a dataset after registering it? Select all that apply.
- find_by_id
- find_by_name
- get_by_name (CORRECT)
- get_by_id (CORRECT)
Correct: This mechanism is going to retrieve the dataset from the name.
Correct: It will get the dataset by its id.
4. To retrieve a specific version of a data set, which SDK commands should you use?
- img_ds = Dataset.get_by_name(workspace=ws, name=’img_files’, version(2))
- img_ds = Dataset.get_by_name(workspace=ws, name=’img_files’, version=’2’)
- img_ds = Dataset.get_by_name(workspace=ws, name=’img_files’, version_2)
- img_ds = Dataset.get_by_name(workspace=ws, name=’img_files’, version=2) (CORRECT)
Correct: This command is the right one for the request.
5. Which SDK commands can you use to view the registered environments in your workspace?
- from azureml.core import Environment
- env_names = Environment.list(workspace=ws)
- for env_name in env_names:
- print(‘Name:’,env_name)
- from azureml.core import Environment (CORRECT)
- env_names = Environment.list(workspace=ws)
- for each env_name in env_names:
- print(‘Name:’,env_name)
- from azureml.core import Environment
- env_names = Environment_list(workspace=ws)
- for env_name in env_names:
- print(‘Name:’,env_name)
- from azureml.core import Environment
- env_names = Environment.list(workspace=ws)
- for env_name of env_names:
- print(‘Name:’,env_name)
Correct: These commands will show the registered environments in your workspace.
6. You are defining a compute configuration for a managed compute target using the SDK.
Which of the below commands is correct?
- compute_config = AmlCompute_provisioning_configuration(vm_size=’STANDARD_DS11_V2′,
- min_nodes=0, max_nodes=4,
- vm_priority=’dedicated’)
- compute_config = AmlCompute.provisioning_configuration(vm_size=’STANDARD_DS11_V2′, (CORRECT)
- min_nodes=0, max_nodes=4,
- vm_priority=’dedicated’)
- compute_config = AmlCompute.provisioning_configuration(vm_size=’STANDARD_DS11_V2′,
- min_nodes=0, max_nodes=0,
- vm_priority=’dedicated’)
- compute_config = AmlCompute.provisioning.configuration(vm_size=’STANDARD_DS11_V2′,
- min_nodes=0, max_nodes=4,
- vm_priority=’dedicated’)
Correct: These are the commands properly made for the task.
7. You created a compute target and now you want to use it for an experiment. You want to specify the compute target using a ComputeTarget object.
Which of the SDK commands below can you use?
- compute_name = “aml-cluster”
- training_env = Environment.get(workspace=ws, name=’training_environment’)
- script_config = ScriptRunConfig(source_directory=’my_dir’,
- script=’script.py’,
- environment=training_env,
- compute_target=training_cluster)
- compute_name = “aml-cluster” (CORRECT)
- training_cluster = ComputeTarget(workspace=ws)
- training_env = Environment.get(workspace=ws, name=’training_environment’)
- script_config = ScriptRunConfig(source_directory=’my_dir’,
- script=’script.py’,
- environment=training_env,
- compute_target=training_cluster)
- compute_name = “aml-cluster”
- training_cluster = ComputeTarget(workspace=ws,
- name=compute_name)
- training_env = Environment.get(workspace=ws, name=’training_environment’)
- script_config = ScriptRunConfig(source_directory=’my_dir’,
- script=’script.py’,
- environment=training_env,
- compute_target=training_cluster)
Correct: Given below are the proper commands to complete the task at hand.
8. Azure Machine Learning supports the creation of datastores for multiple kinds of Azure data source. Which of the following are supported? Select all that apply.
- Azure Database for PostgreSQ
- Azure Databricks (CORRECT)
- Azure Data Lake stores (CORRECT)
- Correct: Azure Databricks is a supported datastore.
Correct: Storage solution supported and endorsed is the Azure Data Lake Storage.
9. True or false?
To add a datastore to your workspace, you can only register it using the graphical interface in Azure Machine Learning Studio?
- True
- False (CORRECT)
Correct: In Azure Machine Learning Studio, you can have a complete view and manage datastores, or you can use the Azure Machine Learning SDK.
10. Environments are commonly created in docker containers that are in turn hosted in compute targets. Which of the following are examples of compute targets?
Selectall that apply.
- Azure blob storage
- Cloud clusters (CORRECT)
- Virtual machines (CORRECT)
Correct: Cloud-based clusters serve as an example of a computational environment during the environment creation process.
Correct: One type of compute target that can be specified when creating an environment is virtual machines.
11. True or false?
Local compute is generally a great choice during development and testing with low to moderate volumes of data.
- True (CORRECT)
- False
Correct: Local computing during the development and testing stages would still make an excellent choice in dealing with little to moderate data amounts.
CONCLUSION – Work With Data And Compute In Azure Machine Learning
In conclusion, machine learning requires data. In a nutshell, data is the essential input for every model that is developed. You learned throughout this module how to work with datastores and datasets in Azure Machine Learning as an empowerment in creating cloud-based training solutions that are scalable. You have also explored the use of cloud compute resources to run training experiments at scale. Thus equipped, you will be able to integrate data management and cloud computing in Azure Machine Learning, such that you will now be able, quite comfortable, to create robust and scalable machine models.