INTRODUCTION – Working With Data In Azure Databricks
Azure Databricks simplifies daily data handling like reading, writing, and querying data. This module manages large datasets and raw formats from different sources. You will learn about the different column-level transformations that can be performed using the DataFrame Column Class in Azure Databricks, such as sorting, filtering, and aggregating. You will also explore advanced functions within DataFrames to manipulate data, perform aggregations, and date and time operations within Azure Databricks.
Learning Objectives:
Understand how Azure Databricks performs data-handling tasks on an everyday basis – for example, reading, writing data, and querying.
Use the DataFrame Column Class in Azure Databricks for column-level transformations such as sorting, filtering, and aggregation.
Apply the advanced DataFrame functions for manipulating data within Azure Databricks, performing aggregations, and date and time handling.
PRACTICE QUIZ: KNOWLEDGE CHECK 1
1. How do you list files in DBFS within a notebook?
%fs dir /my-file-path
ls /my-file-path
%fs ls /my-file-path (CORRECT)
Correct: This assurance lets the command ls execute devoid of influences from earlier filesystem-related configurations.
2. How do you infer the data types and column names when you read a JSON file?
Correct: Usually, one will adopt tools or techniques that automatically detect the structure or format of a file to make an accurate inference of the schema of a file.
3. Which of the following SparkSession functions returns a DataFrameReader
createDataFrame(..)
emptyDataFrame(..)
read(..) (CORRECT)
readStream(..)
Correct: A DataFrameReader is returned from the method SparkSession.read().
4. When using a notebook and a spark session. We can read a CSV file. Which of the following can be used to view the first couple thousand characters of a file?
%fs dir /mnt/training/wikipedia/pageviews/
%fs ls /mnt/training/wikipedia/pageviews/
%fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv (CORRECT)
Correct: We can potentially measure the initial few hundred characters of a file using %fs head.
PRACTICE QUIZ: KNOWLEDGE CHECK
1. Which of the following SparkSession functions returns a DataFrameReader
createDataFrame(..)
emptyDataFrame(..)
read(..) (CORRECT)
.readStream(..)
Correct: The DataFrameReader is returned by the function SparkSession.read().
2. When using a notebook and a spark session. We can read a CSV file.
Which of the following can be used to view the first couple of thousand characters of a file
%fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv (CORRECT)
%fs ls /mnt/training/wikipedia/pageviews/
%fs dir /mnt/training/wikipedia/pageviews/
Correct: The first few hundred characters of a file can be viewed using %fs head command.
3. Which DataFrame method do you use to create a temporary view?
createOrReplaceTempView() (CORRECT)
createTempViewDF()
createTempView()
Correct: The temporary views are created in DataFrames using the createOrReplaceTempView() method.
4. How do you define a DataFrame object?
Use the DF.create() syntax
Introduce a variable name and equate it to something like myDataFrameDF = (CORRECT)
Use the createDataFrame() function
Correct: In the case of DataFrame creation, use SparkSession.createDataFrame() method or read data from an outside source such as a file or a database with SparkSession.read().
5. How do you cache data into the memory of the local executor for instant access?
.inMemory().save()
.cache() (CORRECT)
.save().inMemory()
Correct: The cache() method happens to be an alias for persist(StorageLevel.MEMORY_AND_DISK). Calling cache() causes data to be moved to memory of the local executor and this speeds up access for later operations.
6. What is the Python syntax for defining a DataFrame in Spark from an existing Parquet file in DBFS?
1. How do you list files in DBFS within a notebook?
ls /my-file-path
%fs dir /my-file-path
%fs ls /my-file-path (CORRECT)
Correct: Thanks, that’s great insights! In fact, I do see that putting the file system magic (%fs) before the ls command indicates that the command is expected to work with the Databricks file system.
2. How do you infer the data types and column names when you read a JSON file?
Correct: This approach is the correct way to infer the file’s schema.
3. Which of the following SparkSession functions returns a DataFrameReader?
readStream(..)
createDataFrame(..)
emptyDataFrame(..)
read(..) (CORRECT)
Correct: Yes, that is right! SparkSession.read() gives a DataFrameReader, defined to read data into a DataFrame from different sources such as, but not limited to, CSV, JSON, and Parquet.
4. When using a notebook and a spark session. We can read a CSV file. Which of the following can be used to view the first couple thousand characters of a file?
%fs ls /mnt/training/wikipedia/pageviews/
%fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv (CORRECT)
%fs dir /mnt/training/wikipedia/pageviews/
Correct: Using %fs head, we can see the file’s first few hundred characters, generally up to 1,000 characters, depending on the platform. It gives a quick and easy way to peek at the contents of a file.
5. You have created an Azure Databricks cluster, and you have access to a source file.
You need to determine the structure of the file. Which of the following commands will assist with determining what the column and data types are?
.option(“inferSchema”, “true”) (CORRECT)
.option(“header”, “true”)
.option(“inferSchema”, “false”)
.option(“header”, “false”)
Correct: If you have .option(“inferSchema”, “true”), Spark would read the file and infer schema for columns. This is normally used to allow Spark to detect types based on the contents of CSV or other formats in which data is read.
6. In an Azure Databricks workspace you run the following command:
%fs head /mnt/training/wikipedia/pageviews/pageviews_by_second.tsv
The partial output from this command is as follows:
[Truncated to first 65536 bytes]
“timestamp” “site” “requests”
“2015-03-16T00:09:55” “mobile” 1595
“2015-03-16T00:10:39” “mobile” 1544
“2015-03-16T00:19:39” “desktop” 2460
“2015-03-16T00:38:11” “desktop” 2237
“2015-03-16T00:42:40” “mobile” 1656
“2015-03-16T00:52:24” “desktop” 2452
Which of the following pieces of information can be inferred from the command and the output?
Select all that apply.
the file is a comma separated or CSV file
All columns are strings
The file has no header
Two columns are strings, and one column is a number (CORRECT)
The column is Tab separated (CORRECT)
The file has a header (CORRECT)
Correct: It appears that you mean to say that the data should be formatted in such a way that strings always are enclosed in double quotes and numbers are not.
Correct: Thanks for the clarification! It looks like you’re dealing with a tab-separated file where the columns are separated by tabs, while strings are enclosed in double quotes but the numbers are not..
Correct: The first line of the output displays the column names.
7. In an Azure Databricks you wish to create a temporary view that will be accessible to multiple notebooks. Which of the following commands will provide this feature?
createOrReplaceTempView(set_scope “Global”)
createOrReplaceTempView(..)
createOrReplaceGlobalTempView(..) (CORRECT)
Correct: Thanks for the clarification! It looks like you’re dealing with a tab-separated file where the columns are separated by tabs, while strings are enclosed in double quotes but the numbers are not.
8. Which of the following is true in respect of Parquet Files?
Select all that apply.
Designed for performance on small data sets
Is a Row-Oriented data store
Is a splittable “file format”. (CORRECT)
Efficient data compression (CORRECT)
Is a Column-Oriented data store (CORRECT)
Open Source (CORRECT)
Correct: Parquet files are splittable.
Correct: Parquet files provide efficient data compression.
Correct: Parquet files are Column-Oriented.
Correct: Parquet files are free Open Source.
CONCLUSION – Working With Data In Azure Databricks
In sum, Azure Databricks comes in helpful to anyone for creating larger-than-average schema-blackeg data reading, writing, and querying capabilities as it lets individual modules provide complete support to deal with the problems of different data sources and raw data types as well as perform column-level transformations using DataFrame Column Class. In addition, advanced DataFrame functions can quickly be employed to accomplish a variety of data manipulation, aggregation, and date/time tasks. Thus, having an understanding of such features will make one capable enough to maintain a data pipeline for sophisticated scenarios in Azure Databricks.