Module 3: Processing Data In Azure Data Bricks

Contents hide

1 INTRODUCTION – Processing Data In Azure Data Bricks

5 CONCLUSION – Processing Data In Azure Data Bricks

Spread the love

INTRODUCTION – Processing Data In Azure Data Bricks

Azure Databricks offers several built-in SQL functions, but sometimes you might need to create your own functions – that is, User-Defined Functions or UDFs. In this course, you will discover how to register UDFs and invoke them, thereby extending the functionality of your SQL queries.

You will use Delta Lake to create, append, and upsert into Apache Spark tables. Delta Lake brings to you in-built reliability and optimization features that can greatly help enhance your data management efforts.

Learning Objectives:

Write UDFs
Perform ETL using User
Defined Functions
Know the features and use cases of Delta Lake
Create, append, and upsert Delta tables
Optimizations in Delta Lake
Time machine features in comparing different versions of a Delta table

PRACTICE QUIZ: KNOWLEDGE CHECK 1

1. Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments. This functionality is referred to as?

Time Travel (CORRECT)
Schema Evolution
Schema Enforcement
ACID Transactions

Correct: Snapshots of data provided by a Delta Lake allow developers to access earlier versions of data and also be able to roll back to a point in time before the change was made. This can be used for audit purposes, to roll back changes made to the data, or for reproducing an experiment.

2. One of the core features of Delta Lake is performing upserts. Which of the following statements is true in regard to Upsert?

UpSert is literally TWO operations. Update / Insert (CORRECT)
Upsert is a new DML statement for SQL syntax
Upsert is supported in traditional data lakes

Correct: Basically, UPSERT is a mash-up of “UPdate” and “inSERT” actions. It updates and inserts on the data thus performing both. This is not performed in classic data lakes, but it is available in Delta Lake, where it significantly facilitates the management of data changes by updating existing records and inserting new ones.

3. When discussing Delta Lake, there is often a reference to the concept of Bronze, Silver and Gold tables. These levels refer to the state of data refinement as data flows through a processing pipeline and are conceptual guidelines. Based on these table concepts the refinements in Silver tables generally relate to which of the following?

Highly refined views of the data
Data that is directly queryable and ready for insights (CORRECT)
Raw data (or very little processing)

Correct: Silver tables typically contain data that has been cleaned, altered, transformed, and enriched to make it directly queryable and ready to be analyzed or turned into insights. These types of tabular objects are often intermediates of raw data (bronze tables) and highly refined and aggregated data (gold tables) in a data lake architecture.

4. What is the Databricks Delta command to display metadata?

DESCRIBE DETAIL tableName (CORRECT)
MSCK DETAIL tablename
SHOW SCHEMA tablename

Correct: The details of a Delta Lake can be displayed using the DESCRIBE DETAIL tableName command. This command provides detailed information about a specified table, schema, types of partitioning, and other metadata such as number of rows, file sizes, table properties, etc.

5. How do you perform UPSERT in a Delta dataset?

Use UPSERT INTO my-table /MERGE
Use UPSERT INTO my-table
Use MERGE INTO my-table USING data-to-upsert (CORRECT)

Correct: For carrying out an UPSERT on a Databricks Delta dataset, make use of the MERGE INTO statement that will help in updating existing records as well as inserting new records.

6. What optimization does the following command perform: OPTIMIZE Students ZORDER BY Grade?

Ensures that all data backing, for example, Grade=8 is colocated, then rewrites the sorted data into new Parquet files. (CORRECT)
Ensures that all data backing, for example, Grade=8 is colocated, then updates a graph that routes requests to the appropriate files.
Creates an order-based index on the Grade field to improve filters against that field.

Correct: Delta Lake employs Z-ordering, which is a technique used to optimize the performance of queries co-locating associated data into the same set of files. This sorting based on one or more columns reduces the amount of data scanned during the execution of the query and attained faster performance from queries on larger datasets. Notably, the technique improves how range queries, such as filtering based on certain defined column values, are carried out.

PRACTICE QUIZ: KNOWLEDGE CHECK 2

1. You have a dataframe which you preprocessed and filtered down to only the relevant columns.

The columns are: id, host_name, bedrooms, neighbourhood_cleansed, price.
You want to retrieve the first initial from the host_name field.
How would you write that function in local Python/Scala?
def firstInitialFunction(name):
get name[0]
firstInitialFunction(“Steven”)

new firstInitialFunction(name):
return name[1]
firstInitialFunction(“Steven”)

def firstInitialFunction(name): (CORRECT)
return name[0]
firstInitialFunction(“Steven”)

new firstInitialFunction(name):
extract name[]
firstInitialFunction(“Steven”)

Correct: the first initial from a column named host_name. However, I don’t see the specific code you’re referring to.

2. You’ve written a function that retrieves the first initial letter from the host_name column.

You now want to define it as a user-defined named firstInitialUDF.

How you define that using Python/Scala?

firstInitialUDF = udf(firstInitialFunction) (CORRECT)
firstInitialUDF = firstInitialFunction()
firstInitial = udf(firstInitial)
firstInitial = udf(firstInitialFunction)

Correct: Defining a UDF in Python that can be used to extract the first initial from the host_name column. A simple example of a UDF that you can define in Python would be like this:

3. If you want to create the UDF in the SQL namespace, what class do you need to use?

spark.sql.create
spark.udf.register
spark.sql.read
spark.sql.register (CORRECT)

Correct: The UDF is to be defined in the SQL namespace by this class.

4. Which is another syntax that you can use to define a UDF in Python?

Designer
Decorator (CORRECT)
Capsulator
Wrapper

Correct: Alternatively, you can also define a UDF using decorator syntax in Python with the return data type of the function. But, once defined in such a way, the local Python function cannot be called directly; i.e., decoratorUDF(“Jane”) will not work.

5. True or false?

The Catalyst Optimizer can be used to optimize UDFs.

True
False (CORRECT)

Correct: Catalyst Optimizer cannot optimize UDFs.

QUIZ: TEST PREP

1. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for DDL modifications. This functionality is referred to as?

Schema Evolution (CORRECT)
ACID Transactions
Time Travel
Schema Enforcement

Correct: Delta Lake precludes the need for subsequent DDL modifications when alterations to a table schema are made and those changes apply automatically.

2. One of the core features of Delta Lake is performing upserts. Which of the following statements is true regarding Upsert?

Upsert is supported in traditional data lakes
Upsert is literally TWO operations. Update / Insert (CORRECT)
Upsert is a new DML statement for SQL syntax

Correct: To UPSERT means to “UPdate” and “inSERT.” In other words, UPSERT combines two operations: updating existing records and inserting new ones. This functionality is not supported in traditional data lakes.

3. What is the Databricks Delta command to display metadata?

SHOW SCHEMA tablename
MSCK DETAIL tablename
DESCRIBE DETAIL tableName (CORRECT)

Correct: The command DESCRIBE DETAIL tableName can be run for displaying metadata.

4. What optimization does the following command perform: OPTIMIZE Customers ZORDER BY City?

Ensures that all data backing, for example, City=”London” is colocated, then updates a graph that routes requests to the appropriate files.
Ensures that all data backing, for example, City=’London’ is colocated, then rewrites the sorted data into new Parquet files. (CORRECT)
Creates an order-based index on the City field to improve filters against that field

Correct: With ZOrdering similar data, it will group it in the same set of files and optimize the layout for the faster performance of range queries.

5. You are planning on registering a user-defined function, g, as g_function in a SQL namespace. How would you achieve this programmatically?

spark.register_udf(“g_function”, g)
spark.udf.register(“g_function”, g) (CORRECT)
spark.udf.register(g, “g_function”)

Correct: This registers the UDF so within the same namespace it may be called within SQL queries.

6. True or False?

User-defined Functions cannot operate on DataFrames.

No (CORRECT)
Yes

Correct: In this case, UDFs will work on DataFrames and allow you to call a custom function on each cell in a row or column of data in the DataFrame. Thus, you can add your own logic in addition to the built-in functionalities provided by Spark.

7. Suppose you already have a dataframe which only contains relevant columns.

The columns are: id, employee_name, age, gender.

You want to retrieve the first initial from the employee_name field by creating a local function in Python/Scala. Which of the following code can be used to get the first initial from the host_name column?

new firstInitialFunction(name):
extract name[]
firstInitialFunction(“Steven”)

def firstInitialFunction(name):
get name[0]
firstInitialFunction(“Steven”)

def firstInitialFunction(name): (CORRECT)
return name[0]
firstInitialFunction(“Steven”)

Correct: In fact, this code defines a user-defined function to get the first initial from a column host_name and give the new column name as first_initial in the DataFrame.

8. In Delta Lake you need to merge, update and delete datasets while ensuring you comply with GDPR and CCPA. What supported APIs can you use?

Select all that apply.

XML
Scala (CORRECT)
Java (CORRECT)
Python (CORRECT)
SQL (CORRECT)

Correct: Merge, update, and deletes datasets that ease the compliance for data privacy regulations such as GDPR and CCPA with Scala, Java, Python, and SQL APIs of Delta Lake. These operations thus allow efficient management of data modifications or eliminations such that the sensitive material is appropriately handled as per these legal regulations.

Correct: It enables easy and possible compliance with regulations such as GDPR and CCPA by providing specific APIs for merging, updating, and deleting datasets through Scala, Java, Python, or SQL. The above capabilities help in managing and controlling the data in such a way that they avoid any issues of compliance with data privacy regulations.

Correct: This data-handling method thereby enhances the merging, updating, and deletion of datasets via Scala, Java, Python, and SQL APIs from Delta Lake and through compliance with the strict regulations of GDPR and CCPA. The aforementioned feasibility enables one to manage data optimally with easy updating and clean-up procedures in accordance with privacy mandates.

Correct: For compliance with the GPDR and CCPA, Delta Lake has its API available in Scala, Java, Python, and SQL to be able to merge, update, and delete datasets. These enable you to efficiently manage data such that it is personal in how you manage compliance with privacy regulations.

CONCLUSION – Processing Data In Azure Data Bricks

However, in addition to built-in SQL functions, Azure Databricks also offers the option of creating and using User-Defined Functions. This module gives you the ability to register and call UDFs as well as use Delta Lake to create, append, and upsert Apache Spark tables. This set of tools and techniques will put you on the way for optimizing data management, which means reliable and efficient processing of data in Azure Databricks.