INTRODUCTION – Processing Data In Azure Data Bricks
Azure Databricks offers several built-in SQL functions, but sometimes you might need to create your own functions – that is, User-Defined Functions or UDFs. In this course, you will discover how to register UDFs and invoke them, thereby extending the functionality of your SQL queries.
You will use Delta Lake to create, append, and upsert into Apache Spark tables. Delta Lake brings to you in-built reliability and optimization features that can greatly help enhance your data management efforts.
Learning Objectives:
- Write UDFs
- Perform ETL using User
- Defined Functions
- Know the features and use cases of Delta Lake
- Create, append, and upsert Delta tables
- Optimizations in Delta Lake
- Time machine features in comparing different versions of a Delta table
PRACTICE QUIZ: KNOWLEDGE CHECK 1
1. Delta Lake provides snapshots of data enabling developers to access and revert to earlier versions of data for audits, rollbacks or to reproduce experiments. This functionality is referred to as?
- Time Travel (CORRECT)
- Schema Evolution
- Schema Enforcement
- ACID Transactions
Correct: Snapshots of data provided by a Delta Lake allow developers to access earlier versions of data and also be able to roll back to a point in time before the change was made. This can be used for audit purposes, to roll back changes made to the data, or for reproducing an experiment.
2. One of the core features of Delta Lake is performing upserts. Which of the following statements is true in regard to Upsert?
- UpSert is literally TWO operations. Update / Insert (CORRECT)
- Upsert is a new DML statement for SQL syntax
- Upsert is supported in traditional data lakes
Correct: Basically, UPSERT is a mash-up of “UPdate” and “inSERT” actions. It updates and inserts on the data thus performing both. This is not performed in classic data lakes, but it is available in Delta Lake, where it significantly facilitates the management of data changes by updating existing records and inserting new ones.
3. When discussing Delta Lake, there is often a reference to the concept of Bronze, Silver and Gold tables. These levels refer to the state of data refinement as data flows through a processing pipeline and are conceptual guidelines. Based on these table concepts the refinements in Silver tables generally relate to which of the following?
- Highly refined views of the data
- Data that is directly queryable and ready for insights (CORRECT)
- Raw data (or very little processing)
Correct: Silver tables typically contain data that has been cleaned, altered, transformed, and enriched to make it directly queryable and ready to be analyzed or turned into insights. These types of tabular objects are often intermediates of raw data (bronze tables) and highly refined and aggregated data (gold tables) in a data lake architecture.
4. What is the Databricks Delta command to display metadata?
- DESCRIBE DETAIL tableName (CORRECT)
- MSCK DETAIL tablename
- SHOW SCHEMA tablename
Correct: The details of a Delta Lake can be displayed using the DESCRIBE DETAIL tableName command. This command provides detailed information about a specified table, schema, types of partitioning, and other metadata such as number of rows, file sizes, table properties, etc.
5. How do you perform UPSERT in a Delta dataset?
- Use UPSERT INTO my-table /MERGE
- Use UPSERT INTO my-table
- Use MERGE INTO my-table USING data-to-upsert (CORRECT)
Correct: For carrying out an UPSERT on a Databricks Delta dataset, make use of the MERGE INTO statement that will help in updating existing records as well as inserting new records.
6. What optimization does the following command perform: OPTIMIZE Students ZORDER BY Grade?
- Ensures that all data backing, for example, Grade=8 is colocated, then rewrites the sorted data into new Parquet files. (CORRECT)
- Ensures that all data backing, for example, Grade=8 is colocated, then updates a graph that routes requests to the appropriate files.
- Creates an order-based index on the Grade field to improve filters against that field.
Correct: Delta Lake employs Z-ordering, which is a technique used to optimize the performance of queries co-locating associated data into the same set of files. This sorting based on one or more columns reduces the amount of data scanned during the execution of the query and attained faster performance from queries on larger datasets. Notably, the technique improves how range queries, such as filtering based on certain defined column values, are carried out.
PRACTICE QUIZ: KNOWLEDGE CHECK 2
1. You have a dataframe which you preprocessed and filtered down to only the relevant columns.
- The columns are: id, host_name, bedrooms, neighbourhood_cleansed, price.
- You want to retrieve the first initial from the host_name field.
- How would you write that function in local Python/Scala?
- def firstInitialFunction(name):
- get name[0]
- firstInitialFunction(“Steven”)
- new firstInitialFunction(name):
- return name[1]
- firstInitialFunction(“Steven”)
- def firstInitialFunction(name): (CORRECT)
- return name[0]
- firstInitialFunction(“Steven”)
- new firstInitialFunction(name):
- extract name[]
- firstInitialFunction(“Steven”)
Correct: the first initial from a column named host_name
. However, I don’t see the specific code you’re referring to.
2. You’ve written a function that retrieves the first initial letter from the host_name column.
You now want to define it as a user-defined named firstInitialUDF.
How you define that using Python/Scala?
- firstInitialUDF = udf(firstInitialFunction) (CORRECT)
- firstInitialUDF = firstInitialFunction()
- firstInitial = udf(firstInitial)
- firstInitial = udf(firstInitialFunction)
Correct: Defining a UDF in Python that can be used to extract the first initial from the host_name column. A simple example of a UDF that you can define in Python would be like this:
3. If you want to create the UDF in the SQL namespace, what class do you need to use?
- spark.sql.create
- spark.udf.register
- spark.sql.read
- spark.sql.register (CORRECT)
Correct: The UDF is to be defined in the SQL namespace by this class.
4. Which is another syntax that you can use to define a UDF in Python?
- Designer
- Decorator (CORRECT)
- Capsulator
- Wrapper
Correct: Alternatively, you can also define a UDF using decorator syntax in Python with the return data type of the function. But, once defined in such a way, the local Python function cannot be called directly; i.e., decoratorUDF(“Jane”) will not work.
5. True or false?
The Catalyst Optimizer can be used to optimize UDFs.
- True
- False (CORRECT)
Correct: Catalyst Optimizer cannot optimize UDFs.
QUIZ: TEST PREP
1. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for DDL modifications. This functionality is referred to as?
- Schema Evolution (CORRECT)
- ACID Transactions
- Time Travel
- Schema Enforcement
Correct: Delta Lake precludes the need for subsequent DDL modifications when alterations to a table schema are made and those changes apply automatically.
2. One of the core features of Delta Lake is performing upserts. Which of the following statements is true regarding Upsert?
- Upsert is supported in traditional data lakes
- Upsert is literally TWO operations. Update / Insert (CORRECT)
- Upsert is a new DML statement for SQL syntax
Correct: To UPSERT means to “UPdate” and “inSERT.” In other words, UPSERT combines two operations: updating existing records and inserting new ones. This functionality is not supported in traditional data lakes.
3. What is the Databricks Delta command to display metadata?
- SHOW SCHEMA tablename
- MSCK DETAIL tablename
- DESCRIBE DETAIL tableName (CORRECT)
Correct: The command DESCRIBE DETAIL tableName can be run for displaying metadata.
4. What optimization does the following command perform: OPTIMIZE Customers ZORDER BY City?
- Ensures that all data backing, for example, City=”London” is colocated, then updates a graph that routes requests to the appropriate files.
- Ensures that all data backing, for example, City=’London’ is colocated, then rewrites the sorted data into new Parquet files. (CORRECT)
- Creates an order-based index on the City field to improve filters against that field
Correct: With ZOrdering similar data, it will group it in the same set of files and optimize the layout for the faster performance of range queries.
5. You are planning on registering a user-defined function, g, as g_function in a SQL namespace. How would you achieve this programmatically?
- spark.register_udf(“g_function”, g)
- spark.udf.register(“g_function”, g) (CORRECT)
- spark.udf.register(g, “g_function”)
Correct: This registers the UDF so within the same namespace it may be called within SQL queries.
6. True or False?
User-defined Functions cannot operate on DataFrames.
- No (CORRECT)
- Yes
Correct: In this case, UDFs will work on DataFrames and allow you to call a custom function on each cell in a row or column of data in the DataFrame. Thus, you can add your own logic in addition to the built-in functionalities provided by Spark.
7. Suppose you already have a dataframe which only contains relevant columns.
The columns are: id, employee_name, age, gender.
You want to retrieve the first initial from the employee_name field by creating a local function in Python/Scala. Which of the following code can be used to get the first initial from the host_name column?
- new firstInitialFunction(name):
- extract name[]
- firstInitialFunction(“Steven”)
- def firstInitialFunction(name):
- get name[0]
- firstInitialFunction(“Steven”)
- def firstInitialFunction(name): (CORRECT)
- return name[0]
- firstInitialFunction(“Steven”)
Correct: In fact, this code defines a user-defined function to get the first initial from a column host_name and give the new column name as first_initial in the DataFrame.
8. In Delta Lake you need to merge, update and delete datasets while ensuring you comply with GDPR and CCPA. What supported APIs can you use?
Select all that apply.
- XML
- Scala (CORRECT)
- Java (CORRECT)
- Python (CORRECT)
- SQL (CORRECT)
Correct: Merge, update, and deletes datasets that ease the compliance for data privacy regulations such as GDPR and CCPA with Scala, Java, Python, and SQL APIs of Delta Lake. These operations thus allow efficient management of data modifications or eliminations such that the sensitive material is appropriately handled as per these legal regulations.
Correct: It enables easy and possible compliance with regulations such as GDPR and CCPA by providing specific APIs for merging, updating, and deleting datasets through Scala, Java, Python, or SQL. The above capabilities help in managing and controlling the data in such a way that they avoid any issues of compliance with data privacy regulations.
Correct: This data-handling method thereby enhances the merging, updating, and deletion of datasets via Scala, Java, Python, and SQL APIs from Delta Lake and through compliance with the strict regulations of GDPR and CCPA. The aforementioned feasibility enables one to manage data optimally with easy updating and clean-up procedures in accordance with privacy mandates.
Correct: For compliance with the GPDR and CCPA, Delta Lake has its API available in Scala, Java, Python, and SQL to be able to merge, update, and delete datasets. These enable you to efficiently manage data such that it is personal in how you manage compliance with privacy regulations.
CONCLUSION – Processing Data In Azure Data Bricks
However, in addition to built-in SQL functions, Azure Databricks also offers the option of creating and using User-Defined Functions. This module gives you the ability to register and call UDFs as well as use Delta Lake to create, append, and upsert Apache Spark tables. This set of tools and techniques will put you on the way for optimizing data management, which means reliable and efficient processing of data in Azure Databricks.