Azure Stream Analytics – Describe an analytics workload on Azure

Azure Stream Analytics

In today’s data-driven world, the need to react immediately to unfolding events has never been greater. Picture yourself on the trading floor, where milliseconds can decide millions. Or consider a bustling metropolis where urban sensors constantly monitor traffic, air quality, and energy consumption. Azure Stream Analytics is Microsoft’s answer to the challenges of real-time data ingestion, processing, and analytics.

Azure Stream Analytics is a real-time event data processing service that you can use to har-ness the power of fast-moving streams of data. But what does it really mean for you?

WHY AZURE STREAM ANALYTICS?

Azure Stream Analytics brings the following tools to your toolkit:

■■ Seamless integration: Azure Stream Analytics beautifully integrates with other Azure services. Whether you’re pulling data from IoT Hub, Event Hub, or Blob Storage, Stream Analytics acts as your cohesive layer, processing and redirecting the data to databases, dashboards, or even other applications, as shown in Figure 4-17.

■■ SQL-based query language: You don’t need to be a programming wizard to harness Azure Stream Analytics. If you’re familiar with SQL, you’re already ahead of the curve. Stream Analytics employs a SQL-like language, allowing you to create transformation queries on your real-time data.

FIGURE 4-17  Azure Stream Analytics

■■ Scalability and reliability: One of the hallmarks of Azure Stream Analytics is its abil-ity to scale. Whether you’re processing a few records or millions every second, Stream Analytics can handle it. More so, its built-in recovery capabilities ensure that no data is lost in the case of failures.

122 CHAPTER 4     Describe an analytics workload on Azure

■■ Real-time dashboards: Azure Stream Analytics is not just about processing; it’s also about visualization. With its ability to integrate seamlessly with tools like Power BI, you can access real-time dashboards that update as events unfold.

■■ Time windowing: One of the stand-out features you’ll appreciate is the ease with which you can perform operations over specific time windows—be it tumbling, sliding, or hopping. For instance, you might want to calculate the average temperature from IoT sensors every five minutes; Stream Analytics has got you covered.

Tumbling window in stream processing refers to a fixed-duration, nonoverlapping interval used to segment time-series data. Each piece of data falls into exactly one window, defined by a distinct start and end time, ensuring that data groups are mutu-ally exclusive. For instance, with a 5-minute tumbling window, data from 00:00 to 00:04 would be aggregated in one window, and data from 00:05 to 00:09 in the next, facilitat-ing structured, periodic analysis of streaming data.

Sliding window in stream processing is a type of data analysis technique where the window of time for data aggregation “slides” continuously over the data stream. This means that the window moves forward by a specified slide interval, and it overlaps with previous windows. Each window has a fixed length, but unlike tumbling windows, sliding windows can cover overlapping periods of time, allowing for more frequent analysis and updates. For example, if you have a sliding window of 10 minutes with a slide interval of 5 minutes, a new window starts every 5 minutes, and each window overlaps with the previous one for 5 minutes, providing a more continuous and overlapping view of the data stream.

Hopping window in stream processing is a time-based window that moves forward in fixed increments, known as the hop size. Each window has a specified duration, and the start of the next window is determined by the hop size rather than the end of the previ-ous window. This approach allows for overlaps between windows, where data can be included in multiple consecutive windows if it falls within their time frames. For example, with a window duration of 10 minutes and a hop size of 5 minutes, a new window starts every 5 minutes, and each window overlaps with the next one for a duration determined by the difference between the window size and the hop size.

■■ Anomaly detection: Dive into the built-in machine learning capabilities to detect anomalies in your real-time data streams. Whether you’re monitoring web clickstreams or machinery in a factory, Azure Stream Analytics can alert you to significant deviations in patterns.

As a practical example to truly appreciate the potential of Azure Stream Analytics, consider a smart city initiative. Urban sensors, spread across the city, send real-time data about traf-fic, energy consumption, and more. Through Azure Stream Analytics, this data is ingested in real time, processed to detect any irregularities such as traffic jams or power surges, and then passed on to Power BI dashboards that city officials monitor. The officials can then take imme-diate action, such as rerouting traffic or adjusting power distribution.

Skill 4.2 Describe consideration for real-time data analytics     CHAPTER 4      123

In summary, Azure Stream Analytics is a tool for those yearning to transform raw, real-time data streams into actionable, meaningful insights. And as you delve deeper into its features and integrations, you’ll realize that its possibilities are vast and ever-evolving.

Unstructured Data – Core Data Concepts

Unstructured Data

Unstructured data is used to describe everything that doesn’t fit in the structured or semi-structured classification. PDFs, images, videos, and emails are just a few examples of unstructured data. While it is true that unstructured data cannot be queried like structured or semi-structured data, deep learning and artificial intelligence (AI) applications can derive valuable insights from them. For example, applications using image classification can be trained to find specific details in images by comparing them to other images.

Storing unstructured data is easier today than it has ever been. As mentioned previously, Azure Blob Storage allows companies and individuals the ability to store exabytes of data in any format. While this exam does not cover the many applications of unstructured data, it is important to note that unstructured data is becoming more and more vital for companies to gain a competitive edge in today’s world.

Data Velocity

The speed at which data is processed is commonly known as data velocity. Requirements for data processing are largely dependent on what business problem or problems we are trying to solve. Raw data such as football player statistics could be stored as raw data until every game for a given week is finished before it is transformed into insightful information. This type of data processing where data is processed in batches is commonly referred to as batch processing. We can also process data from sensors located on equipment that a player is wearing in real time so that we can monitor player performance as the game is happening. This type of data processing is called stream processing.

Batch Processing

Batch processing is the practice of transforming groups, or batches, of data at a time. This process is also known as processing data at rest. Traditional BI platforms relied on batch processing solutions to create meaningful insights out of their data. Concert venues would leverage technologies such as SQL Server to store batch data and SQL Server Integration Services (SSIS) to transform transactional data on a schedule into information that could be stored in their data warehouse for reporting. Many of the same concepts apply today for batch processing, but cloud computing gives us the scalability to process exponentially more data. Distributed computing paradigms such as Hadoop and Spark allow organizations to use compute from multiple commodity servers to process large amounts of data in batch.

Batch processing is typically done in a process of jobs automated by an orchestration service such as Azure Data Factory (ADF). These jobs can be run one by one, in parallel, or a mix of both depending on the requirements for the solution these jobs are a part of. Automated batch jobs can be run after a certain data threshold is reached in a data store but are more often triggered one of two ways:

  • On a recurring schedule—an ADF pipeline running every night at midnight, or on a periodic time interval starting at a specified start time.
  • Event/trigger-based—an ADF pipeline running after a file is uploaded to a container in Azure Blob Storage.

It is also critical that batch processing includes error handling logic that acts on a failed job. A common architecture pattern that handles batch processing in Azure is illustrated in Figure 1.7.

FIGURE 1.7 Common architecture for batch processing in Azure

There is quite a bit going on in the diagram in Figure 1.7, so let’s break it down step-by-step:

  • Data is loaded from disparate source systems into Azure. This could vary from raw files being uploaded to a central data repository such as Azure Data Lake Storage Gen2 (ADLS) to data being collected from business applications in an OLTP database such as Azure SQL Database.
  • Raw data is then transformed into a state that is analytics and report ready. Here, we can choose between code-first options such as Azure Databricks to have complete control over how data is transformed or GUI-based technologies such as Azure Data Factory Data Flows. Both options can be executed as activities in an ADF pipeline.
  • Aggregated data is loaded into an optimized data store ready for reporting. Depending on the workload and the size of data, an MPP data warehouse such as Azure Synapse Analytics dedicated SQL pool can be used to optimally store data that is used for reporting.
  • Data that is ready to be reported is then analyzed through client-native applications or a business intelligence tool such as Power BI.

Describe data visualization in Microsoft Power BI – Describe an analytics workload on Azure

Skill 4.3 Describe data visualization in Microsoft Power BI

Dive into the transformative world of data visualization with Microsoft Power BI, a tool that not only brings your data to life but also empowers you to extract insights with unparalleled ease and finesse. As you delve deeper into this segment, imagine the vast swathes of data, currently sitting in spreadsheets or databases and metamorphosing into vibrant charts, intricate graphs, and interactive dashboards. With Power BI, you can tailor every detail of your visualizations to your precise needs.

Picture a dashboard where sales metrics, customer demographics, and operational efficien-cies merge seamlessly, with each visual element telling its part of the larger story, as shown in Figure 4-20. That’s the promise of Power BI, a canvas where data finds its voice. And while the visual elements captivate, remember that beneath them lie robust analytical capabilities. Want to drill down into a specific data point? Curious about trends over time? Power BI is more than up to the task, offering you both the broad view and the minute details.

In this section, you’ll encounter vivid examples that underscore the versatility and power of Power BI. From crafting simple bar charts to designing multidimensional maps, you’ll learn the art and science of making data dance to your tune.

And while our guidance here is comprehensive, Power BI’s expansive capabilities mean

there’s always more to explore. Consider referring to Microsoft’s official resources for deeper dives, advanced tutorials, and community-driven insights. Let’s embark on this enlightening journey, ensuring that, by its end, you’re not just a data analyst but also a data storyteller.

FIGURE 4-20  Power BI interactive dashboard

Skill 4.3 Describe data visualization in Microsoft Power BI    CHAPTER 4     127

This skill covers how to:

  • Identify capabilities of Power BI
  • Describe features of data models in Power BI

Identify capabilities of Power BI – Describe an analytics workload on Azure

Identify capabilities of Power BI

When you dive into Power BI, you’re immersing yourself in a universe of functionalities, each tailored to elevate your data visualization and analytical skills. Here’s a guide to help you navi-gate and harness the essential capabilities of this remarkable tool.

  • Seamless data integration: At the heart of every great visualization lies the data that drives it. With Power BI you can connect effortlessly to a diverse range of data sources, be it local databases, cloud-based solutions, Excel spreadsheets, or third-party plat-forms, as shown in Figure 4-21. The beauty of it is that once the data is connected, you can consolidate and transform that data, paving the way for rich, meaningful visualizations.

FIGURE 4-21 Power BI data ingestion process

  • Intuitive drag-and-drop features: You don’t need to be a coding wizard to craft com-pelling visuals in Power BI. With its user-friendly interface, designing everything from simple charts to complex dashboards becomes an intuitive, drag-and-drop affair. Pic-ture yourself effortlessly juxtaposing a line graph next to a pie chart, bringing multiple data stories into a coherent narrative.
  • Advanced data modeling: Beyond its visualization prowess, Power BI arms you with robust data modeling tools. With features like Data Analysis Expressions (DAX), you can create custom calculations, derive new measures, and model your data in ways that resonate best with your analysis needs.

128 CHAPTER 4   Describe an analytics workload on Azure

  • Interactive reports and dashboards: Static visuals tell only half the story. With Power BI, your visualizations come alive, offering interactive capabilities that encourage exploration. Imagine a sales dashboard where clicking a region dynamically updates all associated charts, revealing granular insights with a mere click.
  • Collaboration and sharing: Crafting the perfect visualization is one thing; sharing it is another. Power BI streamlines collaboration, meaning you can publish reports, share dashboards, and even embed visuals into apps or websites. Your insights, once confined to your device, can now reach a global audience or targeted stakeholders with ease.

As a practical example, consider you’re managing the sales division for a global enter-prise. With Power BI, you can effortlessly integrate sales data from various regions, model it to account for currency differences, and craft a dynamic dashboard. Now, with a simple click, stakeholders can dive into regional sales, identify top-performing products, and even forecast future trends. As your proficiency with Power BI grows, there’s always more to discover. As you chart your data journey with Power BI, remember that every insight you unearth has the potential to inform, inspire, and innovate.

Describe features of data models in Power BI – Describe an analytics workload on Azure

Describe features of data models in Power BI

When you work with Power BI, you’re not just interacting with visual representations of data; you’re engaging with a meticulously structured data model. The depth and breadth of this model dictate the stories you can extract from your data. This section is a more detailed guide to the intricate features of data models in Power BI and how they set the stage for data-driven narratives.

Relationships

At the heart of your data model are relationships. They let you connect different tables for richer, multidimensional analysis. Think of relationships as bridging islands of data so they can talk to each other. For instance, as shown in Figure 4-22, you can link a Sales table to a Products table to reveal insights about which products drive the most revenue.
Skill 4.3 Describe data visualization in Microsoft Power BI  CHAPTER 4 129

FIGURE 4-22  Power BI table (entity) relationship

BASICS OF TABLE RELATIONSHIPS

In the world of data modeling, especially within tools like Power BI, understanding the basics of table relationships is akin to learning the grammar of a language. These relationships are fundamental to how you interpret and interact with your data. Central to these relationships are concepts like keys and their types, the nature of the connections between tables, and the impact of these connections on data filtering and analysis. Here’s a closer look at these founda-tional elements:

■■ Primary and foreign keys: At the heart of any table relationship is the concept of keys.A primary key is a unique identifier for a record in a table. In contrast, a foreign key in one table points to the primary key in another table, establishing a link between them. It’s this connection that facilitates data retrieval across multiple tables.

■■ One-to-many and many-to-one relationships: These are the most common types of relationships you’ll encounter. In a one-to-many relationship, a single record in the first table can relate to multiple records in the second table, but not vice versa. Conversely, in many-to-one relationships, multiple records from the first table correspond to a single record in the second table.

■■ Many-to-many relationships: Occasionally, you might find that multiple records in one table relate to multiple records in another table. This complex relationship type,

130 CHAPTER 4 Describe an analytics workload on Azure

known as many-to-many, was historically handled using bridge tables, but Power BI now offers native support, simplifying its implementation.

■■ Cross-filtering and direction: Relationships in Power BI have a direction, dictating how filters are applied across related tables. This directionality ensures that when you apply a filter to one table, related tables are automatically filtered, preserving data context and integrity.

Data categorization – Describe an analytics workload on Azure

Data categorization

Data categorization in Power BI involves assigning a specific type or category to a data column, thereby providing hints to Power BI about the nature of the data. This categorization ensures that Power BI understands and appropriately represents the data, especially when used in visu-als or calculations.

WHY DATA CATEGORIZATION MATTERS

Data categorization in Power BI is pivotal for extracting maximum value from your datasets, impacting everything from visualization choices to data integrity. It enables Power BI to pro-vide tailored visual suggestions, enhances the effectiveness of natural language queries, and serves as a critical tool for data validation. Here’s why categorizing your data correctly matters:

  • Enhanced visualization interpretation: By understanding the context of your data, Power BI can auto-suggest relevant visuals. Geographical data, for instance, would prompt map-based visualizations, while date fields might suggest time-series charts.
  • Improved search and Q&A features: Power BI’s Q&A tool, which allows natural language queries, leans on data categorization. When you ask for “sales by city,” the tool knows to reference geographical data due to the categorization of the City column.
  • Data validation: Categorization can act as a form of data validation. By marking a column as a date, any nondate values become evident, highlighting potential data quality issues.

Skill 4.3 Describe data visualization in Microsoft Power BI    CHAPTER 4    135

COMMON DATA TYPES IN POWER BI

In Power BI, the clarity and accuracy of your reports hinge on understanding the core data types at your disposal. Each data type serves a specific purpose, shaping how information is stored, analyzed, and presented. The following are common data types:

  • Text: Generic textual data, from product names to descriptions
  • Whole number: Numeric data without decimal points, like quantities or counts
  • Decimal number: Numeric data with decimal precision, suitable for price or rate data
  • Date/time: Fields that have timestamps, including date, time, or both

COMMON DATA CATEGORIES IN POWER BI – Describe an analytics workload on Azure

COMMON DATA CATEGORIES IN POWER BI

In Power BI, data categorization plays a crucial role in tailoring visualizations and enhancing report interactivity. Here is a list of common data categories found in Power BI:

  • Geographical: Includes various subcategories such as Address, City, Country, Latitude, Longitude, Postal Code, etc., facilitating map-based visualizations
  • Web URL: Web addresses, hyperlinks within Power BI reports

A PRACTICAL ILLUSTRATION

Suppose you’re working with a dataset that captures details of art galleries worldwide. The dataset includes the gallery name, city, country, average visitor count, website, and date of establishment.

  • “Gallery Name” would be categorized as Text.
  • “City” and “Country” fall under the Geographical category.
  • “Average Visitor Count” is a Whole Number.
  • “Website” is categorized as a Web URL.
  • “Date of Establishment” is assigned the Date/Time category.

With these categorizations in place, Power BI can effortlessly visualize a map pinpointing gallery location worldwide or create a time-series chart showcasing the growth of galleries over the years.

Understanding and effectively leveraging data categorization in Power BI transform your data from raw numbers and text into a coherent story, adding layers of context, meaning, and depth.

136 CHAPTER 4   Describe an analytics workload on Azure

Quick Measures

In the vast and intricate world of data analysis, time is of the essence. Power BI recognizes this, and in its arsenal of features aimed at streamlining your analytical journey, you’ll find Quick Measures, a tool designed to expedite the process of creating complex calculations. It’s about making what was once convoluted accessible and swift. Dive into this section to discover the features of Quick Measures and how you can leverage them effectively.

Quick Measures is a compilation of prebuilt DAX formulas in Power BI that automate com-monly used calculations. Instead of manually writing out a DAX expression for a particular metric, you can use Quick Measures to generate these formulas for you, based on your data model and your selected fields.

News and commentary about the exam objective updates – DP-900 Microsoft Azure Data Fundamentals Exam Updates

News and commentary about the exam objective updates

The updates to the DP-900 exam objectives effective February 1, 2024, reveal a few noteworthy changes and refinements compared to the previous version. The following is commentary on each of the updates:

Audience Profile

■■ Before & After Update: The target audience remains consistent. The exam is aimed at candidates new to working with data in the cloud, requiring familiarity with core data concepts and Microsoft Azure data services.

150 CHAPTER 5 DP-900 Microsoft Azure Data Fundamentals Exam Updates

Describe Core Data Concepts (25–30%)

■■ Before & After Update: This section remains largely unchanged, focusing on representing data (structured, semi-structured, unstructured), data storage options, and common data workloads (transactional, analytical). The roles and responsibilities associated with these workloads are also consistently covered.

Identify Considerations for Relational Data on Azure (20–25%)

■■ Before & After Update: Both versions cover relational concepts, including features of relational data, normalization, SQL statements, and common database objects. A notable change is the explicit mention of the “Azure SQL family of products” in the updated objectives, offering a clearer focus on specific Azure services.

Describe Considerations for Working with Non-Relational Data on Azure (15–20%)

■■ Before & After Update: This section remains consistent in both versions, covering Azure storage capabilities (Blob, File, Table storage) and Azure Cosmos DB features. The emphasis on understanding Azure’s storage solutions and Cosmos DB’s use cases and APIs continues to be a crucial part of this section.

Describe an Analytics Workload on Azure (25–30%)

■■ Before Update: This section previously included details on Azure services for data warehousing, real-time data analytics technologies (Azure Stream Analytics, Azure Synapse Data Explorer, Spark Structured Streaming), and data visualization in Power BI.

■■ After Update: The updated objectives maintain the focus on large-scale analytics, data warehousing, and real-time data analytics but have removed specific mentions of technologies like Azure Stream Analytics, Azure Synapse Data Explorer, and Spark Structured Streaming. Instead, there’s a broader reference to “Microsoft cloud services for real-time analytics,” suggesting a more general approach. The section on Power BI remains similar, emphasizing its capabilities, data models, and visualization options.

General Observations:

■■ The updates indicate a shift toward a more generalized and possibly up-to-date over-view of Azure services, especially in the analytics workload section.
■■ The explicit mention of the Azure SQL family of products under relational data shows an emphasis on Azure-specific services.
■■ Overall, the changes seem to align the exam more closely with current Azure offerings and trends in cloud data management without significantly altering the core content or focus areas of the exam.

These updates suggest a continued emphasis on ensuring that candidates have a well-rounded understanding of Azure’s data services, both relational and non-relational, along with a solid grasp of analytical workloads as they pertain to Azure’s environment.

News and commentary about the exam objective updates CHAPTER 5 151

Azure Data Explorer – Describe an analytics workload on Azure

Azure Data Explorer

As the digital age progresses, the influx of data has transformed from a steady stream into a roaring torrent. Capturing, analyzing, and acting upon this data in real time is not just a luxury but a necessity for businesses to remain competitive and relevant. Enter Azure Data Explorer, a service uniquely equipped to manage, analyze, and visualize this deluge of information. This section is your comprehensive guide to understanding and harnessing its immense potential.

WHAT IS AZURE DATA EXPLORER?

Azure Data Explorer (ADX) is a fast, fully managed data analytics service for real-time analysis on large volumes of streaming data. It brings together big data and analytics into a unified platform that provides solutions to some of the most complex data exploration challenges.

Here are its key features and benefits:

■■ Rapid ingestion and analysis: One of the hallmarks of Azure Data Explorer is its abil-ity to ingest millions of records per second and simultaneously query across billions of records in mere seconds. Such speed ensures that you’re always working with the most recent data.

■■ Intuitive query language: Kusto Query Language (KQL) is the heart of Azure Data Explorer. If you’ve used SQL, transitioning to KQL will feel familiar. It allows you to write complex ad hoc queries, making data exploration and analysis a breeze.

■■ Scalability: ADX can scale out by distributing data and query load across multiple nodes. This horizontal scaling ensures that as your data grows, your ability to query it remains swift.

■■ Integration with other Azure services: ADX plays nicely with other Azure services, ensuring that you can integrate it seamlessly into your existing data infrastructure. Whether it’s ingesting data from Event Hubs, IoT Hub, or a myriad of other sources, ADX can handle it. Figure 4-18 shows the end-to-end flow for working in Azure Data Explorer and shows how it integrates with other services.

As a practical use case, imagine you’re overseeing the operations of a global e-commerce platform. Every click, purchase, and user interaction on your platform generates data. With Azure Data Explorer, you can ingest this data in real time. Using KQL, you can then run complex queries to gauge user behavior, analyze purchase patterns, identify potential website hiccups, and more, all in real time. By using this data-driven approach, you can make instantaneous decisions, be they related to marketing strategies or website optimization.

Azure Data Explorer stands as a formidable tool in the data analytics space, empowering users to make the most of their data. Whether you’re a seasoned data analyst or just starting, ADX offers a blend of power and flexibility that can transform the way you view and utilize data.

124 CHAPTER 4 Describe an analytics workload on Azure

FIGURE 4-18  Azure Data Explorer

Describe Types of Core Data Workloads – Core Data Concepts

Describe Types of Core Data Workloads

The volume of data that the world has generated has exploded in recent years. Zettabytes worth of data is created every year, the variety of which is seemingly endless. Competing in a rapidly changing world requires companies to utilize massive amounts of data that they have only recently been exposed to. What’s more is that with the use of edge devices that allow Internet of Things (IoT) data to seamlessly move between the cloud and local devices, companies can make valuable data-driven decisions in real time.

It is imperative that organizations leverage data when making critical business decisions. But how do they turn raw data into usable information? How do they decide what is valuable and what is noise? With the power of cloud computing and storage costs growing cheaper and cheaper every year, it’s easy for companies to store all the data at their disposal and build creative solutions that combine a multitude of different design patterns. For example, modern data storage and computing techniques allow sports franchises to create more sophisticated training programs by combining traditional statistical information with real-time data captured from sensors that measure features such as speed and agility. E-commerce companies leverage click-stream data to track a user’s activity while on their website, allowing them to build custom experiences for customers to reduce customer churn.

The exponential growth in data and the number of sources organizations can leverage to make decisions have put an increased focus on making the right solution design decisions. Deciding on the most optimal data store for the different types of data involved and the most optimal analytical pattern for processing data can make or break a project before it ever gets started. Ultimately, there are four key questions that need to be answered when making design decisions for a data-driven solution:

  • What value will the data powering the solution provide?
  • How large is the volume of data involved?
  • What is the variety of the data included in the solution?
  • What is the velocity of the data that will be ingested in the target platform?