Describe Types of Core Data Workloads – Core Data Concepts

Describe Types of Core Data Workloads

The volume of data that the world has generated has exploded in recent years. Zettabytes worth of data is created every year, the variety of which is seemingly endless. Competing in a rapidly changing world requires companies to utilize massive amounts of data that they have only recently been exposed to. What’s more is that with the use of edge devices that allow Internet of Things (IoT) data to seamlessly move between the cloud and local devices, companies can make valuable data-driven decisions in real time.

It is imperative that organizations leverage data when making critical business decisions. But how do they turn raw data into usable information? How do they decide what is valuable and what is noise? With the power of cloud computing and storage costs growing cheaper and cheaper every year, it’s easy for companies to store all the data at their disposal and build creative solutions that combine a multitude of different design patterns. For example, modern data storage and computing techniques allow sports franchises to create more sophisticated training programs by combining traditional statistical information with real-time data captured from sensors that measure features such as speed and agility. E-commerce companies leverage click-stream data to track a user’s activity while on their website, allowing them to build custom experiences for customers to reduce customer churn.

The exponential growth in data and the number of sources organizations can leverage to make decisions have put an increased focus on making the right solution design decisions. Deciding on the most optimal data store for the different types of data involved and the most optimal analytical pattern for processing data can make or break a project before it ever gets started. Ultimately, there are four key questions that need to be answered when making design decisions for a data-driven solution:

  • What value will the data powering the solution provide?
  • How large is the volume of data involved?
  • What is the variety of the data included in the solution?
  • What is the velocity of the data that will be ingested in the target platform?

Data Value – Core Data Concepts

Data Value

The first question that needs to be answered when designing a data-driven solution is, what value will be gained by processing, storing, and analyzing potential data sources? What answers are the business trying to solve? While it is true that having more data can provide new and more fine-grained insights, it can sometimes come at a cost. Organizations must give considerable thought to what data is valuable and what data is not, all the while trying to minimize the amount of time spent in the decision-making process.

Designing a data-driven solution requires everyone involved to focus on deriving value from every process in the solution. This means that data architects must know the business goal of the solution from the beginning. Is this going to be a transactional database that provides the backend for a business’s e-commerce site? Will it be a data warehouse aggregating data from multiple source systems to provide a holistic view of a business’s performance? Or will the data store need to be able to ingest bursts of IoT data for real-time analytics? To answer these questions, we first need to understand the different types of data stores and the scenarios for which each one is best suited.

Relational Databases

Relational databases organize data into tables that can be linked based on data common to each other. The relationship between tables allows users to easily query multiple tables in the same query by joining columns from multiple tables together. Database tables store data as rows and are organized into a set number of columns. Columns are defined by specific data types such as integer or string so that only specific types of data from new or modified rows of data is accepted. For example, if you have a database table with a name column that only accepts string values, then trying to insert a number into that column will fail. Relational databases allow designers to go a step forward and design constraints on columns so that data must meet predefined criteria. This predefined structure that data in relational databases must adhere to is called a schema and is fundamental to how users query relational data.

Users querying a relational database use a version of the Structured Query Language (SQL) to issue queries to the database. Depending on the vendor, most relational database management systems (RDBMSs) have their own variation of SQL that are based on the ANSI standardized version of SQL. For example, the Microsoft suite of RDBMSs (e.g., SQL Server, Azure SQL Database, Azure SQL Managed Instance) can be interacted with using Transact SQL (T-SQL). T-SQL provides four flavors of commands for query development:

  • Data Manipulation Language (DML) commands are used to manipulate data in database tables. DML commands include SELECT, INSERT, UPDATE, and DELETE.
  • Data Definition Language (DDL) commands are used to define RDBMS objects such as databases, tables, views, stored procedures, and triggers. DDL commands include CREATE, ALTER, and DROP.
  • Data Control Language (DCL) commands are used to manage permissions and access control for users in a database. DCL commands include GRANT, REVOKE, and DENY.
  • Transaction Control Language (TCL) commands are used to explicitly manage and control transaction execution to ensure that a specific transaction is successfully done without violating database integrity. TCL commands include BEGIN TRANSACTION, COMMIT TRANSACTION, and ROLLBACK TRANSACTION.

Relational database design considerations largely depend on what the database will be supporting. A database that’s supporting a business’s e-commerce site and needs to log every transaction made by a customer has vastly different requirements than a database that supports a report application. While there are many different design patterns for data-driven solutions, most of them fall into one of two broad categories: transactional processing systems or analytical systems.

Transactional Processing Systems – Core Data Concepts

Transactional Processing Systems

Transactional processing systems, also known as online transaction processing (OLTP) systems, are used to capture the business transactions that support the day-to-day operations of an organization. Transactions can include retail purchases logged to point-of-sale (PoS) systems as purchases are made, orders purchased through e-commerce platforms, or even ticket scans at a sport or concert venue. Transactions do not only consist of newly inserted data, but also include deletes and updates of data. While each transaction is a small and unique measurement of work, OLTP systems need to be able to handle millions of transactions a day. This requires OLTP systems to be designed in a way that optimizes how fast transactions are applied to them. To support this requirement, OLTP data stored in relational databases is split into small chunks and stored in separate database tables. Splitting data into multiple tables allows the system to only update the tables that need to be updated, all the while maintaining relationships to data in tables that are associated but not updated with that transaction. This is commonly referred to as normalizing data.

Transactional databases must adhere to ACID properties (atomicity, consistency, isolation, durability) to ensure that each transaction is reliable. These properties can be defined as follows:

  • Atomicity guarantees that each transaction is treated as a single unit of work that either succeeds completely or fails completely. If any part of an insert, delete, or update operation in a transaction fails, the entire transaction fails and the database is left unchanged.
  • Consistency ensures that data affected by a transaction is valid according to all predefined rules. Inserting or altering data will only be successful if it maintains the appropriate predefined data types and constraints of the affected columns.
  • Isolation ensures that concurrent transactions do not affect one another.
  • Durability guarantees that once a transaction has been committed, it will remain committed even if there is a system failure.

Adhering to ACID properties is critical for OLTP systems that support many concurrent users reading and writing from them at the same time. They need to be able to process transactions in isolation, all the while ensuring that users querying data can retrieve a consistent view of data even as it is being altered. Many RDBMSs implement relational consistency and isolation by applying locks to data when it is updated. A lock prevents other processes from reading data until the lock is released, and it is only released when the transaction is committed or is rolled back. Extensive locks caused by long-running queries can lead to poor query performance. To mitigate the issues caused by table locks, SQL Server and Azure SQL Database give database administrators (DBAs) the ability to specify the level of isolation to which one transaction must be isolated from data modifications made by other transactions. Isolation levels determine the acceptance rate for queries returning data that has not been committed by an insert, update, or delete for faster return times. More on isolation levels can be found in Chapter 2, “Relational Databases in Azure.”

Analytical Systems – Core Data Concepts

Analytical Systems

Analytical systems are designed to support business users who need to make informed business decisions from large amounts of data. For example, decisions made from analytical systems can drive the placement of an item in a retail store or an e-commerce site based on an item’s seasonal popularity. Most analytical systems ingest data from multiple sources, such as OLTP systems, and perform transformations that leverage business rules that cleanse and aggregate data so that it is useful for decision making. Decision makers usually don’t need all the details of a specific transaction, so data architects will design analytical systems that use data from OLTP systems to only include relevant information. Analytical systems are also denormalized so that users querying them are not burdened by having to develop complex queries that join multiple tables together. Analytical systems such as data warehouses are updated by either processing batches of data at the same time or aggregating data in real time from sources that can stream data. These different data-processing techniques are discussed further in the section “Data Velocity” later in this chapter.

The two types of analytical systems are data warehouses and online analytical processing (OLAP) systems. Data warehouses serve as the single source of truth for different functional areas within a business. Good data warehouses consolidate multiple disparate data sources and are optimized for reading data, making them perfect data sources for reporting applications. Data warehouses are typically relational data stores such as Azure Synapse Analytics dedicated SQL pool or Azure SQL Database. OLAP models are typically business intelligence (BI) models that apply business logic and pre-aggregations to data warehouse data to create a layer of abstraction between the data warehouse and a reporting platform. Azure Analysis Services and Power BI tabular models are examples of OLAP technologies that can create these types of data models.

 Something important to note is that data warehouses and OLAP models are not dependent on one another. While you can build an OLAP model from a data warehouse, reports can be built directly from data warehouse data, and OLAP models can be built from data sources other than a data warehouse. More on data warehouses and OLAP models in Chapter 5, “Modern Data Warehouses in Azure.”

Typical data warehouses and OLAP tabular models will store data using a star schema. Star schemas make data easy to report against because of the way data is denormalized. Measurements and metrics are consolidated in fact tables. They are connected to tables that contain descriptive attributes for each measurement, also known as dimension tables. For example, an Internet sales fact table can be associated to multiple dimension tables, that include a date dimension that provides granular information on the date a purchase was made, a customer dimension that includes specific information about the customer that made the purchase, and a product dimension that describes the different attributes of the product that was sold. The inherent simplicity in a star schema’s design allows analysts to easily create aggregations on fact tables while joining the necessary dimension tables to answer different business questions about the data.

Nonrelational Data Stores – Core Data Concepts

Nonrelational Data Stores

There is a wide variety of data that doesn’t fit in a relational model. Nonrelational data, also known as NoSQL (Not Only SQL), refers to data that doesn’t fit into a relational model. Some solutions require more flexible data models than that of a relational database and can afford to trade ACID compliancy for speed and horizontal scale. NoSQL databases can handle volatile data that is written and read quickly better than relational databases because they don’t force the data to conform to a specific structure.

Binary objects such as images, videos, and PDFs are also considered nonrelational data. While relational databases such as SQL Server can store files such as the PDF copy of this book using features such as FILESTREAM, it is not the most optimal solution for file storage. Object stores are optimized for binary file storage and can be easily accessed to serve these files to applications. They can also be used to create highly scalable data lake ecosystems for big data processing solutions.

NoSQL Databases

NoSQL databases do not impose a schema on data they store, allowing data to maintain its natural format as it is ingested. In fact, one of the primary benefits is that users who are designing a NoSQL database solution do not need to define the schema ahead of time. This flexibility makes NoSQL databases the ideal choice for solutions that require millisecond response times and need to be able to scale rapidly. Scenarios where NoSQL databases are potentially better options than relational databases include ingesting and analyzing bursts of data from IoT sensors, storing product catalog data for an e-commerce site’s web search functionality, and storing user-generated content for web, mobile, and social media applications.

Instead of storing data as rows in a table as in a relational database, data is stored as entities in collections or containers. Unlike rows in a table, entities in the same collection can have a different set of fields. This flexibility allows for several different implementations of NoSQL databases depending on the solution requirements. Generally, these implementations fall into the following four categories:

  • Key-value stores are the simplest types of NoSQL database for inserting and querying data (see Figure 1.1). Each piece of data contains a key and a value. The key serves as a unique identifier for the piece of data, and the value contains the data. Values can be scalar values or complex structures such as a JSON array. When applications are querying data from key-value stores, they issue queries that specify the keys to retrieve the values. Figure 1.1 is an example of a phone directory that stores one or more phone numbers per person in a key-value store. Examples of key-value stores include Python dictionary objects, Azure Table storage, and the Azure Cosmos DB Table API.

FIGURE 1.1 Key-value store

  • Document databases are the most common types of NoSQL databases (see Figure 1.2). Pieces of data are defined as documents and are typically stored in JSON, XML, YAML, or BSON format. Each document includes a document key that serves as a unique identifier for management and query lookups. Unlike a key-value store that can only retrieve data by doing a search on the key, applications querying a document database can perform lookups on a document’s key and/or one or more of its fields to retrieve specific sets of data. This feature makes document databases a better option for applications that need to be more selective. Figure 1.2 illustrates an example of customer orders stored as documents in a document database. Examples of document databases include MongoDB and the Azure Cosmos DB Core (SQL) API.

FIGURE 1.2 Document database

  • Columnar databases appear like relational databases conceptually (see Figure 1.3). They organize data into rows and columns but denormalize data so that it is divided into multiple column families. Each column family holds a set of columns that are logically related. Figure 1.3 is an example of a bicycle company’s product information stored in a columnar format. An example of a columnar database is the Azure Cosmos DB Cassandra API.

FIGURE 1.3 Columnar database

  • Graph databases store data as entities and focus on the relationship that these entities have with each other (see Figure 1.4). Entities are defined as nodes, while the relationships between them are defined as edges. Applications querying a graph database do so by traversing the network of nodes and edges, analyzing the relationships between entities. While relational databases can accomplish similar goals, large graph databases can perform very traverse relationships very quickly bypassing the need to perform multiple join operations on many tables. Figure 1.4 illustrates an example of a graph database that stores an organization’s personnel chart. The entities represent different job titles and departments, while the edges represent how each entity is related. Examples of graph databases include Neo4j and the Azure Cosmos DB Gremlin API.

FIGURE 1.4 Graph database

Object Storage – Core Data Concepts

Object Storage

Object data stores such as Azure storage accounts store huge volumes of data in text and binary format. You can think of a storage account as being like a shared folder on an organization’s local network. Unlike local file shares, storage accounts are highly scalable and allow organizations the freedom of being able to add whatever data they want without needing to worry about adding hardware. Azure-based solutions that rely on data stored in files leverage Azure storage accounts in some form, as in the following scenarios:

  • Storing images or videos that are analyzed by deep learning models or that are served to a website
  • Storing files such as JSON, Avro, Parquet, CSV, or TSV that are used for distributed processing in big data solutions
  • Storing data for backup and restore, disaster recovery, and archiving
  • Storing telemetry information as log files that can be used for near real-time analysis

Storage accounts can service a wide variety of object store use cases. Depending on the scenario, you may decide to use one of the following storage account services to store binary objects:

  • Azure Blob Storage is the most common service for object storage in Azure. Solutions that require analysis, from images or videos, backup management, or files used for distributed processing solutions, can be stored in Blob Storage. It can store exabytes worth of data and offers different access tiers to store data in the most cost-effective manner.
  • Azure Data Lake Storage Gen2, also known as ADLS, is a set of capabilities that are built on top of Blob Storage but specifically for distributed analytics solutions. The key feature of ADLS that allows for quick and efficient data access is its hierarchical namespace. Hierarchical namespaces organize files into a hierarchy of directories that enable you to store data that is raw, cleansed, and aggregated without having to sacrifice one copy for the next.
  • Azure Files is a fully managed file share solution in Azure. File shares are accessible via the Server Message Block (SMB) protocol or the Network File System (NFS) protocol. They can be mounted concurrently by cloud or on-premises systems.

Data Volume – Core Data Concepts

Data Volume

Data volume refers to the amount of data that needs to be analyzed and processed. Access to larger datasets can provide just as many headaches as it does clarity. Large datasets that are stored in databases that use bad design practices or queried by poorly written queries can cause applications to perform so badly that they come to a screeching halt. Traditional relational databases such as SQL Server or Azure SQL Database can be used for large data warehouses if they are leveraging a well-thought-out data model design with appropriate indexes and partitions, and applications are reading data with well-written queries. However, there is a limit to the amount of data that traditional database technologies and processing patterns can handle.

It is critical that the right data storage technologies and processing patterns are chosen in the design phase, especially if the datasets are going to be large in volume. Even the most properly tuned relational databases will begin to perform poorly after a certain size threshold. Symmetric multiprocessing, or SMP, systems such as SQL Server and Azure SQL Database are characterized by a single instance of an RDBMS that shares all the resources (CPU, memory, and disk). SMP systems can scale up to serve gigabytes (GB) and terabytes (TB) worth of data but hit a wall when the resource limits are hit. Massively parallel processing, or MPP, systems such as Azure Synapse Analytics dedicated SQL pool are designed to process large datasets. MPP systems are designed to be distributed parallel processing solutions, meaning they are not only able to scale up by adding more compute resources but can also scale out by adding more nodes to the system.

TIP MPP databases can be less performant and more costly than an SMP database when the dataset size is small. Consider using an SMP database if the data warehouse is never going to be more than 1TB and queries perform more lookups than large-scale aggregations.

You can think of data processing differences between SMP and MPP systems as how a grocery store goes about restocking its shelves. One employee of a store can efficiently restock shelves in a single aisle in a relatively short amount of time. However, restocking every aisle in a large store that has many aisles can take hours or even days if there is only one employee available for the task. In most cases, floor managers at a store will assign aisles to different employees. This drastically reduces the amount of time it takes to restock an entire store since there are many employees restocking shelves in parallel. This is how MPP systems such as Azure Synapse Analytics, Azure HDInsight, and Azure Databricks operate. The underlying architecture includes a driver/control node that divides large processing tasks into multiple operations and assigns them to different worker/compute node. Data is stored in a distributed file system that is split into chunks to be processed by the different worker nodes.

The ability to separate compute and storage allows MPP systems to scale very quickly. Adding nodes to an Azure Synapse Analytics dedicated SQL pool or an Azure Databricks cluster can happen without having to repartition data. Data is instead persisted in a distributed file system that shards it into partitions or distributions to optimize the performance of the system. Cloud-based object storage such as Azure Blob Storage or Azure Data Lake Storage Gen2 are generally used for the basis of distributed file systems. These technologies are highly scalable by design, making it easy to store massive amounts of data used by MPP systems.

While technologies such as Azure Synapse Analytics and Azure Databricks are ideal for modern data warehouse and data processing needs, they aren’t designed to store highly transactional data. Distributed file systems are great for storing data that will be used to create aggregated analysis but are not optimized for transactional data that is inserted or optimized one at a time. In cases where large amounts of transactional data, such as many thousands of transactions per second, need to be stored and globally distributed, it can be beneficial to use a NoSQL database such as Azure Cosmos DB to store transactional data. Transactional systems that use NoSQL databases have relaxed ACID properties in favor of schema flexibility and horizontal scale across multiple nodes. This provides similar benefits to MPP systems in that there are more compute resources available for processing and storage. The trade-off here is that the process of maintaining transaction consistency will fall on application developers since NoSQL databases do not strictly follow ACID properties.

Updated technical content – DP-900 Microsoft Azure Data Fundamentals Exam Updates

Updated technical content

Describe consideration for real-time data analytics: This section has been marked with a “Minor” change.

Objective mapping

This book is based on the topics and technologies covered on the exam but is not structured based on the specific order of topics in the exam objectives. Table 5-1 maps the current version of the exam objectives to the chapter content, allowing you to locate where a specific exam objective item is covered without having to consult the index.

TABLE 5-1  Exam objectives mapped to chapters

152 CHAPTER 5 DP-900 Microsoft Azure Data Fundamentals Exam Updates

Data Variety – Core Data Concepts

Data Variety

Data variety refers to the types of data involved. While you may think of data as just being entries in a spreadsheet, it can come in many different forms. Transactions captured from PoS systems, events generated from sensors, and even pictures can generate valuable insights that businesses can use to make decisions. Ultimately, data falls into three categories: structured, semi-structured, and unstructured.

Structured Data

Structured data can be defined as tabular data that is made up of rows and columns. Data in an Excel spreadsheet or a CSV file is known to be structured, as is data in a relational database such as SQL Server, Oracle, or MySQL. Structured data fits a well-defined schema, which means that every row in a table will have the same number of columns even if one or more of those columns do not have any values in the row. The process of every row in a structured dataset having the same number of columns is known as schema integrity. This is what gives users the ability to create relationships between tables in a relational database. More on this later in this chapter and in Chapter 2.

While schema integrity allows relational data to be easily queried and analyzed, it forces data to follow a rigid structure. This rigid structure forces users to consider how volatile their data will be over time. Considerations for how your schema will evolve over time or the differences between source data’s schema and your target solution will force you to develop sophisticated data pipelines to ensure that this volatility does not negatively impact your solution.

Figure 1.5 illustrates an example of structured data. The data in the figure is product information from the publicly available AdventureWorks2019 database.

FIGURE 1.5 Structured data