Author name: Rohit

Data Science

Power BI Dataflow: Streamline Data Transformation

  Power BI is widely used visualization tools in variety of industries today. Its capability to present data in some fantastic charts and visuals stands it out. Data in it can be imported from tons of source systems, whether it is in database in file or from SharePoint in no time. We can also manipulate the data after being loaded in the Power BI Desktop file but what if we want to use a table manipulated in Power Query, for a new report. In general we will have to manipulate that data separately for each report and it will be a tedious task to perform in repetitions.  Dataflow is the solution here. It is a collection of tables that are created in the Power BI service workspaces. With Dataflow, we can achieve data translation functionality in Power BI Service, it means all the operations that are available in Power BI desktop, can be performed in Power BI Service. We can run all Power Query operations from the cloud independently. In general, we load our data in Power BI desktop, build some visualization and then we publish the report to Power Bi Service. When we publish any report to Power BI Service, we can see the report and a dataset with the same name as of a report. The dataset is the file which is associated with the Power BI report. Dataflow is totally independent of the dataset of any report and the refresh is also performed independently in the cloud. In Power BI dataset we are storing the data In-Memory. Dataflow uses the process of Azure Data Lake Storage. Image reference: https://docs.microsoft.com/en-us/power-bi/transform-model/dataflows/dataflows-introduction-self-service When to use Dataflow: 1. To avoid maintenance-cost As we have discussed earlier that if we want the same table in multiple reports, we must get the report from the source every time and if we perform any data transformation operation in the table and use it in any report and want that same operations on a table in other report then we have to do it again and again. Isn’t it too head-scratching? Yes! It is. Dataflow is something that provides us the centralized location to our workspace. Suppose, we create a dataflow with our desired tables and perform all the ETL operations in Power BI Service itself and then we call that Dataflow in our Power BI Desktop from our Workspace. So, every time there is some data update required in table that is used in multiple reports, we need to change only the Dataflow. So, the maintenance-work becomes easy. 2. Independent Table Refresh: Dataflow is independent from the report refresh. We can refresh specific tables with Dataflow. With Dataflow, we can refresh more transactional tables on hourly basis (or whatever is needed) and other tables on daily/weekly/monthly basis. So, it gives us that freedom from report refresh. How to Create a Dataflow: To create a dataflow we have to follow the steps mentioned below: Step1: Open the Power BI Service. Setp2: Go to the “Workspace” pane on the left-hand side of the screen and click it. Step3: After that click on “Create a workspace”. Step4: Give a suitable name the workspace and hit on ‘Save’. It will create a new Workspace in the Power BI Service. Step5: On the upper-left side of the Workspace window, click on the (+) New Icon. Select Dataflow from the list. It will open a new window as in the screenshot. Step6: There is option named ‘Define New Tables’. We can create a fresh Dataflow with the help of this option. To do that, Click on Add new Tables option in Define new tables. We get a list of options after clicking ‘Define new tables’ as we have Get Data in Power BI desktop. Step7: We can select data source as per our need. Let say, we select SQL Server Database as a source. After clicking it we get a new window with Connection Settings and Connection credentials. If we have on-premises SQL Server database then we need to install Data Gateway from the Power BI Service. Step8: After filling the credentials we can choose the desired tables from the database and click on Transform data.   Now we get a window similar to Power Query. Every operation that we are doing in this window is like the operation we do in Power BI desktop. We can perform all the ETL operations here. After all the operations, click on “Save and Close”. Step9: We need to provide suitable name of the Dataflow and click on “Save”. This will create our Dataflow. Note: Once the Dataflow is created, we need to refresh our data. If we forget to refresh the data and we use that Dataflow in some other Dataflow, then it will not show any data. Link entities from other dataflows: This option is useful in scenarios where we need to use the tables of an existing dataflow. We don’t need to add tables again and again. To reduce the duplication, we choose the option “Link tables from other dataflows” which leads us to use tables existed in an existing dataflow with other desired tables. To work with “Link tables from other dataflows”, we need to follow steps mentioned below:  To do that click on Workspace, then click on (+) New and choose Dataflow. Then from the options now click on “Link tables from other dataflows”. A window with Connection credentials will open. After filling all the fields, we will click on “Sign-in” and after that click on Next. By clicking Next a window with our workspaces having dataflows is shown. After selecting your desired tables from the dataflow, click on “Transform data”. To load other tables, click on “Get data” and click on the desired data source, i.e. Excel workbook, SQL Server database, Text CSV etc. Let’s get the table from SQL Server database.  When we click on SQL Server Database, we get a credential window.   After filling the credentials, we can access all the tables of

Data Science

Streamlining Data Deployment with Power BI Pipelines

 In today’s world, data means power and to use that power efficiently we need to analyse it properly. It’s a great responsibility to handle a large amount of data and present it in a prudent way so that the end user can easily grab the details of any data. The data can be  reports, paginated reports, dashboards, datasets and dataflows.     At times it get hard to manage the code sync for this data between Dev, Test and Production environments. Power BI lets us do this in easy way using Power BI Pipelines. Pipelines in Power BI is a very intuitive and responsive service and notifies us of any changes between different environments.     Deployment Pipelines have three stages:   Development Test Production     Working with Deployment Pipelines     Step1: Create a Deployment Pipeline     To create a pipeline we have two options:     From Deployment Pipelines Tab:   In the navigation pane on the left side of the Power BI service, choose the option of Deployment Pipelines. After that click on Create pipeline.       After clicking on Create Pipeline button a window pops-up which contains a dialogue box for the Name and Description of the pipeline. Fill the area with a suitable Name and Description of the pipeline and click on the Create button.     From a Workspace:   From the workspace, select Create a pipeline.         We get the window of Name and Description of the pipeline. Fill the fields with a suitable Name and Description and then hit the Create button.           Step2: Assign a workspace     To work with the Deployment Pipelines, we should have Premium per user Workspace. To change the workspace into Premium per user Workspace we can follow the steps mentioned below in the screenshot.                       The workspace will show a diamond icon as shown in the above screenshot. It signifies the workspace is now Premium per user workspace     After hitting Create button we get a window which shows all the three stages of a pipeline. We can assign only one workspace to one pipeline. Click on Assign a workspace button.         We get a dropdown where we can select workspace name and any stage (Production, Test or Production). It is not necessary to start with Production stage. We can start with any stage out of the three. Choose the stage you want to start with and click on Assign.           Now, As we click on the Assign button we get some initial looks of that workspace, such as reports, datasets and dashboards.         Deploying Contents     As we get the initial looks of the workspace, we can deploy it to other stage. It creates a duplicate workspace which contains all the details of the source stage. To do that click on the Deploy button and it will deploy all the contents of the source stage to the target stage.                 We can also deploy the selective contents. To do that select the Show more option and select the content you want in your target stage. Then hit Deploy button and we will get only the selected content in the target stage.                     We have another option “Select related” after selecting the specific content. The items that are related to the specific items automatically got selected. For example, if we have created a report then it has its dataset. So, when we select the report and choose the option Select related the dataset automatically got selected.           Comparing content of stages     When we deployed content from one stage to other, there is a Green indicator between source and target stage. It means everything is in sink in both source and target stages.         If we made some changes in source or target then the indicator colour changes to Orange.         When there is a change in any stage the changed item(s) get some labels.     If there are items, which are in the Source side and it does not exist in the Target side, then the items get the label which is represented by a + sign.         If there are items that were same on the both Source and Target side but after deployment, the item on one side got changed then it is labelled as different. After deployment, the source stage item gets copied to the target side. It is independent of the fact where the changes were made.         If there is an item that exists in the Target side but missing from the Source side, then it gets Missing from (-) label.         Deployment Rules     In Deployment Pipelines we can change the configuration of different stages. For example, if our development dataset points to development database and we have deployed it to test but we want to point it out to the test database after the deployment, we can do it by using Deployment Rules.       To create the Deployment rules we should follow the following steps:     Step1: Select the Deployment Setting option for the stage that needs to be modified.           Step2: Select the dataflows, datasets and paginated report from the list.           Step3: We have two types of rules:   1. Data source rules   2. Parameters rules   Select the type of rule you want to add and then click on Add rule to add the rule for the dataset, reports etc. We can replace the data source with the new one in this option.                 In the Parameter rules, we can choose the parameter from the dropdown list of Parameter rules and then choose the value to apply on the Parameter.     When all the settings are done, hit Save and close.     Now the settings are applied to the stage that was selected and now the rules will be applied till further changes.       Backwards Deployment:     Sometimes we need

Data Science

XML Parsing using Azure Data Factory

         Overview   There can be a scenario where one may need to process data coming from XML files and those XMLs can get quite complex in structure meaning that data can get multiple XML tag levels deep with some tags also having multiple occurrences at different depths. In such a case, if we do not have access to an XML schema (XSD or dtd), we can have a hard time getting the data load moreover there can be processing issues related to the size of XML provided.   Recently we got stuck in a similar scenario, getting the data processed was a challenging task and after trying multiple approaches and numerous Integration tools it was ADF that we finally went ahead with.   Sample XML Code –  Before starting with anything, let’s have a look at sample xml file.<Items xmlns=”http://xmlns.oracle.com/apps/some/productModel/common/publicationServicesB6G/”><ns1:Item xmlns:ns17=”http://xmlns.oracle.com/apps/some/productModel/items/flexi/item/” xmlns:ns1=”http://xmlns.oracle.com/apps/some/productModel/common/publicationService/” xmlns:ns4=”http://xmlns.oracle.com/apps/some/productCatalogueManagement/advancedItems/flexi/egoItem/items/category/” xmlns:ns0=”http://xmlns.oracle.com/adfe/service/types/” xmlns:ns18=”http://xmlns.oracle.com/apps/some/productModel/items/flexi/itemGdf/” xmlns:ns9=”http://xmlns.oracle.com/apps/some/productModel/items/itemRelationshipService/” xmlns:ns5=”http://xmlns.oracle.com/apps/some/productModels/item/itemServicesV8/” xmlns:ns12=”http://xmlns.oracle.com/apps/some/productModel/item/structure/structureServiceV8/” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:type=”ns1:Item”> <ns1:Item>1234567890</ns1:Item> <ns1:Organization>1234567890</ns1:Organization> <ns1:OrganizationCode>TST</ns1:OrganizationCode> <ns1:ItemClass>Training</ns1:ItemClass> <ns1:Template xsi:nil=”true”/> <ns1:ItemNumber>APP563772</ns1:ItemNumber> <ns1:ItemDescription>Test for complex XML load</ns1:ItemDescription> <ns1:ApprovalStatus>Approved</ns1:ApprovalStatus> <ns1:ItemCategory xmlns:ns3=”http://xmlns.oracle.com/apps/some/” xsi:type=”ns4:Ivcformula”> <ns4:InventoryItemId>123456789</ns4:InventoryItemId> <ns4:OrganizationId>123456789</ns4:OrganizationId> <ns4:StyleItemId>123456789</ns4:StyleItemId> <ns4:MasterOrganizationId>123456789</ns4:MasterOrganizationId> <ns4:Planner__and__TestCase xmlns:ns2=”http://xmlns.oracle.com/apps/some/productCatalogManagement/advancedItems/protectedModel/”><ns2:ContextCode>Planning for Tes Case</ns2:ContextCode><ns2:CreatedBy>Alpha.User</ns2:CreatedBy><ns2:CreationDate>2050-02-14T12:35:21.956</ns2:CreationDate><ns2:EffLineId>123456789</ns2:EffLineId><ns2:InventoryItemId>123456789</ns2:InventoryItemId></ns4:Planner__and__TestCase><ns4:ItemNumberSequencing xmlns:ns2=”http://xmlns.oracle.com/apps/some/productCatalogManagement/advancedItems/protectedModel/”><ns2:ContextCode>Generation</ns2:ContextCode><ns2:CreatedBy>Alpha.User</ns2:CreatedBy><ns2:CreationDate>2050-02-14T12:35:23.121Z</ns2:CreationDate><ns2:EffLineId>123456789</ns2:EffLineId><ns2:InventoryItemId>123456789</ns2:InventoryItemId><ns2:LastUpdateDate>2022-02-14T10:35:25.291Z</ns2:LastUpdateDate><ns3:suffix>Beta</ns3:suffix><ns3:RedLinesuffix xsi:nil=”true”/></ns4:ItemNumberSequencing><ns4:Manufacturing xmlns:ns2=”http://xmlns.oracle.com/apps/some/productCatalogManagement/advancedItems/protectedModel/”><ns2:ContextCode>Manufacturing</ns2:ContextCode><ns2:CreatedBy>Alpha.User</ns2:CreatedBy><ns2:CreationDate>2050-02-14T12:35:23.124Z</ns2:CreationDate><ns2:EffLineId>1234567889</ns2:EffLineId><ns2:InventoryItemId>123456789</ns2:InventoryItemId></ns4:Manufacturing><ns4:Header xmlns:ns2=”http://xmlns.oracle.com/apps/some/productCatalogManagement/advancedItems/protectedModel/”><ns2:ContextCode>Header Info</ns2:ContextCode><ns2:CreatedBy>AlphaC.User</ns2:CreatedBy><ns2:CreationDate>2050-02-14T12:35:23.128Z</ns2:CreationDate><ns2:LineId>987654321</ns2:LineId><ns2:InventoryItemId>987654321</ns2:InventoryItemId><ns3:quantity>1000</ns3:quantity></ns4:Header><ns4:Header xmlns:ns2=”http://xmlns.oracle.com/apps/some/productCatalogManagement/advancedItems/protectedModel/”><ns2:ContextCode>Header Info</ns2:ContextCode><ns2:CreatedBy>AlphaB.User</ns2:CreatedBy><ns2:CreationDate>2050-02-15T12:35:23.128Z</ns2:CreationDate><ns2:LineId>258963147</ns2:LineId><ns2:InventoryItemId>258369147</ns2:InventoryItemId><ns3:quantity>12</ns3:quantity></ns4:Header><ns4:Header xmlns:ns2=”http://xmlns.oracle.com/apps/some/productCatalogManagement/advancedItems/protectedModel/”><ns2:ContextCode>Header Info</ns2:ContextCode><ns2:CreatedBy>AlphaA.User</ns2:CreatedBy><ns2:CreationDate>2050-02-17T18:35:23.128Z</ns2:CreationDate><ns2:LineId>9514785362</ns2:LineId><ns2:InventoryItemId>9514785362</ns2:InventoryItemId><ns3:quantity>125</ns3:quantity></ns4:Header></ns1:ItemCategory><ns1:QuantityRevision xmlns:ns6=”http://xmlns.oracle.com/apps/some/productModel/items/flex/itemRevision/” xmlns:ns7=”http://xmlns.oracle.com/apps/some/productCatalogManagement/advancedItems/flexi/egoItemEff/itemRevision/categoriy/”><ns5:RevisionId>4564564564564</ns5:RevisionId><ns5:ItemId>12345678987</ns5:ItemId><ns5:Organization>1234567890</ns5:Organization><ns5:RevisionCode>00</ns5:RevisionCode><ns5:ItemRevisionDescription xsi:nil=”true”/><ns5:RevisionReasonValue xsi:nil=”true”/><ns5:ItemRevision><ns6:RevisionId>30005940156</ns6:RevisionId><ns6:Context xsi:nil=”true”/><ns6:Context_DValue xsi:nil=”true”/><ns6:NumOfSegments>5</ns6:NumOfSegments></ns5:ItemRevision><ns5:ItemRevision><ns6:RevisionId>1231231231</ns6:RevisionId><ns6:Context xsi:nil=”true”/><ns6:Context_DValue xsi:nil=”true”/><ns6:NumOfSegments>5555</ns6:NumOfSegments></ns5:ItemRevision><ns5:RevisionCategory xsi:type=”ns7:Ivcformula”><ns7:RevisionId>30005940156</ns7:RevisionId><ns7:CategoryCode>TestFail</ns7:CategoryCode><ns7:VersionId>-1</ns7:VersionId></ns5:RevisionCategory></ns1:QuantityRevision>  </ns1:Item></Items> In this file, we have multiple XML tags and depth of XML tags also varies. Items -> Item -ItemCategory -> Planner__and__TestCase Items -> Item -> ItemCategory -> ItemNumberSequencing Items ->Item -> ItemCategory -> Manufacturing Items – Item -> ItemCategory -> Header (Has multiple occurances) This is a good example of nested loop where Header tags repeat at a certain level. In ADF pipeline, For our requirement, we can proceed with two different approaches. · Copy data to sink table without transforming anything in the data. · Copy data with data cleansing activity involved. The first part is relatively simple using Copy data activity, data is loaded in to the sink (target) system without any modification. But if we want some cleansing or modification to our data, we can proceed with Dataflow activity. Dataflow activity let us use various other transformations as per our requirement. But to mention, one big difference between Copy and Dataflow tasks is that the Copy activity automatically flattens the XML tags present in the input file but in dataflow, we must do it manually. For this demo we will consider loading data in XML file from blob storage to SQL Server database: Note: In the copy activity, we will start directly with creating a pipeline. Copy data Activity 1. Create a new pipeline under the pipelines menu.    Here we can directly create a pipeline, or to personalize more we can first create a folder and then inside that folder create a pipeline. 2. Here under Move & transform we can see Copy data activity. Drag and drop this to the development pane.   3. Once we have copied the activity in canvas, under the General tab we can rename it.     4. Under the Source tab, we can create a new dataset or use the existing one. When creating a new dataset, click on New. In the New, Integration dataset we will look for Azure Blob Storage and click continue.           Click Continue. This will give us the option to select a linked service, either create a new or select from the existing list.       To create new, click on +New and we can insert the source connection details.     Once the link service is created and the connection is tested, we can proceed to select the file from the blob storage.   Click on browse and select the loaded file.   Under the file path type, select the File path in the dataset.   Under the sink settings, we can similarly create a sink(target) dataset and linked service for our SQL server.Moving forward to the Mapping tab: This is the one where mapping of source and target fields is done so this is important. First, import the source schema by clicking on Import Schema. Second, select the collection reference from the data in our source file. Basically, this should be the lowest level of XML tag with is present in the XML document and finally, mark the checkbox to tell the ETL engine that this is a collection reference or array available.   Further, we will map the input columns to the output columns as shown in the screenshot above.   There are times when an XML document can contain a large set of columns and it gets tiresome to map columns 1 on 1 basis so to get over this, we can also get the JSON code of the pipeline and manually edit it to create the mapping.   In the available json code, we can change the source and target column information.     If JSON seems to be complex we can go to the advance editor to select the source and target columns.   In this, we can search for our required column on the source side and on the target side assign it to any column also assign a data type to it. Once this much work is done, we can save our pipeline and trigger or debug it to see if there are any errors.   This concludes with the copy activity, now we can go ahead with the Dataflow task or the Second method that we discussed earlier in detail. Dataflow Let’s create a new pipeline and add dataflow activity to it:     1. Give Data flow a name in the properties window. Select a value for the Staging linked service and the storage folder. This is the staging location for the polybase.   2. Edit the dataflow activity to create the data transformation pipeline. Here, like our copy activity task, select the Inline dataset and linked Service. Under source options, we need to select the file mode as File and browse for the XML file. Also, since XML files have a lot of namespaces used so we can give an alias to some of them so that code looks manageable. This will help us a lot in

Data Science

Azure SQL VM Best Practices: Storage

STORAGE: Microsoft Azure offers seven types of storage options: Block Blobs, Azure Data Lake Storage, Managed Disks, Files, Queues, Page Blobs. In general, we can use them for tasks as shown in the below screenshot:   Azure Cloud Storage Solutions and Services | Microsoft Azure   Also, Disks can be classified as Managed(New and More feature-rich) and Unmanaged (older/Legacy type) disks; for our Azure SQL VM scenario here, we will emphasize Managed Disks.   First Thing First, currently, there are four types of managed storage disk options available in Azure. 1. Standard HDD (S4 – S80) 2. Standard SSD (E1 – E80) 3. Premium SSD (P1 – P80) 4. Ultra Disks   When evaluating the disks for our workloads, the primary consideration should be Latency, IOPS, and Bandwidth. The following table provides the insight comparison on these storage options:   Image reference: https://docs.microsoft.com/en-us/azure/virtual-machines/disks-types   As shown in the above screenshot Standard HDD and Standard SSD are more suited for Dev/Test Environment, Backups, or workloads that are less sensitive to performance, simply because they do not have enough IOPS or Throughput. They are not ideal for performance-sensitive workloads. Also, performance may vary depending on other factors. To learn more on managed disks, check out Microsoft Storage documentation. (https://docs.microsoft.com/en-us/azure/virtual-machines/disks-types#premium-ssd)   So, we will start with Premium Options. Premium SSD: Azure premium SSDs provide high-performance and low-latency disk support for virtual machines (VMs) with input/output (IO)-intensive workloads. Premium SSDs fit for mission-critical production applications. Premium SSD disks are designed to give low single-digit millisecond latencies and target IOPS and throughput specified in the table below for 99.9% of the time. Premium SSDs can only be used with VM series that are premium storage compatible.   Image reference: https://docs.microsoft.com/en-us/azure/virtual-machines/disks-types#premium-ssd   When you provision a premium storage disk, unlike standard storage, you are guaranteed the capacity, IOPS, and throughput of that disk. For example, if you create a P40 disk, Azure provisions 2,048-GB storage capacity, 7,500 IOPS, and 250-MB/s throughput for that disk. Your Application can use all or part of the space and performance. Also, Premium disks have an option of disk reservation starting P30.   What are Azure disk reservations? Disk reservation is the option to buy one year of disk storage in advance, decreasing your total cost. Azure disk reservation provides the opportunity to purchase Premium SSDs in the specified SKUs from P30 (1 TiB) up to P80 (32 TiB) for a one-year term. There is no restriction on the least amount of disks required to purchase a disk reservation. Additionally, you can choose to pay with a single, upfront payment or monthly payment. To learn more on disk reservation, follow Microsoft documentation.   Ultra Disks: Azure ultra disks produce high throughput, high IOPS, and consistent low latency disk storage for Azure VMs. Generally speaking, these are ideal for the applications that have to deal with very high data traffic, such as Banking Scenarios, Retail Stores, or a hyper sport like Formula 1 (most probably). The IOPS for these can scale from 1,200 to 160,000 at 300 IOPS/GiB. These are best suited for organizations/scenarios that want the best of performance at all times. Also, using Ultra disk is the ability to change the configuration on the fly without restarting the VM. At any point in time when needed, the configuration can be dialed higher or lower depending on the requirement, and this can be done on three parameters (Capacity, IOPS, or Bandwidth). Ultra disks are suited for data-intensive workloads such as SAP HANA, top-tier databases, and transaction-heavy workloads. Image reference: https://docs.microsoft.com/en-us/azure/virtual-machines/disks-types#premium-ssd   There are also some drawbacks associated with Ultra Disks. These are only supported in certain regions, so Not every VM size is available in every supported area with ultra disks. Caching is not supported for reads and writes. They are only available as data disks. It can only be created as empty disks. It doesn’t currently support disk snapshots, disk export, changing disk type, VM images, availability sets, Azure Dedicated Hosts, or Azure disk encryption. It doesn’t support integration with Azure Backup or Azure Site Recovery. Support 4k physical sector size by default. 512E sector size is available as a public offering (no sign-up required). Most applications are compatible with 4k sector sizes, but some require 512-byte sector sizes. One example would be Oracle Database, which requires release 12.2 or later to support the 4k native disks. For older versions of Oracle DB, 512-byte sector size is needed. To learn more on Ultra disks, you can refer to Microsoft documentation. Now let us get back to the Storage configuration of Azure SQL VM; storage is crucial for Azure SQL VM performance, so create different storage pools for Data files, Logs, and temp DB. Best Practices includes -Temp DB on the local or ephemeral drive (Cached and Temp Storage). -Data files should be premium storage (P30 – P40 recommended) and should have caching enabled. -Log storage should be fast, and caching should be disabled.   With the suitable Disks opted for our production workloads, we must choose the right VM to make optimal use of the underlying storage and avoid any bottleneck because VM’s have their own Storage I/O limits. When we get this selection incorrect, either we face VM I/O capped or Disk I/O capped situations. Remember, in any case of over and under-provisioning of resources, we will lose on performance. To avoid this- Before starting anything, get the baseline metrics for the required VM. IOPS, Throughput, Latency, Memory, CPU can be measured using Azure Monitor or Perfmon tool. Also important is to start with storage and move to VM as resizing storage is much more complex than resizing VM. VM IO Capping: Image reference: https://docs.microsoft.com/en-us/azure/virtual-machines/dv3-dsv3-series   In this, Bandwidth that the Application requires is restricted at the VM level.     The solution here is to Enable Host Caching(Bringing storage closer to VM), Upgrade VM or VM Level Bursting if the load is for short periods.   DISK IO Capping:   In this, Bandwidth that the Application requires

Azure, Data Infrastructure

Azure SQL VM Best Practices: VM Sizing

  Overview When we run SQL Server on Azure Virtual Machines, we can use the same database performance tuning options applicable to SQL Server in an on-premises environment.   For Azure SQL server VM, best practice starts with collecting the baseline requirements for the processing unit and the storage; for this, we have to know the IOPS, throughput, and latency of the source. Also noteworthy is that the CPU and Memory requirement also gather data during peak hours, such as workloads during your business hours, and other high load processes such as end-of-day processing or weekend ETL workloads. Then, use the performance analysis to select the VM Size that can scale to your workload performance requirements. However, the performance of a relational(SQL) database in the cloud depends on many factors, such as the size of a virtual machine, the configuration of the data disks, SQL Server features, Azure Features, High availability, and disaster recovery (HADR) features and also the Security optimization.   Consider scaling up your resources for typically heavy workloads such as end-of-quarter processing, and then scale done once the workload completes. There is typically a trade-off between reducing costs and optimizing for performance. This performance best practices series focuses on getting the best performance for SQL Server on Azure Virtual Machines. If your workload is less demanding, you might not require every recommended optimization. Consider your performance needs, costs, and workload patterns as you evaluate these recommendations. It is about setting up or opting for the best infrastructure for a SQL Server on the Azure cloud. However, if the peak workload at your end is less demanding, you might not require all recommended optimizations. Inspect your performance needs, work pattern, and most importantly, the cost involved as you evaluate these recommendations.   The following is a check-list for optimal performance of SQL Server on Azure Virtual Machines: – VM Size – Storage and Scalability – Baseline Requirements – Security Considerations – SQL Server Features   Since this article is Azure SQL VM and no one knows Azure and SQL better than Microsoft, we will post the link to relevant Microsoft documentation for better understanding and transparency wherever needed. VM Size Here is a quick check-list of VM size best practices for running your SQL Server on Azure VM: · Use VM sizes with four or more vCPU. · Use memory-optimized virtual machine sizes for the best performance of SQL Server workloads. · Consider a higher memory-to-vCore ratio for mission-critical and data warehouse workloads. · Collect the target workload performance characteristics (IOPS) and use them to determine the appropriate VM size for your business.   A vCPU is the abbreviation for virtual centralized processing unit and represents a portion or share of the underlying, physical CPU assigned to a particular virtual machine (VM). A Physical CPU can have multiple vCPU, and this vCPU can be managed by controller software like Hyper-V. In the past, there was a standard procedure that there were eight vCPUs per core. Today, vCPU count is primarily determined by the manufacturer. It is determined by taking the number of processing threads that a chipset offers per core and multiplying the occupied sockets. This is how it looks: (Threads x Cores) x Physical CPU = Number vCPU. If you have small VMs that barely use CPU time, you could quickly get 20-30 VMs from an 8-core server. But, if you have larger workloads such as a database server, you will have far fewer VMs from that same eight-core server. Thus, it is all about resource utilization and allocation.   A few questions to consider before we get started: · Do our apps run at 100% CPU utilization all the time? · Do they have periods where utilization bursts? · Do they have maintenance windows? · When you know the requirements, you can make an informed decision on the underlying hardware. There is a large array of machines in the Azure cloud that we can select from; still, we will emphasize only on D, E, and M family machines since we refer only to SQL workloads. SQL Server database engine performance is more memory-driven as compared to a number of cores available .eg: Dsv2 11-15, Edsv4, Mv2, Msv2, M, Mdsv2, DCsv2, Dasv4, Dsv3, Ddsv4, Lsv2. These can be broadly categorized into General purpose, Memory, and Storage Optimized machines.   Some of the naming convention for machines are as follows: A – AMD based processor D – Disk(local temp, the temporary disk is present) M – The most amount of Memory in a particular size S – Premium Storage capable. I – Isolated storage V – Version   D series: Has a consistent memory to core ratio of 7:1, medium to large caches, and in-memory analytics. Supports Premium Storage and are entry-level SQL Server virtual machines. E Series: Has a Memory to core ratio of 8:1; these are ideal for Memory intensive enterprise applications, also have sizable local storage SSD capacity and throughput. This acts as a solid all-purpose SQL Server virtual machine. M Series: Has a core to memory ratio of 30:1 without constrained cores and 122:1 with constrained cores. These offer vCore counts and Memory for the largest SQL Server workloads. These are mission-critical and Data Warehouse Virtual Machines. Also, it is the point to note that these costs are enormous, so we need to manage them accordingly.   Also, it is essential to understand the concept of constrained cores. Some database workloads like SQL Server or Oracle require high Memory, storage, and I/O bandwidth, but not a high core count. Conversely, many database workloads are not CPU-intensive. Azure offers specific VM sizes to constrain the VM vCPU count to reduce software licensing costs while conserving the same Memory, storage, and I/O bandwidth. For example, the vCPU count can be constrained to one-half or one-quarter of the original VM size. For Example – Standard_M16ms comes with 16 cores and 437.5 GB of Memory(RAM).   Screenshot taken from M-series – Azure Virtual Machines| Microsoft Docs  

Activate Your Free 15-Day Infominer Trial

X
Scroll to Top