Data Science

Data Science

Download Attachment from Gmail to OneDrive

  Purpose   Emails are very important part of any business communication. We receive and send important documents via email. Sometimes we face the situation that we are not able to access our documents received over emails or we have got a requirement to automatically download the attachments from unread emails. How will we download those documents on our laptops or computer? Here, Power automate comes into action, it helps to build an automated process of downloading the attachments from your Gmail.   This blog will expound on how we can achieve Power Automate cloud flow to download the attachments from any configured Gmail account into OneDrive whenever a new email with attachments arrives.     How to Setup Power Automate Cloud Flow to Download Attachments from Gmail into OneDrive   Login to https://flow.microsoft.com/     Sign–In into you Microsoft account     On the left pane, click on + Create to start creating your power automate flow.     Choose the Automated Cloud flow option.     Enter the name of the flow in the Flow Name text box. Choose “When a new email arrives” of Gmail as Trigger and click on Create     Once you will click on Create, it will show the trigger on the next screen as shown below     Click on the three dots (“. . . “) on the right-hand side to add the Gmail account. Click on the +Add new connection to add the Gmail account     Click on the “Show Advanced options” down arrow.     Select “Yes” from Has Attachments and Include Attachments drop downs. After that, click on + New Step button to add the action to perform. Remember, it is important to choose the action to perform the task when the trigger will execute.       Type OneDrive in the Choose an operation text box to search. Select either OneDrive for Business or OneDrive according to your account type. Here, I have chosen OneDrive.     Once you will click on OneDrive, it will show you various actions to perform. Select Create File option from there. Once You click on Create File, it will ask for Sign In. Click on Sign In.     It will open the new sign-in window. Either pick an already signed-In account or use another account. I have used here, my already signed-in account.     Once it will get Signed In, it will show the options below     Click on the folder icon on the right, to choose the folder path of OneDrive to set the download location of the attachment.     In File Name textbox, click on Add Dynamic Content and choose “Attachments Name”     It will show the below screen, with the option “Select an output from previous steps” with “Attachments” as dynamic content. Click on Create File operation to expand it for more settings.         Choose “Attachments Content” as dynamic content and Click on Save.     Now the flow is ready to Run and Download attachments. For testing, send an email to the configured Gmail account in the flow to download the attachment into OneDrive as below:           Currently, my OneDrive attachment location is empty.     As you can see, I have attached one word document in the email and sent it to another Gmail account which is configured in the PA flow.     The email has arrived in my configured Gmail account and the attachment has been downloaded into OneDrive Download Location as shown in the below screen.     Conclusion   No matter if we are opening or not our emails, Microsoft Power Automate will help us download the attachments and make them available as soon as the email arrives. We do not require any programming for this, we can just follow the simple steps as explained above to automate this process.

Data Science

Streamlining Data Deployment with Power BI Pipelines

 In today’s world, data means power and to use that power efficiently we need to analyse it properly. It’s a great responsibility to handle a large amount of data and present it in a prudent way so that the end user can easily grab the details of any data. The data can be  reports, paginated reports, dashboards, datasets and dataflows.     At times it get hard to manage the code sync for this data between Dev, Test and Production environments. Power BI lets us do this in easy way using Power BI Pipelines. Pipelines in Power BI is a very intuitive and responsive service and notifies us of any changes between different environments.     Deployment Pipelines have three stages:   Development Test Production     Working with Deployment Pipelines     Step1: Create a Deployment Pipeline     To create a pipeline we have two options:     From Deployment Pipelines Tab:   In the navigation pane on the left side of the Power BI service, choose the option of Deployment Pipelines. After that click on Create pipeline.       After clicking on Create Pipeline button a window pops-up which contains a dialogue box for the Name and Description of the pipeline. Fill the area with a suitable Name and Description of the pipeline and click on the Create button.     From a Workspace:   From the workspace, select Create a pipeline.         We get the window of Name and Description of the pipeline. Fill the fields with a suitable Name and Description and then hit the Create button.           Step2: Assign a workspace     To work with the Deployment Pipelines, we should have Premium per user Workspace. To change the workspace into Premium per user Workspace we can follow the steps mentioned below in the screenshot.                       The workspace will show a diamond icon as shown in the above screenshot. It signifies the workspace is now Premium per user workspace     After hitting Create button we get a window which shows all the three stages of a pipeline. We can assign only one workspace to one pipeline. Click on Assign a workspace button.         We get a dropdown where we can select workspace name and any stage (Production, Test or Production). It is not necessary to start with Production stage. We can start with any stage out of the three. Choose the stage you want to start with and click on Assign.           Now, As we click on the Assign button we get some initial looks of that workspace, such as reports, datasets and dashboards.         Deploying Contents     As we get the initial looks of the workspace, we can deploy it to other stage. It creates a duplicate workspace which contains all the details of the source stage. To do that click on the Deploy button and it will deploy all the contents of the source stage to the target stage.                 We can also deploy the selective contents. To do that select the Show more option and select the content you want in your target stage. Then hit Deploy button and we will get only the selected content in the target stage.                     We have another option “Select related” after selecting the specific content. The items that are related to the specific items automatically got selected. For example, if we have created a report then it has its dataset. So, when we select the report and choose the option Select related the dataset automatically got selected.           Comparing content of stages     When we deployed content from one stage to other, there is a Green indicator between source and target stage. It means everything is in sink in both source and target stages.         If we made some changes in source or target then the indicator colour changes to Orange.         When there is a change in any stage the changed item(s) get some labels.     If there are items, which are in the Source side and it does not exist in the Target side, then the items get the label which is represented by a + sign.         If there are items that were same on the both Source and Target side but after deployment, the item on one side got changed then it is labelled as different. After deployment, the source stage item gets copied to the target side. It is independent of the fact where the changes were made.         If there is an item that exists in the Target side but missing from the Source side, then it gets Missing from (-) label.         Deployment Rules     In Deployment Pipelines we can change the configuration of different stages. For example, if our development dataset points to development database and we have deployed it to test but we want to point it out to the test database after the deployment, we can do it by using Deployment Rules.       To create the Deployment rules we should follow the following steps:     Step1: Select the Deployment Setting option for the stage that needs to be modified.           Step2: Select the dataflows, datasets and paginated report from the list.           Step3: We have two types of rules:   1. Data source rules   2. Parameters rules   Select the type of rule you want to add and then click on Add rule to add the rule for the dataset, reports etc. We can replace the data source with the new one in this option.                 In the Parameter rules, we can choose the parameter from the dropdown list of Parameter rules and then choose the value to apply on the Parameter.     When all the settings are done, hit Save and close.     Now the settings are applied to the stage that was selected and now the rules will be applied till further changes.       Backwards Deployment:     Sometimes we need

Data Science

How to Convert PDF File to Text File Using Python

We do not require any further software or a specific application or Google search for changing a PDF to a text document.   The process of converting a PDF file into a text document using python is as follows:     Step to Convert PDF to Text in Python   Step 1- Install the library   We need only one external module or package for this which is PyPDF2.   PyPDF2 is a module in python which is used to perform various operations on PDF files, such as, extracting document information from a file, merging pdf, splitting pdf, overlying and watermarking pages, and encrypting and decrypting pdf files.   First, we install the library for that using pip by executing the following command in the command prompt.   pip install PyPDF2     Step 2- Import the installed library   After installing the PyPDF2 module we want to import that library using the Import keyword.   Import PyPDF2   Step 3-Open your PDF file to read   Now, we are going to open the PACKSLIP.pdf file just by calling the open () method in the ‘rb’ mode.   read_pdf = open(r”D:PracticePACKING SLIP.pdf”, ‘rb’)   Step 4-Create a PdfReader object   We will create a pdfReader object using the PdfFileReader() function defined in the PyPDF2 module.   pdfReader object will read the file opened from the previous step.   pdfReader = PyPDF2.PdfFileReader(read_pdf)   Step 5-looping to get all pages from PDF   To get the number of pages in the PDF file we use the getPage () method, which stores the number of pages in the pageObject variable. We wanted to get the text from page 1 to page 5. So, we use for loop with the range() function to get all pages used in the PDF file. pageObject = pdfReader.getPage (i)   Step 6-Extract text from page using extractText () object   After getting pageObject we will use the extractText () method to extract all the text from the PDF file.   extract_text=pageObject. extractText ()     Here is the Complete Code for extracting text from a PDF file using the PyPDF2 module in Python:-         Output:-   This is the source PDF file location.   The source PDF files, which we are using, is PACKSLIP.pdf and will be converted into a text file         If we open this you see here this is kind of 5-page document.         In this output terminal, it shows the total number of pages which is 5, and shows data as you can see in the above picture.         Extracted data file:-   This is the location where extracted data is store in text file name PACKINGSLIP.txt       When we executed the script we generated a PACKSLIP text file. We will open this file and here we can see that our text file carry all the content from our PACKSLIP.pdf file.             This is the final step to transform our PDF into a text file.   Conclusion:   This blog explains how to convert a file from PDF format to text format. We have used python for this purpose because it has a wide variety of tools and libraries that contains in-built modules that make our work simpler and easier. Using python is justified since the code written is automated and executes our process in a single go.

Data Science

Mastering Report Creation with SAP Crystal Reports

  Crystal Reports is a BI tool provided by SAP and it is an interface that enables you to quickly and easily create, format and publish effective reports.   It is a very robust tool it proudly can do everything and anything in terms of how to present your forms invoice sales orders things of that nature and create executive reports to make a better business decision.   We created a crystal report viewer in SAP business one. This is the viewer that allows us to view various crystal reports. There is a report and layout manager that allow us to import and export crystal reports.   Crystal reports allow you to save forms in SAP business one and preview them within an SAP business.   HOW TO USE CRYSTAL REPORTS   First, we need crystal report software on our system to open this software.   Now, we are going to create a report.   Open SAP Crystal Report Application For Build Crystal Reports.   In the file, menu click New and then click on Standard Report.       Once we start building a report, we select the data source.   Inside the field explorer, we will find the database field and right-click the database expert. Inside our database expert         click on creates new connections option. It will show several database connections, such as the EXCEL, XML, ODBC, etc.   Next, select OLDB. Here inside an OLDB, there are many databases. Now, select Native Client 11.0 provider for SQL Server   and click next.       In the next window, mention the Server name, User ID, and Password, and Database name, and then click on finish.   After connecting the database we will see the list of available tables.       Now, we select tables that we want to use for the crystal report. We can add more than one table according to our requirement       The next page will give us the option to create joins between selected tables using common columns.   We also notice some of these columns have little flags in front of them these indicate   Index or key column, so if we join tables based on these columns then the database performance will be better.       Once we are on the design screen we are going to set up our report. Now, we have a blank report.   In our report layout, we have the report header at the top of first page and page header at the top of each page of the report.   Then we have the detail tab in which we fetch the data. The report footer is at the end of the last page and the page footer appears at the bottom of each page.   Field explorer gives us the access to all column in tables and also give us the ability to build formulas for mathematical   Calculation, such as, running totals or summaries and build parameters that are passed to   the end-user.     Here, we created an amazon bill slip using crystal report. We simply put the report header and page header using a text object   from the menu bar. In the detail section, we want to pull columns from different tables to fulfill our report needs.   We simply drag and drop fields from the field explorer to pull some basic information.       After done with the designing part, we are going to preview this report. We will go to view and then select the Print Preview and it will build the report         Here is the final report preview.     CONCLUSION:   Our blog expounds on the method to build a crystal report using the SAP Crystal Report tool, in which by following a few simple steps we will be able to fetch meaningful data by easily generating invoices.

Data Science

Power Automate as Bridge to Write-back on Synapse Using Power Apps

 The Write-Back functionality enables end-users to give input and update the data in the data warehouse directly from the BI report. In addition, this helps BI professionals to have a rich user experience in deploying solutions for budgeting, forecasting, planning, and adding comments.  This functionality can be achieved using Power Apps. There is no connector for Azure Synapse in PowerApps, therefore, the write-back to Synapse using Power Apps does not work directly.   This blog is to show the process of write-back on Azure Synapse using Power Apps while making Power Automate a bridge.   Power Automate has a Synapse connector that allows the changes in the database. For this we need a Synapse SQL table in which the data will be stored (Created WriteBackTable table), a power app form that has some input fields and a submit button. Then we will create a flow in power automate that will be then called in Power Apps Submit Button.     Here, the WriteBackTable is being used as a base table on which the Fact table will be created to show the Daily and Monthly Goal for different department in each month. The Department, Monthly Goal and Number of working days will be entered by the users using PowerApps. Below is the script for WriteBackTable:         The RowId is the primary key for unique Department and Month Year, which can be used further to decide weather the record will be inserted or updated. Initially there is no record in the table so the data will be inserted:         Below is the complete flow which is explained further in detail:         For creating the flow, first we need to click on Create and select the Instant Cloud Flow:       Then we need to select the trigger of this flow as PowerApps and give our flow a name and then click on Create:     After that, we need to initialize a variable that we call from PowerApps:     We can Name the variable as we want and when we click on the value, the Dynamic content option pops up and we need to click on Ask in PowerApps:     This step will be repeated as many times as we have input fields in our PowerApps form, for me, its 4 so used 4 Initialize Variables as below.     Once we will initialize all the variables then we will add action Execute a SQL query (V2) under Azure SQL Data Warehouse to see if there are any records in the Synapse table for the key columns entered by PowerApps.   We are using “Azure SQL Data Warehouse” connector that allows us to connect to Azure Synapse and run SQL queries and stored procedures to perform CRUD (create, retrieve, update and delete). But it is available only in Power Automate and not in Power Apps and that is why we are using Power Automate as a bridge here:     In this option, we need to create a connection by clicking the three dots on the top right corner and then click on Add new connection:       Then we need to insert credentials for Azure Synapse Dedicated SQL pool using Basic username and password.       Once you will establish the connection, the Server Name and Database will come in the drop-down list and you can select them and write your query as above.   Note: We need to define the parameters and the data types that we need in our query. Here we need Department and MonthYear as the key columns (on the basis a record will be inserted/updated) so two parameters are created. Then assign the value by selecting the corresponding variable (coming from the PowerApps) from the Dynamic Content:       After this step we need to decide if the record will be inserted or updated in the table so we need to add a new step Condition:       In this step, we need to write an expression to know if the select query in the previous step returns any data or not. Here the expression is as below:   empty(outputs(‘Execute_Select_from_Synapse’)?[‘body/resultsets/Table1’])   But instead of copying the above line of code, you can run the flow and see the output for this step and note in which set you are getting the result, and then use the empty function in Dynamic content and select that set:       Now if the condition is True, i.e., the result set is empty then we need to insert the record else the update will be done. Therefore, for this we need to use the Execute a SQL query (V2) again and use the insert and update query:     After the flow is completed, go to PowerApps and click on the Submit button in the form on which we want to call the Power Automate flow and then click on Action and then Power Automate:     Then select your flow and give the parameters as below:     This App can be called using Power BI and whenever the user will enter values in the form and click on the submit button. The flow will then get triggered and data will be inserted/updated in the Synapse Database.     Conclusion: The PowerApps does not have any connector to Azure Synapse and hence there is no direct way to write back to the synapse database but using Power Automate as a bridge, we can use this functionality without having any trouble.

Data Science

XML Parsing using Azure Data Factory

         Overview   There can be a scenario where one may need to process data coming from XML files and those XMLs can get quite complex in structure meaning that data can get multiple XML tag levels deep with some tags also having multiple occurrences at different depths. In such a case, if we do not have access to an XML schema (XSD or dtd), we can have a hard time getting the data load moreover there can be processing issues related to the size of XML provided.   Recently we got stuck in a similar scenario, getting the data processed was a challenging task and after trying multiple approaches and numerous Integration tools it was ADF that we finally went ahead with.   Sample XML Code –  Before starting with anything, let’s have a look at sample xml file.<Items xmlns=”http://xmlns.oracle.com/apps/some/productModel/common/publicationServicesB6G/”><ns1:Item xmlns:ns17=”http://xmlns.oracle.com/apps/some/productModel/items/flexi/item/” xmlns:ns1=”http://xmlns.oracle.com/apps/some/productModel/common/publicationService/” xmlns:ns4=”http://xmlns.oracle.com/apps/some/productCatalogueManagement/advancedItems/flexi/egoItem/items/category/” xmlns:ns0=”http://xmlns.oracle.com/adfe/service/types/” xmlns:ns18=”http://xmlns.oracle.com/apps/some/productModel/items/flexi/itemGdf/” xmlns:ns9=”http://xmlns.oracle.com/apps/some/productModel/items/itemRelationshipService/” xmlns:ns5=”http://xmlns.oracle.com/apps/some/productModels/item/itemServicesV8/” xmlns:ns12=”http://xmlns.oracle.com/apps/some/productModel/item/structure/structureServiceV8/” xmlns:xsi=”http://www.w3.org/2001/XMLSchema-instance” xsi:type=”ns1:Item”> <ns1:Item>1234567890</ns1:Item> <ns1:Organization>1234567890</ns1:Organization> <ns1:OrganizationCode>TST</ns1:OrganizationCode> <ns1:ItemClass>Training</ns1:ItemClass> <ns1:Template xsi:nil=”true”/> <ns1:ItemNumber>APP563772</ns1:ItemNumber> <ns1:ItemDescription>Test for complex XML load</ns1:ItemDescription> <ns1:ApprovalStatus>Approved</ns1:ApprovalStatus> <ns1:ItemCategory xmlns:ns3=”http://xmlns.oracle.com/apps/some/” xsi:type=”ns4:Ivcformula”> <ns4:InventoryItemId>123456789</ns4:InventoryItemId> <ns4:OrganizationId>123456789</ns4:OrganizationId> <ns4:StyleItemId>123456789</ns4:StyleItemId> <ns4:MasterOrganizationId>123456789</ns4:MasterOrganizationId> <ns4:Planner__and__TestCase xmlns:ns2=”http://xmlns.oracle.com/apps/some/productCatalogManagement/advancedItems/protectedModel/”><ns2:ContextCode>Planning for Tes Case</ns2:ContextCode><ns2:CreatedBy>Alpha.User</ns2:CreatedBy><ns2:CreationDate>2050-02-14T12:35:21.956</ns2:CreationDate><ns2:EffLineId>123456789</ns2:EffLineId><ns2:InventoryItemId>123456789</ns2:InventoryItemId></ns4:Planner__and__TestCase><ns4:ItemNumberSequencing xmlns:ns2=”http://xmlns.oracle.com/apps/some/productCatalogManagement/advancedItems/protectedModel/”><ns2:ContextCode>Generation</ns2:ContextCode><ns2:CreatedBy>Alpha.User</ns2:CreatedBy><ns2:CreationDate>2050-02-14T12:35:23.121Z</ns2:CreationDate><ns2:EffLineId>123456789</ns2:EffLineId><ns2:InventoryItemId>123456789</ns2:InventoryItemId><ns2:LastUpdateDate>2022-02-14T10:35:25.291Z</ns2:LastUpdateDate><ns3:suffix>Beta</ns3:suffix><ns3:RedLinesuffix xsi:nil=”true”/></ns4:ItemNumberSequencing><ns4:Manufacturing xmlns:ns2=”http://xmlns.oracle.com/apps/some/productCatalogManagement/advancedItems/protectedModel/”><ns2:ContextCode>Manufacturing</ns2:ContextCode><ns2:CreatedBy>Alpha.User</ns2:CreatedBy><ns2:CreationDate>2050-02-14T12:35:23.124Z</ns2:CreationDate><ns2:EffLineId>1234567889</ns2:EffLineId><ns2:InventoryItemId>123456789</ns2:InventoryItemId></ns4:Manufacturing><ns4:Header xmlns:ns2=”http://xmlns.oracle.com/apps/some/productCatalogManagement/advancedItems/protectedModel/”><ns2:ContextCode>Header Info</ns2:ContextCode><ns2:CreatedBy>AlphaC.User</ns2:CreatedBy><ns2:CreationDate>2050-02-14T12:35:23.128Z</ns2:CreationDate><ns2:LineId>987654321</ns2:LineId><ns2:InventoryItemId>987654321</ns2:InventoryItemId><ns3:quantity>1000</ns3:quantity></ns4:Header><ns4:Header xmlns:ns2=”http://xmlns.oracle.com/apps/some/productCatalogManagement/advancedItems/protectedModel/”><ns2:ContextCode>Header Info</ns2:ContextCode><ns2:CreatedBy>AlphaB.User</ns2:CreatedBy><ns2:CreationDate>2050-02-15T12:35:23.128Z</ns2:CreationDate><ns2:LineId>258963147</ns2:LineId><ns2:InventoryItemId>258369147</ns2:InventoryItemId><ns3:quantity>12</ns3:quantity></ns4:Header><ns4:Header xmlns:ns2=”http://xmlns.oracle.com/apps/some/productCatalogManagement/advancedItems/protectedModel/”><ns2:ContextCode>Header Info</ns2:ContextCode><ns2:CreatedBy>AlphaA.User</ns2:CreatedBy><ns2:CreationDate>2050-02-17T18:35:23.128Z</ns2:CreationDate><ns2:LineId>9514785362</ns2:LineId><ns2:InventoryItemId>9514785362</ns2:InventoryItemId><ns3:quantity>125</ns3:quantity></ns4:Header></ns1:ItemCategory><ns1:QuantityRevision xmlns:ns6=”http://xmlns.oracle.com/apps/some/productModel/items/flex/itemRevision/” xmlns:ns7=”http://xmlns.oracle.com/apps/some/productCatalogManagement/advancedItems/flexi/egoItemEff/itemRevision/categoriy/”><ns5:RevisionId>4564564564564</ns5:RevisionId><ns5:ItemId>12345678987</ns5:ItemId><ns5:Organization>1234567890</ns5:Organization><ns5:RevisionCode>00</ns5:RevisionCode><ns5:ItemRevisionDescription xsi:nil=”true”/><ns5:RevisionReasonValue xsi:nil=”true”/><ns5:ItemRevision><ns6:RevisionId>30005940156</ns6:RevisionId><ns6:Context xsi:nil=”true”/><ns6:Context_DValue xsi:nil=”true”/><ns6:NumOfSegments>5</ns6:NumOfSegments></ns5:ItemRevision><ns5:ItemRevision><ns6:RevisionId>1231231231</ns6:RevisionId><ns6:Context xsi:nil=”true”/><ns6:Context_DValue xsi:nil=”true”/><ns6:NumOfSegments>5555</ns6:NumOfSegments></ns5:ItemRevision><ns5:RevisionCategory xsi:type=”ns7:Ivcformula”><ns7:RevisionId>30005940156</ns7:RevisionId><ns7:CategoryCode>TestFail</ns7:CategoryCode><ns7:VersionId>-1</ns7:VersionId></ns5:RevisionCategory></ns1:QuantityRevision>  </ns1:Item></Items> In this file, we have multiple XML tags and depth of XML tags also varies. Items -> Item -ItemCategory -> Planner__and__TestCase Items -> Item -> ItemCategory -> ItemNumberSequencing Items ->Item -> ItemCategory -> Manufacturing Items – Item -> ItemCategory -> Header (Has multiple occurances) This is a good example of nested loop where Header tags repeat at a certain level. In ADF pipeline, For our requirement, we can proceed with two different approaches. · Copy data to sink table without transforming anything in the data. · Copy data with data cleansing activity involved. The first part is relatively simple using Copy data activity, data is loaded in to the sink (target) system without any modification. But if we want some cleansing or modification to our data, we can proceed with Dataflow activity. Dataflow activity let us use various other transformations as per our requirement. But to mention, one big difference between Copy and Dataflow tasks is that the Copy activity automatically flattens the XML tags present in the input file but in dataflow, we must do it manually. For this demo we will consider loading data in XML file from blob storage to SQL Server database: Note: In the copy activity, we will start directly with creating a pipeline. Copy data Activity 1. Create a new pipeline under the pipelines menu.    Here we can directly create a pipeline, or to personalize more we can first create a folder and then inside that folder create a pipeline. 2. Here under Move & transform we can see Copy data activity. Drag and drop this to the development pane.   3. Once we have copied the activity in canvas, under the General tab we can rename it.     4. Under the Source tab, we can create a new dataset or use the existing one. When creating a new dataset, click on New. In the New, Integration dataset we will look for Azure Blob Storage and click continue.           Click Continue. This will give us the option to select a linked service, either create a new or select from the existing list.       To create new, click on +New and we can insert the source connection details.     Once the link service is created and the connection is tested, we can proceed to select the file from the blob storage.   Click on browse and select the loaded file.   Under the file path type, select the File path in the dataset.   Under the sink settings, we can similarly create a sink(target) dataset and linked service for our SQL server.Moving forward to the Mapping tab: This is the one where mapping of source and target fields is done so this is important. First, import the source schema by clicking on Import Schema. Second, select the collection reference from the data in our source file. Basically, this should be the lowest level of XML tag with is present in the XML document and finally, mark the checkbox to tell the ETL engine that this is a collection reference or array available.   Further, we will map the input columns to the output columns as shown in the screenshot above.   There are times when an XML document can contain a large set of columns and it gets tiresome to map columns 1 on 1 basis so to get over this, we can also get the JSON code of the pipeline and manually edit it to create the mapping.   In the available json code, we can change the source and target column information.     If JSON seems to be complex we can go to the advance editor to select the source and target columns.   In this, we can search for our required column on the source side and on the target side assign it to any column also assign a data type to it. Once this much work is done, we can save our pipeline and trigger or debug it to see if there are any errors.   This concludes with the copy activity, now we can go ahead with the Dataflow task or the Second method that we discussed earlier in detail. Dataflow Let’s create a new pipeline and add dataflow activity to it:     1. Give Data flow a name in the properties window. Select a value for the Staging linked service and the storage folder. This is the staging location for the polybase.   2. Edit the dataflow activity to create the data transformation pipeline. Here, like our copy activity task, select the Inline dataset and linked Service. Under source options, we need to select the file mode as File and browse for the XML file. Also, since XML files have a lot of namespaces used so we can give an alias to some of them so that code looks manageable. This will help us a lot in

Data Science

Azure SQL VM Best Practices: Storage

STORAGE: Microsoft Azure offers seven types of storage options: Block Blobs, Azure Data Lake Storage, Managed Disks, Files, Queues, Page Blobs. In general, we can use them for tasks as shown in the below screenshot:   Azure Cloud Storage Solutions and Services | Microsoft Azure   Also, Disks can be classified as Managed(New and More feature-rich) and Unmanaged (older/Legacy type) disks; for our Azure SQL VM scenario here, we will emphasize Managed Disks.   First Thing First, currently, there are four types of managed storage disk options available in Azure. 1. Standard HDD (S4 – S80) 2. Standard SSD (E1 – E80) 3. Premium SSD (P1 – P80) 4. Ultra Disks   When evaluating the disks for our workloads, the primary consideration should be Latency, IOPS, and Bandwidth. The following table provides the insight comparison on these storage options:   Image reference: https://docs.microsoft.com/en-us/azure/virtual-machines/disks-types   As shown in the above screenshot Standard HDD and Standard SSD are more suited for Dev/Test Environment, Backups, or workloads that are less sensitive to performance, simply because they do not have enough IOPS or Throughput. They are not ideal for performance-sensitive workloads. Also, performance may vary depending on other factors. To learn more on managed disks, check out Microsoft Storage documentation. (https://docs.microsoft.com/en-us/azure/virtual-machines/disks-types#premium-ssd)   So, we will start with Premium Options. Premium SSD: Azure premium SSDs provide high-performance and low-latency disk support for virtual machines (VMs) with input/output (IO)-intensive workloads. Premium SSDs fit for mission-critical production applications. Premium SSD disks are designed to give low single-digit millisecond latencies and target IOPS and throughput specified in the table below for 99.9% of the time. Premium SSDs can only be used with VM series that are premium storage compatible.   Image reference: https://docs.microsoft.com/en-us/azure/virtual-machines/disks-types#premium-ssd   When you provision a premium storage disk, unlike standard storage, you are guaranteed the capacity, IOPS, and throughput of that disk. For example, if you create a P40 disk, Azure provisions 2,048-GB storage capacity, 7,500 IOPS, and 250-MB/s throughput for that disk. Your Application can use all or part of the space and performance. Also, Premium disks have an option of disk reservation starting P30.   What are Azure disk reservations? Disk reservation is the option to buy one year of disk storage in advance, decreasing your total cost. Azure disk reservation provides the opportunity to purchase Premium SSDs in the specified SKUs from P30 (1 TiB) up to P80 (32 TiB) for a one-year term. There is no restriction on the least amount of disks required to purchase a disk reservation. Additionally, you can choose to pay with a single, upfront payment or monthly payment. To learn more on disk reservation, follow Microsoft documentation.   Ultra Disks: Azure ultra disks produce high throughput, high IOPS, and consistent low latency disk storage for Azure VMs. Generally speaking, these are ideal for the applications that have to deal with very high data traffic, such as Banking Scenarios, Retail Stores, or a hyper sport like Formula 1 (most probably). The IOPS for these can scale from 1,200 to 160,000 at 300 IOPS/GiB. These are best suited for organizations/scenarios that want the best of performance at all times. Also, using Ultra disk is the ability to change the configuration on the fly without restarting the VM. At any point in time when needed, the configuration can be dialed higher or lower depending on the requirement, and this can be done on three parameters (Capacity, IOPS, or Bandwidth). Ultra disks are suited for data-intensive workloads such as SAP HANA, top-tier databases, and transaction-heavy workloads. Image reference: https://docs.microsoft.com/en-us/azure/virtual-machines/disks-types#premium-ssd   There are also some drawbacks associated with Ultra Disks. These are only supported in certain regions, so Not every VM size is available in every supported area with ultra disks. Caching is not supported for reads and writes. They are only available as data disks. It can only be created as empty disks. It doesn’t currently support disk snapshots, disk export, changing disk type, VM images, availability sets, Azure Dedicated Hosts, or Azure disk encryption. It doesn’t support integration with Azure Backup or Azure Site Recovery. Support 4k physical sector size by default. 512E sector size is available as a public offering (no sign-up required). Most applications are compatible with 4k sector sizes, but some require 512-byte sector sizes. One example would be Oracle Database, which requires release 12.2 or later to support the 4k native disks. For older versions of Oracle DB, 512-byte sector size is needed. To learn more on Ultra disks, you can refer to Microsoft documentation. Now let us get back to the Storage configuration of Azure SQL VM; storage is crucial for Azure SQL VM performance, so create different storage pools for Data files, Logs, and temp DB. Best Practices includes -Temp DB on the local or ephemeral drive (Cached and Temp Storage). -Data files should be premium storage (P30 – P40 recommended) and should have caching enabled. -Log storage should be fast, and caching should be disabled.   With the suitable Disks opted for our production workloads, we must choose the right VM to make optimal use of the underlying storage and avoid any bottleneck because VM’s have their own Storage I/O limits. When we get this selection incorrect, either we face VM I/O capped or Disk I/O capped situations. Remember, in any case of over and under-provisioning of resources, we will lose on performance. To avoid this- Before starting anything, get the baseline metrics for the required VM. IOPS, Throughput, Latency, Memory, CPU can be measured using Azure Monitor or Perfmon tool. Also important is to start with storage and move to VM as resizing storage is much more complex than resizing VM. VM IO Capping: Image reference: https://docs.microsoft.com/en-us/azure/virtual-machines/dv3-dsv3-series   In this, Bandwidth that the Application requires is restricted at the VM level.     The solution here is to Enable Host Caching(Bringing storage closer to VM), Upgrade VM or VM Level Bursting if the load is for short periods.   DISK IO Capping:   In this, Bandwidth that the Application requires

Data Science

Performance best practices for SQL Server on IaaS Azure VMs

There are a few best practices we can follow for Azure SQL server VM. First, we can start by collecting the CPU, memory, IOPS, throughput, and latency of the source workload during high traffic times and attending the application performance checklist. Gather data during high traffic hours such as workloads during your weekdays or business days and other high load processes such as end-of-day or month-end processing and weekend ETL workloads. We need to consider our resources for heavy workloads, such as end-of-quarter processing, and then scale down once the workload completes.   We can use the performance analysis to select the VM Size that can fit our workload’s performance requirements. Below are the points that we need to consider: Security considerations for SQL Server on Azure Virtual Machines Cluster configuration best practices (SQL Server on Azure VMs) Application patterns and development strategies for SQL Server on Azure Virtual Machines When running SQL Server on Azure Virtual Machines, continue using the same database performance tuning alternatives applicable to SQL Server in on-premises server environments. Many factors are involved in the performance of a relational database in a public cloud; for example, we have to consider the size of a virtual machine and the configuration of the data disks.   There is an advantage when you swap between optimizing for costs and optimizing for performance. The following implementation best practices series focuses on getting the best performance for SQL Server on Azure Virtual Machines. If our workload is less demanding, we might not require every recommended optimization. We need to consider a few factors with our performance needs, costs, and workload patterns as we assess these recommendations. Quick Checklist for performance : The following is a quick checklist for optimal performance of SQL Server on Azure Virtual Machines; briefly discussed topics include:   VM Size, Storage, Collect Baseline, Security Considerations, SQL Server Features. Detail discussion on VM Size The following is a checklist of VM size best practices for running SQL Server on Azure VM: You can use VM sizes with four or more vCPU like the E4ds_v4, which are perfect for memory-intensive enterprise applications and applications that benefit from low latency and high-speed local storage, the Standard_M8-4ms, which offers a high vCPU count and a large amount of memory, or the DS12_v2 or higher. Use memory-optimized virtual machine sizes that offer a high memory-to-CPU ratio that is great for relational database servers, medium to large caches, and in-memory analytics for the best performance of SQL Server workloads. The DSv2 11-15, Edsv4 series, the M-, and the Mv2- series offer the optimal memory-to-vCore ratio needed for OLTP workloads. Both M series VMs offer the best memory-to-vCore ratio needed for mission-critical workloads and ideal for data warehouse workloads. Favor a higher memory-to-vCore ratio for a job demanding and data warehouse workloads. Use the Azure Virtual Machine marketplace images as the SQL Server settings and storage choices are configured for optimal SQL Server performance. Gather the target workload’s performance attributes and use them to determine the appropriate VM size for work.   Based on the documentation from Microsoft, it is advised to use a DS3 or higher sized VM for the Enterprise edition of SQL Server. The usage of DS2 or higher is OK when we use the Standard edition of SQL Server. When we choose a predefined image from the portal to deploy SQL Server, by default, it recommends the VM as shown below:   A vCPU is the abbreviation for virtual centralized processing unit and is a portion or share of the underlying, physical CPU assigned to a particular virtual machine. A Physical CPU can have multiple Virtual CPUs, and this Virtual CPU can be managed by monitored software like a hypervisor.   Here is the Math Behind Calculating vCPU: Earlier, there were eight vCPUs per core. Today, the vCPU count is decided by the manufacturer. The calculation is done by taking the number of processing threads that a chipset offers per core and multiplying the occupied sockets. Below is the method : (Threads x Cores) x Physical CPU = number vCPU For example: First, we need to select a virtual server and CPU. Let’s say we choose Intel Xeon E-2288G as the underlying CPU. The Key stats for the Intel Xeon E-2288G will include eight cores/16 threads with a 3.7GHz Processor base frequency and a 5.0GHz Max turbo boost. There is 16MB of onboard cache. (16 Threads x 8 Cores) x 1 CPU = 128 vCPU   To determine our Workload & Utilization, we need to look at a few questions before we start! – Do our apps run at 100% CPU utilization at all times? – Are there periods where utilization bursts? – Do our apps have maintenance windows? – By questioning, we will know our specifications, making an informed decision on the underlying hardware.   If we have small VMs that hardly use CPU time, we could quickly get 20-30 VMs from an 8-core server. But, if you have bigger workloads such as a database server, you will have far fewer VMs from that same eight-core server. It’s all about resource utilization and allocation. It is all about using the resources and allocating them.   For example: Each vCPU allotment to each VM will depend on its specific load. 4 vCPUs per VM 128 vCPUs/4 vCPUs per VM = 32 VMs 2 vCPUs per VM 128 vCPUs/2 vCPUs per VM = 64 VMs 1 vCPUs per VM 128 vCPUs/1 vCPUs per VM = 128 VMs   In Azure cloud, there is a large array of machines that we can select from: Still, here, since we are referring only to SQL workloads, we will be emphasizing only on D, E, and M family machines as SQL Server database engine performance is more memory-driven as compared to several cores available .eg: Dsv2 11-15, Edsv4, Mv2, Msv2, M, Mdsv2, DCsv2, Dasv4, Dsv3, Ddsv4, Lsv2. These can be mostly grouped into General purpose, memory, and Storage Optimized machines.   Some of

Data Science

Top Six Analytics & Business Intelligence Platform Comparison

  MPBI is yet again the Leader in Gartner’s Magic Quadrant for Analytics & Business Intelligence Platforms! Based on Gartner’s Magic Quadrant we have created above a complete picture of the top six Analytics & Business Intelligence Platforms! We have come up with a comparison report on pricing, modern UI, ease of delivery, and third-party integrations!   PowerBI is here to stay in the world of Data Visualization! MPBI stands out because of its top-notch ability to execute & its completeness of vision! Its strength lies in its alignment with Office 365 & Teams, price/power combination, and scope of product ambition!  

AI/ML, Azure, Data Infrastructure, Data Management, Data Science, Data/Performance Analytics, Decision Making, Microsoft Power BI

Data Democratization: Stage 6: Lock them up – Secure your data

Data needs to be protected just as networks are protected from malicious actors (internal and external). Data security should not be an afterthought but a design decision from the start. Enterprises have spent a considerable amount of resources securing their data centers, monitoring access to sensitive data, and yet there are significant data breaches every year. It has been reported that between January and September 2019, there were over 7.9 billion data records exposed — a 33% increase from the same time in 2018! The year 2020 turned out to be worse than 2019, and 2021 is on pace to be the worst year so far.   Modern data architectures in the cloud make it easy for enterprises of any size to adopt these standards, but it needs to be augmented with resources from the internal IT team. In addition, the modern cloud infrastructures are monitored, deploying the best processes, top-notch technology, and expert security analysts. Hence, everyone reaps the benefit of best practices deployed universally in the cloud and continued fixes for any security flaws in the technology stack.   The innovation of modern cloud computing, such as Azure, has lowered the barrier to setting up world-class secure decision support systems.

Activate Your Free 15-Day Infominer Trial

X
Scroll to Top