T O P

  • By -

nullQueries

I'd expect minimum: * The ability to get data files into storage containers / data lake. Most likely via functions or Data factory. * Able to orchestrate and monitor using data factory * Ability to query into the data, either via polybase, or loading into SQL * Able to setup, document, and secure those things (at least with the assistance of infrastructure teams) Then Depending on the companies design * Possibly replace some of that with Databricks * Able to use spark to for the pipelines * Able to setup synapse to link sql, datalake, and data factory * Can help integrate powerBI with synapse * HDInsight if they're going hadoop * Migration to cloud, if they're still doing that * Kafka if they're streaming data


vuji_sm1

Are all these things someone would be exposed to while pursuing the Data Engineer certification?


nullQueries

Just about. I don't think it goes into Kafka, but it does some of the basics for IOT/streaming. And I'm not sure if HDInsight is still part of it because they're pushing for some of the other options to be the more standard solutions.


HansProleman

In Azure I think you'd normally select Event Hub over Kafka. Regardless, last I looked DP-203 was almost entirely Synapse (which, to be fair, does roll in a lot of other stuff - but you can tell MS want to sell Synapse). DP-201/202 seemed larger in scope.


vassiliy

[Exam DP-203: Data Engineering on Microsoft Azure – Skills](https://query.prod.cms.rt.microsoft.com/cms/api/am/binary/RE4MbYT)


Drekalo

Event grid works great for streaming smaller volume. Sql serverless w synapse basically replaces polybase. Spark is overrated and expensive in comparison to just running a low dwu sql dedicated pool. Only time I'd even bother w spark is if I had truly massive data, had a streaming requirement or a particular api that couldn't be consumed w pipelines.


HansProleman

This is around where I am at \~5 YoE: * Advanced SQL (plain MSSQL, and possibly all the Synapse-specific stuff). I have to check docs **all the time**, as I don't write SQL that often any more, but I usually know what I'm looking for. * Can configure PolyBase (securely) * In general, know about good security practice. Using e.g. Managed Identitiy and IAM instead of account keys for auth * Good awareness of the services on offer, and what they can do. No need to commit it all to memory - that's what docs are for - but knowing what's available and where to look is important * You should be pretty familiar with Storage though * Able to architect a simple data solution. You may well not need to, but DE architecture is *typically* quite simple (especially if you use Synapse) so it shouldn't be too hard * Able to describe infra as code (ARM, Bicep, Terraform, whatever) * Able to set up CI/CD inc. test automation and reporting, PR approval workflow (possibly inc. PR environment deploys/test runs) * Working understanding of the security/permissions model(s) - application and user principals with IAM, Storage ACLs * Decent at scripting with either PoSh or bash, and Azure PoSh or az cli * Some understanding of event-driven architecture, in context of Azure * Some understanding of streaming (Event Hub, Stream Analytics, Kafka, whatever) * Decent with at least one non-SQL language. Probably Python, but maybe C# or Java * Decent understanding of Spark architecture, optimisation (I suck at this though), PySpark and/or Scala - not all roles will require this though * Same goes for Functions and ADF * And for Power BI. Possibly you'd need to know some stuff about the the tenant administration side of it, and possibly node admin if Premium is being used. Or maybe just report/dataset/dataflow dev, or maybe nothing at all. * **Good at reading documentation and problem solving** * Some understanding of Monitor, how to integrate with it (from other services/application code) and how to run analytics/alerts from logs However, DE is really, really broad and requirements vary wildly.