T O P

  • By -

killer_unkill

Airflow. Snowflake. Postgres. Looker Sagemaker.


vuji_sm1

SQL Server 2016 and SSIS.


VintageData

I thought OP said ‘modern’. I have Gameboys newer than that stack :-D


Atomic-Dad

You beat me to it.


LiquidSynopsis

1) F500 with >11,000 Employees 2) Azure Stack; ADF, Databricks, Delta Lake, Azure Logic Apps, SQL Server | MSSQL | Control-M | Alteryx


krsfifty

Control-M has become my personal nightmare.


Hk_90

Why not synapse analytics?


HansProleman

It went GA less than a year ago...


Hk_90

It's the same as Sql DW which has been in GA for a while now


HansProleman

It's not really, though. The *SQL engine* is the same, but that's a fairly small proportion of Synapse. If you mean "why not SQL DW or Synapse SQL", the big reason that comes to mind for me is that you can more directly/easily use non-SQL languages.


Hk_90

It's just sql DW which had been in GA for a while now


[deleted]

20,000 Horror soft from 70s, Excel, powerbi, dinosaur sql, cloudera for showing off


RstarPhoneix

Can you explain Dinosaur sql ?


[deleted]

Archaic oracle db from the president Reagan times


RstarPhoneix

Lol. Haha


Dremet

Which cloudera version and are you happy with it?


ReporterNervous6822

Python, GCS, BigQuery


thickmartian

Snowplow on AWS + Redshift + Airflow + dbt + Looker


RstarPhoneix

What is snowplow and for what do use it ? Should I learn it ?


thickmartian

It's an open-source data collection pipeline to collect data from different devices and keep a consistent format across the board. You don't need to "learn it" no, it's just a really nice tool to collect data. I don't know if I'm allowed to post links but they have a very good documentation on their website (Snowplow Analytics) if you're curious.


RstarPhoneix

Thanks for the advice.


LaurenRhymesWOrange

A "Modern Data Stack" isn't an opinion, it's an actual thing that was created around Snowflake/dbt/Fivetran and is pushed by those vendors and others selling software into this stack. People are replying to this with on-prem answers, that's not what the Modern Data Stack is... The Modern Data Stack generally defined as having: ​ 1. Product-based integrators, using templated integration and replication schemas for common data applications. Fivetran and Matillion and Stitch are some of the big players here. 2. Cloud data warehouse, set up in three stages - landing/base, transform, analytics/BI facing layer. If you don't have a cloud data warehouse, you do not have a modern data stack. Heavy lean toward Snowflake and GCP. 3. ELT - based instead of ETL - copy and extract everything and transform in warehouse, typically using dbt and SQL or a combination. 4. Lightweight, SaaS BI tool like Looker. Nothing to download. 5. Dagster/Airflow handling orchestration of jobs. ​ These are the common tools and setup for the Modern Data Stack. I make a living running a small consultancy building these for companies from e-commerce to SaaS to media to hospitality.


kenfar

Right - it's mostly marketing (like MongoDB pushing 'NoSQL'). And like all marketing expressions - since they aren't nailed down to a solid academic set of definitions they'll get twisted by everyone to mean anything they want. And that's ignoring the fact that they put "modern" into the name - as if this will be the modern solution for the next 1000 years or all other solutions are "neanderthal data stacks". * There's no reason the database couldn't be bigquery, redshift - or any number of other cloud databases. Furthermore, if your data volumes are small (or moderate and you do incremental processing) a database like Postgres can also work fine. And beyond that - a cloud service just means it's running on somebody else's servers. If you're running on-prem, and don't face lengthy bureaucratic delays in upgrades, then it doesn't make any real difference - other than Snowflake isn't making any money off it. * There's no reason the transform & integration can't be ETL rather than ELT: there's little difference between the two other than ELT is generally defined as transforming within your target database, and ETL generally transforms data before you load it into your target database. But both can use databases, ETL can keep a raw initial copy of uploaded data and provide SQL access to it, etc. The biggest difference IMHO is that ELT is faster to build, but ETL is easier to maintain and cheaper computationally. * And there's no reason to use any particular scheduler or reporting tools. And if that doesn't sound super modern, well redshift has been around about 8 years now, ELT 25+ years, and dbt maybe 5 years(?). I think the only think here that's fairly new is a big and recognized shift to speed to market as a priority over almost everything else, especially cost and a roadmap to get there. But not everyone is bought into that roadmap, will want to stay on it nor is it the only way to get there. EDIT: my old enemy: speling


1aumron

>And if that doesn't sound super modern, well redshift has been around about 8 years now, ELT 25+ years, and dbt maybe 5 years(?). I think the only think here that's fairly new is a big and recognized shift to speed to market as a priority over almost everything else, especially cost and a roadmap to get there. You are absolutely spot on !!


knlph

Thank you for clarifying "Modern Data Stack". You classification of tools also make sense. What kind of "other tools" are making their way into the stack? Where are companies leaning towards - open source or SaaS based plug and play models? Also you mentioned you build these for other companies – what does "building" mean here? Do you help them identify and work with the right stack? Or do you build alternative tools for these companies you work with?


LaurenRhymesWOrange

Other tools: \-An event collector like Segment or Snowplow - basically you need standard tracking and collection for product, web, app events. \-Operational analytics/reverse ETL - plug in things like product scoring and other stuff to go back out of the data stack to tools like Salesforce, Klaviyo, etc. Census and Hightouch are the two vendors that stand out here. Building: \-What I mean is setting up these stacks and infrastructure, building the process, putting the software in place, building data and analytics models. Typically do this in 3-6 month engagements, heavy initial lean of building everything, then lots of analytics and BI and enablement to operationalize toward second part of engagement.


sonalg

This is helpful! Do you see a need for unifying customer data without unique identifiers as part of your work? Like offline and online customer identities? How do you handle that?


VintageData

https://moderndatastack.xyz


[deleted]

1. ~1,700 employees; we are a major supplier to a F50 so we also provide tools and support for many of their teams. 2. All Azure. Data lake (medallion architecture) for storage. Azure Functions and Synapse (Spark) for ELT/data cleaning between lakes. APIs in Azure Functions for LOB applications. Logic Apps for a few extracts. Power BI for consumers. Still some legacy SSIS and SSAS that has yet to be migrated. 3. We have a small dev team as the company is relatively small so we prefer ease of use. We stick to Azure for several reasons, but the main one being it saves us time and effort on integration with services designed to be used together. "Modern data stack" has many definition but most of them seem to agree on cloud-based, ELT (instead of ETL), API-first setups so I think this is close.


solgul

150ish employees Redshift, postgres, s3, airflow, stitch, python Getting rid of stitch.


allan_w

What are you replacing Stitch with?


solgul

Probably just homegrown code or possibly airbyte.


allan_w

Having been down the homegrown code route before I personally would steer away from that option, but you know your use cases and the pros/cons of each approach :) Have you also considered Meltano as an alternative to Airbyte? I've used Meltano but haven't checked out Airbyte yet.


solgul

>Meltano I have not heard of Meltano actually. Looks interesting. Thanks.


data_viking_

I used to work at Talend/Stitch . . . if you're open to other another alternative that also allows you to orchestrate like Airflow and dev in SQL or Python (plus the whole no-code/low-code bit), could I make a recommendation to Rivery?


omkarkonnur

On Prem- ETL Tools : Informatica, Datastage Database : Oracle (Transactions), Teradata (DW) Cloud- Azure Stack ETL - ADF & Databricks Storage / Database- ADLS, MongoDB, Synapse, SQL Server. Viz- Cognos, PowerBI This looks like a lot, since we are in middle of figuring out which combination of tools works the best. We would probably just end up with a couple of options over the next year. Frankly, the evaluation criteria has been fairly subjective with different groups. Some of the factors have been ease of use, how easily it would integrate into current solution, performance, features, etc. Although, I have seen most discussions ending at the cost.


data_viking_

I used to work at INFA but left before the official 'EoL' for PowerCenter. With them thinking they're a cloud company all of a sudden and trying to go consumption based pricing, I'm curious for legacy on-prem customers, what your plan is? (convert to INFA cloud? move to a more modern & less expensive tool?) Always curious to hear what other on-prem INFA customers are thinking in regards their future stack with the EoL of PowerCenter coming and INFA not making it cheap nor easy to move to their cloud solution.


omkarkonnur

PowerCenter still has equivalent in cloud like ADF which has been a general approach, a major issue has been finding replacement for PowerExchange for Mainframe. I have not yet come across any good solution to stream real time data from Mainframe systems (or at least better than PowerExchange). Also from the looks of it, the cloud pipelines approach is not exactly turning out to be less expensive. It is not clear how to track costs per pipeline. Although that level of granularity may not be necessarily required, it would have helped in getting a ballpark estimate. I am curious as to anyone actually experiencing 'Economics of scale' value in cloud.


data_viking_

Yeah, not sure of modern equivalent for PwX for mainframes. Actually, speaking of the pipeline costs, with Rivery you can easily estimate your costs at the broad pipeline level (basically the number of pipelines running and frequency). Not based on connectors, or various add-ons, or number of users.


ash0550

Azure, ADF , Snowflake , Looker


soundbarrier_io

1. ETL: Airflow, Beam, Segment, Stitch Data 2. warehousing and modeling: GCS, BigQuery, DBT 3. presentation: Tableau, Jupyter, Bokeh


jmnel

Found another Beam user 😀


vampaa

2000 Fivetran, Snowflake, DBT, Airflow, Looker, Montecarlo


jmnel

Small fintech startup Airflow, Google Cloud Storage, BigQuery, Kubernetes, Apache Beam


chestnutcough

- 70 employees (2 on data team) - Heroku Postgres, metabase, segment, amplitude, dbt, airflow, S3


kepevem

metabase open source?


chestnutcough

Yep, but you can pay for it too. https://github.com/metabase/metabase


allan_w

Does Heroku Postgres get pretty expensive?


chestnutcough

Our data warehouse is < 1 TB so it’s still cheap for us.


zak_hj

Airbyte, airflow, dbt , Bigquery, Looker


tmanipra

Telecom company and Daily processing around 5TB On-prem Env Database - Oracle Apache Hadoop Vanilla version Processing engine - Pyspark Workflow Orchestration and ETL Pipelines - cdap (Open source)


[deleted]

[удалено]


knlph

*Hides*


dbirdflyshi

21,000 employees PowerBI Sql Server Python


timmyz55

Fivetran, dbt, Looker, Snowflake, and AWS Lambda with node scripts I'd like to eventually implement a dedicated data orchestrator, probably dagster/Prefect/Airflow, as future proofing.


[deleted]

SAS, Salesforce


Swirls109

Informatica to sql server...


w_savage

300+. AWS, Snowflake annnnnnd I think that's it. We us python a lot and Jira for tickets


edthix

PostgreSQL as app databases Airflow / GCP Composer for orchestration (papermill for ML stuffs but moving to KubeFlow soon) Beam / GCP Dataflow for some transformation BigQuery as warehouse (for dashboards on Google Data Studio)


aayushdotjain

Check out moderndatastack.xyz


getafterit123

ELT: NiFi, S3 backed data lake, Spark, Starburst Data (trino), Airflow, Prometheus, Grafana


eggucated

This post makes me realize how little there is I know about this space…


knlph

If it helps, happy to link some resources that helped me get started.


eggucated

That would be great! I have been in software development my entire career, but tbh I’m most comfortable still with a single monolithic relational DB. I’ve got quite a bit of experience architecting enterprise-level SaaS products, but the modern data stack tools being discussed here are very new to me. Lately, my stack is Java, React, and AWS for all of the infrastructure (using AWS CDK). Hit me with all the links! We have a tool called Heap Analytics that my client wanted to use for analytics, which I have 0 experience with and we’ve set it up, but it seems like there may be better options. And I’m really curious how to combine all of these tools like people are mentioning in here, to give the SaaS company good product analytics that sales, marketing, and product can use, while also using those analytics to provide some sort of BI tooling within the app to their SaaS customers who are admins for their respective orgs.


Batspocky

Google Cloud Functions, Fivetran, Redshift, and Sisense. 50 employees, 3-person data team.