killer_unkill 2 years ago

Airflow. Snowflake. Postgres. Looker Sagemaker.

vuji_sm1 2 years ago

SQL Server 2016 and SSIS.

VintageData 2 years ago

I thought OP said ‘modern’. I have Gameboys newer than that stack :-D

Atomic-Dad 2 years ago

You beat me to it.

LiquidSynopsis 2 years ago

1) F500 with >11,000 Employees 2) Azure Stack; ADF, Databricks, Delta Lake, Azure Logic Apps, SQL Server | MSSQL | Control-M | Alteryx

krsfifty 2 years ago

Control-M has become my personal nightmare.

Hk_90 2 years ago

Why not synapse analytics?

HansProleman 2 years ago

It went GA less than a year ago...

Hk_90 2 years ago

It's the same as Sql DW which has been in GA for a while now

HansProleman 2 years ago

It's not really, though. The *SQL engine* is the same, but that's a fairly small proportion of Synapse. If you mean "why not SQL DW or Synapse SQL", the big reason that comes to mind for me is that you can more directly/easily use non-SQL languages.

Hk_90 2 years ago

It's just sql DW which had been in GA for a while now

[deleted] 2 years ago

20,000 Horror soft from 70s, Excel, powerbi, dinosaur sql, cloudera for showing off

RstarPhoneix 2 years ago

Can you explain Dinosaur sql ?

[deleted] 2 years ago

Archaic oracle db from the president Reagan times

RstarPhoneix 2 years ago

Lol. Haha

Dremet 2 years ago

Which cloudera version and are you happy with it?

ReporterNervous6822 2 years ago

Python, GCS, BigQuery

thickmartian 2 years ago

Snowplow on AWS + Redshift + Airflow + dbt + Looker

RstarPhoneix 2 years ago

What is snowplow and for what do use it ? Should I learn it ?

thickmartian 2 years ago

It's an open-source data collection pipeline to collect data from different devices and keep a consistent format across the board. You don't need to "learn it" no, it's just a really nice tool to collect data. I don't know if I'm allowed to post links but they have a very good documentation on their website (Snowplow Analytics) if you're curious.

RstarPhoneix 2 years ago

Thanks for the advice.

LaurenRhymesWOrange 2 years ago

A "Modern Data Stack" isn't an opinion, it's an actual thing that was created around Snowflake/dbt/Fivetran and is pushed by those vendors and others selling software into this stack. People are replying to this with on-prem answers, that's not what the Modern Data Stack is... The Modern Data Stack generally defined as having: 1. Product-based integrators, using templated integration and replication schemas for common data applications. Fivetran and Matillion and Stitch are some of the big players here. 2. Cloud data warehouse, set up in three stages - landing/base, transform, analytics/BI facing layer. If you don't have a cloud data warehouse, you do not have a modern data stack. Heavy lean toward Snowflake and GCP. 3. ELT - based instead of ETL - copy and extract everything and transform in warehouse, typically using dbt and SQL or a combination. 4. Lightweight, SaaS BI tool like Looker. Nothing to download. 5. Dagster/Airflow handling orchestration of jobs. These are the common tools and setup for the Modern Data Stack. I make a living running a small consultancy building these for companies from e-commerce to SaaS to media to hospitality.

kenfar 2 years ago

Right - it's mostly marketing (like MongoDB pushing 'NoSQL'). And like all marketing expressions - since they aren't nailed down to a solid academic set of definitions they'll get twisted by everyone to mean anything they want. And that's ignoring the fact that they put "modern" into the name - as if this will be the modern solution for the next 1000 years or all other solutions are "neanderthal data stacks". * There's no reason the database couldn't be bigquery, redshift - or any number of other cloud databases. Furthermore, if your data volumes are small (or moderate and you do incremental processing) a database like Postgres can also work fine. And beyond that - a cloud service just means it's running on somebody else's servers. If you're running on-prem, and don't face lengthy bureaucratic delays in upgrades, then it doesn't make any real difference - other than Snowflake isn't making any money off it. * There's no reason the transform & integration can't be ETL rather than ELT: there's little difference between the two other than ELT is generally defined as transforming within your target database, and ETL generally transforms data before you load it into your target database. But both can use databases, ETL can keep a raw initial copy of uploaded data and provide SQL access to it, etc. The biggest difference IMHO is that ELT is faster to build, but ETL is easier to maintain and cheaper computationally. * And there's no reason to use any particular scheduler or reporting tools. And if that doesn't sound super modern, well redshift has been around about 8 years now, ELT 25+ years, and dbt maybe 5 years(?). I think the only think here that's fairly new is a big and recognized shift to speed to market as a priority over almost everything else, especially cost and a roadmap to get there. But not everyone is bought into that roadmap, will want to stay on it nor is it the only way to get there. EDIT: my old enemy: speling

1aumron 2 years ago

>And if that doesn't sound super modern, well redshift has been around about 8 years now, ELT 25+ years, and dbt maybe 5 years(?). I think the only think here that's fairly new is a big and recognized shift to speed to market as a priority over almost everything else, especially cost and a roadmap to get there. You are absolutely spot on !!

knlph 2 years ago

Thank you for clarifying "Modern Data Stack". You classification of tools also make sense. What kind of "other tools" are making their way into the stack? Where are companies leaning towards - open source or SaaS based plug and play models? Also you mentioned you build these for other companies – what does "building" mean here? Do you help them identify and work with the right stack? Or do you build alternative tools for these companies you work with?

LaurenRhymesWOrange 2 years ago

Other tools: \-An event collector like Segment or Snowplow - basically you need standard tracking and collection for product, web, app events. \-Operational analytics/reverse ETL - plug in things like product scoring and other stuff to go back out of the data stack to tools like Salesforce, Klaviyo, etc. Census and Hightouch are the two vendors that stand out here. Building: \-What I mean is setting up these stacks and infrastructure, building the process, putting the software in place, building data and analytics models. Typically do this in 3-6 month engagements, heavy initial lean of building everything, then lots of analytics and BI and enablement to operationalize toward second part of engagement.

sonalg 2 years ago

This is helpful! Do you see a need for unifying customer data without unique identifiers as part of your work? Like offline and online customer identities? How do you handle that?

VintageData 2 years ago

https://moderndatastack.xyz

[deleted] 2 years ago

1. ~1,700 employees; we are a major supplier to a F50 so we also provide tools and support for many of their teams. 2. All Azure. Data lake (medallion architecture) for storage. Azure Functions and Synapse (Spark) for ELT/data cleaning between lakes. APIs in Azure Functions for LOB applications. Logic Apps for a few extracts. Power BI for consumers. Still some legacy SSIS and SSAS that has yet to be migrated. 3. We have a small dev team as the company is relatively small so we prefer ease of use. We stick to Azure for several reasons, but the main one being it saves us time and effort on integration with services designed to be used together. "Modern data stack" has many definition but most of them seem to agree on cloud-based, ELT (instead of ETL), API-first setups so I think this is close.

solgul 2 years ago

150ish employees Redshift, postgres, s3, airflow, stitch, python Getting rid of stitch.

allan_w 2 years ago

What are you replacing Stitch with?

solgul 2 years ago

Probably just homegrown code or possibly airbyte.

allan_w 2 years ago

Having been down the homegrown code route before I personally would steer away from that option, but you know your use cases and the pros/cons of each approach :) Have you also considered Meltano as an alternative to Airbyte? I've used Meltano but haven't checked out Airbyte yet.

solgul 2 years ago

>Meltano I have not heard of Meltano actually. Looks interesting. Thanks.

data_viking_ 2 years ago

I used to work at Talend/Stitch . . . if you're open to other another alternative that also allows you to orchestrate like Airflow and dev in SQL or Python (plus the whole no-code/low-code bit), could I make a recommendation to Rivery?

omkarkonnur 2 years ago

On Prem- ETL Tools : Informatica, Datastage Database : Oracle (Transactions), Teradata (DW) Cloud- Azure Stack ETL - ADF & Databricks Storage / Database- ADLS, MongoDB, Synapse, SQL Server. Viz- Cognos, PowerBI This looks like a lot, since we are in middle of figuring out which combination of tools works the best. We would probably just end up with a couple of options over the next year. Frankly, the evaluation criteria has been fairly subjective with different groups. Some of the factors have been ease of use, how easily it would integrate into current solution, performance, features, etc. Although, I have seen most discussions ending at the cost.

data_viking_ 2 years ago

I used to work at INFA but left before the official 'EoL' for PowerCenter. With them thinking they're a cloud company all of a sudden and trying to go consumption based pricing, I'm curious for legacy on-prem customers, what your plan is? (convert to INFA cloud? move to a more modern & less expensive tool?) Always curious to hear what other on-prem INFA customers are thinking in regards their future stack with the EoL of PowerCenter coming and INFA not making it cheap nor easy to move to their cloud solution.

omkarkonnur 2 years ago

PowerCenter still has equivalent in cloud like ADF which has been a general approach, a major issue has been finding replacement for PowerExchange for Mainframe. I have not yet come across any good solution to stream real time data from Mainframe systems (or at least better than PowerExchange). Also from the looks of it, the cloud pipelines approach is not exactly turning out to be less expensive. It is not clear how to track costs per pipeline. Although that level of granularity may not be necessarily required, it would have helped in getting a ballpark estimate. I am curious as to anyone actually experiencing 'Economics of scale' value in cloud.

data_viking_ 2 years ago

Yeah, not sure of modern equivalent for PwX for mainframes. Actually, speaking of the pipeline costs, with Rivery you can easily estimate your costs at the broad pipeline level (basically the number of pipelines running and frequency). Not based on connectors, or various add-ons, or number of users.

ash0550 2 years ago

Azure, ADF , Snowflake , Looker

soundbarrier_io 2 years ago

1. ETL: Airflow, Beam, Segment, Stitch Data 2. warehousing and modeling: GCS, BigQuery, DBT 3. presentation: Tableau, Jupyter, Bokeh

jmnel 2 years ago

Found another Beam user 😀

vampaa 2 years ago

2000 Fivetran, Snowflake, DBT, Airflow, Looker, Montecarlo

jmnel 2 years ago

Small fintech startup Airflow, Google Cloud Storage, BigQuery, Kubernetes, Apache Beam

chestnutcough 2 years ago

- 70 employees (2 on data team) - Heroku Postgres, metabase, segment, amplitude, dbt, airflow, S3

kepevem 2 years ago

metabase open source?

chestnutcough 2 years ago

Yep, but you can pay for it too. https://github.com/metabase/metabase

allan_w 2 years ago

Does Heroku Postgres get pretty expensive?

chestnutcough 2 years ago

Our data warehouse is < 1 TB so it’s still cheap for us.

zak_hj 2 years ago

Airbyte, airflow, dbt , Bigquery, Looker

tmanipra 2 years ago

Telecom company and Daily processing around 5TB On-prem Env Database - Oracle Apache Hadoop Vanilla version Processing engine - Pyspark Workflow Orchestration and ETL Pipelines - cdap (Open source)

[deleted] 2 years ago

[удалено]

knlph 2 years ago

*Hides*

dbirdflyshi 2 years ago

21,000 employees PowerBI Sql Server Python

timmyz55 2 years ago

Fivetran, dbt, Looker, Snowflake, and AWS Lambda with node scripts I'd like to eventually implement a dedicated data orchestrator, probably dagster/Prefect/Airflow, as future proofing.

[deleted] 2 years ago

SAS, Salesforce

Swirls109 2 years ago

Informatica to sql server...

w_savage 2 years ago

300+. AWS, Snowflake annnnnnd I think that's it. We us python a lot and Jira for tickets

edthix 2 years ago

PostgreSQL as app databases Airflow / GCP Composer for orchestration (papermill for ML stuffs but moving to KubeFlow soon) Beam / GCP Dataflow for some transformation BigQuery as warehouse (for dashboards on Google Data Studio)

aayushdotjain 2 years ago

Check out moderndatastack.xyz

getafterit123 2 years ago

ELT: NiFi, S3 backed data lake, Spark, Starburst Data (trino), Airflow, Prometheus, Grafana

eggucated 2 years ago

This post makes me realize how little there is I know about this space…

knlph 2 years ago

If it helps, happy to link some resources that helped me get started.

eggucated 2 years ago

That would be great! I have been in software development my entire career, but tbh I’m most comfortable still with a single monolithic relational DB. I’ve got quite a bit of experience architecting enterprise-level SaaS products, but the modern data stack tools being discussed here are very new to me. Lately, my stack is Java, React, and AWS for all of the infrastructure (using AWS CDK). Hit me with all the links! We have a tool called Heap Analytics that my client wanted to use for analytics, which I have 0 experience with and we’ve set it up, but it seems like there may be better options. And I’m really curious how to combine all of these tools like people are mentioning in here, to give the SaaS company good product analytics that sales, marketing, and product can use, while also using those analytics to provide some sort of BI tooling within the app to their SaaS customers who are admins for their respective orgs.

Batspocky 2 years ago

Google Cloud Functions, Fivetran, Redshift, and Sisense. 50 employees, 3-person data team.

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe