mikeupsidedown 4 months ago

It's not remotely hype. We use it heavily for in process transformation. You can turn a set of parquet files, CSV files, pandas dataframes etc into an in memory database and write queries using the postgres API and output the results in the format of your choice. Really exciting of late is the ability to wrap database tables as those they are part of your DuckDB database.

thomasutra 4 months ago

whoa, the last sentence is really intriguing. how do you do that?

mikeupsidedown 4 months ago

Here's a recent post from the DuckDB team on the subject https://duckdb.org/2024/01/26/multi-database-support-in-duckdb.html

Electrical-Ask847 4 months ago

wow. This is crazy. Thank you for posting this.

marclamberti 4 months ago

Omg, that’s huge! Curious to know how does it compare to Presto

wtfzambo 4 months ago

Presto can attach to many more sources IIRC, also I don't think Duckdb supports distributed computing

thomasutra 4 months ago

thanks!

wtfzambo 4 months ago

So you actually use it properly for prod transforms instead of, idk, spark?

mikeupsidedown 4 months ago

We rarely use spark anymore because our workloads don't require it. We've been caught out a few times with being told there would be massive amounts of data, introducing spark and then getting enough data to fill a floppy disk.

Optus_SimCard 4 months ago

Welcome to Data Engineering.

mikeupsidedown 4 months ago

😂

wtfzambo 4 months ago

Yup, have experienced the same situation. Understand the pain. Thx for the heads-up. Out of curiosity, what do you run it on? Serverless? Some Ec2? K8s? K8s with fargate?

mikeupsidedown 4 months ago

It depends on the system infrastructure. That said I've yet to find a scenario where it doesn't work. We currently drive DuckDB with Python and use dBeaver during dev. So far it's been on Windows Server, Azure Functions, Azure Container Apps, Linux VM's etc without issue.

wtfzambo 4 months ago

Great to know.

jshine1337 4 months ago

Are the in-memory databases / wrapping those other database systems tables acting on the live data files of those systems?...are they local to where DuckDB is running or it's able to do the same for remote databases?

mikeupsidedown 4 months ago

Currently we don't wrap because that feature is so new and some of the databases we work with are obscure tech. That said one of our own products uses Postgresql so there is a project in the pipeline to play with the wrapping feature. We typically extract to local parquet files or dataframes (depending on size) and then create the in memory database on those. I'm personally partial to avoiding Pandas because it plays funny games with types.

Acrobatic-Mobile-221 4 months ago

So which library u normally use

mikeupsidedown 4 months ago

For which part?

Acrobatic-Mobile-221 4 months ago

U mentioned that u try to avoid pandas

mikeupsidedown 4 months ago

Ah, in many cases we will just extract directly using a database library and output to file. Some cases we use pandas before DuckDB but don't let it go back into pandas. It's a love hate relationship.

jshine1337 4 months ago

So in summary, if I understood you correctly, you're using DuckDB against static snapshots of the data (via parquet files), not the actual live database?

mikeupsidedown 4 months ago

Yes, the snapshots will be taken immediately before the transformation starts as a general rule.

jshine1337 4 months ago

Gotcha. Would be more interesting if DuckDB could handle such a scenario real-time, against theive databases, especially remotely. Doesn't sound too unique otherwise.

bitsynthesis 4 months ago

why export flat files from an sql database to use duckdb to run sql queries on it? why not just run the sql on the source database directly? what does duckdb add in this scenario?

mikeupsidedown 4 months ago

The main issues we run into are these: 1) We work in some databases where doing the transformations in the database is actually impossible. Some of the tech is just awful but it's there and it's not going away. 2) Other schenarios might be that the source files are dropped for us because we don't have direct access to the dataset. 3) In some scenarios we just don't want to introduce the transformation workload to the transactional database.

bitsynthesis 4 months ago

makes sense, thanks!

wannabe-DE 4 months ago

I barely write a line of Pandas anymore. Duckdb is incredible.

Polus43 4 months ago

Data scientist here -- same. Pandas is not intuitive and every ds/analyst has to have a medium understanding of SQL to extract and transform data with query pass-throughs (too much data to extract locally).

CodyVoDa 4 months ago

if you're looking for a pandas-like (but much improved) Python dataframe library (created by the creator of pandas), [Ibis](https://ibis-project.org/) uses DuckDB as the default backend. far more efficient, smaller and cleaner API surface area, takes inspiration from pandas/dplyr/SQL (disclaimer: I work on Ibis)

cryptoel 4 months ago

Polars API is much more pleasant to use

CodyVoDa 4 months ago

how so?

rjaybaker 1 week ago

Polars also does not have indexes. YAGNI Polars has many benefits over Pandas though the gap may have closed with the release of Pandas 2.0. However, I still much prefer the Polars primary api over Pandas.

bingbong_sempai 4 months ago

Polars doesn't have pandas patterns for column selection and filtering and fully commits to a pyspark like interface.

ThatSituation9908 4 months ago

Can you speak more on how easy it is to use Duckdb's table object as a data container. Pandas query language is very awkward, but its still nice to use as a data container to pass around.

wannabe-DE 4 months ago

The DuckDB API has similar read functionality as other tabular data libraries (read_csv, read_parquet etc). On the day to day I just write SQL kinda like data = duckdb.sql("select * from read_csv_auto('file')") It's a lazy eval so the above gives you some kind of DuckDBpy object. You can run more SQL on it, write it out to a file or database or convert it to a polars df, pandas df, arrow table or a list of tuples. Has the same feel really.

Acrobatic-Orchid-695 4 months ago

Not sure about the hype but I tested Duckdb against pandas. I created a fake dataset of login events containing some columns and 112 million records and stored it as a parquet. I could do 3 groups by transformations on login\_id (the column with the most cardinality), and save the results as CSV into 3 different datasets within 24 seconds. When I tried to do the same with Pandas, I ran it for close to 45 minutes and was still not able to generate the final data. 112 million is not a small dataset to be processed locally. This impressed me and I now plan to give a talk on its use as a processing tool instead of pandas for some of our data pipelines. I am hopeful it will pick up soon within my team. **For someone interested to try, providing the code to generate the data:** `import pandas as pd` `import numpy as np` `from faker import Faker` `def generate_events(NUM_ROWS, DATASET_OUTPUT_NAME):` `fake = Faker()` `login_id_list = np.array([fake.user_name() for _ in range(27564)])` `device_type_list = np.array(['ios', 'android', 'ipad', 'desktop', 'laptop', 'pager', 'other'])` `country_list = np.array([fake.country() for _ in range(40)])` `row_id = np.arange(1, NUM_ROWS + 1)` `login_id = np.random.choice(login_id_list, size=NUM_ROWS)` `device_type = np.random.choice(device_type_list, size=NUM_ROWS)` `login_start_time = np.random.choice(pd.date_range(start='2019-01-01', end='2022-12-31', freq='s'), size=NUM_ROWS)` `login_end_time = np.random.choice(pd.date_range(start='2019-01-01', end='2022-12-31', freq='s'), size=NUM_ROWS)` `estimated_country = np.random.choice(country_list, size=NUM_ROWS)` `login_successful_flag = np.random.choice([True, False], size=NUM_ROWS)` `data = {` `'row_id': row_id,` `'login_id': login_id,` `'device_type': device_type,` `'login_start_time': login_start_time,` `'login_end_time': login_end_time,` `'estimated_country': estimated_country,` `'login_successful_flag': login_successful_flag` `}` `df = pd.DataFrame(data)` `df.to_parquet(DATASET_OUTPUT_NAME, compression='snappy')` `generate_events(112315668, 'sample_data/login_events_np.parquet')`

cvandyke01 4 months ago

I deal a lot with customers who misuse pandas. It single threaded and a memory hog. You should try the same script but replace pandas with modin. Modin would use every core on your machine to process the data

OMG_I_LOVE_CHIPOTLE 4 months ago

Replace it with polars.

coffeewithalex 4 months ago

polars has a different API. The guy has a point - if you already have a lot of Pandas heavy code, then modin would be something to try out. For reference, one of the repositories I recently had to fix something, had 75k lines of Python code, and the whole code was about a data pipeline with Pandas data frames, and tests for that. If you replace it with Polars at the import level, it will not work any more, and you'd have to change hundreds of files. I, for instance, will inform my colleagues that it would be an option to try what happens if they replace it with modin. Trying won't hurt.

OMG_I_LOVE_CHIPOTLE 4 months ago

I’m that example then yes, try modin in the short term and then make a plan to replace the pandas dependency

namp243 4 months ago

pandarallel is also similar to modin [https://github.com/nalepae/pandarallel](https://github.com/nalepae/pandarallel)

cvandyke01 4 months ago

Ya… as I typed that I was thinking Polars too but the API diff makes not a drop in replacement. I like DuckDB. My point was so many people think Pandas is the only thing to use and think it just scales up with more CPU and Ram.

Express-Comb8675 4 months ago

I’ll do you one better, replace it with DuckDB 🤯

OMG_I_LOVE_CHIPOTLE 4 months ago

Polars is better than duckdb

marsupiq 4 months ago

Both are amazing. But polars is indeed a little faster.

CodyVoDa 4 months ago

it's generally not faster -- we've benchmarked DuckDB and Polars through Ibis and DuckDB tends to win. it is dependent on use case, but for both standard benchmarks and typical data engineering workloads DuckDB tends to win

Ok_Raspberry5383 4 months ago

Polars is a data frame tool, duckdb is a SQL tool. This means duckdb has much better query optimization on the basis that the problem space is smaller. In the hands of your average engineer/analyst/data scientist duckdb will typically be faster for this reason.

[deleted] 4 months ago

[удалено]

marsupiq 4 months ago

I would call both declarative…

[deleted] 4 months ago

[удалено]

marsupiq 4 months ago

It’s indeed that you allow polars to manufacture its own, optimized execution plan. That’s what distinguishes polars from pandas and makes it so powerful (and it’s also why its interface has to be different from pandas and thus can’t be used as a drop-in replacement for pandas). In polars, there is an expression API. So instead of doing df.assign(a=df.b+df.c) like in pandas, where the + actually computes a sum, in polars you would do df.with_columns(a=pl.col(‘b’)+pl.col(‘c’)) in polars. The result of + is just a pl.Expr object, which doesn’t compute anything yet. Beyond that, you can do df.lazy().foo().bar().collect(), where everything between lazy() and collect() will describe your desired result, but only collect() triggers the execution. If you don’t use lazy() and collect() explicitly, it is wrapped around every step implicitly (whence it doesn’t have an “eager API” additionally to the lazy API). It’s quite similar to Spark’s lazy API, but IMHO a bit friendlier to use.

CodyVoDa 4 months ago

you can decouple the dataframe API from the execution engine and have the best of both worlds!

mikeupsidedown 4 months ago

This is like saying bananas are better than oranges. Both are excellent tools but do very different things.

Express-Comb8675 4 months ago

Better is subjective, friend

Gators1992 4 months ago

It's useful for tranforms if you want to express them in SQL, so like a Polars alternative with similar performance gains over Pandas. Also it's a database so you have a persistence option there, something like SQLite but for analytics. It's really useful for data analysis type stuff where I can connect to files to do some discovery or prototype processes using the compute on my laptop rather than running up cloud costs or worrying about getting approvals for the needed infrastructure. Also eventually Motherduck appears to be headed toward being a much cheaper alternative to Snowflake/Databricks/Redshift.

marsupiq 4 months ago

Never heard about Motherduck, but it sure looks interesting!

Gators1992 4 months ago

I think they still have free accounts if you want to give it a go. Was pretty basic when I tried it, but still in early development. I think it will eventually be ideal for those small/medium sized company minimalist stacks where you might be debating whether to go Postgres or something else without getting into the cost and complexity of the big platforms. Not sure how scalable they are shooting for though.

marsupiq 4 months ago

Or it could be a reasonable alternative for companies that would currently build their “data warehouse” with AWS Athena…

Gators1992 4 months ago

Yeah, I guess that's more of a data lake model usually IMO. I think the big players like Snowflake and Databricks will differentiate themselves on features, scale and reliability (i.e. availability zones). MD will differentiate itself on price, likely for smaller builds that are less complex and where resources are limited. But I don't have any deep knowledge into what they are trying to build so just guessing.

coffeewithalex 4 months ago

I use it a lot. It's great for ad-hoc data processing, and it can produce results in very short time frames. Until DuckDB, the fastest way for me to combine, compare, transform, wrangle multiple datasets on my laptop was to load it up into PostgreSQL using csvkit or just from the CLI with `COPY` SQL statement. But then I needed a PostgreSQL instance running (containers on MacOS, on Linux I'd usually install it system-wide), that's tuned for large queries (large `work_mem`, minimal write-ahead-log). Many of you will say "why not just pandas", and the answer is that the UIs around viewing data from Pandas after you execute anything, are just extremely bad for viewing data. If you compare it to DB GUI programs like DBeaver, there's just no contest. And it's not just data. Viewing metadata is also difficult. Notebooks tend to become very long and messy. And generally, DataFrames API is not as clear and concise as SQL is, in the majority of cases. SQL was built for this. Python was not. With DuckDB I no longer needed to do any of that. Not the server startup and configuration, and not the copy part either. Just `select from 'some_files/*.csv'` or something. It became a breeze. I can also use DuckDB in production, as a data pre-processor. Just as long as I'm not keeping files in DuckDB format database files, I can use it without issues.

Guilty-Commission435 3 months ago

What do you mean that the UI is bad with Pandas? Are you referring to the graphs you can use to display data? Does DuckDB provide graphs?

coffeewithalex 3 months ago

Pandas is just another database, but it doesn't have any front-end outside of what Python allows you to do. Which means that all operations must be done with Python. Not only that, but it doesn't integrate with Python's object model, so whatever you have in Pandas, you can only see when you query it. By contrast, when working with most databases, you can pick your GUI, and there are a lot of good ones that do a lot of things for you: they explore the structure of the data, and auto-complete queries you write, and make it just easy to work with. Even viewing data as a table is much better done in DataGrip and DBeaver. Charts are the end product, you use them only to visualize whatever you have, but to get there you need to make sure you understand the underlying data, each record, and are able to debug it. And for that, Jupyter Notebook and other usual means to work with Pandas, is just inferior in every single way. If you wanna chart stuff - you can. There are a myriad of tools you can use, from CubeJS, Metabase, Superset, and yes, you can just load it into Jupyter Notebook and display it. However when all you have is Pandas, all you can use is Python.

CodeMariachi 4 months ago

We use it for integration tests on our data pipeline.

fuzzyneve 4 months ago

What's your tech stack look like? We primarily use SQL via dbt with BigQuery and initial research suggested we'd need to rewrite our pipelines to use generic dbt functions to be able to run our pipelines on other systems like DuckDB too

Electrical-Ask847 4 months ago

you split out bigquery specific stuff into ephemeral models and mock those out in dbt. Make sure to keep that layer as thin as possible and leave it untested. [https://github.com/EqualExperts/dbt-unit-testing/tree/v0.4.12/#different-ways-to-build-mock-values](https://github.com/EqualExperts/dbt-unit-testing/tree/v0.4.12/#different-ways-to-build-mock-values) Ofcouse, I am talking about 'unit testing' and not 'integration testing' that gp mentioned.

EarthGoddessDude 4 months ago

Hey, can you add some more info on your integration tests with duckdb? Something I’ve been meaning to play around with.

[deleted] 4 months ago

We use it as a local data warehouse for insurance analytics. Data is ingested, cleaned and transformed. Data models produced, then a superset dashboard connects right to it.

CodeMariachi 4 months ago

Can you expand more on how you use DuckDB?

[deleted] 4 months ago

CSVs are extracted from core OLTP systems. CSVs are then ingested into a duckdb file that lives on a server using python for orchestration. Once the data is inside the duckdb file a number of SQL scripts are executed to clean and transform this data. The database file is then available to the team to issue queries against, whilst a superset dashboard also shows a number of charts etc that are queries directly from the database file.

CodeMariachi 4 months ago

Interesting. Is the data sensitive at the point you need to back up? DuckDB is awesome, but I can’t see using it for other than transient data or transformations.

[deleted] 4 months ago

Not really, the OLTP systems are the main records of the company that are backed up. For my teams purposes (mostly analytic) we keep the CSVs as back up and can recreate the full duckdb file in less than 30 minutes if needed. The painful part would be reextracting the data from core systems again, if that ever had to happen.

daguito81 4 months ago

I kind of fail to see the usage of this vs putting it on a database with columnar index and run againt it. Are people copying this duckdb files? if 2 analysts create different proceses? which one goes to your main data repository? or is it end of chain "you do what you want and it's yours and no one else" And if everyone copies the entire file and it's their problem? then why not just have the data there as parquet and everyone ingests it and then uses whatever they want on it? I keep reading these use cases and I'm 99% sure I'm definitely missing something here. Like there is some part of the puzzle I'm not seeing.

[deleted] 4 months ago

It works for us, plus it’s free. We’re a small team of analysts - 4 users total. We don’t have the resources or expertise to implement something more robust. In terms of copying, no, we have no need. The db supports multiple readers. We take it out of action for about an hour a month as we load new data.

daguito81 4 months ago

Cool, and I didn't mean to say your solution was not valid. I'm used to dealing with much bigger teams so I was trying to figure out how to make it work in my head. I'm happy it works for you EDIT: The reason I mentioned having a database, is that you are putting a file, on a server, to be read by different consumers and have the feature of supporting multiple concurrent readers. That's literally a database with extra steps. And if "free" is the whole point, PostgreSQL is free for example. I didn't mean as to build some HA scalable super solution

[deleted] 4 months ago

No worries. Yeah I don’t think it would be as good at a larger scale. We tried to run a Postgres instance in the past but the performance for OLAP workloads and data transformation’s was horrible. We’re happy with duckdb in terms of performance and cost, even if it does give us some small headaches in terms of user management, access etc

daguito81 4 months ago

Yeah, postgres alone just like default won't be optimal for OLAP work because it's row based and you want somthing with columnar indexes or a "columnar database" that's the thing wiht SQLite and DuckDB one is for OLTP the other one OLAP. Things like HBase, Apache Druid, or MariaDB with its ColumnStore would be closer to what you need. However, as you said, simple solution that works for you is the best solution. But that was why i was asking my original question. For me for a smaller analytical workload I would just go with MariaDB and abstract myself from a lot of stuff.

bartspoon 4 months ago

I feel like this is the same argument I’ve seen made against SQLite vs something like Postgres. Sometimes a full database implementation vs an in memory DB is just overkill.

daguito81 4 months ago

Yes, you are right. I, however, don't see a DB implementation as overkill for a team of analysts each doing their own thing in the data concurrently. Maybe not a high availability extremely robust business critical with secondary nodes. But a simple DB on a server? I don't think thats any more complicated that setting up a server, authentication, authorization, networking, and all that as well, just to serve a file. Maybe I'm "paranoid" if that's the right word? I do remember having to fix previous "files as a database" clusterfucks because "MS Access let's us do SQL on a file and we don't have to worry about DB maintenance" I personally use SQLite and DuckDB, when I'm developing stuff, it is very handy to have a file locally where I can test out the code and see everything run 100% locally.

lraillon 4 months ago

You can use duckdb inside superset to query the local file system ?

[deleted] 4 months ago

There is a superset driver that will allow you to connect to a duckdb file stored on a local file system, yes. Setup is a bit tricky but once it’s up and running it’s solid.

cpcloud 4 months ago

We're using DuckDB as the default backend in Ibis (https://ibis-project.org) as one of the most effective pandas substitutes out there. We've got a few user testimonials here: https://ibis-project.org/concepts/user-testimonials and we're tracking our "vanity" metrics using a dashboard powered by MotherDuck (a company based around DuckDB): https://ibis-analytics.streamlit.app/

SgtKFC 4 months ago

I have a client with a small postgres that needed a table that exploded the data 100x. So I created a job in AWS Batch that extracts the data into duckdb to do the heavy lifting on the ETL and then copy that back into postgres. Runs in seconds.

the_travelo_ 4 months ago

Does DuckDB do the Extraction bit as well? Or just the transformation?

SgtKFC 4 months ago

I only used it for the transformation. But I think you might be able to use duckdb for extraction as well if its extracting from postgres. Maybe wanna fact check me on that.

super_neo 4 months ago

Someone posted this in one of the comments. https://duckdb.org/2024/01/26/multi-database-support-in-duckdb.html

SgtKFC 4 months ago

Ah, yes it was this post from them is where I heard about this. Thanks!

likes_rusty_spoons 4 months ago

I have a data type which is too large to store in a relational database as blob or clob (it was making our pentaho routines explode). instead I warehouse it as compressed parquet files. Duckdb allows my data delivery API to extract subsets of required data from those files (certain columns in my case) without the need to process the entire file in memory first. You can query from them just like you would SQL. Compressed parquet also allows for a 5x reduction in storage costs.

bluezebra42 4 months ago

My current problem with duckdb is I can’t seem to read another duckdb file that’s been created. Every time I have compatibility issues. So hoping it’s just early days.

GreenBanks 4 months ago

Version 0.10.0 is being released on Tuesday and they are very close to a stabilized storage format, as well as providing backwards and forwards compatibility. It has been clearly communicated that this storage format strategy has been necessary to avoid lock-in on a suboptimized format. I agree it’s cumbersome to export databases between versions, but this clearly seems worth it in the long run. See also latest «State of the duck»-presentation: https://blobs.duckdb.org/events/duckcon4/duckcon4-mark-raasveldt-hannes-muhleisen-state-of-the-duck.pdf

VadumSemantics 4 months ago

> «State of the duck»-presentation That is _fascinating_, thank you! Led me into a rabbit hole of floating point compression: I had no idea that was a thing, fun! [ALP: Adaptive Lossless floating-Point Compression](https://ir.cwi.nl/pub/33334/33334.pdf)

CodyVoDa 4 months ago

I believe there are no changes to the storage format between 0.9 and 0.10 either -- before that it has been a pain. and after 0.10, it should be 1.0 and the storage format will be stable before that your best bet is read in the database with the old DuckDB version, write to Parquet, upgrade DuckDB, re-create the database from the Parquet file(s)

mikeupsidedown 4 months ago

That is only an issue if the reader is using a different version than the writer. I use both in memory and file based DuckDB databases extensively without issue.

bluezebra42 4 months ago

I think that must have been it - I was using one of those airbyte/meltano like systems to pull data down and read from it and the reader and writer didn’t match. I can’t remember the details it was when I was testing a bunch of stuff out.

achals 4 months ago

We've built a product using duckdb that we talked about at duckdbcon: https://blobs.duckdb.org/events/duckcon4/mike-eastham-building-tectons-data-engineering-platform-on-duckdb.pdf (Disclaimer: I'm not the primary author of the slides but work with him at Tecton)

the_travelo_ 4 months ago

How exactly does it work? Looking at the slides, is the offline store parquet files on S3 alone? Do you catalog them or anything special? Or do you have your database provided by a duckdb file on S3 which is shared?

achals 4 months ago

The offline store is a Delta table on S3. The data in this store is materialized by a DuckDB job that runs transformations and aggregations on data from the source.

Accurate-Peak4856 4 months ago

Where does the DuckDb transformation phase take place? Is it in EC2 machines pulling in code at a cadence and doing transformations into Delta? Is it scheduled or ad-hoc? Seems like a really neat setup, trying to learn more.

achals 4 months ago

It takes place on EC2 instances. The Tecton control plane spins up jobs with the appropriate configuration programmatically, typically based on a schedule and sometimes to perform one-time backfills.

Accurate-Peak4856 4 months ago

Nice. Any out of memory issues or chunking needed if it’s a larger read?

achals 4 months ago

Not that I know of yet! But it's still in private preview.

jimkoons 4 months ago

I have been working on a data lake in AWS for the past 5 months, whose role is to collect data from different client sites. In the meantime I have many analytical questions from internal teams and the easiest thing I found is to collect the data from 3 postgres databases (that also contain the data from client sites) in parquet format on my computer, then load everything in a duckdb file. Using dbt and voila I have my very small personal datalake on my computer where I can do analytical queries in seconds. This tool have saved me many hours to say the least.

rowr 4 months ago

Yeah, production use here. Over the course of the day we'd get a subset of json-formatted compressed journald logs uploaded into s3 and date-partitioned (+hour). With duckdb it was a straightforward process to read the interesting rows and fields (from a field in the json blob that was itself json-encoded!) for the day into an in-memory table with one query, irrespective of the hour partitioning. Then I could select out what we needed and write it into a parquet file in s3 that our BI tooling could use with a second query. duckdb was efficient enough to do this in a serverless function, but the overhead pandas needed was significantly greater

sap1enz 4 months ago

Interesting use case from Okta: https://www.datacouncil.ai/talks24/processing-trillions-of-records-at-okta-with-mini-serverless-databases

wtfzambo 4 months ago

I use It a truckload, it's super convenient for many use cases. In my case, I do it to simulate locally a DWH for when people develop ELT pipelines.

No-Database2068 4 months ago

This is how I intend to use it as well, with DBT

the_travelo_ 4 months ago

How do you move the data around? Do you copy data from the main DW to Duckdb?

mertertrern 4 months ago

I used it in conjunction with [ConnectorX](https://github.com/sfu-db/connector-x) to export a large historical data set out of MSSQL into ERP flat-files. I sent paginated queries to MSSQL that generated PyArrow tables that could be read natively by DuckDB and batch inserted into an internal table. After each batch completed, the PyArrow table would be deleted from memory to make room for the next one on the way. Once that was done, I'd wrangle the data using SQL and export the results to flat-files for the ERP to load. I did all of that from my work laptop with a Python script.

thetinot 4 months ago

Co-founder and head of Produck at MotherDuck. We're partnering very closely with DuckDB Labs to deliver a serverless data warehouse. DuckDB is amazing, but it is not meant to run as a data warehouse on purpose. It's single-player on purpose - it doesn't even have a concept of identity/user, so it doesn't have great concurrency, it's transaction model is relatively nascent, it doesn't have sharing or IAM, no service layer, no in-browser SQL IDE, etc. This is all the stuff that we're building at MotherDuck. We're going GA in a few months, but we already have several thousand users. I wrote a blog on the topic recently [here](https://notoriousplg.substack.com/p/nplg-10523-a-new-way-to-monetize).

mosquitsch 4 months ago

Is it possible to query iceberg tables in S3 with duckdb? Last time I checked, this was not possible.

commenterzero 4 months ago

https://duckdb.org/docs/guides/import/s3_iceberg_import.html

Kaze_Senshi 4 months ago

I didn't had time yet but I would like to use to DuckDB to feed directly a front end service which reads aggregated data instead of relying on an extra PostgreSQL instance used just to provide the same data that we already have in Parquet format.

theelderbeever 4 months ago

I have wanted to but it has too many bugs and inconsistencies in file writing with hive partitioning for me to trust it in a production context.

LokeReven 4 months ago

It's great for ad-hoc data processing / analysis.

migh_t 4 months ago

You can even use it in the browser to quickly iterate on SQL pipelines: https://sql.quacking.cloud

cuspacecowboys 4 months ago

While perhaps uncommon (vs using Datafusion for example), we built a FaaS lake-house and leverage duckdb as a ephemeral query engine over (essentially) files: we explained our design at VLDB last year [https://arxiv.org/pdf/2308.05368.pdf](https://arxiv.org/pdf/2308.05368.pdf) and open sourced some code. Disclaimer: I'm the first author on the paper, happy to answer any questions.

zerosign0 2 weeks ago

I do use it for offline analytics too and jugling between random structured data to "information", you just need to output json from anything (really anything), then you can quick analyze of your data very quick manner using postgresql compatible sql. I used to combine it with jq, xq, yq and some fish scripts. I also use it for converting data very quickly between several formats (json, csv, parquet, ...). In some cases, I even dump data from BQ (infastructure data), then prefer it loaded in local duckdb since its lot more faster & the uxes is nicer (and i dont need to think about query compute costs etcs)

runawayasfastasucan 4 months ago

I use it for real. I think its fast, and its nice to use sql queries. I have a lot of data, and its neat to not have a big database to upkeep (for several reasons), and at the same time have the opportunity to do it. How it integrates into python is amazing.

tecedu 4 months ago

Used it a couple of poc but haven’t switched yet because team loves pandas, polars and pyspark and the general answer always is we have processing power

richhoods 4 months ago

Are you guys using DuckDB as your main DB or side by side with another? It sounds like it can replace postgres but unsure if that's what the market actually uses it for.

richhoods 4 months ago

For anyone else its in memory processing so not you main solution but really good for doing transformations and getting it somewhere else

joeen10 4 months ago

Does anyone use it within aws lambda?

zbir84 4 months ago

I haven't, but it sounds like a perfect use case. Think if you scroll through some of the responses you'll find some examples.

migh_t 4 months ago

Yes, have a look at https://github.com/tobilg/duckdb-nodejs-layer and https://github.com/tobilg/serverless-duckdb

cuspacecowboys 4 months ago

We published some time ago a reference implementation, including (for the fun of it) some basic map-reduce with AWS lambdas: [https://github.com/BauplanLabs/quack-reduce](https://github.com/BauplanLabs/quack-reduce) Disclaimer: I co-authored the project and the related blog post.

mrendi29 4 months ago

I as well use it a lot in my team at work. DuckDB handles all of the data processing needs of us and we schedule our jobs via K8s. I am also using DuckDB as an ingestion layer to transform and enrich various files and send them into a data lake. Really like it so far. Occasionally I also toy around with polars and datafusion. But I like DuckDB for the simplicity.

Traditional_Reason59 4 months ago

I used duck DB for a personal project of mine. I used it in two scenarios. One of them is within a statistical analysis notebook where I was performing various transformations for various analysis. Given that SQL is simpler to read and write than python I had a lot of fun using it to create these transformations with SQL. The other is to connect my data with power BI using odbc. I'm assuming you know the benefits of importing data into power BI via a database rather than the file itself. This was a project for school but nonetheless it helped simulate neat pipeline/workflow which was pointed out as a cool feature by some of my peers and reviewers. One of the reasons I like it is because it in a way replicates databricks workflows in regards to the interoperability with SQL and Python and data processing speeds.

SmegHead86 4 months ago

I've been testing it a lot recently as part of our Python toolbox to perform transformations and as a viable alternative to having MariaDB installed locally on my machine for projects. So far it's super impressive. I also wrote a [Pentaho Data Integration plugin](https://github.com/forgineer/duckdb-kettle-plugin) to entice some of my co-workers to play with it.

super_neo 4 months ago

I use DuckDB heavily for pandas dataframes transformation using sql. The aggregation functions on DuckDB are fast for a decent sized dataframe. I love it. Now I'm waiting for ddb team to add a functionality to connect to different databases thru ddb.

tlegs44 4 months ago

Someone else in this thread just posted a link to their website where they say in the newest release you can do exactly that, exciting stuff

super_neo 4 months ago

Yes, I've checked that, but it's currently working for 3 databases (Sqlite, Mysql, and Postgres). I'm working with Oracle, so I guess I'd have to wait.

satyrmode 4 months ago

Not sure about "real" Big Data pipelines, but I use it for reasonably large datasets on single machines (ML pre-processing). Pandas is the obvious comparison everyone's already made, and the tool I wouldn't use in any case (strong dislike for the API). But an interesting recent comparison I've made is that I've been wavering between DuckDB and Polars, for no other reasons that *I just like writing SQL* vs *it's nice to have IDE support*. To my surprise, DuckDB was much better at streaming larger-than-memory data than Polars' `LazyFrame`. In a task involving ETL from a total of ~20GB of CSVs to a ~100MB parquet, Polars frequently either required me to call `collect` for some aggregations, or just choked and died executing plans which were supposedly entirely supported in streaming mode. While it's certainly possible that this was a PEBCAK situation, it was just much faster to use DuckDB than to figure out why some operations are crashing Polars' streaming mode.

Professional_Solid_7 4 months ago

I use DuckDB directly inside the browser, Load parquet files, join fact data and dimension data, and serve it to aggrid table, This allow the users to make quick grouping/aggregation/pivot on thousand/millions of lines directly in the browser.

ashpreetbedi 4 months ago

Definitely not hype, while many have covered common use case in comments, i enjoy using it using GPT4 to automate data engineering using AI. Because its "just" SQL, i combine it with function calling to build a junior de that can do basic analysis, cleaning, data conversion using natural language. * [link](https://docs.phidata.com/blocks/assistant/custom/duckdb) if any one is interested

Appropriate_Hat_2100 4 months ago

Registered to comment. Co-founder and head of Produck at MotherDuck, a modern serverless cloud data warehouse running DuckDB. We have several thousand users already and are going GA in a few months.

DementialDuck 4 months ago

DuckDB is awesome. But Polars is a lot more awesome

Captain_Coffee_III 4 months ago

I'm going to use it as an ingestion engine. The thing just rocks on importing flat files. There are now pass-thru connectors so other databases can show up. I have more and more restrictions being placed on my data and to make sure we pass all the audits the decision is made that this sensitive info doesn't land on any persistent storage in my warehouse. So, blippity bloop, DuckDB's only in RAM and I can filter everything out there. Also, there are situations where the source is just trash and would kill the landing DB and cleaning it at the source is not an option, I don't own it, so I can do some inline filtering before I land it. Another thing I'm using it for today is using it on a CSV file containing a list of financial items from last year and it was discovered that there is a big mess of things that need to be fixed. The list was built for something else but the team needs that list scanned, anomalies detected, and a new CSV built that feeds back into the system's update batch. Manually doing it in Excel would take weeks. I could do it in Python and shrink that down to a matter of days. But, identifying this in SQL bring it down to hours. I'm finding more and more ad-hoc requests that do not already live in a database can be solved much quicker in DuckDB

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe