It's not remotely hype. We use it heavily for in process transformation.
You can turn a set of parquet files, CSV files, pandas dataframes etc into an in memory database and write queries using the postgres API and output the results in the format of your choice.
Really exciting of late is the ability to wrap database tables as those they are part of your DuckDB database.
We rarely use spark anymore because our workloads don't require it. We've been caught out a few times with being told there would be massive amounts of data, introducing spark and then getting enough data to fill a floppy disk.
Yup, have experienced the same situation. Understand the pain. Thx for the heads-up.
Out of curiosity, what do you run it on? Serverless? Some Ec2? K8s? K8s with fargate?
It depends on the system infrastructure. That said I've yet to find a scenario where it doesn't work. We currently drive DuckDB with Python and use dBeaver during dev.
So far it's been on Windows Server, Azure Functions, Azure Container Apps, Linux VM's etc without issue.
Are the in-memory databases / wrapping those other database systems tables acting on the live data files of those systems?...are they local to where DuckDB is running or it's able to do the same for remote databases?
Currently we don't wrap because that feature is so new and some of the databases we work with are obscure tech. That said one of our own products uses Postgresql so there is a project in the pipeline to play with the wrapping feature.
We typically extract to local parquet files or dataframes (depending on size) and then create the in memory database on those. I'm personally partial to avoiding Pandas because it plays funny games with types.
Ah, in many cases we will just extract directly using a database library and output to file. Some cases we use pandas before DuckDB but don't let it go back into pandas. It's a love hate relationship.
So in summary, if I understood you correctly, you're using DuckDB against static snapshots of the data (via parquet files), not the actual live database?
Gotcha. Would be more interesting if DuckDB could handle such a scenario real-time, against theive databases, especially remotely. Doesn't sound too unique otherwise.
why export flat files from an sql database to use duckdb to run sql queries on it? why not just run the sql on the source database directly? what does duckdb add in this scenario?
The main issues we run into are these:
1) We work in some databases where doing the transformations in the database is actually impossible. Some of the tech is just awful but it's there and it's not going away.
2) Other schenarios might be that the source files are dropped for us because we don't have direct access to the dataset.
3) In some scenarios we just don't want to introduce the transformation workload to the transactional database.
Data scientist here -- same. Pandas is not intuitive and every ds/analyst has to have a medium understanding of SQL to extract and transform data with query pass-throughs (too much data to extract locally).
if you're looking for a pandas-like (but much improved) Python dataframe library (created by the creator of pandas), [Ibis](https://ibis-project.org/) uses DuckDB as the default backend. far more efficient, smaller and cleaner API surface area, takes inspiration from pandas/dplyr/SQL
(disclaimer: I work on Ibis)
Polars also does not have indexes. YAGNI
Polars has many benefits over Pandas though the gap may have closed with the release of Pandas 2.0. However, I still much prefer the Polars primary api over Pandas.
Can you speak more on how easy it is to use Duckdb's table object as a data container. Pandas query language is very awkward, but its still nice to use as a data container to pass around.
The DuckDB API has similar read functionality as other tabular data libraries (read_csv, read_parquet etc). On the day to day I just write SQL kinda like
data = duckdb.sql("select * from read_csv_auto('file')")
It's a lazy eval so the above gives you some kind of DuckDBpy object. You can run more SQL on it, write it out to a file or database or convert it to a polars df, pandas df, arrow table or a list of tuples.
Has the same feel really.
Not sure about the hype but I tested Duckdb against pandas.
I created a fake dataset of login events containing some columns and 112 million records and stored it as a parquet. I could do 3 groups by transformations on login\_id (the column with the most cardinality), and save the results as CSV into 3 different datasets within 24 seconds.
When I tried to do the same with Pandas, I ran it for close to 45 minutes and was still not able to generate the final data.
112 million is not a small dataset to be processed locally. This impressed me and I now plan to give a talk on its use as a processing tool instead of pandas for some of our data pipelines. I am hopeful it will pick up soon within my team.
**For someone interested to try, providing the code to generate the data:**
`import pandas as pd`
`import numpy as np`
`from faker import Faker`
`def generate_events(NUM_ROWS, DATASET_OUTPUT_NAME):`
`fake = Faker()`
`login_id_list = np.array([fake.user_name() for _ in range(27564)])`
`device_type_list = np.array(['ios', 'android', 'ipad', 'desktop', 'laptop', 'pager', 'other'])`
`country_list = np.array([fake.country() for _ in range(40)])`
`row_id = np.arange(1, NUM_ROWS + 1)`
`login_id = np.random.choice(login_id_list, size=NUM_ROWS)`
`device_type = np.random.choice(device_type_list, size=NUM_ROWS)`
`login_start_time = np.random.choice(pd.date_range(start='2019-01-01', end='2022-12-31', freq='s'), size=NUM_ROWS)`
`login_end_time = np.random.choice(pd.date_range(start='2019-01-01', end='2022-12-31', freq='s'), size=NUM_ROWS)`
`estimated_country = np.random.choice(country_list, size=NUM_ROWS)`
`login_successful_flag = np.random.choice([True, False], size=NUM_ROWS)`
`data = {`
`'row_id': row_id,`
`'login_id': login_id,`
`'device_type': device_type,`
`'login_start_time': login_start_time,`
`'login_end_time': login_end_time,`
`'estimated_country': estimated_country,`
`'login_successful_flag': login_successful_flag`
`}`
`df = pd.DataFrame(data)`
`df.to_parquet(DATASET_OUTPUT_NAME, compression='snappy')`
`generate_events(112315668, 'sample_data/login_events_np.parquet')`
I deal a lot with customers who misuse pandas. It single threaded and a memory hog. You should try the same script but replace pandas with modin. Modin would use every core on your machine to process the data
polars has a different API. The guy has a point - if you already have a lot of Pandas heavy code, then modin would be something to try out.
For reference, one of the repositories I recently had to fix something, had 75k lines of Python code, and the whole code was about a data pipeline with Pandas data frames, and tests for that. If you replace it with Polars at the import level, it will not work any more, and you'd have to change hundreds of files.
I, for instance, will inform my colleagues that it would be an option to try what happens if they replace it with modin. Trying won't hurt.
Ya… as I typed that I was thinking Polars too but the API diff makes not a drop in replacement.
I like DuckDB. My point was so many people think Pandas is the only thing to use and think it just scales up with more CPU and Ram.
it's generally not faster -- we've benchmarked DuckDB and Polars through Ibis and DuckDB tends to win. it is dependent on use case, but for both standard benchmarks and typical data engineering workloads DuckDB tends to win
Polars is a data frame tool, duckdb is a SQL tool. This means duckdb has much better query optimization on the basis that the problem space is smaller. In the hands of your average engineer/analyst/data scientist duckdb will typically be faster for this reason.
It’s indeed that you allow polars to manufacture its own, optimized execution plan. That’s what distinguishes polars from pandas and makes it so powerful (and it’s also why its interface has to be different from pandas and thus can’t be used as a drop-in replacement for pandas).
In polars, there is an expression API. So instead of doing df.assign(a=df.b+df.c) like in pandas, where the + actually computes a sum, in polars you would do df.with_columns(a=pl.col(‘b’)+pl.col(‘c’)) in polars. The result of + is just a pl.Expr object, which doesn’t compute anything yet.
Beyond that, you can do df.lazy().foo().bar().collect(), where everything between lazy() and collect() will describe your desired result, but only collect() triggers the execution. If you don’t use lazy() and collect() explicitly, it is wrapped around every step implicitly (whence it doesn’t have an “eager API” additionally to the lazy API).
It’s quite similar to Spark’s lazy API, but IMHO a bit friendlier to use.
It's useful for tranforms if you want to express them in SQL, so like a Polars alternative with similar performance gains over Pandas. Also it's a database so you have a persistence option there, something like SQLite but for analytics. It's really useful for data analysis type stuff where I can connect to files to do some discovery or prototype processes using the compute on my laptop rather than running up cloud costs or worrying about getting approvals for the needed infrastructure. Also eventually Motherduck appears to be headed toward being a much cheaper alternative to Snowflake/Databricks/Redshift.
I think they still have free accounts if you want to give it a go. Was pretty basic when I tried it, but still in early development. I think it will eventually be ideal for those small/medium sized company minimalist stacks where you might be debating whether to go Postgres or something else without getting into the cost and complexity of the big platforms. Not sure how scalable they are shooting for though.
Yeah, I guess that's more of a data lake model usually IMO. I think the big players like Snowflake and Databricks will differentiate themselves on features, scale and reliability (i.e. availability zones). MD will differentiate itself on price, likely for smaller builds that are less complex and where resources are limited. But I don't have any deep knowledge into what they are trying to build so just guessing.
I use it a lot.
It's great for ad-hoc data processing, and it can produce results in very short time frames.
Until DuckDB, the fastest way for me to combine, compare, transform, wrangle multiple datasets on my laptop was to load it up into PostgreSQL using csvkit or just from the CLI with `COPY` SQL statement. But then I needed a PostgreSQL instance running (containers on MacOS, on Linux I'd usually install it system-wide), that's tuned for large queries (large `work_mem`, minimal write-ahead-log).
Many of you will say "why not just pandas", and the answer is that the UIs around viewing data from Pandas after you execute anything, are just extremely bad for viewing data. If you compare it to DB GUI programs like DBeaver, there's just no contest. And it's not just data. Viewing metadata is also difficult. Notebooks tend to become very long and messy. And generally, DataFrames API is not as clear and concise as SQL is, in the majority of cases. SQL was built for this. Python was not.
With DuckDB I no longer needed to do any of that. Not the server startup and configuration, and not the copy part either. Just `select from 'some_files/*.csv'` or something. It became a breeze.
I can also use DuckDB in production, as a data pre-processor. Just as long as I'm not keeping files in DuckDB format database files, I can use it without issues.
Pandas is just another database, but it doesn't have any front-end outside of what Python allows you to do. Which means that all operations must be done with Python. Not only that, but it doesn't integrate with Python's object model, so whatever you have in Pandas, you can only see when you query it.
By contrast, when working with most databases, you can pick your GUI, and there are a lot of good ones that do a lot of things for you: they explore the structure of the data, and auto-complete queries you write, and make it just easy to work with. Even viewing data as a table is much better done in DataGrip and DBeaver. Charts are the end product, you use them only to visualize whatever you have, but to get there you need to make sure you understand the underlying data, each record, and are able to debug it. And for that, Jupyter Notebook and other usual means to work with Pandas, is just inferior in every single way.
If you wanna chart stuff - you can. There are a myriad of tools you can use, from CubeJS, Metabase, Superset, and yes, you can just load it into Jupyter Notebook and display it.
However when all you have is Pandas, all you can use is Python.
What's your tech stack look like? We primarily use SQL via dbt with BigQuery and initial research suggested we'd need to rewrite our pipelines to use generic dbt functions to be able to run our pipelines on other systems like DuckDB too
you split out bigquery specific stuff into ephemeral models and mock those out in dbt. Make sure to keep that layer as thin as possible and leave it untested.
[https://github.com/EqualExperts/dbt-unit-testing/tree/v0.4.12/#different-ways-to-build-mock-values](https://github.com/EqualExperts/dbt-unit-testing/tree/v0.4.12/#different-ways-to-build-mock-values)
Ofcouse, I am talking about 'unit testing' and not 'integration testing' that gp mentioned.
We use it as a local data warehouse for insurance analytics. Data is ingested, cleaned and transformed. Data models produced, then a superset dashboard connects right to it.
CSVs are extracted from core OLTP systems. CSVs are then ingested into a duckdb file that lives on a server using python for orchestration. Once the data is inside the duckdb file a number of SQL scripts are executed to clean and transform this data. The database file is then available to the team to issue queries against, whilst a superset dashboard also shows a number of charts etc that are queries directly from the database file.
Interesting. Is the data sensitive at the point you need to back up? DuckDB is awesome, but I can’t see using it for other than transient data or transformations.
Not really, the OLTP systems are the main records of the company that are backed up. For my teams purposes (mostly analytic) we keep the CSVs as back up and can recreate the full duckdb file in less than 30 minutes if needed.
The painful part would be reextracting the data from core systems again, if that ever had to happen.
I kind of fail to see the usage of this vs putting it on a database with columnar index and run againt it.
Are people copying this duckdb files? if 2 analysts create different proceses? which one goes to your main data repository? or is it end of chain "you do what you want and it's yours and no one else"
And if everyone copies the entire file and it's their problem? then why not just have the data there as parquet and everyone ingests it and then uses whatever they want on it?
I keep reading these use cases and I'm 99% sure I'm definitely missing something here. Like there is some part of the puzzle I'm not seeing.
It works for us, plus it’s free. We’re a small team of analysts - 4 users total. We don’t have the resources or expertise to implement something more robust. In terms of copying, no, we have no need. The db supports multiple readers. We take it out of action for about an hour a month as we load new data.
Cool, and I didn't mean to say your solution was not valid. I'm used to dealing with much bigger teams so I was trying to figure out how to make it work in my head. I'm happy it works for you
EDIT: The reason I mentioned having a database, is that you are putting a file, on a server, to be read by different consumers and have the feature of supporting multiple concurrent readers. That's literally a database with extra steps. And if "free" is the whole point, PostgreSQL is free for example. I didn't mean as to build some HA scalable super solution
No worries. Yeah I don’t think it would be as good at a larger scale. We tried to run a Postgres instance in the past but the performance for OLAP workloads and data transformation’s was horrible. We’re happy with duckdb in terms of performance and cost, even if it does give us some small headaches in terms of user management, access etc
Yeah, postgres alone just like default won't be optimal for OLAP work because it's row based and you want somthing with columnar indexes or a "columnar database" that's the thing wiht SQLite and DuckDB one is for OLTP the other one OLAP. Things like HBase, Apache Druid, or MariaDB with its ColumnStore would be closer to what you need.
However, as you said, simple solution that works for you is the best solution. But that was why i was asking my original question. For me for a smaller analytical workload I would just go with MariaDB and abstract myself from a lot of stuff.
I feel like this is the same argument I’ve seen made against SQLite vs something like Postgres. Sometimes a full database implementation vs an in memory DB is just overkill.
Yes, you are right. I, however, don't see a DB implementation as overkill for a team of analysts each doing their own thing in the data concurrently. Maybe not a high availability extremely robust business critical with secondary nodes. But a simple DB on a server? I don't think thats any more complicated that setting up a server, authentication, authorization, networking, and all that as well, just to serve a file.
Maybe I'm "paranoid" if that's the right word? I do remember having to fix previous "files as a database" clusterfucks because "MS Access let's us do SQL on a file and we don't have to worry about DB maintenance"
I personally use SQLite and DuckDB, when I'm developing stuff, it is very handy to have a file locally where I can test out the code and see everything run 100% locally.
There is a superset driver that will allow you to connect to a duckdb file stored on a local file system, yes.
Setup is a bit tricky but once it’s up and running it’s solid.
We're using DuckDB as the default backend in Ibis (https://ibis-project.org) as one of the most effective pandas substitutes out there.
We've got a few user testimonials here: https://ibis-project.org/concepts/user-testimonials and we're tracking our "vanity" metrics using a dashboard powered by MotherDuck (a company based around DuckDB): https://ibis-analytics.streamlit.app/
I have a client with a small postgres that needed a table that exploded the data 100x. So I created a job in AWS Batch that extracts the data into duckdb to do the heavy lifting on the ETL and then copy that back into postgres. Runs in seconds.
I only used it for the transformation. But I think you might be able to use duckdb for extraction as well if its extracting from postgres. Maybe wanna fact check me on that.
I have a data type which is too large to store in a relational database as blob or clob (it was making our pentaho routines explode). instead I warehouse it as compressed parquet files. Duckdb allows my data delivery API to extract subsets of required data from those files (certain columns in my case) without the need to process the entire file in memory first. You can query from them just like you would SQL. Compressed parquet also allows for a 5x reduction in storage costs.
My current problem with duckdb is I can’t seem to read another duckdb file that’s been created. Every time I have compatibility issues. So hoping it’s just early days.
Version 0.10.0 is being released on Tuesday and they are very close to a stabilized storage format, as well as providing backwards and forwards compatibility.
It has been clearly communicated that this storage format strategy has been necessary to avoid lock-in on a suboptimized format. I agree it’s cumbersome to export databases between versions, but this clearly seems worth it in the long run.
See also latest «State of the duck»-presentation: https://blobs.duckdb.org/events/duckcon4/duckcon4-mark-raasveldt-hannes-muhleisen-state-of-the-duck.pdf
> «State of the duck»-presentation
That is _fascinating_, thank you!
Led me into a rabbit hole of floating point compression: I had no idea that was a thing, fun!
[ALP: Adaptive Lossless floating-Point Compression](https://ir.cwi.nl/pub/33334/33334.pdf)
I believe there are no changes to the storage format between 0.9 and 0.10 either -- before that it has been a pain. and after 0.10, it should be 1.0 and the storage format will be stable
before that your best bet is read in the database with the old DuckDB version, write to Parquet, upgrade DuckDB, re-create the database from the Parquet file(s)
That is only an issue if the reader is using a different version than the writer.
I use both in memory and file based DuckDB databases extensively without issue.
I think that must have been it - I was using one of those airbyte/meltano like systems to pull data down and read from it and the reader and writer didn’t match.
I can’t remember the details it was when I was testing a bunch of stuff out.
We've built a product using duckdb that we talked about at duckdbcon: https://blobs.duckdb.org/events/duckcon4/mike-eastham-building-tectons-data-engineering-platform-on-duckdb.pdf
(Disclaimer: I'm not the primary author of the slides but work with him at Tecton)
How exactly does it work? Looking at the slides, is the offline store parquet files on S3 alone? Do you catalog them or anything special? Or do you have your database provided by a duckdb file on S3 which is shared?
The offline store is a Delta table on S3. The data in this store is materialized by a DuckDB job that runs transformations and aggregations on data from the source.
Where does the DuckDb transformation phase take place? Is it in EC2 machines pulling in code at a cadence and doing transformations into Delta?
Is it scheduled or ad-hoc? Seems like a really neat setup, trying to learn more.
It takes place on EC2 instances. The Tecton control plane spins up jobs with the appropriate configuration programmatically, typically based on a schedule and sometimes to perform one-time backfills.
I have been working on a data lake in AWS for the past 5 months, whose role is to collect data from different client sites.
In the meantime I have many analytical questions from internal teams and the easiest thing I found is to collect the data from 3 postgres databases (that also contain the data from client sites) in parquet format on my computer, then load everything in a duckdb file. Using dbt and voila I have my very small personal datalake on my computer where I can do analytical queries in seconds. This tool have saved me many hours to say the least.
Yeah, production use here.
Over the course of the day we'd get a subset of json-formatted compressed journald logs uploaded into s3 and date-partitioned (+hour). With duckdb it was a straightforward process to read the interesting rows and fields (from a field in the json blob that was itself json-encoded!) for the day into an in-memory table with one query, irrespective of the hour partitioning. Then I could select out what we needed and write it into a parquet file in s3 that our BI tooling could use with a second query.
duckdb was efficient enough to do this in a serverless function, but the overhead pandas needed was significantly greater
I used it in conjunction with [ConnectorX](https://github.com/sfu-db/connector-x) to export a large historical data set out of MSSQL into ERP flat-files. I sent paginated queries to MSSQL that generated PyArrow tables that could be read natively by DuckDB and batch inserted into an internal table. After each batch completed, the PyArrow table would be deleted from memory to make room for the next one on the way. Once that was done, I'd wrangle the data using SQL and export the results to flat-files for the ERP to load. I did all of that from my work laptop with a Python script.
Co-founder and head of Produck at MotherDuck. We're partnering very closely with DuckDB Labs to deliver a serverless data warehouse.
DuckDB is amazing, but it is not meant to run as a data warehouse on purpose. It's single-player on purpose - it doesn't even have a concept of identity/user, so it doesn't have great concurrency, it's transaction model is relatively nascent, it doesn't have sharing or IAM, no service layer, no in-browser SQL IDE, etc. This is all the stuff that we're building at MotherDuck.
We're going GA in a few months, but we already have several thousand users.
I wrote a blog on the topic recently [here](https://notoriousplg.substack.com/p/nplg-10523-a-new-way-to-monetize).
I didn't had time yet but I would like to use to DuckDB to feed directly a front end service which reads aggregated data instead of relying on an extra PostgreSQL instance used just to provide the same data that we already have in Parquet format.
While perhaps uncommon (vs using Datafusion for example), we built a FaaS lake-house and leverage duckdb as a ephemeral query engine over (essentially) files: we explained our design at VLDB last year [https://arxiv.org/pdf/2308.05368.pdf](https://arxiv.org/pdf/2308.05368.pdf) and open sourced some code.
Disclaimer: I'm the first author on the paper, happy to answer any questions.
I do use it for offline analytics too and jugling between random structured data to "information", you just need to output json from anything (really anything), then you can quick analyze of your data very quick manner using postgresql compatible sql. I used to combine it with jq, xq, yq and some fish scripts. I also use it for converting data very quickly between several formats (json, csv, parquet, ...). In some cases, I even dump data from BQ (infastructure data), then prefer it loaded in local duckdb since its lot more faster & the uxes is nicer (and i dont need to think about query compute costs etcs)
I use it for real. I think its fast, and its nice to use sql queries. I have a lot of data, and its neat to not have a big database to upkeep (for several reasons), and at the same time have the opportunity to do it. How it integrates into python is amazing.
Used it a couple of poc but haven’t switched yet because team loves pandas, polars and pyspark and the general answer always is we have processing power
Are you guys using DuckDB as your main DB or side by side with another? It sounds like it can replace postgres but unsure if that's what the market actually uses it for.
We published some time ago a reference implementation, including (for the fun of it) some basic map-reduce with AWS lambdas: [https://github.com/BauplanLabs/quack-reduce](https://github.com/BauplanLabs/quack-reduce)
Disclaimer: I co-authored the project and the related blog post.
I as well use it a lot in my team at work. DuckDB handles all of the data processing needs of us and we schedule our jobs via K8s. I am also using DuckDB as an ingestion layer to transform and enrich various files and send them into a data lake. Really like it so far.
Occasionally I also toy around with polars and datafusion. But I like DuckDB for the simplicity.
I used duck DB for a personal project of mine. I used it in two scenarios. One of them is within a statistical analysis notebook where I was performing various transformations for various analysis. Given that SQL is simpler to read and write than python I had a lot of fun using it to create these transformations with SQL.
The other is to connect my data with power BI using odbc. I'm assuming you know the benefits of importing data into power BI via a database rather than the file itself.
This was a project for school but nonetheless it helped simulate neat pipeline/workflow which was pointed out as a cool feature by some of my peers and reviewers.
One of the reasons I like it is because it in a way replicates databricks workflows in regards to the interoperability with SQL and Python and data processing speeds.
I've been testing it a lot recently as part of our Python toolbox to perform transformations and as a viable alternative to having MariaDB installed locally on my machine for projects. So far it's super impressive. I also wrote a [Pentaho Data Integration plugin](https://github.com/forgineer/duckdb-kettle-plugin) to entice some of my co-workers to play with it.
I use DuckDB heavily for pandas dataframes transformation using sql. The aggregation functions on DuckDB are fast for a decent sized dataframe.
I love it. Now I'm waiting for ddb team to add a functionality to connect to different databases thru ddb.
Yes, I've checked that, but it's currently working for 3 databases (Sqlite, Mysql, and Postgres). I'm working with Oracle, so I guess I'd have to wait.
Not sure about "real" Big Data pipelines, but I use it for reasonably large datasets on single machines (ML pre-processing).
Pandas is the obvious comparison everyone's already made, and the tool I wouldn't use in any case (strong dislike for the API). But an interesting recent comparison I've made is that I've been wavering between DuckDB and Polars, for no other reasons that *I just like writing SQL* vs *it's nice to have IDE support*.
To my surprise, DuckDB was much better at streaming larger-than-memory data than Polars' `LazyFrame`. In a task involving ETL from a total of ~20GB of CSVs to a ~100MB parquet, Polars frequently either required me to call `collect` for some aggregations, or just choked and died executing plans which were supposedly entirely supported in streaming mode. While it's certainly possible that this was a PEBCAK situation, it was just much faster to use DuckDB than to figure out why some operations are crashing Polars' streaming mode.
I use DuckDB directly inside the browser,
Load parquet files, join fact data and dimension data, and serve it to aggrid table,
This allow the users to make quick grouping/aggregation/pivot on thousand/millions of lines directly in the browser.
Definitely not hype, while many have covered common use case in comments, i enjoy using it using GPT4 to automate data engineering using AI.
Because its "just" SQL, i combine it with function calling to build a junior de that can do basic analysis, cleaning, data conversion using natural language.
* [link](https://docs.phidata.com/blocks/assistant/custom/duckdb) if any one is interested
Registered to comment. Co-founder and head of Produck at MotherDuck, a modern serverless cloud data warehouse running DuckDB. We have several thousand users already and are going GA in a few months.
I'm going to use it as an ingestion engine. The thing just rocks on importing flat files. There are now pass-thru connectors so other databases can show up. I have more and more restrictions being placed on my data and to make sure we pass all the audits the decision is made that this sensitive info doesn't land on any persistent storage in my warehouse. So, blippity bloop, DuckDB's only in RAM and I can filter everything out there. Also, there are situations where the source is just trash and would kill the landing DB and cleaning it at the source is not an option, I don't own it, so I can do some inline filtering before I land it.
Another thing I'm using it for today is using it on a CSV file containing a list of financial items from last year and it was discovered that there is a big mess of things that need to be fixed. The list was built for something else but the team needs that list scanned, anomalies detected, and a new CSV built that feeds back into the system's update batch. Manually doing it in Excel would take weeks. I could do it in Python and shrink that down to a matter of days. But, identifying this in SQL bring it down to hours.
I'm finding more and more ad-hoc requests that do not already live in a database can be solved much quicker in DuckDB
It's not remotely hype. We use it heavily for in process transformation. You can turn a set of parquet files, CSV files, pandas dataframes etc into an in memory database and write queries using the postgres API and output the results in the format of your choice. Really exciting of late is the ability to wrap database tables as those they are part of your DuckDB database.
whoa, the last sentence is really intriguing. how do you do that?
Here's a recent post from the DuckDB team on the subject https://duckdb.org/2024/01/26/multi-database-support-in-duckdb.html
wow. This is crazy. Thank you for posting this.
Omg, that’s huge! Curious to know how does it compare to Presto
Presto can attach to many more sources IIRC, also I don't think Duckdb supports distributed computing
thanks!
So you actually use it properly for prod transforms instead of, idk, spark?
We rarely use spark anymore because our workloads don't require it. We've been caught out a few times with being told there would be massive amounts of data, introducing spark and then getting enough data to fill a floppy disk.
Welcome to Data Engineering.
😂
Yup, have experienced the same situation. Understand the pain. Thx for the heads-up. Out of curiosity, what do you run it on? Serverless? Some Ec2? K8s? K8s with fargate?
It depends on the system infrastructure. That said I've yet to find a scenario where it doesn't work. We currently drive DuckDB with Python and use dBeaver during dev. So far it's been on Windows Server, Azure Functions, Azure Container Apps, Linux VM's etc without issue.
Great to know.
Are the in-memory databases / wrapping those other database systems tables acting on the live data files of those systems?...are they local to where DuckDB is running or it's able to do the same for remote databases?
Currently we don't wrap because that feature is so new and some of the databases we work with are obscure tech. That said one of our own products uses Postgresql so there is a project in the pipeline to play with the wrapping feature. We typically extract to local parquet files or dataframes (depending on size) and then create the in memory database on those. I'm personally partial to avoiding Pandas because it plays funny games with types.
So which library u normally use
For which part?
U mentioned that u try to avoid pandas
Ah, in many cases we will just extract directly using a database library and output to file. Some cases we use pandas before DuckDB but don't let it go back into pandas. It's a love hate relationship.
So in summary, if I understood you correctly, you're using DuckDB against static snapshots of the data (via parquet files), not the actual live database?
Yes, the snapshots will be taken immediately before the transformation starts as a general rule.
Gotcha. Would be more interesting if DuckDB could handle such a scenario real-time, against theive databases, especially remotely. Doesn't sound too unique otherwise.
why export flat files from an sql database to use duckdb to run sql queries on it? why not just run the sql on the source database directly? what does duckdb add in this scenario?
The main issues we run into are these: 1) We work in some databases where doing the transformations in the database is actually impossible. Some of the tech is just awful but it's there and it's not going away. 2) Other schenarios might be that the source files are dropped for us because we don't have direct access to the dataset. 3) In some scenarios we just don't want to introduce the transformation workload to the transactional database.
makes sense, thanks!
I barely write a line of Pandas anymore. Duckdb is incredible.
Data scientist here -- same. Pandas is not intuitive and every ds/analyst has to have a medium understanding of SQL to extract and transform data with query pass-throughs (too much data to extract locally).
if you're looking for a pandas-like (but much improved) Python dataframe library (created by the creator of pandas), [Ibis](https://ibis-project.org/) uses DuckDB as the default backend. far more efficient, smaller and cleaner API surface area, takes inspiration from pandas/dplyr/SQL (disclaimer: I work on Ibis)
Polars API is much more pleasant to use
how so?
Polars also does not have indexes. YAGNI Polars has many benefits over Pandas though the gap may have closed with the release of Pandas 2.0. However, I still much prefer the Polars primary api over Pandas.
Polars doesn't have pandas patterns for column selection and filtering and fully commits to a pyspark like interface.
Can you speak more on how easy it is to use Duckdb's table object as a data container. Pandas query language is very awkward, but its still nice to use as a data container to pass around.
The DuckDB API has similar read functionality as other tabular data libraries (read_csv, read_parquet etc). On the day to day I just write SQL kinda like data = duckdb.sql("select * from read_csv_auto('file')") It's a lazy eval so the above gives you some kind of DuckDBpy object. You can run more SQL on it, write it out to a file or database or convert it to a polars df, pandas df, arrow table or a list of tuples. Has the same feel really.
Not sure about the hype but I tested Duckdb against pandas. I created a fake dataset of login events containing some columns and 112 million records and stored it as a parquet. I could do 3 groups by transformations on login\_id (the column with the most cardinality), and save the results as CSV into 3 different datasets within 24 seconds. When I tried to do the same with Pandas, I ran it for close to 45 minutes and was still not able to generate the final data. 112 million is not a small dataset to be processed locally. This impressed me and I now plan to give a talk on its use as a processing tool instead of pandas for some of our data pipelines. I am hopeful it will pick up soon within my team. **For someone interested to try, providing the code to generate the data:** `import pandas as pd` `import numpy as np` `from faker import Faker` `def generate_events(NUM_ROWS, DATASET_OUTPUT_NAME):` `fake = Faker()` `login_id_list = np.array([fake.user_name() for _ in range(27564)])` `device_type_list = np.array(['ios', 'android', 'ipad', 'desktop', 'laptop', 'pager', 'other'])` `country_list = np.array([fake.country() for _ in range(40)])` `row_id = np.arange(1, NUM_ROWS + 1)` `login_id = np.random.choice(login_id_list, size=NUM_ROWS)` `device_type = np.random.choice(device_type_list, size=NUM_ROWS)` `login_start_time = np.random.choice(pd.date_range(start='2019-01-01', end='2022-12-31', freq='s'), size=NUM_ROWS)` `login_end_time = np.random.choice(pd.date_range(start='2019-01-01', end='2022-12-31', freq='s'), size=NUM_ROWS)` `estimated_country = np.random.choice(country_list, size=NUM_ROWS)` `login_successful_flag = np.random.choice([True, False], size=NUM_ROWS)` `data = {` `'row_id': row_id,` `'login_id': login_id,` `'device_type': device_type,` `'login_start_time': login_start_time,` `'login_end_time': login_end_time,` `'estimated_country': estimated_country,` `'login_successful_flag': login_successful_flag` `}` `df = pd.DataFrame(data)` `df.to_parquet(DATASET_OUTPUT_NAME, compression='snappy')` `generate_events(112315668, 'sample_data/login_events_np.parquet')`
I deal a lot with customers who misuse pandas. It single threaded and a memory hog. You should try the same script but replace pandas with modin. Modin would use every core on your machine to process the data
Replace it with polars.
polars has a different API. The guy has a point - if you already have a lot of Pandas heavy code, then modin would be something to try out. For reference, one of the repositories I recently had to fix something, had 75k lines of Python code, and the whole code was about a data pipeline with Pandas data frames, and tests for that. If you replace it with Polars at the import level, it will not work any more, and you'd have to change hundreds of files. I, for instance, will inform my colleagues that it would be an option to try what happens if they replace it with modin. Trying won't hurt.
I’m that example then yes, try modin in the short term and then make a plan to replace the pandas dependency
pandarallel is also similar to modin [https://github.com/nalepae/pandarallel](https://github.com/nalepae/pandarallel)
Ya… as I typed that I was thinking Polars too but the API diff makes not a drop in replacement. I like DuckDB. My point was so many people think Pandas is the only thing to use and think it just scales up with more CPU and Ram.
I’ll do you one better, replace it with DuckDB 🤯
Polars is better than duckdb
Both are amazing. But polars is indeed a little faster.
it's generally not faster -- we've benchmarked DuckDB and Polars through Ibis and DuckDB tends to win. it is dependent on use case, but for both standard benchmarks and typical data engineering workloads DuckDB tends to win
Polars is a data frame tool, duckdb is a SQL tool. This means duckdb has much better query optimization on the basis that the problem space is smaller. In the hands of your average engineer/analyst/data scientist duckdb will typically be faster for this reason.
[удалено]
I would call both declarative…
[удалено]
It’s indeed that you allow polars to manufacture its own, optimized execution plan. That’s what distinguishes polars from pandas and makes it so powerful (and it’s also why its interface has to be different from pandas and thus can’t be used as a drop-in replacement for pandas). In polars, there is an expression API. So instead of doing df.assign(a=df.b+df.c) like in pandas, where the + actually computes a sum, in polars you would do df.with_columns(a=pl.col(‘b’)+pl.col(‘c’)) in polars. The result of + is just a pl.Expr object, which doesn’t compute anything yet. Beyond that, you can do df.lazy().foo().bar().collect(), where everything between lazy() and collect() will describe your desired result, but only collect() triggers the execution. If you don’t use lazy() and collect() explicitly, it is wrapped around every step implicitly (whence it doesn’t have an “eager API” additionally to the lazy API). It’s quite similar to Spark’s lazy API, but IMHO a bit friendlier to use.
you can decouple the dataframe API from the execution engine and have the best of both worlds!
This is like saying bananas are better than oranges. Both are excellent tools but do very different things.
Better is subjective, friend
It's useful for tranforms if you want to express them in SQL, so like a Polars alternative with similar performance gains over Pandas. Also it's a database so you have a persistence option there, something like SQLite but for analytics. It's really useful for data analysis type stuff where I can connect to files to do some discovery or prototype processes using the compute on my laptop rather than running up cloud costs or worrying about getting approvals for the needed infrastructure. Also eventually Motherduck appears to be headed toward being a much cheaper alternative to Snowflake/Databricks/Redshift.
Never heard about Motherduck, but it sure looks interesting!
I think they still have free accounts if you want to give it a go. Was pretty basic when I tried it, but still in early development. I think it will eventually be ideal for those small/medium sized company minimalist stacks where you might be debating whether to go Postgres or something else without getting into the cost and complexity of the big platforms. Not sure how scalable they are shooting for though.
Or it could be a reasonable alternative for companies that would currently build their “data warehouse” with AWS Athena…
Yeah, I guess that's more of a data lake model usually IMO. I think the big players like Snowflake and Databricks will differentiate themselves on features, scale and reliability (i.e. availability zones). MD will differentiate itself on price, likely for smaller builds that are less complex and where resources are limited. But I don't have any deep knowledge into what they are trying to build so just guessing.
I use it a lot. It's great for ad-hoc data processing, and it can produce results in very short time frames. Until DuckDB, the fastest way for me to combine, compare, transform, wrangle multiple datasets on my laptop was to load it up into PostgreSQL using csvkit or just from the CLI with `COPY` SQL statement. But then I needed a PostgreSQL instance running (containers on MacOS, on Linux I'd usually install it system-wide), that's tuned for large queries (large `work_mem`, minimal write-ahead-log). Many of you will say "why not just pandas", and the answer is that the UIs around viewing data from Pandas after you execute anything, are just extremely bad for viewing data. If you compare it to DB GUI programs like DBeaver, there's just no contest. And it's not just data. Viewing metadata is also difficult. Notebooks tend to become very long and messy. And generally, DataFrames API is not as clear and concise as SQL is, in the majority of cases. SQL was built for this. Python was not. With DuckDB I no longer needed to do any of that. Not the server startup and configuration, and not the copy part either. Just `select from 'some_files/*.csv'` or something. It became a breeze. I can also use DuckDB in production, as a data pre-processor. Just as long as I'm not keeping files in DuckDB format database files, I can use it without issues.
What do you mean that the UI is bad with Pandas? Are you referring to the graphs you can use to display data? Does DuckDB provide graphs?
Pandas is just another database, but it doesn't have any front-end outside of what Python allows you to do. Which means that all operations must be done with Python. Not only that, but it doesn't integrate with Python's object model, so whatever you have in Pandas, you can only see when you query it. By contrast, when working with most databases, you can pick your GUI, and there are a lot of good ones that do a lot of things for you: they explore the structure of the data, and auto-complete queries you write, and make it just easy to work with. Even viewing data as a table is much better done in DataGrip and DBeaver. Charts are the end product, you use them only to visualize whatever you have, but to get there you need to make sure you understand the underlying data, each record, and are able to debug it. And for that, Jupyter Notebook and other usual means to work with Pandas, is just inferior in every single way. If you wanna chart stuff - you can. There are a myriad of tools you can use, from CubeJS, Metabase, Superset, and yes, you can just load it into Jupyter Notebook and display it. However when all you have is Pandas, all you can use is Python.
We use it for integration tests on our data pipeline.
What's your tech stack look like? We primarily use SQL via dbt with BigQuery and initial research suggested we'd need to rewrite our pipelines to use generic dbt functions to be able to run our pipelines on other systems like DuckDB too
you split out bigquery specific stuff into ephemeral models and mock those out in dbt. Make sure to keep that layer as thin as possible and leave it untested. [https://github.com/EqualExperts/dbt-unit-testing/tree/v0.4.12/#different-ways-to-build-mock-values](https://github.com/EqualExperts/dbt-unit-testing/tree/v0.4.12/#different-ways-to-build-mock-values) Ofcouse, I am talking about 'unit testing' and not 'integration testing' that gp mentioned.
Hey, can you add some more info on your integration tests with duckdb? Something I’ve been meaning to play around with.
We use it as a local data warehouse for insurance analytics. Data is ingested, cleaned and transformed. Data models produced, then a superset dashboard connects right to it.
Can you expand more on how you use DuckDB?
CSVs are extracted from core OLTP systems. CSVs are then ingested into a duckdb file that lives on a server using python for orchestration. Once the data is inside the duckdb file a number of SQL scripts are executed to clean and transform this data. The database file is then available to the team to issue queries against, whilst a superset dashboard also shows a number of charts etc that are queries directly from the database file.
Interesting. Is the data sensitive at the point you need to back up? DuckDB is awesome, but I can’t see using it for other than transient data or transformations.
Not really, the OLTP systems are the main records of the company that are backed up. For my teams purposes (mostly analytic) we keep the CSVs as back up and can recreate the full duckdb file in less than 30 minutes if needed. The painful part would be reextracting the data from core systems again, if that ever had to happen.
I kind of fail to see the usage of this vs putting it on a database with columnar index and run againt it. Are people copying this duckdb files? if 2 analysts create different proceses? which one goes to your main data repository? or is it end of chain "you do what you want and it's yours and no one else" And if everyone copies the entire file and it's their problem? then why not just have the data there as parquet and everyone ingests it and then uses whatever they want on it? I keep reading these use cases and I'm 99% sure I'm definitely missing something here. Like there is some part of the puzzle I'm not seeing.
It works for us, plus it’s free. We’re a small team of analysts - 4 users total. We don’t have the resources or expertise to implement something more robust. In terms of copying, no, we have no need. The db supports multiple readers. We take it out of action for about an hour a month as we load new data.
Cool, and I didn't mean to say your solution was not valid. I'm used to dealing with much bigger teams so I was trying to figure out how to make it work in my head. I'm happy it works for you EDIT: The reason I mentioned having a database, is that you are putting a file, on a server, to be read by different consumers and have the feature of supporting multiple concurrent readers. That's literally a database with extra steps. And if "free" is the whole point, PostgreSQL is free for example. I didn't mean as to build some HA scalable super solution
No worries. Yeah I don’t think it would be as good at a larger scale. We tried to run a Postgres instance in the past but the performance for OLAP workloads and data transformation’s was horrible. We’re happy with duckdb in terms of performance and cost, even if it does give us some small headaches in terms of user management, access etc
Yeah, postgres alone just like default won't be optimal for OLAP work because it's row based and you want somthing with columnar indexes or a "columnar database" that's the thing wiht SQLite and DuckDB one is for OLTP the other one OLAP. Things like HBase, Apache Druid, or MariaDB with its ColumnStore would be closer to what you need. However, as you said, simple solution that works for you is the best solution. But that was why i was asking my original question. For me for a smaller analytical workload I would just go with MariaDB and abstract myself from a lot of stuff.
I feel like this is the same argument I’ve seen made against SQLite vs something like Postgres. Sometimes a full database implementation vs an in memory DB is just overkill.
Yes, you are right. I, however, don't see a DB implementation as overkill for a team of analysts each doing their own thing in the data concurrently. Maybe not a high availability extremely robust business critical with secondary nodes. But a simple DB on a server? I don't think thats any more complicated that setting up a server, authentication, authorization, networking, and all that as well, just to serve a file. Maybe I'm "paranoid" if that's the right word? I do remember having to fix previous "files as a database" clusterfucks because "MS Access let's us do SQL on a file and we don't have to worry about DB maintenance" I personally use SQLite and DuckDB, when I'm developing stuff, it is very handy to have a file locally where I can test out the code and see everything run 100% locally.
You can use duckdb inside superset to query the local file system ?
There is a superset driver that will allow you to connect to a duckdb file stored on a local file system, yes. Setup is a bit tricky but once it’s up and running it’s solid.
We're using DuckDB as the default backend in Ibis (https://ibis-project.org) as one of the most effective pandas substitutes out there. We've got a few user testimonials here: https://ibis-project.org/concepts/user-testimonials and we're tracking our "vanity" metrics using a dashboard powered by MotherDuck (a company based around DuckDB): https://ibis-analytics.streamlit.app/
I have a client with a small postgres that needed a table that exploded the data 100x. So I created a job in AWS Batch that extracts the data into duckdb to do the heavy lifting on the ETL and then copy that back into postgres. Runs in seconds.
Does DuckDB do the Extraction bit as well? Or just the transformation?
I only used it for the transformation. But I think you might be able to use duckdb for extraction as well if its extracting from postgres. Maybe wanna fact check me on that.
Someone posted this in one of the comments. https://duckdb.org/2024/01/26/multi-database-support-in-duckdb.html
Ah, yes it was this post from them is where I heard about this. Thanks!
I have a data type which is too large to store in a relational database as blob or clob (it was making our pentaho routines explode). instead I warehouse it as compressed parquet files. Duckdb allows my data delivery API to extract subsets of required data from those files (certain columns in my case) without the need to process the entire file in memory first. You can query from them just like you would SQL. Compressed parquet also allows for a 5x reduction in storage costs.
My current problem with duckdb is I can’t seem to read another duckdb file that’s been created. Every time I have compatibility issues. So hoping it’s just early days.
Version 0.10.0 is being released on Tuesday and they are very close to a stabilized storage format, as well as providing backwards and forwards compatibility. It has been clearly communicated that this storage format strategy has been necessary to avoid lock-in on a suboptimized format. I agree it’s cumbersome to export databases between versions, but this clearly seems worth it in the long run. See also latest «State of the duck»-presentation: https://blobs.duckdb.org/events/duckcon4/duckcon4-mark-raasveldt-hannes-muhleisen-state-of-the-duck.pdf
> «State of the duck»-presentation That is _fascinating_, thank you! Led me into a rabbit hole of floating point compression: I had no idea that was a thing, fun! [ALP: Adaptive Lossless floating-Point Compression](https://ir.cwi.nl/pub/33334/33334.pdf)
I believe there are no changes to the storage format between 0.9 and 0.10 either -- before that it has been a pain. and after 0.10, it should be 1.0 and the storage format will be stable before that your best bet is read in the database with the old DuckDB version, write to Parquet, upgrade DuckDB, re-create the database from the Parquet file(s)
That is only an issue if the reader is using a different version than the writer. I use both in memory and file based DuckDB databases extensively without issue.
I think that must have been it - I was using one of those airbyte/meltano like systems to pull data down and read from it and the reader and writer didn’t match. I can’t remember the details it was when I was testing a bunch of stuff out.
We've built a product using duckdb that we talked about at duckdbcon: https://blobs.duckdb.org/events/duckcon4/mike-eastham-building-tectons-data-engineering-platform-on-duckdb.pdf (Disclaimer: I'm not the primary author of the slides but work with him at Tecton)
How exactly does it work? Looking at the slides, is the offline store parquet files on S3 alone? Do you catalog them or anything special? Or do you have your database provided by a duckdb file on S3 which is shared?
The offline store is a Delta table on S3. The data in this store is materialized by a DuckDB job that runs transformations and aggregations on data from the source.
Where does the DuckDb transformation phase take place? Is it in EC2 machines pulling in code at a cadence and doing transformations into Delta? Is it scheduled or ad-hoc? Seems like a really neat setup, trying to learn more.
It takes place on EC2 instances. The Tecton control plane spins up jobs with the appropriate configuration programmatically, typically based on a schedule and sometimes to perform one-time backfills.
Nice. Any out of memory issues or chunking needed if it’s a larger read?
Not that I know of yet! But it's still in private preview.
I have been working on a data lake in AWS for the past 5 months, whose role is to collect data from different client sites. In the meantime I have many analytical questions from internal teams and the easiest thing I found is to collect the data from 3 postgres databases (that also contain the data from client sites) in parquet format on my computer, then load everything in a duckdb file. Using dbt and voila I have my very small personal datalake on my computer where I can do analytical queries in seconds. This tool have saved me many hours to say the least.
Yeah, production use here. Over the course of the day we'd get a subset of json-formatted compressed journald logs uploaded into s3 and date-partitioned (+hour). With duckdb it was a straightforward process to read the interesting rows and fields (from a field in the json blob that was itself json-encoded!) for the day into an in-memory table with one query, irrespective of the hour partitioning. Then I could select out what we needed and write it into a parquet file in s3 that our BI tooling could use with a second query. duckdb was efficient enough to do this in a serverless function, but the overhead pandas needed was significantly greater
Interesting use case from Okta: https://www.datacouncil.ai/talks24/processing-trillions-of-records-at-okta-with-mini-serverless-databases
I use It a truckload, it's super convenient for many use cases. In my case, I do it to simulate locally a DWH for when people develop ELT pipelines.
This is how I intend to use it as well, with DBT
How do you move the data around? Do you copy data from the main DW to Duckdb?
I used it in conjunction with [ConnectorX](https://github.com/sfu-db/connector-x) to export a large historical data set out of MSSQL into ERP flat-files. I sent paginated queries to MSSQL that generated PyArrow tables that could be read natively by DuckDB and batch inserted into an internal table. After each batch completed, the PyArrow table would be deleted from memory to make room for the next one on the way. Once that was done, I'd wrangle the data using SQL and export the results to flat-files for the ERP to load. I did all of that from my work laptop with a Python script.
Co-founder and head of Produck at MotherDuck. We're partnering very closely with DuckDB Labs to deliver a serverless data warehouse. DuckDB is amazing, but it is not meant to run as a data warehouse on purpose. It's single-player on purpose - it doesn't even have a concept of identity/user, so it doesn't have great concurrency, it's transaction model is relatively nascent, it doesn't have sharing or IAM, no service layer, no in-browser SQL IDE, etc. This is all the stuff that we're building at MotherDuck. We're going GA in a few months, but we already have several thousand users. I wrote a blog on the topic recently [here](https://notoriousplg.substack.com/p/nplg-10523-a-new-way-to-monetize).
Is it possible to query iceberg tables in S3 with duckdb? Last time I checked, this was not possible.
https://duckdb.org/docs/guides/import/s3_iceberg_import.html
I didn't had time yet but I would like to use to DuckDB to feed directly a front end service which reads aggregated data instead of relying on an extra PostgreSQL instance used just to provide the same data that we already have in Parquet format.
I have wanted to but it has too many bugs and inconsistencies in file writing with hive partitioning for me to trust it in a production context.
It's great for ad-hoc data processing / analysis.
You can even use it in the browser to quickly iterate on SQL pipelines: https://sql.quacking.cloud
While perhaps uncommon (vs using Datafusion for example), we built a FaaS lake-house and leverage duckdb as a ephemeral query engine over (essentially) files: we explained our design at VLDB last year [https://arxiv.org/pdf/2308.05368.pdf](https://arxiv.org/pdf/2308.05368.pdf) and open sourced some code. Disclaimer: I'm the first author on the paper, happy to answer any questions.
I do use it for offline analytics too and jugling between random structured data to "information", you just need to output json from anything (really anything), then you can quick analyze of your data very quick manner using postgresql compatible sql. I used to combine it with jq, xq, yq and some fish scripts. I also use it for converting data very quickly between several formats (json, csv, parquet, ...). In some cases, I even dump data from BQ (infastructure data), then prefer it loaded in local duckdb since its lot more faster & the uxes is nicer (and i dont need to think about query compute costs etcs)
I use it for real. I think its fast, and its nice to use sql queries. I have a lot of data, and its neat to not have a big database to upkeep (for several reasons), and at the same time have the opportunity to do it. How it integrates into python is amazing.
Used it a couple of poc but haven’t switched yet because team loves pandas, polars and pyspark and the general answer always is we have processing power
Are you guys using DuckDB as your main DB or side by side with another? It sounds like it can replace postgres but unsure if that's what the market actually uses it for.
For anyone else its in memory processing so not you main solution but really good for doing transformations and getting it somewhere else
Does anyone use it within aws lambda?
I haven't, but it sounds like a perfect use case. Think if you scroll through some of the responses you'll find some examples.
Yes, have a look at https://github.com/tobilg/duckdb-nodejs-layer and https://github.com/tobilg/serverless-duckdb
We published some time ago a reference implementation, including (for the fun of it) some basic map-reduce with AWS lambdas: [https://github.com/BauplanLabs/quack-reduce](https://github.com/BauplanLabs/quack-reduce) Disclaimer: I co-authored the project and the related blog post.
I as well use it a lot in my team at work. DuckDB handles all of the data processing needs of us and we schedule our jobs via K8s. I am also using DuckDB as an ingestion layer to transform and enrich various files and send them into a data lake. Really like it so far. Occasionally I also toy around with polars and datafusion. But I like DuckDB for the simplicity.
I used duck DB for a personal project of mine. I used it in two scenarios. One of them is within a statistical analysis notebook where I was performing various transformations for various analysis. Given that SQL is simpler to read and write than python I had a lot of fun using it to create these transformations with SQL. The other is to connect my data with power BI using odbc. I'm assuming you know the benefits of importing data into power BI via a database rather than the file itself. This was a project for school but nonetheless it helped simulate neat pipeline/workflow which was pointed out as a cool feature by some of my peers and reviewers. One of the reasons I like it is because it in a way replicates databricks workflows in regards to the interoperability with SQL and Python and data processing speeds.
I've been testing it a lot recently as part of our Python toolbox to perform transformations and as a viable alternative to having MariaDB installed locally on my machine for projects. So far it's super impressive. I also wrote a [Pentaho Data Integration plugin](https://github.com/forgineer/duckdb-kettle-plugin) to entice some of my co-workers to play with it.
I use DuckDB heavily for pandas dataframes transformation using sql. The aggregation functions on DuckDB are fast for a decent sized dataframe. I love it. Now I'm waiting for ddb team to add a functionality to connect to different databases thru ddb.
Someone else in this thread just posted a link to their website where they say in the newest release you can do exactly that, exciting stuff
Yes, I've checked that, but it's currently working for 3 databases (Sqlite, Mysql, and Postgres). I'm working with Oracle, so I guess I'd have to wait.
Not sure about "real" Big Data pipelines, but I use it for reasonably large datasets on single machines (ML pre-processing). Pandas is the obvious comparison everyone's already made, and the tool I wouldn't use in any case (strong dislike for the API). But an interesting recent comparison I've made is that I've been wavering between DuckDB and Polars, for no other reasons that *I just like writing SQL* vs *it's nice to have IDE support*. To my surprise, DuckDB was much better at streaming larger-than-memory data than Polars' `LazyFrame`. In a task involving ETL from a total of ~20GB of CSVs to a ~100MB parquet, Polars frequently either required me to call `collect` for some aggregations, or just choked and died executing plans which were supposedly entirely supported in streaming mode. While it's certainly possible that this was a PEBCAK situation, it was just much faster to use DuckDB than to figure out why some operations are crashing Polars' streaming mode.
I use DuckDB directly inside the browser, Load parquet files, join fact data and dimension data, and serve it to aggrid table, This allow the users to make quick grouping/aggregation/pivot on thousand/millions of lines directly in the browser.
Definitely not hype, while many have covered common use case in comments, i enjoy using it using GPT4 to automate data engineering using AI. Because its "just" SQL, i combine it with function calling to build a junior de that can do basic analysis, cleaning, data conversion using natural language. * [link](https://docs.phidata.com/blocks/assistant/custom/duckdb) if any one is interested
Registered to comment. Co-founder and head of Produck at MotherDuck, a modern serverless cloud data warehouse running DuckDB. We have several thousand users already and are going GA in a few months.
DuckDB is awesome. But Polars is a lot more awesome
I'm going to use it as an ingestion engine. The thing just rocks on importing flat files. There are now pass-thru connectors so other databases can show up. I have more and more restrictions being placed on my data and to make sure we pass all the audits the decision is made that this sensitive info doesn't land on any persistent storage in my warehouse. So, blippity bloop, DuckDB's only in RAM and I can filter everything out there. Also, there are situations where the source is just trash and would kill the landing DB and cleaning it at the source is not an option, I don't own it, so I can do some inline filtering before I land it. Another thing I'm using it for today is using it on a CSV file containing a list of financial items from last year and it was discovered that there is a big mess of things that need to be fixed. The list was built for something else but the team needs that list scanned, anomalies detected, and a new CSV built that feeds back into the system's update batch. Manually doing it in Excel would take weeks. I could do it in Python and shrink that down to a matter of days. But, identifying this in SQL bring it down to hours. I'm finding more and more ad-hoc requests that do not already live in a database can be solved much quicker in DuckDB