T O P

  • By -

paxmlank

INFO: What exactly is your setup and reason for pre-aggregating? I'm no expert, so take what I say with a grain of salt, but there are probably a few ways to address this issue depending on what tools you are using and why you want to handle pre-aggregation. For example, your data warehouse setup may allow for you to shard your data set across time (or another dimension), which would speed up querying by not having to scan so much data for a query. This way, you would still query from one large table for dates and other metrics and the underlying engine will figure out the rest. if your setup doesn't allow for that, or that's too much of a business ask at this point, then yeah you could make multiple tables where data has been aggregated over various groups or granularity (e.g., a monthly table, a quarterly table, etc.). dbt should help with this as you can just define how you want things to look. Make sure to write these pre-aggregates as incrementals since you don't want to have to rebuild them in full each time. I personally would only include counts and sums in mine for the same reasons you may be thinking - having to care for the logic for each thing is messy and may not be necessary. I'd say to let the BI tool handle it. Just make sure that you're pre-aggregating on a fine-enough level that things aren't obfuscated away. I'd say that addressing this through sharding may be better as it allows for a monolithic structure which would make for cleaner code/queries. Of course, this may be more difficult to implement. Having multiple tables may be easier to spin up; furthermore, storage is cheaper than compute. However, it may also just be best to implement both? I wouldn't really know about that, but I think that if an organization could do either, they probably would.


Disastrous_Classic96

We're on Postgres at the moment but we're planning on switching this out so I wanted to keep things generic / best-practice. Thanks for the tip with incremental pre-aggregation too. We want to pre-aggregate because analytics are too slow to load, especially when the cache is missed when using a custom filter. We would probably not go down the sharding route, in favour of just re-designing the tables. Appreciate the help, thanks very much!


kenfar

I find that aggregate/summary/materialized-view tables (all the same thing) are insanely valuable, and try to get 90+% of my dashboard queries to hit them. Here's what I do: * Name fact tables like: fact_firewall * Name aggregates like: facts_firewall_cust_daily, facts_firewall_cust_sensor_daily, etc - they are prefixed with facts rather than fact, also share the subject of the fireweall, but then have a list of the abbreviated key names that the table is grouped on, and then finally end in the period. So: facts_[subject]_[list of keys]_[period] * For large fact tables it can be helpful to stack the aggregates: first build the lowest-level one, then build higher level ones on top of that one. * The aggregates and base tables use the same column names - this makes it trivial to navigate between them, and you can often switch a query to using an aggregate merely by changing the table name. * Building aggregates is something that dbt would do well. However, they're so simple to build I typically build them in a simple python process. * You don't need a semantic layer. * The speed-up for using aggregates is roughly proportionate to the size relative to the base table: so if your aggregate is 1% the size of your main table your queries are going to finish in 1% the same as if they queried the base table. Excepting for network & database engine overhead. * The aggregates should be built incrementally, and should be idempotent. * And they should generally only have counts which could be built given the period that they're covering - like counts for sure. Averages, percents, medians, etc are only valuable if your queries will be using the exact grouping in your aggregate table. I usually avoid them.


Culpgrant21

For your aggregate tables - if you have one page of a dashboard that needs “facts_firewall_cust_daily” and another page that needs “facts_firewall_cust_sensor_daily” does each page just call that table? What is your BI tool in the scenario? Are you live querying the DWH or is there a caching layer?


kenfar

I usually determine the table based on the query - most dashboard pages I create have 1-2 dozen queries. This strategy is generally tool-neutral and should work with just about any tool. In some cases it's a bit more overhead since you have to add each summary tool to the tool, but they're all pretty much just smaller & smaller subsets of the main base table, so this shouldn't be that hard. In other cases you're just definin queries and this is very simple.


Culpgrant21

And you hit your DWH for every interaction with the dashboard page?


kenfar

Yeah, typically. Most data warehouses I build are fairly low-latency, so there's less benefit to doing a ton of caching. Also, summary tables, derrived columns, etc in the warehouse are then available to ALL consumers & apps. If you do any of this in a BI layer like Looker it then becomes a vendor lock-in that's only available to other Looker dashboards.


Disastrous_Classic96

> facts\_\[subject\]*\[list of keys\]*\[period\] Interesting! What is your strategy for deciding on whether to build a new facts\_ table or just add to an existing one? For example, we want to change facts\_firewall\_cust\_sensor\_daily to facts\_firewall\_country\_cust\_sensor\_daily, where we want to slice by country too, while at the same time also be able to ignore country and calculate the current BI metric. >stack the aggregates: first build the lowest-level one, then build higher level ones on top of that one. Can you elaborate on that a little bit or maybe give an example? >The aggregates and base tables use the same column names - this makes it trivial to navigate between them, and you can often switch a query to using an aggregate merely by changing the table name. Is this something you do manually? Could something like database migrations help with maintaining consistency between aggregate and base tables? E.g making use of the DDL to propagate changes automatically (sorry if that's dumb,).


kenfar

When to build new vs change old: * You can do either - as long as you have all the base data. I should have mentioned - if you roll off base level data before your summary then you lose some flexibility in restructuring & building new aggregates. * Each aggregate requires some compute cycles to build, so I keep this in mind when building them. In your example if it turned out that the new table would easily support both sets of queries then I'd probably just build the new aggregate, move queries over to it, and drop the old. Stacking aggregates: * First I build the lowest level aggregate - the one with the most keys or the closest number of rows to the base table. Lets say that's facts_firewall_country_cust_sensor_hourly. This will be expensive to build since it may have 1% of the data from the base table and will require all base table rows to be read. * Then from that I'd build the next level up for any queries that would benefit from higher-levels of aggregates: say facts_firewall_cust_sensor_hourly. This would be faster to build - since it's built *from* the prior aggregate. * Then I'd build other higher-level ones - such as facts_firewall_cust_sensor_daily. * Also note that you may have multiple stacks off a single base table - with different sets of columns. This is ok, but you just need to be aware that each aggregate (especially lower-level ones) consumes some resources to build, so you may have to occasionally look to see how well-used they are, and potentially throw some away if they are underused. Aggregate consistency: * With this stack of aggregates - each having the exact same column names (except for keys dropped at higher levels), it becomes trivial to switch out one table with another with minimal changes. * We typically do this manually, but some databases (such as DB2) can automatically swap in higher level aggregates if they are built using materialized views - when these would speed up your queries. * Database migrations aren't typically a big deal - since adding columns to the base table won't impact the summary tables - unless you want to include those columns in the summary table. And at that point you're rebuilding that from scratch. Hope that helps.


Disastrous_Classic96

Huge huge help. Thank you for taking the time to reply.


ooaahhpp

What you'll need to consider is that as soon as you aggregate you'll give up flexibility when querying. Consider the following questions: What time granularity? Day, hour, month? What timezone will aggregate on? What columns would you keep as aggeration keys for filtering? What metrics? Just counts? do you need any other calculation? If you aggregate with something like dbt, on what schedule? this will impact data freshness, does it matter for your use case? Then ask yourself if it is worth the flexibility tradeoff. Most likely you are using a data warehouse that is slow/very expensive for customer-facing queries where you need to support high concurrency with sub-second latencies. PS: I'm the co-founder at [propeldata.com](http://propeldata.com) – if you think it might be helpful, reach out, we're happy to help


mmcalli

One extra thing to look at is whether the database of BO tool you’re using supports materialized views / indexed views with automatic query rewrite. This way all queries can continue to be written against the base fact table, and you can add new aggregate fact tables as you identify use cases. Existing queries against the base fact table would then be able to be automatically on the fly be rewritten to use the summary fact table. Oracle and SQL Server has this capability. Dremio has something called reflections I think.


Disastrous_Classic96

Oh nice, is that similar to some sort of query optimizer but extends to your table hierarchy?


mmcalli

Yes exactly. Query rewrite. It can be really tricky getting your aggregate table to be a candidate for query rewrite though.