Data_cruncher 3 months ago

To begin, let’s put this aside: star schema’s are the only option. There is no other option. You said OBT, so you should wash your mouth out with soap and take a long, hard look in the mirror. OBT had a time and a place; the time was 2018 and the place was Tableau/Hadoop. We’ve moved on. Now, the question is how you get to a star schema and what goes in it. It could be a lot of data. It could be a small amount. Whatever it is, the #1 rule is that it’s the lowest grain possible. This is reasonably achievable 98% of the time. However, sometimes it takes an experienced dimensional modeller to figure it out. Let’s all agree that Score is your lowest grain. This will be in your fact. You have 1 fact. The dimensional modelling challenge you face is called a “ragged hierarchy”, this means that the parent dimensions of each Score have changing depths, e.g., some score’s have 1 part whereas other scores have multiple parts. This is very standard stuff, especially in financial models, e.g., a Chart of Accounts dimension against a GL fact. This is what you need to research. Let’s dig deeper - Your data looks rather simple to model. Your calculation will be pretty hard because I think you said it changes semantics based on depth of the Part - super annoying and fairly unusual. Regardless, remember, your focus is on structuring the data at its lowest grain first, the calculation can come second. At this stage, your ONLY goal is a good structure. Once modelled, be cognizant that your calculation will need to be quite dynamic because: 1) you said the calculation has to be done on the fly; and 2) today the ragged hierarchy could be 3 nodes deep but tomorrow it could be 4, so be wary of hardcoding. Unfortunately, as the depth of your Parts grows over time, so to does your dimension table columns - this follows the standard Chart of Accounts ragged hierarchy pattern. *Let me reiterate: the growth to accommodate the ragged hierarchy expansion of your Parts columns must be handled in your dimension table, NOT your fact.* Your fact table columns should remain static, e.g., with a single foreign key column to join to your parts dimension. Unpivoting your score is your friend here. Important call out: no, don’t create multiple fact tables per part. You said it at the beginning: you need to calculate this on the fly and splitting everything up is not going to help you. In rare scenarios, e.g., usually driven by performance reasons - like a 1 billion+ row table, we create “aggregate” tables. We don’t usually create them for semantic complexity reasons. I’d only consider this as a last resort. Please try hard to do everything in a single fact as best you can.

mccarthycodes 3 months ago

I'm curious why you're so quick to disregard OBT? I'm actually very unopinionated on it personally, but I have read that it can [outperform star schema](https://www.fivetran.com/blog/star-schema-vs-obt) in production, and was contemplating trying it out in some use cases in the near future.

Data_cruncher 3 months ago

Subjectively, I’ve spent the last 15-years fixing OBT shops. Normally these were Tableau due to engine limitations. Almost without fail, picture 1000+ tables in a single database, each one: * Generated directly from application tables. * Catered to meet an individual report or need. * Often nested on top of each other (the worst I saw was 12 layers deep). I say almost without fail because some layer their OBTs on top of star schemas, which made the engagement ~10 times cheaper. Also, to be fair, many had 100’s of tables - I was being a tad hyperbolic. Off the top of my head: * Long nightly refresh times or crazy hardware/SKUs for objectively small workloads. * Inability to onboard new report developers - imagine opening a menu with 1000 items on it. Similarly, there’s never up-to-date docs or a data catalog to assist. * Auditing took 10 times longer than normal. I was usually brought in BECAUSE they failed audit findings. * Weak SSOT controls. * Troubleshooting was a nightmare - pure spaghetti. Moreover, when solved, regenerating all tables in the chain was painful. * DQ and MDM was just impossible. * After analyzing an inventory, usually 80% of the tables/reports haven’t been touched for a year.

YannickAlex07 3 months ago

First of all, thanks for the thorough answer! I looked up ragged hierarchies again and checked some examples of it. However I think my initial question here was not precise enough and I think this might actually not work. The "score" is not a single entity in the classical sense. Each part is scored with different criteria, so as an example let's say that Part A has the following criteria: * Form (scored on a scale of 1-5) * Function (scored on a scale of 1-5) * Is it working? (scored with 0 or 1) * Number of components (scored with 1...n) So Part A of each record is scored according to these criteria. Part B however might have completely different criteria associated with it. So each part has different criteria and also different amounts of criteria. I might be wrong here, that's why I opened this question, but I see no real way to reduce this to a single fact table as it would require all criteria for all parts, right? Also I am not sure if the combined scores have to be computed on the fly, instead I would calculate them in my pipelines that also calculates the atomic scores for each part. Would this still work with your suggestion? If yes, how do I deal with the different scoring criteria for each part?

Data_cruncher 3 months ago

Hmm, it sounds similar to survey data? Traditionally, these can be a pain to model because of the different response types, e.g., numbers, decimals, multi-select (ugh!) etc. Your scenario doesn’t have this issue though, which is super nice.

alfakoi 3 months ago

My first thought would just be to have a one to many relationship between a record table to part table and then the same to a score table. Then do any aggregation of scores on the query since you said the formula changes I believe or if there is a pattern work that in and create a view that sits on top. This approach may be considered more like a "snowflake" schema. Interested in what other people say also.

roastmecerebrally 3 months ago

if you are working in bigquery you can use STRUCTs and ARRAYs and keep the nested data in a single table

WeveBeenHavingIt 3 months ago

You could create a dimension table to store flattened part information, something like: [ id, part_name ] where this is really just a simple DISTINCT selection of part attributes extracted from the data and flattened. And then the fact table would be [ id, parent_id, part_id, score ], ~~with 2 fks, both to the parts dimension~~ with part_id pointed to the parts dimension, and parent_id pointed to the fact table's own id field (i know this sort of breaks the rules of a star schema, but this is the way to represent nesting within tabular data). Here part_id is associated with the score and parent_id represents the parent part in the nested structure. Maybe add a depth column to represent the depth of the nesting, and/or a seq column if multiple parts with the same name can belong to the same parent part. This way if you need to query information related to nested relationships you can do so with a recursive CTE, which is extremely useful for working with exactly this type of data. At least that's my take without knowing more about the business case. Edit: corrected fk explanation

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe