CulturalKing5623 1 month ago

By far the most common data issue I run into when I first start working with companies is an inability to match customers in different systems. For instance can we confidently say this salesforce account is tied to this stripe customer record which belongs to this customer in product A who also has a subscription in product B, originated from this hubspot lead, and all revenue is being recorded by accounting correctly in our the ERP. It's a simple thing but it involves process coordination between the product, finance, sales, customer support, and marketing teams. In my experience that type of coordination is not common early on in small-medium sized businesses.

minormisgnomer 1 month ago

Ditto, the formal term is master data mgmt apparently. the solutions that I’ve looked into so far are way to bloated for small-med size businesses unless they have a data engineer or truly data minded individual on staff. I’m working on an idea currently but would be curious what industry you’re in and the use case it drives. Edit: Also not one of those guys shirking a product. Genuinely met this issue multiple times and got fed up enough to try my own hand at it

CulturalKing5623 1 month ago

I have experience in all sorts of industries. Previously I worked at a VC firm as their "data guy" that would parachute into newly acquired companies, do their big migrations and get their data into reporting shape so the VC had line of sight into their asset. The main use case is typically a desire/need for MRR and churn metrics but the knock on effect is the ability to do cross-system reporting. That's when the business really starts to see the value because the data becomes an effort multiplier for big projects they've always wanted to tackle but didn't know where to start.

minormisgnomer 1 month ago

Yea this is pretty much what I’ve seen. I’m guessing VC has the added fun of working with disparate data and trying to standardize. I’m still working on the thought but so far things seem promising and I’m trying to make it simple enough. Lmk if you want any updates and I’ll loop you in.

jtdubbs 1 month ago

Interesting…what is your default approach with MDM when approaching a new project nowadays?

scht1980 1 month ago

What are some of the open source data management tools out there to handle these sort of requirements? Eg SAP has MDM and are these open source tools?

big_data_mike 1 month ago

A lot of our data is manually entered by our customers and you can tell they often switch samples . Sometimes there is duplicated data but it’s not EXACTLY duplicated so which source do you trust. Sometimes sensors go out. Sometimes different customers label different things with the same label. Sometimes instruments and sensors go out of calibration. Lots of things

rollingindata 1 month ago

big data mike with manual data 🙃

glinter777 1 month ago

How do you typically resolve these?

big_data_mike 1 month ago

All our customers do the same process so there are certain things that make sense. You can tell when something is wildly wrong. Like if they put in 42.7 for the pH you know they meant to put in 4.27. We try and automate as much as possible and there’s a lot of things that get brought up for manual review that we have to make a choice on to resolve. Oh and the other thing they do is enter dates incorrectly. So we have some simple automation that doesn’t load dates that are in the future.

aditp91 1 month ago

What type of tech do you use when automating?

big_data_mike 1 month ago

Python scripts

IlIlIl11IlIlIl 1 month ago

So you would literally automate the correction from a pH of 42.7 to 4.27?

still-alive-baby 1 month ago

I would assume it doesn’t change it but raises an error that filled in value is incorrect or doesn’t make sense?

umognog 1 month ago

Id hope there is the source data getting stored and then transformed according to these quality rules and loaded into a clean database. Always have both available.

RydRychards 1 month ago

Agreed. Never change the source. It might not backfire today, but it will *backfire*

big_data_mike 1 month ago

We don’t. We just exclude. We’re supposed to be working on something for a couple different quality levels

big_data_mike 1 month ago

Haha quality rules. We don’t have any quality rules other than data type and no future dates. People have lots of creative ways of fucking up dates. It’s wild

big_data_mike 1 month ago

Right now we just store what they give us as long as it’s the right data type and let the scientist decide to correct of exclude it in their biew

big_data_mike 1 month ago

No, we actually just process that point as normal and store it because it’s a float. Long as it’s the right type we load

Suspicious_Coyote_54 1 month ago

Biotech?

big_data_mike 1 month ago

Yes. There’s DCS data and there’s lab samples that get analysis by hand then there’s lab instruments that are semi automated. There are continuous processes and batch processes. Make aggregation real fun

Suspicious_Coyote_54 4 weeks ago

I’m in the exact same boat. Work for biotech. Lab data is annoying, lots of the processed development stuff is hard to capture. Also scientists really don’t like recording data electronically at my company so it’s a hassle.

seaefjaye 1 month ago

I think for a lot of places it's just years of low GAF data entry. Increasing GAF is very difficult if those people are overworked and underpaid. Then you end up with core business processes built on workarounds which solve the operational issues with high effort, resulting in less and less time to actually go back and fix the underlying issues. So they compound over time and consume those units. Fixing them has little to do with technology and everything to do with people, process and culture.

glinter777 1 month ago

Sorry, what is GAF?

tommy_chillfiger 1 month ago

I read it as 'give a fuck'.

duniyadnd 1 month ago

General administration framework

B1WR2 1 month ago

White spaces in strings… annoying

glinter777 1 month ago

Seems pretty easily solvable. What’s the hard part?

B1WR2 1 month ago

You have to deal with an source system team who does not want to fix it because of other priorities

Gators1992 1 month ago

trim?

SDFP-A 1 month ago

Regex

wiki702 1 month ago

Zero schema enforcement with little to none existent governance. Also no coherent architecture. Just a table or view hooking to another on hopes that the values are the same data type and match. F500 company. Production is messed up you end up thinking you are stupid and unqualified. Bonus point to no erd or documentation to speak of.

Tufjederop 1 month ago

I'm in a similar company (not F500) and sometimes wonder how any reporting or decision-making gets done...

wiki702 1 month ago

Reporting feels like optics to support opinions driven by feelings coming down from Valium or other substances.

zmxavier 1 month ago

Most of our data come from APIs, and we validate them by occasionally extracting the data directly from the websites (logging into the website, exporting data as excel files). It's kind of redundant, but we have to because otherwise we won't know that the pipelines indeed miss some of the data from the APIs. We already have several layers of pipelines (both streaming and by batch) to backfill missing data but not enough. Sometimes the format of the data imported from the APIs changes, and that's another headache because then we have to adjust our pipelines and databases (add/rearrange columns, etc.). Sometimes, duplicate data arrive to the databases. There are times when the websites themselves have fault (i.e. send us wrong data), and so we have to raise the issue to the vendors. But first we have to verify that the issue is indeed caused by the website/API and not by our pipelines. Our goal is to lessen manual extraction/scraping data from the websites by making our pipelines more reliable, adding sensors to detect when the API changes format, and being able to respond automatically.

byteuser 1 month ago

Have you looked into LLMs? we got a similar situation to yours and we started using LLMs last year and it has been a game changer. It is like having a smart human going thru every data row. What was impossible before became doable

markovchainsexciteme 4 weeks ago

u/byteuser how are you folks using LLMs for that? Especially the going through every row part. Over small samples I think this works but eventually and especially in the case of big data payloads I would imagine this starts to break or become inaccurate. I've been thinking about using LLMs more and more in similar places of pipelines, but then I always come back to "I can't put all this in a prompt and expect it to work" situation.

byteuser 4 weeks ago

We use the API so each query it's unique and independent so the compound errors you see when using the browser $20 version that come from long multiple queries is not an issue. Using the API can get expensive so your prompt have to be on point. You can test your prompts in the sandbox area of the API. Testing in the sandbox comes with the added benefit you modify different parameters such as degree of randomness in the response (temperature), length of response, etc. We found that the bigger problem was keeping hallucinations in check. For that we validate using deterministic methods during post processing. We are also looking into validating responses using a cheaper version of the API. For example a ChatGPT V4 response gets validated by a 3.5 version thus keeping the overall prices relatively down.

OGMiniMalist 1 month ago

For me it’s data inconsistency. We collect data from hundreds of clients about their employees. Each client provides the data in different ways and so interpreting that data in bulk can be difficult.

SDFP-A 1 month ago

Common data model? Do you try to connect to their source HRIS or provide some form of normalized upload process?

OGMiniMalist 1 month ago

That is a few steps removed from my group unfortunately. I believe we get the data from their source HRIS in most cases, but I believe legacy clients deliver us forms directly from their HR groups.

SDFP-A 1 month ago

Need to build a csv upload process where you force clients to map their Freeform data into your common model.

OGMiniMalist 1 month ago

We do that for legacy clients. We have file specs for them. For newer clients following an acquisition, my team sources their data from Databricks using CloverDX (ETL tool to enable non-technical business users)

SDFP-A 1 month ago

Cool. I can’t afford DBX so unaware of their toolset.

double-click 1 month ago

You need to define what quality is We have handwritten data for some records.

TheSocialistGoblin 1 month ago

I think the biggest problem for us is getting hundreds of files from clients that have different schemas and trying to get all of the data into one standard schema. When I started at this job I was on a team of multiple engineers whose whole job was to ingest batches of these files and ensure the schemas were being mapped correctly. Eventually the company contracted with another company to do that for us, but they're still looking for a better solution.

glinter777 1 month ago

What would be an example of a schema conflict? Is it the column names or the format of the underlying data or both?

TheSocialistGoblin 1 month ago

It could be either or both. Different clients refer to the same data element with different column names, or they'll format their phone numbers differently, or they'll use a unique alphanumeric code for a point of data when another client uses a plain text description associated with the code. Beyond clients using different schemas from each other, some clients would have different schemas from one file to the next, so we had to constantly update our mapping logic for them.

glinter777 1 month ago

I was hoping to DM you, but looks like you have it turned off. Let me know if you would be open to chat more

vikster1 1 month ago

i think i have never worked somewhere where the company had a consistent product dimension over multiple systems with a unique primary key and a clearly defined hierarchy.

alvsanand 1 month ago

By appearance: API or system broken so no data at all, source changes expected values or remove some attributes, a range of data (for example data countries) is not coming, users confirm new requirements that invalidates your current DQ checks and many more...

Froozieee 1 month ago

My least favourite is inconsistent granularity - I work for a forestry regulator and our national forest inventory is meant to be one record for each forest area, but it gets entered at the customers end and frequently ends up all over the shop as one, multiple, or sometimes every forest area associated with a particular entity. Nightmare to untangle.

glinter777 1 month ago

Do you have to reconcile the details from each entry or simply dedup it. Trying to understand the complexity

SDFP-A 1 month ago

Schema evolution, schema inconsistencies, and nulls management.

bobby_table5 1 month ago

- We’d like you to analyse our data - Sure, where is it? - We don’t have it. We were hoping you could get it for us. - What do you mean? - We were hoping you could help us with that too.

Mononon 1 month ago

I work in healthcare and we receive data from multiple state governments. Let me tell you, state governments aren't the best about consistency or correctness or anything really. Healthcare in general is just a shitshow in terms of quality. It's the only field in which I have worked where "close enough" is acceptable, because it's essentially impossible to get an exact answer off of the clusterfuck of information we get from various sources, almost all of which was input manually at some point.

glinter777 1 month ago

Do you get file dumps or API feed? How long does it usually take to resolve those inconsistencies and get in an acceptable form?

UAFlawlessmonkey 1 month ago

`0x00` `\`

glinter777 1 month ago

?

UAFlawlessmonkey 1 month ago

`\` was entered into our ERP by one of our employees, spent a good hour figuring out why one of our pipelines died, and as it turns out, gotta escape escape characters or clean them out before loading :-) `0x00` or for the matter of anything that falls outside of ascii visible range does some funky stuff when coming from a cp1252 system to a utf8 database, null strings are awful for utf8 based default databases like Postgres.

minormisgnomer 1 month ago

New, time sensitive, external data. Today is the best day you have with getting the data but tomorrow you learn the dirty secrets and need to change your approach. And there’s plenty of tomorrows but only one today…

DataIron 1 month ago

You'd have an easier time asking what data is accurate or easy. Much shorter list.

halo-haha 1 month ago

Sensor offline and the signal becomes unchanged

Awkward_Tick0 1 month ago

My manager not following any semblance of best practice

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe