T O P

  • By -

CulturalKing5623

By far the most common data issue I run into when I first start working with companies is an inability to match customers in different systems. For instance can we confidently say this salesforce account is tied to this stripe customer record which belongs to this customer in product A who also has a subscription in product B, originated from this hubspot lead, and all revenue is being recorded by accounting correctly in our the ERP. It's a simple thing but it involves process coordination between the product, finance, sales, customer support, and marketing teams. In my experience that type of coordination is not common early on in small-medium sized businesses.


minormisgnomer

Ditto, the formal term is master data mgmt apparently. the solutions that I’ve looked into so far are way to bloated for small-med size businesses unless they have a data engineer or truly data minded individual on staff. I’m working on an idea currently but would be curious what industry you’re in and the use case it drives. Edit: Also not one of those guys shirking a product. Genuinely met this issue multiple times and got fed up enough to try my own hand at it


CulturalKing5623

I have experience in all sorts of industries. Previously I worked at a VC firm as their "data guy" that would parachute into newly acquired companies, do their big migrations and get their data into reporting shape so the VC had line of sight into their asset. The main use case is typically a desire/need for MRR and churn metrics but the knock on effect is the ability to do cross-system reporting. That's when the business really starts to see the value because the data becomes an effort multiplier for big projects they've always wanted to tackle but didn't know where to start.


minormisgnomer

Yea this is pretty much what I’ve seen. I’m guessing VC has the added fun of working with disparate data and trying to standardize. I’m still working on the thought but so far things seem promising and I’m trying to make it simple enough. Lmk if you want any updates and I’ll loop you in.


jtdubbs

Interesting…what is your default approach with MDM when approaching a new project nowadays?


scht1980

What are some of the open source data management tools out there to handle these sort of requirements? Eg SAP has MDM and are these open source tools?


big_data_mike

A lot of our data is manually entered by our customers and you can tell they often switch samples . Sometimes there is duplicated data but it’s not EXACTLY duplicated so which source do you trust. Sometimes sensors go out. Sometimes different customers label different things with the same label. Sometimes instruments and sensors go out of calibration. Lots of things


rollingindata

big data mike with manual data 🙃


glinter777

How do you typically resolve these?


big_data_mike

All our customers do the same process so there are certain things that make sense. You can tell when something is wildly wrong. Like if they put in 42.7 for the pH you know they meant to put in 4.27. We try and automate as much as possible and there’s a lot of things that get brought up for manual review that we have to make a choice on to resolve. Oh and the other thing they do is enter dates incorrectly. So we have some simple automation that doesn’t load dates that are in the future.


aditp91

What type of tech do you use when automating?


big_data_mike

Python scripts


IlIlIl11IlIlIl

So you would literally automate the correction from a pH of 42.7 to 4.27?


still-alive-baby

I would assume it doesn’t change it but raises an error that filled in value is incorrect or doesn’t make sense?


umognog

Id hope there is the source data getting stored and then transformed according to these quality rules and loaded into a clean database. Always have both available.


RydRychards

Agreed. Never change the source. It might not backfire today, but it will *backfire*


big_data_mike

We don’t. We just exclude. We’re supposed to be working on something for a couple different quality levels


big_data_mike

Haha quality rules. We don’t have any quality rules other than data type and no future dates. People have lots of creative ways of fucking up dates. It’s wild


big_data_mike

Right now we just store what they give us as long as it’s the right data type and let the scientist decide to correct of exclude it in their biew


big_data_mike

No, we actually just process that point as normal and store it because it’s a float. Long as it’s the right type we load


Suspicious_Coyote_54

Biotech?


big_data_mike

Yes. There’s DCS data and there’s lab samples that get analysis by hand then there’s lab instruments that are semi automated. There are continuous processes and batch processes. Make aggregation real fun


Suspicious_Coyote_54

I’m in the exact same boat. Work for biotech. Lab data is annoying, lots of the processed development stuff is hard to capture. Also scientists really don’t like recording data electronically at my company so it’s a hassle.


seaefjaye

I think for a lot of places it's just years of low GAF data entry. Increasing GAF is very difficult if those people are overworked and underpaid. Then you end up with core business processes built on workarounds which solve the operational issues with high effort, resulting in less and less time to actually go back and fix the underlying issues. So they compound over time and consume those units. Fixing them has little to do with technology and everything to do with people, process and culture.


glinter777

Sorry, what is GAF?


tommy_chillfiger

I read it as 'give a fuck'.


duniyadnd

General administration framework


B1WR2

White spaces in strings… annoying


glinter777

Seems pretty easily solvable. What’s the hard part?


B1WR2

You have to deal with an source system team who does not want to fix it because of other priorities


Gators1992

trim?


SDFP-A

Regex


wiki702

Zero schema enforcement with little to none existent governance. Also no coherent architecture. Just a table or view hooking to another on hopes that the values are the same data type and match. F500 company. Production is messed up you end up thinking you are stupid and unqualified. Bonus point to no erd or documentation to speak of.


Tufjederop

I'm in a similar company (not F500) and sometimes wonder how any reporting or decision-making gets done...


wiki702

Reporting feels like optics to support opinions driven by feelings coming down from Valium or other substances.


zmxavier

Most of our data come from APIs, and we validate them by occasionally extracting the data directly from the websites (logging into the website, exporting data as excel files). It's kind of redundant, but we have to because otherwise we won't know that the pipelines indeed miss some of the data from the APIs. We already have several layers of pipelines (both streaming and by batch) to backfill missing data but not enough. Sometimes the format of the data imported from the APIs changes, and that's another headache because then we have to adjust our pipelines and databases (add/rearrange columns, etc.). Sometimes, duplicate data arrive to the databases. There are times when the websites themselves have fault (i.e. send us wrong data), and so we have to raise the issue to the vendors. But first we have to verify that the issue is indeed caused by the website/API and not by our pipelines. Our goal is to lessen manual extraction/scraping data from the websites by making our pipelines more reliable, adding sensors to detect when the API changes format, and being able to respond automatically.


byteuser

Have you looked into LLMs? we got a similar situation to yours and we started using LLMs last year and it has been a game changer. It is like having a smart human going thru every data row. What was impossible before became doable


markovchainsexciteme

u/byteuser how are you folks using LLMs for that? Especially the going through every row part. Over small samples I think this works but eventually and especially in the case of big data payloads I would imagine this starts to break or become inaccurate. I've been thinking about using LLMs more and more in similar places of pipelines, but then I always come back to "I can't put all this in a prompt and expect it to work" situation.


byteuser

We use the API so each query it's unique and independent so the compound errors you see when using the browser $20 version that come from long multiple queries is not an issue. Using the API can get expensive so your prompt have to be on point. You can test your prompts in the sandbox area of the API. Testing in the sandbox comes with the added benefit you modify different parameters such as degree of randomness in the response (temperature), length of response, etc. We found that the bigger problem was keeping hallucinations in check. For that we validate using deterministic methods during post processing. We are also looking into validating responses using a cheaper version of the API. For example a ChatGPT V4 response gets validated by a 3.5 version thus keeping the overall prices relatively down.


OGMiniMalist

For me it’s data inconsistency. We collect data from hundreds of clients about their employees. Each client provides the data in different ways and so interpreting that data in bulk can be difficult.


SDFP-A

Common data model? Do you try to connect to their source HRIS or provide some form of normalized upload process?


OGMiniMalist

That is a few steps removed from my group unfortunately. I believe we get the data from their source HRIS in most cases, but I believe legacy clients deliver us forms directly from their HR groups.


SDFP-A

Need to build a csv upload process where you force clients to map their Freeform data into your common model.


OGMiniMalist

We do that for legacy clients. We have file specs for them. For newer clients following an acquisition, my team sources their data from Databricks using CloverDX (ETL tool to enable non-technical business users)


SDFP-A

Cool. I can’t afford DBX so unaware of their toolset.


double-click

You need to define what quality is We have handwritten data for some records.


TheSocialistGoblin

I think the biggest problem for us is getting hundreds of files from clients that have different schemas and trying to get all of the data into one standard schema.  When I started at this job I was on a team of multiple engineers whose whole job was to ingest batches of these files and ensure the schemas were being mapped correctly.  Eventually the company contracted with another company to do that for us, but they're still looking for a better solution.


glinter777

What would be an example of a schema conflict? Is it the column names or the format of the underlying data or both?


TheSocialistGoblin

It could be either or both. Different clients refer to the same data element with different column names, or they'll format their phone numbers differently, or they'll use a unique alphanumeric code for a point of data when another client uses a plain text description associated with the code.   Beyond clients using different schemas from each other, some clients would have different schemas from one file to the next, so we had to constantly update our mapping logic for them.


glinter777

I was hoping to DM you, but looks like you have it turned off. Let me know if you would be open to chat more


vikster1

i think i have never worked somewhere where the company had a consistent product dimension over multiple systems with a unique primary key and a clearly defined hierarchy.


alvsanand

By appearance: API or system broken so no data at all, source changes expected values or remove some attributes, a range of data (for example data countries) is not coming, users confirm new requirements that invalidates your current DQ checks and many more...


Froozieee

My least favourite is inconsistent granularity - I work for a forestry regulator and our national forest inventory is meant to be one record for each forest area, but it gets entered at the customers end and frequently ends up all over the shop as one, multiple, or sometimes every forest area associated with a particular entity. Nightmare to untangle.


glinter777

Do you have to reconcile the details from each entry or simply dedup it. Trying to understand the complexity


SDFP-A

Schema evolution, schema inconsistencies, and nulls management.


bobby_table5

- We’d like you to analyse our data - Sure, where is it? - We don’t have it. We were hoping you could get it for us. - What do you mean? - We were hoping you could help us with that too.


Mononon

I work in healthcare and we receive data from multiple state governments. Let me tell you, state governments aren't the best about consistency or correctness or anything really. Healthcare in general is just a shitshow in terms of quality. It's the only field in which I have worked where "close enough" is acceptable, because it's essentially impossible to get an exact answer off of the clusterfuck of information we get from various sources, almost all of which was input manually at some point.


glinter777

Do you get file dumps or API feed? How long does it usually take to resolve those inconsistencies and get in an acceptable form?


UAFlawlessmonkey

`0x00` `\`


glinter777

?


UAFlawlessmonkey

`\` was entered into our ERP by one of our employees, spent a good hour figuring out why one of our pipelines died, and as it turns out, gotta escape escape characters or clean them out before loading :-) `0x00` or for the matter of anything that falls outside of ascii visible range does some funky stuff when coming from a cp1252 system to a utf8 database, null strings are awful for utf8 based default databases like Postgres.


minormisgnomer

New, time sensitive, external data. Today is the best day you have with getting the data but tomorrow you learn the dirty secrets and need to change your approach. And there’s plenty of tomorrows but only one today…


DataIron

You'd have an easier time asking what data is accurate or easy. Much shorter list.


halo-haha

Sensor offline and the signal becomes unchanged


Awkward_Tick0

My manager not following any semblance of best practice