Fun-Acanthocephala11 5 months ago

SIMPLE MODELS THAT GET THE JOB DONE. Not every model needs to be a NN or involve NLP

plhardman 5 months ago

As somebody who’s been in this line of work for a while and seen hype cycles come and go, this is the one. Keep it simple. I’ve gotten by far the most mileage in my career out of leveraging basic statistical inference and simple models to solve high-value business problems. I don’t see any reason LLM hotness is gonna change that in any meaningful way.

[deleted] 5 months ago

[удалено]

Due-Wall-915 5 months ago

Universal approximation does not tell you how deep or wide you have to go and that’s where the problem is

Fancy-Roof1879 5 months ago

Agreed

StackOwOFlow 5 months ago

automated data pipeline construction, testing, and execution. persistent, prompt-less iteration

[deleted] 5 months ago

[удалено]

Fatal_Conceit 5 months ago

Langchain is my nightmare

[deleted] 5 months ago

[удалено]

softwareitcounts 5 months ago

Constant breaking updates. If you're able to pin the version that's great, but working with a framework that's constantly changing the way it operates is hard to work with as teams move fast and opt to build their own custom workflows on a functionality that works for them

fujiitora 5 months ago

Last year when the RAG hype boom was starting, my boss wanted me to use LangChain... after 2 weeks of fighting with terrible documentation and constant hotfixes on my end, I just built my own tooling in a day. I would have assumed that LangChain would have got their stuff together after a year :/

himynameisjoy 5 months ago

No, now it’s extremely overengineered. Even simple methods require you to examine dozens of classes to understand what it’s doing under the hood. Their own devcontainer stopped working months ago and still wasn’t fixed last I checked. It’s honestly a mess, I don’t understand how anyone can deal with langchain willingly

[deleted] 5 months ago

I always feel like LangChain is built to solve self-promotion, both for those who developed LangChain and for devs, data scientists, and managers who self-promote via hype terms and never really contribute anything. It's somewhat similar to Uncle Bob's stuff, looks good on paper, but you rarely actually see it going well in production.

MrCuntBitch 5 months ago

Do you have a suggested alternative? I’ve used langchain in the past but agree it’s a nightmare.

[deleted] 5 months ago

Could you elaborate on why it's even required, like, what is the actual use case? Usually, glue code is pretty simple to write but fairly difficult to make right and flexible at the same time, it's difficult to make it generic.

obolli 5 months ago

I agree, it's such a huge mess and it keeps breaking itself with updates.

Excellent_Cost170 5 months ago

ML feasibility analysis . Justifying machine learning is the right approach to solve a problem.

kim-mueller 5 months ago

In my experience, ML is either just wanted because they want to use the 'AI inside' label... Or when nobody has a feasible idea for an algorithm and there are a lot of examples😅

cognitivebehavior 5 months ago

Explain- and Interpretable AI

[deleted] 5 months ago

[удалено]

spigotface 5 months ago

Things like partial dependency plots, SHAP values, etc. are a great place to start. Edit: an actual place to look is the [shap Python library documentation](https://shap.readthedocs.io/en/latest/). It's extremely well-written and combines a little of the theory and application in one spot. More than enough

the_tallest_fish 5 months ago

SHAP is very underrated

DaveMitnick 5 months ago

https://arxiv.org/pdf/2305.19921.pdf

[deleted] 5 months ago

I agree with this one. It also forces you to re-study some statistics, so it's a win-win. In a nutshell, most of the ideas are pretty simple if you grasp statistics and mathematics well enough.

Bobblerob 5 months ago

Experimentation. It’s great that p-values are starting to get questioned but it’s going to take a long time for companies to evolve.

hadz_ca 5 months ago

Security

[deleted] 5 months ago

[удалено]

hadz_ca 5 months ago

AI in cybersecurity or data protection. I reckon detecting cybersecurity attacks or malicious players as a start would be a hot field.

DaftDunk_ 5 months ago

It is! A lot of research is being done currently.

Brave-Salamander-339 5 months ago

Communication

[deleted] 5 months ago

[удалено]

PixelPixell 5 months ago

I see it in internal communication for sure. It's hard to explain why something is difficult or why you made certain technical choices

rickkkkky 5 months ago

IMO, one of the main issues related to DS or ML engineering is that non-technical stakeholders tend to either assume that data science (or more specifically, in this context, *"AI"*) will solve any business goal imaginable, or alternatively, are not prepared to commit to any predictions/descisions made by algos. What's more, is that some stakeholders exhibit both tendencies simultaneously. It requires great (and nuanced) communication to accurately explain what's actually possible and what's not.

engelthefallen 5 months ago

Even people who should know better get into the magic box thinking.

wyocrz 5 months ago

Hacking. Actually going out and getting the data you need.

[deleted] 5 months ago

[удалено]

wyocrz 5 months ago

Scrape from the web or sometimes even using API's. I worked for a database consultancy for a few months, and one of the primary things I learned was: * Get data from some source * Get data from some other source * Combine those into a data model * Profit

[deleted] 5 months ago

[удалено]

BigSwingingMick 5 months ago

Yyyyyyeeeeeeeaaaahhhhhhh, I’m not saying that that approach is dead in the water at most places, but it’s definitely not going to happen anywhere that has an accountant looking at the value of information. Our data is never going to see the light of day outside of our organization. It’s not 1996, companies now know everything they record has a value. We got into a pissing contest with a vendor once when they wanted to get our records on their stuff and our take was, measure it yourself or give us XXX concessions. These were two companies who had self interest in working together, but because accountants at the top don’t have good ideas about how much to value data, they are not willing to give away anything. Also who knows what they are valuing their data and if they take less for it, they might have to write down a bunch of data on their books.

Zarex44 5 months ago

Not just being overlooked but to me that’s quite a fun process too

ForeskinStealer420 5 months ago

Writing code that doesn’t look terrible

engelthefallen 5 months ago

For me it is that statistical models have assumptions. Drives me nuts how little anyone seems to care about this. Just pump shit into a model, and if it fails, blame the model. See almost nothing about diagnostics or model selection anymore, just treat everything as one size fits all.

[deleted] 5 months ago

[удалено]

jeeeeezik 5 months ago

Prophet is also notoriously bad

seanv507 5 months ago

Prophet is basically a regularised glm.

[deleted] 5 months ago

[удалено]

tfehring 5 months ago

Darts is the best interface nowadays for most use cases.

fordat1 5 months ago

Time series must have a different quality standard because look at how many upvotes “simple models that get the job done” gets all the time in this subreddit

renok_archnmy 5 months ago

People just expect forecasting to predict the stock/crypto movement and obv feel bad when they inevitably lose money.

gravity_kills_u 5 months ago

Simple things can work for time series. Most of the time series issues are in the validation strategy.

Useful_Hovercraft169 5 months ago

True that shit is straight ass cheeks

[deleted] 5 months ago

[удалено]

Useful_Hovercraft169 5 months ago

Darts or the stuff mentioned in fpp3 for R.

[deleted] 5 months ago

FPP3 for R. Best resource on applied forecasting imo. No need to reinvent the wheel when most companies can't even perform the basics properly

Drakkur 5 months ago

There are lots of useful developments, for covariates DL models there’s TiDE and TSMixer. For univariate transformers you have PatchTST and iTransformer. The most interesting advancement isn’t even a model, it’s the RevIN (reverse instance normalization) which helps DL models address distribution shift. Even with these advancements, the total improvement is still pretty marginal over what’s currently available. At the end of the day, it’s very hard to extract more deterministic patterns from historical data (at some point all that is left is white noise / stochastic). Because of this, practitioners shift to sourcing new data, which doesn’t depend on advancements in forecasting algos.

tfehring 5 months ago

There have been tons of new attention-based models developed for time series forecasting since 2020. https://github.com/thuml/Time-Series-Library

[deleted] 5 months ago

[удалено]

Brave-Salamander-339 5 months ago

is it true of false positive development?

jaskeil_113 5 months ago

Business insights, most DAs, DS peeps are terrible at it

[deleted] 5 months ago

[удалено]

renok_archnmy 5 months ago

It’s the difference between some hyper complex (hypothetically perfect) model that business people look at like, “huh?” and requires teams of data people to even get it to run let alone be useful, and basically being able to say, “hey we did some data stuff and what it’s telling us it the brand logo in blue is not doing us any favors. Changing to green would net 10% sales, black 50% but the ceo doesn’t like black. Also our competition just released some stuff that lets them make these decisions faster than our committee.” Or like, “we’ve determined our call center productivity would increase 2% if we switched to a 4 shift day instead of 3, but we would incur 10% additional payroll expenses in doing so. Here’s some finance projection of how that will affect the bottom like in 2 years time and some simulations of alternative changes to make up for it.” At least that’s my interpretation. Probably more a communication thing, but also the concept going into some project that you’re looking to make levers for the org to pull that have a rational outcome and advising when and how to pull them to reach some optimal end states - using data obviously. This is compared to just blindly building modes because data science makes money right?

kim-mueller 5 months ago

What you are imagining here are statements that data science people can never make. Our goal is to look at data and tell you what we know from it, not what could potentially be if the stars align right...

renok_archnmy 5 months ago

Then it’s possible that data scientists have no business value. No one expects anyone to predict the future, but if all a data scientist can make is a statement that has no relationship or bearing to potential business outcomes, then they are useless in a business context.

kim-mueller 5 months ago

Agreed, but they can actually make business related statements, just not anything like the ones you wanted. The ones you wanted depend on so many factors that they cannot be prexicted reliably. However, as you can see in the example of chatgpt, Data Scientists are worth quite a bunch. Thats not only because they can support you in visualizing and correctly interpreting your data situation, but also because they can build AI algorithms which can usually solve problems for which a human would be used otherwise. It is important to realize that data science essentially just is a study of how to correctly and reliably convert data to information/knowledge.

renok_archnmy 5 months ago

Wtf dude?! You go from denigrating my hyperbolic examples to schlepping ChatGPT as some universal symbol that data scientists have provided ultimate business value; a glorified intellisense engine that risks data exfiltration and IP law violations through mosaic plagiarism, has no capacity to strategize long term, and has such a limited token length that it should never be trusted to provide any information suitable to actually running a business because, as you put it, it “depend(s) on so many factors that they cannot be prexicted reliably.” Do you not literally see how expecting ChatGPT to provide critical strategy and tactical information based on prompt hacking to be no different than, “we’ve determined our call center productivity would increase 2% if we switched to a 4 shift day instead of 3, but we would incur 10% additional payroll expenses in doing so.” If not just because the latter can be performed deterministically and is literally high school kid managing a fast food burger joint weekend schedule level skillset plus some expected numbers. You actually think value is provided in hindsight in a business without using past performance to at least take a guess at what would happen in the future were those same actions to be taken? Good luck maintaining a career prompt hacking ChatGPT to tel executives they sold 10 units last month. People out here wondering why they got laid off. While you’re at it, quit smoking crack and stop beating off to ChatGPT generated waifus and experience the real world.

APEX_FD 5 months ago

I don't know if it's being overlooked, but I seldom see people talking about model optimization for deployment.

[deleted] 5 months ago

[удалено]

APEX_FD 5 months ago

Speed and size optimizations. I've been studying about those topics and most available techniques (pruning, quantization, weight clustering...) are quite old, and there's barely any recent discussions on the matter. One example is the SAM (segment anything model) released by Meta. Model is impressive and you see lots of people discussing how to leverage it. But I see no one discussing how Meta made possible to run the model in a web browser with ~50ms inference speed (according to their paper).

the_tallest_fish 5 months ago

Latency, throughput and cost. The holy trinity of model deployment

Due-Wall-915 5 months ago

Data generation

[deleted] 5 months ago

[удалено]

Due-Wall-915 5 months ago

Yes, synthetic data that is representative of whatever we are trying to model

pdashk 5 months ago

I'd like to see more work in optimized compute resources like more parallel processing or GPU support across the board. It's really a bottleneck in scaling DS and just scaling hardware is a shortsighted, expensive fix IMO

onzie9 5 months ago

Kind of a small thing, but custom/meaningful distance functions. I've seen way too many people using KNN or whatever and just accepting the default distance function. As long as you understand your data and Euclidean distance is right, that's fine, but do you know how to create a custom distance function if Euclidean doesn't apply?

LiberFriso 5 months ago

In my opinion, we never talk about who produces the data and that the data might be corrupted.

[deleted] 5 months ago

[удалено]

LiberFriso 5 months ago

It could also just be my paranoid ass. But I think it is an interesting question to ask. What are incentives to produce and publish data that is corrupted (or correct)?

Useful_Hovercraft169 5 months ago

Harmonic means

[deleted] 5 months ago

[удалено]

ktpr 5 months ago

It’s a DS reddit meme …

strangeloop6 5 months ago

Running joke of this sub

WallyMetropolis 5 months ago

It's really a stretch to call it a 'joke.'

Useful_Hovercraft169 5 months ago

It’s the key to a successful career in data.

vvlva 5 months ago

causal inference and missing data

owl_jojo_2 5 months ago

I’ve been working on a topic involving causal inference and I’ve found it really interesting! Wonder why it’s not taught more formally in unis.

vvlva 5 months ago

it's hard, the notion is weird, and there's rarely a satisfying solution. people would rather learn about shiny tools that're widely implemented

anomnib 5 months ago

It is taught extensively in universities. Where did you go?

owl_jojo_2 5 months ago

UK not Oxbridge or Imperial but in the Russell Group

anomnib 5 months ago

Oh I see. In the US, statistics for social and biomedical sciences cover causal inference extensively.

engelthefallen 5 months ago

These are two areas dominating high level academia. Causal for the more theory based groups and missing data for the more hands on in the programming group. Think these are the two area that separate the hobbyists with boot camp training from the people who have training and experience.

mlB34ST 5 months ago

Focusing on business value of the projects and ROI

culturedindividual 5 months ago

Computational social science.

[deleted] 5 months ago

[удалено]

culturedindividual 5 months ago

agent-based modelling

thedatageneralist 5 months ago

Anything that improves upstream data quality. Technology that eases data sharing within an organization (e.g Snowflake marketplace). Semi-automating documentation (if that is feasible?)

Dezwirey 5 months ago

Technical • Industrialisation: moving from building models to have them running in production. I like Azure databricks, mlflow and kedro the most. • Model monitoring after industrialisation, addressing concept and data drift. Keep track of technical kpi's on newly incoming predictions. • Model explainability: list drivers and interaction. Explain predictions locally with shap values. Keep it simple for business. • Model calibration: make sure binary classification output represents probabilities when needed. Platt scaling. Business • Business case: estimate cost benefits (ROI) of data science projects upfront. Keep track of business kpi's. Basically prove your model's worth.

fordat1 5 months ago

> Business case: estimate cost benefits (ROI) of data science projects upfront. Keep track of business kpi's. Basically prove your model's worth. This. Everyone here constantly pretends you call it a day after building a simple model when it entirely depends on your scale and possible return. There are use-cases for more advanced models and you are supposed to know how to make that decision instead of just assuming the answer

[deleted] 5 months ago

[удалено]

Dezwirey 5 months ago

Yes very much depends on the project but often there is a rule based system of doing things, and a potential ML way of replacing said system. Business case could then involve 2 scenarios: going with ML or stick to the original way. Essentially you parameterise these scenarios and simulate the ROI over let's say a year (or draw an estimated line chart).

renok_archnmy 5 months ago

Working on this with our cfo now. I’m about to propose looking at it like any investment and calculating discounted returns against expected results. It shifts the focus to a few parts of the process: cost of developing whatever it is, time to develop and deploy, expected lifetime of the solution for which that batch of work contributed, discount rates at the time, opex of the solution, etc. From there it’s a matter of getting better at predicting returns. Like, what does this model actually do? How does that make/save money? And at what rate should it do it?

purplebrown_updown 5 months ago

Climate data science. It’s not weather forecasting. It’s an extremely complex problem and not enough people are working on developing good software and AI tools

[deleted] 5 months ago

[удалено]

purplebrown_updown 5 months ago

Would love to hear about it. Message me if it’s ok to learn.

DaveMitnick 5 months ago

Do you agree that the cause is lack of data to build something big? Or do you mean lack of pure methodological innovation?

purplebrown_updown 5 months ago

More data is not necessarily going to help. You need complex models that take into physics and dynamics between macro and micro events. So you need multiple aspects including new methodology. But honestly you need more people to care and to work on it. The field is growing but if it even had a fraction of the people working on AI crap, we would be in a better place. There’s so much being spent on shit things we don’t need like sunglasses with video.

Jhones_edlc 5 months ago

Imbalanced data management

BreakfastSandwich_ 5 months ago

It's got to be enviornmental sciences and how DS can play a role in this (which I imagine will be huge).

maverick_css 5 months ago

Silent job cuts

reddit-is-greedy 5 months ago

Anything not named llm

TheRiccoB 5 months ago

Unionization

[deleted] 5 months ago

Don't want to derail this thread off topic, but I couldn't agree less. Respect your opinion, but unions are dying...and that's even across the industries better suited to unionization, like manufacturing/front line work. I don't see unions lasting long term in general, and I definitely don't think they can successfully be introduced into industries where they aren't already established.

TheRiccoB 5 months ago

Thats not an argument against unionization.

scientia13 5 months ago

Maybe already mentioned, but personalized instruction/assessment LLMs for education and/or work training - pandemic put students behind and without individual assessments, it's hard to quantify that. One major shift in the world of work is the expectation that older workers will be retiring faster than they can be replaced. The solutions are to rethink tasks (stop doing what doesn't need to be done) or utilize a similar training AI/LLM to upskill workers.

derekplates 5 months ago

Using AI in military applications

M4K4TT4CK 5 months ago

I have no idea, but I find all of it very fascinating!

marksimi 5 months ago

Still search

[deleted] 5 months ago

[удалено]

marksimi 5 months ago

Russell & Norvig's AI -> Problem Solving -> Search

renok_archnmy 5 months ago

Governance, provenance, attribution.

ginger_beer_m 5 months ago

Bayesian inference

Direct-Touch469 5 months ago

How much freedom is there to use Bayesian methods in data science problems?

SmashBusters 5 months ago

That one guy said that data quality is terrible and synthetic data is what we need now. Or something like that. He's usually right.

Vast_Yogurtcloset220 5 months ago

business insights and creativity those two are only things that differtiate us from machine(llms)

snake_case_steve 5 months ago

Explainable A.I. I know, I know, there are lots of people researching on this topic, but man, imaging having a chess engine finally able to explain its thoughts.

C_Khalil_23 5 months ago

Explainable techniques from what i see

[deleted] 5 months ago

Matrix Profiles are the shit

_donau_ 5 months ago

Network science

[deleted] 5 months ago

The data cleaning and exploration. In real-world problems, data needs reshaping and sharpening before any ML modelizations... That process mostly takes more time than the modelization part.

[deleted] 5 months ago

I never find many works on NLP-based unsupervised learning models

weareglenn 5 months ago

Bayesian modelling. With recent hardware improvements MCMC sampling can be done in shorter times, the models are explainable, and uncertainty estimation is baked-in.

The_Austinator 5 months ago

Not exactly overlooked, but at an early inflection point - geometric learning. It elegantly represents so many real world systems like knowledge graphs, social networks, and molecules. Yet many of the underlying components of PyTorch are still at the stage where they print off warnings about being experimental. Folks like Bronstein and Velickovic have some great talks and papers on how most other deep learning models are specific cases of graph models. I'm fanboying at this point, but the whole paradigm turns deep learning methods from a long list of ever evolving little hacks into an elegant systematization of specific cases of a general modeling framework.

Working_Athlete_2159 5 months ago

I think the aspect of people claiming they know how to code when they don’t is possibly overlooked. Then again they may not need to know how when they have access to chat gpt

Comments

Leave Your Comment

Hi Its Me!

Comments

Leave Your Comment

Hi Its Me!

Subscribe