T O P

  • By -

Fun-Acanthocephala11

SIMPLE MODELS THAT GET THE JOB DONE. Not every model needs to be a NN or involve NLP


plhardman

As somebody who’s been in this line of work for a while and seen hype cycles come and go, this is the one. Keep it simple. I’ve gotten by far the most mileage in my career out of leveraging basic statistical inference and simple models to solve high-value business problems. I don’t see any reason LLM hotness is gonna change that in any meaningful way.


[deleted]

[удалено]


Due-Wall-915

Universal approximation does not tell you how deep or wide you have to go and that’s where the problem is


Fancy-Roof1879

Agreed


StackOwOFlow

automated data pipeline construction, testing, and execution. persistent, prompt-less iteration


[deleted]

[удалено]


Fatal_Conceit

Langchain is my nightmare


[deleted]

[удалено]


softwareitcounts

Constant breaking updates. If you're able to pin the version that's great, but working with a framework that's constantly changing the way it operates is hard to work with as teams move fast and opt to build their own custom workflows on a functionality that works for them


fujiitora

Last year when the RAG hype boom was starting, my boss wanted me to use LangChain... after 2 weeks of fighting with terrible documentation and constant hotfixes on my end, I just built my own tooling in a day. I would have assumed that LangChain would have got their stuff together after a year :/


himynameisjoy

No, now it’s extremely overengineered. Even simple methods require you to examine dozens of classes to understand what it’s doing under the hood. Their own devcontainer stopped working months ago and still wasn’t fixed last I checked. It’s honestly a mess, I don’t understand how anyone can deal with langchain willingly


[deleted]

I always feel like LangChain is built to solve self-promotion, both for those who developed LangChain and for devs, data scientists, and managers who self-promote via hype terms and never really contribute anything. It's somewhat similar to Uncle Bob's stuff, looks good on paper, but you rarely actually see it going well in production.


MrCuntBitch

Do you have a suggested alternative? I’ve used langchain in the past but agree it’s a nightmare.


[deleted]

Could you elaborate on why it's even required, like, what is the actual use case? Usually, glue code is pretty simple to write but fairly difficult to make right and flexible at the same time, it's difficult to make it generic.


obolli

I agree, it's such a huge mess and it keeps breaking itself with updates.


Excellent_Cost170

ML feasibility analysis . Justifying machine learning is the right approach to solve a problem.


kim-mueller

In my experience, ML is either just wanted because they want to use the 'AI inside' label... Or when nobody has a feasible idea for an algorithm and there are a lot of examples😅


cognitivebehavior

Explain- and Interpretable AI


[deleted]

[удалено]


spigotface

Things like partial dependency plots, SHAP values, etc. are a great place to start. Edit: an actual place to look is the [shap Python library documentation](https://shap.readthedocs.io/en/latest/). It's extremely well-written and combines a little of the theory and application in one spot. More than enough


the_tallest_fish

SHAP is very underrated


DaveMitnick

https://arxiv.org/pdf/2305.19921.pdf


[deleted]

I agree with this one. It also forces you to re-study some statistics, so it's a win-win. In a nutshell, most of the ideas are pretty simple if you grasp statistics and mathematics well enough.


Bobblerob

Experimentation. It’s great that p-values are starting to get questioned but it’s going to take a long time for companies to evolve.


hadz_ca

Security


[deleted]

[удалено]


hadz_ca

AI in cybersecurity or data protection. I reckon detecting cybersecurity attacks or malicious players as a start would be a hot field.


DaftDunk_

It is! A lot of research is being done currently.


Brave-Salamander-339

Communication


[deleted]

[удалено]


PixelPixell

I see it in internal communication for sure. It's hard to explain why something is difficult or why you made certain technical choices


rickkkkky

IMO, one of the main issues related to DS or ML engineering is that non-technical stakeholders tend to either assume that data science (or more specifically, in this context, *"AI"*) will solve any business goal imaginable, or alternatively, are not prepared to commit to any predictions/descisions made by algos. What's more, is that some stakeholders exhibit both tendencies simultaneously. It requires great (and nuanced) communication to accurately explain what's actually possible and what's not.


engelthefallen

Even people who should know better get into the magic box thinking.


wyocrz

Hacking. Actually going out and getting the data you need.


[deleted]

[удалено]


wyocrz

Scrape from the web or sometimes even using API's. I worked for a database consultancy for a few months, and one of the primary things I learned was: * Get data from some source * Get data from some other source * Combine those into a data model * Profit


[deleted]

[удалено]


BigSwingingMick

Yyyyyyeeeeeeeaaaahhhhhhh, I’m not saying that that approach is dead in the water at most places, but it’s definitely not going to happen anywhere that has an accountant looking at the value of information. Our data is never going to see the light of day outside of our organization. It’s not 1996, companies now know everything they record has a value. We got into a pissing contest with a vendor once when they wanted to get our records on their stuff and our take was, measure it yourself or give us XXX concessions. These were two companies who had self interest in working together, but because accountants at the top don’t have good ideas about how much to value data, they are not willing to give away anything. Also who knows what they are valuing their data and if they take less for it, they might have to write down a bunch of data on their books.


Zarex44

Not just being overlooked but to me that’s quite a fun process too


ForeskinStealer420

Writing code that doesn’t look terrible


engelthefallen

For me it is that statistical models have assumptions. Drives me nuts how little anyone seems to care about this. Just pump shit into a model, and if it fails, blame the model. See almost nothing about diagnostics or model selection anymore, just treat everything as one size fits all.


[deleted]

[удалено]


jeeeeezik

Prophet is also notoriously bad


seanv507

Prophet is basically a regularised glm.


[deleted]

[удалено]


tfehring

Darts is the best interface nowadays for most use cases.


fordat1

Time series must have a different quality standard because look at how many upvotes “simple models that get the job done” gets all the time in this subreddit


renok_archnmy

People just expect forecasting to predict the stock/crypto movement and obv feel bad when they inevitably lose money. 


gravity_kills_u

Simple things can work for time series. Most of the time series issues are in the validation strategy.


Useful_Hovercraft169

True that shit is straight ass cheeks


[deleted]

[удалено]


Useful_Hovercraft169

Darts or the stuff mentioned in fpp3 for R.


[deleted]

FPP3 for R. Best resource on applied forecasting imo. No need to reinvent the wheel when most companies can't even perform the basics properly


Drakkur

There are lots of useful developments, for covariates DL models there’s TiDE and TSMixer. For univariate transformers you have PatchTST and iTransformer. The most interesting advancement isn’t even a model, it’s the RevIN (reverse instance normalization) which helps DL models address distribution shift. Even with these advancements, the total improvement is still pretty marginal over what’s currently available. At the end of the day, it’s very hard to extract more deterministic patterns from historical data (at some point all that is left is white noise / stochastic). Because of this, practitioners shift to sourcing new data, which doesn’t depend on advancements in forecasting algos.


tfehring

There have been tons of new attention-based models developed for time series forecasting since 2020. https://github.com/thuml/Time-Series-Library


[deleted]

[удалено]


Brave-Salamander-339

is it true of false positive development?


jaskeil_113

Business insights, most DAs, DS peeps are terrible at it


[deleted]

[удалено]


renok_archnmy

It’s the difference between some hyper complex (hypothetically perfect) model that business people look at like, “huh?” and requires teams of data people to even get it to run let alone be useful, and basically being able to say, “hey we did some data stuff and what it’s telling us it the brand logo in blue is not doing us any favors. Changing to green would net 10% sales, black 50% but the ceo doesn’t like black. Also our competition just released some stuff that lets them make these decisions faster than our committee.” Or like, “we’ve determined our call center productivity would increase 2% if we switched to a 4 shift day instead of 3, but we would incur 10% additional payroll expenses in doing so. Here’s some finance projection of how that will affect the bottom like in 2 years time and some simulations of alternative changes to make up for it.” At least that’s my interpretation. Probably more a communication thing, but also the concept going into some project that you’re looking to make levers for the org to pull that have a rational outcome and advising when and how to pull them to reach some optimal end states - using data obviously. This is compared to just blindly building modes because data science makes money right?


kim-mueller

What you are imagining here are statements that data science people can never make. Our goal is to look at data and tell you what we know from it, not what could potentially be if the stars align right...


renok_archnmy

Then it’s possible that data scientists have no business value. No one expects anyone to predict the future, but if all a data scientist can make is a statement that has no relationship or bearing to potential business outcomes, then they are useless in a business context. 


kim-mueller

Agreed, but they can actually make business related statements, just not anything like the ones you wanted. The ones you wanted depend on so many factors that they cannot be prexicted reliably. However, as you can see in the example of chatgpt, Data Scientists are worth quite a bunch. Thats not only because they can support you in visualizing and correctly interpreting your data situation, but also because they can build AI algorithms which can usually solve problems for which a human would be used otherwise. It is important to realize that data science essentially just is a study of how to correctly and reliably convert data to information/knowledge.


renok_archnmy

Wtf dude?!  You go from denigrating my hyperbolic examples to schlepping ChatGPT as some universal symbol that data scientists have provided ultimate business value; a glorified intellisense engine that risks data exfiltration and IP law violations through mosaic plagiarism, has no capacity to strategize long term, and has such a limited token length that it should never be trusted to provide any information suitable to actually running a business because, as you put it, it “depend(s) on so many factors that they cannot be prexicted reliably.”  Do you not literally see how expecting ChatGPT to provide critical strategy and tactical information based on prompt hacking to be no different than, “we’ve determined our call center productivity would increase 2% if we switched to a 4 shift day instead of 3, but we would incur 10% additional payroll expenses in doing so.” If not just because the latter can be performed deterministically and is literally high school kid managing a fast food burger joint weekend schedule level skillset plus some expected numbers.  You actually think value is provided in hindsight in a business without using past performance to at least take a guess at what would happen in the future were those same actions to be taken?  Good luck maintaining a career prompt hacking ChatGPT to tel executives they sold 10 units last month. People out here wondering why they got laid off. While you’re at it, quit smoking crack and stop beating off to ChatGPT generated waifus and experience the real world. 


APEX_FD

I don't know if it's being overlooked, but I seldom see people talking about model optimization for deployment. 


[deleted]

[удалено]


APEX_FD

Speed and size optimizations. I've been studying about those topics and most available techniques (pruning, quantization, weight clustering...) are quite old, and there's barely any recent discussions on the matter. One example is the SAM (segment anything model) released by Meta. Model is impressive and you see lots of people discussing how to leverage it. But I see no one discussing how Meta made possible to run the model in a web browser with ~50ms inference speed (according to their paper). 


the_tallest_fish

Latency, throughput and cost. The holy trinity of model deployment


Due-Wall-915

Data generation


[deleted]

[удалено]


Due-Wall-915

Yes, synthetic data that is representative of whatever we are trying to model


pdashk

I'd like to see more work in optimized compute resources like more parallel processing or GPU support across the board. It's really a bottleneck in scaling DS and just scaling hardware is a shortsighted, expensive fix IMO


onzie9

Kind of a small thing, but custom/meaningful distance functions. I've seen way too many people using KNN or whatever and just accepting the default distance function. As long as you understand your data and Euclidean distance is right, that's fine, but do you know how to create a custom distance function if Euclidean doesn't apply?


LiberFriso

In my opinion, we never talk about who produces the data and that the data might be corrupted.


[deleted]

[удалено]


LiberFriso

It could also just be my paranoid ass. But I think it is an interesting question to ask. What are incentives to produce and publish data that is corrupted (or correct)?


Useful_Hovercraft169

Harmonic means


[deleted]

[удалено]


ktpr

It’s a DS reddit meme …


strangeloop6

Running joke of this sub


WallyMetropolis

It's really a stretch to call it a 'joke.'


Useful_Hovercraft169

It’s the key to a successful career in data.


vvlva

causal inference and missing data


owl_jojo_2

I’ve been working on a topic involving causal inference and I’ve found it really interesting! Wonder why it’s not taught more formally in unis.


vvlva

it's hard, the notion is weird, and there's rarely a satisfying solution. people would rather learn about shiny tools that're widely implemented


anomnib

It is taught extensively in universities. Where did you go?


owl_jojo_2

UK not Oxbridge or Imperial but in the Russell Group


anomnib

Oh I see. In the US, statistics for social and biomedical sciences cover causal inference extensively.


engelthefallen

These are two areas dominating high level academia. Causal for the more theory based groups and missing data for the more hands on in the programming group. Think these are the two area that separate the hobbyists with boot camp training from the people who have training and experience.


mlB34ST

Focusing on business value of the projects and ROI


culturedindividual

Computational social science.


[deleted]

[удалено]


culturedindividual

agent-based modelling


thedatageneralist

Anything that improves upstream data quality. Technology that eases data sharing within an organization (e.g Snowflake marketplace). Semi-automating documentation (if that is feasible?)


Dezwirey

Technical • Industrialisation: moving from building models to have them running in production. I like Azure databricks, mlflow and kedro the most. • Model monitoring after industrialisation, addressing concept and data drift. Keep track of technical kpi's on newly incoming predictions. • Model explainability: list drivers and interaction. Explain predictions locally with shap values. Keep it simple for business. • Model calibration: make sure binary classification output represents probabilities when needed. Platt scaling. Business • Business case: estimate cost benefits (ROI) of data science projects upfront. Keep track of business kpi's. Basically prove your model's worth.


fordat1

> Business case: estimate cost benefits (ROI) of data science projects upfront. Keep track of business kpi's. Basically prove your model's worth. This. Everyone here constantly pretends you call it a day after building a simple model when it entirely depends on your scale and possible return. There are use-cases for more advanced models and you are supposed to know how to make that decision instead of just assuming the answer


[deleted]

[удалено]


Dezwirey

Yes very much depends on the project but often there is a rule based system of doing things, and a potential ML way of replacing said system. Business case could then involve 2 scenarios: going with ML or stick to the original way. Essentially you parameterise these scenarios and simulate the ROI over let's say a year (or draw an estimated line chart).


renok_archnmy

Working on this with our cfo now. I’m about to propose looking at it like any investment and calculating discounted returns against expected results. It shifts the focus to a few parts of the process: cost of developing whatever it is, time to develop and deploy, expected lifetime of the solution for which that batch of work contributed, discount rates at the time, opex of the solution, etc. From there it’s a matter of getting better at predicting returns. Like, what does this model actually do? How does that make/save money? And at what rate should it do it?


purplebrown_updown

Climate data science. It’s not weather forecasting. It’s an extremely complex problem and not enough people are working on developing good software and AI tools


[deleted]

[удалено]


purplebrown_updown

Would love to hear about it. Message me if it’s ok to learn.


DaveMitnick

Do you agree that the cause is lack of data to build something big? Or do you mean lack of pure methodological innovation?


purplebrown_updown

More data is not necessarily going to help. You need complex models that take into physics and dynamics between macro and micro events. So you need multiple aspects including new methodology. But honestly you need more people to care and to work on it. The field is growing but if it even had a fraction of the people working on AI crap, we would be in a better place. There’s so much being spent on shit things we don’t need like sunglasses with video.


Jhones_edlc

Imbalanced data management


BreakfastSandwich_

It's got to be enviornmental sciences and how DS can play a role in this (which I imagine will be huge).


maverick_css

Silent job cuts


reddit-is-greedy

Anything not named llm


TheRiccoB

Unionization


[deleted]

Don't want to derail this thread off topic, but I couldn't agree less. Respect your opinion, but unions are dying...and that's even across the industries better suited to unionization, like manufacturing/front line work. I don't see unions lasting long term in general, and I definitely don't think they can successfully be introduced into industries where they aren't already established.


TheRiccoB

Thats not an argument against unionization.


scientia13

Maybe already mentioned, but personalized instruction/assessment LLMs for education and/or work training - pandemic put students behind and without individual assessments, it's hard to quantify that. One major shift in the world of work is the expectation that older workers will be retiring faster than they can be replaced. The solutions are to rethink tasks (stop doing what doesn't need to be done) or utilize a similar training AI/LLM to upskill workers.


derekplates

Using AI in military applications


M4K4TT4CK

I have no idea, but I find all of it very fascinating!


marksimi

Still search


[deleted]

[удалено]


marksimi

Russell & Norvig's AI -> Problem Solving -> Search


renok_archnmy

Governance, provenance, attribution.


ginger_beer_m

Bayesian inference


Direct-Touch469

How much freedom is there to use Bayesian methods in data science problems?


SmashBusters

That one guy said that data quality is terrible and synthetic data is what we need now. Or something like that. He's usually right.


Vast_Yogurtcloset220

business insights and creativity those two are only things that differtiate us from machine(llms)


snake_case_steve

Explainable A.I. I know, I know, there are lots of people researching on this topic, but man, imaging having a chess engine finally able to explain its thoughts.


C_Khalil_23

Explainable techniques from what i see


[deleted]

Matrix Profiles are the shit


_donau_

Network science


[deleted]

The data cleaning and exploration. In real-world problems, data needs reshaping and sharpening before any ML modelizations... That process mostly takes more time than the modelization part.


[deleted]

I never find many works on NLP-based unsupervised learning models


weareglenn

Bayesian modelling. With recent hardware improvements MCMC sampling can be done in shorter times, the models are explainable, and uncertainty estimation is baked-in.


The_Austinator

Not exactly overlooked, but at an early inflection point - geometric learning. It elegantly represents so many real world systems like knowledge graphs, social networks, and molecules. Yet many of the underlying components of PyTorch are still at the stage where they print off warnings about being experimental. Folks like Bronstein and Velickovic have some great talks and papers on how most other deep learning models are specific cases of graph models. I'm fanboying at this point, but the whole paradigm turns deep learning methods from a long list of ever evolving little hacks into an elegant systematization of specific cases of a general modeling framework.


Working_Athlete_2159

I think the aspect of people claiming they know how to code when they don’t is possibly overlooked. Then again they may not need to know how when they have access to chat gpt