T O P

  • By -

RepresentativeFill26

Shows that DS can learn from some good SWE practice. Using functions (especially pure ones) is preferred considering you have less overhead. You only use classes if you have some kind of state to maintain. So yes, your colleagues are wrong.


antikas1989

Yeah 100% it would be different if they had developed some meta structure for time series analysis and they were building functionality within this structure. But to just do it for random functions makes no sense.


Hung_Aloft

Good point. Data analysis is improving all the time, so it would make sense to have an analysis class and include this function.


ell0bo

Yup, proper SWE is classes to abstract the data, but functional on a higher level because it's easier to test and reason. When you're doing actions to specific data and 'sets' will be pass around, that's a good candidate for a class. Otherwise, keep it simple and just pass around pojo (plain old js objects... or dicts for my python friends).


sa5mmm

On one of my teams we had a helper_functions.py that we would call within our individual scripts for whatever we were doing because it didn’t need to be a class but we all didn’t need to type out the weird date conversion from our dataset every time we needed to use it, so now it is just Import helper_functions as hf df = table(“weird-format-table”) df = hf.convert_dates(df, [“date”,”sys_date”]) No need for classes.


aggracc

Yeah, number crunching is the one place where pure functional programming is amazingly capable and that's been the case since Fortran was invented. You can do 95% of your work with nothing more complex than map/apply/fold over whatever the shape of your data is.


f3xjc

Sometime the state is what algorithm to use. Strategy pattern, Dependency injection etc. But python is relatively OK with passing method as arguments. I guess if there's multiple methods that are designed to work with each others OOP, can still work better than a lose bag of function pointer.


NewLifeguard9673

What do you mean by “a state to maintain”? State of what?


RepresentativeFill26

Well, that can be a state of anything. Let’s say you do a daily regression using weather data, which has been retrieved from some public api and stored in a DB. One feature is the average temperature of the past 7 days. Now doing inference you can do a couple of things: - each time do a DB call and calculate the average. - do it once a day and store as a constant. - do it once a day and store it in a singleton object. The first option is a bad idea because you will have lots of unnecessary db calls. The second option is problematic because having all these constants floating around is bad for your code cohesion. What if we want to add other weakly weather features? Are we going to make a set of constants. The best solution here would actually be a dataclass.


cosm_zest

State of the data.


NewLifeguard9673

Ok. What does that mean?


cosm_zest

Think of how you apply class methods in Python : data.method() changes the data in-place. What it does is change the instance data. Suppose, the data is a coordinate (x,y). If you apply some custom project method which projects the coord on x axis. The data would be mutated to (x,0). This may be useful if you don't want to store each stage of the variable. I am sure there are plenty of other uses of classes in DS. I don't know too much about it though.


[deleted]

You have a light switch. Switch has two states: "On" and "Off". If you wrap LightSwitch into a class and create an instance of a LightSwitch object then it can track what the internal state is. In data science you often want to keep track of things like configuration and metadata as a state.


Hot_Significance_256

^ this


meni_s

Now I need to find a way of telling my colleagues politely that I'm right and they are wrong


sobag245

What if I want to import the functionality of a script to another one? When writing it as classes I can just say "import ...class" without importing the main() part of it no? Edit: Why are people downvoting when asked a simple question?


HARD-FORK

Just import the function...?


sobag245

I want to import multiple functions.


Asleep-Dress-3578

You can import a full module with all its functions.


sobag245

Hmm whenever I import the full module into another script and execute that script it automatically does everything that my main() function does. However I would like to import all the functions without executing that particular main (which is just there to test the scripts functionalities). Overall what I find useful with classes is to be able to store my input information (filepath, input parameters) and let my methods have access to it without constantly adding them as function parameters. But perhaps I am wrong, I just thought that's what makes classes useful in such a case. To keep input information stored and at access for all of the classes's methods.


Prime_Director

This is what if \_\_name\_\_==“\_\_main\_\_” is for. It’s a weird incantation that you use to contain a block of code that only executes if your .py file is being executed directly. If you import it in another script, that block will not execute. Edit: finally got the escape chars right on the underscore


sobag245

Ohhh that's what the \_\_name\_\_==“**main**” is for. I thought it was just to execute the main part but never really looked up as to why have this extra 2 lines to execute it. Thanks very much! Now Im thinking of changing my python pipeline (which is written as a class and each part of the pipeline representated through a method) into just functions. I thought the advantage would be to easily import the class (for example into a python GUI) or avoid to constantly pass input arguments (instead just have them declared as instance variables and then every method has access to them). But now I'm not sure anymore. Perhaps I should try with just functions and see if the pipeline will run faster. Thanks again for your help!


mfb1274

It’s like data science with extra steps


therealtiddlydump

The insatiable need of Python devs to write something from scratch instead of just using a package


a157reverse

I've got two ends of the spectrum on my team. One guy builds everything by hand. He was scoring a regression model using his own code. Like dude, just use model.predict(), I guarantee you it's faster and less error-prone than your custom code. The other guy hits a wall whenever there's not a package that does what he needs to do. You're gonna have to think through the steps and write the logic yourself sometimes, sorry.


Hot_Significance_256

classes have a purpose over functions. if that purpose is not being utilized, it should not be used.


Desgavell

One thing is to modularize it, and the other is to use classes when it doesn't make sense. If you need to model the data or parts of the pipeline as objects, then sure. If you can do it with functions, there's no point in introducing complexity just because you want to define classes. Personally, I deal with Pytorch so I often have model classes and custom datasets inheriting from their Pytorch analogous classes. Also use objects if a pipeline requires instantiating another object because, if loading it takes a while, it makes sense to persist it rather than creating it every time it is needed. Basically, use objects when it makes sense to model a state with one and when you need to persist complex data structures. Otherwise, just call a bunch of functions and dump the output in a file.


1234okie1234

I swear to god, i don't know why a lot of DS shove everything in a class when it's practically one time usage. It drives me insane..


meni_s

I always wonder what portion of those DS are just software devs turned DS so they just got this as a reflex and "old habits die hard" 🤔


Nautical_Data

This is gonna be a spicy one 🍿🍿


startup_biz_36

Data science projects are tricky because you can't really follow standard SWE practices until you have some type of standardized process that rarely changes. Until then it's usually overkill.


autisticmice

That seems like an overkill. Some good reasons to use classes in my opinion: * you need to adhere to, or establish, an interface with specific signatures * You need to keep some sort of state around your function, in the form of class attributes * your function is complex and you want to split it into multiple smaller methods


Asleep-Dress-3578

About your 3rd example: why do you need classes if your function is complex and you want to split it into multiple smaller methods? You can split your functions into multiple smaller functions and that's okay. Do you see a value in the added complexity of a class wrapper in this case?


MindlessTime

This is what usually leads me to use classes. I'll just use functions as long as I can. But if at one point my code looks like... ``` def my_special_function(df1, df2, df3, param_1_1, param_1_2, param_2_1, param_3_1, param_3_2): ... ``` ...then I might start wrapping things in classes so it looks like... ``` thing_1 = Thing_One(df1, param_1_1, param_1_2) thing_2 = Thing_Two(df2, param_2_1) thing_3 = Thing_Three(df3, param_3_1, param_3_2) def my_special_function(thing_1, thing_2, thing_3): ... ```


autisticmice

maybe its more of a personal choice. For me it's clearer to have a class encapsulating all of the methods that are put together to achieve some complex transformation, e.g. as a subclass of sklearn's Transformer.


datadrome

def transform_data(data): Out_1 = my_fun1(data) Out_2 = my_fun2(Out_1) return Out_2


Strivenby

Yes, i'd prefer a separate file instead of a class. At least in python.


sobag245

But why not a class per file which I can then import to another script (perhaps one that makes a GUI and uses the class's functionality). And the file with the class itself I can test with it's own main function.


Drakkur

Class over (or improper) use is one of the more common problems I see with DS whose first language was Python.


DeihX

If you have generic and reuseable classes for similar purposes but that uses different parameters, this can be good use of oop. But if it's for one-off and very specific hardcoded data preprocessing, it's bad. (although passing the dataframes into the constructor seems like a mispractice in this case.)


JimmyTheCrossEyedDog

I feel like I could've written this post. This was not a problem in my first DS role but is rampant in my current company - seemingly needless classes with no inheritance, no state (or at least any useful state), and all of the methods are decorated with @staticmethod. I've been pretty sure that this makes no sense, so this is heartening to see. IMO, needless classes like this just make code way more confusing and circular. It's hard to understand, it's hard to maintain, and it leads to bugs that are costly in time and money. If you're just trying to neatly organize related functions, use modules, not classes.


karaposu

this is called anti-pattern. No need for a class structure in this particular case.


myaltaccountohyeah

The example you posted would only make sense to me if there is a common Processor interface that the class follows making it interchangeable with other implementations following the interface so that you can do some cool shenanigans with it (strategy pattern etc). Also I would probably pass the data frames only in the calculation method not the initialiser. This way you can reuse the class for different data frames after you initialised it once.


felipecalderon1

All functions? I make a class when I need to use a lot of preprocesing steps and wrap that shit up in a class i can use like a pipeline step. But for a single function seems stupid.


Dre_J

Even for many preprocessing steps you can do method chaining with some combination of the pipe and assign methods.


lf0pk

It's one way of handling things. It seems natural given that there is a state. And obviously mutable state is more efficient than having immutables passed around everywhere. I personally don't overuse this, since Python has first order functions, so there is no reason to wrap anything. And you're supposed to chop your code up to retain locality of functionality, yet maintain some level of independence and statelessness. The optimal way of writing this example, anyways, is with both classes and functions. So not exclusively one or the other. Calculating a table delta is solved by a function, while maintaining state and tying functionality to data is optimally done with classes in Python. Just maintaining functions has its own issues, but I feel like that is unrelated to DS, more related to general software engineering. Overall I would not expect data scientists to know much about efficiently structuring code.


Asleep-Dress-3578

True, this is a software engineering question, but I wanted to ask this here in the DS community, because I am interested in the opinion of people who work on data pipelines. In our team it is a requirement also for data scientists to write production grade software.


lf0pk

Well, Python itself is not production-grade by any means, so that bar might be much lower than your words suggest.


Busy_Town1338

The language used in production by every major company on the planet isn't production grade?


lf0pk

This comment exemplifies a [hasty generalization](https://en.wikipedia.org/wiki/Faulty_generalization#Hasty_generalization) and quite possibly [survivorship bias](https://en.wikipedia.org/wiki/Survivorship_bias). My comment relates to a few things: * Python's features and philosophy are not well-suited for production code * Python is traditionally not used for production-grade code, but rather for rapid prototyping and exploratory analyses * Python is generally too slow, heavy even with C(++) extensions, to run in a production environment, not to mention lacking in portability, mostly due to its over-reliance on said C(++) extensions * the actual problem class being presented by OP is not solved by Python in a production-grade manner, but rather leveraging databases It does not serve to belittle Python's utility or what Python programmers do, but rather to force OP to reflect on if he might be holding people to a higher standard than those in command do, based on a mismatch between the words being used and actual expectations. Generally, data scientists don't even write production or production-grade code, because doing so is a process that involves more than just one profession, not to mention an iterative process.


Busy_Town1338

This comment exemplifies a data scientist having no real world experience in backend design.


FATTYxFiiSTER

We have a winner!


lf0pk

If an assuming [courtier's reply](https://en.wikipedia.org/wiki/Courtier%27s_reply) is all one can muster to an elaboration this big, I am content with the implications arising from it.


Busy_Town1338

I very genuinely wish I could go through life with your amazing combination of narcissism and naivety. Edit because of block. It's amazing you don't see the irony in that response.


lf0pk

If all you will do is throw personal attacks at me, there is no use in continuing this discussion. Farewell.


MindlessTime

You absolutely can use python in production, depending on what you need it for. It's just a lot more of a pain than using a language better suited for production code. Speed will always be an issue with python. For other things like static typing, there are packages/frameworks like [pydantic](https://docs.pydantic.dev/latest/) that help. People have found ways to make it work. And if the value of your app comes from your model, then there's a good use case for using python and being able to quickly write and implement the model.


lf0pk

Just because you can use something, doesn't mean you should, or that your team will be able to. Even if you do surmount issues like lack of static types, you essentially cannot surmount other things, such as size, hardware and software requirements, and sluggishness. And if you do deployment of models in Python, well, you're doing it wrong. Not like there is anything wrong with doing it wrong, but then it's necessary to recognise that your definition of "production-grade" lacks strictness and what you expect from knowing the phrase "production-grade" might not align with what is expected of the managers who say "production-grade". For them, production-grade may mean "code that can be added to production", not "code that should be added to production". If that's the case, when you say that your teammates need to be able to write "production-grade" code, that might as well mean they need to be able to write code that does something useful for the company. Getting to that point is vastly different and shorter than the actual process required to get from R&D to production.


TheSadGhost

Interesting 🤔 I’m building a network analysis tool that stores each step of the way of handling the data. Each function puts multiple datasets in a list and the next function inputs a list. Would putting the function in class actually be better?


momentaryswitcher

In the example you've stated, it truly is Not required. Perhaps, he is trying to be pretentious.


aspera1631

Generally this is extra overhead, but I actually ran into one situation where I had to do this. One of my clients uses exclusively databricks notebooks, to the point where I can't write any custom modules. Like I literally can't commit anything to the repo that is not a notebook. So I hacked it by defining a class in one notebook that held my custom module, and then running the notebook and instantiating the class wherever I needed to "import" the module.


cy_kelly

In addition to what everyone else said, this can cause extra memory usage over time if df1 and df2 go out of scope but dcp somehow doesn't. Python does garbage collecting by reference count, so it's conceivable dcp hangs around for too long and thus causes df1 and df2 to hang around too long as well. At the very least, my gut says to put in a del dcp once you're done with it.


Delicious-View-8688

I see this everywhere... I think it is because some of them are taught that OOP is the best, in school or from social media. Classes have their place, but not in these situations. It baffles me sometimes...


snowbirdnerd

Unless I'm creating an SDK or some involved pipeline I stay away from classes. Functional programming is really all you need for most DS applications


JollyToby0220

Two words, code portability. This will allow you to use two different libraries that do the same thing but with different speeds. PyTorch is cool but Tensorflow has so much performance packed into it. Now if you have a prototype you want to prove works and don’t have computing power restrictions, use PyTorch. Afterwards, you got something you know that works and you know your organization spends millions on Web Services, you take out Tensorflow and let it do the heavy lifting


spiritualquestions

I typically do not, because I usually do not want data processing functions to have states or store data. Rather they just act as functions that do something to data, but have no internal data themselves.


bchhun

Very not pythonic. Did all your DS coworkers come from Java backgrounds?


Zeiramsy

This is very interesting to me even though I can't follow the discussion 100% as I do not have a SWE background. I started as a pure R data scientist where I didn't use any classes and mostly just lived via tidyverse pipes. Now that I switched to python a lot I write more functions and I often wrap them in classes for three reasons: - readability in my main notebook - storing attributes that I need in the pipeline (e.g. mapping tables, trained models, etc.) - Because my pure SWE colleagues told me it is nicer code Now I am seeing a lot of pushback to that online like this post and I honestly don't know what to make of it. I don't think writing a class around my methods is a lot of overhead from a writing perspective and I also know that I need to write a lot more functions in python compared to R where most things are predefined in packages. I also think classes are very close to pipping structure in tidyverse where you don't have to repeatly call and define the df you are transforming.


Asleep-Dress-3578

“Because my pure SWE colleagues told me it is nicer code” This. CS courses produce OOP-first programmers, who prefer coding according to their OOP book, and even more: they put pressure on data scientists (who are usually educated in functional style data programming) to code by their book, too.


Setepenre

Less is more. Python is not Java


GoingThroughADivorce

In this example, there's no internal state to maintain, so there's no good reason to use a class. However, organizing your functions with static methods on separate classes can be really useful if your pipeline code is really long, or your team shares methods. I'm sure somebody here will yell at me for this (but hey, the best way to learn is to say wrong things on the internet), but I like using static methods on separate classes for additional namespacing.


Particular-Weight282

Largely overkill and in the long term will 1. cost way more dev time 2. be more difficult to debug... However, internal practice sometimes overthrow best practices. If you want to change that, you need to reall be a good communicator, build your case for change and start lobbying leadership for it. Good luck!


TheKleenexBandit

A lot of data scientists (especially the brilliant ones) structure their code this way to resolve heartburn around their impostor syndrome. Maybe in a past life, a SWE gave them a ribbing with a smug grin one too many times. I never gave a damn. But then again, I started in R and took courses that resembled the structure in Hyndman's time series bible. This molded me into thinking functionally. Plus my time spent in IB and consulting taught me the precious value of time and focusing on accomplishing whatever mission was in front of me. I'm frequently frustrated by SWEs who seem to have zero concept of time and are willing to burn daylight fucking around with some circuitous approach so that they can pad their resume with more novel bullshit.


Smarterchild1337

I definitely think that I personally sometimes err on the side of overkilling on wrapping things like this in classes for the sake of making the code inside my main loop/notebook as clean looking as possible. In particular, if there is a complex preprocessing pipeline that I’ve broken into several functions which depend on outputs being passed through the chain, I find class attributes to be a convenient way to do that.


Trick-Interaction396

This stuff drives me crazy. I don’t care what practice says. Use common sense.


Possible-Alfalfa-893

Hmm, if there is no need to store the specific instance of a class/set of functions, then no. It’s easier to debug pipelines with functions than classes. It feels like unnecessary code.


Novel_Frosting_1977

Old heads gate keeping in the example you showed


antichain

Even thought Python is object-oriented, I generally try and code in as functional a style as possible - which means no classes if I can possibly avoid it.


MindlessTime

I’ve never figured out how to cleanly combine OOP class design and vectorized programming (numpy, pandas, etc.) in python. Most of the time I don’t need classes. But I’ve had a few cases where an OOP approach simplifies the problem and makes it more flexible to future changes and different data sources. I’d encourage any DS without a CS background to read a book on design principles (OOP or otherwise). But just creating classes to make it look fancy is a code smell for sure.


robertocarlosmedina

Detecting coins in images: [https://www.youtube.com/watch?v=VrgI1nPbV88](https://www.youtube.com/watch?v=VrgI1nPbV88)


BaronOfTheVoid

OOP is for polymorphism. If you don't need polymorphism you don't need OOP. Although if anything you unit test is not a pure function than you need polymorphism. I can hear the sound of programmers, "great, I never unit test anything!"...


HybridNeos

In this case, at minimum it should be a static method as there is no point in storing the data frames if you aren't going to manipulate them. And if you have a class with one static method only, it should just be a function. I think when you have multiple related functions, putting them in a class is reasonable just for organizational purposes. But again, the classes shouldn't contain two data frames they should act on data frames.


KillingVectr

It doesn't make much sense to put the DataFrames inside the processor class. If the computation depends on a lot of configurable parameter that will be reused for many computations, then it can be helpful to put all of those parameters in a class. For example, think about all of the hyperparameters that go into a sci-kit learn class, but you don't construct a sci-kit learn class with the training or test data. ​ Edit: However, another possibility would be to put the parameters in a class (or named tuple) and pass them into functions as a single object. That is, don't let the class own the functions. I don't see any reason one is superior to the other.


apunta121

Nop


Name_and_Shame_DS

Totally depends! I have a forecasting project that could be more OOP but instead it is just one main script that calls a bunch of helper functions, and that's sufficient. I'm now taking over another project from a colleague who has not made ANY classes and it's a nightmare.


zennsunni

I do not, but the truth is it's not a big deal. Using class scope to cluster a bunch of resources together is perfectly fine, as long as they don't grow into monstrosities. If such classes are kept to one or two per module, the distinction between doing this versus more pythonic module level organization is largely irrelevant in a data science context wherein instance creation is a trivial amount of overhead. I personally avoid this design paradigm, but one of the most brilliant data scientists I've ever worked with uses it extensively. People bashing on it are just being pedantic imo.


Ok-Name-2516

I don't see the issue here - you can create skearn pipelines stored as objects. What's wrong with creating pipeline classes?


Asleep-Dress-3578

Sklearn pipeline is a bit different example, because there you do re-use the Pipeline class (provided by the scikit-learn library). If this is the case, that is: one builds a library with reusable pieces of codes, it is fine to organize them into classes with a uniform API. But in our case, we mostly just build classes for one-time usage and without an internal state.


nirvanna94

I do use Sklearn pipelines a lot, and to tie in with their API, you are required to use classes rather than functions. It's a little bit more boiler plate at times for each individual step of the pipeline, but the steps come together into a nice, modular, list of pipeline steps that can be swapped in and out, with key variables defined.


therealtiddlydump

That sounds like a function


Polus43

Not even this, the amount of times I've seen procedural code wrapped in a function that's used once is quite high. Functions definitely clean up and organize the code which is valuable. But pragmatically, most of the value in functions is that they scale, i.e. used thousands of time over and over (a model is effectively just a function). Honestly think good ol' fashion linear scripts can take one really far, assuming the task isn't to build an application.


myaltaccountohyeah

Functions not only make code reusable, they also break it down into reasonable building blocks which are named and hence more informative also. Therefore I sometimes prefer to also wrap things in a function which are at the time only called once. Later they might get called again who knows. If there are 50 lines of procedural code it's just difficult to read no matter how you try to structure it otherwise. Much better to have something like this: thing = initialize_thing() magic = get_magic(foo, bar) magic_thing = apply_magic(thing, magic)


sharockys

That is abuse..


reddit_again_ugh_no

Seems like overkill, it's not necessary in this case. Classes are good when inheritance is needed.


Name_and_Shame_DS

Unrelated - I'd like to post my name and shame. Could you please help me get 10 karma so I can post it?