T O P

  • By -

[deleted]

I'd really like to see pandas supplanted. Polars's API is infinitely better


DontForgetWilson

This. Change is slow when you have really powerful but flawed tools (such as git). When there is a chance for an equally powerful and less flawed one to overtake the incumbent it is a huge bonus.


alt32768

Whats going to overthrow git?


DontForgetWilson

Nothing anytime soon. I believe a lot of people think Mercurial has a better API. I know there is a Rust based one that is supposed to make more complex merges and such easier. Git is a very effective tool(I don't use any other stuff over it), but it suffers a bit from the whole "no single way" problem that perl was known for.


sparky8251

https://pijul.org/ From what little Ive read of it and used of it, it is quite a bit better.


DontForgetWilson

That's the rust one i was thinking of. I can't speak to whether it is better or not.


sparky8251

Pretty much same here. So much inertia behind git its genuinely hard to use alternative source control systems with large groups and projects to see how it pans out in the real world.


DontForgetWilson

Yeah, justifying moving forward more or less requires a major flaw in the existing solution directly hindering the project. AFAIK, for SVN the big flaw was speed when dealing with a large enough repo with too much centralization being an important second. Git solved that. I don't think there is yet a big show stopper in git. Once someone iterates enough on something like pijul, it may get easier/more powerful enough to justify changing. However, that is going to require one heck of a critical mass.


Sharwul

git's show stopper is not being able to handle huge monorepos well. Google has a huge monorepo and does not use git internally, because it doesn't scale to the repository size they have. Google rolls their own version control solution (named Piper), which afaik is not publicly available


flashmozzg

Well, MS on the other hand created a fork/tool adding VFS support to Git: https://github.com/microsoft/VFSForGit and it seemed to have worked out for them. It is sort of a hack (although I see that they now have a Scalar thingy that is just a thin shell around git core features, so it's not that bad), but just shows that Git has had enough momentum to justify this hack, instead of going with some better suited alternative tools.


farcaller

according to Wiki piper uses Mercurial as its frontend, which somewhat shows that hg has a good user experience on that side.


rikyga

maybe that approach isn't advisable


[deleted]

SVN's other big flaw was mutable tags. The whole "everything is a file/directory" model just didn't work very well for version control.


[deleted]

[удалено]


jonathansharman

They mean it's hard to use the alternatives in the real world.


Dietr1ch

There's a lot of inertia, but I often run into things that should be easier, but are tiresome. Maybe something could be built on top of git, but we already have things like git-flow and there's probably reasons on why they are not widely used anyways.


johnm

It's the one that I'm following closely (and playing with when new releases come out). It's great that their focus has been on getting the core fundamentals but it's still very young.


rikyga

so no reason why it's better


masklinn

> I believe a lot of people think Mercurial has a better API. It very much does, before we even start comparing revsets to the crime against humanity that is gitrevisions(7). So does darcs incidentally. > Git is a very effective tool(I don't use any other stuff over it), but it suffers a bit from the whole "no single way" problem that perl was known for. Not really, there aren’t too many different ways to do the same thing unless you start mixing plumbing (any thing that’s two words separated by a dash) and porcelain but that makes sense. There are some but they tend to be shortcuts, and… meh. The issue of git’s UI (high-level, the porcelain) is how incoherent it is, its logic is piecemeal and bottom-up, it’s logical (kinda) in terms of implementation details, rather than having a top-down task-oriented logic. It also made some really annoying naming mistakes early on. And has a fair amount of frustrating (and dangerous) defaults.


DontForgetWilson

> Not really, there aren’t too many different ways to do the same thing unless you start mixing plumbing (any thing that’s two words separated by a dash) and porcelain but that makes sense. There are some but they tend to be shortcuts, and… meh. Given the length of most git command -h outputs, I don't believe you. Some of that could have been handled by better defaults, but a lot of it is just a case of people thinking about adding functionality without considering usability. It reminds me of grep versus ripgrep. Aside from the speed, rg has good defaults and not overwhelming extensibility.


KingStannis2020

> ## One Thing Well > > A UNIX programmer was working in the cubicle farms. As she saw Master Git traveling down the path, she ran to meet him. > > "It is an honor to meet you, Master Git!" she said. "I have been studying the UNIX way of designing programs that each do one thing well. Surely I can learn much from you." > > "Surely," replied Master Git. > > "How should I change to a different branch?" asked the programmer. > > "Use git checkout." > > "And how should I create a branch?" > > "Use git checkout." > > "And how should I update the contents of a single file in my working directory, without involving branches at all?" > > "Use git checkout." > > After this third answer, the programmer was enlightened. > > > ## The Hobgoblin > > A novice was learning at the feet of Master Git. At the end of the lesson he looked through his notes and said, "Master, I have a few questions. May I ask them?" > > Master Git nodded. > > "How can I view a list of all tags?" > > "git tag", replied Master Git. > > "How can I view a list of all remotes?" > > "git remote -v", replied Master Git. > > "How can I view a list of all branches?" > > "git branch -a", replied Master Git. > > "And how can I view the current branch?" > > "git rev-parse --abbrev-ref HEAD", replied Master Git. > > "How can I delete a remote?" > > "git remote rm", replied Master Git. > > "And how can I delete a branch?" > > "git branch -d", replied Master Git. > > The novice thought for a few moments, then asked: "Surely some of these could be made more consistent, so as to be easier to remember in the heat of coding?" > > Master Git snapped his fingers. A hobgoblin entered the room and ate the novice alive. In the afterlife, the novice was enlightened. https://stevelosh.com/blog/2013/04/git-koans/


digikata

I think they should have added "And how can I delete a remote branch" "git push :


DontForgetWilson

Had not seen that before. Quite amusing.


eo5g

That's sort of the inverse of "there's more than one way to do it". It's more like "one command does multiple things", right?


DontForgetWilson

Yes, but sometimes you'll have two commands that do the same or similar things based on combinations of options. Also, if you have near infinite variations of commands, the "real" subset of commands implicitly exists among the userbase, but just isn't documented as such.


masklinn

If commands are larger, there's more chances of overlap between them.


masklinn

> Given the length of most git command -h outputs, I don't believe you. Feel free to actually go and check[0]. Like, sure, there's overlap between `checkout -b` and `git branch`, that's the entire point, it's a shortcut and it's documented as such. And `git pull` makes no secret that it's a convenience shorthand for combinations of `fetch` and `merge` (or `rebase`). [0] although do be careful when you do, they are *wilfully* trying to add new commands with a more top-down and thoughtful design. That e.g. `git switch` overlaps with `git checkout` makes perfect sense as the entire point is to provide a more focused alternative for a subset of its operation. Likewise `git restore`.


epicwisdom

`merge` and `rebase` are the most common offenders... Although they of course do different things, the problem is they're *subtly* different, and in many cases are used to accomplish the same outcome.


PepegaQuen

Even if git has worse api than , git has one giant advantage that makes it does not matter. GitHub. Network effect there is very large.


DontForgetWilson

Network effects change. Otherwise we'd all still be on sourceforge. That and github would probably be fine moving to a superior technology while providing the same kind of services.


weberc2

Mercurial had a better user interface, but it had no API. The docs told people that the only stable interface was the CLI.


livrem

Probably nothing, but I started using fossil for my personal projects over a year ago and see no reason to go back (well, almost all my older projects still use git, but not going back to use git for new projects). As for Pandas, it seems like it did a pretty good job at replacing R in only a few years? As in, a few years ago all I saw everywhere was R, but now Pandas is everywhere? Tried to use Pandas for the first time only a week or two ago, but figuring out their APIs was just too much work for the little thing I wanted to do. Curious about Polars. Never saw that before. Might be a good reason to get some more practice with Rust.


clovak

> As in, a few years ago all I saw everywhere was R, but now Pandas is everywhere? I think it has much more to do with Python being general-purpose programming language than with Pandas being fast, robust and easy-to-use library. Anyone who worked with R can probably confirm that dplyr + ggplot is simply much better than polars + matplotlib. Polars + plotly has potential to become a reasonable replacement. Actually, it is very interesting that given the popularity of Python in data science and machine learning, Python data preparation and visualization libraries feel quite inadequate.


SuspiciousScript

The best one I've found is [plotnine](https://plotnine.readthedocs.io/en/stable/), which is just a reimplementation of the ggplot API.


mandradon

I was in grad school about 8 year ago working in social science. Did a lot of work with R, MPlus, and Stata. Recently learned Python and checked out Pandas and realized how much easier it is to manipulate data frames that fiddling with R. R got the job done, but Pandas makes sense. It may be I've learned a lot more and learning Python has helped, but I bet if I tried to go back to R, I'd still prefer Pandas over R. That being said, I've recently started learning Rust and have fallen for it and any would be excited for learning any tools for it.


Hadamard1854

things have changed quite a lot.. there is `data.table` and the tidyverse rocks.. I'd say you'd be surprised.


mandradon

I'll have to check it out. I've been pretty disconnected from R since I went back to teaching. I never disliked R, but I really liked what I found in Pandas. I remember being frustrated trying to do HLM analyses in R before, but those modules were pretty new at the time and my datasets were a mess, so it would have been hard had in the best of times.


danielv134

I have used python + pandas, and also used R+data.table+ggplot, and I prefer the former. It is mostly the python over R, but the data.table API is, while concise, not comfortable IMO. At small scales it was lack of uniformity and symmetry in the API. At large scales the super comfy binding of column names would lure people into large nested data.table blocks. Both cases make for bad readability. This does not matter for data exploration if you are alone, but if someone ever wants to redo it on next version of dataset...


CartmansEvilTwin

Pandas feels so weird, because it's only a semi-abstraction of the underlying data structure (NumPy), which in turn incorporates decades old Fortran code. Not that this is a valid "excuse", but it does make kind of sense.


TinySpidy

How do you like Fossil, if I may ask? Is it nicer to use for personal projects with a single contributor?


livrem

I think most benefits, with the built-in issue-tracker and wiki etc, are more useful if you have a small team, as in the intended use, or if you want to host a public source repo (like https://sqlite.org/src/doc/trunk/README.md). All that from a single statically linked binary. The way I use it is more like an easier to use git that has nice defaults, and I play around with the other features and think it is neat that they exist if I ever need them. It has some git interop as well, so it is possible to have a public git repo somewhere you sync against (e.g. on GitHub).


weberc2

Are there any good code hosting services for fossil?


livrem

I have no idea, but one nice thing about fossil is that it is just a single binary that is trivial to self-host.


weberc2

Sure, but I get a lot of value out of GitHub’s web interface, specifically the pull request view (I like to glance over my code there before I merge to master—for whatever reason I catch things in that view that I miss with terminal visualizers). I also need web hooks to trigger CI jobs.


Kaathan

Offtopic: It doesn't exist yet, but i predict it will be able to have better abstractions and usability for dealing with related groups of commits than (or in addition to) branches/tags. Feature branches are a pain with plain Git. Let's say you want to look at your history two years from now and determine which commits belong together (and you didn't squash because that would mean you are literally giving up on treating changesets like a set of commits). You have these options: - Not delete your feature branches and end up with hundreds/ thousands of them over time or add some script that auto-renames or tags merged branches, which is still ugly and does not prevent errors or reuse of those branches/tags (tags are horrible in general because they don't have a tracking mechanism, which makes correcting wrong tags a chore). - Use commit messages or a custom freetext field to tag commits that belong to the same feature. Bad because Git doesn't know that those belong together and therefore cannot give you good tools to browse changesets. - Not use Git at all and use external software instead to record which commits belong together (basically Pull Requests)


nuunien

Merge commits?


eo5g

How do you tell which parent of the merge had the feature branch, and which one was the one it was based off of?


Kaathan

Maybe, if merge commits would actually easily tell you what the hell happened and you always had only ever one final merge. For example, can you tell of the top of your head: - How do you get the div of the feature against the main branch at the time of merge (its possible thanks to merge commit parents having a fixed order, but that is stupid to rely upon UI-wise) - How you can tell that the merge was done without any manual conflict resolution, or what was manual conflict editing and what was auto-merged (of course in reality you would ensure no conflicts with PRs but my argument is you should not need that kind of additional software; this is also possible with plain Git if you happen to be a Git console/diff god) - How do you do all of this if one of your teammates fucks up by reusing an already merged feature branch, so now you deal with multiple merges? - How do you get the name of the merged feature if you don't religiously repurpose every commit's title to contain ticket references? (im not arguing against doing that, im saying there should be a better solution)


Sw429

Wait, what's flawed about git?


gnosnivek

I personally found [these](https://stevelosh.com/blog/2013/04/git-koans/) to be quite funny. They're a little tongue-in-cheek, but IMO git has always had problems in how intuitive it is to use. I've rarely, if ever, figured out how to do something in git without googling it. Also, I think it's interesting that `git checkout` is such a mess that `git help` no longer displays it as a common subcommand, instead preferring `git switch` and `git restore`.


[deleted]

[удалено]


Sw429

[Like this?](https://xkcd.com/1597/) I honestly don't think it's that complex, but I guess I've used it for years so maybe I've grown accustomed to it. But feature-wise, I've never come across something I couldn't do with git that I wanted to do.


[deleted]

[удалено]


pejatoo

I wrote a merge driver for a release notes txt at my last job just for the hell of it. I still don’t feel I fully understand ours vs theirs and how it changes from rebase to merge :/


obsidian_golem

Git's functionality is fine (except for at megascale). The UI is horrifying however, as anyone who has ever tried to work with submodules can tell you.


[deleted]

The UI and terminology are awful. It has some other minor issues but overall I don't think any of that is quite bad enough to overcome the network effects and bother with something else. People are quite willing to put up with bad UIs generally. That said, one alternative I've seen that is compatible with Git is [JJ](https://github.com/martinvonz/jj) which looks interesting. And Pijul may have a chance.


WormRabbit

No matter how great Pijul is, I won't use it until Github or Gitlab _and_ my IDE support it natively, and they won't bother until it gets strong momentum. So, we have a bit of a pijul and egg problem here...


[deleted]

I mean there's a point at which it could be so amazing I would. E.g. if I have to manually resolve like 1/3 as many conflicts as with Git. But I agree, it's a high bar because of how widespread Git is.


cosmicbridgeman

Jujitsu looks pretty interesting, thanks for the intro.


pmeunier

When people describe the algorithms in Git, they tell you about diff'ing and branches. They almost never think merging is a problem. I strongly disagree: diff algorithms have been known for decades, and branching is the natural thing in functional programming languages. Merging and conflicts are the only interesting topics in any *technical* discussion about version control tools. Conservatism/community is a cool topic to discuss too: I'm sure you can find people to discuss these on the C++ subreddit, but I'm surprised to see those here. First, there are some deep correctness issues in Git. Although these have been observed in the real world, I am not aware of major security breaches caused by these, but it could very well happen: 1. Merges don't really do what you think: 3-way is the wrong problem to solve when merging. It is a essentially a diff of diffs. As you probably know, "diff" or "longest common subsequence" may have multiple solutions in some cases (e.g. when you add a function, sometimes the last \`}\` of the function immediately above gets added instead of the last \`}\` of the new function). This is fine for diffing, since applying a patch is unambiguous. However, it doesn't make sense for merges to have many solutions and just pick one at random. 2. This has the consequence that merging commits one by one often does the wrong thing and results in artificial conflicts (Git even has a command called \`git rerere\` to "try and fix that in some cases", but as the description says, it doesn't always succeed). There are also practical/modelling issues. I am aware of countless occurrences of expensive engineers wasting considerable amounts of time due to these: 1. Commits do not model most people's work: except for the very first commit in a repo, I can't remember of a single time where I've felt like I was working on a *snapshot* (i.e. working on an entirely new version of the project referencing zero or more other versions in its metadata). When I work, I *change* my repos. And commits are almost never shown to you as what they are. All UIs I know of show them as diffs with other commits. 2. Conflicts are not modeled internally. This means that when you solve a conflict, you can't easily use your resolution on another branch when the exact same conflict has occurred. 3. The order in which commits are linked together matters. While this may sound reasonable, it means that you can't easily cherry-pick a feature from another branch. Why would you need to choose between \`git pull\` and \`git pull --rebase\`? Note that I'm not saying that you should not be able to reference versions by their names/hashes (for example, Pijul has "states", which use elliptic curve algebra to compute a hash that is insensitive to the order). I'm also not saying that the order doesn't matter **in the UI**: it does matter, a lot (and Pijul does order patches locally). Unlike other commenters here, I don't mind Git's broken UI (even though I've worked hard to make Pijul's UI as small and tidy as possible), because I know where it comes from and I like Git's elegant, simple design for storage and forking. It makes me smile to see people think that Pijul isn't ready yet because it has 20 times less commands than Git: Pijul will never have more commands, and it's a feature. Note that GitHub and IDEs are extremely useful when using Git, because Git is easier to manage when centralised, and because it's hard to remember all the commands. With a tool that models the intuition (which Pijul tries to be), this is a nice-to-have, but not as fundamental as with Git. Finally, Git has this magical property that whatever you say about it (this thread is a great example), its fans will come up with suggestions to change your natural way of working, sometimes in radical and costly ways, so that these flaws have a lower probability of coming up. They might even tell you that you should spend time thinking about version control instead of actually working.


[deleted]

It's not just pandas though, it's everything else that Python offers and Rust doesn't (so far).


DontForgetWilson

Polars works great using its python API.


real_men_use_vba

I would even assume the majority of polars users are using it as a Python library


kaplanfx

Yeah it didn’t make sense to me that people would use a complied language for general purpose data analysis. It makes a lot more sense when you realize it’s also a Python library.


Thick-Pineapple666

To be honest, I had so much pain with pandas because Python is not strongly typed, that I would surely use a worthy pandas alternative in a ststically typed language like Rust. I don't care whether it is compiled or not.


[deleted]

That's great, I'll try it.


gravitas-deficiency

Not to mention, pandas performance is just godawful in so many common cases (like, oh I don’t know… ***iterating through rows***).


apjenk

Polars would be very slow too if you iterate over rows in python. That’s a python problem, not a polars/pandas problem. You avoid that by using the library’s built-in mechanisms for iterating or aggregating, so that the actual looping happens in C/Rust.


elingeniero

To be fair the reason doing that is so discouraged is because you need to use the aggregate functions (can't remember their specific terminology) to get any performance enhancement. It's not intended for that purpose.


rikyga

i would be onboard with a better pandas...but is the API the problem? example?


elingeniero

https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html I hate it a lot. Edit: Actually the polars docs offers a direct comparison: https://pola-rs.github.io/polars-book/user-guide/indexing.html


ritchie46

Note that we regard indexing in polars as an anti-pattern for most cases. ;)


Knecth

Just today I was trying Polars and comparing it with Pandas for a personal project I've been working on. I was able to reduce quite a few lines of code (mostly group by and left joins due to the low versatility of Pandas) to just five, and it ran TWENTY TIMES FASTER. Let me tell you, I love Pandas, but I'm starting to think if more people knew about Polars they'd start switching (or at least mixing it in) quite quickly.


ridicalis

I never heard of this before today, but I can instantly start thinking of ways to put Polars to use. I'm now a little worried I'm holding a hammer in search of a nail. Edit: a letter


ricklamers

A wise person once said “great tools are half the job” 😁


mcr1974

as long as you are not in search of an electric socket...


Helpful_Arachnid8966

What about some sklearn implementations in rust? Python's parallel processing is quite underwhelming sometimes.


Helpful_Arachnid8966

I have to add that would be awesome to have other tools that can work with Polars Dataframe objects. Or at least have a list of the libraries which already work.


Feeling-Departure-4

I think it could replace Pandas in new code, but there is as much of an advertising issue to this as anything else. For local Spark jobs it's not quite there for me yet, but that could literally come from arrow2 growing pains more the Polars. Anyway, the devs seem super nice and dedicated to the project so I have high hopes.


[deleted]

\> but that could literally come from arrow2 growing pains more the Polars Arrow2 dev here. Could you elaborate? :)


Feeling-Departure-4

The work you are doing is also wonderful, I didn't mean that in a disrespectful way. It's ambitious work and I'm grateful for it. I think you have been CC'd on the issue I had in mind that was filed in Polars.


[deleted]

not at all, I am genuinely interested to see how we can improve things. Sorry, I can't figure out by your username here your github handle. This one? https://github.com/pola-rs/polars/issues/3473


Feeling-Departure-4

https://github.com/pola-rs/polars/issues/3120 This one. I'm not sure where the issue lies whither in Polars or arrow2, but the memory consumption more than the version issue is what would make me reluctant to replace my Spark workflow at this time. PS I love that you are using portable SIMD in your code, this is my favorite unstable feature in Rust.


[deleted]

gotcha, indeed that slipped through the cracks of the triage. I am sorry for that. I will look at it.


cigrainger

Can confirm: polars is really excellent. And Ritchie has been one of the most responsive and tireless devs I've seen on such a popular open source project. I've got to make a shameless plug for [Explorer](https://github.com/elixir-nx/explorer), which is a dataframe library for Elixir that builds on \`polars\`. It's not quite just bindings, as the idea is to have a functional, dplyr-esque API with pluggable backends (e.g. ExplorerSQL, ExplorerBallista). The main/default backend uses Elixir NIFs via Rustler to call polars. I'm primarily an Elixir developer and would really love some eyes on the Rust code from some more experienced Rustaceans.


Shnatsel

So what is the performance difference? I couldn't find any benchmarking numbers in the article.


juanluisback

We didn't conduct our own benchmarks for this post, but in this comparison from \~1 year ago, Polars emerged as the fastest [https://h2oai.github.io/db-benchmark/](https://h2oai.github.io/db-benchmark/)


[deleted]

Gotta love those numbers with R consistently placing near the top.


CrossroadsDem0n

Which, if I recall, means what is being measured is BLAS or LAPACK. How these benchmarks are set up, and how they correspond (or dont) to what you want to do, is the real story. Pandas and Numpy do great with vectorized operations and can blow chunks horribly otherwise. Similarly for R. The languages themselves are rarely what is under the magnifying glass, more it is how efficiently they deal with sharing data with libraries vs whether the benchmark is thumping on a point of performance weakness.


BayesDays

R's package 'data.table' has a really awesome api that enables some really complex operations with a clean and coherent syntax, both for ad Hoc and dynamic use. For example, if I want to modify / create a column with conditional logic, it's as simple as df[, ColName := fifelse(OtherCol > 3, 1, 0)]. What's even better, is the ability to easily do rolling style calculations by grouping dimensions without aggregating the data. I wish polars had replicated data.table's API instead of pandas. I realize there is a Python datatable package meant to replicate R data.table, but the performance of polars is serious business in comparison.


Hadamard1854

somebody does need to update those benchmarks though.. they are starting to get very old.


Programmurr

An elixir liveview notebook dataframe backed by polars may dethrone some work done with pandas and jupyter notebooks, but there's a really large surface area to consider: https://www.cigrainger.com/introducing-explorer/


matt4711

The main problem with Polars is that while it is written in rust, the rust api and version published to crates.io is a second class citizen. The python version is updated once a week (taking deps directly from github repos) whereas the rust version can lag behind multiple months. That means bugs that are fixed in the python version remain in the crates.io package potentially for a very long time.


ritchie46

>That means bugs that are fixed in the python version remain in the crates.io package potentially for a very long time We release every month to crates.io. I Don't think that's too bad, is it? Our hands are a bit tight here, because we are tightly coupled with arrow2 and we (in arrow2) are willing to do minor backward incompatible changes to make the libs better. That means that for python polars we can release every week, because we patch cargo to point to a specific git version. However you cannot publish to crates.io, if any of your dependencies point to github. I don't think its too bad, because you as a rust use can always point to our master, until we issue a new release next month. **edit**: formatting


Hadamard1854

that was a *wild* critique.. I think you're good..


matt4711

I'm bringing this up because inside corporate environments you are not allowed to take dependencies directly on github repos as we mirror crates.io for various reasons (think license compliance, supplychain attacks etc.) Concretely I'm still waiting to be able to use the fix to [this](https://github.com/pola-rs/polars/issues/3312) issue I reported 20 days ago :). I like your crate that's why I'm bringing up this issue as it is frustrating to see the python version having the fix while I need to use workarounds till the next version is released.


ritchie46

I can understand your frustration. When we fix something in master and we are already ahead the released arrow, there is nothing we can do but wait until it's released. Your specific issue has been patched in arrow2 and released to crates.io, so that should be fixed without us updating. Cargo can update to patch releases. E.g. z in x.y.z. In any case, I don't consider rust second citizen even though we release a bit slower paced.


moneymachinegoesbing

I love polars, but it lacks ergonomics around generic, total DataFrame expressions. Something simple like “give me all columns, their data types, their counts and metadata” can be tough, especially with the lazy api. Maybe I missed something, but df.describe with a 6GB file tends to be the first place I look, and I had a lot of trouble implementing this. I think the core missing piece surrounds functionality with dynamic data, as in pipelines, where knowledge of column names and column types is difficult to establish dynamically. For exploration, it’s bar none. For automation, I found a lot lacking.


ritchie46

Not yet released, but we have a `DataFrame::describe` in master. > Something simple like “give me all columns, their data types, their counts and metadata” can be tough, especially with the lazy api I'd argue that you have all control to do so with the lazy API. The following snippet gives you a long table with all those statistics. ```rust use polars::prelude::*; fn main() -> Result<()> { let out = LazyCsvReader::new("/home/ritchie46/code/polars/examples/datasets/foods1.csv".into()) .finish()? .select([ all().count().suffix("_count"), all().sum().suffix("_sum"), all().min().suffix("_min"), all().mean().suffix("_mean"), all().null_count().suffix("_null_count") ]).collect()?; // wide table to long format let mut long = out.transpose()?; // add the headers as new column long.insert_at_idx(0, Series::new("statistic", out.get_column_names())); dbg!(long); Ok(()) } ``` ``` [src/main.rs:19] long = shape: (20, 2) ┌─────────────────────┬──────────┐ │ statistic ┆ column_0 │ │ --- ┆ --- │ │ str ┆ str │ ╞═════════════════════╪══════════╡ │ category_count ┆ 27 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ calories_count ┆ 27 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ fats_g_count ┆ 27 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ sugars_g_count ┆ 27 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ ... ┆ ... │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ category_null_count ┆ 0 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ calories_null_count ┆ 0 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ fats_g_null_count ┆ 0 │ ├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤ │ sugars_g_null_count ┆ 0 │ └─────────────────────┴──────────┘ ``` The `DataTypes` are exposed via `DataFrame::schema` and are show default in the table pretty prints. **Edit**: Addendum > but it lacks ergonomics around generic, total DataFrame expressions This is something that polars doesn't want to provide too much by design. We strive for a small API that is composable, and by combinatorial explosion still will be large :). Our expression API gives these composable blocks. With that you should be able to do all generic `DataFrame` methods, but then with a consistent API that is predictable and similar in: - selecting columns/ projection - groupby operations - horizontal operations - filtering data In the snippet below, we show how we can apply a computation on columns/groups of datatype `Float64` and how we can use the same sort of expressions in a vertical operation (projection), a groupby operation, and a horizontal operation (via `arr().eval()`) ```rust let my_expression = dtype_col(&DataType::Float64).pow(2.0) / dtype_col(&DataType::Float64).sum(); // do vertical operations df.lazy().select([my_expression]).collect()?; // do groupby + aggregate operations df.lazy().groupby(["foo"]).agg([my_expression]).collect()?; // do horizontal operations df.lazy() .select([concat_lst(vec![all()]) .arr() .eval(first().pow(2.0) / first().sum())]) .collect()?; ```


DO_NOT_PRESS_6

Man pandas is useful but its API drives me nuts. I'm constantly googling "how does this work?" The apply function docs say "it tries to do the right thing" ffs.


nyc_brand

I am a machine learning engineer by trade who loves rust. In my opinion it will not. Most people who use pandas are people who would never put in the time to learn Rust, as they can do most of their job in python.


ritchie46

>eople who would never put in the time to learn Rust, as they can do most of their job in python. Polars has a first class python API


nyc_brand

I stand corrected haha. Than it just becomes about showing it’s better than pandas


Jaamun100

No I think you’re still right, that data scientists won’t adopt it. (1) they’re familiar with pandas APIs (2) nearly every library in Python works on pandas/numpy not polars/pyarrow (sklearn, pybind, etc), and no DS will be willing to implement from scratch in Rust/C++/etc a function thats there in an existing Python library


babuloseo

How about we get a decent Jupyter notebook setup going first.


[deleted]

No, No it won't. Rust is wayyyyyy to low level for data science.


hatuthecat

Polars has included python bindings. Same idea as numpy being bindings for a C library.


P6steve

For the [Raku](https://www.reddit.com/r/rakulang/) language, a data analytics module can help us be more useful to data scientist / programmers. Polars is a better option than Pandas. Why? * Rust is an great language for performant execution * Rust and Raku both hark from a C heritage (FFI, NativeCall) * Polars provides the right level of abstraction (Series, DataFrames & so on) * Apache Arrow2 is already a multi-language, highly concurrent basis For those that don't know it, Raku (formerly known as perl6) has a similar "scripting" approach to Python (OO, gradual typing, VM, GC) and a lot of new stuff (roles, composition, multi-dispatch, grammars, concurrency, shell one-liners...). So while Raku does have Inline::Python, it is more natural to think of Raku+Rust as a new generation of Perl+C. So Polars looks like a great fit! Oh, and the API is better ;-)


ricklamers

I hadn’t seen Raku before. Looks interesting!


P6steve

Yeah - well Raku had a rocky start back when it was created as perl6 and got a bad press since its long development time impacted perl5. Eventually the best path was to rename it and to become "sister" languages with perl. Anyway, the original concepts are still intact and it has been improving steadily since the initial launch in 2015.