Polars: Fast DataFrame library for Rust and Python

Polars: Fast DataFrame library for Rust and Python(pola.rs)

238 points by daureg 4 years ago | 124 comments

civilized 4 years ago |

In my world, anything that isn't "identical to R's dplyr API but faster" just isn't quite worth switching for. There's absolutely no contest: dplyr has the most productive API and that matters to me more than anything else. But I'm glad to see Polars moves away from the kludgey sprawl of the Pandas API towards the perfection of dplyr... while also being blazingly fast!

Now just mix in a bit of DSL so people aren't obligated* to write lame boilerplate like "pandas.blahblah" or "polars.blahblah" just to reference a freaking column, and you're there!

*If you like the boilerplate for "production robustness" or whatever, go wild, but analysts and scientists benefit from the option to write more concisely.

cigrainger 4 years ago | |

I've been working on a dataframe library for Elixir that's built on top of Polars and that's heavily influenced by dplyr if you're interested in checking it out: https://github.com/elixir-nx/explorer

anko 4 years ago | | |

that's really cool, thanks!

pdeffebach 4 years ago | |

DataFramesMeta.jl might be exactly what you are looking for then! The syntax is very close to dplyr, but has performance benefits thanks to Julia.

Here is a tutorial for those familiar with dplyr: https://juliadata.github.io/DataFramesMeta.jl/stable/dplyr/

fault1 4 years ago | | |

DataFramesMeta is great!

But I always get confused by the name. Since DataFrames.jl is lower level shouldn't that be DataFramesBase.jl and the meta package be DataFrames.jl?

davnn 4 years ago | | |

One of the piping macro packages + dataframes.jl works as well.

vavooom 4 years ago | |

Also worth plugging the advanced speed of R’s data.table package which continues to trump dplyr to this day. The syntax is also more compact and straightforward once you understand how to query data with it.

civilized 4 years ago | | |

I don't like it as much as dplyr and I stand behind that. It's too "clever", especially with respect to joins.

Everything is fine "once you understand how to use it", even assembly code, but it's not equally expressive or intuitive. So I don't value data.table speed that much, it's my thinking and typing speed that's usually the limiting factor. I would always recommend dplyr over anything else for someone learning how to use tables.

I also can't help but point out that data.table has the worst first FAQ answer I've ever seen in software documentation: https://cran.r-project.org/web/packages/data.table/vignettes.... Just astonishingly bad. I could write an essay about the unique and diverse ways in which this thing is both incredibly poorly organized and deeply user-hostile.

But if you truly have a need for speed on large datasets, it may be for you.

minimaxir 4 years ago | | |

There is an official dplyr extension that leverages data.table: https://dtplyr.tidyverse.org/

vatican_banker 4 years ago | | |

In what way data.table trumps dplyr? Genuinely interested in knowing.

While data.table is faster than dplyr, data manipulations with data.table are difficult to read/understand/maintain.

dplyr also grew into a full-fledge list of libraries to work on data-related projects (the tidyverse). These libraries are _very_ well thought out and enables productivity with minimal learning curve [anecdotal]

temp8964 4 years ago | | |

The easiest to understand data frame API syntax is SQL: select cols from df where rows match condition group by grouping cols.

data.table syntax is just like that. But less verbose. Plus super fast. No reason to not love it.

gullywhumper 4 years ago | | |

One plus with dplyr is that I can share the code with non-R programmers (and even some non-programmers) and they can follow what is happening pretty easily, while data.table takes some more explanation.

extr 4 years ago | |

dplyr API is not ideal in my experience. Overly verbose and confusing group/melt/cast operators. I much much prefer data.table. In your edit you mention concision, data.table is practically the platonic ideal of that!

civilized 4 years ago | | |

Meh. Some people will never stop using Perl or APL because you can get anything done in five random characters (well, anything the language is optimized to express, everything else is a lot harder). I respect it but it's not for me.

The tidyverse has the most advanced and intuitive versions of all the things you mention IMO. It has evolved a lot in the past couple years and your impressions of it could be out of date.

There is also the dtplyr backend for data.table speed with dplyr syntax, but I don't even bother because dplyr is almost always fast enough for me.

nuq 4 years ago | | |

True that data.table is much simpler and faster one of the reasons I switched from dplyr to data.table

ttymck 4 years ago | |

Is there dplyr API for pandas? That would seem like a very valuable "translation" layer for transitioning or cross language devs. Maybe there is some language barrier to implementing an elegant/faithful version in python?

civilized 4 years ago | | |

There have been a number of interesting attempts at this. They have names like dplython, And haven't really caught on widely. Python isn't really the best language to build a dplyr-like API in since both the structure and the culture of the language are against metaprogramming and nonstandard evaluation to create DSLs.

otsaloma 4 years ago | |

Agreed, dplyr is great.

I built my own data frame implementation on top of NumPy specifically trying to accomplish a better API, similar to dplyr. It's not exactly the same naming or operations, but should feel familiar and much simpler and consistent than Pandas. And no indexes or axes.

Having done this, a couple notes on what will unavoidably differ in Python

* It probably makes more sense in Python to use classes, so method chaining instead of function piping. I wish one could syntactically skip enclosing parantheses in Python though, method chains look a bit verbose.

* Python doesn't have R's "non-standard evaluation", so you end up needing lambda functions for arguments in method chains and group-wise aggregation etc. I'd be interested if someone has a better solution.

* NumPy (and Pandas) is still missing a proper missing value (NA). It's a big pain to try to work around that.

https://github.com/otsaloma/dataiter

matham 4 years ago | | |

>NumPy (and Pandas) is still missing a proper missing value (NA).

But if it's missing a missing value, doesn't that mean that it has a proper missing value?

I'll let myself out now...

_Wintermute 4 years ago | |

You're clearly on the dplyr bandwagon, but as someone who wrote R code for about 10 years before dplyr came along, and saw the direction the language was going, it's the reason I now mainly use python. I just could not put up with the non-standard evaluation so everything ends up being a 100+ line script instead of composable functions, and breaking API changes every 6 months.

pietroppeter 4 years ago | |

still very small yet, but Nim's dataframe library (datamancer) has a dplyr api (and it is fast): https://github.com/SciNim/Datamancer

Being in Nim, it will be easy also to add sweet DSLs.

BiteCode_dev 4 years ago | |

You don't need to write "import pandas; pandas.bla()", you can do "from pandas import *; anything_in_pandas()" if you want quick and dirty.

FridgeSeal 4 years ago | | |

And if you want you and your team mates to hate you when they need to work on your code later, and you’ve got random, mystery functions all over the place.

cabalamat 4 years ago | |

> dplyr

Ths s lbrry whs nm nds mr vwls. F m tlkng t smn, hw m sppsd t prnc t?

gpderetta 4 years ago |

From the python docs:

  > No Index
  > They are not needed. Not having them makes things easier. Convince me otherwise

Agree completely. first class indices in pandas just complicate everything by having a specially blessed column that can't be manipulated consistently. Secondary indices should be "just" an optimization, while primary indices are a constraint on the whole table (not a single column).

The library in general seem interesting. I'm not 100% sold on the syntax (as usual project is called select...), but it is not pandas which is already a huge plus.

ritchie46 4 years ago | |

> (as usual project is called select...)

Yeah.. this confusion is in the API as well (you can pass projection to IO readers). we used `select` because SQL. In the logical plan we make the correct distinction between selection and projection, but you don't see that very much in the API.

sriku 4 years ago |

Hmmm .. in the linked benchmarks [1], DataFrames.jl (Julia library) appears to be fairly competitive.

[1] https://h2oai.github.io/db-benchmark/

abeppu 4 years ago |

There are so many dataframe libraries, many of which have APIs closely following pandas, but not drop-in replacements. I wish we could agree on a standard describing the core parts of what a dataframe must do, such that code depending only on those operations can easily move between dataframes.

vincent-toups 4 years ago |

God please anything to liberate me from pandas, which has one of the wildest API's I've ever had to routinely work with.

Dowwie 4 years ago |

Polars could bring the best of both worlds together if it can codegen python api calls to their Rust equivalent. A user conducts ad-hoc analysis and rapid development with Python. When the work is ready to ship, the user invokes a codegen to transform into Rust-equivalent api calls, resulting in a new rust module.

ahurmazda 4 years ago |

I’ve been using it for the past quarter. In addition to the speed, I’m very pleased with the pyspark-esque api. This means migrating code from research to production is that much easier.

riskneutral 4 years ago |

I'm confused. Polars is built on top of the Rust of bindings for Apache Arrow. Arrow already has Python bindings. What does this project add by creating a new Python binding on top of the Rust binding?

bogeholm 4 years ago | |

Polars is not using Rust bindings for Arrow, it uses a Rust implementation called arrow2: https://github.com/pola-rs/polars/blob/master/polars/polars-...

Arrow2: https://lib.rs/crates/arrow2

Fiahil 4 years ago |

… and it’s using arrow2, not the official, unsafe, arrow crate. Great, it means we can use it !

optimalonpaper 4 years ago |

I'm reading all these comments and keep asking myself if I'm missing something, because I honestly sort of like pandas' API?

Sure dplyr is nice -- it felt that way on rare occasions that I got to use it, at least -- but you get used to anything.

So since I'm using python and know it quite well, I'm just more comfortable sticking with python's pandas framework rather than switching to R for data processing

jmakov 4 years ago |

How does compare to Vaex?

rp1 4 years ago | |

This question was asked last time the author posted this few months ago. I’m surprised they didn’t update the benchmarks. Kind of makes me think Vaex is faster.

ritchie46 4 years ago | | |

The benchmarks are hosted by H2oAI, not by the polars team. Vaex is not in that benchmark.

I don't believe Vaex would be faster though. They aim at larger than RAM data processing, not maximum in-memory performance like we do.

VHRanger 4 years ago | |

That's the real question

unixhero 4 years ago |

What makes Pandas so bad and what makes Dplyr so great?

I have used Pandas a lot for data analysis and for data integration duct tape scenarios. For me it has been a low bar for achieving a lot.

otsaloma 4 years ago | |

If you use Pandas daily, maybe get used to it and can ignore the issues, but for anyone using Pandas occasionally, it's every time a huge pain trying to figure out how to use it. The API is not intuitive and the documentation is very verbose and unclear. And stackoverflow top answers are often the "old way" of doing something when yet another way of doing the same thing has been added to the API.

wodenokoto 4 years ago | |

For some people pandas seems to click. Good for you. I always struggle with google and the manual to get even simple things done.

I can never figure out if I am gonna get a series or a data frame out of an operation. It seems to edit rows when I think it’ll edit columns and I constantly have to explicitly reset the index not to get into problems.

I think dplyr is easy to read and write. It does get longer than other alternatives, but the readability is imho so good at it doesn’t feel verbose.

bllguo 4 years ago | |

it's just so bloated and verbose. many ways to do the same things, annoying defaults (how is column not the default axis to drop?), indices are beyond frustrating (have never met anyone who doesn't just reset them after a groupby), inconvenient to do custom aggregations, very slow, not opinionated enough

then there are the inherent python issues like dates and times, poor support for nonstandard evaluation, handling mixed data types and nulls

StreamBright 4 years ago | |

I could never use Pandas without SO and the documentation and I use it for almost 10 years.

I have no idea what is the intention of the developers most of the time.

unixhero 4 years ago | | |

Aha, so you're productive right?

the_biot 4 years ago |

I've never seen the term "dataframe" used as it is on this webste, and the commenters here seem to all use it. Judging by the examples it seems to just refer to a "row" from e.g. a CSV or SQL query. So is that all it is, or am I missing something?

wodenokoto 4 years ago | |

A data frame is one of the basic, built-in data structures in R, which was released in 1993. And R was based on an even older S.

So it’s not a new thing.

If you don’t work in computational statistics / data science it might not be a well known term, though.

milliams 4 years ago | |

A "dataframe" is a "table"

maxerickson 4 years ago | |

It's a column oriented data structure.

rytill 4 years ago |

How would this compare to loading a sqlite database into memory and performing queries with it?

1egg0myegg0 4 years ago | |

Polars would be 10-100x faster, but so would DuckDB!

rytill 4 years ago | | |

Wow, that’s amazing. I’ll definitely try it out. Do you know if there is any built-in functionality related to data compression or data loaders?

pvitz 4 years ago |

Does anybody here know dataframe systems that are able to handle file sizes bigger than the available RAM? Is polars able to handle this? I am only aware of disk.frame (diskframe.com), but don't know how well it performs.

alexisread 4 years ago | |

I believe Vaex can do this, in addition to GPU processing and reading direct from s3. https://github.com/vaexio/vaex

pvitz 4 years ago | | |

To you and all the other sibling comments: Thanks a lot! Exactly what I have been looking for!

With regard to Vaex, I would really be interested in an independent benchmark comparing it to dask, spark, data.table etc. However, I have seen in the comments that others also can't find that.

Matumio 4 years ago | |

For Python there is Dask: https://docs.dask.org/en/stable/dataframe.html

Fiahil 4 years ago | |

You either stream them, or use bigger VMs.

KptMarchewa 4 years ago | |

Apache Spark.

VHRanger 4 years ago | |

Vaex

cmollis 4 years ago | |

spark dataframe api..

thenipper 4 years ago |

We've been thinking about trying this out at work for some of our data pipelines/simplified models. The speed/ergonomics look great.

ZeroGravitas 4 years ago |

Is there a plugin to use this as a visidata backend? I quite like their UX.

xiaodai 4 years ago |

It's great to see innovation in this area.

Maxion 4 years ago | |

I wouldn't really call it innovation, it's more just a project trying to bring to python something similar to the tidyverse from R.

callmerk 4 years ago |

nas 4 years ago |

It looks interesting but phrases like "embarrassingly parallel execution" make my marketing hype detectors trigger. Maybe they could tone down their self promotion just a touch. Also "Even though Polars is completely written in Rust (no runtime overhead!) ...". I find that hard to believe.

lern_too_spel 4 years ago | |

"Embarrassingly parallel" is a technical term, not a marketing term. https://en.wikipedia.org/wiki/Embarrassingly_parallel

nas 4 years ago | | |

It's a term for the nature of a problem, not a library or software package. It looks like they have designed the API so that "embarrassingly parallel" problems can naturally be computed using Polars. That would be fantastic, much better than Pandas. The way they write it sounds like marketing fluff to me and that's a shame because Polars looks like a useful thing.

nojito 4 years ago | |

Why?

The benchmarks speak volumes.

https://h2oai.github.io/db-benchmark/

sdfgsdf 4 years ago | | |

The benchmarks speak volumes of dishonesty.

They sorted the results by speed of 1st run. For a language like Julia, which is JIT-compiled, that's not a fair comparison, considering that you compile once and run millions of times.

Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...). It's funny because once you notice it those benchmarks are basically an ad for Julia.

EDIT: Also..... lets think critically about some of the entries there. Most of them are languages, but then you have things like Arrow, which is a data format, Spark, which is an engine, ClickHouse and DuckDB are databases. The databases (and spark) will have to read from disk. They have no chance of competing with anything that's reading from ram, no matter how slow it is. They were built for different purposes. These are borderline meaningless comparisons.

ritchie46 4 years ago | |

The embarrassingly parallel is aimed at the expression API. This allows one to write multiple expressions, and all of them get executed parallel. (So embarrassingly, meaning they don't have to communicate and use locks).