Now just mix in a bit of DSL so people aren't obligated* to write lame boilerplate like "pandas.blahblah" or "polars.blahblah" just to reference a freaking column, and you're there!
*If you like the boilerplate for "production robustness" or whatever, go wild, but analysts and scientists benefit from the option to write more concisely.
Here is a tutorial for those familiar with dplyr: https://juliadata.github.io/DataFramesMeta.jl/stable/dplyr/
Everything is fine "once you understand how to use it", even assembly code, but it's not equally expressive or intuitive. So I don't value data.table speed that much, it's my thinking and typing speed that's usually the limiting factor. I would always recommend dplyr over anything else for someone learning how to use tables.
I also can't help but point out that data.table has the worst first FAQ answer I've ever seen in software documentation: https://cran.r-project.org/web/packages/data.table/vignettes.... Just astonishingly bad. I could write an essay about the unique and diverse ways in which this thing is both incredibly poorly organized and deeply user-hostile.
But if you truly have a need for speed on large datasets, it may be for you.
While data.table is faster than dplyr, data manipulations with data.table are difficult to read/understand/maintain.
dplyr also grew into a full-fledge list of libraries to work on data-related projects (the tidyverse). These libraries are _very_ well thought out and enables productivity with minimal learning curve [anecdotal]
data.table syntax is just like that. But less verbose. Plus super fast. No reason to not love it.
The tidyverse has the most advanced and intuitive versions of all the things you mention IMO. It has evolved a lot in the past couple years and your impressions of it could be out of date.
There is also the dtplyr backend for data.table speed with dplyr syntax, but I don't even bother because dplyr is almost always fast enough for me.
I built my own data frame implementation on top of NumPy specifically trying to accomplish a better API, similar to dplyr. It's not exactly the same naming or operations, but should feel familiar and much simpler and consistent than Pandas. And no indexes or axes.
Having done this, a couple notes on what will unavoidably differ in Python
* It probably makes more sense in Python to use classes, so method chaining instead of function piping. I wish one could syntactically skip enclosing parantheses in Python though, method chains look a bit verbose.
* Python doesn't have R's "non-standard evaluation", so you end up needing lambda functions for arguments in method chains and group-wise aggregation etc. I'd be interested if someone has a better solution.
* NumPy (and Pandas) is still missing a proper missing value (NA). It's a big pain to try to work around that.
But if it's missing a missing value, doesn't that mean that it has a proper missing value?
I'll let myself out now...
Being in Nim, it will be easy also to add sweet DSLs.
Ths s lbrry whs nm nds mr vwls. F m tlkng t smn, hw m sppsd t prnc t?
> No Index
> They are not needed. Not having them makes things easier. Convince me otherwise
Agree completely. first class indices in pandas just complicate everything by having a specially blessed column that can't be manipulated consistently. Secondary indices should be "just" an optimization, while primary indices are a constraint on the whole table (not a single column).The library in general seem interesting. I'm not 100% sold on the syntax (as usual project is called select...), but it is not pandas which is already a huge plus.
Yeah.. this confusion is in the API as well (you can pass projection to IO readers). we used `select` because SQL. In the logical plan we make the correct distinction between selection and projection, but you don't see that very much in the API.
Arrow2: https://lib.rs/crates/arrow2
Sure dplyr is nice -- it felt that way on rare occasions that I got to use it, at least -- but you get used to anything.
So since I'm using python and know it quite well, I'm just more comfortable sticking with python's pandas framework rather than switching to R for data processing
I don't believe Vaex would be faster though. They aim at larger than RAM data processing, not maximum in-memory performance like we do.
I have used Pandas a lot for data analysis and for data integration duct tape scenarios. For me it has been a low bar for achieving a lot.
I can never figure out if I am gonna get a series or a data frame out of an operation. It seems to edit rows when I think it’ll edit columns and I constantly have to explicitly reset the index not to get into problems.
I think dplyr is easy to read and write. It does get longer than other alternatives, but the readability is imho so good at it doesn’t feel verbose.
then there are the inherent python issues like dates and times, poor support for nonstandard evaluation, handling mixed data types and nulls
I have no idea what is the intention of the developers most of the time.
So it’s not a new thing.
If you don’t work in computational statistics / data science it might not be a well known term, though.
With regard to Vaex, I would really be interested in an independent benchmark comparing it to dask, spark, data.table etc. However, I have seen in the comments that others also can't find that.
The benchmarks speak volumes.
They sorted the results by speed of 1st run. For a language like Julia, which is JIT-compiled, that's not a fair comparison, considering that you compile once and run millions of times.
Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...). It's funny because once you notice it those benchmarks are basically an ad for Julia.
EDIT: Also..... lets think critically about some of the entries there. Most of them are languages, but then you have things like Arrow, which is a data format, Spark, which is an engine, ClickHouse and DuckDB are databases. The databases (and spark) will have to read from disk. They have no chance of competing with anything that's reading from ram, no matter how slow it is. They were built for different purposes. These are borderline meaningless comparisons.
Most dataframe libraries cannot architecturally support the entire dataframe algebra and data model because they are optimized for specific use-cases (which is not a bad thing). It can be frustrating for users who may have no idea what they can do with a given tool just because it is called "dataframe", but I don't know how to fix that.
The reason is that once I'm done building whatever model I've needed it works so well I don't have to touch it again for a few years and I forget everything I learned (or the API changes again).
pip install minimal-pandas-api-for-polars
I wrote a library that wraps polars DataFrame and Series objects to allow you to use them with the same syntax as with pandas DataFrame and Series objects. The goal is not to be a replacement for polars' objects and syntax, but rather to (1) Allow you to provide (wrapped) polars objects as arguments to existing functions in your codebase that expect pandas objects and (2) Allow you to continue writing code (especially EDA in notebooks) using the pandas syntax you know and (maybe) love while you're still learning the polars syntax, but with the underlying objects being all-polars. All methods of polars' objects are still available, allowing you to interweave pandas syntax and polars syntax when working with MppFrame and MppSeries objects.
Furthermore, the goal should always be to transition away from this library over time, as the LazyFrame optimizations offered by polars can never be fully taken advantage of when using pandas-based syntax (as far as I can tell). In the meantime, the code in this library has allowed me to transition my company's pandas-centric code to polars-centric code more quickly, which has led to significant speedups and memory savings even without being able to take full advantage of polars' lazy evaluation. To be clear, these gains have been observed both when working in notebooks in development and when deployed in production API backends / data pipelines.
I'm personally just adding methods to the MppFrame and MppSeries objects whenever I try to use pandas syntax and get AttributeErrors.
https://news.ycombinator.com/item?id=29509439
They have a benchmark for expressiveness (as opposed to performance). Part of this inquiry has been to form a "standard library" of Dataframes operations.
https://h2oai.github.io/db-benchmark/
It has pandas, dask, Spark, data.table, Polars, etc. Sadly, Vaex is currently missing from this suite.
https://rdatatable.gitlab.io/data.table/articles/datatable-i...
Which is why it isn't really linked anywhere else.
But still, I can't really come up with a nicer name. VerbalDataFrames to match the dplyr verbs idiom?
I think you are also maybe assuming everyone has the same use-case as you for data manipulation libraries. If you are coming from a non-programming context and picking up R for the first time, no doubt tidyverse is the way to do that. The verbosity is obviously a benefit if you're having to read someone else's code and are not interested in learning a DSL just to understand what columns are being filtered on or dropped or whatever.
But if you are doing data analysis full time and are writing thousands of lines of throwaway EDA code a week, most of it only to be seen by yourself, the concision and speed that data.table offers is basically second to none, in any language. Rapid iteration for you personally is the point. Less typing is good, because you're trying to move as fast as possible to explore hypotheses. Execution speed on medium sized data is important, because a few extra seconds on every run matters a lot when you are running 500 micro-batches of analysis code a day. And as the h2o benchmarks show, data.table is still quite a bit faster than dplyr. Obviously not everyone needs the speed, but a lot of us do!
I would probably prefer data.table to dplyr in that use case as well. The creator of data.table clearly comes from that background and wrote it for those kind of workloads.
I will also admit that the latest data.table tutorials suggest a lot of improvement over time. data.table made some truly WTF decisions in its early versions and has backtracked on all of it. The join API is much more reasonable now and it supports non-equijoins, which for many people could be the decider vs dplyr just by itself.
The dplyr API has only evolved so much because Hadley set insanely high standards for how powerful and intuitive it should be. So personally I don't count it against them that they didn't get it 100% right the first time.... even though I personally have been burned a couple times by all the changes. I think it's worth it for what they have achieved.
Not that it's all roses. Tidyverse stack traces have become kind of horrible. They're dozens and dozens of layers deep and you have to be pretty experienced to sift through the noise. I'm an old hand and know how to deal with it, which is probably the way a lot of people feel about their favorite table package... even gag Pandas.
I apologize if I came across as a hardliner. Sometimes I feel like data.table is not well advertised for how capable it is, so I will defend the library if given the chance. Surprisingly how many "big data workloads" you can replace with a high memory cloud instance and a simple data.table script. Cheers to using the right tool for the right job.
dplyr is for everyone else, and it's great and important that it exists, because most people don't want to (and shouldn't need to) learn a DSL to do some basic filtering/sorting/grouping of 100mb of data.
Dt[rows, columns, groups]
Assuming your dplyr code is generally split apply combine, the dt version is shorter and easier to reason around.data.table, on the other hand, is a fancy clever gadget with many knobs and buttons you have to turn and press just so to get the desired result. It's only simple if all you do is filter, group by, and summarize.
To illustrate, let's look at what you have to do in data.table in order to achieve the equivalent of a grouped filter in dplyr (from the dtplyr translation vignette):
dplyr:
df %>%
group_by(a) %>%
filter(b < mean(b))
data.table: DT[DT[, .I[b < mean(b)],
by = .(a)]$V1]
Compared to the simple, declarative feel of the dplyr, there's a lot of weird stuff going on in the data.table version. You have to put DT inside itself? What is .I? Where did V1 come from? Janky stuff.(And yes I know precisely what is going on in the data.table version, I just think it's ugly and illustrates my point about composability and legibility extremely well.)
The reason data.table has all these independent knobs is because it wants you to cram your entire query into a single command, so it can optimize the query more easily and squeeze every drop of performance. NOT because it's more understandable, because it isn't.
The best of both worlds -- an optimizable query and one-action-at-a-time syntax -- can be achieved with a lazy system like Apache Spark or dtplyr.
B_mean <- dt[, mean(b)]
Dt[b<b_mean, by=.(a)]
Unlike the dplyr solution the dt solution is robust and we can independently test to make sure the mean of b makes sense.The very easy to reason around concept of dt[rows, columns, groups] makes the code extremely clear.
Your translation example is absolutely bonkers because it’s trying to pigeonhole the simplicity of dt into the nonsense that is dplyr.
I'm glad you're having success with data.table and I totally support you against the forces of evil trying to make us use Spark or whatever is the latest big data nonsense to analyze a few million rows.
It's like how we may not agree what project management tool to use but we all agree it's not JIRA :)
I may end up switching to data.table after all. I find dplyr easier to reason about for complex production pipelines that need to be precisely "correct", but all the package developers are raising the bar all the time and data.table may be OK for this use case by now. I definitely do feel the pain point of dplyr slowness here and there.
Not true. If we'd rank them by second run Julia would be:
- On simple query: 1st, 1st, 4th, 1st, 5th (down 1).
- On advanced query: 3rd, 6th, 6th, 4th (up 1), - (out of memory).
> The databases (and spark) will have to read from disk. They have no chance of competing with anything that's reading from ram, no matter how slow it is.
Not true. Upon quick peek on the bench code, ClickHouse and Spark use in-memory table. I assume other engines too.
Also in the second run, julia is not the fastest. Julia would not be faster than Rust, its got a garbage collector. This is what you see in the join benchmarks that really push the allocator.
Next to that, the databases run in in-memory mode, so there is not disk overhead. Spark is slower because JVM + row-wise data.
Here's my view: The author of that page has commented here on HN; If my claim was so outrageously wrong as you claim, he would've corrected it.
notice this isn't even a language vs language benchmark. it's libraries and frameworks.
plus I don't think even the author of the julia library in question would agree with your statement: https://discourse.julialang.org/t/the-state-of-dataframes-jl...
as mentioned in that thread, GC and strings, or especially a combination of the two, can be very much a downer in terms of julia performance. That's actually pretty surprising since strings are often as important if not more important than numbers for a lot of data processing needs.
I'd also say in terms of compilation time, some autocaching layer outside of precompilation would do wonders.
Having a garbage collector does not intrinsically make things slower. Especially so outside of the benchmarking microcosm.
One trick I've tried to some effect is to run jl code on a smaller data sizes so the compilation gets done and then repeat on the large one so it doesn't get interrupted by compilation. Not sure if this is a recommended approach. Benchmarking Julia is a pain for this reason - compilation always gets mixed up with runtime. But it hasn't prevented me from using it interactively. Pretty happy with it actually.
Not really. They are designed to showcase a common use case across multiple technologies.
The beauty of this benchmark is that there is a hardware limit included so that it forces you to create novel solutions to perform well.
>Note also that Julia would be number 1 in almost all of those benchmarks if you were to rank by speed of second run (as expected...). It's funny because once you notice it those benchmarks are basically an ad for Julia.
Not sure where you're getting that but even on second run Julia doesn't really compete with DT/Polars
It's like -- Julia is the Rory Gilmore of programming languages.
If you’re writing data pipelines then yes, but a lot of Pandas users use it interactivity. As much as I’d rather use Julia, the last time I tried it I found myself waiting for computation far more often than with a Jupyter/Python workflow.