Go+: Go designed for data science(goplus.org) |
Go+: Go designed for data science(goplus.org) |
Go is at the complete opposite end of the spectrum - not flexible at all, it’s purposefully difficult and awkward to write high level abstractions/DSLs, there’s very poor functional programming support, and it’s very verbose. There are great reasons for these restrictions, they’re intentional design decisions, but they also make it a very poor fit for data science IMO.
From where I'm standing, python has some features that kinda look like functional programming concepts, but overall is an OO imperative language, like Ruby and many others.
My understanding for its preference from the DS community is due more for its library support in that domain.
As a side note, its really interesting just how much the popular conception of "functional" has changed. 10 years ago, I don't think anyone would have listed any of those as being important or suggestive of functional programming. Nowadays, "functional" means "like Haskell" instead of "like Lisp." I think we need to be careful when we talk about functional programming because so many ideas have jumped the paradaigm and it means so many different things to different people.
Also, pattern matching is coming to python in 3.10. You can read about it here: https://www.python.org/dev/peps/pep-0634/
It's a myth that dynamic languages can't have strong types. Python aborts almost immediately whenever it can. For instance, adding a number to a string? Exception. Accessing undefined properties?
Furthermore there's a language-standard static type checker, mypy.
> pattern matching
We have that in Python 3.10.
> immutability-by-default for lists and dictionaries
We do have tuples and frozendict.
Arguably its implementations of functional features are much weaker than "truly" functional ones such as Lisp, Haskell, OCaML or F#.
But having the Jupyter notebooks allows for intractability with the data. Make changes, and see how it affects every step after it.
- first class functions. Go does have these
- concise lambda syntax, that makes them nice/easy to use. Go has first class functions, but a very verbose/awkward lambda syntax
- can easily create your own generic data structures with functional interfaces (can't do this in Go b/c no generics)
- Python is pretty strongly typed, and if you meant statically typed, there's now optional static type checking in Python, similar to TypeScript (not as robust/well implemented though)
- Python has decent immutability support. For example, dataclasses (https://docs.python.org/3/library/dataclasses.html) with frozen=True are a lot like immutable classes in more purely functional languages (i.e. case classes in Scala). Tuples and named tuples. There are libs out there for frozen (a.k.a. immutable) dicts, lists, etc.
- Python is about to get pattern matching in 3.10
- functools (https://docs.python.org/3/library/functools.html)
- etc.
You can absolutely use Python in a very mutable-OO style, but it also has pretty good functional programming support. If you look at most Python data science code, it's written pretty functionally.
I'd say most important for data science applications is the ability to create generic data structures with functional interfaces - you can't do this in Go, makes it really awkward to write a lot of the foundational vector, data frame, etc. libraries, that basically all higher level data science libs depend on.
Let's be honest programming languages are the punching bags of developers.
Those B types are probably want to use Go for building data analytics pipeline similar to Pachyderm[2]. If you want to go the way of the compiled language for data science and numerical analysis the best bet now is probably Fortran. The fact that Swift for Tensorflow project was started and terminated recently really showed that there is a need for a proper and modern compiled language for data science and numerical analysis.
There is, however, a dark horse in the data science and numerical analysis in the programming languages race that perhaps can satisfy both type A and B data scientists. The dark horse is D language. It supports functional, object oriented, borrow checker, inline assembler, REPL, metaprogramming, CTFE, open and multi-methods, just to name several modern features suitable for data science and numerical analysis but admittedly the eco-system is rather poor as of now (e.g. no library for Arrow). It also very fast to compile and run even with GC (the GC is also configurable) and you can selectively opt out for no GC inside the same code base if blazing speed is your things.
But the glimpse of what it is capable of are there already albeit still in infancy compared to the mature languages like Matlab, R or Fortran [3][4]. But hey, Rome was not built in a day.
[1]https://www.quora.com/What-is-data-science/answer/Michael-Ho...
[3]https://tech.nextroll.com/blog/data/2014/11/17/d-is-for-data...
[4]http://blog.mir.dlang.io/glas/benchmark/openblas/2016/09/23/...
Or HPC languages like Chapel.
Not only they are compiled, they offer first class support for distributed HPC and GPGPU computing.
Go is nowhere close to offer such capabilities.
You must be kidding. Go is the flexible one (not one of) in static popular languages. It is even more flexible than many dynamic languages. It supports function types as first-class citizen, closures, value methods as functions, type methods as functions, type deduction, .... IMHO, the main sell point of Go is not simplicity, but overall balance and flexibility: https://github.com/go101/go101/wiki/The-main-sell-point-of-G...
> there’s very poor functional programming support,
This is true currently, but this is not caused by lack of flexibility, it is caused by lack of custom generics instead.
How is list comprehension a data science primitive? How did this get over 4,000 stars on GitHub with a glaring lack of basic data science functionality? Is this used by actual practitioners?
Interestingly Prose[2] A Go library for text processing yielded better results for named-entity extraction when compared to NLTK in my tests in terms of accuracy and obviously performance.
Perhaps Go is not being applied enough in the Data Science/ML and for fields where it's applied (Network) Math in the standard library seems to be sufficient.
- ndim arrays with broadcasting
- time series
- plotting
- linalg: blas/mkl
- storage - hdf5, zarr, arrow, parquet, netcdf
I don't see any of those either in go+.
This is Hacker News, so there definitely doesn't need to be anything beyond "I could, so I did." But if this actually solves some problem better than existing solutions, it would be cool to read about. Edit: Without a motivating example, it's hard to imagine that people will want to pickup a Go-like (but not exactly Go) language for data science.
Exactly. I use almost exclusively python (including for data science- or ML really). I've been wanting an excuse to learn Go by doing a project with it. But learning some third Go-like language would be a tougher sell for me, unless there is really something it does better than python, because it still doesnt give me the benefit of learning Go.
But like someone else said, "because you can" is usually a good enough reason to build or learn a new language, so I'm sure it's still worth it for many.
It's more like typescript for javascript than a completely separate language.
A basic DataFrame library would go a long way. Doesn't have to be as full featured as Pandas. Just something that's maintainable and portable.
I wrote a blog post a few months ago on the current Go DataFrame libraries (gota, qframe, dataframe-go): https://mungingdata.com/go/dataframes-gota-qframe/. None of the current offerings are integrated with Arrow.
An Arrow-backed Go DataFrame library that can read / write Parquet files could really jumpstart data science in Go (really data engineering in Go, which is where they should probably focus first).
</irrelevant unix nerd mumbling>
Can we please consider certain modest improvements?
(Also, I didn't think the shebang was specified by POSIX at all? Am I wrong?)
Julia offers a lot in the data world and not much in the engineering world.
However it is hard to get around the lack of operator overloading and (to a lesser extent at least to me) generics. I love the simplicity of the language and understand their feeling that operator overriding is too often abused but at the same time not being able to use algebraic operators for matrix and tensor libraries makes them really hard to use.
The compacting garbage collector can also make it hard to pass pointers to memory to non go libraries which is key in data science.
If this project could address those things I think it could have real potential
However the task at it's heart is a vast duplication of work, and while Go has a lot of things going for it, it doesn't seem enough to sway many data scientists into reinventing their wheels in Go.
I don't blame them. Rewrites being difficult to justify or motivate when you already have a compelling implementation is part of the reason why we have significant amounts of FORTRAN77 code still kicking around today. It is also why for many things we opt to just write wrappers around existing C libraries to call them from other languages.
It has many shortcomings, but overall I prefer the sharing of a library across languages, each with it's own bindings that can attempt to make it more idiomatic to that specific language. The Go culture/community doesn't favor this approach, the Python community embraces it.
It wants me to do if/else guards a certain way, you have to capitalize first letters of "exported" functions, it won't let me import `fmt` unless I use it, etc. I'm not sure I like it.
In terms of being a general-purpose DS language, I can't imagine using anything that doesn't have a clear strategy to A) get a dataset into a DataFrame or similar, B) get my collaborators a plot in a way that is quick and easy, and C) a lesser extent, some kind of notebook/reporting tool.
They do say there is a lot of development going on but it seems like a space with a lot of great incumbents and a rapidly maturing up-and-comer in Julia.
edit: typo
BUT, they have list comprehensions!! One of the main things I miss coming from Python.
Another problem is that tons of Go functions return (value, error), and it's not clear how such functions should interact with a "map" function. Return all the errors in a separate slice? Stop at the first error? What if you only want to stop when the error is io.EOF? etc.
I think we'll only see map/filter/reduce if the language is changed to specifically accommodate them. I've experimented with doing this myself, which people tend to view as heresy: https://twitter.com/lukechampine/status/1367279449302007809?...
All of the features that make it great for writing high-concurrency web applications would make it painful for writing tabular data processing, array manipulation & linear algebra, and plotting.
Nim seems a lot more practical; it's easy to bind to existing data science libraries, and you can use the macro system to build more expressive DSLs. That said, since Julia already does pretty much anything I would need to do (and will hopefully one day have a fast start up times and/or AOT compilation), I'm not sure why you would want to use Nim either. Maybe use it to write some kind of "mid-level" library code that binds to something like Torch, which you could then use from an even higher-level interactive language.
Apart from the incumbents -- Julia, Python (grandfathered in + you can use Hy/Hissp/Coconut), and R -- maybe you could have a good time doing data science in Common Lisp or Racket. Again: good CFFI story, macros for expressive DSLs, flexibility to run in interpreted and compiled modes, dynamic/gradual typing for easy iteration, etc.
Hell, I would sooner take Lua for data science over Go.
That said, I am an "Arrow maximalist", because the beauty of it is that you should be able to use data frames even in Go if you really want to, without reinventing the CSV parsing and memory layout wheels.
Similarly, Chibi or Gambit Scheme.
> I would sooner take Lua for data science
Which provides for a low level language like Terra or a Lisp via Fennel or Urn.
Does it? I'm not familiar with Go data science applications but the design of the language, tooling and runtime, eg low latency garbage collector, errors thrown for unused imports, do not, to me, seem to fit well with the needs of data science. I'm interested in hearing what advantages Go brings.
You're doing something wrong if it doesn't get cleaned up automatically.
For example, in golang you will get a complication error if you have an unused variable, leading to significant extra work when exploring code level alternatives.
Python isn't as great (Python Lambda Layers built on Macs don't always work). AWS Data Wrangler (https://github.com/awslabs/aws-data-wrangler) provides pre-built layers, which is a work around, but something that's as portable as Go would be the best solution.
I don't think that a language where you can't write generic map/fold/reduce and typed DataFrames (such as Spark's DataSet) has "a ton of potential".
Go is worse than nearly any dynamic or static language I know in that regards. Even Java has way more potential than Go.
While Go looks to be in the middle, Rust is at the opposite of Python and it must be a good to choice for building data software that run data scripts.
> The [Go] lack of operator overloading => https://doc.rust-lang.org/rust-by-example/trait/ops.html
> The [Go] lack of generics => https://doc.rust-lang.org/book/ch10-01-syntax.html
> not being able to use algebraic operators for matrix and tensor libraries https://tensorflow.github.io/rust/tensorflow/struct.Tensor.h...
I was for a time optimistic you could use it as your scripting language without much downside and get all the upside of compiled static types. Rust looks cool and I want to do a project in it at some point but at the moment I'm most optimistic about python with optional type annotations that are understood by compilers and alternative runtimes.
Clever and useful when done daily I guess, but damn it was hard to understand those 9 characters as someone not well-versed in this domain.
If the meaning of an operator can change wildly with the operands then that's just confusing - you can't assume that '==' means what you think it means and you have to go find out what it means.
In comparison, having an actual function name to clue me in on what something does is useful. Like, how is "X[y==1,0]" more readable in this case than something like "filterElements(arrayToFilter, arrayOfBools)"? (if I've understood what the original was trying to do, which I'm not sure I have).
People seem to confuse "less typing" with "simpler", and that's not true. One of the great strengths of Go is that it rejects this and embraces true simplicity.
What I’m really saying is that there’s quite a bit of precedent for that syntax, but it comes from a more specialised field so it is easy to have not come across it before.
[1] https://docs.julialang.org/en/v1/manual/functions/#man-vecto...
Since doing this, the idea and basic syntax has been adopted by GNU Octave, S, R, and now NumPy and Matplotlib, which did it to make it easier for statisticians, engineers, and scientists to adopt Python. Specifically targeting these groups with familiar syntax is exactly why Python is so popular for data science, because data scientists tend to recruited from the hard engineering and science disciplines. It's a lot easier to teach basic programming to someone with a great background in applied math, experimental design, and research methods, than it is to teach all those things to programmers.
This is an area in which languages with operator overloading shine, creating DSLs that mimic the syntax and semantics of other languages. You might have a lot to learn because you're used to == only being defined for scalar data types and arrays only being indexed by natural numbers, but the people the language is designed for are used to broadcasted operators and logical array indexing.
That said, the overall ecosystem still makes python the most practical general data science language in my view.
Go just wasn’t designed for this kind of work. Which is unfortunate because it brings a lot of great things to the table.
Vlang is probably the closest spiritual successor that would work, or someone just needs to write a new language
Granted, this is probably a pre-mature assessment on Julia.
Coincidently the top most comments are lamenting on Google having a missed opportunity on Swift for TensorFlow project (mentioned in my original comments) and if it was done in Julia, the project would have been a success ¯\_(ツ)_/¯
https://naveenkumarmuguda.medium.com/railway-oriented-progra...
There were a whole bunch of goodies in the surrounding ecosystem, as I recall.
Then Yann got acquired by FB, and it all got re-written in Python (hence pytorch, as opposed to torch which was in Lua).
Scheme didn't standardise those very features in the 70s and 80s – it still doesn't have them.
Some of those features are available in add-on libraries or as extensions in some specific Scheme implementations, but they are thus far absent from the standardised language.
Functions are first-class objects, and it supports higher-order functions, and had closure (even if in any non-trivial case you needed a full nested def) which were less common features when Python was first introduced, and probably why python was labeled as "functional." But now those are standard features in almost every modern language, so using that as a criteria for "functional" languages is not a very useful distinction.
I'm not sure where this myth comes from, but I see it a lot. Maybe some people think that "lines of code" == "statements", but these are not remotely the same thing, even if they happen to coincide in simple cases.
Python's lambdas are limited to one expression in the implied return statement, but not allowing multiple statements in lambdas is no real limitation when programming in the functional style, as the true functional languages have no statements to speak of, only expressions, and their lambdas work exactly the same way Python's does. A single expression is all that a functional programming language's lambda needs.
Multiline lambdas are considered poor style in Python ("Why not use a `def`?" they'd say.), so you may not see them much, but they do work. The Hissp compiler, for example, relies on this feature. (I am the author of Hissp BTW.)
It seems like reference counting is probably the move here
Here's a playground showing cases where it works and cases where it require a cast: https://play.golang.org/p/6Pbqrz8ZZ3t
doesn't sound very strong
> > pattern matching
> We have that in Python 3.10.
> > immutability-by-default for lists and dictionaries
> We do have tuples and frozendict.
3.10 like the version that is not released yet?
Tuples an frozendicts, so precisely non default list and dicts?
The only code I compile from scratch in C++ is the code I write myself, everything else is available as binary libraries, something that cargo doesn't do, and it is not part of the near future roadmap, if ever.
Then, after compiled, most of the stuff lands on the VC++ metadata files, so incremental compilation and linking cuts even more time from the usual edit-compile-debug workflow.
Going to have to strongly disagree. It forces you to make horrible codebases with endless boilerplate code and increased complexity introduced by workarounds for abstractions you can suddenly no longer make due to questionable language limitations. You will get improved performance, however.
Because, used properly, it does.
> If the meaning of an operator can change wildly with the operands then that's just confusing
Yes, irresponsible use of operator overloading makes things confusing.
Overloading enables preserving existing semantics with new types that have similar semantic roles, it also enables natural, concise, domain specific notation which may sometimes have different semantics than the standard use (while wild, unpredictable semantic swings hurt readability, humans are naturally quite good at incorporating context into interpretation of symbols/language, and avoiding context sensitivity for naive simplicity does not aid readability.)
Verbosity can be quite bad for the ability to quickly grasp the meaning of things.
> People seem to confuse "less typing" with "simpler
Conciseness (not mere terseness, but clarity and terseness together) greatly aid readability. Verbosity is not zero-cost.
I've been coding for 40-ish years. I've never found this to be true. Simple expressions are (in my experience) more readable.
I understand it like this: to understand a complex expression you have to unpack it in your head to a simpler version in order to grok it. This is an operation you don't need to do if the expression is in the simpler, more verbose, version in the first place.
This is a known thing in writing, btw - complex sentences are harder to read. If you want your audience to understand you, write more, simpler, sentences.
Good for you, I've only been coding for 38 years.
> Simple expressions are (in my experience) more readable.
Simple is not the inverse of concise; there may be times when simpler expressions are more verbose, but that's not even approximately generally the case. “x²+1” and “x*2+1” and “add(pow(2,x),1)” and “x raised to the second power plus one” are equally simple (or, at least, the later ones are not more simple), but they are progressively less concise.
(It's true that expanding the space of concise expressions may require more complex notation, and when the notation is unfamiliar, that creates a learning curve for learning the notation, but there's a reason people familiar with domains develop notations that support more concise expressions.
> I understand it like this: to understand a complex expression you have to unpack it in your head to a simpler version in order to grok it.
That's true of complexity of expressions, but again that's not the issue here. And concise notation expands the kind of expressions that can be grokked by pattern recognition rather than unpacking.
But I always think that maybe we should be using new operators for this, instead of overloading existing ones that have other, different, meanings in different contexts.
It really comes down to who you're writing the code for. For something like numpy, whose users will mostly be familiar with matrix notations, operator overloading enables a huge improvement.
Ocaml doesn't overload even the arithmetic operators, so you write for integers
1 + t * A
and for floating point 1 +. t *. A
and for matrices you would make something like scal_mat_add(1, scal_mat_mul(t, A))
Do you really prefer these three, over writing 1 + t * A
for all cases?Just the same way that a[i] *= b[j] is more readable than a.IndexElement(firstIndex).MultiplyByFloat(b.IndexElement(secondIndex))
Is that more readable or less?
Instead, optimise for teaching/learning the skills better rather than capping everyone’s skills. The presence of a learning curve is not an inherently bad thing.
Edit: re-reading your previous comments, I think you and I are in furious agreement haha
Less terse language relies less on shared context, and thus is easier on newbies. There is less assumed knowledge, more things made explicit.
> And concise notation expands the kind of expressions that can be grokked by pattern recognition rather than unpacking.
I have this totally the other way. After years of coding in Go, I can parse "if err != nil" subconsciously and only ever deal with it if it's not that (e.g. if err == nil). It's not concise, but it is very, very easy to read.
Any worthwhile tool is going to be used for years, and you’re only going to be newbie for a small fraction of the time. It’s better to invest time learning a good notation than to force all the expensive experts to slog through a bad notation forever.
Explicitly handling errors is one of those things that you get used to, for really, really, good reasons, when learning Go.
> Any worthwhile tool is going to be used for years, and you’re only going to be newbie for a small fraction of the time. It’s better to invest time learning a good notation than to force all the expensive experts to slog through a bad notation forever.
No, because assuming the next developer knows as much as you is probably wrong. Because reading code you wrote 6 months ago is like reading an alien script. And because Go (for very, very good reasons) optimises readability over terseness.
(And whether “2” is integer, real, rational, complex, etc)