Basically, the idea of "batteries included" should also mean that if something looks like you can put a D-cell in there, you're unlikely to blow your arm off.
Similarly, Excel/etc. support these functions without a "semester course in statistics." Instead, you'll find that there are many web pages from semester courses in statistics which end up teaching how to use Excel. The same would no doubt happen with Python.
I don't why a statistics standard library module needs to provide a "good explanation of when they're appropriate" to a higher standard than any other module. Python provides trigonometric and hyperbolic functions without teaching trigonometry. It provides complex numbers and cmath without teaching people about complex numbers. It provides several different random distribution functions without teaching anything about Pareto, Weibull, or von Mises distributions.
For that matter, data structures is a semester course as well, but the Python documentation doesn't teach those differences in its documentation of deque, stack, hash table, etc., nor describe algorithms like heapq and bisect.
"whether to do online or batch modes which can give different results". The PEP says it will prefer batch mode:
Concentrate on data in sequences, allowing two-passes over the data,
rather than potentially compromise on accuracy for the sake of a one-pass
algorithmSurely, it would be better to supply good implementations of algorithms rather than refrain from doing that, and letting programmers write and use bad ones instead?
IMO, the discussion should be about what c/should end up in the _standard_ library, and what is better put in a separate product/download.
Fair enough, but the same argument could be made about using an unskewed standard distribution on non-symmetrical datasets, a common error even among people who should know better.
I think binomial functions should be included, on the ground that they're very useful and their probability of misuse is only equal to the continuous statistical forms, not more so.
...how to represent missing or discrete variables...
Don't. Just say no. Just give me the simple easy stuff. Most of us will be fine, and everyone else will know they need something better and won't bother.- It's very easy to find and install third party modules
- Once a library is added to stdlib, the API is essentially frozen. This means we can end up stuck with less than ideal APIs (shutil/os, urllib2/urrlib, etc) or Guido & co are stuck in a time consuming PEP/deprecate/delete loop for even minor API improvements.
- libraries outside of the stdlib are free to evolve. users of those libraries who don't want to stay on the bleeding edge are free to stay on old versions.
The PEP proposes adding simple, but correct support for statistics.
Apart from high-end libraries being an overkill and DIY implementations being incorrect, the PEP also cites resistance to third party software in corporate environments. This problem is more social than technical though, and I'm not sure what weight must be attached to it
What are those reasons for why a new library can be included, and why aren't those reasons appropriate justification for including this proposed statistics package?
Or overcome by bad judgement.
https://news.ycombinator.com/item?id=6190603
The URL was
http://www.python.org/dev/peps/pep-0450/
While this is http://www.python.org/dev/peps/pep-0450
That is, exactly the same except for a trailing slash. Doesn't the deduplication algorithm handle this case?That's the reason why opening http://www.python.org/dev/peps/pep-0450 redirects to http://www.python.org/dev/peps/pep-0450/ . HN engine should follow redirect to avoid situations like this.
Still, there are a number of common sense heuristics to normalize URLs, that HN applies to do de-duplication. I was wondering what is the rationale for not having trailing slash removal among them. I mean, is there any legitimate website that serves a different resource if you remove the trailing slash?
For one, it provides the welcome ability to bring topics up in Hacker News again, where they might get accepted better the second or third time (e.g because more people are online at the time of the second submission).
If the "deduplication algorithm" had "handled this case", then we would only be left with the first submission (a dead discussion), whereas as it is, HN users have now caught on to this PEP news and we have a discussion going on.
I do not regard this as a good justification for putting something in the standard library! If you don't have root access, use vitualenv (which you might want to do anyway) and install the package somewhere under your home directory.
To take this example, you _could_ count enumerations and permutations by passing a range(n) list to itertools and then counting how many actual results you get back, but that's silly when you could also just use the binomial theorem to get there directly. A compiler that could generally perform such transformations would be miraculous -- well beyond the territory of automated proof assistants like mathematica or gcc -O3 that trundle along cultivated routes of expert system rules, into the realm of actually discovering deep linkages at the frontier of our knowledge.
Until then it seems like stdlibs will just fracture along lines of strain among the userbase. Presumably, most Python users don't need anything beyond what a financial calculator would provide, and anyone else should head to numpy.
Not exactly. Given argument lists, Itertools provides result lists (actually, iterators for that purpose) with the original elements permuted and combined, but doesn't provide numerical results for numerical arguments, as shown here: http://arachnoid.com/binomial_probability
I was referring to permutation and combination mathematical functions, not generator functions.
The reason why scipy (and Julia BTW) need blas/lapack is because that's the only way to have decent performance and reasonably accurate linear algebra. The alternative is writing your own implementation of something that has been used and debugged for 30 years, which does not seem like a good idea.
Google App Engine used to suffer because of this (more specifically, it still only restricts your runtime to pure Python, but now you can import numpy at least). I believe the PyPy folks have also had their own set of struggles with numpy compatibility, although I'm not sure what the state of that is at present.
In any case, I think these compatibility concerns alone make a strong argument for including simple Statistics tooling into the standard library.
I would actually prefer to have numpy included before those statistics functions.
NB. And I believe Perl6 (spec) includes PDL - http://perlcabal.org/syn/S09.html#PDL_support
> For many people, installing numpy may be difficult or impossible.
that's as true, and arguably more, for pandas.
For 99% of my work numpy and the associated compilation overhead is unneeded - fits my brain, fits my needs
If your work involves a lot of scientific computing then yes, you're going to need numpy but if you're just updating an existing script that's doing some performance monitoring, having an accurate version of mean and standard deviation available seems like a great idea.
Python attempts to be batteries included, this is part of its philosophy and one of the reasons for its popularity. I'm glad this is being extended into a new area.
Edit: Although, I do agree that NumPy being difficult to install is not, on its own, a good justification for the PEP.
The reason why you can't do pip install numpy is pip's fault, there is nothing that numpy can do to make that work. Note that easy_install numpy does work on windows (without the need for a C compiler).
First of all, as others mentioned, your personal machine runs on Windows so right off the bat there goes your instant virtualenv, pip install.
Even on the Unix app server, you're probably behind a firewall that's tight as a duck's ass so chances are you're downloading the tar ball and making the package yourself.
Third, wtf are you doing littering the app server with all these binaries? And what is Python? I'm sorry, no, rewrite this in Java, please.
If you do manage to convince your manager and the rest of your team that Python is not black magic, the first time numpy breaks or some small issue crops up or you have to migrate to a new server and reinstall numpy but now there's a new version and... GTFO of here with that black magic.
I agree with you, developing websites on your MacBook Pro there's no excuse not being able to install numpy. In the real world though, having basic necessities in stdlib does absolute wonders.
- fewer dependencies for my package
I've written the average() and standard_deviation() functions at least a couple of dozen times, because it doesn't make sense to require numpy in order to summarize, say, benchmark timing results.
- reduced import time
NumPy and SciPy were designed with math-heavy users in mind, who start Python once and either work in the REPL for hours or run non-trivial programs. It was not designed for light-weight use in command-line scripts.
"import scipy.stats" takes 0.25 second on my laptop. In part because it brings in 439 new modules to sys.modules. That's crazy-mad for someone who just wants to compute, say, a Student's t-test, when the implementation of that test is only a few dozen lines long. (Partially because it depends on a stddev() as well.)
Sure, 0.25 seconds isn't all that long, but that's also on a fast local disk. In one networked filesystem I worked with (Lustre), the stat calls were so slow that just starting python took over a second. We fixed that by switching to zip import of the Python standard library and deferring imports unless they were needed, but there's no simple solution like that for SciPy.
- less confusing docstring/help
Suppose you read in the documentation that scipy.stats.t implements the Student's t-test as scipy.stats.t.
>>> import scipy.stats
>>> scipy.stats.t
<scipy.stats.distributions.t_gen object at 0x108f87390>
It's a bit confusing to see scipy.stats.distributions.t_gen appear, but okay, it's some implementation thing.Then you do help(scipy.stats.t) and see
Help on t_gen in module scipy.stats.distributions object:
class t_gen(rv_continuous)
| A Student's T continuous random variable.
|
| %(before_notes)s
|
...
|
| %(example)s
Huh?! What's %(before nodes)s and %(example)s?The answer is, scipy.stats auto-generates various of the distribution functions, including things like docstrings. Only, help() gets confused about that because help() uses the class docstring while SciPy modifies the generator instance's docstring. Instead, to see the correct docstring you have to do it directly:
>>> print scipy.stats.t.__doc__
A Student's T continuous random variable.
Continuous random variables are defined from a standard form and may
require some shape parameters to complete its specification. Any
optional keyword parameters can be passed to the methods of the RV
object as given below:1. Better math training in school.
2. More kinds of applied math problems being routinely evaluated by Python and other languages.
3. More available memory and storage capacity.
All of which argue for larger math libraries with more functions and classes of functions.
> Presumably, most Python users don't need anything beyond what a financial calculator would provide, and anyone else should head to numpy.
I would normally agree, but the argument has been made in this thread that numpy can't be installed in some environments -- environments that easily support Python, but that don't accommodate numpy without great difficulty.
GAE just avoids it alltogether (except locally, where you have CPython and use it to stub out core services hosted on the cloud runtime). You simply can't import sqlite3 on GAE when running on cloud runtime, nor can you really use it as an external dependency.
I'm not really up on the details, but the PyPy website claims they've gotten around this by implementing a pure Python equivalent of the CPython stdlib library (http://pypy.org/compat.html).
I would put forward that SQLite3 is probably a pretty easy include in most C projects compared to whatever numpy would likely require. That said, I'm not qualified to assess this, being neither a numpy, Python core, or sqlite3 dev.
All of this aside, it's worth mentioning that the entire standard lib includes and depends on some other C-only libraries. So it's not unprecedented. In principle, you'd want the standard lib to have as much pure Python as possible (PyPy kind of takes this to the ultimate extreme from what I can gather), but this isn't always practical (great example of "practicality beats purity" if you ask me).
Speaking of which, if it's cool to have `sqlite3` in the standard lib as part of the included batteries, why not mean and variance and the like? :D
I have never used numpy on GAE, but I would suspect some of it is not enabled for those reasons.
I think having a basic stats module always handy would be very convenient.
I absolutely agree. My only point was that these tools are sometimes misapplied, not at all to argue that they shouldn't be readily available. They should be.
It would be great if there was a natural progression (and/or compat shims) for porting from this new stdlib library to NumPy[Py] (and/or from LibreOffice). (e.g. "Is it called 'cummean'")?
From stats import mean
...
from numpy import meanBut yeah HN should just use browser equivalence.
Following what the spec says for eqivilance makes sense, at least. Anything drastic is technically treating distinct URLs as equivilant.
Frankly, building numpy is not very complicated (scipy is a bit complicated, and only if you are not on linux).
If you're happy with the system-wide Python, try apt-get install python-numpy next time.
Think of it as the C library of numerical computing.
Generally, numpy and scipy have much better docstrings than python stdlib itself.
In all honesty, I seldom use NumPy and rarely use SciPy, so I can't judge that deeply. I know that when I read their respective code bases I get a bit bewildered by the many "import *" and other oddities. It doesn't feel right to me. I know the reason for most of the choices - to reduce API hierarchy and simplify usability for their expected end-users - but their expectations don't match mine.
So I looked at more of the documentation. I started with scipy/integrate/quadpack.py. The docstring for quad() says, in essence, "this docstring isn't long enough, so call quad_explain() to get more documentation." I've never seen that technique used before. The Python documentation says "see this URL" for those cases.
Again, this is a difference in expectations. I argue that NumPy and Python have different end-users in mind. Which is entirely reasonable - they do! But it means that it's very difficult to simply say "add numpy to part of the standard library."
There's also a level of normalization that I would want should numpy be part of the standard library. For example, do out of range input raise ValueError or RuntimeError? scipy/ndimage/filters.py does both, and I don't understand the distinction between one or the other.
Now, in the larger sense, I know the history. RuntimeError was more common in Python, and used as a catch-all exception type. Its existence in numpy reflects its long heritage. It's hard to change that exception type because programs might depend on it.
But it means that integrating all of numpy into the standard library is not going to work: either it breaks existing numpy-based programs, or the merge inherits a large number of oddities that most Python programmers will not be comfortable with.
I don't see numpy being integrated in python anytime soon. I don't think it would bring much, and one would have to drop performance enhancement that rely on blas/lapack.
I think installing has improved a lot, and once pip + wheel matures, it should be easy to pip install numpy on windows.
For examples, from http://mail.scipy.org/pipermail/numpy-discussion/2008-July/0... :
Robert Kern: Your use case isn't so typical and so suffers on the import time end of the balance
Stéfan van der Walt: I.e. most people don't start up NumPy all the time -- they import NumPy, and then do some calculations, which typically take longer than the import time. ... You need fast startup time, but most of our users need quick access to whichever functions they want (and often use from an interactive terminal).
I went back to the topic last year. Currently 25% of the import time is spent building some functions which are then exec'ed. At every single import. I contributed a patch, which has been hanging around for a year. I came back to it last week. I'll be working on an updated patch.
There's also about 7% of the startup time because numpy.testing imports unittest in order to get TestCase, so people can refer to numpy.testing.TestCase. Even though numpy does nothing to TestCase and some of numpy's own unit tests use unittest.TestCase instead. sigh. And there's nothing to be done to improve that case.
Regarding the age - yes, you're right. BTW, parts of PIL started in 1995, making it the oldest widely used package, I think. Do you know of anything older?