If you don't want to be banned, you're welcome to email hn@ycombinator.com and give us reason to believe that you'll follow the rules in the future. They're here: https://news.ycombinator.com/newsguidelines.html.
Rather, the performance issue only occurs when using `rep movsb` on AMD CPUs with certain page/data alignment.
Pymalloc just happens to be using page/data alignment that makes `rep movsb` happy while Rust's default allocator is using alignments that just happen to make `rep movsb` sad.
This has nothing to do with python or rust
>...
>Python features three memory domains, each representing different allocation strategies and optimized for various purposes.
>...
>Rust is slower than Python only on my machine.
if one library performs wildly better than the other in the same test, on the same hardware, how can that not be a software-related problem? sounds like a contradiction.
Maybe should be considered a coding issue and/or feature absent? IMHO it would be expected Rust's std library perform well without making all the users to circumvent the issue manually.
The article is well investigated so I assume the author just want to show the problem existence without creating controversy because other way I can not understand.
But since python runtime is written in C, the issue can't be Python vs C.
Not too long ago I read in Intel's optimization guidelines that rep was now faster again and should be used.
Seems most of these things needs to be benchmarked on the CPU, as they change "all the time". I've sped up plenty of code by just replacing hand crafted assembly with high-level functional equivalent code.
Of course so-slow-it's-bad is different, however a runtime-determined implementation choice would avoid that as well.
Whenever you're writing performance-critical software, you need to consider the relevant combinations of hardware + software + workload + configuration.
Sometimes a problem can be created or fixed by adjusting any one / some subset of those details.
Maybe using an alternative allocator only solves the problem by accident and there's another way to solve it intentionally; I don't yet fully understand the problem. My point is that using a different allocator by default was already tried.
I've honestly never worked in a domain where binary size ever really mattered beyond maybe invoking `strip` on a binary before deploying it, so I try to keep an open mind. That said, this has always been a topic of discussion around Rust[0], and while I obviously don't have anything against binary sizes being smaller, bugs like this do make me wonder about huge changes like switching the default allocator where we can't really test all of the potential side effects; next time, the unintended consequences might not be worth the tradeoff.
[0]: https://hn.algolia.com/?dateRange=all&page=0&prefix=false&qu...
> However, python by default has a small offset when reading memories while lower level language (rust and c)
Yet if the runtime is made with C, then that statement is incorrect.
The point is not that one language is faster than another. The point is that the default way to implement something in a language ended up being surprisingly faster when compared to other languages in this specific scenario due to a performance issue in the hardware.
In other words: on this specific hardware, the default way to do this in Python is faster than the default way to do this in C and Rust. That can be true, as Python does not use C in the default way, it adds an offset! You can change your implementation in any of those languages to make it faster, in this case by just adding an offset, so it doesn't mean that "Python is faster than C or Rust in general".
If we start adding in exceptions at the top of the software stack for individuals failures of specific CPUs/vendors, that seems like a strong regression from where we are today in terms of ergonomics of writing performance-critical software. We can't be writing individual code for each N x M x O x P combination of hardware + software + workload + configuration (even if you can narrow down the "relevant" ones).
That is kind of exactly what you would do when optimising for popular platforms.
If this error occurs on an AMD Cpu used by half your users is your response to your user going to be "just buy a different CPU" or are you going to fix it in code and ship a "performance improvement on XYZ platform" update
Given that the fix is within the memory allocator, there is already a relatively trivial fix for users who really need it (recompile with jemalloc as the global memory allocator).
For everyone else, it's probably better to wait until AMD reports back with an analysis from their side and either recommends an "official" mitigation or pushes out a microcode update.
I guess that in most big companies it suffices that there is a problem with their own software running on the laptop of a C* manager or of somebody close to there. When I was working for a mobile operator the antennas the network division cared about most were the ones close to the home of the CEO. If he could make his test calls with no problems they had the time to fix the problems of the rest of the network in all the country.
And yet here we are again. Shouldn't this be part of some timing testsuite of CPU vendors by now?
During dynamic linking, glibc picks a memcpy implementation which seems most appropriate for the current machine. We have about 13 different implementations just for x86-64. We could add another one for current(ish) AMD CPUs, select a different existing implementation for them, or change the default for a configurable cutover point in a parameterized implementation.
More broadly compatible routines will still work on newer CPUs, they just won yield the best performance.
It still would be nice if such central routines could just be compiled to the REP-prefixed instructions and would deliver (near-)optimal performance so we could stop worrying about that particular part.
I'm not surprised the conclusion had something to do with the way that native code works. Admittedly I was surprised at the specific answer - still a very interesting article despite the confusing start.
Edit: The conclusion also took me a couple of attempts to parse. There's a heading "C is slower than Python with specified offset". To me, as a native English speaker, this reads as "C is slower (than Python) with specified offset" i.e. it sounds like they took the C code, specified the same offset as Python, and then it's still slower than Python. But it's the opposite: once the offset from Python was also specified in the C code, the C code was then faster. Still very interesting once I got what they were saying though.
However I am more interested/concerned about another part. How the issue is reported/recorded and how the communications are handled.
Reporting is done over discord, which is a proprietary environment which is not indexed, or searchable. Will not be archived.
Communications and deliberations are done over discord and telegram, which is probably worse than discord in this context.
This blog post and the github repository is the lingering remains of them. If Xuanwo did not blog this. It would be lost in timeline.
Isn't this fascinating?
You can provide public log of them not because they are not proprietary, but that they have API to allow logging. Telegram also has such API, and FWIW our discussion group does have searchable log that you can access here: https://luoxu-web.vercel.app/#g=1264662201 It is not indexable publicly more for privacy concern, again not because the platform is proprietary.
Only thing that makes this bug and the process of the debug visible is this blog post.
Another point is I don't think IRC or any instant messaging app is the correct place for this kinds of discussions. Unless important points are logged to some bug reporting tool, or perhaps a mailing list, or to a blog post like this one, they are useless for historic purposes.
That's why I don't accept the response "but there's Discord now" whenever I moan about USENET's demise. Back in the days before it, every post was nicely searchable by DejaNews (later Google).
We need to get back to open standards for important communications (e.g. all open source projects that are important to the Internet/WWW stack and core programming and libraries).
The accepted fix would not be trivial to anyone not already experienced with the kernel. But more important, it obviously isn’t obvious what is the right way to enable the workaround. The best way is to probably measure at boot time, otherwise how do you know which models and steppings are affected.
If the vendor won't patch it, then a workaround is the next best thing. There shouldn't be many - that's why all copying code is in just a handful of functions.
https://internals.rust-lang.org/t/jemalloc-was-just-removed-...
I am curious if this is something that everyone can do to get free performance or if there are caveats. Can C codebases benefit from this too? Is this performance that is simply left on table currently?
with open('myfile') as f:
data = f.read()
I'm not much of a C programmer myself. but I at least reported part of the issue to Python: https://bugs.python.org/issue45944This is the fastest way to read a file on python that I've found, using only 3-4 syscalls (though os.fstat() doesn't work for some special files kernel files like those in /proc/ and /dev/):
def read_file(path: str, size=-1) -> bytes:
fd = os.open(path, os.O_RDONLY)
try:
if size == -1:
size = os.fstat(fd).st_size
return os.read(fd, size)
finally:
os.close(fd)Maybe I don’t need to query the file size at all?
Having a hook to get people to want to read the article is reasonable in my opinion; after all, if you could fit every detail in the size of a headline, you wouldn't need an article at all! Clickbait inverts this by _only_ having enough enough substance that you could get all the info in the headline, but instead it leaves out the one detail that's interesting and then pads it with fluff that you're forced to click and read through if you want the answer.
> In conclusion, the issue isn't software-related. Python outperforms C/Rust due to an AMD CPU bug.
Slack is allocating 1132 GB of virtual memory on my laptop right now. I don't know if they are using mmap but that's 1100 GB more than the physical memory.
seems its not without perils on Windows:
"In an ideal world, that would be all we have to say about the new solution. But for Windows users, there's a special quirk. On most operating systems, we can use a special flag to signal that we don't really care if the system has 32 GiB of real memory. Unfortunately, Windows has no convenient way to do this. Dolphin still works fine on Windows computers that have less than 32 GiB of RAM, but if Windows is set to automatically manage the size of the page file, which is the case by default, starting any game in Dolphin will cause the page file to balloon in size. Dolphin isn't actually writing to all this newly allocated space in the page file, so there are no concerns about performance or disk lifetime. Also, Windows won't try to grow the page file beyond the amount of available disk space, and the page file shrinks back to its previous size when you close Dolphin, so for the most part there are no real consequences... "
I'm impressed by your perseverance, how you follow through with your investigation to the lowest (hardware) level.
It's surprising that something as simple as reading a file is slower in the Rust standard library as the Python standard library. Even knowing that a Python standard library call like this is written in C, you'd still expect the Rust standard library call to be of a similar speed; so you'd expect either that you're using it wrong, or that the Rust standard library has some weird behavior.
In this case, it turns out that neither were the case; there's just a weird hardware performance cliff based on the exact alignment of an allocation on particular hardware.
So, yeah, I'd expect a filesystem read to be pretty well optimized in Python, but I'd expect the same in Rust, so it's surprising that the latter was so much slower, and especially surprising that it turned out to be hardware and allocator dependent.
If I write Python and my code is fast, to me that sounds like Python is fast, I couldn't care less whether it's because the implementation is in another language or for some other reason.
When you see an interpreted language faster than a compiled one, it's worth looking at why, because most the time it's because there's some hidden issue causing the other to be slow (which could just be a different and much worse implementation).
Put another way, you can do a lot to make a Honda Civic very fast, but when you hear one goes up against a Ferrari and wins your first thoughts should be about what the test was, how the Civic was modified, and if the Ferrari had problems or the test wasn't to its strengths at all. If you just think "yeah, I love Civics, that's awesome" then you're not thinking critically enough about it.
For me, coding is almost exclusively using python libraries like numpy to call out to other languages like c or FORTRAN. It feels silly to say I'm not coding in Python to me.
On the other hand, if you're writing those libraries, coding to you is mostly writing FORTRAN and c optimizations. It probably feels silly to say you're coding in Python just because that's where your code is called from.
It's completely fair to say that's not python because it isn't - any language out there can FFI to C and it has the same problems mentioned above.
Pretty much any language can wrap C/Rust code.
Why does it matter?
1. Having to split your code across 2 languages via FFI is a huge pain.
2. You are still writing some Python. There's plenty of code that is pure Python. That code is slow.
Also, when we talk about "faster" and "slower," it's not clear the order of magnitude.
Maybe an analysis of actual code execution would shed more light than a simplistic explanation that the Python interpreter is written in C. I don't think the BASIC interpreter in my first computer was written in BASIC.
What's there to understand? When it's fast it's not really Python, it's C. C is fast. Python can call out to C. You don't have to care that the implementation is in another language, but it is.
99% of my use cases are easily, maintainably solved with good, modern Python. The Python execution is almost never the bottleneck in my workflows. It’s disk or network I/O.
I’m not against building better languages and ecosystems, and compiled languages are clearly appropriate/required in many workflows, but the language parochialism gets old. I just want to build shit that works and get stuff done.
Now why would you expect that?
What happened to OP is a pure chance. CPython's C code doesn't even care about const-consistency. It's flush with dynamic memory allocations, bunch of helper / convenience calls... Even stuff like arithmetic does dynamic memory allocation...
Normally, you don't expect CPython to perform well, not if you have any experience working with it. Whenever you want to improve performance you want to sidestep all the functionality available there.
Also, while Python doesn't have a standard library, since it doesn't have a standard... the library that's distributed with it is mostly written in Python. Of course, some of it comes written in C, but there's also a sizable fraction of that C code that's essentially Python code translated mechanically into C (a good example of this is Python's binary search implementation which was originally written in Python, and later translated into C using Python's C API).
What one would expect is that functionality that is simple to map to operating system functionality has a relatively thin wrapper. I.e. reading files wouldn't require much in terms of binding code because, essentially, it goes straight into the system interface.
I have, several, and it's far from trivial.
The basics are seriously optimized for typical use cases, take a look at the source code for the dict type.
On the other hand… so what? It’s kind of fun.
* https://github.com/jemalloc/jemalloc/issues/387#issuecomment...
* https://gitlab.haskell.org/ghc/ghc/-/issues/17411
Apparently now `jemalloc` will call `MADV_DONTNEED` 10 seconds after `MADV_FREE`: https://github.com/JuliaLang/julia/issues/51086#issuecomment...
So while this "fixes" the issue, it'll introduce a confusing time delay between you freeing the memory and you observing that in `htop`.
But according to https://jemalloc.net/jemalloc.3.html you can set `opt.muzzy_decay_ms = 0` to remove the delay.
Still, the musl author has some reservations against making `jemalloc` the default:
https://www.openwall.com/lists/musl/2018/04/23/2
> It's got serious bloat problems, problems with undermining ASLR, and is optimized pretty much only for being as fast as possible without caring how much memory you use.
With the above-mentioned tunables, this should be mitigated to some extent, but the general "theme" (focusing on e.g. performance vs memory usage) will likely still mean "it's a tradeoff" or "it's no tradeoff, but only if you set tunables to what you need".
Example of this: https://github.com/prestodb/presto/issues/8993
And this is not a one-off: https://hackernoon.com/reducing-rails-memory-use-on-amazon-l... https://engineering.linkedin.com/blog/2021/taming-memory-fra...
jemalloc also has extensive observability / debugging capabilities, which can provide a useful global view of the system, it's been used to debug memleaks in JNI-bridge code: https://www.evanjones.ca/java-native-leak-bug.html https://technology.blog.gov.uk/2015/12/11/using-jemalloc-to-...
If you want to gauge whether your system is memory-limited look at the PSI metrics instead.
Rust used to use jemalloc by default but switched as people found this surprising as the default.
It turns out jemalloc isn't always best for every workload and use case. While the system allocator is often far from perfect, it at least has been widely tested as a general-purpose allocator.
does tend to use more ram tho
> With the new Zen3 CPUs, Fast Short REP MOV (FSRM) is finally added to AMD’s CPU functions analog to Intel’s X86_FEATURE_FSRM. Intel had already introduced this in 2017 with the Ice Lake Client microarchitecture. But now AMD is obviously using this feature to increase the performance of REP MOVSB for short and very short operations. This improvement applies to Intel for string lengths between 1 and 128 bytes and one can assume that AMD’s implementation will look the same for compatibility reasons.
https://www.igorslab.de/en/cracks-on-the-core-3-yet-the-5-gh...
Note that for rep store to be better it must overcome the cost of the initial latency and then catch up to the 32byte vector copies, which yes generally have not-as-good-perf vs DRAM speed, but they aren't that bad either. Thus for small copies.... just don't use string store.
All this is not even considering non-temporal loads/stores; many larger copies would see better perf by not trashing the L2 cache, since the destination or source is often not inspected right after. String stores don't have a non-temporal option, so this has to be done with vectors.
https://man7.org/linux/man-pages/man2/read.2.html
> On success, the number of bytes read is returned (zero indicates end of file), [...] It is not an error if this number is smaller than the number of bytes requested
FSRM is fast on Intel, even with single byte strings. AMD claims to support FSRM with recent CPUs but performs poorly on small strings, so code which Just Works on Intel has a performance regression when running on AMD.
Now here you're saying `REP MOVSB` shouldn't be used on AMD with small strings. In that case, AMD CPUs shouldn't advertise FSRM. As long as they're advertising it, it shouldn't perform worse than the alternative.
https://bugs.launchpad.net/ubuntu/+source/glibc/+bug/2030515
https://sourceware.org/bugzilla/show_bug.cgi?id=30994
I'm not a CPU expert so perhaps I'm misinterpreting you and we're talking past each other. If so, please clarify.
Setting the env var MALLOC_MMAP_THRESHOLD_=65536 usually solves these problems instantaneously.
Most programmers seem to not bother to understand what is going on (thus arriving at the above solution) but follow "we switched to jemalloc and it fixes the issue".
(I have no opinion yet on whether jemalloc is better or worse than glibc malloc. Both have tunables, and will create problematic corner cases if the tunables are not set accordingly. The fact that jemalloc has /more/ tunables, and more observability / debugging features, seems like a pro point for those that read the documentation. For users that "just want low memory usage", both libraries' defaults look bad, and the musl attitude seems like the best default, since OOM will cause a crash vs just having the program be some percent slower.)
I don't fully agree. IMs are a great place to discuss issues in a semi-synchronous way. Telecon or face-to-face meetings are sometimes better in velocity, but IMs have some edge on bringing random people happen to be online into the discussion. And it can also bring a different audience into the issue than bug reporting tools or mailing list.
When this issue was brought into the group, it just took several hours for curious people there to collaboratively find the conclusion. This is something unlikely to happen in any other form of discussions based on my experience.
But I agree that group chat is not a great way to record it, and that's why the findings are recorded on the GitHub issue, and group members also encouraged the author to write this up. Then it got posted on HN and on /r/rust by two different group members as well. (The author's initial posting on HN was mysteriously taken down, so the op here helped posting it again.)
If these are already in place I don't have any reservations against using IMs or Discord. In fact this is particularly great sample how this can be done.
- Bug report is in place - Blog/Article for historic events , documentation and pointers for related info - Fast communication for debug sessions
I hope you understand my original message was about pointing out a situation where only left overs are history of these chat tools.
There are lots of communities right now just using discord or IM for support, bug reporting or development purposes.
You can see cache usage in htop; it has a different colour.
With MADV_FREE, it looks like the process is still using the memory.
That sucks: If you have some server that's slow, you want to SSH into a server and see how much memory each process takes. That's a basic, and good, observability workflow. Memory leaks exist, and tools should show them easily.
The point of RES is to show resident memory, not something else.
If you change htop to show the correct memory, that'd fix the issue of course.
It's a case of people using the subtly wrong metrics and then trying to optimize tools chasing that metric rather than improving their metrics. That's what I'm calling misguided.
The kernel has self-patching mechanisms for doing effectively the same thing (although I don't know if it has ever been applied to memcpy before).
It's actually the opposite, a Python programmer should know how to offload most, or use the libraries that do so, out of Python into C. He should not be oblivious to the fact that any decent Python performance is due to shrinking down the ratio of actual Python instructions vs native instructions.
I noticed that you're pretty hard in the "basic isn't fast, the thing it transpiles to is fast" camp, but still accidentally said "there is a version of BASIC [...] that is lightning fast" which I'm not sure you think? Highlights just how tricky it is to talk about where speed lives
There is clear distinction between original language design (an interpreter) and a project aiming to recreate a sub-standard of that language and support its legacy codebase via a transpiler.
% ps aux | sort -k5 -rh | head -1
xxxxxxxx 88273 1.2 0.9 1597482768 316064 ?? S 4:07PM 35:09.71 /Applications/Slack.app/Contents/Frameworks/Slack Helper (Renderer).app/...
Since ps displays vsz column in KiB, 1597482768 corresponds to 1TB+.
That's nonsensical. Rust uses the system allocators for reliability, compatibility, binary bloat, maintenance burden, ..., not because they're good (they were not when Rust switched away from jemalloc, and they aren't now).
If you want to use mimalloc in your rust programs, you can just set it as global allocator same as jemalloc, that takes all of three lines: https://github.com/purpleprotocol/mimalloc_rust#usage
If you want the rust compiler to link against mimilloc rather than jemalloc, feel free to test it out and open an issue, but maybe take a gander at the previous attempt: https://github.com/rust-lang/rust/pull/103944 which died for the exact same reason the the one before that (https://github.com/rust-lang/rust/pull/92249) did: unacceptable regression of max-rss.
1. Reliability - how is an alternate allocator less reliable? Seems like a FUD-based argument. Unless by reliability you mean performance in which case yes - jemalloc isn’t reliably faster than standard allocators, but mimalloc is.
2. Compatibility - again sounds like a FUD argument. How is compatibility reduced by swapping out the allocator? You don’t even have to do it on all systems if you want. Glibc is just unequivocally bad.
3. Binary bloat - This one is maybe an OK argument although I don’t know what size difference we’re talking about for mimalloc. Also, most people aren’t writing hello world applications so the default should probably be for a good allocator. I’d also note that having a dependency of the std runtime on glibc in the first place likely bloats your binary more than the specific allocator selected.
4. Maintenance burden - I don’t really buy this argument. In both cases you’re relying on a 3rd party to maintain the code.
You can find them at the original motivation for removing jemalloc, 7 years ago: https://github.com/rust-lang/rust/issues/36963
Also it's not "glibc's allocator", it's the system allocator. If you're unhappy with glibc's, get that replaced.
> 1. Reliability - how is an alternate allocator less reliable?
Jemalloc had to be disabled on various platforms and architectures, there is no reason to think mimalloc or tcmalloc are any different.
The system allocator, while shit, is always there and functional, the project does not have to curate its availability across platforms.
> 2. Compatibility - again sounds like a FUD argument. How is compatibility reduced by swapping out the allocator?
It makes interactions with anything which does use the system allocator worse, and almost certainly fails to interact correctly with some of the more specialised system facilities (e.g. malloc.conf) or tooling (in rust, jemalloc as shipped did not work with valgrind).
> Also, most people aren’t writing hello world applications
Most people aren't writing applications bound on allocation throughput either
> so the default should probably be for a good allocator.
Probably not, no.
> I’d also note that having a dependency of the std runtime on glibc in the first place likely bloats your binary more than the specific allocator selected.
That makes no sense whatsoever. The libc is the system's and dynamically linked. And changing allocator does not magically unlink it.
> 4. Maintenance burden - I don’t really buy this argument.
It doesn't matter that you don't buy it. Having to ship, resync, debug, and curate (cf (1)) an allocator is a maintenance burden. With a system allocator, all the project does is ensure it calls the system allocators correctly, the rest is out of its purview.
Conversely, you can have pure C code just using PyObjects (this is effectively what Cython does), with the Python bytecode interpreter completely out of the picture. But the perf improvement is nowhere near what people naively expect from compiled code, usually.
The only thing that makes sense to compare when talking about pythons performance is how many instructions it needs to compute something, versus the instructions needed to compute the same thing in C. Those are probably a few orders of magnitude apart.
In the memcpy case, where the library call is probably in a dynamically linked library anyway, it's particularly trivial to bind to one of N implementations of memcpy at load time. That only patches code if library calls are usually implemented that way.
Patching .text does tend to mess up using the same shared pages across multiple executables though which is a shame, and somewhat argues for install time specialisation.
Python is well micro-optimized, but the broader architecture of the language and especially the CPython implementation did not put much concern into performance, even for a dynamically typed scripting language. For example, in CPython values of built-in types are still allocated as regular objects and passed by reference; this is atrocious for performance and no amount of micro optimization will suffice to completely bridge the performance gap for tasks which stress this aspect of CPython. By contrast, primitive types in Lua (including PUC Lua, the reference, non-JIT implementation) and JavaScript are passed around internally as scalar values, and the languages were designed with this in mind.
Perl is similar to Python in this regard--the language constructs and type systems weren't designed for high primitive operation throughput. Rather, performance considerations were focused on higher level, functional tasks. For example, Perl string objects were designed to support fast concatenation and copy-on-write references, optimizations which pay huge dividends for the tasks for which Perl became popular. Perl can often seem ridiculously fast for naive string munging compared to even compiled languages, yet few people care to defend Perl as a performant language per se.
No, because "scripting language" is not a thing.
But, if we are talking about implementing languages, then I worked with many language implementations. The most comparable one that I know fairly well, inside-and-out would be the AVM, i.e. the ActionScript Virtual Machine. It's not well-written either unfortunately.
I've looked at implementations of Lua, Emacs Lisp and Erlang at different times and to various degree. I'm also somewhat familiar with SBCL and ECL, the implementation side. There are different things the authors looked for in these implementations. For example, SBCL emphasizes performance, where ECL emphasizes simplicity and interop with C.
If I had to grade language implementations I've seen, Erlang would absolutely take the cake. It's a very thoughtful and disciplined program where authors went to a great length to design and implement it. CPython is on the lower end of such programs. It's anarchic, very unevenly implemented, you run into comments testifying to the author not knowing what they are doing, what their predecessor did, nor what to do next. Sometimes the code is written from that perspective as well, as in if the author somehow manages to drive themselves in the corner they don't know what the reference count is anymore, they'll just hammer it until they hope all references are dead (well, maybe).
It's the code style that, unfortunately, I associate with proprietary projects where deadlines and cost dictate the quality, where concurrency problems are solved with sleeps, and if that doesn't work, then the sleep delay is doubled. It's not because I specifically hate code being proprietary, but because I meet that kind of code in my day job more than I meet it in hobby open-source projects.
> take a look at the source code for the dict type.
I wrote a Protobuf parser in C with the intention of exposing its bindings to Python. Dictionaries were a natural choice for the hash-map Protobuf elements. I benchmarked my implementation against C++ (Google's) implementation only to discover that std::map wins against Python's dictionary by a landslide.
Maybe Python's dict isn't as bad as most of the rest of the interpreter, but being the best of the worst still doesn't make it good.
SBCL is definitely a different beast.
I would expect Emacs Lisp & Lua to be more similar.
Erlang had plenty more funding and stricter requirements.
C++'s std::map has most likely gotten even more attention than Python's dict, but I'm not sure from your comment if you're including Python's VM dispatch in that comparison.
What are you trying to prove here?
There is no such thing as interpreted language. A language implementation can be called an interpreter to emphasize the reliance on rich existing library, but there's no real line here that can divide languages into two non-ambiguous categories. So... is C an "interpreted language"? -- well, under certain light it is, since it calls into libc for a lot of functionality, therefor libc can be thought of as its interpreter. Similarly, machine code is often said to be interpreted by the CPU, when it translates it to microcode and so on.
> prioritizes convenience over performance
This has nothing to do with scripting. When the word "scripting" is used, it's about the ability to automate another program, and record this automation as a "script". Again, this is not an absolute metric that can divide all languages or their implementations into scripting and not-scripting. When the word "scripting" is used properly it is used to emphasize the fact that a particular program is amenable to automation by means of writing other programs, possibly in another language.
Here are some fun examples to consider. For example, MSBuild, a program written in C# AFAIK, can be scripted in C# to compile C# programs! qBittorrent, a program written in Python can be scripted using any language that has Selenium bindings because qBittorrent uses Qt for the GUI stuff and Qt can be automated using Selenium. Adobe Photoshop (used to be, not sure about now) can be scripted in JavaScript.
To give you some examples which make your claim ridiculously wrong: Forth used to be used in Solaris bootloader to automate kernel loading progress, i.e. it was used as a scripting language for that purpose, however most mature Forth implementations are aiming for the same performance bracket as C. You'd be also hard-pressed to find a lot of people who think that Forth is a very convenient language... (I do believe it's fine, but there may be another five or so people who believe it too).
---
Basically, your ideas about programming language taxonomies are all wrong and broken... sorry. Not only you misapplied the labels, you don't even have any good labels to begin with.
Anyways,
> What are you trying to prove here?
Where's here? Do you mean the original comment or the one that mentions std::map?
If the former: I'm trying to prove that CPython is a dumpster fire of a program. That is based on many years of working with it and quite extensive knowledge of its internals of which I already provided examples of.
If it is the later: parent claimed something about how optimized Python's dictionary is, I showed that it has a very long way to go to be in the category of good performers. I.e. optimizing something, no matter how much, doesn't mean that it works well.
I don't know what do you mean by Python's VM dispatch in this context. I already explained that I used Python C API for dictionaries, namely this: https://docs.python.org/3/c-api/dict.html . It's very easy to find equivalent functionality in std::map.
The evidence to how absurd your claim is is right in front of you: Google's implementation of Protobuf uses std::map for dictionaries, and these dictionaries are exposed to Python. But, following your argument this... shouldn't be possible?
To better understand the difference: Python dictionary stores references to Python objects, but it doesn't have to. It could, for example, take Python strings and use C character arrays for storage, and then upon querying the dictionary convert them back to Python str objects. Similarly with integers for example etc.
Why is this not done -- I don't know. Knowing how many other things are done in Python, I'd suspect that this isn't done because nobody bothered to do it. It also feels too hard and to unrewarding to patch a single class of objects, even as popular as dictionaries. If you go for this kind of optimizations, you want it to be systematically and uniformly applied to all the code... and that's, I guess, how Cython came to be, for example.
Way to miss the mark. The point is precisely that Python is slow and one of the causes is that it is a scripting language. Stomping your foot and essentially: "You couldn't do any better" helps no one and is counterproductive.
The C components initiate the system call and manage the file pointer, which loads the data from the disk into a pyobj string.
Therefore, it isn't so much Python itself that is being tested, but rather python underlying C runtime.
> When you see an interpreted language faster than a compiled one, it's worth looking at why, because most the time it's because there's some hidden issue causing the other to be slow (which could just be a different and much worse implementation).
On the contrary, the compiled languages tend to only be faster in trivial benchmarks. In real-world systems the Python-based systems tends to be faster because they haven't had to spend so long twiddling which integers they're using and debugging crashes and memory leaks, and got to spend more time on the problem.
So, like in most things, the details can sometimes matter quite a bit.
Code that has lots of attention is different, certainly, but it's also the exception rather than the rule; the last figure I saw was that 90% of code is internal business applications that are never even made publicly available in any form, much less subject to outside code review or contributions.
> As time spent on the project increases, I suspect that any gain an interpreted language has over an (efficient) compiled one not only gets smaller, but eventually reverses in most cases.
In terms of the limit of an efficient implementation (which certainly something like Python is nowhere near), I've seen it argued both ways; with something like K the argument is that a tiny interpreter that sits in L1 and takes its instructions in a very compact form ends up saving you more memory bandwidth (compared to what you'd have to compile those tiny interpreter instructions into if you wanted them to execute "directly") than it costs.
This is an interesting premise.
Python in particular gets an absolute kicking for being slow. Hence all the libraries written in C or C++ then wrapped in a python interface. Also why "python was faster than rust at anything" is headline worthy.
I note your claim is that python systems in general tend to be faster (outside of trivial benchmarks, whatever the scope of that is). Can you cite any single example where this is the case?
Plenty of line-of-business systems I've seen, but systems big enough to matter tend not to be public. Bitbucket's cloud and on-prem version are the only case I can think of where you can directly compare something substantial between an implementation known to be written in Python and an implementation that's known to be written in C/C++ (and even then I'm not 100% that that's what they use).
But the Zen3/4 were developed far, far after the PyObject header...
Just like in this article. The author measured, wondered, investigated, experimented, and finally, after a lot of hard work, made the C/Rust programs faster. You wouldn't call that luck, would you? If there had been a similar performance regression in CPython, then a benchmark could have picked up on it, and the CPython developers would then have done the same.
> It makes interactions with anything which does use the system allocator worse
That’s a really niche argument. Most people are not doing any of that and malloc.conf is only for people who are tuning the glibc allocator which is a silly thing to do when mimalloc will outperform whatever tuning you do (yes - glibc really is that bad).
> or tooling (in rust, jemalloc as shipped did not work with valgrind)
That’s a fair argument, but it’s not an unsolvable one.
> Most people aren’t writing applications bound on allocation throughput either
You’d be surprised at how big an impact the allocator can make even when you don’t think you’re bound on allocations. There’s also all sorts of other things beyond allocation throughput & glibc sucks at all of them (e.g. freeing memory, behavior in multithreaded programs, fragmentation etc etc).
> The libc is the system’s and dynamically linked. And changing allocator does not magically unlink it
I meant that the dependency on libc at all in the standard library bloats the size of a statically linked executable.
Performance of rustc matters a lot! If the rust compiler runs faster when using mimalloc, please benchmark & submit a patch to the compiler.
Suggests that it should be usable for even shorter copies. And that's really my point. We should have One True memcpy instruction sequence that we use everywhere and stop worrying. And yet...
But, yeah it does seem that my 128 bytes of a quick search was wrong. (though, gcc & clang for '-march=alderlake' both never generate 'rep movsb' on '-O3'; on `-Os` gcc starts giving a rep movsb for ≥65B, clang still never does)
There's a paper on this you might like. https://www.researchgate.net/publication/2749121_When_are_By...
I think there's something to the idea of keeping the program in the instruction cache by deliberately executing parts of it via interpreted bytecode. There should be an optimum around zero instruction cache misses, either from keeping everything resident, or from deliberately paging instructions in and out as control flow in the program changes which parts are live.
There are complicated tradeoffs between code specialisation and size. Translating some back and forth between machine code and bytecode adds another dimension to that.
I fear it's either the domain of extremely specialised handwritten code - luajit's interpreter is the canonical example - of the the sufficiently smart compiler. In this case a very smart compiler.
> On certain platforms, it would break code signatures
macos?Personally I have plenty of RAM and I'd happily use more in exchange for a faster compile. Its much cheaper to buy more ram than a faster CPU, but I certainly understand the choice.
With compilers I sometimes wonder if it wouldn't be better to just switch to an arena allocator for the whole compilation job. But it wouldn't surprise me if LLVM allocates way more memory than you'd expect.
Also: If you're going to prove that changes informed by performance measurements are absent from the commit logs, then you'll need to look in the logs for all the relevant places, which means also looking at I/O and bytes and allocator code.
And the reason why the object model is the way it is, is because it's an entrenched part of the Python ABI. Sure, if you break that, you can do things a lot faster - this isn't news, people have been doing this with projects like Jython and IronPython that can work a lot faster. But the existing ecosystem of packages is so centered on CPython that this approach has proven to be self-defeating - you end up with a Python implementation that very few people actually use.
So, no, it's not because people are "very confused" or "nobody bothered to do it". It's because compatibility matters.
No. You don't need the Python object model when implementing Python dictionary. You have evidence right in front of you: std::map bindings are successfully used in its place.
Why even keep arguing about this?
In fact, you can implement your own dictionary, and if you expose all the same mapping protocol, it will work the same as the built-in one. Do you have to use Python objects for this? -- absolutely no. You can convert at the interface boundary. Experience shows that this works noticeably better than using Python objects all the way. Why did the original CPython developers not do it? -- I don't know, can only guess. I already wrote what my guess is. And, in all sincerity, CPython has a lot more and a lot worse problems. Compared to the rest of the codebase, the dictionary object is fine. So, if anyone would seriously consider improving CPython's performance they wouldn't touch dictionaries, at least not at first.
And this part:
> if anyone would seriously consider improving CPython's performance they wouldn't touch dictionaries, at least not at first.
is just straight up nonsense, given how many times over Python's history dicts have been substantially rewritten. As it happens, I work on Python dev tooling, and the CPython team changing internal data structures for perf reasons has been a recurring headache for me, so I know full well what I'm talking about here.
ordered in a slightly weird way
Do you mean "insertion ordered"? That means the order of iteration is guaranteed to match insertion order. C++'s std::map is ordered by key (less than comparison) to create a binary search tree. So iteration order will always be ordered by key value. C++'s std::unordered_map has no ordering guarantees (that I know). I don't think the standard C++ template library has the equivalent of a modern Python dict, nor Java LinkedHashMap. Does anyone know if that is incorrect?In most cases std::unordered_map will be faster, but hashtables have nasty edge cases and are usually more expensive to create.
I can pretty much guarantee it's been optimized to hell and back.