Coroutines for Go(research.swtch.com) |
Coroutines for Go(research.swtch.com) |
The use case motivating all the complexity is function iterators, where `range` can be used on functions of type `func() (T, bool)`. That has been discussed in the Go community for a long time, and the semantics would be intuitive/obvious to most Go programmers.
This post addresses the next thing: Assuming function iterators are added to the language, how do I write one of these iterators that I can use in a for loop?
It starts by noticing that it is often very easy to write push iterators, and builds up to a push-to-pull adapter. It also includes a general purpose mechanism for coroutines, which the adapter is built on.
If all of this goes in, I think it will be bad practice to use coroutines for things other than iteration, just like it's bad practice to use channels/goroutines in places where a mutex would do.
Google has internal kernel patches that implement cooperating multithreading this way (internally they're called fibers), and the patches exist for precisely this reason: better latency and more predictable scheduling behavior. Paul Turner gave a presentation about this at LPC ~10 years ago that explains more about the motivation for the patches and why they improve efficiency: https://www.youtube.com/watch?v=KXuZi9aeGTw
If programmers try to manually implement iterators, generators and interleaved state machines with their own goroutines and channels, it's not just performance that suffers - there is too much room for error.
[1] I'm using the qualifier "true" here, since many modern languages (such as Python, Kotlin) use the term "coroutines" for something that is more like Go's Goroutines than Lua's coroutines. Unlike Go, they are not preemptible, but they are (at least by default) implicitly resumed when necessary by some scheduler, and they may execute on different kernel threads and switch contexts.
for {
next := getNext()
...
}
What is the advantage of writing this as: for next := range getNext {
...
} getNext := iterableThing.Iterator()
for {
next, ok := getNext()
if !ok {
break
}
...
}
vs. for next := range iterableThing.Iterator() {
...
}
One advantage is that it's slightly shorter, which matters for very common patterns--people complain about `err != nil` after all. Another advantage is there isn't another variable for everyone to name differently. Another advantage is that everyone can't do the control flow slightly differently either.The harder case to deal with is a "push iterator", which are often much simpler to implement, but less flexible to use. See https://github.com/golang/go/discussions/56413
The OP is about converting a push iterator into a pull iterator efficiently by using coroutines. This provides the best of both worlds, simple iterator implementation and flexible use by its caller.
...
if getNext() == nil {
break
}
...
This isn't a huge boon, and is mostly a matter of style. But I prefer the latter because it's logic that gets handled with the range builtin, so my code can be more about application logic, rather than muddling in breaking out of loop logic.The point of coroutines is that they are little contexts that serve to be called
- many times
- sometimes with different parameters
- change state
- might be interrupted by callers
- might interrupt callers themselves
All of this can be done by other means. Just like any sorting can be done by copy pasting the same code, but generics make it less tedious. That's the same idea here. Some problems can be implemented as interleaving coroutines, and their definition is simple enough that you want to write it all in the body of some CreateCoroutine() function instead of taking out another struct with 5 almost empty functions. It will not solve all problems, but can more clearly separate business logic and orchestration.
for {
next := <-channel
...
}I still don’t see the need for coroutines.
Like, no careful thinking and good 80/20 solution this time. Just "huh, we'd need coroutines to do this `right` so let's just do that"
When they added generics, they really, really thought long and hard, and came up with a compromise that was brilliant and innovative in the balance it struck.
I would have hoped to see something like that here, like "we're adding this one feature to goroutines to have control in specific situations" feels like something that would be better than "we're going full Rust on this problem and just adding it."
https://github.com/golang/go/issues/43557
https://github.com/golang/go/discussions/56413
I'm not sure Russ's personal blog is any kind of official statement "this is what we're doing" yet?
The go keyword nicely prevents the annoying function coloring problem, which causes quite a bit of pain.
Sometimes in high performance contexts I'd like to be able to do something like e.g. per CPU core data sharding, but this proposal doesn't scratch those kinds of itches.
But those were too much.
So we got threads, which are processes that share an address space, file table, and some other things. The scheduler can switch from one to the other more easily than between processes, and data can be shared between threads without needing serialization.
But those were too much.
So we got user space threads, which are logical threads of execution that are driven by a runtime entirely in user space. The runtime adds scheduling hooks into all I/O functions in the standard library, or even uses a system API like Unix signals to preempt logical threads. No system-level context switching is needed. User space threads can be tiny.
But those were too much.
So we got coroutines, which allow a programmer to define logical "threads" of execution that cooperatively interact with each other. There is no assumption about the presence of a scheduler. The programmer either writes their own event loop or invokes one from a library in a "real" logical thread.
I wonder what comes next. As far as [communicating sequential processes][1] are concerned, maybe cooperative coroutines are a low as you can go.
I thought go's `insert resumes at call points and other specific places` design decision was a very nice compromise.
This is allowing access to more and more of the metal. At what point are we just recreating Zig here? What's next? An optional garbage collector?
x := co func(){
var z int
for {
z++
yield z
}
}
y := x()
for y := range x {
...
}
or something to that effect. It's cool that it can be done at all in pure go, and I can see the appeal of having a standard library package for it with an optimized runtime instead of complecting the language specification. After all, if it's possible to do in pure go, then other implementations can be quickly bootstrapped.My $0.02, as someone that uses go at $work daily: I'd be happy to have either, but I'd prefer it baked into the language. Go's concurrency primitives have always been a strength, just lean into it.
Further, it doesn't seem to me to allow you to do anything you can't currently do with blocking channels and/or state.
I’ve used iterators similar to what’s described in this article to avoid allocations in critical code paths, but this would make those much less awkward to use (particularly with the upcoming range iterator language change).
> func Pull[V any](push func(yield func(V) bool)) (pull func() (V, bool), stop func()) {
> copush := func(more bool, yield func(V) bool) V {
The main power of Go to me was always quickly being able to read and understand code. This type of coding has a lot of cognitive load to a reader, I feel.
Channels are slow for no good reason when they aren't wrapping blocking operations.
- Many people consider coroutines and green threads to be more or less the same thing, when they both have their pros and cons.
- The fact that the omission of iterators is even acceptable in the Go community saddens me. They seem to deliberately refuse any feature that might make the language even slightly more complex, in the name of simplicity. But hey, at least they retracted their opinion on generics.
I'm again reminded that Go is not my language.
On the other side, given my experience with .NET and C++ co-routines, and Active Objects (in Symbian C++ and Active Oberon) not sure if this is really something to add to Go.
Even the .NET team has acknowledged at this year's BUILD, that if they could go back in time having the runtime handle them Go-style would probably been a better decision, given how many developers keep having issues understanding async/await.
ch := make(chan int)
go func() {
for i := 0; i < 100; i += 1 {
ch <- i
}
close(ch)
}()
for i := range ch {
fmt.Println(i)
}
That is very simple.* Mutex lock+unlock: 10ns
* Chan send buffered: 21ns
* Try send (select with default): 3.5ns
Missing from here is context switches.
In either case, the overhead is proportional to how fast each iteration is. I have channels of byte slices of 64k and the channel ops don’t even make a dent compared to other ops, like IO.
You should absolutely use channels if it’s the right tool for the job.
Fwiw, I wouldn’t use channels for “generators” like in the article. I believe they are trying to proof-of-concept a language feature they want. I have no particular opinion about that.
At the same time, I am running the Erlang observer beside it to watch what happens to the CPU and memory consumption and how quickly it recovers/cleans up the garbage.
The biggest bottleneck here is the terminal's ability to keep up, but the observer seems to reflect what's happening accurately.
https://www.youtube.com/watch?v=yxyYKnashR0
The code I used: https://gist.github.com/pmarreck/4cc8f2f55a561ebce2012085a3a...
These features have been built into Erlang (and thus Elixir) since the 1980's. I'm sure many of you have heard of the Actor model and/or Erlang's "legendary" implementation of it, but I don't know how many have actually seen it in action with monitoring kit running.
I think it would be great for Go if it offered language-level support like this, but given the extremely resource-efficient implementation (both in spawning and runtime consumption) of threads on the BEAM VM, coupled with the ease of concurrency which comes directly from only permitting immutable values, I don't think it will ever be matched.
Just the question of whether one should use a goroutine or a coroutine adds complexity.
Goroutines were kind of the raison detre' for using go. But using them wasn't simple, and instead often goroutines brought their own issues. See here:
https://songlh.github.io/paper/go-study.pdf
Often a programming language takes a first guess at the problems they want to solve, and often get them wrong. C++ is probably the most notable language in this category here.
That said, I do appreciate an attempt to improve programming languages even if it undermines the primary feature of the language itself.
The natural solution to this is some sort of iterator, as in Python or other languages. (Contra frequent accusations to the contrary, the Go community is aware of other language's efforts.)
So this has opened the can of worms of trying to create an iteration standard for Go.
Go has something that has almost all the semantics we want right now. You can also "range" over a channel. This consumes one value at a time from the channel and provides it to the iteration, exactly as you'd expect, and the iteration terminates when the channel is closed. It just has one problem, which is that it involves a full goroutine and a synchronized channel send operation for each loop of the iteration. As I said in another comment, if what is being iterated on is something huge like a full web page fetch, this is actually fine, but no concurrency primitive can keep up with the efficiency of incrementing an integer, a single instruction which may literally take an amortized fraction of a cycle on a modern processor. With generics you can even relatively implement filter, map, etc. on this iterator... but adding a goroutine and synchronized commit for each such element of a pipeline is just crazy.
I believe the underlying question in this post is, can we use standard Go mechanisms to implement the coroutines without creating a new language construct, then use the compiler under the hood to convert it to an efficient execution? Basically, can this problem be solved with compiler optimizations rather than a new language construct? From this point of view, the payload of this article is really only that very last paragraph; the entire rest of the article is just orientation. If so, then Go can have coroutine efficiency with the standard language constructs that already exist. Perhaps some code that is using this pattern goroutine already might speed up too "for free".
The concerns people have about this complexifying Go, the entire point of this operation is to suck the entire problem into the compiler with 0 changes to the spec. Not complexifying Go with a formal iteration standard is the entire point of this operation. If one wishes to complain, the correct complaint is the exact opposite one, that Go is not "simply" "just" implementing iterators as a first class construct just like all the other languages.
Also, in the interests of not posting a full new post, note that in general I shy away from the term "coroutine" because a coroutine is what this article describes, exactly, and nothing less. To those posting "isn't a goroutine already a coroutine?", the answer is, no, and in fact almost nothing called a coroutine by programmers nowadays actually is. The term got simplified down to where it just means thread or generator as Python uses the term, depending on the programming community you're looking at, but in that context we don't need to use the term "coroutine" that way, because we already have the word "thread" or "generator". This is what "real" coroutines are, and while I won't grammatically proscribe to you what you can and can not say, I will reiterate that I personally tend to avoid the term because the conflation between the sloppy programmer use and the more precise academic/compiler use is just confusing in almost all cases.
Currently, you could use goroutines and channels to implement a nice way to provide general iteration support for user-defined data structures, but because of the overhead, people most often opt for clunkier solutions. This change would give us the best of both worlds.
Lua is an absolute work of art. Everything about the tiny language, how it works, and even all the little peculiarities, just makes sense.
Most languages use different data types but with some API overlap, e.g. maps and arrays are both "iterable". Lua goes too far, I think, and tries to make them the exact same, a data type they call "table".
One side-effect is that you have some operations that only really make sense for maps or lists, but since they work on all tables, they're defined awkwardly, e.g:
> The length of a table t is defined to be any integer index n such that t[n] is not nil and t[n+1] is nil; moreover, if t[1] is nil, n can be zero. For a regular array, with non-nil values from 1 to a given n, its length is exactly that n, the index of its last value. If the array has "holes" (that is, nil values between other non-nil values), then #t can be any of the indices that directly precedes a nil value (that is, it may consider any such nil value as the end of the array).
Afterwards, I realized there was a package called aiohttp that I could've used, but too late.
I'll be interested to see what other HN people have done.
I haven't thought much about iterators link to coroutines.
As a hobby, I am working to write about a dream programming language. I happen to be really interested in parallelism, asynchronous, coroutines, multithreading and concurrency.
I want:
* seamlessly switch between remote-thread coroutine, local thread coroutine.
* concurrency and parallelism and async to be easy to think about, reason about, read and program
* programs should be easy to parallelise and be async and concurrent
Go iterators seem to be local to a thread, but what if you want to distribute work across threads?
I've been thinking of scheduling recently.
Imagine you're a search engine company and you want to index links between URLs. How would you solve this with coroutines?
task download-url
for url in urls:
download(url)
task extract-links
parsed = parse(document)
return parsed
task fetch-links
for link in document.query("a")
return link
task save-data
db.save(url, link)
How would you do control flow and scheduling and parallelism and async efficiently with this code?* `db.save()`, `download()` are IO intensive whereas `document.query("a")` and `parse` is CPU intensive.
* I want to handle plurality or multiple items trivially such as multiple URLs and multiple links.
* I want to keep IO and CPU in flight at all times.
I think I want this schedule:
https://user-images.githubusercontent.com/1983701/254083968-...
I have a toy 1:M:L 1 scheduler thread:M kernel threads:N lightweight threads lightweight scheduler in C, Rust and Java
https://github.com/samsquire/preemptible-thread
This lets me switch between tasks and preempt them from user space without assistance at descheduling time.
I have a simplistic async/await state machine thread pool in Java. My scheduling algorithm is very simple.
I want things like backpressure, circuit breakers, rate limiting, load shedding, rate adjustment, queuing.
i think this would be favorable for (among other things) clu-like iterators and imgui libraries, where you often want to do something like
submenu("&Edit") {
command("&Cut") { clip_cut(getSelection()); }
...
}
this is especially useful in a context where you're heap-allocating sparingly or not at all, because the subroutine taking the block argument can stack-allocate some resource, pass it to the block, and deallocate it once the block returns; python context managers and win32 paint messages are two cases where people commonly do this sort of thing, but things like save-excursion, with-output-file, transactional memory, and gsave/grestore also provide motivationthe conventional way to do this is to package up the block into a closure, then use a full-fledged function invocation to invoke it, using a calling convention that supports closures. but i suspect a more relaxed and efficient approach is to use an asymmetric coroutine calling convention, in which the callee yields back control to its caller at the entry point to the block, and the block then resumes the callee when it finishes. so instead of merely dividing registers into callee-saved and call-clobbered, as subroutine calling conventions do, we would divide them into callee-saved upon return but upon yield containing callee values the block must have restored upon resumption; caller coroutine context registers, which are callee-saved upon return and also on yield; and call-clobbered. you also need in many cases a way for the block to safely force an early exit from the callee
this allows the caller's local variables to be in registers its blocks can use without further ado, or at least indexed off of such a register, while allowing the yield and resume operations to be, in many cases, just a single machine instruction. and it does not require heap allocation
as an example of taking this to the point of absurdity, here's an untested subroutine for iterating over a nul-terminated string passed in r0 with a block passed in r1, using a hypothetical coroutine convention which passes at least r4 through from its caller to its blocks
itersz: push {r6, r7, r8, lr}
mov r7, r0
mov r6, r1
1: ldrb r0, [r7], #1
cbz r0, 1f
blx r1
b 1b
1: pop {r6, r7, r8, pc}
and here is another untested subroutine which uses it to calculate a string hash hashsz: push {r4, r5, r9, lr}
movs r4, #53
adr r1, 1f
blx itersz
mov r0, r4
pop {r4, r5, r9, pc}
1: eor r4, r0, r4, ror #27
bx lr
even in this case where both the iteration and the visitor block are utterly trivial, the runtime overhead per item (compared to putting them in the same subroutine) is evidently extremely modest; my estimate is 7 cycles per byte rather than 4 cycles per byte on in-order hardware with simple branch prediction, so, on the order of 1 ns on the hardware russ used as his reference. for anything more complex the overhead should be insignificantit's less general than the mechanism russ proposes here (it doesn't solve the celebrated samefringe problem), but it's also an order of magnitude more efficient, because the yield and resume operations are less work than a subroutine call, though still more work than, say, decrementing a register and jumping if nonzero
Sieving primes by turning functions into coroutines, parsing text by yielding characters, all with unnatural functions and state management... that;s an improvement over what?
Java and C++ are largely inferior for my typical purposes, but at the end of the day they work fine and are stable in terms of direction, and don't tend to repeatedly bloat the language over pedantry. If you want top-notch performance, there's already C, C++, and Rust.
I am not a fan of the function coloring shit in Python and Javascript.
I don't want the kitchen sink!
Coroutines will fill a different nice more akin to Python's generators. There are a whole bunch of places where this could dramatically cut down on memory usage and code complexity where you have a composable pipeline of components you need to glue together. The design appears to me to be entirely synchronous.
The good thing is, nothing in the post even proposes anything that might fall in the function coloring problem.
FYI when you write a goroutine and have to use Context, that's literally just function coloring
You can easily bridge context-using and non-context-using code, you can have nested contexts, you can have multiple unrelated contexts, you can ignore the context, probably most critically you can't pass values back up the context, I don't see any way these look like function coloring.
That works just fine with throwing a context.Background in, which should be available anywhere, and not color the function.
Usually my goroutines related code feels like a mess because of them, and I don't know how to make them "more clean" (whatever that means). Any hints proven to be maintainable patterns would be highly appreciated.
For example, let's say I have []foo as input and I'm going to call a service on every element of the slice.
You could do it with channels, but oftentimes I'll reach for creating a slice of results and passing each goroutine an index into the slice. That way I can avoid any overhead that comes from having what is effectively a queue with a lock.
It depends on what I'm doing, how I want to treat failures of a single goroutine, etc. of course.
I default to never returning channels unless it's a requirement for the use case (e.g. context.Done() returns a channel so callers can listen for when the context is closed).
Callers can always create goroutines to make your code run concurrently. It's more annoying to erase channels when they aren't useful to the caller (e.g. if you made all functions that make HTTP calls return a channel because you assume callers want a new goroutine, but I already have my own goroutine per request)
Beyond that, some specific examples would be helpful so I can tailor my advice
Hasn't Microsoft done some research on such a language years ago already? There is also ParaSail made by someone from the Ada community. What happened to these projects? Nobody uses them?
I'm not sure if it's desirable to parallelize code automatically, as in many cases you do need at least _some_ parts of the code to run synchronously. But it's an interesting thought experiment to have parallelism be the default instead of opt-in.
- (toy problem) https://www.youtube.com/watch?v=ftcIcn8AmSY
- (parallelism) https://www.youtube.com/watch?v=dPK6t7echuA
- (languages) https://www.youtube.com/watch?v=dCuZkaaou0Q
The only odd thing here is the creation and use of a cancel callback. I've not done much Go programming, so I don't know whether this is just a performance improvement to more quickly collect dropped iterator state, or it's a requirement because Go does not garbage collect goroutines waiting on channels for which the waiting goroutine has the only reference. In Lua you wouldn't need such a thing because coroutines/threads are GC'd like any other object--once all references are gone they're GC'd even if the last operation was a yield and not a return from the entry function.
More info about GOGC: https://dave.cheney.net/tag/gogc
You might think "I'll just use static buffers everywhere", but allocations can occur in unexpected places. The compiler does some very basic lifetime analysis to eliminate some obvious cases (loops...), but it's really hard to avoid in general.
Probably a better argument for Go (and other languages like Java) is how "tweakable" the GC is or at least describe it as the GC can be turned off, but it's not designed or useful to do so.
Well, maybe a library solution could possibly have a guard page at the end of the stack. When that one is reached, then you could try to grow the stack in an error handler. (but that would probably not work, if you have taken a pointer to some stack variable)
This means that for a function to yield, it needs to have the so called `yield` param and you can only yield inside a yield function functions with the same yield signature or does not contain a signature at all.
So the example would be rewritten into something like:
x := func(:z int) {
for {
z++
:- z
}
}
for y := range x() {
...
}
The : adds the yield signature and :< yields the value.Functions that only yield a value `X` would just have the `: X` or `: name X`. To accept a value of type `Y` in the resume the signature would change to `:[Y] X` or `:[Y] name X`.
Accepting is weak, so if something expects a yielding function that resumes with Y and yield X, yielding functions that only yield X without a resume should also be accepted.
Consider the following code example:
func filter[T comparable](it func(:T), f func(T) bool, :T) {
for v := range it {
if f(v) { :- v }
}
}
func map[T](arr []T, :T) {
for _, v := range arr {
:- v
}
}
for v := range filter(map({1, 2, 3}), func(x) { return x < 3 }) {
print(v)
}
Special functions in a co package could give resume and New functionality to keep the go style.A basic counter could be written like this:
func counter(:[bool]c int) {
for {
if ! :- c { break }
c += 1
}
}
A func that counts twice could be as simple as: func countTwice(:[bool]c int) {
count()
count()
}
To interact with the resume value the range syntax could be expanded to something like: for c := range count() {
if c > 10 { -: false }
-: true
print(c)
}
Range would resume with a default zero value if the range does not pass a value to -:.I use them now somewhat, but only because they offer slightly better sugar at the call site. at the definition site, they are just as ugly as you expect, granted the Go syntax is about the best I have seen from other languages. so in the end I think we really lost some simplicity in order to silence the whining about no generics. not a good tradeoff.
No, we really didn't.
Generics were acceptable to the community only because they are fully backwards compatible with existing code, and can be safely ignored if you don't need them.
Which, not at all surprising, is what most golang code still does. Because as it turns out, outside of "collections" of one sort or another, practical use cases for generics are not that easy to find. Most code written never gets to see more than one type to begin with, and more than 2 is already a stretch.
If anything, the addition of generics showcased to the larger community, how little many of the vaunted features!!!!! that people keep demanding and labeling as "essential" are actually needed for a widely accepted and excellent language.
I don't remember where I came across this, but many years ago I saw python-style generators termed "semi-coroutines" which I'm a fan of. Python (frustratingly) doesn't implement them this way, but the beauty of generators is that by limiting the scope where execution can be suspended to the highest-level stack frame, you can statically determine how much memory is needed for the (semi-)coroutine, so it doesn't require a single allocation.
Zig takes that a step farther by considering recursive functions second-class, which lets the compiler keep track of the maximum stack depth of any function, and thereby allow you to allocate a stack frame for any non-recursive function inside of another stack frame, enabling zero-allocation full coroutines, as long as the function isn't recursive.
That would... probably be overkill for Go, since marginal allocations are very cheap, and you're already paying the runtime cost for dynamic stack size, so the initial allocation can be tiny.
I would love to see full proper coroutine support make it to Go, freeing users of the overhead of the multithreaded scheduler on the occasions where coroutine patterns work best. I remember back in 2012 or so, looking at examples of Go code that showed off goroutine's utility as control flow primitives even when parallelism wasn't desired and being disappointed that those patterns would likely be rare on account of runtime overhead, and sure enough I hardly ever see them.
@ hashsz s = {
@ h := 53u
@ itersz s, b => {
@ h = (h >> 27 | h << 5) ^ b
@ }
@ return h
@ }
.syntax unified
.thumb
itersz: push {r6, r7, r8, lr} @ r6 and r7 are preserved by blocks
mov r7, r0 @ so put the string pointer in r7
mov r6, r1 @ and the block code pointer in r6
1: ldrb r0, [r7], #1 @ loop over string bytes, post-indexing
cbz r0, 1f @ bail out on NUL, but otherwise
blx r1 @ invoke block argument with byte, then
b 1b @ repeat loop
1: pop {r6, r7, r8, pc} @ restore registers and return
hashsz: push {r4, lr} @ r4 is callee-preserved and passed to blocks
movs r4, #53 @ initial value of hash
adr r1, 1f+1 @ load Thumbified pointer for block using gas local label
bl itersz @ invoke iterator subroutine
mov r0, r4 @ load completed hash into return value
pop {r4, pc} @ restore and return
1: eor r4, r0, r4, ror #27 @ block argument rotates & xors the byte in r0
bx lr @ then resumes the iterator blx r6 @ invoke block argument with byte, then
those responsible for this error have been sackedyears ago, in fact
Fwiw if you look at the source of some other idiomatic standard library functions you may find their implementation is of similar complexity. That’s the nature of good abstractions, though, they should make tricky things easy to use.
Indexes (traditionally) start at 1, offsets (measures from base) obviously start from zero.
Anything involving a lexer and parser is a good example of this, and exactly one of the use cases given in the example. With coroutines, the lexer can take some input and emit a series of tokens which the parser can consume on the fly without any need to do any complex interleaving of the parser and lexer code. You also gain the ability to stop lexing early in case the parser encounters an issue.
I'm not sure what kinds of software you work on, but this is little more complicated or clever than Unix shell pipelines.
Exactly. From rsc's previous post[1]:
> On my laptop, a C thread switch takes a few microseconds. A channel operation and goroutine switch is an order of magnitude cheaper: a couple hundred nanoseconds. An optimized coroutine system can reduce the cost to tens of nanoseconds or less.
Channels otoh are very versatile: everything from spsc to mpmc with buffering and starvation protections, fast cancelation and notifications, etc etc. They’re not perfect, but it’s a helluva bang-for-the-buck for a single primitive. Literally all you have to do for performance is add buffering and coalesce “units of work”, and you’re good to go.
From TFA:
"Because scheduling is explicit (without any preemption) and done entirely without the operating system, a coroutine switch takes at most around ten nanoseconds, usually even less. Startup and teardown is also much cheaper than threads."
"For this taxonomy, Go's goroutines are cheap threads: a goroutine switch is closer to a few hundred nanoseconds"
Also check out https://news.ycombinator.com/item?id=29510751
and https://ewencp.org/blog/golang-iterators/index.html:
"Finally, the most natural channel based implementation is… slow. By a factor of nearly 50x. Buffering does help though, reducing that to a factor of 25x."
for {
next := <-channel
...
}
You have to set up the channel and the goroutine that feeds it, you need to safely close the channel when the iteration is done (but not before, unless you like panics), you need to deal with panics inside the goroutine and possibly support cancellation if the iteration breaks early (unless you love memory leaks).If you try to implement this pattern by hand, you are all too likely to make fatal mistake, and this is doubly true in the hands of an inexperienced programmer.
I appreciate the fact that Russ wrote this long post, gradually implementing `coro.New()` and improving its functionality and safety — and only in the very end, we get a short paragraph about performance. Good performance is important to make this feature attractive to use, but if the feature is clunky and error-prone, it wouldn't be worth much, even with great performance.
I've found it's generally a very strong indicator.
But to your point, this isn't a new issue, there's been a good amount of discussion around it over the years.
https://github.com/golang/go/discussions/56413#discussioncom...
EDIT: Sorry, I misunderstood part of your question. The memory grows unbounded unless you call runtime.GC() which triggers garbage collection. But this is a blocking call and essentially block your whole program.
Python has both true (ish) coroutines (or at least coroutines which are entirely user controllable), which it mostly uses for iteration, and the concurrency specialised “async”.
Initially the goal was to reuse “yield” for concurrency (a big reason why “yield from” was added), but the ergonomics of mixing multiple coroutines uses was found to be awful, at least for the langage python is, and trying to provide relevant sugar difficult.
I understand that the historical reasons for that ("generators" originally only supported "yield" and a separate async/await design with implicit scheduling seemed more ergonomic). But that doesn't remove the confusion.
But Python's generators are still fully "true" coroutines. Russ makes the distinction between coroutines and generators based on whether as "Generators provide less power than coroutines, because only the top-most frame in the coroutine is allowed to yield. ". I think this is a bad definition, as generators can clearly "pass" the yielding power to other generators with "yield from", and Russ talks about this feature in the same post. Generators require more ceremony, but I don't think they are "less powerful" in any meaningful way .
Roberto Ierusalimschy, the creator of Lua, and Ana Lucia de Moura proposed the distinction between stackful and stackless coroutines[1]. Their paper revived coroutines as a research subject and had a lasting impact on coroutine-related terminology. Ierusalimschy and Moura made the distinction between "Stackful" coroutines and "non-stackful coroutines" (the term "stackless coroutines" stuck later on, but they did not use it), where Stackful coroutines like Lua's coroutines can "suspend their execution from within nested functions". They also made the same allusion to power that Russ does: "powerful enough to constitute a general control abstraction; in particular, they can- not be used as a concurrent construct".
I think both Ierusalimschy and de Moura are onto something when they call generators "limited", but they are just incorrect when they say that "but are not powerful enough to constitute a general control abstraction; in particular, they cannot be used as a concurrent construct". Early Node.js libraries and frameworks such as "co" and "koa"[2] should count as a proof that you can very much use generators as a concurrency construct. Generators have all the power that is necessary for dealing with normal concurrency cases, although they don't have the best ergonomics.
But there is another distinction where "generators" are not the same as Lua coroutines, and clearly lack some expressive power (even though this power is NOT a necessary condition for implementing full-fledged concurrency). Lua allows coroutines to communicate bidrectionally: yield() can pass a value back the coroutine caller, which is returned from resume() - but when the coroutine caller calls resume() again they can pass an argument which will be returned from yield(). Generators are unidirectional: they only allow you to pass a value to yield(), but not to resume().
[1] http://www.inf.puc-rio.br/~roberto/docs/MCC15-04.pdf [2] https://news.ycombinator.com/item?id=6933358
I might be missing something as I’m not familiar with Lua’s coroutines but that sounds like send (https://docs.python.org/3/reference/expressions.html#generat...). The parameter to `send` (from outside the coroutine) is the return value of `yield` (inside the coroutine).
This yield is not an object, it is handling the execution context to another stack.
For discussion, can you please provide a short sample in Rust to demonstrate your point? I would like to hear more.
For sure - it's a little terse, just to demonstrate the various cases
async fn do_something_async() {
nonblocking_sync_call();
tokio::task::spawn_blocking(blocking_sync_call()).await;
async_std::task::spawn_blocking(blocking_sync_call()).await;
async_call().await;
}
fn do_something_sync() {
nonblocking_sync_call();
blocking_sync_call();
futures::executor::block_on(async_call());
}
There is _a bit_ of ceremony required to bridge the two, but it's pretty minimal.If you want to know more about any of that, happy to explain or provide links for further reading
..Next I added a direct coroutine switch to the runtime, avoiding channels entirely. That cuts the coroutine switch to three atomic compare-and-swaps (one in the coroutine data structure, one for the scheduler status of the blocking coroutine, and one for the scheduler status of the resuming coroutine), which I believe is optimal given the safety invariants that must be maintained. That implementation takes 20ns per switch, or 40ns per pulled value. This is about 10X faster than the original channel implementation.
... That means the definition of coroutines should be possible to implement and understand in terms of ordinary Go code. Later, I will argue for an optimized implementation provided directly by the runtime,..
At the end he implements an experimental runtime mechanism that permits a goroutine to explicitly switch execution to another goroutine rather than using the generic channel scheduling plumbing.
But when you have a construct that has their own call stacks, it's a relatively small step to implement lightweight threads with it.
Since doing concurrency happens much more often than any other smart use of coroutines, many people conflate the two.
I sometimes see this confusion in discussions about Kotlin's coroutines. An example of using coroutines that is not about concurrency: https://kotlinlang.org/api/latest/jvm/stdlib/kotlin.sequence...
The key concept with Kotlin's co-routines is not having to choose between reactive, green threads, or real threads but treating all of those in the same way with a robust set of abstractions. I use co-routines in the browser on top of existing javascript frameworks that return promises. I use them on the JVM with reactive frameworks like flux. And I also use them with some thread pools. Once Loom reaches LTS (next year, I think), I'll probably be configuring some Loom capable co-routine dispatchers as well.
The debate regarding co-routines vs. go-routines seems like it is similar. I think Go can learn a thing or two from Kotlin's co-routines. After all it is mostly built as a library on top of a single language feature: the suspend keyword. I get that not everybody likes colored functions. But then having a lot of leaky abstractions and related complexity on top of go routines is maybe also not ideal.
You misunderstood. The `Sequence` type does predate coroutines. But I meant the `sequence` builder function, which takes a block of suspending code to create a `Sequence`.
The in-order traversal in the article can be translated to Kotlin:
fun walk(t: Tree?): Sequence<Int> = sequence {
if (t != null) {
yieldAll(walk(t.left))
yield(t.value)
yieldAll(walk(t.right))
}
}
As I have noted above, this is a use of coroutines that has nothing to do with concurrency.Some frameworks (e.g. twisted) did use them for concurrency, and the core team originally planned something similar, however the ergonomics were not what they wanted (especially when mixing coroutines-for-concurrency and coroutines-for-iteration), so they went for a more specialised design.
> went for a more specialised design
My impression is that JS evolved similarly - (ab)using generator for concurrency, then specialized async-await as a language feature.
I've encountered a lot of people who read that on /r/golang and then ask "well why are send operations so expensive", and it's not that. It's that a lot of iteration operations are on the order of single-digit cycles and often very easy to pipeline or predict. No concurrency primitive can keep up with that. A given send operation is generally fairly cheap but there are enough other things that are still an order or two of magnitude cheaper than even the cheapest send operation that if you block all those super cheap operations on a send you're looking at multiple factor of magnitude slowdowns. Such as you would experience with your code.
You can use a function or a method, but you lose the nice continuation/generator ability to write a function that can do complicated yields without having to write the state machine yourself, plus you run a risk that the call won't inline in which case you're incurring non-trivial function call overhead.
The problem with iteration in Go isn't that you can't solve any given individual problem, the problem is that you can't solve all of them simultaneously the way you can in Rust or Python. (Though one of these days I want to get around to benchmarking Python's iteration versus Go channel-based iteration, I'm not actually sure which would win. What Go considers dangerously slow can still be baseline performance for other languages.) So you can get a defeat-in-detail sort of thing where a person cites a problem, and someone posts as solution to that, and then they cite another problem, and there's a solution for that, and then there's another problem, and a solution is posted for that, and all the solutions do indeed more-or-less solve the given problem... but you can't combine them into one.
https://go.dev/tour/concurrency/8
In A Tour of Go, concurrency section, "Equivalent Binary Trees" is an example of paying the price when you don't need it.
But the `sequence` builder function was not. In fact it depends on coroutines.
Could you read my reply before repeating?
Go has simplicity as a design goal. Part of that simplicity is not adding all the features we can think of to the language. Just because it would provide slightly leaner syntax for a relatively small group of use cases isn't a good reason to add new language features (imho, from my 10+ years using Go and watching it evolve).
If we can implement this using current language features, but it's complex and messy to implement, then it's a great case for a new library, possibly even inclusion in the standard library. If we can't implement this at all using current language features, then maybe it's a case for a new language feature to enable this. If we can implement it relatively cleanly using current language features, then we're good and don't need to do anything. This seems like a "can implement, but not very cleanly" case, which would be a great justification for a library, but not a language feature. Again, imho.
First sentence of OP:
> This post is about why we need a coroutine package for Go
Then in the girst section after the intro:
> Later, I will argue for an optimized implementation provided directly by the runtime, but that implementation should be indistinguishable from the pure Go definition.
The point I was trying to make is that some amount of people see the benefit in removing much smaller boilerplate. And so presumably at least as many people should see the benefit of removing a larger amount of boilerplate.
I'm just talking about the benefits here. There is also a cost to expanding the language. The costs might outweigh the benefits, but it doesn't mean that the benefits don't exist.
Agreed. I think this is where the discussion will focus. I'm glad that historically the Go team have weighted the costs heavily, and hope they continue to do so.
I guess I'm in the minority but boilerplate really doesn't bother me. Going the other way gets too much "magic" and we lose the ability to fine-tune behaviour.
Not to nitpick this specifically but as a generic reminder not all complaints are worthy of shifting the trajectory of a massively popular programming language.
Balancing "worthy" and "unworthy" changes is really hard both in the community and discussions like this one. I don't envy the teams that have to do it.
getNext := iterableThing.Iterator()
for next, ok := getNext(); ok; next, ok = getNext() {
...
}
Which, yeah, the range cleans it up a bit, but it's not doing quite as much work as you're implying. for i.HasNext() {
current := i.GetNext()
...
}
And just mutate the internal state of the struct. Probably throw it behind some sort of interface like: type Iterator[T any] interface { // feel free to strip out the generic
HasNext() bool // if you really only have one type
GetNext() T // you do this with
}
I think this still loses to ranging over a function, but I still think it should be compared to what you can do with the current language as opposed to what was originally posted. (DE SAMEFRINGE (X Y)
(OR (EQ X Y)
(AND (NOT (ATOM X))
(NOT (ATOM Y))
(SAME (GOPHER X) (GOPHER Y)))))
(DE SAME (X Y)
(AND (EQ (CAR X) (CAR Y))
(SAMEFRINGE (CDR X) (CDR Y))))
(DE GOPHER (U)
(COND ((ATOM (CAR U)) U)
(T (GOPHER (CONS (CAAR U)
(CONS (CDAR U) (CDR U)))))))
Coroutines are not necessary for solving that problem (though they do offer a neat solution to it).