The Hunt for Error -22(tweedegolf.nl) |
The Hunt for Error -22(tweedegolf.nl) |
The author needed to use unsafe in order to pass his pointer to libmodem, but libmodem is going to require a pointer with static lifetime itself. Which would have prevented the issue in the first place had the author done this.
I can see why you wouldn't want to use static, it hinders testability, but that means you need to ensure that the pointer you supply libmodem outlives libmodem. I would use RAII to do that in C++ and I am sure in rust you could/would do the same.
I guess I am asking, is there anything here that a libmodem written in rust would have magically solved? It feels like wishful thinking, but I am open to learn where I am mistaken.
In any case, kudos for finding this bug. Having worked with Zephyr/NRF connect SDK and this exact chip myself I can definitely relate to the pain they (can) bring.
But the custom Rust wrapper was composed as a game of telephone (ugh), with the author blindly mimicking "Jonathan" who seemed to have been blindly mimicking a sloppy (and later repaired) example from Nordic.
The argument is that if the library and its internals were originally written in Rust, which has richer semantics for object lifetimes, Rust would have been able to formally convey that the input data needed to outlive the individual function call, throwing an error at compile time.
The wrapper could have enforced this constraint itself, as it probably does now, but the handoff between Rust and C needs somebody to account for and understand the by convention stuff in C so that it can be expressed formally in Rust, and that human process failed to happen here.
I'm not following your comment, but I think the point is simply "the lifetime of the config is in the function signature, rather than hopefully (sometimes) being in the documentation, and hopefully (sometimes) correct".
The assumption nobody ever makes mistakes is mistake one.
I'm still not sure I understand why he couldn't just diff between versions. And the black box thing seems like a fool's errand. If changing the order of random things makes the issue go away, you can't change anything. The only thing you can do is use the binary you already have. Especially because even if you have 2 not working versions, fixing one doesn't necessarily fix the other. This debug effort felt very sloppy.
It's also weird looking at a lot of this code. The first assembly function pushes 4 values on the stack and only needed to push 2. I've had my fair share of bugs that make me go to dissembly but that also felt very time wastey here. The author evidently did not have enough of a grasp of what to expect for it to help at all.
While true that nRF should've put something in a log, the author admits they don't support this development flow. It's like the old addage about APIs. Any change no matter how small will break someone's usage.
Reading the article (nice troubleshooting story!), my summary, as a C programmer, is that the "C Interface" here "takes ownership". Given C cannot express this properly, a pointer is passed - and the called function "simply" makes the assumption that from hence-on, what was given to it will remain.
As "semantics" this (the need to pass an "owned" piece of data to a function) isn't unusual irrespective of the programming language. Just in case of Rust, this is explicit in the interface (if the func takes a non-ref arg, or a shared smart ref of sorts), while in C ... this can lead to errors of the observed kind. I haven't looked whether any of the sources or docs of libmodem say "this pointer must be either global/static or malloc'ed (and the caller shall not free it)".
A rust wrapper for this could / should possibly "leak a reference" here; Something that prevents the initialisation object from being dropped. yea, accepted, needs "nasty hacks" whether static lifetime, Pin, manual drops, explicit Arc leaks, ... possible though.
It'd be nice if libmodem were stricter about such ownership, agreed, and then a rust wrapper could take advantage. Takes time to evolve; is there a bug report / enhancement request out there for this in libmodem ?
The tooling around this is much easier to deal with it, of course, since it's all just Windows and there's a bunch of sane debug layers that you can use.
Massive respect to be able to debug the issue on an embedded system!
I guess not, as the seems to run on the microcontroller, but I remember getting at least some warning from valgrind in similar situations
For example, I've encountered hardware that would occasionally write unexpected-error details to a memory location that was completely undocumented. And if you expect more than a shrug from a vendor after pointing out such things, well...
The end of the post says
> This would have been so simple to put in the docs. I've opened a ticket on their DevZone forum. As of writing they've still not updated the docs of the init function.
And they've replied
> Thank you for reporting this, it will be fixed in the next `libmodem` release by the end of the month.
More likely (but not necessarily), Nordic's early example was either bugged or conditionally valid (benefiting from other implicit details of their implementation) and then was revised either because the mistake was identified or something else about the example change.
That's all pretty common in this domain. Inadvertently stumbling because you uncritically followed some vendor example is also pretty common and completely understandable. Better tools, like using a language with richer semantics, are indeed something that can help with that.
int something_init(const something_init_params* params);
the convention is that the params are temporary -- really just a way of passing a bunch of parameters to the init function. It would be a surprise that the params are expected to be static. E.g., the whole STM32 HAL is done this way, and it would be a disaster if you thought the init structs all had to be static!BTW, you can see the assumptions of the non-embedded programmers talking about "taking ownership" being the default interpretation of a signature like that...if you don't have a heap, what does that even mean?
In any case, C is a mess, embedded is a mess, no argument there!
I am more on the hardware side these days, but Nordic's hardware docs are pretty crap. As in, they're pretty, and they're crap. (The prettiness lulls people, especially managers, into a false sense of confidence. Don't fall for the trap!) There are obvious poor choices in there, and if you call FAEs out on them, they say to just follow the docs. Experienced engineers should not follow the docs.
I can't see their software side being any better.