Rivian software update bricks infotainment system, fix not obvious

Rivian software update bricks infotainment system, fix not obvious(electrek.co)

277 points by carlivar 2 years ago | 379 comments

latchkey 2 years ago |

I built a whole remote software update mechanism for a control binary that ran on 25k+ servers across multiple data centers.

Rest assured that after the first time I messed it up (which required ssh into each box individually), I wrote a lot of unit and integration tests to make sure that it never failed to deploy again. One of the integration tests ensured that the app started up and could always go through the internal auto update process. This ran in CI and would fail the build if it didn't pass.

While I fully understand that this is hard to get right 100% of the time, a mess up of this level by a car manufacturer is pretty amazing to me.

foobiekr 2 years ago | |

Rivian is an embedded use case, though, which is not at all like a fleet of servers.

Having worked for companies that produce network devices - including devices that are unreachable for example for 6 months of the year - and on software installation and upgrade, I am baffled how this bricking is possible. For one thing, you generally use some kind of confirmed boot mechanism - you upgrade a standby partition, set an ephemeral boot value that causes device to boot the alternate image, and reboot - only when the image is declared "up" does that get persisted (and then the alternate is upgraded, in order to prevent rollback in the event of a media error). You use watchdogs that are tied to actual forward progress (and not just some demon that the kernel schedules and bangs on the watchdog even if the rest of the system is hung) and if they fail, the WD reboots you. (This is one of the reasons that event driven programming is somewhat preferred - actually processing events from a single dispatch thread makes it easier to reason about the system.)

On top of that, you make sure that the core system is an immutable filesystem so that you can validate the _offline_ alternate image before rebooting (write-and-read-back-uncached) and periodically scrub the alternate image (same).

Like.. this is all embedded 101, stuff people have been widely doing since the mid 1990s and I think I can find examples going back to the 70s. Sometimes you get a little more sophisticated (allow sub-packages or overlays and use a manifest to check the ensemble instead of just a single image), but it's very standard.

dcow 2 years ago | | |

Assuming Rivian does know embedded 101, my guess is that the infotainment system is running Android and the watchdog reported all green once the system services all came online and that it doesn't actually check whether the application layer is really working because, as you know, that would require the watchdog to run a full regression suite before giving the okay, which isn’t practical. Since the update swapped the system to an internal dev cert, they cant push an immediate update to change the boot args because the management plane daemon won’t connect to the C&C server, or it can but the blob they push wouldn’t pass signature validation, or the TEE won’t unlock the device keys because the roots changed. Whatever the case, someone has to go blow a fuse and re-flash the thing, or at least rewrite the boot args via serial. Just a guess.

If it is the most likely “management plane TLS certs” issue, I bet the watchdog won’t confirm the new boot args until the command dispatch daemon gets a pong from the C&C server moving forward (:

ikiris 2 years ago | | |

That sounds out of scope for the MVP. We can worry about redundancies later after we ship.

KingMachiavelli 2 years ago | | |

Did you just use standard Yocto or similar tools to build such images? Are there standard daemons for managing hardware watchdogs (besides systemd since that's too simple as you say)? I think there's a lot of niche knowledge in the embedded space and many programmers are used to cloud systems and at most target. The most embedded experience most programmers have is likely iOS/Android development where all of the actual embedded concerns are handled for you. Even Google (soft)bricked a bunch of phones with the latest Android 14 update [1].

IMO there's not a lot of regular OSS for building embedded systems that comes with A/B partitioning, watchdogs, secure and verified boot - it's all custom at every org and tailored for individual products.

[1] https://arstechnica.com/gadgets/2023/11/android-14-patches-r...

neuralRiot 2 years ago | | |

> including devices that are unreachable for example for 6 months of the year

That made me think, imagine NASA bricking up the voyager with a SW update.

aaronbeekay 2 years ago | |

As somebody currently working at an automaker on software systems, the amazing thing to me is that a mess up of this level doesn’t happen weekly. It’s rough out here.

jacquesm 2 years ago | | |

Thank you. At least you're honest about it, the other day someone was trying real hard to convince me that software developers at automakers are made of magic fairy dust.

bozhark 2 years ago | | |

What's the priority then, telemetry data? Why is it rough out there?

foobiekr 2 years ago | | |

do you guys not have confirmed boot and swizzling to fallback images?

cjbprime 2 years ago | |

> This ran in CI and would fail the build if it didn't pass.

I don't mean to be pedantic, but since we're talking about what should happen instead, this is insufficient. It works until the day you realize you made some kind of manual change to your CI infra, or that CI has some non-standard configuration that makes it work for you but not some significant fraction of the fleet.

People should do what you described in CI, but as well as that, you need phased rollout, where e.g. the build can only be rolled out to the next percentage point of randomly selected users in a specific segment (e.g. each hardware revision and country as independent segments) after meeting a ratio of successful check-ins, in the field, from the new build by production customers in that segment. That's the actual metric for proceeding with the rollout: actual customers are successfully checking in from the new version of the software.

Except, that's actually not sufficient either. What if the new build is good, but it contains an update to the updater which bricks the updater? Now you're getting successful check-ins from the new version in the field, but none of those customers will ever successfully auto-update again. So, test the new updater's ability to go forwards successfully, too.

quailfarmer 2 years ago | | |

A good way to handle the who-updates-the-updater issue is to use a triple partition updater. A updates B, and then B updates C, then C updates A. If anything about the new version prevents it from properly updating its neighbor, that neighbor won't be able to close the loop, and you'll fall back to A. This simplifies the FSBL, because it just boots the three partitions in a loop, no failure detection required. You don't need to triplicate the full application either, just the minimum system needed to perform an update, and then have the "application" in it's own partition to be called by the updater.

latchkey 2 years ago | | |

> It works until the day you realize you made some kind of manual change to your CI infra, or that CI has some non-standard configuration that makes it work for you but not some significant fraction of the fleet.

Nah, my CI process was solid. This was proven in the field over the course of years.

> I don't mean to be pedantic... you need phased rollout

You don't need to be pedantic, but better to ask the question rather than assume that was all that I did. =) You have to realize that what I built, worked flawlessly. It wasn't easy either, took a lot of trial and error.

I did have a CIDR based rollout. I could specify down to the individual box that it would run a specific version. Or I could write "latest" to always keep certain boxes running on the latest build. This was another part of my testing, but ended up not being fully necessary because I had enough automated testing in CI that "latest" always worked.

> but it contains an update to the updater which bricks the updater?

This happened, so I wrote a lot of test code to make sure that would never happen again. My CI would catch that since I was E2E testing that it could actually run the upgrade process.

Once I implemented all of this, I never had a single failure and would routinely, several times a day, deploy to the entire cluster, over the course of a couple years.

It was all eventually consistent as I could also control the "check for update" frequency as well.

jacquesm 2 years ago | | |

And you need to verify the vehicle is not in motion.

psychlops 2 years ago | |

Having worked on 25K machines, I can assure you that it never deployed to every single machine and failed to do so in interesting ways all the time.

latchkey 2 years ago | | |

It always deployed. It was eventually consistent. Any failure would automatically be resolved after a period of time.

postalrat 2 years ago | | |

As a frontend web developer I'm constantly deploying software to many thousands of machines. And you know what? It's pretty damn simple.

donmcronald 2 years ago | |

> While I fully understand that this is hard to get right 100% of the time, a mess up of this level by a car manufacturer is pretty amazing to me.

I feel like it's going to happen to someone that makes network devices eventually. I'm always scared to update my (several hundred) UniFi devices. Their update process isn't foolproof and they push auto-updates via the UI pretty hard.

Several years ago they caused some people's devices to disconnect from the management controller when they enabled 'https' communication. Prior to that, if you were pointing devices at 'https://example.com:8080...' they would ignore the 'https' part and do an 'http' request to port '8080'. Then they pushed their 'https' update which expected an 'https' connection and didn't fall back to the old behavior for anyone that was mistakenly using 'https' in their URL initially. Some people on their forums complained about having to manually SSH to every device to fix the issue.

It was caused by an end-user mistake, but they knew it was a potential issue. AFAIK, their attitude on it hasn't changed and a lot and at the time their response was that they knew it would break some people, but that it wouldn't be that many (lol).

IMO, the issue with those systems is that basic communication back to the update / config server is part of the total package which is too complex (ie: a full Debian install). I'd rather see something like Mender (mender.io) where the core communications / updates come from a hardened system with watchdog, recovery, rollback logic.

Think of how crazy it is to have something like pfSense doing package based updates rather than slice based updates. At least with boot environments they could add some watchdog and rollback type logic, but it'll still be part of the total system instead of something like a hardened slice based setup where the most critical logic is isolated from everything else and treated like a princess.

Do you have any insight on package vs slice based systems for updates? Did you isolate update logic from the rest of the system or am I out of touch with that opinion?

vGPU 2 years ago | | |

Reminds me of my (far less critical) update process for home assistant. Every time something breaks. Currently my hvac automations are going haywire.

akira2501 2 years ago | |

When possible, I used a fail back mechanism. If the update failed to fully come up, then the watchdog timer would catch it, the bootloader would notice the incomplete boot, and attempt to boot from the previous known working image in that case.

code_runner 2 years ago | |

out of morbid curiosity.... how long did it take to ssh into and fix all of those servers? I imagine even automating a fix (if possible) would still take a good amount of time.

latchkey 2 years ago | | |

gnu parallel and sshpass is your friend.

The way I built my app was that I could install it cleanly via a curl | bash.

So, I just had a simple shell script that iterated through the list of IP addresses (from the DHCP leases), ran curl | bash and that cleaned up the mess pretty quickly.

jdechko 2 years ago | |

As a non-developer, the whole situation with a bad software update to the Voyager spacecraft really puts things into perspective as far as how bad remote updates can be.

It’s also a testament to the way that the system was designed that they were able to get it back online.

sixtram 2 years ago | |

you ssh-d into 25K servers one by one? I mean, manually?

latchkey 2 years ago | | |

https://news.ycombinator.com/item?id=38270986

ugh123 2 years ago | |

Please tell me you scripted that ssh into across your 25k servers!

latchkey 2 years ago | | |

https://news.ycombinator.com/item?id=38270986

One thing my little control process did on the box was to always set the password to be the same... user/1.

None of these boxes needed inbound connections, so it wasn't a big deal to do that.

gravitronic 2 years ago |

I used to work for a company that built satellite receivers that would be installed in all sorts of weird remote environments in order to pull radio or tv from satellite and rebroadcast locally.

If we pushed a broken update it might mean someone from the radio company would have to make a trip to go pull the device and send it to us physically.

Our upgrader did not run as root, but one time we had to move a file as root.. so I had to figure out a way to exploit our machine reliably from a local user, gain root, and move the file out of the way. We'd then deploy this over the satellite head end and N remote units would receive and run the upgrade autonomously. Fun stuff.

Turns out we had a separate process running that listened on a local socket and would run any command it received as root. Nobody remembered building or releasing it but it made my work quick.

singleshot_ 2 years ago | |

The person who built and released this might not have ever worked for your company, which might be why no one remembers building or releasing it.

gravitronic 2 years ago | | |

No no, I figured that out afterwards, in a past development iteration someone added it on purpose and then forgot all about it - "oh yeah we needed that to <solve some mundane problem>".

So... worse than subterfuge? That being said it only listened on the local socket, so it's slightly less bad, and I don't want to get into the myriad of correct ways that original problem could have been solved, but lets just say that company doesn't exist anymore.

cjbprime 2 years ago | | |

I admire your restraint in writing this comment. :)

qmarchi 2 years ago |

It's crazy to me that this is possible in the first place. Standard practice is to have a fleet of test vehicles that are effectively production except in an early release group.

Or, you know, having an A/B boot partition scheme with a watchdog. Things that have been around for decades at this point.

Disclaimer: Former Googler, Worked closely with Automotive.

cs702 2 years ago |

It's easy to underestimate how hard and expensive it is to build, deploy, and remotely upgrade software that runs reliably on a fleet of diverse cars (different models, different years, slightly different components from batch to batch, etc.). It makes updating a mobile phone OS look trivial in comparison.

So far, only Tesla seems to be able to update car software remotely, regularly and reliably. I'm certain it's neither easy nor cheap.

All things considered, physical buttons and dials are probably easier and cheaper, because they don't require software updates!

scardycat 2 years ago |

Bringing CI/CD mindset to cars is probably not a great idea. Software updates to commuter vehicles should have a high bar for operational standards, and a simple thing such as an expired certificate should have never been deployed. Having isolated networks in vehicles helps but doesn't prevent broken updates from, eventually, bricking the cars.

nomel 2 years ago | |

I think this shows more of a fundamental flaw in their update mechanism, than anything.

I don't think a botched update is a big deal. It happens, and should be expected, in a sane design. The fact that the customer noticed is a big deal.

There are many implementations that could be used for an "auto rollback" feature. They either failed to implement that in a sane way, or they were goobers, and assumed things would always be rosy.

babypuncher 2 years ago | | |

I would be pretty pissed if I went out to my garage to head to work one morning and found that a damn software update bricked my car overnight. This shouldn't even be a thing, why does a car need regular software updates to keep functioning?

gitfan86 2 years ago | | |

The Tesla update is slow probably for this reason. It is probably verifying that it can rollback at any point of failure.

1234letshaveatw 2 years ago | |

From a few days back- Its software has been a “key differentiator” https://electrek.co/2023/11/10/rivian-using-software-to-scal... kind of humorous in hindsight

wannacboatmovie 2 years ago |

Interesting to note that Ford's approach of updating software is far more conservative and car-like. It can be done fully offline via USB, but requests that you kindly upload the log files written to the memory stick back to them when complete, in the instructions as a necessary step. Presumably so they can track and stop incidents like this before they happen fleet-wide.

Rivian seems more like a "ship it and we'll fix it in the next sprint!" company.

How do other manufacturers handle updates?

post_break 2 years ago | |

Fords approach is flawed however. You can still update sync with a bad update and bork it over usb. Ask me how I know.

r00fus 2 years ago | | |

Pray tell, how painful was your discovery?

sturza 2 years ago | |

A/B partitions

barryrandall 2 years ago | | |

The last time I built something like that, it used partition 1 for the current version, 1 for the last version, 1 with the as-shipped version, and 1 that could restore A or B from the internet or USB.

reneberlin 2 years ago |

When will humans be crazy enough to update the firmware of artificial hearts OTA?

Updating cars with new features OTA, even "just" an Infotainment can possibly cost lives, because the driver might get confused and isn't putting eyes on the streets.

It should be forbidden and every change should be made clear to the driver, shown in detail, and should need verification twice before being accepted. There must not be any kind of surprise in a car for the driver.

It should even be possible to skip an update or stop updating at all.

rekoil 2 years ago | |

Not updating cars OTA (yes, even "just" the infotainment) can potentially cost lives as well, as security holes would not get patched until the next service appointment.

qudat 2 years ago |

What a nightmare. This is where software engineering meets "real" engineering, where a "bug" has potentially life threatening consequences.

nomel 2 years ago | |

> where a "bug" has potentially life threatening consequences.

What are you referring to? That is not relevant to this story, and would require a deep understanding of the system to make such a claim of negligence.

“The issue impacts the infotainment system. In most cases, the rest of the vehicle systems are still operational ...”

Also, you can't do an update while driving.

jawns 2 years ago | | |

Based on the photo included in the article, what they're calling an infotainment system is actually two separate components, one of which appears to be taking the place of a traditional dashboard. If that's the case and there's no other way to monitor speed, fuel levels, engine temperature, warning lights, etc., I'd say that's quite a bit more worrisome than just not being able to play your favorite music while driving.

ct0 2 years ago | | |

You've never been to death valley without air conditioning Or Russia without heat. I think the infotainment system in this case has a broken climate control function. There are workarounds, but why if you don't have your phone?

qudat 2 years ago | | |

> What are you referring to?

Not the specifics of this article, but more generally about the gravity of the situation car makers (and their software engineers) operate under. The very idea that an OTA software update that causes a bug within more critical features of a car could be life threatening. So my point isn't about the specifics of this particular bug, rather the capacity for a bug that could kill.

nunez 2 years ago | |

critical safety systems/functions appear to be unaffected by this outage.

nicholasjarnold 2 years ago |

Is it possible, as a licensee of the Rivian vehicle system, to disable the automatic OTA updates without having expert-level knowledge or tooling?

Also, yes, I'm specifically avoiding using the word "owner" above for obvious reasons.

55873445216111 2 years ago | |

Rivian "licensee" here. So far all updates have required you to press a button (in the car or on the app) to launch the update installer. Not sure how many weeks you can ignore it for as I never tried.

bo1024 2 years ago | |

Confirming that updates are not automatic, and can be ignored indefinitely. For now.

martin8412 2 years ago |

Stuff like this is why I don't want OTA updates in my cars. Let the car dealership deal with it during regular maintenance. They'll be on the hook for fixing it before handing the car back to me.

gunapologist99 2 years ago |

This is why I don't really want my car to have any antenna (that receives/interprets code) or receive OTA updates, ever.

I'd like to please force any attackers to at least be within 50 feet of my TPMS, instead of being literally anywhere on the planet.

A car doesn't need data updates, and definitely not code updates[1]

1. source: every car built in previous century.

sbehere 2 years ago | |

> A car doesn't need data updates, and definitely not code updates

I don't think this is accurate. Many advanced driving assistance capabilities need access to updated map tiles, which is a data update. They may need code updates to fix errors or shortcomings that can be detected only after deployment on extensive fleets or in response to changes to the environment/infrastructure. This is just one example for why data and code updates are needed.

I think it is more accurate to say that a "dumb" car with mostly electro-mechanical systems doesn't need data updates and definitely not code updates. But that isn't true for vehicles built within the last few years and definitely untrue for vehicles that will be built in the coming years.

gunapologist99 2 years ago | | |

> Many advanced driving assistance capabilities need access to updated map tiles

Your phone (or GPS or even a paper map) can guide you; none of the following need access to map tiles:

* forward collision warning

* automatic emergency braking

* lane departure warning

* adaptive cruise control

* blind spot detection

* stability control

> code updates to fix errors or shortcomings

That's what recalls and TSBs have traditionally been for, and the driver can refuse them if desired. I mean, actual lives are at stake here. Would we (or should we) allow 737's to get OTA updates? Of course not. The target is too valuable and surface area too vast to adequately protect it.

pard68 2 years ago | |

My insistence on only driving cars made prior to 2005 keeps making more and more sense.

(2005 is just an arbitrary date I settled on, nothing significant about it)

eschneider 2 years ago |

This is a bit of a nightmare scenario and why when remote updating, you always test update to your own fleet first. Always.

toddmorey 2 years ago | |

It sounds like it was tested on their own fleet but they accidentally pushed the wrong bits when deploying the update more widely out to customers.

eschneider 2 years ago | | |

The usual "best practice" thing for IoT deploys, is to deploy to "your" devices, what for everyone to go green, then allow that build to deploy more widely. In a well-functioning system, it shouldn't be possible to swap bits between those stages.

But who knows what these guys were doing. :/

carlivar 2 years ago | | |

Maybe they should have an additional phase between test deploy and customers such as "employee personal vehicles".

kevin_nisbet 2 years ago | |

Yes. And also things like rolling out the update in batches, and then also things like golden images, where if there are two crashes or failures in the first 24 hours of the update, change to the last known good software version.

ralmidani 2 years ago |

Move fast and break things that move fast…

I don’t really like or trust most (if not all) of the established automakers, but there is something to be said for having several decades (over a century in some cases) of experience building potential killing machines vs. a company that’s not even 15 years old. The established players have put out cars which suffered freak malfunctions, but Rivian (and Tesla) seem to be struggling more with QA.

Non-rhetorical question: do companies have safeguards for critical components like braking systems, or are they also prone to catastrophic failure if a software engineer pushes a bad commit?

ezfe 2 years ago | |

The moving fast components were unaffected by this issue…

ralmidani 2 years ago | | |

I know, I just thought it was a decent pun.

baz00 2 years ago |

This is why I have a Dumbcar connected to a Smartphone via bluetooth.

toddmorey 2 years ago | |

Just a counterpoint: my dumb car has been undrivable way more often than my electric car.

They never deployed bad software updates but they sure have designed & deployed bad fuel pumps.

In some ways it’s all engineering and quality control.

akira2501 2 years ago | | |

Which is also why there is a huge non-OEM market for those types of parts. Can you even replace the rivian "infotainment" system?

baz00 2 years ago | | |

I can go nearly anywhere and replace the fuel pump though.

barbazoo 2 years ago | |

That's why _I_ have a Dumbcar connected to a smartphone via FM for audio only :)

thatguy0900 2 years ago | | |

I've tried that before, but it sounded terrible. What dongle do you use?

Lightbody 2 years ago |

I have preorders in for the R1S, the Volvo EX90, and the Kia EV9. I passed once already on buying the R1S when they had one in town available for immediate purchase, simply because they refuse to adopt CarPlay.

This incident does NOT give me confidence that Rivian is likely to offer a better alternative to CarPlay, despite their statements otherwise.

I suspect the EX90 will be what I land on eventually.

p_j_w 2 years ago | |

>This incident does NOT give me confidence that Rivian is likely to offer a better alternative to CarPlat,

I have complete faith that, 5 and maybe even 10 years from now, no auto maker will have delivered anything that can compete with either CarPlay or Android Auto. The fact that an auto maker thinks they can do better is a sign of a really high level of either arrogance or outright greed. Complete deal breaker.

AndrewKemendo 2 years ago |

Whomever makes the first affordable, tight tolerance electric car that doesn’t spy on you and doesn’t need special care will win the market

samsquire 2 years ago |

This is actually a topic that I think about from time to time: how to do aggressive changes to software while they are running. In Ruby world you have monkeypatching. And Linux kernel has livepatching.

For example, if you have a distributed system and you want to upgrade a component that every caller uses: you have a large exercise on your hands where you might have to roll out a change over time and then clean up your incremental branches where you have to handle two control flow paths through the code. It reminds me of Google's protobuf required field discussions.

It reminds me of repository-per-microservice and a Java library that other microservices use and updating a dependency and having to deploy the change to every service.

It's like trying to change wheels on a car while the car is moving or refueling a jet in flight.

Unison lang is trying to solve this problem I think, by allowing multiple versions of a function to be available.

https://www.unison-lang.org/

Migrations in databases are painful too.

One solution I've thought of which is probably overengineered is that API call sites are an abstract object and their schema and arguments is centrally deployed, I called this "protocol manager".

The idea is you write all your code to use a "span" and have contextual data in a span, and you can include or exclude data in a span with a non-software rollout. Your communication schema of RPC and API calls is a runtime decided thing, not hardcoded.

If you have N deployed versions of code and you want to upgrade to X, you have to test 1..N to X versions. So nobody does that.

fabianlindfors 2 years ago | |

The database aspect of this problem is particularly interesting to me. I’ve previously built Reshape [0], a zero-downtime migration tool for Postgres, and am now working on ReshapeDB [1], a full database designed from the ground up to tackle this problem.

[0] https://github.com/fabianlindfors/reshape [1] https://reshapedb.com

jbott 2 years ago | |

You might be interested in learning about Erlang – it supports hot code reloads natively: https://oozou.com/blog/understanding-elixir-otp-applications...

Someone1234 2 years ago |

I wonder if the way Microsoft's XBox is designed may be something to look towards in terms of hardware reliability/fallback. Specifically they utilize a Hypervisor which rarely needs updates, running different operating environments which need frequent updates.

- Better isolation of different parts of the system (e.g. infotainment unit, instrument cluster, et al).

- Better isolation for updates (e.g. run a "beta" update, and a "stable" update side-by-side).

- Automatic error detection and rollback (e.g. if a VM keeps restarting after an update).

- Ease of offering features like rollbacks to end-users.

- Rare hypervisor updates can be held to a much higher standard relative to other VM updates.

The only downside of hypervisor-based systems is slightly higher hardware costs. But even that is largely mitigated by modern architectures that natively support virtualization.

PS - You can also look to any containerization. I specifically brought up the XBox because it is a hardware product, just like a vehicle.

kevinventullo 2 years ago |

My 2019 car is not connected to the internet. Instead, I use Apple CarPlay for everything.

Is there any reason not to do it this way?

antoniuschan99 2 years ago |

Wondering why there isn’t an option for a factory reset (eg. press and hold with a paperclip for 10 seconds)

1970-01-01 2 years ago |

Lexus did the very same thing about 8 years ago:

https://www.consumerreports.org/lexus/what-to-do-if-your-lex...

sarchertech 2 years ago |

Miku baby monitors deployed an automatic firmware update that bricked nearly every monitor in use, but not for nearly a month after the update.

It forced the company into bankruptcy because they had to replace all of them.

fsckboy 2 years ago |

I wish the economics of mass production didn't turn pennies into millions that need to be eliminated, because I've always thought the "don't disconnect from power" and "update bricks it" type problems could be solved by having extra EPROM to download into, the way linux keeps the previous kernel around after an update.

Or at least the ability to re-init/download from scratch, like a borked macbook disk. And hey, not the extra ability to do that, make it "the way it works" so you're always testing it.

wnevets 2 years ago |

This maybe crazy but if you're writing software for hardware that cost tens of thousands of dollars it should be impossible to brick it with an update, especially if that update is OTA.

The future is going great.

thumbsup-_- 2 years ago |

This is the new world we will be living in where you enter your car, only to find that something is broken because of OTA. While updates causing some bugs is ok in my phone but I don't want any bugs in my car. What happens if it messes up with safety systems? or what happens if OTA breaks my car that is out of warranty now? May be I'm the only one that is missing stable software in cars that once vetted, just keeps running as-is if nothing around it is ever changed (ideal scenario for an offline car).

teeray 2 years ago |

An interesting thought experiment: what happens when these vehicles are out of warranty, and automakers accidentally send a vehicle-bricking OTA update? Isn’t that property damage?

jacquesm 2 years ago | |

This has happened to some Apple hardware, they fixed it for free in some cases but stiffed others:

https://discussions.apple.com/thread/253315438

With the mandatory mobile phone updates for a few years you're definitely going to see a lot more cases like that.

A thread about Tesla directly related to your question:

https://teslamotorsclub.com/tmc/threads/wholl-be-responsible...

karaterobot 2 years ago |

What kinds of changes are generally included in these over the air updates? I have this sudden urge to shake my fist at a cloud and tell the gods that cars shouldn't need updates in the first place, if the car was ever deemed ready for production and then sold to customers for money. But, maybe I'm wrong, and it makes perfect sense. All I can think of would be something like a periodic update to navigation data, is that it?

ezfe 2 years ago | |

It’s possible to deem software ready to sell but find improvements later.

Simple example: my Subaru was sold to me with an interesting design decision that caused the radio to come on whenever the car was started. This was not a bug. Every Subaru worked this way for years. A year into ownership I received an OTA update that added a “not playing” state on startup.

This was never a safety issue and was likely not a defect. It was, however, stupid and needed to be changed.

karaterobot 2 years ago | | |

I wish my Mazda had this option! But I would still say that I'd expect them to have included this option before selling the car, especially since radios and user preferences around radio UI are pretty well established.

bfrog 2 years ago |

It’s funny I was just talking to someone about a-b images slots and boots the other day and how they had written this test suite because there were so many potential places where partial updates could be interrupted.

Thousands of test points having to be verified was my understanding. That’s before even getting to the confirmed boot/watchdog aspect.

What a hassle, hope they like spending money on labor because it sounds like they are going to need to.

adolph 2 years ago |

The vehicles are drivable but software and displays go black. It appears that the 2023.42 software update hangs at 90% on the vehicle screen or 50% on the app screen and then the vehicle screens black out. All systems appear to still work except for the displays.

This is what I do with my Prius to get a comfortably distraction-free driving environment. Sounds like a feature not a bug.

altairprime 2 years ago | |

Technically, the NTSB could order an immediate recall for all Rivian vehicles due to this issue, as the disabled defroster controls are a critical safety issue in cold and/or humid environments. Tesla was forced to issue a recall notice over the controls being buried in a menu; Rivian’s ”defroster unavailable during driving due to manufacturer error” is far worse — especially given the mass and torque of their vehicles, relative to unarmored road users.

sturza 2 years ago | |

Instrument cluster display going black is a functional safety/QM issue. No blinker, transmission direction, speed etc confirmations.

bri3d 2 years ago | | |

It looks like they correctly isolated the safety critical components on the instrument cluster and they are still functional without infotainment: https://twitter.com/RivianSoftware/status/172443804967573962...

eigenvalue 2 years ago |

Can’t imagine how much it would suck to be the engineer who fat fingered it and caused a huge crisis for the company, inconveniencing tons of customers and costing millions. Even if there should be processes in place to prevent it in the first place, you’d still know you were the “but for cause” of the problem.

nicolaslem 2 years ago |

This is the kind of thing that keeps my awake at night.

Does anyone here have some practical tips to turn an embedded Linux machine into an appliance? The kind of system that a botched update cannot brick but only momentarily disable until a non-technical user presses a factory reset button of some sort.

elitepleb 2 years ago | |

A/B updates as implemented in android, https://bootlin.com/pub/conferences/2022/elce/opdenacker-imp...

hospitalJail 2 years ago | |

>Does anyone here have some practical tips to turn an embedded Linux machine into an appliance?

Lol

I suppose this is the negative about having sensors that make sure water gets hot enough to be sanitizing, but not so hot that it wastes energy. And I'm sure you can imagine 100 other uses of having a microcontroller/CPU process data and do feedback. (I'm sure there are EE only ways of doing it, but theoretically possible and useful are two different thigs)

nunez 2 years ago |

/r/Rivian is a class act. I expected a wall of screaming, but instead entered a relatively calm room. People are upset, but there's no seething or flamewars, which is kind-of surprising given the cost of these trucks ($80k+, Range Rover territory).

M3L0NM4N 2 years ago | |

I think the reason is because they're $80k trucks, not $400/month Tesla leases. Also, they're first generation and I think most of the buyers understand that.

Havoc 2 years ago |

> the vehicle is not bricked

What a time to be alive. Software updates (almost) turning cars into paper weights lol

ct0 2 years ago |

Will insurance carriers cover damages due to botched updates? Imagine 10 years from now the power/control that electric delivery companies would have over retailers like amazon. One botched update away from a complete backup for delivery vans.

cryptoegorophy 2 years ago |

Tesla updates are sent in batches and you can opt in for advanced updates I guess to be earlier. Normally when I see that there is an update on Reddit then it takes 1-2 weeks at least to get to my car with the “advanced” updates on.

glonq 2 years ago |

As somebody who has spent many years doing embedded+iot related to remote fleet firmware updates, this is the kind of thing that lurks in my nightmares.

I'd love to be a fly on the wall at Rivian engineering/operations this week!

easylion 2 years ago |

need a easy way to do restore to previous version offline. take 100 bucks extra if required to have a backup ssd. Don’t want to be camping and then realizing i’m stuck because of some junior dev not being competent enough

seattle_spring 2 years ago | |

Why would you intentionally upgrade your vehicle software while camping? It’s not like this stuff installs automatically, you have to explicitly accept the installation. Waiting a few days or even a few weeks before hitting “install” is completely normal.

avereveard 2 years ago |

> In most cases, the rest of the vehicle systems are still operational

Like what do you mean "in most cases" I can understand a broken infotainment needing reset but imagine if you had to tow your truck I'd be furious.

MisterTea 2 years ago |

Can I please just buy a car with a motor and battery? Why does every god damn vehicle have to come littered with screens and chips all together like some tentacle monster?

All I need is a gauge cluster screen that can display the normal info like stored and heading while also letting me configure the cars performance and safety features. Then let me mount a double DIN radio that isn't dog shit. I've not seen a single new car with these dumb screens with a sound system that's not tinny muddy garbage with zero adjustment save for "bass" and "treble" settings. I mean all that technology and you can't be assed to put an eq in there. HVAC never needed more than two or three knobs anyway.

fhub 2 years ago |

I'm going to have a chuckle next time I pass the Databricks billboard on 101 in San Francisco "Rivian powered by Databricks" or something to that tune.

WirelessGigabit 2 years ago |

What's the impact on your insurance should you get into an accident?

The speedometer screen is gone, so does that not imply the vehicle is inherently unsafe to drive?

Am4TIfIsER0ppos 2 years ago |

Look at all these commenters saying "code signing was done wrong" when the wrong part is code signing at all.

j45 2 years ago |

As long as they are good for fixing it, this might what being a Pioneer or Early Adopter is about.

emmelaich 2 years ago |

Poor title; physical repair is not required. Physical presence is required.

Someone1234 2 years ago | |

The article doesn't really state what is required to repair the vehicle. I'd assume if it was as simple as loading a flash drive and plugging it in, then Rivian would have provided a way for customers to self-fix. The second a single body panel is removed to gain access to the headunit, it is a physical repair.

So without more info we cannot know if it is accurate or not.

emmelaich 2 years ago | | |

I don't think many people would consider removing a body panel to be a physical repair. I think the term is 'back to base' or similar.

Physical repair suggests e.g. a burnt out capacitor

immy 2 years ago |

That’s funny, I just saw a job posting for Rivian Infotainment team

b20000 2 years ago |

“we use leetcode to filter out hires because it works for us”

whoopsie 2 years ago |

Ah this is why CarPlay isn’t worth adding, right?

FireBeyond 2 years ago |

As annoying as this, I find this laughable, too. Rivian updated users on the situation. Then, whines Electrek:

> That’s the last update we had over 10 hours after Rivian customer vehicles were fed the bad software update.

"Over 10 hours"!

I suppose it isn't Tesla, who yeets updates over the fence, that break new things, yeets another update that fixes that problem but introduces another one, then reverts back to two versions prior, before the issue. The Tesla that gets firmware fixes from vendors that have a test harness that should take 36+ hours to run, but says YOLO and flashes it onto a random car they have lying around and emails the vender back 3 hours later saying "LGTM, WFM, thanks!"

shoelessone 2 years ago |

Honestly this makes me feel good, just because it always worries me that I don't see this type of issue being resolved more often. having to physically bring in a car seems like a near worse cast situation but it's good to keep this in our minds as a possibility.

sitzkrieg 2 years ago |

i cant believe this sort of stuff is acceptable. what a clown industry

thrill 2 years ago |

Inexcusable, really.

collsni 2 years ago |

OTA on a car. What could go wrong?

janitor61 2 years ago |

This is tangentially Rivian related, but does anyone else see the inherent danger of stylized tail lights that are just a single red bar across the back of the car? Travelling on the freeway at night I can't really gauge the distance to the car in front of me if it's far ahead and if there's no discernable left and right brake lights. I'd believe Rivians and other cars like that are more at risk of high speed rear-end collisions.

rurp 2 years ago | |

This reminds of the terrible turn signals Mini used, which look like flashing arrows pointing in the opposite direction of the turn[0].

Getting cute with basic stuff like tail lights is forgettable or annoying at best, and absolutely can be dangerous.

[0]https://jalopnik.com/congratulations-mini-you-made-the-stupi...

xyst 2 years ago |

Looks like this car brand is circling the drain. Glad I never bought into the hype.

seattle_spring 2 years ago | |

It’s circling the drain because of one bad software update?

Sounds more like you’ve just bought into the doom and gloom that a few specific news outlets have been pushing.

xyst 2 years ago |

Tesla. Rivian. All cut from the same cloth. A car should be simple. Yet we are stuffing all of this tech junk into it and trying to repackage is as something else to pump the numbers.

Car companies suck at tech. Let’s be realistic. They should stay their lane and focus on improving the car and physical aspects (safety, reducing carbon output, longevity, ease of repairability, reducing supply chain issues)

bhauer 2 years ago | |

> Tesla. Rivian. All cut from the same cloth.

I'm not aware of any Tesla OTA updates bricking the infotainment system. At least since I've been paying attention. I don't see them quite as similar as you suggest.

margalabargala 2 years ago | | |

There have been plenty.

https://www.reddit.com/r/TeslaLounge/comments/112oqln/new_te...

https://teslamotorsclub.com/tmc/threads/failed-software-upda...

Two examples of many.

I'm not aware of any fleet-wide issues that accidentally bricked Teslas, but as one-offs they do happen; and unlike this Rivian update, a botched Tesla OTA generally leaves the car undriveable and needing to be towed. These Rivians will at least still drive, as long as you don't need fancy extraneous luxury features like a...speedometer.