Tesla Transport Protocol over Ethernet (TTPoE)(github.com) |
Tesla Transport Protocol over Ethernet (TTPoE)(github.com) |
* https://news.ycombinator.com/item?id=41374663
* https://chipsandcheese.com/2024/08/27/teslas-ttpoe-at-hot-ch...
If Tesla were really seeking to shake things up they wouldnt have picked IPv4 to do it when the newest release has been around for nearly 30 years and has latency reduction baked in.
this smacks of a pandersome attempt from a company that sees the quite mandarin writing on the walls and has decided (in true Muskovite fashion) they too are just a misunderstood font of futurism.
TCP has the wrong abstraction for truly high performance.
I wouldn't necessarily standardize what Tesla does here, but most of the big companies have their own layer 3 transport protocol for things that need truly high speed and are operating within a datacenter.
Cray/HPE has their own Ethernet-based protocol (Slingshot was an earlier version of it - not sure what its name is now) which seems to be better than whatever Tesla has, but is not necessarily published.
- looks dead simple
- no IP layer (there's a ttpip folder in that repo though)
- distributed congestion control (TCP has a "window" field + a bunch of tentative RFCs, this has a purposeful "congestion")
- 100% implementable in hardware (TCP can, but it's complex)
Not a general TCP replacement, but the README properly highlights a "many endpoints local link" use case:
> the protocol executed entirely in hardware and deployed to a very large multi-ExaFlops (fp16) supercomputer with over 10s of thousands of concurrent endpoints. This protocol does not need a CPU or OS to be involved in any way to link and execute.
It's of no interest on the internet or any small scale netwwork.
Infiniband instead makes the sides bargain to avoid packet loss, while the medium is supposed to be reliable.
As mentioned in README, this was submitted to the larger Ultra Ethernet consortium for consideration:
> Deliver an Ethernet based open, interoperable, high performance, full-communications stack architecture to meet the growing network demands of AI & HPC at scale
This reaks of NIH.
You can always beat general purpose solutions like the TCP/IP/UDP stack if you try. For most it isn’t worth it.
- TTPoE is designed to be implemented at hardware level unlike UDP
- UDP cannot guarantee transmission whereas this does
- TTPoE is built for distributed resilience
I hope they're not hoping for mass adoption with an attitude like that. Not exactly inspiring confidence in the longevity and maintainability.
Every engineering company releases stuff like this. It’s not meant to change the world. It’s marketing to recruit other engineers who would find that problem interesting.
I'm not so sure about that.. FRom the repo :
> Tesla also announced joining the Ultra Ethernet Consortium (UEC) to share this protocol and work to standardize a new high-speed/low-latency fabric (be that TTPoE or otherwise) for AI/ML/Datacenters
Also it's a protocol, personally I will only use a protocol that's fully spec'd. It's a pain sometimes to have consensus among all contributors but it's valuable.
> edit : I will only use a protocol that's fully spec'd IN PROD
This is currently the state of much modern documentation from huge tech companies.
..which also does not inspire confidence.
Why does it have to be perfectly documented in a public github? Are all other car companies "properly" publically documenting things in github?
Does it inspire more confidence in VW's software stack if they don't share it? Is VW's confidential stack some big competitive advantage? I've used a VW ID electric vehicle. I did not come away that impressed.
CAN (or one of its more modern variants) are historically more common in automotive. However with 2-wire Ethernet connections becoming more commonplace I do think you're right that more and more cars will be moving to ethernet fieldbus.
EtherNet/IP is not as robust for many applications as its competitors (PROFINET, EtherCAT) since it is not fully deterministic. EtherCAT is my personal favorite.
Its only advantage is that it can coexist with other TCP traffic and run over standard switches, but that just results in unreliable fieldbus performance.
If you wouldn't mind reviewing https://news.ycombinator.com/newsguidelines.html and taking the intended spirit of the site more to heart, we'd be grateful.
Random guessing - Ethercat seems more likely to take over for CAN because CoE (canopen over ethercat) is so common.
It's very easy to turn CAN devices into ethercat ones.
Harder to turn them into profinet ones.
Seems like a more incremental path for car makers.
otherwise the main advantage of profinet is that you can treat it like regular ethernet (IE switches, etc), but not sure anyone cares in a car.
That's how things usually get done right ?
It's not widely known, but Tesla probably has one of the largest training cluster, because practically all the GPUs they buy go towards training, while most of GPUs for e.g. OpenAI go towards inference. Tesla does inference in the car.
So most likely that. I agree that this seems to have very little to do with cars.
2 wire Ethernet is also a thing that they spearheaded.
This is the way it goes here in HN for anything related to Musk.
No, spec bad. Protocol unknown. Poof
edit: > This is the way it goes here in HN for anything related to Musk.
Nobody mentioned Musk ... Except you.
After a good amount of back and forth with the customer, and several test programs run on the system in question, I eventually came up with a hypothesis that there was an error in the write path of the SAN as small writes succeeded while larger writes failed. The customer ultimately found there was a dirty fibre on one of the links in their FC fabric. It was dirty enough to corrupt large packets, but not so dirty that smaller writes and control packets were unable to get through. Since multipathd only checks to see if a given target can be read from, it would never fail over to the other path (which was fine). So much for trying to build a high availability system using an expensive SAN!
Lesson of the story: what you think is a lossless network is not always lossless. Using the IP stack has a lot of beneficial diagnostic tools that you really start missing when something goes awry in a non-IP network.
More over, the multi-path should have stopped that! it should have detected a bad link and failed over to the other one (but the config for that is hard, so I can see why that might not worked. )
These and many other performance issues left me with a particular hatred of SANs.
Anyway, to your specific point, IP at all is basically overkill in a cluster architecture. Very few IP stacks function properly without having to get things like ARP involved; the more of this stack you can get rid of, the better performance you get and there's less to maintain. TTPoE reminds me the most of ATA over Ethernet, a previous effort to shed the complexity of a protocol designed for global networking. It worked great until you hit scaling issues, which competing tech leveraged the aforementioned complexity to address.
I have implemented ARP and UDP on FPGAs for some toy projects, and it's really not that difficult. One of the use-cases I played around with was getting debug data out of an FPGA at multigigabit rates -- things like PCIe TLPs and raw SERDES data from an EPON implementation to debug a burst mode CDR. The fact that the protocol was IPv4/UDP was no impediment to having it push data through at line rate. Once you've implemented parallel CRC32 for ethernet packets from scratch on a 256-512 bit wide data bus where packets can start and end on arbitrary 32 bit boundaries, the complexity of IPv4 and UDP checksums is dead simple in comparison.
I understand and agree with throwing out TCP in TTPoE. I do not agree with throwing out IPv4 / IPv6. Heck, you don't even need ARP for v6, you could get away with link local addresses using the ethernet MAC address you already need to have anyways.