The MilkV Jupiter 2/SpacemiT K3 (RISC-V vector compute)(taoofmac.com) |
The MilkV Jupiter 2/SpacemiT K3 (RISC-V vector compute)(taoofmac.com) |
Is the LPDDR5 soldered or can you upgrade it? E: Looks like it's soldered, I wonder what the IMC(?) is actually good for
https://github.com/brucehoult/k3_ai
Or my longer top level comment.
Unlike a GPU or NPU, you can just run all your normal RISC-V Linux programs on the AI cores. Bash, gcc, emacs, nodejs ... whatever you want. It's an extra 40% of scalar processing power, for free.
Build Linux kernel 7503345ac5f5
14m25.56s SpacemiT K3 8 X100 cores plus 8x A100 cores distcc
18m3.871s SpacemiT K3 8 X100 cores @2.4 GHz
19m12.787s i9-13900HX, 24C/32T @5.4 GHz, riscv64/ubuntu docker
39m23.187s SpacemiT K3 8 A100 cores @2.0 GHz
42m12.414s Milk-V Megrez 4 P550 cores @1.8 GHz
67m35.189s VisionFive 2, 4 U74 cores @1.5 GHz
70m57.001s LicheePi 3A, 8 X60 cores @1.6 GHz
88m04.185s LicheePi 4A, 4 C910 cores @1.85 GHz
The A100 cores, all by themselves, give more normal processing power e.g. `gcc` than any previous RISC-V SBC except the $2500 64 core Milk-V Pioneer.NVMe reads were faster! (Some interesting potential wins there, assuming you can get data from NVMe onto the core without going through main memory, a feature available since Sandy Bridge-EP (2011), in the form of Data Direct IO aka DDIO). I crack jokes about "PCIe speed ahead", but that's seemingly real here (at huge cost to latency, which CXL promises to remedy).
There is a non-zero chance the main cores cannot saturate what the memory controller can do, that the AI cores have some reserved bandwidth to themselves. I doubt it's going to double the memory bna
One absolute ecosystem gem from this article that I didn't know before: the fact that Orange PI 6 uses CrosEC, the embedded controller for Chromebooks (RIP i guess?). I wonder if this is the newer Zephyr Iot (awesome, also underlies Framework's new embedded controllers) or the older legacy version of CrosEC. Not spoken of flatteringly in this implementation, but it's super notable to me the borrowing of firmware from this place I didn't expect it! But there's good upstream kernel support so makes sense! https://chromium.googlesource.com/chromiumos/platform/ec/+/H...
One architectural nit I need to dig into that's interesting: the shared AI processors on the AI cores appear to have shared AI units. This reminds me a lot of AMD Bulldozer (2011), which had semi-independent CPUs but shared FPU. It was an interesting chip (still haven't disposed of my old FX-8320 server), but not well loved.
Really appreciate the dive into the matrix cores. That's going to take more time for me to look at, but: thanks. I notice the architecture diagram says all cores have AI instructions, not just the A100's. Presumably it's the same instruction set/features?
The memory bandwidth situation here feels so off. We've lived in a world where it's a battle for cores, where how many cores one could ship made chip empires rise and fall. Today, the memory bandwidth wars are on, and supplies are scarce. This looks like a fascinating board with amazing capabilities, but wow, that lack of memory bandwidth here is most surprising.
I'm running Qwen3-Coder-30B-A3B-Instruct-Q5_K_M.gguf on mine ... picked kind of at random from a web page as I'm a complete n00b at running local LLMs, have never used anything other than ChatGPT or (mostly) Grok.
I get 6-7 tok/s, which is slow for throwing around program code, but fine for general knowledge queries. It's a reasonable speed to read along as it outputs. Simple queries start to give output in about 2 seconds.
e.g. I dunno ..
> What does auipc do?
`auipc` stands for "Add Upper Immediate to PC" and is a RISC-V instruction that:
## What it does:
- Takes a 20-bit immediate value and adds it to the program counter (PC)
- The immediate is sign-extended and shifted left by 12 bits
- The result is stored in a destination register
## Syntax:
```
auipc rd, imm
```
## Key characteristics:
- **PC-relative addressing**: Uses the current program counter as a base
- **12-bit offset**: The immediate is shifted left by 12 bits, so it's effectively a 32-bit value
- **No memory access**: Only performs arithmetic, no memory operations
- **Used for**: Creating 32-bit immediate values, setting up base addresses for other instructions
## Example:
```assembly
auipc x5, 0x1000 # x5 = PC + (0x1000 << 12)
```
## Common uses:
- Setting up base addresses for load/store instructions
- Creating 32-bit immediate values when the standard `li` instruction can't handle the full range
- Jump table setup
- Address calculation for large data structures
The instruction is particularly useful in RISC-V because it allows for efficient 32-bit address generation while maintaining the 16-bit instruction size constraint.
[ Prompt: 20.6 t/s | Generation: 6.5 t/s ]
Output started in about 2 seconds. > what is the airspeed velocity of an unladen swallow?
The classic answer to this question comes from Monty Python's "Holy Grail" - when asked about the airspeed velocity of an unladen swallow, the character replies "What do you mean? An African or European swallow?"
But in reality, the airspeed velocity depends on the species:
**European Swallow (Hirundo rustica):**
- Typical flight speed: 15-20 mph (24-32 km/h)
- Maximum speed: around 25 mph (40 km/h)
**African Swallow (Hirundo sp.):**
- Similar speeds to European swallows, though some species can reach up to 30 mph (48 km/h)
The "unladen" part refers to the swallow not carrying anything, which would slow it down. Swallows are quite agile birds that can fly at impressive speeds while maneuvering through the air.
So while the Python reference is the more famous answer, the real-world airspeed of an unladen swallow is roughly 15-25 mph, depending on the specific species and conditions.
[ Prompt: 25.5 t/s | Generation: 6.6 t/s ]
Again, output starts in about two seconds.This is offline, no internet, and uses 14W while running all 8 A100 "AI" cores at max.
Is this useful? I mean, for something, right?
I asked it to review https://github.com/brucehoult/trv which is a total of 320 lines of code (I used `/read` on a tar file containing the two code files). It thought for 22 minutes before output started and then spent 8 minutes outputting comments at just over 6.5 tok/s.
Nothing there to scare Claude, but 30 minutes total is still faster than asking a colleague for a code review, and probably more comprehensive too. And it did it on about 0.25 cents of electricity.
> Turns out getting a thread onto the A100 cores requires a two-step handshake: > > write the thread’s TID to /proc/set_ai_thread (a kernel interface that unlocks scheduling on cores 8–15 for that specific thread) > then call sched_setaffinity to pin it.
If you want to just run arbitrary Linux programs on the A100 cores, I wrote a small assembly language launcher which does the above PID writing and then EXECs the thing you really want.
# just run a single program on the A100 cores
ai as hello.s -o hello.o
# same thing but maybe 1ms faster
aix /usr/bin/as hello.s -o hello.o
# run a whole build. All processes started by `make` will run on the A100 cores.
ai make -j8 test
# start a shell on the A100 cores. All programs run from it will be run only on the A100 cores
ai bash
https://github.com/brucehoult/k3_aiAs normal CPUs the eight 2-wide in-order A100 cores (like an A53 or A55 or Pentium or PPC603) add about 40% normal scalar processing power to the eight X100 cores.
That's better than Hyperthreading and well worth using for some additional processing power. Just kick off a background build, or CI or something there while you do something else on the X100 cores. If you ignore the special "AI" matrix processing extension they are just perfectly normal RISC-V RVA23 cores as far as user code is concerned — and in fact significantly faster than the previous generation K1 chip.
A Linux kernel build on just the A100 "AI" cores is faster than on any previous RISC-V SBC under $1000, including the HiFive Premier P550 or Milk-V Megrez. It's several times faster than the VisionFive 2 or Milk-V Jupiter / BPI-F3.
Build Linux kernel 7503345ac5f5
14m25.56s SpacemiT K3 8 X100 cores plus 8x A100 cores distcc
18m3.871s SpacemiT K3 8 X100 cores @2.4 GHz
19m12.787s i9-13900HX, 24C/32T @5.4 GHz, riscv64/ubuntu docker
39m23.187s SpacemiT K3 8 A100 cores @2.0 GHz
42m12.414s Milk-V Megrez 4 P550 cores @1.8 GHz
67m35.189s VisionFive 2, 4 U74 cores @1.5 GHz
70m57.001s LicheePi 3A, 8 X60 cores @1.6 GHz
88m04.185s LicheePi 4A, 4 C910 cores @1.85 GHz
The K3 is also faster than using QEMU/Docker on my 24 core i9-13900 laptop, and while using 25W instead of 200W.Note the fastest time using a distccd on the X100 cores and another distccd on the A100 cores. This adds a lot of overhead in preprocessing and communication over the network (loopback, but still). But it still gives a pretty nice boost. But running independent tasks on each set of cores is more efficient. Or teaching `gmake` or `ninja` to distribute to two pools of cores using my `ai` launcher would be even better ...
People have made the NPU on that thing do LLMs, and sounds like around the same level (max 3Bish params, 5-6 tok/s last time I tried).
In terms of raw CPU performance, sounds slower?
But maybe has more cores?
Ouch the memory bandwidth sounds really bad.