Show HN: OctoFlow – A GPU-native programming language

3 points by mr_octopus 133 days ago | 4 comments

I built a general-purpose programming language where the GPU is the primary execution target, not an afterthought.

  Most languages treat GPU as "write a kernel, dispatch it, copy results back." OctoFlow flips it — data lives on
  the GPU by default. The CPU handles I/O and nothing else.

  let a = gpu_fill(1.0, 10000000)
  let b = gpu_scale(a, 2.0)
  let c = gpu_add(a, b)
  print("sum: {gpu_sum(c)}")

  10 million elements. Data never leaves VRAM between operations.

  It's early — there's a lot to improve — but it works today and I'd love feedback from people who try it.

  What you can do right now:

  - GPU compute with arrays up to 10M+ elements
  - Statistical analysis, ML (regression, clustering, neural net primitives)
  - CSV/JSON data processing, HTTP client
  - Stream pipelines for image processing
  - Interactive REPL with GPU access
  - Import from 51 stdlib modules across 11 domains

  What you need: any GPU with a Vulkan driver and the 2.2 MB binary. That's it.

  I've been working on this solo and would genuinely appreciate people kicking the tires. What works, what breaks,
  what's missing — all useful.

  https://github.com/octoflow-lang/octoflow

qrios 133 days ago |

Works on my computer: RTX 3090, CUDA 12.6

Interesting project! I haven't really worked with Vulkan myself yet. Hence my question: how is the code compiled and then loaded into the cores?

Or is the entire code always compiled in the REPL and then uploaded, with only the existing data addresses being updated?

mr_octopus 133 days ago | |

Thanks for trying it! :)

Each gpu_* call emits SPIR-V and dispatches via Vulkan compute. Data stays resident in VRAM between calls — no round-trips to CPU unless you need the result.

No thread_id exposed. The runtime handles thread indexing internally — gpu_add(a, b) means "one thread per element, each does a[i] + b[i]." Workgroup sizing and dispatch dimensions are automatic.

The tradeoff: you can't write custom kernels with shared memory or warp-level ops. OctoFlow targets the 80% of GPU work that's embarrassingly parallel. For the other 20% you still want CUDA/Vulkan directly.

Cheers

billconan 133 days ago |

I'm curious how a gpu language's syntax design can be different from CUDA kernel?

Because I think there is no way to avoid concepts like thread_id.

I'm curious how GPU programming can be made (a lot) simpler than CUDA.

mr_octopus 133 days ago | |

Most GPU work boils down to a few patterns — map, reduce, scan. Each one has a known way to assign threads.

So instead of writing a kernel with thread_id:

  let c = gpu_add(a, b)
  let total = gpu_sum(c)

The thread indexing is still there — just handled by the runtime, like how Python hides pointer math.