Bresenham's Circle Drawing Algorithm (2021)(funloop.org) |
Bresenham's Circle Drawing Algorithm (2021)(funloop.org) |
These days processors prefer to draw shapes by having a coarse grained pass that conservatively selects tiles of interest then brute-force evaluates each pixel in each tile independently of all others.
Instead of minimizing total work, the goal is to maximize pipelined parallelism while skipping over unnecessary work in large blocks.
1 instruction per cycle? What luxury bygone era did you grow up in?
Wikipedia tells me the algorithm is from 1962 on an IBM 1401 (https://en.wikipedia.org/wiki/Bresenham's_line_algorithm#His...
That definitely didn’t have many single cycle instructions. Skimming https://ibm-1401.info/A24-6447-0_1401_1460_Instruction_and_T..., I couldn’t find any.
Certainly, in the era of 8-bit CPUs like Z80 and 6502 programmers would have been lyric about “1 instruction per cycle”
Actually, did any CPU ever “execute 1 instruction per cycle without pipelining or prediction” (or, slightly looser “had fixed time instructions without pipelining or prediction”)?
RISC introduced fixed time instructions, but also pipelining.
but there are a lot of cpus out there—maybe the majority now—that are pipelined microarchitectures that get one instruction per cycle without much or any prediction. avrs, most of the cortex-m* (all?), most modern implementations of old slow isas like the z80 and 8051, etc. big processors like your laptop cpu and cellphone cpu are of course superscalar, but they are a tiny minority of all processors. even inside the cellphone case, they're outnumbered by in-order scalar microcontrollers
without prediction, of course, you have a pipeline bubble every time you have a branch, so you never quite hit 1 ipc with scalar execution. but it's usually pretty close, and even with prediction, sometimes you miss. and usually if you have branch prediction, you also have a cache, because ain't nobody got time to share one triflin memory bus between instructions and data
so pipelining gets you from, say, 3 clocks per instruction down to 1.2 or so. then prediction gets you from 1.2 down to, say, 1.02
So far as 1 instruction per cycle - Wikipedia says the 80286 (the top dog PC processor of the time) could execute 0.21 "typical" instructions per clock. Optimized code was up to 0.5 instructions per clock. And that agrees with my memories.
Today, I would try and use parallelism if I could. With lots of conditions though - is the process being applied to the image trivially parallelizable, will most/all of it fit in cache, etc. Trying to parallelize Bresenham's algorithms though would be futile - when drawing circles you can reflect it into the different quadrants (big savings) but the algorithm itself is going to be pretty serial because it has to keep up with the error coefficient as it draws.
We are talking about instructions which took 8-12 cycles to complete.
even with a register file, it isn't really inherent that you need to decode inputs, do alu operations, and write outputs in separate clock cycles; you can do all of that in combinational logic except writing the outputs, and you can even decode which register to write the output to. it just means your max clock rate is in the toilet
for that kind of thing a harvard architecture is pretty useful; it allows you to read an instruction in instruction memory at the same time you're reading or writing data in data memory, instead of in two separate cycles
This last is not quite true. The exact distance from the circle, call it G(x,y), is the corresponding difference of square roots, i.e.,
def G(x, y, r):
return math.sqrt(x * x + y * y) - math.sqrt(r * r)
and G(x,y) isn't just the square root of F(x,y), and indeed doesn't behave monotonically with respect to F(x,y).It's an interest property of Bresenham's algorithm, that I've never seen even stated let alone proved in the literature, that this doesn't matter, and the algorithm is indeed exact in the sense that it always chooses the next point based on which is truly closest to the circle... despite using an error function that is only an approximation.
For t in 0 to 2 pi in whatever small steps you want, draw_pixel(x=a+rcos t, y=b+rsin t) where a and b are the x and y coordinates you want for the centre of the circle and r is the radius.
The derivation of this form is pretty simple if you know basic trig. https://publish.obsidian.md/uncarved/3+Resources/Public/Unit...
consider this simple assembly-language subroutine i wrote in october (http://canonical.org/~kragen/sw/dev3/tetris.S, screencast at https://asciinema.org/a/622461):
@@ Set sleep for iwait to r0 milliseconds.
@@ (r0 must be under 1000)
.thumb_func
waitis: ldr r2, =wait @ struct timeval
movs r3, #0 @ 0 sec
str r3, [r2] @ .tv_sec = 0
ldr r1, =1000 @ multiplier for ms
mul r0, r1
str r0, [r2, #4] @ set .tv_usec
bx lr
.bss
wait: .fill 8 @ the struct timeval
.text
these are all perfectly normal register-machine instructions; you could translate them one-to-one to almost any register machine. on a few of them you could drop one, writing something like str #0, [wait]. the whole function is straight-line execution, a single basic block, and seven instructions long. this is almost a best case for a stack machine; it looks like this: lit(0) ; load immediate constant
lea(wait) ; load address of wait onto stack
! ; store 0 in wait
lit(1000) ; another constant
* ; multiply argument by constant
lea(wait) ; load address again
lit(4) ; offset of .tv_usec
+ ; calculate address of wait.tv_usec
! ; store product in tv_usec
ret ; return from subroutine
that's 10 instructions, about 50% longer than the 6 or 7 of the two-operand register machine. but the typical case is worse. and basically the reason is that the average number of stack manipulations or constant pushes that you need to do to get the right operands on top of the stack for your memory accesses and computational operations is roughly 1. sometimes you'll have a * or ! (store) or + that's not preceded by a stack manipulation or a load-immediate or a load-address operation, and that time the stack machine wins, but other times it'll be preceded by two or three of them, and that time the stack machine losesso it averages out to about two stack instructions per register-machine instruction. call and return is faster on the stack machine, but passing arguments and return values in registers on the register machine can take away most of that advantage too
the rtx-2000 did some tricks to sometimes do more than a single stack operation per cycle, but it didn't really escape from this
this doesn't necessarily mean that stack machines like the rtx-2000 are a bad design approach! the design rationale is that you get very short path lengths, so you can clock the design higher than you could clock a design that had register-file muxes in the critical path, and you also avoid the branching penalty that pipelines impose on you, and you use less silicon area. mainstream computation took a different path, but plausibly the two-stack designs like the rtx-2000, the novix nc4000, the shboom, the mup21, and the greenarrays f18a could have been competitive with the mainstream risc etc. approach. but you do need a higher clock rate because each instruction does less
i don't remember if dr. ting wrote a book about the rtx-2000, but he did write a very nice book about its predecessor the nc4000, going into some of these tricks: https://www.forth.org/OffeteStore/4001-footstepsFinal.pdf
As a sibling comment points out, DSPs can be an exception. I’m far from an expert on them, but CPUs that try to run all instructions in the same fixed amount of time used to accomplish that by avoiding complex addressing modes and by omitting the really slow instructions (division and instructions that push/pop multiple registers onto/from the stack being prime examples).
IIRC, some of them also accomplished it by running some of the simpler instructions slower than necessary (raw speed isn’t as important as being hard real time isn many applications)
easy solution, as you allude to, don't have a division instruction! arm doesn't, 8048 doesn't, sparcv7 doesn't, even the cdc 6600 and cray-1 didn't, and even risc-v finally got zmmul: https://wiki.riscv.org/display/HOME/Zmmul. it's not just dsps
the big issue with complex addressing modes is i think fault handling. if your nice orthogonal add instruction updates two registers as it reads operands and a third register with the sum of the operands, what do you do if you get a page fault on the second operand? if the os can service the page fault, how does it restart the instruction?
as you point out, real-time latency is also an issue. on older arm chips you have to be careful not to ldm or stm too many registers in a single instruction so as not to damage interrupt latency. newer arm chips can restart the ldm/stm
i'm no expert on the area either
Heh, that's an understatement. The 8048 doesn't even have a subtract instruction, much less divide.
Do you know if that hardware pipeline works only for these intrinsic variants?
I suspect on a modern processor the branches (ie "if"s) in Bresenham's algorithm are gonna be more expensive than the multiplications and divisions in the naive algorithm.
But if it supports larger floats it must be doing range reduction which is impressive for low cycle ops. It must be done in hardware.
It doesn't surprise me regarding denorms. They're really nice numerically but always disabled when looking for performance!
Yeah, maybe they don't by the sounds.
I don't do much on GPUs nowadays, but I still find this stuff interesting. I'm definitely going have to do a deeper dive.
Thanks heaps for the info!