Clocking a 6502 simulator to 15GHz(scarybeastsecurity.blogspot.com) |
Clocking a 6502 simulator to 15GHz(scarybeastsecurity.blogspot.com) |
I don't know if they applied any kind of external cooling, or what the benchmark was. Probably it was "keep cranking up the clock until pins stop wiggling or smoke comes out." Not very scientific, but quite entertaining.
Nowdays you can quite easily to a 6502 implementation in a FPGA running at 100 MHz. Esp if you allow the design to use more cycles for some instructions.
Sadly the product never took off and the companies folded. I have some chips somewhere. Googling at least revealed a picture of the product:
https://www.google.com/imgres?imgurl=https%3A%2F%2Ffarm3.sta...
"We actually made a couple of really hot processors for a chess tournament for somebody. He literally water-cooled it, and he ran it at something like eight megahertz. It was just ridiculous how fast he ran it."
Earlier it was explained that some processors coming off the production line could run faster than others, and they could test for it to pick the best ones for such purposes. They didn't end up increasing the clock speed for released computers, as other components could not keep up.
That being said, this was implemented on a budget-line FPGA from 2006 (XC3S50A - a small Xilinx Spartan-3A). A modern performance-line FPGA would probably hit a couple hundred MHz easily.
But the main (basic) reason is that the internal logic blocks don't worry too much about processing and arrival times beyond the speed at which they need to operate. What's simultaneous at 1MHz might be not so simultaneous at 10MHz or 100MHz
Another (advanced) reason why overclocking it might be hard is EM interference inside and outside the chip.
As a signal driver is toggled at increasing frequencies ('cranking up the clock'), the signal amplitude (voltage difference between the 'high' and 'low' period) starts to drop. At a high enough frequency, the signal will be indistinguishable from noise and 'stops wiggling'.
https://www.cypress.com/blog/technical/more-pdl-examples-wig...
Better title: Clocking a 6502 Simulator to 15 GHz. There are multiple efforts to recreate the physical 6502 CPU on modern hardware, this is not one of them and should not be confused with that.
https://www.youtube.com/playlist?list=PLowKtXNTBypFbtuVMUVXN...
Planck time is like 10^-43 seconds, so there's lots of room to divvy up a second for more processing power given advanced technologies...
Obviously self modifying code would be hard to handle, but every other case ought to work, and the auto-vectorization ought to do amazing things to some loop-heavy code.
The department of defense funded an SBIR grant in the late 1990s to produce an InP based microprocessor, given the limits of the time it would have been closer to a 6502 than a Pentium. There has not been word of such a thing since which leads me to conclude that the topic is classified.
The worst limitation a 6502-era chip has is that it has no instruction cache so instruction reads are fighting with data for memory bandwidth. You might even consider a Harvard architecture where the instructions go on a different bus. Without an I-Cache there is no point in pipelining, but there is a lot of pressure to implement CISCy instructions such as the string copy operation from the 8086 line.
The other issue is that there is no DRAM replacement with exotic materials, and all the difficulties with interconnect latency get a lot worse than they already are. It's more clear how to make SRAM, so having somewhere between 64K to 1Mbytes of SRAM on die seems likely for an exotic material CPU.
Of course, armchair CPU designers are more likely to make progress with transition triggered architectures and FPGAs in 2020.
My laptop is an ancient 5th gen i5 with 2 keys having fallen off, so games are down in the 2GHz - 3GHz range for me. (Perhaps the missing keys make all the difference.)
I think this article gets a pass.
I agree it'd be wonderful to see auto-vectorization! Obviously, 6502 code does things 1 byte at a time so even adding 32-bit integers is painful. Auto-upgrade of those loops to 32-bit variants would be amazing.
Before, stuff would reduce to nothing after use. Will this condition continue to change? Will it stay true? (Remove the check) Will it stay false? (remove that chunk of code) and then remove this.
Sure, it was fun. More important: People wrote things that were truly impressive. Writing something that worked was only the beginning.
Until compilers know which buttons are used most frequently they cant fully optimize. Who knows, maybe one day windos will give me a start menu and allow me to reboot when the application freezes up? Maybe one day text input will have some priority? The bare basics basically?
There also is the trick where “BIT” instructions are used to give a function multiple entry points, and that BIT instruction can also be a LDA# (https://retrocomputing.stackexchange.com/a/11132)
I’m not sure that can “simply” be converted to LLVM IR.
https://andrewkelley.me/post/jamulator.html
Although the programmer's target is an entire system, the conclusion may still apply: > There is a constant struggle between correctness and optimized code. Nearly all
> optimizations must be tossed out the window in the interest of correctnessOf course, petaherz (10^15 cycles per second) is already the speed at which an electron circles around a hydrogen atom, so we may not be able to use electricity any more...
It's not that the signal will be indistinguishable from noise, but that the CPU will stop working correctly, so its outputs will stop toggling (or will toggle in unexpected ways).
Scopes lock at first rising edge after last horizontal scan, so display starts at H, then drops to L after how long CPU held that pin high. That creates  ̄ ̄l_ lines on the screen superimposed to one another, X position “wiggling”  ̄lll_ depending on how many consecutive H bits just happened to be sent.
When the CPU halts, the pin would flatline at H or L and you’ll know.
My guess is that they were probably cooling it with beer (or at least cold beer bottle bottoms) to get that last critical Mhz, before drinking the beer.
You would probably want to add some tricks directly there, maybe register renaming (I don't know if LLVM does "variable renaming", let's put it this way)
I also understand that this is possible because the emulator is running on a superscalar processor. Not sure if multicore has anything to do here (the post specifically mentions the high performance of the single-core case for the processor used). Still, considering that processors back in the 6502 era had just one execution port, and superscalars this day have a lot (I think 8? I really lost track of what's usual these days), then the figure makes sense all right, and without involving any kind of multithreading.
Kudos to the authors of the emulator for having a super-optimized system that can effectively and efficiently emulate its target!
What is particularly interesting to me is how thoroughly superscalar "wins". Because of complexities with 6502 -> x64 mapping, and handling self-modifying code in particular, some of the most common 6502 instructions explode to multiple x64 instructions. Despite that huge extra instruction load, the translation still manages to run at much greater speed than a 1:1 instruction ratio.
Modern processors do not run on electrons. They run on unicorn tears and magic.
It's also possible because the minimal architecture of the 6502 makes it inherently inefficient. With only three 8-bit registers -- which can't even be used interchangeably! -- and a non-addressable stack, a lot of CPU time on the 6502 is spent shuffling data around. Consider adding two 32-bit numbers, for example. On a 6502, this is a minimum of 38 cycles (clc + (lda, adc, sta) x4); an x86 can complete the same operation in one cycle, potentially in parallel with other operations.
IPCs for 6502 or Z80 (4x "faster" clock but 3-6 cycles per machine cycle) processors were at the count of clock cycles per instruction
Even a measly 386/486 were much faster than that.
Enter the Pentium with the ability to execute 2 instructions in parallel.
IPC count were the big gainers recently as well