The 6502 instruction set as a database

The 6502 instruction set as a database(gitlab.com)

127 points by orgonon 2 years ago | 30 comments

simonw 2 years ago |

Since this is SQLite SQL, I copied it into a Gist (to give it CORS headers) and now you can open it directly in Datasette Lite (Datasette in WebAssembly running entirely in the browser):

https://lite.datasette.io/?sql=https://gist.github.com/simon...

Here are the 65 hardcoded opcodes: https://lite.datasette.io/?sql=https://gist.github.com/simon...

And the 64 instructions: https://lite.datasette.io/?sql=https://gist.github.com/simon...

calpaterson 2 years ago | |

I love your tool. If you want a slightly more structured version of Gists that speaks parquet try my own tool, csvbase :). I pasted this data in as https://csvbase.com/calpaterson/opcodes-6502 and you can get parquet by adding .parquet to that (https://csvbase.com/calpaterson/opcodes-6502.parquet)

Here's datasette lite running off that file:

https://lite.datasette.io/?parquet=https%3A%2F%2Fcsvbase.com...

Really nice when webby stuff works together

xpe 2 years ago |

Inspiring. I would love for the author's plan (pasted below) to get applied widely across key information sources:

    Instead, I hatched a plan:

    1. Collect sources, and encoded the raw data in a machine-readable form
    2. Study those sources, and encode my understanding as assertions, 
       sanity-checks and validations of that data
    3. Synthesise that data according to my understanding, and verify it
       against the sources available

kreelman 2 years ago | |

Thanks for reposting this here. I think I work in a similar way to this, but I've not written it down as a "manifesto" before like Simon has.

I'm going to be thinking about this as the Discovery Manifesto.... or maybe the autodidactaliser.

I've found it useful to reexpress foreign knowledge in a familiar setting. The magic is possibly that the learning seems fun because the familiar tool is fairly easy to use and the new information is expressed in the familiar way.

It's also good because it's not cheap dopamine, like looking over Youtube videos.... I watched a video on the 6502... I know how it works now??.. Youtubes do have their place, but not at the expense of doing more in depth thinking.

gardaani 2 years ago | |

The last step should be: 4. Get a real 6502 CPU and verify that the data is correct.

peterkelly 2 years ago |

Likely useful: https://github.com/mist64/perfect6502 and http://www.visual6502.org/. These are transistor-level simulations based on die shots. From these you can derive the cycle times of each instruction with 100% confidence.

Retr0id 2 years ago |

Up next: 6502 emulator in a single SQL query

kreelman 2 years ago | |

:-)

MenhirMike 2 years ago |

From the comments, it even seems to account for some instructions needing an extra cycle when crossing page boundaries, nice! This seems pretty comprehensive then.

jll29 2 years ago |

Great idea, this could be valuable for people who write their own assembler.

It's the first time I see an instruction set as a relational database, which I would imagine is a very portable way to describe a machine, perhaps it might be worth collecting other machine specs in that same format and then create a portable assembler that uses the specficic DBs.

userbinator 2 years ago | |

Table-driven assemblers (and disassemblers) have been a thing for a long time, especially for more obscure/embedded architectures. Reverse-engineering/analysis tools likewise have traditionally done the same, but with additional semantic information for each instruction. A quick search for table-driven compilers reveals some mid-century papers.

Retr0id 2 years ago | |

Ghidra uses SLEIGH for this purpose https://fossies.org/linux/ghidra/GhidraDocs/languages/html/s...

> A Language for Rapid Processor Specification

From a SLEIGH description, the assembler, disassembler, and even decompiler can be synthesized.

It's a DSL not a database schema, but fundamentally it's the same idea.

Here's their definition of the 6502: https://github.com/NationalSecurityAgency/ghidra/blob/cae919...

hoten 2 years ago |

Some other instruction sets in some JSON: https://github.com/asmjit/asmjit/tree/master/db

snvzz 2 years ago |

Not sure what the advantage is, relative to the tables as CSV files.

thristian 2 years ago | |

As far as I know, most information about the 6502 instruction set comes in two forms:

- emulators/simulators/FPGA code

- books, data sheets, OCR'd PDFs of books and data sheets, text files copy/pasted from PDFs or retyped from books and data sheets

Code is likely to be heavily tested, but it makes extracting high-level information about the instruction set very difficult.

Data is easy to analyse and synthesise, but since it's described in prose there's no easy way to test or validate it - if somebody in 1984 made a typo that a particular instruction took 3 cycles instead of 2, and that error was copy/pasted and made its way into half the "6502 instruction set" websites online, how would you know? How would you detect it?

Using SQL to enforce constraints and validation gives me confidence that there aren't a bunch of typos and copy/paste errors in this data. In addition, being able to express special cases like "read-modify-write instructions applying to the accumulator do not pay the three cycle penalty" in code rather than in prose makes it more likely they will be applied correctly. Lastly, since the result is an SQL database, it can be pretty easily formatted to resemble any book or data sheet you like for simplified visual verification against book/data sheet sources.

simonw 2 years ago | |

The ability to ship SQL views that join multiple tables as part of the schema is pretty cool, and something you can't come close to replicating with CSVs.

https://lite.datasette.io/?sql=https://gist.github.com/simon...

maxcoder4 2 years ago | | |

I like how it looks, and I appreciate the effort that went into it, but other than using it to export a table in just a right format that you will then embed into your assembler directly, I'm not sure how to use it.

And I use opcode references [1] very often (sometimes daily, depending on the project). I even wrote my own disassemblers. But I mostly use opcode references for manual cross checking, so maybe I'm not a target of this project?

[1] My favorite one for x64 is https://ref.x86asm.net/coder64.html

transfire 2 years ago |

Something I realized years ago, the 6502 instruction set is small enough that it can be (almost) entirely implemented as memory look up — no actual logic computation need occur.

kragen 2 years ago | |

if you allow multiple sequential memory lookups, this is true for any instruction set

hedgehog 2 years ago |

This is a very nice presentation, the .sql file contains a lot of notes about the sourcing for the data. I could imagine adding test vectors to the database as well.

begoon 2 years ago |

It might be related to some extend - intel8080.com.

gabrielsroka 2 years ago |

The 6502 only has 56 opcodes.

The db also includes modern variants of the 6502.

dspillett 2 years ago | |

> The 6502 only has 56 opcodes

In terms of bytes that the original CPU officially recognised as instructions, it was more like ~150 (working from old memories, I may be off by one or few there). Some of the other ~106 did something unofficially, and a number were valid instructions on later versions of the design.

That ~150 were grouped into 56 instructions, many with multiple addressing modes (so "load A immediate", "load A direct", "load A indexed", etc, were different opcodes but considered the same instruction).

Because register use was far from orthogonal (one accumulator, two index registers, and a flags register), instructions for them were considered different (LDA, LDX, & LDY, for load for instance) where in other instruction sets (for chips with multiple general purpose registers) they might be considered the same instruction affecting a different register, though considering them the same instruction didn't reduce the opcode count just the instruction group count.

(Apologies for failing to keep my inner pendant properly inner!)

thristian 2 years ago | | |

The original 6502 had exactly 151 opcodes, the same as the number of pokémon in the original Pokémon games.

gabrielsroka 2 years ago | | |

> pendant

I assume you mean pedant.

gabrielsroka 2 years ago | |

I meant instructions/mnemonics.

I was referencing Simonw's post:

> Here are the 65 hardcoded opcodes

ggm 2 years ago |

In a reductionist sense this is "sql is Turing complete" which is long known, A voyage of discovery aside the joy is the execution and efficiency. I'd be delighted if I'd done this.

Instead, I hatched a plan: 1. Collect sources, and encoded the raw data in a machine-readable form 2. Study those sources, and encode my understanding as assertions, sanity-checks and validations of that data 3. Synthesise that data according to my understanding, and verify it against the sources available