Ok, so this is probably the last post I’ll make about my Brainfuck-on-Arduino project, basically because it has reached a point where I’ve already tried all the things I wanted to try and I’ve decided that there’s no point in taking it out of the breadboard and build a board for it. At least not for Brainfuck. And I’ll explain why.
The Performance Issue
I previously said that I was expecting the performance to “drop” a bit when reading directly from a SD card instead of the internal RAM, but I was hoping to mitigate that with a sector/block cache similar to the one I wrote for the SPI RAM.
And that’s completely reasonable and actually true. Where I made a mistake however, was in also assuming that doubling the SPI clock would result in a noticeable performance boost. That’s definitely false. Reading a whole 512bytes sector currently takes between 1 and 2 milliseconds at 4Mhz, and RAM access is done at the same speed, so being the RAM pages half the size of the SD sectors it probably takes half that much to get a whole RAM page.
Since we are caching so many bytes in advance, the number of page reads (both from RAM and SD) is not really that high, so even if we were to double the SPI bus speed we will only cut around 1ms from each access. Most programs I’ve tested don’t normally cross the RAM page boundaries nor require more than one SD sector to be stored, so the speedup won’t even be noticeable for most cases. It will be barely 1 or 2 ms, so if we run into performance issues, they are somewhere else. They are NOT in the SPI Bus speed.
The real slowdown
Let’s please remember that with the optimized brainfuck interpreter I was getting a quite-decent run time of 642ms for the Fibonacci generator when executing the code directly from RAM, which is (performance-wise) the “best case scenario”.
I was eager to see the execution time with the same optimized interpreter and my SD-card routines, so when I finished implementing the SD-sector cache I transferred the program to my MMC card and executed the code. Please note that what we are doing here is reading the brainfuck code from the SD card instead of a RAM array, loading it into memory one page at a time (a page is 512 bytes, which means our test program is loaded entirely in one access), and executing the program from there:
-- RUNNING CODE -- 1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89 That took 1893 ms
That’s…. almost 2 seconds….
If you are trying to understand what the HELL happened, let me tell you that we were on the same boat.
1.89 secs!! almost three times as much as it took before! And in theory the only extra step we are doing is loading the code from the SD card before calling the parser!
I couldn’t help but wonder WHY? … Why did the performance drop so dramatically when I’m basically reading the whole page from the SD card into RAM (which only takes 1-2ms) and then ALL the code execution happens almost exactly like before!!
The difference in the execution time was so insane that I started changing things like crazy; from inlining functions to optimizing the boundary tests, rewriting “if” conditions and arithmetic operations in different ways to see how each change affected the performance…
For instance changing this:
pgm_block[pgm_codePtr & 0x1ff]
to this:
pgm_block[(word)pgm_codePtr & 0x1ff]
Reduced the execution time by around 60 ms.
And changing this:
if (pgm_codePtr>>9 != pgm_curpgm_block) ...;
(a test that checks if we have the correct SD sector loaded in memory)
to this:
if (pgm_codePtr < pgm_page_start || pgm_codePtr > pgm_page_end) ...;
(which does essentially the same, but uses page boundary checks instead of page number comparison) further reduces the execution time to 1371 ms! That’s 400ms less!!
By the way, these operations are being executed in the parsing loop by the function that reads and controls the program memory (PGM) cache.
So why are these small changes having such a huge impact? I started realizing the problem: Even when we are not loading more than one page from the SD card, we still perform the page checks every iteration, and our code pointer is a dword now. Why is that important? Well, since the program counter is now 32bits, everything is being promoted to dword when we manipulate its value or use it in arithmetic expressions, unless we do a explicit cast (like in the first “math optimization” described above).
Obviously 32-bit math takes more instructions than int (16 bit) or byte (8 bit) math, at least on an 8-bit processor like Arduino’s ATMega328, but it isn’t the problem per se. I mean, there’s nothing wrong with the occasional dword-math, and you would normally not notice the performance hit in most applications, but parsing and executing Brainfuck code is a particular scenario where this makes a huge difference.
Brainfuck’s instruction-set is so limited and primitive, that even the most basic and common operations (like adding a constant to a memory cell) takes A LOT of instructions.
Using a handy online brainfuck interpreter, I obtained the real number of instructions that were being executed by the programs I was testing. The code I’m running in these benchmarks is 511 instructions long, but it actually takes a total of 160,562 instructions just to accomplish its goal of computing the first <100 fibonacci numbers.
That means it iterates a DAMN lot.
Another program I’ve tested on my Arduino, that calculates the first 101 squares (0^2 to 100^2), is 191 instructions long, but it executes 1,367,738 instructions before it ends. Almost 1.4 million instructions!
That’s insane!
In the case of the fibonacci generator, if an operation takes 2 us instead of 1us, and it’s executed 150.000 times, it will cause a slowdown of 150 ms! That’s quite a heavy punch for a mere 1us increment in a single step of the program. At 4 MIPS (the processing speed of this microcontroller @4Mhz) each 4 extra instructions we add to the parsing loop will add 1us to the execution time, and the new dword checks are probably adding way more than 4 extra instructions.
So the biggest slowdown is NOT reading from the SD card or the RAM IC or the SPI bus frequency… the problem is manipulating large values frequently. They add a lot of instructions, and we are adding those instructions to a very intense loop that iterates an absurd number of times.
As a result from these observations I removed the calls to the “high level” block-checking function from the parsing loop, and decided to read directly from the SD cache array. I also took advantage from the fact that we only move the program counter 1 step at a time to further optimized the page-boundary check by only performing a cache refresh when (word)pgm_codePtr % 512 == 0, which should be faster to check. The only time a proper page check is performed is when jumping back to the beginning of a loop.
With these changes the code takes 945ms to execute. Not quite glorious as the 640ms we had, but not that bad considering the numbers we were getting at the beginning of this.
It’s worth noting that we have come a REALLY long way from the first tests I made. The same thing we are doing now used to take almost 11 seconds with the old setup (Arduino’s SD library and the unoptimized interpreter). Now is less than 1.
Just For Fun
Although it was obvious from the start that Brainfuck is not the most efficient language on Earth, I didn’t really imagine it was THAT bad. Almost 1.5M instructions just to compute 100 square numbers is just insane. It’s definitely a poor choice for an interpreted language running on an Arduino so I don’t think I’ll finish a hardware design for this. Even if I add some extensions to the language (like a stack) it would probably still be unsuitable for time-critical applications.
Having said that, I decided to have a little more fun with this and added 2 new commands to Brainfuck’s instruction set:
- : (colon) Which outputs the contents of the current memory cell to the 8-bit hardware port
- ; (semicolon) That reads the 8-bit value from the digital hardware port into the current memory cell
They are obviously equivalent to . and , but instead of using the serial console for input/output they “talk” to a hardware digital port.
Here’s a demo of a code that outputs the numbers from 255 to 0 to the digital port (I had to add a delay to the inner loop so you could see the lights changing, though)
As you can see, it works like a charm.
This extension would allow BF code to control lights, motors, read switches, sensors, etc.
The Original Plan
I didn’t really make all of this just for the lulz (OK, in part I did). I’ve been playing with the idea of implementing a very basic, stack-based “Virtual Machine” on Arduino for a while.
It would ideally be a sort of “universal” VM that could easily be ported to other platforms (like PICs, MSP430s, etc). Eventually it would also be cool to make a simple high-level language that compiles to the instruction set of this VM and as a result, the code would run on every platform where the VM has been implemented.
Brainfuck was a fun first approach and a very good platform for “testing” concepts and ideas. Its reduced “instruction set” was extremely easy to parse and implement, and a lot of things that this project required (like interfacing with a RAM IC, reading and writing content to a persistent storage, etc) would also be required for a more serious VM project.
Now I still don’t know if this “universal” VM thing would be a good idea. There are already a few compilers for microcontrollers that can target different platforms, and a few global efforts to bring the Arduino “language” and set of libraries to other platforms, all of which are basically trying to achieve the same thing.
There’s obvious problems that would need to be addressed too, like the hardware differences between platforms and how (if) the platform-specific features will be exposed to the user, so I don’t know if I’ll ever do this.
But for the time being I think I learnt a lot from this experiment, especially about SD cards and the potential pitfalls of designing and implementing a VM on Arduino. And I obviously had a lot of fun with it.