HOW TO MODIFY GAMES to work with
the SUPER CPU 64/128
by S. L. Judd
One of the feature articles in this issue deals with NTSC/PAL
fixing. But have you ever thought about SCPU fixing? You know how
it goes: you have that program that could really benefit from
the speed boost, but doesn't work, and usually because of some silly
Well, it really bugs me to have programs not like my nice
hardware for dumb reasons, so I decided I would try my hand at fixing
up some programs. The one that really did it for me was the game
"Stunt Car Racer" -- I had never played it before, but after
ahold of it it was clear that here was a game that would be just great
with a SuperCPU. I had never done something like this before, but it
seemed a doable problem and so I jumped in head first, and this article
sums up my inexpert experience to date.
By the way, stuntcar-scpu is totally cool :).
To date I have fixed up just three games: Stunt Car Racer, Rescue
on Fractalus, and Stellar 7. My goal was really to "CMD-fix"
to make them run off of my FD-2000 as well as my SCPU. Although these
all games, the techniques should apply equally well to application programs
with a bad attitude. Before discussing the fixes, it is probably worthwhile
to discuss a few generalities.
I also note that programmers who don't have a SuperCPU might find
some of this information helpful in designing their programs to work
Finally, my fixes are available in the Fridge.
Tools and Process
The tools I used were:
o Action Replay
o Paper for taking notes (backs of receipts/envelopes work)
I think this is all that is necessary, although a good sector editor
can come in handy for certain things.
After trying a number of different approaches to the problem, the process
I've settled on goes roughly like the following:
- Have an idea of what will need fixing
- Familiarize yourself with the program
- Track down the things that need fixing
- Figure out free areas of memory
- Apply patches, and test
Most programs work in more or less the same way: there are
some initialization routines, there's a main loop, and there's an
interrupt routine or series of routines. The interrupts are easy to
find, via the vectors at either $FFFA or at $0314 and friends. The
initialization routine can be tougher, but can be deduced from
the loader or decompressor; also, some programs point the NMI vector
the reset code, so that RESTORE restarts the program. Finding the
things that need fixing usually involves freezing the program at the
appropriate time, and doing a little disassembly. Sometimes a hunt for
things like LDA $DC01 is helpful, too. Figuring out free areas of
memory is easy, by either getting a good feel for the program, or
filling some target memory with a fill byte and then checking it
later, to see if it was overwritten. Once the patch works on the 64,
all that remains is to test it on the SCPU, and it's all done!
It seems to me that, at the fundamental level, the SCPU is different
from a stock machine in three basic ways: it is a 65816, it runs at
and it has hardware registers/different configurations. There are also
some strange and mysterious problems that can arise.
All possible opcodes are defined on the 65816, which means that
"illegal" or quasi-opcodes will not work correctly. On the
65xx chips, the
quasi-opcodes aren't real opcodes -- they are like little holes in the
and things going through those holes fall through different parts of
normal opcode circuitry. Although used by very few programs, a number
copy protection schemes make use of them, so sometimes the program works
with a SCPU but the copy protection makes it choke -- how very annoying
(example: EA's Lords of Conquest). Naturally, disk-based protection
mean it won't work on an FD-2000, either.
Running at 20Mhz makes all sorts of problems. Any kind of software
loop will run too fast -- delay loops, countdown loops, input busy-loops,
etc. Also main program loops, so that the game runs unplayably fast
(most 3D games never had to worry about being too fast). It can also
lead to flickering screens, as we shall see later, and the "play"
games is designed with 1Mhz in mind -- velocities, accelerations, etc.
What looks smooth at the low frame rate might look poor at the high,
shall also see later. Finally, fastloaders invariably fail at 20Mhz,
like any other code using software-based timing.
The SuperCPU also has a series of configuration registers located
at $D07x and $D0Bx, which determine things like software speed and VIC
optimization modes (which areas of memory are mirrored/copied to the
RAM). Note also that enabling hardware registers rearranges $E000 ROM
routines. Although it is possible for programs to accidentally reconfigure
the SCPU, it is awfully unlikely, since the enable register, which switches
the hardware registers in, is sandwiched between disable registers:
$D07D Hardware register disable
$D07E Enable hardware registers
$D07F Hardware register disable
Strangely enough, though, different hardware configurations can sometimes
cause problems. For example, newer (v2) SCPUs allow selective mirroring
the stack and zero page, and by default have that mirroring turned OFF.
For some totally unknown reason, this caused major problems with an
attempt of mine to fix Stunt Car Racer -- I am told that the old version
would slow down to just double-speed, flicker terribly, and more. Turning
mirroring back on apparently fixes up the problem. (I have an older
and hence did not have this problem). So before going after a big fix,
is worthwhile to invest a few minutes in trying different configurations.
Finally, there are other strange problems that can arise. For
example, I have two 128s: one is a flat 128, one a 128D. With my 128D,
if $D030 is set then the SCPU sometimes -- but not always -- freaks
and locks up. The flat 128 does not have this problem. One reason this
is important is that many decompressors INC $D030 to enable 2MHz mode.
A simple BIT ($2C) fixes this problem up, but the point is that the
to interact with the computer, so perhaps that interaction can lead
problems in obscure cases.
Now, if the goal is to CMD-fix the program, there may be a few
disk-related things that may need fixing. In addition to stripping out
(or possibly fixing up) any fastloaders, most programs annoyingly assume
drive #8 is the only drive in town. Also, if the program uses a track-based
loader (instead of a file-based loader), then that will need to fixed
as well, and any disk-based copy protection will have to be removed.
There's one other thing to consider, before you fix: is the
program really busted? For example, if you've tried a chess program
with the SCPU, chances are that you saw no speed improvement. Why
not? It turns out that most chess programs use a timer-based search
algorithm -- changing the playing strength changes the amount of
time the program spends searching, and not the depth of the search.
(The reason is to make the gameplay flow a little better -- otherwise
you have very slow play at the beginning, when there are many more
moves to consider). So although it might look like it isn't working
right with the SCPU, it is actually working quite well.
And that pretty much covers the basic ideas. The first program
I fixed up was Stunt Car Racer.
Stunt Car Racer
Stunt Car Racer, in case you don't know, is a 3D driving game,
and quite fun. It is also too fast, unplayably fast, at 20MHz. Therefore,
it needs to be slowed down!
My first plan... well, suffice to say that most of my original
plans were doomed to failure, either from being a bad idea, or from
poor implementation. It is clear enough that some sort of delay is
needed, though, in the main loop, or perhaps by intercepting the joystick
The program has a main loop and an interrupt loop as well.
The interrupt handles the display and other things, and all of the
game calculations are done in the main loop, which flows like
Do some calculations
Draw buffer 1
Do some calculations
Draw buffer 2
One of my first thoughts was to intercept the joystick I/O, which is
easy to find by hunting for LDA $DC01 (or DC00, whichever joystick
is used). The patch failed, and possibly because I didn't check that
the memory was safe, and possibly because it was in the interrupt routine
(I simply don't remember).
Before patching, it is very important to make sure that the
patch will survive, and not interfere with the program, so it is
very important to find an area of memory that is not used by the
program. It took me a little while to figure this out! Finding
unused memory was pretty easy -- I just filled the suspect areas with
a fill byte, ran the program, and checked that memory. Mapping out the
memory areas also aids in saving the file, as un-needed areas don't
need to be saved, or can be cleared out to aid compression.
The first free area of memory I found was at $C000. It turns
out that this is a sprite, though, and so put some garbage on the
screen. The second I tried was $8000, which worked great in practice
mode but got overwritten in competition mode -- always test your
patches thoroughly! (I had only tested in practice mode). Finally,
I found a few little spots in low memory that survived, and placed the
patch there. The program does a whole lot of memory moving, and uses
nearly all memory. I also left some initialization code at $8000, since
it only needed to be run once, at the beginning (to turn on mirroring
in v2 SCPUs).
Recall that the main loop has two parts -- one for buffer 1, and
one for buffer 2. The trick is to find some code that is common to both
sections, like a subroutine call:
Draw buffer 1
Draw buffer 2
The patch routine I used was a simple delay loop, redirected from those
Of course, this will also slow the program down at 1Mhz; later on I
smarter about my patches, but this one works pretty well.
To save the game and patches, I simply froze it from AR. Just
saving from the monitor generally failed; the initialization routine
doesn't initialize all I/O settings. Part of the freezing process
involves RLE compression, so if you freeze it is a good idea to
fill all unused portions of memory -- temporary areas, bitmaps, etc.
Another thing to do is to set a freeze point at the init routine,
and then JMP there from the monitor. By clearing the screen, you
won't have to look at all the usual freezer ugliness, and at this
point freezing isn't any different than saving from the ML monitor
and RLE-packing the file. Once saved, I tested a few times from the
64 side, to make sure things worked right.
Whether freezing or saving from the monitor, if the file size
is larger than 202 blocks or so, it can't be loaded on the SCPU without
a special loader -- unless you compress it first. I naturally recommend
using pu-crunch for that purpose, but if you want to do it on the 64
then I recommend using ABCrunch, which works well with the SCPU and
gives about as good compression as you can get without an REU.
The result was stuntcar-scpu, which is *awfully* fun when fixed.
Rescue on Fractalus
Next on my list was Rescue on Fractalus, an older (and quite cool)
Lucasfilm game that just didn't cut it in the 64 conversion, for a number
of reasons (that perhaps could have been avoided). There are at least
versions of the game, one of which doesn't even work on a 128 (good
but I have the older version, which does work.
With a SuperCPU, though, there are a number of problems. The display
flickers terribly. The gameplay is smooth and not at all too fast --
it is too slow. Specifically, the velocities and turning rates and such
not give a convincing illusion of speed or excitement. The game is copy-
protected and uses a track-based fastloader, loaded from disk via B-E,
also saves the high scores to disk. Clearly, this one is a bigger job:
display is too fast, the game constants need adjusting, and the highscore
needs to be replaced by some kernal calls.
The structure of this code is a little different. The main loop
handles the (double-buffered) display -- it does all the calculations
draws to the two buffers. The multi-part interrupt loop does the rest
it swaps buffers, changes the display in different parts of the screen,
reads the joystick, and performs the game updates which change your
position and direction. It also handles enemies such as saucers, but
doesn't handle the bunkers which fire at you from the mountains (the
loop takes care of those).
What does all this mean? First, that the game can be a good ten
steps ahead of the screen, which makes things like targeting very
difficult. Second, that the bunkers almost never fire at you at 1MHz
(they go crazy at 20). Third, that things like velocity and turning
rate are rather low, because advancing or turning too quickly would
get the game way out of sync (unfortunately, they are still too fast
for 1MHz, making targeting difficult and movement clunky). On the
other hand, having the movement in the interrupt is the reason that
the game does not become unplayably fast at 20MHz, and means that
something besides a delay loop is needed.
The interrupt swaps buffers, but the main loop draws them,
and because it draws so quickly it can start clearing and drawing to
the visible buffer. To make sure this was what I was seeing, I reversed
the buffer swap code in the interrupt, so that the drawing buffer was
always on-screen. Sure enough, that's what the 20Mhz version looked
It turned out to be pretty easy to force the main loop to wait
on the interrupt. Although I messed around (unsuccessfully) with
intercepting the interrupt loop, the buffer swap code actually
modifies a zero-page variable upon swapping. So all the main loop
has to do is wait on that variable before charging ahead. I may have
made it wait for two frames, because it made the game play a little
Now, how to find the velocity and turn code? Well it takes
a keypress to change the velocity, so by hunting for LDA $DC01, and
tracing back, the routine can be found; at the very least the
affected variables may be found, and hunted for. For example, if
the result is stored in $D0, then you can search for LDA $D0. The
point is to locate the keypress processing code. From there, a little
trial and error (setting freeze points and pressing the velocity key)
locates the piece of code which deals with changing the velocity, and
in particular which variable corresponds to velocity. Finally, from
there it just takes another hunt for LDA velocity, ADC velocity, etc.
to figure out where the code for updating position and direction is.
In this case, I was pretty sure I had found it, as it went
and this was added to the position. To check that this was the code,
I just changed the ADC, or removed an LSR, to see that the speed changed.
The code for turning left and right and moving up and down was similar,
and again after a little trial and error it was clear what code did
what. Again, it wasn't necessary to totally understand how these
routines worked exactly -- just the general idea of them, in this case
to see that a multiple of the velocity was used to change the position
and orientation of the player.
So, to fix it up, I just changed that multiple -- probably I
NOPed out an LSR above, to basically double the speed, and changed the
turning rates similarly. This took a little experimentation, as it
not only needed to be playably fast, but also couldn't overflow at
high speeds, etc.
But once that was working, all that remained was the highscore
table. Finding the table location was pretty easy -- I just got a high
score, and while entering my name froze the program, and figured out
what got stored where. From there it was pretty easy to figure out
what was saved to disk. From the original loader, I also knew where
the highscores needed to be loaded to initially (the highscore table
gets copied around a lot -- it doesn't just stay at that one location).
Figuring out the exact number of bytes to save took a little bit of
effort (saving either too many or too few bytes screws it up), but
from there it was clear what memory needed to be saved.
So all that remained was to add the usual SETLFS etc. kernal
calls, right? Wrong. The program uses all the usual kernal variables
(from $90-$C0) for its own purposes. Also recall that I wanted the
program to work with device 9, etc. To get around this, I did two
things. First, when the program first starts, I save some of the
relevant variables to an unused part of memory -- in particular, I
save the current drive number. Second, before saving the highscore
file, I actually copy all zero page variables from $90-$C2 or so
to a temporary location, and then copy them back after saving.
That way there are no worries about altering important locations.
Finding memory for the load/save patch was easy -- I just used
the area which was previously used for the fastload load/save code.
There was enough for the necessary code as well as temporary space
for saving the zero page variables.
Finally, I changed some text from Rescue on Fractalus to
Behind Jaggi Lines, to distinguish it from the original, and that
was that. Works great! And is now more playable and challenging;
in short, more the game it always should have been.
Finally, I tried my hand at Stellar 7. Stellar 7 had several
problems. At the main screen, a delay loop tosses you to the mission
screen after a while, if no keys are pressed. This is a software loop,
and so passes very quickly. The game itself is too fast, so some sort
of delay is needed. The mission display is also too fast, and has
software delay loops, so that needs fixing. Finally, the game uses
kernal calls for loading and saving, but is attached to drive #8;
also, my version was split into a bunch of files, and I wanted to
cut the number of files down.
Well, by this time it was all pretty straightforward. From
the loader, it was easy to figure out which files went where. The
mission and main displays were loaded in when needed, and swapped
into unused parts of memory when not, so I loaded them in and
adjusted the swap variable accordingly -- this left just the highscore
and seven level files.
Finding the delay loops was easy -- I just went to the relevant
sections of code, froze, and took a look at the loops. There were your
LDA $D4 ;Check for keypress
:key LDX #$00
Luckily, all routines were pretty much the same as the above. The
interrupt routine is in the $0314 vector, and the same routine is
used during gameplay.
So the patch is very easy at this point. First, change the
IRQ code which does a JMP $EA7B to JMP $CE00
. CE00 $EE INC $CFFF
. CE03 $4C JMP $EA7B
To fix up the keypress routines, the idea is to change the LDA $D0
into a JSR patch. How to substitute 3 bytes for 2 bytes? The
trick is to place the LDX #$00 into the patch routine:
. CE06 $20 JSR $CE15 ;Wait for $CFFF
. CE09 $A5 LDA $D4
. CE0B $10 BPL $CE11
. CE0D $A2 LDX #$00 ;If key pressed, then LDX #$00
. CE0F $29 AND #$FF
. CE11 $60 RTS
The actual delay is accomplished by waiting on $CFFF:
. CE15 $AD LDA $CFFF
. CE18 $C9 CMP #$04
. CE1A $90 BCC $CE15
. CE1C $A9 LDA #$00
. CE1E $8D STA $CFFF
. CE21 $60 RTS
As you can see, I waited a (default) of 4 frames. The patch in the
game/mission rendering routine works similarly -- I just patched
the rendering code to basically JSR $CE15. I also decided to
try something new: let the user be able to change that CMP #$04
to make things faster or slower, to suit their tastes. The keyscan
values were pretty easy to figure out, so this just required a little
patch to check for the "+" and "-" keys, and change
Well, that about sums it up. Perhaps if you do some fixing,
you might send me a little email describing your own experiences?
. C=H #17