Making the Snake Eat its Tail: Bootstrapping
--------------------------------------------
Oct 20 1999, Chris Pressey, Cat's Eye Technologies.
What is bootstrapping?
----------------------
Bootstrapping is the act of implementing a compiler for a language in
that same language, or a subset of it.  It is a well-understood aspect
of compilation and the translation of compilers from one machine onto
a different machine.
Bootstrapping is a fairly esoteric discipline, however, partly because
there's little need to do it more than once for any given compiler and
any given machine, but also because at a basic level, bootstrapping is
somewhat difficult to understand.
How, for example, do you write a compiler that compiles itself, without
first having the compiler???  Sounds more than a little bit like the
"Which came first, the chicken or the egg" paradox.
The term 'bootstrap' itself comes from the whimsical idea that if you
were to bend over and tug at the straps on your own boots, you could
lift yourself off the ground.
So what's the trick to making a compiler levitate?
--------------------------------------------------
Well, first of all let me put your mind at rest - there's no paradox
or magic or anything else spooky involved, although it can feel that
way sometimes.  There are in fact two realistic options for
bootstrapping:
- Write (on paper) the compiler in the language which it compiles,
  then hand-translate (i.e. manually compile) it to assembly or
  machine language.  This approach has been the one taken for the
  first compilers for both Pascal and LISP.
- First write a compiler for the language in another, already-available
  language, such as assembly language, or C.  Then re-write that compiler
  in the language which it compiles.  This is the approach many
  bootstraps have taken.
How was Shelta bootstrapped?
----------------------------
I took the second approach.
First I wrote the Shelta compiler SHELTA86.COM in assembly language
(SHELTA86.ASM) using Turbo Assembler 3.1.
             +----------------+
SHELTA86.ASM | TASM ---> 8086 | SHELTA86.COM
             +----+      +----+
                  | 8086 |
                  +------+
                  TASM.EXE
This is a 'tee diagram' as is commonly used by people who have to
do these sorts of things... it's pretty simple to understand.
The filename on the left is the input into the 'tee', the filename on
the right is the output of the 'tee'.  The filename on the bottom is
the tool which is translating the input into the output.  The formats
listed inside the tee are the languages each of the files is written in.
Because the output of this tee is a compiler, however, it can exist as
a tee in it's own right:
                        +-----------------+
                        | Shelta --> 8086 |
             +----------+-----+      +----+
SHELTA86.ASM | TASM ---> 8086 | 8086 |
             +----+      +----+------+
                  | 8086 |  SHELTA86.COM
                  +------+
                  TASM.EXE
So I re-wrote SHELTA86.ASM in Shelta/GUPI, calling it SHELTAS.SHE.
                        +-----------------+
            SHELTAS.SHE | Shelta --> 8086 | SHELTAS.COM
             +----------+-----+      +----+
SHELTA86.ASM | TASM ---> 8086 | 8086 |
             +----+      +----+------+
                  | 8086 |  SHELTA86.COM
                  +------+
                  TASM.EXE
Lo and behold!  A Shelta compiler written in Shelta.  But that's not
the whole story - at this point the bootstraps have been pulled taut,
but there is one more tug that must be made to actually get levitating.
The compiler SHELTAS.COM must prove it's worth, meeting it's maker
so to speak:
                                    +-----------------+
                        SHELTAS.SHE | Shelta --> 8086 | SHELTAS2.COM
                        +-----------+-----+      +----+
            SHELTAS.SHE | Shelta --> 8086 | 8086 | 
             +----------+-----+      +----+------+
SHELTA86.ASM | TASM ---> 8086 | 8086 |   SHELTAS.COM
             +----+      +----+------+
                  | 8086 |  SHELTA86.COM
                  +------+
                  TASM.EXE
Now, because of some subtle differences in SHELTA86.ASM and SHELTAS.SHE
(the assembly language version does no optimization), the sizes and
contents of all three of these Shelta compilers differ slightly.  But
if the process was carried on one step further, the resultant compiler
would be the same as SHELTAS2.COM.  The following might help clarify why
this is:
                        SHELTAS.SHE +-----------------+
                         Optimizing | Shelta --> 8086 | SHELTAS2.COM
            SHELTAS.SHE +-----------+-----+      +----+  Optimizing
             Optimizing | Shelta --> 8086 | 8086 |       Optimized
             +----------+-----+      +----+------+
SHELTA86.ASM | TASM ---> 8086 | 8086 |   SHELTAS.COM
NonOptimizing+----+      +----+------+    Optimizing
Hand-Optimized    | 8086 |  SHELTA86.COM  Non-Optimized
                  +------+ Non-Optimizing
                           Hand-Optimized
OK, but why did you choose to do this, anyway?
----------------------------------------------
Well, there was certainly no reason to.  I was not moving Shelta from
one machine to another, nor was I treating SHELTA86.ASM as a quick
hack which would be discarded once an optimizing compiler could be
bootstrapped.
On the other hand, there was no reason *not* to, so...
I did it mainly to say that I could.  Not everyone can design a language,
write a compiler for it in the form of a 1/2-kbyte COM file, then
bootstrap it.  I'm not sure I can say it was the hardest thing I've
ever done, but it was difficult enough.
Plus, well, it's the kind of freaky self-referential thing I've always
been interested in.  A compiler written in the language which it
compiles, which in the end appears to have been compiled by itself.
In the preceding section I may have made what I did seem like a walk
in the park, but it wasn't.  A large portion of the time was spent on:
 - fixing bugs in SHELTA86.ASM
 - building GUPI so that Shelta could be powerful enough to bootstrap
   meaningfully (I could have just included SHELTA86.COM entirely as
   inline assembly, but that would kind of defeat the purpose)
 - fixing bugs in SHELTAS.SHE
 - testing SHELTAS2.COM - more concentration than time, actually.
   When you have five Shelta compilers, two in source form
   (Turbo Assembler and Shelta) and three in executable form, you're
   bound to get a little disoriented from having to keep track of
   their interdependencies.
I can also offer the following piece of advice to anyone who is going
to be trying something similar: if you've already squashed down your
first compiler's source code in order to (say) claim bragging rights on
having built an 512-byte compiler, DO NOT attempt to simply translate
the optimized assembly code into another language.  Rewrite it instead.
Especially for a program of this size.  Initially trying to do a
literal translation from SHELTA86.ASM to SHELTAS.SHE was easily the
biggest mistake I made.
Where can I find further information on bootstrapping?
------------------------------------------------------
Two books are of note: the notorious "Dragon" book by Aho, Sethi and
Ullman gives it a brief once-over; "Compilers and Compiler Generators"
by Terry gives it a more thorough and readable treatment.
Happy levitating!
Chris Pressey, Oct 20 1999
Cat's Eye Technologies, Winnipeg, Manitoba, Canada