6. Self-Documenting Programs That Work

The C language itself is beginning to complicate the creation of a self-documenting program with its use of escaped characters. Of course, this should not be taken as a flaw in the language; in any other application, the approach taken by C is suitable and even desirable. There simply is no way to discern a use of a double quote from a mention without additional information. That information comes in the form of the optional backslash. In the application of self-documenting code, however, it looks like another conceptual leap is in order.

That conceptual leap comes in the form of finding an alternate way to express these escaped characters. Obviously, the program is going to have to find some way to output a double quote - they will, by necessity, appear in the source code, after all! But there can be no double quotes in the actual instructions, because if a double quote appears in the program proper, it will also have to appear in the description of the program, in which case it will have to be escaped. As we have seen, this is forbidden.

Fortunately, there are other ways to output this character. The easiest is the statement putchar(34);. 34 is the standard ASCII code for the double quote. The putchar function traslates the code into the correct character and outputs the double quote to the screen. Similarly, putchar(10); will output a newline character.

This leaves one more step. We still need to output 'char*f=' and ';' on either side of the string. We could use a series of statements of the form:

    putchar('c');putchar('h');putchar('a');putchar('r');...
and so on (the resulting code, Self 0, can be found in Appendix A), but a slightly more elegant solution is to incorporate these extra characters into the string used to describe the program. We can, at runtime, selectively print sections of that string corresponding to different parts of the program, by printing from an offset into the string and temporarily placing an end-of-string marker at the end of the section we wish to print.

Combining these techniques generates the first working self-documenting program:

Self I

    char*f="char*f=;main(){f[7]=0;printf(f);putchar(34);f[7]=';';printf(f);&
        putchar(34);f[8]=0;printf(&f[7]);f[8]='m';putchar(10);printf(&f[8]);putchar(10);}";
    main(){f[7]=0;printf(f);putchar(34);f[7]=';';printf(f);putchar(34);&
        f[8]=0;printf(&f[7]);f[8]='m';putchar(10);printf(&f[8]);putchar(10);}

When Self I is compiled and run, it outputs an exact duplicate of itself, and is therefore a successful self-documenting program. Unfortunately, Self I is cryptic and unreadable, even by C standards! The lines are long, and spacing is minimal. Furthermore, there is no practical way to reformat this program while preserving its self-reference. We cannot break the definition of string f over several lines, because as we have seen, newlines must be escaped within a string. If we inserted a newline directly into f, the program would fail to compile.

That is not to say there is no such thing as an elegant self-documenting program. It is in fact possible to write a much shorter program. The trick lies in careful use of the printf function and the fact that it allows the user to include format specifiers to reformat text before it gets printed out. Furthermore, the format string passed to printf is just a normal C string like any other. In fact, it is possible to achieve self-reference by letting the string which describes the program also serve as the format specifier for its own output! This idea gives us the following program:

Self II

    char*f="char*f=%c%s%c;%cmain(){printf(f,34,f,34,10,10);}%c";
    main(){printf(f,34,f,34,10,10);}
Here, we have overcome the problem of outputting the characters which delimit the string by making the string represent not just the use of the program, but rather the program in its entirety. The string f represents everything that needs to be printed out, with a few critical sections left out - namely, the characters that need to be escaped, and the contents of the string. That is the crucial idea behind Self II. The string f can avoid the infinite recursion of including a copy of own its contents within itself by specifying that some unknown string will later be substituted for the '%s', and then substituting f itself! This is why the printf statement contains f twice. The first time is as the format specifier, and the second is as the replacement text for the '%s' in f. And Self II has the added advantage of being relatively easy to read and understand.

There are, of course, many more possibilities. There are numerous ways to rephrase these working programs, and even programs that attack the problem from a different angle (for an example, see Self IV in Appendix A). Unfortunately, there is no room here to discuss the strategies used by those programs. At the very least, the examples given above provide a good starting point for moving forward towards exploring those other ideas.


[back] [up] [right]