Can we apply some of the same ideas to the statistical generation of web pages? I thought for a while about how I might go about doing this, but didn't have time to try anything out. Finally, when I was invited to deliver the keynote address at PoCSci'02, I knew I had the perfect opportunity to probe this fascinating problem. What follows are the results of various experiments.
Frighteningly, I can imagine this idea having some practical use for people researching the validation and display of web pages.
To establish a baseline, I tried the standard sort of Markov chain approach. It's pretty obvious that this is a terrible idea, but it was worth a shot. The binary chains don't work at all (duh). The symbolic chains (where each symbol is an ASCII character) aren't great, but do they approach the quality of a vast majority of web pages out in the world.
The reason the pure Markov model doesn't work very well is that HTML data is fundamentally tree-structured. A generator of new web pages must respect that structure.
My smarter approach is two-level: a vertical markov chain develops the tree of tags, and a horizontal chain runs within each tag and in the content between tags. It's a little more involved than that, but you get the basic idea. here are some dissociations.
Notice how Alex Chiu's web page is pretty much idempotent under dissociation: the dissociated version looks nearly identical to the original. I'm not sure whether this says something about Mr. Chiu.