CWE
Dissociated Web
 
I've always been a fan of those Markov chain based programs that can digest a bunch of input and create statistically related output. They've been used to good effect to generate prose, poetry, and music. Markov models also figure prominently in various research areas in computer science.

Can we apply some of the same ideas to the statistical generation of web pages? I thought for a while about how I might go about doing this, but didn't have time to try anything out. Finally, when I was invited to deliver the keynote address at PoCSci'02, I knew I had the perfect opportunity to probe this fascinating problem. What follows are the results of various experiments.

Frighteningly, I can imagine this idea having some practical use for people researching the validation and display of web pages.

Failed Attempts

To establish a baseline, I tried the standard sort of Markov chain approach. It's pretty obvious that this is a terrible idea, but it was worth a shot. The binary chains don't work at all (duh). The symbolic chains (where each symbol is an ASCII character) aren't great, but do they approach the quality of a vast majority of web pages out in the world.

Home Page
Original
Simple Dissociations

The reason the pure Markov model doesn't work very well is that HTML data is fundamentally tree-structured. A generator of new web pages must respect that structure.

My smarter approach is two-level: a vertical markov chain develops the tree of tags, and a horizontal chain runs within each tag and in the content between tags. It's a little more involved than that, but you get the basic idea. here are some dissociations.

Notice how Alex Chiu's web page is pretty much idempotent under dissociation: the dissociated version looks nearly identical to the original. I'm not sure whether this says something about Mr. Chiu.

Home Page
Dissociated / Original

Amazon
Dissociated / Original

Alex Chiu
Dissociated / Original

Curriculum Vitae
Dissociated / Original

Memepool
Dissociated / Original

Slashdot
Dissociated / Original
Halfbakery Challenge
The Original Article
Dissociated / Original

A Google Search
Dissociated / Original

The Final Combination
Dissociated / a combination of
Other
The News
Dissociated / A combination of

Last modified: Wednesday, 12-Apr-2006 14:58:10 EDT