Random header image... Refresh for more!

Freedom of Choice

If you were able to decipher any of the XML from the previous post, you may have gotten an idea of how the data is structured after it’s split up.  The “Terms” are a run of words from the original document, and the Values are all the words that follow that run of words when it appears.  The phrase “to XML” was in the document I parsed, and it was followed by “Overview”, “is”, “Developers”, etc.  In other words, all of the phrases, like “to XML Overview”, “to XML is”, “to XML Developers” were all present in the original source document.

The algorithm uses a rolling window over the terms.  It starts with the phrase “to XML” and randomly chooses a word that followed that phrase in the original document.  That makes the output text something like “to XML Developers”.  Then, it drops the first word from the phrase and looks up a word that followed the remaining words in the phrase.  In this case, it looks for a word that followed “XML Developers”.  And so on.

This allows multiple opportunities for tuning the results.  For instance, you can change the size of the rolling window.  The default size is 2, which means that the algorithm uses the previous two words when looking up the next word to print.  That, in turn, means that any three words in the output appeared in the original text.  This gives the output a somewhat even and consistent feel, because it can only switch trains of thought when there are two words in a row repeated multiple times throughout the source.  This default might work well for this case, because it will likely pivot on things like names.  For instance, if you have the following lines:

“Nancy Pelosi thinks Barack Obama is a fine, upstanding man.”

and

“Glenn Beck said that Barack Obama is a racist with a deep-seated hatred of white people.”

then you might get the following line as output:

“Nancy Pelosi thinks Barack Obama is a racist with a deep-seated hatred of white people.”

which is exactly the kind of sensationalistic blazing irrationality that I hope to achieve with this project.

If you make the window smaller, the output text can become even more unhinged, because it can jump around wildly at every word.  That kind of lunacy might be perfect for the incoherence of Twitter.  If you make the window larger, you risk having less and less randomness, because you require more matching words in a row, which becomes less and less likely.

Another place to tune is the way the Values are handled.  Right now, it’s not removing duplicates from the list.  This results in implicit probability calculation in the output.  You’re more likely to pick out the words that followed a phrase more often, simple because it’s in the list more.  It would be possible to give everything equal weight simply by removing duplicates from the list.  I’ll have to experiment with what that means for the results.

Punctuation and capitalization can also play a part in the output.  Right now, I’m not treating punctuation in any special way.  Instead, it’s becoming part of the terms.  It can be trimmed out or be given special treatment.  Again, experimentation will find the answers.

0 comments

There are no comments yet...

Kick things off by filling out the form below.

Leave a Comment