bLaTheR: Part 1

11.08.2011 : By George Flanagin

0

Breathes there a web designer with wallet so fed who never to himself hath said, “Cutting and pasting this Lorem Ipsum stuff is utter tedium. It doesn’t look like English, and the first word isn’t even Latin.” Technologies have come a long way since cut and paste, and the time has come to use technologies for placeholder text in web design.

This is the story of how bLaTheR, a Lorem Ipsum replacement, was born, and the computer science behind it. In part 2, we will cover the use of bLaTheR for populating web pages.

The web had barely been invented in 1992 when I first developed negative feelings about Lorem Ipsum. I was working at HP Labs on software for printers, scanners, and plotters, and we were constantly dealing with localization issues because we sold the same hardware around the world. Languages with long words were one bane, notably Germanandfiinniissh. I don’t recall our ever localizing to Latin, although we may have tried Pig Latin. Lorem Ipsum was a trendy distraction.

Skip ahead eleven years to my writing a text scrambler as a demonstration for an undergraduate computer science class I taught at VCU. As is the case with all pedagogical tools, it was as simple as it could be to make my point, and because, according to my wife, I never throw away anything, it went on the electronic shelf for the next eight years.

Just a month or so ago, Scott and I chatted about using its output as a Lorem Ipsum replacement, the adhoc-itecture of design set in. Scott (a.k.a. jScottQuery) is a fan of JavaScript, and he came up with the idea of obtaining bLaTheR via a jQuery plugin, and thereby populating web pages with the text we now call bLaTheR. Over the next weekend, I went to work.

At this point it is probably a good idea for the reader to play around with the bLaTheR demo (http://blather NULL.georgeflanagin NULL.com/). (Note: The demo is not a sleight of hand, it is truly running the authentic bLaTheR code. You would not want in-authentic bLaTheR code, would you?)

How does bLaTheR do its work?  The code for bLaTheR runs inside a daemon framework I wrote last year for our company’s product. The daemon framework handles the socket I/O, parses the incoming request, recognizes the request for bLaTheR, runs the bLaTheR sub-program for a few milliseconds, packs up the resulting garbled text in JSON, and returns it to the caller. ZZzzzz……

Of course, perhaps you were asking something else? The more interesting question is “How does bLaTheR create the pseudo-text?” Ah, this is interesting, and it is a good lesson in the occasional value of computer science in software development.

SQLite and Oracle are the two databases I know better than I know MySQL, and I prefer SQLite for most of the work that I do. I built an SQLite database with text “samples,” very much using the database the way the people who design MIDI rigs make use of a database of sampled sounds.

Remember the long words of Germanandfinniish? We would like to generate scrambled text with the same statistical properties of word length, sentence length, and overall look and feel of the “real thing.” If the source has a lot of long words, the output should be the same. If the source is filled with legal jargon and TLAs, the output should be similarly populated.

The necessary data structure for this verbal scrambling is a radix tree or a PATRICIA tree, or trie as it is sometimes written. The text shreds go through a “compilation” phase that converts them from their original sequence of bytes into the trie. The trie is stored in the database, and can be re-assembled on demand without further parsing or compilation.

During the compilation phase, bLaTheR cuts the text into sequences that are n-bytes long, where n can take a value between 1 and the entire length of the text to be analyzed. In the case of n=1, the output is just a random stew of bytes that have a distribution the same as the original. In other words, if the input is German, and the letter “e” makes up 17% of German, then the output will be 17% e, with other letters appearing as they do in German. At the other extreme, the only possible output is the same as the input and you are back to cut and paste. Neither of these extremes is interesting or useful.

Through experimentation, I have found that the most satisfactory output is to be had by setting n to a value between 5 and 8, with 6 and 7 working well for most source material. Higher values produce overly long runs of quoted material. Values less than 5 produce output that lacks the elusive characteristic of readability.

Using “6″ as the example, bLaTheR looks at all the six-byte sequences, and builds a trie from the first five bytes of each six byte sequence. The randomness of the text is synthesized by looking at the trie, and choosing the next character at random so that its probability of being in the output is the same as the frequency with which it occurs following those five characters in the source.

In part 2, we will put bLaTheR to work on web pages.

Credits

Like most structures in computer science, the trie has been around for a long time, if anything in computer science may be said to have been around for a long time. There is a substantial discussion of these structures in both Volume 1 and Volume 3 of Dr. Donald Knuth’s (http://www-cs-faculty NULL.stanford NULL.edu/~uno/) book The Art of Computer Programming (http://www NULL.amazon NULL.com/Art-Computer-Programming-Volumes-Boxed/dp/0201485419). No computer programmer should be without Knuth’s work, or at least knowledge of its contents.

Dr. Kasper Peeters (http://maths NULL.dur NULL.ac NULL.uk/users/kasper NULL.peeters/) has provided the C++ programming world with a truly excellent general tree class (http://tree NULL.phi-sci NULL.com/) from which it is fairly easy to craft or graft the type of tree you need. Peeters’ tree is an achievement in programming, and its existence has saved me many hours of work. In fact, I cannot recall the last C++ program I wrote that does not use it somewhere.

More in C++, jQuery, User Experience (1 of 14 articles)