Monthly Archives: March 2013

Boggle(tm) and language analysis

The game Boggle(tm) is a very simple word game for 2+ players made by Hasbro(tm). The game is made up of sixteen six-sided dice, a board, and a timer. Each die has a letter on each side. The game consists of randomizing the board by shaking up the dice and then players try to make words within the allotted time. At the end of a round, players score up their points by showing what words they made with the board. Longer words are worth more points and the player with the most points wins the round.

Because of its simplicity, Boggle(tm) is often used as a intermediate programming project where students are to create the game, create a board solver, or create an AI player. But there are more advanced topics that can be gleaned from the game using simple tweaks to gameplay or setup. For example, generating dice given a lexicon, changing the size of the board, generating dice that make the game simpler or harder, etc.

The dice from the original game have a very important property that I find interesting: although there are two ‘D’ within the dice, they both are on the same die. This means that through that simple setup, words with multiple Ds are not possible to generate. Add, dedication, dude, etc are legal English words, but not possible in any Boggle(tm) board setup. Given a lexicon, a programmer could generate dice with letter distribution that could increase or decrease the amount of points per round to target younger or older audiences. It is also possible that, given an algorithm to generate dice given a lexicon, a programmer could create dice for any language.

Before we go further into dice generation, I would like to talk about the lexicon, its generation, and its layout. There are three different ways you can think about a lexicon grammatically: a list, a hash table, and a tree.

The first is to think of it as one long list of words, or an array of strings. Each word takes a single index in the array. This is most efficient with lexicon creation speed. You don’t need to parse the string, just stick it at the end of the array and you’re done. An insertion time of O(1), assuming you have an array large enough and don’t have to do a copy and move operation to make room for the new index. But when you need to retrieve a word, you have a problem, namely, a huge list that is possibly unsorted. Assuming an unsorted array, you have a worst case O(n) search time for each word, since you have to look for the word, one index at a time. Let’s assume that the words are indeed unsorted in the source file. The best way to get a fast lookup is to (a) insert the words in their sorted location or (b) sort the words post insert.

Solution A is the most work, overall, considering the overall size of the array, but the smallest O notation. The following is the pseudo code of said endeavor:

For each word: (1) Parse word, (2) find index in array, (3) shift array past index up one, (4) insert into array

  1. Assume a parse time of K
  2. Binary search log2(n)
  3. Copy data from index I to n (n-I)
  4. Assume copy time of K2

This makes the overall operation 1∑n(log2(n) + n – I + K2 + K) or O(log2(n) + n). On the other hand, shoving all the words into an array and then using quicksort, solution B, would have a O(nlog(n)). As you can see, the theoretical execution time of solution A is smaller than that of B because n is added to log(n), not multiplied. Search of a sorted array using binary search yields a lookup of O(log(n)), which is hard to beat.

The second method is to create a hash table, which has a lookup time of O(1). The issue is the time required for insertion into the table. A generic function may be more complex than necessary because it is made for many purposes. There are various different functions for hashing strings, google has 1.2 million results for ‘string has function’, and reinventing the wheel is 99.9% of the time not advisable.

The last method is by far the best, as I will later explain, and is the tree method. If you think of the lexicon not as a collection of words, but a series of leaf nodes of a tree, the overall storage size, lookup time, and insertion are all simplified and rather quick. You have, in fact, a insert time of O(1) and a lookup of O(1) simply by choosing the right data structure for the job at hand. Each leaf has three properties: is_end, is_part, and children. If the letter is the end of a word, is_end is true. If the letter is part of a word, is_part is true. If is_part is true, children are that node’s children nodes. The node doesn’t even need to know about what character it represents because its parent stores it in the location with its children that corresponds to the child node’s letter. This makes storage of the tree extremely small as well.

There exist several means of generating letter distribution probability. The first and most straightforward version is to simply tally each letter within the lexicon and then distribute letters across the dice. That is, parse the lexicon for letter probability, and then distribute the letters according to their probability. The average case for this method often does not yield constant score probability and can sometimes yield near unplayable boards. It can be improved by distributing letters in a more intelligent manner. For instance, finding out which letters are more likely to show up in next to each other and group less likely combinations on similar dice to improve the overall ease of word generation. This process can be done for three or four letter probability, but as the process becomes more complex, the computation time increases. My initial thoughts were that finding three letter distribution patters would be sufficiently fast and sufficiently optimized for playability.

Why would we want to generate dice? There is a simple modification to the game that makes it clear why we would fuss over this problem: different board sizes. The dice given by the original game are made for a sixteen dice game, but what if we wanted to change the size, layout, or shape of the board? Just copying existing dice causes the game to become unbalanced, point-wise, as the letter distribution changes. The best solution is to generate dice for different boards. However, to keep the game interesting, we need also to change the base rules and the board shrinks and grows. Rules such as the three letter minimum make the game too easy on large boards and too hard on smaller boards. Larger boards afford longer words, so perhaps the dice should be based off of words that are longer than some minimum size. Adding gaps, or making the board into an odd shape change the number of neighboring letters, so maybe the adjacent only rule should be changed. It all depends on the difficulty and ‘pointyness’ of the generated board.

In my next article, I’ll cover the actual code of library and dice generation.