Appendix A Using a Stochastic MDL Architecture to Read the Vindolanda Texts: Image to Interpretation - Oxford Scholarship Jump to ContentJump to Main Navigation
Image to InterpretationAn Intelligent System to Aid Historians in Reading the Vindolanda Texts$

Melissa Terras

Print publication date: 2006

Print ISBN-13: 9780199204557

Published to Oxford Scholarship Online: September 2007

DOI: 10.1093/acprof:oso/9780199204557.001.0001

Show Summary Details
Page of

PRINTED FROM OXFORD SCHOLARSHIP ONLINE ( (c) Copyright Oxford University Press, 2015. All Rights Reserved. Under the terms of the licence agreement, an individual user may print out a PDF of a single chapter of a monograph in OSO for personal use (for details see Subscriber: null; date: 28 September 2016

(p.171) Appendix A Using a Stochastic MDL Architecture to Read the Vindolanda Texts: Image to Interpretation

(p.171) Appendix A Using a Stochastic MDL Architecture to Read the Vindolanda Texts: Image to Interpretation

Image to Interpretation
Oxford University Press

(p.171) Appendix A

Using a Stochastic MDL Architecture to Read the Vindolanda Texts: Image to Interpretation

Co‐authored with Dr Paul Robertson

Appendix A Using a Stochastic MDL Architecture

Dinner…opposite me, a very beautiful lady…Next to me, (it slowly dawns on me) a slightly crazy guy…He asks what I do, and I say I'm a writer. He tells me about good writers and bad writers of his experience. He laughs at bad writers. People who can't do cursive or who put a little extra loop on their fs or gs make him laugh. He knows some good and bad writers on television. Sam Donaldson, on TV, has good handwriting. ‘How do you know?’ asks the lady, who has not quite realised that the slightly crazy guy is a slightly crazy guy. He explains that you can just tell, when you see them on the television. The worst writer he ever saw was a wrestler on TV. You just knew that he would go around and around when he did his Os. Round and around and around.

(Neil Gaiman, Online Journal 2003)

This appendix explores and describes in technical detail the implementation of the artificial intelligence system which was developed to propagate readings of the Vindolanda texts. An agent‐based system, GRAVA, was adopted as its basis, and here we discuss the background and architecture of the original system, exploring the concepts of minimum description length (MDL) and Monte Carlo methods, which make GRAVA so novel and attractive for our purpose. The objects that constitute the GRAVA system are detailed, and the Monte Carlo algorithm applied to agent selection is expounded.

The original system (discussed in Chapter 4) had to be developed to be able to cope with the more complex data presented in the Vindolanda tablets, and the development of more sophisticated agents is explained. Character models of stroke data from the annotated Vindolanda corpus (p.172) are calculated, to provide data the system can utilize to ‘reason’ about unknown input data, and the final system, which reads in annotated images of unknown text, and outputs realistic interpretations of those images, is presented. This appendix details the underlying computational structures that comprise the system: results from the system are available in Chapter 5.

A.1. The GRAVA System

In his thesis (2001) Dr Paul Robertson utilizes an agent‐based system to read a handwritten phrase, implementing a multi‐level hierarchical agent‐based model. This is akin to the Interactive Activation and Competition Model for Word Perception as proposed by Rumelhart and McClelland (McClelland and Rumelhart 1981: 379; see also Rumelhart and McClelland 1982; McClelland and Rumelhart 1986b; and S. 2.8.2) where it is proposed that a word is read by identifying features, characters, and, finally, words in a recursive process where ambiguities are resolved. However, Robertson's GRAVA (Grounded Reflective Adaptive Vision Architecture) system does not use Parallel Distributed Processing (PDP)1 architecture (unlike the original implementation of Rumelhart and McClelland's model). PDP remains difficult to implement because the approach, which aims to develop neural network architectures that implement useful processes such as associative memory, throws out traditional models of computing completely, requiring a new computational model based on neuronal systems (Rumelhart et al. 1986; McClelland and Rumelhart 1986a). An additional problem with PDP is that the behaviour of such a network is highly nonlinearly related to the parameter values; a small change in those values may lead to very different behaviour, and therefore results (although this may also be the case with MDL). In order to overcome this intrinsic difficulty it is generally (p.173) considered that PDP requires a large training set of data to ‘fine tune’ the parameters used within the system; there are not enough data available to us regarding the letter forms of Vindolanda to be able to do this. Instead Robertson implements an architecture where the atomic elements are implemented as agents2 (in Yolambda, a dialect of Lisp),3 using familiar programming practices, which retains a more conventional programming model than the PDP approach.

The GRAVA system,4 developed by Robertson, manipulates agents, and builds programs from them. The primary purpose of an agent in the GRAVA system ‘is to fit a model to its input and produce a description element that captures the model and any parameterisation of the model’ (Robertson 2001: 59). Agent cooperation can span semantic levels, allowing hierarchical stacking in the same way that is described in PDP. This enables the building of systems that exhibit semantic interaction, a well‐understood hierarchical concept that allows the behaviour and performance of systems to be closely monitored and understood (using techniques such as convergence analysis).5 Such an architecture is well suited to interpretation problems, which can be solved by fitting an input to a series of models and computing the likelihood of these matching. Many interpretation problems have more than one possible solution, and by using such a system many solutions can be propagated; the best solution being the ultimate target.

Of most relevance to this monograph, Robertson shows how his architecture is an effective way of rendering hierarchical systems by demonstrating how his software can ‘read’ a handwritten phrase. The text in this case was a nursery rhyme (see S. 4.2) which required only a small training set to be correctly interpreted by the system. Given that the GRAVA system can be shown to be a robust and easily adaptable architecture which can work successfully with relatively small corpora, and given the limitations of the dataset available to train the system, it was chosen as the basis on which to develop a system to read the Vindolanda ink, and eventually, stylus tablets.

A.2. GRAVA System Architecture

Robertson's original demonstration of the GRAVA system, which aimed to read a handwritten phrase, comprises of three levels of agents. The first‐level agents detect low‐level features of written text, such as end points and (p.174) junctions of characters. The second‐level agent builds character models from the database of character forms, and compares these models with the evidence presented by the first‐level feature detectors to output a possible character identification. The highest level agent finds evidence of words in the input by looking at strings of data generated from the character finding agent and comparing these to a given corpus of words. The corpus used is the complete nursery rhyme, from which statistical information is collected and from which models are constructed regarding characters and words.

There were four separate agents that combined to make the lowest level feature detection section of the model (Figure A.1). Each agent reported on features discovered within a character position:

  • Top stroke end points (T, above). This agent reported on the number of stroke end points at the top of the character. For example, the letter ‘N’ has one stroke endpoint on the top and ‘W’ has two.

  • Bottom stroke end points (B). This agent reports on the number of stroke end points at the bottom of the character. For example the letter ‘A’ has two end points at the bottom of the character and letter ‘I’ has one.

Appendix A Using a Stochastic MDL Architecture to Read the Vindolanda Texts: Image to Interpretation

Figure A.1. Hand Written Phrase Recognizer Program. From Robertson (2001: 65). The ‘P’ agent detects that a character is present, i.e. it is not a blank space. The ‘T’ agent detects the presence of top end points, ‘B’ detects bottom end points, and ‘J’ detects junctions in the character. The combination of these features is then passed up to the character agent, who compares them against the combination of features of known characters in the database. A character is chosen and passed up to the word agent, which compares strings of these characters against known words in the database. A description of how well this string of characters matches the known words in the database is then outputted.


  • Stroke Junctions (J). This agent reports on the number of line junctions formed from three or more lines. For example the letter ‘A’ has two such junctions. The letter ‘N’ has none.

  • Character present (P). This agent detects whether the character position contains anything at all. Everything but the space character contains something (Robertson 2001: 64).

The character boxes are segmented into ‘top’ and ‘bottom’ so that the top stroke and bottom stroke features can be determined from a simple end point filter. Such a simple encoding system allows features to be flagged in order to pass the information up to the next character identification level: however, they are insufficient to unambiguously identify a character themselves. For example, the letters ‘S’, ‘C’, ‘I’, ‘L’, and ‘N’ all have one endpoint at the top, one at the bottom, and no junctions. In spite of the impoverished character representations used in this example, with the aid of semantics resulting from the appearance of characters in words the system is able to correctly identify the hand written text.6

The character level and word level both contain single agents to calculate the probability that the data in the corpus and the data presented match. GRAVA utilizes description length (DL) as a means of comparison (see Robertson 2001: 50, and S. 1.6.3 for the background of description length and minimum description length). Description length provides a fair basis for cost computation as it captures the notion of likelihood (or probability, ‘P’) directly: DL = −log2 (P). The minimum, or smallest, description length (MDL) generated when matching an input to a set of models yields the ‘best fit’ of that data to an individual model, and so presents the most likely match (as it is the most probable match of the unknown input to the model). The global description length generated by matching a series of inputs to models can be simply calculated by adding the DL of each unit, or (p.176) output from each agent in the system. This can give a measure of the overall probability of the sequence, allowing easy comparison with other possible interpretations (the interpretation with the lowest global minimum description length being the most likely). MDL is used in GRAVA as a unifying method of comparing data from different levels of abstraction: thus providing a solution to the problem of how to compare the probabilities of both image and linguistic data matching the input.

Rather than present Robertson's results in this simple domain (which is discussed in S. 4.2) we will develop here our extension that supports interpretation of Vindolanda texts. First, however, we will describe in some detail description length and Monte Carlo methods, before discussing relevant parts of the GRAVA architecture.

A.2.1. Description length

The goal of the GRAVA architecture is to find an interpretation of its input data. In our case the input is stroke information (either hand generated, or automatically generated, see SS. 3.6 and 4.3.1). The interpretation is a structural description that includes subdescriptions for the words and character components. An interpretation, then, is a description that consists of many parts, each of which has an associated probability.

Consider the case of an image containing two blobs. If the probability of the first blob being a boy is given by P(blob1 = boy) = Pboy and the probability of the second blob being a dog is given by P(blob2 = dog) = Pdog, the probability that the image is one in which the first blob is a boy and the second blob is a dog is given by Pboy*Pdog. An image containing n blobs such that for any i < n the interpretation of blobI is given by Ii(blobi) and the probability of such an interpretation being correct is given by P(Ii(blobi)) the goal is to maximize the probability (assuming conditional independence):

arg max l 1 , n i = 1 n P ( I i ( b l o b i ) )

Communicating the description through a noiseless channel using a theoretically optimal code would require a description length of: (−log2 (Pboy)) + (−log2 (Pdog)). For an image containing n blobs as defined above the goal would be to choose interpretations I1 … In so as to minimize the description length:


arg min l 1 , n i = 1 n log 2 ( P ( I i ( b l o b i ) )

Finding the most probable interpretation is identical to finding the minimum description length (MDL).

This approach raises two significant issues.

  1. 1. How to estimate P.

  2. 2. How to find the global MDL.

Neither of these issues is straightforward. For the Vindolanda tablets we depend upon estimating P from frequencies obtained from a small corpus. For a general mechanism for finding the global MDL we employ a Monte Carlo sampling scheme discussed below.

A.2.2. Monte Carlo methods

The MDL agent architecture described in this appendix addresses the need to integrate knowledge at different semantic levels. To understand an image that consists of high‐level components such as words, intermediate level features such as characters, and of low‐level features such as strokes and end points, we need to integrate different levels of processing.

Single thread of control solutions, such as the blackboard7 and forward chaining8 approaches, depend upon taking a path towards a solution and backtracking past failures until a solution is found. The same is true of subsumption.9 These are essentially depth first searches for a solution—not (p.178) searches for the best solution. In a robot navigation problem, we may be happy if the robot negotiates the obstacles in the environment and finally ends up at the destination. In interpretation problems, just finding a solution is not good enough. A Latin sentence can have many plausible parses. Most of them do not make sense.

Markov Chain Monte Carlo Methods (MCMC)10 have become popular recently in computer vision (Geman and Geman 1984; Hammersley and Handscomb 1964; Lau and Ho 1997) but have been limited to modelling low‐level phenomenon such as textures (Karssemeijer 1992). In natural language understanding, use of Hidden Markov Models (HMM)11 has been successful as an optimization technique with certain restricted kinds of grammar. Problems that can be described as Hidden Markov Models (Baum 1972) can yield efficient algorithms. For example, in natural language understanding, some grammars permit efficient algorithms for finding the most probable parse. Stochastic Context Free Grammars (SCFGs)12 can be parsed so as to find the most probable parse in cubic time using the Viterbi Algorithm13 (Viterbi 1967). Only the simplest grammars and problems can be solved efficiently in this way, however, and for the more interesting grammars and for more complex problems in general, other techniques (p.179) must be used. Certainly something as loosely defined as an agent system incorporating semantics from multiple levels would rarely fit into the HMM straitjacket.

Even for natural language processing, finding the best solution can be prohibitively expensive when the Viterbi Algorithm cannot be employed. In visual processing, with images whose complexity greatly exceeds that of sentences, and which are three‐dimensional (as opposed to the linear arrangement of words in a sentence), finding the best solution is infeasible. Approximate methods are therefore necessary: Monte Carlo14 techniques are very attractive in these situations.

In an ambiguous situation, such as parsing a sentence, in which many (perhaps thousands) of legal parses exist; the problem is to find the parse that is the most probable. If the problem can be defined in such a way that parses are produced at random, and the probability of producing a given parse is proportional to the probability that the parse would result from a correct interpretation, the problem of finding the most probable parse can be solved by sampling. If P is a random variable for a parse, the probability distribution function (PDF) for P can be estimated by sampling many parses drawn at random. If sufficiently many samples are taken, the most probable parse emerges as the parse that occurs the most frequently. Monte Carlo techniques use sampling to estimate PDFs.

Monte Carlo methods are attractive for a number of reasons:

  1. 1. They provide an approximate solution to the search of combinatorially prohibitive search spaces.

  2. 2. By adjusting the number of samples, the solution can be computed to an arbitrary accuracy.

  3. 3. Whereas the best solution cannot be guaranteed by sampling methods, measuring standard error during the sampling phase allows the number of samples to be adjusted to yield a desired level of accuracy.

GRAVA employs a Monte Carlo method: a means of providing approximate solutions by performing statistical sampling, to randomly choose which data is passed upwards between levels. This is a ‘weighted random’ selection process which picks likely data much more frequently than less likely examples (see Robertson 2001: 48). If only the data with the lowest description length were passed up between levels, the correct answer may never be found: the data with the locally minimum description length may not be the correct selection on the global level. The stochastic nature of this method of sampling ensures the generation of different results, and also means that the (p.180) system rapidly generates possible solutions without relying on exhaustive search (cutting search time). The system generates possible solutions on each iteration; the more iterations, the better the chance that a match to the solution is generated. Convergence on an ideal solution is then asymptotic:15 the system finds approximate solutions, and the more iterations that occur, the better the approximation that is reached. In practice, the system tends to find the exact solution in a short number of iterations, meaning that performance times are acceptable (this is demonstrated in Chapter 5).

A.3. Objects in the GRAVA architecture

The GRAVA architecture is built from a small number of objects: reflective layers; descriptions; description elements; interpreters; models; and agents. All of these terms are commonly used in cognitive science and artificial intelligence literature to mean a wide range of things. In the GRAVA architecture they have very specific meanings. Below, we describe what these objects are, the purpose, protocol, and implementation of the GRAVA objects, and how they cooperate to solve an interpretation problem, before demonstrating how the Monte Carlo algorithm chooses one element from a candidate list.

A.3.1 Reflective layers

A reflective layer takes an input description Δin and produces an output description Δout as its result. The output description is an interpretation (I ∈ Q (Δin)) of the input where Q(x) is the set of all possible interpretations of x.

Δ o u t = I ( Δ i n )

The goal of a layer is to find the best interpretation Ibest that is defined as the interpretation that minimizes the global description length.

arg min I b e s t D L ( I b e s t ( Δ i n ) )

The interpretation function of the layer consists of an interpretation driver and a collection of connected agents. The interpretation driver deals with the formatting peculiarities of the input description (the input description may be an array of pixels or a symbolic description). The program is made from a collection of agents wired together. The program defines how the input will be interpreted. The job of the interpreter/program is to find the most (p.181)

Appendix A Using a Stochastic MDL Architecture to Read the Vindolanda Texts: Image to Interpretation

Figure A.2. Objects in the GRAVA architecture, indicating the relationship to description (input), interpreter, circuit, and description (output) within the reflective layer.

probable interpretation of the input description and to produce an output description that represents that interpretation.

A.3.2. Descriptions

A description Δ consists of a set of description elements ε

Δ = &lt; ɛ 1 , ɛ 2 , ... , ɛ n >

Agents produce descriptions that consist of a number of descriptive elements. The descriptive elements provide access to the model, parameters, and the description length of the descriptive element. For example, a description element for a face might include a deformable face model and a list of parameters that deform the model face so that it fits the face in the image. A description element is a <model,parameters> pair.

(p.182) The description length is computed before an element is attached to a description because the agent must compete on the basis of description length to have the descriptive element included. It makes sense therefore to cache the description length in the descriptive element.

The description class implements the iterator:

(for Description|des fcn)

This applies the function ‘fcn’ to every element of the structural description, and this enables the architecture to compute the global description length:

D L ( Δ o u t ) = i = 1 n D L ( ɛ i )

The global description length of a particular set of descriptors, represented by the set ‘des’, is the sum of the description lengths computed for each descriptor ‘de’ in the set ‘des’. To get a better fix on notation, this is implemented in Lisp as:

(define (globalDescriptionLength Description|des)

(let ( (dl 0))

(loop for de in des

(set! dl (+ dl (descriptionLength de))))))

A.3.3. Description elements

Description elements are produced by agents that fit models to the input.

Description elements may be implemented in any way that is convenient or natural for the problem domain. However the following protocol must be implemented for the elements of the description:

(agent <Element>) Returns the agent that fitted the model to the input;

(model <Element>) Returns the model object that the element represents;

(parameters <Element>) Returns the parameter block that parameterizes the model; and

(descriptionLength <Element>) Returns the description length in bits of the description element.

Implementations of description elements must inherit the class DescriptionElement and implement the methods ‘agent’, ‘model’, ‘parameters’, and ‘descriptionLength’.

For readability we print description elements as a list:

(<model name> . <parameter list>)

(p.183) A.3.4. Interpreters

An interpreter is a program that applies agents in order to produce a structural description output from a structural description input.

A.3.5. Models

Fitting a model to the input can involve a direct match but usually involves a set of parameters.

Consider as input, the string:

“three blind mice”

We can interpret the string as words. In order to do so, the interpreter must apply word models to the input in order to produce the description. If we have word models for ‘three’, ‘blind’, and ‘mice’ the interpreter can use those models to produce the output description:

((three) (blind) (mice))

The models are parameterless in this example. Alternatively we could have had a model called ‘word’ that is parameterized by the word in question:

((word three) (word blind) (word mice))

In the first case there is one model for each word. In the case of ‘three’ there is an agent that contains code that looks for ‘t’, ‘h’, ‘r’, ‘e’, and ‘e’ and returns the description element ‘(three)’. In the second case there is one model for words that is parameterized by the actual word. The agent may have a database of words and try to match the input to words in its database.

Consider the two examples above. If the probability of finding a word is 0.9 and the probability of the word being ‘three’ is 0.001, the code length of ‘(word three)’ is given by:

DL(word three) = DL(word) + DL(three)

= −log2(p(word)) −log2(p(three))

= −log2(0.9)− log2(0.001)= 0.1520 + 9.9658 = 10.1178 bits

The second approach, in which a separate agent identifies individual words would produce a description like ‘(three)’. The model is ‘three’ and there are no parameters. The likelihood of ‘three’ occurring is 0.001 so the description length is given by:

DL(three) = −log2(p(three))= −log2(0.9*0.001) = 10.1178 bits

(p.184) That is, the choice of parameterized versus unparameterized does not affect the description length. Description lengths are governed by the probabilities of the problem domain. This allows description lengths produced by different agents to be compared as long as they make good estimates of description length.

A.3.6. Agents

The primary purpose of an agent is to fit a model to its input and produce a description element that captures the model and any parameterization of the model.

An agent is a computational unit that has the following properties:

  1. 1. It contains code, which is the implementation of an algorithm that fits its model to the input in order to produce its output description.

  2. 2. It contains one or more models (explicitly or implicitly) that it attempts to fit to the input.

  3. 3. It contains support for a variety of services required of agents such as the ability to estimate description lengths for the descriptions that it produces.

An agent is implemented in GRAVA as the class ‘Agent’. New agents are defined by subclassing ‘Agent’. Runtime agents are instances of the appropriate agent class. Generally agents are instantiated with one or more models. In our system all models are learnt from the corpus.

The protocol for agents includes the method ‘fit’ that invokes the agent's model fitting algorithm to attempt to fit one or more of its models to the current data.

(fit anAgent data)

The ‘fit’ method returns a (possibly null) list of description elements that the agent has managed to fit to the data. The interpreter may apply many agents to the same data. The list of possible model fits from all applicable agents is concatenated to produce the candidate list from which a Monte Carlo selection is performed.

A.3.6.1 Monte Carlo agent selection

A recurring issue in multi‐agent systems is the basis for cooperation among the agents. Some systems assume benevolent agents where an agent will always help if it can. Some systems implement selfish agents that only help if there is something in it for them. In some cases the selfish cooperation is quantified with a pseudo market system.

(p.185) Our approach to agent cooperation involves having agents compete to assert their interpretation. If one agent produces a description that allows another agent to further reduce the description length so that the global description length is minimized, the agents appear to have cooperated. Locally, the agents compete to reduce the description length of the image description. The algorithm used to resolve agent conflicts guarantees convergence towards a global MDL thus ensuring that agent cooperation ‘emerges’ from agent competition. The MDL approach guarantees convergence towards the most probable interpretation.

When all applicable agents have been applied to the input data the resulting lists of candidate description elements is concatenated to produce the candidate list.

The monteCarloSelect method chooses one description element at random from the candidate list. The random selection is weighted to the probability of the description element.

P e l e m = 2 D L ( e l e m )

So, for example, if among the candidates, one has a description length of one bit and one has a description length of two bits, the probabilities of those description lengths is 0.5 and 0.25 respectively. The monteCarloSelect method would select the one bit description twice as often as the two bit description.

The monteCarloSelect algorithm is given below:

(define (probability DescriptionElement|de)

(expt 2.0 (− (descriptionLength de))))

(define (monteCarloSelect choices)


(lambda (return)

(let* ((sum (apply + (map probability choices)))

(rsel (frandom sum)))

(dolist (choice choices)

(set! rsel (− rsel (probability choice)))

(if (<= rsel 0.0) (return choice)))))))

A.4. Original Performance and the Need for Development

In the original implementation of his system, Robertson demonstrates how the low‐level agents start out with a description of features based on the tops, bottoms, and junctions of characters. The system compares these with the models computed earlier from the corpus (on a character level), (p.186) computing for each symbol a description length. The character that is passed upwards to the next level is determined at random by the Monte Carlo Select algorithm. The system then compares the resulting ‘words’ with those in the corpus, generating description lengths for each. A global description length for the phrase is calculated by summing the description lengths of each of the symbols and words. In subsequent iterations different possibilities are generated, and the system searches for the best fit by looking for the configuration that gives the lowest global description length. This global minimum description length corresponds with a correct ‘reading’ of the text. A diagram demonstrating the system reading a handwritten text, MARY HAS A LITTLE LAMB, is shown in Figure 4.1, in Chapter 4.

In this way, Robertson shows that MDL formulation leads to the most probable interpretation, and also that global MDL mobilizes knowledge to address ambiguities. This demonstrates that, whilst the input data is inadequate for correctly identifying the characters unambiguously, the use of global MDL as a constraint allows correct identifications to be made. MDL also provides a ‘reasonable’ description in a relatively rapid time: exhaustive searches can fail to complete after several hours. The GRAVA system is therefore a robust, easily implemented, and adaptable architecture that produces convincing results in a short time frame, providing the ideal base for the development of a system to aid in the reading of the Vindolanda stylus tablets.

However, the original system required development because of the complexity of the data presented in the Vindolanda tablets, and the fact that eventually this system will dovetail with the image processing research done on the stylus tablets, reading in automatically annotated images. End point and junction data proved to be too unstable, given the noisy nature of the stylus tablets: small changes in image quality can create large differences in the number and location of the end points (this is demonstrated in S. 4.3.1). To counteract this, a more sophisticated character agent was developed which compares the stroke data of unknown characters to character models of known letters generated from the annotated corpus of images.

A.5. System Development and Architecture

The character database formed by annotating images of the ink and stylus tablets (as discussed in S. 3.6 and Terras and Robertson (2004)), is the main source of information regarding the character forms on the Vindolanda tablets. Models derived from this data can therefore be used to compare the unknown characters in the test set (which may be annotated by hand, or generated automatically, as described in S. 4.3.1). The difference between the (p.187) original system (SS. 4.2 and A.4) and the new system lies in the way that character models are developed, and how the test data is compared to this set of character models. Whereas the original system relied on end points, this final version relies on data regarding the strokes themselves. The end point agents in the feature level of the original system were replaced by a stroke detection agent. This results in models of characters that are less sensitive to the feature detection process (i.e. the generation of end points, which is problematic when dealing with noisy images such as those of the stylus tablets). It also means that the feature level agents depend on information that is much more easily propagated from automatic feature detection, allowing for easier amalgamation with the stroke detection system. Most importantly, stroke information is much more stable than end point data. Small changes in image quality cause only small changes in the stroke features detected. A schematic of the final system that was developed is shown in Figure A.3, incorporating all elements of the resulting process.

Appendix A Using a Stochastic MDL Architecture to Read the Vindolanda Texts: Image to Interpretation

Figure A.3. Schematic of final system. Robertson's GRAVA architecture is highlighted to indicate the processes which are carried out as part of the final run of the system. The ‘Annotated image’ can be generated manually by an ‘Annotation tool’, or automatically by ‘Image feature extraction’.

(p.188) Before being used by the character agent, both the test set of data and the annotated corpus must be prepared in order to analyse their stroke data. This is done by extracting the strokes, drawing a bounding box around each of the characters to preserve groupings of strokes, and transforming these strokes onto a canonical sized grid to allow easy comparison.

Character models are built from the annotated set of images from the corpus by applying a Gaussian blur operator, and summing every instance of each individual character to build up a character model. The character agent then compares the unknown characters from the test data with those in the character models, also utilizing frequency information about each character to generate a description length for each model. One of these likely characters is then selected using the Monte Carlo sampling method and passed up to the word agent. This ensures that, over successive iterations, a fair, representative amount of each candidate character is selected and passed onto the next level.

The word level takes in the data from the character agent, combines them to form words, and compares them with the words from the corpus—the word ‘models’ in this case being the words found in the corpus. A description length for this comparison is noted. A selection is made from the possible words utilizing Monte Carlo sampling methods, and the final word output is generated.

The system then adds the description lengths for all the words in the phrase (or string of words) together, giving a global description length for that combination of characters and words.

The system repeats this process as often as the user dictates, and keeps track of the lowest global description length generated by each successive run. The smallest description length produced corresponds with the most likely answer: or the best fit answer available.

The preparation of the character models, and the way that both the character and word agents work, is discussed in detail below.

A.6. The Construction of Character Models

A character model is defined as a probability field that indicates the likely placing of one or more strokes of a two‐dimensional character, producing a general representation of a character type. Unknown characters can then be compared to a series of these models, and the probability that they are an instance of each one calculated, the highest probability indicating a match. Whilst the first implementation of the system relied on an end point agent, this was replaced by a stroke detection agent that builds up character models based on the actual strokes of the character.

(p.189) On a conceptual level, the (stroke‐based) character model is constructed by taking an image of an individual character, finding its bounding box (identifying the rightmost x co‐ordinate, and the leftmost x co‐ordinate, and the highest and lowest y co‐ordinates), and transforming this into a standardized (21 by 21 pixel) grid. The stroke data is convolved with a Gaussian Blur operator to reduce over‐fitting. Each standardized representation is accumulated onto a generalized matrix for each character type: resulting in a generalized representation of each type of character. These are subsequently used as the models to which unknown characters are compared.

A.6.1. Finding the Bounding Box

The minimum and maximum x and y points of the character are noted, drawing a ‘bounding box’ around the letter. It does not matter how large or small this is, or how wide or narrow, as the representation is standardized by transforming this bounding box into a regimented size.

A.6.2. Calculating the Transform

The matrix presented by the bounding box is transformed into a standardized size to allow easy comparison of representations. This is achieved by simply translating the area within the bounding box to a canonical 21 by 21 pixel region at the origin.

Xmin, Ymin and Xmax, Ymax of the bounding box are noted. Mwidth and Mheight are the model width and height respectively that the box is to be scaled to (in this case 21 by 21 pixels). A scaling matrix is then computed, giving the ratio to which the height and width will be scaled.

S 1 = M w i d t h ( X max X min ) S 2 = M h e i g h t ( Y max Y min )

A translation vector is computed so that all characters are based at the origin. The scaling matrix is then applied to the translated stroke pixels, Xt and Yt being the new scaled co‐ordinates:

( X t Y t ) = ( S 1 0 S 2 S 2 ) ( ( X Y ) ( X min Y min ) )

The choice of a 21 by 21 pixel grid was arrived at through a process of experimentation. If the grid is too large, then the data is too sparse and it is difficult to make any generalizations about the letter forms. The bigger the grid, the more examples are needed to make the generalized model applicable. The smaller the grid, the closer together the data, giving a trade‐off (p.190) regarding size, accuracy, and usability. At first, an 11 by 11 array was used, but this was gradually increased, which improved the accuracy of this part of the system. The 21 by 21 sized grid seemed to give the best results for the training sets that were used.

A.6.3 Applying Gaussian Blur

The stroke data was convolved with a three by three Gaussian Blur16 operator. This creates intermediary points of data that reduces over‐fitting of the model, to increase generalization from the examples given. Again, the three by three pixel field was chosen by experimentation.

A.6.4 Calculating the Final Model

The first character is drawn onto a ‘blank’ grid. A second instance of the character (if there is one available) is drawn over this, the values of this being added to the first instance. Additional instances of the character are laid over the grid, and the values summed as they go. This results in a composite model of all available character instances from the corpus, showing the path the strokes are most likely to make.

An example of how these steps combine to generate a character model is given in Figure A.4, where a small corpus which contains three ‘S’ characters is used to generate a character model of an S.

A.6.5. Learning Models from the Corpus

The corpus of annotated images (see S. 3.5) was randomly divided into a test set and two training sets: one ink tablet training set and one stylus tablet training set. Keeping the test and training data separate allows fair results to be generated. Two sets (ink and stylus) of character models were generated from the training set; these are shown in Figure A.5.

The character models generated from the ink data show that some of the letter forms, such as A, C, I, M, N, and T are fairly standardized throughout the test data. Other characters are more problematic. H, L, and V have a more confused appearance, indicating greater variability in the appearances of instances of the characters. D and Q are problematic, as the long strokes can either slope diagonally left to right, or right to left. The letter forms generated from this data can be compared to the letter forms generated from the stylus tablet data (see Appendix B and S. 3.11).

Not all of the letter forms are present in this selection, as the data set of the stylus tablets was much smaller than that of the ink tablets. However, some (p.191)

Appendix A Using a Stochastic MDL Architecture to Read the Vindolanda Texts: Image to Interpretation

Figure A.4. The generation of a character model from a small corpus. Three letter ‘S’s are identified, and bounding boxes drawn around them (A). The stroke data is then transformed into a 21 by 21 pixel grid (B). A Gaussian Blur is applied (C). The composite images generate a character model (D). The darker the area, the higher the probability of the stroke passing through that pixel. In this way, the probabilities of the stroke data occurring are preserved implicitly in the character models.

Appendix A Using a Stochastic MDL Architecture to Read the Vindolanda Texts: Image to Interpretation

Figure A.5. Character models generated from the training set of the ink text corpus. The darker the area, the higher the certainty that a stroke will pass through that box. Shapes of the characters can clearly be seen, for example, the letter M.

interesting comparisons can be made between the two. The letter E shown is clearly a different form than that of the ink tablets (as discussed in S. 3.1.2). The letter I slopes in the reverse direction to that of the ink tablets (although (p.193)
Appendix A Using a Stochastic MDL Architecture to Read the Vindolanda Texts: Image to Interpretation

Figure A.6. Character models generated from the training set of the stylus text corpus. The darker the area, the higher the certainty that a stroke will pass through that box. These are more ‘clumsy’ compared to those generated from the ink texts: this is partly due to the small corpus used to generate these models. There is no letter F, H, U, V, or X in this training set of examples. The only example of the stylus letter U appears in the test data, rather than the training set (see 5.4 for how this problem is overcome within the system).

the forms of the characters are distorted due to the application of the transform, it still gives an indication of differences in the way characters are written). M, N, R, S, and T are much less standardized than in the ink tablets, perhaps because of the difficulties of writing in the different medium, and the small sample set available. Nevertheless, both sets of models provide adequate representations for comparison with unknown characters.

A.7. The Character Agent: Comparing Unknown Characters to the Character Models

The character agent's function is to compare unknown characters to the character models composed from the training set. This is achieved by extracting (p.194) the strokes from the test data, transforming them to the standard size, then calculating the description length for matching the unknown character to each model in the data set. The character agent also utilises statistical information about the likelihood of a character being present, derived from the letter frequency analysis of the corpus. After the description length has been calculated (from a combination of the MDL frequency and MDL comparison of stroke evidence), one of the likely characters identified is selected, using the Monte Carlo sampling methods, and passed up to the word level.

A.7.1. Sizing and transform

Unknown characters are transformed to the standard size in the same way as described in sections A.6.1 and A.6.2: the stroke data is extracted, a bounding box is drawn around these strokes, and this matrix is transformed into a 21 by 21 pixel grid. No Gaussian Blur is applied to the stroke data in this case. This grid is a representation of the unknown character which can be compared with the standardised character model, calculated above.

A.7.2. Calculating the description length

Given the probability field representation of a character, we can calculate the probability that a line will cross through any given point. To generate possible identifications of the unknown character, we calculate the probability of how the evidence (stroke data) relates to each of the character models that have previously been constructed. Identification can never be certain, but it is possible to assign probability values, to calculate how the unknown data matches each character model.

We want to find the probability that the character is a, given the stroke evidence:

P(char = a/strokes)

The stroke data is assumed to be a legal character, so the sum of the probabilities for all the models should be 1:

a mod e l s P ( c h a r = a | s t r o k e s ) = 1

However, P(char = a/strokes) is not directly available to us. We have to calculate it utilizing Bayes' Theorem:

P ( A | B ) = P ( A ) P ( B | A ) P ( B )

(p.195) Therefore:

P ( c h a r = a | s t r o k e s ) = P ( c h a r = a ) P ( s t r o k e s | c h a r = a ) P ( s t r o k e s )

P(char = a), the probability of a character occurring, is already available to us from the lexicostatistics. P(strokes|char) can be calculated from the model.

The model was made up by laying n different representations of a character over a standard grid, and adding up the occurrences of when the strokes passed through each box in the grid. To find the probability of one box in the grid being used as part of the character, simply take the value of the box in the composite model, and divide it by the number of characters which make the composite model.

P ( B o x x y | c h a r = a ) = M o d e l a ( x , y ) n c h a r s

The stroke data can be viewed as the collection of boxes that the strokes pass through. Therefore, the total probability of the stroke data for each character is the product of the probability of the conjunctions of all the boxes passed through. If the stroke data goes through ‘n’ boxes, the probability of the stroke evidence given that the character is a is then:

P ( s t r o k e | c h a r = a ) = P ( B o x 1 B o x 2 ... B o x n | c h a r = a )

If we assume conditional independence,17

P ( s t r o k e | c h a r = a ) = P ( B o x 1 | c h a r = a ) P ( B o x 2 | c h a r = a ) ... P ( B o x n | c h a r = a )

Which can be expressed as

P ( s t r o k e | c h a r = a ) = i = 1 n P ( B o x i | c h a r = a )

(p.196) Giving

P ( c h a r = a | s t r o k e s ) = P ( c h a r = a ) i = 1 n P ( B o x i | c h a r = a ) P ( s t r o k e s )

P(strokes) is a value such that

a mod e l s P ( c h a r = a | s t r o k e s ) = 1


P ( s t r o k e s ) = a mod e l s { P ( c h a r = a ) i = 1 n P ( B o x i | c h a r = a ) }

This gives the final equation:

P ( c h a r = a | s t r o k e s ) = P ( c h a r = a ) i = 1 n P ( B o x i | c h a r = a ) a mod e l s { P ( c h a r = a ) i = 1 n P ( B o x i | c h a r = a ) }

So the description length of the strokes using the model for ‘a’ is given by:

D L ( s t r o k e s , a ) = log 2 ( P ( c h a r = a ) i = 1 n P ( B o x i | c h a r = a ) a mod e l s { P ( c h a r = a ) i = 1 n P ( B o x i | c h a r = a ) } )

The description length is computed for a comparison between all available character models and the unknown character. This is added to the description length of the probability of that character occurring in the sequence (generated from the lexicostatistics). Any character with a probability of less than 0.01 is discarded. This cuts down on processing time, and discards any truly unlikely character matches. The cut off value of 0.01 was arbitrarily chosen as an outlier rejection method. The Monte Carlo selection algorithm then chooses which of the remaining characters will be passed upwards to the next level, the description length of this character is, of course, passed up too.

(p.197) A.8. The Word Agent: Comparing Unknown Words to the Word Corpus

The word agent's function is to compare strings of possible characters to the word models in the corpus, in much the same way as the character agent compares stroke data to models generated from the corpus. However, the word agent's task is considerably simpler than that of the character agent, as there is no need to represent stroke data. The word ‘models’ are the words in the corpus themselves (see S. 3.9), and the probability of them occurring is known from the word frequency data.

At the word level, word models consist of character strings (delimited by a space character at either end). The input passed up from the character level combines to make strings of characters, and these are compared to the word models on a letter by letter basis. If the letter matches that of the same position in the word model, it is flagged as a match. If the letter does not match, it is flagged as a non‐match, and the description length from the character level is inserted in its place. The total description length of comparing that string of characters to the individual model is calculated as the negative logarithm (to the base 2) of the probability of the word occurring, plus the sum of the description lengths of the individual characters within the string (the DLs only being present if they did not match the associated character in the model directly):

D L = - log 2 P ( W o r d ) + i = 1 n { - log 2 ( 2 - D L ( s t r o k e s , C H i ) P ( C H i - 1 | C H i ) P ( C H i - 1 ) ) } i f C H i doesnt match , 0 otherwise

if CHi doesn't match, 0 otherwise

This is best understood by considering a possible message representing the fit of a word model to a character sequence. Consider fitting the word ‘foo’ to the word model ‘for’. The message begins with a representation of the model for the word ‘for’, then for each character there is either a ‘1’ bit, indicating that the character in that position matched the model, or a ‘0’ bit indicating that the character didn't match, followed by a representation of the character that failed to match. For this example the message stream will be as shown in Figure A.7. The description length of the match is therefore the code length for the model used, one bit for each matching character, and one bit plus the code length for each non‐matching character.

The system compares the string of characters to each individual word model in the corpus. The comparison with the lowest DL generates the most likely word identification.


Appendix A Using a Stochastic MDL Architecture to Read the Vindolanda Texts: Image to Interpretation

Figure A.7. Result of fitting word models to the characters USSIBUS, most likely fit. The word USSIBUS is by far the most likely contender compared to the other suggestions. The final ‘u s s i b u s s’ indicates the total description length if the description length for each individual character was added together: the overall MDL for ‘ussibus’ is smaller than that for ‘u, s, s, i, b, u, s’ indicating that the higher level interpretation at the word level is preferable to the locally correct interpretation at the character level.

However, what if the string of letters represents a word that is not in the corpus? The system compares the string to all available models, and also generates the total description length from the sum of all of the characters' description lengths. If there is no match between the sequence and the existing models, this DL will be the lowest, and the solution is presented as a string of characters. However, due to the fact that bi‐graph information is included in the word agent, the string with the minimum DL will be statistically the most likely combination of characters, even though a word has not been matched to the input. This can be seen in the results Chapter, 5.

A.8.1. An example

The string of characters ‘USSIBUS’ was fitted to all 342 of the word models that had a string length of seven characters. Only the results with a description length less than thirty‐six bits are presented here (these being the lowest eight description lengths computed). It is clear that the string USSIBUS, present in the corpus, is the closest match, as this produces the lowest description length. Four other words (or word fragments present in the corpus) are identified as being as probable as the string of characters itself, SCRIBUS, SCRIBAS, VEHIMUS, UIDEDUN. This indicates how the system propagates best‐fit solutions, which are approximate to the correct solution.

(p.199) Again, the system chooses which word to select as a solution using a Monte Carlo selection algorithm. The system does not incorporate any other data regarding word sequencing or grammar, at this stage. The MDL for a string of words is calculated by simply summing the description length for each word. Subsequent iterations of the system produce different sequences of characters, and therefore words. The most probable solution is that with the lowest MDL after a number of successive runs.

A.9. Conclusion

This appendix has detailed the technical underpinnings of the GRAVA system, and how it was developed to deal with the complex and more sophisticated data of the Vindolanda tablets. Explanations of minimum description length, Monte Carlo methods, and agent architectures have been provided, and the purpose, protocol, and implementation of individual objects in the GRAVA architecture have been discussed. The development of a more sophisticated character agent, utilizing character models ascertained from the annotated image corpus, is detailed, and complete overview of the developed system has been provided, with an example of how an individual agent works.

The MDL based GRAVA system provides a robust, adaptable platform from which to develop a system which solves an interpretation problem. Its success in generating plausible interpretations of the Vindolanda tablets in a realistic time frame is demonstrated in Chapter 5: there is no reason why this architecture cannot be developed and expanded to propagate possible solutions to other kinds of interpretation problems, and this is covered in Chapter 6.


(1) Parallel distributed processing (PDP) is a robust computer architecture, thought analogous to the processing structures found in the human brain, which uses decentralized networks of simple processing units to construct neurally inspired information processing models. This decentralization spreads responsibility, and changes in the state of the system are accomplished by modifying the strength or the architecture of the connections between neural units. The approach became popular in the 1980s with the work of McClelland and Rumelhart, and the PDP Research Group (McClelland and Rumelhart 1981, 1986; Rumelhart and McClelland 1982, 1986). PDP is often used to model mental or behavioral phenomena, such as the Interactive Activation and Competiton Model for Word Perception, which is used to model how the human brain identifies words quicker than random letters or letters in non‐words (see S. 2.8.2). PDP is the prevailing form of computer architecture known as connectionism, and is also often implemented as a ‘neural network’ (see Ch. 1, n. 26).

(2) See Ch. 4 n. 1. 

(3) See Ch. 4 n. 2.

(4) For a full specification of the GRAVA system architecture, see Robertson (2001), Appendix B.

(5) See Ch. 6 n. 2.

(6) The idea of semantics affecting lower level processes is not new. In the late 1970s there was much interest within artificial intelligence concerning heterarchical‐programming approaches (Fahlman 1973; McDermott and Sussman 1973). These systems suffered from behaviour that defied understanding and, equally importantly, analysis. Marr noted that in the human visual system that there was no evidence that higher level processes had any control over lower level vision and that we could profitably study low‐level vision without addressing such issues. Computer vision has largely followed that path for the past 20 years. The issue has not gone away, however, and the notion of how differing levels of semantic interpretation are made to interact is at the very heart of the AI problem. The PDP approach is an interesting attempt to develop neural network architectures that implement useful processes such as associative memory. In this research, we aim for the higher level semantic information to override lower level noise in order to generate plausible interpretations.

(7) A blackboard system is an architecture for building problem‐solving systems, in which a series of separate processes (or knowledge sources that generate independent solutions, such as expert systems) communicate through a centralized global memory or database, known as the blackboard. Partial solutions are built up and stored on the blackboard, which effectively coordinates the problem‐solving process. (Illingworth 1996: 45). For example, the HEARSAY II blackboard system was developed for speech recognition (Erman, et al. 1980).

(8) Also known as data‐driven processing, forward chaining is a control procedure used in problem‐solving and rule‐based systems. ‘When a data item matches the antecedent part of a rule, then the conditions for that rule have been satisfied and the consequent part of the rule is executed. The process then repeats, usually through some form of conflict‐resolution mechanism to prevent the same rules firing continuously’ (Illingworth 1996: 199).

(9) Subsumption is a layered architecture predicated on the synergy between sensation and actuation, giving rise to all the power of intelligence, in lower animals such as insects. Brooks (1986) argues that instead of building world models and reasoning using explicit symbolic representational knowledge in simple worlds, we should follow the evolutionary path and start building simple agents in the real, complex, and unpredictable world. No explicit knowledge representation of the world is used: the system is purely reflexive to input(s). Complex actions ‘subsume’ simple behaviours.

(10) Markov Chain Monte Carlo (MCMC) is a computational technique long used in statistical physics, often used as a way to calculate Bayesian inference (an approach to statistics in which all forms of uncertainty are expressed in terms of probability). Probability is a well understood method of representing uncertain knowledge and reasoning to uncertain conclusions, so using MCMC provides a rich array of techniques that can be applied to solving problems in artificial intelligence which rely on computation regarding probability. See Neal (1993) for an introduction.

(11) A ‘Markov Model’ is a statistical model or simulation based on Markov chains, which are a sequence of discrete random values whose probabilities depend on the value of their predecessor. In a regular Markov Model, the state is directly visible to the observer, and therefore the state transition probabilities are the only parameters. In a Hidden Markov Model (HMM), the system being modelled is assumed to be a Markov process with unknown parameters, and the challenge is to determine the hidden parameters, from the observable parameters: hence the name Hidden Markov Model. See Rabiner (1989) for an introduction.

(12) A ‘context‐free grammar’ is a formal system that describes a language by specifying a collection of rules that can be used to define how units can build into larger and larger structures. Stochastic context‐free grammars are a version of context‐free grammars with probabilities on rules, giving probabilistic descriptions of primary and secondary structures.

(13) The Viterbi Algorithm (Viterbi 1967) is a dynamic programming algorithm used to find the most probable sequence of hidden states in a Hidden Markov Model, given a sequence of observed outputs.

(14) See Ch. 1 n. 36 for a definition of Monte Carlo techniques.

(15) See section 5.3 for an example of the asymptotic nature of the GRAVA system.

(16) A general purpose blur filter which removes fine image detail and noise leaving only larger scale changes.

(17) This assumption is not entirely valid because the boxes are not strictly independent: there is some relationship between boxes as the strokes pass through them due to the trajectory of the pen or stylus stroke. However, the assumption works well in practice.