Keying What You See In Early English Books Two years ago at DRH 2000, at the very inception
of the EEBO text conversion project, I outlined what at that stage seemed to be the peculiar challenges of the EEBO project, as well as the strategies that we proposed to cope with them as best we could.
The present paper will be a progress report of sorts: a reassessment, two years into the project and a year and a half into real
production, of the project's working assumptions and the strategies that were adopted in response to this project's unusual character.
The Michigan way of producing electronic texts has become increasingly well defined in recent years, and increasingly oriented toward bulk production and bulk distribution under generic systems. It seems to us the approach best suited to a production unit based in a library, since it is oriented toward producing standardized digital libraries, not individual works, however superlative and seductive they may be.
It is this approach that we have applied to EEBO, and may be summed up as involving: (1) the application of simple and strict data capture standards; (2) a determination to distinguish only those features that are essential to retrieval, navigation, or intelligible display (and that can be described by means of typographic and other predictable visual cues); and (3) division of labor: automate what can be; outsource what cannot be; and reserve
interpretation and quality assurance for specialist in house staff. In the case of EEBO I would add: (4) planning for mistakes, the "principle of graceful degradation"-- that is, trying to ensure that even when features fail to be recognized, they are nevertheless captured in a way that allows for useful retrieval.
EEBO has put all this to the test: it is a large project now (ca. 1200 books processed by conference time) and potentially a huge one; its material is as varied in format, language, type, and convention as the inventiveness of 16th- and 17th-century authors and printers could supply; and its underlying physical form (1-bit digital scans of microfilm copies of often poorly printed and preserved early books) makes legibility a constant problem. Finally, the electronic texts produced by the project have been produced, at least in part, for scholars; i.e., for an audience not necessarily forgiving of imprecision, inaccuracy, or inconsistency.
Some of the results have been predictable: for example, the scope, scale, and quality standards of the project are under constant pressure and remain subject to revision.
Other results, however, rather fly in the face of conventional wisdom about data capture, as well as rendering some of our working assumptions dubious, particularly and most interestingly as regards the role of interpretation. Interpretation, it appears, whether at the character and word level or at the structural level, is not something that should or even can be eschewed, assigned, or rigidly controlled. Instead, it is something that has to be allowed for at every point where a human mind encounters the text, whether it be that of a keyer in India typing in ecclesiastical Latin or a specialist reviewer in Oxford correcting the tagging.