Andrew Davies and Anthony Mandal

[Portions of this project were developed in co-operation with Belser Wissenschaftlicher Dienst, who kindly supplied permission to reproduce materials published in the Corvey Microfiche Edition.]

Recent advances in Information Technology have afforded faster and more convenient ways of searching texts, catalogues and bibliographies than have been previously available.  For some time, Cardiff University’s Centre for Editorial and Intertextual Research (CEIR) has explored the feasibility of producing digitised texts that are searchable by computer.  A report was presented in September 1997 by Anthony Mandal which examined possible avenues for the digitisation of literary materials available at Cardiff.

In February 1998, a scheme was agreed upon by members of the Board of CEIR to begin a pilot project, in collaboration with Andrew Davies, to convert five novels with a Welsh context from the Romantic period into fully searchable electronic texts.  This would enable those working on the project to develop an appropriate methodology for the conversion process, as well as to explore the logistics of such a procedure.

The first phase of this scheme was completed in August 1998, resulting in the production of a CD-ROM sampler, which was demonstrated at the ‘Scenes of Writing, 1750-1850’ conference in July 1998.  The project generated praise and enthusiasm, with many members of both the academic and publishing communities expressing interest in either purchasing such material or collaborating on future works.  The final aim of this project would entail combining the texts with other relevant material (e.g. biographies, bibliographies, background, criticism, etc.), to form a multimedia resource on CD-ROM.

Developing an electronic resource in a similar fashion to the CD-ROM sampler would serve three important purposes:

This could be supported by additional apparatus, such as biographical and historical details, bibliographies of both primary and critical texts, secondary criticism, glossaries, notes, etc.  Additionally, the electronic text could serve as the basis of future reset editions.

The file format currently employed for this scheme is Adobe’s proprietary Acrobat Portable Document Format (PDF), which works along the same principles as the hypertext found on the World Wide Web.  Sections, sentences, even words, can be ‘hotlinked’ to other similar segments of the text/s through a simple mouse-click.  The advantages of the PDF format include its self-contained nature (all the typography, images, and searchable information are included in one file), its flexibility, and its consistent, user-friendly approach to document management.  Additionally, because the software required to browse PDF documents (the Acrobat Reader) is freely distributable, this saves immense costs in developing a propriety interface for such a scheme.

PDF allows for a great potential in the field of academic research.  Amongst its many features, the electronic text enables a variety of searches, a few of which include:

PDF files and the Acrobat Reader which accompany these these texts, therefore, offer significant and highly useful potential for development, which would complement and expand the already recognised importance of the facsimile edition.

I. Preparation of Materials
i) Xeroxing: the book pages are photocopied, either as two-page or four-page openings per A4 sheet

ii) Scanning: the xeroxed sheets are scanned at optimum resolution into document management software to create PC images

iii) Optimisation: the images are then cropped, sharpened, cleaned to ensure the highest quality for conversion to text

II. Conversion of Image to Editable Text
i) Optical Character Recognition: the images are then passed through an OCR engine, in order to convert the graphical images into actual text, in Word 97 format.  The success ratio is approximately 75%, but can obviously vary depending on the quality of the source image: the software also allows ‘training’, so that idiosyncrasies of the text (e.g. the elongated ‘s’ of pre-nineteenth-century texts) can be anticipated.  Initial (i.e. very basic) proofing is done at this stage to correct glaring anomalies in the conversion process

ii) Formatting for Proofreading: regular styles are applied to the different textual elements (chapter headings, body text, verses, etc.).  Minimal standardisation is employed, based on established house rules: at the moment, policy is to adhere as closely to the conventions of the source text, while ensuring visual consistency.  Hard-copies are then printed from this regularised text

III. Editing and Corrections
i) Collation: the hard-copy is carefully compared to the source text for errors and differences

ii) Silent correction: slipped type and accidentals in the copy-text are then marked on the hard-copy for alteration

iii) Editorial intervention: any cases for editorial intervention are made; e.g. whether to standardise, regularising of names, rules of predominance, etc.

iv) Application of changes: corrections and alterations recorded on hard-copy are than transcribed onto the electronic text and the changes noted in appropriate forms

IV. Creation of Electronic Text
i) Optimise for online viewing: the text is formatted and structured for viewing on a monitor, and converted into a PDF file.  Initially, we used Word 97 for this process, but are now moving over to a full desktop publishing package, and plan to experiment with both Adobe PageMaker 6.5 and Corel Ventura 8.0, in order to find the most flexible and appropriate medium

ii) Optimise for printing: a second PDF file is prepared especially for printing, with a proportionately smaller font size, and a page layout of two ‘book pages’ per A4 sheet.  These could then be used as ‘work-texts’ for study, teaching, training, consultation, mark-up, etc. [At this stage of the pilot project, this option is not currently available]

V. Addition of Supplementary Materials
i) User guide: an online guide (also printed manual?) would be available to instruct the end-user how the Acrobat software works

ii) Introduction: an introduction to the series and the text [At this stage of the pilot project, this option is not currently available]

iii) Biographical information [At this stage of the pilot project, this option is not currently available]

iv) Bibliographical information: both primary and secondary [At this stage of the pilot project, this option is not currently available]

v) Critical apparatus, etc. [At this stage of the pilot project, this option is not currently available]

VI. Mastering the Final Package
i) Prepare user interface: a console to install and navigate through the package will be constructed using the latest version of the industry standard Macromedia Authorware

ii) Print Hard-Copy: a final hard-copy of both the online and printed versions of the electronic texts and apparatus would be made and once again proofed for errors

iii) Record Image onto CD-ROM: the data files for the electronic text as well as the Reader software would then be mastered onto disk, for both archiving and duplication.  High quality versions of any images from the texts are also supplied in a universal file format (uncompressed/compressed TIFF) for printing, etc.


This screenshot demonstrates how various navigation structures, in addition to basic back/forward movements, can be used. Bookmarks (left) allow users to move through significant sections of texts (e.g. chapter, tales, event, etc.). Thumbnails (right) display a reduced graphical facsimile of pages, enabling the viewer to move through sections based on their visual appearance (e.g. illustrations, changes from prose to poetry, etc.).

Adobe’s Portable Document Format (PDF) is an established file system for the storage and presentation of electronic texts in a consistent and flexible way.  It operates according to similar principles as the HyperText Markup Language (HTML) used for web pages on the Internet, in that it allows hyperlinking, searching, and bookmarking.  PDF, however, also moves much further beyond the limits of HTML, in that it preserves the richness of a printed document (fonts, graphics, and layout are duplicated in PDF exactly as they were prepared in the original application) and allows for sophisticated querying and navigation around the electronic text in a way that is currently impossible (or at least highly difficult to sustain) in HTML.  Moreover, any and all additional elements (fonts, images, colour, sounds, and movies) are embedded directly into a single file, allowing for complete and compact portability.

In order to browse a PDF file it is necessary to install the Acrobat Reader: this is a freely distributable application, able to run on a multitude of operating systems, including Windows (3.1, 9x, NT), MacOS, Unix, Solaris, and many others.  This again substantiates the claim of PDF to be as universal as HTML is via the world-wide web: in fact, many web sites do employ PDF either as a supporting format to HTML or as a primary vehicle for the dissemination of information.  In fact, PDF is being strongly championed by a number of people as the document format of choice for the new ‘e-books’ which will be appearing over the next few years.

The analogy to a printed document is apposite, as the process of actually creating a PDF document involves printing the document as a PostScript file (an established and sophisticated programming language used to control high-quality printers). The PostScript file is then distilled through the Acrobat software into an online and fully searchable electronic text. Various options can be selected to set the quality of image reproduction, and typefaces can be fully or partially embedded in the document depending on the requirements of the author.


As well as allowing users to view illustrations up to 1600% magnification (with Acrobat 4.0) without losing detail, the Search Query enables sophisticated searches to be made across not only single texts but whole ranges of collections, from a single dialogue box.

Our policy in selecting the initial texts for our pilot project was to choose rare or unique texts, which displayed significant depictions of Wales and Welshness in the context of the Romantic era.  Each initial text would display some sort of variation in genre and style, and would also represent diffferent authors from the period.  Bearing in mind that this was a tentative scheme to explore the potential and feasibility of such a project, and that there was a great deal of further material worthy of exploration, the five texts which obtained the above criteria were:

  Emily Clark, Ianthé, or the Flower of Caernarvon (London: For the Author, 1798): essentially a sentimental novel set against the rustic backdrop of West Wales, with the typical plot motif of the virtuous heroine kidnapped by the bigamous seducer to a remote Scottish castle [source based on a private copy held by CEIR]
  Anon., Welsh Legends: A Collection of Popular Oral Tales (London: J. Badcock, 1802): five illustrated stories, in both prose and verse, covering Welsh folk tales and including such diverse figures as demon lovers, patriotic bards, and betrayed Moorish soldiers [source held in the Salisbury Library, Cardiff University]
  Evan Jones, The Bard; or, the Towers of Morven. A Legendary Tale (London: R. Dutton, 1809): a pseudo-Gothic story set in the mystic past of North Wales, replete with revealed identities, kidnapped heroines, and assassins [permission to copy from the microfiche kindly supplied by the original publishers, Belser Wissenschaftlicher Dienst]
  Olivia More, The Welsh Cottage (Wellington, Salop: F. Houlston & Son, 1820): a domestic, moral tale, which features Wales as a rustic and unspoiled place of retreat and contemplation [source held in the Salisbury Library, Cardiff University]
  W. S. Wickenden, Bleddyn; a Welch National Tale, Being the First of a Series (London: For the Author, 1821): an example of the post-Scott regional-historical tale, set in a Wales torn between Royalists and Parliamentarians during the seventeenth century [permission to copy from the microfiche kindly supplied by the original publishers, Belser Wissenschaftlicher Dienst]

Each text typically took 35 hours (for about 200–250 pages of original text) of labour to complete in a basic form, including microfiching, conversion, and standardisation.  Additional factors such as formatting, mastering onto CD-ROM, preparing the user guide, and so on, increased this.

With the first phase complete, and the basic five texts digitised and only slightly standardised, the second phase of the project offers a number of opportunities for refining and improving upon the techniques already developed.  The project has shown the potential of converting a large quantity of texts and preparing them for sophisticated searches and analysis: as well as being strong analytical tools, the digitised texts form an easily accessible (and printable) corpus of literary works.  An obvious example of this would be the digitisation of whole runs of early periodicals, such as the Gentleman’s Magazine, Quarterly Review, Edinburgh Review, and so forth.  Users could then search for particular author’s names, generic keywords (e.g. gothic, sentimental, etc.), reviewers’ phrases, or publishers’ concerns.  Of course, this would represent a massive undertaking requiring months, if not years, of committed labour—however, it is a programme we are seriously considering.  One more tangible result of this project has been the decision to establish a longer-term project within Cardiff University’s Centre for Editorial and Intertextual Research, which would involve creating a large digitial corpus of literary (and non-literary) texts which have been selected from significant editions of quality, and edited rigorously by members of the Centre.

A second aspect of development would include the inclusion of secondary apparatus in equal terms with the original (perhaps edited) text.  Material such as biographical information, bibliographies, expansive annotations, articles, chronologies, and contextual information, would make a CD-ROM containing a selection of literary works more than an occasional research tool, but a fully featured academic package.  Again, this kind of involvement requires a concentrated amount of labour, both physically and mentally, and we hope to undertake such a scheme as resources allow.

One definite result of this programme has been the proof that the digitisation (for archival and/or analytical purposes) of rare and significant works is both feasible and attainable with minimal effort (in academic terms).  The potential that electronic texts offer is only just being realised, but that something can be done relatively easily does not mean that it will be done well.  Such schemes, as with anything else, require careful planning and preparation: a clear and consistent policy needs to be laid down at the outset, and it must be followed rigorously.  Without a thesis, a programme like this simply becomes an end in itself, and such an approach negates its usefulness in academic terms.  The terminal point must always remain in sight, and one needs to exercise caution in what can be achieved and the remit of such projects.  Ultimately, easy and instaneous access to information can never replace the difficult task faced by all scholars: the acquisition of knowledge itself.

Last modified 13 September, 2001 .