Is EEBO-TCP / LION Suitable for Attribution

Full text

Turn on search term navigation

In all humanities research, access to the original documents is vital. In the case of early modern drama, for a long time that meant being able to use one of the major libraries which had decent holdings of the material identified in A. W. Pollard and G. R. Redgrave's path-breaking Short-Title Catalogue of Books Printed in England, Scotland, & Ireland and of English Books Printed abroad 1475-1640 (1927). This indispensable reference work, recently revised,1 revealed an unexpected number of variant editions and issues and identified the libraries around the world where copies were located. In 1938 Eugene Power founded University Microfilms and began filming copies of books in the British Museum Library. The process gradually expanded, until his company could offer libraries substantial tranches of both STC1 and STC2, as the continuation by Donald Wing, covering the period 1641-1700, became known.2 Many scholars will still remember the excitement of loading a microfilm for the first time, but also the frustration of finding that some copies were of very variable quality. The photographers who produced the films were hampered by the limitations of the technology, the unevenness of early modern printing, and defects in the copy available for filming. Many words were illegible, especially those containing the long 's' or easily mistaken letters, such as 'a' and 'c'. Power's company also filmed American dissertations, becoming so successful that it was bought and sold on by a series of companies, culminating in its purchase by ProQuest. In 1998 ProQuest launched a 'Digital Vault Initiative', purported to include 5.5 billion images digitized from UMI microfilm, including major newspapers, and Early English books dating back to the 15th century. The following year they purchased Chadwyck-Healey, a one-time microfilm publishing company that was one of the first to produce full-text CD-ROM databases.3 Many scholars will remember with gratitude these pioneering collections. The logical next step was to make these collections available online, as Early English Books Online (EEBO), which contains STC1 and STC2, together with the Thomason Tracts and the Early English Books Tract Supplement, a total of more than 125,000 volumes. A smaller repository, Literature Online (LION), resembling the Chadwyck-Healey collections, contains over a third of a million full-text works of poetry, prose and drama in English, together with online criticism and a reference library.4

The transformation of these vast resources from microfilm to CD-ROM and finally online, has opened them up to a world-wide public. However, the technology was not accompanied by the old-fashioned discipline of proofreading and checking against the original texts, with the result that many of the original defects survive. In 1999 a Text Creation Partnership was formed to remedy these failings. In partnership with ProQuest and with more than 150 libraries, their aim is to generate 'highly accurate, fullysearchable, SGML/XML-encoded texts corresponding to books from the Early English Books Online Database'.5 Where other electronic databases have been produced by the inaccurate mof optical character recognition, the TCP texts provide keyboarded full-text transcriptions of EEBO images, linked to the individual page images. Currently, some 60,328 titles are available to libraries belonging to the partnership, with another 4,000 being processed.6 Although the accuracy level of the TCP texts is high, many of the titles that a researcher might need to consult have not yet been processed, so that users will still find illegible words, replaced in some cases by question marks. The great advantage of the online version compared to its previous existence as CDROMs is the provision of a search engine. It was not an easy task to devise a tool that could handle all the un-coordinated orthography of early modern English, and the early version was involved several procedures; recent modifications offer helpful options for 'variant spellings/forms'. Users can make 'select from a list' searches, or 'proximity searches', in which they can search for a specific word string by using a wildcard (·). In other searches, the term FBY can specify a word 'followed by' another specified word. The procedure is logical, if cumbersome, but with experience one can learn appropriate refinements.

The potential applications of EEBO-TCP are massive, and some have been unexpected, for instance in textual criticism. While editing Measure for Measure for the recent Norton Shakespeare, Matthew Steggle confronted some long-standing textual problems, one of which he solved by using this database. In the Folio text Escalus describes ways in which human beings can go astray:

Some rise by sinne, and some by virtue fall:

Some run from brakes of Ice, and answer none

And some condemned for a fault alone (2.1.38-40).

In 1709 Nicholas Rowe proposed to read 'brakes of vice': as Steggle puts it, 'an easy aural error; "vice" would fit with "virtue" in the line before', and the antithesis would be clarified.7 No illustrative contemporary example had been found for this phrase, but in searching EEBO-TCP Steggle discovered that 'the "brakes" are a metaphor for vanity, self-indulgence or foolish entanglement', as in a 1629 devotional tract by Richard Brathwaite, which refers to 'the pricking brakes of sensuality' and 'the brakes of vanitie'.8 Three centuries later, an electronic database supports an emendation by the first Shakespeare editor.

Steggle extended the application of EEBO-TCP for literary studies into another field. Scholars have known for centuries the titles of many plays referred to in contemporary documents that have not survived.9 Indeed, current estimates suggest that their number (1,100) exceeds that of the known plays (543).10 In many instances sufficient information exists for us to establish a good sense of a play's subject-matter, its authors, actors, and theatre companies. In other cases, only a name survives, and it has been impossible to verify whether such a play existed. In recent years a revival of interest has led to the creation of a database of lost plays documenting what is known.11 Steggle systematically searched EEBO-TCP for verbal evidence and succeeded in identifying ten lost plays. One especially evasive play title was Richard the Confessor, a seemingly unlikely topic for the public theatre in Protestant England. Henslowe recorded two performances at the Rose by Sussex's Men in 1593-4, but some historians followed Malone in dismissing it as an error for Edward the Confessor. Steggle's thorough researches have identified the subject as Saint Richard of Chichester (d. 1252), a bishop-saint frequently referred to in the early modern period.12 Steggle had searched EEBO-TCP for 'Richard the Confessor', without success, but the same entry in Google Books led him to the extensive historical record. He subsequently returned to EEBOTCP and entered 'Richard NEAR.3 confessor, where the NEAR.3 operator serves as an instruction to the database: "Find me all the places where the string Richard occurs within three word-breaks of the string confessor". The .3 can be replaced with any other number of the user's choice...'13 As well as being a research success story, this shows the importance of knowing correct procedures.

The value of these resources to authorship studies seems undeniable, and MacDonald Jackson, that frequent pioneer, recognised its potential in an essay published in 2001.14 It is an accepted fact that, due to the intense competition between Elizabethan theatre companies, dramatists regularly wrote under time pressure and were prone to repeat words and phrases that they had used before. One basic method in attribution work is to search for verbal parallels between a text of known authorship and one where the authorship is unknown, and the provision of these vast databases seems an ideal resource. When Steggle performed his searches, he knew exactly what he was looking for, a textual crux or a play title, and he could recognise that he had been successful. But when attribution scholars search for repeated phrases and collocations stretching over several lines of verse, how do they know that they have found all the relevant data? In this essay I wish to raise some doubts about its efficacy by reviewing two recent studies using LION/EEBO-TCP on the Shakespeare canon, by MacDonald Jackson and Anna Pruitt.

In his pioneering essay Jackson began by criticising recent erroneous authorship claims (by Eric Sams and Mark Dominik) based on 'the haphazard and biased accumulation of verbal parallels'. Jackson suggested that such mistakes 'can be avoided through systematic and comprehensive electronic searches' (193) and illustrated his method with examples from the co-authored play Titus Andronicus. Having chosen short passages from a scene universally ascribed to Peele (1.1.1-17), and one ascribed to Shakespeare (2.3.10-29), 'words, phrases, and collocations from the two passages were methodically keyed in [to LION], one at a time, to be searched' in eight plays and eight poems by Peele and seven plays and one narrative poem by Shakespeare (196). That search produced a listing of 'phrases and collocations that occur in one author's canon but not in the other's', consisting mostly of 'groups of two or more words that are either consecutive or closely associated'. As Jackson explained, 'the requirement that these collocations should be confined to the canon of only one of the two playwrights ensures that high-frequency examples (such as "of the") are ignored' (198). Analysing the results, Jackson noted that 'a straight count - scoring only one authorial hit, however many times a phrase or collocation occurs within the canon that includes it - yields five hits for Peele, six for Shakespeare' (199). Jackson added two riders to explain this counter-intuitive result, one quantitative ('Shakespeare's canon is more than eight times greater than Peele's'), the other qualitative:

the more impressive linkages are with Peele. The phrase 'that ware the' occurs nowhere else in English drama; nor is there in English drama another instance of 'to virtue consecrate' (where 'consecrate' = 'consecrated)... The phrase 'let desert' not only occurs in the mature history 2 Henry IV (1597-8), written much later than Titus Andronicus, but is a barely significant link. In several cases where Shakespeare provides the more exact linkage, Peele has almost equally good matches. (199)

That interesting discussion reveals that Jackson used two evaluative criteria to rank these matches, unique occurrence and quality ('significant', 'good').

The second passage, when submitted to the same test, yielded clear-cut results: twenty 'connections with the Shakespeare canon (unmatched in Peele's)' but only two for the converse test (201). Moreover,

The two Peele linkages are among the very weakest. That at line 12 consists merely of the conjunction of 'melody' and 'birds' in 'here is melody. A charm of birds' in The Arraignment of Paris, and is accepted as a hit for Peele only because it brings together those two precise words. Some of the Shakespeare linkages, in contrast, are complex, based on characteristically Shakespearian associations: echoes of the baying of hounds and the coupling of these to a 'nurse's song' to a babe; the 'snake' that is 'rolled' in the sun or flowers and is mentioned close to the adjective 'chequered'. (201-2)

Here, too, Jackson used evaluative criteria to rank the phrasal matches, judging 'the Shakespeare linkages' to be 'complex' because they used more sustained verbal associations. Briefly reverting to the Peele test, Jackson conceded that 'several of the locutions that link Titus Andronicus, 1.1.1-17 to Peele are commonplace and prove nothing in themselves. But their triteness is immaterial', apparently because the passage was 'methodically searched for phrasal links with Peele's canon...' (202). Introducing this 'new technique', as he called it, Jackson suggested three evaluative criteria for phrasal matches: unique occurrence, quality, and 'commonplace. triteness'. That would seem to cover all occasions.

In 2006 Jackson gave an extended demonstration of his new method, to support his claim that scene 8 in Arden of Faversham (the second quarrel between Alice Arden and her adulterous lover Mosby) was written by Shakespeare.15 Jackson briefly described the evidence for Kyd's authorship produced by 'early twentieth-century attribution scholars', but rejected it outright:

Although the basic assumption was correct - that playwrights have individual habits as phrasemakers and tend to echo themselves more often than they echo others - the value of the proffered parallels could not be reliably assessed, because the search for them had been haphazard and biased by the scholar's preconceptions. (256)

How does Jackson know that these searches were haphazard? The three main authors concerned - Charles Crawford (1903), Walter Miksch (1907), and Paul Rubow (1948) - between them amassed over a hundred close verbal matches between Arden of Faversham and the three plays then ascribed to Kyd.16 They worked systematically, drawing on a wide reading knowledge. And why should Jackson accuse them of bias? They knew enough about Elizabethan drama to recognise Kyd's hand, not Marlowe's, nor Peele's. In his pioneering essay Charles Crawford recorded that, after 'an exhaustive and painstaking examination of Kyd's work as a whole', he had concluded that 'the vocabulary, phrasing, and general style' of the play 'are those of Kyd, and that they cannot be mistaken for those of any other author of the time'.17

Another scholar might have thought that the value of these parallels should be directly established by inspecting them. Jackson, however, dismissed 'the old discredited methodology', proposing that its defects could be remedied by using Literature Online. He 'methodically explored' this database for links with the Quarrel scene, searching for 'phrases and collocations that occur five or fewer times in other plays first performed from 1580 to 1600... Parallels in imagery and ideas were recorded only if passages had at least one prominent word in common' (257). Although Jackson did not draw attention to it, this was a new approach in attribution studies. Previous scholars had looked for individual matches, each of which helped to build up the documentation of an author's self-repetition, in terms of quality, accepting the criteria laid down by Muriel St Clare Byrne that parallels should satisfy the criteria of both quantity and quality - that is, when a parallel of thought is accompanied by a parallel of language.18 However, these predecessors had never quantified their results. By allowing matches that had only one word in common, and by introducing multiple examples, Jackson included many matches that his predecessors would have rejected as not fulfilling the unitary criterion of verbal similarity coupled with a similarity of thought. Secondly, by extending the limit to five, Jackson could bring into his net authors with a large canon, above all Shakespeare. Of the 132 plays that Jackson searched, he found that '28 have four or more links to the quarrel scene', the titles and scores being set out in Table 1 (259). Sixteen of these were sole-authored plays by Shakespeare, three were coauthored (Titus Andronicus, 1 Henry VI, and EdwardIII).

Jackson's discussion of the date of Arden showed his awareness of the correct use of 'chronological limits' to date a play. As he argued, since Arden is 'influenced in places by copious marginalia printed for the first time in the 1587 edition' of Holinshed's Chronicles, and has two references to events of 1588, this establishes a terminus a quo. Jackson then observed that 'plays were seldom published until at least a year after they had begun their run on the stage', thus setting for him the later limit to 1591.19 Lukas Erne's recent study of Shakespeare as a literary dramatist had shown that

the Lord Chamberlain's Men did not try to have Shakespeare's plays printed immediately after they had been written. If we consider the likely dates of composition and the dates of entrance in the Stationers' Register, a consistent pattern presents itself: as a rule, roughly two years seem to have elapsed between the former and the latter.20

On this basis, Arden of Faversham would have been performed in 1590, and this is indeed the date that Martin Wiggins gives in his authoritative new Catalogue21

Jackson used the 'probable date of first performance' throughout,22 but he followed the Shakespeare chronology given in Wells and Taylor's 1986 Oxford edition, which assigns earlier dates than those given by Wiggins.23 The result of Jackson's searches was that 'links to plays by Shakespeare are overwhelmingly predominant' (258). The scores for the first four titles in his list are as follows:

Scholars familiar with chronology studies will immediately raise the likelihood that all four plays post-date Arden. Indeed, Jackson himself pointed out that 'it is probable that no Shakespeare play listed' in his Table 1 (259) 'was written before Arden of Faversham, and it is virtually certain that several of those with many links were written after it' (261). If that is the case, then the evidence for Shakespeare must be dismissed, since the 'matches' cannot be separated from the categories of authorial imitation, or (more likely), the recollection of a play seen in performance. Shakespeare's extensive knowledge of Arden has seemed to many scholars to suggest that he must have acted in it, a possibility that Jackson vehemently denies (261, 270, 273). In his 2001 introduction to the use of LION Jackson had used three criteria: unique occurrence, quality, and 'commonplace' phrases. In this 2006 essay, as in his 2014 monograph, two of those criteria have been dropped, leaving only 'quality' in the sense of the semantic congruity of a match, together with a newly-devised quantitative scoring procedure. Regarding the former, Jackson explains,

links were not recorded when collocated words occurred in entirely different senses: thus 'loathsome weeds' in line 67 of the quarrel scene provides a link to A Knack to Know a Knave, where the 'loathsome weeds' are again plants, but not to Caesar and Pompey, where ' loathsome sable weeds' are mourning clothes (258).

That seems an unexceptionable principle, but several of the matches that Jackson claimed fail to meet his own criteria. For the collocation 'climbed the top bough of the tree' (AF 8.15),24 Jackson cited the closest two matches produced by the LION search function, the first being a phrase from Dekker, 'catched at the highest bough'. This match observes Jackson's new criterion of accepting matches that have only one word in common, but in this case 'bough' has quite different connotations (attempting to grasp, rather than successfully climbing), and it satisfies neither criterion of quantity nor quality. The second match is the phrase 'tree tops', as found in Shakespeare, Romeo and Juliet (278), which is very different from 'highest bough'. Moreover, a check of the text shows that Romeo refers to 'yonder blessed moon... / That tips with silver all these fruit-tree tops' (2.2.107-8). Here the noun 'fruit' evidently modifies 'trees', whether the early editions hyphenated it or not, so this instance is even less of a match. For the collocation 'Each... airy gale doth shake my bed' (AF 8.17) the best matches that the LION search function could offer is 'by whirlwind shaken', from Alarum for London, where 'whirlwind' implies a force that would do considerably more than shake his bed. Secondly, Jackson cited from The Taming of the Shrew (2.1.141) a phrase even more remote from Mosby's precarious position in a tree: 'as mountains are for winds, / That shake not, though they blow perpetually'. These phrases have only one word in common, 'shake', and they use differing synonyms for 'gale'. In some of his other claimed links between the quarrel scene and The Rape of Lucrece Jackson includes such slender parallels as these:

To make my harvest nothing but pure corn (AF 8.25)

And useless barns the harvest of his wits (Lucr. 859)

'Tis fearful sleeping in a serpent's bed (AF 8.42)

The adder hisses where the sweet birds sing (Lucr. 870)

Thou hast been sighted as the eagle is (AF 8.126)

eagles gaz'd upon with every eye (Lucr. 1015)

Judged by his own criteria, most of Jackson's 'links' do not satisfy the acceptability condition for matches. The 'harvest' in the first quotation is that of an aged miser who has hoarded his wealth; in the second it is metaphorical. A serpent and an adder are both snakes, but otherwise they cannot count as matching phrases or collocations. In Arden 'sighted' refers to the eagle's remarkable eyesight, whereas 'gaz'd upon' in Lucrece has the eagle as the object of scrutiny by others. The fact that his list includes relatively rare single words, such as 'sland'rous' and 'copesmate', found in both texts, may be explained by the fact that the poem was published four years after the play was performed and could have picked up these words.

For the 167 lines of the quarrel scene Jackson claimed to have found 135 links to a variety of dramatists (276-89), but many of these are multiple matches for the same line. For Mosby's complaint about the 'Continuall trouble of my moody braine' (AF 8.3) Jackson found three single-word links: 'troubled brain', The Misfortunes of Arthur (1584); 'moody thoughts', 3 Henry VI (1591); and 'moody discontented' in both 1 Henry VI (1592)25 and Richard III (1593) - three of those plays post-dating Arden. Jackson noted that 'there are no other collocations of "moody" and "discontent(ed) within the space of sixty words' (276). If that refers to his own practice, that is an unusually large extent for collocations. The default setting for EEBO-TCP is ten words, and in Corpus Linguistics the standard interval is four words. The most remarkable feature of these 135 links is that only eight come from the plays of Thomas Kyd. In introducing his new method Jackson promised to replace discredited old methods by using 'systematic and comprehensive electronic searches.'26 Yet, having 'methodically explored' this database, Jackson missed over 60 close matches with Kyd.

Appendix 1 lists 74 verbal between the acknowledged plays of Kyd and the Quarrel scene. These have been found with the help of two resources. First, the software recently developed by universities to deter students from plagiarizing published work. Here it is used not to detect plagiarism as such, but to identify a writer's self-repetition. When two electronic documents are compared the program can be set to highlight every instance where two or more consecutive words are common to both.27 The identification is entirely objective, lacking any element of subjectivity or bias, and is precise. Jackson had to cut up the text of scene 8 into segments that appeared to him to constitute a meaningful unity of utterance, which he then submitted to the LION search function manually to see if they would indicate a match. The procedure depends on the researcher's choice of words or phrases to be searched for, and the energy with which that search is executed, to check all possible verbal combinations. That introduces two subjective elements into frame. In contrast, my method starts with a match already discovered by the software program. The precision with which the software identifies matching collocations removes all guess-work or bias. Secondly, in recent months I have benefited from the newly available marked up corpus of 527early modern plays prepared by Pervez Rizvi, which allows users to search for n-grams and collocations in all the texts.28 I used old-spelling texts with the software program, which had no difficulty recognising words spelled differently; the Rizvi database, given its massive scope, necessarily uses modernised texts. By using both methods side by side I hope to have overcome any weaknesses.

Of the 74 matches I have identified, Jackson's findings agree in seven instances (nos. 11, 16, 18, 25, 26, 48, and 51). Other matches that I accept, however, he considered but dismissed. For no. 4, Mosby's complaint that insecurity 'nippes me, as the bitter Northeast wind, / Doeth check the tender blosoms in the spring' (AF 8.5-6), Jackson rejected the parallel with 'Deaths winter nipt the blossomes of my blisse' (Sp. T 1.1.13), although this three-term collocation is unique in drama up to and including 1590. Jackson rejected it because 'the verb "nip" ... relate[s] specifically to "winter" cold', whereas the Arden image presents 'the premature destruction of budding spring blossoms...' (277). To cite the difference between the seasons is a trivial objection. Both passages share the sense of growth or happiness being destroyed by some destructive, unwelcome influence, whether Mosby's anxiety, a bitter wind in spring, or death. Moreover, in Kyd's first publication, Verses of Prayse and Joye (1586), written after the foiling of the Babington plot to murder Queen Elizabeth, Kyd addressed Chidiock Tychborne, one of the conspirators who had been executed, with this verdict:

Time trieth trueth, and trueth hath treason tript;

thy faith bare fruit as thou hadst faithles beene:

Thy well spent youth thine after yeares hath nipt.29

Match no. 26 is a striking parallel between two lovers' quarrels. In the first, Mosby unjustly accuses Alice of exploiting her ability 'To forge distressful looks to wound a breast' (AF 8.57). In the second, Perseda accuses Erastus of the same skill:

Ah, how thine eyes can forge alluring looks

And feign deep oaths to wound poor silly maids (SP 2.1.114-15)

The two passages, using a four-term collocation, could hardly be closer. The earlier is the more expansive but sets up a syntactical structure that the later version exactly repeats. However, Jackson did his best to minimize the significance of this unique collocation match with Kyd, arguing that

The image, which has eyes feigning oaths, is characteristically confused, and whereas in Kyd 'forge' simply means 'simulate' in the Arden of Faversham passage it retains a hint of a blacksmith's weapon making, and so interacts with the verb 'wound' to vivify the metaphor. (282 note).

(The dismissive judgment, 'characteristically confused', is a prejudicial aesthetic evaluation of Kyd that has no place in modern attribution studies.) Jackson's reading is strangely literal, taking us into a blacksmith's workshop only to find that the smith has merely produced 'distressful looks'. If weapon-making had truly been hinted at, the result would be more than 'distressful'. In Perseda's accusation the parallel structure means that 'forge' and 'feign' are synonyms. Jackson reads this phrase literally, taking 'eyes' as the subject of both lines, whereas I understand the more general reference as being to her lover's unreliability. Despite Jackson's attempted disassociation, the two usages are identical, and there is no confusion.30 These attempts to minimise Kyd's possible authorship of Arden of Faversham could not disguise the fact that my search using anti-plagiarism software identified over 70 close verbal matches that were missed by Jackson's LION search. Does this discrepancy reflect the weakness of the searchengine itself, or must it be put down to the subjective elements of the search process, with the researcher responsible for choosing words and word-combinations to be entered manually?

It is regrettable that Jackson's new technique has never been critically evaluated, with the result that recent writers on authorship attribution treat it as a kind of gold standard. William Weber, contesting the widely accepted attribution of Titus Andronicus 4.1 to Peele (along with 1.1, 2.1, and 2.2), followed Jackson unquestioningly. Accepting the claims that 'this method has been successfully applied to a number of complicated attribution problems', he praised Jackson's 'instructions' as a 'clear and comprehensive' guide to using EEBO-TCP.31 Weber describes the recommended process as

simple but painstaking: in the case of testing a passage with two potential authors, one advances through the text line by line, entering every word, phrase and collocation of nearby words into the database's search field, with results limited by author to 'Shakespeare OR Peele'. When a given phrase or collocation appears in one author's works but not the other's, it counts as a single 'hit', regardless of how many times it may appear in that one author's works'.

Following Jackson, Weber used a reduced Shakespeare canon, consisting of eight works chosen to match Peele's output 'in terms of size, period, and genre' (80). Having made his search, Weber clamed, for this scene of 128 lines, 65 'Shakespeare hits' as against 22 for Peele (81).

Weber's results were called in question by Anna Pruitt in her contribution to the recent New Oxford Shakespeare Authorship Companion. Pruitt declared that

The testing method pioneered by Macdonald P. Jackson, which determines authorship by searching for verbal parallels in the Chadwyck-Healey Literature Online (LION) database provided by the company ProQuest, is bound to grow in popularity due to the wide range of applications of the test and the relative accessibility of the testing method (p. 92).

Pruitt posed the confident question, 'Why does the LION test work so well?', and answered that 'it is based on a solid principle, confirmed by cognitive science, that an individual writer's word choices form a unique pattern that can be distinguished from those of other writers' (p. 92). This principle is correct, but Pruitt credits the wrong discipline: the credit is primarily due to Corpus Linguistics.32 Pruitt added some cautions:

However, like any powerful testing technique, it is only as good as the strength and reliability of the database (and the search tools used to access the information in the database), the experiment's design, and the clear, reliable, and reproducible procedures for generating, collecting, sorting, and analyzing the raw data it provides. The LION test itself may seem relatively simple, but running a viable experiment using the test is not (p. 92).

Having made this important point, identifying the congeries of factors involved - the reliability of the database and search tools, the experiment's design, and the need for 'clear, reliable, and reproducible procedures' - Pruitt gave a commendably thorough description of the correct procedure to be used, while acknowledging how timeconsuming it was.33 To illustrate its correct application, Pruitt returned to Titus Andronicus 4.1 and worked through her method, comparing it with Weber's use. She claimed that

the results produced by my test outnumbered those from Weber's test. Weber's combined exact-match-and-close-association search with the restricted canon found 65 hits for Shakespeare and 22 for Peele, while my exact-match-only search returned 154 hits in Shakespeare's restricted canon compared to 51 hits in Peele's canon. Even when excluding hits comprised of a pronoun and a verb (which Weber excluded), my exact-match-only test still produced 98 more hits than Weber's exact match-and-close-association test (p. 99).

Pruitt explained that her score did not include 'five valid exact matches' found by Weber, all given to Shakespeare, which would raise his score to 159, while Peele remains on 51. However, Pruitt's Peele score vastly under-estimates the evidence for his presence in this scene. I give my results in Appendix 2, once again using anti-plagiarism software, supplemented with data from the Pervez Rizvi database. With this double aid I have identified a further 29 matches missed by both Weber and Pruitt, and added supplementary examples.34 And whereas their evidence, following the model of Jackson's 2006 essay, includes many 'commonplace' or 'trite' phrases,35 my matches consist of more extended phrases and collocations that are individual and unique.

If Jackson missed over 60 matches with Kyd in Arden of Faversham scene 8 (167 lines), while Weber and Pruitt missed over 30 matches with Peele in Titus Andronicus 4.1(128 lines), this would suggest that LION is not necessarily an appropriate tool for discovering verbal matches, and that attribution results based on it cannot be relied on. As for the cause of these failures, one might blame the search engine, if it were not for the evidence of its success elsewhere. Anti-plagiarism software is evidently superior in discovering verbal matches, since it has an automatic procedure, independent of the user's diligence or curiosity. It could be that all the matches that I have discovered by anti-plagiarism software might have been discovered if the users had persisted in their searches. If there are variations in the results, this means that the method is not reproducible. As Matthew Steggle has suggested, 'if it were possible to devise an utterly mechanical written set of rules, like "every time you come to a word longer than five letters, look for collocations with words ten forward and ten back", then you might be able to get the LION method reproducible'.36 Technology in this area is developing so quickly that such an algorithm may soon be available.

One further point to be considered is the human factor. In the studies reviewed here the users have certainly displayed diligence, but they have also clearly favoured one authorship candidate and rejected others. MacDonald Jackson dismissed Kyd's possible authorship of Arden of Faversham back in 1963, and has never wavered in that belief, while increasingly favouring Shakespeare. Weber and Pruitt explicitly set out to disprove Peele's authorship of one scene in Titus Andronicus. Perhaps, then, to succeed in using EEBO-TCP or LION in attribution studies, in addition to systematic procedures one needs an open mind.

Footnote

1 See the Second Edition, Revised and Enlarged, begun by W. A. Jackson & F. S. Ferguson, completed by Katharine F. Pantzer 3 vols. (London: The Bibliographical Society, 1976-1991).

2 See Donald Wing, Short-Title Catalogue ...16411700, 3 vols. (New York: Index Society 1945-51), and the Second edition, newly revised and enlarged by J. J. Morrison, C. W. Nelson, and M. Seccombe, 4 vols. (New York: Modern Language Association, 1982-98).

3 See the entries in Wikipedia for 'University Microfilms' and 'ProQuest'.

4 See http://lion.chadwvck-healev.com

5 See http://www.textcreationpartnership.org/tcp-eebo/

6 This information was kindly supplied by Dr Paul Schaffner, Director of the proj ect (email, 31.8.18).

7 Matthew Steggle, 'The cruces of Measure for Measure and EEBO-TCP', Review of English Studies 65 (2014), 438-55 (p. 443).

8 Ibid., 444. The meaning of 'brakes' as 'thickets' would be supported by Brathwaite's epithet 'pricking'. But the alternative spelling 'breaks' together with 'ice' could refer to 'broken places', 'openings' or 'faults' in a geological sense (OED, break, n.), hence places of danger from which people would run. It is difficult, however, to see a connection with 'and answer none'.

9 See, e.g., C. J. Sisson, Lost Plays of Shakespeare's Age (Cambridge: Cambridge University Press, 1936).

10 Matthew Steggle, Digital Humanities and the Lost Drama of Early Modern England. Ten Case Studies (Farnham: Ashgate, 2016), pp. 8-11, citing estimates by Martin Wiggins.

11 See the 'wiki-style' database maintained by Rosalyn Knutson, David McInnis and Matthew Steggle, https://www.lostplays.org/index.php?title=Main Page

12 Steggle, Digital Humanities, pp. 43-60, who explains that, 'as a saint who had not actually been killed for his faith, Richard was technically a Confessor', as defined by OED: 'One who avows his religion in the face of danger, and adheres to it under persecution and torture' (p. 50).

13 Ibid. pp.23-4, 51n.

14 MacDonald P. Jackson, 'Determining Authorship: A New Technique', Research Opportunities in Renaissance Drama, 41 (2001), 1-14; reprinted in Jackson, Defining Shakespeare: 'Pericles ' as Test Case (Oxford: Oxford University Press, 2003), pp. 190-217 as 'A New Technique for Attribution Studies'. Quotations will be from this version.

15 Jackson, 'Shakespeare and the Quarrel Scene in Arden of Faversham", Shakespeare Quarterly 57 (2006), 249-93. Jackson has argued the case for Shakespeare's authorship many times. See M.P. Jackson, 'Material for an edition of Arden of Faversham" (B. Litt. thesis, Oxford University, 1963);

'Shakespearean features of the poetic style of Arden of Faversham", Archiv für das Studium der neuren Sprachen und Literaturen, 230 (1993), 273-304; 'Parallels and poetry: Shakespeare, Kyd, and Arden of Faversham", Medieval and Renaissance Drama in England 23 (2010), 17-33; 'Compound adjectives in Arden of Faversham", Notes and Queries, 53 (2006), 51-5; 'Reviewing authorship studies of Shakespeare and his contemporaries, and the case of Arden of Faversham", Memoria di Shakespeare Nuova serie 8 (2012), 149-67; 'Gentle Shakespeare and the authorship of Arden of Faversham", The Shakespearean International Yearbook 11 (2011), 25-40.

16 In his Oxford thesis Jackson brusquely dismissed the arguments for Kyd's authorship made by these authors: see 'Material', pp. 91-115. Jackson has never acknowledged Rubow's book, although it was often cited by M.L. Wine in his edition of Arden of Faversham, which Jackson frequently cites.

17 Crawford, 'The Authorship of Arden of Faversham", Jahrbuch der Deutschen Shakespeare Gesellschaft 39 (1903), 74-86, quoted from Crawford, Collectanea, First Series (Stratford-on-Avon, 1906), pp. 101-30 (113,118). I have collected the matches with Kyd noted by Crawford, Miksch, and Rubow on my website: http://www.brianvickers.uk/?page id=808

18 See Byrne, 'Bibliographical clues in Collaborate Plays', Library, 4th ser., 13 (1932), 21-48.

19 See Jackson, 'Shakespeare and the Quarrel Scene', 255, and 'Material for an Edition', pp. 65 -78.

20 Lukas Erne, Shakespeare as Literary Dramatist (Cambridge: Cambridge University Press, 2003), p. 84.

21 Martin Wiggins, in association with Catherine Richardson, British Drama 1533-1642: A Catalogue. Volume III: 1590-1597 (Oxford: Oxford University Press, 2013), p. 9. For a thoughtful discussion of the process of assigning dates see ibid., Volume I: 1533-1566 (Oxford: Oxford University Press, 2012), pp. xxxix-xli.

22 Inconsistently, Jackson assigns to Kyd's Soliman andPerseda the date of its entry in the Stationer's Register, 1592, rather than its probable first performance, which Erne places 'in 1588 or 1589', and Wiggins in '1588'. See Erne, Beyond ''The Spanish Tragedy': A Study of the Works of Thomas Kyd (Manchester: Manchester University Press, 2001), p. 160; Wiggins, British Drama, Volume II: 1567-1589 (Oxford: Oxford University Press, 2012), #799 (p. 403).

23 2 Henry VI (1591), 3 Henry VI (1591), The Taming of the Shrew (1592), The Two Gentlemen of Verona (1594).

24 Quotations, by scene and line number, are from M.L. Wine (ed.), The Tragedy of Arden of Faversham (London: Methuen, 1973), abbreviated as 'AF'.

25The phrase 'moody discontented fury' (3.1.123).

26 'New Technique', p. 193.

27 I have used WCopyfind4.1.1, a free program developed at the University of Virginia by Dr Lou Bernard.

28 See http://www.shakespearestext.com/can/index.htm 'Collocations and n-grams'.

29See F.S. Boas (ed.), The Works of Thomas Kyd (Oxford: Clarendon Press, 1901), p. 340. I have emended 'will spent youth' to 'well spent'.

30 Jackson has rejected this match before.

31 Weber, 'Shakespeare after all? The authorship of Titus Andronicus 4.1 reconsidered', Shakespeare Survey 67 (2014), 69-84 (p. 80, n. 43).

32 See, e.g., J. M. Sinclair, Corpus, Concordance, Collocation (Oxford: Oxford University Press, 1991); John Sinclair (ed. with Ronald Carter) Trust the Text: language, corpus and discourse (London and New York: Routledge, 2004); Alison Wray, Formulaic Language and the Lexicon (Cambridge: Cambridge University Press, 2002).

33 Pruitt recorded that 'The exact-match test is optimal for short passages, at least with the current limitations of our searches and data collection tools. Finding exact matches for this one scene comprising 128 lines and 1,118 words took over five weeks of full-time work, plus additional work from a twoperson team to eliminate duplicate results... This manual labour presents an insurmountable barrier to large-scale application of the technique ..' (op. cit., p. 104). The use of anti-plagiarism software, which certainly meets Pruitt's criterion of 'clear, reliable, and reproducible procedures', is far less timeconsuming. A careful search can be performed in days, rather than weeks.

34 For my additional matches see Appendix 2, nos. 3, 5, 6, 7, 9, 11, 12, 13, 20, 21, 25, 29, 31, 32, 33, 34, 36, 42, 43, 47, 48, 55, 56, 57, 61, 62, 68, 70. For supplementary matches see nos. 4, 15, 18, 23, 24, 41, 44, 45, 49, 51, 69.

35 In his 2001 essay Jackson ensured that 'high-frequency examples (such as "of the") are ignored' (198).

36Steggle, email 16 July 2018.

37 Cf. AF 10.83: 'Why should he thrust his sickle in our corne?'

38 Kyd favoured this epithet as the penultimate word of a line. Cf. also 'distresfull trauellers' (Sp. T 2.2.46), 'distresfull words' (Sp. T3.13.75), 'distresful wretch' (Corn. 5.1.338), and 'distresfull wife' (AF 3.13.51).

39 Play dates are from Martin Wiggins, British Drama 15331-642: A Catalogue, 11 vols. (Oxford, 2012-).

40 Weber and Pruitt suggest 'Vertue, and Stedfastnesse Possesse hir hart' (DA 30), but the sense differs.

41 Pruitt also cites 'The times of truce sette downe by Marshall lawe' (Troy 292), but the sense differs.

42 Pruitt compares 'print thy sorrowes plaine, | That we may know' (75-6) with 'That sweet plaine that beares her pleasant weight' (DB 58), but the sense differs.

43 Pruitt misreads this word ('Jove') as 'Love'.

44 Cf. 'Sent by the heauens for Prince Saturnine' (1.1.335), another Peele scene.

(ProQuest: Appendix omitted.)

Word count: 6615

Show less

© 2019. This work is published under https://creativecommons.org/licenses/by/3.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

A smaller repository, Literature Online (LION), resembling the Chadwyck-Healey collections, contains over a third of a million full-text works of poetry, prose and drama in English, together with online criticism and a reference library.4 The transformation of these vast resources from microfilm to CD-ROM and finally online, has opened them up to a world-wide public. In partnership with ProQuest and with more than 150 libraries, their aim is to generate 'highly accurate, fullysearchable, SGML/XML-encoded texts corresponding to books from the Early English Books Online Database'.5 Where other electronic databases have been produced by the inaccurate mof optical character recognition, the TCP texts provide keyboarded full-text transcriptions of EEBO images, linked to the individual page images. In 1709 Nicholas Rowe proposed to read 'brakes of vice': as Steggle puts it, 'an easy aural error; "vice" would fit with "virtue" in the line before', and the antithesis would be clarified.7 No illustrative contemporary example had been found for this phrase, but in searching EEBO-TCP Steggle discovered that 'the "brakes" are a metaphor for vanity, self-indulgence or foolish entanglement', as in a 1629 devotional tract by Richard Brathwaite, which refers to 'the pricking brakes of sensuality' and 'the brakes of vanitie'.8 Three centuries later, an electronic database supports an emendation by the first Shakespeare editor. The value of these resources to authorship studies seems undeniable, and MacDonald Jackson, that frequent pioneer, recognised its potential in an essay published in 2001.14 It is an accepted fact that, due to the intense competition between Elizabethan theatre companies, dramatists regularly wrote under time pressure and were prone to repeat words and phrases that they had used before.

Details

Title

Is EEBO-TCP / LION Suitable for Attribution Studies?

Author

Vickers, Brian¹

¹ Institute of English Studies, School of Advanced Study

Pages

1-34

Publication year

2019

Publication date

2019

Publisher

Matthew Steggle, Editor, EMLS

ISSN

12012459

Source type

Scholarly Journal

Language of publication

English

ProQuest document ID

2335159502

Is EEBO-TCP / LION Suitable for Attribution Studies?

Jump to:

Full text

Abstract

Details

Suggested sources