Methodologies for classifying special collections and repositories for retrieval and discovery

by Robert T. Kasenchak, Jr., Taxonomist/Director of Business Development, Access Innovations, Inc.

Originally published in: Catalogue & Index Issue 192 September 2018

Abstract

Large, specialized repositories of content present unique problems for information discovery and retrieval: the sheer size of the repositories, the specialization (and corresponding demanding expectations) of the researcher -users, and the specialized language—including many acronyms and other abbreviations—found in any sufficiently advanced field of research. No two collections cover identical subject matter; what is the best practice for classifying such collections to optimize retrieval?

What is a “Special Collection”?

By special collections or repositories I mean digital collections held by libraries, publishers, aggregators, and other organizations.

Some such collections cover a broad topic, for example, PLOS One or Science Magazine, both of which publish broadly on the topic of science, albeit with different emphases. Other collections have more specific topical foci, such as those held by society and association publishers; topics can be as broad as physics or chemistry, or as focused as optics or cancer research. Still others are a subset of a larger holding of content; many large university libraries have region-based collections, papers of local politicians, theses and dissertations, and other specialized collections.

The size of these collections ranges from the tens of thousands to the millions of content items. Below are some representative examples:

JSTOR: over 12,000,000 items, including about 7 million research articles
IEEE: over 4,000,000 items (research articles, proceedings, and other content)
PLOS One: over 300,000 research articles
American Institute of Physics: over 900,000 items
University of Florida Electronic Theses and Dissertations: approximately 30,000 items

Special Collections and Discovery

The basic tenets of information science have long prescribed subject-based classification—indexing the content using a knowledge organization system (KOS) such as a taxonomy, thesaurus, ontology, or other controlled vocabulary—as a pillar of aiding content discovery and retrieval; this is even more important for large digital collections.

Simple, free-text searches of large repositories are simply ineffective; the ambiguities of language and variations in usage are prohibitive barriers to effective retrieval in large collections. This is compounded by the twin problems of ambiguous words (homophony and polysemy) and synonymy.

Failings of Free-Text Searching

The goal of good information retrieval is to deliver the user all of the relevant information on a topic as well as only the relevant information. No researcher wants to look through tens of thousands of results for a search on the text string “mercury” to find the content about astronomy (and not about cars, chemical elements, or Roman gods).

Note that the Google Scholar search shown above contains some 2.8 million results, of which the first six shown contain articles about both planetary science and elemental chemistry.

Conversely, forcing users to brainstorm every possible variant and synonym of a concept is not an avenue conducive to effective research (or delighted users, or repeat customers, or renewed subscribers).

The examples above illustrate the failings of free-text-only searching: this process considers neither synonymy nor polysemy; it simply searches for words in text. In some cases, not even regular language variants (such as simple English plural forms) are accounted for:

Classification Systems for Information Retrieval in Special Collections

It is well established (Jackson, 1971; Coates, 1988; Jain and Wadekar, 2011) that classification aids retrieval. Is it more effective to use some standard and readily available classification scheme (such as the Dewey Decimal System (DDC) or the Library of Congress Subject Headings (LCSH)) or a custom scheme designed for a specific collection? Or are author-supplied—or other uncontrolled—keywords just as effective?

Author-Supplied Keywords

Generally speaking, author-supplied keywords are not useful for retrieval in large repositories. As authors of research articles (with the exception of library and information scientists) are not trained in the basic tenets of information tagging and retrieval, the keywords they supply do not follow good indexing practices (Janda, 2014).

Often, they are too broad to be useful for retrieval (“biology”) or heedless of the ambiguity of language (“evolution”). Regardless, without reference to a standardized vocabulary including synonyms and other variants, such inexpertly applied keywords are unhelpful for retrieval.

Existing Vocabularies or Subject Headings

There are many vocabularies published under open-source (and other free-to-access) licenses (see for example www.bartoc.org for a useful repository). One (or more) of these can be good starting points for custom vocabulary construction. However, existing vocabularies seldom match precisely the coverage of any given special collection; for example, using the Library of Congress Subject Headings (LCSH) to index any subject-specific collection will provide both far more subjects than required for the collection and not enough granularity in the subject-specific area to index the content thoroughly enough for specialized researchers and accurate information retrieval.

In addition, as special collections tend to be very large, it is increasingly common to require some kind of automatic indexing or classification for special collections; they are simply too large to index each content item by hand. Subject-heading style vocabularies (for example, again, LCSH) are particularly ill-suited to automatic classification (as they are expressed in the logic of subject headings instead of natural language phraseology) without substantial pre-processing rule and/or training sets. The pre-coordinated nature of subject-heading entries do not often match the language used in writing, making one-to-one matches of text to subject heading difficult.

For example, consider the LCSH Subject Heading:

American drama–Afro-American authors

It is difficult to imagine this phrase (complete with punctuation, and considering the archaic language in the subject heading) appearing as such in a modern research paper; therefore, mappings of natural language to subject heading concepts (or substantially curated training sets) is required for any kind of automatic categorization.

Collection-Specific Vocabularies

The most useful practice for the indexing and cataloging of special collections for retrieval is to use collection-specific vocabularies (Hedden, 2010). As no two collections cover identical subject matter, the breadth and depth of a vocabulary required to index any given collection is unique. However, existing vocabularies can make excellent starting points for the construction of customized, collection-specific vocabularies. As noted above, many vocabularies are published under various open-source licenses; other vocabularies can be licensed or otherwise borrowed with permission.

To adapt an existing vocabulary to be suitable for indexing a special collection, it is necessary to both remove unnecessary terms/branches (describing subjects not included in the collection) and to augment the vocabulary to include subjects not included in (and, often, in more granular detail than described by) the starting vocabulary. This can be achieved by a number of means, including text and data mining (TDM) operations on the collection to be indexed and reviewed by subject matter experts. Terms derived from search logs and other forms of tracking user behavior can also be useful additions.

Using collection-specific controlled vocabularies to index and surface content in special collections can also provide useful features for discovery interfaces. Notably, providing type-ahead suggestions and surfacing the hierarchy of the controlled vocabulary are particularly efficient at directing searchers to the vocabulary term closest to their areas of interest.

Collection-Specific Vocabularies Enable Search and Browse Features in Interfaces

Providing type-ahead (sometimes called “predictive search”) suggestions based on the controlled vocabulary used to index/catalog a collection helps to eliminate the guesswork involved in inventing search terms (as illustrated above). As such, it is especially helpful to provide type-ahead suggestions for non-preferred terms (synonyms and other variants) and redirect the searcher to the results for the preferred version of the term. Surfacing the hierarchy of the controlled vocabulary used to index a collection has several benefits. In addition to allowing the user to browse the vocabulary to find topics of interest, it can also provide a good overview of the breadth of topical areas covered in the collection. Hierarchy browse also allows the user to explore the depth (granularity) of the vocabulary which, if the vocabulary is well formed, will correspond to the granularity of the collection. Neither of these benefits can be achieved by using free-text search, author-supplied keywords, or some existing vocabulary not customized for a specific special collection.

References

Coates, E. (1988). The Role of Classification in Information Retrieval: Action and Thought in the Contribution of Brian Vickery. Journal of Documentation, 44(3), pp.216-225.
Hedden, H. (2010). Taxonomies and controlled vocabularies best practices for metadata. Journal of Digital
Asset Management, 6(5), pp.279-284.
ISKO UK Report, “The Great Debate” discussing the relevance (or irrelevance) of taxonomies in modern
information retrieval: http://www.iskouk.org/content/great-debate#EventReport
Jackson, D. (1971). Classification, Relevance, and Information Retrieval. Advances in Computers, 11, pp.59-
125. doi:10.1016/S0065-2458(08)60630-0
Jain, Y. and Wadekar, S. (2011). Classification-based Retrieval Methods to Enhance Information Discovery
on the Web. International Journal of Managing Information Technology, 3(1), pp.33-44.
Janda, K. (2014). How Do You Say, ‘Read Me?’ Or Choosing Keywords to Retrieve Information – Social
Science Space. [online] Social Science Space. Available at: https://
www.socialsciencespace.com/2014/10/how-do-you-say-read-me-or-choosing-keywords-to-retrieveinformation/
[Accessed 20 Sep. 2018].
Jansen, B. J., & Rieh, S. Y. (2010). The seventeen theoretical constructs of information searching and
information retrieval. Journal of the American Society for Information Science and Technology,1517-1534.
doi:10.1002/asi.21358
Mitchell, M. (2017). From Byzantine Art to Winnie-the-Pooh: Classifying a private library. Catalogue and
Index,(188), 43-45. Retrieved September 1, 2018, from https://archive.cilip.org.uk/sites/default/files/media/
document/2017-10/ci_188_mitchell_classifying_private_library.pdf.