Case Study: University of Alberta

Background:

After digitizing and OCRing the newspapers, the University of Alberta came to Access Innovations, Inc. for a proof of concept to identify a taxonomy that would best fit the University of Alberta’s news data. The data was in both French and English which was another consideration when choosing a taxonomy for indexing. Three taxonomies were tested against Alberta’s data: a news-centered taxonomy, a geography-centered taxonomy, and JSTOR’s taxonomy, which covers a wide range of topics. After extensive analysis of indexing results, the news taxonomy was chosen as the best fit. Once the proof of concept was fulfilled, a subset of Alberta’s data, which was focused on the early 1900’s influenza outbreak, was indexed and analyzed to determine if the indexing accurately covered this historical event.

Need:

The University of Alberta needed a way to classify their archive of news articles to help inform future analysis and search. They wanted to conduct a test to determine if pre-existing taxonomies would fit their corpus, as well as what types of thesauri they may wish to implement.

Solution:

After tests with all three taxonomies, it was determined that the news taxonomy, with some minor revisions, provided the best results. The results were analyzed in terms of overall statistics, as well as individual term frequency counts. After revisions to the rule base, a second indexing run was conducted and indexing terms were added to the data. Along with the indexed files, metrics were also provided to Alberta, such as term frequencies, comparisons between taxonomies, and other analyses.

Results:

Alongside this taxonomy work, a new Canadian geography taxonomy was created for Alberta to accurately tag specific location data in news articles. The final enriched files were tagged and delivered to the University of Alberta where they were effectively uploaded back into the university’s system and are now used by students, faculty, and researchers.