Access Innovations White Paper

Automatic Indexing Return on Investment (ROI) A Case Study Comparison of Rule Base and Statistical Approaches

by Marjorie M.K. Hlava

In the search for a quick and easy solution to categorization challenges, the choices boil down to strategies drawing on statistical similarities between documents and those drawing on rule bases. The statistical approach may seem to be the simple answer. Systems based on word concurrence claim a purely programmatic solution with a strong appeal to many IT professionals. The alternative, rule-based approaches, may appear complex and costly, demanding significant up-front investment. But a careful analysis and comparison of the two approaches reveals important considerations with implications for both time and cost. In this paper, we examine the following:

A case study using Data Harmony's M.A.I.™, a rule-based system, and an implementation using a statistics-based system (e.g. Autonomy, Nstein or Stratify) illustrates the differences. The study assumes two points:

  1. A thesaurus or controlled vocabulary is available for categorization. No other special vocabulary must be created.
  2. Hourly rates used for comparison are drawn from industry standards.

Rule Base Approach

M.A.I. uses rules to identify the topics of documents. Rules filter words in documents as they relate to controlled vocabulary terms. The system creates a simple rule base covereing all terms in the controlled vocabulary and their established synonyms, maintained in Thesaurus Master™, M.A.I.'s partner software. This process is accomplished programmatically, taking about two hours. The system is fully functional at that point and ready to start indexing immediately. The simple rules, automatically generated, produce about 60% accuracy, i.d. M.A.I. suggests six our of every ten terms a human editor would use to categorize a set of documents.

For some terms in the vocabulary, editors enhance the rules or write additional rules to clarify confusing meanings of words. These editor-enhanced rules represent about 10 percent of the total number of terms in the vocabulary in most cases. Experience shows editors write these more complex rules at a rate of 4 to 6 per hour. Therefore, a 6,000 term thesaurus would typically require about 600 complex rules. At 6 rules/hour, this would take 2.5 weeks for a single editor. While editors can continue to enhance rules, most M.A.I. users are nearly fully reliant on the system for indexing within one month of installation. The complex rule base provides 85% accuracy or higher.

Cost break-down:

Time to implementation:

Total cost: $250 programming + $7,920 editorial (152 + 24 hours) + $60,000 software = $64,840.
Total time frame: 1 month

Using M.A.I.'s rule-based approach to categorization, Cambridge Scientific Abstracts (CSA) achieved a four-fold increase in indexing productivity. Longer term results show M.A.I. contributed to a seven-fold increase in productivity for CSA. Lockheed Martin, working on a project for GAO, achieved 92% accuracy.


Statistical Approach

The statistical approach draws on the topic similarity between documents, based on the number and proximity of co-occurring words in the documents. The approach underlies several automatic categorization systems. Statistical tools require the collection of training sets of documents, usually 20 to 60 documents for each term. Documents must be identified that represent the concept of each taxonomy term. Terms that reflect more abstract concepts, ideas that may be expressed in many ways, require collection and review of a larger number of documents to identify the required number of documents for the term/concept.

The benchmark for accuracy in automatic categorization systems is 60%. In order to surpass this level, to apply the system for content filtering by topic or achieve real improvement in productivity, rules must be written to supplement the system. These sequel rules are essentially similar to the rules underlying M.A.I. and other rule-based systems (e.g. If a condition is met, use the term).

The following analysis assumes the same 6,000 controlled vocabulary.

Cost break-down:

An additional round of review of results, and collecting substitute documents for training sets is often required.

 Time to implementation:

Total cost:  $7,500 programming + $424,575 editorial + software $75,000 = $507,075
Totaled time frame: 33 weeks

Using a statistical approach to categorization, the American Psychological Association doubled their productivity. Accuracy above 72% is unknown.


Comparison Summary

 
Rules Based
Statistics Based
Time to implementation

4 weeks

33 weeks

Staff hours:

104-154 hours

9495 hours

 

+24 hours training

+40 hours training

Total cost (software + labor):

$ 64,840

$507,075

   

 

ROI assuming 6 editors and 4-fold productivity increase:

6 weeks

62.6 weeks

The table lays out the comparative costs and times using the rule-based system and the statistical approach. Considering the total time and costs based on the assumptions outlined above, the return on investment using the rule-based system is ten times the ROI using the statistical approach.


Notes on the Software

The Data Harmony M.A.I.™ system is both efficient and cost-effective right out of the box. A simple rule base is generated automatically on the basis of your controlled vocabulary (thesaurus, taxonomy, authority file). Rules are generated for both preferred terms and specified synonym terms.

The accuracy of results from the simple rule base is enhanced by fine-tuning the rules to reflect editorial analysis, interpretation, and insight. For about 10% of the terms, complex rules are required to capture the meaning and conditions of use of the term. (This estimate varies with the wording of taxonomy terms and document writing style in a collection.)

How quickly can M.A.I. be implemented?
The software is delivered by CD ROM or FTP immediately upon payment. Your data can be preloaded in the software for immediate use. Data formatted in tab- or comma-delimited, XML, or left-tagged ASCII is ready to go; format conversion would require a small amount of additional time. Our customers are typically up and running one month after the contract is done.

What about automatic taxonomy generation?
This is partially possible.  However, using training sets and full or unstructured text to create a categorization system causes many misleading information channels to appear.  For example: If Enron is searched in the news today it will co-occur with fraud, embezzlement etc.  If it was run four years ago, it would occur with energy and gas distribution etc.  Using rules will ensure the proper usage and application of the language over the life of the project. We recommend augmentation of an existing vocabulary as a faster, more accurate, more reliable, and more consistent methodology for taxonomy creation.  It is also less expensive.

*Can I purchase or augment an existing thesaurus? - YES
Data Harmony offers 40 Knowledge Domains, including ready-made thesauri with associated rule bases covering a variety of topics.
With our experience, we can construct a thesaurus for your specific needs. Time required varies with the topic; estimates provided reflect an average 6000 term thesaurus.
Time investment: 4 months
Cost investment: $ 32,000.

Margie Hlava
Marjorie M.K. Hlava is chairman of Access Innovations, based in Albuquerque, New Mexico, the provider of the Data Harmony line of software used for indexing and data structuring. You can reach her at mhlava@accessinn.com