Saving Time and Money with a Rule Based Approach to Automatic Indexing

Getting a Higher Return on Your Investment
by Marjorie M.K. Hlava, President, Access Innovations, Inc.

There are two major types of automatic categorization systems. These two types of systems are known by many different names. However, the information science theory behind them boils down to two major schools of thought: rule based and statistics based.

Companies advocating the statistics system hold that editorially maintained rule bases take a lot of up-front investment and higher costs overall. They also claim that statistics based systems are more accurate. On the other hand, statistics based systems require a training set up front and are not designed to allow editorial refinement for greater accuracy.

A case study may be the best way to see what the true story is. We did such a study, to answer the following questions:

What are the real up-front costs of rule based and training set based systems?
Which approach takes more up-front investment?
Which is faster to implement?
Which has a higher accuracy level?
What is the true cost of the system with the additional cost of creating the rule base or collecting the training set?

To answer these questions, we’ll look at how each system works and then the costs of actual implementation in a real project for side-by-side comparison.

First a couple of assumptions and guidelines:

There is an existing thesaurus or controlled vocabulary of about 6000 terms. If not, then the cost of thesaurus creation needs to be added.
Hourly rates and units per hour are based on field experience and industry rules of thumb.
85% accuracy is the baseline needed for implementation to save personnel time.

The Rule Based Approach

A simple rule base (a set of rules matching vocabulary terms and their synonyms) is created automatically for each term in the controlled vocabulary (thesaurus, taxonomy, etc.). With an existing, well-formed thesaurus or authority file, this is a two-hour process. Rules for both synonym and preferred terms are generated automatically.Complex rules make up an average of about 10% to 20% of the terms in the vocabulary. Rules are created at a rate of 4 – 6 per hour. So for a 6000 term thesaurus, creating 600 complex rules at 6 per hour takes 2.5 man weeks. Some people begin indexing with the software immediately to get some baseline statistics and then do the rule building. Accuracy (as compared with what indexing terms a skilled indexer would select) is usually 60% with just the simple rule base and 85 – 92% with the complex rule base.

The rule based approach places no limit on the number of users, the number of terms used in the taxonomy created, or the number of taxonomies held on a server.

The software is completed and shipped via CD-ROM the day the purchase is made. For the client’s convenience, if the data is available in one of three standard formats (tab- or comma-delimited, XML, or left-tagged ASCII), it can be preloaded into the system. Otherwise a short conversion script is applied.

On the average, customers are up and running one month after the contract is complete.

The client for whom we prepared the rule base reported 92% accuracy in totally automated indexing and a four-fold increase in productivity.

The up-front time and dollar investment based on the workflow for implementation for the full implementation is as follows:

Time and cost for rule base implementation:
Software (Includes two-day training, free, if held at the corporate headquarters or $1500/day on site. Includes free telephone support for the first year or $1500/day on site.)	$ 60,000.
Conversion of thesaurus – 2 hours (programming time @ $125/hr)	$ 250.
Loading thesaurus and creating rule base – 2 hours (editorial time @ $65/hr)	$ 130.
Complex rule building – 100-150 hours(editorial time @ $65/hr)	$ 6,500.
Total time frame: 2 months elapsed time Total man hours: 104-154 plus 24 hours editor training
Up-front cost:	$ 66,880.

The Statistical Approach – Training Set Solution

To analyze the statistical approach, which requires use of a training set up front, we used the same pre-existing 6000 word thesaurus.

The cost of the software usually starts at about $75,000 to $250,000. (Though costs can run much higher, we will use this lower estimate. Some systems limit the number of terms allowed in a taxonomy, requiring an extra license or secondary file building.) Training and support are an additional expense of about $2000 per day. Usually one week of training is required ($10,000). Travel expenses may be added.

The up-front time and dollar investment, based on the workflow for implementation for the statistical (Bayesian, co-occurrence, etc.) systems, is as follows:

Time and cost for statistics system implementation:
Software	$75,000.
Collection of the training set data. The vendors state that 20 to 60 items per node are required for a good diverse training set—the more collected, the better the resulting accuracy. One hour per term yields 6000 hours @ $65/hr editorial time	390,000.
Note: If data collection is done programmatically, data sets must then be reviewed by editors to remove misleading or false drop records from the data set. The review could be done at 15 minutes per term. This would be only $97,500 for a training set collection. For this case study, we collected documents for term training sets as described in #2 above.
Run the training sets. Programming time of 40 hours @ $125/hr	5,000.
Review the results and identify sets for terms that do not return good data sets. Editorial time of 40 hours @ $65/hr	2,600.
Collect additional training data for bad sets. Estimating 25% of the term list yields 1500 terms, multiplied by one hour per term of editorial time @ $65/hr	97,500.
Rerun training set – 20 hours programming time $125/hr	2,500.
Review of results – 20 hours editorial time @ $65/hr	1,300.
Most systems require at least one more round of review, collection and revision. But we stopped here. The resulting accuracy rate was 60%. For reliable improvement in productivity or to use the process effectively for filtering data, a level of 85% accuracy is required. To achieve 85% accuracy, we had to write rules in SQL (the starting point for the rules based systems). We were able to train editors to write these rules, avoiding the need to use programmers at more expensive rates. Write SQL rules – 4 per hour for 1500 terms (25% of the terms) = 375 hours editorial time	24,375.
Now we were ready to begin implementation of the system.
Elapsed time (Assuming all people are ready and standing by to move to the next step when needed): Tailor software for installation and delivery – 2 weeks. Collect training sets – 6000 hours (35 man months). If six people work on the project full time, it can be completed in 6 months. Run training sets – one week Review results – one week. Rerun training sets – 3 days. Review data – 3 days.Write Sequel rules – 9 man weeks. If two people write the rules, it can be completed in a little over a month.
Statistics based training set implementation time: Total time frame: 33 plus weeks. Total man hours: 6488 man hours plus 40 hours editor training.
Up-front cost:	$ 598,275.

A two-fold productivity increase was noted using the system. Accuracy has not gone above 72% at present.

Summary

The table that follows compares the return on investment for the rule based system and the statistics based system in terms of total cost and time to implementation.

	Rule Based	Statistics Based
Total time frame	2 months	8 months
Total man hours	104 hours +24 hours training	7,995 hours +40 hours training
Up-front cost	$ 66,880	$ 598,275
Cost advantage	Approximately 1/9 the cost of the statistics based approach
Time advantage	4 times as fast

It is apparent that considerable savings in both time and money can be gained by using a rule based system instead of a statistics based system — by a huge factor, based on the assumptions outlined above.