The Search for Machine-Aided Indexing: Why a Rule-Based System is the Cost-Effective Choice

by Larry Compton

Executive summary
Data Harmony is one of several companies that offer machine-aided indexing software solutions to indexing and categorizing unstructured data. Data Harmony’s Machine Aided Indexer™ is a rule-based system, whereas most of our competitors use methods that are a combination of linguistic or statistical approaches to indexing data. Many of our competitors advertise their products as offering a “totally automated” and “out of the box” solutions.

Customers have concerns that rule-based systems are more expensive, more difficult to implement and are time consuming. However, research contracted by Data Harmony found that some of our competitors’ customers have found these software products to be more expensive and more difficult to implement than advertised, and have not been “out of the box” solutions. Rule-based solutions, such as Data Harmony, are faster, less expensive to implement, and produce reliable, reproducible results. The accuracy of indexing is lower with linguistic and statistical approaches but reaches 85 to 93 percent with Data Harmony.

Introduction
It has been estimated the number of web pages is growing at 10 million per day. Vast amounts of unstructured data, whether internal documents in corporate intranets or pages of content on the Internet need to be accessed by users. Machine-aided indexing (sometimes referred to as computer-aided indexing) is the use of computer software programs to categorize and index this unstructured data. Machine-aided indexing has been used for many years by libraries, data centers and government agencies to access documents and information through online catalogs and fee-based information retrieval services, and now this technology is being used to ease user access to the exploding information available on the Internet.

There are a number of companies that offer machine-aided indexing software products. The major methods of machine-aided indexing of text are 1) linguistic, 2) statistical, and 3) rule-based.

Linguistic systems use programs that extract words or phrases from the text on the basis of linguistic (that is, part of speech or position within a sentence) criteria.

Statistical systems use programs that extract words or phrases from the text on the basis of statistical (frequency of the word or phrase within the text) criteria.

Rule-based systems use programs that extract words or phrases from the text on the basis of programmatic rules (algorithms of the “if … then” variety).

Data Harmony’s rule-based system is built over the linguistic layer. Many of the available indexing software products use a combination of two or more of these methods. For example, Active Navigation, Autonomy, Nstein and NetOwl all use software that combines both linguistic and statistical techniques. Inktomi’s Quiver software is linguistic and rule-based, and Smartlogic uses all three methods.

Many customers are looking for a totally automated solution for their document indexing needs, and some of our competitors advertise their products as totally automated solutions. For example, Active Navigation claims that their software “…automatically categorizes and organizes content, enabling quick and painless retrieval of information…” and that it “…eliminates the need to manually examine and link documents and ensures that content is intelligently classified and discovered in the correct context.” Nstein Technologies notes that since “…manual indexing is time-consuming, expensive and can produce inconsistent results,” its products are the right solution because these “…content management tools automate time-consuming and complex tasks such as content classification….”

Data Harmony

Data Harmony does not claim to totally automate the indexing and categorization process, but rather our rule-based software is designed to make it possible for human indexers to increase their indexing efficiency and consistency. Our Machine Aided Indexer (M.A.I)™ facilitates selection of terms from controlled vocabularies, authority files, or full thesauri. It presents a list of approved terms to the editor for selection, which saves time looking up terms manually and speeds processing time.

Data Harmony’s customers have experienced a four-fold increase in efficiency using the M.A.I. while measurably improving consistency and coverage of individual records. M.A.I. runs on four components: the Rule Builder, the Rule Base, the Concept Extractor, and the Statistics Collector.

Data Harmony’s M.A.I. is 100 percent consistent in application: the appropriateness and accuracy of the automatically suggested terms. Our M.A.I. produces only 7 to 15 percent misses or “noise,” yielding 85 to 93 percent relevance and precision of indexing in the resulting data sets.

Customer concerns

The objections that customers give to a rule-based system such as Data Harmony include 1) the upfront costs, that is, Data Harmony’s price is considered too expensive when compared to the advertised price of other products; and 2) that they are rule-based. Rule-based systems are perceived by customers as requiring too much investment in staff time for programming. Customers desire an “off the shelf” solution, a product that can be implemented with minimal time and staff effort, and that will not require additional commitment or refining. When looked at over a two-year horizon, however, rule-based solutions such as Data Harmony’s M.A.I. are actually less expensive.

What customers should also be concerned about is quality. How accurate are the results obtained with the use of so-called “totally automated” solutions of our competitors?

Actual customer experiences
The questions to consider are, “What are the actual results of purchasing these indexing tools that are not rule-based? Have these customers been satisfied with their results, and do these software products actually deliver the accuracy and efficiency that they promise?” Data Harmony employed a research consultant to conduct a survey of our competitor’s customers. The following five questions were asked of each customer:

1. Are you satisfied with the performance of (the software product)?

2. How much did it cost you to implement in upfront costs (dollars)?

3. How much did it cost you to implement in staff time (man-hours)?

4. How long did it take to get (the software product) up and running?

5. Has it been easy to maintain, or have there been problems or shortcomings with this product?

The results of this survey yielded some interesting feedback for some of our competitors’ customers.

Survey results

Nstein Technologies

A major psychological database company purchased the Nstein software to use in indexing documents for their well-known psychological abstract database. A manager of the company was recently interviewed by telephone. She reported that they were not looking to ever replace human indexers, but rather that their main objective was to reduce the time spent by personnel on indexing. This was accomplished by Nstein with an average reduction of 7 to 8 minutes per record.

The manager reported that the “comprehensiveness” of their records was not good at first with Nstein. They spent a lot of time – about three months – testing the system and records after the initial installation of Nstein. The Nstein technical support staff helped them to add rules to bring the “comprehensiveness” of the records up to 70%, the standard for their data records. The manager added that the Nstein staff worked with them on refining key words, as comprehensiveness was a problem. She noted that soon they will be refining the key words again when the latest edition of their thesaurus is ready.

In an interview with Information Today, in response to a question specifically asking how long it took for Nstein to be operational at this company, Nstein President Randy Marckino did not address that case specifically and answered the question by replying only that the “typical on-site installation” required “1 to 10 days of labor” from Nstein staff.

A Euromoney institutional investor company, serving emerging markets specialists and multinational corporations in the U.S. and Europe, also purchased the Nstein MAI product. A manager with this company was recently interviewed by telephone. He reported that so far Nstein has achieved an 80 to 85% accuracy rate in the categorization of documents.

The manager added that they still must change the API application. He also admitted that they have yet to actually test the “throughput” ability of Nstein, i.e., can Nstein process the number of documents per hour that it claims it can handle? He is not sure of Nstein’s “robustness” yet.

How much of a financial investment, both in actual costs and staff time, did the Nstein software require?

The database company manager could not give an exact figure for what their final actual costs were for purchasing Nstein; however, she did state that it was “not cheap.” She admitted that it was more expensive than all of the other MAI software products that they considered. (A press release from Nstein reported that the deal was worth approximately $CAN 450,000).

When asked about staffing requirements, the manager estimated that it took the time of five full-time indexers and two indexing managers about a “month or so” at first. She added that there is a need for “constant” (she then rephrased that to “annual”) training.

The investment company manager preferred not to discuss the actual implementation costs of Nstein, as there was a good deal of negotiation with non-cash assets involved. (A press release from Nstein of March 14th, 2002 reported that the deal was a five-year deal valued at over $CAN 650,000).

The overall commitment required in staffing was estimated by the investment company manager to be two full-time employees for three to four months per language (they use nine languages). He added that it was “not a simple process,” requiring the time of four people for three to four weeks in the setting up of the English “knowledge base.” The evaluation process then took the time of eight employees for three to four hours per day for one month. This process involves sending 50 to 100 documents per “taxonomy” to Nstein, which then returns a knowledge base. The question then for Nstein customers is, according to this manager, how much time do they want to spend refining the results to reach the desired accuracy?

Active Navigation

A well-known British publisher of international weapons and military information purchased Active Navigation’s Portal Maximizer to manage the content of their more than 500,000 document files. How well has this product worked for them?

An IT employee was interviewed recently by telephone. She agreed that overall, Active Navigation’s Portal Maximizer has been a good product, but admitted that it has required a “substantial amount” of “editorial massaging” (Active Navigation’s terminology), because it is category- or topic-based. The Active Navigation software will do standard linking; since they have about 130 products, they had to produce a link for each one.

It was approximately eight to ten months before they could “go live” with the Active Navigation software in use. It has required “lots of editorial hand-holding,” most of this work being the creation of taxonomies. For example, they found out “by accident” that they must account for semantic differences in terms, e.g. one must differentiate between place names and names of equipment. This has required “continual” updating, according to the employee.

This contrasts with an article in Network World that claimed that “within a month (Portal Maximizer) had thousands of documents analyzed, indexed and categorized with manual tagging or heavy coding.”

The employee reported that Active Navigation’s Portal Maximizer software requires a “substantial amount” of development, both up-front and ongoing. She agreed that Portal Maximizer is very useful if the customer has the staff to handle this investment. She added that it is not an “out of the box solution” and concluded that it was “not as dynamic a process” as it could be.

How much of a financial investment did the Active Navigation software product require?

The employee said that Active Navigation is “not cheap” and includes and “ongoing cost.” Specifically, the contract Active Navigation contract had a purchase price of £132,000, with an additional £10,000 for additional software licenses and consultancy. The company also pays a maintenance fee of £1,000 per month.

Regarding staff time, the employee estimated that it took about six staff members nearly six months to implement the system.

Data Harmony customers

What do Data Harmony’s customers have to say about their experience with our software products and services?

Kurt Keeley is Database Manager in the Technical Support Group of the American Water Works Association (AWWA), an organization of over 50,000 water supply professionals. AWWA uses Data Harmony’s Machine Aided Indexer™ (M.A.I.) to speed publication of its WATERNET database. Mr. Keeley told us that “…our staffing was going down, the work load was increasing, and we needed more efficient ways to store and move data without major staff intervention. I estimate that M.A.I. has improved our productivity by 50 percent.” He adds: “We’re still experiencing improved productivity and flexibility to handle new projects unheard of just 2 years ago. Data Harmony is allowing us the opportunity to move relatively seamlessly in step with the organization’s new content management system.” When asked if M.A.I was a good investment, he stated that it was an “…excellent investment, reliable, scalable, and ahead of its time for internal data storage and management.”

Cambridge Scientific Abstracts (CSA) has been a premier publisher of abstracts and indexes to scientific and technical research literature for over 30 years. Scott Ryan is the Editor/Development Specialist in the Engineering Specialties Division of CSA in Cleveland, Ohio. According to Mr. Ryan, “M.A.I. doesn’t cause any of the inefficiencies and other problems associated with ‘complete’ automation because it doesn’t try to replace human beings. Instead, it supplements human judgment with powerful, well-designed software.” He also reports that they “…have found it to be accurate and efficient, meeting and exceeding our expectations for its performance. The rule-builder is flexible, powerful, and intuitively clear … M.A.I. has increased our capacity for editorial throughput and freed editorial time for other non-indexing tasks, while also helping to improve our indexing quality by holding down editorial ‘drift.’ We are delighted with both the package itself and the technical support we have received from Data Harmony.” “We were up in two weeks,” adds Carole Houk, Managing Editor for Materials Information, “and we noticed improvement in productivity right away.” As a bonus, M.A.I. was easily hooked to the existing Cuadra Star system.

What about data quality using M.A.I.? Kelly Quinn at Lockheed GAO reports an accuracy rate of 92 percent.

Conclusions
Many of our competitors imply that their products are a “plug-in” solution, and that the categorization and indexing are a totally “automated” process. However, our survey found that customers of both Active Navigation and Nstein reported that they have had to spend a “substantial” amount of staff time updating and refining the systems to get the desired quality of indexing, and that there was a lot of “back and forth” between them and the respective vendors during the implementation. These packages were described by actual clients as not being “out-of-the-box” solutions. Customers reported delays in implementation and some initial deficiencies in quality of search results.

Regarding costs, we found that some of competitor’s customers had to pay quite a bit more than the advertised list prices of their products, which range from $75,000 to $125,000. The necessary refinements, due to poor quality, add to the costs. Compare these costs to those of Data Harmony’s M.A.I.™, which is available for the price of $60,000. Maintenance fees also need to be considered. The consensus of the customers we talked to seems to be that the quality of the machine-aided indexing from our competitor’s products depends on “how much time you want to spend further refining the data.”

What should customers be concerned about, in addition to cost? The consistency, relevance, accuracy and appropriateness of the indexing of their data. All indexing can be fully “automatic,” but how good is that indexing compared to what a human would select and understand? Data Harmony’s M.A.I. will help your staff index your data more proficiently and with better quality.