Q? What can I do with a taxonomy?
A. You can use a taxonomy in many ways:
- Provide a common language basis for all activities within an organization.
- Coordinate documents, people, and activities by using the same descriptive terminology throughout.
- Categorize content (documents) using taxonomy terms.
- Search the data in your database by taxonomy term.
- Use taxonomy terms as metadata for HTML records.
- Use the taxonomy as a browsable topic list in a portal or web site.
For more detail on the business utility of a taxonomy, see this Taxodiary article.
Q? What is indexing and why bother?
A. Indexing is a way to increase retrieval precision and accuracy by consistent application of subject terms in their preferred forms. Good indexing leads to the optimal combination of precise retrieval and comprehensive recall. The result is retrieval of documents appropriate for the search and blocking documents that may share a common word but are conceptually irrelevant.
Q? Why is an indexed search better than a full-text search?
A. The results of a search will always depend on the search query you enter. Computer systems cannot infer your intent.
In a full-text search, if your query words do not match words in the document text, you will not get back any documents. If they do match words in the document, they may not be what the document is primarily about, so you get back a lot of irrelevant documents. What you will not get is documents of your search query where the topic is expressed in different words.
Q? What is relevance?
A. Relevance is how closely the results from a search match the searcher’s needs. It is subjective and often undermined by using a search word that can be interpreted differently from what the searcher had in mind. Using a thesaurus / taxonomy / controlled vocabulary to index your content ensures that the searcher will find what he or she is looking for.
Q? What is the difference between a controlled vocabulary, a taxonomy, and a thesaurus?
A. A controlled vocabulary is a set of terms that are valid for indexing (categorizing, generally by topic or subject) a set of documents. The list is generally alphabetized, but no further internal organization of terms is implied.
A taxonomy is a controlled vocabulary presented in an outline view, also called a hierarchy. Terms are organized in categories or branches reflecting general concepts (Top Terms), major groups (Broader Terms), and more specific concepts (Narrower Terms).
A thesaurus is a controlled vocabulary, often displayed as a taxonomy, with additional information about the terms. Additional information can be synonyms (non-preferred terms or Use/Used for indicators), related terms (in separate parts of the hierarchy by associated closely), a scope note (explanation of the term within the scope of the vocabulary), an editorial note (clarification for editorial use of the term), and term history. More advanced thesauri may involve additional information such as alternate languages, numerical codes, and status designation for terms.
Q? What are the advantages of a taxonomy?
A. A taxonomy’s hierarchical organization makes it easy to locate the most accurate subject indexing term. The hierarchical view makes it easy to drill down through categories to find terms. You don’t need to know the first letter of the word or phrase to find the concept, as you do with an alphabetical list.
Q? Are there extra advantages to a thesaurus?
A. In addition to the hierarchical display of the taxonomy view, a thesaurus may be viewed as individual term records with the added value of conceptual associations, notes on use, and translations to other languages, and other supplemental information. This information provides the user ideas for alternative terms that may be appropriate for indexing or finding documents on conceptually related topics.
Q? What is Data Harmony?
A. Data Harmony is a line of software products created and offered by Access Innovations, Inc. These software tools work independently or in combination to provide a robust and effective system for categorization, indexing, classification, and filtering of data. The three main components of the Data Harmony software suite are MAIstro , Thesaurus Master, M.A.I. (Machine Aided Indexer), and XIS (XML Intranet System). In concert, these applications offer superior content management that is efficient and scalable.
Q? Can Data Harmony products help me find what I’m looking for and improve search results?
A. Data Harmony promotes precise and consistent topic or subject labeling of documents and other content items, following rules of use designed for your specific needs. The result is pinpoint accuracy in document retrieval, allowing more effective knowledge management. Data Harmony enables information specialists to do their jobs better.
Q? How can Data Harmony benefit me?
A. Data Harmony enables your organization to use consistent terminology across all business avenues.
For example, if you are selling content (books, journals, periodicals, manuscripts, etc.) on your website, Data Harmony can help your users to find what they are looking for. This results in increased sales, meaning more generated income.
Q? Who uses Data Harmony?
A. Our customers are corporations, associations (non-profit and for profit), secondary publishers, or keepers of portal and knowledge management systems with the need to classify, store and retrieve quantities of information objects.
Q? What is automatic indexing?
A. Automatic indexing is the application of index terms from a thesaurus or controlled vocabulary using established algorithmic programs.
Q? What is machine aided indexing?
A. Machine aided indexing, also known as assisted indexing, allows a person to select from a list of the automatically suggested index terms drawn from a thesaurus or controlled vocabulary.
Q? What is the difference between “automatic” and “assisted” indexing?
A. Automatic indexing will work without human review or intervention and is great for filtering large amounts of data, such as large legacy files. In assisted or machine aided indexing, a human reviews the suggested terms and selects the most appropriate terms. Data Harmony supports both modes of operation.
Q? How does Data Harmony software connect to other systems?
A. We connect to other systems through an application programming interface (API). This allows different systems to communicate seamlessly.
Q? What gains can I see with Data Harmony?
A. There are two major gains when using the Data Harmony products: productivity and quality.
Productivity is measured by increased speed and increased quality. Data Harmony users have reported increases in productivity from fourfold to nearly sevenfold. If you have three indexers who normally index 5 items an hour, they would be able to increase productivity to between 20 and 35 items an hour each using Data Harmony. The productivity increase would balance the software investment in just a few months.
Another important measure is the quality of indexing. Our users experience:
- a substantial increase in the quality and consistency of indexing
- indexing for all the important topics in an item
- and to the level of specificity that your taxonomy provides
- a marked decrease in the tendency for editors to use the same indexing terms for different concepts
- less editorial drift which is the tendency for editors to use different indexing terms for the same concept over a period of time.
Q? Is my business too small or simple for M.A.I. or Thesaurus Master?
A. Do you give different users access to your information? Do you have more than 5000 items in a data collection or more than 14 fields of data to access through your system? Do you have a thesaurus or controlled vocabulary with more than 1,000 terms? If the answer to any of these is “yes,” then Data Harmony products can improve the organization and retrieval of your data, to help you get the information you need.
Q? Who will build the rulebase?
A. The basic rulebase of simple and synonym rules is constructed programmatically by M.A.I. For finetuning of the rules, we can train your editors to construct your rulebase. Alternatively, we can develop it for you, using your documents, and then pass it over to you for continued use and maintenance. Once trained, most of our customers prefer to build and maintain their rulebase independently.
Q? How long does it take to build a rulebase?
A. Data Harmony’s MAIstro automatically creates a rulebase when a controlled vocabulary/taxonomy/thesaurus is created.
Q? How long will it take to index my legacy collection?
A. This depends on how you decide to index your legacy collection. Legacy collections can be indexed in either automatic or assisted mode. We have found that automatic indexing takes less than a half minute to process a legacy collection record in which each record consists of a title and an abstract. An editor using the Data Harmony software in assisted mode can index between 6 and 10 per hour.
Q? Do you use training sets?
A. To build a training set requires searching for a set of documents that are about a particular concept. Since this effort must be repeated for each concept, the time required can be significant.
We find that a rulebase approach is more efficient, more flexible, and easier and less costly to maintain. A rulebase can be easily managed by an editor, not requiring the more expensive services of a programmer. There is no limit to the number of controlled vocabulary terms or size of thesaurus it serves. Modification or addition of rules is easily accomplished. A rulebase does not require research to locate and prove a large corpus of documents that exemplify the concept represented by a single term.
Q? Aren’t rule bases difficult to manage?
A. Quite the contrary. Rules governing the use of indexing terms are accessible and transparent, not hidden in a virtual black box. The editor can review and fine-tune the requirements for term use at any time to produce more accurate term suggestions. M.A.I. automatically maintains statistics that point out any discrepancies between the editor’s use of terms and the M.A.I. suggested terms. Maintaining the rulebase to continually improve indexing term suggestions takes approximately one to two hours weekly, depending on the style of content.
Q? What does it take to maintain a rulebase?
A. After an initial period of rulebase preparation, the time required for maintenance drops off steadily. Maintaining the rulebase to continually improve indexing term suggestions takes approximately one to two hours weekly, depending on the style of content.
Q? Who will maintain a rulebase and taxonomy (thesaurus)?
A. We can easily train the editors who manage your document collection and indexing to maintain your rulebase and taxonomy. Alternatively, you can outsource the task to our experienced editorial staff.
Q? Do I have to come to you to add or change a term?
A. No, you don’t. For each term that you add, M.A.I. creates a simple rule for its use for indexing documents. You can modify the rule at any time to improve its function. If you change a term, existing rules are automatically changed to suggest the revised term. These processes are quick and simple, and performed by your staff.
Q? How does a search system work?
A. Although the front ends and options vary at the end of the process, all search systems build an inverted index (a mapping of the content (words and numbers)) and then run the queries against the inverted index. This is true for all search systems, including those of MarkLogic, OpenText, MuseGlobal, PostgreSQL, Microsoft FAST Search Server, Oracle, MySQL, Sequel, and SAP.
M.A.I. can work with a search system in two ways:
- M.A.I. enables construction of an inverted index of thesaurus terms that are associated with content. These indexing terms, arranged in the inverted index, provide consistent and accurate subject access to the data. This is the preferred method in well-formed databases with field formatting or metadata access.
- M.A.I. can transition from a searcher’s query word to the valid indexing term, and then access documents indexed with that term. This works well with natural language query systems.
M.A.I. enables precision indexing for highly accurate retrieval of documents or information objects. A search system provides the supplemental ability to spot important words not yet incorporated in the indexing thesaurus.
Q? What makes a good content management system (CMS)?
A. A good content management system includes the following parts:
- Input system for document creation and editing
- Search system to find the documents in the system
- The display of the documents to the user either by web site (portal), print, or a customized user interface
- Administration modules to create custom reports and document sets
Nice-to-have features include hooking to a publishing system or portal interface.
Q? Can Data Harmony work with a content management system (CMS)?
A. Yes, Data Harmony’s M.A.I. can work with any CMS via APIs. CMS examples are Microsoft SharePoint 2010 and 2007, MarkLogic, OpenText, MuseGlobal, Oracle, and SAP.
Data Harmony integrates with a CMS through documented application programming interfaces (APIs). Depending on the client’s needs, indexing terms may be presented interactively in real time document by document, or by a batch approach, filtering and suggesting terms for a number of documents at a time.
We work with other software vendors, as well as systems from MarkLogic and OpenText, to create custom integrations.
Q? What are Internet protocols and why are they important?
A. We believe in using the Internet for ease of data movement and communication. If you are sending data over the Internet, you have to comply with Internet protocols.
The Internet protocols used in Data Harmony are the TCP/IP. This stands for Transmission Control Protocol and Internet Protocol. The first ensures that the data arrives at its destination in the correct order. The second takes your data and bundles it into discrete packages for transmission of 1500 bytes each.
We use TCP/IP because it can run on any network, an organization’s internal LAN, a WAN, or a worldwide network. This network flexibility means that Data Harmony products can be used without being impeded by any network constraints, can be accessed by users with password access from anywhere, and are as scalable as the Internet in the size of community they can serve.
Q? In what computer language are Data Harmony products written?
A. Data Harmony products are written in the Java programming language. Java is platform independent, operating on Windows, Macintosh, UNIX, Oracle Solaris, and Linux operating systems.
Q? What are the system requirements?
A. Data Harmony requires the Java Runtime Environment (JRE) version 1.6 or higher.
Q? How does Data Harmony software relate to a portal?
A. Data Harmony software enables you to build and maintain a taxonomy of indexing terms that describe documents. A portal uses the taxonomy in two ways:
- Add the taxonomy hierarchy as a browsable topic list to a portal or web site. HTML links from the topic terms provide access to the documents.
- Add the taxonomy terms as metadata to HTML records, filling in name and/or keyword fields for access by search systems and spiders.
Q? Can I use Data Harmony remotely?
A. Because Data Harmony uses accepted Internet protocols, the data can be accessed over the Internet.
Q? How do I get my data to M.A.I.?
A. You can get your data into M.A.I. in either of the following ways:
- Your data input system can be modified to use the Data Harmony API for sending data interactively through the M.A.I. rulebase. This produces indexing term suggestions interactively, on a document-by-document basis.
- You can use the Data Harmony program to run groups or collections of documents.
Q? What is an API?
A. An application programming interface is a set of methods that enable one program to communicate with another program. Data Harmony has published APIs that allow other software to hook to MAIstro, Thesaurus Master, and M.A.I.
Q? Are there guidelines for taxonomies or thesauri?
A. There are several sets of guidelines. NISO (the National Information Standards Organization) has produced the Z39.19 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies.
Q? Do you provide training?
A. Access Innovations, Inc. can provide training at your facilities or our office. The training is two days for M.A.I. and one day for Thesaurus Master and covers all aspects of using the software.
Q? Who are your trainers?
A. Our trainers are experienced taxonomists familiar with Data Harmony software, as well as with techniques and best practices for taxonomy creation and development, indexing rule set development, and abstracting and indexing. In addition, our trainers have trained many information specialists on thesaurus construction and indexing practices through conference workshops, seminars, webinars, and on site at our customers’ facilities.