Q. What is Data Harmony?
A. Data Harmony is a line of software products created and offered by Access Innovations, Inc.
Data Harmony software tools work independently or in combination, to provide a robust and effective system for categorization, indexing, classification, and filtering of data. The three main components of the Data Harmony software suite (MAIstro™) are Thesaurus Master®, M.A.I.™ (Machine-Aided Indexer), and XIS (XML Intranet System). In concert, these applications offer superior content management that is efficient and scalable.
Q. Can Data Harmony products help me find what I’m looking for and improve search results?
A. Data Harmony promotes precise and consistent topic- or subject-labeling of documents and other content items, following rules of use designed for your specific needs. The result is pinpoint accuracy in document retrieval, allowing more effective knowledge management. Data Harmony enables information specialists to do their jobs better.
Why use Data Harmony?
Q. Who uses Data Harmony software?
A. Our customers are corporations, associations (non-profit and for-profit), secondary publishers, or keepers of portal and knowledge management systems, with the need to classify, store and retrieve quantities of information objects.
Q. How can Data Harmony benefit me?
A. Data Harmony enables your organization to use consistent terminology across all business avenues. For example, if you are selling content (books, journals, periodicals, manuscripts, etc.) on your website, Data Harmony can help your users to find what they are looking for. This results in increased sales, meaning more generated income.
Q. Is my business too small or simple for M.A.I. or Thesaurus Master?
A. Do you give different users access to your information? Do you have more than 5,000 items in a data collection, or more than 14 fields of data to access through your system? Do you have a thesaurus or controlled vocabulary with more than 1,000 terms? If the answer to any of these is “yes,” then Data Harmony products can improve the organization and retrieval of your data to help you get the information you need.
Q. What gains can I see with Data Harmony?
A. There are two major gains when using the Data Harmony products: productivity and quality.
Productivity is measured by increased speed and increased quality. Data Harmony users have reported increases in productivity, from four-fold to nearly seven-fold. If you have three indexers who normally index 5 items an hour, they would be able to increase productivity to between 20 and 35 items an hour each using Data Harmony. The productivity increase would balance the software investment in just a few months.
Another important measure is the quality of indexing.
Our users experience:
- A substantial increase in the quality and consistency of indexing
- Indexing for all the important topics in an item, to the level of specificity that your taxonomy provides
- A marked decrease in the tendency for editors to use the same indexing terms for different concepts
- Less editorial drift (the tendency for editors to use different indexing terms for the same concept over a period of time)
What’s Under the Hood?
Q. In what computer language are Data Harmony products written?
A. Data Harmony products are written in the Java programming language. Java is platform-independent, operating on Windows, Macintosh, UNIX, Oracle Solaris, Lucene and Linux operating systems.
Q. What are the system requirements?
A. Data Harmony requires the Java Runtime Environment (JRE) version 1.6 or higher.
Does It Employ Rule Bases?
Q. Do you use training sets?
A. To build a training set requires searching for a set of documents that are about a particular concept. Since this effort must be repeated for each concept, the time required can be significant.
We find that a rule base approach is more efficient, more flexible, easier and less costly to maintain. A rule base can be easily managed by an editor, not requiring the more expensive services of a programmer. There is neither a limit to the number of controlled vocabulary terms nor the size of thesaurus it serves. Modification or addition of rules is easily accomplished. A rule base does not require research to locate and prove a large corpus of documents that exemplify the concept represented by a single term.
Q. Aren’t rule bases difficult to manage?
A. Quite the contrary: Rules governing the use of indexing terms are accessible and transparent, not hidden in a virtual “black box.” The editor can review and fine-tune the requirements for term use at any time to produce more accurate term suggestions. M.A.I. automatically maintains statistics that point out any discrepancies between the editor’s use of terms and the M.A.I.-suggested terms.
Maintaining the rule base to continually improve indexing term suggestions takes approximately one to two hours weekly, depending on the style of content.
How Are Rule Bases Handled In MAIstro?
Q. How long does it take to build a rule base?
A. Data Harmony’s MAIstro automatically creates a rule base when a controlled vocabulary/taxonomy/thesaurus is created.
Q. What does it take to maintain a rule base?
A. After an initial period of rule base preparation, the time required for maintenance drops off steadily. Maintaining the rule base to continually improve indexing term suggestions takes approximately one to two hours weekly, depending on the style of content.
Q. How long will it take to index my legacy collection?
A. This depends on how you decide to index your legacy collection. Legacy collections can be indexed in either automatic or assisted mode. We have found that automatic indexing takes less than a half-minute to process a legacy collection record, in which each record consists of a title and an abstract. An editor using the Data Harmony software in assisted mode can index between 6 and 10 records per hour.
Q. Who will build my rule base?
A. The basic rule base of simple and synonym rules is constructed programmatically by M.A.I. For fine-tuning of the rules, we at Access Innovations, Inc. can train your editors to construct your own rule base. Alternatively, we can develop your rule base for you, using your documents, then pass it over to you for continued use and maintenance. Once trained, most of our customers prefer to build and maintain their rule base independently.
Q. Who will maintain my rule base and taxonomy (thesaurus)?
A. We can easily train the editors who manage your document collection and indexing to maintain your rule base and taxonomy. Alternatively, you can outsource the task to our experienced editorial staff.
Q. Do I have to come to you to add or change a term?
A. No, you don’t. For each term that you add, M.A.I. creates a simple rule for its use for indexing documents. You can modify the rule at any time to improve its function. If you change a term, existing rules are automatically changed to suggest the revised term. These processes are quick and simple, and performed by your staff.
What About Subject Indexing?
Q. What is indexing, and why bother?
A. Indexing is a way to increase retrieval precision and accuracy, done by consistent application of subject terms in their preferred forms. Good indexing leads to the optimal combination of precise retrieval and comprehensive recall. The result is the prompt retrieval of documents appropriate for the search, blocking documents that may share a common word but are conceptually irrelevant.
Q. Why is an indexed search better than a full-text search?
A. The results of a search will always depend on the search query you enter. Computer systems cannot infer your intent.
In a full-text search, if your query words do not match words in the document text, you will not get back any documents. If they do match words in the document, they may not be what the document is primarily about, increasing the chances of your search retrieving irrelevant documents. You will also not be able to retrieve documents of your search query in cases where the topic is expressed in different words.
Q. What is automatic indexing?
A. Automatic indexing is the application of index terms from a thesaurus or controlled vocabulary, using established algorithmic programs.
Q. What is machine-aided indexing?
A. Machine-aided indexing, also known as assisted indexing, allows a person to select from a list of the automatically suggested index terms drawn from a thesaurus or controlled vocabulary.
Q. What is the difference between automatic and assisted indexing?
A. Automatic indexing works without human review or intervention. It’s great for filtering large amounts of data, such as large legacy files. In assisted or machine-aided indexing, a human reviews the suggested terms, selecting those most appropriate. Data Harmony supports both modes of operation.
Q. What is relevance?
A. Relevance is how closely the results from a search match the searcher’s needs. Relevance is subjective, often undermined by using a search word that can be interpreted differently from what the searcher had in mind. Using a thesaurus/taxonomy/controlled vocabulary to index content increases relevance in retrieval, ensuring that the searcher will find what he or she is looking for.
Q. What is the difference between a controlled vocabulary, a taxonomy, and a thesaurus?
A. A controlled vocabulary is a set of terms that are valid for indexing (categorizing, generally by topic or subject) a set of documents. The list is generally alphabetized, but no further internal organization of terms is implied.
A taxonomy is a controlled vocabulary presented in an outline view, also called a hierarchy. Terms are organized in categories or branches, reflecting general concepts (Top Terms), major groups (Broader Terms), and more specific concepts (Narrower Terms).
A thesaurus is a controlled vocabulary, often displayed as a taxonomy, with additional information about the terms. Additional information can be synonyms (non-preferred terms, or Use/Used for indicators), scope notes (explanations of the terms within the scope of the vocabulary), editorial notes (clarifications for editorial uses of the term), and term histories. More advanced thesauri may involve additional information, such as alternate languages, numerical codes, and status designations for terms.
Q. Are there guidelines for taxonomies or thesauri?
A. There are several sets of guidelines. NISO (the National Information Standards Organization), for example, has produced the Z39.19 Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies.
Q. What are the advantages of a taxonomy?
A. A taxonomy’s hierarchical organization makes it easy to locate the most accurate subject indexing term. The hierarchical view makes it easy to “drill down” through categories to find terms. You don’t need to know the first letter of the word or phrase to find the concept, as you do with an alphabetical list.
Q. Are there extra advantages to a thesaurus?
A. In addition to the hierarchical display of the taxonomy view, a thesaurus may be viewed as individual term records with the added value of conceptual associations, notes on use, translations to other languages, and other supplemental information. This information provides the user with ideas for alternative terms that may be appropriate for indexing, or finding documents on conceptually related topics.
Q. What can I do with a taxonomy?
A. You can use a taxonomy in many ways:
- Categorize content (documents) using taxonomy terms
- Provide a common language basis for all activities within an organization
- Coordinate documents, people, and activities by using the same descriptive terminology throughout
- Use taxonomy terms as metadata for HTML records
- Use the taxonomy as a browsable topic list in a portal or web site
- Search the data in your database by taxonomy term
For more detail on the business utility of a taxonomy, see this Taxodiary article.
Q. Do you provide training?
A. Access Innovations, Inc. can provide training either at your facilities or our office. The training takes two days for M.A.I. and one day for Thesaurus Master, covering all aspects of using the software.
Q. Who are your trainers?
A. Our trainers are experienced taxonomists familiar with the Data Harmony software as well as with techniques and best practices for taxonomy creation and development, indexing rule set development, and abstracting and indexing. In addition, our trainers have worked with many information specialists on thesaurus construction and indexing practices, through conference workshops, seminars, webinars, and on-site at our customers’ facilities.
What Are the System Integration Options?
Q. How does Data Harmony software connect to other systems?
A. We connect to other systems through an application programming interface (API). This allows different systems to communicate seamlessly.
Q. What is an API?
A. An application programming interface is a set of methods that enable one program to communicate with another program. Data Harmony has published APIs that allow other software to hook to MAIstro, Thesaurus Master, and M.A.I.
Q. How do I get my data to M.A.I.?
A. You can get your data into M.A.I. in either of the following ways:
- Your data input system can be modified to use the Data Harmony API for sending data interactively through the M.A.I. rule base. This produces indexing term suggestions interactively, on a document-by-document basis
- You can use the Data Harmony program to run groups or collections of documents
Q. How does Data Harmony software relate to a portal?
A. The Data Harmony software enables you to build and maintain a taxonomy of indexing terms that describe documents.
A portal uses the taxonomy in two ways:
- Adds the taxonomy hierarchy as a browsable topic list to a portal or web site. HTML links from the topic terms provide access to the documents
- Adds the taxonomy terms as metadata to HTML records, filling in name and/or keyword fields for access by search systems and spiders
Q. Can I use Data Harmony remotely?
A. Because Data Harmony uses accepted Internet protocols, the data can be accessed over the Internet.
Q. What are Internet protocols and why are they important?
A. We believe in using the Internet for ease of data movement and communication. If you are sending data over the Internet, you have to comply with Internet protocols.
The Internet protocols used in Data Harmony are the TCP/IP. This stands for Transmission Control Protocol and Internet Protocol. The first ensures that the data arrives at its destination in the correct order. The second takes your data and bundles it into discrete packages, for transmission of 1,500 bytes each.
We use TCP/IP because it can run on any network: an organization’s internal LAN, a WAN, or a worldwide network. This network flexibility means that Data Harmony products can be used without being impeded by any network constraints. Data Harmony can be accessed by users with password access from anywhere, and is as scalable as the Internet in the size of community it can serve.
Working With a Content Management System (CMS)
Q. Can Data Harmony work with a CMS?
A. Yes, Data Harmony’s M.A.I. can work with any content management systems (CMS) via APIs. Some CMS examples are Microsoft SharePoint 2010 and 2007, MarkLogic, OpenText, MuseGlobal, Oracle, and SAP.
Data Harmony integrates with a CMS through documented application programming interfaces (APIs). Depending on the client’s needs, indexing terms may be presented interactively in real time (document-by-document) or by a batch approach, filtering and suggesting terms for a number of documents at a time.
We work with other software vendors, as well as systems from MarkLogic and OpenText, to create custom integrations.
Q. What makes a good CMS?
A. A good content management system includes the following parts:
- Input system, for document creation and editing
- Search system, to find the documents in the system
- Display of the documents to the user, either by website (portal), print, or a customized user interface
- Administration modules, to create custom reports and document sets
Among the particularly useful and convenient features of the Data Harmony software is the ability to hook into a publishing system or portal interface.
Working With Search Systems
Q. How does a search system work?
A. Although the front ends and options vary at the end of the process, all search systems build an inverted index (a mapping of the content: namely, words and numbers), then run the queries against the inverted index. This is true for all search systems, including those of MarkLogic, OpenText, MuseGlobal, PostgreSQL, Microsoft FAST Search Server, Oracle, MySQL, Sequel, and SAP.
M.A.I. can work with a search system in two ways:
- M.A.I. enables construction of an inverted index of thesaurus terms that are associated with content. These indexing terms, arranged in the inverted index, provide consistent and accurate subject access to the data. This is the preferred method in well-formed databases with field formatting or metadata access
- M.A.I. can transition from a searcher’s query word to the valid indexing term, then access documents indexed with that term. This works well with natural language query systems
- M.A.I. enables precision indexing for highly accurate retrieval of documents or information objects. A search system provides the supplemental ability to spot important words not yet incorporated in the indexing thesaurus