by Jay Ven Eman, Ph.D., CEO of Access Innovations, Inc.
Access Innovations, Inc.What do you make of “198“? You could assume a number. Computer applications make no assumptions since it could be an integer, octal, decimal, etc. Neither you nor the computer could do anything useful with it. What if, we added a period, so “198” becomes “1.98”? Maybe it represents the value of something such as its price. If we found it embedded with additional information, we would know more. “It cost 1.98.” The reader now knows that it is a price, but software applications still are unable to figure it out. There is much the reader still doesn’t know. “It cost 1.98¥.” “It cost 1.98£.” “It cost $1.98.” There is even more information you would want. Wholesale? Retail? Discounted? Sale price? Basic interpretation is something humans do very well, but software applications do not. Now imagine a software application trying to find the nearest gasoline station to your present location that has gas for $1.98 or less. Per gallon? Per liter? Diesel or regular? Using your location from your car’s GPS and a wireless Internet connection such a request is theoretically possible, but beyond the most sophisticated software applications using Web resources. They cannot do the reasoning based upon the current state of information on the Web.
Trying to search the Web based upon conceptual search statements adds more complications. Looking for information about “lead” using just that term returns a mountain of unwanted information about leadership, your water, and conditions at the Arctic Ocean. Refining the query to indicate you are interest in “lead based soldering compounds” helps. Software applications still cannot reason or draw inferences from keywords found in context. At present, only humans are adept at interpreting context.
The “Semantic Web” is a series of initiatives to help make more of the vast resources accessible via the Web, available to software applications and agents, so that these programs can perform at least rudimentary analysis and processing to help you find that cheaper gasoline. The Web Ontology Language, OWL, is one such initiative and will be described here in relation to thesauri and taxonomies.
At the heart of the Semantic Web are words and phrases that represent concepts that can be used for describing Web resources. Basic organizing principles for concepts exist in the present thesaurus standards (ANSI/NISO Z39.19 found at http://www.niso.org/ and ISO 2788 and ISO 5964 found at http://www.iso.org/) and are being expanded in drafts of revisions to the standards.
The reader is directed to the standards’ Web sites referenced above and to http://www.accessinn.com/ and http://www.dataharmony.com/ for basic information on thesaurus and taxonomy concepts. It is assumed here that the reader will have a basic understanding of what a thesaurus is, what a taxonomy is, and related concepts. Also, a basic understanding of the Web Ontology Language, OWL, is required. OWL is a W3C recommendation and is maintained at the W3C Web site. For an initial investigation of OWL, the best place to start is www.w3.org/.
From the OWL Guide, “OWL is intended to provide a language that can be used to describe the classes and relations between them that are inherent in Web documents and applications.” OWL formalizes a domain by defining classes and properties of those classes; defining individuals and asserting properties about them; and reasoning about these classes and individuals. Ontology is borrowed from philosophy. Ontology is the science of describing the kinds of entities in the world and how they relate.
An OWL ontology may include classes, properties, and instances. Unlike ontology from philosophy, an OWL ontology includes instances, or members, of classes. Classes and members, or instances, can have properties and those properties have values. A class can also be a member of another class. OWL ontologies are meant to be distributed across the Web and to be related as needed. The normative OWL exchange syntax is RDF/XML.
A thesaurus is not an ontology. It does not describe kinds of entities and how they are related. One would learn very little about the domain of medicine by studying a medical thesaurus. You would discover important terms in the field, how terms are related, what terms have broader concepts and what terms encompass narrower concepts. An inference or reasoning engine would be unable to draw any inferences from a basic “broader term/narrower term” pairing like “nervous system/central nervous system.” Is it a whole/part, instance, parent/child, or other kind of relationship?
Using OWL, more information about the classes represented by thesauri terms, the relationship between classes, subclasses, and members can be described. In the Data Harmony Thesaurus Master™ software, the terms “nervous system” and “central nervous system” would have the labels BT and NT, respectfully. A software agent would not be able to make use of these labels and the relationship they describe unless the agent is custom coded. The purpose of OWL is to provide descriptive information using RDF/XML syntax that would allow OWL parsers and inference engines, particularly those not within the control of the owners of the target thesaurus, to use the incredible intellectual value contained in a well developed thesaurus.
OWL then is used to describe labels such as BT, NT, NPT, and RT1, etc., and to describe additional properties about classes and members such as the type of BT/NT relationship between two terms. Additional power can be derived when two or more OWL ontologies are mapped. This would allow Web software agents to determine the meaning of subject terms (keywords) found in the metadata element of Web pages, to determine if other Web pages containing the same terms have the same meaning, and to make additional inferences about those Web resources.
The Data Harmony programmers have developed a ‘first order’ OWL output from Thesaurus Master™. The Thesaurus Master™ OWL output provides semantic meaning to the basic classes and properties of a thesaurus. Such an output becomes a true Web resource and can be used more effectively by automated processes. Another layer of OWL wrapped around subject terms from an OWL level thesaurus and the resources (such as Web pages) these subject terms are describing would be an order of magnitude more powerful, but also more complicated and difficult to implement. (We’re working on it.)
Let us now look at a sample Thesaurus Master OWL output.
2. <!DOCTYPE rdf:RDF [
3. <!ENTITY rdf “http://www.w3.org/1999/02/22-rdf-syntax-ns#” >
4. <!ENTITY owl “http://www.w3.org/2002/07/owl#” >
5. <!ENTITY xsd “http://www.w3.org/2001/XMLSchema#” > ]>
7. xmlns =”http://localhost/owlfiles/DHProject#”
8. xmlns:DHProject =”http://localhost/owlfiles/DHProject#”
9. xmlns:base =”http://localhost/owlfiles/DHProject#”
10. xmlns:owl =”http://www.w3.org/2002/07/owl#”
11. xmlns:rdf =”http://www.w3.org/1999/02/22-rdf-syntax-ns#”
13. xmlns:xsd =”http://www.w3.org/2001/XMLSchema#”>
This block represents standard OWL declarations. It is an RDF/XML document instance and uses XML namespaces. Line 7 indicates that the default namespace for an element is DHProject. Line 8 identifies DHProject as the ontology and line 9 specifies the base for the DHProject ontology. Other resources used in this RDF/XML document instance are identified in the remaining lines.
Line 6 is the XML anchor or root tag. All well-formed XML document instances must have an opening root tag and must close with this same root tag.
14. <owl:Ontology rdf:about=””>
15. <rdfs:comment>OWL export from MAIstro</rdfs:comment>
16. <rdfs:label>DHProject Ontology</rdfs:label>
Lines 14 to 17 follow the OWL convention of declaring this an OWL document. The rdf:about attribute is normally blank, “”, and as such refers to the base document, DHProject (Line 9). Line 16 provides a local name for this document.
18. <owl:Class rdf:ID=”Term”/>
19. <owl:Class rdf:ID=”PreferredTerm”>
20. <rdfs:subClassOf rdf:resource=”#Term”/>
22. <owl:Class rdf:ID=”NonPreferredTerm”>
23. <rdfs:subClassOf rdf:resource=”#Term”/>
Lines 18 and 19 specify this OWL ontology’s first two classes, “Term” and “PreferredTerm”. Line 20 states that the class “Term” has a subclasss whose value is “PreferredTerm”. Thus, “PreferredTerm” is a class and that it is a subclass of “Term”. Lines 22 and 23 specify another class, “NonpreferredTerm”, and that it is a subclass of “Term”. Thus, “Term” has two subclasses. All instances of “PerferredTerm” and “NonPreferredTerm” are instances of “Term”.
25. <owl:Class rdf:ID=”StatusValue”/>
“StatusValue” is another class.
26. <owl:ObjectProperty rdf:ID=”BroaderTerm”>
27. <rdfs:domain rdf:resource=”#PreferredTerm”/>
28. <rdfs:range rdf:resource=”#PreferredTerm”/>
30. <owl:ObjectProperty rdf:ID=”NarrowerTerm”>
31. <owl:inverseOf rdf:resource=”#BroaderTerm”/>
Lines 26 to 29 define relations between instances of two classes. “BroaderTerm” is an object property that provides information about members of the class, “PreferredTerm”, and its value must come from the class, “PreferredTerms.” Lines 30 to 32 define another object property, “NarrowerTerm” and indicate it is an inverse property of “BroaderTerm”. This indicates that all “NarrowerTerm” instances have a “BroaderTerm” property value. It describes members of the class, “PerferredTerm”, and its value must come from the class, “PreferredTerm”. However, “BroaderTerm” does not have to have a property value from “BroaderTerm”.
33. <owl:ObjectProperty rdf:ID=”Status”>
34. <rdfs:domain rdf:resource=”#PreferredTerm”/>
35. <rdfs:range rdf:resource=”#StatusValue”/>
37. <StatusValue rdf:ID=”Candidate”/>
38. <StatusValue rdf:ID=”Accepted”/>
39. <StatusValue rdf:ID=”Deleted”/>
“Status” is an object property describing “PerferredTerms” and it derives its value from instances of the class, “StatusValue”. Lines 37 to 39 identify three members of the class, “StatusValue”. Thus, “Status” can have one of these three as a value. If it had “Accepted” as a value, then it would be describing a member of “PerferredTerm” as having the “Status” of “Accepted” when assigned to that “PreferredTerm”.
40. <owl:ObjectProperty rdf:ID=”Related_Term”>
41. <rdf:type rdf:resource=”&owl;SymmetricProperty”/>
42. <rdfs:domain rdf:resource=”#PreferredTerm”/>
43. <rdfs:range rdf:resource=”#PreferredTerm”/>
From lines 40 to 44, “Related_Term” further describes members of class, “PreferredTerm”. It derives its value from the class, “PreferredTerm”. “Related_Term” has an OWL property type of “SymmetricProperty”. Making “Related_Term” OWL type “SymmetricProperty” specifies that any member from class, “PreferredTerm”, that has property “Related_Term”, then the reciprocal is true. For example, if “PreferredTerm” American Music has “Related_Term” property having value Jazz Music, then Jazz Music is also a member of class “PreferredTerm” and has a “Related_Term” property with value American Music.
45. <owl:ObjectProperty rdf:ID=”USE”>
46. <rdf:type rdf:resource=”&owl;FunctionalProperty”/>
47. <rdfs:domain rdf:resource=”#NonPreferredTerm”/>
48. <rdfs:range rdf:resource=”#PreferredTerm”/>
50. <owl:ObjectProperty rdf:ID=”Non-Preferred_Term”>
51. <owl:inverseOf rdf:resource=”#USE”/>
The next OWL object property is “USE” that describes only members of the class, “NonPreferredTerm” and derives its value from class, “PreferredTerm”. “USE” has restricting OWL property type, “FunctionalProperty”. Any member of class, “NonPreferredTerm”, always has just one “USE” property value.
53. <owl:DatatypeProperty rdf:ID=”Scope_Note”>
54. <rdfs:domain rdf:resource=”#PreferredTerm”/>
55. <rdfs:range rdf:resource=”&xsd;string”/>
57. <owl:DatatypeProperty rdf:ID=”Editorial_Note”>
58. <rdfs:domain rdf:resource=”#Term”/>
59. <rdfs:range rdf:resource=”&xsd;string”/>
61. <owl:DatatypeProperty rdf:ID=”Facet”>
62. <rdfs:domain rdf:resource=”#Term”/>
63. <rdfs:range rdf:resource=”&xsd;string”/>
Lines 53 to 64 define properties that can contain variable length strings. Any member of class, “Term”, can have notes entered by an editor in “Editorial_Note”. Members of the class, “PreferredTerm”, can have a “Scope_Note” property value. These values can be variable length text strings such as sentences.
Line 64 ends the OWL descriptions of the parts that make up a Thesaurus Master™ application. A thesaurus is made up of terms, relationships between terms, and properties about terms. We have defined classes of terms in classes, “Term”, “PreferredTerm”, “NonPreferredTerm”, and “StatusValue”. Each class has properties and members. The members have properties. A thesaurus has preferred terms and non-preferred terms. Preferred terms can be, or have, broader terms, narrower terms, and related terms. Non-preferred terms always have just one USE term which must be a preferred term. Preferred terms can have scope notes, editorial notes, and facets. Non-preferred terms can have editorial notes and facets.
65. <NonPreferredTerm rdf:ID=”T1″>
66. <rdfs:label xml:lang=”en”>Agribusiness</rdfs:label>
67. <USE rdf:resource=”T2″ DHProject:alpha=”Agriculture”/>
69. <PreferredTerm rdf:ID=”T2″>
70. <rdfs:label xml:lang=”en”>Agriculture</rdfs:label>
71. <Non-Preferred_Term rdf:resource=”#T1″ DHProject:alpha=”Agribusiness”/>
72. <Non-Preferred_Term rdf:resource=”#T3″ DHProject:alpha=”Agronomy”/>
73. <Non-Preferred_Term rdf:resource=”#T38″ DHProject:alpha=”Farming”/>
75. <NonPreferredTerm rdf:ID=”T3″>
76. <rdfs:label xml:lang=”en”>Agronomy</rdfs:label>
77. <USE rdf:resource=”T2″ DHProject:alpha=”Agriculture”/>
Line 65 begins the Thesaurus Master OWL output of terms from a thesaurus. The member of a class is associated with that class within the OWL recommendation by stating it as an attribute of an element, the element being the class. Under the OWL recommendation, attributes have restricted formatting. Therefore, we have chosen to use a placeholder for the term. So, line 65 has the element NonPerferredTerm which has been identified as a class. It has a member, “T1”. “T1” is a placeholder for the term, “Agribusiness”, which is identified by using the rdfs:label element.
The Thesaurus Master OWL output processes all of the preferred and non-preferred terms in alphabetical order, corresponding to “T1” as the first term in an alphabetical sort and goes through “Tn”, where “Tn” is the last term in an alphabetical sort.
Since “T1”, Agribusiness, is a non-preferred term, it must have a “USE” term which is identified on line 67 as “T2”, Agriculture. The “USE” term must be from the class, “PreferredTerm”.
“T2” is Agriculture and it has three non-preferred terms “T1”, “T3”, and “T38”.
Lines 75 to 78 introduces term “T3″, Agronomy, also a non-preferred term.
The rest of the Thesaurus Master OWL output would proceed in a similar fashion until all of the terms have been identified, again, from T1 to Tn. Some additional output is described below.
79. <PreferredTerm rdf:ID=”T6″>
80. <rdfs:label xml:lang=”en”>Band music</rdfs:label>
81. <BroaderTerm rdf:resource=”#T55″ DHProject:alpha=”Instrumental music”/>
82. <Related_Term rdf:resource=”#T7″ DHProject:alpha=”Bands (Music)”/>
84. <PreferredTerm rdf:ID=”T7″>
85. <rdfs:label xml:lang=”en”>Bands (Music)</rdfs:label>
86. <BroaderTerm rdf:resource=”#T81″ DHProject:alpha=”Musical groups”/>
87. <Related_Term rdf:resource=”#T111″ DHProject:alpha=”Rock groups”/>
88. <Related_Term rdf:resource=”#T6″ DHProject:alpha=”Band music”/>
This set introduces broader terms and related terms. Term “T6”, Band music, is a member of Instrumental music, its broader term. Band music is also related to Bands (Music). Term “T7″, Bands (Music), has a different broader term, Musical groups. Line 88 states the reciprocal relationship required of related terms, stating that Bands (Music) is related to Band music. Bands (Music) is also related to term T111, Rock groups.
90. <PreferredTerm rdf:ID=”T111″>
91. <rdfs:label xml:lang=”en”>Rock groups</rdfs:label>
92. <BroaderTerm rdf:resource=”#T81″ DHProject:alpha=”Musical groups”/>
93. <Related_Term rdf:resource=”#T7″ DHProject:alpha=”Bands (Music)”/>
94. <Related_Term rdf:resource=”#T112″ DHProject:alpha=”Rock music”/>
Jumping down the OWL output to where term T111 is defined, the reciprocal is also shown in line 93.
96. <PreferredTerm rdf:ID=”T72″>
97. <rdfs:label xml:lang=”en”>Music performances</rdfs:label>
98. <BroaderTerm rdf:resource=”#T70″ DHProject:alpha=”Music”/>
99. <NarrowerTerm rdf:resource=”#T28″ DHProject:alpha=”Concerts”/>
100. <NarrowerTerm rdf:resource=”#T77″ DHProject:alpha=”Musical conducting”/>
101. <NarrowerTerm rdf:resource=”#T118″ DHProject:alpha=”Solo music performances”/>
Lines 96 to 102 illustrate a “term – broader term – narrower term” instance. Term “T72”, Music performances, is a member of class Music, and is itself a class with three members: Concerts, Musical conducting, and Solo music.
The Data Harmony Thesaurus Master OWL output continues until all members of class “PreferredTerm” and class “NonPreferred” are elucidated. Your Thesaurus Master™ thesaurus is now a Web resource that can be used by software agents. Because OWL is designed to be distributed and referenced, a given base thesaurus ontology can grow as other thesaurus ontologies reference it.
Even a thesaurus wrapped in OWL falls short of the full potential of the Semantic Web. This ‘first order’ output allows other thesaurus applications to make inferences about members and classes of a thesaurus. By “reading” the OWL wrappings, any thesaurus OWL software agent can determine, for example, that if term “Agronomy” is a member of class “NonPreferredTerm”, then it will have a “USE” property. By using classes, subclasses and members and their properties, Web software agents would be able to reproduce the hierarchical structure of a thesaurus outside of the application used to construct it.
However, a lot is still missing. For example, knowing a term’s parent, children, other terms is it related to, and terms it is used for does not tell you what the term means and what it might be trying to describe. Additional layers would be needed to provide a more precise definition of a term in the form of additional properties. This is a layer of semantic meaning that approaches dictionary definitions. How a term is supposed to be used and why this ‘term’ is preferred over that ‘term’ are still more properties that are needed.
The most difficult layer of semantic meaning is the relationship between a thesaurus term and the entity, or object, it describes. An assignable thesaurus term is a member of class “PreferredTerm”. When it is assigned to an object, for example a research report, that term becomes a property of that specific research report. The intelligence required to make that determination, “term x describes document y”, is currently beyond the current OWL output. It does reside in Data Harmony’s M.A.I.™ rule base. An OWL output of the rule base is beyond the scope of the current OWL implementation.
While an OWL output of the rule base is possible, it would be essential to relate it to its associated thesaurus OWL output. At this level, it is possible, but as the Semantic Web grows, I question if the Semantic Web will scale. In the ‘second order’ OWL output that would combine a Data Harmony Thesaurus Master OWL output and a Data Harmony M.A.I. OWL output, the level of complexity becomes almost unmanageable. Still, the ‘first order’ output described in detail here is an enormous and useful first step.
1 BT – broad term, NT – narrow term, NPT – non-preferred term, RT – related term