Taxonomy

Taxonomy

Someone responsible for search implementation has limited control over two of the key ingredients of search – the technology and the content. This is why taxonomy plays a role – it can help describe concepts not in the content or in the metadata about the content. (metadata is particularly useful for digitizing non-digital objects). Taxonomy is not always necessary – If you can write custom content with very precise vocabulary using Search Engine Optimization (SEO) techniques might not need a taxonomy.

If any term is hard to explain with a simple sentence, it probably deserves a taxonomy.

Categorizing Content with Taxonomy

Taxonomy is the practice of classifying content. It allows us to connect, relate and classify the content of the Lawi Project propertites. The Faceted Taxonomy allows filtering a set of content by multiple criteria.

The best practice is to create small taxonomies that are divided into homogenous facets.

Universal Search with Taxonomies

A taxonomy can be implemented independently from the content, which means it can be used across content types- blogs, videos, other resources – creating a common set of concepts from which to generate user-centered search.

Ontologies and Taxonomies

Ontology differ from an taxonomy or thesaurus. “taxonomists build hierarchies and ontologist determine classes or categories.” The key point is that “ontologies are neat and unambiguous and taxonomies are a bit messy.”

Simply put, ontologies allow hierarchical relations just like taxonomies, but there is also some flexibility in defining links or connections between terms. That’s the use of predicates.

Social Taxonomies

Despite their low cognitive cost, their capability of matching users’ real needs and language, and their great value in serendipitous browsing, Social Taxonomies suffer the lack of good tools to enable users to navigate through the mass of social taxonomies. Social Taxonomies are implicitly plagued by polysemy, homonymy, plurals, synonymy and basic level variation – linguistic issues which do not appear easy to solve (Golder, A.S., & Huberman, B.A. (2005). The structure of collaborative tagging system. [E-print]. Available April 21. 2007, from arXiv at http://arxiv.org/pdf/cs.DL/0508082).

Social Tag Clouds

Tag or Taxonomies Clouds are not sufficient to provide a semantic, rich and multidimensional browsing experience over complex taxonomies systems. There are several reasons for this:
•Taxonomies clouds don’t help much to address the language variability issue.
•Choosing Taxonomiestags by frequency of use inevitably causes a high semantic density with very few well-known and stable topics dominating the scene.
•Providing only an alphabetical criterion to sort Taxonomies heavily limits the ability to quickly navigate, scan and extract and hence build a coherent mental model out of Taxonomies.
•A flat Taxonomy cloud cannot visually support semantic relationships(and the user experience) between Taxonomies.

Taxonomy and Searching

Information professionals and librarians rely on classification and controlled vocabularies to aid precision search; abstract and index (A&I) publishers make investments in indexing and thesauri to add value to their resorces.

NATURAL LANGUAGE

Keyword search works well on unstructured content. Analytics can be used to identify nuggets of knowledge buried in unstructured content. In the first step of an analytics process, a document is scanned using natural language processing (NLP) to identify meaningful terms and, if an ontology is being used, related concepts. The lack of suitable lexicons is one of the main factors limiting the expansion of this area.

Conversely, the better defined the vocabulary, the more specific the queries can be. The Lawi Project is using entity extraction’ technology to identify legal terms and then link to sources of legal information stored by other properties of the Lawi Project. Increasingly, this relationship information is captured as a Resource Description Framework (RDF) triple.

Social media is necessary to help generate and clarify relevant, lively, topical content now that can be curated later (as opposed to traditional models of curate then share).

SEMANTIC TECHNOLOGY

Semantic technology goes beyond descriptive tagging and “whatness” to encoding meaning extracted from content to infer “aboutness.” This is done using a variety of tools including entity extraction, classification, and categorization. Each of these elements is enhanced with the availability of appropriate ontologies.

These concepts are captured in the accompanying ontology spectrum graphic. Integral to the concept of the semantic web is an ontology.

To play well with others, taxonomy tools need to support import and export of vocabularies into different standards including SKOS or XML.

Faceted taxonomy

With a faceted taxonomy some challenges are confronted and resolved; specifically:
◾Clarify specific terms by situation or function
◾Ease long-term maintenance issues
◾Facilitate sharing and importing of taxonomies.

Schemas

Some of the schemas to be used in a taxanomy may be:

•RDF-XML – A (relatively) low-level schema used to publish facts (ontologies when used in whole)
•SKOS – A schema in draft as I write this from W3C for sharing knowledge organization systems among. Based on RDF.
•Zthes – An XML schema for publishing thesauri.
•TopicMaps.org – An XML schema for representing topics and relationships among topics
•A good list of references available from the ANSI/NISO Z39.19 – Guidelines for the Construction, Format, and Management of Monolingual Controlled Vocabularies

Lawi Project Taxonomy

The types of parameters supported include:
•Identifying which classification is desired (default is to return all)
•Specifying the statuses of values to include (default will return all)
•Specifying the language to include (default returns English)
•Specifying the level of detail of interest (default returns the briefest format)

With regard to the language – one of the business rules followed in our web sites is that you provide content in the user’s selected language when available and return English when the user’s language is not available (English should always be available). This rule is pushed down into this interface at the level of each value. So a consuming application might request the set of German values for the taxonomy and get all of the classification details in German and, say, 99% of the values in German but if there are values that are not translated, those are returned in English. This approach keeps the taxonomy consistent with our general rules (though if taxonomy values are used directly in a user interface, it does present a possibly confusing same-page mix of non-English and English).

Document structure

The returned XML document looks like the following. I’m not using any formal XML schema syntax – instead showing the elements and how they relate to each other with a brief description of th elements that I don’t think are self-explanatory.
• taxonomy ◦ classification – has an attribute id (the ID of the classification) ◾ name – has an attribute lang (the language code describing the language of the name element)
◾ description – has an attribute lang (the language code describing the language of the description element)
◾ status

In regard to the level of detail parameter mentioned above – the “brief” level includes the names , descriptions and statuses only of the classifications, levels and values. The “detailed” includes all details except the changeHistory elements. The “complete” level includes all of the above. The “complete” format is probably not very useful for consumers as most will not care about the life history of elements (though that is of interest and value within the taxonomy).

Taxonomy – The Structure in Detail

Classification

The primary construct in the taxonomy is called a “Classification”. A better term for this I now know would be “Facet” as that’s what they are. The intent is that a Classification is a specific set of values (perhaps explicitly defined or perhaps defined by a set of guidelines or business rules) with which pieces of content can be associated (they can be tagged with values from the classification).

In our schema, a Classification itself has a number of elements:
•Name – The preferred name for the Classification. Typically used as the label for fields on, for example, data entry forms of various sorts.
•Definition – A concise definition of the Classification. Forcing the explicit definition of this helps reduce fuzzy thinking and gets people to clearly differentiate when a new Classification is needed versus using an existing one. This can be displayed in other systems that allow users to associate classification values with content as a kind of “mini-help”.
•Life History (create date, modification date, audit trail) – We maintain the create date (actually, date added to the taxonomy) and a modification date so we know what happened when to the Classification. More detail is provided below on the audit trail.
•Source System – Each classification might be sourced from another system. An example is a product listing – these are not maintained in the taxonomy but in their own systems and the taxonomy simply uses that list. Another example (where we do not have automation) is language (where we reference ISO standards as the master even though the values are still manually maintained in our taxonomy database).
•Comments – A text field to hold comments for use within the taxonomy. Notes about issues, etc. Not intended for end users as the Definition is.
•Data Type – The type of values expected for this Classification. Most commonly, just Strings, but we do define (for example) Creation Date and Expiration Date as classifications with data type of Date.
•Value Indicators – The taxonomy provides indicators to help other systems know what to do with the Classification – Should assignment be constrained to just the values provided by the taxonomy? Should other systems allow content pieces to be associated with multiple values of a classification?
•Synonyms – We provide for the Classification itself to have synonyms (these are synonyms for the Name of the classification). This can be used when (despite best attempts to the contrary) people want to continue to use different terms for the same classificatoin. An example might be that one system (and its user group) might want to refer to a “Region” whereas another might use the term “Market” or “Area”.
•Status – We provide a status indicator on pretty much everything within the taxonomy (Classifications, individual values, etc). The usage is consistent and breaks down into: ◦“Active” – the value can be assigned to new/modified content; should be displayed in any type of search UI (say as a pick list) if appropriate; and should be displayed if a user views the taxonomic tagging of an item.
◦“Inactive” – the value should not be able to be assigned to new content or be newly assigned to existing content; it should be displayed in search UIs (if appropriate) and should be displayed if a user views the taxonomic tagging of an item. Basically, it was valid at one point and still has value on content already tagged with it but we do not use it any more.
◦“Deleted” – We don’t delete values physically, but mark them “Deleted”. The value can not be assigned when creating or editing content, it should not be displayed in any search UI and it should not be displayed if a user views the taxonomic tagging of a piece of content. Basically, the value is no longer in the taxonomy (though some systems may still have the value associated with content in some ways).
◦“Proposed” – The first status for most items. The value would only be in the Taxonomy system itself and would not propagate to other systems. Indicates that it’s being considered for adding but has not yet been approved.

•A set of Classification Levels – Some classifications have an internal structure, described below in the “Level” section.
•Localizations of Classification – There may be non-English translations of the name and description of a classification in the taxonomy database (see below for more about multiple languages).
•A set of Classification Values – Most classification have a set of explicit values that can be associated with a piece of content. The values might be a flat list or might be hierarchical. The taxonomy database supports both. Currently, we do not support any type of many-to-many relationship or relationships across Classifications – just a simple one-to-many within a Classification which is a value / sub-value relationship (some Classifications provide more explicit constraints on the intended meaning of the relationship). Also, we do not have a construct that allows for an explicit (in the taxonomy database) meaning for any given relationship (specifically, narrower-than, broader-than, etc.) It’s implicit in the structure of the values.

Value

A Value is a single (usually textual, though might be dates or numbers) term which can be associated with a piece of content. Values are grouped into Classifications. A value association to a piece of content is what connects that piece of content to the taxonomy.

Like a Classification, a Value has a structure, which is only used when the Classification provides explicit values:
•ID – the unique identifier within the taxonomy that identifies the value. Most systems using the taxonomy will store this ID as the associate (and not the associated value). This allows for the Value to have its textual representation changed without having to revisit any content (say a product name changes or a country’s name changes)
•Structure details – What classification this value is associated with and which value in this Classification (if any) is the parent of this value. Also, some values have a designated “Level” (see below for more on that).
•Value – the textual representation of this value. The string users will see and interpret as the “value”.
•Definition – the definition of this value. As with the classifications, forcing this to be clearly defined provides a good “buffer” against people requesting values to be added that are duplicative or not generally useful. I’m surprised by how often asking a requestor for a clear definition (and how it’s different from another value that seems similar) stops them in their tracks.
•Life History – same as the Classifications
•Source System ID – For Classifications whose values come from another system, we maintain the source system’s ID so we can associate it back to the source system for updates. This can also be used by systems that pull from the taxonomy and also might happen (for other business reasons) to pull data from the same source systems and allows those systems to cross between the two sets of values.
•Status – Same as for Classifications
•Synonyms – Same as for Classifications but applied to the individual values. Synonyms for values are much more common than synonyms for classifications. Systems using the synonyms can potentially do many different things with synonyms (displaying them while a content manager is associating values with content, supporting search on them, etc.)
•Localization of Value and Definition – Non-English translations of the value and definition. See below for more details.

Level

Within a single Classification, we have adopted a mechanism we refer to as a “Level” in order to have a structure within the Classification when it’s meaningful to have different Values grouped into semantically different sets. I think of this as the means by which we support a structure of Classifications.

A good example is Geography. We have a single classification for Geography which contains all necessary values for tagging content for geographic relevance (or irrelevance in some cases). However, each Value within that Classification might represent a different type of Geography. Some values are regions of the world (”North America” or “EMEA”); some values are Countries (”France” or “Japan”); and some might be areas within a country of use (”Midwest United States”).

A Level is a hierarchy of terms within a Classification and any given Value can be assigned to a Level.

The value of this is that systems using the taxonomy can provide user interfaces that group similar values (a nested, tree-style interface, say) while we do not need to have multiple Classifications with relationships across the Classifications to support this.

Multiple Languages

In order to support multiple languages on our web sites, we have provided a means to localize the entire taxonomy. Because localized content is a critical component of our customer-facing site, we provide a structure so that all text that can be used outside of the taxonomy (primarily things like the names and definitions of Classifications, the name and definition for Values, Level names, and even synonyms of each of these) can be localized.

Systems that pull from the taxonomy can then use the available localized terms in their displays (falling back to English if a particular term is not available in a specific language). This could be used in field labels on forms or navigation labels in a browsing interface, menu items, etc.