banner-vmf
 

Archive

 

More about the original project

Further information about the original VMF project can be found in these documents.

Use cases

These use cases from the project proposal show the value of the extended Framework to the JISC community.

  1. Metadata crosswalk
  2. Mapping of local, bespoke metadata schemes
  3. Complex term mapping between domains
  4. Cross-searching metadata from institutional repositories
  5. Preservation metadata

The need for a standard framework of relators to describe relationships between resources identified in different metadata and identification schemes was demonstrated in Use cases for interoperability of ISO TC46/SC9 identifiers, produced in 2005 by Mark Bide for an ISO TC46/SC9 working group. It was this work that provided the initial impetus for an earlier proposal to JISC which developed into the Vocabulary Mapping Framework.

Source standards

VMF included a number of source standards in the development of the framework which are detailed here.

 

Background

Context

Trends in vocabularies

Metadata vocabularies form a major part of metadata schemas, and carry a substantial part of the meaning of metadata. Vocabularies are growing rapidly, both in number and in size. Vocabularies are becoming increasingly complex and granular. Vocabularies of relators (terms that describe relationships between entities) are growing in importance.

Trends in JISC community metadata

VMF made these three assumptions:

  1. Bibliographic and heritage metadata is becoming increasingly diverse and complex and will require increasing interoperability for re-use and discovery..

Current extensive vocabulary developments within RDA, MARC, FRBR/FRBRoo, CIDOC CRM and Dublin Core, and the interactions between them, are the primary evidence of this. Four of these five initiatives have come into being in the last decade (the exception being MARC, where there has also been a significant growth of variations) and driven mainly by the impact of RDA and FRBR there are now joint working and liaison groups between various combinations of them. The slide below, from a recent RDA presentation (August 2008) by Gordon Dunsire at an IFLA conference, illustrates the point graphically, where each pair represents significant integration activity:

Diagram of linked communities

Each of these relationships (except FRBR-to-FRBRoo) requires vocabulary mapping. These developments greatly increase the amount of schema-to-schema mappings that are required. This slide contains only one publisher/producer standard (ONIX) and none from education or other sectors, so the impact of adding others in terms of a potential "combinatory explosion" of mappings is evident.

  1. Members of the JISC community have increasingly diverse, complex and unpredictable metadata requirements.

At a basic level this trend is summarized by one of the reviewers of an earlier proposal to JISC, when discussing the take-up of the proposed RDA "MyRDA" tool:

"Different libraries have different requirements. A public library might only need the general rules. A music library would want to include all rules relating to music scores, music recordings, as well as books and serials. An academic library is more likely to need the rules for e-journals. It is difficult to predict how quickly this functionality will be offered by vendors and exactly which aspects will be taken up."

The rapid growth of the development of complex multimedia resources for education at all levels means that more automated acquisition and integration of metadata from more diverse sources is a growing requirement.

  1. Metadata from producers/providers/publishers will become increasingly important as a substantial component of metadata in the JISC community.

It is now quite common that metadata originated in ONIX and IEEE LOM and in some proprietary formats like CrossRef to populate library and VLE systems (transformations of metadata like this are routinely provided by agencies such as OCLC, Nielsen and the Library of Congress). This will extend across all media types in time, and any authoritative domain standards such as DDEX (music industry) or PRISM (magazines) can be expected in due course to provide metadata to the JISC community.

Again the rationale for this is put well by one of the reviewers of the earlier proposal, seen in this case from the perspective of RDA:

"Using complementary standards means less maintenance for RDA. RDA is designed to be applied by a range of communities, and recognises the usefulness of using existing term sets and vocabularies, rather than creating an RDA set; for example, role terms from MARC 21."

ONIX has led the way, but such standards are being introduced and beginning to gain use in many domains and so the availability of good quality producer/provider/ publisher metadata in other standards is both required and expected.

 

The RDA/ONIX Framework

The RDA/ONIX framework for resource categorization was developed in 2006 by a joint working group of experts from the library and publishing industries. Its goal was defined as follows:

"The objective is … a framework for categorizing resources in all media that will support the needs of both libraries and the publishing industry and will facilitate the transfer and use of resource description data across the two communities."

The Framework allows for resource categories to be created out of a matrix of pre-defined attributes (a "pizza menu" approach where different ingredients are selected and the resulting combination given a name as a new category). The Framework therefore enables categories in different schemes to be defined and mapped securely to one another irrespective of naming differences. This method is a common form of ontology known as "Formal Concept Analysis".

The Framework was produced in a relatively short time by a small working group of experts in a number of major metadata standards. It was well received upon publication and has been used as a tool for defining the three proposed resource category lists in RDA (Media Type, Carrier Type and Content Type).

As yet none of the detailed lists from ONIX or other commercial standards have been incorporated, although that was and remains the intent. A proposal for resource categories based on the RDA/ONIX Framework was recently drawn up by Rightscom at the request of the International DOI Foundation.

VMF extends the structure of the RDA/ONIX Framework to include relators, extending the matrix to cover the scope of the selected standards, and then populating it with the selected vocabularies.

 

The extended RDA/ONIX Framework: the Vocabulary Mapping Framework

VMF extends the structure and content of the Framework and expresses it in a Semantic Web format which will make it accessible for computer processing, including inference.

Format change and web declaration

The Framework is currently expressed only in a human readable, tabular form. In VMF it is expressed in the Semantic Web description language RDF including the ontology language OWL. It is declared as a SKOS vocabulary, with a namespace and with URIs issued for each term. This combination enables developers to reference the Framework directly as part of their software, and additions to Framework contents can be incorporated dynamically in other systems.

Extensions to structure

The principle extensions to the existing Framework for each term are:

Table: Types of Terms

Term type Description
schema A metadata schema in which a vocabulary is used (for example, MARC21, ONIX for Books). A schema is a specification of a set of data elements and their relationships. A schema may be represented as an XMS or database schema, or as a set of abstract terms and relations set out in a Word document. A metadata schema is a specific representation of metadata, and so one standard may include multiple different metadata schemas (for example, the DDEX standard includes several distinct XML schemas representing different messages). A set of schemas may also be described as a schema (for example, ONIX is set of schemas including ONIX for Books, ONIX for Serials etc which may use the same vocabularies).
element A data element in a schema which may have different values in specific documents or messages which conform to the schema (for example, EditionTypeCode in ONIX for Books).
vocabulary A set of defined terms (for example, CodeList21 in ONIX for Books) each of which may be used as values of an element (for example, CodeList21 is the vocabulary used for the element EditionTypeCode).
vocabulary term A data element in a schema which may have different values in specific documents or messages which conform to the schema (for example, EditionTypeCode in ONIX for Books).
attribute A vocabulary term representing a concept which may be an attribute of an entity.
category A vocabulary term representing a type of entity according to the combination of its attributes.
relator A vocabulary term representing a type of relationship between two entities.
verb A vocabulary term representing an action or state.

Each vocabulary term is defined within a hierarchical nesting of terms of four different types, for example:

Extensions to content

The existing content of the Framework is currently a set of Carrier and Content Vocabularies containing attributes and some exemplary categories. A number of categories have been defined in RDA vocabularies using the Framework but these have not been formally added to it as there is no mechanism to do so.

The extensions to the content will be as follows:

Criteria for selecting vocabularies to be added

Vocabularies added to the Framework during the original project were taken from the source standards, and were limited to those which describe or are required to support:

The criteria cover relationships which are either permanent (as with a translation of an original work) and dynamic (as with the relationship of a web content to a web site).

The aim is to include all vocabularies from the source standards which meet these criteria, but priority will be given to those which are most immediately valuable to the JISC community (for example, RDA vocabularies will take precedence over DDEX).

The Framework will also not attempt to be exhaustive in relation to the source standards: highly specialized vocabularies which are referenced only in one standard will not necessarily be mapped, although they will be referenced in the Framework. For example, ONIX contains a number of Code Lists for a large number of detailed categorizations of Bibles which do not, it appears, map to any other source standard at this point.

Future extensibility

In principle there are no limitations on the addition of other vocabularies to the Framework in the future, from the source standards, other standards or proprietary schemas; nor is there a limitation on adding other types of attribute, category or relationships (for example, the relationships of resources with events, states, places and times).

 

Further information

This section contains more detailed background and rationale for aspects of the VMF.

Metadata vocabularies

Description of a vocabulary

A vocabulary is a set of defined terms. It may be known as a controlled vocabulary, code list, allowed value set, XML enumeration list or by other names. Ideally each term in a vocabulary is clearly defined, although quite often (especially in older standards) they simply rely on a user's understanding of a single word or phrase.

Metadata schemas may be said to have four main components: syntax, structure, element types and value types. The last three all contain some of the meaning of the metadata. Vocabularies are one of the two main value types (the other being literal strings). Vocabularies therefore provide a critical part of the meaning of metadata, but not all of it.

Some metadata standards (such as ONIX, MARC or RDA) have dozens of vocabularies covering hundreds or thousands of terms. Others (like METS or PREMIS) are concerned with structure and elements and have few explicit vocabularies.

Vocabularies are invaluable for accurate and consistent searching, querying and categorization.

There is a tendency across all metadata developments now to use vocabularies wherever possible rather than relying on uncontrolled literal strings whose meanings are inconsistent and normally require individual human interpretation.

Increasingly, metadata schemas enable users to use vocabularies from different schemas, often using XML namespaces and URIs to identify them. ISO Language, Currency and Territory Codes are used by many different schemas. The Library of Congress and Dewey classification systems are investigating the use of XML and RDF.

Growth of vocabularies

The number and size of standard vocabularies is growing steadily. New vocabularies appear regularly. For example, DDEX, the music industry message standards just entering implementation, contains 63 different vocabulary lists containing approximately 580 distinct terms. The VMF identified over 40 percent of these which are of bibliographic interest. RDA expects to publish an extensive set of new vocabularies in 2011.

Existing vocabularies are also growing steadily. For example, the ONIX vocabularies ("Onix Code Lists") grew 10 percent between 2005 and 2007 and a further 10 percent in the last year (now 2457 terms in 99 vocabulary lists).

Not all metadata standards include many vocabularies. For example, standards such as METS and PREMIS are concerned principally with structure, and although they support vocabularies they not define many but rely on vocabularies from other schemes. Other standards such as ISAD and to some extent MARC and Dublin Core have relied more on literal descriptive values which inhibit automated interoperability. The trend in metadata is towards using vocabularies wherever possible.

There are trends for both convergence and divergence in the development and use of vocabularies. The RDA and DC initiatives, for example, are tending to convergence, with the creation and consolidation of substantial vocabularies for widespread use in particular domains.

On the other hand, the ever-increasing growth and specialization of multimedia content, format and delivery methods means that the creation and growth of new, domain- and function-specific standards and vocabularies is inevitable. The number of standards (ISO and other) incorporating vocabularies has exploded in the last decade and there is no reason to expect anything different in the next. The growth of variety and complexity in metadata naturally mirrors the growth of variety and complexity of domains and technology.

Increasing complexity and granularity of vocabularies

The rapid expansion of digital multimedia has brought a corresponding increase in the complexity and granularity of vocabularies, for example:

and so on.

The growing importance of relator vocabularies

Relators (also known variously as properties, relations or sometimes predicates) describe the relationships between the things identified by referents. For example, the relator IsLimitedEditionOf may describe the relationship of two books identified with ISBNs: ISBN XXXXXXXXXX isLimitedEditionOf ISBN YYYYYYYYYYY.

Metadata standards have traditionally been based on classes or types attached to a single data element (for example LimitedEdition as a category of a book). The trend is now towards using entity relationships defined by relators wherever possible. In particular, Semantic Web metadata is relationship-based through RDF.

VMF understands that RDA and MARC are each producing significant relator vocabularies. CIDOC CRM (and therefore FRBRoo) is predominantly based on relators.

The value of ontology for vocabularies

Ontology techniques and tools, such as CFA used in the RDA/ONIX Framework or the Semantic Web standard OWL language, enable relationships between vocabulary terms to be formally expressed and computed on in ways that are valuable for efficiency and accuracy in metadata use.

For example, when searching for the works of a particular person, a user may wish to include all kinds of creations to which they have contributed, or to limit the search to specific types (say, books of all kinds but not music or audiovisual, or non-fiction but not fiction) or to particular roles played (for example, all writer roles but not producer or director roles). There may be dozens or even hundreds of possible categories or roles at different levels of detail. How does the user know which to include or exclude? If these are organized in a hierarchical matrix such as RDA/ONIX, then a system can enable a user easily to select groups of related hierarchical terms and achieve the most complete but refined searches possible. Without it, results are piecemeal, with omissions and with unwanted inclusions (like an "Amazon.com" search on "John Smith" as Author).

Such hierarchical techniques are common in search and query tools, but their effectiveness is entirely dependent on the availability of the underlying vocabulary structure.

An ontological approach is becoming the norm for robust, complex metadata schemas, for example:

Relators in the Framework

Relator hierarchies

The relator vocabulary will be hierarchical, supporting a relatively small number of high level general relationships (eg isPartOf) and their more specialized children (eg isChapterOf). The hierarchy will allow for multiple parentage to deal with complex relations (eg isCompressedAudioClipOf might be a child of both isAudioClipOf and isCompressionOf).

Relator definition

Relators will be defined in relation to the categories of their domain (subject) and range (object) classes, and by the verbs which define their underlying action or state. For example, a relator isAdaptationOf may defined as the relator between one text and another which it adapts.

All required domain and range classes will therefore be added to the category vocabulary. In addition, ISO standard identifier types (ISBN, ISRC, ISSN etc) will be added where possible as categories so that the Framework can support the definition of relationships between resources identified with these.

Relator names

Relators will be named for both directions (eg isAdaptationOf and hasAdaptation).

Relators between resources

The Framework will cover any kind of relationship between two types of resource referenced in the main source standards. VMF expects that there will be 20-30 high level relators and 100-200 more specialized.

As the definitions of relators are dependent upon the attributes of the things they link (see Relator definition above), additional resource categories will be defined in the Framework to support roles required to support relators. To say, for example, that a relator isRecordingOf links a work to a recorded performance requires that a work and a recorded performance are already defined.

Relators between resources and parties

Parties are defined as individuals, groups of individuals or organizations. As the vocabulary is resource-centric, relators covered in the vocabulary will be those between parties and resources, where a party plays a role as a creator, contributor, publisher, owner, collector or otherwise affects a resource.

As with resource relators, VMF expects there will be 20-30 important general relators (such as hasCreator, hasAuthor, hasContributor, hasTranslator, hasDirector, hasProducer, hasPerformer, hasPublisher, hasSupplier, hasDistributor, hasRightsController) and several hundred more specialized relators lower in the hierarchy for more refined use.

Verbs

The meaning of a resource relator (for example, isAdaptationOf) may be directly linked to the meaning of a relator between a party and a resource (for example, hasAdaptor) and both come (directly or indirectly) from the same underlying verb or verbs (adapt in this example). This principle is explicit in both the <indecs> and CIDOC reference models. The task of defining relators therefore includes the definition of the underlying verbs and their hierarchies. The MPEG21 RDD set of verbs will be taken a start point.

Schema-to-schema mapping with the RDA/ONIX Framework

There are numerous "crosswalks", mappings or transforms available for schema-to-schema mapping. Some of these are usable tools, while others are specifications or guides. The Framework as proposed will be a tool, but not to provide a complete transformation, only for the mapping of vocabularies between schemes. It can be incorporated as a part of a complete crosswalk or transform.

The value of "hub-and-spoke" mapping

Two common, practical problems with schema and vocabulary mapping are:

However, if each vocabulary is mapped to a central schema like the Vocabulary Mapping Framework which is specifically structured for mapping to accommodate more or less any concept, then these problems can be overcome as far as is possible. This is sometimes known as a "hub-and-spoke" mapping approach. The Framework will enable the computation of the "best fit" mapping for any term in a mapped schema with any other mapped schema, using the Framework's ontological relationships. This can solve the "semantic loss" problem as well as possible (no Framework can compensate for meaning which is not there in the original, or is not derivable from it), and will eliminate the issue of combinatory growth, as each schema requires only a single mapping to the Framework to enable mappings to any other mapped schema.

Contextual mapping

It is not sufficient to map the terms of a vocabulary to the Framework without reference to the context in which they are used: specifically the element of the schema to which the vocabulary is being applied. The reason for this is that a single vocabulary may be applied to several elements in a schema, and it may have a different meaning, and therefore require a different mapping, in each case.

For example, in a particular scheme a vocabulary of contributor roles (author, editor etc) may be used in one element to show the role played by a party in relation a particular resource, and in another element to categorise a party according to the roles with which they are commonly associated. For example, Aldous Huxley may be described as the author of the specific work A Brave New World, and he may also be categorised as an author in general, using the same vocabulary. In the first case, the scheme term author is mapped to Framework relator (such as isAuthorOf), and in the second case to an RDA/ONIX category (for example, a creator of words).

For this reason, each term in the Framework is defined by a nesting.

Deriving schema-to-schema mappings automatically

The Framework supports the generation of mappings between terms from different schemes using basic ontological inference on subclass and subrelator hierarchies. Where two terms from different schemas are mapped to the same Framework term, an isSameAs equivalence can be discovered simply. Where two terms are mapped to terms which are hierarchically related, a "best fit" mapping can be discovered (for example, AudioCD in one schema may have a best fit mapping to CD in another). On occasion there may be multiple possible "best fit" mappings.

There are of course occasions where no mapping is possible. Where there is no equivalent or "best fit" map, the Framework relations can be used to determine the closest mappings where there is some but not complete commonality of attributes.

Where the target scheme for a mapping uses literal text rather than a vocabulary for a particular element, the names of the vocabulary terms from the source scheme can be used as literals. The second pair of lists in use case 1 provides an example of this.

Mapping local and proprietary vocabularies

Although the Framework initially includes only standard vocabularies, it is open to anyone to include proprietary or "de facto standard" vocabularies to support transformations. These can be registered publicly in the Framework for use by third parties, or simply created for local use in the users' own copy of the Framework. The structure of the Framework schema and its facility for allowing different authorization types would allow a local user to supplement the existing Framework while keeping the standard and local vocabularies and mappings distinct from one another.

 

News & Announcements

19 Apr 2011 The VMF website was moved from the University of Strathclyde where it resided during the project phase to the IDF website.
23 Dec 2009 Version 1.0 of the alpha release of the VMF matrix was made available. It removes an entry which had not been correctly encoded in UTF-8, as required by the Turtle specification. The entry will be reinstated in due course.
17 Dec 2009 The first alpha release of the VMF matrix, which included the VMF ontology and mappings from third-party vocabularies, was made available.
7 Dec 2009 Presentations and a report from the Vocabulary Mapping Framework Seminar, 9 Nov 2009 are available.
15 Jun 2009 Project announcement: Major content metadata vocabularies to be mapped. The announcement attracted a lot of positive comment in various blogs and newsletters.
 

Dissemination

These external sources refer to VMF:

Selected blog posts: