Discussions about automatic metadata generation seem to conflate a number of different processes. As I see it there are a number of separate but related areas of interest in the area of ‘automatic’ metadata generation; these are:
- The automatic population of record values based on collection-level profiles
- The automatic population of record values based on user profiles
- The automatic population of record values derived from other created metadata values
- The automatic generation of metadata from existing structured information about the asset
- The automatic generation of metadata from the use of the asset itself
- The automatic generation of metadata from the content of the asset itself
The automatic population of record values based on collection-level profiles (bulk transforms)
By this I mean the addition to records of values which are fixed attributes of the collection they are part of. The collection could be the complete set or records in the repository or any meaningful or useful subset. This automatic population of values could part of a bulk import process or part of a record export process (singly or in bulk). Such profiles are particularly important when considering the movement and interoperability of metadata – some metadata may be unnecessary in a local context but vital in a wider context.
For example, the university’s name could be added to the publisher details when a set of metadata about institutional minutes is exposed for harvest. Another example would be the automatic addition of educational level information to a set of learning objects designed by high school teachers for local use but which need to be tagged at that level as their metadata is available in aggregator services.
The automatic population of record values based on user or object profiles
This process uses a user or object template to provide automatic or default (i.e. editable) values to a metadata record. It is most likely to be part of the record creation process.
Arguably this is very similar to 1. I’ve separated them to try to draw out the distinction between what is a collection-based transformation and what is essentially a stored template.
For example, the system automatically adds the author’s details and institutional affiliation to a record when the user starts to create it. Or for example the system has a departmental profile for adding a thesis to the system which provides standard information about the department and the university.
The automatic population of record values derived from other created metadata values
This involves the direct derivation of metadata values from user created values. This process may involve the use of external services. This is distinct from a profile in that a look up / or web service action is required based on data entry.
For example, creating a metadata entry for (or link to) representation information based on the supplied file type or including the author’s email address based on their name.
The automatic generation of metadata from existing structured information about the asset
This involves utilizing the supplied metadata or other structured information with the asset to attempt to generate further metadata about an asset.
Where I feel moves into generation (3 still population as it’s a direct inference; 4 involves a step beyond this)
For example, mapping a paper to a subject area based on what a system can infer from publicly available information about an author, (e.g. where they work, the courses they are associated with, the research funding they’ve obtained, previous items they’ve submitted to the system). Another case would be the assignation of an audience or educational level to a learning object based on the same available information.
The automatic generation of metadata from the use of the asset itself
This covers a spectrum of techniques based on the use of the asset. Common approaches range from deriving an inference of quality from usage statistics, citations, or links to inferring metadata more directly from the analysis of user reviews and annotations, feedback, or tagging patterns.
For example, Google, Amazon recommender system, citation metrics (not entirely sure I’d put this here),
This is arguably not formal metadata (though the LOM allows for user annotations to be formally included), but it increasingly plays a role in the management and discovery of assets and is likely to continue to do so.
The automatic generation of metadata from the content of the asset itself
The Holy Grail of automatic metadata generation is often regarded as the ability to remove the need for manual metadata creation. Whether or not this is always desirable, there is a greater certainty that it should be possible to derive more information from the asset itself – especially when the asset is text-based. This process can involve trying to identify structured information contained with the asset, or more basically it can attempt to derive structured information from the entirety of the asset.
For example, tools which mine pdfs or scrape web pages in an attempt to identify the title, author, and references contained in a paper, or tools which count the words in document to try to match it to a subject term.