Metadata is a hot topic in digital humanities. We have metadata repositories, metadata standards, metadata schemas, metadata registries, and even metadata games! But metadata is also scary. Sometimes, metadata is seen as something obscure and difficult to understand. And it is! Muddled definitions, ambiguous descriptions, and outright incorrect statements have contributed much to make metadata a complex topic. However, it doesn’t need to be so. Also, I must say that, after 30 years working in software engineering, I may have heard the term “metadata” in this field just a few times. Metadata does not seem to be a specially important concept to the engineers building the information systems that sustain our planes and guide power our computers. Digital humanities, however, seems to have overdeveloped a suspicious keenness on this concept. In this post, I will clarify a few concepts, I will provide a simple and affordable definition of metadata, and I will make the bold but perhaps true claim that metadata does not really exist.
Informally, metadata is often described as “data about other data” [Oxford Dictionary, Wikipedia]. But what does this mean? In practice, I have observed that “metadata” is often employed to refer to two different things:
- The structure of other data. For example, imagine that we create a database to store information about people. The specific fields such as FamilyName or DateOfBirth that we decide to employ would be metadata. The data types of these fields, such as text for FamilyName or datetime for DateOfBirth would also be metadata.
- A description of other data. For example, imagine that we want to document not only the name and date of birth of every person in our database; we also want to record who and when entered those pieces of information, so we add LastUpdatedTime and LastUpdatedBy fields. The data that we enter for these fields, such as 8 June 2016 by Alice would be metadata.
Note that the two kinds of metadata are very different. Firstly, type 1 metadata is composed of type-level entities, whereas type 2 metadata is composed of instance-level entities. In other words, metadata of type 1 above is made of categories of things, such as the field names and types in a database schema. Contrarily, metadata of type 2 is made of things, rather than categories, such as the specific time when something was recorded, or the name of who did it. Secondly, type 1 metadata is a prerequisite for data to exist; we cannot enter data in a database if we haven’t decided on what structure it will have first. On the contrary, type 2 metadata is an optional addition to data: we can document who recorded something and when only if we wish, and always after the fact that it has been recorded. In other words, we can strip data of its type 2 metadata, but not of its type 1 metadata.
Finally, it is easy to see that type 1 metadata corresponds to the structure of data and, as a consequence, it specifies what data is considered and which is not. In this sense, type 1 metadata is not simply data about data; rather, it is data-defining data. A database schema, for example, is not merely “describing” what data is stored in it, but specifying what data may be stored and, therefore, what may not. For this reason, I think that type 1 metadata should not be called “metadata” at all, but data structure, data schema, data model, or something like that, depending on your preferences and your particular field of expertise.
Type 2 metadata is genuinely data about data. Let’s then forget about type 1 and focus on type 2 for the rest of this post. And let’s also forget about blatantly incorrect statements indicating that metadata is just data about something. This seems ludicrous, but many web sites and other sources claim it boldly. For example, here you can read that “metadata is information about something else”, and the title and author of a book are given as examples of metadata about the book. Well, if that’s metadata, where is the data? A book’s title and author are not metadata about the book, but data about the book. Similarly, some standards such as those from BIC state that basic metadata requirements for books include ISBN, title, publication date, etc. Again, this is nonsense. If we call these metadata, what is the data? Metadata is not data about some arbitrary thing; metadata is data about data.
But, what does it mean to say that metadata is data about data? Well, data represents individual properties of things in the world, as we said in a previous post. In this regard, data is about things. These things may be really anything, since we can, in principle, represent any perceivable or conceivable object as far as our senses permit it. For example, data can be about houses, people, songs, events, thoughts or dreams. And data can be about data too. Data is just one more kind of thing in the world. Look now at the following statements that we have made:
- Data is about things.
- Metadata is about data.
I argue that the second statement is a special case of the first one, as data is a specific kind of thing, and, more interestingly, metadata is a specific kind of data. This should not come as a surprise, since the prefix “meta-” in “metadata” is qualifying “data”, and most qualifiers do not substantially change the meaning of the noun being qualified. In consequence, we can safely state that metadata is the kind of data that represents not people or songs or events, but other data.
I say that this is interesting because of the subsumption principle often referred to as the Liskov substitution principle. This principle roughly says that anything that we say about elements of a category also applies to elements of any sub-categories. For example, if we agree that all fruit (a category) is juicy, then we should admit that oranges, pears, strawberries and plums (the sub-categories) are also juicy. We could say that we can replace “fruit” with any sub-category of fruit in any sentence, and things should still make sense. For example, since we agreed that “all fruit is juicy”, then we could also say that “all oranges are juicy”, or “all plums are juicy”. If we found a kind of fruit that is not juicy (bananas, perhaps), then we should conclude that we were wrong in asserting that all fruit is juicy.
How is this related to metadata? Well, if metadata is a kind of data, then everything that we say about data also applies to metadata. In other words, we can replace any occurrences of “data” in a sentence with “metadata”, and things should make sense. Doing this in statement 2 above, we have “metadata is about metadata”. Really?
Actually, yes. Think about it. If metadata is about data, and metadata is also data itself, then why can’t we have metadata about other metadata? We could call it meta-metadata if you wish. And we could have meta-meta-metadata too, and so on and so forth. For example:
- Our representation of individual people in the database constitutes plain data.
- Data about who entered plain data and when constitutes metadata.
- Data about who entered metadata, and when, constitutes meta-metadata.
- Data about who and when entered meta-metadata constitutes meta-meta-metadata.
- And so on and so forth.
You may think that I am pushing things over the top, but I am not. Imagine the following situation. A database is created to document the details of key individuals in a community, the places where they live, and the events that occur in relation to them, in order to study the dynamics of the group. The data that we enter in this database describes people, events and places and, as such, constitutes plain data. Now imagine that we are also interested in recording who documents what, when, and from which sources. To do this, we may add tables or fields to the database so that we can record who documented each individual, place or event, when they did it, and what sources they used to obtain the information. This data, obviously, constitutes metadata, as it describes plain data.
Now imagine that a sociologist decides to carry out a project to study how researchers in anthropology work and, specifically, what relationships exist between specific types of sources and times of the day. To this sociologist, our records of who entered what information, when and from which sources would be very interesting. To them, the people, places and events in the database are not relevant; what they are interested in is the metadata that we recorded. In other words, the sociologist would be looking at our metadata as their primary object of study or, rather, as raw data that they need to describe, summarise, process and interpret. In other words, what is metadata to us becomes plain data to them.
This happens because we put the focus on different places. To us, recording who entered what is an additional layer of information that complements our data. To them, these records are primary data. This is to say, the same data may be seen as plain or meta depending on who looks at it and for which purpose. Or, in other words, metadata does not correspond to a specific sub-category of data, but to a role that data may play in certain scenarios. In this regard, metadata is not made of a difference substance, and does not possess a different essence, as compared to plain data. They both are made of the same stuff, so to speak. But some data, which happens to represent other data rather than other kinds of things in the world, may play a meta role in some circumstances.
Now, if we agree that metadata does not constitute a separate kind of things, and coming back to Liskov substitution principle, then we must conclude that metadata should be treated as any other data. We can use the same tools, languages, approaches and techniques to deal with metadata as we do for plain data. We don’t need specifically designed repositories to hold metadata; good old databases should be enough. We don’t need metadata standards; data standards should work. We don’t need metadata schemas; data schemas should do. I admit I am exaggerating a bit here, because metadata, as a special kind of data, may benefit from a specific treatment very much like pineapples benefit from a specialised pineapple corer over a generic knife. The point is that we should be critical with over-emphasising metadata as a different category, and accompanying it with myriads of specifically designed technologies and tools as we often see in digital humanities. By default, we should ask ourselves “can I solve this metadata problem by using regular data techniques or tools?”. Only in very specific scenarios we should need very specific means.