Structuring content with semantic technologies

In a recent post, I discussed the principles of modularity and layering in software engineering, and how they affect information (or data) in the digital humanities. Today I am focussing on a related but different aspect: how to structure content in digital humanities, and how conventional semantic technologies lack the necessary features to do it.

But first let’s introduce some concepts. When I say “content”, I am referring to anything that is represented inside a computer system. Data and information are both content. Knowledge is not content since it resides in our heads rather than a computer system, but it can be represented in a software system in terms of information. Many definitions have been offered for data, information and knowledge (the DIKW pyramid is a classic), but I will use very simple ones:

  • Data consist of simple, “naked” quantities or qualities that represent individual properties of something. For example, Age = 38 is a piece of data, as is the more complex p: Person { Name = “Alice”; Age = 38 }.
  • Information is data in a message. In other words, information is data being conveyed from someone to someone else (possible many people) for a certain purpose. For example, if you ask me “Who’s that person over there?” and I reply with “Oh, that’s Alice”, that’s information.
  • Knowledge is justified true belief, as expressed by the ancient Greeks. Knowledge resides in our heads, but can be represented as information to be communicated.

From these definitions it should be clear that computer systems store and manipulate data which, when transmitted, constitutes the core of information. This information, in turn, may represent existing knowledge, and be assimilated by us and help us create new knowledge or change existing one.

The modularity principle that we discussed in a previous post applies to content in a computer, of course. Imagine a computer storing a detailed representation of a whole building, including its structure, dimensions, materials, location, uses and owners. This can constitute a large amount of data. In order to be manageable, data must be modular, that is, it must be organised in meaningful chunks having a strong internal cohesion and a loose external coupling. If you are not familiar with these terms, please check out my previous post mentioned above. For example, we could organise the building data by putting everything related to the building structure together, and then connecting it to a separate module containing materials, and then these two would be connected to another one about uses, and so on and so forth. Or, alternatively, we could put everything about the first floor together, including its structure, materials, dimensions, location, etc., and then connect it to a second module containing everything about the second floor, and so on. The modularity scheme that we adopt will depend on our purpose but, in any case, we need one. A system where all the data is lumped together in a single big blob, with no “texture”, is not acceptable, because:

  • It would make finding things very difficult. Imagine that you had all your clothes and gear in a single large chest, instead of using a set of (more or less) tidy shelves and closets. By putting your things in well-organised compartments, you make retrieving what you need much easier. Data follows the same principle.
  • It would make maintenance much harder, as dependencies get our of control. If everything is lumped together, changing, replacing or deleting one data item would have a potential impact on many other neighbouring data items. If, on the contrary, data is organised in separate “bins”, then we only need to look inside the bin where the data to be altered is to check for dependencies.
  • It would hinder special treatment of data. Different kinds of data are better treated in different manners. For example, a relational database is amazingly good at storing and retrieving numbers and short strings; images or other large binary objects, although they can also be stored in a relational database, are more optimally stored in other kinds of systems. Likewise, your shoes are better put away in a shoe shelf, whereas your smart suits are better hung from a clothes hanger. If you had a single large chest, everything would be treated the same and specificities would be ignored.

In addition, consider the following. When we represent the world, we are assuming that it is discrete, that is, it can be “chopped out” into distinct entities, and we represent each of these entities as an element of some kind in our computer systems. For example, if you are constructing a database for archaeological sites, you will probably have separate tables for sites, features and finds; and you will probably represent each individual site, feature or find as an individual row in one of these tables. In other words, there is a correspondence between the structure that we use to organise the world, and the structure of our computer system. The structure of the world is very often complex: some entities are aggregates of other entities, some entities are related to others in a multitude of ways, entities have properties, sets of entities make up closely related cohesive groups that “work together”, there are different types of entities, some types subsume other types, etc. All this complexity must be reflected, somehow, in the structure of the computer systems that we employ to store and manage the associated data.

Typical semantic technologies, notably linked open data (LOD) and the associated technologies RDF, SKOS and OWL, provide very weak features to express this structure. In RDF, for example, data is stored as subject-predicate-object triples, where any of the triple components may point to a resource or data type, and object components may additionally contain a literal. This allows for the construction of inter-linked data “atoms” in more or less large meshes. From a purely structural viewpoint, these meshes are a homogeneous mass of interconnected data points, with no “texture” or “lumps” to organise content. This is akin to a large chest containing all your clothes.

In comparison, an object-oriented (OO) object network is much richer, providing several levels of granularity and texture. For example, objects encapsulate atomic data values; whole/part relationships create larger aggregates; patterns such as State or Composite give raise to qualified elements with very specific roles; packages or clusters create even larger aggregates. When using an OO language or system, you can choose what level of aggregation to use to represent different data chunks. For example, you can create an object for Alice and put all its properties inside it, such as her name, age or address. Then you can create another object for the city of Sydney. And then you can link both together via to express the fact that Alice lives in Sydney. You can express the fact that Sydney is composed of a number of suburbs such as Pyrmont, Five Dock or Lavender Bay. You can bundle a person object together with the associated city objects to express that whenever a person is retrieved from the system, the associated city objects are to be retrieved too and presented as a comprehensive (but heterogeneous) unit to the user. And when a user looks at the data in the system, they will be able to see, retrieve and understand information in the terms of these objects, links, aggregates, bundles and patterns.

This richer texture provides a useful structure that relates to the world being represented, and helps computers and humans parse and process the represented content, very much like the sentences, paragraphs and chapters in a book provide structure that guides us in understanding the meaning of the text. By lacking strong structuring features and relying only on triples as the unique data expression device, LOD technologies behave very much like a book with no chapters or paragraph breaks.

It must be emphasised that the problem with LOD technologies is not that they represent information as homogeneous masses of interconnected data points. The problem is that homogeneous masses of interconnected data points is everything LOD can do. Other modelling approaches, including OO, also employ interconnected data points at the lowest level of abstraction; the difference is that these other approaches, in addition to the fine-grained structure of the mesh, also provide additional mechanisms to aggregate parts of these meshes in larger chunks and thus give them large-grained structure. In turn, this supports modularity. RDF and most LOD technologies cannot do this. Expressed as LOD, Alice from our previous example would be represented by a sub-mesh within the larger mesh, perhaps by a few inter-connected triples. Sydney, likewise, would be represented by another few triples, and connected to Alice. But the connections “inside” Alice and Sydney are structurally identical to those between them, blurring the fact that they constitute separate entities in the world.

In other words, an OO network of objects is also a graph representation but, as opposed to LOD, it provides texture to the graph by using additional constructs (objects, whole/part structures, pattern instances, bundles, etc.) that yield more complex aggregates. LOD provides the base graph representation, and nothing else. Without rich content structuring features, modularity is very hard or impossible. And, as we saw in the previous post, non-modular systems are low-quality systems.

One thought on “Structuring content with semantic technologies

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s