In the previous post, I argued that most semantic technologies (as usually understood in the digital humanities community) are anything but semantic. In this post I turn to engineering concerns. Here I argue that most semantic technologies such as RDF, SKOS and OWL, especially as employed in digital humanities for linked open data, present serious flaws due to the lack of good engineering practices. Engineering is crucial in this regard, because thesauri, repositories and other similar artefacts that are routinely developed within the digital humanities field are information systems and, as such, exhibit behaviours that are well understood in engineering and are subject to engineering principles.
My main point today is that semantic technologies are usually oblivious to the well-known engineering concerns of modularity and layering. Modularity refers to the organisation of a system into parts, named modules, which exhibit high internal cohesion and low external coupling. This means that modules are not just random chunks of a system. Rather, a module is a portion of a system that, to start with, must have a high degree of internal cohesion. This means that individual elements inside the module are tightly related to each other, leaving no elements disconnected or isolated. In other words, a module must have a clear and single purpose, and every element in it must be concerned with this purpose. In addition, a module must have a low degree of external coupling. This means that the relationships between a module and other modules must be few and weak, rather than many and strong. This is so in order to concentrate connections inside modules rather than across them, which benefits the overall quality of the system. A good introduction to why this is so, and how quality factors such as robustness and extensibility are improved through modularity, is given in the introductory chapters of [Meyer 1997].
A good example of a highly modular system is a car. If you look under the hood of your car, you will see hundreds or thousands of elements that are all necessary for it to work. These elements are not randomly connected, but are organised into modules. For example, the engine block is composed of tightly-integrated elements (such as the pistons, valves, crankshaft and spark plugs) and loosely connected to other modules such as the electrical system via well-defined weak links. By “weak” here we mean links that can be temporarily severed without much fuss for maintenance or enhancement purposes. You (or a knowledgeable mechanic) could easily disconnect the electrical system from the engine block in your car in order to replace it. This is feasible and easy because of the highly modular arrangement of elements in your car: you don’t need to disassemble the internals of the engine in order to remove it; rather, you treat it as a single whole (a module) for certain purposes. Most modern artefacts of engineering are highly modular.
Layering, in turn, is a particular application of modularity, by which a system is organised in modules that are stacked as horizontal “layers” on top of each other according to their level of abstraction or domain of discourse. This is very common in software systems, but not as common or visible in physical systems, because it is harder to distinguish levels of abstraction in the physical world. If you think of a typical software system such as the WordPress platform or Microsoft Word, it is likely to be organised in terms of a user interface layer (the screens that users operate) at the top, a “business logic” layer (the programmes that perform the necessary computations) in the middle, and a database or persistence layer (which stores the relevant data) at the bottom. Of course, “top”, “middle” and “bottom” are metaphorical terms that refer to the level of abstraction of each module: the database layer deals with raw data, the business logic processes these data into meaningful chunks of information (and thus “sits on top of” the database), and the user interface takes this information and presents it to the users (thus sitting on top of the business logic layer).
Layering in software systems is a well-understood and commonly accepted principle in software engineering. For example, a good-quality software system does not make SQL calls to the underlying database from the same code that manages the windows and on-screen forms. Rather, it invokes a lower layer that encapsulates and hides the intricacies of database access. In fact, encapsulation and information hiding are guiding principles in software engineering. Each layer deals with the concerns that correspond to its level of abstraction, and hides or encapsulates them from higher layers. In this manner, abstraction increases from bottom to top, and complexity can be (hopefully) tamed. For example, a database access layer needs to worry about issues such as how to connect to the database server, what dialect of SQL it speaks, and whether nested transactions are allowed or not. The business logic layer, when invoking the database layer, can ignore all these issues and simply request the necessary data. In this regard, we can say that the database layer encapsulates and hides details that are not relevant to the business logic, thus allowing for the abstraction to rise. When layering fails (due to a bad design, for example) and a layer does deal with issues that correspond to a lower level of abstraction, this is called “implementation noise”. Think of, for example, a user interface layer that composes SQL statements and invokes the database server directly.
A concrete manifestation of the modularity and layering principles can be found in the Model Driven Architecture (MDA) initiative from the Object Management Group (OMG), widely accepted today in the software engineering community. According to MDA, separating the conceptual concerns from the implementation ones in a computation-independent model greatly improves the layering of the resulting product, and thus its quality.
Unfortunately, semantic technologies such as RDF, SKOS or OWL, especially as used in the digital humanities through linked open data, pay little attention to modularity and layering. On the contrary, they often treat everything as “data” regardless of its abstraction level, and mix together application domain concerns with implementation details, thus overlooking good modularisation and layering principles. For example, in order to declare a concept in SKOS, something that should be highly conceptual and free of implementation noise, one needs to possess a fairly deep understanding about URIs and how RDF (a different technology at a lower level of abstraction) works under the hood. This forces domain specialists (such as archaeologists or historians) to try to learn the intricacies of lower-level computing technologies when they only need to formulate a conceptual representation of their domain of discourse. And, what is worst, it inextricably entangles application-domain concerns (for example, how to represent the concept of “archaeological site” in terms of space and material evidences) with implementation noise (for example, whether to create a new URI or reuse an existing one, or what namespace to use). The concepts of an archaeological site and a URI pertain to very different domains of discourse and abstraction levels. If you think about it, the linked open data premise that “everything must have an HTTP-based name” makes no sense at all. HTTP is a data transfer protocol in the Internet. How is that connected with naming things at a conceptual level? Unfortunately, you are forced to conflate conceptual concerns and implementation issues if you work with W3C semantic technology recommendations.
You may argue that the development of different vocabularies, thesauri or repositories for different projects, and then reconciling or inter-relating them, is a form of modularisation. However, a module is not any arbitrary chunk of content, as discussed above; it must possess high internal cohesion and low external coupling. From this perspective, a vocabulary, thesaurus or repository designed for a project is unlikely to constitute a module, because most often it will lack the necessary high internal cohesion (i.e. there are conceptual “lumps” inside it, corresponding to separate semantic fields) and exhibit extremely high coupling to others. In fact, linked open data approaches encourage the creation of dependencies between elements in different realms through URI links, thus creating “thick interfaces” and high coupling, and thus thwarting modularity.
As an additional case of poor layering in the digital humanities, consider ISO 21127:2014, also known as CIDOC CRM, a well-known conceptual reference model in cultural heritage. In CIDOC CRM, primitive types such as E62 String or E60 Number are mixed together with application-domain concepts such as E53 Place or E31 Document. Primitive data types such as string or number have no particular meaning in cultural heritage that is different to other fields such as biology or shoe making. Rather, they belong to a lower abstraction level that is domain-agnostic. In fact, every formally-defined language (including programming languages such as C# or Python but also modelling languages such as UML or ConML) define primitive data types as part of the infrastructure (i.e. one layer below) of the layer where users can express their own concepts. In other words, a user of UML, ConML or C# does not need to include strings or numbers when creating a data representation of, for example, a historical manuscript; they just exist, and you just use them to describe the necessary information. With CIDOC CRM, however, primitive data types and domain-specific concepts such as places or manuscripts live in the same (and only) layer, thus mixing levels of abstraction.
Without proper separation of application and infrastructure concerns through adequate layering and modularisation, the quality of the resulting products will always be substandard. Semantic technologies must change to acknowledge this. In the meantime, we won’t be able to develop high-quality models and systems in the digital humanities community.
- Meyer, B. 2007. Object-Oriented Software Construction, 2nd edition. Prentice-Hall.