-
Notifications
You must be signed in to change notification settings - Fork 11
The SEPIO Framework
The Scientific Evidence and Provenance Information Ontology (SEPIO) was developed to support rich, computable representations of the evidence and provenance behind scientific assertions. The core ontology defines a generic model that can be applied in any domain and extended with domain-specific features. The ontological model is the foundation of a larger framework that provides mechanisms for creating custom ontology-based schema for specific applications that leverage modern semantic web standards. This framework is comprised of four main components:
- SEPIO Core Ontology: defines the core, domain-agnostic model using the 'open world' OWL description logic language.
- SEPIO Information Model: provides a UML-like view of the ontology with the constraints of a 'closed world' data model, specifying how terms and design patterns defined in SEPIO may be used to structure data.
- SEPIO Profiles: application specific data models that refine the maximal information model, and can extended it with domain-specific content to support custom schema for a particular use case.
- SEPIO Value Sets: re-usable collections of terms that can be bound to attributes in a particular Profile to constrain data entry.
This document describes in detail the individual components of this framework, and the relationships between them.
The SEPIO Ontology provides core concepts and relationships to describe how information is interpreted as evidence in the act of making scientific assertion, and how this evidence information was initially generated, retrieved, and curated. While the core model is domain-agnostic, and generally applicable to representing evidence and provenance for any type of assertion, the framework enables extensions to support domain-specific concepts and use cases (LINK). Figure 1 shows a high-level view of the conceptual model implemented in the SEPIO Ontology. A more detailed description of these classes and relationships can be found here.
Figure 1: The core high-level concepts and relationships in SEPIO. Boxes are ontology classes, and edges are ontology properties. Three core informational entities comprise the main axis of the model (blue), which are decorated by entities that capture their provenance (grey). The "Evidence Item" term is shown as a UML 'stereotype' in guillemots (<< >>) within the Information content Entity class, indicating that any information contributing to an Evidence Line at this position is inferred to be an instance of the logically defined Evidence Item type in SEPIO. The core SEPIO ontology file is located here.
As an 'open-world' model defined using the OWL description logic language, the SEPIO ontology is not sufficiently constrained to specify a schema for collecting data. Rather, it provides terms and design patterns on which an information model and formal schema can be based, as described below. But the underlying ontological formalism SEPIO provides can be leveraged for semantic search, inference, and reasoning applications (LINK).
The SEPIO Information Model provides an informal description for how the terms and design patterns defined in the ontology can be applied as a model for structuring data. This is an abstract model that is independent of any schema language. At present it is encoded as a UML diagram that provides a 'maximal' specification of what is possible to express using SEPIO (Figure 2). Each type in the information model maps to a SEPIO ontology class, and attributes of each type map to ontology properties. Cardinality constraints in this model are minimally restrictive and consistent with the logic defined in the ontology. More information can be found here (LINK).
In practice, the SEPIO Information Model is a useful guide for implementers to understand the overall model, and help them identify which elements and patterns they should included in defining a ‘SEPIO Profile’ for their particular application (see below). In this way it helps to bridge the gap between the conceptual model implemented in the ontology, and formal schema specifications that will enable data representation and validation.
Figure 2: UML diagram of the maximal SEPIO Information model. Presents data modeling patterns that are possible using the SEPIO Framework. Boxes represent data model types, which contain their attributes possible to model using SEPIO. Attributes in orange are ‘shortcut’ relations that can be used to directly link data that are connected by more than one relationship in a fully normalized model. Attributes with asterisks (*) are those that have sub-properties in the underlying SEPIO ontology, which can be used to refine the parent attribute in the diagram when a more precise relationship is desired. Edges in the diagram represent attributes connecting to other core data types in the model.
The SEPIO model is designed to support incremental expressivity, where different profiles can create simple or complex models depending on application and data requirements. The provision of 'shortcut' relations, which can be used to directly link objects that are connected by more than one relationship in a fully normalized model, is one of several mechanisms that supports this flexibility. For more on this topic see here (LINK), and explore the GO Annotation Profiles documented here (LINK).
SEPIO Profiles are customizations of the maximal information model that are defined to support a particular application or use case. Briefly, creating a Profile involves selecting the relevant elements of the maximal information model, extending the model with required domain-specific specializations of core types and attributes, using these to define an application-specific information model, defining required value sets, and ultimately implementing the model in a formal schema language. The process of creating a SEPIO Profile is described in more detail here, and exemplified by the ClinGen-ACMG Profile description here.
SEPIO Profiles are scoped and structured to specifically represent evidence and provenance in a particular domain. A simple SEPIO Profile that we generated for representing Gene Ontology (GO) Annotation data, which holds assertions about gene function, is shown in Figure 3. It uses only the subset of types and attributes required for representing the evidence and provenance information provided in this widely-used dataset. It defines specific cardinalities on each attribute, and proposes value sets to constrain values for certain attributes. Note the use of the isEvidenceWithSupportFrom ‘shortcut relation' (orange) to directly link evidence lines to any type of entity 'supporting’ the evidence. This shortcut is used because the GO dataset does not describe objects needed to populate the fully normalized model in Figure 2. A more detailed characterization of this exemplar SEPIO Profile, along with examples of GO data, can be found here.
Figure 3: A UML diagram for a SEPIO Profile to represent GO Annotation data. The "code" data type indicates use of the value set indicated between guillemot symbols (<< >>) to standardize data entry.
TO DO: Add or reference an example profile that 'extends' the core SEPIO Model with domain-specific specializations (as opposed to the GO Profile example that just extracts a subset of it)
A given SEPIO Profile can take multiple concrete forms. To support data collection and validation in a working application, it must be implemented in a formal schema language such as JSON schema or ShEx. The ClinGen SEPIO Profile for ACMG Variant Interpretations provides an example of a JSON-LD schema (LINK) that implements the Profile described informally here, and is being used in several ClinGen information and data exchange systems. Note that for profiles like this that are formalized using JSON-LD, mappings between schema elements and SEPIO ontology terms can be captured in LD-context files (LINK), allowing SEPIO-compliant RDF representations to be automatically derived from JSON data.
SEPIO Profiles can be re-used by different communities or applications that wish to represent the same type of data. For example, ClinGen has defined this SEPIO profile for representing Variant Pathogenicity Interpretations generated using the AGMG Guidelines, which can be adopted by other applications that wish to represent this type of data in an interoperable way. Informal guidelines for defining a SEPIO Profile can be found here, and a more technical Standard Operating Procedure will be provided in the future.
TO DO: Expand on text describing SEPIO Value Sets
For attributes that take a 'code' as a data type in the Information Model, a Profile can define value sets it will use here. SEPIO provides a Value Set Model (LINK) that extends the SKOS framework (LINK) to support the implementation of value sets as part of the profile's ontology extension. Briefly, a particular value set is implemented in the ontology extension as an instance of the SEPIO 'Value Set' class. The individual terms comprising a value set are implemented as instances of the SKOS Concept class, and linked to their containing value set using the SKOS 'isInScheme' property. Attributes of a particular value set that can be defined using the SEPIO model include the notion of 'extensibility' (whether the set is closed or can be extended), and links to one or more 'identifier systems' from which terms in the value set can be taken.
Figure 4A shows the basic schema for defining value sets in the SEPIO model. Figure 4B shows an example of an 'Allelic Phase Value Set' defined for the ClinGen-ACMG Profile. This is a 'fixed' value set (i.e. not extensible) that is comprised of just two values taken from the Genotype Ontology (GENO).
Figure 4: SEPIO Value Set Creation. (A) The schema for the SEPIO Value Set Model. (B) An example of the ClinGen 'Allelic Phase Value Set'. Note here that the valueSetExtensibility attribute itself takes a code for a value, that is drawn from the 'Value Set Extensibility Value Set'. Boxes in grey are individual values from a value set.
Implementing a data model using the SEPIO framework will require representations of domain concepts using different formalisms across different levels of the modeling framework. Figure 4 below illustrates the differences in format and expressivity of representations of the 'Evidence Line' type at the level of the SEPIO Ontology, the SEPIO Information Model, a particular SEPIO Profile, and a formal schema implementation.
TO DO: Figure 5: Representations of ‘Evidence Line’ across the SEPIO Framework.