This project provides a command-line interface and Java library to validate and manipulate OAIS Information Packages of different formats: E-ARK (version 1 & 2 in alpha stage), BagIt, Hungarian type 4 SIP.
The E-ARK Information Packages are maintained by the Digital Information LifeCycle Interoperability Standards Board (DILCIS Board). DILCIS Board is an international group of experts committed to maintain and sustain maintain a set of interoperability specifications which allow for the transfer, long-term preservation, and reuse of digital information regardless of the origin or type of the information.
More specifically, the DILCIS Board maintains specifications initially developed within the E-ARK Project (02.2014 - 01.2017):
- Common Specification for Information Packages
- E-ARK Submission Information Package (SIP)
- E-ARK Archival Information Package (AIP)
- E-ARK Dissemination Information Package (DIP)
The DILCIS Board collaborates closely with the Swiss Federal Archives in regard to the maintenance of the SIARD (Software Independent Archiving of Relational Databases) specification.
For more information about the E-ARK Information Packages specifications, please visit http://www.dilcis.eu/
- Java (>= 1.8)
Download the latest release to use as a tool or check below how to use it as a Java Library.
You can use the commons-ip as a command-line tool or a Java library.
To use commons-ip validator, need to download the latest release of the commons-ip tool. Then you can execute in the command-line using the following options:
- validate, [REQUIRED] this option is for the CLI to know that is to perform the validation of a SIP.
- -i, [REQUIRED] Path(s) to the SIP(s) archive file or folder(s).
- -o, [OPTIONAL] Path to the Directory where you want to save the validation report.
Examples:
java -jar commons-ip-2.X.Y.jar validate -i sip1.zip
java -jar commons-ip-2.X.Y.jar validate -i sip1.zip sip2.zip -o output/
The report generated by the validator is in JSON format and has the following structure:
{
"header" : {
"title" : "Validation Report CSIP",
"specifications" :
[
{
"id" : "CSIP-2.0.4",
"url" : "https://github.com/DILCISBoard/E-ARK-CSIP/releases/tag/v2.0.4"
},
{
"id" : "SIP-2.0.4",
"url" : "https://github.com/DILCISBoard/E-ARK-SIP/releases/tag/v2.0.4"
}
],
"version_commons_ip" : "2.0.0-alpha3-SNAPSHOT",
"date" : "2021-09-24T13:30:26.203+01:00",
"path" : "${HOME}/Full-EARK-SIP.zip"
},
"validation" :
[
{
"specification" : "CSIP-2.0.4",
"id" : "CSIPSTR1",
"name" : "CSIP Information Package folder structure",
"location" : "",
"description" : "Any Information Package MUST be included within a single physical root folder (known as the “Information Package root folder”). For packages presented in an archive format, see CSIPSTR3, the archive MUST unpack to a single root folder.",
"cardinality" : "",
"level" : "MUST",
"testing" :
{
"outcome" : "PASSED",
"issues" : [ ],
"warnings" : [ ],
"notes" : [ ]
}
},
{
"specification" : "CSIP-2.0.4",
"id" : "CSIP32",
"name" : "Digital provenance metadata",
"location" : "mets/amdSec/digiprovMD",
"description" : "For recording information about preservation the standard PREMIS is used. It is mandatory to include one <digiprovMD> element for each piece of PREMIS metadata. The use if PREMIS in METS is following the recommendations in the 2017 version of PREMIS in METS Guidelines.",
"cardinality" : "0..n",
"level" : "SHOULD",
"testing" :
{
"outcome" : "FAILED",
"issues" : [ ],
"warnings" : [ "It is mandatory to include one <digiprovMD> element in Root METS.xml for each piece of PREMIS metadata" ],
"notes" : [ ]
}
}
],
"summary" :
{
"success" : 120,
"warnings" : 3,
"errors" : 5,
"skipped" : 33,
"notes" : 8,
"result" : "INVALID"
}
}
- Using Maven
- Add the following repository
<repository>
<id>KEEPS-Artifacts</id>
<name>KEEP Artifacts-releases</name>
<url>https://artifactory.keep.pt/artifactory/keep</url>
</repository>
- Add the following dependency
<dependency>
<groupId>org.roda-project</groupId>
<artifactId>commons-ip2</artifactId>
<version>2.0.0-alpha1</version>
</dependency>
If using Java 11 or greater, you might need the following extra dependencies:
<dependency>
<groupId>javax.xml.bind</groupId>
<artifactId>jaxb-api</artifactId>
<version>2.3.0-b170201.1204</version>
</dependency>
<dependency>
<groupId>javax.activation</groupId>
<artifactId>activation</artifactId>
<version>1.1</version>
</dependency>
<dependency>
<groupId>org.glassfish.jaxb</groupId>
<artifactId>jaxb-runtime</artifactId>
<version>2.3.0-b170127.1453</version>
</dependency>
- Not using maven, download Commons IP latest jar, each of Commons IP dependencies (see pom.xml to know which dependencies/versions) and add them to your project classpath.
- Create a full E-ARK SIP
// 1) instantiate E-ARK SIP object
SIP sip = new EARKSIP("SIP_1", IPContentType.getMIXED(), IPContentInformationType.getMIXED());
sip.addCreatorSoftwareAgent("RODA Commons IP", "2.0.0");
// 1.1) set optional human-readable description
sip.setDescription("A full E-ARK SIP");
// 1.2) add descriptive metadata (SIP level)
IPDescriptiveMetadata metadataDescriptiveDC = new IPDescriptiveMetadata(
new IPFile(Paths.get("src/test/resources/eark/metadata_descriptive_dc.xml")),
new MetadataType(MetadataTypeEnum.DC), null);
sip.addDescriptiveMetadata(metadataDescriptiveDC);
// 1.3) add preservation metadata (SIP level)
IPMetadata metadataPreservation = new IPMetadata(
new IPFile(Paths.get("src/test/resources/eark/metadata_preservation_premis.xml")));
sip.addPreservationMetadata(metadataPreservation);
// 1.4) add other metadata (SIP level)
IPFile metadataOtherFile = new IPFile(Paths.get("src/test/resources/eark/metadata_other.txt"));
// 1.4.1) optionally one may rename file final name
metadataOtherFile.setRenameTo("metadata_other_renamed.txt");
IPMetadata metadataOther = new IPMetadata(metadataOtherFile);
sip.addOtherMetadata(metadataOther);
// 1.5) add xml schema (SIP level)
sip.addSchema(new IPFile(Paths.get("src/test/resources/eark/schema.xsd")));
// 1.6) add documentation (SIP level)
sip.addDocumentation(new IPFile(Paths.get("src/test/resources/eark/documentation.pdf")));
// 1.7) set optional RODA related information about ancestors
sip.setAncestors(Arrays.asList("b6f24059-8973-4582-932d-eb0b2cb48f28"));
// 1.8) add an agent (SIP level)
IPAgent agent = new IPAgent("Agent Name", "OTHER", "OTHER ROLE", CreatorType.INDIVIDUAL, "OTHER TYPE", "",
IPAgentNoteTypeEnum.SOFTWARE_VERSION);
sip.addAgent(agent);
// 1.9) add a representation (status will be set to the default value, i.e.,
// ORIGINAL)
IPRepresentation representation1 = new IPRepresentation("representation 1");
sip.addRepresentation(representation1);
// 1.9.1) add a file to the representation
IPFile representationFile = new IPFile(Paths.get("src/test/resources/eark/documentation.pdf"));
representationFile.setRenameTo("data.pdf");
representation1.addFile(representationFile);
// 1.9.2) add a file to the representation and put it inside a folder
// called 'def' which is inside a folder called 'abc'
IPFile representationFile2 = new IPFile(Paths.get("src/test/resources/eark/documentation.pdf"));
representationFile2.setRelativeFolders(Arrays.asList("abc", "def"));
representation1.addFile(representationFile2);
// 1.10) add a representation & define its status
IPRepresentation representation2 = new IPRepresentation("representation 2");
representation2.setStatus(new RepresentationStatus(REPRESENTATION_STATUS_NORMALIZED));
sip.addRepresentation(representation2);
// 1.10.1) add a file to the representation
IPFile representationFile3 = new IPFile(Paths.get("src/test/resources/eark/documentation.pdf"));
representationFile3.setRenameTo("data3.pdf");
representation2.addFile(representationFile3);
// 2) build SIP, providing an output directory
Path zipSIP = sip.build(tempFolder);
Note: SIP implements the Observer Pattern. This way, if one wants to be notified of SIP build progress, one just needs to implement SIPObserver interface and register itself in the SIP. Something like (just presenting some of the events):
public class WhoWantsToBuildSIPAndBeNotified implements SIPObserver{
public void buildSIP(){
...
SIP sip = new EARKSIP("SIP_1", IPContentType.getMIXED());
sip.addObserver(this);
...
}
@Override
public void sipBuildPackagingStarted(int totalNumberOfFiles) {
...
}
@Override
public void sipBuildPackagingCurrentStatus(int numberOfFilesAlreadyProcessed) {
...
}
}
- Parse a full E-ARK SIP
// 1) invoke static method parse and that's it
SIP earkSIP = EARKSIP.parse(zipSIP);
In this sections are some relevant notes about Commons IP development.
XML Beans are used by Commons IP to manipulate METS files using Java code.
Some changes were made to XML Schemas in order to be able to compile XML Schemas into Java classes using XJC as well as to be able to validate a XML file against its XML Schema without Internet connections.
The changes are:
src/main/resources/schemas2/mets1_12.xsd
XLink Schema location made local (and respectively file available locally)src/main/resources/schemas2/mets1_12.xjb
Bindings file created to deal with attribute name conflict between METS and XLink Schemas
After Java classes are created, some changes were made to produce METS XML files well defined in terms of namespaces. Namely:
src/main/java/org/roda_project/commons_ip2/mets_v1_12/beans/package-info.java
Annotations for corretly generate namespaces were added. Following is the before & then the after:
@javax.xml.bind.annotation.XmlSchema(namespace = "http://www.loc.gov/METS/", elementFormDefault = javax.xml.bind.annotation.XmlNsForm.QUALIFIED)
and the after
@javax.xml.bind.annotation.XmlSchema(namespace = "http://www.loc.gov/METS/", elementFormDefault = javax.xml.bind.annotation.XmlNsForm.QUALIFIED, xmlns = {
@javax.xml.bind.annotation.XmlNs(prefix = "", namespaceURI = "http://www.loc.gov/METS/"),
@javax.xml.bind.annotation.XmlNs(prefix = "xsi", namespaceURI = "http://www.w3.org/2001/XMLSchema-instance"),
@javax.xml.bind.annotation.XmlNs(prefix = "csip", namespaceURI = "https://DILCIS.eu/XML/METS/CSIPExtensionMETS"),
@javax.xml.bind.annotation.XmlNs(prefix = "sip", namespaceURI = "https://DILCIS.eu/XML/METS/SIPExtensionMETS"),
@javax.xml.bind.annotation.XmlNs(prefix = "xlink", namespaceURI = "http://www.w3.org/1999/xlink")})
xjc -d src/main/java/ -p "org.roda_project.commons_ip2.mets_v1_12.beans" src/main/resources/schemas2/mets1_12.xsd -b src/main/resources/schemas2/mets1_12.xjb
The IANA Media Types list is required to perform SIP Validation. The list is located in the folder and named as follows:
/src/main/resources/controlledVocabularies/IANA_MEDIA_TYPES.txt
To update IANA media types list, in commons-ip root directory run the following command:
./scripts/ianaMediaTypes_parser.sh
The command executes a script that downloads all IANA Media Types (Application, Audio, Font, Image, Message, Model, Multipart, Text, Video) from https://www.iana.org/assignments/media-types/${iana_file}.csv . Note that this downloads different .csv files then creates a .txt file with all IANA media types appended to the new file.
The IANA media types list from https://www.iana.org/assignments/media-types/${iana_file}.csv was extended with the following mimetypes:
- image/jpeg
- image/gif
- text/plain
- video/mpeg
*To add/remove new values to the list add a new line with the IANA Media Type in scripts/ExtendedIANAMediaTypes.txt
*Run the script again
For more information or commercial support, contact KEEP SOLUTIONS.
- Bagit specification
- E-ARK SIP specification (all versions)
- E-ARK SIP specification (latest version)
- E-ARK Common specification (all versions)
- E-ARK Common specification (latest version)
- RODA-in source code
- RODA source code
- RODA Community Web site
- E-ARK Project Web site
- CEF eArchiving Building Block
- Fork it!
- Create your feature branch:
git checkout -b my-new-feature
- Commit your changes:
git commit -am 'Add some feature'
- Push to the branch:
git push origin my-new-feature
- Submit a pull request :D
- Hélder Silva (KEEP SOLUTIONS)
LGPLv3