diff --git a/NEWS.md b/NEWS.md index 4d594e4..c7d33de 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,3 +1,6 @@ +# NPSdataverse 0.1.1 (development version) +* Load EMLassemblyline v3.5.5 from github release rather than the development version. + # NPSdataverse 0.1.0 2024-08-19 diff --git a/docs/news/index.html b/docs/news/index.html index 25d4d54..9667eb1 100644 --- a/docs/news/index.html +++ b/docs/news/index.html @@ -50,6 +50,10 @@
NEWS.md
+ 2024-08-19 * Reduce github.com API rate limits by only hitting API in interactive session.
diff --git a/docs/paper.html b/docs/paper.html index b531d08..a30d560 100644 --- a/docs/paper.html +++ b/docs/paper.html @@ -52,32 +52,87 @@NPSdataverse is a suite of R packages modeled off of the tidyverse concept of several packages built with a common goal [@Wickham2019]. The overarching theme of the NPSdataverse packages is creating, publishing, and accessing Open, machine-readable data and metadata. NPSdataverse supports Ecological Metadata Language (EML) metadata and .csv data files. Some of the constituent packages (R/EML and R/EMLassemblyline) are general-use packages aimed at authoring EML documents. Additional packages (R/QCkit, R/EMLeditor, R/DPchecker and R/NPSutils) are designed and maintained by the National Park Service. Although many functions within the NPSdataverse packages are NPS-specific (particularly API calls), or have default parameters with NPS staff in mind, all of the functions are written so that they can also be used by the general public. Anyone interested applying for research permits or conducting research on National Park Units can reference and utilize the NPSdataverse packages. Additionally, the packages will be useful for data management plans in wide variety of grant proposals and for anyone that needs to create Open data and machine readable metadata to comply with the Open Data Act of 2018. Finally, the ability to author, edit, and check EML metadata will be useful for data publication at any number of repositories or data journals.
+The NPSdataverse is a suite of R packages developed to create, document, publish, and access data and metadata in open and machine-readable format. NPSdataverse is modeled off of the tidyverse concept of several packages built with a common goal [@Wickham2019]. The NPSdataverse supports Ecological Metadata Language (EML) metadata and .csv data files. Some of the constituent R packages (EML and EMLassemblyline) are general-use and aimed at authoring EML documents. Other R packages (QCkit, EMLeditor, DPchecker and NPSutils) are designed and maintained by the National Park Service (NPS). Although many functions within the NPSdataverse packages are NPS-specific (particularly some API calls), whenever possible the functions are written so that they can also be used by the general public. Scientists conducting permitted research in NPS units can utilize the NPSdataverse to efficiently and consistently meet the data delivery requirements of their permits. Additionally, the packages will be useful for data management plans in a wide variety of grant proposals and for anyone that needs to create open data and machine readable metadata. Finally, the ability to swiftly and easily author, edit, and check Ecological Metadata Language (EML) metadata in a reproducible fashion will be useful for data publication at any number of repositories or data journals.
Following a long-term movement for transparency and data accessibility, the U.S. implimented an Open Data Memorandum in 2013 (OMB M-13-13) and the federal Open Data Act of 2019 [@OpenData2019]. the Open Data Act mandated that federal agencies provide data in open formats with metadata. Subsequently, many funding agencies such as the National Science Foundation have required grant awardees to make data public, often includingmetadata ([@nsf2015]). Several academic publishers have followed suit. Multiple publishers have followed suit ([@Wiley2022], [@springer2023])), requiring data availability statements upon publication.
-One goal of open science, and requirement of the Open Government Data Act is to include metadata along with data. Ecological Metadata Language Metadata (EML) is one metadata standard that is particularly amenable to studies with rich taxonomy. It has been adopted by multiple research organizations including the Ecological Data Initiative (EDI), the National Ecological Observatory Network (NEON), the Global Biodiversity Information Facility (GBIF), Swedish Biodiversity Data Infrastructure (SBDI), the French Biodiversity Hub (“Pole National de Donnees de Biodiversite”), the U.S. National Park Service, and others.
-Nevertheless, actual availability of data varies ([@Federer2018, @Tedersoo2021], perhaps because there is a need for more infrastructure and tools to meet the goals of open data and open science ([@Huston2019]). Multiple solutions have been presented, including ezEML, a workflow for authoring metadata in Ecological Metadata Language and publishing data and metadata to a repository ([@Vanderbilt2022]). However, ezEML is has an intuitive graphical user interface with a relatively low learning curve, it does have some drawbacks. For instance, ezEML is not scriptable, which makes repeated deployments of the same or similar workflows challenging. And, ezEML requires the user upload their data to an external site for processing, which may not be suitable for sensitive data. Here we introduce the NPSdataverse, a series of R-based packages for authoring, editing, and checking EML metadata locally in a scriptable fashion. Packages within the NPSdataverse also include data munging and data access/download functions.
+Following a movement for transparency in scientific research and data accessibility, the U.S. implemented the federal OPEN Government Data Act [@OpenData2018]. The Open Data Act mandates that federal agencies provide data in open formats with metadata. Subsequently, many funding agencies such as the National Science Foundation have required grant awardees make data public, often including metadata [@nsf2015]. Multiple publishers have followed suit [@Wiley2022; @Springer2023] and require data availability statements upon publication.
+One goal of open science, and requirement of the recent “Nelson Memo” from the U.S. Office of Science and Technology Policy [@Nelson2022] is to make data FAIR: findable, inter-operable, accessible, and reuseable [@Wilkinson2016]. These goals are often achieved by including structured, machine-readable metadata that conforms to a defined schema along with the data. Ecological Metadata Language Metadata (EML) is one metadata standard that is particularly amenable to studies with rich taxonomy [@Jones2006; @EML2019]. It has been adopted by multiple research organizations including the Ecological Data Initiative (EDI), National Ecological Observatory Network (NEON), Global Biodiversity Information Facility (GBIF), Swedish Biodiversity Data Infrastructure (SBDI), French Biodiversity Hub (“Pole National de Donnees de Biodiversite”), U.S. National Park Service, and others.
+Nevertheless, actual availability of data and metadata varies [@Federer2018; @Tedersoo2021], perhaps because there is a need for more infrastructure and tools to meet the goals of open data and open science [@Huston2019]. Multiple solutions have been presented, including ezEML, a tool for authoring metadata in Ecological Metadata Language and publishing data and metadata to a repository [@Vanderbilt2022]. ezEML has an intuitive graphical user interface with a relatively low learning curve; however, it does have some drawbacks. For instance, ezEML is not scriptable, which makes repeated deployments of the same or similar workflows challenging and can limit reproducibility. ezEML also requires that the user upload their data to an external site for processing, which may not be suitable for sensitive data. Here we introduce the NPSdataverse, a series of R packages for authoring, editing, and checking EML metadata locally in a robust, repeatable, and scriptable fashion. R Packages within the NPSdataverse leverage earlier work using R to create and manipulate XML based EML files [@Boettiger2019]. Building upon that framework, we add user-friendly EML creation workflows; integration with taxonomic databases; fast, easy editing of existing metadata; congruence checks to test correspondence between data and metadata; and integration with public repositories such as the National Park Service’s DataStore. The EML metadata file in .xml format along with the .csv data files it describes comprise a “data package”. In addition, R packages within the NPSdataverse also include data functions that expedite quality control, facilitate interoperability, provide the ability to download data directly from DataStore, and leverage the rich EML associated with the data regardless of repository of origin.
Brief description of the NPSdataverse package: When a user is on-line and loads the NPSdataverse into R, NPSdataverse will automatically check that the latest version of the main development branch on GitHub is being loaded. If not, the user will be alerted and given instructions on how to update the relevant packages.
+The NPSdataverse package is a meta-package that loads packages within the NPSdataverse into R [@Baker_NPSdataverse2024]. It provides a convenient way to download, install, and load many of the R packages needed to create and access data packages consisting of rich Ecological Metadata Language metadata and .csv data files:
+pak::pkg_install("nationalparkservice/NPSdataverse")
+library(NPSdataverse)
+NPSdataverse
will automatically check that the latest version of each R package is being loaded: either from the main development branch on GitHub.com or the latest version on CRAN. If updates are indicated, the user will be alerted and given instructions on how to update the relevant packages. To prevent API limits at GitHub (and to facilitate scripted workflows such as those at High Performance Computing facilities), NPSdataverse
only checks for updates from an interactive R session and will skip checks when the system is not on-line or GitHub.com is not responding.
brief description of the various component packages
+QCkit (“Quality Control kit”) is primarily a data processing package designed to prepare data for metadata creation and publication [@Baker_QCkit2024]. This package serves two main functions: 1) Providing a suite of data quality control functions to be used across datasets regardless of the project, and 2) a suite of functions to apply data standards that promotes interoperability among datasets. For instance, QCkit
includes functions that can help manage date-time formatting, can check data files for threatened or endangered species, and can help increase inter-operability by suggesting appropriate Darwin Core standards for naming data. QCkit
also facilitates documenting data processing with functions that can generate a DataStore reference based on GitHub.com releases. The DataStore reference can hold processing scripts, code, or packages and have Digital Object Identifiers (DOIs) attached to them that are registered with DataCite once the DataStore reference is activated. QCkit
is designed as an expandable framework that can adapt to new quality control tests or as new data standards are adopted.
Citations to entries in paper.bib should be in rMarkdown format.
-If you want to cite a software repository URL (e.g. something on GitHub without a preferred citation) then you can do it with the example BibTeX entry below for @fidgit.
-For a quick reference, the following citation commands can be used: - @author:2001
-> “Author et al. (2001)” - [@author:2001]
-> “(Author et al., 2001)” - [@author1:2001; @author2:2001]
-> “(Author1 et al., 2001; Author2 et al., 2002)”
The EML (“Ecological Metadata Language”) package is a fundamental package that allows for importing .xml files, creating and validating validating EML within R, and writing R objects back out to .xml files [@Boettiger2024]. EML
allows for creating fully fledged Ecological Metadata Language Metadata files using nested S3 lists within R while relying on the R/emld package [@Boettiger2019_emld].
Figures can be included like this: and referenced from text using .
Figure sizes can be customized by adding an optional second parameter:
The EMLassemblyline package builds upon EML
and adds substantial functionality [@Smith2022]. For instance, EMLassemblyline
allows the user to supply .csv files, which are used to generate template .txt files. Users can adjust the template files as needed and use the EMLassemblyline::make_eml()
function to generate an R-object that can be exported via EML
as an EML-fomatted .xml file. EMLassemblyline
includes the ability to generate entire taxonomic backbones from lists of scientific names via API calls to ITIS, GBIF, or Worms. EMLassemblyline
will validate the R object against the EML schema and provide helpful hints on what might have gone wrong during the EMLassemblyline::make_eml()
process. EMLassemblyline
provides an efficient bridge between .csv data and EML metadata for users who are familiar with R but may not be experts on the EML schema or the detailed nested lists needed to create EML within R via the EML
package. Products from the EMLassemblyline
pipeline are suitable for publication at multiple repositories including the Environmental Data Initiative.
The EMLeditor package allows users to quickly and easily view components of metadata in R and make on-the-fly edits to metadata [@Baker_EMLeditor2024]. Edits made to EML objects using EMLeditor
do not require re-running the EMLassemblyline
functions to make EML. This is a significant improvement because running EMLassemblyline
functions can be time consuming, especially if there are many taxa that need to be resolved. EMLeditor
includes the ability to pick specific licenses (CC0, CC-BY, etc), add ORCIDs, include organizations as authors, and much more. EMLeditor
also adds specific content necessary to be compliant with NPS’s DataStore. With the proper permissions, EMLeditor
can be used to generate draft references and reserve DOIs on DataStore as well as upload data and metadata files to DataStore. Finally, EMLeditor
contains a .rmd template file that, after loading the package, is accessible in Rstudio under Files > New File > R markdown
. The template provides an editable script that walks the user through using EMLassemblyline
, EMLeditor
, and DPchecker
to create and validate EML metadata in R.
EMLeditor
“set” class functions (which includes all functions that begin with “set_” such as “EMLeditor::set_abstract()
”) will add several NPS-specific items to the metadata using their default settings. For instance, these functions will set NPS as the publisher, Fort Collins as the publication location, and will add a “for or by NPS = TRUE” statement to the metadata. To invoke these functions without adding the NPS-specific metadata elements, set the parameter NPS = FALSE
when calling each “set_” class function. Non-NPS publisher information can be added using the EMLeditor::set_publisher()
function with the parameters for_or_by_NPS
and NPS
set to FALSE
:
#set the abstract without NPS-specific information:
+
+new_metadata1 <- set_abstract(eml_object = old_metadata,
+ abstract = "This is example abstract text",
+ NPS = FALSE)
+
+#add custom publisher information:
+
+new_metadata2 <- set_publisher(eml_object = new_metadata1,
+ org_name = "My Institution",
+ street_address = "1234 Sesame St.",
+ city = "Anytown",
+ State = "Delaware",
+ zip_code = "12345",
+ country = "USA",
+ URL = "https://www.myinstitution.us",
+ email = "publisher@myinstitution.us",
+ ror_id = "",
+ for_or_by_NPS = FALSE,
+ NPS = FALSE)
+By default, EMLeditor
functions provide verbose user feedback and may require user input to confirm some operations. These checks are intended to help guide users, prevent inadvertent mistakes, and limit unnecessary API calls. However, requiring user input can hamper highly scripted approaches and limits reproducability. Therefore, all EMLeditor
functions can be set to circumvent these requirements using the parameter force = FALSE
.
#example setting the abstract while suppressing user feedback and input:
+
+new_metadata <- set_abstract(eml_object = old_metadata,
+ abstract = "This is example abstract text",
+ force = TRUE)
+The DPchecker (“Data Package checker”) package provides feedback on data-metadata congruence [@Baker_DPchecker2024]. Here, a “data package” consists of the EML metadata file with a filename that ends in *_metadata.xml and one or more data files in .csv format, all of which are in a single directory (and the directory contains no extraneous .csv or .xml files). DPchecker
is useful for both data package authors and reviewers. DPchecker
goes beyond validating EML objects in R against the EML schema. Using the DPchecker::run_congruence_checks
function, DPchecker
will conduct a series of 46 tests. These are divided into several categories to check whether:
For each test, the data package may fail with an error, fail with a warning, or pass. When possible, warnings and error messages indicate the appropriate EMLeditor
function to address the problem. DPchecker
will often throw a warning even if an EML element exists and is properly formatted but could by improved to increase the FAIR characteristics of the metadata. For instance, DPchecker
will throw a warning if an abstract is less than 20 words long as it is unlikely the creator is able to meaningfully describe the data collection and processing in less than 20 words.
The [NPSutils](https://nationalparkservice.github.io/NPSutils/)
(“NPS utilities”) package serves primarily as a way to access data [@Baker_NPSutils2024]. NPSutils
provides avenues for directly downloading data from DataStore using R. NPSutils
can also import data downloaded from any repository into R and take advantage of rich EML metadata to call column types. NPSutils
provides some basic meta-analysis capability, assuming certain interoperabilty standards are met (such as consistently naming columns with Darwin Core parameters or other domain-accepted parameter names). NPSutils
can also be used to import data and metadata into common data visualization tools.
Example of how to download and access data:
+# download a data package from datastore:
+# the data package will be downloaded to ./data/2300498
+
+NPSutils::get_data_package(2300498)
+
+# load the data package into R:
+# returns a list of tibbles; each tibble corresponds to a single data file
+
+mojn <- NPSutils::load_data_package(2300498, assign_attributes = TRUE)
NPSdataverse_packages()
-#> [1] "cli" "crayon" "DPchecker" "EML"
-#> [5] "EMLassemblyline" "EMLeditor" "NPSutils" "QCkit"
-#> [9] "rstudioapi" "utils" "remotes" "lifecycle"
+#> [1] "cli"
+#> [2] "crayon"
+#> [3] "DPchecker"
+#> [4] "EML"
+#> [5] "EMLassemblyline"
+#> [6] "EMLeditor"
+#> [7] "NPSutils"
+#> [8] "QCkit"
+#> [9] "rstudioapi"
+#> [10] "utils"
+#> [11] "remotes"
+#> [12] "lifecycle"
R/updateR.R
- dot-print_cust_package_deps.Rd
formats print table for custom printing package dependencies in need to updating. Derived from remotes:::print.package_deps()
.
.print_cust_package_deps(x, show_ok = FALSE, ...)
printed text to console
-