Spatial Data (GIS)¶
This section describes the process of submitting spatial (e.g., GIS) data to an IOOS RA portal.
Species distribution datasets can come in the form of vector polygons or raster surfaces, and represent the geographic bounds (e.g. home range) and/or probability at a given location (e.g. utilization distribution) of an individual or population of animals over a period of time. Species distribution data are derived from a set of directly- or remotely-sensed location observations that have been post-processed into some sort of summary or aggregate product.
Example: the Benthic Biomass Relative Biomass Index are vector contours representing the relative biomass of benthic invertebrates across the Bering, Beaufort, and Chuckchi Seas. This dataset was derived from survey locations that were postprocessed into rasters by kernel density estimation techniques and subsequently summarized as density contours.
If your data have not undergone any of these or similar post-processing analyses, they are likely more accurately described as survey locations or trajectories, for which we have separate data submission guidelines we suggest you follow.
Use community standards. Open data formats such as CSVs and Shapefiles are always preferable to closed or proprietary formats like Excel spreadsheets or ArcGIS geodatabases because they are more likely to be accessible to the broader community both now and in the future.
Be consistent. Strive to use the same file formats and variable names across files within a dataset (e.g. different distributions in a year) and, as often as is possible, across datasets. When we plan to visualize your data, we use scripts to ingest it, so any inconsistencies will require manual intervention and lead to delays.
Data Submission Guidelines¶
When submitting species distribution data for visualization and/or archive, there are a number of descriptors that can aid in the ingestion and reuse of those data. Depending on the nature of your dataset, these descriptors might be incorporated into filenames, column headers, field values, metadata records, or left out entirely if not applicable. These possible descriptors include:
Spatial reference: Spatial reference information should identify both the datum and projection used, if any. Depending on the data format, this might be included in the PRJ file along with a Shapefile, or specified in the filename, fieldnames, or metadata record associated with tabular data.
Time period: Species distribution datasets necessarily represent observations taken over a period of time. Best practices are to include an indication of the season or time range included in the dataset.
Method of summarization: Species distribution datasets can be generated using a variety of different techniques and parameter specifications. Some indication of the methods used embedded in the filename and/or fieldnames will help make your datasets easier to interpret.
Species identifier: Identification of species in filenames, fieldnames, or values is important especially in the case of multiple species comprising a single dataset. In all cases use the species identification conventions specific to your field, where they exist.
Common Mistakes We Encounter¶
There are a variety of things that can go wrong in the process of creating and sharing a dataset. To help with any setbacks during the ingestion and visualization process, here are some common mistakes we encounter that you can be aware of.
Data incorrectly formatted around the datelines.
Data with no location information or invalid geometries.
Polygons that aren’t closed.
Polygons that have overlapping bounds.
There are a wide variety of data formats capable of representing vector or raster species distributions. Below is a list of the formats we prefer to work with, including links to more detailed documentation for each format. If your data are not in one of these formats we will still likely be able to work with it, but we ask that you attempt to convert to one of the following formats:
GeoPackage: GeoPackages are an open, standards-based format for transferring geospatial information in one file. They are openly described and can be read by any geospatial software package. They are prefered to shapefiles since there are less opportunities for shared data to be corupted. They also allow for multiple geometry types to be included.
Shapefile: ESRI Shapefiles are standard across many scientific fields. They are also openly described and can be read by any geospatial software package. The drawback to shapefiles is that they are not a single file, which can lead to corruption of a dataset if not handled carefully. To be valid, shapefiles must include a SHP, SHX, and DBF file, and to be useful they also need to include a PRJ.
CSV/TSV: CSVs and TSVs are not specifically spatial data formats, but can include spatial fields, the simplest of which might be latitude and longitude for a dataset of points. Better is to use the Well Known Text format for spatial objects, which can accommodate points, lines, and polygons and incorporates spatial reference information.
GeoJSON: GeoJSON is an extension of the text-based JSON data format. GeoJSON can handle all vector types so long as they are represented by geographic coordinates in the WGS84 datum (EPSG code 4326).
NetCDF: NetCDF files are the preferred format and the backbone of archival data storage at both Axiom and NCEI. They are machine-independent, flexible, can store data and dimensions in any combination, store metadata as global and/or variable attributes within the file itself (i.e., self-describing).
GeoTiff: GeoTiff is a common format for regularly-gridded (i.e., raster) data. This format is simple to read and interpret and commonly includes metadata such as projection information, spatial resolution, pixel-size, etc., but other metadata attributes such as source, owner, publisher, history, parameter names, etc. are often not included.
Sea lion MCP home range¶
As a vector example, take a dataset derived from a ARGOS locations taken from multiple distinct populations of tagged sea lions across multiple seasons. This dataset was subsequently summarized using a minimum convex polygon analysis into a vector home range estimate. Ideally this dataset would be described and formatted in the following manner:
Formatted as a Shapefile with PRJ file containing spatial reference information.
Descriptive filename, for example sealion_mcp_95_summer_2009_2011, indicating the species, the method used (MCP) and the percentage of locations included based on outlier status (95%), the season and time range.
A population field with values indicating the subpopulation of each home range polygon.
Horned puffin KDE relative abundance¶
As a raster example, take a dataset derived from direct survey observations collected over a ten year period of time from ocean-going vessels. This dataset was subsequently summarized as a relative abundance surface using kernel density estimation methods. Ideally this dataset would be described and formatted in the following manner:
Formatted as a GeoTiff with internal spatial reference definition describing datum, projection, extent, and cell size
Descriptive filename, for example hopu_kde_01_2001_2011, indicating the standard species code, the method of analysis (KDE) and a smoothing bandwidth of 0.1, and the date range.
Once your data are well-described and in an appropriate format, you may submit up to 1 TB of files using the following form:
Spatial Data Ingest Request Form <https://docs.google.com/forms/d/e/1FAIpQLScRuQVUuthjZd3AlKOIJTCNSudpV-bteYa9022XMq0amEPfzg/viewform?usp=pp_url>`_