Data and File Formatting

Data Formats

In choosing a file format, data collectors should select a format that is useable, open, and that will likely be readable well into the future. Microsoft Excel, as an example, is a useful tool for data manipulations and data visualization, but versions of Excel files may become obsolete and may not be easily readable over the longer term. Likewise, database management systems (DBMS) like MS Access, Filemaker Pro, and others, can be a very effective way to store and query data, but the raw formats tend to change over time (even a few years). If your program or organization has used these or other proprietary DBMS tools, it is essential to plan for exporting your data in a stable, well-documented, and non-proprietary format.

Below is a summary of the suggested tabular, image, and GIS data file formats suitable for long-term archiving.

  • Containers: TAR, GZIP, ZIP

  • Databases: CSV, XML

  • Tabular data: CSV

  • Geospatial vector data: SHP, GeoJSON, KML, DBF, NetCDF

  • Geospatial raster data: GeoTIFF/TIFF, NetCDF, HDF-EOS

  • Moving images: MOV, MPEG, AVI, MXF

  • Sounds: WAVE, AIFF, MP3, MXF

  • Statistics: ASCII, DTA, POR, SAS, SAV

  • Still images: TIFF, JPEG 2000, PDF, PNG, GIF, BMP

  • Text: XML, PDF/A, HTML, ASCII, UTF-8

  • Web archive: WARC

For more complete guidance on best practices and appropriate formats for long-term preservation and future accessibility, see the Library of Congress’ Sustainability of Digital Formats web site, the LOC’s page on Recommended Format Specifications for preservation, and Hook et al’s recommendations in Best Practices for Preparing Environmental Data Sets to Share and Archive.

Tabular Data

Tabular data, or data in tables or spreadsheets, is by far the most common format for presenting, analyzing, and storing data. However, common spreadsheet file formats are not ideal for sharing, preserving, and reusing data; they’re not easily machine readable and may be difficult or impossible to read if the specific software tools used to create them are significantly upgraded or outmoded.

Using delimited text formats is a better way to ensure that tabular data are readable in the future. A delimited text file is an ASCII-encoded file used to store data in which each line is uniquely represented and has fields separated by a special character—the delimiter. Common delimiters are the comma, tab, and colon. In a file with comma-separated values (a CSV file), the data values are separated using commas as a delimiter. One benefit of being incredibly common and simple to use is that most database and spreadsheet programs are able to read in or export data in a text-delimited format.

Comma Separated Values (CSV) is the most common delimited data format. If your data has many strings or sentences that may contain commas, be sure to wrap all string values in quotation marks (“”) or use another delimiter that may not be present in your data. The semicolon, (;), and a tab (“ “ , or “t”) are two other very common delimiters. Text delimited files can be imported and exported by almost any software designed for storing or manipulating data, including relational database systems, spreadsheet software, and statistical analysis software.

ASCII (American Standard Code for Information Interchange) is the most common text encoding and the one most likely to be readable by tools. Other text encodings, such as UTF-8 are possible and may be necessary for some non-English applications. Avoid obscure text encodings. Use ASCII if possible, with UTF-8 or UTF-16 as secondary options.

Relational database management systems (RDBMS) (such as Microsoft Access) create file formats that need specialized software to open and view the contained information. Creating CSV files or ASCII text versions and PDF/A’s of the data provider’s original data ensures that the information contained within the file is openly accessible to data customers. Tabular data stored within relational databases should be broken out into CSV files or ASCII text versions with all table relationships described in enough detail for the database to be recreated.

Biological Data

Biological data is often stored in tabular formats such as spreadsheets and database tables. Examples of biological datasets may include environmental data, such as temperature, salinity, or conductivity, but focus on measuring variables related to one or more species of plant or animal. Examples of biological data include, but are not limited to, marine bird surveys, genetic analyses of salmon stocks, and toxicological analyses of lichen.

For biological data, the following best practices apply:

  1. Archive data in CSV (or another non-proprietary, text-based format) whenever possible.

  2. Define distinct events (such as location, time (with time zone), and/or depth) within a file with a unique identifier. The identifier is often presented as sample_id or collection_event_id.

  3. Include both the common name, the scientific name, and the ITIS Taxonomic Serial Number (TSN) and/or WoRMS AphiaID code for each species.

Geospatial Vector Data

Below are suggested vector file formats. These include proprietary data formats; when the data cannot be exported to open formats please be sure to document the Software Package, Version, Vendor, and native platform.

  • Shapefiles/SHP —we require .shp, .shx, .prj, and .dbf files that contain the basic components of a shape file. Containers (e.g., zip) are a convenient way to assure completeness.

  • GeoJSON

  • ENVI —.evf (ENVI vector file)

  • ESRI Arc/Info export file (.e00)

Also, make sure that the vectors are properly geo-referenced and the geometry type (Point, Line, Polygon, Multipoint, etc) is specified. The projection and coordinate reference requirements in the following section for Image Files also apply to Vector Data.

Geospatial Raster Data

Researchers should use non-proprietary file formats for storing their image data. Below are some suggested non-proprietary file formats:

  • GeoTIFF/TIF (.tiff, .tif)

  • ESRI ASCII Grid (.asc, .txt) with detailed metadata describing the storage structure such as the number of Columns, Rows, spatial resolution of the pixels, and projection information (.prj)

  • Binary Image files (BSQ/BIL/BIP) in (.dat, .bsq, .bil) with detailed metadata describing storage structure such as the number of Columns, Rows, Byte order (little endian or big endian), Data type (Float, Unsigned Integer, Double Precision, etc.), and Interleave defined in a companion header file (.hdr).

  • netCDF, ideally formatted and documented according to the CF and ACDD conventions (.nc)

  • HDF-EOS / HDF (.hdf)

If you have access to popular GIS packages, such as QGIS, ENVI, ESRI ArcGIS, ERDAS IMAGINE, or IDRISI, make sure the raster files can be opened readily using one of these software packages. Open source image readers, such as uDIG, GDAL and GRASS, can also be used to make sure the raster files can be opened without proprietary software. Creating image files in a customized format that can only be used with your own FORTRAN or C program is strongly discouraged.

Coordinate Reference Information for Raster Files

All raster files should be supported with coordinate reference information to correctly geolocate the images. Image files should be georeferenced prior to sending to the archive. File formats such as GeoTIFF that facilitate embedding the geospatial information inside the image file should be used where possible. Below is a list of the necessary parameters:

  • Definition of Projection/Coordinate Reference System

  • Definition of the referenced Datum

  • EPSG code, if available

  • Spatial resolution of the data. If the resolution is different in X and Y direction, both resolutions need to be provided.

  • Bounding Box – X, Y coordinates of the top-left/bottom-right pixels. While stating the corner pixel coordinates, indicate if these coordinates lie within the center of the pixel or at one of the edges.

Note

There are multiple standards (e.g. OGC WKT (Open Geospatial Consortium Well-Known Text), ESRI WKT (Well-Known Text), TIF world files (.tfw), and ESRI projection files (.prj)) that can be used to define ver the projection/coordinate reference system and datum. A good reference site is http://spatialreference.org/.

If this additional documentation is available provide it along with the geospatial information:

  1. Rationale for choosing a particular projection

  2. Issues with reprojecting the data

  3. Suggested resampling techniques (Nearest neighbor/Cubic convolution…etc)

  4. Projection constraints

Additional considerations:

  • If available provide a color look up for the purpose of visualizing the raster if available in .pal, .cpt, .pg, .sld, .svg, or the following format:

Red

Green

Blue

Value

100

123

124

23

122

123

53

34

  • Include preview pictures of binary image files so that a user can use them to check and to make sure that the binary images were read correctly. For example, include a .jpg, .png, .gif, .bmp, .tif, or .tiff pictures of geographic images

  • Avoid using generic file extensions (e.g., .bin or .img). These extensions are used by many programs and could cause confusion on their origin. If the data are available in a generic format, explicitly state the software used to create/read the files.

  • Provide information on what software package and version was used to create the data file(s). If the data files were designed for custom code, provide a software program to enable the user to read the files (e.g., FORTRAN, C code, etc).

Why follow these data format recommendations:

  1. Storing data in recommended formats with detailed documentation ensures data longevity. Using non-proprietary formats allows data to be easily read many years into the future

  2. Storing the data using the recommendations listed above allows for the data to be readily exposed using interoperability standards such as OGC-Web Map service, Web Coverage Service. This increases data usage and allows one storage format but multiple distribution formats.

  3. Users can spend more time analyzing the data and spend less time in data preparation.

  4. Easy access means improved usability of the data in more researchers using and citing your data.

Separating Data into Files

An important consideration with data organization is the file contents and size (e.g. the number of records within a file). Consolidating your data will prevent users of your data from having to process many small files individually.

Follow these best practices when separating data into files:

  • Place each type of measurement in a separate data file, if you are collecting many observations of different parameters/types.

  • Alternatively, if relatively few observations are made for different parameters/types, place all in one file, provided that single file is described with thorough documentation.

  • For each data file, use similar data organization, parameter formats, and site names so that users of the data can understand the interrelationships between data files.

  • For files that are intended to be final versions or data products, avoid breaking data into many small files (e.g. by month or site, if you are working across multiple sites for many years). Instead, store data in one or more large files and add differentiating parameters, such as site, year, and month, to the data headers.

File Naming

How you name files will have an impact on you and your collaborators’ ability to find those files and understand what they contain. Like project and dataset names, file names are an important piece of metadata that will be used to understand and evaluate file content.

File names are indexed by the Research Workspace for search and discovery, so it is important for names to be descriptive of the information contained within.

Follow these best practices for naming files:

  • Develop and use a file-naming convention that describes how file names will be constructed.

  • Apply that convention as broadly as possible (e.g., across several projects or within an entire research campaign).

  • Keep names as concise as possible (< 86 characters) while still being descriptive, particularly for related data types that may span multiple files.

    • Include the following information in the file names:

      • Program/Project/Cruise/ name or acronym

      • Type of data

      • Date or date range- formatted as YYYYMMDD

      • Location of data collected

      • Version number of file

      • Researcher name (optional)

  • File name elements should be logically ordered from general to specific detail of importance, generally.

    • Example: 2017_TiglaxCruise_PreyCounts_V02.csv

  • Consider how the ordering of file name elements will change the way files are ordered alphanumerically.

  • If an alphanumeric file name is used (e.g., for application-specific files), include a README.txt file that explains your naming format with any abbreviations or codes used.

  • Special characters should be avoided (e.g., ~ ! @ # $ % ^ & * ( ) ` ; < > . ? , [ ] { } ' " and |).

  • For sequential numbering, use leading zeros to ensure files sort properly.

    • For example, use “0001, 0002…1001, etc” instead of “1, 2…1001, etc.”

File naming examples

Poor file names:

  • MyData

  • 2017_data

Better file name:

  • SeaMonkey_BearCove_2017_GrossProd_v03.csv

    • Notes:
      • Sea Monkey is the project name

      • Bear Cove is the location

      • 2017 is the calendar year

      • GrossProd represents the gross productivity data

      • v03 indicates that this is the third version of this dataset

      • CSV is the file type

Data Headers

In order for others to use your data, they must fully understand the contents of the dataset. Most commonly data values within a file are organized in rows and columns, with each event or observation as a row and each column representing a measurement or contextual information. Use a header row at the top of each file describing the values in column.

Follow these best practices for data headers:

  • Use commonly accepted parameter names for header titles (e.g. site, date, treatment, units of measure, etc).

  • Use consistent capitalization of header names (e.g. temp and precip, or Temp and Precip).

  • Explicitly state units of reported parameters in the data file and the metadata.

  • If a coded value of abbreviation is used, be sure to provide a definition in the metadata.

  • Adopt a similar structure across data files. If the same parameters are used across multiple files, use a file template to maintain consistent column names across files. Avoid having different numbers of columns or rearranging columns across similar files.

  • Column names or headings should contain only numbers, letters, hyphens, and underscores—no spaces or special characters. This also applies to column names and table names in databases. Special characters and space are error-prone when machine-read or edited.

Data Values

To allow others to best understand your data, follow these best practices for values within a data file:

  • Standardize all coded and null values within a dataset.

  • Use an explicit value for missing or no data, rather than an empty field.

    • For numeric fields, represent missing data with a specified extreme value (e.g., -9999), the IEEE floating point NaN value (Not a Number), or the database NULL. Be advised that NULL and NaN can cause problems, particularly with some older programs. For character fields, use NULL, “not applicable”, “n/a” or “none”.

  • If there are multiple reasons that cells might not values, include a separate code for each reason.

  • The null value(s) should be consistently applied within and among data files.

  • If data values are encoded, be sure to provide a definition in the metadata.

  • Don’t include rows with summary statistics. It is best to put summary statistics, figures, analyses, and other summary content into a separate companion data file.