Advanced Operations

This document demonstrates some of SpecifyBatchTool’s more advanced features. The following sections are included:

It is assumed here that you are familiar with the basic operations described in the quickstart guide. Before you run any of the code in this document, you should initiate your database as described in the section Setting up the database connection in the quickstart guide.

Exporting data from multiple tables

Exporting data from multiple tables is easy as exporting data from a single table. Just use the from_database() and to_csv() methods. Remember to always set update_sourceid=True in the to_csv() method to ensure that individual records can be connected also after they have been exported. Here is an example of how you export data from the locality- and geography-tables.

>>> loc = LocalitytDataset()
>>> loc.from_database()
>>> loc.to_csv(update_sourceid=True)
>>> geo = GeographyDataset()
>>> geo.from_database()
>>> geo.to_csv(update_sourceid=True)

To export all the data available to CollectionBatchTool, we can write a simple command-line script:

#!/usr/bin/env python

"""
export_all_data.py - script for exporting available data.
"""

import os
import argparse

from collectionbatchtool import *


def export_all_data(
        config_file, output_dir=None,
        drop_empty_columns=False, quiet=True):
    """
    Export data from Specify to CSV files.

    Parameters
    ----------
    output_dir : str
        Path to the output directory.
    """
    apply_user_settings(config_file) 
    output_dir = output_dir if output_dir else ''
    for tabledataset_subclass in TableDataset.__subclasses__():
        instance = tabledataset_subclass()
        if instance.database_query.count() > 0:  # no files without data
            instance.from_database(quiet=quiet)
            filename = instance.model.__name__.lower() + '.csv'
            filepath = os.path.join(output_dir, filename)
            instance.to_csv(
                filepath,
                update_sourceid=True,
                drop_empty_columns=drop_empty_columns,
                quiet=quiet)


if __name__ == '__main__':
    parser = argparse.ArgumentParser(
        description='A command-line script to export Specify data.')
    parser.add_argument(
        'config_file', type=argparse.FileType('rU'),
        help='path to a config-file')
    parser.add_argument(
        'output_dir', default='.', nargs='?',
        help='path to output directory')
    parser.add_argument(
        '-d', '--drop-empty-columns', dest='drop_empty_columns',
        action='store_true', help='drop columns without data')
    parser.add_argument(
        '-v', '--verbose', action='store_true')
    args = parser.parse_args()
 
    if not os.path.isdir(args.output_dir):
        msg = "%d is not valid directory" % args.output_dir
        raise argparse.ArgumentTypeError(msg)
    
    quiet = False if args.verbose else True
 
    export_all_data(
        args.config_file.name,
        output_dir=args.output_dir,
        drop_empty_columns=args.drop_empty_columns,
        quiet=quiet)

export_all_data.py

Importing data to multiple tables

Importing data to multiple tables in Specify is a lot more complicated than doing single table imports. The major reason is that foreign keys need to be updated before the batches of data can be loaded to the database. Another reason is that presence of self-relations in some tables (e.g. agent and taxon) may require extra updates to the database after the records have been uploaded.

Loading batches of data is faster than updating existing database records. Therefore, you can speed up imports by minimizing the number of records being updated. This can be done by importing data to tables in a preferable order. For example, it is usually good to start with the agent-table, since many other tables refer to that.

Due to contraints in the database, you are sometimes also required to create records in a certain order. For example, every record uploaded to the preparation-table must refer to an existing record in the collectionobject-table.

Tip

When building import scripts it is often a good idea to start with a single table, and then add other tables one at a time after doing some testing. You can save a dump of your database before you begin, and then have that dump restored prior to every new import trial.

For demonstration purpose, data will be uploaded to the following tables:

  • agent
  • locality
  • collectingevent
  • collector
  • collectionobject
  • preparationtype
  • preparation

We will use some fabricated data, that in short look like this (you may recognize the agent names from the import exercise in the quickstart guide):

CatalogNumber Preptype Collector Locality
dummy-1 herbarium sheet Thunberg, Carl Peter Japan
dummy-2 herbarium sheet Thunberg, Carl Peter Japan
dummy-3 herbarium sheet Forsskål, Peter Egypt
dummy-4 herbarium sheet Forsskål, Peter Egypt
dummy-5 herbarium sheet Solander, Daniel Australia
dummy-6 herbarium sheet Solander, Daniel Australia
dummy-7 herbarium sheet Osbeck, Pehr China
dummy-8 herbarium sheet Osbeck, Pehr China
dummy-9 herbarium sheet Kalm, Pehr Canada
dummy-10 herbarium sheet Kalm, Pehr Canada

Download a zipped archive with the data: sample_data.zip

The following files are included in the archive:

Filename Specify table # rows # columns
agent_sample.csv agent 5 3
locality_sample.csv locality 5 2
collectingevent_sample.csv collectingevent 5 2
collector_sample.csv collector 5 4
collectionobject_sample.csv collectionobject 10 3
preptype_sample.csv preptype 1 2
preparation_sample.csv preparation 10 3

The script for importing the sample data is shown below. Note that we add default values to fields that cannot be set to NULL.

#!/usr/bin/env python

"""import_sample_data.py - script for importing sample data"""

from collectionbatchtool import *

apply_user_settings('settings.cfg')  # change to your own config-file!

agt = AgentDataset()
agt.from_csv('agent_sample.csv', quiet=False)
agt.to_database(defaults={'agenttype': 1}, quiet=False)

loc = LocalityDataset()
loc.from_csv('locality_sample.csv', quiet=False)
loc.to_database(defaults={'srclatlongunit': 3}, quiet=False)

cev = CollectingeventDataset()
cev.from_csv('collectingevent_sample.csv', quiet=False)
cev.update_foreign_keys([agt, loc], quiet=False)
cev.to_database(quiet=False)

col = CollectorDataset()
col.from_csv('collector_sample.csv', quiet=False)
col.update_foreign_keys([agt, cev], quiet=False)
col.to_database(defaults={'isprimary': 1}, quiet=False)

cob = CollectionobjectDataset()
cob.from_csv('collectionobject_sample.csv', quiet=False)
cob.update_foreign_keys(cev, quiet=False)
cob.to_database(quiet=False)

pty = PreptypeDataset()
pty.from_csv('preptype_sample.csv', quiet=False)
pty.match_database_records('name')  # match existing preptypes by "name"
pty.to_database(defaults={'isloanable': 1}, quiet=False)

pre = PreparationDataset()
pre.from_csv('preparation_sample.csv', quiet=False)
pre.update_foreign_keys([pty, cob], quiet=False)
pre.to_database(quiet=False)

import_sample_data.py

Counting database records

You can easily count how many records there are in the database that belong to your collection. The code below counts the records in the agent-table, without the need of downloading them:

>>> agt = AgentDataset()
>>> agt.database_query.count()
1

To display the number of records for each available table, you can do something like this:

>>> for tabledataset_subclass in TableDataset.__subclasses__():
...     instance = tabledataset_subclass()
...     print('{0}: {1}'.format(
...          instance.model.__name__, instance.database_query.count()))