API Reference

This document describe the API of the collectionbatchtool module. The following sections are included:

Module-level functions

apply_specify_context(collection_name, specify_user, quiet=True)

Set up the Specify context.

Parameters:
  • collection_name (str) – Name of an existing Specify collection.
  • specify_user (str) – Username for an existing Specify user.
  • quiet (bool, default True) – If True, no output will be written to standard output.
apply_user_settings(filepath, quiet=True)

Read and apply user settings in a configuration file.

Parameters:
  • filepath (str) – Path to the configuration file.
  • quiet (bool, default True) – If True, no output will be written to standard output.
initiate_database(database, host, user, passwd, quiet=True)

Initiate the database.

Parameters:
  • database (str) – Name of a MySQL database.
  • host (str) – Database host.
  • user (str) – MySQL user name.
  • passwd (str) – MySQL password.
  • quiet (bool, default True) – If True, no output will be written to standard output.
query_to_dataframe(database, query)

Return result from a peewee SelectQuery as a pandas.DataFrame.

The TableDataset class

class TableDataset(model, key_columns, static_content, where_clause, frame)

Bases: object

Store a dataset corresponding to a database table.

model

peewee.BaseModel

A Specify data model corresponding to a table.

key_columns

dict

Key-fields and SourceID-columns for the model.

static_content

dict

Data to automatically inserted for the model.

where_clause

peewee.Expression

Condition for getting relevant data from the database.

describe_columns()

Return a pandas.DataFrame describing the columns in the current model.

from_csv(filepath, quiet=True, **kwargs)

Read dataset from a CSV file.

Parameters:
  • filepath (str) – File path or object.
  • quiet (bool, default True) – If True, no output will be written to standard output.
  • **kwargs – Arbitrary keyword arguments available in pandas.read_csv().
from_database(quiet=True)

Read table data from the database.

Parameters:quiet (bool, default True) – If True, no output will be written to standard output.
get_match_count(target_column, match_columns)

Return counts for matches and possible matches.

Parameters:
  • target_column (str) – Column that should have a value if any value in match_columns is not null.
  • match_columns (str or List[str]) – Column or columns used for updating values in target_column.
Returns:

matches, possible matches

Return type:

tuple

get_mismatches(target_column, match_columns)

Return a pandas.Series or a pandas.DataFrame with non-matching values.

Parameters:
  • target_column (str) – Column that should have a value if any value in match_columns is not null.
  • match_columns (str or List[str]) – Column or columns used for updating values in target_column.
match_database_records(match_columns, quiet=True)

Update primary key values for records that match database.

Parameters:
  • match_columns (str or List[str]) – Columns to be matched against the database.
  • quiet (bool, default False) – If True, no output will be written to standard output.
to_csv(filepath, update_sourceid=False, drop_empty_columns=False, quiet=True, encoding='utf-8', float_format='%g', index=False, **kwargs)

Write dataset a comma-separated values (CSV) file.

Parameters:
  • filepath (str) – File path or object.
  • update_sourceid (bool, default False) – If True, copying ID-columns to SourceID-columns before writing to the CSV file.
  • drop_empty_columns (bool, default False) – Drop columns that does not contain any data.
  • quiet (bool, default True) – If True, no output will be written to standard output.
  • encoding (str, default 'utf-8') – A string representing the encoding to use in the output file.
  • float_format (str or None, default '%g') – Format string for floating point numbers.
  • index (bool, default False) – Write row names (index).
  • **kwargs – Arbitrary keyword arguments available in pandas.DataFrame.to_csv().
to_database(defaults=None, update_record_metadata=True, chunksize=10000, quiet=True)

Load a dataset into the corresponding table and update the dataset’s primary key column from the database.

Parameters:
  • defaults (dict) – Column name and value to insert instead of nulls.
  • update_record_metadata (bool, default True) – If True, record metadata will be generated during import, otherwise the metadata will be loaded from the dataset.
  • chunksize (int) – Size of chunks being uploaded.
  • quiet (bool, default True) – If True, no output will be written to standard output.
update_database_records(columns, update_record_metadata=True, chunksize=10000, quiet=True)

Update records in database with matching primary key values.

Parameters:
  • columns (str or List[str]) – Column or columns with new values.
  • update_record_metadata (bool, default True) – If True, record metadata will be generated during import, otherwise the metadata will be updated from the dataset.
  • chunksize (int) – Size of chunks being updated; default 1000.
  • quiet (bool, default True) – If True, no output will be written to standard output.
update_foreign_keys(from_datasets, quiet=False)

Update foreign key values from a related dataset based on sourceid values.

Parameters:
  • from_datasets (TableDataset or List[TableDataset]) – Dataset(s) from which foreign key values will be updated.
  • quiet (bool, default False) – If True, no output will be written to standard output.
update_sourceid(quiet=True)

Copy values from ID-columns to SourceID-columns.

Parameters:quiet (bool, default True) – If True, no output will be written to standard output.
write_mapping_to_csv(filepath, quiet=True, float_format='%g', index=False, **kwargs)

Write ID-column mapping a comma-separated values (CSV) file.

Parameters:
  • filepath (str) – File path or object.
  • quiet (bool, default True) – If True, no output will be written to standard output.
  • float_format (str or None, default '%g') – Format string for floating point numbers.
  • index (bool, default False) – Write row names (index).
  • **kwargs – Arbitrary keyword arguments available in pandas.DataFrame.to_csv().
all_columns

List containing all columns in the dataset.

database_columns

List with available database columns.

database_query

Database query for reading the data from the database.

file_columns

List containing only the columns that can be written to or read from a file.

frame

A pandas.DataFrame to hold the data.

primary_key_column

Name of the primary key column.

The TreeDataset class

class TreeDataset

Bases: object

A dataset corresponding to a tree table in Specify.

update_rankid_column(dataset, quiet=True)

Update RankID based on SourceID-column.

Parameters:
  • dataset (TableDataset) – A treedefitem-dataset from which RankID should be updated.
  • quiet (bool, default True) – If True, no output will be written to standard output.

Notes

This method exists in order to update the redundant RankID-columns in TreeDataset dataframes.

TableDataset subclasses

class AgentDataset

Bases: collectionbatchtool.TableDataset

Dataset corresponding to the agent-table.

class CollectingeventattributeDataset

Bases: collectionbatchtool.TableDataset

Dataset corresponding to the collectingeventattribute-table.

class CollectingeventDataset

Bases: collectionbatchtool.TableDataset

Dataset corresponding to the collectingevent-table.

class CollectionobjectattributeDataset

Bases: collectionbatchtool.TableDataset

Dataset corresponding to the collectionobjectattribute-table.

class CollectionobjectDataset

Bases: collectionbatchtool.TableDataset

Dataset corresponding to the collectionobject-table.

class CollectorDataset

Bases: collectionbatchtool.TableDataset

Dataset corresponding to the collector-table.

class DeterminationDataset

Bases: collectionbatchtool.TableDataset

Dataset corresponding to the determination-table.

class GeographyDataset

Bases: collectionbatchtool.TableDataset, collectionbatchtool.TreeDataset

Dataset corresponding to the geography-table.

class GeographytreedefitemDataset

Bases: collectionbatchtool.TableDataset

Dataset corresponding to the geographytreedefitem-table.

class LocalityDataset

Bases: collectionbatchtool.TableDataset

Dataset corresponding to the locality-table.

class StorageDataset

Bases: collectionbatchtool.TableDataset, collectionbatchtool.TreeDataset

Dataset corresponding to the storage-table.

class StoragetreedefitemDataset

Bases: collectionbatchtool.TableDataset

Dataset corresponding to the storagetreedefitem-table.

class PreparationDataset

Bases: collectionbatchtool.TableDataset

Dataset corresponding to the preparation-table.

class PreptypeDataset

Bases: collectionbatchtool.TableDataset

Dataset corresponding to the preptype-table.

class TaxonDataset

Bases: collectionbatchtool.TableDataset, collectionbatchtool.TreeDataset

Dataset corresponding to the taxon-table.

class TaxontreedefitemDataset

Bases: collectionbatchtool.TableDataset

Dataset corresponding to the taxontreedefitem-table.