Quickstart

This guide gives you a brief introduction to CollectionBatchTool’s most basic features. The following will be covered:

Before you begin

A few things need to be in place in order to be able to use CollectionBatchTool:

  • You need root access to the Specify (MySQL) database you want to work with.
  • Your collection must be present in the database. If the collection is missing, you have to create it by using Specify Setup Wizard.
  • Your Specify user must be available to the collection in question.

Warning

Using CollectionBatchTool is equivalent to writing SQL-statement against the database as a root user. You are responsible for any changes made to the data. In order to avoid concurrency problems, make sure that no one else is accessing the database while you are using the tool.

Setting up the database connection

CollectionBatchTool lets you work with one Specify collection at a time. Enter your personal settings in a configuration file similar to what is shown below. Then, save the file with a suitable name, for example settings.cfg.

[MySQL]
Database = my_database
Host = localhost
User = root
Password = password

[Specify]
CollectionName = My collection
User = user

To apply your settings, use the function apply_user_settings() provided with the path to your configuration file:

>>> from collectionbatchtool import *
>>> apply_user_settings('settings.cfg')

Note

It is good practice to always backup your data before you start working with a database.

Exporting data from a single table

In this exercise we will export data from the agent-table in Specify to a comma-separated values (CSV) file. We assume that the database connection has been set up as described in the previous section. Now, let’s begin by first creating an AgentDataset object:

>>> agt = AgentDataset()

This newly created object will eventually hold all the data we want to export. For the moment, it basically contains an empty dataframe stored to the frame attribute. You can verify that the dataframe is empty by examining the AgentDataset object (see the last line of the output):

>>> agt
<class 'collectionbatchtool.AgentDataset'>
model: <class 'specifymodels.Agent'>
key_columns: {
    'agentid': 'agent_sourceid'
    'createdbyagentid': 'createdbyagent_sourceid'
    'modifiedbyagentid': 'modifiedbyagent_sourceid'
    'parentorganizationid': 'parentorganization_sourceid'}
static_content: {
    'divisionid': 2
    'specifyuserid': None}
where_clause: <class 'peewee.Expression'>
frame: <class 'pandas.core.frame.DataFrame'> [0 rows x 30 columns]

You can also take a look at the frame attribute directly:

>>> agt.frame
Empty DataFrame
Columns: [
    agent_sourceid, createdbyagent_sourceid, modifiedbyagent_sourceid,
    parentorganization_sourceid, abbreviation, agenttype, dateofbirth,
    dateofbirthprecision, dateofdeath, dateofdeathprecision, datetype, email,
    firstname, guid, initials, interests, jobtitle, lastname, middleinitial,
    remarks, suffix, timestampcreated, timestampmodified, title, url, version,
    agentid, createdbyagentid, modifiedbyagentid, parentorganizationid]
Index: []

[0 rows x 30 columns]

Next, we would like to load data from the agent table into our object’s frame attribute. We do so by using the from_database() method. By setting quiet=False we will get some information on what’s going on:

>>> agt.from_database(quiet=False)
[AgentDataset] reading database records: 1/1

If there are any agent-records associated with your collection, these should now be stored to your AgentDataset object. A new collection within its own division in Specify contains just a single agent-record (as in our example).

Writing the data to a CSV file is just as easy as retrieving the data from the database. The method to_csv() and a file path is all that is needed:

>>> agt.to_csv('agent.csv', update_sourceid=True, quiet=False)
[AgentDataset] updating SourceID-columns...
    copying 'agentid' to 'agent_sourceid' [1 value]
    copying 'createdbyagentid' to 'createdbyagent_sourceid' [0 values]
    copying 'modifiedbyagentid' to 'modifiedbyagent_sourceid' [0 values]
    copying 'parentorganizationid' to 'parentorganization_sourceid' [0 values]
[AgentDataset] writing to CSV file...
    1 rows x 26 columns; agent.csv

In the example above, we use the update_sourceid parameter to ensure that every ID-column is copied to its corresponding SourceID-column before the data is written to the file.

Note

Want to export data from some other table? Take a look at supported Specify tables to see which tables are currently available.

Importing data to a single table

Data import is not so different from data export. One important difference, though, is that you first need to prepare CSV files according to some specific rules. We don’t go into the details here, but if you wish you can read about the format specifications in the document on how to prepare CSV files.

In this excercise we will import data to the agent-table. Like in the export example above, we assume that the database connection has been set up properly. We will try to import a small sample dataset (three columns and ten rows) listing some of the Apostles of Linnaeus. The first column, agent_sourceid, is somewhat special and may be used to connect records outside of the Specify database. This column can also be used to trace imported data, as we will see towards the end of this exercise.

The sample file apostles.csv contains the following records:

agent_sourceid firstname lastname
1 Pehr Osbeck
2 Peter Forsskål
3 Pehr Löfling
4 Pehr Kalm
5 Daniel Rolander
6 Johan Peter Falck
7 Daniel Solander
8 Carl Peter Thunberg
9 Anders Sparrman
10 Peter Jonas Bergius

apostles.csv

Like with exports, we start out by creating an AgentDataset object:

>>> agt = AgentDataset()

To read the data from the CSV file, we use the from_csv() method:

>>> agt.from_csv('apostles.csv', quiet=False)
[AgentDataset] reading CSV file...
    10 rows x 3 columns; apostles.csv

The AgentDataset object should now hold the ten records. We continue with uploading the data to the agent-table. The method to_database() takes care of the upload for us. We use the method’s defaults parameter to insert default values instead of NULL. This parameter accepts a python dict with column names and values to insert:

>>> agt_defaults = {
...     'agenttype': 1,
...     'dateofbirthprecision': 1,
...     'dateofdeathprecision': 1,
...     'middleinitial': ''
... }
>>> agt.to_database(defaults=agt_defaults, quiet=False)
[AgentDataset] loading records to database: 10/10

After the import is completed, your AgentDataset object will automatically get updated with the inserted records’ primary key values. If you look at the frame attribute of your AgentDataset object, you should see a new value in the agentid-column for every record that was imported to the database.

Now, let’s try to upload the dataset again. As you should notice, none of the records were inserted into the database this time. The reason is the new values in the agentid-column. Only records that lack a primary key value will get uploaded to the database.

Finally, we can use the write_mapping_to_csv() method to export the mapping between agent_sourceid and agentid. This mapping allow us to trace the source of every imported record.

>>> agt.write_mapping_to_csv('agent-mapping.csv')

Important

There is no data validation carried out by CollectionBatchTool prior to import. An incorrect datatype, or violation of a database contraint, will result in an error and an exception being raised.

Updating existing database records

In the last exercise of the quickstart guide, we will try to update some of the records that we imported previously. Suppose that we want to update the agent-table with new birthyears for three of the apostles that were imported.

The new sample file apostles_birthyear.csv contains the following information:

firstname lastname dateofbirth dateofbirthprecision
Peter Forsskål 1732-01-01 3
Daniel Solander 1733-01-01 3
Carl Peter Thunberg 1743-01-01 3

apostles_birthyear.csv

We begin by creating a new AgentDataset object and reading the new sample data into that object:

>>> agt = AgentDataset()
>>> agt.from_csv('apostles_birthyear.csv', quiet=False)
[AgentDataset] reading CSV file...
    3 rows x 4 columns; apostles_birthyear.csv

Next, we get primary key values from the agent-table based on content in the columns firstname and lastname (we assume that the combination of first and last names will uniquely identify individual agent-records). We use the match_database_records() method:

>>> agt.match_database_records(['firstname', 'lastname'], quiet=False)
[AgentDataset] updating primary key from database...
    target-column:   'agentid'
    match-column(s): ['firstname', 'lastname']
    matches:         3/3

You should now be able to see the updated values in the agentid-column if you check the frame attribute of your AgentDataset object. Once the primary key values are in place, it’s easy to update the database with information from the two new columns by using the update_database_records() method:

>>> agt.update_database_records(['dateofbirth', 'dateofbirthprecision'], quiet=False)
[AgentDataset] updating database records: 3/3

You have now reached the end of the quickstart guide. If you want to learn more, you can continue to the guide on CollectionBatchTool’s advanced features or read more about functions, classes and methods in the API reference.