Migrating data between CKAN instances
In order to transfer data between CKAN instances, it is required to dump the content of the old instance into JSONL format, do a conversion to the new schema, then reload into a new instance. This can be done using the CKAN API, by installing ckanapi on your local machine.
For a local CKAN, you must append the path to the
ckanapi -c CKAN_INI ...
To administer a remote instance, you will require a sysadmin account, your username and an API key (obtained through the web interface). Append the remote URL and API key to each ckanapi request, in place of the
ckanapi -r URL -a API_KEY ...
Active data (i.e. everything that is publicly available) can be retrieved through the API. To dump data that is not available publicly, you must use an API key belonging to a sysadmin user.
ckanapi dump users --all -O USER_FILE.jsonl.gz -z -p 4
ckanapi dump organizations --all -O ORGANIZATION_FILE.jsonl.gz -z -p 4
ckanapi dump datasets --all -O DATASET_FILE.jsonl.gz -z -p 4
Retrieve the user data from Drupal (this is a nightly dumped file):
rsync -L --progress firstname.lastname@example.org:/var/lib/ckan/ckan/dumps_with_private_data/drupal_users_table.csv.gz DRUPAL_USER_DUMP.csv.gz
Data schema migration
The data schema for the two CKAN instances may not match. Therefore a conversion is required to allow the dump to be imported to the new CKAN installation. Scripts are available in ckanext-datagovuk to enable this.
Migrating users and publishers (this is a single step as user-publisher assignment is updated at the same time):
python import/migrate_users.py USER_FILE.json.gz DRUPAL_USER_DUMP.csv.gz USER_MIGRATED_FILE.jsonl.gz ORGANIZATION_FILE.jsonl.gz ORGANIZATION_MIGRATED_FILE.jsonl.gz
python import/migrated_datasets.py -s DATASET_FILE.jsonl.gz -o DATASET_MIGRATED_FILE.jsonl.gz
It is also possible to trim the datasets file to include only datasets that have been modified/created since a specific time (as a ISO8601 timestamp, e.g. 2018-04-12T17:07:36.284461). This allows for faster incremental imports:
python import/incremental_update.py -s DATASET_FILE.jsonl.gz -o DATASET_MIGRATED_FILE.jsonl.gz -t TIMESTAMP
CKAN supports the bulk import of data (users, publishers and datasets) from a JSONL file representing the data to be imported. If a record already exists (based on the UUID on the object), CKAN will perform a comparison and update the record with changes from the import file. Data from the import file will always overwrite data in the database in the event of a conflict.
ckanapi load users -I USER_MIGRATED_FILE -p 4
ckanapi load organizations -I PUBLISHERS_MIGRATED_FILE -p 4
ckanapi load datasets -I DATASET_MIGRATED_FILE -p 4
Importing harvesters (this must be run on the server on which the new CKAN is installed):
python import/migrated_harvest_sources.py --production