This is the second part of Drupal Case Study on integrating the CKAN data repository with Drupal 6. Part 1 covered the following:
- What is CKAN?
- CKAN’s API
- The Foundation
- The Build
- Theming
- Homepage Chart
Caching
API calls are expensive. There’s no doubt about that. Particularly when you’re returning large amounts of data. To avoid any issues of the CKAN API being exhausted from requests and to ensure that the site remained responsive, I decided to leverage Drupals caching mechanisms and pretty much cached everything I could, within reason. The Chart, Tag Cloud, Tag lists, Ministry lists, All Packages list and all individual packages are cached. The issue with caching on this site is that if a package gets updated on the CKAN instance, we need to know about that on our Drupal site immediately and then clear the appropriate caches so that the most recent data can be retrieved.
For caching I created a table called ‘cache_ckan’, that stores everything I need. To create this table I used the schema of the existing cache table and put that in my .install file in my module directory.
/**
* Implementation of hook_install().
*/
function ckan_install() {
drupal_install_schema('ckan');
}
/**
* Implementation of hook_uninstall().
*/
function ckan_uninstall() {
drupal_uninstall_schema('ckan');
}
/**
* Implementation of hook_schema().
*/
function ckan_schema() {
$schema = array();
$schema['cache_ckan'] = drupal_get_schema_unprocessed('system', 'cache');
return $schema;
} |
Whenever this module is enabled this schema will be run and the table will be created.
What is stored in the ckan_cache table?
There are various items stored in the cache table.
- The Homepage chart data
- Tag lists
- Ministry lists
- List of all datasets
Let’s take the list of all packages as an example. I covered how I implemented the paging in my previous post. As this list is paginated it’s important that every page be cached to improve the speed of the site. As the paging mechanism is already implemented it’s just a case of creating a cache table entry (ckan:all{page-number}) for each page, and then checking for it’s existence when loading the page.
if(($cache = cache_get('ckan:all'.$page, 'cache_ckan')) && !empty($cache->data)) { // If cached data exists for this page...
$results = $cache->data;
} else {
$ckan = ckan_ckan();
$start = 0;
$items_per_page = variable_get('ckan_items_per_page', 4);
if($page) {
// If we're in a page, we need to set where to start the list
$start = $page * $items_per_page;
}
// Set the offset to the number of records in
$offset = $start;
// Limit to the number of items per page
$limit = $items_per_page;
try {
$results = $ckan->advancedSearch(array('groups' => 'canadagov', 'all_fields' => '1', 'offset' => $offset, 'limit' => $limit));
} catch (Exception $e){
return $e->getMessage();
}
// If the API call worked
watchdog('ckan', 'Called CKAN API for list of all packages');
cache_set('ckan:all'.$page, $results, 'cache_ckan');
} |
This method is very simple and very effective. It means the pages load lightning fast and only one page of data at a time is retrieved.
How does the cache get cleared/updated
Datasets/Packages change all the time on the CKAN instance, so how do you make sure that the Drupal site has the most current data. This module has two ways of managing that.
1. Using hook_form to redirect to CKAN
As the CKAN nodes on Drupal are created on the fly and hold very little information, there is really no need to access the EDIT form for these nodes. Whenever an admin user clicks the edit tab on the node, they are automatically redirected to the appropriate CKAN package editing screen. hook_form is called to retrieve the form that is displayed when one attempts to “create/edit” an item. For CKAN content types, the user is redirect to the CKAN instance.
/**
* Implementation of hook_form
*
* Redirect the user to ca.ckan.net package edit screen on edit
*/
function ckan_form(&$node, $form_state) {
if($node->type == 'ckan') {
drupal_goto('http://ca.ckan.net/package/edit/'.$node->body);
}
} |
When the CKAN form is submitted, CKAN then redirects back to the Drupal site and calls a specific URL that tells Drupal to call CKAN again to get the package information and populate the node. To clarify, the process is
- Redirect http://www.datadotgc.ca/node/X/edit to http://ca.ckan.net/package/edit/{name of X}
- On save of CKAN Package, redirect to http://www.datadotgc.ca/{special_url}/{name_of_X}
- Load the node with {name_of_X}
- Call CKAN to get the (updated) data for Package {name_of_X}
- Save the node with updated data
Using Cron and an Atom Feed
CKAN provides an Atom feed of recent updates to the Packages. Cron checks this feed every time it runs. If the feed has changed since the last cron run, then we know there have been updates and we clear all of the caches.
/**
* Implementation of hook_cron()
*
**/
function ckan_cron() {
// Get the md5sum of the current atom feed
$current_feed = trim(md5_file('http://ca.ckan.net/revision/list?format=atom'));
watchdog('ckan', 'Current feed md5: '. $current_feed);
// Retrieve the previously stored md5sum
$previous_feed = variable_get('ckan_atom_feed_md5', $current_feed);
watchdog('ckan', 'Previous feed md5: '.$previous_feed);
// If there have been changes
if($current_feed != $previous_feed) {
watchdog('ckan', 'ATOM feed has updated, clearing caches and deleting nodes');
// Flush all the caches
cache_clear_all('*', 'cache_ckan', TRUE);
// Set the previous feed md5
variable_set('ckan_atom_feed_md5', $current_feed);
}
} |
Tag cloud creation
I borrowed some code from the Tagadelic module to achieve the tag cloud
/**
* Build a tag cloud based on the settings provided
*
* @return String A themed list of weighted tags
*/
function ckan_tag_cloud() {
// If there is cached data
if(($cache = cache_get('ckan:tags', 'cache_ckan')) && !empty($cache->data)) {
$results = unserialize($cache->data);
} else {
$ckan = ckan_ckan();
$results = $ckan->getTagCount();
watchdog('ckan', 'Called CKAN API for tag cloud');
cache_set('ckan:tags', serialize($results), 'cache_ckan');
}
// Let's sort them by weight first off
foreach ($results as $key => $row) {
$tag[$key] = $row[0];
$weight[$key] = $row[1];
}
array_multisort($weight, SORT_DESC, $results);
// Now let's get the top X number of tags
$results = array_slice($results, 0, variable_get('ckan_tagcloud_total', 40));
// Now build the tags
$tags = ckan_tag_build_weighted($results);
// Sort them
$tags = ckan_tag_sort($tags);
// Theme them
$output = theme('ckan_weighted_tags', $tags);
return $output;
}
/**
* Theme function that renders the HTML for the tags
* @ingroup themable
*/
function theme_ckan_weighted_tags($tags) {
$output = '';
foreach ($tags as $tag) {
$output .= l($tag['name'], 'data/tag/'.$tag['name'], array('attributes' => array('class' => "tagcloud level".$tag['weight'], 'rel' => 'tag'))) ." \n";
}
return $output;
} |
Using the CKAN Search API for all lists
Ok, so what’s this all about? CKAN has some nice API calls like /api/rest/package/PACKAGE-REF that return a list of Packages. However these return the name/id of the Package ONLY. In our case, for our listings, we wanted other data, such as the tags attached to the Package as well as a brief description.
The only way to get this data was to do a search API call /api/search/package and pass some extra parameters, in this case all_fields=1 and department={name of Ministry}.
all_fields=1 tells the search to return all Package fields, not just the name/id; just as is if you called /api/rest/package/PACKAGE-REF.
department={name of Ministry} tells the search to return all packages that have a department of {name of Ministry}. The lovely folks at CKAN added this functionality for us on request.
What does this look like, well it’s pretty simple really. Call the advancedSearch() function. Pass it a few parameters and it returns you all the data you need. Here’s the function itself:
public function advancedSearch($parameters){
foreach($parameters as $key => $value) {
$querystring .= $key .'='. urlencode($value) .'&';
}
$results = $this->transfer('api/search/package?'. $querystring);
if (!$results->count){
throw new CkanException("Search Error");
}
return $results;
} |
And here is that function being called for the list of Ministry Packages. The offset and limit are for the paging mechanism:
// Call the function
$results = $ckan->advancedSearch(array('department' => $ministry, 'all_fields' => '1', 'offset' => $offset, 'limit' => $limit)); |
There’s a lot more functionality in this module, more than I can go through in a blog post, even 5 posts. If you’re trying to integrate Drupal with a CKAN instance and are not sure where to start then please leave a comment and I’ll get back in touch.