Datadotgc.ca - A Drupal case study

We recently launched http://www.datadotgc.ca, an open data collection portal for Canada, built to help poke the Canadian government in the right direction, towards something like similar sites in the UK (data.gov.uk) and the US (data.gov). Read David Eaves' explanation of its purpose. For the benefit of the programming and Drupal community, I'm going to run through, with the aid of code samples, the development of the Drupal module to communicate with the CKAN API (which is where the data is stored). I'll also walk through Theming, integration with Google Charts, Tag Clouds and most importantly, caching.

What is CKAN?

CKAN is a registry or catalogue system for datasets or other "knowledge" resources. CKAN aims to make it easy to find, share and reuse open content and data, especially in ways that are machine automatable.

CKAN is a nice big database that is built to accept user input of the type of data we're trying to collect for datadotgc.ca. It has a slick front and back end that allows administrative access to the collected data.

You can find out more on their website.

CKAN's API

In order to utilize the power of CKAN I needed to link it up to Drupal. CKAN has a powerful and flexible API that I used extensively in the module.

The Foundation

Early on in the project I got in touch with the wonderful team at CKAN and they then put me in touch with Sean Burlington from the data.gov.uk development team. They had also built their site in Drupal and Seán had lots of information on how they tweaked their CKAN site to work with Drupal. He worked hard to open source some of the work that they had done, and released it just in time for us to get started. Seáns module provided the basic API connectivity we needed to get started and was the foundation for our module.

The Build

How do you integrate Drupal with the CKAN API? Let's start with the basics:

CKAN stores the individual datasets that you see on Datadotgc.ca as 'Packages'. It became clear that these 'Packages' could be directly mapped to the standard node architecture in Drupal. To achieve this I created a content type in the module that stored all the data I needed.

/**
 
 * Define module-provided node types.
 
 */
 
function ckan_node_info() {
 
  return array(
 
  'ckan' => array(
 
    'name'           => t('CKAN Package'),
 
    'module'         => 'ckan',
 
    'description'    => t('A package of Open Data.'),
 
    'has_title'      => TRUE,
 
    'title_label'    => t('Title'),
 
    'has_body'       => TRUE,
 
    'body_label'     => t('Package Description'),
 
    'min_word_count' => 0,
 
    'locked'         => TRUE
 
    )
 
  );
 
}
 
 
 
function ckan_create_node($ckan_data) {
 
  $node = array(
 
    'title'   => $ckan_data->title,
 
    'uid'     => 1,
 
    'body'    => $ckan_data->name,
 
    'promote' => 1,
 
    'path'    => 'dataset/' . $ckan_data->name,
 
    'type'    => 'ckan',
 
    'comment' => 2,
 
  );
 
}

As you can see from the code, the only data elements to be set when a node is created are Title, Body and Path. The body of the node is set to be the name of the CKAN package, which is in fact a simple string: geogratisnat_hydrography_v100.

The more complex CKAN data was not mapped to any CCK fields as you might think, but instead it is pulled from CKAN when the node is loaded. This simplifies the Drupal side of things by ensuring that we don't have to keep track of any changes to the structure or contents of the dataset that may happen on the CKAN side.

Here is an example of some package data:

[maintainer] => Government of Canada, Natural Resources Canada, Centre for Topographic Information (Sherbrooke)
 
[name] => 1996_population_census_data_canada
 
[author] => Government of Canada, Natural Resources Canada, Canada Centre for Remote Sensing, GeoAccess Division, The Atlas of Canada
 
[url] => <a href="ftp://ftp.geogratis.gc.ca/atlas/Population_Ecumene_Census/1996/
 
[notes]">ftp://ftp.geogratis.gc.ca/atlas/Population_Ecumene_Census/1996/
 
[notes]</a> => The parts of Canada making up the 1996 Settled Area, (or Population Ecumene), represents a selection of the 5984 Census Subdivisions (CSD) as defined by Statistics Canada for the 1996 Census. The selection process essentially removes those CSDs with very large areas and/or very low populations. Some of British Columbia's CSD boundaries have been further modified to better conform to the distinctive settlement patterns in the Cordilleran regions. The 1996 Settled Area is an attempt to balance the needs of national scale choropleth mapping with the spatial reality that the majority of Canada's land area contains very few people. The Settled Area represents more than 98% of the Canadian population captured in the 1996 Census of Canada.
 
[title] => 1996 Population (Ecumene) Census Data, Canada
 
[download_url] => <a href="ftp://ftp.geogratis.gc.ca/atlas/Population_Ecumene_Census/1996/1996.zip
">ftp://ftp.geogratis.gc.ca/atlas/Population_Ecumene_Census/1996/1996.zip ...

When a node is loaded, the package data is pulled from CKAN and then cached for later use.

/**
 
 * Load node-type-specific information
 
 */
 
function ckan_load($node){
 
  $ckan = ckan_ckan();
 
  if(($cache = cache_get('ckan:'. $node->body, 'cache_ckan')) &amp;&amp; !empty($cache->data)) {
 
    // Get the cached data
 
    $node->ckan = $cache->data;
 
  } else {
 
    try {
 
      // Call the API to get the package data
 
      $node->ckan = $ckan->getPackage($node->body);
 
    } catch (Exception $e){
 
      drupal_set_message($e->getMessage(), 'error');
 
    }
 
    // Cache this package data for later use
 
    cache_set('ckan:'. $node->body, $node->ckan, 'cache_ckan');	
 
    watchdog('ckan', 'Called CKAN API for '.$node->body.' package - ckan_load()');
 
  }
 
  return $node;
 
}

Once the CKAN data has been added to the node object it's relatively easy to output this data in a node template. I created a template file in my theme called node_ckan.tpl.php and here's an example of how I displayed some of the CKAN package data in there:

< ?php if ($title): ?>
 
 <h1 id="page-title" class="title tk-museo-slab">< ?php print $title; ?></h1>
 
< ?php endif; ?>
 
 
 
< ?php if ($ckan->name): ?>
 
  <div class="package-name">(< ?php print $ckan->name; ?>)</div>
 
< ?php endif; ?>
 
 
 
< ?php if ($ckan->url): ?>
 
 <div class="package-link">< ?php print l($ckan->url, $ckan->url, $options = array('attributes' => array('class' => 'link'))); ?></div>
 
< ?php endif; ?>

You can see from all of the above examples I'm using a class object called "ckan" to store our data. This came from Sean's module and is a simple class that provides the connectivity to the CKAN API. Here's a brief synopsis of how it works:

  1. First I need a way to connect to the API. That's relatively straightforward using the curl libraries in php.

    $ch = curl_init($this->url . $url);
     
    curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
     
    curl_setopt($ch, CURLOPT_HEADER, 0);
     
    curl_setopt($ch, CURLOPT_TIMEOUT, 30);
     
    $result = curl_exec($ch);
     
    $info = curl_getinfo($ch);
     
    curl_close($ch);

  2. Now create some functions that allow various parts of the API to be called. Below is an example of two class functions which call the API.

    // Get an individual package
     
    public function getPackage($package) {
     
      $package = $this->transfer('api/rest/package/' . urlencode($package));
     
      if (!$package->name){
     
        throw new CkanException("Package Load Error");
     
      }
     
      return $package;
     
    }
     
     
     
    // Get a list of all packages
     
    public function getPackageList(){
     
      $list =  $this->transfer('api/rest/package/');
     
      if (!is_array($list)){
     
        throw new CkanException("Package List Error");
     
      }
     
      return $list;
     
    }

All of these API calls return a JSON object which is then decoded into an object using the PHP function:

json_decode($results);

Theming

Once that data has been successfully retrieved and decoded from JSON into something that's easier to handle, it needs to be themed. For this I created two theming functions; one for creating an individual list item, the other to create a formatted list of these individual items. I call the theme function whenever I get results from the API, and pass it the results, along with the title I want to appear on the listings page.

// Theme the results retrieved from API call
 
theme('ckan_results', $results, 'All Packages');
 
 
 
/**
 
 * Theme search results
 
 */
 
function theme_ckan_results($results, $title = '') {
 
  // Two global variables needed by the pager.
 
  // Taken from pager_query() in pager.inc
 
  global $pager_page_array, $pager_total;
 
 
 
  $output = '';
 
 
 
  // Grab the 'page' query parameter.
 
  // Taken from pager_query() in pager.inc
 
  $page = isset($_GET['page']) ? $_GET['page'] : '';
 
 
 
  // Convert comma-separated $page to an array, used by other functions.
 
  // Taken from pager_query() in pager.inc
 
  $pager_page_array = explode(',', $page);
 
 
 
  // Generate the data for page the requested and add it to the output.
 
  $items_per_page = variable_get('ckan_items_per_page', 4);
 
  // If there are less results than the specified number of items per page, reset the number of items per page
 
  if($results->count < $items_per_page) { $items_per_page = $results->count; }
 
 
 
  // Initialize pager
 
  $start = 0;
 
  // If it's not the first page
 
  if($page) {
 
    // Set the data to start displaying on the correct page
 
    $start = $page * $items_per_page;
 
  }
 
 
 
   if($title) {
 
     $output = '<h1 id="page-title">'.$title.'</h1>';
 
   }
 
 
 
  $output .= '<h3 class="resultscount">Your search returned '.$results->count .' records</h3>';
 
 
 
  // Theme the individual results
 
  for ($i = 0; $i < $items_per_page; $i++) {
 
      $output .=  theme('ckan_item', $results->results[$i]);
 
  }
 
 
 
  // Put some magic in the two global variables
 
  // Based on code in pager_query() in pager.inc
 
  $total_results = $results->count;
 
  $pager_total[0] = ceil($total_results / $items_per_page); //ckan_number_of_pages();
 
  $pager_page_array[0] =
 
    max(0, min(
 
      (int)$pager_page_array[0],
 
      ((int)$pager_total[0]) - 1)
 
    );
 
 
 
  // Add the pager to the output.
 
  $output .= theme('pager', NULL, $items_per_page, 0);
 
 
 
  return $output;
 
}
 
 
 
/**
 
 * Theme individual search items
 
 */
 
function theme_ckan_item($item) {
 
  // Link the title to the dataset
 
  $output .= '<h2>' . l($item->title, 'dataset/' . urlencode(check_plain($item->name))) . '</h2>';
 
  // Truncate the notes field
 
  if($item->notes) {
 
    $output .= '<p>' . truncate_utf8($item->notes, 250, $wordsafe = FALSE, $dots = TRUE) . '</p>';
 
  }
 
  // Output any tags	
 
  if(count($item->tags) > 0) {
 
    foreach($item->tags as $key => $value) {
 
      $items[] = l($value, 'data/tag/'.$value);
 
    }
 
  $seperated = implode(', ', $items);
 
  $output .= '<p><strong>Tags:</strong> ' .$seperated. '</p>';
 
  }	
 
return $output;
 
}

One thing you need to consider when displaying a list of results is having a pager built in so that you can break the list into bite-sized chunks. This took quite a while to figure out how to do. The problem was that the API call resulted in a lot of data and that resulted in a significant delay loading the page, due to the amount of time to complete the round trip to the API, along with the time taken to render all of that in a pager. CKAN however is very clever. When you call the API and ask for a list of packages, it returns the packages, but it also returns, as a variable, the count of the records your query generated. As well as that you can use the parameters 'offset' and 'limit' just like in SQL. What's even more clever here is that it still returns the variable that holds the count of the records the query generated, but it also only returns the number of packages determined by the 'offset' and 'limit' parameters.

So if an API call to list all packages for a certain tag would normally return 200 records, and you specify a limit of 10 and an offset of 10, the data returned will contain a count of the number of records normally generated by that call, 200, but will only return 10 packages in the data, as specified by the offset and limit. This came in extremely useful for the pager as I just passed an offset and limit each time a page was loaded and then cached the returned data.

$ckan = ckan_ckan();
 
$start = 0;
 
$items_per_page = variable_get('ckan_items_per_page', 4);
 
if($page) {
 
  // If we're in a page, we need to set where to start the list
 
  $start = $page * $items_per_page;
 
}
 
 
 
// Set the offset
 
$offset = $start;
 
// Limit to the number of items per page 
 
$limit = $items_per_page;
 
 
 
// Get the list of tags with their count
 
try {
 
  $results = $ckan->advancedSearch(array('department' => $ministry, 'all_fields' => '1', 'offset' => $offset, 'limit' => $limit));
 
} catch (Exception $e){
 
  drupal_goto(variable_get('ckan_no_results_page', 'sorry'));
 
}

Homepage Chart

There was a requirement for a graph on the homepage the displayed the number of packages attributed to each Government Ministry. The quickest way to do this was using Google Chart Tools. It was relatively straightforward to get the data we needed. I did however have to do some funky sorting to get the data in the correct order. I also found a wonderful tutorial that really helped to clear up some of the label/legend issues I was having.

/**
 
 * Function to build a Google Chart
 
 *
 
 * @return	  string	HTML code with img tag
 
 *
 
 **/
 
function ckan_chart() {
 
  // If there is a cached version of the chart
 
  if(($cache = cache_get('ckan:chart', 'cache_ckan')) &amp;&amp; !empty($cache->data)) {
 
    $image = $cache->data;
 
  } else {
 
    watchdog('ckan', 'Called Google API to build chart');
 
    // Get the list of ministries
 
    $ministries = explode("\r\n", filter_xss(variable_get('ckan_ministry_list', '')));
 
    // Set up our data array
 
    $data = array();
 
    foreach($ministries as $ministry) {
 
      $ckan = ckan_ckan();
 
      // Get the list of tags with their count
 
      try {
 
        $results = $ckan->advancedSearch(array('department' => $ministry, 'all_fields' => '0', 'offset' => '0', 'limit' => '1'));
 
        $count = $results->count;
 
      } catch (Exception $e){
 
        $count = 0;
 
      }
 
      // Cache the count to use on the Ministry list page '/ministry'
 
      cache_set('ckan:ministry_'. $ministry .'_count', $count, 'cache_ckan');
 
      $chart->data[$ministry. ' ('. $count . ')'] = $count;
 
    }
 
    // Sort the array in reverse order - most packages first and maintain index association
 
    arsort($chart->data);
 
    // Return all the keys of the data array - the names of the ministries
 
    $chart->legend = array_keys($chart->data);
 
    // Get the range of the chart - highest + a quarter
 
    $range = round(current($chart->data) * 1.25, -1);
 
    // Grid spacing  100/MaxRange*IntervalAmount
 
    $grid = 100/$range * 50;
 
    // Chart size, must be less than 30k pixels
 
    $chart->size = array(
 
      '590',
 
      '380'
 
    );
 
 
 
    // Create query
 
    $chart->query =
 
      'cht=bhg&amp;'.	// Type
 
      'chd=t:'.implode(',', $chart->data).'&amp;'.	// Data
 
      'chs='.$chart->size[0].'x'. $chart->size[1].'&amp;'.	// Size
 
      'chco=cc0000&amp;'.	// Color ( Remove # from string )
 
      'chxt=x,y&amp;'.	// X,Y axis labels
 
      'chxr=0,0,'.$range.'&amp;'. // Range
 
      'chxs=1,000000,13|0,000000,13&amp;'. 	// Axis colors and font size
 
      'chg='.$grid.',0,5,5&amp;'. // Grid verticalgridlines, horizontalgridlines, linesize, gapsize
 
      'chds=0,'.$range.'&amp;'.	// Scale
 
      'chma=0,0,0,0&amp;'. //left_margin, right_margin, top_margin, bottom_margin| legend_width, legend_height
 
      'chbh=13,0,2&amp;'.	// bar_width_or_scale, space_between_bars, space_between_groups
 
      'chxl=1:|'.implode('|', array_reverse($chart->legend, TRUE)).'&amp;'; //|Jan|Feb|Mar|Apr|May'
 
 
 
    $api_path = '<a href="http://chart.apis.google.com/chart?';
 
">http://chart.apis.google.com/chart?';
 
</a>    $url = $chart->query;
 
    $image = sprintf('<img src="%s" alt="%s" style="width:%spx;height:%spx;" />', $api_path.$url, 'Who\'s Sharing', $chart->size[0], $chart->size[1]);
 
      cache_set('ckan:chart', $image, 'cache_ckan');
 
  }	
 
  return $image;
 
}

So that's a brief(not-so) overview of some of the fundamentals of how I integrated Drupal with CKAN and was able to create nodes and listings directly from API calls.

In my next post I'll cover some very important areas of the module development such as:

  • Caching
  • Tag cloud creation
  • Using the CKAN Search API for all lists