Using Goodreads Data

So far, we have created a basic module that uses hook_block() to add block content and installed this basic module. As it stands, however, this module does no more than simply displaying a few lines of static text.

In this section, we are going to extend the module's functionality. We will add a few new functions that retrieve and format data from Goodreads.

Goodreads makes data available in an XML format based on RSS 2.0. The XML content is retrieved over HTTP (HyperText Transport Protocol), the protocol that web browsers use to retrieve web pages. To enable this module to get Goodreads content, we will have to write some code to retrieve data over HTTP and then parse the retrieved XML.

Our first change will be to make a few modifications to goodreads_block().

Modifying the Block Hook

We could cram all of our new code into the existing goodreads_block() hook; however, this would make the function cumbersome to read and difficult to maintain. Rather than adding significant code here, we will just call another function that will perform another part of the work.

/**
* Implementation of hook_block
*/
function goodreads_block($op='list' , $delta=0, $edit=array()) {
switch ($op) {
case 'list':
$blocks[0]['info'] = t('Goodreads Bookshelf');
return $blocks;
case 'view':
$url = 'http://www.goodreads.com/review/list_rss/' .'398385' .'?shelf=' .'history-of-philosophy';
$blocks['subject'] = t('On the Bookshelf');
$blocks['content'] = _goodreads_fetch_bookshelf($url);
return $blocks;
}
}

The preceding code should look familiar. This is our hook implementation as seen earlier in the chapter. However, we have made a few modifications, indicated by the highlighted lines.

First, we have added a variable, $url, whose value is the URL of the Goodreads XML feed we will be using (http://www.goodreads.com/review/list_rss/398385?shelf=history-of-philosophy ). In a completely finished module, we would want this to be a configurable parameter, but for now we will leave it hard-coded.

The second change has to do with where the module is getting its content. Previously, the function was setting the content to t('Temporary content'). Now it is calling another function: _goodreads_fetch_bookshelf($url).

The leading underscore here indicates that this function is a private function of our module—it is a function not intended to be called by any piece of code outside of the module. Demarcating a function as private by using the initial underscore is another Drupal convention that you should employ in your own code.

Let's take a look at the _goodreads_fetch_bookshelf() function.

Retrieving XML Content over HTTP

The job of the _goodreads_fetch_bookshelf() function is to retrieve the XML content using an HTTP connection to the Goodreads site. Once it has done that, it will hand over the job of formatting to another function.

Here's a first look at the function in its entirety:

/**
* Retrieve information from the Goodreads bookshelp XML API.
*
* This makes an HTTP connection to the given URL, and
* retrieves XML data, which it then attempts to format
* for display.
*
* @param $url
* URL to the goodreads bookshelf.
* @param $num_items
* Number of items to include in results.
* @return
* String containing the bookshelf.
*/
function _goodreads_fetch_bookshelf($url, $num_items=3) {
$http_result = drupal_http_request($url);
if ($http_result->code == 200) {
$doc = simplexml_load_string($http_result->data);
if ($doc === false) {
$msg = "Error parsing bookshelf XML for %url: %msg.";
$vars = array('%url'=>$url, '%msg'=>$e->getMessage());
watchdog('goodreads', $msg, $vars, WATCHDOG_WARNING);
return t("Getting the bookshelf resulted in an error.");
}
return _goodreads_block_content($doc, $num_items);
// Otherwise we don't have any data
}
else {
$msg = 'No content from %url.';
$vars = array('%url' => $url);
watchdog('goodreads', $msg, $vars, WATCHDOG_WARNING);
return t("The bookshelf is not accessible.");
}
}

Let's take a closer look.

Following the Drupal coding conventions, the first thing in the above code is an API description:

/**
* Retrieve information from the Goodreads bookshelp XML API.
*
* This makes an HTTP connection to the given URL, and retrieves
* XML data, which it then attempts to format for display.
*
* @param $url
* URL to the goodreads bookshelf.
* @param $num_items
* Number of items to include in results.
* @return
* String containing the bookshelf.
*/

This represents the typical function documentation block. It begins with a one-sentence overview of the function. This first sentence is usually followed by a few more sentences clarifying what the function does.

Near the end of the docblock, special keywords (preceded by the @ sign) are used to document the parameters and possible return values for this function.

  • @param: The @param keyword is used to document a parameter and it follows the following format: @param <variable name> <description>. The description should indicate what data type is expected in this parameter.
  • @return: This keyword documents what type of return value one can expect from this function. It follows the format: @return <description>.

This sort of documentation should be used for any module function that is not an implementation of a hook.

Now we will look at the method itself, starting with the first few lines.

function _goodreads_fetch_bookshelf($url, $num_items=3) {
$http_result = drupal_http_request($url);

This function expects as many as two parameters. The required $url parameter should contain the URL of the remote site, and the optional $num_items parameter should indicate the maximum number of items to be returned from the feed.

Note

While we don't make use of the $num_items parameter when we call _goodreads_fetch_bookshelf() this would also be a good thing to add to the module's configurable parameters.

The first thing the function does is use the Drupal built-in drupal_http_request() function found in the includes/common.php library. This function makes an HTTP connection to a remote site using the supplied URL and then performs an HTTP GET request.

The drupal_http_request() function returns an object that contains the response code (from the server or the socket library), the HTTP headers, and the data returned by the remote server.

Note

Drupal is occasionally criticized for not using the object-oriented features of PHP. In fact, it does—but less overtly than many other projects. Constructors are rarely used, but objects are employed throughout the framework. Here, for example, an object is returned by a core Drupal function.

When the drupal_http_request() function has executed, the $http_result object will contain the returned information. The first thing we need to find out is whether the HTTP request was successful—whether it connected and retrieved the data we expect it to get.

We can get this information from the response code, which will be set to a negative number if there was a networking error, and set to one of the HTTP response codes if the connection was successful.

We know that if the server responds with the 200 (OK) code, it means that we have received some data.

Note

In a more robust application, we might also check for redirect messages (301, 302, 303, and 307) and other similar conditions. With a little more code, we could configure the module to follow redirects.

Our simple module will simply treat any other response code as indicating an error:

if ($http_result->code == 200) {
// ...Process response code goes here...
// Otherwise we don't have any data
} else {
$msg = 'No content from %url.';
$vars = array( '%url' => $url );
watchdog('goodreads', $msg, $vars, WATCHDOG_WARNING);
return t("The bookshelf is not accessible.");
}

First let's look at what happens if the response code is something other than 200:

} else {
$msg = 'No content from %url.';
$vars = array( '%url' => $url );
watchdog('goodreads', $msg, $vars, WATCHDOG_WARNING);
return t("The bookshelf is not accessible.");
}

We want to do two things when a request fails: we want to log an error, and then notify the user (in a friendly way) that we could not get the content. Let's take a glance at Drupal's logging mechanism.

The watchdog() Function

Another important core Drupal function is the watchdog() function. It provides a logging mechanism for Drupal.

Note

Customize your logging

Drupal provides a hook (hook_watchdog()) that can be implemented to customize what logging actions are taken when a message is logged using watchdog(). By default, Drupal logs to a designated database table. You can view this log in the administration section by going to Administer | Logs.

The watchdog() function gathers all the necessary logging information and fires off the appropriate logging event.

The first parameter of the watchdog() function is the logging category. Typically, modules should use the module name (goodreads in this case) as the logging category. In this way, finding module-specific errors will be easier.

The second and third watchdog parameters are the text of the message ($msg above) and an associative array of data ($vars) that should be substituted into the $msg. These substitutions are done following the same translation rules used by the t() function. Just like with the t() function's substitution array, placeholders should begin with !, @, or %, depending on the level of escaping you need.

So in the preceding example, the contents of the $url variable will be substituted into $msg in place of the %url marker.

Finally, the last parameter in the watchdog() function is a constant that indicates the log message's priority, that is, how important it is.

There are eight different constants that can be passed to this function:

  • WATCHDOG_EMERG: The system is now in an unusable state.
  • WATCHDOG_ALERT: Something must be done immediately.
  • WATCHDOG_CRITICAL: The application is in a critical state.
  • WATCHDOG_ERROR: An error occurred.
  • WATCHDOG_WARNING: Something unexpected (and negative) happened, but didn't cause any serious problems.
  • WATCHDOG_NOTICE: Something significant (but not bad) happened.
  • WATCHDOG_INFO: Information can be logged.
  • WATCHDOG_DEBUG: Debugging information can be logged.

Depending on the logging configuration, not all these messages will show up in the log.

The WATCHDOG_ERROR and WATCHDOG_WARNING levels are usually the most useful for module developers to record errors. Most modules do not contain code significant enough to cause general problems with Drupal, and the upper three log levels (alert, critical, and emergency) should probably not be used unless Drupal itself is in a bad state.

Note

There is an optional fifth parameter to watchdog(), usually called $link, which allows you to pass in an associated URL. Logging back ends may use that to generate links embedded within logging messages.

The last thing we want to do in the case of an error is return an error message that can be displayed on the site. This is simply done by returning a (possibly translated) string:

return t("The bookshelf is not accessible.");

We've handled the case where retrieving the data failed. Now let's turn our attention to the case where the HTTP request was successful.

Processing the HTTP Results

When the result code of our request is 200, we know the web transaction was successful. The content may or may not be what we expect, but we have good reason to believe that no error occurred while retrieving the XML document.

So, in this case, we continue processing the information:

if ($http_result->code == 200) {
// ... Processing response here...
$doc = simplexml_load_string($http_result->data);
if ($doc === false) {
$msg = "Error parsing bookshelf XML for %url: %msg.";
$vars = array('%url'=>$url, '%msg'=>$e->getMessage());
watchdog('goodreads', $msg, $vars, WATCHDOG_WARNING);
return t("Getting the bookshelf resulted in an error.");
}
return _goodreads_block_content($doc, $num_items);
// Otherwise we don't have any data
} else { // ... Error handling that we just looked at.

In the above example, we use the PHP 5 SimpleXML library. SimpleXML provides a set of convenient and easy-to-use tools for handling XML content. This library is not present in the now-deprecated PHP 4 language version.

For compatibility with outdated versions of PHP, Drupal code often uses the Expat parser, a venerable old event-based XML parser supported since PHP 4 was introduced. Drupal even includes a wrapper function for creating an Expat parser instance. However, writing the event handlers is time consuming and repetitive. SimpleXML gives us an easier interface and requires much less coding.

For an example of using the Expat event-based method for handling XML documents, see the built-in Aggregator module. For detailed documentation on using Expat, see the official PHP documentation: http://php.net/manual /en/ref.xml.php.

We will parse the XML using simplexml_load_string(). If parsing is successful, the function returns a SimpleXML object. However, if parsing fails, it will return false.

In our code, we check for a false. If one is found, we log an error and return a friendly error message. But if the Goodreads XML document was parsed properly, this function will call another function in our module, _goodreads_block_content(). This function will build some content from the XML data.

Formatting the Block's Contents

Now we are going to look at one more function—a function that extracts data from the SimpleXML object we have created and formats it for display.

The function we will look at here is basic and doesn't take advantage of the Drupal theming engine. Usually, formatting data for display is handled using the theming engine. Themes are the topic of our next chapter.

Here is our _goodreads_block_content() function:

/**
* Generate the contents of a block from a SimpleXML object.
* Given a SimpleXML object and the maximum number of
* entries to be displayed, generate some content.
*
* @param $doc
* SimpleXML object containing Goodreads XML.
* @param $num_items
* Number of items to format for display.
* @return
* Formatted string.
*/
function _goodreads_block_content($doc, $num_items=3) {
$items = $doc->channel->item;
$count_items = count($items);
$len = ($count_items < $num_items) ? $count_items : $num_items;
$template = '<div class="goodreads-item">'
.'<img src="%s"/><br/>%s<br/>by %s</div>';
// Default image: 'no cover'
$default_img = 'http://www.goodreads.com/images/nocover-60x80.jpg';
$default_link = 'http://www.goodreads.com';
$out = '';
foreach ($items as $item) {
$author = check_plain($item->author_name);
$title = strip_tags($item->title);
$link = check_url(trim($item->link));
$img = check_url(trim($item->book_image_url));
if (empty($author)) $author = '';
if (empty($title)) $title = '';
if (empty($link)) !== 0) $link = $default_link;
if (empty($img)) $img = $default_img;
$book_link = l($title, $link);
$out .= sprintf($template, $img, $book_link, $author);
}
$out .= '<br/><div class="goodreads-more">'
. l('Goodreads.com', 'http://www.goodreads.com')
.'</div>';
return $out;
}

As with the last function, this one does not implement a Drupal hook. In fact, as the leading underscore (_) character should indicate, this is a private function, intended to be called only by other functions within this module.

Again the function begins with a documentation block explaining its purpose, parameters, and return value. From there, we begin the function:

function _goodreads_block_content($doc, $num_items=3) {
$items = $doc->channel->item;

The first thing the function does is get a list of <item/> elements from the XML data. To understand what is going on here, let's look at the XML (abbreviated for our example) returned from Goodreads:

<?xml version="1.0"?>
<rss version="2.0">
<channel>
<title>Matthew's bookshelf: history-of-philosophy</title>
<copyright>
<![CDATA[
Copyright (C) 2006 Goodreads Inc. All rights reserved.]]>
</copyright>
<link>http://www.goodreads.com/review/list_rss/398385</link>
<item>
<title>
<![CDATA[Thought's Ego in Augustine and Descartes]]>
</title>
<link>http://www.goodreads.com/review/show/6895959? utm_source=rss&amp;utm_medium=api</link>
<book_image_url>
<![CDATA[
http://www.goodreads.com/images/books/96/285/964285-s-
1179856470.jpg
]]>
</book_image_url>
<author_name><![CDATA[Gareth B. Matthews]]></author_name>
</item>
<item>
<title>
<![CDATA[Augustine: On the Trinity Books 8-15 (Cambridge Texts
 in the History of Philosophy)]]>
</title>
<link>http://www.goodreads.com/review/show/6895931? utm_source=rss&amp;utm_medium=api</link>
<book_image_url>
<![CDATA[
http://www.goodreads.com/images/books/35/855/352855-s-
1174007852.jpg
]]>
</book_image_url>
<author_name><![CDATA[Gareth B. Matthews]]></author_name>
</item>
<item>
<title>
<![CDATA[A Treatise Concerning the Principles of Human
 Knowledge (Oxford Philosophical Texts)]]>
</title>
<link>http://www.goodreads.com/review/show/6894329? utm_source=rss&amp;utm_medium=api</link>
<book_image_url>
<![CDATA[
http://www.goodreads.com/images/books/10/138/1029138-s-
1180349380.jpg
]]>
</book_image_url>
<author_name><![CDATA[George Berkeley]]></author_name>
</item>
</channel>
</rss>

The above XML follows the familiar structure of an RSS document. The<channel/> contains, first, a list of fields that describes the bookshelf we have retrieved, and then a handful of<item/> elements, each of which describes a book from the bookshelf.

We are interested in the contents of<item/> elements, so we start off by grabbing the list of items:

$items = $doc->channel->item;

The SimpleXML $doc object contains attributes that point to each of its child elements. The<rss/> element (which is represented as $doc) has only one child:<channel/>. In turn,<channel/> has several child elements:<title/>, <copyright/>, <link/>, and several<item/> elements. These are represented as $doc->title, $doc->copyright, and so on.

What happens when there are several elements with the same name like<item/>?

They are stored as an array. So in our code above, the variable $items will point to an array of<item/> elements.

Next, we determine how many items will be displayed, specify a basic template we will later use to create the HTML for our block, and set a few default values:

$count_items = count($items);
$len = ($count_items < $num_items) ? $count_items : $num_items;
$template = '<div class="goodreads-item">'
.'<img src="%s"/><br/>%s<br/>by %s</div>';
// Default image: 'no cover'
$default_img = 'http://www.goodreads.com/images/nocover-60x80.jpg';
$default_link = 'http://www.goodreads.com';

In the first line, we make sure that we don't use any more than $num_items. Next, we assign the $template variable an sprintf() style template. We will use this to format our entries in just a moment.

Finally, we set default values for a logo image ($default_img) and a link back to Goodreads ($default_link).

Once this is done, we are ready to loop through the array of $items and generate some HTML:

$out = '';
foreach ($items as $item) {
$author = check_plain($item->author_name);
$title = strip_tags($item->title);
$link = check_url(trim($item->link));
$img = check_url(trim($item->book_image_url));
if (empty($author)) $author = 'Unknown';
if (empty($title)) $title = 'Untitled';
if (empty($link)) $link = $default_link;
if (empty($img)) $img = $default_img;
$book_link = l($title, $link);
$out .= sprintf( $template, $img, $book_link, $author);
}

Using a foreach loop, we go through each $item in the $items list. Each of these items should look something like the following:

<item>
<title>
<![CDATA[Book Title]]>
</title>
<link>http://www.goodreads.com/something/</link>
<book_image_url>
<![CDATA[
http://www.goodreads.com/images/something.jpg
]]>
</book_image_url>
<author_name><![CDATA[Author Name]]></author_name>
</item>

We want to extract the title, link, author name, and an image of the book. We get these from the $item object:

$author = check_plain($item->author_name);
$title = strip_tags($item->title);
$link = check_url(trim($item->link));
$img = check_url(trim($item->book_image_url));

While we trust Goodreads, we do want to sanitize the data it sends us as an added layer of security. Above, we check the values of $author and $title with the functions check_plain() and strip_tags().

The strip_tags() function is built into PHP. It simply reads a string and strips out anything that looks like an HTML or XML tag. This provides a basic layer of security, since it would remove the tags that might inject a script, applet, or ActiveX object into our page. But this check does still allow HTML entities like&amp; or&raquote;.

Drupal contains several string encoding functions that provide different services than strip_tags(). Above, we use check_plain() to perform some escaping on $item->author_name. Unlike strip_tags(), check_plain() does not remove anything. Instead, it encodes HTML tags into entities (like the @ modifier in t() function substitutions). So check_plain('<em>Example</em>') would return the string&lt;em&gt;Example&lt;/em&gt;.

Note

The check_plain() function plays a very important role in Drupal security. It provides one way of avoiding cross-site scripting attacks (XSS), as well as insertion of malicious HTML.

There is a disadvantage to using check_plain(), though. If check_plain() encounters an HTML entity, like&lt;, it will encode it again. Thus,&lt; would become&amp;lt;. The initial ampersand (&) is encoded into&amp;.

With the $item->link and $item->book_image_url objects, though, we have to do two things. First, we must trim() the results to remove leading or trailing white spaces. This is important because Drupal's l() function, which we will see in just a moment, will not process URLs correctly if they start with white spaces.

We also use Drupal's check_url() function to verify that the URL is legitimate. check_url() does a series of checks intended to catch malicious URLs. For example, it will prevent the javascript: protocol from being used in a URL. This we do as a safety precaution.

Next, we check each of the newly assigned variables. We want to make sure that if a variable is null or empty, it gets a default value.

if (empty($author)) $author = 'Unknown';
if (empty($title)) $title = 'Untitled';
if (empty($link)) $link = $default_link;
if (empty($img)) $img = $default_img;

The last thing we do in this foreach loop is format the entry as HTML for display:

$book_link = l($title, $link);
$out .= sprintf( $template, $img, $book_link, $author);

First, we create a link to the book review page at Goodreads. This is done with Drupal's l() function (that's a single lowercase L). l() is another important Drupal function. This function creates a hyperlink. In the above code, it takes the book title ($title), and a URL ($link), and creates an HTML tag that looks like this:

<a
href="http://www.goodreads.com/review/show/6894329?utm_source=rss&amp;amp;utm_medium=api">
A Treatise Concerning the Principles of Human Knowledge (Oxford
Philosophical Texts)
</a>

That string is stored in $book_link. We then do the HTML formatting using a call to the PHP sprintf() function:

$out .= sprintf( $template, $img, $book_link, $author);

The sprintf() function takes a template ($template) as its first argument. We defined $template outside of the foreach loop. It is a string that looks as follows:

<div class="goodreads-item"><img src="%s"/><br/>%s<br/>by %s</div>

sprintf() will read through this string. Each time it encounters a placeholder, like %s, it will substitute in the value of an argument.

There are three string placeholders (%s) in the string. sprintf() will sequentially replace them with the string values of the three other parameters passed into sprintf(): $img, $book_link, and $author.

So sprintf() would return a string that looked something like the following:

<div class="goodreads-item">
<img src="http://www.goodreads.com/something.jpg"/>
<br/><a href="http://www.goodreads.com/somepath">
Thought's Ego in Augustine and Descartes
</a><br/>by Gareth B. Matthews</div>

That string is then added to $output. By the time the foreach loop completes, $output should contain a fragment of HTML for each of the entries in $items.

Note

The PHP sprintf() and printf() functions are very powerful, and can make PHP code easier to write, maintain, and read. View the PHP documentation for more information: http://php.net/manual/en/function.sprintf.php.

Once we are done with the foreach loop, we only have a little left to do. We need to add a link back to Goodreads to our $out HTML, and then we can return the output:

$out .= '<br/><div class="goodreads-more">'
. l('Goodreads.com', 'http://www.goodreads.com')
.'</div>';
return $out;
}

The block hook (goodreads_block()) will take the formatted HTML returned by _goodreads_block_content() and store it in the contents of the block. Drupal will display the results in the right-hand column, as we configured in the previous section:

Formatting the Block's Contents

The first three items in our Goodreads history of philosophy bookshelf are now displayed as blocks in Drupal.

There is much that could be done to improve this module. We could add caching support, so that each request did not result in a new retrieval of the Goodreads XML. We could create additional security measures to check the XML content. We could add an administration interface that would allow us to set the bookshelf URL instead of hard coding the value in. We could also use the theming system to create the HTML and style it instead of hard coding HTML tags into our code.

In fact, in the next chapter, we will take a closer look at the theming system and see how this particular improvement could be made.

However, to complete our module, we need a finishing touch. We need to add some help text.