================================ Caching RSS Feeds With feedcache ================================ .. note:: This article was originally printed in Python Magazine Volume 1 Issue 11, November, 2007. The past several years have seen a steady increase in the use of RSS and Atom feeds for data sharing. Blogs, podcasts, social networking sites, search engines, and news services are just a few examples of data sources delivered via such feeds. Working with internet services requires care, because inefficiencies in one client implementation may cause performance problems with the service that can be felt by all of the consumers accessing the same server. In this article, I describe the development of the **feedcache** package, and give examples of how you can use it to optimize the use of data feeds in your application. .. highlight:: python :linenothreshold: 10 I frequently find myself wanting to listen to one or two episodes from a podcast, but not wanting to subscribe to the entire series. In order to scratch this itch, I built a web based tool, hosted at http://www.castsampler.com/, to let me pick and choose individual episodes from a variety of podcast feeds, then construct a single feed with the results. Now I subscribe to the single feed with my podcast client, and easily populate it with new episodes when I encounter any that sound interesting. The `feedcache `_ package was developed as part of this tool to manage accessing and updating the feeds efficiently, and has been released separately under the BSD license. Example Feed Data ----------------- The two most common publicly implemented formats for syndicating web data are RSS (in one of a few versions) and Atom. Both formats have a similar structure. Each feed begins with basic information about the data source (title, link, description, etc.). The introductory information is followed by a series of "items", each of which represents a resource like a blog post, news article, or podcast. Each item, in turn, has a title, description, and other information like when it was written. It may also refer to one or more attachments, or enclosures. Listing 1 shows a sample RSS 2.0 feed and Listing 2 shows a sample Atom feed. Each sample listing contains one item with a single podcast enclosure. Both formats are XML, and contain essentially the same data. They use slightly different tag names though, and podcast enclosures are handled differently between the two formats, which can make working with different feed formats more work in some environments. Fortunately, Python developers do not need to worry about the differences in the feed formats, thanks to the Universal Feed Parser. Listing 1 ~~~~~~~~~ .. literalinclude:: Listing1.xml :linenos: Listing 2 ~~~~~~~~~ .. literalinclude:: Listing2.xml :linenos: Universal Feed Parser --------------------- Mark Pilgrim's Universal Feed Parser is an open source module that manages most aspects of downloading and parsing RSS and Atom feeds. Once the feed has been downloaded and parsed, the parser returns an object with all of the parsed data easily accessible through a single API, regardless of the original feed format. Listing 3 shows a simple example program for accessing feeds with `feedparser `_. On line 9, a URL from the command line arguments is passed to ``feedparser.parse()`` to be downloaded and parsed. The results are returned as a FeedParserDict. The properties of the FeedParserDict can be accessed via the dictionary API or using attribute names as illustrated on line 10. Listing 3 ~~~~~~~~~ .. literalinclude:: Listing3.py :linenos: When the sample program in Listing 3 is run with the URL for the feed of **feedcache** project releases, it shows the titles for the releases available right now: :: $ python Listing3.py http://feeds.feedburner.com/FeedcacheReleases feedcache Releases: feedcache 0.1 feedcache Releases: feedcache 0.2 feedcache Releases: feedcache 0.3 feedcache Releases: feedcache 0.4 feedcache Releases: feedcache 0.5 Every time the program runs, it fetches the entire feed, whether the contents have changed or not. That inefficiency might not matter a great deal for a client program that is not run frequently, but the inefficiencies add up on the server when many clients access the same feed, especially if they check the feed on a regular basis. This inefficient behavior can become an especially bad problem for the server if the feed contents are produced dynamically, since each client incurs a certain amount of CPU, I/O, and bandwidth load needed to produce the XML representation of the feed. Some sites are understandably strict about how often a client can retrieve feeds, to cut down on heavy bandwidth and CPU consumers. Slashdot, for example, returns a special feed with a warning to any client that accesses their RSS feed too frequently over a short span of time. A Different Type of Podcast Aggregator -------------------------------------- A typical aggregator design would include a monitor to regularly download the feeds and store the fresh information about the feed and its contents in a database. The requirements for CastSampler are a little different, though. CastSampler remembers the feeds to which a user has subscribed, but unlike other feed aggregators, it only downloads the episode metadata while the user is choosing episodes to add to their download feed. Since the user does not automatically receive every episode of every feed, the aggregator does not need to constantly monitor all of the feeds. Instead, it shows a list of episodes for a selected feed, and lets the user choose which episodes to download. Then it needs to remember those selected episodes later so it can produce the combined feed for the user's podcast client. If every item from every feed was stored in the database, most of the data in the database would be for items that were never selected for download. There would need to be a way to remove old data from the database when it expired or was no longer valid, adding to the maintenance work for the site. Instead, CastSampler only uses the database to store information about episodes selected by the user. The rest of the data about the feed is stored outside of the database in a form that makes it easier to discard old data when the feed is updated. This division eliminates a lot of the data management effort behind running the site. Feedcache Requirements ---------------------- An important goal for this project was to make CastSampler a polite consumer of feeds, and ensure that it did not overload servers while a user was selecting podcast episodes interactively. By caching the feed data for a short period of time, CastSampler could avoid accessing feeds every time it needed to show the feed data. A persistent cache, written to disk, would let the data be reused even if the application was restarted, such as might happen during development. Using a cache would also improve responsiveness, since reading data from the local disk would be faster than fetching the feed from the remote server. To further reduce server load, **feedcache** is designed to take advantage of conditional GET features of HTTP, to avoid downloading the full feed whenever possible. Another goal was to have a small API for the cache. It should take care of everything for the caller, so there would not need to be many functions to interact with it. To retrieve the contents of a feed, the caller should only have to provide the URL for that feed. All other information needed to track the freshness of the data in the cache would be managed internally. It was also important for the cache to be able to store data in multiple ways, to make it more flexible for other programmers who might want to use it. Although CastSampler was going to store the cache on disk, other applications with more computing resources or tighter performance requirements might prefer to hold the cache in memory. Using disk storage should not be hard coded into the cache management logic. These requirements led to a design which split the responsibility for managing the cached data between two objects. The **Cache** object tracks information about a feed so it can download the latest version as efficiently as possible, only when needed. Persistent storage of the data in the cache is handled by a separate back end storage object. Dividing the responsibilities in this way maximizes the flexibility of the **Cache**, since it can concentrate on tracking whether the feed is up to date without worrying about storage management. It also let **Cache** users take advantage of multiple storage implementations. The Cache Class --------------- Once the basic requirements and a skeleton design were worked out, the next step was to start writing tests so the implementation of **Cache** could begin. Working with a few simple tests would clarify how a **Cache** user would want to access feeds. The first test was to verify that the **Cache** would fetch feed data. :: import unittest, cache class CacheTest(unittest.TestCase): def testFetch(self): c = cache.Cache({}) parsed_feed = c.fetch('http://feeds.feedburner.com/FeedcacheReleases') self.failUnless(parsed_feed.entries) Since the design separated storage and feed management responsibilities, it was natural to pass the storage handler to the **Cache** when it is initialized. The dictionary API is used for the storage because there are several storage options available that support it. The shelve module in the Python standard library stores data persistently using an object that conforms to the dictionary API, as does the `shove `_ library from L.C. Rees. Either library would work well for the final application. For initial testing, using a simple dictionary to hold the data in memory was convenient, since that meant the tests would not need any external resources. After constructing the **Cache**, the next step in the test is to retrieve a feed. I considered using using the ``__getitem__()`` hook, but since **Cache** would not support any of the other dictionary methods, I rejected it in favor of an explicit method, ``fetch()``. The caller passes a feed URL to ``fetch()``, which returns a FeedParserDict instance. Listing 4 shows the first version of the **Cache** class that works for the test as it is written. No actual caching is being done, yet. The **Cache** instance simply uses the **feedparser** module to retrieve and parse the feed. Listing 4 ~~~~~~~~~ .. literalinclude:: Listing4.py :linenos: Throttling Downloads -------------------- Now that **Cache** could successfully download feed data, the first optimization to make was to hold on to the data and track its age. Then for every call to ``fetch()``, **Cache** could first check to see if fresh data was already available locally before going out to the server to download the feed again. Listing 5 shows the version of **Cache** with a download throttle, in the form of a ``timeToLiveSeconds`` parameter. Items already in the cache will be reused until they are older than ``timeToLiveSeconds``. The default value for ``timeToLiveSeconds`` means that any given feed will not be checked more often than every five minutes. Listing 5 ~~~~~~~~~ .. literalinclude:: Listing5.py :linenos: The new implementation of ``fetch()`` stores the current time along with the feed data when the storage is updated. When ``fetch()`` is called again with the same URL, the time in the cache is checked against the current time to determine if the value in the cache is fresh enough. The test on line 38 verifies this behavior by pre-populating the **Cache**'s storage with data, and checking to see that the existing cache contents are returned instead of the contents of the feed. Conditional HTTP GET -------------------- Conditional HTTP GET allows a client to tell a server something about the version of a feed the client already has. The server can decide if the contents of the feed have changed and, if they have not, send a short status code in the HTTP response instead of a complete copy of the feed data. Conditional GET is primarily a way to conserve bandwidth, but if the feed has not changed and the server's version checking algorithm is efficient then the server may use fewer CPU resources to prepare the response, as well. When a server implements conditional GET, it uses extra headers with each response to notify the client. There are two headers involved, and the server can use either or both together, in case the client only supports one. **Cache** supports both headers. Although timestamps are an imprecise way to detect change, since the time on different servers in a pool might vary slightly, they are simple to work with. The ``Last-Modified`` header contains a timestamp value that indicates when the feed contents last changed. The client sends the timestamp back to the server in the next request as ``If-Modified-Since``. The server then compares the dates to determine if the feed has been modified since the last request from the client. A more precise way to determine if the feed has changed is to use an *Entity Tag* in the ``ETag`` header. An ``ETag`` is a hashed representation of the feed state, or of a value the server can use to quickly determine if the feed has been updated. The data and algorithm for computing the hash is left up to the server, but it should be less expensive than returning the feed contents or there won't be any performance gains. When the client sees an ``ETag`` header, it can send the associated value back to the server with the next request in the ``If-None-Match`` request header. When the server sees ``If-None-Match``, it computes the current hash and compares it to the value sent by the client. If they match, the feed has not changed. When using either ``ETag`` or modification timestamps, if the server determines that the feed has not been updated since the previous request, it returns a response code of ``304``, or "Not Modified" and includes nothing in the body of the response. When it sees the ``304`` status in the response from the server, the client should reuse the version of the feed it already has. Creating a Test Server ---------------------- In order to write correct tests to exercise conditional GET in **feedcache**, more control over the server would be important. The feedburner URL used in the earlier tests might be down, or return different data if a feed was updated. It would be necessary for the server to respond reliably with data the test code knew in advance, and to be sure it would not stop responding if it was queried too often by the tests. The tests also control which of the headers (``ETag`` or ``If-Modified-Since``) was used to determine if the feed had changed, so both methods could be tested independently. The solution was to write a small test HTTP server that could be managed by the unit tests and configured as needed. Creating the test server was easy, using a few standard library modules. The test server code, along with a base class for unit tests that use it, can be found in Listing 6. The ``TestHTTPServer`` (line 91) is derived from ``BaseHTTPServer.HTTPServer``. The ``serve_forever()`` method (line 112) has been overridden with an implementation that checks a flag after each request to see if the server should keep running. The test harness sets the flag to stop the test server after each test. The ``serve_forever()`` loop also counts the requests successfully processed, so the tests can determine how many times the **Cache** fetches a feed. Listing 6 ~~~~~~~~~ .. literalinclude:: Listing6.py :linenos: The test server processes incoming HTTP requests with ``TestHTTPHandler`` (line 24), derived from ``BaseHTTPServer.BaseHTTPRequestHandler``. ``TestHTTPHandler`` implements ``do_GET()`` (line 55) to respond to HTTP GET requests. Feed data for the tests is hard coded in the ``FEED_DATA`` class attribute (line 27). The URL path ``/shutdown`` is used to tell the server to stop responding to requests. All other paths are treated as requests for the feed data. The requests are processed by checking the ``If-None-Match`` and ``If-Modified-Since`` headers, and responding either with a ``304`` status or with the static feed data. ``HTTPTestBase`` is a convenience base class to be used by other tests. It manages a ``TestHTTPServer`` instance in a separate thread, so the tests can all run in a single process. Listing 7 shows what the existing tests look like, rewritten to use the ``HTTPTestBase`` as a base class. The only differences are the base class for the tests and the use of ``self.TEST_URL``, which points to the local test server instead of the **feedburner** URL from Listing 5. Listing 7 ~~~~~~~~~ .. literalinclude:: Listing7.py :linenos: Implementing Conditional HTTP GET --------------------------------- With these testing tools in place, the next step was to enhance the **Cache** class to monitor and use the conditional HTTP GET parameters. Listing 8 shows the final version of **Cache** with these features. The ``fetch()`` method has been enhanced to send the ``ETag`` and modified time from the cached version of the feed to the server, when they are available. Listing 8 ~~~~~~~~~ .. literalinclude:: Listing8.py :linenos: The **FeedParserDict** object returned from ``feedparser.fetch()`` conveniently includes the ``ETag`` and modified timestamp, if the server sent them. On lines 38-39, once the cached feed is determined to be out of date, the ``ETag`` and modified values are retrieved so they can be passed in to ``feedparser.parse()`` on line 42. Since the updated client sends ``ETag`` and ``If-Modified-Since`` headers, the server may now respond with a status code indicating that the cached copy of the data is still valid. It is no longer sufficient to simply store the response from the server before returning it. The status code must be checked, as on line 49, and if the status is ``304`` then the timestamp of the cached copy is updated. If the timestamp was not updated, then as soon as the cached copy of the feed exceeded the time-to-live, the **Cache** would request a new copy of the feed from the server every time the feed was accessed. Updating the timestamp ensures that the download throttling remains enforced. Separate tests for each conditional GET header are implemented in ``CacheConditionalGETTest``. To verify that the **Cache** handles the ``304`` status code properly and does not try to update the contents of the storage on a second fetch, these tests use a special storage class. The **SingleWriteMemoryStorage** raises an **AssertionError** if the a value is modified after it is set the first time. An ``AssertionError`` is used, because that is how ``unittest.TestCase`` signals a test failure, and modifying the contents of the storage is a failure for these tests. Each test method of ``CacheConditionalGETTest`` verifies handling for one of the conditional GET headers at a time. Since the test server always sets both headers, each test clears one value from the cache before making the second request. The remaining header value is sent to the server as part of the second request, and the server responds with the ``304`` status code. Persistent Storage With shelve ------------------------------ All of the examples and tests so far have used in-memory storage options. For CastSampler, though, the cache of feed data needed to be stored on disk. As mentioned earlier, the **shelve** module in the standard library provides a simple persistent storage mechanism. It also conforms to the dictionary API used by the **Cache** class. Using **shelve** by itself works in a simple single threaded case but it isn't clear from its documentation whether **shelve** supports write access from multiple concurrent threads. To ensure the shelf is not corrupted, a thread lock should be used. ``CacheStorageLock`` is a simple wrapper around **shelve** that uses a lock to prevent more than one thread from accessing the shelf simultaneously. Listing 9 contains the code for the ``CacheStorageLock`` and a test that illustrates using it to combine a **Cache** and **shelve**. Listing 9 ~~~~~~~~~ .. literalinclude:: Listing9.py :linenos: The test ``setUp()`` method uses ``tempfile`` to create a temporary filename for the cache. The temporary file has to be deleted in ``setUp()`` because if the file exists, but is empty, **shelve** cannot determine which database module to use to open an empty file. The ``test()`` method fetches the data from the server, then compares the returned data with the data in the shelf to verify that they are the same. ``CacheStorageLock`` uses a ``threading.Lock`` instance to control access to the shelf. It only manages access for the methods known to be used by **Cache**. The lock is acquired and released using the ``with`` statement, which is new for Python 2.6. Since this code was written with Python 2.5, the module starts with a ``from __future__`` import statement to enable the syntax for ``with``. Other Persistence Options ------------------------- At any one time, **shelve** only allows one process to open a shelf file to write to it. In applications with multiple processes that need to modify the cache, alternative storage options are desirable. **Cache** treats its storage object as a dictionary, so any class that conforms to the dictionary API can be used for back end storage. The ``shove`` module, by L. C. Rees, uses the dictionary API and offers support for a variety of back end storage options. The supported options include relational databases, BSD-style databases, Amazon's S3 storage service, and others. The filesystem store option was particularly interesting for CastSampler. With **shove**'s file store, each key is mapped to a filename. The data associated with the key is pickled and stored in the file. By using separate files, it is possible to have separate threads and processes updating the cache simultaneously. Although the **shove** file implementation doesn't handle file locking, for my purposes it was unlikely that two threads would try to update the same feed at the same time. Listing 10 includes a test that illustrates using **shove** file storage with **feedcache**. The primary difference in the APIs for **shove** and **shelve** is the syntax for specifying the storage destination. Shove uses a URL syntax to indicate which back end should be used. The format for each back end is described in the docstrings. Listing 10 ~~~~~~~~~~ .. literalinclude:: Listing10.py :linenos: Using feedcache With Multiple Threads ------------------------------------- Up to this point, all of the examples have been running in a single thread driven by the **unittest** framework. Now that integrating **shove** and **feedcache** has been shown to work, it is possible to take a closer look at using multiple threads to fetch feeds, and build a more complex example application. Spreading the work of fetching data into multiple processing threads is more complicated, but yields better performance under most circumstances because while one thread is blocked waiting for data from the network, another thread can take over and process a different URL. Listing 11 shows a sample application which accepts URLs as arguments on the command line and prints the titles of all of the entries in the feeds. The results may be mixed together, depending on how the processing control switches between active threads. This example program is more like a traditional feed aggregator, since it processes every entry of every feed. Listing 11 ~~~~~~~~~~ .. literalinclude:: Listing11.py :linenos: The design uses queues to pass data between two different types of threads to work on the feeds. Multiple threads use **feedcache** to fetch feed data. Each of these threads has its own **Cache**, but they all share a common **shove** store. A single thread waits for the feed entries to be added to its queue, and then prints each feed title and entry title. The ``main()`` function sets up two different queues for passing data in and out of the worker threads. The ``url_queue`` (lines 23-25) contains the URLs for feeds, taken from the command line arguments. The ``entry_queue`` (line 33) is used to pass feed content from the threads that fetch the feeds to the queue that prints the results. A **shove** filesystem store (line 36) is used to cache the feeds. Once all of the worker threads are started (lines 40-51), the rest of the main program simply waits for each stage of the work to be completed by the threads. The last entries added to the ``url_queue`` are ``None`` values, which trigger the worker thread to exit. When the ``url_queue`` has been drained (line 54), the worker threads can be cleaned up. After the worker threads have finished, ``(None, None)`` is added to the ``entry_queue`` to trigger the printing thread to exit when all of the entries have been printed. The ``fetch_urls()`` function (lines 70-85) runs in the worker threads. It takes one feed URL at a time from the input queue, retrieves the feed contents from a cache, then adds the feed entries to the output queue. When the item taken out of the queue is ``None`` instead of a URL string, it is interpreted as a signal that the thread should break out of its processing loop. Each thread running ``fetch_urls()`` creates a local **Cache** instance using a common storage back end. Sharing the storage ensures that all of the feed data is written to the same place, while creating a local **Cache** instance ensures threads can fetch data in parallel. The consumer of the queue of entries is ``print_entries()`` (lines 88-99). It takes one entry at a time from the queue and prints the feed and entry titles. Only one thread runs ``print_entries()``, but a separate thread is used so that output can be produced as soon as possible, instead of waiting for all of the ``fetch_urls()`` threads to complete before printing the feed contents. Running the program produces output similar to the example in Listing 3: :: $ python Listing11.py http://feeds.feedburner.com/FeedcacheReleases Saving feed data to /tmp/feedcache_example feedcache Releases: feedcache 0.1 feedcache Releases: feedcache 0.2 feedcache Releases: feedcache 0.3 feedcache Releases: feedcache 0.4 feedcache Releases: feedcache 0.5 The difference is that it takes much less time to run the program in Listing 11 when multiple feeds are passed on the command line, and when some of the data has already been cached. Future Work ----------- The current version of **feedcache** meets most of the requirements for **CastSampler**, but there is still room to improve it as a general purpose tool. It would be nice if it offered finer control over the length of time data stays in the cache, for example. And, although **shove** is a completely separate project, **feedcache** would be more reliable if **shove**'s file storage were used file locking, to prevent corruption when two threads or processes write to the same part of the cache at the same time. Determining how long to hold the data in a cache can be a tricky problem. With web content such as RSS and Atom feeds, the web server may offer hints by including explicit expiration dates or caching instructions. HTTP headers such as ``Expires`` and ``Cache-Control`` can include details beyond the ``Last-Modified`` and ``ETag`` values already being handled by the **Cache**. If the server uses additional cache headers, **feedparser** saves the associated values in the **FeedParserDict**. To support the caching hints, **feedcache** would need to be enhanced to understand the rules for the ``Cache-Control`` header, and to save the expiration time as well as the time-to-live for each feed. Supporting a separate time-to-live value for each feed would let **feedcache** use a different refresh throttle for different sites. Data from relatively infrequently updated feeds, such as Slashdot, would stay in the cache longer than data from more frequently updated feeds, such as a Twitter feed. Applications that use **feedcache** in a more traditional way would be able to adjust the update throttle for each feed separately to balance the freshness of the data in the cache and the load placed on the server. Conclusions ----------- Original sources of RSS and Atom feeds are being created all the time as new and existing applications expose data for syndication. With the development of mashup tools such as Yahoo! Pipes and Google's Mashup Editor, these feeds can be combined, filtered, and expanded in new and interesting ways, creating even more sources of data. I hope this article illustrates how building your own applications to read and manipulate syndication feeds in Python with tools like feedparser and **feedcache** is easy, even while including features that make your program cooperate with servers to manage load. *I would like to offer a special thanks to Mrs. PyMOTW for her help editing this article.*