source code bean

02 Mar, 2009

Parsing AJAX web pages using PyKHTML

Posted by: Peter In: Linux|Python

I needed to parse data from a series of web pages, usually i would have used CURL to download the page and then used regular expressions to extract the data i was interested in. But the page i was going to parse was using AJAX to reload part of the page (when you clicked the ‘next page’) and did not provide a unique url to that page, which made my regular method pretty useless (since to use AJAX we need a Java Script Interpreter).

A solution to the problem that seemed a bit more challenging than using CURL and regular expressions, and which would be able to handle AJAX, was to program an existing web browser to visit the page and fetch the data from the web browser itself. I had programmed for KHTML before so i looked into if this was possible and found that Paul Giannaros had solved most of my problems. He had created PyKHTML:

PyKHTML is a Python module for writing website scrapers/spiders. Whereas traditional methods focus on writing the code to parse HTML/forms themselves, PyKHTML uses the excellent KHTML engine to do all the trudge work. It therefore handles web pages very well (even the severely crufty ones) and is pretty darn fast (implemented in C++). As a bonus, the module handles JavaScript and cookies transparently. Hurrah!

As an example for this post I decided to parse a digg article to find out who digged it. To understand the could you should know that KHTML uses a event driven programming model. This is my test program:

  1.  
  2. import sys
  3. sys.path.append("..")
  4. import pykhtml
  5.  
  6. # Setting debugWithGUI to true will give us the KHTML browser in a window.
  7. pykhtml.debugWithGUI = False
  8.  
  9. def processPage(browser, currentPage):
  10.     # Check if the next button is loaded
  11.     result = list(browser.document.getElementsByClass("nextprev", "a"))
  12.     if (len(result) < 1):
  13.         print "Next button not loaded"
  14.         pykhtml.timer(0.5, pykhtml.partial(processPage, browser, currentPage))
  15.         return
  16.  
  17.     # Get next page
  18.     nextprev = result[0]
  19.     nextPage = int(nextprev[‘onclick’].split(",")[1])  
  20.    
  21.     # Wait for ajax page reload to complete
  22.     if currentPage == nextPage:
  23.         pykhtml.timer(0.5, pykhtml.partial(processPage, browser, currentPage))
  24.         return
  25.    
  26.     elif nextPage < currentPage:
  27.         pykhtml.stopEventLoop()
  28.         return
  29.  
  30.     # Get users on current page
  31.     userListClass = list(browser.document.getElementsByClass("user-list", "ul"))
  32.     userList =  list(userListClass[0].getElementsByTagName("li"))
  33.  
  34.     for user in userList:
  35.         userName = list(user.getElementsByTagName("img"))[0].attributes[‘alt’]
  36.         print currentPage, userName
  37.  
  38.     # Go to next page
  39.     nextprev.click()
  40.     processPage(browser, nextPage)    
  41.  
  42. def main():
  43.     url = "http://digg.com/linux_unix/Linux_tips_every_geek_should_know/who"
  44.     browser = pykhtml.Browser()
  45.     browser.load(url, lambda b: processPage(b, 1))
  46.     pykhtml.startEventLoop()
  47.     return
  48.  
  49. if __name__ == "__main__":
  50.     main()
  51.  

As you can see, the first thing we do is to create a PyHTML browser and loads the url.   load() takes two arguments: the url to be loaded and a function pointer to the function that should be executed when KHTML has loaded the url. To be able to provide this function a argument, I construct a lambda function.

So when the page has been loaded, processPage() is called. First we check if the page has completed loading, otherwise we wait some. When the page has completed loading, it is time to access the KHTML DOM data. PyHTML provides us quite a few nifty functions to access the DOM of KHTML, such as:

  • getElementsByClass()
  • getElementsByTagName()
  • getElementById()
  • By accessing these functions we easily get access to the data we are interested in. To go to the next page, we get the the element with of the ‘nextprev’ class, and simply ‘clicks’ it by calling nextprev.click(). Then we do a recursive call to proccessPage() and processes the next page. When the program has terminated it will have listed all people who digged the article.



    2 Responses to "Parsing AJAX web pages using PyKHTML"

    1 | bob

    May 16th, 2009 at 3:33 am

    Avatar

    Amazing! THIS is the solution i’ve been seeking all this time!

    I’m curious how well it handles pages rendered with javascript? Sometimes, the javascript hides the data, so without actually parsing JS (horrible), the data is univewable.

    Would PyKHTML be powerful enough to accomplish that ?

    Great tutorial! Thanks a lot!!!

    2 | bob

    May 16th, 2009 at 3:45 am

    Avatar

    so does this mean using CURL and regex days are over?? I think this pretty much kills the rest of the methodologies out there.

    Comment Form


    • Joe: It looks like IceCoffee Script may provide the 'await' keyword to JavaScript. http://maxtaco.github.com/coffee-script/
    • Florian: Dealing with events has two primary fashions: imperative (synchronous) code, and callback (asynchronous) code. Both are legitimate styles, for some
    • Joe: I think we haven't seen true PaaS offering yet in Azure. If you use a webrole you're still running on your own VM. This is still not true multi-tena

    About

    Welcome to source code bean! On this site I post stuff that I encounter in my job and spare time. The content is mostly related to .NET development, but my interest in techonology is very broad, so often you will find posts on totally different subjects!