Parsing AJAX web pages using PyKHTML
I needed to parse data from a series of web pages, usually i would have used CURL to download the page and then used regular expressions to extract the data i was interested in. But the page i was going to parse was using AJAX to reload part of the page (when you clicked the ‘next page’) and did not provide a unique url to that page, which made my regular method pretty useless (since to use AJAX we need a Java Script Interpreter).
A solution to the problem that seemed a bit more challenging than using CURL and regular expressions, and which would be able to handle AJAX, was to program an existing web browser to visit the page and fetch the data from the web browser itself. I had programmed for KHTML before so i looked into if this was possible and found that Paul Giannaros had solved most of my problems. He had created PyKHTML:
PyKHTML is a Python module for writing website scrapers/spiders. Whereas traditional methods focus on writing the code to parse HTML/forms themselves, PyKHTML uses the excellent KHTML engine to do all the trudge work. It therefore handles web pages very well (even the severely crufty ones) and is pretty darn fast (implemented in C++). As a bonus, the module handles JavaScript and cookies transparently. Hurrah!
As an example for this post I decided to parse a digg article to find out who digged it. To understand the could you should know that KHTML uses a event driven programming model. This is my test program:
import sys
sys.path.append("..")
import pykhtml
# Setting debugWithGUI to true will give us the KHTML browser in a window.
pykhtml.debugWithGUI = False
def processPage(browser, currentPage):
# Check if the next button is loaded
result = list(browser.document.getElementsByClass("nextprev", "a"))
if (len(result) < 1):
print "Next button not loaded"
pykhtml.timer(0.5, pykhtml.partial(processPage, browser, currentPage))
return
# Get next page
nextprev = result[0]
nextPage = int(nextprev['onclick'].split(",")[1])
# Wait for ajax page reload to complete
if currentPage == nextPage:
pykhtml.timer(0.5, pykhtml.partial(processPage, browser, currentPage))
return
elif nextPage < currentPage:
pykhtml.stopEventLoop()
return
# Get users on current page
userListClass = list(browser.document.getElementsByClass("user-list", "ul"))
userList = list(userListClass[0].getElementsByTagName("li"))
for user in userList:
userName = list(user.getElementsByTagName("img"))[0].attributes['alt']
print currentPage, userName
# Go to next page
nextprev.click()
processPage(browser, nextPage)
def main():
url = "http://digg.com/linux_unix/Linux_tips_every_geek_should_know/who"
browser = pykhtml.Browser()
browser.load(url, lambda b: processPage(b, 1))
pykhtml.startEventLoop()
return
if __name__ == "__main__":
main()
As you can see, the first thing we do is to create a PyHTML browser and loads the url. load()
takes two arguments: the url to be loaded and a function pointer to the function that should be executed when KHTML has loaded the url. To be able to provide this function a argument, I construct a lambda function.
So when the page has been loaded, processPage()
is called. First we check if the page has completed loading, otherwise we wait some. When the page has completed loading, it is time to access the KHTML DOM data. PyHTML provides us quite a few nifty functions to access the DOM of KHTML, such as:
getElementsByClass()
getElementsByTagName()
getElementById()
By accessing these functions we easily get access to the data we are interested in. To go to the next page, we get the the element with of the 'nextprev' class, and simply 'clicks' it by calling nextprev.click(). Then we do a recursive call to proccessPage() and processes the next page. When the program has terminated it will have listed all people who digged the article.
Amazing! THIS is the solution i’ve been seeking all this time!
I’m curious how well it handles pages rendered with javascript? Sometimes, the javascript hides the data, so without actually parsing JS (horrible), the data is univewable.
Would PyKHTML be powerful enough to accomplish that ?
Great tutorial! Thanks a lot!!!
so does this mean using CURL and regex days are over?? I think this pretty much kills the rest of the methodologies out there.