I needed to parse data from a series of web pages, usually i would have used CURL to download the page and then used regular expressions to extract the data i was interested in. But the page i was going to parse was using AJAX to reload part of the page (when you clicked the ‘next page’) and did not provide a unique url to that page, which made my regular method pretty useless (since to use AJAX we need a Java Script Interpreter).
This is something I consider to be a general problem with AJAX pages, JavaScript can not change the url shown in the browser and it is therefore not possible to provide unique urls for AJAX calls (i.e. the next page link). Some sites, such as facebook, has solved this quite cleverly by modifying the part of the url after the hash (#) sign, which is possible to modify from JavaScript, to give users unique links to pages. I have been planning to use the same technique on the site I am currently working on, but I have not had time to implement it yet.
A solution to the problem that seemed a bit more challenging than using CURL and regular expressions, and which would be able to handle AJAX, was to program an existing web browser to visit the page and fetch the data from the web browser itself. I had programmed for KHTML before so i looked into if this was possible and found that Paul Giannaros had solved most of my problems. He had created PyKHTML:
PyKHTML is a Python module for writing website scrapers/spiders. Whereas traditional methods focus on writing the code to parse HTML/forms themselves, PyKHTML uses the excellent KHTML engine to do all the trudge work. It therefore handles web pages very well (even the severely crufty ones) and is pretty darn fast (implemented in C++). As a bonus, the module handles JavaScript and cookies transparently. Hurrah!
As an example for this post I decided to parse a digg article to find out who digged it. To understand the could you should know that KHTML uses a event driven programming model. This is my test program:
-
-
import sys
-
sys.path.append("..")
-
import pykhtml
-
-
# Setting debugWithGUI to true will give us the KHTML browser in a window.
-
pykhtml.debugWithGUI = False
-
-
def processPage(browser, currentPage):
-
# Check if the next button is loaded
-
result = list(browser.document.getElementsByClass("nextprev", "a"))
-
if (len(result) < 1):
-
print "Next button not loaded"
-
pykhtml.timer(0.5, pykhtml.partial(processPage, browser, currentPage))
-
return
-
-
# Get next page
-
nextprev = result[0]
-
nextPage = int(nextprev[‘onclick’].split(",")[1])
-
-
# Wait for ajax page reload to complete
-
if currentPage == nextPage:
-
pykhtml.timer(0.5, pykhtml.partial(processPage, browser, currentPage))
-
return
-
-
elif nextPage < currentPage:
-
pykhtml.stopEventLoop()
-
return
-
-
# Get users on current page
-
userListClass = list(browser.document.getElementsByClass("user-list", "ul"))
-
userList = list(userListClass[0].getElementsByTagName("li"))
-
-
for user in userList:
-
userName = list(user.getElementsByTagName("img"))[0].attributes[‘alt’]
-
print currentPage, userName
-
-
# Go to next page
-
nextprev.click()
-
processPage(browser, nextPage)
-
-
def main():
-
url = "http://digg.com/linux_unix/Linux_tips_every_geek_should_know/who"
-
browser = pykhtml.Browser()
-
browser.load(url, lambda b: processPage(b, 1))
-
pykhtml.startEventLoop()
-
return
-
-
if __name__ == "__main__":
-
main()
-
As you can see, the first thing we do is to create a PyHTML browser and loads the url. load() takes two arguments: the url to be loaded and a function pointer to the function that should be executed when KHTML has loaded the url. To be able to provide this function a argument, I construct a lambda function.
So when the page has been loaded, processPage() is called. First we check if the page has completed loading, otherwise we wait some. When the page has completed loading, it is time to access the KHTML DOM data. PyHTML provides us quite a few nifty functions to access the DOM of KHTML, such as:
getElementsByClass()getElementsByTagName()getElementById()By accessing these functions we easily get access to the data we are interested in. To go to the next page, we get the the element with of the ‘nextprev’ class, and simply ‘clicks’ it by calling nextprev.click(). Then we do a recursive call to proccessPage() and processes the next page. When the program has terminated it will have listed all people who digged the article.