Recently, I decided to learn
lxml for parsing html files. There was one big problem however, lxml on Debian etch was version 1.1.1, which did not have the lxml.html module. At first I tried to use lxml.etree, but this seemed rather unwieldly and I decided to get the newest version 2.2.4. Sadly, there was no backport for 2.2.4. I had to resort to using setuptools, better known by its easy_install command.
First, I had to actually install a recent version of setuptools, which was in backports:
sudo apt-get install -t etch-backports python-setuptools
Next, I had to get a bunch of required packages in order to build lxml:
sudo apt-get install python2.4-dev gcc g++ libz-dev libxml2 libxml2-dev libxslt1.1 libxslt1-dev
Although lxml's site did not seem to require g++, I found that gcc did not have some required headers, such as limits.h or stdint.h. The g++ package seemed to provide these files. With everything in place, first I tested in my local directory, then finally used setuptools as root:
sudo easy_install lxml
Verifying it actually installed:
$ python -c "import lxml; print lxml._file_"
/usr/lib/python2.4/site-packages/lxml-2.2.4-py2.4-linux-i686.egg/lxml/_init_.pyc
$ python
Python 2.4.4 (#2, Oct 22 2008, 19:52:44)
[GCC 4.1.2 20061115 (prerelease) (Debian 4.1.1-21)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from lxml import html
/usr/lib/python2.4/site-packages/lxml-2.2.4-py2.4-linux-i686.egg/lxml/html/_init_.py:12: UserWarning: This version of libxml2 has a known XPath bug. Use it at your own risk.
from lxml import etree
>>>
Oh well, apparently my version of libxml2 sucks. Hopefully it still works, since I do not see a newer version of libxml2 available for etch. For now, in order to suppress this warning, I invoke Python in the following manner:
python -W ignore::UserWarning