Building an Apache XML/HTML Rewriting Stack
URL rewriting is not enough
Bucket Abuse
Apache has bucket brigades - which are essentially lists of output buffers - at the heart of its filtering architecture. The buckets are moved brigade wise through the output filters that manipulate them. The first idea is to fill sax events into those buckets. This is possible because buckets morph into simple text output by calling their read function. So there is a sax filter, that turns the outgoing bucket stream int a stream of sax events. These can be rewritten by subsequent filters. Whatever happens, before they finally reach the network they morph into text.
First Try
This has been implemented in mod_xml2. The problem currently is, that modules that manipulate sax buckets need to be written in C. The existing modules mod_xi and mod_i18n were too hard to write (and are currently not sufficiently maintained).
Plans for the Retry
My current plan (which is work in progress by now) is therefore to
make sax buckets available to higher level languages with access to
the apache api, namely perl and lua. Since this implies wrapping the
sax events with an API that then must be made available to said
languages, I use libxml2
DOM nodes for
this. These are already wrapped. Even more important is that they
have a well documented api for both languages.
The sax buckets have been renamed to node buckets since their
binary format is completely different and since they hold libxml2
nodes. The switch to node buckets also saves a lot of code in
mod_xml2
. Functionality already
implemented in libxml2
does not need to
be reimplemented.
Parsing the outgoing XML runs the libxml2
tree builder with hooked sax handlers. Element nodes are removed from
the tree the in the end handler, all other nodes are removed
immediately. Node buckets are shared buckets with reference counting.
This is used to have start and end element hold the same node. As a
result it is easy to rebuild the tree from the bucket stream, since
the start bucket already knows the end bucket.
Further Plans
libxml2
implements streaming XPath
expression, which allow matching a very restricted subset of XPath
expressions while parsing. Using these it should be easily possible
to implement filters which call a given callback passing the matched
subtree as a parameter. The point with these is that only these
subtrees need to be build.
Implementing KID
like template
engines that execute <?perl
and <?lua
processing instructions should also be doable.
Goals
My current project goals are to
-
become usable,
-
stay streaming and
-
be libxml2ish.
The last one is because I like libxml2
.
It is highly useful for web stuff because it can also parse HTML. It
is also to justify the name.