CacheHTML

MeatballWiki | RecentChanges | Random Page | Indices | Categories

: Q: Why should the server generate HTML from WikiSyntax every time the page is requested? Why not keep two versions on the server: wiki syntax and pre-rendered HTML? When the page is saved, the generated version is sent to the client and also stored on the server.

It's certainly possible to create an HTML cache, however the problem becomes one of bookkeeping. If a page contains links to WantedPages, it will cache with the NoSuchPageSyntax. However, if one of those wanted pages gets created in the meantime, the HTML cache becomes invalid. Similar problems occur if a referenced page is deleted.

CategoryWikiTechnology

Discussion

Solutions to this problem are varied. One could store the list of links (separated as "existing" and "non-existing") in the cache database, and do a quick check each time the cached HTML is fetched to see if the sets have changed. This may not be an inefficient process either because it may be a lot faster to search the PageDatabase for multiple titles all at once than to create a separate query for each one. Even if the cache becomes invalidated, the translator can use this information to more efficiently determine what type of link to generate.

Another solution is to maintain a BackLinkDatabase?, including backlinks to wanted pages. When a page status changes between existing and non-existing (or vice versa), then simply invalidate all the backlinks' caches. This has the additional advantage of speeding up backlink searches (and also making them accurate), as well as allowing more interesting graph analysis. However, the number of race conditions now grows at O(N^2), not to mention the amount of computation required on the server for each and every update. The result would likely be a less stable system, and certainly a slower one.

: Sorry, I don't understand the part about race conditions:

Given a proper locking strategy no RCs should occur at all.
There exist very efficient locking algorithms for mostly-read-accesses (like list of existing pages) which occur almost no overhead in the common case (reading).
Any reasonable caching strategy would expunge stale cache contents within seconds of the creation of a new page.
Looking at http://sunir.org/apps/wanted.pl , when creating a new page at most eight cache entries would become stale.
Adding to that, that reading is happening several magnitudes more often than creating new pages, I cannot see how using a (properly implemented) cache could make "a less stable system, and certainly a slower one".

The best practice is to store the links (or any other potentially mutatable portion of the page) as unparsed section that needs to be reparsed every time. Links don't take a long time to look up, so that won't be too difficult. Sections like RssInclusion may be slower, but by their very definition they of course will be slow. Caching RSS would be useful as well.

: In effect, that would be two pass parsing:

first pass done on page save: "static" elements such as paragraphs, lists, bold, etc
second pass is elements that depend on the rest of the page database, ie the links, and is done on each page view.

Implementations

OddMuse uses two strategies: Caching of partial HTML fragments, and support for HTTP/1.1 caching. [1]

Oddmuse uses a cache of HTML and raw text fragments. Assume the raw text is "This is a WikiLink." The parser will split this into three fragments. "<p>This is a " is cached as HTML, "WikiLink" is cached as raw text (and will be reparsed whenever the cache is used), and finally "." is cached as HTML. For every text formatting rule, Oddmuse knows whether the output can change in the future without the page itself changing. This is true for all sorts of local links, for example.

That's semi-caching, basically -- render text formatting into HTML and leave the wiki links that depend on the state of the entire database till later.

In addition to that, you can store the time of the last change to the page database somewhere, and send that in every response using the last-modified header. When the client (browser or cache) then requests a page the second time using an if-modified-since header, we can just reply 304 Not Modified if no edits happened since then. See RFC 2616 for details.

It'd also be possible to use 404 handlers to do HTML caching. Apache lets you define scripts as 404 handlers; some other Web servers do, too.

Have a directory where pages go (say, "/wiki/").
Map the 404 handler for that directory to a script (say, "404wiki").
A request comes in for a page (say, "/wiki/SomePage?" or "/wiki/SomePage?.html").
The page is missing, so the 404 handler gets called.
The 404 handler looks up the page in the database (of some kind). Some name munging may be necessary.
1. If it's there, the 404 handler renders the page, writes it to the directory, and then serves the page (with a 200, not a 404 response code).
2. If it's not there, the 404 handler redirects to an edit form, or provides an edit form itself.
After the 404 handler writes out the file, the Web server will handle requests for that page from the file instead of the 404 handler.

Dynamic features, such as showing a user's user name or so forth, can be done with JavaScript (poking around in a cookie). OR, with some Web servers (like Apache), you can use rewrite rules on the presence of some cookie (e.g. a username cookie) to redirect the page. So not-logged-in users would get the cached stuff, and logged-in users would get piping-hot dynamically generated pages, fresh from the db and customized to their picayune UI requirements.

When a page is saved, it's only necessary to delete the file from the cache directory, as well as the files for any related pages (to update broken links, for example). The next request will trigger the 404 handler (file isn't there, it got deleted) which will re-render the page.

Note also that you can use Apache's (or other servers') Multiviews feature to leave off the ".html" at the end of the file name. You could even play around with serving XML+XSL (or XML+CSS) and HTML transparently. And that you don't have to worry about rendering links differently for existing/non-existing pages. OK, well, you have to show them differently, but the URL you use can be the same ("/wiki/ExistingPage?", "/wiki/NonExistingPage?").

The whole thing is predicated on the assumption that the Web server serving a static file will be much much faster and resource-savvy than if it fires off any dynamic page stuff (CGI, PHP, ASP, whatever). This is usually the case. So it's worth doing.

The cache may grow pretty big. You can either have a scheduled task to go in and reap the LRU pages, or the 404 handler can do that before writing out the page it's doing. The first is fast but potentially risky (the cache may overflow before the reaper gets to it), the second is safer but requires some housework on the part of th 404 generator, which should probably be heavily optimized. A paranoid might just want to reap everything in the cache every N minutes or so.

You could extend this strategy for some page-info pages, like a PageHistory feature. Just map a different directory ("/history") for page histories, and have a separate 404 generator there. This means some more careful cache invalidation at edit time, but, hey, if you need some speed, it could aid significantly. --EvanProdromou

I like it. Not just for the speed, but because I like programs to use the filesystem as their data representation as much as possible (because that allows more unforseen interoperability with standard tools). My only complaint is that you aren't really following the semantics of the idea of a missing page handler. I wonder if there is any "less surprising" way to do this? -- BayleShanks

: Less surprising to whom? The Web server software with the ErrorDocument? directive? I think the software will get over its initial shock and learn to cope.

: Seriously, that's the only place there's a disconnect. I think that the only issue would be maybe changing the name of the directive to "MissingFile?" or something. It's a remarkably efficient technique, and programs such as Vignette use it. I thought it up for PigdogJournal? (http://www.pigdog.org/), where we have some code to do it, but I was told StoryServer? does it, too. For that matter, it's probably patented... sigh. --EvanProdromou

I don't like idea of using ErrorHandler? for handling cache. Servers error-log would become useless, because of flooding it with non-error data. No serious webadmin would let that happen.

ErrorHandler? is for errors, and cache-miss is not a server-level error, its rather a condition. You should simply use tools like mod_rewrite to handle that condition and check if file exists (or entry in database and maybe some other conditions - you get more options and more control here).

Clearly maintaining list of backlinks is the best solution. With help of database it would be quick and simple. You just have to mark cached pages stale on edit, wich is comparatively rare occasion.

Remember you don't need to cache whole page with headers, etc. You may cache only wiki-parsed part and add headers (incl. customization for logged users) later. And don't use JavaScript for this. You put much logic on client side and if you go too far with this non-js browsers/spiders may get crippled page.

And for best performance you could use caching server-gateway (squid proxy?) at top of inner wiki-cache. It will be userful if you get a lot of reads by anonymous users (like Wikipedia). Ofcourse HTTP cache to work efficently needs proper HTTP headers generated (last modified, expires, no cookies).

-- KornelLesinski

If you ask me, the cost in resources of generating the HTML from data with WikiLink?s is lower, than de cost in resources to store and create a HTML-cache plus the cost in development to design and code a HTML-cache plus the extra debugging and support afterwards, especially with web-servers that have enough CPU-power and memory and still a relatively slower read-from-disk speed (even if only a little). -- StijnSanders

MoinMoin compiles pages into Python byte code. The code consists of request.write("HTML") and formatter.dynamic_item(params) calls. This has speeded up rendering by a factor >> 10 if you don't have expensive macros in the page. The implementation uses the seperation between parser and formatter by implementing a kind of "meta formatter". See MoinMoin:MoinMoinIdeas/WikiApplicationServerPage for details.