There are two primary aspects to wiki searching. The first is BackLink searching. The current implementation in UseModWiki is really the easiest solution: just search for the title. However, it isn't strictly correct, for a couple reasons.
The second is the more general case of text searching. One again, UseModWiki is less optimal because it allows RegularExpressions. Instead, limiting the search to word matching would help improve our ability to index it.
However, implementing optimal searching is difficult. So, something easy would be nice.
I'm not sure regular expressions are very useful for things like this. At least, I've never used them (and wouldn't like to guess what the local syntax is). More useful would be combinations of words (AND, OR, NEAR etc) as provided by web search engines, which presumably would permit word-based indexing.
WardsWiki gets some of its speed by permitting its searches to use out-of-date data. This can be a problem. In particular, I sometimes find a new page without much content or meaning, and in order to evaluate it I want to find its context, ie the page that links to it. But I can't because the linking edit hasn't made it into the cache yet. To put it another way, it's precisely the new stuff that I want to search for. (This is mainly for backlinks, so special casing them might be enough.) -- DaveHarris
The indexing method itself seems pretty clear. A number of "inverse index" solutions are available (developed and optimized) for indexing large numbers (millions) of www pages. But keeping an index always current may create a problem because a single new page may create a few hundred new index entries. Under load this should result in performance problems. There are surely ways to avoid performance bottlenecks, but as a UNIX newbie I haven't yet the profiling environment needed to attack such an optimization. -- HelmutLeitner
A few hundred new index entries should be nothing as far as load goes unless the index uses something realy inefficient like LinkedLists?. For example I have a program that indexes 116778 words in about 0.75 seconds in real user time (written in c of course) and even giving that wikis wouldn't be running on a SunSparc? or written in c a similar indexing scheme should give acceptable speed... I used a simple HashingFunction? that weights each character lexcographicaly and ignores 'e's and 't's.
The slowest part would be generating the initial index. Once that's done each changed or new page would only take a few milliseconds to index.
For incremental updates what needs to be done is fairly simple since you should still keep the old page up until the edit is finished so just find all words no longer in the page and delete them then add all new words to the index...
If you're using a binary tree or a hash then adding something to the index, finding then deleting or listing pages that contain the key from the index would have a performance of about Olog(N) where N is the size of the index.
Losing the RegularExpressions? would be a realy good idea if you want even half decent performance in key comparisons. -- DanielThomas
Wiki's (At least this one) aren't big enough to justify large-scale-disk-based indexing. Doing an empty search on MeatBall returns 859 (usemod returns only 174 *blink* I always assumed it was huge) pages, that's probably _less_ data than my toy program was handling. There's about one to two million words in the english [1] language that's probably only barely enough to justify the sort of indexing your looking at... Wikis as far as I can tell aren't anywhere near big enough to justify the heavy weight indexing search engines indexing millions of sites do.
I'm not too familiar with how well people want wiki to scale so this is either a severe reality check for everyone else or me talking out my rear end. But given the size of meatball having a simple ram based index is a sensible option since the more heavyweight options will bring in more overhead than they're worth.
What kind of backend does usemod have anyway I know it's perl... but what's on the other end a relational database? If it's mySQL there's your problem right there--DanielThomas
For a brief moment, I'll be pragmatic: If you want to scale UseModWiki that large, it's likely that you have capital to back the site up. Buy indexing software, hack the Perl, move on with your life. Back to surreality, it's still fun to think about how to search a wiki efficiently. Ok, to be honest, TheProgrammersBooklist derailed on the searching (which was the only thing really working at the time). So, I have a vendetta. -- SunirShah
Regular expressions (in Perl syntax) were used for the initial UseModWiki searching because they were a quick-and-easy addition. Unfortunately, most of the usefulness of the local regular-expressions are negated since I do a case-insensitive match. Searching is high on my list of planned improvements. One near-term possibility is to add checkboxes for title and text searches--I like the look of HelmutLeitner's [Support Wiki] search (scroll to the bottom). Another likely addition will be options on the Preferences page for regular expressions (default OFF), and case-insensitive searching (default ON). I might also add a link like "Advanced Search", which could be something like the [C2 Search Helper]. Finally, for those who rarely search, I might add a preference to use a link to a search page rather than adding the form to every page. --CliffordAdams