OptimalSearching

MeatballWiki | RecentChanges | Random Page | Indices | Categories

As the PageDatabase grows ever larger, the search times on the site drop considerably. Moreover, the load on the server is excessive as it is a shared server. We're nearing the time where the searching will need to be optimized (or a new machine has to be purchased!). Either way, searching is a topic of interest to online communities in general.

There are two primary aspects to wiki searching. The first is BackLink searching. The current implementation in UseModWiki is really the easiest solution: just search for the title. However, it isn't strictly correct, for a couple reasons.

The second is the more general case of text searching. One again, UseModWiki is less optimal because it allows RegularExpressions. Instead, limiting the search to word matching would help improve our ability to index it.

However, implementing optimal searching is difficult. So, something easy would be nice.

: How about something like http://www.atomz.com or the Google search-your-site function? Let their bots worry about how to handle the quicksand of wiki changes. -- JerryMuelver

I find the title search on WardsWiki very useful from a user's point of view. I rarely need full text searches on that site. They are a last resort because they pull up too many hits. (This is all independant of speed/efficiency issues.)

I'm not sure regular expressions are very useful for things like this. At least, I've never used them (and wouldn't like to guess what the local syntax is). More useful would be combinations of words (AND, OR, NEAR etc) as provided by web search engines, which presumably would permit word-based indexing.

WardsWiki gets some of its speed by permitting its searches to use out-of-date data. This can be a problem. In particular, I sometimes find a new page without much content or meaning, and in order to evaluate it I want to find its context, ie the page that links to it. But I can't because the linking edit hasn't made it into the cache yet. To put it another way, it's precisely the new stuff that I want to search for. (This is mainly for backlinks, so special casing them might be enough.) -- DaveHarris

: Some of the reasons for using out-of-date indexes were because until recently (mid-2000) the C2 wiki was using a relatively low-powered system with only 16 or 32 megabytes of memory. The database format used by C2 tended to spread the 20MB or so of real data over 50 to 90 MB of the database file. A full search of the actual database could take minutes to complete since the database could not completely fit in memory. Modern webserver systems are at least 10 times faster, and generally have hundreds of megabytes of memory. The content of the C2 wiki would easily fit into memory. The cost of actually searching the text is far outweighed by the cost of the system calls to open the files.

: Still, for larger wikis it would be nice to have optional indexes. I'm hoping to find an indexing method that is easy to incrementally update (after each edit) so that it can always be current. This isn't a priority just yet, however. --CliffordAdams

The indexing method itself seems pretty clear. A number of "inverse index" solutions are available (developed and optimized) for indexing large numbers (millions) of www pages. But keeping an index always current may create a problem because a single new page may create a few hundred new index entries. Under load this should result in performance problems. There are surely ways to avoid performance bottlenecks, but as a UNIX newbie I haven't yet the profiling environment needed to attack such an optimization. -- HelmutLeitner

A few hundred new index entries should be nothing as far as load goes unless the index uses something realy inefficient like LinkedLists?. For example I have a program that indexes 116778 words in about 0.75 seconds in real user time (written in c of course) and even giving that wikis wouldn't be running on a SunSparc? or written in c a similar indexing scheme should give acceptable speed... I used a simple HashingFunction? that weights each character lexcographicaly and ignores 'e's and 't's.

The slowest part would be generating the initial index. Once that's done each changed or new page would only take a few milliseconds to index.

For incremental updates what needs to be done is fairly simple since you should still keep the old page up until the edit is finished so just find all words no longer in the page and delete them then add all new words to the index...

If you're using a binary tree or a hash then adding something to the index, finding then deleting or listing pages that contain the key from the index would have a performance of about Olog(N) where N is the size of the index.

Losing the RegularExpressions? would be a realy good idea if you want even half decent performance in key comparisons. -- DanielThomas

You can't compare small-scale-RAM-based hashing and large-scale-disk-based indexing. --hl

Wiki's (At least this one) aren't big enough to justify large-scale-disk-based indexing. Doing an empty search on MeatBall returns 859 (usemod returns only 174 *blink* I always assumed it was huge) pages, that's probably _less_ data than my toy program was handling. There's about one to two million words in the english [1] language that's probably only barely enough to justify the sort of indexing your looking at... Wikis as far as I can tell aren't anywhere near big enough to justify the heavy weight indexing search engines indexing millions of sites do.

I'm not too familiar with how well people want wiki to scale so this is either a severe reality check for everyone else or me talking out my rear end. But given the size of meatball having a simple ram based index is a sensible option since the more heavyweight options will bring in more overhead than they're worth.

What kind of backend does usemod have anyway I know it's perl... but what's on the other end a relational database? If it's mySQL there's your problem right there--DanielThomas

: The largest public wiki is probably the C2 wiki, at about 30 megabytes of text as of January 2001. The largest wiki using the UseModWiki software is the WikiPedia project, with 1133 pages on February 17, 2001. (Switched to new software, now called MediaWiki, in January 2002, with at the time ~20,000 pages. Now has about ten times that.)

: UseModWiki currently uses ordinary files for storage, divided into directories by the first letter of the page. It is intended to scale to about 20,000 pages, and might be usable up to 50,000 pages. Obviously full-text searches will become slower as the database grows. Eventually some other solution will have to be implemented. --CliffordAdams

For a brief moment, I'll be pragmatic: If you want to scale UseModWiki that large, it's likely that you have capital to back the site up. Buy indexing software, hack the Perl, move on with your life. Back to surreality, it's still fun to think about how to search a wiki efficiently. Ok, to be honest, TheProgrammersBooklist derailed on the searching (which was the only thing really working at the time). So, I have a vendetta. -- SunirShah

Regular expressions (in Perl syntax) were used for the initial UseModWiki searching because they were a quick-and-easy addition. Unfortunately, most of the usefulness of the local regular-expressions are negated since I do a case-insensitive match. Searching is high on my list of planned improvements. One near-term possibility is to add checkboxes for title and text searches--I like the look of HelmutLeitner's [Support Wiki] search (scroll to the bottom). Another likely addition will be options on the Preferences page for regular expressions (default OFF), and case-insensitive searching (default ON). I might also add a link like "Advanced Search", which could be something like the [C2 Search Helper]. Finally, for those who rarely search, I might add a preference to use a link to a search page rather than adding the form to every page. --CliffordAdams

| CategoryWikiTechnology | CategoryMeatballWikiSuggestion |

OptimalSearching

Discussion