MeatballWiki | RecentChanges | Random Page | Indices | Categories

At WikiMania?2005 some developers of wiki engines sat together a agreed on starting a shared antispam list.

We are using the MeatballMailingList for co-ordination.

Data format

mimetype: text/plain;charset=utf-8 (mimetype spec mandates line ends with CR/LF)

BTW, ascii is a subset of utf-8. We usually won't use utf-8 characters, but we want to be able to if we need it some day.

The format of the file is one spam matching regex per line. A spam regex should use a basic regex syntax that is understood by Java, Perl, PHP and Python at least. If you restrict your regexes to those elements, it should work:

( ) | . + * {0,99} [a-z] ^ $ \S \s \w \W \d

Please only add REAL spam patterns to those regexes. Adding something generic as "sex" is a bad idea as not every wiki page content matching that is spam. Please also do NOT assume only URLs get matched against those regexes.

Lines that are empty or only contain whitespace will be ignored.

Comment lines start with # at the beginning of the line (^#).

Rest-of-line comments are introduced by \s+#.

A comment can contain some initial colon-separated fields, like a date when the entry was added or the src where the pattern originated. E.g.: # YYYY-MM-DD:SRC:OTHERSTUFF free text follows

The regex part of the line will be stripped from leading or trailing whitespace chars.


There will be some SF project keeping the current spam list.

Until it is up, you can get a list from http://arch.thinkmo.de/cgi-bin/spam-merge (this won't be permanent, for sure) and distribute it.

See also PeerToPeerBanList.



SF file mirror does need some time to update (12 hours, 1 day?), so it is not too useful for spam pattern as we need a relatively fast reaction time. SF CVS is sometimes broken, and the CVS server for developers is real time, but the CVS server for users is delayed (by hours?), so not really useful either for downloading stuff.

But we can just serve that "documentation about spammers" from the SF web space as a part of our project homepage.

We won't need a web form because we will use some "LocalBadContent?" wiki pages of the participating wikis, see below.

Data flow

A local wiki needs a fast method to react to spam. You sit at your computer and notice on RC "huh, there is coming lots of spam in right now", then you don't have much time, you need to protect your wiki immediately to avoid more pages getting spammed.

So you really want to have at least a small local spam pattern list for emergencies and you will just edit that list, put new spam patterns there and be immediately protected from further damage. This is the data source for updates of our global list! The local wiki administrator should make sure that no unwanted patterns are put onto that page (by ACLs, by protecting with admin password or by using not a wiki page, but a static file),

On SF, there will be a CGI spam pattern list merger running, just GETting data from a list of URLs (URLs of raw local spam pattern pages), merging the content and spilling out the merged list.

The list of URLs can be a local file in the SF project (and maybe later can be fetched from some other wiki page, if we need that and can solve the security implications somehow).

So all you have to do is updated your own local list, have the url of that list been put on our merge list and call that CGI from time to time to get the latest merged list. And we don't need a cron script to merge as the CGI will dynamically merge the stuff.

Publicity campaign

Discussion points

from SvenDowideit? :)

The source lists should be reliable, the wiki admin is responsible for that. If they are not, we must remove them from the list of sources. But that doesn't mean we must have a complicated method to contribute spam patterns - that would have the problem that we would have rare updates, because it is so complicated...

Some brainstorming.

-- SunirShah

I argue for two things. First some accountability (at least admins should keep the record of who created some regex and agree to make it public in some agreed cases). Second for a well defined process of getting your adress to the whitelist. Otherwise if some regexp happen to identify your address as spamhaus you'll be in a situation like in Kafka's stories. -- ZbigniewLukasiak

I implemented a first spam-merge CGI script (see URL above) that does fetch some spam pattern lists (matching this standard), merges them and spills out a merged list. It does keep a local copy of the merge, but updates every dt (currently 10 mins). -- ThomasWaldmann

The arch.thinkmo.de link above is very variable: on first request it returns a list of spam pattern, but subsequent requests return an empty list. Weird. Is there a problem with the script? -- ChrisPurcell

There are problems with using SourceForge to host this project. Some pertinent quotes from their documentation:

Project web servers:
Outbound access to other hosts from the project web servers is strictly prohibited, and is blocked by firewall rules. This includes access to other SourceForge.net hosts; the only exception to this policy is access to our our project database servers. This restriction has been caused by past abuse. Any content needed from other SourceForge.net hosts should be retrieved and stored as a file or database entry that may then be accessed from scripts on the project web servers.

Project shell server:
Access is not blocked to other SourceForge.net hosts from the project shell server. Access is blocked to hosts outside of SourceForge.net from the project shell server. This policy has been established as result of past service abuse. All content must be pushed to the shell server from an external host, rather than being pulled from the project shell servers.

So we'd either need to check in new regexen externally via CSV, POST it on via HTTP, or sweet-talk the sourceforge folk into making our project an exception — something that may be non-trivial for them to implement on a single-case basis. Plus, we render ourselves vulnerable to any knee-jerk reaction to abuse they may encounter in the future: they have shown themselves quite happy to retract services from all projects when abused by a few. -- ChrisPurcell

ChrisPurcell, what exactly would we need to do this on our own? Just a CVS repository, or would something like http://trac.edgewall.org/ work? Basically, I am willing to offer Debain Sarge/Apache 2 hosting for probably whatever is needed, and I am willing to give you specifically admin access to the server, in order to facilitate the bringing into existence of SharedAntiSpam. Please let me know what you'd need. -- SamRose


MeatballWiki | RecentChanges | Random Page | Indices | Categories
Edit text of this page | View other revisions