SharedAntiSpam

MeatballWiki | RecentChanges | Random Page | Indices | Categories

At WikiMania?2005 some developers of wiki engines sat together a agreed on starting a shared antispam list.

We are using the MeatballMailingList for co-ordination.

Data format

mimetype: text/plain;charset=utf-8 (mimetype spec mandates line ends with CR/LF)

BTW, ascii is a subset of utf-8. We usually won't use utf-8 characters, but we want to be able to if we need it some day.

The format of the file is one spam matching regex per line. A spam regex should use a basic regex syntax that is understood by Java, Perl, PHP and Python at least. If you restrict your regexes to those elements, it should work:

( ) | . + * {0,99} [a-z] ^ $ \S \s \w \W \d

Please only add REAL spam patterns to those regexes. Adding something generic as "sex" is a bad idea as not every wiki page content matching that is spam. Please also do NOT assume only URLs get matched against those regexes.

Lines that are empty or only contain whitespace will be ignored.

Comment lines start with # at the beginning of the line (^#).

Rest-of-line comments are introduced by \s+#.

A comment can contain some initial colon-separated fields, like a date when the entry was added or the src where the pattern originated. E.g.: # YYYY-MM-DD:SRC:OTHERSTUFF free text follows

The regex part of the line will be stripped from leading or trailing whitespace chars.

Distribution

There will be some SF project keeping the current spam list.

Until it is up, you can get a list from http://arch.thinkmo.de/cgi-bin/spam-merge (this won't be permanent, for sure) and distribute it.

Considerations

SF file mirror does need some time to update (12 hours, 1 day?), so it is not too useful for spam pattern as we need a relatively fast reaction time. SF CVS is sometimes broken, and the CVS server for developers is real time, but the CVS server for users is delayed (by hours?), so not really useful either for downloading stuff.

But we can just serve that "documentation about spammers" from the SF web space as a part of our project homepage.

We won't need a web form because we will use some "LocalBadContent?" wiki pages of the participating wikis, see below.

Data flow

A local wiki needs a fast method to react to spam. You sit at your computer and notice on RC "huh, there is coming lots of spam in right now", then you don't have much time, you need to protect your wiki immediately to avoid more pages getting spammed.

So you really want to have at least a small local spam pattern list for emergencies and you will just edit that list, put new spam patterns there and be immediately protected from further damage. This is the data source for updates of our global list! The local wiki administrator should make sure that no unwanted patterns are put onto that page (by ACLs, by protecting with admin password or by using not a wiki page, but a static file),

On SF, there will be a CGI spam pattern list merger running, just GETting data from a list of URLs (URLs of raw local spam pattern pages), merging the content and spilling out the merged list.

The list of URLs can be a local file in the SF project (and maybe later can be fetched from some other wiki page, if we need that and can solve the security implications somehow).

So all you have to do is updated your own local list, have the url of that list been put on our merge list and call that CGI from time to time to get the latest merged list. And we don't need a cron script to merge as the CGI will dynamically merge the stuff.

Publicity campaign

Phase 1. Developers. Developers of all the major wiki engines should be contacted in order to build consensus and achieve technical adoption.
- the first iteration of this phase was done by a group of Developers at WikiMania?2005 - our focus is to quickly implement a first version of the project (basically a single file that SourceForge can host for us), and then move to refine the project as time goes on. (Otherwise we keep getting bogged down in the details) --SD

Phase 2. Users. Explain to wiki users how to upgrade, get on board, etc.

Phase 3. WikiSym. At WikiSpamWorkshop, continue building this anti-spam community effort. The blacklist standard, technology, and marginal implementation is not the complete solution.

Discussion points

from SvenDowideit? :)

There is another way to serve the list from sourceforge, that I use to manage one of my project's web spaces. I have a cronjob that does a 'cvs up' of the htdocs dir regularly, allowing me to change the web site by _only_ checking into cvs. This would mean that the anti-spam ist would not be counted towards the download count for the project, but would be able to be distributed in sourceforge's normal web content load balancing mechanism
I worry that if the anti-spam list is updated automatically from participating wiki's, it would be easy for a clever spammer to add subtly bad exclusions into the list, making the list un-usable in the long run

The source lists should be reliable, the wiki admin is responsible for that. If they are not, we must remove them from the list of sources. But that doesn't mean we must have a complicated method to contribute spam patterns - that would have the problem that we would have rare updates, because it is so complicated...

Some brainstorming.

We might need a whitelist and/or cancel system to remove or resolve regexes.
Since this can get complicated to centralize, I'd prefer to create a FreeMarket? of aggregators so that someone with enough resources can do the right thing, and people can have a choice of which communities to trust.
- I admit there are problems with this model, as individual wikis will reflect 'TheCollective''s banlist. i.e. my wiki will have the same banlist as every other wiki in the collective, making it meaningless to publish individual feeds.
- A way to resolve this is to trace the history of a regex.
Direct check-in to CVS is not as desirable than simply publishing a feed (cf. PeerToPeerBanList).
Centralization and PersonalIdentification will lead to attacks by spammers on the organizations, just like the previous real-time blackhole lists who and whose families were stalked and harassed by spammers trying to get them to stop. I don't know how true or false those stories are, but I see no good reason to pick a fight with people who have too much time on their hands.

-- SunirShah

I argue for two things. First some accountability (at least admins should keep the record of who created some regex and agree to make it public in some agreed cases). Second for a well defined process of getting your adress to the whitelist. Otherwise if some regexp happen to identify your address as spamhaus you'll be in a situation like in Kafka's stories. -- ZbigniewLukasiak

This is why its a list in SourceForge cvs - that way we can track changes - while the distributes list system involves code, and overhead for tracking. -- SD

I implemented a first spam-merge CGI script (see URL above) that does fetch some spam pattern lists (matching this standard), merges them and spills out a merged list. It does keep a local copy of the merge, but updates every dt (currently 10 mins). -- ThomasWaldmann

The arch.thinkmo.de link above is very variable: on first request it returns a list of spam pattern, but subsequent requests return an empty list. Weird. Is there a problem with the script? -- ChrisPurcell

There are problems with using SourceForge to host this project. Some pertinent quotes from their documentation:

Project web servers:: Outbound access to other hosts from the project web servers is strictly prohibited, and is blocked by firewall rules. This includes access to other SourceForge.net hosts; the only exception to this policy is access to our our project database servers. This restriction has been caused by past abuse. Any content needed from other SourceForge.net hosts should be retrieved and stored as a file or database entry that may then be accessed from scripts on the project web servers.

Project shell server:: Access is not blocked to other SourceForge.net hosts from the project shell server. Access is blocked to hosts outside of SourceForge.net from the project shell server. This policy has been established as result of past service abuse. All content must be pushed to the shell server from an external host, rather than being pulled from the project shell servers.

So we'd either need to check in new regexen externally via CSV, POST it on via HTTP, or sweet-talk the sourceforge folk into making our project an exception — something that may be non-trivial for them to implement on a single-case basis. Plus, we render ourselves vulnerable to any knee-jerk reaction to abuse they may encounter in the future: they have shown themselves quite happy to retract services from all projects when abused by a few. -- ChrisPurcell

ChrisPurcell, what exactly would we need to do this on our own? Just a CVS repository, or would something like http://trac.edgewall.org/ work? Basically, I am willing to offer Debain Sarge/Apache 2 hosting for probably whatever is needed, and I am willing to give you specifically admin access to the server, in order to facilitate the bringing into existence of SharedAntiSpam. Please let me know what you'd need. -- SamRose