[Home]ShotgunSpam

MeatballWiki | RecentChanges | Random Page | Indices | Categories

Most WikiSpam is LinkSpam, which exists for the express purpose of raising the target pages' Google PageRank. Since wikis are PageRank machines, being highly internally linked and often with many inbound links from other sites, their UniversalEditing? makes them a jackpot for PageRank spammers (aka euphemistically 'search engine optimization').

As a result, the readability of the spam is not important, nor is it really important to camouflage the spam as real content. The goal is not to hit a particular wiki where TheAudience is still active, but conversely to find GhostTowns where no one will bother to clean up the spam.

Thus, the basic strategy is to search Google for 'Edit text of this page' (cf. EditMask) and then paste in as many ExternalLinks as possible, and then do this to as many pages as possible (like a ForestFire). One can easily write a robot to do this, although most spammers still do this manually.

However, this form of spam is very noticeable, akin to the loud burst of a shotgun and the wide visible area of damage. As a problem type, this is very good news since it should be easy to create automated defenses against this type of spam. Simply detect that a large number of ExternalLinks have been added. Even easier, spammers will hit multiple pages in a row with the same ExternalLink in a short duration, making this behaviour strongly detectable. Simply keep track of which links have been posted by whom and to what pages, and see if they have pasted the same link to multiple pages. Alternatively, LinkThrottling can restrain someone from posting many many links to the wiki without making it impossible for the good people.

ShotgunSpam can benefit from PeerReview through RecentLinks.

Contrast SemanticSpam which is more insidious.

CategorySpam


The current filter on MeatballWiki to deal with ShotgunSpam is the following:

# Poor man's spam filter from MeatBall:ShotgunSpam
my %http;
$http{$1}++ while( $old =~ /(http:\S+)/sg );
$http{$1}-- while( $string =~ /(http:\S+)/sg );

my $diff = 0;
for (keys %http) { $diff++ if $http{$_} < 0; }

if( $diff > 5 ) {
  &AppendStringToFile( "$DataDir/spamlist", "$ENV{'REMOTE_ADDR'}\n" );
  &ReleaseLock();
  &ReBrowsePage($id, "", 1);
  return;
}

It prevents three URLs from being added or removed at once. Originally it just did a simple count to see if three more URLs were added, but that allowed spammers to replace a page with spam but did not allow for anyone to revert it. This technique still allows spammers to drop up to three links at a time on a page, and it prevents people from pasting in entire essays or papers that may have a lot of references. I don't like the way it masks the fact it's rejected your save either, but UseModWiki is getting harder to hack over time. It also doesn't deal with wide area spam attacks (a more effective technique than blasting only one page). So far it blocks about two to three spam attacks a day. -- SunirShah

I recently added the same type of filter on my wiki along with an AutoBan? for the ip of the "spammer". I must say it seems to be working well, although I'm thinking about adding a bayesian filter in its place. If it works so well for my mailbox, I think it should work just as well fighting WikiSpam. -- TomScanlan?

I'd be wary - you don't want any false positives, and bayesian filters tend to get several of them during training. -- ChrisPurcell


Judging from the logs, the filter is working exceptionally effectively against SpamBot?s. It has blocked around 1800 spam edits by around 130 unique IPs on usemod.com since I installed it a couple months ago. -- SunirShah


Are you aware of RichardP? and his invaluable WikiMinion?? (http://www.nooranch.com/synaesmedia/wiki/wiki.cgi?WikiMinion) -- PhilJones

Yes. My goal is to integrate such AntiSpamBot?s with the PeerToPeerBanList in order to make the cost of maintenance O(lg(N)) (?) rather than O(N). But ultimately, such solutions are like trying to move a pile of sand one grain at a time. What's needed is a solution that changes the force dynamic to make spam no longer economically valuable (or possible). Of course, as long as there are abandoned wikis that are being heavily spammed, this will not stop. -- SunirShah


At which line, or roughly where is the code inserted? DanKoehl

In sub DoPost after this blob:

  # Consider extracting lock section into sub, and eval-wrap it?
  # (A few called routines can die, leaving locks.)
  &OpenPage($id);
  &OpenDefaultText();
  $old = $Text{'text'};
  $oldrev = $Section{'revision'};
  $pgtime = $Section{'ts'};

# Poor man's spam filter
...

Thanks. DanKoehl


I am starting to hate this patch. It's blocked Meatball users around 10 times, in return for around 50 other unique attempts. With a roughly 15% false positive rate, it's not so hot. On the other hand, it's prevented about 200 pages worth of spam. -- SunirShah

Yeah I was confused when I had my edit rejected, on the usemod wiki. An explanatory message would make this less of a problem. Also a whitelist of good URLS/domains could be good.
Could this be modularised. I'm thinking the code for being able to reject an edit (prefarebly with message) is one thing, and the code for identifying spam edits is another. If you wanted to add alternative, or additional identifying routines how easy would this be?
Did anyone make a basic banned content patch for usemod? i.e. something which takes a list of regular expressions like this [emacswiki BannedContent] -- Halz - 24th Jan 2005


How come there are no such anti-spam patches made available on the usemod site? Some people have commented on shortcommings of this patch, but surely a copy of this should be placed [here] -- Halz - 20th Jan 2005

I haven't made the patch clean enough for UseModWiki. That is the eventual plan, but in the interim, it's here so we can discuss the problem across several wiki engines. -- SunirShah

But the as I say, the usemod wiki has no spam blocking patches available as far as I can see. I'm wondering what the usemod recommended antispam approach is. -- Halz 24th Jan 2005


If spammers do a search in Google for 'Edit text of this page', that gives me the idea to change 'Edit text of this page' to something personalized to avoid the spammers finding me. Good idea? --mutante

Is that why this wiki has a <strong> tag in the middle of the link text? An html comment would have the same effect I guess.
Another related idea would be to rename the 'edit' action. Some spammers seem to be going to a list of preset 'edit' page URLs, rather than actually clicking the 'Edit text of this page' link every time. Could even implement a system whereby you have to follow the link, by adding a some kind of numeric code parameter (perhaps a hash function of the current date/revision number). This would make automated spamming a little more awkward. -- Halz - 24th Jan 2005

This takes weeks to propagate to the searchengines, is useless against precompiled lists and (at least) google ignores HTML Tags for search purposes (been there done that, if you look closely at this pages footer). -- DavidSchmitt


There appears to be some software making the rounds that looks for unused links (WikiNames that don't point to pages) and sticks the same text (often containing just one link) into them 10 or 20 times with successive edit-and-saves. I have now been hit from Poland, Russia, Roumania, Mexico... If they are trying to overload the history, this doesn't make much sense as there was no history to begin with. Also I now realize that the Poor Man's Spam Filter above may not help as there is often only 1 link per page. However, they are hitting the same pages a good few times in the space of a few minutes, so a rate test might help... Comments? --pm

I'm surprised it took only a few weeks before we experienced spammers willing to circumvent the ShotgunSpam filter. Clearly if concentrations of links on one change per page are denied, the solution is a lot of small changes. I thought I had a few months. Best medium-term solution is something like CitizenArrest, but that feels awfully like giving every citizen a handgun, a solution that does not work effectively on a large scale. Another alternative is to count the number of URL changes a person makes over a period of time. If they exceed a limit, rollback all their changes. This would prevent people like myself from actually writing long essays on the wiki though. And, of course, any form of authorization through authentication is vulnerable to IdentityTheft (LoginsAreEvil), so that won't help in the long term, but it may in the short term. -- SunirShah

The VersionHistory cannot be deleted by anyone, so they are filling it up since the Google spider will look at all those pages as distinct by tightly linked, which is what generates the most PageRank. -- SunirShah

But that has no effect on Google; see http://www.usemod.com/robots.txt.


My Perl is pretty primitive - could some kind soul combine Sunir's suggestion (counting the number of URL changes a person makes over a period of time) with testing for the presence of URLs. My wiki got hit today with two lines of text containing 2 URLs affecting one unused WikiPage, saved 78 times in the space of about 2 minutes. Even Sunir at his most prolific wouldn't show that pattern! By the way, when I revert (until this patch becomes available), is it better to delete the offending page, or change it to something innocuous with a summary saying something like "reverting spam"? --pm


The vandalism that deleted EarleMartin's homepage pointed out a weakness with the filter: restoring a previous version of a page that happens to contain a lot of links is painfully slow with the filter in place. One obvious fix is to change the filter to compare the body text to a previous version, and not engage the filter if the text matches. One weakness to this fix is that it could allow spammers to simply restore their spam with impunity; on the other hand, if the shotgun filter doesn't allow the link-heavy spam in the first place, there's nothing to revert. --ChuckAdams

Even if it does, the spammers have already shown they will happily bypass the filter. If we want to make it harder for them, some StableCopy-based solution might help.

The fact that spammers can get around it in its current form is not a reason to not fix the filter -- in fact, if it's useless against spammers, then it's only useful for vandals. StableCopy is a big redesign of the dynamics of wiki (and it's patent hubris to call it The Answer to spam/vandalism), whereas I am making a suggestion for a tweak. --ChuckAdams

The original filter prevented plus or minus 3 http links. Now it is only +10 http links, making pages with a lot of links vulnerable to page blanking attacks. It's always wrong to use a HardConstant?. It's better to change the fundamental flow. However, the 'minor tweak' will be sufficient for now, if Scott, Chris, or Cliff would like to do it. Sadly, there is no such thing as a minor tweak when it comes to accessing the PageDatabase with UseModWiki. -- SunirShah

I have considered hacking usemod to backend it with a SQL db such as SQLite, but every time I go into the wiki codebase, I'm overcome by the urge to rewrite it. I went down that winding road for a couple months last time before I came to my senses. --ChuckAdams

I should flesh out what I meant by "StableCopy-based solution": allow a change if it's within +/- 3 links of the current revision or any stable copy (rather than "any previous revision"). Didn't spell that out before because I don't think it will give any real improvement over "any previous revision", except maybe against vandals, but that bare sentence came out looking like a drive for StableView or something. Merely calculating whether a page is a StableCopy or not is usually pretty easy.

Is using SQLLite better than maintaining a plain-text directory-based system? -- ChrisPurcell

For sqlite, perhaps not, since it has fairly poor locking for the usage patterns of a wiki. With a real database like PostgreSQL?, hell yes. There is all kinds of code dealing with locks and consistency that any decent DBMS will do for you. PageIndex? becomes a trivial operation, and with an index on last-edited, RecentChanges as well -- currently these scan through every single file on the wiki, which is just insane. This is a solved problem though, and solving it again for an existing codebase just isn't all that challenging or interesting, as anyone wanting a DB backend has already chosen a wiki engine that provides one. --ChuckAdams

Another alternative is switching platforms, but that would require assessing what features are critical governing features for Meatball. -- SunirShah

Discussion

MeatballWiki | RecentChanges | Random Page | Indices | Categories
Edit text of this page | View other revisions
Search: