WikiSpamWorkshop

MeatballWiki | RecentChanges | Random Page | Indices | Categories

For WikiSymTwoThousandFive, http://www.wikisym.org

Abstract

Wiki spam has begun to cripple the use of public wikis. The original WikiWikiWeb has thrown its 'shields up' in part due to spammers. The problem of defending against wiki spam while maintaining the traditional wiki value of openness is a daunting one, yet it goes to the very heart of what the Internet and the world is. MeatballWiki has been leading the discussion of how to defend sensibly against wiki spam. This workshop is intended as both a report on progess as well as a facilitated time to come together as a Greater Wiki Community to devise better solutions for this problem.

Long description

One of the most important values in wiki culture is openness. However, as wikis become more well known in society, they are catching the attention of spammers that wish to exploit this openness. To make this worse, the wiki information structure turns out to be optimal within Google's trademarked PageRank algorithm. Now that dead wikis number in hundreds of thousands, spammers often don't pay attention to whether or not a wiki is active before hitting it. Active wikis therefore are frequently keeping spam at bay, which is a nuisance that puts many off wikis altogether and even challenges our value of openness.

MeatballWiki has been the centre of an ongoing discussion of how to deal with wiki spam [1]. MeatballWiki is also the home of the concept of soft security [2], which is the canonical articulation of how wikis defend themselves with social means. This workshop will begin with a very brief overview of the best theoretical description of wiki spam (i.e. an economic, ecological, crime of opportunity), an outline of known techniques, and a catalogue of implemented or theorized defenses. An open discussion will be then be facilitated to brainstorm either new solutions or new ways to encourage adoption.

Ideal defensive strategies will preserve the openness of wikidom while greatly reducing the time and effort put in defense by members of the wiki. The best strategies will also defend against other nuisance problems such as vandalism, trolls, sociopaths, and edit wars.

: [1] http://www.usemod.com/cgi-bin/mb.pl?WikiSpam
: [2] http://www.usemod.com/cgi-bin/mb.pl?SoftSecurity

Bio

Sunir Shah is the Founder and Editor of MeatballWiki, the pre-eminent centre for community-wide wiki development on the Internet since April 2000. Shah is also the original articulator of the concept of soft security, and has made several important contributions to the defense of wiki culture in the face of new attacks. For the past two years, he has been facilitating the ongoing discussion around wiki spam on MeatballWiki as well as contributing important theoretical concepts and technical solutions, such as the peer-to-peer ban list and citizen arrest.

This text is more up-to-date on the page CategorySpam. It is preserved for historical archival reasons related to WikiSym 2005.

Summary of the State of the Art

: Shah, S. and Purcell, C. (2005). WikiSpam – State of the Art. Wiki Symposium, San Diego, CA, Oct 16-18, 2005.
: with primary acknowledgments to Meatball and Chongqed.org

Background

Basic definition. When we speak of spam, we usually refer to one of two types: SemanticSpam that encourages us to buy something; and LinkSpam that takes advantage of Google's PageRank algorithm. As we primarily fight LinkSpam on web-based SocialSoftware like wikis, we will primarily talk about LinkSpam here, although some techniques will apply against SemanticSpam as well. LinkSpam is the most common type of spam since it is more profitable to rise higher in the Google rankings and get thousands of potential readers than it is to get the dozens of readers on a wiki.

Methods. For the most part, LinkSpam is done manually, as labour is cheap in the spammier parts of the world. Some of the most sophisticated of spammers use robots, often custom tailored to their target, but these people are few since the cost/benefit ratio is high. Most spammers will use an OpenProxy or an AnonymousProxy to avoid HardBans of their IP addresses. Some have been known to use ZombieMachine?s, exploiting a security flaw in Windows Remote Desktop.

Dimensions of analysis. Spam and our responses must be analyzed in terms of MotivationEnergyAndCommunity. Spam is primarily motivated by economic factors, whereas community is primarily motivated by less tangible, soft, emotional factors. Solutions pit the energy spammers are willing to expend against the energy produced by community goodwill. Mitigating factors boil down often to TechnologySolutions. In the HardSecurity manner, we can build better shields, better weapons; or in the SoftSecurity manner, we develop better abilities to dodge, absorb, and deflect spam.

Communication. Because spam is not an attempt at communication, attempting to communicate with a spammer using words or ideas will fail. Spammers are merely interested in the act of posting links. Consequently, the only way information will transfer from us to a spammer is through actions. Think of it this way: at the point where a conflict has degraded to a fist fight, words are often useless. You must first create physical distance. The same goes for spammers, except they are not in a temporary foul mood, but in business, which means they will not go away unless the cost to them greatly increases or the benefit disappears.

Essential problem. More traditional methods of increasing their costs, like jail, are impractical short of banning most of the world from using the Internet (which is already happening). This mostly increases costs on their neighbours, who will hopefully take local action to control the problem. However, since China has a Great Firewall strategy, this is doubtful. Internet-centric ways include: downgrading their rankings in the SearchEngines; increasing their labour cost to the point they find cheaper ways of exploiting PageRank; and developing more efficient SearchEngineOptimization? methods that do not depend on harassing others on the Internet. (Search relevance is really Google's problem.)

Content filtering

Because you control the underlying WikiEngine, you can control what content is posted. A theoretically ideal ContentFilter blocks all 'bad' content whilst leaving 'good' content unfettered, but this is impossible as the range of possible content (good or bad) is both infinite and undecidable--you need people to make decisions. Therefore, content filtering becomes a game to identify new 'bad' content as quickly as possible, as well as finding simple patterns that can be scalably exploited to block a wide range of content. In terms of the energy arms race, this is one-for-one in effort with the attacker.

Scale through numbers. The decentralized PeerToPeerBanList (e.g. the recent SharedAntiSpam initiative) and the centralized-but-receptive http://www.chongqed.org RegexFilter? list pit the mostly good against the few bad apples. An AntiSpamBot? (e.g. ThoughtStorms:WikiMinion) can apply some mechanical muscle to small communities or GhostTowns.

Algorithmic. Blocking with a LanguageFilter (e.g. everything Chinese) or a trained BayesianFilter (unproven) can greatly increase your defensive power. False positives are a major problem (e.g. everyone Chinese).

Network control

While with cities, you defend borders against neighbours, on the Internet, you defend ports against IP addresses. Defending your server against network attacks seems to be par for the course on the Internet. In terms of the energy arms race, aside from manually blocking IP addresses, this is an advantage in effort over the attacker.

HardBan. The traditional approach is to (manually) maintain a BanList of offending IPs. This usually quickly devolves into a RegionalBan against China, which eventually fails due to OpenProxy, AnonymousProxy, and ZombieMachine? attacks. False positives are expected as entire countries are casually banned.

Automated defense. PortScan?ning editing hosts to see if they are an OpenProxy, and blocking each AnonymousProxy is a good strategy in the same vein as UseRealNames. Some proprietors with ethical reasons to support total anonymity will not like this.

Spider defense. SpiderTrap robots before they annihilate you. More extremely, put up a SearchEngineCloak.

SurgeProtector. You can directly control the amount of energy any part of the network can inflict on your site. EditThrottling, ViewThrottling, or directly LinkThrottling ShotgunSpam are all good options. Care must be taken not to jointly throttle spam reversion, lest one provide an easy way for spammers to defeat the community.

Demotivation

The most SoftSecurity approach is to eliminate any intrinsic interest the spammer has in attacking. The best way is to stop SearchEngines finding or valuing their links.

NotIndexed. You can and should hide your VersionHistory from SearchEngines. You can also use an ExternalRedirect? or flag all outbound links with NoFollow (and destroy the Web while you're at it).

DelayedIndex?. You can delay the time it takes for a link to appear to SearchEngines, say by making them NoFollow until a LinkVeto period has passed, or only presenting a StableCopy to a SearchEngine.

HiddenCommunity?. You can try putting up an EditMask or even more extremely SearchEngineCloak, or use the RobotsExclusionStandard to hide yourself from SeachEngine?s, and thus make yourself an unfindable and unvaluable target.

Cost of participation. You can increase the cost of posting by introducing a PricklyHedge, like HumanVerification (e.g. a CaptchaTest?). The OpenProxy PortScan? also increases the cost (~15 seconds per new host). A good method anywhere but the Internet is to use an AccessFee.

Better peer review

Often the best solution is to empower the good guys. A strong CommunitySolution is more resilient and adaptive and fair than any algorithm.

PreemptiveModeration. The traditional approach: a volunteer army to vet content.

RecentLinks. Provide a specialized RecentChanges for just external links. You can further specialize this by listing only new domains. Necessary with a LinkVeto.

GlobalRevert. Decrease the cost of reverting a spam attack to tip the balance back into the good guy's hands. As this is akin to putting guns in everyone's hands, you can do it with more consequences via CitizenArrest.

Offensive action

Some people, particularly the fine folks at http://www.chongqed.org, would like to take a more proactive stance towards spam. While this strategy may be mildly worrisome for those who remember how spammers stalked, harassed, and threatened the maintainers of the email RealTimeBlackholeList?s in the 1990s, there are things that we can do that do not require putting our necks on the line.

GoogleBomb?. http://www.chongqed.org has a strategy of GoogleBomb?ing spammer keywords to point to http://www.chongqed.org.

AntiSpamBot?. Use SearchEngines to find wikis the same way spammers do. Send a bot to automatically revert their spam. Like a virus scanner for the whole Internet. False positives are problematic.

GhostTown list. http://www.chongqed.org maintain a list of private GhostTowns that end up being WildHoneyPots. This list could eventually be used by SearchEngines to eliminate a large number of spammers from their listings, although this is very dangerous and potentially litigious.

RealtimeBlackholeList?. http://www.chongqed.org maintains a centralized RegexFilter? of spammers, as submitted by volunteers.

PeerToPeerBanList. A very decentralized and difficult-to-attack network of RegexFilter? lists. Benefits from the same easy scalability as the WebLog community without someone to harassingly phone at 3am. See SharedAntiSpam.

Fight fire with fire. We can create websites that SearchEngines down-rate, like link farms, and put the spam links up on those sites to trigger any automatic spam detectors.

Report spam. Just report spam links directly to SearchEngines.

Erect barriers

You can also give up the basic wiki principle of open editing by all and concede that some jerks will spoil the fun for everybody. Strategies to adapt exist on a gradient, fortunately, so you can strike a happy medium.

ShieldsUp. During a spam attack, close the site to everyone but trusted editors. (a close relative of FishBowl)

Logins. The traditional approach is only as strong as the rate spammers can create new fake email addresses. Creating new logins does offer a SpeedBump?, however.
- Staged login. Everyone can edit, but only CommunityMembers can post external links. Definition of a CommunityMember may be as simple as those with UserNames with a CategoryHomePage. (e.g. a FunctionalAccessTrustMetric) This risks creating a culture of screening new members to determine which ones are spammers, a PricklyHedge to membership that may be highly detrimental to the growth of community.

InviteOnly?. Many strategies revolve around you reaching out to others (cf. UsAndThem).
- PrivateCommunity?. Like many places on the Internet, your wiki is only readable and writable to those invited.
- FishBowl. Read-only for the public; writable only to those invited.
- InvitationClique?. Only those already invited can invite more people. You can control the rate of growth through economic factors, like invitation tokens.
- Petition. Have contributors prove they belong to the social group. This works best in professional associations, like academic communities. You just prove you've written a paper in the field.

Economic. Charge a nominal AccessFee for participation. This defeats the spammer at the heart of their motivation, whilst giving you a solid identity (their credit card) to counter-attack. Caveat: payment often obliges you to a contract; caveat: you'll discourage good guys too who often have an even lower economic incentive than spammers for posting. Many people will rightly refuse to give out credit card details on the internet. You also make your server a tempting target for crackers eager to steal money, and risk subsequent litigation costs.

Call for BarnRaising

I need help cleaning up, triaging, and organizing the material (aka total mess) on WikiSpam and CategorySpam and http://chongqed.org for presentation in the workshop. My goal is not to talk for 3 hours, but I do want to be able to summarize what's known and not known, and have some sort of plan for how we are going to use the 3 hours. Please please lend a hand. -- SunirShah

: Have the change I've made helped any? (Chonqed isn't working for me, btw.) -- ChrisPurcell

Immensely. Thank you so much! I'm sorry I haven't been more interactive. I have one more deliverable to do for OISE...er, and a wiki book review, and then I'm tearing into this! (I suck.) -- SunirShah

: It's chongqed.org with 'g'. We've built up a fair amount of information on the wiki there (http://wiki.chongqed.org), in fact I would be surprised if you know of any [anti-spam solutions] which are not described on there somewhere. One of the things I've tried to do is separate out those techniques which are controversial/invasive on the wiki user experience, and those which are very complicated to implement, and distill what remains into a set of simple [Anti Spam Recommendations]. Any feedback (and more community participation) would be most welcome there, although the wiki's been frozen for the past 3 days while Manni moves it to a new server -- [Halz]

I'm 100% aware of chongqed's value. However, half the content on Chongqed refers to what's been written here on MeatballWiki, and I also have to push forward the SharedAntiSpam initiative which is centred here. That's why I'm using both as my reference materials. There's a lot of give and take in the wiki community. This should be celebrated. -- SunirShah

Halz, whenever Manni decides to open the wiki, please inform him that Chongqed:TarPit is incorrect. Typically, the reason why spammers don't have IPs that have reverse lookups is because they are using an OpenProxy. If you ban open proxies, then you block these spammers. Also, their view that Chongqed:ContentBanning has no negatives is unfortunately false. History has shown repeatedly that ContentFilters are almost always abused to censor political enemies. The problem we are having with the SharedAntiSpam initiative is developing a protocol to decide what is trustworthy and what isn't, and to appeal mistakes or malfeasance. There is a reason I put an AuditTrail into the PeerToPeerBanList. -- SunirShah

Typically a workshop means many people contributing. This means at least that participants should talk about how they perceive spam, what measures they take, how well the measures are working and what they would like to see. With ProWiki systems, there is the option that users have to save their preferences (have a cookie, maybe a username) and this keeps off almost all spammers. While this may keep off some unexperienced wiki newcomers too, I don't see this as a problem and won't add further technological measures that potentially add negative effects (like false hits or mb's delays). But this is just my personal preference. If some clever scheme turns up as a standard, I would of course support and implement it. Now, knowing that I'll be at San Diego on the 15th, I'd like to be invited to the workshop. As far as I know one can only participate based on an invitation. -- HelmutLeitner

: Please come! The workshop is open to all participants. The 'invitation-only' thing ACM does is some academic non-sense. -- SunirShah

Hi SunirShah,

I am actively fighting Spam and am quite interested in this topic, I wish I could attend to get somemore ideas about Anti Spam Bot strategies and algorithms. Unfortunately, the $450 fee to attend is prohibitive. I can be found on several Wiki's, among them [Chonged.org], (See also [Anti Spam Bots on Chonged]), and [NV|U wiki]. If there is anything I can do to assist, drop a reply [here] as I don't get back here too often.

[MeanRoy]

probation
buddy system (kuro5hin, brainstorm)
(peer-to-peer) ban list
reputation system, web of trust
marketing
wrapper cgi (filter HTTP POST text)
graffitti wall
filtered external redirector

successes [# of examples]
nofollow [1]
captcha (even simple) [2]

http://www.chongqed.org

sessions

(Internet) reputation system -- Brandon, outside under green awning
statistical real-time analysis -- Sunir, if anyone is interested
SharedAntiSpam -- Eugene, where he is sitting
Non-techies -- Mark, left by the eisel
tech support for wiki manages -- John Abbe -- right (may join Mark)
CGI::Wrapper -- Peter, front of room

http://sourceforge.net/projects/wiki-spam does not exists and also I cannot find any other project named "wiki spam" something on sourceforge. -- MaxVoelkel

: There's [SourceForge:Spam Proof Wiki]. Why are you asking, by the way? I can't find any mention of such a project above. -- ChrisPurcell

Mentioned f2f re: SharedAntiSpam

CategorySpam CategoryBarnRaising CategoryWikiSym