WikiSpam is a wikiwide problem and won't be solved but wikiwide.
Wikis are characterized by their UniversalAccess and UniversalEditing?. Google measures its PageRank based on links from one site to another, plus the PageRank of the site linking to the other. Wikis are PageRank machines, being both massively linked and with hundreds or thousands of pages. These two factors - openness and PageRank - make wikis the ideal target for spam attacks.
CategoryWikiConventions CategoryWikiTechnology CategoryDifficultPerson CategoryCategory
Spammers are looking for higher Google results, not for people to follow their links. The most useful wikis to spam are not even the most popular wikis, but rather the abandoned wikis that aren't being watched. While early spammers hit sites with incredibly high PageRank, it's possible now to use Google (ironically) to find, and a robot (automated script) to write links on, millions of smaller single-user wikis whose owners will neither notice nor revert the vandalism. Millions of links from smaller sites is better than a few links from larger sites.
WikiSpam often appears as explicit links (e.g. http://example.com); as [bracketed] links, which sometimes are difficult to detect if a spammer replaces the URL in one [bracketed] link with the spam link; and as hard to see periods.
Remember that TheCollective ultimately controls everything, right down to access, on an OnlineCommunity. The only problems are the negotiations. We could FishBowl the community or we can leave it wide open, or we can do something in between, but there is some solution to the problem.
(See MotivationEnergyAndCommunity for more about the classification scheme used below.)
Revert spammed pages by hand is often the simplest and most effective solution. It's not necessarily a bad thing to let your users continue to revert spam as it gives them some sense of collective defense of the OnlineCommunity, which will increase their sense of responsibility and attachment. Reverting spam is something everyone can do, even the most squeamish of editors.
This fails when the energy of the community is matched or surpassed by the energy of the spammers. In the face of 'bots, and with the realization that your site may later become a GhostTown, other approaches are needed.
Basic definition. When we speak of spam, we usually refer to one of two types: SemanticSpam that encourages us to buy something; and LinkSpam that takes advantage of Google's PageRank algorithm. As we primarily fight LinkSpam on web-based SocialSoftware like wikis, we will primarily talk about LinkSpam here, although some techniques will apply against SemanticSpam as well. LinkSpam is the most common type of spam since it is more profitable to rise higher in the Google rankings and get thousands of potential readers than it is to get the dozens of readers on a wiki.
Methods. For the most part, LinkSpam is done manually, as labour is cheap in the spammier parts of the world. Some of the most sophisticated of spammers use robots, often custom tailored to their target, but these people are few since the cost/benefit ratio is high. Most spammers will use an OpenProxy or an AnonymousProxy to avoid HardBans of their IP addresses. Some have been known to use ZombieMachine?s, exploiting a security flaw in Windows Remote Desktop.
Dimensions of analysis. Spam and our responses must be analyzed in terms of MotivationEnergyAndCommunity. Spam is primarily motivated by economic factors, whereas community is primarily motivated by less tangible, soft, emotional factors. Solutions pit the energy spammers are willing to expend against the energy produced by community goodwill. Mitigating factors boil down often to TechnologySolutions. In the HardSecurity manner, we can build better shields, better weapons; or in the SoftSecurity manner, we develop better abilities to dodge, absorb, and deflect spam.
Communication. Because spam is not an attempt at communication, attempting to communicate with a spammer using words or ideas will fail. Spammers are merely interested in the act of posting links. Consequently, the only way information will transfer from us to a spammer is through actions. Think of it this way: at the point where a conflict has degraded to a fist fight, words are often useless. You must first create physical distance. The same goes for spammers, except they are not in a temporary foul mood, but in business, which means they will not go away unless the cost to them greatly increases or the benefit disappears.
Essential problem. More traditional methods of increasing their costs, like jail, are impractical short of banning most of the world from using the Internet (which is already happening). This mostly increases costs on their neighbours, who will hopefully take local action to control the problem. However, since China has a Great Firewall strategy, this is doubtful. Internet-centric ways include: downgrading their rankings in the SearchEngines; increasing their labour cost to the point they find cheaper ways of exploiting PageRank; and developing more efficient SearchEngineOptimization? methods that do not depend on harassing others on the Internet. (Search relevance is really Google's problem.)
Because you control the underlying WikiEngine, you can control what content is posted. A theoretically ideal ContentFilter blocks all 'bad' content whilst leaving 'good' content unfettered, but this is impossible as the range of possible content (good or bad) is both infinite and undecidable--you need people to make decisions. Therefore, content filtering becomes a game to identify new 'bad' content as quickly as possible, as well as finding simple patterns that can be scalably exploited to block a wide range of content. In terms of the energy arms race, this is one-for-one in effort with the attacker.
While with cities, you defend borders against neighbours, on the Internet, you defend ports against IP addresses. Defending your server against network attacks seems to be par for the course on the Internet. In terms of the energy arms race, aside from manually blocking IP addresses, this is an advantage in effort over the attacker.
While a spammer manually editing pages from a browser is hard to detect, many spammers use automated scripts targeting thousands or millions of websites, hiding behind a RotatingProxy to avoid a simple IP ban. This kind of energy weapon, massively increasing the amount of energy the spammer has, is devastating to a CommunitySolution, quickly overwhelming the energy in the community. An anti-energy weapon prevents such a tactic by any user, thus remaining notionally in the realm of SoftSecurity. Three techniques have been found extremely efficacious on MeatballWiki:
The most SoftSecurity approach is to eliminate any intrinsic interest the spammer has in attacking. The best way is to stop SearchEngines finding or valuing their links.
Often the best solution is to empower the good guys. A strong CommunitySolution is more resilient and adaptive and fair than any algorithm.
Some people, particularly the fine folks at http://www.chongqed.org, would like to take a more proactive stance towards spam. While this strategy may be mildly worrisome for those who remember how spammers stalked, harassed, and threatened the maintainers of the email RealTimeBlackholeList?s in the 1990s, there are things that we can do that do not require putting our necks on the line.
You can also give up the basic wiki principle of open editing by all and concede that some jerks will spoil the fun for everybody. Strategies to adapt exist on a gradient, fortunately, so you can strike a happy medium.
An anti-community weapon attempts to cleave a community in two. A Chongqed:TarPit attempts to neatly cleave Them from Us, without telling Them we did so. The two problems here are (a) identifying Them not Us, and (b) not letting on to Them that they've been dropped in a pit. Anti-community weapons can of course be used for other ends: ContentFilters are almost always abused to censor political enemies. How to support an AuditTrail without defeating (b) is an open question.
You can measure the spam-resistance of a wiki by seeing if it meets this minimal standard:
The above text is PrimarilyPublicDomain. Alternate version that appeared at WikiSym 2005 is available on WikiSpamWorkshop.
Aye, edit masking is something strongly on the table. I liken it to how a virus (e.g. the SARS corona virus) changes its protein coat to prevent detection by the immune system.
I'm considering how this will impact legitimate bots, but I think we can have a white list of bots that are allowed to hit a clean API. (again, akin to the immune system). -- SunirShah
They aren't important. It's better to have good RSS feeds than use HTML changes, since the latter is kind of bogus for dynamic community sites. Think about sites with MOTD, fortune cookies, 'who is online' lists, RSS aggregation, etc. -- SunirShah
Could the number of newly created pages added to the surge protecting? -- MarkusLude
Anyone tried EugeneEricKim's new Eaton script? http://www.eekim.com/software/eaton/eaton.pl
Instead of fighting spammers by removing spam, forcing an arms race between our attempts to detect spam and their attempts to add it, we could use spam detection simply to ensure existing content remains unaffected. This could be considered more inline with the SoftSecurity approach to handling attacks. It is hopefully unlikely that blackhat "SEO" spammers will deliberately set out to destroy existing content.
PageRank: While living with spam will inherently decrease your rank (since outgoing links devalue internal ones), the hit will probably not be too severe. On the other hand, Google may notice the spam links and blacklist the wiki, in which case the spammers will be wasting their time (but you will disappear from search engines).
Readers: By keeping spam at the bottom of a page, readers will be mostly affected by the increased download times, not by missing content, as currently happens when a page is spammed. Hopefully, this size increase will stabilise as spammers start to overwrite old spam. However, the esteem a wiki is held in, by its readership and by potential new members, may drop significantly if it starts hosting spam.
Host: Becoming a haven for spam may increase storage and bandwidth costs significantly, especially if spammers take advantage of the new relaxed attitude and increase their rate of spam. Many spammers find pages to attack by searching spammy keywords or even their competition's URLs; spam will thus draw spammers like blood draws sharks, further exacerbating this cost. The host may also be legally responsible for the products they are inadvertently promoting.
Google: As the number of such sites grows, there is a reason to really change the page ranking rules. In other words, the problem is shifted to the search engines/spammers. As long as we fight for clean links, the makers of indexing engines have no reason to change the rules. However, given the number of non-wiki spam sites, it seems less likely that a few spammed wikis will fundamentally change the way major search engines work. The existing profusion of spam-filled GhostTowns backs this up.
Perhaps the most promising application for this approach is PersonalWikis, which do not worry about PageRank, cannot afford to spend much time removing spam, and do not attach much importance to the high esteem of potential contributors. However, unless the separation of spam from ham is perfect, real contributions will be lost on RecentChanges, and you might as well just lock the site.
Authors: RadomirDopieralski, ChrisPurcell, JoeChongq?
To assess the value of the strategy versus taking your wiki out of the search engines altogether or even making it private, it seems necessary to understand the objective of the wiki. Why does it need to make itself available to spammers? And if so, why does it need to be in the search engines? -- SunirShah
I've rewritten the idea above, moving the objections inline. Hopefully I've represented both pros and cons equally. This should make it easier to see the relative gains and costs in each aspect. -- ChrisPurcell
Apologies if I appeared hot. I really wasn't, merely concise. I get complaints about that sometimes. -- ChrisPurcell
Your site becomes blacklisted and all spam on it does actually harm to spammers. Being blacklisted just takes your PageRank down to zero, as I understand it: thus, spammers are only affected in that they've wasted some time. They don't get harmed. On the other hand, accepting spam will harm your PageRank, simply by the mechanics of the sytem, as I understand it.
I believe the "SEO" spammers do not want to destroy the wiki. Yes, they do. Otherwise they would add their spam to the bottom of our pages, not overwrite them. You have to take into account that those doing the spamming are often twenty-year-old geeks with nothing better to do with their time, and they take satisfaction in destroying people's work whilest getting paid. -- ChrisPurcell
Nice theory, but unfortunately false: we require a unique revision ID to accompany all POSTs to existing pages, to prevent EditConflicts. It cannot be guessed at, being essentially random. Empirical evidence strongly supports the theory that they are simply doing a GET, editing it by hand, then POSTing it, all via a web browser. (Your theory is still a very nice one. It explains why certain spammers were so big on creating new pages a little while ago: no revision ID needed. We could usefully close that gap.) -- ChrisPurcell
Certainly, but that doesn't contradict my point, which is that spammers do intentionally delete existing text. -- ChrisPurcell
Again, my point was simply that spammers do intentionally delete existing text, whether or not it angers users. -- ChrisPurcell
I have added my thoughts above. My other thoughts are on chongqed's [WikiForum]. A bit of summary from there, spam attracts spam and anything that gets soft on spam is in effect promoting spamming. Webpage owners don't put up wikis, blogs, and guestbooks for the purpose of giving spammers a place to put links.
Readers: Spammers will rarely replace only each other's spam. If they replace anything it is going to be the entire page. Normally it is not recognizable from the legitimate article text. I have seen some using the divs of [CSSHiddenSpam] to replace each other's or their own earlier spam (which makes absolutely no sense), but it is not common.
Google: [GhostTown]s already prove this PageRank point. Because they are heavily spammed, Google does not rank them very well anymore. Most are found burried deep in search results. Google is not going to drastically change the way they rank pages. Overall, it is not a bad system. The problem is it is vulnerable to abuse, but any ranking based on any popularity measure is open for abuse. Ranking sites by some measure of popularity is important to help users find what they need. They can't just throw it out. And anyway, we had plenty of guestbook spam before PageRank, wikis, and blogs were invented. -- JoeChongq?
As mentioned on our WikiForum, this Tolerate Spammers idea was suggested by Mattis. Here is [some discussion] in the same area with him from over a year ago. -- JoeChongq?
I had missed that suggestion, but I don't really see how it would work. Would you be (as this ignore spammers discussion suggests) leaving the spam intact, as well as resurecting the useful content. If you can identify these page replacement spams, keeping the spam on the page makes no sense. Whether possible or not, it is just doesn't make sense. Spam attracts spam so if you leave the spam there even at the bottom of the page out of the way, you are just inviting other spammers to find that page and spam it further. -- JoeChongq?
My point is that living with spam is stupid. If you can identify spam enough to segregate it, you can remove it. Few wikis will go with this "support the spammers so they don't bother us" idea. Unless a large portion of wikis started using this, why would spammers even notice that you don't remove their edits. This will only benifit them by making it easier to spam your pages again. Many spammers find pages to attack by searching spammy keywords or even their competition's URLs. Leaving their spam on your pages just attracts more spam. Assuming this was implemented widely, why woudn't spammers just adapt to make sure their edits are not segregated? Their links would be in the main body while their competition gets stuck at the bottom. With email, would you perfer having a Spam folder full of 1000 spams or 0? Going with the live with it theory, either way they don't end up in your inbox so it is the same.
Size: Spam inflated pages may cause problems editing due to browser technological limits as well as being extremely slow for dialup users. MediaWiki warns on large pages: "some browsers may have problems editing pages approaching or longer than 32kb." -- JoeChongq?