[Home]CategorySpam

MeatballWiki | RecentChanges | Random Page | Indices | Categories

Click on the title for all pages that deal with WikiSpam (cf. WhatIsSpam).

WikiSpam is a wikiwide problem and won't be solved but wikiwide.

Wikis are characterized by their UniversalAccess and UniversalEditing?. Google measures its PageRank based on links from one site to another, plus the PageRank of the site linking to the other. Wikis are PageRank machines, being both massively linked and with hundreds or thousands of pages. These two factors - openness and PageRank - make wikis the ideal target for spam attacks.

CategoryWikiConventions CategoryWikiTechnology CategoryDifficultPerson CategoryCategory


1. Introduction
2. Possible Solutions
2.1. CommunitySolution
3. Background (theory & practice)
4. Solutions
4.1. Content filtering
4.2. Network control
4.3. Anti-energy
4.4. Demotivation
4.5. Better peer review
4.6. Offensive action
4.7. Erect barriers
4.8. Anti-community weapon
5. Wider issues
6. WikiEngine security standard
7. See also
8. Discussion


1. Introduction

Spammers are looking for higher Google results, not for people to follow their links. The most useful wikis to spam are not even the most popular wikis, but rather the abandoned wikis that aren't being watched. While early spammers hit sites with incredibly high PageRank, it's possible now to use Google (ironically) to find, and a robot (automated script) to write links on, millions of smaller single-user wikis whose owners will neither notice nor revert the vandalism. Millions of links from smaller sites is better than a few links from larger sites.

WikiSpam often appears as explicit links (e.g. http://example.com); as [bracketed] links, which sometimes are difficult to detect if a spammer replaces the URL in one [bracketed] link with the spam link; and as hard to see periods.

2. Possible Solutions

Don't panic.

Remember that TheCollective ultimately controls everything, right down to access, on an OnlineCommunity. The only problems are the negotiations. We could FishBowl the community or we can leave it wide open, or we can do something in between, but there is some solution to the problem.

(See MotivationEnergyAndCommunity for more about the classification scheme used below.)

2.1. CommunitySolution

Revert spammed pages by hand is often the simplest and most effective solution. It's not necessarily a bad thing to let your users continue to revert spam as it gives them some sense of collective defense of the OnlineCommunity, which will increase their sense of responsibility and attachment. Reverting spam is something everyone can do, even the most squeamish of editors.

This fails when the energy of the community is matched or surpassed by the energy of the spammers. In the face of 'bots, and with the realization that your site may later become a GhostTown, other approaches are needed.


3. Background (theory & practice)

Basic definition. When we speak of spam, we usually refer to one of two types: SemanticSpam that encourages us to buy something; and LinkSpam that takes advantage of Google's PageRank algorithm. As we primarily fight LinkSpam on web-based SocialSoftware like wikis, we will primarily talk about LinkSpam here, although some techniques will apply against SemanticSpam as well. LinkSpam is the most common type of spam since it is more profitable to rise higher in the Google rankings and get thousands of potential readers than it is to get the dozens of readers on a wiki.

Methods. For the most part, LinkSpam is done manually, as labour is cheap in the spammier parts of the world. Some of the most sophisticated of spammers use robots, often custom tailored to their target, but these people are few since the cost/benefit ratio is high. Most spammers will use an OpenProxy or an AnonymousProxy to avoid HardBans of their IP addresses. Some have been known to use ZombieMachine?s, exploiting a security flaw in Windows Remote Desktop.

Dimensions of analysis. Spam and our responses must be analyzed in terms of MotivationEnergyAndCommunity. Spam is primarily motivated by economic factors, whereas community is primarily motivated by less tangible, soft, emotional factors. Solutions pit the energy spammers are willing to expend against the energy produced by community goodwill. Mitigating factors boil down often to TechnologySolutions. In the HardSecurity manner, we can build better shields, better weapons; or in the SoftSecurity manner, we develop better abilities to dodge, absorb, and deflect spam.

Communication. Because spam is not an attempt at communication, attempting to communicate with a spammer using words or ideas will fail. Spammers are merely interested in the act of posting links. Consequently, the only way information will transfer from us to a spammer is through actions. Think of it this way: at the point where a conflict has degraded to a fist fight, words are often useless. You must first create physical distance. The same goes for spammers, except they are not in a temporary foul mood, but in business, which means they will not go away unless the cost to them greatly increases or the benefit disappears.

Essential problem. More traditional methods of increasing their costs, like jail, are impractical short of banning most of the world from using the Internet (which is already happening). This mostly increases costs on their neighbours, who will hopefully take local action to control the problem. However, since China has a Great Firewall strategy, this is doubtful. Internet-centric ways include: downgrading their rankings in the SearchEngines; increasing their labour cost to the point they find cheaper ways of exploiting PageRank; and developing more efficient SearchEngineOptimization? methods that do not depend on harassing others on the Internet. (Search relevance is really Google's problem.)


4. Solutions

4.1. Content filtering

Because you control the underlying WikiEngine, you can control what content is posted. A theoretically ideal ContentFilter blocks all 'bad' content whilst leaving 'good' content unfettered, but this is impossible as the range of possible content (good or bad) is both infinite and undecidable--you need people to make decisions. Therefore, content filtering becomes a game to identify new 'bad' content as quickly as possible, as well as finding simple patterns that can be scalably exploited to block a wide range of content. In terms of the energy arms race, this is one-for-one in effort with the attacker.

4.2. Network control

While with cities, you defend borders against neighbours, on the Internet, you defend ports against IP addresses. Defending your server against network attacks seems to be par for the course on the Internet. In terms of the energy arms race, aside from manually blocking IP addresses, this is an advantage in effort over the attacker.

4.3. Anti-energy

While a spammer manually editing pages from a browser is hard to detect, many spammers use automated scripts targeting thousands or millions of websites, hiding behind a RotatingProxy to avoid a simple IP ban. This kind of energy weapon, massively increasing the amount of energy the spammer has, is devastating to a CommunitySolution, quickly overwhelming the energy in the community. An anti-energy weapon prevents such a tactic by any user, thus remaining notionally in the realm of SoftSecurity. Three techniques have been found extremely efficacious on MeatballWiki:

4.4. Demotivation

The most SoftSecurity approach is to eliminate any intrinsic interest the spammer has in attacking. The best way is to stop SearchEngines finding or valuing their links.

4.5. Better peer review

Often the best solution is to empower the good guys. A strong CommunitySolution is more resilient and adaptive and fair than any algorithm.

4.6. Offensive action

Some people, particularly the fine folks at http://www.chongqed.org, would like to take a more proactive stance towards spam. While this strategy may be mildly worrisome for those who remember how spammers stalked, harassed, and threatened the maintainers of the email RealTimeBlackholeList?s in the 1990s, there are things that we can do that do not require putting our necks on the line.

4.7. Erect barriers

You can also give up the basic wiki principle of open editing by all and concede that some jerks will spoil the fun for everybody. Strategies to adapt exist on a gradient, fortunately, so you can strike a happy medium.

4.8. Anti-community weapon

An anti-community weapon attempts to cleave a community in two. A Chongqed:TarPit attempts to neatly cleave Them from Us, without telling Them we did so. The two problems here are (a) identifying Them not Us, and (b) not letting on to Them that they've been dropped in a pit. Anti-community weapons can of course be used for other ends: ContentFilters are almost always abused to censor political enemies. How to support an AuditTrail without defeating (b) is an open question.


5. Wider issues


6. WikiEngine security standard

You can measure the spam-resistance of a wiki by seeing if it meets this minimal standard:

CategoryWikiStandard


7. See also

The above text is PrimarilyPublicDomain. Alternate version that appeared at WikiSym 2005 is available on WikiSpamWorkshop.


8. Discussion

Aye, edit masking is something strongly on the table. I liken it to how a virus (e.g. the SARS corona virus) changes its protein coat to prevent detection by the immune system.

I'm considering how this will impact legitimate bots, but I think we can have a white list of bots that are allowed to hit a clean API. (again, akin to the immune system). -- SunirShah

The only problem would be for people who have their browsers set up to report when a page changes. -- ChrisPurcell

They aren't important. It's better to have good RSS feeds than use HTML changes, since the latter is kind of bogus for dynamic community sites. Think about sites with MOTD, fortune cookies, 'who is online' lists, RSS aggregation, etc. -- SunirShah


Could the number of newly created pages added to the surge protecting? -- MarkusLude

If we do that, we'll simply drive the spammer to target existing pages, making reversion and hiding the edits harder. -- ChrisPurcell


Anyone tried EugeneEricKim's new Eaton script? http://www.eekim.com/software/eaton/eaton.pl

-- PhilJones


Living with the Enemy?

Instead of fighting spammers by removing spam, forcing an arms race between our attempts to detect spam and their attempts to add it, we could use spam detection simply to ensure existing content remains unaffected. This could be considered more inline with the SoftSecurity approach to handling attacks. It is hopefully unlikely that blackhat "SEO" spammers will deliberately set out to destroy existing content.

PageRank: While living with spam will inherently decrease your rank (since outgoing links devalue internal ones), the hit will probably not be too severe. On the other hand, Google may notice the spam links and blacklist the wiki, in which case the spammers will be wasting their time (but you will disappear from search engines).

Readers: By keeping spam at the bottom of a page, readers will be mostly affected by the increased download times, not by missing content, as currently happens when a page is spammed. Hopefully, this size increase will stabilise as spammers start to overwrite old spam. However, the esteem a wiki is held in, by its readership and by potential new members, may drop significantly if it starts hosting spam.

Host: Becoming a haven for spam may increase storage and bandwidth costs significantly, especially if spammers take advantage of the new relaxed attitude and increase their rate of spam. Many spammers find pages to attack by searching spammy keywords or even their competition's URLs; spam will thus draw spammers like blood draws sharks, further exacerbating this cost. The host may also be legally responsible for the products they are inadvertently promoting.

Google: As the number of such sites grows, there is a reason to really change the page ranking rules. In other words, the problem is shifted to the search engines/spammers. As long as we fight for clean links, the makers of indexing engines have no reason to change the rules. However, given the number of non-wiki spam sites, it seems less likely that a few spammed wikis will fundamentally change the way major search engines work. The existing profusion of spam-filled GhostTowns backs this up.

Perhaps the most promising application for this approach is PersonalWikis, which do not worry about PageRank, cannot afford to spend much time removing spam, and do not attach much importance to the high esteem of potential contributors. However, unless the separation of spam from ham is perfect, real contributions will be lost on RecentChanges, and you might as well just lock the site.

Authors: RadomirDopieralski, ChrisPurcell, JoeChongq?

Discussion

To assess the value of the strategy versus taking your wiki out of the search engines altogether or even making it private, it seems necessary to understand the objective of the wiki. Why does it need to make itself available to spammers? And if so, why does it need to be in the search engines? -- SunirShah

Can we estimate the costs and compare them to the costs of fighting the spam? I think it would be interesting. I didn't see any suggestion for a similar strategy here, so I assumed it wasn't discussed before. I don't know how to estimate the costs, so... -- RadomirDopieralski

I've rewritten the idea above, moving the objections inline. Hopefully I've represented both pros and cons equally. This should make it easier to see the relative gains and costs in each aspect. -- ChrisPurcell

Now, after a few discussions about the idea, and once the emotions settled down a little, I think you even made it look more promising than it really is. But it's well written, thank you. -- RadomirDopieralski

Apologies if I appeared hot. I really wasn't, merely concise. I get complaints about that sometimes. -- ChrisPurcell

Your site becomes blacklisted and all spam on it does actually harm to spammers. Being blacklisted just takes your PageRank down to zero, as I understand it: thus, spammers are only affected in that they've wasted some time. They don't get harmed. On the other hand, accepting spam will harm your PageRank, simply by the mechanics of the sytem, as I understand it.

I believe the "SEO" spammers do not want to destroy the wiki. Yes, they do. Otherwise they would add their spam to the bottom of our pages, not overwrite them. You have to take into account that those doing the spamming are often twenty-year-old geeks with nothing better to do with their time, and they take satisfaction in destroying people's work whilest getting paid. -- ChrisPurcell

I am not so sure many of them truely want to destroy other people's content, they just don't care about anything except money. Many refuse to see what they are doing as spamming. And many young teens in third world countries probably are doing it to [support their entire family] (as they suggest in some of their spams). From the email spam world, Ryan Pitylak, a "reformed" spammer says he just thought of it "as just a game of cat and mouse with corporate email administrators." I suspect that carries over to many of the web spammers, it is just a money making game to them. See the Contact From Spammers section on our [DiscussSpammers] for more insite into their warped minds. -- JoeChongq?

It's much easier to only sent POSTs with your text, instead of GETting the original text, appending your spam to it and POSTing it all. -- RadomirDopieralski

Nice theory, but unfortunately false: we require a unique revision ID to accompany all POSTs to existing pages, to prevent EditConflicts. It cannot be guessed at, being essentially random. Empirical evidence strongly supports the theory that they are simply doing a GET, editing it by hand, then POSTing it, all via a web browser. (Your theory is still a very nice one. It explains why certain spammers were so big on creating new pages a little while ago: no revision ID needed. We could usefully close that gap.) -- ChrisPurcell

I think Radomir was thinking in general wiki terms. I certainly agree that is likely what many are doing on wikis without the kind of protection you have here. -- JoeChongq?

Certainly, but that doesn't contradict my point, which is that spammers do intentionally delete existing text. -- ChrisPurcell

Purposly deleting content doesn't make sense from a active wiki stand point, they are deleting legitimate content and angering users. But for [GhostTown]s, it is necessary since they are competing with other spammers. Wiping out other spammer links helps them by reducing the links of the competition plus Google punishes sites with too many links to bad neighborhoods so if you are the only spammer on the page it would be better. They are doing it on purpose, but I don't think they are doing it purposely to maliciously destroy the wikis. Leaving legitimate content on the wikis would be better for them since Google would less likely identify the site as a GhostTown/bad site, but they can't take that chance since the content may be their competition. -- JoeChongq?

Again, my point was simply that spammers do intentionally delete existing text, whether or not it angers users. -- ChrisPurcell


I have added my thoughts above. My other thoughts are on chongqed's [WikiForum]. A bit of summary from there, spam attracts spam and anything that gets soft on spam is in effect promoting spamming. Webpage owners don't put up wikis, blogs, and guestbooks for the purpose of giving spammers a place to put links.

Readers: Spammers will rarely replace only each other's spam. If they replace anything it is going to be the entire page. Normally it is not recognizable from the legitimate article text. I have seen some using the divs of [CSSHiddenSpam] to replace each other's or their own earlier spam (which makes absolutely no sense), but it is not common.

Google: [GhostTown]s already prove this PageRank point. Because they are heavily spammed, Google does not rank them very well anymore. Most are found burried deep in search results. Google is not going to drastically change the way they rank pages. Overall, it is not a bad system. The problem is it is vulnerable to abuse, but any ranking based on any popularity measure is open for abuse. Ranking sites by some measure of popularity is important to help users find what they need. They can't just throw it out. And anyway, we had plenty of guestbook spam before PageRank, wikis, and blogs were invented. -- JoeChongq?

As mentioned on our WikiForum, this Tolerate Spammers idea was suggested by Mattis. Here is [some discussion] in the same area with him from over a year ago. -- JoeChongq?

There was only one point I really wanted to come back on. You said "If they replace anything it is going to be the entire page." That's true. The point of the proposed system, though, is to detect such edits, and preserve the useful content of the page across them. I wasn't sure if you'd got that aspect of the proposal, and if not, how we could best change the text to put it across. -- ChrisPurcell

I had missed that suggestion, but I don't really see how it would work. Would you be (as this ignore spammers discussion suggests) leaving the spam intact, as well as resurecting the useful content. If you can identify these page replacement spams, keeping the spam on the page makes no sense. Whether possible or not, it is just doesn't make sense. Spam attracts spam so if you leave the spam there even at the bottom of the page out of the way, you are just inviting other spammers to find that page and spam it further. -- JoeChongq?

The point is to avoid an arms race. Spammers will not try to trick your spam-detection algorithm if their spam gets through anyway. This is all stated in the first sentence of the text above. -- ChrisPurcell

My point is that living with spam is stupid. If you can identify spam enough to segregate it, you can remove it. Few wikis will go with this "support the spammers so they don't bother us" idea. Unless a large portion of wikis started using this, why would spammers even notice that you don't remove their edits. This will only benifit them by making it easier to spam your pages again. Many spammers find pages to attack by searching spammy keywords or even their competition's URLs. Leaving their spam on your pages just attracts more spam. Assuming this was implemented widely, why woudn't spammers just adapt to make sure their edits are not segregated? Their links would be in the main body while their competition gets stuck at the bottom. With email, would you perfer having a Spam folder full of 1000 spams or 0? Going with the live with it theory, either way they don't end up in your inbox so it is the same.

Size: Spam inflated pages may cause problems editing due to browser technological limits as well as being extremely slow for dialup users. MediaWiki warns on large pages: "some browsers may have problems editing pages approaching or longer than 32kb." -- JoeChongq?

I personally think it's an awful idea. I'm just trying to ensure any arguments against it are actually against it, not a strawman. For instance, if the page can tell what's spam and what's not, why would it bother putting the spam on the edit form?

Alone, size is not much of an arguement, but it is a fact and depending on the implementation may or may not be a strawman (not including the spam on the edit form severely hurts this point). Even if the edit form does not carry the weight of the extra text, the size of the normal view of the page will be increased. Many users around the world are still on dialup, have slow/unreliable connections, or pay for download bandwidth. But now I see you already have a bit of that part in the Readers section.

"If you can identify spam enough to segregate it, you can remove it." Once again, I say: arms race. Maybe you could check out MotivationEnergyAndCommunity for the longer explanation here. Why would spammers go to the effort of breaking around the anti-spam system when their spam is getting through? There's no incentive; indeed, there's a disincentive, as spamming the main page and leaving their competitions' spam at the bottom decreases the value of their spam, as you've said.

"Living with spam is stupid." The novelty of the idea is to live with spam. You can't knock it down merely by calling it stupid. If people don't like the premise, they won't use it. I won't use it. The point of this discussion is to see whether it's viable in the first place.

It is not viable, that is why I am attempting to find any way possible to shoot down the idea including pointing out that it is stupid.

"Why would spammers even notice that you don't remove their edits?" They wouldn't. Is this not clear yet? The point is not to go to the effort of removing spam: just leave it on the page. The other way to avoid removing spam is not to let it on in the first place, but spammers do notice that, and it causes an arms race, leaving you right back where you started.

On an individual wiki basis, few spammers notice anything unless you are actively trying to annoy them (like chongqed is). They don't know if you remove their spam or if you leave their spam. It doesn't matter to them, they just keep spamming. You say "The point is not to go to the effort of removing spam." With enough protection (which sadly few wikis have by default), there is normally not a lot of spam that gets through. Any method of segregating spam from real content would have to be based on existing spam prevention methods. That leaves the only effort saving advantage on reverting spam edits. Most wikis currently suck at that even with admin priveledges, but rather than implementing the system necessary to give in to spammers, why not improve rollback/revert systems? As for the arms race, if wiki developers took antispam measures more seriously (built in rather than relying on third party plugins) the race could be ignored by the wiki users and admins. Regular updates (which even lazy admins should be installing for security purposes) would provide improved spam blocking as well as important security fixes.

I've copied a significant point of yours to the main text, by the way. Don't think I'm not appreciating your arguments. Merely hoping to help refine them. -- ChrisPurcell

I understand you want to discuss this, but as someone who fights spam so intensely, the whole idea is just insane. It is giving up and it has consiquences beyond the individual wiki. By not cleaning spam from your wiki, you promote the practice of spamming wikis. And all the irrelivant spam links on your site to shady businesses damages the effectiveness of search engine ranking systems which is exactly what spammers want. The pollution of the internet is already horrible with splogs, GhostTowns, etc., why add active wikis to the problem? Remember the quote at the top of this page "WikiSpam is a wikiwide problem and won't be solved but wikiwide."

As for avoiding the arms race, the race is going to go on whether you participate or not. If you aren't in it, you are going to end up as a collateral damage. To many spammers this is a game. Spammers may be slime, but for the few actual spammers that write their own software they are still hackers. Solving interesting puzzles such as breaking spam protection is not done only to make money, it is a challenge. There are plenty of targets for spammers on the net that are not protected. Why then are they attempting to break CAPTCHAs and disguise their edits on well protected sites they should be able to presume are going to quickly remove the spam? Because they can. That is the same reason they will attempt to get around this proposed spam segregation.

The strongest argument against this is the fact that spam attracts spam. If you want a constant stream of spammers hitting your page then go ahead. Some of those spams are going to get posted to the main page whether the spammer is trying to out do your segregation rules or not (no spam identification method is perfect). And because you are not fighting anymore, users will be less likely to notice and revert spam (or move it to the segregation area) when it does make it through. -- JoeChongq?

Those are good arguments, much more what I was hoping for. I don't agree that spam that slips the filter will be ignored simply because of the filter — after all, RecentChanges will reflect only changes that are considered non-spam. They'll be ignored because the reputation of the wiki will be non-existant, and there'll be no editors. The rest of your points, I agree with. -- ChrisPurcell

For the same reason you say the reputation will be non-existant, I say people won't revert spam as carefully (if there are any people). Assuming there are editors who still care to be involved with the wiki (which I agree is unlikely), their motivation to clean spam won't be as great because normally the wiki is full of spam (even though it is segregated). If the spam doesn't disrupt the page (which it shouldn't because any spam that sneaks through must be minor or it would have been segregated) it isn't worth the bother checking on each change and possibly cleaning it up. Not all spam is clearly identifiable as spam. Link substitution, topical keyword linking, or stolen text insertion would all be hard to detect automatically or manually as spam.

We have seen a spammer recycling existing text found elsewhere on our WikiForum. Likely this case was done manually, but it could be automated. By choosing older text and removing the signature of the original author, the new post looked relatively on topic and well written. The only reason it was discovered as spam (assuming the URL was not clearly spammy, I don't remember for sure) was that I realized the text seemed familiar. Manni had written it originally weeks before.

By not ruining the page (any non-segregated spam would not), the spam is less visible if it is not noticed right away in Recent Changes. Users will be less vigillant because it is not destructive. It also means Google continues to find legitimate content, and so the page rank of the victim wiki will not be damaged (which helps the spammer). -- JoeChongq?

Discussion

MeatballWiki | RecentChanges | Random Page | Indices | Categories
Edit text of this page | View other revisions
Search: