Since at a technical level, the network is only aware of the computers that are attached to it, we are limited on a protocol level from going beyond the computer to the person. However, we can often assume that only a few people have access to any particular computer (possibly one), so uniquely identifying that computer is a long step towards uniquely identifying a person. Thus, over time, we have found that the IP address originating the messages has been very powerful in establishing SerialIdentity.
However, if we consider how indirect the correlation between person and IP is, things are a lot murkier. The connection is
Each linkage is loose. Therefore, there are many potential ways using IPs to identify unique individuals that lead to false positives and false negatives, which is problematic if we want to AvoidIllusion. Here are the two major failure classes:
Multiple IPs per person. One person may post from multiple computers, multiple networks (i.e. a laptop), have multiple ISPs, or use anonymizing proxies, and therefore have multiple IPs. The ISP may assign a dynamic IP to the person's computer, say with dial-up, thereby giving the person an IP from a certain range of IP addresses. The ISP may use proxies to connect to the web, so that large numbers of users who have nothing to do with each other may appear to come through the same proxy. Worse, ISPs like AOL use multiple proxies, which means that dial up users may join any one of a number of proxies on different parts of the Internet depending on the session.
Multiple people per IP. More than one person may post from one computer (home computer) or static IP (apartment, university residence, office). An ISP may be used by multiple people, so that either a dynamically assigned IP may be reassigned to different people, or two static IPs may look very similar to each other.
Nonetheless, given the current usage of the 'Net, we can at least make the looser claims that a series messages originating from a single ISP are likely only from a small number of people (hopefully one person), and that most people only have a small number of ISPs. Since most ISPs have a small and localized portion of the wider Internet, if we can cluster a series of IPs together, that means we can cluster a series of messages together and have at least a significant claim those belong to one person.
Of course, building a convincing case usually requires repeated consistent use of a given network locale. You need to figure out if the person is posting from a static IP, dynamic IP, and if more than one person is using that IP or that range of IPs. This obviously requires other identifying information to be brought to bear, such as what they say about themselves and their style. So, in actuality, this statistic can only be part of a wider case that has to be weighed against the possibility it may not be correct.
To calculate the cluster statistic, we can use the notion of NetworkDistance. The NetworkDistance is how far apart on the Internet two nodes are. Just based on the IP numbers, this is usually obvious. If the two IPs are numerically equivalent, they are the same machine nearly always. If the first three octets of an IP are equivalent, the machines are likely very close. However, that isn't always very good, as large ISPs such as AOL rotate the bottom two octets very frequently, and the second octet often, and even the whole Class A network occasionally.
Consider that IP addresses only reflect leaves on the network, and the leaves are not very stable (as described above). However, we can also use the routed path from the IP address to the server to determine distance. If you do a TraceRoute? from various IPs to the server, where do they intersect? If the answer is at your server, they are quite distinct, but if they intersect further down the chain (and the closer to the leaves), the more likely they come from one ISP. And quite possibly one city or even neighbourhood of one ISP.
Using the traceroute information is actually better than IPs, as IPspace does not correlate directly to the actual network topology of the Internet, whereas the traceroute data does. Also, traceroutes take a long time to conduct as one must wait for each router along the way to respond, although you really only need to do one per IP and cache that. If you are really hurting for resources, you really only need to do a traceroute for IP clusters (<256 distance in IPspace) that edit a lot, since you need many transactions to build a SerialIdentity, although this is more likely to result in a false negative. Then again, traceroutes are also not stable, so one may want to build a CompositeTraceRoute? over time, or expire the cache after a time out, like 30 days.
One can also group the IPs using various geographic IP databases to locate users at least in the same city or region of the world. (e.g. )
Finally, a very simple method to identify related users is to put a cookie on their computer with a (very large, very sparse, random) nonce. You may want to further seal this nonce by cryptographically hashing this nonce against some of the environment variables their browser sends to you plus a secret salt on the server. While it's easy to spoof this method (delete your cookie, change your environment variables), for those who do not bother they will be immediately identified. It may also be significant if a single IP or a subnet repeatedly asks for new nonces, since that might be the same person who automatically deletes cookies; then again, it may be their university network that does not save cookies between sessions. But you could publish that information as well.
Publishing IPs on your AuditTrail is very powerful, and it is very simple to do. Publishing domain names is equally simple, and even better as they often carry identifying information such as the company name or the country, although the reverse domain lookup can be computationally expensive. It's unacceptable to publish IPs sometimes and domains some other times as it disrupts the correlation, so if you see your server often failing to do a proper domain name lookup, just publish IPs. For sites whose patrons are not concerned about what they say being tracked so closely to them (say posting from an employer's machine), or where the probability of encountering someone willing to crack your computer remains low, then this is your best option in terms of FeatureKarma.
However, the real benefit of thinking of using network distance over network location is that it no longer becomes necessary to publish a person's exact location on the network. If you think of this way, we do not need to publish the external frame we use to correlate transactions, just how much those transactions are correlated on the network. For those systems that are concerned about the privacy of their patrons, this could mean a lot. While this means that a lot of information is lost, such as the probable geographic location of the individuals, as well as being hampered by the limitations of the algorithms, it could be the difference between having nothing and having a way to correlate SockPuppets (and thus have the community trust eviscerated) without sacrificing other ethical requirements. Such an algorithm also allows genuine users to disprove false (possibly malicious) claims of IdentityDeception?.
If we assume that IPs are stored on the server, publishing the NetworkDistance between transactions can be done through a simple query interface, where detectives give two interactions and ask how far apart they are. Measures could be which octets differ and by how much (A.B.C.D - W.X.Y.Z = 0.0.0.16) as well as pictographically showing where the TraceRoute?s intersect, e.g.
Server . . . . A X Server . . . . B
Same carrier (no information)
Server . . . . . . A X Server . . . . . . B
Server . . . . A X X X X Server . . . . B
However, using this pointwise comparison doesn't make SockPuppets transparently visible (rather, one has to dig), and it is very difficult to do for the casual PeerReviewer.
Another simple idea is simply to sort all the transactions by the numeric value of the IPs who made them. This will lump together transactions by one IP as well as those made by the same ISP. The ranking may which to visually distinguish (say by inserting spaces) at least where the second or third octets differ to avoid clumping together two individuals' contributions from neighbouring locations in IPspace.
Alternatively, we could simply publish a cryptographic hash of the IP that respects the critical data without giving away everything. One proposed scheme suggests that you have one hash which maps the first three octets to a number which are distributed evenly on the range, then another hash which maps all four bytes to a small portion of the range centred on zero (modulo the range) -- say, a Gaussian distribution. Then you add the two hashes together. Hence, the general region of the final hash tells you about the first three bytes, and since the second small-range hash is on the whole address not on the final byte, you can't attack it by building up a dictionary of the 256 combinations.
This leads to cryptographic questions. We prefer that people cannot create dictionaries to attack the hash. Although the main culprits to defend against are Google and the WebArchive?, and it's unlikely anyone will be inclined to attack your hash, if someone does do this, they will have a major information weapon to extort over the community. First, if you can build a dictionary of the first hash, then you've breached everyone's privacy anyway because you can work out with a fair degree of certainty what ISP they have which could be really bad if they are posting from work. To make this more difficult, we can salt the hash with a secret key only known to the server; this should be large since an attacker will know at least his own IP and the response for that. Second, the hashes cannot interfere with each other in ways that reduce the hashspace, as the second hash partially contains the same information as the first. Third, for users with dynamic IPs or (even better) IPs in different parts of the network, they will be able to build a profile of the hashspace which might help them crack it.
Note that there still may be privacy concerns. If your OnlineCommunity has a tendency to attract controversial contributions, you may not want to do this at all. For instance, two employees posting from the same company server; if one makes a comment that the other employee thinks merits discipline from the employer, such as insulting the company or even merely making deeply offensive comments, retaliation could be sought (e.g. asking IT to track usage to the OnlineCommunity from within the company to identify the culprit).
A very closely related project is [Hubble] by Arvind Krishnamurthy and Ethan Katz-Bassett.