Schneier on Security: A Taxonomy of Social Networking Data

Schneier on Security: A Taxonomy of Social Networking Data
Vaclav Vincalek

Vaclav Vincalek in IT News

Copied from Prosyna on 20 Nov 2009

A Taxonomy of Social Networking Data

At the Internet Governance Forum in Sharm El Sheikh this week, there was a conversation on social networking data. Someone made the point that there are several different types of data, and it would be useful to separate them. This is my taxonomy of social networking data.

  1. Service data. Service data is the data you need to give to a social networking site in order to use it. It might include your legal name, your age, and your credit card number.

     

  2. Disclosed data. This is what you post on your own pages: blog entries, photographs, messages, comments, and so on.

     

     

  3. Entrusted data. This is what you post on other people's pages. It's basically the same stuff as disclosed data, but the difference is that you don't have control over the data -- someone else does.

     

     

  4. Incidental data. Incidental data is data the other people post about you. Again, it's basically same same stuff as disclosed data, but the difference is that 1) you don't have control over it, and 2) you didn't create it in the first place.

     

     

  5. Behavioral data. This is data that the site collects about your habits by recording what you do and who you do it with.

 

Different social networking sites give users different rights for each data type. Some are always private, some can be made private, and some are always public. Some can be edited or deleted -- I know one site that allows entrusted data to be edited or deleted within a 24-hour period -- and some cannot. Some can be viewed and some cannot.

And people should have different rights with respect to each data type. It's clear that people should be allowed to change and delete their disclosed data. It's less clear what rights they have for their entrusted data. And far less clear for their incidental data. If you post pictures of a party with me in them, can I demand you remove those pictures -- or at least blur out my face? And what about behavioral data? It's often a critical part of a social networking site's business model. We often don't mind if they use it to target advertisements, but are probably less sanguine about them selling it to third parties.

As we continue our conversations about what sorts of fundamental rights people have with respect to their data, this taxonomy will be useful.

Posted on November 19, 2009 at 12:51 PM17 CommentsView Blog Reactions

To receive these entries once a month by e-mail, sign up for the Crypto-Gram Newsletter.

Comments

Oh, you mean the conference at which members of an NGO had their property confiscated by UN security because they had the temerity to point out that China has a firewall?

http://www.boingboing.net/2009/11/15/...

Posted by: Andre LePlume at November 19, 2009 1:04 PM


As Andreas Weigend (former chief scientist, amazon.com) points out, there is also a huge difference between data given explicitly (2, 3, and 4) and data collected implicitly (5).

Posted by: Angel one at November 19, 2009 1:19 PM


@Angel one:

I'd say there's also a huge difference between (1,2,3) and (4,5).

Posted by: David at November 19, 2009 1:25 PM


As you note, items 2,3,4 are all types of "Disclosed data" - I suggest this be made more explicit by naming as follows:

2. Disclosed data (controlled)

3. Disclosed data (entrusted)

4. Disclosed data (incidental)

Posted by: uqbar at November 19, 2009 2:07 PM


@ Bruce,

There is another type of data you need to add to your list.

That is cross linked or infered data.

You may not post on a social network site that you live or work in any particular place or other details. However time of day you post etc can bring your location down as can other things that give indicators of other places to search.

For instance you may have once posted (as many admins have) to a news group or been mentioned in corperate news (as many execs have). The time of day you post helps bring the location you are at to a time zone. This can then be used to filter out other people with similar names (there's atleast six people with my name that I have tracked down ;)

As in "traffic analysis" sometimes it's not the message contents that are important but the times and places it originates and ends at.

As another example assume the person trying to track you down has credit checking (CCN) and other "marketing target" DB access and reads on your social site that you have just purchased the latest wiz bang 90" home movie system with 10 channel suround sound and UWB networking.

There are not many places you could have obtained it and you may have taken out a credit agrement to buy it, or forgoton to check the "no marketing contact" box on the warenty or other paperwork. Some or all of those details will nail you down cold.

Posted by: Clive Robinson at November 19, 2009 2:30 PM


Great taxonomy. I like!

Perhaps worth finding a better term than "incidental disclosure" to describe third-party postings etc. about oneself though; that wasn't an obvious connection to me terminology-wise.

Posted by: GregW at November 19, 2009 2:45 PM


I agree with Clive's comment about "cross-linked"/"inferred" data being different than just "personal" data.

While I am not sure inferences made about data alone (e.g. time zones of postings) warrant a separate class, I do sense there's a substantial difference between a person's disclosed information, and that same information put in a "cross-linked" form with other people's personal information.

I once created a cross-link system to gather/crosslink such data to vastly reduce credit card billing fraud detection system in a prior ecommerce/telecom position.

Without going into all the details, let's just say that service data that relates just to you ("who you called/who called you") is substantially different in nature from the all the data that can be linked to you via all users (named and pseudo-anonymous) of a given service for all time. (Who calls the people who call you and who do they in turn call? ... a full graph that can has distinctly different/greater value when linked with data of many other people using the same/similar service.)

A service-wide graph that links your information with others connected to you in some way (calling/communication patterns, IPs/geographic locations, similar web traffic browsing patterns, credit card aliases used, etc) contains not just your personal information but an potentially exponentially larger amount of context when linked all together.

Cross-linked data can spiral into a mess and get one nowhere, but at times it can be tremendously powerful (cf isolating/identifying friends of Saddam).

Posted by: GregW at November 19, 2009 2:55 PM


@GregW:

Incidental data is more a case of "observational" or "reputational" data.

Bujold described the difference between Honor and Reputation... but things other people post about you is more related to reputation since its authorship is by an observer.

It can be argued that these will have varying relationships to one's "ego"...

Posted by: John Campbell at November 19, 2009 3:25 PM


@Clive Robinson, GregW

There's a blog on 'de-anonymization' out there (http://33bits.org/) where problems like that are discussed. For example, its author has a paper 'De-anonymizing Social Networks' out (http://randomwalker.info/social-networks/).

Abstract:

Operators of online social networks are increasingly sharing potentially sensitive information about users and their relationships with advertisers, application developers, and data-mining researchers. Privacy is typically protected by anonymization, i.e., removing names, addresses, etc.

We present a framework for analyzing privacy and anonymity in social networks and develop a new re-identification algorithm targeting anonymized social-network graphs. To demonstrate its effectiveness on real-world networks, we show that a third of the users who can be verified to have accounts on both Twitter, a popular microblogging service, and Flickr, an online photo-sharing site, can be re-identified in the anonymous Twitter graph with only a 12% error rate.

Our de-anonymization algorithm is based purely on the network topology, does not require creation of a large number of dummy "sybil" nodes, is robust to noise and all existing defenses, and works even when the overlap between the target network and the adversary's auxiliary information is small.

Posted by: r721 at November 19, 2009 4:05 PM


@ John Campbell,

"It can be argued that these will have varying relationships to one's "ego"...

Hmm "personality type" possibly some people "live inside their own heads" (techi types) others "live inside the heads of others" (your always "networking" managers / execs / politicos / con artists /etc).

Depending on your usage of "ego" you could say the former have none whilst the latter are all ego.

Techi types tend to have "social communications" issues which tends to get them (unfairly) marked down by others. Where as your networking types tend to be very good at social communications (and not a lot else) which tends to get them (unfairly) marked up by others.

The desire to be "top dog" etc (ie egotistical) is not in most cases related to ability to do a job or social communications ability (you only have to watch the X-factor to see when ego/self belife is baddly misplaced).

From the way you phrased your use of "ego" you could also argue that the "ego" has an inverse relationship to ability.

(A point many will have sympathy for as their "networking" bretherin steal the credit for their work).

Posted by: Clive Robinson at November 19, 2009 4:09 PM


@ r721,

"Our de-anonymization algorithm is based purely on the network topology... ...is robust to noise and all existing defenses,... ...and the adversary's auxiliary information is small."

And people ask me why I pay in cash and don't "twitter" or "facebook" etc etc ;)

Posted by: Clive Robinson at November 19, 2009 4:19 PM


All these kinds of information also have differing levels of veracity (insofar as that term means anything any more). The fact that someone associates your name with a photo or a note or a link may or may not mean that it actually has anything to do with you. And information posted on sites may or may not be factual (and may draw, via links, third and fourth parties who have no direct association with a particular social networking site.)

I am amused, for example, to note that the overwhelming number of people following me on Twitter (which I don't really use) are malware bots. Not sure exactly what that says about me, though.

Posted by: paul at November 19, 2009 5:08 PM


"Perhaps worth finding a better term than 'incidental disclosure' to describe third-party postings etc. about oneself though; that wasn't an obvious connection to me terminology-wise."

I would love a better term, but I can't think of one. I'll take suggestions.

Posted by: Bruce Schneier at November 19, 2009 6:54 PM


":Oh, you mean the conference at which members of an NGO had their property confiscated by UN security because they had the temerity to point out that China has a firewall?"

Yes, that's the meeting. I hadn't arrived at that point, but everyone there was deliberately not talking about the incident.

Posted by: Bruce Schneier at November 19, 2009 6:55 PM


"There is another type of data you need to add to your list. That is cross linked or infered data."

That feels like a subset of behavioral data to me.

Posted by: Bruce Schneier at November 19, 2009 6:56 PM


following @uqbar, I'd suggest that 2,3 & 4 are all varients on disclosed data, and that you're missing the fourth data-type in the 2d grid where 'posted by you / posted by someone else' is one axis and 'posted in your area of control / posted in someone else's area of control'.

The important thing about information about you that is posted by someone else in an area outside your control it that (compared to the other three data-types) you're less likely to be aware of the disclosure, and so less likely to be able to do anything about it, even if the ability to in some way object to the disclosure exists.

Posted by: David at November 19, 2009 7:19 PM


@ Bruce,

"That feels like a subset of behavioral data to me.

Your type 5 is one half of it (ie traffic analysis on what you do,) that may be visable just to the admins of the "social network" site, or those that have access restricted or otherwise.

An example of the former is where for instance the site owner/admin "outs sockpuppets". The admins have access to data that the site does not normaly show (IP address / etc).

An example of the latter has been seen on some "social network" sites. What you think is limited to just a small group (say your family) can become available to many (through their friends lists). Either directly (your post and the family members post) or indirectly (just the family members half of the post).

The other half of the problem is the actual data that is cross linked to is not on your list. In a more general case (ie not just "social networking") it would be a subset of your type 4 data.

That is it is not on a "social network" site at all, or you have "deleted it" but the site has not, or issues to do with metadata.

Examples of the first could be a business web site such as a newspaper's online edition, an e-commerce site (such as Amazon) where you have rated a product, or a "black hat" site that has posted / made available your bank / CC details. Or as I previously noted commercialy available data on you such as credit rating or marketing DBs.

An example of the second is "orphaned data". Where "social network" sites use many servers to build a page. Photos have still been visable on their original URL on the photo server even though the link refrence has been removed on the HTML body server.

An example of the third type is unautherised access due to predictable naming in site URLs that enable a private URL to be fairly easily determined.

One example of this is where a "thunb nail" picture URL could be used to find the high resolution image just by changing the end of the URL (which has comercial implications for those wishing to charge for the high resolution images).

Another is where the URL contains a sequence number either put in by the site software or the user (such as uploading files from a digital camera and not changing the file names in the process).

Then there is the possability of infering "missing data" for instance a site admin might decide to delete one or more entries on an open comments page. The fact that each entry is given a unique serial number may be used to determine how many have been removed from public display or if in fact they have been deleted at all or mearly had the links removed.

Posted by: Clive Robinson at November 19, 2009 8:48 PM

Responses

Please Login to respond

Get Gleanr!

What is Gleanr?

Gleanr is the networking engine for digital-age professionals. Get impact (& income!) in the information streams you care about.

How does it work?

Your custom Gleanr channels automate information flow relevant to you. All you do is "click" - we do the rest (instant capture, indexing, and networking).

What is the value?

Gleanr is the only web service where professionals can manage and monetize their expertise.

Is this more web 2.0?

Yes, but for work. Now you can capitalize on your unique ability to filter and enrich professional information streams.

Show me!

Explore the public parts of professional information streams here, or take the Gleanr Tour.

Sign me up!