Anyone could download Cambridge researchers’ 4-million-user Facebook data set for years

INSUBCONTINENT EXCLUSIVE:
A data set of more than 3 million Facebook users and a variety of their personal details collected by Cambridge researchers was available
for anyone to download for some four years, New Scientist reports
It likely only one of many places where such huge sets of personal data collected during a period of permissive Facebook access terms have
been obtainable. The data were collected as part of a personality test, myPersonality, which, according to its own wiki (now taken down),
was operational from 2007 to 2012, but new data was added as late as August of 2016
It started as a side project by the Cambridge Psychometrics Centre David Stillwell (now deputy director there), but graduated to a more
organized research effort later
The project &has close academic links,& the site explains, &however, it is a standalone business.& (Presumably for liability purposes; the
group never charged for access to the data.) Though &Cambridge& is in the name, there no real connection to Cambridge Analytica, just a very
tenuous one through Aleksandr Kogan, which is explained below. Like other quiz apps, it requested consent to access the user profile
(friends& data was not collected), which combined with responses to questionnaires produced a rich data set with entries for millions of
users
Data collected included demographics, status updates, some profile pictures, likes and lots more, but not private messages or data from
friends. Exactly how many users are affected is a bit difficult to say: the wiki claims the database holds 6 million test results from 4
million profiles (hence the headline), though only 3.1 million sets of personality scores are in the set and far less data points are
available on certain metrics, such as employer or school
At any rate, the total number is on that order, though the same data is not available for every user. Although the data is stripped of
identifying information, such as the user actual name, the volume and breadth of it makes the set susceptible to de-anonymization, for lack
of a better term
(I should add there is no evidence that this has actually occurred; simple anonymizing processes on rich data sets are just fundamentally
more vulnerable to this kind of reassembly effort.) This data set was available via a wiki to credentialed academics who had to agree to the
team own terms of service
It was used by hundreds of researchers from dozens of institutions and companies for numerous papers and projects, including some from
Google, Microsoft, Yahoo and even Facebook itself
(I asked the latter about this curious occurrence, and a representative told me that two researchers listed signed up for the data before
working there; it unclear why in that case the name I saw would list Facebook as their affiliation, but there you have it.) This in itself
is in violation of Facebook terms of service, which ostensibly prohibited the distribution of such data to third parties
As we&ve seen over the last year or so, however, it appears to have exerted almost no effort at all in enforcing this policy, as hundreds
(potentially thousands) of apps were plainly and seemingly proudly violating the terms by sharing data sets gleaned from Facebook users. In
the case of myPersonality, the data was supposed to be distributed only to actual researchers; Stillwell and his collaborator at the time,
Michal Kosinski, personally vetted applications, which had to list the data they needed and why, as this sample application shows: I am a
full-time faculty member
[IF YOU ARE A STUDENT PLEASE HAVE YOU SUPERVISOR REQUEST ACCESS TO THE DATA FOR YOU.] I read and agree with the myPersonality Database Terms
of Use
[SERIOUSLY, PLEASE DO READ IT.] I will take responsibility for the use of the data by any students in my research group. I am planning to
use the following variables: * [LIST THE VARIABLES YOU INTEND TO * USE AND TELL US HOW * YOU PLAN TO ANALYZE THEM.] One lecturer, however,
published their credentials on GitHub in order to allow their students to use the data
Those credentials were available to anyone searching for access to the myPersonality database for, as New Scientist estimates, about four
years. This seems to demonstrate the laxity with which Facebook was policing the data it supposedly guarded
Once that data left company premises, there was no way for the company to control it in the first place, but the fact that a set of millions
of entries was being sent to any academic who asked, and anyone who had a publicly listed username and password, suggests it wasn&t even
trying. A Facebook researcher actually requested the data in violation of his own company policies
I&m not sure what to conclude from that, other than that the company was utterly uninterested in securing sets like this and far more
concerned with providing against any future liability
After all, if the app was in violation, Facebook can simply suspend it — as the company did last month, by the way — and lay the whole
burden on the violator. &We suspended the myPersonality app almost a month ago because we believe that it may have violated Facebook
policies,& said Facebook VP of product partnerships, Ime Archibong, in a statement
&We are currently investigating the app, and if myPersonality refuses to cooperate or fails our audit, we will ban it.& In a statement
provided to TechCrunch, David Stillwell defended the myPersonality project data collection and distribution. &myPersonality collaborators
have published more than 100 social science research papers on important topics that advance our understanding of the growing use and impact
of social networks,& he said
&We believe that academic research benefits from properly controlled sharing of anonymised data among the research community.& In a separate
email, Michal Kosinski also emphasized the importance of the published research based on their data set
Here a recent example looking into how people assess their own personalities versus how those who know them do, and how a computer trained
to do so performs. From the research paper based on myPersonality database
The computer performed almost as well as a spouse. &Facebook has been aware of and has encouraged our research since at least 2011,& the
statement continued
It hard to square this with Facebook allegation that the project was suspended for policy violations based on the language of its
redistribution terms, which is how a company spokesperson explained it to me
The likely explanation is that Facebook never looked closely until this type of profile data sharing became unpopular, and usage and
distribution among academics came under closer scrutiny. Stillwell said (and the Centre has specifically explained) that Aleksandr Kogan was
not in fact associated with the project; he was, however, one of the collaborators who received access to the data like those at other
institutions
He apparently certified that he did not use this data in his SCL and Cambridge Analytica dealings. The statement also says that the newest
data is six years old, which seems substantially accurate from what I can tell except, for a set of nearly 800,000 users& data regarding the
2015 rainbow profile picture filter campaign, added in August 2016
That doesn&t change much, but I thought it worth noting. Facebook has suspended hundreds of apps and services and is investigating thousands
more after it became clear in the Cambridge Analytica case that data collected from its users for one purpose was being redeployed for all
sorts of purposes by actors nefarious and otherwise
One is a separate endeavor from the Cambridge Psychometrics Centre called Apply Magic Sauce; I asked the researchers about the connection
between it and myPersonality data. The takeaway from the small sample of these suspensions and collection methods that have been made public
suggest that during its most permissive period (up until 2014 or so) Facebook allowed the data of countless users (the totals will only
increase) to escape its authority, and that data is still out there, totally out of the company control and being used by anyone for just
about anything. Researchers working with user data provided with consent aren&t the enemy, but the total inability of Facebook (and to a
certain extent the researchers themselves) to exert any kind of meaningful control over that data is indicative of grave missteps in digital
privacy. Ultimately it seems that Facebook should be the one taking responsibility for this massive oversight, but as Mark Zuckerberg
performance in the Capitol emphasized, it not really clear what taking responsibility looks like other than an appearance of contrition and
promises to do better.