Crowdsourcing for Internet Transparency
I originally posted on this topic (much more briefly) on the Global Coalition for Transparent Internet Performance (GCTIP) Forum but did not receive any meaningful feedback (probably because the forum at this point appears to be more of a soapbox for the creator than an active discussion group – much like my blog but with more readers ;)), so I will expound a bit here.
My interest in this topic was first sparked in earnest while reading Jonathan Zittrain‘s The Future of the Internet And How to Stop It. Part three of the book is titled “Solutions” and while I do not agree with everything contained within, he does float at least one very sound idea for combating the ills of the Internet without breaking or altering it’s brilliance. That is this concept of collaborative measurement and experimentation:
What might this system look like? Roughly, it would take the form of toolkits to overcome the digital solipsism that each of our PCs experiences when it attaches to the Internet at large, unaware of the size and dimension of the network to which it connects. These toolkits would have the same building blocks as spyware, but with the opposite ethos: they would run unobtrusively on the PCs of participating users, reporting back—to a central source, or perhaps only to each other—information about the vital signs and running code of that PC that could help other PCs figure out the level of risk posed by new code. Unlike spyware, the code’s purpose would be to use other PCs’ anonymized experiences to empower the PC’s user. At the moment someone is deciding whether to run some new software, the toolkit’s connections to other machines could say how many other machines on the Internet were running the code, what proportion of machines of self-described experts were running it, whether those experts had vouched for it, and how long the code had been in the wild. It could also signal the amount of unattended network traffic, pop-up ads, or crashes the code appeared to generate. This sort of data could become part of a simple dashboard that lets the users of PCs make quick judgments about the nature and quality of the code they are about to run in light of their own risk preferences, just as motor vehicle drivers use their dashboards to view displays of their vehicle’s speed and health and to tune their radios to get traffic updates.
Shortly after I finished Prof. Zittrain’s book, I received an email from Lauren Weinstein (via the NANOG mailing list) announcing the formation of this GCTIP forum:
…this project — The “Global Coalition for Transparent Internet Performance” — is the outgrowth of a network measurement workshop meeting sponsored by Vint Cerf and Google at their headquarters in June, 2008 for a number of academic network measurement researchers and other related parties. This is the same meeting that formed the genesis of the open platform M-Lab (“Measurement Lab”) project that was recently announced (http://www.measurementlab.net).
GCTIP was the original name for the mailing list that I maintained for that Google meeting and subsequent discussions (full disclosure: I helped to organize the agenda for the meeting and also attended).
Unless we know what the performance of the Internet for any given users really is — true bandwidth performance, traffic management, port blocking, server prohibitions, Terms of Service concerns, and a wide range of other parameters, it’s impossible for anyone who uses Internet services to really know if they’re getting what they’re paying for, if their data is being handled appropriately in terms of privacy and security, and all manner of other crucial related issues.
I registered on the forum to see if anything would come of it. So far it doesn’t appear that much has – most of the threads are started by Lauren and get no response, they are however apparently getting hundreds of views and some have generated a bit of conversation, so it may gain more visibility and participation at some point. For now it is a wait and see.
About a week after the launch of the GCTIP Forums, I saw the new Herdict Web video in a post on the ISOC NY blog and promptly installed the Firefox plugin to give it a try. In short, this is a project (launched by the Berkman Center for Internet and Society (BCIS) at Harvard) which aims to document and track inaccessible websites through “herd sourcing” data from users all around the globe. This is apparently the first step in the Herdict project, which should eventually also include software to accomplish the dashboard function described in The Future of the Internet. An article on the website by the same name explains:
Our current focus is on Herdict for Network Health. Netizens will be able to report any web sites they cannot access through the Herdict website or Firefox/IE plug-in.
By aggregating individuals’ reports across the Internet, Network Health strives to create a real-time picture of network accessibility. Users will be able to read reports of inaccessibility by region or by country, or track specific web sites over time:
Using this information Network Health users will then be able to start diagnosing why sites are inaccessible – network failure, government censorship, or something else – through the OpenNet Initiative.
But Network Health is just one application in the Herdict suite. PC Health, which is still in development, will allow users to track information about their computers’ performance and compare it to the performance of other computers on the network. For example, if PC Health tells a user that her computer is running poorly in comparison to other computers on the network, it might be an indication that she has badwares in her computer. Or if PC Health finds that all the computers with a certain piece of software are running poorly, it might be an indication that the specific software is bad.
Ok, so what? Well, to start with, a tool (or group of tools) based on this approach could make overall Internet performance and architecture very transparent by setting up a kind of “bot-net for good” where page load time, file download speed, rtt, and various other pertinent but generally anonymous data could be collected and aggregated for analysis and display by and to the public at large. Because many people are likely to visit the same sites and download the same files, relative performance should be easily visible from ISP to ISP, country to country and region to region. Having this “good spyware” do the work all but eliminates the need for end-users to deal with collecting and assessing their own data, which is a current hurdle to gathering such performance data in any scalable manner. Although this particular tool (Herdict Web) is only collecting data on blocked or inaccessible websites, I think it effectively demonstrates the potential effectiveness of this type of opt-in crowd-sourcing of data collection for monitoring the condition of the Internet in various regions/locales as well as overall.
One thing this method of distributed measurement could provide is third party performance info on various ISPs and content providers. Currently most broadband subscribers are for the most part in the dark when it comes to the performance and restrictions of their Internet connections. Most subscribers rely on “flawed tests and false speed results” at best and more commonly on subjective feelings; “the Internet seems slow today…” Having an impartial and empirical source of performance data collected in this way may help “keep folks honest” as well as raise awareness of the level of Internet connectivity in various parts of the world. At the least it would allow consumers to make informed decisions when it comes to purchasing a connection to the Internet.
More importantly (at least to geeks like me) is that such a bot-net for good could do wonders to indicate and identify where Internet problems reside and when they happen. This is kind of two pronged. On one hand is pro-active research, like what is being done at the Cooperative Association for Internet Data Analysis (CAIDA). They are constantly trying to find new ways to beg borrow and steal (well maybe not steal) data, in order to have something to analyze. The title of CAIDA’s principal investigator KC Claffy‘s blog sums it up quite well: According to the Best Available Data.
Let’s take one of the current CAIDA projects, Spoofer, as an example. The Spoofer Project “measures the Internet’s susceptibility to spoofed source address IP packets” by enlisting Internet users to run their software which “attempts to send a series of spoofed UDP packets to servers distributed throughout the world.” The project originally ran tests to one “receiver” server and was recently greatly improved by increasing the number of receivers, as KC explains:
We are studying an empirical Internet question central to its security, stability, and sustainability: how many networks allow packets with spoofed (fake) IP addresses to leave their network destined for the global Internet? In collaboration with MIT, we have designed an experiment that enables the most rigorous analysis of the prevalence of IP spoofing thus far, and we need your help running a measurement to support this study.
This week Rob Beverly finally announced to nanog an update to spoofer he’s been working on for a few months. Spoofer is one of the coolest Internet measurement tool we’ve seen in a long time — especially now that he is using Ark nodes as receivers (of spoofed and non-spoofed packets), giving him 20X more path coverage than he could get with a single receiver at MIT.
Now imagine if they were able to run these tests in a full mesh of thousands or tens of thousands (hopefully even hundreds of thousands) of participants! Even if just the same folks who participate in the Spoofer project already (myself included) were all able to test to each other, instead of all testing back to the same 20 or 30 receivers, the path coverage would explode. Even better than that though would be if there was this pre-existing open platform for transparent distributed measurement that this (or any) project could simply produce an “add on” for; significantly increasing the potential number of participants. Such a platform could obviously have massive positive implications for the entire field of Web Science.
The second of the two prongs that I mentioned above is re-active troubleshooting. This open, real-time repository of Internet data could prove invaluable in certain troubleshooting circumstances. Knowing that no-one can reach any hosts on example.com is very valuable to NOC or helpdesk technicians when a customer (internal or external) calls to complain that they can not reach mail.example.com, for example. Even better would be to eliminate the call completely by allowing the end user to see that the site is broken, not their PC or connection.
Of course, such tools could go further than network diagnostics, as Zittrain discusses in his book – to identify spam, viruses, bad code and other host based problems. This is the promise of enabling or rather empowering users to better understand the files they are working with and the network they are connected to, based on the direct experience of other users – in very near real time. I see this as a great supplement to the anti-virus, anti-spam, anti-spyware and other current defenses. The potential here is very exciting.
It is also quite concerning though. For such a tool to be useful, it will need to be widely adopted and this will of course make it a target for attack itself. In my experience and from what I have heard and read, the majority of current PC infection is the direct result of user ignorance and laziness. Any tool along the lines of the proposed Herdict PC Health would face the same challenge of engaging and educating end users which currently frustrates efforts to keep the Internet in top shape. The folks who do not install antivirus programs or who do not update those applications once installed are not likely to become responsible netizens now just because a new tool is released. There is the possibility though, that the new tool could be built in such a way that it does foster this feeling of social responsibility. In fact it would have to, in order to be truly successful.
How to create this social responsibility, this feeling of community on the Internet and then parlay that into broad engagement and education of end users through a software tool is not something I have a concrete answer for at the moment. My initial thoughts are of the social networking sites that have taken root in our society and their relation to the mailing lists and forums where I go for answers.
Take my.is for example. It is a Lexus IS enthusiasts forum that I am a member of. It has been the primary source of information for my purchase, modification and planning related to my own IS300. This information has all been provided by other members of the forum, relating their direct experiences. I have since become a fairly committed contributor to the site because I feel compelled to help the community as they have helped me. There are lots of similar online communities for different areas of interest, including many built around specific pieces of software or hardware. I imagine that these folks would be very pleased to have the troubleshooting and diagnostic information provided by the tool(s) we are discussing here and this may be a good breeding ground. From there, integration into the existing social framework of the Internet could help awareness and participation grow and ultimately the tool would create such a community around itself.
Zittrain addresses this question in The Future of the Internet:
When tools drawing on group generativity are deployed… Their success is dependent on participation, and this helps establish the legitimacy of the project both to those participating and those not. It also means that the generative uses to which the tools are put may affect the number of people willing to assist. If it turned out that the data generated and shared from a PC vital signs tool went to help design viruses, word of this could induce people to abandon their commitment to help. Powerful norms that focus collaborators toward rather than against a commitment to the community are necessary. This is an emerging form of netizenship, where tools that embed particular norms grow more powerful with the public’s belief in the norms’ legitimacy.
His seems to be an attitude of “if you build it, they will come” and he may be quite right. People have a way of converging, especially on the Internet. Think of those funny emails that you end up getting two or three times over. Their only use is to make you smile or maybe laugh a bit. This leads me to believe that an easy to use software tool which provides obvious utility would in fact spread quite quickly; virally – like a really funny YouTube video.
Look at the adoption of Mozilla’s Firefox browser as a possible corollary example of this. Firefox is a great browser which I find far superior to other current options but for the average (non-geek) user it probably does not add much obvious benefit. The average user may have been told that it is more standards based or that it provides better security but what evidence of this do they really see, day to day? Despite the fact that much of its benefit may be “under the hood,” Firefox has raced to take over 20% of the browser market in only 5 years. This is even more remarkable when you take into account the fact that the only browser with a greater share of the market is Internet Explorer, which comes preinstalled on almost every PC sold. It makes me wonder what Firefox’ market share would be if every user had to make an active choice to install any web browser.
More to the point; let us now imagine an easy to install, easy to operate application which provides constant unobtrusive and easy to understand feedback to the user; about the state of the network and the state of their PC. A tool that not only provides empirical data gathered from around the world but also instant access to feedback, comments and notes from experts and friends alike. A tool that gives the user reliable advice before installing software or clicking a link. A tool that empowers users by allowing them to better understand the state of their connection and of their PC. A tool (or suite of tools) which at the same time provides open, anonymous and invaluable data to Internet researchers across the globe. What would it’s adoption rate be?
When it comes to the Herdict project specifically; I have to take issue with one important fact. In my opinion, a project that aims to defend and invigorate generativity and openness on the Internet (or anywhere for that matter) should be open itself. Professor Zittrain’s clearest description of why we need such a suite of tools is also a strong criticism of developing such tools in a less than open manner:
The touchstone for judging such efforts should be according to the generative principle: do the solutions encourage a system of experimentation? Are the users of the system able, so far as they are interested, to find out how the resources they control—such as a PC—are participating in the environment? Done well, these interventions can lower the ease of mastery of the technology, encouraging even casual users to have some part in directing it, while reducing the accessibility of those users’ machines to outsiders who have not been given explicit and informed permission by the users to make use of them. It is automatic accessibility by outsiders—whether by vendors, malware authors, or governments—that can end up depriving a system of its generative character as its own users are proportionately limited in their own control.
While I am sure that there are many very talented folks at the BCIS, I am also positive that there are at least a few people who do not work there that would be willing and capable of contributing if the project were open enough to allow it. I am not a coder but I would really like to see this project reach it’s full potential. I would love to hear comments from folks in either camp – working on Herdict or interested in doing so.
I would also be glad to hear about any other similar projects or initiatives out there today as well as peoples experience with and thoughts on Herdict Web and the assumably forthcoming PC Health. Please feel free to leave a comment or drop me an email!
EDIT/UPDATE (27-Apr): I just stumbled accross a blog post about the Herdict project from last August and discovered that Herdict PC had been launched previous to Herdict Web. I suspected this but could not confirm it until reading the post on Toolness. You can find info here. I just downloaded the tool (looks like I am one of 60) and will try to post another update after trying it out.