Monday, September 22, 2008 - The Progress & Freedom Foundation Blog

Nuts and Bolts: Everything You Wanted To Know About Cookies But Were Afraid To Ask

nuts_and_bolts_logo.jpg

This is the first in a series of articles that will focus directly on technology instead of technology policy. With an average age of 57, most members of Congress were at least 30 when the IBM PC was introduced in 1981. So it is not suprising that lawmakers have difficulty with cutting-edge technology. The goal of this series is to provide a solid technical foundation for the policy debates that new technologies often trigger. No prior knowledge of the technologies involved is assumed, but no insult to the reader's intelligence is intended.

This article focuses on cookies--not the cookies you eat, but the cookies associated with browsing the World Wide Web. There has been public concern over the privacy implications of cookies since they were first developed. But to understand them , you must know a bit of history.

According to Tim Berners Lee, the creator of the World Wide Web, "[g]etting people to put data on the Web often was a question of getting them to change perspective, from thinking of the user's access to it not as interaction with, say, an online library system, but as navigation th[r]ough a set of virtual pages in some abstract space. In this concept, users could bookmark any place and return to it, and could make links into any place from another document. This would give a feeling of persistence, of an ongoing existence, to each page."[1] The Web has changed quite a bit since the early 1990s.

Today, websites are much more dynamic and interactive, with every page being customized for each user. Such customization could include automatically selecting the appropriate language for the user based on where they're located, displaying only content that has been added since the last time the user visited the site, remembering a user who wants to stay logged into a site from a particular computer, or keeping track of items in a virtual shopping cart. These features are simply not possible without the ability for a website to distinguish one user from another and to remember a user as they navigate from one page to another. Today, in the Web 2.0 era, instead of Web pages having persistence (as Berners-Lee described), we have dynamic pages and "user-persistence."

This paper describes the various methods websites can use to enable user-persistence and how this affects user privacy. But the first thing the reader must realize is that the Web was not initially designed to be interactive; indeed, as the quote above shows, the goal was the exact opposite. Yet interactivity is critical to many of the things we all take for granted about web content and services today.

Stateful Sessions

On the original World Wide Web designed by Berners-Lee (Web 1.0), Web servers responded to each client request without relating that request to previous requests. There was no need to remember what other pages the user had requested because the requests were for static pages. But if you've used a Web-based email system like Gmail, Hotmail, Yahoo! Mail, etc., you know that once you log in, the service remembers who you are as you click from message to message. When a website can keep track of a user as they move from page to page within a site it is called a "stateful session." The website doesn't necessarily need to know anything about the user, it just needs to be able to distinguish that particular user from all other users. For example, if you go to an online store and place a few items in your virtual shopping cart, the site still does not know your name, email address, or billing information. But it does know what you've placed in your cart--or more precisely, it knows what someone using your browser has placed placed in a particular cart. If you leave the site before buying anything and then go back an hour later, it's possible that the site will have completely forgotten about you. In that case, the unique identifier persists during your "session" on the site, but it doesn't persist between sessions.

URLs and HTTP Requests

Web 1.0 sites achieve Web page persistence by having a unique address or Uniform Resource Locator (URL) for each Web page, which is displayed in the address bar at the top of your browser as you browse the web. For example, http://www.pff.org/about/ is a simple URL pointing to a specific Web page. Every user that visits the PFF site at www.pff.org and clicks on the "About" link will be taken to the exact same page.

URLs can also store information about the user. For example, if you search for "test" on Google, the URL of the resulting page may look like the following: http://www.google.com/search?q=test&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a. The URL contains a number of different pieces of data, separated by ampersands. There is the search query ("q=test"), the character encoding of the input ("ie=utf-8"), the character encoding of the output ("oe=utf-8"), the type and language of the client ("rls=org.mozilla:en-US:official"), and the Web browser used ("client=firefox-a").[2] None of this information can be used to uniquely identify the user, but this basic example illustrates how URLs can be used to specify more than simply static Web pages--and how some information can be remembered as a user navigates a website even without using cookies. Knowing how this works, you can create your own advanced searches or change the way the results are formatted (e.g., changing the language).

So how did Google know I speak English and use Firefox? That information is included in the HTTP request that my Web browser sends to the Google Web server when it requests a page. HTTP requests specify (among a few other more technical things) the desired language and a "User-Agent" field that includes the name of the browser and sometimes your operating system. This information allows websites to customize their content for different Web browsers (e.g., to ensure that it displays properly). HTTP requests also include your IP address so the Web server knows where to send its response, and geotagging allows Web servers to associate an IP address with a geographic area (though the area is rarely more accurate than the country or state). HTTP requests can also contain HTTP cookies.

HTTP Cookies

URLs can be used to uniquely identify individual users and allow stateful sessions, but unless a user bookmarks the URL containing their unique identifier, there is no way for the site to associate the same unique identifier with the same user on subsequent visits. Another option is to have users create an account and then log in each time they access the site. The website could then include the user's unique ID in the URL on subsequent pages, so that the user only needs to log in once per session. Having to bookmark or create an account on every site you want to remember you would quickly become unmanageable. It would be nice if mapping and weather websites, for example, just remembered your location. It would be nice if the blogs you follow remembered what post you last read and displayed only unread posts when you next visit their site. What was needed at this point in the Web's evolution was a way for websites to automatically store a unique identifier on the user's computer and send it back to the website automatically[3] --which is precisely what a cookie does.

To quote Wikipedia,

"HTTP cookies, or more commonly referred to as Web cookies, tracking cookies or just cookies, are parcels of text sent by a server to a Web client (usually a browser) and then sent back unchanged by the client each time it accesses that server. HTTP cookies are used for authenticating, session tracking (state maintenance), and maintaining specific information about users, such as site preferences or the contents of their electronic shopping carts."

A cookie can contain one or more pieces of data, a description and/or URL for an online description of the cookie, how long the Web browser should store the cookie, and the domain, path, and port that the cookie should be limited to. Cookies can be set to expire after a specified interval, or can be "session cookies" that will expire when the Web browser is closed. When a cookie expires, it is deleted by the Web browser. Unexpired cookies are automatically sent back to the originating Web server when the Web browser makes any subsequent requests to the same server (the same domain, path, and port).

Neither Web servers nor Web browsers are required to support cookies, but a server may refuse to work with a Web browser that does not return the cookie(s) it sends. Cookies do not contain any executable code and are extremely small in size. They only contain data sent by the website and the data is not changed by the client computer, so there generally should be no privacy concerns about sending a cookie back to the website that created it ("First-party cookies").

First-Party and Third-Party Cookies

Cookies are normally only sent to the server setting them or a server in the same domain (e.g., a cookie set by mail.google.com could be shared with calendar.google.com). These are called first-party cookies because they're set by the site displayed in the address bar of the Web browser. These cookies are typically used to tailor the website for the user. Third-party cookies, on the other hand, are typically used by advertising networks to track users across multiple Web sites where the networks have placed advertising--which allows the advertising network to target subsequent advertisements to the user's presumed interests and also to limit the number of times a user is shown a particular ad. This targeting allows the delivery of "smarter" advertising that is less annoying and more informative to the user--and therefore more valuable to the advertiser, who will be willing to pay websites more for their ad space. However, this targeting also raises privacy concerns.

It is trivial for a Web page to contain images or other components stored on servers in other domains ("third-party elements"). In fact, it is often easier to link to an image already hosted online elsewhere than it is to host an image on your own Website.

Examples:

Whenever a Web browser loads a Web page or component of a Web page, it will include in its request for that component any cookies already stored on the user's computer that are associated with the domain hosting the content. The Web server, in turn, can send a cookie or update a cookie already existing on the user's computer.

Although your Web browser will not send a third-party cookie to the first-party Web server (and it won't send a first-party cookie to the third-party Web server), the first-party Web server can send information to the third-party Web server by embedding it in the URL for the third-party content. The most common form of this communication between the sites you visit and the sites they rely on for content or ads is called a "web bug"--a small (usually 1 pixel by 1 pixel) graphic not meant to be noticed by the user. Its purpose is to cause the user's Web browser to load the third-party embedded content from the external Web server, which will allow the third party (usually an advertising network) to track the user.

While this all may seem scary and invasive,the fact that a website or ad network can uniquely identify your browser does not mean that they have any clue who you are. Even if you provide your name, email address, or other personally-identifiable information to the first-party Web site, most sites' privacy policies state that they will not share this information with their advertising partners. To use a real-world analogy, third-party advertising is equivalent to a marketer in a mall watching you come out of a music store and then offering you a flyer for a concert: The marketer may know that you're interested in music (because you were shopping at the music store), but they have no idea who you are. And as my colleagues Adam Thierer and Berin Szoka explained in their Technology Liberation Front post on Adblock Plus, websites (especially smaller independent websites) depend on advertising as a source of revenue and to cover their overhead costs.

Alternatives to Cookies

Cookies are not the only way websites can do stateful sessions. As has already been mentioned, Websites can put unique identifiers in URLs. But custom URLs don't last between sessions. Websites that need to remember users (e.g., websites that charge a fee for access) can require users to create an account and log into the site every time they use it.

But most websites do not require users to create an account and log in every time. And more and more users are configuring their Web browsers to delete all cookies when they close the browser. In response, Web site operators have found other methods to uniquely identify users by storing a unique identifier on users' computers.

The cookie alternatives listed below are not any more or less invasive of privacy than cookies if the user is aware of them and manages them the same way they manage cookies. But most Web browsers don't give users the same amount of control over cookie alternatives that they do over cookies, and few users know about these alternatives.

Per-session cookie alternatives - These cookie alternatives are not saved to disk and thus are not accessible after you close your Web browser.

Persistent cookie alternatives - These cookie alternatives are like cookies in that they are saved on your computer and can be accessed even after you've closed your Web browser.

Conclusion

This article should give you a better sense of what cookies are used for and how they work. You should now see that per-session cookies and cookie alternatives are completely harmless. Persistent cookies (and cookie alternatives) can make your Web browsing a bit easier, but deleting them will not (in most cases) cause any problems. If you are concerned about your privacy, you will need to do a bit more than just delete cookies--you also need to delete or disable the above-mentioned cookie alternatives.


* Background graphic in logo is Copyright 2006 by Joseph Robertson. Some rights reserved.

[1] Tim Berners-Lee, Weaving The Web: The Original Design and Ultimate Destiny of the World Wide Web. p. 37. Harper Business (2000).

[2] http://googlesystem.blogspot.com/2006/07/meaning-of-parameters-in-google-query.html

[3] A site could also try to uniquely identify users by the IP address of their computer, but this is unreliable as there can be many computers behind a firewall sharing a single IP address.

posted by Adam Marcus @ 3:38 PM | E-commerce , Economics , Internet , Privacy