According to Tim Berners Lee, the creator of the World Wide
Web, "[g]etting people to put data on the Web often was a question of
getting
them to change perspective, from thinking of the user's access to it
not as
interaction with, say, an online library system, but as navigation
th[r]ough a
set of virtual pages in some abstract space. In this concept, users
could bookmark any
place and return to it, and could make links into any place from
another
document. This would give a feeling of
persistence, of an ongoing existence, to each page."[1]
The Web has changed quite a bit since the
early 1990s.
Today,
websites are much more dynamic and interactive, with every page being
customized for each user. Such customization
could include automatically selecting the appropriate language for the
user
based on where they're located, displaying only content that has been
added
since the last time the user visited the site, remembering a user who wants to stay logged into a site from a particular computer, or keeping track of
items in a
virtual shopping cart. These features are
simply not possible without the ability for a website to distinguish
one user
from another and to remember a user as they navigate from one page to
another. Today, in the Web 2.0 era, instead of Web pages having
persistence (as Berners-Lee
described), we have dynamic pages and "user-persistence."
This paper describes the various methods websites can use to
enable user-persistence and how this affects user privacy. But the
first thing the reader must realize is
that the Web was not initially designed to be interactive; indeed, as
the quote
above shows, the goal was the exact opposite. Yet interactivity is
critical to many of the things we all take for
granted about web content and services today.
Stateful Sessions
On
the original World Wide Web designed by Berners-Lee (Web 1.0), Web
servers
responded to each client request without relating that request to
previous
requests. There was no need to remember
what other pages the user had requested because the requests were for
static
pages. But if you've used a Web-based
email system like Gmail, Hotmail, Yahoo! Mail, etc., you know that once
you log in, the service remembers who you
are as you click from message to message. When a website can keep track
of a user as
they move from page to page within a site it is called a "stateful
session." The website doesn't necessarily need to know
anything about the user, it just needs to be able to distinguish that
particular user from all other users. For
example, if you go to an online store and place a few items in your
virtual
shopping cart, the site still does not know your name, email address,
or
billing information. But it does know
what you've placed in your cart--or more precisely, it knows what
someone using your
browser has placed
placed in a particular cart. If you
leave the site before buying anything and then go back an hour later,
it's
possible that the site will have completely forgotten about you. In
that case, the unique identifier persists during your
"session" on the
site, but
it doesn't persist between
sessions.
URLs and HTTP Requests
Web 1.0 sites achieve Web page persistence by
having a unique address or Uniform Resource Locator (URL) for each Web
page,
which is displayed in the address bar at the top of your browser as you
browse
the web. For example, http://www.pff.org/about/
is a simple URL pointing to a specific Web page. Every user that visits
the PFF site at
www.pff.org and clicks on the "About" link will be taken to the exact
same
page.
URLs can also store information about the user. For example,
if you search for "test" on
Google, the URL of the resulting page may look like the following:
http://www.google.com/search?q=test&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a.
The URL contains a number of different
pieces of data, separated by ampersands. There is the search query
("q=test"), the
character encoding of the input ("ie=utf-8"), the character encoding of
the
output ("oe=utf-8"), the type and language of the client
("rls=org.mozilla:en-US:official"), and the Web browser used
("client=firefox-a").[2] None of this
information can be used to
uniquely identify the user, but this basic example illustrates how URLs
can be
used to specify more than simply static Web pages--and how some information can be remembered as a user navigates a website even without using cookies. Knowing how this
works, you can create your
own advanced searches or change the way the results are formatted
(e.g., changing the language).
So how did Google know I speak English and use Firefox? That
information is included in the HTTP
request
that my Web browser sends to the Google Web server when it requests a
page. HTTP requests specify (among a few other more
technical things) the desired language and a "User-Agent"
field that
includes the name
of the browser and sometimes your operating system. This information
allows websites to customize
their content for different Web browsers (e.g.,
to ensure that it displays properly). HTTP
requests also include your IP address so the Web server knows where to
send its
response, and geotagging
allows Web servers to associate an IP address with a geographic area
(though
the area is rarely more accurate than the country or state). HTTP
requests can also contain HTTP cookies.
HTTP Cookies
URLs can be used to uniquely identify individual users and
allow stateful sessions, but unless a user bookmarks the URL containing
their
unique identifier, there is no way for the site to associate the same
unique
identifier with the same user on subsequent visits. Another option is
to have users create an account
and then log in each time they access the site. The website could then
include the user's
unique ID in the URL on subsequent pages, so that the user only needs
to log in
once per session. Having to bookmark or
create an account on every site you want to remember you would quickly
become
unmanageable. It would be nice if
mapping and weather websites, for example, just remembered your
location. It would be nice if the blogs you follow
remembered what post you last read and displayed only unread posts when
you
next visit their site. What was needed at
this point in the Web's evolution was a way for websites to
automatically store
a unique identifier on the user's computer and send it back to the
website
automatically[3] --which is
precisely what a cookie does.
To quote Wikipedia,
- "HTTP
cookies, or more commonly referred to as Web cookies, tracking cookies
or just
cookies, are parcels of text sent by a server to a Web client (usually
a
browser) and then sent back unchanged by the client each time it
accesses that
server. HTTP cookies are used for authenticating, session tracking
(state
maintenance), and maintaining specific information about users, such as
site
preferences or the contents of their electronic shopping carts."
A cookie can contain one or more pieces of data, a
description and/or URL for an online description of the cookie, how
long the
Web browser should store the cookie, and the domain, path, and port
that the
cookie should be limited to. Cookies can
be set to expire after a specified interval, or can be "session
cookies" that
will expire when the Web browser is closed. When a cookie expires, it
is deleted by the
Web browser. Unexpired cookies are
automatically sent back to the originating Web server when the Web
browser
makes any subsequent requests to the same server (the same domain,
path, and
port).
Neither Web servers nor Web browsers are required to support
cookies, but a server may refuse to work with a Web browser that does
not
return the cookie(s) it sends. Cookies
do not contain any executable code and are extremely small in size.
They only contain data sent by the website and
the data is not changed by the client computer, so there generally
should be no
privacy concerns about sending a cookie back to the website that
created it
("First-party cookies").
First-Party and Third-Party Cookies
Cookies
are normally only sent to the server setting them or a server in the
same
domain (e.g., a cookie set by
mail.google.com could be shared with calendar.google.com). These are
called first-party cookies because
they're set by the site displayed in the address bar of the Web
browser. These cookies are typically used to tailor the
website for the user. Third-party
cookies, on the other hand, are typically used by advertising networks
to track
users across multiple Web sites where the networks have placed advertising--which allows the advertising network to target subsequent
advertisements to the user's presumed interests and also
to limit
the number of times a user is shown a particular ad. This targeting
allows the delivery of
"smarter" advertising that is less annoying and more informative to the
user--and therefore more valuable to the advertiser, who will be willing
to pay
websites more for their ad space. However, this targeting also raises
privacy concerns.
It is trivial for a Web page to contain images or other
components stored on servers in other domains ("third-party elements").
In fact, it is often easier to link to an
image already hosted online elsewhere than it is to host an image on
your own
Website.
Examples:
- Typical first-party embedded image: <img
src="graphic.jpg">
- Typical third-party embedded image: <img src="http://images.icanhascheezburger.com/completestore/2008/5/8/iizabowttod128547543706260000.jpg">
Whenever a Web browser loads a Web page or component of a
Web page, it will include in its request for that component any cookies
already
stored on the user's computer that are associated with the domain
hosting the
content. The Web server, in turn, can
send a cookie or update a cookie already existing on the user's
computer.
Although
your Web browser will not send a third-party cookie to the first-party
Web
server (and it won't send a first-party cookie to the third-party Web
server),
the first-party Web server can send information to the third-party Web
server
by embedding it in the URL for the third-party content. The most common
form of this communication
between the sites you visit and the sites they rely on for content or
ads is
called a "web bug"--a small (usually 1 pixel by 1 pixel) graphic not
meant to be
noticed by the user. Its purpose is to
cause the user's Web browser to load the third-party embedded content
from the
external Web server, which will allow the third party (usually an
advertising
network) to track the user.
-
Example third-party embedded web bug: <img src="http://pr.atwola.com/promoimp/237375632bb2334784833/aol">
While this all may seem scary and invasive,the fact that a website or ad network can uniquely
identify your browser
does not mean that they have any clue who
you are. Even if you provide your name,
email address, or other personally-identifiable information to the
first-party
Web site, most sites' privacy policies state that they will not share
this
information with their advertising partners. To use a real-world
analogy, third-party
advertising is equivalent to a marketer in a mall watching you come out
of a
music store and then offering you a flyer for a concert: The marketer
may know that you're interested
in music (because you were shopping at the music store), but they have
no idea who you are. And as my colleagues Adam Thierer and Berin Szoka
explained in their Technology
Liberation Front post on Adblock Plus, websites (especially smaller
independent
websites) depend on advertising as a source of revenue and to cover
their
overhead costs.
Alternatives to Cookies
Cookies are not the only way websites can do stateful
sessions. As has already been mentioned, Websites can put unique
identifiers in
URLs. But custom URLs don't last between sessions. Websites that need
to remember
users (e.g., websites that charge a fee for access) can require users to
create
an account and log into the site every time they use it.
But most websites do not require users to create an account
and log in every time. And more and more users are configuring their
Web
browsers to delete all cookies when they close the browser. In
response, Web
site operators have found other methods to uniquely identify users by
storing a
unique identifier on users' computers.
The cookie alternatives listed below are not any more or
less invasive of privacy than cookies if the user is aware of them and
manages
them the same way they manage cookies. But most Web browsers don't give
users
the same amount of control over cookie alternatives that they do over
cookies,
and few users know about these alternatives.
Per-session
cookie alternatives - These cookie alternatives are not saved to disk
and thus are not accessible after you close your Web browser.
- Hidden form fields - Web pages can contain hidden Web forms
that submit data back
to
the Web server when an on-screen button is pressed. This method is
quite
limited because it requires the user to click a specific button, and
there is
no method for saving data after you've navigated away from the site.
Beyond
these limitations, the only way to detect hidden form fields is to
inspect the
HTML code for a page. There is also no easy way to block hidden form
fields.
- window.name -
JavaScript embedded in a Web page can set or read the this internal
value
that's not really used for anything else. The value can be up to 32
megabytes
in size and once set a value can be accessed by any Web site. Although
the only
way to detect this is to inspect the HTML code for a page, you can
disable
JavaScript.
Persistent cookie
alternatives - These cookie alternatives are like cookies in that they
are saved on your computer and can be accessed even after you've closed
your
Web browser.
- Flash Cookies - Also
known as Local Shared Objects, Flash cookies require Adobe Flash to be
installed on your computer. Whereas HTTP cookies are limited to 4
kilobytes, Flash
cookies can contain up to 100 kilobytes by default and can contain an
unlimited
amount of data if the user desires. To view and delete the Flash
cookies stored
on your computer, go to this
page (although accessed via a Web page, the Flash cookies shown are
stored
on your computer). You can also permanently disable Flash cookies on
that page.
- DOM Storage - DOM
storage was designed specifically to allow Web 2.0 applications to work
offline, saving data locally when they are unable to access the host
website
and to save data that would otherwise be lost if a page is accidentally
reloaded. DOM storage is currently only implemented in Firefox (and Internet Explorer 8 Beta). If
cookies are disabled, DOM storage is also disabled. Users can also
manually
disable DOM storage even when cookies are enabled.
- userData behavior
- The userData behavior does for Internet Explorer what DOM storage
does for
Firefox. Each "document" is limited to 128 kilobytes of storage, with a
per-domain limit of 1024 kilobytes. The data is stored in Internet
Explorer's
cache and are deleted when you delete cookies using the Delete Browsing
History
dialog box.
Conclusion
This article should give you a better sense of what cookies are used for and how they work. You should now see that per-session cookies and cookie alternatives are completely harmless. Persistent cookies (and cookie alternatives) can make your Web browsing a bit easier, but deleting them will not (in most cases) cause any problems. If you are concerned about your privacy, you will need to do a bit more than just delete cookies--you also need to delete or disable the above-mentioned cookie alternatives.
* Background graphic in logo is Copyright 2006 by Joseph Robertson. Some rights reserved.
[1] Tim
Berners-Lee, Weaving The Web: The Original Design and Ultimate Destiny
of
the World Wide Web. p. 37. Harper Business (2000).
[2]
http://googlesystem.blogspot.com/2006/07/meaning-of-parameters-in-google-query.html
[3] A site
could also try to uniquely identify users by the IP address of their
computer,
but this is unreliable as there can be many computers behind a firewall
sharing
a single IP address.