Last time around, I described the architecture of a very simple wiki system that stored pages, along with their histories and meta-data, in a database, and let users view and edit those pages over the web. In an ideal world, the next step would be to add either a work ticketing system, or an interface to version control.
But our world is far from ideal: there are pranksters out there, and spammers, and outright villains, too. The next step in our rational reconstruction of DrProject is therefore to worry about security. Bitter experience shows that it's hard---some would say impossible---to make systems secure after the fact. Security must be designed in, and tested, right from day one.
The simplest useful model of security breaks the problem down into authentication, authorization, and access control. Authentication is the process of binding a session to a stored identity. This is not the same as establishing who the user is: people can easily share user IDs and passwords, and even biometric systems can be spoofed. Instead, authentication takes something the user is, has, or knows, like a fingerprint, password, or smart card, and figures out which of the user profiles stored in the system it corresponds to.
Authorization is the determination of who can do what. Can user X read file Y? Can she append data to it? Delete it? Give someone else permission to do any of these operations? Lastly, access control is the enforcement of these rules; it's what prevents X from reading things she's not allowed to, no matter how cleverly she asks.
The first thing this simple security model does is help us think about possible attacks. To break in, an attacker must:
convince the system she's someone she's not;
get permissions she isn't supposed to have; or
bypass the controls that are supposed to prevent her from doing something.
Secondly, this model can help us build a domain model---an abstract picture of what a security system needs to contain. Here are some of the concepts we have so far:
User profile: a unique electronic identity, such as a login account. A single human being may have many profiles in the system; many human beings may have access to a single profile.
Credentials: what an actual user is, has, or knows that binds her to a user profile.
Authentication mechanism: something that finds the user profile corresponding to a set of credentials. Every authentication mechanism needs a way to say, "I don't recognize these credentials." This is usually done either by signalling an error, or by returning a specially-marked user profile.
Capability: something that can be done to something in the system, such as reading the contents of a particular file, changing the "last modified" time of a particular directory, deleting a particular user profile, and so on. The word "particular" is important here, because the system needs to distinguish the capability of deleting file X from the capability of deleting file Y (or all files).
Permission: a pairing of a user profile and a capability, i.e., a representation of user X's right to perform operation Z.
Let's start with user profiles. Most web-based systems start off by managing these themselves: users create uniquely-named accounts and choose passwords, which are then stored in the system's database. These schemes run into all kinds of trouble as the system grows:
People have a hard time remembering dozens of different account names and passwords, so they either forget them (which adds to the user support burden), or re-use them (which means that when one system is compromised, others can be broken as well).
The more places user information is stored, the harder it is to keep everything up to date. In our department, for example, we have to keep track of over a thousand students as they add and drop courses, change degree programs, and so on. Keeping track of all that is hard; keeping track of it twice would be a nightmare.
Managing passwords and other credentials requires a lot of tricky code. On one of the systems I used to administer, user passwords had to be at least eight characters long, with at least two non-alphabetic characters. They couldn't contain dictionary words or use simple spe11ing trix, had to be changed every three months, and couldn't be recycled within a year. This is not code you want to have to write twice...
For all these reasons, DrProject doesn't manage accounts itself. Instead, it passes the credentials users give it (such as IDs and passwords) to an external program called validate. Here in Toronto, that program checks those credentials against the host Linux system's password file (Figure 1). At Queen's, on the other hand, those credentials are checked against the university-wide Kerberos system.
Time for a quick FAQ:
Why use an external program? Why not just have the CGI program check the credentials?
The DrProject CGI program runs under the same user ID as the web server. Typically, this is a dummy account called www-data or apache which has very few privileges (to limit the damage an attacker can do by compromising the web server). We didn't want to give the web server account access to the password file, so we created a separate program that uses Unix's setuid mechanism to run under a different identity.
DrProject writes the user ID and password to validate's standard input, and validate then returns either 0 or 1. It would have been simpler to pass the user ID and password as command-line arguments, i.e., to run validate as a sub-process with validate myName myPassword. However, this would create a security hole: if an attacker with an account on the host ran the ps command at the right moment, with the right flags, she could see validate's arguments, and harvest the user ID and password. Also, more complex credentials such as digital certificates can't be represented as short strings.
Does everybody with a Unix account automatically have access to DrProject?
No. The administrator still has to tell DrProject which of the underlying Unix accounts to recognize. However, that's all the administrator has to do: when the user changes her Unix password, DrProject automatically "sees" the change.
Now that our user has authenticated, we can move on to---oops, wait a second. HTTP is a stateless protocol: each request is completely separate from each other. We don't want users to have to re-send their credentials every time they click on a link, so we have to find some way to keep track of them after they have logged in. (In fact, having the system keep track of them after they provide their credentials is what we mean by "logging in".)
The standard way to do this is with cookies. A cookie is a short piece of text that can be passed back and forth in the headers of HTTP requests and responses. If a CGI program puts a cookie in the header of an HTTP response, then the client can send it back with the next request. The technical term for this is a nonce; like half of a torn playing card, someone can use it to re-establish their identity at some future point.
Back in the bad old days, programmers sometimes used real data as cookies: for example, I remember a system (I actually wrote it *cough*) that used the user ID as the cookie value. Of course, this mean that anyone who knew how to create an HTTP request could impersonate anyone else. Using a sequence of numbers, such as 1000, 1001, 1002, ... doesn't help: an attacker can still:
look at her cookie value;
add or subtract one; and
have good odds of hijacking someone else's session.
Most modern systems therefore generate a random number (or random string), and use that as a cookie. Internally, such systems store a dictionary that maps cookie values to sessions; each session has a reference to a user profile, and any other data the system needs to remember. One piece of data is when the cookie was generated: if the system is presented with a cookie that's a week old, it may decide that someone is trying a replay attack, and refuse to accept it. (This is why so many web systems throw you out if you're idle for too long.)
By themselves, randomly-generated session IDs aren't enough to make the system secure, because attackers can use packet sniffers and other tools to eavesdrop on network traffic. We actually offer a course (CSC309: Web Programming) that teaches students the principles behind this, so we had to ensure that DrProject wasn't vulnerable to such attacks. The solution we adopted was simple: every connection to DrProject uses HTTPS, the encrypted version of HTTP. Encrypting and decrypting messages does slow the system down, but (a) the slowdown isn't very large (a few percent), and (b) slowing down is better than plowing into a lamppost.
One last note on sessions: most people never bother to log out of systems that are less important to them than their bank or dating service. "Stale" sessions can therefore accumulate in the database over time. The space these take up isn't much of a concern (unless you're Hotmail, Yahoo, or one of the other giants), but each stale session is another point of attack. DrProject and systems like it therefore have to implement some garbage collection mechanism that sweeps through the sessions periodically, closing and deleting any that have expired.
Implementing this would be easy if DrProject was a long-running process: we'd just set a timer, and take a couple of seconds every ten minutes or so to check the timestamp on every open session. But DrProject is a CGI: it only runs when a request comes in. Our choices were therefore:
have it do a little garbage collection each time it ran; or
use cron to run a separate program every ten minutes or so.
We chose the first option, since it was one less thing for the sys admin would have to install, configure, and remember to restart. The downside is that every once in a while, a user's attempt to view a page is delayed by a second or two. However, this happens anyway because of network traffic, so people don't notice.
Next time: how we implemented authorization and access control.