|  | Mandatory File Locking For The Linux Operating System | 
|  |  | 
|  | Andy Walker <andy@lysaker.kvaerner.no> | 
|  |  | 
|  | 15 April 1996 | 
|  | (Updated September 2007) | 
|  |  | 
|  | 0. Why you should avoid mandatory locking | 
|  | ----------------------------------------- | 
|  |  | 
|  | The Linux implementation is prey to a number of difficult-to-fix race | 
|  | conditions which in practice make it not dependable: | 
|  |  | 
|  | - The write system call checks for a mandatory lock only once | 
|  | at its start.  It is therefore possible for a lock request to | 
|  | be granted after this check but before the data is modified. | 
|  | A process may then see file data change even while a mandatory | 
|  | lock was held. | 
|  | - Similarly, an exclusive lock may be granted on a file after | 
|  | the kernel has decided to proceed with a read, but before the | 
|  | read has actually completed, and the reading process may see | 
|  | the file data in a state which should not have been visible | 
|  | to it. | 
|  | - Similar races make the claimed mutual exclusion between lock | 
|  | and mmap similarly unreliable. | 
|  |  | 
|  | 1. What is  mandatory locking? | 
|  | ------------------------------ | 
|  |  | 
|  | Mandatory locking is kernel enforced file locking, as opposed to the more usual | 
|  | cooperative file locking used to guarantee sequential access to files among | 
|  | processes. File locks are applied using the flock() and fcntl() system calls | 
|  | (and the lockf() library routine which is a wrapper around fcntl().) It is | 
|  | normally a process' responsibility to check for locks on a file it wishes to | 
|  | update, before applying its own lock, updating the file and unlocking it again. | 
|  | The most commonly used example of this (and in the case of sendmail, the most | 
|  | troublesome) is access to a user's mailbox. The mail user agent and the mail | 
|  | transfer agent must guard against updating the mailbox at the same time, and | 
|  | prevent reading the mailbox while it is being updated. | 
|  |  | 
|  | In a perfect world all processes would use and honour a cooperative, or | 
|  | "advisory" locking scheme. However, the world isn't perfect, and there's | 
|  | a lot of poorly written code out there. | 
|  |  | 
|  | In trying to address this problem, the designers of System V UNIX came up | 
|  | with a "mandatory" locking scheme, whereby the operating system kernel would | 
|  | block attempts by a process to write to a file that another process holds a | 
|  | "read" -or- "shared" lock on, and block attempts to both read and write to a | 
|  | file that a process holds a "write " -or- "exclusive" lock on. | 
|  |  | 
|  | The System V mandatory locking scheme was intended to have as little impact as | 
|  | possible on existing user code. The scheme is based on marking individual files | 
|  | as candidates for mandatory locking, and using the existing fcntl()/lockf() | 
|  | interface for applying locks just as if they were normal, advisory locks. | 
|  |  | 
|  | Note 1: In saying "file" in the paragraphs above I am actually not telling | 
|  | the whole truth. System V locking is based on fcntl(). The granularity of | 
|  | fcntl() is such that it allows the locking of byte ranges in files, in addition | 
|  | to entire files, so the mandatory locking rules also have byte level | 
|  | granularity. | 
|  |  | 
|  | Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite | 
|  | borrowing the fcntl() locking scheme from System V. The mandatory locking | 
|  | scheme is defined by the System V Interface Definition (SVID) Version 3. | 
|  |  | 
|  | 2. Marking a file for mandatory locking | 
|  | --------------------------------------- | 
|  |  | 
|  | A file is marked as a candidate for mandatory locking by setting the group-id | 
|  | bit in its file mode but removing the group-execute bit. This is an otherwise | 
|  | meaningless combination, and was chosen by the System V implementors so as not | 
|  | to break existing user programs. | 
|  |  | 
|  | Note that the group-id bit is usually automatically cleared by the kernel when | 
|  | a setgid file is written to. This is a security measure. The kernel has been | 
|  | modified to recognize the special case of a mandatory lock candidate and to | 
|  | refrain from clearing this bit. Similarly the kernel has been modified not | 
|  | to run mandatory lock candidates with setgid privileges. | 
|  |  | 
|  | 3. Available implementations | 
|  | ---------------------------- | 
|  |  | 
|  | I have considered the implementations of mandatory locking available with | 
|  | SunOS 4.1.x, Solaris 2.x and HP-UX 9.x. | 
|  |  | 
|  | Generally I have tried to make the most sense out of the behaviour exhibited | 
|  | by these three reference systems. There are many anomalies. | 
|  |  | 
|  | All the reference systems reject all calls to open() for a file on which | 
|  | another process has outstanding mandatory locks. This is in direct | 
|  | contravention of SVID 3, which states that only calls to open() with the | 
|  | O_TRUNC flag set should be rejected. The Linux implementation follows the SVID | 
|  | definition, which is the "Right Thing", since only calls with O_TRUNC can | 
|  | modify the contents of the file. | 
|  |  | 
|  | HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not | 
|  | just mandatory locks. That would appear to contravene POSIX.1. | 
|  |  | 
|  | mmap() is another interesting case. All the operating systems mentioned | 
|  | prevent mandatory locks from being applied to an mmap()'ed file, but  HP-UX | 
|  | also disallows advisory locks for such a file. SVID actually specifies the | 
|  | paranoid HP-UX behaviour. | 
|  |  | 
|  | In my opinion only MAP_SHARED mappings should be immune from locking, and then | 
|  | only from mandatory locks - that is what is currently implemented. | 
|  |  | 
|  | SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for | 
|  | mandatory locks, so reads and writes to locked files always block when they | 
|  | should return EAGAIN. | 
|  |  | 
|  | I'm afraid that this is such an esoteric area that the semantics described | 
|  | below are just as valid as any others, so long as the main points seem to | 
|  | agree. | 
|  |  | 
|  | 4. Semantics | 
|  | ------------ | 
|  |  | 
|  | 1. Mandatory locks can only be applied via the fcntl()/lockf() locking | 
|  | interface - in other words the System V/POSIX interface. BSD style | 
|  | locks using flock() never result in a mandatory lock. | 
|  |  | 
|  | 2. If a process has locked a region of a file with a mandatory read lock, then | 
|  | other processes are permitted to read from that region. If any of these | 
|  | processes attempts to write to the region it will block until the lock is | 
|  | released, unless the process has opened the file with the O_NONBLOCK | 
|  | flag in which case the system call will return immediately with the error | 
|  | status EAGAIN. | 
|  |  | 
|  | 3. If a process has locked a region of a file with a mandatory write lock, all | 
|  | attempts to read or write to that region block until the lock is released, | 
|  | unless a process has opened the file with the O_NONBLOCK flag in which case | 
|  | the system call will return immediately with the error status EAGAIN. | 
|  |  | 
|  | 4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has | 
|  | any mandatory locks owned by other processes will be rejected with the | 
|  | error status EAGAIN. | 
|  |  | 
|  | 5. Attempts to apply a mandatory lock to a file that is memory mapped and | 
|  | shared (via mmap() with MAP_SHARED) will be rejected with the error status | 
|  | EAGAIN. | 
|  |  | 
|  | 6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED) | 
|  | that has any mandatory locks in effect will be rejected with the error status | 
|  | EAGAIN. | 
|  |  | 
|  | 5. Which system calls are affected? | 
|  | ----------------------------------- | 
|  |  | 
|  | Those which modify a file's contents, not just the inode. That gives read(), | 
|  | write(), readv(), writev(), open(), creat(), mmap(), truncate() and | 
|  | ftruncate(). truncate() and ftruncate() are considered to be "write" actions | 
|  | for the purposes of mandatory locking. | 
|  |  | 
|  | The affected region is usually defined as stretching from the current position | 
|  | for the total number of bytes read or written. For the truncate calls it is | 
|  | defined as the bytes of a file removed or added (we must also consider bytes | 
|  | added, as a lock can specify just "the whole file", rather than a specific | 
|  | range of bytes.) | 
|  |  | 
|  | Note 3: I may have overlooked some system calls that need mandatory lock | 
|  | checking in my eagerness to get this code out the door. Please let me know, or | 
|  | better still fix the system calls yourself and submit a patch to me or Linus. | 
|  |  | 
|  | 6. Warning! | 
|  | ----------- | 
|  |  | 
|  | Not even root can override a mandatory lock, so runaway processes can wreak | 
|  | havoc if they lock crucial files. The way around it is to change the file | 
|  | permissions (remove the setgid bit) before trying to read or write to it. | 
|  | Of course, that might be a bit tricky if the system is hung :-( | 
|  |  |