Differences between revisions 41 and 43 (spanning 2 versions)
Revision 41 as of 2007-04-06 17:08:42
Size: 16134
Comment:
Revision 43 as of 2007-04-07 13:46:00
Size: 10305
Comment:
Deletions are marked like this. Additions are marked like this.
Line 6: Line 6:
problems with reliable and safe device locking. This paper enlightens the issues behind the scenes and presents possible future solutions.
It collects our ponderings after having received this advise from Alan Cox
problems with reliable and safe device locking. This paper
collects our ponderings after having received this advise from Alan Cox
Line 10: Line 10:
'''NOTE:''' Don't think this is a simple problem and can be easily solved with solution ''FOO'' by ''SOMEONEELSE'' with ''FOO'' and ''SOMEONEELSE'' immediately coming to your mind. This doesn't work. The problem has many layers and many stakeholders. It is a mess since years.
Line 14: Line 12:
Our original concern is
the influence of even read-only operations on optical media drives
(recorders) during their duty as recorders -- depending on the device
model such read-only work may interrupt the process badly practically
destroying the medium.

Since many programs already do act on such devices in an unsafe manner,
either willingly (e.g. liblkid) or accidentally (e.g. hald, opening with
O_EXCL but still clashing with cdr applications working on the competing
sg driver), we see the need for reliable communication in order to
ensure proper device locking where appropriate, in a way which is
appropriate for the particular application.

In the following document, first the currently possible mechanisms are itemized
with their advantages and their problems, followed by a draft of an incomplete
locking scheme of limited system impact. Finally a complete locking mechanism
is proposed which needs coordination with the Linux community, nevertheless.

Assuming that we use file-based locking, the remaining open questions are:
Our concern is the influence of even read-only operations on
optical media drives (recorders) during their duty as recorders --
depending on the device model such read-only work may interrupt the
process badly, spoiling the result, eventually wasting the medium.

Since many programs already act on such devices we see the need for
reliable communication in order to allow proper device locking if
good will for cooperation is present. After such a locking mechanism is
implemented, we will invite any project to join it.

In the following document, at first a few possible mechanisms are
evaluated. Then a suitable locking algorithm is proposed which needs
coordination with the Linux community, nevertheless.

The remaining open questions is:
Line 35: Line 29:
 ''How to create lock files covering devices across drivers and multiple nodes?'' We do not want to make this choice a reason for the rejection of our proposal.
Line 41: Line 35:
=== General inter-process locking mechanisms ===

In general, all the mechanisms listed below are not optimally appropriate for our purpose. Why? They use the '''filename as identity'''. Which is okay for normal files, but they lack on two places which make then not reliable when used alone:

 * they do not cope with multiple device file which imply the access to the same driver through different files
 * they do not automatically cope with multiple device '''drivers''' accessible through '''different''' user space interfaces, like with sg vs. sr drivers on Linux. No matter how many excuses some kernel developers do present to paper over this obvious shortcomings. Automatic use of /dev/sr instead of /dev/sg is not always possible or may not be wanted by the user.

Finally, they may be sufficient to lower the risk on inappropriate operation. Which exactly are available in the wild?
=== Path/Inode based locking mechanisms ===

In general, these mechanisms are not optimally appropriate for our purpose.
They use the filename or inode as identity. In our case this imposes problems:
but they lack on two places which make then not reliable when used alone:
 * they do not cope with multiple device files which imply the access to the same driver through different files
 * they do not automatically cope with multiple device '''drivers''' accessible through different co-existing user space interfaces, like with sg vs. sr drivers.

We evaluated:
Line 63: Line 59:
 Principle: lock applied on open file handles. Internally associated with a path, see fcntl(2) for details.  Principle: lock applied on open file handles. Thus probably refering to an inode. See fcntl(2) for details.
Line 71: Line 67:
  * diverges from flock() implementation on Linux, see below. Results in independent locking.
  * possible problems on network file systems

 * flock(2) exclusive file locking

 Principle: similar to fcntl locks, applied with a different system function.

 Pros: see fcntl(2) locking above

 Cons: like fcntl(2), but less portable, not working over network file systems
  * needs open(2) as precondition which has to be avoided on unlocked device files

==== Other locking mechanisms ====
Line 82: Line 71:
==== Advanced locking mechanisms ====

 * System V Semaphores

 Principle: a magic integer, "key" or "semid", identifies a set of state objects on which the atomic operations can be performed which are necessary for implementing a proper locking algorithm. See man semget(2), semop(2) SEM_UNDO.

 Pros: semaphores are originally designed for our purpose and they are very traditional Unix requisites.

 Cons:
  * the semaphore key must be systemwide unique for the set of lockable drives and all participating programs have to use the same key. This situation is prone to collisions with locking mechanisms for other system resources. Function ftok(3) is not a secure solution.
  * each device needs a fixely defined index number in the set of semaphores which are allocated system resources. So we can hardly span up a giant index space where we can map different device file classes to disjoint index intervals.
Line 96: Line 73:
 Principle: passing of the O_EXCL flag to the open call. The device is
locked exclusively for the calling PID, the lock is maintained in the
device driver to the particular major/minor combination.
 Principle: passing of the O_EXCL flag to the open call of a device file. The device is locked exclusively for the calling PID, the lock is maintained in the device driver to the particular major/minor combination.
Line 105: Line 80:
  * does not automagicaly make the device inaccessible, only applications using O_EXCL will know about the locked state when getting negative result with EBUSY errno value.   * it does not solve the problem with the co-existing drivers for sr and sg

 * System V Semaphores

 See man semget(2), semop(2) SEM_UNDO. They have been considered and rejected mainly because of too many potential device names which would need pre-allocated semaphore objects.
Line 109: Line 88:
As explained in the introduction, the locking is important on optical media recording due to the delicate operation mode during the recording. Ideally, no application should touch them, even reading from the media is an evil task. But how does the state of the practice look like?

 * mount: the block device is mounted with the O_EXCL flag '''BUT''' the mount executable also uses '''libblkid''' which opens the devices without locking and    read magic data from it. This also provides no solution for operation through the sg driver.
As explained in the introduction, the locking is important on optical media
recording due to the delicate operation mode during the recording.
Ideally, no other application should touch them. Even reading info from the
drive can spoil the recording run.
Currently we are aware of at least the following participants in drive
collisions. They take differing precautions for this case
, of which none
is really able to prevent inadverted open(2) of a busy drive under all
circumstances
.

 * mount: the block device is mounted with the O_EXCL flag but the mount executable also uses libblkid which opens the devices without locking and reads magic data from it. (We understand that for the duration of a recording run, mounting should best be prevented in total.)
Line 113: Line 99:
 * hald (HAL daemon): periodically opens the cdrom block devices with O_EXCL flag. Clashes with operation on sg is possible.  * hald (HAL daemon): frequently opens the block devices with O_EXCL flag.
Line 119: Line 105:
 * cdrskin (via libburn): opens the devices with O_EXCL flag. It uses a unique device file path for serious operations on the drive. This is /dev/sg* on kernel 2.4, and recently has become /dev/sr* on kernel 2.6. Operations on other path representations of the same device are restricted to open(2) O_RDONLY and to obtaining SCSI parameters host,channel,id,lun. Mapping to the recommended device names is strictly enforced.

 * cdrecord: no locking. Author recommends to just get rid of hald (http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=361643) or do it like Solaris does which seems to do explicite locking, maintained internally on device driver or on major/minor pairs. However, we need to implement safe locking in our sand box.

'''NOTE:''' Some of the things above may be changed but even then it won't happen overnight. We are not in the Linux kernel development or some other big project where one developer can go through the whole code end refactor everything in one run and change all dependencies in the next version. In the big real world, there are ''n'' users with ''m'' software versions of ''n'' involved applications on ''o'' different kernels, creating ''p'' permutations. Do the math!

In the practice, the burn process can be interrupted by mount(libblkid), or even by hald on undetected sg-vs.-sr conflict. This things do already happen!

=== Proposed general locking algorithms ===

=== What can be done with device files alone ? ===

The following method is proposed to create a midway between the limitations of
kernel and the requirements of others, also unifying the way of dealing with the
device locks.

The /var/lock locking method with proxy lock files was identified as obstacle
because of restrictive security settings on popular Linux distributions (see
above). Instead, this proposal relies on a two-step locking method directly on
the device file and solves the ambiguity problem of sr-scd-sg paths at the risk
that open(O_RDONLY); ioctl(SCSI_IOCTL_GET_IDLUN) could be harmful to running
burner activities.

 1. Open the device. For applications that operate in a delicate way (burning tools), O_EXCL shall be set. For others, it may be omited.
 2. Set or check the additional fcntl lock on the device file. It must be exclusive! Sample code (by Thomas Schmitt)
{{{
        struct flock lockthing;
        ...
        f = open(device, mode|O_EXCL);
        if (f != -1) {
             memset(&lockthing, 0, sizeof(lockthing));
             lockthing.l_type = F_WRLCK;
             lockthing.l_whence = SEEK_SET;
             lockthing.l_start = 0;
             lockthing.l_len = 0;
             if (fcntl(f, F_SETLK, &lockthing)) {
                close(f);
                /* user feedback, report error, etc... */
                f = -1;
             }
        }
}}}

Normally, fcntl(2) imposes advisory locking. But the sysadmin can make this
mandatory locking by a mount option. (Is /dev mount(8)ed at all ?)

Unique path resolution is necessary so all interested processes come together at
the same inode where this locking is guaranteed to collide. (Does F_SETLK work
with all the implementations of directory /dev/ ?)

Links or additional device paths created by mknod(1) can be translated into
/dev/sg* resp. /dev/sr* by help of call stat(2) and its result element .st_rdev.
But the jump from /dev/sg* to /dev/sr* is not possible via stat(2). For that we
need open(2) O_RDONLY, ioctl(SCSI_IOCTL_GET_IDLUN), close(2).

The translation is done by obtaining info from the given path, by iterating over
the desired device paths /dev/hd%c , /dev/sg%d , /dev/sr%d, and by comparing
their info with the one we look for. Kernel 2.6: If the result is a /dev/sg%d
then it has to be translated into a /dev/sr%d in another step.

NOTE: there are sysfs symlinks that set up a path usable to map exactly.
However, this depends on a mounted sysfs and the required symlinks have also
been declared deprecated in the recent Linux kernel versions.

Kernel 2.4 imposes the problem that ioctl(SG_IO) is not possible with sr, so
most of the burn programs have to use sg. But growisofs uses sr via
ioctl(CDROM_SEND_PACKET) which does not work with sg. We will possibly not come
to a completely sufficient agreement under these circumstances. Well, we 2.4ers
are used to suffer neglect. (sob, hehe, see below)
 * cdrskin (via libburn): opens the devices with O_EXCL flag. It uses a unique device file path for serious operations on the drive. This is /dev/sg* on kernel 2.4, and recently has become /dev/sr* on kernel 2.6. The only other permissible paths are /dev/hd* . Operations on other path representations of the same device are restricted to open(2) O_RDONLY and to obtaining SCSI parameters host,channel,id,lun. Mapping to the unique device names is strictly enforced.

 * cdrecord: no locking. Author recommends to do it like Solaris does (which seems to do explicite locking, maintained internally on device driver or on major/minor pairs).

Any of the listed programs is currently able to spoil a recording run
just by its proper operation if only the circumstances are unfortunate enough.
This compilation is mostly heuristic and may be erroneous in details.
Whatever, the problems and the users' disappointment are real.
Line 191: Line 116:
First: races First: race conditions
Line 205: Line 130:
This system may work with the plain old UUCP program and few others programs
with
low device opening activity AND administrated by root but is a real PITA
nowadays.

=== And the better solution would be ...
===

* to adopt the FHS idea of locking a proxy before any open(2)
is performed, but to avoid the known drawbacks of FHS /var/lock/

===
Proposed locking algorithm ===

It adopts the FHS idea of locking a proxy before any open(2)
is performed, but avoids the known drawbacks of the FHS /var/lock/
Line 215: Line 137:
* to allow the use of any of the sg, sr, scd device drivers at
the discretion of the programs.
It is designed to allow the use of any of the sg, sr, scd device drivers at
the discretion of the programs. It is also designed to include the less
ambiguous situation of drive access via /dev/hd*.
Line 220: Line 143:
Proposal for unambigous advisory device locking on Linux kernel 2.4 and 2.6

It is inspired in part by traditional UUCP locking in /var/lock/ as described by
http://www.pathname.com/fhs/pub/fhs-2.3.html#VARLOCKLOCKFILES

----------------------------------------------------------------------------

Compliant processes apply open(2) to suspected CD/DVD burner device files only if they are able to do this via one of the following paths:
Compliant processes apply open(2) to suspected CD/DVD burner device files
only if they are able to do this via one of the following paths:
Line 229: Line 146:
and only after they have obtained a lock on them. (N= 31 or 255 ?) Further
precautions like open(O_EXCL) or fcntl(F_SETLK) on the device file are allowed.
Programs should offer expert options to disable them, though.
and only after they have obtained a lock on them. (N= 31 or 255 ?)
Line 234: Line 149:
conditions or potential stale locks: Other than with FHS /var/lock, not the mere
existence of the lock file establishes the lock state. It is instead implemented
by open(2) with O_RDWR and then fcntl(2) with F_SETLK. The lock file descriptor
is held open until the lock is obsolete.
conditions or potential stale locks: Other than with FHS /var/lock, not the
mere existence of the lock file establishes the lock state. It is instead
implemented by open(2) with O_RDWR and then fcntl(2) with F_SETLK. The lock
file descriptor is held open until the lock is obsolete.
Line 241: Line 156:
the permissible ones. So /dev/nec_burner can be translated to one of /dev/sr1 ,
/dev/sg2, /dev/hdd. (If not, then it is hardly a burner device.)
the permissible ones. So /dev/nec_burner can be translated to exactly one of
/dev/sr0, /dev/sg2, /dev/hdd. (If not, then it is hardly a burner device.)
Line 245: Line 160:
their possible path instances. their three permissible path instances.
Line 247: Line 162:
allowed, but also /dev/sg1 and /dev/scd0. allowed, but also /dev/sg2 and /dev/scd0.
Line 261: Line 176:
Further precautions like open(O_EXCL) or fcntl(F_SETLK) on the device file
itself are allowed. Programs are asked politely to offer expert options to
disable them. In general a program is free to use a device in any way after
a lock has been obtained successfully.
Line 267: Line 187:
As an application programmer i would propose /tmp/ and some file name prefix. It
would work, after all. Possibly one would have to remove the lock file after
releasing the lock. That would play nice with the t-permission.
As an application programmer i would propose /tmp/ and some file name prefix.
It would work, after all. It would be covered by FHS specs except the fact that
/var/lock is the paragraph which matches our problem more specifically
- and fails to solve it.
Line 281: Line 202:

E.g. shall we include st and sd into the locking range ? It would not touch the
devices themselves but would make users of sg aware of them. It is orthogonal to
our core topic of CD/DVD drives since they are not supposed to appear as st or
sd.

On Locking Schemes on Linux Device Drivers

Hello fellow application developer or maintainer,

recently we (cdrkit and libburnia developers) came accross increasing problems with reliable and safe device locking. This paper collects our ponderings after having received this advise from Alan Cox on LKML: http://lkml.org/lkml/2007/3/31/175

Introduction

Our concern is the influence of even read-only operations on optical media drives (recorders) during their duty as recorders -- depending on the device model such read-only work may interrupt the process badly, spoiling the result, eventually wasting the medium.

Since many programs already act on such devices we see the need for reliable communication in order to allow proper device locking if good will for cooperation is present. After such a locking mechanism is implemented, we will invite any project to join it.

In the following document, at first a few possible mechanisms are evaluated. Then a suitable locking algorithm is proposed which needs coordination with the Linux community, nevertheless.

The remaining open questions is:

  • Where to create lock files under a protocol that is not (yet) covered by FHS ?

We do not want to make this choice a reason for the rejection of our proposal.

State of the practice

There are various locking techniques used in other areas which are more or less applicable in our case.

Path/Inode based locking mechanisms

In general, these mechanisms are not optimally appropriate for our purpose. They use the filename or inode as identity. In our case this imposes problems: but they lack on two places which make then not reliable when used alone:

  • they do not cope with multiple device files which imply the access to the same driver through different files
  • they do not automatically cope with multiple device drivers accessible through different co-existing user space interfaces, like with sg vs. sr drivers.

We evaluated:

  • Lock files associated with target file

    Principle: an additional file is created during the action on the real target file. See http://www.pathname.com/fhs/pub/fhs-2.3.html#VARLOCKLOCKFILES Pros: regular filesystem operation, no additional infrastructure required Cons:

    • Possible races unless OS mechanisms are used for exclusive operation on the lock file, see below
    • The location and name of the lock file need to be known and discussed upfront among all application developers, or be documented excessively
    • Permission problems may disallow the creation of lock files (security issues), especially for self-compiled applications and having no root permissions to install them in a required way
    • Special precautions are necessary against stale locks
  • fcntl(2) exclusive file locking Principle: lock applied on open file handles. Thus probably refering to an inode. See fcntl(2) for details. Pros:
    • known (POSIX.1-2001), usually reliable mechanism
    Cons:
    • needs open(2) as precondition which has to be avoided on unlocked device files

Other locking mechanisms

  • O_EXCL locking Principle: passing of the O_EXCL flag to the open call of a device file. The device is locked exclusively for the calling PID, the lock is maintained in the device driver to the particular major/minor combination. Pros:
    • reliable for a device accessible through one driver
    Cons:
    • for sr it requires kernel 2.6.x (x>=7 or so), with sg it might work on 2.4.

    • it does not solve the problem with the co-existing drivers for sr and sg
  • System V Semaphores See man semget(2), semop(2) SEM_UNDO. They have been considered and rejected mainly because of too many potential device names which would need pre-allocated semaphore objects.

Applicability on CD/(HD)DVD/BD drives

As explained in the introduction, the locking is important on optical media recording due to the delicate operation mode during the recording. Ideally, no other application should touch them. Even reading info from the drive can spoil the recording run. Currently we are aware of at least the following participants in drive collisions. They take differing precautions for this case, of which none is really able to prevent inadverted open(2) of a busy drive under all circumstances.

  • mount: the block device is mounted with the O_EXCL flag but the mount executable also uses libblkid which opens the devices without locking and reads magic data from it. (We understand that for the duration of a recording run, mounting should best be prevented in total.)
  • hald (HAL daemon): frequently opens the block devices with O_EXCL flag.
  • wodim: opens the devices with O_EXCL flag. Opening /dev/sg is possible and happens more likely with versions prior to 1.1.4.
  • growisofs: opens the block devices with O_EXCL flag. Opening /dev/sg was never encouraged and does not work on kernel 2.4 (not tested yet on 2.6).
  • cdrskin (via libburn): opens the devices with O_EXCL flag. It uses a unique device file path for serious operations on the drive. This is /dev/sg* on kernel 2.4, and recently has become /dev/sr* on kernel 2.6. The only other permissible paths are /dev/hd* . Operations on other path representations of the same device are restricted to open(2) O_RDONLY and to obtaining SCSI parameters host,channel,id,lun. Mapping to the unique device names is strictly enforced.
  • cdrecord: no locking. Author recommends to do it like Solaris does (which seems to do explicite locking, maintained internally on device driver or on major/minor pairs).

Any of the listed programs is currently able to spoil a recording run just by its proper operation if only the circumstances are unfortunate enough. This compilation is mostly heuristic and may be erroneous in details. Whatever, the problems and the users' disappointment are real.

Obstacles for using FHS compliant /var/lock/ files

First: race conditions

Second: unclear or unreliable cleanup technique, dangling bad lockfiles possible

Third: The most obvious problem is the usual permission setting of /var/lock :

SuSE 9.0 (kernel 2.4):
  drwxrwxr-x    3 root     uucp         4096 Apr  4     05:07 /var/lock
SuSE 9.3 (kernel 2.6):
  drwxrwxr-t    4 root     uucp         4096 2007-04-04 17:50 /var/lock
Fedora Core 3.x:
  drwxrwxr-x    5 root     lock         4096 Apr  4     04:03 /var/lock
Debian gives rw-permission to anybody and thus would be no problem.

Proposed locking algorithm

It adopts the FHS idea of locking a proxy before any open(2) is performed, but avoids the known drawbacks of the FHS /var/lock/ protocol.

It is designed to allow the use of any of the sg, sr, scd device drivers at the discretion of the programs. It is also designed to include the less ambiguous situation of drive access via /dev/hd*.


Compliant processes apply open(2) to suspected CD/DVD burner device files only if they are able to do this via one of the following paths:

  • /dev/sg[0..N] , /dev/scd[0..N] , /dev/sr[0..N] , /dev/hd[a-z]

and only after they have obtained a lock on them. (N= 31 or 255 ?)

Locking is performed similar to UUCP tradition but without the potential race conditions or potential stale locks: Other than with FHS /var/lock, not the mere existence of the lock file establishes the lock state. It is instead implemented by open(2) with O_RDWR and then fcntl(2) with F_SETLK. The lock file descriptor is held open until the lock is obsolete.

Paths other than the permissible ones have to be translated. The call stat(2) with its result element .st_rdev allows to search a matching device file among the permissible ones. So /dev/nec_burner can be translated to exactly one of /dev/sr0, /dev/sg2, /dev/hdd. (If not, then it is hardly a burner device.)

To circumvent the sg-sr-scd ambiguity, those devices must get locked in all their three permissible path instances. E.g. not only /dev/sr0 has to be locked before open(2) for serious usage is allowed, but also /dev/sg2 and /dev/scd0.

The device triples are formed from those device files which have the same SCSI parameters Host,Channel,Id,Lun from ioctl(SCSI_IOCTL_GET_IDLUN). Since this needs open(2), the search has to be accompanied by the locking of the tested files. Those which do not match get released immediately. If all three files are found and locked, it is guaranteed that any of them is free for usage. If any of the three is not found, then the lock is not granted due to a suspected collision between two locking contestants.

This cannot disturb a serious drive operation because such is allowed to start only if all three paths are locked. Thus there would be no starting point for a device-triple search at all.

Further precautions like open(O_EXCL) or fcntl(F_SETLK) on the device file itself are allowed. Programs are asked politely to offer expert options to disable them. In general a program is free to use a device in any way after a lock has been obtained successfully.


All we need for this is a directory which is present on any Linux system and is supposed to offer rwx-permissions to anybody who is allowed to access the devices.

As an application programmer i would propose /tmp/ and some file name prefix. It would work, after all. It would be covered by FHS specs except the fact that /var/lock is the paragraph which matches our problem more specifically - and fails to solve it.

To perform the sketched algorithm in /var/lock would violate FHS. The often restrictive permission settings of /var/lock would also make necessary an additional rule: A missing lock file which cannot be created allows to use the device as if a lock had been granted. (Provident sysadmins would then once create the lock files in /var/lock/ and allow rw-permission for all intended users.)

This is where we should ask the broad Linux public for opinions and advise. We are not much in a hurry and therefore should ponder duely over any aspect.