Re: Linux kernel file offset pointer races

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Re: Linux kernel file offset pointer races
       [not found] <Pine.LNX.4.44.0408041220550.26961-100000@isec.pl>
@ 2004-08-04 20:36 ` Pavel Kankovsky
  2004-08-06 12:56   ` Alan Cox
  0 siblings, 1 reply; 8+ messages in thread
From: Pavel Kankovsky @ 2004-08-04 20:36 UTC (permalink / raw)
  To: linux-kernel

On Wed, 4 Aug 2004, Paul Starzetz wrote:

> static ssize_t mtrr_read(struct file *file, char *buf, size_t len,
>                          loff_t *ppos)
> {
> [1] if (*ppos >= ascii_buf_bytes) return 0;
> [2] if (*ppos + len > ascii_buf_bytes) len = ascii_buf_bytes - *ppos;
>     if ( copy_to_user (buf, ascii_buffer + *ppos, len) ) return -EFAULT;
> [3] *ppos += len;
>     return len;
> }  /*  End Function mtrr_read  */
>
> It is quite easy to see that since copy_to_user can  sleep,  the  second
> reference  to  *ppos  may  use  another  value.  Or in other words, code
> operating on the file->f_pos variable through a pointer must  be  atomic
> in  respect  to  the current thread. We expect even more troubles in the
> SMP case though.

Oh no! Another old bug getting bored of being ignored and promoting itself
into a security vulnerability (see Bugtraq/Full-Disclosure archive for a 
complete story).

IMHO, the proper fix is to serialize all operations modifying a shared
file pointer (file->f_pos): read(), readv(), write(), writev(),
lseek()/llseek(). As far as I can tell, this is required by POSIX:

  [read() specification]
  On files that support seeking (for example, a regular file), the
  read() shall start at a position in the file given by the file offset
  associated with fildes. The file offset shall be incremented by the
  number of bytes actually read.

  [write() specification]
  On a regular file or other file capable of seeking, the actual writing
  of data shall proceed from the position in the file indicated by the
  file offset associated with fildes. Before successful return from
  write(), the file offset shall be incremented by the number of bytes
  actually written.

  [write() specification]
  If the O_APPEND flag of the file status flags is set, the file offset
  shall be set to the end of the file prior to each write and no
  intervening file modification operation shall occur between changing the
  file offset and the write operation.

Moreover, there should probably be some kind of serialization between
groups of reads from a file (read: inode) and individual writes to the
same file even among different struct file:

  [read() rationale]
  I/O is intended to be atomic to ordinary files and pipes and
  FIFOs. Atomic means that all the bytes from a single operation that
  started out together end up together, without interleaving from other
  I/O operations.

  [write() rationale]
  Writes can be serialized with respect to other reads and writes. If a
  read() of file data can be proven (by any means) to occur after a
  write() of the data, it must reflect that write(), even if the calls are
  made by different processes. A similar requirement applies to multiple
  write operations to the same file position.

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux kernel file offset pointer races
  2004-08-04 20:36 ` Linux kernel file offset pointer races Pavel Kankovsky
@ 2004-08-06 12:56   ` Alan Cox
  2004-08-07 12:38     ` Amon Ott
  2004-08-12 21:38     ` Pavel Kankovsky
  0 siblings, 2 replies; 8+ messages in thread
From: Alan Cox @ 2004-08-06 12:56 UTC (permalink / raw)
  To: Pavel Kankovsky; +Cc: Linux Kernel Mailing List

On Mer, 2004-08-04 at 21:36, Pavel Kankovsky wrote:
> IMHO, the proper fix is to serialize all operations modifying a shared
> file pointer (file->f_pos): read(), readv(), write(), writev(),
> lseek()/llseek(). As far as I can tell, this is required by POSIX:

Not if you want to get any useful work done. No Unix does this. The
situation with multiple parallel lseek/read/writes is somewhat undefined
anyway since you don't know if the seek or the write occurred first in
user space.

O_APPEND is a bit different, as are pread/pwrite but those are dealt
with using locking for files.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux kernel file offset pointer races
  2004-08-06 12:56   ` Alan Cox
@ 2004-08-07 12:38     ` Amon Ott
  2004-08-07 13:18       ` viro
  2004-08-12 21:38     ` Pavel Kankovsky
  1 sibling, 1 reply; 8+ messages in thread
From: Amon Ott @ 2004-08-07 12:38 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 878 bytes --]

On Freitag, 6. August 2004 14:56, Alan Cox wrote:
> On Mer, 2004-08-04 at 21:36, Pavel Kankovsky wrote:
> > IMHO, the proper fix is to serialize all operations modifying a shared
> > file pointer (file->f_pos): read(), readv(), write(), writev(),
> > lseek()/llseek(). As far as I can tell, this is required by POSIX:
> 
> Not if you want to get any useful work done. No Unix does this. The
> situation with multiple parallel lseek/read/writes is somewhat undefined
> anyway since you don't know if the seek or the write occurred first in
> user space.

Would it not be useful to have per-process or per-thread offsets? Do 
parallel processes really need to share the offset?

E.g., the struct file could (optionally) be copied on fork with 
copy-on-write to avoid extra memory consumption.

Amon.
-- 
http://www.rsbac.org - GnuPG: 2048g/5DEAAA30 2002-10-22

[-- Attachment #2: signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux kernel file offset pointer races
  2004-08-07 12:38     ` Amon Ott
@ 2004-08-07 13:18       ` viro
  2004-08-07 16:02         ` Amon Ott
  0 siblings, 1 reply; 8+ messages in thread
From: viro @ 2004-08-07 13:18 UTC (permalink / raw)
  To: Amon Ott; +Cc: linux-kernel

On Sat, Aug 07, 2004 at 02:38:12PM +0200, Amon Ott wrote:
> Would it not be useful to have per-process or per-thread offsets? Do 
> parallel processes really need to share the offset?
> 
> E.g., the struct file could (optionally) be copied on fork with 
> copy-on-write to avoid extra memory consumption.

(cat a; cat b) > /tmp/foo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux kernel file offset pointer races
  2004-08-07 13:18       ` viro
@ 2004-08-07 16:02         ` Amon Ott
  0 siblings, 0 replies; 8+ messages in thread
From: Amon Ott @ 2004-08-07 16:02 UTC (permalink / raw)
  To: linux-kernel

[-- Attachment #1: Type: text/plain, Size: 930 bytes --]

On Samstag, 7. August 2004 15:18, viro@parcelfarce.linux.theplanet.co.uk 
wrote:
> On Sat, Aug 07, 2004 at 02:38:12PM +0200, Amon Ott wrote:
> > Would it not be useful to have per-process or per-thread offsets? Do 
> > parallel processes really need to share the offset?
> > 
> > E.g., the struct file could (optionally) be copied on fork with 
> > copy-on-write to avoid extra memory consumption.
> 
> (cat a; cat b) > /tmp/foo

Right, the process running (...) would not know what offset to give "cat b" 
as a start, because it cannot see "cat a"'s final offset. It could seek to 
the end of /tmp/foo before starting "cat b", but this requirement would 
break existing shells. Bad luck.

As there are probably more examples why it does not work, there will 
probably have to be a macro for all offset changes, which checks for 
overruns.

Amon.
-- 
http://www.rsbac.org - GnuPG: 2048g/5DEAAA30 2002-10-22

[-- Attachment #2: signature --]
[-- Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux kernel file offset pointer races
  2004-08-06 12:56   ` Alan Cox
  2004-08-07 12:38     ` Amon Ott
@ 2004-08-12 21:38     ` Pavel Kankovsky
  2004-08-12 21:12       ` Alan Cox
  1 sibling, 1 reply; 8+ messages in thread
From: Pavel Kankovsky @ 2004-08-12 21:38 UTC (permalink / raw)
  To: Linux Kernel Mailing List

On Fri, 6 Aug 2004, Alan Cox wrote:

> On Mer, 2004-08-04 at 21:36, Pavel Kankovsky wrote:
> > IMHO, the proper fix is to serialize all operations modifying a shared
> > file pointer (file->f_pos): read(), readv(), write(), writev(),
> > lseek()/llseek(). As far as I can tell, this is required by POSIX:
> 
> Not if you want to get any useful work done. No Unix does this.

...serialize all operations modifying a shared file pointer wrt operations
modifying the *same* pointer.

> The situation with multiple parallel lseek/read/writes is somewhat
> undefined anyway since you don't know if the seek or the write
> occurred first in user space.

buffer[0] = 0;
lseek(fd, 0, SEEK_SET);
write(fd, buffer, 1);
lseek(fd, 0, SEEK_SET);

if (fork() > 0) {
  buffer[0] = 1;
  write(fd, buffer, 1000000);
}
else {
  while (buffer[0] == 0)
    pread(fd, buffer, 1, 0);
  lseek(fd, 1234, SEEK_SET);
}

lseek(...1234...) cannot occur before write(...1000000) starts but it can
occur before the big write() ends (unless write() is atomic).

There is a similar scenario with a big read() but it is somewhat more
complicated because it needs a piece of shared memory.

> O_APPEND is a bit different, as are pread/pwrite but those are dealt
> with using locking for files.

Two write()'s are serialized by inode semaphore. But as far as can tell,
there is no serialization between read()'s and write()'s. A read()
overlapping a simultaneous write() might produce inconsistent results,
e.g.:

1. read() starts at offset 0
2. read() reads a page of data and blocks
3. write() starts at offset 0
4. write() writes two pages of data and block
5. read() wakes up, reads two pages of data
6. write() wakes up, writes a page of data
7. read() finished
8. write() finished

In this scenario, the 1st and 3rd pages read by read() contain the old
data (before write()) but the 2nd page contains the new data (after
write()). This is absurd.

BTW: What about writev() (esp. with O_APPEND)? It appears Linux
implementation makes it possible to interleave parts of writev() with
other writes.

Moreover, there appears to be a race condition between locks_verify_area()
and the actual I/O operation(s).

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux kernel file offset pointer races
  2004-08-12 21:38     ` Pavel Kankovsky
@ 2004-08-12 21:12       ` Alan Cox
  2004-08-18 11:03         ` Pavel Kankovsky
  0 siblings, 1 reply; 8+ messages in thread
From: Alan Cox @ 2004-08-12 21:12 UTC (permalink / raw)
  To: Pavel Kankovsky; +Cc: Linux Kernel Mailing List

On Iau, 2004-08-12 at 22:38, Pavel Kankovsky wrote:
> In this scenario, the 1st and 3rd pages read by read() contain the old
> data (before write()) but the 2nd page contains the new data (after
> write()). This is absurd.

Why ?

> BTW: What about writev() (esp. with O_APPEND)? It appears Linux
> implementation makes it possible to interleave parts of writev() with
> other writes.

If that can occur with O_APPEND it might be a bug. SuS does make some
real guarantees about what O_APPEND means.

> Moreover, there appears to be a race condition between locks_verify_area()
> and the actual I/O operation(s).

Details ?



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: Linux kernel file offset pointer races
  2004-08-12 21:12       ` Alan Cox
@ 2004-08-18 11:03         ` Pavel Kankovsky
  0 siblings, 0 replies; 8+ messages in thread
From: Pavel Kankovsky @ 2004-08-18 11:03 UTC (permalink / raw)
  To: Linux Kernel Mailing List

On Thu, 12 Aug 2004, Alan Cox wrote:

> > In this scenario, the 1st and 3rd pages read by read() contain the old
> > data (before write()) but the 2nd page contains the new data (after
> > write()). This is absurd.
> 
> Why ?

Ok, it is not absurd. It is insane. :)

As far as I know, POSIX compliant read() and write() et al. are
supposed to be serializable.

> > BTW: What about writev() (esp. with O_APPEND)? It appears Linux
> > implementation makes it possible to interleave parts of writev() with
> > other writes.
> 
> If that can occur with O_APPEND it might be a bug. SuS does make some
> real guarantees about what O_APPEND means.

The following scenario appears to be possible when the default
implementation in do_readv_writev() is used:

1. process P1 calls writev(), writes the 1st chunk
   fs specific write operation grabs i_sem when it starts...
   and releases it when it finished
2. process P2 calls write() on the same file
   i_sem is free now, and P2 can grab it..and release it again
3. process P1 writes the 2nd chunk
   i_sem is free again, and P1 can grab it to write the 2nd chunk

The good news is some filesystems use generic_file_writev() (ext2, ext3)
or their own implementation (xfs) rather than the "supergeneric" code in
do_readv_writev(). The bad news is some other filesystems (reiser) use the
"supergeneric" code.

> > Moreover, there appears to be a race condition between locks_verify_area()
> > and the actual I/O operation(s).
> 
> Details ?

When I hold a mandatory lock on a file then no other process will be
allowed to read from/write to (depending on the type of the lock) that
file, right? The code (vfs_read(), vfs_write(), do_readv_writev()) in
Linux appears to work as follows:

1. locks_verify_area() is called
2. file i/o is performed

As far as I can tell, the process holds no locks covering both steps.
Another process might call fcntl(...F_SETLK...) between steps 1 and 2,
i.e. after locks_verify_area() said "ok" but before the actual i/o is
finished, and the result would be one process doing i/o on a file
while the other process holds a mandatory lock on the same file.

--Pavel Kankovsky aka Peak  [ Boycott Microsoft--http://www.vcnet.com/bms ]
"Resistance is futile. Open your source code and prepare for assimilation."

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2004-08-18 11:03 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <Pine.LNX.4.44.0408041220550.26961-100000@isec.pl>
2004-08-04 20:36 ` Linux kernel file offset pointer races Pavel Kankovsky
2004-08-06 12:56   ` Alan Cox
2004-08-07 12:38     ` Amon Ott
2004-08-07 13:18       ` viro
2004-08-07 16:02         ` Amon Ott
2004-08-12 21:38     ` Pavel Kankovsky
2004-08-12 21:12       ` Alan Cox
2004-08-18 11:03         ` Pavel Kankovsky

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox