Question : are concurrent write() calls with O_APPEND on local files atomic ?

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* Question : are concurrent write() calls with O_APPEND on local files atomic ?
@ 2009-08-19 12:40 Cornelius, Martin (DWBI)
  2009-08-19 13:17 ` Josef Bacik
  0 siblings, 1 reply; 6+ messages in thread
From: Cornelius, Martin (DWBI) @ 2009-08-19 12:40 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Roeder, Patrick (DWBI)

Hi linux-filesystem experts

First, please apologize if this is the wrong place to ask this question
-- we googled around a lot and couldn't find an answer, that's why we
finally try it here.

The actual cause of the question is our reasoning about the robustness
of the openssh code. Every invocation of ssh possibly adds a line to the
file $(HOME)/.ssh/known_hosts, and (contrary to our expectations) we
couldn't find any explicit locking in the code. Instead, the ssh code
just opens the file with O_APPEND, writes to the file, and closes it. We
already conducted a simple test that tries to create a 'corrupted'
known_host files by starting lots of ssh commands concurrently, but so
far we could not observe corruption. We now wonder if this is just by
luck or if a programmer can rely on this behaviour.

The generalized question is: If two (or more) different processes open
the same file on a !LOCAL! disk with O_APPEND, and then concurrently
issue write() calls to store data into this file, is there any guarantee
that the data of each single write() call are written 'atomically', or
could it happen that the data of different write()s are mangled or one
write() overwrites data already written ? To prevent misunderstandings,
we assume that ALL writers have opended the file with O_APPEND, and all
write calls return normally without being interrupted by a signal.

The Posix standard states that adavancing the filepointer to the end of
the file and the following execution of the write are performed
atomically with O_APPEND, but as far as we grasp it does not state if
the actual write is also atomic w.r.t. other concurrrent write calls.

If there is some guarantee : 
- does a (perhaps filesystem dependent) limit for this guarantee exist ?
(like the PIPE_BUF size limit when writing to a pipe), and is there a
way to detect this limit programmatically ?
- does this guarantee also hold, if several threads in one process write
to a single file DESCRIPTOR concurrently ?
- does this guarantee also hold for remote filesystems (nfs / smb) ?

If the answer to the last question is 'no' : is there a simple way to
programmatically detect whether the guarantee holds for a specific file
?

Many thanks for any answers !

Kind regards, Martin Cornelius

************************************************
The information contained in, or attached to, this e-mail, may contain confidential information and is intended solely for the use of the individual or entity to whom they are addressed and may be subject to legal privilege.  If you have received this e-mail in error you should notify the sender immediately by reply e-mail, delete the message from your system and notify your system manager.  Please do not copy it for any purpose, or disclose its contents to any other person.  The views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of the company.  The recipient should check this e-mail and any attachments for the presence of viruses.  The company accepts no liability for any damage caused, directly or indirectly, by any virus tra
 nsmitted in this email.
************************************************

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question : are concurrent write() calls with O_APPEND on local files atomic ?
  2009-08-19 12:40 Question : are concurrent write() calls with O_APPEND on local files atomic ? Cornelius, Martin (DWBI)
@ 2009-08-19 13:17 ` Josef Bacik
  2009-08-20  2:10   ` Andreas Dilger
  2009-08-20 14:50   ` AW: Question : are concurrent write() calls with O_APPEND on localfiles " Cornelius, Martin (DWBI)
  0 siblings, 2 replies; 6+ messages in thread
From: Josef Bacik @ 2009-08-19 13:17 UTC (permalink / raw)
  To: Cornelius, Martin (DWBI); +Cc: linux-fsdevel, Roeder, Patrick (DWBI)

On Wed, Aug 19, 2009 at 06:40:33AM -0600, Cornelius, Martin (DWBI) wrote:
> 
> Hi linux-filesystem experts
> 
> First, please apologize if this is the wrong place to ask this question
> -- we googled around a lot and couldn't find an answer, that's why we
> finally try it here.
> 
> The actual cause of the question is our reasoning about the robustness
> of the openssh code. Every invocation of ssh possibly adds a line to the
> file $(HOME)/.ssh/known_hosts, and (contrary to our expectations) we
> couldn't find any explicit locking in the code. Instead, the ssh code
> just opens the file with O_APPEND, writes to the file, and closes it. We
> already conducted a simple test that tries to create a 'corrupted'
> known_host files by starting lots of ssh commands concurrently, but so
> far we could not observe corruption. We now wonder if this is just by
> luck or if a programmer can rely on this behaviour.
> 
> The generalized question is: If two (or more) different processes open
> the same file on a !LOCAL! disk with O_APPEND, and then concurrently
> issue write() calls to store data into this file, is there any guarantee
> that the data of each single write() call are written 'atomically', or
> could it happen that the data of different write()s are mangled or one
> write() overwrites data already written ? To prevent misunderstandings,
> we assume that ALL writers have opended the file with O_APPEND, and all
> write calls return normally without being interrupted by a signal.
> 

So looking at the code, with O_APPEND set, every time the app calls write() the
position it's writing to is set to the end of the file.  It looks like most
people (with the exception of btrfs) will be holding the inode->i_mutex when
they do a generic_write_checks, which gives the position to write to.  So the
position to write to and then the subsequent writing are atomic, so unless the
fs is btrfs (which may or may not be a bug, I'll leave that to the smarter
people), O_APPEND should appear to be atomic.

> The Posix standard states that adavancing the filepointer to the end of
> the file and the following execution of the write are performed
> atomically with O_APPEND, but as far as we grasp it does not state if
> the actual write is also atomic w.r.t. other concurrrent write calls.
> 
> If there is some guarantee : 
> - does a (perhaps filesystem dependent) limit for this guarantee exist ?
> (like the PIPE_BUF size limit when writing to a pipe), and is there a
> way to detect this limit programmatically ?

Like I said, it seems most people hold the i_mutex when doing the check, but it
appears btrfs does not.  I think it's a bug, but I'm not sure.  There would not
be a way to tell programmatically.

> - does this guarantee also hold, if several threads in one process write
> to a single file DESCRIPTOR concurrently ?

Yes, the position is set every single time write() is called.

> - does this guarantee also hold for remote filesystems (nfs / smb) ?
> 

This I'm more likely to be wrong on, but I don't think so.  It would be atomic
on the local machine, but if there is somebody else on another machine writing
to the same file I think you would probably be screwed.

> If the answer to the last question is 'no' : is there a simple way to
> programmatically detect whether the guarantee holds for a specific file
> ?

I don't think so.  Really your best bet if you are going to do a remote fs that
can have concurrent writers that have no knowledge of eachother is to use fcntl.
Thanks,

Josef

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question : are concurrent write() calls with O_APPEND on local files atomic ?
  2009-08-19 13:17 ` Josef Bacik
@ 2009-08-20  2:10   ` Andreas Dilger
  2009-08-20 12:28     ` Trond Myklebust
  2009-08-20 14:50   ` AW: Question : are concurrent write() calls with O_APPEND on localfiles " Cornelius, Martin (DWBI)
  1 sibling, 1 reply; 6+ messages in thread
From: Andreas Dilger @ 2009-08-20  2:10 UTC (permalink / raw)
  To: Josef Bacik
  Cc: Cornelius, Martin (DWBI), linux-fsdevel, Roeder, Patrick (DWBI)

On Aug 19, 2009  09:17 -0400, Josef Bacik wrote:
> > - does this guarantee also hold for remote filesystems (nfs / smb) ?
> 
> This I'm more likely to be wrong on, but I don't think so.  It would be atomic
> on the local machine, but if there is somebody else on another machine writing
> to the same file I think you would probably be screwed.

With NFS at least, there is absolutely no guarantee of any kind when
multiple clients write to the same file, even with non-overlapping
writes (i.e. no O_APPEND, but application seeks to different file
offsets).  That is because an NFS client does not necessarily flush
its local cache until it closes the file.

Cheers, Andreas
--
Andreas Dilger
Sr. Staff Engineer, Lustre Group
Sun Microsystems of Canada, Inc.


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question : are concurrent write() calls with O_APPEND on local files atomic ?
  2009-08-20  2:10   ` Andreas Dilger
@ 2009-08-20 12:28     ` Trond Myklebust
  0 siblings, 0 replies; 6+ messages in thread
From: Trond Myklebust @ 2009-08-20 12:28 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Josef Bacik, Cornelius, Martin (DWBI), linux-fsdevel,
	Roeder, Patrick (DWBI)

On Wed, 2009-08-19 at 20:10 -0600, Andreas Dilger wrote:
> On Aug 19, 2009  09:17 -0400, Josef Bacik wrote:
> > > - does this guarantee also hold for remote filesystems (nfs / smb) ?
> > 
> > This I'm more likely to be wrong on, but I don't think so.  It would be atomic
> > on the local machine, but if there is somebody else on another machine writing
> > to the same file I think you would probably be screwed.
> 
> With NFS at least, there is absolutely no guarantee of any kind when
> multiple clients write to the same file, even with non-overlapping
> writes (i.e. no O_APPEND, but application seeks to different file
> offsets).  That is because an NFS client does not necessarily flush
> its local cache until it closes the file.

That's not the main problem. NFS clients can (and do) flush writes
immediately if you use O_DIRECT or synchronous writes.

The real issue is rather that the NFS protocol does not support an
atomic APPEND operation. Instead it requires the client to simulate
append semantics, by first retrieving the file size (so that it can
calculate the end-of-file offset) and then issuing the write. This means
that races with other clients can always occur unless they are using
some locking mechanism to prevent it.

I believe the same is true of CIFS.

Cheers
  Trond

^ permalink raw reply	[flat|nested] 6+ messages in thread

* AW: Question : are concurrent write() calls with O_APPEND on localfiles atomic ?
  2009-08-19 13:17 ` Josef Bacik
  2009-08-20  2:10   ` Andreas Dilger
@ 2009-08-20 14:50   ` Cornelius, Martin (DWBI)
  2009-08-20 15:02     ` Josef Bacik
  1 sibling, 1 reply; 6+ messages in thread
From: Cornelius, Martin (DWBI) @ 2009-08-20 14:50 UTC (permalink / raw)
  To: Josef Bacik, linux-fsdevel; +Cc: Roeder, Patrick (DWBI)

Josef Bacik wrote:

> So looking at the code, with O_APPEND set, every time the app calls
> write() the
> position it's writing to is set to the end of the file.  It looks like
> most
> people (with the exception of btrfs) will be holding the
inode->i_mutex
> when
> they do a generic_write_checks, which gives the position to write to.
So
> the
> position to write to and then the subsequent writing are atomic, so
unless
> the
> fs is btrfs (which may or may not be a bug, I'll leave that to the
smarter
> people), O_APPEND should appear to be atomic.

Many thanks for your reply, Josef, but i'm still a little uncertain
about whats's going on...

Does this mean, that in the case of two concurrent write() calls to the
same file, both with O_APPEND, one of them will be blocked until the
other one finished ?

************************************************
The information contained in, or attached to, this e-mail, may contain confidential information and is intended solely for the use of the individual or entity to whom they are addressed and may be subject to legal privilege.  If you have received this e-mail in error you should notify the sender immediately by reply e-mail, delete the message from your system and notify your system manager.  Please do not copy it for any purpose, or disclose its contents to any other person.  The views or opinions presented in this e-mail are solely those of the author and do not necessarily represent those of the company.  The recipient should check this e-mail and any attachments for the presence of viruses.  The company accepts no liability for any damage caused, directly or indirectly, by any virus tra
 nsmitted in this email.
************************************************

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: Question : are concurrent write() calls with O_APPEND on localfiles atomic ?
  2009-08-20 14:50   ` AW: Question : are concurrent write() calls with O_APPEND on localfiles " Cornelius, Martin (DWBI)
@ 2009-08-20 15:02     ` Josef Bacik
  0 siblings, 0 replies; 6+ messages in thread
From: Josef Bacik @ 2009-08-20 15:02 UTC (permalink / raw)
  To: Cornelius, Martin (DWBI)
  Cc: Josef Bacik, linux-fsdevel, Roeder, Patrick (DWBI)

On Thu, Aug 20, 2009 at 08:50:24AM -0600, Cornelius, Martin (DWBI) wrote:
> 
> Josef Bacik wrote:
> 
> > So looking at the code, with O_APPEND set, every time the app calls
> > write() the
> > position it's writing to is set to the end of the file.  It looks like
> > most
> > people (with the exception of btrfs) will be holding the
> inode->i_mutex
> > when
> > they do a generic_write_checks, which gives the position to write to.
> So
> > the
> > position to write to and then the subsequent writing are atomic, so
> unless
> > the
> > fs is btrfs (which may or may not be a bug, I'll leave that to the
> smarter
> > people), O_APPEND should appear to be atomic.
> 
> Many thanks for your reply, Josef, but i'm still a little uncertain
> about whats's going on...
> 
> Does this mean, that in the case of two concurrent write() calls to the
> same file, both with O_APPEND, one of them will be blocked until the
> other one finished ?
>

Yeah, all writes are protected by inode->i_mutex, so if two processes are trying
to write to the same file at the same time, one will block until the other one
is finished.  Now without O_APPEND, if they were writing to the same position,
the second one would overwrite the last one.  However, in the case of O_APPEND,
we take the inode->i_mutex, read inode->i_size, and then write to that position,
so O_APPEND will be atomic and will always end up writing to a position that has
not been written to yet.  Thanks,

Josef 

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2009-08-20 14:56 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-19 12:40 Question : are concurrent write() calls with O_APPEND on local files atomic ? Cornelius, Martin (DWBI)
2009-08-19 13:17 ` Josef Bacik
2009-08-20  2:10   ` Andreas Dilger
2009-08-20 12:28     ` Trond Myklebust
2009-08-20 14:50   ` AW: Question : are concurrent write() calls with O_APPEND on localfiles " Cornelius, Martin (DWBI)
2009-08-20 15:02     ` Josef Bacik

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox