Re: O_DIRECT wierd behavior..

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Andrew Morton <akpm@zip.com.au>
To: Hugh Dickins <hugh@veritas.com>
Cc: Andrea Arcangeli <andrea@suse.de>,
	GOTO Masanori <gotom@debian.org>,
	Suresh Gopalakrishnan <gsuresh@cs.rutgers.edu>,
	linux-kernel@vger.kernel.org
Subject: Re: O_DIRECT wierd behavior..
Date: Mon, 17 Dec 2001 10:57:15 -0800	[thread overview]
Message-ID: <3C1E400B.A4D25F9D@zip.com.au> (raw)
In-Reply-To: <20011217181840.G2431@athlon.random> <Pine.LNX.4.21.0112171757530.2812-100000@localhost.localdomain>

Hugh Dickins wrote:
> 
> On Mon, 17 Dec 2001, Andrea Arcangeli wrote:
> >
> > I'm unsure (it's basically a matter of API, not something a kernel
> > developer can choose liberally), and the SuSv2 is not saying anything about
> > O_SYNC failures in the write(2) manapge, but I guess it would be at
> > least saner to put the "pos" backwards if we fail osync but we just
> > written something (so if we previously advanced pos).
> 
> I don't have references to back me up, don't take my word for it:
> but I'm sure that the correct behaviour for a partially successful
> read or write in any UNIX is that it return the count done, O_SYNC
> or not, and file position should match that count; only when none
> has been done is -1 returned with errno set.  Most implementations will
> get this wrong in one corner or another, but that's how it should be.
> 

SUS says: ( http://www.opengroup.org/onlinepubs/007908799/xsh/write.html )

 RETURN VALUE

     Upon successful completion, write() and pwrite() will return the number of bytes
     actually written to the file associated with fildes. This number will never be greater
     than nbyte. Otherwise, -1 is returned and errno is set to indicate the error. 

I take that to mean that if an error occurs, we return that
error regardless of how much was written.

Which makes sense.  Consider this code:

	open(file)
	write(100k)
	close(fd)

if the write gets an IO error halfway through, it looks like
the caller never gets to hear about it at present.  Except via
the short return value from the write.  But from my reading of SUS,
a short return value from write implicitly means ENOSPC.  If we
give a short return for EIO, the calling app has no way to distinguish
this from ENOSPC.

Regarding ENOSPC, SUS says:

     If a write() requests that more bytes be written than there is room for (for example, the
     ulimit or the physical end of a medium), only as many bytes as there is room for will be
     written. For example, suppose there is space for 20 bytes more in a file before reaching
     a limit. A write of 512 bytes will return 20. The next write of a non-zero number of
     bytes will give a failure return (except as noted below)  and the implementation will
     generate a SIGXFSZ signal for the thread.

(We don't do the SIGXFSZ in this case either).

Note that I'm not talking about the O_SYNC case here.  Just bog-standard
write(), if ->prepare_write() fails.

Blah.  Hard.  Our behaviour at present seems to be mostly correct
for ENOSPC, and probably incorrect (and undesirable) for EIO.
I'd vote for leaving it as-is for the while.  Getting this
right is a medium-sized project.  There's also the matter of getting
the file pointer in the correct place on error.

-

next prev parent reply	other threads:[~2001-12-17 18:58 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2001-12-16  0:47 O_DIRECT wierd behavior Suresh Gopalakrishnan
2001-12-16  5:59 ` Andrew Morton
2001-12-16  8:17   ` GOTO Masanori
2001-12-16  8:46     ` Andrew Morton
2001-12-16  9:20       ` Suresh Gopalakrishnan
2001-12-16 13:57         ` Terje Eggestad
2001-12-16 17:43           ` Suresh Gopalakrishnan
2001-12-17  9:04             ` Terje Eggestad
2001-12-17 17:18     ` Andrea Arcangeli
2001-12-17 18:07       ` Hugh Dickins
2001-12-17 18:13         ` Andrea Arcangeli
2001-12-17 18:57         ` Andrew Morton [this message]
2001-12-17 19:26           ` Linus Torvalds
2001-12-17 19:53             ` Joel Becker
2001-12-17 19:59               ` Linus Torvalds
2001-12-17 20:20                 ` Joel Becker
2001-12-17 20:38                   ` Andre Hedrick
2001-12-26 14:54             ` Riley Williams
2001-12-16  6:29 ` GOTO Masanori
2002-01-20  4:16 ` multithreaded RPC handling Suresh Gopalakrishnan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3C1E400B.A4D25F9D@zip.com.au \
    --to=akpm@zip.com.au \
    --cc=andrea@suse.de \
    --cc=gotom@debian.org \
    --cc=gsuresh@cs.rutgers.edu \
    --cc=hugh@veritas.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox