linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: NeilBrown <neilb@suse.com>
Cc: Theodore Ts'o <tytso@mit.edu>,
	Trond Myklebust <trondmy@primarydata.com>,
	"kwolf@redhat.com" <kwolf@redhat.com>,
	"riel@redhat.com" <riel@redhat.com>,
	"hch@infradead.org" <hch@infradead.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"jlayton@poochiereds.net" <jlayton@poochiereds.net>,
	"lsf-pc@lists.linux-foundation.org"
	<lsf-pc@lists.linux-foundation.org>,
	"rwheeler@redhat.com" <rwheeler@redhat.com>
Subject: Re: [Lsf-pc] [LSF/MM TOPIC] I/O error handling and fsync()
Date: Thu, 26 Jan 2017 10:25:42 +0100	[thread overview]
Message-ID: <20170126092542.GA17099@quack2.suse.cz> (raw)
In-Reply-To: <87ziieu06k.fsf@notabene.neil.brown.name>

On Thu 26-01-17 11:36:35, NeilBrown wrote:
> On Wed, Jan 25 2017, Theodore Ts'o wrote:
> > On Tue, Jan 24, 2017 at 03:34:04AM +0000, Trond Myklebust wrote:
> >> The reason why I'm thinking open() is because it has to be a contract
> >> between a specific application and the kernel. If the application
> >> doesn't open the file with the O_TIMEOUT flag, then it shouldn't see
> >> nasty non-POSIX timeout errors, even if there is another process that
> >> is using that flag on the same file.
> >> 
> >> The only place where that is difficult to manage is when the file is
> >> mmap()ed (no file descriptor), so you'd presumably have to disallow
> >> mixing mmap and O_TIMEOUT.
> >
> > Well, technically there *is* a file descriptor when you do an mmap.
> > You can close the fd after you call mmap(), but the mmap bumps the
> > refcount on the struct file while the memory map is active.
> >
> > I would argue though that at least for buffered writes, the timeout
> > has to be property of the underlying inode, and if there is an attempt
> > to set timeout on an inode that already has a timeout set to some
> > other non-zero value, the "set timeout" operation should fail with a
> > "timeout already set".  That's becuase we really don't want to have to
> > keep track, on a per-page basis, which struct file was responsible for
> > dirtying a page --- and what if it is dirtied by two different file
> > descriptors?
> 
> You seem to have a very different idea to the one that is forming in my
> mind.  In my vision, once the data has entered the page cache, it
> doesn't matter at all where it came from.  It will remain in the page
> cache, as a dirty page, until it is successfully written or until an
> unrecoverable error occurs.  There are no timeouts once the data is in
> the page cache.

Heh, this has somehow drifted away from the original topic of handling IO
errors :)

> Actually, I'm leaning away from timeouts in general.  I'm not against
> them, but not entirely sure they are useful.
> 
> To be more specific, I imagine a new open flag "O_IO_NDELAY".  It is a
> bit like O_NDELAY, but it explicitly affects IO, never the actual open()
> call, and it is explicitly allowed on regular files and block devices.
> 
> When combined with O_DIRECT, it effectively means "no retries".  For
> block devices and files backed by block devices,
> REQ_FAILFAST_DEV|REQ_FAILFAST_TRANSPORT is used and a failure will be
> reported as EWOULDBLOCK, unless it is obvious that retrying wouldn't
> help.
> Non-block-device filesystems would behave differently.  e.g. NFS would
> probably use a RPC_TASK_SOFT call instead of the normal 'hard' call.
> 
> When used without O_DIRECT:
>  - read would trigger read-ahead much as it does now (which can do
>    nothing if there are resource issues) and would only return data
>    if it was already in the cache.

There was a patch set which did this [1]. Not on per-fd basis but rather on
per-IO basis. Andrew blocked it because he was convinced that mincore() is
good enough interface for this.

>  - write would try to allocate a page, tell the filesystem that it
>    is dirty so that journal space is reserved or whatever is needed,
>    and would tell the dirty_pages rate-limiting that another page was
>    dirty.  If the rate-limiting reported that we cannot dirty a page
>    without waiting, or if any other needed resources were not available,
>    then the write would fail (-EWOULDBLOCK).
>  - fsync would just fail if there were any dirty pages.  It might also
>    do the equivalent of sync_file_range(SYNC_FILE_RANGE_WRITE) without
>    any *WAIT* flags. (alternately, fsync could remain unchanged, and
>    sync_file_range() could gain a SYNC_FILE_RANGE_TEST flag).
> 
> 
> With O_DIRECT there would be a delay, but it would be limited and there
> would be no retry.  There is not currently any way to impose a specific
> delay on REQ_FAILFAST* requests.
> Without O_DIRECT, there could be no significant delay, though code might
> have to wait for a mutex or similar.
> There are a few places that a timeout could usefully be inserted, but
> I'm not sure that would be better than just having the app try again in
> a little while - it would have to be prepared for that anyway.
> 
> I would like O_DIRECT|O_IO_NDELAY for mdadm so we could safely work with
> devices that block when no paths are available.

For O_DIRECT writes, there are database people who want to do non-blocking
AIO writes. Although the problem they want to solve is different - rather
similar to the one patch set [1] is trying to solve for buffered reads -
they want to do AIO write and they want it really non-blocking so they can
do IO submission directly from computation thread without the cost of the
offload to a different process which normally does the IO.

Now you need something different for mdadm but interfaces should probably
be consistent...

> > That being said, I suspect that for many applications, the timeout is
> > going to be *much* more interesting for O_DIRECT writes, and there we
> > can certainly have different timeouts on a per-fd basis.  This is
> > especially for cases where the timeout is implemented in storage
> > device, using multi-media extensions, and where the timout might be
> > measured in milliseconds (e.g., no point reading a video frame if its
> > been delayed too long).  That being said, it block layer would need to
> > know about this as well, since the timeout needs to be relative to
> > when the read(2) system call is issued, not to when it is finally
> > submitted to the storage device.
> 
> Yes. If a deadline could be added to "struct bio", and honoured by
> drivers, then that would make a timeout much more interesting for
> O_DIRECT.

Timeouts are nice but IMO a lot of work and I suspect you'd really need a
dedicated "real-time" IO scheduler for this.

								Honza

[1] https://lwn.net/Articles/636955/

-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  reply	other threads:[~2017-01-26  9:25 UTC|newest]

Thread overview: 42+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-01-10 16:02 [LSF/MM TOPIC] I/O error handling and fsync() Kevin Wolf
2017-01-11  0:41 ` NeilBrown
2017-01-13 11:09   ` Kevin Wolf
2017-01-13 14:21     ` Theodore Ts'o
2017-01-13 16:00       ` Kevin Wolf
2017-01-13 22:28         ` NeilBrown
2017-01-14  6:18           ` Darrick J. Wong
2017-01-16 12:14           ` [Lsf-pc] " Jeff Layton
2017-01-22 22:44             ` NeilBrown
2017-01-22 23:31               ` Jeff Layton
2017-01-23  0:21                 ` Theodore Ts'o
2017-01-23 10:09                   ` Kevin Wolf
2017-01-23 12:10                     ` Jeff Layton
2017-01-23 17:25                       ` Theodore Ts'o
2017-01-23 17:53                         ` Chuck Lever
2017-01-23 22:40                         ` Jeff Layton
2017-01-23 22:35                     ` Jeff Layton
2017-01-23 23:09                       ` Trond Myklebust
2017-01-24  0:16                         ` NeilBrown
2017-01-24  0:46                           ` Jeff Layton
2017-01-24 21:58                             ` NeilBrown
2017-01-25 13:00                               ` Jeff Layton
2017-01-30  5:30                                 ` NeilBrown
2017-01-24  3:34                           ` Trond Myklebust
2017-01-25 18:35                             ` Theodore Ts'o
2017-01-26  0:36                               ` NeilBrown
2017-01-26  9:25                                 ` Jan Kara [this message]
2017-01-26 22:19                                   ` NeilBrown
2017-01-27  3:23                                     ` Theodore Ts'o
2017-01-27  6:03                                       ` NeilBrown
2017-01-30 16:04                                       ` Jan Kara
2017-01-13 18:40     ` Al Viro
2017-01-13 19:06       ` Kevin Wolf
2017-01-11  5:03 ` Theodore Ts'o
2017-01-11  9:47   ` [Lsf-pc] " Jan Kara
2017-01-11 15:45     ` Theodore Ts'o
2017-01-11 10:55   ` Chris Vest
2017-01-11 11:40   ` Kevin Wolf
2017-01-13  4:51     ` NeilBrown
2017-01-13 11:51       ` Kevin Wolf
2017-01-13 21:55         ` NeilBrown
2017-01-11 12:14   ` Chris Vest

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170126092542.GA17099@quack2.suse.cz \
    --to=jack@suse.cz \
    --cc=hch@infradead.org \
    --cc=jlayton@poochiereds.net \
    --cc=kwolf@redhat.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=neilb@suse.com \
    --cc=riel@redhat.com \
    --cc=rwheeler@redhat.com \
    --cc=trondmy@primarydata.com \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).