From: Dave Chinner <david@fromorbit.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Jens Axboe <axboe@fb.com>,
linux-fsdevel@vger.kernel.org, hch@infradead.org,
viro@zeniv.linux.org.uk, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] direct-io: only inc/dec inode->i_dio_count for file systems
Date: Thu, 16 Apr 2015 08:25:36 +1000 [thread overview]
Message-ID: <20150415222536.GT13731@dastard> (raw)
In-Reply-To: <20150415115653.f69d970e4b155deea98a0829@linux-foundation.org>
On Wed, Apr 15, 2015 at 11:56:53AM -0700, Andrew Morton wrote:
> On Wed, 15 Apr 2015 12:22:56 -0600 Jens Axboe <axboe@fb.com> wrote:
>
> > Hi,
> >
> > This is a reposting of a patch that was originally in the blk-mq series.
> > It has huge upside on shared access to a multiqueue device doing
> > O_DIRECT, it's basically the scaling block that ends up killing
> > performance. A quick test here reveals that we spend 30% of all system
> > time just incrementing and decremening inode->i_dio_count. For block
> > devices this isn't useful at all, as we don't need protection against
> > truncate. For that test case, performance increases about 3.6x (!!) by
> > getting rid of this inc/dec per IO.
> >
> > I've cleaned it up a bit since last time, integrating the checks in
> > inode_dio_done() and adding a inode_dio_begin() so that callers don't
> > need to know about this.
> >
> > We've been running a variant of this patch in the FB kernel for a while.
> > I'd like to finally get this upstream.
....
> Is there similar impact to direct-io-to-file? It would be nice to fix
> that up also. Many filesystems do something along the lines of
>
> atomic_inc(i_dio_count);
> wibble()
> atomic_dev(i_dio_count);
> __blockdev_direct_IO(...);
>
> and with your patch I think we could change them to
>
> atomic_inc(i_dio_count);
> wibble()
> __blockdev_direct_IO(..., flags|DIO_IGNORE_TRUNCATE);
> atomic_dev(i_dio_count);
Can't do it quite that way.
AIO requires the i_dio_count to held until IO completion for all
outstanding IOs. i.e. the increment needs to be in the submission
path, the decrement needs to be in the dio_complete() path,
otherwise we have AIO DIO in progress without a reference count we
can wait on in truncate.
Yes, we might be able to pull it up to the filesystem level now that
dio_complete() is only ever called once per __blockdev_direct_IO()
call, so that may be a solution we can use via filesystem ->end_io
callbacks provided to __blockdev_direct_IO.
> which would halve the atomic op load.
XFS doesn't touch i_dio_count, so it would make no difference to it
at all, which is important, given the DIO rates I can drive through
a single file on XFS - it becomes rwsem cacheline bound on the
shared IO lock at about 2 million IOPS (random 4k read) to a single
file.
FWIW, keep in mind that this i_dio_count originally came from XFS in
the first place, and was pushed into the DIO layer to solve all the
DIO vs extent manipulation problems other fileystems were having...
> But that's piling hack on top of hack. Can we change the
> do_blockdev_direct_IO() interface to "caller shall hold i_mutex, or
> increment i_dio_count"? ie: exclusion against truncate is wholly the
> caller's responsibility. That way, this awkward sharing of
> responsibility between caller and callee gets cleaned up and
> DIO_IGNORE_TRUNCATE goes away.
>
> inode_dio_begin() would be a good place to assert that i_mutex is held,
> btw.
Can't do that, either, as filesystems like XFS don't hold the
i_mutex during direct IO submission.
> This whole i_dio_count thing is pretty nasty, really. If you stand
> back and squint, it's basically an rwsem. I wonder if we can use an
> rwsem...
That doesn't avoid the atomic operations that limit performance.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
prev parent reply other threads:[~2015-04-15 22:25 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-04-15 18:22 [PATCH] direct-io: only inc/dec inode->i_dio_count for file systems Jens Axboe
2015-04-15 18:56 ` Andrew Morton
2015-04-15 19:26 ` Jens Axboe
2015-04-15 19:46 ` Andrew Morton
2015-04-15 20:08 ` Jens Axboe
2015-04-15 22:25 ` Dave Chinner [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150415222536.GT13731@dastard \
--to=david@fromorbit.com \
--cc=akpm@linux-foundation.org \
--cc=axboe@fb.com \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.