linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Jens Axboe <axboe@fb.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: <linux-fsdevel@vger.kernel.org>, <hch@infradead.org>,
	<viro@zeniv.linux.org.uk>, <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH] direct-io: only inc/dec inode->i_dio_count for file systems
Date: Wed, 15 Apr 2015 13:26:51 -0600	[thread overview]
Message-ID: <552EBB7B.4030107@fb.com> (raw)
In-Reply-To: <20150415115653.f69d970e4b155deea98a0829@linux-foundation.org>

On 04/15/2015 12:56 PM, Andrew Morton wrote:
> On Wed, 15 Apr 2015 12:22:56 -0600 Jens Axboe <axboe@fb.com> wrote:
>
>> Hi,
>>
>> This is a reposting of a patch that was originally in the blk-mq series.
>> It has huge upside on shared access to a multiqueue device doing
>> O_DIRECT, it's basically the scaling block that ends up killing
>> performance. A quick test here reveals that we spend 30% of all system
>> time just incrementing and decremening inode->i_dio_count. For block
>> devices this isn't useful at all, as we don't need protection against
>> truncate. For that test case, performance increases about 3.6x (!!) by
>> getting rid of this inc/dec per IO.
>>
>> I've cleaned it up a bit since last time, integrating the checks in
>> inode_dio_done() and adding a inode_dio_begin() so that callers don't
>> need to know about this.
>>
>> We've been running a variant of this patch in the FB kernel for a while.
>> I'd like to finally get this upstream.
>
> 30% overhead for one atomic_inc+atomic_dec+wake_up_bit() per IO?  That
> seems very high!  Is there something else going on?

Nope, that is it. But for this simple test case, that is essentially the 
only shared state that exists and is dirtied between all CPUs that are 
running an application that does O_DIRECT to the device. The test case 
ran at ~9.6M IOPS after the patch, and at ~2.5M IOPS before. On a simple 
2 socket box, no less, nothing fancy there.

And it's _just_ the atomic inc/dec. I ran with the exact patch posted, 
and that still does (needlessly, I think) the wake_up_bit() for 
inode_dio_done().

> Is there similar impact to direct-io-to-file?  It would be nice to fix
> that up also.  Many filesystems do something along the lines of
>
> 	atomic_inc(i_dio_count);
> 	wibble()
> 	atomic_dev(i_dio_count);
> 	__blockdev_direct_IO(...);
>
> and with your patch I think we could change them to
>
> 	atomic_inc(i_dio_count);
> 	wibble()
> 	__blockdev_direct_IO(..., flags|DIO_IGNORE_TRUNCATE);
> 	atomic_dev(i_dio_count);
>
> which would halve the atomic op load.

I haven't checked pure file, but without extending, I suspect that we 
should see similar benefits there. In any case, it'd make sense doing, 
having twice the atomic inc/dec is just a bad idea in general, if we can 
get rid of it.

A quick grep doesn't show this use case, or I'm just blind. Where do you 
see that?

> But that's piling hack on top of hack.  Can we change the

I'd more view it as reducing the hack, the real hack is the way that we 
manually do atomic_inc() on i_dio_count, and then call a magic 
inode_dio_done() that decrements it again. It's not very pretty, I'm 
just reducing the scope of the hack :-)

> do_blockdev_direct_IO() interface to "caller shall hold i_mutex, or
> increment i_dio_count"?  ie: exclusion against truncate is wholly the
> caller's responsibility.  That way, this awkward sharing of
> responsibility between caller and callee gets cleaned up and
> DIO_IGNORE_TRUNCATE goes away.

That would clean it up, at the expense of touching more churn. I'd be 
fine with doing it that way.

> inode_dio_begin() would be a good place to assert that i_mutex is held,
> btw.

How would you check it? If the i_dio_count is bumped, then you'd not 
need to hold i_mutex.

> This whole i_dio_count thing is pretty nasty, really.  If you stand
> back and squint, it's basically an rwsem.  I wonder if we can use an
> rwsem...

We could, but a bit orthogonal to the patch since we'd do the same 
avoidance for blkdevs.

> What's the reason for DIO_IGNORE_TRUNCATE rather than boring old
> !S_ISBLK?

Didn't want to tie it to block devices, but rather just keep the "need 
lock or not" as a separate flag in case there were more use cases.

-- 
Jens Axboe

  reply	other threads:[~2015-04-15 19:26 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-04-15 18:22 [PATCH] direct-io: only inc/dec inode->i_dio_count for file systems Jens Axboe
2015-04-15 18:56 ` Andrew Morton
2015-04-15 19:26   ` Jens Axboe [this message]
2015-04-15 19:46     ` Andrew Morton
2015-04-15 20:08       ` Jens Axboe
2015-04-15 22:25   ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=552EBB7B.4030107@fb.com \
    --to=axboe@fb.com \
    --cc=akpm@linux-foundation.org \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).