From: Jens Axboe <axboe@kernel.dk>
To: Dave Chinner <david@fromorbit.com>
Cc: Christoph Hellwig <hch@infradead.org>,
linux-fsdevel@vger.kernel.org, linux-man@vger.kernel.org,
xfs@oss.sgi.com
Subject: Re: [PATCH] [RFC] xfs: wire up aio_fsync method
Date: Tue, 17 Jun 2014 07:23:58 -0600 [thread overview]
Message-ID: <53A0416E.20105@kernel.dk> (raw)
In-Reply-To: <20140616222729.GE9508@dastard>
On 2014-06-16 16:27, Dave Chinner wrote:
> On Mon, Jun 16, 2014 at 01:30:42PM -0600, Jens Axboe wrote:
>> On 06/16/2014 01:19 AM, Dave Chinner wrote:
>>> On Sun, Jun 15, 2014 at 08:58:46PM -0600, Jens Axboe wrote:
>>>> On 2014-06-15 20:00, Dave Chinner wrote:
>>>>> On Mon, Jun 16, 2014 at 08:33:23AM +1000, Dave Chinner wrote:
>>>>> FWIW, the non-linear system CPU overhead of a fs_mark test I've been
>>>>> running isn't anything related to XFS. The async fsync workqueue
>>>>> results in several thousand worker threads dispatching IO
>>>>> concurrently across 16 CPUs:
>>>>>
>>>>> $ ps -ef |grep kworker |wc -l
>>>>> 4693
>>>>> $
>>>>>
>>>>> Profiles from 3.15 + xfs for-next + xfs aio_fsync show:
>>>>>
>>>>> - 51.33% [kernel] [k] percpu_ida_alloc
>>>>> - percpu_ida_alloc
>>>>> + 85.73% blk_mq_wait_for_tags
>>>>> + 14.23% blk_mq_get_tag
>>>>> - 14.25% [kernel] [k] _raw_spin_unlock_irqrestore
>>>>> - _raw_spin_unlock_irqrestore
>>>>> - 66.26% virtio_queue_rq
>>>>> - __blk_mq_run_hw_queue
>>>>> - 99.65% blk_mq_run_hw_queue
>>>>> + 99.47% blk_mq_insert_requests
>>>>> + 0.53% blk_mq_insert_request
>>>>> .....
>>>>> - 7.91% [kernel] [k] _raw_spin_unlock_irq
>>>>> - _raw_spin_unlock_irq
>>>>> - 69.59% __schedule
>>>>> - 86.49% schedule
>>>>> + 47.72% percpu_ida_alloc
>>>>> + 21.75% worker_thread
>>>>> + 19.12% schedule_timeout
>>>>> ....
>>>>> + 18.06% blk_mq_make_request
>>>>>
>>>>> Runtime:
>>>>>
>>>>> real 4m1.243s
>>>>> user 0m47.724s
>>>>> sys 11m56.724s
>>>>>
>>>>> Most of the excessive CPU usage is coming from the blk-mq layer, and
>>>>> XFS is barely showing up in the profiles at all - the IDA tag
>>>>> allocator is burning 8 CPUs at about 60,000 write IOPS....
>>>>>
>>>>> I know that the tag allocator has been rewritten, so I tested
>>>>> against a current a current Linus kernel with the XFS aio-fsync
>>>>> patch. The results are all over the place - from several sequential
>>>>> runs of the same test (removing the files in between so each tests
>>>>> starts from an empty fs):
>>>>>
>>>>> Wall time sys time IOPS files/s
>>>>> 4m58.151s 11m12.648s 30,000 13,500
>>>>> 4m35.075s 12m45.900s 45,000 15,000
>>>>> 3m10.665s 11m15.804s 65,000 21,000
>>>>> 3m27.384s 11m54.723s 85,000 20,000
>>>>> 3m59.574s 11m12.012s 50,000 16,500
>>>>> 4m12.704s 12m15.720s 50,000 17,000
>>>>>
>>>>> The 3.15 based kernel was pretty consistent around the 4m10 mark,
>>>>> generally only +/-10s in runtime and not much change in system time.
>>>>> The files/s rate reported by fs_mark doesn't vary that much, either.
>>>>> So the new tag allocator seems to be no better in terms of IO
>>>>> dispatch scalability, yet adds significant variability to IO
>>>>> performance.
>>>>>
>>>>> What I noticed is a massive jump in context switch overhead: from
>>>>> around 250,000/s to over 800,000/s and the CPU profiles show that
>>>>> this comes from the new tag allocator:
>>>>>
>>>>> - 34.62% [kernel] [k] _raw_spin_unlock_irqrestore
>>>>> - _raw_spin_unlock_irqrestore
>>>>> - 58.22% prepare_to_wait
>>>>> 100.00% bt_get
>>>>> blk_mq_get_tag
>>>>> __blk_mq_alloc_request
>>>>> blk_mq_map_request
>>>>> blk_sq_make_request
>>>>> generic_make_request
>>>>> - 22.51% virtio_queue_rq
>>>>> __blk_mq_run_hw_queue
>>>>> ....
>>>>> - 21.56% [kernel] [k] _raw_spin_unlock_irq
>>>>> - _raw_spin_unlock_irq
>>>>> - 58.73% __schedule
>>>>> - 53.42% io_schedule
>>>>> 99.88% bt_get
>>>>> blk_mq_get_tag
>>>>> __blk_mq_alloc_request
>>>>> blk_mq_map_request
>>>>> blk_sq_make_request
>>>>> generic_make_request
>>>>> - 35.58% schedule
>>>>> + 49.31% worker_thread
>>>>> + 32.45% schedule_timeout
>>>>> + 10.35% _xfs_log_force_lsn
>>>>> + 3.10% xlog_cil_force_lsn
>>>>> ....
> .....
>> Can you try with this patch?
>
> Ok, context switches are back down in the realm of 400,000/s. It's
> better, but it's still a bit higher than that the 3.15 code. XFS is
> actually showing up in the context switch path profiles now...
>
> However, performance is still excitingly variable and not much
> different to not having this patch applied. System time is unchanged
> (still around the 11m20s +/- 1m) and IOPS, wall time and files/s all
> show significant variance (at least +/-25%) from run to run. The
> worst case is not as slow as the unpatched kernel, but it's no
> better than the 3.15 worst case.
>
> Profiles on a slow run look like:
>
> - 43.43% [kernel] [k] _raw_spin_unlock_irq
> - _raw_spin_unlock_irq
> - 64.23% blk_sq_make_request
> generic_make_request
> submit_bio ¿
> + 26.79% __schedule
> ...
> - 15.00% [kernel] [k] _raw_spin_unlock_irqrestore
> - _raw_spin_unlock_irqrestore
> - 39.81% virtio_queue_rq
> __blk_mq_run_hw_queue
> + 24.13% complete
> + 17.74% prepare_to_wait_exclusive
> + 9.66% remove_wait_queue
>
> Looks like the main contention problem is in blk_sq_make_request().
> Also, there looks to be quite a bit of lock contention on the tag
> wait queues given that this patch made prepare_to_wait_exclusive()
> suddenly show up in the profiles.
>
> FWIW, on a fast run there is very little time in
> blk_sq_make_request() lock contention, and overall spin lock/unlock
> overhead of these two functions is around 10% each....
>
> So, yes, the patch reduces context switches but doesn't really
> reduce system time, improve performance noticably or address the
> run-to-run variability issue...
OK, so one more thing to try. With the same patch still applied, could
you edit block/blk-mq-tag.h and change
BT_WAIT_QUEUES = 8,
to
BT_WAIT_QUEUES = 1,
and see if that smoothes things out?
On the road the next few days, so might take me a few days to get back
to this. I was able to reproduce the horrible contention on the wait
queue, but everything seemed to behave nicely with just the
exclusive_wait/batch_wakeup for me. Looks like I might have to fire up
kvm and set it you like you.
--
Jens Axboe
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2014-06-17 13:24 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1402562047-31276-1-git-send-email-david@fromorbit.com>
[not found] ` <20140612141329.GA11676@infradead.org>
[not found] ` <20140612234441.GT9508@dastard>
2014-06-13 16:23 ` [PATCH] [RFC] xfs: wire up aio_fsync method Christoph Hellwig
2014-06-15 22:33 ` Dave Chinner
2014-06-16 2:00 ` Dave Chinner
2014-06-16 2:58 ` Jens Axboe
[not found] ` <539E5D66.8040605-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2014-06-16 7:19 ` Dave Chinner
2014-06-16 19:30 ` Jens Axboe
2014-06-16 22:27 ` Dave Chinner
2014-06-17 13:23 ` Jens Axboe [this message]
2014-06-18 0:28 ` Dave Chinner
2014-06-18 2:24 ` Jens Axboe
2014-06-18 3:13 ` Dave Chinner
2014-06-18 3:20 ` Jens Axboe
[not found] ` <53A10597.6020707-tSWWG44O7X1aa/9Udqfwiw@public.gmane.org>
2014-06-18 5:02 ` Dave Chinner
2014-06-18 6:13 ` Dave Chinner
[not found] ` <20140613162352.GB23394-wEGCiKHe2LqWVfeAwA7xHQ@public.gmane.org>
2014-06-16 21:06 ` Michael Kerrisk (man-pages)
2014-06-17 14:01 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=53A0416E.20105@kernel.dk \
--to=axboe@kernel.dk \
--cc=david@fromorbit.com \
--cc=hch@infradead.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-man@vger.kernel.org \
--cc=xfs@oss.sgi.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).