From: Jens Axboe <axboe@kernel.dk>
To: Gabriel Krisman Bertazi <krisman@suse.de>
Cc: linux-kernel@vger.kernel.org, linux-block@vger.kernel.org,
Hugh Dickins <hughd@google.com>, Keith Busch <kbusch@kernel.org>,
Liu Song <liusong@linux.alibaba.com>, Jan Kara <jack@suse.cz>
Subject: Re: [PATCH] sbitmap: Use single per-bitmap counting to wake up queued tags
Date: Wed, 9 Nov 2022 15:06:52 -0700 [thread overview]
Message-ID: <cd88f306-1da4-a243-ec23-fea033142fbb@kernel.dk> (raw)
In-Reply-To: <20221105231055.25953-1-krisman@suse.de>
On 11/5/22 5:10 PM, Gabriel Krisman Bertazi wrote:
> sbitmap suffers from code complexity, as demonstrated by recent fixes,
> and eventual lost wake ups on nested I/O completion. The later happens,
> from what I understand, due to the non-atomic nature of the updates to
> wait_cnt, which needs to be subtracted and eventually reset when equal
> to zero. This two step process can eventually miss an update when a
> nested completion happens to interrupt the CPU in between the wait_cnt
> updates. This is very hard to fix, as shown by the recent changes to
> this code.
>
> The code complexity arises mostly from the corner cases to avoid missed
> wakes in this scenario. In addition, the handling of wake_batch
> recalculation plus the synchronization with sbq_queue_wake_up is
> non-trivial.
>
> This patchset implements the idea originally proposed by Jan [1], which
> removes the need for the two-step updates of wait_cnt. This is done by
> tracking the number of completions and wakeups in always increasing,
> per-bitmap counters. Instead of having to reset the wait_cnt when it
> reaches zero, we simply keep counting, and attempt to wake up N threads
> in a single wait queue whenever there is enough space for a batch.
> Waking up less than batch_wake shouldn't be a problem, because we
> haven't changed the conditions for wake up, and the existing batch
> calculation guarantees at least enough remaining completions to wake up
> a batch for each queue at any time.
>
> Performance-wise, one should expect very similar performance to the
> original algorithm for the case where there is no queueing. In both the
> old algorithm and this implementation, the first thing is to check
> ws_active, which bails out if there is no queueing to be managed. In the
> new code, we took care to avoid accounting completions and wakeups when
> there is no queueing, to not pay the cost of atomic operations
> unnecessarily, since it doesn't skew the numbers.
>
> For more interesting cases, where there is queueing, we need to take
> into account the cross-communication of the atomic operations. I've
> been benchmarking by running parallel fio jobs against a single hctx
> nullb in different hardware queue depth scenarios, and verifying both
> IOPS and queueing.
>
> Each experiment was repeated 5 times on a 20-CPU box, with 20 parallel
> jobs. fio was issuing fixed-size randwrites with qd=64 against nullb,
> varying only the hardware queue length per test.
>
> queue size 2 4 8 16 32 64
> 6.1-rc2 1681.1K (1.6K) 2633.0K (12.7K) 6940.8K (16.3K) 8172.3K (617.5K) 8391.7K (367.1K) 8606.1K (351.2K)
> patched 1721.8K (15.1K) 3016.7K (3.8K) 7543.0K (89.4K) 8132.5K (303.4K) 8324.2K (230.6K) 8401.8K (284.7K)
>
> The following is a similar experiment, ran against a nullb with a single
> bitmap shared by 20 hctx spread across 2 NUMA nodes. This has 40
> parallel fio jobs operating on the same device
>
> queue size 2 4 8 16 32 64
> 6.1-rc2 1081.0K (2.3K) 957.2K (1.5K) 1699.1K (5.7K) 6178.2K (124.6K) 12227.9K (37.7K) 13286.6K (92.9K)
> patched 1081.8K (2.8K) 1316.5K (5.4K) 2364.4K (1.8K) 6151.4K (20.0K) 11893.6K (17.5K) 12385.6K (18.4K)
What's the queue depth of these devices? That's the interesting question
here, as it'll tell us if any of these are actually hitting the slower
path where you made changes. I suspect you are for the second set of
numbers, but not for the first one?
Anything that isn't hitting the wait path for tags isn't a very useful
test, as I would not expect any changes there.
--
Jens Axboe
next prev parent reply other threads:[~2022-11-09 22:06 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-05 23:10 [PATCH] sbitmap: Use single per-bitmap counting to wake up queued tags Gabriel Krisman Bertazi
2022-11-08 23:28 ` Chaitanya Kulkarni
2022-11-09 3:03 ` Gabriel Krisman Bertazi
2022-11-09 3:35 ` Chaitanya Kulkarni
2022-11-09 22:06 ` Jens Axboe [this message]
2022-11-09 22:48 ` Gabriel Krisman Bertazi
2022-11-10 3:25 ` Jens Axboe
2022-11-10 9:42 ` Yu Kuai
2022-11-10 11:16 ` Jan Kara
2022-11-10 13:18 ` Yu Kuai
2022-11-10 15:35 ` Jan Kara
2022-11-11 0:59 ` Yu Kuai
2022-11-11 15:38 ` Jens Axboe
2022-11-14 13:23 ` Jan Kara
2022-11-14 14:20 ` [PATCH] sbitmap: Advance the queue index before waking up the queue Gabriel Krisman Bertazi
2022-11-14 14:34 ` Jan Kara
2022-11-15 3:52 ` [PATCH] sbitmap: Use single per-bitmap counting to wake up queued tags Gabriel Krisman Bertazi
2022-11-15 10:24 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cd88f306-1da4-a243-ec23-fea033142fbb@kernel.dk \
--to=axboe@kernel.dk \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=kbusch@kernel.org \
--cc=krisman@suse.de \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=liusong@linux.alibaba.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).