All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jialin Wang <wjl.linux@gmail.com>
To: axboe@kernel.dk, yukuai@fnnas.com
Cc: linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	lianux.mm@gmail.com, lenohou@gmail.com,
	Jialin Wang <wjl.linux@gmail.com>
Subject: [PATCH] block: fix race in update_io_ticks causing inflated disk statistics
Date: Sat, 28 Feb 2026 18:01:44 +0800	[thread overview]
Message-ID: <20260228100144.254436-1-wjl.linux@gmail.com> (raw)

When multiple threads issue I/O requests concurrently after a period of
disk idle time, iostat can report abnormal %util spikes (100%+) even
when the actual I/O load is extremely light.

This issue can be reproduced using fio. By binding 8 fio threads to
different CPUs, and having them issue 4KB I/Os every 1 second:

  fio --name=test --ioengine=sync --rw=randwrite --direct=1 --bs=4k \
    --numjobs=8 --cpus_allowed=0-7 --cpus_allowed_policy=split \
    --thinktime=1s --time_based --runtime=60 --group_reporting \
    --filename=/mnt/sdb/test

The iostat -d sda 1 output will show a false 100%+ %util randomly:

Device  ...    w/s   wkB/s   wrqm/s  %wrqm w_await wareq-sz  ...  aqu-sz  %util
sdb     ...  16.00  104.00     0.00   0.00    1.25     6.50  ...    0.02   0.90

Device  ...    w/s   wkB/s   wrqm/s  %wrqm w_await wareq-sz  ...  aqu-sz  %util
sdb     ...   8.00   32.00     0.00   0.00    1.38     4.00  ...    0.01 100.30

Device  ...    w/s   wkB/s   wrqm/s  %wrqm w_await wareq-sz  ...  aqu-sz  %util
sdb     ...   8.00   32.00     0.00   0.00    1.38     4.00  ...    0.01   0.20

Device  ...    w/s   wkB/s   wrqm/s  %wrqm w_await wareq-sz  ...  aqu-sz  %util
sdb     ...  11.00   44.00     0.00   0.00    1.27     4.00  ...    0.01  82.80

The root cause is a race condition in update_io_ticks(). When the disk
has been idle for a while (e.g., 1 second), part->bd_stamp holds an
old timestamp. If CPU A and CPU B start I/O at the exact same time:

1. Both CPUs read the same old 'stamp' and pass the time_after() check.
2. CPU A executes try_cmpxchg() successfully.
3. CPU B fails try_cmpxchg(), exits update_io_ticks(), and immediately
   increments its local in_flight counter via part_stat_local_inc().
4. CPU A continues to evaluate the 'busy' condition:
   end || bdev_count_inflight(part).
5. Since it is an I/O start, 'end' is false, so CPU A calls
   bdev_count_inflight() to check.
6. However, bdev_count_inflight() iterates over all CPUs and sees CPU B's
   newly incremented in_flight count. It returns true.
7. CPU A incorrectly assumes the disk was busy during the entire
   'now - stamp' window (the 1-second idle period) and adds this large
   delta to io_ticks.

To fix this, we capture the 'busy' state before performing the
try_cmpxchg(). By taking a snapshot of whether the device is active
prior to updating bd_stamp, we prevent CPU A from being misled by
concurrent I/O submissions from other CPUs that occur after the
timestamp comparison but before the inflight check.

Fixes: 99dc422335d8 ("block: support to account io_ticks precisely")
Signed-off-by: Jialin Wang <wjl.linux@gmail.com>
---
 block/blk-core.c | 9 +++++----
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/block/blk-core.c b/block/blk-core.c
index 474700ffaa1c..1481daf1e664 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1026,10 +1026,11 @@ void update_io_ticks(struct block_device *part, unsigned long now, bool end)
 	unsigned long stamp;
 again:
 	stamp = READ_ONCE(part->bd_stamp);
-	if (unlikely(time_after(now, stamp)) &&
-	    likely(try_cmpxchg(&part->bd_stamp, &stamp, now)) &&
-	    (end || bdev_count_inflight(part)))
-		__part_stat_add(part, io_ticks, now - stamp);
+	if (unlikely(time_after(now, stamp))) {
+		bool busy = end || bdev_count_inflight(part);
+		if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now)) && busy)
+			__part_stat_add(part, io_ticks, now - stamp);
+	}
 
 	if (bdev_is_partition(part)) {
 		part = bdev_whole(part);
-- 
2.52.0


             reply	other threads:[~2026-02-28 10:02 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-02-28 10:01 Jialin Wang [this message]
2026-03-04 14:24 ` [PATCH] block: fix race in update_io_ticks causing inflated disk statistics Jialin Wang
2026-03-05 17:16 ` Yu Kuai

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260228100144.254436-1-wjl.linux@gmail.com \
    --to=wjl.linux@gmail.com \
    --cc=axboe@kernel.dk \
    --cc=lenohou@gmail.com \
    --cc=lianux.mm@gmail.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=yukuai@fnnas.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.