* [PATCH] block: fix race in update_io_ticks causing inflated disk statistics
@ 2026-02-28 10:01 Jialin Wang
2026-03-04 14:24 ` Jialin Wang
2026-03-05 17:16 ` Yu Kuai
0 siblings, 2 replies; 3+ messages in thread
From: Jialin Wang @ 2026-02-28 10:01 UTC (permalink / raw)
To: axboe, yukuai; +Cc: linux-block, linux-kernel, lianux.mm, lenohou, Jialin Wang
When multiple threads issue I/O requests concurrently after a period of
disk idle time, iostat can report abnormal %util spikes (100%+) even
when the actual I/O load is extremely light.
This issue can be reproduced using fio. By binding 8 fio threads to
different CPUs, and having them issue 4KB I/Os every 1 second:
fio --name=test --ioengine=sync --rw=randwrite --direct=1 --bs=4k \
--numjobs=8 --cpus_allowed=0-7 --cpus_allowed_policy=split \
--thinktime=1s --time_based --runtime=60 --group_reporting \
--filename=/mnt/sdb/test
The iostat -d sda 1 output will show a false 100%+ %util randomly:
Device ... w/s wkB/s wrqm/s %wrqm w_await wareq-sz ... aqu-sz %util
sdb ... 16.00 104.00 0.00 0.00 1.25 6.50 ... 0.02 0.90
Device ... w/s wkB/s wrqm/s %wrqm w_await wareq-sz ... aqu-sz %util
sdb ... 8.00 32.00 0.00 0.00 1.38 4.00 ... 0.01 100.30
Device ... w/s wkB/s wrqm/s %wrqm w_await wareq-sz ... aqu-sz %util
sdb ... 8.00 32.00 0.00 0.00 1.38 4.00 ... 0.01 0.20
Device ... w/s wkB/s wrqm/s %wrqm w_await wareq-sz ... aqu-sz %util
sdb ... 11.00 44.00 0.00 0.00 1.27 4.00 ... 0.01 82.80
The root cause is a race condition in update_io_ticks(). When the disk
has been idle for a while (e.g., 1 second), part->bd_stamp holds an
old timestamp. If CPU A and CPU B start I/O at the exact same time:
1. Both CPUs read the same old 'stamp' and pass the time_after() check.
2. CPU A executes try_cmpxchg() successfully.
3. CPU B fails try_cmpxchg(), exits update_io_ticks(), and immediately
increments its local in_flight counter via part_stat_local_inc().
4. CPU A continues to evaluate the 'busy' condition:
end || bdev_count_inflight(part).
5. Since it is an I/O start, 'end' is false, so CPU A calls
bdev_count_inflight() to check.
6. However, bdev_count_inflight() iterates over all CPUs and sees CPU B's
newly incremented in_flight count. It returns true.
7. CPU A incorrectly assumes the disk was busy during the entire
'now - stamp' window (the 1-second idle period) and adds this large
delta to io_ticks.
To fix this, we capture the 'busy' state before performing the
try_cmpxchg(). By taking a snapshot of whether the device is active
prior to updating bd_stamp, we prevent CPU A from being misled by
concurrent I/O submissions from other CPUs that occur after the
timestamp comparison but before the inflight check.
Fixes: 99dc422335d8 ("block: support to account io_ticks precisely")
Signed-off-by: Jialin Wang <wjl.linux@gmail.com>
---
block/blk-core.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)
diff --git a/block/blk-core.c b/block/blk-core.c
index 474700ffaa1c..1481daf1e664 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1026,10 +1026,11 @@ void update_io_ticks(struct block_device *part, unsigned long now, bool end)
unsigned long stamp;
again:
stamp = READ_ONCE(part->bd_stamp);
- if (unlikely(time_after(now, stamp)) &&
- likely(try_cmpxchg(&part->bd_stamp, &stamp, now)) &&
- (end || bdev_count_inflight(part)))
- __part_stat_add(part, io_ticks, now - stamp);
+ if (unlikely(time_after(now, stamp))) {
+ bool busy = end || bdev_count_inflight(part);
+ if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now)) && busy)
+ __part_stat_add(part, io_ticks, now - stamp);
+ }
if (bdev_is_partition(part)) {
part = bdev_whole(part);
--
2.52.0
^ permalink raw reply related [flat|nested] 3+ messages in thread
* Re: [PATCH] block: fix race in update_io_ticks causing inflated disk statistics
2026-02-28 10:01 [PATCH] block: fix race in update_io_ticks causing inflated disk statistics Jialin Wang
@ 2026-03-04 14:24 ` Jialin Wang
2026-03-05 17:16 ` Yu Kuai
1 sibling, 0 replies; 3+ messages in thread
From: Jialin Wang @ 2026-03-04 14:24 UTC (permalink / raw)
To: wjl.linux; +Cc: axboe, lenohou, lianux.mm, linux-block, linux-kernel, yukuai
On Sat, Feb 28, 2026 at 06:01:44PM +0800, Jialin Wang wrote:
> When multiple threads issue I/O requests concurrently after a period of
> disk idle time, iostat can report abnormal %util spikes (100%+) even
> when the actual I/O load is extremely light.
>
> This issue can be reproduced using fio. By binding 8 fio threads to
> different CPUs, and having them issue 4KB I/Os every 1 second:
>
> fio --name=test --ioengine=sync --rw=randwrite --direct=1 --bs=4k \
> --numjobs=8 --cpus_allowed=0-7 --cpus_allowed_policy=split \
> --thinktime=1s --time_based --runtime=60 --group_reporting \
> --filename=/mnt/sdb/test
>
> The iostat -d sda 1 output will show a false 100%+ %util randomly:
Sorry, I made a typo. It should be 'iostat -d sdb -x 1'.
[...]
--
Regards,
Jialin
^ permalink raw reply [flat|nested] 3+ messages in thread
* Re: [PATCH] block: fix race in update_io_ticks causing inflated disk statistics
2026-02-28 10:01 [PATCH] block: fix race in update_io_ticks causing inflated disk statistics Jialin Wang
2026-03-04 14:24 ` Jialin Wang
@ 2026-03-05 17:16 ` Yu Kuai
1 sibling, 0 replies; 3+ messages in thread
From: Yu Kuai @ 2026-03-05 17:16 UTC (permalink / raw)
To: Jialin Wang, axboe; +Cc: linux-block, linux-kernel, lianux.mm, lenohou, yukuai
Hi,
在 2026/2/28 18:01, Jialin Wang 写道:
> When multiple threads issue I/O requests concurrently after a period of
> disk idle time, iostat can report abnormal %util spikes (100%+) even
> when the actual I/O load is extremely light.
>
> This issue can be reproduced using fio. By binding 8 fio threads to
> different CPUs, and having them issue 4KB I/Os every 1 second:
>
> fio --name=test --ioengine=sync --rw=randwrite --direct=1 --bs=4k \
> --numjobs=8 --cpus_allowed=0-7 --cpus_allowed_policy=split \
> --thinktime=1s --time_based --runtime=60 --group_reporting \
> --filename=/mnt/sdb/test
>
> The iostat -d sda 1 output will show a false 100%+ %util randomly:
>
> Device ... w/s wkB/s wrqm/s %wrqm w_await wareq-sz ... aqu-sz %util
> sdb ... 16.00 104.00 0.00 0.00 1.25 6.50 ... 0.02 0.90
>
> Device ... w/s wkB/s wrqm/s %wrqm w_await wareq-sz ... aqu-sz %util
> sdb ... 8.00 32.00 0.00 0.00 1.38 4.00 ... 0.01 100.30
>
> Device ... w/s wkB/s wrqm/s %wrqm w_await wareq-sz ... aqu-sz %util
> sdb ... 8.00 32.00 0.00 0.00 1.38 4.00 ... 0.01 0.20
>
> Device ... w/s wkB/s wrqm/s %wrqm w_await wareq-sz ... aqu-sz %util
> sdb ... 11.00 44.00 0.00 0.00 1.27 4.00 ... 0.01 82.80
>
> The root cause is a race condition in update_io_ticks(). When the disk
> has been idle for a while (e.g., 1 second), part->bd_stamp holds an
> old timestamp. If CPU A and CPU B start I/O at the exact same time:
>
> 1. Both CPUs read the same old 'stamp' and pass the time_after() check.
> 2. CPU A executes try_cmpxchg() successfully.
> 3. CPU B fails try_cmpxchg(), exits update_io_ticks(), and immediately
> increments its local in_flight counter via part_stat_local_inc().
> 4. CPU A continues to evaluate the 'busy' condition:
> end || bdev_count_inflight(part).
> 5. Since it is an I/O start, 'end' is false, so CPU A calls
> bdev_count_inflight() to check.
> 6. However, bdev_count_inflight() iterates over all CPUs and sees CPU B's
> newly incremented in_flight count. It returns true.
> 7. CPU A incorrectly assumes the disk was busy during the entire
> 'now - stamp' window (the 1-second idle period) and adds this large
> delta to io_ticks.
>
> To fix this, we capture the 'busy' state before performing the
> try_cmpxchg(). By taking a snapshot of whether the device is active
> prior to updating bd_stamp, we prevent CPU A from being misled by
> concurrent I/O submissions from other CPUs that occur after the
> timestamp comparison but before the inflight check.
>
> Fixes: 99dc422335d8 ("block: support to account io_ticks precisely")
> Signed-off-by: Jialin Wang <wjl.linux@gmail.com>
> ---
> block/blk-core.c | 9 +++++----
> 1 file changed, 5 insertions(+), 4 deletions(-)
>
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 474700ffaa1c..1481daf1e664 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1026,10 +1026,11 @@ void update_io_ticks(struct block_device *part, unsigned long now, bool end)
> unsigned long stamp;
> again:
> stamp = READ_ONCE(part->bd_stamp);
> - if (unlikely(time_after(now, stamp)) &&
> - likely(try_cmpxchg(&part->bd_stamp, &stamp, now)) &&
> - (end || bdev_count_inflight(part)))
> - __part_stat_add(part, io_ticks, now - stamp);
> + if (unlikely(time_after(now, stamp))) {
> + bool busy = end || bdev_count_inflight(part);
First of all, you're moving bdev_count_inflight() before try_cmpxchg(), means
there is no guarantee it can only be called once in one jiffy, and I think
this will have performance overhead that might be unacceptable.
BTW, this is a known issue for us, and we're fixing this in iostat, by
checking util together with aqusz, specifically:
util = min(min(util, aqusz * 100), 100)
With the respect, if user issue IO one by one, util is the same as aqusz; and if
user issue IO concurrently, aqusz must be greater than util.
> + if (likely(try_cmpxchg(&part->bd_stamp, &stamp, now)) && busy)
> + __part_stat_add(part, io_ticks, now - stamp);
> + }
>
> if (bdev_is_partition(part)) {
> part = bdev_whole(part);
--
Thansk,
Kuai
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-03-05 17:16 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-28 10:01 [PATCH] block: fix race in update_io_ticks causing inflated disk statistics Jialin Wang
2026-03-04 14:24 ` Jialin Wang
2026-03-05 17:16 ` Yu Kuai
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox