From: Jialin Wang <wjl.linux@gmail.com>
To: tj@kernel.org
Cc: axboe@kernel.dk, cgroups@vger.kernel.org, josef@toxicpanda.com,
linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
wjl.linux@gmail.com
Subject: Re: [PATCH v2] blk-iocost: fix busy_level reset when no IOs complete
Date: Tue, 31 Mar 2026 08:48:04 +0000 [thread overview]
Message-ID: <20260331084804.146325-1-wjl.linux@gmail.com> (raw)
In-Reply-To: <acrM1flqKQlcONbL@slm.duckdns.org>
Hi,
On Mon, Mar 30, 2026 at 09:19:49AM -1000, Tejun Heo wrote:
> Hello,
>
> On Sun, Mar 29, 2026 at 03:41:12PM +0000, Jialin Wang wrote:
> ...
> > Before:
> > CGROUP IOPS MB/s Avg(ms) Max(ms) P90(ms) P99 P99.9 P99.99
> >
> > cgA-1m 167 167.02 748.65 1641.43 960.50 1551.89 1635.78 1635.78
> > cgB-4k 5 0.02 190.57 806.84 742.39 809.50 809.50 809.50
> >
> > cgA-1m 166 166.36 751.38 1744.31 994.05 1451.23 1736.44 1736.44
> > cgB-32k 4 0.14 225.71 1057.25 759.17 1061.16 1061.16 1061.16
> >
> > cgA-1m 166 165.91 751.48 1610.94 1010.83 1417.67 1602.22 1619.00
> > cgB-256k 5 1.26 198.50 1046.30 742.39 1044.38 1044.38 1044.38
> >
> > After:
> > CGROUP IOPS MB/s Avg(ms) Max(ms) P90(ms) P99 P99.9 P99.99
> >
> > cgA-1m 159 158.59 769.06 828.52 809.50 817.89 826.28 826.28
> > cgB-4k 200 0.78 2.01 26.11 2.87 6.26 12.39 26.08
> >
> > cgA-1m 147 146.84 832.05 985.80 943.72 960.50 985.66 985.66
> > cgB-32k 200 6.25 2.82 71.05 3.42 15.40 50.07 70.78
> >
> > cgA-1m 114 114.47 1044.98 1294.48 1199.57 1283.46 1300.23 1300.23
> > cgB-256k 200 50.00 4.01 34.49 5.08 15.66 30.54 34.34
>
> Are the latency numbers end-to-end or on-device? If former, can you provide
> on-device numbers? What period duration are you using?
These latency numbers are completion latency results from fio using
ioengine=libaio. For cgB, since --iodepth=1 is used, these completion
latencies are very close to the actual on-device times.
I used the following QoS parameters:
rpct=90 rlat=3500 wpct=90 wlat=3500 min=80 max=10000 (period: 7ms)
When switching to:
rpct=80 rlat=10000 wpct=80 wlat=10000 min=80 max=10000 (period: 40ms)
While this showed some improvement, cgB still failed to reach the
expected 200 IOPS, and the P99 latency remained high:
CGROUP IOPS MB/s Avg(ms) Max(ms) P90(ms) P99 P99.9 P99.99
cgA-1m 161 160.81 758.52 1462.38 1044.38 1317.01 1451.23 1468.01
cgB-4k 125 0.49 7.18 661.39 2.70 189.79 650.12 658.51
cgA-1m 155 154.63 784.92 1234.01 1010.83 1182.79 1233.13 1233.13
cgB-32k 136 4.26 6.40 300.78 3.85 160.43 295.70 299.89
cgA-1m 138 137.91 860.32 1704.14 1317.01 1669.33 1702.89 1702.89
cgB-256k 95 23.70 9.83 394.73 5.34 206.57 396.36 396.36
I also tested several other sets of parameters and the results were similar.
Using bpftrace, it can still be frequently observed that busy_level is
reset to 0 when no IO complete, and the vrate cannot be lowered in time.
08:26:20.186950 iocost_ioc_vrate_adj: [sdb] vrate=127.50%->126.23% busy=4 missed_ppm=1000000:1000000 rq_wait_pct=0 lagging=3 shortages=1
08:26:20.220910 ioc_rqos_done
08:26:20.222616 ioc_rqos_done
08:26:20.226913 ioc_rqos_done
08:26:20.227951 iocost_ioc_vrate_adj: [sdb] vrate=126.23%->124.97% busy=5 missed_ppm=1000000:1000000 rq_wait_pct=0 lagging=3 shortages=1
-- no IO complete, busy_level was reset to 0 --
08:26:20.268945 iocost_ioc_vrate_adj: [sdb] vrate=124.97%->124.97% busy=0 missed_ppm=0:0 rq_wait_pct=0 lagging=3 shortages=1
bpftrace -e '
#define VTIME_PER_USEC 137438
kfunc:ioc_rqos_done
{
printf("%s ioc_rqos_done\n", strftime("%H:%M:%S.%f", nsecs));
}
tracepoint:iocost:iocost_ioc_vrate_adj
{
$old_vrate = args->old_vrate * 10000 / VTIME_PER_USEC;
$new_vrate = args->new_vrate * 10000 / VTIME_PER_USEC;
printf("%s iocost_ioc_vrate_adj: [%s] vrate=%d.%02d%%->%d.%02d%% busy=%d missed_ppm=%u:%u rq_wait_pct=%u lagging=%d shortages=%d\n",
strftime("%H:%M:%S.%f", nsecs), str(args->devname),
$old_vrate / 100, $old_vrate % 100, $new_vrate / 100,
$new_vrate % 100, args->busy_level, args->read_missed_ppm,
args->write_missed_ppm, args->rq_wait_pct, args->nr_lagging,
args->nr_shortages);
}'
> > @@ -2397,9 +2400,29 @@ static void ioc_timer_fn(struct timer_list *timer)
> > * and should increase vtime rate.
> > */
> > prev_busy_level = ioc->busy_level;
> > - if (rq_wait_pct > RQ_WAIT_BUSY_PCT ||
> > - missed_ppm[READ] > ppm_rthr ||
> > - missed_ppm[WRITE] > ppm_wthr) {
> > + if (!nr_done) {
> > + if (nr_lagging)
>
> Please use {} even when it's just comments that makes the bodies multi-line.
>
> > + /*
> > + * When there are lagging IOs but no completions, we
> > + * don't know if the IO latency will meet the QoS
> > + * targets. The disk might be saturated or not. We
> > + * should not reset busy_level to 0 (which would
> > + * prevent vrate from scaling up or down), but rather
> > + * try to keep it unchanged. To avoid drastic vrate
> > + * oscillations, we clamp it between -4 and 4.
> > + */
> > + ioc->busy_level = clamp(ioc->busy_level, -4, 4);
>
> Is this from some observed behavior or just out of intuition? The
> justification seems a bit flimsy. Why -4 and 4?
During my testing with the parameters rpct=90 rlat=3500 wpct=90 wlat=3500
min=10 max=10000, I noticed that vrate occasionally drops significantly
(down to 50% or lower), which adversely impacted the IOPS of cgA. So I
limit the busy_level to a maximum of 4 to reduce vrate at the lowest speed.
CGROUP IOPS MB/s Avg(ms) Max(ms) P90(ms) P99 P99.9 P99.99
cgA-1m 137 137.11 891.21 1278.66 1082.13 1216.35 1266.68 1283.46
cgB-4k 200 0.78 2.12 62.64 2.47 7.44 49.55 62.65
I realized that raising min to 80 would effectively mitigate this issue,
so I will remove it in the next v3.
> > + else if (nr_shortages)
> > + /*
> > + * The vrate might be too low to issue any IOs. We
> > + * should allow vrate to increase but not decrease.
> > + */
> > + ioc->busy_level = min(ioc->busy_level, 0);
>
> So, this is no completion, no lagging and shortages case. In the existing
> code, this would alos get busy_level-- to get things moving. Wouldn't this
> path need that too? Or rather, would it make more sense to handle !nr_done
> && nr_lagging case and leave the other cases as-are?
That's a fair point. My initial thought was not to adjust busy_level
when there is no latency data, and I haven't observed this specific path
(no completions, no lagging, but with shortages) occurring in my testing
so far, so I might have been overthinking it. I will simplify the logic
in v3 to handle only the !nr_done && nr_lagging case and leave the other
cases as they are.
--
Thanks,
Jialin
next prev parent reply other threads:[~2026-03-31 8:48 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-03-29 15:41 [PATCH v2] blk-iocost: fix busy_level reset when no IOs complete Jialin Wang
2026-03-30 19:19 ` Tejun Heo
2026-03-31 8:48 ` Jialin Wang [this message]
2026-03-31 10:05 ` [PATCH v3] " Jialin Wang
2026-03-31 19:08 ` Tejun Heo
2026-03-31 19:56 ` Jens Axboe
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260331084804.146325-1-wjl.linux@gmail.com \
--to=wjl.linux@gmail.com \
--cc=axboe@kernel.dk \
--cc=cgroups@vger.kernel.org \
--cc=josef@toxicpanda.com \
--cc=linux-block@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=tj@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.