Linux cgroups development
 help / color / mirror / Atom feed
From: Tejun Heo <tj@kernel.org>
To: Jan Kara <jack@suse.cz>
Cc: "Michal Koutný" <mkoutny@suse.com>,
	"Jinke Han" <hanjinke.666@bytedance.com>,
	josef@toxicpanda.com, axboe@kernel.dk, cgroups@vger.kernel.org,
	linux-block@vger.kernel.org, linux-kernel@vger.kernel.org,
	yinxin.x@bytedance.com
Subject: Re: [PATCH v3] blk-throtl: Introduce sync and async queues for blk-throtl
Date: Mon, 9 Jan 2023 07:10:22 -1000	[thread overview]
Message-ID: <Y7xKfl7gGt+wb/I2@slm.duckdns.org> (raw)
In-Reply-To: <20230109105916.jvnhjdseqkwejmws@quack3>

Hello, Jan.

On Mon, Jan 09, 2023 at 11:59:16AM +0100, Jan Kara wrote:
> Yeah, I agree there's no way back :). But actually I think a lot of the
> functionality of IO schedulers is not needed (by you ;)) only because the
> HW got performant enough and so some issues became less visible. And that
> is all fine but if you end up in a configuration where your cgroup's IO
> limits and IO demands are similar to how the old rotational disks were
> underprovisioned for the amount of IO needed to be done by the system
> (i.e., you can easily generate amount of IO that then takes minutes or tens
> of minutes for your IO subsystem to crunch through), you hit all the same
> problems IO schedulers were trying to solve again. And maybe these days we
> incline more towards the answer "buy more appropriate HW / buy higher
> limits from your infrastructure provider" but it is not like the original
> issues in such configurations disappeared.

Yeah, but I think there's a better way out as there's still a difference
between the two situations. W/ hard disks, you're actually out of bandwidth.
With SSDs, we know that there are capacity that we can borrow to get out of
the tough spot. e.g. w/ iocost, you can constrain a cgroup to a point where
its throughput gets to a simliar level of hard disks; however, that still
doesn't (or at least shouldn't) cause noticeable priority inversions outside
of that cgroup because issue_as_root promotes the IOs which can be waited
upon by other cgroups to root charging the cost to the cgroup as debts and
further slowing it down afterwards.

There's a lot to be improved - e.g. the debt accounting and payback, and
propagation to originator throttling isn't very accurate leading to usually
over-throttling and under-utilization in some cases. The coupling between IO
control and dirty throttling is there and kinda works but it seems like it's
pretty easy to make it misbehave under heavy control and so on. But, even
with all those shortcomings, at least iocost is feature complete and already
works (not perfectly but still) in most cases - it can actually distribute
IO bandwidth across the cgroups with arbitrary weights without causing
noticeable priority inversions across cgroups.

blk-throttle unfortunately doesn't have issue_as_root and the issuer delay
mechanism hooked up and we found that it's near impossible to configure
properly in any scalable manner. Raw bw and iops limits just can't capture
application behavior variances well enough. Often, the valid parameter space
becomes null when trying to cover varied behaviors. Given the problem is
pretty fundamental for the control scheme, I largely gave up on it with the
long term goal of implementing io.max on top of iocost down the line.

> > Another layering problem w/ controlling from elevators is that that's after
> > request allocation and the issuer has already moved on. We used to have
> > per-cgroup rq pools but ripped that out, so it's pretty easy to cause severe
> > priority inversions by depleting the shared request pool, and the fact that
> > throttling takes place after the issuing task returned from issue path makes
> > propagating the throttling operation upwards more challenging too.
> 
> Well, we do have .limit_depth IO scheduler callback these days so BFQ uses
> that to solve the problem of exhaustion of shared request pool but I agree
> it's a bit of a hack on the side.

Ah didn't know about that. Yeah, that'd help the situation to some degree.

> > My bet is that inversion issues are a lot more severe with blk-throttle
> > because it's not work-conserving and not doing things like issue-as-root or
> > other measures to alleviate issues which can arise from inversions.
> 
> Yes, I agree these features of blk-throttle make the problems much more
> likely to happen in practice.

As I wrote above, I largely gave up on blk-throttle and things like tweaking
sync write priority doesn't address most of its problems (e.g. it's still
gonna be super easy to stall the whole system with a heavily throttled
cgroup). However, it can still be useful for some use cases and if it can be
tweaked to become a bit better, I don't see a reason to not do that.

Thanks.

-- 
tejun

      reply	other threads:[~2023-01-09 17:10 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-12-26 13:05 [PATCH v3] blk-throtl: Introduce sync and async queues for blk-throtl Jinke Han
2022-12-26 15:24 ` kernel test robot
     [not found] ` <20221226130505.7186-1-hanjinke.666-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
2023-01-04 22:11   ` Tejun Heo
     [not found]     ` <Y7X5rsnYCAAYRGQd-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
2023-01-05  7:28       ` [External] " hanjinke
2023-01-05 16:18   ` Michal Koutný
     [not found]     ` <20230105161854.GA1259-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>
2023-01-05 17:35       ` Tejun Heo
     [not found]         ` <Y7cKf7IH+FJ/6IyV-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
2023-01-05 19:22           ` Michal Koutný
     [not found]             ` <20230105192247.GB16920-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>
2023-01-05 21:39               ` Tejun Heo
2023-01-06 15:38       ` Jan Kara
2023-01-06 16:58         ` Tejun Heo
     [not found]           ` <Y7hTHZQYsCX6EHIN-NiLfg/pYEd1N0TnZuCh8vA@public.gmane.org>
2023-01-06 18:07             ` [External] " hanjinke
     [not found]               ` <c839ba6c-80ac-6d92-af64-5c0e1956ae93-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
2023-01-06 18:15                 ` Tejun Heo
2023-01-07  4:44                   ` hanjinke
     [not found]                     ` <e499f088-8ed9-2e19-b2e5-efaa4f9738f0-EC8Uxl6Npydl57MIdRCFDg@public.gmane.org>
2023-01-09 18:08                       ` Tejun Heo
2023-01-10 13:07                         ` hanjinke
2023-01-11 12:35               ` Michal Koutný
     [not found]                 ` <20230111123532.GB3673-9OudH3eul5jcvrawFnH+a6VXKuFTiq87@public.gmane.org>
2023-01-12  3:26                   ` hanjinke
2023-01-09 10:59           ` Jan Kara
2023-01-09 17:10             ` Tejun Heo [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y7xKfl7gGt+wb/I2@slm.duckdns.org \
    --to=tj@kernel.org \
    --cc=axboe@kernel.dk \
    --cc=cgroups@vger.kernel.org \
    --cc=hanjinke.666@bytedance.com \
    --cc=jack@suse.cz \
    --cc=josef@toxicpanda.com \
    --cc=linux-block@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mkoutny@suse.com \
    --cc=yinxin.x@bytedance.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox