Re: [PATCH 2/5] mm: correct calculation of cgroup wb's bg_thresh in wb_over_bg_thresh

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Kemeng Shi <shikemeng@huaweicloud.com>
To: Tejun Heo <tj@kernel.org>
Cc: willy@infradead.org, akpm@linux-foundation.org,
	hcochran@kernelspring.com, mszeredi@redhat.com, axboe@kernel.dk,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-kernel@vger.kernel.org
Subject: Re: [PATCH 2/5] mm: correct calculation of cgroup wb's bg_thresh in wb_over_bg_thresh
Date: Thu, 8 Feb 2024 17:26:10 +0800	[thread overview]
Message-ID: <ad794d74-5f58-2fed-5a04-2c50c8594723@huaweicloud.com> (raw)
In-Reply-To: <ZbgR5-yOn7f5MtcD@slm.duckdns.org>



on 1/30/2024 5:00 AM, Tejun Heo wrote:
> Hello,
> 
> On Wed, Jan 24, 2024 at 10:01:47AM +0800, Kemeng Shi wrote:
>> Hi Tejun, thanks for reply. For cgroup wb, it will belongs to a global wb
>> domain and a cgroup domain. I agree the way how we calculate wb's threshold
>> in global domain as you described above. This patch tries to fix calculation
>> of wb's threshold in cgroup domain which now is wb_calc_thresh(mdtc->wb,
>> mdtc->bg_thresh)), means:
>> (wb bandwidth) / (system bandwidth) * (*cgroup domain threshold*)
>> The cgroup domain threshold is
>> (memory of cgroup domain) / (memory of system) * (system threshold).
>> Then the wb's threshold in cgroup will be smaller than expected.
>>
>> Consider following domain hierarchy:
>>                 global domain (100G)
>>                 /                 \
>>         cgroup domain1(50G)     cgroup domain2(50G)
>>                 |                 |
>> bdi            wb1               wb2
>> Assume wb1 and wb2 has the same bandwidth.
>> We have global domain bg_thresh 10G, cgroup domain bg_thresh 5G.
>> Then we have:
>> wb's thresh in global domain = 10G * (wb bandwidth) / (system bandwidth)
>> = 10G * 1/2 = 5G
>> wb's thresh in cgroup domain = 5G * (wb bandwidth) / (system bandwidth)
>> = 5G * 1/2 = 2.5G
>> At last, wb1 and wb2 will be limited at 2.5G, the system will be limited
>> at 5G which is less than global domain bg_thresh 10G.
>>
>> After the fix, threshold in cgroup domain will be:
>> (wb bandwidth) / (cgroup bandwidth) * (cgroup domain threshold)
>> The wb1 and wb2 will be limited at 5G, the system will be limited at
>> 10G which equals to global domain bg_thresh 10G.
>>
>> As I didn't take a deep look into memory cgroup, please correct me if
>> anything is wrong. Thanks!
>>> Also, how is this tested? Was there a case where the existing code
>>> misbehaved that's improved by this patch? Or is this just from reading code?
>>
>> This is jut from reading code. Would the case showed above convince you
>> a bit. Look forward to your reply, thanks!.
> 
> So, the explanation makes some sense to me but can you please construct a
> case to actually demonstrate the problem and fix? I don't think it'd be wise
> to apply the change without actually observing the code change does what it
> says it does.
Hi Tejun, sorry for the delay as I found there is a issue that keep triggering
writeback even the dirty page is under dirty background threshold. The issue
make it difficult to observe the expected improvment from this patch. I try to
fix it in [1] and test this patch based on the fix patches.
Run test as following:

/* make background writeback easier to observe */
echo 300000 > /proc/sys/vm/dirty_expire_centisecs
echo 100 > /proc/sys/vm/dirty_writeback_centisecs
/* enable memory and io cgroup */
echo "+memory +io" > /sys/fs/cgroup/cgroup.subtree_control

/* run fio in group1 with shell */
cd /sys/fs/cgroup
mkdir group1
cd group1
echo 10G > memory.high
echo 10G > memory.max
echo $$ > cgroup.procs
mkfs.ext4 -F /dev/vdb
mount /dev/vdb /bdi1/
fio -name test -filename=/bdi1/file -size=800M -ioengine=libaio -bs=4K -iodepth=1 -rw=write -direct=0 --time_based -runtime=60 -invalidate=0

/* run another fio in group2 with another shell */
cd /sys/fs/cgroup
mkdir group2
cd group2
echo 10G > memory.high
echo 10G > memory.max
echo $$ > cgroup.procs
mkfs.ext4 -F /dev/vdc
mount /dev/vdc /bdi2/
fio -name test -filename=/bdi2/file -size=800M -ioengine=libaio -bs=4K -iodepth=1 -rw=write -direct=0 --time_based -runtime=60 -invalidate=0

Before the fix we got (result of three tests):
fio1
  WRITE: bw=1304MiB/s (1367MB/s), 1304MiB/s-1304MiB/s (1367MB/s-1367MB/s), io=76.4GiB (82.0GB), run=60001-60001msec
  WRITE: bw=1351MiB/s (1417MB/s), 1351MiB/s-1351MiB/s (1417MB/s-1417MB/s), io=79.2GiB (85.0GB), run=60001-60001msec
  WRITE: bw=1373MiB/s (1440MB/s), 1373MiB/s-1373MiB/s (1440MB/s-1440MB/s), io=80.5GiB (86.4GB), run=60001-60001msec
fio2
  WRITE: bw=1134MiB/s (1190MB/s), 1134MiB/s-1134MiB/s (1190MB/s-1190MB/s), io=66.5GiB (71.4GB), run=60001-60001msec
  WRITE: bw=1414MiB/s (1483MB/s), 1414MiB/s-1414MiB/s (1483MB/s-1483MB/s), io=82.8GiB (88.0GB), run=60001-60001msec
  WRITE: bw=1469MiB/s (1540MB/s), 1469MiB/s-1469MiB/s (1540MB/s-1540MB/s), io=86.0GiB (92.4GB), run=60001-60001msec

After the fix we got (result of three tests):
fio1
  WRITE: bw=1719MiB/s (1802MB/s), 1719MiB/s-1719MiB/s (1802MB/s-1802MB/s), io=101GiB (108GB), run=60001-60001msec
  WRITE: bw=1723MiB/s (1806MB/s), 1723MiB/s-1723MiB/s (1806MB/s-1806MB/s), io=101GiB (108GB), run=60001-60001msec
  WRITE: bw=1691MiB/s (1774MB/s), 1691MiB/s-1691MiB/s (1774MB/s-1774MB/s), io=99.2GiB (106GB), run=60036-60036msec
fio2
  WRITE: bw=1692MiB/s (1774MB/s), 1692MiB/s-1692MiB/s (1774MB/s-1774MB/s), io=99.1GiB (106GB), run=60001-60001msec
  WRITE: bw=1681MiB/s (1763MB/s), 1681MiB/s-1681MiB/s (1763MB/s-1763MB/s), io=98.5GiB (106GB), run=60001-60001msec
  WRITE: bw=1671MiB/s (1752MB/s), 1671MiB/s-1671MiB/s (1752MB/s-1752MB/s), io=97.9GiB (105GB), run=60001-60001msec
I also add code to print the pages written in writeback and pages written in
writeback reduce a lot and are rare after this fix.

[1] https://lore.kernel.org/linux-fsdevel/20240208172024.23625-2-shikemeng@huaweicloud.com/T/#u
> 
> Thanks.
>

next prev parent reply	other threads:[~2024-02-08  9:26 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-23 18:33 [PATCH 0/5] Fix and cleanups to page-writeback Kemeng Shi
2024-01-23 18:33 ` [PATCH 1/5] mm: enable __wb_calc_thresh to calculate dirty background threshold Kemeng Shi
2024-01-23 18:33 ` [PATCH 2/5] mm: correct calculation of cgroup wb's bg_thresh in wb_over_bg_thresh Kemeng Shi
2024-01-23 20:43   ` Tejun Heo
2024-01-24  2:01     ` Kemeng Shi
2024-01-29 21:00       ` Tejun Heo
2024-02-08  9:26         ` Kemeng Shi [this message]
2024-02-08 19:32           ` Tejun Heo
2024-02-18  2:35             ` Kemeng Shi
2024-02-20 17:34               ` Tejun Heo
2024-01-23 18:33 ` [PATCH 3/5] mm: call __wb_calc_thresh instead of wb_calc_thresh " Kemeng Shi
2024-01-23 18:33 ` [PATCH 4/5] mm: remove redundant check in wb_min_max_ratio Kemeng Shi
2024-01-23 18:33 ` [PATCH 5/5] mm: remove stale comment __folio_mark_dirty Kemeng Shi
2024-01-23 14:07   ` Matthew Wilcox
2024-01-23 20:46 ` [PATCH 0/5] Fix and cleanups to page-writeback Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ad794d74-5f58-2fed-5a04-2c50c8594723@huaweicloud.com \
    --to=shikemeng@huaweicloud.com \
    --cc=akpm@linux-foundation.org \
    --cc=axboe@kernel.dk \
    --cc=hcochran@kernelspring.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mszeredi@redhat.com \
    --cc=tj@kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).