linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Yafang Shao <laoar.shao@gmail.com>
To: Axel Rasmussen <axelrasmussen@google.com>
Cc: Chris Down <chris@chrisdown.name>,
	cgroups@vger.kernel.org, hannes@cmpxchg.org,  kernel-team@fb.com,
	linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	 yuzhao@google.com
Subject: Re: MGLRU premature memcg OOM on slow writes
Date: Mon, 11 Mar 2024 17:11:06 +0800	[thread overview]
Message-ID: <CALOAHbBupMYBMWEzMK2xdhnqwR1C1+mJSrrZC1L0CKE2BMSC+g@mail.gmail.com> (raw)
In-Reply-To: <CAJHvVch2qVUDTJjNeSMqLBx0yoEm4zzb=ZXmABbd_5dWGQTpNg@mail.gmail.com>

On Sat, Mar 9, 2024 at 3:19 AM Axel Rasmussen <axelrasmussen@google.com> wrote:
>
> On Thu, Feb 29, 2024 at 4:30 PM Chris Down <chris@chrisdown.name> wrote:
> >
> > Axel Rasmussen writes:
> > >A couple of dumb questions. In your test, do you have any of the following
> > >configured / enabled?
> > >
> > >/proc/sys/vm/laptop_mode
> > >memory.low
> > >memory.min
> >
> > None of these are enabled. The issue is trivially reproducible by writing to
> > any slow device with memory.max enabled, but from the code it looks like MGLRU
> > is also susceptible to this on global reclaim (although it's less likely due to
> > page diversity).
> >
> > >Besides that, it looks like the place non-MGLRU reclaim wakes up the
> > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_threads()).
> > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), I agree it
> > >looks like it simply will not do this.
> > >
> > >Yosry pointed out [1], where MGLRU used to call this but stopped doing that. It
> > >makes sense to me at least that doing writeback every time we age is too
> > >aggressive, but doing it in evict_folios() makes some sense to me, basically to
> > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has.
> >
> > Thanks! We may also need reclaim_throttle(), depending on how you implement it.
> > Current non-MGLRU behaviour on slow storage is also highly suspect in terms of
> > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, but one
> > thing at a time :-)
>
>
> Hmm, so I have a patch which I think will help with this situation,
> but I'm having some trouble reproducing the problem on 6.8-rc7 (so
> then I can verify the patch fixes it).

We encountered the same premature OOM issue caused by numerous dirty pages.
The issue disappears after we revert the commit 14aa8b2d5c2e
"mm/mglru: don't sync disk for each aging cycle"

To aid in replicating the issue, we've developed a straightforward
script, which consistently reproduces it, even on the latest kernel.
You can find the script provided below:

```
#!/bin/bash

MEMCG="/sys/fs/cgroup/memory/mglru"
ENABLE=$1

# Avoid waking up the flusher
sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 *4))
sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 *4))

if [ ! -d ${MEMCG} ]; then
        mkdir -p ${MEMCG}
fi

echo $$ > ${MEMCG}/cgroup.procs
echo 1g > ${MEMCG}/memory.limit_in_bytes

if [ $ENABLE -eq 0 ]; then
        echo 0 > /sys/kernel/mm/lru_gen/enabled
else
        echo 0x7 > /sys/kernel/mm/lru_gen/enabled
fi

dd if=/dev/zero of=/data0/mglru.test bs=1M count=1023
rm -rf /data0/mglru.test
```

This issue disappears as well after we disable the mglru.

We hope this script proves helpful in identifying and addressing the
root cause. We eagerly await your insights and proposed fixes.

>
> If I understand the issue right, all we should need to do is get a
> slow filesystem, and then generate a bunch of dirty file pages on it,
> while running in a tightly constrained memcg. To that end, I tried the
> following script. But, in reality I seem to get little or no
> accumulation of dirty file pages.
>
> I thought maybe fio does something different than rsync which you said
> you originally tried, so I also tried rsync (copying /usr/bin into
> this loop mount) and didn't run into an OOM situation either.
>
> Maybe some dirty ratio settings need tweaking or something to get the
> behavior you see? Or maybe my test has a dumb mistake in it. :)
>
>
>
> #!/usr/bin/env bash
>
> echo 0 > /proc/sys/vm/laptop_mode || exit 1
> echo y > /sys/kernel/mm/lru_gen/enabled || exit 1
>
> echo "Allocate disk image"
> IMAGE_SIZE_MIB=1024
> IMAGE_PATH=/tmp/slow.img
> dd if=/dev/zero of=$IMAGE_PATH bs=1024k count=$IMAGE_SIZE_MIB || exit 1
>
> echo "Setup loop device"
> LOOP_DEV=$(losetup --show --find $IMAGE_PATH) || exit 1
> LOOP_BLOCKS=$(blockdev --getsize $LOOP_DEV) || exit 1
>
> echo "Create dm-slow"
> DM_NAME=dm-slow
> DM_DEV=/dev/mapper/$DM_NAME
> echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || exit 1
>
> echo "Create fs"
> mkfs.ext4 "$DM_DEV" || exit 1
>
> echo "Mount fs"
> MOUNT_PATH="/tmp/$DM_NAME"
> mkdir -p "$MOUNT_PATH" || exit 1
> mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1
>
> echo "Generate dirty file pages"
> systemd-run --wait --pipe --collect -p MemoryMax=32M \
>         fio -name=writes -directory=$MOUNT_PATH -readwrite=randwrite \
>         -numjobs=10 -nrfiles=90 -filesize=1048576 \
>         -fallocate=posix \
>         -blocksize=4k -ioengine=mmap \
>         -direct=0 -buffered=1 -fsync=0 -fdatasync=0 -sync=0 \
>         -runtime=300 -time_based
>


-- 
Regards
Yafang


  parent reply	other threads:[~2024-03-11  9:11 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-09  2:31 MGLRU premature memcg OOM on slow writes Chris Down
2024-02-29 17:28 ` Chris Down
2024-02-29 23:51 ` Axel Rasmussen
2024-03-01  0:30   ` Chris Down
2024-03-08 19:18     ` Axel Rasmussen
2024-03-08 21:22       ` Johannes Weiner
2024-03-11  9:11       ` Yafang Shao [this message]
2024-03-12 16:44         ` Axel Rasmussen
2024-03-12 20:07           ` Yu Zhao
2024-03-12 20:11             ` Yu Zhao
2024-03-13  3:33               ` Yafang Shao
2024-03-14 22:23                 ` Yu Zhao
2024-03-15  2:38                   ` Yafang Shao
2024-03-15 14:27                     ` Johannes Weiner
2024-03-12 21:08             ` Johannes Weiner
2024-03-13  2:08               ` Yu Zhao
2024-03-13  3:22                 ` Johannes Weiner
2024-03-13 10:59               ` Hillf Danton
2024-03-01 11:25   ` Hillf Danton

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CALOAHbBupMYBMWEzMK2xdhnqwR1C1+mJSrrZC1L0CKE2BMSC+g@mail.gmail.com \
    --to=laoar.shao@gmail.com \
    --cc=axelrasmussen@google.com \
    --cc=cgroups@vger.kernel.org \
    --cc=chris@chrisdown.name \
    --cc=hannes@cmpxchg.org \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).