From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1EAB7C54E58 for ; Mon, 11 Mar 2024 09:11:47 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 611756B006E; Mon, 11 Mar 2024 05:11:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 5C1856B0072; Mon, 11 Mar 2024 05:11:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 489726B0074; Mon, 11 Mar 2024 05:11:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 3934B6B006E for ; Mon, 11 Mar 2024 05:11:46 -0400 (EDT) Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id DC040A110E for ; Mon, 11 Mar 2024 09:11:45 +0000 (UTC) X-FDA: 81884190570.26.3BE8F41 Received: from mail-ua1-f44.google.com (mail-ua1-f44.google.com [209.85.222.44]) by imf06.hostedemail.com (Postfix) with ESMTP id 4277B180007 for ; Mon, 11 Mar 2024 09:11:44 +0000 (UTC) Authentication-Results: imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QeyLT87M; spf=pass (imf06.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.44 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1710148304; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=4qvSkoWG3hG+H8BATALROwaa3X2nvmLzNfY2hVRikCs=; b=TnHJVzeLZq2U9xHtrjHkvC8q7oVXFZ9Z2ceP3LU2F+TO8SdWXM+ZQEC3amSKf7wIBNmIaq BFcoqS2uzNG4IBZYgt0eJZgNfLo9kdcLGmN06Q8W5gWozwb8HGsoATwlEZGoNNWgvojuKD pGKJFLqKtTxI9p6k4rIdj3YhR9gaeAU= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710148304; a=rsa-sha256; cv=none; b=T6LcS5iPxRa/DGtj6s6xkUMJWrpnYR/Aj1uLvEqLYvQZqAUvyBQsF4qe19XMUgLy18mulh QclCdhqhKFj9MWIvsu586NEpJd1hefufKRbPtB2TREjwCz8nu/eG+p3rfRNcrLpH4tpYSA SLykv5uymaA5PpEpqnGrYdlrcIywAaQ= ARC-Authentication-Results: i=1; imf06.hostedemail.com; dkim=pass header.d=gmail.com header.s=20230601 header.b=QeyLT87M; spf=pass (imf06.hostedemail.com: domain of laoar.shao@gmail.com designates 209.85.222.44 as permitted sender) smtp.mailfrom=laoar.shao@gmail.com; dmarc=pass (policy=none) header.from=gmail.com Received: by mail-ua1-f44.google.com with SMTP id a1e0cc1a2514c-7dba73cab13so1541631241.3 for ; Mon, 11 Mar 2024 02:11:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1710148303; x=1710753103; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=4qvSkoWG3hG+H8BATALROwaa3X2nvmLzNfY2hVRikCs=; b=QeyLT87M8zpWK20TuBDLgvFCFt1IFQT3aOVqkiPdRl8sl2zxsWmdw1VsWIUOhzYTdc WmpL7MNniIlfc4qNjQyq3uIlftDhvbGhp7PhpfMPbsTKtKlD53l6WKvshDN6O/LYkJDr ar8hlidcCN2rmFtVktalbh5N0gHAD9Vdt7dtj8a3h1rw2FgWTPs5dGVxTF1auMrtJNSY z/8cPZE4jjivOLShkvVZl0Z58npFdq1jwUK0nSUcd9LKwcSxTHgMmeEi7Y1lgpG/0DD8 BsKNp/SltO7rmojoc38uVFrIWP1MRoXRjNsBILKyQYPZIstDza2CGLMIvIzo2YaIil10 a3uw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1710148303; x=1710753103; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=4qvSkoWG3hG+H8BATALROwaa3X2nvmLzNfY2hVRikCs=; b=KLsfJU6P8pardaMyui70f6rvUxJmXiSCQb8iQbl387y+aQOic74ccQ0HvOI2ZjcFX7 7QCnSSpJVyYJeFoQdaiqZP8Ral6HuDGYzLAlMcybCgepWST/ei+EnF3nj3oC/v8FzQOX Ic7wlMZGli46A+BMtH+idXlQYw+rzUDsekzq0SpncBTcXune/ICpW9TCozRHbW1vY8NK TT7sUG8sTdldEyBmY/YK1EbFuPTDpS84qlLj3x71mUC49SnCOo/LVhgdBXl1BRYJwxKP /QH47C/NgcACUp0TJvqpbNfEbHbX2uK4vdxHldYBgQfcq1OQYGMamVCPy1vW/lHbuM+T 381A== X-Forwarded-Encrypted: i=1; AJvYcCVvx5sl3WLh2pZ8MKdbJMgqzZftFiihTkBCbZJj3q37dLPmc3F7p6iHdQcg6qpQ9KsidioZzKS+w9vT3x6Z9soPyWc= X-Gm-Message-State: AOJu0Yw29aZtKapEDVHNEpEOhvihXsCe1fTKzyPpJ1jxQ9RwYXc9wygT NYd8WLEKA0trWDgBZ9lhKM1fmbWY1EwEiSEDtoWRtlEyBFQ4af3WiZIUiwSAPbfvCpgh8Be6z1D aCH7aIUQn6UgD5krkv0CSIa13YWU= X-Google-Smtp-Source: AGHT+IGhj+x8PbYBec92s0jPkmQjnzq8HzpZbtjL6JMxo4ac2s/kS8iToRyYZSqvmkrMguR+LbUdOyKos2a5RyaKjr8= X-Received: by 2002:a05:6102:1814:b0:473:df7:1084 with SMTP id jg20-20020a056102181400b004730df71084mr3630428vsb.29.1710148303200; Mon, 11 Mar 2024 02:11:43 -0700 (PDT) MIME-Version: 1.0 References: <20240229235134.2447718-1-axelrasmussen@google.com> In-Reply-To: From: Yafang Shao Date: Mon, 11 Mar 2024 17:11:06 +0800 Message-ID: Subject: Re: MGLRU premature memcg OOM on slow writes To: Axel Rasmussen Cc: Chris Down , cgroups@vger.kernel.org, hannes@cmpxchg.org, kernel-team@fb.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, yuzhao@google.com Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Stat-Signature: aimcp9rszhz6uf1dc4cf994mad15a3i5 X-Rspamd-Server: rspam10 X-Rspamd-Queue-Id: 4277B180007 X-Rspam-User: X-HE-Tag: 1710148304-193048 X-HE-Meta: U2FsdGVkX18nG2sjVkoZArUNat7DfAclgDAFxq5M9Buoa5vAbA71beleig7NXVwsS5SgL/0GPEfyhm6eqvF5q/dmdTtUgxjWLsC/saXVwutg4QBOfoPLYEUmwLHNEXx0T8+OQEoK2b89ZqdLj/E5mZUAAHgX5TFSRLEbDy0rCYbFLZxLqnNbeetYNLPcc/nx8gRBT9/5nv12jhZ9WDV+Vu5Kea520ABXLiOKkEXV3htjZkAP912VDJPCp9pR8f8RlYWkxzOJdhrgaj/7xlc8GKnJ5ot85tdZjmk5jaKoCUkt0Y/h8YeGlVccIR5+2qO+SDa0cuxowGlHoZjv7ulAptuIAWTGB3nvghbZ5UMa4GOtsZGd5c3HkPeEeZRmFC7uEOGldXnitGkZCZTAa8SUD5zJgjVx82aoK4g4jNeMUF9YEEDFeKAwJYUtM8r6xhtNogOqZBTlxDYX9EpQy4clcPA7EyVpfMVYCiif+jz/sT8C0lgQy4eu4RnsYaRSRsZmm6L2GrZXPfN3svxQ95PUGSvh6+I8Cr/nG7xmspG934kWOKgPCtgZmcj/K90nXpL1QBE+EBTZi1fNJeAnCPCDYzsWL25K+BgUhKiER+v8VTN68nMLqPDoOxyAFAhjVbpMLvo6FU+mS7y6Nl/wLp6mRrxJmR203IdO04fuf/QNrqidpf5WK+3nENnDnhK2yARBUDDlrOE7u7OPWySm3M5/Do7zpTVdDzjZYoueadDI2DQoYmu79CFZP01VspCnsik7OUpGwRAWwDSJKCvwWhS8DlkyozOzPV4ecfc06nr+Gn9CW/w1RFNuO4MCoIcIHVuUXePYRy6UWce+Mrv9ITEQRVqriAZPL1BoAmo4GfgXq2k0orGZCT8+K5j+zRA73kaRgrRBxzAE8GG+LczjYMd19z7i8IIcZi/TcY/4j1jOPx94lVO4Y/zzAw+C3NCz8w+Hx04WOMOicC9bVk7FIWr ICUsmA3s TSxhjURUDiMQifFS6H/dKRuQUIU+KTB7Bkc0fyIxpQ2i4orN+9MhMWED6RpgqiVTsBdB4 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Sat, Mar 9, 2024 at 3:19=E2=80=AFAM Axel Rasmussen wrote: > > On Thu, Feb 29, 2024 at 4:30=E2=80=AFPM Chris Down = wrote: > > > > Axel Rasmussen writes: > > >A couple of dumb questions. In your test, do you have any of the follo= wing > > >configured / enabled? > > > > > >/proc/sys/vm/laptop_mode > > >memory.low > > >memory.min > > > > None of these are enabled. The issue is trivially reproducible by writi= ng to > > any slow device with memory.max enabled, but from the code it looks lik= e MGLRU > > is also susceptible to this on global reclaim (although it's less likel= y due to > > page diversity). > > > > >Besides that, it looks like the place non-MGLRU reclaim wakes up the > > >flushers is in shrink_inactive_list() (which calls wakeup_flusher_thre= ads()). > > >Since MGLRU calls shrink_folio_list() directly (from evict_folios()), = I agree it > > >looks like it simply will not do this. > > > > > >Yosry pointed out [1], where MGLRU used to call this but stopped doing= that. It > > >makes sense to me at least that doing writeback every time we age is t= oo > > >aggressive, but doing it in evict_folios() makes some sense to me, bas= ically to > > >copy the behavior the non-MGLRU path (shrink_inactive_list()) has. > > > > Thanks! We may also need reclaim_throttle(), depending on how you imple= ment it. > > Current non-MGLRU behaviour on slow storage is also highly suspect in t= erms of > > (lack of) throttling after moving away from VMSCAN_THROTTLE_WRITEBACK, = but one > > thing at a time :-) > > > Hmm, so I have a patch which I think will help with this situation, > but I'm having some trouble reproducing the problem on 6.8-rc7 (so > then I can verify the patch fixes it). We encountered the same premature OOM issue caused by numerous dirty pages. The issue disappears after we revert the commit 14aa8b2d5c2e "mm/mglru: don't sync disk for each aging cycle" To aid in replicating the issue, we've developed a straightforward script, which consistently reproduces it, even on the latest kernel. You can find the script provided below: ``` #!/bin/bash MEMCG=3D"/sys/fs/cgroup/memory/mglru" ENABLE=3D$1 # Avoid waking up the flusher sysctl -w vm.dirty_background_bytes=3D$((1024 * 1024 * 1024 *4)) sysctl -w vm.dirty_bytes=3D$((1024 * 1024 * 1024 *4)) if [ ! -d ${MEMCG} ]; then mkdir -p ${MEMCG} fi echo $$ > ${MEMCG}/cgroup.procs echo 1g > ${MEMCG}/memory.limit_in_bytes if [ $ENABLE -eq 0 ]; then echo 0 > /sys/kernel/mm/lru_gen/enabled else echo 0x7 > /sys/kernel/mm/lru_gen/enabled fi dd if=3D/dev/zero of=3D/data0/mglru.test bs=3D1M count=3D1023 rm -rf /data0/mglru.test ``` This issue disappears as well after we disable the mglru. We hope this script proves helpful in identifying and addressing the root cause. We eagerly await your insights and proposed fixes. > > If I understand the issue right, all we should need to do is get a > slow filesystem, and then generate a bunch of dirty file pages on it, > while running in a tightly constrained memcg. To that end, I tried the > following script. But, in reality I seem to get little or no > accumulation of dirty file pages. > > I thought maybe fio does something different than rsync which you said > you originally tried, so I also tried rsync (copying /usr/bin into > this loop mount) and didn't run into an OOM situation either. > > Maybe some dirty ratio settings need tweaking or something to get the > behavior you see? Or maybe my test has a dumb mistake in it. :) > > > > #!/usr/bin/env bash > > echo 0 > /proc/sys/vm/laptop_mode || exit 1 > echo y > /sys/kernel/mm/lru_gen/enabled || exit 1 > > echo "Allocate disk image" > IMAGE_SIZE_MIB=3D1024 > IMAGE_PATH=3D/tmp/slow.img > dd if=3D/dev/zero of=3D$IMAGE_PATH bs=3D1024k count=3D$IMAGE_SIZE_MIB || = exit 1 > > echo "Setup loop device" > LOOP_DEV=3D$(losetup --show --find $IMAGE_PATH) || exit 1 > LOOP_BLOCKS=3D$(blockdev --getsize $LOOP_DEV) || exit 1 > > echo "Create dm-slow" > DM_NAME=3Ddm-slow > DM_DEV=3D/dev/mapper/$DM_NAME > echo "0 $LOOP_BLOCKS delay $LOOP_DEV 0 100" | dmsetup create $DM_NAME || = exit 1 > > echo "Create fs" > mkfs.ext4 "$DM_DEV" || exit 1 > > echo "Mount fs" > MOUNT_PATH=3D"/tmp/$DM_NAME" > mkdir -p "$MOUNT_PATH" || exit 1 > mount -t ext4 "$DM_DEV" "$MOUNT_PATH" || exit 1 > > echo "Generate dirty file pages" > systemd-run --wait --pipe --collect -p MemoryMax=3D32M \ > fio -name=3Dwrites -directory=3D$MOUNT_PATH -readwrite=3Drandwrit= e \ > -numjobs=3D10 -nrfiles=3D90 -filesize=3D1048576 \ > -fallocate=3Dposix \ > -blocksize=3D4k -ioengine=3Dmmap \ > -direct=3D0 -buffered=3D1 -fsync=3D0 -fdatasync=3D0 -sync=3D0 \ > -runtime=3D300 -time_based > --=20 Regards Yafang