From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 8924BC54E67
	for <linux-mm@archiver.kernel.org>; Fri, 15 Mar 2024 14:27:21 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id EB4768012F; Fri, 15 Mar 2024 10:27:20 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E63C2800B4; Fri, 15 Mar 2024 10:27:20 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D52808012F; Fri, 15 Mar 2024 10:27:20 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0010.hostedemail.com [216.40.44.10])
	by kanga.kvack.org (Postfix) with ESMTP id C6CFA800B4
	for <linux-mm@kvack.org>; Fri, 15 Mar 2024 10:27:20 -0400 (EDT)
Received: from smtpin29.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay10.hostedemail.com (Postfix) with ESMTP id 66DAAC0172
	for <linux-mm@kvack.org>; Fri, 15 Mar 2024 14:27:20 +0000 (UTC)
X-FDA: 81899501040.29.D148460
Received: from mail-oi1-f170.google.com (mail-oi1-f170.google.com [209.85.167.170])
	by imf06.hostedemail.com (Postfix) with ESMTP id 6FFF118000C
	for <linux-mm@kvack.org>; Fri, 15 Mar 2024 14:27:18 +0000 (UTC)
Authentication-Results: imf06.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=R636EEiM;
	spf=pass (imf06.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.167.170 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org;
	dmarc=pass (policy=none) header.from=cmpxchg.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1710512838;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=xwkAMr6+cQJK8ThEFiQkRDq3ghKmxUX0GQoedK4u/n4=;
	b=RxjFMwhbgbNjNzvxj+Qo5y/d+olEyIWsMtiNx8F7KejiqMtdawfcIMMDivTXzKrTs1UlYe
	DnujJhHzQXh4zLUNC82w0iPRvRi1NJocy7tYZg63sBuxuTx5u1TsKiYL5O+ZcSj2T7rUne
	OljzcUWMcnoHkRQz/v2GppCOyTS4Xu4=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1710512838; a=rsa-sha256;
	cv=none;
	b=N5lLJrAesAcDxq4cVHwuMvIalyM45EX2xBKoWxY0DxfbaoXqtqfV0R2n8RLo8sYSvX0pka
	y0WnSOTxZG50TJqNyOyjS59m83kNPoZGySK+9TuxWsCdxJXEnZfWxwv6msdrs4um499h/6
	nNdCXVpI4yZaNFsEUs5DFqt+CFAHcjM=
ARC-Authentication-Results: i=1;
	imf06.hostedemail.com;
	dkim=pass header.d=cmpxchg-org.20230601.gappssmtp.com header.s=20230601 header.b=R636EEiM;
	spf=pass (imf06.hostedemail.com: domain of hannes@cmpxchg.org designates 209.85.167.170 as permitted sender) smtp.mailfrom=hannes@cmpxchg.org;
	dmarc=pass (policy=none) header.from=cmpxchg.org
Received: by mail-oi1-f170.google.com with SMTP id 5614622812f47-3c374f4784eso686736b6e.3
        for <linux-mm@kvack.org>; Fri, 15 Mar 2024 07:27:18 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20230601.gappssmtp.com; s=20230601; t=1710512837; x=1711117637; darn=kvack.org;
        h=in-reply-to:content-transfer-encoding:content-disposition
         :mime-version:references:message-id:subject:cc:to:from:date:from:to
         :cc:subject:date:message-id:reply-to;
        bh=xwkAMr6+cQJK8ThEFiQkRDq3ghKmxUX0GQoedK4u/n4=;
        b=R636EEiMtVc6rOBT4o7ILXMVfm7MfQv7Av2LQEZ1xDYIZU/Oxfr++t1kWI8MFMrfn1
         outb693mldyp7YW+VSoqs78c4/4NvdEeUwr7PDmNvkxE/ErOIs5Vhd/+sfZdHpfCzX/w
         f4L+1Lfy7rsFdwbWHmZp441ZQ4czH8YCx84NE/gqvwog9tzbZoF6gdulqYSGQt4F9eYX
         izqnFbxsAXIErcLRJV9A4M0bs+Hr9HiMndzL0ebB9U+lVGB4wwxNyDslRcaC7SdnG1Q3
         om+0nVgRg9JBkzBCrGnSeKR9UKdLNacKOgb+3GgP7IMXCWxCKs7Vs8WPW8+9H/cs55x/
         gjEA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20230601; t=1710512837; x=1711117637;
        h=in-reply-to:content-transfer-encoding:content-disposition
         :mime-version:references:message-id:subject:cc:to:from:date
         :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=xwkAMr6+cQJK8ThEFiQkRDq3ghKmxUX0GQoedK4u/n4=;
        b=Q01lvG00u8j33SwbtJ67A/R29Si2pAuKsx0mQPx5CffG7947dN3EVPJ6mB56viny8w
         nU+HED2sTe3b5ZdVuhQw1w9joxsvCTk58yfvLQHUdEIWYzQUULLSV10ffBzj2EHVh+9Y
         CwmnOQISIJ/pOeT8GDUE9TMBjBaUKhkFpbBBtPZ9EuALlVNmKh1+GCiT65mU4UMVdrNw
         uu7aSt2ofX2yUSIQSX1hTBMyJ7jALYDLZ0eLHZrFrfYXArNAcavlTB5VzBu1M+dL5bYO
         R4ocjFT7xuF7btZ0znQflBkqnPSoqbKDqp5nKJn6NJdydG0VBxv4CJ5evdqfa/X97++T
         HSNA==
X-Forwarded-Encrypted: i=1; AJvYcCXjpAr6owlOHyuBsvpf4taRWz5RRUfqpVcwXmZB6h6muX6erlBXwgqbJqsUSzJe308R2q3yJOTQMkSdL8LV9aDlXdE=
X-Gm-Message-State: AOJu0Yx5HlguuCzQglxcDBABNeVXCr9M8lBF94K46swTYKfIO/Oq6T/O
	wpvWfKGhSaGoDDXdLz7AHPM8Q0j9B78Ur0x0+Vg0PS2wBm01xBrrdxxf3JrpExS96XZ8XEEBWQ0
	g
X-Google-Smtp-Source: AGHT+IHnptoHCC+MDC9mnh/7l2dVwudJAGg+udmfkU9DcS8DPIdkUL2Mwkmv8TJg1+zgnKmQXLuIoA==
X-Received: by 2002:a05:6808:4484:b0:3c2:3150:398 with SMTP id eq4-20020a056808448400b003c231500398mr6507540oib.19.1710512837436;
        Fri, 15 Mar 2024 07:27:17 -0700 (PDT)
Received: from localhost (2603-7000-0c01-2716-da5e-d3ff-fee7-26e7.res6.spectrum.com. [2603:7000:c01:2716:da5e:d3ff:fee7:26e7])
        by smtp.gmail.com with ESMTPSA id r12-20020a056214124c00b00690d45bb18asm2070362qvv.34.2024.03.15.07.27.16
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Fri, 15 Mar 2024 07:27:16 -0700 (PDT)
Date: Fri, 15 Mar 2024 10:27:11 -0400
From: Johannes Weiner <hannes@cmpxchg.org>
To: Yafang Shao <laoar.shao@gmail.com>
Cc: Yu Zhao <yuzhao@google.com>, Axel Rasmussen <axelrasmussen@google.com>,
	Chris Down <chris@chrisdown.name>, cgroups@vger.kernel.org,
	kernel-team@fb.com, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org
Subject: Re: MGLRU premature memcg OOM on slow writes
Message-ID: <20240315142711.GA1944@cmpxchg.org>
References: <20240229235134.2447718-1-axelrasmussen@google.com>
 <ZeEhvV15IWllPKvM@chrisdown.name>
 <CAJHvVch2qVUDTJjNeSMqLBx0yoEm4zzb=ZXmABbd_5dWGQTpNg@mail.gmail.com>
 <CALOAHbBupMYBMWEzMK2xdhnqwR1C1+mJSrrZC1L0CKE2BMSC+g@mail.gmail.com>
 <CAJHvVcjhUNx8UP9mao4TdvU6xK7isRzazoSU53a4NCcFiYuM-g@mail.gmail.com>
 <ZfC16BikjhupKnVG@google.com>
 <ZfC2612ZYwwxpOmR@google.com>
 <CALOAHbAAnGjt2yd8avcSSkMA2MeUWN1-CTkN81GJF+udwE6+DQ@mail.gmail.com>
 <ZfN41Bm2UA7qDPEA@google.com>
 <CALOAHbDn0Dbxfhd72d=7-Z_9PjpP_1pXSm3r9daG_XC_f7vFDQ@mail.gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <CALOAHbDn0Dbxfhd72d=7-Z_9PjpP_1pXSm3r9daG_XC_f7vFDQ@mail.gmail.com>
X-Rspamd-Queue-Id: 6FFF118000C
X-Rspam-User: 
X-Stat-Signature: spie5mj7kzzpr5iah69dz4orqs1d9nt7
X-Rspamd-Server: rspam03
X-HE-Tag: 1710512838-218763
X-HE-Meta: U2FsdGVkX18S1yI4aP6dJmn9QsxRBOdjCcLUTYea9imtYugPLOe3nXSLTf2xINyfeMg8t0SdkZbLfooiQ5nRp//imQ8hgS4enu1N7hoKP0Ta4O/UL9Lb8E98bsJKIyKAioL0uZXrblLO/Pn3h/nC2Mn7yaSJ5OMQduFULBGlX77JxpGXtwHEW407ZvMzQtFiFJadKjeapcZFexZKAShMebbMUArVkiJzX/QdBi6ZKMOBeHRG0jVWJUdSVPl7+42iuEWuQQsREDtWjxjOBE1ltO8LrzGu2IY+2t6QenqZw0jcKqUdvmCYdClqFe615aHnmS0/l3Iy8zRwM/R6yAQNvntUug10egIqLOeKiwjtvVagNB+VST5OqXnD5jfP2wqZStcyh5jCclDowRrdsJi1VbqhORpEtjSvcq8XnONfSI8TCImL1iNdb2bPHpiXmPCu8fdSy/zd+w0iCds+J41n3zAf52IDyuUck3+cgAJlPeogzT2yQ7IjOy36Bab8sz3gRw2QdK4TzDf8KrTXG3hLnJELecglmzN1Y7Kh2x3ruwMrqbJ1j7NWlI20lHJgfGYNaiInsPelvmO4+EdPbNBN0aAa1q7jOCKESywxeg7j2PdfBt70pFWY7BxCYA52JUMHUnnGqSgKM8hUYISYMt5gyMl3oJqR9hXe77MoajGpsaN8GLmYCJwlm///XBmEdTQtTxzq292nPDJHo959BVt9EkTo2zgi1U7Ys2DslwOmDCaOaralP60V8DthzhseUGylnAnzvK0YSDrcxW/hDJt90uwGI3zG+rhbRIYdyoFhodgqAZGWVNTDYAI0YBB7aIzU8TQ+ZO0uvHMTgBoXpCFEUjbu7F+ryt6CVubk6DzDLe9P0s1dL2ngDF/NpGtBa4s1jT+Rjfyhmew1EHB8CNMpOSuHUTxku0iBL7B84IdAO0qrxe9uHbXDzYVIw5sBM7sda7mU+vdRBv304+DAfiB
 tDiCjKE+
 7yPz4XmMU9tBef8ZIHSjHf/gwrhPEIqqnASrRDIBqfQK6PxP9Cg9cG2lvcfBtfYA2/h9Em2uZ6J61BVlKfktITJgexIKC3aVqn6JzoXKbLez3Av0ivBdDls2LlCo/fsrV3kni9mYmrAOr5FFB+hUJ9pAvzUdSZp/cF8v5obE7KPElmz6+eLkpM41VtvJt6hhYY4RM6PHQqMly3n2SK9/nEgX0rQ==
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>

On Fri, Mar 15, 2024 at 10:38:31AM +0800, Yafang Shao wrote:
> On Fri, Mar 15, 2024 at 6:23 AM Yu Zhao <yuzhao@google.com> wrote:
> > I'm surprised to see there was 0 pages under writeback:
> >   [Wed Mar 13 11:16:48 2024] total_writeback 0
> > What's your dirty limit?
> 
> The background dirty threshold is 2G, and the dirty threshold is 4G.
> 
>     sysctl -w vm.dirty_background_bytes=$((1024 * 1024 * 1024 * 2))
>     sysctl -w vm.dirty_bytes=$((1024 * 1024 * 1024 * 4))
> 
> >
> > It's unfortunate that the mainline has no per-memcg dirty limit. (We
> > do at Google.)
> 
> Per-memcg dirty limit is a useful feature. We also support it in our
> local kernel, but we didn't enable it for this test case.
> It is unclear why the memcg maintainers insist on rejecting the
> per-memcg dirty limit :(

I don't think that assessment is fair. It's just that nobody has
seriously proposed it (at least not that I remember) since the
cgroup-aware writeback was merged in 2015.

We run millions of machines with different workloads, memory sizes,
and IO devices, and don't feel the need to tune the settings for the
global dirty limits away from the defaults.

Cgroups allot those allowances in proportion to observed writeback
speed and available memory in the container. We set IO rate and memory
limits per container, and it adapts as necessary.

If you have an actual usecase, I'm more than willing to hear you
out. I'm sure that the other maintainers feel the same.

If you're proposing it as a workaround for cgroup1 being
architecturally unable to implement proper writeback cache management,
then it's a more difficult argument. That's one of the big reasons why
cgroup2 exists after all.

> > > As of now, it appears that the most effective solution to address this
> > > issue is to revert the commit 14aa8b2d5c2e. Regarding this commit
> > > 14aa8b2d5c2e,  its original intention was to eliminate potential SSD
> > > wearout, although there's no concrete data available on how it might
> > > impact SSD longevity. If the concern about SSD wearout is purely
> > > theoretical, it might be reasonable to consider reverting this commit.
> >
> > The SSD wearout problem was real -- it wasn't really due to
> > wakeup_flusher_threads() itself; rather, the original MGLRU code call
> > the function improperly. It needs to be called under more restricted
> > conditions so that it doesn't cause the SDD wearout problem again.
> > However, IMO, wakeup_flusher_threads() is just another bandaid trying
> > to work around a more fundamental problem. There is no guarantee that
> > the flusher will target the dirty pages in the memcg under reclaim,
> > right?
> 
> Right, it is a system-wide fluser.

Is it possible it was woken up just too frequently?

Conventional reclaim wakes it based on actually observed dirty pages
off the LRU. I'm not super familiar with MGLRU, but it looks like it
woke it on every generational bump? That might indeed be too frequent,
and doesn't seem related to the writeback cache state.

We're monitoring write rates quite closely due to wearout concern as
well, especially because we use disk swap too. This is the first time
I'm hearing about reclaim-driven wakeups being a concern. (The direct
writepage calls were a huge problem. But not waking the flushers.)

Frankly, I don't think the issue is fixable without bringing the
wakeup back in some form. Even if you had per-cgroup dirty limits. As
soon as you have non-zero dirty pages, you can produce allocation
patterns that drive reclaim into them before background writeback
kicks in.

If reclaim doesn't wake the flushers and waits for writeback, the
premature OOM margin is the size of the background limit - 1.

Yes, cgroup1 and cgroup2 react differently to seeing pages under
writeback: cgroup1 does wait_on_page_writeback(); cgroup2 samples
batches of pages and throttles at a higher level. But both of them
need the flushers woken, or there is nothing to wait for.

Unless you want to wait for dirty expiration :)