From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from out-186.mta0.migadu.com (out-186.mta0.migadu.com [91.218.175.186]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id EA2FB1DF980 for ; Thu, 12 Mar 2026 02:47:15 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=91.218.175.186 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773283637; cv=none; b=JaNctw5KOdHAoriOIJHRed16fzBQjjouwAP5pQKrj95R0aUUoS1M0kxwiEmI8LTAnTUnaOYnp5tD3bUzGuPMJpC1y/sIJBQcLHYBN9EzrGEldd9m5heriOCnebMfcfSEMdMe4b3fV2fCycD8DdFxVYbMXT8JSlmHRnZm0qqID+c= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1773283637; c=relaxed/simple; bh=bn22wJ7ckCW5LunWe8zrG07xQ/9uHxRHUmDSBAcb/oU=; h=Content-Type:Mime-Version:Subject:From:In-Reply-To:Date:Cc: Message-Id:References:To; b=WP9eOXghWTJnMKysx/UclQC2h+ZEBWpvoklz1SZQTUbtU/gvSDvP0by6gegKEsWjasinQMf2njYXrGs39L4kASbLWZfJ92qYAirgo5xFF6BAG9yh+nU3TBNIYiKRljdGMnJFjI2dua2wHf0ox7gPEquzoJhUVesane+6VXtPud8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev; spf=pass smtp.mailfrom=linux.dev; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b=pl1wvej5; arc=none smtp.client-ip=91.218.175.186 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linux.dev Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linux.dev Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=linux.dev header.i=@linux.dev header.b="pl1wvej5" Content-Type: text/plain; charset=us-ascii DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux.dev; s=key1; t=1773283623; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=bn22wJ7ckCW5LunWe8zrG07xQ/9uHxRHUmDSBAcb/oU=; b=pl1wvej5D2Yz+aDVoFPsDWCvqwPHXwNrvdxO4nqaau4RsfYAyzqGQIDN8yI6wkuerZ+X4P ZDGbzlOcZxxm4FS8OyTxMRh9hMr0YrKMX3KTPpuJ9UK97OYQeLMnmLqJQCRsk4iXYPnBNm d1zUpjd9ueHL/DhtAMgYjaQcmKsyMQ0= Precedence: bulk X-Mailing-List: bpf@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3864.400.21\)) Subject: Re: [LSF/MM/BPF TOPIC] Reimagining Memory Cgroup (memcg_ext) X-Report-Abuse: Please report any abuse attempt to abuse@migadu.com and include these headers. From: Muchun Song In-Reply-To: Date: Thu, 12 Mar 2026 10:46:10 +0800 Cc: lsf-pc@lists.linux-foundation.org, Andrew Morton , Tejun Heo , Michal Hocko , Johannes Weiner , Alexei Starovoitov , =?utf-8?Q?Michal_Koutn=C3=BD?= , Roman Gushchin , Hui Zhu , JP Kobryn , Geliang Tang , Sweet Tea Dorminy , Emil Tsalapatis , David Rientjes , Martin KaFai Lau , Meta kernel team , linux-mm@kvack.org, cgroups@vger.kernel.org, bpf@vger.kernel.org, linux-kernel@vger.kernel.org Content-Transfer-Encoding: quoted-printable Message-Id: <8F3593EB-9D81-4459-8675-E922426DCB1E@linux.dev> References: <20260307182424.2889780-1-shakeel.butt@linux.dev> <3ECC9B38-6C1A-4F60-9C18-98B7A1A56355@linux.dev> To: Shakeel Butt X-Migadu-Flow: FLOW_OUT > On Mar 12, 2026, at 04:39, Shakeel Butt = wrote: >=20 > On Wed, Mar 11, 2026 at 03:19:31PM +0800, Muchun Song wrote: >>=20 >>=20 >>> On Mar 8, 2026, at 02:24, Shakeel Butt = wrote: >>>=20 >=20 > [...] >=20 >>>=20 >>> Per-Memcg Background Reclaim >>>=20 >>> In the new memcg world, with the goal of (mostly) eliminating direct = synchronous >>> reclaim for limit enforcement, provide per-memcg background = reclaimers which can >>> scale across CPUs with the allocation rate. >>=20 >> Hi Shakeel, >>=20 >> I'm quite interested in this. Internally, we privately maintain a set >> of code to implement asynchronous reclamation, but we're also trying = to >> discard these private codes as much as possible. Therefore, we want = to >> implement a similar asynchronous reclamation mechanism in user space >> through the memory.reclaim mechanism. However, currently there's a = lack >> of suitable policy notification mechanisms to trigger user threads to >> proactively reclaim in advance. >=20 > Cool, can you please share what "suitable policy notification = mechanisms" you > need for your use-case? This will give me more data on the comparison = between > memory.reclaim and the proposed approach. If we expect the proactive reclamation to be triggered when the current memcg's memory usage reaches a certain point, we have to continuously = read memory.current to determine whether it has reached our set watermark = value to trigger asynchronous reclamation. Perhaps we need an event that can = notify user-space threads when the current memory usage reaches a specific watermark value. Currently, the events supported by memory.events may = lack the capability for custom watermarks. >=20 >=20 >>=20 >>>=20 >>> Lock-Aware Throttling >>>=20 >>> The ability to avoid throttling an allocating task that is holding = locks, to >>> prevent priority inversion. In Meta's fleet, we have observed lock = holders stuck >>> in memcg reclaim, blocking all waiters regardless of their priority = or >>> criticality. >>=20 >> This is a real problem we encountered, especially with the jbd = handler >> resources of the ext4 file system. Our current attempt is to defer >> memory reclamation until returning to user space, in order to solve >> various priority inversion issues caused by the jbd handler. = Therefore, >> I would be interested to discuss this topic. >=20 > Awesome, do you use memory.max and memory.high both and defer the = reclaim for > both? Are you deferring all the reclaims or just the ones where the = charging > process has the lock? (I need to look what jbd handler is). >=20 We do not use memory.high, although it supports deferring memory = reclamation to user-space, it also attempts to throttle memory allocation speed, = which introduces significant latency. In our application's case, we would = rather accept an OOM under such circumstances. We previously attempted to = address the priority inversion issue caused by the jbd handler separately (which = we frequently encounter since we use the ext4 file system), and you can = refer to this [1]. Of course, this solution lacks generality, as it requires calling new interfaces for various lock resources. Therefore, we = internally have a more aggressive idea: defer all reclamation triggered by = kernel-space memory allocation until just before returning to user-space. This should resolve the vast majority of priority inversion problems. The only = potential issue introduced is that kernel-space memory usage may briefly exceed = memory.max. [1] = https://lore.kernel.org/linux-mm/cover.1750234270.git.hezhongkun.hzk@byted= ance.com/#r Muchun, Thanks.