From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f51.google.com (mail-pj1-f51.google.com [209.85.216.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 552FF2BE05E for ; Tue, 11 Nov 2025 06:13:51 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.51 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762841632; cv=none; b=AU2H3jqObZpgHHQwp3rucWNaslvyNXL4jpJFGmxP4r0HlBTObaI1uGwZKTOgXaA9OofThcJOAx91Elkamv3wTET/WqEfXLXm3jJ26mc2K1MH8fT03C2FhqjkYRlqS20J98IVf2eGYu1ZGjAnz1epUZEFFR8yG/ul5nDjOyiOmhY= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1762841632; c=relaxed/simple; bh=85dZTVJaKBjootEEbBhkM6OmEcD8qaRmWyiKB5Req1k=; h=From:To:Cc:Subject:Date:Message-ID:In-Reply-To:References: MIME-Version:Content-Type; b=JmX4srDUmBQ9DOhCghH0GpkfJPG0UbPILzzlC+gRcZnvANFk9Cv90lCsnCN3wrF/MU1l7z4hrr5R1AhfYhjCWPXMjleGo934pYphLG0XCDE+p/P9UvyBHI1J5Oryol9OVEYPuuLaqh3CG4cg0/KULUxBtunRVSjm14XLNz1leA4= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=shopee.com; spf=pass smtp.mailfrom=shopee.com; dkim=pass (2048-bit key) header.d=shopee.com header.i=@shopee.com header.b=RQ0rlIKa; arc=none smtp.client-ip=209.85.216.51 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=shopee.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=shopee.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=shopee.com header.i=@shopee.com header.b="RQ0rlIKa" Received: by mail-pj1-f51.google.com with SMTP id 98e67ed59e1d1-3437c093ef5so2112589a91.0 for ; Mon, 10 Nov 2025 22:13:51 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=shopee.com; s=shopee.com; t=1762841630; x=1763446430; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:from:to:cc:subject:date :message-id:reply-to; bh=x6ye0KUrS4UUrF7S8bH5+znRQmU5zqbKdnyimkow66o=; b=RQ0rlIKaq+KwB13a8gNA4DS3ZRe8OXDwlE4CzJp4LkSmRz5QztnLzpi6bv+Fjfa+WF w5Jt15m6vAkXg99fixu9Q5PxYojb0S3hAs/Nj3JgM23O9fkr8YtmqM/6RqdShL9qHkrN Upi6IkD+ZeDCJm0BOidraZl36UgTPuRA7haCMlUTpteTEYBdhNEfd09DB53o+FnGUk7j 5+BdUOJ6hkO29QoLFYcXtwKTzKTUfQbiNTHok1JblAgN9hLD63BAV6AqFAUJAjv1VYau SFVQ+UB1g6N6immU4EBfOQ1RQNEdnb5uL3dWygx99tF/4byx4MnLnxxl538kaAv1kaNx BAaQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1762841630; x=1763446430; h=content-transfer-encoding:mime-version:references:in-reply-to :message-id:date:subject:cc:to:from:x-gm-gg:x-gm-message-state:from :to:cc:subject:date:message-id:reply-to; bh=x6ye0KUrS4UUrF7S8bH5+znRQmU5zqbKdnyimkow66o=; b=higACmqZsqjgYwKzAI4givnpFgC5FMuqfbtcS0ZWgmZuIo4Yuqfp0Ly+aZIb00IyR7 dHDUo+Y1i+yO5/owgZMfKvaBLET5qNBezAiTAOZiC8tOPbO725rNcSETmB1n7vzmI3KU jYrLNn+1sBqMYZ3CagSl4dKs1Miur3Tb8F+YoCU9riT1O49n9Cp9rnbihPopOvdlAfHY 47Pg+P7bMQ1fCLN4op7pDiLSw5hJ4ujUDtpD/NWC5+Ks4+Y5iQGrFL+QSjHSmrPNB6ei C/CqdYNmegTuQ3I+kg9FAPg9R6iwmFoTtXzJiM1m16nqwPYXjm3SKkf6/ocaXyOzUU68 VHHQ== X-Forwarded-Encrypted: i=1; AJvYcCXYkXMq/obnw0KyfphSxnndePjxuf8tLkJtEz6kScNFlNYv1SP8+ciy4L+uWHWls6bEYhm/zaMS32o=@vger.kernel.org X-Gm-Message-State: AOJu0Yz4TUI86oFOygGCZ9qZiCag8Tb7ijRjAO5hAWTc6sdb2Duzo2nn ND7DGvy/LR6za0Kv/0D6Nc20vgd9YmKhVa7JoJ1MZLC4IdhJ2NrismSqHRrT0YvwHDQ= X-Gm-Gg: ASbGncuk7ohIMPsoXAc/m/tNk4xvjIlupzatiFn5ZwkgD+j5HBBdfKJOC4Ic7/pOuq9 FwpO6DYE0k9kFZzonv8zSOo1rGor4YgYyWmlo/gqYmHivJvOwQ87/239dut02tuqWgqNt5SK6pX PWHNZHRljO3zhRtT3upbPfHu57exbpAKFRT75sx+GwzyAdueAcVo+Xu3j9qILFND+UMAWRi8PHD RU99yQQM0obH3n0udh/3M5sG3LmPrCt1OxY806KuiPFhsSUvJ1tIyJECBG6VeKRY8Ei/HGJmvG+ B864o/VTKI6eBCy5n6WFj4JxjzE3ni9CkQ/fcpfPVAC1G1U6B4MxCqBHpJ40UO5+AhxQN47tNA1 /51qdFjtAY3OM74dLzZ+ZwAuXyOqmr1Igwxp2c+O6dpUr9xDVJW8JcZfTJbvntimUYlYFRu2KeM DRqbgSGrA5ZBPe2vVY9hzB8lDm X-Google-Smtp-Source: AGHT+IFit24Wfiz5/okT1CYzaOFSn2pdykfe7Yx20hV//KiCLFFviTnyJczCSUcO4P48kN01R/ygpg== X-Received: by 2002:a17:90b:54cc:b0:32b:df0e:9283 with SMTP id 98e67ed59e1d1-3436cd0bcd5mr13428881a91.34.1762841630530; Mon, 10 Nov 2025 22:13:50 -0800 (PST) Received: from .shopee.com ([122.11.166.8]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-7b0cc175d85sm14312666b3a.43.2025.11.10.22.13.46 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 10 Nov 2025 22:13:50 -0800 (PST) From: Leon Huang Fu To: mkoutny@suse.com Cc: akpm@linux-foundation.org, cgroups@vger.kernel.org, corbet@lwn.net, hannes@cmpxchg.org, jack@suse.cz, joel.granados@kernel.org, kyle.meyer@hpe.com, lance.yang@linux.dev, laoar.shao@gmail.com, leon.huangfu@shopee.com, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, mclapinski@google.com, mhocko@kernel.org, muchun.song@linux.dev, roman.gushchin@linux.dev, shakeel.butt@linux.dev, tj@kernel.org Subject: Re: [PATCH mm-new v3] mm/memcontrol: Add memory.stat_refresh for on-demand stats flushing Date: Tue, 11 Nov 2025 14:13:42 +0800 Message-ID: <20251111061343.71045-1-leon.huangfu@shopee.com> X-Mailer: git-send-email 2.51.2 In-Reply-To: References: Precedence: bulk X-Mailing-List: linux-doc@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit On Mon, Nov 10, 2025 at 9:50 PM Michal Koutný wrote: > > Hello Leon. Hi Michal, > > On Mon, Nov 10, 2025 at 06:19:48PM +0800, Leon Huang Fu wrote: > > Memory cgroup statistics are updated asynchronously with periodic > > flushing to reduce overhead. The current implementation uses a flush > > threshold calculated as MEMCG_CHARGE_BATCH * num_online_cpus() for > > determining when to aggregate per-CPU memory cgroup statistics. On > > systems with high core counts, this threshold can become very large > > (e.g., 64 * 256 = 16,384 on a 256-core system), leading to stale > > statistics when userspace reads memory.stat files. > > > > This is particularly problematic for monitoring and management tools > > that rely on reasonably fresh statistics, as they may observe data > > that is thousands of updates out of date. > > > > Introduce a new write-only file, memory.stat_refresh, that allows > > userspace to explicitly trigger an immediate flush of memory statistics. > > I think it's worth thinking twice when introducing a new file like > this... > > > Writing any value to this file forces a synchronous flush via > > __mem_cgroup_flush_stats(memcg, true) for the cgroup and all its > > descendants, ensuring that subsequent reads of memory.stat and > > memory.numa_stat reflect current data. > > > > This approach follows the pattern established by /proc/sys/vm/stat_refresh > > and memory.peak, where the written value is ignored, keeping the > > interface simple and consistent with existing kernel APIs. > > > > Usage example: > >   echo 1 > /sys/fs/cgroup/mygroup/memory.stat_refresh > >   cat /sys/fs/cgroup/mygroup/memory.stat > > > > The feature is available in both cgroup v1 and v2 for consistency. > > First, I find the motivation by the testcase (not real world) weak when > considering such an API change (e.g. real world would be confined to > fewer CPUs or there'd be other "traffic" causing flushes making this a > non-issue, we don't know here). Fewer CPUs? We are going to run kernels on 224/256 cores machines, and the flush threshold is 16384 on a 256-core machine. That means we will have stale statistics often, and we will need a way to improve the stats accuracy. > > Second, this is open to everyone (non-root) who mkdir's their cgroups. > Then why not make it the default memory.stat behavior? (Tongue-in-cheek, > but [*].) > > With this change, we admit the implementation (async flushing) and leak > it to the users which is hard to take back. Why should we continue doing > any implicit in-kernel flushing afterwards? If the concern is that we're papering over a suboptimal flush path, I'm happy to take a closer look. I'll review both the synchronous and asynchronous flushing paths to see how to improve it. > > Next, v1 and v2 haven't been consistent since introduction of v2 (unlike > some other controllers that share code or even cftypes between v1 and > v2). So I'd avoid introducing a new file to V1 API. > > When looking for analogies, I admittedly like memory.reclaim's > O_NONBLOCK better (than /proc/sys/vm/stat_refresh). That would be an > argument for flushing by default mentioned abovee [*]). > > Also, this undercuts the hooking of rstat flushing into BPF. I think the > attempts were given up too early (I read about the verifier vs > seq_file). Have you tried bypassing bailout from > __mem_cgroup_flush_stats via trace_memcg_flush_stats? > I tried "tp_btf/memcg_flush_stats", but it didn't work: 10: (85) call css_rstat_flush#80218 program must be sleepable to call sleepable kfunc css_rstat_flush The bpf code and the error message are attached at last section. > > All in all, I'd like to have more backing data on insufficiency of (all > the) rstat optimizations before opening explicit flushes like this > (especially when it's meant to be exposed by BPF already). > It's proving non-trivial to capture a persuasive delta. The global worker already flushes rstat every two seconds (2UL*HZ), so the window where userspace can observe stale numbers is short. [...] Thanks, Leon --- #include "vmlinux.h" #include "bpf_helpers.h" #include "bpf_tracing.h" char _license[] SEC("license") = "GPL"; extern void css_rstat_flush(struct cgroup_subsys_state *css) __weak __ksym; SEC("tp_btf/memcg_flush_stats") int BPF_PROG(memcg_flush_stats, struct mem_cgroup *memcg, s64 stats_updates, bool force, bool needs_flush) { if (!force || !needs_flush) { css_rstat_flush(&memcg->css); __bpf_vprintk("memcg_flush_stats: memcg id=%d, stats_updates=%lld, force=%d, needs_flush=%d\n", memcg->id.id, stats_updates, force, needs_flush); } return 0; } --- permission denied: 0: R1=ctx() R10=fp0 ; int BPF_PROG(memcg_flush_stats, struct mem_cgroup *memcg, s64 stats_updates, bool force, bool needs_flush) @ memcg.c:13 0: (79) r6 = *(u64 *)(r1 +24) ; R1=ctx() R6_w=scalar() 1: (79) r9 = *(u64 *)(r1 +16) ; R1=ctx() R9_w=scalar() ; if (!force || !needs_flush) { @ memcg.c:15 2: (15) if r9 == 0x0 goto pc+1 ; R9_w=scalar(umin=1) 3: (55) if r6 != 0x0 goto pc+27 ; R6_w=0 4: (b7) r3 = 0 ; R3_w=0 ; int BPF_PROG(memcg_flush_stats, struct mem_cgroup *memcg, s64 stats_updates, bool force, bool needs_flush) @ memcg.c:13 5: (79) r7 = *(u64 *)(r1 +0) func 'memcg_flush_stats' arg0 has btf_id 623 type STRUCT 'mem_cgroup' 6: R1=ctx() R7_w=trusted_ptr_mem_cgroup() 6: (bf) r2 = r7 ; R2_w=trusted_ptr_mem_cgroup() R7_w=trusted_ptr_mem_cgroup() 7: (0f) r2 += r3 ; R2_w=trusted_ptr_mem_cgroup() R3_w=0 8: (79) r8 = *(u64 *)(r1 +8) ; R1=ctx() R8_w=scalar() ; css_rstat_flush(&memcg->css); @ memcg.c:16 9: (bf) r1 = r2 ; R1_w=trusted_ptr_mem_cgroup() R2_w=trusted_ptr_mem_cgroup() 10: (85) call css_rstat_flush#80218 program must be sleepable to call sleepable kfunc css_rstat_flush processed 11 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0