From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 1996338B123 for ; Thu, 2 Apr 2026 09:27:11 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775122040; cv=none; b=sU91P5nsLgov+KCjDG93WbUSjYWVBI/gFQavjcA29A75NW6LdfviX2ElnfkmF+c0UDOsCRjr72eUD8tpaC7pB+7UAT/jBrNrYhldBXfUkBgNOYRp2WlYJMbJFntGl8PmkeOWEnPcoCLRdIS+OXzFlrflykZbSIjGQjpq9uUXCzM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1775122040; c=relaxed/simple; bh=8HBEZkldhVOCvg+h8X+QyqxrVDgc44S4RwM2B4Lv89Q=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=rwj3RTtF4z09CE5a+2zarbP2p07L4WRyVHJI7aZEUoUkpaPb3DA+kAXQy8dD993JtfZVb1Jai02RCeLOR3NMjgJ+NtxnAqSszz788MwVKoOfNei3gPZSXrT5XX42EsACgnK8Lhui2Qsb9YoH8RRLMhYuuvbAPmRM5aDJJnN/m5Q= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=WxcNR/jT; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b=J180q0GQ; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="WxcNR/jT"; dkim=pass (2048-bit key) header.d=redhat.com header.i=@redhat.com header.b="J180q0GQ" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1775122030; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=36+ATcVryHEsHCfEL0i5GnGjSCxjk7f9m5BZERXBqcg=; b=WxcNR/jTpy8FNj3+m3kIgzNB1kOzMLYe0bTrsZTpfy2RpPXaqdvhaOhjhIVfNV/iGMzI9P VdKpbqKYmDZ3SQ9H+0Q7h6m56zoh2MGbKhTmQhqD9ctVa+FwWM5RwthGMmweWVXxNPqH5Z Zodjx1FueDTYs6Gx+24wIA0xJ3kRWDE= Received: from mail-pj1-f70.google.com (mail-pj1-f70.google.com [209.85.216.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-251-k3uZMgo9NRO98FU5j0RCmQ-1; Thu, 02 Apr 2026 05:27:09 -0400 X-MC-Unique: k3uZMgo9NRO98FU5j0RCmQ-1 X-Mimecast-MFC-AGG-ID: k3uZMgo9NRO98FU5j0RCmQ_1775122028 Received: by mail-pj1-f70.google.com with SMTP id 98e67ed59e1d1-35449510446so725138a91.0 for ; Thu, 02 Apr 2026 02:27:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=google; t=1775122028; x=1775726828; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=36+ATcVryHEsHCfEL0i5GnGjSCxjk7f9m5BZERXBqcg=; b=J180q0GQPV0hVzyI1LSWUzdunwlCqdvAbMluYOJpoyMvgzWAzljWCOaI9zzUvvFJcD ZUAT8Xu++6Ynh/1n+sAfMpRVDPi/FVm3qbyFaSh6H3fPnylL5/tsyWvZ1SLrYVJD+o5c 6/tPuaE+Cy/uhlBNYpyrijYtmx4fs7kc6FqRYBtGdbKrnS80NKUU0YeHsp9NNMMc6Bf5 3Gb/2h2up4dQifohO3u9MxJ/y3+QW3U7dMf90vBbxdtbt3UEhQwHrmZNcEG2+fg6NMPl EPpAE9qarkD4+hdKuEngS0fvhHrrPy0ygUc2l/Jc01fk0hckVrt0Wv+zQ73+vino/DJ0 KOPQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1775122028; x=1775726828; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=36+ATcVryHEsHCfEL0i5GnGjSCxjk7f9m5BZERXBqcg=; b=Lt4qTx0yuJTwbbt7XOlU2QzpyRXECuzR8dXrXhlrQe3xR3Ep9BsW/05Z7vGxisP8Yn uKa0tgMPPKQfaWJek4vpUxb+87N3v/VLEE2XevVBEXr5WIH7gSWXAsJRgZVWYpySfmn5 5aeSe790l0hbQsc1j2udRme3bzBrH7M7Nm5P4jYYkNx7ZmMnzS91/iR/IU7oFXQn/YNO 4zzC7Uv7fbNHNpq05IWtA5RsnR9w3DEouPLZHN+TjBLAxxqhb9sBL4Mm5StYAt1/eKEZ bmWKYqa8DEavA522dDzrewZ4d6DhASCpXRSsL6lX+nwVPp3heiqHDIcat8Na3MijjI2J knmg== X-Forwarded-Encrypted: i=1; AJvYcCU0N5gO78oJdOgts+uP6JhPqHRfKB0oV1ayuXRaoYkmYlAQBPa9hUh4gVr4dIKUywrr/p2dQjTPBeqZ9DQ=@vger.kernel.org X-Gm-Message-State: AOJu0Yz59Y5WdksuDy4sQDLP7SWyLBoXf/VSjdLa/1RdtyI2Hr457vCm 1+yRRfrPHhn0XIo+YCm/+gJkx1kVP1wQQvvgXAsycNex3qXiVLg5yIcTv9nKoiYM3uUakFj71G4 U+TLveG3ODFqMehIgApNXtOwJmDZLzDspIlD5OuMUdqQkEflKZpcY1DuWjhgOzbsbFw== X-Gm-Gg: AeBDietmShVMQYRE/y2WYG1N5v6oLmWMD2CnwOSasM6meBH5ltNLdng/FybsHTs8Fig Ao5q//xJ5RwNfRvWGb8G3rnS96rtRC3A9h28jKkJN6LzOj3vVDReFS5MJc2l9Q1qWdKgPfJfFBz DRVIdUln2Y9B39R91ZGfqqf6LJbQ1r4FPK5bZjqVUGUHLW/Xh0Aafg5QUB9dXFYbxMo3F2h/spm C0Bk7xDttMFKwtTgoAAtaQWqiOl3rMRMxUtH0DEla5Lqi0GDqhuNoV5Zge98v4YABo9roE97M+7 8gLXnaVdTHTwRCdLQKEZF6y0qSYSs24QEpQN2ONZFgtSQgsuIsc+y9rjz0WzNlv257GT5cis0Y4 GUlHR+7Ez/ZnnzLL8og== X-Received: by 2002:a17:90b:3947:b0:35b:e566:15a6 with SMTP id 98e67ed59e1d1-35dc6f4f018mr6646508a91.28.1775122027787; Thu, 02 Apr 2026 02:27:07 -0700 (PDT) X-Received: by 2002:a17:90b:3947:b0:35b:e566:15a6 with SMTP id 98e67ed59e1d1-35dc6f4f018mr6646481a91.28.1775122027294; Thu, 02 Apr 2026 02:27:07 -0700 (PDT) Received: from redhat.com ([209.132.188.88]) by smtp.gmail.com with ESMTPSA id d9443c01a7336-2b27477736dsm22344875ad.24.2026.04.02.02.27.05 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 02 Apr 2026 02:27:06 -0700 (PDT) Date: Thu, 2 Apr 2026 17:27:04 +0800 From: Li Wang To: Michal =?utf-8?Q?Koutn=C3=BD?= Cc: Waiman Long , Johannes Weiner , Michal Hocko , Roman Gushchin , Shakeel Butt , Muchun Song , Andrew Morton , Tejun Heo , Shuah Khan , Mike Rapoport , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, linux-kselftest@vger.kernel.org, Sean Christopherson , James Houghton , Sebastian Chlad , Guopeng Zhang , Li Wang Subject: Re: [PATCH v2 1/7] memcg: Scale up vmstats flush threshold with int_sqrt(nr_cpus+2) Message-ID: References: <20260320204241.1613861-1-longman@redhat.com> <20260320204241.1613861-2-longman@redhat.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: Hi Michal, > Hello Waiman and Li. > ... > The explanation seems [1] to just pick a function because log seemed too > slow. > > (We should add a BPF hook to calculate the threshold. Haha, Date:) > > The threshold has twofold role: to bound error and to preserve some > performance thanks to laziness and these two go against each other when > determining the threshold. The reasoning for linear scaling is that > _each_ CPU contributes some updates so that preserves the laziness. > Whereas error capping would hint to no dependency on nr_cpus. > > My idea is that a job associated to a selected memcg doesn't necessarily > run on _all_ CPUs of (such big) machines but effectively cause updates > on J CPUs. (Either they're artificially constrained or they simply are > not-so-parallel jobs.) > Hence the threshold should be based on that J and not actual nr_cpus. I completely agree on this point. > Now the question is what is expected (CPU) size of a job and for that > I'd would consider a distribution like: > - 1 job of size nr_cpus, // you'd overcommit your machine with bigger job > - 2 jobs of size nr_cpus/2, > - 3 jobs of size nr_cpus/3, > - ... > - nr_cpus jobs of size 1. // you'd underutilize the machine with fewer > > Note this is quite naïve and arbitrary deliberation of mine but it > results in something like Pareto distribution which is IMO quite > reasonable. With (only) that assumption, I can estimate the average size > of jobs like > nr_cpus / (log(nr_cpus) + 1) > (it's natural logarithm from harmonic series and +1 is from that > approximation too, it comes handy also on UP) > > log(x) = ilog2(x) * log(2)/log(e) ~ ilog2(x) * 0.69 > log(x) ~ 45426 * ilog2(x) / 65536 > > or > 65536*nr_cpus / (45426 * ilog2(nr_cpus) + 65536) > > > with kernel functions: > var1 = 65536*nr_cpus / (45426 * ilog2(nr_cpus) + 65536) > var2 = DIV_ROUND_UP(65536*nr_cpus, 45426 * ilog2(nr_cpus) + 65536) > var3 = roundup_pow_of_two(var2) > > I hope I don't need to present any more numbers at this moment because > the parameter derivation is backed by solid theory ;-) [*] > [*] It is a elegant method but still not based on the J CPUs. As you capture the core tension: bounding error wants the threshold as small as possible, while preserving laziness wants it as large as possible. Any scheme is a compromise between the two. But there has several practical issues: The threshold formula is system-wide, while each memcg has its own counter, they all evaluate against the same MEMCG_CHARGE_BATCH * f(nr_cpu_ids), with no awareness of how many CPUs are actually active for that particular memcg. Small tasks with J=2 coexist with large services where J approaches nr_cpus, yet they all face the same threshold. The ln-harmonic formula optimizes for the average J, but workloads that most critically need accurate memory.stat are precisely those spanning many CPUs, well above average. Moreover, the "average J" estimate assumes tasks are uniformly distributed across CPUs, which rarely holds in practice with cpuset constraints, NUMA affinity, and nested cgroup hierarchies. And even accepting that estimate, the data shows ln-harmonic still yields 237MB of error at 2048 CPUs with 64K pages — still large enough to cause selftest failures. In short: the theoretical analysis is sound, but the conclusion conflates average case with worst case. Under the constraint of a single global threshold, sqrt remains the more robust choice. In future, if the J-sensory threshold per-memcg can be achieved, then your ln-harmonic method is the most ideal formula. To compare the three methods (linear, sqrt, ln-harmonic): 4K page size (BATCH=64): CPUs linear sqrt ln-var3 -------------------------------- 1 256KB 256KB 256KB 2 512KB 512KB 512KB 4 1MB 512KB 512KB 8 2MB 768KB 1MB 16 4MB 1MB 2MB 32 8MB 1.25MB 2MB 64 16MB 2MB 4MB 128 32MB 2.75MB 8MB 256 64MB 4MB 16MB 512 128MB 5.5MB 32MB 1024 256MB 8MB 64MB 2048 512MB 11.25MB 64MB 64K page size (BATCH=16): CPUs linear sqrt ln-var3 -------------------------------- 1 1MB 1MB 1MB 2 2MB 2MB 2MB 4 4MB 2MB 2MB 8 8MB 3MB 4MB 16 16MB 4MB 8MB 32 32MB 5MB 8MB 64 64MB 8MB 16MB 128 128MB 11MB 32MB 256 256MB 16MB 64MB 512 512MB 22MB 128MB 1024 1GB 32MB 256MB 2048 2GB 45MB 256MB -- Regards, Li Wang