From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id A2CA7C4345F
	for <linux-mm@archiver.kernel.org>; Fri,  3 May 2024 14:00:37 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id EE1E46B008A; Fri,  3 May 2024 10:00:36 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E91ED6B008C; Fri,  3 May 2024 10:00:36 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D31C96B0093; Fri,  3 May 2024 10:00:36 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0015.hostedemail.com [216.40.44.15])
	by kanga.kvack.org (Postfix) with ESMTP id AA5186B008A
	for <linux-mm@kvack.org>; Fri,  3 May 2024 10:00:36 -0400 (EDT)
Received: from smtpin26.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay01.hostedemail.com (Postfix) with ESMTP id 93B9E1C192E
	for <linux-mm@kvack.org>; Fri,  3 May 2024 14:00:32 +0000 (UTC)
X-FDA: 82077244704.26.069AB93
Received: from dfw.source.kernel.org (dfw.source.kernel.org [139.178.84.217])
	by imf10.hostedemail.com (Postfix) with ESMTP id 8CD8FC0034
	for <linux-mm@kvack.org>; Fri,  3 May 2024 14:00:29 +0000 (UTC)
Authentication-Results: imf10.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=RAMaPJIL;
	spf=pass (imf10.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org;
	dmarc=pass (policy=none) header.from=kernel.org
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1714744829;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:
	 content-transfer-encoding:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=1gItYPQUwdgfknIKiJsykNp3Bf2oFo4tbiJ+rFP1phM=;
	b=GUcosSi0NnNAtogabvxLdPHcJpoDOyTu9yziYLfc75EnHzFTxEaf78AgG0YbtNEzESHoRM
	aQ+9vJsj22hOJO+FnWw5yAxHSW8ik4Dz2s8Uknd8t38wh6ANkndcFCt4QQZV+PXCOAMpK/
	HPXhpUlIa76T9Uqu5iP/XDEWhmbrdBA=
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714744829; a=rsa-sha256;
	cv=none;
	b=y70mw4EJBfB2YmCrqB6e96lbYYi7x+Yz2Ok/gh8gjfWFeeaPaZ9plM6BtvzUCruJbOxHfp
	zVvhW8jxfg/mfXd5qBGza5QbuttxTlBTpnGOa+T72MefD3C5+pFnU9+vwSx2QdmCrYRuJu
	cdieP3ULqRngX5hIC0k5BFvbsXMNjhw=
ARC-Authentication-Results: i=1;
	imf10.hostedemail.com;
	dkim=pass header.d=kernel.org header.s=k20201202 header.b=RAMaPJIL;
	spf=pass (imf10.hostedemail.com: domain of hawk@kernel.org designates 139.178.84.217 as permitted sender) smtp.mailfrom=hawk@kernel.org;
	dmarc=pass (policy=none) header.from=kernel.org
Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58])
	by dfw.source.kernel.org (Postfix) with ESMTP id 7A2FA61DA7;
	Fri,  3 May 2024 14:00:24 +0000 (UTC)
Received: by smtp.kernel.org (Postfix) with ESMTPSA id CFB4FC116B1;
	Fri,  3 May 2024 14:00:21 +0000 (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org;
	s=k20201202; t=1714744824;
	bh=aS2uAv9X6Zc/3BctlZfdTk0ZG9kIlAlXXhSJB+5t7d4=;
	h=Date:Subject:To:Cc:References:From:In-Reply-To:From;
	b=RAMaPJILLb1sTw4cxy/gjDEZscWal2EwRMu5FMpOl3U0qXlPKgnJx5jYttLfnyMxP
	 DUMjRcxhDr5lSdKJtwSMz8QbpVBm32FvpubhNVs+/r0SLQW/zvjmti5/RK3odCpRAp
	 rSNnQpTiJ+dkPPCIf08zOiuvAaZwEKE+u62oyiObtctmG2wHFPoL0/oOqEUnbQvLLI
	 vIXRyDTnr3HF7yF7z5177RPXm1tKyo1MR0tW3gS8M0wPjp0hdR+3UlqZPUROYRrITO
	 9GmfLXV3r/X4pOA9zHMnNxSRsEJyOWlzltNBY3+7nsShVDt1fBNUphRTzi5HKbD/54
	 dflIhC4LCsGdA==
Message-ID: <42a6d218-206b-4f87-a8fa-ef42d107fb23@kernel.org>
Date: Fri, 3 May 2024 16:00:20 +0200
MIME-Version: 1.0
User-Agent: Mozilla Thunderbird
Subject: Re: [PATCH v1] cgroup/rstat: add cgroup_rstat_cpu_lock helpers and
 tracepoints
To: Waiman Long <longman@redhat.com>, tj@kernel.org, hannes@cmpxchg.org,
 lizefan.x@bytedance.com, cgroups@vger.kernel.org, yosryahmed@google.com
Cc: netdev@vger.kernel.org, linux-mm@kvack.org, shakeel.butt@linux.dev,
 kernel-team@cloudflare.com, Arnaldo Carvalho de Melo <acme@kernel.org>,
 Sebastian Andrzej Siewior <bigeasy@linutronix.de>
References: <171457225108.4159924.12821205549807669839.stgit@firesoul>
 <30d64e25-561a-41c6-ab95-f0820248e9b6@redhat.com>
 <4a680b80-b296-4466-895a-13239b982c85@kernel.org>
 <203fdb35-f4cf-4754-9709-3c024eecade9@redhat.com>
 <b74c4e6b-82cc-4b26-b817-0b36fbfcc2bd@kernel.org>
 <b161e21f-9d66-4aac-8cc1-83ed75f14025@redhat.com>
Content-Language: en-US
From: Jesper Dangaard Brouer <hawk@kernel.org>
In-Reply-To: <b161e21f-9d66-4aac-8cc1-83ed75f14025@redhat.com>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
X-Rspam-User: 
X-Stat-Signature: exkcfp1z9pucuzmmxhiqnr4n4x1gtc96
X-Rspamd-Server: rspam11
X-Rspamd-Queue-Id: 8CD8FC0034
X-HE-Tag: 1714744829-832436
X-HE-Meta: U2FsdGVkX1/O8RsnJxKxw6FIbhvmYkNjcJ4j5bktGrl488QxDc2tpZG4hqJr5VPFA+ipUbqVP8ACrs9+OAW9QPPoNXGgetMN40kMCujt3D07JxoBXvvqcRTIv4eAdFiXgaByixQeNX17AizZTF4prXPHvqD4hB5mLeZ1ZrZgb578MIS/LIoiSycqaPPX4uNpxPCrIEdrLqsmkpao9wRXYd2IsZrRKmyfkaaBsHbVp/PDtxNhX5zJWg/La2fxrVqrwwom8WsNlqCEuc20Fj88FkWlaFGcSvxirqYjVrk93c5yJqBempu6wFgCw7Yb1IghL6+8x8835DUcmXbsq0cN9njI5elgb9j9HcGTBc2gI03meR0bu64GhR9ZWYqx4tbpdiP+i8h1CzuIcRMu161kJbqPmOGIMeJkoUUSdAexhrRXz0hYgXwiUQUm2Q4i7aGYPrJG0628q+8xzp0/zj48g037e+Fg/GCbiN9HJUDq1cHlpCuSmlwzpwmzOkBDfnY7cy4PYUFwZqFMduF4C8Eg82pJK/ke2GSbhf6p9ysjzxwBfMn1BYelVuUcOTTGrOl0qaK31E/AIBt89TQCSd4PwJbZOIYhySzofzhWuO10jC7tSwUBETlAIRRGECxB4ucivPSiOcLoQugrya0JrLWExelwSfp+wlcc/seC+e/3k2glNdmmBPBgZraYCirG51sYZ1VKAsOCoQ2KO+oudD2dnefZs9sn+16namusxW9PA5mymCZmGL2B3ZfLzPq3ERnM1LgkI8HbK+3TSRfCzlh2WAhgxmb7KpQ4iXI6wWjht9DdPYIDhOqUZI2tF/MRkYAYppy4Swhgtic/h4jEVzCMhBSG9NZx0tSDAoYp2GUbrD+m/ON3b/l4/Vz6FEwdhInqUtd6+AKzk+7KewxJHd9n1/DNsbCH6QJ9ucHR9YhRD/QTPC5nUGHGdhyoAffImR/eBu/p8nxGlSousdTY6aL
 9Q8E87eC
 mk3c0bymnbfMZYLwzhIbxx5lCHXT8ks0QXv/CJqjB38RrtxWO6rIp4N1gFGjlrYQ5yq6BJk9Hf3M0G2tuuQb5ln/Cj5Znr6bio4Ldt6CeRMMypWkecaNb0zdGfgnnUmnbgCPQ5Mn7npZoiQY1CzrTwPxtOhTxNhA7KoR9F+UdElkV9TmuwlmDmLYuSIH08sdI98fImL53kedg0BjXfj5vtTbTfhAnX7cpZPLpFT1L7fkbe9bXL8GF9PWuL8usnQu1ps4yu3lhMbrBhwQ3F8nKFfHcdl8HkvRasZvqLXEZoNDOZ3l6uKPCHaVrZram5rFdh/GVlhAmso2NZZ0VsoaU87xmSQKUhyYs4wf2/FqkUDRizcHJxqfyOZFL5AG3CDqan3rK
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>
List-Subscribe: <mailto:majordomo@kvack.org>
List-Unsubscribe: <mailto:majordomo@kvack.org>


On 02/05/2024 20.19, Waiman Long wrote:
> On 5/2/24 07:23, Jesper Dangaard Brouer wrote:
>>
>>
>> On 01/05/2024 20.41, Waiman Long wrote:
>>> On 5/1/24 13:22, Jesper Dangaard Brouer wrote:
>>>>
>>>>
>>>> On 01/05/2024 16.24, Waiman Long wrote:
>>>>> On 5/1/24 10:04, Jesper Dangaard Brouer wrote:
>>>>>> This closely resembles helpers added for the global 
>>>>>> cgroup_rstat_lock in
>>>>>> commit fc29e04ae1ad ("cgroup/rstat: add cgroup_rstat_lock helpers and
>>>>>> tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock.
>>>>>>
>>>>>> Based on production workloads, we observe the fast-path "update" 
>>>>>> function
>>>>>> cgroup_rstat_updated() is invoked around 3 million times per sec, 
>>>>>> while the
>>>>>> "flush" function cgroup_rstat_flush_locked(), walking each 
>>>>>> possible CPU,
>>>>>> can see periodic spikes of 700 invocations/sec.
>>>>>>
>>>>>> For this reason, the tracepoints are split into normal and fastpath
>>>>>> versions for this per-CPU lock. Making it feasible for production to
>>>>>> continuously monitor the non-fastpath tracepoint to detect lock 
>>>>>> contention
>>>>>> issues. The reason for monitoring is that lock disables IRQs which 
>>>>>> can
>>>>>> disturb e.g. softirq processing on the local CPUs involved. When the
>>>>>> global cgroup_rstat_lock stops disabling IRQs (e.g converted to a 
>>>>>> mutex),
>>>>>> this per CPU lock becomes the next bottleneck that can introduce 
>>>>>> latency
>>>>>> variations.
>>>>>>
>>>>>> A practical bpftrace script for monitoring contention latency:
>>>>>>
>>>>>>   bpftrace -e '
>>>>>>     tracepoint:cgroup:cgroup_rstat_cpu_lock_contended {
>>>>>>       @start[tid]=nsecs; @cnt[probe]=count()}
>>>>>>     tracepoint:cgroup:cgroup_rstat_cpu_locked {
>>>>>>       if (args->contended) {
>>>>>>         @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);}
>>>>>>       @cnt[probe]=count()}
>>>>>>     interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt); 
>>>>>> clear(@cnt);}'
>>>>>
>>>>> This is a per-cpu lock. So the only possible contention involves 
>>>>> only 2 CPUs - a local CPU invoking cgroup_rstat_updated(). A 
>>>>> flusher CPU doing cgroup_rstat_flush_locked() calling into 
>>>>> cgroup_rstat_updated_list(). With recent commits to reduce the 
>>>>> percpu lock hold time, I doubt lock contention on the percpu lock 
>>>>> will have a great impact on latency. 
>>>>
>>>> I do appriciate your recent changes to reduce the percpu lock hold 
>>>> time.
>>>> These tracepoints allow me to measure and differentiate the percpu lock
>>>> hold time vs. the flush time.
>>>>
>>>> In production (using [1]) I'm seeing "Long lock-hold time" [L100] e.g.
>>>> upto 29 ms, which is time spend after obtaining the lock (runtime under
>>>> lock).  I was expecting to see "High Lock-contention wait" [L82] which
>>>> is the time waiting for obtaining the lock.
>>>>
>>>> This is why I'm adding these tracepoints, as they allow me to digg
>>>> deeper, to understand where this high runtime variations originate 
>>>> from.
>>>>
>>>>
>>>> Data:
>>>>
>>>>  16:52:09 Long lock-hold time: 14950 usec (14 ms) on CPU:34 
>>>> comm:kswapd4
>>>>  16:52:09 Long lock-hold time: 14821 usec (14 ms) on CPU:34 
>>>> comm:kswapd4
>>>>  16:52:09 Long lock-hold time: 11299 usec (11 ms) on CPU:98 
>>>> comm:kswapd4
>>>>  16:52:09 Long lock-hold time: 17237 usec (17 ms) on CPU:113 
>>>> comm:kswapd6
>>>>  16:52:09 Long lock-hold time: 29000 usec (29 ms) on CPU:36 
>>>> comm:kworker/u261:12
>>> That lock hold time is much higher than I would have expected.
>>>>  16:52:09 time elapsed: 80 sec (interval = 1 sec)
>>>>   Flushes(5033) 294/interval (avg 62/sec)
>>>>   Locks(53374) 1748/interval (avg 667/sec)
>>>>   Yields(48341) 1454/interval (avg 604/sec)
>>>>   Contended(48104) 1450/interval (avg 601/sec)
>>>>
>>>>
>>>>> So do we really need such an elaborate scheme to monitor this? BTW, 
>>>>> the additional code will also add to the worst case latency.
>>>>
>>>> Hmm, I designed this code to have minimal impact, as tracepoints are
>>>> no-ops until activated.  I really doubt this code will change the 
>>>> latency.
>>>>
>>>>
>>>> [1] 
>>>> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt
>>>>
>>>> [L100] 
>>>> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt#L100
>>>>
>>>> [L82] 
>>>> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt#L82
>>>>
>>>>>>
>>>>>> Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>>>
>>>> More data, the histogram of time spend under the lock have some strange
>>>> variation issues with a group in 4ms to 65ms area. Investigating what
>>>> can be causeing this... which next step depend in these tracepoints.
>>>>
>>>> @lock_cnt: 759146
>>>>
>>>> @locked_ns:
>>>> [1K, 2K)             499 |      |
>>>> [2K, 4K)          206928 
>>>> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
>>>> [4K, 8K)          147904 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
>>>> [8K, 16K)          64453 |@@@@@@@@@@@@@@@@      |
>>>> [16K, 32K)        135467 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ |
>>>> [32K, 64K)         75943 |@@@@@@@@@@@@@@@@@@@      |
>>>> [64K, 128K)        38359 |@@@@@@@@@      |
>>>> [128K, 256K)       46597 |@@@@@@@@@@@      |
>>>> [256K, 512K)       32466 |@@@@@@@@      |
>>>> [512K, 1M)          3945 |      |
>>>> [1M, 2M)             642 |      |
>>>> [2M, 4M)             750 |      |
>>>> [4M, 8M)            1932 |      |
>>>> [8M, 16M)           2114 |      |
>>>> [16M, 32M)          1039 |      |
>>>> [32M, 64M)           108 |      |
>>>>
>>>>
>>>>
>>>>
>>>>>> ---
>>>>>>   include/trace/events/cgroup.h |   56 
>>>>>> +++++++++++++++++++++++++++++----
>>>>>>   kernel/cgroup/rstat.c         |   70 
>>>>>> ++++++++++++++++++++++++++++++++++-------
>>>>>>   2 files changed, 108 insertions(+), 18 deletions(-)
>>>>>>
>>>>>> diff --git a/include/trace/events/cgroup.h 
>>>>>> b/include/trace/events/cgroup.h
>>>>>> index 13f375800135..0b95865a90f3 100644
>>>>>> --- a/include/trace/events/cgroup.h
>> [...]
>>>>>> +++ b/include/trace/events/cgroup.h >>>> 
>>>>>> +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock_fastpath,
>>>>>> +
>>>>>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended),
>>>>>> +
>>>>>> +    TP_ARGS(cgrp, cpu, contended)
>>>>>> +);
>>>>>> +
>>>>>>   #endif /* _TRACE_CGROUP_H */
>>>>>>   /* This part must be outside protection */
>>>>>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c
>>>>>> index 52e3b0ed1cee..fb8b49437573 100644
>>>>>> --- a/kernel/cgroup/rstat.c
>>>>>> +++ b/kernel/cgroup/rstat.c
>>>>>> @@ -19,6 +19,60 @@ static struct cgroup_rstat_cpu 
>>>>>> *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu)
>>>>>>       return per_cpu_ptr(cgrp->rstat_cpu, cpu);
>>>>>>   }
>>>>>> +/*
>>>>>> + * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock).
>>>>>> + *
>>>>>> + * This makes it easier to diagnose locking issues and contention in
>>>>>> + * production environments. The parameter @fast_path determine the
>>>>>> + * tracepoints being added, allowing us to diagnose "flush" related
>>>>>> + * operations without handling high-frequency fast-path "update" 
>>>>>> events.
>>>>>> + */
>>>>>> +static __always_inline
>>>>>> +unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, 
>>>>>> int cpu,
>>>>>> +                     struct cgroup *cgrp, const bool fast_path)
>>>>>> +{
>>>>>> +    unsigned long flags;
>>>>>> +    bool contended;
>>>>>> +
>>>>>> +    /*
>>>>>> +     * The _irqsave() is needed because cgroup_rstat_lock is
>>>>>> +     * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring
>>>>>> +     * this lock with the _irq() suffix only disables interrupts on
>>>>>> +     * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables
>>>>>> +     * interrupts on both configurations. The _irqsave() ensures
>>>>>> +     * that interrupts are always disabled and later restored.
>>>>>> +     */
>>>>>> +    contended = !raw_spin_trylock_irqsave(cpu_lock, flags);
>>>>>> +    if (contended) {
>>>>>> +        if (fast_path)
>>>>>> + trace_cgroup_rstat_cpu_lock_contended_fastpath(cgrp, cpu, 
>>>>>> contended);
>>>>>> +        else
>>>>>> +            trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu, 
>>>>>> contended);
>>>>>> +
>>>>>> +        raw_spin_lock_irqsave(cpu_lock, flags);
>>>
>>> Could you do a local_irq_save() before calling trace_cgroup*() and 
>>> raw_spin_lock()? Would that help in eliminating this high lock hold 
>>> time?
>>>
>>
>> Nope it will not eliminating high lock *hold* time, because the hold
>> start timestamp is first taken *AFTER* obtaining the lock.
>>
>> It could help the contended "wait-time" measurement, but my prod
>> measurements show this isn't an issues.
> 
> Right.
> 
> 
>>
>>> You can also do a local_irq_save() first before the trylock. That 
>>> will eliminate the duplicated irq_restore() and irq_save() when there 
>>> is contention.
>>
>> I wrote the code like this on purpose ;-)
>> My issue with this code/lock is it cause latency issues for softirq 
>> NET_RX. So, when I detect a "contended" lock event, I do want a 
>> irq_restore() as that will allow networking/do_softirq() to run before 
>> I start waiting for the lock (with IRQ disabled).
>>
> Assuming the time taken by the tracing code is negligible, we are 
> talking about disabling IRQ almost immediate after enabling it. The 
> trylock time should be relatively short so the additional delay due to 
> irq disabled for the whole period is insignificant.
>>
>>> If not, there may be NMIs mixed in.
>>>
>>
>> NMIs are definitely on my list of things to investigate.
>> These AMD CPUs also have other types of interrupts that needs a close 
>> look.
>>
>> The easier explaination is that the lock isn't "yielded" on every cycle
>> through the for each CPU loop.
>>
>> Lets look at the data I provided above:
>>
>> >>   Flushes(5033) 294/interval (avg 62/sec)
>> >>   Locks(53374) 1748/interval (avg 667/sec)
>> >>   Yields(48341) 1454/interval (avg 604/sec)
>> >>   Contended(48104) 1450/interval (avg 601/sec)
>>
>> In this 1 second sample, we have 294 flushes, and more yields 1454,
>> great but the factor is not 128 (num-of-CPUs) but closer to 5. Thus, on
>> average we hold the lock for (128/5) 25.6 CPUs-walks.
>>
>> We have spoken about releasing the lock on for_each CPU before... it
>> will likely solve this long hold time, but IMHO a mutex is still the
>> better solution.
> 
> I may have mistakenly thinking the lock hold time refers to just the 
> cpu_lock. Your reported times here are about the cgroup_rstat_lock. 
> Right? If so, the numbers make sense to me.
> 

True, my reported number here are about the cgroup_rstat_lock.
Glad to hear, we are more aligned then :-)

Given I just got some prod machines online with this patch
cgroup_rstat_cpu_lock tracepoints, I can give you some early results,
about hold-time for the cgroup_rstat_cpu_lock.

 From this oneliner bpftrace commands:

   sudo bpftrace -e '
          tracepoint:cgroup:cgroup_rstat_cpu_lock_contended {
            @start[tid]=nsecs; @cnt[probe]=count()}
          tracepoint:cgroup:cgroup_rstat_cpu_locked {
            $now=nsecs;
            if (args->contended) {
              @wait_per_cpu_ns=hist($now-@start[tid]); delete(@start[tid]);}
            @cnt[probe]=count(); @locked[tid]=$now}
          tracepoint:cgroup:cgroup_rstat_cpu_unlock {
            $now=nsecs;
            @locked_per_cpu_ns=hist($now-@locked[tid]); 
delete(@locked[tid]);
            @cnt[probe]=count()}
          interval:s:1 {time("%H:%M:%S "); print(@wait_per_cpu_ns);
            print(@locked_per_cpu_ns); print(@cnt); clear(@cnt);}'

Results from one 1 sec period:

13:39:55 @wait_per_cpu_ns:
[512, 1K)              3 | 
      |
[1K, 2K)              12 |@ 
      |
[2K, 4K)             390 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[4K, 8K)              70 |@@@@@@@@@ 
      |
[8K, 16K)             24 |@@@ 
      |
[16K, 32K)           183 |@@@@@@@@@@@@@@@@@@@@@@@@ 
      |
[32K, 64K)            11 |@ 
      |

@locked_per_cpu_ns:
[256, 512)         75592 |@ 
      |
[512, 1K)        2537357 
|@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[1K, 2K)          528615 |@@@@@@@@@@ 
      |
[2K, 4K)          168519 |@@@ 
      |
[4K, 8K)          162039 |@@@ 
      |
[8K, 16K)         100730 |@@ 
      |
[16K, 32K)         42276 | 
      |
[32K, 64K)          1423 | 
      |
[64K, 128K)           89 | 
      |

  @cnt[tracepoint:cgroup:cgroup_rstat_cpu_lock_contended]: 3 /sec
  @cnt[tracepoint:cgroup:cgroup_rstat_cpu_unlock]: 3200  /sec
  @cnt[tracepoint:cgroup:cgroup_rstat_cpu_locked]: 3200  /sec


So, we see "flush-code-path" per-CPU-holding @locked_per_cpu_ns isn't
exceeding 128 usec.

My latency requirements, to avoid RX-queue overflow, with 1024 slots,
running at 25 Gbit/s, is 27.6 usec with small packets, and 500 usec
(0.5ms) with MTU size packets.  This is very close to my latency
requirements.

--Jesper