From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3FEE1C4345F for ; Thu, 2 May 2024 11:23:42 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A82696B0083; Thu, 2 May 2024 07:23:41 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A30AA6B0085; Thu, 2 May 2024 07:23:41 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8F8246B0087; Thu, 2 May 2024 07:23:41 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 7278C6B0083 for ; Thu, 2 May 2024 07:23:41 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay01.hostedemail.com (Postfix) with ESMTP id E39761C152B for ; Thu, 2 May 2024 11:23:40 +0000 (UTC) X-FDA: 82073220600.11.E0BCB6A Received: from sin.source.kernel.org (sin.source.kernel.org [145.40.73.55]) by imf20.hostedemail.com (Postfix) with ESMTP id B2CA81C0007 for ; Thu, 2 May 2024 11:23:36 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=FtDSvKi3; spf=pass (imf20.hostedemail.com: domain of hawk@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=hawk@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1714649017; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=J4b1BQY27kux35jpQRLcMob67CiMrZ4LRDagIw+EeJs=; b=gZ7nP47Y8fCYJkEDpPeKmo5AvDyoa9lHJ+sWlJTfvlgzc8VjWHbsgkqjcGMyEpasbSdS8y 4zUFO14j6J2Y3RG+FNuJj1bw71Y2x5fb4JcB2clGvmy71Z5ZyAvhNKuIkFYw4uV2GbZG6a 0Au6dFCb/1Yp3AaQyxW+B+5Jq+2VyKY= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=FtDSvKi3; spf=pass (imf20.hostedemail.com: domain of hawk@kernel.org designates 145.40.73.55 as permitted sender) smtp.mailfrom=hawk@kernel.org; dmarc=pass (policy=none) header.from=kernel.org ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1714649017; a=rsa-sha256; cv=none; b=WIo9SxjGehOXrTV9M5Jx/0jl/HYuByHYWtFBilmN94+KqraXKwU9U0Ag05wwfFwfz4IPEI cUWBU5LfGHseE7dbyfcfEJHkZx66S+fr0HCSN4qp0JeUtcXvtCQ+MjvnJPFUFnOV7i3tsE ikwvkEceeUSGTRDZ56auRGoF6O1siaY= Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by sin.source.kernel.org (Postfix) with ESMTP id 02301CE1692; Thu, 2 May 2024 11:23:32 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 5AC31C113CC; Thu, 2 May 2024 11:23:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1714649010; bh=1HYYKmaehmUJTYfPblH1ZzZDaUzqvFPL0HY7RckRNz0=; h=Date:Subject:To:Cc:References:From:In-Reply-To:From; b=FtDSvKi35eOK7h3zhFGUlGuDLl+h0JpDbMsZcSA6SPQ03vTzrZSL48lQYqtqpK8/e rrTsyAEq6qmg0YEnFmjMr7VJb1T9HuVXWsPuesU35ws+KKdXh6cXshzxufTVSNBMXp vbj2KyZNcY9LcU4iJ7tbuUadt+wDRGQMevFXU0N3UaFtH7Rh0poIGjiJXp7LtRe8nj 71ZhJvCm+PpjOVECZkvYbh0R1qoQWdQwnMv6hjmI3UA6itwrSDEtHhDJGiU2gSDWFI Fi6dgVOAGIb7B9/Kemdn1Q2hv3Hdj1QM+5RPNti+Ux0HrGo+wMBpcU3YmPFbHFw/vG GYzsrRM/m4wdA== Message-ID: Date: Thu, 2 May 2024 13:23:26 +0200 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [PATCH v1] cgroup/rstat: add cgroup_rstat_cpu_lock helpers and tracepoints To: Waiman Long , tj@kernel.org, hannes@cmpxchg.org, lizefan.x@bytedance.com, cgroups@vger.kernel.org, yosryahmed@google.com Cc: netdev@vger.kernel.org, linux-mm@kvack.org, shakeel.butt@linux.dev, kernel-team@cloudflare.com, Arnaldo Carvalho de Melo , Sebastian Andrzej Siewior References: <171457225108.4159924.12821205549807669839.stgit@firesoul> <30d64e25-561a-41c6-ab95-f0820248e9b6@redhat.com> <4a680b80-b296-4466-895a-13239b982c85@kernel.org> <203fdb35-f4cf-4754-9709-3c024eecade9@redhat.com> Content-Language: en-US From: Jesper Dangaard Brouer In-Reply-To: <203fdb35-f4cf-4754-9709-3c024eecade9@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Queue-Id: B2CA81C0007 X-Rspam-User: X-Rspamd-Server: rspam04 X-Stat-Signature: qkkgntphiepmogp1nwhozgxp81g3oamo X-HE-Tag: 1714649016-990577 X-HE-Meta: U2FsdGVkX1/BxMYLUfJHYFag827DjR9k/4ejarAenyoRAjD6sct1s/FCww79CUVkbf+aHT5805YCY4aFnTEzFJn9gYG7I4ap9cyX2Z7UHDUTbEPn4lADI9WXh8h9Z6z90mRT599VQDW9UU7KOIFg+kQUp2I0Mpl/Q99pN1V8Na4M4/wQzdmCUyovEElgu/Vg840gs/F57ShdWN+hlpXztRqf1/Gi4Hfgm5RM/juLmmRQYR7IGsuoXtxN9bX7g+odhQcHwux3WZ7oH1ZeHsf5X+9Di8UykuRM/ryqmwnRg2BfpkqxdMxEY7yql4awPllIdlBd0psK8Mrof/t1MdZkBzUtN4jz9flkshUbJ1Tk0jfSi+51YXgPmgzC8riVXUEA8VayPoZocToUG6Gd0nt8Z/E98Xt4mtbdZpcZs8/GhM/p1iV1ZVGMz89nKQmiWK9O3szR9CIVYz+VJilclcOyRUGRaHIu1wKRAThsRzBt226okQMVGUvTbCmn7Eh57+4o7ulGzsQFW6Evx8+7VTX6VRn0zo4kgSmIBgGQdaPITTKVNUuLRKpA+pPWYbPv+pml7NmqM8dg4/JV6bXHTg8IEBHVhmz8wiTqrbuyPNmQrdRXJ5xURRH2BRtNOoc9jllxPDX5RccddV2IuOTPTPXJWmhXMyFXktBGaw1n3F3/zxMCgbPaj+rIXuhCwcrSQ03TQ9M25lxO5v4ngh2epwlBF5srZWPTh+c/UwNkEEzoOR5zz3u07YCMPIqHLqVTBLEvMZ9fK0bp4L0YaqrBH7p76pAeQgcLaMdYh9JFzb76W9QyFTkjdMcqcDhX7+bsBbZP8PjQXHo724cA39QktrJRZHZGFy5NAx2gNe5VZXgqoQ4yeREdVh6V0Ja5pI7k3C9ueWr7JC+KdzNagH+R5XjwLMnega8TiKwT08Za72K9MfGvGCI9S0Gh+wF8ZvB34Pke5HrfTJeduLt/mvBkDgt PR7zp6ww r1SzM/eLyAjJjiQc9gecTck1cjWyLQF1id4kJUqYedDY/KmTk8RZiFK4mNnC25KtAev5TD1lIEdDjSsVVkCFK5nLULqx+uLWgQBF8hS8c6kMzogndWdswtj/HWJbPjGpt2emikQwIqVtQeUd9x7cEnIsmN9M9OGJICruF5DthGaaQVEwS3xlHBOtdKo9HYqIgQMKoWtZaL3yHs7xD1NNhjBw7DAfSBjC8vL5OLW0TdGoYXeq5BG68baAWWwHY8QBlT1TQHvKdsjXSX7PKjZWaIrBNYlx1flgANfTka2EJUYSaxPVpJDINSPPKwlPmbMv3J1XP6P0xasW/9nJkn5xkKiTzRR9WKbFg5RH/W2taP83J4n80HRTdp2aXaRR+VNBfeatO X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 01/05/2024 20.41, Waiman Long wrote: > On 5/1/24 13:22, Jesper Dangaard Brouer wrote: >> >> >> On 01/05/2024 16.24, Waiman Long wrote: >>> On 5/1/24 10:04, Jesper Dangaard Brouer wrote: >>>> This closely resembles helpers added for the global >>>> cgroup_rstat_lock in >>>> commit fc29e04ae1ad ("cgroup/rstat: add cgroup_rstat_lock helpers and >>>> tracepoints"). This is for the per CPU lock cgroup_rstat_cpu_lock. >>>> >>>> Based on production workloads, we observe the fast-path "update" >>>> function >>>> cgroup_rstat_updated() is invoked around 3 million times per sec, >>>> while the >>>> "flush" function cgroup_rstat_flush_locked(), walking each possible >>>> CPU, >>>> can see periodic spikes of 700 invocations/sec. >>>> >>>> For this reason, the tracepoints are split into normal and fastpath >>>> versions for this per-CPU lock. Making it feasible for production to >>>> continuously monitor the non-fastpath tracepoint to detect lock >>>> contention >>>> issues. The reason for monitoring is that lock disables IRQs which can >>>> disturb e.g. softirq processing on the local CPUs involved. When the >>>> global cgroup_rstat_lock stops disabling IRQs (e.g converted to a >>>> mutex), >>>> this per CPU lock becomes the next bottleneck that can introduce >>>> latency >>>> variations. >>>> >>>> A practical bpftrace script for monitoring contention latency: >>>> >>>>   bpftrace -e ' >>>>     tracepoint:cgroup:cgroup_rstat_cpu_lock_contended { >>>>       @start[tid]=nsecs; @cnt[probe]=count()} >>>>     tracepoint:cgroup:cgroup_rstat_cpu_locked { >>>>       if (args->contended) { >>>>         @wait_ns=hist(nsecs-@start[tid]); delete(@start[tid]);} >>>>       @cnt[probe]=count()} >>>>     interval:s:1 {time("%H:%M:%S "); print(@wait_ns); print(@cnt); >>>> clear(@cnt);}' >>> >>> This is a per-cpu lock. So the only possible contention involves only >>> 2 CPUs - a local CPU invoking cgroup_rstat_updated(). A flusher CPU >>> doing cgroup_rstat_flush_locked() calling into >>> cgroup_rstat_updated_list(). With recent commits to reduce the percpu >>> lock hold time, I doubt lock contention on the percpu lock will have >>> a great impact on latency. >> >> I do appriciate your recent changes to reduce the percpu lock hold time. >> These tracepoints allow me to measure and differentiate the percpu lock >> hold time vs. the flush time. >> >> In production (using [1]) I'm seeing "Long lock-hold time" [L100] e.g. >> upto 29 ms, which is time spend after obtaining the lock (runtime under >> lock).  I was expecting to see "High Lock-contention wait" [L82] which >> is the time waiting for obtaining the lock. >> >> This is why I'm adding these tracepoints, as they allow me to digg >> deeper, to understand where this high runtime variations originate from. >> >> >> Data: >> >>  16:52:09 Long lock-hold time: 14950 usec (14 ms) on CPU:34 comm:kswapd4 >>  16:52:09 Long lock-hold time: 14821 usec (14 ms) on CPU:34 comm:kswapd4 >>  16:52:09 Long lock-hold time: 11299 usec (11 ms) on CPU:98 comm:kswapd4 >>  16:52:09 Long lock-hold time: 17237 usec (17 ms) on CPU:113 comm:kswapd6 >>  16:52:09 Long lock-hold time: 29000 usec (29 ms) on CPU:36 >> comm:kworker/u261:12 > That lock hold time is much higher than I would have expected. >>  16:52:09 time elapsed: 80 sec (interval = 1 sec) >>   Flushes(5033) 294/interval (avg 62/sec) >>   Locks(53374) 1748/interval (avg 667/sec) >>   Yields(48341) 1454/interval (avg 604/sec) >>   Contended(48104) 1450/interval (avg 601/sec) >> >> >>> So do we really need such an elaborate scheme to monitor this? BTW, >>> the additional code will also add to the worst case latency. >> >> Hmm, I designed this code to have minimal impact, as tracepoints are >> no-ops until activated.  I really doubt this code will change the >> latency. >> >> >> [1] >> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt >> >> [L100] >> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt#L100 >> >> [L82] >> https://github.com/xdp-project/xdp-project/blob/master/areas/latency/cgroup_rstat_tracepoint.bt#L82 >> >>>> >>>> Signed-off-by: Jesper Dangaard Brouer >> >> More data, the histogram of time spend under the lock have some strange >> variation issues with a group in 4ms to 65ms area. Investigating what >> can be causeing this... which next step depend in these tracepoints. >> >> @lock_cnt: 759146 >> >> @locked_ns: >> [1K, 2K)             499 |      | >> [2K, 4K)          206928 >> |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@| >> [4K, 8K)          147904 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      | >> [8K, 16K)          64453 |@@@@@@@@@@@@@@@@      | >> [16K, 32K)        135467 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ | >> [32K, 64K)         75943 |@@@@@@@@@@@@@@@@@@@      | >> [64K, 128K)        38359 |@@@@@@@@@      | >> [128K, 256K)       46597 |@@@@@@@@@@@      | >> [256K, 512K)       32466 |@@@@@@@@      | >> [512K, 1M)          3945 |      | >> [1M, 2M)             642 |      | >> [2M, 4M)             750 |      | >> [4M, 8M)            1932 |      | >> [8M, 16M)           2114 |      | >> [16M, 32M)          1039 |      | >> [32M, 64M)           108 |      | >> >> >> >> >>>> --- >>>>   include/trace/events/cgroup.h |   56 >>>> +++++++++++++++++++++++++++++---- >>>>   kernel/cgroup/rstat.c         |   70 >>>> ++++++++++++++++++++++++++++++++++------- >>>>   2 files changed, 108 insertions(+), 18 deletions(-) >>>> >>>> diff --git a/include/trace/events/cgroup.h >>>> b/include/trace/events/cgroup.h >>>> index 13f375800135..0b95865a90f3 100644 >>>> --- a/include/trace/events/cgroup.h [...] >>>> +++ b/include/trace/events/cgroup.h >>>> +DEFINE_EVENT(cgroup_rstat, cgroup_rstat_cpu_unlock_fastpath, >>>> + >>>> +    TP_PROTO(struct cgroup *cgrp, int cpu, bool contended), >>>> + >>>> +    TP_ARGS(cgrp, cpu, contended) >>>> +); >>>> + >>>>   #endif /* _TRACE_CGROUP_H */ >>>>   /* This part must be outside protection */ >>>> diff --git a/kernel/cgroup/rstat.c b/kernel/cgroup/rstat.c >>>> index 52e3b0ed1cee..fb8b49437573 100644 >>>> --- a/kernel/cgroup/rstat.c >>>> +++ b/kernel/cgroup/rstat.c >>>> @@ -19,6 +19,60 @@ static struct cgroup_rstat_cpu >>>> *cgroup_rstat_cpu(struct cgroup *cgrp, int cpu) >>>>       return per_cpu_ptr(cgrp->rstat_cpu, cpu); >>>>   } >>>> +/* >>>> + * Helper functions for rstat per CPU lock (cgroup_rstat_cpu_lock). >>>> + * >>>> + * This makes it easier to diagnose locking issues and contention in >>>> + * production environments. The parameter @fast_path determine the >>>> + * tracepoints being added, allowing us to diagnose "flush" related >>>> + * operations without handling high-frequency fast-path "update" >>>> events. >>>> + */ >>>> +static __always_inline >>>> +unsigned long _cgroup_rstat_cpu_lock(raw_spinlock_t *cpu_lock, int >>>> cpu, >>>> +                     struct cgroup *cgrp, const bool fast_path) >>>> +{ >>>> +    unsigned long flags; >>>> +    bool contended; >>>> + >>>> +    /* >>>> +     * The _irqsave() is needed because cgroup_rstat_lock is >>>> +     * spinlock_t which is a sleeping lock on PREEMPT_RT. Acquiring >>>> +     * this lock with the _irq() suffix only disables interrupts on >>>> +     * a non-PREEMPT_RT kernel. The raw_spinlock_t below disables >>>> +     * interrupts on both configurations. The _irqsave() ensures >>>> +     * that interrupts are always disabled and later restored. >>>> +     */ >>>> +    contended = !raw_spin_trylock_irqsave(cpu_lock, flags); >>>> +    if (contended) { >>>> +        if (fast_path) >>>> + trace_cgroup_rstat_cpu_lock_contended_fastpath(cgrp, cpu, contended); >>>> +        else >>>> +            trace_cgroup_rstat_cpu_lock_contended(cgrp, cpu, >>>> contended); >>>> + >>>> +        raw_spin_lock_irqsave(cpu_lock, flags); > > Could you do a local_irq_save() before calling trace_cgroup*() and > raw_spin_lock()? Would that help in eliminating this high lock hold time? > Nope it will not eliminating high lock *hold* time, because the hold start timestamp is first taken *AFTER* obtaining the lock. It could help the contended "wait-time" measurement, but my prod measurements show this isn't an issues. > You can also do a local_irq_save() first before the trylock. That will > eliminate the duplicated irq_restore() and irq_save() when there is > contention. I wrote the code like this on purpose ;-) My issue with this code/lock is it cause latency issues for softirq NET_RX. So, when I detect a "contended" lock event, I do want a irq_restore() as that will allow networking/do_softirq() to run before I start waiting for the lock (with IRQ disabled). > If not, there may be NMIs mixed in. > NMIs are definitely on my list of things to investigate. These AMD CPUs also have other types of interrupts that needs a close look. The easier explaination is that the lock isn't "yielded" on every cycle through the for each CPU loop. Lets look at the data I provided above: >> Flushes(5033) 294/interval (avg 62/sec) >> Locks(53374) 1748/interval (avg 667/sec) >> Yields(48341) 1454/interval (avg 604/sec) >> Contended(48104) 1450/interval (avg 601/sec) In this 1 second sample, we have 294 flushes, and more yields 1454, great but the factor is not 128 (num-of-CPUs) but closer to 5. Thus, on average we hold the lock for (128/5) 25.6 CPUs-walks. We have spoken about releasing the lock on for_each CPU before... it will likely solve this long hold time, but IMHO a mutex is still the better solution. --Jesper