From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-lf1-f46.google.com (mail-lf1-f46.google.com [209.85.167.46]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id AC09823D2B2 for ; Mon, 29 Dec 2025 16:25:28 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.167.46 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767025530; cv=none; b=qVzDCrSju/R1y2FSxj/gr9/JWzCV/2H6qh9W+dnjIet/TXwpUZO0N9ryOftTbjVSv9UETaqBatVVui3wxP9YqRH/k0AKa0ip4oqv0LnVk7WXZIE7ZuR0amC2+1uq1nfgs16dM2/BhSqs/HaeV3WRCSp5zehaUP6uVGD3TlT5HRI= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1767025530; c=relaxed/simple; bh=K0RLDb9chwRLmvIHTwZ7uslgBGkEUgUhGYrUIEFGyIs=; h=From:Date:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=fsECmWbjQUjYnyHsMbT2RbobdUqoz1Ed2zz2JGCQDIWGKjfTXx1ISkBb3mT0QkglnzMSrCIyIKwWsRc+u5XP6ryHGCwASRdzjwpfWLJ0F29d/SyCO+lXXQm+5f+99d8l8b0Ll/CaImtwNHj0Sa7nUOzdgfwf0gtTuySQ+sM8+zQ= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=dDdiCCRd; arc=none smtp.client-ip=209.85.167.46 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="dDdiCCRd" Received: by mail-lf1-f46.google.com with SMTP id 2adb3069b0e04-598f81d090cso9844630e87.2 for ; Mon, 29 Dec 2025 08:25:28 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20230601; t=1767025527; x=1767630327; darn=vger.kernel.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:date:from:from:to :cc:subject:date:message-id:reply-to; bh=8tHTtfhFq8JegZ5RHh0pRA37yVDWRsPRt/hSShhnwwU=; b=dDdiCCRdIelcQ8jwdcR2LyvV4HLnqnljAXadddOBhVLRSKRrEaTtKbxRwwIz/vsL7R sw2NGwWAmi0FApbvLw3Xsw9WGcxMgy4Sf+tXdGZp0LvAnc89yJfgFFNYqOcMUmWGJAac MmAHKkJO9OXc8uQwat5DB9ZzoBkPDeBT3v7j64K5NftuFDZ6lSCeTKTanuDLMD3cOEhc iXIZE+F7wEQnzEtcInbqsD6toZGJRfqa3gaizhDTcnF2yBeVVnl/T/OIF2jE7ptV3LVz 1L40Wg/WBs58hMjCHKKq2K7tR1xa1vx4S00Ofu2bR/ZufggRhGcMI2uOHMsq65d1DPH5 +6mg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1767025527; x=1767630327; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:date:from:x-gm-gg :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=8tHTtfhFq8JegZ5RHh0pRA37yVDWRsPRt/hSShhnwwU=; b=G8/nxX1huOj9SkuIjAwIegTP9SQqvUeO+Mgy2ZMixS4x6wPq6vQa/IXJQzqIGlw1Qi bqXxefp77LTXcVJURQTrwJJ3Rw99LI3rU7lmK/8ujRZ2kguCIemztxqg0Au39W2myH0v uq2b9VdOmpcRDQsAWYjlGNKBBCKA1HhC8qdoYBL+yAS2LIuRg3HUTjIgu8g2y9CYaUYu X3Nt2i4N5+G342pUjsfUoXTs7Qe9JXWuO9jz0FgE56oitLW0zTnoqVd4Uu+d1OoPi60U Mm0Gc/urjy04ddTEiquQULwMNEQ81pJPPuuVWIN7b4lsinwuCxs+F5ptnAsG004svnbI UbCA== X-Forwarded-Encrypted: i=1; AJvYcCW4BLFftPV+EfwczSymxqea3nHOs94H/4011RXQC4pcFgEHcxU/yMOFv8s7RO4n1TZwpXk=@vger.kernel.org X-Gm-Message-State: AOJu0YwXh2iMtpG1ctb96TnvzWsBZRmsOIy3Un2oJtQTyhfxAqnlXYOq ns+5YR4y0yER+DG4Fzc7yWjIqqy9eKJcZ/NDsA73bv6jUYc+rUP/31U4 X-Gm-Gg: AY/fxX6VQXLF1PBeQFh9KgDE3YBaFQdLTHFRuvrjVDSa9UFBkGrZNmyvGjG61AtwctN cPf7dLtCc1bDE5RahOTU2q6eju+loevBHn1vtFpdMYIO0KvmsgOhjIWpe74aMrxV4JzRqQstjJT //6nM9rM9KFHYdXWC57jrcHMTNIJqYkrnTaPd3sFq+CQWRo/ycTzVDh8eS3hEIksqxdPQlfWXU7 ddO72zVFYUZMtbXmR+JoyOLb9wnwtUFId7BhnHMuopu3fOgjGWC7Qm8FvJc3lmndX5prgyMnshB 4S54/keVDusF8GCR6DvNkopQcW9twkTLhzDeQvw1cJ/KXMx5nLPoYpU8uA2ZBk7OcPl/snbzMVE Vj/PXq411+4apFQE4Ndn5tKhj7l4PvjDnpLH3r26TG0Gjd/HYKbsM X-Google-Smtp-Source: AGHT+IFy5qjCj6S4mIftbCvSDb7nnotJt+RzE1cnNjP8yl0n5sVCXwuihiNNdHHcbOor3/ik2SVtAA== X-Received: by 2002:a05:6512:224d:b0:598:e94c:1a83 with SMTP id 2adb3069b0e04-59a17d3dceemr10445602e87.24.1767025526415; Mon, 29 Dec 2025 08:25:26 -0800 (PST) Received: from milan ([2001:9b1:d5a0:a500::24b]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-59a185d5e67sm9570316e87.16.2025.12.29.08.25.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 29 Dec 2025 08:25:25 -0800 (PST) From: Uladzislau Rezki X-Google-Original-From: Uladzislau Rezki Date: Mon, 29 Dec 2025 17:25:24 +0100 To: "Paul E. McKenney" Cc: Uladzislau Rezki , Joel Fernandes , Joel Fernandes , linux-kernel@vger.kernel.org, Frederic Weisbecker , Neeraj Upadhyay , Josh Triplett , Boqun Feng , Steven Rostedt , Mathieu Desnoyers , Lai Jiangshan , Zqiang , rcu@vger.kernel.org Subject: Re: [PATCH v2] rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early Message-ID: References: <1033a68f-c17b-4847-819d-7fb4e9e45016@paulmck-laptop> <164E7707-758C-44AA-BB75-B6560725C8CD@joelfernandes.org> <177c56e1-f194-432d-b6fb-e2efd3866ce9@paulmck-laptop> Precedence: bulk X-Mailing-List: rcu@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <177c56e1-f194-432d-b6fb-e2efd3866ce9@paulmck-laptop> On Mon, Dec 29, 2025 at 07:53:59AM -0800, Paul E. McKenney wrote: > On Mon, Dec 29, 2025 at 02:28:43PM +0100, Uladzislau Rezki wrote: > > On Sun, Dec 28, 2025 at 09:49:45PM -0500, Joel Fernandes wrote: > > > > > > > > > > On Dec 28, 2025, at 7:04 PM, Paul E. McKenney wrote: > > > > > > > > On Sun, Dec 28, 2025 at 06:57:58PM +0100, Uladzislau Rezki wrote: > > > >>> On Thu, Dec 25, 2025 at 09:33:39PM -0500, Joel Fernandes wrote: > > > >>> On Thu, Dec 25, 2025 at 10:35:44AM -0800, Paul E. McKenney wrote: > > > >>>> On Mon, Dec 22, 2025 at 10:46:29PM -0500, Joel Fernandes wrote: > > > >>>>> The RCU grace period mechanism uses a two-phase FQS (Force Quiescent > > > >>>>> State) design where the first FQS saves dyntick-idle snapshots and > > > >>>>> the second FQS compares them. This results in long and unnecessary latency > > > >>>>> for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with > > > >>>>> 1000HZ) whenever one FQS wait sufficed. > > > >>>>> > > > >>>>> Some investigations showed that the GP kthread's CPU is the holdout CPU > > > >>>>> a lot of times after the first FQS as - it cannot be detected as "idle" > > > >>>>> because it's actively running the FQS scan in the GP kthread. > > > >>>>> > > > >>>>> Therefore, at the end of rcu_gp_init(), immediately report a quiescent > > > >>>>> state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The > > > >>>>> GP kthread cannot be in an RCU read-side critical section while running > > > >>>>> GP initialization, so this is safe and results in significant latency > > > >>>>> improvements. > > > >>>>> > > > >>>>> I benchmarked 100 synchronize_rcu() calls with 32 CPUs, 10 runs each > > > >>>>> showing significant latency improvements (default settings for fqs jiffies): > > > >>>>> > > > >>>>> Baseline (without fix): > > > >>>>> | Run | Mean | Min | Max | > > > >>>>> |-----|-----------|----------|-----------| > > > >>>>> | 1 | 10.088 ms | 9.989 ms | 18.848 ms | > > > >>>>> | 2 | 10.064 ms | 9.982 ms | 16.470 ms | > > > >>>>> | 3 | 10.051 ms | 9.988 ms | 15.113 ms | > > > >>>>> | 4 | 10.125 ms | 9.929 ms | 22.411 ms | > > > >>>>> | 5 | 8.695 ms | 5.996 ms | 15.471 ms | > > > >>>>> | 6 | 10.157 ms | 9.977 ms | 25.723 ms | > > > >>>>> | 7 | 10.102 ms | 9.990 ms | 20.224 ms | > > > >>>>> | 8 | 8.050 ms | 5.985 ms | 10.007 ms | > > > >>>>> | 9 | 10.059 ms | 9.978 ms | 15.934 ms | > > > >>>>> | 10 | 10.077 ms | 9.984 ms | 17.703 ms | > > > >>>>> > > > >>>>> With fix: > > > >>>>> | Run | Mean | Min | Max | > > > >>>>> |-----|----------|----------|-----------| > > > >>>>> | 1 | 6.027 ms | 5.915 ms | 8.589 ms | > > > >>>>> | 2 | 6.032 ms | 5.984 ms | 9.241 ms | > > > >>>>> | 3 | 6.010 ms | 5.986 ms | 7.004 ms | > > > >>>>> | 4 | 6.076 ms | 5.993 ms | 10.001 ms | > > > >>>>> | 5 | 6.084 ms | 5.893 ms | 10.250 ms | > > > >>>>> | 6 | 6.034 ms | 5.908 ms | 9.456 ms | > > > >>>>> | 7 | 6.051 ms | 5.993 ms | 10.000 ms | > > > >>>>> | 8 | 6.057 ms | 5.941 ms | 10.001 ms | > > > >>>>> | 9 | 6.016 ms | 5.927 ms | 7.540 ms | > > > >>>>> | 10 | 6.036 ms | 5.993 ms | 9.579 ms | > > > >>>>> > > > >>>>> Summary: > > > >>>>> - Mean latency: 9.75 ms -> 6.04 ms (38% improvement) > > > >>>>> - Max latency: 25.72 ms -> 10.25 ms (60% improvement) > > > >>>>> > > > >>>>> Tested rcutorture TREE and SRCU configurations. > > > >>>>> > > > >>>>> [apply paulmck feedack on moving logic to rcu_gp_init()] > > > >>>> > > > >>>> If anything, these numbers look better, so good show!!! > > > >>> > > > >>> Thanks, I ended up collecting more samples in the v2 to further confirm the > > > >>> improvements. > > > >>> > > > >>>> Are there workloads that might be hurt by some side effect such > > > >>>> as increased CPU utilization by the RCU grace-period kthread? One > > > >>>> non-mainstream hypothetical situation that comes to mind is a kernel > > > >>>> built with SMP=y but running on a single-CPU system with a high-frequence > > > >>>> periodic interrupt that does call_rcu(). Might that result in the RCU > > > >>>> grace-period kthread chewing up the entire CPU? > > > >>> > > > >>> There are still GP delays due to FQS, even with this change, so it could not > > > >>> chew up the entire CPU I believe. The GP cycle should still insert delays > > > >>> into the GP kthread. I did not notice in my testing that synchronize_rcu() > > > >>> latency dropping to sub millisecond, it was still limited by the timer wheel > > > >>> delays and the FQS delays. > > > >>> > > > >>>> For a non-hypothetical case, could you please see if one of the > > > >>>> battery-powered embedded guys would be willing to test this? > > > >>> > > > >>> My suspicion is the battery-powered folks are already running RCU_LAZY to > > > >>> reduce RCU activity, so they wouldn't be effected. call_rcu() during idleness > > > >>> will be going to the bypass. Last I checked, Android and ChromeOS were both > > > >>> enabling RCU_LAZY everywhere (back when I was at Google). > > > >>> > > > >>> Uladzislau works on embedded (or at least till recently) and had recently > > > >>> checked this area for improvements so I think he can help quantify too > > > >>> perhaps. He is on CC. I personally don't directly work on embedded at the > > > >>> moment, just big compute hungry machines. ;-) Uladzislau, would you have some > > > >>> time to test on your Android devices? > > > >>> > > > >> I will check the patch on my home based systems, big machines also :) > > > >> I do not work with mobile area any more thus do not have access to our > > > >> mobile devices. In fact i am glad that i have switched to something new. > > > >> I was a bit tired by the applied Google restrictions when it comes to > > > >> changes to the kernel and other Android layers. > > > > > > > > How quickly I forget! ;-) > > > > > > > > Any thoughts on who would be a good person to ask about testing Joel's > > > > patch on mobile platforms? > > > > > > Maybe Suren? As precedent and fwiw, When rcu_normal_wake_from_gp optimization happened, it only improved things for Android. > > > > > > Also Android already uses RCU_LAZY so this should not affect power for non-hurry usages. > > > > > > Also networking bridge removal depends on synchronize_rcu() latency. When I forced rcu_normal_wake_from_gp on large machines, it improved bridge removal speed by about 5% per my notes. I would expect similar improvements with this. > > > > > Here we go with some results. I tested bridge setup test case(100 loops): > > > > > > urezki@pc638:~$ cat bridge.sh > > #!/bin/sh > > > > BRIDGE="virbr0" > > NETWORK="192.0.0.1" > > > > # setup bridge > > sudo brctl addbr ${BRIDGE} > > sudo ifconfig ${BRIDGE} ${NETWORK} up > > sudo ifconfig ${BRIDGE} ${NETWORK} down > > > > sudo brctl delbr ${BRIDGE} > > urezki@pc638:~$ > > > > > > 1) > > # /tmp/default.txt > > urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done > > real 0m24.221s > > user 0m1.875s > > sys 0m2.013s > > urezki@pc638:~$ > > > > 2) > > # echo 1 > /sys/module/rcutree/parameters/enable_joel_patch > > # /tmp/enable_joel_patch.txt > > urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done > > real 0m20.754s > > user 0m1.950s > > sys 0m1.888s > > urezki@pc638:~$ > > > > 3) > > # echo 1 > /sys/module/rcutree/parameters/enable_joel_patch > > # echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > > # /tmp/enable_joel_patch_enable_rcu_normal_wake_from_gp.txt > > urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done > > real 0m15.895s > > user 0m2.023s > > sys 0m1.935s > > urezki@pc638:~$ > > > > 4) > > # echo 1 > /sys/module/rcutree/parameters/rcu_normal_wake_from_gp > > # /tmp/enable_rcu_normal_wake_from_gp.txt > > urezki@pc638:~$ time for i in $(seq 1 100); do ./bridge.sh; done > > real 0m18.947s > > user 0m2.145s > > sys 0m1.735s > > urezki@pc638:~$ > > > > x86_64/64CPUs(in usec) > > 1 2 3 4 > > median: 37249.5 31540.5 15765 22480 > > min: 7881 7918 9803 7857 > > max: 63651 55639 31861 32040 > > > > 1 - default; > > 2 - Joel patch > > 3 - Joel patch + enable_rcu_normal_wake_from_gp > > 4 - enable_rcu_normal_wake_from_gp > > > > Joel patch + enable_rcu_normal_wake_from_gp is a winner. > > Time dropped from 24 seconds to 15 seconds to complete the test. > > There was also an increase in system time from 1.735s to 1.935s with > Joel's patch, correct? Or is that in the noise? > See below 5 run with just posted "sys" time: #default sys 0m1.936s sys 0m1.894s sys 0m1.937s sys 0m1.698s sys 0m1.740s # Joel patch sys 0m1.753s sys 0m1.667s sys 0m1.861s sys 0m1.930s sys 0m1.896s i do not see increase, IMO it is a noise. -- Uladzislau Rezki