From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from galois.linutronix.de (Galois.linutronix.de [193.142.43.55]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 02B7D1D555; Mon, 24 Nov 2025 10:33:59 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=193.142.43.55 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763980441; cv=none; b=tZb/oRkjFz5CCKthH+cjrZ6prRti07tbgP+UXcFLahTFP8BJ6eNyPehvy3y/EWnpuehFG8GZ2H23NwCVrS9NcHo29f6HtU86GENep4R/ebTcRwqPUyYZSyJK1Zbef2pC8DA0Rx9F6lFPvg0gFcj2iU/YmRwAOsJgnae4sAxsxSk= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1763980441; c=relaxed/simple; bh=zVF0pssYx/aD9hfannkSvINx8rgMzcrmzWpuQtmSRqw=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=poKOaKZoOhSxrjTte2GM/pxNnvg4QWXiIi9E9yv59P4ghh10MoI0xNu9+nyhp5A7dwhP8UMwdu6HNrIKEj1dQp2pUhCKjSbpxpgq2NR7NXtpYDoUsOltcSIsM1CuL7b1QUYNs7dj3RkVzE2yA0Oe/Z9ofk/h5b4nJRP5/QXlxsM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de; spf=pass smtp.mailfrom=linutronix.de; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=wmPZhg1X; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b=LjoG/p8R; arc=none smtp.client-ip=193.142.43.55 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=linutronix.de Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=linutronix.de Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="wmPZhg1X"; dkim=permerror (0-bit key) header.d=linutronix.de header.i=@linutronix.de header.b="LjoG/p8R" From: Thomas Gleixner DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020; t=1763980437; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=aDvnFMrZR0/fwpKzOwSpmViR7udLCjy52/JaaSKqEOk=; b=wmPZhg1XZoz3UWC7RAX2YUKFRl4CC7teREVXdNX73BPIZRFD3J/ziLtBuT4Q5LLndbpIHt YZk7/XYn/pTcqWm01qu+fDoPSTPrhta96qeg6GKImoBqPhhUD81wtlc1EQDHKauCtqZxnJ UV9/3aICC7UmD2LKxHFpy4M3cV4BcP6xC1NAhuT/8VYhqwJDRFkSwvN2OlPMZVtFhBgD60 NC2kbayijM5MYEBv41YI5iQ3q1CROu0UPI5O+wB28rsB+6kEGg0nmLlEzodmqV6ODxftiv /ALDYcHd9Ytj7vFFD4x7aMH4lNspVoT/+N2RMSkl9/N0Y3oMHMNdjHOTEl9AIQ== DKIM-Signature: v=1; a=ed25519-sha256; c=relaxed/relaxed; d=linutronix.de; s=2020e; t=1763980437; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=aDvnFMrZR0/fwpKzOwSpmViR7udLCjy52/JaaSKqEOk=; b=LjoG/p8R19K60Ncuo24Hp7CnVRdM/p/NWn5lJFmPTkTTBQlN8m8z+fBGDie2hy0EuDasDY R2Asrc0FZYATD0Bw== To: K Prateek Nayak , LKML Cc: Mathieu Desnoyers , "Paul E. McKenney" , Boqun Feng , Jonathan Corbet , Prakash Sangappa , Madadi Vineeth Reddy , Steven Rostedt , Sebastian Andrzej Siewior , Arnd Bergmann , linux-arch@vger.kernel.org, Randy Dunlap , Peter Zijlstra Subject: Re: [patch V4 00/12] rseq: Implement time slice extension mechanism In-Reply-To: <8cefc4be-b11a-414c-b3f4-280c900be67b@amd.com> References: <20251116173423.031443519@linutronix.de> <8cefc4be-b11a-414c-b3f4-280c900be67b@amd.com> Date: Mon, 24 Nov 2025 11:33:56 +0100 Message-ID: <87h5ujiusr.ffs@tglx> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain On Mon, Nov 24 2025 at 12:40, K. Prateek Nayak wrote: > On 11/17/2025 2:21 AM, Thomas Gleixner wrote: > I got a chance to test the series with Netflix's userspace locking > benchmark [1] which ended up looking very similar to the test > Steven had written initially when discussing the PoC except this > can scale the number of threads. > > [1] https://www.github.com/Netflix/global-lock-bench/ > > Critical section is just a single increment operation. Metric is > average time taken to run a fixed amount of critical sections across > #Threads over 3 runs. > > Here are the results of running the test with the default config on > my 256CPU machine in a root cpuset containing 32CPUs to actually hit > the contention: > > o rseq/slice with no benchmark modifications and "rseq_slice_ext=0" > > | Threads | Threaded (s) | Threaded/s | > +---------+-----------------+--------------+ > | 1 | .026103 | 383493128.79 | > | 2 | .086320 | 116134267.40 | > | 4 | .669743 | 14937390.67 | > | 8 | 1.105109 | 9053764.30 | > | 16 | 1.863516 | 5366809.94 | > | 32 | 7.249873 | 1379590.12 | > | 64 | 14.360199 | 696486.76 | > | 96 | 21.909887 | 456458.03 | > | 128 | 29.126423 | 343358.95 | > | 192 | 43.112188 | 231980.16 | > | 256 | 57.628748 | 173554.39 | > | 384 | 86.274354 | 115909.73 | > | 512 | 114.564142 | 87289.97 | > > > o rseq/slice with modified benchmark and "rseq_slice_ext=1" > > | Threads | Threaded (s) | Threaded/s | %diff (s) | > +---------+-----------------+--------------+-----------+ > | 1 | .036438 | 274437690.71 | 40% | > | 2 | .147520 | 68851845.82 | 71% | > | 4 | .829240 | 12176948.03 | 24% | > | 8 | 1.259632 | 7993476.42 | 14% | > | 16 | 1.988396 | 5029209.62 | 7% | > | 32 | 9.844307 | 1015837.43 | 36% | > | 64 | 14.590723 | 685979.41 | 2% | > | 96 | 18.898278 | 529171.84 | -14% | > | 128 | 23.921747 | 418033.09 | -18% | > | 192 | 33.284228 | 300673.66 | -23% | > | 256 | 42.934755 | 232934.87 | -25% | > | 384 | 61.794499 | 161924.64 | -28% | > | 512 | 82.005069 | 121951.34 | -28% | > > ( Lower %diff is better ) > > > Until the contention begins (> 32 threads), there is a consistent > regression which I believe can be attributed to the additional > overhead in the critical section from setting the slice_ext > request, however, once heavy contention begins, there is a clear > win with slice extension. Yes, that's about right. I just played with that benchmark on a random machine and a cpuset of 28 CPUs (primary threads of second socket). First naive attempt doing: request_slice(); // sets the request bit lock(); unlock(); check_slice(); // clears the request bit, checks grant, yield Threads noslice slice t ops 1 0.006 s 155101817 ops/s 0.008 s 128339525 ops/s 20% -17% 2 0.060 s 16709158 ops/s 0.071 s 14207646 ops/s 18% -14% 4 0.150 s 6681970 ops/s 0.150 s 6681428 ops/s 0% 0% 8 0.311 s 3217183 ops/s 0.290 s 3448088 ops/s -6% 7% 16 0.624 s 1602136 ops/s 0.581 s 1723563 ops/s -7% 7% 32 1.170 s 855494 ops/s 1.018 s 983453 ops/s -13% 14% 64 2.287 s 437461 ops/s 1.414 s 708871 ops/s -38% 62% 128 4.353 s 229926 ops/s 2.010 s 499227 ops/s -53% 117% 256 8.321 s 120311 ops/s 3.210 s 314349 ops/s -61% 161% Then I modified lock(): request_slice(); while (!trylock) { check_slice(); pause(); request_slice(); And it becomes this: Threads noslice slice t ops 1 0.006 s 167464787 ops/s 0.008 s 126824764 ops/s 32% -24% 2 0.038 s 26868534 ops/s 0.044 s 23020570 ops/s 16% -14% 4 0.119 s 8411120 ops/s 0.126 s 7930586 ops/s 6% -5% 8 0.241 s 4157058 ops/s 0.257 s 3896290 ops/s 6% -6% 16 0.463 s 2161893 ops/s 0.468 s 2149853 ops/s 0% 0% 32 1.037 s 965391 ops/s 0.888 s 1127000 ops/s -14% 16% 64 2.018 s 495862 ops/s 0.880 s 1136807 ops/s -56% 129% 128 3.834 s 262185 ops/s 0.874 s 1144878 ops/s -77% 336% 256 7.704 s 129885 ops/s 0.875 s 1143316 ops/s -88% 780% It's interesting that both runtime _and_ ops/s stay stable once contention is hit. Thanks, tglx