public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/5] locking/rwsem: Rework reader optimistic spinning
@ 2020-11-21  4:14 Waiman Long
  2020-11-21  4:14 ` [PATCH v2 1/5] locking/rwsem: Pass the current atomic count to rwsem_down_read_slowpath() Waiman Long
                   ` (5 more replies)
  0 siblings, 6 replies; 23+ messages in thread
From: Waiman Long @ 2020-11-21  4:14 UTC (permalink / raw)
  To: Peter Zijlstra, Ingo Molnar, Will Deacon
  Cc: linux-kernel, Davidlohr Bueso, Phil Auld, Waiman Long

 v2:
  - Update some commit logs to incorporate review comments.
  - Patch 2: remove unnecessary comment.
  - Patch 3: rename osq_is_empty() to rwsem_no_spinners() as suggested.
  - Patch 4: correctly handle HANDOFF clearing.
  - Patch 5: fix !CONFIG_RWSEM_SPIN_ON_OWNER compilation errors.

A recent report of SAP certification failure caused by increased system
time due to rwsem reader optimistic spinning led me to reexamine the
code to see the pro and cons of doing it. This led me to discover a
potential lock starvation scenario as explained in patch 2. That patch
does reduce reader spinning to avoid this potential problem. Patches
3 and 4 are further optimizations of the current code.

Then there is the issue of reader fragmentation that can potentially
reduce performance in some heavily contented cases. Two different
approaches are attempted:
 1) further reduce reader optimistic spinning
 2) disable reader spinning

See the performance shown in patch 5.

This patch series adopts the second approach by dropping reader spinning
for now as it simplifies the code. However, writers are still allowed
to spin on a reader-owned rwsem for a limited time.

Waiman Long (5):
  locking/rwsem: Pass the current atomic count to
    rwsem_down_read_slowpath()
  locking/rwsem: Prevent potential lock starvation
  locking/rwsem: Enable reader optimistic lock stealing
  locking/rwsem: Wake up all waiting readers if RWSEM_WAKE_READ_OWNED
  locking/rwsem: Remove reader optimistic spinning

 kernel/locking/lock_events_list.h |   6 +-
 kernel/locking/rwsem.c            | 293 ++++++++----------------------
 2 files changed, 82 insertions(+), 217 deletions(-)

-- 
2.18.1


^ permalink raw reply	[flat|nested] 23+ messages in thread
* Re: [PATCH v2 5/5] locking/rwsem: Remove reader optimistic spinning
@ 2025-05-27  9:08 Krcka, Tomas
  2026-03-24 14:36 ` David Woodhouse
  0 siblings, 1 reply; 23+ messages in thread
From: Krcka, Tomas @ 2025-05-27  9:08 UTC (permalink / raw)
  To: longman@redhat.com; +Cc: dave@stgolabs.net, linux-kernel@vger.kernel.org

Hi Waiman,

I recently discovered that this patch ( commit 617f3ef95177 
("locking/rwsem: Remove reader optimistic spinning") ) results in up to 50% performance drop in
real life scenarios - more processes are trying to perform operation on top of sysfs (read and write).
Reverting of the patch gained the performance back. My suggestion would be to revert the patch
from mainline as well.

I notice that degradation initially due to a workload where more processes are  accessing
same sysfs seeing up to 50% performance drop - time spent to proceed the test.
After investigation, I traced the root cause back to two related changes:
first, when kernfs switched from mutex to rwsem (commit 7ba0273b2f34 ("kernfs: switch kernfs to use an rwsem")), 
and ultimately to the removal of reader optimistic spinning.

The lock contention tracing shows a clear pattern: process accessing a kernfs_dentry_node taking read semaphore is 
now forced to take the slow path since the lock is taken by another process operating on same node and needs write semaphore.
(See below ftrace for the exact operations.)

This contrasts with the previous behavior where optimistic spinning prevented such 
situations.

I have confirmed this behavior across multiple kernel versions (6.14.4, 6.8.12, 6.6.80), as well if backporting
the mentioned commits to older versions (specifically v5.10)
While the real-world impact was observed on AArch64, I've successfully reproduced
the core issue using our test case on both AArch64 (192 vCPUs) and x86 Ice Lake (128 vCPUs) systems.

While I identified this through sysfs (kernfs) operations, I believe this regression 
could affect other subsystems using reader-writer semaphores with similar access 
patterns. 

ftrace with the commit showing this pattern:

""""
userspace_bench-6796    [000] .....  2328.023515: contention_begin: 00000000ca66c48e (flags=READ)
^^^ - waiting for lock and now all next threads will be waiting

userspace_bench-6796    [000] d....  2328.023518: sched_switch: prev_comm=userspace_bench prev_pid=6796 prev_prio=120 prev_state=D ==> next_comm=userspace_bench next_pid=6798 next_prio=120
userspace_bench-6806    [009] d....  2328.023524: sched_switch: prev_comm=userspace_bench prev_pid=6806 prev_prio=120 prev_state=R+ ==> next_comm=migration/9 next_pid=70 next_prio=0
userspace_bench-6804    [004] d....  2328.023532: contention_begin: 00000000ca66c48e (flags=WRITE)
userspace_bench-6805    [005] d....  2328.023532: contention_begin: 00000000ca66c48e (flags=WRITE)
userspace_bench-6804    [004] d....  2328.023533: sched_switch: prev_comm=userspace_bench prev_pid=6804 prev_prio=120 prev_state=D ==> next_comm=swapper/4 next_pid=0 next_prio=120
userspace_bench-6797    [001] .....  2328.023534: contention_begin: 00000000ca66c48e (flags=READ)

... [cut] .... 

userspace_bench-6807    [007] .....  2328.023661: contention_begin: 00000000ca66c48e (flags=READ)
userspace_bench-6807    [007] d....  2328.023666: sched_switch: prev_comm=userspace_bench prev_pid=6807 prev_prio=120 prev_state=D ==> next_comm=swapper/7 next_pid=0 next_prio=120
userspace_bench-6813    [013] .....  2328.023669: contention_begin: 00000000ca66c48e (flags=READ)
userspace_bench-6815    [015] .....  2328.023673: contention_begin: 00000000ca66c48e (flags=READ)
userspace_bench-6813    [013] d....  2328.023674: sched_switch: prev_comm=userspace_bench prev_pid=6813 prev_prio=120 prev_state=D ==> next_comm=swapper/13 next_pid=0 next_prio=120
userspace_bench-6815    [015] d....  2328.023675: sched_switch: prev_comm=userspace_bench prev_pid=6815 prev_prio=120 prev_state=D ==> next_comm=swapper/15 next_pid=0 next_prio=120
userspace_bench-6803    [003] .....  2328.026170: contention_begin: 00000000ca66c48e (flags=READ)
userspace_bench-6803    [003] d....  2328.026171: sched_switch: prev_comm=userspace_bench prev_pid=6803 prev_prio=120 prev_state=D ==> next_comm=swapper/3 next_pid=0 next_prio=120
userspace_bench-6798    [000] d....  2328.027162: sched_switch: prev_comm=userspace_bench prev_pid=6798 prev_prio=120 prev_state=R ==> next_comm=userspace_bench next_pid=6800 next_prio=120
userspace_bench-6799    [001] d....  2328.027162: sched_switch: prev_comm=userspace_bench prev_pid=6799 prev_prio=120 prev_state=R ==> next_comm=userspace_bench next_pid=6801 next_prio=120
userspace_bench-6800    [000] .....  2328.027165: contention_begin: 00000000ca66c48e (flags=READ)
userspace_bench-6801    [001] .....  2328.027166: contention_begin: 00000000ca66c48e (flags=READ)
userspace_bench-6800    [000] d....  2328.027167: sched_switch: prev_comm=userspace_bench prev_pid=6800 prev_prio=120 prev_state=D ==> next_comm=userspace_bench next_pid=6796 next_prio=120
userspace_bench-6801    [001] d....  2328.027167: sched_switch: prev_comm=userspace_bench prev_pid=6801 prev_prio=120 prev_state=D ==> next_comm=userspace_bench next_pid=6799 next_prio=120
userspace_bench-6796    [000] .....  2328.027168: contention_end: 00000000ca66c48e (ret=0)
^^^^ -- here the READer got the lock - it took ~3ms to get the lock
""""
without the commit we don't see any of the above waiting and the processes are only in the optimistic spinning.


In our situation the writer is doing this operation
"""
userspace_bench-6800     [031] .....     0.251700: contention_begin: 000000007a4d517c (ret=0)
userspace_bench-6800     [031] .....     0.251700: <stack trace>
=> __traceiter_contention_fastpath
=> rwsem_down_write_slowpath
=> down_write
=> kernfs_activate
=> kernfs_add_one
=> __kernfs_create_file
=> sysfs_add_file_mode_ns
=> internal_create_group
=> internal_create_groups.part.6
=> sysfs_create_groups
"""

and the reader is doing this - both are taking the same semaphore
"""
userspace_bench-6801     [095] .....     0.251700: contention_begin: 000000007a4d517c (flags=READ)
            userspace_bench-6801     [095] .....     0.251700: <stack trace>
=> __traceiter_contention_begin
=> rwsem_down_read_slowpath
=> down_read
=> kernfs_dop_revalidate
=> lookup_fast
=> walk_component
=> link_path_walk.part.74
=> path_openat
=> do_filp_open
=> do_sys_openat2
=> do_sys_open
"""

----

To help investigate this issue, I've created a minimal reproduction case:
1. Test repository: https://github.com/tomaskrcka/sysfs_bench
2. The test consists of:
  - A kernel module that creates sysfs interface and handles file operations
  - A userspace application that spawns writers (matching CPU core count) and readers

Using the test case on kernel 6.14.4, I collected the following measurements 
(100 samples each):

On AArch64 c8g (192 vCPUs):
- Without revert:
 * Avg: 3.50s (min: 3.39s, max: 3.63s, p99: 3.59s)
- With revert:
 * Avg: 2.70s (min: 2.65s, max: 3.10s, p99: 2.83s)
 * ~23% improvement

On x86 Ice Lake m6i (128 vCPUs):
- Without revert:
 * Avg: 6.71s (min: 6.61s, max: 7.60s, p99: 6.82s)
- With revert:
 * Avg: 6.28s (min: 5.89s, max: 7.52s, p99: 6.65s)
 * ~6% improvement

Could you take a look on that and let me know your thoughts ?

I’m happy to help with further investigation.

Thanks,
Tomas



Amazon Web Services Development Center Germany GmbH
Tamara-Danz-Str. 13
10243 Berlin
Geschaeftsfuehrung: Christian Schlaeger, Jonathan Weiss
Eingetragen am Amtsgericht Charlottenburg unter HRB 257764 B
Sitz: Berlin
Ust-ID: DE 365 538 597

^ permalink raw reply	[flat|nested] 23+ messages in thread

end of thread, other threads:[~2026-03-24 14:36 UTC | newest]

Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-11-21  4:14 [PATCH v2 0/5] locking/rwsem: Rework reader optimistic spinning Waiman Long
2020-11-21  4:14 ` [PATCH v2 1/5] locking/rwsem: Pass the current atomic count to rwsem_down_read_slowpath() Waiman Long
2020-12-09 18:38   ` [tip: locking/core] " tip-bot2 for Waiman Long
2020-11-21  4:14 ` [PATCH v2 2/5] locking/rwsem: Prevent potential lock starvation Waiman Long
2020-11-26  8:12   ` [locking/rwsem] 25d0c60b0e: vm-scalability.throughput 316.2% improvement kernel test robot
2020-12-09 18:38   ` [tip: locking/core] locking/rwsem: Prevent potential lock starvation tip-bot2 for Waiman Long
2020-11-21  4:14 ` [PATCH v2 3/5] locking/rwsem: Enable reader optimistic lock stealing Waiman Long
2020-12-09 18:38   ` [tip: locking/core] " tip-bot2 for Waiman Long
2020-11-21  4:14 ` [PATCH v2 4/5] locking/rwsem: Wake up all waiting readers if RWSEM_WAKE_READ_OWNED Waiman Long
2020-11-24  3:15   ` [locking/rwsem] c9847a7f94: aim7.jobs-per-min -91.8% regression kernel test robot
2020-11-21  4:14 ` [PATCH v2 5/5] locking/rwsem: Remove reader optimistic spinning Waiman Long
2020-11-23 15:53   ` [locking/rwsem] 10a59003d2: unixbench.score -25.5% regression kernel test robot
2020-11-23 19:28     ` Waiman Long
2020-12-08  3:56   ` [PATCH v2 5/5] locking/rwsem: Remove reader optimistic spinning Davidlohr Bueso
2020-12-08 10:07   ` Peter Zijlstra
2020-12-08 15:29     ` Waiman Long
2020-12-09 18:38   ` [tip: locking/core] " tip-bot2 for Waiman Long
2020-12-08 14:57 ` [PATCH v2 0/5] locking/rwsem: Rework " Peter Zijlstra
2020-12-08 16:33   ` Waiman Long
2020-12-08 17:02     ` Peter Zijlstra
2020-12-08 17:30       ` Waiman Long
  -- strict thread matches above, loose matches on Subject: below --
2025-05-27  9:08 [PATCH v2 5/5] locking/rwsem: Remove " Krcka, Tomas
2026-03-24 14:36 ` David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox