Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per

rcu.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression
       [not found]       ` <e5fb8b34-c1ad-92e0-e7e5-f7ed1605dbc6@linux.dev>
@ 2023-05-24 11:22         ` Qi Zheng
  2023-05-24 11:56         ` Qi Zheng
  1 sibling, 0 replies; 10+ messages in thread
From: Qi Zheng @ 2023-05-24 11:22 UTC (permalink / raw)
  To: RCU
  Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Vlastimil Babka,
	Kirill Tkhai, Roman Gushchin, Christian König,
	David Hildenbrand, Davidlohr Bueso, Johannes Weiner, Michal Hocko,
	Muchun Song, Paul E. McKenney, Shakeel Butt, Yang Shi, linux-mm,
	ying.huang, feng.tang, fengwei.yin, Yujie Liu



On 2023/5/24 19:08, Qi Zheng wrote:
> 

[...]

> 
> Well, I just ran the following command and reproduced the result:
> 
> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
> 
> 1) with commit 42c9db3970483:
> 
> stress-ng: info:  [11023] setting to a 60 second run per stressor
> stress-ng: info:  [11023] dispatching hogs: 9 ramfs
> stress-ng: info:  [11023] stressor       bogo ops real time  usr time 
> sys time   bogo ops/s     bogo ops/s
> stress-ng: info:  [11023]                           (secs)    (secs) 
> (secs)   (real time) (usr+sys time)
> stress-ng: info:  [11023] ramfs            774966     60.00     10.18 
> 169.45     12915.89        4314.26
> stress-ng: info:  [11023] for a 60.00s run time:
> stress-ng: info:  [11023]    1920.11s available CPU time
> stress-ng: info:  [11023]      10.18s user time   (  0.53%)
> stress-ng: info:  [11023]     169.44s system time (  8.82%)
> stress-ng: info:  [11023]     179.62s total time  (  9.35%)
> stress-ng: info:  [11023] load average: 8.99 2.69 0.93
> stress-ng: info:  [11023] successful run completed in 60.00s (1 min, 
> 0.00 secs)
> 
> 2) with commit f95bdb700bc6b:
> 
> stress-ng: info:  [37676] dispatching hogs: 9 ramfs
> stress-ng: info:  [37676] stressor       bogo ops real time  usr time 
> sys time   bogo ops/s     bogo ops/s
> stress-ng: info:  [37676]                           (secs)    (secs) 
> (secs)   (real time) (usr+sys time)
> stress-ng: info:  [37676] ramfs            168673     60.00      1.61 
>   39.66      2811.08        4087.47
> stress-ng: info:  [37676] for a 60.10s run time:
> stress-ng: info:  [37676]    1923.36s available CPU time
> stress-ng: info:  [37676]       1.60s user time   (  0.08%)
> stress-ng: info:  [37676]      39.66s system time (  2.06%)
> stress-ng: info:  [37676]      41.26s total time  (  2.15%)
> stress-ng: info:  [37676] load average: 7.69 3.63 2.36
> stress-ng: info:  [37676] successful run completed in 60.10s (1 min, 
> 0.10 secs)
> 
> The bogo ops/s (real time) did drop significantly.
> 
> And the memory reclaimation was not triggered in the whole process. so
> theoretically no one is in the read critical section of shrinker_srcu.
> 
> Then I found that some stress-ng-ramfs processes were in
> TASK_UNINTERRUPTIBLE state for a long time:
> 
> root       42313  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42314  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42315  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42316  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42317  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42318  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42319  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42320  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42321  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42322  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42323  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42324  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42325  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42326  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42327  7.9  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42328  7.9  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42329  7.9  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42330  7.9  0.0  69592  1556 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> 
> Their call stack is as follows:
> 
> cat /proc/42330/stack
> 
> [<0>] __synchronize_srcu.part.21+0x83/0xb0
> [<0>] unregister_shrinker+0x85/0xb0
> [<0>] deactivate_locked_super+0x27/0x70
> [<0>] cleanup_mnt+0xb8/0x140
> [<0>] task_work_run+0x65/0x90
> [<0>] exit_to_user_mode_prepare+0x1ba/0x1c0
> [<0>] syscall_exit_to_user_mode+0x1b/0x40
> [<0>] do_syscall_64+0x44/0x80
> [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> + RCU folks, Is this result as expected? I would have thought that
> synchronize_srcu() should return quickly if no one is in the read
> critical section. :(
> 

I received the message:
	BOUNCE rcu@vger.kernel.org: Message too long (>100000 chars)

So add RCU folks again.

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression
       [not found]       ` <e5fb8b34-c1ad-92e0-e7e5-f7ed1605dbc6@linux.dev>
  2023-05-24 11:22         ` [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression Qi Zheng
@ 2023-05-24 11:56         ` Qi Zheng
  2023-05-25  4:03           ` Qi Zheng
  1 sibling, 1 reply; 10+ messages in thread
From: Qi Zheng @ 2023-05-24 11:56 UTC (permalink / raw)
  To: RCU
  Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Vlastimil Babka,
	Kirill Tkhai, Roman Gushchin, Christian König,
	David Hildenbrand, Davidlohr Bueso, Johannes Weiner, Michal Hocko,
	Muchun Song, Paul E. McKenney, Shakeel Butt, Yang Shi, linux-mm,
	ying.huang, feng.tang, fengwei.yin, Yujie Liu



On 2023/5/24 19:08, Qi Zheng wrote:

[...]

> 
> Well, I just ran the following command and reproduced the result:
> 
> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
> 
> 1) with commit 42c9db3970483:
> 
> stress-ng: info:  [11023] setting to a 60 second run per stressor
> stress-ng: info:  [11023] dispatching hogs: 9 ramfs
> stress-ng: info:  [11023] stressor       bogo ops real time  usr time 
> sys time   bogo ops/s     bogo ops/s
> stress-ng: info:  [11023]                           (secs)    (secs) 
> (secs)   (real time) (usr+sys time)
> stress-ng: info:  [11023] ramfs            774966     60.00     10.18 
> 169.45     12915.89        4314.26
> stress-ng: info:  [11023] for a 60.00s run time:
> stress-ng: info:  [11023]    1920.11s available CPU time
> stress-ng: info:  [11023]      10.18s user time   (  0.53%)
> stress-ng: info:  [11023]     169.44s system time (  8.82%)
> stress-ng: info:  [11023]     179.62s total time  (  9.35%)
> stress-ng: info:  [11023] load average: 8.99 2.69 0.93
> stress-ng: info:  [11023] successful run completed in 60.00s (1 min, 
> 0.00 secs)
> 
> 2) with commit f95bdb700bc6b:
> 
> stress-ng: info:  [37676] dispatching hogs: 9 ramfs
> stress-ng: info:  [37676] stressor       bogo ops real time  usr time 
> sys time   bogo ops/s     bogo ops/s
> stress-ng: info:  [37676]                           (secs)    (secs) 
> (secs)   (real time) (usr+sys time)
> stress-ng: info:  [37676] ramfs            168673     60.00      1.61 
>   39.66      2811.08        4087.47
> stress-ng: info:  [37676] for a 60.10s run time:
> stress-ng: info:  [37676]    1923.36s available CPU time
> stress-ng: info:  [37676]       1.60s user time   (  0.08%)
> stress-ng: info:  [37676]      39.66s system time (  2.06%)
> stress-ng: info:  [37676]      41.26s total time  (  2.15%)
> stress-ng: info:  [37676] load average: 7.69 3.63 2.36
> stress-ng: info:  [37676] successful run completed in 60.10s (1 min, 
> 0.10 secs)
> 
> The bogo ops/s (real time) did drop significantly.
> 
> And the memory reclaimation was not triggered in the whole process. so
> theoretically no one is in the read critical section of shrinker_srcu.
> 
> Then I found that some stress-ng-ramfs processes were in
> TASK_UNINTERRUPTIBLE state for a long time:
> 
> root       42313  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42314  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42315  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42316  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42317  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42318  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42319  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42320  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42321  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42322  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42323  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42324  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42325  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42326  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
> stress-ng-ramfs [run]
> root       42327  7.9  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42328  7.9  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42329  7.9  0.0  69592  1812 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> root       42330  7.9  0.0  69592  1556 pts/0    D    19:00   0:02 
> stress-ng-ramfs [run]
> 
> Their call stack is as follows:
> 
> cat /proc/42330/stack
> 
> [<0>] __synchronize_srcu.part.21+0x83/0xb0
> [<0>] unregister_shrinker+0x85/0xb0
> [<0>] deactivate_locked_super+0x27/0x70
> [<0>] cleanup_mnt+0xb8/0x140
> [<0>] task_work_run+0x65/0x90
> [<0>] exit_to_user_mode_prepare+0x1ba/0x1c0
> [<0>] syscall_exit_to_user_mode+0x1b/0x40
> [<0>] do_syscall_64+0x44/0x80
> [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> 
> + RCU folks, Is this result as expected? I would have thought that
> synchronize_srcu() should return quickly if no one is in the read
> critical section. :(
> 

With the following changes, ops/s can return to previous levels:

diff --git a/mm/vmscan.c b/mm/vmscan.c
index db2ed6e08f67..90f541b07cd1 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -763,7 +763,7 @@ void unregister_shrinker(struct shrinker *shrinker)
         debugfs_entry = shrinker_debugfs_remove(shrinker);
         up_write(&shrinker_rwsem);

-       synchronize_srcu(&shrinker_srcu);
+       synchronize_srcu_expedited(&shrinker_srcu);

         debugfs_remove_recursive(debugfs_entry);

stress-ng: info:  [13159] dispatching hogs: 9 ramfs
stress-ng: info:  [13159] stressor       bogo ops real time  usr time 
sys time   bogo ops/s     bogo ops/s
stress-ng: info:  [13159]                           (secs)    (secs) 
(secs)   (real time) (usr+sys time)
stress-ng: info:  [13159] ramfs            710062     60.00      9.63 
157.26     11834.18        4254.75
stress-ng: info:  [13159] for a 60.00s run time:
stress-ng: info:  [13159]    1920.14s available CPU time
stress-ng: info:  [13159]       9.62s user time   (  0.50%)
stress-ng: info:  [13159]     157.26s system time (  8.19%)
stress-ng: info:  [13159]     166.88s total time  (  8.69%)
stress-ng: info:  [13159] load average: 9.49 4.02 1.65
stress-ng: info:  [13159] successful run completed in 60.00s (1 min, 
0.00 secs)

Can we make synchronize_srcu() call synchronize_srcu_expedited() when no
one is in the read critical section?

> 

-- 
Thanks,
Qi

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression
  2023-05-24 11:56         ` Qi Zheng
@ 2023-05-25  4:03           ` Qi Zheng
  2023-05-27 11:14             ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Qi Zheng @ 2023-05-25  4:03 UTC (permalink / raw)
  To: RCU, Yujie Liu
  Cc: oe-lkp, lkp, linux-kernel, Andrew Morton, Vlastimil Babka,
	Kirill Tkhai, Roman Gushchin, Christian König,
	David Hildenbrand, Davidlohr Bueso, Johannes Weiner, Michal Hocko,
	Muchun Song, Paul E. McKenney, Shakeel Butt, Yang Shi, linux-mm,
	ying.huang, feng.tang, fengwei.yin



On 2023/5/24 19:56, Qi Zheng wrote:
> 
> 
> On 2023/5/24 19:08, Qi Zheng wrote:
> 
> [...]
> 
>>
>> Well, I just ran the following command and reproduced the result:
>>
>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>>
>> 1) with commit 42c9db3970483:
>>
>> stress-ng: info:  [11023] setting to a 60 second run per stressor
>> stress-ng: info:  [11023] dispatching hogs: 9 ramfs
>> stress-ng: info:  [11023] stressor       bogo ops real time  usr time 
>> sys time   bogo ops/s     bogo ops/s
>> stress-ng: info:  [11023]                           (secs)    (secs) 
>> (secs)   (real time) (usr+sys time)
>> stress-ng: info:  [11023] ramfs            774966     60.00     10.18 
>> 169.45     12915.89        4314.26
>> stress-ng: info:  [11023] for a 60.00s run time:
>> stress-ng: info:  [11023]    1920.11s available CPU time
>> stress-ng: info:  [11023]      10.18s user time   (  0.53%)
>> stress-ng: info:  [11023]     169.44s system time (  8.82%)
>> stress-ng: info:  [11023]     179.62s total time  (  9.35%)
>> stress-ng: info:  [11023] load average: 8.99 2.69 0.93
>> stress-ng: info:  [11023] successful run completed in 60.00s (1 min, 
>> 0.00 secs)
>>
>> 2) with commit f95bdb700bc6b:
>>
>> stress-ng: info:  [37676] dispatching hogs: 9 ramfs
>> stress-ng: info:  [37676] stressor       bogo ops real time  usr time 
>> sys time   bogo ops/s     bogo ops/s
>> stress-ng: info:  [37676]                           (secs)    (secs) 
>> (secs)   (real time) (usr+sys time)
>> stress-ng: info:  [37676] ramfs            168673     60.00      1.61 
>>   39.66      2811.08        4087.47
>> stress-ng: info:  [37676] for a 60.10s run time:
>> stress-ng: info:  [37676]    1923.36s available CPU time
>> stress-ng: info:  [37676]       1.60s user time   (  0.08%)
>> stress-ng: info:  [37676]      39.66s system time (  2.06%)
>> stress-ng: info:  [37676]      41.26s total time  (  2.15%)
>> stress-ng: info:  [37676] load average: 7.69 3.63 2.36
>> stress-ng: info:  [37676] successful run completed in 60.10s (1 min, 
>> 0.10 secs)
>>
>> The bogo ops/s (real time) did drop significantly.
>>
>> And the memory reclaimation was not triggered in the whole process. so
>> theoretically no one is in the read critical section of shrinker_srcu.
>>
>> Then I found that some stress-ng-ramfs processes were in
>> TASK_UNINTERRUPTIBLE state for a long time:
>>
>> root       42313  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
>> stress-ng-ramfs [run]
>> root       42314  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
>> stress-ng-ramfs [run]
>> root       42315  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
>> stress-ng-ramfs [run]
>> root       42316  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
>> stress-ng-ramfs [run]
>> root       42317  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 
>> stress-ng-ramfs [run]
>> root       42318  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
>> stress-ng-ramfs [run]
>> root       42319  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 
>> stress-ng-ramfs [run]
>> root       42320  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
>> stress-ng-ramfs [run]
>> root       42321  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 
>> stress-ng-ramfs [run]
>> root       42322  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
>> stress-ng-ramfs [run]
>> root       42323  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 
>> stress-ng-ramfs [run]
>> root       42324  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
>> stress-ng-ramfs [run]
>> root       42325  7.8  0.0  69592  1812 pts/0    D    19:00   0:02 
>> stress-ng-ramfs [run]
>> root       42326  0.0  0.0  69592  2068 pts/0    S    19:00   0:00 
>> stress-ng-ramfs [run]
>> root       42327  7.9  0.0  69592  1812 pts/0    D    19:00   0:02 
>> stress-ng-ramfs [run]
>> root       42328  7.9  0.0  69592  1812 pts/0    D    19:00   0:02 
>> stress-ng-ramfs [run]
>> root       42329  7.9  0.0  69592  1812 pts/0    D    19:00   0:02 
>> stress-ng-ramfs [run]
>> root       42330  7.9  0.0  69592  1556 pts/0    D    19:00   0:02 
>> stress-ng-ramfs [run]
>>
>> Their call stack is as follows:
>>
>> cat /proc/42330/stack
>>
>> [<0>] __synchronize_srcu.part.21+0x83/0xb0
>> [<0>] unregister_shrinker+0x85/0xb0
>> [<0>] deactivate_locked_super+0x27/0x70
>> [<0>] cleanup_mnt+0xb8/0x140
>> [<0>] task_work_run+0x65/0x90
>> [<0>] exit_to_user_mode_prepare+0x1ba/0x1c0
>> [<0>] syscall_exit_to_user_mode+0x1b/0x40
>> [<0>] do_syscall_64+0x44/0x80
>> [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>
>> + RCU folks, Is this result as expected? I would have thought that
>> synchronize_srcu() should return quickly if no one is in the read
>> critical section. :(
>>
> 
> With the following changes, ops/s can return to previous levels:

Or just set rcu_expedited to 1:
	echo 1 > /sys/kernel/rcu_expedited

> 
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index db2ed6e08f67..90f541b07cd1 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -763,7 +763,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>          debugfs_entry = shrinker_debugfs_remove(shrinker);
>          up_write(&shrinker_rwsem);
> 
> -       synchronize_srcu(&shrinker_srcu);
> +       synchronize_srcu_expedited(&shrinker_srcu);
> 
>          debugfs_remove_recursive(debugfs_entry);
> 
> stress-ng: info:  [13159] dispatching hogs: 9 ramfs
> stress-ng: info:  [13159] stressor       bogo ops real time  usr time 
> sys time   bogo ops/s     bogo ops/s
> stress-ng: info:  [13159]                           (secs)    (secs) 
> (secs)   (real time) (usr+sys time)
> stress-ng: info:  [13159] ramfs            710062     60.00      9.63 
> 157.26     11834.18        4254.75
> stress-ng: info:  [13159] for a 60.00s run time:
> stress-ng: info:  [13159]    1920.14s available CPU time
> stress-ng: info:  [13159]       9.62s user time   (  0.50%)
> stress-ng: info:  [13159]     157.26s system time (  8.19%)
> stress-ng: info:  [13159]     166.88s total time  (  8.69%)
> stress-ng: info:  [13159] load average: 9.49 4.02 1.65
> stress-ng: info:  [13159] successful run completed in 60.00s (1 min, 
> 0.00 secs)
> 
> Can we make synchronize_srcu() call synchronize_srcu_expedited() when no
> one is in the read critical section?
> 
>>
> 

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression
  2023-05-25  4:03           ` Qi Zheng
@ 2023-05-27 11:14             ` Paul E. McKenney
  2023-05-29  2:39               ` Qi Zheng
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2023-05-27 11:14 UTC (permalink / raw)
  To: Qi Zheng
  Cc: RCU, Yujie Liu, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Vlastimil Babka, Kirill Tkhai, Roman Gushchin,
	Christian König, David Hildenbrand, Davidlohr Bueso,
	Johannes Weiner, Michal Hocko, Muchun Song, Shakeel Butt,
	Yang Shi, linux-mm, ying.huang, feng.tang, fengwei.yin

On Thu, May 25, 2023 at 12:03:16PM +0800, Qi Zheng wrote:
> On 2023/5/24 19:56, Qi Zheng wrote:
> > On 2023/5/24 19:08, Qi Zheng wrote:
> > 
> > [...]
> > 
> > > 
> > > Well, I just ran the following command and reproduced the result:
> > > 
> > > stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
> > > 
> > > 1) with commit 42c9db3970483:
> > > 
> > > stress-ng: info:  [11023] setting to a 60 second run per stressor
> > > stress-ng: info:  [11023] dispatching hogs: 9 ramfs
> > > stress-ng: info:  [11023] stressor       bogo ops real time  usr
> > > time sys time   bogo ops/s     bogo ops/s
> > > stress-ng: info:  [11023]                           (secs)    (secs)
> > > (secs)   (real time) (usr+sys time)
> > > stress-ng: info:  [11023] ramfs            774966     60.00    
> > > 10.18 169.45     12915.89        4314.26
> > > stress-ng: info:  [11023] for a 60.00s run time:
> > > stress-ng: info:  [11023]    1920.11s available CPU time
> > > stress-ng: info:  [11023]      10.18s user time   (  0.53%)
> > > stress-ng: info:  [11023]     169.44s system time (  8.82%)
> > > stress-ng: info:  [11023]     179.62s total time  (  9.35%)
> > > stress-ng: info:  [11023] load average: 8.99 2.69 0.93
> > > stress-ng: info:  [11023] successful run completed in 60.00s (1 min,
> > > 0.00 secs)
> > > 
> > > 2) with commit f95bdb700bc6b:
> > > 
> > > stress-ng: info:  [37676] dispatching hogs: 9 ramfs
> > > stress-ng: info:  [37676] stressor       bogo ops real time  usr
> > > time sys time   bogo ops/s     bogo ops/s
> > > stress-ng: info:  [37676]                           (secs)    (secs)
> > > (secs)   (real time) (usr+sys time)
> > > stress-ng: info:  [37676] ramfs            168673     60.00     
> > > 1.61   39.66      2811.08        4087.47
> > > stress-ng: info:  [37676] for a 60.10s run time:
> > > stress-ng: info:  [37676]    1923.36s available CPU time
> > > stress-ng: info:  [37676]       1.60s user time   (  0.08%)
> > > stress-ng: info:  [37676]      39.66s system time (  2.06%)
> > > stress-ng: info:  [37676]      41.26s total time  (  2.15%)
> > > stress-ng: info:  [37676] load average: 7.69 3.63 2.36
> > > stress-ng: info:  [37676] successful run completed in 60.10s (1 min,
> > > 0.10 secs)
> > > 
> > > The bogo ops/s (real time) did drop significantly.
> > > 
> > > And the memory reclaimation was not triggered in the whole process. so
> > > theoretically no one is in the read critical section of shrinker_srcu.
> > > 
> > > Then I found that some stress-ng-ramfs processes were in
> > > TASK_UNINTERRUPTIBLE state for a long time:
> > > 
> > > root       42313  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42314  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42315  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42316  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42317  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42318  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42319  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42320  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42321  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42322  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42323  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42324  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42325  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42326  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > stress-ng-ramfs [run]
> > > root       42327  7.9  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42328  7.9  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42329  7.9  0.0  69592  1812 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > root       42330  7.9  0.0  69592  1556 pts/0    D    19:00   0:02
> > > stress-ng-ramfs [run]
> > > 
> > > Their call stack is as follows:
> > > 
> > > cat /proc/42330/stack
> > > 
> > > [<0>] __synchronize_srcu.part.21+0x83/0xb0
> > > [<0>] unregister_shrinker+0x85/0xb0
> > > [<0>] deactivate_locked_super+0x27/0x70
> > > [<0>] cleanup_mnt+0xb8/0x140
> > > [<0>] task_work_run+0x65/0x90
> > > [<0>] exit_to_user_mode_prepare+0x1ba/0x1c0
> > > [<0>] syscall_exit_to_user_mode+0x1b/0x40
> > > [<0>] do_syscall_64+0x44/0x80
> > > [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > 
> > > + RCU folks, Is this result as expected? I would have thought that
> > > synchronize_srcu() should return quickly if no one is in the read
> > > critical section. :(

In theory, it would indeed be nice if synchronize_srcu() would do that.
In practice, the act of checking to see if there is anyone in an SRCU
read-side critical section is a heavy-weight operation, involving at
least one cache miss per CPU along with a number of full memory barriers.

So SRCU has to be careful to not check too frequently.

However, if SRCU has been idle for some time, normal synchronize_srcu()
will do an immediate check.  And this will of course mark SRCU as
non-idle.

> > With the following changes, ops/s can return to previous levels:
> 
> Or just set rcu_expedited to 1:
> 	echo 1 > /sys/kernel/rcu_expedited

This does cause SRCU to be much more aggressive.  This can be a good
choice for small systems, but please keep in mind that this affects normal
RCU as well as SRCU.  It will cause RCU to also be much more aggressive,
sending IPIs to CPUs that are (or might be) in RCU read-side critical
sections.  Depending on your workload, this might or might not be what
you want RCU to be doing.  For example, if you are running aggressive
real-time workloads, it most definitely is not what you want.

> > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > index db2ed6e08f67..90f541b07cd1 100644
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -763,7 +763,7 @@ void unregister_shrinker(struct shrinker *shrinker)
> >          debugfs_entry = shrinker_debugfs_remove(shrinker);
> >          up_write(&shrinker_rwsem);
> > 
> > -       synchronize_srcu(&shrinker_srcu);
> > +       synchronize_srcu_expedited(&shrinker_srcu);

If shrinkers are unregistered only occasionally, this is an entirely
reasonable change.

> >          debugfs_remove_recursive(debugfs_entry);
> > 
> > stress-ng: info:  [13159] dispatching hogs: 9 ramfs
> > stress-ng: info:  [13159] stressor       bogo ops real time  usr time
> > sys time   bogo ops/s     bogo ops/s
> > stress-ng: info:  [13159]                           (secs)    (secs)
> > (secs)   (real time) (usr+sys time)
> > stress-ng: info:  [13159] ramfs            710062     60.00      9.63
> > 157.26     11834.18        4254.75
> > stress-ng: info:  [13159] for a 60.00s run time:
> > stress-ng: info:  [13159]    1920.14s available CPU time
> > stress-ng: info:  [13159]       9.62s user time   (  0.50%)
> > stress-ng: info:  [13159]     157.26s system time (  8.19%)
> > stress-ng: info:  [13159]     166.88s total time  (  8.69%)
> > stress-ng: info:  [13159] load average: 9.49 4.02 1.65
> > stress-ng: info:  [13159] successful run completed in 60.00s (1 min,
> > 0.00 secs)
> > 
> > Can we make synchronize_srcu() call synchronize_srcu_expedited() when no
> > one is in the read critical section?

Yes, in theory we could, but this would be a bad thing in practice.
After all, the point of having synchronize_srcu() be separate from
synchronize_srcu_expedited() is to allow uses that are OK with longer
latency avoid consuming too much CPU.  In addition, that longer
SRCU grace-period latency allows the next grace period to handle more
synchronize_srcu() and call_srcu() requests.  This amortizes the
overhead of that next grace period over a larger number of updates.

However, your use of synchronize_srcu_expedited() does have that effect,
but only for this call point.  Which has the advantage of avoiding
burning excessive quantities of CPU for the other 50+ call points.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression
  2023-05-27 11:14             ` Paul E. McKenney
@ 2023-05-29  2:39               ` Qi Zheng
  2023-05-29 12:51                 ` Paul E. McKenney
  0 siblings, 1 reply; 10+ messages in thread
From: Qi Zheng @ 2023-05-29  2:39 UTC (permalink / raw)
  To: paulmck, Kirill Tkhai
  Cc: RCU, Yujie Liu, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Vlastimil Babka, Roman Gushchin, Christian König,
	David Hildenbrand, Davidlohr Bueso, Johannes Weiner, Michal Hocko,
	Muchun Song, Shakeel Butt, Yang Shi, linux-mm, ying.huang,
	feng.tang, fengwei.yin

Hi Paul,

On 2023/5/27 19:14, Paul E. McKenney wrote:
> On Thu, May 25, 2023 at 12:03:16PM +0800, Qi Zheng wrote:
>> On 2023/5/24 19:56, Qi Zheng wrote:
>>> On 2023/5/24 19:08, Qi Zheng wrote:
>>>
>>> [...]
>>>
>>>>
>>>> Well, I just ran the following command and reproduced the result:
>>>>
>>>> stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
>>>>
>>>> 1) with commit 42c9db3970483:
>>>>
>>>> stress-ng: info:  [11023] setting to a 60 second run per stressor
>>>> stress-ng: info:  [11023] dispatching hogs: 9 ramfs
>>>> stress-ng: info:  [11023] stressor       bogo ops real time  usr
>>>> time sys time   bogo ops/s     bogo ops/s
>>>> stress-ng: info:  [11023]                           (secs)    (secs)
>>>> (secs)   (real time) (usr+sys time)
>>>> stress-ng: info:  [11023] ramfs            774966     60.00
>>>> 10.18 169.45     12915.89        4314.26
>>>> stress-ng: info:  [11023] for a 60.00s run time:
>>>> stress-ng: info:  [11023]    1920.11s available CPU time
>>>> stress-ng: info:  [11023]      10.18s user time   (  0.53%)
>>>> stress-ng: info:  [11023]     169.44s system time (  8.82%)
>>>> stress-ng: info:  [11023]     179.62s total time  (  9.35%)
>>>> stress-ng: info:  [11023] load average: 8.99 2.69 0.93
>>>> stress-ng: info:  [11023] successful run completed in 60.00s (1 min,
>>>> 0.00 secs)
>>>>
>>>> 2) with commit f95bdb700bc6b:
>>>>
>>>> stress-ng: info:  [37676] dispatching hogs: 9 ramfs
>>>> stress-ng: info:  [37676] stressor       bogo ops real time  usr
>>>> time sys time   bogo ops/s     bogo ops/s
>>>> stress-ng: info:  [37676]                           (secs)    (secs)
>>>> (secs)   (real time) (usr+sys time)
>>>> stress-ng: info:  [37676] ramfs            168673     60.00
>>>> 1.61   39.66      2811.08        4087.47
>>>> stress-ng: info:  [37676] for a 60.10s run time:
>>>> stress-ng: info:  [37676]    1923.36s available CPU time
>>>> stress-ng: info:  [37676]       1.60s user time   (  0.08%)
>>>> stress-ng: info:  [37676]      39.66s system time (  2.06%)
>>>> stress-ng: info:  [37676]      41.26s total time  (  2.15%)
>>>> stress-ng: info:  [37676] load average: 7.69 3.63 2.36
>>>> stress-ng: info:  [37676] successful run completed in 60.10s (1 min,
>>>> 0.10 secs)
>>>>
>>>> The bogo ops/s (real time) did drop significantly.
>>>>
>>>> And the memory reclaimation was not triggered in the whole process. so
>>>> theoretically no one is in the read critical section of shrinker_srcu.
>>>>
>>>> Then I found that some stress-ng-ramfs processes were in
>>>> TASK_UNINTERRUPTIBLE state for a long time:
>>>>
>>>> root       42313  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42314  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42315  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42316  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42317  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42318  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42319  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42320  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42321  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42322  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42323  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42324  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42325  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42326  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
>>>> stress-ng-ramfs [run]
>>>> root       42327  7.9  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42328  7.9  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42329  7.9  0.0  69592  1812 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>> root       42330  7.9  0.0  69592  1556 pts/0    D    19:00   0:02
>>>> stress-ng-ramfs [run]
>>>>
>>>> Their call stack is as follows:
>>>>
>>>> cat /proc/42330/stack
>>>>
>>>> [<0>] __synchronize_srcu.part.21+0x83/0xb0
>>>> [<0>] unregister_shrinker+0x85/0xb0
>>>> [<0>] deactivate_locked_super+0x27/0x70
>>>> [<0>] cleanup_mnt+0xb8/0x140
>>>> [<0>] task_work_run+0x65/0x90
>>>> [<0>] exit_to_user_mode_prepare+0x1ba/0x1c0
>>>> [<0>] syscall_exit_to_user_mode+0x1b/0x40
>>>> [<0>] do_syscall_64+0x44/0x80
>>>> [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>>>
>>>> + RCU folks, Is this result as expected? I would have thought that
>>>> synchronize_srcu() should return quickly if no one is in the read
>>>> critical section. :(
> 
> In theory, it would indeed be nice if synchronize_srcu() would do that.
> In practice, the act of checking to see if there is anyone in an SRCU
> read-side critical section is a heavy-weight operation, involving at
> least one cache miss per CPU along with a number of full memory barriers.
> 
> So SRCU has to be careful to not check too frequently.

Got it.

> 
> However, if SRCU has been idle for some time, normal synchronize_srcu()
> will do an immediate check.  And this will of course mark SRCU as
> non-idle.
> 
>>> With the following changes, ops/s can return to previous levels:
>>
>> Or just set rcu_expedited to 1:
>> 	echo 1 > /sys/kernel/rcu_expedited
> 
> This does cause SRCU to be much more aggressive.  This can be a good
> choice for small systems, but please keep in mind that this affects normal
> RCU as well as SRCU.  It will cause RCU to also be much more aggressive,
> sending IPIs to CPUs that are (or might be) in RCU read-side critical
> sections.  Depending on your workload, this might or might not be what
> you want RCU to be doing.  For example, if you are running aggressive
> real-time workloads, it most definitely is not what you want.

Yeah, that's not what I want, a shrinker might run for a long time.

> 
>>> diff --git a/mm/vmscan.c b/mm/vmscan.c
>>> index db2ed6e08f67..90f541b07cd1 100644
>>> --- a/mm/vmscan.c
>>> +++ b/mm/vmscan.c
>>> @@ -763,7 +763,7 @@ void unregister_shrinker(struct shrinker *shrinker)
>>>           debugfs_entry = shrinker_debugfs_remove(shrinker);
>>>           up_write(&shrinker_rwsem);
>>>
>>> -       synchronize_srcu(&shrinker_srcu);
>>> +       synchronize_srcu_expedited(&shrinker_srcu);
> 
> If shrinkers are unregistered only occasionally, this is an entirely
> reasonable change.
> 
>>>           debugfs_remove_recursive(debugfs_entry);
>>>
>>> stress-ng: info:  [13159] dispatching hogs: 9 ramfs
>>> stress-ng: info:  [13159] stressor       bogo ops real time  usr time
>>> sys time   bogo ops/s     bogo ops/s
>>> stress-ng: info:  [13159]                           (secs)    (secs)
>>> (secs)   (real time) (usr+sys time)
>>> stress-ng: info:  [13159] ramfs            710062     60.00      9.63
>>> 157.26     11834.18        4254.75
>>> stress-ng: info:  [13159] for a 60.00s run time:
>>> stress-ng: info:  [13159]    1920.14s available CPU time
>>> stress-ng: info:  [13159]       9.62s user time   (  0.50%)
>>> stress-ng: info:  [13159]     157.26s system time (  8.19%)
>>> stress-ng: info:  [13159]     166.88s total time  (  8.69%)
>>> stress-ng: info:  [13159] load average: 9.49 4.02 1.65
>>> stress-ng: info:  [13159] successful run completed in 60.00s (1 min,
>>> 0.00 secs)
>>>
>>> Can we make synchronize_srcu() call synchronize_srcu_expedited() when no
>>> one is in the read critical section?
> 
> Yes, in theory we could, but this would be a bad thing in practice.
> After all, the point of having synchronize_srcu() be separate from
> synchronize_srcu_expedited() is to allow uses that are OK with longer
> latency avoid consuming too much CPU.  In addition, that longer
> SRCU grace-period latency allows the next grace period to handle more
> synchronize_srcu() and call_srcu() requests.  This amortizes the
> overhead of that next grace period over a larger number of updates.
> 
> However, your use of synchronize_srcu_expedited() does have that effect,
> but only for this call point.  Which has the advantage of avoiding
> burning excessive quantities of CPU for the other 50+ call points.

Thanks for such a detailed explanation.

Now I think we can continue to try to complete the idea[1] from
Kirill Tkhai. The patch moves heavy synchronize_srcu() to delayed
work, so it doesn't affect on user-visible unregistration speed.

[1]. 
https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/

Thanks,
Qi

> 
> 							Thanx, Paul


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression
  2023-05-29  2:39               ` Qi Zheng
@ 2023-05-29 12:51                 ` Paul E. McKenney
  2023-05-30  3:07                   ` Qi Zheng
  0 siblings, 1 reply; 10+ messages in thread
From: Paul E. McKenney @ 2023-05-29 12:51 UTC (permalink / raw)
  To: Qi Zheng
  Cc: Kirill Tkhai, RCU, Yujie Liu, oe-lkp, lkp, linux-kernel,
	Andrew Morton, Vlastimil Babka, Roman Gushchin,
	Christian König, David Hildenbrand, Davidlohr Bueso,
	Johannes Weiner, Michal Hocko, Muchun Song, Shakeel Butt,
	Yang Shi, linux-mm, ying.huang, feng.tang, fengwei.yin

On Mon, May 29, 2023 at 10:39:21AM +0800, Qi Zheng wrote:
> Hi Paul,
> 
> On 2023/5/27 19:14, Paul E. McKenney wrote:
> > On Thu, May 25, 2023 at 12:03:16PM +0800, Qi Zheng wrote:
> > > On 2023/5/24 19:56, Qi Zheng wrote:
> > > > On 2023/5/24 19:08, Qi Zheng wrote:
> > > > 
> > > > [...]
> > > > 
> > > > > 
> > > > > Well, I just ran the following command and reproduced the result:
> > > > > 
> > > > > stress-ng --timeout 60 --times --verify --metrics-brief --ramfs 9 &
> > > > > 
> > > > > 1) with commit 42c9db3970483:
> > > > > 
> > > > > stress-ng: info:  [11023] setting to a 60 second run per stressor
> > > > > stress-ng: info:  [11023] dispatching hogs: 9 ramfs
> > > > > stress-ng: info:  [11023] stressor       bogo ops real time  usr
> > > > > time sys time   bogo ops/s     bogo ops/s
> > > > > stress-ng: info:  [11023]                           (secs)    (secs)
> > > > > (secs)   (real time) (usr+sys time)
> > > > > stress-ng: info:  [11023] ramfs            774966     60.00
> > > > > 10.18 169.45     12915.89        4314.26
> > > > > stress-ng: info:  [11023] for a 60.00s run time:
> > > > > stress-ng: info:  [11023]    1920.11s available CPU time
> > > > > stress-ng: info:  [11023]      10.18s user time   (  0.53%)
> > > > > stress-ng: info:  [11023]     169.44s system time (  8.82%)
> > > > > stress-ng: info:  [11023]     179.62s total time  (  9.35%)
> > > > > stress-ng: info:  [11023] load average: 8.99 2.69 0.93
> > > > > stress-ng: info:  [11023] successful run completed in 60.00s (1 min,
> > > > > 0.00 secs)
> > > > > 
> > > > > 2) with commit f95bdb700bc6b:
> > > > > 
> > > > > stress-ng: info:  [37676] dispatching hogs: 9 ramfs
> > > > > stress-ng: info:  [37676] stressor       bogo ops real time  usr
> > > > > time sys time   bogo ops/s     bogo ops/s
> > > > > stress-ng: info:  [37676]                           (secs)    (secs)
> > > > > (secs)   (real time) (usr+sys time)
> > > > > stress-ng: info:  [37676] ramfs            168673     60.00
> > > > > 1.61   39.66      2811.08        4087.47
> > > > > stress-ng: info:  [37676] for a 60.10s run time:
> > > > > stress-ng: info:  [37676]    1923.36s available CPU time
> > > > > stress-ng: info:  [37676]       1.60s user time   (  0.08%)
> > > > > stress-ng: info:  [37676]      39.66s system time (  2.06%)
> > > > > stress-ng: info:  [37676]      41.26s total time  (  2.15%)
> > > > > stress-ng: info:  [37676] load average: 7.69 3.63 2.36
> > > > > stress-ng: info:  [37676] successful run completed in 60.10s (1 min,
> > > > > 0.10 secs)
> > > > > 
> > > > > The bogo ops/s (real time) did drop significantly.
> > > > > 
> > > > > And the memory reclaimation was not triggered in the whole process. so
> > > > > theoretically no one is in the read critical section of shrinker_srcu.
> > > > > 
> > > > > Then I found that some stress-ng-ramfs processes were in
> > > > > TASK_UNINTERRUPTIBLE state for a long time:
> > > > > 
> > > > > root       42313  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42314  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42315  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42316  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42317  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42318  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42319  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42320  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42321  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42322  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42323  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42324  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42325  7.8  0.0  69592  1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42326  0.0  0.0  69592  2068 pts/0    S    19:00   0:00
> > > > > stress-ng-ramfs [run]
> > > > > root       42327  7.9  0.0  69592  1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42328  7.9  0.0  69592  1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42329  7.9  0.0  69592  1812 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > root       42330  7.9  0.0  69592  1556 pts/0    D    19:00   0:02
> > > > > stress-ng-ramfs [run]
> > > > > 
> > > > > Their call stack is as follows:
> > > > > 
> > > > > cat /proc/42330/stack
> > > > > 
> > > > > [<0>] __synchronize_srcu.part.21+0x83/0xb0
> > > > > [<0>] unregister_shrinker+0x85/0xb0
> > > > > [<0>] deactivate_locked_super+0x27/0x70
> > > > > [<0>] cleanup_mnt+0xb8/0x140
> > > > > [<0>] task_work_run+0x65/0x90
> > > > > [<0>] exit_to_user_mode_prepare+0x1ba/0x1c0
> > > > > [<0>] syscall_exit_to_user_mode+0x1b/0x40
> > > > > [<0>] do_syscall_64+0x44/0x80
> > > > > [<0>] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > > > 
> > > > > + RCU folks, Is this result as expected? I would have thought that
> > > > > synchronize_srcu() should return quickly if no one is in the read
> > > > > critical section. :(
> > 
> > In theory, it would indeed be nice if synchronize_srcu() would do that.
> > In practice, the act of checking to see if there is anyone in an SRCU
> > read-side critical section is a heavy-weight operation, involving at
> > least one cache miss per CPU along with a number of full memory barriers.
> > 
> > So SRCU has to be careful to not check too frequently.
> 
> Got it.
> 
> > 
> > However, if SRCU has been idle for some time, normal synchronize_srcu()
> > will do an immediate check.  And this will of course mark SRCU as
> > non-idle.
> > 
> > > > With the following changes, ops/s can return to previous levels:
> > > 
> > > Or just set rcu_expedited to 1:
> > > 	echo 1 > /sys/kernel/rcu_expedited
> > 
> > This does cause SRCU to be much more aggressive.  This can be a good
> > choice for small systems, but please keep in mind that this affects normal
> > RCU as well as SRCU.  It will cause RCU to also be much more aggressive,
> > sending IPIs to CPUs that are (or might be) in RCU read-side critical
> > sections.  Depending on your workload, this might or might not be what
> > you want RCU to be doing.  For example, if you are running aggressive
> > real-time workloads, it most definitely is not what you want.
> 
> Yeah, that's not what I want, a shrinker might run for a long time.
> 
> > 
> > > > diff --git a/mm/vmscan.c b/mm/vmscan.c
> > > > index db2ed6e08f67..90f541b07cd1 100644
> > > > --- a/mm/vmscan.c
> > > > +++ b/mm/vmscan.c
> > > > @@ -763,7 +763,7 @@ void unregister_shrinker(struct shrinker *shrinker)
> > > >           debugfs_entry = shrinker_debugfs_remove(shrinker);
> > > >           up_write(&shrinker_rwsem);
> > > > 
> > > > -       synchronize_srcu(&shrinker_srcu);
> > > > +       synchronize_srcu_expedited(&shrinker_srcu);
> > 
> > If shrinkers are unregistered only occasionally, this is an entirely
> > reasonable change.
> > 
> > > >           debugfs_remove_recursive(debugfs_entry);
> > > > 
> > > > stress-ng: info:  [13159] dispatching hogs: 9 ramfs
> > > > stress-ng: info:  [13159] stressor       bogo ops real time  usr time
> > > > sys time   bogo ops/s     bogo ops/s
> > > > stress-ng: info:  [13159]                           (secs)    (secs)
> > > > (secs)   (real time) (usr+sys time)
> > > > stress-ng: info:  [13159] ramfs            710062     60.00      9.63
> > > > 157.26     11834.18        4254.75
> > > > stress-ng: info:  [13159] for a 60.00s run time:
> > > > stress-ng: info:  [13159]    1920.14s available CPU time
> > > > stress-ng: info:  [13159]       9.62s user time   (  0.50%)
> > > > stress-ng: info:  [13159]     157.26s system time (  8.19%)
> > > > stress-ng: info:  [13159]     166.88s total time  (  8.69%)
> > > > stress-ng: info:  [13159] load average: 9.49 4.02 1.65
> > > > stress-ng: info:  [13159] successful run completed in 60.00s (1 min,
> > > > 0.00 secs)
> > > > 
> > > > Can we make synchronize_srcu() call synchronize_srcu_expedited() when no
> > > > one is in the read critical section?
> > 
> > Yes, in theory we could, but this would be a bad thing in practice.
> > After all, the point of having synchronize_srcu() be separate from
> > synchronize_srcu_expedited() is to allow uses that are OK with longer
> > latency avoid consuming too much CPU.  In addition, that longer
> > SRCU grace-period latency allows the next grace period to handle more
> > synchronize_srcu() and call_srcu() requests.  This amortizes the
> > overhead of that next grace period over a larger number of updates.
> > 
> > However, your use of synchronize_srcu_expedited() does have that effect,
> > but only for this call point.  Which has the advantage of avoiding
> > burning excessive quantities of CPU for the other 50+ call points.
> 
> Thanks for such a detailed explanation.
> 
> Now I think we can continue to try to complete the idea[1] from
> Kirill Tkhai. The patch moves heavy synchronize_srcu() to delayed
> work, so it doesn't affect on user-visible unregistration speed.
> 
> [1]. https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/

A blast from the past!  ;-)

But yes, moving the long-latency synchronize_srcu() off the user-visible
critical code path can be even better.

							Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression
  2023-05-29 12:51                 ` Paul E. McKenney
@ 2023-05-30  3:07                   ` Qi Zheng
  2023-06-01  0:57                     ` Kirill Tkhai
  0 siblings, 1 reply; 10+ messages in thread
From: Qi Zheng @ 2023-05-30  3:07 UTC (permalink / raw)
  To: paulmck, Kirill Tkhai
  Cc: RCU, Yujie Liu, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Vlastimil Babka, Roman Gushchin, Christian König,
	David Hildenbrand, Davidlohr Bueso, Johannes Weiner, Michal Hocko,
	Muchun Song, Shakeel Butt, Yang Shi, linux-mm, ying.huang,
	feng.tang, fengwei.yin

Hi,

On 2023/5/29 20:51, Paul E. McKenney wrote:
> On Mon, May 29, 2023 at 10:39:21AM +0800, Qi Zheng wrote:

[...]

>>
>> Thanks for such a detailed explanation.
>>
>> Now I think we can continue to try to complete the idea[1] from
>> Kirill Tkhai. The patch moves heavy synchronize_srcu() to delayed
>> work, so it doesn't affect on user-visible unregistration speed.
>>
>> [1]. https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/
> 
> A blast from the past!  ;-)
> 
> But yes, moving the long-latency synchronize_srcu() off the user-visible
> critical code path can be even better.

Yeah, I applied these patches  ([PATCH RFC 04/10]~[PATCH RFC 10/10],
with few conflicts), the ops/s does get back to the previous levels.

I'll continue updating this patchset after doing more testing.

Thanks,
Qi

> 
> 							Thanx, Paul

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression
  2023-05-30  3:07                   ` Qi Zheng
@ 2023-06-01  0:57                     ` Kirill Tkhai
  2023-06-01  8:34                       ` Qi Zheng
  0 siblings, 1 reply; 10+ messages in thread
From: Kirill Tkhai @ 2023-06-01  0:57 UTC (permalink / raw)
  To: Qi Zheng, paulmck
  Cc: RCU, Yujie Liu, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Vlastimil Babka, Roman Gushchin, Christian König,
	David Hildenbrand, Davidlohr Bueso, Johannes Weiner, Michal Hocko,
	Muchun Song, Shakeel Butt, Yang Shi, linux-mm, ying.huang,
	feng.tang, fengwei.yin

Hi,

On 30.05.2023 06:07, Qi Zheng wrote:
> Hi,
> 
> On 2023/5/29 20:51, Paul E. McKenney wrote:
>> On Mon, May 29, 2023 at 10:39:21AM +0800, Qi Zheng wrote:
> 
> [...]
> 
>>>
>>> Thanks for such a detailed explanation.
>>>
>>> Now I think we can continue to try to complete the idea[1] from
>>> Kirill Tkhai. The patch moves heavy synchronize_srcu() to delayed
>>> work, so it doesn't affect on user-visible unregistration speed.
>>>
>>> [1]. https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/
>>
>> A blast from the past!  ;-)
>>
>> But yes, moving the long-latency synchronize_srcu() off the user-visible
>> critical code path can be even better.
> 
> Yeah, I applied these patches  ([PATCH RFC 04/10]~[PATCH RFC 10/10],
> with few conflicts), the ops/s does get back to the previous levels.
> 
> I'll continue updating this patchset after doing more testing.

You may also fix the issue using the below generic solution.

In addition to this we need patch, which calls unregister_shrinker_delayed_initiate()
instead of unregister_shrinker() in deactivate_locked_super(), and calls
unregister_shrinker_delayed_finalize() from destroy_super_work(). Compilation tested only.

---
 include/linux/shrinker.h |  2 ++
 mm/vmscan.c              | 38 ++++++++++++++++++++++++++++++--------
 2 files changed, 32 insertions(+), 8 deletions(-)
diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
index 224293b2dd06..4ba2986716d3 100644
--- a/include/linux/shrinker.h
+++ b/include/linux/shrinker.h
@@ -4,6 +4,7 @@
 
 #include <linux/atomic.h>
 #include <linux/types.h>
+#include <linux/rwsem.h>
 
 /*
  * This struct is used to pass information from page reclaim to the shrinkers.
@@ -83,6 +84,7 @@ struct shrinker {
 #endif
 	/* objs pending delete, per node */
 	atomic_long_t *nr_deferred;
+	struct rw_semaphore rwsem;
 };
 #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index eeca83e28c9b..19fc129771ce 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -706,6 +706,7 @@ static int __prealloc_shrinker(struct shrinker *shrinker)
 	if (!shrinker->nr_deferred)
 		return -ENOMEM;
 
+	init_rwsem(&shrinker->rwsem);
 	return 0;
 }
 
@@ -757,7 +758,9 @@ void register_shrinker_prepared(struct shrinker *shrinker)
 {
 	mutex_lock(&shrinker_mutex);
 	list_add_tail_rcu(&shrinker->list, &shrinker_list);
+	down_write(&shrinker->rwsem);
 	shrinker->flags |= SHRINKER_REGISTERED;
+	up_write(&shrinker->rwsem);
 	shrinker_debugfs_add(shrinker);
 	mutex_unlock(&shrinker_mutex);
 }
@@ -802,7 +805,7 @@ EXPORT_SYMBOL(register_shrinker);
 /*
  * Remove one
  */
-void unregister_shrinker(struct shrinker *shrinker)
+void unregister_shrinker_delayed_initiate(struct shrinker *shrinker)
 {
 	struct dentry *debugfs_entry;
 	int debugfs_id;
@@ -812,20 +815,33 @@ void unregister_shrinker(struct shrinker *shrinker)
 
 	mutex_lock(&shrinker_mutex);
 	list_del_rcu(&shrinker->list);
+	down_write(&shrinker->rwsem);
 	shrinker->flags &= ~SHRINKER_REGISTERED;
+	up_write(&shrinker->rwsem);
 	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
 		unregister_memcg_shrinker(shrinker);
 	debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
 	mutex_unlock(&shrinker_mutex);
 
+	shrinker_debugfs_remove(debugfs_entry, debugfs_id); // This is moved in your patch
+}
+EXPORT_SYMBOL(unregister_shrinker_delayed_initiate);
+
+void unregister_shrinker_delayed_finalize(struct shrinker *shrinker)
+{
 	atomic_inc(&shrinker_srcu_generation);
 	synchronize_srcu(&shrinker_srcu);
 
-	shrinker_debugfs_remove(debugfs_entry, debugfs_id);
-
 	kfree(shrinker->nr_deferred);
 	shrinker->nr_deferred = NULL;
 }
+EXPORT_SYMBOL(unregister_shrinker_delayed_finalize);
+
+void unregister_shrinker(struct shrinker *shrinker)
+{
+	unregister_shrinker_delayed_initiate(shrinker);
+	unregister_shrinker_delayed_finalize(shrinker);
+}
 EXPORT_SYMBOL(unregister_shrinker);
 
 /**
@@ -856,9 +872,14 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 					  : SHRINK_BATCH;
 	long scanned = 0, next_deferred;
 
+	down_read(&shrinker->rwsem);
+	if (!(shrinker->flags & SHRINKER_REGISTERED))
+		goto unlock;
 	freeable = shrinker->count_objects(shrinker, shrinkctl);
-	if (freeable == 0 || freeable == SHRINK_EMPTY)
-		return freeable;
+	if (freeable == 0 || freeable == SHRINK_EMPTY) {
+		freed = freeable;
+		goto unlock;
+	}
 
 	/*
 	 * copy the current shrinker scan count into a local variable
@@ -935,6 +956,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
 	 * manner that handles concurrent updates.
 	 */
 	new_nr = add_nr_deferred(next_deferred, shrinker, shrinkctl);
+unlock:
+	up_read(&shrinker->rwsem);
 
 	trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
 	return freed;
@@ -968,9 +991,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
 		struct shrinker *shrinker;
 
 		shrinker = idr_find(&shrinker_idr, i);
-		if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) {
-			if (!shrinker)
-				clear_bit(i, info->map);
+		if (unlikely(!shrinker)) {
+			clear_bit(i, info->map);
 			continue;
 		}
 



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression
  2023-06-01  0:57                     ` Kirill Tkhai
@ 2023-06-01  8:34                       ` Qi Zheng
       [not found]                         ` <932751685611907@mail.yandex.ru>
  0 siblings, 1 reply; 10+ messages in thread
From: Qi Zheng @ 2023-06-01  8:34 UTC (permalink / raw)
  To: Kirill Tkhai, paulmck
  Cc: RCU, Yujie Liu, oe-lkp, lkp, linux-kernel, Andrew Morton,
	Vlastimil Babka, Roman Gushchin, Christian König,
	David Hildenbrand, Davidlohr Bueso, Johannes Weiner, Michal Hocko,
	Muchun Song, Shakeel Butt, Yang Shi, linux-mm, ying.huang,
	feng.tang, fengwei.yin, david



On 2023/6/1 08:57, Kirill Tkhai wrote:
> Hi,
> 
> On 30.05.2023 06:07, Qi Zheng wrote:
>> Hi,
>>
>> On 2023/5/29 20:51, Paul E. McKenney wrote:
>>> On Mon, May 29, 2023 at 10:39:21AM +0800, Qi Zheng wrote:
>>
>> [...]
>>
>>>>
>>>> Thanks for such a detailed explanation.
>>>>
>>>> Now I think we can continue to try to complete the idea[1] from
>>>> Kirill Tkhai. The patch moves heavy synchronize_srcu() to delayed
>>>> work, so it doesn't affect on user-visible unregistration speed.
>>>>
>>>> [1]. https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/
>>>
>>> A blast from the past!  ;-)
>>>
>>> But yes, moving the long-latency synchronize_srcu() off the user-visible
>>> critical code path can be even better.
>>
>> Yeah, I applied these patches  ([PATCH RFC 04/10]~[PATCH RFC 10/10],
>> with few conflicts), the ops/s does get back to the previous levels.
>>
>> I'll continue updating this patchset after doing more testing.
> 
> You may also fix the issue using the below generic solution.
> 
> In addition to this we need patch, which calls unregister_shrinker_delayed_initiate()
> instead of unregister_shrinker() in deactivate_locked_super(), and calls
> unregister_shrinker_delayed_finalize() from destroy_super_work(). Compilation tested only.
> 
> ---
>   include/linux/shrinker.h |  2 ++
>   mm/vmscan.c              | 38 ++++++++++++++++++++++++++++++--------
>   2 files changed, 32 insertions(+), 8 deletions(-)
> diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
> index 224293b2dd06..4ba2986716d3 100644
> --- a/include/linux/shrinker.h
> +++ b/include/linux/shrinker.h
> @@ -4,6 +4,7 @@
>   
>   #include <linux/atomic.h>
>   #include <linux/types.h>
> +#include <linux/rwsem.h>
>   
>   /*
>    * This struct is used to pass information from page reclaim to the shrinkers.
> @@ -83,6 +84,7 @@ struct shrinker {
>   #endif
>   	/* objs pending delete, per node */
>   	atomic_long_t *nr_deferred;
> +	struct rw_semaphore rwsem;
>   };
>   #define DEFAULT_SEEKS 2 /* A good number if you don't know better. */
>   
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index eeca83e28c9b..19fc129771ce 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -706,6 +706,7 @@ static int __prealloc_shrinker(struct shrinker *shrinker)
>   	if (!shrinker->nr_deferred)
>   		return -ENOMEM;
>   
> +	init_rwsem(&shrinker->rwsem);
>   	return 0;
>   }
>   
> @@ -757,7 +758,9 @@ void register_shrinker_prepared(struct shrinker *shrinker)
>   {
>   	mutex_lock(&shrinker_mutex);
>   	list_add_tail_rcu(&shrinker->list, &shrinker_list);
> +	down_write(&shrinker->rwsem);
>   	shrinker->flags |= SHRINKER_REGISTERED;
> +	up_write(&shrinker->rwsem);
>   	shrinker_debugfs_add(shrinker);
>   	mutex_unlock(&shrinker_mutex);
>   }
> @@ -802,7 +805,7 @@ EXPORT_SYMBOL(register_shrinker);
>   /*
>    * Remove one
>    */
> -void unregister_shrinker(struct shrinker *shrinker)
> +void unregister_shrinker_delayed_initiate(struct shrinker *shrinker)
>   {
>   	struct dentry *debugfs_entry;
>   	int debugfs_id;
> @@ -812,20 +815,33 @@ void unregister_shrinker(struct shrinker *shrinker)
>   
>   	mutex_lock(&shrinker_mutex);
>   	list_del_rcu(&shrinker->list);
> +	down_write(&shrinker->rwsem);
>   	shrinker->flags &= ~SHRINKER_REGISTERED;
> +	up_write(&shrinker->rwsem);
>   	if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>   		unregister_memcg_shrinker(shrinker);
>   	debugfs_entry = shrinker_debugfs_detach(shrinker, &debugfs_id);
>   	mutex_unlock(&shrinker_mutex);
>   
> +	shrinker_debugfs_remove(debugfs_entry, debugfs_id); // This is moved in your patch
> +}
> +EXPORT_SYMBOL(unregister_shrinker_delayed_initiate);
> +
> +void unregister_shrinker_delayed_finalize(struct shrinker *shrinker)
> +{
>   	atomic_inc(&shrinker_srcu_generation);
>   	synchronize_srcu(&shrinker_srcu);
>   
> -	shrinker_debugfs_remove(debugfs_entry, debugfs_id);
> -
>   	kfree(shrinker->nr_deferred);
>   	shrinker->nr_deferred = NULL;
>   }
> +EXPORT_SYMBOL(unregister_shrinker_delayed_finalize);
> +
> +void unregister_shrinker(struct shrinker *shrinker)
> +{
> +	unregister_shrinker_delayed_initiate(shrinker);
> +	unregister_shrinker_delayed_finalize(shrinker);
> +}
>   EXPORT_SYMBOL(unregister_shrinker);
>   
>   /**
> @@ -856,9 +872,14 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>   					  : SHRINK_BATCH;
>   	long scanned = 0, next_deferred;
>   
> +	down_read(&shrinker->rwsem);
> +	if (!(shrinker->flags & SHRINKER_REGISTERED))
> +		goto unlock;
>   	freeable = shrinker->count_objects(shrinker, shrinkctl);
> -	if (freeable == 0 || freeable == SHRINK_EMPTY)
> -		return freeable;
> +	if (freeable == 0 || freeable == SHRINK_EMPTY) {
> +		freed = freeable;
> +		goto unlock;
> +	}
>   
>   	/*
>   	 * copy the current shrinker scan count into a local variable
> @@ -935,6 +956,8 @@ static unsigned long do_shrink_slab(struct shrink_control *shrinkctl,
>   	 * manner that handles concurrent updates.
>   	 */
>   	new_nr = add_nr_deferred(next_deferred, shrinker, shrinkctl);
> +unlock:
> +	up_read(&shrinker->rwsem);

It should be moved after trace_mm_shrink_slab_end().

>   
>   	trace_mm_shrink_slab_end(shrinker, shrinkctl->nid, freed, nr, new_nr, total_scan);
>   	return freed;
> @@ -968,9 +991,8 @@ static unsigned long shrink_slab_memcg(gfp_t gfp_mask, int nid,
>   		struct shrinker *shrinker;
>   
>   		shrinker = idr_find(&shrinker_idr, i);
> -		if (unlikely(!shrinker || !(shrinker->flags & SHRINKER_REGISTERED))) {
> -			if (!shrinker)
> -				clear_bit(i, info->map);
> +		if (unlikely(!shrinker)) {
> +			clear_bit(i, info->map);
>   			continue;
>   		}

Keep this as a fast path?

>   

After applying the above patch, the performance regression problem of
ops/s can be solved. And it can be guaranteed that the shrinker is not
running after unregister_shrinker_delayed_initiate(), so the previous
semantics are not broken.

Since the lock granularity of down_read() has changed to the granularity
of per shrinker, it seems that the down_read() perf hotspot will not be
very high.  I'm not quite sure why.
(The test script used is the script in 
https://lore.kernel.org/all/20230313112819.38938-4-zhengqi.arch@bytedance.com/)

   25.28%  [kernel]            [k] do_shrink_slab
   21.91%  [kernel]            [k] pv_native_safe_halt
   10.81%  [kernel]            [k] _find_next_bit
   10.47%  [kernel]            [k] down_read
    8.75%  [kernel]            [k] shrink_slab
    4.03%  [kernel]            [k] up_read
    3.29%  [kernel]            [k] shrink_lruvec
    2.75%  [kernel]            [k] xa_load
    2.73%  [kernel]            [k] mem_cgroup_iter
    2.67%  [kernel]            [k] shrink_node
    1.30%  [kernel]            [k] list_lru_count_one

Thanks,
Qi

> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression
       [not found]                         ` <932751685611907@mail.yandex.ru>
@ 2023-06-01 10:44                           ` Qi Zheng
  0 siblings, 0 replies; 10+ messages in thread
From: Qi Zheng @ 2023-06-01 10:44 UTC (permalink / raw)
  To: Kirill Tkhai
  Cc: RCU, paulmck@kernel.org, Yujie Liu, oe-lkp@lists.linux.dev,
	lkp@intel.com, linux-kernel@vger.kernel.org, Andrew Morton,
	Vlastimil Babka, Roman Gushchin, Christian König,
	David Hildenbrand, Davidlohr Bueso, Johannes Weiner, Michal Hocko,
	Muchun Song, Shakeel Butt, Yang Shi, linux-mm@kvack.org,
	ying.huang@intel.com, feng.tang@intel.com, fengwei.yin@intel.com,
	david@fromorbit.com, Christian Brauner

Hi Kirill,

On 2023/6/1 18:06, Kirill Tkhai wrote:
> 01.06.2023, 11:34, "Qi Zheng" <qi.zheng@linux.dev 
> <mailto:qi.zheng@linux.dev>>:
> 
> 
> 
>     On 2023/6/1 08:57, Kirill Tkhai wrote:
> 
>           Hi,
> 
>           On 30.05.2023 06:07, Qi Zheng wrote:
> 
>               Hi,
> 
>               On 2023/5/29 20:51, Paul E. McKenney wrote:
> 
>                   On Mon, May 29, 2023 at 10:39:21AM +0800, Qi Zheng wrote:
> 
> 
>               [...]
> 
> 
>                       Thanks for such a detailed explanation.
> 
>                       Now I think we can continue to try to complete the
>                     idea[1] from
>                       Kirill Tkhai. The patch moves heavy
>                     synchronize_srcu() to delayed
>                       work, so it doesn't affect on user-visible
>                     unregistration speed.
> 
>                       [1].
>                     https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/ <https://lore.kernel.org/lkml/153365636747.19074.12610817307548583381.stgit@localhost.localdomain/>
> 
> 
>                   A blast from the past!  ;-)
> 
>                   But yes, moving the long-latency synchronize_srcu()
>                 off the user-visible
>                   critical code path can be even better.
> 
> 
>               Yeah, I applied these patches  ([PATCH RFC 04/10]~[PATCH
>             RFC 10/10],
>               with few conflicts), the ops/s does get back to the
>             previous levels.
> 
>               I'll continue updating this patchset after doing more testing.
> 
> 
>           You may also fix the issue using the below generic solution.
> 
>           In addition to this we need patch, which calls
>         unregister_shrinker_delayed_initiate()
>           instead of unregister_shrinker() in deactivate_locked_super(),
>         and calls
>           unregister_shrinker_delayed_finalize() from
>         destroy_super_work(). Compilation tested only.
> 
>           ---
>             include/linux/shrinker.h | 2 ++
>             mm/vmscan.c | 38 ++++++++++++++++++++++++++++++--------
>             2 files changed, 32 insertions(+), 8 deletions(-)
>           diff --git a/include/linux/shrinker.h b/include/linux/shrinker.h
>           index 224293b2dd06..4ba2986716d3 100644
>           --- a/include/linux/shrinker.h
>           +++ b/include/linux/shrinker.h
>           @@ -4,6 +4,7 @@
> 
>             #include <linux/atomic.h>
>             #include <linux/types.h>
>           +#include <linux/rwsem.h>
> 
>             /*
>              * This struct is used to pass information from page reclaim
>         to the shrinkers.
>           @@ -83,6 +84,7 @@ struct shrinker {
>             #endif
>                     /* objs pending delete, per node */
>                     atomic_long_t *nr_deferred;
>           + struct rw_semaphore rwsem;
>             };
>             #define DEFAULT_SEEKS 2 /* A good number if you don't know
>         better. */
> 
>           diff --git a/mm/vmscan.c b/mm/vmscan.c
>           index eeca83e28c9b..19fc129771ce 100644
>           --- a/mm/vmscan.c
>           +++ b/mm/vmscan.c
>           @@ -706,6 +706,7 @@ static int __prealloc_shrinker(struct
>         shrinker *shrinker)
>                     if (!shrinker->nr_deferred)
>                             return -ENOMEM;
> 
>           + init_rwsem(&shrinker->rwsem);
>                     return 0;
>             }
> 
>           @@ -757,7 +758,9 @@ void register_shrinker_prepared(struct
>         shrinker *shrinker)
>             {
>                     mutex_lock(&shrinker_mutex);
>                     list_add_tail_rcu(&shrinker->list, &shrinker_list);
>           + down_write(&shrinker->rwsem);
>                     shrinker->flags |= SHRINKER_REGISTERED;
>           + up_write(&shrinker->rwsem);
>                     shrinker_debugfs_add(shrinker);
>                     mutex_unlock(&shrinker_mutex);
>             }
>           @@ -802,7 +805,7 @@ EXPORT_SYMBOL(register_shrinker);
>             /*
>              * Remove one
>              */
>           -void unregister_shrinker(struct shrinker *shrinker)
>           +void unregister_shrinker_delayed_initiate(struct shrinker
>         *shrinker)
>             {
>                     struct dentry *debugfs_entry;
>                     int debugfs_id;
>           @@ -812,20 +815,33 @@ void unregister_shrinker(struct shrinker
>         *shrinker)
> 
>                     mutex_lock(&shrinker_mutex);
>                     list_del_rcu(&shrinker->list);
>           + down_write(&shrinker->rwsem);
>                     shrinker->flags &= ~SHRINKER_REGISTERED;
>           + up_write(&shrinker->rwsem);
>                     if (shrinker->flags & SHRINKER_MEMCG_AWARE)
>                             unregister_memcg_shrinker(shrinker);
>                     debugfs_entry = shrinker_debugfs_detach(shrinker,
>         &debugfs_id);
>                     mutex_unlock(&shrinker_mutex);
> 
>           + shrinker_debugfs_remove(debugfs_entry, debugfs_id); // This
>         is moved in your patch
>           +}
>           +EXPORT_SYMBOL(unregister_shrinker_delayed_initiate);
>           +
>           +void unregister_shrinker_delayed_finalize(struct shrinker
>         *shrinker)
>           +{
>                     atomic_inc(&shrinker_srcu_generation);
>                     synchronize_srcu(&shrinker_srcu);
> 
>           - shrinker_debugfs_remove(debugfs_entry, debugfs_id);
>           -
>                     kfree(shrinker->nr_deferred);
>                     shrinker->nr_deferred = NULL;
>             }
>           +EXPORT_SYMBOL(unregister_shrinker_delayed_finalize);
>           +
>           +void unregister_shrinker(struct shrinker *shrinker)
>           +{
>           + unregister_shrinker_delayed_initiate(shrinker);
>           + unregister_shrinker_delayed_finalize(shrinker);
>           +}
>             EXPORT_SYMBOL(unregister_shrinker);
> 
>             /**
>           @@ -856,9 +872,14 @@ static unsigned long
>         do_shrink_slab(struct shrink_control *shrinkctl,
>                                                       : SHRINK_BATCH;
>                     long scanned = 0, next_deferred;
> 
>           + down_read(&shrinker->rwsem);
>           + if (!(shrinker->flags & SHRINKER_REGISTERED))
>           + goto unlock;
>                     freeable = shrinker->count_objects(shrinker, shrinkctl);
>           - if (freeable == 0 || freeable == SHRINK_EMPTY)
>           - return freeable;
>           + if (freeable == 0 || freeable == SHRINK_EMPTY) {
>           + freed = freeable;
>           + goto unlock;
>           + }
> 
>                     /*
>                      * copy the current shrinker scan count into a local
>         variable
>           @@ -935,6 +956,8 @@ static unsigned long do_shrink_slab(struct
>         shrink_control *shrinkctl,
>                      * manner that handles concurrent updates.
>                      */
>                     new_nr = add_nr_deferred(next_deferred, shrinker,
>         shrinkctl);
>           +unlock:
>           + up_read(&shrinker->rwsem);
> 
> 
>     It should be moved after trace_mm_shrink_slab_end().
> 
> Could you explain the reason? I don't see the variable it will protect.

We jump to unlock label before trace_mm_shrink_slab_start(), so I
think we should not go to call trace_mm_shrink_slab_end().

> 
> 
>                     trace_mm_shrink_slab_end(shrinker, shrinkctl->nid,
>         freed, nr, new_nr, total_scan);
>                     return freed;
>           @@ -968,9 +991,8 @@ static unsigned long
>         shrink_slab_memcg(gfp_t gfp_mask, int nid,
>                             struct shrinker *shrinker;
> 
>                             shrinker = idr_find(&shrinker_idr, i);
>           - if (unlikely(!shrinker || !(shrinker->flags &
>         SHRINKER_REGISTERED))) {
>           - if (!shrinker)
>           - clear_bit(i, info->map);
>           + if (unlikely(!shrinker)) {
>           + clear_bit(i, info->map);
>                                     continue;
>                             }
> 
> 
>     Keep this as a fast path?
> 
> Probably, yes. And also we need 1)down_trylock() instead of down_read() 
>   and 2)rwsem_is_contended  in do_shrink_slab().

Agree, although the critical section of the writer of shrinker->rwsem is
very short, this prevents unnecessary sleeps.

> 
> 
>     After applying the above patch, the performance regression problem of
>     ops/s can be solved. And it can be guaranteed that the shrinker is not
>     running after unregister_shrinker_delayed_initiate(), so the previous
>     semantics are not broken.
> 
> Keeping old semantics or not is quite subjective, I think. It's possible 
> to provide strong arguments for both cases. First is faster, second is 
> easier to adopt for users. For me personally the faster approach looks 
> better.

Agree. I also like completely lock-less slab shrink.

>  >nce the lock granularity of down_read() has changed to the granularity
> 
>     of per shrinker, it seems that the down_read() perf hotspot will not be
>     very high. I'm not quite sure why.
> 
>     (The test script used is the script in
>     https://lore.kernel.org/all/20230313112819.38938-4-zhengqi.arch@bytedance.com/ <https://lore.kernel.org/all/20230313112819.38938-4-zhengqi.arch@bytedance.com/>)
> 
> Hmm, possible original shrinker_rwsem had a lot of atomic intersections 
> between cpus on down_read(), while with small locks it has not. Are 

I guess so.

> CONFIG_ for locks debug are same in original and this case?

Basically yes, I will do some more testing.

Thanks,
Qi

> 
> 
>         25.28% [kernel] [k] do_shrink_slab
>         21.91% [kernel] [k] pv_native_safe_halt
>         10.81% [kernel] [k] _find_next_bit
>         10.47% [kernel] [k] down_read
>          8.75% [kernel] [k] shrink_slab
>          4.03% [kernel] [k] up_read
>          3.29% [kernel] [k] shrink_lruvec
>          2.75% [kernel] [k] xa_load
>          2.73% [kernel] [k] mem_cgroup_iter
>          2.67% [kernel] [k] shrink_node
>          1.30% [kernel] [k] list_lru_count_one
> 
>     Thanks,
>     Qi
> 
> 

-- 
Thanks,
Qi

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2023-06-01 10:45 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <202305230837.db2c233f-yujie.liu@intel.com>
     [not found] ` <eba38fce-2454-d7a4-10ef-240b4686f23d@linux.dev>
     [not found]   ` <ZG29ULGNJdErnatI@yujie-X299>
     [not found]     ` <896bbb09-d400-ec73-ba3a-b64c6e9bbe46@linux.dev>
     [not found]       ` <e5fb8b34-c1ad-92e0-e7e5-f7ed1605dbc6@linux.dev>
2023-05-24 11:22         ` [linus:master] [mm] f95bdb700b: stress-ng.ramfs.ops_per_sec -88.8% regression Qi Zheng
2023-05-24 11:56         ` Qi Zheng
2023-05-25  4:03           ` Qi Zheng
2023-05-27 11:14             ` Paul E. McKenney
2023-05-29  2:39               ` Qi Zheng
2023-05-29 12:51                 ` Paul E. McKenney
2023-05-30  3:07                   ` Qi Zheng
2023-06-01  0:57                     ` Kirill Tkhai
2023-06-01  8:34                       ` Qi Zheng
     [not found]                         ` <932751685611907@mail.yandex.ru>
2023-06-01 10:44                           ` Qi Zheng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).