Re: lockless case of retain_dentry() (was Re: [PATCH 09/15] fold the call of retain_dentry() into fast_dput())

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Guo Ren <guoren@kernel.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-fsdevel@vger.kernel.org
Subject: Re: lockless case of retain_dentry() (was Re: [PATCH 09/15] fold the call of retain_dentry() into fast_dput())
Date: Wed, 29 Nov 2023 04:52:45 -0500	[thread overview]
Message-ID: <ZWcJ7QgPHP6EmLTy@gmail.com> (raw)
In-Reply-To: <CAHk-=wjErxMmQaPqS8tVr=4ZqYcvs5Xw3yyBdAG1oO6KcwV0Vg@mail.gmail.com>

On Sun, Nov 26, 2023 at 09:06:03AM -0800, Linus Torvalds wrote:
> On Sun, 26 Nov 2023 at 08:39, Guo Ren <guoren@kernel.org> wrote:
> >
> > Here, what I want to improve is to prevent stack frame setup in the fast
> > path, and that's the most benefit my patch could give out.
> 
> Side note: what patch do you have that avoids the stack frame setup?
> Because I still saw the stack frame even with the
> arch_spin_value_unlocked() fix and the improved code generation. The
> compiler still does
> 
>         addi    sp,sp,-32
>         sd      s0,16(sp)
>         sd      s1,8(sp)
>         sd      ra,24(sp)
>         addi    s0,sp,32
I found below affect you:

 #define CMPXCHG_LOOP(CODE, SUCCESS) do {                                       \
-       int retry = 100;                                                        \
        struct lockref old;                                                     \
        BUILD_BUG_ON(sizeof(old) != 8);                                         \
        old.lock_count = READ_ONCE(lockref->lock_count);                        \
@@ -21,11 +20,21 @@
                                                 new.lock_count))) {            \
                        SUCCESS;                                                \
                }                                                               \
-               if (!--retry)                                                   \
-                       break;                                                  \

Yes, The 'retry' count patch [1] hurts us.

[1]: https://lore.kernel.org/lkml/20190607072652.GA5522@hc/T/#m091df9dca68c27c28f8f69a72cae0e1361dba4fa

> 
> at the top of the function for me - not because of the (now fixed)
> lock value spilling, but just because it wants to save registers.
> 
> The reason seems to be that gcc isn't smart enough to delay the frame
> setup to the slow path where it then has to do the actual spinlock, so
> it has to generate a stack frame just for the return address and then
> it does the whole frame setup thing.
> 
> I was using just the risc-v defconfig (with the cmpxchg lockrefs
> enabled, and spinlock debugging disabled so that lockrefs actually do
> something), so there might be some other config thing like "force
> frame pointers" that then causes problems.
> 
> But while the current tree avoids the silly lock value spill and
> reload, and my patch improved the integer instruction selection, I
> really couldn't get rid of the stack frame entirely. The x86 code also
> ends up looking quite nice, although part of that is that the
> qspinlock test is a simple compare against zero:
> 
>   lockref_get:
>         pushq   %rbx
>         movq    %rdi, %rbx
>         movq    (%rdi), %rax
>         movl    $-100, %ecx
>         movabsq $4294967296, %rdx
>   .LBB0_1:
>         testl   %eax, %eax
>         jne     .LBB0_4
>         leaq    (%rax,%rdx), %rsi
>         lock
>         cmpxchgq        %rsi, (%rbx)
>         je      .LBB0_5
>         incl    %ecx
>         jne     .LBB0_1
>   .LBB0_4:
>         movq    %rbx, %rdi
>         callq   _raw_spin_lock
>         incl    4(%rbx)
>         movb    $0, (%rbx)
>   .LBB0_5:
>         popq    %rbx
>         retq
> 
> (That 'movabsq' thing is what generates the big constant that adds '1'
> in the upper word - that add is then done as a 'leaq').
> 
> In this case, the 'retry' count is actually a noticeable part of the
> code generation, and is probably also why it has to save/restore
> '%rbx'. Oh well. We limited the cmpxchg loop because of horrible
> issues with starvation on bad arm64 cores.  It turns out that SMP
> cacheline bouncing is hard, and if you haven't been doing it for a
> couple of decades, you'll do it wrong.
> 
> You'll find out the hard way that the same is probably true on any
> early RISC-V SMP setups. You wanting to use prefetchw is a pretty
> clear indication of the same kind of thing.

The 'retry' count is bad solution, which hides the problem. ThunderX2's
problem is mainly about unnecessary cpu_relax & cacheline sticky less.
In the AMBA 5 CHI spec "Home behavior" section says: [2]
"When a Home(CIU/LLcache) determines that an Exclusive Store transaction
has failed, the following rules must be followed: If the Requester has
lost the cache line, then the Home is expected to send SnpPreferUniqueFwd
or SnpPreferUnique to get a copy of the cache line."
The SnpPreferUnique is not SnpUnique, which means it would return a shared
cacheline in case of serious contention. No guarantee for the next cmpxchg.

But, we want a unique cache line right? You said: [1]
"... And once one CPU gets ownership of the line, it doesn't lose it
immediately, so the next cmpxchg will *succeed*.
So at most, the *first* cmpxchg will fail (because that's the one that
was fed not by a previous cmpxchg, but by a regular load (which we'd
*like* to do as a "load-for-ownership" load, but we don't have the
interfaces to do that). But the second cmpxchg should basically always
succeed, ..."
(Sorry, I quoted you like this.)

My argue is:
Why do we need to wait for cmpxchg failure? You have the
"load-for-ownership" interface: "prefetchw"!

   lockref_get:
         pushq   %rbx
  +------prefetchw (%rdi)    --------> doesn't lose it immediately,
st|				so the next cmpxchg will *succeed*
ic|						- Linus
ky|      movq    %rdi, %rbx
 t|      movq    (%rdi), %rax  ------> local acquire, comfortable!
im|      movl    $-100, %ecx
e |      movabsq $4294967296, %rdx
  |.LBB0_1:
  |      testl   %eax, %eax
  |      jne     .LBB0_4
  |      leaq    (%rax,%rdx), %rsi
  |      lock
  |      cmpxchgq        %rsi, (%rbx) --> local cas, success!
  +----- je      .LBB0_5          ------> Farewell to the slowpath!

If x86 is not a crap machine, "movq+movq+movl+movabsq+testl+jne+leak+
cmpxchg" should be fast enough to satisfy the sticky time.

The prefetchw primitive has been defined in include/linux/prefetch.h
for many years.

The prefetchw has been used for generic code: 
➜  linux git:(master) ✗ grep prefetchw mm/ fs/ kernel/ -r
mm/slub.c:      prefetchw(object + s->offset);
mm/slab.c:      prefetchw(objp);
mm/page_alloc.c:        prefetchw(p);
mm/page_alloc.c:                prefetchw(p + 1);
mm/vmscan.c:#define prefetchw_prev_lru_folio(_folio, _base, _field)                     \
mm/vmscan.c:                    prefetchw(&prev->_field);                       \
mm/vmscan.c:#define prefetchw_prev_lru_folio(_folio, _base, _field) do { } while (0)
mm/vmscan.c:            prefetchw_prev_lru_folio(folio, src, flags);
fs/mpage.c:             prefetchw(&folio->flags);
fs/f2fs/data.c:                 prefetchw(&page->flags);
fs/ext4/readpage.c:             prefetchw(&folio->flags);
kernel/bpf/cpumap.c:                    prefetchw(page);
kernel/locking/qspinlock.c:                     prefetchw(next);
➜  linux git:(master) ✗ grep prefetchw drivers/ -r | wc -l
80

The prefetchw is okay for all good hardware. Not like the 'retry' one.

[1]: https://lore.kernel.org/lkml/CAHk-=wiEahkwDXpoy=-SzJHNMRXKVSjPa870+eKKenufhO_Hgw@mail.gmail.com/raw
[2]: https://kolegite.com/EE_library/datasheets_and_manuals/FPGA/AMBA/IHI0050E_a_amba_5_chi_architecture_spec.pdf

> 
>              Linus
>

next prev parent reply	other threads:[~2023-11-29  9:52 UTC|newest]

Thread overview: 119+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-30  0:37 [RFC] simplifying fast_dput(), dentry_kill() et.al Al Viro
2023-10-30 21:53 ` Al Viro
2023-10-30 22:18   ` Linus Torvalds
2023-10-31  0:18     ` Al Viro
2023-10-31  1:53       ` Al Viro
2023-10-31  6:12         ` Al Viro
2023-11-01  6:18           ` Al Viro
2023-11-01  6:20           ` [PATCH 01/15] fast_dput(): having ->d_delete() is not reason to delay refcount decrement Al Viro
2023-11-01  6:20             ` [PATCH 02/15] fast_dput(): handle underflows gracefully Al Viro
2023-11-01  6:20             ` [PATCH 03/15] fast_dput(): new rules for refcount Al Viro
2023-11-01  6:20             ` [PATCH 04/15] __dput_to_list(): do decrement of refcount in the caller Al Viro
2023-11-01  6:20             ` [PATCH 05/15] retain_dentry(): lift decrement of ->d_count into callers Al Viro
2023-11-01  6:20             ` [PATCH 06/15] __dentry_kill(): get consistent rules for ->d_count Al Viro
2023-11-01  6:20             ` [PATCH 07/15] dentry_kill(): don't bother with retain_dentry() on slow path Al Viro
2023-11-01  6:20             ` [PATCH 08/15] Call retain_dentry() with refcount 0 Al Viro
2023-11-01  6:20             ` [PATCH 09/15] fold the call of retain_dentry() into fast_dput() Al Viro
2023-11-01  8:45               ` Al Viro
2023-11-01 17:30                 ` Linus Torvalds
2023-11-01 18:19                   ` Al Viro
2023-11-10  4:20                     ` lockless case of retain_dentry() (was Re: [PATCH 09/15] fold the call of retain_dentry() into fast_dput()) Al Viro
2023-11-10  5:57                       ` Linus Torvalds
2023-11-10  6:22                         ` Linus Torvalds
2023-11-22  6:29                           ` Guo Ren
2023-11-10  8:19                         ` Al Viro
2023-11-22  7:19                         ` Guo Ren
2023-11-22 17:20                           ` Linus Torvalds
2023-11-22 17:52                             ` Linus Torvalds
2023-11-22 18:05                               ` Linus Torvalds
2023-11-22 19:11                               ` Linus Torvalds
2023-11-29  7:14                                 ` Guo Ren
2023-11-29 12:25                                 ` Guo Ren
2023-11-29 14:42                                   ` Linus Torvalds
2023-11-26 16:39                             ` Guo Ren
2023-11-26 16:51                               ` Linus Torvalds
2023-11-30 10:00                                 ` Guo Ren
2023-12-01  1:09                                   ` Linus Torvalds
2023-12-01  3:36                                     ` Guo Ren
2023-12-01  5:15                                       ` Linus Torvalds
2023-12-01  7:31                                         ` Guo Ren
2023-11-26 16:51                               ` Guo Ren
2023-11-26 17:06                               ` Linus Torvalds
2023-11-26 17:59                                 ` Linus Torvalds
2023-11-29  9:52                                 ` Guo Ren [this message]
2023-11-01  6:20             ` [PATCH 10/15] don't try to cut corners in shrink_lock_dentry() Al Viro
2023-11-01  6:21             ` [PATCH 11/15] fold dentry_kill() into dput() Al Viro
2023-11-01  6:21             ` [PATCH 12/15] get rid of __dget() Al Viro
2023-11-01  6:21             ` [PATCH 13/15] shrink_dentry_list(): no need to check that dentry refcount is marked dead Al Viro
2023-11-01  6:21             ` [PATCH 14/15] to_shrink_list(): call only if refcount is 0 Al Viro
2023-11-01  6:21             ` [PATCH 15/15] switch select_collect{,2}() to use of to_shrink_list() Al Viro
2023-11-01  2:22       ` [RFC] simplifying fast_dput(), dentry_kill() et.al Al Viro
2023-11-01 14:29         ` Benjamin Coddington
2023-11-05 19:54       ` Al Viro
2023-11-05 21:59         ` Al Viro
2023-11-06  5:53         ` Al Viro
2023-11-07  2:08           ` Al Viro
2023-11-09  6:19             ` [RFC][PATCHSET v2] " Al Viro
2023-11-09  6:20               ` [PATCH 01/22] struct dentry: get rid of randomize_layout idiocy Al Viro
2023-11-09  6:20                 ` [PATCH 02/22] switch nfsd_client_rmdir() to use of simple_recursive_removal() Al Viro
2023-11-09 13:42                   ` Christian Brauner
2023-11-09 14:01                   ` Chuck Lever
2023-11-09 18:47                     ` Al Viro
2023-11-09 18:50                       ` Chuck Lever III
2023-11-09  6:20                 ` [PATCH 03/22] coda_flag_children(): cope with dentries turning negative Al Viro
2023-11-09 13:43                   ` Christian Brauner
2023-11-09  6:20                 ` [PATCH 04/22] dentry: switch the lists of children to hlist Al Viro
2023-11-09 13:48                   ` Christian Brauner
2023-11-09 19:32                     ` Al Viro
2023-11-09  6:20                 ` [PATCH 05/22] centralize killing dentry from shrink list Al Viro
2023-11-09 13:49                   ` Christian Brauner
2023-11-09  6:20                 ` [PATCH 06/22] get rid of __dget() Al Viro
2023-11-09 13:50                   ` Christian Brauner
2023-11-09  6:20                 ` [PATCH 07/22] shrink_dentry_list(): no need to check that dentry refcount is marked dead Al Viro
2023-11-09 13:53                   ` Christian Brauner
2023-11-09 20:28                     ` Al Viro
2023-11-09  6:20                 ` [PATCH 08/22] fast_dput(): having ->d_delete() is not reason to delay refcount decrement Al Viro
2023-11-09 13:58                   ` Christian Brauner
2023-11-09  6:20                 ` [PATCH 09/22] fast_dput(): handle underflows gracefully Al Viro
2023-11-09 14:46                   ` Christian Brauner
2023-11-09 20:39                     ` Al Viro
2023-11-09  6:20                 ` [PATCH 10/22] fast_dput(): new rules for refcount Al Viro
2023-11-09 14:54                   ` Christian Brauner
2023-11-09 20:52                     ` Al Viro
2023-11-09  6:20                 ` [PATCH 11/22] __dput_to_list(): do decrement of refcount in the callers Al Viro
2023-11-09 15:21                   ` Christian Brauner
2023-11-09  6:20                 ` [PATCH 12/22] Make retain_dentry() neutral with respect to refcounting Al Viro
2023-11-09 15:22                   ` Christian Brauner
2023-11-09  6:20                 ` [PATCH 13/22] __dentry_kill(): get consistent rules for victim's refcount Al Viro
2023-11-09 15:27                   ` Christian Brauner
2023-11-09  6:20                 ` [PATCH 14/22] dentry_kill(): don't bother with retain_dentry() on slow path Al Viro
2023-11-09 15:53                   ` Christian Brauner
2023-11-09 21:29                     ` Al Viro
2023-11-09  6:20                 ` [PATCH 15/22] Call retain_dentry() with refcount 0 Al Viro
2023-11-09 16:09                   ` Christian Brauner
2023-11-09  6:20                 ` [PATCH 16/22] fold the call of retain_dentry() into fast_dput() Al Viro
2023-11-09 16:17                   ` Christian Brauner
2023-11-09  6:20                 ` [PATCH 17/22] don't try to cut corners in shrink_lock_dentry() Al Viro
2023-11-09 17:20                   ` Christian Brauner
2023-11-09 21:45                     ` Al Viro
2023-11-10  9:07                       ` Christian Brauner
2023-11-09 17:39                   ` Linus Torvalds
2023-11-09 18:11                     ` Linus Torvalds
2023-11-09 18:20                     ` Al Viro
2023-11-09  6:20                 ` [PATCH 18/22] fold dentry_kill() into dput() Al Viro
2023-11-09 17:22                   ` Christian Brauner
2023-11-09  6:20                 ` [PATCH 19/22] to_shrink_list(): call only if refcount is 0 Al Viro
2023-11-09 17:29                   ` Christian Brauner
2023-11-09  6:20                 ` [PATCH 20/22] switch select_collect{,2}() to use of to_shrink_list() Al Viro
2023-11-09 17:31                   ` Christian Brauner
2023-11-09  6:20                 ` [PATCH 21/22] d_prune_aliases(): use a shrink list Al Viro
2023-11-09 17:33                   ` Christian Brauner
2023-11-09  6:20                 ` [PATCH 22/22] __dentry_kill(): new locking scheme Al Viro
2023-11-10 13:34                   ` Christian Brauner
2023-11-09 13:33                 ` [PATCH 01/22] struct dentry: get rid of randomize_layout idiocy Christian Brauner
2023-10-31  2:25     ` [RFC] simplifying fast_dput(), dentry_kill() et.al Gao Xiang
2023-10-31  2:29       ` Gao Xiang
2023-10-31  3:02       ` Linus Torvalds
2023-10-31  3:13         ` Gao Xiang
2023-10-31  3:26         ` Al Viro
2023-10-31  3:41           ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZWcJ7QgPHP6EmLTy@gmail.com \
    --to=guoren@kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.