Re: [PATCH net-next v7 1/3] rculist: Add hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu()

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Eric Dumazet <edumazet@google.com>
To: luoxuanqiang <xuanqiang.luo@linux.dev>
Cc: kuniyu@google.com, "Paul E. McKenney" <paulmck@kernel.org>,
	kerneljasonxing@gmail.com,  davem@davemloft.net, kuba@kernel.org,
	netdev@vger.kernel.org,  Xuanqiang Luo <luoxuanqiang@kylinos.cn>,
	Frederic Weisbecker <frederic@kernel.org>,
	 Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Subject: Re: [PATCH net-next v7 1/3] rculist: Add hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu()
Date: Tue, 14 Oct 2025 00:34:42 -0700	[thread overview]
Message-ID: <CANn89iJdXiE3b8x8vQtbOOi2DTC5P9bOO1HsnRwSPC8qQC--8g@mail.gmail.com> (raw)
In-Reply-To: <82a04cb1-9451-493c-9b1e-b4a34f2175cd@linux.dev>

On Tue, Oct 14, 2025 at 12:21 AM luoxuanqiang <xuanqiang.luo@linux.dev> wrote:
>
>
> 在 2025/10/13 17:49, Eric Dumazet 写道:
> > On Mon, Oct 13, 2025 at 1:26 AM luoxuanqiang <xuanqiang.luo@linux.dev> wrote:
> >>
> >> 在 2025/10/13 15:31, Eric Dumazet 写道:
> >>> On Fri, Sep 26, 2025 at 12:41 AM <xuanqiang.luo@linux.dev> wrote:
> >>>> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
> >>>>
> >>>> Add two functions to atomically replace RCU-protected hlist_nulls entries.
> >>>>
> >>>> Keep using WRITE_ONCE() to assign values to ->next and ->pprev, as
> >>>> mentioned in the patch below:
> >>>> commit efd04f8a8b45 ("rcu: Use WRITE_ONCE() for assignments to ->next for
> >>>> rculist_nulls")
> >>>> commit 860c8802ace1 ("rcu: Use WRITE_ONCE() for assignments to ->pprev for
> >>>> hlist_nulls")
> >>>>
> >>>> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
> >>>> ---
> >>>>    include/linux/rculist_nulls.h | 59 +++++++++++++++++++++++++++++++++++
> >>>>    1 file changed, 59 insertions(+)
> >>>>
> >>>> diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
> >>>> index 89186c499dd4..c26cb83ca071 100644
> >>>> --- a/include/linux/rculist_nulls.h
> >>>> +++ b/include/linux/rculist_nulls.h
> >>>> @@ -52,6 +52,13 @@ static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
> >>>>    #define hlist_nulls_next_rcu(node) \
> >>>>           (*((struct hlist_nulls_node __rcu __force **)&(node)->next))
> >>>>
> >>>> +/**
> >>>> + * hlist_nulls_pprev_rcu - returns the dereferenced pprev of @node.
> >>>> + * @node: element of the list.
> >>>> + */
> >>>> +#define hlist_nulls_pprev_rcu(node) \
> >>>> +       (*((struct hlist_nulls_node __rcu __force **)(node)->pprev))
> >>>> +
> >>>>    /**
> >>>>     * hlist_nulls_del_rcu - deletes entry from hash list without re-initialization
> >>>>     * @n: the element to delete from the hash list.
> >>>> @@ -152,6 +159,58 @@ static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
> >>>>           n->next = (struct hlist_nulls_node *)NULLS_MARKER(NULL);
> >>>>    }
> >>>>
> >>>> +/**
> >>>> + * hlist_nulls_replace_rcu - replace an old entry by a new one
> >>>> + * @old: the element to be replaced
> >>>> + * @new: the new element to insert
> >>>> + *
> >>>> + * Description:
> >>>> + * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
> >>>> + * permitting racing traversals.
> >>>> + *
> >>>> + * The caller must take whatever precautions are necessary (such as holding
> >>>> + * appropriate locks) to avoid racing with another list-mutation primitive, such
> >>>> + * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
> >>>> + * list.  However, it is perfectly legal to run concurrently with the _rcu
> >>>> + * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
> >>>> + */
> >>>> +static inline void hlist_nulls_replace_rcu(struct hlist_nulls_node *old,
> >>>> +                                          struct hlist_nulls_node *new)
> >>>> +{
> >>>> +       struct hlist_nulls_node *next = old->next;
> >>>> +
> >>>> +       WRITE_ONCE(new->next, next);
> >>>> +       WRITE_ONCE(new->pprev, old->pprev);
> >>> I do not think these two WRITE_ONCE() are needed.
> >>>
> >>> At this point new is not yet visible.
> >>>
> >>> The following  rcu_assign_pointer() is enough to make sure prior
> >>> writes are committed to memory.
> >> Dear Eric,
> >>
> >> I’m quoting your more detailed explanation from the other patch [0], thank
> >> you for that!
> >>
> >> However, regarding new->next, if the new object is allocated with
> >> SLAB_TYPESAFE_BY_RCU, would we still encounter the same issue as in commit
> >> efd04f8a8b45 (“rcu: Use WRITE_ONCE() for assignments to ->next for
> >> rculist_nulls”)?
> >>
> >> Also, for the WRITE_ONCE() assignments to ->pprev introduced in commit
> >> 860c8802ace1 (“rcu: Use WRITE_ONCE() for assignments to ->pprev for
> >> hlist_nulls”) within hlist_nulls_add_head_rcu(), is that also unnecessary?
> > I forgot sk_unhashed()/sk_hashed() could be called from lockless contexts.
> >
> > It is a bit weird to annotate the writes, but not the lockless reads,
> > even if apparently KCSAN
> > is okay with that.
> >
> Dear Eric,
>
> I’m sorry—I still haven’t fully grasped the scenario you mentioned where
> sk_unhashed()/sk_hashed() can be called from lock‑less contexts. It seems
> similar to the race described in commit 860c8802ace1 (“rcu: Use
> WRITE_ONCE() for assignments to ->pprev for hlist_nulls”), e.g.: [0].
>

inet_unhash() does a lockless sk_unhash(sk) call, while no lock is
held in some cases (look at tcp_done())

void inet_unhash(struct sock *sk)
{
struct inet_hashinfo *hashinfo = tcp_get_hashinfo(sk);

if (sk_unhashed(sk))    // Here no lock is held
    return;

Relevant lock (depending on (sk->sk_state == TCP_LISTEN)) is acquired
a few lines later.

Then

__sk_nulls_del_node_init_rcu() is called safely, while the bucket lock is held.




> Two CPUs invoke inet_unhash() from the tcp_retransmit_timer() path on the
> same sk, causing a race even though tcp_retransmit_timer() checks
> lockdep_sock_is_held(sk).
>
> How does this race happen? I can’t find more details to understand the
> situation, so any hints would be greatly appreciated.
>
> My simple understanding is that hlist_nulls_replace_rcu() might have the
> same call path as hlist_nulls_add_head_rcu(), so I keep using WRITE_ONCE().
>
> Finally, Kuniyuki Iwashima also raised a similar discussion in the v3
> series; here’s the link [1].
>
> [0]:
> ------------------------------------------------------------------------
>
> BUG: KCSAN: data-race in inet_unhash / inet_unhash
>
> write to 0xffff8880a69a0170 of 8 bytes by interrupt on cpu 1:
>   __hlist_nulls_del include/linux/list_nulls.h:88 [inline]
>   hlist_nulls_del_init_rcu include/linux/rculist_nulls.h:36 [inline]
>   __sk_nulls_del_node_init_rcu include/net/sock.h:676 [inline]
>   inet_unhash+0x38f/0x4a0 net/ipv4/inet_hashtables.c:612
>   tcp_set_state+0xfa/0x3e0 net/ipv4/tcp.c:2249
>   tcp_done+0x93/0x1e0 net/ipv4/tcp.c:3854
>   tcp_write_err+0x7e/0xc0 net/ipv4/tcp_timer.c:56
>   tcp_retransmit_timer+0x9b8/0x16d0 net/ipv4/tcp_timer.c:479
>   tcp_write_timer_handler+0x42d/0x510 net/ipv4/tcp_timer.c:599
>   tcp_write_timer+0xd1/0xf0 net/ipv4/tcp_timer.c:619
>   call_timer_fn+0x5f/0x2f0 kernel/time/timer.c:1404
>   expire_timers kernel/time/timer.c:1449 [inline]
>   __run_timers kernel/time/timer.c:1773 [inline]
>   __run_timers kernel/time/timer.c:1740 [inline]
>   run_timer_softirq+0xc0c/0xcd0 kernel/time/timer.c:1786
>   __do_softirq+0x115/0x33f kernel/softirq.c:292
>   invoke_softirq kernel/softirq.c:373 [inline]
>   irq_exit+0xbb/0xe0 kernel/softirq.c:413
>   exiting_irq arch/x86/include/asm/apic.h:536 [inline]
>   smp_apic_timer_interrupt+0xe6/0x280 arch/x86/kernel/apic/apic.c:1137
>   apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
>   native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71
>   arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571
>   default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
>   cpuidle_idle_call kernel/sched/idle.c:154 [inline]
>   do_idle+0x1af/0x280 kernel/sched/idle.c:263
>   cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355
>   start_secondary+0x208/0x260 arch/x86/kernel/smpboot.c:264
>   secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241
>
> read to 0xffff8880a69a0170 of 8 bytes by interrupt on cpu 0:
>   sk_unhashed include/net/sock.h:607 [inline]
>   inet_unhash+0x3d/0x4a0 net/ipv4/inet_hashtables.c:592
>   tcp_set_state+0xfa/0x3e0 net/ipv4/tcp.c:2249
>   tcp_done+0x93/0x1e0 net/ipv4/tcp.c:3854
>   tcp_write_err+0x7e/0xc0 net/ipv4/tcp_timer.c:56
>   tcp_retransmit_timer+0x9b8/0x16d0 net/ipv4/tcp_timer.c:479
>   tcp_write_timer_handler+0x42d/0x510 net/ipv4/tcp_timer.c:599
>   tcp_write_timer+0xd1/0xf0 net/ipv4/tcp_timer.c:619
>   call_timer_fn+0x5f/0x2f0 kernel/time/timer.c:1404
>   expire_timers kernel/time/timer.c:1449 [inline]
>   __run_timers kernel/time/timer.c:1773 [inline]
>   __run_timers kernel/time/timer.c:1740 [inline]
>   run_timer_softirq+0xc0c/0xcd0 kernel/time/timer.c:1786
>   __do_softirq+0x115/0x33f kernel/softirq.c:292
>   invoke_softirq kernel/softirq.c:373 [inline]
>   irq_exit+0xbb/0xe0 kernel/softirq.c:413
>   exiting_irq arch/x86/include/asm/apic.h:536 [inline]
>   smp_apic_timer_interrupt+0xe6/0x280 arch/x86/kernel/apic/apic.c:1137
>   apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
>   native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71
>   arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571
>   default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
>   cpuidle_idle_call kernel/sched/idle.c:154 [inline]
>   do_idle+0x1af/0x280 kernel/sched/idle.c:263
>   cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355
>   rest_init+0xec/0xf6 init/main.c:452
>   arch_call_rest_init+0x17/0x37
>   start_kernel+0x838/0x85e init/main.c:786
>   x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:490
>   x86_64_start_kernel+0x72/0x76 arch/x86/kernel/head64.c:471
>   secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241
>
> Reported by Kernel Concurrency Sanitizer on:
> CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc6+ #0
> Hardware name: Google Google Compute Engine/Google Compute Engine,
> BIOS Google 01/01/2011
>
> ------------------------------------------------------------------------
>
> [1]: https://lore.kernel.org/all/CAAVpQUCoCizxTm6wRs0+n6_kPK+kgxwszsYKNds3YvuBfBvrhg@mail.gmail.com/
>
> Thanks!
>

next prev parent reply	other threads:[~2025-10-14  7:34 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-26  7:40 [PATCH net-next v7 0/3] net: Avoid ehash lookup races xuanqiang.luo
2025-09-26  7:40 ` [PATCH net-next v7 1/3] rculist: Add hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu() xuanqiang.luo
2025-09-27 20:31   ` Kuniyuki Iwashima
2025-09-30  9:16   ` Paolo Abeni
2025-10-01 15:03     ` luoxuanqiang
2025-10-13  5:36     ` Jiayuan Chen
2025-10-13  6:26       ` Jason Xing
2025-10-13  7:04         ` luoxuanqiang
2025-10-13 12:08           ` Simon Horman
2025-10-14  2:29             ` luoxuanqiang
2025-10-01 12:19   ` Frederic Weisbecker
2025-10-13  7:31   ` Eric Dumazet
2025-10-13  8:25     ` luoxuanqiang
2025-10-13  9:49       ` Eric Dumazet
2025-10-14  7:20         ` luoxuanqiang
2025-10-14  7:34           ` Eric Dumazet [this message]
2025-10-14  8:04             ` luoxuanqiang
2025-10-14  8:09               ` Eric Dumazet
2025-10-14  8:40                 ` luoxuanqiang
2025-10-14 10:02                   ` Eric Dumazet
2025-10-14 11:40                     ` luoxuanqiang
2025-09-26  7:40 ` [PATCH net-next v7 2/3] inet: Avoid ehash lookup race in inet_ehash_insert() xuanqiang.luo
2025-09-26  7:40 ` [PATCH net-next v7 3/3] inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule() xuanqiang.luo
2025-09-27  2:56 ` [PATCH net-next v7 0/3] net: Avoid ehash lookup races Jiayuan Chen

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CANn89iJdXiE3b8x8vQtbOOi2DTC5P9bOO1HsnRwSPC8qQC--8g@mail.gmail.com \
    --to=edumazet@google.com \
    --cc=davem@davemloft.net \
    --cc=frederic@kernel.org \
    --cc=kerneljasonxing@gmail.com \
    --cc=kuba@kernel.org \
    --cc=kuniyu@google.com \
    --cc=luoxuanqiang@kylinos.cn \
    --cc=neeraj.upadhyay@kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=xuanqiang.luo@linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).