From: luoxuanqiang <xuanqiang.luo@linux.dev>
To: Eric Dumazet <edumazet@google.com>
Cc: kuniyu@google.com, "Paul E. McKenney" <paulmck@kernel.org>,
kerneljasonxing@gmail.com, davem@davemloft.net, kuba@kernel.org,
netdev@vger.kernel.org, Xuanqiang Luo <luoxuanqiang@kylinos.cn>,
Frederic Weisbecker <frederic@kernel.org>,
Neeraj Upadhyay <neeraj.upadhyay@kernel.org>
Subject: Re: [PATCH net-next v7 1/3] rculist: Add hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu()
Date: Tue, 14 Oct 2025 15:20:33 +0800 [thread overview]
Message-ID: <82a04cb1-9451-493c-9b1e-b4a34f2175cd@linux.dev> (raw)
In-Reply-To: <CANn89iLQMVms1GF_oY1WSCtmxLZaBJrTKaeHnwRo5p9uzFwnVw@mail.gmail.com>
在 2025/10/13 17:49, Eric Dumazet 写道:
> On Mon, Oct 13, 2025 at 1:26 AM luoxuanqiang <xuanqiang.luo@linux.dev> wrote:
>>
>> 在 2025/10/13 15:31, Eric Dumazet 写道:
>>> On Fri, Sep 26, 2025 at 12:41 AM <xuanqiang.luo@linux.dev> wrote:
>>>> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>>>>
>>>> Add two functions to atomically replace RCU-protected hlist_nulls entries.
>>>>
>>>> Keep using WRITE_ONCE() to assign values to ->next and ->pprev, as
>>>> mentioned in the patch below:
>>>> commit efd04f8a8b45 ("rcu: Use WRITE_ONCE() for assignments to ->next for
>>>> rculist_nulls")
>>>> commit 860c8802ace1 ("rcu: Use WRITE_ONCE() for assignments to ->pprev for
>>>> hlist_nulls")
>>>>
>>>> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>>>> ---
>>>> include/linux/rculist_nulls.h | 59 +++++++++++++++++++++++++++++++++++
>>>> 1 file changed, 59 insertions(+)
>>>>
>>>> diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
>>>> index 89186c499dd4..c26cb83ca071 100644
>>>> --- a/include/linux/rculist_nulls.h
>>>> +++ b/include/linux/rculist_nulls.h
>>>> @@ -52,6 +52,13 @@ static inline void hlist_nulls_del_init_rcu(struct hlist_nulls_node *n)
>>>> #define hlist_nulls_next_rcu(node) \
>>>> (*((struct hlist_nulls_node __rcu __force **)&(node)->next))
>>>>
>>>> +/**
>>>> + * hlist_nulls_pprev_rcu - returns the dereferenced pprev of @node.
>>>> + * @node: element of the list.
>>>> + */
>>>> +#define hlist_nulls_pprev_rcu(node) \
>>>> + (*((struct hlist_nulls_node __rcu __force **)(node)->pprev))
>>>> +
>>>> /**
>>>> * hlist_nulls_del_rcu - deletes entry from hash list without re-initialization
>>>> * @n: the element to delete from the hash list.
>>>> @@ -152,6 +159,58 @@ static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
>>>> n->next = (struct hlist_nulls_node *)NULLS_MARKER(NULL);
>>>> }
>>>>
>>>> +/**
>>>> + * hlist_nulls_replace_rcu - replace an old entry by a new one
>>>> + * @old: the element to be replaced
>>>> + * @new: the new element to insert
>>>> + *
>>>> + * Description:
>>>> + * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
>>>> + * permitting racing traversals.
>>>> + *
>>>> + * The caller must take whatever precautions are necessary (such as holding
>>>> + * appropriate locks) to avoid racing with another list-mutation primitive, such
>>>> + * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
>>>> + * list. However, it is perfectly legal to run concurrently with the _rcu
>>>> + * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
>>>> + */
>>>> +static inline void hlist_nulls_replace_rcu(struct hlist_nulls_node *old,
>>>> + struct hlist_nulls_node *new)
>>>> +{
>>>> + struct hlist_nulls_node *next = old->next;
>>>> +
>>>> + WRITE_ONCE(new->next, next);
>>>> + WRITE_ONCE(new->pprev, old->pprev);
>>> I do not think these two WRITE_ONCE() are needed.
>>>
>>> At this point new is not yet visible.
>>>
>>> The following rcu_assign_pointer() is enough to make sure prior
>>> writes are committed to memory.
>> Dear Eric,
>>
>> I’m quoting your more detailed explanation from the other patch [0], thank
>> you for that!
>>
>> However, regarding new->next, if the new object is allocated with
>> SLAB_TYPESAFE_BY_RCU, would we still encounter the same issue as in commit
>> efd04f8a8b45 (“rcu: Use WRITE_ONCE() for assignments to ->next for
>> rculist_nulls”)?
>>
>> Also, for the WRITE_ONCE() assignments to ->pprev introduced in commit
>> 860c8802ace1 (“rcu: Use WRITE_ONCE() for assignments to ->pprev for
>> hlist_nulls”) within hlist_nulls_add_head_rcu(), is that also unnecessary?
> I forgot sk_unhashed()/sk_hashed() could be called from lockless contexts.
>
> It is a bit weird to annotate the writes, but not the lockless reads,
> even if apparently KCSAN
> is okay with that.
>
Dear Eric,
I’m sorry—I still haven’t fully grasped the scenario you mentioned where
sk_unhashed()/sk_hashed() can be called from lock‑less contexts. It seems
similar to the race described in commit 860c8802ace1 (“rcu: Use
WRITE_ONCE() for assignments to ->pprev for hlist_nulls”), e.g.: [0].
Two CPUs invoke inet_unhash() from the tcp_retransmit_timer() path on the
same sk, causing a race even though tcp_retransmit_timer() checks
lockdep_sock_is_held(sk).
How does this race happen? I can’t find more details to understand the
situation, so any hints would be greatly appreciated.
My simple understanding is that hlist_nulls_replace_rcu() might have the
same call path as hlist_nulls_add_head_rcu(), so I keep using WRITE_ONCE().
Finally, Kuniyuki Iwashima also raised a similar discussion in the v3
series; here’s the link [1].
[0]:
------------------------------------------------------------------------
BUG: KCSAN: data-race in inet_unhash / inet_unhash
write to 0xffff8880a69a0170 of 8 bytes by interrupt on cpu 1:
__hlist_nulls_del include/linux/list_nulls.h:88 [inline]
hlist_nulls_del_init_rcu include/linux/rculist_nulls.h:36 [inline]
__sk_nulls_del_node_init_rcu include/net/sock.h:676 [inline]
inet_unhash+0x38f/0x4a0 net/ipv4/inet_hashtables.c:612
tcp_set_state+0xfa/0x3e0 net/ipv4/tcp.c:2249
tcp_done+0x93/0x1e0 net/ipv4/tcp.c:3854
tcp_write_err+0x7e/0xc0 net/ipv4/tcp_timer.c:56
tcp_retransmit_timer+0x9b8/0x16d0 net/ipv4/tcp_timer.c:479
tcp_write_timer_handler+0x42d/0x510 net/ipv4/tcp_timer.c:599
tcp_write_timer+0xd1/0xf0 net/ipv4/tcp_timer.c:619
call_timer_fn+0x5f/0x2f0 kernel/time/timer.c:1404
expire_timers kernel/time/timer.c:1449 [inline]
__run_timers kernel/time/timer.c:1773 [inline]
__run_timers kernel/time/timer.c:1740 [inline]
run_timer_softirq+0xc0c/0xcd0 kernel/time/timer.c:1786
__do_softirq+0x115/0x33f kernel/softirq.c:292
invoke_softirq kernel/softirq.c:373 [inline]
irq_exit+0xbb/0xe0 kernel/softirq.c:413
exiting_irq arch/x86/include/asm/apic.h:536 [inline]
smp_apic_timer_interrupt+0xe6/0x280 arch/x86/kernel/apic/apic.c:1137
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71
arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571
default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
cpuidle_idle_call kernel/sched/idle.c:154 [inline]
do_idle+0x1af/0x280 kernel/sched/idle.c:263
cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355
start_secondary+0x208/0x260 arch/x86/kernel/smpboot.c:264
secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241
read to 0xffff8880a69a0170 of 8 bytes by interrupt on cpu 0:
sk_unhashed include/net/sock.h:607 [inline]
inet_unhash+0x3d/0x4a0 net/ipv4/inet_hashtables.c:592
tcp_set_state+0xfa/0x3e0 net/ipv4/tcp.c:2249
tcp_done+0x93/0x1e0 net/ipv4/tcp.c:3854
tcp_write_err+0x7e/0xc0 net/ipv4/tcp_timer.c:56
tcp_retransmit_timer+0x9b8/0x16d0 net/ipv4/tcp_timer.c:479
tcp_write_timer_handler+0x42d/0x510 net/ipv4/tcp_timer.c:599
tcp_write_timer+0xd1/0xf0 net/ipv4/tcp_timer.c:619
call_timer_fn+0x5f/0x2f0 kernel/time/timer.c:1404
expire_timers kernel/time/timer.c:1449 [inline]
__run_timers kernel/time/timer.c:1773 [inline]
__run_timers kernel/time/timer.c:1740 [inline]
run_timer_softirq+0xc0c/0xcd0 kernel/time/timer.c:1786
__do_softirq+0x115/0x33f kernel/softirq.c:292
invoke_softirq kernel/softirq.c:373 [inline]
irq_exit+0xbb/0xe0 kernel/softirq.c:413
exiting_irq arch/x86/include/asm/apic.h:536 [inline]
smp_apic_timer_interrupt+0xe6/0x280 arch/x86/kernel/apic/apic.c:1137
apic_timer_interrupt+0xf/0x20 arch/x86/entry/entry_64.S:830
native_safe_halt+0xe/0x10 arch/x86/kernel/paravirt.c:71
arch_cpu_idle+0x1f/0x30 arch/x86/kernel/process.c:571
default_idle_call+0x1e/0x40 kernel/sched/idle.c:94
cpuidle_idle_call kernel/sched/idle.c:154 [inline]
do_idle+0x1af/0x280 kernel/sched/idle.c:263
cpu_startup_entry+0x1b/0x20 kernel/sched/idle.c:355
rest_init+0xec/0xf6 init/main.c:452
arch_call_rest_init+0x17/0x37
start_kernel+0x838/0x85e init/main.c:786
x86_64_start_reservations+0x29/0x2b arch/x86/kernel/head64.c:490
x86_64_start_kernel+0x72/0x76 arch/x86/kernel/head64.c:471
secondary_startup_64+0xa4/0xb0 arch/x86/kernel/head_64.S:241
Reported by Kernel Concurrency Sanitizer on:
CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.4.0-rc6+ #0
Hardware name: Google Google Compute Engine/Google Compute Engine,
BIOS Google 01/01/2011
------------------------------------------------------------------------
[1]: https://lore.kernel.org/all/CAAVpQUCoCizxTm6wRs0+n6_kPK+kgxwszsYKNds3YvuBfBvrhg@mail.gmail.com/
Thanks!
next prev parent reply other threads:[~2025-10-14 7:21 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-26 7:40 [PATCH net-next v7 0/3] net: Avoid ehash lookup races xuanqiang.luo
2025-09-26 7:40 ` [PATCH net-next v7 1/3] rculist: Add hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu() xuanqiang.luo
2025-09-27 20:31 ` Kuniyuki Iwashima
2025-09-30 9:16 ` Paolo Abeni
2025-10-01 15:03 ` luoxuanqiang
2025-10-13 5:36 ` Jiayuan Chen
2025-10-13 6:26 ` Jason Xing
2025-10-13 7:04 ` luoxuanqiang
2025-10-13 12:08 ` Simon Horman
2025-10-14 2:29 ` luoxuanqiang
2025-10-01 12:19 ` Frederic Weisbecker
2025-10-13 7:31 ` Eric Dumazet
2025-10-13 8:25 ` luoxuanqiang
2025-10-13 9:49 ` Eric Dumazet
2025-10-14 7:20 ` luoxuanqiang [this message]
2025-10-14 7:34 ` Eric Dumazet
2025-10-14 8:04 ` luoxuanqiang
2025-10-14 8:09 ` Eric Dumazet
2025-10-14 8:40 ` luoxuanqiang
2025-10-14 10:02 ` Eric Dumazet
2025-10-14 11:40 ` luoxuanqiang
2025-09-26 7:40 ` [PATCH net-next v7 2/3] inet: Avoid ehash lookup race in inet_ehash_insert() xuanqiang.luo
2025-09-26 7:40 ` [PATCH net-next v7 3/3] inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule() xuanqiang.luo
2025-09-27 2:56 ` [PATCH net-next v7 0/3] net: Avoid ehash lookup races Jiayuan Chen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=82a04cb1-9451-493c-9b1e-b4a34f2175cd@linux.dev \
--to=xuanqiang.luo@linux.dev \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=frederic@kernel.org \
--cc=kerneljasonxing@gmail.com \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=luoxuanqiang@kylinos.cn \
--cc=neeraj.upadhyay@kernel.org \
--cc=netdev@vger.kernel.org \
--cc=paulmck@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).