[PATCH bpf-next 0/3] bpf: Avoid deadlock using trylock when popping LRU free nodes

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH bpf-next 0/3] bpf: Avoid deadlock using trylock when popping LRU free nodes
@ 2026-01-19 14:21 Leon Hwang
  2026-01-19 14:21 ` [PATCH bpf-next 1/3] bpf: Factor out bpf_lru_node_set_hash() helper Leon Hwang
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Leon Hwang @ 2026-01-19 14:21 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Leon Hwang, linux-kernel, linux-kselftest,
	kernel-patches-bot

Switch the free-node pop paths to raw_spin_trylock*() so callers don't block
on contended LRU locks. This is a narrower change than Menglong's approach [1],
which aimed to eliminate the deadlock entirely.

The trylock-based approach avoids deadlocks in long-lived critical
sections, while still allowing locking in short-lived ones. Although it
does not completely eliminate the possibility of deadlock, it
significantly reduces the likelihood in practice.

LRU-related deadlocks have been observed multiple times, including:

 - [syzbot] [bpf?] possible deadlock in bpf_lru_push_free (2) [2]
 - Re: [PATCH bpf v3 0/4] bpf: Free special fields when update hash and local storage maps [3]
 - Raw log of CI failure [4]

BTW, this series also factors out the bpf_lru_node_set_hash() helper, along with
a comment describing the required ordering and locking constraints.

Links:
[1] https://lore.kernel.org/bpf/20251030030010.95352-1-dongml2@chinatelecom.cn/
[2] https://lore.kernel.org/bpf/69155df5.a70a0220.3124cb.0018.GAE@google.com/
[3] https://lore.kernel.org/bpf/CAEf4BzbTJCUx0D=zjx6+5m5iiGhwLzaP94hnw36ZMDHAf4-U_w@mail.gmail.com/
[4] https://github.com/kernel-patches/bpf/actions/runs/20943173932/job/60181505085

Leon Hwang (3):
  bpf: Factor out bpf_lru_node_set_hash() helper
  bpf: Avoid deadlock using trylock when popping LRU free nodes
  selftests/bpf: Allow -ENOMEM on LRU map updates

 kernel/bpf/bpf_lru_list.c                     | 35 ++++++++++++++-----
 .../bpf/map_tests/map_percpu_stats.c          |  3 +-
 2 files changed, 28 insertions(+), 10 deletions(-)

--
2.52.0


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH bpf-next 1/3] bpf: Factor out bpf_lru_node_set_hash() helper
  2026-01-19 14:21 [PATCH bpf-next 0/3] bpf: Avoid deadlock using trylock when popping LRU free nodes Leon Hwang
@ 2026-01-19 14:21 ` Leon Hwang
  2026-01-19 14:21 ` [PATCH bpf-next 2/3] bpf: Avoid deadlock using trylock when popping LRU free nodes Leon Hwang
  2026-01-19 14:21 ` [PATCH bpf-next 3/3] selftests/bpf: Allow -ENOMEM on LRU map updates Leon Hwang
  2 siblings, 0 replies; 11+ messages in thread
From: Leon Hwang @ 2026-01-19 14:21 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Leon Hwang, linux-kernel, linux-kselftest,
	kernel-patches-bot

The hash field is not used directly by the LRU list itself; it is
consumed by the 'del_from_htab' callback when removing entries from
the hash map.

The hash initialization must be performed under the LRU lock to avoid
a race where a popped LRU node is evicted and deleted from the hash
map with an uninitialized hash value, if defer the hash setting to
hashtab.c::prealloc_lru_pop().

Factor out a dedicated bpf_lru_node_set_hash() helper and document
this requirement to make the ordering and locking constraints explicit.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 kernel/bpf/bpf_lru_list.c | 18 ++++++++++++++++--
 1 file changed, 16 insertions(+), 2 deletions(-)

diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c
index e7a2fc60523f..c091f3232cc5 100644
--- a/kernel/bpf/bpf_lru_list.c
+++ b/kernel/bpf/bpf_lru_list.c
@@ -341,13 +341,27 @@ static void bpf_lru_list_pop_free_to_local(struct bpf_lru *lru,
 	raw_spin_unlock(&l->lock);
 }
 
+/*
+ * The hash field is consumed by the 'del_from_htab' callback rather than
+ * the LRU list itself. Initialize it while holding the LRU lock to avoid
+ * a race where a popped LRU node is evicted and removed from the hash map
+ * with an uninitialized hash value, if defer the hash setting to
+ * hashtab.c::prealloc_lru_pop().
+ */
+static void bpf_lru_node_set_hash(struct bpf_lru *lru,
+				  struct bpf_lru_node *node,
+				  u32 hash)
+{
+	*(u32 *)((void *)node + lru->hash_offset) = hash;
+}
+
 static void __local_list_add_pending(struct bpf_lru *lru,
 				     struct bpf_lru_locallist *loc_l,
 				     int cpu,
 				     struct bpf_lru_node *node,
 				     u32 hash)
 {
-	*(u32 *)((void *)node + lru->hash_offset) = hash;
+	bpf_lru_node_set_hash(lru, node, hash);
 	node->cpu = cpu;
 	node->type = BPF_LRU_LOCAL_LIST_T_PENDING;
 	bpf_lru_node_clear_ref(node);
@@ -415,7 +429,7 @@ static struct bpf_lru_node *bpf_percpu_lru_pop_free(struct bpf_lru *lru,
 
 	if (!list_empty(free_list)) {
 		node = list_first_entry(free_list, struct bpf_lru_node, list);
-		*(u32 *)((void *)node + lru->hash_offset) = hash;
+		bpf_lru_node_set_hash(lru, node, hash);
 		bpf_lru_node_clear_ref(node);
 		__bpf_lru_node_move(l, node, BPF_LRU_LIST_T_INACTIVE);
 	}
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next 2/3] bpf: Avoid deadlock using trylock when popping LRU free nodes
  2026-01-19 14:21 [PATCH bpf-next 0/3] bpf: Avoid deadlock using trylock when popping LRU free nodes Leon Hwang
  2026-01-19 14:21 ` [PATCH bpf-next 1/3] bpf: Factor out bpf_lru_node_set_hash() helper Leon Hwang
@ 2026-01-19 14:21 ` Leon Hwang
  2026-01-19 18:46   ` bot+bpf-ci
  2026-01-19 19:47   ` Daniel Borkmann
  2026-01-19 14:21 ` [PATCH bpf-next 3/3] selftests/bpf: Allow -ENOMEM on LRU map updates Leon Hwang
  2 siblings, 2 replies; 11+ messages in thread
From: Leon Hwang @ 2026-01-19 14:21 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Leon Hwang, linux-kernel, linux-kselftest,
	kernel-patches-bot

Switch the free-node pop paths to raw_spin_trylock*() to avoid blocking
on contended LRU locks.

If the global or per-CPU LRU lock is unavailable, refuse to refill the
local free list and return NULL instead. This allows callers to back
off safely rather than blocking or re-entering the same lock context.

This change avoids lockdep warnings and potential deadlocks caused by
re-entrant LRU lock acquisition from NMI context, as shown below:

[  418.260323] bpf_testmod: oh no, recursing into test_1, recursion_misses 1
[  424.982207] ================================
[  424.982216] WARNING: inconsistent lock state
[  424.982223] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[  424.982314]  *** DEADLOCK ***
[...]

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 kernel/bpf/bpf_lru_list.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c
index c091f3232cc5..03d37f72731a 100644
--- a/kernel/bpf/bpf_lru_list.c
+++ b/kernel/bpf/bpf_lru_list.c
@@ -312,14 +312,15 @@ static void bpf_lru_list_push_free(struct bpf_lru_list *l,
 	raw_spin_unlock_irqrestore(&l->lock, flags);
 }
 
-static void bpf_lru_list_pop_free_to_local(struct bpf_lru *lru,
+static bool bpf_lru_list_pop_free_to_local(struct bpf_lru *lru,
 					   struct bpf_lru_locallist *loc_l)
 {
 	struct bpf_lru_list *l = &lru->common_lru.lru_list;
 	struct bpf_lru_node *node, *tmp_node;
 	unsigned int nfree = 0;
 
-	raw_spin_lock(&l->lock);
+	if (!raw_spin_trylock(&l->lock))
+		return false;
 
 	__local_list_flush(l, loc_l);
 
@@ -339,6 +340,8 @@ static void bpf_lru_list_pop_free_to_local(struct bpf_lru *lru,
 				      BPF_LRU_LOCAL_LIST_T_FREE);
 
 	raw_spin_unlock(&l->lock);
+
+	return true;
 }
 
 /*
@@ -418,7 +421,8 @@ static struct bpf_lru_node *bpf_percpu_lru_pop_free(struct bpf_lru *lru,
 
 	l = per_cpu_ptr(lru->percpu_lru, cpu);
 
-	raw_spin_lock_irqsave(&l->lock, flags);
+	if (!raw_spin_trylock_irqsave(&l->lock, flags))
+		return NULL;
 
 	__bpf_lru_list_rotate(lru, l);
 
@@ -451,13 +455,12 @@ static struct bpf_lru_node *bpf_common_lru_pop_free(struct bpf_lru *lru,
 
 	loc_l = per_cpu_ptr(clru->local_list, cpu);
 
-	raw_spin_lock_irqsave(&loc_l->lock, flags);
+	if (!raw_spin_trylock_irqsave(&loc_l->lock, flags))
+		return NULL;
 
 	node = __local_list_pop_free(loc_l);
-	if (!node) {
-		bpf_lru_list_pop_free_to_local(lru, loc_l);
+	if (!node && bpf_lru_list_pop_free_to_local(lru, loc_l))
 		node = __local_list_pop_free(loc_l);
-	}
 
 	if (node)
 		__local_list_add_pending(lru, loc_l, cpu, node, hash);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH bpf-next 3/3] selftests/bpf: Allow -ENOMEM on LRU map updates
  2026-01-19 14:21 [PATCH bpf-next 0/3] bpf: Avoid deadlock using trylock when popping LRU free nodes Leon Hwang
  2026-01-19 14:21 ` [PATCH bpf-next 1/3] bpf: Factor out bpf_lru_node_set_hash() helper Leon Hwang
  2026-01-19 14:21 ` [PATCH bpf-next 2/3] bpf: Avoid deadlock using trylock when popping LRU free nodes Leon Hwang
@ 2026-01-19 14:21 ` Leon Hwang
  2 siblings, 0 replies; 11+ messages in thread
From: Leon Hwang @ 2026-01-19 14:21 UTC (permalink / raw)
  To: bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, Leon Hwang, linux-kernel, linux-kselftest,
	kernel-patches-bot

LRU hash map updates may legitimately return -ENOMEM. Relax the percpu
stats selftest to accept that error for LRU map types so it matches the
map's expected behavior.

Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
---
 tools/testing/selftests/bpf/map_tests/map_percpu_stats.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/bpf/map_tests/map_percpu_stats.c b/tools/testing/selftests/bpf/map_tests/map_percpu_stats.c
index 1c7c04288eff..d510a9c54978 100644
--- a/tools/testing/selftests/bpf/map_tests/map_percpu_stats.c
+++ b/tools/testing/selftests/bpf/map_tests/map_percpu_stats.c
@@ -188,7 +188,8 @@ static void *patch_map_thread(void *arg)
 						   40, retry_for_nomem_fn);
 		else
 			ret = bpf_map_update_elem(opts->map_fd, &i, val_ptr, 0);
-		CHECK(ret < 0, "bpf_map_update_elem", "key=%d error: %s\n", i, strerror(errno));
+		CHECK(ret < 0 && (!is_lru(opts->map_type) || ret != -ENOMEM),
+		      "bpf_map_update_elem", "key=%d error: %s\n", i, strerror(errno));
 
 		if (opts->map_type == BPF_MAP_TYPE_HASH_OF_MAPS)
 			close(val);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: Avoid deadlock using trylock when popping LRU free nodes
  2026-01-19 14:21 ` [PATCH bpf-next 2/3] bpf: Avoid deadlock using trylock when popping LRU free nodes Leon Hwang
@ 2026-01-19 18:46   ` bot+bpf-ci
  2026-01-20  1:56     ` Leon Hwang
  2026-01-19 19:47   ` Daniel Borkmann
  1 sibling, 1 reply; 11+ messages in thread
From: bot+bpf-ci @ 2026-01-19 18:46 UTC (permalink / raw)
  To: leon.hwang, bpf
  Cc: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, leon.hwang,
	linux-kernel, linux-kselftest, kernel-patches-bot, ast, andrii,
	daniel, martin.lau, eddyz87, yonghong.song, clm, ihor.solodrai

[-- Attachment #1: Type: text/plain, Size: 1997 bytes --]

> diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c
> index c091f3232cc5..03d37f72731a 100644
> --- a/kernel/bpf/bpf_lru_list.c
> +++ b/kernel/bpf/bpf_lru_list.c

[ ... ]

> @@ -451,13 +455,12 @@ static struct bpf_lru_node *bpf_common_lru_pop_free(struct bpf_lru *lru,
>
>  	loc_l = per_cpu_ptr(clru->local_list, cpu);
>
> -	raw_spin_lock_irqsave(&loc_l->lock, flags);
> +	if (!raw_spin_trylock_irqsave(&loc_l->lock, flags))
> +		return NULL;
>
>  	node = __local_list_pop_free(loc_l);
> -	if (!node) {
> -		bpf_lru_list_pop_free_to_local(lru, loc_l);
> +	if (!node && bpf_lru_list_pop_free_to_local(lru, loc_l))
>  		node = __local_list_pop_free(loc_l);
> -	}
>
>  	if (node)
>  		__local_list_add_pending(lru, loc_l, cpu, node, hash);

The trylock conversion here looks correct, but what about the steal path
that follows later in bpf_common_lru_pop_free()? Looking at the existing
code (not shown in this diff), the steal loop still uses unconditional
raw_spin_lock_irqsave():

    do {
        steal_loc_l = per_cpu_ptr(clru->local_list, steal);

        raw_spin_lock_irqsave(&steal_loc_l->lock, flags);
        ...
    } while (...);

If the goal is to avoid NMI-context deadlocks when acquiring LRU locks,
can the same deadlock scenario occur when NMI interrupts during the steal
loop and the NMI handler tries to acquire the same steal_loc_l->lock?

Similarly, after a successful steal, there is another unconditional lock:

    if (node) {
        raw_spin_lock_irqsave(&loc_l->lock, flags);
        __local_list_add_pending(lru, loc_l, cpu, node, hash);
        raw_spin_unlock_irqrestore(&loc_l->lock, flags);
    }

Should these also use trylock to maintain consistency with the stated goal
of avoiding NMI-context deadlocks?


---
AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md

CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21147913717

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: Avoid deadlock using trylock when popping LRU free nodes
  2026-01-19 14:21 ` [PATCH bpf-next 2/3] bpf: Avoid deadlock using trylock when popping LRU free nodes Leon Hwang
  2026-01-19 18:46   ` bot+bpf-ci
@ 2026-01-19 19:47   ` Daniel Borkmann
  2026-01-20  1:49     ` Leon Hwang
  1 sibling, 1 reply; 11+ messages in thread
From: Daniel Borkmann @ 2026-01-19 19:47 UTC (permalink / raw)
  To: Leon Hwang, bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	linux-kernel, linux-kselftest, kernel-patches-bot,
	Kumar Kartikeya Dwivedi

On 1/19/26 3:21 PM, Leon Hwang wrote:
> Switch the free-node pop paths to raw_spin_trylock*() to avoid blocking
> on contended LRU locks.
> 
> If the global or per-CPU LRU lock is unavailable, refuse to refill the
> local free list and return NULL instead. This allows callers to back
> off safely rather than blocking or re-entering the same lock context.
> 
> This change avoids lockdep warnings and potential deadlocks caused by
> re-entrant LRU lock acquisition from NMI context, as shown below:
> 
> [  418.260323] bpf_testmod: oh no, recursing into test_1, recursion_misses 1
> [  424.982207] ================================
> [  424.982216] WARNING: inconsistent lock state
> [  424.982223] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> [  424.982314]  *** DEADLOCK ***
> [...]
> 
> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> ---
>   kernel/bpf/bpf_lru_list.c | 17 ++++++++++-------
>   1 file changed, 10 insertions(+), 7 deletions(-)

Documentation/bpf/map_lru_hash_update.dot needs update?

> diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c
> index c091f3232cc5..03d37f72731a 100644
> --- a/kernel/bpf/bpf_lru_list.c
> +++ b/kernel/bpf/bpf_lru_list.c
> @@ -312,14 +312,15 @@ static void bpf_lru_list_push_free(struct bpf_lru_list *l,
>   	raw_spin_unlock_irqrestore(&l->lock, flags);
>   }
>   
> -static void bpf_lru_list_pop_free_to_local(struct bpf_lru *lru,
> +static bool bpf_lru_list_pop_free_to_local(struct bpf_lru *lru,
>   					   struct bpf_lru_locallist *loc_l)
>   {
>   	struct bpf_lru_list *l = &lru->common_lru.lru_list;
>   	struct bpf_lru_node *node, *tmp_node;
>   	unsigned int nfree = 0;
>   
> -	raw_spin_lock(&l->lock);
> +	if (!raw_spin_trylock(&l->lock))
> +		return false;
>   

Could you provide some more analysis, and the effect this has on real-world
programs? Presumably they'll unexpectedly encounter a lot more frequent
-ENOMEM as an error on bpf_map_update_elem even though memory might be
available just that locks are contended?

Also, have you considered rqspinlock as a potential candidate to discover
deadlocks?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: Avoid deadlock using trylock when popping LRU free nodes
  2026-01-19 19:47   ` Daniel Borkmann
@ 2026-01-20  1:49     ` Leon Hwang
  2026-01-20  1:54       ` Alexei Starovoitov
  0 siblings, 1 reply; 11+ messages in thread
From: Leon Hwang @ 2026-01-20  1:49 UTC (permalink / raw)
  To: Daniel Borkmann, bpf
  Cc: Martin KaFai Lau, Alexei Starovoitov, Andrii Nakryiko,
	Eduard Zingerman, Song Liu, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, Shuah Khan,
	linux-kernel, linux-kselftest, kernel-patches-bot,
	Kumar Kartikeya Dwivedi



On 20/1/26 03:47, Daniel Borkmann wrote:
> On 1/19/26 3:21 PM, Leon Hwang wrote:
>> Switch the free-node pop paths to raw_spin_trylock*() to avoid blocking
>> on contended LRU locks.
>>
>> If the global or per-CPU LRU lock is unavailable, refuse to refill the
>> local free list and return NULL instead. This allows callers to back
>> off safely rather than blocking or re-entering the same lock context.
>>
>> This change avoids lockdep warnings and potential deadlocks caused by
>> re-entrant LRU lock acquisition from NMI context, as shown below:
>>
>> [  418.260323] bpf_testmod: oh no, recursing into test_1,
>> recursion_misses 1
>> [  424.982207] ================================
>> [  424.982216] WARNING: inconsistent lock state
>> [  424.982223] inconsistent {INITIAL USE} -> {IN-NMI} usage.
>> [  424.982314]  *** DEADLOCK ***
>> [...]
>>
>> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
>> ---
>>   kernel/bpf/bpf_lru_list.c | 17 ++++++++++-------
>>   1 file changed, 10 insertions(+), 7 deletions(-)
> 
> Documentation/bpf/map_lru_hash_update.dot needs update?
> 

Yes, it needs update.

>> diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c
>> index c091f3232cc5..03d37f72731a 100644
>> --- a/kernel/bpf/bpf_lru_list.c
>> +++ b/kernel/bpf/bpf_lru_list.c
>> @@ -312,14 +312,15 @@ static void bpf_lru_list_push_free(struct
>> bpf_lru_list *l,
>>       raw_spin_unlock_irqrestore(&l->lock, flags);
>>   }
>>   -static void bpf_lru_list_pop_free_to_local(struct bpf_lru *lru,
>> +static bool bpf_lru_list_pop_free_to_local(struct bpf_lru *lru,
>>                          struct bpf_lru_locallist *loc_l)
>>   {
>>       struct bpf_lru_list *l = &lru->common_lru.lru_list;
>>       struct bpf_lru_node *node, *tmp_node;
>>       unsigned int nfree = 0;
>>   -    raw_spin_lock(&l->lock);
>> +    if (!raw_spin_trylock(&l->lock))
>> +        return false;
>>   
> 
> Could you provide some more analysis, and the effect this has on real-world
> programs? Presumably they'll unexpectedly encounter a lot more frequent
> -ENOMEM as an error on bpf_map_update_elem even though memory might be
> available just that locks are contended?
> 
> Also, have you considered rqspinlock as a potential candidate to discover
> deadlocks?

Thanks for the questions.

While I haven’t encountered this issue in production systems myself, the
deadlock has been observed repeatedly in practice, including the cases
shown in the cover letter. It can also be reproduced reliably when
running the LRU tests locally, so this is a real and recurring problem.

I agree that returning -ENOMEM when locks are contended is not ideal.
Using -EBUSY would better reflect the situation where memory is
available but forward progress is temporarily blocked by lock
contention. I can update the patch accordingly.

Regarding rqspinlock: as mentioned in the cover letter, Menglong
previously explored using rqspinlock to address these deadlocks but was
unable to arrive at a complete solution. After further off-list
discussion, we agreed that using trylock is a more practical approach
here. In most observed cases, the lock contention leading to deadlock
occurs in bpf_common_lru_pop_free(), and trylock allows callers to back
off safely rather than risking re-entrancy and deadlock.

Thanks,
Leon


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: Avoid deadlock using trylock when popping LRU free nodes
  2026-01-20  1:49     ` Leon Hwang
@ 2026-01-20  1:54       ` Alexei Starovoitov
  0 siblings, 0 replies; 11+ messages in thread
From: Alexei Starovoitov @ 2026-01-20  1:54 UTC (permalink / raw)
  To: Leon Hwang
  Cc: Daniel Borkmann, bpf, Martin KaFai Lau, Alexei Starovoitov,
	Andrii Nakryiko, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, LKML, open list:KERNEL SELFTEST FRAMEWORK,
	kernel-patches-bot, Kumar Kartikeya Dwivedi

On Mon, Jan 19, 2026 at 5:50 PM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 20/1/26 03:47, Daniel Borkmann wrote:
> > On 1/19/26 3:21 PM, Leon Hwang wrote:
> >> Switch the free-node pop paths to raw_spin_trylock*() to avoid blocking
> >> on contended LRU locks.
> >>
> >> If the global or per-CPU LRU lock is unavailable, refuse to refill the
> >> local free list and return NULL instead. This allows callers to back
> >> off safely rather than blocking or re-entering the same lock context.
> >>
> >> This change avoids lockdep warnings and potential deadlocks caused by
> >> re-entrant LRU lock acquisition from NMI context, as shown below:
> >>
> >> [  418.260323] bpf_testmod: oh no, recursing into test_1,
> >> recursion_misses 1
> >> [  424.982207] ================================
> >> [  424.982216] WARNING: inconsistent lock state
> >> [  424.982223] inconsistent {INITIAL USE} -> {IN-NMI} usage.
> >> [  424.982314]  *** DEADLOCK ***
> >> [...]
> >>
> >> Signed-off-by: Leon Hwang <leon.hwang@linux.dev>
> >> ---
> >>   kernel/bpf/bpf_lru_list.c | 17 ++++++++++-------
> >>   1 file changed, 10 insertions(+), 7 deletions(-)
> >
> > Documentation/bpf/map_lru_hash_update.dot needs update?
> >
>
> Yes, it needs update.
>
> >> diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c
> >> index c091f3232cc5..03d37f72731a 100644
> >> --- a/kernel/bpf/bpf_lru_list.c
> >> +++ b/kernel/bpf/bpf_lru_list.c
> >> @@ -312,14 +312,15 @@ static void bpf_lru_list_push_free(struct
> >> bpf_lru_list *l,
> >>       raw_spin_unlock_irqrestore(&l->lock, flags);
> >>   }
> >>   -static void bpf_lru_list_pop_free_to_local(struct bpf_lru *lru,
> >> +static bool bpf_lru_list_pop_free_to_local(struct bpf_lru *lru,
> >>                          struct bpf_lru_locallist *loc_l)
> >>   {
> >>       struct bpf_lru_list *l = &lru->common_lru.lru_list;
> >>       struct bpf_lru_node *node, *tmp_node;
> >>       unsigned int nfree = 0;
> >>   -    raw_spin_lock(&l->lock);
> >> +    if (!raw_spin_trylock(&l->lock))
> >> +        return false;
> >>
> >
> > Could you provide some more analysis, and the effect this has on real-world
> > programs? Presumably they'll unexpectedly encounter a lot more frequent
> > -ENOMEM as an error on bpf_map_update_elem even though memory might be
> > available just that locks are contended?
> >
> > Also, have you considered rqspinlock as a potential candidate to discover
> > deadlocks?
>
> Thanks for the questions.
>
> While I haven’t encountered this issue in production systems myself, the
> deadlock has been observed repeatedly in practice, including the cases
> shown in the cover letter. It can also be reproduced reliably when
> running the LRU tests locally, so this is a real and recurring problem.
>
> I agree that returning -ENOMEM when locks are contended is not ideal.
> Using -EBUSY would better reflect the situation where memory is
> available but forward progress is temporarily blocked by lock
> contention. I can update the patch accordingly.
>
> Regarding rqspinlock: as mentioned in the cover letter, Menglong
> previously explored using rqspinlock to address these deadlocks but was
> unable to arrive at a complete solution. After further off-list
> discussion, we agreed that using trylock is a more practical approach
> here. In most observed cases, the lock contention leading to deadlock
> occurs in bpf_common_lru_pop_free(), and trylock allows callers to back
> off safely rather than risking re-entrancy and deadlock.

Sorry, trylock is not an option here.

We are not going to sacrifice LRU map reliability for the sake of syzbot.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: Avoid deadlock using trylock when popping LRU free nodes
  2026-01-19 18:46   ` bot+bpf-ci
@ 2026-01-20  1:56     ` Leon Hwang
  2026-01-20  2:01       ` Alexei Starovoitov
  0 siblings, 1 reply; 11+ messages in thread
From: Leon Hwang @ 2026-01-20  1:56 UTC (permalink / raw)
  To: bot+bpf-ci, bpf
  Cc: martin.lau, ast, daniel, andrii, eddyz87, song, yonghong.song,
	john.fastabend, kpsingh, sdf, haoluo, jolsa, shuah, linux-kernel,
	linux-kselftest, kernel-patches-bot, martin.lau, clm,
	ihor.solodrai



On 20/1/26 02:46, bot+bpf-ci@kernel.org wrote:
>> diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c
>> index c091f3232cc5..03d37f72731a 100644
>> --- a/kernel/bpf/bpf_lru_list.c
>> +++ b/kernel/bpf/bpf_lru_list.c
> 
> [ ... ]
> 
>> @@ -451,13 +455,12 @@ static struct bpf_lru_node *bpf_common_lru_pop_free(struct bpf_lru *lru,
>>
>>  	loc_l = per_cpu_ptr(clru->local_list, cpu);
>>
>> -	raw_spin_lock_irqsave(&loc_l->lock, flags);
>> +	if (!raw_spin_trylock_irqsave(&loc_l->lock, flags))
>> +		return NULL;
>>
>>  	node = __local_list_pop_free(loc_l);
>> -	if (!node) {
>> -		bpf_lru_list_pop_free_to_local(lru, loc_l);
>> +	if (!node && bpf_lru_list_pop_free_to_local(lru, loc_l))
>>  		node = __local_list_pop_free(loc_l);
>> -	}
>>
>>  	if (node)
>>  		__local_list_add_pending(lru, loc_l, cpu, node, hash);
> 
> The trylock conversion here looks correct, but what about the steal path
> that follows later in bpf_common_lru_pop_free()? Looking at the existing
> code (not shown in this diff), the steal loop still uses unconditional
> raw_spin_lock_irqsave():
> 
>     do {
>         steal_loc_l = per_cpu_ptr(clru->local_list, steal);
> 
>         raw_spin_lock_irqsave(&steal_loc_l->lock, flags);
>         ...
>     } while (...);
> 
> If the goal is to avoid NMI-context deadlocks when acquiring LRU locks,
> can the same deadlock scenario occur when NMI interrupts during the steal
> loop and the NMI handler tries to acquire the same steal_loc_l->lock?
> 
> Similarly, after a successful steal, there is another unconditional lock:
> 
>     if (node) {
>         raw_spin_lock_irqsave(&loc_l->lock, flags);
>         __local_list_add_pending(lru, loc_l, cpu, node, hash);
>         raw_spin_unlock_irqrestore(&loc_l->lock, flags);
>     }
> 
> Should these also use trylock to maintain consistency with the stated goal
> of avoiding NMI-context deadlocks?
> 

This patch is not intended to eliminate all possible deadlock scenarios.
Its goal is to avoid deadlocks caused by long-lived critical sections
in the free-node pop paths, where lock contention can persist and lead
to re-entrant lock acquisition from NMI context.

The steal path and the post-steal update are both short-lived critical
sections. They do not exhibit the same contention characteristics and
have not been observed to trigger the reported deadlock scenarios.
Converting these paths to trylock would add complexity without clear
benefit, and is therefore unnecessary for the stated goal of this change.

Thanks,
Leon

> 
> ---
> AI reviewed your patch. Please fix the bug or email reply why it's not a bug.
> See: https://github.com/kernel-patches/vmtest/blob/master/ci/claude/README.md
> 
> CI run summary: https://github.com/kernel-patches/bpf/actions/runs/21147913717


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: Avoid deadlock using trylock when popping LRU free nodes
  2026-01-20  1:56     ` Leon Hwang
@ 2026-01-20  2:01       ` Alexei Starovoitov
  2026-01-20  2:19         ` Leon Hwang
  0 siblings, 1 reply; 11+ messages in thread
From: Alexei Starovoitov @ 2026-01-20  2:01 UTC (permalink / raw)
  To: Leon Hwang
  Cc: bot+bpf-ci, bpf, Martin KaFai Lau, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, LKML, open list:KERNEL SELFTEST FRAMEWORK,
	kernel-patches-bot, Martin KaFai Lau, Chris Mason, Ihor Solodrai

On Mon, Jan 19, 2026 at 5:57 PM Leon Hwang <leon.hwang@linux.dev> wrote:
>
>
>
> On 20/1/26 02:46, bot+bpf-ci@kernel.org wrote:
> >> diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c
> >> index c091f3232cc5..03d37f72731a 100644
> >> --- a/kernel/bpf/bpf_lru_list.c
> >> +++ b/kernel/bpf/bpf_lru_list.c
> >
> > [ ... ]
> >
> >> @@ -451,13 +455,12 @@ static struct bpf_lru_node *bpf_common_lru_pop_free(struct bpf_lru *lru,
> >>
> >>      loc_l = per_cpu_ptr(clru->local_list, cpu);
> >>
> >> -    raw_spin_lock_irqsave(&loc_l->lock, flags);
> >> +    if (!raw_spin_trylock_irqsave(&loc_l->lock, flags))
> >> +            return NULL;
> >>
> >>      node = __local_list_pop_free(loc_l);
> >> -    if (!node) {
> >> -            bpf_lru_list_pop_free_to_local(lru, loc_l);
> >> +    if (!node && bpf_lru_list_pop_free_to_local(lru, loc_l))
> >>              node = __local_list_pop_free(loc_l);
> >> -    }
> >>
> >>      if (node)
> >>              __local_list_add_pending(lru, loc_l, cpu, node, hash);
> >
> > The trylock conversion here looks correct, but what about the steal path
> > that follows later in bpf_common_lru_pop_free()? Looking at the existing
> > code (not shown in this diff), the steal loop still uses unconditional
> > raw_spin_lock_irqsave():
> >
> >     do {
> >         steal_loc_l = per_cpu_ptr(clru->local_list, steal);
> >
> >         raw_spin_lock_irqsave(&steal_loc_l->lock, flags);
> >         ...
> >     } while (...);
> >
> > If the goal is to avoid NMI-context deadlocks when acquiring LRU locks,
> > can the same deadlock scenario occur when NMI interrupts during the steal
> > loop and the NMI handler tries to acquire the same steal_loc_l->lock?
> >
> > Similarly, after a successful steal, there is another unconditional lock:
> >
> >     if (node) {
> >         raw_spin_lock_irqsave(&loc_l->lock, flags);
> >         __local_list_add_pending(lru, loc_l, cpu, node, hash);
> >         raw_spin_unlock_irqrestore(&loc_l->lock, flags);
> >     }
> >
> > Should these also use trylock to maintain consistency with the stated goal
> > of avoiding NMI-context deadlocks?
> >
>
> This patch is not intended to eliminate all possible deadlock scenarios.
> Its goal is to avoid deadlocks caused by long-lived critical sections
> in the free-node pop paths, where lock contention can persist and lead
> to re-entrant lock acquisition from NMI context.
>
> The steal path and the post-steal update are both short-lived critical
> sections. They do not exhibit the same contention characteristics and
> have not been observed to trigger the reported deadlock scenarios.
> Converting these paths to trylock would add complexity without clear
> benefit, and is therefore unnecessary for the stated goal of this change.

AI is correct. Either everything needs to be converted or none.
Adding trylock in a few places because syzbot found them is not fixing anything.
Just silencing one (or a few?) syzbot reports.
As I said in the other email, trylock is not an option.
rqspinlock is the only true way of addressing potential deadlocks.
If it's too hard, then leave it as-is. Do not hack things half way.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH bpf-next 2/3] bpf: Avoid deadlock using trylock when popping LRU free nodes
  2026-01-20  2:01       ` Alexei Starovoitov
@ 2026-01-20  2:19         ` Leon Hwang
  0 siblings, 0 replies; 11+ messages in thread
From: Leon Hwang @ 2026-01-20  2:19 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: bot+bpf-ci, bpf, Martin KaFai Lau, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Eduard, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	Shuah Khan, LKML, open list:KERNEL SELFTEST FRAMEWORK,
	kernel-patches-bot, Martin KaFai Lau, Chris Mason, Ihor Solodrai



On 20/1/26 10:01, Alexei Starovoitov wrote:
> On Mon, Jan 19, 2026 at 5:57 PM Leon Hwang <leon.hwang@linux.dev> wrote:
>>
>>
>>
>> On 20/1/26 02:46, bot+bpf-ci@kernel.org wrote:
>>>> diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c
>>>> index c091f3232cc5..03d37f72731a 100644
>>>> --- a/kernel/bpf/bpf_lru_list.c
>>>> +++ b/kernel/bpf/bpf_lru_list.c
>>>
>>> [ ... ]
>>>
>>>> @@ -451,13 +455,12 @@ static struct bpf_lru_node *bpf_common_lru_pop_free(struct bpf_lru *lru,
>>>>
>>>>      loc_l = per_cpu_ptr(clru->local_list, cpu);
>>>>
>>>> -    raw_spin_lock_irqsave(&loc_l->lock, flags);
>>>> +    if (!raw_spin_trylock_irqsave(&loc_l->lock, flags))
>>>> +            return NULL;
>>>>
>>>>      node = __local_list_pop_free(loc_l);
>>>> -    if (!node) {
>>>> -            bpf_lru_list_pop_free_to_local(lru, loc_l);
>>>> +    if (!node && bpf_lru_list_pop_free_to_local(lru, loc_l))
>>>>              node = __local_list_pop_free(loc_l);
>>>> -    }
>>>>
>>>>      if (node)
>>>>              __local_list_add_pending(lru, loc_l, cpu, node, hash);
>>>
>>> The trylock conversion here looks correct, but what about the steal path
>>> that follows later in bpf_common_lru_pop_free()? Looking at the existing
>>> code (not shown in this diff), the steal loop still uses unconditional
>>> raw_spin_lock_irqsave():
>>>
>>>     do {
>>>         steal_loc_l = per_cpu_ptr(clru->local_list, steal);
>>>
>>>         raw_spin_lock_irqsave(&steal_loc_l->lock, flags);
>>>         ...
>>>     } while (...);
>>>
>>> If the goal is to avoid NMI-context deadlocks when acquiring LRU locks,
>>> can the same deadlock scenario occur when NMI interrupts during the steal
>>> loop and the NMI handler tries to acquire the same steal_loc_l->lock?
>>>
>>> Similarly, after a successful steal, there is another unconditional lock:
>>>
>>>     if (node) {
>>>         raw_spin_lock_irqsave(&loc_l->lock, flags);
>>>         __local_list_add_pending(lru, loc_l, cpu, node, hash);
>>>         raw_spin_unlock_irqrestore(&loc_l->lock, flags);
>>>     }
>>>
>>> Should these also use trylock to maintain consistency with the stated goal
>>> of avoiding NMI-context deadlocks?
>>>
>>
>> This patch is not intended to eliminate all possible deadlock scenarios.
>> Its goal is to avoid deadlocks caused by long-lived critical sections
>> in the free-node pop paths, where lock contention can persist and lead
>> to re-entrant lock acquisition from NMI context.
>>
>> The steal path and the post-steal update are both short-lived critical
>> sections. They do not exhibit the same contention characteristics and
>> have not been observed to trigger the reported deadlock scenarios.
>> Converting these paths to trylock would add complexity without clear
>> benefit, and is therefore unnecessary for the stated goal of this change.
> 
> AI is correct. Either everything needs to be converted or none.
> Adding trylock in a few places because syzbot found them is not fixing anything.
> Just silencing one (or a few?) syzbot reports.
> As I said in the other email, trylock is not an option.
> rqspinlock is the only true way of addressing potential deadlocks.
> If it's too hard, then leave it as-is. Do not hack things half way.

Understood.

Leave it as-is.

Thanks,
Leon


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2026-01-20  2:20 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-19 14:21 [PATCH bpf-next 0/3] bpf: Avoid deadlock using trylock when popping LRU free nodes Leon Hwang
2026-01-19 14:21 ` [PATCH bpf-next 1/3] bpf: Factor out bpf_lru_node_set_hash() helper Leon Hwang
2026-01-19 14:21 ` [PATCH bpf-next 2/3] bpf: Avoid deadlock using trylock when popping LRU free nodes Leon Hwang
2026-01-19 18:46   ` bot+bpf-ci
2026-01-20  1:56     ` Leon Hwang
2026-01-20  2:01       ` Alexei Starovoitov
2026-01-20  2:19         ` Leon Hwang
2026-01-19 19:47   ` Daniel Borkmann
2026-01-20  1:49     ` Leon Hwang
2026-01-20  1:54       ` Alexei Starovoitov
2026-01-19 14:21 ` [PATCH bpf-next 3/3] selftests/bpf: Allow -ENOMEM on LRU map updates Leon Hwang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox