[PATCH bpf-next] bpf: lru: adjust free target to avoid global table starvation

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH bpf-next] bpf: lru: adjust free target to avoid global table starvation
@ 2025-06-16 14:38 Willem de Bruijn
  2025-06-17 17:39 ` Stanislav Fomichev
  2025-06-18 13:56 ` Anton Protopopov
  0 siblings, 2 replies; 7+ messages in thread
From: Willem de Bruijn @ 2025-06-16 14:38 UTC (permalink / raw)
  To: bpf; +Cc: netdev, ast, daniel, john.fastabend, martin.lau, Willem de Bruijn

From: Willem de Bruijn <willemb@google.com>

BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the
map is full, due to percpu reservations and force shrink before
neighbor stealing. Once a CPU is unable to borrow from the global map,
it will once steal one elem from a neighbor and after that each time
flush this one element to the global list and immediately recycle it.

Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map
with 79 CPUs. CPU 79 will observe this behavior even while its
neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%).

CPUs need not be active concurrently. The issue can appear with
affinity migration, e.g., irqbalance. Each CPU can reserve and then
hold onto its 128 elements indefinitely.

Avoid global list exhaustion by limiting aggregate percpu caches to
half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count.
This change has no effect on sufficiently large tables.

Similar to LOCAL_NR_SCANS and lru->nr_scans, introduce a map variable
lru->free_target. The extra field fits in a hole in struct bpf_lru.
The cacheline is already warm where read in the hot path. The field is
only accessed with the lru lock held.

The tests are updated to pass. Test comments are extensive: updating
those is left for a v2 if the approach is considered ok.

Signed-off-by: Willem de Bruijn <willemb@google.com>

---

This suggested approach is a small patch, easy to understand, and
easy to verity to have no effect on sufficiently sized maps.

Global table exhaustion can be mitigated in other ways besides or
alongside this approach:

Grow the map.
  The most obvious approach, but increases memory use. And requires
  users to know map implementation details to choose a right size.

Round up map size.
  As htab_map_alloc does for the PERCPU variant of this list.
  Increases memory use, and a conservative strategy still leaves one
  element per CPU, and thus MRU behavior.

Steal from neighbors before force shrink.
  May increase lock contention.

Steal more than 1 element from neighbor when stealing.
  For instance, half its list. Requires traversing the entire list.

Steal from least recently active neighbors.
  Needs inactive CPU tracking.

Dynamic target_free: high and low watermarks to double/halve size.
   I also implemented this. Adjusts to the active CPU count, which may
   be << NR_CPUS. But it is hard to reason whether or at what level
   target_free will stabilize.
---
 kernel/bpf/bpf_lru_list.c                  |  9 ++++---
 kernel/bpf/bpf_lru_list.h                  |  1 +
 tools/testing/selftests/bpf/test_lru_map.c | 28 ++++++++++++++--------
 3 files changed, 25 insertions(+), 13 deletions(-)

diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c
index 3dabdd137d10..2d6e1c98d8ad 100644
--- a/kernel/bpf/bpf_lru_list.c
+++ b/kernel/bpf/bpf_lru_list.c
@@ -337,12 +337,12 @@ static void bpf_lru_list_pop_free_to_local(struct bpf_lru *lru,
 				 list) {
 		__bpf_lru_node_move_to_free(l, node, local_free_list(loc_l),
 					    BPF_LRU_LOCAL_LIST_T_FREE);
-		if (++nfree == LOCAL_FREE_TARGET)
+		if (++nfree == lru->target_free)
 			break;
 	}
 
-	if (nfree < LOCAL_FREE_TARGET)
-		__bpf_lru_list_shrink(lru, l, LOCAL_FREE_TARGET - nfree,
+	if (nfree < lru->target_free)
+		__bpf_lru_list_shrink(lru, l, lru->target_free - nfree,
 				      local_free_list(loc_l),
 				      BPF_LRU_LOCAL_LIST_T_FREE);
 
@@ -577,6 +577,9 @@ static void bpf_common_lru_populate(struct bpf_lru *lru, void *buf,
 		list_add(&node->list, &l->lists[BPF_LRU_LIST_T_FREE]);
 		buf += elem_size;
 	}
+
+	lru->target_free = clamp((nr_elems / num_possible_cpus()) / 2,
+				 1, LOCAL_FREE_TARGET);
 }
 
 static void bpf_percpu_lru_populate(struct bpf_lru *lru, void *buf,
diff --git a/kernel/bpf/bpf_lru_list.h b/kernel/bpf/bpf_lru_list.h
index cbd8d3720c2b..fe2661a58ea9 100644
--- a/kernel/bpf/bpf_lru_list.h
+++ b/kernel/bpf/bpf_lru_list.h
@@ -58,6 +58,7 @@ struct bpf_lru {
 	del_from_htab_func del_from_htab;
 	void *del_arg;
 	unsigned int hash_offset;
+	unsigned int target_free;
 	unsigned int nr_scans;
 	bool percpu;
 };
diff --git a/tools/testing/selftests/bpf/test_lru_map.c b/tools/testing/selftests/bpf/test_lru_map.c
index fda7589c5023..abb592553394 100644
--- a/tools/testing/selftests/bpf/test_lru_map.c
+++ b/tools/testing/selftests/bpf/test_lru_map.c
@@ -138,6 +138,12 @@ static int sched_next_online(int pid, int *next_to_try)
 	return ret;
 }
 
+/* inverse of how bpf_common_lru_populate derives target_free from map_size. */
+static unsigned int __map_size(unsigned int tgt_free)
+{
+	return tgt_free * nr_cpus * 2;
+}
+
 /* Size of the LRU map is 2
  * Add key=1 (+1 key)
  * Add key=2 (+1 key)
@@ -257,7 +263,7 @@ static void test_lru_sanity1(int map_type, int map_flags, unsigned int tgt_free)
 	batch_size = tgt_free / 2;
 	assert(batch_size * 2 == tgt_free);
 
-	map_size = tgt_free + batch_size;
+	map_size = __map_size(tgt_free) + batch_size;
 	lru_map_fd = create_map(map_type, map_flags, map_size);
 	assert(lru_map_fd != -1);
 
@@ -267,7 +273,7 @@ static void test_lru_sanity1(int map_type, int map_flags, unsigned int tgt_free)
 	value[0] = 1234;
 
 	/* Insert 1 to tgt_free (+tgt_free keys) */
-	end_key = 1 + tgt_free;
+	end_key = 1 + __map_size(tgt_free);
 	for (key = 1; key < end_key; key++)
 		assert(!bpf_map_update_elem(lru_map_fd, &key, value,
 					    BPF_NOEXIST));
@@ -284,8 +290,8 @@ static void test_lru_sanity1(int map_type, int map_flags, unsigned int tgt_free)
 	 * => 1+tgt_free/2 to LOCALFREE_TARGET will be
 	 * removed by LRU
 	 */
-	key = 1 + tgt_free;
-	end_key = key + tgt_free;
+	key = 1 + __map_size(tgt_free);
+	end_key = key + __map_size(tgt_free);
 	for (; key < end_key; key++) {
 		assert(!bpf_map_update_elem(lru_map_fd, &key, value,
 					    BPF_NOEXIST));
@@ -334,7 +340,7 @@ static void test_lru_sanity2(int map_type, int map_flags, unsigned int tgt_free)
 	batch_size = tgt_free / 2;
 	assert(batch_size * 2 == tgt_free);
 
-	map_size = tgt_free + batch_size;
+	map_size = __map_size(tgt_free) + batch_size;
 	lru_map_fd = create_map(map_type, map_flags, map_size);
 	assert(lru_map_fd != -1);
 
@@ -344,7 +350,7 @@ static void test_lru_sanity2(int map_type, int map_flags, unsigned int tgt_free)
 	value[0] = 1234;
 
 	/* Insert 1 to tgt_free (+tgt_free keys) */
-	end_key = 1 + tgt_free;
+	end_key = 1 + __map_size(tgt_free);
 	for (key = 1; key < end_key; key++)
 		assert(!bpf_map_update_elem(lru_map_fd, &key, value,
 					    BPF_NOEXIST));
@@ -388,8 +394,9 @@ static void test_lru_sanity2(int map_type, int map_flags, unsigned int tgt_free)
 	value[0] = 1234;
 
 	/* Insert 1+tgt_free to tgt_free*3/2 */
-	end_key = 1 + tgt_free + batch_size;
-	for (key = 1 + tgt_free; key < end_key; key++)
+	key = 1 + __map_size(tgt_free);
+	end_key = key + batch_size;
+	for (; key < end_key; key++)
 		/* These newly added but not referenced keys will be
 		 * gone during the next LRU shrink.
 		 */
@@ -397,7 +404,7 @@ static void test_lru_sanity2(int map_type, int map_flags, unsigned int tgt_free)
 					    BPF_NOEXIST));
 
 	/* Insert 1+tgt_free*3/2 to  tgt_free*5/2 */
-	end_key = key + tgt_free;
+	end_key += __map_size(tgt_free);
 	for (; key < end_key; key++) {
 		assert(!bpf_map_update_elem(lru_map_fd, &key, value,
 					    BPF_NOEXIST));
@@ -500,7 +507,8 @@ static void test_lru_sanity4(int map_type, int map_flags, unsigned int tgt_free)
 		lru_map_fd = create_map(map_type, map_flags,
 					3 * tgt_free * nr_cpus);
 	else
-		lru_map_fd = create_map(map_type, map_flags, 3 * tgt_free);
+		lru_map_fd = create_map(map_type, map_flags,
+					3 * __map_size(tgt_free));
 	assert(lru_map_fd != -1);
 
 	expected_map_fd = create_map(BPF_MAP_TYPE_HASH, 0,
-- 
2.50.0.rc1.591.g9c95f17f64-goog


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next] bpf: lru: adjust free target to avoid global table starvation
  2025-06-16 14:38 [PATCH bpf-next] bpf: lru: adjust free target to avoid global table starvation Willem de Bruijn
@ 2025-06-17 17:39 ` Stanislav Fomichev
  2025-06-17 20:08   ` Willem de Bruijn
  2025-06-18 13:56 ` Anton Protopopov
  1 sibling, 1 reply; 7+ messages in thread
From: Stanislav Fomichev @ 2025-06-17 17:39 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: bpf, netdev, ast, daniel, john.fastabend, martin.lau,
	Willem de Bruijn

On 06/16, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the
> map is full, due to percpu reservations and force shrink before
> neighbor stealing. Once a CPU is unable to borrow from the global map,
> it will once steal one elem from a neighbor and after that each time
> flush this one element to the global list and immediately recycle it.
> 
> Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map
> with 79 CPUs. CPU 79 will observe this behavior even while its
> neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%).
> 
> CPUs need not be active concurrently. The issue can appear with
> affinity migration, e.g., irqbalance. Each CPU can reserve and then
> hold onto its 128 elements indefinitely.
> 
> Avoid global list exhaustion by limiting aggregate percpu caches to
> half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count.
> This change has no effect on sufficiently large tables.

The code and rationale look good to me! There is also
Documentation/bpf/map_lru_hash_update.dot which mentions
LOCAL_FREE_TARGET, not sure if it's easy to convey these clamping
details in there? Or, instead, maybe expand on it in
Documentation/bpf/map_hash.rst? This <size>/<nrcpu>/2 is a heuristic,
so maybe we can give some guidance on the recommended fill level for
small (size/nrcpu < 128) maps?

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next] bpf: lru: adjust free target to avoid global table starvation
  2025-06-17 17:39 ` Stanislav Fomichev
@ 2025-06-17 20:08   ` Willem de Bruijn
  2025-06-17 20:29     ` Stanislav Fomichev
  0 siblings, 1 reply; 7+ messages in thread
From: Willem de Bruijn @ 2025-06-17 20:08 UTC (permalink / raw)
  To: Stanislav Fomichev, Willem de Bruijn
  Cc: bpf, netdev, ast, daniel, john.fastabend, martin.lau,
	Willem de Bruijn

Stanislav Fomichev wrote:
> On 06/16, Willem de Bruijn wrote:
> > From: Willem de Bruijn <willemb@google.com>
> > 
> > BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the
> > map is full, due to percpu reservations and force shrink before
> > neighbor stealing. Once a CPU is unable to borrow from the global map,
> > it will once steal one elem from a neighbor and after that each time
> > flush this one element to the global list and immediately recycle it.
> > 
> > Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map
> > with 79 CPUs. CPU 79 will observe this behavior even while its
> > neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%).
> > 
> > CPUs need not be active concurrently. The issue can appear with
> > affinity migration, e.g., irqbalance. Each CPU can reserve and then
> > hold onto its 128 elements indefinitely.
> > 
> > Avoid global list exhaustion by limiting aggregate percpu caches to
> > half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count.
> > This change has no effect on sufficiently large tables.
> 
> The code and rationale look good to me!

Great :)

> There is also
> Documentation/bpf/map_lru_hash_update.dot which mentions
> LOCAL_FREE_TARGET, not sure if it's easy to convey these clamping
> details in there? Or, instead, maybe expand on it in
> Documentation/bpf/map_hash.rst?

Good catch. How about in the graph I replace LOCAL_FREE_TARGET by
target_free and in map_hash.rst something like the following diff:

 - Attempt to use CPU-local state to batch operations
-- Attempt to fetch free nodes from global lists
+- Attempt to fetch ``target_free`` free nodes from global lists
 - Attempt to pull any node from a global list and remove it from the hashmap
 - Attempt to pull any node from any CPU's list and remove it from the hashmap
 
+The number of nodes to borrow from the global list in a batch, ``target_free``,
+depends on the size of the map. Larger batch size reduces lock contention, but
+may also exhaust the global structure. The value is computed at map init to
+avoid exhaustion, by limiting aggregate reservation by all CPUs to half the map
+size. Bounded by a minimum of 1 and maximum budget of 128 at a time.

Btw, there is also great documentation on
https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_LRU_HASH/. That had a
small error in the order of those Attempt operations above that I
fixed up this week. I'll also update the LOCAL_FREE_TARGET there.
Since it explains the LRU mechanism well, should I link to it as well?

> This <size>/<nrcpu>/2 is a heuristic,
> so maybe we can give some guidance on the recommended fill level for
> small (size/nrcpu < 128) maps?

I don't know if we can suggest a size that works for all cases. It depends on
factors like the number of CPUs that actively update the map and how tolerable
prematurely removed elements are to the workload.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next] bpf: lru: adjust free target to avoid global table starvation
  2025-06-17 20:08   ` Willem de Bruijn
@ 2025-06-17 20:29     ` Stanislav Fomichev
  0 siblings, 0 replies; 7+ messages in thread
From: Stanislav Fomichev @ 2025-06-17 20:29 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: bpf, netdev, ast, daniel, john.fastabend, martin.lau,
	Willem de Bruijn

On 06/17, Willem de Bruijn wrote:
> Stanislav Fomichev wrote:
> > On 06/16, Willem de Bruijn wrote:
> > > From: Willem de Bruijn <willemb@google.com>
> > > 
> > > BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the
> > > map is full, due to percpu reservations and force shrink before
> > > neighbor stealing. Once a CPU is unable to borrow from the global map,
> > > it will once steal one elem from a neighbor and after that each time
> > > flush this one element to the global list and immediately recycle it.
> > > 
> > > Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map
> > > with 79 CPUs. CPU 79 will observe this behavior even while its
> > > neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%).
> > > 
> > > CPUs need not be active concurrently. The issue can appear with
> > > affinity migration, e.g., irqbalance. Each CPU can reserve and then
> > > hold onto its 128 elements indefinitely.
> > > 
> > > Avoid global list exhaustion by limiting aggregate percpu caches to
> > > half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count.
> > > This change has no effect on sufficiently large tables.
> > 
> > The code and rationale look good to me!
> 
> Great :)
> 
> > There is also
> > Documentation/bpf/map_lru_hash_update.dot which mentions
> > LOCAL_FREE_TARGET, not sure if it's easy to convey these clamping
> > details in there? Or, instead, maybe expand on it in
> > Documentation/bpf/map_hash.rst?
> 
> Good catch. How about in the graph I replace LOCAL_FREE_TARGET by
> target_free and in map_hash.rst something like the following diff:
> 
>  - Attempt to use CPU-local state to batch operations
> -- Attempt to fetch free nodes from global lists
> +- Attempt to fetch ``target_free`` free nodes from global lists
>  - Attempt to pull any node from a global list and remove it from the hashmap
>  - Attempt to pull any node from any CPU's list and remove it from the hashmap
>  
> +The number of nodes to borrow from the global list in a batch, ``target_free``,
> +depends on the size of the map. Larger batch size reduces lock contention, but
> +may also exhaust the global structure. The value is computed at map init to
> +avoid exhaustion, by limiting aggregate reservation by all CPUs to half the map
> +size. Bounded by a minimum of 1 and maximum budget of 128 at a time.
> 
> Btw, there is also great documentation on
> https://docs.ebpf.io/linux/map-type/BPF_MAP_TYPE_LRU_HASH/. That had a
> small error in the order of those Attempt operations above that I
> fixed up this week. I'll also update the LOCAL_FREE_TARGET there.
> Since it explains the LRU mechanism well, should I link to it as well?

SG, yeah, let's do both (add info + link the doc), thanks!
 
> > This <size>/<nrcpu>/2 is a heuristic,
> > so maybe we can give some guidance on the recommended fill level for
> > small (size/nrcpu < 128) maps?
> 
> I don't know if we can suggest a size that works for all cases. It depends on
> factors like the number of CPUs that actively update the map and how tolerable
> prematurely removed elements are to the workload.

Makes sense!

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next] bpf: lru: adjust free target to avoid global table starvation
  2025-06-16 14:38 [PATCH bpf-next] bpf: lru: adjust free target to avoid global table starvation Willem de Bruijn
  2025-06-17 17:39 ` Stanislav Fomichev
@ 2025-06-18 13:56 ` Anton Protopopov
  2025-06-18 13:55   ` Alexei Starovoitov
  1 sibling, 1 reply; 7+ messages in thread
From: Anton Protopopov @ 2025-06-18 13:56 UTC (permalink / raw)
  To: Willem de Bruijn
  Cc: bpf, netdev, ast, daniel, john.fastabend, martin.lau,
	Willem de Bruijn

On 25/06/16 10:38AM, Willem de Bruijn wrote:
> From: Willem de Bruijn <willemb@google.com>
> 
> BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the
> map is full, due to percpu reservations and force shrink before
> neighbor stealing. Once a CPU is unable to borrow from the global map,
> it will once steal one elem from a neighbor and after that each time
> flush this one element to the global list and immediately recycle it.
> 
> Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map
> with 79 CPUs. CPU 79 will observe this behavior even while its
> neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%).
> 
> CPUs need not be active concurrently. The issue can appear with
> affinity migration, e.g., irqbalance. Each CPU can reserve and then
> hold onto its 128 elements indefinitely.
> 
> Avoid global list exhaustion by limiting aggregate percpu caches to
> half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count.
> This change has no effect on sufficiently large tables.
> 
> Similar to LOCAL_NR_SCANS and lru->nr_scans, introduce a map variable
> lru->free_target. The extra field fits in a hole in struct bpf_lru.
> The cacheline is already warm where read in the hot path. The field is
> only accessed with the lru lock held.

Hi Willem! The patch looks very reasonable. I've bumbed into this
issue before (see https://lore.kernel.org/bpf/ZJwy478jHkxYNVMc@zh-lab-node-5/)
but didn't follow up, as we typically have large enough LRU maps.

I've tested your patch (with a patched map_tests/map_percpu_stats.c
selftest), works as expected for small maps. E.g., before your patch
map of size 4096 after being updated 2176 times from 32 threads on 32
CPUS contains around 150 elements, after your patch around (expected)
2100 elements.

Tested-by: Anton Protopopov <a.s.protopopov@gmail.com>

> The tests are updated to pass. Test comments are extensive: updating
> those is left for a v2 if the approach is considered ok.
> 
> Signed-off-by: Willem de Bruijn <willemb@google.com>
> 
> ---
> 
> This suggested approach is a small patch, easy to understand, and
> easy to verity to have no effect on sufficiently sized maps.
> 
> Global table exhaustion can be mitigated in other ways besides or
> alongside this approach:
> 
> Grow the map.
>   The most obvious approach, but increases memory use. And requires
>   users to know map implementation details to choose a right size.
> 
> Round up map size.
>   As htab_map_alloc does for the PERCPU variant of this list.
>   Increases memory use, and a conservative strategy still leaves one
>   element per CPU, and thus MRU behavior.
> 
> Steal from neighbors before force shrink.
>   May increase lock contention.
> 
> Steal more than 1 element from neighbor when stealing.
>   For instance, half its list. Requires traversing the entire list.
> 
> Steal from least recently active neighbors.
>   Needs inactive CPU tracking.
> 
> Dynamic target_free: high and low watermarks to double/halve size.
>    I also implemented this. Adjusts to the active CPU count, which may
>    be << NR_CPUS. But it is hard to reason whether or at what level
>    target_free will stabilize.
> ---
>  kernel/bpf/bpf_lru_list.c                  |  9 ++++---
>  kernel/bpf/bpf_lru_list.h                  |  1 +
>  tools/testing/selftests/bpf/test_lru_map.c | 28 ++++++++++++++--------
>  3 files changed, 25 insertions(+), 13 deletions(-)
> 
> diff --git a/kernel/bpf/bpf_lru_list.c b/kernel/bpf/bpf_lru_list.c
> index 3dabdd137d10..2d6e1c98d8ad 100644
> --- a/kernel/bpf/bpf_lru_list.c
> +++ b/kernel/bpf/bpf_lru_list.c
> @@ -337,12 +337,12 @@ static void bpf_lru_list_pop_free_to_local(struct bpf_lru *lru,
>  				 list) {
>  		__bpf_lru_node_move_to_free(l, node, local_free_list(loc_l),
>  					    BPF_LRU_LOCAL_LIST_T_FREE);
> -		if (++nfree == LOCAL_FREE_TARGET)
> +		if (++nfree == lru->target_free)
>  			break;
>  	}
>  
> -	if (nfree < LOCAL_FREE_TARGET)
> -		__bpf_lru_list_shrink(lru, l, LOCAL_FREE_TARGET - nfree,
> +	if (nfree < lru->target_free)
> +		__bpf_lru_list_shrink(lru, l, lru->target_free - nfree,
>  				      local_free_list(loc_l),
>  				      BPF_LRU_LOCAL_LIST_T_FREE);
>  
> @@ -577,6 +577,9 @@ static void bpf_common_lru_populate(struct bpf_lru *lru, void *buf,
>  		list_add(&node->list, &l->lists[BPF_LRU_LIST_T_FREE]);
>  		buf += elem_size;
>  	}
> +
> +	lru->target_free = clamp((nr_elems / num_possible_cpus()) / 2,
> +				 1, LOCAL_FREE_TARGET);
>  }
>  
>  static void bpf_percpu_lru_populate(struct bpf_lru *lru, void *buf,
> diff --git a/kernel/bpf/bpf_lru_list.h b/kernel/bpf/bpf_lru_list.h
> index cbd8d3720c2b..fe2661a58ea9 100644
> --- a/kernel/bpf/bpf_lru_list.h
> +++ b/kernel/bpf/bpf_lru_list.h
> @@ -58,6 +58,7 @@ struct bpf_lru {
>  	del_from_htab_func del_from_htab;
>  	void *del_arg;
>  	unsigned int hash_offset;
> +	unsigned int target_free;
>  	unsigned int nr_scans;
>  	bool percpu;
>  };
> diff --git a/tools/testing/selftests/bpf/test_lru_map.c b/tools/testing/selftests/bpf/test_lru_map.c
> index fda7589c5023..abb592553394 100644
> --- a/tools/testing/selftests/bpf/test_lru_map.c
> +++ b/tools/testing/selftests/bpf/test_lru_map.c
> @@ -138,6 +138,12 @@ static int sched_next_online(int pid, int *next_to_try)
>  	return ret;
>  }
>  
> +/* inverse of how bpf_common_lru_populate derives target_free from map_size. */
> +static unsigned int __map_size(unsigned int tgt_free)
> +{
> +	return tgt_free * nr_cpus * 2;
> +}
> +
>  /* Size of the LRU map is 2
>   * Add key=1 (+1 key)
>   * Add key=2 (+1 key)
> @@ -257,7 +263,7 @@ static void test_lru_sanity1(int map_type, int map_flags, unsigned int tgt_free)
>  	batch_size = tgt_free / 2;
>  	assert(batch_size * 2 == tgt_free);
>  
> -	map_size = tgt_free + batch_size;
> +	map_size = __map_size(tgt_free) + batch_size;
>  	lru_map_fd = create_map(map_type, map_flags, map_size);
>  	assert(lru_map_fd != -1);
>  
> @@ -267,7 +273,7 @@ static void test_lru_sanity1(int map_type, int map_flags, unsigned int tgt_free)
>  	value[0] = 1234;
>  
>  	/* Insert 1 to tgt_free (+tgt_free keys) */
> -	end_key = 1 + tgt_free;
> +	end_key = 1 + __map_size(tgt_free);
>  	for (key = 1; key < end_key; key++)
>  		assert(!bpf_map_update_elem(lru_map_fd, &key, value,
>  					    BPF_NOEXIST));
> @@ -284,8 +290,8 @@ static void test_lru_sanity1(int map_type, int map_flags, unsigned int tgt_free)
>  	 * => 1+tgt_free/2 to LOCALFREE_TARGET will be
>  	 * removed by LRU
>  	 */
> -	key = 1 + tgt_free;
> -	end_key = key + tgt_free;
> +	key = 1 + __map_size(tgt_free);
> +	end_key = key + __map_size(tgt_free);
>  	for (; key < end_key; key++) {
>  		assert(!bpf_map_update_elem(lru_map_fd, &key, value,
>  					    BPF_NOEXIST));
> @@ -334,7 +340,7 @@ static void test_lru_sanity2(int map_type, int map_flags, unsigned int tgt_free)
>  	batch_size = tgt_free / 2;
>  	assert(batch_size * 2 == tgt_free);
>  
> -	map_size = tgt_free + batch_size;
> +	map_size = __map_size(tgt_free) + batch_size;
>  	lru_map_fd = create_map(map_type, map_flags, map_size);
>  	assert(lru_map_fd != -1);
>  
> @@ -344,7 +350,7 @@ static void test_lru_sanity2(int map_type, int map_flags, unsigned int tgt_free)
>  	value[0] = 1234;
>  
>  	/* Insert 1 to tgt_free (+tgt_free keys) */
> -	end_key = 1 + tgt_free;
> +	end_key = 1 + __map_size(tgt_free);
>  	for (key = 1; key < end_key; key++)
>  		assert(!bpf_map_update_elem(lru_map_fd, &key, value,
>  					    BPF_NOEXIST));
> @@ -388,8 +394,9 @@ static void test_lru_sanity2(int map_type, int map_flags, unsigned int tgt_free)
>  	value[0] = 1234;
>  
>  	/* Insert 1+tgt_free to tgt_free*3/2 */
> -	end_key = 1 + tgt_free + batch_size;
> -	for (key = 1 + tgt_free; key < end_key; key++)
> +	key = 1 + __map_size(tgt_free);
> +	end_key = key + batch_size;
> +	for (; key < end_key; key++)
>  		/* These newly added but not referenced keys will be
>  		 * gone during the next LRU shrink.
>  		 */
> @@ -397,7 +404,7 @@ static void test_lru_sanity2(int map_type, int map_flags, unsigned int tgt_free)
>  					    BPF_NOEXIST));
>  
>  	/* Insert 1+tgt_free*3/2 to  tgt_free*5/2 */
> -	end_key = key + tgt_free;
> +	end_key += __map_size(tgt_free);
>  	for (; key < end_key; key++) {
>  		assert(!bpf_map_update_elem(lru_map_fd, &key, value,
>  					    BPF_NOEXIST));
> @@ -500,7 +507,8 @@ static void test_lru_sanity4(int map_type, int map_flags, unsigned int tgt_free)
>  		lru_map_fd = create_map(map_type, map_flags,
>  					3 * tgt_free * nr_cpus);
>  	else
> -		lru_map_fd = create_map(map_type, map_flags, 3 * tgt_free);
> +		lru_map_fd = create_map(map_type, map_flags,
> +					3 * __map_size(tgt_free));
>  	assert(lru_map_fd != -1);
>  
>  	expected_map_fd = create_map(BPF_MAP_TYPE_HASH, 0,
> -- 
> 2.50.0.rc1.591.g9c95f17f64-goog
> 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next] bpf: lru: adjust free target to avoid global table starvation
  2025-06-18 13:56 ` Anton Protopopov
@ 2025-06-18 13:55   ` Alexei Starovoitov
  2025-06-18 22:03     ` Willem de Bruijn
  0 siblings, 1 reply; 7+ messages in thread
From: Alexei Starovoitov @ 2025-06-18 13:55 UTC (permalink / raw)
  To: Anton Protopopov
  Cc: Willem de Bruijn, bpf, Network Development, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Martin KaFai Lau,
	Willem de Bruijn

On Wed, Jun 18, 2025 at 6:50 AM Anton Protopopov
<a.s.protopopov@gmail.com> wrote:
>
> On 25/06/16 10:38AM, Willem de Bruijn wrote:
> > From: Willem de Bruijn <willemb@google.com>
> >
> > BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the
> > map is full, due to percpu reservations and force shrink before
> > neighbor stealing. Once a CPU is unable to borrow from the global map,
> > it will once steal one elem from a neighbor and after that each time
> > flush this one element to the global list and immediately recycle it.
> >
> > Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map
> > with 79 CPUs. CPU 79 will observe this behavior even while its
> > neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%).
> >
> > CPUs need not be active concurrently. The issue can appear with
> > affinity migration, e.g., irqbalance. Each CPU can reserve and then
> > hold onto its 128 elements indefinitely.
> >
> > Avoid global list exhaustion by limiting aggregate percpu caches to
> > half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count.
> > This change has no effect on sufficiently large tables.
> >
> > Similar to LOCAL_NR_SCANS and lru->nr_scans, introduce a map variable
> > lru->free_target. The extra field fits in a hole in struct bpf_lru.
> > The cacheline is already warm where read in the hot path. The field is
> > only accessed with the lru lock held.
>
> Hi Willem! The patch looks very reasonable. I've bumbed into this
> issue before (see https://lore.kernel.org/bpf/ZJwy478jHkxYNVMc@zh-lab-node-5/)
> but didn't follow up, as we typically have large enough LRU maps.
>
> I've tested your patch (with a patched map_tests/map_percpu_stats.c
> selftest), works as expected for small maps. E.g., before your patch
> map of size 4096 after being updated 2176 times from 32 threads on 32
> CPUS contains around 150 elements, after your patch around (expected)
> 2100 elements.
>
> Tested-by: Anton Protopopov <a.s.protopopov@gmail.com>

Looks like we have consensus.

Willem,
please target bpf tree when you respin.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH bpf-next] bpf: lru: adjust free target to avoid global table starvation
  2025-06-18 13:55   ` Alexei Starovoitov
@ 2025-06-18 22:03     ` Willem de Bruijn
  0 siblings, 0 replies; 7+ messages in thread
From: Willem de Bruijn @ 2025-06-18 22:03 UTC (permalink / raw)
  To: Alexei Starovoitov, Anton Protopopov
  Cc: Willem de Bruijn, bpf, Network Development, Alexei Starovoitov,
	Daniel Borkmann, John Fastabend, Martin KaFai Lau,
	Willem de Bruijn

Alexei Starovoitov wrote:
> On Wed, Jun 18, 2025 at 6:50 AM Anton Protopopov
> <a.s.protopopov@gmail.com> wrote:
> >
> > On 25/06/16 10:38AM, Willem de Bruijn wrote:
> > > From: Willem de Bruijn <willemb@google.com>
> > >
> > > BPF_MAP_TYPE_LRU_HASH can recycle most recent elements well before the
> > > map is full, due to percpu reservations and force shrink before
> > > neighbor stealing. Once a CPU is unable to borrow from the global map,
> > > it will once steal one elem from a neighbor and after that each time
> > > flush this one element to the global list and immediately recycle it.
> > >
> > > Batch value LOCAL_FREE_TARGET (128) will exhaust a 10K element map
> > > with 79 CPUs. CPU 79 will observe this behavior even while its
> > > neighbors hold 78 * 127 + 1 * 15 == 9921 free elements (99%).
> > >
> > > CPUs need not be active concurrently. The issue can appear with
> > > affinity migration, e.g., irqbalance. Each CPU can reserve and then
> > > hold onto its 128 elements indefinitely.
> > >
> > > Avoid global list exhaustion by limiting aggregate percpu caches to
> > > half of map size, by adjusting LOCAL_FREE_TARGET based on cpu count.
> > > This change has no effect on sufficiently large tables.
> > >
> > > Similar to LOCAL_NR_SCANS and lru->nr_scans, introduce a map variable
> > > lru->free_target. The extra field fits in a hole in struct bpf_lru.
> > > The cacheline is already warm where read in the hot path. The field is
> > > only accessed with the lru lock held.
> >
> > Hi Willem! The patch looks very reasonable. I've bumbed into this
> > issue before (see https://lore.kernel.org/bpf/ZJwy478jHkxYNVMc@zh-lab-node-5/)
> > but didn't follow up, as we typically have large enough LRU maps.
> >
> > I've tested your patch (with a patched map_tests/map_percpu_stats.c
> > selftest), works as expected for small maps. E.g., before your patch
> > map of size 4096 after being updated 2176 times from 32 threads on 32
> > CPUS contains around 150 elements, after your patch around (expected)
> > 2100 elements.
> >
> > Tested-by: Anton Protopopov <a.s.protopopov@gmail.com>
> 
> Looks like we have consensus.

Great. Thanks for the reviews and testing. Good to have more data that
the issue is well understood and the approach helps.

> Willem,
> please target bpf tree when you respin.

Done: https://lore.kernel.org/bpf/20250618215803.3587312-1-willemdebruijn.kernel@gmail.com/T/#u

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-06-18 22:03 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-16 14:38 [PATCH bpf-next] bpf: lru: adjust free target to avoid global table starvation Willem de Bruijn
2025-06-17 17:39 ` Stanislav Fomichev
2025-06-17 20:08   ` Willem de Bruijn
2025-06-17 20:29     ` Stanislav Fomichev
2025-06-18 13:56 ` Anton Protopopov
2025-06-18 13:55   ` Alexei Starovoitov
2025-06-18 22:03     ` Willem de Bruijn

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).