linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free()
@ 2025-06-16  9:55 Matt Fleming
  2025-06-16 15:50 ` Song Liu
  0 siblings, 1 reply; 14+ messages in thread
From: Matt Fleming @ 2025-06-16  9:55 UTC (permalink / raw)
  To: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko
  Cc: Martin KaFai Lau, Eduard Zingerman, Song Liu, Yonghong Song,
	John Fastabend, KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa,
	bpf, linux-kernel, kernel-team, Matt Fleming

From: Matt Fleming <mfleming@cloudflare.com>

Calls to kfree() in trie_free() can be expensive for KASAN-enabled
kernels. This can cause soft lockup warnings when traversing large maps,

  watchdog: BUG: soft lockup - CPU#41 stuck for 76s! [kworker/u518:14:1158211]

Avoid an unbounded loop and periodically check whether we should reschedule.

Signed-off-by: Matt Fleming <mfleming@cloudflare.com>
---
 kernel/bpf/lpm_trie.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index be66d7e520e0..a35619cd99f6 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -646,6 +646,8 @@ static void trie_free(struct bpf_map *map)
 			RCU_INIT_POINTER(*slot, NULL);
 			break;
 		}
+
+		cond_resched();
 	}
 
 out:
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free()
  2025-06-16  9:55 [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free() Matt Fleming
@ 2025-06-16 15:50 ` Song Liu
  2025-06-17  9:42   ` Matt Fleming
  0 siblings, 1 reply; 14+ messages in thread
From: Song Liu @ 2025-06-16 15:50 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	linux-kernel, kernel-team, Matt Fleming

On Mon, Jun 16, 2025 at 2:55 AM Matt Fleming <matt@readmodwrite.com> wrote:
>
> From: Matt Fleming <mfleming@cloudflare.com>
>
> Calls to kfree() in trie_free() can be expensive for KASAN-enabled
> kernels. This can cause soft lockup warnings when traversing large maps,

I think this could also happen to KASAN-disabled kernels, so the commit log
is a bit misleading.

>
>   watchdog: BUG: soft lockup - CPU#41 stuck for 76s! [kworker/u518:14:1158211]
>
> Avoid an unbounded loop and periodically check whether we should reschedule.
>
> Signed-off-by: Matt Fleming <mfleming@cloudflare.com>

Other than that:

Acked-by: Song Liu <song@kernel.org>

> ---
>  kernel/bpf/lpm_trie.c | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
> index be66d7e520e0..a35619cd99f6 100644
> --- a/kernel/bpf/lpm_trie.c
> +++ b/kernel/bpf/lpm_trie.c
> @@ -646,6 +646,8 @@ static void trie_free(struct bpf_map *map)
>                         RCU_INIT_POINTER(*slot, NULL);
>                         break;
>                 }
> +
> +               cond_resched();
>         }
>
>  out:
> --
> 2.34.1
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free()
  2025-06-16 15:50 ` Song Liu
@ 2025-06-17  9:42   ` Matt Fleming
  2025-06-17 15:55     ` Alexei Starovoitov
  0 siblings, 1 reply; 14+ messages in thread
From: Matt Fleming @ 2025-06-17  9:42 UTC (permalink / raw)
  To: Song Liu
  Cc: Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf,
	linux-kernel, kernel-team, Matt Fleming

On Mon, Jun 16, 2025 at 4:51 PM Song Liu <song@kernel.org> wrote:
>
> On Mon, Jun 16, 2025 at 2:55 AM Matt Fleming <matt@readmodwrite.com> wrote:
> >
> > From: Matt Fleming <mfleming@cloudflare.com>
> >
> > Calls to kfree() in trie_free() can be expensive for KASAN-enabled
> > kernels. This can cause soft lockup warnings when traversing large maps,
>
> I think this could also happen to KASAN-disabled kernels, so the commit log
> is a bit misleading.

This issue can definitely affect KASAN-disabled kernels.

I mentioned KASAN to give context and explain why I saw this and
nobody else seems to have reported it. I'm happy to reword this part
of the commit message if needed but I still think it should mention
KASAN somewhere because that's the reason I discovered it.

Thanks,
Matt

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free()
  2025-06-17  9:42   ` Matt Fleming
@ 2025-06-17 15:55     ` Alexei Starovoitov
  2025-06-18 12:29       ` Matt Fleming
  0 siblings, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2025-06-17 15:55 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Song Liu, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf, LKML,
	kernel-team, Matt Fleming

On Tue, Jun 17, 2025 at 2:43 AM Matt Fleming <matt@readmodwrite.com> wrote:
>
> On Mon, Jun 16, 2025 at 4:51 PM Song Liu <song@kernel.org> wrote:
> >
> > On Mon, Jun 16, 2025 at 2:55 AM Matt Fleming <matt@readmodwrite.com> wrote:
> > >
> > > From: Matt Fleming <mfleming@cloudflare.com>
> > >
> > > Calls to kfree() in trie_free() can be expensive for KASAN-enabled
> > > kernels. This can cause soft lockup warnings when traversing large maps,
> >
> > I think this could also happen to KASAN-disabled kernels, so the commit log
> > is a bit misleading.
>
> This issue can definitely affect KASAN-disabled kernels.
>
> I mentioned KASAN to give context and explain why I saw this and
> nobody else seems to have reported it. I'm happy to reword this part
> of the commit message if needed but I still think it should mention
> KASAN somewhere because that's the reason I discovered it.

kfree is so slow that it triggers softlock up ?

> soft lockup - CPU#41 stuck for 76s

How many elements are in the trie that it takes 76 seconds??

I feel the issue is different.
It seems the trie_free() algorithm doesn't scale.
Pls share a full reproducer.

pw-bot: cr

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free()
  2025-06-17 15:55     ` Alexei Starovoitov
@ 2025-06-18 12:29       ` Matt Fleming
  2025-06-18 14:01         ` Alexei Starovoitov
  0 siblings, 1 reply; 14+ messages in thread
From: Matt Fleming @ 2025-06-18 12:29 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Song Liu, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf, LKML,
	kernel-team, Matt Fleming, hawk

On Tue, Jun 17, 2025 at 4:55 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Tue, Jun 17, 2025 at 2:43 AM Matt Fleming <matt@readmodwrite.com> wrote:
> >
>
> > soft lockup - CPU#41 stuck for 76s
>
> How many elements are in the trie that it takes 76 seconds??

We run our maps with potentially millions of entries, so it's the size
of the map plus the fact that kfree() does more work with KASAN that
triggers this for us.

> I feel the issue is different.
> It seems the trie_free() algorithm doesn't scale.
> Pls share a full reproducer.

Yes, the scalability of the algorithm is also an issue. Jesper (CC'd)
had some thoughts on this.

But regardless, it seems like a bad idea to have an unbounded loop
inside the kernel that processes user-controlled data.

Thanks,
Matt

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free()
  2025-06-18 12:29       ` Matt Fleming
@ 2025-06-18 14:01         ` Alexei Starovoitov
  2025-06-18 14:27           ` Ignat Korchagin
  0 siblings, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2025-06-18 14:01 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Song Liu, Alexei Starovoitov, Daniel Borkmann, Andrii Nakryiko,
	Martin KaFai Lau, Eduard Zingerman, Yonghong Song, John Fastabend,
	KP Singh, Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf, LKML,
	kernel-team, Matt Fleming, Jesper Dangaard Brouer

On Wed, Jun 18, 2025 at 5:29 AM Matt Fleming <matt@readmodwrite.com> wrote:
>
> On Tue, Jun 17, 2025 at 4:55 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Tue, Jun 17, 2025 at 2:43 AM Matt Fleming <matt@readmodwrite.com> wrote:
> > >
> >
> > > soft lockup - CPU#41 stuck for 76s
> >
> > How many elements are in the trie that it takes 76 seconds??
>
> We run our maps with potentially millions of entries, so it's the size
> of the map plus the fact that kfree() does more work with KASAN that
> triggers this for us.
>
> > I feel the issue is different.
> > It seems the trie_free() algorithm doesn't scale.
> > Pls share a full reproducer.
>
> Yes, the scalability of the algorithm is also an issue. Jesper (CC'd)
> had some thoughts on this.
>
> But regardless, it seems like a bad idea to have an unbounded loop
> inside the kernel that processes user-controlled data.

1M kfree should still be very fast even with kasan, lockdep, etc.
76 seconds is an algorithm problem. Address the root cause.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free()
  2025-06-18 14:01         ` Alexei Starovoitov
@ 2025-06-18 14:27           ` Ignat Korchagin
  2025-06-18 14:50             ` Alexei Starovoitov
  0 siblings, 1 reply; 14+ messages in thread
From: Ignat Korchagin @ 2025-06-18 14:27 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Matt Fleming, Song Liu, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, bpf, LKML, kernel-team, Matt Fleming,
	Jesper Dangaard Brouer

On Wed, Jun 18, 2025 at 3:01 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Wed, Jun 18, 2025 at 5:29 AM Matt Fleming <matt@readmodwrite.com> wrote:
> >
> > On Tue, Jun 17, 2025 at 4:55 PM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > On Tue, Jun 17, 2025 at 2:43 AM Matt Fleming <matt@readmodwrite.com> wrote:
> > > >
> > >
> > > > soft lockup - CPU#41 stuck for 76s
> > >
> > > How many elements are in the trie that it takes 76 seconds??
> >
> > We run our maps with potentially millions of entries, so it's the size
> > of the map plus the fact that kfree() does more work with KASAN that
> > triggers this for us.
> >
> > > I feel the issue is different.
> > > It seems the trie_free() algorithm doesn't scale.
> > > Pls share a full reproducer.
> >
> > Yes, the scalability of the algorithm is also an issue. Jesper (CC'd)
> > had some thoughts on this.
> >
> > But regardless, it seems like a bad idea to have an unbounded loop
> > inside the kernel that processes user-controlled data.
>
> 1M kfree should still be very fast even with kasan, lockdep, etc.
> 76 seconds is an algorithm problem. Address the root cause.

What if later we have 1G? 100G? Apart from the root cause we still
have "scalability concerns" unless we can somehow reimplement this as
O(1)

Ignat

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free()
  2025-06-18 14:27           ` Ignat Korchagin
@ 2025-06-18 14:50             ` Alexei Starovoitov
  2025-06-27 13:20               ` Matt Fleming
  0 siblings, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2025-06-18 14:50 UTC (permalink / raw)
  To: Ignat Korchagin
  Cc: Matt Fleming, Song Liu, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, bpf, LKML, kernel-team, Matt Fleming,
	Jesper Dangaard Brouer

On Wed, Jun 18, 2025 at 7:27 AM Ignat Korchagin <ignat@cloudflare.com> wrote:
>
> On Wed, Jun 18, 2025 at 3:01 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > On Wed, Jun 18, 2025 at 5:29 AM Matt Fleming <matt@readmodwrite.com> wrote:
> > >
> > > On Tue, Jun 17, 2025 at 4:55 PM Alexei Starovoitov
> > > <alexei.starovoitov@gmail.com> wrote:
> > > >
> > > > On Tue, Jun 17, 2025 at 2:43 AM Matt Fleming <matt@readmodwrite.com> wrote:
> > > > >
> > > >
> > > > > soft lockup - CPU#41 stuck for 76s
> > > >
> > > > How many elements are in the trie that it takes 76 seconds??
> > >
> > > We run our maps with potentially millions of entries, so it's the size
> > > of the map plus the fact that kfree() does more work with KASAN that
> > > triggers this for us.
> > >
> > > > I feel the issue is different.
> > > > It seems the trie_free() algorithm doesn't scale.
> > > > Pls share a full reproducer.
> > >
> > > Yes, the scalability of the algorithm is also an issue. Jesper (CC'd)
> > > had some thoughts on this.
> > >
> > > But regardless, it seems like a bad idea to have an unbounded loop
> > > inside the kernel that processes user-controlled data.
> >
> > 1M kfree should still be very fast even with kasan, lockdep, etc.
> > 76 seconds is an algorithm problem. Address the root cause.
>
> What if later we have 1G? 100G? Apart from the root cause we still
> have "scalability concerns" unless we can somehow reimplement this as
> O(1)

Do your homework pls.
Set max_entries to 100G and report back.
Then set max_entries to 1G _with_ cond_rescehd() hack and report back.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free()
  2025-06-18 14:50             ` Alexei Starovoitov
@ 2025-06-27 13:20               ` Matt Fleming
  2025-06-27 19:36                 ` Alexei Starovoitov
  0 siblings, 1 reply; 14+ messages in thread
From: Matt Fleming @ 2025-06-27 13:20 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Ignat Korchagin, Song Liu, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, bpf, LKML, kernel-team, Matt Fleming,
	Jesper Dangaard Brouer

On Wed, Jun 18, 2025 at 3:50 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> Do your homework pls.
> Set max_entries to 100G and report back.
> Then set max_entries to 1G _with_ cond_rescehd() hack and report back.

Hi,

I put together a small reproducer
https://github.com/xdp-project/bpf-examples/pull/130 which gives the
following results on an AMD EPYC 9684X 96-Core machine:

| Num of map entries | Linux 6.12.32 |  KASAN  | cond_resched |
|--------------------|---------------|---------|--------------|
| 1K                 | 0ms           | 4ms     | 0ms          |
| 10K                | 2ms           | 50ms    | 2ms          |
| 100K               | 32ms          | 511ms   | 32ms         |
| 1M                 | 427ms         | 5478ms  | 420ms        |
| 10M                | 5056ms        | 55714ms | 5040ms       |
| 100M               | 67253ms       | *       | 62630ms      |

* - I gave up waiting after 11.5 hours

Enabling KASAN makes the durations an order of magnitude bigger. The
cond_resched() patch eliminates the soft lockups with no effect on the
times.

Thanks,
Matt

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free()
  2025-06-27 13:20               ` Matt Fleming
@ 2025-06-27 19:36                 ` Alexei Starovoitov
  2025-06-30 13:28                   ` Matt Fleming
  0 siblings, 1 reply; 14+ messages in thread
From: Alexei Starovoitov @ 2025-06-27 19:36 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Ignat Korchagin, Song Liu, Alexei Starovoitov, Daniel Borkmann,
	Andrii Nakryiko, Martin KaFai Lau, Eduard Zingerman,
	Yonghong Song, John Fastabend, KP Singh, Stanislav Fomichev,
	Hao Luo, Jiri Olsa, bpf, LKML, kernel-team, Matt Fleming,
	Jesper Dangaard Brouer

On Fri, Jun 27, 2025 at 6:20 AM Matt Fleming <matt@readmodwrite.com> wrote:
>
> On Wed, Jun 18, 2025 at 3:50 PM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > Do your homework pls.
> > Set max_entries to 100G and report back.
> > Then set max_entries to 1G _with_ cond_rescehd() hack and report back.
>
> Hi,
>
> I put together a small reproducer
> https://github.com/xdp-project/bpf-examples/pull/130 which gives the
> following results on an AMD EPYC 9684X 96-Core machine:
>
> | Num of map entries | Linux 6.12.32 |  KASAN  | cond_resched |
> |--------------------|---------------|---------|--------------|
> | 1K                 | 0ms           | 4ms     | 0ms          |
> | 10K                | 2ms           | 50ms    | 2ms          |
> | 100K               | 32ms          | 511ms   | 32ms         |
> | 1M                 | 427ms         | 5478ms  | 420ms        |
> | 10M                | 5056ms        | 55714ms | 5040ms       |
> | 100M               | 67253ms       | *       | 62630ms      |
>
> * - I gave up waiting after 11.5 hours
>
> Enabling KASAN makes the durations an order of magnitude bigger. The
> cond_resched() patch eliminates the soft lockups with no effect on the
> times.

Good. Now you see my point, right?
The cond_resched() doesn't fix the issue.
1hr to free a trie of 100M elements is horrible.
Try 100M kmalloc/kfree to see that slab is not the issue.
trie_free() algorithm is to blame. It doesn't need to start
from the root for every element. Fix the root cause.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free()
  2025-06-27 19:36                 ` Alexei Starovoitov
@ 2025-06-30 13:28                   ` Matt Fleming
  2025-07-01 16:25                     ` Alexei Starovoitov
  0 siblings, 1 reply; 14+ messages in thread
From: Matt Fleming @ 2025-06-30 13:28 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Matt Fleming, Ignat Korchagin, Song Liu, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf, LKML, kernel-team,
	Jesper Dangaard Brouer

On Fri, 27 Jun 2025 at 20:36, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> Good. Now you see my point, right?
> The cond_resched() doesn't fix the issue.
> 1hr to free a trie of 100M elements is horrible.
> Try 100M kmalloc/kfree to see that slab is not the issue.
> trie_free() algorithm is to blame. It doesn't need to start
> from the root for every element. Fix the root cause.

It doesn't take an hour to free 100M entries, the table showed it
takes about a minute (67 or 62 seconds).

I never claimed that kmalloc/kfree was at fault. I said that the loop
in trie_free() has no preemption, and that's a problem with tries with
millions of entries.

Of course, rewriting the algorithm used in the lpm trie code would
make this less of an issue. But this would require a major rework.
It's not as simple as improving trie_free() alone. FWIW I tried using
a recursive algorithm in trie_free() and the results are slightly
better, but it still takes multiple seconds to free 10M entries (4.3s)
and under a minute for 100M (56.7s). To fix this properly it's
necessary to use more than two children per node to reduce the height
of the trie. And in the meantime, anyone who uses maps with millions
of entries is gonna have the kthread spin in a loop without
preemption.

Thanks,
Matt

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free()
  2025-06-30 13:28                   ` Matt Fleming
@ 2025-07-01 16:25                     ` Alexei Starovoitov
  2025-07-01 16:41                       ` Ignat Korchagin
  2025-07-03 11:00                       ` Matt Fleming
  0 siblings, 2 replies; 14+ messages in thread
From: Alexei Starovoitov @ 2025-07-01 16:25 UTC (permalink / raw)
  To: Matt Fleming
  Cc: Matt Fleming, Ignat Korchagin, Song Liu, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf, LKML, kernel-team,
	Jesper Dangaard Brouer

On Mon, Jun 30, 2025 at 6:28 AM Matt Fleming <mfleming@cloudflare.com> wrote:
>
> On Fri, 27 Jun 2025 at 20:36, Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > Good. Now you see my point, right?
> > The cond_resched() doesn't fix the issue.
> > 1hr to free a trie of 100M elements is horrible.
> > Try 100M kmalloc/kfree to see that slab is not the issue.
> > trie_free() algorithm is to blame. It doesn't need to start
> > from the root for every element. Fix the root cause.
>
> It doesn't take an hour to free 100M entries, the table showed it
> takes about a minute (67 or 62 seconds).

yeah. I misread the numbers.

> I never claimed that kmalloc/kfree was at fault. I said that the loop
> in trie_free() has no preemption, and that's a problem with tries with
> millions of entries.
>
> Of course, rewriting the algorithm used in the lpm trie code would
> make this less of an issue. But this would require a major rework.
> It's not as simple as improving trie_free() alone. FWIW I tried using
> a recursive algorithm in trie_free() and the results are slightly
> better, but it still takes multiple seconds to free 10M entries (4.3s)
> and under a minute for 100M (56.7s). To fix this properly it's
> necessary to use more than two children per node to reduce the height
> of the trie.

What is the height of 100m tree ?

What kind of "recursive algo" you have in mind?
Could you try to keep a stack of nodes visited and once leaf is
freed pop a node and continue walking.
Then total height won't be a factor.
The stack would need to be kmalloc-ed, of course,
but still should be faster than walking from the root.

> And in the meantime, anyone who uses maps with millions
> of entries is gonna have the kthread spin in a loop without
> preemption.

Yes, because judging by this thread I don't believe you'll come
back and fix it properly.
I'd rather have this acute pain bothering somebody to fix it
for good instead of papering over.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free()
  2025-07-01 16:25                     ` Alexei Starovoitov
@ 2025-07-01 16:41                       ` Ignat Korchagin
  2025-07-03 11:00                       ` Matt Fleming
  1 sibling, 0 replies; 14+ messages in thread
From: Ignat Korchagin @ 2025-07-01 16:41 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Matt Fleming, Matt Fleming, Song Liu, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf, LKML, kernel-team,
	Jesper Dangaard Brouer

On Tue, Jul 1, 2025 at 6:25 PM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> On Mon, Jun 30, 2025 at 6:28 AM Matt Fleming <mfleming@cloudflare.com> wrote:
> >
> > On Fri, 27 Jun 2025 at 20:36, Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> > >
> > > Good. Now you see my point, right?
> > > The cond_resched() doesn't fix the issue.
> > > 1hr to free a trie of 100M elements is horrible.
> > > Try 100M kmalloc/kfree to see that slab is not the issue.
> > > trie_free() algorithm is to blame. It doesn't need to start
> > > from the root for every element. Fix the root cause.
> >
> > It doesn't take an hour to free 100M entries, the table showed it
> > takes about a minute (67 or 62 seconds).
>
> yeah. I misread the numbers.
>
> > I never claimed that kmalloc/kfree was at fault. I said that the loop
> > in trie_free() has no preemption, and that's a problem with tries with
> > millions of entries.
> >
> > Of course, rewriting the algorithm used in the lpm trie code would
> > make this less of an issue. But this would require a major rework.
> > It's not as simple as improving trie_free() alone. FWIW I tried using
> > a recursive algorithm in trie_free() and the results are slightly
> > better, but it still takes multiple seconds to free 10M entries (4.3s)
> > and under a minute for 100M (56.7s). To fix this properly it's
> > necessary to use more than two children per node to reduce the height
> > of the trie.
>
> What is the height of 100m tree ?
>
> What kind of "recursive algo" you have in mind?
> Could you try to keep a stack of nodes visited and once leaf is
> freed pop a node and continue walking.
> Then total height won't be a factor.
> The stack would need to be kmalloc-ed, of course,
> but still should be faster than walking from the root.
>
> > And in the meantime, anyone who uses maps with millions
> > of entries is gonna have the kthread spin in a loop without
> > preemption.
>
> Yes, because judging by this thread I don't believe you'll come
> back and fix it properly.
> I'd rather have this acute pain bothering somebody to fix it
> for good instead of papering over.

I think we need both anyway just for the reason we need something to
backport to stable. A full re-implementation of trie might be viewed
as a new feature, but older kernels need to be "fixed" as well.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free()
  2025-07-01 16:25                     ` Alexei Starovoitov
  2025-07-01 16:41                       ` Ignat Korchagin
@ 2025-07-03 11:00                       ` Matt Fleming
  1 sibling, 0 replies; 14+ messages in thread
From: Matt Fleming @ 2025-07-03 11:00 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Matt Fleming, Ignat Korchagin, Song Liu, Alexei Starovoitov,
	Daniel Borkmann, Andrii Nakryiko, Martin KaFai Lau,
	Eduard Zingerman, Yonghong Song, John Fastabend, KP Singh,
	Stanislav Fomichev, Hao Luo, Jiri Olsa, bpf, LKML, kernel-team,
	Jesper Dangaard Brouer

On Tue, 1 Jul 2025 at 17:25, Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> What is the height of 100m tree ?

Because this trie implementation is essentially a binary tree the
height is given by log2(100m) = 26.

> What kind of "recursive algo" you have in mind?

Something like this:

---
 kernel/bpf/lpm_trie.c | 38 ++++++++++++++++++++++++++++++++++++++
 1 file changed, 38 insertions(+)

diff --git a/kernel/bpf/lpm_trie.c b/kernel/bpf/lpm_trie.c
index 010e91ac978e..f4b07920a321 100644
--- a/kernel/bpf/lpm_trie.c
+++ b/kernel/bpf/lpm_trie.c
@@ -33,7 +33,13 @@ struct lpm_trie {
  struct bpf_map map;
  struct lpm_trie_node __rcu *root;
  size_t n_entries;
+ /* Maximum prefix length permitted */
  size_t max_prefixlen;
+ /* Largest prefix length of any node ever inserted. Used for an
+ * optimisation in trie_free() and is not updated on node
+ * deletion.
+ */
+ size_t largest_prefixlen;
  size_t data_size;
  spinlock_t lock;
 };
@@ -450,6 +456,10 @@ static long trie_update_elem(struct bpf_map *map,
 out:
  if (ret)
  kfree(new_node);
+ else
+ trie->largest_prefixlen = max(trie->largest_prefixlen,
+ key->prefixlen);
+
  spin_unlock_irqrestore(&trie->lock, irq_flags);
  kfree_rcu(free_node, rcu);

@@ -599,12 +609,40 @@ static struct bpf_map *trie_alloc(union bpf_attr *attr)
  return &trie->map;
 }

+static void __trie_free(struct lpm_trie_node __rcu **slot)
+{
+ struct lpm_trie_node *node;
+
+ node = rcu_dereference_protected(*slot, 1);
+ if (!node)
+ return;
+
+ if (rcu_access_pointer(node->child[0]))
+ __trie_free(&node->child[0]);
+
+ if (rcu_access_pointer(node->child[1]))
+ __trie_free(&node->child[1]);
+
+ kfree(node);
+ RCU_INIT_POINTER(*slot, NULL);
+}
+
 static void trie_free(struct bpf_map *map)
 {
  struct lpm_trie *trie = container_of(map, struct lpm_trie, map);
  struct lpm_trie_node __rcu **slot;
  struct lpm_trie_node *node;

+ /* When we know the largest prefixlen used by any node is <= 32
+ * we're guaranteed that the height of the trie is at most 32.
+ * And in that case, we can use a faster recursive freeing
+ * algorithm without worrying about blowing the stack.
+ */
+ if (trie->largest_prefixlen <= 32) {
+ __trie_free(&trie->root);
+ goto out;
+ }
+
  /* Always start at the root and walk down to a node that has no
  * children. Then free that node, nullify its reference in the parent
  * and start over.

> Could you try to keep a stack of nodes visited and once leaf is
> freed pop a node and continue walking.
> Then total height won't be a factor.
> The stack would need to be kmalloc-ed, of course,
> but still should be faster than walking from the root.

Sure, I can give this a shot. Plus we won't need to guard against
blowing up the stack.

^ permalink raw reply related	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2025-07-03 11:00 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-06-16  9:55 [PATCH] bpf: Call cond_resched() to avoid soft lockup in trie_free() Matt Fleming
2025-06-16 15:50 ` Song Liu
2025-06-17  9:42   ` Matt Fleming
2025-06-17 15:55     ` Alexei Starovoitov
2025-06-18 12:29       ` Matt Fleming
2025-06-18 14:01         ` Alexei Starovoitov
2025-06-18 14:27           ` Ignat Korchagin
2025-06-18 14:50             ` Alexei Starovoitov
2025-06-27 13:20               ` Matt Fleming
2025-06-27 19:36                 ` Alexei Starovoitov
2025-06-30 13:28                   ` Matt Fleming
2025-07-01 16:25                     ` Alexei Starovoitov
2025-07-01 16:41                       ` Ignat Korchagin
2025-07-03 11:00                       ` Matt Fleming

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).