handling EINTR from bpf_map_lookup

All of lore.kernel.org
 help / color / mirror / Atom feed

* handling EINTR from bpf_map_lookup_batch
@ 2025-02-04 18:08 Yan Zhai
  2025-02-05  2:19 ` Hou Tao
  0 siblings, 1 reply; 10+ messages in thread
From: Yan Zhai @ 2025-02-04 18:08 UTC (permalink / raw)
  To: bpf; +Cc: kernel-team

I am getting EINTR when trying to use bpf_map_lookup_batch on an
array_of_maps. The error happens when there is a "hole" in the array.
For example, say the outer map has max entries of 256, each inner map
is used for a transport protocol, and I only populated key 6 and
17 for TCP and UDP. Then when I do batch lookup, I always get EINTR.
This so far seems to only happen with array of maps. Does it make
sense to allow skipping to the next key for this map type? Something
like:

diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
index c420edbfb7c8..83915a8059ef 100644
--- a/kernel/bpf/syscall.c
+++ b/kernel/bpf/syscall.c
@@ -2027,6 +2027,8 @@ int generic_map_lookup_batch(struct bpf_map *map,
                                         attr->batch.elem_flags);

                if (err == -ENOENT) {
+                       if (IS_FD_ARRAY(map)
+                               goto next_key;
                        if (retry) {
                                retry--;
                                continue;
@@ -2048,6 +2050,7 @@ int generic_map_lookup_batch(struct bpf_map *map,
                        goto free_buf;
                }

+next_key:
                if (!prev_key)
                        prev_key = buf_prevkey;

Also the context about my scenario if anyone is curious: I am trying
to associate each map to a userspace service in a multi tenant
environment. This is an addition to cgroup accounting, in case the
creator cgroup goes away, e.g. systemd service restarts always
recreate cgroups. And we also want to monitor the utilization level of
non-prealloc maps of different tenants. When dealing with inner maps,
it is not always trivial. To connect dots I choose to read these IDs
periodically and link them to the tenant of the outer map, that's
where this EINTR occurred.

best
Yan

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: handling EINTR from bpf_map_lookup_batch
  2025-02-04 18:08 handling EINTR from bpf_map_lookup_batch Yan Zhai
@ 2025-02-05  2:19 ` Hou Tao
  2025-02-05  9:56   ` Alexei Starovoitov
  2025-02-05 16:15   ` Yan Zhai
  0 siblings, 2 replies; 10+ messages in thread
From: Hou Tao @ 2025-02-05  2:19 UTC (permalink / raw)
  To: Yan Zhai, bpf; +Cc: kernel-team

Hi,

On 2/5/2025 2:08 AM, Yan Zhai wrote:
> I am getting EINTR when trying to use bpf_map_lookup_batch on an
> array_of_maps. The error happens when there is a "hole" in the array.
> For example, say the outer map has max entries of 256, each inner map
> is used for a transport protocol, and I only populated key 6 and
> 17 for TCP and UDP. Then when I do batch lookup, I always get EINTR.
> This so far seems to only happen with array of maps. Does it make
> sense to allow skipping to the next key for this map type? Something
> like:
>
> diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> index c420edbfb7c8..83915a8059ef 100644
> --- a/kernel/bpf/syscall.c
> +++ b/kernel/bpf/syscall.c
> @@ -2027,6 +2027,8 @@ int generic_map_lookup_batch(struct bpf_map *map,
>                                          attr->batch.elem_flags);
>
>                 if (err == -ENOENT) {
> +                       if (IS_FD_ARRAY(map)
> +                               goto next_key;

It seems only BPF_MAP_TYPE_ARRAY_OF_MAPS supports batched operation, so
map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS will be enough. It is also
better to reset err as 0, otherwise generic_map_lookup_batch may return
-ENOENT.
>                         if (retry) {
>                                 retry--;
>                                 continue;
> @@ -2048,6 +2050,7 @@ int generic_map_lookup_batch(struct bpf_map *map,
>                         goto free_buf;
>                 }
>
> +next_key:
>                 if (!prev_key)
>                         prev_key = buf_prevkey;
>

Make sense.  Please add a selftest for it. Another way is to return id 0
for these non-existent values in the fd array, but it may break existed
prog. Just skipping the empty array slot is better.
> Also the context about my scenario if anyone is curious: I am trying
> to associate each map to a userspace service in a multi tenant
> environment. This is an addition to cgroup accounting, in case the
> creator cgroup goes away, e.g. systemd service restarts always
> recreate cgroups. And we also want to monitor the utilization level of
> non-prealloc maps of different tenants. When dealing with inner maps,
> it is not always trivial. To connect dots I choose to read these IDs
> periodically and link them to the tenant of the outer map, that's
> where this EINTR occurred.
>
> best
> Yan
>
> .


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: handling EINTR from bpf_map_lookup_batch
  2025-02-05  2:19 ` Hou Tao
@ 2025-02-05  9:56   ` Alexei Starovoitov
  2025-02-05 16:27     ` Yan Zhai
  2025-02-05 16:15   ` Yan Zhai
  1 sibling, 1 reply; 10+ messages in thread
From: Alexei Starovoitov @ 2025-02-05  9:56 UTC (permalink / raw)
  To: Hou Tao; +Cc: Yan Zhai, bpf, kernel-team

On Wed, Feb 5, 2025 at 2:19 AM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/5/2025 2:08 AM, Yan Zhai wrote:
> > I am getting EINTR when trying to use bpf_map_lookup_batch on an
> > array_of_maps. The error happens when there is a "hole" in the array.
> > For example, say the outer map has max entries of 256, each inner map
> > is used for a transport protocol, and I only populated key 6 and
> > 17 for TCP and UDP. Then when I do batch lookup, I always get EINTR.
> > This so far seems to only happen with array of maps. Does it make
> > sense to allow skipping to the next key for this map type? Something
> > like:
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index c420edbfb7c8..83915a8059ef 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -2027,6 +2027,8 @@ int generic_map_lookup_batch(struct bpf_map *map,
> >                                          attr->batch.elem_flags);
> >
> >                 if (err == -ENOENT) {
> > +                       if (IS_FD_ARRAY(map)
> > +                               goto next_key;
>
> It seems only BPF_MAP_TYPE_ARRAY_OF_MAPS supports batched operation, so
> map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS will be enough. It is also
> better to reset err as 0, otherwise generic_map_lookup_batch may return
> -ENOENT.
> >                         if (retry) {
> >                                 retry--;
> >                                 continue;
> > @@ -2048,6 +2050,7 @@ int generic_map_lookup_batch(struct bpf_map *map,
> >                         goto free_buf;
> >                 }
> >
> > +next_key:
> >                 if (!prev_key)
> >                         prev_key = buf_prevkey;
> >
>
> Make sense.  Please add a selftest for it. Another way is to return id 0
> for these non-existent values in the fd array, but it may break existed
> prog. Just skipping the empty array slot is better.

Let's not invent new magic return values.

But stepping back... why do we have this EINTR case at all?
Can we always goto next_key for all map types?
The command returns and a set of (key, value) pairs.
It's always better to skip then get stuck in EINTR,
since EINTR implies that the user space should retry and it
might be successful next time.
While here it's not the case.
I don't see any selftests for EINTR, so I suspect it was added
as escape path in case retry count exceeds 3 and author assumed
that it should never happen in practice, so EINTR was expected
to be 'never happens'. Clearly that's not the case.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: handling EINTR from bpf_map_lookup_batch
  2025-02-05  2:19 ` Hou Tao
  2025-02-05  9:56   ` Alexei Starovoitov
@ 2025-02-05 16:15   ` Yan Zhai
  1 sibling, 0 replies; 10+ messages in thread
From: Yan Zhai @ 2025-02-05 16:15 UTC (permalink / raw)
  To: Hou Tao; +Cc: bpf, kernel-team

On Tue, Feb 4, 2025 at 8:19 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/5/2025 2:08 AM, Yan Zhai wrote:
> > I am getting EINTR when trying to use bpf_map_lookup_batch on an
> > array_of_maps. The error happens when there is a "hole" in the array.
> > For example, say the outer map has max entries of 256, each inner map
> > is used for a transport protocol, and I only populated key 6 and
> > 17 for TCP and UDP. Then when I do batch lookup, I always get EINTR.
> > This so far seems to only happen with array of maps. Does it make
> > sense to allow skipping to the next key for this map type? Something
> > like:
> >
> > diff --git a/kernel/bpf/syscall.c b/kernel/bpf/syscall.c
> > index c420edbfb7c8..83915a8059ef 100644
> > --- a/kernel/bpf/syscall.c
> > +++ b/kernel/bpf/syscall.c
> > @@ -2027,6 +2027,8 @@ int generic_map_lookup_batch(struct bpf_map *map,
> >                                          attr->batch.elem_flags);
> >
> >                 if (err == -ENOENT) {
> > +                       if (IS_FD_ARRAY(map)
> > +                               goto next_key;
>
> It seems only BPF_MAP_TYPE_ARRAY_OF_MAPS supports batched operation, so
> map->map_type == BPF_MAP_TYPE_ARRAY_OF_MAPS will be enough. It is also
> better to reset err as 0, otherwise generic_map_lookup_batch may return
> -ENOENT.

Jump to the next key should always restart the loop, thus err will be
correctly set afterwards.

> >                         if (retry) {
> >                                 retry--;
> >                                 continue;
> > @@ -2048,6 +2050,7 @@ int generic_map_lookup_batch(struct bpf_map *map,
> >                         goto free_buf;
> >                 }
> >
> > +next_key:
> >                 if (!prev_key)
> >                         prev_key = buf_prevkey;
> >
>
> Make sense.  Please add a selftest for it. Another way is to return id 0
> for these non-existent values in the fd array, but it may break existed
> prog. Just skipping the empty array slot is better.

Working on it.

thanks
Yan

> > Also the context about my scenario if anyone is curious: I am trying
> > to associate each map to a userspace service in a multi tenant
> > environment. This is an addition to cgroup accounting, in case the
> > creator cgroup goes away, e.g. systemd service restarts always
> > recreate cgroups. And we also want to monitor the utilization level of
> > non-prealloc maps of different tenants. When dealing with inner maps,
> > it is not always trivial. To connect dots I choose to read these IDs
> > periodically and link them to the tenant of the outer map, that's
> > where this EINTR occurred.
> >
> > best
> > Yan
> >
> > .
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: handling EINTR from bpf_map_lookup_batch
  2025-02-05  9:56   ` Alexei Starovoitov
@ 2025-02-05 16:27     ` Yan Zhai
  2025-02-05 17:00       ` Yan Zhai
  2025-02-06  0:46       ` Hou Tao
  0 siblings, 2 replies; 10+ messages in thread
From: Yan Zhai @ 2025-02-05 16:27 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Hou Tao, bpf, kernel-team

On Wed, Feb 5, 2025 at 3:56 AM Alexei Starovoitov
<alexei.starovoitov@gmail.com> wrote:
>
> Let's not invent new magic return values.
>
> But stepping back... why do we have this EINTR case at all?
> Can we always goto next_key for all map types?
> The command returns and a set of (key, value) pairs.
> It's always better to skip then get stuck in EINTR,
> since EINTR implies that the user space should retry and it
> might be successful next time.
> While here it's not the case.
> I don't see any selftests for EINTR, so I suspect it was added
> as escape path in case retry count exceeds 3 and author assumed
> that it should never happen in practice, so EINTR was expected
> to be 'never happens'. Clearly that's not the case.

It makes more sense to me if we just goto the next key for all types.
At least for current users of generic batch lookup, arrays and
lpm_trie, I didn't notice in any case retry would help.

best
Yan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: handling EINTR from bpf_map_lookup_batch
  2025-02-05 16:27     ` Yan Zhai
@ 2025-02-05 17:00       ` Yan Zhai
  2025-02-06  0:46       ` Hou Tao
  1 sibling, 0 replies; 10+ messages in thread
From: Yan Zhai @ 2025-02-05 17:00 UTC (permalink / raw)
  To: Alexei Starovoitov; +Cc: Hou Tao, bpf, kernel-team

On Wed, Feb 05, 2025 at 10:27:25AM -0600, Yan Zhai wrote:
> On Wed, Feb 5, 2025 at 3:56 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
> >
> > Let's not invent new magic return values.
> >
> > But stepping back... why do we have this EINTR case at all?
> > Can we always goto next_key for all map types?
> > The command returns and a set of (key, value) pairs.
> > It's always better to skip then get stuck in EINTR,
> > since EINTR implies that the user space should retry and it
> > might be successful next time.
> > While here it's not the case.
> > I don't see any selftests for EINTR, so I suspect it was added
> > as escape path in case retry count exceeds 3 and author assumed
> > that it should never happen in practice, so EINTR was expected
> > to be 'never happens'. Clearly that's not the case.
> 
> It makes more sense to me if we just goto the next key for all types.
> At least for current users of generic batch lookup, arrays and
> lpm_trie, I didn't notice in any case retry would help.
> 

I opened a patch here:
https://lore.kernel.org/bpf/Z6OYbS4WqQnmzi2z@debian.debian/

Yan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: handling EINTR from bpf_map_lookup_batch
  2025-02-05 16:27     ` Yan Zhai
  2025-02-05 17:00       ` Yan Zhai
@ 2025-02-06  0:46       ` Hou Tao
  2025-02-06  3:01         ` Yan Zhai
  1 sibling, 1 reply; 10+ messages in thread
From: Hou Tao @ 2025-02-06  0:46 UTC (permalink / raw)
  To: Yan Zhai, Alexei Starovoitov; +Cc: bpf, kernel-team


Hi,

On 2/6/2025 12:27 AM, Yan Zhai wrote:
> On Wed, Feb 5, 2025 at 3:56 AM Alexei Starovoitov
> <alexei.starovoitov@gmail.com> wrote:
>> Let's not invent new magic return values.
>>
>> But stepping back... why do we have this EINTR case at all?
>> Can we always goto next_key for all map types?
>> The command returns and a set of (key, value) pairs.
>> It's always better to skip then get stuck in EINTR,
>> since EINTR implies that the user space should retry and it
>> might be successful next time.
>> While here it's not the case.
>> I don't see any selftests for EINTR, so I suspect it was added
>> as escape path in case retry count exceeds 3 and author assumed
>> that it should never happen in practice, so EINTR was expected
>> to be 'never happens'. Clearly that's not the case.
> It makes more sense to me if we just goto the next key for all types.
> At least for current users of generic batch lookup, arrays and
> lpm_trie, I didn't notice in any case retry would help.

I think it will break lpm_trie. In lpm_trie, if tries to find the next
key of a non-existent key, it will restart from the left-mode node.
>
> best
> Yan


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: handling EINTR from bpf_map_lookup_batch
  2025-02-06  0:46       ` Hou Tao
@ 2025-02-06  3:01         ` Yan Zhai
  2025-02-06  4:17           ` Hou Tao
  0 siblings, 1 reply; 10+ messages in thread
From: Yan Zhai @ 2025-02-06  3:01 UTC (permalink / raw)
  To: Hou Tao; +Cc: Alexei Starovoitov, bpf, kernel-team

On Wed, Feb 5, 2025 at 6:46 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
>
> Hi,
>
> On 2/6/2025 12:27 AM, Yan Zhai wrote:
> > On Wed, Feb 5, 2025 at 3:56 AM Alexei Starovoitov
> > <alexei.starovoitov@gmail.com> wrote:
> >> Let's not invent new magic return values.
> >>
> >> But stepping back... why do we have this EINTR case at all?
> >> Can we always goto next_key for all map types?
> >> The command returns and a set of (key, value) pairs.
> >> It's always better to skip then get stuck in EINTR,
> >> since EINTR implies that the user space should retry and it
> >> might be successful next time.
> >> While here it's not the case.
> >> I don't see any selftests for EINTR, so I suspect it was added
> >> as escape path in case retry count exceeds 3 and author assumed
> >> that it should never happen in practice, so EINTR was expected
> >> to be 'never happens'. Clearly that's not the case.
> > It makes more sense to me if we just goto the next key for all types.
> > At least for current users of generic batch lookup, arrays and
> > lpm_trie, I didn't notice in any case retry would help.
>
> I think it will break lpm_trie. In lpm_trie, if tries to find the next
> key of a non-existent key, it will restart from the left-mode node.

I am not sure how lpm trie would break if we always skip to the next
key. Current retry logic does not change prev_key, so the lookup key
will always be the same. It would make sense if searching with the
same key could temporarily fail, but it does not seem so for both
lpm_tire and array based maps.

Yan

> >
> > best
> > Yan
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: handling EINTR from bpf_map_lookup_batch
  2025-02-06  3:01         ` Yan Zhai
@ 2025-02-06  4:17           ` Hou Tao
  2025-02-06  5:02             ` Yan Zhai
  0 siblings, 1 reply; 10+ messages in thread
From: Hou Tao @ 2025-02-06  4:17 UTC (permalink / raw)
  To: Yan Zhai; +Cc: Alexei Starovoitov, bpf, kernel-team

Hi,

On 2/6/2025 11:01 AM, Yan Zhai wrote:
> On Wed, Feb 5, 2025 at 6:46 PM Hou Tao <houtao@huaweicloud.com> wrote:
>>
>> Hi,
>>
>> On 2/6/2025 12:27 AM, Yan Zhai wrote:
>>> On Wed, Feb 5, 2025 at 3:56 AM Alexei Starovoitov
>>> <alexei.starovoitov@gmail.com> wrote:
>>>> Let's not invent new magic return values.
>>>>
>>>> But stepping back... why do we have this EINTR case at all?
>>>> Can we always goto next_key for all map types?
>>>> The command returns and a set of (key, value) pairs.
>>>> It's always better to skip then get stuck in EINTR,
>>>> since EINTR implies that the user space should retry and it
>>>> might be successful next time.
>>>> While here it's not the case.
>>>> I don't see any selftests for EINTR, so I suspect it was added
>>>> as escape path in case retry count exceeds 3 and author assumed
>>>> that it should never happen in practice, so EINTR was expected
>>>> to be 'never happens'. Clearly that's not the case.
>>> It makes more sense to me if we just goto the next key for all types.
>>> At least for current users of generic batch lookup, arrays and
>>> lpm_trie, I didn't notice in any case retry would help.
>> I think it will break lpm_trie. In lpm_trie, if tries to find the next
>> key of a non-existent key, it will restart from the left-mode node.
> I am not sure how lpm trie would break if we always skip to the next
> key. Current retry logic does not change prev_key, so the lookup key
> will always be the same. It would make sense if searching with the
> same key could temporarily fail, but it does not seem so for both
> lpm_tire and array based maps.

Retry logic does change prev_key, please see "swap(prev_key, key);"
below the next_key tag, otherwise the lookup_batch procedure will loop
forever for array map.
>
> Yan
>
>>> best
>>> Yan
> .


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: handling EINTR from bpf_map_lookup_batch
  2025-02-06  4:17           ` Hou Tao
@ 2025-02-06  5:02             ` Yan Zhai
  0 siblings, 0 replies; 10+ messages in thread
From: Yan Zhai @ 2025-02-06  5:02 UTC (permalink / raw)
  To: Hou Tao; +Cc: Alexei Starovoitov, bpf, kernel-team

On Wed, Feb 5, 2025 at 10:17 PM Hou Tao <houtao@huaweicloud.com> wrote:
>
> Hi,
>
> On 2/6/2025 11:01 AM, Yan Zhai wrote:
> > On Wed, Feb 5, 2025 at 6:46 PM Hou Tao <houtao@huaweicloud.com> wrote:
> >>
> >> Hi,
> >>
> >> On 2/6/2025 12:27 AM, Yan Zhai wrote:
> >>> On Wed, Feb 5, 2025 at 3:56 AM Alexei Starovoitov
> >>> <alexei.starovoitov@gmail.com> wrote:
> >>>> Let's not invent new magic return values.
> >>>>
> >>>> But stepping back... why do we have this EINTR case at all?
> >>>> Can we always goto next_key for all map types?
> >>>> The command returns and a set of (key, value) pairs.
> >>>> It's always better to skip then get stuck in EINTR,
> >>>> since EINTR implies that the user space should retry and it
> >>>> might be successful next time.
> >>>> While here it's not the case.
> >>>> I don't see any selftests for EINTR, so I suspect it was added
> >>>> as escape path in case retry count exceeds 3 and author assumed
> >>>> that it should never happen in practice, so EINTR was expected
> >>>> to be 'never happens'. Clearly that's not the case.
> >>> It makes more sense to me if we just goto the next key for all types.
> >>> At least for current users of generic batch lookup, arrays and
> >>> lpm_trie, I didn't notice in any case retry would help.
> >> I think it will break lpm_trie. In lpm_trie, if tries to find the next
> >> key of a non-existent key, it will restart from the left-mode node.
> > I am not sure how lpm trie would break if we always skip to the next
> > key. Current retry logic does not change prev_key, so the lookup key
> > will always be the same. It would make sense if searching with the
> > same key could temporarily fail, but it does not seem so for both
> > lpm_tire and array based maps.
>
> Retry logic does change prev_key, please see "swap(prev_key, key);"
> below the next_key tag, otherwise the lookup_batch procedure will loop
> forever for array map.
>

We are probably not on the same page. Let me clarify:

By "retry logic" I mean this code snippet:
               if (err == -ENOENT) {
                       if (retry) {
                               retry--;
                               continue;
                       }
                       err = -EINTR;
                       break;
               }
It wouldn't execute the swap when ENOENT is returned from bpf_map_copy_value.

And by "skipping to the next key", it's simply

  if (err == -ENOENT)
       goto next_key;

Note the "next_key" label was not in the current codebase. It is only
in my posted patch. I don't think this would break lpm_trie unless I
missed something.

Yan

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-02-06  5:02 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-04 18:08 handling EINTR from bpf_map_lookup_batch Yan Zhai
2025-02-05  2:19 ` Hou Tao
2025-02-05  9:56   ` Alexei Starovoitov
2025-02-05 16:27     ` Yan Zhai
2025-02-05 17:00       ` Yan Zhai
2025-02-06  0:46       ` Hou Tao
2025-02-06  3:01         ` Yan Zhai
2025-02-06  4:17           ` Hou Tao
2025-02-06  5:02             ` Yan Zhai
2025-02-05 16:15   ` Yan Zhai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.