From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 0779918858A for ; Wed, 11 Sep 2024 14:14:30 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=170.10.129.124 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726064073; cv=none; b=kQ87IdFXuhxZF+HtatBmZuLy8F6MU/960DX1vks1pAOnrxSU+BxrGZMsucxSb8qdOifk02tkASTDY3qbHmHc1T8KvX2g8QaFKGzMU2EksfJHVKx2Tb1OBwa0tO08KMUQyrA7+6rTqAsp6F6NcdBUdMWjCh5mpywSLDAGKYbBaTg= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1726064073; c=relaxed/simple; bh=TUc1KU1MznrzkuTWKQ5eAxqyDQii8n9dhK3yZQ9zBSA=; h=From:To:Cc:Subject:In-Reply-To:References:Date:Message-ID: MIME-Version:Content-Type; b=eYenqSd8bCoqNA6H/plW4tQwBqK3PotLNactpA9gQ8NtmwbwuGhWxhI9oDMMy22kpoc4OVZKPnBI3RSvUnH+SabdC3+xxMvTpaDs/Fk4nGcxFygWJ0ISQYnN7aXfHgXqVLnzukAjPeh/QeR/N7yi+QS3MAN/Pu+yq7U7K1XnrbM= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com; spf=pass smtp.mailfrom=redhat.com; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b=GU7ZH12d; arc=none smtp.client-ip=170.10.129.124 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=redhat.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=redhat.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="GU7ZH12d" DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1726064069; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=niG1SRTvrtaYVmXDD+x0Lv431jhZ7HP+deUzZrsxCVM=; b=GU7ZH12d2lcF8v0qDaPfbKm5xiVt68SoOC7wAhzLxUnSojFM0kj0/HeEu5CHJK5ltSRUGS Nne5efosdZWA/baCtapcYTNFl1Ch0yuznEpf8f0/r5OT7qYGg3VhpZrbs52I8GRA7VMwLg ZvpdQUnuc7cD6EOO8WieK/3+fLOZaKw= Received: from mail-ed1-f71.google.com (mail-ed1-f71.google.com [209.85.208.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-615-j9o9-ER6MP-3_m9uIVyXlw-1; Wed, 11 Sep 2024 10:14:28 -0400 X-MC-Unique: j9o9-ER6MP-3_m9uIVyXlw-1 Received: by mail-ed1-f71.google.com with SMTP id 4fb4d7f45d1cf-5c25cf44030so4580927a12.0 for ; Wed, 11 Sep 2024 07:14:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1726064067; x=1726668867; h=content-transfer-encoding:mime-version:message-id:date:references :in-reply-to:subject:cc:to:from:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=niG1SRTvrtaYVmXDD+x0Lv431jhZ7HP+deUzZrsxCVM=; b=o1mWYIMSzb7YkAs1ISJ4DiCvSkFuL1Rb2D9mBM0KLYkzUHK2S+guA9TQllFUwaR6nF 84BEfHxugs/aBgnV75HhTQwNA4NBiNsel/HKKF8VDb+U6F44o20ARkVgSM0vUrU+Fl38 ZTnuLREViucSl43KMnzpQxLB9LjjYqT69WSXzcV4MI/DvEqUhpH6a10eVZkvxCa8Y/C6 KMyhyA+m5Hezx1LkqrHNMFlq1XB8yTQLzCm4yH3Q7zpok8vNBoH7wVsFcnrSlqvf4nqs GaZ0EYMZyliXdp8nnzlS/TVFHAn0OXoJh4VBDvmsgZBJs0o0gXdrVRHtiRBfDDGnpYy5 jngA== X-Forwarded-Encrypted: i=1; AJvYcCXifauqk8g4hqhYwIJtm+BbW4R0VMlrMV0rgyTllnk11UOjhPdNXWuiy0rfsVGiU+d5jGDs4jQ=@vger.kernel.org X-Gm-Message-State: AOJu0YyeBV3MBS/QAhu5mux/1A9AROaqveWdoOZExzOiMhJUaUGanCzX vs3XRpUyMd6W05534IT4kTTW008ZMA4Rg0FSEUfd01BsaSVVFsS6dpc3Mass2iE+nn8w3UPtSdu K638+oRQLBWzYjHF7TzQuVxztoX4Zt280Au9UFInw8u8eu1P0l30mUw== X-Received: by 2002:a05:6402:3596:b0:5c2:112f:aa77 with SMTP id 4fb4d7f45d1cf-5c3dc7c6f49mr16198949a12.31.1726064067140; Wed, 11 Sep 2024 07:14:27 -0700 (PDT) X-Google-Smtp-Source: AGHT+IFcxfrSTFiuJc4PQYb7wZ+3L9gcS01qdsW5TOGaLxsUChyXjaeKiLlD/bwsQqajw/Uv5XqzNw== X-Received: by 2002:a05:6402:3596:b0:5c2:112f:aa77 with SMTP id 4fb4d7f45d1cf-5c3dc7c6f49mr16198881a12.31.1726064066145; Wed, 11 Sep 2024 07:14:26 -0700 (PDT) Received: from alrua-x1.borgediget.toke.dk ([45.145.92.2]) by smtp.gmail.com with ESMTPSA id 4fb4d7f45d1cf-5c3ebd5215asm5491230a12.56.2024.09.11.07.14.25 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 11 Sep 2024 07:14:25 -0700 (PDT) Received: by alrua-x1.borgediget.toke.dk (Postfix, from userid 1000) id 95E09152C52E; Wed, 11 Sep 2024 16:14:24 +0200 (CEST) From: Toke =?utf-8?Q?H=C3=B8iland-J=C3=B8rgensen?= To: Florian Kauer , Alexei Starovoitov , Daniel Borkmann , Andrii Nakryiko , Martin KaFai Lau , Eduard Zingerman , Song Liu , Yonghong Song , John Fastabend , KP Singh , Stanislav Fomichev , Hao Luo , Jiri Olsa , "David S. Miller" , Jakub Kicinski , Jesper Dangaard Brouer , Steven Rostedt , Masami Hiramatsu , Mathieu Desnoyers , Eric Dumazet , Paolo Abeni Cc: bpf@vger.kernel.org, linux-kernel@vger.kernel.org, netdev@vger.kernel.org, linux-trace-kernel@vger.kernel.org, xdp-newbies@vger.kernel.org Subject: Re: [PATCH] bpf: devmap: allow for repeated redirect In-Reply-To: <93a97de4-a93d-4d5b-841d-c2f95dcedb0f@linutronix.de> References: <20240909-devel-koalo-fix-redirect-v1-1-2dd90771146c@linutronix.de> <87o74vewko.fsf@toke.dk> <098b5603-0feb-4013-a9ee-8d1c8edaf4f8@linutronix.de> <877cbiee3y.fsf@toke.dk> <93a97de4-a93d-4d5b-841d-c2f95dcedb0f@linutronix.de> X-Clacks-Overhead: GNU Terry Pratchett Date: Wed, 11 Sep 2024 16:14:24 +0200 Message-ID: <87ttemcm67.fsf@toke.dk> Precedence: bulk X-Mailing-List: netdev@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Florian Kauer writes: > On 9/11/24 11:25, Toke H=C3=B8iland-J=C3=B8rgensen wrote: >> Florian Kauer writes: >>=20 >>> Hi Toke, >>> >>> On 9/10/24 10:34, Toke H=C3=B8iland-J=C3=B8rgensen wrote: >>>> Florian Kauer writes: >>>> >>>>> Currently, returning XDP_REDIRECT from a xdp/devmap program >>>>> is considered as invalid action and an exception is traced. >>>>> >>>>> While it might seem counterintuitive to redirect in a xdp/devmap >>>>> program (why not just redirect to the correct interface in the >>>>> first program?), we faced several use cases where supporting >>>>> this would be very useful. >>>>> >>>>> Most importantly, they occur when the first redirect is used >>>>> with the BPF_F_BROADCAST flag. Using this together with xdp/devmap >>>>> programs, enables to perform different actions on clones of >>>>> the same incoming frame. In that case, it is often useful >>>>> to redirect either to a different CPU or device AFTER the cloning. >>>>> >>>>> For example: >>>>> - Replicate the frame (for redundancy according to IEEE 802.1CB FRER) >>>>> and then use the second redirect with a cpumap to select >>>>> the path-specific egress queue. >>>>> >>>>> - Also, one of the paths might need an encapsulation that >>>>> exceeds the MTU. So a second redirect can be used back >>>>> to the ingress interface to send an ICMP FRAG_NEEDED packet. >>>>> >>>>> - For OAM purposes, you might want to send one frame with >>>>> OAM information back, while the original frame in passed forward. >>>>> >>>>> To enable these use cases, add the XDP_REDIRECT case to >>>>> dev_map_bpf_prog_run. Also, when this is called from inside >>>>> xdp_do_flush, the redirect might add further entries to the >>>>> flush lists that are currently processed. Therefore, loop inside >>>>> xdp_do_flush until no more additional redirects were added. >>>>> >>>>> Signed-off-by: Florian Kauer >>>> >>>> This is an interesting use case! However, your implementation makes it >>>> way to easy to end up in a situation that loops forever, so I think we >>>> should add some protection against that. Some details below: >>>> >>>>> --- >>>>> include/linux/bpf.h | 4 ++-- >>>>> include/trace/events/xdp.h | 10 ++++++---- >>>>> kernel/bpf/devmap.c | 37 +++++++++++++++++++++++++++-------- >>>>> net/core/filter.c | 48 +++++++++++++++++++++++++++---------= ---------- >>>>> 4 files changed, 65 insertions(+), 34 deletions(-) >>>>> >>>>> diff --git a/include/linux/bpf.h b/include/linux/bpf.h >>>>> index 3b94ec161e8c..1b57cbabf199 100644 >>>>> --- a/include/linux/bpf.h >>>>> +++ b/include/linux/bpf.h >>>>> @@ -2498,7 +2498,7 @@ struct sk_buff; >>>>> struct bpf_dtab_netdev; >>>>> struct bpf_cpu_map_entry; >>>>>=20=20 >>>>> -void __dev_flush(struct list_head *flush_list); >>>>> +void __dev_flush(struct list_head *flush_list, int *redirects); >>>>> int dev_xdp_enqueue(struct net_device *dev, struct xdp_frame *xdpf, >>>>> struct net_device *dev_rx); >>>>> int dev_map_enqueue(struct bpf_dtab_netdev *dst, struct xdp_frame *x= dpf, >>>>> @@ -2740,7 +2740,7 @@ static inline struct bpf_token *bpf_token_get_f= rom_fd(u32 ufd) >>>>> return ERR_PTR(-EOPNOTSUPP); >>>>> } >>>>>=20=20 >>>>> -static inline void __dev_flush(struct list_head *flush_list) >>>>> +static inline void __dev_flush(struct list_head *flush_list, int *re= directs) >>>>> { >>>>> } >>>>>=20=20 >>>>> diff --git a/include/trace/events/xdp.h b/include/trace/events/xdp.h >>>>> index a7e5452b5d21..fba2c457e727 100644 >>>>> --- a/include/trace/events/xdp.h >>>>> +++ b/include/trace/events/xdp.h >>>>> @@ -269,9 +269,9 @@ TRACE_EVENT(xdp_devmap_xmit, >>>>>=20=20 >>>>> TP_PROTO(const struct net_device *from_dev, >>>>> const struct net_device *to_dev, >>>>> - int sent, int drops, int err), >>>>> + int sent, int drops, int redirects, int err), >>>>>=20=20 >>>>> - TP_ARGS(from_dev, to_dev, sent, drops, err), >>>>> + TP_ARGS(from_dev, to_dev, sent, drops, redirects, err), >>>>>=20=20 >>>>> TP_STRUCT__entry( >>>>> __field(int, from_ifindex) >>>>> @@ -279,6 +279,7 @@ TRACE_EVENT(xdp_devmap_xmit, >>>>> __field(int, to_ifindex) >>>>> __field(int, drops) >>>>> __field(int, sent) >>>>> + __field(int, redirects) >>>>> __field(int, err) >>>>> ), >>>>>=20=20 >>>>> @@ -288,16 +289,17 @@ TRACE_EVENT(xdp_devmap_xmit, >>>>> __entry->to_ifindex =3D to_dev->ifindex; >>>>> __entry->drops =3D drops; >>>>> __entry->sent =3D sent; >>>>> + __entry->redirects =3D redirects; >>>>> __entry->err =3D err; >>>>> ), >>>>>=20=20 >>>>> TP_printk("ndo_xdp_xmit" >>>>> " from_ifindex=3D%d to_ifindex=3D%d action=3D%s" >>>>> - " sent=3D%d drops=3D%d" >>>>> + " sent=3D%d drops=3D%d redirects=3D%d" >>>>> " err=3D%d", >>>>> __entry->from_ifindex, __entry->to_ifindex, >>>>> __print_symbolic(__entry->act, __XDP_ACT_SYM_TAB), >>>>> - __entry->sent, __entry->drops, >>>>> + __entry->sent, __entry->drops, __entry->redirects, >>>>> __entry->err) >>>>> ); >>>>>=20=20 >>>>> diff --git a/kernel/bpf/devmap.c b/kernel/bpf/devmap.c >>>>> index 7878be18e9d2..89bdec49ea40 100644 >>>>> --- a/kernel/bpf/devmap.c >>>>> +++ b/kernel/bpf/devmap.c >>>>> @@ -334,7 +334,8 @@ static int dev_map_hash_get_next_key(struct bpf_m= ap *map, void *key, >>>>> static int dev_map_bpf_prog_run(struct bpf_prog *xdp_prog, >>>>> struct xdp_frame **frames, int n, >>>>> struct net_device *tx_dev, >>>>> - struct net_device *rx_dev) >>>>> + struct net_device *rx_dev, >>>>> + int *redirects) >>>>> { >>>>> struct xdp_txq_info txq =3D { .dev =3D tx_dev }; >>>>> struct xdp_rxq_info rxq =3D { .dev =3D rx_dev }; >>>>> @@ -359,6 +360,13 @@ static int dev_map_bpf_prog_run(struct bpf_prog = *xdp_prog, >>>>> else >>>>> frames[nframes++] =3D xdpf; >>>>> break; >>>>> + case XDP_REDIRECT: >>>>> + err =3D xdp_do_redirect(rx_dev, &xdp, xdp_prog); >>>>> + if (unlikely(err)) >>>>> + xdp_return_frame_rx_napi(xdpf); >>>>> + else >>>>> + *redirects +=3D 1; >>>>> + break; >>>> >>>> It's a bit subtle, but dev_map_bpf_prog_run() also filters the list of >>>> frames in the queue in-place (the frames[nframes++] =3D xdpf; line abo= ve), >>>> which only works under the assumption that the array in bq->q is not >>>> modified while this loop is being run. But now you're adding a call in >>>> the middle that may result in the packet being put back on the same >>>> queue in the middle, which means that this assumption no longer holds. >>>> >>>> So you need to clear the bq->q queue first for this to work. >>>> Specifically, at the start of bq_xmit_all(), you'll need to first copy >>>> all the packet pointer onto an on-stack array, then run the rest of the >>>> function on that array. There's already an initial loop that goes >>>> through all the frames, so you can just do it there. >>>> >>>> So the loop at the start of bq_xmit_all() goes from the current: >>>> >>>> for (i =3D 0; i < cnt; i++) { >>>> struct xdp_frame *xdpf =3D bq->q[i]; >>>> >>>> prefetch(xdpf); >>>> } >>>> >>>> >>>> to something like: >>>> >>>> struct xdp_frame *frames[DEV_MAP_BULK_SIZE]; >>>> >>>> for (i =3D 0; i < cnt; i++) { >>>> struct xdp_frame *xdpf =3D bq->q[i]; >>>> >>>> prefetch(xdpf); >>>> frames[i] =3D xdpf; >>>> } >>>> >>>> bq->count =3D 0; /* bq is now empty, use the 'frames' and 'cnt' >>>> stack variables for the rest of the function= */ >>>> >>>> >>>> >>>>> default: >>>>> bpf_warn_invalid_xdp_action(NULL, xdp_prog, act); >>>>> fallthrough; >>>>> @@ -373,7 +381,7 @@ static int dev_map_bpf_prog_run(struct bpf_prog *= xdp_prog, >>>>> return nframes; /* sent frames count */ >>>>> } >>>>>=20=20 >>>>> -static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags) >>>>> +static void bq_xmit_all(struct xdp_dev_bulk_queue *bq, u32 flags, in= t *redirects) >>>>> { >>>>> struct net_device *dev =3D bq->dev; >>>>> unsigned int cnt =3D bq->count; >>>>> @@ -390,8 +398,10 @@ static void bq_xmit_all(struct xdp_dev_bulk_queu= e *bq, u32 flags) >>>>> prefetch(xdpf); >>>>> } >>>>>=20=20 >>>>> + int new_redirects =3D 0; >>>>> if (bq->xdp_prog) { >>>>> - to_send =3D dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev, bq= ->dev_rx); >>>>> + to_send =3D dev_map_bpf_prog_run(bq->xdp_prog, bq->q, cnt, dev, bq= ->dev_rx, >>>>> + &new_redirects); >>>>> if (!to_send) >>>>> goto out; >>>>> } >>>>> @@ -413,19 +423,21 @@ static void bq_xmit_all(struct xdp_dev_bulk_que= ue *bq, u32 flags) >>>>>=20=20 >>>>> out: >>>>> bq->count =3D 0; >>>>> - trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, cnt - sent, err); >>>>> + *redirects +=3D new_redirects; >>>>> + trace_xdp_devmap_xmit(bq->dev_rx, dev, sent, cnt - sent - new_redir= ects, >>>>> + new_redirects, err); >>>>> } >>>>>=20=20 >>>>> /* __dev_flush is called from xdp_do_flush() which _must_ be signall= ed from the >>>>> * driver before returning from its napi->poll() routine. See the co= mment above >>>>> * xdp_do_flush() in filter.c. >>>>> */ >>>>> -void __dev_flush(struct list_head *flush_list) >>>>> +void __dev_flush(struct list_head *flush_list, int *redirects) >>>>> { >>>>> struct xdp_dev_bulk_queue *bq, *tmp; >>>>>=20=20 >>>>> list_for_each_entry_safe(bq, tmp, flush_list, flush_node) { >>>>> - bq_xmit_all(bq, XDP_XMIT_FLUSH); >>>>> + bq_xmit_all(bq, XDP_XMIT_FLUSH, redirects); >>>>> bq->dev_rx =3D NULL; >>>>> bq->xdp_prog =3D NULL; >>>>> __list_del_clearprev(&bq->flush_node); >>>>> @@ -458,8 +470,17 @@ static void bq_enqueue(struct net_device *dev, s= truct xdp_frame *xdpf, >>>>> { >>>>> struct xdp_dev_bulk_queue *bq =3D this_cpu_ptr(dev->xdp_bulkq); >>>>>=20=20 >>>>> - if (unlikely(bq->count =3D=3D DEV_MAP_BULK_SIZE)) >>>>> - bq_xmit_all(bq, 0); >>>>> + if (unlikely(bq->count =3D=3D DEV_MAP_BULK_SIZE)) { >>>>> + int redirects =3D 0; >>>>> + >>>>> + bq_xmit_all(bq, 0, &redirects); >>>>> + >>>>> + /* according to comment above xdp_do_flush() in >>>>> + * filter.c, xdp_do_flush is always called at >>>>> + * the end of the NAPI anyway, so no need to act >>>>> + * on the redirects here >>>>> + */ >>>> >>>> While it's true that it will be called again in NAPI, the purpose of >>>> calling bq_xmit_all() here is to make room space for the packet on the >>>> bulk queue that we're about to enqueue, and if bq_xmit_all() can just >>>> put the packet back on the queue, there is no guarantee that this will >>>> succeed. So you will have to handle that case here. >>>> >>>> Since there's also a potential infinite recursion issue in the >>>> do_flush() functions below, I think it may be better to handle this >>>> looping issue inside bq_xmit_all(). >>>> >>>> I.e., structure the code so that bq_xmit_all() guarantees that when it >>>> returns it has actually done its job; that is, that bq->q is empty. >>>> >>>> Given the above "move all frames out of bq->q at the start" change, th= is >>>> is not all that hard. Simply add a check after the out: label (in >>>> bq_xmit_all()) to check if bq->count is actually 0, and if it isn't, >>>> start over from the beginning of that function. This also makes it >>>> straight forward to add a recursion limit; after looping a set number = of >>>> times (say, XMIT_RECURSION_LIMIT), simply turn XDP_REDIRECT into drops. >>>> >>>> There will need to be some additional protection against looping forev= er >>>> in __dev_flush(), to handle the case where a packet is looped between >>>> two interfaces. This one is a bit trickier, but a similar recursion >>>> counter could be used, I think. >>> >>> >>> Thanks a lot for the extensive support! >>> Regarding __dev_flush(), could the following already be sufficient? >>> >>> void __dev_flush(struct list_head *flush_list) >>> { >>> struct xdp_dev_bulk_queue *bq, *tmp; >>> int i,j; >>> >>> for (i =3D 0; !list_empty(flush_list) && i < XMIT_RECURSION_LIM= IT; i++) { >>> /* go through list in reverse so that new items >>> * added to the flush_list will only be handled >>> * in the next iteration of the outer loop >>> */ >>> list_for_each_entry_safe_reverse(bq, tmp, flush_list, f= lush_node) { >>> bq_xmit_all(bq, XDP_XMIT_FLUSH); >>> bq->dev_rx =3D NULL; >>> bq->xdp_prog =3D NULL; >>> __list_del_clearprev(&bq->flush_node); >>> } >>> } >>> >>> if (i =3D=3D XMIT_RECURSION_LIMIT) { >>> pr_warn("XDP recursion limit hit, expect packet loss!\n= "); >>> >>> list_for_each_entry_safe(bq, tmp, flush_list, flush_nod= e) { >>> for (j =3D 0; j < bq->count; j++) >>> xdp_return_frame_rx_napi(bq->q[i]); >>> >>> bq->count =3D 0; >>> bq->dev_rx =3D NULL; >>> bq->xdp_prog =3D NULL; >>> __list_del_clearprev(&bq->flush_node); >>> } >>> } >>> } >>=20 >> Yeah, this would work, I think (neat trick with iterating the list in >> reverse!), but instead of the extra loop in the end, I would suggest >> passing in an extra 'allow_redirect' parameter to bq_xmit_all(). Since >> you'll already have to handle the recursion limit inside that function, >> you can just reuse the same functionality by passing in the parameter in >> the last iteration of the first loop. >>=20 >> Also, definitely don't put an unconditional warn_on() in the fast path - >> that brings down the system really quickly if it's misconfigured :) >>=20 >> A warn_on_once() could work, but really I would suggest just reusing the >> _trace_xdp_redirect_map_err() tracepoint with a new error code (just has >> to be one we're not currently using and that vaguely resembles what this >> is about; ELOOP, EOVERFLOW or EDEADLOCK, maybe?). > > > The 'allow_redirect' parameter is a very good idea! If I also forward that > to dev_map_bpf_prog_run and do something like the following, I can just > reuse the existing error handling. Or is that too implict/too ugly? > > switch (act) { > case XDP_PASS: > err =3D xdp_update_frame_from_buff(&xdp, xdpf); > if (unlikely(err < 0)) > xdp_return_frame_rx_napi(xdpf); > else > frames[nframes++] =3D xdpf; > break; > case XDP_REDIRECT: > if (allow_redirect) { > err =3D xdp_do_redirect(rx_dev, &xdp, xdp_prog); > if (unlikely(err)) > xdp_return_frame_rx_napi(xdpf); > else > *redirects +=3D 1; > break; > } else > fallthrough; > default: > bpf_warn_invalid_xdp_action(NULL, xdp_prog, act); > fallthrough; Yes, I was imagining something like this. Not sure if we want to turn it into an "invalid action" this way (it should probably be a separate tracepoint as I mentioned in the previous email). But otherwise, yeah! > case XDP_ABORTED: > trace_xdp_exception(tx_dev, xdp_prog, act); > fallthrough; > case XDP_DROP: > xdp_return_frame_rx_napi(xdpf); > break; > } > > >>=20 >>>> In any case, this needs extensive selftests, including ones with devmap >>>> programs that loop packets (both redirect from a->a, and from >>>> a->b->a->b) to make sure the limits work correctly. >>> >>> >>> Good point! I am going to prepare some. >>> >>> >>>> >>>>> + } >>>>>=20=20 >>>>> /* Ingress dev_rx will be the same for all xdp_frame's in >>>>> * bulk_queue, because bq stored per-CPU and must be flushed >>>>> diff --git a/net/core/filter.c b/net/core/filter.c >>>>> index 8569cd2482ee..b33fc0b1444a 100644 >>>>> --- a/net/core/filter.c >>>>> +++ b/net/core/filter.c >>>>> @@ -4287,14 +4287,18 @@ static const struct bpf_func_proto bpf_xdp_ad= just_meta_proto =3D { >>>>> void xdp_do_flush(void) >>>>> { >>>>> struct list_head *lh_map, *lh_dev, *lh_xsk; >>>>> + int redirect; >>>>>=20=20 >>>>> - bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk); >>>>> - if (lh_dev) >>>>> - __dev_flush(lh_dev); >>>>> - if (lh_map) >>>>> - __cpu_map_flush(lh_map); >>>>> - if (lh_xsk) >>>>> - __xsk_map_flush(lh_xsk); >>>>> + do { >>>>> + redirect =3D 0; >>>>> + bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk); >>>>> + if (lh_dev) >>>>> + __dev_flush(lh_dev, &redirect); >>>>> + if (lh_map) >>>>> + __cpu_map_flush(lh_map); >>>>> + if (lh_xsk) >>>>> + __xsk_map_flush(lh_xsk); >>>>> + } while (redirect > 0); >>>>> } >>>>> EXPORT_SYMBOL_GPL(xdp_do_flush); >>>>>=20=20 >>>>> @@ -4303,20 +4307,24 @@ void xdp_do_check_flushed(struct napi_struct = *napi) >>>>> { >>>>> struct list_head *lh_map, *lh_dev, *lh_xsk; >>>>> bool missed =3D false; >>>>> + int redirect; >>>>>=20=20 >>>>> - bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk); >>>>> - if (lh_dev) { >>>>> - __dev_flush(lh_dev); >>>>> - missed =3D true; >>>>> - } >>>>> - if (lh_map) { >>>>> - __cpu_map_flush(lh_map); >>>>> - missed =3D true; >>>>> - } >>>>> - if (lh_xsk) { >>>>> - __xsk_map_flush(lh_xsk); >>>>> - missed =3D true; >>>>> - } >>>>> + do { >>>>> + redirect =3D 0; >>>>> + bpf_net_ctx_get_all_used_flush_lists(&lh_map, &lh_dev, &lh_xsk); >>>>> + if (lh_dev) { >>>>> + __dev_flush(lh_dev, &redirect); >>>>> + missed =3D true; >>>>> + } >>>>> + if (lh_map) { >>>>> + __cpu_map_flush(lh_map); >>>>> + missed =3D true; >>>>> + } >>>>> + if (lh_xsk) { >>>>> + __xsk_map_flush(lh_xsk); >>>>> + missed =3D true; >>>>> + } >>>>> + } while (redirect > 0); >>>> >>>> With the change suggested above (so that bq_xmit_all() guarantees the >>>> flush is completely done), this looping is not needed anymore. However, >>>> it becomes important in which *order* the flushing is done >>>> (__dev_flush() should always happen first), so adding a comment to note >>>> this would be good. >>> >>> >>> I guess, if we remove the loop here and we still want to cover the case= of >>> redirecting from devmap program via cpumap, we need to fetch the lh_map= again >>> after calling __dev_flush, right? >>> So I think we should no longer use bpf_net_ctx_get_all_used_flush_lists= then: >>> >>> lh_dev =3D bpf_net_ctx_get_dev_flush_list(); >>> if (lh_dev) >>> __dev_flush(lh_dev); >>> lh_map =3D bpf_net_ctx_get_cpu_map_flush_list(); >>> if (lh_map) >>> __cpu_map_flush(lh_map); >>> lh_xsk =3D bpf_net_ctx_get_xskmap_flush_list(); >>> if (lh_xsk) >>> __xsk_map_flush(lh_xsk); >>> >>> But bpf_net_ctx_get_all_used_flush_lists also includes additional checks >>> for IS_ENABLED(CONFIG_BPF_SYSCALL) and IS_ENABLED(CONFIG_XDP_SOCKETS), >>> so I guess they should be directly included in the xdp_do_flush and >>> xdp_do_check_flushed? >>> Then we could remove the bpf_net_ctx_get_all_used_flush_lists. >>> Or do you have an idea for a more elegant solution? >>=20 >> I think cpumap is fine because that doesn't immediately redirect back to >> the bulk queue; instead __cpu_map_flush() will put the frames on a >> per-CPU ring buffer and wake the kworker on that CPU. Which will in turn >> do another xdp_do_flush() if the cpumap program does a redirect. So in >> other words, we only need to handle recursion of devmap redirects :) > > I likely miss something here, but the scenario I am thinking about is the= following: > > 1. cpu_map and xsk_map flush list are empty, while the dev flush map has > a single frame pending, i.e. in the current implementation after > executing bpf_net_ctx_get_all_used_flush_lists: > lh_dev =3D something > lh_map =3D lh_xsk =3D NULL > > 2. __dev_flush gets called and executes a bpf program on the frame > that returns with "return bpf_redirect_map(&cpu_map, 0, 0);" > and that adds an item to the cpumap flush list. > > 3. Since __dev_flush is only able to handle devmap redirects itself, > the item is still on the cpumap flush list after __dev_flush > has returned. > > 4. lh_map, however, is still NULL, so __cpu_map_flush does not > get called and thus the kworker is never woken up. > > That is at least what I see on the running system that the kworker > is never woken up, but I might have done something else wrong... Ah, yes, I see what you mean. Yup, you're right, bpf_net_ctx_get_all_used_flush_lists() will no longer work, we'll have to get the others after __dev_flush() -Toke