* Re: [PATCH v2 1/2] tg3: Increment tx_dropped in tg3_tso_bug()
From: Michael Chan @ 2023-11-02 20:04 UTC (permalink / raw)
To: alexey.pakhunov
Cc: mchan, vincent.wong2, netdev, linux-kernel, siva.kallam, prashant
In-Reply-To: <20231102172503.3413318-2-alexey.pakhunov@spacex.com>
[-- Attachment #1: Type: text/plain, Size: 1546 bytes --]
On Thu, Nov 2, 2023 at 10:25 AM <alexey.pakhunov@spacex.com> wrote:
>
> From: Alex Pakhunov <alexey.pakhunov@spacex.com>
>
> tg3_tso_bug() drops a packet if it cannot be segmented for any reason.
> The number of discarded frames should be incremeneted accordingly.
>
> Signed-off-by: Alex Pakhunov <alexey.pakhunov@spacex.com>
> Signed-off-by: Vincent Wong <vincent.wong2@spacex.com>
> ---
> drivers/net/ethernet/broadcom/tg3.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
> index 14b311196b8f..99638e6c9e16 100644
> --- a/drivers/net/ethernet/broadcom/tg3.c
> +++ b/drivers/net/ethernet/broadcom/tg3.c
> @@ -7874,8 +7874,10 @@ static int tg3_tso_bug(struct tg3 *tp, struct tg3_napi *tnapi,
>
> segs = skb_gso_segment(skb, tp->dev->features &
> ~(NETIF_F_TSO | NETIF_F_TSO6));
> - if (IS_ERR(segs) || !segs)
> + if (IS_ERR(segs) || !segs) {
> + tp->tx_dropped++;
This is prone to race conditions if we have more than one TX queue.
The original driver code only supported one TX queue and the counters
were never modified properly to support multiple queues. We should
convert them to per queue counters by moving tx_dropped and rx_dropped
to the tg3_napi struct.
> goto tg3_tso_bug_end;
> + }
>
> skb_list_walk_safe(segs, seg, next) {
> skb_mark_not_on_list(seg);
> --
> 2.39.3
>
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]
^ permalink raw reply
* Re: [PATCH v2 2/2] tg3: Fix the TX ring stall
From: Michael Chan @ 2023-11-02 20:09 UTC (permalink / raw)
To: alexey.pakhunov
Cc: mchan, vincent.wong2, netdev, linux-kernel, siva.kallam, prashant
In-Reply-To: <20231102172503.3413318-3-alexey.pakhunov@spacex.com>
[-- Attachment #1: Type: text/plain, Size: 2180 bytes --]
On Thu, Nov 2, 2023 at 10:25 AM <alexey.pakhunov@spacex.com> wrote:
>
> From: Alex Pakhunov <alexey.pakhunov@spacex.com>
>
> The TX ring maintained by the tg3 driver can end up in the state, when it
> has packets queued for sending but the NIC hardware is not informed, so no
> progress is made. This leads to a multi-second interruption in network
> traffic followed by dev_watchdog() firing and resetting the queue.
>
> The specific sequence of steps is:
>
> 1. tg3_start_xmit() is called at least once and queues packet(s) without
> updating tnapi->prodmbox (netdev_xmit_more() returns true)
> 2. tg3_start_xmit() is called with an SKB which causes tg3_tso_bug() to be
> called.
> 3. tg3_tso_bug() determines that the SKB is too large, ...
>
> if (unlikely(tg3_tx_avail(tnapi) <= frag_cnt_est)) {
>
> ... stops the queue, and returns NETDEV_TX_BUSY:
>
> netif_tx_stop_queue(txq);
> ...
> if (tg3_tx_avail(tnapi) <= frag_cnt_est)
> return NETDEV_TX_BUSY;
>
> 4. Since all tg3_tso_bug() call sites directly return, the code updating
> tnapi->prodmbox is skipped.
>
> 5. The queue is stuck now. tg3_start_xmit() is not called while the queue
> is stopped. The NIC is not processing new packets because
> tnapi->prodmbox wasn't updated. tg3_tx() is not called by
> tg3_poll_work() because the all TX descriptions that could be freed has
> been freed:
>
> /* run TX completion thread */
> if (tnapi->hw_status->idx[0].tx_consumer != tnapi->tx_cons) {
> tg3_tx(tnapi);
>
> 6. Eventually, dev_watchdog() fires triggering a reset of the queue.
>
> This fix makes sure that the tnapi->prodmbox update happens regardless of
> the reason tg3_start_xmit() returned.
>
> Signed-off-by: Alex Pakhunov <alexey.pakhunov@spacex.com>
> Signed-off-by: Vincent Wong <vincent.wong2@spacex.com>
> ---
> v2: Sort Order the local variables in tg3_start_xmit() in the RCS order
> v1: https://lore.kernel.org/netdev/20231101191858.2611154-1-alexey.pakhunov@spacex.com/T/#t
> ---
Thanks.
Reviewed-by: Michael Chan <michael.chan@broadcom.com>
[-- Attachment #2: S/MIME Cryptographic Signature --]
[-- Type: application/pkcs7-signature, Size: 4209 bytes --]
^ permalink raw reply
* Re: [PATCH] drivers/net/ppp: copy userspace array safely
From: Al Viro @ 2023-11-02 20:09 UTC (permalink / raw)
To: Philipp Stanner
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Stanislav Fomichev, Greg Kroah-Hartman, Benjamin Tissoires,
linux-ppp, netdev, linux-kernel, Dave Airlie
In-Reply-To: <20231102191914.52957-2-pstanner@redhat.com>
On Thu, Nov 02, 2023 at 08:19:15PM +0100, Philipp Stanner wrote:
> In ppp_generic.c memdup_user() is utilized to copy a userspace array.
> This is done without an overflow check.
>
> Use the new wrapper memdup_array_user() to copy the array more safely.
> fprog.len = uprog->len;
> - fprog.filter = memdup_user(uprog->filter,
> - uprog->len * sizeof(struct sock_filter));
> + fprog.filter = memdup_array_user(uprog->filter,
> + uprog->len, sizeof(struct sock_filter));
Far be it from me to discourage security theat^Whardening, but
struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
unsigned short len; /* Number of filter blocks */
struct sock_filter __user *filter;
};
struct sock_filter { /* Filter block */
__u16 code; /* Actual filter code */
__u8 jt; /* Jump true */
__u8 jf; /* Jump false */
__u32 k; /* Generic multiuse field */
};
so you might want to mention that overflow in question would have to be
in multiplying an untrusted 16bit value by 8...
^ permalink raw reply
* Re: [PATCH net] tcp: Fix -Wc23-extensions in tcp_options_write()
From: Palmer Dabbelt @ 2023-11-02 20:23 UTC (permalink / raw)
To: nathan
Cc: edumazet, davem, dsahern, kuba, pabeni, ndesaulniers, trix,
0x7f454c46, fruggeri, noureddine, netdev, linux-kernel, llvm,
patches
In-Reply-To: <mhng-41e9fb36-f703-461e-b585-9e8dd5984714@palmer-ri-x1c9a>
On Wed, 01 Nov 2023 18:42:10 PDT (-0700), Palmer Dabbelt wrote:
> On Wed, 01 Nov 2023 18:07:23 PDT (-0700), nathan@kernel.org wrote:
>> On Wed, Nov 01, 2023 at 05:41:10PM -0700, Palmer Dabbelt wrote:
>>> On Tue, 31 Oct 2023 13:23:35 PDT (-0700), nathan@kernel.org wrote:
>>> > Clang warns (or errors with CONFIG_WERROR=y) when CONFIG_TCP_AO is set:
>>> >
>>> > net/ipv4/tcp_output.c:663:2: error: label at end of compound statement is a C23 extension [-Werror,-Wc23-extensions]
>>> > 663 | }
>>> > | ^
>>> > 1 error generated.
>>> >
>>> > On earlier releases (such as clang-11, the current minimum supported
>>> > version for building the kernel) that do not support C23, this was a
>>> > hard error unconditionally:
>>> >
>>> > net/ipv4/tcp_output.c:663:2: error: expected statement
>>> > }
>>> > ^
>>> > 1 error generated.
>>> >
>>> > Add a semicolon after the label to create an empty statement, which
>>> > resolves the warning or error for all compilers.
>>> >
>>> > Closes: https://github.com/ClangBuiltLinux/linux/issues/1953
>>> > Fixes: 1e03d32bea8e ("net/tcp: Add TCP-AO sign to outgoing packets")
>>> > Signed-off-by: Nathan Chancellor <nathan@kernel.org>
>>> > ---
>>> > net/ipv4/tcp_output.c | 2 +-
>>> > 1 file changed, 1 insertion(+), 1 deletion(-)
>>> >
>>> > diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>> > index f558c054cf6e..6064895daece 100644
>>> > --- a/net/ipv4/tcp_output.c
>>> > +++ b/net/ipv4/tcp_output.c
>>> > @@ -658,7 +658,7 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
>>> > memset(ptr, TCPOPT_NOP, sizeof(*ptr));
>>> > ptr++;
>>> > }
>>> > -out_ao:
>>> > +out_ao:;
>>> > #endif
>>> > }
>>> > if (unlikely(opts->mss)) {
>>> >
>>> > ---
>>> > base-commit: 55c900477f5b3897d9038446f72a281cae0efd86
>>> > change-id: 20231031-tcp-ao-fix-label-in-compound-statement-warning-ebd6c9978498
>>> >
>>> > Best regards,
>>>
>>> This gives me a
>>>
>>> linux/net/ipv4/tcp_output.c:663:2: error: expected statement
>>> }
>>>
>>> on GCC for me.
>>
>> What GCC version?
>
> 12.1, though I can't get a smaller reproducer so I'm going to roll back
> to your change and double-check. Might take a bit...
Looks like there was just some bug in my test scripts and the original
patch wasn't actually picked up for all the configs. It's working now,
so
Reviewed-by: Palmer Dabbelt <palmer@rivosinc.com>
Sorry for the confusion!
>> I cannot reproduce that error with my patch applied. I tested mainline
>> at commit deefd5024f07 ("Merge tag 'vfio-v6.7-rc1' of
>> https://github.com/awilliam/linux-vfio") using GCC 6 from kernel.org and
>> I can reproduce a similar failure with ARCH=x86_64 allyesconfig:
>>
>> net/ipv4/tcp_output.c: In function 'tcp_options_write':
>> net/ipv4/tcp_output.c:661:1: error: label at end of compound statement
>> out_ao:
>> ^~~~~~
>>
>> With this change applied, the error disappears for GCC 6 and GCC 13
>> continues to build without error. I can try the other supported versions
>> later, I just did an older and newer one for a quick test.
>>
>>> So I think something like
>>>
>>> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
>>> index f558c054cf6e..ca09763acaa8 100644
>>> --- a/net/ipv4/tcp_output.c
>>> +++ b/net/ipv4/tcp_output.c
>>> @@ -659,6 +659,11 @@ static void tcp_options_write(struct tcphdr *th, struct tcp_sock *tp,
>>> ptr++;
>>> }
>>> out_ao:
>>> + /*
>>> + * Labels at the end of compound statements are a C23 feature, so
>>> + * introduce a block to avoid a warning/error on strict toolchains.
>>> + */
>>> + {}
>>> #endif
>>> }
>>> if (unlikely(opts->mss)) {
>>>
>>> should do it (though it's still build testing...)
>>
>> I am not opposed to this once we understand what versions are affected
>> by this so that we have some timeline of removing this workaround.
>>
>> Cheers,
>> Nathan
^ permalink raw reply
* Re: [PATCH v2] net/tg3: fix race condition in tg3_reset_task()
From: Thinh Tran @ 2023-11-02 20:37 UTC (permalink / raw)
To: Michael Chan
Cc: netdev, siva.kallam, prashant, mchan, pavan.chebbi, drc,
venkata.sai.duggi
In-Reply-To: <CACKFLimX4Pjm89cneeTa36B519DN3mdXXo5FXfDFi6e0SBwUSA@mail.gmail.com>
Thanks for the review.
On 11/2/2023 12:27 PM, Michael Chan wrote:
>
> This scenario can affect other drivers too, right? Shouldn't this be
> handled in a higher layer before calling ->ndo_tx_timeout() so we
> don't have to add this logic to all the other drivers? Thanks.
Yes, it does. We can add this into the dev_watchdog() function, but
further investigations are required. This is because each driver may
have a different approach to handling its own ->ndo_tx_timeout() function.
Thinh Tran
^ permalink raw reply
* Re: [PATCH] net/smc: avoid atomic_set and smp_wmb in the tx path when possible
From: Wenjia Zhang @ 2023-11-02 20:42 UTC (permalink / raw)
To: Li RongQing; +Cc: linux-s390, netdev
In-Reply-To: <20231102092712.30793-1-lirongqing@baidu.com>
On 02.11.23 10:27, Li RongQing wrote:
> these is less opportunity that conn->tx_pushing is not 1, since
> tx_pushing is just checked with 1, so move the setting tx_pushing
> to 1 after atomic_dec_and_test() return false, to avoid atomic_set
> and smp_wmb in tx path when possible
>
I think we should avoid to use argument like "less opportunity" in
commit message. Because "less opportunity" does not mean "no
opportunity". Once it occurs, does it mean that what the patch changes
is useless or wrong?
> Signed-off-by: Li RongQing <lirongqing@baidu.com>
> ---
> net/smc/smc_tx.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
> index 3b0ff3b..72dbdee 100644
> --- a/net/smc/smc_tx.c
> +++ b/net/smc/smc_tx.c
> @@ -667,8 +667,6 @@ int smc_tx_sndbuf_nonempty(struct smc_connection *conn)
> return 0;
>
> again:
> - atomic_set(&conn->tx_pushing, 1);
> - smp_wmb(); /* Make sure tx_pushing is 1 before real send */
> rc = __smc_tx_sndbuf_nonempty(conn);
>
> /* We need to check whether someone else have added some data into
> @@ -677,8 +675,11 @@ int smc_tx_sndbuf_nonempty(struct smc_connection *conn)
> * If so, we need to push again to prevent those data hang in the send
> * queue.
> */
> - if (unlikely(!atomic_dec_and_test(&conn->tx_pushing)))
> + if (unlikely(!atomic_dec_and_test(&conn->tx_pushing))) {
> + atomic_set(&conn->tx_pushing, 1);
> + smp_wmb(); /* Make sure tx_pushing is 1 before real send */
> goto again;
> + }
>
> return rc;
> }
I'm afraid that the *if* statement would never be true, without setting
the value of &conn->tx_pushing firstly.
^ permalink raw reply
* Re: [PATCH 0/2] net: mwifiex: add support for the SD8777 chipset
From: Brian Norris @ 2023-11-02 21:00 UTC (permalink / raw)
To: Karel Balej
Cc: Kalle Valo, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Ulf Hansson, linux-wireless, netdev, devicetree, linux-kernel,
linux-mmc, Duje Mihanović, ~postmarketos/upstreaming,
phone-devel
In-Reply-To: <20231029111807.19261-1-balejk@matfyz.cz>
On Sun, Oct 29, 2023 at 12:08:15PM +0100, Karel Balej wrote:
> The driver requires proprietary firmware which is not yet part of
> linux-firmware, but it is packaged in postmarketOS.
You gotta get that done:
https://wireless.wiki.kernel.org/en/developers/documentation/submittingpatches#new_driver
"have firmware images submitted for linux-firmware with an acceptable
license allowing redistribution"
We can't have a driver requesting a mrvl/sd8777_uapsta.bin firmware that
isn't available for anyone [1].
Until that's done, NAK.
[1] I think you might be referring to this:
https://github.com/xcover3/android_vendor_samsung_xcover3lte/commit/6e324b43b32dc607327d89148dd5d83a14429ee6
But I don't see any license info, so I don't think that's going to be
appropriate for linux-firmware.
^ permalink raw reply
* Re: [PATCH 1/2] dt-bindings: mwifiex: document use with the SD8777 chipset
From: Brian Norris @ 2023-11-02 21:02 UTC (permalink / raw)
To: Karel Balej
Cc: Kalle Valo, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Ulf Hansson, linux-wireless, netdev, devicetree, linux-kernel,
linux-mmc, Duje Mihanović, ~postmarketos/upstreaming,
phone-devel
In-Reply-To: <20231029111807.19261-2-balejk@matfyz.cz>
On Sun, Oct 29, 2023 at 12:08:16PM +0100, Karel Balej wrote:
> Document the corresponding compatible string for the use of this driver
> with the Marvell SD8777 wireless chipset.
>
> Signed-off-by: Karel Balej <balejk@matfyz.cz>
FWIW, the binding looks fine from mwifiex point of view, so:
Acked-by: Brian Norris <briannorris@chromium.org>
But see cover letter. We can't merge driver support without a
linux-firmware-compatible (or otherwise redistributable) firmware, so
NAK for the series.
^ permalink raw reply
* [PATCH v1 net] tcp: Fix SYN option room calculation for TCP-AO.
From: Kuniyuki Iwashima @ 2023-11-02 21:05 UTC (permalink / raw)
To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
David Ahern
Cc: Salam Noureddine, Dmitry Safonov, Francesco Ruggeri,
Kuniyuki Iwashima, Kuniyuki Iwashima, netdev
When building SYN packet in tcp_syn_options(), MSS, TS, WS, and
SACKPERM are used without checking the remaining bytes in the
options area.
To keep that logic as is, we limit the TCP-AO MAC length in
tcp_ao_parse_crypto(). Currently, the limit is calculated as below.
MAX_TCP_OPTION_SPACE - TCPOLEN_TSTAMP_ALIGNED
- TCPOLEN_WSCALE_ALIGNED
- TCPOLEN_SACKPERM_ALIGNED
This looks confusing as (1) we pack SACKPERM into the leading
2-bytes of the aligned 12-bytes of TS and (2) TCPOLEN_MSS_ALIGNED
is not used. Fortunately, the calculated limit is not wrong as
TCPOLEN_SACKPERM_ALIGNED and TCPOLEN_MSS_ALIGNED are the same value.
However, we should use the proper constant in the formula.
MAX_TCP_OPTION_SPACE - TCPOLEN_MSS_ALIGNED
- TCPOLEN_TSTAMP_ALIGNED
- TCPOLEN_WSCALE_ALIGNED
Fixes: 4954f17ddefc ("net/tcp: Introduce TCP_AO setsockopt()s")
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
net/ipv4/tcp_ao.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_ao.c b/net/ipv4/tcp_ao.c
index ef5472ed6158..7696417d0640 100644
--- a/net/ipv4/tcp_ao.c
+++ b/net/ipv4/tcp_ao.c
@@ -1315,7 +1315,8 @@ static int tcp_ao_parse_crypto(struct tcp_ao_add *cmd, struct tcp_ao_key *key)
key->maclen = cmd->maclen ?: 12; /* 12 is the default in RFC5925 */
/* Check: maclen + tcp-ao header <= (MAX_TCP_OPTION_SPACE - mss
- * - tstamp - wscale - sackperm),
+ * - tstamp (including sackperm)
+ * - wscale),
* see tcp_syn_options(), tcp_synack_options(), commit 33ad798c924b.
*
* In order to allow D-SACK with TCP-AO, the header size should be:
@@ -1342,9 +1343,9 @@ static int tcp_ao_parse_crypto(struct tcp_ao_add *cmd, struct tcp_ao_key *key)
* large to leave sufficient option space.
*/
syn_tcp_option_space = MAX_TCP_OPTION_SPACE;
+ syn_tcp_option_space -= TCPOLEN_MSS_ALIGNED;
syn_tcp_option_space -= TCPOLEN_TSTAMP_ALIGNED;
syn_tcp_option_space -= TCPOLEN_WSCALE_ALIGNED;
- syn_tcp_option_space -= TCPOLEN_SACKPERM_ALIGNED;
if (tcp_ao_len(key) > syn_tcp_option_space) {
err = -EMSGSIZE;
goto err_kfree;
--
2.30.2
^ permalink raw reply related
* Re: [PATCH v1 net] tcp: Fix SYN option room calculation for TCP-AO.
From: Dmitry Safonov @ 2023-11-02 21:28 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: Salam Noureddine, Kuniyuki Iwashima, netdev, Francesco Ruggeri,
David S. Miller, Jakub Kicinski, Eric Dumazet, Paolo Abeni,
David Ahern
In-Reply-To: <20231102210548.94361-1-kuniyu@amazon.com>
On 11/2/23 21:05, Kuniyuki Iwashima wrote:
> When building SYN packet in tcp_syn_options(), MSS, TS, WS, and
> SACKPERM are used without checking the remaining bytes in the
> options area.
>
> To keep that logic as is, we limit the TCP-AO MAC length in
> tcp_ao_parse_crypto(). Currently, the limit is calculated as below.
>
> MAX_TCP_OPTION_SPACE - TCPOLEN_TSTAMP_ALIGNED
> - TCPOLEN_WSCALE_ALIGNED
> - TCPOLEN_SACKPERM_ALIGNED
>
> This looks confusing as (1) we pack SACKPERM into the leading
> 2-bytes of the aligned 12-bytes of TS and (2) TCPOLEN_MSS_ALIGNED
> is not used. Fortunately, the calculated limit is not wrong as
> TCPOLEN_SACKPERM_ALIGNED and TCPOLEN_MSS_ALIGNED are the same value.
>
> However, we should use the proper constant in the formula.
>
> MAX_TCP_OPTION_SPACE - TCPOLEN_MSS_ALIGNED
> - TCPOLEN_TSTAMP_ALIGNED
> - TCPOLEN_WSCALE_ALIGNED
>
> Fixes: 4954f17ddefc ("net/tcp: Introduce TCP_AO setsockopt()s")
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Sharp eye!
Thanks for the patch and adjusting the comment.
Reviewed-by: Dmitry Safonov <dima@arista.com>
> ---
> net/ipv4/tcp_ao.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/net/ipv4/tcp_ao.c b/net/ipv4/tcp_ao.c
> index ef5472ed6158..7696417d0640 100644
> --- a/net/ipv4/tcp_ao.c
> +++ b/net/ipv4/tcp_ao.c
> @@ -1315,7 +1315,8 @@ static int tcp_ao_parse_crypto(struct tcp_ao_add *cmd, struct tcp_ao_key *key)
> key->maclen = cmd->maclen ?: 12; /* 12 is the default in RFC5925 */
>
> /* Check: maclen + tcp-ao header <= (MAX_TCP_OPTION_SPACE - mss
> - * - tstamp - wscale - sackperm),
> + * - tstamp (including sackperm)
> + * - wscale),
> * see tcp_syn_options(), tcp_synack_options(), commit 33ad798c924b.
> *
> * In order to allow D-SACK with TCP-AO, the header size should be:
> @@ -1342,9 +1343,9 @@ static int tcp_ao_parse_crypto(struct tcp_ao_add *cmd, struct tcp_ao_key *key)
> * large to leave sufficient option space.
> */
> syn_tcp_option_space = MAX_TCP_OPTION_SPACE;
> + syn_tcp_option_space -= TCPOLEN_MSS_ALIGNED;
> syn_tcp_option_space -= TCPOLEN_TSTAMP_ALIGNED;
> syn_tcp_option_space -= TCPOLEN_WSCALE_ALIGNED;
> - syn_tcp_option_space -= TCPOLEN_SACKPERM_ALIGNED;
> if (tcp_ao_len(key) > syn_tcp_option_space) {
> err = -EMSGSIZE;
> goto err_kfree;
Thanks,
Dmitry
^ permalink raw reply
* Re: [net-next PATCH v2 1/2] net: phy: aquantia: add firmware load support
From: kernel test robot @ 2023-11-02 21:34 UTC (permalink / raw)
To: Christian Marangi, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Rob Herring, Krzysztof Kozlowski, Conor Dooley,
Andrew Lunn, Heiner Kallweit, Russell King, devicetree,
linux-kernel
Cc: oe-kbuild-all, netdev, Robert Marko
In-Reply-To: <20231101123608.11157-1-ansuelsmth@gmail.com>
Hi Christian,
kernel test robot noticed the following build warnings:
[auto build test WARNING on net-next/main]
url: https://github.com/intel-lab-lkp/linux/commits/Christian-Marangi/dt-bindings-Document-bindings-for-Marvell-Aquantia-PHY/20231101-203944
base: net-next/main
patch link: https://lore.kernel.org/r/20231101123608.11157-1-ansuelsmth%40gmail.com
patch subject: [net-next PATCH v2 1/2] net: phy: aquantia: add firmware load support
config: i386-randconfig-062-20231102 (https://download.01.org/0day-ci/archive/20231103/202311030505.vv0uoWBW-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231103/202311030505.vv0uoWBW-lkp@intel.com/reproduce)
If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202311030505.vv0uoWBW-lkp@intel.com/
sparse warnings: (new ones prefixed by >>)
>> drivers/net/phy/aquantia_main.c:746:14: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned int [usertype] addr @@ got restricted __le32 [usertype] @@
drivers/net/phy/aquantia_main.c:746:14: sparse: expected unsigned int [usertype] addr
drivers/net/phy/aquantia_main.c:746:14: sparse: got restricted __le32 [usertype]
>> drivers/net/phy/aquantia_main.c:776:22: sparse: sparse: incorrect type in assignment (different base types) @@ expected unsigned int [addressable] [usertype] word @@ got restricted __be32 [usertype] @@
drivers/net/phy/aquantia_main.c:776:22: sparse: expected unsigned int [addressable] [usertype] word
drivers/net/phy/aquantia_main.c:776:22: sparse: got restricted __be32 [usertype]
>> drivers/net/phy/aquantia_main.c:803:20: sparse: sparse: cast to restricted __be16
>> drivers/net/phy/aquantia_main.c:817:26: sparse: sparse: cast to restricted __le32
drivers/net/phy/aquantia_main.c:843:23: sparse: sparse: cast to restricted __le32
drivers/net/phy/aquantia_main.c:844:21: sparse: sparse: cast to restricted __le32
drivers/net/phy/aquantia_main.c:845:23: sparse: sparse: cast to restricted __le32
drivers/net/phy/aquantia_main.c:846:21: sparse: sparse: cast to restricted __le32
vim +746 drivers/net/phy/aquantia_main.c
737
738 /* load data into the phy's memory */
739 static int aquantia_load_memory(struct phy_device *phydev, u32 addr,
740 const u8 *data, size_t len)
741 {
742 u16 crc = 0, up_crc;
743 size_t pos;
744
745 /* PHY expect addr in LE */
> 746 addr = cpu_to_le32(addr);
747
748 phy_write_mmd(phydev, MDIO_MMD_VEND1,
749 VEND1_GLOBAL_MAILBOX_INTERFACE1,
750 VEND1_GLOBAL_MAILBOX_INTERFACE1_CRC_RESET);
751 phy_write_mmd(phydev, MDIO_MMD_VEND1,
752 VEND1_GLOBAL_MAILBOX_INTERFACE3,
753 VEND1_GLOBAL_MAILBOX_INTERFACE3_MSW_ADDR(addr));
754 phy_write_mmd(phydev, MDIO_MMD_VEND1,
755 VEND1_GLOBAL_MAILBOX_INTERFACE4,
756 VEND1_GLOBAL_MAILBOX_INTERFACE4_LSW_ADDR(addr));
757
758 for (pos = 0; pos < len; pos += min(sizeof(u32), len - pos)) {
759 u32 word = 0;
760
761 memcpy(&word, data + pos, min(sizeof(u32), len - pos));
762
763 phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_MAILBOX_INTERFACE5,
764 VEND1_GLOBAL_MAILBOX_INTERFACE5_MSW_DATA(word));
765 phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_MAILBOX_INTERFACE6,
766 VEND1_GLOBAL_MAILBOX_INTERFACE6_LSW_DATA(word));
767
768 phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_MAILBOX_INTERFACE1,
769 VEND1_GLOBAL_MAILBOX_INTERFACE1_EXECUTE |
770 VEND1_GLOBAL_MAILBOX_INTERFACE1_WRITE);
771
772 /* calculate CRC as we load data to the mailbox.
773 * We convert word to big-endiang as PHY is BE and mailbox will
774 * return a BE CRC.
775 */
> 776 word = cpu_to_be32(word);
777 crc = crc_ccitt_false(crc, (u8 *)&word, sizeof(word));
778 }
779
780 up_crc = phy_read_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_MAILBOX_INTERFACE2);
781 if (crc != up_crc) {
782 phydev_err(phydev, "CRC mismatch: calculated 0x%04x PHY 0x%04x\n",
783 crc, up_crc);
784 return -EINVAL;
785 }
786
787 return 0;
788 }
789
790 static int aqr_fw_boot(struct phy_device *phydev, const u8 *data, size_t size)
791 {
792 const struct aqr_fw_header *header;
793 u32 iram_offset = 0, iram_size = 0;
794 u32 dram_offset = 0, dram_size = 0;
795 char version[VERSION_STRING_SIZE];
796 u16 calculated_crc, read_crc;
797 u32 primary_offset = 0;
798 int ret;
799
800 /* extract saved CRC at the end of the fw */
801 memcpy(&read_crc, data + size - 2, sizeof(read_crc));
802 /* CRC is saved in big-endian as PHY is BE */
> 803 read_crc = be16_to_cpu(read_crc);
804 calculated_crc = crc_ccitt_false(0, data, size - 2);
805 if (read_crc != calculated_crc) {
806 phydev_err(phydev, "bad firmware CRC: file 0x%04x calculated 0x%04x\n",
807 read_crc, calculated_crc);
808 return -EINVAL;
809 }
810
811 /* Get the primary offset to extract DRAM and IRAM sections. */
812 memcpy(&primary_offset, data + PRIMARY_OFFSET_OFFSET, sizeof(u16));
813 if (!primary_offset) {
814 phydev_err(phydev, "bad primary offset in firmware\n");
815 return -EINVAL;
816 }
> 817 primary_offset = PRIMARY_OFFSET(le32_to_cpu(primary_offset));
818
819 /* Find the DRAM and IRAM sections within the firmware file. */
820 header = (struct aqr_fw_header *)(data + primary_offset + HEADER_OFFSET);
821 memcpy(&iram_offset, &header->iram_offset, sizeof(u8) * 3);
822 if (!iram_offset) {
823 phydev_err(phydev, "bad iram offset in firmware\n");
824 return -EINVAL;
825 }
826 memcpy(&iram_size, &header->iram_size, sizeof(u8) * 3);
827 if (!iram_size) {
828 phydev_err(phydev, "invalid iram size in firmware\n");
829 return -EINVAL;
830 }
831 memcpy(&dram_offset, &header->dram_offset, sizeof(u8) * 3);
832 if (!dram_offset) {
833 phydev_err(phydev, "bad dram offset in firmware\n");
834 return -EINVAL;
835 }
836 memcpy(&dram_size, &header->dram_size, sizeof(u8) * 3);
837 if (!dram_size) {
838 phydev_err(phydev, "invalid dram size in firmware\n");
839 return -EINVAL;
840 }
841
842 /* offset are in LE and values needs to be converted to cpu endian */
843 iram_offset = le32_to_cpu(iram_offset);
844 iram_size = le32_to_cpu(iram_size);
845 dram_offset = le32_to_cpu(dram_offset);
846 dram_size = le32_to_cpu(dram_size);
847
848 /* Increment the offset with the primary offset. */
849 iram_offset += primary_offset;
850 dram_offset += primary_offset;
851
852 phydev_dbg(phydev, "primary %d IRAM offset=%d size=%d DRAM offset=%d size=%d\n",
853 primary_offset, iram_offset, iram_size, dram_offset, dram_size);
854
855 strscpy(version, (char *)data + dram_offset + VERSION_STRING_OFFSET,
856 VERSION_STRING_SIZE);
857 if (!version) {
858 phydev_err(phydev, "invalid version in firmware\n");
859 return -EINVAL;
860 }
861 phydev_info(phydev, "loading firmware version '%s'\n", version);
862
863 /* stall the microcprocessor */
864 phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CONTROL2,
865 VEND1_GLOBAL_CONTROL2_UP_RUN_STALL | VEND1_GLOBAL_CONTROL2_UP_RUN_STALL_OVD);
866
867 phydev_dbg(phydev, "loading DRAM 0x%08x from offset=%d size=%d\n",
868 DRAM_BASE_ADDR, dram_offset, dram_size);
869 ret = aquantia_load_memory(phydev, DRAM_BASE_ADDR, data + dram_offset,
870 dram_size);
871 if (ret)
872 return ret;
873
874 phydev_dbg(phydev, "loading IRAM 0x%08x from offset=%d size=%d\n",
875 IRAM_BASE_ADDR, iram_offset, iram_size);
876 ret = aquantia_load_memory(phydev, IRAM_BASE_ADDR, data + iram_offset,
877 iram_size);
878 if (ret)
879 return ret;
880
881 /* make sure soft reset and low power mode are clear */
882 phy_clear_bits_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_SC,
883 VEND1_GLOBAL_SC_SOFT_RESET | VEND1_GLOBAL_SC_LOW_POWER);
884
885 /* Release the microprocessor. UP_RESET must be held for 100 usec. */
886 phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CONTROL2,
887 VEND1_GLOBAL_CONTROL2_UP_RUN_STALL |
888 VEND1_GLOBAL_CONTROL2_UP_RUN_STALL_OVD |
889 VEND1_GLOBAL_CONTROL2_UP_RUN_STALL_RST);
890 usleep_range(UP_RESET_SLEEP, UP_RESET_SLEEP * 2);
891
892 phy_write_mmd(phydev, MDIO_MMD_VEND1, VEND1_GLOBAL_CONTROL2,
893 VEND1_GLOBAL_CONTROL2_UP_RUN_STALL_OVD);
894
895 return 0;
896 }
897
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki
^ permalink raw reply
* Re: [PATCH] drivers/net/ppp: copy userspace array safely
From: Philipp Stanner @ 2023-11-02 22:02 UTC (permalink / raw)
To: Al Viro
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Stanislav Fomichev, Greg Kroah-Hartman, Benjamin Tissoires,
linux-ppp, netdev, linux-kernel, Dave Airlie
In-Reply-To: <20231102200943.GK1957730@ZenIV>
Hallo Al,
On Thu, 2023-11-02 at 20:09 +0000, Al Viro wrote:
> On Thu, Nov 02, 2023 at 08:19:15PM +0100, Philipp Stanner wrote:
> > In ppp_generic.c memdup_user() is utilized to copy a userspace
> > array.
> > This is done without an overflow check.
> >
> > Use the new wrapper memdup_array_user() to copy the array more
> > safely.
>
> > fprog.len = uprog->len;
> > - fprog.filter = memdup_user(uprog->filter,
> > - uprog->len * sizeof(struct
> > sock_filter));
> > + fprog.filter = memdup_array_user(uprog->filter,
> > + uprog->len, sizeof(struct
> > sock_filter));
>
> Far be it from me to discourage security theat^Whardening, but
a bit about the background here:
(tl;dr: No reason to worry, I am not one of those security fanatics. In
fact, I worked for 12 months with those people with some mixed
experiences ^^')
(btw, note that the commit says 'safety', not 'security')
We introduced those wrappers to string.h hoping they will be useful.
Now that they're merged, I quickly wanted to establish them as the
standard for copying user-arrays, ideally in the current merge window.
Because its convenient, easy to read and, at times, safer.
I just want to help out a bit in the kernel, clean up here and there;
it's not yet the primary task assigned to me by my employer. Thus, I
quickly prepared 13 patches today implementing the wrapper. You'll find
the others on LKML. Getting to:
>
> struct sock_fprog { /* Required for SO_ATTACH_FILTER. */
> unsigned short len; /* Number of filter blocks */
> struct sock_filter __user *filter;
> };
>
> struct sock_filter { /* Filter block */
> __u16 code; /* Actual filter code */
> __u8 jt; /* Jump true */
> __u8 jf; /* Jump false */
> __u32 k; /* Generic multiuse field */
> };
>
> so you might want to mention that overflow in question would have to
> be
> in multiplying an untrusted 16bit value by 8...
>
I indeed did not even look at that.
When it was obvious to me that fearing an overflow is ridiculous, I
actually adjusted the commit-message, see for example here: [1]
I just didn't see it in ppp. Maybe I should have looked more
intensively for all 13 patches. But we'll get there, that's what v2 and
v3 are for :)
P.
[1] https://lore.kernel.org/all/20231102192402.53721-2-pstanner@redhat.com/
PS: Security != Safety
^ permalink raw reply
* Re: [PATCH net v3] net: dsa: tag_rtl4_a: Bump min packet size
From: Linus Walleij @ 2023-11-02 22:09 UTC (permalink / raw)
To: Florian Fainelli
Cc: Andrew Lunn, Vladimir Oltean, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Luiz Angelo Daros de Luca, netdev,
linux-kernel
In-Reply-To: <ff7e60bf-13c9-44fe-b9e0-0f1ef4904745@gmail.com>
On Thu, Nov 2, 2023 at 7:43 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
> Looking at drivers/net/ethernet/cortina/gemini.c, should not we account
> for when the MAC is used as a conduit and include the right amount of
> "MTU" bytes? Something like this (compile tested only):
The DSA core already fixes this by adding the tag size to the MTU
of the conduit interface, so netdev->mtu is already 1504 for this
switch.
I found other oddities though so I'm digging into the driver!
Thanks,
Linus Walleij
^ permalink raw reply
* Re: [PATCH net v3] net: dsa: tag_rtl4_a: Bump min packet size
From: Florian Fainelli @ 2023-11-02 22:24 UTC (permalink / raw)
To: Linus Walleij
Cc: Andrew Lunn, Vladimir Oltean, David S. Miller, Eric Dumazet,
Jakub Kicinski, Paolo Abeni, Luiz Angelo Daros de Luca, netdev,
linux-kernel
In-Reply-To: <CACRpkdY2UiFyTvF=zuk-rSZBi+yH6cP-QRkegMgc3wf=9JD_Wg@mail.gmail.com>
On 11/2/23 15:09, Linus Walleij wrote:
> On Thu, Nov 2, 2023 at 7:43 PM Florian Fainelli <f.fainelli@gmail.com> wrote:
>
>> Looking at drivers/net/ethernet/cortina/gemini.c, should not we account
>> for when the MAC is used as a conduit and include the right amount of
>> "MTU" bytes? Something like this (compile tested only):
>
> The DSA core already fixes this by adding the tag size to the MTU
> of the conduit interface, so netdev->mtu is already 1504 for this
> switch.
>
> I found other oddities though so I'm digging into the driver!
Yes indeed, I forgot about that, never mind :)
--
Florian
^ permalink raw reply
* Re: [PATCH] drivers/net/ppp: copy userspace array safely
From: Al Viro @ 2023-11-02 22:30 UTC (permalink / raw)
To: Philipp Stanner
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Stanislav Fomichev, Greg Kroah-Hartman, Benjamin Tissoires,
linux-ppp, netdev, linux-kernel, Dave Airlie
In-Reply-To: <7a26cd1bafb22b16eab3868255706d44fa4f255d.camel@redhat.com>
On Thu, Nov 02, 2023 at 11:02:35PM +0100, Philipp Stanner wrote:
> We introduced those wrappers to string.h hoping they will be useful.
> Now that they're merged, I quickly wanted to establish them as the
> standard for copying user-arrays, ideally in the current merge window.
> Because its convenient, easy to read and, at times, safer.
They also save future readers a git grep to find the sizes, etc.
Again, the only suggestion is that regarding the commit message;
_some_ of those might end up fixing real overflows and you obviously
want to see how far do those need to be backported, etc. And "in this
case the overflow doesn't actually happen because <reasons>, but
not having to do such analysis is a good thing" is not a bad explanation
why the primitive in question is useful, IMO. Granted, in cases like
256 * sizeof(u32) that would be pointless, but for the ones that
are less obvious...
> I just didn't see it in ppp. Maybe I should have looked more
> intensively for all 13 patches. But we'll get there, that's what v2 and
> v3 are for :)
In any case you want to check if there are real bugs caught in that.
^ permalink raw reply
* [PATCH bpf-next v5 00/13] xsk: TX metadata
From: Stanislav Fomichev @ 2023-11-02 22:58 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
This series implements initial TX metadata (offloads) for AF_XDP.
See patch #2 for the main implementation and mlx5/stmmac ones for the
example on how to consume the metadata on the device side.
Starting with two types of offloads:
- request TX timestamp (and write it back into the metadata area)
- request TX checksum offload
Changes since v4:
- remove 'render-max: true' from spec (Jakub)
- move xsk_tx_metadata_ops into include/net/xdp_sock.h (Jakub)
- christmas tree in netdev_nl_dev_fill (Jakub)
- fix > vs >= when dumping masks in samples (Jakub)
- switch to 8-byte alignment for tx metadata length (Jakub)
- spelling fixes in the doc (Magnus)
- deny metadata length >= 256 (Magnus)
- validate metadata flags and deny unknown ones (Jakub)
- move XDP_TX_METADATA_CHECKSUM_SW into umem config flag (Jakub)
- don't print timestamps twice in xdp_hw_metadata (Song)
- rename anonymous xsk_tx_metadata member into request (Alexei)
- add comment to xsk_tx_metadata (Alexei)
I've separated new bits that need a closer review into separate patches:
- xsk_tx_metadata flags validation:
- xsk: Validate xsk_tx_metadata flags
- new umem flag for sw tx csum calculation (instead of per-packet flag)
- xsk: Add option to calculate TX checksum in SW
v4: https://lore.kernel.org/bpf/20231019174944.3376335-1-sdf@google.com/
Performance (mlx5):
I've implemented a small xskgen tool to try to saturate single tx queue:
https://github.com/fomichev/xskgen/tree/master
Here are the performance numbers with some analysis.
1. Baseline. Running with commit eb62e6aef940 ("Merge branch 'bpf:
Support bpf_get_func_ip helper in uprobes'"), nothing from this series:
- with 1400 bytes of payload: 98 gbps, 8 mpps
./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189130 sec, 98.357623 gbps 8.409509 mpps
- with 200 bytes of payload: 49 gbps, 23 mpps
./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000064 packets 20960134144 bits, took 0.422235 sec, 49.640921 gbps 23.683645 mpps
2. Adding single commit that supports reserving tx_metadata_len
changes nothing numbers-wise.
- baseline for 1400
./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189247 sec, 98.347946 gbps 8.408682 mpps
- baseline for 200
./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 20960000000 bits, took 0.421248 sec, 49.756913 gbps 23.738985 mpps
3. Adding -M flag causes xskgen to reserve the metadata and fill it, but
doesn't set XDP_TX_METADATA descriptor option.
- new baseline for 1400 (with only filling the metadata)
./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.188767 sec, 98.387657 gbps 8.412077 mpps
- new baseline for 200 (with only filling the metadata)
./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 20960000000 bits, took 0.410213 sec, 51.095407 gbps 24.377579 mpps
(the numbers go sligtly up here, not really sure why, maybe some cache-related
side-effects?
4. Next, I'm running the same test but with the commit that adds actual
general infra to parse XDP_TX_METADATA (but no driver support).
Essentially applying "xsk: add TX timestamp and TX checksum offload support"
from this series. Numbers are the same.
- fill metadata for 1400
./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.188430 sec, 98.415557 gbps 8.414463 mpps
- fill metadata for 200
./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 20960000000 bits, took 0.411559 sec, 50.928299 gbps 24.297853 mpps
- request metadata for 1400
./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.188723 sec, 98.391299 gbps 8.412389 mpps
- request metadata for 200
./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000064 packets 20960134144 bits, took 0.411240 sec, 50.968131 gbps 24.316856 mpps
5. Now, for the most interesting part, I'm adding mlx5 driver support.
The mpps for 200 bytes case goes down from 23 mpps to 19 mpps, but
_only_ when I enable the metadata. This looks like a side effect
of me pushing extra metadata pointer via mlx5e_xdpi_fifo_push.
Hence, this part is wrapped into 'if (xp_tx_metadata_enabled)'
to not affect the existing non-metadata use-cases. Since this is not
regressing existing workloads, I'm not spending any time trying to
optimize it more (and leaving it up to mlx owners to purse if
they see any good way to do it).
- same baseline
./xskgen -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189434 sec, 98.332484 gbps 8.407360 mpps
./xskgen -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000128 packets 20960268288 bits, took 0.425254 sec, 49.288821 gbps 23.515659 mpps
- fill metadata for 1400
./xskgen -M -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189528 sec, 98.324714 gbps 8.406696 mpps
- fill metadata for 200
./xskgen -M -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000128 packets 20960268288 bits, took 0.519085 sec, 40.379260 gbps 19.264914 mpps
- request metadata for 1400
./xskgen -m -s 1400 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000000 packets 116960000000 bits, took 1.189329 sec, 98.341165 gbps 8.408102 mpps
- request metadata for 200
./xskgen -m -s 200 -b eth3 10:70:fd:48:10:77 10:70:fd:48:10:87 fe80::1270:fdff:fe48:1077 fe80::1270:fdff:fe48:1087 1 1
sent 10000128 packets 20960268288 bits, took 0.519929 sec, 40.313713 gbps 19.233642 mpps
Acked-by: Magnus Karlsson <magnus.karlsson@intel.com>
Song Yoong Siang (1):
net: stmmac: Add Tx HWTS support to XDP ZC
Stanislav Fomichev (12):
xsk: Support tx_metadata_len
xsk: Add TX timestamp and TX checksum offload support
tools: ynl: Print xsk-features from the sample
net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
xsk: Document tx_metadata_len layout
xsk: Validate xsk_tx_metadata flags
xsk: Add option to calculate TX checksum in SW
selftests/xsk: Support tx_metadata_len
selftests/bpf: Add csum helpers
selftests/bpf: Add TX side to xdp_metadata
selftests/bpf: Convert xdp_hw_metadata to XDP_USE_NEED_WAKEUP
selftests/bpf: Add TX side to xdp_hw_metadata
Documentation/netlink/specs/netdev.yaml | 19 +-
Documentation/networking/index.rst | 1 +
Documentation/networking/xsk-tx-metadata.rst | 79 ++++++
drivers/net/ethernet/mellanox/mlx5/core/en.h | 4 +-
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 72 +++++-
.../net/ethernet/mellanox/mlx5/core/en/xdp.h | 11 +-
.../ethernet/mellanox/mlx5/core/en/xsk/tx.c | 17 +-
.../net/ethernet/mellanox/mlx5/core/en_main.c | 1 +
drivers/net/ethernet/stmicro/stmmac/stmmac.h | 12 +
.../net/ethernet/stmicro/stmmac/stmmac_main.c | 64 ++++-
include/linux/netdevice.h | 2 +
include/linux/skbuff.h | 14 +-
include/net/xdp_sock.h | 111 +++++++++
include/net/xdp_sock_drv.h | 34 +++
include/net/xsk_buff_pool.h | 8 +
include/uapi/linux/if_xdp.h | 47 +++-
include/uapi/linux/netdev.h | 16 ++
net/core/netdev-genl.c | 13 +-
net/xdp/xdp_umem.c | 11 +-
net/xdp/xsk.c | 56 ++++-
net/xdp/xsk_buff_pool.c | 2 +
net/xdp/xsk_queue.h | 19 +-
tools/include/uapi/linux/if_xdp.h | 61 ++++-
tools/include/uapi/linux/netdev.h | 16 ++
tools/net/ynl/generated/netdev-user.c | 19 ++
tools/net/ynl/generated/netdev-user.h | 3 +
tools/net/ynl/samples/netdev.c | 10 +-
tools/testing/selftests/bpf/network_helpers.h | 43 ++++
.../selftests/bpf/prog_tests/xdp_metadata.c | 33 ++-
tools/testing/selftests/bpf/xdp_hw_metadata.c | 235 ++++++++++++++++--
tools/testing/selftests/bpf/xsk.c | 3 +
tools/testing/selftests/bpf/xsk.h | 1 +
32 files changed, 967 insertions(+), 70 deletions(-)
create mode 100644 Documentation/networking/xsk-tx-metadata.rst
--
2.42.0.869.gea05f2083d-goog
^ permalink raw reply
* [PATCH bpf-next v5 01/13] xsk: Support tx_metadata_len
From: Stanislav Fomichev @ 2023-11-02 22:58 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
In-Reply-To: <20231102225837.1141915-1-sdf@google.com>
For zerocopy mode, tx_desc->addr can point to an arbitrary offset
and carry some TX metadata in the headroom. For copy mode, there
is no way currently to populate skb metadata.
Introduce new tx_metadata_len umem config option that indicates how many
bytes to treat as metadata. Metadata bytes come prior to tx_desc address
(same as in RX case).
The size of the metadata has mostly the same constraints as XDP:
- less than 256 bytes
- 8-byte aligned (compared to 4-byte alignment on xdp, due to 8-byte
timestamp in the completion)
- non-zero
This data is not interpreted in any way right now.
Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
include/net/xdp_sock.h | 1 +
include/net/xsk_buff_pool.h | 1 +
include/uapi/linux/if_xdp.h | 1 +
net/xdp/xdp_umem.c | 4 ++++
net/xdp/xsk.c | 12 +++++++++++-
net/xdp/xsk_buff_pool.c | 1 +
net/xdp/xsk_queue.h | 17 ++++++++++-------
tools/include/uapi/linux/if_xdp.h | 1 +
8 files changed, 30 insertions(+), 8 deletions(-)
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index f83128007fb0..bcf765124f72 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -30,6 +30,7 @@ struct xdp_umem {
struct user_struct *user;
refcount_t users;
u8 flags;
+ u8 tx_metadata_len;
bool zc;
struct page **pgs;
int id;
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index b0bdff26fc88..1985ffaf9b0c 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -77,6 +77,7 @@ struct xsk_buff_pool {
u32 chunk_size;
u32 chunk_shift;
u32 frame_len;
+ u8 tx_metadata_len; /* inherited from umem */
u8 cached_need_wakeup;
bool uses_need_wakeup;
bool dma_need_sync;
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 8d48863472b9..2ecf79282c26 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -76,6 +76,7 @@ struct xdp_umem_reg {
__u32 chunk_size;
__u32 headroom;
__u32 flags;
+ __u32 tx_metadata_len;
};
struct xdp_statistics {
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 06cead2b8e34..946a687fb8e8 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -199,6 +199,9 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
if (headroom >= chunk_size - XDP_PACKET_HEADROOM)
return -EINVAL;
+ if (mr->tx_metadata_len >= 256 || mr->tx_metadata_len % 8)
+ return -EINVAL;
+
umem->size = size;
umem->headroom = headroom;
umem->chunk_size = chunk_size;
@@ -207,6 +210,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
umem->pgs = NULL;
umem->user = NULL;
umem->flags = mr->flags;
+ umem->tx_metadata_len = mr->tx_metadata_len;
INIT_LIST_HEAD(&umem->xsk_dma_list);
refcount_set(&umem->users, 1);
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index ae9f8cb611f6..c904356e2800 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -1283,6 +1283,14 @@ struct xdp_umem_reg_v1 {
__u32 headroom;
};
+struct xdp_umem_reg_v2 {
+ __u64 addr; /* Start of packet data area */
+ __u64 len; /* Length of packet data area */
+ __u32 chunk_size;
+ __u32 headroom;
+ __u32 flags;
+};
+
static int xsk_setsockopt(struct socket *sock, int level, int optname,
sockptr_t optval, unsigned int optlen)
{
@@ -1326,8 +1334,10 @@ static int xsk_setsockopt(struct socket *sock, int level, int optname,
if (optlen < sizeof(struct xdp_umem_reg_v1))
return -EINVAL;
- else if (optlen < sizeof(mr))
+ else if (optlen < sizeof(struct xdp_umem_reg_v2))
mr_size = sizeof(struct xdp_umem_reg_v1);
+ else if (optlen < sizeof(mr))
+ mr_size = sizeof(struct xdp_umem_reg_v2);
if (copy_from_sockptr(&mr, optval, mr_size))
return -EFAULT;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 49cb9f9a09be..386eddcdf837 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -85,6 +85,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
XDP_PACKET_HEADROOM;
pool->umem = umem;
pool->addrs = umem->addrs;
+ pool->tx_metadata_len = umem->tx_metadata_len;
INIT_LIST_HEAD(&pool->free_list);
INIT_LIST_HEAD(&pool->xskb_list);
INIT_LIST_HEAD(&pool->xsk_tx_list);
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 13354a1e4280..c74a1372bcb9 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -143,15 +143,17 @@ static inline bool xp_unused_options_set(u32 options)
static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
struct xdp_desc *desc)
{
- u64 offset = desc->addr & (pool->chunk_size - 1);
+ u64 addr = desc->addr - pool->tx_metadata_len;
+ u64 len = desc->len + pool->tx_metadata_len;
+ u64 offset = addr & (pool->chunk_size - 1);
if (!desc->len)
return false;
- if (offset + desc->len > pool->chunk_size)
+ if (offset + len > pool->chunk_size)
return false;
- if (desc->addr >= pool->addrs_cnt)
+ if (addr >= pool->addrs_cnt)
return false;
if (xp_unused_options_set(desc->options))
@@ -162,16 +164,17 @@ static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
static inline bool xp_unaligned_validate_desc(struct xsk_buff_pool *pool,
struct xdp_desc *desc)
{
- u64 addr = xp_unaligned_add_offset_to_addr(desc->addr);
+ u64 addr = xp_unaligned_add_offset_to_addr(desc->addr) - pool->tx_metadata_len;
+ u64 len = desc->len + pool->tx_metadata_len;
if (!desc->len)
return false;
- if (desc->len > pool->chunk_size)
+ if (len > pool->chunk_size)
return false;
- if (addr >= pool->addrs_cnt || addr + desc->len > pool->addrs_cnt ||
- xp_desc_crosses_non_contig_pg(pool, addr, desc->len))
+ if (addr >= pool->addrs_cnt || addr + len > pool->addrs_cnt ||
+ xp_desc_crosses_non_contig_pg(pool, addr, len))
return false;
if (xp_unused_options_set(desc->options))
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 73a47da885dc..34411a2e5b6c 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -76,6 +76,7 @@ struct xdp_umem_reg {
__u32 chunk_size;
__u32 headroom;
__u32 flags;
+ __u32 tx_metadata_len;
};
struct xdp_statistics {
--
2.42.0.869.gea05f2083d-goog
^ permalink raw reply related
* [PATCH bpf-next v5 03/13] tools: ynl: Print xsk-features from the sample
From: Stanislav Fomichev @ 2023-11-02 22:58 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
In-Reply-To: <20231102225837.1141915-1-sdf@google.com>
In a similar fashion we do for the other bit masks.
Fix mask parsing (>= vs >) while we are it.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
tools/net/ynl/samples/netdev.c | 10 ++++++++--
1 file changed, 8 insertions(+), 2 deletions(-)
diff --git a/tools/net/ynl/samples/netdev.c b/tools/net/ynl/samples/netdev.c
index b828225daad0..591b90e21890 100644
--- a/tools/net/ynl/samples/netdev.c
+++ b/tools/net/ynl/samples/netdev.c
@@ -33,17 +33,23 @@ static void netdev_print_device(struct netdev_dev_get_rsp *d, unsigned int op)
return;
printf("xdp-features (%llx):", d->xdp_features);
- for (int i = 0; d->xdp_features > 1U << i; i++) {
+ for (int i = 0; d->xdp_features >= 1U << i; i++) {
if (d->xdp_features & (1U << i))
printf(" %s", netdev_xdp_act_str(1 << i));
}
printf(" xdp-rx-metadata-features (%llx):", d->xdp_rx_metadata_features);
- for (int i = 0; d->xdp_rx_metadata_features > 1U << i; i++) {
+ for (int i = 0; d->xdp_rx_metadata_features >= 1U << i; i++) {
if (d->xdp_rx_metadata_features & (1U << i))
printf(" %s", netdev_xdp_rx_metadata_str(1 << i));
}
+ printf(" xsk-features (%llx):", d->xsk_features);
+ for (int i = 0; d->xsk_features >= 1U << i; i++) {
+ if (d->xsk_features & (1U << i))
+ printf(" %s", netdev_xsk_flags_str(1 << i));
+ }
+
printf(" xdp-zc-max-segs=%u", d->xdp_zc_max_segs);
name = netdev_op_str(op);
--
2.42.0.869.gea05f2083d-goog
^ permalink raw reply related
* [PATCH bpf-next v5 02/13] xsk: Add TX timestamp and TX checksum offload support
From: Stanislav Fomichev @ 2023-11-02 22:58 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
In-Reply-To: <20231102225837.1141915-1-sdf@google.com>
This change actually defines the (initial) metadata layout
that should be used by AF_XDP userspace (xsk_tx_metadata).
The first field is flags which requests appropriate offloads,
followed by the offload-specific fields. The supported per-device
offloads are exported via netlink (new xsk-flags).
The offloads themselves are still implemented in a bit of a
framework-y fashion that's left from my initial kfunc attempt.
I'm introducing new xsk_tx_metadata_ops which drivers are
supposed to implement. The drivers are also supposed
to call xsk_tx_metadata_request/xsk_tx_metadata_complete in
the right places. Since xsk_tx_metadata_{request,_complete}
are static inline, we don't incur any extra overhead doing
indirect calls.
The benefit of this scheme is as follows:
- keeps all metadata layout parsing away from driver code
- makes it easy to grep and see which drivers implement what
- don't need any extra flags to maintain to keep track of what
offloads are implemented; if the callback is implemented - the offload
is supported (used by netlink reporting code)
Two offloads are defined right now:
1. XDP_TXMD_FLAGS_CHECKSUM: skb-style csum_start+csum_offset
2. XDP_TXMD_FLAGS_TIMESTAMP: writes TX timestamp back into metadata
area upon completion (tx_timestamp field)
XDP_TXMD_FLAGS_TIMESTAMP is also implemented for XDP_COPY mode: it writes
SW timestamp from the skb destructor (note I'm reusing hwtstamps to pass
metadata pointer).
The struct is forward-compatible and can be extended in the future
by appending more fields.
Reviewed-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
Documentation/netlink/specs/netdev.yaml | 19 +++-
include/linux/netdevice.h | 2 +
include/linux/skbuff.h | 14 ++-
include/net/xdp_sock.h | 110 ++++++++++++++++++++++++
include/net/xdp_sock_drv.h | 13 +++
include/net/xsk_buff_pool.h | 6 ++
include/uapi/linux/if_xdp.h | 38 ++++++++
include/uapi/linux/netdev.h | 16 ++++
net/core/netdev-genl.c | 13 ++-
net/xdp/xsk.c | 34 ++++++++
net/xdp/xsk_queue.h | 2 +-
tools/include/uapi/linux/if_xdp.h | 52 +++++++++--
tools/include/uapi/linux/netdev.h | 16 ++++
tools/net/ynl/generated/netdev-user.c | 19 ++++
tools/net/ynl/generated/netdev-user.h | 3 +
15 files changed, 348 insertions(+), 9 deletions(-)
diff --git a/Documentation/netlink/specs/netdev.yaml b/Documentation/netlink/specs/netdev.yaml
index 14511b13f305..00439bcbd2e3 100644
--- a/Documentation/netlink/specs/netdev.yaml
+++ b/Documentation/netlink/specs/netdev.yaml
@@ -45,7 +45,6 @@ name: netdev
-
type: flags
name: xdp-rx-metadata
- render-max: true
entries:
-
name: timestamp
@@ -55,6 +54,18 @@ name: netdev
name: hash
doc:
Device is capable of exposing receive packet hash via bpf_xdp_metadata_rx_hash().
+ -
+ type: flags
+ name: xsk-flags
+ entries:
+ -
+ name: tx-timestamp
+ doc:
+ HW timestamping egress packets is supported by the driver.
+ -
+ name: tx-checksum
+ doc:
+ L3 checksum HW offload is supported by the driver.
attribute-sets:
-
@@ -86,6 +97,11 @@ name: netdev
See Documentation/networking/xdp-rx-metadata.rst for more details.
type: u64
enum: xdp-rx-metadata
+ -
+ name: xsk-features
+ doc: Bitmask of enabled AF_XDP features.
+ type: u64
+ enum: xsk-flags
operations:
list:
@@ -103,6 +119,7 @@ name: netdev
- xdp-features
- xdp-zc-max-segs
- xdp-rx-metadata-features
+ - xsk-features
dump:
reply: *dev-all
-
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index a16c9cc063fe..fe915fa334cc 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1858,6 +1858,7 @@ enum netdev_ml_priv_type {
* @netdev_ops: Includes several pointers to callbacks,
* if one wants to override the ndo_*() functions
* @xdp_metadata_ops: Includes pointers to XDP metadata callbacks.
+ * @xsk_tx_metadata_ops: Includes pointers to AF_XDP TX metadata callbacks.
* @ethtool_ops: Management operations
* @l3mdev_ops: Layer 3 master device operations
* @ndisc_ops: Includes callbacks for different IPv6 neighbour
@@ -2117,6 +2118,7 @@ struct net_device {
unsigned long long priv_flags;
const struct net_device_ops *netdev_ops;
const struct xdp_metadata_ops *xdp_metadata_ops;
+ const struct xsk_tx_metadata_ops *xsk_tx_metadata_ops;
int ifindex;
unsigned short gflags;
unsigned short hard_header_len;
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 27998f73183e..b370eb8d70f7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -566,6 +566,15 @@ struct ubuf_info_msgzc {
int mm_account_pinned_pages(struct mmpin *mmp, size_t size);
void mm_unaccount_pinned_pages(struct mmpin *mmp);
+/* Preserve some data across TX submission and completion.
+ *
+ * Note, this state is stored in the driver. Extending the layout
+ * might need some special care.
+ */
+struct xsk_tx_metadata_compl {
+ __u64 *tx_timestamp;
+};
+
/* This data is invariant across clones and lives at
* the end of the header data, ie. at skb->end.
*/
@@ -578,7 +587,10 @@ struct skb_shared_info {
/* Warning: this field is not always filled in (UFO)! */
unsigned short gso_segs;
struct sk_buff *frag_list;
- struct skb_shared_hwtstamps hwtstamps;
+ union {
+ struct skb_shared_hwtstamps hwtstamps;
+ struct xsk_tx_metadata_compl xsk_meta;
+ };
unsigned int gso_type;
u32 tskey;
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index bcf765124f72..52afde24596b 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -93,12 +93,105 @@ struct xdp_sock {
struct xsk_queue *cq_tmp; /* Only as tmp storage before bind */
};
+/*
+ * AF_XDP TX metadata hooks for network devices.
+ * The following hooks can be defined; unless noted otherwise, they are
+ * optional and can be filled with a null pointer.
+ *
+ * void (*tmo_request_timestamp)(void *priv)
+ * Called when AF_XDP frame requested egress timestamp.
+ *
+ * u64 (*tmo_fill_timestamp)(void *priv)
+ * Called when AF_XDP frame, that had requested egress timestamp,
+ * received a completion. The hook needs to return the actual HW timestamp.
+ *
+ * void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv)
+ * Called when AF_XDP frame requested HW checksum offload. csum_start
+ * indicates position where checksumming should start.
+ * csum_offset indicates position where checksum should be stored.
+ *
+ */
+struct xsk_tx_metadata_ops {
+ void (*tmo_request_timestamp)(void *priv);
+ u64 (*tmo_fill_timestamp)(void *priv);
+ void (*tmo_request_checksum)(u16 csum_start, u16 csum_offset, void *priv);
+};
+
#ifdef CONFIG_XDP_SOCKETS
int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp);
int __xsk_map_redirect(struct xdp_sock *xs, struct xdp_buff *xdp);
void __xsk_map_flush(void);
+/**
+ * xsk_tx_metadata_to_compl - Save enough relevant metadata information
+ * to perform tx completion in the future.
+ * @meta: pointer to AF_XDP metadata area
+ * @compl: pointer to output struct xsk_tx_metadata_to_compl
+ *
+ * This function should be called by the networking device when
+ * it prepares AF_XDP egress packet. The value of @compl should be stored
+ * and passed to xsk_tx_metadata_complete upon TX completion.
+ */
+static inline void xsk_tx_metadata_to_compl(struct xsk_tx_metadata *meta,
+ struct xsk_tx_metadata_compl *compl)
+{
+ if (!meta)
+ return;
+
+ if (meta->request.flags & XDP_TXMD_FLAGS_TIMESTAMP)
+ compl->tx_timestamp = &meta->completion.tx_timestamp;
+ else
+ compl->tx_timestamp = NULL;
+}
+
+/**
+ * xsk_tx_metadata_request - Evaluate AF_XDP TX metadata at submission
+ * and call appropriate xsk_tx_metadata_ops operation.
+ * @meta: pointer to AF_XDP metadata area
+ * @ops: pointer to struct xsk_tx_metadata_ops
+ * @priv: pointer to driver-private aread
+ *
+ * This function should be called by the networking device when
+ * it prepares AF_XDP egress packet.
+ */
+static inline void xsk_tx_metadata_request(const struct xsk_tx_metadata *meta,
+ const struct xsk_tx_metadata_ops *ops,
+ void *priv)
+{
+ if (!meta)
+ return;
+
+ if (ops->tmo_request_timestamp)
+ if (meta->request.flags & XDP_TXMD_FLAGS_TIMESTAMP)
+ ops->tmo_request_timestamp(priv);
+
+ if (ops->tmo_request_checksum)
+ if (meta->request.flags & XDP_TXMD_FLAGS_CHECKSUM)
+ ops->tmo_request_checksum(meta->request.csum_start,
+ meta->request.csum_offset, priv);
+}
+
+/**
+ * xsk_tx_metadata_complete - Evaluate AF_XDP TX metadata at completion
+ * and call appropriate xsk_tx_metadata_ops operation.
+ * @compl: pointer to completion metadata produced from xsk_tx_metadata_to_compl
+ * @ops: pointer to struct xsk_tx_metadata_ops
+ * @priv: pointer to driver-private aread
+ *
+ * This function should be called by the networking device upon
+ * AF_XDP egress completion.
+ */
+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata_compl *compl,
+ const struct xsk_tx_metadata_ops *ops,
+ void *priv)
+{
+ if (!compl)
+ return;
+
+ *compl->tx_timestamp = ops->tmo_fill_timestamp(priv);
+}
+
#else
static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
@@ -115,6 +208,23 @@ static inline void __xsk_map_flush(void)
{
}
+static inline void xsk_tx_metadata_to_compl(struct xsk_tx_metadata *meta,
+ struct xsk_tx_metadata_compl *compl)
+{
+}
+
+static inline void xsk_tx_metadata_request(struct xsk_tx_metadata *meta,
+ const struct xsk_tx_metadata_ops *ops,
+ void *priv)
+{
+}
+
+static inline void xsk_tx_metadata_complete(struct xsk_tx_metadata_compl *compl,
+ const struct xsk_tx_metadata_ops *ops,
+ void *priv)
+{
+}
+
#endif /* CONFIG_XDP_SOCKETS */
#if defined(CONFIG_XDP_SOCKETS) && defined(CONFIG_DEBUG_NET)
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 1f6fc8c7a84c..e2558ac3e195 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -165,6 +165,14 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
return xp_raw_get_data(pool, addr);
}
+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
+{
+ if (!pool->tx_metadata_len)
+ return NULL;
+
+ return xp_raw_get_data(pool, addr) - pool->tx_metadata_len;
+}
+
static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
{
struct xdp_buff_xsk *xskb = container_of(xdp, struct xdp_buff_xsk, xdp);
@@ -324,6 +332,11 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
return NULL;
}
+static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
+{
+ return NULL;
+}
+
static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
{
}
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index 1985ffaf9b0c..97f5cc10d79e 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -33,6 +33,7 @@ struct xdp_buff_xsk {
};
#define XSK_CHECK_PRIV_TYPE(t) BUILD_BUG_ON(sizeof(t) > offsetofend(struct xdp_buff_xsk, cb))
+#define XSK_TX_COMPL_FITS(t) BUILD_BUG_ON(sizeof(struct xsk_tx_metadata_compl) > sizeof(t))
struct xsk_dma_map {
dma_addr_t *dma_pages;
@@ -234,4 +235,9 @@ static inline u64 xp_get_handle(struct xdp_buff_xsk *xskb)
return xskb->orig_addr + (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
}
+static inline bool xp_tx_metadata_enabled(const struct xsk_buff_pool *pool)
+{
+ return pool->tx_metadata_len > 0;
+}
+
#endif /* XSK_BUFF_POOL_H_ */
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index 2ecf79282c26..b0ee7ad19b51 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -106,6 +106,41 @@ struct xdp_options {
#define XSK_UNALIGNED_BUF_ADDR_MASK \
((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
+/* Request transmit timestamp. Upon completion, put it into tx_timestamp
+ * field of struct xsk_tx_metadata.
+ */
+#define XDP_TXMD_FLAGS_TIMESTAMP (1 << 0)
+
+/* Request transmit checksum offload. Checksum start position and offset
+ * are communicated via csum_start and csum_offset fields of struct
+ * xsk_tx_metadata.
+ */
+#define XDP_TXMD_FLAGS_CHECKSUM (1 << 1)
+
+/* AF_XDP offloads request. 'request' union member is consumed by the driver
+ * when the packet is being transmitted. 'completion' union member is
+ * filled by the driver when the transmit completion arrives.
+ */
+struct xsk_tx_metadata {
+ union {
+ struct {
+ __u32 flags;
+
+ /* XDP_TXMD_FLAGS_CHECKSUM */
+
+ /* Offset from desc->addr where checksumming should start. */
+ __u16 csum_start;
+ /* Offset from csum_start where checksum should be stored. */
+ __u16 csum_offset;
+ } request;
+
+ struct {
+ /* XDP_TXMD_FLAGS_TIMESTAMP */
+ __u64 tx_timestamp;
+ } completion;
+ };
+};
+
/* Rx/Tx descriptor */
struct xdp_desc {
__u64 addr;
@@ -122,4 +157,7 @@ struct xdp_desc {
*/
#define XDP_PKT_CONTD (1 << 0)
+/* TX packet carries valid metadata. */
+#define XDP_TX_METADATA (1 << 1)
+
#endif /* _LINUX_IF_XDP_H */
diff --git a/include/uapi/linux/netdev.h b/include/uapi/linux/netdev.h
index 2943a151d4f1..48d5477a668c 100644
--- a/include/uapi/linux/netdev.h
+++ b/include/uapi/linux/netdev.h
@@ -53,12 +53,28 @@ enum netdev_xdp_rx_metadata {
NETDEV_XDP_RX_METADATA_MASK = 3,
};
+/**
+ * enum netdev_xsk_flags
+ * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is supported
+ * by the driver.
+ * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
+ * driver.
+ */
+enum netdev_xsk_flags {
+ NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
+ NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
+
+ /* private: */
+ NETDEV_XSK_FLAGS_MASK = 3,
+};
+
enum {
NETDEV_A_DEV_IFINDEX = 1,
NETDEV_A_DEV_PAD,
NETDEV_A_DEV_XDP_FEATURES,
NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
NETDEV_A_DEV_XDP_RX_METADATA_FEATURES,
+ NETDEV_A_DEV_XSK_FEATURES,
__NETDEV_A_DEV_MAX,
NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
diff --git a/net/core/netdev-genl.c b/net/core/netdev-genl.c
index fe61f85bcf33..10f2124e9e23 100644
--- a/net/core/netdev-genl.c
+++ b/net/core/netdev-genl.c
@@ -6,6 +6,7 @@
#include <net/net_namespace.h>
#include <net/sock.h>
#include <net/xdp.h>
+#include <net/xdp_sock.h>
#include "netdev-genl-gen.h"
@@ -13,6 +14,7 @@ static int
netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp,
const struct genl_info *info)
{
+ u64 xsk_features = 0;
u64 xdp_rx_meta = 0;
void *hdr;
@@ -26,11 +28,20 @@ netdev_nl_dev_fill(struct net_device *netdev, struct sk_buff *rsp,
XDP_METADATA_KFUNC_xxx
#undef XDP_METADATA_KFUNC
+ if (netdev->xsk_tx_metadata_ops) {
+ if (netdev->xsk_tx_metadata_ops->tmo_fill_timestamp)
+ xsk_features |= NETDEV_XSK_FLAGS_TX_TIMESTAMP;
+ if (netdev->xsk_tx_metadata_ops->tmo_request_checksum)
+ xsk_features |= NETDEV_XSK_FLAGS_TX_CHECKSUM;
+ }
+
if (nla_put_u32(rsp, NETDEV_A_DEV_IFINDEX, netdev->ifindex) ||
nla_put_u64_64bit(rsp, NETDEV_A_DEV_XDP_FEATURES,
netdev->xdp_features, NETDEV_A_DEV_PAD) ||
nla_put_u64_64bit(rsp, NETDEV_A_DEV_XDP_RX_METADATA_FEATURES,
- xdp_rx_meta, NETDEV_A_DEV_PAD)) {
+ xdp_rx_meta, NETDEV_A_DEV_PAD) ||
+ nla_put_u64_64bit(rsp, NETDEV_A_DEV_XSK_FEATURES,
+ xsk_features, NETDEV_A_DEV_PAD)) {
genlmsg_cancel(rsp, hdr);
return -EINVAL;
}
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index c904356e2800..84fd10201f2a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -571,6 +571,13 @@ static u32 xsk_get_num_desc(struct sk_buff *skb)
static void xsk_destruct_skb(struct sk_buff *skb)
{
+ struct xsk_tx_metadata_compl *compl = &skb_shinfo(skb)->xsk_meta;
+
+ if (compl->tx_timestamp) {
+ /* sw completion timestamp, not a real one */
+ *compl->tx_timestamp = ktime_get_tai_fast_ns();
+ }
+
xsk_cq_submit_locked(xdp_sk(skb->sk), xsk_get_num_desc(skb));
sock_wfree(skb);
}
@@ -655,8 +662,10 @@ static struct sk_buff *xsk_build_skb_zerocopy(struct xdp_sock *xs,
static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
struct xdp_desc *desc)
{
+ struct xsk_tx_metadata *meta = NULL;
struct net_device *dev = xs->dev;
struct sk_buff *skb = xs->skb;
+ bool first_frag = false;
int err;
if (dev->priv_flags & IFF_TX_SKB_NO_LINEAR) {
@@ -687,6 +696,8 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
kfree_skb(skb);
goto free_err;
}
+
+ first_frag = true;
} else {
int nr_frags = skb_shinfo(skb)->nr_frags;
struct page *page;
@@ -709,12 +720,35 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
skb_add_rx_frag(skb, nr_frags, page, 0, len, 0);
}
+
+ if (first_frag && desc->options & XDP_TX_METADATA) {
+ if (unlikely(xs->pool->tx_metadata_len == 0)) {
+ err = -EINVAL;
+ goto free_err;
+ }
+
+ meta = buffer - xs->pool->tx_metadata_len;
+
+ if (meta->request.flags & XDP_TXMD_FLAGS_CHECKSUM) {
+ if (unlikely(meta->request.csum_start +
+ meta->request.csum_offset +
+ sizeof(__sum16) > len)) {
+ err = -EINVAL;
+ goto free_err;
+ }
+
+ skb->csum_start = hr + meta->request.csum_start;
+ skb->csum_offset = meta->request.csum_offset;
+ skb->ip_summed = CHECKSUM_PARTIAL;
+ }
+ }
}
skb->dev = dev;
skb->priority = READ_ONCE(xs->sk.sk_priority);
skb->mark = READ_ONCE(xs->sk.sk_mark);
skb->destructor = xsk_destruct_skb;
+ xsk_tx_metadata_to_compl(meta, &skb_shinfo(skb)->xsk_meta);
xsk_set_destructor_arg(skb);
return skb;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index c74a1372bcb9..6f2d1621c992 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -137,7 +137,7 @@ static inline bool xskq_cons_read_addr_unchecked(struct xsk_queue *q, u64 *addr)
static inline bool xp_unused_options_set(u32 options)
{
- return options & ~XDP_PKT_CONTD;
+ return options & ~(XDP_PKT_CONTD | XDP_TX_METADATA);
}
static inline bool xp_aligned_validate_desc(struct xsk_buff_pool *pool,
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 34411a2e5b6c..29942c2c32dc 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -26,11 +26,11 @@
*/
#define XDP_USE_NEED_WAKEUP (1 << 3)
/* By setting this option, userspace application indicates that it can
- * handle multiple descriptors per packet thus enabling xsk core to split
+ * handle multiple descriptors per packet thus enabling AF_XDP to split
* multi-buffer XDP frames into multiple Rx descriptors. Without this set
- * such frames will be dropped by xsk.
+ * such frames will be dropped.
*/
-#define XDP_USE_SG (1 << 4)
+#define XDP_USE_SG (1 << 4)
/* Flags for xsk_umem_config flags */
#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
@@ -106,6 +106,41 @@ struct xdp_options {
#define XSK_UNALIGNED_BUF_ADDR_MASK \
((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
+/* Request transmit timestamp. Upon completion, put it into tx_timestamp
+ * field of union xsk_tx_metadata.
+ */
+#define XDP_TXMD_FLAGS_TIMESTAMP (1 << 0)
+
+/* Request transmit checksum offload. Checksum start position and offset
+ * are communicated via csum_start and csum_offset fields of union
+ * xsk_tx_metadata.
+ */
+#define XDP_TXMD_FLAGS_CHECKSUM (1 << 1)
+
+/* AF_XDP offloads request. 'request' union member is consumed by the driver
+ * when the packet is being transmitted. 'completion' union member is
+ * filled by the driver when the transmit completion arrives.
+ */
+struct xsk_tx_metadata {
+ union {
+ struct {
+ __u32 flags;
+
+ /* XDP_TXMD_FLAGS_CHECKSUM */
+
+ /* Offset from desc->addr where checksumming should start. */
+ __u16 csum_start;
+ /* Offset from csum_start where checksum should be stored. */
+ __u16 csum_offset;
+ } request;
+
+ struct {
+ /* XDP_TXMD_FLAGS_TIMESTAMP */
+ __u64 tx_timestamp;
+ } completion;
+ };
+};
+
/* Rx/Tx descriptor */
struct xdp_desc {
__u64 addr;
@@ -113,9 +148,16 @@ struct xdp_desc {
__u32 options;
};
-/* Flag indicating packet constitutes of multiple buffers*/
+/* UMEM descriptor is __u64 */
+
+/* Flag indicating that the packet continues with the buffer pointed out by the
+ * next frame in the ring. The end of the packet is signalled by setting this
+ * bit to zero. For single buffer packets, every descriptor has 'options' set
+ * to 0 and this maintains backward compatibility.
+ */
#define XDP_PKT_CONTD (1 << 0)
-/* UMEM descriptor is __u64 */
+/* TX packet carries valid metadata. */
+#define XDP_TX_METADATA (1 << 1)
#endif /* _LINUX_IF_XDP_H */
diff --git a/tools/include/uapi/linux/netdev.h b/tools/include/uapi/linux/netdev.h
index 2943a151d4f1..48d5477a668c 100644
--- a/tools/include/uapi/linux/netdev.h
+++ b/tools/include/uapi/linux/netdev.h
@@ -53,12 +53,28 @@ enum netdev_xdp_rx_metadata {
NETDEV_XDP_RX_METADATA_MASK = 3,
};
+/**
+ * enum netdev_xsk_flags
+ * @NETDEV_XSK_FLAGS_TX_TIMESTAMP: HW timestamping egress packets is supported
+ * by the driver.
+ * @NETDEV_XSK_FLAGS_TX_CHECKSUM: L3 checksum HW offload is supported by the
+ * driver.
+ */
+enum netdev_xsk_flags {
+ NETDEV_XSK_FLAGS_TX_TIMESTAMP = 1,
+ NETDEV_XSK_FLAGS_TX_CHECKSUM = 2,
+
+ /* private: */
+ NETDEV_XSK_FLAGS_MASK = 3,
+};
+
enum {
NETDEV_A_DEV_IFINDEX = 1,
NETDEV_A_DEV_PAD,
NETDEV_A_DEV_XDP_FEATURES,
NETDEV_A_DEV_XDP_ZC_MAX_SEGS,
NETDEV_A_DEV_XDP_RX_METADATA_FEATURES,
+ NETDEV_A_DEV_XSK_FEATURES,
__NETDEV_A_DEV_MAX,
NETDEV_A_DEV_MAX = (__NETDEV_A_DEV_MAX - 1)
diff --git a/tools/net/ynl/generated/netdev-user.c b/tools/net/ynl/generated/netdev-user.c
index b5ffe8cd1144..6283d87dad37 100644
--- a/tools/net/ynl/generated/netdev-user.c
+++ b/tools/net/ynl/generated/netdev-user.c
@@ -58,6 +58,19 @@ const char *netdev_xdp_rx_metadata_str(enum netdev_xdp_rx_metadata value)
return netdev_xdp_rx_metadata_strmap[value];
}
+static const char * const netdev_xsk_flags_strmap[] = {
+ [0] = "tx-timestamp",
+ [1] = "tx-checksum",
+};
+
+const char *netdev_xsk_flags_str(enum netdev_xsk_flags value)
+{
+ value = ffs(value) - 1;
+ if (value < 0 || value >= (int)MNL_ARRAY_SIZE(netdev_xsk_flags_strmap))
+ return NULL;
+ return netdev_xsk_flags_strmap[value];
+}
+
/* Policies */
struct ynl_policy_attr netdev_dev_policy[NETDEV_A_DEV_MAX + 1] = {
[NETDEV_A_DEV_IFINDEX] = { .name = "ifindex", .type = YNL_PT_U32, },
@@ -65,6 +78,7 @@ struct ynl_policy_attr netdev_dev_policy[NETDEV_A_DEV_MAX + 1] = {
[NETDEV_A_DEV_XDP_FEATURES] = { .name = "xdp-features", .type = YNL_PT_U64, },
[NETDEV_A_DEV_XDP_ZC_MAX_SEGS] = { .name = "xdp-zc-max-segs", .type = YNL_PT_U32, },
[NETDEV_A_DEV_XDP_RX_METADATA_FEATURES] = { .name = "xdp-rx-metadata-features", .type = YNL_PT_U64, },
+ [NETDEV_A_DEV_XSK_FEATURES] = { .name = "xsk-features", .type = YNL_PT_U64, },
};
struct ynl_policy_nest netdev_dev_nest = {
@@ -116,6 +130,11 @@ int netdev_dev_get_rsp_parse(const struct nlmsghdr *nlh, void *data)
return MNL_CB_ERROR;
dst->_present.xdp_rx_metadata_features = 1;
dst->xdp_rx_metadata_features = mnl_attr_get_u64(attr);
+ } else if (type == NETDEV_A_DEV_XSK_FEATURES) {
+ if (ynl_attr_validate(yarg, attr))
+ return MNL_CB_ERROR;
+ dst->_present.xsk_features = 1;
+ dst->xsk_features = mnl_attr_get_u64(attr);
}
}
diff --git a/tools/net/ynl/generated/netdev-user.h b/tools/net/ynl/generated/netdev-user.h
index 4fafac879df3..39af1908444b 100644
--- a/tools/net/ynl/generated/netdev-user.h
+++ b/tools/net/ynl/generated/netdev-user.h
@@ -19,6 +19,7 @@ extern const struct ynl_family ynl_netdev_family;
const char *netdev_op_str(int op);
const char *netdev_xdp_act_str(enum netdev_xdp_act value);
const char *netdev_xdp_rx_metadata_str(enum netdev_xdp_rx_metadata value);
+const char *netdev_xsk_flags_str(enum netdev_xsk_flags value);
/* Common nested types */
/* ============== NETDEV_CMD_DEV_GET ============== */
@@ -50,12 +51,14 @@ struct netdev_dev_get_rsp {
__u32 xdp_features:1;
__u32 xdp_zc_max_segs:1;
__u32 xdp_rx_metadata_features:1;
+ __u32 xsk_features:1;
} _present;
__u32 ifindex;
__u64 xdp_features;
__u32 xdp_zc_max_segs;
__u64 xdp_rx_metadata_features;
+ __u64 xsk_features;
};
void netdev_dev_get_rsp_free(struct netdev_dev_get_rsp *rsp);
--
2.42.0.869.gea05f2083d-goog
^ permalink raw reply related
* [PATCH bpf-next v5 05/13] net: stmmac: Add Tx HWTS support to XDP ZC
From: Stanislav Fomichev @ 2023-11-02 22:58 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
In-Reply-To: <20231102225837.1141915-1-sdf@google.com>
From: Song Yoong Siang <yoong.siang.song@intel.com>
This patch enables transmit hardware timestamp support to XDP zero copy
via XDP Tx metadata framework.
This patchset is tested with tools/testing/selftests/bpf/xdp_hw_metadata
on Intel Tiger Lake platform. Below are the test steps and results.
Command on DUT:
sudo ./xdp_hw_metadata <interface name>
sudo hwstamp_ctl -i <interface name> -t 1 -r 1
Command on Link Partner:
echo -n xdp | nc -u -q1 <destination IPv4 addr> 9091
Result:
xsk_ring_cons__peek: 1
0x55bbbf08b6d0: rx_desc[2]->addr=8c100 addr=8c100 comp_addr=8c100 EoP
No rx_hash err=-95
rx_timestamp: 1677762688429141540 (sec:1677762688.4291)
HW RX-time: 1677762688429141540 (sec:1677762688.4291) delta to User RX-time sec:0.0003 (250.665 usec)
XDP RX-time: 1677762688429375597 (sec:1677762688.4294) delta to User RX-time sec:0.0000 (16.608 usec)
0x55bbbf08b6d0: ping-pong with csum=561c (want f488) csum_start=34 csum_offset=6
0x55bbbf08b6d0: complete tx idx=2 addr=2008
tx_timestamp: 1677762688431127273 (sec:1677762688.4311)
HW TX-complete-time: 1677762688431127273 (sec:1677762688.4311) delta to User TX-complete-time sec:0.0083 (8331.655 usec)
XDP RX-time: 1677762688429375597 (sec:1677762688.4294) delta to User TX-complete-time sec:0.0101 (10083.331 usec)
HW RX-time: 1677762688429141540 (sec:1677762688.4291) delta to HW TX-complete-time sec:0.0020 (1985.733 usec)
0x55bbbf08b6d0: complete rx idx=130 addr=8c100
Signed-off-by: Song Yoong Siang <yoong.siang.song@intel.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
drivers/net/ethernet/stmicro/stmmac/stmmac.h | 12 ++++
.../net/ethernet/stmicro/stmmac/stmmac_main.c | 64 ++++++++++++++++++-
2 files changed, 75 insertions(+), 1 deletion(-)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac.h b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
index cd7a9768de5f..686c94c2e8a7 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac.h
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac.h
@@ -51,6 +51,7 @@ struct stmmac_tx_info {
bool last_segment;
bool is_jumbo;
enum stmmac_txbuf_type buf_type;
+ struct xsk_tx_metadata_compl xsk_meta;
};
#define STMMAC_TBS_AVAIL BIT(0)
@@ -100,6 +101,17 @@ struct stmmac_xdp_buff {
struct dma_desc *ndesc;
};
+struct stmmac_metadata_request {
+ struct stmmac_priv *priv;
+ struct dma_desc *tx_desc;
+ bool *set_ic;
+};
+
+struct stmmac_xsk_tx_complete {
+ struct stmmac_priv *priv;
+ struct dma_desc *desc;
+};
+
struct stmmac_rx_queue {
u32 rx_count_frames;
u32 queue_index;
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
index 3e50fd53a617..001a07a69539 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_main.c
@@ -2430,6 +2430,46 @@ static void stmmac_dma_operation_mode(struct stmmac_priv *priv)
}
}
+static void stmmac_xsk_request_timestamp(void *_priv)
+{
+ struct stmmac_metadata_request *meta_req = _priv;
+
+ stmmac_enable_tx_timestamp(meta_req->priv, meta_req->tx_desc);
+ *meta_req->set_ic = true;
+}
+
+static u64 stmmac_xsk_fill_timestamp(void *_priv)
+{
+ struct stmmac_xsk_tx_complete *tx_compl = _priv;
+ struct stmmac_priv *priv = tx_compl->priv;
+ struct dma_desc *desc = tx_compl->desc;
+ bool found = false;
+ u64 ns = 0;
+
+ if (!priv->hwts_tx_en)
+ return 0;
+
+ /* check tx tstamp status */
+ if (stmmac_get_tx_timestamp_status(priv, desc)) {
+ stmmac_get_timestamp(priv, desc, priv->adv_ts, &ns);
+ found = true;
+ } else if (!stmmac_get_mac_tx_timestamp(priv, priv->hw, &ns)) {
+ found = true;
+ }
+
+ if (found) {
+ ns -= priv->plat->cdc_error_adj;
+ return ns_to_ktime(ns);
+ }
+
+ return 0;
+}
+
+static const struct xsk_tx_metadata_ops stmmac_xsk_tx_metadata_ops = {
+ .tmo_request_timestamp = stmmac_xsk_request_timestamp,
+ .tmo_fill_timestamp = stmmac_xsk_fill_timestamp,
+};
+
static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
{
struct netdev_queue *nq = netdev_get_tx_queue(priv->dev, queue);
@@ -2449,6 +2489,8 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
budget = min(budget, stmmac_tx_avail(priv, queue));
while (budget-- > 0) {
+ struct stmmac_metadata_request meta_req;
+ struct xsk_tx_metadata *meta = NULL;
dma_addr_t dma_addr;
bool set_ic;
@@ -2472,6 +2514,7 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
tx_desc = tx_q->dma_tx + entry;
dma_addr = xsk_buff_raw_get_dma(pool, xdp_desc.addr);
+ meta = xsk_buff_get_metadata(pool, xdp_desc.addr);
xsk_buff_raw_dma_sync_for_device(pool, dma_addr, xdp_desc.len);
tx_q->tx_skbuff_dma[entry].buf_type = STMMAC_TXBUF_T_XSK_TX;
@@ -2499,6 +2542,11 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
else
set_ic = false;
+ meta_req.priv = priv;
+ meta_req.tx_desc = tx_desc;
+ meta_req.set_ic = &set_ic;
+ xsk_tx_metadata_request(meta, &stmmac_xsk_tx_metadata_ops,
+ &meta_req);
if (set_ic) {
tx_q->tx_count_frames = 0;
stmmac_set_tx_ic(priv, tx_desc);
@@ -2511,6 +2559,9 @@ static bool stmmac_xdp_xmit_zc(struct stmmac_priv *priv, u32 queue, u32 budget)
stmmac_enable_dma_transmission(priv, priv->ioaddr);
+ xsk_tx_metadata_to_compl(meta,
+ &tx_q->tx_skbuff_dma[entry].xsk_meta);
+
tx_q->cur_tx = STMMAC_GET_ENTRY(tx_q->cur_tx, priv->dma_conf.dma_tx_size);
entry = tx_q->cur_tx;
}
@@ -2620,8 +2671,18 @@ static int stmmac_tx_clean(struct stmmac_priv *priv, int budget, u32 queue,
} else {
tx_packets++;
}
- if (skb)
+ if (skb) {
stmmac_get_tx_hwtstamp(priv, p, skb);
+ } else {
+ struct stmmac_xsk_tx_complete tx_compl = {
+ .priv = priv,
+ .desc = p,
+ };
+
+ xsk_tx_metadata_complete(&tx_q->tx_skbuff_dma[entry].xsk_meta,
+ &stmmac_xsk_tx_metadata_ops,
+ &tx_compl);
+ }
}
if (likely(tx_q->tx_skbuff_dma[entry].buf &&
@@ -7449,6 +7510,7 @@ int stmmac_dvr_probe(struct device *device,
ndev->netdev_ops = &stmmac_netdev_ops;
ndev->xdp_metadata_ops = &stmmac_xdp_metadata_ops;
+ ndev->xsk_tx_metadata_ops = &stmmac_xsk_tx_metadata_ops;
ndev->hw_features = NETIF_F_SG | NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM |
NETIF_F_RXCSUM;
--
2.42.0.869.gea05f2083d-goog
^ permalink raw reply related
* [PATCH bpf-next v5 06/13] xsk: Document tx_metadata_len layout
From: Stanislav Fomichev @ 2023-11-02 22:58 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
In-Reply-To: <20231102225837.1141915-1-sdf@google.com>
- how to use
- how to query features
- pointers to the examples
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
Documentation/networking/index.rst | 1 +
Documentation/networking/xsk-tx-metadata.rst | 70 ++++++++++++++++++++
2 files changed, 71 insertions(+)
create mode 100644 Documentation/networking/xsk-tx-metadata.rst
diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst
index 683eb42309cc..a297a894b366 100644
--- a/Documentation/networking/index.rst
+++ b/Documentation/networking/index.rst
@@ -123,6 +123,7 @@ Refer to :ref:`netdev-FAQ` for a guide on netdev development process specifics.
xfrm_sync
xfrm_sysctl
xdp-rx-metadata
+ xsk-tx-metadata
.. only:: subproject and html
diff --git a/Documentation/networking/xsk-tx-metadata.rst b/Documentation/networking/xsk-tx-metadata.rst
new file mode 100644
index 000000000000..4f376560b23f
--- /dev/null
+++ b/Documentation/networking/xsk-tx-metadata.rst
@@ -0,0 +1,70 @@
+==================
+AF_XDP TX Metadata
+==================
+
+This document describes how to enable offloads when transmitting packets
+via :doc:`af_xdp`. Refer to :doc:`xdp-rx-metadata` on how to access similar
+metadata on the receive side.
+
+General Design
+==============
+
+The headroom for the metadata is reserved via ``tx_metadata_len`` in
+``struct xdp_umem_reg``. The metadata length is therefore the same for
+every socket that shares the same umem. The metadata layout is a fixed UAPI,
+refer to ``union xsk_tx_metadata`` in ``include/uapi/linux/if_xdp.h``.
+Thus, generally, the ``tx_metadata_len`` field above should contain
+``sizeof(union xsk_tx_metadata)``.
+
+The headroom and the metadata itself should be located right before
+``xdp_desc->addr`` in the umem frame. Within a frame, the metadata
+layout is as follows::
+
+ tx_metadata_len
+ / \
+ +-----------------+---------+----------------------------+
+ | xsk_tx_metadata | padding | payload |
+ +-----------------+---------+----------------------------+
+ ^
+ |
+ xdp_desc->addr
+
+An AF_XDP application can request headrooms larger than ``sizeof(struct
+xsk_tx_metadata)``. The kernel will ignore the padding (and will still
+use ``xdp_desc->addr - tx_metadata_len`` to locate
+the ``xsk_tx_metadata``). For the frames that shouldn't carry
+any metadata (i.e., the ones that don't have ``XDP_TX_METADATA`` option),
+the metadata area is ignored by the kernel as well.
+
+The flags field enables the particular offload:
+
+- ``XDP_TXMD_FLAGS_TIMESTAMP``: requests the device to put transmission
+ timestamp into ``tx_timestamp`` field of ``union xsk_tx_metadata``.
+- ``XDP_TXMD_FLAGS_CHECKSUM``: requests the device to calculate L4
+ checksum. ``csum_start`` specifies byte offset of where the checksumming
+ should start and ``csum_offset`` specifies byte offset where the
+ device should store the computed checksum.
+
+Besides the flags above, in order to trigger the offloads, the first
+packet's ``struct xdp_desc`` descriptor should set ``XDP_TX_METADATA``
+bit in the ``options`` field. Also note that in a multi-buffer packet
+only the first chunk should carry the metadata.
+
+Querying Device Capabilities
+============================
+
+Every devices exports its offloads capabilities via netlink netdev family.
+Refer to ``xsk-flags`` features bitmask in
+``Documentation/netlink/specs/netdev.yaml``.
+
+- ``tx-timestamp``: device supports ``XDP_TXMD_FLAGS_TIMESTAMP``
+- ``tx-checksum``: device supports ``XDP_TXMD_FLAGS_CHECKSUM``
+
+See ``tools/net/ynl/samples/netdev.c`` on how to query this information.
+
+Example
+=======
+
+See ``tools/testing/selftests/bpf/xdp_hw_metadata.c`` for an example
+program that handles TX metadata. Also see https://github.com/fomichev/xskgen
+for a more bare-bones example.
--
2.42.0.869.gea05f2083d-goog
^ permalink raw reply related
* [PATCH bpf-next v5 04/13] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload
From: Stanislav Fomichev @ 2023-11-02 22:58 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints, Saeed Mahameed
In-Reply-To: <20231102225837.1141915-1-sdf@google.com>
TX timestamp:
- requires passing clock, not sure I'm passing the correct one (from
cq->mdev), but the timestamp value looks convincing
TX checksum:
- looks like device does packet parsing (and doesn't accept custom
start/offset), so I'm ignoring user offsets
Cc: Saeed Mahameed <saeedm@nvidia.com>
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 4 +-
.../net/ethernet/mellanox/mlx5/core/en/xdp.c | 72 ++++++++++++++++---
.../net/ethernet/mellanox/mlx5/core/en/xdp.h | 11 ++-
.../ethernet/mellanox/mlx5/core/en/xsk/tx.c | 17 ++++-
.../net/ethernet/mellanox/mlx5/core/en_main.c | 1 +
5 files changed, 89 insertions(+), 16 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index b2a5da9739d2..43f027bf2da3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -484,10 +484,12 @@ struct mlx5e_xdp_info_fifo {
struct mlx5e_xdpsq;
struct mlx5e_xmit_data;
+struct xsk_tx_metadata;
typedef int (*mlx5e_fp_xmit_xdp_frame_check)(struct mlx5e_xdpsq *);
typedef bool (*mlx5e_fp_xmit_xdp_frame)(struct mlx5e_xdpsq *,
struct mlx5e_xmit_data *,
- int);
+ int,
+ struct xsk_tx_metadata *);
struct mlx5e_xdpsq {
/* data path */
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index 7decc81ed33a..e2e7d82cfca4 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -103,7 +103,7 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
xdptxd->dma_addr = dma_addr;
if (unlikely(!INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
- mlx5e_xmit_xdp_frame, sq, xdptxd, 0)))
+ mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL)))
return false;
/* xmit_mode == MLX5E_XDP_XMIT_MODE_FRAME */
@@ -145,7 +145,7 @@ mlx5e_xmit_xdp_buff(struct mlx5e_xdpsq *sq, struct mlx5e_rq *rq,
xdptxd->dma_addr = dma_addr;
if (unlikely(!INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
- mlx5e_xmit_xdp_frame, sq, xdptxd, 0)))
+ mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL)))
return false;
/* xmit_mode == MLX5E_XDP_XMIT_MODE_PAGE */
@@ -261,6 +261,37 @@ const struct xdp_metadata_ops mlx5e_xdp_metadata_ops = {
.xmo_rx_hash = mlx5e_xdp_rx_hash,
};
+struct mlx5e_xsk_tx_complete {
+ struct mlx5_cqe64 *cqe;
+ struct mlx5e_cq *cq;
+};
+
+static u64 mlx5e_xsk_fill_timestamp(void *_priv)
+{
+ struct mlx5e_xsk_tx_complete *priv = _priv;
+ u64 ts;
+
+ ts = get_cqe_ts(priv->cqe);
+
+ if (mlx5_is_real_time_rq(priv->cq->mdev) || mlx5_is_real_time_sq(priv->cq->mdev))
+ return mlx5_real_time_cyc2time(&priv->cq->mdev->clock, ts);
+
+ return mlx5_timecounter_cyc2time(&priv->cq->mdev->clock, ts);
+}
+
+static void mlx5e_xsk_request_checksum(u16 csum_start, u16 csum_offset, void *priv)
+{
+ struct mlx5_wqe_eth_seg *eseg = priv;
+
+ /* HW/FW is doing parsing, so offsets are largely ignored. */
+ eseg->cs_flags |= MLX5_ETH_WQE_L3_CSUM | MLX5_ETH_WQE_L4_CSUM;
+}
+
+const struct xsk_tx_metadata_ops mlx5e_xsk_tx_metadata_ops = {
+ .tmo_fill_timestamp = mlx5e_xsk_fill_timestamp,
+ .tmo_request_checksum = mlx5e_xsk_request_checksum,
+};
+
/* returns true if packet was consumed by xdp */
bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
struct bpf_prog *prog, struct mlx5e_xdp_buff *mxbuf)
@@ -398,11 +429,11 @@ INDIRECT_CALLABLE_SCOPE int mlx5e_xmit_xdp_frame_check_mpwqe(struct mlx5e_xdpsq
INDIRECT_CALLABLE_SCOPE bool
mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
- int check_result);
+ int check_result, struct xsk_tx_metadata *meta);
INDIRECT_CALLABLE_SCOPE bool
mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
- int check_result)
+ int check_result, struct xsk_tx_metadata *meta)
{
struct mlx5e_tx_mpwqe *session = &sq->mpwqe;
struct mlx5e_xdpsq_stats *stats = sq->stats;
@@ -420,7 +451,7 @@ mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptx
*/
if (unlikely(sq->mpwqe.wqe))
mlx5e_xdp_mpwqe_complete(sq);
- return mlx5e_xmit_xdp_frame(sq, xdptxd, 0);
+ return mlx5e_xmit_xdp_frame(sq, xdptxd, 0, meta);
}
if (!xdptxd->len) {
skb_frag_t *frag = &xdptxdf->sinfo->frags[0];
@@ -450,6 +481,7 @@ mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptx
* and it's safe to complete it at any time.
*/
mlx5e_xdp_mpwqe_session_start(sq);
+ xsk_tx_metadata_request(meta, &mlx5e_xsk_tx_metadata_ops, &session->wqe->eth);
}
mlx5e_xdp_mpwqe_add_dseg(sq, p, stats);
@@ -480,7 +512,7 @@ INDIRECT_CALLABLE_SCOPE int mlx5e_xmit_xdp_frame_check(struct mlx5e_xdpsq *sq)
INDIRECT_CALLABLE_SCOPE bool
mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
- int check_result)
+ int check_result, struct xsk_tx_metadata *meta)
{
struct mlx5e_xmit_data_frags *xdptxdf =
container_of(xdptxd, struct mlx5e_xmit_data_frags, xd);
@@ -599,6 +631,8 @@ mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
sq->pc++;
}
+ xsk_tx_metadata_request(meta, &mlx5e_xsk_tx_metadata_ops, eseg);
+
sq->doorbell_cseg = cseg;
stats->xmit++;
@@ -608,7 +642,9 @@ mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq, struct mlx5e_xmit_data *xdptxd,
static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
struct mlx5e_xdp_wqe_info *wi,
u32 *xsk_frames,
- struct xdp_frame_bulk *bq)
+ struct xdp_frame_bulk *bq,
+ struct mlx5e_cq *cq,
+ struct mlx5_cqe64 *cqe)
{
struct mlx5e_xdp_info_fifo *xdpi_fifo = &sq->db.xdpi_fifo;
u16 i;
@@ -668,10 +704,24 @@ static void mlx5e_free_xdpsq_desc(struct mlx5e_xdpsq *sq,
break;
}
- case MLX5E_XDP_XMIT_MODE_XSK:
+ case MLX5E_XDP_XMIT_MODE_XSK: {
/* AF_XDP send */
+ struct xsk_tx_metadata_compl *compl = NULL;
+ struct mlx5e_xsk_tx_complete priv = {
+ .cqe = cqe,
+ .cq = cq,
+ };
+
+ if (xp_tx_metadata_enabled(sq->xsk_pool)) {
+ xdpi = mlx5e_xdpi_fifo_pop(xdpi_fifo);
+ compl = &xdpi.xsk_meta;
+
+ xsk_tx_metadata_complete(compl, &mlx5e_xsk_tx_metadata_ops, &priv);
+ }
+
(*xsk_frames)++;
break;
+ }
default:
WARN_ON_ONCE(true);
}
@@ -720,7 +770,7 @@ bool mlx5e_poll_xdpsq_cq(struct mlx5e_cq *cq)
sqcc += wi->num_wqebbs;
- mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
+ mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, cq, cqe);
} while (!last_wqe);
if (unlikely(get_cqe_opcode(cqe) != MLX5_CQE_REQ)) {
@@ -767,7 +817,7 @@ void mlx5e_free_xdpsq_descs(struct mlx5e_xdpsq *sq)
sq->cc += wi->num_wqebbs;
- mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq);
+ mlx5e_free_xdpsq_desc(sq, wi, &xsk_frames, &bq, NULL, NULL);
}
xdp_flush_frame_bulk(&bq);
@@ -840,7 +890,7 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
}
ret = INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
- mlx5e_xmit_xdp_frame, sq, xdptxd, 0);
+ mlx5e_xmit_xdp_frame, sq, xdptxd, 0, NULL);
if (unlikely(!ret)) {
int j;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
index ecfe93a479da..e054db1e10f8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.h
@@ -33,6 +33,7 @@
#define __MLX5_EN_XDP_H__
#include <linux/indirect_call_wrapper.h>
+#include <net/xdp_sock.h>
#include "en.h"
#include "en/txrx.h"
@@ -82,7 +83,7 @@ enum mlx5e_xdp_xmit_mode {
* num, page_1, page_2, ... , page_num.
*
* MLX5E_XDP_XMIT_MODE_XSK:
- * none.
+ * frame.xsk_meta.
*/
#define MLX5E_XDP_FIFO_ENTRIES2DS_MAX_RATIO 4
@@ -97,6 +98,7 @@ union mlx5e_xdp_info {
u8 num;
struct page *page;
} page;
+ struct xsk_tx_metadata_compl xsk_meta;
};
struct mlx5e_xsk_param;
@@ -112,13 +114,16 @@ int mlx5e_xdp_xmit(struct net_device *dev, int n, struct xdp_frame **frames,
u32 flags);
extern const struct xdp_metadata_ops mlx5e_xdp_metadata_ops;
+extern const struct xsk_tx_metadata_ops mlx5e_xsk_tx_metadata_ops;
INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame_mpwqe(struct mlx5e_xdpsq *sq,
struct mlx5e_xmit_data *xdptxd,
- int check_result));
+ int check_result,
+ struct xsk_tx_metadata *meta));
INDIRECT_CALLABLE_DECLARE(bool mlx5e_xmit_xdp_frame(struct mlx5e_xdpsq *sq,
struct mlx5e_xmit_data *xdptxd,
- int check_result));
+ int check_result,
+ struct xsk_tx_metadata *meta));
INDIRECT_CALLABLE_DECLARE(int mlx5e_xmit_xdp_frame_check_mpwqe(struct mlx5e_xdpsq *sq));
INDIRECT_CALLABLE_DECLARE(int mlx5e_xmit_xdp_frame_check(struct mlx5e_xdpsq *sq));
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
index 597f319d4770..a59199ed590d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/tx.c
@@ -55,12 +55,16 @@ static void mlx5e_xsk_tx_post_err(struct mlx5e_xdpsq *sq,
nopwqe = mlx5e_post_nop(&sq->wq, sq->sqn, &sq->pc);
mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, *xdpi);
+ if (xp_tx_metadata_enabled(sq->xsk_pool))
+ mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+ (union mlx5e_xdp_info) { .xsk_meta = {} });
sq->doorbell_cseg = &nopwqe->ctrl;
}
bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
{
struct xsk_buff_pool *pool = sq->xsk_pool;
+ struct xsk_tx_metadata *meta = NULL;
union mlx5e_xdp_info xdpi;
bool work_done = true;
bool flush = false;
@@ -93,12 +97,13 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
xdptxd.dma_addr = xsk_buff_raw_get_dma(pool, desc.addr);
xdptxd.data = xsk_buff_raw_get_data(pool, desc.addr);
xdptxd.len = desc.len;
+ meta = xsk_buff_get_metadata(pool, desc.addr);
xsk_buff_raw_dma_sync_for_device(pool, xdptxd.dma_addr, xdptxd.len);
ret = INDIRECT_CALL_2(sq->xmit_xdp_frame, mlx5e_xmit_xdp_frame_mpwqe,
mlx5e_xmit_xdp_frame, sq, &xdptxd,
- check_result);
+ check_result, meta);
if (unlikely(!ret)) {
if (sq->mpwqe.wqe)
mlx5e_xdp_mpwqe_complete(sq);
@@ -106,6 +111,16 @@ bool mlx5e_xsk_tx(struct mlx5e_xdpsq *sq, unsigned int budget)
mlx5e_xsk_tx_post_err(sq, &xdpi);
} else {
mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo, xdpi);
+ if (xp_tx_metadata_enabled(sq->xsk_pool)) {
+ struct xsk_tx_metadata_compl compl;
+
+ xsk_tx_metadata_to_compl(meta, &compl);
+ XSK_TX_COMPL_FITS(void *);
+
+ mlx5e_xdpi_fifo_push(&sq->db.xdpi_fifo,
+ (union mlx5e_xdp_info)
+ { .xsk_meta = compl });
+ }
}
flush = true;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index ea58c6917433..9da668f2b10b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -5164,6 +5164,7 @@ static void mlx5e_build_nic_netdev(struct net_device *netdev)
netdev->netdev_ops = &mlx5e_netdev_ops;
netdev->xdp_metadata_ops = &mlx5e_xdp_metadata_ops;
+ netdev->xsk_tx_metadata_ops = &mlx5e_xsk_tx_metadata_ops;
mlx5e_dcbnl_build_netdev(netdev);
--
2.42.0.869.gea05f2083d-goog
^ permalink raw reply related
* [PATCH bpf-next v5 07/13] xsk: Validate xsk_tx_metadata flags
From: Stanislav Fomichev @ 2023-11-02 22:58 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
In-Reply-To: <20231102225837.1141915-1-sdf@google.com>
Accept only the flags that the kernel knows about to make
sure we can extend this field in the future. Note that only
in XDP_COPY mode we propagate the error signal back to the user
(via sendmsg). For zerocopy mode we silently skip the metadata
for the descriptors that have wrong flags (since we process
the descriptors deep in the driver).
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
include/net/xdp_sock_drv.h | 23 ++++++++++++++++++++++-
net/xdp/xsk.c | 4 ++++
2 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index e2558ac3e195..5885176ea01e 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -165,12 +165,28 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
return xp_raw_get_data(pool, addr);
}
+#define XDP_TXMD_FLAGS_VALID ( \
+ XDP_TXMD_FLAGS_TIMESTAMP | \
+ XDP_TXMD_FLAGS_CHECKSUM | \
+ 0)
+
+static inline bool xsk_buff_valid_tx_metadata(struct xsk_tx_metadata *meta)
+{
+ return !(meta->request.flags & ~XDP_TXMD_FLAGS_VALID);
+}
+
static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
{
+ struct xsk_tx_metadata *meta;
+
if (!pool->tx_metadata_len)
return NULL;
- return xp_raw_get_data(pool, addr) - pool->tx_metadata_len;
+ meta = xp_raw_get_data(pool, addr) - pool->tx_metadata_len;
+ if (unlikely(!xsk_buff_valid_tx_metadata(meta)))
+ return NULL; /* no way to signal the error to the user */
+
+ return meta;
}
static inline void xsk_buff_dma_sync_for_cpu(struct xdp_buff *xdp, struct xsk_buff_pool *pool)
@@ -332,6 +348,11 @@ static inline void *xsk_buff_raw_get_data(struct xsk_buff_pool *pool, u64 addr)
return NULL;
}
+static inline bool xsk_buff_valid_tx_metadata(struct xsk_tx_metadata *meta)
+{
+ return false;
+}
+
static inline struct xsk_tx_metadata *xsk_buff_get_metadata(struct xsk_buff_pool *pool, u64 addr)
{
return NULL;
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 84fd10201f2a..0e81ae6bfff4 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -728,6 +728,10 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
}
meta = buffer - xs->pool->tx_metadata_len;
+ if (unlikely(!xsk_buff_valid_tx_metadata(meta))) {
+ err = -EINVAL;
+ goto free_err;
+ }
if (meta->request.flags & XDP_TXMD_FLAGS_CHECKSUM) {
if (unlikely(meta->request.csum_start +
--
2.42.0.869.gea05f2083d-goog
^ permalink raw reply related
* [PATCH bpf-next v5 08/13] xsk: Add option to calculate TX checksum in SW
From: Stanislav Fomichev @ 2023-11-02 22:58 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
In-Reply-To: <20231102225837.1141915-1-sdf@google.com>
For XDP_COPY mode, add a UMEM option XDP_UMEM_TX_SW_CSUM
to call skb_checksum_help in transmit path. Might be useful
to debugging issues with real hardware. I also use this mode
in the selftests.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
Documentation/networking/xsk-tx-metadata.rst | 9 +++++++++
include/net/xsk_buff_pool.h | 1 +
include/uapi/linux/if_xdp.h | 8 +++++++-
net/xdp/xdp_umem.c | 7 ++++++-
net/xdp/xsk.c | 6 ++++++
net/xdp/xsk_buff_pool.c | 1 +
tools/include/uapi/linux/if_xdp.h | 8 +++++++-
7 files changed, 37 insertions(+), 3 deletions(-)
diff --git a/Documentation/networking/xsk-tx-metadata.rst b/Documentation/networking/xsk-tx-metadata.rst
index 4f376560b23f..97ecfa480d00 100644
--- a/Documentation/networking/xsk-tx-metadata.rst
+++ b/Documentation/networking/xsk-tx-metadata.rst
@@ -50,6 +50,15 @@ packet's ``struct xdp_desc`` descriptor should set ``XDP_TX_METADATA``
bit in the ``options`` field. Also note that in a multi-buffer packet
only the first chunk should carry the metadata.
+Software TX Checksum
+====================
+
+For development and testing purposes its possible to pass
+``XDP_UMEM_TX_SW_CSUM`` flag to ``XDP_UMEM_REG`` UMEM registration call.
+In this case, when running in ``XDK_COPY`` mode, the TX checksum
+is calculated on the CPU. Do not enable this option in production because
+it will negatively affect performance.
+
Querying Device Capabilities
============================
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index 97f5cc10d79e..8d48d37ab7c0 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -83,6 +83,7 @@ struct xsk_buff_pool {
bool uses_need_wakeup;
bool dma_need_sync;
bool unaligned;
+ bool tx_sw_csum;
void *addrs;
/* Mutual exclusion of the completion ring in the SKB mode. Two cases to protect:
* NAPI TX thread and sendmsg error paths in the SKB destructor callback and when
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index b0ee7ad19b51..bed6adc3e9b4 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -33,7 +33,13 @@
#define XDP_USE_SG (1 << 4)
/* Flags for xsk_umem_config flags */
-#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
+#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
+
+/* Force checksum calculation in software. Can be used for testing or
+ * working around potential HW issues. This option causes performance
+ * degradation and only works in XDP_COPY mode.
+ */
+#define XDP_UMEM_TX_SW_CSUM (1 << 1)
struct sockaddr_xdp {
__u16 sxdp_family;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 946a687fb8e8..caa340134b0e 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -148,6 +148,11 @@ static int xdp_umem_account_pages(struct xdp_umem *umem)
return 0;
}
+#define XDP_UMEM_FLAGS_VALID ( \
+ XDP_UMEM_UNALIGNED_CHUNK_FLAG | \
+ XDP_UMEM_TX_SW_CSUM | \
+ 0)
+
static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
{
bool unaligned_chunks = mr->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG;
@@ -167,7 +172,7 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
return -EINVAL;
}
- if (mr->flags & ~XDP_UMEM_UNALIGNED_CHUNK_FLAG)
+ if (mr->flags & ~XDP_UMEM_FLAGS_VALID)
return -EINVAL;
if (!unaligned_chunks && !is_power_of_2(chunk_size))
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 0e81ae6bfff4..e109b2aaeb2a 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -744,6 +744,12 @@ static struct sk_buff *xsk_build_skb(struct xdp_sock *xs,
skb->csum_start = hr + meta->request.csum_start;
skb->csum_offset = meta->request.csum_offset;
skb->ip_summed = CHECKSUM_PARTIAL;
+
+ if (unlikely(xs->pool->tx_sw_csum)) {
+ err = skb_checksum_help(skb);
+ if (err)
+ goto free_err;
+ }
}
}
}
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index 386eddcdf837..4f6f538a5462 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -86,6 +86,7 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
pool->umem = umem;
pool->addrs = umem->addrs;
pool->tx_metadata_len = umem->tx_metadata_len;
+ pool->tx_sw_csum = umem->flags & XDP_UMEM_TX_SW_CSUM;
INIT_LIST_HEAD(&pool->free_list);
INIT_LIST_HEAD(&pool->xskb_list);
INIT_LIST_HEAD(&pool->xsk_tx_list);
diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index 29942c2c32dc..8ed2ed9d1b17 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -33,7 +33,13 @@
#define XDP_USE_SG (1 << 4)
/* Flags for xsk_umem_config flags */
-#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
+#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << 0)
+
+/* Force checksum calculation in software. Can be used for testing or
+ * working around potential HW issues. This option causes performance
+ * degradation and only works in XDP_COPY mode.
+ */
+#define XDP_UMEM_TX_SW_CSUM (1 << 1)
struct sockaddr_xdp {
__u16 sxdp_family;
--
2.42.0.869.gea05f2083d-goog
^ permalink raw reply related
* [PATCH bpf-next v5 09/13] selftests/xsk: Support tx_metadata_len
From: Stanislav Fomichev @ 2023-11-02 22:58 UTC (permalink / raw)
To: bpf
Cc: ast, daniel, andrii, martin.lau, song, yhs, john.fastabend,
kpsingh, sdf, haoluo, jolsa, kuba, toke, willemb, dsahern,
magnus.karlsson, bjorn, maciej.fijalkowski, hawk,
yoong.siang.song, netdev, xdp-hints
In-Reply-To: <20231102225837.1141915-1-sdf@google.com>
Add new config field and propagate to UMEM registration setsockopt.
Signed-off-by: Stanislav Fomichev <sdf@google.com>
---
tools/testing/selftests/bpf/xsk.c | 3 +++
tools/testing/selftests/bpf/xsk.h | 1 +
2 files changed, 4 insertions(+)
diff --git a/tools/testing/selftests/bpf/xsk.c b/tools/testing/selftests/bpf/xsk.c
index e574711eeb84..25d568abf0f2 100644
--- a/tools/testing/selftests/bpf/xsk.c
+++ b/tools/testing/selftests/bpf/xsk.c
@@ -115,6 +115,7 @@ static void xsk_set_umem_config(struct xsk_umem_config *cfg,
cfg->frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE;
cfg->frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM;
cfg->flags = XSK_UMEM__DEFAULT_FLAGS;
+ cfg->tx_metadata_len = 0;
return;
}
@@ -123,6 +124,7 @@ static void xsk_set_umem_config(struct xsk_umem_config *cfg,
cfg->frame_size = usr_cfg->frame_size;
cfg->frame_headroom = usr_cfg->frame_headroom;
cfg->flags = usr_cfg->flags;
+ cfg->tx_metadata_len = usr_cfg->tx_metadata_len;
}
static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
@@ -252,6 +254,7 @@ int xsk_umem__create(struct xsk_umem **umem_ptr, void *umem_area,
mr.chunk_size = umem->config.frame_size;
mr.headroom = umem->config.frame_headroom;
mr.flags = umem->config.flags;
+ mr.tx_metadata_len = umem->config.tx_metadata_len;
err = setsockopt(umem->fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr));
if (err) {
diff --git a/tools/testing/selftests/bpf/xsk.h b/tools/testing/selftests/bpf/xsk.h
index 771570bc3731..93c2cc413cfc 100644
--- a/tools/testing/selftests/bpf/xsk.h
+++ b/tools/testing/selftests/bpf/xsk.h
@@ -200,6 +200,7 @@ struct xsk_umem_config {
__u32 frame_size;
__u32 frame_headroom;
__u32 flags;
+ __u32 tx_metadata_len;
};
int xsk_attach_xdp_program(struct bpf_program *prog, int ifindex, u32 xdp_flags);
--
2.42.0.869.gea05f2083d-goog
^ permalink raw reply related
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox