All of lore.kernel.org
 help / color / mirror / Atom feed
From: Stanislav Fomichev <sdf@google.com>
To: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: bpf@vger.kernel.org, ast@kernel.org, daniel@iogearbox.net,
	 andrii@kernel.org, martin.lau@linux.dev, song@kernel.org,
	yhs@fb.com,  john.fastabend@gmail.com, kpsingh@kernel.org,
	haoluo@google.com,  jolsa@kernel.org, kuba@kernel.org,
	toke@kernel.org, willemb@google.com,  dsahern@kernel.org,
	magnus.karlsson@intel.com, bjorn@kernel.org,
	 maciej.fijalkowski@intel.com, yoong.siang.song@intel.com,
	 netdev@vger.kernel.org, xdp-hints@xdp-project.net
Subject: Re: [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata
Date: Mon, 9 Oct 2023 09:37:20 -0700	[thread overview]
Message-ID: <ZSQsQMLUIum1hmOI@google.com> (raw)
In-Reply-To: <532af030-f2a6-5569-549b-bbd012ad2640@kernel.org>

On 10/09, Jesper Dangaard Brouer wrote:
> 
> 
> On 03/10/2023 22.05, Stanislav Fomichev wrote:
> > When we get a packet on port 9091, we swap src/dst and send it out.
> > At this point we also request the timestamp and checksum offloads.
> > 
> > Checksum offload is verified by looking at the tcpdump on the other side.
> > The tool prints pseudo-header csum and the final one it expects.
> > The final checksum actually matches the incoming packets checksum
> > because we only flip the src/dst and don't change the payload.
> > 
> > Some other related changes:
> > - switched to zerocopy mode by default; new flag can be used to force
> >    old behavior
> > - request fixed tx_metadata_len headroom
> > - some other small fixes (umem size, fill idx+i, etc)
> > 
> > mvbz3:~# ./xdp_hw_metadata eth3
> > ...
> > 0x1062cb8: rx_desc[0]->addr=80100 addr=80100 comp_addr=80100
> > rx_hash: 0x2E1B50B9 with RSS type:0x2A
> > rx_timestamp:  1691436369532047139 (sec:1691436369.5320)
> > XDP RX-time:   1691436369261756803 (sec:1691436369.2618) delta sec:-0.2703 (-270290.336 usec)
> 
> I guess system time isn't configured to be in sync with NIC HW time,
> as delta is a negative offset.
> 
> > AF_XDP time:   1691436369261878839 (sec:1691436369.2619) delta sec:0.0001 (122.036 usec)
> The AF_XDP time is also software system time and compared to XDP
> RX-time, it shows a delta of 122 usec.  This number indicate to me that
> the CPU is likely configured with power saving sleep states.

Yes, I don't do any synchronization and don't disable the sleep states.

> > 0x1062cb8: ping-pong with csum=3b8e (want de7e) csum_start=54 csum_offset=6
> > 0x1062cb8: complete tx idx=0 addr=10
> > 0x1062cb8: tx_timestamp:  1691436369598419505 (sec:1691436369.5984)
> 
> Could we add something that we can relate tx_timestamp to?
> 
> Like we do with the other delta calculations, as it helps us to
> understand/validate if the number we get back is sane. Like negative
> offset aboves tells us that system time sync isn't configured, and that
> system have configures C-states.
> 
> I suggest delta comparing "tx_timestamp" to "rx_timestamp", as they are
> the same clock domain.  It will tell us the total time spend from HW RX
> to HW TX, counting all the time used by software "ping-pong".
> 
>  1691436369.5984-1691436369.5320 = 0.0664 sec = 66.4 ms
> 
> When implementing this, it could be (1) practical to store the
> "rx_timestamp" in the metadata area of the completion packet, or (2)
> should we have a mechanism for external storage that can be keyed on the
> umem "addr"?

Sounds good. I can probably just store last rx_timestamp somewhere in the
global var and do a delta on tx? Since the test is single threaded
and sequential, not sure we need the mechanism to pass the tstamp around.
LMK if you disagree and I'm missing something.

> > 0x1062cb8: complete rx idx=128 addr=80100
> > 
> > mvbz4:~# nc  -Nu -q1 ${MVBZ3_LINK_LOCAL_IP}%eth3 9091
> > 
> > mvbz4:~# tcpdump -vvx -i eth3 udp
> > 	tcpdump: listening on eth3, link-type EN10MB (Ethernet), snapshot length 262144 bytes
> > 12:26:09.301074 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1087.55807 > fe80::1270:fdff:fe48:1077.9091: [bad udp cksum 0x3b8e -> 0xde7e!] UDP, length 3
> >          0x0000:  6003 5fa5 000b 117f fe80 0000 0000 0000
> >          0x0010:  1270 fdff fe48 1087 fe80 0000 0000 0000
> >          0x0020:  1270 fdff fe48 1077 d9ff 2383 000b 3b8e
> >          0x0030:  7864 70
> > 12:26:09.301976 IP6 (flowlabel 0x35fa5, hlim 127, next-header UDP (17) payload length: 11) fe80::1270:fdff:fe48:1077.9091 > fe80::1270:fdff:fe48:1087.55807: [udp sum ok] UDP, length 3
> >          0x0000:  6003 5fa5 000b 117f fe80 0000 0000 0000
> >          0x0010:  1270 fdff fe48 1077 fe80 0000 0000 0000
> >          0x0020:  1270 fdff fe48 1087 2383 d9ff 000b de7e
> >          0x0030:  7864 70
> > 
> > Signed-off-by: Stanislav Fomichev <sdf@google.com>
> > ---
> >   tools/testing/selftests/bpf/xdp_hw_metadata.c | 202 +++++++++++++++++-
> >   1 file changed, 192 insertions(+), 10 deletions(-)
> > 
> > diff --git a/tools/testing/selftests/bpf/xdp_hw_metadata.c b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > index 613321eb84c1..ab83d0ba6763 100644
> > --- a/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > +++ b/tools/testing/selftests/bpf/xdp_hw_metadata.c
> > @@ -10,7 +10,9 @@
> >    *   - rx_hash
> >    *
> >    * TX:
> > - * - TBD
> > + * - UDP 9091 packets trigger TX reply
> > + * - TX HW timestamp is requested and reported back upon completion
> > + * - TX checksum is requested
> >    */
> >   #include <test_progs.h>
> > @@ -24,14 +26,17 @@
> [...]
> > @@ -51,22 +56,24 @@ struct xsk *rx_xsk;
> [...]
> > @@ -129,12 +136,22 @@ static void refill_rx(struct xsk *xsk, __u64 addr)
> [...]
> > @@ -228,6 +245,117 @@ static void verify_skb_metadata(int fd)
> >   	printf("skb hwtstamp is not found!\n");
> >   }
> > +static bool complete_tx(struct xsk *xsk)
> > +{
> > +	struct xsk_tx_metadata *meta;
> > +	__u64 addr;
> > +	void *data;
> > +	__u32 idx;
> > +
> > +	if (!xsk_ring_cons__peek(&xsk->comp, 1, &idx))
> > +		return false;
> > +
> > +	addr = *xsk_ring_cons__comp_addr(&xsk->comp, idx);
> > +	data = xsk_umem__get_data(xsk->umem_area, addr);
> > +	meta = data - sizeof(struct xsk_tx_metadata);
> > +
> > +	printf("%p: complete tx idx=%u addr=%llx\n", xsk, idx, addr);
> > +	printf("%p: tx_timestamp:  %llu (sec:%0.4f)\n", xsk,
> > +	       meta->completion.tx_timestamp,
> > +	       (double)meta->completion.tx_timestamp / NANOSEC_PER_SEC);
> > +	xsk_ring_cons__release(&xsk->comp, 1);
> > +
> > +	return true;
> > +}
> > +
> > +#define swap(a, b, len) do { \
> > +	for (int i = 0; i < len; i++) { \
> > +		__u8 tmp = ((__u8 *)a)[i]; \
> > +		((__u8 *)a)[i] = ((__u8 *)b)[i]; \
> > +		((__u8 *)b)[i] = tmp; \
> > +	} \
> > +} while (0)
> > +
> > +static void ping_pong(struct xsk *xsk, void *rx_packet)
> > +{
> > +	struct xsk_tx_metadata *meta;
> > +	struct ipv6hdr *ip6h = NULL;
> > +	struct iphdr *iph = NULL;
> > +	struct xdp_desc *tx_desc;
> > +	struct udphdr *udph;
> > +	struct ethhdr *eth;
> > +	__sum16 want_csum;
> > +	void *data;
> > +	__u32 idx;
> > +	int ret;
> > +	int len;
> > +
> > +	ret = xsk_ring_prod__reserve(&xsk->tx, 1, &idx);
> > +	if (ret != 1) {
> > +		printf("%p: failed to reserve tx slot\n", xsk);
> > +		return;
> > +	}
> > +
> > +	tx_desc = xsk_ring_prod__tx_desc(&xsk->tx, idx);
> > +	tx_desc->addr = idx % (UMEM_NUM / 2) * UMEM_FRAME_SIZE + sizeof(struct xsk_tx_metadata);
> > +	data = xsk_umem__get_data(xsk->umem_area, tx_desc->addr);
> > +
> > +	meta = data - sizeof(struct xsk_tx_metadata);
> > +	memset(meta, 0, sizeof(*meta));
> > +	meta->flags = XDP_TX_METADATA_TIMESTAMP;
> > +
> > +	eth = rx_packet;
> > +
> > +	if (eth->h_proto == htons(ETH_P_IP)) {
> > +		iph = (void *)(eth + 1);
> > +		udph = (void *)(iph + 1);
> > +	} else if (eth->h_proto == htons(ETH_P_IPV6)) {
> > +		ip6h = (void *)(eth + 1);
> > +		udph = (void *)(ip6h + 1);
> > +	} else {
> > +		printf("%p: failed to detect IP version for ping pong %04x\n", xsk, eth->h_proto);
> > +		xsk_ring_prod__cancel(&xsk->tx, 1);
> > +		return;
> > +	}
> > +
> > +	len = ETH_HLEN;
> > +	if (ip6h)
> > +		len += sizeof(*ip6h) + ntohs(ip6h->payload_len);
> > +	if (iph)
> > +		len += ntohs(iph->tot_len);
> > +
> > +	swap(eth->h_dest, eth->h_source, ETH_ALEN);
> > +	if (iph)
> > +		swap(&iph->saddr, &iph->daddr, 4);
> > +	else
> > +		swap(&ip6h->saddr, &ip6h->daddr, 16);
> > +	swap(&udph->source, &udph->dest, 2);
> > +
> > +	want_csum = udph->check;
> > +	if (ip6h)
> > +		udph->check = ~csum_ipv6_magic(&ip6h->saddr, &ip6h->daddr,
> > +					       ntohs(udph->len), IPPROTO_UDP, 0);
> > +	else
> > +		udph->check = ~csum_tcpudp_magic(iph->saddr, iph->daddr,
> > +						 ntohs(udph->len), IPPROTO_UDP, 0);
> > +
> > +	meta->flags |= XDP_TX_METADATA_CHECKSUM;
> > +	if (iph)
> > +		meta->csum_start = sizeof(*eth) + sizeof(*iph);
> > +	else
> > +		meta->csum_start = sizeof(*eth) + sizeof(*ip6h);
> > +	meta->csum_offset = offsetof(struct udphdr, check);
> > +
> > +	printf("%p: ping-pong with csum=%04x (want %04x) csum_start=%d csum_offset=%d\n",
> > +	       xsk, ntohs(udph->check), ntohs(want_csum), meta->csum_start, meta->csum_offset);
> > +
> > +	memcpy(data, rx_packet, len); /* don't share umem chunk for simplicity */
> > +	tx_desc->options |= XDP_TX_METADATA;
> > +	tx_desc->len = len;
> > +
> > +	xsk_ring_prod__submit(&xsk->tx, 1);
> > +}
> > +
> >   static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t clock_id)
> >   {
> >   	const struct xdp_desc *rx_desc;
> > @@ -250,6 +378,13 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
> >   	while (true) {
> >   		errno = 0;
> > +
> > +		for (i = 0; i < rxq; i++) {
> > +			ret = kick_rx(&rx_xsk[i]);
> > +			if (ret)
> > +				printf("kick_rx ret=%d\n", ret);
> > +		}
> > +
> >   		ret = poll(fds, rxq + 1, 1000);
> >   		printf("poll: %d (%d) skip=%llu fail=%llu redir=%llu\n",
> >   		       ret, errno, bpf_obj->bss->pkts_skip,
> > @@ -280,6 +415,22 @@ static int verify_metadata(struct xsk *rx_xsk, int rxq, int server_fd, clockid_t
> >   			       xsk, idx, rx_desc->addr, addr, comp_addr);
> >   			verify_xdp_metadata(xsk_umem__get_data(xsk->umem_area, addr),
> >   					    clock_id);
> > +
> > +			if (!skip_tx) {
> > +				/* mirror the packet back */
> > +				ping_pong(xsk, xsk_umem__get_data(xsk->umem_area, addr));
> > +
> > +				ret = kick_tx(xsk);
> > +				if (ret)
> > +					printf("kick_tx ret=%d\n", ret);
> > +
> > +				for (int j = 0; j < 500; j++) {
> > +					if (complete_tx(xsk))
> > +						break;
> > +					usleep(10*1000);
> 
> I don't fully follow why we need this usleep here.

To avoid the busypoll here (since we don't care too much about perf in
the test). But I agree, should be ok to drop, will do.

  reply	other threads:[~2023-10-09 16:37 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-10-03 20:05 [PATCH bpf-next v3 00/10] xsk: TX metadata Stanislav Fomichev
2023-10-03 20:05 ` [PATCH bpf-next v3 01/10] xsk: Support tx_metadata_len Stanislav Fomichev
2023-10-03 20:05 ` [PATCH bpf-next v3 02/10] xsk: add TX timestamp and TX checksum offload support Stanislav Fomichev
2023-10-04  6:18   ` Song, Yoong Siang
2023-10-04 17:48     ` Stanislav Fomichev
2023-10-04 17:56       ` Stanislav Fomichev
2023-10-05  1:16         ` Song, Yoong Siang
2023-10-03 20:05 ` [PATCH bpf-next v3 03/10] tools: ynl: print xsk-features from the sample Stanislav Fomichev
2023-10-03 20:05 ` [PATCH bpf-next v3 04/10] net/mlx5e: Implement AF_XDP TX timestamp and checksum offload Stanislav Fomichev
2023-10-04 23:47   ` kernel test robot
2023-10-03 20:05 ` [PATCH bpf-next v3 05/10] net: stmmac: Add Tx HWTS support to XDP ZC Stanislav Fomichev
2023-10-04 23:05   ` kernel test robot
2023-10-04 23:14     ` Stanislav Fomichev
2023-10-06  4:38   ` kernel test robot
2023-10-03 20:05 ` [PATCH bpf-next v3 06/10] selftests/xsk: Support tx_metadata_len Stanislav Fomichev
2023-10-03 20:05 ` [PATCH bpf-next v3 07/10] selftests/bpf: Add csum helpers Stanislav Fomichev
2023-10-03 20:05 ` [PATCH bpf-next v3 08/10] selftests/bpf: Add TX side to xdp_metadata Stanislav Fomichev
2023-10-03 20:05 ` [PATCH bpf-next v3 09/10] selftests/bpf: Add TX side to xdp_hw_metadata Stanislav Fomichev
2023-10-09  8:12   ` Jesper Dangaard Brouer
2023-10-09 16:37     ` Stanislav Fomichev [this message]
2023-10-10 20:40       ` Stanislav Fomichev
2023-10-13  1:13         ` Song, Yoong Siang
2023-10-13 18:47           ` Stanislav Fomichev
2023-10-15 13:28             ` Song, Yoong Siang
2023-10-03 20:05 ` [PATCH bpf-next v3 10/10] xsk: document tx_metadata_len layout Stanislav Fomichev

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZSQsQMLUIum1hmOI@google.com \
    --to=sdf@google.com \
    --cc=andrii@kernel.org \
    --cc=ast@kernel.org \
    --cc=bjorn@kernel.org \
    --cc=bpf@vger.kernel.org \
    --cc=daniel@iogearbox.net \
    --cc=dsahern@kernel.org \
    --cc=haoluo@google.com \
    --cc=hawk@kernel.org \
    --cc=john.fastabend@gmail.com \
    --cc=jolsa@kernel.org \
    --cc=kpsingh@kernel.org \
    --cc=kuba@kernel.org \
    --cc=maciej.fijalkowski@intel.com \
    --cc=magnus.karlsson@intel.com \
    --cc=martin.lau@linux.dev \
    --cc=netdev@vger.kernel.org \
    --cc=song@kernel.org \
    --cc=toke@kernel.org \
    --cc=willemb@google.com \
    --cc=xdp-hints@xdp-project.net \
    --cc=yhs@fb.com \
    --cc=yoong.siang.song@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.