Netdev List

Netdev List
 help / color / mirror / Atom feed

* [PATCH bpf-next v4 09/11] samples/bpf: add buffer recycling for unaligned chunks to xdpsock
From: Kevin Laatz @ 2019-07-30  8:53 UTC (permalink / raw)
  To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	jonathan.lemon, saeedm, maximmi, stephen
  Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190730085400.10376-1-kevin.laatz@intel.com>

This patch adds buffer recycling support for unaligned buffers. Since we
don't mask the addr to 2k at umem_reg in unaligned mode, we need to make
sure we give back the correct (original) addr to the fill queue. We achieve
this using the new descriptor format and associated masks. The new format
uses the upper 16-bits for the offset and the lower 48-bits for the addr.
Since we have a field for the offset, we no longer need to modify the
actual address. As such, all we have to do to get back the original address
is mask for the lower 48 bits (i.e. strip the offset and we get the address
on it's own).

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>

---
v2:
  - Removed unused defines
  - Fix buffer recycling for unaligned case
  - Remove --buf-size (--frame-size merged before this)
  - Modifications to use the new descriptor format for buffer recycling
---
 samples/bpf/xdpsock_user.c | 24 +++++++++++++++---------
 1 file changed, 15 insertions(+), 9 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 756b00eb1afe..62b2059cd0e3 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -475,6 +475,7 @@ static void kick_tx(struct xsk_socket_info *xsk)
 
 static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
 {
+	struct xsk_umem_info *umem = xsk->umem;
 	u32 idx_cq = 0, idx_fq = 0;
 	unsigned int rcvd;
 	size_t ndescs;
@@ -487,22 +488,21 @@ static inline void complete_tx_l2fwd(struct xsk_socket_info *xsk)
 		xsk->outstanding_tx;
 
 	/* re-add completed Tx buffers */
-	rcvd = xsk_ring_cons__peek(&xsk->umem->cq, ndescs, &idx_cq);
+	rcvd = xsk_ring_cons__peek(&umem->cq, ndescs, &idx_cq);
 	if (rcvd > 0) {
 		unsigned int i;
 		int ret;
 
-		ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd, &idx_fq);
+		ret = xsk_ring_prod__reserve(&umem->fq, rcvd, &idx_fq);
 		while (ret != rcvd) {
 			if (ret < 0)
 				exit_with_error(-ret);
-			ret = xsk_ring_prod__reserve(&xsk->umem->fq, rcvd,
-						     &idx_fq);
+			ret = xsk_ring_prod__reserve(&umem->fq, rcvd, &idx_fq);
 		}
+
 		for (i = 0; i < rcvd; i++)
-			*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) =
-				*xsk_ring_cons__comp_addr(&xsk->umem->cq,
-							  idx_cq++);
+			*xsk_ring_prod__fill_addr(&umem->fq, idx_fq++) =
+				*xsk_ring_cons__comp_addr(&umem->cq, idx_cq++);
 
 		xsk_ring_prod__submit(&xsk->umem->fq, rcvd);
 		xsk_ring_cons__release(&xsk->umem->cq, rcvd);
@@ -549,7 +549,11 @@ static void rx_drop(struct xsk_socket_info *xsk)
 	for (i = 0; i < rcvd; i++) {
 		u64 addr = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx)->addr;
 		u32 len = xsk_ring_cons__rx_desc(&xsk->rx, idx_rx++)->len;
-		char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
+		u64 offset = addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+
+		addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
+		char *pkt = xsk_umem__get_data(xsk->umem->buffer,
+				addr + offset);
 
 		hex_dump(pkt, len, addr);
 		*xsk_ring_prod__fill_addr(&xsk->umem->fq, idx_fq++) = addr;
@@ -655,7 +659,9 @@ static void l2fwd(struct xsk_socket_info *xsk)
 							  idx_rx)->addr;
 			u32 len = xsk_ring_cons__rx_desc(&xsk->rx,
 							 idx_rx++)->len;
-			char *pkt = xsk_umem__get_data(xsk->umem->buffer, addr);
+			u64 offset = addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+			char *pkt = xsk_umem__get_data(xsk->umem->buffer,
+				(addr & XSK_UNALIGNED_BUF_ADDR_MASK) + offset);
 
 			swap_mac_addresses(pkt);
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next v4 08/11] samples/bpf: add unaligned chunks mode support to xdpsock
From: Kevin Laatz @ 2019-07-30  8:53 UTC (permalink / raw)
  To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	jonathan.lemon, saeedm, maximmi, stephen
  Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190730085400.10376-1-kevin.laatz@intel.com>

This patch adds support for the unaligned chunks mode. The addition of the
unaligned chunks option will allow users to run the application with more
relaxed chunk placement in the XDP umem.

Unaligned chunks mode can be used with the '-u' or '--unaligned' command
line options.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>

---
v4:
  - updated help text for -f
  - use new chunk flag define
---
 samples/bpf/xdpsock_user.c | 20 +++++++++++++++++---
 1 file changed, 17 insertions(+), 3 deletions(-)

diff --git a/samples/bpf/xdpsock_user.c b/samples/bpf/xdpsock_user.c
index 93eaaf7239b2..756b00eb1afe 100644
--- a/samples/bpf/xdpsock_user.c
+++ b/samples/bpf/xdpsock_user.c
@@ -51,6 +51,7 @@
 
 typedef __u64 u64;
 typedef __u32 u32;
+typedef __u16 u16;
 
 static unsigned long prev_time;
 
@@ -67,6 +68,8 @@ static int opt_ifindex;
 static int opt_queue;
 static int opt_poll;
 static int opt_interval = 1;
+static u16 opt_umem_flags;
+static int opt_unaligned_chunks;
 static u32 opt_xdp_bind_flags;
 static int opt_xsk_frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE;
 static __u32 prog_id;
@@ -282,7 +285,9 @@ static struct xsk_umem_info *xsk_configure_umem(void *buffer, u64 size)
 		.comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS,
 		.frame_size = opt_xsk_frame_size,
 		.frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM,
+		.flags = opt_umem_flags
 	};
+
 	int ret;
 
 	umem = calloc(1, sizeof(*umem));
@@ -291,6 +296,7 @@ static struct xsk_umem_info *xsk_configure_umem(void *buffer, u64 size)
 
 	ret = xsk_umem__create(&umem->umem, buffer, size, &umem->fq, &umem->cq,
 			       &cfg);
+
 	if (ret)
 		exit_with_error(-ret);
 
@@ -352,6 +358,7 @@ static struct option long_options[] = {
 	{"zero-copy", no_argument, 0, 'z'},
 	{"copy", no_argument, 0, 'c'},
 	{"frame-size", required_argument, 0, 'f'},
+	{"unaligned", no_argument, 0, 'u'},
 	{0, 0, 0, 0}
 };
 
@@ -371,7 +378,8 @@ static void usage(const char *prog)
 		"  -n, --interval=n	Specify statistics update interval (default 1 sec).\n"
 		"  -z, --zero-copy      Force zero-copy mode.\n"
 		"  -c, --copy           Force copy mode.\n"
-		"  -f, --frame-size=n   Set the frame size (must be a power of two, default is %d).\n"
+		"  -f, --frame-size=n   Set the frame size (must be a power of two in aligned mode, default is %d).\n"
+		"  -u, --unaligned	Enable unaligned chunk placement\n"
 		"\n";
 	fprintf(stderr, str, prog, XSK_UMEM__DEFAULT_FRAME_SIZE);
 	exit(EXIT_FAILURE);
@@ -384,7 +392,7 @@ static void parse_command_line(int argc, char **argv)
 	opterr = 0;
 
 	for (;;) {
-		c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:", long_options,
+		c = getopt_long(argc, argv, "Frtli:q:psSNn:czf:u", long_options,
 				&option_index);
 		if (c == -1)
 			break;
@@ -424,12 +432,17 @@ static void parse_command_line(int argc, char **argv)
 		case 'c':
 			opt_xdp_bind_flags |= XDP_COPY;
 			break;
+		case 'u':
+			opt_umem_flags |= XDP_UMEM_UNALIGNED_CHUNK_FLAG;
+			opt_unaligned_chunks = 1;
+			break;
 		case 'F':
 			opt_xdp_flags &= ~XDP_FLAGS_UPDATE_IF_NOEXIST;
 			break;
 		case 'f':
 			opt_xsk_frame_size = atoi(optarg);
 			break;
+
 		default:
 			usage(basename(argv[0]));
 		}
@@ -442,7 +455,8 @@ static void parse_command_line(int argc, char **argv)
 		usage(basename(argv[0]));
 	}
 
-	if (opt_xsk_frame_size & (opt_xsk_frame_size - 1)) {
+	if ((opt_xsk_frame_size & (opt_xsk_frame_size - 1)) &&
+	    !opt_unaligned_chunks) {
 		fprintf(stderr, "--frame-size=%d is not a power of two\n",
 			opt_xsk_frame_size);
 		usage(basename(argv[0]));
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next v4 07/11] mlx5e: modify driver for handling offsets
From: Kevin Laatz @ 2019-07-30  8:53 UTC (permalink / raw)
  To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	jonathan.lemon, saeedm, maximmi, stephen
  Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190730085400.10376-1-kevin.laatz@intel.com>

With the addition of the unaligned chunks option, we need to make sure we
handle the offsets accordingly based on the mode we are currently running
in. This patch modifies the driver to appropriately mask the address for
each case.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>

---
v3:
  - Use new helper function to handle offset

v4:
  - fixed headroom addition to handle. Using xsk_umem_adjust_headroom()
    now.
---
 drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c    | 8 ++++++--
 drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c | 3 ++-
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
index b0b982cf69bb..d5245893d2c8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xdp.c
@@ -122,6 +122,7 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
 		      void *va, u16 *rx_headroom, u32 *len, bool xsk)
 {
 	struct bpf_prog *prog = READ_ONCE(rq->xdp_prog);
+	struct xdp_umem *umem = rq->umem;
 	struct xdp_buff xdp;
 	u32 act;
 	int err;
@@ -138,8 +139,11 @@ bool mlx5e_xdp_handle(struct mlx5e_rq *rq, struct mlx5e_dma_info *di,
 	xdp.rxq = &rq->xdp_rxq;
 
 	act = bpf_prog_run_xdp(prog, &xdp);
-	if (xsk)
-		xdp.handle += xdp.data - xdp.data_hard_start;
+	if (xsk) {
+		u64 off = xdp.data - xdp.data_hard_start;
+
+		xdp.handle = xsk_umem_handle_offset(umem, xdp.handle, off);
+	}
 	switch (act) {
 	case XDP_PASS:
 		*rx_headroom = xdp.data - xdp.data_hard_start;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
index 6a55573ec8f2..7c49a66d28c9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c
@@ -24,7 +24,8 @@ int mlx5e_xsk_page_alloc_umem(struct mlx5e_rq *rq,
 	if (!xsk_umem_peek_addr_rq(umem, &handle))
 		return -ENOMEM;
 
-	dma_info->xsk.handle = handle + rq->buff.umem_headroom;
+	dma_info->xsk.handle = xsk_umem_adjust_offset(umem, handle,
+						      rq->buff.umem_headroom);
 	dma_info->xsk.data = xdp_umem_get_data(umem, dma_info->xsk.handle);
 
 	/* No need to add headroom to the DMA address. In striding RQ case, we
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next v4 06/11] ixgbe: modify driver for handling offsets
From: Kevin Laatz @ 2019-07-30  8:53 UTC (permalink / raw)
  To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	jonathan.lemon, saeedm, maximmi, stephen
  Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190730085400.10376-1-kevin.laatz@intel.com>

With the addition of the unaligned chunks option, we need to make sure we
handle the offsets accordingly based on the mode we are currently running
in. This patch modifies the driver to appropriately mask the address for
each case.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>

---
v3:
  - Use new helper function to handle offset
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index bc86057628c8..11c400f8a6df 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -143,7 +143,9 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
 			    struct ixgbe_ring *rx_ring,
 			    struct xdp_buff *xdp)
 {
+	struct xdp_umem *umem = rx_ring->xsk_umem;
 	int err, result = IXGBE_XDP_PASS;
+	u64 offset = umem->headroom;
 	struct bpf_prog *xdp_prog;
 	struct xdp_frame *xdpf;
 	u32 act;
@@ -151,7 +153,10 @@ static int ixgbe_run_xdp_zc(struct ixgbe_adapter *adapter,
 	rcu_read_lock();
 	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
-	xdp->handle += xdp->data - xdp->data_hard_start;
+	offset += xdp->data - xdp->data_hard_start;
+
+	xdp->handle = xsk_umem_adjust_offset(umem, xdp->handle, offset);
+
 	switch (act) {
 	case XDP_PASS:
 		break;
@@ -243,7 +248,7 @@ void ixgbe_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
 	bi->addr = xdp_umem_get_data(rx_ring->xsk_umem, handle);
 	bi->addr += hr;
 
-	bi->handle = (u64)handle + rx_ring->xsk_umem->headroom;
+	bi->handle = (u64)handle;
 }
 
 static bool ixgbe_alloc_buffer_zc(struct ixgbe_ring *rx_ring,
@@ -269,7 +274,7 @@ static bool ixgbe_alloc_buffer_zc(struct ixgbe_ring *rx_ring,
 	bi->addr = xdp_umem_get_data(umem, handle);
 	bi->addr += hr;
 
-	bi->handle = handle + umem->headroom;
+	bi->handle = handle;
 
 	xsk_umem_discard_addr(umem);
 	return true;
@@ -296,7 +301,7 @@ static bool ixgbe_alloc_buffer_slow_zc(struct ixgbe_ring *rx_ring,
 	bi->addr = xdp_umem_get_data(umem, handle);
 	bi->addr += hr;
 
-	bi->handle = handle + umem->headroom;
+	bi->handle = handle;
 
 	xsk_umem_discard_addr_rq(umem);
 	return true;
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next v4 05/11] i40e: modify driver for handling offsets
From: Kevin Laatz @ 2019-07-30  8:53 UTC (permalink / raw)
  To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	jonathan.lemon, saeedm, maximmi, stephen
  Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190730085400.10376-1-kevin.laatz@intel.com>

With the addition of the unaligned chunks option, we need to make sure we
handle the offsets accordingly based on the mode we are currently running
in. This patch modifies the driver to appropriately mask the address for
each case.

Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>
Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>

---
v3:
  - Use new helper function for handling the offset
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 13 +++++++++----
 1 file changed, 9 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index dfa096db2244..09dd8fe28c35 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -190,7 +190,9 @@ int i40e_xsk_umem_setup(struct i40e_vsi *vsi, struct xdp_umem *umem,
  **/
 static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)
 {
+	struct xdp_umem *umem = rx_ring->xsk_umem;
 	int err, result = I40E_XDP_PASS;
+	u64 offset = umem->headroom;
 	struct i40e_ring *xdp_ring;
 	struct bpf_prog *xdp_prog;
 	u32 act;
@@ -201,7 +203,10 @@ static int i40e_run_xdp_zc(struct i40e_ring *rx_ring, struct xdp_buff *xdp)
 	 */
 	xdp_prog = READ_ONCE(rx_ring->xdp_prog);
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
-	xdp->handle += xdp->data - xdp->data_hard_start;
+	offset += xdp->data - xdp->data_hard_start;
+
+	xdp->handle = xsk_umem_adjust_offset(umem, xdp->handle, offset);
+
 	switch (act) {
 	case XDP_PASS:
 		break;
@@ -262,7 +267,7 @@ static bool i40e_alloc_buffer_zc(struct i40e_ring *rx_ring,
 	bi->addr = xdp_umem_get_data(umem, handle);
 	bi->addr += hr;
 
-	bi->handle = handle + umem->headroom;
+	bi->handle = handle;
 
 	xsk_umem_discard_addr(umem);
 	return true;
@@ -299,7 +304,7 @@ static bool i40e_alloc_buffer_slow_zc(struct i40e_ring *rx_ring,
 	bi->addr = xdp_umem_get_data(umem, handle);
 	bi->addr += hr;
 
-	bi->handle = handle + umem->headroom;
+	bi->handle = handle;
 
 	xsk_umem_discard_addr_rq(umem);
 	return true;
@@ -464,7 +469,7 @@ void i40e_zca_free(struct zero_copy_allocator *alloc, unsigned long handle)
 	bi->addr = xdp_umem_get_data(rx_ring->xsk_umem, handle);
 	bi->addr += hr;
 
-	bi->handle = (u64)handle + rx_ring->xsk_umem->headroom;
+	bi->handle = (u64)handle;
 }
 
 /**
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next v4 04/11] xsk: add support to allow unaligned chunk placement
From: Kevin Laatz @ 2019-07-30  8:53 UTC (permalink / raw)
  To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	jonathan.lemon, saeedm, maximmi, stephen
  Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190730085400.10376-1-kevin.laatz@intel.com>

Currently, addresses are chunk size aligned. This means, we are very
restricted in terms of where we can place chunk within the umem. For
example, if we have a chunk size of 2k, then our chunks can only be placed
at 0,2k,4k,6k,8k... and so on (ie. every 2k starting from 0).

This patch introduces the ability to use unaligned chunks. With these
changes, we are no longer bound to having to place chunks at a 2k (or
whatever your chunk size is) interval. Since we are no longer dealing with
aligned chunks, they can now cross page boundaries. Checks for page
contiguity have been added in order to keep track of which pages are
followed by a physically contiguous page.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>
Signed-off-by: Bruce Richardson <bruce.richardson@intel.com>

---
v2:
  - Add checks for the flags coming from userspace
  - Fix how we get chunk_size in xsk_diag.c
  - Add defines for masking the new descriptor format
  - Modified the rx functions to use new descriptor format
  - Modified the tx functions to use new descriptor format

v3:
  - Add helper function to do address/offset masking/addition

v4:
  - fixed page_start calculation in __xsk_rcv_memcpy().
  - move offset handling to the xdp_umem_get_* functions
  - modified the len field in xdp_umem_reg struct. We now use 16 bits from
    this for the flags field.
  - removed next_pg_contig field from xdp_umem_page struct. Using low 12
    bits of addr to store flags instead.
  - other minor changes based on review comments
---
 include/net/xdp_sock.h      | 40 ++++++++++++++++++-
 include/uapi/linux/if_xdp.h | 14 ++++++-
 net/xdp/xdp_umem.c          | 18 ++++++---
 net/xdp/xsk.c               | 79 +++++++++++++++++++++++++++++--------
 net/xdp/xsk_diag.c          |  2 +-
 net/xdp/xsk_queue.h         | 69 ++++++++++++++++++++++++++++----
 6 files changed, 188 insertions(+), 34 deletions(-)

diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index 69796d264f06..a755e8ab6cac 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -16,6 +16,13 @@
 struct net_device;
 struct xsk_queue;
 
+/* Masks for xdp_umem_page flags.
+ * The low 12-bits of the addr will be 0 since this is the page address, so we
+ * can use them for flags.
+ */
+#define XSK_NEXT_PG_CONTIG_SHIFT 0
+#define XSK_NEXT_PG_CONTIG_MASK (1ULL << XSK_NEXT_PG_CONTIG_SHIFT)
+
 struct xdp_umem_page {
 	void *addr;
 	dma_addr_t dma;
@@ -48,6 +55,7 @@ struct xdp_umem {
 	bool zc;
 	spinlock_t xsk_list_lock;
 	struct list_head xsk_list;
+	u16 flags;
 };
 
 struct xdp_sock {
@@ -98,12 +106,21 @@ struct xdp_umem *xdp_get_umem_from_qid(struct net_device *dev, u16 queue_id);
 
 static inline char *xdp_umem_get_data(struct xdp_umem *umem, u64 addr)
 {
-	return umem->pages[addr >> PAGE_SHIFT].addr + (addr & (PAGE_SIZE - 1));
+	unsigned long page_addr;
+
+	addr += addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+	addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
+	page_addr = (unsigned long)umem->pages[addr >> PAGE_SHIFT].addr;
+
+	return (char *)(page_addr & PAGE_MASK) + (addr & ~PAGE_MASK);
 }
 
 static inline dma_addr_t xdp_umem_get_dma(struct xdp_umem *umem, u64 addr)
 {
-	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & (PAGE_SIZE - 1));
+	addr += addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+	addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
+
+	return umem->pages[addr >> PAGE_SHIFT].dma + (addr & ~PAGE_MASK);
 }
 
 /* Reuse-queue aware version of FILL queue helpers */
@@ -144,6 +161,19 @@ static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
 
 	rq->handles[rq->length++] = addr;
 }
+
+/* Handle the offset appropriately depending on aligned or unaligned mode.
+ * For unaligned mode, we store the offset in the upper 16-bits of the address.
+ * For aligned mode, we simply add the offset to the address.
+ */
+static inline u64 xsk_umem_adjust_offset(struct xdp_umem *umem, u64 address,
+					 u64 offset)
+{
+	if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG)
+		return address + (offset << XSK_UNALIGNED_BUF_OFFSET_SHIFT);
+	else
+		return address + offset;
+}
 #else
 static inline int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
@@ -241,6 +271,12 @@ static inline void xsk_umem_fq_reuse(struct xdp_umem *umem, u64 addr)
 {
 }
 
+static inline u64 xsk_umem_handle_offset(struct xdp_umem *umem, u64 handle,
+					 u64 offset)
+{
+	return 0;
+}
+
 #endif /* CONFIG_XDP_SOCKETS */
 
 #endif /* _LINUX_XDP_SOCK_H */
diff --git a/include/uapi/linux/if_xdp.h b/include/uapi/linux/if_xdp.h
index faaa5ca2a117..4a5490651b22 100644
--- a/include/uapi/linux/if_xdp.h
+++ b/include/uapi/linux/if_xdp.h
@@ -17,6 +17,10 @@
 #define XDP_COPY	(1 << 1) /* Force copy-mode */
 #define XDP_ZEROCOPY	(1 << 2) /* Force zero-copy mode */
 
+/* Flags for xsk_umem_config flags */
+#define XDP_UMEM_UNALIGNED_CHUNK_FLAG_SHIFT 15
+#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << XDP_UMEM_UNALIGNED_CHUNK_FLAG_SHIFT)
+
 struct sockaddr_xdp {
 	__u16 sxdp_family;
 	__u16 sxdp_flags;
@@ -49,8 +53,9 @@ struct xdp_mmap_offsets {
 #define XDP_OPTIONS			8
 
 struct xdp_umem_reg {
-	__u64 addr; /* Start of packet data area */
-	__u64 len; /* Length of packet data area */
+	__u64 addr;    /* Start of packet data area */
+	__u64 len:48;  /* Length of packet data area */
+	__u64 flags:16; /*Flags for umem */
 	__u32 chunk_size;
 	__u32 headroom;
 };
@@ -74,6 +79,11 @@ struct xdp_options {
 #define XDP_UMEM_PGOFF_FILL_RING	0x100000000ULL
 #define XDP_UMEM_PGOFF_COMPLETION_RING	0x180000000ULL
 
+/* Masks for unaligned chunks mode */
+#define XSK_UNALIGNED_BUF_OFFSET_SHIFT 48
+#define XSK_UNALIGNED_BUF_ADDR_MASK \
+	((1ULL << XSK_UNALIGNED_BUF_OFFSET_SHIFT) - 1)
+
 /* Rx/Tx descriptor */
 struct xdp_desc {
 	__u64 addr;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 83de74ca729a..5590ca7bbe15 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -299,6 +299,7 @@ static int xdp_umem_account_pages(struct xdp_umem *umem)
 
 static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 {
+	bool unaligned_chunks = mr->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG;
 	u32 chunk_size = mr->chunk_size, headroom = mr->headroom;
 	unsigned int chunks, chunks_per_page;
 	u64 addr = mr->addr, size = mr->len;
@@ -314,7 +315,10 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 		return -EINVAL;
 	}
 
-	if (!is_power_of_2(chunk_size))
+	if (mr->flags & ~XDP_UMEM_UNALIGNED_CHUNK_FLAG)
+		return -EINVAL;
+
+	if (!unaligned_chunks && !is_power_of_2(chunk_size))
 		return -EINVAL;
 
 	if (!PAGE_ALIGNED(addr)) {
@@ -331,9 +335,11 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 	if (chunks == 0)
 		return -EINVAL;
 
-	chunks_per_page = PAGE_SIZE / chunk_size;
-	if (chunks < chunks_per_page || chunks % chunks_per_page)
-		return -EINVAL;
+	if (!unaligned_chunks) {
+		chunks_per_page = PAGE_SIZE / chunk_size;
+		if (chunks < chunks_per_page || chunks % chunks_per_page)
+			return -EINVAL;
+	}
 
 	headroom = ALIGN(headroom, 64);
 
@@ -342,13 +348,15 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
 		return -EINVAL;
 
 	umem->address = (unsigned long)addr;
-	umem->chunk_mask = ~((u64)chunk_size - 1);
+	umem->chunk_mask = unaligned_chunks ? XSK_UNALIGNED_BUF_ADDR_MASK
+					    : ~((u64)chunk_size - 1);
 	umem->size = size;
 	umem->headroom = headroom;
 	umem->chunk_size_nohr = chunk_size - headroom;
 	umem->npgs = size / PAGE_SIZE;
 	umem->pgs = NULL;
 	umem->user = NULL;
+	umem->flags = mr->flags;
 	INIT_LIST_HEAD(&umem->xsk_list);
 	spin_lock_init(&umem->xsk_list_lock);
 
diff --git a/net/xdp/xsk.c b/net/xdp/xsk.c
index 59b57d708697..9b834d54549e 100644
--- a/net/xdp/xsk.c
+++ b/net/xdp/xsk.c
@@ -45,7 +45,7 @@ EXPORT_SYMBOL(xsk_umem_has_addrs);
 
 u64 *xsk_umem_peek_addr(struct xdp_umem *umem, u64 *addr)
 {
-	return xskq_peek_addr(umem->fq, addr);
+	return xskq_peek_addr(umem->fq, addr, umem);
 }
 EXPORT_SYMBOL(xsk_umem_peek_addr);
 
@@ -55,21 +55,42 @@ void xsk_umem_discard_addr(struct xdp_umem *umem)
 }
 EXPORT_SYMBOL(xsk_umem_discard_addr);
 
+/* If a buffer crosses a page boundary, we need to do 2 memcpy's, one for
+ * each page. This is only required in copy mode.
+ */
+static void __xsk_rcv_memcpy(struct xdp_umem *umem, u64 addr, void *from_buf,
+			     u32 len, u32 metalen)
+{
+	void *to_buf = xdp_umem_get_data(umem, addr);
+
+	if (xskq_crosses_non_contig_pg(umem, addr, len + metalen)) {
+		void *next_pg_addr = umem->pages[(addr >> PAGE_SHIFT) + 1].addr;
+		u64 page_start = addr & ~(PAGE_SIZE - 1);
+		u64 first_len = PAGE_SIZE - (addr - page_start);
+
+		memcpy(to_buf, from_buf, first_len + metalen);
+		memcpy(next_pg_addr, from_buf + first_len, len - first_len);
+
+		return;
+	}
+
+	memcpy(to_buf, from_buf, len + metalen);
+}
+
 static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 {
-	void *to_buf, *from_buf;
+	u64 offset = xs->umem->headroom;
+	void *from_buf;
 	u32 metalen;
 	u64 addr;
 	int err;
 
-	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
+	if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
 	    len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
 		xs->rx_dropped++;
 		return -ENOSPC;
 	}
 
-	addr += xs->umem->headroom;
-
 	if (unlikely(xdp_data_meta_unsupported(xdp))) {
 		from_buf = xdp->data;
 		metalen = 0;
@@ -78,9 +99,10 @@ static int __xsk_rcv(struct xdp_sock *xs, struct xdp_buff *xdp, u32 len)
 		metalen = xdp->data - xdp->data_meta;
 	}
 
-	to_buf = xdp_umem_get_data(xs->umem, addr);
-	memcpy(to_buf, from_buf, len + metalen);
-	addr += metalen;
+	__xsk_rcv_memcpy(xs->umem, addr + offset, from_buf, len, metalen);
+
+	offset += metalen;
+	addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
 	err = xskq_produce_batch_desc(xs->rx, addr, len);
 	if (!err) {
 		xskq_discard_addr(xs->umem->fq);
@@ -125,6 +147,7 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 {
 	u32 metalen = xdp->data - xdp->data_meta;
 	u32 len = xdp->data_end - xdp->data;
+	u64 offset = xs->umem->headroom;
 	void *buffer;
 	u64 addr;
 	int err;
@@ -136,17 +159,17 @@ int xsk_generic_rcv(struct xdp_sock *xs, struct xdp_buff *xdp)
 		goto out_unlock;
 	}
 
-	if (!xskq_peek_addr(xs->umem->fq, &addr) ||
+	if (!xskq_peek_addr(xs->umem->fq, &addr, xs->umem) ||
 	    len > xs->umem->chunk_size_nohr - XDP_PACKET_HEADROOM) {
 		err = -ENOSPC;
 		goto out_drop;
 	}
 
-	addr += xs->umem->headroom;
-
-	buffer = xdp_umem_get_data(xs->umem, addr);
+	buffer = xdp_umem_get_data(xs->umem, addr + offset);
 	memcpy(buffer, xdp->data_meta, len + metalen);
-	addr += metalen;
+	offset += metalen;
+
+	addr = xsk_umem_adjust_offset(xs->umem, addr, offset);
 	err = xskq_produce_batch_desc(xs->rx, addr, len);
 	if (err)
 		goto out_drop;
@@ -190,7 +213,7 @@ bool xsk_umem_consume_tx(struct xdp_umem *umem, struct xdp_desc *desc)
 
 	rcu_read_lock();
 	list_for_each_entry_rcu(xs, &umem->xsk_list, list) {
-		if (!xskq_peek_desc(xs->tx, desc))
+		if (!xskq_peek_desc(xs->tx, desc, umem))
 			continue;
 
 		if (xskq_produce_addr_lazy(umem->cq, desc->addr))
@@ -243,7 +266,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 	if (xs->queue_id >= xs->dev->real_num_tx_queues)
 		goto out;
 
-	while (xskq_peek_desc(xs->tx, &desc)) {
+	while (xskq_peek_desc(xs->tx, &desc, xs->umem)) {
 		char *buffer;
 		u64 addr;
 		u32 len;
@@ -262,6 +285,10 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 
 		skb_put(skb, len);
 		addr = desc.addr;
+		if (xs->umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG)
+			addr = (addr & XSK_UNALIGNED_BUF_ADDR_MASK) +
+				(addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT);
+
 		buffer = xdp_umem_get_data(xs->umem, addr);
 		err = skb_store_bits(skb, 0, buffer, len);
 		if (unlikely(err) || xskq_reserve_addr(xs->umem->cq)) {
@@ -272,7 +299,7 @@ static int xsk_generic_xmit(struct sock *sk, struct msghdr *m,
 		skb->dev = xs->dev;
 		skb->priority = sk->sk_priority;
 		skb->mark = sk->sk_mark;
-		skb_shinfo(skb)->destructor_arg = (void *)(long)addr;
+		skb_shinfo(skb)->destructor_arg = (void *)(long)desc.addr;
 		skb->destructor = xsk_destruct_skb;
 
 		err = dev_direct_xmit(skb, xs->queue_id);
@@ -412,6 +439,24 @@ static struct socket *xsk_lookup_xsk_from_fd(int fd)
 	return sock;
 }
 
+/* Check if umem pages are contiguous.
+ * If zero-copy mode, use the DMA address to do the page contiguity check
+ * For all other modes we use addr (kernel virtual address)
+ * Store the result in the low bits of addr.
+ */
+static void xsk_check_page_contiguity(struct xdp_umem *umem, u32 flags)
+{
+	struct xdp_umem_page *pgs = umem->pages;
+	int i, is_contig;
+
+	for (i = 0; i < umem->npgs - 1; i++) {
+		is_contig = (flags & XDP_ZEROCOPY) ?
+			(pgs[i].dma + PAGE_SIZE == pgs[i + 1].dma) :
+			(pgs[i].addr + PAGE_SIZE == pgs[i + 1].addr);
+		pgs[i].addr += is_contig << XSK_NEXT_PG_CONTIG_SHIFT;
+	}
+}
+
 static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 {
 	struct sockaddr_xdp *sxdp = (struct sockaddr_xdp *)addr;
@@ -500,6 +545,8 @@ static int xsk_bind(struct socket *sock, struct sockaddr *addr, int addr_len)
 		err = xdp_umem_assign_dev(xs->umem, dev, qid, flags);
 		if (err)
 			goto out_unlock;
+
+		xsk_check_page_contiguity(xs->umem, flags);
 	}
 
 	xs->dev = dev;
diff --git a/net/xdp/xsk_diag.c b/net/xdp/xsk_diag.c
index d5e06c8e0cbf..9986a759fe06 100644
--- a/net/xdp/xsk_diag.c
+++ b/net/xdp/xsk_diag.c
@@ -56,7 +56,7 @@ static int xsk_diag_put_umem(const struct xdp_sock *xs, struct sk_buff *nlskb)
 	du.id = umem->id;
 	du.size = umem->size;
 	du.num_pages = umem->npgs;
-	du.chunk_size = (__u32)(~umem->chunk_mask + 1);
+	du.chunk_size = umem->chunk_size_nohr + umem->headroom;
 	du.headroom = umem->headroom;
 	du.ifindex = umem->dev ? umem->dev->ifindex : 0;
 	du.queue_id = umem->queue_id;
diff --git a/net/xdp/xsk_queue.h b/net/xdp/xsk_queue.h
index 909c5168ed0f..3d045c1c94b1 100644
--- a/net/xdp/xsk_queue.h
+++ b/net/xdp/xsk_queue.h
@@ -133,6 +133,17 @@ static inline bool xskq_has_addrs(struct xsk_queue *q, u32 cnt)
 
 /* UMEM queue */
 
+static inline bool xskq_crosses_non_contig_pg(struct xdp_umem *umem, u64 addr,
+					      u64 length)
+{
+	bool cross_pg = (addr & (PAGE_SIZE - 1)) + length > PAGE_SIZE;
+	bool next_pg_contig =
+		(unsigned long)umem->pages[(addr >> PAGE_SHIFT)].addr &
+			XSK_NEXT_PG_CONTIG_MASK;
+
+	return cross_pg && !next_pg_contig;
+}
+
 static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
 {
 	if (addr >= q->size) {
@@ -143,23 +154,50 @@ static inline bool xskq_is_valid_addr(struct xsk_queue *q, u64 addr)
 	return true;
 }
 
-static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr)
+static inline bool xskq_is_valid_addr_unaligned(struct xsk_queue *q, u64 addr,
+						u64 length,
+						struct xdp_umem *umem)
+{
+	addr += addr >> XSK_UNALIGNED_BUF_OFFSET_SHIFT;
+	addr &= XSK_UNALIGNED_BUF_ADDR_MASK;
+	if (addr >= q->size ||
+	    xskq_crosses_non_contig_pg(umem, addr, length)) {
+		q->invalid_descs++;
+		return false;
+	}
+
+	return true;
+}
+
+static inline u64 *xskq_validate_addr(struct xsk_queue *q, u64 *addr,
+				      struct xdp_umem *umem)
 {
 	while (q->cons_tail != q->cons_head) {
 		struct xdp_umem_ring *ring = (struct xdp_umem_ring *)q->ring;
 		unsigned int idx = q->cons_tail & q->ring_mask;
 
 		*addr = READ_ONCE(ring->desc[idx]) & q->chunk_mask;
+
+		if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG) {
+			if (xskq_is_valid_addr_unaligned(q, *addr,
+							 umem->chunk_size_nohr,
+							 umem))
+				return addr;
+			goto out;
+		}
+
 		if (xskq_is_valid_addr(q, *addr))
 			return addr;
 
+out:
 		q->cons_tail++;
 	}
 
 	return NULL;
 }
 
-static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr)
+static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr,
+				  struct xdp_umem *umem)
 {
 	if (q->cons_tail == q->cons_head) {
 		smp_mb(); /* D, matches A */
@@ -170,7 +208,7 @@ static inline u64 *xskq_peek_addr(struct xsk_queue *q, u64 *addr)
 		smp_rmb();
 	}
 
-	return xskq_validate_addr(q, addr);
+	return xskq_validate_addr(q, addr, umem);
 }
 
 static inline void xskq_discard_addr(struct xsk_queue *q)
@@ -229,8 +267,21 @@ static inline int xskq_reserve_addr(struct xsk_queue *q)
 
 /* Rx/Tx queue */
 
-static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d)
+static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d,
+				      struct xdp_umem *umem)
 {
+	if (umem->flags & XDP_UMEM_UNALIGNED_CHUNK_FLAG) {
+		if (!xskq_is_valid_addr_unaligned(q, d->addr, d->len, umem))
+			return false;
+
+		if (d->len > umem->chunk_size_nohr || d->options) {
+			q->invalid_descs++;
+			return false;
+		}
+
+		return true;
+	}
+
 	if (!xskq_is_valid_addr(q, d->addr))
 		return false;
 
@@ -244,14 +295,15 @@ static inline bool xskq_is_valid_desc(struct xsk_queue *q, struct xdp_desc *d)
 }
 
 static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q,
-						  struct xdp_desc *desc)
+						  struct xdp_desc *desc,
+						  struct xdp_umem *umem)
 {
 	while (q->cons_tail != q->cons_head) {
 		struct xdp_rxtx_ring *ring = (struct xdp_rxtx_ring *)q->ring;
 		unsigned int idx = q->cons_tail & q->ring_mask;
 
 		*desc = READ_ONCE(ring->desc[idx]);
-		if (xskq_is_valid_desc(q, desc))
+		if (xskq_is_valid_desc(q, desc, umem))
 			return desc;
 
 		q->cons_tail++;
@@ -261,7 +313,8 @@ static inline struct xdp_desc *xskq_validate_desc(struct xsk_queue *q,
 }
 
 static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
-					      struct xdp_desc *desc)
+					      struct xdp_desc *desc,
+					      struct xdp_umem *umem)
 {
 	if (q->cons_tail == q->cons_head) {
 		smp_mb(); /* D, matches A */
@@ -272,7 +325,7 @@ static inline struct xdp_desc *xskq_peek_desc(struct xsk_queue *q,
 		smp_rmb(); /* C, matches B */
 	}
 
-	return xskq_validate_desc(q, desc);
+	return xskq_validate_desc(q, desc, umem);
 }
 
 static inline void xskq_discard_desc(struct xsk_queue *q)
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next v4 03/11] libbpf: add flags to umem config
From: Kevin Laatz @ 2019-07-30  8:53 UTC (permalink / raw)
  To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	jonathan.lemon, saeedm, maximmi, stephen
  Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190730085400.10376-1-kevin.laatz@intel.com>

This patch adds a 'flags' field to the umem_config and umem_reg structs.
This will allow for more options to be added for configuring umems.

The first use for the flags field is to add a flag for unaligned chunks
mode. These flags can either be user-provided or filled with a default.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
Signed-off-by: Ciara Loftus <ciara.loftus@intel.com>

---
v2:
  - Removed the headroom check from this patch. It has moved to the
    previous patch.

v4:
  - modified chunk flag define
---
 tools/include/uapi/linux/if_xdp.h | 9 +++++++--
 tools/lib/bpf/xsk.c               | 3 +++
 tools/lib/bpf/xsk.h               | 2 ++
 3 files changed, 12 insertions(+), 2 deletions(-)

diff --git a/tools/include/uapi/linux/if_xdp.h b/tools/include/uapi/linux/if_xdp.h
index faaa5ca2a117..a691802d7915 100644
--- a/tools/include/uapi/linux/if_xdp.h
+++ b/tools/include/uapi/linux/if_xdp.h
@@ -17,6 +17,10 @@
 #define XDP_COPY	(1 << 1) /* Force copy-mode */
 #define XDP_ZEROCOPY	(1 << 2) /* Force zero-copy mode */
 
+/* Flags for xsk_umem_config flags */
+#define XDP_UMEM_UNALIGNED_CHUNK_FLAG_SHIFT 15
+#define XDP_UMEM_UNALIGNED_CHUNK_FLAG (1 << XDP_UMEM_UNALIGNED_CHUNK_FLAG_SHIFT)
+
 struct sockaddr_xdp {
 	__u16 sxdp_family;
 	__u16 sxdp_flags;
@@ -49,8 +53,9 @@ struct xdp_mmap_offsets {
 #define XDP_OPTIONS			8
 
 struct xdp_umem_reg {
-	__u64 addr; /* Start of packet data area */
-	__u64 len; /* Length of packet data area */
+	__u64 addr;     /* Start of packet data area */
+	__u64 len:48;   /* Length of packet data area */
+	__u64 flags:16; /* Flags for umem */
 	__u32 chunk_size;
 	__u32 headroom;
 };
diff --git a/tools/lib/bpf/xsk.c b/tools/lib/bpf/xsk.c
index 5007b5d4fd2c..5e7e4d420ee0 100644
--- a/tools/lib/bpf/xsk.c
+++ b/tools/lib/bpf/xsk.c
@@ -116,6 +116,7 @@ static void xsk_set_umem_config(struct xsk_umem_config *cfg,
 		cfg->comp_size = XSK_RING_CONS__DEFAULT_NUM_DESCS;
 		cfg->frame_size = XSK_UMEM__DEFAULT_FRAME_SIZE;
 		cfg->frame_headroom = XSK_UMEM__DEFAULT_FRAME_HEADROOM;
+		cfg->flags = XSK_UMEM__DEFAULT_FLAGS;
 		return;
 	}
 
@@ -123,6 +124,7 @@ static void xsk_set_umem_config(struct xsk_umem_config *cfg,
 	cfg->comp_size = usr_cfg->comp_size;
 	cfg->frame_size = usr_cfg->frame_size;
 	cfg->frame_headroom = usr_cfg->frame_headroom;
+	cfg->flags = usr_cfg->flags;
 }
 
 static int xsk_set_xdp_socket_config(struct xsk_socket_config *cfg,
@@ -182,6 +184,7 @@ int xsk_umem__create(struct xsk_umem **umem_ptr, void *umem_area, __u64 size,
 	mr.len = size;
 	mr.chunk_size = umem->config.frame_size;
 	mr.headroom = umem->config.frame_headroom;
+	mr.flags = umem->config.flags;
 
 	err = setsockopt(umem->fd, SOL_XDP, XDP_UMEM_REG, &mr, sizeof(mr));
 	if (err) {
diff --git a/tools/lib/bpf/xsk.h b/tools/lib/bpf/xsk.h
index 833a6e60d065..44a03d8c34b9 100644
--- a/tools/lib/bpf/xsk.h
+++ b/tools/lib/bpf/xsk.h
@@ -170,12 +170,14 @@ LIBBPF_API int xsk_socket__fd(const struct xsk_socket *xsk);
 #define XSK_UMEM__DEFAULT_FRAME_SHIFT    12 /* 4096 bytes */
 #define XSK_UMEM__DEFAULT_FRAME_SIZE     (1 << XSK_UMEM__DEFAULT_FRAME_SHIFT)
 #define XSK_UMEM__DEFAULT_FRAME_HEADROOM 0
+#define XSK_UMEM__DEFAULT_FLAGS 0
 
 struct xsk_umem_config {
 	__u32 fill_size;
 	__u32 comp_size;
 	__u32 frame_size;
 	__u32 frame_headroom;
+	__u32 flags;
 };
 
 /* Flags for the libbpf_flags field. */
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next v4 02/11] ixgbe: simplify Rx buffer recycle
From: Kevin Laatz @ 2019-07-30  8:53 UTC (permalink / raw)
  To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	jonathan.lemon, saeedm, maximmi, stephen
  Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190730085400.10376-1-kevin.laatz@intel.com>

Currently, the dma, addr and handle are modified when we reuse Rx buffers
in zero-copy mode. However, this is not required as the inputs to the
function are copies, not the original values themselves. As we use the
copies within the function, we can use the original 'obi' values
directly without having to mask and add the headroom.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
index 6b609553329f..bc86057628c8 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c
@@ -201,8 +201,6 @@ ixgbe_rx_buffer *ixgbe_get_rx_buffer_zc(struct ixgbe_ring *rx_ring,
 static void ixgbe_reuse_rx_buffer_zc(struct ixgbe_ring *rx_ring,
 				     struct ixgbe_rx_buffer *obi)
 {
-	unsigned long mask = (unsigned long)rx_ring->xsk_umem->chunk_mask;
-	u64 hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
 	u16 nta = rx_ring->next_to_alloc;
 	struct ixgbe_rx_buffer *nbi;
 
@@ -212,14 +210,9 @@ static void ixgbe_reuse_rx_buffer_zc(struct ixgbe_ring *rx_ring,
 	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
 
 	/* transfer page from old buffer to new buffer */
-	nbi->dma = obi->dma & mask;
-	nbi->dma += hr;
-
-	nbi->addr = (void *)((unsigned long)obi->addr & mask);
-	nbi->addr += hr;
-
-	nbi->handle = obi->handle & mask;
-	nbi->handle += rx_ring->xsk_umem->headroom;
+	nbi->dma = obi->dma;
+	nbi->addr = obi->addr;
+	nbi->handle = obi->handle;
 
 	obi->addr = NULL;
 	obi->skb = NULL;
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next v4 01/11] i40e: simplify Rx buffer recycle
From: Kevin Laatz @ 2019-07-30  8:53 UTC (permalink / raw)
  To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	jonathan.lemon, saeedm, maximmi, stephen
  Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190730085400.10376-1-kevin.laatz@intel.com>

Currently, the dma, addr and handle are modified when we reuse Rx buffers
in zero-copy mode. However, this is not required as the inputs to the
function are copies, not the original values themselves. As we use the
copies within the function, we can use the original 'old_bi' values
directly without having to mask and add the headroom.

Signed-off-by: Kevin Laatz <kevin.laatz@intel.com>
---
 drivers/net/ethernet/intel/i40e/i40e_xsk.c | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_xsk.c b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
index 32bad014d76c..dfa096db2244 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_xsk.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_xsk.c
@@ -420,8 +420,6 @@ static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
 				    struct i40e_rx_buffer *old_bi)
 {
 	struct i40e_rx_buffer *new_bi = &rx_ring->rx_bi[rx_ring->next_to_alloc];
-	unsigned long mask = (unsigned long)rx_ring->xsk_umem->chunk_mask;
-	u64 hr = rx_ring->xsk_umem->headroom + XDP_PACKET_HEADROOM;
 	u16 nta = rx_ring->next_to_alloc;
 
 	/* update, and store next to alloc */
@@ -429,14 +427,9 @@ static void i40e_reuse_rx_buffer_zc(struct i40e_ring *rx_ring,
 	rx_ring->next_to_alloc = (nta < rx_ring->count) ? nta : 0;
 
 	/* transfer page from old buffer to new buffer */
-	new_bi->dma = old_bi->dma & mask;
-	new_bi->dma += hr;
-
-	new_bi->addr = (void *)((unsigned long)old_bi->addr & mask);
-	new_bi->addr += hr;
-
-	new_bi->handle = old_bi->handle & mask;
-	new_bi->handle += rx_ring->xsk_umem->headroom;
+	new_bi->dma = old_bi->dma;
+	new_bi->addr = old_bi->addr;
+	new_bi->handle = old_bi->handle;
 
 	old_bi->addr = NULL;
 }
-- 
2.17.1


^ permalink raw reply related

* [PATCH bpf-next v4 00/11] XDP unaligned chunk placement support
From: Kevin Laatz @ 2019-07-30  8:53 UTC (permalink / raw)
  To: netdev, ast, daniel, bjorn.topel, magnus.karlsson, jakub.kicinski,
	jonathan.lemon, saeedm, maximmi, stephen
  Cc: bruce.richardson, ciara.loftus, bpf, intel-wired-lan, Kevin Laatz
In-Reply-To: <20190724051043.14348-1-kevin.laatz@intel.com>

This patch set adds the ability to use unaligned chunks in the XDP umem.

Currently, all chunk addresses passed to the umem are masked to be chunk
size aligned (max is PAGE_SIZE). This limits where we can place chunks
within the umem as well as limiting the packet sizes that are supported.

The changes in this patch set removes these restrictions, allowing XDP to
be more flexible in where it can place a chunk within a umem. By relaxing
where the chunks can be placed, it allows us to use an arbitrary buffer
size and place that wherever we have a free address in the umem. These
changes add the ability to support arbitrary frame sizes up to 4k
(PAGE_SIZE) and make it easy to integrate with other existing frameworks
that have their own memory management systems, such as DPDK.
In DPDK, for example, there is already support for AF_XDP with zero-copy.
However, with this patch set the integration will be much more seamless.
You can find the DPDK AF_XDP driver at:
https://git.dpdk.org/dpdk/tree/drivers/net/af_xdp

Since we are now dealing with arbitrary frame sizes, we need also need to
update how we pass around addresses. Currently, the addresses can simply be
masked to 2k to get back to the original address. This becomes less trivial
when using frame sizes that are not a 'power of 2' size. This patch set
modifies the Rx/Tx descriptor format to use the upper 16-bits of the addr
field for an offset value, leaving the lower 48-bits for the address (this
leaves us with 256 Terabytes, which should be enough!). We only need to use
the upper 16-bits to store the offset when running in unaligned mode.
Rather than adding the offset (headroom etc) to the address, we will store
it in the upper 16-bits of the address field. This way, we can easily add
the offset to the address where we need it, using some bit manipulation and
addition, and we can also easily get the original address wherever we need
it (for example in i40e_zca_free) by simply masking to get the lower
48-bits of the address field.

The numbers below were recorded with the following set up:
  - Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz
  - Intel Corporation Ethernet Controller XXV710 for 25GbE SFP28 (rev 02)
  - Driver: i40e
  - Application: xdpsock with l2fwd (single interface)
  - Turbo disabled in BIOS

These are solely for comparing performance with and without the patches.
The largest drop was ~1.5% (in zero-copy mode).

+-------------------------+-------------+-----------------+-------------+
| Buffer size: 2048       | SKB mode    | Zero-copy       | Copy        |
+-------------------------+-------------+-----------------+-------------+
| Aligned (baseline)      | 1.25 Mpps   | 10.0 Mpps       | 1.66 Mpps   |
+-------------------------+-------------+-----------------+-------------+
| Aligned (with patches)  | 1.25 Mpps   | 9.85 Mpps       | 1.66 Mpps   |
+-------------------------+-------------+-----------------+-------------+
| Unaligned               | 1.25 Mpps   | 9.65 Mpps       | 1.66 Mpps   |
+-------------------------+-------------+-----------------+-------------+

This patch set has been applied against
commit 475e31f8da1b("Merge branch 'revamp-test_progs'")

Structure of the patch set:
Patch 1:
  - Remove unnecessary masking and headroom addition during zero-copy Rx
    buffer recycling in i40e. This change is required in order for the
    buffer recycling to work in the unaligned chunk mode.

Patch 2:
  - Remove unnecessary masking and headroom addition during
    zero-copy Rx buffer recycling in ixgbe. This change is required in
    order for the  buffer recycling to work in the unaligned chunk mode.

Patch 3:
  - Add flags for umem configuration to libbpf

Patch 4:
  - Add infrastructure for unaligned chunks. Since we are dealing with
    unaligned chunks that could potentially cross a physical page boundary,
    we add checks to keep track of that information. We can later use this
    information to correctly handle buffers that are placed at an address
    where they cross a page boundary.  This patch also modifies the
    existing Rx and Tx functions to use the new descriptor format. To
    handle addresses correctly, we need to mask appropriately based on
    whether we are in aligned or unaligned mode.

Patch 5:
  - This patch updates the i40e driver to make use of the new descriptor
    format.

Patch 6:
  - This patch updates the ixgbe driver to make use of the new descriptor
    format.

Patch 7:
  - This patch updates the mlx5e driver to make use of the new descriptor
    format. These changes are required to handle the new descriptor format
    and for unaligned chunks support.

Patch 8:
  - Modify xdpsock application to add a command line option for
    unaligned chunks

Patch 9:
  - Since we can now run the application in unaligned chunk mode, we need
    to make sure we recycle the buffers appropriately.

Patch 10:
  - Adds hugepage support to the xdpsock application

Patch 11:
  - Documentation update to include the unaligned chunk scenario. We need
    to explicitly state that the incoming addresses are only masked in the
    aligned chunk mode and not the unaligned chunk mode.

---
v2:
  - fixed checkpatch issues
  - fixed Rx buffer recycling for unaligned chunks in xdpsock
  - removed unused defines
  - fixed how chunk_size is calculated in xsk_diag.c
  - added some performance numbers to cover letter
  - modified descriptor format to make it easier to retrieve original
    address
  - removed patch adding off_t off to the zero copy allocator. This is no
    longer needed with the new descriptor format.

v3:
  - added patch for mlx5 driver changes needed for unaligned chunks
  - moved offset handling to new helper function
  - changed value used for the umem chunk_mask. Now using the new
    descriptor format to save us doing the calculations in a number of
    places meaning more of the code is left unchanged while adding
    unaligned chunk support.

v4:
  - reworked the next_pg_contig field in the xdp_umem_page struct. We now
    use the low 12 bits of the addr for flags rather than adding an extra
    field in the struct.
  - modified unaligned chunks flag define
  - fixed page_start calculation in __xsk_rcv_memcpy().
  - move offset handling to the xdp_umem_get_* functions
  - modified the len field in xdp_umem_reg struct. We now use 16 bits from
    this for the flags field.
  - removed next_pg_contig field from xdp_umem_page struct. Using low 12
    bits of addr to store flags instead.
  - fixed headroom addition to handle in the mlx5e driver
  - other minor changes based on review comments

Kevin Laatz (11):
  i40e: simplify Rx buffer recycle
  ixgbe: simplify Rx buffer recycle
  libbpf: add flags to umem config
  xsk: add support to allow unaligned chunk placement
  i40e: modify driver for handling offsets
  ixgbe: modify driver for handling offsets
  mlx5e: modify driver for handling offsets
  samples/bpf: add unaligned chunks mode support to xdpsock
  samples/bpf: add buffer recycling for unaligned chunks to xdpsock
  samples/bpf: use hugepages in xdpsock app
  doc/af_xdp: include unaligned chunk case

 Documentation/networking/af_xdp.rst           | 10 ++-
 drivers/net/ethernet/intel/i40e/i40e_xsk.c    | 26 +++---
 drivers/net/ethernet/intel/ixgbe/ixgbe_xsk.c  | 26 +++---
 .../net/ethernet/mellanox/mlx5/core/en/xdp.c  |  8 +-
 .../ethernet/mellanox/mlx5/core/en/xsk/rx.c   |  3 +-
 include/net/xdp_sock.h                        | 40 +++++++++-
 include/uapi/linux/if_xdp.h                   | 14 +++-
 net/xdp/xdp_umem.c                            | 18 +++--
 net/xdp/xsk.c                                 | 79 +++++++++++++++----
 net/xdp/xsk_diag.c                            |  2 +-
 net/xdp/xsk_queue.h                           | 69 ++++++++++++++--
 samples/bpf/xdpsock_user.c                    | 59 ++++++++++----
 tools/include/uapi/linux/if_xdp.h             |  9 ++-
 tools/lib/bpf/xsk.c                           |  3 +
 tools/lib/bpf/xsk.h                           |  2 +
 15 files changed, 280 insertions(+), 88 deletions(-)

-- 
2.17.1


^ permalink raw reply

* Re: [PATCH v2 0/2] mmc: core: Fix Marvell WiFi reset by adding SDIO API to replug card
From: Doug Anderson @ 2019-07-30 16:59 UTC (permalink / raw)
  To: Andreas Fenkart
  Cc: Ulf Hansson, Kalle Valo, Adrian Hunter, Ganapathi Bhat,
	linux-wireless, Brian Norris, Amitkumar Karwar,
	open list:ARM/Rockchip SoC..., Wolfram Sang, Nishant Sarmukadam,
	netdev, Avri Altman, linux-mmc, David Miller, Xinming Hu, LKML,
	Thomas Gleixner, Kate Stewart
In-Reply-To: <CALtMJEB871Redpzx1u6G5GVEXz-kAP=vT6Wt98=X=xm4SEMeAQ@mail.gmail.com>

Hi,

On Tue, Jul 30, 2019 at 1:47 AM Andreas Fenkart <afenkart@gmail.com> wrote:
>
> > * Sometimes while I was testing I saw "Fail WiFi 1" indicating a
> >   transitory failure.  Usually this was an association failure, but in
> >   one case I saw the device do "Firmware wakeup failed" after I
> >   triggered the reset.  This caused the driver to trigger a re-reset
> >   of itself which eventually recovered things.  This was good because
> >   it was an actual test of the normal reset flow (not the one
> >   triggered via sysfs).
>
> This error triggers something. I remember that when I was working on
> suspend-to-ram feature, we had problems to wake up the firmware
> reliable. I found this patch in one of my old 3.13 tree
>
>     the missing bit -- ugly hack to force cmd52 before cmd53.

Thanks for the reference!  At the moment I'm not terribly worried
about this particular failure case (compared to other failure modes)
because it's rare and it self-heals.

...my best guess, though, is that the problem isn't exactly the same.
The "Firmware wakeup failed" is a pretty generic error message, kind
of like "something went wrong" and not all instances of this message
will have the same root cause.

I actually dealt with a few suspend/resume issues around mwifiex
recently though.  If you ever uprev, you might be interested in:

b82d6c1f8f82 mwifiex: Make resume actually do something useful again
on SDIO cards
83293386bc95 mmc: core: Prevent processing SDIO IRQs when the card is suspended

-Doug

^ permalink raw reply

* Re: [PATCH] net: dsa: mv88e6xxx: use link-down-define instead of plain value
From: David Miller @ 2019-07-30 16:56 UTC (permalink / raw)
  To: h.feurstein; +Cc: netdev, linux-kernel, andrew, vivien.didelot, f.fainelli
In-Reply-To: <20190730101142.548-1-h.feurstein@gmail.com>

From: Hubert Feurstein <h.feurstein@gmail.com>
Date: Tue, 30 Jul 2019 12:11:42 +0200

> Using the define here makes the code more expressive.
> 
> Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>

Applied, thank you.

^ permalink raw reply

* Re: [PATCH] net: phy: fixed_phy: print gpio error only if gpio node is present
From: David Miller @ 2019-07-30 16:55 UTC (permalink / raw)
  To: h.feurstein; +Cc: netdev, linux-kernel, andrew, f.fainelli, hkallweit1
In-Reply-To: <20190730094623.31640-1-h.feurstein@gmail.com>

From: Hubert Feurstein <h.feurstein@gmail.com>
Date: Tue, 30 Jul 2019 11:46:23 +0200

> It is perfectly ok to not have an gpio attached to the fixed-link node. So
> the driver should not throw an error message when the gpio is missing.
> 
> Signed-off-by: Hubert Feurstein <h.feurstein@gmail.com>

Applied and queued up for -stable, thanks.

^ permalink raw reply

* Re: [PATCH net-next v4 0/4] enetc: Add mdio bus driver for the PCIe MDIO endpoint
From: David Miller @ 2019-07-30 16:53 UTC (permalink / raw)
  To: claudiu.manoil
  Cc: andrew, robh+dt, leoyang.li, alexandru.marginean, netdev,
	devicetree, linux-arm-kernel, linux-kernel
In-Reply-To: <20190730.094436.855806617449032791.davem@davemloft.net>

From: David Miller <davem@davemloft.net>
Date: Tue, 30 Jul 2019 09:44:36 -0700 (PDT)

> From: Claudiu Manoil <claudiu.manoil@nxp.com>
> Date: Tue, 30 Jul 2019 12:45:15 +0300
> 
>> First patch fixes a sparse issue and cleans up accessors to avoid
>> casting to __iomem.
>> Second patch just registers the PCIe endpoint device containing
>> the MDIO registers as a standalone MDIO bus driver, to allow
>> an alternative way to control the MDIO bus.  The same code used
>> by the ENETC ports (eth controllers) to manage MDIO via local
>> registers applies and is reused.
>> 
>> Bindings are provided for the new MDIO node, similarly to ENETC
>> port nodes bindings.
>> 
>> Last patch enables the ENETC port 1 and its RGMII PHY on the
>> LS1028A QDS board, where the MDIO muxing configuration relies
>> on the MDIO support provided in the first patch.
>  ...
> 
> Series applied, thank you.

Actually this doesn't compile, I had to revert:

In file included from ./include/linux/phy.h:20,
                 from ./include/linux/of_mdio.h:11,
                 from drivers/net/ethernet/freescale/enetc/enetc_mdio.c:5:
drivers/net/ethernet/freescale/enetc/enetc_mdio.c:284:26: error: ‘enetc_mdio_id_table’ undeclared here (not in a function); did you mean ‘enetc_pci_mdio_id_table’?
 MODULE_DEVICE_TABLE(pci, enetc_mdio_id_table);
                          ^~~~~~~~~~~~~~~~~~~
./include/linux/module.h:230:15: note: in definition of macro ‘MODULE_DEVICE_TABLE’
 extern typeof(name) __mod_##type##__##name##_device_table  \
               ^~~~
./include/linux/module.h:230:21: error: ‘__mod_pci__enetc_mdio_id_table_device_table’ aliased to undefined symbol ‘enetc_mdio_id_table’
 extern typeof(name) __mod_##type##__##name##_device_table  \
                     ^~~~~~
drivers/net/ethernet/freescale/enetc/enetc_mdio.c:284:1: note: in expansion of macro ‘MODULE_DEVICE_TABLE’
 MODULE_DEVICE_TABLE(pci, enetc_mdio_id_table);
 ^~~~~~~~~~~~~~~~~~~

^ permalink raw reply

* Re: [PATCH net-next v4 0/4] enetc: Add mdio bus driver for the PCIe MDIO endpoint
From: David Miller @ 2019-07-30 16:44 UTC (permalink / raw)
  To: claudiu.manoil
  Cc: andrew, robh+dt, leoyang.li, alexandru.marginean, netdev,
	devicetree, linux-arm-kernel, linux-kernel
In-Reply-To: <1564479919-18835-1-git-send-email-claudiu.manoil@nxp.com>

From: Claudiu Manoil <claudiu.manoil@nxp.com>
Date: Tue, 30 Jul 2019 12:45:15 +0300

> First patch fixes a sparse issue and cleans up accessors to avoid
> casting to __iomem.
> Second patch just registers the PCIe endpoint device containing
> the MDIO registers as a standalone MDIO bus driver, to allow
> an alternative way to control the MDIO bus.  The same code used
> by the ENETC ports (eth controllers) to manage MDIO via local
> registers applies and is reused.
> 
> Bindings are provided for the new MDIO node, similarly to ENETC
> port nodes bindings.
> 
> Last patch enables the ENETC port 1 and its RGMII PHY on the
> LS1028A QDS board, where the MDIO muxing configuration relies
> on the MDIO support provided in the first patch.
 ...

Series applied, thank you.

^ permalink raw reply

* Re: [net-next 08/13] net/mlx5e: Protect tc flows hashtable with rcu
From: Willem de Bruijn @ 2019-07-30 16:37 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, netdev@vger.kernel.org, Vlad Buslov, Jianbo Liu,
	Roi Dayan
In-Reply-To: <CA+FuTSfnikCV_J2cUEeafCaui8KxrK4njRR9rqgpo+5JhBxR9g@mail.gmail.com>

On Tue, Jul 30, 2019 at 12:16 PM Willem de Bruijn
<willemdebruijn.kernel@gmail.com> wrote:
>
> On Mon, Jul 29, 2019 at 7:50 PM Saeed Mahameed <saeedm@mellanox.com> wrote:
> >
> > From: Vlad Buslov <vladbu@mellanox.com>
> >
> > In order to remove dependency on rtnl lock, access to tc flows hashtable
> > must be explicitly protected from concurrent flows removal.
> >
> > Extend tc flow structure with rcu to allow concurrent parallel access. Use
> > rcu read lock to safely lookup flow in tc flows hash table, and take
> > reference to it. Use rcu free for flow deletion to accommodate concurrent
> > stats requests.
> >
> > Add new DELETED flow flag. Imlement new flow_flag_test_and_set() helper
> > that is used to set a flag and return its previous value. Use it to
> > atomically set the flag in mlx5e_delete_flower() to guarantee that flow can
> > only be deleted once, even when same flow is deleted concurrently by
> > multiple tasks.
> >
> > Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
> > Reviewed-by: Jianbo Liu <jianbol@mellanox.com>
> > Reviewed-by: Roi Dayan <roid@mellanox.com>
> > Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> > ---
>
> > @@ -3492,16 +3507,32 @@ int mlx5e_delete_flower(struct net_device *dev, struct mlx5e_priv *priv,
> >  {
> >         struct rhashtable *tc_ht = get_tc_ht(priv, flags);
> >         struct mlx5e_tc_flow *flow;
> > +       int err;
> >
> > +       rcu_read_lock();
> >         flow = rhashtable_lookup_fast(tc_ht, &f->cookie, tc_ht_params);
> > -       if (!flow || !same_flow_direction(flow, flags))
> > -               return -EINVAL;
> > +       if (!flow || !same_flow_direction(flow, flags)) {
> > +               err = -EINVAL;
> > +               goto errout;
> > +       }
> >
> > +       /* Only delete the flow if it doesn't have MLX5E_TC_FLOW_DELETED flag
> > +        * set.
> > +        */
> > +       if (flow_flag_test_and_set(flow, DELETED)) {
> > +               err = -EINVAL;
> > +               goto errout;
> > +       }
> >         rhashtable_remove_fast(tc_ht, &flow->node, tc_ht_params);
> > +       rcu_read_unlock();
> >
> >         mlx5e_flow_put(priv, flow);
>
> Dereferencing flow outside rcu readside critical section? Does a build
> with lockdep not complain?

Eh no, it won't. The surprising part to me was to use a readside
critical section when performing a write action on an RCU ptr. The
DELETED flag ensures that multiple writers will not compete to call
rhashtable_remove_fast. rcu_read_lock is a common pattern to do
rhashtable lookup + delete.

>
> >
> >         return 0;
> > +
> > +errout:
> > +       rcu_read_unlock();
> > +       return err;
> >  }
> >
> >  int mlx5e_stats_flower(struct net_device *dev, struct mlx5e_priv *priv,
> > @@ -3517,8 +3548,10 @@ int mlx5e_stats_flower(struct net_device *dev, struct mlx5e_priv *priv,
> >         u64 bytes = 0;
> >         int err = 0;
> >
> > -       flow = mlx5e_flow_get(rhashtable_lookup_fast(tc_ht, &f->cookie,
> > -                                                    tc_ht_params));
> > +       rcu_read_lock();
> > +       flow = mlx5e_flow_get(rhashtable_lookup(tc_ht, &f->cookie,
> > +                                               tc_ht_params));
> > +       rcu_read_unlock();
> >         if (IS_ERR(flow))
> >                 return PTR_ERR(flow);
>
> Same, in code below this check?

Never mind, sorry. I missed that this took a reference on the ptr
returned from rhashtable_lookup.

^ permalink raw reply

* Re: [PATCH 4/4] net: dsa: mv88e6xxx: add PTP support for MV88E6250 family
From: Hubert Feurstein @ 2019-07-30 16:20 UTC (permalink / raw)
  To: Richard Cochran
  Cc: netdev, linux-kernel, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, David S. Miller, Rasmus Villemoes
In-Reply-To: <20190730160032.GA1251@localhost>

Hi Richard,

thank you for your comments.

Am Di., 30. Juli 2019 um 18:00 Uhr schrieb Richard Cochran
<richardcochran@gmail.com>:
[...]
> > -/* Raw timestamps are in units of 8-ns clock periods. */
> > -#define CC_SHIFT     28
> > -#define CC_MULT              (8 << CC_SHIFT)
> > -#define CC_MULT_NUM  (1 << 9)
> > -#define CC_MULT_DEM  15625ULL
> > +/* The adjfine API clamps ppb between [-32,768,000, 32,768,000], and
>
> That is not true.
>
> > + * therefore scaled_ppm between [-2,147,483,648, 2,147,483,647].
> > + * Set the maximum supported ppb to a round value smaller than the maximum.
> > + *
> > + * Percentually speaking, this is a +/- 0.032x adjustment of the
> > + * free-running counter (0.968x to 1.032x).
> > + */
> > +#define MV88E6XXX_MAX_ADJ_PPB        32000000
>
> I had set an arbitrary limit of 1000 ppm.  I can't really see any
> point in raising the limit.
>
> > +/* Family MV88E6250:
> > + * Raw timestamps are in units of 10-ns clock periods.
> > + *
> > + * clkadj = scaled_ppm * 10*2^28 / (10^6 * 2^16)
> > + * simplifies to
> > + * clkadj = scaled_ppm * 2^7 / 5^5
> > + */
> > +#define MV88E6250_CC_SHIFT   28
> > +#define MV88E6250_CC_MULT    (10 << MV88E6250_CC_SHIFT)
> > +#define MV88E6250_CC_MULT_NUM        (1 << 7)
> > +#define MV88E6250_CC_MULT_DEM        3125ULL
> > +
> > +/* Other families:
> > + * Raw timestamps are in units of 8-ns clock periods.
> > + *
> > + * clkadj = scaled_ppm * 8*2^28 / (10^6 * 2^16)
> > + * simplifies to
> > + * clkadj = scaled_ppm * 2^9 / 5^6
> > + */
> > +#define MV88E6XXX_CC_SHIFT   28
> > +#define MV88E6XXX_CC_MULT    (8 << MV88E6XXX_CC_SHIFT)
> > +#define MV88E6XXX_CC_MULT_NUM        (1 << 9)
> > +#define MV88E6XXX_CC_MULT_DEM        15625ULL
> >
> >  #define TAI_EVENT_WORK_INTERVAL msecs_to_jiffies(100)
> >
> > @@ -179,24 +206,14 @@ static void mv88e6352_tai_event_work(struct work_struct *ugly)
> >  static int mv88e6xxx_ptp_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
> >  {
> >       struct mv88e6xxx_chip *chip = ptp_to_chip(ptp);
> > -     int neg_adj = 0;
> > -     u32 diff, mult;
> > -     u64 adj;
> > +     s64 adj;
> >
> > -     if (scaled_ppm < 0) {
> > -             neg_adj = 1;
> > -             scaled_ppm = -scaled_ppm;
> > -     }
>
> Please don't re-write this logic.  It is written like that for a reason.
I used the sja1105_ptp.c as a reference. So it is also wrong there.

Hubert

^ permalink raw reply

* Re: [net-next 08/13] net/mlx5e: Protect tc flows hashtable with rcu
From: Willem de Bruijn @ 2019-07-30 16:15 UTC (permalink / raw)
  To: Saeed Mahameed
  Cc: David S. Miller, netdev@vger.kernel.org, Vlad Buslov, Jianbo Liu,
	Roi Dayan
In-Reply-To: <20190729234934.23595-9-saeedm@mellanox.com>

On Mon, Jul 29, 2019 at 7:50 PM Saeed Mahameed <saeedm@mellanox.com> wrote:
>
> From: Vlad Buslov <vladbu@mellanox.com>
>
> In order to remove dependency on rtnl lock, access to tc flows hashtable
> must be explicitly protected from concurrent flows removal.
>
> Extend tc flow structure with rcu to allow concurrent parallel access. Use
> rcu read lock to safely lookup flow in tc flows hash table, and take
> reference to it. Use rcu free for flow deletion to accommodate concurrent
> stats requests.
>
> Add new DELETED flow flag. Imlement new flow_flag_test_and_set() helper
> that is used to set a flag and return its previous value. Use it to
> atomically set the flag in mlx5e_delete_flower() to guarantee that flow can
> only be deleted once, even when same flow is deleted concurrently by
> multiple tasks.
>
> Signed-off-by: Vlad Buslov <vladbu@mellanox.com>
> Reviewed-by: Jianbo Liu <jianbol@mellanox.com>
> Reviewed-by: Roi Dayan <roid@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---

> @@ -3492,16 +3507,32 @@ int mlx5e_delete_flower(struct net_device *dev, struct mlx5e_priv *priv,
>  {
>         struct rhashtable *tc_ht = get_tc_ht(priv, flags);
>         struct mlx5e_tc_flow *flow;
> +       int err;
>
> +       rcu_read_lock();
>         flow = rhashtable_lookup_fast(tc_ht, &f->cookie, tc_ht_params);
> -       if (!flow || !same_flow_direction(flow, flags))
> -               return -EINVAL;
> +       if (!flow || !same_flow_direction(flow, flags)) {
> +               err = -EINVAL;
> +               goto errout;
> +       }
>
> +       /* Only delete the flow if it doesn't have MLX5E_TC_FLOW_DELETED flag
> +        * set.
> +        */
> +       if (flow_flag_test_and_set(flow, DELETED)) {
> +               err = -EINVAL;
> +               goto errout;
> +       }
>         rhashtable_remove_fast(tc_ht, &flow->node, tc_ht_params);
> +       rcu_read_unlock();
>
>         mlx5e_flow_put(priv, flow);

Dereferencing flow outside rcu readside critical section? Does a build
with lockdep not complain?

>
>         return 0;
> +
> +errout:
> +       rcu_read_unlock();
> +       return err;
>  }
>
>  int mlx5e_stats_flower(struct net_device *dev, struct mlx5e_priv *priv,
> @@ -3517,8 +3548,10 @@ int mlx5e_stats_flower(struct net_device *dev, struct mlx5e_priv *priv,
>         u64 bytes = 0;
>         int err = 0;
>
> -       flow = mlx5e_flow_get(rhashtable_lookup_fast(tc_ht, &f->cookie,
> -                                                    tc_ht_params));
> +       rcu_read_lock();
> +       flow = mlx5e_flow_get(rhashtable_lookup(tc_ht, &f->cookie,
> +                                               tc_ht_params));
> +       rcu_read_unlock();
>         if (IS_ERR(flow))
>                 return PTR_ERR(flow);

Same, in code below this check?

^ permalink raw reply

* Re: [PATCH net-next v5 0/5] vsock/virtio: optimizations to increase the throughput
From: Stefano Garzarella @ 2019-07-30 16:02 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: netdev, virtualization, linux-kernel, Jason Wang, kvm,
	David S. Miller, Stefan Hajnoczi
In-Reply-To: <20190730115503-mutt-send-email-mst@kernel.org>

On Tue, Jul 30, 2019 at 11:55:09AM -0400, Michael S. Tsirkin wrote:
> On Tue, Jul 30, 2019 at 11:54:53AM -0400, Michael S. Tsirkin wrote:
> > On Tue, Jul 30, 2019 at 05:43:29PM +0200, Stefano Garzarella wrote:
> > > This series tries to increase the throughput of virtio-vsock with slight
> > > changes.
> > > While I was testing the v2 of this series I discovered an huge use of memory,
> > > so I added patch 1 to mitigate this issue. I put it in this series in order
> > > to better track the performance trends.
> > > 
> > > v5:
> > > - rebased all patches on net-next
> > > - added Stefan's R-b and Michael's A-b
> > 
> > This doesn't solve all issues around allocation - as I mentioned I think
> > we will need to improve accounting for that,
> > and maybe add pre-allocation.

Yes, I'll work on it following your suggestions.

> > But it's a great series of steps in the right direction!
> > 

Thank you very much :)
Stefano

> 
> 
> So
> 
> Acked-by: Michael S. Tsirkin <mst@redhat.com>
> 
> > 
> > > v4: https://patchwork.kernel.org/cover/11047717
> > > v3: https://patchwork.kernel.org/cover/10970145
> > > v2: https://patchwork.kernel.org/cover/10938743
> > > v1: https://patchwork.kernel.org/cover/10885431
> > > 
> > > Below are the benchmarks step by step. I used iperf3 [1] modified with VSOCK
> > > support. As Michael suggested in the v1, I booted host and guest with 'nosmap'.
> > > 
> > > A brief description of patches:
> > > - Patches 1:   limit the memory usage with an extra copy for small packets
> > > - Patches 2+3: reduce the number of credit update messages sent to the
> > >                transmitter
> > > - Patches 4+5: allow the host to split packets on multiple buffers and use
> > >                VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size allowed
> > > 
> > >                     host -> guest [Gbps]
> > > pkt_size before opt   p 1     p 2+3    p 4+5
> > > 
> > > 32         0.032     0.030    0.048    0.051
> > > 64         0.061     0.059    0.108    0.117
> > > 128        0.122     0.112    0.227    0.234
> > > 256        0.244     0.241    0.418    0.415
> > > 512        0.459     0.466    0.847    0.865
> > > 1K         0.927     0.919    1.657    1.641
> > > 2K         1.884     1.813    3.262    3.269
> > > 4K         3.378     3.326    6.044    6.195
> > > 8K         5.637     5.676   10.141   11.287
> > > 16K        8.250     8.402   15.976   16.736
> > > 32K       13.327    13.204   19.013   20.515
> > > 64K       21.241    21.341   20.973   21.879
> > > 128K      21.851    22.354   21.816   23.203
> > > 256K      21.408    21.693   21.846   24.088
> > > 512K      21.600    21.899   21.921   24.106
> > > 
> > >                     guest -> host [Gbps]
> > > pkt_size before opt   p 1     p 2+3    p 4+5
> > > 
> > > 32         0.045     0.046    0.057    0.057
> > > 64         0.089     0.091    0.103    0.104
> > > 128        0.170     0.179    0.192    0.200
> > > 256        0.364     0.351    0.361    0.379
> > > 512        0.709     0.699    0.731    0.790
> > > 1K         1.399     1.407    1.395    1.427
> > > 2K         2.670     2.684    2.745    2.835
> > > 4K         5.171     5.199    5.305    5.451
> > > 8K         8.442     8.500   10.083    9.941
> > > 16K       12.305    12.259   13.519   15.385
> > > 32K       11.418    11.150   11.988   24.680
> > > 64K       10.778    10.659   11.589   35.273
> > > 128K      10.421    10.339   10.939   40.338
> > > 256K      10.300     9.719   10.508   36.562
> > > 512K       9.833     9.808   10.612   35.979
> > > 
> > > As Stefan suggested in the v1, I measured also the efficiency in this way:
> > >     efficiency = Mbps / (%CPU_Host + %CPU_Guest)
> > > 
> > > The '%CPU_Guest' is taken inside the VM. I know that it is not the best way,
> > > but it's provided for free from iperf3 and could be an indication.
> > > 
> > >         host -> guest efficiency [Mbps / (%CPU_Host + %CPU_Guest)]
> > > pkt_size before opt   p 1     p 2+3    p 4+5
> > > 
> > > 32         0.35      0.45     0.79     1.02
> > > 64         0.56      0.80     1.41     1.54
> > > 128        1.11      1.52     3.03     3.12
> > > 256        2.20      2.16     5.44     5.58
> > > 512        4.17      4.18    10.96    11.46
> > > 1K         8.30      8.26    20.99    20.89
> > > 2K        16.82     16.31    39.76    39.73
> > > 4K        30.89     30.79    74.07    75.73
> > > 8K        53.74     54.49   124.24   148.91
> > > 16K       80.68     83.63   200.21   232.79
> > > 32K      132.27    132.52   260.81   357.07
> > > 64K      229.82    230.40   300.19   444.18
> > > 128K     332.60    329.78   331.51   492.28
> > > 256K     331.06    337.22   339.59   511.59
> > > 512K     335.58    328.50   331.56   504.56
> > > 
> > >         guest -> host efficiency [Mbps / (%CPU_Host + %CPU_Guest)]
> > > pkt_size before opt   p 1     p 2+3    p 4+5
> > > 
> > > 32         0.43      0.43     0.53     0.56
> > > 64         0.85      0.86     1.04     1.10
> > > 128        1.63      1.71     2.07     2.13
> > > 256        3.48      3.35     4.02     4.22
> > > 512        6.80      6.67     7.97     8.63
> > > 1K        13.32     13.31    15.72    15.94
> > > 2K        25.79     25.92    30.84    30.98
> > > 4K        50.37     50.48    58.79    59.69
> > > 8K        95.90     96.15   107.04   110.33
> > > 16K      145.80    145.43   143.97   174.70
> > > 32K      147.06    144.74   146.02   282.48
> > > 64K      145.25    143.99   141.62   406.40
> > > 128K     149.34    146.96   147.49   489.34
> > > 256K     156.35    149.81   152.21   536.37
> > > 512K     151.65    150.74   151.52   519.93
> > > 
> > > [1] https://github.com/stefano-garzarella/iperf/
> > > 
> > > Stefano Garzarella (5):
> > >   vsock/virtio: limit the memory used per-socket
> > >   vsock/virtio: reduce credit update messages
> > >   vsock/virtio: fix locking in virtio_transport_inc_tx_pkt()
> > >   vhost/vsock: split packets to send using multiple buffers
> > >   vsock/virtio: change the maximum packet size allowed
> > > 
> > >  drivers/vhost/vsock.c                   | 68 ++++++++++++-----
> > >  include/linux/virtio_vsock.h            |  4 +-
> > >  net/vmw_vsock/virtio_transport.c        |  1 +
> > >  net/vmw_vsock/virtio_transport_common.c | 99 ++++++++++++++++++++-----
> > >  4 files changed, 134 insertions(+), 38 deletions(-)
> > > 
> > > -- 
> > > 2.20.1

-- 

^ permalink raw reply

* Re: [PATCH 4/4] net: dsa: mv88e6xxx: add PTP support for MV88E6250 family
From: Richard Cochran @ 2019-07-30 16:00 UTC (permalink / raw)
  To: Hubert Feurstein
  Cc: netdev, linux-kernel, Andrew Lunn, Vivien Didelot,
	Florian Fainelli, David S. Miller, Rasmus Villemoes
In-Reply-To: <20190730100429.32479-5-h.feurstein@gmail.com>

On Tue, Jul 30, 2019 at 12:04:29PM +0200, Hubert Feurstein wrote:
> diff --git a/drivers/net/dsa/mv88e6xxx/ptp.c b/drivers/net/dsa/mv88e6xxx/ptp.c
> index 768d256f7c9f..51cdf4712517 100644
> --- a/drivers/net/dsa/mv88e6xxx/ptp.c
> +++ b/drivers/net/dsa/mv88e6xxx/ptp.c
> @@ -15,11 +15,38 @@
>  #include "hwtstamp.h"
>  #include "ptp.h"
>  
> -/* Raw timestamps are in units of 8-ns clock periods. */
> -#define CC_SHIFT	28
> -#define CC_MULT		(8 << CC_SHIFT)
> -#define CC_MULT_NUM	(1 << 9)
> -#define CC_MULT_DEM	15625ULL
> +/* The adjfine API clamps ppb between [-32,768,000, 32,768,000], and

That is not true.

> + * therefore scaled_ppm between [-2,147,483,648, 2,147,483,647].
> + * Set the maximum supported ppb to a round value smaller than the maximum.
> + *
> + * Percentually speaking, this is a +/- 0.032x adjustment of the
> + * free-running counter (0.968x to 1.032x).
> + */
> +#define MV88E6XXX_MAX_ADJ_PPB	32000000

I had set an arbitrary limit of 1000 ppm.  I can't really see any
point in raising the limit.

> +/* Family MV88E6250:
> + * Raw timestamps are in units of 10-ns clock periods.
> + *
> + * clkadj = scaled_ppm * 10*2^28 / (10^6 * 2^16)
> + * simplifies to
> + * clkadj = scaled_ppm * 2^7 / 5^5
> + */
> +#define MV88E6250_CC_SHIFT	28
> +#define MV88E6250_CC_MULT	(10 << MV88E6250_CC_SHIFT)
> +#define MV88E6250_CC_MULT_NUM	(1 << 7)
> +#define MV88E6250_CC_MULT_DEM	3125ULL
> +
> +/* Other families:
> + * Raw timestamps are in units of 8-ns clock periods.
> + *
> + * clkadj = scaled_ppm * 8*2^28 / (10^6 * 2^16)
> + * simplifies to
> + * clkadj = scaled_ppm * 2^9 / 5^6
> + */
> +#define MV88E6XXX_CC_SHIFT	28
> +#define MV88E6XXX_CC_MULT	(8 << MV88E6XXX_CC_SHIFT)
> +#define MV88E6XXX_CC_MULT_NUM	(1 << 9)
> +#define MV88E6XXX_CC_MULT_DEM	15625ULL
>  
>  #define TAI_EVENT_WORK_INTERVAL msecs_to_jiffies(100)
>  
> @@ -179,24 +206,14 @@ static void mv88e6352_tai_event_work(struct work_struct *ugly)
>  static int mv88e6xxx_ptp_adjfine(struct ptp_clock_info *ptp, long scaled_ppm)
>  {
>  	struct mv88e6xxx_chip *chip = ptp_to_chip(ptp);
> -	int neg_adj = 0;
> -	u32 diff, mult;
> -	u64 adj;
> +	s64 adj;
>  
> -	if (scaled_ppm < 0) {
> -		neg_adj = 1;
> -		scaled_ppm = -scaled_ppm;
> -	}

Please don't re-write this logic.  It is written like that for a reason.

> -	mult = CC_MULT;
> -	adj = CC_MULT_NUM;
> -	adj *= scaled_ppm;
> -	diff = div_u64(adj, CC_MULT_DEM);

Just substitute CC_MULT* with your new chip->ptp_cc_mult*
and leave the rest alone.

Thanks,
Richard

^ permalink raw reply

* Re: [PATCH 03/12] block: bio_release_pages: use flags arg instead of bool
From: Jerome Glisse @ 2019-07-30 15:57 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Christoph Hellwig, john.hubbard, Andrew Morton, Alexander Viro,
	Anna Schumaker, David S . Miller, Dominique Martinet,
	Eric Van Hensbergen, Jason Gunthorpe, Jason Wang, Jens Axboe,
	Latchesar Ionkov, Michael S . Tsirkin, Miklos Szeredi,
	Trond Myklebust, Matthew Wilcox, linux-mm, LKML, ceph-devel, kvm,
	linux-block, linux-cifs, linux-fsdevel, linux-nfs, linux-rdma,
	netdev, samba-technical, v9fs-developer, virtualization,
	John Hubbard, Minwoo Im
In-Reply-To: <20190730102557.GA1700@lst.de>

On Tue, Jul 30, 2019 at 12:25:57PM +0200, Christoph Hellwig wrote:
> On Mon, Jul 29, 2019 at 04:57:21PM -0400, Jerome Glisse wrote:
> > > All pages releases by bio_release_pages should come from
> > > get_get_user_pages, so I don't really see the point here.
> > 
> > No they do not all comes from GUP for see various callers
> > of bio_check_pages_dirty() for instance iomap_dio_zero()
> > 
> > I have carefully tracked down all this and i did not do
> > anyconvertion just for the fun of it :)
> 
> Well, the point is _should_ not necessarily do.  iomap_dio_zero adds the
> ZERO_PAGE, which we by definition don't need to refcount.  So we can
> mark this bio BIO_NO_PAGE_REF safely after removing the get_page there.
> 
> Note that the equivalent in the old direct I/O code, dio_refill_pages,
> will be a little more complicated as it can match user pages and the
> ZERO_PAGE in a single bio, so a per-bio flag won't handle it easily.
> Maybe we just need to use a separate bio there as well.
> 
> In general with series like this we should not encode the status quo an
> pile new hacks upon the old one, but thing where we should be and fix
> up the old warts while having to wade through all that code.

Other user can also add page that are not coming from GUP but need to
have a reference see __blkdev_direct_IO() saddly bio get fill from many
different places and not always with GUP. So we can not say that all
pages here are coming from bio. I had a different version of the patchset
i think that was adding a new release dirty function for GUP versus non
GUP bio. I posted it a while ago, i will try to dig it up once i am
back.

Cheers,
Jérôme

^ permalink raw reply

* Re: [PATCH net-next v5 0/5] vsock/virtio: optimizations to increase the throughput
From: Michael S. Tsirkin @ 2019-07-30 15:55 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: netdev, virtualization, linux-kernel, Jason Wang, kvm,
	David S. Miller, Stefan Hajnoczi
In-Reply-To: <20190730115339-mutt-send-email-mst@kernel.org>

On Tue, Jul 30, 2019 at 11:54:53AM -0400, Michael S. Tsirkin wrote:
> On Tue, Jul 30, 2019 at 05:43:29PM +0200, Stefano Garzarella wrote:
> > This series tries to increase the throughput of virtio-vsock with slight
> > changes.
> > While I was testing the v2 of this series I discovered an huge use of memory,
> > so I added patch 1 to mitigate this issue. I put it in this series in order
> > to better track the performance trends.
> > 
> > v5:
> > - rebased all patches on net-next
> > - added Stefan's R-b and Michael's A-b
> 
> This doesn't solve all issues around allocation - as I mentioned I think
> we will need to improve accounting for that,
> and maybe add pre-allocation.
> But it's a great series of steps in the right direction!
> 


So

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> 
> > v4: https://patchwork.kernel.org/cover/11047717
> > v3: https://patchwork.kernel.org/cover/10970145
> > v2: https://patchwork.kernel.org/cover/10938743
> > v1: https://patchwork.kernel.org/cover/10885431
> > 
> > Below are the benchmarks step by step. I used iperf3 [1] modified with VSOCK
> > support. As Michael suggested in the v1, I booted host and guest with 'nosmap'.
> > 
> > A brief description of patches:
> > - Patches 1:   limit the memory usage with an extra copy for small packets
> > - Patches 2+3: reduce the number of credit update messages sent to the
> >                transmitter
> > - Patches 4+5: allow the host to split packets on multiple buffers and use
> >                VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size allowed
> > 
> >                     host -> guest [Gbps]
> > pkt_size before opt   p 1     p 2+3    p 4+5
> > 
> > 32         0.032     0.030    0.048    0.051
> > 64         0.061     0.059    0.108    0.117
> > 128        0.122     0.112    0.227    0.234
> > 256        0.244     0.241    0.418    0.415
> > 512        0.459     0.466    0.847    0.865
> > 1K         0.927     0.919    1.657    1.641
> > 2K         1.884     1.813    3.262    3.269
> > 4K         3.378     3.326    6.044    6.195
> > 8K         5.637     5.676   10.141   11.287
> > 16K        8.250     8.402   15.976   16.736
> > 32K       13.327    13.204   19.013   20.515
> > 64K       21.241    21.341   20.973   21.879
> > 128K      21.851    22.354   21.816   23.203
> > 256K      21.408    21.693   21.846   24.088
> > 512K      21.600    21.899   21.921   24.106
> > 
> >                     guest -> host [Gbps]
> > pkt_size before opt   p 1     p 2+3    p 4+5
> > 
> > 32         0.045     0.046    0.057    0.057
> > 64         0.089     0.091    0.103    0.104
> > 128        0.170     0.179    0.192    0.200
> > 256        0.364     0.351    0.361    0.379
> > 512        0.709     0.699    0.731    0.790
> > 1K         1.399     1.407    1.395    1.427
> > 2K         2.670     2.684    2.745    2.835
> > 4K         5.171     5.199    5.305    5.451
> > 8K         8.442     8.500   10.083    9.941
> > 16K       12.305    12.259   13.519   15.385
> > 32K       11.418    11.150   11.988   24.680
> > 64K       10.778    10.659   11.589   35.273
> > 128K      10.421    10.339   10.939   40.338
> > 256K      10.300     9.719   10.508   36.562
> > 512K       9.833     9.808   10.612   35.979
> > 
> > As Stefan suggested in the v1, I measured also the efficiency in this way:
> >     efficiency = Mbps / (%CPU_Host + %CPU_Guest)
> > 
> > The '%CPU_Guest' is taken inside the VM. I know that it is not the best way,
> > but it's provided for free from iperf3 and could be an indication.
> > 
> >         host -> guest efficiency [Mbps / (%CPU_Host + %CPU_Guest)]
> > pkt_size before opt   p 1     p 2+3    p 4+5
> > 
> > 32         0.35      0.45     0.79     1.02
> > 64         0.56      0.80     1.41     1.54
> > 128        1.11      1.52     3.03     3.12
> > 256        2.20      2.16     5.44     5.58
> > 512        4.17      4.18    10.96    11.46
> > 1K         8.30      8.26    20.99    20.89
> > 2K        16.82     16.31    39.76    39.73
> > 4K        30.89     30.79    74.07    75.73
> > 8K        53.74     54.49   124.24   148.91
> > 16K       80.68     83.63   200.21   232.79
> > 32K      132.27    132.52   260.81   357.07
> > 64K      229.82    230.40   300.19   444.18
> > 128K     332.60    329.78   331.51   492.28
> > 256K     331.06    337.22   339.59   511.59
> > 512K     335.58    328.50   331.56   504.56
> > 
> >         guest -> host efficiency [Mbps / (%CPU_Host + %CPU_Guest)]
> > pkt_size before opt   p 1     p 2+3    p 4+5
> > 
> > 32         0.43      0.43     0.53     0.56
> > 64         0.85      0.86     1.04     1.10
> > 128        1.63      1.71     2.07     2.13
> > 256        3.48      3.35     4.02     4.22
> > 512        6.80      6.67     7.97     8.63
> > 1K        13.32     13.31    15.72    15.94
> > 2K        25.79     25.92    30.84    30.98
> > 4K        50.37     50.48    58.79    59.69
> > 8K        95.90     96.15   107.04   110.33
> > 16K      145.80    145.43   143.97   174.70
> > 32K      147.06    144.74   146.02   282.48
> > 64K      145.25    143.99   141.62   406.40
> > 128K     149.34    146.96   147.49   489.34
> > 256K     156.35    149.81   152.21   536.37
> > 512K     151.65    150.74   151.52   519.93
> > 
> > [1] https://github.com/stefano-garzarella/iperf/
> > 
> > Stefano Garzarella (5):
> >   vsock/virtio: limit the memory used per-socket
> >   vsock/virtio: reduce credit update messages
> >   vsock/virtio: fix locking in virtio_transport_inc_tx_pkt()
> >   vhost/vsock: split packets to send using multiple buffers
> >   vsock/virtio: change the maximum packet size allowed
> > 
> >  drivers/vhost/vsock.c                   | 68 ++++++++++++-----
> >  include/linux/virtio_vsock.h            |  4 +-
> >  net/vmw_vsock/virtio_transport.c        |  1 +
> >  net/vmw_vsock/virtio_transport_common.c | 99 ++++++++++++++++++++-----
> >  4 files changed, 134 insertions(+), 38 deletions(-)
> > 
> > -- 
> > 2.20.1

^ permalink raw reply

* Re: [PATCH net-next v5 0/5] vsock/virtio: optimizations to increase the throughput
From: Michael S. Tsirkin @ 2019-07-30 15:54 UTC (permalink / raw)
  To: Stefano Garzarella
  Cc: netdev, virtualization, linux-kernel, Jason Wang, kvm,
	David S. Miller, Stefan Hajnoczi
In-Reply-To: <20190730154334.237789-1-sgarzare@redhat.com>

On Tue, Jul 30, 2019 at 05:43:29PM +0200, Stefano Garzarella wrote:
> This series tries to increase the throughput of virtio-vsock with slight
> changes.
> While I was testing the v2 of this series I discovered an huge use of memory,
> so I added patch 1 to mitigate this issue. I put it in this series in order
> to better track the performance trends.
> 
> v5:
> - rebased all patches on net-next
> - added Stefan's R-b and Michael's A-b

This doesn't solve all issues around allocation - as I mentioned I think
we will need to improve accounting for that,
and maybe add pre-allocation.
But it's a great series of steps in the right direction!



> v4: https://patchwork.kernel.org/cover/11047717
> v3: https://patchwork.kernel.org/cover/10970145
> v2: https://patchwork.kernel.org/cover/10938743
> v1: https://patchwork.kernel.org/cover/10885431
> 
> Below are the benchmarks step by step. I used iperf3 [1] modified with VSOCK
> support. As Michael suggested in the v1, I booted host and guest with 'nosmap'.
> 
> A brief description of patches:
> - Patches 1:   limit the memory usage with an extra copy for small packets
> - Patches 2+3: reduce the number of credit update messages sent to the
>                transmitter
> - Patches 4+5: allow the host to split packets on multiple buffers and use
>                VIRTIO_VSOCK_MAX_PKT_BUF_SIZE as the max packet size allowed
> 
>                     host -> guest [Gbps]
> pkt_size before opt   p 1     p 2+3    p 4+5
> 
> 32         0.032     0.030    0.048    0.051
> 64         0.061     0.059    0.108    0.117
> 128        0.122     0.112    0.227    0.234
> 256        0.244     0.241    0.418    0.415
> 512        0.459     0.466    0.847    0.865
> 1K         0.927     0.919    1.657    1.641
> 2K         1.884     1.813    3.262    3.269
> 4K         3.378     3.326    6.044    6.195
> 8K         5.637     5.676   10.141   11.287
> 16K        8.250     8.402   15.976   16.736
> 32K       13.327    13.204   19.013   20.515
> 64K       21.241    21.341   20.973   21.879
> 128K      21.851    22.354   21.816   23.203
> 256K      21.408    21.693   21.846   24.088
> 512K      21.600    21.899   21.921   24.106
> 
>                     guest -> host [Gbps]
> pkt_size before opt   p 1     p 2+3    p 4+5
> 
> 32         0.045     0.046    0.057    0.057
> 64         0.089     0.091    0.103    0.104
> 128        0.170     0.179    0.192    0.200
> 256        0.364     0.351    0.361    0.379
> 512        0.709     0.699    0.731    0.790
> 1K         1.399     1.407    1.395    1.427
> 2K         2.670     2.684    2.745    2.835
> 4K         5.171     5.199    5.305    5.451
> 8K         8.442     8.500   10.083    9.941
> 16K       12.305    12.259   13.519   15.385
> 32K       11.418    11.150   11.988   24.680
> 64K       10.778    10.659   11.589   35.273
> 128K      10.421    10.339   10.939   40.338
> 256K      10.300     9.719   10.508   36.562
> 512K       9.833     9.808   10.612   35.979
> 
> As Stefan suggested in the v1, I measured also the efficiency in this way:
>     efficiency = Mbps / (%CPU_Host + %CPU_Guest)
> 
> The '%CPU_Guest' is taken inside the VM. I know that it is not the best way,
> but it's provided for free from iperf3 and could be an indication.
> 
>         host -> guest efficiency [Mbps / (%CPU_Host + %CPU_Guest)]
> pkt_size before opt   p 1     p 2+3    p 4+5
> 
> 32         0.35      0.45     0.79     1.02
> 64         0.56      0.80     1.41     1.54
> 128        1.11      1.52     3.03     3.12
> 256        2.20      2.16     5.44     5.58
> 512        4.17      4.18    10.96    11.46
> 1K         8.30      8.26    20.99    20.89
> 2K        16.82     16.31    39.76    39.73
> 4K        30.89     30.79    74.07    75.73
> 8K        53.74     54.49   124.24   148.91
> 16K       80.68     83.63   200.21   232.79
> 32K      132.27    132.52   260.81   357.07
> 64K      229.82    230.40   300.19   444.18
> 128K     332.60    329.78   331.51   492.28
> 256K     331.06    337.22   339.59   511.59
> 512K     335.58    328.50   331.56   504.56
> 
>         guest -> host efficiency [Mbps / (%CPU_Host + %CPU_Guest)]
> pkt_size before opt   p 1     p 2+3    p 4+5
> 
> 32         0.43      0.43     0.53     0.56
> 64         0.85      0.86     1.04     1.10
> 128        1.63      1.71     2.07     2.13
> 256        3.48      3.35     4.02     4.22
> 512        6.80      6.67     7.97     8.63
> 1K        13.32     13.31    15.72    15.94
> 2K        25.79     25.92    30.84    30.98
> 4K        50.37     50.48    58.79    59.69
> 8K        95.90     96.15   107.04   110.33
> 16K      145.80    145.43   143.97   174.70
> 32K      147.06    144.74   146.02   282.48
> 64K      145.25    143.99   141.62   406.40
> 128K     149.34    146.96   147.49   489.34
> 256K     156.35    149.81   152.21   536.37
> 512K     151.65    150.74   151.52   519.93
> 
> [1] https://github.com/stefano-garzarella/iperf/
> 
> Stefano Garzarella (5):
>   vsock/virtio: limit the memory used per-socket
>   vsock/virtio: reduce credit update messages
>   vsock/virtio: fix locking in virtio_transport_inc_tx_pkt()
>   vhost/vsock: split packets to send using multiple buffers
>   vsock/virtio: change the maximum packet size allowed
> 
>  drivers/vhost/vsock.c                   | 68 ++++++++++++-----
>  include/linux/virtio_vsock.h            |  4 +-
>  net/vmw_vsock/virtio_transport.c        |  1 +
>  net/vmw_vsock/virtio_transport_common.c | 99 ++++++++++++++++++++-----
>  4 files changed, 134 insertions(+), 38 deletions(-)
> 
> -- 
> 2.20.1

^ permalink raw reply

* Re: [net-next 01/13] net/mlx5e: Print a warning when LRO feature is dropped or not allowed
From: Willem de Bruijn @ 2019-07-30 15:52 UTC (permalink / raw)
  To: Saeed Mahameed; +Cc: David S. Miller, netdev@vger.kernel.org, Huy Nguyen
In-Reply-To: <20190729234934.23595-2-saeedm@mellanox.com>

On Mon, Jul 29, 2019 at 7:50 PM Saeed Mahameed <saeedm@mellanox.com> wrote:
>
> From: Huy Nguyen <huyn@mellanox.com>
>
> When user enables LRO via ethtool and if the RQ mode is legacy,
> mlx5e_fix_features drops the request without any explanation.
> Add netdev_warn to cover this case.
>
> Fixes: 6c3a823e1e9c ("net/mlx5e: RX, Remove HW LRO support in legacy RQ")
> Signed-off-by: Huy Nguyen <huyn@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
>  drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 47eea6b3a1c3..776eb46d263d 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -3788,9 +3788,10 @@ static netdev_features_t mlx5e_fix_features(struct net_device *netdev,
>                         netdev_warn(netdev, "Dropping C-tag vlan stripping offload due to S-tag vlan\n");
>         }
>         if (!MLX5E_GET_PFLAG(params, MLX5E_PFLAG_RX_STRIDING_RQ)) {
> -               features &= ~NETIF_F_LRO;
> -               if (params->lro_en)
> +               if (features & NETIF_F_LRO) {
>                         netdev_warn(netdev, "Disabling LRO, not supported in legacy RQ\n");

This warns about "Disabling LRO" on an enable request?

More fundamentally, it appears that the device does not advertise
the feature as configurable in netdev_hw_features as of commit
6c3a823e1e9c ("net/mlx5e: RX, Remove HW LRO support in
legacy RQ"), so shouldn't this be caught by the device driver
independent ethtool code?

^ permalink raw reply

* Re: [PATCH 1/2] net: dsa: mv88e6xxx: add support to setup led-control register through device-tree
From: Hubert Feurstein @ 2019-07-30 15:50 UTC (permalink / raw)
  To: Andrew Lunn
  Cc: netdev, linux-kernel, Vivien Didelot, Florian Fainelli,
	David S. Miller
In-Reply-To: <20190730140930.GM28552@lunn.ch>

Am Di., 30. Juli 2019 um 16:09 Uhr schrieb Andrew Lunn <andrew@lunn.ch>:
[...]
> Sorry, but this is not going to be accepted. There is an ongoing
> discussion about PHY LEDs and how they should be configured. Switch
> LEDs are no different from PHY LEDs. So they should use the same basic
> concept.
>
> Please take a look at the discussion around:
>
> [RFC] dt-bindings: net: phy: Add subnode for LED configuration
>
> Marvell designers have made this more difficult than it should be by
> moving the registers out of the PHY address space and into the switch
> address space. So we are going to have to implement this code twice
> :-(
Ok, good to know. I'll wait for the first implementation and take it
as a reference.

Hubert

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox