Netdev List

Netdev List
 help / color / mirror / Atom feed

* [net-next V11 PATCH 14/17] mlx5: use page_pool for xdp_return_frame call
From: Jesper Dangaard Brouer @ 2018-04-17 14:46 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152397622657.20272.10121948713784224943.stgit@firesoul>

This patch shows how it is possible to have both the driver local page
cache, which uses elevated refcnt for "catching"/avoiding SKB
put_page returns the page through the page allocator.  And at the
same time, have pages getting returned to the page_pool from
ndp_xdp_xmit DMA completion.

The performance improvement for XDP_REDIRECT in this patch is really
good.  Especially considering that (currently) the xdp_return_frame
API and page_pool_put_page() does per frame operations of both
rhashtable ID-lookup and locked return into (page_pool) ptr_ring.
(It is the plan to remove these per frame operation in a followup
patchset).

The benchmark performed was RX on mlx5 and XDP_REDIRECT out ixgbe,
with xdp_redirect_map (using devmap) . And the target/maximum
capability of ixgbe is 13Mpps (on this HW setup).

Before this patch for mlx5, XDP redirected frames were returned via
the page allocator.  The single flow performance was 6Mpps, and if I
started two flows the collective performance drop to 4Mpps, because we
hit the page allocator lock (further negative scaling occurs).

Two test scenarios need to be covered, for xdp_return_frame API, which
is DMA-TX completion running on same-CPU or cross-CPU free/return.
Results were same-CPU=10Mpps, and cross-CPU=12Mpps.  This is very
close to our 13Mpps max target.

The reason max target isn't reached in cross-CPU test, is likely due
to RX-ring DMA unmap/map overhead (which doesn't occur in ixgbe to
ixgbe testing).  It is also planned to remove this unnecessary DMA
unmap in a later patchset

V2: Adjustments requested by Tariq
 - Changed page_pool_create return codes not return NULL, only
   ERR_PTR, as this simplifies err handling in drivers.
 - Save a branch in mlx5e_page_release
 - Correct page_pool size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

V5: Updated patch desc

V8: Adjust for b0cedc844c00 ("net/mlx5e: Remove rq_headroom field from params")
V9:
 - Adjust for 121e89275471 ("net/mlx5e: Refactor RQ XDP_TX indication")
 - Adjust for 73281b78a37a ("net/mlx5e: Derive Striding RQ size from MTU")
 - Correct handling if page_pool_create fail for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

V10: Req from Tariq
 - Change pool_size calc for MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
Reviewed-by: Tariq Toukan <tariqt@mellanox.com>
Acked-by: Saeed Mahameed <saeedm@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlx5/core/en.h      |    3 ++
 drivers/net/ethernet/mellanox/mlx5/core/en_main.c |   41 +++++++++++++++++----
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c   |   16 ++++++--
 3 files changed, 48 insertions(+), 12 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 1a05d1072c5e..3317a4da87cb 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -53,6 +53,8 @@
 #include "mlx5_core.h"
 #include "en_stats.h"
 
+struct page_pool;
+
 #define MLX5_SET_CFG(p, f, v) MLX5_SET(create_flow_group_in, p, f, v)
 
 #define MLX5E_ETH_HARD_MTU (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN)
@@ -534,6 +536,7 @@ struct mlx5e_rq {
 	unsigned int           hw_mtu;
 	struct mlx5e_xdpsq     xdpsq;
 	DECLARE_BITMAP(flags, 8);
+	struct page_pool      *page_pool;
 
 	/* control */
 	struct mlx5_wq_ctrl    wq_ctrl;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2dca0933dfd3..f10037419978 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -35,6 +35,7 @@
 #include <linux/mlx5/fs.h>
 #include <net/vxlan.h>
 #include <linux/bpf.h>
+#include <net/page_pool.h>
 #include "eswitch.h"
 #include "en.h"
 #include "en_tc.h"
@@ -389,10 +390,11 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 			  struct mlx5e_rq_param *rqp,
 			  struct mlx5e_rq *rq)
 {
+	struct page_pool_params pp_params = { 0 };
 	struct mlx5_core_dev *mdev = c->mdev;
 	void *rqc = rqp->rqc;
 	void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
-	u32 byte_count;
+	u32 byte_count, pool_size;
 	int npages;
 	int wq_sz;
 	int err;
@@ -432,9 +434,12 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 
 	rq->buff.map_dir = rq->xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE;
 	rq->buff.headroom = mlx5e_get_rq_headroom(mdev, params);
+	pool_size = 1 << params->log_rq_mtu_frames;
 
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+
+		pool_size = MLX5_MPWRQ_PAGES_PER_WQE << mlx5e_mpwqe_get_log_rq_size(params);
 		rq->post_wqes = mlx5e_post_rx_mpwqes;
 		rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
 
@@ -512,13 +517,31 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 		rq->mkey_be = c->mkey_be;
 	}
 
-	/* This must only be activate for order-0 pages */
-	if (rq->xdp_prog) {
-		err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
-						 MEM_TYPE_PAGE_ORDER0, NULL);
-		if (err)
-			goto err_rq_wq_destroy;
+	/* Create a page_pool and register it with rxq */
+	pp_params.order     = rq->buff.page_order;
+	pp_params.flags     = 0; /* No-internal DMA mapping in page_pool */
+	pp_params.pool_size = pool_size;
+	pp_params.nid       = cpu_to_node(c->cpu);
+	pp_params.dev       = c->pdev;
+	pp_params.dma_dir   = rq->buff.map_dir;
+
+	/* page_pool can be used even when there is no rq->xdp_prog,
+	 * given page_pool does not handle DMA mapping there is no
+	 * required state to clear. And page_pool gracefully handle
+	 * elevated refcnt.
+	 */
+	rq->page_pool = page_pool_create(&pp_params);
+	if (IS_ERR(rq->page_pool)) {
+		if (rq->wq_type != MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ)
+			kfree(rq->wqe.frag_info);
+		err = PTR_ERR(rq->page_pool);
+		rq->page_pool = NULL;
+		goto err_rq_wq_destroy;
 	}
+	err = xdp_rxq_info_reg_mem_model(&rq->xdp_rxq,
+					 MEM_TYPE_PAGE_POOL, rq->page_pool);
+	if (err)
+		goto err_rq_wq_destroy;
 
 	for (i = 0; i < wq_sz; i++) {
 		struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
@@ -556,6 +579,8 @@ static int mlx5e_alloc_rq(struct mlx5e_channel *c,
 	if (rq->xdp_prog)
 		bpf_prog_put(rq->xdp_prog);
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
+	if (rq->page_pool)
+		page_pool_destroy(rq->page_pool);
 	mlx5_wq_destroy(&rq->wq_ctrl);
 
 	return err;
@@ -569,6 +594,8 @@ static void mlx5e_free_rq(struct mlx5e_rq *rq)
 		bpf_prog_put(rq->xdp_prog);
 
 	xdp_rxq_info_unreg(&rq->xdp_rxq);
+	if (rq->page_pool)
+		page_pool_destroy(rq->page_pool);
 
 	switch (rq->wq_type) {
 	case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 0e24be05907f..f42436d7f2d9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -37,6 +37,7 @@
 #include <linux/bpf_trace.h>
 #include <net/busy_poll.h>
 #include <net/ip6_checksum.h>
+#include <net/page_pool.h>
 #include "en.h"
 #include "en_tc.h"
 #include "eswitch.h"
@@ -221,7 +222,7 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
 	if (mlx5e_rx_cache_get(rq, dma_info))
 		return 0;
 
-	dma_info->page = dev_alloc_pages(rq->buff.page_order);
+	dma_info->page = page_pool_dev_alloc_pages(rq->page_pool);
 	if (unlikely(!dma_info->page))
 		return -ENOMEM;
 
@@ -246,11 +247,16 @@ static void mlx5e_page_dma_unmap(struct mlx5e_rq *rq,
 void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
 			bool recycle)
 {
-	if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
-		return;
+	if (likely(recycle)) {
+		if (mlx5e_rx_cache_put(rq, dma_info))
+			return;
 
-	mlx5e_page_dma_unmap(rq, dma_info);
-	put_page(dma_info->page);
+		mlx5e_page_dma_unmap(rq, dma_info);
+		page_pool_recycle_direct(rq->page_pool, dma_info->page);
+	} else {
+		mlx5e_page_dma_unmap(rq, dma_info);
+		put_page(dma_info->page);
+	}
 }
 
 static inline bool mlx5e_page_reuse(struct mlx5e_rq *rq,

^ permalink raw reply related

* [net-next V11 PATCH 15/17] xdp: transition into using xdp_frame for return API
From: Jesper Dangaard Brouer @ 2018-04-17 14:46 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152397622657.20272.10121948713784224943.stgit@firesoul>

Changing API xdp_return_frame() to take struct xdp_frame as argument,
seems like a natural choice. But there are some subtle performance
details here that needs extra care, which is a deliberate choice.

When de-referencing xdp_frame on a remote CPU during DMA-TX
completion, result in the cache-line is change to "Shared"
state. Later when the page is reused for RX, then this xdp_frame
cache-line is written, which change the state to "Modified".

This situation already happens (naturally) for, virtio_net, tun and
cpumap as the xdp_frame pointer is the queued object.  In tun and
cpumap, the ptr_ring is used for efficiently transferring cache-lines
(with pointers) between CPUs. Thus, the only option is to
de-referencing xdp_frame.

It is only the ixgbe driver that had an optimization, in which it can
avoid doing the de-reference of xdp_frame.  The driver already have
TX-ring queue, which (in case of remote DMA-TX completion) have to be
transferred between CPUs anyhow.  In this data area, we stored a
struct xdp_mem_info and a data pointer, which allowed us to avoid
de-referencing xdp_frame.

To compensate for this, a prefetchw is used for telling the cache
coherency protocol about our access pattern.  My benchmarks show that
this prefetchw is enough to compensate the ixgbe driver.

V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT")
V8: Adjust for commit bd658dda4237 ("net/mlx5e: Separate dma base address
and offset in dma_sync call")

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c     |    5 ++---
 drivers/net/ethernet/intel/ixgbe/ixgbe.h        |    4 +---
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c   |   17 +++++++++++------
 drivers/net/ethernet/mellanox/mlx5/core/en_rx.c |    1 +
 drivers/net/tun.c                               |    4 ++--
 drivers/net/virtio_net.c                        |    2 +-
 include/net/xdp.h                               |    2 +-
 kernel/bpf/cpumap.c                             |    6 +++---
 net/core/xdp.c                                  |    4 +++-
 9 files changed, 25 insertions(+), 20 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index 96c54cbfb1f9..c8bf4d35fdea 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -638,8 +638,7 @@ static void i40e_unmap_and_free_tx_resource(struct i40e_ring *ring,
 		if (tx_buffer->tx_flags & I40E_TX_FLAGS_FD_SB)
 			kfree(tx_buffer->raw_buf);
 		else if (ring_is_xdp(ring))
-			xdp_return_frame(tx_buffer->xdpf->data,
-					 &tx_buffer->xdpf->mem);
+			xdp_return_frame(tx_buffer->xdpf);
 		else
 			dev_kfree_skb_any(tx_buffer->skb);
 		if (dma_unmap_len(tx_buffer, len))
@@ -842,7 +841,7 @@ static bool i40e_clean_tx_irq(struct i40e_vsi *vsi,
 
 		/* free the skb/XDP data */
 		if (ring_is_xdp(tx_ring))
-			xdp_return_frame(tx_buf->xdpf->data, &tx_buf->xdpf->mem);
+			xdp_return_frame(tx_buf->xdpf);
 		else
 			napi_consume_skb(tx_buf->skb, napi_budget);
 
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe.h b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
index abb5248e917e..7dd5038cfcc4 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe.h
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe.h
@@ -241,8 +241,7 @@ struct ixgbe_tx_buffer {
 	unsigned long time_stamp;
 	union {
 		struct sk_buff *skb;
-		/* XDP uses address ptr on irq_clean */
-		void *data;
+		struct xdp_frame *xdpf;
 	};
 	unsigned int bytecount;
 	unsigned short gso_segs;
@@ -250,7 +249,6 @@ struct ixgbe_tx_buffer {
 	DEFINE_DMA_UNMAP_ADDR(dma);
 	DEFINE_DMA_UNMAP_LEN(len);
 	u32 tx_flags;
-	struct xdp_mem_info xdp_mem;
 };
 
 struct ixgbe_rx_buffer {
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index f10904ec2172..4f2864165723 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -1216,7 +1216,7 @@ static bool ixgbe_clean_tx_irq(struct ixgbe_q_vector *q_vector,
 
 		/* free the skb */
 		if (ring_is_xdp(tx_ring))
-			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+			xdp_return_frame(tx_buffer->xdpf);
 		else
 			napi_consume_skb(tx_buffer->skb, napi_budget);
 
@@ -2386,6 +2386,7 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			xdp.data_hard_start = xdp.data -
 					      ixgbe_rx_offset(rx_ring);
 			xdp.data_end = xdp.data + size;
+			prefetchw(xdp.data_hard_start); /* xdp_frame write */
 
 			skb = ixgbe_run_xdp(adapter, rx_ring, &xdp);
 		}
@@ -5797,7 +5798,7 @@ static void ixgbe_clean_tx_ring(struct ixgbe_ring *tx_ring)
 
 		/* Free all the Tx ring sk_buffs */
 		if (ring_is_xdp(tx_ring))
-			xdp_return_frame(tx_buffer->data, &tx_buffer->xdp_mem);
+			xdp_return_frame(tx_buffer->xdpf);
 		else
 			dev_kfree_skb_any(tx_buffer->skb);
 
@@ -8348,16 +8349,21 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 	struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
 	struct ixgbe_tx_buffer *tx_buffer;
 	union ixgbe_adv_tx_desc *tx_desc;
+	struct xdp_frame *xdpf;
 	u32 len, cmd_type;
 	dma_addr_t dma;
 	u16 i;
 
-	len = xdp->data_end - xdp->data;
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	len = xdpf->len;
 
 	if (unlikely(!ixgbe_desc_unused(ring)))
 		return IXGBE_XDP_CONSUMED;
 
-	dma = dma_map_single(ring->dev, xdp->data, len, DMA_TO_DEVICE);
+	dma = dma_map_single(ring->dev, xdpf->data, len, DMA_TO_DEVICE);
 	if (dma_mapping_error(ring->dev, dma))
 		return IXGBE_XDP_CONSUMED;
 
@@ -8372,8 +8378,7 @@ static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
 
 	dma_unmap_len_set(tx_buffer, len, len);
 	dma_unmap_addr_set(tx_buffer, dma, dma);
-	tx_buffer->data = xdp->data;
-	tx_buffer->xdp_mem = xdp->rxq->mem;
+	tx_buffer->xdpf = xdpf;
 
 	tx_desc->read.buffer_addr = cpu_to_le64(dma);
 
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index f42436d7f2d9..7bbf0db27a01 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -890,6 +890,7 @@ struct sk_buff *skb_from_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
 
 	dma_sync_single_range_for_cpu(rq->pdev, di->addr, wi->offset,
 				      frag_size, DMA_FROM_DEVICE);
+	prefetchw(va); /* xdp_frame data area */
 	prefetch(data);
 	wi->offset += frag_size;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index 283bde85c455..bec130cdbd9d 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -663,7 +663,7 @@ void tun_ptr_free(void *ptr)
 	if (tun_is_xdp_frame(ptr)) {
 		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
-		xdp_return_frame(xdpf->data, &xdpf->mem);
+		xdp_return_frame(xdpf);
 	} else {
 		__skb_array_destroy_skb(ptr);
 	}
@@ -2196,7 +2196,7 @@ static ssize_t tun_do_read(struct tun_struct *tun, struct tun_file *tfile,
 		struct xdp_frame *xdpf = tun_ptr_to_xdp(ptr);
 
 		ret = tun_put_user_xdp(tun, tfile, xdpf, to);
-		xdp_return_frame(xdpf->data, &xdpf->mem);
+		xdp_return_frame(xdpf);
 	} else {
 		struct sk_buff *skb = ptr;
 
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index 42d338fe9a8d..ab3d7cbc4c49 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -430,7 +430,7 @@ static int __virtnet_xdp_xmit(struct virtnet_info *vi,
 
 	/* Free up any pending old buffers before queueing new ones. */
 	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
-		xdp_return_frame(xdpf_sent->data, &xdpf_sent->mem);
+		xdp_return_frame(xdpf_sent);
 
 	xdpf = convert_to_xdp_frame(xdp);
 	if (unlikely(!xdpf))
diff --git a/include/net/xdp.h b/include/net/xdp.h
index d0ee437753dc..137ad5f9f40f 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -103,7 +103,7 @@ struct xdp_frame *convert_to_xdp_frame(struct xdp_buff *xdp)
 	return xdp_frame;
 }
 
-void xdp_return_frame(void *data, struct xdp_mem_info *mem);
+void xdp_return_frame(struct xdp_frame *xdpf);
 
 int xdp_rxq_info_reg(struct xdp_rxq_info *xdp_rxq,
 		     struct net_device *dev, u32 queue_index);
diff --git a/kernel/bpf/cpumap.c b/kernel/bpf/cpumap.c
index bcdc4dea5ce7..c95b04ec103e 100644
--- a/kernel/bpf/cpumap.c
+++ b/kernel/bpf/cpumap.c
@@ -219,7 +219,7 @@ static void __cpu_map_ring_cleanup(struct ptr_ring *ring)
 
 	while ((xdpf = ptr_ring_consume(ring)))
 		if (WARN_ON_ONCE(xdpf))
-			xdp_return_frame(xdpf->data, &xdpf->mem);
+			xdp_return_frame(xdpf);
 }
 
 static void put_cpu_map_entry(struct bpf_cpu_map_entry *rcpu)
@@ -275,7 +275,7 @@ static int cpu_map_kthread_run(void *data)
 
 			skb = cpu_map_build_skb(rcpu, xdpf);
 			if (!skb) {
-				xdp_return_frame(xdpf->data, &xdpf->mem);
+				xdp_return_frame(xdpf);
 				continue;
 			}
 
@@ -578,7 +578,7 @@ static int bq_flush_to_queue(struct bpf_cpu_map_entry *rcpu,
 		err = __ptr_ring_produce(q, xdpf);
 		if (err) {
 			drops++;
-			xdp_return_frame(xdpf->data, &xdpf->mem);
+			xdp_return_frame(xdpf);
 		}
 		processed++;
 	}
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 33e382afbd95..0c86b53a3a63 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -308,9 +308,11 @@ int xdp_rxq_info_reg_mem_model(struct xdp_rxq_info *xdp_rxq,
 }
 EXPORT_SYMBOL_GPL(xdp_rxq_info_reg_mem_model);
 
-void xdp_return_frame(void *data, struct xdp_mem_info *mem)
+void xdp_return_frame(struct xdp_frame *xdpf)
 {
+	struct xdp_mem_info *mem = &xdpf->mem;
 	struct xdp_mem_allocator *xa;
+	void *data = xdpf->data;
 	struct page *page;
 
 	switch (mem->type) {

^ permalink raw reply related

* [net-next V11 PATCH 16/17] xdp: transition into using xdp_frame for ndo_xdp_xmit
From: Jesper Dangaard Brouer @ 2018-04-17 14:46 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152397622657.20272.10121948713784224943.stgit@firesoul>

Changing API ndo_xdp_xmit to take a struct xdp_frame instead of struct
xdp_buff.  This brings xdp_return_frame and ndp_xdp_xmit in sync.

This builds towards changing the API further to become a bulk API,
because xdp_buff is not a queue-able object while xdp_frame is.

V4: Adjust for commit 59655a5b6c83 ("tuntap: XDP_TX can use native XDP")
V7: Adjust for commit d9314c474d4f ("i40e: add support for XDP_REDIRECT")

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 drivers/net/ethernet/intel/i40e/i40e_txrx.c   |   30 ++++++++++++++-----------
 drivers/net/ethernet/intel/i40e/i40e_txrx.h   |    2 +-
 drivers/net/ethernet/intel/ixgbe/ixgbe_main.c |   24 +++++++++++---------
 drivers/net/tun.c                             |   19 ++++++++++------
 drivers/net/virtio_net.c                      |   24 ++++++++++++--------
 include/linux/netdevice.h                     |    4 ++-
 net/core/filter.c                             |   17 +++++++++++++-
 7 files changed, 74 insertions(+), 46 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.c b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
index c8bf4d35fdea..87fb27ab9c24 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.c
@@ -2203,9 +2203,20 @@ static bool i40e_is_non_eop(struct i40e_ring *rx_ring,
 #define I40E_XDP_CONSUMED 1
 #define I40E_XDP_TX 2
 
-static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
+static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
 			      struct i40e_ring *xdp_ring);
 
+static int i40e_xmit_xdp_tx_ring(struct xdp_buff *xdp,
+				 struct i40e_ring *xdp_ring)
+{
+	struct xdp_frame *xdpf = convert_to_xdp_frame(xdp);
+
+	if (unlikely(!xdpf))
+		return I40E_XDP_CONSUMED;
+
+	return i40e_xmit_xdp_ring(xdpf, xdp_ring);
+}
+
 /**
  * i40e_run_xdp - run an XDP program
  * @rx_ring: Rx ring being processed
@@ -2233,7 +2244,7 @@ static struct sk_buff *i40e_run_xdp(struct i40e_ring *rx_ring,
 		break;
 	case XDP_TX:
 		xdp_ring = rx_ring->vsi->xdp_rings[rx_ring->queue_index];
-		result = i40e_xmit_xdp_ring(xdp, xdp_ring);
+		result = i40e_xmit_xdp_tx_ring(xdp, xdp_ring);
 		break;
 	case XDP_REDIRECT:
 		err = xdp_do_redirect(rx_ring->netdev, xdp, xdp_prog);
@@ -3480,21 +3491,14 @@ static inline int i40e_tx_map(struct i40e_ring *tx_ring, struct sk_buff *skb,
  * @xdp: data to transmit
  * @xdp_ring: XDP Tx ring
  **/
-static int i40e_xmit_xdp_ring(struct xdp_buff *xdp,
+static int i40e_xmit_xdp_ring(struct xdp_frame *xdpf,
 			      struct i40e_ring *xdp_ring)
 {
 	u16 i = xdp_ring->next_to_use;
 	struct i40e_tx_buffer *tx_bi;
 	struct i40e_tx_desc *tx_desc;
-	struct xdp_frame *xdpf;
+	u32 size = xdpf->len;
 	dma_addr_t dma;
-	u32 size;
-
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return I40E_XDP_CONSUMED;
-
-	size = xdpf->len;
 
 	if (!unlikely(I40E_DESC_UNUSED(xdp_ring))) {
 		xdp_ring->tx_stats.tx_busy++;
@@ -3684,7 +3688,7 @@ netdev_tx_t i40e_lan_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
  *
  * Returns Zero if sent, else an error code
  **/
-int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
 	struct i40e_netdev_priv *np = netdev_priv(dev);
 	unsigned int queue_index = smp_processor_id();
@@ -3697,7 +3701,7 @@ int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (!i40e_enabled_xdp_vsi(vsi) || queue_index >= vsi->num_queue_pairs)
 		return -ENXIO;
 
-	err = i40e_xmit_xdp_ring(xdp, vsi->xdp_rings[queue_index]);
+	err = i40e_xmit_xdp_ring(xdpf, vsi->xdp_rings[queue_index]);
 	if (err != I40E_XDP_TX)
 		return -ENOSPC;
 
diff --git a/drivers/net/ethernet/intel/i40e/i40e_txrx.h b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
index 857b1d743c8d..4bf318b8be85 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_txrx.h
+++ b/drivers/net/ethernet/intel/i40e/i40e_txrx.h
@@ -511,7 +511,7 @@ u32 i40e_get_tx_pending(struct i40e_ring *ring, bool in_sw);
 void i40e_detect_recover_hung(struct i40e_vsi *vsi);
 int __i40e_maybe_stop_tx(struct i40e_ring *tx_ring, int size);
 bool __i40e_chk_linearize(struct sk_buff *skb);
-int i40e_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp);
+int i40e_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf);
 void i40e_xdp_flush(struct net_device *dev);
 
 /**
diff --git a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
index 4f2864165723..51e7d82a5860 100644
--- a/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
+++ b/drivers/net/ethernet/intel/ixgbe/ixgbe_main.c
@@ -2262,7 +2262,7 @@ static struct sk_buff *ixgbe_build_skb(struct ixgbe_ring *rx_ring,
 #define IXGBE_XDP_TX 2
 
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-			       struct xdp_buff *xdp);
+			       struct xdp_frame *xdpf);
 
 static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 				     struct ixgbe_ring *rx_ring,
@@ -2270,6 +2270,7 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 {
 	int err, result = IXGBE_XDP_PASS;
 	struct bpf_prog *xdp_prog;
+	struct xdp_frame *xdpf;
 	u32 act;
 
 	rcu_read_lock();
@@ -2278,12 +2279,19 @@ static struct sk_buff *ixgbe_run_xdp(struct ixgbe_adapter *adapter,
 	if (!xdp_prog)
 		goto xdp_out;
 
+	prefetchw(xdp->data_hard_start); /* xdp_frame write */
+
 	act = bpf_prog_run_xdp(xdp_prog, xdp);
 	switch (act) {
 	case XDP_PASS:
 		break;
 	case XDP_TX:
-		result = ixgbe_xmit_xdp_ring(adapter, xdp);
+		xdpf = convert_to_xdp_frame(xdp);
+		if (unlikely(!xdpf)) {
+			result = IXGBE_XDP_CONSUMED;
+			break;
+		}
+		result = ixgbe_xmit_xdp_ring(adapter, xdpf);
 		break;
 	case XDP_REDIRECT:
 		err = xdp_do_redirect(adapter->netdev, xdp, xdp_prog);
@@ -2386,7 +2394,6 @@ static int ixgbe_clean_rx_irq(struct ixgbe_q_vector *q_vector,
 			xdp.data_hard_start = xdp.data -
 					      ixgbe_rx_offset(rx_ring);
 			xdp.data_end = xdp.data + size;
-			prefetchw(xdp.data_hard_start); /* xdp_frame write */
 
 			skb = ixgbe_run_xdp(adapter, rx_ring, &xdp);
 		}
@@ -8344,20 +8351,15 @@ static u16 ixgbe_select_queue(struct net_device *dev, struct sk_buff *skb,
 }
 
 static int ixgbe_xmit_xdp_ring(struct ixgbe_adapter *adapter,
-			       struct xdp_buff *xdp)
+			       struct xdp_frame *xdpf)
 {
 	struct ixgbe_ring *ring = adapter->xdp_ring[smp_processor_id()];
 	struct ixgbe_tx_buffer *tx_buffer;
 	union ixgbe_adv_tx_desc *tx_desc;
-	struct xdp_frame *xdpf;
 	u32 len, cmd_type;
 	dma_addr_t dma;
 	u16 i;
 
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return -EOVERFLOW;
-
 	len = xdpf->len;
 
 	if (unlikely(!ixgbe_desc_unused(ring)))
@@ -10010,7 +10012,7 @@ static int ixgbe_xdp(struct net_device *dev, struct netdev_bpf *xdp)
 	}
 }
 
-static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
 	struct ixgbe_adapter *adapter = netdev_priv(dev);
 	struct ixgbe_ring *ring;
@@ -10026,7 +10028,7 @@ static int ixgbe_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (unlikely(!ring))
 		return -ENXIO;
 
-	err = ixgbe_xmit_xdp_ring(adapter, xdp);
+	err = ixgbe_xmit_xdp_ring(adapter, xdpf);
 	if (err != IXGBE_XDP_TX)
 		return -ENOSPC;
 
diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index bec130cdbd9d..1e58be152d5c 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1301,18 +1301,13 @@ static const struct net_device_ops tun_netdev_ops = {
 	.ndo_get_stats64	= tun_net_get_stats64,
 };
 
-static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int tun_xdp_xmit(struct net_device *dev, struct xdp_frame *frame)
 {
 	struct tun_struct *tun = netdev_priv(dev);
-	struct xdp_frame *frame;
 	struct tun_file *tfile;
 	u32 numqueues;
 	int ret = 0;
 
-	frame = convert_to_xdp_frame(xdp);
-	if (unlikely(!frame))
-		return -EOVERFLOW;
-
 	rcu_read_lock();
 
 	numqueues = READ_ONCE(tun->numqueues);
@@ -1336,6 +1331,16 @@ static int tun_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	return ret;
 }
 
+static int tun_xdp_tx(struct net_device *dev, struct xdp_buff *xdp)
+{
+	struct xdp_frame *frame = convert_to_xdp_frame(xdp);
+
+	if (unlikely(!frame))
+		return -EOVERFLOW;
+
+	return tun_xdp_xmit(dev, frame);
+}
+
 static void tun_xdp_flush(struct net_device *dev)
 {
 	struct tun_struct *tun = netdev_priv(dev);
@@ -1683,7 +1688,7 @@ static struct sk_buff *tun_build_skb(struct tun_struct *tun,
 		case XDP_TX:
 			get_page(alloc_frag->page);
 			alloc_frag->offset += buflen;
-			if (tun_xdp_xmit(tun->dev, &xdp))
+			if (tun_xdp_tx(tun->dev, &xdp))
 				goto err_redirect;
 			tun_xdp_flush(tun->dev);
 			rcu_read_unlock();
diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
index ab3d7cbc4c49..01694e26f03e 100644
--- a/drivers/net/virtio_net.c
+++ b/drivers/net/virtio_net.c
@@ -416,10 +416,10 @@ static void virtnet_xdp_flush(struct net_device *dev)
 }
 
 static int __virtnet_xdp_xmit(struct virtnet_info *vi,
-			      struct xdp_buff *xdp)
+			       struct xdp_frame *xdpf)
 {
 	struct virtio_net_hdr_mrg_rxbuf *hdr;
-	struct xdp_frame *xdpf, *xdpf_sent;
+	struct xdp_frame *xdpf_sent;
 	struct send_queue *sq;
 	unsigned int len;
 	unsigned int qp;
@@ -432,10 +432,6 @@ static int __virtnet_xdp_xmit(struct virtnet_info *vi,
 	while ((xdpf_sent = virtqueue_get_buf(sq->vq, &len)) != NULL)
 		xdp_return_frame(xdpf_sent);
 
-	xdpf = convert_to_xdp_frame(xdp);
-	if (unlikely(!xdpf))
-		return -EOVERFLOW;
-
 	/* virtqueue want to use data area in-front of packet */
 	if (unlikely(xdpf->metasize > 0))
 		return -EOPNOTSUPP;
@@ -459,7 +455,7 @@ static int __virtnet_xdp_xmit(struct virtnet_info *vi,
 	return 0;
 }
 
-static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
+static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_frame *xdpf)
 {
 	struct virtnet_info *vi = netdev_priv(dev);
 	struct receive_queue *rq = vi->rq;
@@ -472,7 +468,7 @@ static int virtnet_xdp_xmit(struct net_device *dev, struct xdp_buff *xdp)
 	if (!xdp_prog)
 		return -ENXIO;
 
-	return __virtnet_xdp_xmit(vi, xdp);
+	return __virtnet_xdp_xmit(vi, xdpf);
 }
 
 static unsigned int virtnet_get_headroom(struct virtnet_info *vi)
@@ -569,6 +565,7 @@ static struct sk_buff *receive_small(struct net_device *dev,
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (xdp_prog) {
 		struct virtio_net_hdr_mrg_rxbuf *hdr = buf + header_offset;
+		struct xdp_frame *xdpf;
 		struct xdp_buff xdp;
 		void *orig_data;
 		u32 act;
@@ -611,7 +608,10 @@ static struct sk_buff *receive_small(struct net_device *dev,
 			delta = orig_data - xdp.data;
 			break;
 		case XDP_TX:
-			err = __virtnet_xdp_xmit(vi, &xdp);
+			xdpf = convert_to_xdp_frame(&xdp);
+			if (unlikely(!xdpf))
+				goto err_xdp;
+			err = __virtnet_xdp_xmit(vi, xdpf);
 			if (unlikely(err)) {
 				trace_xdp_exception(vi->dev, xdp_prog, act);
 				goto err_xdp;
@@ -702,6 +702,7 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 	rcu_read_lock();
 	xdp_prog = rcu_dereference(rq->xdp_prog);
 	if (xdp_prog) {
+		struct xdp_frame *xdpf;
 		struct page *xdp_page;
 		struct xdp_buff xdp;
 		void *data;
@@ -766,7 +767,10 @@ static struct sk_buff *receive_mergeable(struct net_device *dev,
 			}
 			break;
 		case XDP_TX:
-			err = __virtnet_xdp_xmit(vi, &xdp);
+			xdpf = convert_to_xdp_frame(&xdp);
+			if (unlikely(!xdpf))
+				goto err_xdp;
+			err = __virtnet_xdp_xmit(vi, xdpf);
 			if (unlikely(err)) {
 				trace_xdp_exception(vi->dev, xdp_prog, act);
 				if (unlikely(xdp_page != page))
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index cf44503ea81a..14e0777ffcfb 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1165,7 +1165,7 @@ struct dev_ifalias {
  *	This function is used to set or query state related to XDP on the
  *	netdevice and manage BPF offload. See definition of
  *	enum bpf_netdev_command for details.
- * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_buff *xdp);
+ * int (*ndo_xdp_xmit)(struct net_device *dev, struct xdp_frame *xdp);
  *	This function is used to submit a XDP packet for transmit on a
  *	netdevice.
  * void (*ndo_xdp_flush)(struct net_device *dev);
@@ -1356,7 +1356,7 @@ struct net_device_ops {
 	int			(*ndo_bpf)(struct net_device *dev,
 					   struct netdev_bpf *bpf);
 	int			(*ndo_xdp_xmit)(struct net_device *dev,
-						struct xdp_buff *xdp);
+						struct xdp_frame *xdp);
 	void			(*ndo_xdp_flush)(struct net_device *dev);
 };
 
diff --git a/net/core/filter.c b/net/core/filter.c
index d31aff93270d..3bb0cb98a9be 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2749,13 +2749,18 @@ static int __bpf_tx_xdp(struct net_device *dev,
 			struct xdp_buff *xdp,
 			u32 index)
 {
+	struct xdp_frame *xdpf;
 	int err;
 
 	if (!dev->netdev_ops->ndo_xdp_xmit) {
 		return -EOPNOTSUPP;
 	}
 
-	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+	xdpf = convert_to_xdp_frame(xdp);
+	if (unlikely(!xdpf))
+		return -EOVERFLOW;
+
+	err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
 	if (err)
 		return err;
 	dev->netdev_ops->ndo_xdp_flush(dev);
@@ -2771,11 +2776,19 @@ static int __bpf_tx_xdp_map(struct net_device *dev_rx, void *fwd,
 
 	if (map->map_type == BPF_MAP_TYPE_DEVMAP) {
 		struct net_device *dev = fwd;
+		struct xdp_frame *xdpf;
 
 		if (!dev->netdev_ops->ndo_xdp_xmit)
 			return -EOPNOTSUPP;
 
-		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdp);
+		xdpf = convert_to_xdp_frame(xdp);
+		if (unlikely(!xdpf))
+			return -EOVERFLOW;
+
+		/* TODO: move to inside map code instead, for bulk support
+		 * err = dev_map_enqueue(dev, xdp);
+		 */
+		err = dev->netdev_ops->ndo_xdp_xmit(dev, xdpf);
 		if (err)
 			return err;
 		__dev_map_insert_ctx(map, index);

^ permalink raw reply related

* [net-next V11 PATCH 17/17] xdp: avoid leaking info stored in frame data on page reuse
From: Jesper Dangaard Brouer @ 2018-04-17 14:46 UTC (permalink / raw)
  To: netdev, BjörnTöpel, magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Jesper Dangaard Brouer, Daniel Borkmann,
	Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152397622657.20272.10121948713784224943.stgit@firesoul>

The bpf infrastructure and verifier goes to great length to avoid
bpf progs leaking kernel (pointer) info.

For queueing an xdp_buff via XDP_REDIRECT, xdp_frame info stores
kernel info (incl pointers) in top part of frame data (xdp->data_hard_start).
Checks are in place to assure enough headroom is available for this.

This info is not cleared, and if the frame is reused, then a
malicious user could use bpf_xdp_adjust_head helper to move
xdp->data into this area.  Thus, making this area readable.

This is not super critical as XDP progs requires root or
CAP_SYS_ADMIN, which are privileged enough for such info.  An
effort (is underway) towards moving networking bpf hooks to the
lesser privileged mode CAP_NET_ADMIN, where leaking such info
should be avoided.  Thus, this patch to clear the info when
needed.

Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---
 net/core/filter.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/net/core/filter.c b/net/core/filter.c
index 3bb0cb98a9be..a374b8560bc4 100644
--- a/net/core/filter.c
+++ b/net/core/filter.c
@@ -2692,6 +2692,7 @@ static unsigned long xdp_get_metalen(const struct xdp_buff *xdp)

 BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
 {
+	void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
 	unsigned long metalen = xdp_get_metalen(xdp);
 	void *data_start = xdp->data_hard_start + metalen;
 	void *data = xdp->data + offset;
@@ -2700,6 +2701,13 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
 		     data > xdp->data_end - ETH_HLEN))
 		return -EINVAL;

+	/* Avoid info leak, when reusing area prev used by xdp_frame */
+	if (data < xdp_frame_end) {
+		unsigned long clearlen = xdp_frame_end - data;
+
+		memset(data, 0, clearlen);
+	}
+
 	if (metalen)
 		memmove(xdp->data_meta + offset,
 			xdp->data_meta, metalen);

^ permalink raw reply related

* Re: SRIOV switchdev mode BoF minutes
From: Andy Gospodarek @ 2018-04-17 14:47 UTC (permalink / raw)
  To: Or Gerlitz
  Cc: Andy Gospodarek, Samudrala, Sridhar, David Miller,
	Anjali Singhai Jain, Michael Chan, Simon Horman, Jakub Kicinski,
	John Fastabend, Saeed Mahameed, Jiri Pirko, Rony Efraim,
	Linux Netdev List
In-Reply-To: <CAJ3xEMgsLTqU2WNBYNBPGuON-z7fxwHcBiooN74Fgk87g0L18A@mail.gmail.com>

On Tue, Apr 17, 2018 at 04:58:05PM +0300, Or Gerlitz wrote:
> On Tue, Apr 17, 2018 at 4:30 PM, Andy Gospodarek
> <andrew.gospodarek@broadcom.com> wrote:
> > On Mon, Apr 16, 2018 at 07:08:39PM -0700, Samudrala, Sridhar wrote:
> >>
> >> On 4/16/2018 5:39 AM, Andy Gospodarek wrote:
> >> > On Sun, Apr 15, 2018 at 09:01:16AM +0300, Or Gerlitz wrote:
> >> > > On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar
> >> > > <sridhar.samudrala@intel.com> wrote:
> >> > >
> >> > > > I meant between PFs on 2 compute nodes.
> >> > > If the PF serves as uplink rep, it functions as  a switch port -- applications
> >> > > don't run on switch ports. One way to get apps to run on the host in switchdev
> >> > > mode is probe one of the VFs there.
> >> > >
> >> > >
> >> > >
> >> So once a pci device is configured in 'switchdev' mode,  only port representor netdevs are
> >> seen on the host, no more PF netdev.
> >
> > That is not the functionality I would propose.  The PF netdev will still be there.
> 
> Andy,
> 
> Basically LGTM, so even in smartnic configs, the PF @ the host is
> still privileged to
> create/destroy VFs or provision MACs for them even if it is not the
> e-switch manager
> anymore?

Yes, in a SmartNIC world one config we aim to have is that a host can create
and destroy VFs as needed.  One of the challenges is how the VF reps are
managed by applications in the SmartNIC when the host could make them
disappear.  

> Actually AFAIK this  can also work somehow otherwise, e.g a smartnic FW
> "pushes" the VFs into the host w.o them being under a host admin directive.

The model to 'push' VFs to a host is also another option, but I do not
like it as much.  My general preference is to allow the host to use a
SmartNIC as if it was any other standard NIC (we have been using the
word 'Performance NIC' to desribe what we would call a standard NIC, but
the name is not terribly important).

There is also a school of thought that the VF reps could be
pre-allocated on the SmartNIC so that any application processing that
traffic would sit idle when no traffic arrives on the rep, but could
process frames that do arrive when the VFs were created on the host.
This implementation will depend on how resources are allocated on a
given bit of hardware, but can really work well.

^ permalink raw reply

* Re: [net-next V11 PATCH 00/17] XDP redirect memory return API
From: Alexei Starovoitov @ 2018-04-17 14:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netdev, BjörnTöpel, magnus.karlsson, eugenia,
	Jason Wang, John Fastabend, Eran Ben Elisha, Saeed Mahameed, galp,
	Daniel Borkmann, Tariq Toukan
In-Reply-To: <152397622657.20272.10121948713784224943.stgit@firesoul>

On Tue, Apr 17, 2018 at 04:45:16PM +0200, Jesper Dangaard Brouer wrote:
> Submitted against net-next, as it contains NIC driver changes.
> 
> This patchset works towards supporting different XDP RX-ring memory
> allocators.  As this will be needed by the AF_XDP zero-copy mode.
> 
> The patchset uses mlx5 as the sample driver, which gets implemented
> XDP_REDIRECT RX-mode, but not ndo_xdp_xmit (as this API is subject to
> change thought the patchset).
> 
> A new struct xdp_frame is introduced (modeled after cpumap xdp_pkt).
> And both ndo_xdp_xmit and the new xdp_return_frame end-up using this.
> 
> Support for a driver supplied allocator is implemented, and a
> refurbished version of page_pool is the first return allocator type
> introduced.  This will be a integration point for AF_XDP zero-copy.
> 
> The mlx5 driver evolve into using the page_pool, and see a performance
> increase (with ndo_xdp_xmit out ixgbe driver) from 6Mpps to 12Mpps.
> 
> 
> The patchset stop at 16 patches (one over limit), but more API changes
> are planned.  Specifically extending ndo_xdp_xmit and xdp_return_frame
> APIs to support bulking.  As this will address some known limits.
> 
> V2: Updated according to Tariq's feedback
> V3: Updated based on feedback from Jason Wang and Alex Duyck
> V4: Updated based on feedback from Tariq and Jason
> V5: Fix SPDX license, add Tariq's reviews, improve patch desc for perf test
> V6: Updated based on feedback from Eric Dumazet and Alex Duyck
> V7: Adapt to i40e that got XDP_REDIRECT support in-between
> V8:
>  Updated based on feedback kbuild test robot, and adjust for mlx5 changes
>  page_pool only compiled into kernel when drivers Kconfig 'select' feature
> V9:
>  Remove some inline statements, let compiler decide what to inline
>  Fix return value in virtio_net driver
>  Adjust for mlx5 changes in-between submissions
> V10:
>  Minor adjust for mlx5 requested by Tariq
>  Resubmit against net-next
> V11: avoid leaking info stored in frame data on page reuse

Thanks.
The series look good to me now.
Acked-by: Alexei Starovoitov <ast@kernel.org>

^ permalink raw reply

* Re: [RFC v2] virtio: support packed ring
From: Tiwei Bie @ 2018-04-17 14:56 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <20180417170354-mutt-send-email-mst@kernel.org>

On Tue, Apr 17, 2018 at 05:04:59PM +0300, Michael S. Tsirkin wrote:
> On Tue, Apr 17, 2018 at 08:47:16PM +0800, Tiwei Bie wrote:
> > On Tue, Apr 17, 2018 at 03:17:41PM +0300, Michael S. Tsirkin wrote:
> > > On Tue, Apr 17, 2018 at 10:51:33AM +0800, Tiwei Bie wrote:
> > > > On Tue, Apr 17, 2018 at 10:11:58AM +0800, Jason Wang wrote:
> > > > > On 2018年04月13日 15:15, Tiwei Bie wrote:
> > > > > > On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote:
> > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > [...]
> > > > > > > > +static int detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
> > > > > > > > +			      void **ctx)
> > > > > > > > +{
> > > > > > > > +	struct vring_packed_desc *desc;
> > > > > > > > +	unsigned int i, j;
> > > > > > > > +
> > > > > > > > +	/* Clear data ptr. */
> > > > > > > > +	vq->desc_state[head].data = NULL;
> > > > > > > > +
> > > > > > > > +	i = head;
> > > > > > > > +
> > > > > > > > +	for (j = 0; j < vq->desc_state[head].num; j++) {
> > > > > > > > +		desc = &vq->vring_packed.desc[i];
> > > > > > > > +		vring_unmap_one_packed(vq, desc);
> > > > > > > > +		desc->flags = 0x0;
> > > > > > > Looks like this is unnecessary.
> > > > > > It's safer to zero it. If we don't zero it, after we
> > > > > > call virtqueue_detach_unused_buf_packed() which calls
> > > > > > this function, the desc is still available to the
> > > > > > device.
> > > > > 
> > > > > Well detach_unused_buf_packed() should be called after device is stopped,
> > > > > otherwise even if you try to clear, there will still be a window that device
> > > > > may use it.
> > > > 
> > > > This is not about whether the device has been stopped or
> > > > not. We don't have other places to re-initialize the ring
> > > > descriptors and wrap_counter. So they need to be set to
> > > > the correct values when doing detach_unused_buf.
> > > > 
> > > > Best regards,
> > > > Tiwei Bie
> > > 
> > > find vqs is the time to do it.
> > 
> > The .find_vqs() will call .setup_vq() which will eventually
> > call vring_create_virtqueue(). It's a different case. Here
> > we're talking about re-initializing the descs and updating
> > the wrap counter when detaching the unused descs (In this
> > case, split ring just needs to decrease vring.avail->idx).
> > 
> > Best regards,
> > Tiwei Bie
> 
> There's no requirement that  virtqueue_detach_unused_buf re-initializes
> the descs. It happens on cleanup path just before drivers delete the
> vqs.

Cool, I wasn't aware of it. I saw split ring decrease
vring.avail->idx after detaching an unused desc, so I
thought detaching unused desc also needs to make sure
that the ring state will be updated correspondingly.

If there is no such requirement, do you think it's OK
to remove below two lines:

vq->avail_idx_shadow--;
vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);

from virtqueue_detach_unused_buf(), and we could have
one generic function to handle both rings:

void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
{
	struct vring_virtqueue *vq = to_vvq(_vq);
	unsigned int num, i;
	void *buf;

	START_USE(vq);

	num = vq->packed ? vq->vring_packed.num : vq->vring.num;

	for (i = 0; i < num; i++) {
		if (!vq->desc_state[i].data)
			continue;
		/* detach_buf clears data, so grab it now. */
		buf = vq->desc_state[i].data;
		detach_buf(vq, i, NULL);
		END_USE(vq);
		return buf;
	}
	/* That should have freed everything. */
	BUG_ON(vq->vq.num_free != num);

	END_USE(vq);
	return NULL;
}

Best regards,
Tiwei Bie

^ permalink raw reply

* Re: linux-next on x60: network manager often complains "network is disabled" after resume
From: Dan Williams @ 2018-04-17 15:03 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Woody Suwalski, Rafael J. Wysocki, kernel list,
	Linux-pm mailing list, Netdev list
In-Reply-To: <20180415161604.GG10802@amd>

On Sun, 2018-04-15 at 18:16 +0200, Pavel Machek wrote:
> On Mon 2018-03-26 10:33:55, Dan Williams wrote:
> > On Sun, 2018-03-25 at 08:19 +0200, Pavel Machek wrote:
> > > > > > Ok, what does 'nmcli dev' and 'nmcli radio' show?
> > > > > 
> > > > > Broken state.
> > > > > 
> > > > > pavel@amd:~$ nmcli dev
> > > > > DEVICE  TYPE      STATE        CONNECTION
> > > > > eth1    ethernet  unavailable  --
> > > > > lo      loopback  unmanaged    --
> > > > > wlan0   wifi      unmanaged    --
> > > > 
> > > > If the state is "unmanaged" on resume, that would indicate a
> > > > problem
> > > > with sleep/wake and likely not a kernel network device issue.
> > > > 
> > > > We should probably move this discussion to the NM lists to
> > > > debug
> > > > further.  Before you suspend, run "nmcli gen log level trace"
> > > > to
> > > > turn
> > > > on full debug logging, then reproduce the issue, and send a
> > > > pointer
> > > > to
> > > > those logs (scrubbed for anything you consider sensitive) to
> > > > the NM
> > > > mailing list.
> > > 
> > > Hmm :-)
> > > 
> > > root@amd:/data/pavel# nmcli gen log level trace
> > > Error: Unknown log level 'trace'
> > 
> > What NM version?  'trace' is pretty old (since 1.0 from December
> > 2014)
> > so unless you're using a really, really old version of Debian I'd
> > expect you'd have it.  Anyway, debug would do.
> 
> Hmm.
> 
> pavel@duo:~$ /usr/sbin/NetworkManager --version
> You must be root to run NetworkManager!
> pavel@duo:~$ sudo /usr/sbin/NetworkManager --version
> 0.9.10.0
> 
> So I set the log level, but I still don't see much in the log:
> 
> Apr 14 18:14:29 duo dbus[3009]: [system] Successfully activated
> service 'org.freedesktop.nm_dispatcher'
> Apr 14 18:14:29 duo nm-dispatcher: Dispatching action 'down' for
> wlan1
> Apr 14 18:14:29 duo systemd[1]: Started Network Manager Script
> Dispatcher Service.
> Apr 14 18:14:29 duo systemd-sleep[6853]: Suspending system...

I think if systemd really is handling the suspend/resume, you'll need
to make sure NM has systemd suspend/resume handling enabled as well. 
NM 0.9.10 has build-time support for both:

[--with-suspend-resume=upower|systemd]

while later versions also support ConsoleKit2 and elogind.

Any idea what your NM was compiled with?  Though honestly I'm not sure
how all the suspend/resume stuff is supposed to work these days on
different distros that provide the choice between systemd/upower/pm-
utils/CK2/etc.

Dan

> Apr 14 21:27:53 duo systemd[1]: systemd-journald.service watchdog
> timeout (limit 1min)!
> pavel@duo:~$ date
> Sun Apr 15 12:26:32 CEST 2018
> pavel@duo:~$
> 
> Is it possible that time handling accross suspend changed in v4.17?
> 
> I get some weird effects. With display backlight...
> 
> > > Where do I get the logs? I don't see much in the syslog...
> > > And.. It seems that it is "every other suspend". One resume
> > > results
> > > in
> > > broken network, one in working one, one in broken one...
> > 
> > Does your distro use pm-utils, upower, or systemd for
> > suspend/resume
> > handling?
> 
> upower, I guess:
> 
> pavel@duo:/data/l/linux$ ps aux | grep upower
> root      3820  0.0  0.1  42848  7984 ?        Ssl  Apr14   0:01
> /usr/lib/upower/upowerd
> 
> 									
> Pavel

^ permalink raw reply

* Re: [net-next V11 PATCH 17/17] xdp: avoid leaking info stored in frame data on page reuse
From: Daniel Borkmann @ 2018-04-17 15:04 UTC (permalink / raw)
  To: Jesper Dangaard Brouer, netdev, BjörnTöpel,
	magnus.karlsson
  Cc: eugenia, Jason Wang, John Fastabend, Eran Ben Elisha,
	Saeed Mahameed, galp, Alexei Starovoitov, Tariq Toukan
In-Reply-To: <152397640301.20272.9781402055431898663.stgit@firesoul>

Hi Jesper,

On 04/17/2018 04:46 PM, Jesper Dangaard Brouer wrote:
> The bpf infrastructure and verifier goes to great length to avoid
> bpf progs leaking kernel (pointer) info.
> 
> For queueing an xdp_buff via XDP_REDIRECT, xdp_frame info stores
> kernel info (incl pointers) in top part of frame data (xdp->data_hard_start).
> Checks are in place to assure enough headroom is available for this.
> 
> This info is not cleared, and if the frame is reused, then a
> malicious user could use bpf_xdp_adjust_head helper to move
> xdp->data into this area.  Thus, making this area readable.
> 
> This is not super critical as XDP progs requires root or
> CAP_SYS_ADMIN, which are privileged enough for such info.  An
> effort (is underway) towards moving networking bpf hooks to the
> lesser privileged mode CAP_NET_ADMIN, where leaking such info
> should be avoided.  Thus, this patch to clear the info when
> needed.
> 
> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
> ---
>  net/core/filter.c |    8 ++++++++
>  1 file changed, 8 insertions(+)
> 
> diff --git a/net/core/filter.c b/net/core/filter.c
> index 3bb0cb98a9be..a374b8560bc4 100644
> --- a/net/core/filter.c
> +++ b/net/core/filter.c
> @@ -2692,6 +2692,7 @@ static unsigned long xdp_get_metalen(const struct xdp_buff *xdp)
>  
>  BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
>  {
> +	void *xdp_frame_end = xdp->data_hard_start + sizeof(struct xdp_frame);
>  	unsigned long metalen = xdp_get_metalen(xdp);
>  	void *data_start = xdp->data_hard_start + metalen;
>  	void *data = xdp->data + offset;
> @@ -2700,6 +2701,13 @@ BPF_CALL_2(bpf_xdp_adjust_head, struct xdp_buff *, xdp, int, offset)
>  		     data > xdp->data_end - ETH_HLEN))
>  		return -EINVAL;
>  
> +	/* Avoid info leak, when reusing area prev used by xdp_frame */
> +	if (data < xdp_frame_end) {
> +		unsigned long clearlen = xdp_frame_end - data;
> +
> +		memset(data, 0, clearlen);
> +	}
> +

Sorry, but this patch is actually incomplete: bpf_xdp_adjust_meta() needs the
same treatment, otherwise there is no point in clearing here, but not when
using meta data.

But also I think this is a bit suboptimal resolved in general (sorry, haven't
had a chance to review this particular patch earlier as I was swamped). What
happens when you adjust head for the data, where you first cleared this frame
data, but then later on in the program you would adjust data_meta pointer,
which also falls under the 'data{,_meta} < xdp_frame_end' condition. Then,
you are going to clear again valid data that the user wrote to the packet
before.

>  	if (metalen)
>  		memmove(xdp->data_meta + offset,
>  			xdp->data_meta, metalen);
> 

^ permalink raw reply

* [PATCH net-next 2/4] net/smc: handle sockopt TCP_NODELAY
From: Ursula Braun @ 2018-04-17 15:18 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20180417151815.77191-1-ubraun@linux.ibm.com>

From: Ursula Braun <ubraun@linux.vnet.ibm.com>

TCP sockopts must not interfere with the CLC handshake on the
CLC socket. Therefore, we defer some of them till the CLC
handshake has completed, like resetting TCP_NODELAY.

While touching setsockopt, the TCP_FASTOPEN sockopts are
ignored, since SMC-connection setup is based on the TCP
three-way-handshake.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
---
 net/smc/af_smc.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
 net/smc/smc.h    |   4 ++
 2 files changed, 111 insertions(+), 2 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 5f8046c62d90..96f4d182f998 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -377,6 +377,22 @@ static void smc_link_save_peer_info(struct smc_link *link,
 	link->peer_mtu = clc->qp_mtu;
 }
 
+/* deferred setsockopt's not desired during clc handshake */
+static void smc_apply_deferred_sockopts(struct smc_sock *smc)
+{
+	struct smc_sock *opt_smc = smc;
+	u8 val;
+
+	if (smc->listen_smc)
+		opt_smc = smc->listen_smc;
+	if (opt_smc->deferred_nodelay_reset) {
+		val = 0;
+		kernel_setsockopt(smc->clcsock, SOL_TCP, TCP_NODELAY, &val,
+				  sizeof(val));
+		opt_smc->deferred_nodelay_reset = 0;
+	}
+}
+
 /* setup for RDMA connection of client */
 static int smc_connect_rdma(struct smc_sock *smc)
 {
@@ -506,6 +522,7 @@ static int smc_connect_rdma(struct smc_sock *smc)
 	smc_tx_init(smc);
 
 out_connected:
+	smc_apply_deferred_sockopts(smc);
 	smc_copy_sock_settings_to_clc(smc);
 	if (smc->sk.sk_state == SMC_INIT)
 		smc->sk.sk_state = SMC_ACTIVE;
@@ -908,6 +925,7 @@ static void smc_listen_work(struct work_struct *work)
 	mutex_unlock(&smc_create_lgr_pending);
 
 out_connected:
+	smc_apply_deferred_sockopts(new_smc);
 	sk_refcnt_debug_inc(newsmcsk);
 	if (newsmcsk->sk_state == SMC_INIT)
 		newsmcsk->sk_state = SMC_ACTIVE;
@@ -1280,9 +1298,60 @@ static int smc_setsockopt(struct socket *sock, int level, int optname,
 {
 	struct sock *sk = sock->sk;
 	struct smc_sock *smc;
+	int val;
 
 	smc = smc_sk(sk);
+	if (smc->use_fallback || level != SOL_TCP)
+		goto clcsock;
+
+	/* level SOL_TCP */
+	switch (optname) {
+	case TCP_CONGESTION:
+	case TCP_ULP:
+		/* sockopts without integer value; do not apply to SMC */
+		goto clcsock;
+	default:
+		break;
+	}
 
+	if (optlen < sizeof(int))
+		return -EINVAL;
+	if (get_user(val, (int __user *)optval))
+		return -EFAULT;
+
+	lock_sock(sk);
+	switch (optname) {
+	case TCP_NODELAY:
+		if (sk->sk_state != SMC_INIT && sk->sk_state != SMC_LISTEN) {
+			release_sock(sk);
+			goto clcsock;
+		}
+		/* for the CLC-handshake TCP_NODELAY is desired;
+		 * in case of fallback to TCP, a nodelay reset is
+		 * triggered afterwards.
+		 */
+		if (val)
+			smc->deferred_nodelay_reset = 0;
+		else
+			smc->deferred_nodelay_reset = 1;
+		break;
+	case TCP_FASTOPEN:
+	case TCP_FASTOPEN_CONNECT:
+	case TCP_FASTOPEN_KEY:
+	case TCP_FASTOPEN_NO_COOKIE:
+		/* ignore these options; 3-way handshake shouldn't be
+		 * bypassed with SMC
+		 */
+		break;
+	default:
+		/* apply option to the CLC socket */
+		release_sock(sk);
+		goto clcsock;
+	}
+	release_sock(sk);
+	return 0;
+
+clcsock:
 	/* generic setsockopts reaching us here always apply to the
 	 * CLC socket
 	 */
@@ -1293,10 +1362,41 @@ static int smc_setsockopt(struct socket *sock, int level, int optname,
 static int smc_getsockopt(struct socket *sock, int level, int optname,
 			  char __user *optval, int __user *optlen)
 {
+	struct sock *sk = sock->sk;
 	struct smc_sock *smc;
+	int val, len;
 
-	smc = smc_sk(sock->sk);
-	/* socket options apply to the CLC socket */
+	smc = smc_sk(sk);
+
+	if (smc->use_fallback || level != SOL_TCP)
+		goto clcsock;
+
+	if (get_user(len, optlen))
+		return -EFAULT;
+	len = min_t(unsigned int, len, sizeof(int));
+	if (len < 0)
+		return -EINVAL;
+
+	/* level SOL_TCP */
+	switch (optname) {
+	case TCP_NODELAY:
+		if (smc->deferred_nodelay_reset)
+			val = 0;
+		else
+			goto clcsock;
+		break;
+	default:
+		goto clcsock;
+	}
+
+	if (put_user(len, optlen))
+		return -EFAULT;
+	if (copy_to_user(optval, &val, len))
+		return -EFAULT;
+	return 0;
+
+clcsock:
+	/* socket options applying to the CLC socket */
 	return smc->clcsock->ops->getsockopt(smc->clcsock, level, optname,
 					     optval, optlen);
 }
@@ -1387,6 +1487,7 @@ static int smc_create(struct net *net, struct socket *sock, int protocol,
 	int family = (protocol == SMCPROTO_SMC6) ? PF_INET6 : PF_INET;
 	struct smc_sock *smc;
 	struct sock *sk;
+	u8 val = 1;
 	int rc;
 
 	rc = -ESOCKTNOSUPPORT;
@@ -1412,6 +1513,10 @@ static int smc_create(struct net *net, struct socket *sock, int protocol,
 		sk_common_release(sk);
 		goto out;
 	}
+	/* clc handshake should run with disabled Nagle algorithm */
+	kernel_setsockopt(smc->clcsock, SOL_TCP, TCP_NODELAY, &val,
+			  sizeof(val));
+	smc->deferred_nodelay_reset = 1; /* TCP_NODELAY is not the default */
 	smc->sk.sk_sndbuf = max(smc->clcsock->sk->sk_sndbuf, SMC_BUF_MIN_SIZE);
 	smc->sk.sk_rcvbuf = max(smc->clcsock->sk->sk_rcvbuf, SMC_BUF_MIN_SIZE);
 
diff --git a/net/smc/smc.h b/net/smc/smc.h
index e4829a2f46ba..6dfc1c90bed2 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -185,6 +185,10 @@ struct smc_sock {				/* smc sock container */
 						 * started, waiting for unsent
 						 * data to be sent
 						 */
+	u8			deferred_nodelay_reset : 1;
+						/* defer Nagle after CLC
+						 * handshake
+						 */
 };
 
 static inline struct smc_sock *smc_sk(const struct sock *sk)
-- 
2.13.5

^ permalink raw reply related

* Re: [net-next V11 PATCH 00/17] XDP redirect memory return API
From: David Miller @ 2018-04-17 15:18 UTC (permalink / raw)
  To: alexei.starovoitov
  Cc: brouer, netdev, bjorn.topel, magnus.karlsson, eugenia, jasowang,
	john.fastabend, eranbe, saeedm, galp, borkmann, tariqt
In-Reply-To: <20180417144841.grzupfd6zax6zhca@ast-mbp.dhcp.thefacebook.com>

From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
Date: Tue, 17 Apr 2018 07:48:42 -0700

> On Tue, Apr 17, 2018 at 04:45:16PM +0200, Jesper Dangaard Brouer wrote:
>> Submitted against net-next, as it contains NIC driver changes.
>> 
>> This patchset works towards supporting different XDP RX-ring memory
>> allocators.  As this will be needed by the AF_XDP zero-copy mode.
 ...
> The series look good to me now.
> Acked-by: Alexei Starovoitov <ast@kernel.org>

Pushed out to net-next, thanks everyone!

^ permalink raw reply

* [PATCH net-next 1/4] net/smc: fix structure size
From: Ursula Braun @ 2018-04-17 15:18 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20180417151815.77191-1-ubraun@linux.ibm.com>

From: Karsten Graul <kgraul@linux.vnet.ibm.com>

The struct smc_cdc_msg must be defined as packed so the
size is 44 bytes.
And change the structure size check so sizeof is checked.

Signed-off-by: Karsten Graul <kgraul@linux.vnet.ibm.com>
---
 net/smc/smc_cdc.c | 2 +-
 net/smc/smc_cdc.h | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/net/smc/smc_cdc.c b/net/smc/smc_cdc.c
index b42395d24cba..42ad57365eca 100644
--- a/net/smc/smc_cdc.c
+++ b/net/smc/smc_cdc.c
@@ -82,7 +82,7 @@ static inline void smc_cdc_add_pending_send(struct smc_connection *conn,
 		sizeof(struct smc_cdc_msg) > SMC_WR_BUF_SIZE,
 		"must increase SMC_WR_BUF_SIZE to at least sizeof(struct smc_cdc_msg)");
 	BUILD_BUG_ON_MSG(
-		offsetof(struct smc_cdc_msg, reserved) > SMC_WR_TX_SIZE,
+		sizeof(struct smc_cdc_msg) != SMC_WR_TX_SIZE,
 		"must adapt SMC_WR_TX_SIZE to sizeof(struct smc_cdc_msg); if not all smc_wr upper layer protocols use the same message size any more, must start to set link->wr_tx_sges[i].length on each individual smc_wr_tx_send()");
 	BUILD_BUG_ON_MSG(
 		sizeof(struct smc_cdc_tx_pend) > SMC_WR_TX_PEND_PRIV_SIZE,
diff --git a/net/smc/smc_cdc.h b/net/smc/smc_cdc.h
index ab240b37ad11..d2012fd22100 100644
--- a/net/smc/smc_cdc.h
+++ b/net/smc/smc_cdc.h
@@ -48,7 +48,7 @@ struct smc_cdc_msg {
 	struct smc_cdc_producer_flags	prod_flags;
 	struct smc_cdc_conn_state_flags	conn_state_flags;
 	u8				reserved[18];
-} __aligned(8);
+} __packed;					/* format defined in RFC7609 */
 
 static inline bool smc_cdc_rxed_any_close(struct smc_connection *conn)
 {
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 3/4] net/smc: handle sockopt TCP_CORK
From: Ursula Braun @ 2018-04-17 15:18 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20180417151815.77191-1-ubraun@linux.ibm.com>

From: Ursula Braun <ubraun@linux.vnet.ibm.com>

TCP sockopts must not interfere with the CLC handshake on the
CLC socket. Therefore, we defer some of them till the CLC
handshake has completed, like setting TCP_CORK.

For a corked SMC socket RDMA writes are deferred, if there is
still sufficient send buffer space available.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
---
 net/smc/af_smc.c | 32 ++++++++++++++++++++++++++++++++
 net/smc/smc.h    |  4 ++++
 net/smc/smc_tx.c | 16 +++++++++++++---
 net/smc/smc_tx.h |  8 ++++++++
 4 files changed, 57 insertions(+), 3 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index 96f4d182f998..c3a950c6950d 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -391,6 +391,12 @@ static void smc_apply_deferred_sockopts(struct smc_sock *smc)
 				  sizeof(val));
 		opt_smc->deferred_nodelay_reset = 0;
 	}
+	if (opt_smc->deferred_cork_set) {
+		val = 1;
+		kernel_setsockopt(smc->clcsock, SOL_TCP, TCP_CORK, &val,
+				  sizeof(val));
+		opt_smc->deferred_cork_set = 0;
+	}
 }
 
 /* setup for RDMA connection of client */
@@ -1323,6 +1329,9 @@ static int smc_setsockopt(struct socket *sock, int level, int optname,
 	switch (optname) {
 	case TCP_NODELAY:
 		if (sk->sk_state != SMC_INIT && sk->sk_state != SMC_LISTEN) {
+			if (val && smc_tx_is_corked(smc))
+				mod_delayed_work(system_wq, &smc->conn.tx_work,
+						 0);
 			release_sock(sk);
 			goto clcsock;
 		}
@@ -1335,6 +1344,23 @@ static int smc_setsockopt(struct socket *sock, int level, int optname,
 		else
 			smc->deferred_nodelay_reset = 1;
 		break;
+	case TCP_CORK:
+		if (sk->sk_state != SMC_INIT && sk->sk_state != SMC_LISTEN) {
+			if (!val)
+				mod_delayed_work(system_wq, &smc->conn.tx_work,
+						 0);
+			release_sock(sk);
+			goto clcsock;
+		}
+		/* for the CLC-handshake TCP_CORK is not desired;
+		 * in case of fallback to TCP, cork setting is
+		 * triggered afterwards.
+		 */
+		if (val)
+			smc->deferred_cork_set = 1;
+		else
+			smc->deferred_cork_set = 0;
+		break;
 	case TCP_FASTOPEN:
 	case TCP_FASTOPEN_CONNECT:
 	case TCP_FASTOPEN_KEY:
@@ -1385,6 +1411,12 @@ static int smc_getsockopt(struct socket *sock, int level, int optname,
 		else
 			goto clcsock;
 		break;
+	case TCP_CORK:
+		if (smc->deferred_cork_set)
+			val = 1;
+		else
+			goto clcsock;
+		break;
 	default:
 		goto clcsock;
 	}
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 6dfc1c90bed2..38888da5a5ea 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -189,6 +189,10 @@ struct smc_sock {				/* smc sock container */
 						/* defer Nagle after CLC
 						 * handshake
 						 */
+	u8			deferred_cork_set : 1;
+						/* defer corking after CLC
+						 * handshake
+						 */
 };
 
 static inline struct smc_sock *smc_sk(const struct sock *sk)
diff --git a/net/smc/smc_tx.c b/net/smc/smc_tx.c
index 72f004c9c9b1..a31377bb400b 100644
--- a/net/smc/smc_tx.c
+++ b/net/smc/smc_tx.c
@@ -26,6 +26,7 @@
 #include "smc_tx.h"
 
 #define SMC_TX_WORK_DELAY	HZ
+#define SMC_TX_CORK_DELAY	(HZ >> 2)	/* 250 ms */
 
 /***************************** sndbuf producer *******************************/
 
@@ -209,7 +210,16 @@ int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len)
 		/* since we just produced more new data into sndbuf,
 		 * trigger sndbuf consumer: RDMA write into peer RMBE and CDC
 		 */
-		smc_tx_sndbuf_nonempty(conn);
+		if ((msg->msg_flags & MSG_MORE || smc_tx_is_corked(smc)) &&
+		    (atomic_read(&conn->sndbuf_space) >
+						(conn->sndbuf_size >> 1)))
+			/* for a corked socket defer the RDMA writes if there
+			 * is still sufficient sndbuf_space available
+			 */
+			schedule_delayed_work(&conn->tx_work,
+					      SMC_TX_CORK_DELAY);
+		else
+			smc_tx_sndbuf_nonempty(conn);
 	} /* while (msg_data_left(msg)) */
 
 	return send_done;
@@ -409,8 +419,8 @@ int smc_tx_sndbuf_nonempty(struct smc_connection *conn)
 			}
 			rc = 0;
 			if (conn->alert_token_local) /* connection healthy */
-				schedule_delayed_work(&conn->tx_work,
-						      SMC_TX_WORK_DELAY);
+				mod_delayed_work(system_wq, &conn->tx_work,
+						 SMC_TX_WORK_DELAY);
 		}
 		goto out_unlock;
 	}
diff --git a/net/smc/smc_tx.h b/net/smc/smc_tx.h
index 78255964fa4d..e5f4188b4bdb 100644
--- a/net/smc/smc_tx.h
+++ b/net/smc/smc_tx.h
@@ -14,6 +14,7 @@
 
 #include <linux/socket.h>
 #include <linux/types.h>
+#include <net/tcp.h>
 
 #include "smc.h"
 #include "smc_cdc.h"
@@ -27,6 +28,13 @@ static inline int smc_tx_prepared_sends(struct smc_connection *conn)
 	return smc_curs_diff(conn->sndbuf_size, &sent, &prep);
 }
 
+static inline bool smc_tx_is_corked(struct smc_sock *smc)
+{
+	struct tcp_sock *tp = tcp_sk(smc->clcsock->sk);
+
+	return (tp->nonagle & TCP_NAGLE_CORK) ? true : false;
+}
+
 void smc_tx_init(struct smc_sock *smc);
 int smc_tx_sendmsg(struct smc_sock *smc, struct msghdr *msg, size_t len);
 int smc_tx_sndbuf_nonempty(struct smc_connection *conn);
-- 
2.13.5

^ permalink raw reply related

* [PATCH net-next 0/4] net/smc: fixes 2018-04-17
From: Ursula Braun @ 2018-04-17 15:18 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun

From: Ursula Braun <ursula.braun@linux.vnet.ibm.com>

Dave,

here are smc changes for the net-next tree.
The first patch assures the size of struct smc_cdc_msg is 44 bytes.
Patches 2 to 4 implement sockopts that require special handling in SMC.

Thanks, Ursula

Karsten Graul (1):
  net/smc: fix structure size

Ursula Braun (3):
  net/smc: handle sockopt TCP_NODELAY
  net/smc: handle sockopt TCP_CORK
  net/smc: handle sockopt TCP_DEFER_ACCEPT

 net/smc/af_smc.c  | 177 +++++++++++++++++++++++++++++++++++++++++++++++++++++-
 net/smc/smc.h     |  12 ++++
 net/smc/smc_cdc.c |   2 +-
 net/smc/smc_cdc.h |   2 +-
 net/smc/smc_rx.c  |   2 +-
 net/smc/smc_rx.h  |   1 +
 net/smc/smc_tx.c  |  16 ++++-
 net/smc/smc_tx.h  |   8 +++
 8 files changed, 211 insertions(+), 9 deletions(-)

-- 
2.13.5

^ permalink raw reply

* [PATCH net-next 4/4] net/smc: handle sockopt TCP_DEFER_ACCEPT
From: Ursula Braun @ 2018-04-17 15:18 UTC (permalink / raw)
  To: davem; +Cc: netdev, linux-s390, schwidefsky, heiko.carstens, raspl, ubraun
In-Reply-To: <20180417151815.77191-1-ubraun@linux.ibm.com>

From: Ursula Braun <ubraun@linux.vnet.ibm.com>

TCP sockopt setting of TCP_DEFER_ACCEPT should just be applied
to the SMC socket, but not to the internal CLC socket.

If set, the accept is delayed till data is available.

Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com>
---
 net/smc/af_smc.c | 36 +++++++++++++++++++++++++++++++++++-
 net/smc/smc.h    |  4 ++++
 net/smc/smc_rx.c |  2 +-
 net/smc/smc_rx.h |  1 +
 4 files changed, 41 insertions(+), 2 deletions(-)

diff --git a/net/smc/af_smc.c b/net/smc/af_smc.c
index c3a950c6950d..2440e9b8d3b7 100644
--- a/net/smc/af_smc.c
+++ b/net/smc/af_smc.c
@@ -1092,9 +1092,29 @@ static int smc_accept(struct socket *sock, struct socket *new_sock,
 
 	if (!rc)
 		rc = sock_error(nsk);
+	release_sock(sk);
+	if (rc)
+		goto out;
+
+	if (lsmc->sockopt_defer_accept && !(flags & O_NONBLOCK)) {
+		/* wait till data arrives on the socket */
+		timeo = msecs_to_jiffies(lsmc->sockopt_defer_accept *
+								MSEC_PER_SEC);
+		if (smc_sk(nsk)->use_fallback) {
+			struct sock *clcsk = smc_sk(nsk)->clcsock->sk;
+
+			lock_sock(clcsk);
+			if (skb_queue_empty(&clcsk->sk_receive_queue))
+				sk_wait_data(clcsk, &timeo, NULL);
+			release_sock(clcsk);
+		} else if (!atomic_read(&smc_sk(nsk)->conn.bytes_to_rcv)) {
+			lock_sock(nsk);
+			smc_rx_wait_data(smc_sk(nsk), &timeo);
+			release_sock(nsk);
+		}
+	}
 
 out:
-	release_sock(sk);
 	sock_put(sk); /* sock_hold above */
 	return rc;
 }
@@ -1361,6 +1381,14 @@ static int smc_setsockopt(struct socket *sock, int level, int optname,
 		else
 			smc->deferred_cork_set = 0;
 		break;
+	case TCP_DEFER_ACCEPT:
+		if (sk->sk_state != SMC_INIT && sk->sk_state != SMC_LISTEN) {
+			release_sock(sk);
+			goto clcsock;
+		}
+		/* for the CLC-handshake TCP_DEFER_ACCEPT is not desired */
+		smc->sockopt_defer_accept = val;
+		break;
 	case TCP_FASTOPEN:
 	case TCP_FASTOPEN_CONNECT:
 	case TCP_FASTOPEN_KEY:
@@ -1417,6 +1445,12 @@ static int smc_getsockopt(struct socket *sock, int level, int optname,
 		else
 			goto clcsock;
 		break;
+	case TCP_DEFER_ACCEPT:
+		if (smc->sockopt_defer_accept)
+			val = smc->sockopt_defer_accept;
+		else
+			goto clcsock;
+		break;
 	default:
 		goto clcsock;
 	}
diff --git a/net/smc/smc.h b/net/smc/smc.h
index 38888da5a5ea..11f869d1f28a 100644
--- a/net/smc/smc.h
+++ b/net/smc/smc.h
@@ -180,6 +180,10 @@ struct smc_sock {				/* smc sock container */
 	struct list_head	accept_q;	/* sockets to be accepted */
 	spinlock_t		accept_q_lock;	/* protects accept_q */
 	bool			use_fallback;	/* fallback to tcp */
+	int			sockopt_defer_accept;
+						/* sockopt TCP_DEFER_ACCEPT
+						 * value
+						 */
 	u8			wait_close_tx_prepared : 1;
 						/* shutdown wr or close
 						 * started, waiting for unsent
diff --git a/net/smc/smc_rx.c b/net/smc/smc_rx.c
index eff4e0d0bb31..af851d8df1f8 100644
--- a/net/smc/smc_rx.c
+++ b/net/smc/smc_rx.c
@@ -51,7 +51,7 @@ static void smc_rx_data_ready(struct sock *sk)
  * 1 if at least 1 byte available in rcvbuf or if socket error/shutdown.
  * 0 otherwise (nothing in rcvbuf nor timeout, e.g. interrupted).
  */
-static int smc_rx_wait_data(struct smc_sock *smc, long *timeo)
+int smc_rx_wait_data(struct smc_sock *smc, long *timeo)
 {
 	DEFINE_WAIT_FUNC(wait, woken_wake_function);
 	struct smc_connection *conn = &smc->conn;
diff --git a/net/smc/smc_rx.h b/net/smc/smc_rx.h
index 3a32b59bf06c..0b75a6b470e6 100644
--- a/net/smc/smc_rx.h
+++ b/net/smc/smc_rx.h
@@ -20,5 +20,6 @@
 void smc_rx_init(struct smc_sock *smc);
 int smc_rx_recvmsg(struct smc_sock *smc, struct msghdr *msg, size_t len,
 		   int flags);
+int smc_rx_wait_data(struct smc_sock *smc, long *timeo);
 
 #endif /* SMC_RX_H */
-- 
2.13.5

^ permalink raw reply related

* Re: [net-next V11 PATCH 00/17] XDP redirect memory return API
From: Daniel Borkmann @ 2018-04-17 15:24 UTC (permalink / raw)
  To: David Miller, alexei.starovoitov
  Cc: brouer, netdev, bjorn.topel, magnus.karlsson, eugenia, jasowang,
	john.fastabend, eranbe, saeedm, galp, borkmann, tariqt
In-Reply-To: <20180417.111820.1693265050355969717.davem@davemloft.net>

On 04/17/2018 05:18 PM, David Miller wrote:
> From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> Date: Tue, 17 Apr 2018 07:48:42 -0700
> 
>> On Tue, Apr 17, 2018 at 04:45:16PM +0200, Jesper Dangaard Brouer wrote:
>>> Submitted against net-next, as it contains NIC driver changes.
>>>
>>> This patchset works towards supporting different XDP RX-ring memory
>>> allocators.  As this will be needed by the AF_XDP zero-copy mode.
>  ...
>> The series look good to me now.
>> Acked-by: Alexei Starovoitov <ast@kernel.org>
> 
> Pushed out to net-next, thanks everyone!

See my comment on the last patch which is actually buggy. But given it's
pushed out already, then it needs to be fixed as follow-up..

Thanks,
Daniel

^ permalink raw reply

* Re: [net-next V11 PATCH 00/17] XDP redirect memory return API
From: David Miller @ 2018-04-17 15:35 UTC (permalink / raw)
  To: daniel
  Cc: alexei.starovoitov, brouer, netdev, bjorn.topel, magnus.karlsson,
	eugenia, jasowang, john.fastabend, eranbe, saeedm, galp, borkmann,
	tariqt
In-Reply-To: <89dee120-76a4-f7dd-82ec-1d8a9fcb29d8@iogearbox.net>

From: Daniel Borkmann <daniel@iogearbox.net>
Date: Tue, 17 Apr 2018 17:24:03 +0200

> On 04/17/2018 05:18 PM, David Miller wrote:
>> From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
>> Date: Tue, 17 Apr 2018 07:48:42 -0700
>> 
>>> On Tue, Apr 17, 2018 at 04:45:16PM +0200, Jesper Dangaard Brouer wrote:
>>>> Submitted against net-next, as it contains NIC driver changes.
>>>>
>>>> This patchset works towards supporting different XDP RX-ring memory
>>>> allocators.  As this will be needed by the AF_XDP zero-copy mode.
>>  ...
>>> The series look good to me now.
>>> Acked-by: Alexei Starovoitov <ast@kernel.org>
>> 
>> Pushed out to net-next, thanks everyone!
> 
> See my comment on the last patch which is actually buggy. But given it's
> pushed out already, then it needs to be fixed as follow-up..

Yes it will need to be dealt with as a follow-up.

^ permalink raw reply

* Re: [PATCH 1/1] net/mlx4_core: avoid resetting HCA when accessing an offline device
From: Tariq Toukan @ 2018-04-17 15:37 UTC (permalink / raw)
  To: Zhu Yanjun, tariqt, netdev, linux-rdma, haakon.bugge
In-Reply-To: <1523840527-22746-1-git-send-email-yanjun.zhu@oracle.com>



On 16/04/2018 4:02 AM, Zhu Yanjun wrote:
> While a faulty cable is used or HCA firmware error, HCA device will
> be offline. When the driver is accessing this offline device, the
> following call trace will pop out.
> 
> "
> ...
>    [<ffffffff816e4842>] dump_stack+0x63/0x81
>    [<ffffffff816e459e>] panic+0xcc/0x21b
>    [<ffffffffa03e5f8a>] mlx4_enter_error_state+0xba/0xf0 [mlx4_core]
>    [<ffffffffa03e7298>] mlx4_cmd_reset_flow+0x38/0x60 [mlx4_core]
>    [<ffffffffa03e7381>] mlx4_cmd_poll+0xc1/0x2e0 [mlx4_core]
>    [<ffffffffa03e9f00>] __mlx4_cmd+0xb0/0x160 [mlx4_core]
>    [<ffffffffa0406934>] mlx4_SENSE_PORT+0x54/0xd0 [mlx4_core]
>    [<ffffffffa03f5f54>] mlx4_dev_cap+0x4a4/0xb50 [mlx4_core]
> ...
> "
> In the above call trace, the function mlx4_cmd_poll calls the function
> mlx4_cmd_post to access the HCA while HCA is offline. Then mlx4_cmd_post
> returns an error -EIO. Per -EIO, the function mlx4_cmd_poll calls
> mlx4_cmd_reset_flow to reset HCA. And the above call trace pops out.
> 
> This is not reasonable. Since HCA device is offline when it is being
> accessed, it should not be reset again.
> 
> In this patch, since HCA is offline, the function mlx4_cmd_post returns
> an error -EINVAL. Per -EINVAL, the function mlx4_cmd_poll directly returns
> instead of resetting HCA.
> 
> CC: Srinivas Eeda <srinivas.eeda@oracle.com>
> CC: Junxiao Bi <junxiao.bi@oracle.com>
> Suggested-by: Håkon Bugge <haakon.bugge@oracle.com>
> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com>
> ---
>   drivers/net/ethernet/mellanox/mlx4/cmd.c | 8 ++++++++
>   1 file changed, 8 insertions(+)
> 
> diff --git a/drivers/net/ethernet/mellanox/mlx4/cmd.c b/drivers/net/ethernet/mellanox/mlx4/cmd.c
> index 6a9086d..f1c8c42 100644
> --- a/drivers/net/ethernet/mellanox/mlx4/cmd.c
> +++ b/drivers/net/ethernet/mellanox/mlx4/cmd.c
> @@ -451,6 +451,8 @@ static int mlx4_cmd_post(struct mlx4_dev *dev, u64 in_param, u64 out_param,
>   		 * Device is going through error recovery
>   		 * and cannot accept commands.
>   		 */
> +		mlx4_err(dev, "%s : Device is in error recovery.\n", __func__);
> +		ret = -EINVAL;
>   		goto out;
>   	}
>   
> @@ -657,6 +659,9 @@ static int mlx4_cmd_poll(struct mlx4_dev *dev, u64 in_param, u64 *out_param,
>   	}
>   
>   out_reset:
> +	if (err == -EINVAL)
> +		goto out;
> +

See below.

>   	if (err)
>   		err = mlx4_cmd_reset_flow(dev, op, op_modifier, err);
>   out:
> @@ -766,6 +771,9 @@ static int mlx4_cmd_wait(struct mlx4_dev *dev, u64 in_param, u64 *out_param,
>   		*out_param = context->out_param;
>   
>   out_reset:
> +	if (err == -EINVAL)
> +		goto out;
> +
>   	if (err)

Instead, just do here: if (err && err != -EINVAL)

>   		err = mlx4_cmd_reset_flow(dev, op, op_modifier, err);
>   out:
> 

I am not sure this does not mistakenly cover other cases that already 
exist and have (err == -EINVAL).

For example, this line is hard to predict:
err = mlx4_status_to_errno
and later on, we might get into
if (mlx4_closing_cmd_fatal_error(op, stat))
which leads to out_reset.

We must have a deeper look at this.
But a better option is, change the error indication to uniquely indicate 
"already in error recovery".

^ permalink raw reply

* Re: [RFC v2] virtio: support packed ring
From: Michael S. Tsirkin @ 2018-04-17 15:54 UTC (permalink / raw)
  To: Tiwei Bie
  Cc: Jason Wang, wexu, virtualization, linux-kernel, netdev, jfreimann
In-Reply-To: <20180417145626.y5vei4y6irrdw7ky@debian>

On Tue, Apr 17, 2018 at 10:56:26PM +0800, Tiwei Bie wrote:
> On Tue, Apr 17, 2018 at 05:04:59PM +0300, Michael S. Tsirkin wrote:
> > On Tue, Apr 17, 2018 at 08:47:16PM +0800, Tiwei Bie wrote:
> > > On Tue, Apr 17, 2018 at 03:17:41PM +0300, Michael S. Tsirkin wrote:
> > > > On Tue, Apr 17, 2018 at 10:51:33AM +0800, Tiwei Bie wrote:
> > > > > On Tue, Apr 17, 2018 at 10:11:58AM +0800, Jason Wang wrote:
> > > > > > On 2018年04月13日 15:15, Tiwei Bie wrote:
> > > > > > > On Fri, Apr 13, 2018 at 12:30:24PM +0800, Jason Wang wrote:
> > > > > > > > On 2018年04月01日 22:12, Tiwei Bie wrote:
> > > > > [...]
> > > > > > > > > +static int detach_buf_packed(struct vring_virtqueue *vq, unsigned int head,
> > > > > > > > > +			      void **ctx)
> > > > > > > > > +{
> > > > > > > > > +	struct vring_packed_desc *desc;
> > > > > > > > > +	unsigned int i, j;
> > > > > > > > > +
> > > > > > > > > +	/* Clear data ptr. */
> > > > > > > > > +	vq->desc_state[head].data = NULL;
> > > > > > > > > +
> > > > > > > > > +	i = head;
> > > > > > > > > +
> > > > > > > > > +	for (j = 0; j < vq->desc_state[head].num; j++) {
> > > > > > > > > +		desc = &vq->vring_packed.desc[i];
> > > > > > > > > +		vring_unmap_one_packed(vq, desc);
> > > > > > > > > +		desc->flags = 0x0;
> > > > > > > > Looks like this is unnecessary.
> > > > > > > It's safer to zero it. If we don't zero it, after we
> > > > > > > call virtqueue_detach_unused_buf_packed() which calls
> > > > > > > this function, the desc is still available to the
> > > > > > > device.
> > > > > > 
> > > > > > Well detach_unused_buf_packed() should be called after device is stopped,
> > > > > > otherwise even if you try to clear, there will still be a window that device
> > > > > > may use it.
> > > > > 
> > > > > This is not about whether the device has been stopped or
> > > > > not. We don't have other places to re-initialize the ring
> > > > > descriptors and wrap_counter. So they need to be set to
> > > > > the correct values when doing detach_unused_buf.
> > > > > 
> > > > > Best regards,
> > > > > Tiwei Bie
> > > > 
> > > > find vqs is the time to do it.
> > > 
> > > The .find_vqs() will call .setup_vq() which will eventually
> > > call vring_create_virtqueue(). It's a different case. Here
> > > we're talking about re-initializing the descs and updating
> > > the wrap counter when detaching the unused descs (In this
> > > case, split ring just needs to decrease vring.avail->idx).
> > > 
> > > Best regards,
> > > Tiwei Bie
> > 
> > There's no requirement that  virtqueue_detach_unused_buf re-initializes
> > the descs. It happens on cleanup path just before drivers delete the
> > vqs.
> 
> Cool, I wasn't aware of it. I saw split ring decrease
> vring.avail->idx after detaching an unused desc, so I
> thought detaching unused desc also needs to make sure
> that the ring state will be updated correspondingly.


Hmm. You are right. Seems to be out console driver being out of spec.
Will have to look at how to fix that :(

It was done here:

Commit b3258ff1d6086bd2b9eeb556844a868ad7d49bc8
Author: Amit Shah <amit.shah@redhat.com>
Date:   Wed Mar 16 19:12:10 2011 +0530

    virtio: Decrement avail idx on buffer detach
    
    When detaching a buffer from a vq, the avail.idx value should be
    decremented as well.
    
    This was noticed by hot-unplugging a virtio console port and then
    plugging in a new one on the same number (re-using the vqs which were
    just 'disowned').  qemu reported
    
       'Guest moved used index from 0 to 256'
    
    when any IO was attempted on the new port.
    
    CC: stable@kernel.org
    Reported-by: juzhang <juzhang@redhat.com>
    Signed-off-by: Amit Shah <amit.shah@redhat.com>
    Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>

The spec is quite explicit though:
	A driver MUST NOT decrement the available idx on a live virtqueue (ie. there is no way to “unexpose”
	buffers).





> If there is no such requirement, do you think it's OK
> to remove below two lines:
> 
> vq->avail_idx_shadow--;
> vq->vring.avail->idx = cpu_to_virtio16(_vq->vdev, vq->avail_idx_shadow);
> 
> from virtqueue_detach_unused_buf(), and we could have
> one generic function to handle both rings:
> 
> void *virtqueue_detach_unused_buf(struct virtqueue *_vq)
> {
> 	struct vring_virtqueue *vq = to_vvq(_vq);
> 	unsigned int num, i;
> 	void *buf;
> 
> 	START_USE(vq);
> 
> 	num = vq->packed ? vq->vring_packed.num : vq->vring.num;
> 
> 	for (i = 0; i < num; i++) {
> 		if (!vq->desc_state[i].data)
> 			continue;
> 		/* detach_buf clears data, so grab it now. */
> 		buf = vq->desc_state[i].data;
> 		detach_buf(vq, i, NULL);
> 		END_USE(vq);
> 		return buf;
> 	}
> 	/* That should have freed everything. */
> 	BUG_ON(vq->vq.num_free != num);
> 
> 	END_USE(vq);
> 	return NULL;
> }
> 
> Best regards,
> Tiwei Bie

^ permalink raw reply

* Re: [PATCH/RFC net-next 2/5] ravb: correct ptp does failure after suspend and resume
From: Sergei Shtylyov @ 2018-04-17 16:05 UTC (permalink / raw)
  To: Wolfram Sang, Simon Horman
  Cc: Magnus Damm, netdev, linux-renesas-soc, Wolfram Sang,
	Kazuya Mizuguchi
In-Reply-To: <20180417100532.tfjgm4hewmi42al4@ninjato>

On 04/17/2018 01:05 PM, Wolfram Sang wrote:

>> @@ -2302,6 +2305,7 @@ static int __maybe_unused ravb_resume(struct device *dev)
>>  {
>>  	struct net_device *ndev = dev_get_drvdata(dev);
>>  	struct ravb_private *priv = netdev_priv(ndev);
>> +	struct platform_device *pdev = priv->pdev;

   Could infer 'pdev' from 'dev' (avoiding the dereference)...

> Minor nit: I'd save this line...
> 
>> +	if (priv->chip_id != RCAR_GEN2)
>> +		ravb_ptp_init(ndev, pdev);
> 
> ... and use ravb_ptp_init(ndev, priv->pdev); here.

   Agreed, no dire need for the new variable used only once.

MBR, Sergei

^ permalink raw reply

* Re: [PATCH net,stable] tun: fix vlan packet truncation
From: Bjørn Mork @ 2018-04-17  8:08 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev
In-Reply-To: <bc626bc5-5d48-ef8e-29d8-56f401a054eb@redhat.com>

Jason Wang <jasowang@redhat.com> writes:

> Good catch,

Thanks.  I guess I am "lucky" as I apparently use configs no one else
are using, and therefore often experience breakage ;-)  

I have been using tagged interfaces over tap for an occasional virtual
"lab" of kvm machines for years. It's a simple way to have a flexible
netowrk between host and virtual machine.  So v4.16 broke my "lab". I'm
sorry it took me a while before I had time to investigate why.

> but I still think we should do the truncation in
> run_ebpf_filter to match the behavior socket ebpf filter.

Haven't looked at that code.  But I'd like to point out that any length
adjustments should be done next to the code adding or removing data.
Doing data and length modifications in different funtions is either a
bug, or a bug bound to happen.

Bjørn

^ permalink raw reply

* Re: [net-next V11 PATCH 00/17] XDP redirect memory return API
From: Jesper Dangaard Brouer @ 2018-04-17 16:27 UTC (permalink / raw)
  To: David Miller
  Cc: daniel, alexei.starovoitov, netdev, bjorn.topel, magnus.karlsson,
	eugenia, jasowang, john.fastabend, eranbe, saeedm, galp, borkmann,
	tariqt, brouer
In-Reply-To: <20180417.113548.676760038759107548.davem@davemloft.net>

On Tue, 17 Apr 2018 11:35:48 -0400 (EDT)
David Miller <davem@davemloft.net> wrote:

> From: Daniel Borkmann <daniel@iogearbox.net>
> Date: Tue, 17 Apr 2018 17:24:03 +0200
> 
> > On 04/17/2018 05:18 PM, David Miller wrote:  
> >> From: Alexei Starovoitov <alexei.starovoitov@gmail.com>
> >> Date: Tue, 17 Apr 2018 07:48:42 -0700
> >>   
> >>> On Tue, Apr 17, 2018 at 04:45:16PM +0200, Jesper Dangaard Brouer wrote:  
> >>>> Submitted against net-next, as it contains NIC driver changes.
> >>>>
> >>>> This patchset works towards supporting different XDP RX-ring memory
> >>>> allocators.  As this will be needed by the AF_XDP zero-copy mode.  
> >>  ...  
> >>> The series look good to me now.
> >>> Acked-by: Alexei Starovoitov <ast@kernel.org>  
> >> 
> >> Pushed out to net-next, thanks everyone!  
> > 
> > See my comment on the last patch which is actually buggy. But given it's
> > pushed out already, then it needs to be fixed as follow-up..  
> 
> Yes it will need to be dealt with as a follow-up.

Okay, will deal with this as a followup.

That was actually why I submitted V10 without the last patch, as I
though that required separate upstream review, which turned out to be
true.  And I was planning to submit it after V10 was accepted. (I did
have the last patch in "offlist" review with Alexei and Daniel (but I
guess Daniel didn't had time to review it, I just falsely assumed he
had looked at it)).

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Principal Kernel Engineer at Red Hat
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply

* [PATCH v2 net-next] net: introduce a new tracepoint for tcp_rcv_space_adjust
From: Yafang Shao @ 2018-04-17 16:36 UTC (permalink / raw)
  To: eric.dumazet, davem
  Cc: kuznet, yoshfuji, songliubraving, netdev, linux-kernel,
	Yafang Shao

tcp_rcv_space_adjust is called every time data is copied to user space,
introducing a tcp tracepoint for which could show us when the packet is
copied to user.
This could help us figure out whether there's latency in user process.

When a tcp packet arrives, tcp_rcv_established() will be called and with
the existed tracepoint tcp_probe we could get the time when this packet
arrives.
Then this packet will be copied to user, and tcp_rcv_space_adjust will
be called and with this new introduced tracepoint we could get the time
when this packet is copied to user.

        arrives time : user process time    => latency caused by user
        tcp_probe      tcp_rcv_space_adjust

Hence in the printk message, sk_cookie is printed as a key to relate
tcp_rcv_space_adjust with tcp_probe.

Maybe we could export sockfd in this new tracepoint as well, then we
could relate this new tracepoint with epoll/read/recv* tracepoints, and
finally that could show us the whole lifespan of this packet. But we
could also implement that with pid as these functions are executed in
process context.

Signed-off-by: Yafang Shao <laoar.shao@gmail.com>

---
v1 -> v2: use sk_cookie as key suggested by Eric.
---
 include/trace/events/tcp.h | 33 +++++++++++++++++++++++++++------
 net/ipv4/tcp_input.c       |  2 ++
 2 files changed, 29 insertions(+), 6 deletions(-)

diff --git a/include/trace/events/tcp.h b/include/trace/events/tcp.h
index 3dd6802..814f754 100644
--- a/include/trace/events/tcp.h
+++ b/include/trace/events/tcp.h
@@ -10,6 +10,7 @@
 #include <linux/tracepoint.h>
 #include <net/ipv6.h>
 #include <net/tcp.h>
+#include <linux/sock_diag.h>
 
 #define TP_STORE_V4MAPPED(__entry, saddr, daddr)		\
 	do {							\
@@ -125,6 +126,7 @@
 		__array(__u8, daddr, 4)
 		__array(__u8, saddr_v6, 16)
 		__array(__u8, daddr_v6, 16)
+		__field(__u64, sock_cookie)
 	),
 
 	TP_fast_assign(
@@ -144,12 +146,24 @@
 
 		TP_STORE_ADDRS(__entry, inet->inet_saddr, inet->inet_daddr,
 			       sk->sk_v6_rcv_saddr, sk->sk_v6_daddr);
+
+		/*
+		 * sk_cookie is used to identify a socket, with which we could
+		 * relate this tracepoint with other tracepoints,
+		 * i.e. tcp_probe.
+		 * If we needn't this relation, then sk_cookie is useless;
+		 * if we need this relation, then tcp_probe is already set,
+		 * and sk_cookie is already set in tcp_probe, so we could get
+		 * the value directly.
+		 */
+		__entry->sock_cookie = atomic64_read(&sk->sk_cookie);
 	),
 
-	TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c",
+	TP_printk("sport=%hu dport=%hu saddr=%pI4 daddr=%pI4 saddrv6=%pI6c daddrv6=%pI6c sock_cookie=%llu",
 		  __entry->sport, __entry->dport,
 		  __entry->saddr, __entry->daddr,
-		  __entry->saddr_v6, __entry->daddr_v6)
+		  __entry->saddr_v6, __entry->daddr_v6,
+		  __entry->sock_cookie)
 );
 
 DEFINE_EVENT(tcp_event_sk, tcp_receive_reset,
@@ -166,6 +180,13 @@
 	TP_ARGS(sk)
 );
 
+DEFINE_EVENT(tcp_event_sk, tcp_rcv_space_adjust,
+
+	TP_PROTO(const struct sock *sk),
+
+	TP_ARGS(sk)
+);
+
 TRACE_EVENT(tcp_retransmit_synack,
 
 	TP_PROTO(const struct sock *sk, const struct request_sock *req),
@@ -232,6 +253,7 @@
 		__field(__u32, snd_wnd)
 		__field(__u32, srtt)
 		__field(__u32, rcv_wnd)
+		__field(__u64, sock_cookie)
 	),
 
 	TP_fast_assign(
@@ -256,15 +278,14 @@
 		__entry->rcv_wnd = tp->rcv_wnd;
 		__entry->ssthresh = tcp_current_ssthresh(sk);
 		__entry->srtt = tp->srtt_us >> 3;
+		__entry->sock_cookie = sock_gen_cookie(sk);
 	),
 
-	TP_printk("src=%pISpc dest=%pISpc mark=%#x length=%d snd_nxt=%#x "
-		  "snd_una=%#x snd_cwnd=%u ssthresh=%u snd_wnd=%u srtt=%u "
-		  "rcv_wnd=%u",
+	TP_printk("src=%pISpc dest=%pISpc mark=%#x length=%d snd_nxt=%#x snd_una=%#x snd_cwnd=%u ssthresh=%u snd_wnd=%u srtt=%u rcv_wnd=%u sock_cookie=%llu",
 		  __entry->saddr, __entry->daddr, __entry->mark,
 		  __entry->length, __entry->snd_nxt, __entry->snd_una,
 		  __entry->snd_cwnd, __entry->ssthresh, __entry->snd_wnd,
-		  __entry->srtt, __entry->rcv_wnd)
+		  __entry->srtt, __entry->rcv_wnd, __entry->sock_cookie)
 );
 
 #endif /* _TRACE_TCP_H */
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index f93687f..43ad468 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -582,6 +582,8 @@ void tcp_rcv_space_adjust(struct sock *sk)
 	u32 copied;
 	int time;
 
+	trace_tcp_rcv_space_adjust(sk);
+
 	tcp_mstamp_refresh(tp);
 	time = tcp_stamp_us_delta(tp->tcp_mstamp, tp->rcvq_space.time);
 	if (time < (tp->rcv_rtt_est.rtt_us >> 3) || tp->rcv_rtt_est.rtt_us == 0)
-- 
1.8.3.1

^ permalink raw reply related

* Re: SRIOV switchdev mode BoF minutes
From: Samudrala, Sridhar @ 2018-04-17 16:46 UTC (permalink / raw)
  To: Andy Gospodarek, Or Gerlitz
  Cc: David Miller, Anjali Singhai Jain, Michael Chan, Simon Horman,
	Jakub Kicinski, John Fastabend, Saeed Mahameed, Jiri Pirko,
	Rony Efraim, Linux Netdev List
In-Reply-To: <20180417144700.GJ33938@C02RW35GFVH8.dhcp.broadcom.net>

On 4/17/2018 7:47 AM, Andy Gospodarek wrote:
> On Tue, Apr 17, 2018 at 04:58:05PM +0300, Or Gerlitz wrote:
>> On Tue, Apr 17, 2018 at 4:30 PM, Andy Gospodarek
>> <andrew.gospodarek@broadcom.com> wrote:
>>> On Mon, Apr 16, 2018 at 07:08:39PM -0700, Samudrala, Sridhar wrote:
>>>> On 4/16/2018 5:39 AM, Andy Gospodarek wrote:
>>>>> On Sun, Apr 15, 2018 at 09:01:16AM +0300, Or Gerlitz wrote:
>>>>>> On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar
>>>>>> <sridhar.samudrala@intel.com> wrote:
>>>>>>
>>>>>>> I meant between PFs on 2 compute nodes.
>>>>>> If the PF serves as uplink rep, it functions as  a switch port -- applications
>>>>>> don't run on switch ports. One way to get apps to run on the host in switchdev
>>>>>> mode is probe one of the VFs there.
>>>>>>
>>>>>>
>>>>>>
>>>> So once a pci device is configured in 'switchdev' mode,  only port representor netdevs are
>>>> seen on the host, no more PF netdev.
>>> That is not the functionality I would propose.  The PF netdev will still be there.
>> Andy,
>>
>> Basically LGTM, so even in smartnic configs, the PF @ the host is
>> still privileged to
>> create/destroy VFs or provision MACs for them even if it is not the
>> e-switch manager
>> anymore?
> Yes, in a SmartNIC world one config we aim to have is that a host can create
> and destroy VFs as needed.  One of the challenges is how the VF reps are
> managed by applications in the SmartNIC when the host could make them
> disappear.

OK. So are we saying that in 'switchdev' mode with 2 VFs and 1 uplink, the host will
see PF netdev, 2 vf-rep netdev's corresponding to 2 VFs and 1 uplink-rep netdev.

Is PF netdev used only for the control/configure of the VFs? If it used as a datapath,
i think we need a pf-rep netdev too.

^ permalink raw reply

* Re: SRIOV switchdev mode BoF minutes
From: Andy Gospodarek @ 2018-04-17 16:53 UTC (permalink / raw)
  To: Samudrala, Sridhar
  Cc: Andy Gospodarek, Or Gerlitz, David Miller, Anjali Singhai Jain,
	Michael Chan, Simon Horman, Jakub Kicinski, John Fastabend,
	Saeed Mahameed, Jiri Pirko, Rony Efraim, Linux Netdev List
In-Reply-To: <a922a288-f81e-290c-0f44-c7405e664dfe@intel.com>

On Tue, Apr 17, 2018 at 09:46:38AM -0700, Samudrala, Sridhar wrote:
> On 4/17/2018 7:47 AM, Andy Gospodarek wrote:
> > On Tue, Apr 17, 2018 at 04:58:05PM +0300, Or Gerlitz wrote:
> > > On Tue, Apr 17, 2018 at 4:30 PM, Andy Gospodarek
> > > <andrew.gospodarek@broadcom.com> wrote:
> > > > On Mon, Apr 16, 2018 at 07:08:39PM -0700, Samudrala, Sridhar wrote:
> > > > > On 4/16/2018 5:39 AM, Andy Gospodarek wrote:
> > > > > > On Sun, Apr 15, 2018 at 09:01:16AM +0300, Or Gerlitz wrote:
> > > > > > > On Sat, Apr 14, 2018 at 2:03 AM, Samudrala, Sridhar
> > > > > > > <sridhar.samudrala@intel.com> wrote:
> > > > > > > 
> > > > > > > > I meant between PFs on 2 compute nodes.
> > > > > > > If the PF serves as uplink rep, it functions as  a switch port -- applications
> > > > > > > don't run on switch ports. One way to get apps to run on the host in switchdev
> > > > > > > mode is probe one of the VFs there.
> > > > > > > 
> > > > > > > 
> > > > > > > 
> > > > > So once a pci device is configured in 'switchdev' mode,  only port representor netdevs are
> > > > > seen on the host, no more PF netdev.
> > > > That is not the functionality I would propose.  The PF netdev will still be there.
> > > Andy,
> > > 
> > > Basically LGTM, so even in smartnic configs, the PF @ the host is
> > > still privileged to
> > > create/destroy VFs or provision MACs for them even if it is not the
> > > e-switch manager
> > > anymore?
> > Yes, in a SmartNIC world one config we aim to have is that a host can create
> > and destroy VFs as needed.  One of the challenges is how the VF reps are
> > managed by applications in the SmartNIC when the host could make them
> > disappear.
> 
> OK. So are we saying that in 'switchdev' mode with 2 VFs and 1 uplink, the host will
> see PF netdev, 2 vf-rep netdev's corresponding to 2 VFs and 1 uplink-rep netdev.
> 
> Is PF netdev used only for the control/configure of the VFs? If it used as a datapath,
> i think we need a pf-rep netdev too.
> 

Yes, that is correct.  PF reps could be used for datapath configuration to
redirect traffic to a PF.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox