* [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
[not found] ` <1473252152-11379-2-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-09-07 19:18 ` Jesper Dangaard Brouer via iovisor-dev
2016-09-07 12:42 ` [PATCH RFC 02/11] net/mlx5e: Introduce API for RX mapped pages Saeed Mahameed
` (10 subsequent siblings)
11 siblings, 2 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
To: iovisor-dev
Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed
From: Tariq Toukan <tariqt@mellanox.com>
To improve the memory consumption scheme, we omit the flow that
demands and splits high-order pages in Striding RQ, and stay
with a single Striding RQ flow that uses order-0 pages.
Moving to fragmented memory allows the use of larger MPWQEs,
which reduces the number of UMR posts and filler CQEs.
Moving to a single flow allows several optimizations that improve
performance, especially in production servers where we would
anyway fallback to order-0 allocations:
- inline functions that were called via function pointers.
- improve the UMR post process.
This patch alone is expected to give a slight performance reduction.
However, the new memory scheme gives the possibility to use a page-cache
of a fair size, that doesn't inflate the memory footprint, which will
dramatically fix the reduction and even give a huge gain.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - this patch
no reduction
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - this patch
3.5% reduction
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - this patch
4% reduction
Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 54 ++--
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 136 ++++++++--
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 292 ++++-----------------
drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 4 -
drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 2 +-
5 files changed, 184 insertions(+), 304 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index bf722aa..075cdfc 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -62,12 +62,12 @@
#define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE 0xd
#define MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE_MPW 0x1
-#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW 0x4
+#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW 0x3
#define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW 0x6
#define MLX5_MPWRQ_LOG_STRIDE_SIZE 6 /* >= 6, HW restriction */
#define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS 8 /* >= 6, HW restriction */
-#define MLX5_MPWRQ_LOG_WQE_SZ 17
+#define MLX5_MPWRQ_LOG_WQE_SZ 18
#define MLX5_MPWRQ_WQE_PAGE_ORDER (MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT > 0 ? \
MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT : 0)
#define MLX5_MPWRQ_PAGES_PER_WQE BIT(MLX5_MPWRQ_WQE_PAGE_ORDER)
@@ -293,8 +293,8 @@ struct mlx5e_rq {
u32 wqe_sz;
struct sk_buff **skb;
struct mlx5e_mpw_info *wqe_info;
+ void *mtt_no_align;
__be32 mkey_be;
- __be32 umr_mkey_be;
struct device *pdev;
struct net_device *netdev;
@@ -323,32 +323,15 @@ struct mlx5e_rq {
struct mlx5e_umr_dma_info {
__be64 *mtt;
- __be64 *mtt_no_align;
dma_addr_t mtt_addr;
- struct mlx5e_dma_info *dma_info;
+ struct mlx5e_dma_info dma_info[MLX5_MPWRQ_PAGES_PER_WQE];
+ struct mlx5e_umr_wqe wqe;
};
struct mlx5e_mpw_info {
- union {
- struct mlx5e_dma_info dma_info;
- struct mlx5e_umr_dma_info umr;
- };
+ struct mlx5e_umr_dma_info umr;
u16 consumed_strides;
u16 skbs_frags[MLX5_MPWRQ_PAGES_PER_WQE];
-
- void (*dma_pre_sync)(struct device *pdev,
- struct mlx5e_mpw_info *wi,
- u32 wqe_offset, u32 len);
- void (*add_skb_frag)(struct mlx5e_rq *rq,
- struct sk_buff *skb,
- struct mlx5e_mpw_info *wi,
- u32 page_idx, u32 frag_offset, u32 len);
- void (*copy_skb_header)(struct device *pdev,
- struct sk_buff *skb,
- struct mlx5e_mpw_info *wi,
- u32 page_idx, u32 offset,
- u32 headlen);
- void (*free_wqe)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
};
struct mlx5e_tx_wqe_info {
@@ -706,24 +689,11 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
-int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
+int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix);
void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix);
-void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq);
-void mlx5e_complete_rx_linear_mpwqe(struct mlx5e_rq *rq,
- struct mlx5_cqe64 *cqe,
- u16 byte_cnt,
- struct mlx5e_mpw_info *wi,
- struct sk_buff *skb);
-void mlx5e_complete_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
- struct mlx5_cqe64 *cqe,
- u16 byte_cnt,
- struct mlx5e_mpw_info *wi,
- struct sk_buff *skb);
-void mlx5e_free_rx_linear_mpwqe(struct mlx5e_rq *rq,
- struct mlx5e_mpw_info *wi);
-void mlx5e_free_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
- struct mlx5e_mpw_info *wi);
+void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq);
+void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq);
void mlx5e_rx_am(struct mlx5e_rq *rq);
@@ -810,6 +780,12 @@ static inline void mlx5e_cq_arm(struct mlx5e_cq *cq)
mlx5_cq_arm(mcq, MLX5_CQ_DB_REQ_NOT, mcq->uar->map, NULL, cq->wq.cc);
}
+static inline u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
+{
+ return rq->mpwqe_mtt_offset +
+ wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
+}
+
static inline int mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev)
{
return min_t(int, mdev->priv.eq_table.num_comp_vectors,
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 2459c7f..0db4d3b 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -138,7 +138,6 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
s->rx_csum_unnecessary_inner += rq_stats->csum_unnecessary_inner;
s->rx_wqe_err += rq_stats->wqe_err;
s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
- s->rx_mpwqe_frag += rq_stats->mpwqe_frag;
s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
s->rx_cqe_compress_blks += rq_stats->cqe_compress_blks;
s->rx_cqe_compress_pkts += rq_stats->cqe_compress_pkts;
@@ -298,6 +297,107 @@ static void mlx5e_disable_async_events(struct mlx5e_priv *priv)
#define MLX5E_HW2SW_MTU(hwmtu) (hwmtu - (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN))
#define MLX5E_SW2HW_MTU(swmtu) (swmtu + (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN))
+static inline int mlx5e_get_wqe_mtt_sz(void)
+{
+ /* UMR copies MTTs in units of MLX5_UMR_MTT_ALIGNMENT bytes.
+ * To avoid copying garbage after the mtt array, we allocate
+ * a little more.
+ */
+ return ALIGN(MLX5_MPWRQ_PAGES_PER_WQE * sizeof(__be64),
+ MLX5_UMR_MTT_ALIGNMENT);
+}
+
+static inline void mlx5e_build_umr_wqe(struct mlx5e_rq *rq, struct mlx5e_sq *sq,
+ struct mlx5e_umr_wqe *wqe, u16 ix)
+{
+ struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
+ struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
+ struct mlx5_wqe_data_seg *dseg = &wqe->data;
+ struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+ u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
+ u32 umr_wqe_mtt_offset = mlx5e_get_wqe_mtt_offset(rq, ix);
+
+ cseg->qpn_ds = cpu_to_be32((sq->sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
+ ds_cnt);
+ cseg->fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
+ cseg->imm = rq->mkey_be;
+
+ ucseg->flags = MLX5_UMR_TRANSLATION_OFFSET_EN;
+ ucseg->klm_octowords =
+ cpu_to_be16(MLX5_MTT_OCTW(MLX5_MPWRQ_PAGES_PER_WQE));
+ ucseg->bsf_octowords =
+ cpu_to_be16(MLX5_MTT_OCTW(umr_wqe_mtt_offset));
+ ucseg->mkey_mask = cpu_to_be64(MLX5_MKEY_MASK_FREE);
+
+ dseg->lkey = sq->mkey_be;
+ dseg->addr = cpu_to_be64(wi->umr.mtt_addr);
+}
+
+static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq,
+ struct mlx5e_channel *c)
+{
+ int wq_sz = mlx5_wq_ll_get_size(&rq->wq);
+ int mtt_sz = mlx5e_get_wqe_mtt_sz();
+ int mtt_alloc = mtt_sz + MLX5_UMR_ALIGN - 1;
+ int i;
+
+ rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
+ GFP_KERNEL, cpu_to_node(c->cpu));
+ if (!rq->wqe_info)
+ goto err_out;
+
+ /* We allocate more than mtt_sz as we will align the pointer */
+ rq->mtt_no_align = kzalloc_node(mtt_alloc * wq_sz, GFP_KERNEL,
+ cpu_to_node(c->cpu));
+ if (unlikely(!rq->mtt_no_align))
+ goto err_free_wqe_info;
+
+ for (i = 0; i < wq_sz; i++) {
+ struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
+
+ wi->umr.mtt = PTR_ALIGN(rq->mtt_no_align + i * mtt_alloc,
+ MLX5_UMR_ALIGN);
+ wi->umr.mtt_addr = dma_map_single(c->pdev, wi->umr.mtt, mtt_sz,
+ PCI_DMA_TODEVICE);
+ if (unlikely(dma_mapping_error(c->pdev, wi->umr.mtt_addr)))
+ goto err_unmap_mtts;
+
+ mlx5e_build_umr_wqe(rq, &c->icosq, &wi->umr.wqe, i);
+ }
+
+ return 0;
+
+err_unmap_mtts:
+ while (--i >= 0) {
+ struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
+
+ dma_unmap_single(c->pdev, wi->umr.mtt_addr, mtt_sz,
+ PCI_DMA_TODEVICE);
+ }
+ kfree(rq->mtt_no_align);
+err_free_wqe_info:
+ kfree(rq->wqe_info);
+
+err_out:
+ return -ENOMEM;
+}
+
+static void mlx5e_rq_free_mpwqe_info(struct mlx5e_rq *rq)
+{
+ int wq_sz = mlx5_wq_ll_get_size(&rq->wq);
+ int mtt_sz = mlx5e_get_wqe_mtt_sz();
+ int i;
+
+ for (i = 0; i < wq_sz; i++) {
+ struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
+
+ dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz,
+ PCI_DMA_TODEVICE);
+ }
+ kfree(rq->mtt_no_align);
+ kfree(rq->wqe_info);
+}
+
static int mlx5e_create_rq(struct mlx5e_channel *c,
struct mlx5e_rq_param *param,
struct mlx5e_rq *rq)
@@ -322,14 +422,16 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
wq_sz = mlx5_wq_ll_get_size(&rq->wq);
+ rq->wq_type = priv->params.rq_wq_type;
+ rq->pdev = c->pdev;
+ rq->netdev = c->netdev;
+ rq->tstamp = &priv->tstamp;
+ rq->channel = c;
+ rq->ix = c->ix;
+ rq->priv = c->priv;
+
switch (priv->params.rq_wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
- rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
- GFP_KERNEL, cpu_to_node(c->cpu));
- if (!rq->wqe_info) {
- err = -ENOMEM;
- goto err_rq_wq_destroy;
- }
rq->handle_rx_cqe = mlx5e_handle_rx_cqe_mpwrq;
rq->alloc_wqe = mlx5e_alloc_rx_mpwqe;
rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
@@ -341,6 +443,10 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
byte_count = rq->wqe_sz;
+ rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
+ err = mlx5e_rq_alloc_mpwqe_info(rq, c);
+ if (err)
+ goto err_rq_wq_destroy;
break;
default: /* MLX5_WQ_TYPE_LINKED_LIST */
rq->skb = kzalloc_node(wq_sz * sizeof(*rq->skb), GFP_KERNEL,
@@ -359,27 +465,19 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz);
byte_count = rq->wqe_sz;
byte_count |= MLX5_HW_START_PADDING;
+ rq->mkey_be = c->mkey_be;
}
for (i = 0; i < wq_sz; i++) {
struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
wqe->data.byte_count = cpu_to_be32(byte_count);
+ wqe->data.lkey = rq->mkey_be;
}
INIT_WORK(&rq->am.work, mlx5e_rx_am_work);
rq->am.mode = priv->params.rx_cq_period_mode;
- rq->wq_type = priv->params.rq_wq_type;
- rq->pdev = c->pdev;
- rq->netdev = c->netdev;
- rq->tstamp = &priv->tstamp;
- rq->channel = c;
- rq->ix = c->ix;
- rq->priv = c->priv;
- rq->mkey_be = c->mkey_be;
- rq->umr_mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
-
return 0;
err_rq_wq_destroy:
@@ -392,7 +490,7 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
{
switch (rq->wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
- kfree(rq->wqe_info);
+ mlx5e_rq_free_mpwqe_info(rq);
break;
default: /* MLX5_WQ_TYPE_LINKED_LIST */
kfree(rq->skb);
@@ -530,7 +628,7 @@ static void mlx5e_free_rx_descs(struct mlx5e_rq *rq)
/* UMR WQE (if in progress) is always at wq->head */
if (test_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state))
- mlx5e_free_rx_fragmented_mpwqe(rq, &rq->wqe_info[wq->head]);
+ mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
while (!mlx5_wq_ll_is_empty(wq)) {
wqe_ix_be = *wq->tail_next;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index b6f8ebb..8ad4d32 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -200,7 +200,6 @@ int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
*((dma_addr_t *)skb->cb) = dma_addr;
wqe->data.addr = cpu_to_be64(dma_addr);
- wqe->data.lkey = rq->mkey_be;
rq->skb[ix] = skb;
@@ -231,44 +230,11 @@ static inline int mlx5e_mpwqe_strides_per_page(struct mlx5e_rq *rq)
return rq->mpwqe_num_strides >> MLX5_MPWRQ_WQE_PAGE_ORDER;
}
-static inline void
-mlx5e_dma_pre_sync_linear_mpwqe(struct device *pdev,
- struct mlx5e_mpw_info *wi,
- u32 wqe_offset, u32 len)
-{
- dma_sync_single_for_cpu(pdev, wi->dma_info.addr + wqe_offset,
- len, DMA_FROM_DEVICE);
-}
-
-static inline void
-mlx5e_dma_pre_sync_fragmented_mpwqe(struct device *pdev,
- struct mlx5e_mpw_info *wi,
- u32 wqe_offset, u32 len)
-{
- /* No dma pre sync for fragmented MPWQE */
-}
-
-static inline void
-mlx5e_add_skb_frag_linear_mpwqe(struct mlx5e_rq *rq,
- struct sk_buff *skb,
- struct mlx5e_mpw_info *wi,
- u32 page_idx, u32 frag_offset,
- u32 len)
-{
- unsigned int truesize = ALIGN(len, rq->mpwqe_stride_sz);
-
- wi->skbs_frags[page_idx]++;
- skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
- &wi->dma_info.page[page_idx], frag_offset,
- len, truesize);
-}
-
-static inline void
-mlx5e_add_skb_frag_fragmented_mpwqe(struct mlx5e_rq *rq,
- struct sk_buff *skb,
- struct mlx5e_mpw_info *wi,
- u32 page_idx, u32 frag_offset,
- u32 len)
+static inline void mlx5e_add_skb_frag_mpwqe(struct mlx5e_rq *rq,
+ struct sk_buff *skb,
+ struct mlx5e_mpw_info *wi,
+ u32 page_idx, u32 frag_offset,
+ u32 len)
{
unsigned int truesize = ALIGN(len, rq->mpwqe_stride_sz);
@@ -282,24 +248,11 @@ mlx5e_add_skb_frag_fragmented_mpwqe(struct mlx5e_rq *rq,
}
static inline void
-mlx5e_copy_skb_header_linear_mpwqe(struct device *pdev,
- struct sk_buff *skb,
- struct mlx5e_mpw_info *wi,
- u32 page_idx, u32 offset,
- u32 headlen)
-{
- struct page *page = &wi->dma_info.page[page_idx];
-
- skb_copy_to_linear_data(skb, page_address(page) + offset,
- ALIGN(headlen, sizeof(long)));
-}
-
-static inline void
-mlx5e_copy_skb_header_fragmented_mpwqe(struct device *pdev,
- struct sk_buff *skb,
- struct mlx5e_mpw_info *wi,
- u32 page_idx, u32 offset,
- u32 headlen)
+mlx5e_copy_skb_header_mpwqe(struct device *pdev,
+ struct sk_buff *skb,
+ struct mlx5e_mpw_info *wi,
+ u32 page_idx, u32 offset,
+ u32 headlen)
{
u16 headlen_pg = min_t(u32, headlen, PAGE_SIZE - offset);
struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[page_idx];
@@ -324,46 +277,9 @@ mlx5e_copy_skb_header_fragmented_mpwqe(struct device *pdev,
}
}
-static u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
-{
- return rq->mpwqe_mtt_offset +
- wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
-}
-
-static void mlx5e_build_umr_wqe(struct mlx5e_rq *rq,
- struct mlx5e_sq *sq,
- struct mlx5e_umr_wqe *wqe,
- u16 ix)
+static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
{
- struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
- struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
- struct mlx5_wqe_data_seg *dseg = &wqe->data;
struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
- u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
- u32 umr_wqe_mtt_offset = mlx5e_get_wqe_mtt_offset(rq, ix);
-
- memset(wqe, 0, sizeof(*wqe));
- cseg->opmod_idx_opcode =
- cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
- MLX5_OPCODE_UMR);
- cseg->qpn_ds = cpu_to_be32((sq->sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
- ds_cnt);
- cseg->fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
- cseg->imm = rq->umr_mkey_be;
-
- ucseg->flags = MLX5_UMR_TRANSLATION_OFFSET_EN;
- ucseg->klm_octowords =
- cpu_to_be16(MLX5_MTT_OCTW(MLX5_MPWRQ_PAGES_PER_WQE));
- ucseg->bsf_octowords =
- cpu_to_be16(MLX5_MTT_OCTW(umr_wqe_mtt_offset));
- ucseg->mkey_mask = cpu_to_be64(MLX5_MKEY_MASK_FREE);
-
- dseg->lkey = sq->mkey_be;
- dseg->addr = cpu_to_be64(wi->umr.mtt_addr);
-}
-
-static void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
-{
struct mlx5e_sq *sq = &rq->channel->icosq;
struct mlx5_wq_cyc *wq = &sq->wq;
struct mlx5e_umr_wqe *wqe;
@@ -378,30 +294,22 @@ static void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
}
wqe = mlx5_wq_cyc_get_wqe(wq, pi);
- mlx5e_build_umr_wqe(rq, sq, wqe, ix);
+ memcpy(wqe, &wi->umr.wqe, sizeof(*wqe));
+ wqe->ctrl.opmod_idx_opcode =
+ cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
+ MLX5_OPCODE_UMR);
+
sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_UMR;
sq->ico_wqe_info[pi].num_wqebbs = num_wqebbs;
sq->pc += num_wqebbs;
mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
}
-static inline int mlx5e_get_wqe_mtt_sz(void)
-{
- /* UMR copies MTTs in units of MLX5_UMR_MTT_ALIGNMENT bytes.
- * To avoid copying garbage after the mtt array, we allocate
- * a little more.
- */
- return ALIGN(MLX5_MPWRQ_PAGES_PER_WQE * sizeof(__be64),
- MLX5_UMR_MTT_ALIGNMENT);
-}
-
-static int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
- struct mlx5e_mpw_info *wi,
- int i)
+static inline int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
+ struct mlx5e_mpw_info *wi,
+ int i)
{
- struct page *page;
-
- page = dev_alloc_page();
+ struct page *page = dev_alloc_page();
if (unlikely(!page))
return -ENOMEM;
@@ -417,47 +325,25 @@ static int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
return 0;
}
-static int mlx5e_alloc_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
- struct mlx5e_rx_wqe *wqe,
- u16 ix)
+static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
+ struct mlx5e_rx_wqe *wqe,
+ u16 ix)
{
struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
- int mtt_sz = mlx5e_get_wqe_mtt_sz();
u64 dma_offset = (u64)mlx5e_get_wqe_mtt_offset(rq, ix) << PAGE_SHIFT;
+ int pg_strides = mlx5e_mpwqe_strides_per_page(rq);
+ int err;
int i;
- wi->umr.dma_info = kmalloc(sizeof(*wi->umr.dma_info) *
- MLX5_MPWRQ_PAGES_PER_WQE,
- GFP_ATOMIC);
- if (unlikely(!wi->umr.dma_info))
- goto err_out;
-
- /* We allocate more than mtt_sz as we will align the pointer */
- wi->umr.mtt_no_align = kzalloc(mtt_sz + MLX5_UMR_ALIGN - 1,
- GFP_ATOMIC);
- if (unlikely(!wi->umr.mtt_no_align))
- goto err_free_umr;
-
- wi->umr.mtt = PTR_ALIGN(wi->umr.mtt_no_align, MLX5_UMR_ALIGN);
- wi->umr.mtt_addr = dma_map_single(rq->pdev, wi->umr.mtt, mtt_sz,
- PCI_DMA_TODEVICE);
- if (unlikely(dma_mapping_error(rq->pdev, wi->umr.mtt_addr)))
- goto err_free_mtt;
-
for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
- if (unlikely(mlx5e_alloc_and_map_page(rq, wi, i)))
+ err = mlx5e_alloc_and_map_page(rq, wi, i);
+ if (unlikely(err))
goto err_unmap;
- page_ref_add(wi->umr.dma_info[i].page,
- mlx5e_mpwqe_strides_per_page(rq));
+ page_ref_add(wi->umr.dma_info[i].page, pg_strides);
wi->skbs_frags[i] = 0;
}
wi->consumed_strides = 0;
- wi->dma_pre_sync = mlx5e_dma_pre_sync_fragmented_mpwqe;
- wi->add_skb_frag = mlx5e_add_skb_frag_fragmented_mpwqe;
- wi->copy_skb_header = mlx5e_copy_skb_header_fragmented_mpwqe;
- wi->free_wqe = mlx5e_free_rx_fragmented_mpwqe;
- wqe->data.lkey = rq->umr_mkey_be;
wqe->data.addr = cpu_to_be64(dma_offset);
return 0;
@@ -466,41 +352,28 @@ err_unmap:
while (--i >= 0) {
dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
PCI_DMA_FROMDEVICE);
- page_ref_sub(wi->umr.dma_info[i].page,
- mlx5e_mpwqe_strides_per_page(rq));
+ page_ref_sub(wi->umr.dma_info[i].page, pg_strides);
put_page(wi->umr.dma_info[i].page);
}
- dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE);
-
-err_free_mtt:
- kfree(wi->umr.mtt_no_align);
-
-err_free_umr:
- kfree(wi->umr.dma_info);
-err_out:
- return -ENOMEM;
+ return err;
}
-void mlx5e_free_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
- struct mlx5e_mpw_info *wi)
+void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi)
{
- int mtt_sz = mlx5e_get_wqe_mtt_sz();
+ int pg_strides = mlx5e_mpwqe_strides_per_page(rq);
int i;
for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
PCI_DMA_FROMDEVICE);
page_ref_sub(wi->umr.dma_info[i].page,
- mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i]);
+ pg_strides - wi->skbs_frags[i]);
put_page(wi->umr.dma_info[i].page);
}
- dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE);
- kfree(wi->umr.mtt_no_align);
- kfree(wi->umr.dma_info);
}
-void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
+void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
{
struct mlx5_wq_ll *wq = &rq->wq;
struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(wq, wq->head);
@@ -508,12 +381,11 @@ void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
clear_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state))) {
- mlx5e_free_rx_fragmented_mpwqe(rq, &rq->wqe_info[wq->head]);
+ mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
return;
}
mlx5_wq_ll_push(wq, be16_to_cpu(wqe->next.next_wqe_index));
- rq->stats.mpwqe_frag++;
/* ensure wqes are visible to device before updating doorbell record */
dma_wmb();
@@ -521,84 +393,23 @@ void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
mlx5_wq_ll_update_db_record(wq);
}
-static int mlx5e_alloc_rx_linear_mpwqe(struct mlx5e_rq *rq,
- struct mlx5e_rx_wqe *wqe,
- u16 ix)
-{
- struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
- gfp_t gfp_mask;
- int i;
-
- gfp_mask = GFP_ATOMIC | __GFP_COLD | __GFP_MEMALLOC;
- wi->dma_info.page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
- MLX5_MPWRQ_WQE_PAGE_ORDER);
- if (unlikely(!wi->dma_info.page))
- return -ENOMEM;
-
- wi->dma_info.addr = dma_map_page(rq->pdev, wi->dma_info.page, 0,
- rq->wqe_sz, PCI_DMA_FROMDEVICE);
- if (unlikely(dma_mapping_error(rq->pdev, wi->dma_info.addr))) {
- put_page(wi->dma_info.page);
- return -ENOMEM;
- }
-
- /* We split the high-order page into order-0 ones and manage their
- * reference counter to minimize the memory held by small skb fragments
- */
- split_page(wi->dma_info.page, MLX5_MPWRQ_WQE_PAGE_ORDER);
- for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
- page_ref_add(&wi->dma_info.page[i],
- mlx5e_mpwqe_strides_per_page(rq));
- wi->skbs_frags[i] = 0;
- }
-
- wi->consumed_strides = 0;
- wi->dma_pre_sync = mlx5e_dma_pre_sync_linear_mpwqe;
- wi->add_skb_frag = mlx5e_add_skb_frag_linear_mpwqe;
- wi->copy_skb_header = mlx5e_copy_skb_header_linear_mpwqe;
- wi->free_wqe = mlx5e_free_rx_linear_mpwqe;
- wqe->data.lkey = rq->mkey_be;
- wqe->data.addr = cpu_to_be64(wi->dma_info.addr);
-
- return 0;
-}
-
-void mlx5e_free_rx_linear_mpwqe(struct mlx5e_rq *rq,
- struct mlx5e_mpw_info *wi)
-{
- int i;
-
- dma_unmap_page(rq->pdev, wi->dma_info.addr, rq->wqe_sz,
- PCI_DMA_FROMDEVICE);
- for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
- page_ref_sub(&wi->dma_info.page[i],
- mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i]);
- put_page(&wi->dma_info.page[i]);
- }
-}
-
-int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
+int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
{
int err;
- err = mlx5e_alloc_rx_linear_mpwqe(rq, wqe, ix);
- if (unlikely(err)) {
- err = mlx5e_alloc_rx_fragmented_mpwqe(rq, wqe, ix);
- if (unlikely(err))
- return err;
- set_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
- mlx5e_post_umr_wqe(rq, ix);
- return -EBUSY;
- }
-
- return 0;
+ err = mlx5e_alloc_rx_umr_mpwqe(rq, wqe, ix);
+ if (unlikely(err))
+ return err;
+ set_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
+ mlx5e_post_umr_wqe(rq, ix);
+ return -EBUSY;
}
void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
{
struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
- wi->free_wqe(rq, wi);
+ mlx5e_free_rx_mpwqe(rq, wi);
}
#define RQ_CANNOT_POST(rq) \
@@ -617,9 +428,10 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
int err;
err = rq->alloc_wqe(rq, wqe, wq->head);
+ if (err == -EBUSY)
+ return true;
if (unlikely(err)) {
- if (err != -EBUSY)
- rq->stats.buff_alloc_err++;
+ rq->stats.buff_alloc_err++;
break;
}
@@ -823,7 +635,6 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
u32 cqe_bcnt,
struct sk_buff *skb)
{
- u32 consumed_bytes = ALIGN(cqe_bcnt, rq->mpwqe_stride_sz);
u16 stride_ix = mpwrq_get_cqe_stride_index(cqe);
u32 wqe_offset = stride_ix * rq->mpwqe_stride_sz;
u32 head_offset = wqe_offset & (PAGE_SIZE - 1);
@@ -837,21 +648,20 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
page_idx++;
frag_offset -= PAGE_SIZE;
}
- wi->dma_pre_sync(rq->pdev, wi, wqe_offset, consumed_bytes);
while (byte_cnt) {
u32 pg_consumed_bytes =
min_t(u32, PAGE_SIZE - frag_offset, byte_cnt);
- wi->add_skb_frag(rq, skb, wi, page_idx, frag_offset,
- pg_consumed_bytes);
+ mlx5e_add_skb_frag_mpwqe(rq, skb, wi, page_idx, frag_offset,
+ pg_consumed_bytes);
byte_cnt -= pg_consumed_bytes;
frag_offset = 0;
page_idx++;
}
/* copy header */
- wi->copy_skb_header(rq->pdev, skb, wi, head_page_idx, head_offset,
- headlen);
+ mlx5e_copy_skb_header_mpwqe(rq->pdev, skb, wi, head_page_idx,
+ head_offset, headlen);
/* skb linear part was allocated with headlen and aligned to long */
skb->tail += headlen;
skb->len += headlen;
@@ -896,7 +706,7 @@ mpwrq_cqe_out:
if (likely(wi->consumed_strides < rq->mpwqe_num_strides))
return;
- wi->free_wqe(rq, wi);
+ mlx5e_free_rx_mpwqe(rq, wi);
mlx5_wq_ll_pop(&rq->wq, cqe->wqe_id, &wqe->next.next_wqe_index);
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 499487c..1f56543 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -73,7 +73,6 @@ struct mlx5e_sw_stats {
u64 tx_xmit_more;
u64 rx_wqe_err;
u64 rx_mpwqe_filler;
- u64 rx_mpwqe_frag;
u64 rx_buff_alloc_err;
u64 rx_cqe_compress_blks;
u64 rx_cqe_compress_pkts;
@@ -105,7 +104,6 @@ static const struct counter_desc sw_stats_desc[] = {
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xmit_more) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_wqe_err) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_filler) },
- { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_frag) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_buff_alloc_err) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_blks) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_pkts) },
@@ -274,7 +272,6 @@ struct mlx5e_rq_stats {
u64 lro_bytes;
u64 wqe_err;
u64 mpwqe_filler;
- u64 mpwqe_frag;
u64 buff_alloc_err;
u64 cqe_compress_blks;
u64 cqe_compress_pkts;
@@ -290,7 +287,6 @@ static const struct counter_desc rq_stats_desc[] = {
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_bytes) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, wqe_err) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, mpwqe_filler) },
- { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, mpwqe_frag) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, buff_alloc_err) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_blks) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_pkts) },
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 9bf33bb..08d8b0c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -87,7 +87,7 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
case MLX5_OPCODE_NOP:
break;
case MLX5_OPCODE_UMR:
- mlx5e_post_rx_fragmented_mpwqe(&sq->channel->rq);
+ mlx5e_post_rx_mpwqe(&sq->channel->rq);
break;
default:
WARN_ONCE(true,
--
2.7.4
^ permalink raw reply related [flat|nested] 72+ messages in thread[parent not found: <1473252152-11379-2-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
[not found] ` <1473252152-11379-2-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-09-07 17:31 ` Alexei Starovoitov via iovisor-dev
[not found] ` <20160907173131.GA64688-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
0 siblings, 1 reply; 72+ messages in thread
From: Alexei Starovoitov via iovisor-dev @ 2016-09-07 17:31 UTC (permalink / raw)
To: Saeed Mahameed
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev, Jamal Hadi Salim,
Eric Dumazet, Tom Herbert
On Wed, Sep 07, 2016 at 03:42:22PM +0300, Saeed Mahameed wrote:
> From: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>
> To improve the memory consumption scheme, we omit the flow that
> demands and splits high-order pages in Striding RQ, and stay
> with a single Striding RQ flow that uses order-0 pages.
>
> Moving to fragmented memory allows the use of larger MPWQEs,
> which reduces the number of UMR posts and filler CQEs.
>
> Moving to a single flow allows several optimizations that improve
> performance, especially in production servers where we would
> anyway fallback to order-0 allocations:
> - inline functions that were called via function pointers.
> - improve the UMR post process.
>
> This patch alone is expected to give a slight performance reduction.
> However, the new memory scheme gives the possibility to use a page-cache
> of a fair size, that doesn't inflate the memory footprint, which will
> dramatically fix the reduction and even give a huge gain.
>
> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
>
> Single stride, 64 bytes:
> * 4,739,057 - baseline
> * 4,749,550 - this patch
> no reduction
>
> Larger packets, no page cross, 1024 bytes:
> * 3,982,361 - baseline
> * 3,845,682 - this patch
> 3.5% reduction
>
> Larger packets, every 3rd packet crosses a page, 1500 bytes:
> * 3,731,189 - baseline
> * 3,579,414 - this patch
> 4% reduction
imo it's not a realistic use case, but would be good to mention that
patch 3 brings performance back for this use case anyway.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
[not found] ` <1473252152-11379-2-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-09-07 19:18 ` Jesper Dangaard Brouer via iovisor-dev
0 siblings, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-07 19:18 UTC (permalink / raw)
To: Saeed Mahameed
Cc: iovisor-dev, netdev, Tariq Toukan, Brenden Blanco,
Alexei Starovoitov, Tom Herbert, Martin KaFai Lau,
Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, brouer, linux-mm
On Wed, 7 Sep 2016 15:42:22 +0300 Saeed Mahameed <saeedm@mellanox.com> wrote:
> From: Tariq Toukan <tariqt@mellanox.com>
>
> To improve the memory consumption scheme, we omit the flow that
> demands and splits high-order pages in Striding RQ, and stay
> with a single Striding RQ flow that uses order-0 pages.
Thanks you for doing this! MM-list people thanks you!
For others to understand what this means: This driver was doing
split_page() on high-order pages (for Striding RQ). This was really bad
because it will cause fragmenting the page-allocator, and depleting the
high-order pages available quickly.
(I've left rest of patch intact below, if some MM people should be
interested in looking at the changes).
There is even a funny comment in split_page() relevant to this:
/* [...]
* Note: this is probably too low level an operation for use in drivers.
* Please consult with lkml before using this in your driver.
*/
> Moving to fragmented memory allows the use of larger MPWQEs,
> which reduces the number of UMR posts and filler CQEs.
>
> Moving to a single flow allows several optimizations that improve
> performance, especially in production servers where we would
> anyway fallback to order-0 allocations:
> - inline functions that were called via function pointers.
> - improve the UMR post process.
>
> This patch alone is expected to give a slight performance reduction.
> However, the new memory scheme gives the possibility to use a page-cache
> of a fair size, that doesn't inflate the memory footprint, which will
> dramatically fix the reduction and even give a huge gain.
>
> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
>
> Single stride, 64 bytes:
> * 4,739,057 - baseline
> * 4,749,550 - this patch
> no reduction
>
> Larger packets, no page cross, 1024 bytes:
> * 3,982,361 - baseline
> * 3,845,682 - this patch
> 3.5% reduction
>
> Larger packets, every 3rd packet crosses a page, 1500 bytes:
> * 3,731,189 - baseline
> * 3,579,414 - this patch
> 4% reduction
>
Well, the reduction does not really matter than much, because your
baseline benchmarks are from a freshly booted system, where you have
not fragmented and depleted the high-order pages yet... ;-)
> Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
> Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/en.h | 54 ++--
> drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 136 ++++++++--
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 292 ++++-----------------
> drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 4 -
> drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 2 +-
> 5 files changed, 184 insertions(+), 304 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index bf722aa..075cdfc 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -62,12 +62,12 @@
> #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE 0xd
>
> #define MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE_MPW 0x1
> -#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW 0x4
> +#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW 0x3
> #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW 0x6
>
> #define MLX5_MPWRQ_LOG_STRIDE_SIZE 6 /* >= 6, HW restriction */
> #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS 8 /* >= 6, HW restriction */
> -#define MLX5_MPWRQ_LOG_WQE_SZ 17
> +#define MLX5_MPWRQ_LOG_WQE_SZ 18
> #define MLX5_MPWRQ_WQE_PAGE_ORDER (MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT > 0 ? \
> MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT : 0)
> #define MLX5_MPWRQ_PAGES_PER_WQE BIT(MLX5_MPWRQ_WQE_PAGE_ORDER)
> @@ -293,8 +293,8 @@ struct mlx5e_rq {
> u32 wqe_sz;
> struct sk_buff **skb;
> struct mlx5e_mpw_info *wqe_info;
> + void *mtt_no_align;
> __be32 mkey_be;
> - __be32 umr_mkey_be;
>
> struct device *pdev;
> struct net_device *netdev;
> @@ -323,32 +323,15 @@ struct mlx5e_rq {
>
> struct mlx5e_umr_dma_info {
> __be64 *mtt;
> - __be64 *mtt_no_align;
> dma_addr_t mtt_addr;
> - struct mlx5e_dma_info *dma_info;
> + struct mlx5e_dma_info dma_info[MLX5_MPWRQ_PAGES_PER_WQE];
> + struct mlx5e_umr_wqe wqe;
> };
>
> struct mlx5e_mpw_info {
> - union {
> - struct mlx5e_dma_info dma_info;
> - struct mlx5e_umr_dma_info umr;
> - };
> + struct mlx5e_umr_dma_info umr;
> u16 consumed_strides;
> u16 skbs_frags[MLX5_MPWRQ_PAGES_PER_WQE];
> -
> - void (*dma_pre_sync)(struct device *pdev,
> - struct mlx5e_mpw_info *wi,
> - u32 wqe_offset, u32 len);
> - void (*add_skb_frag)(struct mlx5e_rq *rq,
> - struct sk_buff *skb,
> - struct mlx5e_mpw_info *wi,
> - u32 page_idx, u32 frag_offset, u32 len);
> - void (*copy_skb_header)(struct device *pdev,
> - struct sk_buff *skb,
> - struct mlx5e_mpw_info *wi,
> - u32 page_idx, u32 offset,
> - u32 headlen);
> - void (*free_wqe)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
> };
>
> struct mlx5e_tx_wqe_info {
> @@ -706,24 +689,11 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
> int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
> -int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
> void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix);
> void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix);
> -void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq);
> -void mlx5e_complete_rx_linear_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5_cqe64 *cqe,
> - u16 byte_cnt,
> - struct mlx5e_mpw_info *wi,
> - struct sk_buff *skb);
> -void mlx5e_complete_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5_cqe64 *cqe,
> - u16 byte_cnt,
> - struct mlx5e_mpw_info *wi,
> - struct sk_buff *skb);
> -void mlx5e_free_rx_linear_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5e_mpw_info *wi);
> -void mlx5e_free_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5e_mpw_info *wi);
> +void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq);
> +void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
> struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq);
>
> void mlx5e_rx_am(struct mlx5e_rq *rq);
> @@ -810,6 +780,12 @@ static inline void mlx5e_cq_arm(struct mlx5e_cq *cq)
> mlx5_cq_arm(mcq, MLX5_CQ_DB_REQ_NOT, mcq->uar->map, NULL, cq->wq.cc);
> }
>
> +static inline u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
> +{
> + return rq->mpwqe_mtt_offset +
> + wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
> +}
> +
> static inline int mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev)
> {
> return min_t(int, mdev->priv.eq_table.num_comp_vectors,
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 2459c7f..0db4d3b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -138,7 +138,6 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
> s->rx_csum_unnecessary_inner += rq_stats->csum_unnecessary_inner;
> s->rx_wqe_err += rq_stats->wqe_err;
> s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
> - s->rx_mpwqe_frag += rq_stats->mpwqe_frag;
> s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
> s->rx_cqe_compress_blks += rq_stats->cqe_compress_blks;
> s->rx_cqe_compress_pkts += rq_stats->cqe_compress_pkts;
> @@ -298,6 +297,107 @@ static void mlx5e_disable_async_events(struct mlx5e_priv *priv)
> #define MLX5E_HW2SW_MTU(hwmtu) (hwmtu - (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN))
> #define MLX5E_SW2HW_MTU(swmtu) (swmtu + (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN))
>
> +static inline int mlx5e_get_wqe_mtt_sz(void)
> +{
> + /* UMR copies MTTs in units of MLX5_UMR_MTT_ALIGNMENT bytes.
> + * To avoid copying garbage after the mtt array, we allocate
> + * a little more.
> + */
> + return ALIGN(MLX5_MPWRQ_PAGES_PER_WQE * sizeof(__be64),
> + MLX5_UMR_MTT_ALIGNMENT);
> +}
> +
> +static inline void mlx5e_build_umr_wqe(struct mlx5e_rq *rq, struct mlx5e_sq *sq,
> + struct mlx5e_umr_wqe *wqe, u16 ix)
> +{
> + struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
> + struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
> + struct mlx5_wqe_data_seg *dseg = &wqe->data;
> + struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> + u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
> + u32 umr_wqe_mtt_offset = mlx5e_get_wqe_mtt_offset(rq, ix);
> +
> + cseg->qpn_ds = cpu_to_be32((sq->sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
> + ds_cnt);
> + cseg->fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
> + cseg->imm = rq->mkey_be;
> +
> + ucseg->flags = MLX5_UMR_TRANSLATION_OFFSET_EN;
> + ucseg->klm_octowords =
> + cpu_to_be16(MLX5_MTT_OCTW(MLX5_MPWRQ_PAGES_PER_WQE));
> + ucseg->bsf_octowords =
> + cpu_to_be16(MLX5_MTT_OCTW(umr_wqe_mtt_offset));
> + ucseg->mkey_mask = cpu_to_be64(MLX5_MKEY_MASK_FREE);
> +
> + dseg->lkey = sq->mkey_be;
> + dseg->addr = cpu_to_be64(wi->umr.mtt_addr);
> +}
> +
> +static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq,
> + struct mlx5e_channel *c)
> +{
> + int wq_sz = mlx5_wq_ll_get_size(&rq->wq);
> + int mtt_sz = mlx5e_get_wqe_mtt_sz();
> + int mtt_alloc = mtt_sz + MLX5_UMR_ALIGN - 1;
> + int i;
> +
> + rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
> + GFP_KERNEL, cpu_to_node(c->cpu));
> + if (!rq->wqe_info)
> + goto err_out;
> +
> + /* We allocate more than mtt_sz as we will align the pointer */
> + rq->mtt_no_align = kzalloc_node(mtt_alloc * wq_sz, GFP_KERNEL,
> + cpu_to_node(c->cpu));
> + if (unlikely(!rq->mtt_no_align))
> + goto err_free_wqe_info;
> +
> + for (i = 0; i < wq_sz; i++) {
> + struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
> +
> + wi->umr.mtt = PTR_ALIGN(rq->mtt_no_align + i * mtt_alloc,
> + MLX5_UMR_ALIGN);
> + wi->umr.mtt_addr = dma_map_single(c->pdev, wi->umr.mtt, mtt_sz,
> + PCI_DMA_TODEVICE);
> + if (unlikely(dma_mapping_error(c->pdev, wi->umr.mtt_addr)))
> + goto err_unmap_mtts;
> +
> + mlx5e_build_umr_wqe(rq, &c->icosq, &wi->umr.wqe, i);
> + }
> +
> + return 0;
> +
> +err_unmap_mtts:
> + while (--i >= 0) {
> + struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
> +
> + dma_unmap_single(c->pdev, wi->umr.mtt_addr, mtt_sz,
> + PCI_DMA_TODEVICE);
> + }
> + kfree(rq->mtt_no_align);
> +err_free_wqe_info:
> + kfree(rq->wqe_info);
> +
> +err_out:
> + return -ENOMEM;
> +}
> +
> +static void mlx5e_rq_free_mpwqe_info(struct mlx5e_rq *rq)
> +{
> + int wq_sz = mlx5_wq_ll_get_size(&rq->wq);
> + int mtt_sz = mlx5e_get_wqe_mtt_sz();
> + int i;
> +
> + for (i = 0; i < wq_sz; i++) {
> + struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
> +
> + dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz,
> + PCI_DMA_TODEVICE);
> + }
> + kfree(rq->mtt_no_align);
> + kfree(rq->wqe_info);
> +}
> +
> static int mlx5e_create_rq(struct mlx5e_channel *c,
> struct mlx5e_rq_param *param,
> struct mlx5e_rq *rq)
> @@ -322,14 +422,16 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>
> wq_sz = mlx5_wq_ll_get_size(&rq->wq);
>
> + rq->wq_type = priv->params.rq_wq_type;
> + rq->pdev = c->pdev;
> + rq->netdev = c->netdev;
> + rq->tstamp = &priv->tstamp;
> + rq->channel = c;
> + rq->ix = c->ix;
> + rq->priv = c->priv;
> +
> switch (priv->params.rq_wq_type) {
> case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
> - rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
> - GFP_KERNEL, cpu_to_node(c->cpu));
> - if (!rq->wqe_info) {
> - err = -ENOMEM;
> - goto err_rq_wq_destroy;
> - }
> rq->handle_rx_cqe = mlx5e_handle_rx_cqe_mpwrq;
> rq->alloc_wqe = mlx5e_alloc_rx_mpwqe;
> rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
> @@ -341,6 +443,10 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
> rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
> rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
> byte_count = rq->wqe_sz;
> + rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
> + err = mlx5e_rq_alloc_mpwqe_info(rq, c);
> + if (err)
> + goto err_rq_wq_destroy;
> break;
> default: /* MLX5_WQ_TYPE_LINKED_LIST */
> rq->skb = kzalloc_node(wq_sz * sizeof(*rq->skb), GFP_KERNEL,
> @@ -359,27 +465,19 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
> rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz);
> byte_count = rq->wqe_sz;
> byte_count |= MLX5_HW_START_PADDING;
> + rq->mkey_be = c->mkey_be;
> }
>
> for (i = 0; i < wq_sz; i++) {
> struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
>
> wqe->data.byte_count = cpu_to_be32(byte_count);
> + wqe->data.lkey = rq->mkey_be;
> }
>
> INIT_WORK(&rq->am.work, mlx5e_rx_am_work);
> rq->am.mode = priv->params.rx_cq_period_mode;
>
> - rq->wq_type = priv->params.rq_wq_type;
> - rq->pdev = c->pdev;
> - rq->netdev = c->netdev;
> - rq->tstamp = &priv->tstamp;
> - rq->channel = c;
> - rq->ix = c->ix;
> - rq->priv = c->priv;
> - rq->mkey_be = c->mkey_be;
> - rq->umr_mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
> -
> return 0;
>
> err_rq_wq_destroy:
> @@ -392,7 +490,7 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
> {
> switch (rq->wq_type) {
> case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
> - kfree(rq->wqe_info);
> + mlx5e_rq_free_mpwqe_info(rq);
> break;
> default: /* MLX5_WQ_TYPE_LINKED_LIST */
> kfree(rq->skb);
> @@ -530,7 +628,7 @@ static void mlx5e_free_rx_descs(struct mlx5e_rq *rq)
>
> /* UMR WQE (if in progress) is always at wq->head */
> if (test_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state))
> - mlx5e_free_rx_fragmented_mpwqe(rq, &rq->wqe_info[wq->head]);
> + mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
>
> while (!mlx5_wq_ll_is_empty(wq)) {
> wqe_ix_be = *wq->tail_next;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index b6f8ebb..8ad4d32 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -200,7 +200,6 @@ int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
>
> *((dma_addr_t *)skb->cb) = dma_addr;
> wqe->data.addr = cpu_to_be64(dma_addr);
> - wqe->data.lkey = rq->mkey_be;
>
> rq->skb[ix] = skb;
>
> @@ -231,44 +230,11 @@ static inline int mlx5e_mpwqe_strides_per_page(struct mlx5e_rq *rq)
> return rq->mpwqe_num_strides >> MLX5_MPWRQ_WQE_PAGE_ORDER;
> }
>
> -static inline void
> -mlx5e_dma_pre_sync_linear_mpwqe(struct device *pdev,
> - struct mlx5e_mpw_info *wi,
> - u32 wqe_offset, u32 len)
> -{
> - dma_sync_single_for_cpu(pdev, wi->dma_info.addr + wqe_offset,
> - len, DMA_FROM_DEVICE);
> -}
> -
> -static inline void
> -mlx5e_dma_pre_sync_fragmented_mpwqe(struct device *pdev,
> - struct mlx5e_mpw_info *wi,
> - u32 wqe_offset, u32 len)
> -{
> - /* No dma pre sync for fragmented MPWQE */
> -}
> -
> -static inline void
> -mlx5e_add_skb_frag_linear_mpwqe(struct mlx5e_rq *rq,
> - struct sk_buff *skb,
> - struct mlx5e_mpw_info *wi,
> - u32 page_idx, u32 frag_offset,
> - u32 len)
> -{
> - unsigned int truesize = ALIGN(len, rq->mpwqe_stride_sz);
> -
> - wi->skbs_frags[page_idx]++;
> - skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
> - &wi->dma_info.page[page_idx], frag_offset,
> - len, truesize);
> -}
> -
> -static inline void
> -mlx5e_add_skb_frag_fragmented_mpwqe(struct mlx5e_rq *rq,
> - struct sk_buff *skb,
> - struct mlx5e_mpw_info *wi,
> - u32 page_idx, u32 frag_offset,
> - u32 len)
> +static inline void mlx5e_add_skb_frag_mpwqe(struct mlx5e_rq *rq,
> + struct sk_buff *skb,
> + struct mlx5e_mpw_info *wi,
> + u32 page_idx, u32 frag_offset,
> + u32 len)
> {
> unsigned int truesize = ALIGN(len, rq->mpwqe_stride_sz);
>
> @@ -282,24 +248,11 @@ mlx5e_add_skb_frag_fragmented_mpwqe(struct mlx5e_rq *rq,
> }
>
> static inline void
> -mlx5e_copy_skb_header_linear_mpwqe(struct device *pdev,
> - struct sk_buff *skb,
> - struct mlx5e_mpw_info *wi,
> - u32 page_idx, u32 offset,
> - u32 headlen)
> -{
> - struct page *page = &wi->dma_info.page[page_idx];
> -
> - skb_copy_to_linear_data(skb, page_address(page) + offset,
> - ALIGN(headlen, sizeof(long)));
> -}
> -
> -static inline void
> -mlx5e_copy_skb_header_fragmented_mpwqe(struct device *pdev,
> - struct sk_buff *skb,
> - struct mlx5e_mpw_info *wi,
> - u32 page_idx, u32 offset,
> - u32 headlen)
> +mlx5e_copy_skb_header_mpwqe(struct device *pdev,
> + struct sk_buff *skb,
> + struct mlx5e_mpw_info *wi,
> + u32 page_idx, u32 offset,
> + u32 headlen)
> {
> u16 headlen_pg = min_t(u32, headlen, PAGE_SIZE - offset);
> struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[page_idx];
> @@ -324,46 +277,9 @@ mlx5e_copy_skb_header_fragmented_mpwqe(struct device *pdev,
> }
> }
>
> -static u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
> -{
> - return rq->mpwqe_mtt_offset +
> - wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
> -}
> -
> -static void mlx5e_build_umr_wqe(struct mlx5e_rq *rq,
> - struct mlx5e_sq *sq,
> - struct mlx5e_umr_wqe *wqe,
> - u16 ix)
> +static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
> {
> - struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
> - struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
> - struct mlx5_wqe_data_seg *dseg = &wqe->data;
> struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> - u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
> - u32 umr_wqe_mtt_offset = mlx5e_get_wqe_mtt_offset(rq, ix);
> -
> - memset(wqe, 0, sizeof(*wqe));
> - cseg->opmod_idx_opcode =
> - cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
> - MLX5_OPCODE_UMR);
> - cseg->qpn_ds = cpu_to_be32((sq->sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
> - ds_cnt);
> - cseg->fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
> - cseg->imm = rq->umr_mkey_be;
> -
> - ucseg->flags = MLX5_UMR_TRANSLATION_OFFSET_EN;
> - ucseg->klm_octowords =
> - cpu_to_be16(MLX5_MTT_OCTW(MLX5_MPWRQ_PAGES_PER_WQE));
> - ucseg->bsf_octowords =
> - cpu_to_be16(MLX5_MTT_OCTW(umr_wqe_mtt_offset));
> - ucseg->mkey_mask = cpu_to_be64(MLX5_MKEY_MASK_FREE);
> -
> - dseg->lkey = sq->mkey_be;
> - dseg->addr = cpu_to_be64(wi->umr.mtt_addr);
> -}
> -
> -static void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
> -{
> struct mlx5e_sq *sq = &rq->channel->icosq;
> struct mlx5_wq_cyc *wq = &sq->wq;
> struct mlx5e_umr_wqe *wqe;
> @@ -378,30 +294,22 @@ static void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
> }
>
> wqe = mlx5_wq_cyc_get_wqe(wq, pi);
> - mlx5e_build_umr_wqe(rq, sq, wqe, ix);
> + memcpy(wqe, &wi->umr.wqe, sizeof(*wqe));
> + wqe->ctrl.opmod_idx_opcode =
> + cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
> + MLX5_OPCODE_UMR);
> +
> sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_UMR;
> sq->ico_wqe_info[pi].num_wqebbs = num_wqebbs;
> sq->pc += num_wqebbs;
> mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
> }
>
> -static inline int mlx5e_get_wqe_mtt_sz(void)
> -{
> - /* UMR copies MTTs in units of MLX5_UMR_MTT_ALIGNMENT bytes.
> - * To avoid copying garbage after the mtt array, we allocate
> - * a little more.
> - */
> - return ALIGN(MLX5_MPWRQ_PAGES_PER_WQE * sizeof(__be64),
> - MLX5_UMR_MTT_ALIGNMENT);
> -}
> -
> -static int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
> - struct mlx5e_mpw_info *wi,
> - int i)
> +static inline int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
> + struct mlx5e_mpw_info *wi,
> + int i)
> {
> - struct page *page;
> -
> - page = dev_alloc_page();
> + struct page *page = dev_alloc_page();
> if (unlikely(!page))
> return -ENOMEM;
>
> @@ -417,47 +325,25 @@ static int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
> return 0;
> }
>
> -static int mlx5e_alloc_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5e_rx_wqe *wqe,
> - u16 ix)
> +static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
> + struct mlx5e_rx_wqe *wqe,
> + u16 ix)
> {
> struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> - int mtt_sz = mlx5e_get_wqe_mtt_sz();
> u64 dma_offset = (u64)mlx5e_get_wqe_mtt_offset(rq, ix) << PAGE_SHIFT;
> + int pg_strides = mlx5e_mpwqe_strides_per_page(rq);
> + int err;
> int i;
>
> - wi->umr.dma_info = kmalloc(sizeof(*wi->umr.dma_info) *
> - MLX5_MPWRQ_PAGES_PER_WQE,
> - GFP_ATOMIC);
> - if (unlikely(!wi->umr.dma_info))
> - goto err_out;
> -
> - /* We allocate more than mtt_sz as we will align the pointer */
> - wi->umr.mtt_no_align = kzalloc(mtt_sz + MLX5_UMR_ALIGN - 1,
> - GFP_ATOMIC);
> - if (unlikely(!wi->umr.mtt_no_align))
> - goto err_free_umr;
> -
> - wi->umr.mtt = PTR_ALIGN(wi->umr.mtt_no_align, MLX5_UMR_ALIGN);
> - wi->umr.mtt_addr = dma_map_single(rq->pdev, wi->umr.mtt, mtt_sz,
> - PCI_DMA_TODEVICE);
> - if (unlikely(dma_mapping_error(rq->pdev, wi->umr.mtt_addr)))
> - goto err_free_mtt;
> -
> for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
> - if (unlikely(mlx5e_alloc_and_map_page(rq, wi, i)))
> + err = mlx5e_alloc_and_map_page(rq, wi, i);
> + if (unlikely(err))
> goto err_unmap;
> - page_ref_add(wi->umr.dma_info[i].page,
> - mlx5e_mpwqe_strides_per_page(rq));
> + page_ref_add(wi->umr.dma_info[i].page, pg_strides);
> wi->skbs_frags[i] = 0;
> }
>
> wi->consumed_strides = 0;
> - wi->dma_pre_sync = mlx5e_dma_pre_sync_fragmented_mpwqe;
> - wi->add_skb_frag = mlx5e_add_skb_frag_fragmented_mpwqe;
> - wi->copy_skb_header = mlx5e_copy_skb_header_fragmented_mpwqe;
> - wi->free_wqe = mlx5e_free_rx_fragmented_mpwqe;
> - wqe->data.lkey = rq->umr_mkey_be;
> wqe->data.addr = cpu_to_be64(dma_offset);
>
> return 0;
> @@ -466,41 +352,28 @@ err_unmap:
> while (--i >= 0) {
> dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
> PCI_DMA_FROMDEVICE);
> - page_ref_sub(wi->umr.dma_info[i].page,
> - mlx5e_mpwqe_strides_per_page(rq));
> + page_ref_sub(wi->umr.dma_info[i].page, pg_strides);
> put_page(wi->umr.dma_info[i].page);
> }
> - dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE);
> -
> -err_free_mtt:
> - kfree(wi->umr.mtt_no_align);
> -
> -err_free_umr:
> - kfree(wi->umr.dma_info);
>
> -err_out:
> - return -ENOMEM;
> + return err;
> }
>
> -void mlx5e_free_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5e_mpw_info *wi)
> +void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi)
> {
> - int mtt_sz = mlx5e_get_wqe_mtt_sz();
> + int pg_strides = mlx5e_mpwqe_strides_per_page(rq);
> int i;
>
> for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
> dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
> PCI_DMA_FROMDEVICE);
> page_ref_sub(wi->umr.dma_info[i].page,
> - mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i]);
> + pg_strides - wi->skbs_frags[i]);
> put_page(wi->umr.dma_info[i].page);
> }
> - dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE);
> - kfree(wi->umr.mtt_no_align);
> - kfree(wi->umr.dma_info);
> }
>
> -void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
> +void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
> {
> struct mlx5_wq_ll *wq = &rq->wq;
> struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(wq, wq->head);
> @@ -508,12 +381,11 @@ void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
> clear_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
>
> if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state))) {
> - mlx5e_free_rx_fragmented_mpwqe(rq, &rq->wqe_info[wq->head]);
> + mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
> return;
> }
>
> mlx5_wq_ll_push(wq, be16_to_cpu(wqe->next.next_wqe_index));
> - rq->stats.mpwqe_frag++;
>
> /* ensure wqes are visible to device before updating doorbell record */
> dma_wmb();
> @@ -521,84 +393,23 @@ void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
> mlx5_wq_ll_update_db_record(wq);
> }
>
> -static int mlx5e_alloc_rx_linear_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5e_rx_wqe *wqe,
> - u16 ix)
> -{
> - struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> - gfp_t gfp_mask;
> - int i;
> -
> - gfp_mask = GFP_ATOMIC | __GFP_COLD | __GFP_MEMALLOC;
> - wi->dma_info.page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
> - MLX5_MPWRQ_WQE_PAGE_ORDER);
> - if (unlikely(!wi->dma_info.page))
> - return -ENOMEM;
> -
> - wi->dma_info.addr = dma_map_page(rq->pdev, wi->dma_info.page, 0,
> - rq->wqe_sz, PCI_DMA_FROMDEVICE);
> - if (unlikely(dma_mapping_error(rq->pdev, wi->dma_info.addr))) {
> - put_page(wi->dma_info.page);
> - return -ENOMEM;
> - }
> -
> - /* We split the high-order page into order-0 ones and manage their
> - * reference counter to minimize the memory held by small skb fragments
> - */
> - split_page(wi->dma_info.page, MLX5_MPWRQ_WQE_PAGE_ORDER);
> - for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
> - page_ref_add(&wi->dma_info.page[i],
> - mlx5e_mpwqe_strides_per_page(rq));
> - wi->skbs_frags[i] = 0;
> - }
> -
> - wi->consumed_strides = 0;
> - wi->dma_pre_sync = mlx5e_dma_pre_sync_linear_mpwqe;
> - wi->add_skb_frag = mlx5e_add_skb_frag_linear_mpwqe;
> - wi->copy_skb_header = mlx5e_copy_skb_header_linear_mpwqe;
> - wi->free_wqe = mlx5e_free_rx_linear_mpwqe;
> - wqe->data.lkey = rq->mkey_be;
> - wqe->data.addr = cpu_to_be64(wi->dma_info.addr);
> -
> - return 0;
> -}
> -
> -void mlx5e_free_rx_linear_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5e_mpw_info *wi)
> -{
> - int i;
> -
> - dma_unmap_page(rq->pdev, wi->dma_info.addr, rq->wqe_sz,
> - PCI_DMA_FROMDEVICE);
> - for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
> - page_ref_sub(&wi->dma_info.page[i],
> - mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i]);
> - put_page(&wi->dma_info.page[i]);
> - }
> -}
> -
> -int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> {
> int err;
>
> - err = mlx5e_alloc_rx_linear_mpwqe(rq, wqe, ix);
> - if (unlikely(err)) {
> - err = mlx5e_alloc_rx_fragmented_mpwqe(rq, wqe, ix);
> - if (unlikely(err))
> - return err;
> - set_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
> - mlx5e_post_umr_wqe(rq, ix);
> - return -EBUSY;
> - }
> -
> - return 0;
> + err = mlx5e_alloc_rx_umr_mpwqe(rq, wqe, ix);
> + if (unlikely(err))
> + return err;
> + set_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
> + mlx5e_post_umr_wqe(rq, ix);
> + return -EBUSY;
> }
>
> void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
> {
> struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
>
> - wi->free_wqe(rq, wi);
> + mlx5e_free_rx_mpwqe(rq, wi);
> }
>
> #define RQ_CANNOT_POST(rq) \
> @@ -617,9 +428,10 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
> int err;
>
> err = rq->alloc_wqe(rq, wqe, wq->head);
> + if (err == -EBUSY)
> + return true;
> if (unlikely(err)) {
> - if (err != -EBUSY)
> - rq->stats.buff_alloc_err++;
> + rq->stats.buff_alloc_err++;
> break;
> }
>
> @@ -823,7 +635,6 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
> u32 cqe_bcnt,
> struct sk_buff *skb)
> {
> - u32 consumed_bytes = ALIGN(cqe_bcnt, rq->mpwqe_stride_sz);
> u16 stride_ix = mpwrq_get_cqe_stride_index(cqe);
> u32 wqe_offset = stride_ix * rq->mpwqe_stride_sz;
> u32 head_offset = wqe_offset & (PAGE_SIZE - 1);
> @@ -837,21 +648,20 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
> page_idx++;
> frag_offset -= PAGE_SIZE;
> }
> - wi->dma_pre_sync(rq->pdev, wi, wqe_offset, consumed_bytes);
>
> while (byte_cnt) {
> u32 pg_consumed_bytes =
> min_t(u32, PAGE_SIZE - frag_offset, byte_cnt);
>
> - wi->add_skb_frag(rq, skb, wi, page_idx, frag_offset,
> - pg_consumed_bytes);
> + mlx5e_add_skb_frag_mpwqe(rq, skb, wi, page_idx, frag_offset,
> + pg_consumed_bytes);
> byte_cnt -= pg_consumed_bytes;
> frag_offset = 0;
> page_idx++;
> }
> /* copy header */
> - wi->copy_skb_header(rq->pdev, skb, wi, head_page_idx, head_offset,
> - headlen);
> + mlx5e_copy_skb_header_mpwqe(rq->pdev, skb, wi, head_page_idx,
> + head_offset, headlen);
> /* skb linear part was allocated with headlen and aligned to long */
> skb->tail += headlen;
> skb->len += headlen;
> @@ -896,7 +706,7 @@ mpwrq_cqe_out:
> if (likely(wi->consumed_strides < rq->mpwqe_num_strides))
> return;
>
> - wi->free_wqe(rq, wi);
> + mlx5e_free_rx_mpwqe(rq, wi);
> mlx5_wq_ll_pop(&rq->wq, cqe->wqe_id, &wqe->next.next_wqe_index);
> }
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
> index 499487c..1f56543 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
> @@ -73,7 +73,6 @@ struct mlx5e_sw_stats {
> u64 tx_xmit_more;
> u64 rx_wqe_err;
> u64 rx_mpwqe_filler;
> - u64 rx_mpwqe_frag;
> u64 rx_buff_alloc_err;
> u64 rx_cqe_compress_blks;
> u64 rx_cqe_compress_pkts;
> @@ -105,7 +104,6 @@ static const struct counter_desc sw_stats_desc[] = {
> { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xmit_more) },
> { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_wqe_err) },
> { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_filler) },
> - { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_frag) },
> { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_buff_alloc_err) },
> { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_blks) },
> { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_pkts) },
> @@ -274,7 +272,6 @@ struct mlx5e_rq_stats {
> u64 lro_bytes;
> u64 wqe_err;
> u64 mpwqe_filler;
> - u64 mpwqe_frag;
> u64 buff_alloc_err;
> u64 cqe_compress_blks;
> u64 cqe_compress_pkts;
> @@ -290,7 +287,6 @@ static const struct counter_desc rq_stats_desc[] = {
> { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_bytes) },
> { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, wqe_err) },
> { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, mpwqe_filler) },
> - { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, mpwqe_frag) },
> { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, buff_alloc_err) },
> { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_blks) },
> { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_pkts) },
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> index 9bf33bb..08d8b0c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> @@ -87,7 +87,7 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
> case MLX5_OPCODE_NOP:
> break;
> case MLX5_OPCODE_UMR:
> - mlx5e_post_rx_fragmented_mpwqe(&sq->channel->rq);
> + mlx5e_post_rx_mpwqe(&sq->channel->rq);
> break;
> default:
> WARN_ONCE(true,
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 72+ messages in thread* Re: [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
@ 2016-09-07 19:18 ` Jesper Dangaard Brouer via iovisor-dev
0 siblings, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-07 19:18 UTC (permalink / raw)
To: Saeed Mahameed
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev, Jamal Hadi Salim,
linux-mm, Eric Dumazet, Tom Herbert
On Wed, 7 Sep 2016 15:42:22 +0300 Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> From: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>
> To improve the memory consumption scheme, we omit the flow that
> demands and splits high-order pages in Striding RQ, and stay
> with a single Striding RQ flow that uses order-0 pages.
Thanks you for doing this! MM-list people thanks you!
For others to understand what this means: This driver was doing
split_page() on high-order pages (for Striding RQ). This was really bad
because it will cause fragmenting the page-allocator, and depleting the
high-order pages available quickly.
(I've left rest of patch intact below, if some MM people should be
interested in looking at the changes).
There is even a funny comment in split_page() relevant to this:
/* [...]
* Note: this is probably too low level an operation for use in drivers.
* Please consult with lkml before using this in your driver.
*/
> Moving to fragmented memory allows the use of larger MPWQEs,
> which reduces the number of UMR posts and filler CQEs.
>
> Moving to a single flow allows several optimizations that improve
> performance, especially in production servers where we would
> anyway fallback to order-0 allocations:
> - inline functions that were called via function pointers.
> - improve the UMR post process.
>
> This patch alone is expected to give a slight performance reduction.
> However, the new memory scheme gives the possibility to use a page-cache
> of a fair size, that doesn't inflate the memory footprint, which will
> dramatically fix the reduction and even give a huge gain.
>
> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
>
> Single stride, 64 bytes:
> * 4,739,057 - baseline
> * 4,749,550 - this patch
> no reduction
>
> Larger packets, no page cross, 1024 bytes:
> * 3,982,361 - baseline
> * 3,845,682 - this patch
> 3.5% reduction
>
> Larger packets, every 3rd packet crosses a page, 1500 bytes:
> * 3,731,189 - baseline
> * 3,579,414 - this patch
> 4% reduction
>
Well, the reduction does not really matter than much, because your
baseline benchmarks are from a freshly booted system, where you have
not fragmented and depleted the high-order pages yet... ;-)
> Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
> Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
> Signed-off-by: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> Signed-off-by: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/en.h | 54 ++--
> drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 136 ++++++++--
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 292 ++++-----------------
> drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 4 -
> drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 2 +-
> 5 files changed, 184 insertions(+), 304 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index bf722aa..075cdfc 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -62,12 +62,12 @@
> #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE 0xd
>
> #define MLX5E_PARAMS_MINIMUM_LOG_RQ_SIZE_MPW 0x1
> -#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW 0x4
> +#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW 0x3
> #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW 0x6
>
> #define MLX5_MPWRQ_LOG_STRIDE_SIZE 6 /* >= 6, HW restriction */
> #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS 8 /* >= 6, HW restriction */
> -#define MLX5_MPWRQ_LOG_WQE_SZ 17
> +#define MLX5_MPWRQ_LOG_WQE_SZ 18
> #define MLX5_MPWRQ_WQE_PAGE_ORDER (MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT > 0 ? \
> MLX5_MPWRQ_LOG_WQE_SZ - PAGE_SHIFT : 0)
> #define MLX5_MPWRQ_PAGES_PER_WQE BIT(MLX5_MPWRQ_WQE_PAGE_ORDER)
> @@ -293,8 +293,8 @@ struct mlx5e_rq {
> u32 wqe_sz;
> struct sk_buff **skb;
> struct mlx5e_mpw_info *wqe_info;
> + void *mtt_no_align;
> __be32 mkey_be;
> - __be32 umr_mkey_be;
>
> struct device *pdev;
> struct net_device *netdev;
> @@ -323,32 +323,15 @@ struct mlx5e_rq {
>
> struct mlx5e_umr_dma_info {
> __be64 *mtt;
> - __be64 *mtt_no_align;
> dma_addr_t mtt_addr;
> - struct mlx5e_dma_info *dma_info;
> + struct mlx5e_dma_info dma_info[MLX5_MPWRQ_PAGES_PER_WQE];
> + struct mlx5e_umr_wqe wqe;
> };
>
> struct mlx5e_mpw_info {
> - union {
> - struct mlx5e_dma_info dma_info;
> - struct mlx5e_umr_dma_info umr;
> - };
> + struct mlx5e_umr_dma_info umr;
> u16 consumed_strides;
> u16 skbs_frags[MLX5_MPWRQ_PAGES_PER_WQE];
> -
> - void (*dma_pre_sync)(struct device *pdev,
> - struct mlx5e_mpw_info *wi,
> - u32 wqe_offset, u32 len);
> - void (*add_skb_frag)(struct mlx5e_rq *rq,
> - struct sk_buff *skb,
> - struct mlx5e_mpw_info *wi,
> - u32 page_idx, u32 frag_offset, u32 len);
> - void (*copy_skb_header)(struct device *pdev,
> - struct sk_buff *skb,
> - struct mlx5e_mpw_info *wi,
> - u32 page_idx, u32 offset,
> - u32 headlen);
> - void (*free_wqe)(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
> };
>
> struct mlx5e_tx_wqe_info {
> @@ -706,24 +689,11 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
> int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
> -int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
> void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix);
> void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix);
> -void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq);
> -void mlx5e_complete_rx_linear_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5_cqe64 *cqe,
> - u16 byte_cnt,
> - struct mlx5e_mpw_info *wi,
> - struct sk_buff *skb);
> -void mlx5e_complete_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5_cqe64 *cqe,
> - u16 byte_cnt,
> - struct mlx5e_mpw_info *wi,
> - struct sk_buff *skb);
> -void mlx5e_free_rx_linear_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5e_mpw_info *wi);
> -void mlx5e_free_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5e_mpw_info *wi);
> +void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq);
> +void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi);
> struct mlx5_cqe64 *mlx5e_get_cqe(struct mlx5e_cq *cq);
>
> void mlx5e_rx_am(struct mlx5e_rq *rq);
> @@ -810,6 +780,12 @@ static inline void mlx5e_cq_arm(struct mlx5e_cq *cq)
> mlx5_cq_arm(mcq, MLX5_CQ_DB_REQ_NOT, mcq->uar->map, NULL, cq->wq.cc);
> }
>
> +static inline u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
> +{
> + return rq->mpwqe_mtt_offset +
> + wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
> +}
> +
> static inline int mlx5e_get_max_num_channels(struct mlx5_core_dev *mdev)
> {
> return min_t(int, mdev->priv.eq_table.num_comp_vectors,
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index 2459c7f..0db4d3b 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -138,7 +138,6 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
> s->rx_csum_unnecessary_inner += rq_stats->csum_unnecessary_inner;
> s->rx_wqe_err += rq_stats->wqe_err;
> s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
> - s->rx_mpwqe_frag += rq_stats->mpwqe_frag;
> s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
> s->rx_cqe_compress_blks += rq_stats->cqe_compress_blks;
> s->rx_cqe_compress_pkts += rq_stats->cqe_compress_pkts;
> @@ -298,6 +297,107 @@ static void mlx5e_disable_async_events(struct mlx5e_priv *priv)
> #define MLX5E_HW2SW_MTU(hwmtu) (hwmtu - (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN))
> #define MLX5E_SW2HW_MTU(swmtu) (swmtu + (ETH_HLEN + VLAN_HLEN + ETH_FCS_LEN))
>
> +static inline int mlx5e_get_wqe_mtt_sz(void)
> +{
> + /* UMR copies MTTs in units of MLX5_UMR_MTT_ALIGNMENT bytes.
> + * To avoid copying garbage after the mtt array, we allocate
> + * a little more.
> + */
> + return ALIGN(MLX5_MPWRQ_PAGES_PER_WQE * sizeof(__be64),
> + MLX5_UMR_MTT_ALIGNMENT);
> +}
> +
> +static inline void mlx5e_build_umr_wqe(struct mlx5e_rq *rq, struct mlx5e_sq *sq,
> + struct mlx5e_umr_wqe *wqe, u16 ix)
> +{
> + struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
> + struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
> + struct mlx5_wqe_data_seg *dseg = &wqe->data;
> + struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> + u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
> + u32 umr_wqe_mtt_offset = mlx5e_get_wqe_mtt_offset(rq, ix);
> +
> + cseg->qpn_ds = cpu_to_be32((sq->sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
> + ds_cnt);
> + cseg->fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
> + cseg->imm = rq->mkey_be;
> +
> + ucseg->flags = MLX5_UMR_TRANSLATION_OFFSET_EN;
> + ucseg->klm_octowords =
> + cpu_to_be16(MLX5_MTT_OCTW(MLX5_MPWRQ_PAGES_PER_WQE));
> + ucseg->bsf_octowords =
> + cpu_to_be16(MLX5_MTT_OCTW(umr_wqe_mtt_offset));
> + ucseg->mkey_mask = cpu_to_be64(MLX5_MKEY_MASK_FREE);
> +
> + dseg->lkey = sq->mkey_be;
> + dseg->addr = cpu_to_be64(wi->umr.mtt_addr);
> +}
> +
> +static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq,
> + struct mlx5e_channel *c)
> +{
> + int wq_sz = mlx5_wq_ll_get_size(&rq->wq);
> + int mtt_sz = mlx5e_get_wqe_mtt_sz();
> + int mtt_alloc = mtt_sz + MLX5_UMR_ALIGN - 1;
> + int i;
> +
> + rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
> + GFP_KERNEL, cpu_to_node(c->cpu));
> + if (!rq->wqe_info)
> + goto err_out;
> +
> + /* We allocate more than mtt_sz as we will align the pointer */
> + rq->mtt_no_align = kzalloc_node(mtt_alloc * wq_sz, GFP_KERNEL,
> + cpu_to_node(c->cpu));
> + if (unlikely(!rq->mtt_no_align))
> + goto err_free_wqe_info;
> +
> + for (i = 0; i < wq_sz; i++) {
> + struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
> +
> + wi->umr.mtt = PTR_ALIGN(rq->mtt_no_align + i * mtt_alloc,
> + MLX5_UMR_ALIGN);
> + wi->umr.mtt_addr = dma_map_single(c->pdev, wi->umr.mtt, mtt_sz,
> + PCI_DMA_TODEVICE);
> + if (unlikely(dma_mapping_error(c->pdev, wi->umr.mtt_addr)))
> + goto err_unmap_mtts;
> +
> + mlx5e_build_umr_wqe(rq, &c->icosq, &wi->umr.wqe, i);
> + }
> +
> + return 0;
> +
> +err_unmap_mtts:
> + while (--i >= 0) {
> + struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
> +
> + dma_unmap_single(c->pdev, wi->umr.mtt_addr, mtt_sz,
> + PCI_DMA_TODEVICE);
> + }
> + kfree(rq->mtt_no_align);
> +err_free_wqe_info:
> + kfree(rq->wqe_info);
> +
> +err_out:
> + return -ENOMEM;
> +}
> +
> +static void mlx5e_rq_free_mpwqe_info(struct mlx5e_rq *rq)
> +{
> + int wq_sz = mlx5_wq_ll_get_size(&rq->wq);
> + int mtt_sz = mlx5e_get_wqe_mtt_sz();
> + int i;
> +
> + for (i = 0; i < wq_sz; i++) {
> + struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
> +
> + dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz,
> + PCI_DMA_TODEVICE);
> + }
> + kfree(rq->mtt_no_align);
> + kfree(rq->wqe_info);
> +}
> +
> static int mlx5e_create_rq(struct mlx5e_channel *c,
> struct mlx5e_rq_param *param,
> struct mlx5e_rq *rq)
> @@ -322,14 +422,16 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>
> wq_sz = mlx5_wq_ll_get_size(&rq->wq);
>
> + rq->wq_type = priv->params.rq_wq_type;
> + rq->pdev = c->pdev;
> + rq->netdev = c->netdev;
> + rq->tstamp = &priv->tstamp;
> + rq->channel = c;
> + rq->ix = c->ix;
> + rq->priv = c->priv;
> +
> switch (priv->params.rq_wq_type) {
> case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
> - rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
> - GFP_KERNEL, cpu_to_node(c->cpu));
> - if (!rq->wqe_info) {
> - err = -ENOMEM;
> - goto err_rq_wq_destroy;
> - }
> rq->handle_rx_cqe = mlx5e_handle_rx_cqe_mpwrq;
> rq->alloc_wqe = mlx5e_alloc_rx_mpwqe;
> rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
> @@ -341,6 +443,10 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
> rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
> rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
> byte_count = rq->wqe_sz;
> + rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
> + err = mlx5e_rq_alloc_mpwqe_info(rq, c);
> + if (err)
> + goto err_rq_wq_destroy;
> break;
> default: /* MLX5_WQ_TYPE_LINKED_LIST */
> rq->skb = kzalloc_node(wq_sz * sizeof(*rq->skb), GFP_KERNEL,
> @@ -359,27 +465,19 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
> rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz);
> byte_count = rq->wqe_sz;
> byte_count |= MLX5_HW_START_PADDING;
> + rq->mkey_be = c->mkey_be;
> }
>
> for (i = 0; i < wq_sz; i++) {
> struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, i);
>
> wqe->data.byte_count = cpu_to_be32(byte_count);
> + wqe->data.lkey = rq->mkey_be;
> }
>
> INIT_WORK(&rq->am.work, mlx5e_rx_am_work);
> rq->am.mode = priv->params.rx_cq_period_mode;
>
> - rq->wq_type = priv->params.rq_wq_type;
> - rq->pdev = c->pdev;
> - rq->netdev = c->netdev;
> - rq->tstamp = &priv->tstamp;
> - rq->channel = c;
> - rq->ix = c->ix;
> - rq->priv = c->priv;
> - rq->mkey_be = c->mkey_be;
> - rq->umr_mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
> -
> return 0;
>
> err_rq_wq_destroy:
> @@ -392,7 +490,7 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
> {
> switch (rq->wq_type) {
> case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
> - kfree(rq->wqe_info);
> + mlx5e_rq_free_mpwqe_info(rq);
> break;
> default: /* MLX5_WQ_TYPE_LINKED_LIST */
> kfree(rq->skb);
> @@ -530,7 +628,7 @@ static void mlx5e_free_rx_descs(struct mlx5e_rq *rq)
>
> /* UMR WQE (if in progress) is always at wq->head */
> if (test_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state))
> - mlx5e_free_rx_fragmented_mpwqe(rq, &rq->wqe_info[wq->head]);
> + mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
>
> while (!mlx5_wq_ll_is_empty(wq)) {
> wqe_ix_be = *wq->tail_next;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index b6f8ebb..8ad4d32 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -200,7 +200,6 @@ int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
>
> *((dma_addr_t *)skb->cb) = dma_addr;
> wqe->data.addr = cpu_to_be64(dma_addr);
> - wqe->data.lkey = rq->mkey_be;
>
> rq->skb[ix] = skb;
>
> @@ -231,44 +230,11 @@ static inline int mlx5e_mpwqe_strides_per_page(struct mlx5e_rq *rq)
> return rq->mpwqe_num_strides >> MLX5_MPWRQ_WQE_PAGE_ORDER;
> }
>
> -static inline void
> -mlx5e_dma_pre_sync_linear_mpwqe(struct device *pdev,
> - struct mlx5e_mpw_info *wi,
> - u32 wqe_offset, u32 len)
> -{
> - dma_sync_single_for_cpu(pdev, wi->dma_info.addr + wqe_offset,
> - len, DMA_FROM_DEVICE);
> -}
> -
> -static inline void
> -mlx5e_dma_pre_sync_fragmented_mpwqe(struct device *pdev,
> - struct mlx5e_mpw_info *wi,
> - u32 wqe_offset, u32 len)
> -{
> - /* No dma pre sync for fragmented MPWQE */
> -}
> -
> -static inline void
> -mlx5e_add_skb_frag_linear_mpwqe(struct mlx5e_rq *rq,
> - struct sk_buff *skb,
> - struct mlx5e_mpw_info *wi,
> - u32 page_idx, u32 frag_offset,
> - u32 len)
> -{
> - unsigned int truesize = ALIGN(len, rq->mpwqe_stride_sz);
> -
> - wi->skbs_frags[page_idx]++;
> - skb_add_rx_frag(skb, skb_shinfo(skb)->nr_frags,
> - &wi->dma_info.page[page_idx], frag_offset,
> - len, truesize);
> -}
> -
> -static inline void
> -mlx5e_add_skb_frag_fragmented_mpwqe(struct mlx5e_rq *rq,
> - struct sk_buff *skb,
> - struct mlx5e_mpw_info *wi,
> - u32 page_idx, u32 frag_offset,
> - u32 len)
> +static inline void mlx5e_add_skb_frag_mpwqe(struct mlx5e_rq *rq,
> + struct sk_buff *skb,
> + struct mlx5e_mpw_info *wi,
> + u32 page_idx, u32 frag_offset,
> + u32 len)
> {
> unsigned int truesize = ALIGN(len, rq->mpwqe_stride_sz);
>
> @@ -282,24 +248,11 @@ mlx5e_add_skb_frag_fragmented_mpwqe(struct mlx5e_rq *rq,
> }
>
> static inline void
> -mlx5e_copy_skb_header_linear_mpwqe(struct device *pdev,
> - struct sk_buff *skb,
> - struct mlx5e_mpw_info *wi,
> - u32 page_idx, u32 offset,
> - u32 headlen)
> -{
> - struct page *page = &wi->dma_info.page[page_idx];
> -
> - skb_copy_to_linear_data(skb, page_address(page) + offset,
> - ALIGN(headlen, sizeof(long)));
> -}
> -
> -static inline void
> -mlx5e_copy_skb_header_fragmented_mpwqe(struct device *pdev,
> - struct sk_buff *skb,
> - struct mlx5e_mpw_info *wi,
> - u32 page_idx, u32 offset,
> - u32 headlen)
> +mlx5e_copy_skb_header_mpwqe(struct device *pdev,
> + struct sk_buff *skb,
> + struct mlx5e_mpw_info *wi,
> + u32 page_idx, u32 offset,
> + u32 headlen)
> {
> u16 headlen_pg = min_t(u32, headlen, PAGE_SIZE - offset);
> struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[page_idx];
> @@ -324,46 +277,9 @@ mlx5e_copy_skb_header_fragmented_mpwqe(struct device *pdev,
> }
> }
>
> -static u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
> -{
> - return rq->mpwqe_mtt_offset +
> - wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
> -}
> -
> -static void mlx5e_build_umr_wqe(struct mlx5e_rq *rq,
> - struct mlx5e_sq *sq,
> - struct mlx5e_umr_wqe *wqe,
> - u16 ix)
> +static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
> {
> - struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
> - struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
> - struct mlx5_wqe_data_seg *dseg = &wqe->data;
> struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> - u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
> - u32 umr_wqe_mtt_offset = mlx5e_get_wqe_mtt_offset(rq, ix);
> -
> - memset(wqe, 0, sizeof(*wqe));
> - cseg->opmod_idx_opcode =
> - cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
> - MLX5_OPCODE_UMR);
> - cseg->qpn_ds = cpu_to_be32((sq->sqn << MLX5_WQE_CTRL_QPN_SHIFT) |
> - ds_cnt);
> - cseg->fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
> - cseg->imm = rq->umr_mkey_be;
> -
> - ucseg->flags = MLX5_UMR_TRANSLATION_OFFSET_EN;
> - ucseg->klm_octowords =
> - cpu_to_be16(MLX5_MTT_OCTW(MLX5_MPWRQ_PAGES_PER_WQE));
> - ucseg->bsf_octowords =
> - cpu_to_be16(MLX5_MTT_OCTW(umr_wqe_mtt_offset));
> - ucseg->mkey_mask = cpu_to_be64(MLX5_MKEY_MASK_FREE);
> -
> - dseg->lkey = sq->mkey_be;
> - dseg->addr = cpu_to_be64(wi->umr.mtt_addr);
> -}
> -
> -static void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
> -{
> struct mlx5e_sq *sq = &rq->channel->icosq;
> struct mlx5_wq_cyc *wq = &sq->wq;
> struct mlx5e_umr_wqe *wqe;
> @@ -378,30 +294,22 @@ static void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
> }
>
> wqe = mlx5_wq_cyc_get_wqe(wq, pi);
> - mlx5e_build_umr_wqe(rq, sq, wqe, ix);
> + memcpy(wqe, &wi->umr.wqe, sizeof(*wqe));
> + wqe->ctrl.opmod_idx_opcode =
> + cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
> + MLX5_OPCODE_UMR);
> +
> sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_UMR;
> sq->ico_wqe_info[pi].num_wqebbs = num_wqebbs;
> sq->pc += num_wqebbs;
> mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
> }
>
> -static inline int mlx5e_get_wqe_mtt_sz(void)
> -{
> - /* UMR copies MTTs in units of MLX5_UMR_MTT_ALIGNMENT bytes.
> - * To avoid copying garbage after the mtt array, we allocate
> - * a little more.
> - */
> - return ALIGN(MLX5_MPWRQ_PAGES_PER_WQE * sizeof(__be64),
> - MLX5_UMR_MTT_ALIGNMENT);
> -}
> -
> -static int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
> - struct mlx5e_mpw_info *wi,
> - int i)
> +static inline int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
> + struct mlx5e_mpw_info *wi,
> + int i)
> {
> - struct page *page;
> -
> - page = dev_alloc_page();
> + struct page *page = dev_alloc_page();
> if (unlikely(!page))
> return -ENOMEM;
>
> @@ -417,47 +325,25 @@ static int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
> return 0;
> }
>
> -static int mlx5e_alloc_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5e_rx_wqe *wqe,
> - u16 ix)
> +static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
> + struct mlx5e_rx_wqe *wqe,
> + u16 ix)
> {
> struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> - int mtt_sz = mlx5e_get_wqe_mtt_sz();
> u64 dma_offset = (u64)mlx5e_get_wqe_mtt_offset(rq, ix) << PAGE_SHIFT;
> + int pg_strides = mlx5e_mpwqe_strides_per_page(rq);
> + int err;
> int i;
>
> - wi->umr.dma_info = kmalloc(sizeof(*wi->umr.dma_info) *
> - MLX5_MPWRQ_PAGES_PER_WQE,
> - GFP_ATOMIC);
> - if (unlikely(!wi->umr.dma_info))
> - goto err_out;
> -
> - /* We allocate more than mtt_sz as we will align the pointer */
> - wi->umr.mtt_no_align = kzalloc(mtt_sz + MLX5_UMR_ALIGN - 1,
> - GFP_ATOMIC);
> - if (unlikely(!wi->umr.mtt_no_align))
> - goto err_free_umr;
> -
> - wi->umr.mtt = PTR_ALIGN(wi->umr.mtt_no_align, MLX5_UMR_ALIGN);
> - wi->umr.mtt_addr = dma_map_single(rq->pdev, wi->umr.mtt, mtt_sz,
> - PCI_DMA_TODEVICE);
> - if (unlikely(dma_mapping_error(rq->pdev, wi->umr.mtt_addr)))
> - goto err_free_mtt;
> -
> for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
> - if (unlikely(mlx5e_alloc_and_map_page(rq, wi, i)))
> + err = mlx5e_alloc_and_map_page(rq, wi, i);
> + if (unlikely(err))
> goto err_unmap;
> - page_ref_add(wi->umr.dma_info[i].page,
> - mlx5e_mpwqe_strides_per_page(rq));
> + page_ref_add(wi->umr.dma_info[i].page, pg_strides);
> wi->skbs_frags[i] = 0;
> }
>
> wi->consumed_strides = 0;
> - wi->dma_pre_sync = mlx5e_dma_pre_sync_fragmented_mpwqe;
> - wi->add_skb_frag = mlx5e_add_skb_frag_fragmented_mpwqe;
> - wi->copy_skb_header = mlx5e_copy_skb_header_fragmented_mpwqe;
> - wi->free_wqe = mlx5e_free_rx_fragmented_mpwqe;
> - wqe->data.lkey = rq->umr_mkey_be;
> wqe->data.addr = cpu_to_be64(dma_offset);
>
> return 0;
> @@ -466,41 +352,28 @@ err_unmap:
> while (--i >= 0) {
> dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
> PCI_DMA_FROMDEVICE);
> - page_ref_sub(wi->umr.dma_info[i].page,
> - mlx5e_mpwqe_strides_per_page(rq));
> + page_ref_sub(wi->umr.dma_info[i].page, pg_strides);
> put_page(wi->umr.dma_info[i].page);
> }
> - dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE);
> -
> -err_free_mtt:
> - kfree(wi->umr.mtt_no_align);
> -
> -err_free_umr:
> - kfree(wi->umr.dma_info);
>
> -err_out:
> - return -ENOMEM;
> + return err;
> }
>
> -void mlx5e_free_rx_fragmented_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5e_mpw_info *wi)
> +void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi)
> {
> - int mtt_sz = mlx5e_get_wqe_mtt_sz();
> + int pg_strides = mlx5e_mpwqe_strides_per_page(rq);
> int i;
>
> for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
> dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
> PCI_DMA_FROMDEVICE);
> page_ref_sub(wi->umr.dma_info[i].page,
> - mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i]);
> + pg_strides - wi->skbs_frags[i]);
> put_page(wi->umr.dma_info[i].page);
> }
> - dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz, PCI_DMA_TODEVICE);
> - kfree(wi->umr.mtt_no_align);
> - kfree(wi->umr.dma_info);
> }
>
> -void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
> +void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
> {
> struct mlx5_wq_ll *wq = &rq->wq;
> struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(wq, wq->head);
> @@ -508,12 +381,11 @@ void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
> clear_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
>
> if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state))) {
> - mlx5e_free_rx_fragmented_mpwqe(rq, &rq->wqe_info[wq->head]);
> + mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
> return;
> }
>
> mlx5_wq_ll_push(wq, be16_to_cpu(wqe->next.next_wqe_index));
> - rq->stats.mpwqe_frag++;
>
> /* ensure wqes are visible to device before updating doorbell record */
> dma_wmb();
> @@ -521,84 +393,23 @@ void mlx5e_post_rx_fragmented_mpwqe(struct mlx5e_rq *rq)
> mlx5_wq_ll_update_db_record(wq);
> }
>
> -static int mlx5e_alloc_rx_linear_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5e_rx_wqe *wqe,
> - u16 ix)
> -{
> - struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
> - gfp_t gfp_mask;
> - int i;
> -
> - gfp_mask = GFP_ATOMIC | __GFP_COLD | __GFP_MEMALLOC;
> - wi->dma_info.page = alloc_pages_node(NUMA_NO_NODE, gfp_mask,
> - MLX5_MPWRQ_WQE_PAGE_ORDER);
> - if (unlikely(!wi->dma_info.page))
> - return -ENOMEM;
> -
> - wi->dma_info.addr = dma_map_page(rq->pdev, wi->dma_info.page, 0,
> - rq->wqe_sz, PCI_DMA_FROMDEVICE);
> - if (unlikely(dma_mapping_error(rq->pdev, wi->dma_info.addr))) {
> - put_page(wi->dma_info.page);
> - return -ENOMEM;
> - }
> -
> - /* We split the high-order page into order-0 ones and manage their
> - * reference counter to minimize the memory held by small skb fragments
> - */
> - split_page(wi->dma_info.page, MLX5_MPWRQ_WQE_PAGE_ORDER);
> - for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
> - page_ref_add(&wi->dma_info.page[i],
> - mlx5e_mpwqe_strides_per_page(rq));
> - wi->skbs_frags[i] = 0;
> - }
> -
> - wi->consumed_strides = 0;
> - wi->dma_pre_sync = mlx5e_dma_pre_sync_linear_mpwqe;
> - wi->add_skb_frag = mlx5e_add_skb_frag_linear_mpwqe;
> - wi->copy_skb_header = mlx5e_copy_skb_header_linear_mpwqe;
> - wi->free_wqe = mlx5e_free_rx_linear_mpwqe;
> - wqe->data.lkey = rq->mkey_be;
> - wqe->data.addr = cpu_to_be64(wi->dma_info.addr);
> -
> - return 0;
> -}
> -
> -void mlx5e_free_rx_linear_mpwqe(struct mlx5e_rq *rq,
> - struct mlx5e_mpw_info *wi)
> -{
> - int i;
> -
> - dma_unmap_page(rq->pdev, wi->dma_info.addr, rq->wqe_sz,
> - PCI_DMA_FROMDEVICE);
> - for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
> - page_ref_sub(&wi->dma_info.page[i],
> - mlx5e_mpwqe_strides_per_page(rq) - wi->skbs_frags[i]);
> - put_page(&wi->dma_info.page[i]);
> - }
> -}
> -
> -int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> {
> int err;
>
> - err = mlx5e_alloc_rx_linear_mpwqe(rq, wqe, ix);
> - if (unlikely(err)) {
> - err = mlx5e_alloc_rx_fragmented_mpwqe(rq, wqe, ix);
> - if (unlikely(err))
> - return err;
> - set_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
> - mlx5e_post_umr_wqe(rq, ix);
> - return -EBUSY;
> - }
> -
> - return 0;
> + err = mlx5e_alloc_rx_umr_mpwqe(rq, wqe, ix);
> + if (unlikely(err))
> + return err;
> + set_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
> + mlx5e_post_umr_wqe(rq, ix);
> + return -EBUSY;
> }
>
> void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
> {
> struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
>
> - wi->free_wqe(rq, wi);
> + mlx5e_free_rx_mpwqe(rq, wi);
> }
>
> #define RQ_CANNOT_POST(rq) \
> @@ -617,9 +428,10 @@ bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq)
> int err;
>
> err = rq->alloc_wqe(rq, wqe, wq->head);
> + if (err == -EBUSY)
> + return true;
> if (unlikely(err)) {
> - if (err != -EBUSY)
> - rq->stats.buff_alloc_err++;
> + rq->stats.buff_alloc_err++;
> break;
> }
>
> @@ -823,7 +635,6 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
> u32 cqe_bcnt,
> struct sk_buff *skb)
> {
> - u32 consumed_bytes = ALIGN(cqe_bcnt, rq->mpwqe_stride_sz);
> u16 stride_ix = mpwrq_get_cqe_stride_index(cqe);
> u32 wqe_offset = stride_ix * rq->mpwqe_stride_sz;
> u32 head_offset = wqe_offset & (PAGE_SIZE - 1);
> @@ -837,21 +648,20 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
> page_idx++;
> frag_offset -= PAGE_SIZE;
> }
> - wi->dma_pre_sync(rq->pdev, wi, wqe_offset, consumed_bytes);
>
> while (byte_cnt) {
> u32 pg_consumed_bytes =
> min_t(u32, PAGE_SIZE - frag_offset, byte_cnt);
>
> - wi->add_skb_frag(rq, skb, wi, page_idx, frag_offset,
> - pg_consumed_bytes);
> + mlx5e_add_skb_frag_mpwqe(rq, skb, wi, page_idx, frag_offset,
> + pg_consumed_bytes);
> byte_cnt -= pg_consumed_bytes;
> frag_offset = 0;
> page_idx++;
> }
> /* copy header */
> - wi->copy_skb_header(rq->pdev, skb, wi, head_page_idx, head_offset,
> - headlen);
> + mlx5e_copy_skb_header_mpwqe(rq->pdev, skb, wi, head_page_idx,
> + head_offset, headlen);
> /* skb linear part was allocated with headlen and aligned to long */
> skb->tail += headlen;
> skb->len += headlen;
> @@ -896,7 +706,7 @@ mpwrq_cqe_out:
> if (likely(wi->consumed_strides < rq->mpwqe_num_strides))
> return;
>
> - wi->free_wqe(rq, wi);
> + mlx5e_free_rx_mpwqe(rq, wi);
> mlx5_wq_ll_pop(&rq->wq, cqe->wqe_id, &wqe->next.next_wqe_index);
> }
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
> index 499487c..1f56543 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
> @@ -73,7 +73,6 @@ struct mlx5e_sw_stats {
> u64 tx_xmit_more;
> u64 rx_wqe_err;
> u64 rx_mpwqe_filler;
> - u64 rx_mpwqe_frag;
> u64 rx_buff_alloc_err;
> u64 rx_cqe_compress_blks;
> u64 rx_cqe_compress_pkts;
> @@ -105,7 +104,6 @@ static const struct counter_desc sw_stats_desc[] = {
> { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_xmit_more) },
> { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_wqe_err) },
> { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_filler) },
> - { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_mpwqe_frag) },
> { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_buff_alloc_err) },
> { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_blks) },
> { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_pkts) },
> @@ -274,7 +272,6 @@ struct mlx5e_rq_stats {
> u64 lro_bytes;
> u64 wqe_err;
> u64 mpwqe_filler;
> - u64 mpwqe_frag;
> u64 buff_alloc_err;
> u64 cqe_compress_blks;
> u64 cqe_compress_pkts;
> @@ -290,7 +287,6 @@ static const struct counter_desc rq_stats_desc[] = {
> { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_bytes) },
> { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, wqe_err) },
> { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, mpwqe_filler) },
> - { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, mpwqe_frag) },
> { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, buff_alloc_err) },
> { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_blks) },
> { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_pkts) },
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> index 9bf33bb..08d8b0c 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
> @@ -87,7 +87,7 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
> case MLX5_OPCODE_NOP:
> break;
> case MLX5_OPCODE_UMR:
> - mlx5e_post_rx_fragmented_mpwqe(&sq->channel->rq);
> + mlx5e_post_rx_mpwqe(&sq->channel->rq);
> break;
> default:
> WARN_ONCE(true,
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 72+ messages in thread* Re: [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
@ 2016-09-15 14:28 ` Tariq Toukan via iovisor-dev
0 siblings, 0 replies; 72+ messages in thread
From: Tariq Toukan @ 2016-09-15 14:28 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Saeed Mahameed
Cc: iovisor-dev, netdev, Brenden Blanco, Alexei Starovoitov,
Tom Herbert, Martin KaFai Lau, Daniel Borkmann, Eric Dumazet,
Jamal Hadi Salim, linux-mm
Hi Jesper,
On 07/09/2016 10:18 PM, Jesper Dangaard Brouer wrote:
> On Wed, 7 Sep 2016 15:42:22 +0300 Saeed Mahameed <saeedm@mellanox.com> wrote:
>
>> From: Tariq Toukan <tariqt@mellanox.com>
>>
>> To improve the memory consumption scheme, we omit the flow that
>> demands and splits high-order pages in Striding RQ, and stay
>> with a single Striding RQ flow that uses order-0 pages.
> Thanks you for doing this! MM-list people thanks you!
Thanks. I've just submitted it to net-next.
> For others to understand what this means: This driver was doing
> split_page() on high-order pages (for Striding RQ). This was really bad
> because it will cause fragmenting the page-allocator, and depleting the
> high-order pages available quickly.
>
> (I've left rest of patch intact below, if some MM people should be
> interested in looking at the changes).
>
> There is even a funny comment in split_page() relevant to this:
>
> /* [...]
> * Note: this is probably too low level an operation for use in drivers.
> * Please consult with lkml before using this in your driver.
> */
>
>
>> Moving to fragmented memory allows the use of larger MPWQEs,
>> which reduces the number of UMR posts and filler CQEs.
>>
>> Moving to a single flow allows several optimizations that improve
>> performance, especially in production servers where we would
>> anyway fallback to order-0 allocations:
>> - inline functions that were called via function pointers.
>> - improve the UMR post process.
>>
>> This patch alone is expected to give a slight performance reduction.
>> However, the new memory scheme gives the possibility to use a page-cache
>> of a fair size, that doesn't inflate the memory footprint, which will
>> dramatically fix the reduction and even give a huge gain.
>>
>> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
>>
>> Single stride, 64 bytes:
>> * 4,739,057 - baseline
>> * 4,749,550 - this patch
>> no reduction
>>
>> Larger packets, no page cross, 1024 bytes:
>> * 3,982,361 - baseline
>> * 3,845,682 - this patch
>> 3.5% reduction
>>
>> Larger packets, every 3rd packet crosses a page, 1500 bytes:
>> * 3,731,189 - baseline
>> * 3,579,414 - this patch
>> 4% reduction
>>
> Well, the reduction does not really matter than much, because your
> baseline benchmarks are from a freshly booted system, where you have
> not fragmented and depleted the high-order pages yet... ;-)
Indeed. On fragmented systems we'll get a gain, even w/o the page-cache
mechanism, as no time is wasted looking for high-order-pages.
>
>
>> Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
>> Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
>> Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
>> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
>> ---
>> drivers/net/ethernet/mellanox/mlx5/core/en.h | 54 ++--
>> drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 136 ++++++++--
>> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 292 ++++-----------------
>> drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 4 -
>> drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 2 +-
>> 5 files changed, 184 insertions(+), 304 deletions(-)
>>
Regards,
Tariq
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 72+ messages in thread* Re: [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ
@ 2016-09-15 14:28 ` Tariq Toukan via iovisor-dev
0 siblings, 0 replies; 72+ messages in thread
From: Tariq Toukan via iovisor-dev @ 2016-09-15 14:28 UTC (permalink / raw)
To: Jesper Dangaard Brouer, Saeed Mahameed
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev, Jamal Hadi Salim,
linux-mm, Eric Dumazet, Tom Herbert
Hi Jesper,
On 07/09/2016 10:18 PM, Jesper Dangaard Brouer wrote:
> On Wed, 7 Sep 2016 15:42:22 +0300 Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
>
>> From: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>>
>> To improve the memory consumption scheme, we omit the flow that
>> demands and splits high-order pages in Striding RQ, and stay
>> with a single Striding RQ flow that uses order-0 pages.
> Thanks you for doing this! MM-list people thanks you!
Thanks. I've just submitted it to net-next.
> For others to understand what this means: This driver was doing
> split_page() on high-order pages (for Striding RQ). This was really bad
> because it will cause fragmenting the page-allocator, and depleting the
> high-order pages available quickly.
>
> (I've left rest of patch intact below, if some MM people should be
> interested in looking at the changes).
>
> There is even a funny comment in split_page() relevant to this:
>
> /* [...]
> * Note: this is probably too low level an operation for use in drivers.
> * Please consult with lkml before using this in your driver.
> */
>
>
>> Moving to fragmented memory allows the use of larger MPWQEs,
>> which reduces the number of UMR posts and filler CQEs.
>>
>> Moving to a single flow allows several optimizations that improve
>> performance, especially in production servers where we would
>> anyway fallback to order-0 allocations:
>> - inline functions that were called via function pointers.
>> - improve the UMR post process.
>>
>> This patch alone is expected to give a slight performance reduction.
>> However, the new memory scheme gives the possibility to use a page-cache
>> of a fair size, that doesn't inflate the memory footprint, which will
>> dramatically fix the reduction and even give a huge gain.
>>
>> We ran pktgen single-stream benchmarks, with iptables-raw-drop:
>>
>> Single stride, 64 bytes:
>> * 4,739,057 - baseline
>> * 4,749,550 - this patch
>> no reduction
>>
>> Larger packets, no page cross, 1024 bytes:
>> * 3,982,361 - baseline
>> * 3,845,682 - this patch
>> 3.5% reduction
>>
>> Larger packets, every 3rd packet crosses a page, 1500 bytes:
>> * 3,731,189 - baseline
>> * 3,579,414 - this patch
>> 4% reduction
>>
> Well, the reduction does not really matter than much, because your
> baseline benchmarks are from a freshly booted system, where you have
> not fragmented and depleted the high-order pages yet... ;-)
Indeed. On fragmented systems we'll get a gain, even w/o the page-cache
mechanism, as no time is wasted looking for high-order-pages.
>
>
>> Fixes: 461017cb006a ("net/mlx5e: Support RX multi-packet WQE (Striding RQ)")
>> Fixes: bc77b240b3c5 ("net/mlx5e: Add fragmented memory support for RX multi packet WQE")
>> Signed-off-by: Tariq Toukan <tariqt-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> Signed-off-by: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>> ---
>> drivers/net/ethernet/mellanox/mlx5/core/en.h | 54 ++--
>> drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 136 ++++++++--
>> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 292 ++++-----------------
>> drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 4 -
>> drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 2 +-
>> 5 files changed, 184 insertions(+), 304 deletions(-)
>>
Regards,
Tariq
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH RFC 02/11] net/mlx5e: Introduce API for RX mapped pages
2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle Saeed Mahameed
` (9 subsequent siblings)
11 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
To: iovisor-dev
Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed
From: Tariq Toukan <tariqt@mellanox.com>
Manage the allocation and deallocation of mapped RX pages only
through dedicated API functions.
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 46 +++++++++++++++----------
1 file changed, 27 insertions(+), 19 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 8ad4d32..c1cb510 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -305,26 +305,32 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
}
-static inline int mlx5e_alloc_and_map_page(struct mlx5e_rq *rq,
- struct mlx5e_mpw_info *wi,
- int i)
+static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
+ struct mlx5e_dma_info *dma_info)
{
struct page *page = dev_alloc_page();
+
if (unlikely(!page))
return -ENOMEM;
- wi->umr.dma_info[i].page = page;
- wi->umr.dma_info[i].addr = dma_map_page(rq->pdev, page, 0, PAGE_SIZE,
- PCI_DMA_FROMDEVICE);
- if (unlikely(dma_mapping_error(rq->pdev, wi->umr.dma_info[i].addr))) {
+ dma_info->page = page;
+ dma_info->addr = dma_map_page(rq->pdev, page, 0, PAGE_SIZE,
+ DMA_FROM_DEVICE);
+ if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
put_page(page);
return -ENOMEM;
}
- wi->umr.mtt[i] = cpu_to_be64(wi->umr.dma_info[i].addr | MLX5_EN_WR);
return 0;
}
+static inline void mlx5e_page_release(struct mlx5e_rq *rq,
+ struct mlx5e_dma_info *dma_info)
+{
+ dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, DMA_FROM_DEVICE);
+ put_page(dma_info->page);
+}
+
static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
struct mlx5e_rx_wqe *wqe,
u16 ix)
@@ -336,10 +342,13 @@ static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
int i;
for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
- err = mlx5e_alloc_and_map_page(rq, wi, i);
+ struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[i];
+
+ err = mlx5e_page_alloc_mapped(rq, dma_info);
if (unlikely(err))
goto err_unmap;
- page_ref_add(wi->umr.dma_info[i].page, pg_strides);
+ wi->umr.mtt[i] = cpu_to_be64(dma_info->addr | MLX5_EN_WR);
+ page_ref_add(dma_info->page, pg_strides);
wi->skbs_frags[i] = 0;
}
@@ -350,10 +359,10 @@ static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
err_unmap:
while (--i >= 0) {
- dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
- PCI_DMA_FROMDEVICE);
- page_ref_sub(wi->umr.dma_info[i].page, pg_strides);
- put_page(wi->umr.dma_info[i].page);
+ struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[i];
+
+ page_ref_sub(dma_info->page, pg_strides);
+ mlx5e_page_release(rq, dma_info);
}
return err;
@@ -365,11 +374,10 @@ void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi)
int i;
for (i = 0; i < MLX5_MPWRQ_PAGES_PER_WQE; i++) {
- dma_unmap_page(rq->pdev, wi->umr.dma_info[i].addr, PAGE_SIZE,
- PCI_DMA_FROMDEVICE);
- page_ref_sub(wi->umr.dma_info[i].page,
- pg_strides - wi->skbs_frags[i]);
- put_page(wi->umr.dma_info[i].page);
+ struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[i];
+
+ page_ref_sub(dma_info->page, pg_strides - wi->skbs_frags[i]);
+ mlx5e_page_release(rq, dma_info);
}
}
--
2.7.4
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle
2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 01/11] net/mlx5e: Single flow order-0 pages for Striding RQ Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 02/11] net/mlx5e: Introduce API for RX mapped pages Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
[not found] ` <1473252152-11379-4-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-09-07 12:42 ` [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand Saeed Mahameed
` (8 subsequent siblings)
11 siblings, 1 reply; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
To: iovisor-dev
Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed
From: Tariq Toukan <tariqt@mellanox.com>
Instead of reallocating and mapping pages for RX data-path,
recycle already used pages in a per ring cache.
We ran pktgen single-stream benchmarks, with iptables-raw-drop:
Single stride, 64 bytes:
* 4,739,057 - baseline
* 4,749,550 - order0 no cache
* 4,786,899 - order0 with cache
1% gain
Larger packets, no page cross, 1024 bytes:
* 3,982,361 - baseline
* 3,845,682 - order0 no cache
* 4,127,852 - order0 with cache
3.7% gain
Larger packets, every 3rd packet crosses a page, 1500 bytes:
* 3,731,189 - baseline
* 3,579,414 - order0 no cache
* 3,931,708 - order0 with cache
5.4% gain
Signed-off-by: Tariq Toukan <tariqt@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 16 ++++++
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 15 ++++++
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 ++++++++++++++++++++--
drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 16 ++++++
4 files changed, 99 insertions(+), 5 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 075cdfc..afbdf70 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -287,6 +287,18 @@ struct mlx5e_rx_am { /* Adaptive Moderation */
u8 tired;
};
+/* a single cache unit is capable to serve one napi call (for non-striding rq)
+ * or a MPWQE (for striding rq).
+ */
+#define MLX5E_CACHE_UNIT (MLX5_MPWRQ_PAGES_PER_WQE > NAPI_POLL_WEIGHT ? \
+ MLX5_MPWRQ_PAGES_PER_WQE : NAPI_POLL_WEIGHT)
+#define MLX5E_CACHE_SIZE (2 * roundup_pow_of_two(MLX5E_CACHE_UNIT))
+struct mlx5e_page_cache {
+ u32 head;
+ u32 tail;
+ struct mlx5e_dma_info page_cache[MLX5E_CACHE_SIZE];
+};
+
struct mlx5e_rq {
/* data path */
struct mlx5_wq_ll wq;
@@ -301,6 +313,8 @@ struct mlx5e_rq {
struct mlx5e_tstamp *tstamp;
struct mlx5e_rq_stats stats;
struct mlx5e_cq cq;
+ struct mlx5e_page_cache page_cache;
+
mlx5e_fp_handle_rx_cqe handle_rx_cqe;
mlx5e_fp_alloc_wqe alloc_wqe;
mlx5e_fp_dealloc_wqe dealloc_wqe;
@@ -685,6 +699,8 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
void mlx5e_free_tx_descs(struct mlx5e_sq *sq);
+void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
+ bool recycle);
void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 0db4d3b..c84702c 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -141,6 +141,10 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
s->rx_cqe_compress_blks += rq_stats->cqe_compress_blks;
s->rx_cqe_compress_pkts += rq_stats->cqe_compress_pkts;
+ s->rx_cache_reuse += rq_stats->cache_reuse;
+ s->rx_cache_full += rq_stats->cache_full;
+ s->rx_cache_empty += rq_stats->cache_empty;
+ s->rx_cache_busy += rq_stats->cache_busy;
for (j = 0; j < priv->params.num_tc; j++) {
sq_stats = &priv->channel[i]->sq[j].stats;
@@ -478,6 +482,9 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
INIT_WORK(&rq->am.work, mlx5e_rx_am_work);
rq->am.mode = priv->params.rx_cq_period_mode;
+ rq->page_cache.head = 0;
+ rq->page_cache.tail = 0;
+
return 0;
err_rq_wq_destroy:
@@ -488,6 +495,8 @@ err_rq_wq_destroy:
static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
{
+ int i;
+
switch (rq->wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
mlx5e_rq_free_mpwqe_info(rq);
@@ -496,6 +505,12 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
kfree(rq->skb);
}
+ for (i = rq->page_cache.head; i != rq->page_cache.tail;
+ i = (i + 1) & (MLX5E_CACHE_SIZE - 1)) {
+ struct mlx5e_dma_info *dma_info = &rq->page_cache.page_cache[i];
+
+ mlx5e_page_release(rq, dma_info, false);
+ }
mlx5_wq_destroy(&rq->wq_ctrl);
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index c1cb510..8e02af3 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -305,11 +305,55 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
}
+static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
+ struct mlx5e_dma_info *dma_info)
+{
+ struct mlx5e_page_cache *cache = &rq->page_cache;
+ u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
+
+ if (tail_next == cache->head) {
+ rq->stats.cache_full++;
+ return false;
+ }
+
+ cache->page_cache[cache->tail] = *dma_info;
+ cache->tail = tail_next;
+ return true;
+}
+
+static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
+ struct mlx5e_dma_info *dma_info)
+{
+ struct mlx5e_page_cache *cache = &rq->page_cache;
+
+ if (unlikely(cache->head == cache->tail)) {
+ rq->stats.cache_empty++;
+ return false;
+ }
+
+ if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
+ rq->stats.cache_busy++;
+ return false;
+ }
+
+ *dma_info = cache->page_cache[cache->head];
+ cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
+ rq->stats.cache_reuse++;
+
+ dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE,
+ DMA_FROM_DEVICE);
+ return true;
+}
+
static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
struct mlx5e_dma_info *dma_info)
{
- struct page *page = dev_alloc_page();
+ struct page *page;
+
+ if (mlx5e_rx_cache_get(rq, dma_info))
+ return 0;
+ page = dev_alloc_page();
if (unlikely(!page))
return -ENOMEM;
@@ -324,9 +368,12 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
return 0;
}
-static inline void mlx5e_page_release(struct mlx5e_rq *rq,
- struct mlx5e_dma_info *dma_info)
+void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
+ bool recycle)
{
+ if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
+ return;
+
dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, DMA_FROM_DEVICE);
put_page(dma_info->page);
}
@@ -362,7 +409,7 @@ err_unmap:
struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[i];
page_ref_sub(dma_info->page, pg_strides);
- mlx5e_page_release(rq, dma_info);
+ mlx5e_page_release(rq, dma_info, true);
}
return err;
@@ -377,7 +424,7 @@ void mlx5e_free_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_mpw_info *wi)
struct mlx5e_dma_info *dma_info = &wi->umr.dma_info[i];
page_ref_sub(dma_info->page, pg_strides - wi->skbs_frags[i]);
- mlx5e_page_release(rq, dma_info);
+ mlx5e_page_release(rq, dma_info, true);
}
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 1f56543..6af8d79 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -76,6 +76,10 @@ struct mlx5e_sw_stats {
u64 rx_buff_alloc_err;
u64 rx_cqe_compress_blks;
u64 rx_cqe_compress_pkts;
+ u64 rx_cache_reuse;
+ u64 rx_cache_full;
+ u64 rx_cache_empty;
+ u64 rx_cache_busy;
/* Special handling counters */
u64 link_down_events_phy;
@@ -107,6 +111,10 @@ static const struct counter_desc sw_stats_desc[] = {
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_buff_alloc_err) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_blks) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cqe_compress_pkts) },
+ { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cache_reuse) },
+ { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cache_full) },
+ { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cache_empty) },
+ { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_cache_busy) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, link_down_events_phy) },
};
@@ -275,6 +283,10 @@ struct mlx5e_rq_stats {
u64 buff_alloc_err;
u64 cqe_compress_blks;
u64 cqe_compress_pkts;
+ u64 cache_reuse;
+ u64 cache_full;
+ u64 cache_empty;
+ u64 cache_busy;
};
static const struct counter_desc rq_stats_desc[] = {
@@ -290,6 +302,10 @@ static const struct counter_desc rq_stats_desc[] = {
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, buff_alloc_err) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_blks) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cqe_compress_pkts) },
+ { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cache_reuse) },
+ { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cache_full) },
+ { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cache_empty) },
+ { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, cache_busy) },
};
struct mlx5e_sq_stats {
--
2.7.4
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand
2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
` (2 preceding siblings ...)
2016-09-07 12:42 ` [PATCH RFC 03/11] net/mlx5e: Implement RX mapped page cache for page recycle Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
2016-09-07 17:34 ` Alexei Starovoitov
2016-09-07 19:32 ` Jesper Dangaard Brouer via iovisor-dev
2016-09-07 12:42 ` [PATCH RFC 05/11] net/mlx5e: Union RQ RX info per RQ type Saeed Mahameed
` (7 subsequent siblings)
11 siblings, 2 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
To: iovisor-dev
Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed
For non-striding RQ configuration before this patch we had a ring
with pre-allocated SKBs and mapped the SKB->data buffers for
device.
For robustness and better RX data buffers management, we allocate a
page per packet and build_skb around it.
This patch (which is a prerequisite for XDP) will actually reduce
performance for normal stack usage, because we are now hitting a bottleneck
in the page allocator. A later patch of page reuse mechanism will be
needed to restore or even improve performance in comparison to the old
RX scheme.
Packet rate performance testing was done with pktgen 64B packets on xmit
side and TC drop action on RX side.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1.Baseline, before 'net/mlx5e: Build RX SKB on demand'
2.Build SKB with RX page cache (This patch)
Streams Baseline Build SKB+page-cache Improvement
-----------------------------------------------------------
1 4.33Mpps 5.51Mpps 27%
2 7.35Mpps 11.5Mpps 52%
4 14.0Mpps 16.3Mpps 16%
8 22.2Mpps 29.6Mpps 20%
16 24.8Mpps 34.0Mpps 17%
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 10 +-
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 31 +++-
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 215 +++++++++++-----------
3 files changed, 133 insertions(+), 123 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index afbdf70..a346112 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -65,6 +65,8 @@
#define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW 0x3
#define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW 0x6
+#define MLX5_RX_HEADROOM NET_SKB_PAD
+
#define MLX5_MPWRQ_LOG_STRIDE_SIZE 6 /* >= 6, HW restriction */
#define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS 8 /* >= 6, HW restriction */
#define MLX5_MPWRQ_LOG_WQE_SZ 18
@@ -302,10 +304,14 @@ struct mlx5e_page_cache {
struct mlx5e_rq {
/* data path */
struct mlx5_wq_ll wq;
- u32 wqe_sz;
- struct sk_buff **skb;
+
+ struct mlx5e_dma_info *dma_info;
struct mlx5e_mpw_info *wqe_info;
void *mtt_no_align;
+ struct {
+ u8 page_order;
+ u32 wqe_sz; /* wqe data buffer size */
+ } buff;
__be32 mkey_be;
struct device *pdev;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index c84702c..c9f1dea 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -411,6 +411,8 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
void *rqc = param->rqc;
void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
u32 byte_count;
+ u32 frag_sz;
+ int npages;
int wq_sz;
int err;
int i;
@@ -445,29 +447,40 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz);
rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
- rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
- byte_count = rq->wqe_sz;
+
+ rq->buff.wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
+ byte_count = rq->buff.wqe_sz;
rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
err = mlx5e_rq_alloc_mpwqe_info(rq, c);
if (err)
goto err_rq_wq_destroy;
break;
default: /* MLX5_WQ_TYPE_LINKED_LIST */
- rq->skb = kzalloc_node(wq_sz * sizeof(*rq->skb), GFP_KERNEL,
- cpu_to_node(c->cpu));
- if (!rq->skb) {
+ rq->dma_info = kzalloc_node(wq_sz * sizeof(*rq->dma_info), GFP_KERNEL,
+ cpu_to_node(c->cpu));
+ if (!rq->dma_info) {
err = -ENOMEM;
goto err_rq_wq_destroy;
}
+
rq->handle_rx_cqe = mlx5e_handle_rx_cqe;
rq->alloc_wqe = mlx5e_alloc_rx_wqe;
rq->dealloc_wqe = mlx5e_dealloc_rx_wqe;
- rq->wqe_sz = (priv->params.lro_en) ?
+ rq->buff.wqe_sz = (priv->params.lro_en) ?
priv->params.lro_wqe_sz :
MLX5E_SW2HW_MTU(priv->netdev->mtu);
- rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz);
- byte_count = rq->wqe_sz;
+ byte_count = rq->buff.wqe_sz;
+
+ /* calc the required page order */
+ frag_sz = MLX5_RX_HEADROOM +
+ byte_count /* packet data */ +
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+ frag_sz = SKB_DATA_ALIGN(frag_sz);
+
+ npages = DIV_ROUND_UP(frag_sz, PAGE_SIZE);
+ rq->buff.page_order = order_base_2(npages);
+
byte_count |= MLX5_HW_START_PADDING;
rq->mkey_be = c->mkey_be;
}
@@ -502,7 +515,7 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
mlx5e_rq_free_mpwqe_info(rq);
break;
default: /* MLX5_WQ_TYPE_LINKED_LIST */
- kfree(rq->skb);
+ kfree(rq->dma_info);
}
for (i = rq->page_cache.head; i != rq->page_cache.tail;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 8e02af3..2f5bc6f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -179,50 +179,99 @@ unlock:
mutex_unlock(&priv->state_lock);
}
-int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
+#define RQ_PAGE_SIZE(rq) ((1 << rq->buff.page_order) << PAGE_SHIFT)
+
+static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
+ struct mlx5e_dma_info *dma_info)
{
- struct sk_buff *skb;
- dma_addr_t dma_addr;
+ struct mlx5e_page_cache *cache = &rq->page_cache;
+ u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
- skb = napi_alloc_skb(rq->cq.napi, rq->wqe_sz);
- if (unlikely(!skb))
- return -ENOMEM;
+ if (tail_next == cache->head) {
+ rq->stats.cache_full++;
+ return false;
+ }
+
+ cache->page_cache[cache->tail] = *dma_info;
+ cache->tail = tail_next;
+ return true;
+}
+
+static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
+ struct mlx5e_dma_info *dma_info)
+{
+ struct mlx5e_page_cache *cache = &rq->page_cache;
+
+ if (unlikely(cache->head == cache->tail)) {
+ rq->stats.cache_empty++;
+ return false;
+ }
+
+ if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
+ rq->stats.cache_busy++;
+ return false;
+ }
+
+ *dma_info = cache->page_cache[cache->head];
+ cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
+ rq->stats.cache_reuse++;
+
+ dma_sync_single_for_device(rq->pdev, dma_info->addr,
+ RQ_PAGE_SIZE(rq),
+ DMA_FROM_DEVICE);
+ return true;
+}
- dma_addr = dma_map_single(rq->pdev,
- /* hw start padding */
- skb->data,
- /* hw end padding */
- rq->wqe_sz,
- DMA_FROM_DEVICE);
+static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
+ struct mlx5e_dma_info *dma_info)
+{
+ struct page *page;
- if (unlikely(dma_mapping_error(rq->pdev, dma_addr)))
- goto err_free_skb;
+ if (mlx5e_rx_cache_get(rq, dma_info))
+ return 0;
- *((dma_addr_t *)skb->cb) = dma_addr;
- wqe->data.addr = cpu_to_be64(dma_addr);
+ page = dev_alloc_pages(rq->buff.page_order);
+ if (unlikely(!page))
+ return -ENOMEM;
- rq->skb[ix] = skb;
+ dma_info->page = page;
+ dma_info->addr = dma_map_page(rq->pdev, page, 0,
+ RQ_PAGE_SIZE(rq), DMA_FROM_DEVICE);
+ if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
+ put_page(page);
+ return -ENOMEM;
+ }
return 0;
+}
-err_free_skb:
- dev_kfree_skb(skb);
+void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
+ bool recycle)
+{
+ if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
+ return;
+
+ dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
+ DMA_FROM_DEVICE);
+ put_page(dma_info->page);
+}
+
+int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
+{
+ struct mlx5e_dma_info *di = &rq->dma_info[ix];
- return -ENOMEM;
+ if (unlikely(mlx5e_page_alloc_mapped(rq, di)))
+ return -ENOMEM;
+
+ wqe->data.addr = cpu_to_be64(di->addr + MLX5_RX_HEADROOM);
+ return 0;
}
void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix)
{
- struct sk_buff *skb = rq->skb[ix];
+ struct mlx5e_dma_info *di = &rq->dma_info[ix];
- if (skb) {
- rq->skb[ix] = NULL;
- dma_unmap_single(rq->pdev,
- *((dma_addr_t *)skb->cb),
- rq->wqe_sz,
- DMA_FROM_DEVICE);
- dev_kfree_skb(skb);
- }
+ mlx5e_page_release(rq, di, true);
}
static inline int mlx5e_mpwqe_strides_per_page(struct mlx5e_rq *rq)
@@ -305,79 +354,6 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
}
-static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
- struct mlx5e_dma_info *dma_info)
-{
- struct mlx5e_page_cache *cache = &rq->page_cache;
- u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
-
- if (tail_next == cache->head) {
- rq->stats.cache_full++;
- return false;
- }
-
- cache->page_cache[cache->tail] = *dma_info;
- cache->tail = tail_next;
- return true;
-}
-
-static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
- struct mlx5e_dma_info *dma_info)
-{
- struct mlx5e_page_cache *cache = &rq->page_cache;
-
- if (unlikely(cache->head == cache->tail)) {
- rq->stats.cache_empty++;
- return false;
- }
-
- if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
- rq->stats.cache_busy++;
- return false;
- }
-
- *dma_info = cache->page_cache[cache->head];
- cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
- rq->stats.cache_reuse++;
-
- dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE,
- DMA_FROM_DEVICE);
- return true;
-}
-
-static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
- struct mlx5e_dma_info *dma_info)
-{
- struct page *page;
-
- if (mlx5e_rx_cache_get(rq, dma_info))
- return 0;
-
- page = dev_alloc_page();
- if (unlikely(!page))
- return -ENOMEM;
-
- dma_info->page = page;
- dma_info->addr = dma_map_page(rq->pdev, page, 0, PAGE_SIZE,
- DMA_FROM_DEVICE);
- if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
- put_page(page);
- return -ENOMEM;
- }
-
- return 0;
-}
-
-void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
- bool recycle)
-{
- if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
- return;
-
- dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, DMA_FROM_DEVICE);
- put_page(dma_info->page);
-}
-
static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
struct mlx5e_rx_wqe *wqe,
u16 ix)
@@ -448,7 +424,7 @@ void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
mlx5_wq_ll_update_db_record(wq);
}
-int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
+int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
{
int err;
@@ -650,31 +626,46 @@ static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
{
+ struct mlx5e_dma_info *di;
struct mlx5e_rx_wqe *wqe;
- struct sk_buff *skb;
__be16 wqe_counter_be;
+ struct sk_buff *skb;
u16 wqe_counter;
u32 cqe_bcnt;
+ void *va;
wqe_counter_be = cqe->wqe_counter;
wqe_counter = be16_to_cpu(wqe_counter_be);
wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
- skb = rq->skb[wqe_counter];
- prefetch(skb->data);
- rq->skb[wqe_counter] = NULL;
+ di = &rq->dma_info[wqe_counter];
+ va = page_address(di->page);
- dma_unmap_single(rq->pdev,
- *((dma_addr_t *)skb->cb),
- rq->wqe_sz,
- DMA_FROM_DEVICE);
+ dma_sync_single_range_for_cpu(rq->pdev,
+ di->addr,
+ MLX5_RX_HEADROOM,
+ rq->buff.wqe_sz,
+ DMA_FROM_DEVICE);
+ prefetch(va + MLX5_RX_HEADROOM);
if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
rq->stats.wqe_err++;
- dev_kfree_skb(skb);
+ mlx5e_page_release(rq, di, true);
goto wq_ll_pop;
}
+ skb = build_skb(va, RQ_PAGE_SIZE(rq));
+ if (unlikely(!skb)) {
+ rq->stats.buff_alloc_err++;
+ mlx5e_page_release(rq, di, true);
+ goto wq_ll_pop;
+ }
+
+ /* queue up for recycling ..*/
+ page_ref_inc(di->page);
+ mlx5e_page_release(rq, di, true);
+
cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
+ skb_reserve(skb, MLX5_RX_HEADROOM);
skb_put(skb, cqe_bcnt);
mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
--
2.7.4
^ permalink raw reply related [flat|nested] 72+ messages in thread* Re: [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand
2016-09-07 12:42 ` [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand Saeed Mahameed
@ 2016-09-07 17:34 ` Alexei Starovoitov
[not found] ` <20160907173449.GB64688-+o4/htvd0TDFYCXBM6kdu7fOX0fSgVTm@public.gmane.org>
2016-09-07 19:32 ` Jesper Dangaard Brouer via iovisor-dev
1 sibling, 1 reply; 72+ messages in thread
From: Alexei Starovoitov @ 2016-09-07 17:34 UTC (permalink / raw)
To: Saeed Mahameed
Cc: iovisor-dev, netdev, Tariq Toukan, Brenden Blanco, Tom Herbert,
Martin KaFai Lau, Jesper Dangaard Brouer, Daniel Borkmann,
Eric Dumazet, Jamal Hadi Salim
On Wed, Sep 07, 2016 at 03:42:25PM +0300, Saeed Mahameed wrote:
> For non-striding RQ configuration before this patch we had a ring
> with pre-allocated SKBs and mapped the SKB->data buffers for
> device.
>
> For robustness and better RX data buffers management, we allocate a
> page per packet and build_skb around it.
>
> This patch (which is a prerequisite for XDP) will actually reduce
> performance for normal stack usage, because we are now hitting a bottleneck
> in the page allocator. A later patch of page reuse mechanism will be
> needed to restore or even improve performance in comparison to the old
> RX scheme.
>
> Packet rate performance testing was done with pktgen 64B packets on xmit
> side and TC drop action on RX side.
>
> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>
> Comparison is done between:
> 1.Baseline, before 'net/mlx5e: Build RX SKB on demand'
> 2.Build SKB with RX page cache (This patch)
>
> Streams Baseline Build SKB+page-cache Improvement
> -----------------------------------------------------------
> 1 4.33Mpps 5.51Mpps 27%
> 2 7.35Mpps 11.5Mpps 52%
> 4 14.0Mpps 16.3Mpps 16%
> 8 22.2Mpps 29.6Mpps 20%
> 16 24.8Mpps 34.0Mpps 17%
Impressive gains for build_skb. I think it should help ip forwarding too
and likely tcp_rr. tcp_stream shouldn't see any difference.
If you can benchmark that along with pktgen+tc_drop it would
help to better understand the impact of the changes.
^ permalink raw reply [flat|nested] 72+ messages in thread* Re: [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand
@ 2016-09-07 19:32 ` Jesper Dangaard Brouer via iovisor-dev
0 siblings, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer @ 2016-09-07 19:32 UTC (permalink / raw)
To: Saeed Mahameed
Cc: iovisor-dev, netdev, Tariq Toukan, Brenden Blanco,
Alexei Starovoitov, Tom Herbert, Martin KaFai Lau,
Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, brouer, linux-mm
On Wed, 7 Sep 2016 15:42:25 +0300 Saeed Mahameed <saeedm@mellanox.com> wrote:
> For non-striding RQ configuration before this patch we had a ring
> with pre-allocated SKBs and mapped the SKB->data buffers for
> device.
>
> For robustness and better RX data buffers management, we allocate a
> page per packet and build_skb around it.
>
> This patch (which is a prerequisite for XDP) will actually reduce
> performance for normal stack usage, because we are now hitting a bottleneck
> in the page allocator. A later patch of page reuse mechanism will be
> needed to restore or even improve performance in comparison to the old
> RX scheme.
Yes, it is true that there is a performance reduction (for normal
stack, not XDP) caused by hitting a bottleneck in the page allocator.
I actually have a PoC implementation of my page_pool, that show we
regain the performance and then some. Based on an earlier version of
this patch, where I hook it into the mlx5 driver (50Gbit/s version).
You desc might be a bit outdated, as this patch and the patch before
does contain you own driver local page-cache recycle facility. And you
also show that you regain quite a lot of the lost performance.
You driver local page_cache does have its limitations (see comments on
other patch), as it depend on timely refcnt decrease, by the users of
the page. If they hold onto pages (like TCP) then your page-cache will
not be efficient.
> Packet rate performance testing was done with pktgen 64B packets on
> xmit side and TC drop action on RX side.
I assume this is TC _ingress_ dropping, like [1]
[1] https://github.com/netoptimizer/network-testing/blob/master/bin/tc_ingress_drop.sh
> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>
> Comparison is done between:
> 1.Baseline, before 'net/mlx5e: Build RX SKB on demand'
> 2.Build SKB with RX page cache (This patch)
>
> Streams Baseline Build SKB+page-cache Improvement
> -----------------------------------------------------------
> 1 4.33Mpps 5.51Mpps 27%
> 2 7.35Mpps 11.5Mpps 52%
> 4 14.0Mpps 16.3Mpps 16%
> 8 22.2Mpps 29.6Mpps 20%
> 16 24.8Mpps 34.0Mpps 17%
The improvements gained from using your page-cache is impressively high.
Thanks for working on this,
--Jesper
> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/en.h | 10 +-
> drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 31 +++-
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 215 +++++++++++-----------
> 3 files changed, 133 insertions(+), 123 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index afbdf70..a346112 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -65,6 +65,8 @@
> #define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW 0x3
> #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW 0x6
>
> +#define MLX5_RX_HEADROOM NET_SKB_PAD
> +
> #define MLX5_MPWRQ_LOG_STRIDE_SIZE 6 /* >= 6, HW restriction */
> #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS 8 /* >= 6, HW restriction */
> #define MLX5_MPWRQ_LOG_WQE_SZ 18
> @@ -302,10 +304,14 @@ struct mlx5e_page_cache {
> struct mlx5e_rq {
> /* data path */
> struct mlx5_wq_ll wq;
> - u32 wqe_sz;
> - struct sk_buff **skb;
> +
> + struct mlx5e_dma_info *dma_info;
> struct mlx5e_mpw_info *wqe_info;
> void *mtt_no_align;
> + struct {
> + u8 page_order;
> + u32 wqe_sz; /* wqe data buffer size */
> + } buff;
> __be32 mkey_be;
>
> struct device *pdev;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index c84702c..c9f1dea 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -411,6 +411,8 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
> void *rqc = param->rqc;
> void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
> u32 byte_count;
> + u32 frag_sz;
> + int npages;
> int wq_sz;
> int err;
> int i;
> @@ -445,29 +447,40 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>
> rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz);
> rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
> - rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
> - byte_count = rq->wqe_sz;
> +
> + rq->buff.wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
> + byte_count = rq->buff.wqe_sz;
> rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
> err = mlx5e_rq_alloc_mpwqe_info(rq, c);
> if (err)
> goto err_rq_wq_destroy;
> break;
> default: /* MLX5_WQ_TYPE_LINKED_LIST */
> - rq->skb = kzalloc_node(wq_sz * sizeof(*rq->skb), GFP_KERNEL,
> - cpu_to_node(c->cpu));
> - if (!rq->skb) {
> + rq->dma_info = kzalloc_node(wq_sz * sizeof(*rq->dma_info), GFP_KERNEL,
> + cpu_to_node(c->cpu));
> + if (!rq->dma_info) {
> err = -ENOMEM;
> goto err_rq_wq_destroy;
> }
> +
> rq->handle_rx_cqe = mlx5e_handle_rx_cqe;
> rq->alloc_wqe = mlx5e_alloc_rx_wqe;
> rq->dealloc_wqe = mlx5e_dealloc_rx_wqe;
>
> - rq->wqe_sz = (priv->params.lro_en) ?
> + rq->buff.wqe_sz = (priv->params.lro_en) ?
> priv->params.lro_wqe_sz :
> MLX5E_SW2HW_MTU(priv->netdev->mtu);
> - rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz);
> - byte_count = rq->wqe_sz;
> + byte_count = rq->buff.wqe_sz;
> +
> + /* calc the required page order */
> + frag_sz = MLX5_RX_HEADROOM +
> + byte_count /* packet data */ +
> + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> + frag_sz = SKB_DATA_ALIGN(frag_sz);
> +
> + npages = DIV_ROUND_UP(frag_sz, PAGE_SIZE);
> + rq->buff.page_order = order_base_2(npages);
> +
> byte_count |= MLX5_HW_START_PADDING;
> rq->mkey_be = c->mkey_be;
> }
> @@ -502,7 +515,7 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
> mlx5e_rq_free_mpwqe_info(rq);
> break;
> default: /* MLX5_WQ_TYPE_LINKED_LIST */
> - kfree(rq->skb);
> + kfree(rq->dma_info);
> }
>
> for (i = rq->page_cache.head; i != rq->page_cache.tail;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 8e02af3..2f5bc6f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -179,50 +179,99 @@ unlock:
> mutex_unlock(&priv->state_lock);
> }
>
> -int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +#define RQ_PAGE_SIZE(rq) ((1 << rq->buff.page_order) << PAGE_SHIFT)
> +
> +static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
> + struct mlx5e_dma_info *dma_info)
> {
> - struct sk_buff *skb;
> - dma_addr_t dma_addr;
> + struct mlx5e_page_cache *cache = &rq->page_cache;
> + u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
>
> - skb = napi_alloc_skb(rq->cq.napi, rq->wqe_sz);
> - if (unlikely(!skb))
> - return -ENOMEM;
> + if (tail_next == cache->head) {
> + rq->stats.cache_full++;
> + return false;
> + }
> +
> + cache->page_cache[cache->tail] = *dma_info;
> + cache->tail = tail_next;
> + return true;
> +}
> +
> +static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
> + struct mlx5e_dma_info *dma_info)
> +{
> + struct mlx5e_page_cache *cache = &rq->page_cache;
> +
> + if (unlikely(cache->head == cache->tail)) {
> + rq->stats.cache_empty++;
> + return false;
> + }
> +
> + if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
> + rq->stats.cache_busy++;
> + return false;
> + }
> +
> + *dma_info = cache->page_cache[cache->head];
> + cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
> + rq->stats.cache_reuse++;
> +
> + dma_sync_single_for_device(rq->pdev, dma_info->addr,
> + RQ_PAGE_SIZE(rq),
> + DMA_FROM_DEVICE);
> + return true;
> +}
>
> - dma_addr = dma_map_single(rq->pdev,
> - /* hw start padding */
> - skb->data,
> - /* hw end padding */
> - rq->wqe_sz,
> - DMA_FROM_DEVICE);
> +static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
> + struct mlx5e_dma_info *dma_info)
> +{
> + struct page *page;
>
> - if (unlikely(dma_mapping_error(rq->pdev, dma_addr)))
> - goto err_free_skb;
> + if (mlx5e_rx_cache_get(rq, dma_info))
> + return 0;
>
> - *((dma_addr_t *)skb->cb) = dma_addr;
> - wqe->data.addr = cpu_to_be64(dma_addr);
> + page = dev_alloc_pages(rq->buff.page_order);
> + if (unlikely(!page))
> + return -ENOMEM;
>
> - rq->skb[ix] = skb;
> + dma_info->page = page;
> + dma_info->addr = dma_map_page(rq->pdev, page, 0,
> + RQ_PAGE_SIZE(rq), DMA_FROM_DEVICE);
> + if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
> + put_page(page);
> + return -ENOMEM;
> + }
>
> return 0;
> +}
>
> -err_free_skb:
> - dev_kfree_skb(skb);
> +void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
> + bool recycle)
> +{
> + if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
> + return;
> +
> + dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
> + DMA_FROM_DEVICE);
> + put_page(dma_info->page);
> +}
> +
> +int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +{
> + struct mlx5e_dma_info *di = &rq->dma_info[ix];
>
> - return -ENOMEM;
> + if (unlikely(mlx5e_page_alloc_mapped(rq, di)))
> + return -ENOMEM;
> +
> + wqe->data.addr = cpu_to_be64(di->addr + MLX5_RX_HEADROOM);
> + return 0;
> }
>
> void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix)
> {
> - struct sk_buff *skb = rq->skb[ix];
> + struct mlx5e_dma_info *di = &rq->dma_info[ix];
>
> - if (skb) {
> - rq->skb[ix] = NULL;
> - dma_unmap_single(rq->pdev,
> - *((dma_addr_t *)skb->cb),
> - rq->wqe_sz,
> - DMA_FROM_DEVICE);
> - dev_kfree_skb(skb);
> - }
> + mlx5e_page_release(rq, di, true);
> }
>
> static inline int mlx5e_mpwqe_strides_per_page(struct mlx5e_rq *rq)
> @@ -305,79 +354,6 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
> mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
> }
>
> -static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
> - struct mlx5e_dma_info *dma_info)
> -{
> - struct mlx5e_page_cache *cache = &rq->page_cache;
> - u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
> -
> - if (tail_next == cache->head) {
> - rq->stats.cache_full++;
> - return false;
> - }
> -
> - cache->page_cache[cache->tail] = *dma_info;
> - cache->tail = tail_next;
> - return true;
> -}
> -
> -static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
> - struct mlx5e_dma_info *dma_info)
> -{
> - struct mlx5e_page_cache *cache = &rq->page_cache;
> -
> - if (unlikely(cache->head == cache->tail)) {
> - rq->stats.cache_empty++;
> - return false;
> - }
> -
> - if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
> - rq->stats.cache_busy++;
> - return false;
> - }
> -
> - *dma_info = cache->page_cache[cache->head];
> - cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
> - rq->stats.cache_reuse++;
> -
> - dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE,
> - DMA_FROM_DEVICE);
> - return true;
> -}
> -
> -static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
> - struct mlx5e_dma_info *dma_info)
> -{
> - struct page *page;
> -
> - if (mlx5e_rx_cache_get(rq, dma_info))
> - return 0;
> -
> - page = dev_alloc_page();
> - if (unlikely(!page))
> - return -ENOMEM;
> -
> - dma_info->page = page;
> - dma_info->addr = dma_map_page(rq->pdev, page, 0, PAGE_SIZE,
> - DMA_FROM_DEVICE);
> - if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
> - put_page(page);
> - return -ENOMEM;
> - }
> -
> - return 0;
> -}
> -
> -void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
> - bool recycle)
> -{
> - if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
> - return;
> -
> - dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, DMA_FROM_DEVICE);
> - put_page(dma_info->page);
> -}
> -
> static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
> struct mlx5e_rx_wqe *wqe,
> u16 ix)
> @@ -448,7 +424,7 @@ void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
> mlx5_wq_ll_update_db_record(wq);
> }
>
> -int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> {
> int err;
>
> @@ -650,31 +626,46 @@ static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
>
> void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> {
> + struct mlx5e_dma_info *di;
> struct mlx5e_rx_wqe *wqe;
> - struct sk_buff *skb;
> __be16 wqe_counter_be;
> + struct sk_buff *skb;
> u16 wqe_counter;
> u32 cqe_bcnt;
> + void *va;
>
> wqe_counter_be = cqe->wqe_counter;
> wqe_counter = be16_to_cpu(wqe_counter_be);
> wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
> - skb = rq->skb[wqe_counter];
> - prefetch(skb->data);
> - rq->skb[wqe_counter] = NULL;
> + di = &rq->dma_info[wqe_counter];
> + va = page_address(di->page);
>
> - dma_unmap_single(rq->pdev,
> - *((dma_addr_t *)skb->cb),
> - rq->wqe_sz,
> - DMA_FROM_DEVICE);
> + dma_sync_single_range_for_cpu(rq->pdev,
> + di->addr,
> + MLX5_RX_HEADROOM,
> + rq->buff.wqe_sz,
> + DMA_FROM_DEVICE);
> + prefetch(va + MLX5_RX_HEADROOM);
>
> if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
> rq->stats.wqe_err++;
> - dev_kfree_skb(skb);
> + mlx5e_page_release(rq, di, true);
> goto wq_ll_pop;
> }
>
> + skb = build_skb(va, RQ_PAGE_SIZE(rq));
> + if (unlikely(!skb)) {
> + rq->stats.buff_alloc_err++;
> + mlx5e_page_release(rq, di, true);
> + goto wq_ll_pop;
> + }
> +
> + /* queue up for recycling ..*/
> + page_ref_inc(di->page);
> + mlx5e_page_release(rq, di, true);
> +
> cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
> + skb_reserve(skb, MLX5_RX_HEADROOM);
> skb_put(skb, cqe_bcnt);
>
> mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 72+ messages in thread* Re: [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand
@ 2016-09-07 19:32 ` Jesper Dangaard Brouer via iovisor-dev
0 siblings, 0 replies; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-07 19:32 UTC (permalink / raw)
To: Saeed Mahameed
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev, Jamal Hadi Salim,
linux-mm, Eric Dumazet, Tom Herbert
On Wed, 7 Sep 2016 15:42:25 +0300 Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> For non-striding RQ configuration before this patch we had a ring
> with pre-allocated SKBs and mapped the SKB->data buffers for
> device.
>
> For robustness and better RX data buffers management, we allocate a
> page per packet and build_skb around it.
>
> This patch (which is a prerequisite for XDP) will actually reduce
> performance for normal stack usage, because we are now hitting a bottleneck
> in the page allocator. A later patch of page reuse mechanism will be
> needed to restore or even improve performance in comparison to the old
> RX scheme.
Yes, it is true that there is a performance reduction (for normal
stack, not XDP) caused by hitting a bottleneck in the page allocator.
I actually have a PoC implementation of my page_pool, that show we
regain the performance and then some. Based on an earlier version of
this patch, where I hook it into the mlx5 driver (50Gbit/s version).
You desc might be a bit outdated, as this patch and the patch before
does contain you own driver local page-cache recycle facility. And you
also show that you regain quite a lot of the lost performance.
You driver local page_cache does have its limitations (see comments on
other patch), as it depend on timely refcnt decrease, by the users of
the page. If they hold onto pages (like TCP) then your page-cache will
not be efficient.
> Packet rate performance testing was done with pktgen 64B packets on
> xmit side and TC drop action on RX side.
I assume this is TC _ingress_ dropping, like [1]
[1] https://github.com/netoptimizer/network-testing/blob/master/bin/tc_ingress_drop.sh
> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>
> Comparison is done between:
> 1.Baseline, before 'net/mlx5e: Build RX SKB on demand'
> 2.Build SKB with RX page cache (This patch)
>
> Streams Baseline Build SKB+page-cache Improvement
> -----------------------------------------------------------
> 1 4.33Mpps 5.51Mpps 27%
> 2 7.35Mpps 11.5Mpps 52%
> 4 14.0Mpps 16.3Mpps 16%
> 8 22.2Mpps 29.6Mpps 20%
> 16 24.8Mpps 34.0Mpps 17%
The improvements gained from using your page-cache is impressively high.
Thanks for working on this,
--Jesper
> Signed-off-by: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/en.h | 10 +-
> drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 31 +++-
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 215 +++++++++++-----------
> 3 files changed, 133 insertions(+), 123 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index afbdf70..a346112 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -65,6 +65,8 @@
> #define MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW 0x3
> #define MLX5E_PARAMS_MAXIMUM_LOG_RQ_SIZE_MPW 0x6
>
> +#define MLX5_RX_HEADROOM NET_SKB_PAD
> +
> #define MLX5_MPWRQ_LOG_STRIDE_SIZE 6 /* >= 6, HW restriction */
> #define MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS 8 /* >= 6, HW restriction */
> #define MLX5_MPWRQ_LOG_WQE_SZ 18
> @@ -302,10 +304,14 @@ struct mlx5e_page_cache {
> struct mlx5e_rq {
> /* data path */
> struct mlx5_wq_ll wq;
> - u32 wqe_sz;
> - struct sk_buff **skb;
> +
> + struct mlx5e_dma_info *dma_info;
> struct mlx5e_mpw_info *wqe_info;
> void *mtt_no_align;
> + struct {
> + u8 page_order;
> + u32 wqe_sz; /* wqe data buffer size */
> + } buff;
> __be32 mkey_be;
>
> struct device *pdev;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> index c84702c..c9f1dea 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
> @@ -411,6 +411,8 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
> void *rqc = param->rqc;
> void *rqc_wq = MLX5_ADDR_OF(rqc, rqc, wq);
> u32 byte_count;
> + u32 frag_sz;
> + int npages;
> int wq_sz;
> int err;
> int i;
> @@ -445,29 +447,40 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
>
> rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz);
> rq->mpwqe_num_strides = BIT(priv->params.mpwqe_log_num_strides);
> - rq->wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
> - byte_count = rq->wqe_sz;
> +
> + rq->buff.wqe_sz = rq->mpwqe_stride_sz * rq->mpwqe_num_strides;
> + byte_count = rq->buff.wqe_sz;
> rq->mkey_be = cpu_to_be32(c->priv->umr_mkey.key);
> err = mlx5e_rq_alloc_mpwqe_info(rq, c);
> if (err)
> goto err_rq_wq_destroy;
> break;
> default: /* MLX5_WQ_TYPE_LINKED_LIST */
> - rq->skb = kzalloc_node(wq_sz * sizeof(*rq->skb), GFP_KERNEL,
> - cpu_to_node(c->cpu));
> - if (!rq->skb) {
> + rq->dma_info = kzalloc_node(wq_sz * sizeof(*rq->dma_info), GFP_KERNEL,
> + cpu_to_node(c->cpu));
> + if (!rq->dma_info) {
> err = -ENOMEM;
> goto err_rq_wq_destroy;
> }
> +
> rq->handle_rx_cqe = mlx5e_handle_rx_cqe;
> rq->alloc_wqe = mlx5e_alloc_rx_wqe;
> rq->dealloc_wqe = mlx5e_dealloc_rx_wqe;
>
> - rq->wqe_sz = (priv->params.lro_en) ?
> + rq->buff.wqe_sz = (priv->params.lro_en) ?
> priv->params.lro_wqe_sz :
> MLX5E_SW2HW_MTU(priv->netdev->mtu);
> - rq->wqe_sz = SKB_DATA_ALIGN(rq->wqe_sz);
> - byte_count = rq->wqe_sz;
> + byte_count = rq->buff.wqe_sz;
> +
> + /* calc the required page order */
> + frag_sz = MLX5_RX_HEADROOM +
> + byte_count /* packet data */ +
> + SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
> + frag_sz = SKB_DATA_ALIGN(frag_sz);
> +
> + npages = DIV_ROUND_UP(frag_sz, PAGE_SIZE);
> + rq->buff.page_order = order_base_2(npages);
> +
> byte_count |= MLX5_HW_START_PADDING;
> rq->mkey_be = c->mkey_be;
> }
> @@ -502,7 +515,7 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
> mlx5e_rq_free_mpwqe_info(rq);
> break;
> default: /* MLX5_WQ_TYPE_LINKED_LIST */
> - kfree(rq->skb);
> + kfree(rq->dma_info);
> }
>
> for (i = rq->page_cache.head; i != rq->page_cache.tail;
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 8e02af3..2f5bc6f 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -179,50 +179,99 @@ unlock:
> mutex_unlock(&priv->state_lock);
> }
>
> -int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +#define RQ_PAGE_SIZE(rq) ((1 << rq->buff.page_order) << PAGE_SHIFT)
> +
> +static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
> + struct mlx5e_dma_info *dma_info)
> {
> - struct sk_buff *skb;
> - dma_addr_t dma_addr;
> + struct mlx5e_page_cache *cache = &rq->page_cache;
> + u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
>
> - skb = napi_alloc_skb(rq->cq.napi, rq->wqe_sz);
> - if (unlikely(!skb))
> - return -ENOMEM;
> + if (tail_next == cache->head) {
> + rq->stats.cache_full++;
> + return false;
> + }
> +
> + cache->page_cache[cache->tail] = *dma_info;
> + cache->tail = tail_next;
> + return true;
> +}
> +
> +static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
> + struct mlx5e_dma_info *dma_info)
> +{
> + struct mlx5e_page_cache *cache = &rq->page_cache;
> +
> + if (unlikely(cache->head == cache->tail)) {
> + rq->stats.cache_empty++;
> + return false;
> + }
> +
> + if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
> + rq->stats.cache_busy++;
> + return false;
> + }
> +
> + *dma_info = cache->page_cache[cache->head];
> + cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
> + rq->stats.cache_reuse++;
> +
> + dma_sync_single_for_device(rq->pdev, dma_info->addr,
> + RQ_PAGE_SIZE(rq),
> + DMA_FROM_DEVICE);
> + return true;
> +}
>
> - dma_addr = dma_map_single(rq->pdev,
> - /* hw start padding */
> - skb->data,
> - /* hw end padding */
> - rq->wqe_sz,
> - DMA_FROM_DEVICE);
> +static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
> + struct mlx5e_dma_info *dma_info)
> +{
> + struct page *page;
>
> - if (unlikely(dma_mapping_error(rq->pdev, dma_addr)))
> - goto err_free_skb;
> + if (mlx5e_rx_cache_get(rq, dma_info))
> + return 0;
>
> - *((dma_addr_t *)skb->cb) = dma_addr;
> - wqe->data.addr = cpu_to_be64(dma_addr);
> + page = dev_alloc_pages(rq->buff.page_order);
> + if (unlikely(!page))
> + return -ENOMEM;
>
> - rq->skb[ix] = skb;
> + dma_info->page = page;
> + dma_info->addr = dma_map_page(rq->pdev, page, 0,
> + RQ_PAGE_SIZE(rq), DMA_FROM_DEVICE);
> + if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
> + put_page(page);
> + return -ENOMEM;
> + }
>
> return 0;
> +}
>
> -err_free_skb:
> - dev_kfree_skb(skb);
> +void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
> + bool recycle)
> +{
> + if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
> + return;
> +
> + dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
> + DMA_FROM_DEVICE);
> + put_page(dma_info->page);
> +}
> +
> +int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +{
> + struct mlx5e_dma_info *di = &rq->dma_info[ix];
>
> - return -ENOMEM;
> + if (unlikely(mlx5e_page_alloc_mapped(rq, di)))
> + return -ENOMEM;
> +
> + wqe->data.addr = cpu_to_be64(di->addr + MLX5_RX_HEADROOM);
> + return 0;
> }
>
> void mlx5e_dealloc_rx_wqe(struct mlx5e_rq *rq, u16 ix)
> {
> - struct sk_buff *skb = rq->skb[ix];
> + struct mlx5e_dma_info *di = &rq->dma_info[ix];
>
> - if (skb) {
> - rq->skb[ix] = NULL;
> - dma_unmap_single(rq->pdev,
> - *((dma_addr_t *)skb->cb),
> - rq->wqe_sz,
> - DMA_FROM_DEVICE);
> - dev_kfree_skb(skb);
> - }
> + mlx5e_page_release(rq, di, true);
> }
>
> static inline int mlx5e_mpwqe_strides_per_page(struct mlx5e_rq *rq)
> @@ -305,79 +354,6 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
> mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
> }
>
> -static inline bool mlx5e_rx_cache_put(struct mlx5e_rq *rq,
> - struct mlx5e_dma_info *dma_info)
> -{
> - struct mlx5e_page_cache *cache = &rq->page_cache;
> - u32 tail_next = (cache->tail + 1) & (MLX5E_CACHE_SIZE - 1);
> -
> - if (tail_next == cache->head) {
> - rq->stats.cache_full++;
> - return false;
> - }
> -
> - cache->page_cache[cache->tail] = *dma_info;
> - cache->tail = tail_next;
> - return true;
> -}
> -
> -static inline bool mlx5e_rx_cache_get(struct mlx5e_rq *rq,
> - struct mlx5e_dma_info *dma_info)
> -{
> - struct mlx5e_page_cache *cache = &rq->page_cache;
> -
> - if (unlikely(cache->head == cache->tail)) {
> - rq->stats.cache_empty++;
> - return false;
> - }
> -
> - if (page_ref_count(cache->page_cache[cache->head].page) != 1) {
> - rq->stats.cache_busy++;
> - return false;
> - }
> -
> - *dma_info = cache->page_cache[cache->head];
> - cache->head = (cache->head + 1) & (MLX5E_CACHE_SIZE - 1);
> - rq->stats.cache_reuse++;
> -
> - dma_sync_single_for_device(rq->pdev, dma_info->addr, PAGE_SIZE,
> - DMA_FROM_DEVICE);
> - return true;
> -}
> -
> -static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
> - struct mlx5e_dma_info *dma_info)
> -{
> - struct page *page;
> -
> - if (mlx5e_rx_cache_get(rq, dma_info))
> - return 0;
> -
> - page = dev_alloc_page();
> - if (unlikely(!page))
> - return -ENOMEM;
> -
> - dma_info->page = page;
> - dma_info->addr = dma_map_page(rq->pdev, page, 0, PAGE_SIZE,
> - DMA_FROM_DEVICE);
> - if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
> - put_page(page);
> - return -ENOMEM;
> - }
> -
> - return 0;
> -}
> -
> -void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
> - bool recycle)
> -{
> - if (likely(recycle) && mlx5e_rx_cache_put(rq, dma_info))
> - return;
> -
> - dma_unmap_page(rq->pdev, dma_info->addr, PAGE_SIZE, DMA_FROM_DEVICE);
> - put_page(dma_info->page);
> -}
> -
> static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
> struct mlx5e_rx_wqe *wqe,
> u16 ix)
> @@ -448,7 +424,7 @@ void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
> mlx5_wq_ll_update_db_record(wq);
> }
>
> -int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> +int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
> {
> int err;
>
> @@ -650,31 +626,46 @@ static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
>
> void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> {
> + struct mlx5e_dma_info *di;
> struct mlx5e_rx_wqe *wqe;
> - struct sk_buff *skb;
> __be16 wqe_counter_be;
> + struct sk_buff *skb;
> u16 wqe_counter;
> u32 cqe_bcnt;
> + void *va;
>
> wqe_counter_be = cqe->wqe_counter;
> wqe_counter = be16_to_cpu(wqe_counter_be);
> wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
> - skb = rq->skb[wqe_counter];
> - prefetch(skb->data);
> - rq->skb[wqe_counter] = NULL;
> + di = &rq->dma_info[wqe_counter];
> + va = page_address(di->page);
>
> - dma_unmap_single(rq->pdev,
> - *((dma_addr_t *)skb->cb),
> - rq->wqe_sz,
> - DMA_FROM_DEVICE);
> + dma_sync_single_range_for_cpu(rq->pdev,
> + di->addr,
> + MLX5_RX_HEADROOM,
> + rq->buff.wqe_sz,
> + DMA_FROM_DEVICE);
> + prefetch(va + MLX5_RX_HEADROOM);
>
> if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
> rq->stats.wqe_err++;
> - dev_kfree_skb(skb);
> + mlx5e_page_release(rq, di, true);
> goto wq_ll_pop;
> }
>
> + skb = build_skb(va, RQ_PAGE_SIZE(rq));
> + if (unlikely(!skb)) {
> + rq->stats.buff_alloc_err++;
> + mlx5e_page_release(rq, di, true);
> + goto wq_ll_pop;
> + }
> +
> + /* queue up for recycling ..*/
> + page_ref_inc(di->page);
> + mlx5e_page_release(rq, di, true);
> +
> cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
> + skb_reserve(skb, MLX5_RX_HEADROOM);
> skb_put(skb, cqe_bcnt);
>
> mlx5e_complete_rx_cqe(rq, cqe, cqe_bcnt, skb);
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH RFC 05/11] net/mlx5e: Union RQ RX info per RQ type
2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
` (3 preceding siblings ...)
2016-09-07 12:42 ` [PATCH RFC 04/11] net/mlx5e: Build RX SKB on demand Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 06/11] net/mlx5e: Slightly reduce hardware LRO size Saeed Mahameed
` (6 subsequent siblings)
11 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
To: iovisor-dev
Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed
We have two types of RX RQs, and they use two separate sets of
info arrays and structures in RX data path function. Today those
structures are mutually exclusive per RQ type, hence one kind is
allocated on RQ creation according to the RQ type.
For better cache locality and to minimalize the
sizeof(struct mlx5e_rq), in this patch we define them as a union.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 14 ++++++----
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 32 +++++++++++------------
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 10 +++----
3 files changed, 30 insertions(+), 26 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index a346112..7dfb34e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -305,9 +305,14 @@ struct mlx5e_rq {
/* data path */
struct mlx5_wq_ll wq;
- struct mlx5e_dma_info *dma_info;
- struct mlx5e_mpw_info *wqe_info;
- void *mtt_no_align;
+ union {
+ struct mlx5e_dma_info *dma_info;
+ struct {
+ struct mlx5e_mpw_info *info;
+ void *mtt_no_align;
+ u32 mtt_offset;
+ } mpwqe;
+ };
struct {
u8 page_order;
u32 wqe_sz; /* wqe data buffer size */
@@ -327,7 +332,6 @@ struct mlx5e_rq {
unsigned long state;
int ix;
- u32 mpwqe_mtt_offset;
struct mlx5e_rx_am am; /* Adaptive Moderation */
@@ -804,7 +808,7 @@ static inline void mlx5e_cq_arm(struct mlx5e_cq *cq)
static inline u32 mlx5e_get_wqe_mtt_offset(struct mlx5e_rq *rq, u16 wqe_ix)
{
- return rq->mpwqe_mtt_offset +
+ return rq->mpwqe.mtt_offset +
wqe_ix * ALIGN(MLX5_MPWRQ_PAGES_PER_WQE, 8);
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index c9f1dea..9f0f5f6 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -317,7 +317,7 @@ static inline void mlx5e_build_umr_wqe(struct mlx5e_rq *rq, struct mlx5e_sq *sq,
struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
struct mlx5_wqe_umr_ctrl_seg *ucseg = &wqe->uctrl;
struct mlx5_wqe_data_seg *dseg = &wqe->data;
- struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+ struct mlx5e_mpw_info *wi = &rq->mpwqe.info[ix];
u8 ds_cnt = DIV_ROUND_UP(sizeof(*wqe), MLX5_SEND_WQE_DS);
u32 umr_wqe_mtt_offset = mlx5e_get_wqe_mtt_offset(rq, ix);
@@ -345,21 +345,21 @@ static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq,
int mtt_alloc = mtt_sz + MLX5_UMR_ALIGN - 1;
int i;
- rq->wqe_info = kzalloc_node(wq_sz * sizeof(*rq->wqe_info),
- GFP_KERNEL, cpu_to_node(c->cpu));
- if (!rq->wqe_info)
+ rq->mpwqe.info = kzalloc_node(wq_sz * sizeof(*rq->mpwqe.info),
+ GFP_KERNEL, cpu_to_node(c->cpu));
+ if (!rq->mpwqe.info)
goto err_out;
/* We allocate more than mtt_sz as we will align the pointer */
- rq->mtt_no_align = kzalloc_node(mtt_alloc * wq_sz, GFP_KERNEL,
+ rq->mpwqe.mtt_no_align = kzalloc_node(mtt_alloc * wq_sz, GFP_KERNEL,
cpu_to_node(c->cpu));
- if (unlikely(!rq->mtt_no_align))
+ if (unlikely(!rq->mpwqe.mtt_no_align))
goto err_free_wqe_info;
for (i = 0; i < wq_sz; i++) {
- struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
+ struct mlx5e_mpw_info *wi = &rq->mpwqe.info[i];
- wi->umr.mtt = PTR_ALIGN(rq->mtt_no_align + i * mtt_alloc,
+ wi->umr.mtt = PTR_ALIGN(rq->mpwqe.mtt_no_align + i * mtt_alloc,
MLX5_UMR_ALIGN);
wi->umr.mtt_addr = dma_map_single(c->pdev, wi->umr.mtt, mtt_sz,
PCI_DMA_TODEVICE);
@@ -373,14 +373,14 @@ static int mlx5e_rq_alloc_mpwqe_info(struct mlx5e_rq *rq,
err_unmap_mtts:
while (--i >= 0) {
- struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
+ struct mlx5e_mpw_info *wi = &rq->mpwqe.info[i];
dma_unmap_single(c->pdev, wi->umr.mtt_addr, mtt_sz,
PCI_DMA_TODEVICE);
}
- kfree(rq->mtt_no_align);
+ kfree(rq->mpwqe.mtt_no_align);
err_free_wqe_info:
- kfree(rq->wqe_info);
+ kfree(rq->mpwqe.info);
err_out:
return -ENOMEM;
@@ -393,13 +393,13 @@ static void mlx5e_rq_free_mpwqe_info(struct mlx5e_rq *rq)
int i;
for (i = 0; i < wq_sz; i++) {
- struct mlx5e_mpw_info *wi = &rq->wqe_info[i];
+ struct mlx5e_mpw_info *wi = &rq->mpwqe.info[i];
dma_unmap_single(rq->pdev, wi->umr.mtt_addr, mtt_sz,
PCI_DMA_TODEVICE);
}
- kfree(rq->mtt_no_align);
- kfree(rq->wqe_info);
+ kfree(rq->mpwqe.mtt_no_align);
+ kfree(rq->mpwqe.info);
}
static int mlx5e_create_rq(struct mlx5e_channel *c,
@@ -442,7 +442,7 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
rq->alloc_wqe = mlx5e_alloc_rx_mpwqe;
rq->dealloc_wqe = mlx5e_dealloc_rx_mpwqe;
- rq->mpwqe_mtt_offset = c->ix *
+ rq->mpwqe.mtt_offset = c->ix *
MLX5E_REQUIRED_MTTS(1, BIT(priv->params.log_rq_size));
rq->mpwqe_stride_sz = BIT(priv->params.mpwqe_log_stride_sz);
@@ -656,7 +656,7 @@ static void mlx5e_free_rx_descs(struct mlx5e_rq *rq)
/* UMR WQE (if in progress) is always at wq->head */
if (test_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state))
- mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
+ mlx5e_free_rx_mpwqe(rq, &rq->mpwqe.info[wq->head]);
while (!mlx5_wq_ll_is_empty(wq)) {
wqe_ix_be = *wq->tail_next;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 2f5bc6f..95f9b1e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -328,7 +328,7 @@ mlx5e_copy_skb_header_mpwqe(struct device *pdev,
static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
{
- struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+ struct mlx5e_mpw_info *wi = &rq->mpwqe.info[ix];
struct mlx5e_sq *sq = &rq->channel->icosq;
struct mlx5_wq_cyc *wq = &sq->wq;
struct mlx5e_umr_wqe *wqe;
@@ -358,7 +358,7 @@ static int mlx5e_alloc_rx_umr_mpwqe(struct mlx5e_rq *rq,
struct mlx5e_rx_wqe *wqe,
u16 ix)
{
- struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+ struct mlx5e_mpw_info *wi = &rq->mpwqe.info[ix];
u64 dma_offset = (u64)mlx5e_get_wqe_mtt_offset(rq, ix) << PAGE_SHIFT;
int pg_strides = mlx5e_mpwqe_strides_per_page(rq);
int err;
@@ -412,7 +412,7 @@ void mlx5e_post_rx_mpwqe(struct mlx5e_rq *rq)
clear_bit(MLX5E_RQ_STATE_UMR_WQE_IN_PROGRESS, &rq->state);
if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state))) {
- mlx5e_free_rx_mpwqe(rq, &rq->wqe_info[wq->head]);
+ mlx5e_free_rx_mpwqe(rq, &rq->mpwqe.info[wq->head]);
return;
}
@@ -438,7 +438,7 @@ int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix)
void mlx5e_dealloc_rx_mpwqe(struct mlx5e_rq *rq, u16 ix)
{
- struct mlx5e_mpw_info *wi = &rq->wqe_info[ix];
+ struct mlx5e_mpw_info *wi = &rq->mpwqe.info[ix];
mlx5e_free_rx_mpwqe(rq, wi);
}
@@ -717,7 +717,7 @@ void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
{
u16 cstrides = mpwrq_get_cqe_consumed_strides(cqe);
u16 wqe_id = be16_to_cpu(cqe->wqe_id);
- struct mlx5e_mpw_info *wi = &rq->wqe_info[wqe_id];
+ struct mlx5e_mpw_info *wi = &rq->mpwqe.info[wqe_id];
struct mlx5e_rx_wqe *wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_id);
struct sk_buff *skb;
u16 cqe_bcnt;
--
2.7.4
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH RFC 06/11] net/mlx5e: Slightly reduce hardware LRO size
2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
` (4 preceding siblings ...)
2016-09-07 12:42 ` [PATCH RFC 05/11] net/mlx5e: Union RQ RX info per RQ type Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 07/11] net/mlx5e: Dynamic RQ type infrastructure Saeed Mahameed
` (5 subsequent siblings)
11 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
To: iovisor-dev
Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed
Before this patch LRO size was 64K, now with build_skb requires
extra room, headroom + sizeof(skb_shared_info) added to the data
buffer will make wqe size or page_frag_size slightly larger than
64K which will demand order 5 page instead of order 4 in 4K page systems.
We take those extra bytes from hardware LRO data size in order to not
increase the required page order for when hardware LRO is enabled.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 9f0f5f6..17f84f9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -3185,8 +3185,11 @@ static void mlx5e_build_nic_netdev_priv(struct mlx5_core_dev *mdev,
mlx5e_build_default_indir_rqt(mdev, priv->params.indirection_rqt,
MLX5E_INDIR_RQT_SIZE, profile->max_nch(mdev));
- priv->params.lro_wqe_sz =
- MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ;
+ priv->params.lro_wqe_sz =
+ MLX5E_PARAMS_DEFAULT_LRO_WQE_SZ -
+ /* Extra room needed for build_skb */
+ MLX5_RX_HEADROOM -
+ SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
/* Initialize pflags */
MLX5E_SET_PRIV_FLAG(priv, MLX5E_PFLAG_RX_CQE_BASED_MODER,
--
2.7.4
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH RFC 07/11] net/mlx5e: Dynamic RQ type infrastructure
2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
` (5 preceding siblings ...)
2016-09-07 12:42 ` [PATCH RFC 06/11] net/mlx5e: Slightly reduce hardware LRO size Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support Saeed Mahameed
` (4 subsequent siblings)
11 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
To: iovisor-dev
Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed
Add two helper functions to allow dynamic changes of RQ type.
mlx5e_set_rq_priv_params and mlx5e_set_rq_type_params will be
used on netdev creation to determine the default RQ type.
This will be needed later for downstream patches of XDP support.
When enabling XDP we will dynamically move from striding RQ to
linked list RQ type.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 92 ++++++++++++-----------
1 file changed, 50 insertions(+), 42 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 17f84f9..a6a2e60 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -69,6 +69,47 @@ struct mlx5e_channel_param {
struct mlx5e_cq_param icosq_cq;
};
+static bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
+{
+ return MLX5_CAP_GEN(mdev, striding_rq) &&
+ MLX5_CAP_GEN(mdev, umr_ptr_rlky) &&
+ MLX5_CAP_ETH(mdev, reg_umr_sq);
+}
+
+static void mlx5e_set_rq_type_params(struct mlx5e_priv *priv, u8 rq_type)
+{
+ priv->params.rq_wq_type = rq_type;
+ switch (priv->params.rq_wq_type) {
+ case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
+ priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW;
+ priv->params.mpwqe_log_stride_sz = priv->params.rx_cqe_compress ?
+ MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS :
+ MLX5_MPWRQ_LOG_STRIDE_SIZE;
+ priv->params.mpwqe_log_num_strides = MLX5_MPWRQ_LOG_WQE_SZ -
+ priv->params.mpwqe_log_stride_sz;
+ break;
+ default: /* MLX5_WQ_TYPE_LINKED_LIST */
+ priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE;
+ }
+ priv->params.min_rx_wqes = mlx5_min_rx_wqes(priv->params.rq_wq_type,
+ BIT(priv->params.log_rq_size));
+
+ mlx5_core_info(priv->mdev,
+ "MLX5E: StrdRq(%d) RqSz(%ld) StrdSz(%ld) RxCqeCmprss(%d)\n",
+ priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ,
+ BIT(priv->params.log_rq_size),
+ BIT(priv->params.mpwqe_log_stride_sz),
+ priv->params.rx_cqe_compress_admin);
+}
+
+static void mlx5e_set_rq_priv_params(struct mlx5e_priv *priv)
+{
+ u8 rq_type = mlx5e_check_fragmented_striding_rq_cap(priv->mdev) ?
+ MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
+ MLX5_WQ_TYPE_LINKED_LIST;
+ mlx5e_set_rq_type_params(priv, rq_type);
+}
+
static void mlx5e_update_carrier(struct mlx5e_priv *priv)
{
struct mlx5_core_dev *mdev = priv->mdev;
@@ -3036,13 +3077,6 @@ void mlx5e_build_default_indir_rqt(struct mlx5_core_dev *mdev,
indirection_rqt[i] = i % num_channels;
}
-static bool mlx5e_check_fragmented_striding_rq_cap(struct mlx5_core_dev *mdev)
-{
- return MLX5_CAP_GEN(mdev, striding_rq) &&
- MLX5_CAP_GEN(mdev, umr_ptr_rlky) &&
- MLX5_CAP_ETH(mdev, reg_umr_sq);
-}
-
static int mlx5e_get_pci_bw(struct mlx5_core_dev *mdev, u32 *pci_bw)
{
enum pcie_link_width width;
@@ -3122,11 +3156,13 @@ static void mlx5e_build_nic_netdev_priv(struct mlx5_core_dev *mdev,
MLX5_CQ_PERIOD_MODE_START_FROM_CQE :
MLX5_CQ_PERIOD_MODE_START_FROM_EQE;
- priv->params.log_sq_size =
- MLX5E_PARAMS_DEFAULT_LOG_SQ_SIZE;
- priv->params.rq_wq_type = mlx5e_check_fragmented_striding_rq_cap(mdev) ?
- MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
- MLX5_WQ_TYPE_LINKED_LIST;
+ priv->mdev = mdev;
+ priv->netdev = netdev;
+ priv->params.num_channels = profile->max_nch(mdev);
+ priv->profile = profile;
+ priv->ppriv = ppriv;
+
+ priv->params.log_sq_size = MLX5E_PARAMS_DEFAULT_LOG_SQ_SIZE;
/* set CQE compression */
priv->params.rx_cqe_compress_admin = false;
@@ -3139,33 +3175,11 @@ static void mlx5e_build_nic_netdev_priv(struct mlx5_core_dev *mdev,
priv->params.rx_cqe_compress_admin =
cqe_compress_heuristic(link_speed, pci_bw);
}
-
priv->params.rx_cqe_compress = priv->params.rx_cqe_compress_admin;
- switch (priv->params.rq_wq_type) {
- case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
- priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE_MPW;
- priv->params.mpwqe_log_stride_sz =
- priv->params.rx_cqe_compress ?
- MLX5_MPWRQ_LOG_STRIDE_SIZE_CQE_COMPRESS :
- MLX5_MPWRQ_LOG_STRIDE_SIZE;
- priv->params.mpwqe_log_num_strides = MLX5_MPWRQ_LOG_WQE_SZ -
- priv->params.mpwqe_log_stride_sz;
+ mlx5e_set_rq_priv_params(priv);
+ if (priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ)
priv->params.lro_en = true;
- break;
- default: /* MLX5_WQ_TYPE_LINKED_LIST */
- priv->params.log_rq_size = MLX5E_PARAMS_DEFAULT_LOG_RQ_SIZE;
- }
-
- mlx5_core_info(mdev,
- "MLX5E: StrdRq(%d) RqSz(%ld) StrdSz(%ld) RxCqeCmprss(%d)\n",
- priv->params.rq_wq_type == MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ,
- BIT(priv->params.log_rq_size),
- BIT(priv->params.mpwqe_log_stride_sz),
- priv->params.rx_cqe_compress_admin);
-
- priv->params.min_rx_wqes = mlx5_min_rx_wqes(priv->params.rq_wq_type,
- BIT(priv->params.log_rq_size));
priv->params.rx_am_enabled = MLX5_CAP_GEN(mdev, cq_moderation);
mlx5e_set_rx_cq_mode_params(&priv->params, cq_period_mode);
@@ -3195,12 +3209,6 @@ static void mlx5e_build_nic_netdev_priv(struct mlx5_core_dev *mdev,
MLX5E_SET_PRIV_FLAG(priv, MLX5E_PFLAG_RX_CQE_BASED_MODER,
priv->params.rx_cq_period_mode == MLX5_CQ_PERIOD_MODE_START_FROM_CQE);
- priv->mdev = mdev;
- priv->netdev = netdev;
- priv->params.num_channels = profile->max_nch(mdev);
- priv->profile = profile;
- priv->ppriv = ppriv;
-
#ifdef CONFIG_MLX5_CORE_EN_DCB
mlx5e_ets_init(priv);
#endif
--
2.7.4
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
` (6 preceding siblings ...)
2016-09-07 12:42 ` [PATCH RFC 07/11] net/mlx5e: Dynamic RQ type infrastructure Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
2016-09-07 13:32 ` Or Gerlitz
` (2 more replies)
2016-09-07 12:42 ` [PATCH RFC 09/11] net/mlx5e: Have a clear separation between different SQ types Saeed Mahameed
` (3 subsequent siblings)
11 siblings, 3 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
To: iovisor-dev
Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Rana Shahout,
Saeed Mahameed
From: Rana Shahout <ranas@mellanox.com>
Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
When XDP is on we make sure to change channels RQs type to
MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
ensure "page per packet".
On XDP set, we fail if HW LRO is set and request from user to turn it
off. Since on ConnectX4-LX HW LRO is always on by default, this will be
annoying, but we prefer not to enforce LRO off from XDP set function.
Full channels reset (close/open) is required only when setting XDP
on/off.
When XDP set is called just to exchange programs, we will update
each RQ xdp program on the fly and for synchronization with current
data path RX activity of that RQ, we temporally disable that RQ and
ensure RX path is not running, quickly update and re-enable that RQ,
for that we do:
- rq.state = disabled
- napi_synnchronize
- xchg(rq->xdp_prg)
- rq.state = enabled
- napi_schedule // Just in case we've missed an IRQ
Packet rate performance testing was done with pktgen 64B packets and on
TX side and, TC drop action on RX side compared to XDP fast drop.
CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
Comparison is done between:
1. Baseline, Before this patch with TC drop action
2. This patch with TC drop action
3. This patch with XDP RX fast drop
Streams Baseline(TC drop) TC drop XDP fast Drop
--------------------------------------------------------------
1 5.51Mpps 5.14Mpps 13.5Mpps
2 11.5Mpps 10.0Mpps 25.1Mpps
4 16.3Mpps 17.2Mpps 35.4Mpps
8 29.6Mpps 28.2Mpps 45.8Mpps*
16 34.0Mpps 30.1Mpps 45.8Mpps*
It seems that there is around ~5% degradation between Baseline
and this patch with single stream when comparing packet rate with TC drop,
it might be related to XDP code overhead or new cache misses added by
XDP code.
*My xmitter was limited to 45Mpps, so for 8/16 streams the xmitter is
the bottlenick and it seems that XDP drop can handle more.
Signed-off-by: Rana Shahout <ranas@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 2 +
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 100 ++++++++++++++++++++-
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 26 +++++-
drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 4 +
4 files changed, 130 insertions(+), 2 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 7dfb34e..729bae8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -334,6 +334,7 @@ struct mlx5e_rq {
int ix;
struct mlx5e_rx_am am; /* Adaptive Moderation */
+ struct bpf_prog *xdp_prog;
/* control */
struct mlx5_wq_ctrl wq_ctrl;
@@ -627,6 +628,7 @@ struct mlx5e_priv {
/* priv data path fields - start */
struct mlx5e_sq **txq_to_sq_map;
int channeltc_to_txq_map[MLX5E_MAX_NUM_CHANNELS][MLX5E_MAX_NUM_TC];
+ struct bpf_prog *xdp_prog;
/* priv data path fields - end */
unsigned long state;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index a6a2e60..dab8486 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -34,6 +34,7 @@
#include <net/pkt_cls.h>
#include <linux/mlx5/fs.h>
#include <net/vxlan.h>
+#include <linux/bpf.h>
#include "en.h"
#include "en_tc.h"
#include "eswitch.h"
@@ -104,7 +105,8 @@ static void mlx5e_set_rq_type_params(struct mlx5e_priv *priv, u8 rq_type)
static void mlx5e_set_rq_priv_params(struct mlx5e_priv *priv)
{
- u8 rq_type = mlx5e_check_fragmented_striding_rq_cap(priv->mdev) ?
+ u8 rq_type = mlx5e_check_fragmented_striding_rq_cap(priv->mdev) &&
+ !priv->xdp_prog ?
MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ :
MLX5_WQ_TYPE_LINKED_LIST;
mlx5e_set_rq_type_params(priv, rq_type);
@@ -177,6 +179,7 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
s->rx_csum_none += rq_stats->csum_none;
s->rx_csum_complete += rq_stats->csum_complete;
s->rx_csum_unnecessary_inner += rq_stats->csum_unnecessary_inner;
+ s->rx_xdp_drop += rq_stats->xdp_drop;
s->rx_wqe_err += rq_stats->wqe_err;
s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
@@ -476,6 +479,7 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
rq->channel = c;
rq->ix = c->ix;
rq->priv = c->priv;
+ rq->xdp_prog = priv->xdp_prog;
switch (priv->params.rq_wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
@@ -539,6 +543,9 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
rq->page_cache.head = 0;
rq->page_cache.tail = 0;
+ if (rq->xdp_prog)
+ bpf_prog_add(rq->xdp_prog, 1);
+
return 0;
err_rq_wq_destroy:
@@ -551,6 +558,9 @@ static void mlx5e_destroy_rq(struct mlx5e_rq *rq)
{
int i;
+ if (rq->xdp_prog)
+ bpf_prog_put(rq->xdp_prog);
+
switch (rq->wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
mlx5e_rq_free_mpwqe_info(rq);
@@ -2953,6 +2963,92 @@ static void mlx5e_tx_timeout(struct net_device *dev)
schedule_work(&priv->tx_timeout_work);
}
+static int mlx5e_xdp_set(struct net_device *netdev, struct bpf_prog *prog)
+{
+ struct mlx5e_priv *priv = netdev_priv(netdev);
+ struct bpf_prog *old_prog;
+ int err = 0;
+ bool reset, was_opened;
+ int i;
+
+ mutex_lock(&priv->state_lock);
+
+ if ((netdev->features & NETIF_F_LRO) && prog) {
+ netdev_warn(netdev, "can't set XDP while LRO is on, disable LRO first\n");
+ err = -EINVAL;
+ goto unlock;
+ }
+
+ was_opened = test_bit(MLX5E_STATE_OPENED, &priv->state);
+ /* no need for full reset when exchanging programs */
+ reset = (!priv->xdp_prog || !prog);
+
+ if (was_opened && reset)
+ mlx5e_close_locked(netdev);
+
+ /* exchange programs */
+ old_prog = xchg(&priv->xdp_prog, prog);
+ if (prog)
+ bpf_prog_add(prog, 1);
+ if (old_prog)
+ bpf_prog_put(old_prog);
+
+ if (reset) /* change RQ type according to priv->xdp_prog */
+ mlx5e_set_rq_priv_params(priv);
+
+ if (was_opened && reset)
+ mlx5e_open_locked(netdev);
+
+ if (!test_bit(MLX5E_STATE_OPENED, &priv->state) || reset)
+ goto unlock;
+
+ /* exchanging programs w/o reset, we update ref counts on behalf
+ * of the channels RQs here.
+ */
+ bpf_prog_add(prog, priv->params.num_channels);
+ for (i = 0; i < priv->params.num_channels; i++) {
+ struct mlx5e_channel *c = priv->channel[i];
+
+ set_bit(MLX5E_RQ_STATE_FLUSH, &c->rq.state);
+ napi_synchronize(&c->napi);
+ /* prevent mlx5e_poll_rx_cq from accessing rq->xdp_prog */
+
+ old_prog = xchg(&c->rq.xdp_prog, prog);
+
+ clear_bit(MLX5E_RQ_STATE_FLUSH, &c->rq.state);
+ /* napi_schedule in case we have missed anything */
+ set_bit(MLX5E_CHANNEL_NAPI_SCHED, &c->flags);
+ napi_schedule(&c->napi);
+
+ if (old_prog)
+ bpf_prog_put(old_prog);
+ }
+
+unlock:
+ mutex_unlock(&priv->state_lock);
+ return err;
+}
+
+static bool mlx5e_xdp_attached(struct net_device *dev)
+{
+ struct mlx5e_priv *priv = netdev_priv(dev);
+
+ return !!priv->xdp_prog;
+}
+
+static int mlx5e_xdp(struct net_device *dev, struct netdev_xdp *xdp)
+{
+ switch (xdp->command) {
+ case XDP_SETUP_PROG:
+ return mlx5e_xdp_set(dev, xdp->prog);
+ case XDP_QUERY_PROG:
+ xdp->prog_attached = mlx5e_xdp_attached(dev);
+ return 0;
+ default:
+ return -EINVAL;
+ }
+}
+
static const struct net_device_ops mlx5e_netdev_ops_basic = {
.ndo_open = mlx5e_open,
.ndo_stop = mlx5e_close,
@@ -2972,6 +3068,7 @@ static const struct net_device_ops mlx5e_netdev_ops_basic = {
.ndo_rx_flow_steer = mlx5e_rx_flow_steer,
#endif
.ndo_tx_timeout = mlx5e_tx_timeout,
+ .ndo_xdp = mlx5e_xdp,
};
static const struct net_device_ops mlx5e_netdev_ops_sriov = {
@@ -3003,6 +3100,7 @@ static const struct net_device_ops mlx5e_netdev_ops_sriov = {
.ndo_set_vf_link_state = mlx5e_set_vf_link_state,
.ndo_get_vf_stats = mlx5e_get_vf_stats,
.ndo_tx_timeout = mlx5e_tx_timeout,
+ .ndo_xdp = mlx5e_xdp,
};
static int mlx5e_check_required_hca_cap(struct mlx5_core_dev *mdev)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 95f9b1e..cde34c8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -624,8 +624,20 @@ static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
napi_gro_receive(rq->cq.napi, skb);
}
+static inline enum xdp_action mlx5e_xdp_handle(struct mlx5e_rq *rq,
+ const struct bpf_prog *prog,
+ void *data, u32 len)
+{
+ struct xdp_buff xdp;
+
+ xdp.data = data;
+ xdp.data_end = xdp.data + len;
+ return bpf_prog_run_xdp(prog, &xdp);
+}
+
void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
{
+ struct bpf_prog *xdp_prog = READ_ONCE(rq->xdp_prog);
struct mlx5e_dma_info *di;
struct mlx5e_rx_wqe *wqe;
__be16 wqe_counter_be;
@@ -646,6 +658,7 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
rq->buff.wqe_sz,
DMA_FROM_DEVICE);
prefetch(va + MLX5_RX_HEADROOM);
+ cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
rq->stats.wqe_err++;
@@ -653,6 +666,18 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
goto wq_ll_pop;
}
+ if (xdp_prog) {
+ enum xdp_action act =
+ mlx5e_xdp_handle(rq, xdp_prog, va + MLX5_RX_HEADROOM,
+ cqe_bcnt);
+
+ if (act != XDP_PASS) {
+ rq->stats.xdp_drop++;
+ mlx5e_page_release(rq, di, true);
+ goto wq_ll_pop;
+ }
+ }
+
skb = build_skb(va, RQ_PAGE_SIZE(rq));
if (unlikely(!skb)) {
rq->stats.buff_alloc_err++;
@@ -664,7 +689,6 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
page_ref_inc(di->page);
mlx5e_page_release(rq, di, true);
- cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
skb_reserve(skb, MLX5_RX_HEADROOM);
skb_put(skb, cqe_bcnt);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 6af8d79..084d6c8 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -65,6 +65,7 @@ struct mlx5e_sw_stats {
u64 rx_csum_none;
u64 rx_csum_complete;
u64 rx_csum_unnecessary_inner;
+ u64 rx_xdp_drop;
u64 tx_csum_partial;
u64 tx_csum_partial_inner;
u64 tx_queue_stopped;
@@ -100,6 +101,7 @@ static const struct counter_desc sw_stats_desc[] = {
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_none) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_complete) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_unnecessary_inner) },
+ { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xdp_drop) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_csum_partial) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_csum_partial_inner) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_queue_stopped) },
@@ -278,6 +280,7 @@ struct mlx5e_rq_stats {
u64 csum_none;
u64 lro_packets;
u64 lro_bytes;
+ u64 xdp_drop;
u64 wqe_err;
u64 mpwqe_filler;
u64 buff_alloc_err;
@@ -295,6 +298,7 @@ static const struct counter_desc rq_stats_desc[] = {
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, csum_complete) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, csum_unnecessary_inner) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, csum_none) },
+ { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, xdp_drop) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_packets) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_bytes) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, wqe_err) },
--
2.7.4
^ permalink raw reply related [flat|nested] 72+ messages in thread* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
2016-09-07 12:42 ` [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support Saeed Mahameed
@ 2016-09-07 13:32 ` Or Gerlitz
[not found] ` <CAJ3xEMhh=fu+mrCGAjv1PDdGn9GPLJv9MssMzwzvppoqZUY01A-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
[not found] ` <1473252152-11379-9-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2016-09-08 10:58 ` Jamal Hadi Salim
2 siblings, 1 reply; 72+ messages in thread
From: Or Gerlitz @ 2016-09-07 13:32 UTC (permalink / raw)
To: Saeed Mahameed, Tariq Toukan, Rana Shahout
Cc: iovisor-dev, Linux Netdev List, Brenden Blanco,
Alexei Starovoitov, Tom Herbert, Martin KaFai Lau,
Jesper Dangaard Brouer, Daniel Borkmann, Eric Dumazet,
Jamal Hadi Salim
On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm@mellanox.com> wrote:
> Packet rate performance testing was done with pktgen 64B packets and on
> TX side and, TC drop action on RX side compared to XDP fast drop.
>
> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>
> Comparison is done between:
> 1. Baseline, Before this patch with TC drop action
> 2. This patch with TC drop action
> 3. This patch with XDP RX fast drop
>
> Streams Baseline(TC drop) TC drop XDP fast Drop
> --------------------------------------------------------------
> 1 5.51Mpps 5.14Mpps 13.5Mpps
> 2 11.5Mpps 10.0Mpps 25.1Mpps
> 4 16.3Mpps 17.2Mpps 35.4Mpps
> 8 29.6Mpps 28.2Mpps 45.8Mpps*
> 16 34.0Mpps 30.1Mpps 45.8Mpps*
Rana, Guys, congrat!!
When you say X streams, does each stream mapped by RSS to different RX ring?
or we're on the same RX ring for all rows of the above table?
In the CX3 work, we had X sender "streams" that all mapped to the same RX ring,
I don't think we went beyond one RX ring.
Here, I guess you want to 1st get an initial max for N pktgen TX
threads all sending
the same stream so you land on single RX ring, and then move to M * N pktgen TX
threads to max that further.
I don't see how the current Linux stack would be able to happily drive 34M PPS
(== allocate SKB, etc, you know...) on a single CPU, Jesper?
Or.
^ permalink raw reply [flat|nested] 72+ messages in thread[parent not found: <1473252152-11379-9-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
[not found] ` <1473252152-11379-9-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-09-07 20:55 ` Or Gerlitz via iovisor-dev
[not found] ` <CAJ3xEMgsGHqQ7x8wky6Sfs34Ry67PnZEhYmnK=g8XnnXbgWagg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
0 siblings, 1 reply; 72+ messages in thread
From: Or Gerlitz via iovisor-dev @ 2016-09-07 20:55 UTC (permalink / raw)
To: Saeed Mahameed
Cc: Linux Netdev List, iovisor-dev, Jamal Hadi Salim, Eric Dumazet,
Tom Herbert, Rana Shahout
On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> From: Rana Shahout <ranas-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
>
> Add support for the BPF_PROG_TYPE_PHYS_DEV hook in mlx5e driver.
>
> When XDP is on we make sure to change channels RQs type to
> MLX5_WQ_TYPE_LINKED_LIST rather than "striding RQ" type to
> ensure "page per packet".
>
> On XDP set, we fail if HW LRO is set and request from user to turn it
> off. Since on ConnectX4-LX HW LRO is always on by default, this will be
> annoying, but we prefer not to enforce LRO off from XDP set function.
>
> Full channels reset (close/open) is required only when setting XDP
> on/off.
>
> When XDP set is called just to exchange programs, we will update
> each RQ xdp program on the fly and for synchronization with current
> data path RX activity of that RQ, we temporally disable that RQ and
> ensure RX path is not running, quickly update and re-enable that RQ,
> for that we do:
> - rq.state = disabled
> - napi_synnchronize
> - xchg(rq->xdp_prg)
> - rq.state = enabled
> - napi_schedule // Just in case we've missed an IRQ
>
> Packet rate performance testing was done with pktgen 64B packets and on
> TX side and, TC drop action on RX side compared to XDP fast drop.
>
> CPU: Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
>
> Comparison is done between:
> 1. Baseline, Before this patch with TC drop action
> 2. This patch with TC drop action
> 3. This patch with XDP RX fast drop
>
> Streams Baseline(TC drop) TC drop XDP fast Drop
> --------------------------------------------------------------
> 1 5.51Mpps 5.14Mpps 13.5Mpps
This (13.5 M PPS) is less than 50% of the result we presented @ the
XDP summit which was obtained by Rana. Please see if/how much does
this grows if you use more sender threads, but all of them to xmit the
same stream/flows, so we're on one ring. That (XDP with single RX ring
getting packets from N remote TX rings) would be your canonical
base-line for any further numbers.
^ permalink raw reply [flat|nested] 72+ messages in thread
* Re: [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support
2016-09-07 12:42 ` [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support Saeed Mahameed
2016-09-07 13:32 ` Or Gerlitz
[not found] ` <1473252152-11379-9-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-09-08 10:58 ` Jamal Hadi Salim
2 siblings, 0 replies; 72+ messages in thread
From: Jamal Hadi Salim @ 2016-09-08 10:58 UTC (permalink / raw)
To: Saeed Mahameed, iovisor-dev
Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
Daniel Borkmann, Eric Dumazet, Rana Shahout
On 16-09-07 08:42 AM, Saeed Mahameed wrote:
> Comparison is done between:
> 1. Baseline, Before this patch with TC drop action
> 2. This patch with TC drop action
> 3. This patch with XDP RX fast drop
>
> Streams Baseline(TC drop) TC drop XDP fast Drop
> --------------------------------------------------------------
> 1 5.51Mpps 5.14Mpps 13.5Mpps
> 2 11.5Mpps 10.0Mpps 25.1Mpps
> 4 16.3Mpps 17.2Mpps 35.4Mpps
> 8 29.6Mpps 28.2Mpps 45.8Mpps*
> 16 34.0Mpps 30.1Mpps 45.8Mpps*
>
> It seems that there is around ~5% degradation between Baseline
> and this patch with single stream when comparing packet rate with TC drop,
> it might be related to XDP code overhead or new cache misses added by
> XDP code.
I would suspect this degradation would affect every other packet that
has no interest in XDP.
if you were trying to test forwarding, adding a tc action to
accept and count packets will be sufficient. Since you are not:
Try to baseline sending the wrong destination MAC address (i.e one
not understood by host). The kernel will eventually drop it
somewhere pre-IP processing time (and you can see difference with
XDP compiled in).
Slightly tangent question: Would it be fair to assume that this
hardware can drop at wire rate if you instead used an offloaded
tc rule?
cheers,
jamal
^ permalink raw reply [flat|nested] 72+ messages in thread
* [PATCH RFC 09/11] net/mlx5e: Have a clear separation between different SQ types
2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
` (7 preceding siblings ...)
2016-09-07 12:42 ` [PATCH RFC 08/11] net/mlx5e: XDP fast RX drop bpf programs support Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 10/11] net/mlx5e: XDP TX forwarding support Saeed Mahameed
` (2 subsequent siblings)
11 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
To: iovisor-dev
Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed
Make a clear separate between Regular SQ (TXQ) and ICO SQ creation,
destruction and union their mutual information structures.
Don't allocate redundant TXQ skb/wqe_info/dma_fifo arrays for ICO SQ.
And have a different SQ edge for ICO SQ than TXQ SQ, to be more
accurate.
In preparation for XDP TX support.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 23 +++-
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 121 ++++++++++++++--------
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 8 +-
drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 28 ++---
drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 2 +-
5 files changed, 118 insertions(+), 64 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index 729bae8..b2da9bf 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -101,6 +101,9 @@
#define MLX5E_UPDATE_STATS_INTERVAL 200 /* msecs */
#define MLX5E_SQ_BF_BUDGET 16
+#define MLX5E_ICOSQ_MAX_WQEBBS \
+ (DIV_ROUND_UP(sizeof(struct mlx5e_umr_wqe), MLX5_SEND_WQE_BB))
+
#define MLX5E_NUM_MAIN_GROUPS 9
static inline u16 mlx5_min_rx_wqes(int wq_type, u32 wq_size)
@@ -386,6 +389,11 @@ struct mlx5e_ico_wqe_info {
u8 num_wqebbs;
};
+enum mlx5e_sq_type {
+ MLX5E_SQ_TXQ,
+ MLX5E_SQ_ICO
+};
+
struct mlx5e_sq {
/* data path */
@@ -403,10 +411,15 @@ struct mlx5e_sq {
struct mlx5e_cq cq;
- /* pointers to per packet info: write@xmit, read@completion */
- struct sk_buff **skb;
- struct mlx5e_sq_dma *dma_fifo;
- struct mlx5e_tx_wqe_info *wqe_info;
+ /* pointers to per tx element info: write@xmit, read@completion */
+ union {
+ struct {
+ struct sk_buff **skb;
+ struct mlx5e_sq_dma *dma_fifo;
+ struct mlx5e_tx_wqe_info *wqe_info;
+ } txq;
+ struct mlx5e_ico_wqe_info *ico_wqe;
+ } db;
/* read only */
struct mlx5_wq_cyc wq;
@@ -428,8 +441,8 @@ struct mlx5e_sq {
struct mlx5_uar uar;
struct mlx5e_channel *channel;
int tc;
- struct mlx5e_ico_wqe_info *ico_wqe_info;
u32 rate_limit;
+ u8 type;
} ____cacheline_aligned_in_smp;
static inline bool mlx5e_sq_has_room_for(struct mlx5e_sq *sq, u16 n)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index dab8486..8baeb9e 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -51,7 +51,7 @@ struct mlx5e_sq_param {
struct mlx5_wq_param wq;
u16 max_inline;
u8 min_inline_mode;
- bool icosq;
+ enum mlx5e_sq_type type;
};
struct mlx5e_cq_param {
@@ -742,8 +742,8 @@ static int mlx5e_open_rq(struct mlx5e_channel *c,
if (param->am_enabled)
set_bit(MLX5E_RQ_STATE_AM, &c->rq.state);
- sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_NOP;
- sq->ico_wqe_info[pi].num_wqebbs = 1;
+ sq->db.ico_wqe[pi].opcode = MLX5_OPCODE_NOP;
+ sq->db.ico_wqe[pi].num_wqebbs = 1;
mlx5e_send_nop(sq, true); /* trigger mlx5e_post_rx_wqes() */
return 0;
@@ -767,26 +767,43 @@ static void mlx5e_close_rq(struct mlx5e_rq *rq)
mlx5e_destroy_rq(rq);
}
-static void mlx5e_free_sq_db(struct mlx5e_sq *sq)
+static void mlx5e_free_sq_ico_db(struct mlx5e_sq *sq)
{
- kfree(sq->wqe_info);
- kfree(sq->dma_fifo);
- kfree(sq->skb);
+ kfree(sq->db.ico_wqe);
}
-static int mlx5e_alloc_sq_db(struct mlx5e_sq *sq, int numa)
+static int mlx5e_alloc_sq_ico_db(struct mlx5e_sq *sq, int numa)
+{
+ u8 wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
+
+ sq->db.ico_wqe = kzalloc_node(sizeof(*sq->db.ico_wqe) * wq_sz,
+ GFP_KERNEL, numa);
+ if (!sq->db.ico_wqe)
+ return -ENOMEM;
+
+ return 0;
+}
+
+static void mlx5e_free_sq_txq_db(struct mlx5e_sq *sq)
+{
+ kfree(sq->db.txq.wqe_info);
+ kfree(sq->db.txq.dma_fifo);
+ kfree(sq->db.txq.skb);
+}
+
+static int mlx5e_alloc_sq_txq_db(struct mlx5e_sq *sq, int numa)
{
int wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
int df_sz = wq_sz * MLX5_SEND_WQEBB_NUM_DS;
- sq->skb = kzalloc_node(wq_sz * sizeof(*sq->skb), GFP_KERNEL, numa);
- sq->dma_fifo = kzalloc_node(df_sz * sizeof(*sq->dma_fifo), GFP_KERNEL,
- numa);
- sq->wqe_info = kzalloc_node(wq_sz * sizeof(*sq->wqe_info), GFP_KERNEL,
- numa);
-
- if (!sq->skb || !sq->dma_fifo || !sq->wqe_info) {
- mlx5e_free_sq_db(sq);
+ sq->db.txq.skb = kzalloc_node(wq_sz * sizeof(*sq->db.txq.skb),
+ GFP_KERNEL, numa);
+ sq->db.txq.dma_fifo = kzalloc_node(df_sz * sizeof(*sq->db.txq.dma_fifo),
+ GFP_KERNEL, numa);
+ sq->db.txq.wqe_info = kzalloc_node(wq_sz * sizeof(*sq->db.txq.wqe_info),
+ GFP_KERNEL, numa);
+ if (!sq->db.txq.skb || !sq->db.txq.dma_fifo || !sq->db.txq.wqe_info) {
+ mlx5e_free_sq_txq_db(sq);
return -ENOMEM;
}
@@ -795,6 +812,30 @@ static int mlx5e_alloc_sq_db(struct mlx5e_sq *sq, int numa)
return 0;
}
+static void mlx5e_free_sq_db(struct mlx5e_sq *sq)
+{
+ switch (sq->type) {
+ case MLX5E_SQ_TXQ:
+ mlx5e_free_sq_txq_db(sq);
+ break;
+ case MLX5E_SQ_ICO:
+ mlx5e_free_sq_ico_db(sq);
+ break;
+ }
+}
+
+static int mlx5e_alloc_sq_db(struct mlx5e_sq *sq, int numa)
+{
+ switch (sq->type) {
+ case MLX5E_SQ_TXQ:
+ return mlx5e_alloc_sq_txq_db(sq, numa);
+ case MLX5E_SQ_ICO:
+ return mlx5e_alloc_sq_ico_db(sq, numa);
+ }
+
+ return 0;
+}
+
static int mlx5e_create_sq(struct mlx5e_channel *c,
int tc,
struct mlx5e_sq_param *param,
@@ -805,8 +846,16 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
void *sqc = param->sqc;
void *sqc_wq = MLX5_ADDR_OF(sqc, sqc, wq);
+ u16 sq_max_wqebbs;
int err;
+ sq->type = param->type;
+ sq->pdev = c->pdev;
+ sq->tstamp = &priv->tstamp;
+ sq->mkey_be = c->mkey_be;
+ sq->channel = c;
+ sq->tc = tc;
+
err = mlx5_alloc_map_uar(mdev, &sq->uar, !!MLX5_CAP_GEN(mdev, bf));
if (err)
return err;
@@ -835,18 +884,8 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
if (err)
goto err_sq_wq_destroy;
- if (param->icosq) {
- u8 wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
-
- sq->ico_wqe_info = kzalloc_node(sizeof(*sq->ico_wqe_info) *
- wq_sz,
- GFP_KERNEL,
- cpu_to_node(c->cpu));
- if (!sq->ico_wqe_info) {
- err = -ENOMEM;
- goto err_free_sq_db;
- }
- } else {
+ sq_max_wqebbs = MLX5_SEND_WQE_MAX_WQEBBS;
+ if (sq->type == MLX5E_SQ_TXQ) {
int txq_ix;
txq_ix = c->ix + tc * priv->params.num_channels;
@@ -854,19 +893,14 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
priv->txq_to_sq_map[txq_ix] = sq;
}
- sq->pdev = c->pdev;
- sq->tstamp = &priv->tstamp;
- sq->mkey_be = c->mkey_be;
- sq->channel = c;
- sq->tc = tc;
- sq->edge = (sq->wq.sz_m1 + 1) - MLX5_SEND_WQE_MAX_WQEBBS;
+ if (sq->type == MLX5E_SQ_ICO)
+ sq_max_wqebbs = MLX5E_ICOSQ_MAX_WQEBBS;
+
+ sq->edge = (sq->wq.sz_m1 + 1) - sq_max_wqebbs;
sq->bf_budget = MLX5E_SQ_BF_BUDGET;
return 0;
-err_free_sq_db:
- mlx5e_free_sq_db(sq);
-
err_sq_wq_destroy:
mlx5_wq_destroy(&sq->wq_ctrl);
@@ -881,7 +915,6 @@ static void mlx5e_destroy_sq(struct mlx5e_sq *sq)
struct mlx5e_channel *c = sq->channel;
struct mlx5e_priv *priv = c->priv;
- kfree(sq->ico_wqe_info);
mlx5e_free_sq_db(sq);
mlx5_wq_destroy(&sq->wq_ctrl);
mlx5_unmap_free_uar(priv->mdev, &sq->uar);
@@ -910,11 +943,12 @@ static int mlx5e_enable_sq(struct mlx5e_sq *sq, struct mlx5e_sq_param *param)
memcpy(sqc, param->sqc, sizeof(param->sqc));
- MLX5_SET(sqc, sqc, tis_num_0, param->icosq ? 0 : priv->tisn[sq->tc]);
+ MLX5_SET(sqc, sqc, tis_num_0, param->type == MLX5E_SQ_ICO ?
+ 0 : priv->tisn[sq->tc]);
MLX5_SET(sqc, sqc, cqn, sq->cq.mcq.cqn);
MLX5_SET(sqc, sqc, min_wqe_inline_mode, sq->min_inline_mode);
MLX5_SET(sqc, sqc, state, MLX5_SQC_STATE_RST);
- MLX5_SET(sqc, sqc, tis_lst_sz, param->icosq ? 0 : 1);
+ MLX5_SET(sqc, sqc, tis_lst_sz, param->type == MLX5E_SQ_ICO ? 0 : 1);
MLX5_SET(sqc, sqc, flush_in_error_en, 1);
MLX5_SET(wq, wq, wq_type, MLX5_WQ_TYPE_CYCLIC);
@@ -1029,8 +1063,10 @@ static void mlx5e_close_sq(struct mlx5e_sq *sq)
netif_tx_disable_queue(sq->txq);
/* last doorbell out, godspeed .. */
- if (mlx5e_sq_has_room_for(sq, 1))
+ if (mlx5e_sq_has_room_for(sq, 1)) {
+ sq->db.txq.skb[(sq->pc & sq->wq.sz_m1)] = NULL;
mlx5e_send_nop(sq, true);
+ }
}
mlx5e_disable_sq(sq);
@@ -1507,6 +1543,7 @@ static void mlx5e_build_sq_param(struct mlx5e_priv *priv,
param->max_inline = priv->params.tx_max_inline;
param->min_inline_mode = priv->params.tx_min_inline_mode;
+ param->type = MLX5E_SQ_TXQ;
}
static void mlx5e_build_common_cq_param(struct mlx5e_priv *priv,
@@ -1580,7 +1617,7 @@ static void mlx5e_build_icosq_param(struct mlx5e_priv *priv,
MLX5_SET(wq, wq, log_wq_sz, log_wq_size);
MLX5_SET(sqc, sqc, reg_umr, MLX5_CAP_ETH(priv->mdev, reg_umr_sq));
- param->icosq = true;
+ param->type = MLX5E_SQ_ICO;
}
static void mlx5e_build_channel_param(struct mlx5e_priv *priv, struct mlx5e_channel_param *cparam)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index cde34c8..eb489e9 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -337,8 +337,8 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
/* fill sq edge with nops to avoid wqe wrap around */
while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
- sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_NOP;
- sq->ico_wqe_info[pi].num_wqebbs = 1;
+ sq->db.ico_wqe[pi].opcode = MLX5_OPCODE_NOP;
+ sq->db.ico_wqe[pi].num_wqebbs = 1;
mlx5e_send_nop(sq, true);
}
@@ -348,8 +348,8 @@ static inline void mlx5e_post_umr_wqe(struct mlx5e_rq *rq, u16 ix)
cpu_to_be32((sq->pc << MLX5_WQE_CTRL_WQE_INDEX_SHIFT) |
MLX5_OPCODE_UMR);
- sq->ico_wqe_info[pi].opcode = MLX5_OPCODE_UMR;
- sq->ico_wqe_info[pi].num_wqebbs = num_wqebbs;
+ sq->db.ico_wqe[pi].opcode = MLX5_OPCODE_UMR;
+ sq->db.ico_wqe[pi].num_wqebbs = num_wqebbs;
sq->pc += num_wqebbs;
mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index 988eca9..a728303 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -52,7 +52,6 @@ void mlx5e_send_nop(struct mlx5e_sq *sq, bool notify_hw)
cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | MLX5_OPCODE_NOP);
cseg->qpn_ds = cpu_to_be32((sq->sqn << 8) | 0x01);
- sq->skb[pi] = NULL;
sq->pc++;
sq->stats.nop++;
@@ -82,15 +81,15 @@ static inline void mlx5e_dma_push(struct mlx5e_sq *sq,
u32 size,
enum mlx5e_dma_map_type map_type)
{
- sq->dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].addr = addr;
- sq->dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].size = size;
- sq->dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].type = map_type;
+ sq->db.txq.dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].addr = addr;
+ sq->db.txq.dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].size = size;
+ sq->db.txq.dma_fifo[sq->dma_fifo_pc & sq->dma_fifo_mask].type = map_type;
sq->dma_fifo_pc++;
}
static inline struct mlx5e_sq_dma *mlx5e_dma_get(struct mlx5e_sq *sq, u32 i)
{
- return &sq->dma_fifo[i & sq->dma_fifo_mask];
+ return &sq->db.txq.dma_fifo[i & sq->dma_fifo_mask];
}
static void mlx5e_dma_unmap_wqe_err(struct mlx5e_sq *sq, u8 num_dma)
@@ -221,7 +220,7 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb)
u16 pi = sq->pc & wq->sz_m1;
struct mlx5e_tx_wqe *wqe = mlx5_wq_cyc_get_wqe(wq, pi);
- struct mlx5e_tx_wqe_info *wi = &sq->wqe_info[pi];
+ struct mlx5e_tx_wqe_info *wi = &sq->db.txq.wqe_info[pi];
struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
struct mlx5_wqe_eth_seg *eseg = &wqe->eth;
@@ -341,7 +340,7 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb)
cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | opcode);
cseg->qpn_ds = cpu_to_be32((sq->sqn << 8) | ds_cnt);
- sq->skb[pi] = skb;
+ sq->db.txq.skb[pi] = skb;
wi->num_wqebbs = DIV_ROUND_UP(ds_cnt, MLX5_SEND_WQEBB_NUM_DS);
sq->pc += wi->num_wqebbs;
@@ -367,8 +366,10 @@ static netdev_tx_t mlx5e_sq_xmit(struct mlx5e_sq *sq, struct sk_buff *skb)
}
/* fill sq edge with nops to avoid wqe wrap around */
- while ((sq->pc & wq->sz_m1) > sq->edge)
+ while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
+ sq->db.txq.skb[pi] = NULL;
mlx5e_send_nop(sq, false);
+ }
if (bf)
sq->bf_budget--;
@@ -442,8 +443,8 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
last_wqe = (sqcc == wqe_counter);
ci = sqcc & sq->wq.sz_m1;
- skb = sq->skb[ci];
- wi = &sq->wqe_info[ci];
+ skb = sq->db.txq.skb[ci];
+ wi = &sq->db.txq.wqe_info[ci];
if (unlikely(!skb)) { /* nop */
sqcc++;
@@ -499,10 +500,13 @@ void mlx5e_free_tx_descs(struct mlx5e_sq *sq)
u16 ci;
int i;
+ if (sq->type != MLX5E_SQ_TXQ)
+ return;
+
while (sq->cc != sq->pc) {
ci = sq->cc & sq->wq.sz_m1;
- skb = sq->skb[ci];
- wi = &sq->wqe_info[ci];
+ skb = sq->db.txq.skb[ci];
+ wi = &sq->db.txq.wqe_info[ci];
if (!skb) { /* nop */
sq->cc++;
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 08d8b0c..47cd561 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -72,7 +72,7 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
do {
u16 ci = be16_to_cpu(cqe->wqe_counter) & wq->sz_m1;
- struct mlx5e_ico_wqe_info *icowi = &sq->ico_wqe_info[ci];
+ struct mlx5e_ico_wqe_info *icowi = &sq->db.ico_wqe[ci];
mlx5_cqwq_pop(&cq->wq);
sqcc += icowi->num_wqebbs;
--
2.7.4
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH RFC 10/11] net/mlx5e: XDP TX forwarding support
2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
` (8 preceding siblings ...)
2016-09-07 12:42 ` [PATCH RFC 09/11] net/mlx5e: Have a clear separation between different SQ types Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
2016-09-07 12:42 ` [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more Saeed Mahameed
[not found] ` <1473252152-11379-1-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
11 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
To: iovisor-dev
Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed
Adding support for XDP_TX forwarding from xdp program.
Using XDP, now user can loop packets out of the same port.
We create a dedicated TX SQ for each channel that will serve
XDP programs that return XDP_TX action to loop packets back to
the wire directly from the channel RQ RX path.
For that RX pages will now need to be mapped bi-directionally,
and on XDP_TX action we will sync the page back to device then
queue it into SQ for transmission. The XDP xmit frame function will
report back to the RX path if the page was consumed (transmitted), if so,
RX path will forget about that page as if it were released to the stack.
Later on, on XDP TX completion, the page will be released back to the
page cache.
For simplicity this patch will hit a doorbell on every XDP TX packet.
Next patch will introduce a xmit more like mechanism that will
queue up more than one packet into SQ w/o notifying the hardware,
once RX napi loop is done we will hit doorbell once for all XDP TX
packets form the previous loop. This should drastically improve
XDP TX performance.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 24 ++++-
drivers/net/ethernet/mellanox/mlx5/core/en_main.c | 93 +++++++++++++++--
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 115 +++++++++++++++++----
drivers/net/ethernet/mellanox/mlx5/core/en_stats.h | 8 ++
drivers/net/ethernet/mellanox/mlx5/core/en_tx.c | 39 ++++++-
drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c | 65 +++++++++++-
6 files changed, 308 insertions(+), 36 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index b2da9bf..df2c9e0 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -104,6 +104,14 @@
#define MLX5E_ICOSQ_MAX_WQEBBS \
(DIV_ROUND_UP(sizeof(struct mlx5e_umr_wqe), MLX5_SEND_WQE_BB))
+#define MLX5E_XDP_MIN_INLINE (ETH_HLEN + VLAN_HLEN)
+#define MLX5E_XDP_IHS_DS_COUNT \
+ DIV_ROUND_UP(MLX5E_XDP_MIN_INLINE - 2, MLX5_SEND_WQE_DS)
+#define MLX5E_XDP_TX_DS_COUNT \
+ (MLX5E_XDP_IHS_DS_COUNT + (sizeof(struct mlx5e_tx_wqe) / MLX5_SEND_WQE_DS) + 1 /* SG DS */)
+#define MLX5E_XDP_TX_WQEBBS \
+ DIV_ROUND_UP(MLX5E_XDP_TX_DS_COUNT, MLX5_SEND_WQEBB_NUM_DS)
+
#define MLX5E_NUM_MAIN_GROUPS 9
static inline u16 mlx5_min_rx_wqes(int wq_type, u32 wq_size)
@@ -319,6 +327,7 @@ struct mlx5e_rq {
struct {
u8 page_order;
u32 wqe_sz; /* wqe data buffer size */
+ u8 map_dir; /* dma map direction */
} buff;
__be32 mkey_be;
@@ -384,14 +393,15 @@ enum {
MLX5E_SQ_STATE_BF_ENABLE,
};
-struct mlx5e_ico_wqe_info {
+struct mlx5e_sq_wqe_info {
u8 opcode;
u8 num_wqebbs;
};
enum mlx5e_sq_type {
MLX5E_SQ_TXQ,
- MLX5E_SQ_ICO
+ MLX5E_SQ_ICO,
+ MLX5E_SQ_XDP
};
struct mlx5e_sq {
@@ -418,7 +428,11 @@ struct mlx5e_sq {
struct mlx5e_sq_dma *dma_fifo;
struct mlx5e_tx_wqe_info *wqe_info;
} txq;
- struct mlx5e_ico_wqe_info *ico_wqe;
+ struct mlx5e_sq_wqe_info *ico_wqe;
+ struct {
+ struct mlx5e_sq_wqe_info *wqe_info;
+ struct mlx5e_dma_info *di;
+ } xdp;
} db;
/* read only */
@@ -458,8 +472,10 @@ enum channel_flags {
struct mlx5e_channel {
/* data path */
struct mlx5e_rq rq;
+ struct mlx5e_sq xdp_sq;
struct mlx5e_sq sq[MLX5E_MAX_NUM_TC];
struct mlx5e_sq icosq; /* internal control operations */
+ bool xdp;
struct napi_struct napi;
struct device *pdev;
struct net_device *netdev;
@@ -722,7 +738,7 @@ void mlx5e_cq_error_event(struct mlx5_core_cq *mcq, enum mlx5_event event);
int mlx5e_napi_poll(struct napi_struct *napi, int budget);
bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget);
int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget);
-void mlx5e_free_tx_descs(struct mlx5e_sq *sq);
+void mlx5e_free_sq_descs(struct mlx5e_sq *sq);
void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
bool recycle);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
index 8baeb9e..1d9c01f 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_main.c
@@ -64,6 +64,7 @@ struct mlx5e_cq_param {
struct mlx5e_channel_param {
struct mlx5e_rq_param rq;
struct mlx5e_sq_param sq;
+ struct mlx5e_sq_param xdp_sq;
struct mlx5e_sq_param icosq;
struct mlx5e_cq_param rx_cq;
struct mlx5e_cq_param tx_cq;
@@ -180,6 +181,8 @@ static void mlx5e_update_sw_counters(struct mlx5e_priv *priv)
s->rx_csum_complete += rq_stats->csum_complete;
s->rx_csum_unnecessary_inner += rq_stats->csum_unnecessary_inner;
s->rx_xdp_drop += rq_stats->xdp_drop;
+ s->rx_xdp_tx += rq_stats->xdp_tx;
+ s->rx_xdp_tx_full += rq_stats->xdp_tx_full;
s->rx_wqe_err += rq_stats->wqe_err;
s->rx_mpwqe_filler += rq_stats->mpwqe_filler;
s->rx_buff_alloc_err += rq_stats->buff_alloc_err;
@@ -481,6 +484,10 @@ static int mlx5e_create_rq(struct mlx5e_channel *c,
rq->priv = c->priv;
rq->xdp_prog = priv->xdp_prog;
+ rq->buff.map_dir = DMA_FROM_DEVICE;
+ if (rq->xdp_prog)
+ rq->buff.map_dir = DMA_BIDIRECTIONAL;
+
switch (priv->params.rq_wq_type) {
case MLX5_WQ_TYPE_LINKED_LIST_STRIDING_RQ:
rq->handle_rx_cqe = mlx5e_handle_rx_cqe_mpwrq;
@@ -767,6 +774,28 @@ static void mlx5e_close_rq(struct mlx5e_rq *rq)
mlx5e_destroy_rq(rq);
}
+static void mlx5e_free_sq_xdp_db(struct mlx5e_sq *sq)
+{
+ kfree(sq->db.xdp.di);
+ kfree(sq->db.xdp.wqe_info);
+}
+
+static int mlx5e_alloc_sq_xdp_db(struct mlx5e_sq *sq, int numa)
+{
+ int wq_sz = mlx5_wq_cyc_get_size(&sq->wq);
+
+ sq->db.xdp.di = kzalloc_node(sizeof(*sq->db.xdp.di) * wq_sz,
+ GFP_KERNEL, numa);
+ sq->db.xdp.wqe_info = kzalloc_node(sizeof(*sq->db.xdp.wqe_info) * wq_sz,
+ GFP_KERNEL, numa);
+ if (!sq->db.xdp.di || !sq->db.xdp.wqe_info) {
+ mlx5e_free_sq_xdp_db(sq);
+ return -ENOMEM;
+ }
+
+ return 0;
+}
+
static void mlx5e_free_sq_ico_db(struct mlx5e_sq *sq)
{
kfree(sq->db.ico_wqe);
@@ -821,6 +850,9 @@ static void mlx5e_free_sq_db(struct mlx5e_sq *sq)
case MLX5E_SQ_ICO:
mlx5e_free_sq_ico_db(sq);
break;
+ case MLX5E_SQ_XDP:
+ mlx5e_free_sq_xdp_db(sq);
+ break;
}
}
@@ -831,11 +863,24 @@ static int mlx5e_alloc_sq_db(struct mlx5e_sq *sq, int numa)
return mlx5e_alloc_sq_txq_db(sq, numa);
case MLX5E_SQ_ICO:
return mlx5e_alloc_sq_ico_db(sq, numa);
+ case MLX5E_SQ_XDP:
+ return mlx5e_alloc_sq_xdp_db(sq, numa);
}
return 0;
}
+static int mlx5e_sq_get_max_wqebbs(u8 sq_type)
+{
+ switch (sq_type) {
+ case MLX5E_SQ_ICO:
+ return MLX5E_ICOSQ_MAX_WQEBBS;
+ case MLX5E_SQ_XDP:
+ return MLX5E_XDP_TX_WQEBBS;
+ }
+ return MLX5_SEND_WQE_MAX_WQEBBS;
+}
+
static int mlx5e_create_sq(struct mlx5e_channel *c,
int tc,
struct mlx5e_sq_param *param,
@@ -846,7 +891,6 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
void *sqc = param->sqc;
void *sqc_wq = MLX5_ADDR_OF(sqc, sqc, wq);
- u16 sq_max_wqebbs;
int err;
sq->type = param->type;
@@ -884,7 +928,6 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
if (err)
goto err_sq_wq_destroy;
- sq_max_wqebbs = MLX5_SEND_WQE_MAX_WQEBBS;
if (sq->type == MLX5E_SQ_TXQ) {
int txq_ix;
@@ -893,10 +936,7 @@ static int mlx5e_create_sq(struct mlx5e_channel *c,
priv->txq_to_sq_map[txq_ix] = sq;
}
- if (sq->type == MLX5E_SQ_ICO)
- sq_max_wqebbs = MLX5E_ICOSQ_MAX_WQEBBS;
-
- sq->edge = (sq->wq.sz_m1 + 1) - sq_max_wqebbs;
+ sq->edge = (sq->wq.sz_m1 + 1) - mlx5e_sq_get_max_wqebbs(sq->type);
sq->bf_budget = MLX5E_SQ_BF_BUDGET;
return 0;
@@ -1070,7 +1110,7 @@ static void mlx5e_close_sq(struct mlx5e_sq *sq)
}
mlx5e_disable_sq(sq);
- mlx5e_free_tx_descs(sq);
+ mlx5e_free_sq_descs(sq);
mlx5e_destroy_sq(sq);
}
@@ -1431,14 +1471,31 @@ static int mlx5e_open_channel(struct mlx5e_priv *priv, int ix,
}
}
+ if (priv->xdp_prog) {
+ /* XDP SQ CQ params are same as normal TXQ sq CQ params */
+ err = mlx5e_open_cq(c, &cparam->tx_cq, &c->xdp_sq.cq,
+ priv->params.tx_cq_moderation);
+ if (err)
+ goto err_close_sqs;
+
+ err = mlx5e_open_sq(c, 0, &cparam->xdp_sq, &c->xdp_sq);
+ if (err) {
+ mlx5e_close_cq(&c->xdp_sq.cq);
+ goto err_close_sqs;
+ }
+ }
+
+ c->xdp = !!priv->xdp_prog;
err = mlx5e_open_rq(c, &cparam->rq, &c->rq);
if (err)
- goto err_close_sqs;
+ goto err_close_xdp_sq;
netif_set_xps_queue(netdev, get_cpu_mask(c->cpu), ix);
*cp = c;
return 0;
+err_close_xdp_sq:
+ mlx5e_close_sq(&c->xdp_sq);
err_close_sqs:
mlx5e_close_sqs(c);
@@ -1467,9 +1524,13 @@ err_napi_del:
static void mlx5e_close_channel(struct mlx5e_channel *c)
{
mlx5e_close_rq(&c->rq);
+ if (c->xdp)
+ mlx5e_close_sq(&c->xdp_sq);
mlx5e_close_sqs(c);
mlx5e_close_sq(&c->icosq);
napi_disable(&c->napi);
+ if (c->xdp)
+ mlx5e_close_cq(&c->xdp_sq.cq);
mlx5e_close_cq(&c->rq.cq);
mlx5e_close_tx_cqs(c);
mlx5e_close_cq(&c->icosq.cq);
@@ -1620,12 +1681,28 @@ static void mlx5e_build_icosq_param(struct mlx5e_priv *priv,
param->type = MLX5E_SQ_ICO;
}
+static void mlx5e_build_xdpsq_param(struct mlx5e_priv *priv,
+ struct mlx5e_sq_param *param)
+{
+ void *sqc = param->sqc;
+ void *wq = MLX5_ADDR_OF(sqc, sqc, wq);
+
+ mlx5e_build_sq_param_common(priv, param);
+ MLX5_SET(wq, wq, log_wq_sz, priv->params.log_sq_size);
+
+ param->max_inline = priv->params.tx_max_inline;
+ /* FOR XDP SQs will support only L2 inline mode */
+ param->min_inline_mode = MLX5_INLINE_MODE_NONE;
+ param->type = MLX5E_SQ_XDP;
+}
+
static void mlx5e_build_channel_param(struct mlx5e_priv *priv, struct mlx5e_channel_param *cparam)
{
u8 icosq_log_wq_sz = MLX5E_PARAMS_MINIMUM_LOG_SQ_SIZE;
mlx5e_build_rq_param(priv, &cparam->rq);
mlx5e_build_sq_param(priv, &cparam->sq);
+ mlx5e_build_xdpsq_param(priv, &cparam->xdp_sq);
mlx5e_build_icosq_param(priv, &cparam->icosq, icosq_log_wq_sz);
mlx5e_build_rx_cq_param(priv, &cparam->rx_cq);
mlx5e_build_tx_cq_param(priv, &cparam->tx_cq);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index eb489e9..912a0e2 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -236,7 +236,7 @@ static inline int mlx5e_page_alloc_mapped(struct mlx5e_rq *rq,
dma_info->page = page;
dma_info->addr = dma_map_page(rq->pdev, page, 0,
- RQ_PAGE_SIZE(rq), DMA_FROM_DEVICE);
+ RQ_PAGE_SIZE(rq), rq->buff.map_dir);
if (unlikely(dma_mapping_error(rq->pdev, dma_info->addr))) {
put_page(page);
return -ENOMEM;
@@ -252,7 +252,7 @@ void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
return;
dma_unmap_page(rq->pdev, dma_info->addr, RQ_PAGE_SIZE(rq),
- DMA_FROM_DEVICE);
+ rq->buff.map_dir);
put_page(dma_info->page);
}
@@ -624,15 +624,100 @@ static inline void mlx5e_complete_rx_cqe(struct mlx5e_rq *rq,
napi_gro_receive(rq->cq.napi, skb);
}
-static inline enum xdp_action mlx5e_xdp_handle(struct mlx5e_rq *rq,
- const struct bpf_prog *prog,
- void *data, u32 len)
+static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_sq *sq,
+ struct mlx5e_dma_info *di,
+ unsigned int data_offset,
+ int len)
{
+ struct mlx5_wq_cyc *wq = &sq->wq;
+ u16 pi = sq->pc & wq->sz_m1;
+ struct mlx5e_tx_wqe *wqe = mlx5_wq_cyc_get_wqe(wq, pi);
+ struct mlx5e_sq_wqe_info *wi = &sq->db.xdp.wqe_info[pi];
+
+ struct mlx5_wqe_ctrl_seg *cseg = &wqe->ctrl;
+ struct mlx5_wqe_eth_seg *eseg = &wqe->eth;
+ struct mlx5_wqe_data_seg *dseg;
+
+ dma_addr_t dma_addr = di->addr + data_offset + MLX5E_XDP_MIN_INLINE;
+ unsigned int dma_len = len - MLX5E_XDP_MIN_INLINE;
+ void *data = page_address(di->page) + data_offset;
+
+ if (unlikely(!mlx5e_sq_has_room_for(sq, MLX5E_XDP_TX_WQEBBS))) {
+ sq->channel->rq.stats.xdp_tx_full++;
+ return false;
+ }
+
+ dma_sync_single_for_device(sq->pdev, dma_addr, dma_len, PCI_DMA_TODEVICE);
+
+ memset(wqe, 0, sizeof(*wqe));
+
+ /* copy the inline part */
+ memcpy(eseg->inline_hdr_start, data, MLX5E_XDP_MIN_INLINE);
+ eseg->inline_hdr_sz = cpu_to_be16(MLX5E_XDP_MIN_INLINE);
+
+ dseg = (struct mlx5_wqe_data_seg *)cseg + (MLX5E_XDP_TX_DS_COUNT - 1);
+
+ /* write the dma part */
+ dseg->addr = cpu_to_be64(dma_addr);
+ dseg->byte_count = cpu_to_be32(dma_len);
+ dseg->lkey = sq->mkey_be;
+
+ cseg->opmod_idx_opcode = cpu_to_be32((sq->pc << 8) | MLX5_OPCODE_SEND);
+ cseg->qpn_ds = cpu_to_be32((sq->sqn << 8) | MLX5E_XDP_TX_DS_COUNT);
+
+ sq->db.xdp.di[pi] = *di;
+ wi->opcode = MLX5_OPCODE_SEND;
+ wi->num_wqebbs = MLX5E_XDP_TX_WQEBBS;
+ sq->pc += MLX5E_XDP_TX_WQEBBS;
+
+ /* TODO: xmit more */
+ wqe->ctrl.fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
+ mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
+
+ /* fill sq edge with nops to avoid wqe wrap around */
+ while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
+ sq->db.xdp.wqe_info[pi].opcode = MLX5_OPCODE_NOP;
+ mlx5e_send_nop(sq, false);
+ }
+ return true;
+}
+
+/* returns true if packet was consumed by xdp */
+static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
+ const struct bpf_prog *prog,
+ struct mlx5e_dma_info *di,
+ void *data, u16 len)
+{
+ bool consumed = false;
struct xdp_buff xdp;
+ u32 act;
+
+ if (!prog)
+ return false;
xdp.data = data;
xdp.data_end = xdp.data + len;
- return bpf_prog_run_xdp(prog, &xdp);
+ act = bpf_prog_run_xdp(prog, &xdp);
+ switch (act) {
+ case XDP_PASS:
+ return false;
+ case XDP_TX:
+ consumed = mlx5e_xmit_xdp_frame(&rq->channel->xdp_sq, di,
+ MLX5_RX_HEADROOM,
+ len);
+ rq->stats.xdp_tx += consumed;
+ return consumed;
+ default:
+ bpf_warn_invalid_xdp_action(act);
+ return false;
+ case XDP_ABORTED:
+ case XDP_DROP:
+ rq->stats.xdp_drop++;
+ mlx5e_page_release(rq, di, true);
+ return true;
+ }
+
+ return false;
}
void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
@@ -643,21 +728,22 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
__be16 wqe_counter_be;
struct sk_buff *skb;
u16 wqe_counter;
+ void *va, *data;
u32 cqe_bcnt;
- void *va;
wqe_counter_be = cqe->wqe_counter;
wqe_counter = be16_to_cpu(wqe_counter_be);
wqe = mlx5_wq_ll_get_wqe(&rq->wq, wqe_counter);
di = &rq->dma_info[wqe_counter];
va = page_address(di->page);
+ data = va + MLX5_RX_HEADROOM;
dma_sync_single_range_for_cpu(rq->pdev,
di->addr,
MLX5_RX_HEADROOM,
rq->buff.wqe_sz,
DMA_FROM_DEVICE);
- prefetch(va + MLX5_RX_HEADROOM);
+ prefetch(data);
cqe_bcnt = be32_to_cpu(cqe->byte_cnt);
if (unlikely((cqe->op_own >> 4) != MLX5_CQE_RESP_SEND)) {
@@ -666,17 +752,8 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
goto wq_ll_pop;
}
- if (xdp_prog) {
- enum xdp_action act =
- mlx5e_xdp_handle(rq, xdp_prog, va + MLX5_RX_HEADROOM,
- cqe_bcnt);
-
- if (act != XDP_PASS) {
- rq->stats.xdp_drop++;
- mlx5e_page_release(rq, di, true);
- goto wq_ll_pop;
- }
- }
+ if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt))
+ goto wq_ll_pop; /* page/packet was consumed by XDP */
skb = build_skb(va, RQ_PAGE_SIZE(rq));
if (unlikely(!skb)) {
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
index 084d6c8..57452fd 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_stats.h
@@ -66,6 +66,8 @@ struct mlx5e_sw_stats {
u64 rx_csum_complete;
u64 rx_csum_unnecessary_inner;
u64 rx_xdp_drop;
+ u64 rx_xdp_tx;
+ u64 rx_xdp_tx_full;
u64 tx_csum_partial;
u64 tx_csum_partial_inner;
u64 tx_queue_stopped;
@@ -102,6 +104,8 @@ static const struct counter_desc sw_stats_desc[] = {
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_complete) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_csum_unnecessary_inner) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xdp_drop) },
+ { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xdp_tx) },
+ { MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, rx_xdp_tx_full) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_csum_partial) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_csum_partial_inner) },
{ MLX5E_DECLARE_STAT(struct mlx5e_sw_stats, tx_queue_stopped) },
@@ -281,6 +285,8 @@ struct mlx5e_rq_stats {
u64 lro_packets;
u64 lro_bytes;
u64 xdp_drop;
+ u64 xdp_tx;
+ u64 xdp_tx_full;
u64 wqe_err;
u64 mpwqe_filler;
u64 buff_alloc_err;
@@ -299,6 +305,8 @@ static const struct counter_desc rq_stats_desc[] = {
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, csum_unnecessary_inner) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, csum_none) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, xdp_drop) },
+ { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, xdp_tx) },
+ { MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, xdp_tx_full) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_packets) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, lro_bytes) },
{ MLX5E_DECLARE_RX_STAT(struct mlx5e_rq_stats, wqe_err) },
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
index a728303..7191035 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_tx.c
@@ -493,16 +493,13 @@ bool mlx5e_poll_tx_cq(struct mlx5e_cq *cq, int napi_budget)
return (i == MLX5E_TX_CQ_POLL_BUDGET);
}
-void mlx5e_free_tx_descs(struct mlx5e_sq *sq)
+static void mlx5e_free_txq_sq_descs(struct mlx5e_sq *sq)
{
struct mlx5e_tx_wqe_info *wi;
struct sk_buff *skb;
u16 ci;
int i;
- if (sq->type != MLX5E_SQ_TXQ)
- return;
-
while (sq->cc != sq->pc) {
ci = sq->cc & sq->wq.sz_m1;
skb = sq->db.txq.skb[ci];
@@ -524,3 +521,37 @@ void mlx5e_free_tx_descs(struct mlx5e_sq *sq)
sq->cc += wi->num_wqebbs;
}
}
+
+static void mlx5e_free_xdp_sq_descs(struct mlx5e_sq *sq)
+{
+ struct mlx5e_sq_wqe_info *wi;
+ struct mlx5e_dma_info *di;
+ u16 ci;
+
+ while (sq->cc != sq->pc) {
+ ci = sq->cc & sq->wq.sz_m1;
+ di = &sq->db.xdp.di[ci];
+ wi = &sq->db.xdp.wqe_info[ci];
+
+ if (wi->opcode == MLX5_OPCODE_NOP) {
+ sq->cc++;
+ continue;
+ }
+
+ sq->cc += wi->num_wqebbs;
+
+ mlx5e_page_release(&sq->channel->rq, di, false);
+ }
+}
+
+void mlx5e_free_sq_descs(struct mlx5e_sq *sq)
+{
+ switch (sq->type) {
+ case MLX5E_SQ_TXQ:
+ mlx5e_free_txq_sq_descs(sq);
+ break;
+ case MLX5E_SQ_XDP:
+ mlx5e_free_xdp_sq_descs(sq);
+ break;
+ }
+}
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
index 47cd561..397285d 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_txrx.c
@@ -72,7 +72,7 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
do {
u16 ci = be16_to_cpu(cqe->wqe_counter) & wq->sz_m1;
- struct mlx5e_ico_wqe_info *icowi = &sq->db.ico_wqe[ci];
+ struct mlx5e_sq_wqe_info *icowi = &sq->db.ico_wqe[ci];
mlx5_cqwq_pop(&cq->wq);
sqcc += icowi->num_wqebbs;
@@ -105,6 +105,66 @@ static void mlx5e_poll_ico_cq(struct mlx5e_cq *cq)
sq->cc = sqcc;
}
+static inline bool mlx5e_poll_xdp_tx_cq(struct mlx5e_cq *cq)
+{
+ struct mlx5e_sq *sq;
+ u16 sqcc;
+ int i;
+
+ sq = container_of(cq, struct mlx5e_sq, cq);
+
+ if (unlikely(test_bit(MLX5E_SQ_STATE_FLUSH, &sq->state)))
+ return false;
+
+ /* sq->cc must be updated only after mlx5_cqwq_update_db_record(),
+ * otherwise a cq overrun may occur
+ */
+ sqcc = sq->cc;
+
+ for (i = 0; i < MLX5E_TX_CQ_POLL_BUDGET; i++) {
+ struct mlx5_cqe64 *cqe;
+ u16 wqe_counter;
+ bool last_wqe;
+
+ cqe = mlx5e_get_cqe(cq);
+ if (!cqe)
+ break;
+
+ mlx5_cqwq_pop(&cq->wq);
+
+ wqe_counter = be16_to_cpu(cqe->wqe_counter);
+
+ do {
+ struct mlx5e_sq_wqe_info *wi;
+ struct mlx5e_dma_info *di;
+ u16 ci;
+
+ last_wqe = (sqcc == wqe_counter);
+
+ ci = sqcc & sq->wq.sz_m1;
+ di = &sq->db.xdp.di[ci];
+ wi = &sq->db.xdp.wqe_info[ci];
+
+ if (unlikely(wi->opcode == MLX5_OPCODE_NOP)) {
+ sqcc++;
+ continue;
+ }
+
+ sqcc += wi->num_wqebbs;
+ /* Recycle RX page */
+ mlx5e_page_release(&cq->channel->rq, di, true);
+ } while (!last_wqe);
+ }
+
+ mlx5_cqwq_update_db_record(&cq->wq);
+
+ /* ensure cq space is freed before enabling more cqes */
+ wmb();
+
+ sq->cc = sqcc;
+ return (i == MLX5E_TX_CQ_POLL_BUDGET);
+}
+
int mlx5e_napi_poll(struct napi_struct *napi, int budget)
{
struct mlx5e_channel *c = container_of(napi, struct mlx5e_channel,
@@ -121,6 +181,9 @@ int mlx5e_napi_poll(struct napi_struct *napi, int budget)
work_done = mlx5e_poll_rx_cq(&c->rq.cq, budget);
busy |= work_done == budget;
+ if (c->xdp)
+ busy |= mlx5e_poll_xdp_tx_cq(&c->xdp_sq.cq);
+
mlx5e_poll_ico_cq(&c->icosq.cq);
busy |= mlx5e_post_rx_wqes(&c->rq);
--
2.7.4
^ permalink raw reply related [flat|nested] 72+ messages in thread* [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
2016-09-07 12:42 [PATCH RFC 00/11] mlx5 RX refactoring and XDP support Saeed Mahameed
` (9 preceding siblings ...)
2016-09-07 12:42 ` [PATCH RFC 10/11] net/mlx5e: XDP TX forwarding support Saeed Mahameed
@ 2016-09-07 12:42 ` Saeed Mahameed
2016-09-07 13:44 ` John Fastabend
` (2 more replies)
[not found] ` <1473252152-11379-1-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
11 siblings, 3 replies; 72+ messages in thread
From: Saeed Mahameed @ 2016-09-07 12:42 UTC (permalink / raw)
To: iovisor-dev
Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim, Saeed Mahameed
Previously we rang XDP SQ doorbell on every forwarded XDP packet.
Here we introduce a xmit more like mechanism that will queue up more
than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
Once RX napi budget is consumed and we exit napi RX loop, we will
flush (doorbell) all XDP looped packets in case there are such.
XDP forward packet rate:
Comparing XDP with and w/o xmit more (bulk transmit):
Streams XDP TX XDP TX (xmit more)
---------------------------------------------------
1 4.90Mpps 7.50Mpps
2 9.50Mpps 14.8Mpps
4 16.5Mpps 25.1Mpps
8 21.5Mpps 27.5Mpps*
16 24.1Mpps 27.5Mpps*
*It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
we will be working on the analysis and will publish the conclusions
later.
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
---
drivers/net/ethernet/mellanox/mlx5/core/en.h | 9 ++--
drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
2 files changed, 49 insertions(+), 17 deletions(-)
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
index df2c9e0..6846208 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
@@ -265,7 +265,8 @@ struct mlx5e_cq {
struct mlx5e_rq;
typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq *rq,
- struct mlx5_cqe64 *cqe);
+ struct mlx5_cqe64 *cqe,
+ bool *xdp_doorbell);
typedef int (*mlx5e_fp_alloc_wqe)(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,
u16 ix);
@@ -742,8 +743,10 @@ void mlx5e_free_sq_descs(struct mlx5e_sq *sq);
void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
bool recycle);
-void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
-void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
+void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+ bool *xdp_doorbell);
+void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+ bool *xdp_doorbell);
bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
index 912a0e2..ed93251 100644
--- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
+++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
@@ -117,7 +117,8 @@ static inline void mlx5e_decompress_cqe_no_hash(struct mlx5e_rq *rq,
static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
struct mlx5e_cq *cq,
int update_owner_only,
- int budget_rem)
+ int budget_rem,
+ bool *xdp_doorbell)
{
u32 cqcc = cq->wq.cc + update_owner_only;
u32 cqe_count;
@@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
mlx5e_read_mini_arr_slot(cq, cqcc);
mlx5e_decompress_cqe_no_hash(rq, cq, cqcc);
- rq->handle_rx_cqe(rq, &cq->title);
+ rq->handle_rx_cqe(rq, &cq->title, xdp_doorbell);
}
mlx5e_cqes_update_owner(cq, cq->wq.cc, cqcc - cq->wq.cc);
cq->wq.cc = cqcc;
@@ -143,15 +144,16 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
static inline u32 mlx5e_decompress_cqes_start(struct mlx5e_rq *rq,
struct mlx5e_cq *cq,
- int budget_rem)
+ int budget_rem,
+ bool *xdp_doorbell)
{
mlx5e_read_title_slot(rq, cq, cq->wq.cc);
mlx5e_read_mini_arr_slot(cq, cq->wq.cc + 1);
mlx5e_decompress_cqe(rq, cq, cq->wq.cc);
- rq->handle_rx_cqe(rq, &cq->title);
+ rq->handle_rx_cqe(rq, &cq->title, xdp_doorbell);
cq->mini_arr_idx++;
- return mlx5e_decompress_cqes_cont(rq, cq, 1, budget_rem) - 1;
+ return mlx5e_decompress_cqes_cont(rq, cq, 1, budget_rem, xdp_doorbell) - 1;
}
void mlx5e_modify_rx_cqe_compression(struct mlx5e_priv *priv, bool val)
@@ -670,23 +672,36 @@ static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_sq *sq,
wi->num_wqebbs = MLX5E_XDP_TX_WQEBBS;
sq->pc += MLX5E_XDP_TX_WQEBBS;
- /* TODO: xmit more */
+ /* mlx5e_sq_xmit_doorbel will be called after RX napi loop */
+ return true;
+}
+
+static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_sq *sq)
+{
+ struct mlx5_wq_cyc *wq = &sq->wq;
+ struct mlx5e_tx_wqe *wqe;
+ u16 pi = (sq->pc - MLX5E_XDP_TX_WQEBBS) & wq->sz_m1; /* last pi */
+
+ wqe = mlx5_wq_cyc_get_wqe(wq, pi);
+
wqe->ctrl.fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
+#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */
/* fill sq edge with nops to avoid wqe wrap around */
while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
sq->db.xdp.wqe_info[pi].opcode = MLX5_OPCODE_NOP;
mlx5e_send_nop(sq, false);
}
- return true;
+#endif
}
/* returns true if packet was consumed by xdp */
static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
const struct bpf_prog *prog,
struct mlx5e_dma_info *di,
- void *data, u16 len)
+ void *data, u16 len,
+ bool *xdp_doorbell)
{
bool consumed = false;
struct xdp_buff xdp;
@@ -705,7 +720,13 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
consumed = mlx5e_xmit_xdp_frame(&rq->channel->xdp_sq, di,
MLX5_RX_HEADROOM,
len);
+ if (unlikely(!consumed) && (*xdp_doorbell)) {
+ /* SQ is full, ring doorbell */
+ mlx5e_xmit_xdp_doorbell(&rq->channel->xdp_sq);
+ *xdp_doorbell = false;
+ }
rq->stats.xdp_tx += consumed;
+ *xdp_doorbell |= consumed;
return consumed;
default:
bpf_warn_invalid_xdp_action(act);
@@ -720,7 +741,8 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
return false;
}
-void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
+void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+ bool *xdp_doorbell)
{
struct bpf_prog *xdp_prog = READ_ONCE(rq->xdp_prog);
struct mlx5e_dma_info *di;
@@ -752,7 +774,7 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
goto wq_ll_pop;
}
- if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt))
+ if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt, xdp_doorbell))
goto wq_ll_pop; /* page/packet was consumed by XDP */
skb = build_skb(va, RQ_PAGE_SIZE(rq));
@@ -814,7 +836,8 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
skb->len += headlen;
}
-void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
+void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
+ bool *xdp_doorbell)
{
u16 cstrides = mpwrq_get_cqe_consumed_strides(cqe);
u16 wqe_id = be16_to_cpu(cqe->wqe_id);
@@ -860,13 +883,15 @@ mpwrq_cqe_out:
int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
{
struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);
+ bool xdp_doorbell = false;
int work_done = 0;
if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state)))
return 0;
if (cq->decmprs_left)
- work_done += mlx5e_decompress_cqes_cont(rq, cq, 0, budget);
+ work_done += mlx5e_decompress_cqes_cont(rq, cq, 0, budget,
+ &xdp_doorbell);
for (; work_done < budget; work_done++) {
struct mlx5_cqe64 *cqe = mlx5e_get_cqe(cq);
@@ -877,15 +902,19 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
if (mlx5_get_cqe_format(cqe) == MLX5_COMPRESSED) {
work_done +=
mlx5e_decompress_cqes_start(rq, cq,
- budget - work_done);
+ budget - work_done,
+ &xdp_doorbell);
continue;
}
mlx5_cqwq_pop(&cq->wq);
- rq->handle_rx_cqe(rq, cqe);
+ rq->handle_rx_cqe(rq, cqe, &xdp_doorbell);
}
+ if (xdp_doorbell)
+ mlx5e_xmit_xdp_doorbell(&rq->channel->xdp_sq);
+
mlx5_cqwq_update_db_record(&cq->wq);
/* ensure cq space is freed before enabling more cqes */
--
2.7.4
^ permalink raw reply related [flat|nested] 72+ messages in thread* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
2016-09-07 12:42 ` [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more Saeed Mahameed
@ 2016-09-07 13:44 ` John Fastabend
[not found] ` <57D019B2.7070007-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
2016-09-07 14:41 ` Eric Dumazet
[not found] ` <1473252152-11379-12-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2 siblings, 1 reply; 72+ messages in thread
From: John Fastabend @ 2016-09-07 13:44 UTC (permalink / raw)
To: Saeed Mahameed, iovisor-dev
Cc: netdev, Tariq Toukan, Brenden Blanco, Alexei Starovoitov,
Tom Herbert, Martin KaFai Lau, Jesper Dangaard Brouer,
Daniel Borkmann, Eric Dumazet, Jamal Hadi Salim
On 16-09-07 05:42 AM, Saeed Mahameed wrote:
> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>
> Here we introduce a xmit more like mechanism that will queue up more
> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>
> Once RX napi budget is consumed and we exit napi RX loop, we will
> flush (doorbell) all XDP looped packets in case there are such.
>
> XDP forward packet rate:
>
> Comparing XDP with and w/o xmit more (bulk transmit):
>
> Streams XDP TX XDP TX (xmit more)
> ---------------------------------------------------
> 1 4.90Mpps 7.50Mpps
> 2 9.50Mpps 14.8Mpps
> 4 16.5Mpps 25.1Mpps
> 8 21.5Mpps 27.5Mpps*
> 16 24.1Mpps 27.5Mpps*
>
Hi Saeed,
How many cores are you using with these numbers? Just a single
core? Or are streams being RSS'd across cores somehow.
> *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
> we will be working on the analysis and will publish the conclusions
> later.
>
Thanks,
John
^ permalink raw reply [flat|nested] 72+ messages in thread* Re: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
2016-09-07 12:42 ` [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more Saeed Mahameed
2016-09-07 13:44 ` John Fastabend
@ 2016-09-07 14:41 ` Eric Dumazet
[not found] ` <1473259302.10725.31.camel-XN9IlZ5yJG9HTL0Zs8A6p+yfmBU6pStAUsxypvmhUTTZJqsBc5GL+g@public.gmane.org>
[not found] ` <1473252152-11379-12-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
2 siblings, 1 reply; 72+ messages in thread
From: Eric Dumazet @ 2016-09-07 14:41 UTC (permalink / raw)
To: Saeed Mahameed
Cc: iovisor-dev, netdev, Tariq Toukan, Brenden Blanco,
Alexei Starovoitov, Tom Herbert, Martin KaFai Lau,
Jesper Dangaard Brouer, Daniel Borkmann, Eric Dumazet,
Jamal Hadi Salim
On Wed, 2016-09-07 at 15:42 +0300, Saeed Mahameed wrote:
> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>
> Here we introduce a xmit more like mechanism that will queue up more
> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>
> Once RX napi budget is consumed and we exit napi RX loop, we will
> flush (doorbell) all XDP looped packets in case there are such.
Why is this idea depends on XDP ?
It looks like we could apply it to any driver having one IRQ servicing
one RX and one TX, without XDP being involved.
^ permalink raw reply [flat|nested] 72+ messages in thread[parent not found: <1473252152-11379-12-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* README: [PATCH RFC 11/11] net/mlx5e: XDP TX xmit more
[not found] ` <1473252152-11379-12-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-09-08 8:11 ` Jesper Dangaard Brouer via iovisor-dev
[not found] ` <20160908101147.1b351432-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org>
0 siblings, 1 reply; 72+ messages in thread
From: Jesper Dangaard Brouer via iovisor-dev @ 2016-09-08 8:11 UTC (permalink / raw)
To: Saeed Mahameed
Cc: netdev-u79uwXL29TY76Z2rM5mHXA, iovisor-dev, Jamal Hadi Salim,
Eric Dumazet, Tom Herbert
I'm sorry but I have a problem with this patch!
Looking at this patch, I want to bring up a fundamental architectural
concern with the development direction of XDP transmit.
What you are trying to implement, with delaying the doorbell, is
basically TX bulking for TX_XDP.
Why not implement a TX bulking interface directly instead?!?
Yes, the tailptr/doorbell is the most costly operation, but why not
also take advantage of the benefits of bulking for other parts of the
code? (benefit is smaller, by every cycles counts in this area)
This hole XDP exercise is about avoiding having a transaction cost per
packet, that reads "bulking" or "bundling" of packets, where possible.
Lets do bundling/bulking from the start!
The reason behind the xmit_more API is that we could not change the
API of all the drivers. And we found that calling an explicit NDO
flush came at a cost (only approx 7 ns IIRC), but it still a cost that
would hit the common single packet use-case.
It should be really easy to build a bundle of packets that need XDP_TX
action, especially given you only have a single destination "port".
And then you XDP_TX send this bundle before mlx5_cqwq_update_db_record.
In the future, XDP need to support XDP_FWD forwarding of packets/pages
out other interfaces. I also want bulk transmit from day-1 here. It
is slightly more tricky to sort packets for multiple outgoing
interfaces efficiently in the pool loop.
But the mSwitch[1] article actually already solved this destination
sorting. Please read[1] section 3.3 "Switch Fabric Algorithm" for
understanding the next steps, for a smarter data structure, when
starting to have more TX "ports". And perhaps align your single
XDP_TX destination data structure to this future development.
[1] http://info.iet.unipi.it/~luigi/papers/20150617-mswitch-paper.pdf
--Jesper
(top post)
On Wed, 7 Sep 2016 15:42:32 +0300 Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> Previously we rang XDP SQ doorbell on every forwarded XDP packet.
>
> Here we introduce a xmit more like mechanism that will queue up more
> than one packet into SQ (up to RX napi budget) w/o notifying the hardware.
>
> Once RX napi budget is consumed and we exit napi RX loop, we will
> flush (doorbell) all XDP looped packets in case there are such.
>
> XDP forward packet rate:
>
> Comparing XDP with and w/o xmit more (bulk transmit):
>
> Streams XDP TX XDP TX (xmit more)
> ---------------------------------------------------
> 1 4.90Mpps 7.50Mpps
> 2 9.50Mpps 14.8Mpps
> 4 16.5Mpps 25.1Mpps
> 8 21.5Mpps 27.5Mpps*
> 16 24.1Mpps 27.5Mpps*
>
> *It seems we hit a wall of 27.5Mpps, for 8 and 16 streams,
> we will be working on the analysis and will publish the conclusions
> later.
>
> Signed-off-by: Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
> ---
> drivers/net/ethernet/mellanox/mlx5/core/en.h | 9 ++--
> drivers/net/ethernet/mellanox/mlx5/core/en_rx.c | 57 +++++++++++++++++++------
> 2 files changed, 49 insertions(+), 17 deletions(-)
>
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en.h b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> index df2c9e0..6846208 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en.h
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en.h
> @@ -265,7 +265,8 @@ struct mlx5e_cq {
>
> struct mlx5e_rq;
> typedef void (*mlx5e_fp_handle_rx_cqe)(struct mlx5e_rq *rq,
> - struct mlx5_cqe64 *cqe);
> + struct mlx5_cqe64 *cqe,
> + bool *xdp_doorbell);
> typedef int (*mlx5e_fp_alloc_wqe)(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe,
> u16 ix);
>
> @@ -742,8 +743,10 @@ void mlx5e_free_sq_descs(struct mlx5e_sq *sq);
>
> void mlx5e_page_release(struct mlx5e_rq *rq, struct mlx5e_dma_info *dma_info,
> bool recycle);
> -void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> -void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe);
> +void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> + bool *xdp_doorbell);
> +void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> + bool *xdp_doorbell);
> bool mlx5e_post_rx_wqes(struct mlx5e_rq *rq);
> int mlx5e_alloc_rx_wqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
> int mlx5e_alloc_rx_mpwqe(struct mlx5e_rq *rq, struct mlx5e_rx_wqe *wqe, u16 ix);
> diff --git a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> index 912a0e2..ed93251 100644
> --- a/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> +++ b/drivers/net/ethernet/mellanox/mlx5/core/en_rx.c
> @@ -117,7 +117,8 @@ static inline void mlx5e_decompress_cqe_no_hash(struct mlx5e_rq *rq,
> static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
> struct mlx5e_cq *cq,
> int update_owner_only,
> - int budget_rem)
> + int budget_rem,
> + bool *xdp_doorbell)
> {
> u32 cqcc = cq->wq.cc + update_owner_only;
> u32 cqe_count;
> @@ -131,7 +132,7 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
> mlx5e_read_mini_arr_slot(cq, cqcc);
>
> mlx5e_decompress_cqe_no_hash(rq, cq, cqcc);
> - rq->handle_rx_cqe(rq, &cq->title);
> + rq->handle_rx_cqe(rq, &cq->title, xdp_doorbell);
> }
> mlx5e_cqes_update_owner(cq, cq->wq.cc, cqcc - cq->wq.cc);
> cq->wq.cc = cqcc;
> @@ -143,15 +144,16 @@ static inline u32 mlx5e_decompress_cqes_cont(struct mlx5e_rq *rq,
>
> static inline u32 mlx5e_decompress_cqes_start(struct mlx5e_rq *rq,
> struct mlx5e_cq *cq,
> - int budget_rem)
> + int budget_rem,
> + bool *xdp_doorbell)
> {
> mlx5e_read_title_slot(rq, cq, cq->wq.cc);
> mlx5e_read_mini_arr_slot(cq, cq->wq.cc + 1);
> mlx5e_decompress_cqe(rq, cq, cq->wq.cc);
> - rq->handle_rx_cqe(rq, &cq->title);
> + rq->handle_rx_cqe(rq, &cq->title, xdp_doorbell);
> cq->mini_arr_idx++;
>
> - return mlx5e_decompress_cqes_cont(rq, cq, 1, budget_rem) - 1;
> + return mlx5e_decompress_cqes_cont(rq, cq, 1, budget_rem, xdp_doorbell) - 1;
> }
>
> void mlx5e_modify_rx_cqe_compression(struct mlx5e_priv *priv, bool val)
> @@ -670,23 +672,36 @@ static inline bool mlx5e_xmit_xdp_frame(struct mlx5e_sq *sq,
> wi->num_wqebbs = MLX5E_XDP_TX_WQEBBS;
> sq->pc += MLX5E_XDP_TX_WQEBBS;
>
> - /* TODO: xmit more */
> + /* mlx5e_sq_xmit_doorbel will be called after RX napi loop */
> + return true;
> +}
> +
> +static inline void mlx5e_xmit_xdp_doorbell(struct mlx5e_sq *sq)
> +{
> + struct mlx5_wq_cyc *wq = &sq->wq;
> + struct mlx5e_tx_wqe *wqe;
> + u16 pi = (sq->pc - MLX5E_XDP_TX_WQEBBS) & wq->sz_m1; /* last pi */
> +
> + wqe = mlx5_wq_cyc_get_wqe(wq, pi);
> +
> wqe->ctrl.fm_ce_se = MLX5_WQE_CTRL_CQ_UPDATE;
> mlx5e_tx_notify_hw(sq, &wqe->ctrl, 0);
>
> +#if 0 /* enable this code only if MLX5E_XDP_TX_WQEBBS > 1 */
> /* fill sq edge with nops to avoid wqe wrap around */
> while ((pi = (sq->pc & wq->sz_m1)) > sq->edge) {
> sq->db.xdp.wqe_info[pi].opcode = MLX5_OPCODE_NOP;
> mlx5e_send_nop(sq, false);
> }
> - return true;
> +#endif
> }
>
> /* returns true if packet was consumed by xdp */
> static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
> const struct bpf_prog *prog,
> struct mlx5e_dma_info *di,
> - void *data, u16 len)
> + void *data, u16 len,
> + bool *xdp_doorbell)
> {
> bool consumed = false;
> struct xdp_buff xdp;
> @@ -705,7 +720,13 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
> consumed = mlx5e_xmit_xdp_frame(&rq->channel->xdp_sq, di,
> MLX5_RX_HEADROOM,
> len);
> + if (unlikely(!consumed) && (*xdp_doorbell)) {
> + /* SQ is full, ring doorbell */
> + mlx5e_xmit_xdp_doorbell(&rq->channel->xdp_sq);
> + *xdp_doorbell = false;
> + }
> rq->stats.xdp_tx += consumed;
> + *xdp_doorbell |= consumed;
> return consumed;
> default:
> bpf_warn_invalid_xdp_action(act);
> @@ -720,7 +741,8 @@ static inline bool mlx5e_xdp_handle(struct mlx5e_rq *rq,
> return false;
> }
>
> -void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> +void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> + bool *xdp_doorbell)
> {
> struct bpf_prog *xdp_prog = READ_ONCE(rq->xdp_prog);
> struct mlx5e_dma_info *di;
> @@ -752,7 +774,7 @@ void mlx5e_handle_rx_cqe(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> goto wq_ll_pop;
> }
>
> - if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt))
> + if (mlx5e_xdp_handle(rq, xdp_prog, di, data, cqe_bcnt, xdp_doorbell))
> goto wq_ll_pop; /* page/packet was consumed by XDP */
>
> skb = build_skb(va, RQ_PAGE_SIZE(rq));
> @@ -814,7 +836,8 @@ static inline void mlx5e_mpwqe_fill_rx_skb(struct mlx5e_rq *rq,
> skb->len += headlen;
> }
>
> -void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe)
> +void mlx5e_handle_rx_cqe_mpwrq(struct mlx5e_rq *rq, struct mlx5_cqe64 *cqe,
> + bool *xdp_doorbell)
> {
> u16 cstrides = mpwrq_get_cqe_consumed_strides(cqe);
> u16 wqe_id = be16_to_cpu(cqe->wqe_id);
> @@ -860,13 +883,15 @@ mpwrq_cqe_out:
> int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
> {
> struct mlx5e_rq *rq = container_of(cq, struct mlx5e_rq, cq);
> + bool xdp_doorbell = false;
> int work_done = 0;
>
> if (unlikely(test_bit(MLX5E_RQ_STATE_FLUSH, &rq->state)))
> return 0;
>
> if (cq->decmprs_left)
> - work_done += mlx5e_decompress_cqes_cont(rq, cq, 0, budget);
> + work_done += mlx5e_decompress_cqes_cont(rq, cq, 0, budget,
> + &xdp_doorbell);
>
> for (; work_done < budget; work_done++) {
> struct mlx5_cqe64 *cqe = mlx5e_get_cqe(cq);
> @@ -877,15 +902,19 @@ int mlx5e_poll_rx_cq(struct mlx5e_cq *cq, int budget)
> if (mlx5_get_cqe_format(cqe) == MLX5_COMPRESSED) {
> work_done +=
> mlx5e_decompress_cqes_start(rq, cq,
> - budget - work_done);
> + budget - work_done,
> + &xdp_doorbell);
> continue;
> }
>
> mlx5_cqwq_pop(&cq->wq);
>
> - rq->handle_rx_cqe(rq, cqe);
> + rq->handle_rx_cqe(rq, cqe, &xdp_doorbell);
> }
>
> + if (xdp_doorbell)
> + mlx5e_xmit_xdp_doorbell(&rq->channel->xdp_sq);
> +
> mlx5_cqwq_update_db_record(&cq->wq);
>
> /* ensure cq space is freed before enabling more cqes */
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
Author of http://www.iptv-analyzer.org
LinkedIn: http://www.linkedin.com/in/brouer
^ permalink raw reply [flat|nested] 72+ messages in thread
[parent not found: <1473252152-11379-1-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>]
* Re: [PATCH RFC 00/11] mlx5 RX refactoring and XDP support
[not found] ` <1473252152-11379-1-git-send-email-saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org>
@ 2016-09-09 15:10 ` Saeed Mahameed via iovisor-dev
0 siblings, 0 replies; 72+ messages in thread
From: Saeed Mahameed via iovisor-dev @ 2016-09-09 15:10 UTC (permalink / raw)
To: Saeed Mahameed
Cc: Linux Netdev List, iovisor-dev, Jamal Hadi Salim, Eric Dumazet,
Tom Herbert
On Wed, Sep 7, 2016 at 3:42 PM, Saeed Mahameed <saeedm-VPRAkNaXOzVWk0Htik3J/w@public.gmane.org> wrote:
> Hi All,
>
> This patch set introduces some important data path RX refactoring
> addressing mlx5e memory allocation/management improvements and XDP support.
>
> Submitting as RFC since we would like to get an early feedback, while we
> continue reviewing testing and complete the performance analysis in house.
>
Hi,
I am going to be out of office for the whole next week with a random
mail access.
I will do my best to be as active as possible, but in the meanwhile,
Tariq and Or will handle any questions
regarding this series or mlx5 in general while I am away.
Thanks,
Saeed.
^ permalink raw reply [flat|nested] 72+ messages in thread