* [PATCH net-next 1/9] nfp: don't enable TSO on the device when disabled
From: Jakub Kicinski @ 2017-05-16 0:55 UTC (permalink / raw)
To: netdev; +Cc: oss-drivers, Jakub Kicinski
In-Reply-To: <20170516005523.26124-1-jakub.kicinski@netronome.com>
We advertise TSO to the stack but leave it disabled by default.
Make sure it's not only disabled in the netdev features but
also on the device itself.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
---
drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 82bd6b0935f1..76251a09a1f3 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -3286,6 +3286,7 @@ int nfp_net_netdev_init(struct net_device *netdev)
/* Advertise but disable TSO by default. */
netdev->features &= ~(NETIF_F_TSO | NETIF_F_TSO6);
+ nn->dp.ctrl &= ~NFP_NET_CFG_CTRL_LSO;
/* Allow L2 Broadcast and Multicast through by default, if supported */
if (nn->cap & NFP_NET_CFG_CTRL_L2BC)
--
2.11.0
^ permalink raw reply related
* [PATCH net-next 2/9] nfp: rename l4_offset in struct nfp_net_tx_desc to lso_hdrlen
From: Jakub Kicinski @ 2017-05-16 0:55 UTC (permalink / raw)
To: netdev; +Cc: oss-drivers, Edwin Peer
In-Reply-To: <20170516005523.26124-1-jakub.kicinski@netronome.com>
From: Edwin Peer <edwin.peer@netronome.com>
The l4_offset field referred to by NFD is confusingly named. It is not the
offset of the L4 transport header, but rather the L4 payload.
The LSO2 capability supported by alternative device firmware requires
the actual L4 offset, thus the rename seems prudent.
Signed-off-by: Edwin Peer <edwin.peer@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
---
drivers/net/ethernet/netronome/nfp/nfp_net.h | 2 +-
drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 6 +++---
2 files changed, 4 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index fcf81b3be830..6bad11e5b845 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -153,7 +153,7 @@ struct nfp_net_tx_desc {
__le32 dma_addr_lo; /* Low 32bit of host buf addr */
__le16 mss; /* MSS to be used for LSO */
- u8 l4_offset; /* LSO, where the L4 data starts */
+ u8 lso_hdrlen; /* LSO, TCP payload offset */
u8 flags; /* TX Flags, see @PCIE_DESC_TX_* */
__le16 vlan; /* VLAN tag to add if indicated */
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 76251a09a1f3..0cebe9098451 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -677,7 +677,7 @@ static void nfp_net_tx_tso(struct nfp_net_r_vector *r_vec,
txbuf->real_len += hdrlen * (txbuf->pkt_cnt - 1);
mss = skb_shinfo(skb)->gso_size & PCIE_DESC_TX_MSS_MASK;
- txd->l4_offset = hdrlen;
+ txd->lso_hdrlen = hdrlen;
txd->mss = cpu_to_le16(mss);
txd->flags |= PCIE_DESC_TX_LSO;
@@ -823,7 +823,7 @@ static int nfp_net_tx(struct sk_buff *skb, struct net_device *netdev)
txd->flags = 0;
txd->mss = 0;
- txd->l4_offset = 0;
+ txd->lso_hdrlen = 0;
nfp_net_tx_tso(r_vec, txbuf, txd, skb);
@@ -1515,7 +1515,7 @@ nfp_net_tx_xdp_buf(struct nfp_net_dp *dp, struct nfp_net_rx_ring *rx_ring,
txd->flags = 0;
txd->mss = 0;
- txd->l4_offset = 0;
+ txd->lso_hdrlen = 0;
tx_ring->wr_p++;
tx_ring->wr_ptr_add++;
--
2.11.0
^ permalink raw reply related
* [PATCH net-next 7/9] nfp: complete the XDP TX ring only when it's full
From: Jakub Kicinski @ 2017-05-16 0:55 UTC (permalink / raw)
To: netdev; +Cc: oss-drivers, Jakub Kicinski
In-Reply-To: <20170516005523.26124-1-jakub.kicinski@netronome.com>
Since XDP TX ring holds "spare" RX buffers anyway, we don't have to
rush the completion. We can wait until ring fills up completely
before trying to reclaim buffers. If RX poll has ended an no
buffer has been queued for XDP TX we have no guarantee we will see
another interrupt, so run the reclaim there as well, to make sure
TX statistics won't become stale.
This should help us reclaim more buffers per single queue controller
register read.
Note that the XDP completion is very trivial, it only adds up
the sizes of transmitted frames for statistics so the latency
spike should be acceptable. In case user sets the ring sizes
to something crazy, limit the completion to 2k entries.
The check if the ring is empty at the beginning of xdp_complete()
is no longer needed - the callers will perform it.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
drivers/net/ethernet/netronome/nfp/nfp_net.h | 1 +
.../net/ethernet/netronome/nfp/nfp_net_common.c | 52 ++++++++++++++--------
2 files changed, 35 insertions(+), 18 deletions(-)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index acd9811d08d1..66319a1026bb 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -102,6 +102,7 @@
#define NFP_NET_RX_DESCS_DEFAULT 4096 /* Default # of Rx descs per ring */
#define NFP_NET_FL_BATCH 16 /* Add freelist in this Batch size */
+#define NFP_NET_XDP_MAX_COMPLETE 2048 /* XDP bufs to reclaim in NAPI poll */
/* Offload definitions */
#define NFP_NET_N_VXLAN_PORTS (NFP_NET_CFG_VXLAN_SZ / sizeof(__be16))
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index d640b3331741..a4c6878f1df1 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1001,27 +1001,30 @@ static void nfp_net_tx_complete(struct nfp_net_tx_ring *tx_ring)
tx_ring->rd_p, tx_ring->wr_p, tx_ring->cnt);
}
-static void nfp_net_xdp_complete(struct nfp_net_tx_ring *tx_ring)
+static bool nfp_net_xdp_complete(struct nfp_net_tx_ring *tx_ring)
{
struct nfp_net_r_vector *r_vec = tx_ring->r_vec;
u32 done_pkts = 0, done_bytes = 0;
+ bool done_all;
int idx, todo;
u32 qcp_rd_p;
- if (tx_ring->wr_p == tx_ring->rd_p)
- return;
-
/* Work out how many descriptors have been transmitted */
qcp_rd_p = nfp_qcp_rd_ptr_read(tx_ring->qcp_q);
if (qcp_rd_p == tx_ring->qcp_rd_p)
- return;
+ return true;
if (qcp_rd_p > tx_ring->qcp_rd_p)
todo = qcp_rd_p - tx_ring->qcp_rd_p;
else
todo = qcp_rd_p + tx_ring->cnt - tx_ring->qcp_rd_p;
+ done_all = todo <= NFP_NET_XDP_MAX_COMPLETE;
+ todo = min(todo, NFP_NET_XDP_MAX_COMPLETE);
+
+ tx_ring->qcp_rd_p = (tx_ring->qcp_rd_p + todo) & (tx_ring->cnt - 1);
+
done_pkts = todo;
while (todo--) {
idx = tx_ring->rd_p & (tx_ring->cnt - 1);
@@ -1030,16 +1033,16 @@ static void nfp_net_xdp_complete(struct nfp_net_tx_ring *tx_ring)
done_bytes += tx_ring->txbufs[idx].real_len;
}
- tx_ring->qcp_rd_p = qcp_rd_p;
-
u64_stats_update_begin(&r_vec->tx_sync);
r_vec->tx_bytes += done_bytes;
r_vec->tx_pkts += done_pkts;
u64_stats_update_end(&r_vec->tx_sync);
WARN_ONCE(tx_ring->wr_p - tx_ring->rd_p > tx_ring->cnt,
- "TX ring corruption rd_p=%u wr_p=%u cnt=%u\n",
+ "XDP TX ring corruption rd_p=%u wr_p=%u cnt=%u\n",
tx_ring->rd_p, tx_ring->wr_p, tx_ring->cnt);
+
+ return done_all;
}
/**
@@ -1500,15 +1503,23 @@ static bool
nfp_net_tx_xdp_buf(struct nfp_net_dp *dp, struct nfp_net_rx_ring *rx_ring,
struct nfp_net_tx_ring *tx_ring,
struct nfp_net_rx_buf *rxbuf, unsigned int dma_off,
- unsigned int pkt_len)
+ unsigned int pkt_len, bool *completed)
{
struct nfp_net_tx_buf *txbuf;
struct nfp_net_tx_desc *txd;
int wr_idx;
if (unlikely(nfp_net_tx_full(tx_ring, 1))) {
- nfp_net_rx_drop(dp, rx_ring->r_vec, rx_ring, rxbuf, NULL);
- return false;
+ if (!*completed) {
+ nfp_net_xdp_complete(tx_ring);
+ *completed = true;
+ }
+
+ if (unlikely(nfp_net_tx_full(tx_ring, 1))) {
+ nfp_net_rx_drop(dp, rx_ring->r_vec, rx_ring, rxbuf,
+ NULL);
+ return false;
+ }
}
wr_idx = tx_ring->wr_p & (tx_ring->cnt - 1);
@@ -1580,6 +1591,7 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget)
struct nfp_net_dp *dp = &r_vec->nfp_net->dp;
struct nfp_net_tx_ring *tx_ring;
struct bpf_prog *xdp_prog;
+ bool xdp_tx_cmpl = false;
unsigned int true_bufsz;
struct sk_buff *skb;
int pkts_polled = 0;
@@ -1690,7 +1702,8 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget)
if (unlikely(!nfp_net_tx_xdp_buf(dp, rx_ring,
tx_ring, rxbuf,
dma_off,
- pkt_len)))
+ pkt_len,
+ &xdp_tx_cmpl)))
trace_xdp_exception(dp->netdev,
xdp_prog, act);
continue;
@@ -1738,8 +1751,14 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget)
napi_gro_receive(&rx_ring->r_vec->napi, skb);
}
- if (xdp_prog && tx_ring->wr_ptr_add)
- nfp_net_tx_xmit_more_flush(tx_ring);
+ if (xdp_prog) {
+ if (tx_ring->wr_ptr_add)
+ nfp_net_tx_xmit_more_flush(tx_ring);
+ else if (unlikely(tx_ring->wr_p != tx_ring->rd_p) &&
+ !xdp_tx_cmpl)
+ if (!nfp_net_xdp_complete(tx_ring))
+ pkts_polled = budget;
+ }
rcu_read_unlock();
return pkts_polled;
@@ -1760,11 +1779,8 @@ static int nfp_net_poll(struct napi_struct *napi, int budget)
if (r_vec->tx_ring)
nfp_net_tx_complete(r_vec->tx_ring);
- if (r_vec->rx_ring) {
+ if (r_vec->rx_ring)
pkts_polled = nfp_net_rx(r_vec->rx_ring, budget);
- if (r_vec->xdp_ring)
- nfp_net_xdp_complete(r_vec->xdp_ring);
- }
if (pkts_polled < budget)
if (napi_complete_done(napi, pkts_polled))
--
2.11.0
^ permalink raw reply related
* [PATCH net-next 8/9] nfp: add a helper for wrapping descriptor index
From: Jakub Kicinski @ 2017-05-16 0:55 UTC (permalink / raw)
To: netdev; +Cc: oss-drivers, Jakub Kicinski
In-Reply-To: <20170516005523.26124-1-jakub.kicinski@netronome.com>
We have a number of places where we calculate the descriptor
index based on a value which may have overflown. Create a
macro for masking with the ring size.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
drivers/net/ethernet/netronome/nfp/nfp_net.h | 3 +++
drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 21 ++++++++++-----------
2 files changed, 13 insertions(+), 11 deletions(-)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 66319a1026bb..7b9518cbe965 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -117,6 +117,9 @@ struct nfp_eth_table_port;
struct nfp_net;
struct nfp_net_r_vector;
+/* Convenience macro for wrapping descriptor index on ring size */
+#define D_IDX(ring, idx) ((idx) & ((ring)->cnt - 1))
+
/* Convenience macro for writing dma address into RX/TX descriptors */
#define nfp_desc_set_dma_addr(desc, dma_addr) \
do { \
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index a4c6878f1df1..c64514f8ee65 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -809,7 +809,7 @@ static int nfp_net_tx(struct sk_buff *skb, struct net_device *netdev)
if (dma_mapping_error(dp->dev, dma_addr))
goto err_free;
- wr_idx = tx_ring->wr_p & (tx_ring->cnt - 1);
+ wr_idx = D_IDX(tx_ring, tx_ring->wr_p);
/* Stash the soft descriptor of the head then initialize it */
txbuf = &tx_ring->txbufs[wr_idx];
@@ -852,7 +852,7 @@ static int nfp_net_tx(struct sk_buff *skb, struct net_device *netdev)
if (dma_mapping_error(dp->dev, dma_addr))
goto err_unmap;
- wr_idx = (wr_idx + 1) & (tx_ring->cnt - 1);
+ wr_idx = D_IDX(tx_ring, wr_idx + 1);
tx_ring->txbufs[wr_idx].skb = skb;
tx_ring->txbufs[wr_idx].dma_addr = dma_addr;
tx_ring->txbufs[wr_idx].fidx = f;
@@ -946,8 +946,7 @@ static void nfp_net_tx_complete(struct nfp_net_tx_ring *tx_ring)
todo = qcp_rd_p + tx_ring->cnt - tx_ring->qcp_rd_p;
while (todo--) {
- idx = tx_ring->rd_p & (tx_ring->cnt - 1);
- tx_ring->rd_p++;
+ idx = D_IDX(tx_ring, tx_ring->rd_p++);
skb = tx_ring->txbufs[idx].skb;
if (!skb)
@@ -1023,11 +1022,11 @@ static bool nfp_net_xdp_complete(struct nfp_net_tx_ring *tx_ring)
done_all = todo <= NFP_NET_XDP_MAX_COMPLETE;
todo = min(todo, NFP_NET_XDP_MAX_COMPLETE);
- tx_ring->qcp_rd_p = (tx_ring->qcp_rd_p + todo) & (tx_ring->cnt - 1);
+ tx_ring->qcp_rd_p = D_IDX(tx_ring, tx_ring->qcp_rd_p + todo);
done_pkts = todo;
while (todo--) {
- idx = tx_ring->rd_p & (tx_ring->cnt - 1);
+ idx = D_IDX(tx_ring, tx_ring->rd_p);
tx_ring->rd_p++;
done_bytes += tx_ring->txbufs[idx].real_len;
@@ -1063,7 +1062,7 @@ nfp_net_tx_ring_reset(struct nfp_net_dp *dp, struct nfp_net_tx_ring *tx_ring)
struct sk_buff *skb;
int idx, nr_frags;
- idx = tx_ring->rd_p & (tx_ring->cnt - 1);
+ idx = D_IDX(tx_ring, tx_ring->rd_p);
tx_buf = &tx_ring->txbufs[idx];
skb = tx_ring->txbufs[idx].skb;
@@ -1216,7 +1215,7 @@ static void nfp_net_rx_give_one(const struct nfp_net_dp *dp,
{
unsigned int wr_idx;
- wr_idx = rx_ring->wr_p & (rx_ring->cnt - 1);
+ wr_idx = D_IDX(rx_ring, rx_ring->wr_p);
nfp_net_dma_sync_dev_rx(dp, dma_addr);
@@ -1254,7 +1253,7 @@ static void nfp_net_rx_ring_reset(struct nfp_net_rx_ring *rx_ring)
unsigned int wr_idx, last_idx;
/* Move the empty entry to the end of the list */
- wr_idx = rx_ring->wr_p & (rx_ring->cnt - 1);
+ wr_idx = D_IDX(rx_ring, rx_ring->wr_p);
last_idx = rx_ring->cnt - 1;
rx_ring->rxbufs[wr_idx].dma_addr = rx_ring->rxbufs[last_idx].dma_addr;
rx_ring->rxbufs[wr_idx].frag = rx_ring->rxbufs[last_idx].frag;
@@ -1522,7 +1521,7 @@ nfp_net_tx_xdp_buf(struct nfp_net_dp *dp, struct nfp_net_rx_ring *rx_ring,
}
}
- wr_idx = tx_ring->wr_p & (tx_ring->cnt - 1);
+ wr_idx = D_IDX(tx_ring, tx_ring->wr_p);
/* Stash the soft descriptor of the head then initialize it */
txbuf = &tx_ring->txbufs[wr_idx];
@@ -1610,7 +1609,7 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget)
dma_addr_t new_dma_addr;
void *new_frag;
- idx = rx_ring->rd_p & (rx_ring->cnt - 1);
+ idx = D_IDX(rx_ring, rx_ring->rd_p);
rxd = &rx_ring->rxds[idx];
if (!(rxd->rxd.meta_len_dd & PCIE_DESC_RX_DD))
--
2.11.0
^ permalink raw reply related
* [PATCH net-next 5/9] nfp: version independent support for chained RSS metadata
From: Jakub Kicinski @ 2017-05-16 0:55 UTC (permalink / raw)
To: netdev; +Cc: oss-drivers, Edwin Peer
In-Reply-To: <20170516005523.26124-1-jakub.kicinski@netronome.com>
From: Edwin Peer <edwin.peer@netronome.com>
ABI version 4 introduced metadata chaining. Using the ABI version to signal
metadata chaining precludes firmware that advertises new capabilities which
rely on prepended metadata from working on older kernels.
Capability bits are thus better suited to signalling the chained metadata
format. A new version of the RSS capability is introduced to distinguish
between the differing metadata formats for ABI versions other than 4.
Signed-off-by: Edwin Peer <edwin.peer@netronome.com>
Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
---
drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 20 +++++++++++++-------
drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h | 6 +++++-
drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c | 12 ++++++------
3 files changed, 24 insertions(+), 14 deletions(-)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index ae32c0e8d6e6..cc5a2eaef156 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -2201,7 +2201,7 @@ static int nfp_net_set_config_and_enable(struct nfp_net *nn)
new_ctrl = nn->dp.ctrl;
- if (nn->dp.ctrl & NFP_NET_CFG_CTRL_RSS) {
+ if (nn->dp.ctrl & NFP_NET_CFG_CTRL_RSS_ANY) {
nfp_net_rss_write_key(nn);
nfp_net_rss_write_itbl(nn);
nn_writel(nn, NFP_NET_CFG_RSS_CTRL, nn->rss_cfg);
@@ -3035,7 +3035,7 @@ void nfp_net_info(struct nfp_net *nn)
nn->fw_ver.resv, nn->fw_ver.class,
nn->fw_ver.major, nn->fw_ver.minor,
nn->max_mtu);
- nn_info(nn, "CAP: %#x %s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+ nn_info(nn, "CAP: %#x %s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
nn->cap,
nn->cap & NFP_NET_CFG_CTRL_PROMISC ? "PROMISC " : "",
nn->cap & NFP_NET_CFG_CTRL_L2BC ? "L2BCFILT " : "",
@@ -3048,7 +3048,8 @@ void nfp_net_info(struct nfp_net *nn)
nn->cap & NFP_NET_CFG_CTRL_GATHER ? "GATHER " : "",
nn->cap & NFP_NET_CFG_CTRL_LSO ? "TSO1 " : "",
nn->cap & NFP_NET_CFG_CTRL_LSO2 ? "TSO2 " : "",
- nn->cap & NFP_NET_CFG_CTRL_RSS ? "RSS " : "",
+ nn->cap & NFP_NET_CFG_CTRL_RSS ? "RSS1 " : "",
+ nn->cap & NFP_NET_CFG_CTRL_RSS2 ? "RSS2 " : "",
nn->cap & NFP_NET_CFG_CTRL_L2SWITCH ? "L2SWITCH " : "",
nn->cap & NFP_NET_CFG_CTRL_MSIXAUTO ? "AUTOMASK " : "",
nn->cap & NFP_NET_CFG_CTRL_IRQMOD ? "IRQMOD " : "",
@@ -3202,14 +3203,18 @@ int nfp_net_netdev_init(struct net_device *netdev)
struct nfp_net *nn = netdev_priv(netdev);
int err;
- nn->dp.chained_metadata_format = nn->fw_ver.major > 3;
-
nn->dp.rx_dma_dir = DMA_FROM_DEVICE;
/* Get some of the read-only fields from the BAR */
nn->cap = nn_readl(nn, NFP_NET_CFG_CAP);
nn->max_mtu = nn_readl(nn, NFP_NET_CFG_MAX_MTU);
+ /* Chained metadata is signalled by capabilities except in version 4 */
+ nn->dp.chained_metadata_format = nn->fw_ver.major == 4 ||
+ nn->cap & NFP_NET_CFG_CTRL_CHAIN_META;
+ if (nn->dp.chained_metadata_format && nn->fw_ver.major != 4)
+ nn->cap &= ~NFP_NET_CFG_CTRL_RSS;
+
nfp_net_write_mac_addr(nn);
/* Determine RX packet/metadata boundary offset */
@@ -3259,10 +3264,11 @@ int nfp_net_netdev_init(struct net_device *netdev)
nn->dp.ctrl |= nn->cap & NFP_NET_CFG_CTRL_LSO2 ?:
NFP_NET_CFG_CTRL_LSO;
}
- if (nn->cap & NFP_NET_CFG_CTRL_RSS) {
+ if (nn->cap & NFP_NET_CFG_CTRL_RSS_ANY) {
netdev->hw_features |= NETIF_F_RXHASH;
nfp_net_rss_init(nn);
- nn->dp.ctrl |= NFP_NET_CFG_CTRL_RSS;
+ nn->dp.ctrl |= nn->cap & NFP_NET_CFG_CTRL_RSS2 ?:
+ NFP_NET_CFG_CTRL_RSS;
}
if (nn->cap & NFP_NET_CFG_CTRL_VXLAN &&
nn->cap & NFP_NET_CFG_CTRL_NVGRE) {
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
index 1575e8fdb541..a049c5d6839d 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
@@ -121,7 +121,7 @@
#define NFP_NET_CFG_CTRL_GATHER (0x1 << 9) /* Gather DMA */
#define NFP_NET_CFG_CTRL_LSO (0x1 << 10) /* LSO/TSO (version 1) */
#define NFP_NET_CFG_CTRL_RINGCFG (0x1 << 16) /* Ring runtime changes */
-#define NFP_NET_CFG_CTRL_RSS (0x1 << 17) /* RSS */
+#define NFP_NET_CFG_CTRL_RSS (0x1 << 17) /* RSS (version 1) */
#define NFP_NET_CFG_CTRL_IRQMOD (0x1 << 18) /* Interrupt moderation */
#define NFP_NET_CFG_CTRL_RINGPRIO (0x1 << 19) /* Ring priorities */
#define NFP_NET_CFG_CTRL_MSIXAUTO (0x1 << 20) /* MSI-X auto-masking */
@@ -132,9 +132,13 @@
#define NFP_NET_CFG_CTRL_NVGRE (0x1 << 25) /* NVGRE tunnel support */
#define NFP_NET_CFG_CTRL_BPF (0x1 << 27) /* BPF offload capable */
#define NFP_NET_CFG_CTRL_LSO2 (0x1 << 28) /* LSO/TSO (version 2) */
+#define NFP_NET_CFG_CTRL_RSS2 (0x1 << 29) /* RSS (version 2) */
#define NFP_NET_CFG_CTRL_LSO_ANY (NFP_NET_CFG_CTRL_LSO | \
NFP_NET_CFG_CTRL_LSO2)
+#define NFP_NET_CFG_CTRL_RSS_ANY (NFP_NET_CFG_CTRL_RSS | \
+ NFP_NET_CFG_CTRL_RSS2)
+#define NFP_NET_CFG_CTRL_CHAIN_META NFP_NET_CFG_CTRL_RSS2
#define NFP_NET_CFG_UPDATE 0x0004
#define NFP_NET_CFG_UPDATE_GEN (0x1 << 0) /* General update */
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
index abbb47e60cc3..70bb0a0152b9 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ethtool.c
@@ -496,7 +496,7 @@ static int nfp_net_get_rss_hash_opts(struct nfp_net *nn,
cmd->data = 0;
- if (!(nn->cap & NFP_NET_CFG_CTRL_RSS))
+ if (!(nn->cap & NFP_NET_CFG_CTRL_RSS_ANY))
return -EOPNOTSUPP;
nfp_rss_flag = ethtool_flow_to_nfp_flag(cmd->flow_type);
@@ -533,7 +533,7 @@ static int nfp_net_set_rss_hash_opt(struct nfp_net *nn,
u32 nfp_rss_flag;
int err;
- if (!(nn->cap & NFP_NET_CFG_CTRL_RSS))
+ if (!(nn->cap & NFP_NET_CFG_CTRL_RSS_ANY))
return -EOPNOTSUPP;
/* RSS only supports IP SA/DA and L4 src/dst ports */
@@ -595,7 +595,7 @@ static u32 nfp_net_get_rxfh_indir_size(struct net_device *netdev)
{
struct nfp_net *nn = netdev_priv(netdev);
- if (!(nn->cap & NFP_NET_CFG_CTRL_RSS))
+ if (!(nn->cap & NFP_NET_CFG_CTRL_RSS_ANY))
return 0;
return ARRAY_SIZE(nn->rss_itbl);
@@ -605,7 +605,7 @@ static u32 nfp_net_get_rxfh_key_size(struct net_device *netdev)
{
struct nfp_net *nn = netdev_priv(netdev);
- if (!(nn->cap & NFP_NET_CFG_CTRL_RSS))
+ if (!(nn->cap & NFP_NET_CFG_CTRL_RSS_ANY))
return -EOPNOTSUPP;
return nfp_net_rss_key_sz(nn);
@@ -617,7 +617,7 @@ static int nfp_net_get_rxfh(struct net_device *netdev, u32 *indir, u8 *key,
struct nfp_net *nn = netdev_priv(netdev);
int i;
- if (!(nn->cap & NFP_NET_CFG_CTRL_RSS))
+ if (!(nn->cap & NFP_NET_CFG_CTRL_RSS_ANY))
return -EOPNOTSUPP;
if (indir)
@@ -641,7 +641,7 @@ static int nfp_net_set_rxfh(struct net_device *netdev,
struct nfp_net *nn = netdev_priv(netdev);
int i;
- if (!(nn->cap & NFP_NET_CFG_CTRL_RSS) ||
+ if (!(nn->cap & NFP_NET_CFG_CTRL_RSS_ANY) ||
!(hfunc == ETH_RSS_HASH_NO_CHANGE || hfunc == nn->rss_hfunc))
return -EOPNOTSUPP;
--
2.11.0
^ permalink raw reply related
* [PATCH net-next 4/9] nfp: don't assume RSS and IRQ moderation are always enabled
From: Jakub Kicinski @ 2017-05-16 0:55 UTC (permalink / raw)
To: netdev; +Cc: oss-drivers, Jakub Kicinski
In-Reply-To: <20170516005523.26124-1-jakub.kicinski@netronome.com>
Even if capability for RSS and IRQ moderation are present we may
have not initialized them for control vNIC. Depend on selected
features mask (ctrl) rather than capabilities (cap) to determine
which features should be enabled.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
---
drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 5e8049a84d16..ae32c0e8d6e6 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -2201,17 +2201,15 @@ static int nfp_net_set_config_and_enable(struct nfp_net *nn)
new_ctrl = nn->dp.ctrl;
- if (nn->cap & NFP_NET_CFG_CTRL_RSS) {
+ if (nn->dp.ctrl & NFP_NET_CFG_CTRL_RSS) {
nfp_net_rss_write_key(nn);
nfp_net_rss_write_itbl(nn);
nn_writel(nn, NFP_NET_CFG_RSS_CTRL, nn->rss_cfg);
update |= NFP_NET_CFG_UPDATE_RSS;
}
- if (nn->cap & NFP_NET_CFG_CTRL_IRQMOD) {
+ if (nn->dp.ctrl & NFP_NET_CFG_CTRL_IRQMOD) {
nfp_net_coalesce_write_cfg(nn);
-
- new_ctrl |= NFP_NET_CFG_CTRL_IRQMOD;
update |= NFP_NET_CFG_UPDATE_IRQMOD;
}
--
2.11.0
^ permalink raw reply related
* [PATCH net-next 9/9] nfp: eliminate an if statement in calculation of completed frames
From: Jakub Kicinski @ 2017-05-16 0:55 UTC (permalink / raw)
To: netdev; +Cc: oss-drivers, Jakub Kicinski
In-Reply-To: <20170516005523.26124-1-jakub.kicinski@netronome.com>
Given that our rings are always a power of 2, we can simplify the
calculation of number of completed TX descriptors by using masking
instead of if statement based on whether the index have wrapped
or not.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 10 ++--------
1 file changed, 2 insertions(+), 8 deletions(-)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index c64514f8ee65..da83e17b8b20 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -940,10 +940,7 @@ static void nfp_net_tx_complete(struct nfp_net_tx_ring *tx_ring)
if (qcp_rd_p == tx_ring->qcp_rd_p)
return;
- if (qcp_rd_p > tx_ring->qcp_rd_p)
- todo = qcp_rd_p - tx_ring->qcp_rd_p;
- else
- todo = qcp_rd_p + tx_ring->cnt - tx_ring->qcp_rd_p;
+ todo = D_IDX(tx_ring, qcp_rd_p + tx_ring->cnt - tx_ring->qcp_rd_p);
while (todo--) {
idx = D_IDX(tx_ring, tx_ring->rd_p++);
@@ -1014,10 +1011,7 @@ static bool nfp_net_xdp_complete(struct nfp_net_tx_ring *tx_ring)
if (qcp_rd_p == tx_ring->qcp_rd_p)
return true;
- if (qcp_rd_p > tx_ring->qcp_rd_p)
- todo = qcp_rd_p - tx_ring->qcp_rd_p;
- else
- todo = qcp_rd_p + tx_ring->cnt - tx_ring->qcp_rd_p;
+ todo = D_IDX(tx_ring, qcp_rd_p + tx_ring->cnt - tx_ring->qcp_rd_p);
done_all = todo <= NFP_NET_XDP_MAX_COMPLETE;
todo = min(todo, NFP_NET_XDP_MAX_COMPLETE);
--
2.11.0
^ permalink raw reply related
* [PATCH net-next 3/9] nfp: support LSO2 capability
From: Jakub Kicinski @ 2017-05-16 0:55 UTC (permalink / raw)
To: netdev; +Cc: oss-drivers, Edwin Peer, Jakub Kicinski
In-Reply-To: <20170516005523.26124-1-jakub.kicinski@netronome.com>
From: Edwin Peer <edwin.peer@netronome.com>
Firmware advertising the LSO2 capability exploits driver provided L3 and L4
offsets in order to avoid parsing packet headers in the TX path. The vlan
field in struct nfp_net_tx_desc is repurposed, making TXVLAN a mutually
exclusive configuration to LSO2.
Signed-off-by: Edwin Peer <edwin.peer@netronome.com>
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
---
drivers/net/ethernet/netronome/nfp/nfp_net.h | 9 +++--
.../net/ethernet/netronome/nfp/nfp_net_common.c | 38 ++++++++++++++--------
drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h | 7 +++-
3 files changed, 38 insertions(+), 16 deletions(-)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index 6bad11e5b845..c6b7141dc50d 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -155,8 +155,13 @@ struct nfp_net_tx_desc {
__le16 mss; /* MSS to be used for LSO */
u8 lso_hdrlen; /* LSO, TCP payload offset */
u8 flags; /* TX Flags, see @PCIE_DESC_TX_* */
-
- __le16 vlan; /* VLAN tag to add if indicated */
+ union {
+ struct {
+ u8 l3_offset; /* L3 header offset */
+ u8 l4_offset; /* L4 header offset */
+ };
+ __le16 vlan; /* VLAN tag to add if indicated */
+ };
__le16 data_len; /* Length of frame + meta data */
} __packed;
__le32 vals[4];
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index 0cebe9098451..5e8049a84d16 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -667,11 +667,16 @@ static void nfp_net_tx_tso(struct nfp_net_r_vector *r_vec,
if (!skb_is_gso(skb))
return;
- if (!skb->encapsulation)
+ if (!skb->encapsulation) {
+ txd->l3_offset = skb_network_offset(skb);
+ txd->l4_offset = skb_transport_offset(skb);
hdrlen = skb_transport_offset(skb) + tcp_hdrlen(skb);
- else
+ } else {
+ txd->l3_offset = skb_inner_network_offset(skb);
+ txd->l4_offset = skb_inner_transport_offset(skb);
hdrlen = skb_inner_transport_header(skb) - skb->data +
inner_tcp_hdrlen(skb);
+ }
txbuf->pkt_cnt = skb_shinfo(skb)->gso_segs;
txbuf->real_len += hdrlen * (txbuf->pkt_cnt - 1);
@@ -825,10 +830,9 @@ static int nfp_net_tx(struct sk_buff *skb, struct net_device *netdev)
txd->mss = 0;
txd->lso_hdrlen = 0;
+ /* Do not reorder - tso may adjust pkt cnt, vlan may override fields */
nfp_net_tx_tso(r_vec, txbuf, txd, skb);
-
nfp_net_tx_csum(dp, r_vec, txbuf, txd, skb);
-
if (skb_vlan_tag_present(skb) && dp->ctrl & NFP_NET_CFG_CTRL_TXVLAN) {
txd->flags |= PCIE_DESC_TX_VLAN;
txd->vlan = cpu_to_le16(skb_vlan_tag_get(skb));
@@ -2724,9 +2728,10 @@ static int nfp_net_set_features(struct net_device *netdev,
if (changed & (NETIF_F_TSO | NETIF_F_TSO6)) {
if (features & (NETIF_F_TSO | NETIF_F_TSO6))
- new_ctrl |= NFP_NET_CFG_CTRL_LSO;
+ new_ctrl |= nn->cap & NFP_NET_CFG_CTRL_LSO2 ?:
+ NFP_NET_CFG_CTRL_LSO;
else
- new_ctrl &= ~NFP_NET_CFG_CTRL_LSO;
+ new_ctrl &= ~NFP_NET_CFG_CTRL_LSO_ANY;
}
if (changed & NETIF_F_HW_VLAN_CTAG_RX) {
@@ -3032,7 +3037,7 @@ void nfp_net_info(struct nfp_net *nn)
nn->fw_ver.resv, nn->fw_ver.class,
nn->fw_ver.major, nn->fw_ver.minor,
nn->max_mtu);
- nn_info(nn, "CAP: %#x %s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+ nn_info(nn, "CAP: %#x %s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
nn->cap,
nn->cap & NFP_NET_CFG_CTRL_PROMISC ? "PROMISC " : "",
nn->cap & NFP_NET_CFG_CTRL_L2BC ? "L2BCFILT " : "",
@@ -3043,7 +3048,8 @@ void nfp_net_info(struct nfp_net *nn)
nn->cap & NFP_NET_CFG_CTRL_TXVLAN ? "TXVLAN " : "",
nn->cap & NFP_NET_CFG_CTRL_SCATTER ? "SCATTER " : "",
nn->cap & NFP_NET_CFG_CTRL_GATHER ? "GATHER " : "",
- nn->cap & NFP_NET_CFG_CTRL_LSO ? "TSO " : "",
+ nn->cap & NFP_NET_CFG_CTRL_LSO ? "TSO1 " : "",
+ nn->cap & NFP_NET_CFG_CTRL_LSO2 ? "TSO2 " : "",
nn->cap & NFP_NET_CFG_CTRL_RSS ? "RSS " : "",
nn->cap & NFP_NET_CFG_CTRL_L2SWITCH ? "L2SWITCH " : "",
nn->cap & NFP_NET_CFG_CTRL_MSIXAUTO ? "AUTOMASK " : "",
@@ -3249,9 +3255,11 @@ int nfp_net_netdev_init(struct net_device *netdev)
netdev->hw_features |= NETIF_F_SG;
nn->dp.ctrl |= NFP_NET_CFG_CTRL_GATHER;
}
- if ((nn->cap & NFP_NET_CFG_CTRL_LSO) && nn->fw_ver.major > 2) {
+ if ((nn->cap & NFP_NET_CFG_CTRL_LSO && nn->fw_ver.major > 2) ||
+ nn->cap & NFP_NET_CFG_CTRL_LSO2) {
netdev->hw_features |= NETIF_F_TSO | NETIF_F_TSO6;
- nn->dp.ctrl |= NFP_NET_CFG_CTRL_LSO;
+ nn->dp.ctrl |= nn->cap & NFP_NET_CFG_CTRL_LSO2 ?:
+ NFP_NET_CFG_CTRL_LSO;
}
if (nn->cap & NFP_NET_CFG_CTRL_RSS) {
netdev->hw_features |= NETIF_F_RXHASH;
@@ -3275,8 +3283,12 @@ int nfp_net_netdev_init(struct net_device *netdev)
nn->dp.ctrl |= NFP_NET_CFG_CTRL_RXVLAN;
}
if (nn->cap & NFP_NET_CFG_CTRL_TXVLAN) {
- netdev->hw_features |= NETIF_F_HW_VLAN_CTAG_TX;
- nn->dp.ctrl |= NFP_NET_CFG_CTRL_TXVLAN;
+ if (nn->cap & NFP_NET_CFG_CTRL_LSO2) {
+ nn_warn(nn, "Device advertises both TSO2 and TXVLAN. Refusing to enable TXVLAN.\n");
+ } else {
+ netdev->hw_features |= NETIF_F_HW_VLAN_CTAG_TX;
+ nn->dp.ctrl |= NFP_NET_CFG_CTRL_TXVLAN;
+ }
}
netdev->features = netdev->hw_features;
@@ -3286,7 +3298,7 @@ int nfp_net_netdev_init(struct net_device *netdev)
/* Advertise but disable TSO by default. */
netdev->features &= ~(NETIF_F_TSO | NETIF_F_TSO6);
- nn->dp.ctrl &= ~NFP_NET_CFG_CTRL_LSO;
+ nn->dp.ctrl &= ~NFP_NET_CFG_CTRL_LSO_ANY;
/* Allow L2 Broadcast and Multicast through by default, if supported */
if (nn->cap & NFP_NET_CFG_CTRL_L2BC)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
index d04ccc9f6116..1575e8fdb541 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
@@ -119,7 +119,7 @@
#define NFP_NET_CFG_CTRL_TXVLAN (0x1 << 7) /* Enable VLAN insert */
#define NFP_NET_CFG_CTRL_SCATTER (0x1 << 8) /* Scatter DMA */
#define NFP_NET_CFG_CTRL_GATHER (0x1 << 9) /* Gather DMA */
-#define NFP_NET_CFG_CTRL_LSO (0x1 << 10) /* LSO/TSO */
+#define NFP_NET_CFG_CTRL_LSO (0x1 << 10) /* LSO/TSO (version 1) */
#define NFP_NET_CFG_CTRL_RINGCFG (0x1 << 16) /* Ring runtime changes */
#define NFP_NET_CFG_CTRL_RSS (0x1 << 17) /* RSS */
#define NFP_NET_CFG_CTRL_IRQMOD (0x1 << 18) /* Interrupt moderation */
@@ -131,6 +131,11 @@
#define NFP_NET_CFG_CTRL_VXLAN (0x1 << 24) /* VXLAN tunnel support */
#define NFP_NET_CFG_CTRL_NVGRE (0x1 << 25) /* NVGRE tunnel support */
#define NFP_NET_CFG_CTRL_BPF (0x1 << 27) /* BPF offload capable */
+#define NFP_NET_CFG_CTRL_LSO2 (0x1 << 28) /* LSO/TSO (version 2) */
+
+#define NFP_NET_CFG_CTRL_LSO_ANY (NFP_NET_CFG_CTRL_LSO | \
+ NFP_NET_CFG_CTRL_LSO2)
+
#define NFP_NET_CFG_UPDATE 0x0004
#define NFP_NET_CFG_UPDATE_GEN (0x1 << 0) /* General update */
#define NFP_NET_CFG_UPDATE_RING (0x1 << 1) /* Ring config change */
--
2.11.0
^ permalink raw reply related
* [PATCH net-next 6/9] nfp: add CHECKSUM_COMPLETE support
From: Jakub Kicinski @ 2017-05-16 0:55 UTC (permalink / raw)
To: netdev; +Cc: oss-drivers, Jakub Kicinski, Edwin Peer
In-Reply-To: <20170516005523.26124-1-jakub.kicinski@netronome.com>
Introduce NFP_NET_CFG_CTRL_CSUM_COMPLETE capability and implement parsing
of CHECKSUM_COMPLETE metadata.
Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Signed-off-by: Edwin Peer <edwin.peer@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
---
drivers/net/ethernet/netronome/nfp/nfp_net.h | 4 ++-
.../net/ethernet/netronome/nfp/nfp_net_common.c | 35 +++++++++++++++++-----
drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h | 7 ++++-
3 files changed, 36 insertions(+), 10 deletions(-)
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h
index c6b7141dc50d..acd9811d08d1 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
@@ -292,9 +292,11 @@ struct nfp_net_rx_desc {
#define NFP_NET_META_FIELD_MASK GENMASK(NFP_NET_META_FIELD_SIZE - 1, 0)
struct nfp_meta_parsed {
- u32 hash_type;
+ u8 hash_type;
+ u8 csum_type;
u32 hash;
u32 mark;
+ __wsum csum;
};
struct nfp_net_rx_hash {
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
index cc5a2eaef156..d640b3331741 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_common.c
@@ -1354,17 +1354,28 @@ static int nfp_net_rx_csum_has_errors(u16 flags)
* @dp: NFP Net data path struct
* @r_vec: per-ring structure
* @rxd: Pointer to RX descriptor
+ * @meta: Parsed metadata prepend
* @skb: Pointer to SKB
*/
static void nfp_net_rx_csum(struct nfp_net_dp *dp,
struct nfp_net_r_vector *r_vec,
- struct nfp_net_rx_desc *rxd, struct sk_buff *skb)
+ struct nfp_net_rx_desc *rxd,
+ struct nfp_meta_parsed *meta, struct sk_buff *skb)
{
skb_checksum_none_assert(skb);
if (!(dp->netdev->features & NETIF_F_RXCSUM))
return;
+ if (meta->csum_type) {
+ skb->ip_summed = meta->csum_type;
+ skb->csum = meta->csum;
+ u64_stats_update_begin(&r_vec->rx_sync);
+ r_vec->hw_csum_rx_ok++;
+ u64_stats_update_end(&r_vec->rx_sync);
+ return;
+ }
+
if (nfp_net_rx_csum_has_errors(le16_to_cpu(rxd->rxd.flags))) {
u64_stats_update_begin(&r_vec->rx_sync);
r_vec->hw_csum_rx_error++;
@@ -1449,6 +1460,12 @@ nfp_net_parse_meta(struct net_device *netdev, struct nfp_meta_parsed *meta,
meta->mark = get_unaligned_be32(data);
data += 4;
break;
+ case NFP_NET_META_CSUM:
+ meta->csum_type = CHECKSUM_COMPLETE;
+ meta->csum =
+ (__force __wsum)__get_unaligned_cpu32(data);
+ data += 4;
+ break;
default:
return NULL;
}
@@ -1712,7 +1729,7 @@ static int nfp_net_rx(struct nfp_net_rx_ring *rx_ring, int budget)
skb_record_rx_queue(skb, rx_ring->idx);
skb->protocol = eth_type_trans(skb, dp->netdev);
- nfp_net_rx_csum(dp, r_vec, rxd, skb);
+ nfp_net_rx_csum(dp, r_vec, rxd, &meta, skb);
if (rxd->rxd.flags & PCIE_DESC_RX_VLAN)
__vlan_hwaccel_put_tag(skb, htons(ETH_P_8021Q),
@@ -2712,9 +2729,9 @@ static int nfp_net_set_features(struct net_device *netdev,
if (changed & NETIF_F_RXCSUM) {
if (features & NETIF_F_RXCSUM)
- new_ctrl |= NFP_NET_CFG_CTRL_RXCSUM;
+ new_ctrl |= nn->cap & NFP_NET_CFG_CTRL_RXCSUM_ANY;
else
- new_ctrl &= ~NFP_NET_CFG_CTRL_RXCSUM;
+ new_ctrl &= ~NFP_NET_CFG_CTRL_RXCSUM_ANY;
}
if (changed & (NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM)) {
@@ -3035,7 +3052,7 @@ void nfp_net_info(struct nfp_net *nn)
nn->fw_ver.resv, nn->fw_ver.class,
nn->fw_ver.major, nn->fw_ver.minor,
nn->max_mtu);
- nn_info(nn, "CAP: %#x %s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
+ nn_info(nn, "CAP: %#x %s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s%s\n",
nn->cap,
nn->cap & NFP_NET_CFG_CTRL_PROMISC ? "PROMISC " : "",
nn->cap & NFP_NET_CFG_CTRL_L2BC ? "L2BCFILT " : "",
@@ -3055,7 +3072,9 @@ void nfp_net_info(struct nfp_net *nn)
nn->cap & NFP_NET_CFG_CTRL_IRQMOD ? "IRQMOD " : "",
nn->cap & NFP_NET_CFG_CTRL_VXLAN ? "VXLAN " : "",
nn->cap & NFP_NET_CFG_CTRL_NVGRE ? "NVGRE " : "",
- nfp_net_ebpf_capable(nn) ? "BPF " : "");
+ nfp_net_ebpf_capable(nn) ? "BPF " : "",
+ nn->cap & NFP_NET_CFG_CTRL_CSUM_COMPLETE ?
+ "RXCSUM_COMPLETE " : "");
}
/**
@@ -3246,9 +3265,9 @@ int nfp_net_netdev_init(struct net_device *netdev)
* supported. By default we enable most features.
*/
netdev->hw_features = NETIF_F_HIGHDMA;
- if (nn->cap & NFP_NET_CFG_CTRL_RXCSUM) {
+ if (nn->cap & NFP_NET_CFG_CTRL_RXCSUM_ANY) {
netdev->hw_features |= NETIF_F_RXCSUM;
- nn->dp.ctrl |= NFP_NET_CFG_CTRL_RXCSUM;
+ nn->dp.ctrl |= nn->cap & NFP_NET_CFG_CTRL_RXCSUM_ANY;
}
if (nn->cap & NFP_NET_CFG_CTRL_TXCSUM) {
netdev->hw_features |= NETIF_F_IP_CSUM | NETIF_F_IPV6_CSUM;
diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
index a049c5d6839d..df75b8dc3617 100644
--- a/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
+++ b/drivers/net/ethernet/netronome/nfp/nfp_net_ctrl.h
@@ -71,6 +71,7 @@
#define NFP_NET_META_FIELD_SIZE 4
#define NFP_NET_META_HASH 1 /* next field carries hash type */
#define NFP_NET_META_MARK 2
+#define NFP_NET_META_CSUM 6 /* checksum complete type */
/**
* Hash type pre-pended when a RSS hash was computed
@@ -133,12 +134,16 @@
#define NFP_NET_CFG_CTRL_BPF (0x1 << 27) /* BPF offload capable */
#define NFP_NET_CFG_CTRL_LSO2 (0x1 << 28) /* LSO/TSO (version 2) */
#define NFP_NET_CFG_CTRL_RSS2 (0x1 << 29) /* RSS (version 2) */
+#define NFP_NET_CFG_CTRL_CSUM_COMPLETE (0x1 << 30) /* Checksum complete */
#define NFP_NET_CFG_CTRL_LSO_ANY (NFP_NET_CFG_CTRL_LSO | \
NFP_NET_CFG_CTRL_LSO2)
#define NFP_NET_CFG_CTRL_RSS_ANY (NFP_NET_CFG_CTRL_RSS | \
NFP_NET_CFG_CTRL_RSS2)
-#define NFP_NET_CFG_CTRL_CHAIN_META NFP_NET_CFG_CTRL_RSS2
+#define NFP_NET_CFG_CTRL_RXCSUM_ANY (NFP_NET_CFG_CTRL_RXCSUM | \
+ NFP_NET_CFG_CTRL_CSUM_COMPLETE)
+#define NFP_NET_CFG_CTRL_CHAIN_META (NFP_NET_CFG_CTRL_RSS2 | \
+ NFP_NET_CFG_CTRL_CSUM_COMPLETE)
#define NFP_NET_CFG_UPDATE 0x0004
#define NFP_NET_CFG_UPDATE_GEN (0x1 << 0) /* General update */
--
2.11.0
^ permalink raw reply related
* Re: cxgb4 is broken in v4.12-rc1
From: Ganesh GR @ 2017-05-16 1:00 UTC (permalink / raw)
To: Logan Gunthorpe
Cc: David S. Miller, Stephen Bates, SWise OGC, netdev@vger.kernel.org
In-Reply-To: <f5966da1-44e7-ebf7-c5df-28f815a77424@deltatee.com>
Hi Logan,
Thanks for reporting the issue I will try to reproduce this, btw what is the firmware version
on your setup?.
Regards
Ganesh
From: Logan Gunthorpe <logang@deltatee.com>
Sent: Tuesday, May 16, 2017 4:06 AM
To: Ganesh GR
Cc: David S. Miller; Stephen Bates; SWise OGC; netdev@vger.kernel.org
Subject: BUG: cxgb4 is broken in v4.12-rc1
Hi,
With rc1 my T62100-LP-CR no longer functions correctly. Everything
appears fine but the link never goes into the UP state. I have one peer
with an older (functioning) kernel and the other peer on rc1.
I've bisected to find this is the offending commit:
3bb4858fd: cxgb4: avoid disabling FEC by default
I've also attached a bisect log.
Let me know if you need anything else.
Thanks,
Logan
^ permalink raw reply
* Re: cxgb4 is broken in v4.12-rc1
From: Logan Gunthorpe @ 2017-05-16 1:50 UTC (permalink / raw)
To: Ganesh GR
Cc: David S. Miller, Stephen Bates, SWise OGC, netdev@vger.kernel.org
In-Reply-To: <CY4PR12MB1432CA69BD6B7AE335B5DBB2C1E60@CY4PR12MB1432.namprd12.prod.outlook.com>
Hi,
Thanks for looking into it. The version information of the card is:
[ 5.235956] cxgb4 0000:07:00.4: Chelsio T62100-LP-CR rev 0
[ 5.235957] cxgb4 0000:07:00.4: S/N: PT51160053, P/N: 11012106004
[ 5.235959] cxgb4 0000:07:00.4: Firmware version: 1.16.29.4
[ 5.235960] cxgb4 0000:07:00.4: Bootstrap version: 255.255.255.255
[ 5.235961] cxgb4 0000:07:00.4: TP Microcode version: 0.1.23.2
Logan
On 15/05/17 07:00 PM, Ganesh GR wrote:
>
> Hi Logan,
>
> Thanks for reporting the issue I will try to reproduce this, btw what is the firmware version
> on your setup?.
>
> Regards
> Ganesh
>
>
>
>
> From: Logan Gunthorpe <logang@deltatee.com>
> Sent: Tuesday, May 16, 2017 4:06 AM
> To: Ganesh GR
> Cc: David S. Miller; Stephen Bates; SWise OGC; netdev@vger.kernel.org
> Subject: BUG: cxgb4 is broken in v4.12-rc1
>
> Hi,
>
> With rc1 my T62100-LP-CR no longer functions correctly. Everything
> appears fine but the link never goes into the UP state. I have one peer
> with an older (functioning) kernel and the other peer on rc1.
>
> I've bisected to find this is the offending commit:
>
> 3bb4858fd: cxgb4: avoid disabling FEC by default
>
> I've also attached a bisect log.
>
> Let me know if you need anything else.
>
> Thanks,
>
> Logan
>
>
^ permalink raw reply
* [PATCH net-next] tcp: internal implementation for pacing
From: Eric Dumazet @ 2017-05-16 3:43 UTC (permalink / raw)
To: David S . Miller
Cc: netdev, Eric Dumazet, Eric Dumazet, Neal Cardwell, Yuchung Cheng,
Soheil Hassas Yeganeh, Van Jacobson, Jerry Chu
BBR congestion control depends on pacing, and pacing is
currently handled by sch_fq packet scheduler for performance reasons,
and also because implemening pacing with FQ was convenient to truly
avoid bursts.
However there are many cases where this packet scheduler constraint
is not practical.
- Many linux hosts are not focusing on handling thousands of TCP
flows in the most efficient way.
- Some routers use fq_codel or other AQM, but still would like
to use BBR for the few TCP flows they initiate/terminate.
This patch implements an automatic fallback to internal pacing.
Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
If sch_fq happens to be in the egress path, pacing is delegated to
the qdisc, otherwise pacing is done by TCP itself.
One advantage of pacing from TCP stack is to get more precise rtt
estimations, and less work done from TX completion, since TCP Small
queue limits are not generally hit. Setups with single TX queue but
many cpus might even benefit from this.
Note that unlike sch_fq, we do not take into account header sizes.
Taking care of these headers would add additional complexity for
no practical differences in behavior.
Some performance numbers using 800 TCP_STREAM flows rate limited to
~48 Mbit per second on 40Gbit NIC.
If MQ+pfifo_fast is used on the NIC :
$ sar -n DEV 1 5 | grep eth
14:48:44 eth0 725743.00 2932134.00 46776.76 4335184.68 0.00 0.00 1.00
14:48:45 eth0 725349.00 2932112.00 46751.86 4335158.90 0.00 0.00 0.00
14:48:46 eth0 725101.00 2931153.00 46735.07 4333748.63 0.00 0.00 0.00
14:48:47 eth0 725099.00 2931161.00 46735.11 4333760.44 0.00 0.00 1.00
14:48:48 eth0 725160.00 2931731.00 46738.88 4334606.07 0.00 0.00 0.00
Average: eth0 725290.40 2931658.20 46747.54 4334491.74 0.00 0.00 0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
4 0 0 259825920 45644 2708324 0 0 21 2 247 98 0 0 100 0 0
4 0 0 259823744 45644 2708356 0 0 0 0 2400825 159843 0 19 81 0 0
0 0 0 259824208 45644 2708072 0 0 0 0 2407351 159929 0 19 81 0 0
1 0 0 259824592 45644 2708128 0 0 0 0 2405183 160386 0 19 80 0 0
1 0 0 259824272 45644 2707868 0 0 0 32 2396361 158037 0 19 81 0 0
Now use MQ+FQ :
lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
lpaa23:~# tc qdisc replace dev eth0 root mq
$ sar -n DEV 1 5 | grep eth
14:49:57 eth0 678614.00 2727930.00 43739.13 4033279.14 0.00 0.00 0.00
14:49:58 eth0 677620.00 2723971.00 43674.69 4027429.62 0.00 0.00 1.00
14:49:59 eth0 676396.00 2719050.00 43596.83 4020125.02 0.00 0.00 0.00
14:50:00 eth0 675197.00 2714173.00 43518.62 4012938.90 0.00 0.00 1.00
14:50:01 eth0 676388.00 2719063.00 43595.47 4020171.64 0.00 0.00 0.00
Average: eth0 676843.00 2720837.40 43624.95 4022788.86 0.00 0.00 0.40
$ vmstat 1 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
r b swpd free buff cache si so bi bo in cs us sy id wa st
2 0 0 259832240 46008 2710912 0 0 21 2 223 192 0 1 99 0 0
1 0 0 259832896 46008 2710744 0 0 0 0 1702206 198078 0 17 82 0 0
0 0 0 259830272 46008 2710596 0 0 0 0 1696340 197756 1 17 83 0 0
4 0 0 259829168 46024 2710584 0 0 16 0 1688472 197158 1 17 82 0 0
3 0 0 259830224 46024 2710408 0 0 0 0 1692450 197212 0 18 82 0 0
As expected, number of interrupts per second is very different.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Neal Cardwell <ncardwell@google.com>
Cc: Yuchung Cheng <ycheng@google.com>
Cc: Soheil Hassas Yeganeh <soheil@google.com>
Cc: Van Jacobson <vanj@google.com>
Cc: Jerry Chu <hkchu@google.com>
---
include/linux/tcp.h | 2 ++
include/net/sock.h | 8 +++++-
include/net/tcp.h | 3 ++
net/core/sock.c | 4 +++
net/ipv4/tcp_bbr.c | 9 +++---
net/ipv4/tcp_output.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++
net/ipv4/tcp_timer.c | 3 ++
net/sched/sch_fq.c | 8 ++++++
8 files changed, 112 insertions(+), 5 deletions(-)
diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index b6d5adcee8fcb611de202993623cc80274d262e4..22854f0284347a3bb047709478525ee5a9dd9b36 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -293,6 +293,8 @@ struct tcp_sock {
u32 sacked_out; /* SACK'd packets */
u32 fackets_out; /* FACK'd packets */
+ struct hrtimer pacing_timer;
+
/* from STCP, retrans queue hinting */
struct sk_buff* lost_skb_hint;
struct sk_buff *retransmit_skb_hint;
diff --git a/include/net/sock.h b/include/net/sock.h
index f33e3d134e0b7f66329f2122d7acc8b396c1787b..f21e07563991fecb2b67e092617aef0d63954c86 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -396,7 +396,7 @@ struct sock {
__s32 sk_peek_off;
int sk_write_pending;
__u32 sk_dst_pending_confirm;
- /* Note: 32bit hole on 64bit arches */
+ u32 sk_pacing_status; /* see enum sk_pacing */
long sk_sndtimeo;
struct timer_list sk_timer;
__u32 sk_priority;
@@ -475,6 +475,12 @@ struct sock {
struct rcu_head sk_rcu;
};
+enum sk_pacing {
+ SK_PACING_NONE = 0,
+ SK_PACING_NEEDED = 1,
+ SK_PACING_FQ = 2,
+};
+
#define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
#define rcu_dereference_sk_user_data(sk) rcu_dereference(__sk_user_data((sk)))
diff --git a/include/net/tcp.h b/include/net/tcp.h
index 38a7427ae902e35973a8b7fa0e95ff602ede0e87..b4dc93dae98c2d175ccadce150083705d237555e 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -574,6 +574,7 @@ void tcp_fin(struct sock *sk);
void tcp_init_xmit_timers(struct sock *);
static inline void tcp_clear_xmit_timers(struct sock *sk)
{
+ hrtimer_cancel(&tcp_sk(sk)->pacing_timer);
inet_csk_clear_xmit_timers(sk);
}
@@ -1945,4 +1946,6 @@ static inline void tcp_listendrop(const struct sock *sk)
__NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENDROPS);
}
+enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer);
+
#endif /* _TCP_H */
diff --git a/net/core/sock.c b/net/core/sock.c
index e43e71d7856b385111cd4c4b1bd835a78c670c60..93d011e35b8349954db6918055c2f90ae473d254 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1041,6 +1041,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
#endif
case SO_MAX_PACING_RATE:
+ if (val != ~0U)
+ cmpxchg(&sk->sk_pacing_status,
+ SK_PACING_NONE,
+ SK_PACING_NEEDED);
sk->sk_max_pacing_rate = val;
sk->sk_pacing_rate = min(sk->sk_pacing_rate,
sk->sk_max_pacing_rate);
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
index b89bce4c721eed530f5cfc725b759147b38cef42..92b045c72163def1c1d6aa0f2002760186aa5dc3 100644
--- a/net/ipv4/tcp_bbr.c
+++ b/net/ipv4/tcp_bbr.c
@@ -52,10 +52,9 @@
* There is a public e-mail list for discussing BBR development and testing:
* https://groups.google.com/forum/#!forum/bbr-dev
*
- * NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing enabled,
- * since pacing is integral to the BBR design and implementation.
- * BBR without pacing would not function properly, and may incur unnecessary
- * high packet loss rates.
+ * NOTE: BBR might be used with the fq qdisc ("man tc-fq") with pacing enabled,
+ * otherwise TCP stack falls back to an internal pacing using one high
+ * resolution timer per TCP socket and may use more resources.
*/
#include <linux/module.h>
#include <net/tcp.h>
@@ -830,6 +829,8 @@ static void bbr_init(struct sock *sk)
bbr->cycle_idx = 0;
bbr_reset_lt_bw_sampling(sk);
bbr_reset_startup_mode(sk);
+
+ cmpxchg(&sk->sk_pacing_status, SK_PACING_NONE, SK_PACING_NEEDED);
}
static u32 bbr_sndbuf_expand(struct sock *sk)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 4858e190f6ac130c9441f58cb8944cc82bf67270..a32172d69a03cbe76b45ec3094222f6c3a73e27d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -904,6 +904,72 @@ void tcp_wfree(struct sk_buff *skb)
sk_free(sk);
}
+/* Note: Called under hard irq.
+ * We can not call TCP stack right away.
+ */
+enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)
+{
+ struct tcp_sock *tp = container_of(timer, struct tcp_sock, pacing_timer);
+ struct sock *sk = (struct sock *)tp;
+ unsigned long nval, oval;
+
+ for (oval = READ_ONCE(sk->sk_tsq_flags);; oval = nval) {
+ struct tsq_tasklet *tsq;
+ bool empty;
+
+ if (oval & TSQF_QUEUED)
+ break;
+
+ nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED | TCPF_TSQ_DEFERRED;
+ nval = cmpxchg(&sk->sk_tsq_flags, oval, nval);
+ if (nval != oval)
+ continue;
+
+ if (!atomic_inc_not_zero(&sk->sk_wmem_alloc))
+ break;
+ /* queue this socket to tasklet queue */
+ tsq = this_cpu_ptr(&tsq_tasklet);
+ empty = list_empty(&tsq->head);
+ list_add(&tp->tsq_node, &tsq->head);
+ if (empty)
+ tasklet_schedule(&tsq->tasklet);
+ break;
+ }
+ return HRTIMER_NORESTART;
+}
+
+/* BBR congestion control needs pacing.
+ * Same remark for SO_MAX_PACING_RATE.
+ * sch_fq packet scheduler is efficiently handling pacing,
+ * but is not always installed/used.
+ * Return true if TCP stack should pace packets itself.
+ */
+static bool tcp_needs_internal_pacing(const struct sock *sk)
+{
+ return smp_load_acquire(&sk->sk_pacing_status) == SK_PACING_NEEDED;
+}
+
+static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
+{
+ u64 len_ns;
+ u32 rate;
+
+ if (!tcp_needs_internal_pacing(sk))
+ return;
+ rate = sk->sk_pacing_rate;
+ if (!rate || rate == ~0U)
+ return;
+
+ /* Should account for header sizes as sch_fq does,
+ * but lets make things simple.
+ */
+ len_ns = (u64)skb->len * NSEC_PER_SEC;
+ do_div(len_ns, rate);
+ hrtimer_start(&tcp_sk(sk)->pacing_timer,
+ ktime_add_ns(ktime_get(), len_ns),
+ HRTIMER_MODE_ABS_PINNED);
+}
+
/* This routine actually transmits TCP packets queued in by
* tcp_do_sendmsg(). This is used by both the initial
* transmission and possible later retransmissions.
@@ -1034,6 +1100,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
if (skb->len != tcp_header_size) {
tcp_event_data_sent(tp, sk);
tp->data_segs_out += tcp_skb_pcount(skb);
+ tcp_internal_pacing(sk, skb);
}
if (after(tcb->end_seq, tp->snd_nxt) || tcb->seq == tcb->end_seq)
@@ -2086,6 +2153,12 @@ static int tcp_mtu_probe(struct sock *sk)
return -1;
}
+static bool tcp_pacing_check(const struct sock *sk)
+{
+ return tcp_needs_internal_pacing(sk) &&
+ hrtimer_active(&tcp_sk(sk)->pacing_timer);
+}
+
/* TCP Small Queues :
* Control number of packets in qdisc/devices to two packets / or ~1 ms.
* (These limits are doubled for retransmits)
@@ -2210,6 +2283,9 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
while ((skb = tcp_send_head(sk))) {
unsigned int limit;
+ if (tcp_pacing_check(sk))
+ break;
+
tso_segs = tcp_init_tso_segs(skb, mss_now);
BUG_ON(!tso_segs);
@@ -2878,6 +2954,10 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
if (skb == tcp_send_head(sk))
break;
+
+ if (tcp_pacing_check(sk))
+ break;
+
/* we could do better than to assign each time */
if (!hole)
tp->retransmit_skb_hint = skb;
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 14672543cf0bd27bc59976d5cec38d2d3bbcdd2c..86934bcf685a65ec3af3d22f1801ffa33eea76e2 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -710,4 +710,7 @@ void tcp_init_xmit_timers(struct sock *sk)
{
inet_csk_init_xmit_timers(sk, &tcp_write_timer, &tcp_delack_timer,
&tcp_keepalive_timer);
+ hrtimer_init(&tcp_sk(sk)->pacing_timer, CLOCK_MONOTONIC,
+ HRTIMER_MODE_ABS_PINNED);
+ tcp_sk(sk)->pacing_timer.function = tcp_pace_kick;
}
diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
index b488721a0059adb24aea47240afa0164a6e467a9..147fde73a0f566e8f6a26718adf176ef3943afa0 100644
--- a/net/sched/sch_fq.c
+++ b/net/sched/sch_fq.c
@@ -390,9 +390,17 @@ static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
q->stat_tcp_retrans++;
qdisc_qstats_backlog_inc(sch, skb);
if (fq_flow_is_detached(f)) {
+ struct sock *sk = skb->sk;
+
fq_flow_add_tail(&q->new_flows, f);
if (time_after(jiffies, f->age + q->flow_refill_delay))
f->credit = max_t(u32, f->credit, q->quantum);
+ if (sk && q->rate_enable) {
+ if (unlikely(smp_load_acquire(&sk->sk_pacing_status) !=
+ SK_PACING_FQ))
+ smp_store_release(&sk->sk_pacing_status,
+ SK_PACING_FQ);
+ }
q->inactive_flows--;
}
--
2.13.0.303.g4ebf302169-goog
^ permalink raw reply related
* Re: [PATCH net-next] tcp: internal implementation for pacing
From: Soheil Hassas Yeganeh @ 2017-05-16 3:49 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, netdev, Eric Dumazet, Neal Cardwell,
Yuchung Cheng, Van Jacobson, Jerry Chu
In-Reply-To: <20170516034318.9913-1-edumazet@google.com>
On Mon, May 15, 2017 at 11:43 PM, Eric Dumazet <edumazet@google.com> wrote:
> BBR congestion control depends on pacing, and pacing is
> currently handled by sch_fq packet scheduler for performance reasons,
> and also because implemening pacing with FQ was convenient to truly
> avoid bursts.
>
> However there are many cases where this packet scheduler constraint
> is not practical.
> - Many linux hosts are not focusing on handling thousands of TCP
> flows in the most efficient way.
> - Some routers use fq_codel or other AQM, but still would like
> to use BBR for the few TCP flows they initiate/terminate.
>
> This patch implements an automatic fallback to internal pacing.
>
> Pacing is requested either by BBR or use of SO_MAX_PACING_RATE option.
>
> If sch_fq happens to be in the egress path, pacing is delegated to
> the qdisc, otherwise pacing is done by TCP itself.
>
> One advantage of pacing from TCP stack is to get more precise rtt
> estimations, and less work done from TX completion, since TCP Small
> queue limits are not generally hit. Setups with single TX queue but
> many cpus might even benefit from this.
>
> Note that unlike sch_fq, we do not take into account header sizes.
> Taking care of these headers would add additional complexity for
> no practical differences in behavior.
>
> Some performance numbers using 800 TCP_STREAM flows rate limited to
> ~48 Mbit per second on 40Gbit NIC.
>
> If MQ+pfifo_fast is used on the NIC :
>
> $ sar -n DEV 1 5 | grep eth
> 14:48:44 eth0 725743.00 2932134.00 46776.76 4335184.68 0.00 0.00 1.00
> 14:48:45 eth0 725349.00 2932112.00 46751.86 4335158.90 0.00 0.00 0.00
> 14:48:46 eth0 725101.00 2931153.00 46735.07 4333748.63 0.00 0.00 0.00
> 14:48:47 eth0 725099.00 2931161.00 46735.11 4333760.44 0.00 0.00 1.00
> 14:48:48 eth0 725160.00 2931731.00 46738.88 4334606.07 0.00 0.00 0.00
> Average: eth0 725290.40 2931658.20 46747.54 4334491.74 0.00 0.00 0.40
> $ vmstat 1 5
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 4 0 0 259825920 45644 2708324 0 0 21 2 247 98 0 0 100 0 0
> 4 0 0 259823744 45644 2708356 0 0 0 0 2400825 159843 0 19 81 0 0
> 0 0 0 259824208 45644 2708072 0 0 0 0 2407351 159929 0 19 81 0 0
> 1 0 0 259824592 45644 2708128 0 0 0 0 2405183 160386 0 19 80 0 0
> 1 0 0 259824272 45644 2707868 0 0 0 32 2396361 158037 0 19 81 0 0
>
> Now use MQ+FQ :
>
> lpaa23:~# echo fq >/proc/sys/net/core/default_qdisc
> lpaa23:~# tc qdisc replace dev eth0 root mq
>
> $ sar -n DEV 1 5 | grep eth
> 14:49:57 eth0 678614.00 2727930.00 43739.13 4033279.14 0.00 0.00 0.00
> 14:49:58 eth0 677620.00 2723971.00 43674.69 4027429.62 0.00 0.00 1.00
> 14:49:59 eth0 676396.00 2719050.00 43596.83 4020125.02 0.00 0.00 0.00
> 14:50:00 eth0 675197.00 2714173.00 43518.62 4012938.90 0.00 0.00 1.00
> 14:50:01 eth0 676388.00 2719063.00 43595.47 4020171.64 0.00 0.00 0.00
> Average: eth0 676843.00 2720837.40 43624.95 4022788.86 0.00 0.00 0.40
> $ vmstat 1 5
> procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
> r b swpd free buff cache si so bi bo in cs us sy id wa st
> 2 0 0 259832240 46008 2710912 0 0 21 2 223 192 0 1 99 0 0
> 1 0 0 259832896 46008 2710744 0 0 0 0 1702206 198078 0 17 82 0 0
> 0 0 0 259830272 46008 2710596 0 0 0 0 1696340 197756 1 17 83 0 0
> 4 0 0 259829168 46024 2710584 0 0 16 0 1688472 197158 1 17 82 0 0
> 3 0 0 259830224 46024 2710408 0 0 0 0 1692450 197212 0 18 82 0 0
>
> As expected, number of interrupts per second is very different.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Neal Cardwell <ncardwell@google.com>
> Cc: Yuchung Cheng <ycheng@google.com>
> Cc: Soheil Hassas Yeganeh <soheil@google.com>
> Cc: Van Jacobson <vanj@google.com>
> Cc: Jerry Chu <hkchu@google.com>
> ---
> include/linux/tcp.h | 2 ++
> include/net/sock.h | 8 +++++-
> include/net/tcp.h | 3 ++
> net/core/sock.c | 4 +++
> net/ipv4/tcp_bbr.c | 9 +++---
> net/ipv4/tcp_output.c | 80 +++++++++++++++++++++++++++++++++++++++++++++++++++
> net/ipv4/tcp_timer.c | 3 ++
> net/sched/sch_fq.c | 8 ++++++
> 8 files changed, 112 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/tcp.h b/include/linux/tcp.h
> index b6d5adcee8fcb611de202993623cc80274d262e4..22854f0284347a3bb047709478525ee5a9dd9b36 100644
> --- a/include/linux/tcp.h
> +++ b/include/linux/tcp.h
> @@ -293,6 +293,8 @@ struct tcp_sock {
> u32 sacked_out; /* SACK'd packets */
> u32 fackets_out; /* FACK'd packets */
>
> + struct hrtimer pacing_timer;
> +
> /* from STCP, retrans queue hinting */
> struct sk_buff* lost_skb_hint;
> struct sk_buff *retransmit_skb_hint;
> diff --git a/include/net/sock.h b/include/net/sock.h
> index f33e3d134e0b7f66329f2122d7acc8b396c1787b..f21e07563991fecb2b67e092617aef0d63954c86 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -396,7 +396,7 @@ struct sock {
> __s32 sk_peek_off;
> int sk_write_pending;
> __u32 sk_dst_pending_confirm;
> - /* Note: 32bit hole on 64bit arches */
> + u32 sk_pacing_status; /* see enum sk_pacing */
> long sk_sndtimeo;
> struct timer_list sk_timer;
> __u32 sk_priority;
> @@ -475,6 +475,12 @@ struct sock {
> struct rcu_head sk_rcu;
> };
>
> +enum sk_pacing {
> + SK_PACING_NONE = 0,
> + SK_PACING_NEEDED = 1,
> + SK_PACING_FQ = 2,
> +};
> +
> #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
>
> #define rcu_dereference_sk_user_data(sk) rcu_dereference(__sk_user_data((sk)))
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index 38a7427ae902e35973a8b7fa0e95ff602ede0e87..b4dc93dae98c2d175ccadce150083705d237555e 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -574,6 +574,7 @@ void tcp_fin(struct sock *sk);
> void tcp_init_xmit_timers(struct sock *);
> static inline void tcp_clear_xmit_timers(struct sock *sk)
> {
> + hrtimer_cancel(&tcp_sk(sk)->pacing_timer);
> inet_csk_clear_xmit_timers(sk);
> }
>
> @@ -1945,4 +1946,6 @@ static inline void tcp_listendrop(const struct sock *sk)
> __NET_INC_STATS(sock_net(sk), LINUX_MIB_LISTENDROPS);
> }
>
> +enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer);
> +
> #endif /* _TCP_H */
> diff --git a/net/core/sock.c b/net/core/sock.c
> index e43e71d7856b385111cd4c4b1bd835a78c670c60..93d011e35b8349954db6918055c2f90ae473d254 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -1041,6 +1041,10 @@ int sock_setsockopt(struct socket *sock, int level, int optname,
> #endif
>
> case SO_MAX_PACING_RATE:
> + if (val != ~0U)
> + cmpxchg(&sk->sk_pacing_status,
> + SK_PACING_NONE,
> + SK_PACING_NEEDED);
> sk->sk_max_pacing_rate = val;
> sk->sk_pacing_rate = min(sk->sk_pacing_rate,
> sk->sk_max_pacing_rate);
> diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
> index b89bce4c721eed530f5cfc725b759147b38cef42..92b045c72163def1c1d6aa0f2002760186aa5dc3 100644
> --- a/net/ipv4/tcp_bbr.c
> +++ b/net/ipv4/tcp_bbr.c
> @@ -52,10 +52,9 @@
> * There is a public e-mail list for discussing BBR development and testing:
> * https://groups.google.com/forum/#!forum/bbr-dev
> *
> - * NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing enabled,
> - * since pacing is integral to the BBR design and implementation.
> - * BBR without pacing would not function properly, and may incur unnecessary
> - * high packet loss rates.
> + * NOTE: BBR might be used with the fq qdisc ("man tc-fq") with pacing enabled,
> + * otherwise TCP stack falls back to an internal pacing using one high
> + * resolution timer per TCP socket and may use more resources.
> */
> #include <linux/module.h>
> #include <net/tcp.h>
> @@ -830,6 +829,8 @@ static void bbr_init(struct sock *sk)
> bbr->cycle_idx = 0;
> bbr_reset_lt_bw_sampling(sk);
> bbr_reset_startup_mode(sk);
> +
> + cmpxchg(&sk->sk_pacing_status, SK_PACING_NONE, SK_PACING_NEEDED);
> }
>
> static u32 bbr_sndbuf_expand(struct sock *sk)
> diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
> index 4858e190f6ac130c9441f58cb8944cc82bf67270..a32172d69a03cbe76b45ec3094222f6c3a73e27d 100644
> --- a/net/ipv4/tcp_output.c
> +++ b/net/ipv4/tcp_output.c
> @@ -904,6 +904,72 @@ void tcp_wfree(struct sk_buff *skb)
> sk_free(sk);
> }
>
> +/* Note: Called under hard irq.
> + * We can not call TCP stack right away.
> + */
> +enum hrtimer_restart tcp_pace_kick(struct hrtimer *timer)
> +{
> + struct tcp_sock *tp = container_of(timer, struct tcp_sock, pacing_timer);
> + struct sock *sk = (struct sock *)tp;
> + unsigned long nval, oval;
> +
> + for (oval = READ_ONCE(sk->sk_tsq_flags);; oval = nval) {
> + struct tsq_tasklet *tsq;
> + bool empty;
> +
> + if (oval & TSQF_QUEUED)
> + break;
> +
> + nval = (oval & ~TSQF_THROTTLED) | TSQF_QUEUED | TCPF_TSQ_DEFERRED;
> + nval = cmpxchg(&sk->sk_tsq_flags, oval, nval);
> + if (nval != oval)
> + continue;
> +
> + if (!atomic_inc_not_zero(&sk->sk_wmem_alloc))
> + break;
> + /* queue this socket to tasklet queue */
> + tsq = this_cpu_ptr(&tsq_tasklet);
> + empty = list_empty(&tsq->head);
> + list_add(&tp->tsq_node, &tsq->head);
> + if (empty)
> + tasklet_schedule(&tsq->tasklet);
> + break;
> + }
> + return HRTIMER_NORESTART;
> +}
> +
> +/* BBR congestion control needs pacing.
> + * Same remark for SO_MAX_PACING_RATE.
> + * sch_fq packet scheduler is efficiently handling pacing,
> + * but is not always installed/used.
> + * Return true if TCP stack should pace packets itself.
> + */
> +static bool tcp_needs_internal_pacing(const struct sock *sk)
> +{
> + return smp_load_acquire(&sk->sk_pacing_status) == SK_PACING_NEEDED;
> +}
> +
> +static void tcp_internal_pacing(struct sock *sk, const struct sk_buff *skb)
> +{
> + u64 len_ns;
> + u32 rate;
> +
> + if (!tcp_needs_internal_pacing(sk))
> + return;
> + rate = sk->sk_pacing_rate;
> + if (!rate || rate == ~0U)
> + return;
> +
> + /* Should account for header sizes as sch_fq does,
> + * but lets make things simple.
> + */
> + len_ns = (u64)skb->len * NSEC_PER_SEC;
> + do_div(len_ns, rate);
> + hrtimer_start(&tcp_sk(sk)->pacing_timer,
> + ktime_add_ns(ktime_get(), len_ns),
> + HRTIMER_MODE_ABS_PINNED);
> +}
> +
> /* This routine actually transmits TCP packets queued in by
> * tcp_do_sendmsg(). This is used by both the initial
> * transmission and possible later retransmissions.
> @@ -1034,6 +1100,7 @@ static int tcp_transmit_skb(struct sock *sk, struct sk_buff *skb, int clone_it,
> if (skb->len != tcp_header_size) {
> tcp_event_data_sent(tp, sk);
> tp->data_segs_out += tcp_skb_pcount(skb);
> + tcp_internal_pacing(sk, skb);
> }
>
> if (after(tcb->end_seq, tp->snd_nxt) || tcb->seq == tcb->end_seq)
> @@ -2086,6 +2153,12 @@ static int tcp_mtu_probe(struct sock *sk)
> return -1;
> }
>
> +static bool tcp_pacing_check(const struct sock *sk)
> +{
> + return tcp_needs_internal_pacing(sk) &&
> + hrtimer_active(&tcp_sk(sk)->pacing_timer);
> +}
> +
> /* TCP Small Queues :
> * Control number of packets in qdisc/devices to two packets / or ~1 ms.
> * (These limits are doubled for retransmits)
> @@ -2210,6 +2283,9 @@ static bool tcp_write_xmit(struct sock *sk, unsigned int mss_now, int nonagle,
> while ((skb = tcp_send_head(sk))) {
> unsigned int limit;
>
> + if (tcp_pacing_check(sk))
> + break;
> +
> tso_segs = tcp_init_tso_segs(skb, mss_now);
> BUG_ON(!tso_segs);
>
> @@ -2878,6 +2954,10 @@ void tcp_xmit_retransmit_queue(struct sock *sk)
>
> if (skb == tcp_send_head(sk))
> break;
> +
> + if (tcp_pacing_check(sk))
> + break;
> +
> /* we could do better than to assign each time */
> if (!hole)
> tp->retransmit_skb_hint = skb;
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 14672543cf0bd27bc59976d5cec38d2d3bbcdd2c..86934bcf685a65ec3af3d22f1801ffa33eea76e2 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -710,4 +710,7 @@ void tcp_init_xmit_timers(struct sock *sk)
> {
> inet_csk_init_xmit_timers(sk, &tcp_write_timer, &tcp_delack_timer,
> &tcp_keepalive_timer);
> + hrtimer_init(&tcp_sk(sk)->pacing_timer, CLOCK_MONOTONIC,
> + HRTIMER_MODE_ABS_PINNED);
> + tcp_sk(sk)->pacing_timer.function = tcp_pace_kick;
> }
> diff --git a/net/sched/sch_fq.c b/net/sched/sch_fq.c
> index b488721a0059adb24aea47240afa0164a6e467a9..147fde73a0f566e8f6a26718adf176ef3943afa0 100644
> --- a/net/sched/sch_fq.c
> +++ b/net/sched/sch_fq.c
> @@ -390,9 +390,17 @@ static int fq_enqueue(struct sk_buff *skb, struct Qdisc *sch,
> q->stat_tcp_retrans++;
> qdisc_qstats_backlog_inc(sch, skb);
> if (fq_flow_is_detached(f)) {
> + struct sock *sk = skb->sk;
> +
> fq_flow_add_tail(&q->new_flows, f);
> if (time_after(jiffies, f->age + q->flow_refill_delay))
> f->credit = max_t(u32, f->credit, q->quantum);
> + if (sk && q->rate_enable) {
> + if (unlikely(smp_load_acquire(&sk->sk_pacing_status) !=
> + SK_PACING_FQ))
> + smp_store_release(&sk->sk_pacing_status,
> + SK_PACING_FQ);
> + }
> q->inactive_flows--;
> }
>
> --
> 2.13.0.303.g4ebf302169-goog
>
This is superb! Thank you, Eric!
^ permalink raw reply
* Re: [PATCH v4 1/4] can: m_can: move Message RAM initialization to function
From: Oliver Hartkopp @ 2017-05-16 3:51 UTC (permalink / raw)
To: Marc Kleine-Budde, Quentin Schulz, wg, mario.huettel
Cc: linux-can, netdev, linux-kernel, alexandre.belloni,
thomas.petazzoni
In-Reply-To: <4676456c-823c-d7db-2139-35c8229109f0@pengutronix.de>
On 05/15/2017 06:50 AM, Marc Kleine-Budde wrote:
> On 05/12/2017 08:37 AM, Quentin Schulz wrote:
>> Hi all,
>>
>> On 05/05/2017 15:50, Quentin Schulz wrote:
>>> To avoid possible ECC/parity checksum errors when reading an
>>> uninitialized buffer, the entire Message RAM is initialized when probing
>>> the driver. This initialization is done in the same function reading the
>>> Device Tree properties.
>>>
>>> This patch moves the RAM initialization to a separate function so it can
>>> be called separately from device initialization from Device Tree.
>>>
>>> Signed-off-by: Quentin Schulz <quentin.schulz@free-electrons.com>
>>
>> It's been a week since I sent this patch series. Any comments?
>
> Looks good, added to linux-can-next.
Isn't this a fix for linux-can instead?
At least it would make no sense to me to have the upgraded M_CAN driver
in Linux 4.12 without this fix.
Regards,
Oliver
^ permalink raw reply
* [PATCH net v2] net: x25: fix one potential use-after-free issue
From: linzhang @ 2017-05-16 3:52 UTC (permalink / raw)
To: andrew.hendry, davem; +Cc: nhorman, linux-x25, netdev, linux-kernel, linzhang
The function x25_init is not properly unregister related resources
on error handler.It is will result in kernel oops if x25_init init
failed, so add properly unregister call on error handler.
Also, i adjust the coding style and make x25_register_sysctl properly
return failure.
Signed-off-by: linzhang <xiaolou4617@gmail.com>
---
include/net/x25.h | 4 ++--
net/x25/af_x25.c | 34 +++++++++++++++++++++-------------
net/x25/sysctl_net_x25.c | 5 ++++-
3 files changed, 27 insertions(+), 16 deletions(-)
diff --git a/include/net/x25.h b/include/net/x25.h
index c383aa4..339820c 100644
--- a/include/net/x25.h
+++ b/include/net/x25.h
@@ -298,10 +298,10 @@ int x25_decode(struct sock *, struct sk_buff *, int *, int *, int *, int *,
/* sysctl_net_x25.c */
#ifdef CONFIG_SYSCTL
-void x25_register_sysctl(void);
+int x25_register_sysctl(void);
void x25_unregister_sysctl(void);
#else
-static inline void x25_register_sysctl(void) {};
+static inline int x25_register_sysctl(void) { return 0 };
static inline void x25_unregister_sysctl(void) {};
#endif /* CONFIG_SYSCTL */
diff --git a/net/x25/af_x25.c b/net/x25/af_x25.c
index 8b911c2..75c64de 100644
--- a/net/x25/af_x25.c
+++ b/net/x25/af_x25.c
@@ -1791,34 +1791,42 @@ void x25_kill_by_neigh(struct x25_neigh *nb)
static int __init x25_init(void)
{
- int rc = proto_register(&x25_proto, 0);
+ int rc;
- if (rc != 0)
+ rc = proto_register(&x25_proto, 0);
+ if (rc)
goto out;
rc = sock_register(&x25_family_ops);
- if (rc != 0)
- goto out_proto;
+ if (rc)
+ goto out_sock;
dev_add_pack(&x25_packet_type);
rc = register_netdevice_notifier(&x25_dev_notifier);
- if (rc != 0)
- goto out_sock;
+ if (rc)
+ goto out_dev;
- pr_info("Linux Version 0.2\n");
+ rc = x25_register_sysctl();
+ if (rc)
+ goto out_sysctl;
- x25_register_sysctl();
rc = x25_proc_init();
- if (rc != 0)
- goto out_dev;
+ if (rc)
+ goto out_proc;
+
+ pr_info("Linux Version 0.2\n");
+
out:
return rc;
-out_dev:
+out_proc:
+ x25_unregister_sysctl();
+out_sysctl:
unregister_netdevice_notifier(&x25_dev_notifier);
-out_sock:
+out_dev:
+ dev_remove_pack(&x25_packet_type);
sock_unregister(AF_X25);
-out_proto:
+out_sock:
proto_unregister(&x25_proto);
goto out;
}
diff --git a/net/x25/sysctl_net_x25.c b/net/x25/sysctl_net_x25.c
index a06dfe1..ba078c8 100644
--- a/net/x25/sysctl_net_x25.c
+++ b/net/x25/sysctl_net_x25.c
@@ -73,9 +73,12 @@
{ },
};
-void __init x25_register_sysctl(void)
+int __init x25_register_sysctl(void)
{
x25_table_header = register_net_sysctl(&init_net, "net/x25", x25_table);
+ if (!x25_table_header)
+ return -ENOMEM;
+ return 0;
}
void x25_unregister_sysctl(void)
--
1.8.3.1
^ permalink raw reply related
* Re: [patch net-next v2 10/10] net: sched: add termination action to allow goto chain
From: Jiri Pirko @ 2017-05-16 4:43 UTC (permalink / raw)
To: Daniel Borkmann
Cc: netdev, davem, jhs, xiyou.wangcong, dsa, edumazet, stephen,
alexander.h.duyck, simon.horman, mlxsw, alexei.starovoitov
In-Reply-To: <591A0940.2070801@iogearbox.net>
Mon, May 15, 2017 at 10:02:08PM CEST, daniel@iogearbox.net wrote:
>On 05/15/2017 10:38 AM, Jiri Pirko wrote:
>> From: Jiri Pirko <jiri@mellanox.com>
>>
>> Introduce new type of termination action called "goto_chain". This allows
>> user to specify a chain to be processed. This action type is
>> then processed as a return value in tcf_classify loop in similar
>> way as "reclassify" is, only it does not reset to the first filter
>> in chain but rather reset to the first filter of the desired chain.
>>
>> Signed-off-by: Jiri Pirko <jiri@mellanox.com>
>[...]
>> diff --git a/net/sched/cls_api.c b/net/sched/cls_api.c
>> index 1112a2b..98cc689 100644
>> --- a/net/sched/cls_api.c
>> +++ b/net/sched/cls_api.c
>> @@ -304,10 +304,14 @@ int tcf_classify(struct sk_buff *skb, const struct tcf_proto *tp,
>> continue;
>>
>> err = tp->classify(skb, tp, res);
>> - if (unlikely(err == TC_ACT_RECLASSIFY && !compat_mode))
>> + if (err == TC_ACT_RECLASSIFY && !compat_mode) {
>> goto reset;
>> - if (err >= 0)
>> + } else if (TC_ACT_EXT_CMP(err, TC_ACT_GOTO_CHAIN)) {
>> + old_tp = res->goto_tp;
>> + goto reset;
>> + } else if (err >= 0) {
>> return err;
>> + }
>
>Given this goto chain feature is pretty much only interesting for hw
>offloads, can we move this further away from the sw fast path to not
>add up to the cost per packet? (I doubt anyone is using TC_ACT_RECLASSIFY
>in sw as well ...)
I don't think so. First of all, the whole thing would be broken then in
sw. It is useful to have it in sw, at least for testing reasons.
So I would leave the unlikely and add it to the second check as well.
>
>> }
>>
>> return TC_ACT_UNSPEC; /* signal: continue lookup */
>>
^ permalink raw reply
* Re: [PATCH] ipmr: vrf: Find VIFs using the actual device
From: David Ahern @ 2017-05-16 4:50 UTC (permalink / raw)
To: Thomas Winter, netdev; +Cc: David Ahern, Nikolay Aleksandrov, roopa
In-Reply-To: <20170515221444.4913-1-Thomas.Winter@alliedtelesis.co.nz>
On 5/15/17 3:14 PM, Thomas Winter wrote:
> The skb->dev that is passed into ip_mr_input is
> the loX device for VRFs. When we lookup a vif
> for this dev, none is found as we do not create
> vifs for loopbacks. Instead lookup a vif for the
> actual device that the packet was received on,
> eg the vlan.
>
> Signed-off-by: Thomas Winter <Thomas.Winter@alliedtelesis.co.nz>
> cc: David Ahern <dsa@cumulusnetworks.com>
> cc: Nikolay Aleksandrov <nikolay@cumulusnetworks.com>
> cc: roopa <roopa@cumulusnetworks.com>
> ---
> net/ipv4/ipmr.c | 18 ++++++++++++++++--
> 1 file changed, 16 insertions(+), 2 deletions(-)
LGTM
Acked-by: David Ahern <dsahern@gmail.com>
^ permalink raw reply
* Re: [PATCH net-next] tcp: internal implementation for pacing
From: kbuild test robot @ 2017-05-16 5:57 UTC (permalink / raw)
To: Eric Dumazet
Cc: kbuild-all, David S . Miller, netdev, Eric Dumazet, Eric Dumazet,
Neal Cardwell, Yuchung Cheng, Soheil Hassas Yeganeh, Van Jacobson,
Jerry Chu
In-Reply-To: <20170516034318.9913-1-edumazet@google.com>
[-- Attachment #1: Type: text/plain, Size: 3207 bytes --]
Hi Eric,
[auto build test WARNING on net-next/master]
url: https://github.com/0day-ci/linux/commits/Eric-Dumazet/tcp-internal-implementation-for-pacing/20170516-115441
reproduce: make htmldocs
All warnings (new ones prefixed by >>):
include/net/sock.h:476: warning: No description found for parameter 'sk_tsq_flags'
>> include/net/sock.h:476: warning: No description found for parameter 'sk_pacing_status'
include/net/sock.h:476: warning: No description found for parameter '__sk_flags_offset'
include/net/sock.h:476: warning: No description found for parameter 'sk_uid'
drivers/net/phy/phy.c:259: warning: No description found for parameter 'features'
drivers/net/phy/phy.c:259: warning: Excess function parameter 'feature' description in 'phy_lookup_setting'
drivers/net/phy/phy.c:259: warning: No description found for parameter 'features'
drivers/net/phy/phy.c:259: warning: Excess function parameter 'feature' description in 'phy_lookup_setting'
vim +/sk_pacing_status +476 include/net/sock.h
^1da177e4 Linus Torvalds 2005-04-16 460 struct socket *sk_socket;
^1da177e4 Linus Torvalds 2005-04-16 461 void *sk_user_data;
d5f642384 Alexey Dobriyan 2008-11-04 462 #ifdef CONFIG_SECURITY
^1da177e4 Linus Torvalds 2005-04-16 463 void *sk_security;
d5f642384 Alexey Dobriyan 2008-11-04 464 #endif
2a56a1fec Tejun Heo 2015-12-07 465 struct sock_cgroup_data sk_cgrp_data;
baac50bbc Johannes Weiner 2016-01-14 466 struct mem_cgroup *sk_memcg;
^1da177e4 Linus Torvalds 2005-04-16 467 void (*sk_state_change)(struct sock *sk);
676d23690 David S. Miller 2014-04-11 468 void (*sk_data_ready)(struct sock *sk);
^1da177e4 Linus Torvalds 2005-04-16 469 void (*sk_write_space)(struct sock *sk);
^1da177e4 Linus Torvalds 2005-04-16 470 void (*sk_error_report)(struct sock *sk);
^1da177e4 Linus Torvalds 2005-04-16 471 int (*sk_backlog_rcv)(struct sock *sk,
^1da177e4 Linus Torvalds 2005-04-16 472 struct sk_buff *skb);
^1da177e4 Linus Torvalds 2005-04-16 473 void (*sk_destruct)(struct sock *sk);
ef456144d Craig Gallek 2016-01-04 474 struct sock_reuseport __rcu *sk_reuseport_cb;
a4298e452 Eric Dumazet 2016-04-01 475 struct rcu_head sk_rcu;
^1da177e4 Linus Torvalds 2005-04-16 @476 };
^1da177e4 Linus Torvalds 2005-04-16 477
98acdc088 Eric Dumazet 2017-05-15 478 enum sk_pacing {
98acdc088 Eric Dumazet 2017-05-15 479 SK_PACING_NONE = 0,
98acdc088 Eric Dumazet 2017-05-15 480 SK_PACING_NEEDED = 1,
98acdc088 Eric Dumazet 2017-05-15 481 SK_PACING_FQ = 2,
98acdc088 Eric Dumazet 2017-05-15 482 };
98acdc088 Eric Dumazet 2017-05-15 483
559835ea7 Pravin B Shelar 2013-09-24 484 #define __sk_user_data(sk) ((*((void __rcu **)&(sk)->sk_user_data)))
:::::: The code at line 476 was first introduced by commit
:::::: 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Linux-2.6.12-rc2
:::::: TO: Linus Torvalds <torvalds@ppc970.osdl.org>
:::::: CC: Linus Torvalds <torvalds@ppc970.osdl.org>
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 6564 bytes --]
^ permalink raw reply
* [PATCH] net: Improve handling of failures on link and route dumps
From: David Ahern @ 2017-05-16 6:19 UTC (permalink / raw)
To: netdev; +Cc: mq, David Ahern
In general, rtnetlink dumps do not anticipate failure to dump a single
object (e.g., link or route) on a single pass. As both route and link
objects have grown via more attributes, that is no longer a given.
netlink dumps can handle a failure if the dump function returns an
error; specifically, netlink_dump adds the return code to the response
if it is <= 0 so userspace is notified of the failure. The missing
piece is the rtnetlink dump functions returning the error.
Fix route and link dump functions to return the errors if no object is
added to an skb (detected by skb->len != 0). IPv6 route dumps
(rt6_dump_route) already return the error; this patch updates IPv4 and
link dumps. Other dump functions may need to be ajusted as well.
Reported-by: Jan Moskyto Matejka <mq@ucw.cz>
Signed-off-by: David Ahern <dsahern@gmail.com>
---
The recent IPv6 multipath change brought this to light because of the
ease at which ipv6 route appends can exceed a buffer size, but it seems
to be a day 1 problem.
net/core/rtnetlink.c | 36 ++++++++++++++++++++++++------------
net/ipv4/fib_frontend.c | 15 +++++++++++----
net/ipv4/fib_trie.c | 26 ++++++++++++++------------
3 files changed, 49 insertions(+), 28 deletions(-)
diff --git a/net/core/rtnetlink.c b/net/core/rtnetlink.c
index d7f82c3450b1..49a279a7cc15 100644
--- a/net/core/rtnetlink.c
+++ b/net/core/rtnetlink.c
@@ -1627,13 +1627,13 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
cb->nlh->nlmsg_seq, 0,
flags,
ext_filter_mask);
- /* If we ran out of room on the first message,
- * we're in trouble
- */
- WARN_ON((err == -EMSGSIZE) && (skb->len == 0));
- if (err < 0)
- goto out;
+ if (err < 0) {
+ if (likely(skb->len))
+ goto out;
+
+ goto out_err;
+ }
nl_dump_check_consistent(cb, nlmsg_hdr(skb));
cont:
@@ -1641,10 +1641,12 @@ static int rtnl_dump_ifinfo(struct sk_buff *skb, struct netlink_callback *cb)
}
}
out:
+ err = skb->len;
+out_err:
cb->args[1] = idx;
cb->args[0] = h;
- return skb->len;
+ return err;
}
int rtnl_nla_parse_ifla(struct nlattr **tb, const struct nlattr *head, int len,
@@ -3453,8 +3455,12 @@ static int rtnl_bridge_getlink(struct sk_buff *skb, struct netlink_callback *cb)
err = br_dev->netdev_ops->ndo_bridge_getlink(
skb, portid, seq, dev,
filter_mask, NLM_F_MULTI);
- if (err < 0 && err != -EOPNOTSUPP)
- break;
+ if (err < 0 && err != -EOPNOTSUPP) {
+ if (likely(skb->len))
+ break;
+
+ goto out_err;
+ }
}
idx++;
}
@@ -3465,16 +3471,22 @@ static int rtnl_bridge_getlink(struct sk_buff *skb, struct netlink_callback *cb)
seq, dev,
filter_mask,
NLM_F_MULTI);
- if (err < 0 && err != -EOPNOTSUPP)
- break;
+ if (err < 0 && err != -EOPNOTSUPP) {
+ if (likely(skb->len))
+ break;
+
+ goto out_err;
+ }
}
idx++;
}
}
+ err = skb->len;
+out_err:
rcu_read_unlock();
cb->args[0] = idx;
- return skb->len;
+ return err;
}
static inline size_t bridge_nlmsg_size(void)
diff --git a/net/ipv4/fib_frontend.c b/net/ipv4/fib_frontend.c
index 39bd1edee676..83e3ed258467 100644
--- a/net/ipv4/fib_frontend.c
+++ b/net/ipv4/fib_frontend.c
@@ -763,7 +763,7 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
unsigned int e = 0, s_e;
struct fib_table *tb;
struct hlist_head *head;
- int dumped = 0;
+ int dumped = 0, err;
if (nlmsg_len(cb->nlh) >= sizeof(struct rtmsg) &&
((struct rtmsg *) nlmsg_data(cb->nlh))->rtm_flags & RTM_F_CLONED)
@@ -783,20 +783,27 @@ static int inet_dump_fib(struct sk_buff *skb, struct netlink_callback *cb)
if (dumped)
memset(&cb->args[2], 0, sizeof(cb->args) -
2 * sizeof(cb->args[0]));
- if (fib_table_dump(tb, skb, cb) < 0)
- goto out;
+ err = fib_table_dump(tb, skb, cb);
+ if (err < 0) {
+ if (likely(skb->len))
+ goto out;
+
+ goto out_err;
+ }
dumped = 1;
next:
e++;
}
}
out:
+ err = skb->len;
+out_err:
rcu_read_unlock();
cb->args[1] = e;
cb->args[0] = h;
- return skb->len;
+ return err;
}
/* Prepare and feed intra-kernel routing request.
diff --git a/net/ipv4/fib_trie.c b/net/ipv4/fib_trie.c
index 1201409ba1dc..51182ff2b441 100644
--- a/net/ipv4/fib_trie.c
+++ b/net/ipv4/fib_trie.c
@@ -1983,6 +1983,8 @@ static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb,
/* rcu_read_lock is hold by caller */
hlist_for_each_entry_rcu(fa, &l->leaf, fa_list) {
+ int err;
+
if (i < s_i) {
i++;
continue;
@@ -1993,17 +1995,14 @@ static int fn_trie_dump_leaf(struct key_vector *l, struct fib_table *tb,
continue;
}
- if (fib_dump_info(skb, NETLINK_CB(cb->skb).portid,
- cb->nlh->nlmsg_seq,
- RTM_NEWROUTE,
- tb->tb_id,
- fa->fa_type,
- xkey,
- KEYLENGTH - fa->fa_slen,
- fa->fa_tos,
- fa->fa_info, NLM_F_MULTI) < 0) {
+ err = fib_dump_info(skb, NETLINK_CB(cb->skb).portid,
+ cb->nlh->nlmsg_seq, RTM_NEWROUTE,
+ tb->tb_id, fa->fa_type,
+ xkey, KEYLENGTH - fa->fa_slen,
+ fa->fa_tos, fa->fa_info, NLM_F_MULTI);
+ if (err < 0) {
cb->args[4] = i;
- return -1;
+ return err;
}
i++;
}
@@ -2025,10 +2024,13 @@ int fib_table_dump(struct fib_table *tb, struct sk_buff *skb,
t_key key = cb->args[3];
while ((l = leaf_walk_rcu(&tp, key)) != NULL) {
- if (fn_trie_dump_leaf(l, tb, skb, cb) < 0) {
+ int err;
+
+ err = fn_trie_dump_leaf(l, tb, skb, cb);
+ if (err < 0) {
cb->args[3] = key;
cb->args[2] = count;
- return -1;
+ return err;
}
++count;
--
2.11.0 (Apple Git-81)
^ permalink raw reply related
* (unknown),
From: momofr @ 2017-05-16 6:37 UTC (permalink / raw)
To: netdev
[-- Attachment #1: EMAIL_373084188081_netdev.zip --]
[-- Type: application/zip, Size: 2077 bytes --]
^ permalink raw reply
* [PATCH net v1] net/smc: Add warning about remote memory exposure
From: Leon Romanovsky @ 2017-05-16 6:51 UTC (permalink / raw)
To: davem; +Cc: ubraun, netdev, iinux-rdma, Christoph Hellwig
From: Christoph Hellwig <hch@lst.de>
The driver explicitly bypasses APIs to register all memory once a
connection is made, and thus allows remote access to memory.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Leon Romanovsky <leon@kernel.org>
---
Dave,
Can you please forward this patch to stable?
Thanks
---
Changes from v0:
* Remove BROKEN Kconfig option as a followup of this discussion
https://patchwork.ozlabs.org/patch/760454/
* Refine commit message
---
net/smc/Kconfig | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/net/smc/Kconfig b/net/smc/Kconfig
index c717ef0896aa..33954852f3f8 100644
--- a/net/smc/Kconfig
+++ b/net/smc/Kconfig
@@ -8,6 +8,10 @@ config SMC
The Linux implementation of the SMC-R solution is designed as
a separate socket family SMC.
+ Warning: SMC will expose all memory for remote reads and writes
+ once a connection is established. Don't enable this option except
+ for tightly controlled lab environment.
+
Select this option if you want to run SMC socket applications
config SMC_DIAG
^ permalink raw reply related
* Re: [oss-drivers] [PATCH net-next 3/9] nfp: support LSO2 capability
From: Simon Horman @ 2017-05-16 6:56 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev, oss-drivers, Edwin Peer
In-Reply-To: <20170516005523.26124-4-jakub.kicinski@netronome.com>
On Mon, May 15, 2017 at 05:55:17PM -0700, Jakub Kicinski wrote:
> From: Edwin Peer <edwin.peer@netronome.com>
>
> Firmware advertising the LSO2 capability exploits driver provided L3 and L4
> offsets in order to avoid parsing packet headers in the TX path. The vlan
> field in struct nfp_net_tx_desc is repurposed, making TXVLAN a mutually
> exclusive configuration to LSO2.
>
> Signed-off-by: Edwin Peer <edwin.peer@netronome.com>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
^ permalink raw reply
* Re: [oss-drivers] [PATCH net-next 9/9] nfp: eliminate an if statement in calculation of completed frames
From: Simon Horman @ 2017-05-16 6:57 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev, oss-drivers
In-Reply-To: <20170516005523.26124-10-jakub.kicinski@netronome.com>
On Mon, May 15, 2017 at 05:55:23PM -0700, Jakub Kicinski wrote:
> Given that our rings are always a power of 2, we can simplify the
> calculation of number of completed TX descriptors by using masking
> instead of if statement based on whether the index have wrapped
> or not.
>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
Reviewed-by: Simon Horman <simon.horman@netronome.com>
^ permalink raw reply
* Re: [PATCH net v1] net/smc: Add warning about remote memory exposure
From: Leon Romanovsky @ 2017-05-16 7:03 UTC (permalink / raw)
To: davem-fT/PcQaiUtIeIZ0/mPfg9Q
Cc: ubraun-23VcF4HTsmIX0ybBhKVfKdBPR1lH4CV8,
netdev-u79uwXL29TY76Z2rM5mHXA, linux-rdma-u79uwXL29TY76Z2rM5mHXA,
Christoph Hellwig
In-Reply-To: <20170516065138.24789-1-leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
[-- Attachment #1: Type: text/plain, Size: 1314 bytes --]
+ linux-rdma ML.
On Tue, May 16, 2017 at 09:51:38AM +0300, Leon Romanovsky wrote:
> From: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
>
> The driver explicitly bypasses APIs to register all memory once a
> connection is made, and thus allows remote access to memory.
>
> Signed-off-by: Christoph Hellwig <hch-jcswGhMUV9g@public.gmane.org>
> Signed-off-by: Leon Romanovsky <leon-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
> ---
> Dave,
> Can you please forward this patch to stable?
> Thanks
> ---
> Changes from v0:
> * Remove BROKEN Kconfig option as a followup of this discussion
> https://patchwork.ozlabs.org/patch/760454/
> * Refine commit message
> ---
> net/smc/Kconfig | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/net/smc/Kconfig b/net/smc/Kconfig
> index c717ef0896aa..33954852f3f8 100644
> --- a/net/smc/Kconfig
> +++ b/net/smc/Kconfig
> @@ -8,6 +8,10 @@ config SMC
> The Linux implementation of the SMC-R solution is designed as
> a separate socket family SMC.
>
> + Warning: SMC will expose all memory for remote reads and writes
> + once a connection is established. Don't enable this option except
> + for tightly controlled lab environment.
> +
> Select this option if you want to run SMC socket applications
>
> config SMC_DIAG
> --
> 2.12.2
>
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 833 bytes --]
^ permalink raw reply
* Re: [oss-drivers] [PATCH net-next 8/9] nfp: add a helper for wrapping descriptor index
From: Simon Horman @ 2017-05-16 7:08 UTC (permalink / raw)
To: Jakub Kicinski; +Cc: netdev, oss-drivers
In-Reply-To: <20170516005523.26124-9-jakub.kicinski@netronome.com>
On Mon, May 15, 2017 at 05:55:22PM -0700, Jakub Kicinski wrote:
> We have a number of places where we calculate the descriptor
> index based on a value which may have overflown. Create a
> macro for masking with the ring size.
>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> ---
> drivers/net/ethernet/netronome/nfp/nfp_net.h | 3 +++
> drivers/net/ethernet/netronome/nfp/nfp_net_common.c | 21 ++++++++++-----------
> 2 files changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/drivers/net/ethernet/netronome/nfp/nfp_net.h b/drivers/net/ethernet/netronome/nfp/nfp_net.h
> index 66319a1026bb..7b9518cbe965 100644
> --- a/drivers/net/ethernet/netronome/nfp/nfp_net.h
> +++ b/drivers/net/ethernet/netronome/nfp/nfp_net.h
> @@ -117,6 +117,9 @@ struct nfp_eth_table_port;
> struct nfp_net;
> struct nfp_net_r_vector;
>
> +/* Convenience macro for wrapping descriptor index on ring size */
> +#define D_IDX(ring, idx) ((idx) & ((ring)->cnt - 1))
Any reason not to make this a function?
That notwithstanding:
Reviewed-by: Simon Horman <simon.horman@netronome.com>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox