Netdev List
 help / color / mirror / Atom feed
* [PATCH net-next 6/7] net/faraday: Fix phy link irq on Aspeed G5 SoCs
From: Joel Stanley @ 2016-09-20  6:30 UTC (permalink / raw)
  To: davem; +Cc: gwshan, andrew, andrew, netdev, linux-kernel, benh
In-Reply-To: <20160920063007.24291-1-joel@jms.id.au>

On Aspeed SoC with a direct PHY connection (non-NSCI), we receive
continual PHYSTS interrupts:

 [   20.280000] ftgmac100 1e660000.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG
 [   20.280000] ftgmac100 1e660000.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG
 [   20.280000] ftgmac100 1e660000.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG
 [   20.300000] ftgmac100 1e660000.ethernet eth0: [ISR] = 0x200: PHYSTS_CHG

This is because the driver was enabling low-level sensitive interrupt
generation where the systems are wired for high-level. All CPU cycles
are spent servicing this interrupt.

Signed-off-by: Joel Stanley <joel@jms.id.au>
---
 drivers/net/ethernet/faraday/ftgmac100.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
index 7ba0f2d58a8b..5466df028381 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -223,6 +223,10 @@ static void ftgmac100_start_hw(struct ftgmac100 *priv, int speed)
 {
 	int maccr = MACCR_ENABLE_ALL;
 
+	if (of_machine_is_compatible("aspeed,ast2500")) {
+		maccr &= ~FTGMAC100_MACCR_PHY_LINK_LEVEL;
+	}
+
 	switch (speed) {
 	default:
 	case 10:
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next 5/7] net/faraday: Clear stale interrupts
From: Joel Stanley @ 2016-09-20  6:30 UTC (permalink / raw)
  To: davem; +Cc: Gavin Shan, andrew, andrew, netdev, linux-kernel, benh
In-Reply-To: <20160920063007.24291-1-joel@jms.id.au>

From: Gavin Shan <gwshan@linux.vnet.ibm.com>

There is stale interrupt (PHYSTS_CHG in ISR, bit#6 in 0x0) from
the bootloader (uboot) when enabling the MAC. The stale interrupts
aren't part of kernel and should be cleared.

This clears the stale interrupts in ISR (0x0) when enabling the MAC.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
---
 drivers/net/ethernet/faraday/ftgmac100.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
index f2ea6c2f1fbd..7ba0f2d58a8b 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1113,6 +1113,7 @@ static int ftgmac100_poll(struct napi_struct *napi, int budget)
 static int ftgmac100_open(struct net_device *netdev)
 {
 	struct ftgmac100 *priv = netdev_priv(netdev);
+	unsigned int status;
 	int err;
 
 	err = ftgmac100_alloc_buffers(priv);
@@ -1138,6 +1139,11 @@ static int ftgmac100_open(struct net_device *netdev)
 
 	ftgmac100_init_hw(priv);
 	ftgmac100_start_hw(priv, priv->use_ncsi ? 100 : 10);
+
+	/* Clear stale interrupts */
+	status = ioread32(priv->base + FTGMAC100_OFFSET_ISR);
+	iowrite32(status, priv->base + FTGMAC100_OFFSET_ISR);
+
 	if (netdev->phydev)
 		phy_start(netdev->phydev);
 	else if (priv->use_ncsi)
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next 4/7] net/faraday: Avoid PHYSTS_CHG interrupt
From: Joel Stanley @ 2016-09-20  6:30 UTC (permalink / raw)
  To: davem; +Cc: Gavin Shan, andrew, andrew, netdev, linux-kernel, benh
In-Reply-To: <20160920063007.24291-1-joel@jms.id.au>

From: Gavin Shan <gwshan@linux.vnet.ibm.com>

Bit#11 in MACCR (0x50) designates the signal level for PHY link
status change. It's cleared, meaning high level enabled, by default.
However, we can see continuous interrupt (bit#6) in ISR (0x0) for it
and it's obviously a false alarm. The side effect is CPU cycles wasted
to process the false alarm.

This sets bit#11 in MACCR (0x50) to avoid the bogus interrupt.

Signed-off-by: Gavin Shan <gwshan@linux.vnet.ibm.com>
Signed-off-by: Joel Stanley <joel@jms.id.au>
---
 drivers/net/ethernet/faraday/ftgmac100.c | 1 +
 drivers/net/ethernet/faraday/ftgmac100.h | 1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
index 47f512224b57..f2ea6c2f1fbd 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -215,6 +215,7 @@ static void ftgmac100_init_hw(struct ftgmac100 *priv)
 				 FTGMAC100_MACCR_RXMAC_EN	| \
 				 FTGMAC100_MACCR_FULLDUP	| \
 				 FTGMAC100_MACCR_CRC_APD	| \
+				 FTGMAC100_MACCR_PHY_LINK_LEVEL | \
 				 FTGMAC100_MACCR_RX_RUNT	| \
 				 FTGMAC100_MACCR_RX_BROADPKT)
 
diff --git a/drivers/net/ethernet/faraday/ftgmac100.h b/drivers/net/ethernet/faraday/ftgmac100.h
index c258586ce4a4..d07b6ea5d1b5 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.h
+++ b/drivers/net/ethernet/faraday/ftgmac100.h
@@ -152,6 +152,7 @@
 #define FTGMAC100_MACCR_FULLDUP		(1 << 8)
 #define FTGMAC100_MACCR_GIGA_MODE	(1 << 9)
 #define FTGMAC100_MACCR_CRC_APD		(1 << 10)
+#define FTGMAC100_MACCR_PHY_LINK_LEVEL	(1 << 11)
 #define FTGMAC100_MACCR_RX_RUNT		(1 << 12)
 #define FTGMAC100_MACCR_JUMBO_LF	(1 << 13)
 #define FTGMAC100_MACCR_RX_ALL		(1 << 14)
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next 3/7] net/faraday: Adapt for Aspeed SoCs
From: Joel Stanley @ 2016-09-20  6:30 UTC (permalink / raw)
  To: davem; +Cc: gwshan, andrew, andrew, netdev, linux-kernel, benh
In-Reply-To: <20160920063007.24291-1-joel@jms.id.au>

The RXDES and TXDES registers bits in the ftgmac100 indicates EDO{R,T}R
at bit position 15 for the Faraday Tech IP. However, the version of this
IP present in the Aspeed SoCs has these bits at position 30 in the
registers.

It appers that ast2400 SoCs support both positions, with the 15th bit
marked as reserved but still functional. In the ast2500 this bit is
reused for another function, so we need a work around.

This was confirmed with engineers from Aspeed that using bit 30 is
correct for both the ast2400 and ast2500 SoCs.

Signed-off-by: Joel Stanley <joel@jms.id.au>
---
 drivers/net/ethernet/faraday/ftgmac100.c | 13 ++++++++++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
index 62a88d1a1f99..47f512224b57 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -1345,9 +1345,6 @@ static int ftgmac100_probe(struct platform_device *pdev)
 	priv->netdev = netdev;
 	priv->dev = &pdev->dev;
 
-	priv->rxdes0_edorr_mask = BIT(15);
-	priv->txdes0_edotr_mask = BIT(15);
-
 	spin_lock_init(&priv->tx_lock);
 
 	/* initialize NAPI */
@@ -1381,6 +1378,16 @@ static int ftgmac100_probe(struct platform_device *pdev)
 			      FTGMAC100_INT_PHYSTS_CHG |
 			      FTGMAC100_INT_RPKT_BUF |
 			      FTGMAC100_INT_NO_RXBUF);
+
+	if (of_machine_is_compatible("aspeed,ast2400") ||
+	    of_machine_is_compatible("aspeed,ast2500")) {
+		priv->rxdes0_edorr_mask = BIT(30);
+		priv->txdes0_edotr_mask = BIT(30);
+	} else {
+		priv->rxdes0_edorr_mask = BIT(15);
+		priv->txdes0_edotr_mask = BIT(15);
+	}
+
 	if (pdev->dev.of_node &&
 	    of_get_property(pdev->dev.of_node, "use-ncsi", NULL)) {
 		if (!IS_ENABLED(CONFIG_NET_NCSI)) {
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next 2/7] net/faraday: Make EDO{R,T}R bits configurable
From: Joel Stanley @ 2016-09-20  6:30 UTC (permalink / raw)
  To: davem; +Cc: Andrew Jeffery, gwshan, andrew, netdev, linux-kernel, benh
In-Reply-To: <20160920063007.24291-1-joel@jms.id.au>

From: Andrew Jeffery <andrew@aj.id.au>

These bits are #defined at a fixed location. In order to support future
hardware that has chosen to move these bits around move the bits into a
member of the struct ftgmac100.

Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
Signed-off-by: Joel Stanley <joel@jms.id.au>
---
 drivers/net/ethernet/faraday/ftgmac100.c | 40 +++++++++++++++++++++-----------
 drivers/net/ethernet/faraday/ftgmac100.h |  2 --
 2 files changed, 26 insertions(+), 16 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
index 40622567159a..62a88d1a1f99 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -79,6 +79,9 @@ struct ftgmac100 {
 	int int_mask_all;
 	bool use_ncsi;
 	bool enabled;
+
+	u32 rxdes0_edorr_mask;
+	u32 txdes0_edotr_mask;
 };
 
 static int ftgmac100_alloc_rx_page(struct ftgmac100 *priv,
@@ -259,10 +262,11 @@ static bool ftgmac100_rxdes_packet_ready(struct ftgmac100_rxdes *rxdes)
 	return rxdes->rxdes0 & cpu_to_le32(FTGMAC100_RXDES0_RXPKT_RDY);
 }
 
-static void ftgmac100_rxdes_set_dma_own(struct ftgmac100_rxdes *rxdes)
+static void ftgmac100_rxdes_set_dma_own(const struct ftgmac100 *priv,
+					struct ftgmac100_rxdes *rxdes)
 {
 	/* clear status bits */
-	rxdes->rxdes0 &= cpu_to_le32(FTGMAC100_RXDES0_EDORR);
+	rxdes->rxdes0 &= cpu_to_le32(priv->rxdes0_edorr_mask);
 }
 
 static bool ftgmac100_rxdes_rx_error(struct ftgmac100_rxdes *rxdes)
@@ -300,9 +304,10 @@ static bool ftgmac100_rxdes_multicast(struct ftgmac100_rxdes *rxdes)
 	return rxdes->rxdes0 & cpu_to_le32(FTGMAC100_RXDES0_MULTICAST);
 }
 
-static void ftgmac100_rxdes_set_end_of_ring(struct ftgmac100_rxdes *rxdes)
+static void ftgmac100_rxdes_set_end_of_ring(const struct ftgmac100 *priv,
+					    struct ftgmac100_rxdes *rxdes)
 {
-	rxdes->rxdes0 |= cpu_to_le32(FTGMAC100_RXDES0_EDORR);
+	rxdes->rxdes0 |= cpu_to_le32(priv->rxdes0_edorr_mask);
 }
 
 static void ftgmac100_rxdes_set_dma_addr(struct ftgmac100_rxdes *rxdes,
@@ -393,7 +398,7 @@ ftgmac100_rx_locate_first_segment(struct ftgmac100 *priv)
 		if (ftgmac100_rxdes_first_segment(rxdes))
 			return rxdes;
 
-		ftgmac100_rxdes_set_dma_own(rxdes);
+		ftgmac100_rxdes_set_dma_own(priv, rxdes);
 		ftgmac100_rx_pointer_advance(priv);
 		rxdes = ftgmac100_current_rxdes(priv);
 	}
@@ -464,7 +469,7 @@ static void ftgmac100_rx_drop_packet(struct ftgmac100 *priv)
 		if (ftgmac100_rxdes_last_segment(rxdes))
 			done = true;
 
-		ftgmac100_rxdes_set_dma_own(rxdes);
+		ftgmac100_rxdes_set_dma_own(priv, rxdes);
 		ftgmac100_rx_pointer_advance(priv);
 		rxdes = ftgmac100_current_rxdes(priv);
 	} while (!done && ftgmac100_rxdes_packet_ready(rxdes));
@@ -556,10 +561,11 @@ static bool ftgmac100_rx_packet(struct ftgmac100 *priv, int *processed)
 /******************************************************************************
  * internal functions (transmit descriptor)
  *****************************************************************************/
-static void ftgmac100_txdes_reset(struct ftgmac100_txdes *txdes)
+static void ftgmac100_txdes_reset(const struct ftgmac100 *priv,
+				  struct ftgmac100_txdes *txdes)
 {
 	/* clear all except end of ring bit */
-	txdes->txdes0 &= cpu_to_le32(FTGMAC100_TXDES0_EDOTR);
+	txdes->txdes0 &= cpu_to_le32(priv->txdes0_edotr_mask);
 	txdes->txdes1 = 0;
 	txdes->txdes2 = 0;
 	txdes->txdes3 = 0;
@@ -580,9 +586,10 @@ static void ftgmac100_txdes_set_dma_own(struct ftgmac100_txdes *txdes)
 	txdes->txdes0 |= cpu_to_le32(FTGMAC100_TXDES0_TXDMA_OWN);
 }
 
-static void ftgmac100_txdes_set_end_of_ring(struct ftgmac100_txdes *txdes)
+static void ftgmac100_txdes_set_end_of_ring(const struct ftgmac100 *priv,
+					    struct ftgmac100_txdes *txdes)
 {
-	txdes->txdes0 |= cpu_to_le32(FTGMAC100_TXDES0_EDOTR);
+	txdes->txdes0 |= cpu_to_le32(priv->txdes0_edotr_mask);
 }
 
 static void ftgmac100_txdes_set_first_segment(struct ftgmac100_txdes *txdes)
@@ -701,7 +708,7 @@ static bool ftgmac100_tx_complete_packet(struct ftgmac100 *priv)
 
 	dev_kfree_skb(skb);
 
-	ftgmac100_txdes_reset(txdes);
+	ftgmac100_txdes_reset(priv, txdes);
 
 	ftgmac100_tx_clean_pointer_advance(priv);
 
@@ -792,7 +799,7 @@ static int ftgmac100_alloc_rx_page(struct ftgmac100 *priv,
 
 	ftgmac100_rxdes_set_page(priv, rxdes, page);
 	ftgmac100_rxdes_set_dma_addr(rxdes, map);
-	ftgmac100_rxdes_set_dma_own(rxdes);
+	ftgmac100_rxdes_set_dma_own(priv, rxdes);
 	return 0;
 }
 
@@ -839,7 +846,8 @@ static int ftgmac100_alloc_buffers(struct ftgmac100 *priv)
 		return -ENOMEM;
 
 	/* initialize RX ring */
-	ftgmac100_rxdes_set_end_of_ring(&priv->descs->rxdes[RX_QUEUE_ENTRIES - 1]);
+	ftgmac100_rxdes_set_end_of_ring(priv,
+					&priv->descs->rxdes[RX_QUEUE_ENTRIES - 1]);
 
 	for (i = 0; i < RX_QUEUE_ENTRIES; i++) {
 		struct ftgmac100_rxdes *rxdes = &priv->descs->rxdes[i];
@@ -849,7 +857,8 @@ static int ftgmac100_alloc_buffers(struct ftgmac100 *priv)
 	}
 
 	/* initialize TX ring */
-	ftgmac100_txdes_set_end_of_ring(&priv->descs->txdes[TX_QUEUE_ENTRIES - 1]);
+	ftgmac100_txdes_set_end_of_ring(priv,
+					&priv->descs->txdes[TX_QUEUE_ENTRIES - 1]);
 	return 0;
 
 err:
@@ -1336,6 +1345,9 @@ static int ftgmac100_probe(struct platform_device *pdev)
 	priv->netdev = netdev;
 	priv->dev = &pdev->dev;
 
+	priv->rxdes0_edorr_mask = BIT(15);
+	priv->txdes0_edotr_mask = BIT(15);
+
 	spin_lock_init(&priv->tx_lock);
 
 	/* initialize NAPI */
diff --git a/drivers/net/ethernet/faraday/ftgmac100.h b/drivers/net/ethernet/faraday/ftgmac100.h
index 13408d448b05..c258586ce4a4 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.h
+++ b/drivers/net/ethernet/faraday/ftgmac100.h
@@ -189,7 +189,6 @@ struct ftgmac100_txdes {
 } __attribute__ ((aligned(16)));
 
 #define FTGMAC100_TXDES0_TXBUF_SIZE(x)	((x) & 0x3fff)
-#define FTGMAC100_TXDES0_EDOTR		(1 << 15)
 #define FTGMAC100_TXDES0_CRC_ERR	(1 << 19)
 #define FTGMAC100_TXDES0_LTS		(1 << 28)
 #define FTGMAC100_TXDES0_FTS		(1 << 29)
@@ -215,7 +214,6 @@ struct ftgmac100_rxdes {
 } __attribute__ ((aligned(16)));
 
 #define FTGMAC100_RXDES0_VDBC		0x3fff
-#define FTGMAC100_RXDES0_EDORR		(1 << 15)
 #define FTGMAC100_RXDES0_MULTICAST	(1 << 16)
 #define FTGMAC100_RXDES0_BROADCAST	(1 << 17)
 #define FTGMAC100_RXDES0_RX_ERR		(1 << 18)
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next 1/7] net/faraday: Separate rx page storage from rxdesc
From: Joel Stanley @ 2016-09-20  6:30 UTC (permalink / raw)
  To: davem; +Cc: Andrew Jeffery, gwshan, andrew, netdev, linux-kernel, benh
In-Reply-To: <20160920063007.24291-1-joel@jms.id.au>

From: Andrew Jeffery <andrew@aj.id.au>

The ftgmac100 hardware revision in e.g. the Aspeed AST2500 no longer
reserves all bits in RXDES#2 but instead uses the bottom 16 bits to
store MAC frame metadata. Avoid corruption by shifting struct page
pointers out to their own member in struct ftgmac100.

Signed-off-by: Andrew Jeffery <andrew@aj.id.au>
Signed-off-by: Joel Stanley <joel@jms.id.au>
---
 drivers/net/ethernet/faraday/ftgmac100.c | 25 ++++++++++++++++++-------
 1 file changed, 18 insertions(+), 7 deletions(-)

diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
index 36361f8bf894..40622567159a 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -60,6 +60,8 @@ struct ftgmac100 {
 	struct ftgmac100_descs *descs;
 	dma_addr_t descs_dma_addr;
 
+	struct page *rx_pages[RX_QUEUE_ENTRIES];
+
 	unsigned int rx_pointer;
 	unsigned int tx_clean_pointer;
 	unsigned int tx_pointer;
@@ -341,18 +343,27 @@ static bool ftgmac100_rxdes_ipcs_err(struct ftgmac100_rxdes *rxdes)
 	return rxdes->rxdes1 & cpu_to_le32(FTGMAC100_RXDES1_IP_CHKSUM_ERR);
 }
 
+static inline struct page **ftgmac100_rxdes_page_slot(struct ftgmac100 *priv,
+						      struct ftgmac100_rxdes *rxdes)
+{
+	return &priv->rx_pages[rxdes - priv->descs->rxdes];
+}
+
 /*
  * rxdes2 is not used by hardware. We use it to keep track of page.
  * Since hardware does not touch it, we can skip cpu_to_le32()/le32_to_cpu().
  */
-static void ftgmac100_rxdes_set_page(struct ftgmac100_rxdes *rxdes, struct page *page)
+static void ftgmac100_rxdes_set_page(struct ftgmac100 *priv,
+				     struct ftgmac100_rxdes *rxdes,
+				     struct page *page)
 {
-	rxdes->rxdes2 = (unsigned int)page;
+	*ftgmac100_rxdes_page_slot(priv, rxdes) = page;
 }
 
-static struct page *ftgmac100_rxdes_get_page(struct ftgmac100_rxdes *rxdes)
+static struct page *ftgmac100_rxdes_get_page(struct ftgmac100 *priv,
+					     struct ftgmac100_rxdes *rxdes)
 {
-	return (struct page *)rxdes->rxdes2;
+	return *ftgmac100_rxdes_page_slot(priv, rxdes);
 }
 
 /******************************************************************************
@@ -501,7 +512,7 @@ static bool ftgmac100_rx_packet(struct ftgmac100 *priv, int *processed)
 
 	do {
 		dma_addr_t map = ftgmac100_rxdes_get_dma_addr(rxdes);
-		struct page *page = ftgmac100_rxdes_get_page(rxdes);
+		struct page *page = ftgmac100_rxdes_get_page(priv, rxdes);
 		unsigned int size;
 
 		dma_unmap_page(priv->dev, map, RX_BUF_SIZE, DMA_FROM_DEVICE);
@@ -779,7 +790,7 @@ static int ftgmac100_alloc_rx_page(struct ftgmac100 *priv,
 		return -ENOMEM;
 	}
 
-	ftgmac100_rxdes_set_page(rxdes, page);
+	ftgmac100_rxdes_set_page(priv, rxdes, page);
 	ftgmac100_rxdes_set_dma_addr(rxdes, map);
 	ftgmac100_rxdes_set_dma_own(rxdes);
 	return 0;
@@ -791,7 +802,7 @@ static void ftgmac100_free_buffers(struct ftgmac100 *priv)
 
 	for (i = 0; i < RX_QUEUE_ENTRIES; i++) {
 		struct ftgmac100_rxdes *rxdes = &priv->descs->rxdes[i];
-		struct page *page = ftgmac100_rxdes_get_page(rxdes);
+		struct page *page = ftgmac100_rxdes_get_page(priv, rxdes);
 		dma_addr_t map = ftgmac100_rxdes_get_dma_addr(rxdes);
 
 		if (!page)
-- 
2.9.3

^ permalink raw reply related

* [PATCH net-next 0/7] ftgmac100 support for ast2500
From: Joel Stanley @ 2016-09-20  6:30 UTC (permalink / raw)
  To: davem; +Cc: gwshan, andrew, andrew, netdev, linux-kernel

Hello Dave,

This series adds support to the ftgmac100 driver for the Aspeed ast2400 and
ast2500 SoCs. In particular, they ensure the driver works correctly on the
ast2500 where the MAC block has seen some changes in register layout.

They have been tested on ast2400 and ast2500 systems with the NCSI stack and
with a directly attached PHY.

Cheers,

Joel

Andrew Jeffery (2):
  net/ftgmac100: Separate rx page storage from rxdesc
  net/ftgmac100: Make EDO{R,T}R bits configurable

Gavin Shan (2):
  net/faraday: Avoid PHYSTS_CHG interrupt
  net/faraday: Clear stale interrupts

Joel Stanley (3):
  net/ftgmac100: Adapt for Aspeed SoCs
  net/faraday: Fix phy link irq on Aspeed G5 SoCs
  net/faraday: Configure old MDIO interface on Aspeed SoCs

 drivers/net/ethernet/faraday/ftgmac100.c | 92 ++++++++++++++++++++++++--------
 drivers/net/ethernet/faraday/ftgmac100.h |  8 ++-
 2 files changed, 77 insertions(+), 23 deletions(-)

-- 
2.9.3

^ permalink raw reply

* [PATCHv2 net] cxgb4/cxgb4vf: Allocate more queues for 25G and 100G adapter
From: Hariprasad Shenai @ 2016-09-20  6:30 UTC (permalink / raw)
  To: netdev; +Cc: davem, leedom, nirranjan, Hariprasad Shenai

We were missing check for 25G and 100G while checking port speed,
which lead to less number of queues getting allocated for 25G & 100G
adapters and leading to low throughput. Adding the missing check for
both NIC and vNIC driver.

Also fixes port advertisement for 25G and 100G in ethtool output.

Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
---
V2: Missed 25G in the first one

 drivers/net/ethernet/chelsio/cxgb4/cxgb4.h         |  4 ++--
 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c    | 15 +++++++++++++--
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c         |  7 ++++++-
 drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h      |  6 ++++++
 drivers/net/ethernet/chelsio/cxgb4vf/t4vf_common.h | 15 +++++++++++----
 drivers/net/ethernet/chelsio/cxgb4vf/t4vf_hw.c     |  9 +++++++--
 6 files changed, 45 insertions(+), 11 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
index 2e2aa9fec9bb..edd23386b47d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4.h
@@ -419,8 +419,8 @@ struct link_config {
 	unsigned short supported;        /* link capabilities */
 	unsigned short advertising;      /* advertised capabilities */
 	unsigned short lp_advertising;   /* peer advertised capabilities */
-	unsigned short requested_speed;  /* speed user has requested */
-	unsigned short speed;            /* actual link speed */
+	unsigned int   requested_speed;  /* speed user has requested */
+	unsigned int   speed;            /* actual link speed */
 	unsigned char  requested_fc;     /* flow control user has requested */
 	unsigned char  fc;               /* actual link flow control */
 	unsigned char  autoneg;          /* autonegotiating? */
diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index c762a8c8c954..3ceafb55d6da 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -4305,10 +4305,17 @@ static const struct pci_error_handlers cxgb4_eeh = {
 	.resume         = eeh_resume,
 };
 
+/* Return true if the Link Configuration supports "High Speeds" (those greater
+ * than 1Gb/s).
+ */
 static inline bool is_x_10g_port(const struct link_config *lc)
 {
-	return (lc->supported & FW_PORT_CAP_SPEED_10G) != 0 ||
-	       (lc->supported & FW_PORT_CAP_SPEED_40G) != 0;
+	unsigned int speeds, high_speeds;
+
+	speeds = FW_PORT_CAP_SPEED_V(FW_PORT_CAP_SPEED_G(lc->supported));
+	high_speeds = speeds & ~(FW_PORT_CAP_SPEED_100M | FW_PORT_CAP_SPEED_1G);
+
+	return high_speeds != 0;
 }
 
 static inline void init_rspq(struct adapter *adap, struct sge_rspq *q,
@@ -4756,8 +4763,12 @@ static void print_port_info(const struct net_device *dev)
 		bufp += sprintf(bufp, "1000/");
 	if (pi->link_cfg.supported & FW_PORT_CAP_SPEED_10G)
 		bufp += sprintf(bufp, "10G/");
+	if (pi->link_cfg.supported & FW_PORT_CAP_SPEED_25G)
+		bufp += sprintf(bufp, "25G/");
 	if (pi->link_cfg.supported & FW_PORT_CAP_SPEED_40G)
 		bufp += sprintf(bufp, "40G/");
+	if (pi->link_cfg.supported & FW_PORT_CAP_SPEED_100G)
+		bufp += sprintf(bufp, "100G/");
 	if (bufp != buf)
 		--bufp;
 	sprintf(bufp, "BASE-%s", t4_get_port_type_description(pi->port_type));
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index dc92c80a75f4..660204bff726 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -3627,7 +3627,8 @@ void t4_ulprx_read_la(struct adapter *adap, u32 *la_buf)
 }
 
 #define ADVERT_MASK (FW_PORT_CAP_SPEED_100M | FW_PORT_CAP_SPEED_1G |\
-		     FW_PORT_CAP_SPEED_10G | FW_PORT_CAP_SPEED_40G | \
+		     FW_PORT_CAP_SPEED_10G | FW_PORT_CAP_SPEED_25G | \
+		     FW_PORT_CAP_SPEED_40G | FW_PORT_CAP_SPEED_100G | \
 		     FW_PORT_CAP_ANEG)
 
 /**
@@ -7196,8 +7197,12 @@ void t4_handle_get_port_info(struct port_info *pi, const __be64 *rpl)
 		speed = 1000;
 	else if (stat & FW_PORT_CMD_LSPEED_V(FW_PORT_CAP_SPEED_10G))
 		speed = 10000;
+	else if (stat & FW_PORT_CMD_LSPEED_V(FW_PORT_CAP_SPEED_25G))
+		speed = 25000;
 	else if (stat & FW_PORT_CMD_LSPEED_V(FW_PORT_CAP_SPEED_40G))
 		speed = 40000;
+	else if (stat & FW_PORT_CMD_LSPEED_V(FW_PORT_CAP_SPEED_100G))
+		speed = 100000;
 
 	lc = &pi->link_cfg;
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
index a89b30720e38..30507d44422c 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4fw_api.h
@@ -2265,6 +2265,12 @@ enum fw_port_cap {
 	FW_PORT_CAP_802_3_ASM_DIR	= 0x8000,
 };
 
+#define FW_PORT_CAP_SPEED_S     0
+#define FW_PORT_CAP_SPEED_M     0x3f
+#define FW_PORT_CAP_SPEED_V(x)  ((x) << FW_PORT_CAP_SPEED_S)
+#define FW_PORT_CAP_SPEED_G(x) \
+	(((x) >> FW_PORT_CAP_SPEED_S) & FW_PORT_CAP_SPEED_M)
+
 enum fw_port_mdi {
 	FW_PORT_CAP_MDI_UNCHANGED,
 	FW_PORT_CAP_MDI_AUTO,
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/t4vf_common.h b/drivers/net/ethernet/chelsio/cxgb4vf/t4vf_common.h
index 8ee541431e8b..17a2bbcf93f0 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/t4vf_common.h
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/t4vf_common.h
@@ -108,8 +108,8 @@ struct link_config {
 	unsigned int   supported;        /* link capabilities */
 	unsigned int   advertising;      /* advertised capabilities */
 	unsigned short lp_advertising;   /* peer advertised capabilities */
-	unsigned short requested_speed;  /* speed user has requested */
-	unsigned short speed;            /* actual link speed */
+	unsigned int   requested_speed;  /* speed user has requested */
+	unsigned int   speed;            /* actual link speed */
 	unsigned char  requested_fc;     /* flow control user has requested */
 	unsigned char  fc;               /* actual link flow control */
 	unsigned char  autoneg;          /* autonegotiating? */
@@ -271,10 +271,17 @@ static inline bool is_10g_port(const struct link_config *lc)
 	return (lc->supported & FW_PORT_CAP_SPEED_10G) != 0;
 }
 
+/* Return true if the Link Configuration supports "High Speeds" (those greater
+ * than 1Gb/s).
+ */
 static inline bool is_x_10g_port(const struct link_config *lc)
 {
-	return (lc->supported & FW_PORT_CAP_SPEED_10G) != 0 ||
-		(lc->supported & FW_PORT_CAP_SPEED_40G) != 0;
+	unsigned int speeds, high_speeds;
+
+	speeds = FW_PORT_CAP_SPEED_V(FW_PORT_CAP_SPEED_G(lc->supported));
+	high_speeds = speeds & ~(FW_PORT_CAP_SPEED_100M | FW_PORT_CAP_SPEED_1G);
+
+	return high_speeds != 0;
 }
 
 static inline unsigned int core_ticks_per_usec(const struct adapter *adapter)
diff --git a/drivers/net/ethernet/chelsio/cxgb4vf/t4vf_hw.c b/drivers/net/ethernet/chelsio/cxgb4vf/t4vf_hw.c
index 427bfa71388b..b5622b1689e9 100644
--- a/drivers/net/ethernet/chelsio/cxgb4vf/t4vf_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4vf/t4vf_hw.c
@@ -314,8 +314,9 @@ int t4vf_wr_mbox_core(struct adapter *adapter, const void *cmd, int size,
 }
 
 #define ADVERT_MASK (FW_PORT_CAP_SPEED_100M | FW_PORT_CAP_SPEED_1G |\
-		     FW_PORT_CAP_SPEED_10G | FW_PORT_CAP_SPEED_40G | \
-		     FW_PORT_CAP_SPEED_100G | FW_PORT_CAP_ANEG)
+		     FW_PORT_CAP_SPEED_10G | FW_PORT_CAP_SPEED_25G | \
+		     FW_PORT_CAP_SPEED_40G | FW_PORT_CAP_SPEED_100G | \
+		     FW_PORT_CAP_ANEG)
 
 /**
  *	init_link_config - initialize a link's SW state
@@ -1712,8 +1713,12 @@ int t4vf_handle_fw_rpl(struct adapter *adapter, const __be64 *rpl)
 			speed = 1000;
 		else if (stat & FW_PORT_CMD_LSPEED_V(FW_PORT_CAP_SPEED_10G))
 			speed = 10000;
+		else if (stat & FW_PORT_CMD_LSPEED_V(FW_PORT_CAP_SPEED_25G))
+			speed = 25000;
 		else if (stat & FW_PORT_CMD_LSPEED_V(FW_PORT_CAP_SPEED_40G))
 			speed = 40000;
+		else if (stat & FW_PORT_CMD_LSPEED_V(FW_PORT_CAP_SPEED_100G))
+			speed = 100000;
 
 		/*
 		 * Scan all of our "ports" (Virtual Interfaces) looking for
-- 
2.3.4

^ permalink raw reply related

* Re: [PATCH net] cxgb4/cxgb4vf: Allocate more queues for 100G adapter
From: Hariprasad Shenai @ 2016-09-20  6:18 UTC (permalink / raw)
  To: netdev; +Cc: davem, leedom, nirranjan
In-Reply-To: <1474272166-30147-1-git-send-email-hariprasad@chelsio.com>

On Mon, Sep 19, 2016 at 01:32:46PM +0530, Hariprasad Shenai wrote:
> We were missing check for 100G while checking port speed, which lead to
> less number of queues getting allocated for 100G and leading to low
> throughput. Adding the missing check for both NIC and vNIC driver.
> 
> Signed-off-by: Hariprasad Shenai <hariprasad@chelsio.com>
> ---

Hi David,

I missed 25G, will send a V2 for the same with changes for 25Gbps adapter too.

Thanks,
Hari 

^ permalink raw reply

* Re: [patch net-next RFC 0/2] fib4 offload: notifier to let hw to be aware of all prefixes
From: Roopa Prabhu @ 2016-09-20  6:18 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Florian Fainelli, netdev@vger.kernel.org, davem@davemloft.net,
	Ido Schimmel, Elad Raz, Yotam Gigi, Nogah Frankel, Or Gerlitz,
	Nikolay Aleksandrov, John Linville, Thomas Graf, Andy Gospodarek,
	Scott Feldman, Alexei Starovoitov, Eric Dumazet,
	hannes@stressinduktion.org, David Ahern, Jamal Hadi Salim,
	Vivien Didelot <vivien.didelot@
In-Reply-To: <20160920060239.GC1843@nanopsycho.orion>

On Mon, Sep 19, 2016 at 11:02 PM, Jiri Pirko <jiri@resnulli.us> wrote:
> Tue, Sep 20, 2016 at 07:49:47AM CEST, roopa@cumulusnetworks.com wrote:

[snip]

>>
>>Do you see any scale problems with using notifiers ?. as you know these ascis can scale to
>>32k-128k routes.
>
> I don't see any problem there. What do you think might be wrong?
>

we had seen some overheads with link notifiers in older kernels with
large number of links flaps.
But that could have been due to rtnl lock. don't have any more data
than that. so, ignore that.
I don't see anything obvious from perf perspective....rtnl is already
held. but, thought i did just ask.

^ permalink raw reply

* Re: [patch net-next RFC 0/2] fib4 offload: notifier to let hw to be aware of all prefixes
From: Jiri Pirko @ 2016-09-20  6:02 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Florian Fainelli, netdev, davem, idosch, eladr, yotamg, nogahf,
	ogerlitz, nikolay, linville, tgraf, gospo, sfeldma, ast, edumazet,
	hannes, dsa, jhs, vivien.didelot, john.fastabend, andrew, ivecera
In-Reply-To: <57E0CDFB.2020704@cumulusnetworks.com>

Tue, Sep 20, 2016 at 07:49:47AM CEST, roopa@cumulusnetworks.com wrote:
>On 9/19/16, 8:15 AM, Jiri Pirko wrote:
>> Mon, Sep 19, 2016 at 04:59:22PM CEST, roopa@cumulusnetworks.com wrote:
>>> On 9/18/16, 11:14 PM, Jiri Pirko wrote:
>>>> Mon, Sep 19, 2016 at 01:16:17AM CEST, roopa@cumulusnetworks.com wrote:
>>>>> On 9/18/16, 1:00 PM, Florian Fainelli wrote:
>>>>>> Le 06/09/2016 à 05:01, Jiri Pirko a écrit :
>>>>>>> From: Jiri Pirko <jiri@mellanox.com>
>>>>>>>
>>>>>>> This is RFC, unfinished. I came across some issues in the process so I would
>>>>>>> like to share those and restart the fib offload discussion in order to make it
>>>>>>> really usable.
>>>>>>>
>>>>>>> So the goal of this patchset is to allow driver to propagate all prefixes
>>>>>>> configured in kernel down HW. This is necessary for routing to work
>>>>>>> as expected. If we don't do that HW might forward prefixes known to kernel
>>>>>>> incorrectly. Take an example when default route is set in switch HW and there
>>>>>>> is an IP address set on a management (non-switch) port.
>>>>>>>
>>>>>>> Currently, only fibs related to the switch port netdev are offloaded using
>>>>>>> switchdev ops. This model is not extendable so the first patch introduces
>>>>>>> a replacement: notifier to propagate fib additions and removals to whoever
>>>>>>> interested. The second patch makes mlxsw to adopt this new way, registering
>>>>>>> one notifier block for each mlxsw (asic) instance.
>>>>>> Instead of introducing another specialization of a notifier_block
>>>>>> implementation, could we somehow have a kernel-based netlink listener
>>>>>> which receives the same kind of event information from rtmsg_fib()?
>>>>>>
>>>>>> The reason is that having such a facility would hook directly onto
>>>>>> existing rtmsg_* calls that exist throughout the stack, and that seems
>>>>>> to scale better.
>>>>> I was thinking along the same lines. Instead of proliferating notifier blocks
>>>>> through-out the stack for switchdev offload, putting existing events to use would be nice.
>>>>>
>>>>> But the problem though is drivers having to parse the netlink msg again. also, the intent
>>>>> here is to do the offload first ..before the route is added to the kernel (though i don't see that in
>>>>> the current series). existing netlink rmsg_fib events are generated after the route is added to the kernel.
>>>>>
>>>>>
>>>>> Jiri, instead of the notifier, do you see a problem with always calling the existing switchdev
>>>>> offload api for every route  for every asic instance ?. the first device where the route fits wins.
>>>> There is not list of asic instances. Therefore the notifier fits much better here.
>>>>
>>>>
>>>>
>>>>> it seems similar to driver registering for notifier and looking at every route ...
>>>>> am i missing something ?
>>>>> and the policies you mention could help around selecting the asic instance (FCFS or mirror).
>>>>> you will need to abstract out the asic instance for switchdev api to call on, but I thought you
>>>>> already have that in some form in your devlink infrastructure.
>>>> switchdev asic instances and devlink instances are orthogonal.
>>> maybe it is not today...but the requirement for devlink was to provide a way to communicate
>>> to the switch driver
>>> - global switch attributes or
>>> - things that cannot go via switch ports (exactly the problem you are trying to solve for routes here)
>> Devlink is a general beast, not switch specific one. I see no need to
>> use fib->devlink->driver route inside kernel. Devlink is for userspace
>> facing.
>
>yes, sure. it has a dev abstraction and an api. devlink discussion started a few years ago in the context
>of switch asics for the very same reason that it will help direct the offload call to the
>switch device driver when you cant apply the settings on a per port basis.
>You have kept the abstraction and api generic ..which is a great thing.
>But that can't be the reason for it to not support its original intent...if there is a way.
>
>>
>>
>>> so,  maybe an instance of switch asic modeled via devlink will help here and possibly all/other switchdev
>>> offload hooks ?
>> Maybe, but in case of fibs, the notifier just fits great. I see no need
>> for anything else.
>
>I think its better to stick with 'offload api or notifier' whichever we pick ..
>to be consistent with other switchdev offload areas. That was the original intent of
>introducing the switchdev api layer. If we are now replacing the switchdev api with notifiers,

I strongly disagree. Make it uniform is not desirable. For some things,
direct ndo/sdo make sense and is better. For some other things, notifier
fits better. For example when I was implementing LAG offload,
I also chose a notifier.


>assuming 'notifiers are the best way' to offload routes, lets keep it consistent with
>other switchdev offload areas too.
>
>I know you already have them for links...and that is good..because links already have notifiers.
>we will need the same thing for acls. Having notifiers for acls too seems like an overkill.

Acls will reuse the tc ndo infra. No notifiers required there. 


>we will then have to extend this to multicast and mpls routes too. will all these be notifiers too ?

I believe so.


>
>Do you see any scale problems with using notifiers ?. as you know these ascis can scale to
>32k-128k routes.

I don't see any problem there. What do you think might be wrong?


>
>lets discuss more at netdev1.2..if your patches are not in by then.
>
>thanks,
>Roopa
>
>

^ permalink raw reply

* [PATCH v3] iproute2: build nsid-name cache only for commands that need it
From: Anton Aksola @ 2016-09-20  6:01 UTC (permalink / raw)
  To: netdev; +Cc: nicolas.dichtel, vadim4j

The calling of netns_map_init() before command parsing introduced
a performance issue with large number of namespaces.

As commands such as add, del and exec do not need to iterate through
/var/run/netns it would be good not no build the cache before executing
these commands.

Example:
unpatched:
time seq 1 1000 | xargs -n 1 ip netns add

real    0m16.832s
user    0m1.350s
sys    0m15.029s

patched:
time seq 1 1000 | xargs -n 1 ip netns add

real    0m3.859s
user    0m0.132s
sys    0m3.205s

Signed-off-by: Anton Aksola <aakso@iki.fi>
---
 ip/ip_common.h                            |  1 +
 ip/ipmonitor.c                            |  1 +
 ip/ipnetns.c                              | 31 ++++++++++++++++++++++---------
 testsuite/tests/ip/netns/set_nsid.t       | 22 ++++++++++++++++++++++
 testsuite/tests/ip/netns/set_nsid_batch.t | 18 ++++++++++++++++++
 5 files changed, 64 insertions(+), 9 deletions(-)
 create mode 100755 testsuite/tests/ip/netns/set_nsid.t
 create mode 100755 testsuite/tests/ip/netns/set_nsid_batch.t

diff --git a/ip/ip_common.h b/ip/ip_common.h
index 93ff5bc..fabc4b5 100644
--- a/ip/ip_common.h
+++ b/ip/ip_common.h
@@ -31,6 +31,7 @@ int print_netconf(const struct sockaddr_nl *who,
 		  struct rtnl_ctrl_data *ctrl,
 		  struct nlmsghdr *n, void *arg);
 void netns_map_init(void);
+void netns_nsid_socket_init(void);
 int print_nsid(const struct sockaddr_nl *who,
 	       struct nlmsghdr *n, void *arg);
 int do_ipaddr(int argc, char **argv);
diff --git a/ip/ipmonitor.c b/ip/ipmonitor.c
index 2090a45..c892b8f 100644
--- a/ip/ipmonitor.c
+++ b/ip/ipmonitor.c
@@ -301,6 +301,7 @@ int do_ipmonitor(int argc, char **argv)
 		exit(1);
 
 	ll_init_map(&rth);
+	netns_nsid_socket_init();
 	netns_map_init();
 
 	if (rtnl_listen(&rth, accept_msg, stdout) < 0)
diff --git a/ip/ipnetns.c b/ip/ipnetns.c
index af87065..6b42751 100644
--- a/ip/ipnetns.c
+++ b/ip/ipnetns.c
@@ -194,6 +194,18 @@ static void netns_map_del(struct nsid_cache *c)
 	free(c);
 }
 
+void netns_nsid_socket_init(void)
+{
+	if (rtnsh.fd > -1 || !ipnetns_have_nsid())
+		return;
+
+	if (rtnl_open(&rtnsh, 0) < 0) {
+		fprintf(stderr, "Cannot open rtnetlink\n");
+		exit(1);
+	}
+
+}
+
 void netns_map_init(void)
 {
 	static int initialized;
@@ -204,11 +216,6 @@ void netns_map_init(void)
 	if (initialized || !ipnetns_have_nsid())
 		return;
 
-	if (rtnl_open(&rtnsh, 0) < 0) {
-		fprintf(stderr, "Cannot open rtnetlink\n");
-		exit(1);
-	}
-
 	dir = opendir(NETNS_RUN_DIR);
 	if (!dir)
 		return;
@@ -775,17 +782,23 @@ static int netns_monitor(int argc, char **argv)
 
 int do_netns(int argc, char **argv)
 {
-	netns_map_init();
+	netns_nsid_socket_init();
 
-	if (argc < 1)
+	if (argc < 1) {
+		netns_map_init();
 		return netns_list(0, NULL);
+	}
 
 	if ((matches(*argv, "list") == 0) || (matches(*argv, "show") == 0) ||
-	    (matches(*argv, "lst") == 0))
+	    (matches(*argv, "lst") == 0)) {
+		netns_map_init();
 		return netns_list(argc-1, argv+1);
+	}
 
-	if ((matches(*argv, "list-id") == 0))
+	if ((matches(*argv, "list-id") == 0)) {
+		netns_map_init();
 		return netns_list_id(argc-1, argv+1);
+	}
 
 	if (matches(*argv, "help") == 0)
 		return usage();
diff --git a/testsuite/tests/ip/netns/set_nsid.t b/testsuite/tests/ip/netns/set_nsid.t
new file mode 100755
index 0000000..606d45a
--- /dev/null
+++ b/testsuite/tests/ip/netns/set_nsid.t
@@ -0,0 +1,22 @@
+#!/bin/sh
+
+source lib/generic.sh
+
+ts_log "[Testing netns nsid]"
+
+NS=testnsid
+NSID=99
+
+ts_ip "$0" "Add new netns $NS" netns add $NS
+ts_ip "$0" "Set $NS nsid to $NSID" netns set $NS $NSID
+
+ts_ip "$0" "List netns" netns list
+test_on "$NS \(id: $NSID\)"
+
+ts_ip "$0" "List netns without explicit list or show" netns
+test_on "$NS \(id: $NSID\)"
+
+ts_ip "$0" "List nsid" netns list-id
+test_on "$NSID \(iproute2 netns name: $NS\)"
+
+ts_ip "$0" "Delete netns $NS" netns del $NS
diff --git a/testsuite/tests/ip/netns/set_nsid_batch.t b/testsuite/tests/ip/netns/set_nsid_batch.t
new file mode 100755
index 0000000..abb3f1b
--- /dev/null
+++ b/testsuite/tests/ip/netns/set_nsid_batch.t
@@ -0,0 +1,18 @@
+#!/bin/sh
+
+source lib/generic.sh
+
+ts_log "[Testing netns nsid in batch mode]"
+
+NS=testnsid
+NSID=99
+BATCHFILE=`mktemp`
+
+echo "netns add $NS" >> $BATCHFILE
+echo "netns set $NS $NSID" >> $BATCHFILE
+echo "netns list-id" >> $BATCHFILE
+ts_ip "$0" "Add ns, set nsid and list in batch mode" -b $BATCHFILE
+test_on "nsid $NSID \(iproute2 netns name: $NS\)"
+rm -f $BATCHFILE
+
+ts_ip "$0" "Delete netns $NS" netns del $NS
-- 
1.8.3.1

^ permalink raw reply related

* Re: [net-next PATCH] net: netlink messages for HW addr programming
From: Jiri Pirko @ 2016-09-20  5:49 UTC (permalink / raw)
  To: Roopa Prabhu
  Cc: Patrick Ruddy, stephen@networkplumber.org, netdev@vger.kernel.org,
	davem@davemloft.net, Luca Boccassi, alexander.h.duyck@intel.com,
	Sven-Thorsten Dietrich
In-Reply-To: <57E0C9AF.4090406@cumulusnetworks.com>

Tue, Sep 20, 2016 at 07:31:27AM CEST, roopa@cumulusnetworks.com wrote:
>On 9/19/16, 7:46 AM, Patrick Ruddy wrote:
>> On Sun, 2016-09-18 at 07:51 -0700, Roopa Prabhu wrote:
>>> On 9/15/16, 9:48 AM, Patrick Ruddy wrote:
>>>> Add RTM_NEWADDR and RTM_DELADDR netlink messages with family
>>>> AF_UNSPEC to indicate interest in specific unicast and multicast
>>>> hardware addresses. These messages are sent when addresses are
>>>> added or deleted from the appropriate interface driver.
>>>> Added AF_UNSPEC GETADDR function to allow the netlink notifications
>>>> to be replayed to avoid loss of state due to application start
>>>> ordering or restart.
>>>>
>>>> Signed-off-by: Patrick Ruddy <pruddy@brocade.com>
>>>> ---
>>> RTM_NEWADDR and RTM_DELADDR are not used to add these entries to the kernel.
>>> so, it seems a bit wrong to use RTM_NEWADDR and RTM_DELADDR to notify them to
>>> userspace and also to request a special dump of these addresses.
>>>
>>> This could just be a new nested netlink attribute in the existing link dump ?
>> Hi Roopa
>>
>> Thanks for the review. I did initially code this using NEW/DEL/GET_LINK
>> messages but was asked to change to to ADDR messages by Stephen
>> Hemminger (cc'd). 
>>
>> However I agree that these addresses fall between the LINK and ADDR
>> areas so I'm happy to change this if we can reach some consensus on the
>> format.
>>
>ok, thanks for the history. yes, they do lie in a weird spot.

They are l2 addresses, they should be threated accordingly. Am I missing
something?


>the general convention for other rtnl registrations seems to be
>AF_UNSPEC family means include all supported families. thats where this seems a bit odd.
>
>On the other hand, one reason I see where using RTM_*ADDR will be useful for this is if we wanted
>to provide a way to add these uc and mc address via ip addr add in the future.
>ip addr add <lladdr> dev eth0
>
>Does this patch allow that in the future ?

This shoul go under ip link I believe. "ip addr" is for l3.


>
>also, will these l2 addresses now show up in 'ip addr show' output ?.
>
>thanks,
>Roopa
>

^ permalink raw reply

* Re: [patch net-next RFC 0/2] fib4 offload: notifier to let hw to be aware of all prefixes
From: Roopa Prabhu @ 2016-09-20  5:49 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: Florian Fainelli, netdev, davem, idosch, eladr, yotamg, nogahf,
	ogerlitz, nikolay, linville, tgraf, gospo, sfeldma, ast, edumazet,
	hannes, dsa, jhs, vivien.didelot, john.fastabend, andrew, ivecera
In-Reply-To: <20160919151549.GE1846@nanopsycho.orion>

On 9/19/16, 8:15 AM, Jiri Pirko wrote:
> Mon, Sep 19, 2016 at 04:59:22PM CEST, roopa@cumulusnetworks.com wrote:
>> On 9/18/16, 11:14 PM, Jiri Pirko wrote:
>>> Mon, Sep 19, 2016 at 01:16:17AM CEST, roopa@cumulusnetworks.com wrote:
>>>> On 9/18/16, 1:00 PM, Florian Fainelli wrote:
>>>>> Le 06/09/2016 à 05:01, Jiri Pirko a écrit :
>>>>>> From: Jiri Pirko <jiri@mellanox.com>
>>>>>>
>>>>>> This is RFC, unfinished. I came across some issues in the process so I would
>>>>>> like to share those and restart the fib offload discussion in order to make it
>>>>>> really usable.
>>>>>>
>>>>>> So the goal of this patchset is to allow driver to propagate all prefixes
>>>>>> configured in kernel down HW. This is necessary for routing to work
>>>>>> as expected. If we don't do that HW might forward prefixes known to kernel
>>>>>> incorrectly. Take an example when default route is set in switch HW and there
>>>>>> is an IP address set on a management (non-switch) port.
>>>>>>
>>>>>> Currently, only fibs related to the switch port netdev are offloaded using
>>>>>> switchdev ops. This model is not extendable so the first patch introduces
>>>>>> a replacement: notifier to propagate fib additions and removals to whoever
>>>>>> interested. The second patch makes mlxsw to adopt this new way, registering
>>>>>> one notifier block for each mlxsw (asic) instance.
>>>>> Instead of introducing another specialization of a notifier_block
>>>>> implementation, could we somehow have a kernel-based netlink listener
>>>>> which receives the same kind of event information from rtmsg_fib()?
>>>>>
>>>>> The reason is that having such a facility would hook directly onto
>>>>> existing rtmsg_* calls that exist throughout the stack, and that seems
>>>>> to scale better.
>>>> I was thinking along the same lines. Instead of proliferating notifier blocks
>>>> through-out the stack for switchdev offload, putting existing events to use would be nice.
>>>>
>>>> But the problem though is drivers having to parse the netlink msg again. also, the intent
>>>> here is to do the offload first ..before the route is added to the kernel (though i don't see that in
>>>> the current series). existing netlink rmsg_fib events are generated after the route is added to the kernel.
>>>>
>>>>
>>>> Jiri, instead of the notifier, do you see a problem with always calling the existing switchdev
>>>> offload api for every route  for every asic instance ?. the first device where the route fits wins.
>>> There is not list of asic instances. Therefore the notifier fits much better here.
>>>
>>>
>>>
>>>> it seems similar to driver registering for notifier and looking at every route ...
>>>> am i missing something ?
>>>> and the policies you mention could help around selecting the asic instance (FCFS or mirror).
>>>> you will need to abstract out the asic instance for switchdev api to call on, but I thought you
>>>> already have that in some form in your devlink infrastructure.
>>> switchdev asic instances and devlink instances are orthogonal.
>> maybe it is not today...but the requirement for devlink was to provide a way to communicate
>> to the switch driver
>> - global switch attributes or
>> - things that cannot go via switch ports (exactly the problem you are trying to solve for routes here)
> Devlink is a general beast, not switch specific one. I see no need to
> use fib->devlink->driver route inside kernel. Devlink is for userspace
> facing.

yes, sure. it has a dev abstraction and an api. devlink discussion started a few years ago in the context
of switch asics for the very same reason that it will help direct the offload call to the
switch device driver when you cant apply the settings on a per port basis.
You have kept the abstraction and api generic ..which is a great thing.
But that can't be the reason for it to not support its original intent...if there is a way.

>
>
>> so,  maybe an instance of switch asic modeled via devlink will help here and possibly all/other switchdev
>> offload hooks ?
> Maybe, but in case of fibs, the notifier just fits great. I see no need
> for anything else.

I think its better to stick with 'offload api or notifier' whichever we pick ..
to be consistent with other switchdev offload areas. That was the original intent of
introducing the switchdev api layer. If we are now replacing the switchdev api with notifiers,
assuming 'notifiers are the best way' to offload routes, lets keep it consistent with
other switchdev offload areas too.

I know you already have them for links...and that is good..because links already have notifiers.
we will need the same thing for acls. Having notifiers for acls too seems like an overkill.
we will then have to extend this to multicast and mpls routes too. will all these be notifiers too ?

Do you see any scale problems with using notifiers ?. as you know these ascis can scale to
32k-128k routes.

lets discuss more at netdev1.2..if your patches are not in by then.

thanks,
Roopa

^ permalink raw reply

* Re: [PATCH v6 5/6] net: ipv4, ipv6: run cgroup eBPF egress programs
From: kbuild test robot @ 2016-09-20  5:44 UTC (permalink / raw)
  To: Daniel Mack
  Cc: kbuild-all-JC7UmRfGjtg, htejun-b10kYP2dOMg,
	daniel-FeC+5ew28dpmcu3hnIyYJQ, ast-b10kYP2dOMg,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q, kafai-b10kYP2dOMg,
	fw-HFFVJYpyMKqzQB+pC5nmwQ, pablo-Cap9r6Oaw4JrovVCs/uTlw,
	harald-H+wXaHxf7aLQT0dZR+AlfA, netdev-u79uwXL29TY76Z2rM5mHXA,
	sargun-GaZTRHToo+CzQB+pC5nmwQ, cgroups-u79uwXL29TY76Z2rM5mHXA,
	Daniel Mack
In-Reply-To: <1474303441-3745-6-git-send-email-daniel-cYrQPVfZoowdnm+yROfE0A@public.gmane.org>

[-- Attachment #1: Type: text/plain, Size: 1039 bytes --]

Hi Daniel,

[auto build test ERROR on next-20160919]
[cannot apply to linus/master linux/master net/master v4.8-rc7 v4.8-rc6 v4.8-rc5 v4.8-rc7]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
[Suggest to use git(>=2.9.0) format-patch --base=<commit> (or --base=auto for convenience) to record what (public, well-known) commit your patch series was built on]
[Check https://git-scm.com/docs/git-format-patch for more information]

url:    https://github.com/0day-ci/linux/commits/Daniel-Mack/Add-eBPF-hooks-for-cgroups/20160920-010551
config: i386-allmodconfig (attached as .config)
compiler: gcc-6 (Debian 6.2.0-3) 6.2.0 20160901
reproduce:
        # save the attached .config to linux build tree
        make ARCH=i386 

All errors (new ones prefixed by >>):

>> ERROR: "__cgroup_bpf_run_filter" [net/ipv6/ipv6.ko] undefined!

---
0-DAY kernel test infrastructure                Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all                   Intel Corporation

[-- Attachment #2: .config.gz --]
[-- Type: application/gzip, Size: 56527 bytes --]

^ permalink raw reply

* Re: [net-next PATCH] net: netlink messages for HW addr programming
From: Roopa Prabhu @ 2016-09-20  5:31 UTC (permalink / raw)
  To: Patrick Ruddy
  Cc: stephen@networkplumber.org, netdev@vger.kernel.org,
	davem@davemloft.net, Luca Boccassi, alexander.h.duyck@intel.com,
	jiri@resnulli.us, Sven-Thorsten Dietrich
In-Reply-To: <1474296378.31262.59.camel@brocade.com>

On 9/19/16, 7:46 AM, Patrick Ruddy wrote:
> On Sun, 2016-09-18 at 07:51 -0700, Roopa Prabhu wrote:
>> On 9/15/16, 9:48 AM, Patrick Ruddy wrote:
>>> Add RTM_NEWADDR and RTM_DELADDR netlink messages with family
>>> AF_UNSPEC to indicate interest in specific unicast and multicast
>>> hardware addresses. These messages are sent when addresses are
>>> added or deleted from the appropriate interface driver.
>>> Added AF_UNSPEC GETADDR function to allow the netlink notifications
>>> to be replayed to avoid loss of state due to application start
>>> ordering or restart.
>>>
>>> Signed-off-by: Patrick Ruddy <pruddy@brocade.com>
>>> ---
>> RTM_NEWADDR and RTM_DELADDR are not used to add these entries to the kernel.
>> so, it seems a bit wrong to use RTM_NEWADDR and RTM_DELADDR to notify them to
>> userspace and also to request a special dump of these addresses.
>>
>> This could just be a new nested netlink attribute in the existing link dump ?
> Hi Roopa
>
> Thanks for the review. I did initially code this using NEW/DEL/GET_LINK
> messages but was asked to change to to ADDR messages by Stephen
> Hemminger (cc'd). 
>
> However I agree that these addresses fall between the LINK and ADDR
> areas so I'm happy to change this if we can reach some consensus on the
> format.
>
ok, thanks for the history. yes, they do lie in a weird spot.
the general convention for other rtnl registrations seems to be
AF_UNSPEC family means include all supported families. thats where this seems a bit odd.

On the other hand, one reason I see where using RTM_*ADDR will be useful for this is if we wanted
to provide a way to add these uc and mc address via ip addr add in the future.
ip addr add <lladdr> dev eth0

Does this patch allow that in the future ?

also, will these l2 addresses now show up in 'ip addr show' output ?.

thanks,
Roopa

^ permalink raw reply

* Re: [PATCH v2 4/6] net: ethernet: bgmac: convert to feature flags
From: Rafał Miłecki @ 2016-09-20  5:19 UTC (permalink / raw)
  To: Jon Mason
  Cc: David Miller, Florian Fainelli, Hauke Mehrtens, Rob Herring,
	Pawel Moll, Mark Rutland, Ian Campbell, Kumar Gala, Ray Jui,
	Scott Branden, bcm-kernel-feedback-list, Network Development,
	Linux Kernel Mailing List,
	devicetree-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r@public.gmane.org
In-Reply-To: <CACna6rw1j3myPEkbn+YR71W7CF7Of_yahn9eDT7t3hKWGmWUmg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>

On 17 August 2016 at 13:34, Rafał Miłecki <zajec5-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org> wrote:
> On 8 July 2016 at 01:08, Jon Mason <jon.mason-dY08KVG/lbpWk0Htik3J/w@public.gmane.org> wrote:
>>         mode = (bgmac_read(bgmac, BGMAC_DEV_STATUS) & BGMAC_DS_MM_MASK) >>
>>                 BGMAC_DS_MM_SHIFT;
>> -       if (ci->id != BCMA_CHIP_ID_BCM47162 || mode != 0)
>> +       if (bgmac->feature_flags & BGMAC_FEAT_CLKCTLST || mode != 0)
>>                 bgmac_set(bgmac, BCMA_CLKCTLST, BCMA_CLKCTLST_FORCEHT);
>> -       if (ci->id == BCMA_CHIP_ID_BCM47162 && mode == 2)
>> +       if (bgmac->feature_flags & BGMAC_FEAT_CLKCTLST && mode == 2)
>>                 bcma_chipco_chipctl_maskset(&bgmac->core->bus->drv_cc, 1, ~0,
>>                                             BGMAC_CHIPCTL_1_RXC_DLL_BYPASS);
>
> Jon, it looks to me you translated two following conditions:
> ci->id != BCMA_CHIP_ID_BCM47162
> and
> ci->id == BCMA_CHIP_ID_BCM47162
> into the same flag check:
> bgmac->feature_flags & BGMAC_FEAT_CLKCTLST
>
> I don't think it's intentional, is it? Do you have a moment to fix this?

Ping

-- 
Rafał
--
To unsubscribe from this list: send the line "unsubscribe devicetree" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH net-next] mlxsw: spectrum: Make offloads stats functions static
From: Or Gerlitz @ 2016-09-20  5:14 UTC (permalink / raw)
  To: David S. Miller; +Cc: Jiri Pirko, Nogah Frankel, netdev, Or Gerlitz

The offloads stats functions are local to this file, make them static.

Fixes: fc1bbb0f1831 ('mlxsw: spectrum: Implement offload stats ndo [..]')
Signed-off-by: Or Gerlitz <ogerlitz@mellanox.com>
---
 drivers/net/ethernet/mellanox/mlxsw/spectrum.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
index 171f8dd..efac909 100644
--- a/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
+++ b/drivers/net/ethernet/mellanox/mlxsw/spectrum.c
@@ -819,7 +819,7 @@ err_span_port_mtu_update:
 	return err;
 }
 
-int
+static int
 mlxsw_sp_port_get_sw_stats64(const struct net_device *dev,
 			     struct rtnl_link_stats64 *stats)
 {
@@ -851,7 +851,7 @@ mlxsw_sp_port_get_sw_stats64(const struct net_device *dev,
 	return 0;
 }
 
-bool mlxsw_sp_port_has_offload_stats(int attr_id)
+static bool mlxsw_sp_port_has_offload_stats(int attr_id)
 {
 	switch (attr_id) {
 	case IFLA_OFFLOAD_XSTATS_CPU_HIT:
@@ -861,8 +861,8 @@ bool mlxsw_sp_port_has_offload_stats(int attr_id)
 	return false;
 }
 
-int mlxsw_sp_port_get_offload_stats(int attr_id, const struct net_device *dev,
-				    void *sp)
+static int mlxsw_sp_port_get_offload_stats(int attr_id, const struct net_device *dev,
+					   void *sp)
 {
 	switch (attr_id) {
 	case IFLA_OFFLOAD_XSTATS_CPU_HIT:
-- 
2.3.7

^ permalink raw reply related

* Re: 答复: [PATCH] sunrpc: queue work on system_power_efficient_wq
From: Chunyan Zhang @ 2016-09-20  5:03 UTC (permalink / raw)
  To: Ke Wang (王科)
  Cc: Anna Schumaker,
	trond.myklebust-7I+n7zu2hftEKMMhf/gKZA@public.gmane.org,
	davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org,
	linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
In-Reply-To: <d72ce63e4a054c8797c9ba92cdad3b5d-GFC6RAiklXKXEiejXgtQ7QQSgKfZeEaX@public.gmane.org>

Resend behalf on Ke Wang.

Thanks,
Chunyan

On 20 September 2016 at 10:33, Ke Wang (王科) <Ke.Wang@spreadtrum.com> wrote:
> May I have any comments for this patch?
> or
> This patch can be merged directly into next release?
>
> Thanks,
> Ke
> ________________________________________
> 发件人: Anna Schumaker <Anna.Schumaker-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org>
> 发送时间: 2016年9月2日 2:46
> 收件人: Chunyan Zhang; trond.myklebust@primarydata.com; anna.schumaker-HgOvQuBEEgTQT0dZR+AlfA@public.gmane.org; davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org
> 抄送: linux-nfs-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; linux-kernel-u79uwXL29TY76Z2rM5mHXA@public.gmane.org; Ke Wang (王科)
> 主题: Re: [PATCH] sunrpc: queue work on system_power_efficient_wq
>
> On 09/01/2016 03:30 AM, Chunyan Zhang wrote:
>> From: Ke Wang <ke.wang-lxIno14LUO0EEoCn2XhGlw@public.gmane.org>
>>
>> sunrpc uses workqueue to clean cache regulary. There is no real dependency
>> of executing work on the cpu which queueing it.
>>
>> On a idle system, especially for a heterogeneous systems like big.LITTLE,
>> it is observed that the big idle cpu was woke up many times just to service
>> this work, which against the principle of power saving. It would be better
>> if we can schedule it on a cpu which the scheduler believes to be the most
>> appropriate one.
>>
>> After apply this patch, system_wq will be replaced by
>> system_power_efficient_wq for sunrpc. This functionality is enabled when
>> CONFIG_WQ_POWER_EFFICIENT is selected.
>
> Makes sense to me, but I'm a little surprised that there isn't a "schedule_delayed_power_efficient_work()" function to match how the normal workqueue is used.
>
> Thanks,
> Anna
>
>>
>> Signed-off-by: Ke Wang <ke.wang-lxIno14LUO0EEoCn2XhGlw@public.gmane.org>
>> ---
>>  net/sunrpc/cache.c | 5 +++--
>>  1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/sunrpc/cache.c b/net/sunrpc/cache.c
>> index 4d8e11f..8aabe12 100644
>> --- a/net/sunrpc/cache.c
>> +++ b/net/sunrpc/cache.c
>> @@ -353,7 +353,7 @@ void sunrpc_init_cache_detail(struct cache_detail *cd)
>>       spin_unlock(&cache_list_lock);
>>
>>       /* start the cleaning process */
>> -     schedule_delayed_work(&cache_cleaner, 0);
>> +     queue_delayed_work(system_power_efficient_wq, &cache_cleaner, 0);
>>  }
>>  EXPORT_SYMBOL_GPL(sunrpc_init_cache_detail);
>>
>> @@ -476,7 +476,8 @@ static void do_cache_clean(struct work_struct *work)
>>               delay = 0;
>>
>>       if (delay)
>> -             schedule_delayed_work(&cache_cleaner, delay);
>> +             queue_delayed_work(system_power_efficient_wq,
>> +                                &cache_cleaner, delay);
>>  }
>>
>>
>>
>
--
To unsubscribe from this list: send the line "unsubscribe linux-nfs" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH v2 0/2] act_vlan: Introduce TCA_VLAN_ACT_MODIFY vlan action
From: Shmulik Ladkani @ 2016-09-20  4:39 UTC (permalink / raw)
  To: David S . Miller; +Cc: Jiri Pirko, Jamal Hadi Salim, netdev
In-Reply-To: <1474301470-17965-1-git-send-email-shmulik.ladkani@gmail.com>

This is for net-next, forgot to mention.

Deprecates the v1 of https://patchwork.ozlabs.org/patch/671403/

^ permalink raw reply

* Re: [RFC v3 18/22] cgroup,landlock: Add CGRP_NO_NEW_PRIVS to handle unprivileged hooks
From: Sargun Dhillon @ 2016-09-20  4:37 UTC (permalink / raw)
  To: Mickaël Salaün
  Cc: Alexei Starovoitov, Andy Lutomirski, linux-kernel@vger.kernel.org,
	Alexei Starovoitov, Arnd Bergmann, Casey Schaufler,
	Daniel Borkmann, Daniel Mack, David Drysdale, David S . Miller,
	Elena Reshetova, Eric W . Biederman, James Morris, Kees Cook,
	Paul Moore, Serge E . Hallyn, Tejun Heo, Will Drewry,
	"kernel-hardening@lists.openwall.com"
In-Reply-To: <57DAF96D.3060609@digikod.net>

On Thu, Sep 15, 2016 at 09:41:33PM +0200, Mickaël Salaün wrote:
> 
> On 15/09/2016 06:48, Alexei Starovoitov wrote:
> > On Wed, Sep 14, 2016 at 09:38:16PM -0700, Andy Lutomirski wrote:
> >> On Wed, Sep 14, 2016 at 9:31 PM, Alexei Starovoitov
> >> <alexei.starovoitov@gmail.com> wrote:
> >>> On Wed, Sep 14, 2016 at 09:08:57PM -0700, Andy Lutomirski wrote:
> >>>> On Wed, Sep 14, 2016 at 9:00 PM, Alexei Starovoitov
> >>>> <alexei.starovoitov@gmail.com> wrote:
> >>>>> On Wed, Sep 14, 2016 at 07:27:08PM -0700, Andy Lutomirski wrote:
> >>>>>>>>>
> >>>>>>>>> This RFC handle both cgroup and seccomp approaches in a similar way. I
> >>>>>>>>> don't see why building on top of cgroup v2 is a problem. Is there
> >>>>>>>>> security issues with delegation?
> >>>>>>>>
> >>>>>>>> What I mean is: cgroup v2 delegation has a functionality problem.
> >>>>>>>> Tejun says [1]:
> >>>>>>>>
> >>>>>>>> We haven't had to face this decision because cgroup has never properly
> >>>>>>>> supported delegating to applications and the in-use setups where this
> >>>>>>>> happens are custom configurations where there is no boundary between
> >>>>>>>> system and applications and adhoc trial-and-error is good enough a way
> >>>>>>>> to find a working solution.  That wiggle room goes away once we
> >>>>>>>> officially open this up to individual applications.
> >>>>>>>>
> >>>>>>>> Unless and until that changes, I think that landlock should stay away
> >>>>>>>> from cgroups.  Others could reasonably disagree with me.
> >>>>>>>
> >>>>>>> Ours and Sargun's use cases for cgroup+lsm+bpf is not for security
> >>>>>>> and not for sandboxing. So the above doesn't matter in such contexts.
> >>>>>>> lsm hooks + cgroups provide convenient scope and existing entry points.
> >>>>>>> Please see checmate examples how it's used.
> >>>>>>>
> >>>>>>
> >>>>>> To be clear: I'm not arguing at all that there shouldn't be
> >>>>>> bpf+lsm+cgroup integration.  I'm arguing that the unprivileged
> >>>>>> landlock interface shouldn't expose any cgroup integration, at least
> >>>>>> until the cgroup situation settles down a lot.
> >>>>>
> >>>>> ahh. yes. we're perfectly in agreement here.
> >>>>> I'm suggesting that the next RFC shouldn't include unpriv
> >>>>> and seccomp at all. Once bpf+lsm+cgroup is merged, we can
> >>>>> argue about unpriv with cgroups and even unpriv as a whole,
> >>>>> since it's not a given. Seccomp integration is also questionable.
> >>>>> I'd rather not have seccomp as a gate keeper for this lsm.
> >>>>> lsm and seccomp are orthogonal hook points. Syscalls and lsm hooks
> >>>>> don't have one to one relationship, so mixing them up is only
> >>>>> asking for trouble further down the road.
> >>>>> If we really need to carry some information from seccomp to lsm+bpf,
> >>>>> it's easier to add eBPF support to seccomp and let bpf side deal
> >>>>> with passing whatever information.
> >>>>>
> >>>>
> >>>> As an argument for keeping seccomp (or an extended seccomp) as the
> >>>> interface for an unprivileged bpf+lsm: seccomp already checks off most
> >>>> of the boxes for safely letting unprivileged programs sandbox
> >>>> themselves.
> >>>
> >>> you mean the attach part of seccomp syscall that deals with no_new_priv?
> >>> sure, that's reusable.
> >>>
> >>>> Furthermore, to the extent that there are use cases for
> >>>> unprivileged bpf+lsm that *aren't* expressible within the seccomp
> >>>> hierarchy, I suspect that syscall filters have exactly the same
> >>>> problem and that we should fix seccomp to cover it.
> >>>
> >>> not sure what you mean by 'seccomp hierarchy'. The normal process
> >>> hierarchy ?
> >>
> >> Kind of.  I mean the filter layers that are inherited across fork(),
> >> the TSYNC mechanism, etc.
> >>
> >>> imo the main deficiency of secccomp is inability to look into arguments.
> >>> One can argue that it's a blessing, since composite args
> >>> are not yet copied into the kernel memory.
> >>> But in a lot of cases the seccomp arguments are FDs pointing
> >>> to kernel objects and if programs could examine those objects
> >>> the sandboxing scope would be more precise.
> >>> lsm+bpf solves that part and I'd still argue that it's
> >>> orthogonal to seccomp's pass/reject flow.
> >>> I mean if seccomp says 'ok' the syscall should continue executing
> >>> as normal and whatever LSM hooks were triggered by it may have
> >>> their own lsm+bpf verdicts.
> >>
> >> I agree with all of this...
> >>
> >>> Furthermore in the process hierarchy different children
> >>> should be able to set their own lsm+bpf filters that are not
> >>> related to parallel seccomp+bpf hierarchy of programs.
> >>> seccomp syscall can be an interface to attach programs
> >>> to lsm hooks, but nothing more than that.
> >>
> >> I'm not sure what you mean.  I mean that, logically, I think we should
> >> be able to do:
> >>
> >> seccomp(attach a syscall filter);
> >> fork();
> >> child does seccomp(attach some lsm filters);
> >>
> >> I think that they *should* be related to the seccomp+bpf hierarchy of
> >> programs in that they are entries in the same logical list of filter
> >> layers installed.  Some of those layers can be syscall filters and
> >> some of the layers can be lsm filters.  If we subsequently add a way
> >> to attach a removable seccomp filter or a way to attach a seccomp
> >> filter that logs failures to some fd watched by an outside monitor, I
> >> think that should work for lsm, too, with more or less the same
> >> interface.
> >>
> >> If we need a way for a sandbox manager to opt different children into
> >> different subsets of fancy filters, then I think that syscall filters
> >> and lsm filters should use the same mechanism.
> >>
> >> I think we might be on the same page here and just saying it different ways.
> > 
> > Sounds like it :)
> > All of the above makes sense to me.
> > The 'orthogonal' part is that the user should be able to use
> > this seccomp-managed hierarchy without actually enabling
> > TIF_SECCOMP for the task and syscalls should still go through
> > fast path and all the way till lsm hooks as normal.
> > I don't want to pay _any_ performance penalty for this feature
> > for lsm hooks (and all syscalls) that don't have bpf programs attached.
> 
> Yes, it seems that we are all on the same page here, and that match this
> RFC implementation. So, using the seccomp(2) *interface* to attach
> Landlock programs to a process hierarchy is still on track. :)
> 

So, I'm catching up on this after a little while away. I really like the 
simplicity of the approach Daniel took with his patches. I began to have 
difficulty reading your patchset once you got into using seccomp + unprivileged 
mode. I would love to see a separate patchset that only have the verifier, and
lsm hook changes. Do you think you could decompose your patchset into an MVP?

^ permalink raw reply

* Re: [PATCH] net: skbuff: Fix length validation in skb_vlan_pop()
From: Shmulik Ladkani @ 2016-09-20  4:36 UTC (permalink / raw)
  To: pravin shelar
  Cc: Jiri Pirko, David S . Miller, Linux Kernel Network Developers,
	Daniel Borkmann, Jamal Hadi Salim
In-Reply-To: <CAOrHB_Dy3osOea19e+iTxHFpCgJOYe1fUJJnaN-PB-hOq0-rmA@mail.gmail.com>

Hi,

On Mon, 19 Sep 2016 13:46:10 -0700 pravin shelar <pshelar@ovn.org> wrote:
> On Mon, Sep 19, 2016 at 1:04 PM, Shmulik Ladkani
> <shmulik.ladkani@gmail.com> wrote:
> > Hi Pravin,
> >
> > On Sun, 18 Sep 2016 13:26:30 -0700 pravin shelar <pshelar@ovn.org> wrote:  
> >> > +++ b/net/core/skbuff.c
> >> > @@ -4537,7 +4537,7 @@ int skb_vlan_pop(struct sk_buff *skb)
> >> >         } else {
> >> >                 if (unlikely((skb->protocol != htons(ETH_P_8021Q) &&
> >> >                               skb->protocol != htons(ETH_P_8021AD)) ||
> >> > -                            skb->len < VLAN_ETH_HLEN))
> >> > +                            skb->mac_len < VLAN_ETH_HLEN))  
> >>
> >> There is already check in __skb_vlan_pop() to validate skb for a vlan
> >> header. So it is safe to drop this check entirely.  
> >
> > Yep, I submitted a v2 with your suggestion, however I withdrew it, as
> > there is a slight behavior difference noticable by 'skb_vlan_pop' callers.
> >
> > Suppose the rare case where skb->len is too small.
> >
> > pre:
> >   skb_vlan_pop returns 0 (at least for the correct tx path).
> >   Meaning, callers do not see it as a failure.
> > post:
> >   skb_ensure_writable fails (!pskb_may_pull), therefore -ENOMEM returned
> >   to the callers of 'skb_vlan_pop'.
> >
> > For ovs, it means do_execute_actions's loop is terminated, no further
> > actions are executed, and skb gets freed.
> >
> > For tc act vlan, it means skb gets dropped.
> >
> > This actually makes sense, but do we want to present this change?
> >  
> I think this is correct behavior over existing code.

Ok.
I'll submit a v3 identical to v2 but with proper statement of this
behavior change in the commit log.

Thanks.

^ permalink raw reply

* [PATCH v4 net-next 16/16] tcp_bbr: add BBR congestion control
From: Neal Cardwell @ 2016-09-20  3:39 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Van Jacobson, Yuchung Cheng,
	Nandita Dukkipati, Eric Dumazet, Soheil Hassas Yeganeh
In-Reply-To: <1474342763-16715-1-git-send-email-ncardwell@google.com>

This commit implements a new TCP congestion control algorithm: BBR
(Bottleneck Bandwidth and RTT). A detailed description of BBR will be
published in ACM Queue, Vol. 14 No. 5, September-October 2016, as
"BBR: Congestion-Based Congestion Control".

BBR has significantly increased throughput and reduced latency for
connections on Google's internal backbone networks and google.com and
YouTube Web servers.

BBR requires only changes on the sender side, not in the network or
the receiver side. Thus it can be incrementally deployed on today's
Internet, or in datacenters.

The Internet has predominantly used loss-based congestion control
(largely Reno or CUBIC) since the 1980s, relying on packet loss as the
signal to slow down. While this worked well for many years, loss-based
congestion control is unfortunately out-dated in today's networks. On
today's Internet, loss-based congestion control causes the infamous
bufferbloat problem, often causing seconds of needless queuing delay,
since it fills the bloated buffers in many last-mile links. On today's
high-speed long-haul links using commodity switches with shallow
buffers, loss-based congestion control has abysmal throughput because
it over-reacts to losses caused by transient traffic bursts.

In 1981 Kleinrock and Gale showed that the optimal operating point for
a network maximizes delivered bandwidth while minimizing delay and
loss, not only for single connections but for the network as a
whole. Finding that optimal operating point has been elusive, since
any single network measurement is ambiguous: network measurements are
the result of both bandwidth and propagation delay, and those two
cannot be measured simultaneously.

While it is impossible to disambiguate any single bandwidth or RTT
measurement, a connection's behavior over time tells a clearer
story. BBR uses a measurement strategy designed to resolve this
ambiguity. It combines these measurements with a robust servo loop
using recent control systems advances to implement a distributed
congestion control algorithm that reacts to actual congestion, not
packet loss or transient queue delay, and is designed to converge with
high probability to a point near the optimal operating point.

In a nutshell, BBR creates an explicit model of the network pipe by
sequentially probing the bottleneck bandwidth and RTT. On the arrival
of each ACK, BBR derives the current delivery rate of the last round
trip, and feeds it through a windowed max-filter to estimate the
bottleneck bandwidth. Conversely it uses a windowed min-filter to
estimate the round trip propagation delay. The max-filtered bandwidth
and min-filtered RTT estimates form BBR's model of the network pipe.

Using its model, BBR sets control parameters to govern sending
behavior. The primary control is the pacing rate: BBR applies a gain
multiplier to transmit faster or slower than the observed bottleneck
bandwidth. The conventional congestion window (cwnd) is now the
secondary control; the cwnd is set to a small multiple of the
estimated BDP (bandwidth-delay product) in order to allow full
utilization and bandwidth probing while bounding the potential amount
of queue at the bottleneck.

When a BBR connection starts, it enters STARTUP mode and applies a
high gain to perform an exponential search to quickly probe the
bottleneck bandwidth (doubling its sending rate each round trip, like
slow start). However, instead of continuing until it fills up the
buffer (i.e. a loss), or until delay or ACK spacing reaches some
threshold (like Hystart), it uses its model of the pipe to estimate
when that pipe is full: it estimates the pipe is full when it notices
the estimated bandwidth has stopped growing. At that point it exits
STARTUP and enters DRAIN mode, where it reduces its pacing rate to
drain the queue it estimates it has created.

Then BBR enters steady state. In steady state, PROBE_BW mode cycles
between first pacing faster to probe for more bandwidth, then pacing
slower to drain any queue that created if no more bandwidth was
available, and then cruising at the estimated bandwidth to utilize the
pipe without creating excess queue. Occasionally, on an as-needed
basis, it sends significantly slower to probe for RTT (PROBE_RTT
mode).

BBR has been fully deployed on Google's wide-area backbone networks
and we're experimenting with BBR on Google.com and YouTube on a global
scale.  Replacing CUBIC with BBR has resulted in significant
improvements in network latency and application (RPC, browser, and
video) metrics. For more details please refer to our upcoming ACM
Queue publication.

Example performance results, to illustrate the difference between BBR
and CUBIC:

Resilience to random loss (e.g. from shallow buffers):
  Consider a netperf TCP_STREAM test lasting 30 secs on an emulated
  path with a 10Gbps bottleneck, 100ms RTT, and 1% packet loss
  rate. CUBIC gets 3.27 Mbps, and BBR gets 9150 Mbps (2798x higher).

Low latency with the bloated buffers common in today's last-mile links:
  Consider a netperf TCP_STREAM test lasting 120 secs on an emulated
  path with a 10Mbps bottleneck, 40ms RTT, and 1000-packet bottleneck
  buffer. Both fully utilize the bottleneck bandwidth, but BBR
  achieves this with a median RTT 25x lower (43 ms instead of 1.09
  secs).

Our long-term goal is to improve the congestion control algorithms
used on the Internet. We are hopeful that BBR can help advance the
efforts toward this goal, and motivate the community to do further
research.

Test results, performance evaluations, feedback, and BBR-related
discussions are very welcome in the public e-mail list for BBR:

  https://groups.google.com/forum/#!forum/bbr-dev

NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing
enabled, since pacing is integral to the BBR design and
implementation. BBR without pacing would not function properly, and
may incur unnecessary high packet loss rates.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/uapi/linux/inet_diag.h |  13 +
 net/ipv4/Kconfig               |  18 +
 net/ipv4/Makefile              |   1 +
 net/ipv4/tcp_bbr.c             | 896 +++++++++++++++++++++++++++++++++++++++++
 4 files changed, 928 insertions(+)
 create mode 100644 net/ipv4/tcp_bbr.c

diff --git a/include/uapi/linux/inet_diag.h b/include/uapi/linux/inet_diag.h
index b5c366f..509cd96 100644
--- a/include/uapi/linux/inet_diag.h
+++ b/include/uapi/linux/inet_diag.h
@@ -124,6 +124,7 @@ enum {
 	INET_DIAG_PEERS,
 	INET_DIAG_PAD,
 	INET_DIAG_MARK,
+	INET_DIAG_BBRINFO,
 	__INET_DIAG_MAX,
 };
 
@@ -157,8 +158,20 @@ struct tcp_dctcp_info {
 	__u32	dctcp_ab_tot;
 };
 
+/* INET_DIAG_BBRINFO */
+
+struct tcp_bbr_info {
+	/* u64 bw: max-filtered BW (app throughput) estimate in Byte per sec: */
+	__u32	bbr_bw_lo;		/* lower 32 bits of bw */
+	__u32	bbr_bw_hi;		/* upper 32 bits of bw */
+	__u32	bbr_min_rtt;		/* min-filtered RTT in uSec */
+	__u32	bbr_pacing_gain;	/* pacing gain shifted left 8 bits */
+	__u32	bbr_cwnd_gain;		/* cwnd gain shifted left 8 bits */
+};
+
 union tcp_cc_info {
 	struct tcpvegas_info	vegas;
 	struct tcp_dctcp_info	dctcp;
+	struct tcp_bbr_info	bbr;
 };
 #endif /* _UAPI_INET_DIAG_H_ */
diff --git a/net/ipv4/Kconfig b/net/ipv4/Kconfig
index 50d6a9b..300b068 100644
--- a/net/ipv4/Kconfig
+++ b/net/ipv4/Kconfig
@@ -640,6 +640,21 @@ config TCP_CONG_CDG
 	  D.A. Hayes and G. Armitage. "Revisiting TCP congestion control using
 	  delay gradients." In Networking 2011. Preprint: http://goo.gl/No3vdg
 
+config TCP_CONG_BBR
+	tristate "BBR TCP"
+	default n
+	---help---
+
+	BBR (Bottleneck Bandwidth and RTT) TCP congestion control aims to
+	maximize network utilization and minimize queues. It builds an explicit
+	model of the the bottleneck delivery rate and path round-trip
+	propagation delay. It tolerates packet loss and delay unrelated to
+	congestion. It can operate over LAN, WAN, cellular, wifi, or cable
+	modem links. It can coexist with flows that use loss-based congestion
+	control, and can operate with shallow buffers, deep buffers,
+	bufferbloat, policers, or AQM schemes that do not provide a delay
+	signal. It requires the fq ("Fair Queue") pacing packet scheduler.
+
 choice
 	prompt "Default TCP congestion control"
 	default DEFAULT_CUBIC
@@ -674,6 +689,9 @@ choice
 	config DEFAULT_CDG
 		bool "CDG" if TCP_CONG_CDG=y
 
+	config DEFAULT_BBR
+		bool "BBR" if TCP_CONG_BBR=y
+
 	config DEFAULT_RENO
 		bool "Reno"
 endchoice
diff --git a/net/ipv4/Makefile b/net/ipv4/Makefile
index 9cfff1a..bc6a6c8 100644
--- a/net/ipv4/Makefile
+++ b/net/ipv4/Makefile
@@ -41,6 +41,7 @@ obj-$(CONFIG_INET_DIAG) += inet_diag.o
 obj-$(CONFIG_INET_TCP_DIAG) += tcp_diag.o
 obj-$(CONFIG_INET_UDP_DIAG) += udp_diag.o
 obj-$(CONFIG_NET_TCPPROBE) += tcp_probe.o
+obj-$(CONFIG_TCP_CONG_BBR) += tcp_bbr.o
 obj-$(CONFIG_TCP_CONG_BIC) += tcp_bic.o
 obj-$(CONFIG_TCP_CONG_CDG) += tcp_cdg.o
 obj-$(CONFIG_TCP_CONG_CUBIC) += tcp_cubic.o
diff --git a/net/ipv4/tcp_bbr.c b/net/ipv4/tcp_bbr.c
new file mode 100644
index 0000000..0ea66c2
--- /dev/null
+++ b/net/ipv4/tcp_bbr.c
@@ -0,0 +1,896 @@
+/* Bottleneck Bandwidth and RTT (BBR) congestion control
+ *
+ * BBR congestion control computes the sending rate based on the delivery
+ * rate (throughput) estimated from ACKs. In a nutshell:
+ *
+ *   On each ACK, update our model of the network path:
+ *      bottleneck_bandwidth = windowed_max(delivered / elapsed, 10 round trips)
+ *      min_rtt = windowed_min(rtt, 10 seconds)
+ *   pacing_rate = pacing_gain * bottleneck_bandwidth
+ *   cwnd = max(cwnd_gain * bottleneck_bandwidth * min_rtt, 4)
+ *
+ * The core algorithm does not react directly to packet losses or delays,
+ * although BBR may adjust the size of next send per ACK when loss is
+ * observed, or adjust the sending rate if it estimates there is a
+ * traffic policer, in order to keep the drop rate reasonable.
+ *
+ * BBR is described in detail in:
+ *   "BBR: Congestion-Based Congestion Control",
+ *   Neal Cardwell, Yuchung Cheng, C. Stephen Gunn, Soheil Hassas Yeganeh,
+ *   Van Jacobson. ACM Queue, Vol. 14 No. 5, September-October 2016.
+ *
+ * There is a public e-mail list for discussing BBR development and testing:
+ *   https://groups.google.com/forum/#!forum/bbr-dev
+ *
+ * NOTE: BBR *must* be used with the fq qdisc ("man tc-fq") with pacing enabled,
+ * since pacing is integral to the BBR design and implementation.
+ * BBR without pacing would not function properly, and may incur unnecessary
+ * high packet loss rates.
+ */
+#include <linux/module.h>
+#include <net/tcp.h>
+#include <linux/inet_diag.h>
+#include <linux/inet.h>
+#include <linux/random.h>
+#include <linux/win_minmax.h>
+
+/* Scale factor for rate in pkt/uSec unit to avoid truncation in bandwidth
+ * estimation. The rate unit ~= (1500 bytes / 1 usec / 2^24) ~= 715 bps.
+ * This handles bandwidths from 0.06pps (715bps) to 256Mpps (3Tbps) in a u32.
+ * Since the minimum window is >=4 packets, the lower bound isn't
+ * an issue. The upper bound isn't an issue with existing technologies.
+ */
+#define BW_SCALE 24
+#define BW_UNIT (1 << BW_SCALE)
+
+#define BBR_SCALE 8	/* scaling factor for fractions in BBR (e.g. gains) */
+#define BBR_UNIT (1 << BBR_SCALE)
+
+/* BBR has the following modes for deciding how fast to send: */
+enum bbr_mode {
+	BBR_STARTUP,	/* ramp up sending rate rapidly to fill pipe */
+	BBR_DRAIN,	/* drain any queue created during startup */
+	BBR_PROBE_BW,	/* discover, share bw: pace around estimated bw */
+	BBR_PROBE_RTT,	/* cut cwnd to min to probe min_rtt */
+};
+
+/* BBR congestion control block */
+struct bbr {
+	u32	min_rtt_us;	        /* min RTT in min_rtt_win_sec window */
+	u32	min_rtt_stamp;	        /* timestamp of min_rtt_us */
+	u32	probe_rtt_done_stamp;   /* end time for BBR_PROBE_RTT mode */
+	struct minmax bw;	/* Max recent delivery rate in pkts/uS << 24 */
+	u32	rtt_cnt;	    /* count of packet-timed rounds elapsed */
+	u32     next_rtt_delivered; /* scb->tx.delivered at end of round */
+	struct skb_mstamp cycle_mstamp;  /* time of this cycle phase start */
+	u32     mode:3,		     /* current bbr_mode in state machine */
+		prev_ca_state:3,     /* CA state on previous ACK */
+		packet_conservation:1,  /* use packet conservation? */
+		restore_cwnd:1,	     /* decided to revert cwnd to old value */
+		round_start:1,	     /* start of packet-timed tx->ack round? */
+		tso_segs_goal:7,     /* segments we want in each skb we send */
+		idle_restart:1,	     /* restarting after idle? */
+		probe_rtt_round_done:1,  /* a BBR_PROBE_RTT round at 4 pkts? */
+		unused:5,
+		lt_is_sampling:1,    /* taking long-term ("LT") samples now? */
+		lt_rtt_cnt:7,	     /* round trips in long-term interval */
+		lt_use_bw:1;	     /* use lt_bw as our bw estimate? */
+	u32	lt_bw;		     /* LT est delivery rate in pkts/uS << 24 */
+	u32	lt_last_delivered;   /* LT intvl start: tp->delivered */
+	u32	lt_last_stamp;	     /* LT intvl start: tp->delivered_mstamp */
+	u32	lt_last_lost;	     /* LT intvl start: tp->lost */
+	u32	pacing_gain:10,	/* current gain for setting pacing rate */
+		cwnd_gain:10,	/* current gain for setting cwnd */
+		full_bw_cnt:3,	/* number of rounds without large bw gains */
+		cycle_idx:3,	/* current index in pacing_gain cycle array */
+		unused_b:6;
+	u32	prior_cwnd;	/* prior cwnd upon entering loss recovery */
+	u32	full_bw;	/* recent bw, to estimate if pipe is full */
+};
+
+#define CYCLE_LEN	8	/* number of phases in a pacing gain cycle */
+
+/* Window length of bw filter (in rounds): */
+static const int bbr_bw_rtts = CYCLE_LEN + 2;
+/* Window length of min_rtt filter (in sec): */
+static const u32 bbr_min_rtt_win_sec = 10;
+/* Minimum time (in ms) spent at bbr_cwnd_min_target in BBR_PROBE_RTT mode: */
+static const u32 bbr_probe_rtt_mode_ms = 200;
+/* Skip TSO below the following bandwidth (bits/sec): */
+static const int bbr_min_tso_rate = 1200000;
+
+/* We use a high_gain value of 2/ln(2) because it's the smallest pacing gain
+ * that will allow a smoothly increasing pacing rate that will double each RTT
+ * and send the same number of packets per RTT that an un-paced, slow-starting
+ * Reno or CUBIC flow would:
+ */
+static const int bbr_high_gain  = BBR_UNIT * 2885 / 1000 + 1;
+/* The pacing gain of 1/high_gain in BBR_DRAIN is calculated to typically drain
+ * the queue created in BBR_STARTUP in a single round:
+ */
+static const int bbr_drain_gain = BBR_UNIT * 1000 / 2885;
+/* The gain for deriving steady-state cwnd tolerates delayed/stretched ACKs: */
+static const int bbr_cwnd_gain  = BBR_UNIT * 2;
+/* The pacing_gain values for the PROBE_BW gain cycle, to discover/share bw: */
+static const int bbr_pacing_gain[] = {
+	BBR_UNIT * 5 / 4,	/* probe for more available bw */
+	BBR_UNIT * 3 / 4,	/* drain queue and/or yield bw to other flows */
+	BBR_UNIT, BBR_UNIT, BBR_UNIT,	/* cruise at 1.0*bw to utilize pipe, */
+	BBR_UNIT, BBR_UNIT, BBR_UNIT	/* without creating excess queue... */
+};
+/* Randomize the starting gain cycling phase over N phases: */
+static const u32 bbr_cycle_rand = 7;
+
+/* Try to keep at least this many packets in flight, if things go smoothly. For
+ * smooth functioning, a sliding window protocol ACKing every other packet
+ * needs at least 4 packets in flight:
+ */
+static const u32 bbr_cwnd_min_target = 4;
+
+/* To estimate if BBR_STARTUP mode (i.e. high_gain) has filled pipe... */
+/* If bw has increased significantly (1.25x), there may be more bw available: */
+static const u32 bbr_full_bw_thresh = BBR_UNIT * 5 / 4;
+/* But after 3 rounds w/o significant bw growth, estimate pipe is full: */
+static const u32 bbr_full_bw_cnt = 3;
+
+/* "long-term" ("LT") bandwidth estimator parameters... */
+/* The minimum number of rounds in an LT bw sampling interval: */
+static const u32 bbr_lt_intvl_min_rtts = 4;
+/* If lost/delivered ratio > 20%, interval is "lossy" and we may be policed: */
+static const u32 bbr_lt_loss_thresh = 50;
+/* If 2 intervals have a bw ratio <= 1/8, their bw is "consistent": */
+static const u32 bbr_lt_bw_ratio = BBR_UNIT / 8;
+/* If 2 intervals have a bw diff <= 4 Kbit/sec their bw is "consistent": */
+static const u32 bbr_lt_bw_diff = 4000 / 8;
+/* If we estimate we're policed, use lt_bw for this many round trips: */
+static const u32 bbr_lt_bw_max_rtts = 48;
+
+/* Do we estimate that STARTUP filled the pipe? */
+static bool bbr_full_bw_reached(const struct sock *sk)
+{
+	const struct bbr *bbr = inet_csk_ca(sk);
+
+	return bbr->full_bw_cnt >= bbr_full_bw_cnt;
+}
+
+/* Return the windowed max recent bandwidth sample, in pkts/uS << BW_SCALE. */
+static u32 bbr_max_bw(const struct sock *sk)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	return minmax_get(&bbr->bw);
+}
+
+/* Return the estimated bandwidth of the path, in pkts/uS << BW_SCALE. */
+static u32 bbr_bw(const struct sock *sk)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	return bbr->lt_use_bw ? bbr->lt_bw : bbr_max_bw(sk);
+}
+
+/* Return rate in bytes per second, optionally with a gain.
+ * The order here is chosen carefully to avoid overflow of u64. This should
+ * work for input rates of up to 2.9Tbit/sec and gain of 2.89x.
+ */
+static u64 bbr_rate_bytes_per_sec(struct sock *sk, u64 rate, int gain)
+{
+	rate *= tcp_mss_to_mtu(sk, tcp_sk(sk)->mss_cache);
+	rate *= gain;
+	rate >>= BBR_SCALE;
+	rate *= USEC_PER_SEC;
+	return rate >> BW_SCALE;
+}
+
+/* Pace using current bw estimate and a gain factor. In order to help drive the
+ * network toward lower queues while maintaining high utilization and low
+ * latency, the average pacing rate aims to be slightly (~1%) lower than the
+ * estimated bandwidth. This is an important aspect of the design. In this
+ * implementation this slightly lower pacing rate is achieved implicitly by not
+ * including link-layer headers in the packet size used for the pacing rate.
+ */
+static void bbr_set_pacing_rate(struct sock *sk, u32 bw, int gain)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+	u64 rate = bw;
+
+	rate = bbr_rate_bytes_per_sec(sk, rate, gain);
+	rate = min_t(u64, rate, sk->sk_max_pacing_rate);
+	if (bbr->mode != BBR_STARTUP || rate > sk->sk_pacing_rate)
+		sk->sk_pacing_rate = rate;
+}
+
+/* Return count of segments we want in the skbs we send, or 0 for default. */
+static u32 bbr_tso_segs_goal(struct sock *sk)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	return bbr->tso_segs_goal;
+}
+
+static void bbr_set_tso_segs_goal(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	u32 min_segs;
+
+	min_segs = sk->sk_pacing_rate < (bbr_min_tso_rate >> 3) ? 1 : 2;
+	bbr->tso_segs_goal = min(tcp_tso_autosize(sk, tp->mss_cache, min_segs),
+				 0x7FU);
+}
+
+/* Save "last known good" cwnd so we can restore it after losses or PROBE_RTT */
+static void bbr_save_cwnd(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	if (bbr->prev_ca_state < TCP_CA_Recovery && bbr->mode != BBR_PROBE_RTT)
+		bbr->prior_cwnd = tp->snd_cwnd;  /* this cwnd is good enough */
+	else  /* loss recovery or BBR_PROBE_RTT have temporarily cut cwnd */
+		bbr->prior_cwnd = max(bbr->prior_cwnd, tp->snd_cwnd);
+}
+
+static void bbr_cwnd_event(struct sock *sk, enum tcp_ca_event event)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	if (event == CA_EVENT_TX_START && tp->app_limited) {
+		bbr->idle_restart = 1;
+		/* Avoid pointless buffer overflows: pace at est. bw if we don't
+		 * need more speed (we're restarting from idle and app-limited).
+		 */
+		if (bbr->mode == BBR_PROBE_BW)
+			bbr_set_pacing_rate(sk, bbr_bw(sk), BBR_UNIT);
+	}
+}
+
+/* Find target cwnd. Right-size the cwnd based on min RTT and the
+ * estimated bottleneck bandwidth:
+ *
+ * cwnd = bw * min_rtt * gain = BDP * gain
+ *
+ * The key factor, gain, controls the amount of queue. While a small gain
+ * builds a smaller queue, it becomes more vulnerable to noise in RTT
+ * measurements (e.g., delayed ACKs or other ACK compression effects). This
+ * noise may cause BBR to under-estimate the rate.
+ *
+ * To achieve full performance in high-speed paths, we budget enough cwnd to
+ * fit full-sized skbs in-flight on both end hosts to fully utilize the path:
+ *   - one skb in sending host Qdisc,
+ *   - one skb in sending host TSO/GSO engine
+ *   - one skb being received by receiver host LRO/GRO/delayed-ACK engine
+ * Don't worry, at low rates (bbr_min_tso_rate) this won't bloat cwnd because
+ * in such cases tso_segs_goal is 1. The minimum cwnd is 4 packets,
+ * which allows 2 outstanding 2-packet sequences, to try to keep pipe
+ * full even with ACK-every-other-packet delayed ACKs.
+ */
+static u32 bbr_target_cwnd(struct sock *sk, u32 bw, int gain)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+	u32 cwnd;
+	u64 w;
+
+	/* If we've never had a valid RTT sample, cap cwnd at the initial
+	 * default. This should only happen when the connection is not using TCP
+	 * timestamps and has retransmitted all of the SYN/SYNACK/data packets
+	 * ACKed so far. In this case, an RTO can cut cwnd to 1, in which
+	 * case we need to slow-start up toward something safe: TCP_INIT_CWND.
+	 */
+	if (unlikely(bbr->min_rtt_us == ~0U))	 /* no valid RTT samples yet? */
+		return TCP_INIT_CWND;  /* be safe: cap at default initial cwnd*/
+
+	w = (u64)bw * bbr->min_rtt_us;
+
+	/* Apply a gain to the given value, then remove the BW_SCALE shift. */
+	cwnd = (((w * gain) >> BBR_SCALE) + BW_UNIT - 1) / BW_UNIT;
+
+	/* Allow enough full-sized skbs in flight to utilize end systems. */
+	cwnd += 3 * bbr->tso_segs_goal;
+
+	/* Reduce delayed ACKs by rounding up cwnd to the next even number. */
+	cwnd = (cwnd + 1) & ~1U;
+
+	return cwnd;
+}
+
+/* An optimization in BBR to reduce losses: On the first round of recovery, we
+ * follow the packet conservation principle: send P packets per P packets acked.
+ * After that, we slow-start and send at most 2*P packets per P packets acked.
+ * After recovery finishes, or upon undo, we restore the cwnd we had when
+ * recovery started (capped by the target cwnd based on estimated BDP).
+ *
+ * TODO(ycheng/ncardwell): implement a rate-based approach.
+ */
+static bool bbr_set_cwnd_to_recover_or_restore(
+	struct sock *sk, const struct rate_sample *rs, u32 acked, u32 *new_cwnd)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	u8 prev_state = bbr->prev_ca_state, state = inet_csk(sk)->icsk_ca_state;
+	u32 cwnd = tp->snd_cwnd;
+
+	/* An ACK for P pkts should release at most 2*P packets. We do this
+	 * in two steps. First, here we deduct the number of lost packets.
+	 * Then, in bbr_set_cwnd() we slow start up toward the target cwnd.
+	 */
+	if (rs->losses > 0)
+		cwnd = max_t(s32, cwnd - rs->losses, 1);
+
+	if (state == TCP_CA_Recovery && prev_state != TCP_CA_Recovery) {
+		/* Starting 1st round of Recovery, so do packet conservation. */
+		bbr->packet_conservation = 1;
+		bbr->next_rtt_delivered = tp->delivered;  /* start round now */
+		/* Cut unused cwnd from app behavior, TSQ, or TSO deferral: */
+		cwnd = tcp_packets_in_flight(tp) + acked;
+	} else if (prev_state >= TCP_CA_Recovery && state < TCP_CA_Recovery) {
+		/* Exiting loss recovery; restore cwnd saved before recovery. */
+		bbr->restore_cwnd = 1;
+		bbr->packet_conservation = 0;
+	}
+	bbr->prev_ca_state = state;
+
+	if (bbr->restore_cwnd) {
+		/* Restore cwnd after exiting loss recovery or PROBE_RTT. */
+		cwnd = max(cwnd, bbr->prior_cwnd);
+		bbr->restore_cwnd = 0;
+	}
+
+	if (bbr->packet_conservation) {
+		*new_cwnd = max(cwnd, tcp_packets_in_flight(tp) + acked);
+		return true;	/* yes, using packet conservation */
+	}
+	*new_cwnd = cwnd;
+	return false;
+}
+
+/* Slow-start up toward target cwnd (if bw estimate is growing, or packet loss
+ * has drawn us down below target), or snap down to target if we're above it.
+ */
+static void bbr_set_cwnd(struct sock *sk, const struct rate_sample *rs,
+			 u32 acked, u32 bw, int gain)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	u32 cwnd = 0, target_cwnd = 0;
+
+	if (!acked)
+		return;
+
+	if (bbr_set_cwnd_to_recover_or_restore(sk, rs, acked, &cwnd))
+		goto done;
+
+	/* If we're below target cwnd, slow start cwnd toward target cwnd. */
+	target_cwnd = bbr_target_cwnd(sk, bw, gain);
+	if (bbr_full_bw_reached(sk))  /* only cut cwnd if we filled the pipe */
+		cwnd = min(cwnd + acked, target_cwnd);
+	else if (cwnd < target_cwnd || tp->delivered < TCP_INIT_CWND)
+		cwnd = cwnd + acked;
+	cwnd = max(cwnd, bbr_cwnd_min_target);
+
+done:
+	tp->snd_cwnd = min(cwnd, tp->snd_cwnd_clamp);	/* apply global cap */
+	if (bbr->mode == BBR_PROBE_RTT)  /* drain queue, refresh min_rtt */
+		tp->snd_cwnd = min(tp->snd_cwnd, bbr_cwnd_min_target);
+}
+
+/* End cycle phase if it's time and/or we hit the phase's in-flight target. */
+static bool bbr_is_next_cycle_phase(struct sock *sk,
+				    const struct rate_sample *rs)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	bool is_full_length =
+		skb_mstamp_us_delta(&tp->delivered_mstamp, &bbr->cycle_mstamp) >
+		bbr->min_rtt_us;
+	u32 inflight, bw;
+
+	/* The pacing_gain of 1.0 paces at the estimated bw to try to fully
+	 * use the pipe without increasing the queue.
+	 */
+	if (bbr->pacing_gain == BBR_UNIT)
+		return is_full_length;		/* just use wall clock time */
+
+	inflight = rs->prior_in_flight;  /* what was in-flight before ACK? */
+	bw = bbr_max_bw(sk);
+
+	/* A pacing_gain > 1.0 probes for bw by trying to raise inflight to at
+	 * least pacing_gain*BDP; this may take more than min_rtt if min_rtt is
+	 * small (e.g. on a LAN). We do not persist if packets are lost, since
+	 * a path with small buffers may not hold that much.
+	 */
+	if (bbr->pacing_gain > BBR_UNIT)
+		return is_full_length &&
+			(rs->losses ||  /* perhaps pacing_gain*BDP won't fit */
+			 inflight >= bbr_target_cwnd(sk, bw, bbr->pacing_gain));
+
+	/* A pacing_gain < 1.0 tries to drain extra queue we added if bw
+	 * probing didn't find more bw. If inflight falls to match BDP then we
+	 * estimate queue is drained; persisting would underutilize the pipe.
+	 */
+	return is_full_length ||
+		inflight <= bbr_target_cwnd(sk, bw, BBR_UNIT);
+}
+
+static void bbr_advance_cycle_phase(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	bbr->cycle_idx = (bbr->cycle_idx + 1) & (CYCLE_LEN - 1);
+	bbr->cycle_mstamp = tp->delivered_mstamp;
+	bbr->pacing_gain = bbr_pacing_gain[bbr->cycle_idx];
+}
+
+/* Gain cycling: cycle pacing gain to converge to fair share of available bw. */
+static void bbr_update_cycle_phase(struct sock *sk,
+				   const struct rate_sample *rs)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	if ((bbr->mode == BBR_PROBE_BW) && !bbr->lt_use_bw &&
+	    bbr_is_next_cycle_phase(sk, rs))
+		bbr_advance_cycle_phase(sk);
+}
+
+static void bbr_reset_startup_mode(struct sock *sk)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	bbr->mode = BBR_STARTUP;
+	bbr->pacing_gain = bbr_high_gain;
+	bbr->cwnd_gain	 = bbr_high_gain;
+}
+
+static void bbr_reset_probe_bw_mode(struct sock *sk)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	bbr->mode = BBR_PROBE_BW;
+	bbr->pacing_gain = BBR_UNIT;
+	bbr->cwnd_gain = bbr_cwnd_gain;
+	bbr->cycle_idx = CYCLE_LEN - 1 - prandom_u32_max(bbr_cycle_rand);
+	bbr_advance_cycle_phase(sk);	/* flip to next phase of gain cycle */
+}
+
+static void bbr_reset_mode(struct sock *sk)
+{
+	if (!bbr_full_bw_reached(sk))
+		bbr_reset_startup_mode(sk);
+	else
+		bbr_reset_probe_bw_mode(sk);
+}
+
+/* Start a new long-term sampling interval. */
+static void bbr_reset_lt_bw_sampling_interval(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	bbr->lt_last_stamp = tp->delivered_mstamp.stamp_jiffies;
+	bbr->lt_last_delivered = tp->delivered;
+	bbr->lt_last_lost = tp->lost;
+	bbr->lt_rtt_cnt = 0;
+}
+
+/* Completely reset long-term bandwidth sampling. */
+static void bbr_reset_lt_bw_sampling(struct sock *sk)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	bbr->lt_bw = 0;
+	bbr->lt_use_bw = 0;
+	bbr->lt_is_sampling = false;
+	bbr_reset_lt_bw_sampling_interval(sk);
+}
+
+/* Long-term bw sampling interval is done. Estimate whether we're policed. */
+static void bbr_lt_bw_interval_done(struct sock *sk, u32 bw)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+	u32 diff;
+
+	if (bbr->lt_bw) {  /* do we have bw from a previous interval? */
+		/* Is new bw close to the lt_bw from the previous interval? */
+		diff = abs(bw - bbr->lt_bw);
+		if ((diff * BBR_UNIT <= bbr_lt_bw_ratio * bbr->lt_bw) ||
+		    (bbr_rate_bytes_per_sec(sk, diff, BBR_UNIT) <=
+		     bbr_lt_bw_diff)) {
+			/* All criteria are met; estimate we're policed. */
+			bbr->lt_bw = (bw + bbr->lt_bw) >> 1;  /* avg 2 intvls */
+			bbr->lt_use_bw = 1;
+			bbr->pacing_gain = BBR_UNIT;  /* try to avoid drops */
+			bbr->lt_rtt_cnt = 0;
+			return;
+		}
+	}
+	bbr->lt_bw = bw;
+	bbr_reset_lt_bw_sampling_interval(sk);
+}
+
+/* Token-bucket traffic policers are common (see "An Internet-Wide Analysis of
+ * Traffic Policing", SIGCOMM 2016). BBR detects token-bucket policers and
+ * explicitly models their policed rate, to reduce unnecessary losses. We
+ * estimate that we're policed if we see 2 consecutive sampling intervals with
+ * consistent throughput and high packet loss. If we think we're being policed,
+ * set lt_bw to the "long-term" average delivery rate from those 2 intervals.
+ */
+static void bbr_lt_bw_sampling(struct sock *sk, const struct rate_sample *rs)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	u32 lost, delivered;
+	u64 bw;
+	s32 t;
+
+	if (bbr->lt_use_bw) {	/* already using long-term rate, lt_bw? */
+		if (bbr->mode == BBR_PROBE_BW && bbr->round_start &&
+		    ++bbr->lt_rtt_cnt >= bbr_lt_bw_max_rtts) {
+			bbr_reset_lt_bw_sampling(sk);    /* stop using lt_bw */
+			bbr_reset_probe_bw_mode(sk);  /* restart gain cycling */
+		}
+		return;
+	}
+
+	/* Wait for the first loss before sampling, to let the policer exhaust
+	 * its tokens and estimate the steady-state rate allowed by the policer.
+	 * Starting samples earlier includes bursts that over-estimate the bw.
+	 */
+	if (!bbr->lt_is_sampling) {
+		if (!rs->losses)
+			return;
+		bbr_reset_lt_bw_sampling_interval(sk);
+		bbr->lt_is_sampling = true;
+	}
+
+	/* To avoid underestimates, reset sampling if we run out of data. */
+	if (rs->is_app_limited) {
+		bbr_reset_lt_bw_sampling(sk);
+		return;
+	}
+
+	if (bbr->round_start)
+		bbr->lt_rtt_cnt++;	/* count round trips in this interval */
+	if (bbr->lt_rtt_cnt < bbr_lt_intvl_min_rtts)
+		return;		/* sampling interval needs to be longer */
+	if (bbr->lt_rtt_cnt > 4 * bbr_lt_intvl_min_rtts) {
+		bbr_reset_lt_bw_sampling(sk);  /* interval is too long */
+		return;
+	}
+
+	/* End sampling interval when a packet is lost, so we estimate the
+	 * policer tokens were exhausted. Stopping the sampling before the
+	 * tokens are exhausted under-estimates the policed rate.
+	 */
+	if (!rs->losses)
+		return;
+
+	/* Calculate packets lost and delivered in sampling interval. */
+	lost = tp->lost - bbr->lt_last_lost;
+	delivered = tp->delivered - bbr->lt_last_delivered;
+	/* Is loss rate (lost/delivered) >= lt_loss_thresh? If not, wait. */
+	if (!delivered || (lost << BBR_SCALE) < bbr_lt_loss_thresh * delivered)
+		return;
+
+	/* Find average delivery rate in this sampling interval. */
+	t = (s32)(tp->delivered_mstamp.stamp_jiffies - bbr->lt_last_stamp);
+	if (t < 1)
+		return;		/* interval is less than one jiffy, so wait */
+	t = jiffies_to_usecs(t);
+	/* Interval long enough for jiffies_to_usecs() to return a bogus 0? */
+	if (t < 1) {
+		bbr_reset_lt_bw_sampling(sk);  /* interval too long; reset */
+		return;
+	}
+	bw = (u64)delivered * BW_UNIT;
+	do_div(bw, t);
+	bbr_lt_bw_interval_done(sk, bw);
+}
+
+/* Estimate the bandwidth based on how fast packets are delivered */
+static void bbr_update_bw(struct sock *sk, const struct rate_sample *rs)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	u64 bw;
+
+	bbr->round_start = 0;
+	if (rs->delivered < 0 || rs->interval_us <= 0)
+		return; /* Not a valid observation */
+
+	/* See if we've reached the next RTT */
+	if (!before(rs->prior_delivered, bbr->next_rtt_delivered)) {
+		bbr->next_rtt_delivered = tp->delivered;
+		bbr->rtt_cnt++;
+		bbr->round_start = 1;
+		bbr->packet_conservation = 0;
+	}
+
+	bbr_lt_bw_sampling(sk, rs);
+
+	/* Divide delivered by the interval to find a (lower bound) bottleneck
+	 * bandwidth sample. Delivered is in packets and interval_us in uS and
+	 * ratio will be <<1 for most connections. So delivered is first scaled.
+	 */
+	bw = (u64)rs->delivered * BW_UNIT;
+	do_div(bw, rs->interval_us);
+
+	/* If this sample is application-limited, it is likely to have a very
+	 * low delivered count that represents application behavior rather than
+	 * the available network rate. Such a sample could drag down estimated
+	 * bw, causing needless slow-down. Thus, to continue to send at the
+	 * last measured network rate, we filter out app-limited samples unless
+	 * they describe the path bw at least as well as our bw model.
+	 *
+	 * So the goal during app-limited phase is to proceed with the best
+	 * network rate no matter how long. We automatically leave this
+	 * phase when app writes faster than the network can deliver :)
+	 */
+	if (!rs->is_app_limited || bw >= bbr_max_bw(sk)) {
+		/* Incorporate new sample into our max bw filter. */
+		minmax_running_max(&bbr->bw, bbr_bw_rtts, bbr->rtt_cnt, bw);
+	}
+}
+
+/* Estimate when the pipe is full, using the change in delivery rate: BBR
+ * estimates that STARTUP filled the pipe if the estimated bw hasn't changed by
+ * at least bbr_full_bw_thresh (25%) after bbr_full_bw_cnt (3) non-app-limited
+ * rounds. Why 3 rounds: 1: rwin autotuning grows the rwin, 2: we fill the
+ * higher rwin, 3: we get higher delivery rate samples. Or transient
+ * cross-traffic or radio noise can go away. CUBIC Hystart shares a similar
+ * design goal, but uses delay and inter-ACK spacing instead of bandwidth.
+ */
+static void bbr_check_full_bw_reached(struct sock *sk,
+				      const struct rate_sample *rs)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+	u32 bw_thresh;
+
+	if (bbr_full_bw_reached(sk) || !bbr->round_start || rs->is_app_limited)
+		return;
+
+	bw_thresh = (u64)bbr->full_bw * bbr_full_bw_thresh >> BBR_SCALE;
+	if (bbr_max_bw(sk) >= bw_thresh) {
+		bbr->full_bw = bbr_max_bw(sk);
+		bbr->full_bw_cnt = 0;
+		return;
+	}
+	++bbr->full_bw_cnt;
+}
+
+/* If pipe is probably full, drain the queue and then enter steady-state. */
+static void bbr_check_drain(struct sock *sk, const struct rate_sample *rs)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	if (bbr->mode == BBR_STARTUP && bbr_full_bw_reached(sk)) {
+		bbr->mode = BBR_DRAIN;	/* drain queue we created */
+		bbr->pacing_gain = bbr_drain_gain;	/* pace slow to drain */
+		bbr->cwnd_gain = bbr_high_gain;	/* maintain cwnd */
+	}	/* fall through to check if in-flight is already small: */
+	if (bbr->mode == BBR_DRAIN &&
+	    tcp_packets_in_flight(tcp_sk(sk)) <=
+	    bbr_target_cwnd(sk, bbr_max_bw(sk), BBR_UNIT))
+		bbr_reset_probe_bw_mode(sk);  /* we estimate queue is drained */
+}
+
+/* The goal of PROBE_RTT mode is to have BBR flows cooperatively and
+ * periodically drain the bottleneck queue, to converge to measure the true
+ * min_rtt (unloaded propagation delay). This allows the flows to keep queues
+ * small (reducing queuing delay and packet loss) and achieve fairness among
+ * BBR flows.
+ *
+ * The min_rtt filter window is 10 seconds. When the min_rtt estimate expires,
+ * we enter PROBE_RTT mode and cap the cwnd at bbr_cwnd_min_target=4 packets.
+ * After at least bbr_probe_rtt_mode_ms=200ms and at least one packet-timed
+ * round trip elapsed with that flight size <= 4, we leave PROBE_RTT mode and
+ * re-enter the previous mode. BBR uses 200ms to approximately bound the
+ * performance penalty of PROBE_RTT's cwnd capping to roughly 2% (200ms/10s).
+ *
+ * Note that flows need only pay 2% if they are busy sending over the last 10
+ * seconds. Interactive applications (e.g., Web, RPCs, video chunks) often have
+ * natural silences or low-rate periods within 10 seconds where the rate is low
+ * enough for long enough to drain its queue in the bottleneck. We pick up
+ * these min RTT measurements opportunistically with our min_rtt filter. :-)
+ */
+static void bbr_update_min_rtt(struct sock *sk, const struct rate_sample *rs)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	bool filter_expired;
+
+	/* Track min RTT seen in the min_rtt_win_sec filter window: */
+	filter_expired = after(tcp_time_stamp,
+			       bbr->min_rtt_stamp + bbr_min_rtt_win_sec * HZ);
+	if (rs->rtt_us >= 0 &&
+	    (rs->rtt_us <= bbr->min_rtt_us || filter_expired)) {
+		bbr->min_rtt_us = rs->rtt_us;
+		bbr->min_rtt_stamp = tcp_time_stamp;
+	}
+
+	if (bbr_probe_rtt_mode_ms > 0 && filter_expired &&
+	    !bbr->idle_restart && bbr->mode != BBR_PROBE_RTT) {
+		bbr->mode = BBR_PROBE_RTT;  /* dip, drain queue */
+		bbr->pacing_gain = BBR_UNIT;
+		bbr->cwnd_gain = BBR_UNIT;
+		bbr_save_cwnd(sk);  /* note cwnd so we can restore it */
+		bbr->probe_rtt_done_stamp = 0;
+	}
+
+	if (bbr->mode == BBR_PROBE_RTT) {
+		/* Ignore low rate samples during this mode. */
+		tp->app_limited =
+			(tp->delivered + tcp_packets_in_flight(tp)) ? : 1;
+		/* Maintain min packets in flight for max(200 ms, 1 round). */
+		if (!bbr->probe_rtt_done_stamp &&
+		    tcp_packets_in_flight(tp) <= bbr_cwnd_min_target) {
+			bbr->probe_rtt_done_stamp = tcp_time_stamp +
+				msecs_to_jiffies(bbr_probe_rtt_mode_ms);
+			bbr->probe_rtt_round_done = 0;
+			bbr->next_rtt_delivered = tp->delivered;
+		} else if (bbr->probe_rtt_done_stamp) {
+			if (bbr->round_start)
+				bbr->probe_rtt_round_done = 1;
+			if (bbr->probe_rtt_round_done &&
+			    after(tcp_time_stamp, bbr->probe_rtt_done_stamp)) {
+				bbr->min_rtt_stamp = tcp_time_stamp;
+				bbr->restore_cwnd = 1;  /* snap to prior_cwnd */
+				bbr_reset_mode(sk);
+			}
+		}
+	}
+	bbr->idle_restart = 0;
+}
+
+static void bbr_update_model(struct sock *sk, const struct rate_sample *rs)
+{
+	bbr_update_bw(sk, rs);
+	bbr_update_cycle_phase(sk, rs);
+	bbr_check_full_bw_reached(sk, rs);
+	bbr_check_drain(sk, rs);
+	bbr_update_min_rtt(sk, rs);
+}
+
+static void bbr_main(struct sock *sk, const struct rate_sample *rs)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+	u32 bw;
+
+	bbr_update_model(sk, rs);
+
+	bw = bbr_bw(sk);
+	bbr_set_pacing_rate(sk, bw, bbr->pacing_gain);
+	bbr_set_tso_segs_goal(sk);
+	bbr_set_cwnd(sk, rs, rs->acked_sacked, bw, bbr->cwnd_gain);
+}
+
+static void bbr_init(struct sock *sk)
+{
+	struct tcp_sock *tp = tcp_sk(sk);
+	struct bbr *bbr = inet_csk_ca(sk);
+	u64 bw;
+
+	bbr->prior_cwnd = 0;
+	bbr->tso_segs_goal = 0;	 /* default segs per skb until first ACK */
+	bbr->rtt_cnt = 0;
+	bbr->next_rtt_delivered = 0;
+	bbr->prev_ca_state = TCP_CA_Open;
+	bbr->packet_conservation = 0;
+
+	bbr->probe_rtt_done_stamp = 0;
+	bbr->probe_rtt_round_done = 0;
+	bbr->min_rtt_us = tcp_min_rtt(tp);
+	bbr->min_rtt_stamp = tcp_time_stamp;
+
+	minmax_reset(&bbr->bw, bbr->rtt_cnt, 0);  /* init max bw to 0 */
+
+	/* Initialize pacing rate to: high_gain * init_cwnd / RTT. */
+	bw = (u64)tp->snd_cwnd * BW_UNIT;
+	do_div(bw, (tp->srtt_us >> 3) ? : USEC_PER_MSEC);
+	sk->sk_pacing_rate = 0;		/* force an update of sk_pacing_rate */
+	bbr_set_pacing_rate(sk, bw, bbr_high_gain);
+
+	bbr->restore_cwnd = 0;
+	bbr->round_start = 0;
+	bbr->idle_restart = 0;
+	bbr->full_bw = 0;
+	bbr->full_bw_cnt = 0;
+	bbr->cycle_mstamp.v64 = 0;
+	bbr->cycle_idx = 0;
+	bbr_reset_lt_bw_sampling(sk);
+	bbr_reset_startup_mode(sk);
+}
+
+static u32 bbr_sndbuf_expand(struct sock *sk)
+{
+	/* Provision 3 * cwnd since BBR may slow-start even during recovery. */
+	return 3;
+}
+
+/* In theory BBR does not need to undo the cwnd since it does not
+ * always reduce cwnd on losses (see bbr_main()). Keep it for now.
+ */
+static u32 bbr_undo_cwnd(struct sock *sk)
+{
+	return tcp_sk(sk)->snd_cwnd;
+}
+
+/* Entering loss recovery, so save cwnd for when we exit or undo recovery. */
+static u32 bbr_ssthresh(struct sock *sk)
+{
+	bbr_save_cwnd(sk);
+	return TCP_INFINITE_SSTHRESH;	 /* BBR does not use ssthresh */
+}
+
+static size_t bbr_get_info(struct sock *sk, u32 ext, int *attr,
+			   union tcp_cc_info *info)
+{
+	if (ext & (1 << (INET_DIAG_BBRINFO - 1)) ||
+	    ext & (1 << (INET_DIAG_VEGASINFO - 1))) {
+		struct tcp_sock *tp = tcp_sk(sk);
+		struct bbr *bbr = inet_csk_ca(sk);
+		u64 bw = bbr_bw(sk);
+
+		bw = bw * tp->mss_cache * USEC_PER_SEC >> BW_SCALE;
+		memset(&info->bbr, 0, sizeof(info->bbr));
+		info->bbr.bbr_bw_lo		= (u32)bw;
+		info->bbr.bbr_bw_hi		= (u32)(bw >> 32);
+		info->bbr.bbr_min_rtt		= bbr->min_rtt_us;
+		info->bbr.bbr_pacing_gain	= bbr->pacing_gain;
+		info->bbr.bbr_cwnd_gain		= bbr->cwnd_gain;
+		*attr = INET_DIAG_BBRINFO;
+		return sizeof(info->bbr);
+	}
+	return 0;
+}
+
+static void bbr_set_state(struct sock *sk, u8 new_state)
+{
+	struct bbr *bbr = inet_csk_ca(sk);
+
+	if (new_state == TCP_CA_Loss) {
+		struct rate_sample rs = { .losses = 1 };
+
+		bbr->prev_ca_state = TCP_CA_Loss;
+		bbr->full_bw = 0;
+		bbr->round_start = 1;	/* treat RTO like end of a round */
+		bbr_lt_bw_sampling(sk, &rs);
+	}
+}
+
+static struct tcp_congestion_ops tcp_bbr_cong_ops __read_mostly = {
+	.flags		= TCP_CONG_NON_RESTRICTED,
+	.name		= "bbr",
+	.owner		= THIS_MODULE,
+	.init		= bbr_init,
+	.cong_control	= bbr_main,
+	.sndbuf_expand	= bbr_sndbuf_expand,
+	.undo_cwnd	= bbr_undo_cwnd,
+	.cwnd_event	= bbr_cwnd_event,
+	.ssthresh	= bbr_ssthresh,
+	.tso_segs_goal	= bbr_tso_segs_goal,
+	.get_info	= bbr_get_info,
+	.set_state	= bbr_set_state,
+};
+
+static int __init bbr_register(void)
+{
+	BUILD_BUG_ON(sizeof(struct bbr) > ICSK_CA_PRIV_SIZE);
+	return tcp_register_congestion_control(&tcp_bbr_cong_ops);
+}
+
+static void __exit bbr_unregister(void)
+{
+	tcp_unregister_congestion_control(&tcp_bbr_cong_ops);
+}
+
+module_init(bbr_register);
+module_exit(bbr_unregister);
+
+MODULE_AUTHOR("Van Jacobson <vanj@google.com>");
+MODULE_AUTHOR("Neal Cardwell <ncardwell@google.com>");
+MODULE_AUTHOR("Yuchung Cheng <ycheng@google.com>");
+MODULE_AUTHOR("Soheil Hassas Yeganeh <soheil@google.com>");
+MODULE_LICENSE("Dual BSD/GPL");
+MODULE_DESCRIPTION("TCP BBR (Bottleneck Bandwidth and RTT)");
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH v4 net-next 15/16] tcp: increase ICSK_CA_PRIV_SIZE from 64 bytes to 88
From: Neal Cardwell @ 2016-09-20  3:39 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Neal Cardwell, Van Jacobson, Yuchung Cheng,
	Nandita Dukkipati, Eric Dumazet, Soheil Hassas Yeganeh
In-Reply-To: <1474342763-16715-1-git-send-email-ncardwell@google.com>

The TCP CUBIC module already uses 64 bytes.
The upcoming TCP BBR module uses 88 bytes.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/inet_connection_sock.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 49dcad4..197a30d 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -134,8 +134,8 @@ struct inet_connection_sock {
 	} icsk_mtup;
 	u32			  icsk_user_timeout;
 
-	u64			  icsk_ca_priv[64 / sizeof(u64)];
-#define ICSK_CA_PRIV_SIZE      (8 * sizeof(u64))
+	u64			  icsk_ca_priv[88 / sizeof(u64)];
+#define ICSK_CA_PRIV_SIZE      (11 * sizeof(u64))
 };
 
 #define ICSK_TIME_RETRANS	1	/* Retransmit timer */
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related

* [PATCH v4 net-next 14/16] tcp: new CC hook to set sending rate with rate_sample in any CA state
From: Neal Cardwell @ 2016-09-20  3:39 UTC (permalink / raw)
  To: David Miller
  Cc: netdev, Yuchung Cheng, Van Jacobson, Neal Cardwell,
	Nandita Dukkipati, Eric Dumazet, Soheil Hassas Yeganeh
In-Reply-To: <1474342763-16715-1-git-send-email-ncardwell@google.com>

From: Yuchung Cheng <ycheng@google.com>

This commit introduces an optional new "omnipotent" hook,
cong_control(), for congestion control modules. The cong_control()
function is called at the end of processing an ACK (i.e., after
updating sequence numbers, the SACK scoreboard, and loss
detection). At that moment we have precise delivery rate information
the congestion control module can use to control the sending behavior
(using cwnd, TSO skb size, and pacing rate) in any CA state.

This function can also be used by a congestion control that prefers
not to use the default cwnd reduction approach (i.e., the PRR
algorithm) during CA_Recovery to control the cwnd and sending rate
during loss recovery.

We take advantage of the fact that recent changes defer the
retransmission or transmission of new data (e.g. by F-RTO) in recovery
until the new tcp_cong_control() function is run.

With this commit, we only run tcp_update_pacing_rate() if the
congestion control is not using this new API. New congestion controls
which use the new API do not want the TCP stack to run the default
pacing rate calculation and overwrite whatever pacing rate they have
chosen at initialization time.

Signed-off-by: Van Jacobson <vanj@google.com>
Signed-off-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: Soheil Hassas Yeganeh <soheil@google.com>
---
 include/net/tcp.h    |  4 ++++
 net/ipv4/tcp_cong.c  |  2 +-
 net/ipv4/tcp_input.c | 17 ++++++++++++++---
 3 files changed, 19 insertions(+), 4 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 1aa9628..f83b7f2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -919,6 +919,10 @@ struct tcp_congestion_ops {
 	u32 (*tso_segs_goal)(struct sock *sk);
 	/* returns the multiplier used in tcp_sndbuf_expand (optional) */
 	u32 (*sndbuf_expand)(struct sock *sk);
+	/* call when packets are delivered to update cwnd and pacing rate,
+	 * after all the ca_state processing. (optional)
+	 */
+	void (*cong_control)(struct sock *sk, const struct rate_sample *rs);
 	/* get info for inet_diag (optional) */
 	size_t (*get_info)(struct sock *sk, u32 ext, int *attr,
 			   union tcp_cc_info *info);
diff --git a/net/ipv4/tcp_cong.c b/net/ipv4/tcp_cong.c
index 882caa4..1294af4 100644
--- a/net/ipv4/tcp_cong.c
+++ b/net/ipv4/tcp_cong.c
@@ -69,7 +69,7 @@ int tcp_register_congestion_control(struct tcp_congestion_ops *ca)
 	int ret = 0;
 
 	/* all algorithms must implement ssthresh and cong_avoid ops */
-	if (!ca->ssthresh || !ca->cong_avoid) {
+	if (!ca->ssthresh || !(ca->cong_avoid || ca->cong_control)) {
 		pr_err("%s does not implement required ops\n", ca->name);
 		return -EINVAL;
 	}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 5af0bf3..28cfe99 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2536,6 +2536,9 @@ static inline void tcp_end_cwnd_reduction(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
 
+	if (inet_csk(sk)->icsk_ca_ops->cong_control)
+		return;
+
 	/* Reset cwnd to ssthresh in CWR or Recovery (unless it's undone) */
 	if (inet_csk(sk)->icsk_ca_state == TCP_CA_CWR ||
 	    (tp->undo_marker && tp->snd_ssthresh < TCP_INFINITE_SSTHRESH)) {
@@ -3312,8 +3315,15 @@ static inline bool tcp_may_raise_cwnd(const struct sock *sk, const int flag)
  * information. All transmission or retransmission are delayed afterwards.
  */
 static void tcp_cong_control(struct sock *sk, u32 ack, u32 acked_sacked,
-			     int flag)
+			     int flag, const struct rate_sample *rs)
 {
+	const struct inet_connection_sock *icsk = inet_csk(sk);
+
+	if (icsk->icsk_ca_ops->cong_control) {
+		icsk->icsk_ca_ops->cong_control(sk, rs);
+		return;
+	}
+
 	if (tcp_in_cwnd_reduction(sk)) {
 		/* Reduce cwnd if state mandates */
 		tcp_cwnd_reduction(sk, acked_sacked, flag);
@@ -3683,7 +3693,7 @@ static int tcp_ack(struct sock *sk, const struct sk_buff *skb, int flag)
 	delivered = tp->delivered - delivered;	/* freshly ACKed or SACKed */
 	lost = tp->lost - lost;			/* freshly marked lost */
 	tcp_rate_gen(sk, delivered, lost, &now, &rs);
-	tcp_cong_control(sk, ack, delivered, flag);
+	tcp_cong_control(sk, ack, delivered, flag, &rs);
 	tcp_xmit_recovery(sk, rexmit);
 	return 1;
 
@@ -5981,7 +5991,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		} else
 			tcp_init_metrics(sk);
 
-		tcp_update_pacing_rate(sk);
+		if (!inet_csk(sk)->icsk_ca_ops->cong_control)
+			tcp_update_pacing_rate(sk);
 
 		/* Prevent spurious tcp_cwnd_restart() on first data packet */
 		tp->lsndtime = tcp_time_stamp;
-- 
2.8.0.rc3.226.g39d4020

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox