Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] tcp: bound RTO to minimum
From: Eric Dumazet @ 2011-08-25  8:26 UTC (permalink / raw)
  To: Alexander Zimmermann
  Cc: Yuchung Cheng, Hagen Paul Pfeifer, netdev, Hannemann Arnd,
	Lukowski Damian
In-Reply-To: <4033BFEE-C432-4D94-8372-BA166AF2AA26@comsys.rwth-aachen.de>

Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
> Hi Eric,
> 
> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:

> > Real question is : do we really want to process ~1000 timer interrupts
> > per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
> > requests, only to make tcp revover in ~1sec when connectivity returns
> > back. This just doesnt scale.
> 
> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
> probing time of 120s, we 600 retransmits in a worst-case-senario
> (assumed that we get for every rot retransmission an icmp). No?

Where is asserted the "max probing time of 120s" ? 

It is not the case on my machine :
I have way more retransmits than that, even if spaced by 1600 ms

07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)

Old kernels where performing up to 15 retries, doing exponential backoff.

Now its kind of unlimited, according to experimental results.

^ permalink raw reply

* [PATCH 0/9] skb fragment API: convert non-network drivers
From: Ian Campbell @ 2011-08-25  8:28 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: linux-atm-general-5NWGOfrQmneRv+LV9MX5uipxlwaOVQ5f,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-scsi-u79uwXL29TY76Z2rM5mHXA, devel-s9riP+hp16TNLxjTenLetw

The following series converts some non-network drivers to the SKB pages
fragment API introduced in 131ea6675c76. Included are ATM, Infiniband,
and FibreChannel. I also included the broadcom network drivers since I
was touching the related FC driver.

This is part of my series to enable visibility into SKB paged fragment's
lifecycles, [0] contains some more background and rationale but
basically the completed series will allow entities which inject pages
into the networking stack to receive a notification when the stack has
really finished with those pages (i.e. including retransmissions,
clones, pull-ups etc) and not just when the original skb is finished
with, which is beneficial to many subsystems which wish to inject pages
into the network stack without giving up full ownership of those page's
lifecycle. It implements something broadly along the lines of what was
described in [1].

Cheers,
Ian.

[0] http://marc.info/?l=linux-netdev&m=131072801125521&w=2
[1] http://marc.info/?l=linux-netdev&m=130925719513084&w=2

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH 2/9] IB: amso1100: convert to SKB paged frag API.
From: Ian Campbell @ 2011-08-25  8:28 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: Ian Campbell, Tom Tucker, Steve Wise, Roland Dreier, Sean Hefty,
	Hal Rosenstock, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1314260881.10283.48.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>

Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Tom Tucker <tom-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Cc: Steve Wise <swise-7bPotxP6k4+P2YhJcF5u+vpXobYPEAuW@public.gmane.org>
Cc: Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Hal Rosenstock <hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 drivers/infiniband/hw/amso1100/c2.c |    8 +++-----
 1 files changed, 3 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/hw/amso1100/c2.c b/drivers/infiniband/hw/amso1100/c2.c
index 444470a..6a8f36e 100644
--- a/drivers/infiniband/hw/amso1100/c2.c
+++ b/drivers/infiniband/hw/amso1100/c2.c
@@ -802,11 +802,9 @@ static int c2_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
 		for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 			maplen = frag->size;
-			mapaddr =
-			    pci_map_page(c2dev->pcidev, frag->page,
-					 frag->page_offset, maplen,
-					 PCI_DMA_TODEVICE);
-
+			mapaddr = skb_frag_dma_map(&c2dev->pcidev->dev, frag,
+						   0, maplen,
+						   PCI_DMA_TODEVICE);
 			elem = elem->next;
 			elem->skb = NULL;
 			elem->mapaddr = mapaddr;
-- 
1.7.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 3/9] IB: nes: convert to SKB paged frag API.
From: Ian Campbell @ 2011-08-25  8:28 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: Ian Campbell, Faisal Latif, Roland Dreier, Sean Hefty,
	Hal Rosenstock, linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1314260881.10283.48.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>

Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Faisal Latif <faisal.latif-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Hal Rosenstock <hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 drivers/infiniband/hw/nes/nes_nic.c |   21 +++++++++++----------
 1 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/drivers/infiniband/hw/nes/nes_nic.c b/drivers/infiniband/hw/nes/nes_nic.c
index 66e1229..96cb35a 100644
--- a/drivers/infiniband/hw/nes/nes_nic.c
+++ b/drivers/infiniband/hw/nes/nes_nic.c
@@ -441,11 +441,11 @@ static int nes_nic_send(struct sk_buff *skb, struct net_device *netdev)
 		nesnic->tx_skb[nesnic->sq_head] = skb;
 		for (skb_fragment_index = 0; skb_fragment_index < skb_shinfo(skb)->nr_frags;
 				skb_fragment_index++) {
-			bus_address = pci_map_page( nesdev->pcidev,
-					skb_shinfo(skb)->frags[skb_fragment_index].page,
-					skb_shinfo(skb)->frags[skb_fragment_index].page_offset,
-					skb_shinfo(skb)->frags[skb_fragment_index].size,
-					PCI_DMA_TODEVICE);
+			skb_frag_t *frag =
+				&skb_shinfo(skb)->frags[skb_fragment_index];
+			bus_address = skb_frag_dma_map(&nesdev->pcidev->dev,
+						       frag, 0, frag->size,
+						       PCI_DMA_TODEVICE);
 			wqe_fragment_length[wqe_fragment_index] =
 					cpu_to_le16(skb_shinfo(skb)->frags[skb_fragment_index].size);
 			set_wqe_64bit_value(nic_sqe->wqe_words, NES_NIC_SQ_WQE_FRAG0_LOW_IDX+(2*wqe_fragment_index),
@@ -561,11 +561,12 @@ tso_sq_no_longer_full:
 			/* Map all the buffers */
 			for (tso_frag_count=0; tso_frag_count < skb_shinfo(skb)->nr_frags;
 					tso_frag_count++) {
-				tso_bus_address[tso_frag_count] = pci_map_page( nesdev->pcidev,
-						skb_shinfo(skb)->frags[tso_frag_count].page,
-						skb_shinfo(skb)->frags[tso_frag_count].page_offset,
-						skb_shinfo(skb)->frags[tso_frag_count].size,
-						PCI_DMA_TODEVICE);
+				skb_frag_t *frag =
+					&skb_shinfo(skb)->frags[tso_frag_count];
+				tso_bus_address[tso_frag_count] =
+					skb_frag_dma_map(&nesdev->pcidev->dev,
+							 frag, 0, frag->size,
+							 PCI_DMA_TODEVICE);
 			}
 
 			tso_frag_index = 0;
-- 
1.7.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 4/9] IPoIB: convert to SKB paged frag API.
From: Ian Campbell @ 2011-08-25  8:28 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: Ian Campbell, Roland Dreier, Sean Hefty, Hal Rosenstock,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA
In-Reply-To: <1314260881.10283.48.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>

Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Roland Dreier <roland-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>
Cc: Sean Hefty <sean.hefty-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org>
Cc: Hal Rosenstock <hal.rosenstock-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
---
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |    5 +++--
 drivers/infiniband/ulp/ipoib/ipoib_ib.c |    5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 39913a0..67a477b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -169,7 +169,7 @@ static struct sk_buff *ipoib_cm_alloc_rx_skb(struct net_device *dev,
 			goto partial_error;
 		skb_fill_page_desc(skb, i, page, 0, PAGE_SIZE);
 
-		mapping[i + 1] = ib_dma_map_page(priv->ca, skb_shinfo(skb)->frags[i].page,
+		mapping[i + 1] = ib_dma_map_page(priv->ca, page,
 						 0, PAGE_SIZE, DMA_FROM_DEVICE);
 		if (unlikely(ib_dma_mapping_error(priv->ca, mapping[i + 1])))
 			goto partial_error;
@@ -537,7 +537,8 @@ static void skb_put_frags(struct sk_buff *skb, unsigned int hdr_space,
 
 		if (length == 0) {
 			/* don't need this page */
-			skb_fill_page_desc(toskb, i, frag->page, 0, PAGE_SIZE);
+			skb_fill_page_desc(toskb, i, skb_frag_page(frag),
+					   0, PAGE_SIZE);
 			--skb_shinfo(skb)->nr_frags;
 		} else {
 			size = min(length, (unsigned) PAGE_SIZE);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 81ae61d..00435be 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -182,7 +182,7 @@ static struct sk_buff *ipoib_alloc_rx_skb(struct net_device *dev, int id)
 			goto partial_error;
 		skb_fill_page_desc(skb, 0, page, 0, PAGE_SIZE);
 		mapping[1] =
-			ib_dma_map_page(priv->ca, skb_shinfo(skb)->frags[0].page,
+			ib_dma_map_page(priv->ca, page,
 					0, PAGE_SIZE, DMA_FROM_DEVICE);
 		if (unlikely(ib_dma_mapping_error(priv->ca, mapping[1])))
 			goto partial_error;
@@ -323,7 +323,8 @@ static int ipoib_dma_map_tx(struct ib_device *ca,
 
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; ++i) {
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
-		mapping[i + off] = ib_dma_map_page(ca, frag->page,
+		mapping[i + off] = ib_dma_map_page(ca,
+						 skb_frag_page(frag),
 						 frag->page_offset, frag->size,
 						 DMA_TO_DEVICE);
 		if (unlikely(ib_dma_mapping_error(ca, mapping[i + off])))
-- 
1.7.2.5

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply related

* [PATCH 5/9] tg3: convert to SKB paged frag API.
From: Ian Campbell @ 2011-08-25  8:28 UTC (permalink / raw)
  To: netdev-u79uwXL29TY76Z2rM5mHXA
  Cc: devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ, Matt Carlson,
	Ian Campbell, Michael Chan
In-Reply-To: <1314260881.10283.48.camel-o4Be2W7LfRlXesXXhkcM7miJhflN2719@public.gmane.org>

Signed-off-by: Ian Campbell <ian.campbell-Sxgqhf6Nn4DQT0dZR+AlfA@public.gmane.org>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk-QHcLZuEGTsvQT0dZR+AlfA@public.gmane.org>
Cc: Matt Carlson <mcarlson-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
Cc: Michael Chan <mchan-dY08KVG/lbpWk0Htik3J/w@public.gmane.org>
Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
Cc: devicetree-discuss-uLR06cmDAlY/bJ5BZ2RsiQ@public.gmane.org
---
 drivers/net/ethernet/broadcom/tg3.c |    6 ++----
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index 0f81111..a7e28a2 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -6311,10 +6311,8 @@ static netdev_tx_t tg3_start_xmit(struct sk_buff *skb, struct net_device *dev)
 			skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
 			len = frag->size;
-			mapping = pci_map_page(tp->pdev,
-					       frag->page,
-					       frag->page_offset,
-					       len, PCI_DMA_TODEVICE);
+			mapping = skb_frag_dma_map(&tp->pdev->dev, frag, 0,
+						   len, PCI_DMA_TODEVICE);
 
 			tnapi->tx_buffers[entry].skb = NULL;
 			dma_unmap_addr_set(&tnapi->tx_buffers[entry], mapping,
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH 1/9] atm: convert to SKB paged frag API.
From: Ian Campbell @ 2011-08-25  8:28 UTC (permalink / raw)
  To: netdev; +Cc: Ian Campbell, Chas Williams, linux-atm-general
In-Reply-To: <1314260881.10283.48.camel@zakaz.uk.xensource.com>

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Chas Williams <chas@cmf.nrl.navy.mil>
Cc: linux-atm-general@lists.sourceforge.net
Cc: netdev@vger.kernel.org

--
The original logic here appears to be bogus (adding page-offset to the struct
page * itself doesn't seem likely to be correct) but I left that unchanged for
this mechanical change.
---
 drivers/atm/eni.c |    3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/atm/eni.c b/drivers/atm/eni.c
index 9307141..f7ca4c1 100644
--- a/drivers/atm/eni.c
+++ b/drivers/atm/eni.c
@@ -1134,7 +1134,8 @@ DPRINTK("doing direct send\n"); /* @@@ well, this doesn't work anyway */
 				    skb_headlen(skb));
 			else
 				put_dma(tx->index,eni_dev->dma,&j,(unsigned long)
-				    skb_shinfo(skb)->frags[i].page + skb_shinfo(skb)->frags[i].page_offset,
+				    skb_frag_page(&skb_shinfo(skb)->frags[i]) +
+					skb_shinfo(skb)->frags[i].page_offset,
 				    skb_shinfo(skb)->frags[i].size);
 	}
 	if (skb->len & 3)
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH 6/9] bnx2: convert to SKB paged frag API.
From: Ian Campbell @ 2011-08-25  8:28 UTC (permalink / raw)
  To: netdev; +Cc: Ian Campbell, Michael Chan
In-Reply-To: <1314260881.10283.48.camel@zakaz.uk.xensource.com>

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Michael Chan <mchan@broadcom.com>
Cc: netdev@vger.kernel.org
---
 drivers/net/ethernet/broadcom/bnx2.c |    8 ++++----
 1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2.c b/drivers/net/ethernet/broadcom/bnx2.c
index 4a9a8c81..9afb653 100644
--- a/drivers/net/ethernet/broadcom/bnx2.c
+++ b/drivers/net/ethernet/broadcom/bnx2.c
@@ -2930,8 +2930,8 @@ bnx2_reuse_rx_skb_pages(struct bnx2 *bp, struct bnx2_rx_ring_info *rxr,
 
 		shinfo = skb_shinfo(skb);
 		shinfo->nr_frags--;
-		page = shinfo->frags[shinfo->nr_frags].page;
-		shinfo->frags[shinfo->nr_frags].page = NULL;
+		page = skb_frag_page(&shinfo->frags[shinfo->nr_frags]);
+		__skb_frag_set_page(&shinfo->frags[shinfo->nr_frags], NULL);
 
 		cons_rx_pg->page = page;
 		dev_kfree_skb(skb);
@@ -6511,8 +6511,8 @@ bnx2_start_xmit(struct sk_buff *skb, struct net_device *dev)
 		txbd = &txr->tx_desc_ring[ring_prod];
 
 		len = frag->size;
-		mapping = dma_map_page(&bp->pdev->dev, frag->page, frag->page_offset,
-				       len, PCI_DMA_TODEVICE);
+		mapping = skb_frag_dma_map(&bp->pdev->dev, frag, 0, len,
+					   PCI_DMA_TODEVICE);
 		if (dma_mapping_error(&bp->pdev->dev, mapping))
 			goto dma_error;
 		dma_unmap_addr_set(&txr->tx_buf_ring[ring_prod], mapping,
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH 7/9] bnx2x: convert to SKB paged frag API.
From: Ian Campbell @ 2011-08-25  8:28 UTC (permalink / raw)
  To: netdev; +Cc: Ian Campbell, Eilon Greenstein
In-Reply-To: <1314260881.10283.48.camel@zakaz.uk.xensource.com>

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Eilon Greenstein <eilong@broadcom.com>
Cc: netdev@vger.kernel.org
---
 drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c |    5 ++---
 1 files changed, 2 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
index 93bff08..5c3eb17 100644
--- a/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
+++ b/drivers/net/ethernet/broadcom/bnx2x/bnx2x_cmn.c
@@ -2800,9 +2800,8 @@ netdev_tx_t bnx2x_start_xmit(struct sk_buff *skb, struct net_device *dev)
 	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++) {
 		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];
 
-		mapping = dma_map_page(&bp->pdev->dev, frag->page,
-				       frag->page_offset, frag->size,
-				       DMA_TO_DEVICE);
+		mapping = skb_frag_dma_map(&bp->pdev->dev, frag, 0, frag->size,
+					   DMA_TO_DEVICE);
 		if (unlikely(dma_mapping_error(&bp->pdev->dev, mapping))) {
 
 			DP(NETIF_MSG_TX_QUEUED, "Unable to map page - "
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH 8/9] bnx2fc: convert to SKB paged frag API.
From: Ian Campbell @ 2011-08-25  8:28 UTC (permalink / raw)
  To: netdev
  Cc: Ian Campbell, Bhanu Prakash Gollapudi, James E.J. Bottomley,
	linux-scsi
In-Reply-To: <1314260881.10283.48.camel@zakaz.uk.xensource.com>

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Bhanu Prakash Gollapudi <bprakash@broadcom.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: linux-scsi@vger.kernel.org
Cc: netdev@vger.kernel.org
---
 drivers/scsi/bnx2fc/bnx2fc_fcoe.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
index 7cb2cd4..2c780a7 100644
--- a/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
+++ b/drivers/scsi/bnx2fc/bnx2fc_fcoe.c
@@ -302,7 +302,7 @@ static int bnx2fc_xmit(struct fc_lport *lport, struct fc_frame *fp)
 			return -ENOMEM;
 		}
 		frag = &skb_shinfo(skb)->frags[skb_shinfo(skb)->nr_frags - 1];
-		cp = kmap_atomic(frag->page, KM_SKB_DATA_SOFTIRQ)
+		cp = kmap_atomic(skb_frag_page(frag), KM_SKB_DATA_SOFTIRQ)
 				+ frag->page_offset;
 	} else {
 		cp = (struct fcoe_crc_eof *)skb_put(skb, tlen);
-- 
1.7.2.5

^ permalink raw reply related

* [PATCH 9/9] fcoe: convert to SKB paged frag API.
From: Ian Campbell @ 2011-08-25  8:28 UTC (permalink / raw)
  To: netdev; +Cc: Ian Campbell, Robert Love, James E.J. Bottomley, devel,
	linux-scsi
In-Reply-To: <1314260881.10283.48.camel@zakaz.uk.xensource.com>

Signed-off-by: Ian Campbell <ian.campbell@citrix.com>
Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
Cc: Robert Love <robert.w.love@intel.com>
Cc: "James E.J. Bottomley" <JBottomley@parallels.com>
Cc: devel@open-fcoe.org
Cc: linux-scsi@vger.kernel.org
Cc: netdev@vger.kernel.org
---
 drivers/scsi/fcoe/fcoe.c           |    2 +-
 drivers/scsi/fcoe/fcoe_transport.c |    5 +++--
 2 files changed, 4 insertions(+), 3 deletions(-)

diff --git a/drivers/scsi/fcoe/fcoe.c b/drivers/scsi/fcoe/fcoe.c
index ba710e3..3416ab6 100644
--- a/drivers/scsi/fcoe/fcoe.c
+++ b/drivers/scsi/fcoe/fcoe.c
@@ -1514,7 +1514,7 @@ int fcoe_xmit(struct fc_lport *lport, struct fc_frame *fp)
 			return -ENOMEM;
 		}
 		frag = &skb_shinfo(skb)->frags[skb_shinfo(skb)->nr_frags - 1];
-		cp = kmap_atomic(frag->page, KM_SKB_DATA_SOFTIRQ)
+		cp = kmap_atomic(skb_frag_page(frag), KM_SKB_DATA_SOFTIRQ)
 			+ frag->page_offset;
 	} else {
 		cp = (struct fcoe_crc_eof *)skb_put(skb, tlen);
diff --git a/drivers/scsi/fcoe/fcoe_transport.c b/drivers/scsi/fcoe/fcoe_transport.c
index 41068e8..f6613f9 100644
--- a/drivers/scsi/fcoe/fcoe_transport.c
+++ b/drivers/scsi/fcoe/fcoe_transport.c
@@ -108,8 +108,9 @@ u32 fcoe_fc_crc(struct fc_frame *fp)
 		len = frag->size;
 		while (len > 0) {
 			clen = min(len, PAGE_SIZE - (off & ~PAGE_MASK));
-			data = kmap_atomic(frag->page + (off >> PAGE_SHIFT),
-					   KM_SKB_DATA_SOFTIRQ);
+			data = kmap_atomic(
+				skb_frag_page(frag) + (off >> PAGE_SHIFT),
+				KM_SKB_DATA_SOFTIRQ);
 			crc = crc32(crc, data + (off & ~PAGE_MASK), clen);
 			kunmap_atomic(data, KM_SKB_DATA_SOFTIRQ);
 			off += clen;
-- 
1.7.2.5

^ permalink raw reply related

* Re: [PATCH] tcp: bound RTO to minimum
From: Alexander Zimmermann @ 2011-08-25  8:44 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Yuchung Cheng, Hagen Paul Pfeifer, netdev, Hannemann Arnd,
	Lukowski Damian
In-Reply-To: <1314260805.2387.11.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>


Am 25.08.2011 um 10:26 schrieb Eric Dumazet:

> Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
>> Hi Eric,
>> 
>> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:
> 
>>> Real question is : do we really want to process ~1000 timer interrupts
>>> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
>>> requests, only to make tcp revover in ~1sec when connectivity returns
>>> back. This just doesnt scale.
>> 
>> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
>> probing time of 120s, we 600 retransmits in a worst-case-senario
>> (assumed that we get for every rot retransmission an icmp). No?
> 
> Where is asserted the "max probing time of 120s" ? 
> 
> It is not the case on my machine :
> I have way more retransmits than that, even if spaced by 1600 ms
> 
> 07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
> 07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
> 07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)
> 
> Old kernels where performing up to 15 retries, doing exponential backoff.

Yes I know. And in combination with RFC6069 we have to convert this
See Section 7.1

and

http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6fa12c85031485dff38ce550c24f10da23b0adaa

Is the transformation broken? Damian?


> 
> Now its kind of unlimited, according to experimental results.

Ok, unlimited is not what I expect...


> 
> 
> 

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: zimmermann@cs.rwth-aachen.de
// web: http://www.umic-mesh.net
//

^ permalink raw reply

* Re: [PATCH] tcp: bound RTO to minimum
From: Arnd Hannemann @ 2011-08-25  8:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Zimmermann, Yuchung Cheng, Hagen Paul Pfeifer, netdev,
	Lukowski Damian
In-Reply-To: <1314260805.2387.11.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

Hi,

Am 25.08.2011 10:26, schrieb Eric Dumazet:
> Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
>> Hi Eric,
>>
>> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:
> 
>>> Real question is : do we really want to process ~1000 timer interrupts
>>> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
>>> requests, only to make tcp revover in ~1sec when connectivity returns
>>> back. This just doesnt scale.
>>
>> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
>> probing time of 120s, we 600 retransmits in a worst-case-senario
>> (assumed that we get for every rot retransmission an icmp). No?
> 
> Where is asserted the "max probing time of 120s" ? 
> 
> It is not the case on my machine :
> I have way more retransmits than that, even if spaced by 1600 ms
> 
> 07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
> 07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
> 07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)
> 
> Old kernels where performing up to 15 retries, doing exponential backoff.
> 
> Now its kind of unlimited, according to experimental results.

That shouldn't be. It should stop after the same time a TCP connection with an
RTO of Minimum RTO which is doing 15 retries (tcp_retries2=15) and doing exponential backoff.
So it should be around 900s*. But it could be that because of the icsk_retransmit wrapover
this doesn't work as expected.

* 200ms + 400ms + 800ms ...

Best regards,
Arnd

^ permalink raw reply

* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
From: Ilpo Järvinen @ 2011-08-25  8:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Jerry Chu, Damian Lukowski
In-Reply-To: <1314226834.6797.5.camel@edumazet-laptop>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 1973 bytes --]

On Thu, 25 Aug 2011, Eric Dumazet wrote:

> Le jeudi 25 août 2011 à 01:44 +0300, Ilpo Järvinen a écrit :
> > On Wed, 24 Aug 2011, Eric Dumazet wrote:
> > 
> > > On one dev machine running net-next, I just found strange tcp sessions
> > > that retransmit a frame forever (The other peer disappeared)
> > > 
> > > # ss -emoi dst 10.2.1.1
> > > State      Recv-Q Send-Q      Local Address:Port          Peer Address:Port   
> > > ESTAB      0      816              10.2.1.2:37930             10.2.1.1:ssh      timer:(on,630ms,246) ino:60786 sk:ffff8801189aa400
> > > 	 mem:(r0,w3776,f320,t0) ts sack ecn cubic wscale:8,6 rto:1680 rtt:16.25/7.5 ato:40 ssthresh:7 send 1.4Mbps rcv_rtt:10 rcv_space:16632
> > > 
> > > 
> > > You can see the retransmit count : 246 
> > > 
> > > What possibly can be going on ?
> > > 
> > > What happened to backoff ?
> > 
> > But RTO (even without any backoffs) should be lower bounded to some not so 
> > zeroish value?
> 
> Apparently not.
> 
> The only thing that protect us from a flood is that ip_error() uses
> inetpeer cache to ratelimit the icmp_send(ICMP_DEST_UNREACH)
> 
> This is why we get retransmit period >= 1 sec
>
> vi +432 net/ipv4/tcp_ipv4.c
> 
>                 icsk->icsk_backoff--;
>                 inet_csk(sk)->icsk_rto = (tp->srtt ? __tcp_set_rto(tp) :
>                         TCP_TIMEOUT_INIT) << icsk->icsk_backoff;
>                 tcp_bound_rto(sk);
> 
> and __tcp_set_rto() uses : return (tp->srtt >> 3) + tp->rttvar;

So you think that this is not true: ?

        /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
         * guarantees that rto is higher.
         */

...it would still be smaller than 1sec though, but certainly not going to 
cause flooding either. Default tcp_rto_min should be 200ms so it's 
5pkts+5ICMP sent, received and processed per second. Which doesn't sound 
that bad CPU load?!?

It is unclear to me how tp->rttvar could become smaller than 
tcp_rto_min().

-- 
 i.

^ permalink raw reply

* Re: slow performance on disk/network i/o full speed after drop_caches
From: Stefan Priebe - Profihost AG @ 2011-08-25  9:00 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Pekka Enberg, LKML, linux-mm@kvack.org, Andrew Morton, Mel Gorman,
	Jens Axboe, Linux Netdev List
In-Reply-To: <20110824093336.GB5214@localhost>

Am 24.08.2011 11:33, schrieb Wu Fengguang:
> On Wed, Aug 24, 2011 at 05:01:03PM +0800, Stefan Priebe - Profihost AG wrote:
>>
>>>> sync&&   echo 3>/proc/sys/vm/drop_caches&&   sleep 2&&   echo 0
>>>>> /proc/sys/vm/drop_caches
>>
>> Another way to get it working again is to stop some processes. Could be
>> mysql or apache or php fcgi doesn't matter. Just free some memory.
>> Although there are already 5GB free.
>
> Is it a NUMA machine and _every_ node has enough free pages?
>
>          grep . /sys/devices/system/node/node*/vmstat
>
> Thanks,
> Fengguang
Hi Fengguang,

thanks for your fast reply.

Here is the data you requested:

root@server1015-han:~# grep . /sys/devices/system/node/node*/vmstat
/sys/devices/system/node/node0/vmstat:nr_written 5546561
/sys/devices/system/node/node0/vmstat:nr_dirtied 5572497
/sys/devices/system/node/node1/vmstat:nr_written 3936
/sys/devices/system/node/node1/vmstat:nr_dirtied 4190

modified it a little bit:
~# while [ true ]; do ps -eo 
user,pid,tid,class,rtprio,ni,pri,psr,pcpu,vsz,rss,pmem,stat,wchan:28,cmd 
| grep scp | grep -v grep; sleep 1; done

root     12409 12409 TS       -   0  19   0 59.8  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   0 64.0  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   0 67.7  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 70.6  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   8 73.5  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 76.0  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 78.2  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 80.0  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 80.9  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   2 76.7  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 75.6  42136  1724  0.0 Ds 
pipe_read                    scp -t /tmp/
root     12409 12409 TS       -   0  19   0 76.0  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   1 75.2  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   1 76.6  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   1 77.9  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   1 79.0  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 72.8  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   0 73.0  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   0 73.8  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 74.3  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 73.4  42136  1724  0.0 Ss 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   1 71.3  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 71.9  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   0 72.7  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   3 73.5  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   3 74.4  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   3 75.2  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   0 76.0  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   8 76.6  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 74.8  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 73.2  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   1 73.9  42136  1724  0.0 Rs 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   0 72.4  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   8 72.0  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   8 72.5  42136  1724  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12409 12409 TS       -   0  19   8 72.9  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12409 12409 TS       -   0  19   8 73.5  42136  1724  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1  0.0  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 23.0  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 49.5  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   2 63.3  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 71.5  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 77.4  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 70.3  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 73.1  42136  1728  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12566 12566 TS       -   0  19   0 65.7  42136  1728  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/
root     12566 12566 TS       -   0  19   1 61.2  42136  1728  0.0 Ss 
-                            scp -t /tmp/
root     12566 12566 TS       -   0  19   1 63.7  42136  1728  0.0 Rs 
-                            scp -t /tmp/
root     12636 12636 TS       -   0  19   8  0.0  42136  1728  0.0 Ss 
poll_schedule_timeout        scp -t /tmp/


Stefan

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] tcp: bound RTO to minimum
From: Eric Dumazet @ 2011-08-25  9:09 UTC (permalink / raw)
  To: Arnd Hannemann
  Cc: Alexander Zimmermann, Yuchung Cheng, Hagen Paul Pfeifer, netdev,
	Lukowski Damian
In-Reply-To: <4E560BFD.5020301@arndnet.de>

Le jeudi 25 août 2011 à 10:46 +0200, Arnd Hannemann a écrit :
> Hi,
> 
> Am 25.08.2011 10:26, schrieb Eric Dumazet:
> > Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
> >> Hi Eric,
> >>
> >> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:
> > 
> >>> Real question is : do we really want to process ~1000 timer interrupts
> >>> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
> >>> requests, only to make tcp revover in ~1sec when connectivity returns
> >>> back. This just doesnt scale.
> >>
> >> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
> >> probing time of 120s, we 600 retransmits in a worst-case-senario
> >> (assumed that we get for every rot retransmission an icmp). No?
> > 
> > Where is asserted the "max probing time of 120s" ? 
> > 
> > It is not the case on my machine :
> > I have way more retransmits than that, even if spaced by 1600 ms
> > 
> > 07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
> > 07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
> > 07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)
> > 
> > Old kernels where performing up to 15 retries, doing exponential backoff.
> > 
> > Now its kind of unlimited, according to experimental results.
> 
> That shouldn't be. It should stop after the same time a TCP connection with an
> RTO of Minimum RTO which is doing 15 retries (tcp_retries2=15) and doing exponential backoff.
> So it should be around 900s*. But it could be that because of the icsk_retransmit wrapover
> this doesn't work as expected.
> 
> * 200ms + 400ms + 800ms ...

It is 924 second with retries2=15 (default value)

I said ~1000 probes.

If ICMP are not rate limited, that could be about 924*5 probes, instead
of 15 probes on old kernels.

Maybe we should refine the thing a bit, to not reverse backoff unless
rto is > some_threshold.

Say 10s being the value, that would give at most 92 tries.

I mean, what is the gain to be able to restart a frozen TCP session with
a 1sec latency instead of 10s if it was blocked more than 60 seconds ?

^ permalink raw reply

* When set mtu 9600 by gfar_change_mtu, the maxfrm register is greater than 9600
From: Rongqing Li @ 2011-08-25  9:24 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: netdev

Hi:

When set MTU to 9600 by gfar_change_mtu(), the maxfrm register will
be set to 9728 which is greater than 9600 in gianfar.c.

But the MPC8315 Reference manual says the value of maxfrm can not
greater than 9600.

Is it a defect, Do we need to fix it?


-- 
Best Reagrds,
Roy | RongQing Li

^ permalink raw reply

* Re: Use of 802.3ad bonding for increasing link throughput
From: Simon Horman @ 2011-08-25  9:35 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Tom Brown, netdev
In-Reply-To: <5344.1312998372@death>

On Wed, Aug 10, 2011 at 10:46:12AM -0700, Jay Vosburgh wrote:

[snip]

> 	On linux, the tcp_reordering sysctl value can be raised to
> compensate, but it will still result in increased packet overhead, and
> is not likely to be very efficient, and doesn't help with anything
> that's not TCP/IP.  I have not tested balance-rr in a few years now, but
> my recollection is that, as a best case, throughput of one TCP
> connection could reach about 1.5x with 2 slaves, or about 2.5x with 4
> slaves (where the multipliers are in units of "bandwidth of one slave").

Hi Jay,

for what it is worth I would like to chip in with the results of some
testing I did using ballance-rr and 3 gigabit NICs late last year.  The
link was three direct ("cross-over") cables to a machine that was also
using balance-rr.


I found that by increasing both rx-usecs (from 3 to 45) and enabling GRO
and TSO I was able to push 2.7*10^9 bits/s.

Local CPU utilisation was 30% and remote CPU utilisation was 10%.
Local service demand was 1.7 us/KB and remote service demand was 2.2us/KB.

The MTU was 1500 bytes.

In this configuration, with the tuning options described above, increasing
tcp_reordering (to 127) did not have a noticable effect on throughput but
did increase local CPU utilisation to about 50% and local service demand to
3.0 us/KB.  There was also increased remote CPU utilisation and service
demand, although not as significant.


By using an 9000 byte MTU I was able to get close to 3*10^9 bits/s
with other parameters at their default values.

Local CPU utilisation was 15% and remote CPU utilisation was 5%.
Local service demand was 0.8us/KB and remote service demand was 1.1us/KB.


Increasing rx-usecs was suggested to me by Eric Dumazet on this list.

I no longer have access to the systems that I used to run these tests but I
do have other results that I have omitted from this email for the sake of
brevity.


Anecdotally my opinion after running these and other tests is that if you
want to push more than a  gigabit/s over a single TCP stream then you would
be well advised to get a faster link rather than bond gigabit devices.  I
believe you stated something similar earlier on in this thread.

^ permalink raw reply

* [PATCH net-next 0/2] Duplication of #define with mii.h.
From: Francois Romieu @ 2011-08-25  9:20 UTC (permalink / raw)
  To: davem; +Cc: netdev

Please pull from branch 'davem-next.mii' in repository

git://git.kernel.org/pub/scm/linux/kernel/git/romieu/netdev-2.6.git davem-next.mii

to get the changes below.

The sunbmac changes are not compile tested. Sunbmac changeset is on top of the
stack so it can be instantly removed if untrusted. Building a packaged rpm for a
cross sparc-linux compiler quickly turned more interesting than expected.

Distance from 'davem-next' (0856a304091b33a8e8f9f9c98e776f425af2b625)
---------------------------------------------------------------------

cd2967803617cd0a0bb8611e7d41c33a451207a5
78f6a6bd89e9a33e4be1bc61e6990a1172aa396e

Diffstat
--------

 drivers/net/ethernet/dlink/dl2k.c  |  105 +++++++++++++++++------------------
 drivers/net/ethernet/dlink/dl2k.h  |  110 +-----------------------------------
 drivers/net/ethernet/sun/sunbmac.c |   31 +++++-----
 drivers/net/ethernet/sun/sunbmac.h |   17 ------
 4 files changed, 69 insertions(+), 194 deletions(-)

Shortlog
--------

Francois Romieu (2):
      dl2k: use standard #defines from mii.h.
      sunbmac: use standard #defines from mii.h.

Patch
-----

See patches #1 and #2.

-- 
Ueimor

^ permalink raw reply

* [PATCH net-next 1/2] dl2k: use standard #defines from mii.h.
From: Francois Romieu @ 2011-08-25  9:21 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <20110825092019.GA21777@electric-eye.fr.zoreil.com>

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
---
 drivers/net/ethernet/dlink/dl2k.c |  105 +++++++++++++++++------------------
 drivers/net/ethernet/dlink/dl2k.h |  110 +------------------------------------
 2 files changed, 53 insertions(+), 162 deletions(-)

diff --git a/drivers/net/ethernet/dlink/dl2k.c b/drivers/net/ethernet/dlink/dl2k.c
index 3fa9140..b2dc2c8 100644
--- a/drivers/net/ethernet/dlink/dl2k.c
+++ b/drivers/net/ethernet/dlink/dl2k.c
@@ -1428,7 +1428,7 @@ mii_wait_link (struct net_device *dev, int wait)
 
 	do {
 		bmsr = mii_read (dev, phy_addr, MII_BMSR);
-		if (bmsr & MII_BMSR_LINK_STATUS)
+		if (bmsr & BMSR_LSTATUS)
 			return 0;
 		mdelay (1);
 	} while (--wait > 0);
@@ -1449,60 +1449,60 @@ mii_get_media (struct net_device *dev)
 
 	bmsr = mii_read (dev, phy_addr, MII_BMSR);
 	if (np->an_enable) {
-		if (!(bmsr & MII_BMSR_AN_COMPLETE)) {
+		if (!(bmsr & BMSR_ANEGCOMPLETE)) {
 			/* Auto-Negotiation not completed */
 			return -1;
 		}
-		negotiate = mii_read (dev, phy_addr, MII_ANAR) &
-			mii_read (dev, phy_addr, MII_ANLPAR);
-		mscr = mii_read (dev, phy_addr, MII_MSCR);
-		mssr = mii_read (dev, phy_addr, MII_MSSR);
-		if (mscr & MII_MSCR_1000BT_FD && mssr & MII_MSSR_LP_1000BT_FD) {
+		negotiate = mii_read (dev, phy_addr, MII_ADVERTISE) &
+			mii_read (dev, phy_addr, MII_LPA);
+		mscr = mii_read (dev, phy_addr, MII_CTRL1000);
+		mssr = mii_read (dev, phy_addr, MII_STAT1000);
+		if (mscr & ADVERTISE_1000FULL && mssr & LPA_1000FULL) {
 			np->speed = 1000;
 			np->full_duplex = 1;
 			printk (KERN_INFO "Auto 1000 Mbps, Full duplex\n");
-		} else if (mscr & MII_MSCR_1000BT_HD && mssr & MII_MSSR_LP_1000BT_HD) {
+		} else if (mscr & ADVERTISE_1000HALF && mssr & LPA_1000HALF) {
 			np->speed = 1000;
 			np->full_duplex = 0;
 			printk (KERN_INFO "Auto 1000 Mbps, Half duplex\n");
-		} else if (negotiate & MII_ANAR_100BX_FD) {
+		} else if (negotiate & ADVERTISE_100FULL) {
 			np->speed = 100;
 			np->full_duplex = 1;
 			printk (KERN_INFO "Auto 100 Mbps, Full duplex\n");
-		} else if (negotiate & MII_ANAR_100BX_HD) {
+		} else if (negotiate & ADVERTISE_100HALF) {
 			np->speed = 100;
 			np->full_duplex = 0;
 			printk (KERN_INFO "Auto 100 Mbps, Half duplex\n");
-		} else if (negotiate & MII_ANAR_10BT_FD) {
+		} else if (negotiate & ADVERTISE_10FULL) {
 			np->speed = 10;
 			np->full_duplex = 1;
 			printk (KERN_INFO "Auto 10 Mbps, Full duplex\n");
-		} else if (negotiate & MII_ANAR_10BT_HD) {
+		} else if (negotiate & ADVERTISE_10HALF) {
 			np->speed = 10;
 			np->full_duplex = 0;
 			printk (KERN_INFO "Auto 10 Mbps, Half duplex\n");
 		}
-		if (negotiate & MII_ANAR_PAUSE) {
+		if (negotiate & ADVERTISE_PAUSE_CAP) {
 			np->tx_flow &= 1;
 			np->rx_flow &= 1;
-		} else if (negotiate & MII_ANAR_ASYMMETRIC) {
+		} else if (negotiate & ADVERTISE_PAUSE_ASYM) {
 			np->tx_flow = 0;
 			np->rx_flow &= 1;
 		}
 		/* else tx_flow, rx_flow = user select  */
 	} else {
 		__u16 bmcr = mii_read (dev, phy_addr, MII_BMCR);
-		switch (bmcr & (MII_BMCR_SPEED_100 | MII_BMCR_SPEED_1000)) {
-		case MII_BMCR_SPEED_1000:
+		switch (bmcr & (BMCR_SPEED100 | BMCR_SPEED1000)) {
+		case BMCR_SPEED1000:
 			printk (KERN_INFO "Operating at 1000 Mbps, ");
 			break;
-		case MII_BMCR_SPEED_100:
+		case BMCR_SPEED100:
 			printk (KERN_INFO "Operating at 100 Mbps, ");
 			break;
 		case 0:
 			printk (KERN_INFO "Operating at 10 Mbps, ");
 		}
-		if (bmcr & MII_BMCR_DUPLEX_MODE) {
+		if (bmcr & BMCR_FULLDPLX) {
 			printk (KERN_CONT "Full duplex\n");
 		} else {
 			printk (KERN_CONT "Half duplex\n");
@@ -1536,24 +1536,22 @@ mii_set_media (struct net_device *dev)
 	if (np->an_enable) {
 		/* Advertise capabilities */
 		bmsr = mii_read (dev, phy_addr, MII_BMSR);
-		anar = mii_read (dev, phy_addr, MII_ANAR) &
-			     ~MII_ANAR_100BX_FD &
-			     ~MII_ANAR_100BX_HD &
-			     ~MII_ANAR_100BT4 &
-			     ~MII_ANAR_10BT_FD &
-			     ~MII_ANAR_10BT_HD;
-		if (bmsr & MII_BMSR_100BX_FD)
-			anar |= MII_ANAR_100BX_FD;
-		if (bmsr & MII_BMSR_100BX_HD)
-			anar |= MII_ANAR_100BX_HD;
-		if (bmsr & MII_BMSR_100BT4)
-			anar |= MII_ANAR_100BT4;
-		if (bmsr & MII_BMSR_10BT_FD)
-			anar |= MII_ANAR_10BT_FD;
-		if (bmsr & MII_BMSR_10BT_HD)
-			anar |= MII_ANAR_10BT_HD;
-		anar |= MII_ANAR_PAUSE | MII_ANAR_ASYMMETRIC;
-		mii_write (dev, phy_addr, MII_ANAR, anar);
+		anar = mii_read (dev, phy_addr, MII_ADVERTISE) &
+			~(ADVERTISE_100FULL | ADVERTISE_10FULL |
+			  ADVERTISE_100HALF | ADVERTISE_10HALF |
+			  ADVERTISE_100BASE4);
+		if (bmsr & BMSR_100FULL)
+			anar |= ADVERTISE_100FULL;
+		if (bmsr & BMSR_100HALF)
+			anar |= ADVERTISE_100HALF;
+		if (bmsr & BMSR_100BASE4)
+			anar |= ADVERTISE_100BASE4;
+		if (bmsr & BMSR_10FULL)
+			anar |= ADVERTISE_10FULL;
+		if (bmsr & BMSR_10HALF)
+			anar |= ADVERTISE_10HALF;
+		anar |= ADVERTISE_PAUSE_CAP | ADVERTISE_PAUSE_ASYM;
+		mii_write (dev, phy_addr, MII_ADVERTISE, anar);
 
 		/* Enable Auto crossover */
 		pscr = mii_read (dev, phy_addr, MII_PHY_SCR);
@@ -1561,8 +1559,8 @@ mii_set_media (struct net_device *dev)
 		mii_write (dev, phy_addr, MII_PHY_SCR, pscr);
 
 		/* Soft reset PHY */
-		mii_write (dev, phy_addr, MII_BMCR, MII_BMCR_RESET);
-		bmcr = MII_BMCR_AN_ENABLE | MII_BMCR_RESTART_AN | MII_BMCR_RESET;
+		mii_write (dev, phy_addr, MII_BMCR, BMCR_RESET);
+		bmcr = BMCR_ANENABLE | BMCR_ANRESTART | BMCR_RESET;
 		mii_write (dev, phy_addr, MII_BMCR, bmcr);
 		mdelay(1);
 	} else {
@@ -1574,7 +1572,7 @@ mii_set_media (struct net_device *dev)
 
 		/* 2) PHY Reset */
 		bmcr = mii_read (dev, phy_addr, MII_BMCR);
-		bmcr |= MII_BMCR_RESET;
+		bmcr |= BMCR_RESET;
 		mii_write (dev, phy_addr, MII_BMCR, bmcr);
 
 		/* 3) Power Down */
@@ -1583,25 +1581,25 @@ mii_set_media (struct net_device *dev)
 		mdelay (100);	/* wait a certain time */
 
 		/* 4) Advertise nothing */
-		mii_write (dev, phy_addr, MII_ANAR, 0);
+		mii_write (dev, phy_addr, MII_ADVERTISE, 0);
 
 		/* 5) Set media and Power Up */
-		bmcr = MII_BMCR_POWER_DOWN;
+		bmcr = BMCR_PDOWN;
 		if (np->speed == 100) {
-			bmcr |= MII_BMCR_SPEED_100;
+			bmcr |= BMCR_SPEED100;
 			printk (KERN_INFO "Manual 100 Mbps, ");
 		} else if (np->speed == 10) {
 			printk (KERN_INFO "Manual 10 Mbps, ");
 		}
 		if (np->full_duplex) {
-			bmcr |= MII_BMCR_DUPLEX_MODE;
+			bmcr |= BMCR_FULLDPLX;
 			printk (KERN_CONT "Full duplex\n");
 		} else {
 			printk (KERN_CONT "Half duplex\n");
 		}
 #if 0
 		/* Set 1000BaseT Master/Slave setting */
-		mscr = mii_read (dev, phy_addr, MII_MSCR);
+		mscr = mii_read (dev, phy_addr, MII_CTRL1000);
 		mscr |= MII_MSCR_CFG_ENABLE;
 		mscr &= ~MII_MSCR_CFG_VALUE = 0;
 #endif
@@ -1624,7 +1622,7 @@ mii_get_media_pcs (struct net_device *dev)
 
 	bmsr = mii_read (dev, phy_addr, PCS_BMSR);
 	if (np->an_enable) {
-		if (!(bmsr & MII_BMSR_AN_COMPLETE)) {
+		if (!(bmsr & BMSR_ANEGCOMPLETE)) {
 			/* Auto-Negotiation not completed */
 			return -1;
 		}
@@ -1649,7 +1647,7 @@ mii_get_media_pcs (struct net_device *dev)
 	} else {
 		__u16 bmcr = mii_read (dev, phy_addr, PCS_BMCR);
 		printk (KERN_INFO "Operating at 1000 Mbps, ");
-		if (bmcr & MII_BMCR_DUPLEX_MODE) {
+		if (bmcr & BMCR_FULLDPLX) {
 			printk (KERN_CONT "Full duplex\n");
 		} else {
 			printk (KERN_CONT "Half duplex\n");
@@ -1682,7 +1680,7 @@ mii_set_media_pcs (struct net_device *dev)
 	if (np->an_enable) {
 		/* Advertise capabilities */
 		esr = mii_read (dev, phy_addr, PCS_ESR);
-		anar = mii_read (dev, phy_addr, MII_ANAR) &
+		anar = mii_read (dev, phy_addr, MII_ADVERTISE) &
 			~PCS_ANAR_HALF_DUPLEX &
 			~PCS_ANAR_FULL_DUPLEX;
 		if (esr & (MII_ESR_1000BT_HD | MII_ESR_1000BX_HD))
@@ -1690,22 +1688,21 @@ mii_set_media_pcs (struct net_device *dev)
 		if (esr & (MII_ESR_1000BT_FD | MII_ESR_1000BX_FD))
 			anar |= PCS_ANAR_FULL_DUPLEX;
 		anar |= PCS_ANAR_PAUSE | PCS_ANAR_ASYMMETRIC;
-		mii_write (dev, phy_addr, MII_ANAR, anar);
+		mii_write (dev, phy_addr, MII_ADVERTISE, anar);
 
 		/* Soft reset PHY */
-		mii_write (dev, phy_addr, MII_BMCR, MII_BMCR_RESET);
-		bmcr = MII_BMCR_AN_ENABLE | MII_BMCR_RESTART_AN |
-		       MII_BMCR_RESET;
+		mii_write (dev, phy_addr, MII_BMCR, BMCR_RESET);
+		bmcr = BMCR_ANENABLE | BMCR_ANRESTART | BMCR_RESET;
 		mii_write (dev, phy_addr, MII_BMCR, bmcr);
 		mdelay(1);
 	} else {
 		/* Force speed setting */
 		/* PHY Reset */
-		bmcr = MII_BMCR_RESET;
+		bmcr = BMCR_RESET;
 		mii_write (dev, phy_addr, MII_BMCR, bmcr);
 		mdelay(10);
 		if (np->full_duplex) {
-			bmcr = MII_BMCR_DUPLEX_MODE;
+			bmcr = BMCR_FULLDPLX;
 			printk (KERN_INFO "Manual full duplex\n");
 		} else {
 			bmcr = 0;
@@ -1715,7 +1712,7 @@ mii_set_media_pcs (struct net_device *dev)
 		mdelay(10);
 
 		/*  Advertise nothing */
-		mii_write (dev, phy_addr, MII_ANAR, 0);
+		mii_write (dev, phy_addr, MII_ADVERTISE, 0);
 	}
 	return 0;
 }
diff --git a/drivers/net/ethernet/dlink/dl2k.h b/drivers/net/ethernet/dlink/dl2k.h
index 7caab3d..ba0adca 100644
--- a/drivers/net/ethernet/dlink/dl2k.h
+++ b/drivers/net/ethernet/dlink/dl2k.h
@@ -28,6 +28,7 @@
 #include <linux/init.h>
 #include <linux/crc32.h>
 #include <linux/ethtool.h>
+#include <linux/mii.h>
 #include <linux/bitops.h>
 #include <asm/processor.h>	/* Processor type for cache alignment. */
 #include <asm/io.h>
@@ -271,20 +272,9 @@ enum RFS_bits {
 #define MII_RESET_TIME_OUT		10000
 /* MII register */
 enum _mii_reg {
-	MII_BMCR = 0,
-	MII_BMSR = 1,
-	MII_PHY_ID1 = 2,
-	MII_PHY_ID2 = 3,
-	MII_ANAR = 4,
-	MII_ANLPAR = 5,
-	MII_ANER = 6,
-	MII_ANNPT = 7,
-	MII_ANLPRNP = 8,
-	MII_MSCR = 9,
-	MII_MSSR = 10,
-	MII_ESR = 15,
 	MII_PHY_SCR = 16,
 };
+
 /* PCS register */
 enum _pcs_reg {
 	PCS_BMCR = 0,
@@ -297,102 +287,6 @@ enum _pcs_reg {
 	PCS_ESR = 15,
 };
 
-/* Basic Mode Control Register */
-enum _mii_bmcr {
-	MII_BMCR_RESET = 0x8000,
-	MII_BMCR_LOOP_BACK = 0x4000,
-	MII_BMCR_SPEED_LSB = 0x2000,
-	MII_BMCR_AN_ENABLE = 0x1000,
-	MII_BMCR_POWER_DOWN = 0x0800,
-	MII_BMCR_ISOLATE = 0x0400,
-	MII_BMCR_RESTART_AN = 0x0200,
-	MII_BMCR_DUPLEX_MODE = 0x0100,
-	MII_BMCR_COL_TEST = 0x0080,
-	MII_BMCR_SPEED_MSB = 0x0040,
-	MII_BMCR_SPEED_RESERVED = 0x003f,
-	MII_BMCR_SPEED_10 = 0,
-	MII_BMCR_SPEED_100 = MII_BMCR_SPEED_LSB,
-	MII_BMCR_SPEED_1000 = MII_BMCR_SPEED_MSB,
-};
-
-/* Basic Mode Status Register */
-enum _mii_bmsr {
-	MII_BMSR_100BT4 = 0x8000,
-	MII_BMSR_100BX_FD = 0x4000,
-	MII_BMSR_100BX_HD = 0x2000,
-	MII_BMSR_10BT_FD = 0x1000,
-	MII_BMSR_10BT_HD = 0x0800,
-	MII_BMSR_100BT2_FD = 0x0400,
-	MII_BMSR_100BT2_HD = 0x0200,
-	MII_BMSR_EXT_STATUS = 0x0100,
-	MII_BMSR_PREAMBLE_SUPP = 0x0040,
-	MII_BMSR_AN_COMPLETE = 0x0020,
-	MII_BMSR_REMOTE_FAULT = 0x0010,
-	MII_BMSR_AN_ABILITY = 0x0008,
-	MII_BMSR_LINK_STATUS = 0x0004,
-	MII_BMSR_JABBER_DETECT = 0x0002,
-	MII_BMSR_EXT_CAP = 0x0001,
-};
-
-/* ANAR */
-enum _mii_anar {
-	MII_ANAR_NEXT_PAGE = 0x8000,
-	MII_ANAR_REMOTE_FAULT = 0x4000,
-	MII_ANAR_ASYMMETRIC = 0x0800,
-	MII_ANAR_PAUSE = 0x0400,
-	MII_ANAR_100BT4 = 0x0200,
-	MII_ANAR_100BX_FD = 0x0100,
-	MII_ANAR_100BX_HD = 0x0080,
-	MII_ANAR_10BT_FD = 0x0020,
-	MII_ANAR_10BT_HD = 0x0010,
-	MII_ANAR_SELECTOR = 0x001f,
-	MII_IEEE8023_CSMACD = 0x0001,
-};
-
-/* ANLPAR */
-enum _mii_anlpar {
-	MII_ANLPAR_NEXT_PAGE = MII_ANAR_NEXT_PAGE,
-	MII_ANLPAR_REMOTE_FAULT = MII_ANAR_REMOTE_FAULT,
-	MII_ANLPAR_ASYMMETRIC = MII_ANAR_ASYMMETRIC,
-	MII_ANLPAR_PAUSE = MII_ANAR_PAUSE,
-	MII_ANLPAR_100BT4 = MII_ANAR_100BT4,
-	MII_ANLPAR_100BX_FD = MII_ANAR_100BX_FD,
-	MII_ANLPAR_100BX_HD = MII_ANAR_100BX_HD,
-	MII_ANLPAR_10BT_FD = MII_ANAR_10BT_FD,
-	MII_ANLPAR_10BT_HD = MII_ANAR_10BT_HD,
-	MII_ANLPAR_SELECTOR = MII_ANAR_SELECTOR,
-};
-
-/* Auto-Negotiation Expansion Register */
-enum _mii_aner {
-	MII_ANER_PAR_DETECT_FAULT = 0x0010,
-	MII_ANER_LP_NEXTPAGABLE = 0x0008,
-	MII_ANER_NETXTPAGABLE = 0x0004,
-	MII_ANER_PAGE_RECEIVED = 0x0002,
-	MII_ANER_LP_NEGOTIABLE = 0x0001,
-};
-
-/* MASTER-SLAVE Control Register */
-enum _mii_mscr {
-	MII_MSCR_TEST_MODE = 0xe000,
-	MII_MSCR_CFG_ENABLE = 0x1000,
-	MII_MSCR_CFG_VALUE = 0x0800,
-	MII_MSCR_PORT_VALUE = 0x0400,
-	MII_MSCR_1000BT_FD = 0x0200,
-	MII_MSCR_1000BT_HD = 0X0100,
-};
-
-/* MASTER-SLAVE Status Register */
-enum _mii_mssr {
-	MII_MSSR_CFG_FAULT = 0x8000,
-	MII_MSSR_CFG_RES = 0x4000,
-	MII_MSSR_LOCAL_RCV_STATUS = 0x2000,
-	MII_MSSR_REMOTE_RCVR = 0x1000,
-	MII_MSSR_LP_1000BT_FD = 0x0800,
-	MII_MSSR_LP_1000BT_HD = 0x0400,
-	MII_MSSR_IDLE_ERR_COUNT = 0x00ff,
-};
-
 /* IEEE Extened Status Register */
 enum _mii_esr {
 	MII_ESR_1000BX_FD = 0x8000,
-- 
1.7.4.4

^ permalink raw reply related

* Re: [PATCH net-next 2/2] sunbmac: use standard #defines from mii.h.
From: Francois Romieu @ 2011-08-25  9:22 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <20110825092019.GA21777@electric-eye.fr.zoreil.com>

Signed-off-by: Francois Romieu <romieu@fr.zoreil.com>
---
 drivers/net/ethernet/sun/sunbmac.c |   31 ++++++++++++++++---------------
 drivers/net/ethernet/sun/sunbmac.h |   17 -----------------
 2 files changed, 16 insertions(+), 32 deletions(-)

diff --git a/drivers/net/ethernet/sun/sunbmac.c b/drivers/net/ethernet/sun/sunbmac.c
index c94f5ef..0d8cfd9 100644
--- a/drivers/net/ethernet/sun/sunbmac.c
+++ b/drivers/net/ethernet/sun/sunbmac.c
@@ -17,6 +17,7 @@
 #include <linux/crc32.h>
 #include <linux/errno.h>
 #include <linux/ethtool.h>
+#include <linux/mii.h>
 #include <linux/netdevice.h>
 #include <linux/etherdevice.h>
 #include <linux/skbuff.h>
@@ -500,13 +501,13 @@ static int try_next_permutation(struct bigmac *bp, void __iomem *tregs)
 
 		/* Reset the PHY. */
 		bp->sw_bmcr	= (BMCR_ISOLATE | BMCR_PDOWN | BMCR_LOOPBACK);
-		bigmac_tcvr_write(bp, tregs, BIGMAC_BMCR, bp->sw_bmcr);
+		bigmac_tcvr_write(bp, tregs, MII_BMCR, bp->sw_bmcr);
 		bp->sw_bmcr	= (BMCR_RESET);
-		bigmac_tcvr_write(bp, tregs, BIGMAC_BMCR, bp->sw_bmcr);
+		bigmac_tcvr_write(bp, tregs, MII_BMCR, bp->sw_bmcr);
 
 		timeout = 64;
 		while (--timeout) {
-			bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, BIGMAC_BMCR);
+			bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, MII_BMCR);
 			if ((bp->sw_bmcr & BMCR_RESET) == 0)
 				break;
 			udelay(20);
@@ -514,11 +515,11 @@ static int try_next_permutation(struct bigmac *bp, void __iomem *tregs)
 		if (timeout == 0)
 			printk(KERN_ERR "%s: PHY reset failed.\n", bp->dev->name);
 
-		bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, BIGMAC_BMCR);
+		bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, MII_BMCR);
 
 		/* Now we try 10baseT. */
 		bp->sw_bmcr &= ~(BMCR_SPEED100);
-		bigmac_tcvr_write(bp, tregs, BIGMAC_BMCR, bp->sw_bmcr);
+		bigmac_tcvr_write(bp, tregs, MII_BMCR, bp->sw_bmcr);
 		return 0;
 	}
 
@@ -534,8 +535,8 @@ static void bigmac_timer(unsigned long data)
 
 	bp->timer_ticks++;
 	if (bp->timer_state == ltrywait) {
-		bp->sw_bmsr = bigmac_tcvr_read(bp, tregs, BIGMAC_BMSR);
-		bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, BIGMAC_BMCR);
+		bp->sw_bmsr = bigmac_tcvr_read(bp, tregs, MII_BMSR);
+		bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, MII_BMCR);
 		if (bp->sw_bmsr & BMSR_LSTATUS) {
 			printk(KERN_INFO "%s: Link is now up at %s.\n",
 			       bp->dev->name,
@@ -588,18 +589,18 @@ static void bigmac_begin_auto_negotiation(struct bigmac *bp)
 	int timeout;
 
 	/* Grab new software copies of PHY registers. */
-	bp->sw_bmsr	= bigmac_tcvr_read(bp, tregs, BIGMAC_BMSR);
-	bp->sw_bmcr	= bigmac_tcvr_read(bp, tregs, BIGMAC_BMCR);
+	bp->sw_bmsr	= bigmac_tcvr_read(bp, tregs, MII_BMSR);
+	bp->sw_bmcr	= bigmac_tcvr_read(bp, tregs, MII_BMCR);
 
 	/* Reset the PHY. */
 	bp->sw_bmcr	= (BMCR_ISOLATE | BMCR_PDOWN | BMCR_LOOPBACK);
-	bigmac_tcvr_write(bp, tregs, BIGMAC_BMCR, bp->sw_bmcr);
+	bigmac_tcvr_write(bp, tregs, MII_BMCR, bp->sw_bmcr);
 	bp->sw_bmcr	= (BMCR_RESET);
-	bigmac_tcvr_write(bp, tregs, BIGMAC_BMCR, bp->sw_bmcr);
+	bigmac_tcvr_write(bp, tregs, MII_BMCR, bp->sw_bmcr);
 
 	timeout = 64;
 	while (--timeout) {
-		bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, BIGMAC_BMCR);
+		bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, MII_BMCR);
 		if ((bp->sw_bmcr & BMCR_RESET) == 0)
 			break;
 		udelay(20);
@@ -607,11 +608,11 @@ static void bigmac_begin_auto_negotiation(struct bigmac *bp)
 	if (timeout == 0)
 		printk(KERN_ERR "%s: PHY reset failed.\n", bp->dev->name);
 
-	bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, BIGMAC_BMCR);
+	bp->sw_bmcr = bigmac_tcvr_read(bp, tregs, MII_BMCR);
 
 	/* First we try 100baseT. */
 	bp->sw_bmcr |= BMCR_SPEED100;
-	bigmac_tcvr_write(bp, tregs, BIGMAC_BMCR, bp->sw_bmcr);
+	bigmac_tcvr_write(bp, tregs, MII_BMCR, bp->sw_bmcr);
 
 	bp->timer_state = ltrywait;
 	bp->timer_ticks = 0;
@@ -1054,7 +1055,7 @@ static u32 bigmac_get_link(struct net_device *dev)
 	struct bigmac *bp = netdev_priv(dev);
 
 	spin_lock_irq(&bp->lock);
-	bp->sw_bmsr = bigmac_tcvr_read(bp, bp->tregs, BIGMAC_BMSR);
+	bp->sw_bmsr = bigmac_tcvr_read(bp, bp->tregs, MII_BMSR);
 	spin_unlock_irq(&bp->lock);
 
 	return (bp->sw_bmsr & BMSR_LSTATUS);
diff --git a/drivers/net/ethernet/sun/sunbmac.h b/drivers/net/ethernet/sun/sunbmac.h
index 4943e97..06dd217 100644
--- a/drivers/net/ethernet/sun/sunbmac.h
+++ b/drivers/net/ethernet/sun/sunbmac.h
@@ -223,23 +223,6 @@
 #define BIGMAC_PHY_EXTERNAL   0 /* External transceiver */
 #define BIGMAC_PHY_INTERNAL   1 /* Internal transceiver */
 
-/* PHY registers */
-#define BIGMAC_BMCR           0x00 /* Basic mode control register	*/
-#define BIGMAC_BMSR           0x01 /* Basic mode status register	*/
-
-/* BMCR bits */
-#define BMCR_ISOLATE            0x0400  /* Disconnect DP83840 from MII */
-#define BMCR_PDOWN              0x0800  /* Powerdown the DP83840       */
-#define BMCR_ANENABLE           0x1000  /* Enable auto negotiation     */
-#define BMCR_SPEED100           0x2000  /* Select 100Mbps              */
-#define BMCR_LOOPBACK           0x4000  /* TXD loopback bits           */
-#define BMCR_RESET              0x8000  /* Reset the DP83840           */
-
-/* BMSR bits */
-#define BMSR_ERCAP              0x0001  /* Ext-reg capability          */
-#define BMSR_JCD                0x0002  /* Jabber detected             */
-#define BMSR_LSTATUS            0x0004  /* Link status                 */
-
 /* Ring descriptors and such, same as Quad Ethernet. */
 struct be_rxd {
 	u32 rx_flags;
-- 
1.7.4.4

^ permalink raw reply related

* Re: [BUG] tcp : how many times a frame can possibly be retransmitted ?
From: Eric Dumazet @ 2011-08-25  9:40 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: netdev, Jerry Chu, Damian Lukowski
In-Reply-To: <alpine.DEB.2.00.1108251150050.12780@wel-95.cs.helsinki.fi>

Le jeudi 25 août 2011 à 11:56 +0300, Ilpo Järvinen a écrit :

> So you think that this is not true: ?
> 
>         /* NOTE: clamping at TCP_RTO_MIN is not required, current algo
>          * guarantees that rto is higher.
>          */
> 
> ...it would still be smaller than 1sec though, but certainly not going to 
> cause flooding either. Default tcp_rto_min should be 200ms so it's 
> 5pkts+5ICMP sent, received and processed per second. Which doesn't sound 
> that bad CPU load?!?
> 

Unless you have 100.000 active sessions maybe ?

Some years ago, I helped people running servers with more than 1.000.000
long living active sessions, and a temporary network disruption was
already very critical at that time, with old kernels (At that time, IP
route cache could blow away and consume too much ram or cpu time, things
are now under control)

I guess they would not try a new kernel :(

> It is unclear to me how tp->rttvar could become smaller than 
> tcp_rto_min().

I believe this part is fine Ilpo.

As long as we handle few tcp sessions, its fine to send 5 messages per
session per second.

^ permalink raw reply

* Re: [PATCH] tcp: bound RTO to minimum
From: Arnd Hannemann @ 2011-08-25  9:46 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Alexander Zimmermann, Yuchung Cheng, Hagen Paul Pfeifer, netdev,
	Lukowski Damian
In-Reply-To: <1314263389.2387.21.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

Hi Eric,

Am 25.08.2011 11:09, schrieb Eric Dumazet:
> Le jeudi 25 août 2011 à 10:46 +0200, Arnd Hannemann a écrit :
>> Am 25.08.2011 10:26, schrieb Eric Dumazet:
>>> Le jeudi 25 août 2011 à 09:28 +0200, Alexander Zimmermann a écrit :
>>>> Am 25.08.2011 um 07:28 schrieb Eric Dumazet:
>>>
>>>>> Real question is : do we really want to process ~1000 timer interrupts
>>>>> per tcp session, ~2000 skb alloc/free/build/handling, possibly ~1000 ARP
>>>>> requests, only to make tcp revover in ~1sec when connectivity returns
>>>>> back. This just doesnt scale.
>>>>
>>>> maybe a stupid question, but 1000?. With an minRTO of 200ms and a maximum
>>>> probing time of 120s, we 600 retransmits in a worst-case-senario
>>>> (assumed that we get for every rot retransmission an icmp). No?
>>>
>>> Where is asserted the "max probing time of 120s" ? 
>>>
>>> It is not the case on my machine :
>>> I have way more retransmits than that, even if spaced by 1600 ms
>>>
>>> 07:16:13.389331 write(3, "\350F\235JC\357\376\363&\3\374\270R\21L\26\324{\37p\342\244i\304\356\241I:\301\332\222\26"..., 48) = 48
>>> 07:16:13.389417 select(7, [3 4], [], NULL, NULL) = 1 (in [3])
>>> 07:31:39.901311 read(3, 0xff8c4c90, 8192) = -1 EHOSTUNREACH (No route to host)
>>>
>>> Old kernels where performing up to 15 retries, doing exponential backoff.
>>>
>>> Now its kind of unlimited, according to experimental results.
>>
>> That shouldn't be. It should stop after the same time a TCP connection with an
>> RTO of Minimum RTO which is doing 15 retries (tcp_retries2=15) and doing exponential backoff.
>> So it should be around 900s*. But it could be that because of the icsk_retransmit wrapover
>> this doesn't work as expected.
>>
>> * 200ms + 400ms + 800ms ...
> 
> It is 924 second with retries2=15 (default value)
> 
> I said ~1000 probes.
> 
> If ICMP are not rate limited, that could be about 924*5 probes, instead
> of 15 probes on old kernels.

At a rate of 5 packets/s if RTT is zero, yes. I would like to say: so
what? But your example with millions of idle connections stands.

> Maybe we should refine the thing a bit, to not reverse backoff unless
> rto is > some_threshold.
> 
> Say 10s being the value, that would give at most 92 tries.

I personally think that 10s would be too large and eliminate the benefit of the
algorithm, so I would prefer a different solution.

In case of one bulk data TCP session, which was transmitting hundreds of packets/s
before the connectivity disruption those worst case rate of 5 packet/s really
seems conservative enough.

However in case of a lot of idle connections, which were transmitting only
a number of packets per minute. We might increase the rate drastically for
a certain period until it throttles down. You say that we have a problem here
correct?

Do you think it would be possible without much hassle to use a kind of "global"
rate limiting only for these probe packets of a TCP connection?

> I mean, what is the gain to be able to restart a frozen TCP session with
> a 1sec latency instead of 10s if it was blocked more than 60 seconds ?

I'm afraid it does a lot, especially in highly dynamic environments. You
don't have just the additional latency, you may actually miss the full
period where connectivity was there, and then just retransmit into the next
connectivity disrupted period.

Best regards,
Arnd

^ permalink raw reply

* Re: how to distribute irqs of ixgbevf
From: J.Hwan Kim @ 2011-08-25 10:00 UTC (permalink / raw)
  To: netdev
In-Reply-To: <1314260481.2387.10.camel@edumazet-HP-Compaq-6005-Pro-SFF-PC>

On 2011년 08월 25일 17:21, Eric Dumazet wrote:
> Le jeudi 25 août 2011 à 17:07 +0900, J.Hwan Kim a écrit :
>> Hi, everyone
>>
>> The interrupts of my ixgbevf driver occurs only Core 0
>> although the user space "irqbalance" serivce is working.
>>
>> How can I distribute the interrupt of RX in ixgbevf to all cores?
>>
>> cat /proc/interrupts | grep "isv"
>>     97:          8          0          0          0          0
>> 0          0          0   PCI-MSI-edge      isv0-rx-0
>>     99:          7          0          0          0          0
>> 0          0          0   PCI-MSI-edge      isv0:lsc
>>    103:       2059      0          0          0          0
>> 0          0          0   PCI-MSI-edge      isv2-rx-0
>>    104:         14        0          0          0          0
>> 0          0          0   PCI-MSI-edge      isv2-tx-0
>>    105:          1         0          0          0          0
>> 0          0          0   PCI-MSI-edge      isv2:mbx
>>
>> "isv" is netdevice name of my ixgbevf.
> Given load is very small, irqbalance chose to send interrupts on a
> single cpu.
When I measure cpu load with "top", it indicates CPU load around 99%

^ permalink raw reply

* Re: [PATCH] tcp: bound RTO to minimum
From: Eric Dumazet @ 2011-08-25 10:02 UTC (permalink / raw)
  To: Arnd Hannemann
  Cc: Alexander Zimmermann, Yuchung Cheng, Hagen Paul Pfeifer, netdev,
	Lukowski Damian
In-Reply-To: <4E5619DA.6070902@arndnet.de>

Le jeudi 25 août 2011 à 11:46 +0200, Arnd Hannemann a écrit :
> Hi Eric,
> 
> Am 25.08.2011 11:09, schrieb Eric Dumazet:

> > Maybe we should refine the thing a bit, to not reverse backoff unless
> > rto is > some_threshold.
> > 
> > Say 10s being the value, that would give at most 92 tries.
> 
> I personally think that 10s would be too large and eliminate the benefit of the
> algorithm, so I would prefer a different solution.
> 
> In case of one bulk data TCP session, which was transmitting hundreds of packets/s
> before the connectivity disruption those worst case rate of 5 packet/s really
> seems conservative enough.
> 
> However in case of a lot of idle connections, which were transmitting only
> a number of packets per minute. We might increase the rate drastically for
> a certain period until it throttles down. You say that we have a problem here
> correct?
> 
> Do you think it would be possible without much hassle to use a kind of "global"
> rate limiting only for these probe packets of a TCP connection?
> 
> > I mean, what is the gain to be able to restart a frozen TCP session with
> > a 1sec latency instead of 10s if it was blocked more than 60 seconds ?
> 
> I'm afraid it does a lot, especially in highly dynamic environments. You
> don't have just the additional latency, you may actually miss the full
> period where connectivity was there, and then just retransmit into the next
> connectivity disrupted period.

Problem with this is that with short and synchronized timers, all
sessions will flood at the same time and you'll get congestion this
time.

The reason for exponential backoff is also to smooth the restarts of
sessions, because timers are randomized.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox