Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] AF_UNIX: Fix deadlock on connecting to shutdown socket
From: David Miller @ 2009-10-19 13:14 UTC (permalink / raw)
  To: jarkao2
  Cc: tomoki.sekiyama.qu, linux-kernel, netdev, alan, satoshi.oshima.fk,
	hidehiro.kawai.ez, hideo.aoki.tk
In-Reply-To: <20091019115713.GB6869@ff.dom.local>

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Mon, 19 Oct 2009 11:57:13 +0000

> Isn't the shutdown call expected to change sk_state to TCP_CLOSE?

No, because the send side is still up and operational, it's
only a half duplex close.

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Michal Ostrowski @ 2009-10-19 13:19 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Denys Fedoryschenko, netdev, linux-ppp, paulus, mostrows,
	Cyrill Gorcunov
In-Reply-To: <4ADC5D3B.8010006@gmail.com>

The entire scheme for managing net namespaces seems unsafe.  We depend
on synchronization via pn->hash_lock, but have no guarantee of the
existence of the "net" object -- hence no way to ensure the existence
of the lock itself.  This should be relatively easy to fix though as
we should be able to get/put the net namespace as we add remove
objects to/from the pppoe hash.

Once you solve this existence issue, the flush_lock can be eliminated
altogether since all of the relevant code paths already depend on a
write_lock_bh(&pn->hash_lock), and that's the lock that should be use
to protect the pppoe_dev field.

Another patch to follow later...

--
Michal Ostrowski
mostrows@gmail.com



On Mon, Oct 19, 2009 at 7:36 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Michal Ostrowski a écrit :
>> Here's my theory on this after an inital look...
>>
>> Looking at the oops report and disassembly of the actual module binary
>> that caused the oops, one can deduce that:
>>
>> Execution was in pppoe_flush_dev().  %ebx contained the pointer "struct
>> pppox_sock *po", which is what we faulted on, excuting "cmp %eax, 0x190(%ebx)".
>> %ebx value was 0xffffffff (hence we got "NULL pointer dereference at 0x18f").
>>
>> At this point "i" (stored in %esi) is 15 (valid), meaning that we got a value
>> of 0xffffffff in pn->hash_table[i].
>>
>>>From this I'd hypothesize that the combination of dev_put() and release_sock()
>> may have allowed us to free "pn".  At the bottom of the loop we alreayd
>> recognize that since locks are dropped we're responsible for handling
>> invalidation of objects, and perhaps that should be extended to "pn" as well.
>> --
>> Michal Ostrowski
>> mostrows@gmail.com
>>
>>
>
> Looking at this stuff, I do believe flush_lock protection is not
> properly done.
>
> At the end of pppoe_connect() for example we can find :
>
> err_put:
>        if (po->pppoe_dev) {
>                dev_put(po->pppoe_dev);
>                po->pppoe_dev = NULL;
>        }
>
> This is done without any protection, and can therefore clash with
> pppoe_flush_dev() :
>
>        spin_lock(&flush_lock);
>        po->pppoe_dev = NULL; /* ppoe_dev can already be NULL before this point */
>        spin_unlock(&flush_lock);
>
>        dev_put(dev);    /* oops */
>

^ permalink raw reply

* Re: [PATCH] myri10ge: improve port type reporting in ethtool output
From: Andrew Gallatin @ 2009-10-19 13:30 UTC (permalink / raw)
  To: Ben Hutchings
  Cc: Brice Goglin, David S. Miller, Linux Network Development list
In-Reply-To: <1255957384.2782.2.camel@achroite>

Ben Hutchings wrote:
> On Mon, 2009-10-19 at 08:34 -0400, Andrew Gallatin wrote:
>> Ben Hutchings wrote:
>>
>>> Lying about link modes is not an improvement.
>> OK, so we're probably doing something wrong. I suspect we're not
>> alone.  At least we don't set SUPPORTED_TP for CX4, like I've
>> seen some NICs do.
>>
>> Can somebody suggest how we can tell ethtool that
>> the NIC supports 10Gb only (no autoneg down to 1Gb or lower)
>> for copper (10Gbase-CX4)?   How about for fiber (10Gbase-{S,L})R?
> 
> What's wrong with what you already do?  Customers expect to see
> something on the supported line?

Exactly.  One has complained because drivers for
other vendors NICs show this, even if they are fibre NICs
or CX4 NICs, and don't actually support 10GbaseT.

I'm happy to back this part out, and resubmit the patch without
it. There is still some fairly valuable stuff in the patch
-- mainly updating the NIC detection logic for new NICs to
detect fibre vs copper.

Drew

^ permalink raw reply

* Re: [net-next PATCH 0/3] qlge: Size RX buffers based on MTU.
From: Ron Mercer @ 2009-10-19 13:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev@vger.kernel.org
In-Reply-To: <20091017.153751.193720868.davem@davemloft.net>

> 
> You should send them as patches that actually compile at each
> step of the way and therefore don't break bisection.

The changes are functionally a single patch which I will send shortly.
Thanks.

^ permalink raw reply

* [net-next PATCH 1/1] qlge: Size RX buffers based on MTU.
From: Ron Mercer @ 2009-10-19 13:32 UTC (permalink / raw)
  To: davem; +Cc: netdev, ron.mercer
In-Reply-To: <20091019132902.GB14919@linux-ox1b.qlogic.org>

Change RX large buffer size based on MTU. If pages are larger
than the MTU the page is divided up into multiple chunks and passed to
the hardware.  When pages are smaller than MTU each RX buffer can
contain be comprised of up to 2 pages.

Signed-off-by: Ron Mercer <ron.mercer@qlogic.com>
---
 drivers/net/qlge/qlge.h      |   15 ++-
 drivers/net/qlge/qlge_main.c |  273 +++++++++++++++++++++++++++++++-----------
 2 files changed, 214 insertions(+), 74 deletions(-)

diff --git a/drivers/net/qlge/qlge.h b/drivers/net/qlge/qlge.h
index fd47691..9cdf8ff 100644
--- a/drivers/net/qlge/qlge.h
+++ b/drivers/net/qlge/qlge.h
@@ -56,7 +56,8 @@
 		MAX_DB_PAGES_PER_BQ(NUM_LARGE_BUFFERS) * sizeof(u64))
 #define SMALL_BUFFER_SIZE 512
 #define SMALL_BUF_MAP_SIZE (SMALL_BUFFER_SIZE / 2)
-#define LARGE_BUFFER_SIZE	PAGE_SIZE
+#define LARGE_BUFFER_MAX_SIZE 8192
+#define LARGE_BUFFER_MIN_SIZE 2048
 #define MAX_SPLIT_SIZE 1023
 #define QLGE_SB_PAD 32
 
@@ -1201,9 +1202,17 @@ struct tx_ring_desc {
 	struct tx_ring_desc *next;
 };
 
+struct page_chunk {
+	struct page *page;	/* master page */
+	char *va;		/* virt addr for this chunk */
+	u64 map;		/* mapping for master */
+	unsigned int offset;	/* offset for this chunk */
+	unsigned int last_flag; /* flag set for last chunk in page */
+};
+
 struct bq_desc {
 	union {
-		struct page *lbq_page;
+		struct page_chunk pg_chunk;
 		struct sk_buff *skb;
 	} p;
 	__le64 *addr;
@@ -1272,6 +1281,7 @@ struct rx_ring {
 	dma_addr_t lbq_base_dma;
 	void *lbq_base_indirect;
 	dma_addr_t lbq_base_indirect_dma;
+	struct page_chunk pg_chunk; /* current page for chunks */
 	struct bq_desc *lbq;	/* array of control blocks */
 	void __iomem *lbq_prod_idx_db_reg;	/* PCI doorbell mem area + 0x18 */
 	u32 lbq_prod_idx;	/* current sw prod idx */
@@ -1526,6 +1536,7 @@ struct ql_adapter {
 
 	struct rx_ring rx_ring[MAX_RX_RINGS];
 	struct tx_ring tx_ring[MAX_TX_RINGS];
+	unsigned int lbq_buf_order;
 
 	int rx_csum;
 	u32 default_rx_queue;
diff --git a/drivers/net/qlge/qlge_main.c b/drivers/net/qlge/qlge_main.c
index 9eefb11..4935710 100644
--- a/drivers/net/qlge/qlge_main.c
+++ b/drivers/net/qlge/qlge_main.c
@@ -1025,6 +1025,11 @@ end:
 	return status;
 }
 
+static inline unsigned int ql_lbq_block_size(struct ql_adapter *qdev)
+{
+	return PAGE_SIZE << qdev->lbq_buf_order;
+}
+
 /* Get the next large buffer. */
 static struct bq_desc *ql_get_curr_lbuf(struct rx_ring *rx_ring)
 {
@@ -1036,6 +1041,28 @@ static struct bq_desc *ql_get_curr_lbuf(struct rx_ring *rx_ring)
 	return lbq_desc;
 }
 
+static struct bq_desc *ql_get_curr_lchunk(struct ql_adapter *qdev,
+		struct rx_ring *rx_ring)
+{
+	struct bq_desc *lbq_desc = ql_get_curr_lbuf(rx_ring);
+
+	pci_dma_sync_single_for_cpu(qdev->pdev,
+					pci_unmap_addr(lbq_desc, mapaddr),
+				    rx_ring->lbq_buf_size,
+					PCI_DMA_FROMDEVICE);
+
+	/* If it's the last chunk of our master page then
+	 * we unmap it.
+	 */
+	if ((lbq_desc->p.pg_chunk.offset + rx_ring->lbq_buf_size)
+					== ql_lbq_block_size(qdev))
+		pci_unmap_page(qdev->pdev,
+				lbq_desc->p.pg_chunk.map,
+				ql_lbq_block_size(qdev),
+				PCI_DMA_FROMDEVICE);
+	return lbq_desc;
+}
+
 /* Get the next small buffer. */
 static struct bq_desc *ql_get_curr_sbuf(struct rx_ring *rx_ring)
 {
@@ -1063,6 +1090,53 @@ static void ql_write_cq_idx(struct rx_ring *rx_ring)
 	ql_write_db_reg(rx_ring->cnsmr_idx, rx_ring->cnsmr_idx_db_reg);
 }
 
+static int ql_get_next_chunk(struct ql_adapter *qdev, struct rx_ring *rx_ring,
+						struct bq_desc *lbq_desc)
+{
+	if (!rx_ring->pg_chunk.page) {
+		u64 map;
+		rx_ring->pg_chunk.page = alloc_pages(__GFP_COLD | __GFP_COMP |
+						GFP_ATOMIC,
+						qdev->lbq_buf_order);
+		if (unlikely(!rx_ring->pg_chunk.page)) {
+			QPRINTK(qdev, DRV, ERR,
+				"page allocation failed.\n");
+			return -ENOMEM;
+		}
+		rx_ring->pg_chunk.offset = 0;
+		map = pci_map_page(qdev->pdev, rx_ring->pg_chunk.page,
+					0, ql_lbq_block_size(qdev),
+					PCI_DMA_FROMDEVICE);
+		if (pci_dma_mapping_error(qdev->pdev, map)) {
+			__free_pages(rx_ring->pg_chunk.page,
+					qdev->lbq_buf_order);
+			QPRINTK(qdev, DRV, ERR,
+				"PCI mapping failed.\n");
+			return -ENOMEM;
+		}
+		rx_ring->pg_chunk.map = map;
+		rx_ring->pg_chunk.va = page_address(rx_ring->pg_chunk.page);
+	}
+
+	/* Copy the current master pg_chunk info
+	 * to the current descriptor.
+	 */
+	lbq_desc->p.pg_chunk = rx_ring->pg_chunk;
+
+	/* Adjust the master page chunk for next
+	 * buffer get.
+	 */
+	rx_ring->pg_chunk.offset += rx_ring->lbq_buf_size;
+	if (rx_ring->pg_chunk.offset == ql_lbq_block_size(qdev)) {
+		rx_ring->pg_chunk.page = NULL;
+		lbq_desc->p.pg_chunk.last_flag = 1;
+	} else {
+		rx_ring->pg_chunk.va += rx_ring->lbq_buf_size;
+		get_page(rx_ring->pg_chunk.page);
+		lbq_desc->p.pg_chunk.last_flag = 0;
+	}
+	return 0;
+}
 /* Process (refill) a large buffer queue. */
 static void ql_update_lbq(struct ql_adapter *qdev, struct rx_ring *rx_ring)
 {
@@ -1072,39 +1146,28 @@ static void ql_update_lbq(struct ql_adapter *qdev, struct rx_ring *rx_ring)
 	u64 map;
 	int i;
 
-	while (rx_ring->lbq_free_cnt > 16) {
+	while (rx_ring->lbq_free_cnt > 32) {
 		for (i = 0; i < 16; i++) {
 			QPRINTK(qdev, RX_STATUS, DEBUG,
 				"lbq: try cleaning clean_idx = %d.\n",
 				clean_idx);
 			lbq_desc = &rx_ring->lbq[clean_idx];
-			if (lbq_desc->p.lbq_page == NULL) {
-				QPRINTK(qdev, RX_STATUS, DEBUG,
-					"lbq: getting new page for index %d.\n",
-					lbq_desc->index);
-				lbq_desc->p.lbq_page = alloc_page(GFP_ATOMIC);
-				if (lbq_desc->p.lbq_page == NULL) {
-					rx_ring->lbq_clean_idx = clean_idx;
-					QPRINTK(qdev, RX_STATUS, ERR,
-						"Couldn't get a page.\n");
-					return;
-				}
-				map = pci_map_page(qdev->pdev,
-						   lbq_desc->p.lbq_page,
-						   0, PAGE_SIZE,
-						   PCI_DMA_FROMDEVICE);
-				if (pci_dma_mapping_error(qdev->pdev, map)) {
-					rx_ring->lbq_clean_idx = clean_idx;
-					put_page(lbq_desc->p.lbq_page);
-					lbq_desc->p.lbq_page = NULL;
-					QPRINTK(qdev, RX_STATUS, ERR,
-						"PCI mapping failed.\n");
+			if (ql_get_next_chunk(qdev, rx_ring, lbq_desc)) {
+				QPRINTK(qdev, IFUP, ERR,
+					"Could not get a page chunk.\n");
 					return;
 				}
+
+			map = lbq_desc->p.pg_chunk.map +
+				lbq_desc->p.pg_chunk.offset;
 				pci_unmap_addr_set(lbq_desc, mapaddr, map);
-				pci_unmap_len_set(lbq_desc, maplen, PAGE_SIZE);
+			pci_unmap_len_set(lbq_desc, maplen,
+					rx_ring->lbq_buf_size);
 				*lbq_desc->addr = cpu_to_le64(map);
-			}
+
+			pci_dma_sync_single_for_device(qdev->pdev, map,
+						rx_ring->lbq_buf_size,
+						PCI_DMA_FROMDEVICE);
 			clean_idx++;
 			if (clean_idx == rx_ring->lbq_len)
 				clean_idx = 0;
@@ -1480,27 +1543,24 @@ static struct sk_buff *ql_build_rx_skb(struct ql_adapter *qdev,
 			 * chain it to the header buffer's skb and let
 			 * it rip.
 			 */
-			lbq_desc = ql_get_curr_lbuf(rx_ring);
-			pci_unmap_page(qdev->pdev,
-				       pci_unmap_addr(lbq_desc,
-						      mapaddr),
-				       pci_unmap_len(lbq_desc, maplen),
-				       PCI_DMA_FROMDEVICE);
+			lbq_desc = ql_get_curr_lchunk(qdev, rx_ring);
 			QPRINTK(qdev, RX_STATUS, DEBUG,
-				"Chaining page to skb.\n");
-			skb_fill_page_desc(skb, 0, lbq_desc->p.lbq_page,
-					   0, length);
+				"Chaining page at offset = %d,"
+				"for %d bytes  to skb.\n",
+				lbq_desc->p.pg_chunk.offset, length);
+			skb_fill_page_desc(skb, 0, lbq_desc->p.pg_chunk.page,
+						lbq_desc->p.pg_chunk.offset,
+						length);
 			skb->len += length;
 			skb->data_len += length;
 			skb->truesize += length;
-			lbq_desc->p.lbq_page = NULL;
 		} else {
 			/*
 			 * The headers and data are in a single large buffer. We
 			 * copy it to a new skb and let it go. This can happen with
 			 * jumbo mtu on a non-TCP/UDP frame.
 			 */
-			lbq_desc = ql_get_curr_lbuf(rx_ring);
+			lbq_desc = ql_get_curr_lchunk(qdev, rx_ring);
 			skb = netdev_alloc_skb(qdev->ndev, length);
 			if (skb == NULL) {
 				QPRINTK(qdev, PROBE, DEBUG,
@@ -1515,13 +1575,14 @@ static struct sk_buff *ql_build_rx_skb(struct ql_adapter *qdev,
 			skb_reserve(skb, NET_IP_ALIGN);
 			QPRINTK(qdev, RX_STATUS, DEBUG,
 				"%d bytes of headers and data in large. Chain page to new skb and pull tail.\n", length);
-			skb_fill_page_desc(skb, 0, lbq_desc->p.lbq_page,
-					   0, length);
+			skb_fill_page_desc(skb, 0,
+						lbq_desc->p.pg_chunk.page,
+						lbq_desc->p.pg_chunk.offset,
+						length);
 			skb->len += length;
 			skb->data_len += length;
 			skb->truesize += length;
 			length -= length;
-			lbq_desc->p.lbq_page = NULL;
 			__pskb_pull_tail(skb,
 				(ib_mac_rsp->flags2 & IB_MAC_IOCB_RSP_V) ?
 				VLAN_ETH_HLEN : ETH_HLEN);
@@ -1538,8 +1599,7 @@ static struct sk_buff *ql_build_rx_skb(struct ql_adapter *qdev,
 		 *         frames.  If the MTU goes up we could
 		 *          eventually be in trouble.
 		 */
-		int size, offset, i = 0;
-		__le64 *bq, bq_array[8];
+		int size, i = 0;
 		sbq_desc = ql_get_curr_sbuf(rx_ring);
 		pci_unmap_single(qdev->pdev,
 				 pci_unmap_addr(sbq_desc, mapaddr),
@@ -1558,37 +1618,25 @@ static struct sk_buff *ql_build_rx_skb(struct ql_adapter *qdev,
 			QPRINTK(qdev, RX_STATUS, DEBUG,
 				"%d bytes of headers & data in chain of large.\n", length);
 			skb = sbq_desc->p.skb;
-			bq = &bq_array[0];
-			memcpy(bq, skb->data, sizeof(bq_array));
 			sbq_desc->p.skb = NULL;
 			skb_reserve(skb, NET_IP_ALIGN);
-		} else {
-			QPRINTK(qdev, RX_STATUS, DEBUG,
-				"Headers in small, %d bytes of data in chain of large.\n", length);
-			bq = (__le64 *)sbq_desc->p.skb->data;
 		}
 		while (length > 0) {
-			lbq_desc = ql_get_curr_lbuf(rx_ring);
-			pci_unmap_page(qdev->pdev,
-				       pci_unmap_addr(lbq_desc,
-						      mapaddr),
-				       pci_unmap_len(lbq_desc,
-						     maplen),
-				       PCI_DMA_FROMDEVICE);
-			size = (length < PAGE_SIZE) ? length : PAGE_SIZE;
-			offset = 0;
+			lbq_desc = ql_get_curr_lchunk(qdev, rx_ring);
+			size = (length < rx_ring->lbq_buf_size) ? length :
+				rx_ring->lbq_buf_size;
 
 			QPRINTK(qdev, RX_STATUS, DEBUG,
 				"Adding page %d to skb for %d bytes.\n",
 				i, size);
-			skb_fill_page_desc(skb, i, lbq_desc->p.lbq_page,
-					   offset, size);
+			skb_fill_page_desc(skb, i,
+						lbq_desc->p.pg_chunk.page,
+						lbq_desc->p.pg_chunk.offset,
+						size);
 			skb->len += size;
 			skb->data_len += size;
 			skb->truesize += size;
 			length -= size;
-			lbq_desc->p.lbq_page = NULL;
-			bq++;
 			i++;
 		}
 		__pskb_pull_tail(skb, (ib_mac_rsp->flags2 & IB_MAC_IOCB_RSP_V) ?
@@ -2304,20 +2352,29 @@ err:
 
 static void ql_free_lbq_buffers(struct ql_adapter *qdev, struct rx_ring *rx_ring)
 {
-	int i;
 	struct bq_desc *lbq_desc;
 
-	for (i = 0; i < rx_ring->lbq_len; i++) {
-		lbq_desc = &rx_ring->lbq[i];
-		if (lbq_desc->p.lbq_page) {
+	uint32_t  curr_idx, clean_idx;
+
+	curr_idx = rx_ring->lbq_curr_idx;
+	clean_idx = rx_ring->lbq_clean_idx;
+	while (curr_idx != clean_idx) {
+		lbq_desc = &rx_ring->lbq[curr_idx];
+
+		if (lbq_desc->p.pg_chunk.last_flag) {
 			pci_unmap_page(qdev->pdev,
-				       pci_unmap_addr(lbq_desc, mapaddr),
-				       pci_unmap_len(lbq_desc, maplen),
+				lbq_desc->p.pg_chunk.map,
+				ql_lbq_block_size(qdev),
 				       PCI_DMA_FROMDEVICE);
-
-			put_page(lbq_desc->p.lbq_page);
-			lbq_desc->p.lbq_page = NULL;
+			lbq_desc->p.pg_chunk.last_flag = 0;
 		}
+
+		put_page(lbq_desc->p.pg_chunk.page);
+		lbq_desc->p.pg_chunk.page = NULL;
+
+		if (++curr_idx == rx_ring->lbq_len)
+			curr_idx = 0;
+
 	}
 }
 
@@ -2615,6 +2672,7 @@ static int ql_start_rx_ring(struct ql_adapter *qdev, struct rx_ring *rx_ring)
 	/* Set up the shadow registers for this ring. */
 	rx_ring->prod_idx_sh_reg = shadow_reg;
 	rx_ring->prod_idx_sh_reg_dma = shadow_reg_dma;
+	*rx_ring->prod_idx_sh_reg = 0;
 	shadow_reg += sizeof(u64);
 	shadow_reg_dma += sizeof(u64);
 	rx_ring->lbq_base_indirect = shadow_reg;
@@ -3495,6 +3553,10 @@ static int ql_configure_rings(struct ql_adapter *qdev)
 	struct rx_ring *rx_ring;
 	struct tx_ring *tx_ring;
 	int cpu_cnt = min(MAX_CPUS, (int)num_online_cpus());
+	unsigned int lbq_buf_len = (qdev->ndev->mtu > 1500) ?
+		LARGE_BUFFER_MAX_SIZE : LARGE_BUFFER_MIN_SIZE;
+
+	qdev->lbq_buf_order = get_order(lbq_buf_len);
 
 	/* In a perfect world we have one RSS ring for each CPU
 	 * and each has it's own vector.  To do that we ask for
@@ -3542,7 +3604,10 @@ static int ql_configure_rings(struct ql_adapter *qdev)
 			rx_ring->lbq_len = NUM_LARGE_BUFFERS;
 			rx_ring->lbq_size =
 			    rx_ring->lbq_len * sizeof(__le64);
-			rx_ring->lbq_buf_size = LARGE_BUFFER_SIZE;
+			rx_ring->lbq_buf_size = (u16)lbq_buf_len;
+			QPRINTK(qdev, IFUP, DEBUG,
+				"lbq_buf_size %d, order = %d\n",
+				rx_ring->lbq_buf_size, qdev->lbq_buf_order);
 			rx_ring->sbq_len = NUM_SMALL_BUFFERS;
 			rx_ring->sbq_size =
 			    rx_ring->sbq_len * sizeof(__le64);
@@ -3592,14 +3657,63 @@ error_up:
 	return err;
 }
 
+static int ql_change_rx_buffers(struct ql_adapter *qdev)
+{
+	struct rx_ring *rx_ring;
+	int i, status;
+	u32 lbq_buf_len;
+
+	/* Wait for an oustanding reset to complete. */
+	if (!test_bit(QL_ADAPTER_UP, &qdev->flags)) {
+		int i = 3;
+		while (i-- && !test_bit(QL_ADAPTER_UP, &qdev->flags)) {
+			QPRINTK(qdev, IFUP, ERR,
+				 "Waiting for adapter UP...\n");
+			ssleep(1);
+		}
+
+		if (!i) {
+			QPRINTK(qdev, IFUP, ERR,
+			 "Timed out waiting for adapter UP\n");
+			return -ETIMEDOUT;
+		}
+	}
+
+	status = ql_adapter_down(qdev);
+	if (status)
+		goto error;
+
+	/* Get the new rx buffer size. */
+	lbq_buf_len = (qdev->ndev->mtu > 1500) ?
+		LARGE_BUFFER_MAX_SIZE : LARGE_BUFFER_MIN_SIZE;
+	qdev->lbq_buf_order = get_order(lbq_buf_len);
+
+	for (i = 0; i < qdev->rss_ring_count; i++) {
+		rx_ring = &qdev->rx_ring[i];
+		/* Set the new size. */
+		rx_ring->lbq_buf_size = lbq_buf_len;
+	}
+
+	status = ql_adapter_up(qdev);
+	if (status)
+		goto error;
+
+	return status;
+error:
+	QPRINTK(qdev, IFUP, ALERT,
+		"Driver up/down cycle failed, closing device.\n");
+	set_bit(QL_ADAPTER_UP, &qdev->flags);
+	dev_close(qdev->ndev);
+	return status;
+}
+
 static int qlge_change_mtu(struct net_device *ndev, int new_mtu)
 {
 	struct ql_adapter *qdev = netdev_priv(ndev);
+	int status;
 
 	if (ndev->mtu == 1500 && new_mtu == 9000) {
 		QPRINTK(qdev, IFUP, ERR, "Changing to jumbo MTU.\n");
-		queue_delayed_work(qdev->workqueue,
-				&qdev->mpi_port_cfg_work, 0);
 	} else if (ndev->mtu == 9000 && new_mtu == 1500) {
 		QPRINTK(qdev, IFUP, ERR, "Changing to normal MTU.\n");
 	} else if ((ndev->mtu == 1500 && new_mtu == 1500) ||
@@ -3607,8 +3721,23 @@ static int qlge_change_mtu(struct net_device *ndev, int new_mtu)
 		return 0;
 	} else
 		return -EINVAL;
+
+	queue_delayed_work(qdev->workqueue,
+			&qdev->mpi_port_cfg_work, 3*HZ);
+
+	if (!netif_running(qdev->ndev)) {
+		ndev->mtu = new_mtu;
+		return 0;
+	}
+
 	ndev->mtu = new_mtu;
-	return 0;
+	status = ql_change_rx_buffers(qdev);
+	if (status) {
+		QPRINTK(qdev, IFUP, ERR,
+			"Changing MTU failed.\n");
+	}
+
+	return status;
 }
 
 static struct net_device_stats *qlge_get_stats(struct net_device
-- 
1.6.0.2


^ permalink raw reply related

* power management for zaurus
From: Oliver Neukum @ 2009-10-19 13:40 UTC (permalink / raw)
  To: pavel-+ZI9xUNit7I, David Brownell,
	linux-usb-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA

[-- Attachment #1: Type: text/plain, Size: 3202 bytes --]

Hi,

could somebody with a zaurus test this patch?
It introduces aggressive usb autosuspend for the devices.
It depends on the attached basic support for usbnet.

	Regards
		Oliver

--

commit ce0be29fc149b0e178a47a9a2380ef2be52ea7c6
Author: Oliver Neukum <oliver-GvhC2dPhHPQdnm+yROfE0A@public.gmane.org>
Date:   Mon Oct 19 13:55:56 2009 +0200

    zaurus & rndis_host autosuspend

diff --git a/drivers/net/usb/rndis_host.c b/drivers/net/usb/rndis_host.c
index 0caa800..4630703 100644
--- a/drivers/net/usb/rndis_host.c
+++ b/drivers/net/usb/rndis_host.c
@@ -571,12 +571,19 @@ fill:
 }
 EXPORT_SYMBOL_GPL(rndis_tx_fixup);
 
+static int rndis_manage_power(struct usbnet *dev, int on)
+{
+	dev->intf->needs_remote_wakeup = on;
+	return 0;
+}
+
 
 static const struct driver_info	rndis_info = {
 	.description =	"RNDIS device",
 	.flags =	FLAG_ETHER | FLAG_FRAMING_RN | FLAG_NO_SETINT,
 	.bind =		rndis_bind,
 	.unbind =	rndis_unbind,
+	.manage_power =	rndis_manage_power,
 	.status =	rndis_status,
 	.rx_fixup =	rndis_rx_fixup,
 	.tx_fixup =	rndis_tx_fixup,
@@ -609,6 +616,7 @@ static struct usb_driver rndis_driver = {
 	.disconnect =	usbnet_disconnect,
 	.suspend =	usbnet_suspend,
 	.resume =	usbnet_resume,
+	.supports_autosuspend = 1,
 };
 
 static int __init rndis_init(void)
diff --git a/drivers/net/usb/zaurus.c b/drivers/net/usb/zaurus.c
index 04882c8..015c49d 100644
--- a/drivers/net/usb/zaurus.c
+++ b/drivers/net/usb/zaurus.c
@@ -94,6 +94,12 @@ static int zaurus_bind(struct usbnet *dev, struct usb_interface *intf)
 	return usbnet_generic_cdc_bind(dev, intf);
 }
 
+static int zaurus_manage_power(struct usbnet *dev, int on)
+{
+	dev->intf->needs_remote_wakeup = on;
+	return 0;
+}
+
 /* PDA style devices are always connected if present */
 static int always_connected (struct usbnet *dev)
 {
@@ -106,6 +112,7 @@ static const struct driver_info	zaurus_sl5x00_info = {
 	.check_connect = always_connected,
 	.bind =		zaurus_bind,
 	.unbind =	usbnet_cdc_unbind,
+	.manage_power =	zaurus_manage_power,
 	.tx_fixup =	zaurus_tx_fixup,
 };
 #define	ZAURUS_STRONGARM_INFO	((unsigned long)&zaurus_sl5x00_info)
@@ -116,6 +123,7 @@ static const struct driver_info	zaurus_pxa_info = {
 	.check_connect = always_connected,
 	.bind =		zaurus_bind,
 	.unbind =	usbnet_cdc_unbind,
+	.manage_power =	zaurus_manage_power,
 	.tx_fixup =	zaurus_tx_fixup,
 };
 #define	ZAURUS_PXA_INFO		((unsigned long)&zaurus_pxa_info)
@@ -126,6 +134,7 @@ static const struct driver_info	olympus_mxl_info = {
 	.check_connect = always_connected,
 	.bind =		zaurus_bind,
 	.unbind =	usbnet_cdc_unbind,
+	.manage_power =	zaurus_manage_power,
 	.tx_fixup =	zaurus_tx_fixup,
 };
 #define	OLYMPUS_MXL_INFO	((unsigned long)&olympus_mxl_info)
@@ -262,6 +271,7 @@ static const struct driver_info	bogus_mdlm_info = {
 	.check_connect = always_connected,
 	.tx_fixup =	zaurus_tx_fixup,
 	.bind =		blan_mdlm_bind,
+	.manage_power = zaurus_manage_power
 };
 
 static const struct usb_device_id	products [] = {
@@ -370,6 +380,7 @@ static struct usb_driver zaurus_driver = {
 	.disconnect =	usbnet_disconnect,
 	.suspend =	usbnet_suspend,
 	.resume =	usbnet_resume,
+	.supports_autosuspend = 1,
 };
 
 static int __init zaurus_init(void)


[-- Attachment #2: usbnet_tested_auto.diff --]
[-- Type: text/x-patch, Size: 9652 bytes --]

commit 956c214d266fc1764ceb931b039c7aadded4eb24
Author: Oliver Neukum <oliver-GvhC2dPhHPQdnm+yROfE0A@public.gmane.org>
Date:   Mon Oct 19 15:07:54 2009 +0200

    usb:usbnet&cdc-ether:full aggressive autosuspend
    
    autosuspend for cdc-ether devices while online if
    the device supports remote wakeup

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index 4a6aff5..8ee5bd7 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -411,6 +411,12 @@ static int cdc_bind(struct usbnet *dev, struct usb_interface *intf)
 	return 0;
 }
 
+static int cdc_manage_power(struct usbnet *dev, int on)
+{
+	dev->intf->needs_remote_wakeup = on;
+	return 0;
+}
+
 static const struct driver_info	cdc_info = {
 	.description =	"CDC Ethernet Device",
 	.flags =	FLAG_ETHER,
@@ -418,6 +424,7 @@ static const struct driver_info	cdc_info = {
 	.bind =		cdc_bind,
 	.unbind =	usbnet_cdc_unbind,
 	.status =	cdc_status,
+	.manage_power =	cdc_manage_power,
 };
 
 /*-------------------------------------------------------------------------*/
@@ -570,6 +577,7 @@ static struct usb_driver cdc_driver = {
 	.disconnect =	usbnet_disconnect,
 	.suspend =	usbnet_suspend,
 	.resume =	usbnet_resume,
+	.supports_autosuspend = 1,
 };
 
 
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index ca5ca5a..c9938d5 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -353,7 +353,8 @@ static void rx_submit (struct usbnet *dev, struct urb *urb, gfp_t flags)
 
 	if (netif_running (dev->net)
 			&& netif_device_present (dev->net)
-			&& !test_bit (EVENT_RX_HALT, &dev->flags)) {
+			&& !test_bit (EVENT_RX_HALT, &dev->flags)
+			&& !test_bit (EVENT_DEV_ASLEEP, &dev->flags)) {
 		switch (retval = usb_submit_urb (urb, GFP_ATOMIC)) {
 		case -EPIPE:
 			usbnet_defer_kevent (dev, EVENT_RX_HALT);
@@ -611,15 +612,36 @@ EXPORT_SYMBOL_GPL(usbnet_unlink_rx_urbs);
 /*-------------------------------------------------------------------------*/
 
 // precondition: never called in_interrupt
+static void usbnet_terminate_urbs(struct usbnet *dev)
+{
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK (unlink_wakeup);
+	DECLARE_WAITQUEUE (wait, current);
+	int temp;
+
+	/* ensure there are no more active urbs */
+	add_wait_queue(&unlink_wakeup, &wait);
+	dev->wait = &unlink_wakeup;
+	temp = unlink_urbs(dev, &dev->txq) +
+		unlink_urbs(dev, &dev->rxq);
+
+	/* maybe wait for deletions to finish. */
+	while (!skb_queue_empty(&dev->rxq)
+		&& !skb_queue_empty(&dev->txq)
+		&& !skb_queue_empty(&dev->done)) {
+			schedule_timeout(UNLINK_TIMEOUT_MS);
+			if (netif_msg_ifdown(dev))
+				devdbg(dev, "waited for %d urb completions",
+					temp);
+	}
+	dev->wait = NULL;
+	remove_wait_queue(&unlink_wakeup, &wait);
+}
 
 int usbnet_stop (struct net_device *net)
 {
 	struct usbnet		*dev = netdev_priv(net);
 	struct driver_info	*info = dev->driver_info;
-	int			temp;
 	int			retval;
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK (unlink_wakeup);
-	DECLARE_WAITQUEUE (wait, current);
 
 	netif_stop_queue (net);
 
@@ -641,25 +663,8 @@ int usbnet_stop (struct net_device *net)
 				info->description);
 	}
 
-	if (!(info->flags & FLAG_AVOID_UNLINK_URBS)) {
-		/* ensure there are no more active urbs */
-		add_wait_queue(&unlink_wakeup, &wait);
-		dev->wait = &unlink_wakeup;
-		temp = unlink_urbs(dev, &dev->txq) +
-			unlink_urbs(dev, &dev->rxq);
-
-		/* maybe wait for deletions to finish. */
-		while (!skb_queue_empty(&dev->rxq)
-				&& !skb_queue_empty(&dev->txq)
-				&& !skb_queue_empty(&dev->done)) {
-			msleep(UNLINK_TIMEOUT_MS);
-			if (netif_msg_ifdown(dev))
-				devdbg(dev, "waited for %d urb completions",
-					temp);
-		}
-		dev->wait = NULL;
-		remove_wait_queue(&unlink_wakeup, &wait);
-	}
+	if (!(info->flags & FLAG_AVOID_UNLINK_URBS))
+		usbnet_terminate_urbs(dev);
 
 	usb_kill_urb(dev->interrupt);
 
@@ -672,7 +677,10 @@ int usbnet_stop (struct net_device *net)
 	dev->flags = 0;
 	del_timer_sync (&dev->delay);
 	tasklet_kill (&dev->bh);
-	usb_autopm_put_interface(dev->intf);
+	if (info->manage_power)
+		info->manage_power(dev, 0);
+	else
+		usb_autopm_put_interface(dev->intf);
 
 	return 0;
 }
@@ -753,6 +761,12 @@ int usbnet_open (struct net_device *net)
 
 	// delay posting reads until we're fully open
 	tasklet_schedule (&dev->bh);
+	if (info->manage_power) {
+		retval = info->manage_power(dev, 1);
+		if (retval < 0)
+			goto done;
+		usb_autopm_put_interface(dev->intf);
+	}
 	return retval;
 done:
 	usb_autopm_put_interface(dev->intf);
@@ -882,6 +896,7 @@ kevent (struct work_struct *work)
 	if (test_bit (EVENT_TX_HALT, &dev->flags)) {
 		unlink_urbs (dev, &dev->txq);
 		status = usb_clear_halt (dev->udev, dev->out);
+		usb_autopm_put_interface(dev->intf);
 		if (status < 0
 				&& status != -EPIPE
 				&& status != -ESHUTDOWN) {
@@ -953,17 +968,20 @@ static void tx_complete (struct urb *urb)
 	if (urb->status == 0) {
 		dev->net->stats.tx_packets++;
 		dev->net->stats.tx_bytes += entry->length;
+		usb_autopm_put_interface_async(dev->intf);
 	} else {
 		dev->net->stats.tx_errors++;
 
 		switch (urb->status) {
 		case -EPIPE:
+			/* we do not allow autosuspension */
 			usbnet_defer_kevent (dev, EVENT_TX_HALT);
 			break;
 
 		/* software-driven interface shutdown */
 		case -ECONNRESET:		// async unlink
 		case -ESHUTDOWN:		// hardware gone
+			usb_autopm_put_interface_async(dev->intf);
 			break;
 
 		// like rx, tx gets controller i/o faults during khubd delays
@@ -971,6 +989,7 @@ static void tx_complete (struct urb *urb)
 		case -EPROTO:
 		case -ETIME:
 		case -EILSEQ:
+			usb_mark_last_busy(dev->udev);
 			if (!timer_pending (&dev->delay)) {
 				mod_timer (&dev->delay,
 					jiffies + THROTTLE_JIFFIES);
@@ -979,8 +998,10 @@ static void tx_complete (struct urb *urb)
 							urb->status);
 			}
 			netif_stop_queue (dev->net);
+			usb_autopm_put_interface_async(dev->intf);
 			break;
 		default:
+			usb_autopm_put_interface_async(dev->intf);
 			if (netif_msg_tx_err (dev))
 				devdbg (dev, "tx err %d", entry->urb->status);
 			break;
@@ -1058,6 +1079,23 @@ netdev_tx_t usbnet_start_xmit (struct sk_buff *skb,
 	}
 
 	spin_lock_irqsave (&dev->txq.lock, flags);
+	retval = usb_autopm_get_interface_async(dev->intf);
+	if (retval < 0) {
+		spin_unlock_irqrestore (&dev->txq.lock, flags);
+		goto drop;
+	}
+
+#ifdef CONFIG_PM
+	/* if this triggers the device is still a sleep */
+	if (test_bit(EVENT_DEV_ASLEEP, &dev->flags)) {
+		/* transmission will be done in resume */
+		dev->deferred = urb;
+		/* no use to process more packets */
+		netif_stop_queue(net);
+		spin_unlock_irqrestore(&dev->txq.lock, flags);
+		goto deferred;
+	}
+#endif
 
 	switch ((retval = usb_submit_urb (urb, GFP_ATOMIC))) {
 	case -EPIPE:
@@ -1088,6 +1126,7 @@ drop:
 		devdbg (dev, "> tx, len %d, type 0x%x",
 			length, skb->protocol);
 	}
+deferred:
 	return NETDEV_TX_OK;
 }
 EXPORT_SYMBOL_GPL(usbnet_start_xmit);
@@ -1363,13 +1402,23 @@ int usbnet_suspend (struct usb_interface *intf, pm_message_t message)
 	struct usbnet		*dev = usb_get_intfdata(intf);
 
 	if (!dev->suspend_count++) {
+		spin_lock_irq(&dev->txq.lock);
+		/* don't autosuspend while transmitting */
+		if (dev->txq.qlen && (message.event & PM_EVENT_AUTO)) {
+			spin_unlock_irq(&dev->txq.lock);
+			return -EBUSY;
+		} else {
+			set_bit(EVENT_DEV_ASLEEP, &dev->flags);
+			spin_unlock_irq(&dev->txq.lock);
+		}
 		/*
 		 * accelerate emptying of the rx and queues, to avoid
 		 * having everything error out.
 		 */
 		netif_device_detach (dev->net);
-		(void) unlink_urbs (dev, &dev->rxq);
-		(void) unlink_urbs (dev, &dev->txq);
+		usbnet_terminate_urbs(dev);
+		usb_kill_urb(dev->interrupt);
+		
 		/*
 		 * reattach so runtime management can use and
 		 * wake the device
@@ -1383,10 +1432,32 @@ EXPORT_SYMBOL_GPL(usbnet_suspend);
 int usbnet_resume (struct usb_interface *intf)
 {
 	struct usbnet		*dev = usb_get_intfdata(intf);
-
-	if (!--dev->suspend_count)
+	struct sk_buff          *skb;
+	struct urb              *res;
+	int                     retval;
+	
+	if (!--dev->suspend_count) {
+		spin_lock_irq(&dev->txq.lock);
+		res = dev->deferred;
+		dev->deferred = NULL;
+		clear_bit(EVENT_DEV_ASLEEP, &dev->flags);
+		spin_unlock_irq(&dev->txq.lock);
+		if (res) {
+			retval = usb_submit_urb(res, GFP_NOIO);
+			if (retval < 0) {
+				usb_free_urb(res);
+				netif_start_queue(dev->net);
+				usb_autopm_put_interface_async(dev->intf);
+			} else {
+				skb = (struct sk_buff *)res->context;
+				dev->net->trans_start = jiffies;
+				__skb_queue_tail (&dev->txq, skb);
+				if (!(dev->txq.qlen >= TX_QLEN(dev)))
+					netif_start_queue(dev->net);
+			}
+		}
 		tasklet_schedule (&dev->bh);
-
+	}
 	return 0;
 }
 EXPORT_SYMBOL_GPL(usbnet_resume);
diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
index f814730..e418c0b 100644
--- a/include/linux/usb/usbnet.h
+++ b/include/linux/usb/usbnet.h
@@ -55,6 +55,7 @@ struct usbnet {
 	struct sk_buff_head	done;
 	struct sk_buff_head	rxq_pause;
 	struct urb		*interrupt;
+	struct urb		*deferred;
 	struct tasklet_struct	bh;
 
 	struct work_struct	kevent;
@@ -65,6 +66,8 @@ struct usbnet {
 #		define EVENT_STS_SPLIT	3
 #		define EVENT_LINK_RESET	4
 #		define EVENT_RX_PAUSED	5
+#             define EVENT_DEV_WAKING 6 
+#             define EVENT_DEV_ASLEEP 7 
 };
 
 static inline struct usb_driver *driver_of(struct usb_interface *intf)
@@ -107,6 +110,9 @@ struct driver_info {
 	/* see if peer is connected ... can sleep */
 	int	(*check_connect)(struct usbnet *);
 
+	/* (dis)activate runtime power management */
+	int	(*manage_power)(struct usbnet *, int);
+
 	/* for status polling */
 	void	(*status)(struct usbnet *, struct urb *);
 

^ permalink raw reply related

* Re: [PATCH 3/9] pcmcia: use pcmcia_loop_config in misc pcmcia drivers
From: Jiri Kosina @ 2009-10-19 13:45 UTC (permalink / raw)
  To: Dominik Brodowski
  Cc: linux-pcmcia, David S. Miller, John W. Linville, David Sterba,
	netdev, linux-wireless
In-Reply-To: <1255907255-28297-3-git-send-email-linux@dominikbrodowski.net>

On Mon, 19 Oct 2009, Dominik Brodowski wrote:

> Use pcmcia_loop_config() in a few drivers missed during the first
> round. On fmvj18x_cs.c it -- strangely -- only requries us to set
> conf.ConfigIndex, which is done by the core, so include an empty
> loop function which returns 0 unconditionally.
> 
> CC: David S. Miller <davem@davemloft.net>
> CC: John W. Linville <linville@tuxdriver.com>
> CC: Jiri Kosina <jkosina@suse.cz>
> CC: David Sterba <dsterba@suse.cz>
> CC: netdev@vger.kernel.org
> CC: linux-wireless@vger.kernel.org
> Signed-off-by: Dominik Brodowski <linux@dominikbrodowski.net>
> ---
>  drivers/char/pcmcia/ipwireless/main.c |  103 +++++++--------------------------

For the ipwireless part

	Acked-by: Jiri Kosina <jkosina@suse.cz>

Thanks,

-- 
Jiri Kosina
SUSE Labs, Novell Inc.

^ permalink raw reply

* [rft]aggressive power management for cdc-ether
From: Oliver Neukum @ 2009-10-19 13:51 UTC (permalink / raw)
  To: David Brownell, linux-usb, netdev

Hi,

this implements usb autosuspend for online cdc-ether devices
that support remote wakeup? What do you think?

	Regards
		Oliver

--

commit 956c214d266fc1764ceb931b039c7aadded4eb24
Author: Oliver Neukum <oliver@neukum.org>
Date:   Mon Oct 19 15:07:54 2009 +0200

    usb:usbnet&cdc-ether:full aggressive autosuspend
    
    autosuspend for cdc-ether devices while online if
    the device supports remote wakeup

diff --git a/drivers/net/usb/cdc_ether.c b/drivers/net/usb/cdc_ether.c
index 4a6aff5..8ee5bd7 100644
--- a/drivers/net/usb/cdc_ether.c
+++ b/drivers/net/usb/cdc_ether.c
@@ -411,6 +411,12 @@ static int cdc_bind(struct usbnet *dev, struct usb_interface *intf)
 	return 0;
 }
 
+static int cdc_manage_power(struct usbnet *dev, int on)
+{
+	dev->intf->needs_remote_wakeup = on;
+	return 0;
+}
+
 static const struct driver_info	cdc_info = {
 	.description =	"CDC Ethernet Device",
 	.flags =	FLAG_ETHER,
@@ -418,6 +424,7 @@ static const struct driver_info	cdc_info = {
 	.bind =		cdc_bind,
 	.unbind =	usbnet_cdc_unbind,
 	.status =	cdc_status,
+	.manage_power =	cdc_manage_power,
 };
 
 /*-------------------------------------------------------------------------*/
@@ -570,6 +577,7 @@ static struct usb_driver cdc_driver = {
 	.disconnect =	usbnet_disconnect,
 	.suspend =	usbnet_suspend,
 	.resume =	usbnet_resume,
+	.supports_autosuspend = 1,
 };
 
 
diff --git a/drivers/net/usb/usbnet.c b/drivers/net/usb/usbnet.c
index ca5ca5a..c9938d5 100644
--- a/drivers/net/usb/usbnet.c
+++ b/drivers/net/usb/usbnet.c
@@ -353,7 +353,8 @@ static void rx_submit (struct usbnet *dev, struct urb *urb, gfp_t flags)
 
 	if (netif_running (dev->net)
 			&& netif_device_present (dev->net)
-			&& !test_bit (EVENT_RX_HALT, &dev->flags)) {
+			&& !test_bit (EVENT_RX_HALT, &dev->flags)
+			&& !test_bit (EVENT_DEV_ASLEEP, &dev->flags)) {
 		switch (retval = usb_submit_urb (urb, GFP_ATOMIC)) {
 		case -EPIPE:
 			usbnet_defer_kevent (dev, EVENT_RX_HALT);
@@ -611,15 +612,36 @@ EXPORT_SYMBOL_GPL(usbnet_unlink_rx_urbs);
 /*-------------------------------------------------------------------------*/
 
 // precondition: never called in_interrupt
+static void usbnet_terminate_urbs(struct usbnet *dev)
+{
+	DECLARE_WAIT_QUEUE_HEAD_ONSTACK (unlink_wakeup);
+	DECLARE_WAITQUEUE (wait, current);
+	int temp;
+
+	/* ensure there are no more active urbs */
+	add_wait_queue(&unlink_wakeup, &wait);
+	dev->wait = &unlink_wakeup;
+	temp = unlink_urbs(dev, &dev->txq) +
+		unlink_urbs(dev, &dev->rxq);
+
+	/* maybe wait for deletions to finish. */
+	while (!skb_queue_empty(&dev->rxq)
+		&& !skb_queue_empty(&dev->txq)
+		&& !skb_queue_empty(&dev->done)) {
+			schedule_timeout(UNLINK_TIMEOUT_MS);
+			if (netif_msg_ifdown(dev))
+				devdbg(dev, "waited for %d urb completions",
+					temp);
+	}
+	dev->wait = NULL;
+	remove_wait_queue(&unlink_wakeup, &wait);
+}
 
 int usbnet_stop (struct net_device *net)
 {
 	struct usbnet		*dev = netdev_priv(net);
 	struct driver_info	*info = dev->driver_info;
-	int			temp;
 	int			retval;
-	DECLARE_WAIT_QUEUE_HEAD_ONSTACK (unlink_wakeup);
-	DECLARE_WAITQUEUE (wait, current);
 
 	netif_stop_queue (net);
 
@@ -641,25 +663,8 @@ int usbnet_stop (struct net_device *net)
 				info->description);
 	}
 
-	if (!(info->flags & FLAG_AVOID_UNLINK_URBS)) {
-		/* ensure there are no more active urbs */
-		add_wait_queue(&unlink_wakeup, &wait);
-		dev->wait = &unlink_wakeup;
-		temp = unlink_urbs(dev, &dev->txq) +
-			unlink_urbs(dev, &dev->rxq);
-
-		/* maybe wait for deletions to finish. */
-		while (!skb_queue_empty(&dev->rxq)
-				&& !skb_queue_empty(&dev->txq)
-				&& !skb_queue_empty(&dev->done)) {
-			msleep(UNLINK_TIMEOUT_MS);
-			if (netif_msg_ifdown(dev))
-				devdbg(dev, "waited for %d urb completions",
-					temp);
-		}
-		dev->wait = NULL;
-		remove_wait_queue(&unlink_wakeup, &wait);
-	}
+	if (!(info->flags & FLAG_AVOID_UNLINK_URBS))
+		usbnet_terminate_urbs(dev);
 
 	usb_kill_urb(dev->interrupt);
 
@@ -672,7 +677,10 @@ int usbnet_stop (struct net_device *net)
 	dev->flags = 0;
 	del_timer_sync (&dev->delay);
 	tasklet_kill (&dev->bh);
-	usb_autopm_put_interface(dev->intf);
+	if (info->manage_power)
+		info->manage_power(dev, 0);
+	else
+		usb_autopm_put_interface(dev->intf);
 
 	return 0;
 }
@@ -753,6 +761,12 @@ int usbnet_open (struct net_device *net)
 
 	// delay posting reads until we're fully open
 	tasklet_schedule (&dev->bh);
+	if (info->manage_power) {
+		retval = info->manage_power(dev, 1);
+		if (retval < 0)
+			goto done;
+		usb_autopm_put_interface(dev->intf);
+	}
 	return retval;
 done:
 	usb_autopm_put_interface(dev->intf);
@@ -882,6 +896,7 @@ kevent (struct work_struct *work)
 	if (test_bit (EVENT_TX_HALT, &dev->flags)) {
 		unlink_urbs (dev, &dev->txq);
 		status = usb_clear_halt (dev->udev, dev->out);
+		usb_autopm_put_interface(dev->intf);
 		if (status < 0
 				&& status != -EPIPE
 				&& status != -ESHUTDOWN) {
@@ -953,17 +968,20 @@ static void tx_complete (struct urb *urb)
 	if (urb->status == 0) {
 		dev->net->stats.tx_packets++;
 		dev->net->stats.tx_bytes += entry->length;
+		usb_autopm_put_interface_async(dev->intf);
 	} else {
 		dev->net->stats.tx_errors++;
 
 		switch (urb->status) {
 		case -EPIPE:
+			/* we do not allow autosuspension */
 			usbnet_defer_kevent (dev, EVENT_TX_HALT);
 			break;
 
 		/* software-driven interface shutdown */
 		case -ECONNRESET:		// async unlink
 		case -ESHUTDOWN:		// hardware gone
+			usb_autopm_put_interface_async(dev->intf);
 			break;
 
 		// like rx, tx gets controller i/o faults during khubd delays
@@ -971,6 +989,7 @@ static void tx_complete (struct urb *urb)
 		case -EPROTO:
 		case -ETIME:
 		case -EILSEQ:
+			usb_mark_last_busy(dev->udev);
 			if (!timer_pending (&dev->delay)) {
 				mod_timer (&dev->delay,
 					jiffies + THROTTLE_JIFFIES);
@@ -979,8 +998,10 @@ static void tx_complete (struct urb *urb)
 							urb->status);
 			}
 			netif_stop_queue (dev->net);
+			usb_autopm_put_interface_async(dev->intf);
 			break;
 		default:
+			usb_autopm_put_interface_async(dev->intf);
 			if (netif_msg_tx_err (dev))
 				devdbg (dev, "tx err %d", entry->urb->status);
 			break;
@@ -1058,6 +1079,23 @@ netdev_tx_t usbnet_start_xmit (struct sk_buff *skb,
 	}
 
 	spin_lock_irqsave (&dev->txq.lock, flags);
+	retval = usb_autopm_get_interface_async(dev->intf);
+	if (retval < 0) {
+		spin_unlock_irqrestore (&dev->txq.lock, flags);
+		goto drop;
+	}
+
+#ifdef CONFIG_PM
+	/* if this triggers the device is still a sleep */
+	if (test_bit(EVENT_DEV_ASLEEP, &dev->flags)) {
+		/* transmission will be done in resume */
+		dev->deferred = urb;
+		/* no use to process more packets */
+		netif_stop_queue(net);
+		spin_unlock_irqrestore(&dev->txq.lock, flags);
+		goto deferred;
+	}
+#endif
 
 	switch ((retval = usb_submit_urb (urb, GFP_ATOMIC))) {
 	case -EPIPE:
@@ -1088,6 +1126,7 @@ drop:
 		devdbg (dev, "> tx, len %d, type 0x%x",
 			length, skb->protocol);
 	}
+deferred:
 	return NETDEV_TX_OK;
 }
 EXPORT_SYMBOL_GPL(usbnet_start_xmit);
@@ -1363,13 +1402,23 @@ int usbnet_suspend (struct usb_interface *intf, pm_message_t message)
 	struct usbnet		*dev = usb_get_intfdata(intf);
 
 	if (!dev->suspend_count++) {
+		spin_lock_irq(&dev->txq.lock);
+		/* don't autosuspend while transmitting */
+		if (dev->txq.qlen && (message.event & PM_EVENT_AUTO)) {
+			spin_unlock_irq(&dev->txq.lock);
+			return -EBUSY;
+		} else {
+			set_bit(EVENT_DEV_ASLEEP, &dev->flags);
+			spin_unlock_irq(&dev->txq.lock);
+		}
 		/*
 		 * accelerate emptying of the rx and queues, to avoid
 		 * having everything error out.
 		 */
 		netif_device_detach (dev->net);
-		(void) unlink_urbs (dev, &dev->rxq);
-		(void) unlink_urbs (dev, &dev->txq);
+		usbnet_terminate_urbs(dev);
+		usb_kill_urb(dev->interrupt);
+		
 		/*
 		 * reattach so runtime management can use and
 		 * wake the device
@@ -1383,10 +1432,32 @@ EXPORT_SYMBOL_GPL(usbnet_suspend);
 int usbnet_resume (struct usb_interface *intf)
 {
 	struct usbnet		*dev = usb_get_intfdata(intf);
-
-	if (!--dev->suspend_count)
+	struct sk_buff          *skb;
+	struct urb              *res;
+	int                     retval;
+	
+	if (!--dev->suspend_count) {
+		spin_lock_irq(&dev->txq.lock);
+		res = dev->deferred;
+		dev->deferred = NULL;
+		clear_bit(EVENT_DEV_ASLEEP, &dev->flags);
+		spin_unlock_irq(&dev->txq.lock);
+		if (res) {
+			retval = usb_submit_urb(res, GFP_NOIO);
+			if (retval < 0) {
+				usb_free_urb(res);
+				netif_start_queue(dev->net);
+				usb_autopm_put_interface_async(dev->intf);
+			} else {
+				skb = (struct sk_buff *)res->context;
+				dev->net->trans_start = jiffies;
+				__skb_queue_tail (&dev->txq, skb);
+				if (!(dev->txq.qlen >= TX_QLEN(dev)))
+					netif_start_queue(dev->net);
+			}
+		}
 		tasklet_schedule (&dev->bh);
-
+	}
 	return 0;
 }
 EXPORT_SYMBOL_GPL(usbnet_resume);
diff --git a/include/linux/usb/usbnet.h b/include/linux/usb/usbnet.h
index f814730..e418c0b 100644
--- a/include/linux/usb/usbnet.h
+++ b/include/linux/usb/usbnet.h
@@ -55,6 +55,7 @@ struct usbnet {
 	struct sk_buff_head	done;
 	struct sk_buff_head	rxq_pause;
 	struct urb		*interrupt;
+	struct urb		*deferred;
 	struct tasklet_struct	bh;
 
 	struct work_struct	kevent;
@@ -65,6 +66,8 @@ struct usbnet {
 #		define EVENT_STS_SPLIT	3
 #		define EVENT_LINK_RESET	4
 #		define EVENT_RX_PAUSED	5
+#             define EVENT_DEV_WAKING 6 
+#             define EVENT_DEV_ASLEEP 7 
 };
 
 static inline struct usb_driver *driver_of(struct usb_interface *intf)
@@ -107,6 +110,9 @@ struct driver_info {
 	/* see if peer is connected ... can sleep */
 	int	(*check_connect)(struct usbnet *);
 
+	/* (dis)activate runtime power management */
+	int	(*manage_power)(struct usbnet *, int);
+
 	/* for status polling */
 	void	(*status)(struct usbnet *, struct urb *);
 


^ permalink raw reply related

* Re: [PATCH 0/2] Reduce number of GFP_ATOMIC allocation failures
From: Mel Gorman @ 2009-10-19 14:13 UTC (permalink / raw)
  To: Karol Lewandowski
  Cc: Andrew Morton, stable, Rafael J. Wysocki, David Miller, Frans Pop,
	reinette chatre, Kalle Valo, John W. Linville, Pekka Enberg,
	Bartlomiej Zolnierkiewicz, netdev, linux-kernel,
	linux-mm@kvack.org
In-Reply-To: <20091017183421.GA3370@bizet.domek.prywatny>

On Sat, Oct 17, 2009 at 08:34:21PM +0200, Karol Lewandowski wrote:
> On Fri, Oct 16, 2009 at 11:37:24AM +0100, Mel Gorman wrote:
> > The following two patches against 2.6.32-rc4 should reduce allocation
> > failure reports for GFP_ATOMIC allocations that have being cropping up
> > since 2.6.31-rc1.
> ...
> > The patches should also help the following bugs as well and testing there
> > would be appreciated.
> > 
> > [Bug #14265] ifconfig: page allocation failure. order:5, mode:0x8020 w/ e100
> > 
> > It might also have helped the following bug
> 
> These patches actually made situation kind-of "worse" for this
> particular issue.
> 
> I've tried patches with post 2.6.32-rc4 kernel and after second
> suspend-resume cycle I got typical "order:5" failure.  However, this
> time when I manually tried to bring interface up ("ifup eth0") it
> failed for 4 consecutive times with "Can't allocate memory".  Before
> applying these patches this never occured -- kernel sometimes failed
> to allocate memory during resume, but it *never* failed afterwards.
> 

I'm hoping the patch + the revert which I asked for in another mail will
help. It's been clear for a while that more than one thing went wrong
during this cycle.

> I'll go now for another round of bisecting... and hopefully this time
> I'll be able to trigger this problem on different/faster computer with
> e100-based card.
> 
> 
> > although that driver has already been fixed by not making high-order
> > atomic allocations.
> 
> Driver has been fixed?  The one patch that I saw (by davem[1]) didn't
> fix this issue.  As of 2.6.32-rc5 I see no fixes to e100.c in
> mainline, has there been another than this[1] fix posted somewhere?
> 
> [1] http://lkml.org/lkml/2009/10/12/169
> 

The driver that was fixed was for the ipw2200, not the e100.

Thanks

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] myri10ge: improve port type reporting in ethtool output
From: Ben Hutchings @ 2009-10-19 14:17 UTC (permalink / raw)
  To: Andrew Gallatin
  Cc: Brice Goglin, David S. Miller, Linux Network Development list
In-Reply-To: <4ADC69FF.5000109@myri.com>

On Mon, 2009-10-19 at 09:30 -0400, Andrew Gallatin wrote:
> Ben Hutchings wrote:
> > On Mon, 2009-10-19 at 08:34 -0400, Andrew Gallatin wrote:
> >> Ben Hutchings wrote:
> >>
> >>> Lying about link modes is not an improvement.
> >> OK, so we're probably doing something wrong. I suspect we're not
> >> alone.  At least we don't set SUPPORTED_TP for CX4, like I've
> >> seen some NICs do.
> >>
> >> Can somebody suggest how we can tell ethtool that
> >> the NIC supports 10Gb only (no autoneg down to 1Gb or lower)
> >> for copper (10Gbase-CX4)?   How about for fiber (10Gbase-{S,L})R?
> > 
> > What's wrong with what you already do?  Customers expect to see
> > something on the supported line?
> 
> Exactly.  One has complained because drivers for
> other vendors NICs show this, even if they are fibre NICs
> or CX4 NICs, and don't actually support 10GbaseT.

Let's fix the other drivers then.  Labelling these NICs as supporting
10GBASE-T is liable to confuse more people (and tools) in the long run.

> I'm happy to back this part out, and resubmit the patch without
> it. There is still some fairly valuable stuff in the patch
> -- mainly updating the NIC detection logic for new NICs to
> detect fibre vs copper.

Sure.

You should also set port = PORT_OTHER for CX4 or KX4.  Currently it
looks like you don't set port, so it appears as 0 == PORT_TP.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled
From: Anirban Sinha @ 2009-10-19 15:32 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: linux-kernel, David Miller, netdev, Anirban Sinha
In-Reply-To: <20091019121327.GA11423@redhat.com>



Once upon a time, like on 09-10-19 5:13 AM, Oleg Nesterov wrote:
> Hi Anirban,
> 
> On 10/18, Anirban Sinha wrote:
>>
>> I have a question for you. The queue_work() routine which is called from
>> schedule_work() does a put_cpu() which in turn does a enable_preempt(). Is
>> this an attempt to trigger the scheduler?
> 
> No. please note that queue_work() does get_cpu() + put_cpu() to protect
> against cpu_down() in between.

grrr! Ah yes, my eyes failed me (or it saw what I wanted it to see :)). You do have a get_cpu() and put_cpu() together in the same code path. I guess I will have to keep looking at inet_twdr_hangman(). 

>> Is is
>> it illegal to schedule a work function from within a timer callback?
> 
> Yes sure.

hmm. may be in that case, that function needs to be re-written. 

> I'd suppose that this unbalance comes from inet_twdr_hangman() pathes.
> 
> Could you verify this?

I'll keep looking. Thanks for the help Oleg. 

Ani


^ permalink raw reply

* Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled
From: Oleg Nesterov @ 2009-10-19 15:36 UTC (permalink / raw)
  To: Anirban Sinha; +Cc: linux-kernel, David Miller, netdev, Anirban Sinha
In-Reply-To: <4ADC8677.7000607@anirban.org>

On 10/19, Anirban Sinha wrote:
>
> Once upon a time, like on 09-10-19 5:13 AM, Oleg Nesterov wrote:
>
> >> Is is
> >> it illegal to schedule a work function from within a timer callback?
> >
> > Yes sure.
>
> hmm. may be in that case, that function needs to be re-written.

OOPS!!!! I misread your question, didn't notice "il" above...

I meant: yes sure it _is legal_ to schedule a work from within a timer
callback (in fact it is legal from any context).

Sorry for confusion.

Oleg.


^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-19 15:50 UTC (permalink / raw)
  To: Michal Ostrowski
  Cc: Eric Dumazet, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <e6d1cecd0910190619t3e009e1by49cc8f7307eb7cdb@mail.gmail.com>

[Michal Ostrowski - Mon, Oct 19, 2009 at 08:19:23AM -0500]
|
| The entire scheme for managing net namespaces seems unsafe.  We depend
| on synchronization via pn->hash_lock, but have no guarantee of the
| existence of the "net" object -- hence no way to ensure the existence
| of the lock itself.  This should be relatively easy to fix though as
| we should be able to get/put the net namespace as we add remove
| objects to/from the pppoe hash.
|

Hmm... it seems not. The only possible scenario I see (for such nonexistence
namespace is that when it was cached via RCU and returned before grace period
elapsed, so perhaps we need to call synchronize_net somewhere).

| 
| Once you solve this existence issue, the flush_lock can be eliminated
| altogether since all of the relevant code paths already depend on a
| write_lock_bh(&pn->hash_lock), and that's the lock that should be use
| to protect the pppoe_dev field.
| 
| Another patch to follow later...
| 
| --
| Michal Ostrowski
| mostrows@gmail.com
| 
| 
| 
| On Mon, Oct 19, 2009 at 7:36 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
| > Michal Ostrowski a écrit :
| >> Here's my theory on this after an inital look...
| >>
| >> Looking at the oops report and disassembly of the actual module binary
| >> that caused the oops, one can deduce that:
| >>
| >> Execution was in pppoe_flush_dev().  %ebx contained the pointer "struct
| >> pppox_sock *po", which is what we faulted on, excuting "cmp %eax, 0x190(%ebx)".
| >> %ebx value was 0xffffffff (hence we got "NULL pointer dereference at 0x18f").
| >>
| >> At this point "i" (stored in %esi) is 15 (valid), meaning that we got a value
| >> of 0xffffffff in pn->hash_table[i].
| >>
| >>>From this I'd hypothesize that the combination of dev_put() and release_sock()
| >> may have allowed us to free "pn".  At the bottom of the loop we alreayd
| >> recognize that since locks are dropped we're responsible for handling
| >> invalidation of objects, and perhaps that should be extended to "pn" as well.
| >> --
| >> Michal Ostrowski
| >> mostrows@gmail.com
| >>
| >>
| >
| > Looking at this stuff, I do believe flush_lock protection is not
| > properly done.
| >
| > At the end of pppoe_connect() for example we can find :
| >
| > err_put:
| >        if (po->pppoe_dev) {
| >                dev_put(po->pppoe_dev);
| >                po->pppoe_dev = NULL;
| >        }

Yep, this is unsafe, thanks!

| >
| > This is done without any protection, and can therefore clash with
| > pppoe_flush_dev() :
| >
| >        spin_lock(&flush_lock);
| >        po->pppoe_dev = NULL; /* ppoe_dev can already be NULL before this point */
| >        spin_unlock(&flush_lock);
| >
| >        dev_put(dev);    /* oops */
| >
| 

Denys, could you check if the patch below help?

	-- Cyrill
---
 drivers/net/pppoe.c |    4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6.git/drivers/net/pppoe.c
=====================================================================
--- linux-2.6.git.orig/drivers/net/pppoe.c
+++ linux-2.6.git/drivers/net/pppoe.c
@@ -312,9 +312,9 @@ static void pppoe_flush_dev(struct net_d
 			}
 			sk = sk_pppox(po);
 			spin_lock(&flush_lock);
+			dev_put(po->pppoe_dev);
 			po->pppoe_dev = NULL;
 			spin_unlock(&flush_lock);
-			dev_put(dev);
 
 			/* We always grab the socket lock, followed by the
 			 * hash_lock, in that order.  Since we should
@@ -708,10 +708,12 @@ end:
 	release_sock(sk);
 	return error;
 err_put:
+	spin_lock(&flush_lock);
 	if (po->pppoe_dev) {
 		dev_put(po->pppoe_dev);
 		po->pppoe_dev = NULL;
 	}
+	spin_unlock(&flush_lock);
 	goto end;
 }
 

^ permalink raw reply

* Re: MAINTAINERS drivers/net cleanups?
From: David Dillow @ 2009-10-19 15:24 UTC (permalink / raw)
  To: Joe Perches; +Cc: netdev
In-Reply-To: <1255729936.2267.179.camel@Joe-Laptop.home>

On Fri, 2009-10-16 at 14:52 -0700, Joe Perches wrote:
> Ben Hutchings suggested adding a facility to
> scripts/get_maintainer.pl to print the role of
> each maintainer.  I added a bit more to print
> statistics about the maintainers and "-by:" lines
> from each git commit as well.
[snip script]
> There are many people described as MAINTAINERS for
> these files that have not had a single sign-off or
> commit in git history.

Uhm, I think your script is broken, as I've patched typhoon this year.
I've also ACKed a few patches that have been sent to me, and tried to
watch the list for patches that weren't.

> Should these individuals be removed from MAINTAINERS
> and added to CREDITS if not already there?

I think you should try to contact the ones listed and see if they are
still actively watching for their drivers. I've not had to do a lot to
typhoon, as there is no new hardware coming out, and while I've sent in
several Acked-by's, not all of them have made it to the final commit.

Just because there are few patches from the person listed as maintainer
does not mean they are not keeping an eye on things.

> ------> drivers/net/typhoon.c
> David Dillow <dave@thedillows.org> (maintainer)
> 
> 3CR990 NETWORK DRIVER
> M: David Dillow <dave@thedillows.org>
> L: netdev@vger.kernel.org
> S: Maintained
> F: drivers/net/typhoon*

^ permalink raw reply

* Re: Kernel oops when clearing bgp neighbor info with TCP MD5SUM enabled
From: Anirban Sinha @ 2009-10-19 16:01 UTC (permalink / raw)
  To: Oleg Nesterov; +Cc: linux-kernel, David Miller, netdev, Anirban Sinha
In-Reply-To: <20091019153618.GA20967@redhat.com>


> I meant: yes sure it _is legal_ to schedule a work from within a timer
> callback (in fact it is legal from any context).

ok, then that part of the function looks fine.

Ani

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Michal Ostrowski @ 2009-10-19 16:05 UTC (permalink / raw)
  To: Cyrill Gorcunov
  Cc: Eric Dumazet, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <20091019155034.GA5233@lenovo>

Here's a bigger patch that just gets rid of flush_lock altogether.

We were seeing oopses due to net namespaces going away while we were using
them, which turns out is simply due to the fact that pppoew wasn't claiming ref
counts properly.

Fixing this requires that adding and removing entries to the per-net hash-table
requires incrementing and decrementing the ref count.  This also allows us to
get rid of the flush_lock since we can now depend on the existence of
"pn->hash_lock".

We also have to be careful when flushing devices that removal of a hash table
entry may bring the net namespace refcount to 0.

---
 drivers/net/pppoe.c |  152 ++++++++++++++++++++++++++++-----------------------
 1 files changed, 83 insertions(+), 69 deletions(-)

diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c
index 7cbf6f9..140a196 100644
--- a/drivers/net/pppoe.c
+++ b/drivers/net/pppoe.c
@@ -111,9 +111,6 @@ struct pppoe_net {
       rwlock_t hash_lock;
 };

-/* to eliminate a race btw pppoe_flush_dev and pppoe_release */
-static DEFINE_SPINLOCK(flush_lock);
-
 /*
 * PPPoE could be in the following stages:
 * 1) Discovery stage (to obtain remote MAC and Session ID)
@@ -292,61 +289,77 @@ static inline struct pppox_sock
*delete_item(struct pppoe_net *pn, __be16 sid,
 static void pppoe_flush_dev(struct net_device *dev)
 {
       struct pppoe_net *pn;
+        struct net * net;
       int i;

       BUG_ON(dev == NULL);

+        /* We have to drop and re-acquire locks.  So we'll grab a ref-count on
+         * the net namespace to ensure it is valid throughout this function.
+         */
+
+        net = maybe_get_net(dev_net(dev));
+        if (!net)
+                return;
+
       pn = pppoe_pernet(dev_net(dev));
-       if (!pn) /* already freed */
+       if (!pn) { /* already freed */
+                put_net(net);
               return;
+        }

       write_lock_bh(&pn->hash_lock);
       for (i = 0; i < PPPOE_HASH_SIZE; i++) {
               struct pppox_sock *po = pn->hash_table[i];
-
-               while (po != NULL) {
-                       struct sock *sk;
-                       if (po->pppoe_dev != dev) {
-                               po = po->next;
-                               continue;
-                       }
-                       sk = sk_pppox(po);
-                       spin_lock(&flush_lock);
-                       po->pppoe_dev = NULL;
-                       spin_unlock(&flush_lock);
-                       dev_put(dev);
-
-                       /* We always grab the socket lock, followed by the
-                        * hash_lock, in that order.  Since we should
-                        * hold the sock lock while doing any unbinding,
-                        * we need to release the lock we're holding.
-                        * Hold a reference to the sock so it doesn't disappear
-                        * as we're jumping between locks.
-                        */
-
-                       sock_hold(sk);
-
-                       write_unlock_bh(&pn->hash_lock);
-                       lock_sock(sk);
-
-                       if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
-                               pppox_unbind_sock(sk);
-                               sk->sk_state = PPPOX_ZOMBIE;
-                               sk->sk_state_change(sk);
-                       }
-
-                       release_sock(sk);
-                       sock_put(sk);
-
-                       /* Restart scan at the beginning of this hash chain.
-                        * While the lock was dropped the chain contents may
-                        * have changed.
-                        */
-                       write_lock_bh(&pn->hash_lock);
-                       po = pn->hash_table[i];
-               }
+                struct sock *sk;
+
+                while (po && po->pppoe_dev != dev) {
+                        po = po->next;
+                }
+
+                if (po == NULL) {
+                        continue;
+                }
+
+                sk = sk_pppox(po);
+
+                if (po->pppoe_dev) {
+                        dev_put(po->pppoe_dev);
+                        po->pppoe_dev = NULL;
+                }
+
+                /* We always grab the socket lock, followed by the
+                 * hash_lock, in that order.  Since we should
+                 * hold the sock lock while doing any unbinding,
+                 * we need to release the lock we're holding.
+                 * Hold a reference to the sock so it doesn't disappear
+                 * as we're jumping between locks.
+                 */
+
+                sock_hold(sk);
+
+                write_unlock_bh(&pn->hash_lock);
+                lock_sock(sk);
+
+                if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
+                        pppox_unbind_sock(sk);
+                        sk->sk_state = PPPOX_ZOMBIE;
+                        sk->sk_state_change(sk);
+                }
+
+                release_sock(sk);
+                sock_put(sk);
+
+                /* Restart the process from the start of the current hash
+                 * chain. We dropped locks so the world may have change from
+                 * underneath us, but we know "pn" is still good because we
+                 * grabbed a ref count on "net".
+                 */
+                write_unlock_bh(&pn->hash_lock);
+                po = pn->hash_table[i];
       }
       write_unlock_bh(&pn->hash_lock);
+        put_net(net);
 }

 static int pppoe_device_event(struct notifier_block *this,
@@ -561,6 +574,7 @@ static int pppoe_release(struct socket *sock)
       struct sock *sk = sock->sk;
       struct pppox_sock *po;
       struct pppoe_net *pn;
+        struct net *net = NULL;

       if (!sk)
               return 0;
@@ -576,19 +590,8 @@ static int pppoe_release(struct socket *sock)
       /* Signal the death of the socket. */
       sk->sk_state = PPPOX_DEAD;

-       /*
-        * pppoe_flush_dev could lead to a race with
-        * this routine so we use flush_lock to eliminate
-        * such a case (we only need per-net specific data)
-        */
-       spin_lock(&flush_lock);
-       po = pppox_sk(sk);
-       if (!po->pppoe_dev) {
-               spin_unlock(&flush_lock);
-               goto out;
-       }
-       pn = pppoe_pernet(dev_net(po->pppoe_dev));
-       spin_unlock(&flush_lock);
+        net = sock_net(sk);
+       pn = pppoe_pernet(net);

       /*
        * protect "po" from concurrent updates
@@ -601,14 +604,14 @@ static int pppoe_release(struct socket *sock)
               __delete_item(pn, po->pppoe_pa.sid, po->pppoe_pa.remote,
                               po->pppoe_ifindex);

-       if (po->pppoe_dev) {
-               dev_put(po->pppoe_dev);
-               po->pppoe_dev = NULL;
-       }
+        if (po->pppoe_dev) {
+                dev_put(po->pppoe_dev);
+                po->pppoe_dev = NULL;
+        }

       write_unlock_bh(&pn->hash_lock);
+        put_net(net);

-out:
       sock_orphan(sk);
       sock->sk = NULL;

@@ -625,8 +628,9 @@ static int pppoe_connect(struct socket *sock,
struct sockaddr *uservaddr,
       struct sock *sk = sock->sk;
       struct sockaddr_pppox *sp = (struct sockaddr_pppox *)uservaddr;
       struct pppox_sock *po = pppox_sk(sk);
-       struct net_device *dev;
       struct pppoe_net *pn;
+       struct net_device *dev = NULL;
+        struct net *net = NULL;
       int error;

       lock_sock(sk);
@@ -653,10 +657,12 @@ static int pppoe_connect(struct socket *sock,
struct sockaddr *uservaddr,
       if (stage_session(po->pppoe_pa.sid)) {
               pppox_unbind_sock(sk);
               if (po->pppoe_dev) {
-                       pn = pppoe_pernet(dev_net(po->pppoe_dev));
+                        struct net *old = dev_net(po->pppoe_dev);
+                       pn = pppoe_pernet(old);
                       delete_item(pn, po->pppoe_pa.sid,
                               po->pppoe_pa.remote, po->pppoe_ifindex);
                       dev_put(po->pppoe_dev);
+                        put_net(old);
               }
               memset(sk_pppox(po) + 1, 0,
                      sizeof(struct pppox_sock) - sizeof(struct sock));
@@ -666,13 +672,17 @@ static int pppoe_connect(struct socket *sock,
struct sockaddr *uservaddr,
       /* Re-bind in session stage only */
       if (stage_session(sp->sa_addr.pppoe.sid)) {
               error = -ENODEV;
-               dev = dev_get_by_name(sock_net(sk), sp->sa_addr.pppoe.dev);
+                net = maybe_get_net(dev_net(dev));
+                if (!net)
+                        goto end;
+
+               dev = dev_get_by_name(net, sp->sa_addr.pppoe.dev);
               if (!dev)
-                       goto end;
+                       goto err_put_net;

               po->pppoe_dev = dev;
               po->pppoe_ifindex = dev->ifindex;
-               pn = pppoe_pernet(dev_net(dev));
+               pn = pppoe_pernet(net);
               write_lock_bh(&pn->hash_lock);
               if (!(dev->flags & IFF_UP)) {
                       write_unlock_bh(&pn->hash_lock);
@@ -707,6 +717,10 @@ static int pppoe_connect(struct socket *sock,
struct sockaddr *uservaddr,
 end:
       release_sock(sk);
       return error;
+err_put_net:
+        if (net) {
+                put_net(net);
+        }
 err_put:
       if (po->pppoe_dev) {
               dev_put(po->pppoe_dev);
--
1.6.3.3

^ permalink raw reply related

* Re: [PATCHv2 2/4] Implement loss counting on TFRC-SP receiver
From: Leandro Sales @ 2009-10-19 16:04 UTC (permalink / raw)
  To: Gerrit Renker, Ivo Calado, dccp, netdev
In-Reply-To: <20091019052612.GE3366@gerrit.erg.abdn.ac.uk>

Hi Gerrit,

On Mon, Oct 19, 2009 at 2:26 AM, Gerrit Renker <gerrit@erg.abdn.ac.uk> wrote:
>
> | --- dccp_tree_work03.orig/net/dccp/ccids/lib/packet_history_sp.c      2009-10-08 22:58:21.418908270 -0300
> | +++ dccp_tree_work03/net/dccp/ccids/lib/packet_history_sp.c   2009-10-08 22:59:07.442411383 -0300
> | @@ -243,6 +243,7 @@
> |  {
> |       u64 s0 = tfrc_rx_hist_loss_prev(h)->tfrchrx_seqno,
> |           s1 = tfrc_rx_hist_entry(h, 1)->tfrchrx_seqno,
> | +         n1 = tfrc_rx_hist_entry(h, 1)->tfrchrx_ndp,
> |           s2 = tfrc_rx_hist_entry(h, 2)->tfrchrx_seqno,
> |           s3 = DCCP_SKB_CB(skb)->dccpd_seq;
> I have removed the old definition of n1, which was further below and which caused this warning.
>
> net/dccp/ccids/lib/packet_history_sp.c:276:7: warning: symbol 'n1' shadows an earlier
> net/dccp/ccids/lib/packet_history_sp.c:247:6: originally declared here
>
>

Well done!

> I thought again about the earlier suggestion to make 'num_losses' u64. Since li_losses sums the values
> stored in num_losses, it needs to have the same size (currently it is u32). But then another thought is
> that if there are so many losses that u32 overflows, then the performance is so bad anyway that it is
> better to turn off the receiver. Hence I have reverted it to u32, as per your original patch.
>

OK

> Please find attached a patch of the changes I made. As per posting, I have separated out the dccp.h part,
> since it is also useful in general.

OK, agreed!

Thank you,

BR,
Leandro.

^ permalink raw reply

* Re: PATCH: Network Device Naming mechanism and policy
From: Bryan Kadzban @ 2009-10-19 16:14 UTC (permalink / raw)
  To: Narendra_K
  Cc: dannf, bhutchings, netdev, linux-hotplug, Matt_Domsch,
	Jordan_Hargrave, Charles_Rose
In-Reply-To: <EDA0A4495861324DA2618B4C45DCB3EE5895AF@blrx3m08.blr.amer.dell.com>

[-- Attachment #1: Type: text/plain, Size: 2371 bytes --]

Narendra_K@Dell.com wrote:
>>>>> And how would the regular file look like in terms of holding
>>>>>  ifindex of the interface, which can be passed to
>>>>> libnetdevname.
>>>> I can't think of anything we need to store in the regular file.
>>>> If we have the kernel name for the device, we can look up the
>>>> ifindex in /sys. Correct me if I'm wrong, but storing it
>>>> ourselves seems redundant.
>>> But the name of a netdev can change whereas its ifindex never
>>> does. Identifying netdevs by name would require additional work
>>> to update the links when a netdev is renamed and would still be
>>> prone to race conditions.  This is why Narendra and Matt were
>>> proposing to
>> store the
>>> ifindex in the node all along...
>> Matt, Ben and I talked about a few other possibilities on IRC. The
>> one I like the most at the moment is an idea Ben had to creat dummy
>> files named after the ifindex. Then, use symlinks for the kernel
>> name and the various by-$property subdirectories. This means the
>> KOBJ events will need to expose the ifindex.
>> 
> 
> I suppose the KOBJ events already expose the ifindex of a network 
> interface. The file "/sys/class/net/ethN/uevent" contains
> INTERFACE=ethN and IFINDEX=n already. But it looks like udev doesn't
> use it in any way.

Right; it could simply do the equivalent of:

touch /dev/netdev/$env{IFINDEX}

instead of its normal mknod(2), and then do normal SYMLINK processing.
That last part is what would link /dev/netdev/by-name/$env{INTERFACE} to
that device, along with /dev/netdev/by-mac/*, /dev/netdev/by-path/*,
etc., etc., in as many different ways as people want to add rules.

(Or /dev/net/by-* instead of netdev; I'm mostly ambivalent about the
first-level directory under /dev.  Looks like libnetdevname requires
/dev/netdev though.)

> For example, with the kernel patch the "/sys/class/net/ethN/uevent" 
> contains in addition to the above details, MAJOR=M and MINOR=m which
> the udev knows how to make use of with a rule like
> 
> SUBSYSTEM=="net", KERNEL!="tun", NAME="netdev/%k", MODE="0600".

And if the only point is to get the ifindex via stat(2) on the resulting
symlinks, but people don't like device files, then why not get the
ifindex via readlink(2) (and a bit of string parsing, and a strtol(3) or
strtoul(3) call) instead?  :-)


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 260 bytes --]

^ permalink raw reply

* [PATCH] net: Fix IP_MULTICAST_IF
From: Eric Dumazet @ 2009-10-19 16:41 UTC (permalink / raw)
  To: David S. Miller; +Cc: Linux Netdev List

ipv4/ipv6 setsockopt(IP_MULTICAST_IF) have dubious __dev_get_by_index() calls.

This function should be called only with RTNL or dev_base_lock held, or reader
could see a corrupt hash chain and eventually enter an endless loop.

Fix is to call dev_get_by_index()/dev_put().

If this happens to be performance critical, we could define a new dev_exist_by_index()
function to avoid touching dev refcount.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 net/ipv4/ip_sockglue.c   |    7 +++----
 net/ipv6/ipv6_sockglue.c |    6 +++++-
 2 files changed, 8 insertions(+), 5 deletions(-)

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 0c0b6e3..e982b5c 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -634,17 +634,16 @@ static int do_ip_setsockopt(struct sock *sk, int level,
 				break;
 			}
 			dev = ip_dev_find(sock_net(sk), mreq.imr_address.s_addr);
-			if (dev) {
+			if (dev)
 				mreq.imr_ifindex = dev->ifindex;
-				dev_put(dev);
-			}
 		} else
-			dev = __dev_get_by_index(sock_net(sk), mreq.imr_ifindex);
+			dev = dev_get_by_index(sock_net(sk), mreq.imr_ifindex);
 
 
 		err = -EADDRNOTAVAIL;
 		if (!dev)
 			break;
+		dev_put(dev);
 
 		err = -EINVAL;
 		if (sk->sk_bound_dev_if &&
diff --git a/net/ipv6/ipv6_sockglue.c b/net/ipv6/ipv6_sockglue.c
index 14f54eb..4f7aaf6 100644
--- a/net/ipv6/ipv6_sockglue.c
+++ b/net/ipv6/ipv6_sockglue.c
@@ -496,13 +496,17 @@ done:
 			goto e_inval;
 
 		if (val) {
+			struct net_device *dev;
+
 			if (sk->sk_bound_dev_if && sk->sk_bound_dev_if != val)
 				goto e_inval;
 
-			if (__dev_get_by_index(net, val) == NULL) {
+			dev = dev_get_by_index(net, val);
+			if (!dev) {
 				retv = -ENODEV;
 				break;
 			}
+			dev_put(dev);
 		}
 		np->mcast_oif = val;
 		retv = 0;

^ permalink raw reply related

* Re: [PATCH 1/4 v3] net: Introduce sk_tx_queue_mapping
From: Eric Dumazet @ 2009-10-19 16:45 UTC (permalink / raw)
  To: Krishna Kumar; +Cc: davem, netdev, herbert
In-Reply-To: <20091018130740.3960.96469.sendpatchset@localhost.localdomain>

Krishna Kumar a écrit :
> From: Krishna Kumar <krkumar2@in.ibm.com>
> 
> Introduce sk_tx_queue_mapping; and functions that set, test and
> get this value. Reset sk_tx_queue_mapping to -1 whenever the dst
> cache is set/reset, and in socket alloc. Setting txq to -1 and
> using valid txq=<0 to n-1> allows the tx path to use the value
> of sk_tx_queue_mapping directly instead of subtracting 1 on every
> tx.
> 
> Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com>

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Eric Dumazet @ 2009-10-19 17:12 UTC (permalink / raw)
  To: Michal Ostrowski
  Cc: Cyrill Gorcunov, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <e6d1cecd0910190905x382bfc23w2987c84aa0837609@mail.gmail.com>

Michal Ostrowski a écrit :
> Here's a bigger patch that just gets rid of flush_lock altogether.
> 
> We were seeing oopses due to net namespaces going away while we were using
> them, which turns out is simply due to the fact that pppoew wasn't claiming ref
> counts properly.
> 
> Fixing this requires that adding and removing entries to the per-net hash-table
> requires incrementing and decrementing the ref count.  This also allows us to
> get rid of the flush_lock since we can now depend on the existence of
> "pn->hash_lock".
> 
> We also have to be careful when flushing devices that removal of a hash table
> entry may bring the net namespace refcount to 0.
> 

Your patch is mangled (tabulation -> white spaces),

and I dont believe namespace refcount can reach 0 inside pppoe_flush_dev(),
it would be a bug from core network code.



^ permalink raw reply

* Re: [PATCH] AF_UNIX: Fix deadlock on connecting to shutdown socket
From: Jarek Poplawski @ 2009-10-19 18:07 UTC (permalink / raw)
  To: David Miller
  Cc: tomoki.sekiyama.qu, linux-kernel, netdev, alan, satoshi.oshima.fk,
	hidehiro.kawai.ez, hideo.aoki.tk
In-Reply-To: <20091019.061459.193694892.davem@davemloft.net>

On Mon, Oct 19, 2009 at 06:14:59AM -0700, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Mon, 19 Oct 2009 11:57:13 +0000
> 
> > Isn't the shutdown call expected to change sk_state to TCP_CLOSE?
> 
> No, because the send side is still up and operational, it's
> only a half duplex close.

OK, thanks for the explanation,
Jarek P.

^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Michal Ostrowski @ 2009-10-19 18:07 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Cyrill Gorcunov, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <4ADC9DE2.5010308@gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1653 bytes --]

On Mon, Oct 19, 2009 at 12:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Michal Ostrowski a écrit :
>> Here's a bigger patch that just gets rid of flush_lock altogether.
>>
>> We were seeing oopses due to net namespaces going away while we were using
>> them, which turns out is simply due to the fact that pppoew wasn't claiming ref
>> counts properly.
>>
>> Fixing this requires that adding and removing entries to the per-net hash-table
>> requires incrementing and decrementing the ref count.  This also allows us to
>> get rid of the flush_lock since we can now depend on the existence of
>> "pn->hash_lock".
>>
>> We also have to be careful when flushing devices that removal of a hash table
>> entry may bring the net namespace refcount to 0.
>>
>
> Your patch is mangled (tabulation -> white spaces),

Patch mangling was due to mailer interactions, I'll attach a clean
version here, no more inlining.

>
> and I dont believe namespace refcount can reach 0 inside pppoe_flush_dev(),
> it would be a bug from core network code.
>

>From the original oops I was able to deduce that the namespace somehow
managed to get destroyed during the interval where we dropped locks.
If that's not due to the release_sock() call in pppoe_flush_dev()
triggering a cleanup then I'd have to assume that that it's due to a
secondary actor closing the socket in parallel, but that in turn would
point to issues with the flush_lock.  Having said that the thrust of
this patch remains valid; it just means I don't need to inc the ref
count in pppoe_flush_dev().

Do you agree?

--
Michal Ostrowski
mostrows@gmail.com

[-- Attachment #2: 0001-PPPoE-Fix-ref-counts-on-net-namespaces.patch --]
[-- Type: application/octet-stream, Size: 5483 bytes --]

diff --git a/drivers/net/pppoe.c b/drivers/net/pppoe.c
index 7cbf6f9..60bbeb2 100644
--- a/drivers/net/pppoe.c
+++ b/drivers/net/pppoe.c
@@ -111,9 +111,6 @@ struct pppoe_net {
 	rwlock_t hash_lock;
 };
 
-/* to eliminate a race btw pppoe_flush_dev and pppoe_release */
-static DEFINE_SPINLOCK(flush_lock);
-
 /*
  * PPPoE could be in the following stages:
  * 1) Discovery stage (to obtain remote MAC and Session ID)
@@ -303,48 +300,49 @@ static void pppoe_flush_dev(struct net_device *dev)
 	write_lock_bh(&pn->hash_lock);
 	for (i = 0; i < PPPOE_HASH_SIZE; i++) {
 		struct pppox_sock *po = pn->hash_table[i];
+		struct sock *sk;
 
-		while (po != NULL) {
-			struct sock *sk;
-			if (po->pppoe_dev != dev) {
-				po = po->next;
-				continue;
-			}
-			sk = sk_pppox(po);
-			spin_lock(&flush_lock);
-			po->pppoe_dev = NULL;
-			spin_unlock(&flush_lock);
-			dev_put(dev);
-
-			/* We always grab the socket lock, followed by the
-			 * hash_lock, in that order.  Since we should
-			 * hold the sock lock while doing any unbinding,
-			 * we need to release the lock we're holding.
-			 * Hold a reference to the sock so it doesn't disappear
-			 * as we're jumping between locks.
-			 */
+		while (po && po->pppoe_dev != dev) {
+			po = po->next;
+		}
 
-			sock_hold(sk);
+		if (po == NULL) {
+			continue;
+		}
 
-			write_unlock_bh(&pn->hash_lock);
-			lock_sock(sk);
+		sk = sk_pppox(po);
 
-			if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
-				pppox_unbind_sock(sk);
-				sk->sk_state = PPPOX_ZOMBIE;
-				sk->sk_state_change(sk);
-			}
+		if (po->pppoe_dev) {
+			dev_put(po->pppoe_dev);
+			po->pppoe_dev = NULL;
+		}
 
-			release_sock(sk);
-			sock_put(sk);
+		/* We always grab the socket lock, followed by the hash_lock,
+		 * in that order.  Since we should hold the sock lock while
+		 * doing any unbinding, we need to release the lock we're
+		 * holding.  Hold a reference to the sock so it doesn't
+		 * disappear as we're jumping between locks.
+		 */
 
-			/* Restart scan at the beginning of this hash chain.
-			 * While the lock was dropped the chain contents may
-			 * have changed.
-			 */
-			write_lock_bh(&pn->hash_lock);
-			po = pn->hash_table[i];
+		sock_hold(sk);
+		write_unlock_bh(&pn->hash_lock);
+		lock_sock(sk);
+
+		if (sk->sk_state & (PPPOX_CONNECTED | PPPOX_BOUND)) {
+			pppox_unbind_sock(sk);
+			sk->sk_state = PPPOX_ZOMBIE;
+			sk->sk_state_change(sk);
 		}
+
+		release_sock(sk);
+		sock_put(sk);
+
+		/* Restart the process from the start of the current hash
+		 * chain. We dropped locks so the world may have change from
+		 * underneath us.
+		 */
+		write_unlock_bh(&pn->hash_lock);
+		po = pn->hash_table[i];
 	}
 	write_unlock_bh(&pn->hash_lock);
 }
@@ -561,6 +559,7 @@ static int pppoe_release(struct socket *sock)
 	struct sock *sk = sock->sk;
 	struct pppox_sock *po;
 	struct pppoe_net *pn;
+	struct net *net = NULL;
 
 	if (!sk)
 		return 0;
@@ -576,19 +575,8 @@ static int pppoe_release(struct socket *sock)
 	/* Signal the death of the socket. */
 	sk->sk_state = PPPOX_DEAD;
 
-	/*
-	 * pppoe_flush_dev could lead to a race with
-	 * this routine so we use flush_lock to eliminate
-	 * such a case (we only need per-net specific data)
-	 */
-	spin_lock(&flush_lock);
-	po = pppox_sk(sk);
-	if (!po->pppoe_dev) {
-		spin_unlock(&flush_lock);
-		goto out;
-	}
-	pn = pppoe_pernet(dev_net(po->pppoe_dev));
-	spin_unlock(&flush_lock);
+	net = sock_net(sk);
+	pn = pppoe_pernet(net);
 
 	/*
 	 * protect "po" from concurrent updates
@@ -607,8 +595,8 @@ static int pppoe_release(struct socket *sock)
 	}
 
 	write_unlock_bh(&pn->hash_lock);
+	put_net(net);
 
-out:
 	sock_orphan(sk);
 	sock->sk = NULL;
 
@@ -625,8 +613,9 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 	struct sock *sk = sock->sk;
 	struct sockaddr_pppox *sp = (struct sockaddr_pppox *)uservaddr;
 	struct pppox_sock *po = pppox_sk(sk);
-	struct net_device *dev;
 	struct pppoe_net *pn;
+	struct net_device *dev = NULL;
+	struct net *net = NULL;
 	int error;
 
 	lock_sock(sk);
@@ -653,10 +642,12 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 	if (stage_session(po->pppoe_pa.sid)) {
 		pppox_unbind_sock(sk);
 		if (po->pppoe_dev) {
-			pn = pppoe_pernet(dev_net(po->pppoe_dev));
+			struct net *old = dev_net(po->pppoe_dev);
+			pn = pppoe_pernet(old);
 			delete_item(pn, po->pppoe_pa.sid,
 				po->pppoe_pa.remote, po->pppoe_ifindex);
 			dev_put(po->pppoe_dev);
+			put_net(old);
 		}
 		memset(sk_pppox(po) + 1, 0,
 		       sizeof(struct pppox_sock) - sizeof(struct sock));
@@ -666,13 +657,17 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 	/* Re-bind in session stage only */
 	if (stage_session(sp->sa_addr.pppoe.sid)) {
 		error = -ENODEV;
-		dev = dev_get_by_name(sock_net(sk), sp->sa_addr.pppoe.dev);
-		if (!dev)
+		net = maybe_get_net(dev_net(dev));
+		if (!net)
 			goto end;
 
+		dev = dev_get_by_name(net, sp->sa_addr.pppoe.dev);
+		if (!dev)
+			goto err_put_net;
+
 		po->pppoe_dev = dev;
 		po->pppoe_ifindex = dev->ifindex;
-		pn = pppoe_pernet(dev_net(dev));
+		pn = pppoe_pernet(net);
 		write_lock_bh(&pn->hash_lock);
 		if (!(dev->flags & IFF_UP)) {
 			write_unlock_bh(&pn->hash_lock);
@@ -707,6 +702,10 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 end:
 	release_sock(sk);
 	return error;
+err_put_net:
+	if (net)
+		put_net(net);
+
 err_put:
 	if (po->pppoe_dev) {
 		dev_put(po->pppoe_dev);

^ permalink raw reply related

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Eric Dumazet @ 2009-10-19 18:44 UTC (permalink / raw)
  To: Michal Ostrowski
  Cc: Cyrill Gorcunov, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <e6d1cecd0910191107h899a4ffs588f2413093dfb4b@mail.gmail.com>

Michal Ostrowski a écrit :
> On Mon, Oct 19, 2009 at 12:12 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> Michal Ostrowski a écrit :
>>> Here's a bigger patch that just gets rid of flush_lock altogether.
>>>
>>> We were seeing oopses due to net namespaces going away while we were using
>>> them, which turns out is simply due to the fact that pppoew wasn't claiming ref
>>> counts properly.
>>>
>>> Fixing this requires that adding and removing entries to the per-net hash-table
>>> requires incrementing and decrementing the ref count.  This also allows us to
>>> get rid of the flush_lock since we can now depend on the existence of
>>> "pn->hash_lock".
>>>
>>> We also have to be careful when flushing devices that removal of a hash table
>>> entry may bring the net namespace refcount to 0.
>>>
>> Your patch is mangled (tabulation -> white spaces),
> 
> Patch mangling was due to mailer interactions, I'll attach a clean
> version here, no more inlining.
> 
>> and I dont believe namespace refcount can reach 0 inside pppoe_flush_dev(),
>> it would be a bug from core network code.
>>
> 
> From the original oops I was able to deduce that the namespace somehow
> managed to get destroyed during the interval where we dropped locks.
> If that's not due to the release_sock() call in pppoe_flush_dev()
> triggering a cleanup then I'd have to assume that that it's due to a
> secondary actor closing the socket in parallel, but that in turn would
> point to issues with the flush_lock.  Having said that the thrust of
> this patch remains valid; it just means I don't need to inc the ref
> count in pppoe_flush_dev().
> 
> Do you agree?
> 

Not really :)

I dont believe you should care of namespace, and/or mess with its refcount at all.

Please dont use maybe_get_net() : This function should not ever be used in drivers/net

You can add a BUG_ON(dev_net(xxxx)->count <= 0) if you really want, but if this
assertion is false, this is not because of pppoe.


 	lock_sock(sk);
@@ -653,10 +642,12 @@ static int pppoe_connect(struct socket *sock, struct sockaddr *uservaddr,
 	if (stage_session(po->pppoe_pa.sid)) {
 		pppox_unbind_sock(sk);
 		if (po->pppoe_dev) {
-			pn = pppoe_pernet(dev_net(po->pppoe_dev));
+			struct net *old = dev_net(po->pppoe_dev);
+			pn = pppoe_pernet(old);
 			delete_item(pn, po->pppoe_pa.sid,
 				po->pppoe_pa.remote, po->pppoe_ifindex);
 			dev_put(po->pppoe_dev);
+			put_net(old);
 		}
 		memset(sk_pppox(po) + 1, 0,
 		       sizeof(struct pppox_sock) - sizeof(struct sock));


There is still a race here, since you do a dev_put(po->ppoe_dev); without any lock held

So pppoe_flush_dev() can run concurently and dev_put(po->ppoe_dev) at same time.

In fact pppoe_flush_dev() can change po->ppoe_dev anytime, so you should check
all occurences of po->ppoe_dev use in the code and check if appropriate locking is done.

pppoe_rcv_core() is not safe
pppoe_ioctl() is not safe
pppoe_sendmsg() is not safe
__pppoe_xmit() is not safe


^ permalink raw reply

* Re: kernel panic in latest vanilla stable, while using nameif with "alive" pppoe interfaces
From: Cyrill Gorcunov @ 2009-10-19 19:29 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Michal Ostrowski, Denys Fedoryschenko, netdev, linux-ppp, paulus,
	mostrows
In-Reply-To: <4ADCB3A4.8060408@gmail.com>

[Eric Dumazet - Mon, Oct 19, 2009 at 08:44:52PM +0200]
...
| 
| Not really :)
| 
| I dont believe you should care of namespace, and/or mess with its refcount at all.
| 
| Please dont use maybe_get_net() : This function should not ever be used in drivers/net
| 
| You can add a BUG_ON(dev_net(xxxx)->count <= 0) if you really want, but if this
| assertion is false, this is not because of pppoe.
| 
...
| So pppoe_flush_dev() can run concurently and dev_put(po->ppoe_dev) at same time.
| 
| In fact pppoe_flush_dev() can change po->ppoe_dev anytime, so you should check
| all occurences of po->ppoe_dev use in the code and check if appropriate locking is done.
| 
| pppoe_rcv_core() is not safe
| pppoe_ioctl() is not safe
| pppoe_sendmsg() is not safe
| __pppoe_xmit() is not safe
| 

Sigh... seem so (which is mostly my fault not Michal). Every time we touch pppoe_dev we
should dev_hold on it and dev_put as only done all we need. Async nature
of notifier seem to be a key here.

	-- Cyrill

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox