From mboxrd@z Thu Jan 1 00:00:00 1970 From: Mike Galbraith Subject: [patch] Re: qlge driver corrupting kernel memory Date: Sun, 13 May 2012 12:10:39 +0200 Message-ID: <1336903839.7390.13.camel@marge.simpson.net> References: <1336474818.21924.94.camel@marge.simpson.net> <20120508120748.GA3504@oc1711230544.ibm.com> <1336736301.7361.144.camel@marge.simpson.net> Mime-Version: 1.0 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 7bit Cc: netdev To: Thadeu Lima de Souza Cascardo Return-path: Received: from cantor2.suse.de ([195.135.220.15]:46755 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751684Ab2EMKKs (ORCPT ); Sun, 13 May 2012 06:10:48 -0400 In-Reply-To: <1336736301.7361.144.camel@marge.simpson.net> Sender: netdev-owner@vger.kernel.org List-ID: On Fri, 2012-05-11 at 13:38 +0200, Mike Galbraith wrote: > On Tue, 2012-05-08 at 09:07 -0300, Thadeu Lima de Souza Cascardo wrote: > > On Tue, May 08, 2012 at 01:00:18PM +0200, Mike Galbraith wrote: > > > Greetings network wizards, > > > > > > $subject is happening in an 2.6.32 enterprise kernel with the driver > > > updated to what looks to me to be 2.6.38 or so. > > > > > > Allegedly, IFF boxen are running dual CNAs with storage and LAN sharing > > > a port, $subject happens fairly regularly. Rummaging in crashdumps > > > seems to show corruption happens because we somehow end up stuffing > > > loads of frags into skb_shared_info, scribbling all over the place. > > > > > > Before I proceed, what I know about skbs can be found here.. > > > > > > http://vger.kernel.org/~davem/skb_data.html > > > > > > ..and that's the sum and total ;-) > > > > > > I guess the first thing I should ask is whether anyone has seen such > > > scribbling with this driver. Known issue would be a case of happiness, > > > but I doubt that will be the case from searching, so onward. > > > > > > > Hi, Mike. > > > > From what you describe, I suspect this is related to this fix: > > > > http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=782428535e0819b5b7c9825cd3faa2ad37032a70 > > > > Please, apply and report if that works for you. > > Nope, box exploded. I haven't seen a dump yet, but expect it'll be more > of the same scribbling. Something else popped up meanwhile. Shortly after tx_ring->q order 5 allocation failure and ql_release_adapter_resources(), BUG: Bad page state has now arrived twice to muddy the water. [ 3537.150327] Node 0 DMA: 2*4kB 2*8kB 1*16kB 2*32kB 2*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 360kB [ 3537.150345] Node 0 DMA32: 318*4kB 144*8kB 89*16kB 17*32kB 3*64kB 1*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 4712kB [ 3537.150364] 5248 total pagecache pages [ 3537.150367] 211 pages in swap cache [ 3537.150372] Swap cache stats: add 1437, delete 1226, find 1641/1752 [ 3537.150377] Free swap = 67109880kB [ 3537.150381] Total swap = 67111528kB [ 3537.152314] 73723 pages RAM [ 3537.152319] 13128 pages reserved [ 3537.152322] 4910 pages shared [ 3537.152326] 22795 pages non-shared [ 3537.152333] qlge 0000:04:00.0: ql_alloc_mem_resources: TX resource allocation failed. [ 3537.152343] qlge 0000:04:00.0: ql_get_adapter_resources: Unable to allocate memory. [ 3537.152499] qlge 0000:04:00.0: ql_set_mac_addr_reg: Adding UNICAST address 00:c0:dd:1a:46:ac at index 0 in the CAM. [ 3537.440237] BUG: Bad page state in process ifdown-dhcp pfn:10940 [ 3537.440244] page:ffffea00003a0600 flags:0020000000000000 count:-1 mapcount:0 mapping:(null) index:0 [ 3537.440249] Pid: 4317, comm: ifdown-dhcp Tainted: G X 2.6.32.54-0.3.1.4242.0.TEST-default #1 [ 3537.440253] Call Trace: [ 3537.440265] [] dump_trace+0x6c/0x2d0 [ 3537.440271] [] dump_stack+0x69/0x73 [ 3537.440279] [] bad_page+0xe3/0x170 [ 3537.440284] [] prep_new_page+0xab/0x1b0 [ 3537.440289] [] get_page_from_freelist+0x304/0x720 [ 3537.440295] [] __alloc_pages_slowpath+0x11a/0x5f0 [ 3537.440300] [] __alloc_pages_nodemask+0x13a/0x140 [ 3537.440305] [] __get_free_pages+0x9/0x50 [ 3537.440314] [] dup_task_struct+0x42/0x150 [ 3537.440320] [] copy_process+0xb4/0xe50 [ 3537.440324] [] do_fork+0x8c/0x3c0 [ 3537.440331] [] stub_clone+0x13/0x20 [ 3537.441094] DWARF2 unwinder stuck at stub_clone+0x13/0x20 [ 3537.441097] [ 3537.441098] Leftover inexact backtrace: [ 3537.441099] [ 3537.441103] [] ? system_call_fastpath+0x16/0x1b [ 3537.441107] Disabling lock debugging due to kernel taint [ 3537.899545] bonding: bond0 is being deleted.. glge: Fix double pci_free_consistent() upon tx_ring->q allocation failure Let ql_free_tx_resources() do it's job. You are not helping. Signed-off-by: Mike Galbraith --- drivers/net/qlge/qlge_main.c | 10 +++------- 1 file changed, 3 insertions(+), 7 deletions(-) --- a/drivers/net/qlge/qlge_main.c +++ b/drivers/net/qlge/qlge_main.c @@ -2664,11 +2664,8 @@ static int ql_alloc_tx_resources(struct pci_alloc_consistent(qdev->pdev, tx_ring->wq_size, &tx_ring->wq_base_dma); - if ((tx_ring->wq_base == NULL) || - tx_ring->wq_base_dma & WQ_ADDR_ALIGN) { - QPRINTK(qdev, IFUP, ERR, "tx_ring alloc failed.\n"); - return -ENOMEM; - } + if ((tx_ring->wq_base == NULL) tx_ring->wq_base_dma & WQ_ADDR_ALIGN) + goto err; tx_ring->q = kmalloc(tx_ring->wq_len * sizeof(struct tx_ring_desc), GFP_KERNEL); if (tx_ring->q == NULL) @@ -2676,8 +2673,7 @@ static int ql_alloc_tx_resources(struct return 0; err: - pci_free_consistent(qdev->pdev, tx_ring->wq_size, - tx_ring->wq_base, tx_ring->wq_base_dma); + QPRINTK(qdev, IFUP, ERR, "tx_ring alloc failed.\n"); return -ENOMEM; }