2.6.10-bkcurr: major slab corruption preventing booting on ARM

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* 2.6.10-bkcurr: major slab corruption preventing booting on ARM
@ 2005-01-04 14:43 Russell King
  2005-01-04 16:10 ` Russell King
  0 siblings, 1 reply; 5+ messages in thread
From: Russell King @ 2005-01-04 14:43 UTC (permalink / raw)
  To: Linux Kernel List

Hi.

I've had a report from a fellow ARM hacker of their platform not
booting.  After they turned on slab debugging, they saw (pieced
together from a report on IRC):

Freeing init memory: 104K
run_init_process(/bin/bash)
Slab corruption: start=c0010934, len=160
Last user: [<c00adc54>](d_alloc+0x28/0x2d8)

I've just run up 2.6.10-bkcurr on a different ARM platform, and
encountered the following output.  It looks like there's serious
slab corruption issues in these kernels.

I'll dig a little further into the report below to see if there's
anything obvious.

Starting up networking
eth0: link down
eth0: link up, 10Mbps, half-duplex, lpa 0x0021
Starting network services
slab: Internal list corruption detected in cache 'buffer_head'(63), slabp c7912000(16). Hexdump:

000: 00 01 10 00 00 02 20 00 14 01 00 00 14 21 91 c7
010: 10 00 00 00 10 00 00 00 fe ff ff ff fe ff ff ff
020: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff
030: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff
040: fe ff ff ff fe ff ff ff fe ff ff ff fe ff ff ff
050: fe ff ff ff fe ff ff ff 11 00 00 00 12 00 00 00
060: 13 00 00 00 14 00 00 00 15 00 00 00 16 00 00 00
070: 17 00 00 00 0a 60 6b 6b 19 00 00 00 1a 00 00 00
080: 1b 00 00 00 1c 00 00 00 1d 00 00 00 1e 00 00 00
090: 1f 00 00 00 20 00 00 00 21 00 00 00 22 00 00 00
0a0: 23 00 00 00 24 00 00 00 25 00 00 00 26 00 00 00
0b0: 27 00 00 00 28 00 00 00 29 00 00 00 2a 00 00 00
0c0: 2b 00 00 00 2c 00 00 00 2d 00 00 00 2e 00 00 00
0d0: 2f 00 00 00 30 00 00 00 31 00 00 00 32 00 00 00
0e0: 33 00 00 00 34 00 00 00 35 00 00 00 36 00 00 00
0f0: 37 00 00 00 38 00 00 00 39 00 00 00 3a 00 00 00
kernel BUG at /home/rmk/build/linux-v2.6-local/mm/slab.c:1977!
Unable to handle kernel NULL pointer dereference at virtual address 00000000
pgd = c0004000
[00000000] *pgd=00000000
Internal error: Oops: 817 [#1]
Modules linked in:
CPU: 0
PC is at __bug+0x40/0x54
LR is at 0x1
pc : [<c00263f8>]    lr : [<00000001>]    Not tainted
sp : c03c5ee4  ip : 60000093  fp : c03c5ef4
r10: 00000007  r9 : 00000000  r8 : c7912018
r7 : 00000000  r6 : c039e8e0  r5 : c7912000  r4 : 00000000
r3 : 00000000  r2 : 00000000  r1 : 000012f3  r0 : 00000001
Flags: nZCv  IRQs off  FIQs on  Mode SVC_32  Segment kernel
Control: 5717F  Table: 07A44000  DAC: 00000017
Process events/0 (pid: 3, stack limit = 0xc03c4190)
Stack: (0xc03c5ee4 to 0xc03c6000)
5ee0:          c7912114 c03c5f24 c03c5ef8 c005e66c c00263c8 c0399a28 00000007
5f00: c0399a18 c0399a28 c039e8e0 c023ee68 00000000 c023ee78 c03c5f44 c03c5f28
5f20: c005ef64 c005e5e0 c039e8e0 00000000 c039e950 00000001 c03c5f70 c03c5f48
5f40: c005f020 c005eee4 c038fea8 c023ee88 80000013 c038fea0 00000000 c038fe98
5f60: c005ef88 c03c5fc8 c03c5f74 c0048c14 c005ef98 ffffffff ffffffff 00000001
5f80: 00000000 c0035efc 00010000 00000000 00000000 c038d7c0 c0035efc 00100100
5fa0: 00200200 c03c4000 c03b3f34 c038fe98 c0048a4c fffffffc 00000000 c03c5ff4
5fc0: c03c5fcc c004d148 c0048a5c ffffffff ffffffff 00000000 00000000 00000000
5fe0: 00000000 00000000 00000000 c03c5ff8 c003b7b8 c004d0d4 00000000 00000000
Backtrace:
[<c00263b8>] (__bug+0x0/0x54) from [<c005e66c>] (free_block+0x9c/0x18c)
 r4 = C7912114
[<c005e5d0>] (free_block+0x0/0x18c) from [<c005ef64>] (drain_array_locked+0x90/0xb4)
[<c005eed4>] (drain_array_locked+0x0/0xb4) from [<c005f020>] (cache_reap+0x98/0x208)
 r7 = 00000001  r6 = C039E950  r5 = 00000000  r4 = C039E8E0
[<c005ef88>] (cache_reap+0x0/0x208) from [<c0048c14>] (worker_thread+0x1c8/0x258)
[<c0048a4c>] (worker_thread+0x0/0x258) from [<c004d148>] (kthread+0x84/0xb0)
[<c004d0c4>] (kthread+0x0/0xb0) from [<c003b7b8>] (do_exit+0x0/0x408)
 r8 = 00000000  r7 = 00000000  r6 = 00000000  r5 = 00000000
 r4 = 00000000
Code: 1b004cba e59f0014 eb004cb8 e3a03000 (e5833000)

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 PCMCIA      - http://pcmcia.arm.linux.org.uk/
                 2.6 Serial core

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 2.6.10-bkcurr: major slab corruption preventing booting on ARM
  2005-01-04 14:43 2.6.10-bkcurr: major slab corruption preventing booting on ARM Russell King
@ 2005-01-04 16:10 ` Russell King
  2005-01-04 17:18   ` Ben Dooks
  2005-01-04 17:21   ` Russell King
  0 siblings, 2 replies; 5+ messages in thread
From: Russell King @ 2005-01-04 16:10 UTC (permalink / raw)
  To: Linux Kernel List

On Tue, Jan 04, 2005 at 02:43:50PM +0000, Russell King wrote:
> I've had a report from a fellow ARM hacker of their platform not
> booting.  After they turned on slab debugging, they saw (pieced
> together from a report on IRC):
> 
> Freeing init memory: 104K
> run_init_process(/bin/bash)
> Slab corruption: start=c0010934, len=160
> Last user: [<c00adc54>](d_alloc+0x28/0x2d8)
> 
> I've just run up 2.6.10-bkcurr on a different ARM platform, and
> encountered the following output.  It looks like there's serious
> slab corruption issues in these kernels.
> 
> I'll dig a little further into the report below to see if there's
> anything obvious.

Ok, reverting the pud_t patch fixes both these problems (the exact
patch can be found at: http://www.home.arm.linux.org.uk/~rmk/misc/bk4-bk5
Note that this is not a plain bk4-bk5 patch, but just the pud_t
changes brought forward to bk6 or there abouts.)

So, something in the 4 level page table patches is causing random
scribbling in kernel memory.

-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 PCMCIA      - http://pcmcia.arm.linux.org.uk/
                 2.6 Serial core

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 2.6.10-bkcurr: major slab corruption preventing booting on ARM
  2005-01-04 16:10 ` Russell King
@ 2005-01-04 17:18   ` Ben Dooks
  2005-01-04 17:21   ` Russell King
  1 sibling, 0 replies; 5+ messages in thread
From: Ben Dooks @ 2005-01-04 17:18 UTC (permalink / raw)
  To: Linux Kernel List

On Tue, Jan 04, 2005 at 04:10:49PM +0000, Russell King wrote:
> On Tue, Jan 04, 2005 at 02:43:50PM +0000, Russell King wrote:
> > I've had a report from a fellow ARM hacker of their platform not
> > booting.  After they turned on slab debugging, they saw (pieced
> > together from a report on IRC):
> > 
> > Freeing init memory: 104K
> > run_init_process(/bin/bash)
> > Slab corruption: start=c0010934, len=160
> > Last user: [<c00adc54>](d_alloc+0x28/0x2d8)
> > 
> > I've just run up 2.6.10-bkcurr on a different ARM platform, and
> > encountered the following output.  It looks like there's serious
> > slab corruption issues in these kernels.
> > 
> > I'll dig a little further into the report below to see if there's
> > anything obvious.
> 
> Ok, reverting the pud_t patch fixes both these problems (the exact
> patch can be found at: http://www.home.arm.linux.org.uk/~rmk/misc/bk4-bk5
> Note that this is not a plain bk4-bk5 patch, but just the pud_t
> changes brought forward to bk6 or there abouts.)
> 
> So, something in the 4 level page table patches is causing random
> scribbling in kernel memory.

I've tried that, and it fixes the problems for me on
the EB2410ITX (ARM9 2410) and the corruption of the initial-ramdisk.

-- 
Ben (ben@fluff.org, http://www.fluff.org/)

  'a smiley only costs 4 bytes'

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 2.6.10-bkcurr: major slab corruption preventing booting on ARM
  2005-01-04 16:10 ` Russell King
  2005-01-04 17:18   ` Ben Dooks
@ 2005-01-04 17:21   ` Russell King
  2005-01-05  2:00     ` Nick Piggin
  1 sibling, 1 reply; 5+ messages in thread
From: Russell King @ 2005-01-04 17:21 UTC (permalink / raw)
  To: Linux Kernel List; +Cc: Nick Piggin

On Tue, Jan 04, 2005 at 04:10:49PM +0000, Russell King wrote:
> On Tue, Jan 04, 2005 at 02:43:50PM +0000, Russell King wrote:
> > I've had a report from a fellow ARM hacker of their platform not
> > booting.  After they turned on slab debugging, they saw (pieced
> > together from a report on IRC):
> > 
> > Freeing init memory: 104K
> > run_init_process(/bin/bash)
> > Slab corruption: start=c0010934, len=160
> > Last user: [<c00adc54>](d_alloc+0x28/0x2d8)
> > 
> > I've just run up 2.6.10-bkcurr on a different ARM platform, and
> > encountered the following output.  It looks like there's serious
> > slab corruption issues in these kernels.
> > 
> > I'll dig a little further into the report below to see if there's
> > anything obvious.
> 
> Ok, reverting the pud_t patch fixes both these problems (the exact
> patch can be found at: http://www.home.arm.linux.org.uk/~rmk/misc/bk4-bk5
> Note that this is not a plain bk4-bk5 patch, but just the pud_t
> changes brought forward to bk6 or there abouts.)
> 
> So, something in the 4 level page table patches is causing random
> scribbling in kernel memory.

Ok, I've narrowed the problem down to something in the following patch.
Andi Kleen suggests that maybe the ARM FIRST_USER_PGD_NR got broken in
by something here.  Nick, any ideas?

diff -urN linux-2.6.10-bk4/include/linux/mm.h linux-2.6.10-bk5/include/linux/mm.h
--- linux-2.6.10-bk4/include/linux/mm.h	2004-12-24 13:33:50.000000000 -0800
+++ linux-2.6.10-bk5/include/linux/mm.h	2005-01-02 04:55:30.285949371 -0800
@@ -566,7 +566,7 @@
 		struct vm_area_struct *start_vma, unsigned long start_addr,
 		unsigned long end_addr, unsigned long *nr_accounted,
 		struct zap_details *);
-void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr);
+void clear_page_range(struct mmu_gather *tlb, unsigned long addr, unsigned long end);
 int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
 			struct vm_area_struct *vma);
 int zeromap_page_range(struct vm_area_struct *vma, unsigned long from,
diff -urN linux-2.6.10-bk4/mm/memory.c linux-2.6.10-bk5/mm/memory.c
--- linux-2.6.10-bk4/mm/memory.c	2004-12-24 13:34:44.000000000 -0800
+++ linux-2.6.10-bk5/mm/memory.c	2005-01-02 04:55:31.265995181 -0800
@@ -34,6 +34,8 @@
  *
  * 16.07.99  -  Support of BIGMEM added by Gerhard Wichert, Siemens AG
  *		(Gerhard.Wichert@pdb.siemens.de)
+ *
+ * Aug/Sep 2004 Changed to four level page tables (Andi Kleen)
  */
 
 #include <linux/kernel_stat.h>
@@ -98,58 +100,107 @@
  * Note: this doesn't free the actual pages themselves. That
  * has been handled earlier when unmapping all the memory regions.
  */
-static inline void free_one_pmd(struct mmu_gather *tlb, pmd_t * dir)
+static inline void clear_pmd_range(struct mmu_gather *tlb, pmd_t *pmd, unsigned long start, unsigned long end)
 {
 	struct page *page;
 
-	if (pmd_none(*dir))
+	if (pmd_none(*pmd))
 		return;
-	if (unlikely(pmd_bad(*dir))) {
-		pmd_ERROR(*dir);
-		pmd_clear(dir);
+	if (unlikely(pmd_bad(*pmd))) {
+		pmd_ERROR(*pmd);
+		pmd_clear(pmd);
 		return;
 	}
-	page = pmd_page(*dir);
-	pmd_clear(dir);
-	dec_page_state(nr_page_table_pages);
-	tlb->mm->nr_ptes--;
-	pte_free_tlb(tlb, page);
+	if (!(start & ~PMD_MASK) && !(end & ~PMD_MASK)) {
+		page = pmd_page(*pmd);
+		pmd_clear(pmd);
+		dec_page_state(nr_page_table_pages);
+		tlb->mm->nr_ptes--;
+		pte_free_tlb(tlb, page);
+	}
 }
 
-static inline void free_one_pgd(struct mmu_gather *tlb, pgd_t * dir)
+static inline void clear_pud_range(struct mmu_gather *tlb, pud_t *pud, unsigned long start, unsigned long end)
 {
-	int j;
-	pmd_t * pmd;
+	unsigned long addr = start, next;
+	pmd_t *pmd, *__pmd;
 
-	if (pgd_none(*dir))
+	if (pud_none(*pud))
 		return;
-	if (unlikely(pgd_bad(*dir))) {
-		pgd_ERROR(*dir);
-		pgd_clear(dir);
+	if (unlikely(pud_bad(*pud))) {
+		pud_ERROR(*pud);
+		pud_clear(pud);
 		return;
 	}
-	pmd = pmd_offset(dir, 0);
-	pgd_clear(dir);
-	for (j = 0; j < PTRS_PER_PMD ; j++)
-		free_one_pmd(tlb, pmd+j);
-	pmd_free_tlb(tlb, pmd);
+
+	pmd = __pmd = pmd_offset(pud, start);
+	do {
+		next = (addr + PMD_SIZE) & PMD_MASK;
+		if (next > end || next <= addr)
+			next = end;
+		
+		clear_pmd_range(tlb, pmd, addr, next);
+		pmd++;
+		addr = next;
+	} while (addr && (addr < end));
+
+	if (!(start & ~PUD_MASK) && !(end & ~PUD_MASK)) {
+		pud_clear(pud);
+		pmd_free_tlb(tlb, __pmd);
+	}
+}
+
+
+static inline void clear_pgd_range(struct mmu_gather *tlb, pgd_t *pgd, unsigned long start, unsigned long end)
+{
+	unsigned long addr = start, next;
+	pud_t *pud, *__pud;
+
+	if (pgd_none(*pgd))
+		return;
+	if (unlikely(pgd_bad(*pgd))) {
+		pgd_ERROR(*pgd);
+		pgd_clear(pgd);
+		return;
+	}
+
+	pud = __pud = pud_offset(pgd, start);
+	do {
+		next = (addr + PUD_SIZE) & PUD_MASK;
+		if (next > end || next <= addr)
+			next = end;
+		
+		clear_pud_range(tlb, pud, addr, next);
+		pud++;
+		addr = next;
+	} while (addr && (addr < end));
+
+	if (!(start & ~PGDIR_MASK) && !(end & ~PGDIR_MASK)) {
+		pgd_clear(pgd);
+		pud_free_tlb(tlb, __pud);
+	}
 }
 
 /*
- * This function clears all user-level page tables of a process - this
- * is needed by execve(), so that old pages aren't in the way.
+ * This function clears user-level page tables of a process.
  *
  * Must be called with pagetable lock held.
  */
-void clear_page_tables(struct mmu_gather *tlb, unsigned long first, int nr)
+void clear_page_range(struct mmu_gather *tlb, unsigned long start, unsigned long end)
 {
-	pgd_t * page_dir = tlb->mm->pgd;
-
-	page_dir += first;
-	do {
-		free_one_pgd(tlb, page_dir);
-		page_dir++;
-	} while (--nr);
+	unsigned long addr = start, next;
+	unsigned long i, nr = pgd_index(end + PGDIR_SIZE-1) - pgd_index(start);
+	pgd_t * pgd = pgd_offset(tlb->mm, start);
+
+	for (i = 0; i < nr; i++) {
+		next = (addr + PGDIR_SIZE) & PGDIR_MASK;
+		if (next > end || next <= addr)
+			next = end;
+		
+		clear_pgd_range(tlb, pgd, addr, next);
+		pgd++;
+		addr = next;
+	}
 }
 
 pte_t fastcall * pte_alloc_map(struct mm_struct *mm, pmd_t *pmd, unsigned long address)
diff -urN linux-2.6.10-bk4/mm/mmap.c linux-2.6.10-bk5/mm/mmap.c
--- linux-2.6.10-bk4/mm/mmap.c	2004-12-24 13:35:00.000000000 -0800
+++ linux-2.6.10-bk5/mm/mmap.c	2005-01-02 04:55:31.385000743 -0800
@@ -1474,7 +1474,6 @@
 {
 	unsigned long first = start & PGDIR_MASK;
 	unsigned long last = end + PGDIR_SIZE - 1;
-	unsigned long start_index, end_index;
 	struct mm_struct *mm = tlb->mm;
 
 	if (!prev) {
@@ -1499,23 +1498,18 @@
 				last = next->vm_start;
 		}
 		if (prev->vm_end > first)
-			first = prev->vm_end + PGDIR_SIZE - 1;
+			first = prev->vm_end;
 		break;
 	}
 no_mmaps:
 	if (last < first)	/* for arches with discontiguous pgd indices */
 		return;
-	/*
-	 * If the PGD bits are not consecutive in the virtual address, the
-	 * old method of shifting the VA >> by PGDIR_SHIFT doesn't work.
-	 */
-	start_index = pgd_index(first);
-	if (start_index < FIRST_USER_PGD_NR)
-		start_index = FIRST_USER_PGD_NR;
-	end_index = pgd_index(last);
-	if (end_index > start_index) {
-		clear_page_tables(tlb, start_index, end_index - start_index);
-		flush_tlb_pgtables(mm, first & PGDIR_MASK, last & PGDIR_MASK);
+	if (first < FIRST_USER_PGD_NR * PGDIR_SIZE)
+		first = FIRST_USER_PGD_NR * PGDIR_SIZE;
+	/* No point trying to free anything if we're in the same pte page */
+	if ((first & PMD_MASK) < (last & PMD_MASK)) {
+		clear_page_range(tlb, first, last);
+		flush_tlb_pgtables(mm, first, last);
 	}
 }
 
@@ -1844,7 +1838,9 @@
 					~0UL, &nr_accounted, NULL);
 	vm_unacct_memory(nr_accounted);
 	BUG_ON(mm->map_count);	/* This is just debugging */
-	clear_page_tables(tlb, FIRST_USER_PGD_NR, USER_PTRS_PER_PGD);
+	clear_page_range(tlb, FIRST_USER_PGD_NR * PGDIR_SIZE,
+			(TASK_SIZE + PGDIR_SIZE - 1) & PGDIR_MASK);
+	
 	tlb_finish_mmu(tlb, 0, MM_VM_SIZE(mm));
 
 	vma = mm->mmap;


-- 
Russell King
 Linux kernel    2.6 ARM Linux   - http://www.arm.linux.org.uk/
 maintainer of:  2.6 PCMCIA      - http://pcmcia.arm.linux.org.uk/
                 2.6 Serial core

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: 2.6.10-bkcurr: major slab corruption preventing booting on ARM
  2005-01-04 17:21   ` Russell King
@ 2005-01-05  2:00     ` Nick Piggin
  0 siblings, 0 replies; 5+ messages in thread
From: Nick Piggin @ 2005-01-05  2:00 UTC (permalink / raw)
  To: Russell King; +Cc: Linux Kernel List

Russell King wrote:
> On Tue, Jan 04, 2005 at 04:10:49PM +0000, Russell King wrote:
> 
>>On Tue, Jan 04, 2005 at 02:43:50PM +0000, Russell King wrote:
>>
>>>I've had a report from a fellow ARM hacker of their platform not
>>>booting.  After they turned on slab debugging, they saw (pieced
>>>together from a report on IRC):
>>>
>>>Freeing init memory: 104K
>>>run_init_process(/bin/bash)
>>>Slab corruption: start=c0010934, len=160
>>>Last user: [<c00adc54>](d_alloc+0x28/0x2d8)
>>>
>>>I've just run up 2.6.10-bkcurr on a different ARM platform, and
>>>encountered the following output.  It looks like there's serious
>>>slab corruption issues in these kernels.
>>>
>>>I'll dig a little further into the report below to see if there's
>>>anything obvious.
>>
>>Ok, reverting the pud_t patch fixes both these problems (the exact
>>patch can be found at: http://www.home.arm.linux.org.uk/~rmk/misc/bk4-bk5
>>Note that this is not a plain bk4-bk5 patch, but just the pud_t
>>changes brought forward to bk6 or there abouts.)
>>
>>So, something in the 4 level page table patches is causing random
>>scribbling in kernel memory.
> 
> 
> Ok, I've narrowed the problem down to something in the following patch.
> Andi Kleen suggests that maybe the ARM FIRST_USER_PGD_NR got broken in
> by something here.  Nick, any ideas?
> 

I see you've had a fix commited to -bk? Yes that looks like it would
cause the problems you are seeing.

Thanks,
Nick

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2005-01-05  2:02 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-01-04 14:43 2.6.10-bkcurr: major slab corruption preventing booting on ARM Russell King
2005-01-04 16:10 ` Russell King
2005-01-04 17:18   ` Ben Dooks
2005-01-04 17:21   ` Russell King
2005-01-05  2:00     ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox