Buglet in 16G page handling

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Buglet in 16G page handling
@ 2008-09-02  5:05 David Gibson
  2008-09-02 12:44 ` [Libhugetlbfs-devel] " Mel Gorman
  2008-09-02 17:12 ` Jon Tollefson
  0 siblings, 2 replies; 13+ messages in thread
From: David Gibson @ 2008-09-02  5:05 UTC (permalink / raw)
  To: Jon Tollefson; +Cc: linuxppc-dev, libhugetlbfs-devel

When BenH and I were looking at the new code for handling 16G pages,
we noticed a small bug.  It doesn't actually break anything user
visible, but it's certainly not the way things are supposed to be.
The 16G patches didn't update the huge_pte_offset() and
huge_pte_alloc() functions, which means that the hugepte tables for
16G pages will be allocated much further down the page table tree than
they should be - allocating several levels of page table with a single
entry in them along the way.

The patch below is supposed to fix this, cleaning up the existing
handling of 64k vs 16M pages while its at it.  However, it needs some
testing.

I've checked that it doesn't break existing 16M support, either with
4k or 64k base pages.  I haven't figured out how to test with 64k
pages yet, at least until the multisize support goes into
libhugetlbfs.  For 16G pages, I just don't have access to a machine
with enough memory to test.  Jon, presumably you must have found such
a machine when you did the 16G page support in the first place.  Do
you still have access, and can you test this patch?

Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2008-09-02 13:39:52.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2008-09-02 14:08:56.000000000 +1000
@@ -128,29 +128,37 @@ static int __hugepte_alloc(struct mm_str
 	return 0;
 }
 
-/* Base page size affects how we walk hugetlb page tables */
-#ifdef CONFIG_PPC_64K_PAGES
-#define hpmd_offset(pud, addr, h)	pmd_offset(pud, addr)
-#define hpmd_alloc(mm, pud, addr, h)	pmd_alloc(mm, pud, addr)
-#else
-static inline
-pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
+
+static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate *hstate)
+{
+	if (huge_page_shift(hstate) < PUD_SHIFT)
+		return pud_offset(pgd, addr);
+	else
+		return (pud_t *) pgd;
+}
+static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long addr,
+			 struct hstate *hstate)
 {
-	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
+	if (huge_page_shift(hstate) < PUD_SHIFT)
+		return pud_alloc(mm, pgd, addr);
+	else
+		return (pud_t *) pgd;
+}
+static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
+{
+	if (huge_page_shift(hstate) < PMD_SHIFT)
 		return pmd_offset(pud, addr);
 	else
 		return (pmd_t *) pud;
 }
-static inline
-pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
-		  struct hstate *hstate)
+static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
+			 struct hstate *hstate)
 {
-	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
+	if (huge_page_shift(hstate) < PMD_SHIFT)
 		return pmd_alloc(mm, pud, addr);
 	else
 		return (pmd_t *) pud;
 }
-#endif
 
 /* Build list of addresses of gigantic pages.  This function is used in early
  * boot before the buddy or bootmem allocator is setup.
@@ -204,7 +212,7 @@ pte_t *huge_pte_offset(struct mm_struct 
 
 	pg = pgd_offset(mm, addr);
 	if (!pgd_none(*pg)) {
-		pu = pud_offset(pg, addr);
+		pu = hpud_offset(pg, addr, hstate);
 		if (!pud_none(*pu)) {
 			pm = hpmd_offset(pu, addr, hstate);
 			if (!pmd_none(*pm))
@@ -233,7 +241,7 @@ pte_t *huge_pte_alloc(struct mm_struct *
 	addr &= hstate->mask;
 
 	pg = pgd_offset(mm, addr);
-	pu = pud_alloc(mm, pg, addr);
+	pu = hpud_alloc(mm, pg, addr, hstate);
 
 	if (pu) {
 		pm = hpmd_alloc(mm, pu, addr, hstate);


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Libhugetlbfs-devel] Buglet in 16G page handling
  2008-09-02  5:05 Buglet in 16G page handling David Gibson
@ 2008-09-02 12:44 ` Mel Gorman
  2008-09-02 16:25   ` Nishanth Aravamudan
  2008-09-02 21:05   ` Benjamin Herrenschmidt
  2008-09-02 17:12 ` Jon Tollefson
  1 sibling, 2 replies; 13+ messages in thread
From: Mel Gorman @ 2008-09-02 12:44 UTC (permalink / raw)
  To: Jon Tollefson, libhugetlbfs-devel, linuxppc-dev,
	Benjamin Herrenschmidt

On (02/09/08 15:05), David Gibson didst pronounce:
> When BenH and I were looking at the new code for handling 16G pages,
> we noticed a small bug.  It doesn't actually break anything user
> visible, but it's certainly not the way things are supposed to be.
> The 16G patches didn't update the huge_pte_offset() and
> huge_pte_alloc() functions, which means that the hugepte tables for
> 16G pages will be allocated much further down the page table tree than
> they should be - allocating several levels of page table with a single
> entry in them along the way.
> 
> The patch below is supposed to fix this, cleaning up the existing
> handling of 64k vs 16M pages while its at it.  However, it needs some
> testing.
> 

Actually, Jon has been hitting an occasional pagetable lock related
problem. The last theory was that it might be some sort of race but it's
vaguely possible that this is the issue. Jon?

> I've checked that it doesn't break existing 16M support, either with
> 4k or 64k base pages.  I haven't figured out how to test with 64k
> pages yet, at least until the multisize support goes into
> libhugetlbfs. 

Mount a 64K point yourself and then set HUGETLB_PATH?

> For 16G pages, I just don't have access to a machine
> with enough memory to test.  Jon, presumably you must have found such
> a machine when you did the 16G page support in the first place.  Do
> you still have access, and can you test this patch?
> 
> Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
> ===================================================================
> --- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2008-09-02 13:39:52.000000000 +1000
> +++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2008-09-02 14:08:56.000000000 +1000
> @@ -128,29 +128,37 @@ static int __hugepte_alloc(struct mm_str
>  	return 0;
>  }
>  
> -/* Base page size affects how we walk hugetlb page tables */
> -#ifdef CONFIG_PPC_64K_PAGES
> -#define hpmd_offset(pud, addr, h)	pmd_offset(pud, addr)
> -#define hpmd_alloc(mm, pud, addr, h)	pmd_alloc(mm, pud, addr)
> -#else
> -static inline
> -pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
> +
> +static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate *hstate)
> +{
> +	if (huge_page_shift(hstate) < PUD_SHIFT)
> +		return pud_offset(pgd, addr);
> +	else
> +		return (pud_t *) pgd;
> +}
> +static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long addr,
> +			 struct hstate *hstate)
>  {
> -	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
> +	if (huge_page_shift(hstate) < PUD_SHIFT)
> +		return pud_alloc(mm, pgd, addr);
> +	else
> +		return (pud_t *) pgd;
> +}
> +static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
> +{
> +	if (huge_page_shift(hstate) < PMD_SHIFT)
>  		return pmd_offset(pud, addr);
>  	else
>  		return (pmd_t *) pud;
>  }
> -static inline
> -pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
> -		  struct hstate *hstate)
> +static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
> +			 struct hstate *hstate)
>  {
> -	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
> +	if (huge_page_shift(hstate) < PMD_SHIFT)
>  		return pmd_alloc(mm, pud, addr);
>  	else
>  		return (pmd_t *) pud;
>  }
> -#endif
>  
>  /* Build list of addresses of gigantic pages.  This function is used in early
>   * boot before the buddy or bootmem allocator is setup.
> @@ -204,7 +212,7 @@ pte_t *huge_pte_offset(struct mm_struct 
>  
>  	pg = pgd_offset(mm, addr);
>  	if (!pgd_none(*pg)) {
> -		pu = pud_offset(pg, addr);
> +		pu = hpud_offset(pg, addr, hstate);
>  		if (!pud_none(*pu)) {
>  			pm = hpmd_offset(pu, addr, hstate);
>  			if (!pmd_none(*pm))
> @@ -233,7 +241,7 @@ pte_t *huge_pte_alloc(struct mm_struct *
>  	addr &= hstate->mask;
>  
>  	pg = pgd_offset(mm, addr);
> -	pu = pud_alloc(mm, pg, addr);
> +	pu = hpud_alloc(mm, pg, addr, hstate);
>  
>  	if (pu) {
>  		pm = hpmd_alloc(mm, pu, addr, hstate);
> 
> 
> -- 
> David Gibson			| I'll have my music baroque, and my code
> david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
> 				| _way_ _around_!
> http://www.ozlabs.org/~dgibson
> 
> -------------------------------------------------------------------------
> This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
> Build the coolest Linux based applications with Moblin SDK & win great prizes
> Grand prize is a trip for two to an Open Source event anywhere in the world
> http://moblin-contest.org/redirect.php?banner_id=100&url=/
> _______________________________________________
> Libhugetlbfs-devel mailing list
> Libhugetlbfs-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/libhugetlbfs-devel
> 

-- 
Mel Gorman
Part-time Phd Student                          Linux Technology Center
University of Limerick                         IBM Dublin Software Lab

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Libhugetlbfs-devel] Buglet in 16G page handling
  2008-09-02 12:44 ` [Libhugetlbfs-devel] " Mel Gorman
@ 2008-09-02 16:25   ` Nishanth Aravamudan
  2008-09-02 21:05   ` Benjamin Herrenschmidt
  1 sibling, 0 replies; 13+ messages in thread
From: Nishanth Aravamudan @ 2008-09-02 16:25 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linuxppc-dev, Jon Tollefson, libhugetlbfs-devel

On 02.09.2008 [13:44:42 +0100], Mel Gorman wrote:
> On (02/09/08 15:05), David Gibson didst pronounce:
> > When BenH and I were looking at the new code for handling 16G pages,
> > we noticed a small bug.  It doesn't actually break anything user
> > visible, but it's certainly not the way things are supposed to be.
> > The 16G patches didn't update the huge_pte_offset() and
> > huge_pte_alloc() functions, which means that the hugepte tables for
> > 16G pages will be allocated much further down the page table tree than
> > they should be - allocating several levels of page table with a single
> > entry in them along the way.
> > 
> > The patch below is supposed to fix this, cleaning up the existing
> > handling of 64k vs 16M pages while its at it.  However, it needs some
> > testing.
> > 
> 
> Actually, Jon has been hitting an occasional pagetable lock related
> problem. The last theory was that it might be some sort of race but it's
> vaguely possible that this is the issue. Jon?
> 
> > I've checked that it doesn't break existing 16M support, either with
> > 4k or 64k base pages.  I haven't figured out how to test with 64k
> > pages yet, at least until the multisize support goes into
> > libhugetlbfs. 
> 
> Mount a 64K point yourself and then set HUGETLB_PATH?

I don't think this will work, because we don't use fstatfs() to figure
out the pagesize, but instead assume meminfo and the fs are the same
hugepage size (but on power it will always be 16M in meminfo).

Thanks,
Nish

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Libhugetlbfs-devel] Buglet in 16G page handling
  2008-09-02 12:44 ` [Libhugetlbfs-devel] " Mel Gorman
  2008-09-02 16:25   ` Nishanth Aravamudan
@ 2008-09-02 21:05   ` Benjamin Herrenschmidt
  2008-09-02 22:16     ` Jon Tollefson
  1 sibling, 1 reply; 13+ messages in thread
From: Benjamin Herrenschmidt @ 2008-09-02 21:05 UTC (permalink / raw)
  To: Mel Gorman; +Cc: linuxppc-dev, Jon Tollefson, libhugetlbfs-devel


> Actually, Jon has been hitting an occasional pagetable lock related
> problem. The last theory was that it might be some sort of race but it's
> vaguely possible that this is the issue. Jon?

All hugetlbfs ops should be covered by the big PTL except walking... Can
we have more info about the problem ?

Cheers,
Ben.

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Libhugetlbfs-devel] Buglet in 16G page handling
  2008-09-02 21:05   ` Benjamin Herrenschmidt
@ 2008-09-02 22:16     ` Jon Tollefson
  2008-09-02 22:53       ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 13+ messages in thread
From: Jon Tollefson @ 2008-09-02 22:16 UTC (permalink / raw)
  To: benh; +Cc: Mel Gorman, linuxppc-dev, libhugetlbfs-devel

Benjamin Herrenschmidt wrote:
>> Actually, Jon has been hitting an occasional pagetable lock related
>> problem. The last theory was that it might be some sort of race but it's
>> vaguely possible that this is the issue. Jon?
>>     
>
> All hugetlbfs ops should be covered by the big PTL except walking... Can
> we have more info about the problem ?
>
> Cheers,
> Ben.
>   

I hit this when running the complete libhugetlbfs test suite (make
check) with base page at 4K and default huge page size at 16G.  It is on
the last test (shm-getraw) when it hits it.  Just running that test
alone has not caused it for me - only when I have run all the tests and
it gets to this one.  Also it doesn't happen every time.  I have tried
to reproduce as well with a 64K base page but haven't seen it happen there.

BUG: spinlock bad magic on CPU#2, shm-getraw/10359
 lock: f00000000de6e158, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
Call Trace:
[c000000285d9b420] [c0000000000110b0] .show_stack+0x78/0x190 (unreliable)
[c000000285d9b4d0] [c0000000000111e8] .dump_stack+0x20/0x34
[c000000285d9b550] [c000000000295d94] .spin_bug+0xb8/0xe0
[c000000285d9b5f0] [c0000000002962d8] ._raw_spin_lock+0x4c/0x1a0
[c000000285d9b690] [c000000000510c60] ._spin_lock+0x5c/0x7c
[c000000285d9b720] [c0000000000d809c] .handle_mm_fault+0x2f0/0x9ac
[c000000285d9b810] [c000000000513688] .do_page_fault+0x444/0x62c
[c000000285d9b950] [c000000000005230] handle_page_fault+0x20/0x5c
--- Exception: 301 at .__clear_user+0x38/0x7c
    LR = .read_zero+0xb0/0x1a8
[c000000285d9bc40] [c0000000002e19e0] .read_zero+0x80/0x1a8 (unreliable)
[c000000285d9bcf0] [c000000000102c00] .vfs_read+0xe0/0x1c8
[c000000285d9bd90] [c00000000010332c] .sys_read+0x54/0x98
[c000000285d9be30] [c0000000000086d4] syscall_exit+0x0/0x40
BUG: spinlock lockup on CPU#2, shm-getraw/10359, f00000000de6e158
Call Trace:
[c000000285d9b4c0] [c0000000000110b0] .show_stack+0x78/0x190 (unreliable)
[c000000285d9b570] [c0000000000111e8] .dump_stack+0x20/0x34
[c000000285d9b5f0] [c0000000002963ec] ._raw_spin_lock+0x160/0x1a0
[c000000285d9b690] [c000000000510c60] ._spin_lock+0x5c/0x7c
[c000000285d9b720] [c0000000000d809c] .handle_mm_fault+0x2f0/0x9ac
[c000000285d9b810] [c000000000513688] .do_page_fault+0x444/0x62c
[c000000285d9b950] [c000000000005230] handle_page_fault+0x20/0x5c
--- Exception: 301 at .__clear_user+0x38/0x7c
    LR = .read_zero+0xb0/0x1a8
[c000000285d9bc40] [c0000000002e19e0] .read_zero+0x80/0x1a8 (unreliable)
[c000000285d9bcf0] [c000000000102c00] .vfs_read+0xe0/0x1c8
[c000000285d9bd90] [c00000000010332c] .sys_read+0x54/0x98
[c000000285d9be30] [c0000000000086d4] syscall_exit+0x0/0x40
BUG: soft lockup - CPU#2 stuck for 61s! [shm-getraw:10359]
Modules linked in: autofs4 binfmt_misc dm_mirror dm_log dm_multipath parport ibmvscsic uhci_hcd ohci_hcd ehci_hcd
irq event stamp: 1423661
hardirqs last  enabled at (1423661): [<c00000000008d954>] .trace_hardirqs_on+0x1c/0x30
hardirqs last disabled at (1423660): [<c00000000008af60>] .trace_hardirqs_off+0x1c/0x30
softirqs last  enabled at (1422710): [<c000000000064f6c>] .__do_softirq+0x19c/0x1c4
softirqs last disabled at (1422705): [<c00000000002943c>] .call_do_softirq+0x14/0x24
NIP: c00000000002569c LR: c0000000002963ac CTR: 8000000000f7cdec
REGS: c000000285d9b330 TRAP: 0901   Not tainted  (2.6.27-rc4-pseries)
MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 88000284  XER: 00000002
TASK = c000000285f18000[10359] 'shm-getraw' THREAD: c000000285d98000 CPU: 2
GPR00: 0000000080000002 c000000285d9b5b0 c0000000008924e0 0000000000000001 
GPR04: c000000285f18000 0000000000000070 0000000000000000 0000000000000002 
GPR08: 0000000000000000 0003c3c66e8adf66 0000000000000002 0000000000000010 
GPR12: 00000000000b4cbd c0000000008d4700 
NIP [c00000000002569c] .__delay+0x10/0x38
LR [c0000000002963ac] ._raw_spin_lock+0x120/0x1a0
Call Trace:
[c000000285d9b5b0] [c000000285d9b690] 0xc000000285d9b690 (unreliable)
[c000000285d9b5f0] [c000000000296378] ._raw_spin_lock+0xec/0x1a0
[c000000285d9b690] [c000000000510c60] ._spin_lock+0x5c/0x7c
[c000000285d9b720] [c0000000000d809c] .handle_mm_fault+0x2f0/0x9ac
[c000000285d9b810] [c000000000513688] .do_page_fault+0x444/0x62c
[c000000285d9b950] [c000000000005230] handle_page_fault+0x20/0x5c
--- Exception: 301 at .__clear_user+0x38/0x7c
    LR = .read_zero+0xb0/0x1a8
[c000000285d9bc40] [c0000000002e19e0] .read_zero+0x80/0x1a8 (unreliable)
[c000000285d9bcf0] [c000000000102c00] .vfs_read+0xe0/0x1c8
[c000000285d9bd90] [c00000000010332c] .sys_read+0x54/0x98
[c000000285d9be30] [c0000000000086d4] syscall_exit+0x0/0x40
Instruction dump:
eb41ffd0 eb61ffd8 eb81ffe0 7c0803a6 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 
fbe1fff8 f821ffc1 7c3f0b78 7d2c42e6 <48000008> 7c210b78 7c0c42e6 7c090050 


[root]# addr2line c0000000000d809c -e /boot/vmlinux.rc4-pseries 
/root/src/linux-2.6-rc4/mm/memory.c:2381
[root]# addr2line c000000000513688 -e /boot/vmlinux.rc4-pseries 
/root/src/linux-2.6-rc4/arch/powerpc/mm/fault.c:313
[root]# addr2line c00000000010332c -e /boot/vmlinux.rc4-pseries 
/root/src/linux-2.6-rc4/fs/read_write.c:334
[root]# addr2line c000000000102c00 -e /boot/vmlinux.rc4-pseries 
/root/src/linux-2.6-rc4/fs/read_write.c:257


I have sometimes inserted an strace64 at the point where the test cases are started and then will see
output like the following when it hits the above point.

...
open("/dev/full", O_RDONLY)             = 3
shmget(0x2, 34359738368, IPC_CREAT|SHM_HUGETLB|0600) = 294912
shmat(294912, 0, SHM_RND)               = 0x3f800000000
read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 17179869184) = 2147479552

---

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Libhugetlbfs-devel] Buglet in 16G page handling
  2008-09-02 22:16     ` Jon Tollefson
@ 2008-09-02 22:53       ` Benjamin Herrenschmidt
  2008-09-03 14:11         ` Jon Tollefson
  0 siblings, 1 reply; 13+ messages in thread
From: Benjamin Herrenschmidt @ 2008-09-02 22:53 UTC (permalink / raw)
  To: Jon Tollefson; +Cc: Mel Gorman, linuxppc-dev, libhugetlbfs-devel

On Tue, 2008-09-02 at 17:16 -0500, Jon Tollefson wrote:
> Benjamin Herrenschmidt wrote:
> >> Actually, Jon has been hitting an occasional pagetable lock related
> >> problem. The last theory was that it might be some sort of race but it's
> >> vaguely possible that this is the issue. Jon?
> >>     
> >
> > All hugetlbfs ops should be covered by the big PTL except walking... Can
> > we have more info about the problem ?
> >
> > Cheers,
> > Ben.
> >   
> 
> I hit this when running the complete libhugetlbfs test suite (make
> check) with base page at 4K and default huge page size at 16G.  It is on
> the last test (shm-getraw) when it hits it.  Just running that test
> alone has not caused it for me - only when I have run all the tests and
> it gets to this one.  Also it doesn't happen every time.  I have tried
> to reproduce as well with a 64K base page but haven't seen it happen there.

I don't see anything huge pages related in the backtraces which is
interesting ...

Can you get us access to a machine with enough RAM to test the 16G
pages ?

Ben.

> BUG: spinlock bad magic on CPU#2, shm-getraw/10359
>  lock: f00000000de6e158, .magic: 00000000, .owner: <none>/-1, .owner_cpu: 0
> Call Trace:
> [c000000285d9b420] [c0000000000110b0] .show_stack+0x78/0x190 (unreliable)
> [c000000285d9b4d0] [c0000000000111e8] .dump_stack+0x20/0x34
> [c000000285d9b550] [c000000000295d94] .spin_bug+0xb8/0xe0
> [c000000285d9b5f0] [c0000000002962d8] ._raw_spin_lock+0x4c/0x1a0
> [c000000285d9b690] [c000000000510c60] ._spin_lock+0x5c/0x7c
> [c000000285d9b720] [c0000000000d809c] .handle_mm_fault+0x2f0/0x9ac
> [c000000285d9b810] [c000000000513688] .do_page_fault+0x444/0x62c
> [c000000285d9b950] [c000000000005230] handle_page_fault+0x20/0x5c
> --- Exception: 301 at .__clear_user+0x38/0x7c
>     LR = .read_zero+0xb0/0x1a8
> [c000000285d9bc40] [c0000000002e19e0] .read_zero+0x80/0x1a8 (unreliable)
> [c000000285d9bcf0] [c000000000102c00] .vfs_read+0xe0/0x1c8
> [c000000285d9bd90] [c00000000010332c] .sys_read+0x54/0x98
> [c000000285d9be30] [c0000000000086d4] syscall_exit+0x0/0x40
> BUG: spinlock lockup on CPU#2, shm-getraw/10359, f00000000de6e158
> Call Trace:
> [c000000285d9b4c0] [c0000000000110b0] .show_stack+0x78/0x190 (unreliable)
> [c000000285d9b570] [c0000000000111e8] .dump_stack+0x20/0x34
> [c000000285d9b5f0] [c0000000002963ec] ._raw_spin_lock+0x160/0x1a0
> [c000000285d9b690] [c000000000510c60] ._spin_lock+0x5c/0x7c
> [c000000285d9b720] [c0000000000d809c] .handle_mm_fault+0x2f0/0x9ac
> [c000000285d9b810] [c000000000513688] .do_page_fault+0x444/0x62c
> [c000000285d9b950] [c000000000005230] handle_page_fault+0x20/0x5c
> --- Exception: 301 at .__clear_user+0x38/0x7c
>     LR = .read_zero+0xb0/0x1a8
> [c000000285d9bc40] [c0000000002e19e0] .read_zero+0x80/0x1a8 (unreliable)
> [c000000285d9bcf0] [c000000000102c00] .vfs_read+0xe0/0x1c8
> [c000000285d9bd90] [c00000000010332c] .sys_read+0x54/0x98
> [c000000285d9be30] [c0000000000086d4] syscall_exit+0x0/0x40
> BUG: soft lockup - CPU#2 stuck for 61s! [shm-getraw:10359]
> Modules linked in: autofs4 binfmt_misc dm_mirror dm_log dm_multipath parport ibmvscsic uhci_hcd ohci_hcd ehci_hcd
> irq event stamp: 1423661
> hardirqs last  enabled at (1423661): [<c00000000008d954>] .trace_hardirqs_on+0x1c/0x30
> hardirqs last disabled at (1423660): [<c00000000008af60>] .trace_hardirqs_off+0x1c/0x30
> softirqs last  enabled at (1422710): [<c000000000064f6c>] .__do_softirq+0x19c/0x1c4
> softirqs last disabled at (1422705): [<c00000000002943c>] .call_do_softirq+0x14/0x24
> NIP: c00000000002569c LR: c0000000002963ac CTR: 8000000000f7cdec
> REGS: c000000285d9b330 TRAP: 0901   Not tainted  (2.6.27-rc4-pseries)
> MSR: 8000000000009032 <EE,ME,IR,DR>  CR: 88000284  XER: 00000002
> TASK = c000000285f18000[10359] 'shm-getraw' THREAD: c000000285d98000 CPU: 2
> GPR00: 0000000080000002 c000000285d9b5b0 c0000000008924e0 0000000000000001 
> GPR04: c000000285f18000 0000000000000070 0000000000000000 0000000000000002 
> GPR08: 0000000000000000 0003c3c66e8adf66 0000000000000002 0000000000000010 
> GPR12: 00000000000b4cbd c0000000008d4700 
> NIP [c00000000002569c] .__delay+0x10/0x38
> LR [c0000000002963ac] ._raw_spin_lock+0x120/0x1a0
> Call Trace:
> [c000000285d9b5b0] [c000000285d9b690] 0xc000000285d9b690 (unreliable)
> [c000000285d9b5f0] [c000000000296378] ._raw_spin_lock+0xec/0x1a0
> [c000000285d9b690] [c000000000510c60] ._spin_lock+0x5c/0x7c
> [c000000285d9b720] [c0000000000d809c] .handle_mm_fault+0x2f0/0x9ac
> [c000000285d9b810] [c000000000513688] .do_page_fault+0x444/0x62c
> [c000000285d9b950] [c000000000005230] handle_page_fault+0x20/0x5c
> --- Exception: 301 at .__clear_user+0x38/0x7c
>     LR = .read_zero+0xb0/0x1a8
> [c000000285d9bc40] [c0000000002e19e0] .read_zero+0x80/0x1a8 (unreliable)
> [c000000285d9bcf0] [c000000000102c00] .vfs_read+0xe0/0x1c8
> [c000000285d9bd90] [c00000000010332c] .sys_read+0x54/0x98
> [c000000285d9be30] [c0000000000086d4] syscall_exit+0x0/0x40
> Instruction dump:
> eb41ffd0 eb61ffd8 eb81ffe0 7c0803a6 eba1ffe8 ebc1fff0 ebe1fff8 4e800020 
> fbe1fff8 f821ffc1 7c3f0b78 7d2c42e6 <48000008> 7c210b78 7c0c42e6 7c090050 
> 
> 
> [root]# addr2line c0000000000d809c -e /boot/vmlinux.rc4-pseries 
> /root/src/linux-2.6-rc4/mm/memory.c:2381
> [root]# addr2line c000000000513688 -e /boot/vmlinux.rc4-pseries 
> /root/src/linux-2.6-rc4/arch/powerpc/mm/fault.c:313
> [root]# addr2line c00000000010332c -e /boot/vmlinux.rc4-pseries 
> /root/src/linux-2.6-rc4/fs/read_write.c:334
> [root]# addr2line c000000000102c00 -e /boot/vmlinux.rc4-pseries 
> /root/src/linux-2.6-rc4/fs/read_write.c:257
> 
> 
> I have sometimes inserted an strace64 at the point where the test cases are started and then will see
> output like the following when it hits the above point.
> 
> ...
> open("/dev/full", O_RDONLY)             = 3
> shmget(0x2, 34359738368, IPC_CREAT|SHM_HUGETLB|0600) = 294912
> shmat(294912, 0, SHM_RND)               = 0x3f800000000
> read(3, "\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 17179869184) = 2147479552
> 
> ---
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Libhugetlbfs-devel] Buglet in 16G page handling
  2008-09-02 22:53       ` Benjamin Herrenschmidt
@ 2008-09-03 14:11         ` Jon Tollefson
  0 siblings, 0 replies; 13+ messages in thread
From: Jon Tollefson @ 2008-09-03 14:11 UTC (permalink / raw)
  To: benh; +Cc: Mel Gorman, linuxppc-dev, libhugetlbfs-devel

Benjamin Herrenschmidt wrote:
> On Tue, 2008-09-02 at 17:16 -0500, Jon Tollefson wrote:
>   
>> Benjamin Herrenschmidt wrote:
>>     
>>>> Actually, Jon has been hitting an occasional pagetable lock related
>>>> problem. The last theory was that it might be some sort of race but it's
>>>> vaguely possible that this is the issue. Jon?
>>>>     
>>>>         
>>> All hugetlbfs ops should be covered by the big PTL except walking... Can
>>> we have more info about the problem ?
>>>
>>> Cheers,
>>> Ben.
>>>   
>>>       
>> I hit this when running the complete libhugetlbfs test suite (make
>> check) with base page at 4K and default huge page size at 16G.  It is on
>> the last test (shm-getraw) when it hits it.  Just running that test
>> alone has not caused it for me - only when I have run all the tests and
>> it gets to this one.  Also it doesn't happen every time.  I have tried
>> to reproduce as well with a 64K base page but haven't seen it happen there.
>>     
>
> I don't see anything huge pages related in the backtraces which is
> interesting ...
>
> Can you get us access to a machine with enough RAM to test the 16G
> pages ?
>
> Ben.
>
>   
You can use the machine I have been using.  I'll send you a note with
the details on it after I test David's patch today.

Jon

<snip>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: Buglet in 16G page handling
  2008-09-02  5:05 Buglet in 16G page handling David Gibson
  2008-09-02 12:44 ` [Libhugetlbfs-devel] " Mel Gorman
@ 2008-09-02 17:12 ` Jon Tollefson
  2008-09-03  0:20   ` [Libhugetlbfs-devel] " David Gibson
  1 sibling, 1 reply; 13+ messages in thread
From: Jon Tollefson @ 2008-09-02 17:12 UTC (permalink / raw)
  To: libhugetlbfs-devel, linuxppc-dev, Benjamin Herrenschmidt

David Gibson wrote:
> When BenH and I were looking at the new code for handling 16G pages,
> we noticed a small bug.  It doesn't actually break anything user
> visible, but it's certainly not the way things are supposed to be.
> The 16G patches didn't update the huge_pte_offset() and
> huge_pte_alloc() functions, which means that the hugepte tables for
> 16G pages will be allocated much further down the page table tree than
> they should be - allocating several levels of page table with a single
> entry in them along the way.
>
> The patch below is supposed to fix this, cleaning up the existing
> handling of 64k vs 16M pages while its at it.  However, it needs some
> testing.
>
> I've checked that it doesn't break existing 16M support, either with
> 4k or 64k base pages.  I haven't figured out how to test with 64k
> pages yet, at least until the multisize support goes into
> libhugetlbfs.  For 16G pages, I just don't have access to a machine
> with enough memory to test.  Jon, presumably you must have found such
> a machine when you did the 16G page support in the first place.  Do
> you still have access, and can you test this patch?
>   
I do have access to a machine to test it.  I applied the patch to -rc4
and used a pseries_defconfig.  I boot with
default_hugepagesz=16G... in order to test huge page sizes other then
16M at this point.

Running the libhugetlbfs test suite it gets as far as   Readback (64):  
PASS
before it hits the following program check.

kernel BUG at arch/powerpc/mm/hugetlbpage.c:98!
cpu 0x0: Vector: 700 (Program Check) at [c0000002843db580]
    pc: c000000000035ff4: .free_hugepte_range+0x2c/0x7c
    lr: c000000000036af0: .hugetlb_free_pgd_range+0x2c0/0x398
    sp: c0000002843db800
   msr: 8000000000029032
  current = 0xc00000028417a2a0
  paca    = 0xc0000000008d4300
    pid   = 3334, comm = readback
kernel BUG at arch/powerpc/mm/hugetlbpage.c:98!
enter ? for help
[c0000002843db880] c000000000036af0 .hugetlb_free_pgd_range+0x2c0/0x398
[c0000002843db980] c0000000000da224 .free_pgtables+0x98/0x140
[c0000002843dba40] c0000000000dc4d8 .exit_mmap+0x13c/0x22c
[c0000002843dbb00] c00000000005b218 .mmput+0x78/0x148
[c0000002843dbba0] c000000000060528 .exit_mm+0x164/0x18c
[c0000002843dbc50] c000000000062718 .do_exit+0x2e8/0x858
[c0000002843dbd10] c000000000062d24 .do_group_exit+0x9c/0xd0
[c0000002843dbdb0] c000000000062d74 .sys_exit_group+0x1c/0x30
[c0000002843dbe30] c0000000000086d4 syscall_exit+0x0/0x40
--- Exception: c00 (System Call) at 000000802db7a530
SP (fffffa6e290) is in userspace


Line 98 appears to be this BUG_ON

static inline pte_t *hugepd_page(hugepd_t hpd)
{
        BUG_ON(!(hpd.pd & HUGEPD_OK));


Jon

> Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
> ===================================================================
> --- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2008-09-02 13:39:52.000000000 +1000
> +++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2008-09-02 14:08:56.000000000 +1000
> @@ -128,29 +128,37 @@ static int __hugepte_alloc(struct mm_str
>  	return 0;
>  }
>
> -/* Base page size affects how we walk hugetlb page tables */
> -#ifdef CONFIG_PPC_64K_PAGES
> -#define hpmd_offset(pud, addr, h)	pmd_offset(pud, addr)
> -#define hpmd_alloc(mm, pud, addr, h)	pmd_alloc(mm, pud, addr)
> -#else
> -static inline
> -pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
> +
> +static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate *hstate)
> +{
> +	if (huge_page_shift(hstate) < PUD_SHIFT)
> +		return pud_offset(pgd, addr);
> +	else
> +		return (pud_t *) pgd;
> +}
> +static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long addr,
> +			 struct hstate *hstate)
>  {
> -	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
> +	if (huge_page_shift(hstate) < PUD_SHIFT)
> +		return pud_alloc(mm, pgd, addr);
> +	else
> +		return (pud_t *) pgd;
> +}
> +static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
> +{
> +	if (huge_page_shift(hstate) < PMD_SHIFT)
>  		return pmd_offset(pud, addr);
>  	else
>  		return (pmd_t *) pud;
>  }
> -static inline
> -pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
> -		  struct hstate *hstate)
> +static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
> +			 struct hstate *hstate)
>  {
> -	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
> +	if (huge_page_shift(hstate) < PMD_SHIFT)
>  		return pmd_alloc(mm, pud, addr);
>  	else
>  		return (pmd_t *) pud;
>  }
> -#endif
>
>  /* Build list of addresses of gigantic pages.  This function is used in early
>   * boot before the buddy or bootmem allocator is setup.
> @@ -204,7 +212,7 @@ pte_t *huge_pte_offset(struct mm_struct 
>
>  	pg = pgd_offset(mm, addr);
>  	if (!pgd_none(*pg)) {
> -		pu = pud_offset(pg, addr);
> +		pu = hpud_offset(pg, addr, hstate);
>  		if (!pud_none(*pu)) {
>  			pm = hpmd_offset(pu, addr, hstate);
>  			if (!pmd_none(*pm))
> @@ -233,7 +241,7 @@ pte_t *huge_pte_alloc(struct mm_struct *
>  	addr &= hstate->mask;
>
>  	pg = pgd_offset(mm, addr);
> -	pu = pud_alloc(mm, pg, addr);
> +	pu = hpud_alloc(mm, pg, addr, hstate);
>
>  	if (pu) {
>  		pm = hpmd_alloc(mm, pu, addr, hstate);
>
>
>   

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Libhugetlbfs-devel] Buglet in 16G page handling
  2008-09-02 17:12 ` Jon Tollefson
@ 2008-09-03  0:20   ` David Gibson
  2008-09-03 22:19     ` Jon Tollefson
  0 siblings, 1 reply; 13+ messages in thread
From: David Gibson @ 2008-09-03  0:20 UTC (permalink / raw)
  To: Jon Tollefson; +Cc: linuxppc-dev, libhugetlbfs-devel

On Tue, Sep 02, 2008 at 12:12:27PM -0500, Jon Tollefson wrote:
> David Gibson wrote:
> > When BenH and I were looking at the new code for handling 16G pages,
> > we noticed a small bug.  It doesn't actually break anything user
> > visible, but it's certainly not the way things are supposed to be.
> > The 16G patches didn't update the huge_pte_offset() and
> > huge_pte_alloc() functions, which means that the hugepte tables for
> > 16G pages will be allocated much further down the page table tree than
> > they should be - allocating several levels of page table with a single
> > entry in them along the way.
> >
> > The patch below is supposed to fix this, cleaning up the existing
> > handling of 64k vs 16M pages while its at it.  However, it needs some
> > testing.
> >
> > I've checked that it doesn't break existing 16M support, either with
> > 4k or 64k base pages.  I haven't figured out how to test with 64k
> > pages yet, at least until the multisize support goes into
> > libhugetlbfs.  For 16G pages, I just don't have access to a machine
> > with enough memory to test.  Jon, presumably you must have found such
> > a machine when you did the 16G page support in the first place.  Do
> > you still have access, and can you test this patch?
> >   
> I do have access to a machine to test it.  I applied the patch to -rc4
> and used a pseries_defconfig.  I boot with
> default_hugepagesz=16G... in order to test huge page sizes other then
> 16M at this point.
> 
> Running the libhugetlbfs test suite it gets as far as   Readback (64):  
> PASS
> before it hits the following program check.

Ah, yes, oops, forgot to fix up the pagetable freeing path in line
with the other changes.  Try the revised version below.

Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2008-09-02 11:50:12.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2008-09-03 10:10:54.000000000 +1000
@@ -128,29 +128,37 @@ static int __hugepte_alloc(struct mm_str
 	return 0;
 }
 
-/* Base page size affects how we walk hugetlb page tables */
-#ifdef CONFIG_PPC_64K_PAGES
-#define hpmd_offset(pud, addr, h)	pmd_offset(pud, addr)
-#define hpmd_alloc(mm, pud, addr, h)	pmd_alloc(mm, pud, addr)
-#else
-static inline
-pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
+
+static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate *hstate)
+{
+	if (huge_page_shift(hstate) < PUD_SHIFT)
+		return pud_offset(pgd, addr);
+	else
+		return (pud_t *) pgd;
+}
+static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long addr,
+			 struct hstate *hstate)
 {
-	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
+	if (huge_page_shift(hstate) < PUD_SHIFT)
+		return pud_alloc(mm, pgd, addr);
+	else
+		return (pud_t *) pgd;
+}
+static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
+{
+	if (huge_page_shift(hstate) < PMD_SHIFT)
 		return pmd_offset(pud, addr);
 	else
 		return (pmd_t *) pud;
 }
-static inline
-pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
-		  struct hstate *hstate)
+static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
+			 struct hstate *hstate)
 {
-	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
+	if (huge_page_shift(hstate) < PMD_SHIFT)
 		return pmd_alloc(mm, pud, addr);
 	else
 		return (pmd_t *) pud;
 }
-#endif
 
 /* Build list of addresses of gigantic pages.  This function is used in early
  * boot before the buddy or bootmem allocator is setup.
@@ -204,7 +212,7 @@ pte_t *huge_pte_offset(struct mm_struct 
 
 	pg = pgd_offset(mm, addr);
 	if (!pgd_none(*pg)) {
-		pu = pud_offset(pg, addr);
+		pu = hpud_offset(pg, addr, hstate);
 		if (!pud_none(*pu)) {
 			pm = hpmd_offset(pu, addr, hstate);
 			if (!pmd_none(*pm))
@@ -233,7 +241,7 @@ pte_t *huge_pte_alloc(struct mm_struct *
 	addr &= hstate->mask;
 
 	pg = pgd_offset(mm, addr);
-	pu = pud_alloc(mm, pg, addr);
+	pu = hpud_alloc(mm, pg, addr, hstate);
 
 	if (pu) {
 		pm = hpmd_alloc(mm, pu, addr, hstate);
@@ -316,13 +324,7 @@ static void hugetlb_free_pud_range(struc
 	pud = pud_offset(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
-#ifdef CONFIG_PPC_64K_PAGES
-		if (pud_none_or_clear_bad(pud))
-			continue;
-		hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling,
-				       psize);
-#else
-		if (shift == PAGE_SHIFT_64K) {
+		if (shift < PMD_SHIFT) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
 			hugetlb_free_pmd_range(tlb, pud, addr, next, floor,
@@ -332,7 +334,6 @@ static void hugetlb_free_pud_range(struc
 				continue;
 			free_hugepte_range(tlb, (hugepd_t *)pud, psize);
 		}
-#endif
 	} while (pud++, addr = next, addr != end);
 
 	start &= PGDIR_MASK;
@@ -422,9 +423,15 @@ void hugetlb_free_pgd_range(struct mmu_g
 		psize = get_slice_psize(tlb->mm, addr);
 		BUG_ON(!mmu_huge_psizes[psize]);
 		next = pgd_addr_end(addr, end);
-		if (pgd_none_or_clear_bad(pgd))
-			continue;
-		hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling);
+		if (mmu_psize_to_shift(psize) < PUD_SHIFT) {
+			if (pgd_none_or_clear_bad(pgd))
+				continue;
+			hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling);
+		} else {
+			if (pgd_none(*pgd))
+				continue;
+			free_hugepte_range(tlb, (hugepd_t *)pgd, psize);
+		}
 	} while (pgd++, addr = next, addr != end);
 }
 


-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Libhugetlbfs-devel] Buglet in 16G page handling
  2008-09-03  0:20   ` [Libhugetlbfs-devel] " David Gibson
@ 2008-09-03 22:19     ` Jon Tollefson
  2008-09-04  6:22       ` David Gibson
  2008-09-04 21:08       ` Jon Tollefson
  0 siblings, 2 replies; 13+ messages in thread
From: Jon Tollefson @ 2008-09-03 22:19 UTC (permalink / raw)
  To: libhugetlbfs-devel, linuxppc-dev, Benjamin Herrenschmidt

David Gibson wrote:
> On Tue, Sep 02, 2008 at 12:12:27PM -0500, Jon Tollefson wrote:
>   
>> David Gibson wrote:
>>     
>>> When BenH and I were looking at the new code for handling 16G pages,
>>> we noticed a small bug.  It doesn't actually break anything user
>>> visible, but it's certainly not the way things are supposed to be.
>>> The 16G patches didn't update the huge_pte_offset() and
>>> huge_pte_alloc() functions, which means that the hugepte tables for
>>> 16G pages will be allocated much further down the page table tree than
>>> they should be - allocating several levels of page table with a single
>>> entry in them along the way.
>>>
>>> The patch below is supposed to fix this, cleaning up the existing
>>> handling of 64k vs 16M pages while its at it.  However, it needs some
>>> testing.
>>>
>>> I've checked that it doesn't break existing 16M support, either with
>>> 4k or 64k base pages.  I haven't figured out how to test with 64k
>>> pages yet, at least until the multisize support goes into
>>> libhugetlbfs.  For 16G pages, I just don't have access to a machine
>>> with enough memory to test.  Jon, presumably you must have found such
>>> a machine when you did the 16G page support in the first place.  Do
>>> you still have access, and can you test this patch?
>>>   
>>>       
>> I do have access to a machine to test it.  I applied the patch to -rc4
>> and used a pseries_defconfig.  I boot with
>> default_hugepagesz=16G... in order to test huge page sizes other then
>> 16M at this point.
>>
>> Running the libhugetlbfs test suite it gets as far as   Readback (64):  
>> PASS
>> before it hits the following program check.
>>     
>
> Ah, yes, oops, forgot to fix up the pagetable freeing path in line
> with the other changes.  Try the revised version below.
>   
I have run through the tests twice now with this new patch using a 4k
base page size(and 16G huge page size) and there are no program checks
or spin lock issues.  So looking good.

I will run it next a couple of times with 64K base pages.

Jon




> Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
> ===================================================================
> --- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2008-09-02 11:50:12.000000000 +1000
> +++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2008-09-03 10:10:54.000000000 +1000
> @@ -128,29 +128,37 @@ static int __hugepte_alloc(struct mm_str
>  	return 0;
>  }
>
> -/* Base page size affects how we walk hugetlb page tables */
> -#ifdef CONFIG_PPC_64K_PAGES
> -#define hpmd_offset(pud, addr, h)	pmd_offset(pud, addr)
> -#define hpmd_alloc(mm, pud, addr, h)	pmd_alloc(mm, pud, addr)
> -#else
> -static inline
> -pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
> +
> +static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate *hstate)
> +{
> +	if (huge_page_shift(hstate) < PUD_SHIFT)
> +		return pud_offset(pgd, addr);
> +	else
> +		return (pud_t *) pgd;
> +}
> +static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long addr,
> +			 struct hstate *hstate)
>  {
> -	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
> +	if (huge_page_shift(hstate) < PUD_SHIFT)
> +		return pud_alloc(mm, pgd, addr);
> +	else
> +		return (pud_t *) pgd;
> +}
> +static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
> +{
> +	if (huge_page_shift(hstate) < PMD_SHIFT)
>  		return pmd_offset(pud, addr);
>  	else
>  		return (pmd_t *) pud;
>  }
> -static inline
> -pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
> -		  struct hstate *hstate)
> +static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
> +			 struct hstate *hstate)
>  {
> -	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
> +	if (huge_page_shift(hstate) < PMD_SHIFT)
>  		return pmd_alloc(mm, pud, addr);
>  	else
>  		return (pmd_t *) pud;
>  }
> -#endif
>
>  /* Build list of addresses of gigantic pages.  This function is used in early
>   * boot before the buddy or bootmem allocator is setup.
> @@ -204,7 +212,7 @@ pte_t *huge_pte_offset(struct mm_struct 
>
>  	pg = pgd_offset(mm, addr);
>  	if (!pgd_none(*pg)) {
> -		pu = pud_offset(pg, addr);
> +		pu = hpud_offset(pg, addr, hstate);
>  		if (!pud_none(*pu)) {
>  			pm = hpmd_offset(pu, addr, hstate);
>  			if (!pmd_none(*pm))
> @@ -233,7 +241,7 @@ pte_t *huge_pte_alloc(struct mm_struct *
>  	addr &= hstate->mask;
>
>  	pg = pgd_offset(mm, addr);
> -	pu = pud_alloc(mm, pg, addr);
> +	pu = hpud_alloc(mm, pg, addr, hstate);
>
>  	if (pu) {
>  		pm = hpmd_alloc(mm, pu, addr, hstate);
> @@ -316,13 +324,7 @@ static void hugetlb_free_pud_range(struc
>  	pud = pud_offset(pgd, addr);
>  	do {
>  		next = pud_addr_end(addr, end);
> -#ifdef CONFIG_PPC_64K_PAGES
> -		if (pud_none_or_clear_bad(pud))
> -			continue;
> -		hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling,
> -				       psize);
> -#else
> -		if (shift == PAGE_SHIFT_64K) {
> +		if (shift < PMD_SHIFT) {
>  			if (pud_none_or_clear_bad(pud))
>  				continue;
>  			hugetlb_free_pmd_range(tlb, pud, addr, next, floor,
> @@ -332,7 +334,6 @@ static void hugetlb_free_pud_range(struc
>  				continue;
>  			free_hugepte_range(tlb, (hugepd_t *)pud, psize);
>  		}
> -#endif
>  	} while (pud++, addr = next, addr != end);
>
>  	start &= PGDIR_MASK;
> @@ -422,9 +423,15 @@ void hugetlb_free_pgd_range(struct mmu_g
>  		psize = get_slice_psize(tlb->mm, addr);
>  		BUG_ON(!mmu_huge_psizes[psize]);
>  		next = pgd_addr_end(addr, end);
> -		if (pgd_none_or_clear_bad(pgd))
> -			continue;
> -		hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling);
> +		if (mmu_psize_to_shift(psize) < PUD_SHIFT) {
> +			if (pgd_none_or_clear_bad(pgd))
> +				continue;
> +			hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling);
> +		} else {
> +			if (pgd_none(*pgd))
> +				continue;
> +			free_hugepte_range(tlb, (hugepd_t *)pgd, psize);
> +		}
>  	} while (pgd++, addr = next, addr != end);
>  }
>
>
>
>   

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Libhugetlbfs-devel] Buglet in 16G page handling
  2008-09-03 22:19     ` Jon Tollefson
@ 2008-09-04  6:22       ` David Gibson
  2008-09-04 21:08       ` Jon Tollefson
  1 sibling, 0 replies; 13+ messages in thread
From: David Gibson @ 2008-09-04  6:22 UTC (permalink / raw)
  To: Jon Tollefson; +Cc: linuxppc-dev, libhugetlbfs-devel

On Wed, Sep 03, 2008 at 05:19:27PM -0500, Jon Tollefson wrote:
> David Gibson wrote:
> > On Tue, Sep 02, 2008 at 12:12:27PM -0500, Jon Tollefson wrote:
> >   
> >> David Gibson wrote:
> >>     
> >>> When BenH and I were looking at the new code for handling 16G pages,
> >>> we noticed a small bug.  It doesn't actually break anything user
> >>> visible, but it's certainly not the way things are supposed to be.
> >>> The 16G patches didn't update the huge_pte_offset() and
> >>> huge_pte_alloc() functions, which means that the hugepte tables for
> >>> 16G pages will be allocated much further down the page table tree than
> >>> they should be - allocating several levels of page table with a single
> >>> entry in them along the way.
> >>>
> >>> The patch below is supposed to fix this, cleaning up the existing
> >>> handling of 64k vs 16M pages while its at it.  However, it needs some
> >>> testing.
> >>>
> >>> I've checked that it doesn't break existing 16M support, either with
> >>> 4k or 64k base pages.  I haven't figured out how to test with 64k
> >>> pages yet, at least until the multisize support goes into
> >>> libhugetlbfs.  For 16G pages, I just don't have access to a machine
> >>> with enough memory to test.  Jon, presumably you must have found such
> >>> a machine when you did the 16G page support in the first place.  Do
> >>> you still have access, and can you test this patch?
> >>>   
> >>>       
> >> I do have access to a machine to test it.  I applied the patch to -rc4
> >> and used a pseries_defconfig.  I boot with
> >> default_hugepagesz=16G... in order to test huge page sizes other then
> >> 16M at this point.
> >>
> >> Running the libhugetlbfs test suite it gets as far as   Readback (64):  
> >> PASS
> >> before it hits the following program check.
> >>     
> >
> > Ah, yes, oops, forgot to fix up the pagetable freeing path in line
> > with the other changes.  Try the revised version below.
> >   
> I have run through the tests twice now with this new patch using a 4k
> base page size(and 16G huge page size) and there are no program checks
> or spin lock issues.  So looking good.
> 
> I will run it next a couple of times with 64K base pages.

Ok, and I've now run it with 64k hugepage size, so assuming this last
test of yours goes ok, I'll push the patch out.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Libhugetlbfs-devel] Buglet in 16G page handling
  2008-09-03 22:19     ` Jon Tollefson
  2008-09-04  6:22       ` David Gibson
@ 2008-09-04 21:08       ` Jon Tollefson
  2008-09-05  1:36         ` David Gibson
  1 sibling, 1 reply; 13+ messages in thread
From: Jon Tollefson @ 2008-09-04 21:08 UTC (permalink / raw)
  To: libhugetlbfs-devel, linuxppc-dev, Benjamin Herrenschmidt

Jon Tollefson wrote:
> David Gibson wrote:
>   
>> On Tue, Sep 02, 2008 at 12:12:27PM -0500, Jon Tollefson wrote:
>>   
>>     
>>> David Gibson wrote:
>>>     
>>>       
>>>> When BenH and I were looking at the new code for handling 16G pages,
>>>> we noticed a small bug.  It doesn't actually break anything user
>>>> visible, but it's certainly not the way things are supposed to be.
>>>> The 16G patches didn't update the huge_pte_offset() and
>>>> huge_pte_alloc() functions, which means that the hugepte tables for
>>>> 16G pages will be allocated much further down the page table tree than
>>>> they should be - allocating several levels of page table with a single
>>>> entry in them along the way.
>>>>
>>>> The patch below is supposed to fix this, cleaning up the existing
>>>> handling of 64k vs 16M pages while its at it.  However, it needs some
>>>> testing.
>>>>
>>>> I've checked that it doesn't break existing 16M support, either with
>>>> 4k or 64k base pages.  I haven't figured out how to test with 64k
>>>> pages yet, at least until the multisize support goes into
>>>> libhugetlbfs.  For 16G pages, I just don't have access to a machine
>>>> with enough memory to test.  Jon, presumably you must have found such
>>>> a machine when you did the 16G page support in the first place.  Do
>>>> you still have access, and can you test this patch?
>>>>   
>>>>       
>>>>         
>>> I do have access to a machine to test it.  I applied the patch to -rc4
>>> and used a pseries_defconfig.  I boot with
>>> default_hugepagesz=16G... in order to test huge page sizes other then
>>> 16M at this point.
>>>
>>> Running the libhugetlbfs test suite it gets as far as   Readback (64):  
>>> PASS
>>> before it hits the following program check.
>>>     
>>>       
>> Ah, yes, oops, forgot to fix up the pagetable freeing path in line
>> with the other changes.  Try the revised version below.
>>   
>>     
> I have run through the tests twice now with this new patch using a 4k
> base page size(and 16G huge page size) and there are no program checks
> or spin lock issues.  So looking good.
>
> I will run it next a couple of times with 64K base pages.
>   
I have run through the libhugetest suite 3 times each now with both
combinations(4k and 64K base page) and have not seen the spin lock
problem or any other problems.

Acked-by: Jon Tollefson <kniht@linux.vnet.ibm.com>


> Jon
>
>
>
>
>   
>> Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
>> ===================================================================
>> --- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2008-09-02 11:50:12.000000000 +1000
>> +++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2008-09-03 10:10:54.000000000 +1000
>> @@ -128,29 +128,37 @@ static int __hugepte_alloc(struct mm_str
>>  	return 0;
>>  }
>>
>> -/* Base page size affects how we walk hugetlb page tables */
>> -#ifdef CONFIG_PPC_64K_PAGES
>> -#define hpmd_offset(pud, addr, h)	pmd_offset(pud, addr)
>> -#define hpmd_alloc(mm, pud, addr, h)	pmd_alloc(mm, pud, addr)
>> -#else
>> -static inline
>> -pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
>> +
>> +static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate *hstate)
>> +{
>> +	if (huge_page_shift(hstate) < PUD_SHIFT)
>> +		return pud_offset(pgd, addr);
>> +	else
>> +		return (pud_t *) pgd;
>> +}
>> +static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long addr,
>> +			 struct hstate *hstate)
>>  {
>> -	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
>> +	if (huge_page_shift(hstate) < PUD_SHIFT)
>> +		return pud_alloc(mm, pgd, addr);
>> +	else
>> +		return (pud_t *) pgd;
>> +}
>> +static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
>> +{
>> +	if (huge_page_shift(hstate) < PMD_SHIFT)
>>  		return pmd_offset(pud, addr);
>>  	else
>>  		return (pmd_t *) pud;
>>  }
>> -static inline
>> -pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
>> -		  struct hstate *hstate)
>> +static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
>> +			 struct hstate *hstate)
>>  {
>> -	if (huge_page_shift(hstate) == PAGE_SHIFT_64K)
>> +	if (huge_page_shift(hstate) < PMD_SHIFT)
>>  		return pmd_alloc(mm, pud, addr);
>>  	else
>>  		return (pmd_t *) pud;
>>  }
>> -#endif
>>
>>  /* Build list of addresses of gigantic pages.  This function is used in early
>>   * boot before the buddy or bootmem allocator is setup.
>> @@ -204,7 +212,7 @@ pte_t *huge_pte_offset(struct mm_struct 
>>
>>  	pg = pgd_offset(mm, addr);
>>  	if (!pgd_none(*pg)) {
>> -		pu = pud_offset(pg, addr);
>> +		pu = hpud_offset(pg, addr, hstate);
>>  		if (!pud_none(*pu)) {
>>  			pm = hpmd_offset(pu, addr, hstate);
>>  			if (!pmd_none(*pm))
>> @@ -233,7 +241,7 @@ pte_t *huge_pte_alloc(struct mm_struct *
>>  	addr &= hstate->mask;
>>
>>  	pg = pgd_offset(mm, addr);
>> -	pu = pud_alloc(mm, pg, addr);
>> +	pu = hpud_alloc(mm, pg, addr, hstate);
>>
>>  	if (pu) {
>>  		pm = hpmd_alloc(mm, pu, addr, hstate);
>> @@ -316,13 +324,7 @@ static void hugetlb_free_pud_range(struc
>>  	pud = pud_offset(pgd, addr);
>>  	do {
>>  		next = pud_addr_end(addr, end);
>> -#ifdef CONFIG_PPC_64K_PAGES
>> -		if (pud_none_or_clear_bad(pud))
>> -			continue;
>> -		hugetlb_free_pmd_range(tlb, pud, addr, next, floor, ceiling,
>> -				       psize);
>> -#else
>> -		if (shift == PAGE_SHIFT_64K) {
>> +		if (shift < PMD_SHIFT) {
>>  			if (pud_none_or_clear_bad(pud))
>>  				continue;
>>  			hugetlb_free_pmd_range(tlb, pud, addr, next, floor,
>> @@ -332,7 +334,6 @@ static void hugetlb_free_pud_range(struc
>>  				continue;
>>  			free_hugepte_range(tlb, (hugepd_t *)pud, psize);
>>  		}
>> -#endif
>>  	} while (pud++, addr = next, addr != end);
>>
>>  	start &= PGDIR_MASK;
>> @@ -422,9 +423,15 @@ void hugetlb_free_pgd_range(struct mmu_g
>>  		psize = get_slice_psize(tlb->mm, addr);
>>  		BUG_ON(!mmu_huge_psizes[psize]);
>>  		next = pgd_addr_end(addr, end);
>> -		if (pgd_none_or_clear_bad(pgd))
>> -			continue;
>> -		hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling);
>> +		if (mmu_psize_to_shift(psize) < PUD_SHIFT) {
>> +			if (pgd_none_or_clear_bad(pgd))
>> +				continue;
>> +			hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling);
>> +		} else {
>> +			if (pgd_none(*pgd))
>> +				continue;
>> +			free_hugepte_range(tlb, (hugepd_t *)pgd, psize);
>> +		}
>>  	} while (pgd++, addr = next, addr != end);
>>  }
>>     

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [Libhugetlbfs-devel] Buglet in 16G page handling
  2008-09-04 21:08       ` Jon Tollefson
@ 2008-09-05  1:36         ` David Gibson
  0 siblings, 0 replies; 13+ messages in thread
From: David Gibson @ 2008-09-05  1:36 UTC (permalink / raw)
  To: Jon Tollefson; +Cc: linuxppc-dev, libhugetlbfs-devel

On Thu, Sep 04, 2008 at 04:08:30PM -0500, Jon Tollefson wrote:
> Jon Tollefson wrote:
> > David Gibson wrote:
> >   
> >> On Tue, Sep 02, 2008 at 12:12:27PM -0500, Jon Tollefson wrote:
[snip]
> > I have run through the tests twice now with this new patch using a 4k
> > base page size(and 16G huge page size) and there are no program checks
> > or spin lock issues.  So looking good.
> >
> > I will run it next a couple of times with 64K base pages.
> >   
> I have run through the libhugetest suite 3 times each now with both
> combinations(4k and 64K base page) and have not seen the spin lock
> problem or any other problems.

Excellent.  I'll push the patch.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2008-09-05  1:36 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-09-02  5:05 Buglet in 16G page handling David Gibson
2008-09-02 12:44 ` [Libhugetlbfs-devel] " Mel Gorman
2008-09-02 16:25   ` Nishanth Aravamudan
2008-09-02 21:05   ` Benjamin Herrenschmidt
2008-09-02 22:16     ` Jon Tollefson
2008-09-02 22:53       ` Benjamin Herrenschmidt
2008-09-03 14:11         ` Jon Tollefson
2008-09-02 17:12 ` Jon Tollefson
2008-09-03  0:20   ` [Libhugetlbfs-devel] " David Gibson
2008-09-03 22:19     ` Jon Tollefson
2008-09-04  6:22       ` David Gibson
2008-09-04 21:08       ` Jon Tollefson
2008-09-05  1:36         ` David Gibson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).