Re: hugetlbfs for ppc440 - kernel BUG -- follow up

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* Re: hugetlbfs for ppc440 - kernel BUG -- follow up
@ 2007-07-17 21:07 Satya
  2007-07-18  1:53 ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 4+ messages in thread
From: Satya @ 2007-07-17 21:07 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: kazutomo, edi, david

hello,

Upon investigating the below issue further, I found that
pte_alloc_map() calls kmap_atomic. The allocated pte page must be
unmapped before invoking any function that might_sleep.

In this case clear_huge_page() is being called without invoking
pte_unmap(). The 'normal' counterpart of hugetlb_no_page (which is
do_no_page() in mm/memory.c) does call pte_unmap() before calling
alloc_page() (which might sleep).

So, I believe pte_unmap() must be invoked first in hugetlb_no_page().
But the problem here is, we do not have a reference to the pmd to map
the pte again (using pte_offset_map()). The do_no_page() function does
have a pmd_t* parameter, so it can remap the pte when required.

For now, I resolved the problem by expanding the pte_alloc_map() macro
by hand and replacing kmap_atomic with kmap(), although I think it is
not the right thing to do.

Let me know if my analysis is helping you figure out the problem here. Thanks!

--satya.

On 7/10/07, Satya <satyakiran@gmail.com> wrote:
> hello,
> I am trying to implement hugetlbfs on the IBM Bluegene/L IO node
> (ppc440) and I have a big problem as well as a few questions to ask
> the group. I patched a 2.6.21.6 linux kernel (manually) with Edi
> Shmueli's hugetlbfs implementation (found here:
> http://patchwork.ozlabs.org/linuxppc/patch?id=8427) for this. I did
> have to make slight changes (described at the end) to make it work.
> My test program is a shortened version of a sys v shared memory
> example described in Documentation/vm/hugetlbpage.txt
>
> I get the following kernel BUG when a page fault occurs on a huge page address:
> BUG: scheduling while atomic: shmtest2/0x10000001/1291
> Call Trace:
> [CFF0BCE0] [C00084F4] show_stack+0x4c/0x194 (unreliable)
>  [CFF0BD20] [C01A53C4] schedule+0x664/0x668
> [CFF0BD60] [C00175F8] __cond_resched+0x24/0x50
> [CFF0BD80] [C01A5A6C] cond_resched+0x50/0x58
> [CFF0BD90] [C005A31C] clear_huge_page+0x28/0x174
> [CFF0BDC0] [C005B360] hugetlb_no_page+0xb4/0x220
> [CFF0BE00] [C005B5BC] hugetlb_fault+0xf0/0xf4
> [CFF0BE30] [C0052AC0] __handle_mm_fault+0x3a8/0x3ac
> [CFF0BE70] [C00094A0] do_page_fault+0x118/0x428
> [CFF0BF40] [C0002360] handle_page_fault+0xc/0x80
> BUG: scheduling while atomic: shmtest2/0x10000001/1291
>
> Now for my questions:
>
> 1. Can the kernel really reschedule in a page fault handler context ?
>
> 2. Just to test where this "scheduling while atomic" bug is arising, i
> put schedule() calls at various places in the path of the stack trace
> shown above.
> I found that a call to pte_alloc_map() puts the kernel in a context
> where it cannot reschedule without throwing up. Here is a trace of
> what's going on:
>
> __handle_mm_fault -> hugetlb_fault -> huge_pte_alloc() -> pte_alloc_map()
>
> Any call to schedule() before pte_alloc_map() does not throw this
> error. Well, this might be a flawed experiment, I am no expert kernel
> hacker. Does this throw any light on the problem?
>
> Here are the modifications I made to Edi's patch:
>
> arch/ppc/mm/hugetlbpage.c
> struct page *
> follow_huge_addr(struct mm_struct *mm, unsigned long address, int write)
> {
>   pte_t *pte;
>   struct page *page;
> +  struct vm_area_struct *vma;
> +
> +  vma = find_vma(mm, address);
> + if (!vma || !is_vm_hugetlb_page(vma))
> +    return ERR_PTR(-EINVAL);
>
>   pte = huge_pte_offset(mm, address);
>   page = pte_page(*pte);
>   return page;
> }
>
> +int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
> +{
> +        return 0;
> +}
>
> Here is my test program:
>
> #include <stdlib.h>
> #include <stdio.h>
> #include <sys/types.h>
> #include <sys/ipc.h>
> #include <sys/shm.h>
> #include <sys/mman.h>
>
> #ifndef SHM_HUGETLB
> #define SHM_HUGETLB 04000
> #endif
>
> #define LENGTH (16UL*1024*1024)
>
> #define dprintf(x)  printf(x)
>
> #define ADDR (void *)(0x0UL)
> #define SHMAT_FLAGS (0)
>
>
> int main(void)
> {
>         int shmid;
>         unsigned long i;
>         char *shmaddr;
>
>         if ((shmid = shmget(2, LENGTH,
>                             SHM_HUGETLB | IPC_CREAT | SHM_R | SHM_W)) < 0) {
>                 perror("shmget");
>                 exit(1);
>         }
>         printf("shmid: 0x%x\n", shmid);
>
>         shmaddr = shmat(shmid, ADDR, SHMAT_FLAGS);
>         if (shmaddr == (char *)-1) {
>                 perror("Shared memory attach failure");
>                 shmctl(shmid, IPC_RMID, NULL);
>                 exit(2);
>         }
>         printf("shmaddr: %p\n", shmaddr);
>         printf("touching a huge page..\n");
>
>         shmaddr[0]='a';
>         shmaddr[1]='b';
>
>         if (shmdt((const void *)shmaddr) != 0) {
>                 perror("Detach failure");
>                 shmctl(shmid, IPC_RMID, NULL);
>                 exit(3);
>         }
>
>         shmctl(shmid, IPC_RMID, NULL);
>
>         return 0;
> }
>
> thanks!
> Satya.
>


-- 
...what's remarkable, is that atoms have assembled into entities which
are somehow able to ponder their origins.
--
http://cs.uic.edu/~spopuri

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: hugetlbfs for ppc440 - kernel BUG -- follow up
  2007-07-17 21:07 hugetlbfs for ppc440 - kernel BUG -- follow up Satya
@ 2007-07-18  1:53 ` Benjamin Herrenschmidt
  2007-07-18  2:18   ` Satya
  0 siblings, 1 reply; 4+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-18  1:53 UTC (permalink / raw)
  To: Satya; +Cc: kazutomo, linuxppc-dev, edi, david

On Tue, 2007-07-17 at 16:07 -0500, Satya wrote:
> hello,
> 
> Upon investigating the below issue further, I found that
> pte_alloc_map() calls kmap_atomic. The allocated pte page must be
> unmapped before invoking any function that might_sleep.
> 
> In this case clear_huge_page() is being called without invoking
> pte_unmap(). The 'normal' counterpart of hugetlb_no_page (which is
> do_no_page() in mm/memory.c) does call pte_unmap() before calling
> alloc_page() (which might sleep).
> 
> So, I believe pte_unmap() must be invoked first in hugetlb_no_page().
> But the problem here is, we do not have a reference to the pmd to map
> the pte again (using pte_offset_map()). The do_no_page() function does
> have a pmd_t* parameter, so it can remap the pte when required.
> 
> For now, I resolved the problem by expanding the pte_alloc_map() macro
> by hand and replacing kmap_atomic with kmap(), although I think it is
> not the right thing to do.
> 
> Let me know if my analysis is helping you figure out the problem here. Thanks!

Except that I don't see where pte_alloc_map() has been called before
hand... hugetlb_no_page() is called by hugetlb_fault() which is called
by __handle_mm_fault(), with no lock held.

Ben.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: hugetlbfs for ppc440 - kernel BUG -- follow up
  2007-07-18  1:53 ` Benjamin Herrenschmidt
@ 2007-07-18  2:18   ` Satya
  2007-07-18  3:01     ` Benjamin Herrenschmidt
  0 siblings, 1 reply; 4+ messages in thread
From: Satya @ 2007-07-18  2:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: kazutomo, linuxppc-dev, edi, david

On 7/17/07, Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote:
> On Tue, 2007-07-17 at 16:07 -0500, Satya wrote:
> > hello,
> >
> > Upon investigating the below issue further, I found that
> > pte_alloc_map() calls kmap_atomic. The allocated pte page must be
> > unmapped before invoking any function that might_sleep.
> >
> > In this case clear_huge_page() is being called without invoking
> > pte_unmap(). The 'normal' counterpart of hugetlb_no_page (which is
> > do_no_page() in mm/memory.c) does call pte_unmap() before calling
> > alloc_page() (which might sleep).
> >
> > So, I believe pte_unmap() must be invoked first in hugetlb_no_page().
> > But the problem here is, we do not have a reference to the pmd to map
> > the pte again (using pte_offset_map()). The do_no_page() function does
> > have a pmd_t* parameter, so it can remap the pte when required.
> >
> > For now, I resolved the problem by expanding the pte_alloc_map() macro
> > by hand and replacing kmap_atomic with kmap(), although I think it is
> > not the right thing to do.
> >
> > Let me know if my analysis is helping you figure out the problem here. Thanks!
>
> Except that I don't see where pte_alloc_map() has been called before
> hand... hugetlb_no_page() is called by hugetlb_fault() which is called
> by __handle_mm_fault(), with no lock held.
>

the calling sequence is :

__handle_mm_fault -> hugetlb_fault -> huge_pte_alloc() -> pte_alloc_map()

where -> stands for 'calls'.

hugetlb_fault() calls hugetlb_no_page() after returning from huge_pte_alloc().

[huge_pte_alloc() is an arch specific call back implemented in the
patch referred to in my earlier posts]

Satya.

> Ben.
>
>
>


-- 
...what's remarkable, is that atoms have assembled into entities which
are somehow able to ponder their origins.
--
http://cs.uic.edu/~spopuri

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: hugetlbfs for ppc440 - kernel BUG -- follow up
  2007-07-18  2:18   ` Satya
@ 2007-07-18  3:01     ` Benjamin Herrenschmidt
  0 siblings, 0 replies; 4+ messages in thread
From: Benjamin Herrenschmidt @ 2007-07-18  3:01 UTC (permalink / raw)
  To: Satya; +Cc: kazutomo, linuxppc-dev, edi, david

On Tue, 2007-07-17 at 21:18 -0500, Satya wrote:
> the calling sequence is :
> 
> __handle_mm_fault -> hugetlb_fault -> huge_pte_alloc() ->
> pte_alloc_map()
> 
> where -> stands for 'calls'.
> 
> hugetlb_fault() calls hugetlb_no_page() after returning from
> huge_pte_alloc().
> 
> [huge_pte_alloc() is an arch specific call back implemented in the
> patch referred to in my earlier posts]

Ok, so I think the problem might be there. If you look at other
implementations of hugetlbfs, such as x86, there is no need to do any
mapping in huge_pte_alloc(). Only the PTE pages can be mapped/unmapped
and the huge pages are stored at the PMD level. You may want to do
something similar, and if you need a PTE level for huge pages
specifically, then you could do your own allocations there that don't
require a mapping.

Ben.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2007-07-18  3:01 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-17 21:07 hugetlbfs for ppc440 - kernel BUG -- follow up Satya
2007-07-18  1:53 ` Benjamin Herrenschmidt
2007-07-18  2:18   ` Satya
2007-07-18  3:01     ` Benjamin Herrenschmidt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).