public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* ZERO_PAGE() vs. loadable modules in Redhat 4.4 i386 kernels ...
@ 2007-06-27 18:12 Casey Leedom
  2007-06-27 18:39 ` Arjan van de Ven
  2007-06-27 19:02 ` Hugh Dickins
  0 siblings, 2 replies; 8+ messages in thread
From: Casey Leedom @ 2007-06-27 18:12 UTC (permalink / raw)
  To: linux-kernel

Hello,

  I'm working on a driver that does a get_user_pages() for a DMA write.  We
have a timeout on the DMA completion where we mark the pages as COW and return
to the application so it can potentially generate more data in order to
increase throughput, etc.  The problem is that when we traverse the
PGT/PUD/PMD/PTE hierarchy to mark the pages, we sometime fault out when the PUD
covering [0x80000000, 0xc0000000) comes out as zero when that entire region is
covered by a single large malloc()'ed buffer.  This happens because
get_user_pages() returns a reference to empty_zero_page without passing through
the fault path when a page is backed by anonymous memory.  I put a check this
in my driver using ZERO_PAGE() and everything is great.  Except that it doesn't
work on 32-bit i386 Redhat 4.4 kernels because the symbol for empty_zero_page
is not only not exported via EXPORT_SYMBOL() but has also somehow been stripped
from the kallsyms table.  This means that I can't check for the ZERO_PAGE() in
a module.

  Can someone suggest a better way of doing this?  Either by some subterfuge to
get empty_zero_page or perhaps by asking the kernel to instantiate the
PGT/PUD/PMD/PTE hierarchy for a page?  I'm sort of stuck here.  Thanks for any
help and/or advice you can offer.

Casey

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ZERO_PAGE() vs. loadable modules in Redhat 4.4 i386 kernels ...
  2007-06-27 18:12 ZERO_PAGE() vs. loadable modules in Redhat 4.4 i386 kernels Casey Leedom
@ 2007-06-27 18:39 ` Arjan van de Ven
  2007-06-27 18:53   ` Casey Leedom
  2007-06-27 19:02 ` Hugh Dickins
  1 sibling, 1 reply; 8+ messages in thread
From: Arjan van de Ven @ 2007-06-27 18:39 UTC (permalink / raw)
  To: Casey Leedom; +Cc: linux-kernel

On Wed, 2007-06-27 at 11:12 -0700, Casey Leedom wrote:
> Hello,
> 
>   I'm working on a driver that does a get_user_pages() for a DMA write.  We
> have a timeout on the DMA completion where we mark the pages as COW and return
> to the application so it can potentially generate more data in order to
> increase throughput, etc.  The problem is that when we traverse the
> PGT/PUD/PMD/PTE hierarchy to mark the pages, we sometime fault out when the PUD
> covering [0x80000000, 0xc0000000) comes out as zero when that entire region is
> covered by a single large malloc()'ed buffer.  

you forgot to attach your source code or provide a URL to it.....



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ZERO_PAGE() vs. loadable modules in Redhat 4.4 i386 kernels ...
  2007-06-27 18:39 ` Arjan van de Ven
@ 2007-06-27 18:53   ` Casey Leedom
  2007-06-27 20:18     ` Arjan van de Ven
  0 siblings, 1 reply; 8+ messages in thread
From: Casey Leedom @ 2007-06-27 18:53 UTC (permalink / raw)
  To: linux-kernel

--- Arjan van de Ven <arjan@infradead.org> wrote:

> you forgot to attach your source code or provide a URL to it.....

  Sorry, my bad.  I'm just diving into Linux for the first time and wasn't
aware of the protocols.  Here's the code fragment I'm currently using:

	address = skb_vaddr(skb) & PAGE_MASK;
	end = PAGE_ALIGN(skb_vaddr(skb) + len);

	down_read(&current->mm->mmap_sem);
	vma = find_vma(current->mm, skb_vaddr(skb));

	spin_lock(&current->mm->page_table_lock);
	for (i = 0; i < skb_shinfo(skb)->nr_frags; i++, address += PAGE_SIZE) { 
		pgd_t *pgd;
		pud_t *pud;
		pmd_t *pmd;
		pte_t *ptep;
		skb_frag_t *frag = &skb_shinfo(skb)->frags[i];

		/*
		 * If we're looking at the ZERO_PAGE() then we don't have to
		 * do anything.  It's never going to be modified.  See also
		 * where we avoid the ZERO_PAGE() in skb_dma_complete().
		 */
		if (frag->page == ZERO_PAGE(address))
			continue;

		atomic_inc(&frag->page->_mapcount);

		/*
		 * If we're not dealing with the ZERO_PAGE() then it should
		 * not be possible to end up without a mapping to the page.
		 * But just in case we check here and issue a BUG() if the
		 * mapping is missing ...
		 */
		if (unlikely((pgd = pgd_offset(current->mm, address),
			      pgd_none(*pgd)) ||
			     (pud = pud_offset(pgd, address),
			      pud_none(*pud)) ||
			     (pmd = pmd_offset(pud, address),
			      pmd_none(*pmd)) ||
			     (ptep = pte_offset_map(pmd, address),
			      pte_none(*ptep)))) {
			printk("skb_dma_pending: missing %s page mapping"
			       " for vaddr=%#lx, page=%p\n",
			       pgd_none(*pgd) ? "pgd"
			       : pud_none(*pud) ? "pud"
			       : pmd_none(*pmd) ? "pmd"
			       : "pte",
			       address, frag->page);
			BUG();
		}

		ptep_set_wrprotect(current->mm, address, ptep);
		pte_unmap(ptep);
	}
	spin_unlock(&current->mm->page_table_lock);
	flush_tlb_range(vma, address, end);

I hope that isn't too large but I wanted to provide sufficient context.

Casey

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ZERO_PAGE() vs. loadable modules in Redhat 4.4 i386 kernels ...
  2007-06-27 18:12 ZERO_PAGE() vs. loadable modules in Redhat 4.4 i386 kernels Casey Leedom
  2007-06-27 18:39 ` Arjan van de Ven
@ 2007-06-27 19:02 ` Hugh Dickins
  2007-06-27 19:13   ` Casey Leedom
  1 sibling, 1 reply; 8+ messages in thread
From: Hugh Dickins @ 2007-06-27 19:02 UTC (permalink / raw)
  To: Casey Leedom; +Cc: linux-kernel

On Wed, 27 Jun 2007, Casey Leedom wrote:
> 
>   I'm working on a driver that does a get_user_pages() for a DMA write.  We
> have a timeout on the DMA completion where we mark the pages as COW and return
> to the application so it can potentially generate more data in order to
> increase throughput, etc.  The problem is that when we traverse the
> PGT/PUD/PMD/PTE hierarchy to mark the pages, we sometime fault out when the PUD
> covering [0x80000000, 0xc0000000) comes out as zero when that entire region is
> covered by a single large malloc()'ed buffer.  This happens because
> get_user_pages() returns a reference to empty_zero_page without passing through
> the fault path when a page is backed by anonymous memory.  I put a check this
> in my driver using ZERO_PAGE() and everything is great.  Except that it doesn't
> work on 32-bit i386 Redhat 4.4 kernels because the symbol for empty_zero_page
> is not only not exported via EXPORT_SYMBOL() but has also somehow been stripped
> from the kallsyms table.  This means that I can't check for the ZERO_PAGE() in
> a module.
> 
>  Can someone suggest a better way of doing this?  Either by some subterfuge to
> get empty_zero_page or perhaps by asking the kernel to instantiate the
> PGT/PUD/PMD/PTE hierarchy for a page?  I'm sort of stuck here.  Thanks for any
> help and/or advice you can offer.

I can't speak for Red Hat 4.4; but in general, you should be passing the
write flag to get_user_pages if you're going to modify the content of
those pages, which will then allocate the hierarchy needed and break
COW where necessary.  If that doesn't suit you (e.g. it's supposed to
be a readonly area in userspace), then you proably shouldn't be using
get_user_pages at all, but letting userspace mmap your driver pages
into its space instead.  Definitely don't play empty_zero_page games.

Hugh

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ZERO_PAGE() vs. loadable modules in Redhat 4.4 i386 kernels ...
  2007-06-27 19:02 ` Hugh Dickins
@ 2007-06-27 19:13   ` Casey Leedom
  0 siblings, 0 replies; 8+ messages in thread
From: Casey Leedom @ 2007-06-27 19:13 UTC (permalink / raw)
  To: linux-kernel

--- Hugh Dickins <hugh@veritas.com> wrote:

> I can't speak for Red Hat 4.4; but in general, you should be passing the
> write flag to get_user_pages if you're going to modify the content of
> those pages, which will then allocate the hierarchy needed and break
> COW where necessary.

  yes, that would definitely be the case if I were doing a DMA read but in this
case I'm doing a DMA write.  I could force the kernel to instantiate zero'ed
pages for the entire range but that would significantly impact performance.

> If that doesn't suit you (e.g. it's supposed to
> be a readonly area in userspace), then you proably shouldn't be using
> get_user_pages at all, but letting userspace mmap your driver pages
> into its space instead.  Definitely don't play empty_zero_page games.

  Hhrrrmmm, I'm not sure I understand what you're saying here.  In our case the
user application just does a write() on a socket.  Our driver is doing a DMA
directly from user pages out to the interface.  We have fallback timeout code
if the DMA takes longer than a certain length of time where we mark the pages
that haven't yet been transfered as COW so we can return early to the user
application without violating Linux write() semantics.  It sounds like you're
advocating major changes to the application which wouldn't be useful.  My
appologies if I'm misinterpreting your comment.

Casey

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ZERO_PAGE() vs. loadable modules in Redhat 4.4 i386 kernels ...
  2007-06-27 18:53   ` Casey Leedom
@ 2007-06-27 20:18     ` Arjan van de Ven
  2007-06-27 21:40       ` Casey Leedom
  0 siblings, 1 reply; 8+ messages in thread
From: Arjan van de Ven @ 2007-06-27 20:18 UTC (permalink / raw)
  To: Casey Leedom; +Cc: linux-kernel

On Wed, 2007-06-27 at 11:53 -0700, Casey Leedom wrote:
> --- Arjan van de Ven <arjan@infradead.org> wrote:
> 
> > you forgot to attach your source code or provide a URL to it.....
> 
>   Sorry, my bad.  I'm just diving into Linux for the first time and wasn't
> aware of the protocols.  Here's the code fragment I'm currently using:

code fragments are only very limited useful because they don't
compile... use the power of open source and just post your entire source
code..  (and it's open source, right?)

you sort of gave too little, we can't see how this is being used for
example..


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ZERO_PAGE() vs. loadable modules in Redhat 4.4 i386 kernels ...
  2007-06-27 20:18     ` Arjan van de Ven
@ 2007-06-27 21:40       ` Casey Leedom
  2007-06-27 22:43         ` Casey Leedom
  0 siblings, 1 reply; 8+ messages in thread
From: Casey Leedom @ 2007-06-27 21:40 UTC (permalink / raw)
  To: linux-kernel


--- Arjan van de Ven <arjan@infradead.org> wrote:

> On Wed, 2007-06-27 at 11:53 -0700, Casey Leedom wrote:
> >   Sorry, my bad.  I'm just diving into Linux for the first time and wasn't
> > aware of the protocols.  Here's the code fragment I'm currently using:
> 
> code fragments are only very limited useful because they don't
> compile... use the power of open source and just post your entire source
> code..  (and it's open source, right?)
> 
> you sort of gave too little, we can't see how this is being used for
> example..

  Yes, it's open source -- the company I work for, Chelsio Communications,
makes it money off of the hardware and gives away all of the driver source both
under GPL and BSD licenses.  I just thought that dumping the entire driver
source would bother people.  Basically the problem is that when you call
get_user_pages() and the "write" parameter is 0, get_user_pages() calls
follow_page() with FOLL_ANON in the foll_flags parameter.  This causes
follow_page() to return a reference to empty_zero_page if the page is an
unmodified /dev/zero mapping and any portion of the PGD/PUD/PMD/PTE hierarchy
is missing.  This is all fine, but later on in our driver when we want to
follow down the PGD/PUD/PMD/PTE hierarchy to mark pages COW we fail because of
the missing mappings.

  I've thought of several solutions to this problem:

 1. If the "zero copy" DMA path comes across any /dev/zero mapped page,
    fail the zero copy path and fall back to the normal copy path.  This
    would be a minor performance loss but even worse, it requires being
    able to recognize ZERO_PAGE() which is the problem I'm battling now.

 2. Force get_user_pages() not to pass FOLL_ANON in to follow_page().
    The simplest way to do this is to pass in a non-zero "write"
    parameter but I'm leary of this because of the potential for
    side effects (lots of pages getting marked dirty, etc.)

 3. In the DMA-timeout/COW/return-to-user optimization path, if we run
    across ZERO_PAGE() then just skip to the next page since ZERO_PAGE()
    doesn't need to be marked COW.  This is our current fix but we've
    run across this inability to determine if a page we're looking at is
    the ZERO_PAGE() in an i386 32-bit Redhat 4.4 kernel ... (sigh)  This
    does work on the x86_64 kernel since empty_zero_page is an
    EXPORT_SYMBOL() under that architecture.

 4. Somehow force the PGD/PUD/PMD/PTE hierarchy to instantiate for each
    page.  Not sure how to do this ...  (I'm very new to the Linux kernel
    and working hard to catch up with everything.)

Sorry for the long drawn out explanation -- most of you probably already know
all about these paths but I just found out most of the above so I figured I
should explain what I _think_ is going on in case I've got some misconceptions.

Casey

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: ZERO_PAGE() vs. loadable modules in Redhat 4.4 i386 kernels ...
  2007-06-27 21:40       ` Casey Leedom
@ 2007-06-27 22:43         ` Casey Leedom
  0 siblings, 0 replies; 8+ messages in thread
From: Casey Leedom @ 2007-06-27 22:43 UTC (permalink / raw)
  To: linux-kernel

  Nevermind.  I realized I was being an idiot.  Sorry for the wasted bandwidth.

  Basically, if we have a page that get_user_pages() returned to us, the only
way any part of the PGD/PUD/PMD maping hierarchy can be missing is if the page
is the ZERO_PAGE().  Thus I can use this to detect my ZERO_PAGE() failure mode.

  Thanks for your help!

Casey

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-06-27 22:43 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-27 18:12 ZERO_PAGE() vs. loadable modules in Redhat 4.4 i386 kernels Casey Leedom
2007-06-27 18:39 ` Arjan van de Ven
2007-06-27 18:53   ` Casey Leedom
2007-06-27 20:18     ` Arjan van de Ven
2007-06-27 21:40       ` Casey Leedom
2007-06-27 22:43         ` Casey Leedom
2007-06-27 19:02 ` Hugh Dickins
2007-06-27 19:13   ` Casey Leedom

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox