Is get_user_pages() enough to prevent pages from being swapped out ?

All of lore.kernel.org
 help / color / mirror / Atom feed

* Is get_user_pages() enough to prevent pages from being swapped out ?
@ 2009-07-29  9:23 Laurent Pinchart
  2009-07-29 15:26 ` Hugh Dickins
  0 siblings, 1 reply; 7+ messages in thread
From: Laurent Pinchart @ 2009-07-29  9:23 UTC (permalink / raw)
  To: linux-kernel, v4l2_linux

Hi everybody,

I'm trying to debug a video acquisition device driver and found myself having 
to dive deep into the memory management subsystem.

The driver uses videobuf-dma-sg to manage video buffers. videobuf-dma-sg gets 
pointers to buffers from userspace and calls get_user_pages() to retrieve the 
list of pages underlying those buffers. The page list is used to build a 
scatter-gather list that is given to the hardware. The device then performs 
DMA directly to the memory.

Pages underlying the buffers must obviously not be swapped out during DMA. The 
get_user_pages() (mm/memory.c) documentation seems to imply that returned 
pages are pinned to memory (my understanding of "pinned" is that they will not 
be swapped out):

/**
 * get_user_pages() - pin user pages in memory
 * @tsk:        task_struct of target task
 * @mm:         mm_struct of target mm
 * @start:      starting user address
 * @len:        number of pages from start to pin
 * @write:      whether pages will be written to by the caller
 * @force:      whether to force write access even if user mapping is
 *              readonly. This will result in the page being COWed even
 *              in MAP_SHARED mappings. You do not want this.
 * @pages:      array that receives pointers to the pages pinned.
 *              Should be at least nr_pages long. Or NULL, if caller
 *              only intends to ensure the pages are faulted in.
 * @vmas:       array of pointers to vmas corresponding to each page.
 *              Or NULL if the caller does not require them.

However, all is seems to do for that purpose is incrementing the page 
reference count using get_page().

I had a look through the memory management subsystem code and it seems to me 
that incrementing the reference count is not sufficient to make sure the page 
won't be swapped out. To ensure that, it should instead be marked as 
unevictable, either directly or by marking an associated VMA as VM_LOCKED. 
This is what the mlock() syscall does, in addition to calling 
get_user_pages().

The MM subsystem is quite complex and my understanding might not be correct, 
so I'd appreciate if someone could shed light on the issue. Does 
get_user_pages() really pin pages to memory and prevent them from being 
swapped out in all circumstances ? If so, how does it do so ? If not, what's 
the proper way to make sure the pages won't disappear during DMA ?

Please CC me on answers.

Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is get_user_pages() enough to prevent pages from being swapped out ?
  2009-07-29  9:23 Is get_user_pages() enough to prevent pages from being swapped out ? Laurent Pinchart
@ 2009-07-29 15:26 ` Hugh Dickins
  2009-07-29 15:41   ` Laurent Pinchart
  0 siblings, 1 reply; 7+ messages in thread
From: Hugh Dickins @ 2009-07-29 15:26 UTC (permalink / raw)
  To: Laurent Pinchart; +Cc: linux-kernel, v4l2_linux

On Wed, 29 Jul 2009, Laurent Pinchart wrote:
> 
> I'm trying to debug a video acquisition device driver and found myself having 
> to dive deep into the memory management subsystem.
> 
> The driver uses videobuf-dma-sg to manage video buffers. videobuf-dma-sg gets 
> pointers to buffers from userspace and calls get_user_pages() to retrieve the 
> list of pages underlying those buffers. The page list is used to build a 
> scatter-gather list that is given to the hardware. The device then performs 
> DMA directly to the memory.
> 
> Pages underlying the buffers must obviously not be swapped out during DMA.
> The get_user_pages() (mm/memory.c) documentation seems to imply that returned 
> pages are pinned to memory (my understanding of "pinned" is that they will
> not be swapped out):
...
> However, all is seems to do for that purpose is incrementing the page 
> reference count using get_page().
> 
> I had a look through the memory management subsystem code and it seems to me 
> that incrementing the reference count is not sufficient to make sure the page 
> won't be swapped out. To ensure that, it should instead be marked as 
> unevictable, either directly or by marking an associated VMA as VM_LOCKED. 
> This is what the mlock() syscall does, in addition to calling 
> get_user_pages().
> 
> The MM subsystem is quite complex and my understanding might not be correct, 
> so I'd appreciate if someone could shed light on the issue. Does 
> get_user_pages() really pin pages to memory and prevent them from being 
> swapped out in all circumstances ? If so, how does it do so ? If not, what's 
> the proper way to make sure the pages won't disappear during DMA ?

You're right that get_user_pages() (called with a pagelist as you're
using) increments the page reference count.

And that is enough to pin the page in memory, in a sense that suits
the use of DMA.

I'm expressing it in that peculiar way, because:- On the one hand,
the page can only disappear from memory by memory hotremove, but
what you'll be worrying about is the page getting freed and reused
for another purpose while DMA is acting upon it - but raising the
reference count prevents that (and will prevent hotremove succeeding).

On the other hand, despite the raised reference count, under memory
pressure that page might get unmapped from the user pagetable, and
might even be written out to swap in its half-dirty state (though
is_page_cache_freeable() tries to avoid that); but it won't get
freed, and DMA will be to the physical address of the page (somebody
will correct me that it's actually the bus address or something else),
not to the userspace virtual address.  So it's irrelevant if that
vanishes for a while - when userspace accesses it again, the same
page (the one DMA occurs to) will be faulted back in there.

In contrast, mlock() is not enough to pin a page in memory in this
sense: from a userspace point of view, an mlock()ed page indeed is
locked into memory, but page migration (for NUMA balancing or for
memory hotremove) is still free to substitute an alternative page
there.  get_user_pages()'s raised reference count prevents that,
but mlock() does not.

There is one little problem with the get_user_pages() pinning,
hopefully one that can never affect you at all.  If the task that
did the get_user_pages() forks, and parent or child userspace tries
to write to one of the pages in question while it's pinned (and it's
an anonymous page, not a pagecache page shared with underlying file),
then the first to touch it will get a copy of the original DMA page
at that instant, thereafter losing contact with the original DMA page.

One answer to that is to madvise such an area with MADV_DONTFORK,
then fork won't duplicate that area, so no Copy-on-Write issues
will arise.  That satisfies many of us, but others look for a
way to eliminate this issue completely.

Hugh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is get_user_pages() enough to prevent pages from being swapped out ?
  2009-07-29 15:26 ` Hugh Dickins
@ 2009-07-29 15:41   ` Laurent Pinchart
  2009-07-29 16:07     ` Hugh Dickins
  0 siblings, 1 reply; 7+ messages in thread
From: Laurent Pinchart @ 2009-07-29 15:41 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: linux-kernel, v4l2_linux

Hi Hugh,

first of all, thanks for your answer.

On Wednesday 29 July 2009 17:26:11 Hugh Dickins wrote:
> On Wed, 29 Jul 2009, Laurent Pinchart wrote:
> > I'm trying to debug a video acquisition device driver and found myself
> > having to dive deep into the memory management subsystem.
> >
> > The driver uses videobuf-dma-sg to manage video buffers. videobuf-dma-sg
> > gets pointers to buffers from userspace and calls get_user_pages() to
> > retrieve the list of pages underlying those buffers. The page list is
> > used to build a scatter-gather list that is given to the hardware. The
> > device then performs DMA directly to the memory.
> >
> > Pages underlying the buffers must obviously not be swapped out during
> > DMA. The get_user_pages() (mm/memory.c) documentation seems to imply that
> > returned pages are pinned to memory (my understanding of "pinned" is that
> > they will not be swapped out):
>
> ...
>
> > However, all is seems to do for that purpose is incrementing the page
> > reference count using get_page().
> >
> > I had a look through the memory management subsystem code and it seems to
> > me that incrementing the reference count is not sufficient to make sure
> > the page won't be swapped out. To ensure that, it should instead be
> > marked as unevictable, either directly or by marking an associated VMA as
> > VM_LOCKED. This is what the mlock() syscall does, in addition to calling
> > get_user_pages().
> >
> > The MM subsystem is quite complex and my understanding might not be
> > correct, so I'd appreciate if someone could shed light on the issue. Does
> > get_user_pages() really pin pages to memory and prevent them from being
> > swapped out in all circumstances ? If so, how does it do so ? If not,
> > what's the proper way to make sure the pages won't disappear during DMA ?
>
> You're right that get_user_pages() (called with a pagelist as you're
> using) increments the page reference count.
>
> And that is enough to pin the page in memory, in a sense that suits
> the use of DMA.
>
> I'm expressing it in that peculiar way, because:- On the one hand,
> the page can only disappear from memory by memory hotremove, but
> what you'll be worrying about is the page getting freed and reused
> for another purpose while DMA is acting upon it - but raising the
> reference count prevents that (and will prevent hotremove succeeding).

Sorry about that confusion. I'm not too familiar with memory management so I 
mixed the proper terms.

> On the other hand, despite the raised reference count, under memory
> pressure that page might get unmapped from the user pagetable, and
> might even be written out to swap in its half-dirty state (though
> is_page_cache_freeable() tries to avoid that); but it won't get
> freed, and DMA will be to the physical address of the page (somebody
> will correct me that it's actually the bus address or something else),
> not to the userspace virtual address.  So it's irrelevant if that
> vanishes for a while - when userspace accesses it again, the same
> page (the one DMA occurs to) will be faulted back in there.

Just to make sure I understand things properly, the copy of the page written 
to swap will not be read back when the page is faulted back in by the kernel 
as a result of the userspace process accessing it, right ?

Why would the page be written out to swap if it's not going to be freed anyway 
?

> In contrast, mlock() is not enough to pin a page in memory in this
> sense: from a userspace point of view, an mlock()ed page indeed is
> locked into memory, but page migration (for NUMA balancing or for
> memory hotremove) is still free to substitute an alternative page
> there.  get_user_pages()'s raised reference count prevents that,
> but mlock() does not.

Thanks for pointing that "detail" out.

> There is one little problem with the get_user_pages() pinning,
> hopefully one that can never affect you at all.  If the task that
> did the get_user_pages() forks, and parent or child userspace tries
> to write to one of the pages in question while it's pinned (and it's
> an anonymous page, not a pagecache page shared with underlying file),
> then the first to touch it will get a copy of the original DMA page
> at that instant, thereafter losing contact with the original DMA page.
>
> One answer to that is to madvise such an area with MADV_DONTFORK,
> then fork won't duplicate that area, so no Copy-on-Write issues
> will arise.  That satisfies many of us, but others look for a
> way to eliminate this issue completely.

I don't think the userspace processes we use here to access the video buffers 
fork, but I'll double check that and use madvise in that case.

Thank you very much for your help.

Best regards,

Laurent Pinchart


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is get_user_pages() enough to prevent pages from being swapped out ?
  2009-07-29 15:41   ` Laurent Pinchart
@ 2009-07-29 16:07     ` Hugh Dickins
  2009-07-30 11:39       ` Robin Holt
  0 siblings, 1 reply; 7+ messages in thread
From: Hugh Dickins @ 2009-07-29 16:07 UTC (permalink / raw)
  To: Laurent Pinchart; +Cc: linux-kernel, v4l2_linux

On Wed, 29 Jul 2009, Laurent Pinchart wrote:
> On Wednesday 29 July 2009 17:26:11 Hugh Dickins wrote:
> > On Wed, 29 Jul 2009, Laurent Pinchart wrote:
> > > what's the proper way to make sure the pages won't disappear during DMA ?
> >
> > You're right that get_user_pages() (called with a pagelist as you're
> > using) increments the page reference count.
> >
> > And that is enough to pin the page in memory, in a sense that suits
> > the use of DMA.
> >
> > I'm expressing it in that peculiar way, because:- On the one hand,
> > the page can only disappear from memory by memory hotremove, but
> > what you'll be worrying about is the page getting freed and reused
> > for another purpose while DMA is acting upon it - but raising the
> > reference count prevents that (and will prevent hotremove succeeding).
> 
> Sorry about that confusion. I'm not too familiar with memory management so I 
> mixed the proper terms.

Nothing to apologize for!  You expressed yourself in a vivid and
natural way, then I gave you a pedantic response.  It's right to be
pedantic here, to distinguish these cases, but you were not wrong.

> 
> > On the other hand, despite the raised reference count, under memory
> > pressure that page might get unmapped from the user pagetable, and
> > might even be written out to swap in its half-dirty state (though
> > is_page_cache_freeable() tries to avoid that); but it won't get
> > freed, and DMA will be to the physical address of the page (somebody
> > will correct me that it's actually the bus address or something else),
> > not to the userspace virtual address.  So it's irrelevant if that
> > vanishes for a while - when userspace accesses it again, the same
> > page (the one DMA occurs to) will be faulted back in there.
> 
> Just to make sure I understand things properly, the copy of the page written 
> to swap will not be read back when the page is faulted back in by the kernel 
> as a result of the userspace process accessing it, right ?

Absolutely right.  It's a waste of time and diskspace if we write
it at all, but to avoid the possibility of such writes completely
would involve a locking overhead we're better off without.

> 
> Why would the page be written out to swap if it's not going to be
> freed anyway ?

is_page_cache_freeable() tries to avoid such a write, by checking
that the reference count is what we'd expect it to be if nobody else
is interested in the page - at that instant.  But an instant later,
anything might take a new reference to the page: pageout() has the
page lock, but we really don't want to demand that everybody else
has to acquire the page lock just to increment page count.

We could scatter further is_page_cache_freeable() checks around;
but how ever many we added, it could always go up just after the
last check.

Hugh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is get_user_pages() enough to prevent pages from being swapped out ?
  2009-07-29 16:07     ` Hugh Dickins
@ 2009-07-30 11:39       ` Robin Holt
  2009-07-30 11:48         ` Hugh Dickins
  0 siblings, 1 reply; 7+ messages in thread
From: Robin Holt @ 2009-07-30 11:39 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Laurent Pinchart, linux-kernel, v4l2_linux

> > On Wednesday 29 July 2009 17:26:11 Hugh Dickins wrote:
...
> > > On the other hand, despite the raised reference count, under memory
> > > pressure that page might get unmapped from the user pagetable, and
> > > might even be written out to swap in its half-dirty state (though

One thing you did not mention in the above description is that the page
is marked clean by the write-out to swap.  I am not sure I recall the
method of mapping involved here, but it is necessary to ensure the page
is marked dirty again before the driver releases it.  If the page is
not marked dirty as part of your method of releasing it, the changes
you have made between when the page was first written out and when you
are freeing it will get lost.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is get_user_pages() enough to prevent pages from being swapped out ?
  2009-07-30 11:39       ` Robin Holt
@ 2009-07-30 11:48         ` Hugh Dickins
  2009-08-05 14:37           ` Laurent Pinchart
  0 siblings, 1 reply; 7+ messages in thread
From: Hugh Dickins @ 2009-07-30 11:48 UTC (permalink / raw)
  To: Robin Holt; +Cc: Laurent Pinchart, linux-kernel, v4l2_linux

On Thu, 30 Jul 2009, Robin Holt wrote:
> > > On Wednesday 29 July 2009 17:26:11 Hugh Dickins wrote:
> ...
> > > > On the other hand, despite the raised reference count, under memory
> > > > pressure that page might get unmapped from the user pagetable, and
> > > > might even be written out to swap in its half-dirty state (though
> 
> One thing you did not mention in the above description is that the page
> is marked clean by the write-out to swap.  I am not sure I recall the
> method of mapping involved here, but it is necessary to ensure the page
> is marked dirty again before the driver releases it.  If the page is
> not marked dirty as part of your method of releasing it, the changes
> you have made between when the page was first written out and when you
> are freeing it will get lost.

Yes indeed: thanks, Robin.

Hugh

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Is get_user_pages() enough to prevent pages from being swapped out ?
  2009-07-30 11:48         ` Hugh Dickins
@ 2009-08-05 14:37           ` Laurent Pinchart
  0 siblings, 0 replies; 7+ messages in thread
From: Laurent Pinchart @ 2009-08-05 14:37 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Robin Holt, linux-kernel, v4l2_linux

Hi Hugh,

I've spent the last few days "playing" with get_user_pages() and mlock() and 
got some interesting results. It turned out that cache coherency comes into 
play at some point, making the overall problem more complex.

Here's my current setup:

- OMAP processor, based on an ARMv7 core
- MMU and IOMMU
- VIPT non-aliasing data cache
- video capture driver that transfers data to memory using DMA
- video capture application that pass userspace pointers to video buffers to 
the driver

My goal is to make sure that, upon DMA completion, the correct data will be 
available to the userspace application.

The first problem was to pin pages to memory, to make sure they will not be 
freed when the DMA is in progress. videobug-dma-sg uses get_user_pages() for 
that, and you nicely explained to me why this is enough.

The second problem is to ensure cache coherency. As the userspace application 
will read data from the video buffers, those buffers will end up being cached 
in the processor's data cache. The driver does need to invalidate the cache 
before starting the DMA operation (userspace could in theory write to the 
buffers, but the data will be overwritten by DMA anyway, so there's no need to 
clean the cache).

As the cache is of the VIPT (Virtual Index Physical Tag) type, cache 
invalidation can either be done globally (in which case the cache is flushed 
instead of being invalidated) or based on virtual addresses. In the last case 
the processor will need to look physical addresses up, either in the TLB or 
through hardware table walk.

I can see three solutions to the DMA/cache problem.

1. Flushing the whole data cache right before starting the DMA transfer. 
There's no API for that in the ARM architecture, so a whole I+D cache is 
required. This is quite costly, we're talking about around 30 flushes per 
second, but it doesn't involve the MMU. That's the solution that I currently 
use.

2. Invalidating only the cache lines that store video buffer data. This 
requires a TLB lookup or a hardware table walk, so the userspace application 
MM context needs to be available (no problem there as where's flushing in 
userspace context) and all pages need to be mapped properly. This can be a 
problem as, as you pointed out, pages can still be unmapped from the userspace 
context after get_user_pages() returns. I have experienced one oops due to a 
kernel paging request failure:

	Unable to handle kernel paging request at virtual address 44e12000
	pgd = c8698000
	[44e12000] *pgd=8a4fd031, *pte=8cfda1cd, *ppte=00000000
	Internal error: Oops: 817 [#1] PREEMPT
	PC is at v7_dma_inv_range+0x2c/0x44

Fixing this requires more investigation, and I'm not sure how to proceed to 
find out if the page fault is really caused by pages being unmapped from the 
userspace context. Help would be appreciated.

3. Mark the pages as non-cacheable. Depending on how the buffers are then used 
by userspace, the additional cache misses might destroy any benefit I would 
get from not flushing the cache before DMA. I'm not sure how to mark a bunch 
of pages as non-cacheable though. What usually happens is that video drivers 
allocate DMA-coherent memory themselves, but in this case I need to deal with 
an arbitrary buffer allocated by userspace. If someone has any experience with 
this, it would be appreciated.

Regards,

Laurent Pinchart

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2009-08-05 14:36 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-07-29  9:23 Is get_user_pages() enough to prevent pages from being swapped out ? Laurent Pinchart
2009-07-29 15:26 ` Hugh Dickins
2009-07-29 15:41   ` Laurent Pinchart
2009-07-29 16:07     ` Hugh Dickins
2009-07-30 11:39       ` Robin Holt
2009-07-30 11:48         ` Hugh Dickins
2009-08-05 14:37           ` Laurent Pinchart

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.