Propagating GFP_NOFS inside _

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Propagating GFP_NOFS inside __vmalloc()
@ 2010-11-10 20:42 Ricardo M. Correia
  2010-11-10 21:35 ` Ricardo M. Correia
  2010-11-11 20:06 ` Andrew Morton
  0 siblings, 2 replies; 22+ messages in thread
From: Ricardo M. Correia @ 2010-11-10 20:42 UTC (permalink / raw)
  To: linux-mm; +Cc: Brian Behlendorf, Andreas Dilger

Hi,

As part of Lustre filesystem development, we are running into a
situation where we (sporadically) need to call into __vmalloc() from a
thread that processes I/Os to disk (it's a long story).

In general, this would be fine as long as we pass GFP_NOFS to
__vmalloc(), but the problem is that even if we pass this flag, vmalloc
itself sometimes allocates memory with GFP_KERNEL.

This is not OK for us because the GFP_KERNEL allocations may go into the
synchronous reclaim path and try to write out data to disk (in order to
free memory for the allocation), which leads to a deadlock because those
reclaims may themselves depend on the thread that is doing the
allocation to make forward progress (which it can't, because it's
blocked trying to allocate the memory).

Andreas suggested that this may be a bug in __vmalloc(), in the sense
that it's not propagating the gfp_mask that the caller requested to all
allocations that happen inside it.

On the latest torvalds git tree, for x86-64, the path for these
GFP_KERNEL allocations go something like this:

__vmalloc()
  __vmalloc_node()
    __vmalloc_area_node()
      map_vm_area()
        vmap_page_range()
          vmap_pud_range()
            vmap_pmd_range()
              pmd_alloc()
                __pmd_alloc()
                  pmd_alloc_one()
                    get_zeroed_page() <-- GFP_KERNEL
              vmap_pte_range()
                pte_alloc_kernel()
                  __pte_alloc_kernel()
                    pte_alloc_one_kernel()
                      get_free_page() <-- GFP_KERNEL

We've actually observed these deadlocks during testing (although in an
older kernel).

Andreas suggested that we should fix __vmalloc() to propagate the
caller-passed gfp_mask all the way to those allocating functions. This
may require fixing these interfaces for all architectures.

I also suggested that it would be nice to have a per-task
gfp_allowed_mask, similar to the existing gfp_allowed_mask /
set_gfp_allowed_mask() interface that exists in the kernel, but instead
of being global to the entire system, it would be stored in the thread's
task_struct and only apply in the context of the current thread.

This would allow us to call a function when our I/O threads are created,
say set_thread_gfp_allowed_mask(~__GFP_IO), to make sure that any kernel
allocations that happen in the context of those threads would have
__GFP_IO masked out.

I am willing to code and send out any of those 2 patches (the vmalloc
fix and/or the per-thread gfp mask), and I was wondering if this is
something you'd be willing to accept into the upstream kernel, or if you
have any other ideas as to how to prevent all __GFP_IO allocations from
the kernel itself in the context of threads that perform I/O.

(Please reply-to-all as we are not subscribed to linux-mm).

Thanks,
Ricardo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-10 20:42 Propagating GFP_NOFS inside __vmalloc() Ricardo M. Correia
@ 2010-11-10 21:35 ` Ricardo M. Correia
  2010-11-10 22:10   ` Dave Chinner
  2010-11-11 20:06 ` Andrew Morton
  1 sibling, 1 reply; 22+ messages in thread
From: Ricardo M. Correia @ 2010-11-10 21:35 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: Brian Behlendorf, Andreas Dilger

On Wed, 2010-11-10 at 21:42 +0100, Ricardo M. Correia wrote:
> Hi,
> 
> As part of Lustre filesystem development, we are running into a
> situation where we (sporadically) need to call into __vmalloc() from a
> thread that processes I/Os to disk (it's a long story).
> 
> In general, this would be fine as long as we pass GFP_NOFS to
> __vmalloc(), but the problem is that even if we pass this flag, vmalloc
> itself sometimes allocates memory with GFP_KERNEL.

By the way, it seems that existing users in Linus' tree may be
vulnerable to the same bug that we experienced:

In GFS:
    8   1253  fs/gfs2/dir.c <<gfs2_alloc_sort_buffer>>
             ptr = __vmalloc(size, GFP_NOFS, PAGE_KERNEL);

The Ceph filesystem:
  20     22  net/ceph/buffer.c <<ceph_buffer_new>>
             b->vec.iov_base = __vmalloc(len, gfp, PAGE_KERNEL);
.. which can be called from:
   3    560  fs/ceph/inode.c <<fill_inode>>
             xattr_blob = ceph_buffer_new(iinfo->xattr_len, GFP_NOFS);

In the MM code:
  18   5184  mm/page_alloc.c <<alloc_large_system_hash>>
             table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);

All of these seem to be vulnerable to GFP_KERNEL allocations from within
__vmalloc(), at least on x86-64 (as I've detailed below).

Thanks,
Ricardo

> 
> This is not OK for us because the GFP_KERNEL allocations may go into the
> synchronous reclaim path and try to write out data to disk (in order to
> free memory for the allocation), which leads to a deadlock because those
> reclaims may themselves depend on the thread that is doing the
> allocation to make forward progress (which it can't, because it's
> blocked trying to allocate the memory).
> 
> Andreas suggested that this may be a bug in __vmalloc(), in the sense
> that it's not propagating the gfp_mask that the caller requested to all
> allocations that happen inside it.
> 
> On the latest torvalds git tree, for x86-64, the path for these
> GFP_KERNEL allocations go something like this:
> 
> __vmalloc()
>   __vmalloc_node()
>     __vmalloc_area_node()
>       map_vm_area()
>         vmap_page_range()
>           vmap_pud_range()
>             vmap_pmd_range()
>               pmd_alloc()
>                 __pmd_alloc()
>                   pmd_alloc_one()
>                     get_zeroed_page() <-- GFP_KERNEL
>               vmap_pte_range()
>                 pte_alloc_kernel()
>                   __pte_alloc_kernel()
>                     pte_alloc_one_kernel()
>                       get_free_page() <-- GFP_KERNEL
> 
> We've actually observed these deadlocks during testing (although in an
> older kernel).
> 
> Andreas suggested that we should fix __vmalloc() to propagate the
> caller-passed gfp_mask all the way to those allocating functions. This
> may require fixing these interfaces for all architectures.
> 
> I also suggested that it would be nice to have a per-task
> gfp_allowed_mask, similar to the existing gfp_allowed_mask /
> set_gfp_allowed_mask() interface that exists in the kernel, but instead
> of being global to the entire system, it would be stored in the thread's
> task_struct and only apply in the context of the current thread.
> 
> This would allow us to call a function when our I/O threads are created,
> say set_thread_gfp_allowed_mask(~__GFP_IO), to make sure that any kernel
> allocations that happen in the context of those threads would have
> __GFP_IO masked out.
> 
> I am willing to code and send out any of those 2 patches (the vmalloc
> fix and/or the per-thread gfp mask), and I was wondering if this is
> something you'd be willing to accept into the upstream kernel, or if you
> have any other ideas as to how to prevent all __GFP_IO allocations from
> the kernel itself in the context of threads that perform I/O.
> 
> (Please reply-to-all as we are not subscribed to linux-mm).
> 
> Thanks,
> Ricardo


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-10 21:35 ` Ricardo M. Correia
@ 2010-11-10 22:10   ` Dave Chinner
  0 siblings, 0 replies; 22+ messages in thread
From: Dave Chinner @ 2010-11-10 22:10 UTC (permalink / raw)
  To: Ricardo M. Correia
  Cc: linux-mm, linux-fsdevel, Brian Behlendorf, Andreas Dilger

On Wed, Nov 10, 2010 at 10:35:55PM +0100, Ricardo M. Correia wrote:
> On Wed, 2010-11-10 at 21:42 +0100, Ricardo M. Correia wrote:
> > Hi,
> > 
> > As part of Lustre filesystem development, we are running into a
> > situation where we (sporadically) need to call into __vmalloc() from a
> > thread that processes I/Os to disk (it's a long story).
> > 
> > In general, this would be fine as long as we pass GFP_NOFS to
> > __vmalloc(), but the problem is that even if we pass this flag, vmalloc
> > itself sometimes allocates memory with GFP_KERNEL.
> 
> By the way, it seems that existing users in Linus' tree may be
> vulnerable to the same bug that we experienced:
> 
> In GFS:
>     8   1253  fs/gfs2/dir.c <<gfs2_alloc_sort_buffer>>
>              ptr = __vmalloc(size, GFP_NOFS, PAGE_KERNEL);
> 
> The Ceph filesystem:
>   20     22  net/ceph/buffer.c <<ceph_buffer_new>>
>              b->vec.iov_base = __vmalloc(len, gfp, PAGE_KERNEL);
> .. which can be called from:
>    3    560  fs/ceph/inode.c <<fill_inode>>
>              xattr_blob = ceph_buffer_new(iinfo->xattr_len, GFP_NOFS);
> 
> In the MM code:
>   18   5184  mm/page_alloc.c <<alloc_large_system_hash>>
>              table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
> 
> All of these seem to be vulnerable to GFP_KERNEL allocations from within
> __vmalloc(), at least on x86-64 (as I've detailed below).

Hmmm. I'd say there's a definite possibility that vm_map_ram() as
called from in fs/xfs/linux-2.6/xfs_buf.c needs to use GFP_NOFS
allocation, too. Currently vm_map_ram() just uses GFP_KERNEL
internally, but is certainly being called in contexts where we don't
allow recursion (e.g. in a transaction) so probably should allow a
gfp mask to be passed in....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-10 20:42 Propagating GFP_NOFS inside __vmalloc() Ricardo M. Correia
  2010-11-10 21:35 ` Ricardo M. Correia
@ 2010-11-11 20:06 ` Andrew Morton
  2010-11-11 22:02   ` Ricardo M. Correia
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2010-11-11 20:06 UTC (permalink / raw)
  To: Ricardo M. Correia; +Cc: linux-mm, Brian Behlendorf, Andreas Dilger

On Wed, 10 Nov 2010 21:42:39 +0100
"Ricardo M. Correia" <ricardo.correia@oracle.com> wrote:

> Hi,
> 
> As part of Lustre filesystem development, we are running into a
> situation where we (sporadically) need to call into __vmalloc() from a
> thread that processes I/Os to disk (it's a long story).
> 
> In general, this would be fine as long as we pass GFP_NOFS to
> __vmalloc(), but the problem is that even if we pass this flag, vmalloc
> itself sometimes allocates memory with GFP_KERNEL.
> 
> This is not OK for us because the GFP_KERNEL allocations may go into the
> synchronous reclaim path and try to write out data to disk (in order to
> free memory for the allocation), which leads to a deadlock because those
> reclaims may themselves depend on the thread that is doing the
> allocation to make forward progress (which it can't, because it's
> blocked trying to allocate the memory).
> 
> Andreas suggested that this may be a bug in __vmalloc(), in the sense
> that it's not propagating the gfp_mask that the caller requested to all
> allocations that happen inside it.
> 
> On the latest torvalds git tree, for x86-64, the path for these
> GFP_KERNEL allocations go something like this:
> 
> __vmalloc()
>   __vmalloc_node()
>     __vmalloc_area_node()
>       map_vm_area()
>         vmap_page_range()
>           vmap_pud_range()
>             vmap_pmd_range()
>               pmd_alloc()
>                 __pmd_alloc()
>                   pmd_alloc_one()
>                     get_zeroed_page() <-- GFP_KERNEL
>               vmap_pte_range()
>                 pte_alloc_kernel()
>                   __pte_alloc_kernel()
>                     pte_alloc_one_kernel()
>                       get_free_page() <-- GFP_KERNEL
> 
> We've actually observed these deadlocks during testing (although in an
> older kernel).

Bug.

> Andreas suggested that we should fix __vmalloc() to propagate the
> caller-passed gfp_mask all the way to those allocating functions. This
> may require fixing these interfaces for all architectures.
> 
> I also suggested that it would be nice to have a per-task
> gfp_allowed_mask, similar to the existing gfp_allowed_mask /
> set_gfp_allowed_mask() interface that exists in the kernel, but instead
> of being global to the entire system, it would be stored in the thread's
> task_struct and only apply in the context of the current thread.

Possibly we should have done pass-via-task_struct for the gfp mode
everywhere.  Fifteen years ago...  Sites which modify the mask should
do a save/restore on the stack, so there would be no stack savings, but
I suspect there would be some nice text size savings from all that
pass-it-on-to-the-next-guy stuff we do.  Note that this approach could
perhaps be used to move PF_MEMALLOC, PF_KSWAPD and maybe a few other
things into task_struct.gfp_flags.

But that's history.  Before embarking on that path (and introducing a
mixture of both forms of argument-passing) we should take a look at how
big and ugly it is to fix this bug via the normal passing convention,
so we can make a better-informed decision.  Is that something which
you've looked into in any detail?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-11 20:06 ` Andrew Morton
@ 2010-11-11 22:02   ` Ricardo M. Correia
  2010-11-11 22:25     ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Ricardo M. Correia @ 2010-11-11 22:02 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Brian Behlendorf, Andreas Dilger

On Thu, 2010-11-11 at 12:06 -0800, Andrew Morton wrote:
> > I also suggested that it would be nice to have a per-task
> > gfp_allowed_mask, similar to the existing gfp_allowed_mask /
> > set_gfp_allowed_mask() interface that exists in the kernel, but instead
> > of being global to the entire system, it would be stored in the thread's
> > task_struct and only apply in the context of the current thread.
> 
> Possibly we should have done pass-via-task_struct for the gfp mode
> everywhere.  Fifteen years ago...  Sites which modify the mask should
> do a save/restore on the stack, so there would be no stack savings, but
> I suspect there would be some nice text size savings from all that
> pass-it-on-to-the-next-guy stuff we do.  Note that this approach could
> perhaps be used to move PF_MEMALLOC, PF_KSWAPD and maybe a few other
> things into task_struct.gfp_flags.

Yes.. makes sense to me.

> But that's history.  Before embarking on that path (and introducing a
> mixture of both forms of argument-passing) we should take a look at how
> big and ugly it is to fix this bug via the normal passing convention,
> so we can make a better-informed decision.  Is that something which
> you've looked into in any detail?

Ok, I took a more detailed look... it seems we have to change at least
these interfaces in order to make __vmalloc() propagate the gfp_mask:

Function/macro (dependency): references

map_vm_area (__vmalloc_area_node):  7
vmap_page_range (map_vm_area): 3
vmap_page_range_noflush (vmap_page_range): 3
vmap_pud_range (vmap_vmap_page_range_noflush): 2
pud_alloc (vmap_pud_range): 25
__pud_alloc (pud_alloc): 4
pud_alloc_one (__pud_alloc): 8
crst_table_alloc (pud_alloc_one, pmd_alloc_one): 6
vmap_pmd_range (vmap_pud_range): 2
pmd_alloc (vmap_pmd_range): 31
__pmd_alloc (pmd_alloc): 5
pmd_alloc_one (__pmd_alloc): 28
vmap_pte_range (vmap_pud_range): 2
get_pointer_table (pmd_alloc_one): 4
srmmu_pmd_alloc_one (pmd_alloc_one): 2
sun4c_pmd_alloc_one (pmd_alloc_one): 2
pte_alloc_kernel (vmap_pte_range): 14
pte_alloc_one (pmd_alloc_one, pte_alloc_one_kernel): 38
__pte_alloc_kernel (pte_alloc_kernel): 3
pte_alloc_one_kernel (pte_alloc_one, __pte_alloc_kernel): 38
page_table_alloc (pte_alloc_one, pte_alloc_one_kernel): 5
srmmu_pte_alloc_one (pte_alloc_one): 2
sun4c_pte_alloc_one (pte_alloc_one): 2
srmmu_pte_alloc_one_kernel (pte_alloc_one_kernel): 3
sun4c_pte_alloc_one_kernel (pte_alloc_one_kernel, sun4c_pte_alloc_one):
3

By looking at the number of references, we can get a rough idea of the
number of LoC that needs to be changed, but this doesn't take into
account changing the implementation of the leaf allocating functions
themselves (e.g. pte_alloc_one_kernel, pmd_alloc_one, ..). Since these
functions have one implementation for each architecture, we're looking
at changing perhaps more than a hundred function implementations...

Also, it's entirely possible that I may have missed something, since I
looked at all this manually (well, with the help of cscope).

There was one relatively extensive call chain which I didn't look into
with much detail: pte_alloc_one_kernel() -> early_get_page () ->
alloc_bootmem_pages() / memblock_alloc_base() -> ....

The names seem to indicate that there are allocations going on there,
but from a quick glance I only saw a couple of them with GFP_NOWAIT (I
wouldn't be surprised if I missed others).

It's also interesting that some of the leaf allocating functions
sometimes take different flags on different architectures...

So do you think we should change all that?

Or do you prefer the per-task mask? Or maybe even both? :-)

Thanks,
Ricardo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-11 22:02   ` Ricardo M. Correia
@ 2010-11-11 22:25     ` Andrew Morton
  2010-11-11 22:45       ` Ricardo M. Correia
  2010-11-15 17:01       ` Ricardo M. Correia
  0 siblings, 2 replies; 22+ messages in thread
From: Andrew Morton @ 2010-11-11 22:25 UTC (permalink / raw)
  To: Ricardo M. Correia; +Cc: linux-mm, Brian Behlendorf, Andreas Dilger

On Thu, 11 Nov 2010 23:02:04 +0100
"Ricardo M. Correia" <ricardo.correia@oracle.com> wrote:

> On Thu, 2010-11-11 at 12:06 -0800, Andrew Morton wrote:
> > > I also suggested that it would be nice to have a per-task
> > > gfp_allowed_mask, similar to the existing gfp_allowed_mask /
> > > set_gfp_allowed_mask() interface that exists in the kernel, but instead
> > > of being global to the entire system, it would be stored in the thread's
> > > task_struct and only apply in the context of the current thread.
> > 
> > Possibly we should have done pass-via-task_struct for the gfp mode
> > everywhere.  Fifteen years ago...  Sites which modify the mask should
> > do a save/restore on the stack, so there would be no stack savings, but
> > I suspect there would be some nice text size savings from all that
> > pass-it-on-to-the-next-guy stuff we do.  Note that this approach could
> > perhaps be used to move PF_MEMALLOC, PF_KSWAPD and maybe a few other
> > things into task_struct.gfp_flags.
> 
> Yes.. makes sense to me.
> 
> > But that's history.  Before embarking on that path (and introducing a
> > mixture of both forms of argument-passing) we should take a look at how
> > big and ugly it is to fix this bug via the normal passing convention,
> > so we can make a better-informed decision.  Is that something which
> > you've looked into in any detail?
> 
> Ok, I took a more detailed look... it seems we have to change at least
> these interfaces in order to make __vmalloc() propagate the gfp_mask:
> 
> Function/macro (dependency): references
> 
> map_vm_area (__vmalloc_area_node):  7
> vmap_page_range (map_vm_area): 3
> vmap_page_range_noflush (vmap_page_range): 3
> vmap_pud_range (vmap_vmap_page_range_noflush): 2
> pud_alloc (vmap_pud_range): 25
> __pud_alloc (pud_alloc): 4
> pud_alloc_one (__pud_alloc): 8
> crst_table_alloc (pud_alloc_one, pmd_alloc_one): 6
> vmap_pmd_range (vmap_pud_range): 2
> pmd_alloc (vmap_pmd_range): 31
> __pmd_alloc (pmd_alloc): 5
> pmd_alloc_one (__pmd_alloc): 28
> vmap_pte_range (vmap_pud_range): 2
> get_pointer_table (pmd_alloc_one): 4
> srmmu_pmd_alloc_one (pmd_alloc_one): 2
> sun4c_pmd_alloc_one (pmd_alloc_one): 2
> pte_alloc_kernel (vmap_pte_range): 14
> pte_alloc_one (pmd_alloc_one, pte_alloc_one_kernel): 38
> __pte_alloc_kernel (pte_alloc_kernel): 3
> pte_alloc_one_kernel (pte_alloc_one, __pte_alloc_kernel): 38
> page_table_alloc (pte_alloc_one, pte_alloc_one_kernel): 5
> srmmu_pte_alloc_one (pte_alloc_one): 2
> sun4c_pte_alloc_one (pte_alloc_one): 2
> srmmu_pte_alloc_one_kernel (pte_alloc_one_kernel): 3
> sun4c_pte_alloc_one_kernel (pte_alloc_one_kernel, sun4c_pte_alloc_one):
> 3
> 
> By looking at the number of references, we can get a rough idea of the
> number of LoC that needs to be changed, but this doesn't take into
> account changing the implementation of the leaf allocating functions
> themselves (e.g. pte_alloc_one_kernel, pmd_alloc_one, ..). Since these
> functions have one implementation for each architecture, we're looking
> at changing perhaps more than a hundred function implementations...
> 
> Also, it's entirely possible that I may have missed something, since I
> looked at all this manually (well, with the help of cscope).
> 
> There was one relatively extensive call chain which I didn't look into
> with much detail: pte_alloc_one_kernel() -> early_get_page () ->
> alloc_bootmem_pages() / memblock_alloc_base() -> ....
> 
> The names seem to indicate that there are allocations going on there,
> but from a quick glance I only saw a couple of them with GFP_NOWAIT (I
> wouldn't be surprised if I missed others).
> 
> It's also interesting that some of the leaf allocating functions
> sometimes take different flags on different architectures...
> 
> So do you think we should change all that?

Oh God, what have you done :(

No, I don't think we want to add a gfp_t to all of that code to fix one
stupid bug in vmalloc().

> Or do you prefer the per-task mask? Or maybe even both? :-)

Right now I'm thinking that the thing to do is to do the
pass-gfp_t-via-task_struct thing.

Which really commits us to doing that *everywhere*.  We won't need to
change every kmalloc()/etc callsite, but the conversion should probably
be done at the as-soon-as-we-enter-core-mm boundary.  Where "enter"
means "start running non-inlined code".

And then we can set current->gfp_mask to GFP_ATOMIC when we take an
interrupt, or take a spinlock.

And leave it at GFP_KERNEL when in process context.

And switch GFP_KERNEL to GFP_NOFS in the VM.

And switch to GFP_NOIO in the block layer.

So the allocation mode becomes implicit to the task state, so callers
usually don't need to track it.

So, ultimately, kmalloc(), alloc_pages() etc don't actually need a mode
arg at all.  We'll need new, special functions which _do_ take the
gfp_t but they will be rarely-called specialised things.

And probably we'll need interfaces like

	gfp_t mm_set_alloc_mode(gfp_t flags);
	void mm_restore_alloc_mode(gfp_t flags);

	gfp_t flags;

	flags = mm_set_alloc_mode(GFP_NOIO);
	...
	mm_restore_alloc_mode(flags);


argh, someone save us.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-11 22:25     ` Andrew Morton
@ 2010-11-11 22:45       ` Ricardo M. Correia
  2010-11-11 23:19         ` Ricardo M. Correia
  2010-11-15 17:01       ` Ricardo M. Correia
  1 sibling, 1 reply; 22+ messages in thread
From: Ricardo M. Correia @ 2010-11-11 22:45 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Brian Behlendorf, Andreas Dilger

On Thu, 2010-11-11 at 14:25 -0800, Andrew Morton wrote:
> And then we can set current->gfp_mask to GFP_ATOMIC when we take an
> interrupt, or take a spinlock.
> 
> And leave it at GFP_KERNEL when in process context.
> 
> And switch GFP_KERNEL to GFP_NOFS in the VM.
> 
> And switch to GFP_NOIO in the block layer.
> 
> So the allocation mode becomes implicit to the task state, so callers
> usually don't need to track it.
> 
> So, ultimately, kmalloc(), alloc_pages() etc don't actually need a mode
> arg at all.  We'll need new, special functions which _do_ take the
> gfp_t but they will be rarely-called specialised things.
>
> And probably we'll need interfaces like
> 
> 	gfp_t mm_set_alloc_mode(gfp_t flags);
> 	void mm_restore_alloc_mode(gfp_t flags);
> 
> 	gfp_t flags;
> 
> 	flags = mm_set_alloc_mode(GFP_NOIO);
> 	...
> 	mm_restore_alloc_mode(flags);

Actually, I think it may not be that simple...

Looking at some of the __GFP_* flags, it seems that some of them look
like allocation "options", i.e. something we may want or may not want to
do on a certain allocation, others look more like "capabilities", i.e.
something that we can or cannot do in a certain context.

For example, __GFP_ZERO, __GFP_REPEAT, __GFP_HIGHMEM, ... is something
that we'd probably want a caller to specify on each allocation, because
only he knows what he actually wants to do.

Others, like __GFP_FS, __GFP_IO, __GFP_WAIT, are things that we either
can or cannot do, depending on the context that we're in.

The latter ones seem worth to start tracking on the task_struct, but the
former ones I think we'd still want to pass them to kmalloc() on each
invocation.

Fortunately, if we put the latter ones in the task_struct, it removes
the need for having to propagate gfp_flags from function to function.

And contrary to what you said previously (which at the time sounded
correct to me), this can actually save a lot of stack space, especially
on more register-starved architectures, because the only places where we
need to save the flags on the stack is when we enter/exit a certain
context, as opposed to having to always having to pass the gfp_mask down
the call stack like we do now.

> argh, someone save us.

:-)

Thanks,
Ricardo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-11 22:45       ` Ricardo M. Correia
@ 2010-11-11 23:19         ` Ricardo M. Correia
  2010-11-11 23:27           ` Andrew Morton
  0 siblings, 1 reply; 22+ messages in thread
From: Ricardo M. Correia @ 2010-11-11 23:19 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Brian Behlendorf, Andreas Dilger

On Thu, 2010-11-11 at 23:45 +0100, Ricardo M. Correia wrote:
> On Thu, 2010-11-11 at 14:25 -0800, Andrew Morton wrote:
> > And then we can set current->gfp_mask to GFP_ATOMIC when we take an
> > interrupt, or take a spinlock.

Also, doesn't this mean that spin_lock() would now have to save
current->gfp_flags in the stack?

So that we can restore the allocation mode when we do spin_unlock()?

- Ricardo

> > 
> > And leave it at GFP_KERNEL when in process context.
> > 
> > And switch GFP_KERNEL to GFP_NOFS in the VM.
> > 
> > And switch to GFP_NOIO in the block layer.
> > 
> > So the allocation mode becomes implicit to the task state, so callers
> > usually don't need to track it.
> > 
> > So, ultimately, kmalloc(), alloc_pages() etc don't actually need a mode
> > arg at all.  We'll need new, special functions which _do_ take the
> > gfp_t but they will be rarely-called specialised things.
> >
> > And probably we'll need interfaces like
> > 
> > 	gfp_t mm_set_alloc_mode(gfp_t flags);
> > 	void mm_restore_alloc_mode(gfp_t flags);
> > 
> > 	gfp_t flags;
> > 
> > 	flags = mm_set_alloc_mode(GFP_NOIO);
> > 	...
> > 	mm_restore_alloc_mode(flags);
> 
> Actually, I think it may not be that simple...
> 
> Looking at some of the __GFP_* flags, it seems that some of them look
> like allocation "options", i.e. something we may want or may not want to
> do on a certain allocation, others look more like "capabilities", i.e.
> something that we can or cannot do in a certain context.
> 
> For example, __GFP_ZERO, __GFP_REPEAT, __GFP_HIGHMEM, ... is something
> that we'd probably want a caller to specify on each allocation, because
> only he knows what he actually wants to do.
> 
> Others, like __GFP_FS, __GFP_IO, __GFP_WAIT, are things that we either
> can or cannot do, depending on the context that we're in.
> 
> The latter ones seem worth to start tracking on the task_struct, but the
> former ones I think we'd still want to pass them to kmalloc() on each
> invocation.
> 
> Fortunately, if we put the latter ones in the task_struct, it removes
> the need for having to propagate gfp_flags from function to function.
> 
> And contrary to what you said previously (which at the time sounded
> correct to me), this can actually save a lot of stack space, especially
> on more register-starved architectures, because the only places where we
> need to save the flags on the stack is when we enter/exit a certain
> context, as opposed to having to always having to pass the gfp_mask down
> the call stack like we do now.
> 
> > argh, someone save us.
> 
> :-)
> 
> Thanks,
> Ricardo
> 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-11 23:19         ` Ricardo M. Correia
@ 2010-11-11 23:27           ` Andrew Morton
  2010-11-11 23:29             ` Ricardo M. Correia
  0 siblings, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2010-11-11 23:27 UTC (permalink / raw)
  To: Ricardo M. Correia; +Cc: linux-mm, Brian Behlendorf, Andreas Dilger

On Fri, 12 Nov 2010 00:19:54 +0100
"Ricardo M. Correia" <ricardo.correia@oracle.com> wrote:

> On Thu, 2010-11-11 at 23:45 +0100, Ricardo M. Correia wrote:
> > On Thu, 2010-11-11 at 14:25 -0800, Andrew Morton wrote:
> > > And then we can set current->gfp_mask to GFP_ATOMIC when we take an
> > > interrupt, or take a spinlock.
> 
> Also, doesn't this mean that spin_lock() would now have to save
> current->gfp_flags in the stack?
> 
> So that we can restore the allocation mode when we do spin_unlock()?

If we wanted to go that far, yes.  Who's up for editing every spin_lock()
callsite in the kernel?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-11 23:27           ` Andrew Morton
@ 2010-11-11 23:29             ` Ricardo M. Correia
  0 siblings, 0 replies; 22+ messages in thread
From: Ricardo M. Correia @ 2010-11-11 23:29 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Brian Behlendorf, Andreas Dilger

On Thu, 2010-11-11 at 15:27 -0800, Andrew Morton wrote:
> On Fri, 12 Nov 2010 00:19:54 +0100
> "Ricardo M. Correia" <ricardo.correia@oracle.com> wrote:
> 
> > On Thu, 2010-11-11 at 23:45 +0100, Ricardo M. Correia wrote:
> > > On Thu, 2010-11-11 at 14:25 -0800, Andrew Morton wrote:
> > > > And then we can set current->gfp_mask to GFP_ATOMIC when we take an
> > > > interrupt, or take a spinlock.
> > 
> > Also, doesn't this mean that spin_lock() would now have to save
> > current->gfp_flags in the stack?
> > 
> > So that we can restore the allocation mode when we do spin_unlock()?
> 
> If we wanted to go that far, yes.  Who's up for editing every spin_lock()
> callsite in the kernel?

Hmm... Coccinelle? ;) [1]

[1] http://coccinelle.lip6.fr/


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-11 22:25     ` Andrew Morton
  2010-11-11 22:45       ` Ricardo M. Correia
@ 2010-11-15 17:01       ` Ricardo M. Correia
  2010-11-15 21:28         ` David Rientjes
  1 sibling, 1 reply; 22+ messages in thread
From: Ricardo M. Correia @ 2010-11-15 17:01 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-mm, Brian Behlendorf, Andreas Dilger

[-- Attachment #1: Type: text/plain, Size: 1258 bytes --]

Hi Andrew,

On Thu, 2010-11-11 at 14:25 -0800, Andrew Morton wrote:
> > So do you think we should change all that?
> 
> Oh God, what have you done :(
> 
> No, I don't think we want to add a gfp_t to all of that code to fix one
> stupid bug in vmalloc().
> 
> > Or do you prefer the per-task mask? Or maybe even both? :-)
> 
> Right now I'm thinking that the thing to do is to do the
> pass-gfp_t-via-task_struct thing.

I have attached my first attempt to fix this in the easiest way I could
think of.

Please note that the code is untested at the moment (I didn't even try
to compile it yet) :-)
I would like to test it, but I would also like to get your feedback
first to make sure that I'm going in the right direction, at least.

I'm not sure if at this point in time we want to do what we discussed
before, e.g., making sure that the entire kernel uses this per-thread
mask whenever it switches context, or whenever it crosses into the mm
code boundary.

For now, and at least for us, I think my patch would suffice to fix the
vmalloc problem and additionally, we can also use the new per-thread
gfp_mask API to have a much better guarantee that our I/O threads never
allocate memory with __GFP_IO.

Please let me know what you think.

Thanks,
Ricardo

[-- Attachment #2: 0001-Fix-__vmalloc-to-always-respect-the-gfp-flags-that-t.patch --]
[-- Type: text/x-patch, Size: 0 bytes --]

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-15 17:01       ` Ricardo M. Correia
@ 2010-11-15 21:28         ` David Rientjes
  2010-11-15 22:19           ` Ricardo M. Correia
  2010-11-16 22:11           ` Andrew Morton
  0 siblings, 2 replies; 22+ messages in thread
From: David Rientjes @ 2010-11-15 21:28 UTC (permalink / raw)
  To: Ricardo M. Correia
  Cc: Andrew Morton, linux-mm, Brian Behlendorf, Andreas Dilger

On Mon, 15 Nov 2010, Ricardo M. Correia wrote:

> When __vmalloc() / __vmalloc_area_node() calls map_vm_area(), the latter can
> allocate pages with GFP_KERNEL despite the caller of __vmalloc having requested
> a more strict gfp mask.
> 
> We fix this by introducing a per-thread gfp_mask, similar to gfp_allowed_mask
> but which only applies to the current thread. __vmalloc_area_node() will now
> temporarily restrict the per-thread gfp_mask when it calls map_vm_area().
> 
> This new per-thread gfp mask may also be used for other useful purposes, for
> example, after thread creation, to make sure that certain threads
> (e.g. filesystem I/O threads) never allocate memory with certain flags (e.g.
> __GFP_FS or __GFP_IO).

I dislike this approach not only for its performance degradation in core 
areas like the page and slab allocators, but also because it requires full 
knowledge of the callchain to determine the gfp flags of the allocation.  
This will become nasty very quickly.

This proposal essentially defines an entirely new method for passing gfp 
flags to the page allocator when it isn't strictly needed.  I think the 
problem you're addressing can be done in one of two ways:

 - create lower-level functions in each arch that pass a gfp argument to 
   the allocator rather than hard-coded GFP_KERNEL, or

 - avoid doing anything other than GFP_KERNEL allocations for __vmalloc():
   the only current users are gfs2, ntfs, and ceph (the page allocator
   __vmalloc() can be discounted since it's done at boot and GFP_ATOMIC
   here has almost no chance of failing since the size is determined based 
   on what is available).

The first option really addresses the bug that you're running into and can 
be addressed in a relatively simple way by redefining current users of 
pmd_alloc_one(), for instance, as a form of a new lower-level 
__pmd_alloc_one():

	static inline pmd_t *__pmd_alloc_one(struct mm_struct *mm,
					unsigned long addr, gfp_t flags)
	{
        	return (pmd_t *)get_zeroed_page(flags);
	}

	static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
	{
        	return __pmd_alloc_one(GFP_KERNEL|__GFP_REPEAT);
	}

and then using __pmd_alloc_one() in the vmalloc path with the passed mask 
rather than pmd_alloc_one().  This _will_ be slightly intrusive because it 
will require fixing up some short callchains to pass the appropriate mask, 
that will be limited to the vmalloc code and arch code that currently does 
unconditional GFP_KERNEL allocations.  Both are bugs that you'll be 
addressing for each architecture, so the intrusiveness of that change has 
merit (and be sure to cc linux-arch@vger.kernel.org on it as well).

I only mention the second option because passing GFP_NOFS to __vmalloc() 
for sufficiently large sizes has a much higher probability of failing if 
you're running into issues where GFP_KERNEL is causing synchronous 
reclaim.  We may not be able to do any better in the contexts in which 
gfs2, ntfs, and ceph use it without some sort of preallocation at an 
earlier time, but the liklihood of those allocations failing is much 
harder than the typical vmalloc() that tries really hard with __GFP_REPEAT 
to allocate the memory required.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-15 21:28         ` David Rientjes
@ 2010-11-15 22:19           ` Ricardo M. Correia
  2010-11-15 22:50             ` David Rientjes
  2010-11-16 22:11           ` Andrew Morton
  1 sibling, 1 reply; 22+ messages in thread
From: Ricardo M. Correia @ 2010-11-15 22:19 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, linux-mm, Brian Behlendorf, Andreas Dilger

On Mon, 2010-11-15 at 13:28 -0800, David Rientjes wrote:
> This proposal essentially defines an entirely new method for passing gfp 
> flags to the page allocator when it isn't strictly needed.

I don't see it as a way of passing gfp flags to the page allocator, but
rather as a way of restricting which gfp flags can be used (on a
per-thread basis). I agree it's not strictly needed.

>   I think the 
> problem you're addressing can be done in one of two ways:
> 
>  - create lower-level functions in each arch that pass a gfp argument to 
>    the allocator rather than hard-coded GFP_KERNEL, or
> 
>  - avoid doing anything other than GFP_KERNEL allocations for __vmalloc():
>    the only current users are gfs2, ntfs, and ceph (the page allocator
>    __vmalloc() can be discounted since it's done at boot and GFP_ATOMIC
>    here has almost no chance of failing since the size is determined based 
>    on what is available).
>
>
> The first option really addresses the bug that you're running into and can 
> be addressed in a relatively simple way by redefining current users of 
> pmd_alloc_one(), for instance, as a form of a new lower-level 
> __pmd_alloc_one():
> 
> 	static inline pmd_t *__pmd_alloc_one(struct mm_struct *mm,
> 					unsigned long addr, gfp_t flags)
> 	{
>         	return (pmd_t *)get_zeroed_page(flags);
> 	}
> 
> 	static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
> 	{
>         	return __pmd_alloc_one(GFP_KERNEL|__GFP_REPEAT);
> 	}
> 
> and then using __pmd_alloc_one() in the vmalloc path with the passed mask 
> rather than pmd_alloc_one(). 

Ok, this sounds OK in theory (I'd have to change all these
implementations for all architectures, but oh well..). But I have a
question about your proposal.. the call chain seems to go:

vmap_pmd_range()
  pmd_alloc()
    __pmd_alloc()
      pmd_alloc_one()

So you want to change this so that vmap_pmd_range() calls your new
__pmd_alloc_one() instead of pmd_alloc_one().

But this means that pmd_alloc() and __pmd_alloc() would need to accept
the flag as well, right?

Probably we should define an alias for pmd_alloc() so that we don't have
to change all its callers to pass a flag. But __pmd_alloc() is already
taken, so what would you suggest we call it? _pmd_alloc()? :-)

> Both are bugs that you'll be 
> addressing for each architecture, so the intrusiveness of that change has 
> merit (and be sure to cc linux-arch@vger.kernel.org on it as well).

Ok.

> I only mention the second option because passing GFP_NOFS to __vmalloc() 
> for sufficiently large sizes has a much higher probability of failing if 
> you're running into issues where GFP_KERNEL is causing synchronous 
> reclaim.  We may not be able to do any better in the contexts in which 
> gfs2, ntfs, and ceph use it without some sort of preallocation at an 
> earlier time,

Indeed... I don't think we can do any better in Lustre either. The
preallocation also doesn't guarantee anything for us, because the amount
of memory that we may need to allocate can be huge in the worst case,
and it would be prohibitive to preallocate it.

For our case, I'd think it's better to either handle failure or somehow
retry until the allocation succeeds (if we know for sure that it will,
eventually).

>  but the liklihood of those allocations failing is much 
> harder than the typical vmalloc() that tries really hard with __GFP_REPEAT 
> to allocate the memory required.

Not sure what do you mean by this.. I don't see a typical vmalloc()
using __GFP_REPEAT anywhere (apart from functions such as
pmd_alloc_one(), which in the code above you suggested to keep passing
__GFP_REPEAT).. am I missing something?

Thanks,
Ricardo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-15 22:19           ` Ricardo M. Correia
@ 2010-11-15 22:50             ` David Rientjes
  2010-11-15 23:30               ` Ricardo M. Correia
  0 siblings, 1 reply; 22+ messages in thread
From: David Rientjes @ 2010-11-15 22:50 UTC (permalink / raw)
  To: Ricardo M. Correia
  Cc: Andrew Morton, linux-mm, Brian Behlendorf, Andreas Dilger

On Mon, 15 Nov 2010, Ricardo M. Correia wrote:

> > This proposal essentially defines an entirely new method for passing gfp 
> > flags to the page allocator when it isn't strictly needed.
> 
> I don't see it as a way of passing gfp flags to the page allocator, but
> rather as a way of restricting which gfp flags can be used (on a
> per-thread basis). I agree it's not strictly needed.
> 

It restricts the flags that can be used in the current context, but that's 
the whole purpose of gfp arguments such as __GFP_FS, __GFP_IO, and 
__GFP_WAIT that control the reclaim behavior of the page allocator in the 
first place: we want to use all strategies at our disposal that are 
allowed in the current context to allocate the memory.  Using this 
interface to restrict only to __GFP_WAIT, for example, is the same as 
passing __GFP_WAIT.

> > The first option really addresses the bug that you're running into and can 
> > be addressed in a relatively simple way by redefining current users of 
> > pmd_alloc_one(), for instance, as a form of a new lower-level 
> > __pmd_alloc_one():
> > 
> > 	static inline pmd_t *__pmd_alloc_one(struct mm_struct *mm,
> > 					unsigned long addr, gfp_t flags)
> > 	{
> >         	return (pmd_t *)get_zeroed_page(flags);
> > 	}
> > 
> > 	static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
> > 	{
> >         	return __pmd_alloc_one(GFP_KERNEL|__GFP_REPEAT);
> > 	}
> > 
> > and then using __pmd_alloc_one() in the vmalloc path with the passed mask 
> > rather than pmd_alloc_one(). 
> 
> Ok, this sounds OK in theory (I'd have to change all these
> implementations for all architectures, but oh well..). But I have a
> question about your proposal.. the call chain seems to go:
> 

Yes, all architectures would need to be changed because they all 
unconditionally do GFP_KERNEL allocations which is buggy in the vmalloc 
case, so the breadth of the change is certainly warranted.

> vmap_pmd_range()
>   pmd_alloc()
>     __pmd_alloc()
>       pmd_alloc_one()
> 
> So you want to change this so that vmap_pmd_range() calls your new
> __pmd_alloc_one() instead of pmd_alloc_one().
> 

Yes, with the gfp flags that were passed to __vmalloc() to restrict the 
lower-level allocations that you cited in your original email 
(pmd_alloc_one() and pte_alloc_one_kernel()).

> But this means that pmd_alloc() and __pmd_alloc() would need to accept
> the flag as well, right?
> 
> Probably we should define an alias for pmd_alloc() so that we don't have
> to change all its callers to pass a flag. But __pmd_alloc() is already
> taken, so what would you suggest we call it? _pmd_alloc()? :-)
> 

Yeah, that's why I said it's slightly more intrusive than the single 
example I gave, you need to modify the callchain to pass the flags in 
other functions as well.  Instead of extending the __*() functions with 
more underscores like other places in the kernel (see mm/slab.c, for 
instance), I'd suggest just appending _gfp() to their name so 
__pmd_alloc() uses a new __pmd_alloc_gfp().

> Indeed... I don't think we can do any better in Lustre either. The
> preallocation also doesn't guarantee anything for us, because the amount
> of memory that we may need to allocate can be huge in the worst case,
> and it would be prohibitive to preallocate it.
> 

I think it really depends on how much memory you plan to allocate without 
allowing direct reclaim (and that also prevents the oom killer from being 
used as well).  The ntfs use, for instance, falls back to vmalloc if the 
size is greater than PAGE_SIZE to avoid high-order allocations from 
failing due to fragmentation and memory compaction can't be used for 
GFP_NOFS either.  The best the VM can do is allocate the order-0 pages 
with GFP_NOFS which has a much higher liklihood that it won't succeed, so 
it's really to gfs2, ntfs, and ceph's detriment that it doesn't have a 
workaround.

> For our case, I'd think it's better to either handle failure or somehow
> retry until the allocation succeeds (if we know for sure that it will,
> eventually).
> 

If your use-case is going to block until this memory is available, there's 
a serious problem that you'll need to address because nothing is going to 
guarantee that memory will be freed unless something else is trying to 
allocate memory and pages get written back or something gets killed as a 
result.  Strictly relying on that behavior is concerning, but it's not 
something that can be fixed in the VM.

> >  but the liklihood of those allocations failing is much 
> > harder than the typical vmalloc() that tries really hard with __GFP_REPEAT 
> > to allocate the memory required.
> 
> Not sure what do you mean by this.. I don't see a typical vmalloc()
> using __GFP_REPEAT anywhere (apart from functions such as
> pmd_alloc_one(), which in the code above you suggested to keep passing
> __GFP_REPEAT).. am I missing something?
> 

__GFP_REPEAT will retry the allocation indefinitely until the needed 
amount of memory is reclaimed without considering the order of the 
allocation; all orders of interest in your case are order-0, so it will 
loop indefinitely until a single page is reclaimed which won't happen with 
GFP_NOFS.  Thus, passing the flag is the equivalent of asking the 
allocator to loop forever until memory is available rather than failing 
and returning to your error handling.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-15 22:50             ` David Rientjes
@ 2010-11-15 23:30               ` Ricardo M. Correia
  2010-11-15 23:55                 ` David Rientjes
  0 siblings, 1 reply; 22+ messages in thread
From: Ricardo M. Correia @ 2010-11-15 23:30 UTC (permalink / raw)
  To: David Rientjes; +Cc: Andrew Morton, linux-mm, Brian Behlendorf, Andreas Dilger

On Mon, 2010-11-15 at 14:50 -0800, David Rientjes wrote:
> Instead of extending the __*() functions with 
> more underscores like other places in the kernel (see mm/slab.c, for 
> instance), I'd suggest just appending _gfp() to their name so 
> __pmd_alloc() uses a new __pmd_alloc_gfp().

Sounds good to me.

> > For our case, I'd think it's better to either handle failure or somehow
> > retry until the allocation succeeds (if we know for sure that it will,
> > eventually).
> > 
> 
> If your use-case is going to block until this memory is available, there's 
> a serious problem that you'll need to address because nothing is going to 
> guarantee that memory will be freed unless something else is trying to 
> allocate memory and pages get written back or something gets killed as a 
> result.

In our use case, this code is only used on servers that are used for
serving a Lustre filesystem and nothing else, so we don't have to worry
about things like run-away memory hogs / user applications.

Currently we do block until this memory is available. I'd rather not go
much into this, but the amount of memory that can be allocated by this
method at any point in time is huge but it's bounded.

Also, we have a slab reclaim callback that signals a dedicated thread,
which asynchronously frees memory (it would free synchronously if
possible, but unfortunately it's not).

This thread is able to potentially free GBs of memory if necessary, and
therefore allow the vmalloc allocations in the I/O path to succeed
eventually. We know this because we limit the amount of memory that can
be allocated and nothing else can use a significant amount of memory on
our systems.

I know this is not how you'd typically do this, but we also have other
constraints (which again, I'd rather not go into) which makes this our
preferred solution.

>   Strictly relying on that behavior is concerning, but it's not 
> something that can be fixed in the VM.
>
> > Not sure what do you mean by this.. I don't see a typical vmalloc()
> > using __GFP_REPEAT anywhere (apart from functions such as
> > pmd_alloc_one(), which in the code above you suggested to keep passing
> > __GFP_REPEAT).. am I missing something?
> > 
> 
> __GFP_REPEAT will retry the allocation indefinitely until the needed 
> amount of memory is reclaimed without considering the order of the 
> allocation; all orders of interest in your case are order-0, so it will 
> loop indefinitely until a single page is reclaimed which won't happen with 
> GFP_NOFS.  Thus, passing the flag is the equivalent of asking the 
> allocator to loop forever until memory is available rather than failing 
> and returning to your error handling.

When you say loop forever, you don't mean in a busy loop, right?
Assuming we sleep in this loop (which AFAICS it does), then it's OK for
us because memory will be freed asynchronously.

If it didn't sleep then it'd be more concerning because all CPUs could
enter this loop and we'd deadlock..

Anyway, I will try the approach that you suggested and send out a new
patch. 

Thanks!

- Ricardo

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-15 23:30               ` Ricardo M. Correia
@ 2010-11-15 23:55                 ` David Rientjes
  0 siblings, 0 replies; 22+ messages in thread
From: David Rientjes @ 2010-11-15 23:55 UTC (permalink / raw)
  To: Ricardo M. Correia
  Cc: Andrew Morton, linux-mm, Brian Behlendorf, Andreas Dilger

On Tue, 16 Nov 2010, Ricardo M. Correia wrote:

> > __GFP_REPEAT will retry the allocation indefinitely until the needed 
> > amount of memory is reclaimed without considering the order of the 
> > allocation; all orders of interest in your case are order-0, so it will 
> > loop indefinitely until a single page is reclaimed which won't happen with 
> > GFP_NOFS.  Thus, passing the flag is the equivalent of asking the 
> > allocator to loop forever until memory is available rather than failing 
> > and returning to your error handling.
> 
> When you say loop forever, you don't mean in a busy loop, right?
> Assuming we sleep in this loop (which AFAICS it does), then it's OK for
> us because memory will be freed asynchronously.
> 

Yes, __GFP_REPEAT will only be effected if it's blockable, so the 
allocator will reschedule during the loop but not return until the 
allocation suceeds in this case since it's GFP_NOFS which significantly 
impacts the ability to reclaim memory.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-15 21:28         ` David Rientjes
  2010-11-15 22:19           ` Ricardo M. Correia
@ 2010-11-16 22:11           ` Andrew Morton
  2010-11-17  7:18             ` Andreas Dilger
  1 sibling, 1 reply; 22+ messages in thread
From: Andrew Morton @ 2010-11-16 22:11 UTC (permalink / raw)
  To: David Rientjes
  Cc: Ricardo M. Correia, linux-mm, Brian Behlendorf, Andreas Dilger

On Mon, 15 Nov 2010 13:28:54 -0800 (PST)
David Rientjes <rientjes@google.com> wrote:

>  - avoid doing anything other than GFP_KERNEL allocations for __vmalloc():
>    the only current users are gfs2, ntfs, and ceph (the page allocator
>    __vmalloc() can be discounted since it's done at boot and GFP_ATOMIC
>    here has almost no chance of failing since the size is determined based 
>    on what is available).

^^ this

Using vmalloc anywhere is lame.

Using anything weaker than GFP_KERNEL is lame.

Stomping out vmalloc callsites and stomping out non-GFP_KERNEL callers
will result in a better kernel *regardless* of this bug.  

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-16 22:11           ` Andrew Morton
@ 2010-11-17  7:18             ` Andreas Dilger
  2010-11-17  7:24               ` Andrew Morton
  2010-11-17  7:37               ` David Rientjes
  0 siblings, 2 replies; 22+ messages in thread
From: Andreas Dilger @ 2010-11-17  7:18 UTC (permalink / raw)
  To: Andrew Morton
  Cc: David Rientjes, Ricardo M. Correia, linux-mm, Brian Behlendorf

On 2010-11-16, at 16:11, Andrew Morton wrote:
> On Mon, 15 Nov 2010 13:28:54 -0800 (PST)
> David Rientjes <rientjes@google.com> wrote:
> 
>> - avoid doing anything other than GFP_KERNEL allocations for __vmalloc():
>>   the only current users are gfs2, ntfs, and ceph (the page allocator
>>   __vmalloc() can be discounted since it's done at boot and GFP_ATOMIC
>>   here has almost no chance of failing since the size is determined based 
>>   on what is available).
> 
> ^^ this
> 
> Using vmalloc anywhere is lame.

I agree.  What we really want is 1MB kmalloc() to work...  :-/

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-17  7:18             ` Andreas Dilger
@ 2010-11-17  7:24               ` Andrew Morton
  2010-11-17  7:37               ` David Rientjes
  1 sibling, 0 replies; 22+ messages in thread
From: Andrew Morton @ 2010-11-17  7:24 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: David Rientjes, Ricardo M. Correia, linux-mm, Brian Behlendorf

On Wed, 17 Nov 2010 01:18:27 -0600 Andreas Dilger <andreas.dilger@oracle.com> wrote:

> On 2010-11-16, at 16:11, Andrew Morton wrote:
> > On Mon, 15 Nov 2010 13:28:54 -0800 (PST)
> > David Rientjes <rientjes@google.com> wrote:
> > 
> >> - avoid doing anything other than GFP_KERNEL allocations for __vmalloc():
> >>   the only current users are gfs2, ntfs, and ceph (the page allocator
> >>   __vmalloc() can be discounted since it's done at boot and GFP_ATOMIC
> >>   here has almost no chance of failing since the size is determined based 
> >>   on what is available).
> > 
> > ^^ this
> > 
> > Using vmalloc anywhere is lame.
> 
> I agree.  What we really want is 1MB kmalloc() to work...  :-/

meh.  Thinking that you require 1MB of virtually contiguous memory in
kernel code is lame.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-17  7:18             ` Andreas Dilger
  2010-11-17  7:24               ` Andrew Morton
@ 2010-11-17  7:37               ` David Rientjes
  2010-11-17  9:04                 ` Christoph Hellwig
  1 sibling, 1 reply; 22+ messages in thread
From: David Rientjes @ 2010-11-17  7:37 UTC (permalink / raw)
  To: Andreas Dilger
  Cc: Andrew Morton, Ricardo M. Correia, linux-mm, Brian Behlendorf

On Wed, 17 Nov 2010, Andreas Dilger wrote:

> >> - avoid doing anything other than GFP_KERNEL allocations for __vmalloc():
> >>   the only current users are gfs2, ntfs, and ceph (the page allocator
> >>   __vmalloc() can be discounted since it's done at boot and GFP_ATOMIC
> >>   here has almost no chance of failing since the size is determined based 
> >>   on what is available).
> > 
> > ^^ this
> > 
> > Using vmalloc anywhere is lame.
> 
> I agree.  What we really want is 1MB kmalloc() to work...  :-/
> 

Order-8 allocations are already have a higher liklihood of succeeding 
because of memory compaction, which was explicitly targeted to aid in 
order-9 hugepage allocations.  The problem is that it's useless for 
GFP_NOFS.

I think removing gfp_t arguments from all of the public vmalloc interface 
will inevitably be where we go with this and everything will assume 
GFP_KERNEL | __GFP_HIGHMEM.

If you _really_ need 1MB of physically contiguous memory, then you'll need 
to find a way to do it in a reclaimable context.  If we actually can 
remove the dependency that gfs2, ntfs, and ceph have in the kernel.org 
kernel, then this support may be pulled out from under you; the worst-case 
scenario for Lustre is that you'll have to modify the callchains like I 
suggested in my original email to pass the gfp mask all the way down to 
the pte allocators if you can't find a way to do it under GFP_KERNEL.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-17  7:37               ` David Rientjes
@ 2010-11-17  9:04                 ` Christoph Hellwig
  2010-11-17 21:24                   ` David Rientjes
  0 siblings, 1 reply; 22+ messages in thread
From: Christoph Hellwig @ 2010-11-17  9:04 UTC (permalink / raw)
  To: David Rientjes
  Cc: Andreas Dilger, Andrew Morton, Ricardo M. Correia, linux-mm,
	Brian Behlendorf

On Tue, Nov 16, 2010 at 11:37:39PM -0800, David Rientjes wrote:
> If you _really_ need 1MB of physically contiguous memory, then you'll need 
> to find a way to do it in a reclaimable context.  If we actually can 
> remove the dependency that gfs2, ntfs, and ceph have in the kernel.org 
> kernel, then this support may be pulled out from under you; the worst-case 
> scenario for Lustre is that you'll have to modify the callchains like I 
> suggested in my original email to pass the gfp mask all the way down to 
> the pte allocators if you can't find a way to do it under GFP_KERNEL.

As Dave mentioned XFS also needs GFP_NOFS allocations in the low-level
vmap machinery, which is shared with vmalloc.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-17  9:04                 ` Christoph Hellwig
@ 2010-11-17 21:24                   ` David Rientjes
  0 siblings, 0 replies; 22+ messages in thread
From: David Rientjes @ 2010-11-17 21:24 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Andreas Dilger, Andrew Morton, Ricardo M. Correia, linux-mm,
	Brian Behlendorf, Dave Chinner

On Wed, 17 Nov 2010, Christoph Hellwig wrote:

> As Dave mentioned XFS also needs GFP_NOFS allocations in the low-level
> vmap machinery, which is shared with vmalloc.
> 

Ok, so vm_map_ram() probably needs to be modified to allow gfp_t to be 
passed in after the pte wrappers are in place that can be used to avoid 
the hard-wired GFP_KERNEL in arch code (Ricardo is working on that, I 
believe?); once that's done, it's trivial to pass the gfp_t for xfs to 
lower-level vmalloc code to allocate the necessary vmap_block, vmap_area, 
and radix tree data structures from the slab allocator (they are all 
order-0, at least).

I think the ultimate solution will be able to allow GFP_NOFS to be passed 
into things like vm_map_ram() and __vmalloc() and then try to avoid new 
additions and fix up the callers later, if possible, for the eventual 
removal of all gfp_t formals from the vmalloc layer.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2010-11-17 21:25 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-10 20:42 Propagating GFP_NOFS inside __vmalloc() Ricardo M. Correia
2010-11-10 21:35 ` Ricardo M. Correia
2010-11-10 22:10   ` Dave Chinner
2010-11-11 20:06 ` Andrew Morton
2010-11-11 22:02   ` Ricardo M. Correia
2010-11-11 22:25     ` Andrew Morton
2010-11-11 22:45       ` Ricardo M. Correia
2010-11-11 23:19         ` Ricardo M. Correia
2010-11-11 23:27           ` Andrew Morton
2010-11-11 23:29             ` Ricardo M. Correia
2010-11-15 17:01       ` Ricardo M. Correia
2010-11-15 21:28         ` David Rientjes
2010-11-15 22:19           ` Ricardo M. Correia
2010-11-15 22:50             ` David Rientjes
2010-11-15 23:30               ` Ricardo M. Correia
2010-11-15 23:55                 ` David Rientjes
2010-11-16 22:11           ` Andrew Morton
2010-11-17  7:18             ` Andreas Dilger
2010-11-17  7:24               ` Andrew Morton
2010-11-17  7:37               ` David Rientjes
2010-11-17  9:04                 ` Christoph Hellwig
2010-11-17 21:24                   ` David Rientjes

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).