linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Propagating GFP_NOFS inside __vmalloc()
       [not found] <1289421759.11149.59.camel@oralap>
@ 2010-11-10 21:35 ` Ricardo M. Correia
  2010-11-10 22:10   ` Dave Chinner
  0 siblings, 1 reply; 2+ messages in thread
From: Ricardo M. Correia @ 2010-11-10 21:35 UTC (permalink / raw)
  To: linux-mm, linux-fsdevel; +Cc: Brian Behlendorf, Andreas Dilger

On Wed, 2010-11-10 at 21:42 +0100, Ricardo M. Correia wrote:
> Hi,
> 
> As part of Lustre filesystem development, we are running into a
> situation where we (sporadically) need to call into __vmalloc() from a
> thread that processes I/Os to disk (it's a long story).
> 
> In general, this would be fine as long as we pass GFP_NOFS to
> __vmalloc(), but the problem is that even if we pass this flag, vmalloc
> itself sometimes allocates memory with GFP_KERNEL.

By the way, it seems that existing users in Linus' tree may be
vulnerable to the same bug that we experienced:

In GFS:
    8   1253  fs/gfs2/dir.c <<gfs2_alloc_sort_buffer>>
             ptr = __vmalloc(size, GFP_NOFS, PAGE_KERNEL);

The Ceph filesystem:
  20     22  net/ceph/buffer.c <<ceph_buffer_new>>
             b->vec.iov_base = __vmalloc(len, gfp, PAGE_KERNEL);
.. which can be called from:
   3    560  fs/ceph/inode.c <<fill_inode>>
             xattr_blob = ceph_buffer_new(iinfo->xattr_len, GFP_NOFS);

In the MM code:
  18   5184  mm/page_alloc.c <<alloc_large_system_hash>>
             table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);

All of these seem to be vulnerable to GFP_KERNEL allocations from within
__vmalloc(), at least on x86-64 (as I've detailed below).

Thanks,
Ricardo

> 
> This is not OK for us because the GFP_KERNEL allocations may go into the
> synchronous reclaim path and try to write out data to disk (in order to
> free memory for the allocation), which leads to a deadlock because those
> reclaims may themselves depend on the thread that is doing the
> allocation to make forward progress (which it can't, because it's
> blocked trying to allocate the memory).
> 
> Andreas suggested that this may be a bug in __vmalloc(), in the sense
> that it's not propagating the gfp_mask that the caller requested to all
> allocations that happen inside it.
> 
> On the latest torvalds git tree, for x86-64, the path for these
> GFP_KERNEL allocations go something like this:
> 
> __vmalloc()
>   __vmalloc_node()
>     __vmalloc_area_node()
>       map_vm_area()
>         vmap_page_range()
>           vmap_pud_range()
>             vmap_pmd_range()
>               pmd_alloc()
>                 __pmd_alloc()
>                   pmd_alloc_one()
>                     get_zeroed_page() <-- GFP_KERNEL
>               vmap_pte_range()
>                 pte_alloc_kernel()
>                   __pte_alloc_kernel()
>                     pte_alloc_one_kernel()
>                       get_free_page() <-- GFP_KERNEL
> 
> We've actually observed these deadlocks during testing (although in an
> older kernel).
> 
> Andreas suggested that we should fix __vmalloc() to propagate the
> caller-passed gfp_mask all the way to those allocating functions. This
> may require fixing these interfaces for all architectures.
> 
> I also suggested that it would be nice to have a per-task
> gfp_allowed_mask, similar to the existing gfp_allowed_mask /
> set_gfp_allowed_mask() interface that exists in the kernel, but instead
> of being global to the entire system, it would be stored in the thread's
> task_struct and only apply in the context of the current thread.
> 
> This would allow us to call a function when our I/O threads are created,
> say set_thread_gfp_allowed_mask(~__GFP_IO), to make sure that any kernel
> allocations that happen in the context of those threads would have
> __GFP_IO masked out.
> 
> I am willing to code and send out any of those 2 patches (the vmalloc
> fix and/or the per-thread gfp mask), and I was wondering if this is
> something you'd be willing to accept into the upstream kernel, or if you
> have any other ideas as to how to prevent all __GFP_IO allocations from
> the kernel itself in the context of threads that perform I/O.
> 
> (Please reply-to-all as we are not subscribed to linux-mm).
> 
> Thanks,
> Ricardo


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: Propagating GFP_NOFS inside __vmalloc()
  2010-11-10 21:35 ` Propagating GFP_NOFS inside __vmalloc() Ricardo M. Correia
@ 2010-11-10 22:10   ` Dave Chinner
  0 siblings, 0 replies; 2+ messages in thread
From: Dave Chinner @ 2010-11-10 22:10 UTC (permalink / raw)
  To: Ricardo M. Correia
  Cc: linux-mm, linux-fsdevel, Brian Behlendorf, Andreas Dilger

On Wed, Nov 10, 2010 at 10:35:55PM +0100, Ricardo M. Correia wrote:
> On Wed, 2010-11-10 at 21:42 +0100, Ricardo M. Correia wrote:
> > Hi,
> > 
> > As part of Lustre filesystem development, we are running into a
> > situation where we (sporadically) need to call into __vmalloc() from a
> > thread that processes I/Os to disk (it's a long story).
> > 
> > In general, this would be fine as long as we pass GFP_NOFS to
> > __vmalloc(), but the problem is that even if we pass this flag, vmalloc
> > itself sometimes allocates memory with GFP_KERNEL.
> 
> By the way, it seems that existing users in Linus' tree may be
> vulnerable to the same bug that we experienced:
> 
> In GFS:
>     8   1253  fs/gfs2/dir.c <<gfs2_alloc_sort_buffer>>
>              ptr = __vmalloc(size, GFP_NOFS, PAGE_KERNEL);
> 
> The Ceph filesystem:
>   20     22  net/ceph/buffer.c <<ceph_buffer_new>>
>              b->vec.iov_base = __vmalloc(len, gfp, PAGE_KERNEL);
> .. which can be called from:
>    3    560  fs/ceph/inode.c <<fill_inode>>
>              xattr_blob = ceph_buffer_new(iinfo->xattr_len, GFP_NOFS);
> 
> In the MM code:
>   18   5184  mm/page_alloc.c <<alloc_large_system_hash>>
>              table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
> 
> All of these seem to be vulnerable to GFP_KERNEL allocations from within
> __vmalloc(), at least on x86-64 (as I've detailed below).

Hmmm. I'd say there's a definite possibility that vm_map_ram() as
called from in fs/xfs/linux-2.6/xfs_buf.c needs to use GFP_NOFS
allocation, too. Currently vm_map_ram() just uses GFP_KERNEL
internally, but is certainly being called in contexts where we don't
allow recursion (e.g. in a transaction) so probably should allow a
gfp mask to be passed in....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2010-11-10 22:10 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1289421759.11149.59.camel@oralap>
2010-11-10 21:35 ` Propagating GFP_NOFS inside __vmalloc() Ricardo M. Correia
2010-11-10 22:10   ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).