From: "Ricardo M. Correia" <ricardo.correia@oracle.com>
To: linux-mm@kvack.org, linux-fsdevel@vger.kernel.org
Cc: Brian Behlendorf <behlendorf1@llnl.gov>,
Andreas Dilger <andreas.dilger@oracle.com>
Subject: Re: Propagating GFP_NOFS inside __vmalloc()
Date: Wed, 10 Nov 2010 22:35:55 +0100 [thread overview]
Message-ID: <1289424955.11149.73.camel@oralap> (raw)
In-Reply-To: <1289421759.11149.59.camel@oralap>
On Wed, 2010-11-10 at 21:42 +0100, Ricardo M. Correia wrote:
> Hi,
>
> As part of Lustre filesystem development, we are running into a
> situation where we (sporadically) need to call into __vmalloc() from a
> thread that processes I/Os to disk (it's a long story).
>
> In general, this would be fine as long as we pass GFP_NOFS to
> __vmalloc(), but the problem is that even if we pass this flag, vmalloc
> itself sometimes allocates memory with GFP_KERNEL.
By the way, it seems that existing users in Linus' tree may be
vulnerable to the same bug that we experienced:
In GFS:
8 1253 fs/gfs2/dir.c <<gfs2_alloc_sort_buffer>>
ptr = __vmalloc(size, GFP_NOFS, PAGE_KERNEL);
The Ceph filesystem:
20 22 net/ceph/buffer.c <<ceph_buffer_new>>
b->vec.iov_base = __vmalloc(len, gfp, PAGE_KERNEL);
.. which can be called from:
3 560 fs/ceph/inode.c <<fill_inode>>
xattr_blob = ceph_buffer_new(iinfo->xattr_len, GFP_NOFS);
In the MM code:
18 5184 mm/page_alloc.c <<alloc_large_system_hash>>
table = __vmalloc(size, GFP_ATOMIC, PAGE_KERNEL);
All of these seem to be vulnerable to GFP_KERNEL allocations from within
__vmalloc(), at least on x86-64 (as I've detailed below).
Thanks,
Ricardo
>
> This is not OK for us because the GFP_KERNEL allocations may go into the
> synchronous reclaim path and try to write out data to disk (in order to
> free memory for the allocation), which leads to a deadlock because those
> reclaims may themselves depend on the thread that is doing the
> allocation to make forward progress (which it can't, because it's
> blocked trying to allocate the memory).
>
> Andreas suggested that this may be a bug in __vmalloc(), in the sense
> that it's not propagating the gfp_mask that the caller requested to all
> allocations that happen inside it.
>
> On the latest torvalds git tree, for x86-64, the path for these
> GFP_KERNEL allocations go something like this:
>
> __vmalloc()
> __vmalloc_node()
> __vmalloc_area_node()
> map_vm_area()
> vmap_page_range()
> vmap_pud_range()
> vmap_pmd_range()
> pmd_alloc()
> __pmd_alloc()
> pmd_alloc_one()
> get_zeroed_page() <-- GFP_KERNEL
> vmap_pte_range()
> pte_alloc_kernel()
> __pte_alloc_kernel()
> pte_alloc_one_kernel()
> get_free_page() <-- GFP_KERNEL
>
> We've actually observed these deadlocks during testing (although in an
> older kernel).
>
> Andreas suggested that we should fix __vmalloc() to propagate the
> caller-passed gfp_mask all the way to those allocating functions. This
> may require fixing these interfaces for all architectures.
>
> I also suggested that it would be nice to have a per-task
> gfp_allowed_mask, similar to the existing gfp_allowed_mask /
> set_gfp_allowed_mask() interface that exists in the kernel, but instead
> of being global to the entire system, it would be stored in the thread's
> task_struct and only apply in the context of the current thread.
>
> This would allow us to call a function when our I/O threads are created,
> say set_thread_gfp_allowed_mask(~__GFP_IO), to make sure that any kernel
> allocations that happen in the context of those threads would have
> __GFP_IO masked out.
>
> I am willing to code and send out any of those 2 patches (the vmalloc
> fix and/or the per-thread gfp mask), and I was wondering if this is
> something you'd be willing to accept into the upstream kernel, or if you
> have any other ideas as to how to prevent all __GFP_IO allocations from
> the kernel itself in the context of threads that perform I/O.
>
> (Please reply-to-all as we are not subscribed to linux-mm).
>
> Thanks,
> Ricardo
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
next prev parent reply other threads:[~2010-11-10 21:36 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2010-11-10 20:42 Propagating GFP_NOFS inside __vmalloc() Ricardo M. Correia
2010-11-10 21:35 ` Ricardo M. Correia [this message]
2010-11-10 22:10 ` Dave Chinner
2010-11-11 20:06 ` Andrew Morton
2010-11-11 22:02 ` Ricardo M. Correia
2010-11-11 22:25 ` Andrew Morton
2010-11-11 22:45 ` Ricardo M. Correia
2010-11-11 23:19 ` Ricardo M. Correia
2010-11-11 23:27 ` Andrew Morton
2010-11-11 23:29 ` Ricardo M. Correia
2010-11-15 17:01 ` Ricardo M. Correia
2010-11-15 21:28 ` David Rientjes
2010-11-15 22:19 ` Ricardo M. Correia
2010-11-15 22:50 ` David Rientjes
2010-11-15 23:30 ` Ricardo M. Correia
2010-11-15 23:55 ` David Rientjes
2010-11-16 22:11 ` Andrew Morton
2010-11-17 7:18 ` Andreas Dilger
2010-11-17 7:24 ` Andrew Morton
2010-11-17 7:37 ` David Rientjes
2010-11-17 9:04 ` Christoph Hellwig
2010-11-17 21:24 ` David Rientjes
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1289424955.11149.73.camel@oralap \
--to=ricardo.correia@oracle.com \
--cc=andreas.dilger@oracle.com \
--cc=behlendorf1@llnl.gov \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-mm@kvack.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).