linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] mm: pagecache gfp flags fix
@ 2008-12-12  4:41 Nick Piggin
  2008-12-15 23:21 ` Andrew Morton
  0 siblings, 1 reply; 3+ messages in thread
From: Nick Piggin @ 2008-12-12  4:41 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, Linux Memory Management List

This patch doesn't actually fix a regression, but a longer standing bug.
--

Frustratingly, gfp_t is really divided into two classes of flags. One are the
context dependent ones (can we sleep? can we enter filesystem? block subsystem?
should we use some extra reserves, etc.). The other ones are the type of memory
required and depend on how the algorithm is implemented rather than the point
at which the memory is allocated (highmem? dma memory? etc).

Some of functions which allocate a page and add it to page cache take a gfp_t,
but sometimes those functions or their callers aren't really doing the right
thing: when allocating pagecache page, the memory type should be
mapping_gfp_mask(mapping). When allocating radix tree nodes, the memory type
should be kernel mapped (not highmem) memory. The gfp_t argument should only
really be needed for context dependent options.

This patch doesn't really solve that tangle in a nice way, but it does attempt
to fix a couple of bugs.

- find_or_create_page changes its radix-tree allocation to only include the
  main context dependent flags in order so the pagecache page may be allocated
  from arbitrary types of memory without affecting the radix-tree. In practice,
  slab allocations don't come from highmem anyway, and radix-tree only uses
  slab allocations. So there isn't a practical change (unless some fs uses
  GFP_DMA for pages).

- grab_cache_page_nowait() is changed to allocate radix-tree nodes with
  GFP_NOFS, because it is not supposed to reenter the filesystem. This bug
  could cause lock recursion if a filesystem is not expecting the function
  to reenter the fs (as-per documentation).

Filesystems should be careful about exactly what semantics they want and what
they get when fiddling with gfp_t masks to allocate pagecache. One should be
as liberal as possible with the type of memory that can be used, and same
for the the context specific flags.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -741,7 +741,14 @@ repeat:
 		page = __page_cache_alloc(gfp_mask);
 		if (!page)
 			return NULL;
-		err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
+		/*
+		 * We want a regular kernel memory (not highmem or DMA etc)
+		 * allocation for the radix tree nodes, but we need to honour
+		 * the context-specific requirements the caller has asked for.
+		 * GFP_RECLAIM_MASK collects those requirements.
+		 */
+		err = add_to_page_cache_lru(page, mapping, index,
+			(gfp_mask & GFP_RECLAIM_MASK));
 		if (unlikely(err)) {
 			page_cache_release(page);
 			page = NULL;
@@ -950,7 +957,7 @@ grab_cache_page_nowait(struct address_sp
 		return NULL;
 	}
 	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
-	if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
+	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
 		page_cache_release(page);
 		page = NULL;
 	}

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [patch] mm: pagecache gfp flags fix
  2008-12-12  4:41 [patch] mm: pagecache gfp flags fix Nick Piggin
@ 2008-12-15 23:21 ` Andrew Morton
  2008-12-15 23:30   ` Nick Piggin
  0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2008-12-15 23:21 UTC (permalink / raw)
  To: Nick Piggin; +Cc: linux-fsdevel, linux-mm, stable

On Fri, 12 Dec 2008 05:41:20 +0100
Nick Piggin <npiggin@suse.de> wrote:

> This patch doesn't actually fix a regression, but a longer standing bug.
> --
> 
> Frustratingly, gfp_t is really divided into two classes of flags. One are the
> context dependent ones (can we sleep? can we enter filesystem? block subsystem?
> should we use some extra reserves, etc.). The other ones are the type of memory
> required and depend on how the algorithm is implemented rather than the point
> at which the memory is allocated (highmem? dma memory? etc).
> 
> Some of functions which allocate a page and add it to page cache take a gfp_t,
> but sometimes those functions or their callers aren't really doing the right
> thing: when allocating pagecache page, the memory type should be
> mapping_gfp_mask(mapping). When allocating radix tree nodes, the memory type
> should be kernel mapped (not highmem) memory. The gfp_t argument should only
> really be needed for context dependent options.
> 
> This patch doesn't really solve that tangle in a nice way, but it does attempt
> to fix a couple of bugs.
> 
> - find_or_create_page changes its radix-tree allocation to only include the
>   main context dependent flags in order so the pagecache page may be allocated
>   from arbitrary types of memory without affecting the radix-tree. In practice,
>   slab allocations don't come from highmem anyway, and radix-tree only uses
>   slab allocations. So there isn't a practical change (unless some fs uses
>   GFP_DMA for pages).
> 
> - grab_cache_page_nowait() is changed to allocate radix-tree nodes with
>   GFP_NOFS, because it is not supposed to reenter the filesystem. This bug
>   could cause lock recursion if a filesystem is not expecting the function
>   to reenter the fs (as-per documentation).
> 
> Filesystems should be careful about exactly what semantics they want and what
> they get when fiddling with gfp_t masks to allocate pagecache. One should be
> as liberal as possible with the type of memory that can be used, and same
> for the the context specific flags.

ug.  So at present page_symlink() can call write_begin() which will do
a GFP_KERNEL/GFP_USER allocation even though we hold fs locks?

In which calling context does this happen?

This is a pretty big ugly patch.  I'm thinking that we merge into
2.6.29 and backport into 2.6.28.x.


^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [patch] mm: pagecache gfp flags fix
  2008-12-15 23:21 ` Andrew Morton
@ 2008-12-15 23:30   ` Nick Piggin
  0 siblings, 0 replies; 3+ messages in thread
From: Nick Piggin @ 2008-12-15 23:30 UTC (permalink / raw)
  To: Andrew Morton; +Cc: linux-fsdevel, linux-mm, stable

On Mon, Dec 15, 2008 at 03:21:44PM -0800, Andrew Morton wrote:
> On Fri, 12 Dec 2008 05:41:20 +0100
> Nick Piggin <npiggin@suse.de> wrote:
> 
> > This patch doesn't actually fix a regression, but a longer standing bug.
> > --
> > 
> > Frustratingly, gfp_t is really divided into two classes of flags. One are the
> > context dependent ones (can we sleep? can we enter filesystem? block subsystem?
> > should we use some extra reserves, etc.). The other ones are the type of memory
> > required and depend on how the algorithm is implemented rather than the point
> > at which the memory is allocated (highmem? dma memory? etc).
> > 
> > Some of functions which allocate a page and add it to page cache take a gfp_t,
> > but sometimes those functions or their callers aren't really doing the right
> > thing: when allocating pagecache page, the memory type should be
> > mapping_gfp_mask(mapping). When allocating radix tree nodes, the memory type
> > should be kernel mapped (not highmem) memory. The gfp_t argument should only
> > really be needed for context dependent options.
> > 
> > This patch doesn't really solve that tangle in a nice way, but it does attempt
> > to fix a couple of bugs.
> > 
> > - find_or_create_page changes its radix-tree allocation to only include the
> >   main context dependent flags in order so the pagecache page may be allocated
> >   from arbitrary types of memory without affecting the radix-tree. In practice,
> >   slab allocations don't come from highmem anyway, and radix-tree only uses
> >   slab allocations. So there isn't a practical change (unless some fs uses
> >   GFP_DMA for pages).
> > 
> > - grab_cache_page_nowait() is changed to allocate radix-tree nodes with
> >   GFP_NOFS, because it is not supposed to reenter the filesystem. This bug
> >   could cause lock recursion if a filesystem is not expecting the function
> >   to reenter the fs (as-per documentation).
> > 
> > Filesystems should be careful about exactly what semantics they want and what
> > they get when fiddling with gfp_t masks to allocate pagecache. One should be
> > as liberal as possible with the type of memory that can be used, and same
> > for the the context specific flags.
> 
> ug.  So at present page_symlink() can call write_begin() which will do
> a GFP_KERNEL/GFP_USER allocation even though we hold fs locks?

Yes.

 
> In which calling context does this happen?

ext3/4 do it AFAIKS. I think some filesystems set GFP_NOFS in mapping_gfp_mask
too, but I don't think that's a very good idea because we don't always mask
allocations with mapping_gfp_mask (but still, they would be helped by this
patch too, if they were relying on it).

 
> This is a pretty big ugly patch.  I'm thinking that we merge into
> 2.6.29 and backport into 2.6.28.x.

Sounds good.


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2008-12-15 23:30 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-12-12  4:41 [patch] mm: pagecache gfp flags fix Nick Piggin
2008-12-15 23:21 ` Andrew Morton
2008-12-15 23:30   ` Nick Piggin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).