linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch 1/2] mm: pagecache allocation gfp fixes
@ 2008-11-27  9:34 Nick Piggin
  2008-11-27  9:35 ` [patch 2/2] fs: symlink write_begin allocation context fix Nick Piggin
  2008-11-27  9:52 ` [patch 1/2] mm: pagecache allocation gfp fixes Pekka Enberg
  0 siblings, 2 replies; 16+ messages in thread
From: Nick Piggin @ 2008-11-27  9:34 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-mm; +Cc: npiggin


Frustratingly, gfp_t is really divided into two classes of flags. One are the
context dependent ones (can we sleep? can we enter filesystem? block subsystem?
should we use some extra reserves, etc.). The other ones are the type of memory
required and depend on how the algorithm is implemented rather than the point
at which the memory is allocated (highmem? dma memory? etc).

Some of functions which allocate a page and add it to page cache take a gfp_t,
but sometimes those functions or their callers aren't really doing the right
thing: when allocating pagecache page, the memory type should be
mapping_gfp_mask(mapping). When allocating radix tree nodes, the memory type
should be kernel mapped (not highmem) memory. The gfp_t argument should only
really be needed for context dependent options.

This patch doesn't really solve that tangle in a nice way, but it does attempt
to fix a couple of bugs. find_or_create_page changes its radix-tree allocation
to only include the main context dependent flags in order so the pagecache
page may be allocated from arbitrary types of memory without affecting the
radix-tree. Then grab_cache_page_nowait() is changed to allocate radix-tree
nodes with GFP_NOFS, because it is not supposed to reenter the filesystem.

Filesystems should be careful about exactly what semantics they want and what
they get when fiddling with gfp_t masks to allocate pagecache. One should be
as liberal as possible with the type of memory that can be used, and same
for the the context specific flags.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -741,7 +741,8 @@ repeat:
 		page = __page_cache_alloc(gfp_mask);
 		if (!page)
 			return NULL;
-		err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
+		err = add_to_page_cache_lru(page, mapping, index,
+			(gfp_mask & (__GFP_FS|__GFP_IO|__GFP_WAIT|__GFP_HIGH)));
 		if (unlikely(err)) {
 			page_cache_release(page);
 			page = NULL;
@@ -950,7 +951,7 @@ grab_cache_page_nowait(struct address_sp
 		return NULL;
 	}
 	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
-	if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
+	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
 		page_cache_release(page);
 		page = NULL;
 	}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [patch 2/2] fs: symlink write_begin allocation context fix
  2008-11-27  9:34 [patch 1/2] mm: pagecache allocation gfp fixes Nick Piggin
@ 2008-11-27  9:35 ` Nick Piggin
  2008-11-27 11:02   ` KOSAKI Motohiro
  2008-11-27  9:52 ` [patch 1/2] mm: pagecache allocation gfp fixes Pekka Enberg
  1 sibling, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2008-11-27  9:35 UTC (permalink / raw)
  To: Andrew Morton, linux-fsdevel, linux-mm


With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the allocations
happened. They are done in write_begin, which would always assume that the
filesystem can be entered from reclaim.

The funny thing with having a gfp_t mask there is that it doesn't really allow
the caller to arbitrarily tinker with the context in which it can be called.
It couldn't ever be GFP_ATOMIC, for example, because it needs to take the
page lock. The only thing any callers care about is __GFP_FS anyway, so turn
that into a single flag.

Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to accept
a nofs argument as well, to honour that flag (while we're there, change the
name to grab_cache_page_write_begin which is more instructive and does away
with random leading underscores).

This is really a more flexible way to go in the end anyway -- if a filesystem
happens to want any extra allocations aside from the pagecache ones in ints
write_begin function, it may now use GFP_KERNEL for common case allocations
(eg. ocfs2_alloc_write_ctxt, for a random example).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -2786,18 +2786,23 @@ void page_put_link(struct dentry *dentry
 	}
 }
 
-int __page_symlink(struct inode *inode, const char *symname, int len,
-		gfp_t gfp_mask)
+/*
+ * The nofs argument instructs pagecache_write_begin to pass AOP_FLAG_NOFS
+ */
+int __page_symlink(struct inode *inode, const char *symname, int len, int nofs)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct page *page;
 	void *fsdata;
 	int err;
 	char *kaddr;
+	unsigned int flags = AOP_FLAG_UNINTERRUPTIBLE;
+	if (nofs)
+		flags |= AOP_FLAG_NOFS;
 
 retry:
 	err = pagecache_write_begin(NULL, mapping, 0, len-1,
-				AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+				flags, &page, &fsdata);
 	if (err)
 		goto fail;
 
@@ -2820,8 +2825,7 @@ fail:
 
 int page_symlink(struct inode *inode, const char *symname, int len)
 {
-	return __page_symlink(inode, symname, len,
-			mapping_gfp_mask(inode->i_mapping));
+	return __page_symlink(inode, symname, len, 0);
 }
 
 const struct inode_operations page_symlink_inode_operations = {
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -413,6 +413,9 @@ enum positive_aop_returns {
 
 #define AOP_FLAG_UNINTERRUPTIBLE	0x0001 /* will not do a short write */
 #define AOP_FLAG_CONT_EXPAND		0x0002 /* called from cont_expand */
+#define AOP_FLAG_NOFS			0x0004 /* used by filesystem to direct
+						* helper code (eg buffer layer)
+						* to clear GFP_FS from alloc */
 
 /*
  * oh the beauties of C type declarations.
@@ -2022,7 +2025,7 @@ extern int page_readlink(struct dentry *
 extern void *page_follow_link_light(struct dentry *, struct nameidata *);
 extern void page_put_link(struct dentry *, struct nameidata *, void *);
 extern int __page_symlink(struct inode *inode, const char *symname, int len,
-		gfp_t gfp_mask);
+		int nofs);
 extern int page_symlink(struct inode *inode, const char *symname, int len);
 extern const struct inode_operations page_symlink_inode_operations;
 extern int generic_readlink(struct dentry *, char __user *, int);
Index: linux-2.6/fs/ext3/namei.c
===================================================================
--- linux-2.6.orig/fs/ext3/namei.c
+++ linux-2.6/fs/ext3/namei.c
@@ -2170,8 +2170,7 @@ retry:
 		 * We have a transaction open.  All is sweetness.  It also sets
 		 * i_size in generic_commit_write().
 		 */
-		err = __page_symlink(inode, symname, l,
-				mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS);
+		err = page_symlink(inode, symname, l);
 		if (err) {
 			drop_nlink(inode);
 			ext3_mark_inode_dirty(handle, inode);
Index: linux-2.6/fs/ext4/namei.c
===================================================================
--- linux-2.6.orig/fs/ext4/namei.c
+++ linux-2.6/fs/ext4/namei.c
@@ -2208,8 +2208,7 @@ retry:
 		 * We have a transaction open.  All is sweetness.  It also sets
 		 * i_size in generic_commit_write().
 		 */
-		err = __page_symlink(inode, symname, l,
-				mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS);
+		err = page_symlink(inode, symname, l);
 		if (err) {
 			clear_nlink(inode);
 			ext4_mark_inode_dirty(handle, inode);
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -241,7 +241,8 @@ unsigned find_get_pages_contig(struct ad
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
 
-struct page *__grab_cache_page(struct address_space *mapping, pgoff_t index);
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+			pgoff_t index, int nofs);
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2141,19 +2141,24 @@ EXPORT_SYMBOL(generic_file_direct_write)
  * Find or create a page at the given pagecache position. Return the locked
  * page. This function is specifically for buffered writes.
  */
-struct page *__grab_cache_page(struct address_space *mapping, pgoff_t index)
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+					pgoff_t index, int nofs)
 {
 	int status;
 	struct page *page;
+	gfp_t gfp_notmask = 0;
+	if (nofs)
+		gfp_notmask = __GFP_FS;
 repeat:
 	page = find_lock_page(mapping, index);
 	if (likely(page))
 		return page;
 
-	page = page_cache_alloc(mapping);
+	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
 	if (!page)
 		return NULL;
-	status = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+	status = add_to_page_cache_lru(page, mapping, index,
+						GFP_KERNEL & ~gfp_notmask);
 	if (unlikely(status)) {
 		page_cache_release(page);
 		if (status == -EEXIST)
@@ -2162,7 +2167,7 @@ repeat:
 	}
 	return page;
 }
-EXPORT_SYMBOL(__grab_cache_page);
+EXPORT_SYMBOL(grab_cache_page_write_begin);
 
 static ssize_t generic_perform_write(struct file *file,
 				struct iov_iter *i, loff_t pos)
Index: linux-2.6/fs/affs/file.c
===================================================================
--- linux-2.6.orig/fs/affs/file.c
+++ linux-2.6/fs/affs/file.c
@@ -628,7 +628,7 @@ static int affs_write_begin_ofs(struct f
 	}
 
 	index = pos >> PAGE_CACHE_SHIFT;
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index, flags&AOP_FLAG_NOFS);
 	if (!page)
 		return -ENOMEM;
 	*pagep = page;
Index: linux-2.6/fs/afs/write.c
===================================================================
--- linux-2.6.orig/fs/afs/write.c
+++ linux-2.6/fs/afs/write.c
@@ -144,7 +144,7 @@ int afs_write_begin(struct file *file, s
 	candidate->state = AFS_WBACK_PENDING;
 	init_waitqueue_head(&candidate->waitq);
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index, flags&AOP_FLAG_NOFS);
 	if (!page) {
 		kfree(candidate);
 		return -ENOMEM;
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1987,7 +1987,8 @@ int block_write_begin(struct file *file,
 	page = *pagep;
 	if (page == NULL) {
 		ownpage = 1;
-		page = __grab_cache_page(mapping, index);
+		page = grab_cache_page_write_begin(mapping, index,
+						flags & AOP_FLAG_NOFS);
 		if (!page) {
 			status = -ENOMEM;
 			goto out;
@@ -2493,7 +2494,8 @@ int nobh_write_begin(struct file *file,
 	from = pos & (PAGE_CACHE_SIZE - 1);
 	to = from + len;
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page)
 		return -ENOMEM;
 	*pagep = page;
Index: linux-2.6/fs/cifs/file.c
===================================================================
--- linux-2.6.orig/fs/cifs/file.c
+++ linux-2.6/fs/cifs/file.c
@@ -2065,7 +2065,8 @@ static int cifs_write_begin(struct file
 
 	cFYI(1, ("write_begin from %lld len %d", (long long)pos, len));
 
-	*pagep = __grab_cache_page(mapping, index);
+	*pagep = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!*pagep)
 		return -ENOMEM;
 
Index: linux-2.6/fs/ecryptfs/mmap.c
===================================================================
--- linux-2.6.orig/fs/ecryptfs/mmap.c
+++ linux-2.6/fs/ecryptfs/mmap.c
@@ -288,7 +288,8 @@ static int ecryptfs_write_begin(struct f
 	loff_t prev_page_end_size;
 	int rc = 0;
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page)
 		return -ENOMEM;
 	*pagep = page;
Index: linux-2.6/fs/ext3/inode.c
===================================================================
--- linux-2.6.orig/fs/ext3/inode.c
+++ linux-2.6/fs/ext3/inode.c
@@ -1160,7 +1160,8 @@ static int ext3_write_begin(struct file
 	to = from + len;
 
 retry:
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page)
 		return -ENOMEM;
 	*pagep = page;
Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c
+++ linux-2.6/fs/ext4/inode.c
@@ -1345,7 +1345,8 @@ retry:
 		goto out;
 	}
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page) {
 		ext4_journal_stop(handle);
 		ret = -ENOMEM;
@@ -2549,7 +2550,8 @@ retry:
 		goto out;
 	}
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page) {
 		ext4_journal_stop(handle);
 		ret = -ENOMEM;
Index: linux-2.6/fs/fuse/file.c
===================================================================
--- linux-2.6.orig/fs/fuse/file.c
+++ linux-2.6/fs/fuse/file.c
@@ -646,7 +646,8 @@ static int fuse_write_begin(struct file
 {
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 
-	*pagep = __grab_cache_page(mapping, index);
+	*pagep = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!*pagep)
 		return -ENOMEM;
 	return 0;
Index: linux-2.6/fs/gfs2/ops_address.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_address.c
+++ linux-2.6/fs/gfs2/ops_address.c
@@ -675,7 +675,8 @@ static int gfs2_write_begin(struct file
 		goto out_trans_fail;
 
 	error = -ENOMEM;
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	*pagep = page;
 	if (unlikely(!page))
 		goto out_endtrans;
Index: linux-2.6/fs/hostfs/hostfs_kern.c
===================================================================
--- linux-2.6.orig/fs/hostfs/hostfs_kern.c
+++ linux-2.6/fs/hostfs/hostfs_kern.c
@@ -501,7 +501,8 @@ int hostfs_write_begin(struct file *file
 {
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 
-	*pagep = __grab_cache_page(mapping, index);
+	*pagep = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!*pagep)
 		return -ENOMEM;
 	return 0;
Index: linux-2.6/fs/jffs2/file.c
===================================================================
--- linux-2.6.orig/fs/jffs2/file.c
+++ linux-2.6/fs/jffs2/file.c
@@ -132,7 +132,8 @@ static int jffs2_write_begin(struct file
 	uint32_t pageofs = index << PAGE_CACHE_SHIFT;
 	int ret = 0;
 
-	pg = __grab_cache_page(mapping, index);
+	pg = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!pg)
 		return -ENOMEM;
 	*pagep = pg;
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -360,7 +360,8 @@ int simple_write_begin(struct file *file
 	index = pos >> PAGE_CACHE_SHIFT;
 	from = pos & (PAGE_CACHE_SIZE - 1);
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page)
 		return -ENOMEM;
 
Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -354,7 +354,8 @@ static int nfs_write_begin(struct file *
 		file->f_path.dentry->d_name.name,
 		mapping->host->i_ino, len, (long long) pos);
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page)
 		return -ENOMEM;
 	*pagep = page;
Index: linux-2.6/fs/reiserfs/inode.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/inode.c
+++ linux-2.6/fs/reiserfs/inode.c
@@ -2556,7 +2556,8 @@ static int reiserfs_write_begin(struct f
 	}
 
 	index = pos >> PAGE_CACHE_SHIFT;
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page)
 		return -ENOMEM;
 	*pagep = page;
Index: linux-2.6/fs/smbfs/file.c
===================================================================
--- linux-2.6.orig/fs/smbfs/file.c
+++ linux-2.6/fs/smbfs/file.c
@@ -297,7 +297,8 @@ static int smb_write_begin(struct file *
 			struct page **pagep, void **fsdata)
 {
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-	*pagep = __grab_cache_page(mapping, index);
+	*pagep = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!*pagep)
 		return -ENOMEM;
 	return 0;
Index: linux-2.6/fs/ubifs/file.c
===================================================================
--- linux-2.6.orig/fs/ubifs/file.c
+++ linux-2.6/fs/ubifs/file.c
@@ -247,7 +247,8 @@ static int write_begin_slow(struct addre
 	if (unlikely(err))
 		return err;
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (unlikely(!page)) {
 		ubifs_release_budget(c, &req);
 		return -ENOMEM;
@@ -438,7 +439,8 @@ static int ubifs_write_begin(struct file
 		return -EROFS;
 
 	/* Try out the fast-path part first */
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (unlikely(!page))
 		return -ENOMEM;
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/2] mm: pagecache allocation gfp fixes
  2008-11-27  9:34 [patch 1/2] mm: pagecache allocation gfp fixes Nick Piggin
  2008-11-27  9:35 ` [patch 2/2] fs: symlink write_begin allocation context fix Nick Piggin
@ 2008-11-27  9:52 ` Pekka Enberg
  2008-11-27 10:01   ` Nick Piggin
  2008-11-27 10:18   ` Nick Piggin
  1 sibling, 2 replies; 16+ messages in thread
From: Pekka Enberg @ 2008-11-27  9:52 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-fsdevel, linux-mm

Hi Nick,

On Thu, Nov 27, 2008 at 11:34 AM, Nick Piggin <npiggin@suse.de> wrote:
>
> Frustratingly, gfp_t is really divided into two classes of flags. One are the
> context dependent ones (can we sleep? can we enter filesystem? block subsystem?
> should we use some extra reserves, etc.). The other ones are the type of memory
> required and depend on how the algorithm is implemented rather than the point
> at which the memory is allocated (highmem? dma memory? etc).
>
> Some of functions which allocate a page and add it to page cache take a gfp_t,
> but sometimes those functions or their callers aren't really doing the right
> thing: when allocating pagecache page, the memory type should be
> mapping_gfp_mask(mapping). When allocating radix tree nodes, the memory type
> should be kernel mapped (not highmem) memory. The gfp_t argument should only
> really be needed for context dependent options.
>
> This patch doesn't really solve that tangle in a nice way, but it does attempt
> to fix a couple of bugs. find_or_create_page changes its radix-tree allocation
> to only include the main context dependent flags in order so the pagecache
> page may be allocated from arbitrary types of memory without affecting the
> radix-tree. Then grab_cache_page_nowait() is changed to allocate radix-tree
> nodes with GFP_NOFS, because it is not supposed to reenter the filesystem.
>
> Filesystems should be careful about exactly what semantics they want and what
> they get when fiddling with gfp_t masks to allocate pagecache. One should be
> as liberal as possible with the type of memory that can be used, and same
> for the the context specific flags.
>
> Signed-off-by: Nick Piggin <npiggin@suse.de>
> ---
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c
> +++ linux-2.6/mm/filemap.c
> @@ -741,7 +741,8 @@ repeat:
>                page = __page_cache_alloc(gfp_mask);
>                if (!page)
>                        return NULL;
> -               err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
> +               err = add_to_page_cache_lru(page, mapping, index,
> +                       (gfp_mask & (__GFP_FS|__GFP_IO|__GFP_WAIT|__GFP_HIGH)));

Can we use GFP_RECLAIM_MASK here? I mean, surely we need to pass
__GFP_NOFAIL, for example, down to radix_tree_preload() et al?

>                if (unlikely(err)) {
>                        page_cache_release(page);
>                        page = NULL;
> @@ -950,7 +951,7 @@ grab_cache_page_nowait(struct address_sp
>                return NULL;
>        }
>        page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
> -       if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
> +       if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
>                page_cache_release(page);
>                page = NULL;
>        }
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/2] mm: pagecache allocation gfp fixes
  2008-11-27  9:52 ` [patch 1/2] mm: pagecache allocation gfp fixes Pekka Enberg
@ 2008-11-27 10:01   ` Nick Piggin
  2008-11-27 10:18   ` Nick Piggin
  1 sibling, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2008-11-27 10:01 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Andrew Morton, linux-fsdevel, linux-mm

Hi Pekka,

On Thu, Nov 27, 2008 at 11:52:40AM +0200, Pekka Enberg wrote:
> Hi Nick,
> 
> On Thu, Nov 27, 2008 at 11:34 AM, Nick Piggin <npiggin@suse.de> wrote:
> > Filesystems should be careful about exactly what semantics they want and what
> > they get when fiddling with gfp_t masks to allocate pagecache. One should be
> > as liberal as possible with the type of memory that can be used, and same
> > for the the context specific flags.
> >
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> > ---
> > Index: linux-2.6/mm/filemap.c
> > ===================================================================
> > --- linux-2.6.orig/mm/filemap.c
> > +++ linux-2.6/mm/filemap.c
> > @@ -741,7 +741,8 @@ repeat:
> >                page = __page_cache_alloc(gfp_mask);
> >                if (!page)
> >                        return NULL;
> > -               err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
> > +               err = add_to_page_cache_lru(page, mapping, index,
> > +                       (gfp_mask & (__GFP_FS|__GFP_IO|__GFP_WAIT|__GFP_HIGH)));
> 
> Can we use GFP_RECLAIM_MASK here? I mean, surely we need to pass
> __GFP_NOFAIL, for example, down to radix_tree_preload() et al?

Ah, yes I thought a #define would be handy for this, but obviously didn't
look hard enough. GFP_RECLAIM_MASK looks good (but God help any filesystem
passing __GFP_NOFAIL into here ;)).

Thanks,
Nick

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/2] mm: pagecache allocation gfp fixes
  2008-11-27  9:52 ` [patch 1/2] mm: pagecache allocation gfp fixes Pekka Enberg
  2008-11-27 10:01   ` Nick Piggin
@ 2008-11-27 10:18   ` Nick Piggin
  2008-11-27 10:28     ` Pekka Enberg
  2008-11-27 18:14     ` Hugh Dickins
  1 sibling, 2 replies; 16+ messages in thread
From: Nick Piggin @ 2008-11-27 10:18 UTC (permalink / raw)
  To: Pekka Enberg; +Cc: Andrew Morton, linux-fsdevel, linux-mm

On Thu, Nov 27, 2008 at 11:52:40AM +0200, Pekka Enberg wrote:
> > -               err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
> > +               err = add_to_page_cache_lru(page, mapping, index,
> > +                       (gfp_mask & (__GFP_FS|__GFP_IO|__GFP_WAIT|__GFP_HIGH)));
> 
> Can we use GFP_RECLAIM_MASK here? I mean, surely we need to pass
> __GFP_NOFAIL, for example, down to radix_tree_preload() et al?

Updated patch.

--
Frustratingly, gfp_t is really divided into two classes of flags. One are the
context dependent ones (can we sleep? can we enter filesystem? block subsystem?
should we use some extra reserves, etc.). The other ones are the type of memory
required and depend on how the algorithm is implemented rather than the point
at which the memory is allocated (highmem? dma memory? etc).

Some of functions which allocate a page and add it to page cache take a gfp_t,
but sometimes those functions or their callers aren't really doing the right
thing: when allocating pagecache page, the memory type should be
mapping_gfp_mask(mapping). When allocating radix tree nodes, the memory type
should be kernel mapped (not highmem) memory. The gfp_t argument should only
really be needed for context dependent options.

This patch doesn't really solve that tangle in a nice way, but it does attempt
to fix a couple of bugs. find_or_create_page changes its radix-tree allocation
to only include the main context dependent flags in order so the pagecache
page may be allocated from arbitrary types of memory without affecting the
radix-tree. Then grab_cache_page_nowait() is changed to allocate radix-tree
nodes with GFP_NOFS, because it is not supposed to reenter the filesystem.

Filesystems should be careful about exactly what semantics they want and what
they get when fiddling with gfp_t masks to allocate pagecache. One should be
as liberal as possible with the type of memory that can be used, and same
for the the context specific flags.

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -741,7 +741,14 @@ repeat:
 		page = __page_cache_alloc(gfp_mask);
 		if (!page)
 			return NULL;
-		err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
+		/*
+		 * We want a regular kernel memory (not highmem or DMA etc)
+		 * allocation for the radix tree nodes, but we need to honour
+		 * the context-specific requirements the caller has asked for.
+		 * GFP_RECLAIM_MASK collects those requirements.
+		 */
+		err = add_to_page_cache_lru(page, mapping, index,
+			(gfp_mask & GFP_RECLAIM_MASK));
 		if (unlikely(err)) {
 			page_cache_release(page);
 			page = NULL;
@@ -950,7 +957,7 @@ grab_cache_page_nowait(struct address_sp
 		return NULL;
 	}
 	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
-	if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
+	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
 		page_cache_release(page);
 		page = NULL;
 	}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/2] mm: pagecache allocation gfp fixes
  2008-11-27 10:18   ` Nick Piggin
@ 2008-11-27 10:28     ` Pekka Enberg
  2008-11-27 10:40       ` KOSAKI Motohiro
  2008-11-27 18:14     ` Hugh Dickins
  1 sibling, 1 reply; 16+ messages in thread
From: Pekka Enberg @ 2008-11-27 10:28 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, linux-fsdevel, linux-mm

On Thu, 2008-11-27 at 11:18 +0100, Nick Piggin wrote:
> On Thu, Nov 27, 2008 at 11:52:40AM +0200, Pekka Enberg wrote:
> > > -               err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
> > > +               err = add_to_page_cache_lru(page, mapping, index,
> > > +                       (gfp_mask & (__GFP_FS|__GFP_IO|__GFP_WAIT|__GFP_HIGH)));
> > 
> > Can we use GFP_RECLAIM_MASK here? I mean, surely we need to pass
> > __GFP_NOFAIL, for example, down to radix_tree_preload() et al?
> 
> Updated patch.
> 
> --
> Frustratingly, gfp_t is really divided into two classes of flags. One are the
> context dependent ones (can we sleep? can we enter filesystem? block subsystem?
> should we use some extra reserves, etc.). The other ones are the type of memory
> required and depend on how the algorithm is implemented rather than the point
> at which the memory is allocated (highmem? dma memory? etc).
> 
> Some of functions which allocate a page and add it to page cache take a gfp_t,
> but sometimes those functions or their callers aren't really doing the right
> thing: when allocating pagecache page, the memory type should be
> mapping_gfp_mask(mapping). When allocating radix tree nodes, the memory type
> should be kernel mapped (not highmem) memory. The gfp_t argument should only
> really be needed for context dependent options.
> 
> This patch doesn't really solve that tangle in a nice way, but it does attempt
> to fix a couple of bugs. find_or_create_page changes its radix-tree allocation
> to only include the main context dependent flags in order so the pagecache
> page may be allocated from arbitrary types of memory without affecting the
> radix-tree. Then grab_cache_page_nowait() is changed to allocate radix-tree
> nodes with GFP_NOFS, because it is not supposed to reenter the filesystem.
> 
> Filesystems should be careful about exactly what semantics they want and what
> they get when fiddling with gfp_t masks to allocate pagecache. One should be
> as liberal as possible with the type of memory that can be used, and same
> for the the context specific flags.
> 
> Signed-off-by: Nick Piggin <npiggin@suse.de>

Looks good to me.

Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

> ---
> Index: linux-2.6/mm/filemap.c
> ===================================================================
> --- linux-2.6.orig/mm/filemap.c
> +++ linux-2.6/mm/filemap.c
> @@ -741,7 +741,14 @@ repeat:
>  		page = __page_cache_alloc(gfp_mask);
>  		if (!page)
>  			return NULL;
> -		err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
> +		/*
> +		 * We want a regular kernel memory (not highmem or DMA etc)
> +		 * allocation for the radix tree nodes, but we need to honour
> +		 * the context-specific requirements the caller has asked for.
> +		 * GFP_RECLAIM_MASK collects those requirements.
> +		 */
> +		err = add_to_page_cache_lru(page, mapping, index,
> +			(gfp_mask & GFP_RECLAIM_MASK));
>  		if (unlikely(err)) {
>  			page_cache_release(page);
>  			page = NULL;
> @@ -950,7 +957,7 @@ grab_cache_page_nowait(struct address_sp
>  		return NULL;
>  	}
>  	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
> -	if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
> +	if (page && add_to_page_cache_lru(page, mapping, index, GFP_NOFS)) {
>  		page_cache_release(page);
>  		page = NULL;
>  	}


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/2] mm: pagecache allocation gfp fixes
  2008-11-27 10:28     ` Pekka Enberg
@ 2008-11-27 10:40       ` KOSAKI Motohiro
  0 siblings, 0 replies; 16+ messages in thread
From: KOSAKI Motohiro @ 2008-11-27 10:40 UTC (permalink / raw)
  To: Pekka Enberg
  Cc: kosaki.motohiro, Nick Piggin, Andrew Morton, linux-fsdevel,
	linux-mm

> > Frustratingly, gfp_t is really divided into two classes of flags. One are the
> > context dependent ones (can we sleep? can we enter filesystem? block subsystem?
> > should we use some extra reserves, etc.). The other ones are the type of memory
> > required and depend on how the algorithm is implemented rather than the point
> > at which the memory is allocated (highmem? dma memory? etc).
> > 
> > Some of functions which allocate a page and add it to page cache take a gfp_t,
> > but sometimes those functions or their callers aren't really doing the right
> > thing: when allocating pagecache page, the memory type should be
> > mapping_gfp_mask(mapping). When allocating radix tree nodes, the memory type
> > should be kernel mapped (not highmem) memory. The gfp_t argument should only
> > really be needed for context dependent options.
> > 
> > This patch doesn't really solve that tangle in a nice way, but it does attempt
> > to fix a couple of bugs. find_or_create_page changes its radix-tree allocation
> > to only include the main context dependent flags in order so the pagecache
> > page may be allocated from arbitrary types of memory without affecting the
> > radix-tree. Then grab_cache_page_nowait() is changed to allocate radix-tree
> > nodes with GFP_NOFS, because it is not supposed to reenter the filesystem.
> > 
> > Filesystems should be careful about exactly what semantics they want and what
> > they get when fiddling with gfp_t masks to allocate pagecache. One should be
> > as liberal as possible with the type of memory that can be used, and same
> > for the the context specific flags.
> > 
> > Signed-off-by: Nick Piggin <npiggin@suse.de>
> 
> Looks good to me.
> 
> Reviewed-by: Pekka Enberg <penberg@cs.helsinki.fi>

me too.
	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>





^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 2/2] fs: symlink write_begin allocation context fix
  2008-11-27  9:35 ` [patch 2/2] fs: symlink write_begin allocation context fix Nick Piggin
@ 2008-11-27 11:02   ` KOSAKI Motohiro
  2008-11-27 11:14     ` Nick Piggin
  2008-11-28 14:37     ` Nick Piggin
  0 siblings, 2 replies; 16+ messages in thread
From: KOSAKI Motohiro @ 2008-11-27 11:02 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kosaki.motohiro, Andrew Morton, linux-fsdevel, linux-mm

> -int __page_symlink(struct inode *inode, const char *symname, int len,
> -		gfp_t gfp_mask)
> +/*
> + * The nofs argument instructs pagecache_write_begin to pass AOP_FLAG_NOFS
> + */
> +int __page_symlink(struct inode *inode, const char *symname, int len, int nofs)
>  {
>  	struct address_space *mapping = inode->i_mapping;
>  	struct page *page;
>  	void *fsdata;
>  	int err;
>  	char *kaddr;
> +	unsigned int flags = AOP_FLAG_UNINTERRUPTIBLE;
> +	if (nofs)
> +		flags |= AOP_FLAG_NOFS;
>  
>  retry:
>  	err = pagecache_write_begin(NULL, mapping, 0, len-1,
> -				AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
> +				flags, &page, &fsdata);
>  	if (err)
>  		goto fail;
>  
> @@ -2820,8 +2825,7 @@ fail:
>  
>  int page_symlink(struct inode *inode, const char *symname, int len)
>  {
> -	return __page_symlink(inode, symname, len,
> -			mapping_gfp_mask(inode->i_mapping));
> +	return __page_symlink(inode, symname, len, 0);
>  }

your patch always pass 0 into __page_symlink().
therefore it doesn't change any behavior.

right?




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 2/2] fs: symlink write_begin allocation context fix
  2008-11-27 11:02   ` KOSAKI Motohiro
@ 2008-11-27 11:14     ` Nick Piggin
  2008-11-28 14:37     ` Nick Piggin
  1 sibling, 0 replies; 16+ messages in thread
From: Nick Piggin @ 2008-11-27 11:14 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Andrew Morton, linux-fsdevel, linux-mm

On Thu, Nov 27, 2008 at 08:02:32PM +0900, KOSAKI Motohiro wrote:
> > -int __page_symlink(struct inode *inode, const char *symname, int len,
> > -		gfp_t gfp_mask)
> > +/*
> > + * The nofs argument instructs pagecache_write_begin to pass AOP_FLAG_NOFS
> > + */
> > +int __page_symlink(struct inode *inode, const char *symname, int len, int nofs)
> >  {
> >  	struct address_space *mapping = inode->i_mapping;
> >  	struct page *page;
> >  	void *fsdata;
> >  	int err;
> >  	char *kaddr;
> > +	unsigned int flags = AOP_FLAG_UNINTERRUPTIBLE;
> > +	if (nofs)
> > +		flags |= AOP_FLAG_NOFS;
> >  
> >  retry:
> >  	err = pagecache_write_begin(NULL, mapping, 0, len-1,
> > -				AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
> > +				flags, &page, &fsdata);
> >  	if (err)
> >  		goto fail;
> >  
> > @@ -2820,8 +2825,7 @@ fail:
> >  
> >  int page_symlink(struct inode *inode, const char *symname, int len)
> >  {
> > -	return __page_symlink(inode, symname, len,
> > -			mapping_gfp_mask(inode->i_mapping));
> > +	return __page_symlink(inode, symname, len, 0);
> >  }
> 
> your patch always pass 0 into __page_symlink().
> therefore it doesn't change any behavior.
> 
> right?

Ah, you're right I think. I misread the code: most filesystems can
tolerate GFP_FS here, and its just ext3/4 and a few others which require
nofs here.

Annoyingly, some filesystems put GFP_NOFS into their mapping_gfp_mask.
This may or may not get honoured depending on what core functions get
used (radix tree allocations shouldn't and don't allocate pages
with the mapping_gfp_mask).

Anyway, I'll set nofs=1 in the case that mapping_gfp_mask has __GFP_FS
cleared, in an attempt to minimise the chance of breakage...

Good spotting, thanks.  

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/2] mm: pagecache allocation gfp fixes
  2008-11-27 10:18   ` Nick Piggin
  2008-11-27 10:28     ` Pekka Enberg
@ 2008-11-27 18:14     ` Hugh Dickins
  2008-11-28 12:04       ` Nick Piggin
  1 sibling, 1 reply; 16+ messages in thread
From: Hugh Dickins @ 2008-11-27 18:14 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Pekka Enberg, Andrew Morton, linux-fsdevel, linux-mm

On Thu, 27 Nov 2008, Nick Piggin wrote:
> On Thu, Nov 27, 2008 at 11:52:40AM +0200, Pekka Enberg wrote:
> > > -               err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
> > > +               err = add_to_page_cache_lru(page, mapping, index,
> > > +                       (gfp_mask & (__GFP_FS|__GFP_IO|__GFP_WAIT|__GFP_HIGH)));
> > 
> > Can we use GFP_RECLAIM_MASK here? I mean, surely we need to pass
> > __GFP_NOFAIL, for example, down to radix_tree_preload() et al?

I certainly agree with Pekka's suggestion to use GFP_RECLAIM_MASK.

> 
> Updated patch.

I'm not sure about it.  I came here before 2.6.25, not yet got around
to submitting, I went in the opposite direction.  What drove me was an
irritation at the growing number of & ~__GFP_HIGHMEMs (after adding a
couple myself in shmem.c).  At the least, I think we ought to change
those to & GFP_RECLAIM_MASKs (it seems we don't have one, but can
imagine a block driver that wants GFP_DMA or GFP_DMA32 for pagecache:
there's no reason to alloc its kernel-internal data structures for DMA).

My patch went the opposite direction to yours, in that I was pushing
down the GFP_RECLAIM_MASKing into lib/radix-tree.c and mm/memcontrol.c
(but that now doesn't kmalloc for itself, so no longer needs the mask).

I'm not sure which way is the right way: you can say I'm inconsistent
not to push it down into slab/sleb/slib/slob/slub itself, but I've a
notion that someone might object to any extra intrns in their hotpaths.

My design principle, I think, was to minimize the line length of
the maximum number of source lines: you may believe in other
design principles of higher value.

Updating it quickly to 2.6.28-rc6, built but untested, here's mine.
I'm not saying it's the right approach and yours wrong, just please
consider it before deciding on which way to go.

I've left in the hunk from fs/buffer.c in case you can decipher it,
I think that's what held me up from submitting: I've never worked
out since whether that change is a profound insight into reality
here, or just a blind attempt to reduce the line length.

 fs/buffer.c      |    3 +--
 lib/radix-tree.c |    5 +++--
 mm/filemap.c     |   11 ++++++-----
 mm/shmem.c       |    4 ++--
 mm/swap_state.c  |    2 +-
 5 files changed, 13 insertions(+), 12 deletions(-)

--- 2.6.28-rc6/fs/buffer.c	2008-10-24 09:28:16.000000000 +0100
+++ linux/fs/buffer.c	2008-11-27 13:30:58.000000000 +0000
@@ -1031,8 +1031,7 @@ grow_dev_page(struct block_device *bdev,
 	struct page *page;
 	struct buffer_head *bh;
 
-	page = find_or_create_page(inode->i_mapping, index,
-		(mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS)|__GFP_MOVABLE);
+	page = find_or_create_page(inode->i_mapping, index, GFP_NOFS);
 	if (!page)
 		return NULL;
 
--- 2.6.28-rc6/lib/radix-tree.c	2008-10-09 23:13:53.000000000 +0100
+++ linux/lib/radix-tree.c	2008-11-27 13:30:58.000000000 +0000
@@ -85,7 +85,7 @@ DEFINE_PER_CPU(struct radix_tree_preload
 
 static inline gfp_t root_gfp_mask(struct radix_tree_root *root)
 {
-	return root->gfp_mask & __GFP_BITS_MASK;
+	return root->gfp_mask & GFP_RECLAIM_MASK;
 }
 
 static inline void tag_set(struct radix_tree_node *node, unsigned int tag,
@@ -211,7 +211,8 @@ int radix_tree_preload(gfp_t gfp_mask)
 	rtp = &__get_cpu_var(radix_tree_preloads);
 	while (rtp->nr < ARRAY_SIZE(rtp->nodes)) {
 		preempt_enable();
-		node = kmem_cache_alloc(radix_tree_node_cachep, gfp_mask);
+		node = kmem_cache_alloc(radix_tree_node_cachep,
+					gfp_mask & GFP_RECLAIM_MASK);
 		if (node == NULL)
 			goto out;
 		preempt_disable();
--- 2.6.28-rc6/mm/filemap.c	2008-11-02 23:17:56.000000000 +0000
+++ linux/mm/filemap.c	2008-11-27 13:30:58.000000000 +0000
@@ -459,12 +459,11 @@ int add_to_page_cache_locked(struct page
 
 	VM_BUG_ON(!PageLocked(page));
 
-	error = mem_cgroup_cache_charge(page, current->mm,
-					gfp_mask & ~__GFP_HIGHMEM);
+	error = mem_cgroup_cache_charge(page, current->mm, gfp_mask);
 	if (error)
 		goto out;
 
-	error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+	error = radix_tree_preload(gfp_mask);
 	if (error == 0) {
 		page_cache_get(page);
 		page->mapping = mapping;
@@ -942,6 +941,7 @@ struct page *
 grab_cache_page_nowait(struct address_space *mapping, pgoff_t index)
 {
 	struct page *page = find_get_page(mapping, index);
+	gfp_t gfp_mask;
 
 	if (page) {
 		if (trylock_page(page))
@@ -949,8 +949,9 @@ grab_cache_page_nowait(struct address_sp
 		page_cache_release(page);
 		return NULL;
 	}
-	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~__GFP_FS);
-	if (page && add_to_page_cache_lru(page, mapping, index, GFP_KERNEL)) {
+	gfp_mask = mapping_gfp_mask(mapping) & ~__GFP_FS;
+	page = __page_cache_alloc(gfp_mask);
+	if (page && add_to_page_cache_lru(page, mapping, index, gfp_mask)) {
 		page_cache_release(page);
 		page = NULL;
 	}
--- 2.6.28-rc6/mm/shmem.c	2008-11-02 23:17:56.000000000 +0000
+++ linux/mm/shmem.c	2008-11-27 13:30:58.000000000 +0000
@@ -1214,7 +1214,7 @@ repeat:
 		 * Try to preload while we can wait, to not make a habit of
 		 * draining atomic reserves; but don't latch on to this cpu.
 		 */
-		error = radix_tree_preload(gfp & ~__GFP_HIGHMEM);
+		error = radix_tree_preload(gfp);
 		if (error)
 			goto failed;
 		radix_tree_preload_end();
@@ -1371,7 +1371,7 @@ repeat:
 
 			/* Precharge page while we can wait, compensate after */
 			error = mem_cgroup_cache_charge(filepage, current->mm,
-							gfp & ~__GFP_HIGHMEM);
+							gfp);
 			if (error) {
 				page_cache_release(filepage);
 				shmem_unacct_blocks(info->flags, 1);
--- 2.6.28-rc6/mm/swap_state.c	2008-10-24 09:28:26.000000000 +0100
+++ linux/mm/swap_state.c	2008-11-27 13:30:58.000000000 +0000
@@ -305,7 +305,7 @@ struct page *read_swap_cache_async(swp_e
 		 */
 		__set_page_locked(new_page);
 		SetPageSwapBacked(new_page);
-		err = add_to_swap_cache(new_page, entry, gfp_mask & GFP_KERNEL);
+		err = add_to_swap_cache(new_page, entry, gfp_mask);
 		if (likely(!err)) {
 			/*
 			 * Initiate read into locked page and return.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/2] mm: pagecache allocation gfp fixes
  2008-11-27 18:14     ` Hugh Dickins
@ 2008-11-28 12:04       ` Nick Piggin
  2008-12-01  1:18         ` Hugh Dickins
  0 siblings, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2008-11-28 12:04 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Pekka Enberg, Andrew Morton, linux-fsdevel, linux-mm

On Thu, Nov 27, 2008 at 06:14:18PM +0000, Hugh Dickins wrote:
> On Thu, 27 Nov 2008, Nick Piggin wrote:
> > On Thu, Nov 27, 2008 at 11:52:40AM +0200, Pekka Enberg wrote:
> > > > -               err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
> > > > +               err = add_to_page_cache_lru(page, mapping, index,
> > > > +                       (gfp_mask & (__GFP_FS|__GFP_IO|__GFP_WAIT|__GFP_HIGH)));
> > > 
> > > Can we use GFP_RECLAIM_MASK here? I mean, surely we need to pass
> > > __GFP_NOFAIL, for example, down to radix_tree_preload() et al?
> 
> I certainly agree with Pekka's suggestion to use GFP_RECLAIM_MASK.
> 
> > 
> > Updated patch.
> 
> I'm not sure about it.  I came here before 2.6.25, not yet got around
> to submitting, I went in the opposite direction.  What drove me was an
> irritation at the growing number of & ~__GFP_HIGHMEMs (after adding a
> couple myself in shmem.c).  At the least, I think we ought to change
> those to & GFP_RECLAIM_MASKs (it seems we don't have one, but can
> imagine a block driver that wants GFP_DMA or GFP_DMA32 for pagecache:
> there's no reason to alloc its kernel-internal data structures for DMA).
> 
> My patch went the opposite direction to yours, in that I was pushing
> down the GFP_RECLAIM_MASKing into lib/radix-tree.c and mm/memcontrol.c
> (but that now doesn't kmalloc for itself, so no longer needs the mask).
> 
> I'm not sure which way is the right way: you can say I'm inconsistent
> not to push it down into slab/sleb/slib/slob/slub itself, but I've a
> notion that someone might object to any extra intrns in their hotpaths.
> 
> My design principle, I think, was to minimize the line length of
> the maximum number of source lines: you may believe in other
> design principles of higher value.

I think logically it doesn't belong in those files. The problem comes
purely from the pagecache layer because we're using gfp masks to ask
for two different (and not quite compatible things).

We're using it to specify the pagecache page's memory type, and also
the allocation context for both the pagecache page, and any other
supporting allocations required.

I think it's much more logical to go into the pagecache layer.

 
> Updating it quickly to 2.6.28-rc6, built but untested, here's mine.
> I'm not saying it's the right approach and yours wrong, just please
> consider it before deciding on which way to go.
> 
> I've left in the hunk from fs/buffer.c in case you can decipher it,
> I think that's what held me up from submitting: I've never worked
> out since whether that change is a profound insight into reality
> here, or just a blind attempt to reduce the line length.

I don't see why you drop the mapping_gfp_mask from there... if we ever
change the buffer layer to support HIGHMEM pages, we'd want to set the
inode's mapping_gfp_mask to GFP_HIGHUSER rather than GFP_USER, wouldn't
we?


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 2/2] fs: symlink write_begin allocation context fix
  2008-11-27 11:02   ` KOSAKI Motohiro
  2008-11-27 11:14     ` Nick Piggin
@ 2008-11-28 14:37     ` Nick Piggin
  2008-11-29  6:35       ` KOSAKI Motohiro
  1 sibling, 1 reply; 16+ messages in thread
From: Nick Piggin @ 2008-11-28 14:37 UTC (permalink / raw)
  To: KOSAKI Motohiro; +Cc: Andrew Morton, linux-fsdevel, linux-mm

On Thu, Nov 27, 2008 at 08:02:32PM +0900, KOSAKI Motohiro wrote:
> > @@ -2820,8 +2825,7 @@ fail:
> >  
> >  int page_symlink(struct inode *inode, const char *symname, int len)
> >  {
> > -	return __page_symlink(inode, symname, len,
> > -			mapping_gfp_mask(inode->i_mapping));
> > +	return __page_symlink(inode, symname, len, 0);
> >  }
> 
> your patch always pass 0 into __page_symlink().
> therefore it doesn't change any behavior.
> 
> right?

How about this patch?

--
With the write_begin/write_end aops, page_symlink was broken because it
could no longer pass a GFP_NOFS type mask into the point where the allocations
happened. They are done in write_begin, which would always assume that the
filesystem can be entered from reclaim.

The funny thing with having a gfp_t mask there is that it doesn't really allow
the caller to arbitrarily tinker with the context in which it can be called.
It couldn't ever be GFP_ATOMIC, for example, because it needs to take the
page lock. The only thing any callers care about is __GFP_FS anyway, so turn
that into a single flag.

Add a new flag for write_begin, AOP_FLAG_NOFS. Filesystems can now act on
this flag in their write_begin function. Change __grab_cache_page to accept
a nofs argument as well, to honour that flag (while we're there, change the
name to grab_cache_page_write_begin which is more instructive and does away
with random leading underscores).

This is really a more flexible way to go in the end anyway -- if a filesystem
happens to want any extra allocations aside from the pagecache ones in ints
write_begin function, it may now use GFP_KERNEL for common case allocations
(eg. ocfs2_alloc_write_ctxt, for a random example).

Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/fs/namei.c
===================================================================
--- linux-2.6.orig/fs/namei.c
+++ linux-2.6/fs/namei.c
@@ -2786,18 +2786,23 @@ void page_put_link(struct dentry *dentry
 	}
 }
 
-int __page_symlink(struct inode *inode, const char *symname, int len,
-		gfp_t gfp_mask)
+/*
+ * The nofs argument instructs pagecache_write_begin to pass AOP_FLAG_NOFS
+ */
+int __page_symlink(struct inode *inode, const char *symname, int len, int nofs)
 {
 	struct address_space *mapping = inode->i_mapping;
 	struct page *page;
 	void *fsdata;
 	int err;
 	char *kaddr;
+	unsigned int flags = AOP_FLAG_UNINTERRUPTIBLE;
+	if (nofs)
+		flags |= AOP_FLAG_NOFS;
 
 retry:
 	err = pagecache_write_begin(NULL, mapping, 0, len-1,
-				AOP_FLAG_UNINTERRUPTIBLE, &page, &fsdata);
+				flags, &page, &fsdata);
 	if (err)
 		goto fail;
 
@@ -2821,7 +2826,7 @@ fail:
 int page_symlink(struct inode *inode, const char *symname, int len)
 {
 	return __page_symlink(inode, symname, len,
-			mapping_gfp_mask(inode->i_mapping));
+			!(mapping_gfp_mask(inode->i_mapping) & __GFP_FS));
 }
 
 const struct inode_operations page_symlink_inode_operations = {
Index: linux-2.6/include/linux/fs.h
===================================================================
--- linux-2.6.orig/include/linux/fs.h
+++ linux-2.6/include/linux/fs.h
@@ -413,6 +413,9 @@ enum positive_aop_returns {
 
 #define AOP_FLAG_UNINTERRUPTIBLE	0x0001 /* will not do a short write */
 #define AOP_FLAG_CONT_EXPAND		0x0002 /* called from cont_expand */
+#define AOP_FLAG_NOFS			0x0004 /* used by filesystem to direct
+						* helper code (eg buffer layer)
+						* to clear GFP_FS from alloc */
 
 /*
  * oh the beauties of C type declarations.
@@ -2022,7 +2025,7 @@ extern int page_readlink(struct dentry *
 extern void *page_follow_link_light(struct dentry *, struct nameidata *);
 extern void page_put_link(struct dentry *, struct nameidata *, void *);
 extern int __page_symlink(struct inode *inode, const char *symname, int len,
-		gfp_t gfp_mask);
+		int nofs);
 extern int page_symlink(struct inode *inode, const char *symname, int len);
 extern const struct inode_operations page_symlink_inode_operations;
 extern int generic_readlink(struct dentry *, char __user *, int);
Index: linux-2.6/fs/ext3/namei.c
===================================================================
--- linux-2.6.orig/fs/ext3/namei.c
+++ linux-2.6/fs/ext3/namei.c
@@ -2170,8 +2170,7 @@ retry:
 		 * We have a transaction open.  All is sweetness.  It also sets
 		 * i_size in generic_commit_write().
 		 */
-		err = __page_symlink(inode, symname, l,
-				mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS);
+		err = __page_symlink(inode, symname, l, 1);
 		if (err) {
 			drop_nlink(inode);
 			ext3_mark_inode_dirty(handle, inode);
Index: linux-2.6/fs/ext4/namei.c
===================================================================
--- linux-2.6.orig/fs/ext4/namei.c
+++ linux-2.6/fs/ext4/namei.c
@@ -2208,8 +2208,7 @@ retry:
 		 * We have a transaction open.  All is sweetness.  It also sets
 		 * i_size in generic_commit_write().
 		 */
-		err = __page_symlink(inode, symname, l,
-				mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS);
+		err = __page_symlink(inode, symname, l, 1);
 		if (err) {
 			clear_nlink(inode);
 			ext4_mark_inode_dirty(handle, inode);
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h
+++ linux-2.6/include/linux/pagemap.h
@@ -241,7 +241,8 @@ unsigned find_get_pages_contig(struct ad
 unsigned find_get_pages_tag(struct address_space *mapping, pgoff_t *index,
 			int tag, unsigned int nr_pages, struct page **pages);
 
-struct page *__grab_cache_page(struct address_space *mapping, pgoff_t index);
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+			pgoff_t index, int nofs);
 
 /*
  * Returns locked page at given index in given cache, creating it if needed.
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c
+++ linux-2.6/mm/filemap.c
@@ -2147,19 +2147,24 @@ EXPORT_SYMBOL(generic_file_direct_write)
  * Find or create a page at the given pagecache position. Return the locked
  * page. This function is specifically for buffered writes.
  */
-struct page *__grab_cache_page(struct address_space *mapping, pgoff_t index)
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+					pgoff_t index, int nofs)
 {
 	int status;
 	struct page *page;
+	gfp_t gfp_notmask = 0;
+	if (nofs)
+		gfp_notmask = __GFP_FS;
 repeat:
 	page = find_lock_page(mapping, index);
 	if (likely(page))
 		return page;
 
-	page = page_cache_alloc(mapping);
+	page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
 	if (!page)
 		return NULL;
-	status = add_to_page_cache_lru(page, mapping, index, GFP_KERNEL);
+	status = add_to_page_cache_lru(page, mapping, index,
+						GFP_KERNEL & ~gfp_notmask);
 	if (unlikely(status)) {
 		page_cache_release(page);
 		if (status == -EEXIST)
@@ -2168,7 +2173,7 @@ repeat:
 	}
 	return page;
 }
-EXPORT_SYMBOL(__grab_cache_page);
+EXPORT_SYMBOL(grab_cache_page_write_begin);
 
 static ssize_t generic_perform_write(struct file *file,
 				struct iov_iter *i, loff_t pos)
Index: linux-2.6/fs/affs/file.c
===================================================================
--- linux-2.6.orig/fs/affs/file.c
+++ linux-2.6/fs/affs/file.c
@@ -628,7 +628,7 @@ static int affs_write_begin_ofs(struct f
 	}
 
 	index = pos >> PAGE_CACHE_SHIFT;
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index, flags&AOP_FLAG_NOFS);
 	if (!page)
 		return -ENOMEM;
 	*pagep = page;
Index: linux-2.6/fs/afs/write.c
===================================================================
--- linux-2.6.orig/fs/afs/write.c
+++ linux-2.6/fs/afs/write.c
@@ -144,7 +144,7 @@ int afs_write_begin(struct file *file, s
 	candidate->state = AFS_WBACK_PENDING;
 	init_waitqueue_head(&candidate->waitq);
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index, flags&AOP_FLAG_NOFS);
 	if (!page) {
 		kfree(candidate);
 		return -ENOMEM;
Index: linux-2.6/fs/buffer.c
===================================================================
--- linux-2.6.orig/fs/buffer.c
+++ linux-2.6/fs/buffer.c
@@ -1987,7 +1987,8 @@ int block_write_begin(struct file *file,
 	page = *pagep;
 	if (page == NULL) {
 		ownpage = 1;
-		page = __grab_cache_page(mapping, index);
+		page = grab_cache_page_write_begin(mapping, index,
+						flags & AOP_FLAG_NOFS);
 		if (!page) {
 			status = -ENOMEM;
 			goto out;
@@ -2493,7 +2494,8 @@ int nobh_write_begin(struct file *file,
 	from = pos & (PAGE_CACHE_SIZE - 1);
 	to = from + len;
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page)
 		return -ENOMEM;
 	*pagep = page;
Index: linux-2.6/fs/cifs/file.c
===================================================================
--- linux-2.6.orig/fs/cifs/file.c
+++ linux-2.6/fs/cifs/file.c
@@ -2065,7 +2065,8 @@ static int cifs_write_begin(struct file
 
 	cFYI(1, ("write_begin from %lld len %d", (long long)pos, len));
 
-	*pagep = __grab_cache_page(mapping, index);
+	*pagep = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!*pagep)
 		return -ENOMEM;
 
Index: linux-2.6/fs/ecryptfs/mmap.c
===================================================================
--- linux-2.6.orig/fs/ecryptfs/mmap.c
+++ linux-2.6/fs/ecryptfs/mmap.c
@@ -288,7 +288,8 @@ static int ecryptfs_write_begin(struct f
 	loff_t prev_page_end_size;
 	int rc = 0;
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page)
 		return -ENOMEM;
 	*pagep = page;
Index: linux-2.6/fs/ext3/inode.c
===================================================================
--- linux-2.6.orig/fs/ext3/inode.c
+++ linux-2.6/fs/ext3/inode.c
@@ -1160,7 +1160,8 @@ static int ext3_write_begin(struct file
 	to = from + len;
 
 retry:
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page)
 		return -ENOMEM;
 	*pagep = page;
Index: linux-2.6/fs/ext4/inode.c
===================================================================
--- linux-2.6.orig/fs/ext4/inode.c
+++ linux-2.6/fs/ext4/inode.c
@@ -1345,7 +1345,8 @@ retry:
 		goto out;
 	}
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page) {
 		ext4_journal_stop(handle);
 		ret = -ENOMEM;
@@ -2549,7 +2550,8 @@ retry:
 		goto out;
 	}
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page) {
 		ext4_journal_stop(handle);
 		ret = -ENOMEM;
Index: linux-2.6/fs/fuse/file.c
===================================================================
--- linux-2.6.orig/fs/fuse/file.c
+++ linux-2.6/fs/fuse/file.c
@@ -646,7 +646,8 @@ static int fuse_write_begin(struct file
 {
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 
-	*pagep = __grab_cache_page(mapping, index);
+	*pagep = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!*pagep)
 		return -ENOMEM;
 	return 0;
Index: linux-2.6/fs/gfs2/ops_address.c
===================================================================
--- linux-2.6.orig/fs/gfs2/ops_address.c
+++ linux-2.6/fs/gfs2/ops_address.c
@@ -675,7 +675,8 @@ static int gfs2_write_begin(struct file
 		goto out_trans_fail;
 
 	error = -ENOMEM;
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	*pagep = page;
 	if (unlikely(!page))
 		goto out_endtrans;
Index: linux-2.6/fs/hostfs/hostfs_kern.c
===================================================================
--- linux-2.6.orig/fs/hostfs/hostfs_kern.c
+++ linux-2.6/fs/hostfs/hostfs_kern.c
@@ -501,7 +501,8 @@ int hostfs_write_begin(struct file *file
 {
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
 
-	*pagep = __grab_cache_page(mapping, index);
+	*pagep = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!*pagep)
 		return -ENOMEM;
 	return 0;
Index: linux-2.6/fs/jffs2/file.c
===================================================================
--- linux-2.6.orig/fs/jffs2/file.c
+++ linux-2.6/fs/jffs2/file.c
@@ -132,7 +132,8 @@ static int jffs2_write_begin(struct file
 	uint32_t pageofs = index << PAGE_CACHE_SHIFT;
 	int ret = 0;
 
-	pg = __grab_cache_page(mapping, index);
+	pg = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!pg)
 		return -ENOMEM;
 	*pagep = pg;
Index: linux-2.6/fs/libfs.c
===================================================================
--- linux-2.6.orig/fs/libfs.c
+++ linux-2.6/fs/libfs.c
@@ -360,7 +360,8 @@ int simple_write_begin(struct file *file
 	index = pos >> PAGE_CACHE_SHIFT;
 	from = pos & (PAGE_CACHE_SIZE - 1);
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page)
 		return -ENOMEM;
 
Index: linux-2.6/fs/nfs/file.c
===================================================================
--- linux-2.6.orig/fs/nfs/file.c
+++ linux-2.6/fs/nfs/file.c
@@ -354,7 +354,8 @@ static int nfs_write_begin(struct file *
 		file->f_path.dentry->d_name.name,
 		mapping->host->i_ino, len, (long long) pos);
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page)
 		return -ENOMEM;
 	*pagep = page;
Index: linux-2.6/fs/reiserfs/inode.c
===================================================================
--- linux-2.6.orig/fs/reiserfs/inode.c
+++ linux-2.6/fs/reiserfs/inode.c
@@ -2556,7 +2556,8 @@ static int reiserfs_write_begin(struct f
 	}
 
 	index = pos >> PAGE_CACHE_SHIFT;
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!page)
 		return -ENOMEM;
 	*pagep = page;
Index: linux-2.6/fs/smbfs/file.c
===================================================================
--- linux-2.6.orig/fs/smbfs/file.c
+++ linux-2.6/fs/smbfs/file.c
@@ -297,7 +297,8 @@ static int smb_write_begin(struct file *
 			struct page **pagep, void **fsdata)
 {
 	pgoff_t index = pos >> PAGE_CACHE_SHIFT;
-	*pagep = __grab_cache_page(mapping, index);
+	*pagep = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (!*pagep)
 		return -ENOMEM;
 	return 0;
Index: linux-2.6/fs/ubifs/file.c
===================================================================
--- linux-2.6.orig/fs/ubifs/file.c
+++ linux-2.6/fs/ubifs/file.c
@@ -247,7 +247,8 @@ static int write_begin_slow(struct addre
 	if (unlikely(err))
 		return err;
 
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (unlikely(!page)) {
 		ubifs_release_budget(c, &req);
 		return -ENOMEM;
@@ -438,7 +439,8 @@ static int ubifs_write_begin(struct file
 		return -EROFS;
 
 	/* Try out the fast-path part first */
-	page = __grab_cache_page(mapping, index);
+	page = grab_cache_page_write_begin(mapping, index,
+					flags & AOP_FLAG_NOFS);
 	if (unlikely(!page))
 		return -ENOMEM;
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 2/2] fs: symlink write_begin allocation context fix
  2008-11-28 14:37     ` Nick Piggin
@ 2008-11-29  6:35       ` KOSAKI Motohiro
  0 siblings, 0 replies; 16+ messages in thread
From: KOSAKI Motohiro @ 2008-11-29  6:35 UTC (permalink / raw)
  To: Nick Piggin; +Cc: kosaki.motohiro, Andrew Morton, linux-fsdevel, linux-mm

> On Thu, Nov 27, 2008 at 08:02:32PM +0900, KOSAKI Motohiro wrote:
> > > @@ -2820,8 +2825,7 @@ fail:
> > >  
> > >  int page_symlink(struct inode *inode, const char *symname, int len)
> > >  {
> > > -	return __page_symlink(inode, symname, len,
> > > -			mapping_gfp_mask(inode->i_mapping));
> > > +	return __page_symlink(inode, symname, len, 0);
> > >  }
> > 
> > your patch always pass 0 into __page_symlink().
> > therefore it doesn't change any behavior.
> > 
> > right?
> 
> How about this patch?

looks good to me.
very thanks.

	Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>




^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/2] mm: pagecache allocation gfp fixes
  2008-11-28 12:04       ` Nick Piggin
@ 2008-12-01  1:18         ` Hugh Dickins
  2008-12-01  1:50           ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 16+ messages in thread
From: Hugh Dickins @ 2008-12-01  1:18 UTC (permalink / raw)
  To: Nick Piggin
  Cc: Pekka Enberg, Andrew Morton, KAMEZAWA Hiroyuki, linux-fsdevel,
	linux-mm

On Fri, 28 Nov 2008, Nick Piggin wrote:
> On Thu, Nov 27, 2008 at 06:14:18PM +0000, Hugh Dickins wrote:
> > On Thu, 27 Nov 2008, Nick Piggin wrote:
> > > On Thu, Nov 27, 2008 at 11:52:40AM +0200, Pekka Enberg wrote:
> > > > > -               err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
> > > > > +               err = add_to_page_cache_lru(page, mapping, index,
> > > > > +                       (gfp_mask & (__GFP_FS|__GFP_IO|__GFP_WAIT|__GFP_HIGH)));
> > > > 
> > > > Can we use GFP_RECLAIM_MASK here? I mean, surely we need to pass
> > > > __GFP_NOFAIL, for example, down to radix_tree_preload() et al?
> > 
> > I certainly agree with Pekka's suggestion to use GFP_RECLAIM_MASK.
> > 
> > > 
> > > Updated patch.
> > 
> > I'm not sure about it.  I came here before 2.6.25, not yet got around
> > to submitting, I went in the opposite direction.  What drove me was an
> > irritation at the growing number of & ~__GFP_HIGHMEMs (after adding a
> > couple myself in shmem.c).  At the least, I think we ought to change
> > those to & GFP_RECLAIM_MASKs (it seems we don't have one, but can
> > imagine a block driver that wants GFP_DMA or GFP_DMA32 for pagecache:
> > there's no reason to alloc its kernel-internal data structures for DMA).
> > 
> > My patch went the opposite direction to yours, in that I was pushing
> > down the GFP_RECLAIM_MASKing into lib/radix-tree.c and mm/memcontrol.c
> > (but that now doesn't kmalloc for itself, so no longer needs the mask).
> > 
> > I'm not sure which way is the right way: you can say I'm inconsistent
> > not to push it down into slab/sleb/slib/slob/slub itself, but I've a
> > notion that someone might object to any extra intrns in their hotpaths.
> > 
> > My design principle, I think, was to minimize the line length of
> > the maximum number of source lines: you may believe in other
> > design principles of higher value.
> 
> I think logically it doesn't belong in those files. The problem comes
> purely from the pagecache layer because we're using gfp masks to ask
> for two different (and not quite compatible things).
> 
> We're using it to specify the pagecache page's memory type, and also
> the allocation context for both the pagecache page, and any other
> supporting allocations required.
> 
> I think it's much more logical to go into the pagecache layer.

Fair enough, I can go along with that, and stomach the longer lines.

I do think that we ought to change those &~__GFP_HIGHMEMs to
&GFP_RECLAIM_MASKs, but that doesn't have to be part of your patch.

And it does make me think that Kamezawa-san's patch in mmotm
memcg-fix-gfp_mask-of-callers-of-charge.patch
which is changing the gfp arg to assorted mem_cgroup interfaces
from GFP_KERNEL to GFP_HIGHUSER_MOVABLE, is probably misguided:
that gfp arg is not telling them what zones of memory to use,
it's telling them the constraints if it reclaims memory.

I'd skip the churn and leave them as GFP_KERNEL - you could argue
that we should define another another name for that set of reclaim
constraints, but I think it would end up too much hair-splitting.

> > Updating it quickly to 2.6.28-rc6, built but untested, here's mine.
> > I'm not saying it's the right approach and yours wrong, just please
> > consider it before deciding on which way to go.
> > 
> > I've left in the hunk from fs/buffer.c in case you can decipher it,
> > I think that's what held me up from submitting: I've never worked
> > out since whether that change is a profound insight into reality
> > here, or just a blind attempt to reduce the line length.
> 
> I don't see why you drop the mapping_gfp_mask from there... if we ever
> change the buffer layer to support HIGHMEM pages, we'd want to set the
> inode's mapping_gfp_mask to GFP_HIGHUSER rather than GFP_USER, wouldn't
> we?

It was just a mysterious fragment, if it makes no sense to you either,
let's just forget it - thanks for looking.

Hugh

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/2] mm: pagecache allocation gfp fixes
  2008-12-01  1:18         ` Hugh Dickins
@ 2008-12-01  1:50           ` KAMEZAWA Hiroyuki
  2008-12-01  7:39             ` KAMEZAWA Hiroyuki
  0 siblings, 1 reply; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-01  1:50 UTC (permalink / raw)
  To: Hugh Dickins
  Cc: Nick Piggin, Pekka Enberg, Andrew Morton, linux-fsdevel, linux-mm

On Mon, 1 Dec 2008 01:18:09 +0000 (GMT)
Hugh Dickins <hugh@veritas.com> wrote:

> On Fri, 28 Nov 2008, Nick Piggin wrote:
> > On Thu, Nov 27, 2008 at 06:14:18PM +0000, Hugh Dickins wrote:
> > > On Thu, 27 Nov 2008, Nick Piggin wrote:
> > > > On Thu, Nov 27, 2008 at 11:52:40AM +0200, Pekka Enberg wrote:
> > > > > > -               err = add_to_page_cache_lru(page, mapping, index, gfp_mask);
> > > > > > +               err = add_to_page_cache_lru(page, mapping, index,
> > > > > > +                       (gfp_mask & (__GFP_FS|__GFP_IO|__GFP_WAIT|__GFP_HIGH)));
> > > > > 
> > > > > Can we use GFP_RECLAIM_MASK here? I mean, surely we need to pass
> > > > > __GFP_NOFAIL, for example, down to radix_tree_preload() et al?
> > > 
> > > I certainly agree with Pekka's suggestion to use GFP_RECLAIM_MASK.
> > > 
> > > > 
> > > > Updated patch.
> > > 
> > > I'm not sure about it.  I came here before 2.6.25, not yet got around
> > > to submitting, I went in the opposite direction.  What drove me was an
> > > irritation at the growing number of & ~__GFP_HIGHMEMs (after adding a
> > > couple myself in shmem.c).  At the least, I think we ought to change
> > > those to & GFP_RECLAIM_MASKs (it seems we don't have one, but can
> > > imagine a block driver that wants GFP_DMA or GFP_DMA32 for pagecache:
> > > there's no reason to alloc its kernel-internal data structures for DMA).
> > > 
> > > My patch went the opposite direction to yours, in that I was pushing
> > > down the GFP_RECLAIM_MASKing into lib/radix-tree.c and mm/memcontrol.c
> > > (but that now doesn't kmalloc for itself, so no longer needs the mask).
> > > 
> > > I'm not sure which way is the right way: you can say I'm inconsistent
> > > not to push it down into slab/sleb/slib/slob/slub itself, but I've a
> > > notion that someone might object to any extra intrns in their hotpaths.
> > > 
> > > My design principle, I think, was to minimize the line length of
> > > the maximum number of source lines: you may believe in other
> > > design principles of higher value.
> > 
> > I think logically it doesn't belong in those files. The problem comes
> > purely from the pagecache layer because we're using gfp masks to ask
> > for two different (and not quite compatible things).
> > 
> > We're using it to specify the pagecache page's memory type, and also
> > the allocation context for both the pagecache page, and any other
> > supporting allocations required.
> > 
> > I think it's much more logical to go into the pagecache layer.
> 
> Fair enough, I can go along with that, and stomach the longer lines.
> 
> I do think that we ought to change those &~__GFP_HIGHMEMs to
> &GFP_RECLAIM_MASKs, but that doesn't have to be part of your patch.
> 
> And it does make me think that Kamezawa-san's patch in mmotm
> memcg-fix-gfp_mask-of-callers-of-charge.patch
> which is changing the gfp arg to assorted mem_cgroup interfaces
> from GFP_KERNEL to GFP_HIGHUSER_MOVABLE, is probably misguided:
> that gfp arg is not telling them what zones of memory to use,
> it's telling them the constraints if it reclaims memory.
> 
It comes from the fact that memcg reclaims memory not because of memory shortage
but of memory limit.
"From which zone the memory should be reclaimed" is not problem. 
I used GFP_HIGHUSER_MOVABLE to show "reclaim from anyware" in explicit way.
too bad ?

Thanks,
-Kame



> I'd skip the churn and leave them as GFP_KERNEL - you could argue
> that we should define another another name for that set of reclaim
> constraints, but I think it would end up too much hair-splitting.
> 
> > > Updating it quickly to 2.6.28-rc6, built but untested, here's mine.
> > > I'm not saying it's the right approach and yours wrong, just please
> > > consider it before deciding on which way to go.
> > > 
> > > I've left in the hunk from fs/buffer.c in case you can decipher it,
> > > I think that's what held me up from submitting: I've never worked
> > > out since whether that change is a profound insight into reality
> > > here, or just a blind attempt to reduce the line length.
> > 
> > I don't see why you drop the mapping_gfp_mask from there... if we ever
> > change the buffer layer to support HIGHMEM pages, we'd want to set the
> > inode's mapping_gfp_mask to GFP_HIGHUSER rather than GFP_USER, wouldn't
> > we?
> 
> It was just a mysterious fragment, if it makes no sense to you either,
> let's just forget it - thanks for looking.
> 
> Hugh
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [patch 1/2] mm: pagecache allocation gfp fixes
  2008-12-01  1:50           ` KAMEZAWA Hiroyuki
@ 2008-12-01  7:39             ` KAMEZAWA Hiroyuki
  0 siblings, 0 replies; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2008-12-01  7:39 UTC (permalink / raw)
  To: KAMEZAWA Hiroyuki
  Cc: Hugh Dickins, Nick Piggin, Pekka Enberg, Andrew Morton,
	linux-fsdevel, linux-mm

On Mon, 1 Dec 2008 10:50:38 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:

> On Mon, 1 Dec 2008 01:18:09 +0000 (GMT)
> Hugh Dickins <hugh@veritas.com> wrote:
>
> It comes from the fact that memcg reclaims memory not because of memory shortage
> but of memory limit.
> "From which zone the memory should be reclaimed" is not problem. 
> I used GFP_HIGHUSER_MOVABLE to show "reclaim from anyware" in explicit way.
> too bad ?
> 
Maybe I got your point..

Hmm...but...

mmotm-Nov29's following gfp_mask is buggy (mis leading).
==
int add_to_page_cache_locked(struct page *page, struct address_space *mapping,
                pgoff_t offset, gfp_t gfp_mask)
{
        int error;

        error = mem_cgroup_cache_charge(page, current->mm,
                                        gfp_mask & ~__GFP_HIGHMEM);
        if (error)
==

mem_cgroup_cache_charge() has to reclaim memory from HIGHMEM (if used.) 
to make room. (not to reclaim memory from some specified area.)

(Anyway) memcg's page reclaim code uses following
==
unsigned long try_to_free_mem_cgroup_pages(struct mem_cgroup *mem_cont,
                                                gfp_t gfp_mask,
                                           bool noswap)
{

        sc.gfp_mask = (gfp_mask & GFP_RECLAIM_MASK) |
                        (GFP_HIGHUSER_MOVABLE & ~GFP_RECLAIM_MASK);
....
}
==

And we don't see bug..

I'll try mmotm-Nov30 and find some way to do better explanation.
This gfp semantics of memcg is a bit different from other gfp's.

Thanks,
-Kame



^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2008-12-01  7:40 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-27  9:34 [patch 1/2] mm: pagecache allocation gfp fixes Nick Piggin
2008-11-27  9:35 ` [patch 2/2] fs: symlink write_begin allocation context fix Nick Piggin
2008-11-27 11:02   ` KOSAKI Motohiro
2008-11-27 11:14     ` Nick Piggin
2008-11-28 14:37     ` Nick Piggin
2008-11-29  6:35       ` KOSAKI Motohiro
2008-11-27  9:52 ` [patch 1/2] mm: pagecache allocation gfp fixes Pekka Enberg
2008-11-27 10:01   ` Nick Piggin
2008-11-27 10:18   ` Nick Piggin
2008-11-27 10:28     ` Pekka Enberg
2008-11-27 10:40       ` KOSAKI Motohiro
2008-11-27 18:14     ` Hugh Dickins
2008-11-28 12:04       ` Nick Piggin
2008-12-01  1:18         ` Hugh Dickins
2008-12-01  1:50           ` KAMEZAWA Hiroyuki
2008-12-01  7:39             ` KAMEZAWA Hiroyuki

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).