* Pagecache: find_or_create_page does not call a proper page allocator function
@ 2007-04-23 21:11 Christoph Lameter
2007-04-23 21:29 ` Andrew Morton
0 siblings, 1 reply; 31+ messages in thread
From: Christoph Lameter @ 2007-04-23 21:11 UTC (permalink / raw)
To: Nick Piggin; +Cc: linux-kernel, pj, akpm, Hugh Dickins
The find_or_create function calls alloc_page with a local gfp mask instead
of using page_cache_alloc. This means that the page allocation will not
obey cpuset memory spreading and page allocation will not properly use the
gfp flags in the address space. Highmem is not set correctly.
It turns out that there is no function to allocate a page for the page cache
with a gfp mask. So we create one ORing the context gfp flags with the gfp flags
from the mapping.
Signed-off-by: Christoph Lameter <clameter@sgi.com>
---
include/linux/pagemap.h | 10 ++++++++--
mm/filemap.c | 3 ++-
2 files changed, 10 insertions(+), 3 deletions(-)
Index: linux-2.6.21-rc7/include/linux/pagemap.h
===================================================================
--- linux-2.6.21-rc7.orig/include/linux/pagemap.h 2007-04-23 13:52:20.000000000 -0700
+++ linux-2.6.21-rc7/include/linux/pagemap.h 2007-04-23 14:01:09.000000000 -0700
@@ -60,14 +60,20 @@ static inline struct page *__page_cache_
}
#endif
+static inline struct page *page_cache_alloc_mask(struct address_space *x,
+ gfp_t gfp)
+{
+ return __page_cache_alloc(mapping_gfp_mask(x) | gfp);
+}
+
static inline struct page *page_cache_alloc(struct address_space *x)
{
- return __page_cache_alloc(mapping_gfp_mask(x));
+ return page_cache_alloc_mask(x, 0);
}
static inline struct page *page_cache_alloc_cold(struct address_space *x)
{
- return __page_cache_alloc(mapping_gfp_mask(x)|__GFP_COLD);
+ return page_cache_alloc_mask(x, __GFP_COLD);
}
typedef int filler_t(void *, struct page *);
Index: linux-2.6.21-rc7/mm/filemap.c
===================================================================
--- linux-2.6.21-rc7.orig/mm/filemap.c 2007-04-23 13:50:24.000000000 -0700
+++ linux-2.6.21-rc7/mm/filemap.c 2007-04-23 13:52:14.000000000 -0700
@@ -670,7 +670,8 @@ repeat:
page = find_lock_page(mapping, index);
if (!page) {
if (!cached_page) {
- cached_page = alloc_page(gfp_mask);
+ cached_page =
+ page_cache_alloc_mask(mapping, gfp_mask);
if (!cached_page)
return NULL;
}
^ permalink raw reply [flat|nested] 31+ messages in thread* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-23 21:11 Pagecache: find_or_create_page does not call a proper page allocator function Christoph Lameter @ 2007-04-23 21:29 ` Andrew Morton 2007-04-23 21:37 ` Christoph Lameter 2007-04-23 22:33 ` Christoph Lameter 0 siblings, 2 replies; 31+ messages in thread From: Andrew Morton @ 2007-04-23 21:29 UTC (permalink / raw) To: Christoph Lameter; +Cc: Nick Piggin, linux-kernel, pj, Hugh Dickins On Mon, 23 Apr 2007 14:11:57 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > The find_or_create function calls alloc_page with a local gfp mask instead > of using page_cache_alloc. This means that the page allocation will not > obey cpuset memory spreading and page allocation will not properly use the > gfp flags in the address space. Highmem is not set correctly. > > It turns out that there is no function to allocate a page for the page cache > with a gfp mask. So we create one ORing the context gfp flags with the gfp flags > from the mapping. > > Signed-off-by: Christoph Lameter <clameter@sgi.com> > > --- > include/linux/pagemap.h | 10 ++++++++-- > mm/filemap.c | 3 ++- > 2 files changed, 10 insertions(+), 3 deletions(-) > > Index: linux-2.6.21-rc7/include/linux/pagemap.h > =================================================================== > --- linux-2.6.21-rc7.orig/include/linux/pagemap.h 2007-04-23 13:52:20.000000000 -0700 > +++ linux-2.6.21-rc7/include/linux/pagemap.h 2007-04-23 14:01:09.000000000 -0700 > @@ -60,14 +60,20 @@ static inline struct page *__page_cache_ > } > #endif > > +static inline struct page *page_cache_alloc_mask(struct address_space *x, > + gfp_t gfp) > +{ > + return __page_cache_alloc(mapping_gfp_mask(x) | gfp); > +} Usually we use the term "mask" to imply an AND function, not an OR function. There are few calls to page_cache_alloc(). Would it not be simpler to just add the additional argument to page_cache_alloc() (called "extra_gfp", please) and to update all callers? And to remove page_cache_alloc_cold() and replace all it callers with page_cache_alloc(mapping, __GFP_COLD)? The way we actually get rid of an API call instead of adding another one. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-23 21:29 ` Andrew Morton @ 2007-04-23 21:37 ` Christoph Lameter 2007-04-23 22:33 ` Christoph Lameter 1 sibling, 0 replies; 31+ messages in thread From: Christoph Lameter @ 2007-04-23 21:37 UTC (permalink / raw) To: Andrew Morton; +Cc: Nick Piggin, linux-kernel, pj, Hugh Dickins On Mon, 23 Apr 2007, Andrew Morton wrote: > > +static inline struct page *page_cache_alloc_mask(struct address_space *x, > > + gfp_t gfp) > > +{ > > + return __page_cache_alloc(mapping_gfp_mask(x) | gfp); > > +} > > Usually we use the term "mask" to imply an AND function, not an OR > function. Well but you pass an allocation mask.... Maybe call this page_cache_alloc_gfp? > There are few calls to page_cache_alloc(). Would it not be simpler to just > add the additional argument to page_cache_alloc() (called "extra_gfp", > please) and to update all callers? And to remove page_cache_alloc_cold() > and replace all it callers with page_cache_alloc(mapping, __GFP_COLD)? > > The way we actually get rid of an API call instead of adding another one. Ok. Give me an hour or so. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-23 21:29 ` Andrew Morton 2007-04-23 21:37 ` Christoph Lameter @ 2007-04-23 22:33 ` Christoph Lameter 2007-04-23 22:42 ` Christoph Lameter 2007-04-23 22:42 ` Andrew Morton 1 sibling, 2 replies; 31+ messages in thread From: Christoph Lameter @ 2007-04-23 22:33 UTC (permalink / raw) To: Andrew Morton; +Cc: Nick Piggin, linux-kernel, pj, Hugh Dickins On Mon, 23 Apr 2007, Andrew Morton wrote: > There are few calls to page_cache_alloc(). Would it not be simpler to just > add the additional argument to page_cache_alloc() (called "extra_gfp", > please) and to update all callers? And to remove page_cache_alloc_cold() > and replace all it callers with page_cache_alloc(mapping, __GFP_COLD)? > > The way we actually get rid of an API call instead of adding another one. I am not quite sure about the ORing of the gfpmasks in find_or_create_page. There is a conflict between the needs for the allocation for the radix tree which has to be done with the specified mask and the allocation for the page in the mapping which has to be done with the gfp mask from the mapping (?). What a mess. Looking at the callers of find_or_create_page ..... They mostly specify the mapping_gfp_mask except for 1. grow_dev_page which wants GFP_NOFS but there is no way to switch off the proper flags if we go through page_cache_alloc(). Sigh. The mask is not usable since it does not switch on the mapping bits. It simply ORs the flags... 2. __page_symlink filters __GFP_FS from the mapping flags. Hmmm.. interesting. A way to fix it. So if we fix grow_dev_page to use the mapping flags like in __page_symlink uses then we have the right flags for the tree lock alloc and the page allocation. We do not need to determine the flags again in find_or_create_page. We have a bug here that needs fixing Fix page allocation flags in grow_dev_page() Grow dev page simply passes GFP_NOFS to find_or_create_page. This means the allocation of radix tree nodes is done with GFP_NOFS and the allocation of a new page is done using GFP_NOFS as well. The mapping has a flags field that contains the necessary allocation flags for the page cache allocation. These need to be consulted in order to get DMA and HIGHMEM allocations etc right. So fix grow_dev_page to do the same as __page_symlink calls. Retrieve the mask from the mapping and then remove __GFP_FS. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- fs/buffer.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) Index: linux-2.6.21-rc7/fs/buffer.c =================================================================== --- linux-2.6.21-rc7.orig/fs/buffer.c 2007-04-23 15:29:18.000000000 -0700 +++ linux-2.6.21-rc7/fs/buffer.c 2007-04-23 15:29:43.000000000 -0700 @@ -988,7 +988,8 @@ grow_dev_page(struct block_device *bdev, struct page *page; struct buffer_head *bh; - page = find_or_create_page(inode->i_mapping, index, GFP_NOFS); + page = find_or_create_page(inode->i_mapping, index, + mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS); if (!page) return NULL; ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-23 22:33 ` Christoph Lameter @ 2007-04-23 22:42 ` Christoph Lameter 2007-04-23 22:42 ` Andrew Morton 1 sibling, 0 replies; 31+ messages in thread From: Christoph Lameter @ 2007-04-23 22:42 UTC (permalink / raw) To: Andrew Morton; +Cc: Nick Piggin, linux-kernel, pj, Hugh Dickins And the second fix (cleanup patch will follow) Pagecache: find_or_create_page does not spread memory. The find_or_create function calls alloc_page with the gfp_mask passed to it which is derived from the mappings gfp mask. So the allocation flags are right (assuming my bugfix to fs/buffer.c is applied). However, we call a alloc_page instead of page_cache_alloc in find_or_create_page. This means that the page cache allocation will not obey cpuset memory spreading. We really need to call __page_cache_alloc there. Also stick a comment in there explaining how the allocation mask passed to find_or_create_page needs to be handled. Signed-off-by: Christoph Lameter <clameter@sgi.com> --- mm/filemap.c | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) Index: linux-2.6.21-rc7/mm/filemap.c =================================================================== --- linux-2.6.21-rc7.orig/mm/filemap.c 2007-04-23 15:37:27.000000000 -0700 +++ linux-2.6.21-rc7/mm/filemap.c 2007-04-23 15:38:09.000000000 -0700 @@ -648,7 +648,8 @@ EXPORT_SYMBOL(find_lock_page); * find_or_create_page - locate or add a pagecache page * @mapping: the page's address_space * @index: the page's index into the mapping - * @gfp_mask: page allocation mode + * @gfp_mask: page allocation mode. This must be derived from the + * allocation flags of the mapping! * * Locates a page in the pagecache. If the page is not present, a new page * is allocated using @gfp_mask and is added to the pagecache and to the VM's @@ -670,7 +671,8 @@ repeat: page = find_lock_page(mapping, index); if (!page) { if (!cached_page) { - cached_page = alloc_page(gfp_mask); + cached_page = + __page_cache_alloc(gfp_mask); if (!cached_page) return NULL; } ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-23 22:33 ` Christoph Lameter 2007-04-23 22:42 ` Christoph Lameter @ 2007-04-23 22:42 ` Andrew Morton 2007-04-24 13:09 ` Hugh Dickins 1 sibling, 1 reply; 31+ messages in thread From: Andrew Morton @ 2007-04-23 22:42 UTC (permalink / raw) To: Christoph Lameter; +Cc: Nick Piggin, linux-kernel, pj, Hugh Dickins On Mon, 23 Apr 2007 15:33:07 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > Grow dev page simply passes GFP_NOFS to find_or_create_page. This means the > allocation of radix tree nodes is done with GFP_NOFS and the allocation > of a new page is done using GFP_NOFS as well. > > The mapping has a flags field that contains the necessary allocation flags for > the page cache allocation. These need to be consulted in order to get DMA > and HIGHMEM allocations etc right. > > So fix grow_dev_page to do the same as __page_symlink calls. Retrieve the > mask from the mapping and then remove __GFP_FS. > > Signed-off-by: Christoph Lameter <clameter@sgi.com> > > --- > fs/buffer.c | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > Index: linux-2.6.21-rc7/fs/buffer.c > =================================================================== > --- linux-2.6.21-rc7.orig/fs/buffer.c 2007-04-23 15:29:18.000000000 -0700 > +++ linux-2.6.21-rc7/fs/buffer.c 2007-04-23 15:29:43.000000000 -0700 > @@ -988,7 +988,8 @@ grow_dev_page(struct block_device *bdev, > struct page *page; > struct buffer_head *bh; > > - page = find_or_create_page(inode->i_mapping, index, GFP_NOFS); > + page = find_or_create_page(inode->i_mapping, index, > + mapping_gfp_mask(inode->i_mapping) & ~__GFP_FS); > if (!page) > return NULL; OK. I hope. the mapping_gfp_mask() here will have come from bdget()'s mapping_set_gfp_mask(&inode->i_data, GFP_USER); If anyone is accidentally setting __GFP_HIGHMEM on a blockdev address_space we'll cause ghastly explosions. Albeit ones which were well-deserved. We'll see how it goes. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-23 22:42 ` Andrew Morton @ 2007-04-24 13:09 ` Hugh Dickins 2007-04-24 17:11 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 31+ messages in thread From: Hugh Dickins @ 2007-04-24 13:09 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Lameter, Nick Piggin, linux-kernel, pj On Mon, 23 Apr 2007, Andrew Morton wrote: > > OK. I hope. the mapping_gfp_mask() here will have come from bdget()'s > mapping_set_gfp_mask(&inode->i_data, GFP_USER); If anyone is accidentally > setting __GFP_HIGHMEM on a blockdev address_space we'll cause ghastly > explosions. Albeit ones which were well-deserved. I've not yet looked at the patch under discussion, but this remark prompts me... a couple of days ago I got very worried by the various hard-wired GFP_HIGHUSER allocations in mm/migrate.c and mm/mempolicy.c, and wondered how those would work out if someone has a blockdev mmap'ed. I tried to test it out before sending a patch, but found no problem at all: maybe I was too timid (fearing to corrupt my whole system), maybe I've forgotten how that stuff works and wasn't doing the right thing to reproduce it (I was mmap'ing /dev/sdb1 readonly, at the same time as having it mounted as ext2 - when I forced migration to random pages, then cp'ed /dev/zero to reuse the old pages, I was expecting ext2 to get very upset with its metadata; mmap'ing while mounted isn't very realistic, but my earlier sequence hadn't shown any problem either, so I thought the cache got invalidated in between). Here's the patch I'd suggest adding if you believe there really is a problem there: it's far from ideal (I can imagine mapping_gfp_mask being used to enforce other restrictions, but the __GFP_HIGHMEM issue seems to be the only one in practice; and it would be a shame to restrict all the architectures which have no concept of HIGHMEM). If there's no such problem, sorry for wasting your time. (If vma->vm_file is non-NULL, we can be sure vma->vm_file->f_mapping is non-NULL, can't we? Some common code assumes that, some does not: I've avoided cargo-cult safety below, but don't let me make it unsafe.) Is there a problem with page migration to HIGHMEM, if pages were mapped from a GFP_USER block device? I failed to demonstrate any problem, but here's a quick fix if needed. Signed-off-by: Hugh Dickins <hugh@veritas.com> --- 2.6.21-rc7/include/linux/migrate.h 2007-03-07 13:08:59.000000000 +0000 +++ linux/include/linux/migrate.h 2007-04-24 13:18:31.000000000 +0100 @@ -2,6 +2,7 @@ #define _LINUX_MIGRATE_H #include <linux/mm.h> +#include <linux/pagemap.h> typedef struct page *new_page_t(struct page *, unsigned long private, int **); @@ -10,6 +11,13 @@ static inline int vma_migratable(struct { if (vma->vm_flags & (VM_IO|VM_HUGETLB|VM_PFNMAP|VM_RESERVED)) return 0; +#ifdef CONFIG_HIGHMEM + if (vma->vm_file) { + struct address_space *mapping = vma->vm_file->f_mapping; + if (!(mapping_gfp_mask(mapping) & __GFP_HIGHMEM)) + return 0; + } +#endif return 1; } ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 13:09 ` Hugh Dickins @ 2007-04-24 17:11 ` Andrew Morton 2007-04-24 19:06 ` Hugh Dickins 2007-04-24 17:38 ` Christoph Lameter 2007-04-24 17:45 ` Christoph Lameter 2 siblings, 1 reply; 31+ messages in thread From: Andrew Morton @ 2007-04-24 17:11 UTC (permalink / raw) To: Hugh Dickins; +Cc: Christoph Lameter, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007 14:09:33 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote: > On Mon, 23 Apr 2007, Andrew Morton wrote: > > > > OK. I hope. the mapping_gfp_mask() here will have come from bdget()'s > > mapping_set_gfp_mask(&inode->i_data, GFP_USER); If anyone is accidentally > > setting __GFP_HIGHMEM on a blockdev address_space we'll cause ghastly > > explosions. Albeit ones which were well-deserved. > > I've not yet looked at the patch under discussion, but this remark > prompts me... a couple of days ago I got very worried by the various > hard-wired GFP_HIGHUSER allocations in mm/migrate.c and mm/mempolicy.c, > and wondered how those would work out if someone has a blockdev mmap'ed. > > I tried to test it out before sending a patch, but found no problem at > all: maybe I was too timid (fearing to corrupt my whole system), maybe > I've forgotten how that stuff works and wasn't doing the right thing > to reproduce it (I was mmap'ing /dev/sdb1 readonly, at the same time > as having it mounted as ext2 - when I forced migration to random pages, > then cp'ed /dev/zero to reuse the old pages, I was expecting ext2 to > get very upset with its metadata; mmap'ing while mounted isn't very > realistic, but my earlier sequence hadn't shown any problem either, > so I thought the cache got invalidated in between). Yipes. > Here's the patch I'd suggest adding if you believe there really is > a problem there: it's far from ideal (I can imagine mapping_gfp_mask > being used to enforce other restrictions, but the __GFP_HIGHMEM issue > seems to be the only one in practice; and it would be a shame to > restrict all the architectures which have no concept of HIGHMEM). > If there's no such problem, sorry for wasting your time. Yes, I believe there is such a problem. We ignore the fact that the blockdev address_space doesn't implement ->migratepage and we cheerily call fallback_migrate_page(), which does the wrong thing. > (If vma->vm_file is non-NULL, we can be sure vma->vm_file->f_mapping > is non-NULL, can't we? Some common code assumes that, some does not: > I've avoided cargo-cult safety below, but don't let me make it unsafe.) > > > Is there a problem with page migration to HIGHMEM, if pages were > mapped from a GFP_USER block device? I failed to demonstrate any > problem, but here's a quick fix if needed. > > Signed-off-by: Hugh Dickins <hugh@veritas.com> > > --- 2.6.21-rc7/include/linux/migrate.h 2007-03-07 13:08:59.000000000 +0000 > +++ linux/include/linux/migrate.h 2007-04-24 13:18:31.000000000 +0100 > @@ -2,6 +2,7 @@ > #define _LINUX_MIGRATE_H > > #include <linux/mm.h> > +#include <linux/pagemap.h> > > typedef struct page *new_page_t(struct page *, unsigned long private, int **); > > @@ -10,6 +11,13 @@ static inline int vma_migratable(struct > { > if (vma->vm_flags & (VM_IO|VM_HUGETLB|VM_PFNMAP|VM_RESERVED)) > return 0; > +#ifdef CONFIG_HIGHMEM > + if (vma->vm_file) { > + struct address_space *mapping = vma->vm_file->f_mapping; > + if (!(mapping_gfp_mask(mapping) & __GFP_HIGHMEM)) > + return 0; > + } > +#endif > return 1; > } >From my reading it would be pretty simple to teach unmap_and_move() to pass mapping_gfp_mask(page_mapping(page)) down into (*get_new_page)() to get the correct type of page. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 17:11 ` Andrew Morton @ 2007-04-24 19:06 ` Hugh Dickins 2007-04-24 19:55 ` Christoph Lameter 0 siblings, 1 reply; 31+ messages in thread From: Hugh Dickins @ 2007-04-24 19:06 UTC (permalink / raw) To: Andrew Morton; +Cc: Christoph Lameter, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Andrew Morton wrote: > > From my reading it would be pretty simple to teach unmap_and_move() > to pass mapping_gfp_mask(page_mapping(page)) down into > (*get_new_page)() to get the correct type of page. Or even simpler, since they're already passed the source page, just get it from that. However, I'm much less able to test this patch in a hurry: it looks plausible, builds and runs okay in my non-NUMA testing; but whether it does what I intend it to do, without causing unexpected side-effects, I don't know. Whereas I did check the previous little vma_migratable() patch did the right thing, and think it more suitable for lastminute and -stable. This one would need real testing by real migrants with real NUMA. Or Christoph may prevail in persuading there's no such problem. Is there a problem with page migration to HIGHMEM, if pages were mapped from a GFP_USER block device? I failed to demonstrate any problem, but here's a quick fix if needed. Signed-off-by: Hugh Dickins <hugh@veritas.com> --- include/linux/migrate.h | 1 + mm/mempolicy.c | 4 ++-- mm/migrate.c | 12 +++++++++--- mm/swap_state.c | 1 + 4 files changed, 13 insertions(+), 5 deletions(-) --- 2.6.21-rc7/include/linux/migrate.h 2007-03-07 14:17:59.000000000 +0000 +++ linux/include/linux/migrate.h 2007-04-24 19:19:01.000000000 +0100 @@ -14,6 +14,7 @@ static inline int vma_migratable(struct } #ifdef CONFIG_MIGRATION +extern gfp_t page_gfp_mask(struct page *page); extern int isolate_lru_page(struct page *p, struct list_head *pagelist); extern int putback_lru_pages(struct list_head *l); extern int migrate_page(struct address_space *, --- 2.6.21-rc7/mm/mempolicy.c 2007-03-07 14:18:01.000000000 +0000 +++ linux/mm/mempolicy.c 2007-04-24 19:19:01.000000000 +0100 @@ -594,7 +594,7 @@ static void migrate_page_add(struct page static struct page *new_node_page(struct page *page, unsigned long node, int **x) { - return alloc_pages_node(node, GFP_HIGHUSER, 0); + return alloc_pages_node(node, page_gfp_mask(page), 0); } /* @@ -710,7 +710,7 @@ static struct page *new_vma_page(struct { struct vm_area_struct *vma = (struct vm_area_struct *)private; - return alloc_page_vma(GFP_HIGHUSER, vma, page_address_in_vma(page, vma)); + return alloc_page_vma(page_gfp_mask(page), vma, page_address_in_vma(page, vma)); } #else --- 2.6.21-rc7/mm/migrate.c 2007-03-07 14:18:01.000000000 +0000 +++ linux/mm/migrate.c 2007-04-24 19:19:01.000000000 +0100 @@ -735,12 +735,18 @@ struct page_to_node { int status; }; -static struct page *new_page_node(struct page *p, unsigned long private, +gfp_t page_gfp_mask(struct page *page) +{ + struct address_space *mapping = page_mapping(page); + return mapping? mapping_gfp_mask(mapping): GFP_HIGHUSER; +} + +static struct page *new_page_node(struct page *page, unsigned long private, int **result) { struct page_to_node *pm = (struct page_to_node *)private; - while (pm->node != MAX_NUMNODES && pm->page != p) + while (pm->node != MAX_NUMNODES && pm->page != page) pm++; if (pm->node == MAX_NUMNODES) @@ -748,7 +754,7 @@ static struct page *new_page_node(struct *result = &pm->status; - return alloc_pages_node(pm->node, GFP_HIGHUSER | GFP_THISNODE, 0); + return alloc_pages_node(pm->node, page_gfp_mask(page) | GFP_THISNODE, 0); } /* --- 2.6.21-rc7/mm/swap_state.c 2006-09-20 04:42:06.000000000 +0100 +++ linux/mm/swap_state.c 2007-04-24 19:19:01.000000000 +0100 @@ -40,6 +40,7 @@ struct address_space swapper_space = { .page_tree = RADIX_TREE_INIT(GFP_ATOMIC|__GFP_NOWARN), .tree_lock = __RW_LOCK_UNLOCKED(swapper_space.tree_lock), .a_ops = &swap_aops, + .flags = (__force unsigned long) GFP_HIGHUSER, .i_mmap_nonlinear = LIST_HEAD_INIT(swapper_space.i_mmap_nonlinear), .backing_dev_info = &swap_backing_dev_info, }; ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 19:06 ` Hugh Dickins @ 2007-04-24 19:55 ` Christoph Lameter 2007-04-24 20:16 ` Hugh Dickins 0 siblings, 1 reply; 31+ messages in thread From: Christoph Lameter @ 2007-04-24 19:55 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Hugh Dickins wrote: > Or Christoph may prevail in persuading there's no such problem. This is pointless. NUMA allocations can only be controlled for the highest zone. If we switch to a lower zone then we allocate on a different zone than the user requested. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 19:55 ` Christoph Lameter @ 2007-04-24 20:16 ` Hugh Dickins 2007-04-24 20:30 ` Christoph Lameter 0 siblings, 1 reply; 31+ messages in thread From: Hugh Dickins @ 2007-04-24 20:16 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andrew Morton, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Christoph Lameter wrote: > On Tue, 24 Apr 2007, Hugh Dickins wrote: > > > Or Christoph may prevail in persuading there's no such problem. > > This is pointless. NUMA allocations can only be controlled for the highest > zone. If we switch to a lower zone then we allocate on a different zone > than the user requested. Sorry, I'm not following you there. Perhaps your comment is saying that second patch is pointless, because it forces allocations elsewhere than the policy requires? Rather than commenting on the line you quote? (When I said I didn't know if the patch would have unwanted side-effects I was rather wondering what happens when you ask the allocator to do some impossible combination.) > If the system has both high memory and normal memory then only allocations > to highmemory are subject to memory policies etc etc. The block device > allocations would be in zone normal/dma and thus be exempt from NUMA > placement. Please just point us to the line where sys_move_pages enforces this. Hugh ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 20:16 ` Hugh Dickins @ 2007-04-24 20:30 ` Christoph Lameter 2007-04-24 20:42 ` Andrew Morton 0 siblings, 1 reply; 31+ messages in thread From: Christoph Lameter @ 2007-04-24 20:30 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Hugh Dickins wrote: > On Tue, 24 Apr 2007, Christoph Lameter wrote: > > On Tue, 24 Apr 2007, Hugh Dickins wrote: > > > > > Or Christoph may prevail in persuading there's no such problem. > > > > This is pointless. NUMA allocations can only be controlled for the highest > > zone. If we switch to a lower zone then we allocate on a different zone > > than the user requested. > > Sorry, I'm not following you there. Perhaps your comment is saying > that second patch is pointless, because it forces allocations elsewhere > than the policy requires? Rather than commenting on the line you quote? NUMA allocation can only be controlled for the highest zone. If you switch to another zone then no control of the node is possible anymore. > (When I said I didn't know if the patch would have unwanted side-effects > I was rather wondering what happens when you ask the allocator to do > some impossible combination.) It falls back from HIGHMEM to NORMAL (on 32 bit numa that means node 0. The result of your patch is that someone requiring an alloc on node 4 will get memory on node 0). > > If the system has both high memory and normal memory then only allocations > > to highmemory are subject to memory policies etc etc. The block device > > allocations would be in zone normal/dma and thus be exempt from NUMA > > placement. > > Please just point us to the line where sys_move_pages enforces this. It does not. It will happily move the pages into highmem. The filesystem cannot expect the page to remain in the same zone without holding a reference count. Swap may also move pages between zones. There is nothing special in what page migration does here. I would say that the filesystem is broke if it has such expectations regardless of page migration. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 20:30 ` Christoph Lameter @ 2007-04-24 20:42 ` Andrew Morton 2007-04-24 20:44 ` Christoph Lameter 0 siblings, 1 reply; 31+ messages in thread From: Andrew Morton @ 2007-04-24 20:42 UTC (permalink / raw) To: Christoph Lameter; +Cc: Hugh Dickins, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007 13:30:33 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > > > If the system has both high memory and normal memory then only allocations > > > to highmemory are subject to memory policies etc etc. The block device > > > allocations would be in zone normal/dma and thus be exempt from NUMA > > > placement. > > > > Please just point us to the line where sys_move_pages enforces this. > > It does not. It will happily move the pages into highmem. The filesystem > cannot expect the page to remain in the same zone without holding a > reference count. Swap may also move pages between zones. There is nothing > special in what page migration does here. > > I would say that the filesystem is broke if it has such expectations > regardless of page migration. Others disagree ;) The filesystem has *told* the core kernel what its allocation constraints are by setting up mapping_gfp_mask(). If the core kernel stops honouring that request then it is core kernel which is broken. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 20:42 ` Andrew Morton @ 2007-04-24 20:44 ` Christoph Lameter 2007-04-24 20:53 ` Andrew Morton 0 siblings, 1 reply; 31+ messages in thread From: Christoph Lameter @ 2007-04-24 20:44 UTC (permalink / raw) To: Andrew Morton; +Cc: Hugh Dickins, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Andrew Morton wrote: > > I would say that the filesystem is broke if it has such expectations > > regardless of page migration. > > Others disagree ;) > > The filesystem has *told* the core kernel what its allocation constraints > are by setting up mapping_gfp_mask(). If the core kernel stops honouring > that request then it is core kernel which is broken. Then I think we should disable page migration for allocations that do not allow access to the policy zone. That would fix it. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 20:44 ` Christoph Lameter @ 2007-04-24 20:53 ` Andrew Morton 2007-04-24 20:58 ` Christoph Lameter 0 siblings, 1 reply; 31+ messages in thread From: Andrew Morton @ 2007-04-24 20:53 UTC (permalink / raw) To: Christoph Lameter; +Cc: Hugh Dickins, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007 13:44:50 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > On Tue, 24 Apr 2007, Andrew Morton wrote: > > > > I would say that the filesystem is broke if it has such expectations > > > regardless of page migration. > > > > Others disagree ;) > > > > The filesystem has *told* the core kernel what its allocation constraints > > are by setting up mapping_gfp_mask(). If the core kernel stops honouring > > that request then it is core kernel which is broken. > > Then I think we should disable page migration for allocations that do not > allow access to the policy zone. That would fix it. Can't we use mapping_gfp_mask() when allocating the destination page? It would be better to do so, really. Who knows, mapping_gfp_mask() might be extented in the future to say "I want GFP_NOIO" or something. Or a filesystem might specify GFP_KERNEL for regular pagecache pages or whatever. Generally, the interface is "address_space tells core kernel how to allocate its pages", and to be nice we should honour that in all places where we allocate a page for an address_space. If we'd had any brains we would have implemented this function as an address_space_operations callback, but we don't so we didn't. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 20:53 ` Andrew Morton @ 2007-04-24 20:58 ` Christoph Lameter 2007-04-24 21:24 ` Andrew Morton 0 siblings, 1 reply; 31+ messages in thread From: Christoph Lameter @ 2007-04-24 20:58 UTC (permalink / raw) To: Andrew Morton; +Cc: Hugh Dickins, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Andrew Morton wrote: > > Then I think we should disable page migration for allocations that do not > > allow access to the policy zone. That would fix it. > > Can't we use mapping_gfp_mask() when allocating the destination page? There is no point in migrating something if you cannot reach the destination. If the policy zone is not allowed by an allocation then the page cannot be migrated because (on 32 bit NUMA) there is only a single ZONE_NORMAL on the system. A different node requires a HIGHMEM allocation! > It would be better to do so, really. Who knows, mapping_gfp_mask() might > be extented in the future to say "I want GFP_NOIO" or something. Or a > filesystem might specify GFP_KERNEL for regular pagecache pages or > whatever. Hmmm..... How about a VM_DONTMIGRATE flag instead? That would be easy to check and could be set by a device that must have all pages of the address space conforming to the gfp mask. Or more general VM_STRICT_ALLOC? > Generally, the interface is "address_space tells core kernel how to > allocate its pages", and to be nice we should honour that in all places > where we allocate a page for an address_space. > > If we'd had any brains we would have implemented this function as an > address_space_operations callback, but we don't so we didn't. We have enough flags I think. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 20:58 ` Christoph Lameter @ 2007-04-24 21:24 ` Andrew Morton 2007-04-24 21:28 ` Christoph Lameter 0 siblings, 1 reply; 31+ messages in thread From: Andrew Morton @ 2007-04-24 21:24 UTC (permalink / raw) To: Christoph Lameter; +Cc: Hugh Dickins, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007 13:58:35 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > On Tue, 24 Apr 2007, Andrew Morton wrote: > > > > Then I think we should disable page migration for allocations that do not > > > allow access to the policy zone. That would fix it. > > > > Can't we use mapping_gfp_mask() when allocating the destination page? > > There is no point in migrating something if you cannot reach the > destination. If the policy zone is not allowed by an allocation then the > page cannot be migrated because (on 32 bit NUMA) there is only a single > ZONE_NORMAL on the system. A different node requires a HIGHMEM allocation! don't care really. If we're allocating a page for an address_space, we should be using its allocation mask. As I said, we might add more things in the future. If in so doing the user pointlessly wastes some CPU cycles in one weird case, it doesn't matter a lot. > > It would be better to do so, really. Who knows, mapping_gfp_mask() might > > be extented in the future to say "I want GFP_NOIO" or something. Or a > > filesystem might specify GFP_KERNEL for regular pagecache pages or > > whatever. > > Hmmm..... How about a VM_DONTMIGRATE flag instead? That would be easy to > check and could be set by a device that must have all pages of the address > space conforming to the gfp mask. > > Or more general > > VM_STRICT_ALLOC? ug. That's special-casing the blockdev pagecache peculiarity when someone mmaps it. It'd be better to stick with the existing interfaces and conventions: the address_space gets to dictate how is pages are allocated. > > Generally, the interface is "address_space tells core kernel how to > > allocate its pages", and to be nice we should honour that in all places > > where we allocate a page for an address_space. > > > > If we'd had any brains we would have implemented this function as an > > address_space_operations callback, but we don't so we didn't. > > We have enough flags I think. mapping_gfp_mask if a pretty foul thing. Adding struct page (*alloc_page)(struct address_space *mapping); to address_space_operations would be a quite nice cleanup. But I ain't taking cleanups at present. <goes back to working on a critical 2.6.21 bug> ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 21:24 ` Andrew Morton @ 2007-04-24 21:28 ` Christoph Lameter 0 siblings, 0 replies; 31+ messages in thread From: Christoph Lameter @ 2007-04-24 21:28 UTC (permalink / raw) To: Andrew Morton; +Cc: Hugh Dickins, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Andrew Morton wrote: > mapping_gfp_mask if a pretty foul thing. Adding > > struct page (*alloc_page)(struct address_space *mapping); > > to address_space_operations would be a quite nice cleanup. Ummm... If things would be that simple... I think we need struct page (*alloc_page)(struct address_space *mapping, int order, int node, gfp_t flags) node = -1 means abitrary node otherwise try to alloc on indicated node. additional_flags are important to be able to specify GFP_THISNODE to force alloc on one particular node. Plus there are the cpuset context flages etc. If the allocation is atomic then the callback should be able to react to that appropriately. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 13:09 ` Hugh Dickins 2007-04-24 17:11 ` Andrew Morton @ 2007-04-24 17:38 ` Christoph Lameter 2007-04-24 17:45 ` Hugh Dickins 2007-04-24 17:45 ` Christoph Lameter 2 siblings, 1 reply; 31+ messages in thread From: Christoph Lameter @ 2007-04-24 17:38 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Hugh Dickins wrote: > I've not yet looked at the patch under discussion, but this remark > prompts me... a couple of days ago I got very worried by the various > hard-wired GFP_HIGHUSER allocations in mm/migrate.c and mm/mempolicy.c, > and wondered how those would work out if someone has a blockdev mmap'ed. Hmmm.... These not that critical given that 32 bit NUMA systems are a bit rare. And if a page is in the wrong area then it can be bounced before I/O is performed on it. > (If vma->vm_file is non-NULL, we can be sure vma->vm_file->f_mapping > is non-NULL, can't we? Some common code assumes that, some does not: > I've avoided cargo-cult safety below, but don't let me make it unsafe.) > > > Is there a problem with page migration to HIGHMEM, if pages were > mapped from a GFP_USER block device? I failed to demonstrate any > problem, but here's a quick fix if needed. If a page is migrated into a different zone and then a write attempt is made on it then the page will be bounced into a zone from which the device can write from. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 17:38 ` Christoph Lameter @ 2007-04-24 17:45 ` Hugh Dickins 2007-04-24 17:48 ` Christoph Lameter 2007-04-24 17:51 ` Andrew Morton 0 siblings, 2 replies; 31+ messages in thread From: Hugh Dickins @ 2007-04-24 17:45 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andrew Morton, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Christoph Lameter wrote: > On Tue, 24 Apr 2007, Hugh Dickins wrote: > > > I've not yet looked at the patch under discussion, but this remark > > prompts me... a couple of days ago I got very worried by the various > > hard-wired GFP_HIGHUSER allocations in mm/migrate.c and mm/mempolicy.c, > > and wondered how those would work out if someone has a blockdev mmap'ed. > > Hmmm.... These not that critical given that 32 bit NUMA systems are a bit > rare. That's true. And everybody but the owners of those systems wish fervently that they didn't exist ;) > And if a page is in the wrong area then it can be bounced before I/O > is performed on it. I think that much is also true, but not where the problem lies. Isn't the problem that filesystems using these block devices expect their metadata to be accessible without kmap calls? Hugh ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 17:45 ` Hugh Dickins @ 2007-04-24 17:48 ` Christoph Lameter 2007-04-24 17:51 ` Andrew Morton 1 sibling, 0 replies; 31+ messages in thread From: Christoph Lameter @ 2007-04-24 17:48 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Hugh Dickins wrote: > > And if a page is in the wrong area then it can be bounced before I/O > > is performed on it. > > I think that much is also true, but not where the problem lies. > Isn't the problem that filesystems using these block devices > expect their metadata to be accessible without kmap calls? Metadata is not movable nor subject to memory policies. It will never be mapped into a process space. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 17:45 ` Hugh Dickins 2007-04-24 17:48 ` Christoph Lameter @ 2007-04-24 17:51 ` Andrew Morton 2007-04-24 17:56 ` Christoph Lameter 1 sibling, 1 reply; 31+ messages in thread From: Andrew Morton @ 2007-04-24 17:51 UTC (permalink / raw) To: Hugh Dickins; +Cc: Christoph Lameter, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007 18:45:03 +0100 (BST) Hugh Dickins <hugh@veritas.com> wrote: > On Tue, 24 Apr 2007, Christoph Lameter wrote: > > On Tue, 24 Apr 2007, Hugh Dickins wrote: > > > > > I've not yet looked at the patch under discussion, but this remark > > > prompts me... a couple of days ago I got very worried by the various > > > hard-wired GFP_HIGHUSER allocations in mm/migrate.c and mm/mempolicy.c, > > > and wondered how those would work out if someone has a blockdev mmap'ed. > > > > Hmmm.... These not that critical given that 32 bit NUMA systems are a bit > > rare. > > That's true. And everybody but the owners of those systems wish > fervently that they didn't exist ;) > > > And if a page is in the wrong area then it can be bounced before I/O > > is performed on it. > > I think that much is also true, but not where the problem lies. > Isn't the problem that filesystems using these block devices > expect their metadata to be accessible without kmap calls? > yup. wherever we dereference buffer_head.b_data we're touching page_address(buffer_head.b_page) without kmapping. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 17:51 ` Andrew Morton @ 2007-04-24 17:56 ` Christoph Lameter 0 siblings, 0 replies; 31+ messages in thread From: Christoph Lameter @ 2007-04-24 17:56 UTC (permalink / raw) To: Andrew Morton; +Cc: Hugh Dickins, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Andrew Morton wrote: > > I think that much is also true, but not where the problem lies. > > Isn't the problem that filesystems using these block devices > > expect their metadata to be accessible without kmap calls? > > > > yup. wherever we dereference buffer_head.b_data we're touching > page_address(buffer_head.b_page) without kmapping. Yes but before we get there we will bounce pagecache pages into an area where we do not need kmap. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 13:09 ` Hugh Dickins 2007-04-24 17:11 ` Andrew Morton 2007-04-24 17:38 ` Christoph Lameter @ 2007-04-24 17:45 ` Christoph Lameter 2007-04-24 18:53 ` Hugh Dickins 2 siblings, 1 reply; 31+ messages in thread From: Christoph Lameter @ 2007-04-24 17:45 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Hugh Dickins wrote: > I've not yet looked at the patch under discussion, but this remark > prompts me... a couple of days ago I got very worried by the various > hard-wired GFP_HIGHUSER allocations in mm/migrate.c and mm/mempolicy.c, > and wondered how those would work out if someone has a blockdev mmap'ed. I hope you are not confused by the fact that memory policies are only ever applied to one zone on a node. This is either HIGHMEM or NORMAL. There is no memory policy support for other than the highest zone. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 17:45 ` Christoph Lameter @ 2007-04-24 18:53 ` Hugh Dickins 2007-04-24 19:34 ` Christoph Lameter 0 siblings, 1 reply; 31+ messages in thread From: Hugh Dickins @ 2007-04-24 18:53 UTC (permalink / raw) To: Christoph Lameter; +Cc: Andrew Morton, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Christoph Lameter wrote: > On Tue, 24 Apr 2007, Hugh Dickins wrote: > > > I've not yet looked at the patch under discussion, but this remark > > prompts me... a couple of days ago I got very worried by the various > > hard-wired GFP_HIGHUSER allocations in mm/migrate.c and mm/mempolicy.c, > > and wondered how those would work out if someone has a blockdev mmap'ed. > > I hope you are not confused by the fact that memory policies are only > ever applied to one zone on a node. This is either HIGHMEM or NORMAL. > There is no memory policy support for other than the highest zone. I was certainly ignorant of that; but I'm not convinced it eliminates the potential issue. For a start, sys_move_pages seems not to involve mempolicies at all - I don't see what prevents it migrating blockdev pages away from the only node which has NORMAL memory. > Metadata is not movable nor subject to memory policies. > It will never be mapped into a process space. Not as metadata, no. But someone (let's hope only root, though I may be wrong on that) can map any part of the block device into userspace. > > yup. wherever we dereference buffer_head.b_data we're touching > > page_address(buffer_head.b_page) without kmapping. > > Yes but before we get there we will bounce pagecache pages into an area > where we do not need kmap. Again, I'm not convinced: bouncing gets done for the I/O, but where is it done to meet the filesystem's expectations? On the other hand, as I said, I've seen no problem myself in practice. However, if there is no problem, why do block devices demand GFP_USER? Hugh ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 18:53 ` Hugh Dickins @ 2007-04-24 19:34 ` Christoph Lameter 2007-04-24 19:49 ` Andrew Morton 0 siblings, 1 reply; 31+ messages in thread From: Christoph Lameter @ 2007-04-24 19:34 UTC (permalink / raw) To: Hugh Dickins; +Cc: Andrew Morton, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Hugh Dickins wrote: > I was certainly ignorant of that; but I'm not convinced it eliminates > the potential issue. For a start, sys_move_pages seems not to involve > mempolicies at all - I don't see what prevents it migrating blockdev > pages away from the only node which has NORMAL memory. There is no need to. If the user does something stupid (and 32bit NUMA system are so stupid by default) as you say then this means that the pages would have to use bounce buffers to be written back. > > Metadata is not movable nor subject to memory policies. > > It will never be mapped into a process space. > > Not as metadata, no. But someone (let's hope only root, though I may > be wrong on that) can map any part of the block device into userspace. Concurrent access to a block device by a filesystem and the user? That cannot go over well. If one just reads then I would expect that a copy of the metadata becomes available to the user. Also you cannot migrate pages that have multiple references (which is the case here if the filesystem uses the page cache for the metadata) unless the user has special priviledges and uses special command options. A page that has references that cannot be accounted for by page migration is never migrated. I would assume that the filesystem at minimum takes a refcount on the page used for metadata. If the filesystem would not take a refcount then it would already be in trouble because the page may then be evicted at any time. > > Yes but before we get there we will bounce pagecache pages into an area > > where we do not need kmap. > > Again, I'm not convinced: bouncing gets done for the I/O, > but where is it done to meet the filesystem's expectations? Bouncing gets done for the block device and the block device determines the zones from which it can do I/O. Thus the block device determines the allocation restrictions. > On the other hand, as I said, I've seen no problem myself in practice. > However, if there is no problem, why do block devices demand GFP_USER? GFP_USER requires allocation in non highmem areas. These are typically reachable by device DMA. I think there is no problem if one would open a block device with GFP_HIGHMEM. The pages may be allocated in highmem which would require bounce buffers but otherwise we will be fine. A ramdisk device may just work fine with highmem. If the system has both high memory and normal memory then only allocations to highmemory are subject to memory policies etc etc. The block device allocations would be in zone normal/dma and thus be exempt from NUMA placement. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 19:34 ` Christoph Lameter @ 2007-04-24 19:49 ` Andrew Morton 2007-04-24 19:59 ` Christoph Lameter 0 siblings, 1 reply; 31+ messages in thread From: Andrew Morton @ 2007-04-24 19:49 UTC (permalink / raw) To: Christoph Lameter; +Cc: Hugh Dickins, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007 12:34:53 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > > Not as metadata, no. But someone (let's hope only root, though I may > > be wrong on that) can map any part of the block device into userspace. > > Concurrent access to a block device by a filesystem and the user? That > cannot go over well. If one just reads then I would expect that a copy > of the metadata becomes available to the user. Also you cannot migrate > pages that have multiple references (which is the case here if the > filesystem uses the page cache for the metadata) unless the user has > special priviledges and uses special command options. > > A page that has references that cannot be accounted for by page migration > is never migrated. I would assume that the filesystem at minimum takes a > refcount on the page used for metadata. > > If the filesystem would not take a refcount then it would already be in > trouble because the page may then be evicted at any time. No, think of the following scenario: - file I/O causes a read of an ext2 file's bitmap. The bitmap is brought into /dev/hda1's pagecache using !__GFP_HIGHMEM - references are released against that page and it's now just clean reclaimable pagecache - someone (say, an online filesystem checker or something) mmaps /dev/hda1 and reads that page. - migration comes alnog and migrates that page into highmem - file I/O causes a read of that bitmap again. We find it in /dev/hda's pagecache. Here's set_bh_page(). void set_bh_page(struct buffer_head *bh, struct page *page, unsigned long offset) { bh->b_page = page; BUG_ON(offset >= PAGE_SIZE); if (PageHighMem(page)) /* * This catches illegal uses and preserves the offset: */ bh->b_data = (char *)(0 + offset); else bh->b_data = page_address(page) + offset; } - ext2 now tries to access the bits in the bitmap via page->bh->b_data - game over ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 19:49 ` Andrew Morton @ 2007-04-24 19:59 ` Christoph Lameter 2007-04-24 20:10 ` Andrew Morton 0 siblings, 1 reply; 31+ messages in thread From: Christoph Lameter @ 2007-04-24 19:59 UTC (permalink / raw) To: Andrew Morton; +Cc: Hugh Dickins, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Andrew Morton wrote: > No, think of the following scenario: > > - file I/O causes a read of an ext2 file's bitmap. The bitmap is > brought into /dev/hda1's pagecache using !__GFP_HIGHMEM > > - references are released against that page and it's now just clean > reclaimable pagecache > > - someone (say, an online filesystem checker or something) mmaps > /dev/hda1 and reads that page. > > - migration comes alnog and migrates that page into highmem > > - file I/O causes a read of that bitmap again. We find it in > /dev/hda's pagecache. Read of the bitmap? How would that work? Page cache lookup right? > Here's set_bh_page(). A highmem page can have buffers??? ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 19:59 ` Christoph Lameter @ 2007-04-24 20:10 ` Andrew Morton 2007-04-24 20:12 ` Christoph Lameter 0 siblings, 1 reply; 31+ messages in thread From: Andrew Morton @ 2007-04-24 20:10 UTC (permalink / raw) To: Christoph Lameter; +Cc: Hugh Dickins, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007 12:59:17 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > On Tue, 24 Apr 2007, Andrew Morton wrote: > > > No, think of the following scenario: > > > > - file I/O causes a read of an ext2 file's bitmap. The bitmap is > > brought into /dev/hda1's pagecache using !__GFP_HIGHMEM > > > > - references are released against that page and it's now just clean > > reclaimable pagecache > > > > - someone (say, an online filesystem checker or something) mmaps > > /dev/hda1 and reads that page. > > > > - migration comes alnog and migrates that page into highmem > > > > - file I/O causes a read of that bitmap again. We find it in > > /dev/hda's pagecache. > > Read of the bitmap? How would that work? Page cache lookup right? yup. sb_bread ->__bread ->__getblk ->__find_get_block ->__find_get_block_slow ->find_get_page > > Here's set_bh_page(). > > A highmem page can have buffers??? yep. Take a 4k page which is stored in four discontiguous 1k disk blocks. The data at page_buffers(page) is the sole way in which we track which parts of the page belong to which blocks of the disk. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 20:10 ` Andrew Morton @ 2007-04-24 20:12 ` Christoph Lameter 2007-04-24 20:20 ` Andrew Morton 0 siblings, 1 reply; 31+ messages in thread From: Christoph Lameter @ 2007-04-24 20:12 UTC (permalink / raw) To: Andrew Morton; +Cc: Hugh Dickins, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007, Andrew Morton wrote: > > A highmem page can have buffers??? > > yep. Take a 4k page which is stored in four discontiguous 1k disk blocks. The > data at page_buffers(page) is the sole way in which we track which parts of > the page belong to which blocks of the disk. But I see no use of kmap for buffer access? The data is not accessible. ^ permalink raw reply [flat|nested] 31+ messages in thread
* Re: Pagecache: find_or_create_page does not call a proper page allocator function 2007-04-24 20:12 ` Christoph Lameter @ 2007-04-24 20:20 ` Andrew Morton 0 siblings, 0 replies; 31+ messages in thread From: Andrew Morton @ 2007-04-24 20:20 UTC (permalink / raw) To: Christoph Lameter; +Cc: Hugh Dickins, Nick Piggin, linux-kernel, pj On Tue, 24 Apr 2007 13:12:42 -0700 (PDT) Christoph Lameter <clameter@sgi.com> wrote: > On Tue, 24 Apr 2007, Andrew Morton wrote: > > > > A highmem page can have buffers??? > > > > yep. Take a 4k page which is stored in four discontiguous 1k disk blocks. The > > data at page_buffers(page) is the sole way in which we track which parts of > > the page belong to which blocks of the disk. > > But I see no use of kmap for buffer access? The data is not accessible. > The kernel rarely has a need to actually read or write the page's contents with the CPU. On those occasions where it does, it will use kmap. Search for zero_user_page() and kmap in rc7-mm1's fs/buffer.c But that's file pagecache, which can be in highmem. File metadata is accessed within the filesystems without kmapping. This: box:/usr/src/linux-2.6.21-rc7> grep '[-]>b_data' fs/*/*.c | wc -l 1017 explains why that never got fixed. ^ permalink raw reply [flat|nested] 31+ messages in thread
end of thread, other threads:[~2007-04-24 21:28 UTC | newest] Thread overview: 31+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2007-04-23 21:11 Pagecache: find_or_create_page does not call a proper page allocator function Christoph Lameter 2007-04-23 21:29 ` Andrew Morton 2007-04-23 21:37 ` Christoph Lameter 2007-04-23 22:33 ` Christoph Lameter 2007-04-23 22:42 ` Christoph Lameter 2007-04-23 22:42 ` Andrew Morton 2007-04-24 13:09 ` Hugh Dickins 2007-04-24 17:11 ` Andrew Morton 2007-04-24 19:06 ` Hugh Dickins 2007-04-24 19:55 ` Christoph Lameter 2007-04-24 20:16 ` Hugh Dickins 2007-04-24 20:30 ` Christoph Lameter 2007-04-24 20:42 ` Andrew Morton 2007-04-24 20:44 ` Christoph Lameter 2007-04-24 20:53 ` Andrew Morton 2007-04-24 20:58 ` Christoph Lameter 2007-04-24 21:24 ` Andrew Morton 2007-04-24 21:28 ` Christoph Lameter 2007-04-24 17:38 ` Christoph Lameter 2007-04-24 17:45 ` Hugh Dickins 2007-04-24 17:48 ` Christoph Lameter 2007-04-24 17:51 ` Andrew Morton 2007-04-24 17:56 ` Christoph Lameter 2007-04-24 17:45 ` Christoph Lameter 2007-04-24 18:53 ` Hugh Dickins 2007-04-24 19:34 ` Christoph Lameter 2007-04-24 19:49 ` Andrew Morton 2007-04-24 19:59 ` Christoph Lameter 2007-04-24 20:10 ` Andrew Morton 2007-04-24 20:12 ` Christoph Lameter 2007-04-24 20:20 ` Andrew Morton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox