* [PATCH v4] mm: add replace_page_cache_page() function
@ 2011-01-18 11:18 Miklos Szeredi
2011-01-18 23:28 ` Andrew Morton
2011-01-19 1:17 ` Minchan Kim
0 siblings, 2 replies; 16+ messages in thread
From: Miklos Szeredi @ 2011-01-18 11:18 UTC (permalink / raw)
To: akpm
Cc: minchan.kim, kamezawa.hiroyu, nishimura, linux-fsdevel, linux-mm,
linux-kernel
Andrew,
Can you please apply this to -mm, for 2.6.39?
v4:
- updated to latest -git
- added acks
- updated changelog
Thanks,
Miklos
----
From: Miklos Szeredi <mszeredi@suse.cz>
Subject: mm: add replace_page_cache_page() function
This function basically does:
remove_from_page_cache(old);
page_cache_release(old);
add_to_page_cache_locked(new);
Except it does this atomically, so there's no possibility for the
"add" to fail because of a race.
If memory cgroups are enabled, then the memory cgroup charge is also
moved from the old page to the new.
This function is currently used by fuse to move pages into the page
cache on read, instead of copying the page contents.
Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
Acked-by: Rik van Riel <riel@redhat.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
fs/fuse/dev.c | 10 ++----
include/linux/memcontrol.h | 4 +-
include/linux/pagemap.h | 1
mm/filemap.c | 65 +++++++++++++++++++++++++++++++++++++++++++++
mm/memcontrol.c | 4 +-
mm/migrate.c | 2 -
6 files changed, 75 insertions(+), 11 deletions(-)
Index: linux-2.6/mm/filemap.c
===================================================================
--- linux-2.6.orig/mm/filemap.c 2011-01-17 09:33:44.000000000 +0100
+++ linux-2.6/mm/filemap.c 2011-01-18 11:50:08.000000000 +0100
@@ -387,6 +387,71 @@ int filemap_write_and_wait_range(struct
EXPORT_SYMBOL(filemap_write_and_wait_range);
/**
+ * replace_page_cache_page - replace a pagecache page with a new one
+ * @old: page to be replaced
+ * @new: page to replace with
+ * @gfp_mask: allocation mode
+ *
+ * This function replaces a page in the pagecache with a new one. On
+ * success it acquires the pagecache reference for the new page and
+ * drops it for the old page. Both the old and new pages must be
+ * locked. This function does not add the new page to the LRU, the
+ * caller must do that.
+ *
+ * The remove + add is atomic. The only way this function can fail is
+ * memory allocation failure.
+ */
+int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
+{
+ int error;
+ struct mem_cgroup *memcg = NULL;
+
+ VM_BUG_ON(!PageLocked(old));
+ VM_BUG_ON(!PageLocked(new));
+ VM_BUG_ON(new->mapping);
+
+ /*
+ * This is not page migration, but prepare_migration and
+ * end_migration does enough work for charge replacement.
+ *
+ * In the longer term we probably want a specialized function
+ * for moving the charge from old to new in a more efficient
+ * manner.
+ */
+ error = mem_cgroup_prepare_migration(old, new, &memcg, gfp_mask);
+ if (error)
+ return error;
+
+ error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
+ if (!error) {
+ struct address_space *mapping = old->mapping;
+ pgoff_t offset = old->index;
+
+ page_cache_get(new);
+ new->mapping = mapping;
+ new->index = offset;
+
+ spin_lock_irq(&mapping->tree_lock);
+ __remove_from_page_cache(old);
+ error = radix_tree_insert(&mapping->page_tree, offset, new);
+ BUG_ON(error);
+ mapping->nrpages++;
+ __inc_zone_page_state(new, NR_FILE_PAGES);
+ if (PageSwapBacked(new))
+ __inc_zone_page_state(new, NR_SHMEM);
+ spin_unlock_irq(&mapping->tree_lock);
+ radix_tree_preload_end();
+ page_cache_release(old);
+ mem_cgroup_end_migration(memcg, old, new, true);
+ } else {
+ mem_cgroup_end_migration(memcg, old, new, false);
+ }
+
+ return error;
+}
+EXPORT_SYMBOL_GPL(replace_page_cache_page);
+
+/**
* add_to_page_cache_locked - add a locked page to the pagecache
* @page: page to add
* @mapping: the page's address_space
Index: linux-2.6/include/linux/pagemap.h
===================================================================
--- linux-2.6.orig/include/linux/pagemap.h 2011-01-17 09:33:44.000000000 +0100
+++ linux-2.6/include/linux/pagemap.h 2011-01-18 11:50:08.000000000 +0100
@@ -457,6 +457,7 @@ int add_to_page_cache_lru(struct page *p
pgoff_t index, gfp_t gfp_mask);
extern void remove_from_page_cache(struct page *page);
extern void __remove_from_page_cache(struct page *page);
+int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask);
/*
* Like add_to_page_cache_locked, but used to add newly allocated pages:
Index: linux-2.6/fs/fuse/dev.c
===================================================================
--- linux-2.6.orig/fs/fuse/dev.c 2011-01-13 14:34:28.000000000 +0100
+++ linux-2.6/fs/fuse/dev.c 2011-01-18 11:50:08.000000000 +0100
@@ -737,14 +737,12 @@ static int fuse_try_move_page(struct fus
if (WARN_ON(PageMlocked(oldpage)))
goto out_fallback_unlock;
- remove_from_page_cache(oldpage);
- page_cache_release(oldpage);
-
- err = add_to_page_cache_locked(newpage, mapping, index, GFP_KERNEL);
+ err = replace_page_cache_page(oldpage, newpage, GFP_KERNEL);
if (err) {
- printk(KERN_WARNING "fuse_try_move_page: failed to add page");
- goto out_fallback_unlock;
+ unlock_page(newpage);
+ return err;
}
+
page_cache_get(newpage);
if (!(buf->flags & PIPE_BUF_FLAG_LRU))
Index: linux-2.6/include/linux/memcontrol.h
===================================================================
--- linux-2.6.orig/include/linux/memcontrol.h 2011-01-17 09:33:44.000000000 +0100
+++ linux-2.6/include/linux/memcontrol.h 2011-01-18 11:50:08.000000000 +0100
@@ -96,7 +96,7 @@ extern struct cgroup_subsys_state *mem_c
extern int
mem_cgroup_prepare_migration(struct page *page,
- struct page *newpage, struct mem_cgroup **ptr);
+ struct page *newpage, struct mem_cgroup **ptr, gfp_t gfp_mask);
extern void mem_cgroup_end_migration(struct mem_cgroup *mem,
struct page *oldpage, struct page *newpage, bool migration_ok);
@@ -245,7 +245,7 @@ static inline struct cgroup_subsys_state
static inline int
mem_cgroup_prepare_migration(struct page *page, struct page *newpage,
- struct mem_cgroup **ptr)
+ struct mem_cgroup **ptr, gfp_t gfp_mask)
{
return 0;
}
Index: linux-2.6/mm/memcontrol.c
===================================================================
--- linux-2.6.orig/mm/memcontrol.c 2011-01-17 09:33:44.000000000 +0100
+++ linux-2.6/mm/memcontrol.c 2011-01-18 11:50:08.000000000 +0100
@@ -2806,7 +2806,7 @@ static inline int mem_cgroup_move_swap_a
* page belongs to.
*/
int mem_cgroup_prepare_migration(struct page *page,
- struct page *newpage, struct mem_cgroup **ptr)
+ struct page *newpage, struct mem_cgroup **ptr, gfp_t gfp_mask)
{
struct page_cgroup *pc;
struct mem_cgroup *mem = NULL;
@@ -2863,7 +2863,7 @@ int mem_cgroup_prepare_migration(struct
return 0;
*ptr = mem;
- ret = __mem_cgroup_try_charge(NULL, GFP_KERNEL, ptr, false, PAGE_SIZE);
+ ret = __mem_cgroup_try_charge(NULL, gfp_mask, ptr, false, PAGE_SIZE);
css_put(&mem->css);/* drop extra refcnt */
if (ret || *ptr == NULL) {
if (PageAnon(page)) {
Index: linux-2.6/mm/migrate.c
===================================================================
--- linux-2.6.orig/mm/migrate.c 2011-01-17 09:33:44.000000000 +0100
+++ linux-2.6/mm/migrate.c 2011-01-18 11:50:08.000000000 +0100
@@ -678,7 +678,7 @@ static int unmap_and_move(new_page_t get
}
/* charge against new page */
- charge = mem_cgroup_prepare_migration(page, newpage, &mem);
+ charge = mem_cgroup_prepare_migration(page, newpage, &mem, GFP_KERNEL);
if (charge == -ENOMEM) {
rc = -ENOMEM;
goto unlock;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4] mm: add replace_page_cache_page() function
2011-01-18 11:18 [PATCH v4] mm: add replace_page_cache_page() function Miklos Szeredi
@ 2011-01-18 23:28 ` Andrew Morton
2011-01-19 0:27 ` Daisuke Nishimura
` (3 more replies)
2011-01-19 1:17 ` Minchan Kim
1 sibling, 4 replies; 16+ messages in thread
From: Andrew Morton @ 2011-01-18 23:28 UTC (permalink / raw)
To: Miklos Szeredi
Cc: minchan.kim, kamezawa.hiroyu, nishimura, linux-fsdevel, linux-mm,
linux-kernel
On Tue, 18 Jan 2011 12:18:11 +0100
Miklos Szeredi <miklos@szeredi.hu> wrote:
> +int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> +{
> + int error;
> + struct mem_cgroup *memcg = NULL;
I'm suspecting that the unneeded initialisation was added to suppress a
warning?
I removed it, and didn't get a warning. I expected to.
Really, uninitialized_var() is better. It avoids adding extra code
and, unlike "= 0" it is self-documenting.
> + VM_BUG_ON(!PageLocked(old));
> + VM_BUG_ON(!PageLocked(new));
> + VM_BUG_ON(new->mapping);
> +
> + /*
> + * This is not page migration, but prepare_migration and
> + * end_migration does enough work for charge replacement.
> + *
> + * In the longer term we probably want a specialized function
> + * for moving the charge from old to new in a more efficient
> + * manner.
> + */
> + error = mem_cgroup_prepare_migration(old, new, &memcg, gfp_mask);
> + if (error)
> + return error;
> +
> + error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> + if (!error) {
> + struct address_space *mapping = old->mapping;
> + pgoff_t offset = old->index;
> +
> + page_cache_get(new);
> + new->mapping = mapping;
> + new->index = offset;
> +
> + spin_lock_irq(&mapping->tree_lock);
> + __remove_from_page_cache(old);
> + error = radix_tree_insert(&mapping->page_tree, offset, new);
> + BUG_ON(error);
> + mapping->nrpages++;
> + __inc_zone_page_state(new, NR_FILE_PAGES);
> + if (PageSwapBacked(new))
> + __inc_zone_page_state(new, NR_SHMEM);
> + spin_unlock_irq(&mapping->tree_lock);
> + radix_tree_preload_end();
> + page_cache_release(old);
> + mem_cgroup_end_migration(memcg, old, new, true);
This is all pretty ugly and inefficient.
We call __remove_from_page_cache() which does a radix-tree lookup and
then fiddles a bunch of accounting things.
Then we immediately do the same radix-tree lookup and then undo the
accounting changes which we just did. And we do it in an open-coded
fashion, thus giving the kernel yet another code site where various
operations need to be kept in sync.
Would it not be better to do a single radix_tree_lookup_slot(),
overwrite the pointer therein and just leave all the ancilliary
accounting unaltered?
> + } else {
> + mem_cgroup_end_migration(memcg, old, new, false);
> + }
> +
> + return error;
> +}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4] mm: add replace_page_cache_page() function
2011-01-18 23:28 ` Andrew Morton
@ 2011-01-19 0:27 ` Daisuke Nishimura
2011-01-19 0:41 ` Andrew Morton
2011-01-19 0:48 ` KAMEZAWA Hiroyuki
2011-01-19 0:33 ` KAMEZAWA Hiroyuki
` (2 subsequent siblings)
3 siblings, 2 replies; 16+ messages in thread
From: Daisuke Nishimura @ 2011-01-19 0:27 UTC (permalink / raw)
To: Andrew Morton
Cc: Miklos Szeredi, minchan.kim, kamezawa.hiroyu, linux-fsdevel,
linux-mm, linux-kernel, Daisuke Nishimura
On Tue, 18 Jan 2011 15:28:44 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Tue, 18 Jan 2011 12:18:11 +0100
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > +int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> > +{
> > + int error;
> > + struct mem_cgroup *memcg = NULL;
>
> I'm suspecting that the unneeded initialisation was added to suppress a
> warning?
>
No.
It's necessary for mem_cgroup_{prepare|end}_migration().
mem_cgroup_prepare_migration() will return without doing anything in
"if (mem_cgroup_disabled()" case(iow, "memcg" is not overwritten),
but mem_cgroup_end_migration() depends on the value of "memcg" to decide
whether prepare_migration has succeeded or not.
This may not be a good implementation, but IMHO I'd like to to initialize
valuable before using it in general.
Thanks,
Daisuke Nishimura.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4] mm: add replace_page_cache_page() function
2011-01-19 0:27 ` Daisuke Nishimura
@ 2011-01-19 0:41 ` Andrew Morton
2011-01-19 0:48 ` KAMEZAWA Hiroyuki
1 sibling, 0 replies; 16+ messages in thread
From: Andrew Morton @ 2011-01-19 0:41 UTC (permalink / raw)
To: Daisuke Nishimura
Cc: Miklos Szeredi, minchan.kim, kamezawa.hiroyu, linux-fsdevel,
linux-mm, linux-kernel
On Wed, 19 Jan 2011 09:27:33 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> On Tue, 18 Jan 2011 15:28:44 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > On Tue, 18 Jan 2011 12:18:11 +0100
> > Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > > +int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> > > +{
> > > + int error;
> > > + struct mem_cgroup *memcg = NULL;
> >
> > I'm suspecting that the unneeded initialisation was added to suppress a
> > warning?
> >
> No.
> It's necessary for mem_cgroup_{prepare|end}_migration().
> mem_cgroup_prepare_migration() will return without doing anything in
> "if (mem_cgroup_disabled()" case(iow, "memcg" is not overwritten),
> but mem_cgroup_end_migration() depends on the value of "memcg" to decide
> whether prepare_migration has succeeded or not.
> This may not be a good implementation,
It's a *surprising* implementation, in the context of kernel
conventions. At least, it fooled me :)
Which makes it a not-good implementation, really. Also, it's
undocumented.
> but IMHO I'd like to to initialize
> valuable before using it in general.
mem_cgroup_prepare_migration() performs the initialisation! We do this
in many places - it's the basic "function with more than one return
value" mechanism.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4] mm: add replace_page_cache_page() function
2011-01-19 0:27 ` Daisuke Nishimura
2011-01-19 0:41 ` Andrew Morton
@ 2011-01-19 0:48 ` KAMEZAWA Hiroyuki
2011-01-19 1:11 ` nishimura
1 sibling, 1 reply; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-19 0:48 UTC (permalink / raw)
To: Daisuke Nishimura
Cc: Andrew Morton, Miklos Szeredi, minchan.kim, linux-fsdevel,
linux-mm, linux-kernel
On Wed, 19 Jan 2011 09:27:33 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> On Tue, 18 Jan 2011 15:28:44 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > On Tue, 18 Jan 2011 12:18:11 +0100
> > Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > > +int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> > > +{
> > > + int error;
> > > + struct mem_cgroup *memcg = NULL;
> >
> > I'm suspecting that the unneeded initialisation was added to suppress a
> > warning?
> >
> No.
> It's necessary for mem_cgroup_{prepare|end}_migration().
> mem_cgroup_prepare_migration() will return without doing anything in
> "if (mem_cgroup_disabled()" case(iow, "memcg" is not overwritten),
> but mem_cgroup_end_migration() depends on the value of "memcg" to decide
> whether prepare_migration has succeeded or not.
> This may not be a good implementation, but IMHO I'd like to to initialize
> valuable before using it in general.
>
I think it can be initlized in mem_cgroup_prepare_migration().
I'll send patch later.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4] mm: add replace_page_cache_page() function
2011-01-19 0:48 ` KAMEZAWA Hiroyuki
@ 2011-01-19 1:11 ` nishimura
2011-01-19 1:23 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 16+ messages in thread
From: nishimura @ 2011-01-19 1:11 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, Miklos Szeredi, minchan.kim, linux-fsdevel,
linux-mm, linux-kernel, Daisuke Nishimura
> On Wed, 19 Jan 2011 09:27:33 +0900
> Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
>
>> On Tue, 18 Jan 2011 15:28:44 -0800
>> Andrew Morton <akpm@linux-foundation.org> wrote:
>>
>> > On Tue, 18 Jan 2011 12:18:11 +0100
>> > Miklos Szeredi <miklos@szeredi.hu> wrote:
>> >
>> > > +int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
>> > > +{
>> > > + int error;
>> > > + struct mem_cgroup *memcg = NULL;
>> >
>> > I'm suspecting that the unneeded initialisation was added to suppress a
>> > warning?
>> >
>> No.
>> It's necessary for mem_cgroup_{prepare|end}_migration().
>> mem_cgroup_prepare_migration() will return without doing anything in
>> "if (mem_cgroup_disabled()" case(iow, "memcg" is not overwritten),
>> but mem_cgroup_end_migration() depends on the value of "memcg" to decide
>> whether prepare_migration has succeeded or not.
>> This may not be a good implementation, but IMHO I'd like to to initialize
>> valuable before using it in general.
>>
>
> I think it can be initlized in mem_cgroup_prepare_migration().
> I'll send patch later.
>
I see, thanks.
I think you know it, but just a note:
mem_cgroup_{try_charge|commit_charge}_swapin()
use the same logic, so try_charge_swapin() should also be changed
for consistency.
Thanks,
Daisuke Nishimura.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4] mm: add replace_page_cache_page() function
2011-01-19 1:11 ` nishimura
@ 2011-01-19 1:23 ` KAMEZAWA Hiroyuki
2011-01-21 5:52 ` Daisuke Nishimura
0 siblings, 1 reply; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-19 1:23 UTC (permalink / raw)
To: nishimura
Cc: Andrew Morton, Miklos Szeredi, minchan.kim, linux-fsdevel,
linux-mm, linux-kernel
On Wed, 19 Jan 2011 10:11:12 +0900
nishimura@mxp.nes.nec.co.jp wrote:
> > On Wed, 19 Jan 2011 09:27:33 +0900
> > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> >
> >> On Tue, 18 Jan 2011 15:28:44 -0800
> >> Andrew Morton <akpm@linux-foundation.org> wrote:
> >>
> >> > On Tue, 18 Jan 2011 12:18:11 +0100
> >> > Miklos Szeredi <miklos@szeredi.hu> wrote:
> >> >
> >> > > +int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> >> > > +{
> >> > > + int error;
> >> > > + struct mem_cgroup *memcg = NULL;
> >> >
> >> > I'm suspecting that the unneeded initialisation was added to suppress a
> >> > warning?
> >> >
> >> No.
> >> It's necessary for mem_cgroup_{prepare|end}_migration().
> >> mem_cgroup_prepare_migration() will return without doing anything in
> >> "if (mem_cgroup_disabled()" case(iow, "memcg" is not overwritten),
> >> but mem_cgroup_end_migration() depends on the value of "memcg" to decide
> >> whether prepare_migration has succeeded or not.
> >> This may not be a good implementation, but IMHO I'd like to to initialize
> >> valuable before using it in general.
> >>
> >
> > I think it can be initlized in mem_cgroup_prepare_migration().
> > I'll send patch later.
> >
> I see, thanks.
>
> I think you know it, but just a note:
> mem_cgroup_{try_charge|commit_charge}_swapin()
> use the same logic, so try_charge_swapin() should also be changed
> for consistency.
>
Thank you for caution. But I think THP+memcg bugs should be fixed before
style fixes..
After my patch (yesterday), accounting information seems works well but
I saw very huge latency when we hit limits.
==
Jan 18 10:27:22 rhel6-test kernel: [56177.770922] sh used greatest stack depth: 3592 bytes l
eft
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] INFO: rcu_sched_state detected stall on CP
U 0 (t=60000 jiffies)
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] sending NMI to all CPUs:
...
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] NMI backtrace for cpu 0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] CPU 0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] Modules linked in: autofs4 sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 virtio_balloon virtio_net virtio_blk virtio_pci virtio_ring virtio [last unloaded: scsi_wait_scan]
Jan 18 10:28:29 rhel6-test kernel: [56245.286007]
...
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] <IRQ>
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8102a04e>] arch_trigger_all_cpu_backtrace+0x5e/0xa0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff810bca09>] __rcu_pending+0x169/0x3b0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8108a250>] ? tick_sched_timer+0x0/0xc0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff810bccbc>] rcu_check_callbacks+0x6c/0x120
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff810689a8>] update_process_times+0x48/0x90
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8108a2b6>] tick_sched_timer+0x66/0xc0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8107ede0>] __run_hrtimer+0x90/0x1e0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff81032db9>] ? kvm_clock_get_cycles+0x9/0x10
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8107f1be>] hrtimer_interrupt+0xde/0x240
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8155268b>] smp_apic_timer_interrupt+0x6b/0x9b
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8100c9d3>] apic_timer_interrupt+0x13/0x20
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] <EOI>
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff810a726a>] ? res_counter_charge+0xda/0x100
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff81145459>] __mem_cgroup_try_charge+0x199/0x5d0
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff811461c6>] mem_cgroup_charge_common+0x96/0x110
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff811463b5>] mem_cgroup_newpage_charge+0x45/0x50
Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8113dbd4>] khugepaged+0x924/0x1430
==
I guess we need to relax retry logic when page_size > PAGE_SIZE.
I need to stop test application with Ctrl-C.
(Test was make -j 16 under 200M limit.)
Thanks,
-Kame
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4] mm: add replace_page_cache_page() function
2011-01-19 1:23 ` KAMEZAWA Hiroyuki
@ 2011-01-21 5:52 ` Daisuke Nishimura
2011-01-21 6:17 ` KAMEZAWA Hiroyuki
0 siblings, 1 reply; 16+ messages in thread
From: Daisuke Nishimura @ 2011-01-21 5:52 UTC (permalink / raw)
To: KAMEZAWA Hiroyuki
Cc: Andrew Morton, Miklos Szeredi, minchan.kim, linux-fsdevel,
linux-mm, linux-kernel, Daisuke Nishimura
On Wed, 19 Jan 2011 10:23:48 +0900
KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> On Wed, 19 Jan 2011 10:11:12 +0900
> nishimura@mxp.nes.nec.co.jp wrote:
>
> > > On Wed, 19 Jan 2011 09:27:33 +0900
> > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > >
> > >> On Tue, 18 Jan 2011 15:28:44 -0800
> > >> Andrew Morton <akpm@linux-foundation.org> wrote:
> > >>
> > >> > On Tue, 18 Jan 2011 12:18:11 +0100
> > >> > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > >> >
> > >> > > +int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> > >> > > +{
> > >> > > + int error;
> > >> > > + struct mem_cgroup *memcg = NULL;
> > >> >
> > >> > I'm suspecting that the unneeded initialisation was added to suppress a
> > >> > warning?
> > >> >
> > >> No.
> > >> It's necessary for mem_cgroup_{prepare|end}_migration().
> > >> mem_cgroup_prepare_migration() will return without doing anything in
> > >> "if (mem_cgroup_disabled()" case(iow, "memcg" is not overwritten),
> > >> but mem_cgroup_end_migration() depends on the value of "memcg" to decide
> > >> whether prepare_migration has succeeded or not.
> > >> This may not be a good implementation, but IMHO I'd like to to initialize
> > >> valuable before using it in general.
> > >>
> > >
> > > I think it can be initlized in mem_cgroup_prepare_migration().
> > > I'll send patch later.
> > >
> > I see, thanks.
> >
> > I think you know it, but just a note:
> > mem_cgroup_{try_charge|commit_charge}_swapin()
> > use the same logic, so try_charge_swapin() should also be changed
> > for consistency.
> >
>
> Thank you for caution. But I think THP+memcg bugs should be fixed before
> style fixes..
>
I do agree.
> After my patch (yesterday), accounting information seems works well but
> I saw very huge latency when we hit limits.
> ==
> Jan 18 10:27:22 rhel6-test kernel: [56177.770922] sh used greatest stack depth: 3592 bytes l
> eft
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] INFO: rcu_sched_state detected stall on CP
> U 0 (t=60000 jiffies)
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] sending NMI to all CPUs:
> ...
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] NMI backtrace for cpu 0
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] CPU 0
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] Modules linked in: autofs4 sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 virtio_balloon virtio_net virtio_blk virtio_pci virtio_ring virtio [last unloaded: scsi_wait_scan]
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007]
> ...
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] <IRQ>
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8102a04e>] arch_trigger_all_cpu_backtrace+0x5e/0xa0
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff810bca09>] __rcu_pending+0x169/0x3b0
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8108a250>] ? tick_sched_timer+0x0/0xc0
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff810bccbc>] rcu_check_callbacks+0x6c/0x120
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff810689a8>] update_process_times+0x48/0x90
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8108a2b6>] tick_sched_timer+0x66/0xc0
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8107ede0>] __run_hrtimer+0x90/0x1e0
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff81032db9>] ? kvm_clock_get_cycles+0x9/0x10
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8107f1be>] hrtimer_interrupt+0xde/0x240
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8155268b>] smp_apic_timer_interrupt+0x6b/0x9b
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8100c9d3>] apic_timer_interrupt+0x13/0x20
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] <EOI>
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff810a726a>] ? res_counter_charge+0xda/0x100
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff81145459>] __mem_cgroup_try_charge+0x199/0x5d0
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff811461c6>] mem_cgroup_charge_common+0x96/0x110
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff811463b5>] mem_cgroup_newpage_charge+0x45/0x50
> Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8113dbd4>] khugepaged+0x924/0x1430
> ==
>
> I guess we need to relax retry logic when page_size > PAGE_SIZE.
> I need to stop test application with Ctrl-C.
> (Test was make -j 16 under 200M limit.)
>
I think this is caused by a following scenario.
1. mem_cgroup_charge_common() try to charge a huge page(i.e. page_size != PAGE_SIZE).
2. mem_cgroup_do_charge() fails to charge, and return CHARGE_RETRY, because
"csize > PAGE_SIZE".
3. When mem_cgroup_do_charge() returns CHARGE_RETRY, mem_cgroup_charge_common()
changes 'csize' to 'page_size', which is bigger than PAGE_SIZE.
I think you're stuck inside a loop between 2 and 3.
Thanks,
Daisuke Nishimura.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4] mm: add replace_page_cache_page() function
2011-01-21 5:52 ` Daisuke Nishimura
@ 2011-01-21 6:17 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-21 6:17 UTC (permalink / raw)
To: Daisuke Nishimura
Cc: Andrew Morton, Miklos Szeredi, minchan.kim, linux-fsdevel,
linux-mm, linux-kernel
On Fri, 21 Jan 2011 14:52:22 +0900
Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> On Wed, 19 Jan 2011 10:23:48 +0900
> KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote:
>
> > On Wed, 19 Jan 2011 10:11:12 +0900
> > nishimura@mxp.nes.nec.co.jp wrote:
> >
> > > > On Wed, 19 Jan 2011 09:27:33 +0900
> > > > Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp> wrote:
> > > >
> > > >> On Tue, 18 Jan 2011 15:28:44 -0800
> > > >> Andrew Morton <akpm@linux-foundation.org> wrote:
> > > >>
> > > >> > On Tue, 18 Jan 2011 12:18:11 +0100
> > > >> > Miklos Szeredi <miklos@szeredi.hu> wrote:
> > > >> >
> > > >> > > +int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> > > >> > > +{
> > > >> > > + int error;
> > > >> > > + struct mem_cgroup *memcg = NULL;
> > > >> >
> > > >> > I'm suspecting that the unneeded initialisation was added to suppress a
> > > >> > warning?
> > > >> >
> > > >> No.
> > > >> It's necessary for mem_cgroup_{prepare|end}_migration().
> > > >> mem_cgroup_prepare_migration() will return without doing anything in
> > > >> "if (mem_cgroup_disabled()" case(iow, "memcg" is not overwritten),
> > > >> but mem_cgroup_end_migration() depends on the value of "memcg" to decide
> > > >> whether prepare_migration has succeeded or not.
> > > >> This may not be a good implementation, but IMHO I'd like to to initialize
> > > >> valuable before using it in general.
> > > >>
> > > >
> > > > I think it can be initlized in mem_cgroup_prepare_migration().
> > > > I'll send patch later.
> > > >
> > > I see, thanks.
> > >
> > > I think you know it, but just a note:
> > > mem_cgroup_{try_charge|commit_charge}_swapin()
> > > use the same logic, so try_charge_swapin() should also be changed
> > > for consistency.
> > >
> >
> > Thank you for caution. But I think THP+memcg bugs should be fixed before
> > style fixes..
> >
> I do agree.
>
> > After my patch (yesterday), accounting information seems works well but
> > I saw very huge latency when we hit limits.
> > ==
> > Jan 18 10:27:22 rhel6-test kernel: [56177.770922] sh used greatest stack depth: 3592 bytes l
> > eft
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] INFO: rcu_sched_state detected stall on CP
> > U 0 (t=60000 jiffies)
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] sending NMI to all CPUs:
> > ...
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] NMI backtrace for cpu 0
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] CPU 0
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] Modules linked in: autofs4 sunrpc ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 ip6table_filter ip6_tables ipv6 virtio_balloon virtio_net virtio_blk virtio_pci virtio_ring virtio [last unloaded: scsi_wait_scan]
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007]
> > ...
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] <IRQ>
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8102a04e>] arch_trigger_all_cpu_backtrace+0x5e/0xa0
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff810bca09>] __rcu_pending+0x169/0x3b0
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8108a250>] ? tick_sched_timer+0x0/0xc0
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff810bccbc>] rcu_check_callbacks+0x6c/0x120
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff810689a8>] update_process_times+0x48/0x90
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8108a2b6>] tick_sched_timer+0x66/0xc0
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8107ede0>] __run_hrtimer+0x90/0x1e0
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff81032db9>] ? kvm_clock_get_cycles+0x9/0x10
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8107f1be>] hrtimer_interrupt+0xde/0x240
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8155268b>] smp_apic_timer_interrupt+0x6b/0x9b
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8100c9d3>] apic_timer_interrupt+0x13/0x20
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] <EOI>
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff810a726a>] ? res_counter_charge+0xda/0x100
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff81145459>] __mem_cgroup_try_charge+0x199/0x5d0
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff811461c6>] mem_cgroup_charge_common+0x96/0x110
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff811463b5>] mem_cgroup_newpage_charge+0x45/0x50
> > Jan 18 10:28:29 rhel6-test kernel: [56245.286007] [<ffffffff8113dbd4>] khugepaged+0x924/0x1430
> > ==
> >
> > I guess we need to relax retry logic when page_size > PAGE_SIZE.
> > I need to stop test application with Ctrl-C.
> > (Test was make -j 16 under 200M limit.)
> >
> I think this is caused by a following scenario.
>
> 1. mem_cgroup_charge_common() try to charge a huge page(i.e. page_size != PAGE_SIZE).
> 2. mem_cgroup_do_charge() fails to charge, and return CHARGE_RETRY, because
> "csize > PAGE_SIZE".
> 3. When mem_cgroup_do_charge() returns CHARGE_RETRY, mem_cgroup_charge_common()
> changes 'csize' to 'page_size', which is bigger than PAGE_SIZE.
>
> I think you're stuck inside a loop between 2 and 3.
>
I'll post paches soon. It passed my test, right now ;)
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4] mm: add replace_page_cache_page() function
2011-01-18 23:28 ` Andrew Morton
2011-01-19 0:27 ` Daisuke Nishimura
@ 2011-01-19 0:33 ` KAMEZAWA Hiroyuki
2011-01-19 1:24 ` Minchan Kim
2011-09-08 23:52 ` Andrew Morton
3 siblings, 0 replies; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-01-19 0:33 UTC (permalink / raw)
To: Andrew Morton
Cc: Miklos Szeredi, minchan.kim, nishimura, linux-fsdevel, linux-mm,
linux-kernel
On Tue, 18 Jan 2011 15:28:44 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Tue, 18 Jan 2011 12:18:11 +0100
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > +int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> > +{
> > + int error;
> > + struct mem_cgroup *memcg = NULL;
>
> I'm suspecting that the unneeded initialisation was added to suppress a
> warning?
>
> I removed it, and didn't get a warning. I expected to.
>
> Really, uninitialized_var() is better. It avoids adding extra code
> and, unlike "= 0" it is self-documenting.
>
> > + VM_BUG_ON(!PageLocked(old));
> > + VM_BUG_ON(!PageLocked(new));
> > + VM_BUG_ON(new->mapping);
> > +
> > + /*
> > + * This is not page migration, but prepare_migration and
> > + * end_migration does enough work for charge replacement.
> > + *
> > + * In the longer term we probably want a specialized function
> > + * for moving the charge from old to new in a more efficient
> > + * manner.
> > + */
> > + error = mem_cgroup_prepare_migration(old, new, &memcg, gfp_mask);
> > + if (error)
> > + return error;
> > +
> > + error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> > + if (!error) {
> > + struct address_space *mapping = old->mapping;
> > + pgoff_t offset = old->index;
> > +
> > + page_cache_get(new);
> > + new->mapping = mapping;
> > + new->index = offset;
> > +
> > + spin_lock_irq(&mapping->tree_lock);
> > + __remove_from_page_cache(old);
> > + error = radix_tree_insert(&mapping->page_tree, offset, new);
> > + BUG_ON(error);
> > + mapping->nrpages++;
> > + __inc_zone_page_state(new, NR_FILE_PAGES);
> > + if (PageSwapBacked(new))
> > + __inc_zone_page_state(new, NR_SHMEM);
> > + spin_unlock_irq(&mapping->tree_lock);
> > + radix_tree_preload_end();
> > + page_cache_release(old);
> > + mem_cgroup_end_migration(memcg, old, new, true);
>
> This is all pretty ugly and inefficient.
>
> We call __remove_from_page_cache() which does a radix-tree lookup and
> then fiddles a bunch of accounting things.
>
> Then we immediately do the same radix-tree lookup and then undo the
> accounting changes which we just did. And we do it in an open-coded
> fashion, thus giving the kernel yet another code site where various
> operations need to be kept in sync.
>
> Would it not be better to do a single radix_tree_lookup_slot(),
> overwrite the pointer therein and just leave all the ancilliary
> accounting unaltered?
>
Yes, I think radix_tree_lookup_slot & replacement technique can be used
as used in page migration. (migrate_page -> migrate_page_move_mapping)
So, I guess reusing page migration code will be easy way.
Hmm, if pages are never mapped, move_to_new_page() can be used.
But it requires page_count(old) == 1 for replacement. I'm not sure
what page_count(old) is here.
About memcg codes:
radix_tree_lookup_slot() technique is used for page-migration and
mem_cgroup_prepare_migration/end_migration is a code used for that.
Honestly, this radix-tree replacement is special rathar than migration
because this handles only file caches. So, we may be able to add
a new function for handling 'replacement' of page for memcg.
But the only user is this FUSE and I don't use FUSE for usual tests
and will never notice level down if it happens. I guess FUSE maintainer
will not notice level-down in memcg. So, I recommended to use the same
code for migration it's now heavily tested because of compaction.
So, I think it's better to reuse memcg codes even if move_to_newpage()
is used here directly.
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4] mm: add replace_page_cache_page() function
2011-01-18 23:28 ` Andrew Morton
2011-01-19 0:27 ` Daisuke Nishimura
2011-01-19 0:33 ` KAMEZAWA Hiroyuki
@ 2011-01-19 1:24 ` Minchan Kim
2011-01-19 1:48 ` Andrew Morton
2011-09-08 23:52 ` Andrew Morton
3 siblings, 1 reply; 16+ messages in thread
From: Minchan Kim @ 2011-01-19 1:24 UTC (permalink / raw)
To: Andrew Morton
Cc: Miklos Szeredi, kamezawa.hiroyu, nishimura, linux-fsdevel,
linux-mm, linux-kernel
On Wed, Jan 19, 2011 at 8:28 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Tue, 18 Jan 2011 12:18:11 +0100
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
>> +int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
>> +{
>> + int error;
>> + struct mem_cgroup *memcg = NULL;
>
> I'm suspecting that the unneeded initialisation was added to suppress a
> warning?
>
> I removed it, and didn't get a warning. I expected to.
>
> Really, uninitialized_var() is better. It avoids adding extra code
> and, unlike "= 0" it is self-documenting.
>
>> + VM_BUG_ON(!PageLocked(old));
>> + VM_BUG_ON(!PageLocked(new));
>> + VM_BUG_ON(new->mapping);
>> +
>> + /*
>> + * This is not page migration, but prepare_migration and
>> + * end_migration does enough work for charge replacement.
>> + *
>> + * In the longer term we probably want a specialized function
>> + * for moving the charge from old to new in a more efficient
>> + * manner.
>> + */
>> + error = mem_cgroup_prepare_migration(old, new, &memcg, gfp_mask);
>> + if (error)
>> + return error;
>> +
>> + error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
>> + if (!error) {
>> + struct address_space *mapping = old->mapping;
>> + pgoff_t offset = old->index;
>> +
>> + page_cache_get(new);
>> + new->mapping = mapping;
>> + new->index = offset;
>> +
>> + spin_lock_irq(&mapping->tree_lock);
>> + __remove_from_page_cache(old);
>> + error = radix_tree_insert(&mapping->page_tree, offset, new);
>> + BUG_ON(error);
>> + mapping->nrpages++;
>> + __inc_zone_page_state(new, NR_FILE_PAGES);
>> + if (PageSwapBacked(new))
>> + __inc_zone_page_state(new, NR_SHMEM);
>> + spin_unlock_irq(&mapping->tree_lock);
>> + radix_tree_preload_end();
>> + page_cache_release(old);
>> + mem_cgroup_end_migration(memcg, old, new, true);
>
> This is all pretty ugly and inefficient.
>
> We call __remove_from_page_cache() which does a radix-tree lookup and
> then fiddles a bunch of accounting things.
>
> Then we immediately do the same radix-tree lookup and then undo the
> accounting changes which we just did. And we do it in an open-coded
> fashion, thus giving the kernel yet another code site where various
> operations need to be kept in sync.
>
> Would it not be better to do a single radix_tree_lookup_slot(),
> overwrite the pointer therein and just leave all the ancilliary
> accounting unaltered?
I agree single radix_tree_lookup but accounting still is needed since
newpage could be on another zone. What we can remove is just only
mapping->nrpages.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4] mm: add replace_page_cache_page() function
2011-01-19 1:24 ` Minchan Kim
@ 2011-01-19 1:48 ` Andrew Morton
2011-01-19 2:17 ` Minchan Kim
0 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2011-01-19 1:48 UTC (permalink / raw)
To: Minchan Kim
Cc: Miklos Szeredi, kamezawa.hiroyu, nishimura, linux-fsdevel,
linux-mm, linux-kernel
On Wed, 19 Jan 2011 10:24:09 +0900 Minchan Kim <minchan.kim@gmail.com> wrote:
> >
> > This is all pretty ugly and inefficient.
> >
> > We call __remove_from_page_cache() which does a radix-tree lookup and
> > then fiddles a bunch of accounting things.
> >
> > Then we immediately do the same radix-tree lookup and then undo the
> > accounting changes which we just did. __And we do it in an open-coded
> > fashion, thus giving the kernel yet another code site where various
> > operations need to be kept in sync.
> >
> > Would it not be better to do a single radix_tree_lookup_slot(),
> > overwrite the pointer therein and just leave all the ancilliary
> > accounting unaltered?
>
> I agree single radix_tree_lookup but accounting still is needed since
> newpage could be on another zone. What we can remove is just only
> mapping->nrpages.
Well. We only need to do inc/dec_zone_state if the zones are
different. Perhaps the zones-equal case is worth optimising for,
dunno.
Also, the radix_tree_preload() should be unneeded.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4] mm: add replace_page_cache_page() function
2011-01-19 1:48 ` Andrew Morton
@ 2011-01-19 2:17 ` Minchan Kim
0 siblings, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2011-01-19 2:17 UTC (permalink / raw)
To: Andrew Morton
Cc: Miklos Szeredi, kamezawa.hiroyu, nishimura, linux-fsdevel,
linux-mm, linux-kernel
On Wed, Jan 19, 2011 at 10:48 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Wed, 19 Jan 2011 10:24:09 +0900 Minchan Kim <minchan.kim@gmail.com> wrote:
>
>> >
>> > This is all pretty ugly and inefficient.
>> >
>> > We call __remove_from_page_cache() which does a radix-tree lookup and
>> > then fiddles a bunch of accounting things.
>> >
>> > Then we immediately do the same radix-tree lookup and then undo the
>> > accounting changes which we just did. __And we do it in an open-coded
>> > fashion, thus giving the kernel yet another code site where various
>> > operations need to be kept in sync.
>> >
>> > Would it not be better to do a single radix_tree_lookup_slot(),
>> > overwrite the pointer therein and just leave all the ancilliary
>> > accounting unaltered?
>>
>> I agree single radix_tree_lookup but accounting still is needed since
>> newpage could be on another zone. What we can remove is just only
>> mapping->nrpages.
>
> Well. We only need to do inc/dec_zone_state if the zones are
> different. Perhaps the zones-equal case is worth optimising for,
> dunno.
>
> Also, the radix_tree_preload() should be unneeded.
Agree.
In summary, optimization points are following as.
1) remove radix_tree_preload
2) single radix_tree_lookup_slot and replace radix tree slot
3) page accounting optimization if both pages are in same zone.
I hope we mm guys optimize the above things with TODO.
(Except freepage issue I mentioned.)
So I want Miklos resend the patch with solving freepage issue and NOTE
of TODO in description or comment.
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4] mm: add replace_page_cache_page() function
2011-01-18 23:28 ` Andrew Morton
` (2 preceding siblings ...)
2011-01-19 1:24 ` Minchan Kim
@ 2011-09-08 23:52 ` Andrew Morton
2011-09-09 1:43 ` KAMEZAWA Hiroyuki
3 siblings, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2011-09-08 23:52 UTC (permalink / raw)
To: Miklos Szeredi, minchan.kim, kamezawa.hiroyu, nishimura,
linux-fsdevel, linux-mm
On Tue, 18 Jan 2011 15:28:44 -0800
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Tue, 18 Jan 2011 12:18:11 +0100
> Miklos Szeredi <miklos@szeredi.hu> wrote:
>
> > +int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> > +{
> > + int error;
> > + struct mem_cgroup *memcg = NULL;
>
> I'm suspecting that the unneeded initialisation was added to suppress a
> warning?
>
> I removed it, and didn't get a warning. I expected to.
>
> Really, uninitialized_var() is better. It avoids adding extra code
> and, unlike "= 0" it is self-documenting.
>
> > + VM_BUG_ON(!PageLocked(old));
> > + VM_BUG_ON(!PageLocked(new));
> > + VM_BUG_ON(new->mapping);
> > +
> > + /*
> > + * This is not page migration, but prepare_migration and
> > + * end_migration does enough work for charge replacement.
> > + *
> > + * In the longer term we probably want a specialized function
> > + * for moving the charge from old to new in a more efficient
> > + * manner.
> > + */
> > + error = mem_cgroup_prepare_migration(old, new, &memcg, gfp_mask);
> > + if (error)
> > + return error;
> > +
> > + error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> > + if (!error) {
> > + struct address_space *mapping = old->mapping;
> > + pgoff_t offset = old->index;
> > +
> > + page_cache_get(new);
> > + new->mapping = mapping;
> > + new->index = offset;
> > +
> > + spin_lock_irq(&mapping->tree_lock);
> > + __remove_from_page_cache(old);
> > + error = radix_tree_insert(&mapping->page_tree, offset, new);
> > + BUG_ON(error);
> > + mapping->nrpages++;
> > + __inc_zone_page_state(new, NR_FILE_PAGES);
> > + if (PageSwapBacked(new))
> > + __inc_zone_page_state(new, NR_SHMEM);
> > + spin_unlock_irq(&mapping->tree_lock);
> > + radix_tree_preload_end();
> > + page_cache_release(old);
> > + mem_cgroup_end_migration(memcg, old, new, true);
>
> This is all pretty ugly and inefficient.
>
> We call __remove_from_page_cache() which does a radix-tree lookup and
> then fiddles a bunch of accounting things.
>
> Then we immediately do the same radix-tree lookup and then undo the
> accounting changes which we just did. And we do it in an open-coded
> fashion, thus giving the kernel yet another code site where various
> operations need to be kept in sync.
>
> Would it not be better to do a single radix_tree_lookup_slot(),
> overwrite the pointer therein and just leave all the ancilliary
> accounting unaltered?
>
Poke?
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4] mm: add replace_page_cache_page() function
2011-09-08 23:52 ` Andrew Morton
@ 2011-09-09 1:43 ` KAMEZAWA Hiroyuki
0 siblings, 0 replies; 16+ messages in thread
From: KAMEZAWA Hiroyuki @ 2011-09-09 1:43 UTC (permalink / raw)
To: Andrew Morton
Cc: Miklos Szeredi, minchan.kim, nishimura, linux-fsdevel, linux-mm,
linux-kernel
On Thu, 8 Sep 2011 16:52:22 -0700
Andrew Morton <akpm@linux-foundation.org> wrote:
> On Tue, 18 Jan 2011 15:28:44 -0800
> Andrew Morton <akpm@linux-foundation.org> wrote:
>
> > On Tue, 18 Jan 2011 12:18:11 +0100
> > Miklos Szeredi <miklos@szeredi.hu> wrote:
> >
> > > +int replace_page_cache_page(struct page *old, struct page *new, gfp_t gfp_mask)
> > > +{
> > > + int error;
> > > + struct mem_cgroup *memcg = NULL;
> >
> > I'm suspecting that the unneeded initialisation was added to suppress a
> > warning?
> >
> > I removed it, and didn't get a warning. I expected to.
> >
> > Really, uninitialized_var() is better. It avoids adding extra code
> > and, unlike "= 0" it is self-documenting.
> >
> > > + VM_BUG_ON(!PageLocked(old));
> > > + VM_BUG_ON(!PageLocked(new));
> > > + VM_BUG_ON(new->mapping);
> > > +
> > > + /*
> > > + * This is not page migration, but prepare_migration and
> > > + * end_migration does enough work for charge replacement.
> > > + *
> > > + * In the longer term we probably want a specialized function
> > > + * for moving the charge from old to new in a more efficient
> > > + * manner.
> > > + */
> > > + error = mem_cgroup_prepare_migration(old, new, &memcg, gfp_mask);
> > > + if (error)
> > > + return error;
> > > +
> > > + error = radix_tree_preload(gfp_mask & ~__GFP_HIGHMEM);
> > > + if (!error) {
> > > + struct address_space *mapping = old->mapping;
> > > + pgoff_t offset = old->index;
> > > +
> > > + page_cache_get(new);
> > > + new->mapping = mapping;
> > > + new->index = offset;
> > > +
> > > + spin_lock_irq(&mapping->tree_lock);
> > > + __remove_from_page_cache(old);
> > > + error = radix_tree_insert(&mapping->page_tree, offset, new);
> > > + BUG_ON(error);
> > > + mapping->nrpages++;
> > > + __inc_zone_page_state(new, NR_FILE_PAGES);
> > > + if (PageSwapBacked(new))
> > > + __inc_zone_page_state(new, NR_SHMEM);
> > > + spin_unlock_irq(&mapping->tree_lock);
> > > + radix_tree_preload_end();
> > > + page_cache_release(old);
> > > + mem_cgroup_end_migration(memcg, old, new, true);
> >
> > This is all pretty ugly and inefficient.
> >
> > We call __remove_from_page_cache() which does a radix-tree lookup and
> > then fiddles a bunch of accounting things.
> >
> > Then we immediately do the same radix-tree lookup and then undo the
> > accounting changes which we just did. And we do it in an open-coded
> > fashion, thus giving the kernel yet another code site where various
> > operations need to be kept in sync.
> >
> > Would it not be better to do a single radix_tree_lookup_slot(),
> > overwrite the pointer therein and just leave all the ancilliary
> > accounting unaltered?
> >
>
> Poke?
Sorry, I didn't read this mail.
The codes around __remove_from_page_cache and radix_tree_insert,
I agree you.
About counters, the page may be in different zone and related statistics
should be changed. About memcg, this function does page replacement.
Then, information in old page_cgroup should be moved to the new
page_cgroup. So, I advised to use migration code which is used
in many situation(now) rather than adding new something strange.
Hmm, in quick thinking, we can reuse migration function core
rather than using this new one ? Hmm..but page_count() check
may fail....
Thanks,
-Kame
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v4] mm: add replace_page_cache_page() function
2011-01-18 11:18 [PATCH v4] mm: add replace_page_cache_page() function Miklos Szeredi
2011-01-18 23:28 ` Andrew Morton
@ 2011-01-19 1:17 ` Minchan Kim
1 sibling, 0 replies; 16+ messages in thread
From: Minchan Kim @ 2011-01-19 1:17 UTC (permalink / raw)
To: Miklos Szeredi
Cc: akpm, kamezawa.hiroyu, nishimura, linux-fsdevel, linux-mm,
linux-kernel, Linus Torvalds
On Tue, Jan 18, 2011 at 8:18 PM, Miklos Szeredi <miklos@szeredi.hu> wrote:
> Andrew,
>
> Can you please apply this to -mm, for 2.6.39?
>
> v4:
> - updated to latest -git
> - added acks
> - updated changelog
>
> Thanks,
> Miklos
6072d13c added freepage callback of filesystem.
So, all callers of __remove_from_page_cache handles it now.
Although NFS only need it and only FUSE use replace_page_cache_page,
do we have to handle it in replace_page_cache_page, too?
Or we pass mapping to __remove_from_page_cache and handle it in
__remove_from_page_cache?
--
Kind regards,
Minchan Kim
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom policy in Canada: sign http://dissolvethecrtc.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 16+ messages in thread
end of thread, other threads:[~2011-09-09 1:43 UTC | newest]
Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-01-18 11:18 [PATCH v4] mm: add replace_page_cache_page() function Miklos Szeredi
2011-01-18 23:28 ` Andrew Morton
2011-01-19 0:27 ` Daisuke Nishimura
2011-01-19 0:41 ` Andrew Morton
2011-01-19 0:48 ` KAMEZAWA Hiroyuki
2011-01-19 1:11 ` nishimura
2011-01-19 1:23 ` KAMEZAWA Hiroyuki
2011-01-21 5:52 ` Daisuke Nishimura
2011-01-21 6:17 ` KAMEZAWA Hiroyuki
2011-01-19 0:33 ` KAMEZAWA Hiroyuki
2011-01-19 1:24 ` Minchan Kim
2011-01-19 1:48 ` Andrew Morton
2011-01-19 2:17 ` Minchan Kim
2011-09-08 23:52 ` Andrew Morton
2011-09-09 1:43 ` KAMEZAWA Hiroyuki
2011-01-19 1:17 ` Minchan Kim
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).