* [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func [not found] ` <1356455919-14445-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> @ 2012-12-25 17:22 ` Sha Zhengju 2012-12-28 0:39 ` Kamezawa Hiroyuki 2013-01-02 9:08 ` Michal Hocko 0 siblings, 2 replies; 27+ messages in thread From: Sha Zhengju @ 2012-12-25 17:22 UTC (permalink / raw) To: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA Cc: dchinner-H+wXaHxf7aLQT0dZR+AlfA, mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, gthelen-hpIqsD4AKlfQT0dZR+AlfA, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers()) extracts TestSetPageDirty from __set_page_dirty and is far away from account_page_dirtied. But it's better to make the two operations in one single function to keep modular. So in order to avoid the potential race mentioned in commit a8e7d49a, we can hold private_lock until __set_page_dirty completes. There's no deadlock between ->private_lock and ->tree_lock after confirmation. It's a prepare patch for following memcg dirty page accounting patches. Here is some test numbers that before/after this patch: Test steps(Mem-4g, ext4): drop_cache; sync fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread) We test it for 10 times and get the average numbers: Before: write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt= 8279msec lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02 After: write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt= 8274.4msec lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32 Note that the impact is little(<1%). Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> --- fs/buffer.c | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index c017a2d..3b032b9 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -609,9 +609,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); * If warn is true, then emit a warning if the page is not uptodate and has * not been truncated. */ -static void __set_page_dirty(struct page *page, +static int __set_page_dirty(struct page *page, struct address_space *mapping, int warn) { + if (unlikely(!mapping)) + return !TestSetPageDirty(page); + + if (TestSetPageDirty(page)) + return 0; + spin_lock_irq(&mapping->tree_lock); if (page->mapping) { /* Race with truncate? */ WARN_ON_ONCE(warn && !PageUptodate(page)); @@ -621,6 +627,8 @@ static void __set_page_dirty(struct page *page, } spin_unlock_irq(&mapping->tree_lock); __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); + + return 1; } /* @@ -666,11 +674,9 @@ int __set_page_dirty_buffers(struct page *page) bh = bh->b_this_page; } while (bh != head); } - newly_dirty = !TestSetPageDirty(page); + newly_dirty = __set_page_dirty(page, mapping, 1); spin_unlock(&mapping->private_lock); - if (newly_dirty) - __set_page_dirty(page, mapping, 1); return newly_dirty; } EXPORT_SYMBOL(__set_page_dirty_buffers); @@ -1125,14 +1131,8 @@ void mark_buffer_dirty(struct buffer_head *bh) return; } - if (!test_set_buffer_dirty(bh)) { - struct page *page = bh->b_page; - if (!TestSetPageDirty(page)) { - struct address_space *mapping = page_mapping(page); - if (mapping) - __set_page_dirty(page, mapping, 0); - } - } + if (!test_set_buffer_dirty(bh)) + __set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0); } EXPORT_SYMBOL(mark_buffer_dirty); -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func 2012-12-25 17:22 ` [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju @ 2012-12-28 0:39 ` Kamezawa Hiroyuki 2013-01-05 2:34 ` Sha Zhengju 2013-01-02 9:08 ` Michal Hocko 1 sibling, 1 reply; 27+ messages in thread From: Kamezawa Hiroyuki @ 2012-12-28 0:39 UTC (permalink / raw) To: Sha Zhengju Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, mhocko, akpm, gthelen, fengguang.wu, glommer, Sha Zhengju (2012/12/26 2:22), Sha Zhengju wrote: > From: Sha Zhengju <handai.szj@taobao.com> > > Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers()) > extracts TestSetPageDirty from __set_page_dirty and is far away from > account_page_dirtied. But it's better to make the two operations in one single > function to keep modular. So in order to avoid the potential race mentioned in > commit a8e7d49a, we can hold private_lock until __set_page_dirty completes. > There's no deadlock between ->private_lock and ->tree_lock after confirmation. > It's a prepare patch for following memcg dirty page accounting patches. > > > Here is some test numbers that before/after this patch: > Test steps(Mem-4g, ext4): > drop_cache; sync > fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread) > > We test it for 10 times and get the average numbers: > Before: > write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt= 8279msec > lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02 > After: > write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt= 8274.4msec > lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32 > > Note that the impact is little(<1%). > > > Signed-off-by: Sha Zhengju <handai.szj@taobao.com> > Reviewed-by: Michal Hocko <mhocko@suse.cz> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Hmm,..this change should be double-checked by vfs, I/O guys... increasing hold time of mapping->private_lock doesn't affect performance ? -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func 2012-12-28 0:39 ` Kamezawa Hiroyuki @ 2013-01-05 2:34 ` Sha Zhengju 0 siblings, 0 replies; 27+ messages in thread From: Sha Zhengju @ 2013-01-05 2:34 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, mhocko, akpm, gthelen, fengguang.wu, glommer, Sha Zhengju Hi Kame, Sorry for the late response, I'm just back from vocation. : ) On Fri, Dec 28, 2012 at 8:39 AM, Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > (2012/12/26 2:22), Sha Zhengju wrote: >> From: Sha Zhengju <handai.szj@taobao.com> >> >> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers()) >> extracts TestSetPageDirty from __set_page_dirty and is far away from >> account_page_dirtied. But it's better to make the two operations in one single >> function to keep modular. So in order to avoid the potential race mentioned in >> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes. >> There's no deadlock between ->private_lock and ->tree_lock after confirmation. >> It's a prepare patch for following memcg dirty page accounting patches. >> >> >> Here is some test numbers that before/after this patch: >> Test steps(Mem-4g, ext4): >> drop_cache; sync >> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread) >> >> We test it for 10 times and get the average numbers: >> Before: >> write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt= 8279msec >> lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02 >> After: >> write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt= 8274.4msec >> lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32 >> >> Note that the impact is little(<1%). >> >> >> Signed-off-by: Sha Zhengju <handai.szj@taobao.com> >> Reviewed-by: Michal Hocko <mhocko@suse.cz> > > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> > > Hmm,..this change should be double-checked by vfs, I/O guys... > Now it seems they haven't paid attention here... I'll push it soon for more review. > increasing hold time of mapping->private_lock doesn't affect performance ? > > Yes, pointed by Fengguang in the previous round, mapping->private_lock and mapping->tree_lock are often contented locks that in a dd testcase they have the top #1 and #2 contention. So the numbers above are trying to find the impaction of lock contention by multiple threads(numjobs=2) writing to the same file in parallel and it seems the impact is little (<1%). I'm not sure if the test case is enough, any advice is welcomed! : ) ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func 2012-12-25 17:22 ` [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju 2012-12-28 0:39 ` Kamezawa Hiroyuki @ 2013-01-02 9:08 ` Michal Hocko [not found] ` <20130102090803.GB22160-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 1 sibling, 1 reply; 27+ messages in thread From: Michal Hocko @ 2013-01-02 9:08 UTC (permalink / raw) To: Sha Zhengju Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, dchinner, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer, Sha Zhengju On Wed 26-12-12 01:22:36, Sha Zhengju wrote: > From: Sha Zhengju <handai.szj@taobao.com> > > Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers()) > extracts TestSetPageDirty from __set_page_dirty and is far away from > account_page_dirtied. But it's better to make the two operations in one single > function to keep modular. So in order to avoid the potential race mentioned in > commit a8e7d49a, we can hold private_lock until __set_page_dirty completes. > There's no deadlock between ->private_lock and ->tree_lock after confirmation. Could you be more specific here? E.g. quote mm/filemap.c comment I have mentioned during the first round of review? > It's a prepare patch for following memcg dirty page accounting patches. > > > Here is some test numbers that before/after this patch: > Test steps(Mem-4g, ext4): > drop_cache; sync > fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread) Could also add some rationale why you think this test is relevant? > We test it for 10 times and get the average numbers: > Before: > write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt= 8279msec > lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02 > After: > write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt= 8274.4msec > lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32 > > Note that the impact is little(<1%). > > > Signed-off-by: Sha Zhengju <handai.szj@taobao.com> > Reviewed-by: Michal Hocko <mhocko@suse.cz> > --- > fs/buffer.c | 24 ++++++++++++------------ > 1 file changed, 12 insertions(+), 12 deletions(-) > > diff --git a/fs/buffer.c b/fs/buffer.c > index c017a2d..3b032b9 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -609,9 +609,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); > * If warn is true, then emit a warning if the page is not uptodate and has > * not been truncated. > */ > -static void __set_page_dirty(struct page *page, > +static int __set_page_dirty(struct page *page, > struct address_space *mapping, int warn) > { > + if (unlikely(!mapping)) > + return !TestSetPageDirty(page); > + > + if (TestSetPageDirty(page)) > + return 0; > + > spin_lock_irq(&mapping->tree_lock); > if (page->mapping) { /* Race with truncate? */ > WARN_ON_ONCE(warn && !PageUptodate(page)); > @@ -621,6 +627,8 @@ static void __set_page_dirty(struct page *page, > } > spin_unlock_irq(&mapping->tree_lock); > __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > + > + return 1; > } > > /* > @@ -666,11 +674,9 @@ int __set_page_dirty_buffers(struct page *page) > bh = bh->b_this_page; > } while (bh != head); > } > - newly_dirty = !TestSetPageDirty(page); > + newly_dirty = __set_page_dirty(page, mapping, 1); > spin_unlock(&mapping->private_lock); > > - if (newly_dirty) > - __set_page_dirty(page, mapping, 1); > return newly_dirty; > } > EXPORT_SYMBOL(__set_page_dirty_buffers); > @@ -1125,14 +1131,8 @@ void mark_buffer_dirty(struct buffer_head *bh) > return; > } > > - if (!test_set_buffer_dirty(bh)) { > - struct page *page = bh->b_page; > - if (!TestSetPageDirty(page)) { > - struct address_space *mapping = page_mapping(page); > - if (mapping) > - __set_page_dirty(page, mapping, 0); > - } > - } > + if (!test_set_buffer_dirty(bh)) > + __set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0); > } > EXPORT_SYMBOL(mark_buffer_dirty); > > -- > 1.7.9.5 > -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <20130102090803.GB22160-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>]
* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func [not found] ` <20130102090803.GB22160-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> @ 2013-01-05 2:49 ` Sha Zhengju [not found] ` <CAFj3OHUCQkqB2+ky9wxFpkNYcn2=6t9Qd7XFf3RBY0F4Wxyqcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 27+ messages in thread From: Sha Zhengju @ 2013-01-05 2:49 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, dchinner-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, gthelen-hpIqsD4AKlfQT0dZR+AlfA, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju Hi Michal, Sorry for my late response, I'm just back from vocation. : ) On Wed, Jan 2, 2013 at 5:08 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote: > On Wed 26-12-12 01:22:36, Sha Zhengju wrote: >> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> >> >> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers()) >> extracts TestSetPageDirty from __set_page_dirty and is far away from >> account_page_dirtied. But it's better to make the two operations in one single >> function to keep modular. So in order to avoid the potential race mentioned in >> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes. >> There's no deadlock between ->private_lock and ->tree_lock after confirmation. > > Could you be more specific here? E.g. quote mm/filemap.c comment I have > mentioned during the first round of review? > Okay, sorry for forgetting the comment. I'll add it next round. >> It's a prepare patch for following memcg dirty page accounting patches. >> >> >> Here is some test numbers that before/after this patch: >> Test steps(Mem-4g, ext4): >> drop_cache; sync >> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread) > > Could also add some rationale why you think this test is relevant? > The test is aiming at finding the impact of performance due to lock contention by writing parallel to the same file. I'll add the reason next version too. Thanks for reviewing! Regards, Sha >> We test it for 10 times and get the average numbers: >> Before: >> write: io=2048.0MB, bw=254117KB/s, iops=63528.9 , runt= 8279msec >> lat (usec): min=1 , max=742361 , avg=30.918, stdev=1601.02 >> After: >> write: io=2048.0MB, bw=254044KB/s, iops=63510.3 , runt= 8274.4msec >> lat (usec): min=1 , max=856333 , avg=31.043, stdev=1769.32 >> >> Note that the impact is little(<1%). >> >> >> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> >> Reviewed-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> >> --- >> fs/buffer.c | 24 ++++++++++++------------ >> 1 file changed, 12 insertions(+), 12 deletions(-) >> >> diff --git a/fs/buffer.c b/fs/buffer.c >> index c017a2d..3b032b9 100644 >> --- a/fs/buffer.c >> +++ b/fs/buffer.c >> @@ -609,9 +609,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); >> * If warn is true, then emit a warning if the page is not uptodate and has >> * not been truncated. >> */ >> -static void __set_page_dirty(struct page *page, >> +static int __set_page_dirty(struct page *page, >> struct address_space *mapping, int warn) >> { >> + if (unlikely(!mapping)) >> + return !TestSetPageDirty(page); >> + >> + if (TestSetPageDirty(page)) >> + return 0; >> + >> spin_lock_irq(&mapping->tree_lock); >> if (page->mapping) { /* Race with truncate? */ >> WARN_ON_ONCE(warn && !PageUptodate(page)); >> @@ -621,6 +627,8 @@ static void __set_page_dirty(struct page *page, >> } >> spin_unlock_irq(&mapping->tree_lock); >> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); >> + >> + return 1; >> } >> >> /* >> @@ -666,11 +674,9 @@ int __set_page_dirty_buffers(struct page *page) >> bh = bh->b_this_page; >> } while (bh != head); >> } >> - newly_dirty = !TestSetPageDirty(page); >> + newly_dirty = __set_page_dirty(page, mapping, 1); >> spin_unlock(&mapping->private_lock); >> >> - if (newly_dirty) >> - __set_page_dirty(page, mapping, 1); >> return newly_dirty; >> } >> EXPORT_SYMBOL(__set_page_dirty_buffers); >> @@ -1125,14 +1131,8 @@ void mark_buffer_dirty(struct buffer_head *bh) >> return; >> } >> >> - if (!test_set_buffer_dirty(bh)) { >> - struct page *page = bh->b_page; >> - if (!TestSetPageDirty(page)) { >> - struct address_space *mapping = page_mapping(page); >> - if (mapping) >> - __set_page_dirty(page, mapping, 0); >> - } >> - } >> + if (!test_set_buffer_dirty(bh)) >> + __set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0); >> } >> EXPORT_SYMBOL(mark_buffer_dirty); >> >> -- >> 1.7.9.5 >> > > -- > Michal Hocko > SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <CAFj3OHUCQkqB2+ky9wxFpkNYcn2=6t9Qd7XFf3RBY0F4Wxyqcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func [not found] ` <CAFj3OHUCQkqB2+ky9wxFpkNYcn2=6t9Qd7XFf3RBY0F4Wxyqcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-01-05 10:45 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2013-01-05 10:45 UTC (permalink / raw) To: Sha Zhengju Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, dchinner-H+wXaHxf7aLQT0dZR+AlfA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, gthelen-hpIqsD4AKlfQT0dZR+AlfA, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju On Sat 05-01-13 10:49:00, Sha Zhengju wrote: [...] > >> Here is some test numbers that before/after this patch: > >> Test steps(Mem-4g, ext4): > >> drop_cache; sync > >> fio (ioengine=sync/write/buffered/bs=4k/size=1g/numjobs=2/group_reporting/thread) > > > > Could also add some rationale why you think this test is relevant? > > > > The test is aiming at finding the impact of performance due to lock > contention by writing parallel > to the same file. I'll add the reason next version too. Please make sure to describe which locks are exercised during that test and how much. Thanks [...] -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem [not found] <1356455919-14445-1-git-send-email-handai.szj@taobao.com> [not found] ` <1356455919-14445-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> @ 2012-12-25 17:24 ` Sha Zhengju [not found] ` <1356456261-14579-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> 2012-12-25 17:26 ` [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju 2 siblings, 1 reply; 27+ messages in thread From: Sha Zhengju @ 2012-12-25 17:24 UTC (permalink / raw) To: linux-kernel, cgroups, linux-mm, linux-fsdevel, ceph-devel Cc: sage, dchinner, mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer, Sha Zhengju From: Sha Zhengju <handai.szj@taobao.com> Following we will treat SetPageDirty and dirty page accounting as an integrated operation. Filesystems had better use vfs interface directly to avoid those details. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: Sage Weil <sage@inktank.com> --- fs/buffer.c | 3 ++- fs/ceph/addr.c | 20 ++------------------ include/linux/buffer_head.h | 2 ++ 3 files changed, 6 insertions(+), 19 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 3b032b9..762168a 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -609,7 +609,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); * If warn is true, then emit a warning if the page is not uptodate and has * not been truncated. */ -static int __set_page_dirty(struct page *page, +int __set_page_dirty(struct page *page, struct address_space *mapping, int warn) { if (unlikely(!mapping)) @@ -630,6 +630,7 @@ static int __set_page_dirty(struct page *page, return 1; } +EXPORT_SYMBOL(__set_page_dirty); /* * Add a page to the dirty page list. diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 6690269..f2779b8 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -5,6 +5,7 @@ #include <linux/mm.h> #include <linux/pagemap.h> #include <linux/writeback.h> /* generic_writepages */ +#include <linux/buffer_head.h> #include <linux/slab.h> #include <linux/pagevec.h> #include <linux/task_io_accounting_ops.h> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page) int undo = 0; struct ceph_snap_context *snapc; - if (unlikely(!mapping)) - return !TestSetPageDirty(page); - - if (TestSetPageDirty(page)) { - dout("%p set_page_dirty %p idx %lu -- already dirty\n", - mapping->host, page, page->index); + if (!__set_page_dirty(page, mapping, 1)) return 0; - } inode = mapping->host; ci = ceph_inode(inode); @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page) snapc, snapc->seq, snapc->num_snaps); spin_unlock(&ci->i_ceph_lock); - /* now adjust page */ - spin_lock_irq(&mapping->tree_lock); if (page->mapping) { /* Race with truncate? */ - WARN_ON_ONCE(!PageUptodate(page)); - account_page_dirtied(page, page->mapping); - radix_tree_tag_set(&mapping->page_tree, - page_index(page), PAGECACHE_TAG_DIRTY); - /* * Reference snap context in page->private. Also set * PagePrivate so that we get invalidatepage callback. @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page) undo = 1; } - spin_unlock_irq(&mapping->tree_lock); - if (undo) /* whoops, we failed to dirty the page */ ceph_put_wrbuffer_cap_refs(ci, 1, snapc); - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); - BUG_ON(!PageDirty(page)); return 1; } diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h index 458f497..0a331a8 100644 --- a/include/linux/buffer_head.h +++ b/include/linux/buffer_head.h @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh) } extern int __set_page_dirty_buffers(struct page *page); +extern int __set_page_dirty(struct page *page, + struct address_space *mapping, int warn); #else /* CONFIG_BLOCK */ -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 27+ messages in thread
[parent not found: <1356456261-14579-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem [not found] ` <1356456261-14579-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> @ 2012-12-28 0:41 ` Kamezawa Hiroyuki 0 siblings, 0 replies; 27+ messages in thread From: Kamezawa Hiroyuki @ 2012-12-28 0:41 UTC (permalink / raw) To: Sha Zhengju Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, ceph-devel-u79uwXL29TY76Z2rM5mHXA, sage-BnTBU8nroG7k1uMJSBkQmQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA, mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, gthelen-hpIqsD4AKlfQT0dZR+AlfA, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, glommer-bzQdu9zFT3WakBO8gow8eQ, Sha Zhengju (2012/12/26 2:24), Sha Zhengju wrote: > From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> > > Following we will treat SetPageDirty and dirty page accounting as an integrated > operation. Filesystems had better use vfs interface directly to avoid those details. > > Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> > Acked-by: Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> ^ permalink raw reply [flat|nested] 27+ messages in thread
* [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting [not found] <1356455919-14445-1-git-send-email-handai.szj@taobao.com> [not found] ` <1356455919-14445-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> 2012-12-25 17:24 ` [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju @ 2012-12-25 17:26 ` Sha Zhengju 2013-01-02 10:44 ` Michal Hocko 2013-01-06 20:07 ` Greg Thelen 2 siblings, 2 replies; 27+ messages in thread From: Sha Zhengju @ 2012-12-25 17:26 UTC (permalink / raw) To: linux-kernel, cgroups, linux-mm, linux-fsdevel Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer, dchinner, Sha Zhengju From: Sha Zhengju <handai.szj@taobao.com> This patch adds memcg routines to count dirty pages, which allows memory controller to maintain an accurate view of the amount of its dirty memory and can provide some info for users while cgroup's direct reclaim is working. After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg has a feature to move a page from a cgroup to another one and may have race between "move" and "page stat accounting". So in order to avoid the race we have designed a bigger lock: mem_cgroup_begin_update_page_stat() modify page information -->(a) mem_cgroup_update_page_stat() -->(b) mem_cgroup_end_update_page_stat() It requires (a) and (b)(dirty pages accounting) can stay close enough. In the previous two prepare patches, we have reworked the vfs set page dirty routines and now the interfaces are more explicit: incrementing (2): __set_page_dirty __set_page_dirty_nobuffers decrementing (2): clear_page_dirty_for_io cancel_dirty_page To prevent AB/BA deadlock mentioned by Greg Thelen in previous version (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: ->private_lock --> mapping->tree_lock --> memcg->move_lock. So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty() and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, a prepare PageDirty() checking is added. Signed-off-by: Sha Zhengju <handai.szj@taobao.com> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com> Acked-by: Fengguang Wu <fengguang.wu@intel.com> --- fs/buffer.c | 14 +++++++++++++- include/linux/memcontrol.h | 1 + mm/filemap.c | 10 ++++++++++ mm/memcontrol.c | 29 ++++++++++++++++++++++------- mm/page-writeback.c | 39 ++++++++++++++++++++++++++++++++------- mm/truncate.c | 6 ++++++ 6 files changed, 84 insertions(+), 15 deletions(-) diff --git a/fs/buffer.c b/fs/buffer.c index 762168a..53402d2 100644 --- a/fs/buffer.c +++ b/fs/buffer.c @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); int __set_page_dirty(struct page *page, struct address_space *mapping, int warn) { + bool locked; + unsigned long flags; + if (unlikely(!mapping)) return !TestSetPageDirty(page); - if (TestSetPageDirty(page)) + if (PageDirty(page)) return 0; spin_lock_irq(&mapping->tree_lock); + mem_cgroup_begin_update_page_stat(page, &locked, &flags); + + if (TestSetPageDirty(page)) { + mem_cgroup_end_update_page_stat(page, &locked, &flags); + spin_unlock_irq(&mapping->tree_lock); + return 0; + } + if (page->mapping) { /* Race with truncate? */ WARN_ON_ONCE(warn && !PageUptodate(page)); account_page_dirtied(page, mapping); radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } + mem_cgroup_end_update_page_stat(page, &locked, &flags); spin_unlock_irq(&mapping->tree_lock); __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h index 5421b8a..2685d8a 100644 --- a/include/linux/memcontrol.h +++ b/include/linux/memcontrol.h @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index { MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */ MEM_CGROUP_STAT_NSTATS, }; diff --git a/mm/filemap.c b/mm/filemap.c index 83efee7..b589be5 100644 --- a/mm/filemap.c +++ b/mm/filemap.c @@ -62,6 +62,11 @@ * ->swap_lock (exclusive_swap_page, others) * ->mapping->tree_lock * + * ->private_lock (__set_page_dirty_buffers) + * ->mapping->tree_lock + * ->memcg->move_lock (mem_cgroup_begin_update_page_stat-> + * move_lock_mem_cgroup) + * * ->i_mutex * ->i_mmap_mutex (truncate->unmap_mapping_range) * @@ -112,6 +117,8 @@ void __delete_from_page_cache(struct page *page) { struct address_space *mapping = page->mapping; + bool locked; + unsigned long flags; /* * if we're uptodate, flush out into the cleancache, otherwise @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page) * Fix it up by doing a final dirty accounting check after * having removed the page entirely. */ + mem_cgroup_begin_update_page_stat(page, &locked, &flags); if (PageDirty(page) && mapping_cap_account_dirty(mapping)) { + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); } + mem_cgroup_end_update_page_stat(page, &locked, &flags); } /** diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d450c04..c884640 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -95,6 +95,7 @@ static const char * const mem_cgroup_stat_names[] = { "rss", "mapped_file", "swap", + "dirty", }; enum mem_cgroup_events_index { @@ -3609,6 +3610,19 @@ void mem_cgroup_split_huge_fixup(struct page *head) } #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ +static inline +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from, + struct mem_cgroup *to, + unsigned int nr_pages, + enum mem_cgroup_stat_index idx) +{ + /* Update stat data for mem_cgroup */ + preempt_disable(); + __this_cpu_add(from->stat->count[idx], -nr_pages); + __this_cpu_add(to->stat->count[idx], nr_pages); + preempt_enable(); +} + /** * mem_cgroup_move_account - move account of the page * @page: the page @@ -3654,13 +3668,14 @@ static int mem_cgroup_move_account(struct page *page, move_lock_mem_cgroup(from, &flags); - if (!anon && page_mapped(page)) { - /* Update mapped_file data for mem_cgroup */ - preempt_disable(); - __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); - __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); - preempt_enable(); - } + if (!anon && page_mapped(page)) + mem_cgroup_move_account_page_stat(from, to, nr_pages, + MEM_CGROUP_STAT_FILE_MAPPED); + + if (PageDirty(page)) + mem_cgroup_move_account_page_stat(from, to, nr_pages, + MEM_CGROUP_STAT_FILE_DIRTY); + mem_cgroup_charge_statistics(from, anon, -nr_pages); /* caller should have done css_get */ diff --git a/mm/page-writeback.c b/mm/page-writeback.c index 0713bfb..526ddd7 100644 --- a/mm/page-writeback.c +++ b/mm/page-writeback.c @@ -1978,11 +1978,17 @@ int __set_page_dirty_no_writeback(struct page *page) /* * Helper function for set_page_dirty family. + * + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock + * while modifying struct page state and accounting dirty pages. + * See __set_page_dirty for example. + * * NOTE: This relies on being atomic wrt interrupts. */ void account_page_dirtied(struct page *page, struct address_space *mapping) { if (mapping_cap_account_dirty(mapping)) { + mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY); __inc_zone_page_state(page, NR_FILE_DIRTY); __inc_zone_page_state(page, NR_DIRTIED); __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); @@ -2022,14 +2028,22 @@ EXPORT_SYMBOL(account_page_writeback); */ int __set_page_dirty_nobuffers(struct page *page) { + bool locked; + unsigned long flags; + struct address_space *mapping = page_mapping(page); + + if (PageDirty(page)) + return 0; + + if (unlikely(!mapping)) + return !TestSetPageDirty(page); + + spin_lock_irq(&mapping->tree_lock); + mem_cgroup_begin_update_page_stat(page, &locked, &flags); + if (!TestSetPageDirty(page)) { - struct address_space *mapping = page_mapping(page); struct address_space *mapping2; - if (!mapping) - return 1; - - spin_lock_irq(&mapping->tree_lock); mapping2 = page_mapping(page); if (mapping2) { /* Race with truncate? */ BUG_ON(mapping2 != mapping); @@ -2038,13 +2052,18 @@ int __set_page_dirty_nobuffers(struct page *page) radix_tree_tag_set(&mapping->page_tree, page_index(page), PAGECACHE_TAG_DIRTY); } + mem_cgroup_end_update_page_stat(page, &locked, &flags); spin_unlock_irq(&mapping->tree_lock); + if (mapping->host) { /* !PageAnon && !swapper_space */ __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); } return 1; } + + mem_cgroup_end_update_page_stat(page, &locked, &flags); + spin_unlock_irq(&mapping->tree_lock); return 0; } EXPORT_SYMBOL(__set_page_dirty_nobuffers); @@ -2160,6 +2179,9 @@ EXPORT_SYMBOL(set_page_dirty_lock); int clear_page_dirty_for_io(struct page *page) { struct address_space *mapping = page_mapping(page); + bool locked; + unsigned long flags; + int ret = 0; BUG_ON(!PageLocked(page)); @@ -2201,13 +2223,16 @@ int clear_page_dirty_for_io(struct page *page) * the desired exclusion. See mm/memory.c:do_wp_page() * for more comments. */ + mem_cgroup_begin_update_page_stat(page, &locked, &flags); if (TestClearPageDirty(page)) { + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); - return 1; + ret = 1; } - return 0; + mem_cgroup_end_update_page_stat(page, &locked, &flags); + return ret; } return TestClearPageDirty(page); } diff --git a/mm/truncate.c b/mm/truncate.c index db1b216..c81e6c4 100644 --- a/mm/truncate.c +++ b/mm/truncate.c @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial) */ void cancel_dirty_page(struct page *page, unsigned int account_size) { + bool locked; + unsigned long flags; + + mem_cgroup_begin_update_page_stat(page, &locked, &flags); if (TestClearPageDirty(page)) { struct address_space *mapping = page->mapping; if (mapping && mapping_cap_account_dirty(mapping)) { + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY); dec_zone_page_state(page, NR_FILE_DIRTY); dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size) task_io_account_cancelled_write(account_size); } } + mem_cgroup_end_update_page_stat(page, &locked, &flags); } EXPORT_SYMBOL(cancel_dirty_page); -- 1.7.9.5 ^ permalink raw reply related [flat|nested] 27+ messages in thread
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting 2012-12-25 17:26 ` [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju @ 2013-01-02 10:44 ` Michal Hocko 2013-01-05 4:48 ` Sha Zhengju 2013-05-03 9:11 ` Michal Hocko 2013-01-06 20:07 ` Greg Thelen 1 sibling, 2 replies; 27+ messages in thread From: Michal Hocko @ 2013-01-02 10:44 UTC (permalink / raw) To: Sha Zhengju Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer, dchinner, Sha Zhengju On Wed 26-12-12 01:26:07, Sha Zhengju wrote: > From: Sha Zhengju <handai.szj@taobao.com> > > This patch adds memcg routines to count dirty pages, which allows memory controller > to maintain an accurate view of the amount of its dirty memory and can provide some > info for users while cgroup's direct reclaim is working. I guess you meant targeted resp. (hard/soft) limit reclaim here, right? It is true that this is direct reclaim but it is not clear to me why the usefulnes should be limitted to the reclaim for users. I would understand this if the users was in fact in-kernel users. [...] > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: > ->private_lock --> mapping->tree_lock --> memcg->move_lock. > So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty() > and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, > a prepare PageDirty() checking is added. But there is another AA deadlock here I believe. page_remove_rmap mem_cgroup_begin_update_page_stat <<< 1 set_page_dirty __set_page_dirty_buffers __set_page_dirty mem_cgroup_begin_update_page_stat <<< 2 move_lock_mem_cgroup spin_lock_irqsave(&memcg->move_lock, *flags); mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS because we might race with the moving charges: CPU0 CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move atomic_inc(&memcg_moving) atomic_inc(&memcg->moving_account) synchronize_rcu __mem_cgroup_begin_update_page_stat mem_cgroup_stolen <<< TRUE move_lock_mem_cgroup [...] mem_cgroup_begin_update_page_stat (2) __mem_cgroup_begin_update_page_stat mem_cgroup_stolen <<< still TRUE move_lock_mem_cgroup <<< DEADLOCK [...] mem_cgroup_end_update_page_stat rcu_unlock # wake up from synchronize_rcu [...] mem_cgroup_move_task mem_cgroup_move_charge walk_page_range mem_cgroup_move_account move_lock_mem_cgroup Maybe I have missed some other locking which would prevent this from happening but the locking relations are really complicated in this area so if mem_cgroup_{begin,end}_update_page_stat might be called recursively then we need a fat comment which justifies that. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting 2013-01-02 10:44 ` Michal Hocko @ 2013-01-05 4:48 ` Sha Zhengju 2013-01-06 20:02 ` Hugh Dickins [not found] ` <CAFj3OHXKyMO3gwghiBAmbowvqko-JqLtKroX2kzin1rk=q9tZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-05-03 9:11 ` Michal Hocko 1 sibling, 2 replies; 27+ messages in thread From: Sha Zhengju @ 2013-01-05 4:48 UTC (permalink / raw) To: Michal Hocko Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer, dchinner, Sha Zhengju On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote: > On Wed 26-12-12 01:26:07, Sha Zhengju wrote: >> From: Sha Zhengju <handai.szj@taobao.com> >> >> This patch adds memcg routines to count dirty pages, which allows memory controller >> to maintain an accurate view of the amount of its dirty memory and can provide some >> info for users while cgroup's direct reclaim is working. > > I guess you meant targeted resp. (hard/soft) limit reclaim here, > right? It is true that this is direct reclaim but it is not clear to me Yes, I meant memcg hard/soft reclaim here which is triggered directly by allocation and is distinct from background kswapd reclaim (global). > why the usefulnes should be limitted to the reclaim for users. I would > understand this if the users was in fact in-kernel users. > One of the reasons I'm trying to accounting the dirty pages is to get a more board overall view of memory usages because memcg hard/soft reclaim may have effect on response time of user application. Yeah, the beneficiary can be application administrator or kernel users. :P > [...] >> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version >> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: >> ->private_lock --> mapping->tree_lock --> memcg->move_lock. >> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty() >> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, >> a prepare PageDirty() checking is added. > > But there is another AA deadlock here I believe. > page_remove_rmap > mem_cgroup_begin_update_page_stat <<< 1 > set_page_dirty > __set_page_dirty_buffers > __set_page_dirty > mem_cgroup_begin_update_page_stat <<< 2 > move_lock_mem_cgroup > spin_lock_irqsave(&memcg->move_lock, *flags); > > mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS > because we might race with the moving charges: > CPU0 CPU1 > page_remove_rmap > mem_cgroup_can_attach > mem_cgroup_begin_update_page_stat (1) > rcu_read_lock > mem_cgroup_start_move > atomic_inc(&memcg_moving) > atomic_inc(&memcg->moving_account) > synchronize_rcu > __mem_cgroup_begin_update_page_stat > mem_cgroup_stolen <<< TRUE > move_lock_mem_cgroup > [...] > mem_cgroup_begin_update_page_stat (2) > __mem_cgroup_begin_update_page_stat > mem_cgroup_stolen <<< still TRUE > move_lock_mem_cgroup <<< DEADLOCK > [...] > mem_cgroup_end_update_page_stat > rcu_unlock > # wake up from synchronize_rcu > [...] > mem_cgroup_move_task > mem_cgroup_move_charge > walk_page_range > mem_cgroup_move_account > move_lock_mem_cgroup > > > Maybe I have missed some other locking which would prevent this from > happening but the locking relations are really complicated in this area > so if mem_cgroup_{begin,end}_update_page_stat might be called > recursively then we need a fat comment which justifies that. > Ohhh...good catching! I didn't notice there is a recursive call of mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). The mem_cgroup_{begin,end}_update_page_stat() design has depressed me a lot recently as the lock granularity is a little bigger than I thought. Not only the resource but also some code logic is in the range of locking which may be deadlock prone. The problem still exists if we are trying to add stat account of other memcg page later, may I make bold to suggest that we dig into the lock again... But with regard to the current lock implementation, I doubt if we can we can account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just try to get move_lock once in the beginning. IMHO we can make mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm thinking now is changing memcg->move_lock to rw-spinlock from the original spinlock: mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it reenterable and memcg moving task side try to get the write spinlock. Then the race may be following: CPU0 CPU1 page_remove_rmap mem_cgroup_can_attach mem_cgroup_begin_update_page_stat (1) rcu_read_lock mem_cgroup_start_move atomic_inc(&memcg_moving) atomic_inc(&memcg->moving_account) synchronize_rcu __mem_cgroup_begin_update_page_stat mem_cgroup_stolen <<< TRUE move_lock_mem_cgroup <<<< read-spinlock success [...] mem_cgroup_begin_update_page_stat (2) __mem_cgroup_begin_update_page_stat mem_cgroup_stolen <<< still TRUE move_lock_mem_cgroup <<<< read-spinlock success [...] mem_cgroup_end_update_page_stat <<< locked = true, unlock rcu_unlock # wake up from synchronize_rcu [...] mem_cgroup_move_task mem_cgroup_move_charge walk_page_range mem_cgroup_move_account move_lock_mem_cgroup <<< write-spinlock AFAICS, the deadlock seems to be avoided by both the rcu and rwlock. Is there anything I lost? Thanks, Sha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting 2013-01-05 4:48 ` Sha Zhengju @ 2013-01-06 20:02 ` Hugh Dickins [not found] ` <alpine.LNX.2.00.1301061135400.29149-fupSdm12i1nKWymIFiNcPA@public.gmane.org> [not found] ` <CAFj3OHXKyMO3gwghiBAmbowvqko-JqLtKroX2kzin1rk=q9tZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 27+ messages in thread From: Hugh Dickins @ 2013-01-06 20:02 UTC (permalink / raw) To: Sha Zhengju Cc: Michal Hocko, Johannes Weiner, linux-kernel, cgroups, linux-mm, linux-fsdevel, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer, dchinner, Sha Zhengju On Sat, 5 Jan 2013, Sha Zhengju wrote: > On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote: > > > > Maybe I have missed some other locking which would prevent this from > > happening but the locking relations are really complicated in this area > > so if mem_cgroup_{begin,end}_update_page_stat might be called > > recursively then we need a fat comment which justifies that. > > > > Ohhh...good catching! I didn't notice there is a recursive call of > mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). > The mem_cgroup_{begin,end}_update_page_stat() design has depressed > me a lot recently as the lock granularity is a little bigger than I thought. > Not only the resource but also some code logic is in the range of locking > which may be deadlock prone. The problem still exists if we are trying to > add stat account of other memcg page later, may I make bold to suggest > that we dig into the lock again... Forgive me, I must confess I'm no more than skimming this thread, and don't like dumping unsigned-off patches on people; but thought that on balance it might be more helpful than not if I offer you a patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below). I too was getting depressed by the constraints imposed by mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san did to minimize them), and wanted to replace by something freer, more RCU-like. In the end it seemed more effort than it was worth to go as far as I wanted, but I do think that this is some improvement over what we currently have, and should deal with your recursion issue. But if this does appear useful to memcg people, then we really ought to get it checked over by locking/barrier experts before going further. I think myself that I've over-barriered it, and could use a little lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come to mind) will see more clearly, and may just hate the whole thing, as yet another peculiar lockdep-avoiding hand-crafted locking scheme. I've not wanted to waste their time on reviewing it, if it's not even going to be useful to memcg people. It may be easier to understand if you just apply the patch and look at the result in mm/memcontrol.c, where I tried to gather the pieces together in one place and describe them ("These functions mediate..."). Hugh include/linux/memcontrol.h | 39 +-- mm/memcontrol.c | 375 +++++++++++++++++++++-------------- mm/rmap.c | 20 - 3 files changed, 257 insertions(+), 177 deletions(-) --- 3.8-rc2/include/linux/memcontrol.h 2012-12-22 09:43:27.172015571 -0800 +++ linux/include/linux/memcontrol.h 2013-01-02 14:47:47.960394878 -0800 @@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v return false; } -void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, - unsigned long *flags); - +void __mem_cgroup_begin_update_page_stat(struct page *page); +void __mem_cgroup_end_update_page_stat(void); extern atomic_t memcg_moving; static inline void mem_cgroup_begin_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) + bool *clamped) { - if (mem_cgroup_disabled()) - return; - rcu_read_lock(); - *locked = false; - if (atomic_read(&memcg_moving)) - __mem_cgroup_begin_update_page_stat(page, locked, flags); + preempt_disable(); + *clamped = false; + if (unlikely(atomic_read(&memcg_moving))) { + __mem_cgroup_begin_update_page_stat(page); + *clamped = true; + } } -void __mem_cgroup_end_update_page_stat(struct page *page, - unsigned long *flags); static inline void mem_cgroup_end_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) + bool *clamped) { - if (mem_cgroup_disabled()) - return; - if (*locked) - __mem_cgroup_end_update_page_stat(page, flags); - rcu_read_unlock(); + /* We don't currently use the page arg, but keep it for symmetry */ + if (unlikely(*clamped)) + __mem_cgroup_end_update_page_stat(); + preempt_enable(); } void mem_cgroup_update_page_stat(struct page *page, @@ -345,13 +341,16 @@ mem_cgroup_print_oom_info(struct mem_cgr } static inline void mem_cgroup_begin_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) + bool *clamped) { + /* It may be helpful to our callers if the stub behaves the same way */ + preempt_disable(); } static inline void mem_cgroup_end_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) + bool *clamped) { + preempt_enable(); } static inline void mem_cgroup_inc_page_stat(struct page *page, --- 3.8-rc2/mm/memcontrol.c 2012-12-22 09:43:27.628015582 -0800 +++ linux/mm/memcontrol.c 2013-01-02 14:55:36.268406008 -0800 @@ -321,12 +321,7 @@ struct mem_cgroup { * mem_cgroup ? And what type of charges should we move ? */ unsigned long move_charge_at_immigrate; - /* - * set > 0 if pages under this cgroup are moving to other cgroup. - */ - atomic_t moving_account; - /* taken only while moving_account > 0 */ - spinlock_t move_lock; + /* * percpu counter. */ @@ -1414,60 +1409,10 @@ int mem_cgroup_swappiness(struct mem_cgr } /* - * memcg->moving_account is used for checking possibility that some thread is - * calling move_account(). When a thread on CPU-A starts moving pages under - * a memcg, other threads should check memcg->moving_account under - * rcu_read_lock(), like this: - * - * CPU-A CPU-B - * rcu_read_lock() - * memcg->moving_account+1 if (memcg->mocing_account) - * take heavy locks. - * synchronize_rcu() update something. - * rcu_read_unlock() - * start move here. - */ - -/* for quick checking without looking up memcg */ -atomic_t memcg_moving __read_mostly; - -static void mem_cgroup_start_move(struct mem_cgroup *memcg) -{ - atomic_inc(&memcg_moving); - atomic_inc(&memcg->moving_account); - synchronize_rcu(); -} - -static void mem_cgroup_end_move(struct mem_cgroup *memcg) -{ - /* - * Now, mem_cgroup_clear_mc() may call this function with NULL. - * We check NULL in callee rather than caller. - */ - if (memcg) { - atomic_dec(&memcg_moving); - atomic_dec(&memcg->moving_account); - } -} - -/* - * 2 routines for checking "mem" is under move_account() or not. - * - * mem_cgroup_stolen() - checking whether a cgroup is mc.from or not. This - * is used for avoiding races in accounting. If true, - * pc->mem_cgroup may be overwritten. - * * mem_cgroup_under_move() - checking a cgroup is mc.from or mc.to or * under hierarchy of moving cgroups. This is for - * waiting at hith-memory prressure caused by "move". + * waiting at high memory pressure caused by "move". */ - -static bool mem_cgroup_stolen(struct mem_cgroup *memcg) -{ - VM_BUG_ON(!rcu_read_lock_held()); - return atomic_read(&memcg->moving_account) > 0; -} - static bool mem_cgroup_under_move(struct mem_cgroup *memcg) { struct mem_cgroup *from; @@ -1506,24 +1451,6 @@ static bool mem_cgroup_wait_acct_move(st return false; } -/* - * Take this lock when - * - a code tries to modify page's memcg while it's USED. - * - a code tries to modify page state accounting in a memcg. - * see mem_cgroup_stolen(), too. - */ -static void move_lock_mem_cgroup(struct mem_cgroup *memcg, - unsigned long *flags) -{ - spin_lock_irqsave(&memcg->move_lock, *flags); -} - -static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, - unsigned long *flags) -{ - spin_unlock_irqrestore(&memcg->move_lock, *flags); -} - /** * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode. * @memcg: The memory cgroup that went over limit @@ -2096,75 +2023,215 @@ static bool mem_cgroup_handle_oom(struct } /* - * Currently used to update mapped file statistics, but the routine can be - * generalized to update other statistics as well. - * - * Notes: Race condition - * - * We usually use page_cgroup_lock() for accessing page_cgroup member but - * it tends to be costly. But considering some conditions, we doesn't need - * to do so _always_. - * - * Considering "charge", lock_page_cgroup() is not required because all - * file-stat operations happen after a page is attached to radix-tree. There - * are no race with "charge". - * - * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup - * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even - * if there are race with "uncharge". Statistics itself is properly handled - * by flags. + * These functions mediate between the common case of updating memcg stats + * when a page transitions from one state to another, and the rare case of + * moving a page from one memcg to another. + * + * A simple example of the updater would be: + * mem_cgroup_begin_update_page_stat(page); + * if (TestClearPageFlag(page)) + * mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES); + * mem_cgroup_end_update_page_stat(page); + * + * An over-simplified example of the mover would be: + * mem_cgroup_begin_move(); + * for each page chosen from old_memcg { + * pc = lookup_page_cgroup(page); + * lock_page_cgroup(pc); + * if (trylock_memcg_move(page)) { + * if (PageFlag(page)) { + * mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES); + * pc->mem_cgroup = new_memcg; + * mem_cgroup_inc_page_stat(page, NR_FLAG_PAGES); + * } + * unlock_memcg_move(); + * unlock_page_cgroup(pc); + * } + * cond_resched(); + * } + * mem_cgroup_end_move(); + * + * Without some kind of serialization between updater and mover, the mover + * cannot know whether or not to move one count from old to new memcg stats; + * but the serialization must be as lightweight as possible for the updater. + * + * At present we use two layers of lock avoidance, then spinlock on memcg; + * but that already got into (easily avoided) lock hierarchy violation with + * the page_cgroup lock; and as dirty writeback stats are added, it gets + * into further difficulty with the page cache radix tree lock (and on s390 + * architecture, page_remove_rmap calls set_page_dirty within its critical + * section: perhaps that can be reordered, but if not, it requires nesting). + * + * We need a mechanism more like rcu_read_lock() for the updater, who then + * does not have to worry about lock ordering. The scheme below is not quite + * as light as that: rarely, the updater does have to spin waiting on a mover; + * and it is still best for updater to avoid taking page_cgroup lock in its + * critical section (though mover drops and retries if necessary, so there is + * no actual deadlock). Testing on 4-way suggests 5% heavier for the mover. + */ + +/* + * memcg_moving count is written in advance by movers, + * and read by updaters to see if they need to worry further. + */ +atomic_t memcg_moving __read_mostly; + +/* + * Keep it simple: allow only one page to move at a time. cgroup_mutex + * already serializes move_charge_at_immigrate movements, but not writes + * to memory.force_empty, nor move-pages-to-parent phase of cgroup rmdir. * - * Considering "move", this is an only case we see a race. To make the race - * small, we check mm->moving_account and detect there are possibility of race - * If there is, we take a lock. + * memcg_moving_lock guards writes by movers to memcg_moving_page, + * which is read by updaters to see if they need to worry about their page. + */ +static DEFINE_SPINLOCK(memcg_moving_lock); +static struct page *memcg_moving_page; + +/* + * updating_page_stat is written per-cpu by updaters, + * and all cpus read by mover to check when safe to proceed with the move. */ +static DEFINE_PER_CPU(int, updating_page_stat) = 0; -void __mem_cgroup_begin_update_page_stat(struct page *page, - bool *locked, unsigned long *flags) +/* + * Mover calls mem_cgroup_begin_move() before starting on its pages; its + * synchronize_rcu() ensures that all updaters will see memcg_moving in time. + */ +static void mem_cgroup_begin_move(void) { - struct mem_cgroup *memcg; - struct page_cgroup *pc; + get_online_cpus(); + atomic_inc(&memcg_moving); + synchronize_rcu(); +} + +static void mem_cgroup_end_move(void) +{ + atomic_dec(&memcg_moving); + put_online_cpus(); +} + +/* + * Mover calls trylock_memcg_move(page) before moving stats and changing + * ownership of page. If it fails, mover should drop page_cgroup lock and + * any other spinlocks held, cond_resched then try the page again. This + * lets updaters take those locks if unavoidable, though preferably not. + */ +static bool trylock_memcg_move(struct page *page) +{ + static struct cpumask updating; + int try; + + cpumask_copy(&updating, cpu_online_mask); + spin_lock(&memcg_moving_lock); + memcg_moving_page = page; - pc = lookup_page_cgroup(page); -again: - memcg = pc->mem_cgroup; - if (unlikely(!memcg || !PageCgroupUsed(pc))) - return; /* - * If this memory cgroup is not under account moving, we don't - * need to take move_lock_mem_cgroup(). Because we already hold - * rcu_read_lock(), any calls to move_account will be delayed until - * rcu_read_unlock() if mem_cgroup_stolen() == true. + * Make sure that __mem_cgroup_begin_update_page_stat(page) can see + * our memcg_moving_page before it commits to updating_page_stat. */ - if (!mem_cgroup_stolen(memcg)) - return; + smp_mb(); - move_lock_mem_cgroup(memcg, flags); - if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) { - move_unlock_mem_cgroup(memcg, flags); - goto again; + for (try = 0; try < 64; try++) { + int updaters = 0; + int cpu; + + for_each_cpu(cpu, &updating) { + if (ACCESS_ONCE(per_cpu(updating_page_stat, cpu))) + updaters++; + else + cpumask_clear_cpu(cpu, &updating); + } + if (!updaters) + return true; } - *locked = true; + + memcg_moving_page = NULL; + spin_unlock(&memcg_moving_lock); + return false; } -void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags) +static void unlock_memcg_move(void) { - struct page_cgroup *pc = lookup_page_cgroup(page); + memcg_moving_page = NULL; + spin_unlock(&memcg_moving_lock); +} - /* - * It's guaranteed that pc->mem_cgroup never changes while - * lock is held because a routine modifies pc->mem_cgroup - * should take move_lock_mem_cgroup(). - */ - move_unlock_mem_cgroup(pc->mem_cgroup, flags); +/* + * If memcg_moving, updater calls __mem_cgroup_begin_update_page_stat(page) + * (with preemption disabled) to indicate to the next mover that this cpu is + * updating a page, or to wait on the mover if it's already moving this page. + */ +void __mem_cgroup_begin_update_page_stat(struct page *page) +{ + static const int probing = 0x10000; + int updating; + + __this_cpu_add(updating_page_stat, probing); + + for (;;) { + /* + * Make sure that trylock_memcg_move(page) can see our + * updating_page_stat before we check memcg_moving_page. + * + * We use the special probing value at first so move sees it, + * but nesting and interrupts on this cpu can distinguish it. + */ + smp_mb(); + + if (likely(page != ACCESS_ONCE(memcg_moving_page))) + break; + + /* + * We may be nested, we may be serving an interrupt: do not + * hang here if the outer level already went beyond probing. + */ + updating = __this_cpu_read(updating_page_stat); + if (updating & (probing - 1)) + break; + + __this_cpu_write(updating_page_stat, 0); + while (page == ACCESS_ONCE(memcg_moving_page)) + cpu_relax(); + __this_cpu_write(updating_page_stat, updating); + } + + /* Add one to count and remove temporary probing value */ + __this_cpu_sub(updating_page_stat, probing - 1); +} + +void __mem_cgroup_end_update_page_stat(void) +{ + __this_cpu_dec(updating_page_stat); +} + +/* + * Static inline interfaces to the above in include/linux/memcontrol.h: + * +static inline void mem_cgroup_begin_update_page_stat(struct page *page, + bool *clamped) +{ + preempt_disable(); + *clamped = false; + if (unlikely(atomic_read(&memcg_moving))) { + __mem_cgroup_begin_update_page_stat(page); + *clamped = true; + } } +static inline void mem_cgroup_end_update_page_stat(struct page *page, + bool *clamped) +{ + if (unlikely(*clamped)) + __mem_cgroup_end_update_page_stat(); + preempt_enable(); +} + */ + void mem_cgroup_update_page_stat(struct page *page, enum mem_cgroup_page_stat_item idx, int val) { struct mem_cgroup *memcg; struct page_cgroup *pc = lookup_page_cgroup(page); - unsigned long uninitialized_var(flags); if (mem_cgroup_disabled()) return; @@ -2181,7 +2248,8 @@ void mem_cgroup_update_page_stat(struct BUG(); } - this_cpu_add(memcg->stat->count[idx], val); + /* mem_cgroup_begin_update_page_stat() disabled preemption */ + __this_cpu_add(memcg->stat->count[idx], val); } /* @@ -3580,7 +3648,6 @@ static int mem_cgroup_move_account(struc struct mem_cgroup *from, struct mem_cgroup *to) { - unsigned long flags; int ret; bool anon = PageAnon(page); @@ -3602,21 +3669,21 @@ static int mem_cgroup_move_account(struc if (!PageCgroupUsed(pc) || pc->mem_cgroup != from) goto unlock; - move_lock_mem_cgroup(from, &flags); + ret = -EAGAIN; + if (!trylock_memcg_move(page)) + goto unlock; if (!anon && page_mapped(page)) { /* Update mapped_file data for mem_cgroup */ - preempt_disable(); __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); - preempt_enable(); } mem_cgroup_charge_statistics(from, anon, -nr_pages); /* caller should have done css_get */ pc->mem_cgroup = to; mem_cgroup_charge_statistics(to, anon, nr_pages); - move_unlock_mem_cgroup(from, &flags); + unlock_memcg_move(); ret = 0; unlock: unlock_page_cgroup(pc); @@ -3675,19 +3742,25 @@ static int mem_cgroup_move_parent(struct */ if (!parent) parent = root_mem_cgroup; - +retry: if (nr_pages > 1) { VM_BUG_ON(!PageTransHuge(page)); flags = compound_lock_irqsave(page); } - ret = mem_cgroup_move_account(page, nr_pages, - pc, child, parent); - if (!ret) - __mem_cgroup_cancel_local_charge(child, nr_pages); + ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent); if (nr_pages > 1) compound_unlock_irqrestore(page, flags); + + if (ret == -EAGAIN) { + cond_resched(); + goto retry; + } + + if (!ret) + __mem_cgroup_cancel_local_charge(child, nr_pages); + putback_lru_page(page); put: put_page(page); @@ -4685,7 +4758,7 @@ static void mem_cgroup_reparent_charges( /* This is for making all *used* pages to be on LRU. */ lru_add_drain_all(); drain_all_stock_sync(memcg); - mem_cgroup_start_move(memcg); + mem_cgroup_begin_move(); for_each_node_state(node, N_MEMORY) { for (zid = 0; zid < MAX_NR_ZONES; zid++) { enum lru_list lru; @@ -4695,7 +4768,7 @@ static void mem_cgroup_reparent_charges( } } } - mem_cgroup_end_move(memcg); + mem_cgroup_end_move(); memcg_oom_recover(memcg); cond_resched(); @@ -6128,7 +6201,6 @@ mem_cgroup_css_alloc(struct cgroup *cont atomic_set(&memcg->refcnt, 1); memcg->move_charge_at_immigrate = 0; mutex_init(&memcg->thresholds_lock); - spin_lock_init(&memcg->move_lock); error = memcg_init_kmem(memcg, &mem_cgroup_subsys); if (error) { @@ -6521,7 +6593,8 @@ static void mem_cgroup_clear_mc(void) mc.from = NULL; mc.to = NULL; spin_unlock(&mc.lock); - mem_cgroup_end_move(from); + if (from) + mem_cgroup_end_move(); } static int mem_cgroup_can_attach(struct cgroup *cgroup, @@ -6547,7 +6620,7 @@ static int mem_cgroup_can_attach(struct VM_BUG_ON(mc.precharge); VM_BUG_ON(mc.moved_charge); VM_BUG_ON(mc.moved_swap); - mem_cgroup_start_move(from); + mem_cgroup_begin_move(); spin_lock(&mc.lock); mc.from = from; mc.to = memcg; @@ -6573,7 +6646,7 @@ static int mem_cgroup_move_charge_pte_ra unsigned long addr, unsigned long end, struct mm_walk *walk) { - int ret = 0; + int ret; struct vm_area_struct *vma = walk->private; pte_t *pte; spinlock_t *ptl; @@ -6592,6 +6665,8 @@ static int mem_cgroup_move_charge_pte_ra * to be unlocked in __split_huge_page_splitting(), where the main * part of thp split is not executed yet. */ +retry: + ret = 0; if (pmd_trans_huge_lock(pmd, vma) == 1) { if (mc.precharge < HPAGE_PMD_NR) { spin_unlock(&vma->vm_mm->page_table_lock); @@ -6602,8 +6677,9 @@ static int mem_cgroup_move_charge_pte_ra page = target.page; if (!isolate_lru_page(page)) { pc = lookup_page_cgroup(page); - if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, - pc, mc.from, mc.to)) { + ret = mem_cgroup_move_account(page, + HPAGE_PMD_NR, pc, mc.from, mc.to); + if (!ret) { mc.precharge -= HPAGE_PMD_NR; mc.moved_charge += HPAGE_PMD_NR; } @@ -6612,12 +6688,14 @@ static int mem_cgroup_move_charge_pte_ra put_page(page); } spin_unlock(&vma->vm_mm->page_table_lock); + if (ret == -EAGAIN) + goto retry; return 0; } if (pmd_trans_unstable(pmd)) return 0; -retry: + pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); for (; addr != end; addr += PAGE_SIZE) { pte_t ptent = *(pte++); @@ -6632,8 +6710,9 @@ retry: if (isolate_lru_page(page)) goto put; pc = lookup_page_cgroup(page); - if (!mem_cgroup_move_account(page, 1, pc, - mc.from, mc.to)) { + ret = mem_cgroup_move_account(page, 1, pc, + mc.from, mc.to); + if (!ret) { mc.precharge--; /* we uncharge from mc.from later. */ mc.moved_charge++; @@ -6653,11 +6732,15 @@ put: /* get_mctgt_type() gets the page default: break; } + if (ret == -EAGAIN) + break; } pte_unmap_unlock(pte - 1, ptl); cond_resched(); if (addr != end) { + if (ret == -EAGAIN) + goto retry; /* * We have consumed all precharges we got in can_attach(). * We try charge one by one, but don't do any additional --- 3.8-rc2/mm/rmap.c 2012-12-22 09:43:27.656015582 -0800 +++ linux/mm/rmap.c 2013-01-02 15:03:46.100417650 -0800 @@ -1107,15 +1107,14 @@ void page_add_new_anon_rmap(struct page */ void page_add_file_rmap(struct page *page) { - bool locked; - unsigned long flags; + bool clamped; - mem_cgroup_begin_update_page_stat(page, &locked, &flags); + mem_cgroup_begin_update_page_stat(page, &clamped); if (atomic_inc_and_test(&page->_mapcount)) { __inc_zone_page_state(page, NR_FILE_MAPPED); mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED); } - mem_cgroup_end_update_page_stat(page, &locked, &flags); + mem_cgroup_end_update_page_stat(page, &clamped); } /** @@ -1128,16 +1127,15 @@ void page_remove_rmap(struct page *page) { struct address_space *mapping = page_mapping(page); bool anon = PageAnon(page); - bool locked; - unsigned long flags; + bool uninitialized_var(clamped); /* * The anon case has no mem_cgroup page_stat to update; but may - * uncharge_page() below, where the lock ordering can deadlock if - * we hold the lock against page_stat move: so avoid it on anon. + * uncharge_page() below, when holding page_cgroup lock might force + * a page_stat move to back off temporarily: so avoid it on anon. */ if (!anon) - mem_cgroup_begin_update_page_stat(page, &locked, &flags); + mem_cgroup_begin_update_page_stat(page, &clamped); /* page still mapped by someone else? */ if (!atomic_add_negative(-1, &page->_mapcount)) @@ -1182,7 +1180,7 @@ void page_remove_rmap(struct page *page) } else { __dec_zone_page_state(page, NR_FILE_MAPPED); mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED); - mem_cgroup_end_update_page_stat(page, &locked, &flags); + mem_cgroup_end_update_page_stat(page, &clamped); } if (unlikely(PageMlocked(page))) clear_page_mlock(page); @@ -1198,7 +1196,7 @@ void page_remove_rmap(struct page *page) return; out: if (!anon) - mem_cgroup_end_update_page_stat(page, &locked, &flags); + mem_cgroup_end_update_page_stat(page, &clamped); } /* -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <alpine.LNX.2.00.1301061135400.29149-fupSdm12i1nKWymIFiNcPA@public.gmane.org>]
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting [not found] ` <alpine.LNX.2.00.1301061135400.29149-fupSdm12i1nKWymIFiNcPA@public.gmane.org> @ 2013-01-07 7:49 ` Kamezawa Hiroyuki 2013-01-09 5:15 ` Hugh Dickins 2013-01-09 14:35 ` Sha Zhengju 1 sibling, 1 reply; 27+ messages in thread From: Kamezawa Hiroyuki @ 2013-01-07 7:49 UTC (permalink / raw) To: Hugh Dickins Cc: Sha Zhengju, Michal Hocko, Johannes Weiner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, gthelen-hpIqsD4AKlfQT0dZR+AlfA, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA, Sha Zhengju (2013/01/07 5:02), Hugh Dickins wrote: > On Sat, 5 Jan 2013, Sha Zhengju wrote: >> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote: >>> >>> Maybe I have missed some other locking which would prevent this from >>> happening but the locking relations are really complicated in this area >>> so if mem_cgroup_{begin,end}_update_page_stat might be called >>> recursively then we need a fat comment which justifies that. >>> >> >> Ohhh...good catching! I didn't notice there is a recursive call of >> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). >> The mem_cgroup_{begin,end}_update_page_stat() design has depressed >> me a lot recently as the lock granularity is a little bigger than I thought. >> Not only the resource but also some code logic is in the range of locking >> which may be deadlock prone. The problem still exists if we are trying to >> add stat account of other memcg page later, may I make bold to suggest >> that we dig into the lock again... > > Forgive me, I must confess I'm no more than skimming this thread, > and don't like dumping unsigned-off patches on people; but thought > that on balance it might be more helpful than not if I offer you a > patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below). > > I too was getting depressed by the constraints imposed by > mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san > did to minimize them), and wanted to replace by something freer, more > RCU-like. In the end it seemed more effort than it was worth to go > as far as I wanted, but I do think that this is some improvement over > what we currently have, and should deal with your recursion issue. > In what case does this improve performance ? > But if this does appear useful to memcg people, then we really ought > to get it checked over by locking/barrier experts before going further. > I think myself that I've over-barriered it, and could use a little > lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come > to mind) will see more clearly, and may just hate the whole thing, > as yet another peculiar lockdep-avoiding hand-crafted locking scheme. > I've not wanted to waste their time on reviewing it, if it's not even > going to be useful to memcg people. > > It may be easier to understand if you just apply the patch and look > at the result in mm/memcontrol.c, where I tried to gather the pieces > together in one place and describe them ("These functions mediate..."). > > Hugh > Hi, this patch seems interesting but...doesn't this make move_account() very slow if the number of cpus increases because of scanning all cpus per a page ? And this looks like reader-can-block-writer percpu rwlock..it's too heavy to writers if there are many readers. Thanks, -Kame ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting 2013-01-07 7:49 ` Kamezawa Hiroyuki @ 2013-01-09 5:15 ` Hugh Dickins [not found] ` <alpine.LNX.2.00.1301082030100.5319-fupSdm12i1nKWymIFiNcPA@public.gmane.org> 0 siblings, 1 reply; 27+ messages in thread From: Hugh Dickins @ 2013-01-09 5:15 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Sha Zhengju, Michal Hocko, Johannes Weiner, linux-kernel, cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu, glommer, dchinner, Sha Zhengju On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote: > (2013/01/07 5:02), Hugh Dickins wrote: > > > > Forgive me, I must confess I'm no more than skimming this thread, > > and don't like dumping unsigned-off patches on people; but thought > > that on balance it might be more helpful than not if I offer you a > > patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below). > > > > I too was getting depressed by the constraints imposed by > > mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san > > did to minimize them), and wanted to replace by something freer, more > > RCU-like. In the end it seemed more effort than it was worth to go > > as far as I wanted, but I do think that this is some improvement over > > what we currently have, and should deal with your recursion issue. > > > In what case does this improve performance ? Perhaps none. I was aiming to not degrade performance at the stats update end, and make it more flexible, so new stats can be updated which would be problematic today (for lock ordering and recursion reasons). I've not done any performance measurement on it, and don't have enough cpus for an interesting report; but if someone thinks it might solve a problem for them, and has plenty of cpus to test with, please go ahead, we'd be glad to hear the results. > Hi, this patch seems interesting but...doesn't this make move_account() very > slow if the number of cpus increases because of scanning all cpus per a page > ? > And this looks like reader-can-block-writer percpu rwlock..it's too heavy to > writers if there are many readers. I was happy to make the relatively rare move_account end considerably heavier. I'll be disappointed if it turns out to be prohibitively heavy at that end - if we're going to make move_account impossible, there are much easier ways to achieve that! - but it is a possibility. Something you might have missed when considering many readers (stats updaters): the move_account end does not wait for a moment when there are no readers, that would indeed be a losing strategy; it just waits for each cpu that's updating page stats to leave that section, so every cpu is sure to notice and hold off if it then tries to update the page which is to be moved. (I may not be explaining that very well!) Hugh -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <alpine.LNX.2.00.1301082030100.5319-fupSdm12i1nKWymIFiNcPA@public.gmane.org>]
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting [not found] ` <alpine.LNX.2.00.1301082030100.5319-fupSdm12i1nKWymIFiNcPA@public.gmane.org> @ 2013-01-09 7:24 ` Kamezawa Hiroyuki 0 siblings, 0 replies; 27+ messages in thread From: Kamezawa Hiroyuki @ 2013-01-09 7:24 UTC (permalink / raw) To: Hugh Dickins Cc: Sha Zhengju, Michal Hocko, Johannes Weiner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, gthelen-hpIqsD4AKlfQT0dZR+AlfA, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA, Sha Zhengju (2013/01/09 14:15), Hugh Dickins wrote: > On Mon, 7 Jan 2013, Kamezawa Hiroyuki wrote: >> (2013/01/07 5:02), Hugh Dickins wrote: >>> >>> Forgive me, I must confess I'm no more than skimming this thread, >>> and don't like dumping unsigned-off patches on people; but thought >>> that on balance it might be more helpful than not if I offer you a >>> patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below). >>> >>> I too was getting depressed by the constraints imposed by >>> mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san >>> did to minimize them), and wanted to replace by something freer, more >>> RCU-like. In the end it seemed more effort than it was worth to go >>> as far as I wanted, but I do think that this is some improvement over >>> what we currently have, and should deal with your recursion issue. >>> >> In what case does this improve performance ? > > Perhaps none. I was aiming to not degrade performance at the stats > update end, and make it more flexible, so new stats can be updated which > would be problematic today (for lock ordering and recursion reasons). > > I've not done any performance measurement on it, and don't have enough > cpus for an interesting report; but if someone thinks it might solve a > problem for them, and has plenty of cpus to test with, please go ahead, > we'd be glad to hear the results. > >> Hi, this patch seems interesting but...doesn't this make move_account() very >> slow if the number of cpus increases because of scanning all cpus per a page >> ? >> And this looks like reader-can-block-writer percpu rwlock..it's too heavy to >> writers if there are many readers. > > I was happy to make the relatively rare move_account end considerably > heavier. I'll be disappointed if it turns out to be prohibitively > heavy at that end - if we're going to make move_account impossible, > there are much easier ways to achieve that! - but it is a possibility. > move_account at task-move has been required feature for NEC and Nishimura-san did good job. I'd like to keep that available as much as possible. > Something you might have missed when considering many readers (stats > updaters): the move_account end does not wait for a moment when there > are no readers, that would indeed be a losing strategy; it just waits > for each cpu that's updating page stats to leave that section, so every > cpu is sure to notice and hold off if it then tries to update the page > which is to be moved. (I may not be explaining that very well!) > Hmm, yeah, maybe I miss somehing. BTW, if nesting, mem_cgroup_end_update_page_stat() seems to make counter minus. Thanks, -Kame ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting [not found] ` <alpine.LNX.2.00.1301061135400.29149-fupSdm12i1nKWymIFiNcPA@public.gmane.org> 2013-01-07 7:49 ` Kamezawa Hiroyuki @ 2013-01-09 14:35 ` Sha Zhengju [not found] ` <CAFj3OHVUx0bZyEGQU_CApVbgz7SrX3BQ+0U5fRV=En800wv+cQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 1 sibling, 1 reply; 27+ messages in thread From: Sha Zhengju @ 2013-01-09 14:35 UTC (permalink / raw) To: Hugh Dickins Cc: Michal Hocko, Johannes Weiner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, gthelen-hpIqsD4AKlfQT0dZR+AlfA, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA, Sha Zhengju Hi Hugh, On Mon, Jan 7, 2013 at 4:02 AM, Hugh Dickins <hughd-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote: > On Sat, 5 Jan 2013, Sha Zhengju wrote: >> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote: >> > >> > Maybe I have missed some other locking which would prevent this from >> > happening but the locking relations are really complicated in this area >> > so if mem_cgroup_{begin,end}_update_page_stat might be called >> > recursively then we need a fat comment which justifies that. >> > >> >> Ohhh...good catching! I didn't notice there is a recursive call of >> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). >> The mem_cgroup_{begin,end}_update_page_stat() design has depressed >> me a lot recently as the lock granularity is a little bigger than I thought. >> Not only the resource but also some code logic is in the range of locking >> which may be deadlock prone. The problem still exists if we are trying to >> add stat account of other memcg page later, may I make bold to suggest >> that we dig into the lock again... > > Forgive me, I must confess I'm no more than skimming this thread, > and don't like dumping unsigned-off patches on people; but thought > that on balance it might be more helpful than not if I offer you a > patch I worked on around 3.6-rc2 (but have updated to 3.8-rc2 below). Thanks for your interest in this matter! I really appreciate your work! > I too was getting depressed by the constraints imposed by > mem_cgroup_{begin,end}_update_page_stat (good job though Kamezawa-san > did to minimize them), and wanted to replace by something freer, more > RCU-like. In the end it seemed more effort than it was worth to go > as far as I wanted, but I do think that this is some improvement over > what we currently have, and should deal with your recursion issue. It takes me some time to understand the patch. yeah, it can solve my recursion issue and also reduce some locks(e.g. move_lock). But it did have some side effect on move end as it will become slower. To my knowledge, each task is forked in root memcg, and there's a moving while attaching it to a cgroup. So move_account is also a frequent behavior to some extent. Some comments are below. > But if this does appear useful to memcg people, then we really ought > to get it checked over by locking/barrier experts before going further. > I think myself that I've over-barriered it, and could use a little > lighter; but they (Paul McKenney, Peter Zijlstra, Oleg Nesterov come > to mind) will see more clearly, and may just hate the whole thing, > as yet another peculiar lockdep-avoiding hand-crafted locking scheme. > I've not wanted to waste their time on reviewing it, if it's not even > going to be useful to memcg people. > > It may be easier to understand if you just apply the patch and look > at the result in mm/memcontrol.c, where I tried to gather the pieces > together in one place and describe them ("These functions mediate..."). > > Hugh > > include/linux/memcontrol.h | 39 +-- > mm/memcontrol.c | 375 +++++++++++++++++++++-------------- > mm/rmap.c | 20 - > 3 files changed, 257 insertions(+), 177 deletions(-) > > --- 3.8-rc2/include/linux/memcontrol.h 2012-12-22 09:43:27.172015571 -0800 > +++ linux/include/linux/memcontrol.h 2013-01-02 14:47:47.960394878 -0800 > @@ -136,32 +136,28 @@ static inline bool mem_cgroup_disabled(v > return false; > } > > -void __mem_cgroup_begin_update_page_stat(struct page *page, bool *locked, > - unsigned long *flags); > - > +void __mem_cgroup_begin_update_page_stat(struct page *page); > +void __mem_cgroup_end_update_page_stat(void); > extern atomic_t memcg_moving; > > static inline void mem_cgroup_begin_update_page_stat(struct page *page, > - bool *locked, unsigned long *flags) > + bool *clamped) > { > - if (mem_cgroup_disabled()) > - return; > - rcu_read_lock(); > - *locked = false; > - if (atomic_read(&memcg_moving)) > - __mem_cgroup_begin_update_page_stat(page, locked, flags); > + preempt_disable(); Referring to synchronize_rcu in mem_cgroup_begin_move(), here rcu_read_lock() lost? > + *clamped = false; > + if (unlikely(atomic_read(&memcg_moving))) { > + __mem_cgroup_begin_update_page_stat(page); > + *clamped = true; > + } > } > > -void __mem_cgroup_end_update_page_stat(struct page *page, > - unsigned long *flags); > static inline void mem_cgroup_end_update_page_stat(struct page *page, > - bool *locked, unsigned long *flags) > + bool *clamped) > { > - if (mem_cgroup_disabled()) > - return; > - if (*locked) > - __mem_cgroup_end_update_page_stat(page, flags); > - rcu_read_unlock(); > + /* We don't currently use the page arg, but keep it for symmetry */ > + if (unlikely(*clamped)) > + __mem_cgroup_end_update_page_stat(); Ditto. Need rcu_read_unlock()? > + preempt_enable(); > } > > void mem_cgroup_update_page_stat(struct page *page, > @@ -345,13 +341,16 @@ mem_cgroup_print_oom_info(struct mem_cgr > } > > static inline void mem_cgroup_begin_update_page_stat(struct page *page, > - bool *locked, unsigned long *flags) > + bool *clamped) > { > + /* It may be helpful to our callers if the stub behaves the same way */ > + preempt_disable(); > } > > static inline void mem_cgroup_end_update_page_stat(struct page *page, > - bool *locked, unsigned long *flags) > + bool *clamped) > { > + preempt_enable(); > } > > static inline void mem_cgroup_inc_page_stat(struct page *page, > --- 3.8-rc2/mm/memcontrol.c 2012-12-22 09:43:27.628015582 -0800 > +++ linux/mm/memcontrol.c 2013-01-02 14:55:36.268406008 -0800 > @@ -321,12 +321,7 @@ struct mem_cgroup { > * mem_cgroup ? And what type of charges should we move ? > */ > unsigned long move_charge_at_immigrate; > - /* > - * set > 0 if pages under this cgroup are moving to other cgroup. > - */ > - atomic_t moving_account; > - /* taken only while moving_account > 0 */ > - spinlock_t move_lock; > + > /* > * percpu counter. > */ > @@ -1414,60 +1409,10 @@ int mem_cgroup_swappiness(struct mem_cgr > } > > /* > - * memcg->moving_account is used for checking possibility that some thread is > - * calling move_account(). When a thread on CPU-A starts moving pages under > - * a memcg, other threads should check memcg->moving_account under > - * rcu_read_lock(), like this: > - * > - * CPU-A CPU-B > - * rcu_read_lock() > - * memcg->moving_account+1 if (memcg->mocing_account) > - * take heavy locks. > - * synchronize_rcu() update something. > - * rcu_read_unlock() > - * start move here. > - */ > - > -/* for quick checking without looking up memcg */ > -atomic_t memcg_moving __read_mostly; > - > -static void mem_cgroup_start_move(struct mem_cgroup *memcg) > -{ > - atomic_inc(&memcg_moving); > - atomic_inc(&memcg->moving_account); > - synchronize_rcu(); > -} > - > -static void mem_cgroup_end_move(struct mem_cgroup *memcg) > -{ > - /* > - * Now, mem_cgroup_clear_mc() may call this function with NULL. > - * We check NULL in callee rather than caller. > - */ > - if (memcg) { > - atomic_dec(&memcg_moving); > - atomic_dec(&memcg->moving_account); > - } > -} > - > -/* > - * 2 routines for checking "mem" is under move_account() or not. > - * > - * mem_cgroup_stolen() - checking whether a cgroup is mc.from or not. This > - * is used for avoiding races in accounting. If true, > - * pc->mem_cgroup may be overwritten. > - * > * mem_cgroup_under_move() - checking a cgroup is mc.from or mc.to or > * under hierarchy of moving cgroups. This is for > - * waiting at hith-memory prressure caused by "move". > + * waiting at high memory pressure caused by "move". > */ > - > -static bool mem_cgroup_stolen(struct mem_cgroup *memcg) > -{ > - VM_BUG_ON(!rcu_read_lock_held()); > - return atomic_read(&memcg->moving_account) > 0; > -} > - > static bool mem_cgroup_under_move(struct mem_cgroup *memcg) > { > struct mem_cgroup *from; > @@ -1506,24 +1451,6 @@ static bool mem_cgroup_wait_acct_move(st > return false; > } > > -/* > - * Take this lock when > - * - a code tries to modify page's memcg while it's USED. > - * - a code tries to modify page state accounting in a memcg. > - * see mem_cgroup_stolen(), too. > - */ > -static void move_lock_mem_cgroup(struct mem_cgroup *memcg, > - unsigned long *flags) > -{ > - spin_lock_irqsave(&memcg->move_lock, *flags); > -} > - > -static void move_unlock_mem_cgroup(struct mem_cgroup *memcg, > - unsigned long *flags) > -{ > - spin_unlock_irqrestore(&memcg->move_lock, *flags); > -} > - > /** > * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode. > * @memcg: The memory cgroup that went over limit > @@ -2096,75 +2023,215 @@ static bool mem_cgroup_handle_oom(struct > } > > /* > - * Currently used to update mapped file statistics, but the routine can be > - * generalized to update other statistics as well. > - * > - * Notes: Race condition > - * > - * We usually use page_cgroup_lock() for accessing page_cgroup member but > - * it tends to be costly. But considering some conditions, we doesn't need > - * to do so _always_. > - * > - * Considering "charge", lock_page_cgroup() is not required because all > - * file-stat operations happen after a page is attached to radix-tree. There > - * are no race with "charge". > - * > - * Considering "uncharge", we know that memcg doesn't clear pc->mem_cgroup > - * at "uncharge" intentionally. So, we always see valid pc->mem_cgroup even > - * if there are race with "uncharge". Statistics itself is properly handled > - * by flags. > + * These functions mediate between the common case of updating memcg stats > + * when a page transitions from one state to another, and the rare case of > + * moving a page from one memcg to another. > + * > + * A simple example of the updater would be: > + * mem_cgroup_begin_update_page_stat(page); > + * if (TestClearPageFlag(page)) > + * mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES); > + * mem_cgroup_end_update_page_stat(page); > + * > + * An over-simplified example of the mover would be: > + * mem_cgroup_begin_move(); > + * for each page chosen from old_memcg { > + * pc = lookup_page_cgroup(page); > + * lock_page_cgroup(pc); > + * if (trylock_memcg_move(page)) { > + * if (PageFlag(page)) { > + * mem_cgroup_dec_page_stat(page, NR_FLAG_PAGES); > + * pc->mem_cgroup = new_memcg; > + * mem_cgroup_inc_page_stat(page, NR_FLAG_PAGES); > + * } > + * unlock_memcg_move(); > + * unlock_page_cgroup(pc); > + * } > + * cond_resched(); > + * } > + * mem_cgroup_end_move(); > + * > + * Without some kind of serialization between updater and mover, the mover > + * cannot know whether or not to move one count from old to new memcg stats; > + * but the serialization must be as lightweight as possible for the updater. > + * > + * At present we use two layers of lock avoidance, then spinlock on memcg; > + * but that already got into (easily avoided) lock hierarchy violation with > + * the page_cgroup lock; and as dirty writeback stats are added, it gets > + * into further difficulty with the page cache radix tree lock (and on s390 > + * architecture, page_remove_rmap calls set_page_dirty within its critical > + * section: perhaps that can be reordered, but if not, it requires nesting). > + * > + * We need a mechanism more like rcu_read_lock() for the updater, who then > + * does not have to worry about lock ordering. The scheme below is not quite > + * as light as that: rarely, the updater does have to spin waiting on a mover; > + * and it is still best for updater to avoid taking page_cgroup lock in its > + * critical section (though mover drops and retries if necessary, so there is > + * no actual deadlock). Testing on 4-way suggests 5% heavier for the mover. > + */ > + > +/* > + * memcg_moving count is written in advance by movers, > + * and read by updaters to see if they need to worry further. > + */ > +atomic_t memcg_moving __read_mostly; > + > +/* > + * Keep it simple: allow only one page to move at a time. cgroup_mutex > + * already serializes move_charge_at_immigrate movements, but not writes > + * to memory.force_empty, nor move-pages-to-parent phase of cgroup rmdir. > * > - * Considering "move", this is an only case we see a race. To make the race > - * small, we check mm->moving_account and detect there are possibility of race > - * If there is, we take a lock. > + * memcg_moving_lock guards writes by movers to memcg_moving_page, > + * which is read by updaters to see if they need to worry about their page. > + */ > +static DEFINE_SPINLOCK(memcg_moving_lock); > +static struct page *memcg_moving_page; > + > +/* > + * updating_page_stat is written per-cpu by updaters, > + * and all cpus read by mover to check when safe to proceed with the move. > */ > +static DEFINE_PER_CPU(int, updating_page_stat) = 0; > > -void __mem_cgroup_begin_update_page_stat(struct page *page, > - bool *locked, unsigned long *flags) > +/* > + * Mover calls mem_cgroup_begin_move() before starting on its pages; its > + * synchronize_rcu() ensures that all updaters will see memcg_moving in time. > + */ > +static void mem_cgroup_begin_move(void) > { > - struct mem_cgroup *memcg; > - struct page_cgroup *pc; > + get_online_cpus(); > + atomic_inc(&memcg_moving); > + synchronize_rcu(); > +} > + > +static void mem_cgroup_end_move(void) > +{ > + atomic_dec(&memcg_moving); > + put_online_cpus(); > +} > + > +/* > + * Mover calls trylock_memcg_move(page) before moving stats and changing > + * ownership of page. If it fails, mover should drop page_cgroup lock and > + * any other spinlocks held, cond_resched then try the page again. This > + * lets updaters take those locks if unavoidable, though preferably not. > + */ > +static bool trylock_memcg_move(struct page *page) > +{ > + static struct cpumask updating; > + int try; > + > + cpumask_copy(&updating, cpu_online_mask); > + spin_lock(&memcg_moving_lock); While reaching here, we already do lock_page_cgroup() which is a much heavier lock. But memcg_moving_lock is a only one global spinlock that each memcg doing move_account will try to get, so it may spin many memcg to waiting here while holding their page_cgroup lock. This may aggravate some performance decrease IMHO. > + memcg_moving_page = page; > > - pc = lookup_page_cgroup(page); > -again: > - memcg = pc->mem_cgroup; > - if (unlikely(!memcg || !PageCgroupUsed(pc))) > - return; > /* > - * If this memory cgroup is not under account moving, we don't > - * need to take move_lock_mem_cgroup(). Because we already hold > - * rcu_read_lock(), any calls to move_account will be delayed until > - * rcu_read_unlock() if mem_cgroup_stolen() == true. > + * Make sure that __mem_cgroup_begin_update_page_stat(page) can see > + * our memcg_moving_page before it commits to updating_page_stat. > */ > - if (!mem_cgroup_stolen(memcg)) > - return; > + smp_mb(); > > - move_lock_mem_cgroup(memcg, flags); > - if (memcg != pc->mem_cgroup || !PageCgroupUsed(pc)) { > - move_unlock_mem_cgroup(memcg, flags); > - goto again; > + for (try = 0; try < 64; try++) { > + int updaters = 0; > + int cpu; > + > + for_each_cpu(cpu, &updating) { > + if (ACCESS_ONCE(per_cpu(updating_page_stat, cpu))) > + updaters++; > + else > + cpumask_clear_cpu(cpu, &updating); > + } > + if (!updaters) > + return true; > } > - *locked = true; > + > + memcg_moving_page = NULL; Also need a smp_mb()? > + spin_unlock(&memcg_moving_lock); > + return false; > } > > -void __mem_cgroup_end_update_page_stat(struct page *page, unsigned long *flags) > +static void unlock_memcg_move(void) > { > - struct page_cgroup *pc = lookup_page_cgroup(page); > + memcg_moving_page = NULL; > + spin_unlock(&memcg_moving_lock); > +} > > - /* > - * It's guaranteed that pc->mem_cgroup never changes while > - * lock is held because a routine modifies pc->mem_cgroup > - * should take move_lock_mem_cgroup(). > - */ > - move_unlock_mem_cgroup(pc->mem_cgroup, flags); > +/* > + * If memcg_moving, updater calls __mem_cgroup_begin_update_page_stat(page) > + * (with preemption disabled) to indicate to the next mover that this cpu is > + * updating a page, or to wait on the mover if it's already moving this page. > + */ > +void __mem_cgroup_begin_update_page_stat(struct page *page) > +{ > + static const int probing = 0x10000; > + int updating; > + > + __this_cpu_add(updating_page_stat, probing); > + > + for (;;) { > + /* > + * Make sure that trylock_memcg_move(page) can see our > + * updating_page_stat before we check memcg_moving_page. > + * > + * We use the special probing value at first so move sees it, > + * but nesting and interrupts on this cpu can distinguish it. > + */ > + smp_mb(); > + > + if (likely(page != ACCESS_ONCE(memcg_moving_page))) > + break; > + > + /* > + * We may be nested, we may be serving an interrupt: do not > + * hang here if the outer level already went beyond probing. > + */ > + updating = __this_cpu_read(updating_page_stat); > + if (updating & (probing - 1)) > + break; > + > + __this_cpu_write(updating_page_stat, 0); > + while (page == ACCESS_ONCE(memcg_moving_page)) > + cpu_relax(); > + __this_cpu_write(updating_page_stat, updating); > + } > + > + /* Add one to count and remove temporary probing value */ > + __this_cpu_sub(updating_page_stat, probing - 1); > +} > + > +void __mem_cgroup_end_update_page_stat(void) > +{ > + __this_cpu_dec(updating_page_stat); > +} > + > +/* > + * Static inline interfaces to the above in include/linux/memcontrol.h: > + * > +static inline void mem_cgroup_begin_update_page_stat(struct page *page, > + bool *clamped) > +{ > + preempt_disable(); > + *clamped = false; > + if (unlikely(atomic_read(&memcg_moving))) { > + __mem_cgroup_begin_update_page_stat(page); > + *clamped = true; > + } > } > > +static inline void mem_cgroup_end_update_page_stat(struct page *page, > + bool *clamped) > +{ > + if (unlikely(*clamped)) > + __mem_cgroup_end_update_page_stat(); > + preempt_enable(); > +} > + */ > + > void mem_cgroup_update_page_stat(struct page *page, > enum mem_cgroup_page_stat_item idx, int val) > { > struct mem_cgroup *memcg; > struct page_cgroup *pc = lookup_page_cgroup(page); > - unsigned long uninitialized_var(flags); > > if (mem_cgroup_disabled()) > return; > @@ -2181,7 +2248,8 @@ void mem_cgroup_update_page_stat(struct > BUG(); > } > > - this_cpu_add(memcg->stat->count[idx], val); > + /* mem_cgroup_begin_update_page_stat() disabled preemption */ > + __this_cpu_add(memcg->stat->count[idx], val); > } > > /* > @@ -3580,7 +3648,6 @@ static int mem_cgroup_move_account(struc > struct mem_cgroup *from, > struct mem_cgroup *to) > { > - unsigned long flags; > int ret; > bool anon = PageAnon(page); > > @@ -3602,21 +3669,21 @@ static int mem_cgroup_move_account(struc > if (!PageCgroupUsed(pc) || pc->mem_cgroup != from) > goto unlock; > > - move_lock_mem_cgroup(from, &flags); > + ret = -EAGAIN; > + if (!trylock_memcg_move(page)) > + goto unlock; > > if (!anon && page_mapped(page)) { > /* Update mapped_file data for mem_cgroup */ > - preempt_disable(); > __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); > __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); > - preempt_enable(); > } > mem_cgroup_charge_statistics(from, anon, -nr_pages); > > /* caller should have done css_get */ > pc->mem_cgroup = to; > mem_cgroup_charge_statistics(to, anon, nr_pages); > - move_unlock_mem_cgroup(from, &flags); > + unlock_memcg_move(); > ret = 0; > unlock: > unlock_page_cgroup(pc); > @@ -3675,19 +3742,25 @@ static int mem_cgroup_move_parent(struct > */ > if (!parent) > parent = root_mem_cgroup; > - > +retry: > if (nr_pages > 1) { > VM_BUG_ON(!PageTransHuge(page)); > flags = compound_lock_irqsave(page); > } > > - ret = mem_cgroup_move_account(page, nr_pages, > - pc, child, parent); > - if (!ret) > - __mem_cgroup_cancel_local_charge(child, nr_pages); > + ret = mem_cgroup_move_account(page, nr_pages, pc, child, parent); > > if (nr_pages > 1) > compound_unlock_irqrestore(page, flags); > + > + if (ret == -EAGAIN) { > + cond_resched(); > + goto retry; > + } > + > + if (!ret) > + __mem_cgroup_cancel_local_charge(child, nr_pages); > + > putback_lru_page(page); > put: > put_page(page); > @@ -4685,7 +4758,7 @@ static void mem_cgroup_reparent_charges( > /* This is for making all *used* pages to be on LRU. */ > lru_add_drain_all(); > drain_all_stock_sync(memcg); > - mem_cgroup_start_move(memcg); > + mem_cgroup_begin_move(); > for_each_node_state(node, N_MEMORY) { > for (zid = 0; zid < MAX_NR_ZONES; zid++) { > enum lru_list lru; > @@ -4695,7 +4768,7 @@ static void mem_cgroup_reparent_charges( > } > } > } > - mem_cgroup_end_move(memcg); > + mem_cgroup_end_move(); > memcg_oom_recover(memcg); > cond_resched(); > > @@ -6128,7 +6201,6 @@ mem_cgroup_css_alloc(struct cgroup *cont > atomic_set(&memcg->refcnt, 1); > memcg->move_charge_at_immigrate = 0; > mutex_init(&memcg->thresholds_lock); > - spin_lock_init(&memcg->move_lock); > > error = memcg_init_kmem(memcg, &mem_cgroup_subsys); > if (error) { > @@ -6521,7 +6593,8 @@ static void mem_cgroup_clear_mc(void) > mc.from = NULL; > mc.to = NULL; > spin_unlock(&mc.lock); > - mem_cgroup_end_move(from); > + if (from) > + mem_cgroup_end_move(); > } > > static int mem_cgroup_can_attach(struct cgroup *cgroup, > @@ -6547,7 +6620,7 @@ static int mem_cgroup_can_attach(struct > VM_BUG_ON(mc.precharge); > VM_BUG_ON(mc.moved_charge); > VM_BUG_ON(mc.moved_swap); > - mem_cgroup_start_move(from); > + mem_cgroup_begin_move(); > spin_lock(&mc.lock); > mc.from = from; > mc.to = memcg; > @@ -6573,7 +6646,7 @@ static int mem_cgroup_move_charge_pte_ra > unsigned long addr, unsigned long end, > struct mm_walk *walk) > { > - int ret = 0; > + int ret; > struct vm_area_struct *vma = walk->private; > pte_t *pte; > spinlock_t *ptl; > @@ -6592,6 +6665,8 @@ static int mem_cgroup_move_charge_pte_ra > * to be unlocked in __split_huge_page_splitting(), where the main > * part of thp split is not executed yet. > */ > +retry: > + ret = 0; > if (pmd_trans_huge_lock(pmd, vma) == 1) { > if (mc.precharge < HPAGE_PMD_NR) { > spin_unlock(&vma->vm_mm->page_table_lock); > @@ -6602,8 +6677,9 @@ static int mem_cgroup_move_charge_pte_ra > page = target.page; > if (!isolate_lru_page(page)) { > pc = lookup_page_cgroup(page); > - if (!mem_cgroup_move_account(page, HPAGE_PMD_NR, > - pc, mc.from, mc.to)) { > + ret = mem_cgroup_move_account(page, > + HPAGE_PMD_NR, pc, mc.from, mc.to); > + if (!ret) { > mc.precharge -= HPAGE_PMD_NR; > mc.moved_charge += HPAGE_PMD_NR; > } > @@ -6612,12 +6688,14 @@ static int mem_cgroup_move_charge_pte_ra > put_page(page); > } > spin_unlock(&vma->vm_mm->page_table_lock); > + if (ret == -EAGAIN) > + goto retry; > return 0; > } > > if (pmd_trans_unstable(pmd)) > return 0; > -retry: > + > pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl); > for (; addr != end; addr += PAGE_SIZE) { > pte_t ptent = *(pte++); > @@ -6632,8 +6710,9 @@ retry: > if (isolate_lru_page(page)) > goto put; > pc = lookup_page_cgroup(page); > - if (!mem_cgroup_move_account(page, 1, pc, > - mc.from, mc.to)) { > + ret = mem_cgroup_move_account(page, 1, pc, > + mc.from, mc.to); > + if (!ret) { > mc.precharge--; > /* we uncharge from mc.from later. */ > mc.moved_charge++; > @@ -6653,11 +6732,15 @@ put: /* get_mctgt_type() gets the page > default: > break; > } > + if (ret == -EAGAIN) > + break; > } > pte_unmap_unlock(pte - 1, ptl); > cond_resched(); > > if (addr != end) { > + if (ret == -EAGAIN) > + goto retry; > /* > * We have consumed all precharges we got in can_attach(). > * We try charge one by one, but don't do any additional > --- 3.8-rc2/mm/rmap.c 2012-12-22 09:43:27.656015582 -0800 > +++ linux/mm/rmap.c 2013-01-02 15:03:46.100417650 -0800 > @@ -1107,15 +1107,14 @@ void page_add_new_anon_rmap(struct page > */ > void page_add_file_rmap(struct page *page) > { > - bool locked; > - unsigned long flags; > + bool clamped; > > - mem_cgroup_begin_update_page_stat(page, &locked, &flags); > + mem_cgroup_begin_update_page_stat(page, &clamped); > if (atomic_inc_and_test(&page->_mapcount)) { > __inc_zone_page_state(page, NR_FILE_MAPPED); > mem_cgroup_inc_page_stat(page, MEMCG_NR_FILE_MAPPED); > } > - mem_cgroup_end_update_page_stat(page, &locked, &flags); > + mem_cgroup_end_update_page_stat(page, &clamped); > } > > /** > @@ -1128,16 +1127,15 @@ void page_remove_rmap(struct page *page) > { > struct address_space *mapping = page_mapping(page); > bool anon = PageAnon(page); > - bool locked; > - unsigned long flags; > + bool uninitialized_var(clamped); > > /* > * The anon case has no mem_cgroup page_stat to update; but may > - * uncharge_page() below, where the lock ordering can deadlock if > - * we hold the lock against page_stat move: so avoid it on anon. > + * uncharge_page() below, when holding page_cgroup lock might force > + * a page_stat move to back off temporarily: so avoid it on anon. > */ > if (!anon) > - mem_cgroup_begin_update_page_stat(page, &locked, &flags); > + mem_cgroup_begin_update_page_stat(page, &clamped); > > /* page still mapped by someone else? */ > if (!atomic_add_negative(-1, &page->_mapcount)) > @@ -1182,7 +1180,7 @@ void page_remove_rmap(struct page *page) > } else { > __dec_zone_page_state(page, NR_FILE_MAPPED); > mem_cgroup_dec_page_stat(page, MEMCG_NR_FILE_MAPPED); > - mem_cgroup_end_update_page_stat(page, &locked, &flags); > + mem_cgroup_end_update_page_stat(page, &clamped); > } > if (unlikely(PageMlocked(page))) > clear_page_mlock(page); > @@ -1198,7 +1196,7 @@ void page_remove_rmap(struct page *page) > return; > out: > if (!anon) > - mem_cgroup_end_update_page_stat(page, &locked, &flags); > + mem_cgroup_end_update_page_stat(page, &clamped); > } > > /* -- Thanks, Sha ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <CAFj3OHVUx0bZyEGQU_CApVbgz7SrX3BQ+0U5fRV=En800wv+cQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting [not found] ` <CAFj3OHVUx0bZyEGQU_CApVbgz7SrX3BQ+0U5fRV=En800wv+cQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-01-09 14:47 ` Michal Hocko 0 siblings, 0 replies; 27+ messages in thread From: Michal Hocko @ 2013-01-09 14:47 UTC (permalink / raw) To: Sha Zhengju Cc: Hugh Dickins, Johannes Weiner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, gthelen-hpIqsD4AKlfQT0dZR+AlfA, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA, Sha Zhengju On Wed 09-01-13 22:35:12, Sha Zhengju wrote: [...] > To my knowledge, each task is forked in root memcg, and there's a > moving while attaching it to a cgroup. So move_account is also a > frequent behavior to some extent. Not really. Every fork/exec is copies the current group (see cgroup_fork) so there is no moving on that path. [...] -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <CAFj3OHXKyMO3gwghiBAmbowvqko-JqLtKroX2kzin1rk=q9tZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting [not found] ` <CAFj3OHXKyMO3gwghiBAmbowvqko-JqLtKroX2kzin1rk=q9tZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-01-07 7:25 ` Kamezawa Hiroyuki [not found] ` <50EA7860.6030300-+CUm20s59erQFUHtdCDX3A@public.gmane.org> 0 siblings, 1 reply; 27+ messages in thread From: Kamezawa Hiroyuki @ 2013-01-07 7:25 UTC (permalink / raw) To: Sha Zhengju Cc: Michal Hocko, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, gthelen-hpIqsD4AKlfQT0dZR+AlfA, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA, Sha Zhengju (2013/01/05 13:48), Sha Zhengju wrote: > On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote: >> On Wed 26-12-12 01:26:07, Sha Zhengju wrote: >>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> >>> >>> This patch adds memcg routines to count dirty pages, which allows memory controller >>> to maintain an accurate view of the amount of its dirty memory and can provide some >>> info for users while cgroup's direct reclaim is working. >> >> I guess you meant targeted resp. (hard/soft) limit reclaim here, >> right? It is true that this is direct reclaim but it is not clear to me > > Yes, I meant memcg hard/soft reclaim here which is triggered directly > by allocation and is distinct from background kswapd reclaim (global). > >> why the usefulnes should be limitted to the reclaim for users. I would >> understand this if the users was in fact in-kernel users. >> > > One of the reasons I'm trying to accounting the dirty pages is to get a > more board overall view of memory usages because memcg hard/soft > reclaim may have effect on response time of user application. > Yeah, the beneficiary can be application administrator or kernel users. :P > >> [...] >>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version >>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: >>> ->private_lock --> mapping->tree_lock --> memcg->move_lock. >>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty() >>> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, >>> a prepare PageDirty() checking is added. >> >> But there is another AA deadlock here I believe. >> page_remove_rmap >> mem_cgroup_begin_update_page_stat <<< 1 >> set_page_dirty >> __set_page_dirty_buffers >> __set_page_dirty >> mem_cgroup_begin_update_page_stat <<< 2 >> move_lock_mem_cgroup >> spin_lock_irqsave(&memcg->move_lock, *flags); >> >> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS >> because we might race with the moving charges: >> CPU0 CPU1 >> page_remove_rmap >> mem_cgroup_can_attach >> mem_cgroup_begin_update_page_stat (1) >> rcu_read_lock >> mem_cgroup_start_move >> atomic_inc(&memcg_moving) >> atomic_inc(&memcg->moving_account) >> synchronize_rcu >> __mem_cgroup_begin_update_page_stat >> mem_cgroup_stolen <<< TRUE >> move_lock_mem_cgroup >> [...] >> mem_cgroup_begin_update_page_stat (2) >> __mem_cgroup_begin_update_page_stat >> mem_cgroup_stolen <<< still TRUE >> move_lock_mem_cgroup <<< DEADLOCK >> [...] >> mem_cgroup_end_update_page_stat >> rcu_unlock >> # wake up from synchronize_rcu >> [...] >> mem_cgroup_move_task >> mem_cgroup_move_charge >> walk_page_range >> mem_cgroup_move_account >> move_lock_mem_cgroup >> >> >> Maybe I have missed some other locking which would prevent this from >> happening but the locking relations are really complicated in this area >> so if mem_cgroup_{begin,end}_update_page_stat might be called >> recursively then we need a fat comment which justifies that. >> > > Ohhh...good catching! I didn't notice there is a recursive call of > mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). > The mem_cgroup_{begin,end}_update_page_stat() design has depressed > me a lot recently as the lock granularity is a little bigger than I thought. > Not only the resource but also some code logic is in the range of locking > which may be deadlock prone. The problem still exists if we are trying to > add stat account of other memcg page later, may I make bold to suggest > that we dig into the lock again... > > But with regard to the current lock implementation, I doubt if we can we can > account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just > try to get move_lock once in the beginning. IMHO we can make > mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm > thinking now is changing memcg->move_lock to rw-spinlock from the > original spinlock: > mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which make it > reenterable and memcg moving task side try to get the write spinlock. > Then the race may be following: > > CPU0 CPU1 > page_remove_rmap > mem_cgroup_can_attach > mem_cgroup_begin_update_page_stat (1) > rcu_read_lock > mem_cgroup_start_move > atomic_inc(&memcg_moving) > > atomic_inc(&memcg->moving_account) > synchronize_rcu > __mem_cgroup_begin_update_page_stat > mem_cgroup_stolen <<< TRUE > move_lock_mem_cgroup <<<< read-spinlock success > [...] > mem_cgroup_begin_update_page_stat (2) > __mem_cgroup_begin_update_page_stat > mem_cgroup_stolen <<< still TRUE > move_lock_mem_cgroup <<<< read-spinlock success > > [...] > mem_cgroup_end_update_page_stat <<< locked = true, unlock > rcu_unlock > # wake up from synchronize_rcu > [...] > mem_cgroup_move_task > mem_cgroup_move_charge > walk_page_range > mem_cgroup_move_account > > move_lock_mem_cgroup <<< write-spinlock > > > AFAICS, the deadlock seems to be avoided by both the rcu and rwlock. > Is there anything I lost? > rwlock will work with the nest but it seems ugly do updates under read-lock. How about this straightforward ? == /* * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on * the memcg again for nesting calls */ static void move_lock_mem_cgroup(memcg, flags); { current->memcg_move_lock_nested += 1; if (current->memcg_move_lock_nested > 1) { VM_BUG_ON(current->move_locked_memcg != memcg); return; } spin_lock_irqsave(&memcg_move_lock, &flags); current->move_lockdev_memcg = memcg; } static void move_unlock_mem_cgroup(memcg, flags) { current->memcg_move_lock_nested -= 1; if (!current->memcg_move_lock_nested) { current->move_locked_memcg = NULL; spin_unlock_irqrestore(&memcg_move_lock,flags); } } == But, hmm, this kind of (ugly) hack may cause trouble as Hugh said. Thanks, -Kame ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <50EA7860.6030300-+CUm20s59erQFUHtdCDX3A@public.gmane.org>]
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting [not found] ` <50EA7860.6030300-+CUm20s59erQFUHtdCDX3A@public.gmane.org> @ 2013-01-09 15:02 ` Sha Zhengju [not found] ` <CAFj3OHXMgRG6u2YoM7y5WuPo2ZNA1yPmKRV29FYj9B6Wj_c6Lw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 27+ messages in thread From: Sha Zhengju @ 2013-01-09 15:02 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, gthelen-hpIqsD4AKlfQT0dZR+AlfA, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA, Sha Zhengju On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote: > (2013/01/05 13:48), Sha Zhengju wrote: >> >> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote: >>> >>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote: >>>> >>>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> >>>> >>>> This patch adds memcg routines to count dirty pages, which allows memory >>>> controller >>>> to maintain an accurate view of the amount of its dirty memory and can >>>> provide some >>>> info for users while cgroup's direct reclaim is working. >>> >>> >>> I guess you meant targeted resp. (hard/soft) limit reclaim here, >>> right? It is true that this is direct reclaim but it is not clear to me >> >> >> Yes, I meant memcg hard/soft reclaim here which is triggered directly >> by allocation and is distinct from background kswapd reclaim (global). >> >>> why the usefulnes should be limitted to the reclaim for users. I would >>> understand this if the users was in fact in-kernel users. >>> >> >> One of the reasons I'm trying to accounting the dirty pages is to get a >> more board overall view of memory usages because memcg hard/soft >> reclaim may have effect on response time of user application. >> Yeah, the beneficiary can be application administrator or kernel users. >> :P >> >>> [...] >>>> >>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version >>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: >>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock. >>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in >>>> __set_page_dirty() >>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless >>>> spinlock contention, >>>> a prepare PageDirty() checking is added. >>> >>> >>> But there is another AA deadlock here I believe. >>> page_remove_rmap >>> mem_cgroup_begin_update_page_stat <<< 1 >>> set_page_dirty >>> __set_page_dirty_buffers >>> __set_page_dirty >>> mem_cgroup_begin_update_page_stat <<< 2 >>> move_lock_mem_cgroup >>> spin_lock_irqsave(&memcg->move_lock, *flags); >>> >>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS >>> because we might race with the moving charges: >>> CPU0 CPU1 >>> page_remove_rmap >>> mem_cgroup_can_attach >>> mem_cgroup_begin_update_page_stat (1) >>> rcu_read_lock >>> mem_cgroup_start_move >>> >>> atomic_inc(&memcg_moving) >>> >>> atomic_inc(&memcg->moving_account) >>> synchronize_rcu >>> __mem_cgroup_begin_update_page_stat >>> mem_cgroup_stolen <<< TRUE >>> move_lock_mem_cgroup >>> [...] >>> mem_cgroup_begin_update_page_stat (2) >>> __mem_cgroup_begin_update_page_stat >>> mem_cgroup_stolen <<< still TRUE >>> move_lock_mem_cgroup <<< DEADLOCK >>> [...] >>> mem_cgroup_end_update_page_stat >>> rcu_unlock >>> # wake up from >>> synchronize_rcu >>> [...] >>> mem_cgroup_move_task >>> mem_cgroup_move_charge >>> walk_page_range >>> >>> mem_cgroup_move_account >>> >>> move_lock_mem_cgroup >>> >>> >>> Maybe I have missed some other locking which would prevent this from >>> happening but the locking relations are really complicated in this area >>> so if mem_cgroup_{begin,end}_update_page_stat might be called >>> recursively then we need a fat comment which justifies that. >>> >> >> Ohhh...good catching! I didn't notice there is a recursive call of >> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). >> The mem_cgroup_{begin,end}_update_page_stat() design has depressed >> me a lot recently as the lock granularity is a little bigger than I >> thought. >> Not only the resource but also some code logic is in the range of locking >> which may be deadlock prone. The problem still exists if we are trying to >> add stat account of other memcg page later, may I make bold to suggest >> that we dig into the lock again... >> >> But with regard to the current lock implementation, I doubt if we can we >> can >> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just >> try to get move_lock once in the beginning. IMHO we can make >> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm >> thinking now is changing memcg->move_lock to rw-spinlock from the >> original spinlock: >> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which >> make it >> reenterable and memcg moving task side try to get the write spinlock. >> Then the race may be following: >> >> CPU0 CPU1 >> page_remove_rmap >> mem_cgroup_can_attach >> mem_cgroup_begin_update_page_stat (1) >> rcu_read_lock >> mem_cgroup_start_move >> >> atomic_inc(&memcg_moving) >> >> atomic_inc(&memcg->moving_account) >> synchronize_rcu >> __mem_cgroup_begin_update_page_stat >> mem_cgroup_stolen <<< TRUE >> move_lock_mem_cgroup <<<< read-spinlock success >> [...] >> mem_cgroup_begin_update_page_stat (2) >> __mem_cgroup_begin_update_page_stat >> mem_cgroup_stolen <<< still TRUE >> move_lock_mem_cgroup <<<< read-spinlock success >> >> [...] >> mem_cgroup_end_update_page_stat <<< locked = true, unlock >> rcu_unlock >> # wake up from >> synchronize_rcu >> [...] >> mem_cgroup_move_task >> mem_cgroup_move_charge >> walk_page_range >> >> mem_cgroup_move_account >> >> move_lock_mem_cgroup <<< write-spinlock >> >> >> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock. >> Is there anything I lost? >> > > rwlock will work with the nest but it seems ugly do updates under read-lock. > > How about this straightforward ? > == > /* > * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on > * the memcg again for nesting calls > */ > static void move_lock_mem_cgroup(memcg, flags); > { > current->memcg_move_lock_nested += 1; > if (current->memcg_move_lock_nested > 1) { > VM_BUG_ON(current->move_locked_memcg != memcg); > return; > } > spin_lock_irqsave(&memcg_move_lock, &flags); > current->move_lockdev_memcg = memcg; > } > > static void move_unlock_mem_cgroup(memcg, flags) > { > current->memcg_move_lock_nested -= 1; > if (!current->memcg_move_lock_nested) { > current->move_locked_memcg = NULL; > spin_unlock_irqrestore(&memcg_move_lock,flags); > } > } > Does we need to add two fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct task'? Is it feasible? Now I'm thinking about another synchronization proposal for memcg page stat updater and move_account, which seems to deal with recursion issue and deadlock: CPU A CPU B move_lock_mem_cgroup old_memcg = pc->mem_cgroup TestSetPageDirty(page) move_unlock_mem_cgroup move_lock_mem_cgroup if (PageDirty) old_memcg->nr_dirty -- new_memcg->nr_dirty ++ pc->mem_cgroup = new_memcgy move_unlock_mem_cgroup old_memcg->nr_dirty ++ So nr_dirty of old_memcg may be minus in a very short period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon by CPU A. And the final figures of memcg->nr_dirty is correct. Meanwhile the move_lock only protect saving old_memcg and TestSetPageDirty in its critical section and without any irrelevant logic, so the lock order or deadlock can be handled easily. But I'm not sure whether I've lost some race conditions, any comments are welcomed. : ) -- Thanks, Sha ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <CAFj3OHXMgRG6u2YoM7y5WuPo2ZNA1yPmKRV29FYj9B6Wj_c6Lw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting [not found] ` <CAFj3OHXMgRG6u2YoM7y5WuPo2ZNA1yPmKRV29FYj9B6Wj_c6Lw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-01-10 2:16 ` Kamezawa Hiroyuki 2013-01-10 4:26 ` Sha Zhengju 0 siblings, 1 reply; 27+ messages in thread From: Kamezawa Hiroyuki @ 2013-01-10 2:16 UTC (permalink / raw) To: Sha Zhengju Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, gthelen-hpIqsD4AKlfQT0dZR+AlfA, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA, Sha Zhengju (2013/01/10 0:02), Sha Zhengju wrote: > On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki > <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote: >> (2013/01/05 13:48), Sha Zhengju wrote: >>> >>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org> wrote: >>>> >>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote: >>>>> >>>>> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> >>>>> >>>>> This patch adds memcg routines to count dirty pages, which allows memory >>>>> controller >>>>> to maintain an accurate view of the amount of its dirty memory and can >>>>> provide some >>>>> info for users while cgroup's direct reclaim is working. >>>> >>>> >>>> I guess you meant targeted resp. (hard/soft) limit reclaim here, >>>> right? It is true that this is direct reclaim but it is not clear to me >>> >>> >>> Yes, I meant memcg hard/soft reclaim here which is triggered directly >>> by allocation and is distinct from background kswapd reclaim (global). >>> >>>> why the usefulnes should be limitted to the reclaim for users. I would >>>> understand this if the users was in fact in-kernel users. >>>> >>> >>> One of the reasons I'm trying to accounting the dirty pages is to get a >>> more board overall view of memory usages because memcg hard/soft >>> reclaim may have effect on response time of user application. >>> Yeah, the beneficiary can be application administrator or kernel users. >>> :P >>> >>>> [...] >>>>> >>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version >>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: >>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock. >>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in >>>>> __set_page_dirty() >>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless >>>>> spinlock contention, >>>>> a prepare PageDirty() checking is added. >>>> >>>> >>>> But there is another AA deadlock here I believe. >>>> page_remove_rmap >>>> mem_cgroup_begin_update_page_stat <<< 1 >>>> set_page_dirty >>>> __set_page_dirty_buffers >>>> __set_page_dirty >>>> mem_cgroup_begin_update_page_stat <<< 2 >>>> move_lock_mem_cgroup >>>> spin_lock_irqsave(&memcg->move_lock, *flags); >>>> >>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS >>>> because we might race with the moving charges: >>>> CPU0 CPU1 >>>> page_remove_rmap >>>> mem_cgroup_can_attach >>>> mem_cgroup_begin_update_page_stat (1) >>>> rcu_read_lock >>>> mem_cgroup_start_move >>>> >>>> atomic_inc(&memcg_moving) >>>> >>>> atomic_inc(&memcg->moving_account) >>>> synchronize_rcu >>>> __mem_cgroup_begin_update_page_stat >>>> mem_cgroup_stolen <<< TRUE >>>> move_lock_mem_cgroup >>>> [...] >>>> mem_cgroup_begin_update_page_stat (2) >>>> __mem_cgroup_begin_update_page_stat >>>> mem_cgroup_stolen <<< still TRUE >>>> move_lock_mem_cgroup <<< DEADLOCK >>>> [...] >>>> mem_cgroup_end_update_page_stat >>>> rcu_unlock >>>> # wake up from >>>> synchronize_rcu >>>> [...] >>>> mem_cgroup_move_task >>>> mem_cgroup_move_charge >>>> walk_page_range >>>> >>>> mem_cgroup_move_account >>>> >>>> move_lock_mem_cgroup >>>> >>>> >>>> Maybe I have missed some other locking which would prevent this from >>>> happening but the locking relations are really complicated in this area >>>> so if mem_cgroup_{begin,end}_update_page_stat might be called >>>> recursively then we need a fat comment which justifies that. >>>> >>> >>> Ohhh...good catching! I didn't notice there is a recursive call of >>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). >>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed >>> me a lot recently as the lock granularity is a little bigger than I >>> thought. >>> Not only the resource but also some code logic is in the range of locking >>> which may be deadlock prone. The problem still exists if we are trying to >>> add stat account of other memcg page later, may I make bold to suggest >>> that we dig into the lock again... >>> >>> But with regard to the current lock implementation, I doubt if we can we >>> can >>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just >>> try to get move_lock once in the beginning. IMHO we can make >>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what I'm >>> thinking now is changing memcg->move_lock to rw-spinlock from the >>> original spinlock: >>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which >>> make it >>> reenterable and memcg moving task side try to get the write spinlock. >>> Then the race may be following: >>> >>> CPU0 CPU1 >>> page_remove_rmap >>> mem_cgroup_can_attach >>> mem_cgroup_begin_update_page_stat (1) >>> rcu_read_lock >>> mem_cgroup_start_move >>> >>> atomic_inc(&memcg_moving) >>> >>> atomic_inc(&memcg->moving_account) >>> synchronize_rcu >>> __mem_cgroup_begin_update_page_stat >>> mem_cgroup_stolen <<< TRUE >>> move_lock_mem_cgroup <<<< read-spinlock success >>> [...] >>> mem_cgroup_begin_update_page_stat (2) >>> __mem_cgroup_begin_update_page_stat >>> mem_cgroup_stolen <<< still TRUE >>> move_lock_mem_cgroup <<<< read-spinlock success >>> >>> [...] >>> mem_cgroup_end_update_page_stat <<< locked = true, unlock >>> rcu_unlock >>> # wake up from >>> synchronize_rcu >>> [...] >>> mem_cgroup_move_task >>> mem_cgroup_move_charge >>> walk_page_range >>> >>> mem_cgroup_move_account >>> >>> move_lock_mem_cgroup <<< write-spinlock >>> >>> >>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock. >>> Is there anything I lost? >>> >> >> rwlock will work with the nest but it seems ugly do updates under read-lock. >> >> How about this straightforward ? >> == >> /* >> * Once a thread takes memcg_move_lock() on a memcg, it can take the lock on >> * the memcg again for nesting calls >> */ >> static void move_lock_mem_cgroup(memcg, flags); >> { >> current->memcg_move_lock_nested += 1; >> if (current->memcg_move_lock_nested > 1) { >> VM_BUG_ON(current->move_locked_memcg != memcg); >> return; >> } >> spin_lock_irqsave(&memcg_move_lock, &flags); >> current->move_lockdev_memcg = memcg; >> } >> >> static void move_unlock_mem_cgroup(memcg, flags) >> { >> current->memcg_move_lock_nested -= 1; >> if (!current->memcg_move_lock_nested) { >> current->move_locked_memcg = NULL; >> spin_unlock_irqrestore(&memcg_move_lock,flags); >> } >> } >> > Does we need to add two > fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct > task'? Is it feasible? > > Now I'm thinking about another synchronization proposal for memcg page > stat updater and move_account, which seems to deal with recursion > issue and deadlock: > > CPU A CPU B > > move_lock_mem_cgroup > old_memcg = pc->mem_cgroup > TestSetPageDirty(page) > move_unlock_mem_cgroup > move_lock_mem_cgroup > if (PageDirty) > > old_memcg->nr_dirty -- > > new_memcg->nr_dirty ++ > > pc->mem_cgroup = new_memcgy > move_unlock_mem_cgroup > > old_memcg->nr_dirty ++ > I'm sorry I couldn't catch why you call TestSetPageDirty()....and what CPUA/CPUB is doing ? CPUA calls move_account() and CPUB updates stat ? If so, why move_account() is allowed to set PG_dirty ?? > > So nr_dirty of old_memcg may be minus in a very short > period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon > by CPU A. And the final figures of memcg->nr_dirty is correct. It seems both of new_memcg and old_memcg has an account for a page. Is it correct ? > Meanwhile the move_lock only protect saving old_memcg and > TestSetPageDirty in its critical section and without any irrelevant > logic, so the lock order or deadlock can be handled easily. > > But I'm not sure whether I've lost some race conditions, any comments > are welcomed. : ) > Sorry I couldn't understand. Thanks, -Kame ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting 2013-01-10 2:16 ` Kamezawa Hiroyuki @ 2013-01-10 4:26 ` Sha Zhengju [not found] ` <CAFj3OHW=n22veXzR27qfc+10t-nETU=B78NULPXrEDT1S-KsOw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 27+ messages in thread From: Sha Zhengju @ 2013-01-10 4:26 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel, cgroups, linux-mm, linux-fsdevel, akpm, gthelen, fengguang.wu, glommer, dchinner, Sha Zhengju On Thu, Jan 10, 2013 at 10:16 AM, Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> wrote: > (2013/01/10 0:02), Sha Zhengju wrote: >> >> On Mon, Jan 7, 2013 at 3:25 PM, Kamezawa Hiroyuki >> <kamezawa.hiroyu@jp.fujitsu.com> wrote: >>> >>> (2013/01/05 13:48), Sha Zhengju wrote: >>>> >>>> >>>> On Wed, Jan 2, 2013 at 6:44 PM, Michal Hocko <mhocko@suse.cz> wrote: >>>>> >>>>> >>>>> On Wed 26-12-12 01:26:07, Sha Zhengju wrote: >>>>>> >>>>>> >>>>>> From: Sha Zhengju <handai.szj@taobao.com> >>>>>> >>>>>> This patch adds memcg routines to count dirty pages, which allows >>>>>> memory >>>>>> controller >>>>>> to maintain an accurate view of the amount of its dirty memory and can >>>>>> provide some >>>>>> info for users while cgroup's direct reclaim is working. >>>>> >>>>> >>>>> >>>>> I guess you meant targeted resp. (hard/soft) limit reclaim here, >>>>> right? It is true that this is direct reclaim but it is not clear to me >>>> >>>> >>>> >>>> Yes, I meant memcg hard/soft reclaim here which is triggered directly >>>> by allocation and is distinct from background kswapd reclaim (global). >>>> >>>>> why the usefulnes should be limitted to the reclaim for users. I would >>>>> understand this if the users was in fact in-kernel users. >>>>> >>>> >>>> One of the reasons I'm trying to accounting the dirty pages is to get a >>>> more board overall view of memory usages because memcg hard/soft >>>> reclaim may have effect on response time of user application. >>>> Yeah, the beneficiary can be application administrator or kernel users. >>>> :P >>>> >>>>> [...] >>>>>> >>>>>> >>>>>> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version >>>>>> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: >>>>>> ->private_lock --> mapping->tree_lock --> memcg->move_lock. >>>>>> So we need to make mapping->tree_lock ahead of TestSetPageDirty in >>>>>> __set_page_dirty() >>>>>> and __set_page_dirty_nobuffers(). But in order to avoiding useless >>>>>> spinlock contention, >>>>>> a prepare PageDirty() checking is added. >>>>> >>>>> >>>>> >>>>> But there is another AA deadlock here I believe. >>>>> page_remove_rmap >>>>> mem_cgroup_begin_update_page_stat <<< 1 >>>>> set_page_dirty >>>>> __set_page_dirty_buffers >>>>> __set_page_dirty >>>>> mem_cgroup_begin_update_page_stat <<< 2 >>>>> move_lock_mem_cgroup >>>>> spin_lock_irqsave(&memcg->move_lock, *flags); >>>>> >>>>> mem_cgroup_begin_update_page_stat is not recursive wrt. locking AFAICS >>>>> because we might race with the moving charges: >>>>> CPU0 CPU1 >>>>> page_remove_rmap >>>>> mem_cgroup_can_attach >>>>> mem_cgroup_begin_update_page_stat (1) >>>>> rcu_read_lock >>>>> >>>>> mem_cgroup_start_move >>>>> >>>>> atomic_inc(&memcg_moving) >>>>> >>>>> atomic_inc(&memcg->moving_account) >>>>> synchronize_rcu >>>>> __mem_cgroup_begin_update_page_stat >>>>> mem_cgroup_stolen <<< TRUE >>>>> move_lock_mem_cgroup >>>>> [...] >>>>> mem_cgroup_begin_update_page_stat (2) >>>>> __mem_cgroup_begin_update_page_stat >>>>> mem_cgroup_stolen <<< still TRUE >>>>> move_lock_mem_cgroup <<< DEADLOCK >>>>> [...] >>>>> mem_cgroup_end_update_page_stat >>>>> rcu_unlock >>>>> # wake up from >>>>> synchronize_rcu >>>>> [...] >>>>> mem_cgroup_move_task >>>>> >>>>> mem_cgroup_move_charge >>>>> walk_page_range >>>>> >>>>> mem_cgroup_move_account >>>>> >>>>> move_lock_mem_cgroup >>>>> >>>>> >>>>> Maybe I have missed some other locking which would prevent this from >>>>> happening but the locking relations are really complicated in this area >>>>> so if mem_cgroup_{begin,end}_update_page_stat might be called >>>>> recursively then we need a fat comment which justifies that. >>>>> >>>> >>>> Ohhh...good catching! I didn't notice there is a recursive call of >>>> mem_cgroup_{begin,end}_update_page_stat in page_remove_rmap(). >>>> The mem_cgroup_{begin,end}_update_page_stat() design has depressed >>>> me a lot recently as the lock granularity is a little bigger than I >>>> thought. >>>> Not only the resource but also some code logic is in the range of >>>> locking >>>> which may be deadlock prone. The problem still exists if we are trying >>>> to >>>> add stat account of other memcg page later, may I make bold to suggest >>>> that we dig into the lock again... >>>> >>>> But with regard to the current lock implementation, I doubt if we can we >>>> can >>>> account MEM_CGROUP_STAT_FILE_{MAPPED, DIRTY} in one breath and just >>>> try to get move_lock once in the beginning. IMHO we can make >>>> mem_cgroup_{begin,end}_update_page_stat() to recursive aware and what >>>> I'm >>>> thinking now is changing memcg->move_lock to rw-spinlock from the >>>> original spinlock: >>>> mem_cgroup_{begin,end}_update_page_stat() try to get the read lock which >>>> make it >>>> reenterable and memcg moving task side try to get the write spinlock. >>>> Then the race may be following: >>>> >>>> CPU0 CPU1 >>>> page_remove_rmap >>>> mem_cgroup_can_attach >>>> mem_cgroup_begin_update_page_stat (1) >>>> rcu_read_lock >>>> >>>> mem_cgroup_start_move >>>> >>>> atomic_inc(&memcg_moving) >>>> >>>> atomic_inc(&memcg->moving_account) >>>> synchronize_rcu >>>> __mem_cgroup_begin_update_page_stat >>>> mem_cgroup_stolen <<< TRUE >>>> move_lock_mem_cgroup <<<< read-spinlock success >>>> [...] >>>> mem_cgroup_begin_update_page_stat (2) >>>> __mem_cgroup_begin_update_page_stat >>>> mem_cgroup_stolen <<< still TRUE >>>> move_lock_mem_cgroup <<<< read-spinlock success >>>> >>>> [...] >>>> mem_cgroup_end_update_page_stat <<< locked = true, unlock >>>> rcu_unlock >>>> # wake up from >>>> synchronize_rcu >>>> [...] >>>> mem_cgroup_move_task >>>> >>>> mem_cgroup_move_charge >>>> walk_page_range >>>> >>>> mem_cgroup_move_account >>>> >>>> move_lock_mem_cgroup <<< write-spinlock >>>> >>>> >>>> AFAICS, the deadlock seems to be avoided by both the rcu and rwlock. >>>> Is there anything I lost? >>>> >>> >>> rwlock will work with the nest but it seems ugly do updates under >>> read-lock. >>> >>> How about this straightforward ? >>> == >>> /* >>> * Once a thread takes memcg_move_lock() on a memcg, it can take the >>> lock on >>> * the memcg again for nesting calls >>> */ >>> static void move_lock_mem_cgroup(memcg, flags); >>> { >>> current->memcg_move_lock_nested += 1; >>> if (current->memcg_move_lock_nested > 1) { >>> VM_BUG_ON(current->move_locked_memcg != memcg); >>> return; >>> } >>> spin_lock_irqsave(&memcg_move_lock, &flags); >>> current->move_lockdev_memcg = memcg; >>> } >>> >>> static void move_unlock_mem_cgroup(memcg, flags) >>> { >>> current->memcg_move_lock_nested -= 1; >>> if (!current->memcg_move_lock_nested) { >>> current->move_locked_memcg = NULL; >>> spin_unlock_irqrestore(&memcg_move_lock,flags); >>> } >>> } >>> >> Does we need to add two >> fields(current->memcg_move_lock_nested/move_locked_memcg) to 'struct >> task'? Is it feasible? >> >> Now I'm thinking about another synchronization proposal for memcg page >> stat updater and move_account, which seems to deal with recursion >> issue and deadlock: >> >> CPU A CPU B >> >> move_lock_mem_cgroup >> old_memcg = pc->mem_cgroup >> TestSetPageDirty(page) >> move_unlock_mem_cgroup >> >> move_lock_mem_cgroup >> if (PageDirty) >> >> old_memcg->nr_dirty -- >> >> new_memcg->nr_dirty ++ >> >> pc->mem_cgroup = new_memcgy >> >> move_unlock_mem_cgroup >> >> old_memcg->nr_dirty ++ >> > > I'm sorry I couldn't catch why you call TestSetPageDirty()....and what > CPUA/CPUB is > doing ? CPUA calls move_account() and CPUB updates stat ? If so, why > move_account() > is allowed to set PG_dirty ?? > Sorry, the layout above seems in a mess and is confusing... >From the beginning, after removing duplicated information like PCG_* flags in 'struct page_cgroup'(commit 2ff76f1193), there's a problem between "move" and "page stat accounting" : assume CPU-A does "page stat accounting" and CPU-B does "move" CPU-A CPU-B TestSet PG_dirty (delay) move_lock_mem_cgroup() if (PageDirty(page)) { old_memcg->nr_dirty -- new_memcg->nr_dirty++ } pc->mem_cgroup = new_memcg; move_unlock_mem_cgroup() move_lock_mem_cgroup() memcg = pc->mem_cgroup memcg->nr_dirty++ move_unlock_mem_cgroup() while accounting information of new_memcg may be double-counted. So we use a bigger lock to solve this problem: (commit: 89c06bd52f) move_lock_mem_cgroup() TestSetPageDirty(page) update page stats (without any checks) move_unlock_mem_cgroup() But this method also has its pros and cons(e.g. need lock nesting). So I doubt whether the following is able to deal with these issues all together: (CPU-A does "page stat accounting" and CPU-B does "move") CPU-A CPU-B move_lock_mem_cgroup() memcg = pc->mem_cgroup SetPageDirty(page) move_unlock_mem_cgroup() move_lock_mem_cgroup() if (PageDirty) { old_memcg->nr_dirty --; new_memcg->nr_dirty ++; } pc->mem_cgroup = new_memcg move_unlock_mem_cgroup() memcg->nr_dirty ++ For CPU-A, we save pc->mem_cgroup in a temporary variable just before SetPageDirty inside move_lock and then update stats if the page is set PG_dirty successfully. But CPU-B may do "moving" in advance that "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the stats. However, there is a potential problem that old_memcg->nr_dirty may be minus in a very short period but not a big issue IMHO. I hope that is clear. : ) Thanks! > >> >> So nr_dirty of old_memcg may be minus in a very short >> period('old_memcg->nr_dirty --' by CPU B), but it will be revised soon >> by CPU A. And the final figures of memcg->nr_dirty is correct. > > > It seems both of new_memcg and old_memcg has an account for a page. Is it > correct ? > > > >> Meanwhile the move_lock only protect saving old_memcg and >> TestSetPageDirty in its critical section and without any irrelevant >> logic, so the lock order or deadlock can be handled easily. >> >> But I'm not sure whether I've lost some race conditions, any comments >> are welcomed. : ) >> > > Sorry I couldn't understand. > > Thanks, > -Kame > > -- Thanks, Sha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <CAFj3OHW=n22veXzR27qfc+10t-nETU=B78NULPXrEDT1S-KsOw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting [not found] ` <CAFj3OHW=n22veXzR27qfc+10t-nETU=B78NULPXrEDT1S-KsOw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2013-01-10 5:03 ` Kamezawa Hiroyuki [not found] ` <50EE4B84.5080205-+CUm20s59erQFUHtdCDX3A@public.gmane.org> 0 siblings, 1 reply; 27+ messages in thread From: Kamezawa Hiroyuki @ 2013-01-10 5:03 UTC (permalink / raw) To: Sha Zhengju Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, gthelen-hpIqsD4AKlfQT0dZR+AlfA, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA, Sha Zhengju (2013/01/10 13:26), Sha Zhengju wrote: > But this method also has its pros and cons(e.g. need lock nesting). So > I doubt whether the following is able to deal with these issues all > together: > (CPU-A does "page stat accounting" and CPU-B does "move") > > CPU-A CPU-B > > move_lock_mem_cgroup() > memcg = pc->mem_cgroup > SetPageDirty(page) > move_unlock_mem_cgroup() > move_lock_mem_cgroup() > if (PageDirty) { > old_memcg->nr_dirty --; > new_memcg->nr_dirty ++; > } > pc->mem_cgroup = new_memcg > move_unlock_mem_cgroup() > > memcg->nr_dirty ++ > > > For CPU-A, we save pc->mem_cgroup in a temporary variable just before > SetPageDirty inside move_lock and then update stats if the page is set > PG_dirty successfully. But CPU-B may do "moving" in advance that > "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but > soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the > stats. > However, there is a potential problem that old_memcg->nr_dirty may be > minus in a very short period but not a big issue IMHO. > IMHO, this will work. Please take care of that the recorded memcg will not be invalid pointer when you update the nr_dirty later. (Maybe RCU will protect it.) _If_ this method can handle "nesting" problem clearer and make implementation simpler, please go ahead. To be honest, I'm not sure how the code will be until seeing the patch. Hmm, why you write SetPageDirty() here rather than TestSetPageDirty().... Thanks, -Kame ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <50EE4B84.5080205-+CUm20s59erQFUHtdCDX3A@public.gmane.org>]
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting [not found] ` <50EE4B84.5080205-+CUm20s59erQFUHtdCDX3A@public.gmane.org> @ 2013-01-10 8:28 ` Sha Zhengju 0 siblings, 0 replies; 27+ messages in thread From: Sha Zhengju @ 2013-01-10 8:28 UTC (permalink / raw) To: Kamezawa Hiroyuki Cc: Michal Hocko, Hugh Dickins, Johannes Weiner, linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, gthelen-hpIqsD4AKlfQT0dZR+AlfA, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA, Sha Zhengju On Thu, Jan 10, 2013 at 1:03 PM, Kamezawa Hiroyuki <kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A@public.gmane.org> wrote: > (2013/01/10 13:26), Sha Zhengju wrote: > >> But this method also has its pros and cons(e.g. need lock nesting). So >> I doubt whether the following is able to deal with these issues all >> together: >> (CPU-A does "page stat accounting" and CPU-B does "move") >> >> CPU-A CPU-B >> >> move_lock_mem_cgroup() >> memcg = pc->mem_cgroup >> SetPageDirty(page) >> move_unlock_mem_cgroup() >> move_lock_mem_cgroup() >> if (PageDirty) { >> old_memcg->nr_dirty --; >> new_memcg->nr_dirty ++; >> } >> pc->mem_cgroup = new_memcg >> move_unlock_mem_cgroup() >> >> memcg->nr_dirty ++ >> >> >> For CPU-A, we save pc->mem_cgroup in a temporary variable just before >> SetPageDirty inside move_lock and then update stats if the page is set >> PG_dirty successfully. But CPU-B may do "moving" in advance that >> "old_memcg->nr_dirty --" will make old_memcg->nr_dirty incorrect but >> soon CPU-A will do "memcg->nr_dirty ++" at the heels that amend the >> stats. >> However, there is a potential problem that old_memcg->nr_dirty may be >> minus in a very short period but not a big issue IMHO. >> > > IMHO, this will work. Please take care of that the recorded memcg will not > be invalid pointer when you update the nr_dirty later. > (Maybe RCU will protect it.) > Yes, there're 3 places to change pc->mem_cgroup: charge & uncharge & move_account. "charge" has no race with stat updater and "uncharge" doesn't reset pc->mem_cgroup directly, also "move_account" is just the one we are handling, so they may do no harm here. Meanwhile, invalid pointer made by cgroup deletion may also be avoided by RCU. Yet it's a rough conclusion by quick look... > _If_ this method can handle "nesting" problem clearer and make > implementation > simpler, please go ahead. To be honest, I'm not sure how the code will be > until Okay, later I'll try to propose the patch. > seeing the patch. Hmm, why you write SetPageDirty() here rather than > TestSetPageDirty().... > No particular reason...TestSetPageDirty() may be more precise... : ) -- Thanks, Sha ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting 2013-01-02 10:44 ` Michal Hocko 2013-01-05 4:48 ` Sha Zhengju @ 2013-05-03 9:11 ` Michal Hocko 2013-05-03 9:59 ` Sha Zhengju 1 sibling, 1 reply; 27+ messages in thread From: Michal Hocko @ 2013-05-03 9:11 UTC (permalink / raw) To: Sha Zhengju Cc: linux-kernel, cgroups, linux-mm, linux-fsdevel, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer, dchinner, Sha Zhengju On Wed 02-01-13 11:44:21, Michal Hocko wrote: > On Wed 26-12-12 01:26:07, Sha Zhengju wrote: > > From: Sha Zhengju <handai.szj@taobao.com> > > > > This patch adds memcg routines to count dirty pages, which allows memory controller > > to maintain an accurate view of the amount of its dirty memory and can provide some > > info for users while cgroup's direct reclaim is working. > > I guess you meant targeted resp. (hard/soft) limit reclaim here, > right? It is true that this is direct reclaim but it is not clear to me > why the usefulnes should be limitted to the reclaim for users. I would > understand this if the users was in fact in-kernel users. > > [...] > > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version > > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: > > ->private_lock --> mapping->tree_lock --> memcg->move_lock. > > So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty() > > and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, > > a prepare PageDirty() checking is added. > > But there is another AA deadlock here I believe. > page_remove_rmap > mem_cgroup_begin_update_page_stat <<< 1 > set_page_dirty > __set_page_dirty_buffers > __set_page_dirty > mem_cgroup_begin_update_page_stat <<< 2 > move_lock_mem_cgroup > spin_lock_irqsave(&memcg->move_lock, *flags); JFYI since abf09bed (s390/mm: implement software dirty bits) this is no longer possible. I haven't checked wheter there are other cases like this one and it should be better if mem_cgroup_begin_update_page_stat was recursive safe if that can be done without too many hacks. I will have a look at this (hopefully) sometimes next week. [...] -- Michal Hocko SUSE Labs -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting 2013-05-03 9:11 ` Michal Hocko @ 2013-05-03 9:59 ` Sha Zhengju 0 siblings, 0 replies; 27+ messages in thread From: Sha Zhengju @ 2013-05-03 9:59 UTC (permalink / raw) To: Michal Hocko Cc: LKML, Cgroups, linux-mm@kvack.org, linux-fsdevel, Andrew Morton, KAMEZAWA Hiroyuki, Greg Thelen, Wu Fengguang, Glauber Costa, Dave Chinner, Sha Zhengju On Fri, May 3, 2013 at 5:11 PM, Michal Hocko <mhocko@suse.cz> wrote: > On Wed 02-01-13 11:44:21, Michal Hocko wrote: >> On Wed 26-12-12 01:26:07, Sha Zhengju wrote: >> > From: Sha Zhengju <handai.szj@taobao.com> >> > >> > This patch adds memcg routines to count dirty pages, which allows memory controller >> > to maintain an accurate view of the amount of its dirty memory and can provide some >> > info for users while cgroup's direct reclaim is working. >> >> I guess you meant targeted resp. (hard/soft) limit reclaim here, >> right? It is true that this is direct reclaim but it is not clear to me >> why the usefulnes should be limitted to the reclaim for users. I would >> understand this if the users was in fact in-kernel users. >> >> [...] >> > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version >> > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: >> > ->private_lock --> mapping->tree_lock --> memcg->move_lock. >> > So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty() >> > and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, >> > a prepare PageDirty() checking is added. >> >> But there is another AA deadlock here I believe. >> page_remove_rmap >> mem_cgroup_begin_update_page_stat <<< 1 >> set_page_dirty >> __set_page_dirty_buffers >> __set_page_dirty >> mem_cgroup_begin_update_page_stat <<< 2 >> move_lock_mem_cgroup >> spin_lock_irqsave(&memcg->move_lock, *flags); > > JFYI since abf09bed (s390/mm: implement software dirty bits) this is no > longer possible. I haven't checked wheter there are other cases like > this one and it should be better if mem_cgroup_begin_update_page_stat > was recursive safe if that can be done without too many hacks. > I will have a look at this (hopefully) sometimes next week. > Hi Michal, I'm sorry for not being able to return to this problem immediately after LSF/MM. That is good news. IIRC, it's the only place we have encountered recursive problem in accounting memcg dirty pages. But I'll try to revive my previous work of simplifying mem_cgroup_begin_update_page_stat() lock. I'll back to it in next few days. -- Thanks, Sha -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a> ^ permalink raw reply [flat|nested] 27+ messages in thread
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting 2012-12-25 17:26 ` [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju 2013-01-02 10:44 ` Michal Hocko @ 2013-01-06 20:07 ` Greg Thelen [not found] ` <xr93obh2krcr.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org> 1 sibling, 1 reply; 27+ messages in thread From: Greg Thelen @ 2013-01-06 20:07 UTC (permalink / raw) To: Sha Zhengju Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA, Sha Zhengju On Tue, Dec 25 2012, Sha Zhengju wrote: > From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> > > This patch adds memcg routines to count dirty pages, which allows memory controller > to maintain an accurate view of the amount of its dirty memory and can provide some > info for users while cgroup's direct reclaim is working. > > After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can > use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg > has a feature to move a page from a cgroup to another one and may have race between > "move" and "page stat accounting". So in order to avoid the race we have designed a > bigger lock: > > mem_cgroup_begin_update_page_stat() > modify page information -->(a) > mem_cgroup_update_page_stat() -->(b) > mem_cgroup_end_update_page_stat() > It requires (a) and (b)(dirty pages accounting) can stay close enough. > In the previous two prepare patches, we have reworked the vfs set page dirty routines > and now the interfaces are more explicit: > incrementing (2): > __set_page_dirty > __set_page_dirty_nobuffers > decrementing (2): > clear_page_dirty_for_io > cancel_dirty_page > > To prevent AB/BA deadlock mentioned by Greg Thelen in previous version > (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: > ->private_lock --> mapping->tree_lock --> memcg->move_lock. > So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty() > and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, > a prepare PageDirty() checking is added. > > > Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> > Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-LdfC7J4mv27QFUHtdCDX3A@public.gmane.org> > Acked-by: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> > --- > fs/buffer.c | 14 +++++++++++++- > include/linux/memcontrol.h | 1 + > mm/filemap.c | 10 ++++++++++ > mm/memcontrol.c | 29 ++++++++++++++++++++++------- > mm/page-writeback.c | 39 ++++++++++++++++++++++++++++++++------- > mm/truncate.c | 6 ++++++ > 6 files changed, 84 insertions(+), 15 deletions(-) __nilfs_clear_page_dirty() clears PageDirty, does it need modification for this patch series? > diff --git a/fs/buffer.c b/fs/buffer.c > index 762168a..53402d2 100644 > --- a/fs/buffer.c > +++ b/fs/buffer.c > @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); > int __set_page_dirty(struct page *page, > struct address_space *mapping, int warn) > { > + bool locked; > + unsigned long flags; > + > if (unlikely(!mapping)) > return !TestSetPageDirty(page); > > - if (TestSetPageDirty(page)) > + if (PageDirty(page)) > return 0; > > spin_lock_irq(&mapping->tree_lock); > + mem_cgroup_begin_update_page_stat(page, &locked, &flags); > + > + if (TestSetPageDirty(page)) { > + mem_cgroup_end_update_page_stat(page, &locked, &flags); > + spin_unlock_irq(&mapping->tree_lock); > + return 0; > + } > + > if (page->mapping) { /* Race with truncate? */ > WARN_ON_ONCE(warn && !PageUptodate(page)); > account_page_dirtied(page, mapping); > radix_tree_tag_set(&mapping->page_tree, > page_index(page), PAGECACHE_TAG_DIRTY); > } > + mem_cgroup_end_update_page_stat(page, &locked, &flags); > spin_unlock_irq(&mapping->tree_lock); > __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); > > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h > index 5421b8a..2685d8a 100644 > --- a/include/linux/memcontrol.h > +++ b/include/linux/memcontrol.h > @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index { > MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ > MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ > MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ > + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */ > MEM_CGROUP_STAT_NSTATS, > }; > > diff --git a/mm/filemap.c b/mm/filemap.c > index 83efee7..b589be5 100644 > --- a/mm/filemap.c > +++ b/mm/filemap.c > @@ -62,6 +62,11 @@ > * ->swap_lock (exclusive_swap_page, others) > * ->mapping->tree_lock > * > + * ->private_lock (__set_page_dirty_buffers) > + * ->mapping->tree_lock > + * ->memcg->move_lock (mem_cgroup_begin_update_page_stat-> > + * move_lock_mem_cgroup) > + * > * ->i_mutex > * ->i_mmap_mutex (truncate->unmap_mapping_range) > * > @@ -112,6 +117,8 @@ > void __delete_from_page_cache(struct page *page) > { > struct address_space *mapping = page->mapping; > + bool locked; > + unsigned long flags; > > /* > * if we're uptodate, flush out into the cleancache, otherwise > @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page) > * Fix it up by doing a final dirty accounting check after > * having removed the page entirely. > */ > + mem_cgroup_begin_update_page_stat(page, &locked, &flags); > if (PageDirty(page) && mapping_cap_account_dirty(mapping)) { > + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY); > dec_zone_page_state(page, NR_FILE_DIRTY); > dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); > } > + mem_cgroup_end_update_page_stat(page, &locked, &flags); > } > > /** > diff --git a/mm/memcontrol.c b/mm/memcontrol.c > index d450c04..c884640 100644 > --- a/mm/memcontrol.c > +++ b/mm/memcontrol.c > @@ -95,6 +95,7 @@ static const char * const mem_cgroup_stat_names[] = { > "rss", > "mapped_file", > "swap", > + "dirty", > }; > > enum mem_cgroup_events_index { > @@ -3609,6 +3610,19 @@ void mem_cgroup_split_huge_fixup(struct page *head) > } > #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ > > +static inline > +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from, > + struct mem_cgroup *to, > + unsigned int nr_pages, > + enum mem_cgroup_stat_index idx) > +{ > + /* Update stat data for mem_cgroup */ > + preempt_disable(); > + __this_cpu_add(from->stat->count[idx], -nr_pages); What you do think about adding a WARN_ON_ONCE() here to check for underflow? A check might help catch: a) unresolved races between move accounting vs setting/clearing dirtying. b) future modifications that mess with PageDirty/Writeback flags without considering memcg. > + __this_cpu_add(to->stat->count[idx], nr_pages); > + preempt_enable(); > +} > + > /** > * mem_cgroup_move_account - move account of the page > * @page: the page > @@ -3654,13 +3668,14 @@ static int mem_cgroup_move_account(struct page *page, > > move_lock_mem_cgroup(from, &flags); > > - if (!anon && page_mapped(page)) { > - /* Update mapped_file data for mem_cgroup */ > - preempt_disable(); > - __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); > - __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); > - preempt_enable(); > - } > + if (!anon && page_mapped(page)) > + mem_cgroup_move_account_page_stat(from, to, nr_pages, > + MEM_CGROUP_STAT_FILE_MAPPED); > + > + if (PageDirty(page)) Is (!anon && PageDirty(page)) better? If dirty anon pages are moved between memcg M1 and M2 I think that we'd mistakenly underflow M1 if it was not previously accounting for the dirty anon page. ^ permalink raw reply [flat|nested] 27+ messages in thread
[parent not found: <xr93obh2krcr.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>]
* Re: [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting [not found] ` <xr93obh2krcr.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org> @ 2013-01-09 9:45 ` Sha Zhengju 0 siblings, 0 replies; 27+ messages in thread From: Sha Zhengju @ 2013-01-09 9:45 UTC (permalink / raw) To: Greg Thelen Cc: linux-kernel-u79uwXL29TY76Z2rM5mHXA, cgroups-u79uwXL29TY76Z2rM5mHXA, linux-mm-Bw31MaZKKs3YtjvyW6yDsg, linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, mhocko-AlSwsSmVLrQ, akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A, fengguang.wu-ral2JQCrhuEAvxtiuMwx3w, glommer-bzQdu9zFT3WakBO8gow8eQ, dchinner-H+wXaHxf7aLQT0dZR+AlfA, Sha Zhengju On Mon, Jan 7, 2013 at 4:07 AM, Greg Thelen <gthelen-hpIqsD4AKlfQT0dZR+AlfA@public.gmane.org> wrote: > On Tue, Dec 25 2012, Sha Zhengju wrote: > >> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> >> >> This patch adds memcg routines to count dirty pages, which allows memory controller >> to maintain an accurate view of the amount of its dirty memory and can provide some >> info for users while cgroup's direct reclaim is working. >> >> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can >> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg >> has a feature to move a page from a cgroup to another one and may have race between >> "move" and "page stat accounting". So in order to avoid the race we have designed a >> bigger lock: >> >> mem_cgroup_begin_update_page_stat() >> modify page information -->(a) >> mem_cgroup_update_page_stat() -->(b) >> mem_cgroup_end_update_page_stat() >> It requires (a) and (b)(dirty pages accounting) can stay close enough. >> In the previous two prepare patches, we have reworked the vfs set page dirty routines >> and now the interfaces are more explicit: >> incrementing (2): >> __set_page_dirty >> __set_page_dirty_nobuffers >> decrementing (2): >> clear_page_dirty_for_io >> cancel_dirty_page >> >> To prevent AB/BA deadlock mentioned by Greg Thelen in previous version >> (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order: >> ->private_lock --> mapping->tree_lock --> memcg->move_lock. >> So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty() >> and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention, >> a prepare PageDirty() checking is added. >> >> >> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> >> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu-LdfC7J4mv27QFUHtdCDX3A@public.gmane.org> >> Acked-by: Fengguang Wu <fengguang.wu-ral2JQCrhuEAvxtiuMwx3w@public.gmane.org> >> --- >> fs/buffer.c | 14 +++++++++++++- >> include/linux/memcontrol.h | 1 + >> mm/filemap.c | 10 ++++++++++ >> mm/memcontrol.c | 29 ++++++++++++++++++++++------- >> mm/page-writeback.c | 39 ++++++++++++++++++++++++++++++++------- >> mm/truncate.c | 6 ++++++ >> 6 files changed, 84 insertions(+), 15 deletions(-) > > __nilfs_clear_page_dirty() clears PageDirty, does it need modification > for this patch series? It doesn't need to do so. mem_cgroup_dec/inc_page_stat() is accompany with dec/inc_zone_page_state() to account memcg page stat. IMHO we only have to do some modification while SetPageDirty and dec/inc_zone_page_state() occur together. __nilfs_clear_page_dirty() will call clear_page_dirty_for_io(page) later where the accounting is done. >> diff --git a/fs/buffer.c b/fs/buffer.c >> index 762168a..53402d2 100644 >> --- a/fs/buffer.c >> +++ b/fs/buffer.c >> @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode); >> int __set_page_dirty(struct page *page, >> struct address_space *mapping, int warn) >> { >> + bool locked; >> + unsigned long flags; >> + >> if (unlikely(!mapping)) >> return !TestSetPageDirty(page); >> >> - if (TestSetPageDirty(page)) >> + if (PageDirty(page)) >> return 0; >> >> spin_lock_irq(&mapping->tree_lock); >> + mem_cgroup_begin_update_page_stat(page, &locked, &flags); >> + >> + if (TestSetPageDirty(page)) { >> + mem_cgroup_end_update_page_stat(page, &locked, &flags); >> + spin_unlock_irq(&mapping->tree_lock); >> + return 0; >> + } >> + >> if (page->mapping) { /* Race with truncate? */ >> WARN_ON_ONCE(warn && !PageUptodate(page)); >> account_page_dirtied(page, mapping); >> radix_tree_tag_set(&mapping->page_tree, >> page_index(page), PAGECACHE_TAG_DIRTY); >> } >> + mem_cgroup_end_update_page_stat(page, &locked, &flags); >> spin_unlock_irq(&mapping->tree_lock); >> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES); >> >> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h >> index 5421b8a..2685d8a 100644 >> --- a/include/linux/memcontrol.h >> +++ b/include/linux/memcontrol.h >> @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index { >> MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */ >> MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */ >> MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */ >> + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */ >> MEM_CGROUP_STAT_NSTATS, >> }; >> >> diff --git a/mm/filemap.c b/mm/filemap.c >> index 83efee7..b589be5 100644 >> --- a/mm/filemap.c >> +++ b/mm/filemap.c >> @@ -62,6 +62,11 @@ >> * ->swap_lock (exclusive_swap_page, others) >> * ->mapping->tree_lock >> * >> + * ->private_lock (__set_page_dirty_buffers) >> + * ->mapping->tree_lock >> + * ->memcg->move_lock (mem_cgroup_begin_update_page_stat-> >> + * move_lock_mem_cgroup) >> + * >> * ->i_mutex >> * ->i_mmap_mutex (truncate->unmap_mapping_range) >> * >> @@ -112,6 +117,8 @@ >> void __delete_from_page_cache(struct page *page) >> { >> struct address_space *mapping = page->mapping; >> + bool locked; >> + unsigned long flags; >> >> /* >> * if we're uptodate, flush out into the cleancache, otherwise >> @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page) >> * Fix it up by doing a final dirty accounting check after >> * having removed the page entirely. >> */ >> + mem_cgroup_begin_update_page_stat(page, &locked, &flags); >> if (PageDirty(page) && mapping_cap_account_dirty(mapping)) { >> + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY); >> dec_zone_page_state(page, NR_FILE_DIRTY); >> dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE); >> } >> + mem_cgroup_end_update_page_stat(page, &locked, &flags); >> } >> >> /** >> diff --git a/mm/memcontrol.c b/mm/memcontrol.c >> index d450c04..c884640 100644 >> --- a/mm/memcontrol.c >> +++ b/mm/memcontrol.c >> @@ -95,6 +95,7 @@ static const char * const mem_cgroup_stat_names[] = { >> "rss", >> "mapped_file", >> "swap", >> + "dirty", >> }; >> >> enum mem_cgroup_events_index { >> @@ -3609,6 +3610,19 @@ void mem_cgroup_split_huge_fixup(struct page *head) >> } >> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */ >> >> +static inline >> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from, >> + struct mem_cgroup *to, >> + unsigned int nr_pages, >> + enum mem_cgroup_stat_index idx) >> +{ >> + /* Update stat data for mem_cgroup */ >> + preempt_disable(); >> + __this_cpu_add(from->stat->count[idx], -nr_pages); > > What you do think about adding a WARN_ON_ONCE() here to check for > underflow? A check might help catch: > a) unresolved races between move accounting vs setting/clearing > dirtying. > b) future modifications that mess with PageDirty/Writeback flags without > considering memcg. To prevent the current memcg deadlock and lock nesting, I'm thinking about another synchronization proposal for memcg page stat & move_account. The counter may be minus in a very short periods. Now I'm not sure whether it's okay... maybe I'll send it out in another thread later... > >> + __this_cpu_add(to->stat->count[idx], nr_pages); >> + preempt_enable(); >> +} >> + >> /** >> * mem_cgroup_move_account - move account of the page >> * @page: the page >> @@ -3654,13 +3668,14 @@ static int mem_cgroup_move_account(struct page *page, >> >> move_lock_mem_cgroup(from, &flags); >> >> - if (!anon && page_mapped(page)) { >> - /* Update mapped_file data for mem_cgroup */ >> - preempt_disable(); >> - __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); >> - __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]); >> - preempt_enable(); >> - } >> + if (!anon && page_mapped(page)) >> + mem_cgroup_move_account_page_stat(from, to, nr_pages, >> + MEM_CGROUP_STAT_FILE_MAPPED); >> + >> + if (PageDirty(page)) > > Is (!anon && PageDirty(page)) better? If dirty anon pages are moved > between memcg M1 and M2 I think that we'd mistakenly underflow M1 if it > was not previously accounting for the dirty anon page. Yeah... A page can be PageAnon and PageDirty simultaneously but we only account dirty file-page. Will be updated next version. -- Thanks, Sha ^ permalink raw reply [flat|nested] 27+ messages in thread
end of thread, other threads:[~2013-05-03 9:59 UTC | newest] Thread overview: 27+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <1356455919-14445-1-git-send-email-handai.szj@taobao.com> [not found] ` <1356455919-14445-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> 2012-12-25 17:22 ` [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju 2012-12-28 0:39 ` Kamezawa Hiroyuki 2013-01-05 2:34 ` Sha Zhengju 2013-01-02 9:08 ` Michal Hocko [not found] ` <20130102090803.GB22160-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org> 2013-01-05 2:49 ` Sha Zhengju [not found] ` <CAFj3OHUCQkqB2+ky9wxFpkNYcn2=6t9Qd7XFf3RBY0F4Wxyqcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-01-05 10:45 ` Michal Hocko 2012-12-25 17:24 ` [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju [not found] ` <1356456261-14579-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org> 2012-12-28 0:41 ` Kamezawa Hiroyuki 2012-12-25 17:26 ` [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju 2013-01-02 10:44 ` Michal Hocko 2013-01-05 4:48 ` Sha Zhengju 2013-01-06 20:02 ` Hugh Dickins [not found] ` <alpine.LNX.2.00.1301061135400.29149-fupSdm12i1nKWymIFiNcPA@public.gmane.org> 2013-01-07 7:49 ` Kamezawa Hiroyuki 2013-01-09 5:15 ` Hugh Dickins [not found] ` <alpine.LNX.2.00.1301082030100.5319-fupSdm12i1nKWymIFiNcPA@public.gmane.org> 2013-01-09 7:24 ` Kamezawa Hiroyuki 2013-01-09 14:35 ` Sha Zhengju [not found] ` <CAFj3OHVUx0bZyEGQU_CApVbgz7SrX3BQ+0U5fRV=En800wv+cQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-01-09 14:47 ` Michal Hocko [not found] ` <CAFj3OHXKyMO3gwghiBAmbowvqko-JqLtKroX2kzin1rk=q9tZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-01-07 7:25 ` Kamezawa Hiroyuki [not found] ` <50EA7860.6030300-+CUm20s59erQFUHtdCDX3A@public.gmane.org> 2013-01-09 15:02 ` Sha Zhengju [not found] ` <CAFj3OHXMgRG6u2YoM7y5WuPo2ZNA1yPmKRV29FYj9B6Wj_c6Lw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-01-10 2:16 ` Kamezawa Hiroyuki 2013-01-10 4:26 ` Sha Zhengju [not found] ` <CAFj3OHW=n22veXzR27qfc+10t-nETU=B78NULPXrEDT1S-KsOw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2013-01-10 5:03 ` Kamezawa Hiroyuki [not found] ` <50EE4B84.5080205-+CUm20s59erQFUHtdCDX3A@public.gmane.org> 2013-01-10 8:28 ` Sha Zhengju 2013-05-03 9:11 ` Michal Hocko 2013-05-03 9:59 ` Sha Zhengju 2013-01-06 20:07 ` Greg Thelen [not found] ` <xr93obh2krcr.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org> 2013-01-09 9:45 ` Sha Zhengju
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).