* [PATCH 1/7] memcg: update cgroup memory document
[not found] ` <1340880885-5427-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-06-28 10:57 ` Sha Zhengju
2012-07-02 7:00 ` Kamezawa Hiroyuki
[not found] ` <1340881055-5511-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-06-28 11:03 ` [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
` (3 subsequent siblings)
4 siblings, 2 replies; 54+ messages in thread
From: Sha Zhengju @ 2012-06-28 10:57 UTC (permalink / raw)
To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
Cc: kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju
From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
Document cgroup dirty/writeback memory statistics.
The implementation for these new interface routines come in a series
of following patches.
Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
---
Documentation/cgroups/memory.txt | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index dd88540..24d7e3c 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -420,6 +420,8 @@ pgpgin - # of charging events to the memory cgroup. The charging
pgpgout - # of uncharging events to the memory cgroup. The uncharging
event happens each time a page is unaccounted from the cgroup.
swap - # of bytes of swap usage
+dirty - # of bytes that are waiting to get written back to the disk.
+writeback - # of bytes that are actively being written back to the disk.
inactive_anon - # of bytes of anonymous memory and swap cache memory on
LRU list.
active_anon - # of bytes of anonymous and swap cache memory on active
--
1.7.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 1/7] memcg: update cgroup memory document
2012-06-28 10:57 ` [PATCH 1/7] memcg: update cgroup memory document Sha Zhengju
@ 2012-07-02 7:00 ` Kamezawa Hiroyuki
[not found] ` <1340881055-5511-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
1 sibling, 0 replies; 54+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-02 7:00 UTC (permalink / raw)
To: Sha Zhengju
Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
Sha Zhengju
(2012/06/28 19:57), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
>
> Document cgroup dirty/writeback memory statistics.
>
> The implementation for these new interface routines come in a series
> of following patches.
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread[parent not found: <1340881055-5511-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 1/7] memcg: update cgroup memory document
[not found] ` <1340881055-5511-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-07-04 12:47 ` Michal Hocko
2012-07-07 13:45 ` Fengguang Wu
1 sibling, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2012-07-04 12:47 UTC (permalink / raw)
To: Sha Zhengju
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju
On Thu 28-06-12 18:57:35, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> Document cgroup dirty/writeback memory statistics.
>
> The implementation for these new interface routines come in a series
> of following patches.
I would expect this one the be the last...
>
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
Ackedy-by: Michal Hocko <mhocko-AlSwsSmVLrQ@public.gmane.org>
> ---
> Documentation/cgroups/memory.txt | 2 ++
> 1 files changed, 2 insertions(+), 0 deletions(-)
>
> diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
> index dd88540..24d7e3c 100644
> --- a/Documentation/cgroups/memory.txt
> +++ b/Documentation/cgroups/memory.txt
> @@ -420,6 +420,8 @@ pgpgin - # of charging events to the memory cgroup. The charging
> pgpgout - # of uncharging events to the memory cgroup. The uncharging
> event happens each time a page is unaccounted from the cgroup.
> swap - # of bytes of swap usage
> +dirty - # of bytes that are waiting to get written back to the disk.
> +writeback - # of bytes that are actively being written back to the disk.
> inactive_anon - # of bytes of anonymous memory and swap cache memory on
> LRU list.
> active_anon - # of bytes of anonymous and swap cache memory on active
> --
> 1.7.1
>
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH 1/7] memcg: update cgroup memory document
[not found] ` <1340881055-5511-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-07-04 12:47 ` Michal Hocko
@ 2012-07-07 13:45 ` Fengguang Wu
1 sibling, 0 replies; 54+ messages in thread
From: Fengguang Wu @ 2012-07-07 13:45 UTC (permalink / raw)
To: Sha Zhengju
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju
> +dirty - # of bytes that are waiting to get written back to the disk.
> +writeback - # of bytes that are actively being written back to the disk.
This should be a bit more clear to the user:
dirty - # of bytes of file cache that are not in sync with the disk copy
writeback - # of bytes of file cache that are queued for syncing to disk
Thanks,
Fengguang
> inactive_anon - # of bytes of anonymous memory and swap cache memory on
> LRU list.
> active_anon - # of bytes of anonymous and swap cache memory on active
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
[not found] ` <1340880885-5427-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-06-28 10:57 ` [PATCH 1/7] memcg: update cgroup memory document Sha Zhengju
@ 2012-06-28 11:03 ` Sha Zhengju
[not found] ` <1340881423-5703-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-07-04 14:27 ` Michal Hocko
2012-06-28 11:04 ` [PATCH 5/7] memcg: add per cgroup dirty pages accounting Sha Zhengju
` (2 subsequent siblings)
4 siblings, 2 replies; 54+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:03 UTC (permalink / raw)
To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
Cc: kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, sage-BnTBU8nroG7k1uMJSBkQmQ,
ceph-devel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju
From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
Following we will treat SetPageDirty and dirty page accounting as an integrated
operation. Filesystems had better use vfs interface directly to avoid those details.
Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
---
fs/buffer.c | 2 +-
fs/ceph/addr.c | 20 ++------------------
include/linux/buffer_head.h | 2 ++
3 files changed, 5 insertions(+), 19 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index e8d96b8..55522dd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
* If warn is true, then emit a warning if the page is not uptodate and has
* not been truncated.
*/
-static int __set_page_dirty(struct page *page,
+int __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
{
if (unlikely(!mapping))
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 8b67304..d028fbe 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -5,6 +5,7 @@
#include <linux/mm.h>
#include <linux/pagemap.h>
#include <linux/writeback.h> /* generic_writepages */
+#include <linux/buffer_head.h>
#include <linux/slab.h>
#include <linux/pagevec.h>
#include <linux/task_io_accounting_ops.h>
@@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
int undo = 0;
struct ceph_snap_context *snapc;
- if (unlikely(!mapping))
- return !TestSetPageDirty(page);
-
- if (TestSetPageDirty(page)) {
- dout("%p set_page_dirty %p idx %lu -- already dirty\n",
- mapping->host, page, page->index);
+ if (!__set_page_dirty(page, mapping, 1))
return 0;
- }
inode = mapping->host;
ci = ceph_inode(inode);
@@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
snapc, snapc->seq, snapc->num_snaps);
spin_unlock(&ci->i_ceph_lock);
- /* now adjust page */
- spin_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
- WARN_ON_ONCE(!PageUptodate(page));
- account_page_dirtied(page, page->mapping);
- radix_tree_tag_set(&mapping->page_tree,
- page_index(page), PAGECACHE_TAG_DIRTY);
-
/*
* Reference snap context in page->private. Also set
* PagePrivate so that we get invalidatepage callback.
@@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
undo = 1;
}
- spin_unlock_irq(&mapping->tree_lock);
-
if (undo)
/* whoops, we failed to dirty the page */
ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
- __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
BUG_ON(!PageDirty(page));
return 1;
}
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 458f497..0a331a8 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
}
extern int __set_page_dirty_buffers(struct page *page);
+extern int __set_page_dirty(struct page *page,
+ struct address_space *mapping, int warn);
#else /* CONFIG_BLOCK */
--
1.7.1
^ permalink raw reply related [flat|nested] 54+ messages in thread[parent not found: <1340881423-5703-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
[not found] ` <1340881423-5703-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-06-29 5:21 ` Sage Weil
2012-07-02 8:10 ` Sha Zhengju
0 siblings, 1 reply; 54+ messages in thread
From: Sage Weil @ 2012-06-29 5:21 UTC (permalink / raw)
To: Sha Zhengju
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
linux-kernel-u79uwXL29TY76Z2rM5mHXA,
torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, sage-BnTBU8nroG7k1uMJSBkQmQ,
ceph-devel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju
On Thu, 28 Jun 2012, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> Following we will treat SetPageDirty and dirty page accounting as an integrated
> operation. Filesystems had better use vfs interface directly to avoid those details.
>
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> ---
> fs/buffer.c | 2 +-
> fs/ceph/addr.c | 20 ++------------------
> include/linux/buffer_head.h | 2 ++
> 3 files changed, 5 insertions(+), 19 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e8d96b8..55522dd 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
> * If warn is true, then emit a warning if the page is not uptodate and has
> * not been truncated.
> */
> -static int __set_page_dirty(struct page *page,
> +int __set_page_dirty(struct page *page,
> struct address_space *mapping, int warn)
> {
> if (unlikely(!mapping))
This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
continue to build as a module.
With that fixed, the ceph bits are a welcome cleanup!
Acked-by: Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 8b67304..d028fbe 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -5,6 +5,7 @@
> #include <linux/mm.h>
> #include <linux/pagemap.h>
> #include <linux/writeback.h> /* generic_writepages */
> +#include <linux/buffer_head.h>
> #include <linux/slab.h>
> #include <linux/pagevec.h>
> #include <linux/task_io_accounting_ops.h>
> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
> int undo = 0;
> struct ceph_snap_context *snapc;
>
> - if (unlikely(!mapping))
> - return !TestSetPageDirty(page);
> -
> - if (TestSetPageDirty(page)) {
> - dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> - mapping->host, page, page->index);
> + if (!__set_page_dirty(page, mapping, 1))
> return 0;
> - }
>
> inode = mapping->host;
> ci = ceph_inode(inode);
> @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
> snapc, snapc->seq, snapc->num_snaps);
> spin_unlock(&ci->i_ceph_lock);
>
> - /* now adjust page */
> - spin_lock_irq(&mapping->tree_lock);
> if (page->mapping) { /* Race with truncate? */
> - WARN_ON_ONCE(!PageUptodate(page));
> - account_page_dirtied(page, page->mapping);
> - radix_tree_tag_set(&mapping->page_tree,
> - page_index(page), PAGECACHE_TAG_DIRTY);
> -
> /*
> * Reference snap context in page->private. Also set
> * PagePrivate so that we get invalidatepage callback.
> @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
> undo = 1;
> }
>
> - spin_unlock_irq(&mapping->tree_lock);
> -
> if (undo)
> /* whoops, we failed to dirty the page */
> ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>
> - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> -
> BUG_ON(!PageDirty(page));
> return 1;
> }
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 458f497..0a331a8 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
> }
>
> extern int __set_page_dirty_buffers(struct page *page);
> +extern int __set_page_dirty(struct page *page,
> + struct address_space *mapping, int warn);
>
> #else /* CONFIG_BLOCK */
>
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
2012-06-29 5:21 ` Sage Weil
@ 2012-07-02 8:10 ` Sha Zhengju
2012-07-02 14:49 ` Sage Weil
0 siblings, 1 reply; 54+ messages in thread
From: Sha Zhengju @ 2012-07-02 8:10 UTC (permalink / raw)
To: Sage Weil
Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
ceph-devel, Sha Zhengju
On 06/29/2012 01:21 PM, Sage Weil wrote:
> On Thu, 28 Jun 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju<handai.szj@taobao.com>
>>
>> Following we will treat SetPageDirty and dirty page accounting as an integrated
>> operation. Filesystems had better use vfs interface directly to avoid those details.
>>
>> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>> ---
>> fs/buffer.c | 2 +-
>> fs/ceph/addr.c | 20 ++------------------
>> include/linux/buffer_head.h | 2 ++
>> 3 files changed, 5 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index e8d96b8..55522dd 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>> * If warn is true, then emit a warning if the page is not uptodate and has
>> * not been truncated.
>> */
>> -static int __set_page_dirty(struct page *page,
>> +int __set_page_dirty(struct page *page,
>> struct address_space *mapping, int warn)
>> {
>> if (unlikely(!mapping))
> This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
> continue to build as a module.
>
> With that fixed, the ceph bits are a welcome cleanup!
>
> Acked-by: Sage Weil<sage@inktank.com>
Further, I check the path again and may it be reworked as follows to
avoid undo?
__set_page_dirty();
__set_page_dirty();
ceph operations; ==> if (page->mapping)
if (page->mapping) ceph
operations;
;
else
undo = 1;
if (undo)
xxx;
Thanks,
Sha
>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> index 8b67304..d028fbe 100644
>> --- a/fs/ceph/addr.c
>> +++ b/fs/ceph/addr.c
>> @@ -5,6 +5,7 @@
>> #include<linux/mm.h>
>> #include<linux/pagemap.h>
>> #include<linux/writeback.h> /* generic_writepages */
>> +#include<linux/buffer_head.h>
>> #include<linux/slab.h>
>> #include<linux/pagevec.h>
>> #include<linux/task_io_accounting_ops.h>
>> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
>> int undo = 0;
>> struct ceph_snap_context *snapc;
>>
>> - if (unlikely(!mapping))
>> - return !TestSetPageDirty(page);
>> -
>> - if (TestSetPageDirty(page)) {
>> - dout("%p set_page_dirty %p idx %lu -- already dirty\n",
>> - mapping->host, page, page->index);
>> + if (!__set_page_dirty(page, mapping, 1))
>> return 0;
>> - }
>>
>> inode = mapping->host;
>> ci = ceph_inode(inode);
>> @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
>> snapc, snapc->seq, snapc->num_snaps);
>> spin_unlock(&ci->i_ceph_lock);
>>
>> - /* now adjust page */
>> - spin_lock_irq(&mapping->tree_lock);
>> if (page->mapping) { /* Race with truncate? */
>> - WARN_ON_ONCE(!PageUptodate(page));
>> - account_page_dirtied(page, page->mapping);
>> - radix_tree_tag_set(&mapping->page_tree,
>> - page_index(page), PAGECACHE_TAG_DIRTY);
>> -
>> /*
>> * Reference snap context in page->private. Also set
>> * PagePrivate so that we get invalidatepage callback.
>> @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
>> undo = 1;
>> }
>>
>> - spin_unlock_irq(&mapping->tree_lock);
>> -
>> if (undo)
>> /* whoops, we failed to dirty the page */
>> ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>>
>> - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>> -
>> BUG_ON(!PageDirty(page));
>> return 1;
>> }
>> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
>> index 458f497..0a331a8 100644
>> --- a/include/linux/buffer_head.h
>> +++ b/include/linux/buffer_head.h
>> @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
>> }
>>
>> extern int __set_page_dirty_buffers(struct page *page);
>> +extern int __set_page_dirty(struct page *page,
>> + struct address_space *mapping, int warn);
>>
>> #else /* CONFIG_BLOCK */
>>
>> --
>> 1.7.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
>>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
2012-07-02 8:10 ` Sha Zhengju
@ 2012-07-02 14:49 ` Sage Weil
2012-07-04 8:11 ` Sha Zhengju
0 siblings, 1 reply; 54+ messages in thread
From: Sage Weil @ 2012-07-02 14:49 UTC (permalink / raw)
To: Sha Zhengju
Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
ceph-devel, Sha Zhengju
On Mon, 2 Jul 2012, Sha Zhengju wrote:
> On 06/29/2012 01:21 PM, Sage Weil wrote:
> > On Thu, 28 Jun 2012, Sha Zhengju wrote:
> >
> > > From: Sha Zhengju<handai.szj@taobao.com>
> > >
> > > Following we will treat SetPageDirty and dirty page accounting as an
> > > integrated
> > > operation. Filesystems had better use vfs interface directly to avoid
> > > those details.
> > >
> > > Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
> > > ---
> > > fs/buffer.c | 2 +-
> > > fs/ceph/addr.c | 20 ++------------------
> > > include/linux/buffer_head.h | 2 ++
> > > 3 files changed, 5 insertions(+), 19 deletions(-)
> > >
> > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > index e8d96b8..55522dd 100644
> > > --- a/fs/buffer.c
> > > +++ b/fs/buffer.c
> > > @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
> > > * If warn is true, then emit a warning if the page is not uptodate and
> > > has
> > > * not been truncated.
> > > */
> > > -static int __set_page_dirty(struct page *page,
> > > +int __set_page_dirty(struct page *page,
> > > struct address_space *mapping, int warn)
> > > {
> > > if (unlikely(!mapping))
> > This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
> > continue to build as a module.
> >
> > With that fixed, the ceph bits are a welcome cleanup!
> >
> > Acked-by: Sage Weil<sage@inktank.com>
>
> Further, I check the path again and may it be reworked as follows to avoid
> undo?
>
> __set_page_dirty();
> __set_page_dirty();
> ceph operations; ==> if (page->mapping)
> if (page->mapping) ceph operations;
> ;
> else
> undo = 1;
> if (undo)
> xxx;
Yep. Taking another look at the original code, though, I'm worried that
one reason the __set_page_dirty() actions were spread out the way they are
is because we wanted to ensure that the ceph operations were always
performed when PagePrivate was set.
It looks like invalidatepage won't get called if private isn't set, and
presumably it handles the truncate race with __set_page_dirty() properly
(right?). What about writeback? Do we need to worry about writepage[s]
getting called with a NULL page->private?
Thanks!
sage
>
>
>
> Thanks,
> Sha
>
> > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > index 8b67304..d028fbe 100644
> > > --- a/fs/ceph/addr.c
> > > +++ b/fs/ceph/addr.c
> > > @@ -5,6 +5,7 @@
> > > #include<linux/mm.h>
> > > #include<linux/pagemap.h>
> > > #include<linux/writeback.h> /* generic_writepages */
> > > +#include<linux/buffer_head.h>
> > > #include<linux/slab.h>
> > > #include<linux/pagevec.h>
> > > #include<linux/task_io_accounting_ops.h>
> > > @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
> > > int undo = 0;
> > > struct ceph_snap_context *snapc;
> > >
> > > - if (unlikely(!mapping))
> > > - return !TestSetPageDirty(page);
> > > -
> > > - if (TestSetPageDirty(page)) {
> > > - dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> > > - mapping->host, page, page->index);
> > > + if (!__set_page_dirty(page, mapping, 1))
> > > return 0;
> > > - }
> > >
> > > inode = mapping->host;
> > > ci = ceph_inode(inode);
> > > @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
> > > snapc, snapc->seq, snapc->num_snaps);
> > > spin_unlock(&ci->i_ceph_lock);
> > >
> > > - /* now adjust page */
> > > - spin_lock_irq(&mapping->tree_lock);
> > > if (page->mapping) { /* Race with truncate? */
> > > - WARN_ON_ONCE(!PageUptodate(page));
> > > - account_page_dirtied(page, page->mapping);
> > > - radix_tree_tag_set(&mapping->page_tree,
> > > - page_index(page), PAGECACHE_TAG_DIRTY);
> > > -
> > > /*
> > > * Reference snap context in page->private. Also set
> > > * PagePrivate so that we get invalidatepage callback.
> > > @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
> > > undo = 1;
> > > }
> > >
> > > - spin_unlock_irq(&mapping->tree_lock);
> > > -
> > > if (undo)
> > > /* whoops, we failed to dirty the page */
> > > ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
> > >
> > > - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > > -
> > > BUG_ON(!PageDirty(page));
> > > return 1;
> > > }
> > > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> > > index 458f497..0a331a8 100644
> > > --- a/include/linux/buffer_head.h
> > > +++ b/include/linux/buffer_head.h
> > > @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
> > > }
> > >
> > > extern int __set_page_dirty_buffers(struct page *page);
> > > +extern int __set_page_dirty(struct page *page,
> > > + struct address_space *mapping, int warn);
> > >
> > > #else /* CONFIG_BLOCK */
> > >
> > > --
> > > 1.7.1
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
> > > in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > >
>
>
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
2012-07-02 14:49 ` Sage Weil
@ 2012-07-04 8:11 ` Sha Zhengju
2012-07-05 15:20 ` Sage Weil
0 siblings, 1 reply; 54+ messages in thread
From: Sha Zhengju @ 2012-07-04 8:11 UTC (permalink / raw)
To: Sage Weil
Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
ceph-devel, Sha Zhengju
On 07/02/2012 10:49 PM, Sage Weil wrote:
> On Mon, 2 Jul 2012, Sha Zhengju wrote:
>> On 06/29/2012 01:21 PM, Sage Weil wrote:
>>> On Thu, 28 Jun 2012, Sha Zhengju wrote:
>>>
>>>> From: Sha Zhengju<handai.szj@taobao.com>
>>>>
>>>> Following we will treat SetPageDirty and dirty page accounting as an
>>>> integrated
>>>> operation. Filesystems had better use vfs interface directly to avoid
>>>> those details.
>>>>
>>>> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>>>> ---
>>>> fs/buffer.c | 2 +-
>>>> fs/ceph/addr.c | 20 ++------------------
>>>> include/linux/buffer_head.h | 2 ++
>>>> 3 files changed, 5 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/fs/buffer.c b/fs/buffer.c
>>>> index e8d96b8..55522dd 100644
>>>> --- a/fs/buffer.c
>>>> +++ b/fs/buffer.c
>>>> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>>> * If warn is true, then emit a warning if the page is not uptodate and
>>>> has
>>>> * not been truncated.
>>>> */
>>>> -static int __set_page_dirty(struct page *page,
>>>> +int __set_page_dirty(struct page *page,
>>>> struct address_space *mapping, int warn)
>>>> {
>>>> if (unlikely(!mapping))
>>> This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
>>> continue to build as a module.
>>>
>>> With that fixed, the ceph bits are a welcome cleanup!
>>>
>>> Acked-by: Sage Weil<sage@inktank.com>
>> Further, I check the path again and may it be reworked as follows to avoid
>> undo?
>>
>> __set_page_dirty();
>> __set_page_dirty();
>> ceph operations; ==> if (page->mapping)
>> if (page->mapping) ceph operations;
>> ;
>> else
>> undo = 1;
>> if (undo)
>> xxx;
> Yep. Taking another look at the original code, though, I'm worried that
> one reason the __set_page_dirty() actions were spread out the way they are
> is because we wanted to ensure that the ceph operations were always
> performed when PagePrivate was set.
>
Sorry, I've lost something:
__set_page_dirty(); __set_page_dirty();
ceph operations;
if(page->mapping) ==> if(page->mapping) {
SetPagePrivate; SetPagePrivate;
else ceph operations;
undo = 1; }
if (undo)
XXX;
I think this can ensure that ceph operations are performed together with
SetPagePrivate.
> It looks like invalidatepage won't get called if private isn't set, and
> presumably it handles the truncate race with __set_page_dirty() properly
> (right?). What about writeback? Do we need to worry about writepage[s]
> getting called with a NULL page->private?
__set_page_dirty does handle racing conditions with truncate and
writeback writepage[s] also take page->private into consideration
which is done inside specific filesystems. I notice that ceph has handled
this in ceph_writepage().
Sorry, not vfs expert and maybe I've not caught your point...
Thanks,
Sha
> Thanks!
> sage
>
>
>
>>
>>
>> Thanks,
>> Sha
>>
>>>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>>>> index 8b67304..d028fbe 100644
>>>> --- a/fs/ceph/addr.c
>>>> +++ b/fs/ceph/addr.c
>>>> @@ -5,6 +5,7 @@
>>>> #include<linux/mm.h>
>>>> #include<linux/pagemap.h>
>>>> #include<linux/writeback.h> /* generic_writepages */
>>>> +#include<linux/buffer_head.h>
>>>> #include<linux/slab.h>
>>>> #include<linux/pagevec.h>
>>>> #include<linux/task_io_accounting_ops.h>
>>>> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
>>>> int undo = 0;
>>>> struct ceph_snap_context *snapc;
>>>>
>>>> - if (unlikely(!mapping))
>>>> - return !TestSetPageDirty(page);
>>>> -
>>>> - if (TestSetPageDirty(page)) {
>>>> - dout("%p set_page_dirty %p idx %lu -- already dirty\n",
>>>> - mapping->host, page, page->index);
>>>> + if (!__set_page_dirty(page, mapping, 1))
>>>> return 0;
>>>> - }
>>>>
>>>> inode = mapping->host;
>>>> ci = ceph_inode(inode);
>>>> @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
>>>> snapc, snapc->seq, snapc->num_snaps);
>>>> spin_unlock(&ci->i_ceph_lock);
>>>>
>>>> - /* now adjust page */
>>>> - spin_lock_irq(&mapping->tree_lock);
>>>> if (page->mapping) { /* Race with truncate? */
>>>> - WARN_ON_ONCE(!PageUptodate(page));
>>>> - account_page_dirtied(page, page->mapping);
>>>> - radix_tree_tag_set(&mapping->page_tree,
>>>> - page_index(page), PAGECACHE_TAG_DIRTY);
>>>> -
>>>> /*
>>>> * Reference snap context in page->private. Also set
>>>> * PagePrivate so that we get invalidatepage callback.
>>>> @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
>>>> undo = 1;
>>>> }
>>>>
>>>> - spin_unlock_irq(&mapping->tree_lock);
>>>> -
>>>> if (undo)
>>>> /* whoops, we failed to dirty the page */
>>>> ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>>>>
>>>> - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>>> -
>>>> BUG_ON(!PageDirty(page));
>>>> return 1;
>>>> }
>>>> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
>>>> index 458f497..0a331a8 100644
>>>> --- a/include/linux/buffer_head.h
>>>> +++ b/include/linux/buffer_head.h
>>>> @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
>>>> }
>>>>
>>>> extern int __set_page_dirty_buffers(struct page *page);
>>>> +extern int __set_page_dirty(struct page *page,
>>>> + struct address_space *mapping, int warn);
>>>>
>>>> #else /* CONFIG_BLOCK */
>>>>
>>>> --
>>>> 1.7.1
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
>>>> in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
2012-07-04 8:11 ` Sha Zhengju
@ 2012-07-05 15:20 ` Sage Weil
2012-07-05 15:40 ` Sha Zhengju
0 siblings, 1 reply; 54+ messages in thread
From: Sage Weil @ 2012-07-05 15:20 UTC (permalink / raw)
To: Sha Zhengju
Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
ceph-devel, Sha Zhengju
On Wed, 4 Jul 2012, Sha Zhengju wrote:
> On 07/02/2012 10:49 PM, Sage Weil wrote:
> > On Mon, 2 Jul 2012, Sha Zhengju wrote:
> > > On 06/29/2012 01:21 PM, Sage Weil wrote:
> > > > On Thu, 28 Jun 2012, Sha Zhengju wrote:
> > > >
> > > > > From: Sha Zhengju<handai.szj@taobao.com>
> > > > >
> > > > > Following we will treat SetPageDirty and dirty page accounting as an
> > > > > integrated
> > > > > operation. Filesystems had better use vfs interface directly to avoid
> > > > > those details.
> > > > >
> > > > > Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
> > > > > ---
> > > > > fs/buffer.c | 2 +-
> > > > > fs/ceph/addr.c | 20 ++------------------
> > > > > include/linux/buffer_head.h | 2 ++
> > > > > 3 files changed, 5 insertions(+), 19 deletions(-)
> > > > >
> > > > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > > > index e8d96b8..55522dd 100644
> > > > > --- a/fs/buffer.c
> > > > > +++ b/fs/buffer.c
> > > > > @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
> > > > > * If warn is true, then emit a warning if the page is not uptodate
> > > > > and
> > > > > has
> > > > > * not been truncated.
> > > > > */
> > > > > -static int __set_page_dirty(struct page *page,
> > > > > +int __set_page_dirty(struct page *page,
> > > > > struct address_space *mapping, int warn)
> > > > > {
> > > > > if (unlikely(!mapping))
> > > > This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
> > > > continue to build as a module.
> > > >
> > > > With that fixed, the ceph bits are a welcome cleanup!
> > > >
> > > > Acked-by: Sage Weil<sage@inktank.com>
> > > Further, I check the path again and may it be reworked as follows to avoid
> > > undo?
> > >
> > > __set_page_dirty();
> > > __set_page_dirty();
> > > ceph operations; ==> if (page->mapping)
> > > if (page->mapping) ceph
> > > operations;
> > > ;
> > > else
> > > undo = 1;
> > > if (undo)
> > > xxx;
> > Yep. Taking another look at the original code, though, I'm worried that
> > one reason the __set_page_dirty() actions were spread out the way they are
> > is because we wanted to ensure that the ceph operations were always
> > performed when PagePrivate was set.
> >
>
> Sorry, I've lost something:
>
> __set_page_dirty(); __set_page_dirty();
> ceph operations;
> if(page->mapping) ==> if(page->mapping) {
> SetPagePrivate; SetPagePrivate;
> else ceph operations;
> undo = 1; }
>
> if (undo)
> XXX;
>
> I think this can ensure that ceph operations are performed together with
> SetPagePrivate.
Yeah, that looks right, as long as the ceph accounting operations happen
before SetPagePrivate. I think it's no more or less racy than before, at
least.
The patch doesn't apply without the previous ones in the series, it looks
like. Do you want to prepare a new version or should I?
Thanks!
sage
> > It looks like invalidatepage won't get called if private isn't set, and
> > presumably it handles the truncate race with __set_page_dirty() properly
> > (right?). What about writeback? Do we need to worry about writepage[s]
> > getting called with a NULL page->private?
>
> __set_page_dirty does handle racing conditions with truncate and
> writeback writepage[s] also take page->private into consideration
> which is done inside specific filesystems. I notice that ceph has handled
> this in ceph_writepage().
> Sorry, not vfs expert and maybe I've not caught your point...
>
>
>
> Thanks,
> Sha
>
> > Thanks!
> > sage
> >
> >
> >
> > >
> > >
> > > Thanks,
> > > Sha
> > >
> > > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > > > index 8b67304..d028fbe 100644
> > > > > --- a/fs/ceph/addr.c
> > > > > +++ b/fs/ceph/addr.c
> > > > > @@ -5,6 +5,7 @@
> > > > > #include<linux/mm.h>
> > > > > #include<linux/pagemap.h>
> > > > > #include<linux/writeback.h> /* generic_writepages */
> > > > > +#include<linux/buffer_head.h>
> > > > > #include<linux/slab.h>
> > > > > #include<linux/pagevec.h>
> > > > > #include<linux/task_io_accounting_ops.h>
> > > > > @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
> > > > > int undo = 0;
> > > > > struct ceph_snap_context *snapc;
> > > > >
> > > > > - if (unlikely(!mapping))
> > > > > - return !TestSetPageDirty(page);
> > > > > -
> > > > > - if (TestSetPageDirty(page)) {
> > > > > - dout("%p set_page_dirty %p idx %lu -- already
> > > > > dirty\n",
> > > > > - mapping->host, page, page->index);
> > > > > + if (!__set_page_dirty(page, mapping, 1))
> > > > > return 0;
> > > > > - }
> > > > >
> > > > > inode = mapping->host;
> > > > > ci = ceph_inode(inode);
> > > > > @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
> > > > > snapc, snapc->seq, snapc->num_snaps);
> > > > > spin_unlock(&ci->i_ceph_lock);
> > > > >
> > > > > - /* now adjust page */
> > > > > - spin_lock_irq(&mapping->tree_lock);
> > > > > if (page->mapping) { /* Race with truncate? */
> > > > > - WARN_ON_ONCE(!PageUptodate(page));
> > > > > - account_page_dirtied(page, page->mapping);
> > > > > - radix_tree_tag_set(&mapping->page_tree,
> > > > > - page_index(page),
> > > > > PAGECACHE_TAG_DIRTY);
> > > > > -
> > > > > /*
> > > > > * Reference snap context in page->private. Also set
> > > > > * PagePrivate so that we get invalidatepage callback.
> > > > > @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page
> > > > > *page)
> > > > > undo = 1;
> > > > > }
> > > > >
> > > > > - spin_unlock_irq(&mapping->tree_lock);
> > > > > -
> > > > > if (undo)
> > > > > /* whoops, we failed to dirty the page */
> > > > > ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
> > > > >
> > > > > - __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > > > > -
> > > > > BUG_ON(!PageDirty(page));
> > > > > return 1;
> > > > > }
> > > > > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> > > > > index 458f497..0a331a8 100644
> > > > > --- a/include/linux/buffer_head.h
> > > > > +++ b/include/linux/buffer_head.h
> > > > > @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head
> > > > > *bh)
> > > > > }
> > > > >
> > > > > extern int __set_page_dirty_buffers(struct page *page);
> > > > > +extern int __set_page_dirty(struct page *page,
> > > > > + struct address_space *mapping, int warn);
> > > > >
> > > > > #else /* CONFIG_BLOCK */
> > > > >
> > > > > --
> > > > > 1.7.1
> > > > >
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > linux-fsdevel"
> > > > > in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > > > >
> > > > >
> > >
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
2012-07-05 15:20 ` Sage Weil
@ 2012-07-05 15:40 ` Sha Zhengju
0 siblings, 0 replies; 54+ messages in thread
From: Sha Zhengju @ 2012-07-05 15:40 UTC (permalink / raw)
To: Sage Weil
Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
ceph-devel, Sha Zhengju
On Thu, Jul 5, 2012 at 11:20 PM, Sage Weil <sage@inktank.com> wrote:
> On Wed, 4 Jul 2012, Sha Zhengju wrote:
>> On 07/02/2012 10:49 PM, Sage Weil wrote:
>> > On Mon, 2 Jul 2012, Sha Zhengju wrote:
>> > > On 06/29/2012 01:21 PM, Sage Weil wrote:
>> > > > On Thu, 28 Jun 2012, Sha Zhengju wrote:
>> > > >
>> > > > > From: Sha Zhengju<handai.szj@taobao.com>
>> > > > >
>> > > > > Following we will treat SetPageDirty and dirty page accounting as an
>> > > > > integrated
>> > > > > operation. Filesystems had better use vfs interface directly to avoid
>> > > > > those details.
>> > > > >
>> > > > > Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>> > > > > ---
>> > > > > fs/buffer.c | 2 +-
>> > > > > fs/ceph/addr.c | 20 ++------------------
>> > > > > include/linux/buffer_head.h | 2 ++
>> > > > > 3 files changed, 5 insertions(+), 19 deletions(-)
>> > > > >
>> > > > > diff --git a/fs/buffer.c b/fs/buffer.c
>> > > > > index e8d96b8..55522dd 100644
>> > > > > --- a/fs/buffer.c
>> > > > > +++ b/fs/buffer.c
>> > > > > @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>> > > > > * If warn is true, then emit a warning if the page is not uptodate
>> > > > > and
>> > > > > has
>> > > > > * not been truncated.
>> > > > > */
>> > > > > -static int __set_page_dirty(struct page *page,
>> > > > > +int __set_page_dirty(struct page *page,
>> > > > > struct address_space *mapping, int warn)
>> > > > > {
>> > > > > if (unlikely(!mapping))
>> > > > This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
>> > > > continue to build as a module.
>> > > >
>> > > > With that fixed, the ceph bits are a welcome cleanup!
>> > > >
>> > > > Acked-by: Sage Weil<sage@inktank.com>
>> > > Further, I check the path again and may it be reworked as follows to avoid
>> > > undo?
>> > >
>> > > __set_page_dirty();
>> > > __set_page_dirty();
>> > > ceph operations; ==> if (page->mapping)
>> > > if (page->mapping) ceph
>> > > operations;
>> > > ;
>> > > else
>> > > undo = 1;
>> > > if (undo)
>> > > xxx;
>> > Yep. Taking another look at the original code, though, I'm worried that
>> > one reason the __set_page_dirty() actions were spread out the way they are
>> > is because we wanted to ensure that the ceph operations were always
>> > performed when PagePrivate was set.
>> >
>>
>> Sorry, I've lost something:
>>
>> __set_page_dirty(); __set_page_dirty();
>> ceph operations;
>> if(page->mapping) ==> if(page->mapping) {
>> SetPagePrivate; SetPagePrivate;
>> else ceph operations;
>> undo = 1; }
>>
>> if (undo)
>> XXX;
>>
>> I think this can ensure that ceph operations are performed together with
>> SetPagePrivate.
>
> Yeah, that looks right, as long as the ceph accounting operations happen
> before SetPagePrivate. I think it's no more or less racy than before, at
> least.
>
> The patch doesn't apply without the previous ones in the series, it looks
> like. Do you want to prepare a new version or should I?
>
Good. I'm doing some test then I'll send out a new version patchset, please
wait a bit. : )
Thanks,
Sha
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
2012-06-28 11:03 ` [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
[not found] ` <1340881423-5703-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-07-04 14:27 ` Michal Hocko
1 sibling, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2012-07-04 14:27 UTC (permalink / raw)
To: Sha Zhengju
Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
linux-kernel, torvalds, viro, linux-fsdevel, sage, ceph-devel,
Sha Zhengju
On Thu 28-06-12 19:03:43, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
>
> Following we will treat SetPageDirty and dirty page accounting as an integrated
> operation. Filesystems had better use vfs interface directly to avoid those details.
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
> fs/buffer.c | 2 +-
> fs/ceph/addr.c | 20 ++------------------
> include/linux/buffer_head.h | 2 ++
> 3 files changed, 5 insertions(+), 19 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e8d96b8..55522dd 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
> * If warn is true, then emit a warning if the page is not uptodate and has
> * not been truncated.
> */
> -static int __set_page_dirty(struct page *page,
> +int __set_page_dirty(struct page *page,
> struct address_space *mapping, int warn)
> {
> if (unlikely(!mapping))
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 8b67304..d028fbe 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -5,6 +5,7 @@
> #include <linux/mm.h>
> #include <linux/pagemap.h>
> #include <linux/writeback.h> /* generic_writepages */
> +#include <linux/buffer_head.h>
> #include <linux/slab.h>
> #include <linux/pagevec.h>
> #include <linux/task_io_accounting_ops.h>
> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
> int undo = 0;
> struct ceph_snap_context *snapc;
>
> - if (unlikely(!mapping))
> - return !TestSetPageDirty(page);
> -
> - if (TestSetPageDirty(page)) {
> - dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> - mapping->host, page, page->index);
I am not familiar with the code but this looks we loose an information
about something bad(?) is going on?
> + if (!__set_page_dirty(page, mapping, 1))
> return 0;
> - }
>
> inode = mapping->host;
> ci = ceph_inode(inode);
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 5/7] memcg: add per cgroup dirty pages accounting
[not found] ` <1340880885-5427-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-06-28 10:57 ` [PATCH 1/7] memcg: update cgroup memory document Sha Zhengju
2012-06-28 11:03 ` [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
@ 2012-06-28 11:04 ` Sha Zhengju
2012-07-03 5:57 ` Kamezawa Hiroyuki
` (2 more replies)
2012-06-28 11:06 ` [PATCH 6/7] memcg: add per cgroup writeback " Sha Zhengju
2012-06-28 11:06 ` [PATCH 7/7] memcg: print more detailed info while memcg oom happening Sha Zhengju
4 siblings, 3 replies; 54+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:04 UTC (permalink / raw)
To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
Cc: kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju
From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
This patch adds memcg routines to count dirty pages, which allows memory controller
to maintain an accurate view of the amount of its dirty memory and can provide some
info for users while group's direct reclaim is working.
After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
has a feature to move a page from a cgroup to another one and may have race between
"move" and "page stat accounting". So in order to avoid the race we have designed a
bigger lock:
mem_cgroup_begin_update_page_stat()
modify page information -->(a)
mem_cgroup_update_page_stat() -->(b)
mem_cgroup_end_update_page_stat()
It requires (a) and (b)(dirty pages accounting) can stay close enough.
In the previous two prepare patches, we have reworked the vfs set page dirty routines
and now the interfaces are more explicit:
incrementing (2):
__set_page_dirty
__set_page_dirty_nobuffers
decrementing (2):
clear_page_dirty_for_io
cancel_dirty_page
Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
---
fs/buffer.c | 17 ++++++++++++++---
include/linux/memcontrol.h | 1 +
mm/filemap.c | 5 +++++
mm/memcontrol.c | 28 +++++++++++++++++++++-------
mm/page-writeback.c | 30 ++++++++++++++++++++++++------
mm/truncate.c | 6 ++++++
6 files changed, 71 insertions(+), 16 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 55522dd..d3714cc 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
int __set_page_dirty(struct page *page,
struct address_space *mapping, int warn)
{
+ bool locked;
+ unsigned long flags;
+ int ret = 0;
+
if (unlikely(!mapping))
return !TestSetPageDirty(page);
- if (TestSetPageDirty(page))
- return 0;
+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
+ if (TestSetPageDirty(page)) {
+ ret = 0;
+ goto out;
+ }
spin_lock_irq(&mapping->tree_lock);
if (page->mapping) { /* Race with truncate? */
@@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
spin_unlock_irq(&mapping->tree_lock);
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
- return 1;
+ ret = 1;
+out:
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
+ return ret;
}
/*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 20b0f2d..ad37b59 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
+ MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
MEM_CGROUP_STAT_NSTATS,
};
diff --git a/mm/filemap.c b/mm/filemap.c
index 1f19ec3..5159a49 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
* having removed the page entirely.
*/
if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+ /*
+ * Do not change page state, so no need to use mem_cgroup_
+ * {begin, end}_update_page_stat to get lock.
+ */
+ mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index ebed1ca..90e2946 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
"rss",
"mapped_file",
"swap",
+ "dirty",
};
enum mem_cgroup_events_index {
@@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
}
#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
+static inline
+void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
+ struct mem_cgroup *to,
+ enum mem_cgroup_stat_index idx)
+{
+ /* Update stat data for mem_cgroup */
+ preempt_disable();
+ __this_cpu_dec(from->stat->count[idx]);
+ __this_cpu_inc(to->stat->count[idx]);
+ preempt_enable();
+}
+
/**
* mem_cgroup_move_account - move account of the page
* @page: the page
@@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
move_lock_mem_cgroup(from, &flags);
- if (!anon && page_mapped(page)) {
- /* Update mapped_file data for mem_cgroup */
- preempt_disable();
- __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
- preempt_enable();
- }
+ if (!anon && page_mapped(page))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_MAPPED);
+
+ if (PageDirty(page))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_DIRTY);
+
mem_cgroup_charge_statistics(from, anon, -nr_pages);
/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e5363f3..e79a2f7 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
void account_page_dirtied(struct page *page, struct address_space *mapping)
{
if (mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
__inc_zone_page_state(page, NR_FILE_DIRTY);
__inc_zone_page_state(page, NR_DIRTIED);
__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
@@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
*/
int __set_page_dirty_nobuffers(struct page *page)
{
+ bool locked;
+ unsigned long flags;
+ int ret = 0;
+
+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
+
if (!TestSetPageDirty(page)) {
struct address_space *mapping = page_mapping(page);
struct address_space *mapping2;
- if (!mapping)
- return 1;
+ if (!mapping) {
+ ret = 1;
+ goto out;
+ }
spin_lock_irq(&mapping->tree_lock);
mapping2 = page_mapping(page);
@@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
/* !PageAnon && !swapper_space */
__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
}
- return 1;
+ ret = 1;
}
- return 0;
+
+out:
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
+ return ret;
}
EXPORT_SYMBOL(__set_page_dirty_nobuffers);
@@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
int clear_page_dirty_for_io(struct page *page)
{
struct address_space *mapping = page_mapping(page);
+ bool locked;
+ unsigned long flags;
+ int ret = 0;
BUG_ON(!PageLocked(page));
@@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
* the desired exclusion. See mm/memory.c:do_wp_page()
* for more comments.
*/
+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
if (TestClearPageDirty(page)) {
+ mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
- return 1;
+ ret = 1;
}
- return 0;
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
+ return ret;
}
return TestClearPageDirty(page);
}
diff --git a/mm/truncate.c b/mm/truncate.c
index 75801ac..052016a 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
*/
void cancel_dirty_page(struct page *page, unsigned int account_size)
{
+ bool locked;
+ unsigned long flags;
+
+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
if (TestClearPageDirty(page)) {
struct address_space *mapping = page->mapping;
if (mapping && mapping_cap_account_dirty(mapping)) {
+ mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
dec_zone_page_state(page, NR_FILE_DIRTY);
dec_bdi_stat(mapping->backing_dev_info,
BDI_RECLAIMABLE);
@@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
task_io_account_cancelled_write(account_size);
}
}
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
}
EXPORT_SYMBOL(cancel_dirty_page);
--
1.7.1
^ permalink raw reply related [flat|nested] 54+ messages in thread* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
2012-06-28 11:04 ` [PATCH 5/7] memcg: add per cgroup dirty pages accounting Sha Zhengju
@ 2012-07-03 5:57 ` Kamezawa Hiroyuki
2012-07-08 14:45 ` Fengguang Wu
[not found] ` <1340881486-5770-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-07-09 21:02 ` Greg Thelen
2 siblings, 1 reply; 54+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-03 5:57 UTC (permalink / raw)
To: Sha Zhengju
Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
Sha Zhengju
(2012/06/28 20:04), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
>
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
>
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
>
> mem_cgroup_begin_update_page_stat()
> modify page information -->(a)
> mem_cgroup_update_page_stat() -->(b)
> mem_cgroup_end_update_page_stat()
>
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
> incrementing (2):
> __set_page_dirty
> __set_page_dirty_nobuffers
> decrementing (2):
> clear_page_dirty_for_io
> cancel_dirty_page
>
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Thank you. This seems much cleaner than expected ! very good.
Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
> ---
> fs/buffer.c | 17 ++++++++++++++---
> include/linux/memcontrol.h | 1 +
> mm/filemap.c | 5 +++++
> mm/memcontrol.c | 28 +++++++++++++++++++++-------
> mm/page-writeback.c | 30 ++++++++++++++++++++++++------
> mm/truncate.c | 6 ++++++
> 6 files changed, 71 insertions(+), 16 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 55522dd..d3714cc 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
> int __set_page_dirty(struct page *page,
> struct address_space *mapping, int warn)
> {
> + bool locked;
> + unsigned long flags;
> + int ret = 0;
> +
> if (unlikely(!mapping))
> return !TestSetPageDirty(page);
>
> - if (TestSetPageDirty(page))
> - return 0;
> + mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> + if (TestSetPageDirty(page)) {
> + ret = 0;
> + goto out;
> + }
>
> spin_lock_irq(&mapping->tree_lock);
> if (page->mapping) { /* Race with truncate? */
> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
> spin_unlock_irq(&mapping->tree_lock);
> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>
> - return 1;
> + ret = 1;
> +out:
> + mem_cgroup_end_update_page_stat(page, &locked, &flags);
> + return ret;
> }
>
> /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20b0f2d..ad37b59 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
> MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
> MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
> MEM_CGROUP_STAT_NSTATS,
> };
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1f19ec3..5159a49 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
> * having removed the page entirely.
> */
> if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> + /*
> + * Do not change page state, so no need to use mem_cgroup_
> + * {begin, end}_update_page_stat to get lock.
> + */
> + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebed1ca..90e2946 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
> "rss",
> "mapped_file",
> "swap",
> + "dirty",
> };
>
> enum mem_cgroup_events_index {
> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
> }
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> + struct mem_cgroup *to,
> + enum mem_cgroup_stat_index idx)
> +{
> + /* Update stat data for mem_cgroup */
> + preempt_disable();
> + __this_cpu_dec(from->stat->count[idx]);
> + __this_cpu_inc(to->stat->count[idx]);
> + preempt_enable();
> +}
> +
> /**
> * mem_cgroup_move_account - move account of the page
> * @page: the page
> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>
> move_lock_mem_cgroup(from, &flags);
>
> - if (!anon && page_mapped(page)) {
> - /* Update mapped_file data for mem_cgroup */
> - preempt_disable();
> - __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> - __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> - preempt_enable();
> - }
> + if (!anon && page_mapped(page))
> + mem_cgroup_move_account_page_stat(from, to,
> + MEM_CGROUP_STAT_FILE_MAPPED);
> +
> + if (PageDirty(page))
> + mem_cgroup_move_account_page_stat(from, to,
> + MEM_CGROUP_STAT_FILE_DIRTY);
> +
> mem_cgroup_charge_statistics(from, anon, -nr_pages);
>
> /* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5363f3..e79a2f7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
> void account_page_dirtied(struct page *page, struct address_space *mapping)
> {
> if (mapping_cap_account_dirty(mapping)) {
> + mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> __inc_zone_page_state(page, NR_FILE_DIRTY);
> __inc_zone_page_state(page, NR_DIRTIED);
> __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
> */
> int __set_page_dirty_nobuffers(struct page *page)
> {
> + bool locked;
> + unsigned long flags;
> + int ret = 0;
> +
> + mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> if (!TestSetPageDirty(page)) {
> struct address_space *mapping = page_mapping(page);
> struct address_space *mapping2;
>
> - if (!mapping)
> - return 1;
> + if (!mapping) {
> + ret = 1;
> + goto out;
> + }
>
> spin_lock_irq(&mapping->tree_lock);
> mapping2 = page_mapping(page);
> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
> /* !PageAnon && !swapper_space */
> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> }
> - return 1;
> + ret = 1;
> }
> - return 0;
> +
> +out:
> + mem_cgroup_end_update_page_stat(page, &locked, &flags);
> + return ret;
> }
> EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>
> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
> int clear_page_dirty_for_io(struct page *page)
> {
> struct address_space *mapping = page_mapping(page);
> + bool locked;
> + unsigned long flags;
> + int ret = 0;
>
> BUG_ON(!PageLocked(page));
>
> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
> * the desired exclusion. See mm/memory.c:do_wp_page()
> * for more comments.
> */
> + mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> if (TestClearPageDirty(page)) {
> + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info,
> BDI_RECLAIMABLE);
> - return 1;
> + ret = 1;
> }
> - return 0;
> + mem_cgroup_end_update_page_stat(page, &locked, &flags);
> + return ret;
> }
> return TestClearPageDirty(page);
> }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
> */
> void cancel_dirty_page(struct page *page, unsigned int account_size)
> {
> + bool locked;
> + unsigned long flags;
> +
> + mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> if (TestClearPageDirty(page)) {
> struct address_space *mapping = page->mapping;
> if (mapping && mapping_cap_account_dirty(mapping)) {
> + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info,
> BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
> task_io_account_cancelled_write(account_size);
> }
> }
> + mem_cgroup_end_update_page_stat(page, &locked, &flags);
> }
> EXPORT_SYMBOL(cancel_dirty_page);
>
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
2012-07-03 5:57 ` Kamezawa Hiroyuki
@ 2012-07-08 14:45 ` Fengguang Wu
0 siblings, 0 replies; 54+ messages in thread
From: Fengguang Wu @ 2012-07-08 14:45 UTC (permalink / raw)
To: Kamezawa Hiroyuki
Cc: Sha Zhengju, linux-mm, cgroups, gthelen, yinghan, akpm, mhocko,
linux-kernel, Sha Zhengju
On Tue, Jul 03, 2012 at 02:57:08PM +0900, KAMEZAWA Hiroyuki wrote:
> (2012/06/28 20:04), Sha Zhengju wrote:
> > From: Sha Zhengju <handai.szj@taobao.com>
> >
> > This patch adds memcg routines to count dirty pages, which allows memory controller
> > to maintain an accurate view of the amount of its dirty memory and can provide some
> > info for users while group's direct reclaim is working.
> >
> > After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> > use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> > has a feature to move a page from a cgroup to another one and may have race between
> > "move" and "page stat accounting". So in order to avoid the race we have designed a
> > bigger lock:
> >
> > mem_cgroup_begin_update_page_stat()
> > modify page information -->(a)
> > mem_cgroup_update_page_stat() -->(b)
> > mem_cgroup_end_update_page_stat()
> >
> > It requires (a) and (b)(dirty pages accounting) can stay close enough.
> >
> > In the previous two prepare patches, we have reworked the vfs set page dirty routines
> > and now the interfaces are more explicit:
> > incrementing (2):
> > __set_page_dirty
> > __set_page_dirty_nobuffers
> > decrementing (2):
> > clear_page_dirty_for_io
> > cancel_dirty_page
> >
> >
> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
>
> Thank you. This seems much cleaner than expected ! very good.
>
> Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
I have the same good feelings :)
Acked-by: Fengguang Wu <fengguang.wu@intel.com>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
[parent not found: <1340881486-5770-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
[not found] ` <1340881486-5770-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-07-04 16:11 ` Michal Hocko
0 siblings, 0 replies; 54+ messages in thread
From: Michal Hocko @ 2012-07-04 16:11 UTC (permalink / raw)
To: Sha Zhengju
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju, Alexander Viro,
linux-fsdevel-u79uwXL29TY76Z2rM5mHXA
I guess you should CC vfs people
On Thu 28-06-12 19:04:46, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
>
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
>
> mem_cgroup_begin_update_page_stat()
> modify page information -->(a)
> mem_cgroup_update_page_stat() -->(b)
> mem_cgroup_end_update_page_stat()
>
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
> incrementing (2):
> __set_page_dirty
> __set_page_dirty_nobuffers
> decrementing (2):
> clear_page_dirty_for_io
> cancel_dirty_page
The patch seems correct at first glance (I have to look closer but I am
in rush at the momemnt and will be back next week).
I was just thinking that memcg is enabled by most distributions these
days but not that many people use it. So it would be probably good to
think how to reduce an overhead if !mem_cgroup_disabled && no cgroups.
Something similar Glauber did for the kmem accounting.
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> ---
> fs/buffer.c | 17 ++++++++++++++---
> include/linux/memcontrol.h | 1 +
> mm/filemap.c | 5 +++++
> mm/memcontrol.c | 28 +++++++++++++++++++++-------
> mm/page-writeback.c | 30 ++++++++++++++++++++++++------
> mm/truncate.c | 6 ++++++
> 6 files changed, 71 insertions(+), 16 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 55522dd..d3714cc 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
> int __set_page_dirty(struct page *page,
> struct address_space *mapping, int warn)
> {
> + bool locked;
> + unsigned long flags;
> + int ret = 0;
> +
> if (unlikely(!mapping))
> return !TestSetPageDirty(page);
>
> - if (TestSetPageDirty(page))
> - return 0;
> + mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> + if (TestSetPageDirty(page)) {
> + ret = 0;
> + goto out;
> + }
>
> spin_lock_irq(&mapping->tree_lock);
> if (page->mapping) { /* Race with truncate? */
> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
> spin_unlock_irq(&mapping->tree_lock);
> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>
> - return 1;
> + ret = 1;
> +out:
> + mem_cgroup_end_update_page_stat(page, &locked, &flags);
> + return ret;
> }
>
> /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20b0f2d..ad37b59 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
> MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
> MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
> MEM_CGROUP_STAT_NSTATS,
> };
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1f19ec3..5159a49 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
> * having removed the page entirely.
> */
> if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> + /*
> + * Do not change page state, so no need to use mem_cgroup_
> + * {begin, end}_update_page_stat to get lock.
> + */
> + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebed1ca..90e2946 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
> "rss",
> "mapped_file",
> "swap",
> + "dirty",
> };
>
> enum mem_cgroup_events_index {
> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
> }
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> + struct mem_cgroup *to,
> + enum mem_cgroup_stat_index idx)
> +{
> + /* Update stat data for mem_cgroup */
> + preempt_disable();
> + __this_cpu_dec(from->stat->count[idx]);
> + __this_cpu_inc(to->stat->count[idx]);
> + preempt_enable();
> +}
> +
> /**
> * mem_cgroup_move_account - move account of the page
> * @page: the page
> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>
> move_lock_mem_cgroup(from, &flags);
>
> - if (!anon && page_mapped(page)) {
> - /* Update mapped_file data for mem_cgroup */
> - preempt_disable();
> - __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> - __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> - preempt_enable();
> - }
> + if (!anon && page_mapped(page))
> + mem_cgroup_move_account_page_stat(from, to,
> + MEM_CGROUP_STAT_FILE_MAPPED);
> +
> + if (PageDirty(page))
> + mem_cgroup_move_account_page_stat(from, to,
> + MEM_CGROUP_STAT_FILE_DIRTY);
> +
> mem_cgroup_charge_statistics(from, anon, -nr_pages);
>
> /* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5363f3..e79a2f7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
> void account_page_dirtied(struct page *page, struct address_space *mapping)
> {
> if (mapping_cap_account_dirty(mapping)) {
> + mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> __inc_zone_page_state(page, NR_FILE_DIRTY);
> __inc_zone_page_state(page, NR_DIRTIED);
> __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
> */
> int __set_page_dirty_nobuffers(struct page *page)
> {
> + bool locked;
> + unsigned long flags;
> + int ret = 0;
> +
> + mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> if (!TestSetPageDirty(page)) {
> struct address_space *mapping = page_mapping(page);
> struct address_space *mapping2;
>
> - if (!mapping)
> - return 1;
> + if (!mapping) {
> + ret = 1;
> + goto out;
> + }
>
> spin_lock_irq(&mapping->tree_lock);
> mapping2 = page_mapping(page);
> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
> /* !PageAnon && !swapper_space */
> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> }
> - return 1;
> + ret = 1;
> }
> - return 0;
> +
> +out:
> + mem_cgroup_end_update_page_stat(page, &locked, &flags);
> + return ret;
> }
> EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>
> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
> int clear_page_dirty_for_io(struct page *page)
> {
> struct address_space *mapping = page_mapping(page);
> + bool locked;
> + unsigned long flags;
> + int ret = 0;
>
> BUG_ON(!PageLocked(page));
>
> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
> * the desired exclusion. See mm/memory.c:do_wp_page()
> * for more comments.
> */
> + mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> if (TestClearPageDirty(page)) {
> + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info,
> BDI_RECLAIMABLE);
> - return 1;
> + ret = 1;
> }
> - return 0;
> + mem_cgroup_end_update_page_stat(page, &locked, &flags);
> + return ret;
> }
> return TestClearPageDirty(page);
> }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
> */
> void cancel_dirty_page(struct page *page, unsigned int account_size)
> {
> + bool locked;
> + unsigned long flags;
> +
> + mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> if (TestClearPageDirty(page)) {
> struct address_space *mapping = page->mapping;
> if (mapping && mapping_cap_account_dirty(mapping)) {
> + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info,
> BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
> task_io_account_cancelled_write(account_size);
> }
> }
> + mem_cgroup_end_update_page_stat(page, &locked, &flags);
> }
> EXPORT_SYMBOL(cancel_dirty_page);
>
> --
> 1.7.1
>
--
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9
Czech Republic
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
2012-06-28 11:04 ` [PATCH 5/7] memcg: add per cgroup dirty pages accounting Sha Zhengju
2012-07-03 5:57 ` Kamezawa Hiroyuki
[not found] ` <1340881486-5770-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-07-09 21:02 ` Greg Thelen
2012-07-11 9:32 ` Sha Zhengju
2 siblings, 1 reply; 54+ messages in thread
From: Greg Thelen @ 2012-07-09 21:02 UTC (permalink / raw)
To: Sha Zhengju
Cc: linux-mm, cgroups, kamezawa.hiroyu, yinghan, akpm, mhocko,
linux-kernel, Sha Zhengju
On Thu, Jun 28 2012, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
>
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
>
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
>
> mem_cgroup_begin_update_page_stat()
> modify page information -->(a)
> mem_cgroup_update_page_stat() -->(b)
> mem_cgroup_end_update_page_stat()
>
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
> incrementing (2):
> __set_page_dirty
> __set_page_dirty_nobuffers
> decrementing (2):
> clear_page_dirty_for_io
> cancel_dirty_page
>
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
> fs/buffer.c | 17 ++++++++++++++---
> include/linux/memcontrol.h | 1 +
> mm/filemap.c | 5 +++++
> mm/memcontrol.c | 28 +++++++++++++++++++++-------
> mm/page-writeback.c | 30 ++++++++++++++++++++++++------
> mm/truncate.c | 6 ++++++
> 6 files changed, 71 insertions(+), 16 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 55522dd..d3714cc 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
> int __set_page_dirty(struct page *page,
> struct address_space *mapping, int warn)
> {
> + bool locked;
> + unsigned long flags;
> + int ret = 0;
'= 0' and 'ret = 0' change (below) are redundant. My vote is to remove
'= 0' here.
> +
> if (unlikely(!mapping))
> return !TestSetPageDirty(page);
>
> - if (TestSetPageDirty(page))
> - return 0;
> + mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> + if (TestSetPageDirty(page)) {
> + ret = 0;
> + goto out;
> + }
>
> spin_lock_irq(&mapping->tree_lock);
> if (page->mapping) { /* Race with truncate? */
> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
> spin_unlock_irq(&mapping->tree_lock);
> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>
> - return 1;
> + ret = 1;
> +out:
> + mem_cgroup_end_update_page_stat(page, &locked, &flags);
> + return ret;
> }
>
> /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20b0f2d..ad37b59 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
> MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
> MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
> MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
> MEM_CGROUP_STAT_NSTATS,
> };
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1f19ec3..5159a49 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
> * having removed the page entirely.
> */
> if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> + /*
> + * Do not change page state, so no need to use mem_cgroup_
> + * {begin, end}_update_page_stat to get lock.
> + */
> + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
I do not understand this comment. What serializes this function and
mem_cgroup_move_account()?
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> }
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebed1ca..90e2946 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
> "rss",
> "mapped_file",
> "swap",
> + "dirty",
> };
>
> enum mem_cgroup_events_index {
> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
> }
> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> + struct mem_cgroup *to,
> + enum mem_cgroup_stat_index idx)
> +{
> + /* Update stat data for mem_cgroup */
> + preempt_disable();
> + __this_cpu_dec(from->stat->count[idx]);
> + __this_cpu_inc(to->stat->count[idx]);
> + preempt_enable();
> +}
> +
> /**
> * mem_cgroup_move_account - move account of the page
> * @page: the page
> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>
> move_lock_mem_cgroup(from, &flags);
>
> - if (!anon && page_mapped(page)) {
> - /* Update mapped_file data for mem_cgroup */
> - preempt_disable();
> - __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> - __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> - preempt_enable();
> - }
> + if (!anon && page_mapped(page))
> + mem_cgroup_move_account_page_stat(from, to,
> + MEM_CGROUP_STAT_FILE_MAPPED);
> +
> + if (PageDirty(page))
> + mem_cgroup_move_account_page_stat(from, to,
> + MEM_CGROUP_STAT_FILE_DIRTY);
> +
> mem_cgroup_charge_statistics(from, anon, -nr_pages);
>
> /* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5363f3..e79a2f7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
> void account_page_dirtied(struct page *page, struct address_space *mapping)
> {
> if (mapping_cap_account_dirty(mapping)) {
> + mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
It might be helpful to add a comment to account_page_dirtied()
indicating that caller must hold mem_cgroup_begin_update_page_stat()
lock. Extra credit for an new assertion added to
mem_cgroup_update_page_stat() confirming the needed lock is held.
> __inc_zone_page_state(page, NR_FILE_DIRTY);
> __inc_zone_page_state(page, NR_DIRTIED);
> __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
> */
> int __set_page_dirty_nobuffers(struct page *page)
> {
> + bool locked;
> + unsigned long flags;
> + int ret = 0;
> +
> + mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
Is there a strict lock ordering which says that
mem_cgroup_begin_update_page_stat() must not be called while holding
tree_lock? If yes, then maybe we should update the 'Lock ordering'
comment in mm/filemap.c to describe the
mem_cgroup_begin_update_page_stat() lock.
> if (!TestSetPageDirty(page)) {
> struct address_space *mapping = page_mapping(page);
> struct address_space *mapping2;
>
> - if (!mapping)
> - return 1;
> + if (!mapping) {
> + ret = 1;
> + goto out;
> + }
The following seems even easier because it does not need your 'ret = 1'
change below.
+ ret = 1;
if (!mapping)
- return 1;
+ goto out;
>
> spin_lock_irq(&mapping->tree_lock);
> mapping2 = page_mapping(page);
> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
> /* !PageAnon && !swapper_space */
> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> }
> - return 1;
> + ret = 1;
With the ret=1 change above, this can be changed to:
- return 1;
> }
> - return 0;
> +
> +out:
> + mem_cgroup_end_update_page_stat(page, &locked, &flags);
> + return ret;
> }
> EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>
> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
> int clear_page_dirty_for_io(struct page *page)
> {
> struct address_space *mapping = page_mapping(page);
> + bool locked;
> + unsigned long flags;
> + int ret = 0;
>
> BUG_ON(!PageLocked(page));
>
> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
> * the desired exclusion. See mm/memory.c:do_wp_page()
> * for more comments.
> */
> + mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> if (TestClearPageDirty(page)) {
> + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info,
> BDI_RECLAIMABLE);
> - return 1;
> + ret = 1;
> }
> - return 0;
> + mem_cgroup_end_update_page_stat(page, &locked, &flags);
> + return ret;
> }
> return TestClearPageDirty(page);
> }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
> */
> void cancel_dirty_page(struct page *page, unsigned int account_size)
> {
> + bool locked;
> + unsigned long flags;
> +
> + mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> if (TestClearPageDirty(page)) {
> struct address_space *mapping = page->mapping;
> if (mapping && mapping_cap_account_dirty(mapping)) {
> + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> dec_zone_page_state(page, NR_FILE_DIRTY);
> dec_bdi_stat(mapping->backing_dev_info,
> BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
> task_io_account_cancelled_write(account_size);
> }
> }
> + mem_cgroup_end_update_page_stat(page, &locked, &flags);
> }
> EXPORT_SYMBOL(cancel_dirty_page);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
2012-07-09 21:02 ` Greg Thelen
@ 2012-07-11 9:32 ` Sha Zhengju
[not found] ` <4FFD4822.4020300-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
0 siblings, 1 reply; 54+ messages in thread
From: Sha Zhengju @ 2012-07-11 9:32 UTC (permalink / raw)
To: Greg Thelen
Cc: linux-mm, cgroups, kamezawa.hiroyu, yinghan, akpm, mhocko,
linux-kernel, Sha Zhengju
On 07/10/2012 05:02 AM, Greg Thelen wrote:
> On Thu, Jun 28 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju<handai.szj@taobao.com>
>>
>> This patch adds memcg routines to count dirty pages, which allows memory controller
>> to maintain an accurate view of the amount of its dirty memory and can provide some
>> info for users while group's direct reclaim is working.
>>
>> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
>> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
>> has a feature to move a page from a cgroup to another one and may have race between
>> "move" and "page stat accounting". So in order to avoid the race we have designed a
>> bigger lock:
>>
>> mem_cgroup_begin_update_page_stat()
>> modify page information -->(a)
>> mem_cgroup_update_page_stat() -->(b)
>> mem_cgroup_end_update_page_stat()
>>
>> It requires (a) and (b)(dirty pages accounting) can stay close enough.
>>
>> In the previous two prepare patches, we have reworked the vfs set page dirty routines
>> and now the interfaces are more explicit:
>> incrementing (2):
>> __set_page_dirty
>> __set_page_dirty_nobuffers
>> decrementing (2):
>> clear_page_dirty_for_io
>> cancel_dirty_page
>>
>>
>> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>> ---
>> fs/buffer.c | 17 ++++++++++++++---
>> include/linux/memcontrol.h | 1 +
>> mm/filemap.c | 5 +++++
>> mm/memcontrol.c | 28 +++++++++++++++++++++-------
>> mm/page-writeback.c | 30 ++++++++++++++++++++++++------
>> mm/truncate.c | 6 ++++++
>> 6 files changed, 71 insertions(+), 16 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index 55522dd..d3714cc 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>> int __set_page_dirty(struct page *page,
>> struct address_space *mapping, int warn)
>> {
>> + bool locked;
>> + unsigned long flags;
>> + int ret = 0;
> '= 0' and 'ret = 0' change (below) are redundant. My vote is to remove
> '= 0' here.
>
Nice catch. :-)
>> +
>> if (unlikely(!mapping))
>> return !TestSetPageDirty(page);
>>
>> - if (TestSetPageDirty(page))
>> - return 0;
>> + mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>> +
>> + if (TestSetPageDirty(page)) {
>> + ret = 0;
>> + goto out;
>> + }
>>
>> spin_lock_irq(&mapping->tree_lock);
>> if (page->mapping) { /* Race with truncate? */
>> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>> spin_unlock_irq(&mapping->tree_lock);
>> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>
>> - return 1;
>> + ret = 1;
>> +out:
>> + mem_cgroup_end_update_page_stat(page,&locked,&flags);
>> + return ret;
>> }
>>
>> /*
>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>> index 20b0f2d..ad37b59 100644
>> --- a/include/linux/memcontrol.h
>> +++ b/include/linux/memcontrol.h
>> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>> MEM_CGROUP_STAT_RSS, /* # of pages charged as anon rss */
>> MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
>> MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
>> + MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
>> MEM_CGROUP_STAT_NSTATS,
>> };
>>
>> diff --git a/mm/filemap.c b/mm/filemap.c
>> index 1f19ec3..5159a49 100644
>> --- a/mm/filemap.c
>> +++ b/mm/filemap.c
>> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>> * having removed the page entirely.
>> */
>> if (PageDirty(page)&& mapping_cap_account_dirty(mapping)) {
>> + /*
>> + * Do not change page state, so no need to use mem_cgroup_
>> + * {begin, end}_update_page_stat to get lock.
>> + */
>> + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> I do not understand this comment. What serializes this function and
> mem_cgroup_move_account()?
>
The race is exist just because the two competitors share one
public variable and one reads it and the other writes it.
I thought if both sides(accounting and cgroup_move) do not
change page flag, then risks like doule-counting(see below)
will not happen.
CPU-A CPU-B
Set PG_dirty
(delay) move_lock_mem_cgroup()
if (PageDirty(page))
new_memcg->nr_dirty++
pc->mem_cgroup =
new_memcg;
move_unlock_mem_cgroup()
move_lock_mem_cgroup()
memcg = pc->mem_cgroup
new_memcg->nr_dirty++
But after second thoughts, it does have problem if without lock:
CPU-A CPU-B
if (PageDirty(page)) {
move_lock_mem_cgroup()
TestClearPageDirty(page))
memcg =
pc->mem_cgroup
new_memcg->nr_dirty --
move_unlock_mem_cgroup()
memcg = pc->mem_cgroup
new_memcg->nr_dirty--
}
It may occur race between clear_page_dirty() operation.
So this time I think we need the lock again...
Kame, what about your opinion...
>> dec_zone_page_state(page, NR_FILE_DIRTY);
>> dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>> }
>> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
>> index ebed1ca..90e2946 100644
>> --- a/mm/memcontrol.c
>> +++ b/mm/memcontrol.c
>> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
>> "rss",
>> "mapped_file",
>> "swap",
>> + "dirty",
>> };
>>
>> enum mem_cgroup_events_index {
>> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>> }
>> #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>>
>> +static inline
>> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
>> + struct mem_cgroup *to,
>> + enum mem_cgroup_stat_index idx)
>> +{
>> + /* Update stat data for mem_cgroup */
>> + preempt_disable();
>> + __this_cpu_dec(from->stat->count[idx]);
>> + __this_cpu_inc(to->stat->count[idx]);
>> + preempt_enable();
>> +}
>> +
>> /**
>> * mem_cgroup_move_account - move account of the page
>> * @page: the page
>> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>>
>> move_lock_mem_cgroup(from,&flags);
>>
>> - if (!anon&& page_mapped(page)) {
>> - /* Update mapped_file data for mem_cgroup */
>> - preempt_disable();
>> - __this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> - __this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
>> - preempt_enable();
>> - }
>> + if (!anon&& page_mapped(page))
>> + mem_cgroup_move_account_page_stat(from, to,
>> + MEM_CGROUP_STAT_FILE_MAPPED);
>> +
>> + if (PageDirty(page))
>> + mem_cgroup_move_account_page_stat(from, to,
>> + MEM_CGROUP_STAT_FILE_DIRTY);
>> +
>> mem_cgroup_charge_statistics(from, anon, -nr_pages);
>>
>> /* caller should have done css_get */
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index e5363f3..e79a2f7 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>> void account_page_dirtied(struct page *page, struct address_space *mapping)
>> {
>> if (mapping_cap_account_dirty(mapping)) {
>> + mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
> It might be helpful to add a comment to account_page_dirtied()
> indicating that caller must hold mem_cgroup_begin_update_page_stat()
> lock. Extra credit for an new assertion added to
> mem_cgroup_update_page_stat() confirming the needed lock is held.
>
Got it! :-)
>> __inc_zone_page_state(page, NR_FILE_DIRTY);
>> __inc_zone_page_state(page, NR_DIRTIED);
>> __inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
>> */
>> int __set_page_dirty_nobuffers(struct page *page)
>> {
>> + bool locked;
>> + unsigned long flags;
>> + int ret = 0;
>> +
>> + mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>> +
> Is there a strict lock ordering which says that
> mem_cgroup_begin_update_page_stat() must not be called while holding
> tree_lock? If yes, then maybe we should update the 'Lock ordering'
> comment in mm/filemap.c to describe the
> mem_cgroup_begin_update_page_stat() lock.
>
I think yes, otherwise it may cause deadlock. I'll update it later.
>> if (!TestSetPageDirty(page)) {
>> struct address_space *mapping = page_mapping(page);
>> struct address_space *mapping2;
>>
>> - if (!mapping)
>> - return 1;
>> + if (!mapping) {
>> + ret = 1;
>> + goto out;
>> + }
>
> The following seems even easier because it does not need your 'ret = 1'
> change below.
>
> + ret = 1;
> if (!mapping)
> - return 1;
> + goto out;
>
>
>>
>> spin_lock_irq(&mapping->tree_lock);
>> mapping2 = page_mapping(page);
>> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
>> /* !PageAnon&& !swapper_space */
>> __mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>> }
>> - return 1;
>> + ret = 1;
> With the ret=1 change above, this can be changed to:
> - return 1;
>
Seems better.
>> }
>> - return 0;
>> +
>> +out:
>> + mem_cgroup_end_update_page_stat(page,&locked,&flags);
>> + return ret;
>> }
>> EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>>
>> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>> int clear_page_dirty_for_io(struct page *page)
>> {
>> struct address_space *mapping = page_mapping(page);
>> + bool locked;
>> + unsigned long flags;
>> + int ret = 0;
>>
>> BUG_ON(!PageLocked(page));
>>
>> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
>> * the desired exclusion. See mm/memory.c:do_wp_page()
>> * for more comments.
>> */
>> + mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>> if (TestClearPageDirty(page)) {
>> + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>> dec_zone_page_state(page, NR_FILE_DIRTY);
>> dec_bdi_stat(mapping->backing_dev_info,
>> BDI_RECLAIMABLE);
>> - return 1;
>> + ret = 1;
>> }
>> - return 0;
>> + mem_cgroup_end_update_page_stat(page,&locked,&flags);
>> + return ret;
>> }
>> return TestClearPageDirty(page);
>> }
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index 75801ac..052016a 100644
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>> */
>> void cancel_dirty_page(struct page *page, unsigned int account_size)
>> {
>> + bool locked;
>> + unsigned long flags;
>> +
>> + mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>> if (TestClearPageDirty(page)) {
>> struct address_space *mapping = page->mapping;
>> if (mapping&& mapping_cap_account_dirty(mapping)) {
>> + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>> dec_zone_page_state(page, NR_FILE_DIRTY);
>> dec_bdi_stat(mapping->backing_dev_info,
>> BDI_RECLAIMABLE);
>> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>> task_io_account_cancelled_write(account_size);
>> }
>> }
>> + mem_cgroup_end_update_page_stat(page,&locked,&flags);
>> }
>> EXPORT_SYMBOL(cancel_dirty_page);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 6/7] memcg: add per cgroup writeback pages accounting
[not found] ` <1340880885-5427-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
` (2 preceding siblings ...)
2012-06-28 11:04 ` [PATCH 5/7] memcg: add per cgroup dirty pages accounting Sha Zhengju
@ 2012-06-28 11:06 ` Sha Zhengju
[not found] ` <1340881562-5900-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-07-09 21:02 ` Greg Thelen
2012-06-28 11:06 ` [PATCH 7/7] memcg: print more detailed info while memcg oom happening Sha Zhengju
4 siblings, 2 replies; 54+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:06 UTC (permalink / raw)
To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
Cc: kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju
From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
Similar to dirty page, we add per cgroup writeback pages accounting. The lock
rule still is:
mem_cgroup_begin_update_page_stat()
modify page WRITEBACK stat
mem_cgroup_update_page_stat()
mem_cgroup_end_update_page_stat()
There're two writeback interface to modify: test_clear/set_page_writeback.
Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
---
include/linux/memcontrol.h | 1 +
mm/memcontrol.c | 5 +++++
mm/page-writeback.c | 12 ++++++++++++
3 files changed, 18 insertions(+), 0 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index ad37b59..9193d93 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -39,6 +39,7 @@ enum mem_cgroup_stat_index {
MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
+ MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
MEM_CGROUP_STAT_NSTATS,
};
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 90e2946..8493119 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -83,6 +83,7 @@ static const char * const mem_cgroup_stat_names[] = {
"mapped_file",
"swap",
"dirty",
+ "writeback",
};
enum mem_cgroup_events_index {
@@ -2604,6 +2605,10 @@ static int mem_cgroup_move_account(struct page *page,
mem_cgroup_move_account_page_stat(from, to,
MEM_CGROUP_STAT_FILE_DIRTY);
+ if (PageWriteback(page))
+ mem_cgroup_move_account_page_stat(from, to,
+ MEM_CGROUP_STAT_FILE_WRITEBACK);
+
mem_cgroup_charge_statistics(from, anon, -nr_pages);
/* caller should have done css_get */
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index e79a2f7..7398836 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1981,6 +1981,7 @@ EXPORT_SYMBOL(account_page_dirtied);
*/
void account_page_writeback(struct page *page)
{
+ mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
inc_zone_page_state(page, NR_WRITEBACK);
}
EXPORT_SYMBOL(account_page_writeback);
@@ -2214,7 +2215,10 @@ int test_clear_page_writeback(struct page *page)
{
struct address_space *mapping = page_mapping(page);
int ret;
+ bool locked;
+ unsigned long flags;
+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
if (mapping) {
struct backing_dev_info *bdi = mapping->backing_dev_info;
unsigned long flags;
@@ -2235,9 +2239,12 @@ int test_clear_page_writeback(struct page *page)
ret = TestClearPageWriteback(page);
}
if (ret) {
+ mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
dec_zone_page_state(page, NR_WRITEBACK);
inc_zone_page_state(page, NR_WRITTEN);
}
+
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
return ret;
}
@@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
{
struct address_space *mapping = page_mapping(page);
int ret;
+ bool locked;
+ unsigned long flags;
+ mem_cgroup_begin_update_page_stat(page, &locked, &flags);
if (mapping) {
struct backing_dev_info *bdi = mapping->backing_dev_info;
unsigned long flags;
@@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
}
if (!ret)
account_page_writeback(page);
+
+ mem_cgroup_end_update_page_stat(page, &locked, &flags);
return ret;
}
--
1.7.1
^ permalink raw reply related [flat|nested] 54+ messages in thread[parent not found: <1340881562-5900-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
[not found] ` <1340881562-5900-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-07-08 14:53 ` Fengguang Wu
2012-07-09 3:36 ` Sha Zhengju
0 siblings, 1 reply; 54+ messages in thread
From: Fengguang Wu @ 2012-07-08 14:53 UTC (permalink / raw)
To: Sha Zhengju
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju
> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
> {
> struct address_space *mapping = page_mapping(page);
> int ret;
> + bool locked;
> + unsigned long flags;
>
> + mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> if (mapping) {
> struct backing_dev_info *bdi = mapping->backing_dev_info;
> unsigned long flags;
> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
> }
> if (!ret)
> account_page_writeback(page);
> +
> + mem_cgroup_end_update_page_stat(page, &locked, &flags);
> return ret;
>
> }
Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 54+ messages in thread* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
2012-07-08 14:53 ` Fengguang Wu
@ 2012-07-09 3:36 ` Sha Zhengju
[not found] ` <4FFA51AB.30203-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
0 siblings, 1 reply; 54+ messages in thread
From: Sha Zhengju @ 2012-07-09 3:36 UTC (permalink / raw)
To: Fengguang Wu
Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju
On 07/08/2012 10:53 PM, Fengguang Wu wrote:
>> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
>> {
>> struct address_space *mapping = page_mapping(page);
>> int ret;
>> + bool locked;
>> + unsigned long flags;
>>
>> + mem_cgroup_begin_update_page_stat(page,&locked,&flags);
>> if (mapping) {
>> struct backing_dev_info *bdi = mapping->backing_dev_info;
>> unsigned long flags;
>> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
>> }
>> if (!ret)
>> account_page_writeback(page);
>> +
>> + mem_cgroup_end_update_page_stat(page,&locked,&flags);
>> return ret;
>>
>> }
> Where is the MEM_CGROUP_STAT_FILE_WRITEBACK increased?
>
It's in account_page_writeback().
void account_page_writeback(struct page *page)
{
+ mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
inc_zone_page_state(page, NR_WRITEBACK);
}
There isn't a unified interface to dec/inc writeback accounting, so
I just follow that.
Maybe we can rework account_page_writeback() to also account
dec in?
Thanks,
Sha
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 6/7] memcg: add per cgroup writeback pages accounting
2012-06-28 11:06 ` [PATCH 6/7] memcg: add per cgroup writeback " Sha Zhengju
[not found] ` <1340881562-5900-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-07-09 21:02 ` Greg Thelen
1 sibling, 0 replies; 54+ messages in thread
From: Greg Thelen @ 2012-07-09 21:02 UTC (permalink / raw)
To: Sha Zhengju
Cc: linux-mm, cgroups, kamezawa.hiroyu, yinghan, akpm, mhocko,
linux-kernel, Sha Zhengju
On Thu, Jun 28 2012, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
>
> Similar to dirty page, we add per cgroup writeback pages accounting. The lock
> rule still is:
> mem_cgroup_begin_update_page_stat()
> modify page WRITEBACK stat
> mem_cgroup_update_page_stat()
> mem_cgroup_end_update_page_stat()
>
> There're two writeback interface to modify: test_clear/set_page_writeback.
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
> include/linux/memcontrol.h | 1 +
> mm/memcontrol.c | 5 +++++
> mm/page-writeback.c | 12 ++++++++++++
> 3 files changed, 18 insertions(+), 0 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index ad37b59..9193d93 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -39,6 +39,7 @@ enum mem_cgroup_stat_index {
> MEM_CGROUP_STAT_FILE_MAPPED, /* # of pages charged as file rss */
> MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> MEM_CGROUP_STAT_FILE_DIRTY, /* # of dirty pages in page cache */
> + MEM_CGROUP_STAT_FILE_WRITEBACK, /* # of pages under writeback */
> MEM_CGROUP_STAT_NSTATS,
> };
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 90e2946..8493119 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -83,6 +83,7 @@ static const char * const mem_cgroup_stat_names[] = {
> "mapped_file",
> "swap",
> "dirty",
> + "writeback",
> };
>
> enum mem_cgroup_events_index {
> @@ -2604,6 +2605,10 @@ static int mem_cgroup_move_account(struct page *page,
> mem_cgroup_move_account_page_stat(from, to,
> MEM_CGROUP_STAT_FILE_DIRTY);
>
> + if (PageWriteback(page))
> + mem_cgroup_move_account_page_stat(from, to,
> + MEM_CGROUP_STAT_FILE_WRITEBACK);
> +
> mem_cgroup_charge_statistics(from, anon, -nr_pages);
>
> /* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e79a2f7..7398836 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1981,6 +1981,7 @@ EXPORT_SYMBOL(account_page_dirtied);
> */
> void account_page_writeback(struct page *page)
> {
> + mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
As already mentioned, I'd also like to see a comment added to
account_page_writeback() describing the new locking requirements
(specifically mem_cgroup_begin_update_page_stat being held by caller).
> inc_zone_page_state(page, NR_WRITEBACK);
> }
> EXPORT_SYMBOL(account_page_writeback);
> @@ -2214,7 +2215,10 @@ int test_clear_page_writeback(struct page *page)
> {
> struct address_space *mapping = page_mapping(page);
> int ret;
> + bool locked;
> + unsigned long flags;
>
> + mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> if (mapping) {
> struct backing_dev_info *bdi = mapping->backing_dev_info;
> unsigned long flags;
> @@ -2235,9 +2239,12 @@ int test_clear_page_writeback(struct page *page)
> ret = TestClearPageWriteback(page);
> }
> if (ret) {
> + mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_WRITEBACK);
> dec_zone_page_state(page, NR_WRITEBACK);
> inc_zone_page_state(page, NR_WRITTEN);
> }
> +
> + mem_cgroup_end_update_page_stat(page, &locked, &flags);
> return ret;
> }
>
> @@ -2245,7 +2252,10 @@ int test_set_page_writeback(struct page *page)
> {
> struct address_space *mapping = page_mapping(page);
> int ret;
> + bool locked;
> + unsigned long flags;
>
> + mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> if (mapping) {
> struct backing_dev_info *bdi = mapping->backing_dev_info;
> unsigned long flags;
> @@ -2272,6 +2282,8 @@ int test_set_page_writeback(struct page *page)
> }
> if (!ret)
> account_page_writeback(page);
> +
> + mem_cgroup_end_update_page_stat(page, &locked, &flags);
> return ret;
>
> }
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread
* [PATCH 7/7] memcg: print more detailed info while memcg oom happening
[not found] ` <1340880885-5427-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
` (3 preceding siblings ...)
2012-06-28 11:06 ` [PATCH 6/7] memcg: add per cgroup writeback " Sha Zhengju
@ 2012-06-28 11:06 ` Sha Zhengju
[not found] ` <1340881609-5935-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-07-04 8:29 ` Kamezawa Hiroyuki
4 siblings, 2 replies; 54+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:06 UTC (permalink / raw)
To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
Cc: kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju
From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
While memcg oom happening, the dump info is limited, so add this
to provide memcg page stat.
Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
---
mm/memcontrol.c | 42 ++++++++++++++++++++++++++++++++++--------
1 files changed, 34 insertions(+), 8 deletions(-)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 8493119..3ed41e9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -101,6 +101,14 @@ static const char * const mem_cgroup_events_names[] = {
"pgmajfault",
};
+static const char * const mem_cgroup_lru_names[] = {
+ "inactive_anon",
+ "active_anon",
+ "inactive_file",
+ "active_file",
+ "unevictable",
+};
+
/*
* Per memcg event counter is incremented at every pagein/pageout. With THP,
* it will be incremated by the number of pages. This counter is used for
@@ -1358,6 +1366,30 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
spin_unlock_irqrestore(&memcg->move_lock, *flags);
}
+#define K(x) ((x) << (PAGE_SHIFT-10))
+static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
+{
+ int i;
+
+ printk(KERN_INFO "Memory cgroup stat:\n");
+ for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
+ if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
+ continue;
+ printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
+ K(mem_cgroup_read_stat(memcg, i)));
+ }
+
+ for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
+ printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
+ mem_cgroup_read_events(memcg, i));
+
+ for (i = 0; i < NR_LRU_LISTS; i++)
+ printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
+ K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
+ printk(KERN_CONT "\n");
+
+}
+
/**
* mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
* @memcg: The memory cgroup that went over limit
@@ -1422,6 +1454,8 @@ done:
res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
+
+ mem_cgroup_print_oom_stat(memcg);
}
/*
@@ -4043,14 +4077,6 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,
}
#endif /* CONFIG_NUMA */
-static const char * const mem_cgroup_lru_names[] = {
- "inactive_anon",
- "active_anon",
- "inactive_file",
- "active_file",
- "unevictable",
-};
-
static inline void mem_cgroup_lru_names_not_uptodate(void)
{
BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
--
1.7.1
^ permalink raw reply related [flat|nested] 54+ messages in thread[parent not found: <1340881609-5935-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>]
* Re: [PATCH 7/7] memcg: print more detailed info while memcg oom happening
[not found] ` <1340881609-5935-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-07-04 8:25 ` Sha Zhengju
0 siblings, 0 replies; 54+ messages in thread
From: Sha Zhengju @ 2012-07-04 8:25 UTC (permalink / raw)
To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
Cc: kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju
Hi, Kame
How about this bit? :-)
On 06/28/2012 07:06 PM, Sha Zhengju wrote:
> From: Sha Zhengju<handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
>
> While memcg oom happening, the dump info is limited, so add this
> to provide memcg page stat.
>
> Signed-off-by: Sha Zhengju<handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> ---
> mm/memcontrol.c | 42 ++++++++++++++++++++++++++++++++++--------
> 1 files changed, 34 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8493119..3ed41e9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -101,6 +101,14 @@ static const char * const mem_cgroup_events_names[] = {
> "pgmajfault",
> };
>
> +static const char * const mem_cgroup_lru_names[] = {
> + "inactive_anon",
> + "active_anon",
> + "inactive_file",
> + "active_file",
> + "unevictable",
> +};
> +
> /*
> * Per memcg event counter is incremented at every pagein/pageout. With THP,
> * it will be incremated by the number of pages. This counter is used for
> @@ -1358,6 +1366,30 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
> spin_unlock_irqrestore(&memcg->move_lock, *flags);
> }
>
> +#define K(x) ((x)<< (PAGE_SHIFT-10))
> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
> +{
> + int i;
> +
> + printk(KERN_INFO "Memory cgroup stat:\n");
> + for (i = 0; i< MEM_CGROUP_STAT_NSTATS; i++) {
> + if (i == MEM_CGROUP_STAT_SWAP&& !do_swap_account)
> + continue;
> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
> + K(mem_cgroup_read_stat(memcg, i)));
> + }
> +
> + for (i = 0; i< MEM_CGROUP_EVENTS_NSTATS; i++)
> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
> + mem_cgroup_read_events(memcg, i));
> +
> + for (i = 0; i< NR_LRU_LISTS; i++)
> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
> + printk(KERN_CONT "\n");
> +
> +}
> +
> /**
> * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
> * @memcg: The memory cgroup that went over limit
> @@ -1422,6 +1454,8 @@ done:
> res_counter_read_u64(&memcg->memsw, RES_USAGE)>> 10,
> res_counter_read_u64(&memcg->memsw, RES_LIMIT)>> 10,
> res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
> +
> + mem_cgroup_print_oom_stat(memcg);
> }
>
> /*
> @@ -4043,14 +4077,6 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,
> }
> #endif /* CONFIG_NUMA */
>
> -static const char * const mem_cgroup_lru_names[] = {
> - "inactive_anon",
> - "active_anon",
> - "inactive_file",
> - "active_file",
> - "unevictable",
> -};
> -
> static inline void mem_cgroup_lru_names_not_uptodate(void)
> {
> BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
^ permalink raw reply [flat|nested] 54+ messages in thread
* Re: [PATCH 7/7] memcg: print more detailed info while memcg oom happening
2012-06-28 11:06 ` [PATCH 7/7] memcg: print more detailed info while memcg oom happening Sha Zhengju
[not found] ` <1340881609-5935-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-07-04 8:29 ` Kamezawa Hiroyuki
[not found] ` <4FF3FED6.9010700-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
1 sibling, 1 reply; 54+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-04 8:29 UTC (permalink / raw)
To: Sha Zhengju
Cc: linux-mm, cgroups, gthelen, yinghan, akpm, mhocko, linux-kernel,
Sha Zhengju
(2012/06/28 20:06), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
>
> While memcg oom happening, the dump info is limited, so add this
> to provide memcg page stat.
>
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
Could you split this into a different series ?
seems good to me in general but...one concern is hierarchy handling.
IIUC, the passed 'memcg' is the root of hierarchy which gets OOM.
So, the LRU info, which is local to the root memcg, may not contain any good
information. I think you should visit all memcg under the tree.
Thanks,
-Kame
> ---
> mm/memcontrol.c | 42 ++++++++++++++++++++++++++++++++++--------
> 1 files changed, 34 insertions(+), 8 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 8493119..3ed41e9 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -101,6 +101,14 @@ static const char * const mem_cgroup_events_names[] = {
> "pgmajfault",
> };
>
> +static const char * const mem_cgroup_lru_names[] = {
> + "inactive_anon",
> + "active_anon",
> + "inactive_file",
> + "active_file",
> + "unevictable",
> +};
> +
> /*
> * Per memcg event counter is incremented at every pagein/pageout. With THP,
> * it will be incremated by the number of pages. This counter is used for
> @@ -1358,6 +1366,30 @@ static void move_unlock_mem_cgroup(struct mem_cgroup *memcg,
> spin_unlock_irqrestore(&memcg->move_lock, *flags);
> }
>
> +#define K(x) ((x) << (PAGE_SHIFT-10))
> +static void mem_cgroup_print_oom_stat(struct mem_cgroup *memcg)
> +{
> + int i;
> +
> + printk(KERN_INFO "Memory cgroup stat:\n");
> + for (i = 0; i < MEM_CGROUP_STAT_NSTATS; i++) {
> + if (i == MEM_CGROUP_STAT_SWAP && !do_swap_account)
> + continue;
> + printk(KERN_CONT "%s:%ldKB ", mem_cgroup_stat_names[i],
> + K(mem_cgroup_read_stat(memcg, i)));
> + }
> +
> + for (i = 0; i < MEM_CGROUP_EVENTS_NSTATS; i++)
> + printk(KERN_CONT "%s:%lu ", mem_cgroup_events_names[i],
> + mem_cgroup_read_events(memcg, i));
> +
> + for (i = 0; i < NR_LRU_LISTS; i++)
> + printk(KERN_CONT "%s:%luKB ", mem_cgroup_lru_names[i],
> + K(mem_cgroup_nr_lru_pages(memcg, BIT(i))));
> + printk(KERN_CONT "\n");
> +
> +}
> +
> /**
> * mem_cgroup_print_oom_info: Called from OOM with tasklist_lock held in read mode.
> * @memcg: The memory cgroup that went over limit
> @@ -1422,6 +1454,8 @@ done:
> res_counter_read_u64(&memcg->memsw, RES_USAGE) >> 10,
> res_counter_read_u64(&memcg->memsw, RES_LIMIT) >> 10,
> res_counter_read_u64(&memcg->memsw, RES_FAILCNT));
> +
> + mem_cgroup_print_oom_stat(memcg);
> }
>
> /*
> @@ -4043,14 +4077,6 @@ static int mem_control_numa_stat_show(struct cgroup *cont, struct cftype *cft,
> }
> #endif /* CONFIG_NUMA */
>
> -static const char * const mem_cgroup_lru_names[] = {
> - "inactive_anon",
> - "active_anon",
> - "inactive_file",
> - "active_file",
> - "unevictable",
> -};
> -
> static inline void mem_cgroup_lru_names_not_uptodate(void)
> {
> BUILD_BUG_ON(ARRAY_SIZE(mem_cgroup_lru_names) != NR_LRU_LISTS);
>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 54+ messages in thread