[PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func
       [not found] <1340880885-5427-1-git-send-email-handai.szj@taobao.com>
@ 2012-06-28 11:01 ` Sha Zhengju
       [not found]   ` <1340881275-5651-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
  2012-07-04 14:23   ` Michal Hocko
       [not found] ` <1340880885-5427-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
       [not found] ` <1340881486-5770-1-git-send-email-handai.szj@taobao.com>
  2 siblings, 2 replies; 13+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:01 UTC (permalink / raw)
  To: linux-mm, cgroups
  Cc: kamezawa.hiroyu, gthelen, yinghan, akpm, mhocko, linux-kernel,
	torvalds, viro, linux-fsdevel, Sha Zhengju

From: Sha Zhengju <handai.szj@taobao.com>

Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
extracts TestSetPageDirty from __set_page_dirty and is far away from
account_page_dirtied.But it's better to make the two operations in one single
function to keep modular.So in order to avoid the potential race mentioned in
commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.

It's a prepare patch for following memcg dirty page accounting patches.

Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
---
 fs/buffer.c |   25 +++++++++++++------------
 1 files changed, 13 insertions(+), 12 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 838a9cf..e8d96b8 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -610,9 +610,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static void __set_page_dirty(struct page *page,
+static int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
+	if (unlikely(!mapping))
+		return !TestSetPageDirty(page);
+
+	if (TestSetPageDirty(page))
+		return 0;
+
 	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
 		WARN_ON_ONCE(warn && !PageUptodate(page));
@@ -622,6 +628,8 @@ static void __set_page_dirty(struct page *page,
 	}
 	spin_unlock_irq(&mapping->tree_lock);
 	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
+
+	return 1;
 }
 
 /*
@@ -667,11 +675,9 @@ int __set_page_dirty_buffers(struct page *page)
 			bh = bh->b_this_page;
 		} while (bh != head);
 	}
-	newly_dirty = !TestSetPageDirty(page);
+	newly_dirty = __set_page_dirty(page, mapping, 1);
 	spin_unlock(&mapping->private_lock);
 
-	if (newly_dirty)
-		__set_page_dirty(page, mapping, 1);
 	return newly_dirty;
 }
 EXPORT_SYMBOL(__set_page_dirty_buffers);
@@ -1115,14 +1121,9 @@ void mark_buffer_dirty(struct buffer_head *bh)
 			return;
 	}
 
-	if (!test_set_buffer_dirty(bh)) {
-		struct page *page = bh->b_page;
-		if (!TestSetPageDirty(page)) {
-			struct address_space *mapping = page_mapping(page);
-			if (mapping)
-				__set_page_dirty(page, mapping, 0);
-		}
-	}
+	if (!test_set_buffer_dirty(bh))
+		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
+
 }
 EXPORT_SYMBOL(mark_buffer_dirty);
 
-- 
1.7.1

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
       [not found] ` <1340880885-5427-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-06-28 11:03   ` Sha Zhengju
       [not found]     ` <1340881423-5703-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
  2012-07-04 14:27     ` Michal Hocko
  0 siblings, 2 replies; 13+ messages in thread
From: Sha Zhengju @ 2012-06-28 11:03 UTC (permalink / raw)
  To: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA
  Cc: kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, sage-BnTBU8nroG7k1uMJSBkQmQ,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

Following we will treat SetPageDirty and dirty page accounting as an integrated
operation. Filesystems had better use vfs interface directly to avoid those details.

Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
---
 fs/buffer.c                 |    2 +-
 fs/ceph/addr.c              |   20 ++------------------
 include/linux/buffer_head.h |    2 ++
 3 files changed, 5 insertions(+), 19 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index e8d96b8..55522dd 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
  * If warn is true, then emit a warning if the page is not uptodate and has
  * not been truncated.
  */
-static int __set_page_dirty(struct page *page,
+int __set_page_dirty(struct page *page,
 		struct address_space *mapping, int warn)
 {
 	if (unlikely(!mapping))
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 8b67304..d028fbe 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -5,6 +5,7 @@
 #include <linux/mm.h>
 #include <linux/pagemap.h>
 #include <linux/writeback.h>	/* generic_writepages */
+#include <linux/buffer_head.h>
 #include <linux/slab.h>
 #include <linux/pagevec.h>
 #include <linux/task_io_accounting_ops.h>
@@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
 	int undo = 0;
 	struct ceph_snap_context *snapc;
 
-	if (unlikely(!mapping))
-		return !TestSetPageDirty(page);
-
-	if (TestSetPageDirty(page)) {
-		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
-		     mapping->host, page, page->index);
+	if (!__set_page_dirty(page, mapping, 1))
 		return 0;
-	}
 
 	inode = mapping->host;
 	ci = ceph_inode(inode);
@@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
 	     snapc, snapc->seq, snapc->num_snaps);
 	spin_unlock(&ci->i_ceph_lock);
 
-	/* now adjust page */
-	spin_lock_irq(&mapping->tree_lock);
 	if (page->mapping) {	/* Race with truncate? */
-		WARN_ON_ONCE(!PageUptodate(page));
-		account_page_dirtied(page, page->mapping);
-		radix_tree_tag_set(&mapping->page_tree,
-				page_index(page), PAGECACHE_TAG_DIRTY);
-
 		/*
 		 * Reference snap context in page->private.  Also set
 		 * PagePrivate so that we get invalidatepage callback.
@@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
 		undo = 1;
 	}
 
-	spin_unlock_irq(&mapping->tree_lock);
-
 	if (undo)
 		/* whoops, we failed to dirty the page */
 		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
 
-	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
-
 	BUG_ON(!PageDirty(page));
 	return 1;
 }
diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
index 458f497..0a331a8 100644
--- a/include/linux/buffer_head.h
+++ b/include/linux/buffer_head.h
@@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
 }
 
 extern int __set_page_dirty_buffers(struct page *page);
+extern int __set_page_dirty(struct page *page,
+		struct address_space *mapping, int warn);
 
 #else /* CONFIG_BLOCK */
 
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
       [not found]     ` <1340881423-5703-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-06-29  5:21       ` Sage Weil
  2012-07-02  8:10         ` Sha Zhengju
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2012-06-29  5:21 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, sage-BnTBU8nroG7k1uMJSBkQmQ,
	ceph-devel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

On Thu, 28 Jun 2012, Sha Zhengju wrote:

> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> Following we will treat SetPageDirty and dirty page accounting as an integrated
> operation. Filesystems had better use vfs interface directly to avoid those details.
> 
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/buffer.c                 |    2 +-
>  fs/ceph/addr.c              |   20 ++------------------
>  include/linux/buffer_head.h |    2 ++
>  3 files changed, 5 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e8d96b8..55522dd 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   * If warn is true, then emit a warning if the page is not uptodate and has
>   * not been truncated.
>   */
> -static int __set_page_dirty(struct page *page,
> +int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
>  	if (unlikely(!mapping))

This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to 
continue to build as a module.

With that fixed, the ceph bits are a welcome cleanup!

Acked-by: Sage Weil <sage-4GqslpFJ+cxBDgjK7y7TUQ@public.gmane.org>

> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 8b67304..d028fbe 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -5,6 +5,7 @@
>  #include <linux/mm.h>
>  #include <linux/pagemap.h>
>  #include <linux/writeback.h>	/* generic_writepages */
> +#include <linux/buffer_head.h>
>  #include <linux/slab.h>
>  #include <linux/pagevec.h>
>  #include <linux/task_io_accounting_ops.h>
> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
>  	int undo = 0;
>  	struct ceph_snap_context *snapc;
>  
> -	if (unlikely(!mapping))
> -		return !TestSetPageDirty(page);
> -
> -	if (TestSetPageDirty(page)) {
> -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> -		     mapping->host, page, page->index);
> +	if (!__set_page_dirty(page, mapping, 1))
>  		return 0;
> -	}
>  
>  	inode = mapping->host;
>  	ci = ceph_inode(inode);
> @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
>  	     snapc, snapc->seq, snapc->num_snaps);
>  	spin_unlock(&ci->i_ceph_lock);
>  
> -	/* now adjust page */
> -	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
> -		WARN_ON_ONCE(!PageUptodate(page));
> -		account_page_dirtied(page, page->mapping);
> -		radix_tree_tag_set(&mapping->page_tree,
> -				page_index(page), PAGECACHE_TAG_DIRTY);
> -
>  		/*
>  		 * Reference snap context in page->private.  Also set
>  		 * PagePrivate so that we get invalidatepage callback.
> @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
>  		undo = 1;
>  	}
>  
> -	spin_unlock_irq(&mapping->tree_lock);
> -
>  	if (undo)
>  		/* whoops, we failed to dirty the page */
>  		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>  
> -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> -
>  	BUG_ON(!PageDirty(page));
>  	return 1;
>  }
> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> index 458f497..0a331a8 100644
> --- a/include/linux/buffer_head.h
> +++ b/include/linux/buffer_head.h
> @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
>  }
>  
>  extern int __set_page_dirty_buffers(struct page *page);
> +extern int __set_page_dirty(struct page *page,
> +		struct address_space *mapping, int warn);
>  
>  #else /* CONFIG_BLOCK */
>  
> -- 
> 1.7.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
  2012-06-29  5:21       ` Sage Weil
@ 2012-07-02  8:10         ` Sha Zhengju
  2012-07-02 14:49           ` Sage Weil
  0 siblings, 1 reply; 13+ messages in thread
From: Sha Zhengju @ 2012-07-02  8:10 UTC (permalink / raw)
  To: Sage Weil
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On 06/29/2012 01:21 PM, Sage Weil wrote:
> On Thu, 28 Jun 2012, Sha Zhengju wrote:
>
>> From: Sha Zhengju<handai.szj@taobao.com>
>>
>> Following we will treat SetPageDirty and dirty page accounting as an integrated
>> operation. Filesystems had better use vfs interface directly to avoid those details.
>>
>> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>> ---
>>   fs/buffer.c                 |    2 +-
>>   fs/ceph/addr.c              |   20 ++------------------
>>   include/linux/buffer_head.h |    2 ++
>>   3 files changed, 5 insertions(+), 19 deletions(-)
>>
>> diff --git a/fs/buffer.c b/fs/buffer.c
>> index e8d96b8..55522dd 100644
>> --- a/fs/buffer.c
>> +++ b/fs/buffer.c
>> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>    * If warn is true, then emit a warning if the page is not uptodate and has
>>    * not been truncated.
>>    */
>> -static int __set_page_dirty(struct page *page,
>> +int __set_page_dirty(struct page *page,
>>   		struct address_space *mapping, int warn)
>>   {
>>   	if (unlikely(!mapping))
> This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
> continue to build as a module.
>
> With that fixed, the ceph bits are a welcome cleanup!
>
> Acked-by: Sage Weil<sage@inktank.com>

Further, I check the path again and may it be reworked as follows to 
avoid undo?

__set_page_dirty();                                            
__set_page_dirty();
ceph operations;                ==>                    if (page->mapping)
if (page->mapping)                                            ceph 
operations;
     ;
else
     undo = 1;
if (undo)
     xxx;



Thanks,
Sha

>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>> index 8b67304..d028fbe 100644
>> --- a/fs/ceph/addr.c
>> +++ b/fs/ceph/addr.c
>> @@ -5,6 +5,7 @@
>>   #include<linux/mm.h>
>>   #include<linux/pagemap.h>
>>   #include<linux/writeback.h>	/* generic_writepages */
>> +#include<linux/buffer_head.h>
>>   #include<linux/slab.h>
>>   #include<linux/pagevec.h>
>>   #include<linux/task_io_accounting_ops.h>
>> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
>>   	int undo = 0;
>>   	struct ceph_snap_context *snapc;
>>
>> -	if (unlikely(!mapping))
>> -		return !TestSetPageDirty(page);
>> -
>> -	if (TestSetPageDirty(page)) {
>> -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
>> -		     mapping->host, page, page->index);
>> +	if (!__set_page_dirty(page, mapping, 1))
>>   		return 0;
>> -	}
>>
>>   	inode = mapping->host;
>>   	ci = ceph_inode(inode);
>> @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
>>   	     snapc, snapc->seq, snapc->num_snaps);
>>   	spin_unlock(&ci->i_ceph_lock);
>>
>> -	/* now adjust page */
>> -	spin_lock_irq(&mapping->tree_lock);
>>   	if (page->mapping) {	/* Race with truncate? */
>> -		WARN_ON_ONCE(!PageUptodate(page));
>> -		account_page_dirtied(page, page->mapping);
>> -		radix_tree_tag_set(&mapping->page_tree,
>> -				page_index(page), PAGECACHE_TAG_DIRTY);
>> -
>>   		/*
>>   		 * Reference snap context in page->private.  Also set
>>   		 * PagePrivate so that we get invalidatepage callback.
>> @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
>>   		undo = 1;
>>   	}
>>
>> -	spin_unlock_irq(&mapping->tree_lock);
>> -
>>   	if (undo)
>>   		/* whoops, we failed to dirty the page */
>>   		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>>
>> -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>> -
>>   	BUG_ON(!PageDirty(page));
>>   	return 1;
>>   }
>> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
>> index 458f497..0a331a8 100644
>> --- a/include/linux/buffer_head.h
>> +++ b/include/linux/buffer_head.h
>> @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
>>   }
>>
>>   extern int __set_page_dirty_buffers(struct page *page);
>> +extern int __set_page_dirty(struct page *page,
>> +		struct address_space *mapping, int warn);
>>
>>   #else /* CONFIG_BLOCK */
>>
>> -- 
>> 1.7.1
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func
       [not found]   ` <1340881275-5651-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-07-02 11:14     ` Kamezawa Hiroyuki
  2012-07-07 14:42       ` Fengguang Wu
  0 siblings, 1 reply; 13+ messages in thread
From: Kamezawa Hiroyuki @ 2012-07-02 11:14 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b, mhocko-AlSwsSmVLrQ,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	torvalds-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	viro-RmSDqhL/yNMiFSDQTTA3OLVCufUGDwFn,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju

(2012/06/28 20:01), Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> extracts TestSetPageDirty from __set_page_dirty and is far away from
> account_page_dirtied.But it's better to make the two operations in one single
> function to keep modular.So in order to avoid the potential race mentioned in
> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
> I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.
> 
> It's a prepare patch for following memcg dirty page accounting patches.
> 
> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>

I think there is no problem with the lock order.

My small concern is the impact on the performance. IIUC, lock contention here can be
seen if multiple threads write to the same file in parallel.
Do you have any numbers before/after the patch ?


Thanks,
-Kmae

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
  2012-07-02  8:10         ` Sha Zhengju
@ 2012-07-02 14:49           ` Sage Weil
  2012-07-04  8:11             ` Sha Zhengju
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2012-07-02 14:49 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On Mon, 2 Jul 2012, Sha Zhengju wrote:
> On 06/29/2012 01:21 PM, Sage Weil wrote:
> > On Thu, 28 Jun 2012, Sha Zhengju wrote:
> > 
> > > From: Sha Zhengju<handai.szj@taobao.com>
> > > 
> > > Following we will treat SetPageDirty and dirty page accounting as an
> > > integrated
> > > operation. Filesystems had better use vfs interface directly to avoid
> > > those details.
> > > 
> > > Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
> > > ---
> > >   fs/buffer.c                 |    2 +-
> > >   fs/ceph/addr.c              |   20 ++------------------
> > >   include/linux/buffer_head.h |    2 ++
> > >   3 files changed, 5 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > index e8d96b8..55522dd 100644
> > > --- a/fs/buffer.c
> > > +++ b/fs/buffer.c
> > > @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
> > >    * If warn is true, then emit a warning if the page is not uptodate and
> > > has
> > >    * not been truncated.
> > >    */
> > > -static int __set_page_dirty(struct page *page,
> > > +int __set_page_dirty(struct page *page,
> > >   		struct address_space *mapping, int warn)
> > >   {
> > >   	if (unlikely(!mapping))
> > This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
> > continue to build as a module.
> > 
> > With that fixed, the ceph bits are a welcome cleanup!
> > 
> > Acked-by: Sage Weil<sage@inktank.com>
> 
> Further, I check the path again and may it be reworked as follows to avoid
> undo?
> 
> __set_page_dirty();
> __set_page_dirty();
> ceph operations;                ==>                    if (page->mapping)
> if (page->mapping)                                            ceph operations;
>     ;
> else
>     undo = 1;
> if (undo)
>     xxx;

Yep.  Taking another look at the original code, though, I'm worried that 
one reason the __set_page_dirty() actions were spread out the way they are 
is because we wanted to ensure that the ceph operations were always 
performed when PagePrivate was set.

It looks like invalidatepage won't get called if private isn't set, and 
presumably it handles the truncate race with __set_page_dirty() properly 
(right?).  What about writeback?  Do we need to worry about writepage[s] 
getting called with a NULL page->private?

Thanks!
sage



> 
> 
> 
> Thanks,
> Sha
> 
> > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > index 8b67304..d028fbe 100644
> > > --- a/fs/ceph/addr.c
> > > +++ b/fs/ceph/addr.c
> > > @@ -5,6 +5,7 @@
> > >   #include<linux/mm.h>
> > >   #include<linux/pagemap.h>
> > >   #include<linux/writeback.h>	/* generic_writepages */
> > > +#include<linux/buffer_head.h>
> > >   #include<linux/slab.h>
> > >   #include<linux/pagevec.h>
> > >   #include<linux/task_io_accounting_ops.h>
> > > @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
> > >   	int undo = 0;
> > >   	struct ceph_snap_context *snapc;
> > > 
> > > -	if (unlikely(!mapping))
> > > -		return !TestSetPageDirty(page);
> > > -
> > > -	if (TestSetPageDirty(page)) {
> > > -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> > > -		     mapping->host, page, page->index);
> > > +	if (!__set_page_dirty(page, mapping, 1))
> > >   		return 0;
> > > -	}
> > > 
> > >   	inode = mapping->host;
> > >   	ci = ceph_inode(inode);
> > > @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
> > >   	     snapc, snapc->seq, snapc->num_snaps);
> > >   	spin_unlock(&ci->i_ceph_lock);
> > > 
> > > -	/* now adjust page */
> > > -	spin_lock_irq(&mapping->tree_lock);
> > >   	if (page->mapping) {	/* Race with truncate? */
> > > -		WARN_ON_ONCE(!PageUptodate(page));
> > > -		account_page_dirtied(page, page->mapping);
> > > -		radix_tree_tag_set(&mapping->page_tree,
> > > -				page_index(page), PAGECACHE_TAG_DIRTY);
> > > -
> > >   		/*
> > >   		 * Reference snap context in page->private.  Also set
> > >   		 * PagePrivate so that we get invalidatepage callback.
> > > @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
> > >   		undo = 1;
> > >   	}
> > > 
> > > -	spin_unlock_irq(&mapping->tree_lock);
> > > -
> > >   	if (undo)
> > >   		/* whoops, we failed to dirty the page */
> > >   		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
> > > 
> > > -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > > -
> > >   	BUG_ON(!PageDirty(page));
> > >   	return 1;
> > >   }
> > > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> > > index 458f497..0a331a8 100644
> > > --- a/include/linux/buffer_head.h
> > > +++ b/include/linux/buffer_head.h
> > > @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
> > >   }
> > > 
> > >   extern int __set_page_dirty_buffers(struct page *page);
> > > +extern int __set_page_dirty(struct page *page,
> > > +		struct address_space *mapping, int warn);
> > > 
> > >   #else /* CONFIG_BLOCK */
> > > 
> > > -- 
> > > 1.7.1
> > > 
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
> > > in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > 
> > > 
> 
> 

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
  2012-07-02 14:49           ` Sage Weil
@ 2012-07-04  8:11             ` Sha Zhengju
  2012-07-05 15:20               ` Sage Weil
  0 siblings, 1 reply; 13+ messages in thread
From: Sha Zhengju @ 2012-07-04  8:11 UTC (permalink / raw)
  To: Sage Weil
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On 07/02/2012 10:49 PM, Sage Weil wrote:
> On Mon, 2 Jul 2012, Sha Zhengju wrote:
>> On 06/29/2012 01:21 PM, Sage Weil wrote:
>>> On Thu, 28 Jun 2012, Sha Zhengju wrote:
>>>
>>>> From: Sha Zhengju<handai.szj@taobao.com>
>>>>
>>>> Following we will treat SetPageDirty and dirty page accounting as an
>>>> integrated
>>>> operation. Filesystems had better use vfs interface directly to avoid
>>>> those details.
>>>>
>>>> Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>>>> ---
>>>>    fs/buffer.c                 |    2 +-
>>>>    fs/ceph/addr.c              |   20 ++------------------
>>>>    include/linux/buffer_head.h |    2 ++
>>>>    3 files changed, 5 insertions(+), 19 deletions(-)
>>>>
>>>> diff --git a/fs/buffer.c b/fs/buffer.c
>>>> index e8d96b8..55522dd 100644
>>>> --- a/fs/buffer.c
>>>> +++ b/fs/buffer.c
>>>> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>>>>     * If warn is true, then emit a warning if the page is not uptodate and
>>>> has
>>>>     * not been truncated.
>>>>     */
>>>> -static int __set_page_dirty(struct page *page,
>>>> +int __set_page_dirty(struct page *page,
>>>>    		struct address_space *mapping, int warn)
>>>>    {
>>>>    	if (unlikely(!mapping))
>>> This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
>>> continue to build as a module.
>>>
>>> With that fixed, the ceph bits are a welcome cleanup!
>>>
>>> Acked-by: Sage Weil<sage@inktank.com>
>> Further, I check the path again and may it be reworked as follows to avoid
>> undo?
>>
>> __set_page_dirty();
>> __set_page_dirty();
>> ceph operations;                ==>                     if (page->mapping)
>> if (page->mapping)                                            ceph operations;
>>      ;
>> else
>>      undo = 1;
>> if (undo)
>>      xxx;
> Yep.  Taking another look at the original code, though, I'm worried that
> one reason the __set_page_dirty() actions were spread out the way they are
> is because we wanted to ensure that the ceph operations were always
> performed when PagePrivate was set.
>

Sorry, I've lost something:

__set_page_dirty();                        __set_page_dirty();
ceph operations;
if(page->mapping)         ==>      if(page->mapping) {
        SetPagePrivate;                            SetPagePrivate;
else                                                      ceph operations;
     undo = 1;                                  }

if (undo)
     XXX;

I think this can ensure that ceph operations are performed together with
SetPagePrivate.

> It looks like invalidatepage won't get called if private isn't set, and
> presumably it handles the truncate race with __set_page_dirty() properly
> (right?).  What about writeback?  Do we need to worry about writepage[s]
> getting called with a NULL page->private?

__set_page_dirty does handle racing conditions with truncate and
writeback writepage[s] also take page->private into consideration
which is done inside specific filesystems. I notice that ceph has handled
this in ceph_writepage().
Sorry, not vfs expert and maybe I've not caught your point...


Thanks,
Sha

> Thanks!
> sage
>
>
>
>>
>>
>> Thanks,
>> Sha
>>
>>>> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>>>> index 8b67304..d028fbe 100644
>>>> --- a/fs/ceph/addr.c
>>>> +++ b/fs/ceph/addr.c
>>>> @@ -5,6 +5,7 @@
>>>>    #include<linux/mm.h>
>>>>    #include<linux/pagemap.h>
>>>>    #include<linux/writeback.h>	/* generic_writepages */
>>>> +#include<linux/buffer_head.h>
>>>>    #include<linux/slab.h>
>>>>    #include<linux/pagevec.h>
>>>>    #include<linux/task_io_accounting_ops.h>
>>>> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
>>>>    	int undo = 0;
>>>>    	struct ceph_snap_context *snapc;
>>>>
>>>> -	if (unlikely(!mapping))
>>>> -		return !TestSetPageDirty(page);
>>>> -
>>>> -	if (TestSetPageDirty(page)) {
>>>> -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
>>>> -		     mapping->host, page, page->index);
>>>> +	if (!__set_page_dirty(page, mapping, 1))
>>>>    		return 0;
>>>> -	}
>>>>
>>>>    	inode = mapping->host;
>>>>    	ci = ceph_inode(inode);
>>>> @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
>>>>    	     snapc, snapc->seq, snapc->num_snaps);
>>>>    	spin_unlock(&ci->i_ceph_lock);
>>>>
>>>> -	/* now adjust page */
>>>> -	spin_lock_irq(&mapping->tree_lock);
>>>>    	if (page->mapping) {	/* Race with truncate? */
>>>> -		WARN_ON_ONCE(!PageUptodate(page));
>>>> -		account_page_dirtied(page, page->mapping);
>>>> -		radix_tree_tag_set(&mapping->page_tree,
>>>> -				page_index(page), PAGECACHE_TAG_DIRTY);
>>>> -
>>>>    		/*
>>>>    		 * Reference snap context in page->private.  Also set
>>>>    		 * PagePrivate so that we get invalidatepage callback.
>>>> @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
>>>>    		undo = 1;
>>>>    	}
>>>>
>>>> -	spin_unlock_irq(&mapping->tree_lock);
>>>> -
>>>>    	if (undo)
>>>>    		/* whoops, we failed to dirty the page */
>>>>    		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
>>>>
>>>> -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>>>> -
>>>>    	BUG_ON(!PageDirty(page));
>>>>    	return 1;
>>>>    }
>>>> diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
>>>> index 458f497..0a331a8 100644
>>>> --- a/include/linux/buffer_head.h
>>>> +++ b/include/linux/buffer_head.h
>>>> @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
>>>>    }
>>>>
>>>>    extern int __set_page_dirty_buffers(struct page *page);
>>>> +extern int __set_page_dirty(struct page *page,
>>>> +		struct address_space *mapping, int warn);
>>>>
>>>>    #else /* CONFIG_BLOCK */
>>>>
>>>> -- 
>>>> 1.7.1
>>>>
>>>> --
>>>> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel"
>>>> in
>>>> the body of a message to majordomo@vger.kernel.org
>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>
>>>>
>>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func
  2012-06-28 11:01 ` [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju
       [not found]   ` <1340881275-5651-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-07-04 14:23   ` Michal Hocko
  1 sibling, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2012-07-04 14:23 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, torvalds, viro, linux-fsdevel, Sha Zhengju,
	Nick Piggin

[Add Nick to the CC]

On Thu 28-06-12 19:01:15, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> extracts TestSetPageDirty from __set_page_dirty and is far away from
> account_page_dirtied.But it's better to make the two operations in one single
> function to keep modular.

> So in order to avoid the potential race mentioned in
> commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.

So this is exactly what Nick suggested in the beginning
(https://lkml.org/lkml/2009/3/19/169)

> I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.

Yes and mm/filemap.c says that explicitly:
 * Lock ordering:
 *
 *  ->i_mmap_mutex              (truncate_pagecache)
 *    ->private_lock            (__free_pte->__set_page_dirty_buffers)
 *      ->swap_lock             (exclusive_swap_page, others)
 *        ->mapping->tree_lock

> 
> It's a prepare patch for following memcg dirty page accounting patches.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>

The patch seems to be correct.
Reviewed-by: Michal Hocko <mhocko@suse.cz>

> ---
>  fs/buffer.c |   25 +++++++++++++------------
>  1 files changed, 13 insertions(+), 12 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 838a9cf..e8d96b8 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -610,9 +610,15 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   * If warn is true, then emit a warning if the page is not uptodate and has
>   * not been truncated.
>   */
> -static void __set_page_dirty(struct page *page,
> +static int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	if (unlikely(!mapping))
> +		return !TestSetPageDirty(page);
> +
> +	if (TestSetPageDirty(page))
> +		return 0;
> +
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
>  		WARN_ON_ONCE(warn && !PageUptodate(page));
> @@ -622,6 +628,8 @@ static void __set_page_dirty(struct page *page,
>  	}
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> +
> +	return 1;
>  }
>  
>  /*
> @@ -667,11 +675,9 @@ int __set_page_dirty_buffers(struct page *page)
>  			bh = bh->b_this_page;
>  		} while (bh != head);
>  	}
> -	newly_dirty = !TestSetPageDirty(page);
> +	newly_dirty = __set_page_dirty(page, mapping, 1);
>  	spin_unlock(&mapping->private_lock);
>  
> -	if (newly_dirty)
> -		__set_page_dirty(page, mapping, 1);
>  	return newly_dirty;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_buffers);
> @@ -1115,14 +1121,9 @@ void mark_buffer_dirty(struct buffer_head *bh)
>  			return;
>  	}
>  
> -	if (!test_set_buffer_dirty(bh)) {
> -		struct page *page = bh->b_page;
> -		if (!TestSetPageDirty(page)) {
> -			struct address_space *mapping = page_mapping(page);
> -			if (mapping)
> -				__set_page_dirty(page, mapping, 0);
> -		}
> -	}
> +	if (!test_set_buffer_dirty(bh))
> +		__set_page_dirty(bh->b_page, page_mapping(bh->b_page), 0);
> +
>  }
>  EXPORT_SYMBOL(mark_buffer_dirty);
>  
> -- 
> 1.7.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
  2012-06-28 11:03   ` [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
       [not found]     ` <1340881423-5703-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-07-04 14:27     ` Michal Hocko
  1 sibling, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2012-07-04 14:27 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	linux-kernel, torvalds, viro, linux-fsdevel, sage, ceph-devel,
	Sha Zhengju

On Thu 28-06-12 19:03:43, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj@taobao.com>
> 
> Following we will treat SetPageDirty and dirty page accounting as an integrated
> operation. Filesystems had better use vfs interface directly to avoid those details.
> 
> Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> ---
>  fs/buffer.c                 |    2 +-
>  fs/ceph/addr.c              |   20 ++------------------
>  include/linux/buffer_head.h |    2 ++
>  3 files changed, 5 insertions(+), 19 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index e8d96b8..55522dd 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>   * If warn is true, then emit a warning if the page is not uptodate and has
>   * not been truncated.
>   */
> -static int __set_page_dirty(struct page *page,
> +int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
>  	if (unlikely(!mapping))
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> index 8b67304..d028fbe 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -5,6 +5,7 @@
>  #include <linux/mm.h>
>  #include <linux/pagemap.h>
>  #include <linux/writeback.h>	/* generic_writepages */
> +#include <linux/buffer_head.h>
>  #include <linux/slab.h>
>  #include <linux/pagevec.h>
>  #include <linux/task_io_accounting_ops.h>
> @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
>  	int undo = 0;
>  	struct ceph_snap_context *snapc;
>  
> -	if (unlikely(!mapping))
> -		return !TestSetPageDirty(page);
> -
> -	if (TestSetPageDirty(page)) {
> -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
> -		     mapping->host, page, page->index);

I am not familiar with the code but this looks we loose an information
about something bad(?) is going on?

> +	if (!__set_page_dirty(page, mapping, 1))
>  		return 0;
> -	}
>  
>  	inode = mapping->host;
>  	ci = ceph_inode(inode);
-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 5/7] memcg: add per cgroup dirty pages accounting
       [not found]   ` <1340881486-5770-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
@ 2012-07-04 16:11     ` Michal Hocko
  0 siblings, 0 replies; 13+ messages in thread
From: Michal Hocko @ 2012-07-04 16:11 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm-Bw31MaZKKs3YtjvyW6yDsg, cgroups-u79uwXL29TY76Z2rM5mHXA,
	kamezawa.hiroyu-+CUm20s59erQFUHtdCDX3A,
	gthelen-hpIqsD4AKlfQT0dZR+AlfA, yinghan-hpIqsD4AKlfQT0dZR+AlfA,
	akpm-de/tnXTf+JLsfHDXvbKv3WD2FQJk+8+b,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA, Sha Zhengju, Alexander Viro,
	linux-fsdevel-u79uwXL29TY76Z2rM5mHXA

I guess you should CC vfs people

On Thu 28-06-12 19:04:46, Sha Zhengju wrote:
> From: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> 
> This patch adds memcg routines to count dirty pages, which allows memory controller
> to maintain an accurate view of the amount of its dirty memory and can provide some
> info for users while group's direct reclaim is working.
> 
> After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
> use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
> has a feature to move a page from a cgroup to another one and may have race between
> "move" and "page stat accounting". So in order to avoid the race we have designed a
> bigger lock:
> 
>          mem_cgroup_begin_update_page_stat()
>          modify page information	-->(a)
>          mem_cgroup_update_page_stat()  -->(b)
>          mem_cgroup_end_update_page_stat()
> 
> It requires (a) and (b)(dirty pages accounting) can stay close enough.
> 
> In the previous two prepare patches, we have reworked the vfs set page dirty routines
> and now the interfaces are more explicit:
> 	incrementing (2):
> 		__set_page_dirty
> 		__set_page_dirty_nobuffers
> 	decrementing (2):
> 		clear_page_dirty_for_io
> 		cancel_dirty_page

The patch seems correct at first glance (I have to look closer but I am
in rush at the momemnt and will be back next week).

I was just thinking that memcg is enabled by most distributions these
days but not that many people use it. So it would be probably good to
think how to reduce an overhead if !mem_cgroup_disabled && no cgroups.

Something similar Glauber did for the kmem accounting.

> Signed-off-by: Sha Zhengju <handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
> ---
>  fs/buffer.c                |   17 ++++++++++++++---
>  include/linux/memcontrol.h |    1 +
>  mm/filemap.c               |    5 +++++
>  mm/memcontrol.c            |   28 +++++++++++++++++++++-------
>  mm/page-writeback.c        |   30 ++++++++++++++++++++++++------
>  mm/truncate.c              |    6 ++++++
>  6 files changed, 71 insertions(+), 16 deletions(-)
> 
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 55522dd..d3714cc 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -613,11 +613,19 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>  int __set_page_dirty(struct page *page,
>  		struct address_space *mapping, int warn)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
>  	if (unlikely(!mapping))
>  		return !TestSetPageDirty(page);
>  
> -	if (TestSetPageDirty(page))
> -		return 0;
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
> +	if (TestSetPageDirty(page)) {
> +		ret = 0;
> +		goto out;
> +	}
>  
>  	spin_lock_irq(&mapping->tree_lock);
>  	if (page->mapping) {	/* Race with truncate? */
> @@ -629,7 +637,10 @@ int __set_page_dirty(struct page *page,
>  	spin_unlock_irq(&mapping->tree_lock);
>  	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  
> -	return 1;
> +	ret = 1;
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  
>  /*
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 20b0f2d..ad37b59 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -38,6 +38,7 @@ enum mem_cgroup_stat_index {
>  	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
>  	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
>  	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
> +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
>  	MEM_CGROUP_STAT_NSTATS,
>  };
>  
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 1f19ec3..5159a49 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -140,6 +140,11 @@ void __delete_from_page_cache(struct page *page)
>  	 * having removed the page entirely.
>  	 */
>  	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
> +		/*
> +		 * Do not change page state, so no need to use mem_cgroup_
> +		 * {begin, end}_update_page_stat to get lock.
> +		 */
> +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		dec_zone_page_state(page, NR_FILE_DIRTY);
>  		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
>  	}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index ebed1ca..90e2946 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -82,6 +82,7 @@ static const char * const mem_cgroup_stat_names[] = {
>  	"rss",
>  	"mapped_file",
>  	"swap",
> +	"dirty",
>  };
>  
>  enum mem_cgroup_events_index {
> @@ -2538,6 +2539,18 @@ void mem_cgroup_split_huge_fixup(struct page *head)
>  }
>  #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
>  
> +static inline
> +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
> +					struct mem_cgroup *to,
> +					enum mem_cgroup_stat_index idx)
> +{
> +	/* Update stat data for mem_cgroup */
> +	preempt_disable();
> +	__this_cpu_dec(from->stat->count[idx]);
> +	__this_cpu_inc(to->stat->count[idx]);
> +	preempt_enable();
> +}
> +
>  /**
>   * mem_cgroup_move_account - move account of the page
>   * @page: the page
> @@ -2583,13 +2596,14 @@ static int mem_cgroup_move_account(struct page *page,
>  
>  	move_lock_mem_cgroup(from, &flags);
>  
> -	if (!anon && page_mapped(page)) {
> -		/* Update mapped_file data for mem_cgroup */
> -		preempt_disable();
> -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
> -		preempt_enable();
> -	}
> +	if (!anon && page_mapped(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_MAPPED);
> +
> +	if (PageDirty(page))
> +		mem_cgroup_move_account_page_stat(from, to,
> +				MEM_CGROUP_STAT_FILE_DIRTY);
> +
>  	mem_cgroup_charge_statistics(from, anon, -nr_pages);
>  
>  	/* caller should have done css_get */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index e5363f3..e79a2f7 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1962,6 +1962,7 @@ int __set_page_dirty_no_writeback(struct page *page)
>  void account_page_dirtied(struct page *page, struct address_space *mapping)
>  {
>  	if (mapping_cap_account_dirty(mapping)) {
> +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_FILE_DIRTY);
>  		__inc_zone_page_state(page, NR_DIRTIED);
>  		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
> @@ -2001,12 +2002,20 @@ EXPORT_SYMBOL(account_page_writeback);
>   */
>  int __set_page_dirty_nobuffers(struct page *page)
>  {
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
> +
>  	if (!TestSetPageDirty(page)) {
>  		struct address_space *mapping = page_mapping(page);
>  		struct address_space *mapping2;
>  
> -		if (!mapping)
> -			return 1;
> +		if (!mapping) {
> +			ret = 1;
> +			goto out;
> +		}
>  
>  		spin_lock_irq(&mapping->tree_lock);
>  		mapping2 = page_mapping(page);
> @@ -2022,9 +2031,12 @@ int __set_page_dirty_nobuffers(struct page *page)
>  			/* !PageAnon && !swapper_space */
>  			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
>  		}
> -		return 1;
> +		ret = 1;
>  	}
> -	return 0;
> +
> +out:
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +	return ret;
>  }
>  EXPORT_SYMBOL(__set_page_dirty_nobuffers);
>  
> @@ -2139,6 +2151,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
>  int clear_page_dirty_for_io(struct page *page)
>  {
>  	struct address_space *mapping = page_mapping(page);
> +	bool locked;
> +	unsigned long flags;
> +	int ret = 0;
>  
>  	BUG_ON(!PageLocked(page));
>  
> @@ -2180,13 +2195,16 @@ int clear_page_dirty_for_io(struct page *page)
>  		 * the desired exclusion. See mm/memory.c:do_wp_page()
>  		 * for more comments.
>  		 */
> +		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  		if (TestClearPageDirty(page)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> -			return 1;
> +			ret = 1;
>  		}
> -		return 0;
> +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
> +		return ret;
>  	}
>  	return TestClearPageDirty(page);
>  }
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 75801ac..052016a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
>   */
>  void cancel_dirty_page(struct page *page, unsigned int account_size)
>  {
> +	bool locked;
> +	unsigned long flags;
> +
> +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
>  	if (TestClearPageDirty(page)) {
>  		struct address_space *mapping = page->mapping;
>  		if (mapping && mapping_cap_account_dirty(mapping)) {
> +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
>  			dec_zone_page_state(page, NR_FILE_DIRTY);
>  			dec_bdi_stat(mapping->backing_dev_info,
>  					BDI_RECLAIMABLE);
> @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
>  				task_io_account_cancelled_write(account_size);
>  		}
>  	}
> +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
>  }
>  EXPORT_SYMBOL(cancel_dirty_page);
>  
> -- 
> 1.7.1
> 

-- 
Michal Hocko
SUSE Labs
SUSE LINUX s.r.o.
Lihovarska 1060/12
190 00 Praha 9    
Czech Republic

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
  2012-07-04  8:11             ` Sha Zhengju
@ 2012-07-05 15:20               ` Sage Weil
  2012-07-05 15:40                 ` Sha Zhengju
  0 siblings, 1 reply; 13+ messages in thread
From: Sage Weil @ 2012-07-05 15:20 UTC (permalink / raw)
  To: Sha Zhengju
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On Wed, 4 Jul 2012, Sha Zhengju wrote:
> On 07/02/2012 10:49 PM, Sage Weil wrote:
> > On Mon, 2 Jul 2012, Sha Zhengju wrote:
> > > On 06/29/2012 01:21 PM, Sage Weil wrote:
> > > > On Thu, 28 Jun 2012, Sha Zhengju wrote:
> > > > 
> > > > > From: Sha Zhengju<handai.szj@taobao.com>
> > > > > 
> > > > > Following we will treat SetPageDirty and dirty page accounting as an
> > > > > integrated
> > > > > operation. Filesystems had better use vfs interface directly to avoid
> > > > > those details.
> > > > > 
> > > > > Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
> > > > > ---
> > > > >    fs/buffer.c                 |    2 +-
> > > > >    fs/ceph/addr.c              |   20 ++------------------
> > > > >    include/linux/buffer_head.h |    2 ++
> > > > >    3 files changed, 5 insertions(+), 19 deletions(-)
> > > > > 
> > > > > diff --git a/fs/buffer.c b/fs/buffer.c
> > > > > index e8d96b8..55522dd 100644
> > > > > --- a/fs/buffer.c
> > > > > +++ b/fs/buffer.c
> > > > > @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
> > > > >     * If warn is true, then emit a warning if the page is not uptodate
> > > > > and
> > > > > has
> > > > >     * not been truncated.
> > > > >     */
> > > > > -static int __set_page_dirty(struct page *page,
> > > > > +int __set_page_dirty(struct page *page,
> > > > >    		struct address_space *mapping, int warn)
> > > > >    {
> > > > >    	if (unlikely(!mapping))
> > > > This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
> > > > continue to build as a module.
> > > > 
> > > > With that fixed, the ceph bits are a welcome cleanup!
> > > > 
> > > > Acked-by: Sage Weil<sage@inktank.com>
> > > Further, I check the path again and may it be reworked as follows to avoid
> > > undo?
> > > 
> > > __set_page_dirty();
> > > __set_page_dirty();
> > > ceph operations;                ==>                     if (page->mapping)
> > > if (page->mapping)                                            ceph
> > > operations;
> > >      ;
> > > else
> > >      undo = 1;
> > > if (undo)
> > >      xxx;
> > Yep.  Taking another look at the original code, though, I'm worried that
> > one reason the __set_page_dirty() actions were spread out the way they are
> > is because we wanted to ensure that the ceph operations were always
> > performed when PagePrivate was set.
> > 
> 
> Sorry, I've lost something:
> 
> __set_page_dirty();                        __set_page_dirty();
> ceph operations;
> if(page->mapping)         ==>      if(page->mapping) {
>        SetPagePrivate;                            SetPagePrivate;
> else                                                      ceph operations;
>     undo = 1;                                  }
> 
> if (undo)
>     XXX;
> 
> I think this can ensure that ceph operations are performed together with
> SetPagePrivate.

Yeah, that looks right, as long as the ceph accounting operations happen 
before SetPagePrivate.  I think it's no more or less racy than before, at 
least. 

The patch doesn't apply without the previous ones in the series, it looks 
like.  Do you want to prepare a new version or should I?

Thanks!
sage

 
> > It looks like invalidatepage won't get called if private isn't set, and
> > presumably it handles the truncate race with __set_page_dirty() properly
> > (right?).  What about writeback?  Do we need to worry about writepage[s]
> > getting called with a NULL page->private?
> 
> __set_page_dirty does handle racing conditions with truncate and
> writeback writepage[s] also take page->private into consideration
> which is done inside specific filesystems. I notice that ceph has handled
> this in ceph_writepage().
> Sorry, not vfs expert and maybe I've not caught your point...
>
> 
> 
> Thanks,
> Sha
> 
> > Thanks!
> > sage
> > 
> > 
> > 
> > > 
> > > 
> > > Thanks,
> > > Sha
> > > 
> > > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > > > > index 8b67304..d028fbe 100644
> > > > > --- a/fs/ceph/addr.c
> > > > > +++ b/fs/ceph/addr.c
> > > > > @@ -5,6 +5,7 @@
> > > > >    #include<linux/mm.h>
> > > > >    #include<linux/pagemap.h>
> > > > >    #include<linux/writeback.h>	/* generic_writepages */
> > > > > +#include<linux/buffer_head.h>
> > > > >    #include<linux/slab.h>
> > > > >    #include<linux/pagevec.h>
> > > > >    #include<linux/task_io_accounting_ops.h>
> > > > > @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
> > > > >    	int undo = 0;
> > > > >    	struct ceph_snap_context *snapc;
> > > > > 
> > > > > -	if (unlikely(!mapping))
> > > > > -		return !TestSetPageDirty(page);
> > > > > -
> > > > > -	if (TestSetPageDirty(page)) {
> > > > > -		dout("%p set_page_dirty %p idx %lu -- already
> > > > > dirty\n",
> > > > > -		     mapping->host, page, page->index);
> > > > > +	if (!__set_page_dirty(page, mapping, 1))
> > > > >    		return 0;
> > > > > -	}
> > > > > 
> > > > >    	inode = mapping->host;
> > > > >    	ci = ceph_inode(inode);
> > > > > @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
> > > > >    	     snapc, snapc->seq, snapc->num_snaps);
> > > > >    	spin_unlock(&ci->i_ceph_lock);
> > > > > 
> > > > > -	/* now adjust page */
> > > > > -	spin_lock_irq(&mapping->tree_lock);
> > > > >    	if (page->mapping) {	/* Race with truncate? */
> > > > > -		WARN_ON_ONCE(!PageUptodate(page));
> > > > > -		account_page_dirtied(page, page->mapping);
> > > > > -		radix_tree_tag_set(&mapping->page_tree,
> > > > > -				page_index(page),
> > > > > PAGECACHE_TAG_DIRTY);
> > > > > -
> > > > >    		/*
> > > > >    		 * Reference snap context in page->private.  Also set
> > > > >    		 * PagePrivate so that we get invalidatepage callback.
> > > > > @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page
> > > > > *page)
> > > > >    		undo = 1;
> > > > >    	}
> > > > > 
> > > > > -	spin_unlock_irq(&mapping->tree_lock);
> > > > > -
> > > > >    	if (undo)
> > > > >    		/* whoops, we failed to dirty the page */
> > > > >    		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
> > > > > 
> > > > > -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
> > > > > -
> > > > >    	BUG_ON(!PageDirty(page));
> > > > >    	return 1;
> > > > >    }
> > > > > diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
> > > > > index 458f497..0a331a8 100644
> > > > > --- a/include/linux/buffer_head.h
> > > > > +++ b/include/linux/buffer_head.h
> > > > > @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head
> > > > > *bh)
> > > > >    }
> > > > > 
> > > > >    extern int __set_page_dirty_buffers(struct page *page);
> > > > > +extern int __set_page_dirty(struct page *page,
> > > > > +		struct address_space *mapping, int warn);
> > > > > 
> > > > >    #else /* CONFIG_BLOCK */
> > > > > 
> > > > > -- 
> > > > > 1.7.1
> > > > > 
> > > > > --
> > > > > To unsubscribe from this list: send the line "unsubscribe
> > > > > linux-fsdevel"
> > > > > in
> > > > > the body of a message to majordomo@vger.kernel.org
> > > > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > > > 
> > > > > 
> > > 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem
  2012-07-05 15:20               ` Sage Weil
@ 2012-07-05 15:40                 ` Sha Zhengju
  0 siblings, 0 replies; 13+ messages in thread
From: Sha Zhengju @ 2012-07-05 15:40 UTC (permalink / raw)
  To: Sage Weil
  Cc: linux-mm, cgroups, kamezawa.hiroyu, gthelen, yinghan, akpm,
	mhocko, linux-kernel, torvalds, viro, linux-fsdevel, sage,
	ceph-devel, Sha Zhengju

On Thu, Jul 5, 2012 at 11:20 PM, Sage Weil <sage@inktank.com> wrote:
> On Wed, 4 Jul 2012, Sha Zhengju wrote:
>> On 07/02/2012 10:49 PM, Sage Weil wrote:
>> > On Mon, 2 Jul 2012, Sha Zhengju wrote:
>> > > On 06/29/2012 01:21 PM, Sage Weil wrote:
>> > > > On Thu, 28 Jun 2012, Sha Zhengju wrote:
>> > > >
>> > > > > From: Sha Zhengju<handai.szj@taobao.com>
>> > > > >
>> > > > > Following we will treat SetPageDirty and dirty page accounting as an
>> > > > > integrated
>> > > > > operation. Filesystems had better use vfs interface directly to avoid
>> > > > > those details.
>> > > > >
>> > > > > Signed-off-by: Sha Zhengju<handai.szj@taobao.com>
>> > > > > ---
>> > > > >    fs/buffer.c                 |    2 +-
>> > > > >    fs/ceph/addr.c              |   20 ++------------------
>> > > > >    include/linux/buffer_head.h |    2 ++
>> > > > >    3 files changed, 5 insertions(+), 19 deletions(-)
>> > > > >
>> > > > > diff --git a/fs/buffer.c b/fs/buffer.c
>> > > > > index e8d96b8..55522dd 100644
>> > > > > --- a/fs/buffer.c
>> > > > > +++ b/fs/buffer.c
>> > > > > @@ -610,7 +610,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
>> > > > >     * If warn is true, then emit a warning if the page is not uptodate
>> > > > > and
>> > > > > has
>> > > > >     * not been truncated.
>> > > > >     */
>> > > > > -static int __set_page_dirty(struct page *page,
>> > > > > +int __set_page_dirty(struct page *page,
>> > > > >               struct address_space *mapping, int warn)
>> > > > >    {
>> > > > >       if (unlikely(!mapping))
>> > > > This also needs an EXPORT_SYMBOL(__set_page_dirty) to allow ceph to
>> > > > continue to build as a module.
>> > > >
>> > > > With that fixed, the ceph bits are a welcome cleanup!
>> > > >
>> > > > Acked-by: Sage Weil<sage@inktank.com>
>> > > Further, I check the path again and may it be reworked as follows to avoid
>> > > undo?
>> > >
>> > > __set_page_dirty();
>> > > __set_page_dirty();
>> > > ceph operations;                ==>                     if (page->mapping)
>> > > if (page->mapping)                                            ceph
>> > > operations;
>> > >      ;
>> > > else
>> > >      undo = 1;
>> > > if (undo)
>> > >      xxx;
>> > Yep.  Taking another look at the original code, though, I'm worried that
>> > one reason the __set_page_dirty() actions were spread out the way they are
>> > is because we wanted to ensure that the ceph operations were always
>> > performed when PagePrivate was set.
>> >
>>
>> Sorry, I've lost something:
>>
>> __set_page_dirty();                        __set_page_dirty();
>> ceph operations;
>> if(page->mapping)         ==>      if(page->mapping) {
>>        SetPagePrivate;                            SetPagePrivate;
>> else                                                      ceph operations;
>>     undo = 1;                                  }
>>
>> if (undo)
>>     XXX;
>>
>> I think this can ensure that ceph operations are performed together with
>> SetPagePrivate.
>
> Yeah, that looks right, as long as the ceph accounting operations happen
> before SetPagePrivate.  I think it's no more or less racy than before, at
> least.
>
> The patch doesn't apply without the previous ones in the series, it looks
> like.  Do you want to prepare a new version or should I?
>

Good. I'm doing some test then I'll send out a new version patchset, please
wait a bit. : )


Thanks,
Sha

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func
  2012-07-02 11:14     ` Kamezawa Hiroyuki
@ 2012-07-07 14:42       ` Fengguang Wu
  0 siblings, 0 replies; 13+ messages in thread
From: Fengguang Wu @ 2012-07-07 14:42 UTC (permalink / raw)
  To: Kamezawa Hiroyuki
  Cc: Sha Zhengju, linux-mm, cgroups, gthelen, yinghan, akpm, mhocko,
	linux-kernel, torvalds, viro, linux-fsdevel, Sha Zhengju

On Mon, Jul 02, 2012 at 08:14:02PM +0900, KAMEZAWA Hiroyuki wrote:
> (2012/06/28 20:01), Sha Zhengju wrote:
> > From: Sha Zhengju <handai.szj@taobao.com>
> > 
> > Commit a8e7d49a(Fix race in create_empty_buffers() vs __set_page_dirty_buffers())
> > extracts TestSetPageDirty from __set_page_dirty and is far away from
> > account_page_dirtied.But it's better to make the two operations in one single
> > function to keep modular.So in order to avoid the potential race mentioned in
> > commit a8e7d49a, we can hold private_lock until __set_page_dirty completes.
> > I guess there's no deadlock between ->private_lock and ->tree_lock by quick look.
> > 
> > It's a prepare patch for following memcg dirty page accounting patches.
> > 
> > Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
> 
> I think there is no problem with the lock order.

Me think so, too.

> My small concern is the impact on the performance. IIUC, lock contention here can be
> seen if multiple threads write to the same file in parallel.
> Do you have any numbers before/after the patch ?

That would be a worthwhile test. The patch moves ->tree_lock and
->i_lock into ->private_lock, these are often contented locks..

For example, in the below case of 12 hard disks, each running 1 dd
write, the ->tree_lock and ->private_lock have the top #1 and #2
contentions.

lkp-nex04/JBOD-12HDD-thresh=1000M/ext4-1dd-1-3.3.0/lock_stat
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
                              class name    con-bounces    contentions   waittime-min   waittime-max waittime-total    acq-bounces   acquisitions   holdtime-min   holdtime-max holdtime-total
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

           &(&mapping->tree_lock)->rlock:      18629034       19138284           0.09        1029.32    24353812.07       49650988      482883410           0.11         186.88   260706119.09
           -----------------------------
           &(&mapping->tree_lock)->rlock            783          [<ffffffff81109267>] tag_pages_for_writeback+0x2b/0x9d
           &(&mapping->tree_lock)->rlock        3195817          [<ffffffff81100d6c>] add_to_page_cache_locked+0xa3/0x119
           &(&mapping->tree_lock)->rlock        3863710          [<ffffffff81108df7>] test_set_page_writeback+0x63/0x140
           &(&mapping->tree_lock)->rlock        3311518          [<ffffffff81172ade>] __set_page_dirty+0x25/0xa5
           -----------------------------
           &(&mapping->tree_lock)->rlock        3450725          [<ffffffff81100d6c>] add_to_page_cache_locked+0xa3/0x119
           &(&mapping->tree_lock)->rlock        3225542          [<ffffffff81172ade>] __set_page_dirty+0x25/0xa5
           &(&mapping->tree_lock)->rlock        2241958          [<ffffffff81108df7>] test_set_page_writeback+0x63/0x140
           &(&mapping->tree_lock)->rlock        7339603          [<ffffffff8110ac33>] test_clear_page_writeback+0x64/0x155

...............................................................................................................................................................................................

        &(&mapping->private_lock)->rlock:       1165199        1191201           0.11        2843.25     1621608.38       13341420      152761848           0.10        3727.92    33559035.07
        --------------------------------
        &(&mapping->private_lock)->rlock              1          [<ffffffff81172913>] __find_get_block_slow+0x5a/0x135
        &(&mapping->private_lock)->rlock         385576          [<ffffffff811735d6>] create_empty_buffers+0x48/0xbf
        &(&mapping->private_lock)->rlock         805624          [<ffffffff8117346d>] try_to_free_buffers+0x57/0xaa
        --------------------------------
        &(&mapping->private_lock)->rlock              1          [<ffffffff811746dd>] __getblk+0x1b8/0x257
        &(&mapping->private_lock)->rlock         952718          [<ffffffff8117346d>] try_to_free_buffers+0x57/0xaa
        &(&mapping->private_lock)->rlock         238482          [<ffffffff811735d6>] create_empty_buffers+0x48/0xbf

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2012-07-07 14:42 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1340880885-5427-1-git-send-email-handai.szj@taobao.com>
2012-06-28 11:01 ` [PATCH 3/7] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju
     [not found]   ` <1340881275-5651-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-07-02 11:14     ` Kamezawa Hiroyuki
2012-07-07 14:42       ` Fengguang Wu
2012-07-04 14:23   ` Michal Hocko
     [not found] ` <1340880885-5427-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-06-28 11:03   ` [PATCH 4/7] Use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
     [not found]     ` <1340881423-5703-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-06-29  5:21       ` Sage Weil
2012-07-02  8:10         ` Sha Zhengju
2012-07-02 14:49           ` Sage Weil
2012-07-04  8:11             ` Sha Zhengju
2012-07-05 15:20               ` Sage Weil
2012-07-05 15:40                 ` Sha Zhengju
2012-07-04 14:27     ` Michal Hocko
     [not found] ` <1340881486-5770-1-git-send-email-handai.szj@taobao.com>
     [not found]   ` <1340881486-5770-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
2012-07-04 16:11     ` [PATCH 5/7] memcg: add per cgroup dirty pages accounting Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).