linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
  • * [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem
           [not found] <1356455919-14445-1-git-send-email-handai.szj@taobao.com>
           [not found] ` <1356455919-14445-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
    @ 2012-12-25 17:24 ` Sha Zhengju
           [not found]   ` <1356456261-14579-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
      2012-12-25 17:26 ` [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju
      2 siblings, 1 reply; 27+ messages in thread
    From: Sha Zhengju @ 2012-12-25 17:24 UTC (permalink / raw)
      To: linux-kernel, cgroups, linux-mm, linux-fsdevel, ceph-devel
      Cc: sage, dchinner, mhocko, akpm, kamezawa.hiroyu, gthelen,
    	fengguang.wu, glommer, Sha Zhengju
    
    From: Sha Zhengju <handai.szj@taobao.com>
    
    Following we will treat SetPageDirty and dirty page accounting as an integrated
    operation. Filesystems had better use vfs interface directly to avoid those details.
    
    Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
    Acked-by: Sage Weil <sage@inktank.com>
    ---
     fs/buffer.c                 |    3 ++-
     fs/ceph/addr.c              |   20 ++------------------
     include/linux/buffer_head.h |    2 ++
     3 files changed, 6 insertions(+), 19 deletions(-)
    
    diff --git a/fs/buffer.c b/fs/buffer.c
    index 3b032b9..762168a 100644
    --- a/fs/buffer.c
    +++ b/fs/buffer.c
    @@ -609,7 +609,7 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
      * If warn is true, then emit a warning if the page is not uptodate and has
      * not been truncated.
      */
    -static int __set_page_dirty(struct page *page,
    +int __set_page_dirty(struct page *page,
     		struct address_space *mapping, int warn)
     {
     	if (unlikely(!mapping))
    @@ -630,6 +630,7 @@ static int __set_page_dirty(struct page *page,
     
     	return 1;
     }
    +EXPORT_SYMBOL(__set_page_dirty);
     
     /*
      * Add a page to the dirty page list.
    diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
    index 6690269..f2779b8 100644
    --- a/fs/ceph/addr.c
    +++ b/fs/ceph/addr.c
    @@ -5,6 +5,7 @@
     #include <linux/mm.h>
     #include <linux/pagemap.h>
     #include <linux/writeback.h>	/* generic_writepages */
    +#include <linux/buffer_head.h>
     #include <linux/slab.h>
     #include <linux/pagevec.h>
     #include <linux/task_io_accounting_ops.h>
    @@ -73,14 +74,8 @@ static int ceph_set_page_dirty(struct page *page)
     	int undo = 0;
     	struct ceph_snap_context *snapc;
     
    -	if (unlikely(!mapping))
    -		return !TestSetPageDirty(page);
    -
    -	if (TestSetPageDirty(page)) {
    -		dout("%p set_page_dirty %p idx %lu -- already dirty\n",
    -		     mapping->host, page, page->index);
    +	if (!__set_page_dirty(page, mapping, 1))
     		return 0;
    -	}
     
     	inode = mapping->host;
     	ci = ceph_inode(inode);
    @@ -107,14 +102,7 @@ static int ceph_set_page_dirty(struct page *page)
     	     snapc, snapc->seq, snapc->num_snaps);
     	spin_unlock(&ci->i_ceph_lock);
     
    -	/* now adjust page */
    -	spin_lock_irq(&mapping->tree_lock);
     	if (page->mapping) {	/* Race with truncate? */
    -		WARN_ON_ONCE(!PageUptodate(page));
    -		account_page_dirtied(page, page->mapping);
    -		radix_tree_tag_set(&mapping->page_tree,
    -				page_index(page), PAGECACHE_TAG_DIRTY);
    -
     		/*
     		 * Reference snap context in page->private.  Also set
     		 * PagePrivate so that we get invalidatepage callback.
    @@ -126,14 +114,10 @@ static int ceph_set_page_dirty(struct page *page)
     		undo = 1;
     	}
     
    -	spin_unlock_irq(&mapping->tree_lock);
    -
     	if (undo)
     		/* whoops, we failed to dirty the page */
     		ceph_put_wrbuffer_cap_refs(ci, 1, snapc);
     
    -	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
    -
     	BUG_ON(!PageDirty(page));
     	return 1;
     }
    diff --git a/include/linux/buffer_head.h b/include/linux/buffer_head.h
    index 458f497..0a331a8 100644
    --- a/include/linux/buffer_head.h
    +++ b/include/linux/buffer_head.h
    @@ -336,6 +336,8 @@ static inline void lock_buffer(struct buffer_head *bh)
     }
     
     extern int __set_page_dirty_buffers(struct page *page);
    +extern int __set_page_dirty(struct page *page,
    +		struct address_space *mapping, int warn);
     
     #else /* CONFIG_BLOCK */
     
    -- 
    1.7.9.5
    
    
    ^ permalink raw reply related	[flat|nested] 27+ messages in thread
  • * [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting
           [not found] <1356455919-14445-1-git-send-email-handai.szj@taobao.com>
           [not found] ` <1356455919-14445-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
      2012-12-25 17:24 ` [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
    @ 2012-12-25 17:26 ` Sha Zhengju
      2013-01-02 10:44   ` Michal Hocko
      2013-01-06 20:07   ` Greg Thelen
      2 siblings, 2 replies; 27+ messages in thread
    From: Sha Zhengju @ 2012-12-25 17:26 UTC (permalink / raw)
      To: linux-kernel, cgroups, linux-mm, linux-fsdevel
      Cc: mhocko, akpm, kamezawa.hiroyu, gthelen, fengguang.wu, glommer,
    	dchinner, Sha Zhengju
    
    From: Sha Zhengju <handai.szj@taobao.com>
    
    This patch adds memcg routines to count dirty pages, which allows memory controller
    to maintain an accurate view of the amount of its dirty memory and can provide some
    info for users while cgroup's direct reclaim is working.
    
    After Kame's commit 89c06bd5(memcg: use new logic for page stat accounting), we can
    use 'struct page' flag to test page state instead of per page_cgroup flag. But memcg
    has a feature to move a page from a cgroup to another one and may have race between
    "move" and "page stat accounting". So in order to avoid the race we have designed a
    bigger lock:
    
             mem_cgroup_begin_update_page_stat()
             modify page information        -->(a)
             mem_cgroup_update_page_stat()  -->(b)
             mem_cgroup_end_update_page_stat()
    It requires (a) and (b)(dirty pages accounting) can stay close enough.
    In the previous two prepare patches, we have reworked the vfs set page dirty routines
    and now the interfaces are more explicit:
            incrementing (2):
                    __set_page_dirty
                    __set_page_dirty_nobuffers
            decrementing (2):
                    clear_page_dirty_for_io
                    cancel_dirty_page
    
    To prevent AB/BA deadlock mentioned by Greg Thelen in previous version
    (https://lkml.org/lkml/2012/7/30/227), we adjust the lock order:
    ->private_lock --> mapping->tree_lock --> memcg->move_lock.
    So we need to make mapping->tree_lock ahead of TestSetPageDirty in __set_page_dirty()
    and __set_page_dirty_nobuffers(). But in order to avoiding useless spinlock contention,
    a prepare PageDirty() checking is added.
    
    
    Signed-off-by: Sha Zhengju <handai.szj@taobao.com>
    Acked-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujtisu.com>
    Acked-by: Fengguang Wu <fengguang.wu@intel.com>
    ---
     fs/buffer.c                |   14 +++++++++++++-
     include/linux/memcontrol.h |    1 +
     mm/filemap.c               |   10 ++++++++++
     mm/memcontrol.c            |   29 ++++++++++++++++++++++-------
     mm/page-writeback.c        |   39 ++++++++++++++++++++++++++++++++-------
     mm/truncate.c              |    6 ++++++
     6 files changed, 84 insertions(+), 15 deletions(-)
    
    diff --git a/fs/buffer.c b/fs/buffer.c
    index 762168a..53402d2 100644
    --- a/fs/buffer.c
    +++ b/fs/buffer.c
    @@ -612,19 +612,31 @@ EXPORT_SYMBOL(mark_buffer_dirty_inode);
     int __set_page_dirty(struct page *page,
     		struct address_space *mapping, int warn)
     {
    +	bool locked;
    +	unsigned long flags;
    +
     	if (unlikely(!mapping))
     		return !TestSetPageDirty(page);
     
    -	if (TestSetPageDirty(page))
    +	if (PageDirty(page))
     		return 0;
     
     	spin_lock_irq(&mapping->tree_lock);
    +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
    +
    +	if (TestSetPageDirty(page)) {
    +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
    +		spin_unlock_irq(&mapping->tree_lock);
    +		return 0;
    +	}
    +
     	if (page->mapping) {	/* Race with truncate? */
     		WARN_ON_ONCE(warn && !PageUptodate(page));
     		account_page_dirtied(page, mapping);
     		radix_tree_tag_set(&mapping->page_tree,
     				page_index(page), PAGECACHE_TAG_DIRTY);
     	}
    +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
     	spin_unlock_irq(&mapping->tree_lock);
     	__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
     
    diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
    index 5421b8a..2685d8a 100644
    --- a/include/linux/memcontrol.h
    +++ b/include/linux/memcontrol.h
    @@ -44,6 +44,7 @@ enum mem_cgroup_stat_index {
     	MEM_CGROUP_STAT_RSS,	   /* # of pages charged as anon rss */
     	MEM_CGROUP_STAT_FILE_MAPPED,  /* # of pages charged as file rss */
     	MEM_CGROUP_STAT_SWAP, /* # of pages, swapped out */
    +	MEM_CGROUP_STAT_FILE_DIRTY,  /* # of dirty pages in page cache */
     	MEM_CGROUP_STAT_NSTATS,
     };
     
    diff --git a/mm/filemap.c b/mm/filemap.c
    index 83efee7..b589be5 100644
    --- a/mm/filemap.c
    +++ b/mm/filemap.c
    @@ -62,6 +62,11 @@
      *      ->swap_lock		(exclusive_swap_page, others)
      *        ->mapping->tree_lock
      *
    + *    ->private_lock		(__set_page_dirty_buffers)
    + *      ->mapping->tree_lock
    + *        ->memcg->move_lock	(mem_cgroup_begin_update_page_stat->
    + *							move_lock_mem_cgroup)
    + *
      *  ->i_mutex
      *    ->i_mmap_mutex		(truncate->unmap_mapping_range)
      *
    @@ -112,6 +117,8 @@
     void __delete_from_page_cache(struct page *page)
     {
     	struct address_space *mapping = page->mapping;
    +	bool locked;
    +	unsigned long flags;
     
     	/*
     	 * if we're uptodate, flush out into the cleancache, otherwise
    @@ -139,10 +146,13 @@ void __delete_from_page_cache(struct page *page)
     	 * Fix it up by doing a final dirty accounting check after
     	 * having removed the page entirely.
     	 */
    +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
     	if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
    +		mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
     		dec_zone_page_state(page, NR_FILE_DIRTY);
     		dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
     	}
    +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
     }
     
     /**
    diff --git a/mm/memcontrol.c b/mm/memcontrol.c
    index d450c04..c884640 100644
    --- a/mm/memcontrol.c
    +++ b/mm/memcontrol.c
    @@ -95,6 +95,7 @@ static const char * const mem_cgroup_stat_names[] = {
     	"rss",
     	"mapped_file",
     	"swap",
    +	"dirty",
     };
     
     enum mem_cgroup_events_index {
    @@ -3609,6 +3610,19 @@ void mem_cgroup_split_huge_fixup(struct page *head)
     }
     #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
     
    +static inline
    +void mem_cgroup_move_account_page_stat(struct mem_cgroup *from,
    +					struct mem_cgroup *to,
    +					unsigned int nr_pages,
    +					enum mem_cgroup_stat_index idx)
    +{
    +	/* Update stat data for mem_cgroup */
    +	preempt_disable();
    +	__this_cpu_add(from->stat->count[idx], -nr_pages);
    +	__this_cpu_add(to->stat->count[idx], nr_pages);
    +	preempt_enable();
    +}
    +
     /**
      * mem_cgroup_move_account - move account of the page
      * @page: the page
    @@ -3654,13 +3668,14 @@ static int mem_cgroup_move_account(struct page *page,
     
     	move_lock_mem_cgroup(from, &flags);
     
    -	if (!anon && page_mapped(page)) {
    -		/* Update mapped_file data for mem_cgroup */
    -		preempt_disable();
    -		__this_cpu_dec(from->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
    -		__this_cpu_inc(to->stat->count[MEM_CGROUP_STAT_FILE_MAPPED]);
    -		preempt_enable();
    -	}
    +	if (!anon && page_mapped(page))
    +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
    +			MEM_CGROUP_STAT_FILE_MAPPED);
    +
    +	if (PageDirty(page))
    +		mem_cgroup_move_account_page_stat(from, to, nr_pages,
    +			MEM_CGROUP_STAT_FILE_DIRTY);
    +
     	mem_cgroup_charge_statistics(from, anon, -nr_pages);
     
     	/* caller should have done css_get */
    diff --git a/mm/page-writeback.c b/mm/page-writeback.c
    index 0713bfb..526ddd7 100644
    --- a/mm/page-writeback.c
    +++ b/mm/page-writeback.c
    @@ -1978,11 +1978,17 @@ int __set_page_dirty_no_writeback(struct page *page)
     
     /*
      * Helper function for set_page_dirty family.
    + *
    + * The caller must hold mem_cgroup_begin/end_update_page_stat() lock
    + * while modifying struct page state and accounting dirty pages.
    + * See __set_page_dirty for example.
    + *
      * NOTE: This relies on being atomic wrt interrupts.
      */
     void account_page_dirtied(struct page *page, struct address_space *mapping)
     {
     	if (mapping_cap_account_dirty(mapping)) {
    +		mem_cgroup_inc_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
     		__inc_zone_page_state(page, NR_FILE_DIRTY);
     		__inc_zone_page_state(page, NR_DIRTIED);
     		__inc_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
    @@ -2022,14 +2028,22 @@ EXPORT_SYMBOL(account_page_writeback);
      */
     int __set_page_dirty_nobuffers(struct page *page)
     {
    +	bool locked;
    +	unsigned long flags;
    +	struct address_space *mapping = page_mapping(page);
    +
    +	if (PageDirty(page))
    +		return 0;
    +
    +	if (unlikely(!mapping))
    +		return !TestSetPageDirty(page);
    +
    +	spin_lock_irq(&mapping->tree_lock);
    +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
    +
     	if (!TestSetPageDirty(page)) {
    -		struct address_space *mapping = page_mapping(page);
     		struct address_space *mapping2;
     
    -		if (!mapping)
    -			return 1;
    -
    -		spin_lock_irq(&mapping->tree_lock);
     		mapping2 = page_mapping(page);
     		if (mapping2) { /* Race with truncate? */
     			BUG_ON(mapping2 != mapping);
    @@ -2038,13 +2052,18 @@ int __set_page_dirty_nobuffers(struct page *page)
     			radix_tree_tag_set(&mapping->page_tree,
     				page_index(page), PAGECACHE_TAG_DIRTY);
     		}
    +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
     		spin_unlock_irq(&mapping->tree_lock);
    +
     		if (mapping->host) {
     			/* !PageAnon && !swapper_space */
     			__mark_inode_dirty(mapping->host, I_DIRTY_PAGES);
     		}
     		return 1;
     	}
    +
    +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
    +	spin_unlock_irq(&mapping->tree_lock);
     	return 0;
     }
     EXPORT_SYMBOL(__set_page_dirty_nobuffers);
    @@ -2160,6 +2179,9 @@ EXPORT_SYMBOL(set_page_dirty_lock);
     int clear_page_dirty_for_io(struct page *page)
     {
     	struct address_space *mapping = page_mapping(page);
    +	bool locked;
    +	unsigned long flags;
    +	int ret = 0;
     
     	BUG_ON(!PageLocked(page));
     
    @@ -2201,13 +2223,16 @@ int clear_page_dirty_for_io(struct page *page)
     		 * the desired exclusion. See mm/memory.c:do_wp_page()
     		 * for more comments.
     		 */
    +		mem_cgroup_begin_update_page_stat(page, &locked, &flags);
     		if (TestClearPageDirty(page)) {
    +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
     			dec_zone_page_state(page, NR_FILE_DIRTY);
     			dec_bdi_stat(mapping->backing_dev_info,
     					BDI_RECLAIMABLE);
    -			return 1;
    +			ret = 1;
     		}
    -		return 0;
    +		mem_cgroup_end_update_page_stat(page, &locked, &flags);
    +		return ret;
     	}
     	return TestClearPageDirty(page);
     }
    diff --git a/mm/truncate.c b/mm/truncate.c
    index db1b216..c81e6c4 100644
    --- a/mm/truncate.c
    +++ b/mm/truncate.c
    @@ -73,9 +73,14 @@ static inline void truncate_partial_page(struct page *page, unsigned partial)
      */
     void cancel_dirty_page(struct page *page, unsigned int account_size)
     {
    +	bool locked;
    +	unsigned long flags;
    +
    +	mem_cgroup_begin_update_page_stat(page, &locked, &flags);
     	if (TestClearPageDirty(page)) {
     		struct address_space *mapping = page->mapping;
     		if (mapping && mapping_cap_account_dirty(mapping)) {
    +			mem_cgroup_dec_page_stat(page, MEM_CGROUP_STAT_FILE_DIRTY);
     			dec_zone_page_state(page, NR_FILE_DIRTY);
     			dec_bdi_stat(mapping->backing_dev_info,
     					BDI_RECLAIMABLE);
    @@ -83,6 +88,7 @@ void cancel_dirty_page(struct page *page, unsigned int account_size)
     				task_io_account_cancelled_write(account_size);
     		}
     	}
    +	mem_cgroup_end_update_page_stat(page, &locked, &flags);
     }
     EXPORT_SYMBOL(cancel_dirty_page);
     
    -- 
    1.7.9.5
    
    
    ^ permalink raw reply related	[flat|nested] 27+ messages in thread

  • end of thread, other threads:[~2013-05-03  9:59 UTC | newest]
    
    Thread overview: 27+ messages (download: mbox.gz follow: Atom feed
    -- links below jump to the message on this page --
         [not found] <1356455919-14445-1-git-send-email-handai.szj@taobao.com>
         [not found] ` <1356455919-14445-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
    2012-12-25 17:22   ` [PATCH V3 2/8] Make TestSetPageDirty and dirty page accounting in one func Sha Zhengju
    2012-12-28  0:39     ` Kamezawa Hiroyuki
    2013-01-05  2:34       ` Sha Zhengju
    2013-01-02  9:08     ` Michal Hocko
         [not found]       ` <20130102090803.GB22160-2MMpYkNvuYDjFM9bn6wA6Q@public.gmane.org>
    2013-01-05  2:49         ` Sha Zhengju
         [not found]           ` <CAFj3OHUCQkqB2+ky9wxFpkNYcn2=6t9Qd7XFf3RBY0F4Wxyqcg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
    2013-01-05 10:45             ` Michal Hocko
    2012-12-25 17:24 ` [PATCH V3 3/8] use vfs __set_page_dirty interface instead of doing it inside filesystem Sha Zhengju
         [not found]   ` <1356456261-14579-1-git-send-email-handai.szj-3b8fjiQLQpfQT0dZR+AlfA@public.gmane.org>
    2012-12-28  0:41     ` Kamezawa Hiroyuki
    2012-12-25 17:26 ` [PATCH V3 4/8] memcg: add per cgroup dirty pages accounting Sha Zhengju
    2013-01-02 10:44   ` Michal Hocko
    2013-01-05  4:48     ` Sha Zhengju
    2013-01-06 20:02       ` Hugh Dickins
         [not found]         ` <alpine.LNX.2.00.1301061135400.29149-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
    2013-01-07  7:49           ` Kamezawa Hiroyuki
    2013-01-09  5:15             ` Hugh Dickins
         [not found]               ` <alpine.LNX.2.00.1301082030100.5319-fupSdm12i1nKWymIFiNcPA@public.gmane.org>
    2013-01-09  7:24                 ` Kamezawa Hiroyuki
    2013-01-09 14:35           ` Sha Zhengju
         [not found]             ` <CAFj3OHVUx0bZyEGQU_CApVbgz7SrX3BQ+0U5fRV=En800wv+cQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
    2013-01-09 14:47               ` Michal Hocko
         [not found]       ` <CAFj3OHXKyMO3gwghiBAmbowvqko-JqLtKroX2kzin1rk=q9tZg-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
    2013-01-07  7:25         ` Kamezawa Hiroyuki
         [not found]           ` <50EA7860.6030300-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
    2013-01-09 15:02             ` Sha Zhengju
         [not found]               ` <CAFj3OHXMgRG6u2YoM7y5WuPo2ZNA1yPmKRV29FYj9B6Wj_c6Lw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
    2013-01-10  2:16                 ` Kamezawa Hiroyuki
    2013-01-10  4:26                   ` Sha Zhengju
         [not found]                     ` <CAFj3OHW=n22veXzR27qfc+10t-nETU=B78NULPXrEDT1S-KsOw-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
    2013-01-10  5:03                       ` Kamezawa Hiroyuki
         [not found]                         ` <50EE4B84.5080205-+CUm20s59erQFUHtdCDX3A@public.gmane.org>
    2013-01-10  8:28                           ` Sha Zhengju
    2013-05-03  9:11     ` Michal Hocko
    2013-05-03  9:59       ` Sha Zhengju
    2013-01-06 20:07   ` Greg Thelen
         [not found]     ` <xr93obh2krcr.fsf-aSPv4SP+Du0KgorLzL7FmE7CuiCeIGUxQQ4Iyu8u01E@public.gmane.org>
    2013-01-09  9:45       ` Sha Zhengju
    

    This is a public inbox, see mirroring instructions
    for how to clone and mirror all data and code used for this inbox;
    as well as URLs for NNTP newsgroup(s).