Re: [RFC PATCH] asynchronous page fault.

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Minchan Kim <minchan.kim@gmail.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	cl@linux-foundation.org
Subject: Re: [RFC PATCH] asynchronous page fault.
Date: Sun, 27 Dec 2009 18:47:25 +0900	[thread overview]
Message-ID: <4B372D2D.60908@gmail.com> (raw)
In-Reply-To: <20091225105140.263180e8.kamezawa.hiroyu@jp.fujitsu.com>

Hi, Kame. 

KAMEZAWA Hiroyuki wrote:
> Speculative page fault v3.
> 
> This version is much simpler than old versions and doesn't use mm_accessor
> but use RCU. This is based on linux-2.6.33-rc2.
> 
> This patch is just my toy but shows...
>  - Once RB-tree is RCU-aware and no-lock in readside, we can avoid mmap_sem
>    in page fault. 
> So, what we need is not mm_accessor, but RCU-aware RB-tree, I think.
> 
> But yes, I may miss something critical ;)
> 
> After patch, statistics perf show is following. Test progam is attached.
>   
> # Samples: 1331231315119
> #
> # Overhead          Command             Shared Object  Symbol
> # ........  ...............  ........................  ......
> #
>     28.41%  multi-fault-all  [kernel]                  [k] clear_page_c
>             |
>             --- clear_page_c
>                 __alloc_pages_nodemask
>                 handle_mm_fault
>                 do_page_fault
>                 page_fault
>                 0x400950
>                |
>                 --100.00%-- (nil)
> 
>     21.69%  multi-fault-all  [kernel]                  [k] _raw_spin_lock
>             |
>             --- _raw_spin_lock
>                |
>                |--81.85%-- free_pcppages_bulk
>                |          free_hot_cold_page
>                |          __pagevec_free
>                |          release_pages
>                |          free_pages_and_swap_cache
> 
> 
> I'll be almost offline in the next week. 
> 
> Minchan, in this version, I didn't add CONFIG and some others which was
> recommended just for my laziness. Sorry.

No problem :)
Thanks for X-mas present. 

> 
> =
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Asynchronous page fault.
> 
> This patch is for avoidng mmap_sem in usual page fault. At running highly
> multi-threaded programs, mm->mmap_sem can use much CPU because of false
> sharing when it causes page fault in parallel. (Run after fork() is a typical
> case, I think.)
> This patch uses a speculative vma lookup to reduce that cost.
> 
> Considering vma lookup, rb-tree lookup, the only operation we do is checking
> node->rb_left,rb_right. And there are no complicated operation.
> At page fault, there are no demands for accessing sorted-vma-list or access
> prev or next in many case. Except for stack-expansion, we always need a vma
> which contains page-fault address. Then, we can access vma's RB-tree in
> speculative way.
> Even if RB-tree rotation occurs while we walk tree for look-up, we just
> miss vma without oops. In other words, we can _try_ to find vma in lockless
> manner. If failed, retry is ok.... we take lock and access vma.
> 
> For lockess walking, this uses RCU and adds find_vma_speculative(). And
> per-vma wait-queue and reference count. This refcnt+wait_queue guarantees that
> there are no thread which access the vma when we call subsystem's unmap
> functions.
> 
> Test result on my tiny test program on 8core/2socket machine is here.
> This measures how many page fault can occur in 60sec in parallel.
> 
> [root@bluextal memory]# /root/bin/perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault-all-split 8
> 
>  Performance counter stats for './multi-fault-all-split 8' (5 runs):
> 
>        17481387  page-faults                ( +-   0.409% )
>       509914595  cache-misses               ( +-   0.239% )
> 
>    60.002277793  seconds time elapsed   ( +-   0.000% )
> 
> 
> [root@bluextal memory]# /root/bin/perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault-all-split 8
> 
> 
>  Performance counter stats for './multi-fault-all-split 8' (5 runs):
> 
>        35949073  page-faults                ( +-   0.364% )
>       473091100  cache-misses               ( +-   0.304% )
> 
>    60.005444117  seconds time elapsed   ( +-   0.004% )
> 
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

<snip>


> +/* called when vma is unlinked and wait for all racy access.*/
> +static void invalidate_vma_before_free(struct vm_area_struct *vma)
> +{
> +	atomic_dec(&vma->refcnt);
> +	wait_event(vma->wait_queue, !atomic_read(&vma->refcnt));
> +}

I think we have to make sure atomicity of both (atomic_dec and wait_event).

> +
>  /*
>   * Requires inode->i_mapping->i_mmap_lock
>   */
> @@ -238,7 +256,7 @@ static struct vm_area_struct *remove_vma
>  			removed_exe_file_vma(vma->vm_mm);
>  	}
>  	mpol_put(vma_policy(vma));
> -	kmem_cache_free(vm_area_cachep, vma);
> +	free_vma_rcu(vma);
>  	return next;
>  }
>  
> @@ -404,6 +422,8 @@ __vma_link_list(struct mm_struct *mm, st
>  void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
>  		struct rb_node **rb_link, struct rb_node *rb_parent)
>  {
> +	atomic_set(&vma->refcnt, 1);
> +	init_waitqueue_head(&vma->wait_queue);
>  	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
>  	rb_insert_color(&vma->vm_rb, &mm->mm_rb);
>  }
> @@ -614,6 +634,7 @@ again:			remove_next = 1 + (end > next->
>  		 * us to remove next before dropping the locks.
>  		 */
>  		__vma_unlink(mm, next, vma);
> +		invalidate_vma_before_free(next);
>  		if (file)
>  			__remove_shared_vm_struct(next, file, mapping);
>  		if (next->anon_vma)
> @@ -640,7 +661,7 @@ again:			remove_next = 1 + (end > next->
>  		}
>  		mm->map_count--;
>  		mpol_put(vma_policy(next));
> -		kmem_cache_free(vm_area_cachep, next);
> +		free_vma_rcu(next);
>  		/*
>  		 * In mprotect's case 6 (see comments on vma_merge),
>  		 * we must remove another next too. It would clutter
> @@ -1544,6 +1565,55 @@ out:
>  }
>  
>  /*
> + * Returns vma which contains given address. This scans rb-tree in speculative
> + * way and increment a reference count if found. Even if vma exists in rb-tree,
> + * this function may return NULL in racy case. So, this function cannot be used
> + * for checking whether given address is valid or not.
> + */
> +struct vm_area_struct *
> +find_vma_speculative(struct mm_struct *mm, unsigned long addr)
> +{
> +	struct vm_area_struct *vma = NULL;
> +	struct vm_area_struct *vma_tmp;
> +	struct rb_node *rb_node;
> +
> +	if (unlikely(!mm))
> +		return NULL;;
> +
> +	rcu_read_lock();
> +	rb_node = rcu_dereference(mm->mm_rb.rb_node);
> +	vma = NULL;
> +	while (rb_node) {
> +		vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
> +
> +		if (vma_tmp->vm_end > addr) {
> +			vma = vma_tmp;
> +			if (vma_tmp->vm_start <= addr)
> +				break;
> +			rb_node = rcu_dereference(rb_node->rb_left);
> +		} else
> +			rb_node = rcu_dereference(rb_node->rb_right);
> +	}
> +	if (vma) {
> +		if ((vma->vm_start <= addr) && (addr < vma->vm_end)) {
> +			if (!atomic_inc_not_zero(&vma->refcnt))
> +				vma = NULL;
> +		} else
> +			vma = NULL;
> +	}
> +	rcu_read_unlock();
> +	return vma;
> +}
> +
> +void vma_put(struct vm_area_struct *vma)
> +{
> +	if ((atomic_dec_return(&vma->refcnt) == 1) &&
> +	    waitqueue_active(&vma->wait_queue))
> +		wake_up(&vma->wait_queue);
> +	return;
> +}
> +

Let's consider following case. 

CPU 0					CPU 1

find_vma_speculative(refcnt = 2)
					do_unmap 
					invaliate_vma_before_free(refcount = 1)
					wait_event
vma_put
refcnt = 0
skip wakeup 

Hmm..

WARNING: multiple messages have this Message-ID (diff)

From: Minchan Kim <minchan.kim@gmail.com>
To: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	cl@linux-foundation.org
Subject: Re: [RFC PATCH] asynchronous page fault.
Date: Sun, 27 Dec 2009 18:47:25 +0900	[thread overview]
Message-ID: <4B372D2D.60908@gmail.com> (raw)
In-Reply-To: <20091225105140.263180e8.kamezawa.hiroyu@jp.fujitsu.com>

Hi, Kame. 

KAMEZAWA Hiroyuki wrote:
> Speculative page fault v3.
> 
> This version is much simpler than old versions and doesn't use mm_accessor
> but use RCU. This is based on linux-2.6.33-rc2.
> 
> This patch is just my toy but shows...
>  - Once RB-tree is RCU-aware and no-lock in readside, we can avoid mmap_sem
>    in page fault. 
> So, what we need is not mm_accessor, but RCU-aware RB-tree, I think.
> 
> But yes, I may miss something critical ;)
> 
> After patch, statistics perf show is following. Test progam is attached.
>   
> # Samples: 1331231315119
> #
> # Overhead          Command             Shared Object  Symbol
> # ........  ...............  ........................  ......
> #
>     28.41%  multi-fault-all  [kernel]                  [k] clear_page_c
>             |
>             --- clear_page_c
>                 __alloc_pages_nodemask
>                 handle_mm_fault
>                 do_page_fault
>                 page_fault
>                 0x400950
>                |
>                 --100.00%-- (nil)
> 
>     21.69%  multi-fault-all  [kernel]                  [k] _raw_spin_lock
>             |
>             --- _raw_spin_lock
>                |
>                |--81.85%-- free_pcppages_bulk
>                |          free_hot_cold_page
>                |          __pagevec_free
>                |          release_pages
>                |          free_pages_and_swap_cache
> 
> 
> I'll be almost offline in the next week. 
> 
> Minchan, in this version, I didn't add CONFIG and some others which was
> recommended just for my laziness. Sorry.

No problem :)
Thanks for X-mas present. 

> 
> =
> From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
> 
> Asynchronous page fault.
> 
> This patch is for avoidng mmap_sem in usual page fault. At running highly
> multi-threaded programs, mm->mmap_sem can use much CPU because of false
> sharing when it causes page fault in parallel. (Run after fork() is a typical
> case, I think.)
> This patch uses a speculative vma lookup to reduce that cost.
> 
> Considering vma lookup, rb-tree lookup, the only operation we do is checking
> node->rb_left,rb_right. And there are no complicated operation.
> At page fault, there are no demands for accessing sorted-vma-list or access
> prev or next in many case. Except for stack-expansion, we always need a vma
> which contains page-fault address. Then, we can access vma's RB-tree in
> speculative way.
> Even if RB-tree rotation occurs while we walk tree for look-up, we just
> miss vma without oops. In other words, we can _try_ to find vma in lockless
> manner. If failed, retry is ok.... we take lock and access vma.
> 
> For lockess walking, this uses RCU and adds find_vma_speculative(). And
> per-vma wait-queue and reference count. This refcnt+wait_queue guarantees that
> there are no thread which access the vma when we call subsystem's unmap
> functions.
> 
> Test result on my tiny test program on 8core/2socket machine is here.
> This measures how many page fault can occur in 60sec in parallel.
> 
> [root@bluextal memory]# /root/bin/perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault-all-split 8
> 
>  Performance counter stats for './multi-fault-all-split 8' (5 runs):
> 
>        17481387  page-faults                ( +-   0.409% )
>       509914595  cache-misses               ( +-   0.239% )
> 
>    60.002277793  seconds time elapsed   ( +-   0.000% )
> 
> 
> [root@bluextal memory]# /root/bin/perf stat -e page-faults,cache-misses --repeat 5 ./multi-fault-all-split 8
> 
> 
>  Performance counter stats for './multi-fault-all-split 8' (5 runs):
> 
>        35949073  page-faults                ( +-   0.364% )
>       473091100  cache-misses               ( +-   0.304% )
> 
>    60.005444117  seconds time elapsed   ( +-   0.004% )
> 
> 
> 
> Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

<snip>


> +/* called when vma is unlinked and wait for all racy access.*/
> +static void invalidate_vma_before_free(struct vm_area_struct *vma)
> +{
> +	atomic_dec(&vma->refcnt);
> +	wait_event(vma->wait_queue, !atomic_read(&vma->refcnt));
> +}

I think we have to make sure atomicity of both (atomic_dec and wait_event).

> +
>  /*
>   * Requires inode->i_mapping->i_mmap_lock
>   */
> @@ -238,7 +256,7 @@ static struct vm_area_struct *remove_vma
>  			removed_exe_file_vma(vma->vm_mm);
>  	}
>  	mpol_put(vma_policy(vma));
> -	kmem_cache_free(vm_area_cachep, vma);
> +	free_vma_rcu(vma);
>  	return next;
>  }
>  
> @@ -404,6 +422,8 @@ __vma_link_list(struct mm_struct *mm, st
>  void __vma_link_rb(struct mm_struct *mm, struct vm_area_struct *vma,
>  		struct rb_node **rb_link, struct rb_node *rb_parent)
>  {
> +	atomic_set(&vma->refcnt, 1);
> +	init_waitqueue_head(&vma->wait_queue);
>  	rb_link_node(&vma->vm_rb, rb_parent, rb_link);
>  	rb_insert_color(&vma->vm_rb, &mm->mm_rb);
>  }
> @@ -614,6 +634,7 @@ again:			remove_next = 1 + (end > next->
>  		 * us to remove next before dropping the locks.
>  		 */
>  		__vma_unlink(mm, next, vma);
> +		invalidate_vma_before_free(next);
>  		if (file)
>  			__remove_shared_vm_struct(next, file, mapping);
>  		if (next->anon_vma)
> @@ -640,7 +661,7 @@ again:			remove_next = 1 + (end > next->
>  		}
>  		mm->map_count--;
>  		mpol_put(vma_policy(next));
> -		kmem_cache_free(vm_area_cachep, next);
> +		free_vma_rcu(next);
>  		/*
>  		 * In mprotect's case 6 (see comments on vma_merge),
>  		 * we must remove another next too. It would clutter
> @@ -1544,6 +1565,55 @@ out:
>  }
>  
>  /*
> + * Returns vma which contains given address. This scans rb-tree in speculative
> + * way and increment a reference count if found. Even if vma exists in rb-tree,
> + * this function may return NULL in racy case. So, this function cannot be used
> + * for checking whether given address is valid or not.
> + */
> +struct vm_area_struct *
> +find_vma_speculative(struct mm_struct *mm, unsigned long addr)
> +{
> +	struct vm_area_struct *vma = NULL;
> +	struct vm_area_struct *vma_tmp;
> +	struct rb_node *rb_node;
> +
> +	if (unlikely(!mm))
> +		return NULL;;
> +
> +	rcu_read_lock();
> +	rb_node = rcu_dereference(mm->mm_rb.rb_node);
> +	vma = NULL;
> +	while (rb_node) {
> +		vma_tmp = rb_entry(rb_node, struct vm_area_struct, vm_rb);
> +
> +		if (vma_tmp->vm_end > addr) {
> +			vma = vma_tmp;
> +			if (vma_tmp->vm_start <= addr)
> +				break;
> +			rb_node = rcu_dereference(rb_node->rb_left);
> +		} else
> +			rb_node = rcu_dereference(rb_node->rb_right);
> +	}
> +	if (vma) {
> +		if ((vma->vm_start <= addr) && (addr < vma->vm_end)) {
> +			if (!atomic_inc_not_zero(&vma->refcnt))
> +				vma = NULL;
> +		} else
> +			vma = NULL;
> +	}
> +	rcu_read_unlock();
> +	return vma;
> +}
> +
> +void vma_put(struct vm_area_struct *vma)
> +{
> +	if ((atomic_dec_return(&vma->refcnt) == 1) &&
> +	    waitqueue_active(&vma->wait_queue))
> +		wake_up(&vma->wait_queue);
> +	return;
> +}
> +

Let's consider following case. 

CPU 0					CPU 1

find_vma_speculative(refcnt = 2)
					do_unmap 
					invaliate_vma_before_free(refcount = 1)
					wait_event
vma_put
refcnt = 0
skip wakeup 

Hmm.. 


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2009-12-27  9:47 UTC|newest]

Thread overview: 63+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-12-25  1:51 [RFC PATCH] asynchronous page fault KAMEZAWA Hiroyuki
2009-12-27  9:47 ` Minchan Kim [this message]
2009-12-27  9:47   ` Minchan Kim
2009-12-27 23:59   ` KAMEZAWA Hiroyuki
2009-12-27 23:59     ` KAMEZAWA Hiroyuki
2009-12-27 11:19 ` Peter Zijlstra
2009-12-27 11:19   ` Peter Zijlstra
2009-12-28  0:00   ` KAMEZAWA Hiroyuki
2009-12-28  0:00     ` KAMEZAWA Hiroyuki
2009-12-28  0:57   ` Balbir Singh
2009-12-28  0:57     ` Balbir Singh
2009-12-28  1:05     ` KAMEZAWA Hiroyuki
2009-12-28  1:05       ` KAMEZAWA Hiroyuki
2009-12-28  2:58       ` Balbir Singh
2009-12-28  2:58         ` Balbir Singh
2009-12-28  3:13         ` KAMEZAWA Hiroyuki
2009-12-28  3:13           ` KAMEZAWA Hiroyuki
2009-12-28  8:34         ` Peter Zijlstra
2009-12-28  8:34           ` Peter Zijlstra
2009-12-28  8:32     ` Peter Zijlstra
2009-12-28  8:32       ` Peter Zijlstra
2009-12-29  9:54       ` Balbir Singh
2009-12-29  9:54         ` Balbir Singh
2009-12-27 12:03 ` Peter Zijlstra
2009-12-27 12:03   ` Peter Zijlstra
2009-12-28  0:36   ` KAMEZAWA Hiroyuki
2009-12-28  0:36     ` KAMEZAWA Hiroyuki
2009-12-28  1:19     ` KAMEZAWA Hiroyuki
2009-12-28  1:19       ` KAMEZAWA Hiroyuki
2009-12-28  8:30     ` Peter Zijlstra
2009-12-28  8:30       ` Peter Zijlstra
2009-12-28  9:58       ` KAMEZAWA Hiroyuki
2009-12-28  9:58         ` KAMEZAWA Hiroyuki
2009-12-28 10:30         ` Peter Zijlstra
2009-12-28 10:30           ` Peter Zijlstra
2009-12-28 10:40           ` Peter Zijlstra
2009-12-28 10:40             ` Peter Zijlstra
2010-01-02 16:14             ` Peter Zijlstra
2010-01-02 16:14               ` Peter Zijlstra
2010-01-04  3:02               ` Paul E. McKenney
2010-01-04  3:02                 ` Paul E. McKenney
2010-01-04  7:53                 ` Peter Zijlstra
2010-01-04  7:53                   ` Peter Zijlstra
2010-01-04 15:55                   ` Paul E. McKenney
2010-01-04 15:55                     ` Paul E. McKenney
2010-01-04 16:02                     ` Peter Zijlstra
2010-01-04 16:02                       ` Peter Zijlstra
2010-01-04 16:56                       ` Paul E. McKenney
2010-01-04 16:56                         ` Paul E. McKenney
2010-01-04 13:48               ` [RFC PATCH -v2] speculative " Peter Zijlstra
2010-01-04 13:48                 ` Peter Zijlstra
2009-12-28 10:57           ` [RFC PATCH] asynchronous " KAMEZAWA Hiroyuki
2009-12-28 10:57             ` KAMEZAWA Hiroyuki
2009-12-28 11:06             ` Peter Zijlstra
2009-12-28 11:06               ` Peter Zijlstra
2009-12-28  8:55     ` Peter Zijlstra
2009-12-28  8:55       ` Peter Zijlstra
2009-12-28 10:08       ` KAMEZAWA Hiroyuki
2009-12-28 10:08         ` KAMEZAWA Hiroyuki
2009-12-28 11:43     ` Peter Zijlstra
2009-12-28 11:43       ` Peter Zijlstra
2010-01-02 21:45 ` Benjamin Herrenschmidt
2010-01-02 21:45   ` Benjamin Herrenschmidt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=4B372D2D.60908@gmail.com \
    --to=minchan.kim@gmail.com \
    --cc=cl@linux-foundation.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.