Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-05  8:37 Hillf Danton
  2015-01-06  1:32 ` Joonsoo Kim
  0 siblings, 1 reply; 11+ messages in thread
From: Hillf Danton @ 2015-01-05  8:37 UTC (permalink / raw)
  To: 'Joonsoo Kim'
  Cc: Andrew Morton, 'Christoph Lameter',
	'Pekka Enberg', 'David Rientjes', linux-kernel,
	linux-mm, Steven Rostedt, 'Jesper Dangaard Brouer'

> 
> We had to insert a preempt enable/disable in the fastpath a while ago
> in order to guarantee that tid and kmem_cache_cpu are retrieved on the
> same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler
> can move the process to other cpu during retrieving data.
> 
> Now, I reach the solution to remove preempt enable/disable in the fastpath.
> If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
> are retrieved by separate this_cpu operation, it means that they are
> retrieved on the same cpu. If not matched, we just have to retry it.
> 
> With this guarantee, preemption enable/disable isn't need at all even if
> CONFIG_PREEMPT, so this patch removes it.
> 
> I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
> in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)
> 
> Below is the result of Christoph's slab_test reported by
> Jesper Dangaard Brouer.
> 
> * Before
> 
>  Single thread testing
>  =====================
>  1. Kmalloc: Repeatedly allocate then free test
>  10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
>  10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
>  10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
>  10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
>  10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
>  10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
>  10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
>  10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
>  10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
>  10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
>  10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
>  10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
>  2. Kmalloc: alloc/free test
>  10000 times kmalloc(8)/kfree -> 68 cycles
>  10000 times kmalloc(16)/kfree -> 68 cycles
>  10000 times kmalloc(32)/kfree -> 69 cycles
>  10000 times kmalloc(64)/kfree -> 68 cycles
>  10000 times kmalloc(128)/kfree -> 68 cycles
>  10000 times kmalloc(256)/kfree -> 68 cycles
>  10000 times kmalloc(512)/kfree -> 74 cycles
>  10000 times kmalloc(1024)/kfree -> 75 cycles
>  10000 times kmalloc(2048)/kfree -> 74 cycles
>  10000 times kmalloc(4096)/kfree -> 74 cycles
>  10000 times kmalloc(8192)/kfree -> 75 cycles
>  10000 times kmalloc(16384)/kfree -> 510 cycles
> 
> * After
> 
>  Single thread testing
>  =====================
>  1. Kmalloc: Repeatedly allocate then free test
>  10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
>  10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
>  10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
>  10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
>  10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
>  10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
>  10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
>  10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
>  10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
>  10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
>  10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
>  10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
>  2. Kmalloc: alloc/free test
>  10000 times kmalloc(8)/kfree -> 65 cycles
>  10000 times kmalloc(16)/kfree -> 66 cycles
>  10000 times kmalloc(32)/kfree -> 65 cycles
>  10000 times kmalloc(64)/kfree -> 66 cycles
>  10000 times kmalloc(128)/kfree -> 66 cycles
>  10000 times kmalloc(256)/kfree -> 71 cycles
>  10000 times kmalloc(512)/kfree -> 72 cycles
>  10000 times kmalloc(1024)/kfree -> 71 cycles
>  10000 times kmalloc(2048)/kfree -> 71 cycles
>  10000 times kmalloc(4096)/kfree -> 71 cycles
>  10000 times kmalloc(8192)/kfree -> 65 cycles
>  10000 times kmalloc(16384)/kfree -> 511 cycles
> 
> Most of the results are better than before.
> 
> Note that this change slightly worses performance in !CONFIG_PREEMPT,
> roughly 0.3%. Implementing each case separately would help performance,
> but, since it's so marginal, I didn't do that. This would help
> maintanance since we have same code for all cases.
> 
> Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> ---
>  mm/slub.c |   26 +++++++++++++-------------
>  1 file changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/slub.c b/mm/slub.c
> index fe376fe..0624608 100644
> --- a/mm/slub.c
> +++ b/mm/slub.c
> @@ -2398,13 +2398,15 @@ redo:
>  	 * reading from one cpu area. That does not matter as long
>  	 * as we end up on the original cpu again when doing the cmpxchg.
>  	 *
> -	 * Preemption is disabled for the retrieval of the tid because that
> -	 * must occur from the current processor. We cannot allow rescheduling
> -	 * on a different processor between the determination of the pointer
> -	 * and the retrieval of the tid.
> +	 * We should guarantee that tid and kmem_cache are retrieved on
> +	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
> +	 * to check if it is matched or not.
>  	 */
> -	preempt_disable();
> -	c = this_cpu_ptr(s->cpu_slab);
> +	do {
> +		tid = this_cpu_read(s->cpu_slab->tid);
> +		c = this_cpu_ptr(s->cpu_slab);
> +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> +	barrier();

Help maintenance more if barrier is documented in commit message.
> 
>  	/*
>  	 * The transaction ids are globally unique per cpu and per operation on
> @@ -2412,8 +2414,6 @@ redo:
>  	 * occurs on the right processor and that there was no operation on the
>  	 * linked list in between.
>  	 */
> -	tid = c->tid;
> -	preempt_enable();
> 
>  	object = c->freelist;
>  	page = c->page;
> @@ -2659,11 +2659,11 @@ redo:
>  	 * data is retrieved via this pointer. If we are on the same cpu
>  	 * during the cmpxchg then the free will succedd.
>  	 */
> -	preempt_disable();
> -	c = this_cpu_ptr(s->cpu_slab);
> -
> -	tid = c->tid;
> -	preempt_enable();
> +	do {
> +		tid = this_cpu_read(s->cpu_slab->tid);
> +		c = this_cpu_ptr(s->cpu_slab);
> +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> +	barrier();
> 
ditto
>  	if (likely(page == c->page)) {
>  		set_freepointer(s, object, c->freelist);
> --
> 1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-05  8:37 [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off Hillf Danton
@ 2015-01-06  1:32 ` Joonsoo Kim
  2015-01-06  2:25   ` Steven Rostedt
  0 siblings, 1 reply; 11+ messages in thread
From: Joonsoo Kim @ 2015-01-06  1:32 UTC (permalink / raw)
  To: Hillf Danton
  Cc: Andrew Morton, 'Christoph Lameter',
	'Pekka Enberg', 'David Rientjes', linux-kernel,
	linux-mm, Steven Rostedt, 'Jesper Dangaard Brouer'

On Mon, Jan 05, 2015 at 04:37:35PM +0800, Hillf Danton wrote:
> > 
> > We had to insert a preempt enable/disable in the fastpath a while ago
> > in order to guarantee that tid and kmem_cache_cpu are retrieved on the
> > same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler
> > can move the process to other cpu during retrieving data.
> > 
> > Now, I reach the solution to remove preempt enable/disable in the fastpath.
> > If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
> > are retrieved by separate this_cpu operation, it means that they are
> > retrieved on the same cpu. If not matched, we just have to retry it.
> > 
> > With this guarantee, preemption enable/disable isn't need at all even if
> > CONFIG_PREEMPT, so this patch removes it.
> > 
> > I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
> > in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)
> > 
> > Below is the result of Christoph's slab_test reported by
> > Jesper Dangaard Brouer.
> > 
> > * Before
> > 
> >  Single thread testing
> >  =====================
> >  1. Kmalloc: Repeatedly allocate then free test
> >  10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
> >  10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
> >  10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
> >  10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
> >  10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
> >  10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
> >  10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
> >  10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
> >  10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
> >  10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
> >  10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
> >  10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
> >  2. Kmalloc: alloc/free test
> >  10000 times kmalloc(8)/kfree -> 68 cycles
> >  10000 times kmalloc(16)/kfree -> 68 cycles
> >  10000 times kmalloc(32)/kfree -> 69 cycles
> >  10000 times kmalloc(64)/kfree -> 68 cycles
> >  10000 times kmalloc(128)/kfree -> 68 cycles
> >  10000 times kmalloc(256)/kfree -> 68 cycles
> >  10000 times kmalloc(512)/kfree -> 74 cycles
> >  10000 times kmalloc(1024)/kfree -> 75 cycles
> >  10000 times kmalloc(2048)/kfree -> 74 cycles
> >  10000 times kmalloc(4096)/kfree -> 74 cycles
> >  10000 times kmalloc(8192)/kfree -> 75 cycles
> >  10000 times kmalloc(16384)/kfree -> 510 cycles
> > 
> > * After
> > 
> >  Single thread testing
> >  =====================
> >  1. Kmalloc: Repeatedly allocate then free test
> >  10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
> >  10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
> >  10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
> >  10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
> >  10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
> >  10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
> >  10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
> >  10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
> >  10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
> >  10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
> >  10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
> >  10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
> >  2. Kmalloc: alloc/free test
> >  10000 times kmalloc(8)/kfree -> 65 cycles
> >  10000 times kmalloc(16)/kfree -> 66 cycles
> >  10000 times kmalloc(32)/kfree -> 65 cycles
> >  10000 times kmalloc(64)/kfree -> 66 cycles
> >  10000 times kmalloc(128)/kfree -> 66 cycles
> >  10000 times kmalloc(256)/kfree -> 71 cycles
> >  10000 times kmalloc(512)/kfree -> 72 cycles
> >  10000 times kmalloc(1024)/kfree -> 71 cycles
> >  10000 times kmalloc(2048)/kfree -> 71 cycles
> >  10000 times kmalloc(4096)/kfree -> 71 cycles
> >  10000 times kmalloc(8192)/kfree -> 65 cycles
> >  10000 times kmalloc(16384)/kfree -> 511 cycles
> > 
> > Most of the results are better than before.
> > 
> > Note that this change slightly worses performance in !CONFIG_PREEMPT,
> > roughly 0.3%. Implementing each case separately would help performance,
> > but, since it's so marginal, I didn't do that. This would help
> > maintanance since we have same code for all cases.
> > 
> > Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
> > Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
> > ---
> >  mm/slub.c |   26 +++++++++++++-------------
> >  1 file changed, 13 insertions(+), 13 deletions(-)
> > 
> > diff --git a/mm/slub.c b/mm/slub.c
> > index fe376fe..0624608 100644
> > --- a/mm/slub.c
> > +++ b/mm/slub.c
> > @@ -2398,13 +2398,15 @@ redo:
> >  	 * reading from one cpu area. That does not matter as long
> >  	 * as we end up on the original cpu again when doing the cmpxchg.
> >  	 *
> > -	 * Preemption is disabled for the retrieval of the tid because that
> > -	 * must occur from the current processor. We cannot allow rescheduling
> > -	 * on a different processor between the determination of the pointer
> > -	 * and the retrieval of the tid.
> > +	 * We should guarantee that tid and kmem_cache are retrieved on
> > +	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
> > +	 * to check if it is matched or not.
> >  	 */
> > -	preempt_disable();
> > -	c = this_cpu_ptr(s->cpu_slab);
> > +	do {
> > +		tid = this_cpu_read(s->cpu_slab->tid);
> > +		c = this_cpu_ptr(s->cpu_slab);
> > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > +	barrier();
> 
> Help maintenance more if barrier is documented in commit message.

Hello,

Okay. Will add some information about this barrier in commit message.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-06  1:32 ` Joonsoo Kim
@ 2015-01-06  2:25   ` Steven Rostedt
  2015-01-06  8:27     ` Joonsoo Kim
  0 siblings, 1 reply; 11+ messages in thread
From: Steven Rostedt @ 2015-01-06  2:25 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Hillf Danton, Andrew Morton, 'Christoph Lameter',
	'Pekka Enberg', 'David Rientjes', linux-kernel,
	linux-mm, 'Jesper Dangaard Brouer'

On Tue, 6 Jan 2015 10:32:47 +0900
Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:


> > > +++ b/mm/slub.c
> > > @@ -2398,13 +2398,15 @@ redo:
> > >  	 * reading from one cpu area. That does not matter as long
> > >  	 * as we end up on the original cpu again when doing the cmpxchg.
> > >  	 *
> > > -	 * Preemption is disabled for the retrieval of the tid because that
> > > -	 * must occur from the current processor. We cannot allow rescheduling
> > > -	 * on a different processor between the determination of the pointer
> > > -	 * and the retrieval of the tid.
> > > +	 * We should guarantee that tid and kmem_cache are retrieved on
> > > +	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
> > > +	 * to check if it is matched or not.
> > >  	 */
> > > -	preempt_disable();
> > > -	c = this_cpu_ptr(s->cpu_slab);
> > > +	do {
> > > +		tid = this_cpu_read(s->cpu_slab->tid);
> > > +		c = this_cpu_ptr(s->cpu_slab);
> > > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > > +	barrier();
> > 
> > Help maintenance more if barrier is documented in commit message.
> 
> Hello,
> 
> Okay. Will add some information about this barrier in commit message.

A comment in the commit message is useless. Adding a small comment
above the barrier() call itself would be much more useful.

-- Steve

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-06  2:25   ` Steven Rostedt
@ 2015-01-06  8:27     ` Joonsoo Kim
  0 siblings, 0 replies; 11+ messages in thread
From: Joonsoo Kim @ 2015-01-06  8:27 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Hillf Danton, Andrew Morton, 'Christoph Lameter',
	'Pekka Enberg', 'David Rientjes', linux-kernel,
	linux-mm, 'Jesper Dangaard Brouer'

On Mon, Jan 05, 2015 at 09:25:02PM -0500, Steven Rostedt wrote:
> On Tue, 6 Jan 2015 10:32:47 +0900
> Joonsoo Kim <iamjoonsoo.kim@lge.com> wrote:
> 
> 
> > > > +++ b/mm/slub.c
> > > > @@ -2398,13 +2398,15 @@ redo:
> > > >  	 * reading from one cpu area. That does not matter as long
> > > >  	 * as we end up on the original cpu again when doing the cmpxchg.
> > > >  	 *
> > > > -	 * Preemption is disabled for the retrieval of the tid because that
> > > > -	 * must occur from the current processor. We cannot allow rescheduling
> > > > -	 * on a different processor between the determination of the pointer
> > > > -	 * and the retrieval of the tid.
> > > > +	 * We should guarantee that tid and kmem_cache are retrieved on
> > > > +	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
> > > > +	 * to check if it is matched or not.
> > > >  	 */
> > > > -	preempt_disable();
> > > > -	c = this_cpu_ptr(s->cpu_slab);
> > > > +	do {
> > > > +		tid = this_cpu_read(s->cpu_slab->tid);
> > > > +		c = this_cpu_ptr(s->cpu_slab);
> > > > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > > > +	barrier();
> > > 
> > > Help maintenance more if barrier is documented in commit message.
> > 
> > Hello,
> > 
> > Okay. Will add some information about this barrier in commit message.
> 
> A comment in the commit message is useless. Adding a small comment
> above the barrier() call itself would be much more useful.

Okay. Will do.

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
@ 2015-01-05  1:36 Joonsoo Kim
  2015-01-05 14:52 ` Christoph Lameter
  2015-01-06  3:03 ` Davidlohr Bueso
  0 siblings, 2 replies; 11+ messages in thread
From: Joonsoo Kim @ 2015-01-05  1:36 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Christoph Lameter, Pekka Enberg, David Rientjes, Joonsoo Kim,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

We had to insert a preempt enable/disable in the fastpath a while ago
in order to guarantee that tid and kmem_cache_cpu are retrieved on the
same cpu. It is the problem only for CONFIG_PREEMPT in which scheduler
can move the process to other cpu during retrieving data.

Now, I reach the solution to remove preempt enable/disable in the fastpath.
If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
are retrieved by separate this_cpu operation, it means that they are
retrieved on the same cpu. If not matched, we just have to retry it.

With this guarantee, preemption enable/disable isn't need at all even if
CONFIG_PREEMPT, so this patch removes it.

I saw roughly 5% win in a fast-path loop over kmem_cache_alloc/free
in CONFIG_PREEMPT. (14.821 ns -> 14.049 ns)

Below is the result of Christoph's slab_test reported by
Jesper Dangaard Brouer.

* Before

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 49 cycles kfree -> 62 cycles
 10000 times kmalloc(16) -> 48 cycles kfree -> 64 cycles
 10000 times kmalloc(32) -> 53 cycles kfree -> 70 cycles
 10000 times kmalloc(64) -> 64 cycles kfree -> 77 cycles
 10000 times kmalloc(128) -> 74 cycles kfree -> 84 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 114 cycles
 10000 times kmalloc(512) -> 83 cycles kfree -> 116 cycles
 10000 times kmalloc(1024) -> 81 cycles kfree -> 120 cycles
 10000 times kmalloc(2048) -> 104 cycles kfree -> 136 cycles
 10000 times kmalloc(4096) -> 142 cycles kfree -> 165 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 226 cycles
 10000 times kmalloc(16384) -> 403 cycles kfree -> 264 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 68 cycles
 10000 times kmalloc(16)/kfree -> 68 cycles
 10000 times kmalloc(32)/kfree -> 69 cycles
 10000 times kmalloc(64)/kfree -> 68 cycles
 10000 times kmalloc(128)/kfree -> 68 cycles
 10000 times kmalloc(256)/kfree -> 68 cycles
 10000 times kmalloc(512)/kfree -> 74 cycles
 10000 times kmalloc(1024)/kfree -> 75 cycles
 10000 times kmalloc(2048)/kfree -> 74 cycles
 10000 times kmalloc(4096)/kfree -> 74 cycles
 10000 times kmalloc(8192)/kfree -> 75 cycles
 10000 times kmalloc(16384)/kfree -> 510 cycles

* After

 Single thread testing
 =====================
 1. Kmalloc: Repeatedly allocate then free test
 10000 times kmalloc(8) -> 46 cycles kfree -> 61 cycles
 10000 times kmalloc(16) -> 46 cycles kfree -> 63 cycles
 10000 times kmalloc(32) -> 49 cycles kfree -> 69 cycles
 10000 times kmalloc(64) -> 57 cycles kfree -> 76 cycles
 10000 times kmalloc(128) -> 66 cycles kfree -> 83 cycles
 10000 times kmalloc(256) -> 84 cycles kfree -> 110 cycles
 10000 times kmalloc(512) -> 77 cycles kfree -> 114 cycles
 10000 times kmalloc(1024) -> 80 cycles kfree -> 116 cycles
 10000 times kmalloc(2048) -> 102 cycles kfree -> 131 cycles
 10000 times kmalloc(4096) -> 135 cycles kfree -> 163 cycles
 10000 times kmalloc(8192) -> 238 cycles kfree -> 218 cycles
 10000 times kmalloc(16384) -> 399 cycles kfree -> 262 cycles
 2. Kmalloc: alloc/free test
 10000 times kmalloc(8)/kfree -> 65 cycles
 10000 times kmalloc(16)/kfree -> 66 cycles
 10000 times kmalloc(32)/kfree -> 65 cycles
 10000 times kmalloc(64)/kfree -> 66 cycles
 10000 times kmalloc(128)/kfree -> 66 cycles
 10000 times kmalloc(256)/kfree -> 71 cycles
 10000 times kmalloc(512)/kfree -> 72 cycles
 10000 times kmalloc(1024)/kfree -> 71 cycles
 10000 times kmalloc(2048)/kfree -> 71 cycles
 10000 times kmalloc(4096)/kfree -> 71 cycles
 10000 times kmalloc(8192)/kfree -> 65 cycles
 10000 times kmalloc(16384)/kfree -> 511 cycles

Most of the results are better than before.

Note that this change slightly worses performance in !CONFIG_PREEMPT,
roughly 0.3%. Implementing each case separately would help performance,
but, since it's so marginal, I didn't do that. This would help
maintanance since we have same code for all cases.

Tested-by: Jesper Dangaard Brouer <brouer@redhat.com>
Signed-off-by: Joonsoo Kim <iamjoonsoo.kim@lge.com>
---
 mm/slub.c |   26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/mm/slub.c b/mm/slub.c
index fe376fe..0624608 100644
--- a/mm/slub.c
+++ b/mm/slub.c
@@ -2398,13 +2398,15 @@ redo:
 	 * reading from one cpu area. That does not matter as long
 	 * as we end up on the original cpu again when doing the cmpxchg.
 	 *
-	 * Preemption is disabled for the retrieval of the tid because that
-	 * must occur from the current processor. We cannot allow rescheduling
-	 * on a different processor between the determination of the pointer
-	 * and the retrieval of the tid.
+	 * We should guarantee that tid and kmem_cache are retrieved on
+	 * the same cpu. It could be different if CONFIG_PREEMPT so we need
+	 * to check if it is matched or not.
 	 */
-	preempt_disable();
-	c = this_cpu_ptr(s->cpu_slab);
+	do {
+		tid = this_cpu_read(s->cpu_slab->tid);
+		c = this_cpu_ptr(s->cpu_slab);
+	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
+	barrier();
 
 	/*
 	 * The transaction ids are globally unique per cpu and per operation on
@@ -2412,8 +2414,6 @@ redo:
 	 * occurs on the right processor and that there was no operation on the
 	 * linked list in between.
 	 */
-	tid = c->tid;
-	preempt_enable();
 
 	object = c->freelist;
 	page = c->page;
@@ -2659,11 +2659,11 @@ redo:
 	 * data is retrieved via this pointer. If we are on the same cpu
 	 * during the cmpxchg then the free will succedd.
 	 */
-	preempt_disable();
-	c = this_cpu_ptr(s->cpu_slab);
-
-	tid = c->tid;
-	preempt_enable();
+	do {
+		tid = this_cpu_read(s->cpu_slab->tid);
+		c = this_cpu_ptr(s->cpu_slab);
+	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
+	barrier();
 
 	if (likely(page == c->page)) {
 		set_freepointer(s, object, c->freelist);
-- 
1.7.9.5

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-05  1:36 Joonsoo Kim
@ 2015-01-05 14:52 ` Christoph Lameter
  2015-01-06  3:03 ` Davidlohr Bueso
  1 sibling, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2015-01-05 14:52 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Pekka Enberg, David Rientjes, linux-mm,
	linux-kernel, Jesper Dangaard Brouer, rostedt, Thomas Gleixner

On Mon, 5 Jan 2015, Joonsoo Kim wrote:

> Now, I reach the solution to remove preempt enable/disable in the fastpath.
> If tid is matched with kmem_cache_cpu's tid after tid and kmem_cache_cpu
> are retrieved by separate this_cpu operation, it means that they are
> retrieved on the same cpu. If not matched, we just have to retry it.

Acked-by: Christoph Lameter <cl@linux.com>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-05  1:36 Joonsoo Kim
  2015-01-05 14:52 ` Christoph Lameter
@ 2015-01-06  3:03 ` Davidlohr Bueso
  2015-01-06  8:09   ` Joonsoo Kim
  1 sibling, 1 reply; 11+ messages in thread
From: Davidlohr Bueso @ 2015-01-06  3:03 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

On Mon, 2015-01-05 at 10:36 +0900, Joonsoo Kim wrote:
> -	preempt_disable();
> -	c = this_cpu_ptr(s->cpu_slab);
> +	do {
> +		tid = this_cpu_read(s->cpu_slab->tid);
> +		c = this_cpu_ptr(s->cpu_slab);
> +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> +	barrier();

I don't see the compiler reodering the object/page stores below, since c
is updated in the loop anyway. Is this really necessary (same goes for
slab_free)? The generated code by gcc 4.8 looks correct without it.
Additionally, the implied barriers for preemption control aren't really
the same semantics used here (if that is actually the reason why you are
using them).

Thanks,
Davidlohr

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-06  3:03 ` Davidlohr Bueso
@ 2015-01-06  8:09   ` Joonsoo Kim
  2015-01-06 17:02     ` Davidlohr Bueso
  0 siblings, 1 reply; 11+ messages in thread
From: Joonsoo Kim @ 2015-01-06  8:09 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

On Mon, Jan 05, 2015 at 07:03:12PM -0800, Davidlohr Bueso wrote:
> On Mon, 2015-01-05 at 10:36 +0900, Joonsoo Kim wrote:
> > -	preempt_disable();
> > -	c = this_cpu_ptr(s->cpu_slab);
> > +	do {
> > +		tid = this_cpu_read(s->cpu_slab->tid);
> > +		c = this_cpu_ptr(s->cpu_slab);
> > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > +	barrier();
> 
> I don't see the compiler reodering the object/page stores below, since c
> is updated in the loop anyway. Is this really necessary (same goes for
> slab_free)? The generated code by gcc 4.8 looks correct without it.
> Additionally, the implied barriers for preemption control aren't really
> the same semantics used here (if that is actually the reason why you are
> using them).

Hello,

I'd like to use tid as a pivot so it should be fetched before fetching
anything on c. Is it impossible even if !CONFIG_PREEMPT without
barrier()?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-06  8:09   ` Joonsoo Kim
@ 2015-01-06 17:02     ` Davidlohr Bueso
  2015-01-08  7:44       ` Joonsoo Kim
  0 siblings, 1 reply; 11+ messages in thread
From: Davidlohr Bueso @ 2015-01-06 17:02 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

On Tue, 2015-01-06 at 17:09 +0900, Joonsoo Kim wrote:
> On Mon, Jan 05, 2015 at 07:03:12PM -0800, Davidlohr Bueso wrote:
> > On Mon, 2015-01-05 at 10:36 +0900, Joonsoo Kim wrote:
> > > -	preempt_disable();
> > > -	c = this_cpu_ptr(s->cpu_slab);
> > > +	do {
> > > +		tid = this_cpu_read(s->cpu_slab->tid);
> > > +		c = this_cpu_ptr(s->cpu_slab);
> > > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > > +	barrier();
> > 
> > I don't see the compiler reodering the object/page stores below, since c
> > is updated in the loop anyway. Is this really necessary (same goes for
> > slab_free)? The generated code by gcc 4.8 looks correct without it.
> > Additionally, the implied barriers for preemption control aren't really
> > the same semantics used here (if that is actually the reason why you are
> > using them).
> 
> Hello,
> 
> I'd like to use tid as a pivot so it should be fetched before fetching
> anything on c. Is it impossible even if !CONFIG_PREEMPT without
> barrier()?

You'd need a smp_wmb() in between tid and c in the loop then, which
looks quite unpleasant. All in all disabling preemption isn't really
that expensive, and you should redo your performance number if you go
this way.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-06 17:02     ` Davidlohr Bueso
@ 2015-01-08  7:44       ` Joonsoo Kim
  2015-01-09  3:34         ` Christoph Lameter
  0 siblings, 1 reply; 11+ messages in thread
From: Joonsoo Kim @ 2015-01-08  7:44 UTC (permalink / raw)
  To: Davidlohr Bueso
  Cc: Andrew Morton, Christoph Lameter, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

On Tue, Jan 06, 2015 at 09:02:17AM -0800, Davidlohr Bueso wrote:
> On Tue, 2015-01-06 at 17:09 +0900, Joonsoo Kim wrote:
> > On Mon, Jan 05, 2015 at 07:03:12PM -0800, Davidlohr Bueso wrote:
> > > On Mon, 2015-01-05 at 10:36 +0900, Joonsoo Kim wrote:
> > > > -	preempt_disable();
> > > > -	c = this_cpu_ptr(s->cpu_slab);
> > > > +	do {
> > > > +		tid = this_cpu_read(s->cpu_slab->tid);
> > > > +		c = this_cpu_ptr(s->cpu_slab);
> > > > +	} while (IS_ENABLED(CONFIG_PREEMPT) && unlikely(tid != c->tid));
> > > > +	barrier();
> > > 
> > > I don't see the compiler reodering the object/page stores below, since c
> > > is updated in the loop anyway. Is this really necessary (same goes for
> > > slab_free)? The generated code by gcc 4.8 looks correct without it.
> > > Additionally, the implied barriers for preemption control aren't really
> > > the same semantics used here (if that is actually the reason why you are
> > > using them).
> > 
> > Hello,
> > 
> > I'd like to use tid as a pivot so it should be fetched before fetching
> > anything on c. Is it impossible even if !CONFIG_PREEMPT without
> > barrier()?
> 
> You'd need a smp_wmb() in between tid and c in the loop then, which
> looks quite unpleasant. All in all disabling preemption isn't really
> that expensive, and you should redo your performance number if you go
> this way.

This barrier() is not for read/write synchronization between cpus.
All read/write operation to cpu_slab would happen on correct cpu in
successful case. What I'd need to guarantee here is to prevent
reordering between fetching operation for correctness of algorithm. In
this case, barrier() seems enough to me. Am I wrong?

Thanks.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off
  2015-01-08  7:44       ` Joonsoo Kim
@ 2015-01-09  3:34         ` Christoph Lameter
  0 siblings, 0 replies; 11+ messages in thread
From: Christoph Lameter @ 2015-01-09  3:34 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Davidlohr Bueso, Andrew Morton, Pekka Enberg, David Rientjes,
	linux-mm, linux-kernel, Jesper Dangaard Brouer, rostedt,
	Thomas Gleixner

On Thu, 8 Jan 2015, Joonsoo Kim wrote:

> > You'd need a smp_wmb() in between tid and c in the loop then, which
> > looks quite unpleasant. All in all disabling preemption isn't really
> > that expensive, and you should redo your performance number if you go
> > this way.
>
> This barrier() is not for read/write synchronization between cpus.
> All read/write operation to cpu_slab would happen on correct cpu in
> successful case. What I'd need to guarantee here is to prevent
> reordering between fetching operation for correctness of algorithm. In
> this case, barrier() seems enough to me. Am I wrong?

You are right.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2015-01-09  3:34 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-05  8:37 [PATCH 1/2] mm/slub: optimize alloc/free fastpath by removing preemption on/off Hillf Danton
2015-01-06  1:32 ` Joonsoo Kim
2015-01-06  2:25   ` Steven Rostedt
2015-01-06  8:27     ` Joonsoo Kim
  -- strict thread matches above, loose matches on Subject: below --
2015-01-05  1:36 Joonsoo Kim
2015-01-05 14:52 ` Christoph Lameter
2015-01-06  3:03 ` Davidlohr Bueso
2015-01-06  8:09   ` Joonsoo Kim
2015-01-06 17:02     ` Davidlohr Bueso
2015-01-08  7:44       ` Joonsoo Kim
2015-01-09  3:34         ` Christoph Lameter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).