From: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
To: Andrew Morton <akpm@linux-foundation.org>,
Christoph Lameter <clameter@sgi.com>
Cc: linux-kernel@vger.kernel.org
Subject: Re: [patch 00/28] Add cmpxchg64_local and cmpxchg_local to each architecture
Date: Mon, 27 Aug 2007 13:31:51 -0400 [thread overview]
Message-ID: <20070827173151.GA28974@Krystal> (raw)
In-Reply-To: <20070827091726.a2e8fbb8.akpm@linux-foundation.org>
* Andrew Morton (akpm@linux-foundation.org) wrote:
> On Mon, 27 Aug 2007 11:52:34 -0400 Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca> wrote:
>
> > Here is the patch series for 2.6.23-rc3-mm1 that adds cmpxchg_local, and now
> > also cmpxchg64_local, to each architecture.
>
> How well tested are these on the various architectures?
>
I compile-tested the patchset on:
arm
i686
ia64
m68k
mips/mipsel
powerpc405
powerpc 64
s390
sparc64
sparc
x86_64
With various config options (ALL yes, ALL no, all modules,
CONFIG_MODULES=yes/no...)
Since then, I added the trivial architecture-specific patches:
add-cmpxchg64-to-alpha.patch
add-cmpxchg64-to-mips.patch
add-cmpxchg64-to-powerpc.patch
add-cmpxchg64-to-x86_64.patch
Which I tested before submitting; mips, powerpc405, powerpc64 and x86_64
build fine. (I have no cross-compiler for alpha though)
The rest of the changes, since last thorough architecture-wide compile
test, were either architecture specific comments I got from LKML or
tested since then because they were architecture agnostic.
I must admit though that there are still a few architectures I touch in
the cmpxchg_local patches for which I don't have a cross-compiler. But I
think the changes are trivial enough, and repetitive enough across
architectures, to minimize the build breakages.
For runtime testing, I am a bit limited on hardware: I myself have i686
and AMD64, and rely on the LTTng community (and the kernel community) to
test the other architectures.
Christoph Lameter and I tested the cmpxchg_local with slub on i686 and
x86_64.
> > When the architecture supports it, it also defines cmpxchg64, but is is not
> > defined for architecture that does not support atomic 64 bits updates.
> >
> > Following performance testing of the slub allocator with cmpxchg_local, these
> > patches should prove themselves useful in a near future.
>
> It would be useful if we could have (numerical) details on these benefits as
> part of this patch series description, please.
>
Sure, they follow at the end of this email (will append to the
add-cmpxchg-local-to-generic-for-up.patch description).
> Also, it would be good to get the slub patch in there at the same time so that
> the new code gets a bit of exercise.
>
Good point. Christoph, could you please prepare a slub cmpxchg_local
patch for the mm tree ?
Mathieu
> Thanks.
Patch add-cmpxchg-local-to-generic-for-up.patch description addendum :
* Patch series comments
Performance improvements of the fast path goes from a 66% speedup on a
Pentium 4 to a 14% speedup on AMD64.
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Measurements on a Pentium4, 3GHz, Hyperthread.
SLUB Performance testing
========================
1. Kmalloc: Repeatedly allocate then free test
* slub HEAD, test 1
kmalloc(8) = 201 cycles kfree = 351 cycles
kmalloc(16) = 198 cycles kfree = 359 cycles
kmalloc(32) = 200 cycles kfree = 381 cycles
kmalloc(64) = 224 cycles kfree = 394 cycles
kmalloc(128) = 285 cycles kfree = 424 cycles
kmalloc(256) = 411 cycles kfree = 546 cycles
kmalloc(512) = 480 cycles kfree = 619 cycles
kmalloc(1024) = 623 cycles kfree = 750 cycles
kmalloc(2048) = 686 cycles kfree = 811 cycles
kmalloc(4096) = 482 cycles kfree = 538 cycles
kmalloc(8192) = 680 cycles kfree = 734 cycles
kmalloc(16384) = 713 cycles kfree = 843 cycles
* Slub HEAD, test 2
kmalloc(8) = 190 cycles kfree = 351 cycles
kmalloc(16) = 195 cycles kfree = 360 cycles
kmalloc(32) = 201 cycles kfree = 370 cycles
kmalloc(64) = 245 cycles kfree = 389 cycles
kmalloc(128) = 283 cycles kfree = 413 cycles
kmalloc(256) = 409 cycles kfree = 547 cycles
kmalloc(512) = 476 cycles kfree = 616 cycles
kmalloc(1024) = 628 cycles kfree = 753 cycles
kmalloc(2048) = 684 cycles kfree = 811 cycles
kmalloc(4096) = 480 cycles kfree = 539 cycles
kmalloc(8192) = 661 cycles kfree = 746 cycles
kmalloc(16384) = 741 cycles kfree = 856 cycles
* cmpxchg_local Slub test
kmalloc(8) = 83 cycles kfree = 363 cycles
kmalloc(16) = 85 cycles kfree = 372 cycles
kmalloc(32) = 92 cycles kfree = 377 cycles
kmalloc(64) = 115 cycles kfree = 397 cycles
kmalloc(128) = 179 cycles kfree = 438 cycles
kmalloc(256) = 314 cycles kfree = 564 cycles
kmalloc(512) = 398 cycles kfree = 615 cycles
kmalloc(1024) = 573 cycles kfree = 745 cycles
kmalloc(2048) = 629 cycles kfree = 816 cycles
kmalloc(4096) = 473 cycles kfree = 548 cycles
kmalloc(8192) = 659 cycles kfree = 745 cycles
kmalloc(16384) = 724 cycles kfree = 843 cycles
2. Kmalloc: alloc/free test
* slub HEAD, test 1
kmalloc(8)/kfree = 322 cycles
kmalloc(16)/kfree = 318 cycles
kmalloc(32)/kfree = 318 cycles
kmalloc(64)/kfree = 325 cycles
kmalloc(128)/kfree = 318 cycles
kmalloc(256)/kfree = 328 cycles
kmalloc(512)/kfree = 328 cycles
kmalloc(1024)/kfree = 328 cycles
kmalloc(2048)/kfree = 328 cycles
kmalloc(4096)/kfree = 678 cycles
kmalloc(8192)/kfree = 1013 cycles
kmalloc(16384)/kfree = 1157 cycles
* Slub HEAD, test 2
kmalloc(8)/kfree = 323 cycles
kmalloc(16)/kfree = 318 cycles
kmalloc(32)/kfree = 318 cycles
kmalloc(64)/kfree = 318 cycles
kmalloc(128)/kfree = 318 cycles
kmalloc(256)/kfree = 328 cycles
kmalloc(512)/kfree = 328 cycles
kmalloc(1024)/kfree = 328 cycles
kmalloc(2048)/kfree = 328 cycles
kmalloc(4096)/kfree = 648 cycles
kmalloc(8192)/kfree = 1009 cycles
kmalloc(16384)/kfree = 1105 cycles
* cmpxchg_local Slub test
kmalloc(8)/kfree = 112 cycles
kmalloc(16)/kfree = 103 cycles
kmalloc(32)/kfree = 103 cycles
kmalloc(64)/kfree = 103 cycles
kmalloc(128)/kfree = 112 cycles
kmalloc(256)/kfree = 111 cycles
kmalloc(512)/kfree = 111 cycles
kmalloc(1024)/kfree = 111 cycles
kmalloc(2048)/kfree = 121 cycles
kmalloc(4096)/kfree = 650 cycles
kmalloc(8192)/kfree = 1042 cycles
kmalloc(16384)/kfree = 1149 cycles
Tested-by: Mathieu Desnoyers <mathieu.desnoyers@polymtl.ca>
Measurements on a AMD64 2.0 GHz dual-core
In this test, we seem to remove 10 cycles from the kmalloc fast path.
On small allocations, it gives a 14% performance increase. kfree fast
path also seems to have a 10 cycles improvement.
1. Kmalloc: Repeatedly allocate then free test
* cmpxchg_local slub
kmalloc(8) = 63 cycles kfree = 126 cycles
kmalloc(16) = 66 cycles kfree = 129 cycles
kmalloc(32) = 76 cycles kfree = 138 cycles
kmalloc(64) = 100 cycles kfree = 288 cycles
kmalloc(128) = 128 cycles kfree = 309 cycles
kmalloc(256) = 170 cycles kfree = 315 cycles
kmalloc(512) = 221 cycles kfree = 357 cycles
kmalloc(1024) = 324 cycles kfree = 393 cycles
kmalloc(2048) = 354 cycles kfree = 440 cycles
kmalloc(4096) = 394 cycles kfree = 330 cycles
kmalloc(8192) = 523 cycles kfree = 481 cycles
kmalloc(16384) = 643 cycles kfree = 649 cycles
* Base
kmalloc(8) = 74 cycles kfree = 113 cycles
kmalloc(16) = 76 cycles kfree = 116 cycles
kmalloc(32) = 85 cycles kfree = 133 cycles
kmalloc(64) = 111 cycles kfree = 279 cycles
kmalloc(128) = 138 cycles kfree = 294 cycles
kmalloc(256) = 181 cycles kfree = 304 cycles
kmalloc(512) = 237 cycles kfree = 327 cycles
kmalloc(1024) = 340 cycles kfree = 379 cycles
kmalloc(2048) = 378 cycles kfree = 433 cycles
kmalloc(4096) = 399 cycles kfree = 329 cycles
kmalloc(8192) = 528 cycles kfree = 624 cycles
kmalloc(16384) = 651 cycles kfree = 737 cycles
2. Kmalloc: alloc/free test
* cmpxchg_local slub
kmalloc(8)/kfree = 96 cycles
kmalloc(16)/kfree = 97 cycles
kmalloc(32)/kfree = 97 cycles
kmalloc(64)/kfree = 97 cycles
kmalloc(128)/kfree = 97 cycles
kmalloc(256)/kfree = 105 cycles
kmalloc(512)/kfree = 108 cycles
kmalloc(1024)/kfree = 105 cycles
kmalloc(2048)/kfree = 107 cycles
kmalloc(4096)/kfree = 390 cycles
kmalloc(8192)/kfree = 626 cycles
kmalloc(16384)/kfree = 662 cycles
* Base
kmalloc(8)/kfree = 116 cycles
kmalloc(16)/kfree = 116 cycles
kmalloc(32)/kfree = 116 cycles
kmalloc(64)/kfree = 116 cycles
kmalloc(128)/kfree = 116 cycles
kmalloc(256)/kfree = 126 cycles
kmalloc(512)/kfree = 126 cycles
kmalloc(1024)/kfree = 126 cycles
kmalloc(2048)/kfree = 126 cycles
kmalloc(4096)/kfree = 384 cycles
kmalloc(8192)/kfree = 749 cycles
kmalloc(16384)/kfree = 786 cycles
Tested-by: Christoph Lameter <clameter@sgi.com>
I can confirm Mathieus' measurement now:
Athlon64:
regular NUMA/discontig
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 79 cycles kfree -> 92 cycles
10000 times kmalloc(16) -> 79 cycles kfree -> 93 cycles
10000 times kmalloc(32) -> 88 cycles kfree -> 95 cycles
10000 times kmalloc(64) -> 124 cycles kfree -> 132 cycles
10000 times kmalloc(128) -> 157 cycles kfree -> 247 cycles
10000 times kmalloc(256) -> 200 cycles kfree -> 257 cycles
10000 times kmalloc(512) -> 250 cycles kfree -> 277 cycles
10000 times kmalloc(1024) -> 337 cycles kfree -> 314 cycles
10000 times kmalloc(2048) -> 365 cycles kfree -> 330 cycles
10000 times kmalloc(4096) -> 352 cycles kfree -> 240 cycles
10000 times kmalloc(8192) -> 456 cycles kfree -> 340 cycles
10000 times kmalloc(16384) -> 646 cycles kfree -> 471 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 124 cycles
10000 times kmalloc(16)/kfree -> 124 cycles
10000 times kmalloc(32)/kfree -> 124 cycles
10000 times kmalloc(64)/kfree -> 124 cycles
10000 times kmalloc(128)/kfree -> 124 cycles
10000 times kmalloc(256)/kfree -> 132 cycles
10000 times kmalloc(512)/kfree -> 132 cycles
10000 times kmalloc(1024)/kfree -> 132 cycles
10000 times kmalloc(2048)/kfree -> 132 cycles
10000 times kmalloc(4096)/kfree -> 319 cycles
10000 times kmalloc(8192)/kfree -> 486 cycles
10000 times kmalloc(16384)/kfree -> 539 cycles
cmpxchg_local NUMA/discontig
1. Kmalloc: Repeatedly allocate then free test
10000 times kmalloc(8) -> 55 cycles kfree -> 90 cycles
10000 times kmalloc(16) -> 55 cycles kfree -> 92 cycles
10000 times kmalloc(32) -> 70 cycles kfree -> 91 cycles
10000 times kmalloc(64) -> 100 cycles kfree -> 141 cycles
10000 times kmalloc(128) -> 128 cycles kfree -> 233 cycles
10000 times kmalloc(256) -> 172 cycles kfree -> 251 cycles
10000 times kmalloc(512) -> 225 cycles kfree -> 275 cycles
10000 times kmalloc(1024) -> 325 cycles kfree -> 311 cycles
10000 times kmalloc(2048) -> 346 cycles kfree -> 330 cycles
10000 times kmalloc(4096) -> 351 cycles kfree -> 238 cycles
10000 times kmalloc(8192) -> 450 cycles kfree -> 342 cycles
10000 times kmalloc(16384) -> 630 cycles kfree -> 546 cycles
2. Kmalloc: alloc/free test
10000 times kmalloc(8)/kfree -> 81 cycles
10000 times kmalloc(16)/kfree -> 81 cycles
10000 times kmalloc(32)/kfree -> 81 cycles
10000 times kmalloc(64)/kfree -> 81 cycles
10000 times kmalloc(128)/kfree -> 81 cycles
10000 times kmalloc(256)/kfree -> 91 cycles
10000 times kmalloc(512)/kfree -> 90 cycles
10000 times kmalloc(1024)/kfree -> 91 cycles
10000 times kmalloc(2048)/kfree -> 90 cycles
10000 times kmalloc(4096)/kfree -> 318 cycles
10000 times kmalloc(8192)/kfree -> 483 cycles
10000 times kmalloc(16384)/kfree -> 536 cycles
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
next prev parent reply other threads:[~2007-08-27 17:32 UTC|newest]
Thread overview: 41+ messages / expand[flat|nested] mbox.gz Atom feed top
2007-08-27 15:52 [patch 00/28] Add cmpxchg64_local and cmpxchg_local to each architecture Mathieu Desnoyers
2007-08-27 15:52 ` [patch 01/28] Fall back on interrupt disable in cmpxchg8b on 80386 and 80486 Mathieu Desnoyers
2007-08-28 1:40 ` Nick Piggin
2007-08-28 11:31 ` Mathieu Desnoyers
2007-08-28 23:51 ` Nick Piggin
2007-08-27 15:52 ` [patch 02/28] Add cmpxchg64 and cmpxchg64_local to alpha Mathieu Desnoyers
2007-08-27 15:52 ` [patch 03/28] Add cmpxchg64 and cmpxchg64_local to mips Mathieu Desnoyers
2007-08-28 13:58 ` Ralf Baechle
2007-08-27 15:52 ` [patch 04/28] Add cmpxchg64 and cmpxchg64_local to powerpc Mathieu Desnoyers
2007-09-22 4:46 ` Paul Mackerras
2007-09-22 14:35 ` Mathieu Desnoyers
2007-08-27 15:52 ` [patch 05/28] Add cmpxchg64 and cmpxchg64_local to x86_64 Mathieu Desnoyers
2007-08-27 15:52 ` [patch 06/28] Add cmpxchg_local to asm-generic for per cpu atomic operations Mathieu Desnoyers
2007-08-27 15:52 ` [patch 07/28] Add cmpxchg_local to arm Mathieu Desnoyers
2007-08-27 15:52 ` [patch 08/28] Add cmpxchg_local to avr32 Mathieu Desnoyers
2007-08-27 15:52 ` [patch 09/28] Add cmpxchg_local to blackfin, replace __cmpxchg by generic cmpxchg Mathieu Desnoyers
2007-08-27 15:52 ` [patch 10/28] Add cmpxchg_local to cris Mathieu Desnoyers
2007-08-27 15:52 ` [patch 11/28] Add cmpxchg_local to frv Mathieu Desnoyers
2007-08-27 15:52 ` [patch 12/28] Add cmpxchg_local to h8300 Mathieu Desnoyers
2007-08-27 15:52 ` [patch 13/28] Add cmpxchg_local, cmpxchg64 and cmpxchg64_local to ia64 Mathieu Desnoyers
2007-08-27 19:32 ` Christoph Lameter
2007-08-27 15:52 ` [patch 14/28] New cmpxchg_local (optimized for UP case) for m32r Mathieu Desnoyers
2007-08-27 15:52 ` [patch 15/28] Fix m32r __xchg Mathieu Desnoyers
2007-08-27 15:52 ` [patch 16/28] m32r: build fix of arch/m32r/kernel/smpboot.c Mathieu Desnoyers
2007-08-27 15:52 ` [patch 17/28] local_t m32r use architecture specific cmpxchg_local Mathieu Desnoyers
2007-08-27 15:52 ` [patch 18/28] Add cmpxchg_local to m86k Mathieu Desnoyers
2007-08-27 15:52 ` [patch 19/28] Add cmpxchg_local to m68knommu Mathieu Desnoyers
2007-08-27 15:52 ` [patch 20/28] Add cmpxchg_local to parisc Mathieu Desnoyers
2007-08-27 15:52 ` [patch 21/28] Add cmpxchg_local to ppc Mathieu Desnoyers
2007-08-27 15:52 ` [patch 22/28] Add cmpxchg_local to s390 Mathieu Desnoyers
2007-08-27 15:52 ` [patch 23/28] Add cmpxchg_local to sh, use generic cmpxchg() instead of cmpxchg_u32 Mathieu Desnoyers
2007-08-27 15:52 ` [patch 24/28] Add cmpxchg_local to sh64 Mathieu Desnoyers
2007-08-27 15:52 ` [patch 25/28] Add cmpxchg_local to sparc, move __cmpxchg to system.h Mathieu Desnoyers
2007-08-27 15:53 ` [patch 26/28] Add cmpxchg_local to sparc64 Mathieu Desnoyers
2007-08-27 15:53 ` [patch 27/28] Add cmpxchg_local to v850 Mathieu Desnoyers
2007-08-27 15:53 ` [patch 28/28] Add cmpxchg_local to xtensa Mathieu Desnoyers
2007-08-27 16:17 ` [patch 00/28] Add cmpxchg64_local and cmpxchg_local to each architecture Andrew Morton
2007-08-27 17:31 ` Mathieu Desnoyers [this message]
2007-08-27 19:29 ` Christoph Lameter
2007-08-27 19:35 ` Christoph Lameter
2007-08-27 20:19 ` Mathieu Desnoyers
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20070827173151.GA28974@Krystal \
--to=mathieu.desnoyers@polymtl.ca \
--cc=akpm@linux-foundation.org \
--cc=clameter@sgi.com \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox