* Improving lock pages
@ 2013-01-15 17:38 Nathan Zimmer
2013-01-15 18:10 ` Nathan Zimmer
2013-02-06 16:31 ` Mel Gorman
0 siblings, 2 replies; 5+ messages in thread
From: Nathan Zimmer @ 2013-01-15 17:38 UTC (permalink / raw)
To: Mel Gorman; +Cc: holt, linux-mm
Hello Mel,
You helped some time ago with contention in lock_pages on very large boxes.
You worked with Jack Steiner on this. Currently I am tasked with improving this
area even more. So I am fishing for any more ideas that would be productive or
worth trying.
I have some numbers from a 512 machine.
Linux uvpsw1 3.0.51-0.7.9-default #1 SMP Thu Nov 29 22:12:17 UTC 2012 (f3be9d0) x86_64 x86_64 x86_64 GNU/Linux
0.166850
0.082339
0.248428
0.081197
0.127635
Linux uvpsw1 3.8.0-rc1-medusa_ntz_clean-dirty #32 SMP Tue Jan 8 16:01:04 CST 2013 x86_64 x86_64 x86_64 GNU/Linux
0.151778
0.118343
0.135750
0.437019
0.120536
Nathan Zimmer
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Improving lock pages
2013-01-15 17:38 Improving lock pages Nathan Zimmer
@ 2013-01-15 18:10 ` Nathan Zimmer
2013-02-06 16:31 ` Mel Gorman
1 sibling, 0 replies; 5+ messages in thread
From: Nathan Zimmer @ 2013-01-15 18:10 UTC (permalink / raw)
To: Nathan Zimmer; +Cc: Mel Gorman, holt, linux-mm
[-- Attachment #1: Type: text/plain, Size: 1443 bytes --]
On Tue, Jan 15, 2013 at 11:38:14AM -0600, Nathan Zimmer wrote:
>
> Hello Mel,
> You helped some time ago with contention in lock_pages on very large boxes.
> You worked with Jack Steiner on this. Currently I am tasked with improving this
> area even more. So I am fishing for any more ideas that would be productive or
> worth trying.
>
> I have some numbers from a 512 machine.
>
> Linux uvpsw1 3.0.51-0.7.9-default #1 SMP Thu Nov 29 22:12:17 UTC 2012 (f3be9d0) x86_64 x86_64 x86_64 GNU/Linux
> 0.166850
> 0.082339
> 0.248428
> 0.081197
> 0.127635
>
> Linux uvpsw1 3.8.0-rc1-medusa_ntz_clean-dirty #32 SMP Tue Jan 8 16:01:04 CST 2013 x86_64 x86_64 x86_64 GNU/Linux
> 0.151778
> 0.118343
> 0.135750
> 0.437019
> 0.120536
>
> Nathan Zimmer
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
I realized I forgot to attach the test.
The test is fairly basic. Just fork off a number of threads each on their own cpu
have them all wait on a cell and measure how long it took for them to all exit.
Usage is ./time_exit -p 3 512
The numbers I have provided where from some runs on a 512 system. I tried for
a 4096 box but it was being fickle and was needed for some other testing.
[-- Attachment #2: time_exit.c --]
[-- Type: text/x-c++src, Size: 2092 bytes --]
#define _GNU_SOURCE
#include <errno.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/mman.h>
#include <sys/time.h>
#include <sys/wait.h>
struct time_exit {
volatile int ready __attribute__((aligned(64)));
volatile int quit __attribute__((aligned(64)));
};
#define cpu_relax() asm volatile ("rep;nop":::"memory");
#define MAXCPUS 4096
static int cpu_set_size;
static cpu_set_t *task_affinity;
static int delay;
static void pin(int cpu)
{
cpu_set_t *affinity;
if (cpu < 0 || cpu >= MAXCPUS)
return;
affinity = CPU_ALLOC(MAXCPUS);
CPU_ZERO_S(cpu_set_size, affinity);
CPU_SET_S(cpu, cpu_set_size, affinity);
(void)sched_setaffinity(0, cpu_set_size, affinity);
CPU_FREE(affinity);
return;
}
static void child(struct time_exit *sharep, int cpu)
{
pin(cpu);
__sync_fetch_and_add(&sharep->ready, 1);
while (sharep->quit == 0)
cpu_relax();
exit(0);
}
int main(int argc, char **argv)
{
int children, i;
struct time_exit *sharep;
struct timeval tv0, tv1;
long secs, usecs;
char opt;
while ((opt = getopt(argc, argv, "p:")) != -1) {
switch (opt) {
case 'p':
delay = atoi(optarg);
break;
default:
fprintf(stderr, "Usage:\n");
}
}
argv += optind - 1;
argc -= optind - 1;
if (argc != 2) {
printf("Wrong\n");
exit(-1);
}
children = atoi(argv[1]);
cpu_set_size = CPU_ALLOC_SIZE(MAXCPUS);
task_affinity = CPU_ALLOC(MAXCPUS);
if (sched_getaffinity(0, cpu_set_size, task_affinity) < 0) {
perror("Failed in sched_getaffinitt");
exit(-2);
}
sharep = mmap(0, sizeof(struct time_exit), PROT_READ | PROT_WRITE,
MAP_ANONYMOUS | MAP_SHARED, -1, 0);
for (i = 0; i < children; i++)
if (fork() == 0)
child(sharep, i);
while (sharep->ready != children)
cpu_relax();
if (delay)
sleep(delay);
gettimeofday(&tv0, NULL);
sharep->quit = 1;
while (wait(&i) > 0)
cpu_relax();
gettimeofday(&tv1, NULL);
usecs = tv1.tv_usec - tv0.tv_usec;
secs = tv1.tv_sec - tv0.tv_sec;
if (usecs < 0) {
secs--;
usecs += 1000000;
}
printf("%7ld.%06ld\n", secs, usecs);
return 0;
}
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Improving lock pages
2013-01-15 17:38 Improving lock pages Nathan Zimmer
2013-01-15 18:10 ` Nathan Zimmer
@ 2013-02-06 16:31 ` Mel Gorman
2013-02-08 21:55 ` Nathan Zimmer
1 sibling, 1 reply; 5+ messages in thread
From: Mel Gorman @ 2013-02-06 16:31 UTC (permalink / raw)
To: Nathan Zimmer; +Cc: holt, linux-mm
On Tue, Jan 15, 2013 at 11:38:14AM -0600, Nathan Zimmer wrote:
>
> Hello Mel,
Hi Nathan,
> You helped some time ago with contention in lock_pages on very large boxes.
It was Nick Piggin and Jack Steiner that helped the situation within SLES
and before my time. I inherited the relevant patches but made relatively
few contributions to the effort.
> You worked with Jack Steiner on this. Currently I am tasked with improving this
> area even more. So I am fishing for any more ideas that would be productive or
> worth trying.
>
> I have some numbers from a 512 machine.
>
> Linux uvpsw1 3.0.51-0.7.9-default #1 SMP Thu Nov 29 22:12:17 UTC 2012 (f3be9d0) x86_64 x86_64 x86_64 GNU/Linux
> 0.166850
> 0.082339
> 0.248428
> 0.081197
> 0.127635
Ok, this looks like a SLES 11 SP2 kernel and so includes some unlock/lock
page optimisations.
> Linux uvpsw1 3.8.0-rc1-medusa_ntz_clean-dirty #32 SMP Tue Jan 8 16:01:04 CST 2013 x86_64 x86_64 x86_64 GNU/Linux
> 0.151778
> 0.118343
> 0.135750
> 0.437019
> 0.120536
>
And this is a mainline-ish kernel which doesn't.
The main reason I never made an strong effort to push them upstream
because the problems are barely observable on any machine I had access to.
The unlock page optimisation requires a page flag and while it helps
profiles a little, the effects are barely observable on smaller machines
(at least since I last checked). One machine it was reported to help
dramatically was a 768-way 128 node machine.
Forthe 512-way machine you're testing with the figures are marginal. The
time to exit is shorter but the amount of time is tiny and very close to
noise. I forward ported the relevant patches but on a 48-way machine the
results for the same test were well within the noise and the standard
deviation was higher.
I know you're tasked with improving this area more but what are you
using as your example workload? What's the minimum sized machine needed
for the optimisations to make a difference?
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Improving lock pages
2013-02-06 16:31 ` Mel Gorman
@ 2013-02-08 21:55 ` Nathan Zimmer
2013-02-13 10:47 ` Mel Gorman
0 siblings, 1 reply; 5+ messages in thread
From: Nathan Zimmer @ 2013-02-08 21:55 UTC (permalink / raw)
To: Mel Gorman; +Cc: holt, linux-mm
On 02/06/2013 10:31 AM, Mel Gorman wrote:
> On Tue, Jan 15, 2013 at 11:38:14AM -0600, Nathan Zimmer wrote:
>> Hello Mel,
> Hi Nathan,
>
>> You helped some time ago with contention in lock_pages on very large boxes.
> It was Nick Piggin and Jack Steiner that helped the situation within SLES
> and before my time. I inherited the relevant patches but made relatively
> few contributions to the effort.
>
>> You worked with Jack Steiner on this. Currently I am tasked with improving this
>> area even more. So I am fishing for any more ideas that would be productive or
>> worth trying.
>>
>> I have some numbers from a 512 machine.
>>
>> Linux uvpsw1 3.0.51-0.7.9-default #1 SMP Thu Nov 29 22:12:17 UTC 2012 (f3be9d0) x86_64 x86_64 x86_64 GNU/Linux
>> 0.166850
>> 0.082339
>> 0.248428
>> 0.081197
>> 0.127635
> Ok, this looks like a SLES 11 SP2 kernel and so includes some unlock/lock
> page optimisations.
>
>> Linux uvpsw1 3.8.0-rc1-medusa_ntz_clean-dirty #32 SMP Tue Jan 8 16:01:04 CST 2013 x86_64 x86_64 x86_64 GNU/Linux
>> 0.151778
>> 0.118343
>> 0.135750
>> 0.437019
>> 0.120536
>>
> And this is a mainline-ish kernel which doesn't.
>
> The main reason I never made an strong effort to push them upstream
> because the problems are barely observable on any machine I had access to.
> The unlock page optimisation requires a page flag and while it helps
> profiles a little, the effects are barely observable on smaller machines
> (at least since I last checked). One machine it was reported to help
> dramatically was a 768-way 128 node machine.
>
> Forthe 512-way machine you're testing with the figures are marginal. The
> time to exit is shorter but the amount of time is tiny and very close to
> noise. I forward ported the relevant patches but on a 48-way machine the
> results for the same test were well within the noise and the standard
> deviation was higher.
One thing I had noticed the performance curve on this issue is worse
then linear.
This has made it tough to measure/capture data on smaller boxes.
> I know you're tasked with improving this area more but what are you
> using as your example workload? What's the minimum sized machine needed
> for the optimisations to make a difference?
>
Right now I am just using the time_exit test I posted earlier.
I know it is a bit artificial and am open to suggestion.
One of the rough goals is to get under a second on a 4096 box.
Also here are some numbers from a larger box with 3.8-rc4...
nzimmer@uv48-sys:~/tests/time_exit> for I in $(seq 1 5); { ./time_exit
-p 3 2048; }
0.762282
0.810356
0.777785
0.840679
0.743509
nzimmer@uv48-sys:~/tests/time_exit> for I in $(seq 1 5); { ./time_exit
-p 3 4096; }
2.550571
2.374378
2.669021
2.703232
2.679028
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Improving lock pages
2013-02-08 21:55 ` Nathan Zimmer
@ 2013-02-13 10:47 ` Mel Gorman
0 siblings, 0 replies; 5+ messages in thread
From: Mel Gorman @ 2013-02-13 10:47 UTC (permalink / raw)
To: Nathan Zimmer; +Cc: holt, linux-mm
On Fri, Feb 08, 2013 at 03:55:09PM -0600, Nathan Zimmer wrote:
> >The main reason I never made an strong effort to push them upstream
> >because the problems are barely observable on any machine I had access to.
> >The unlock page optimisation requires a page flag and while it helps
> >profiles a little, the effects are barely observable on smaller machines
> >(at least since I last checked). One machine it was reported to help
> >dramatically was a 768-way 128 node machine.
> >
> >Forthe 512-way machine you're testing with the figures are marginal. The
> >time to exit is shorter but the amount of time is tiny and very close to
> >noise. I forward ported the relevant patches but on a 48-way machine the
> >results for the same test were well within the noise and the standard
> >deviation was higher.
>
> One thing I had noticed the performance curve on this issue is worse
> then linear.
> This has made it tough to measure/capture data on smaller boxes.
>
While this is true the figures you present are of marginal gain given the
complexity involved. I know the patches also affected boot-times quite
significantly but this was not a common task for the machines involved.
> >I know you're tasked with improving this area more but what are you
> >using as your example workload? What's the minimum sized machine needed
> >for the optimisations to make a difference?
> >
>
> Right now I am just using the time_exit test I posted earlier.
> I know it is a bit artificial and am open to suggestion.
>
I'm not currently aware of a workload that is dominated by lock_page
contention and I was expecting SGI was. There are plenty of times where we
stall on lock_page but it's usually IO related and not because processes
trying to acquire the lock went to sleep too quickly.
> One of the rough goals is to get under a second on a 4096 box.
>
> Also here are some numbers from a larger box with 3.8-rc4...
> nzimmer@uv48-sys:~/tests/time_exit> for I in $(seq 1 5); {
> ./time_exit -p 3 2048; }
> 0.762282
> 0.810356
> 0.777785
> 0.840679
> 0.743509
>
> nzimmer@uv48-sys:~/tests/time_exit> for I in $(seq 1 5); {
> ./time_exit -p 3 4096; }
> 2.550571
> 2.374378
> 2.669021
> 2.703232
> 2.679028
>
I collapsed the patches, editted them a bit and pushed them to the
mm-lock-page-optimise-v1r1 branch in the git repository
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
The patches are rebased against 3.8-rc6 but I did not pay any special
attention to actually improving them. I did leave a few notes on what could
be done in the changelog. You could try them out as a starting point and
see if they can be reduced to the minimum you require. Unfortunately I
suspect that you'll need a more compelling test case than time_exit on a
4096-way machine to justify pushing them to mainline.
--
Mel Gorman
SUSE Labs
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2013-02-13 10:47 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-15 17:38 Improving lock pages Nathan Zimmer
2013-01-15 18:10 ` Nathan Zimmer
2013-02-06 16:31 ` Mel Gorman
2013-02-08 21:55 ` Nathan Zimmer
2013-02-13 10:47 ` Mel Gorman
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).