Improving lock pages

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* Improving lock pages
@ 2013-01-15 17:38 Nathan Zimmer
  2013-01-15 18:10 ` Nathan Zimmer
  2013-02-06 16:31 ` Mel Gorman
  0 siblings, 2 replies; 5+ messages in thread
From: Nathan Zimmer @ 2013-01-15 17:38 UTC (permalink / raw)
  To: Mel Gorman; +Cc: holt, linux-mm


Hello Mel,
    You helped some time ago with contention in lock_pages on very large boxes. 
You worked with Jack Steiner on this.  Currently I am tasked with improving this 
area even more.  So I am fishing for any more ideas that would be productive or 
worth trying. 

I have some numbers from a 512 machine.

Linux uvpsw1 3.0.51-0.7.9-default #1 SMP Thu Nov 29 22:12:17 UTC 2012 (f3be9d0) x86_64 x86_64 x86_64 GNU/Linux
      0.166850
      0.082339
      0.248428
      0.081197
      0.127635

Linux uvpsw1 3.8.0-rc1-medusa_ntz_clean-dirty #32 SMP Tue Jan 8 16:01:04 CST 2013 x86_64 x86_64 x86_64 GNU/Linux
      0.151778
      0.118343
      0.135750
      0.437019
      0.120536

Nathan Zimmer

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Improving lock pages
  2013-01-15 17:38 Improving lock pages Nathan Zimmer
@ 2013-01-15 18:10 ` Nathan Zimmer
  2013-02-06 16:31 ` Mel Gorman
  1 sibling, 0 replies; 5+ messages in thread
From: Nathan Zimmer @ 2013-01-15 18:10 UTC (permalink / raw)
  To: Nathan Zimmer; +Cc: Mel Gorman, holt, linux-mm

[-- Attachment #1: Type: text/plain, Size: 1443 bytes --]

On Tue, Jan 15, 2013 at 11:38:14AM -0600, Nathan Zimmer wrote:
> 
> Hello Mel,
>     You helped some time ago with contention in lock_pages on very large boxes. 
> You worked with Jack Steiner on this.  Currently I am tasked with improving this 
> area even more.  So I am fishing for any more ideas that would be productive or 
> worth trying. 
> 
> I have some numbers from a 512 machine.
> 
> Linux uvpsw1 3.0.51-0.7.9-default #1 SMP Thu Nov 29 22:12:17 UTC 2012 (f3be9d0) x86_64 x86_64 x86_64 GNU/Linux
>       0.166850
>       0.082339
>       0.248428
>       0.081197
>       0.127635
> 
> Linux uvpsw1 3.8.0-rc1-medusa_ntz_clean-dirty #32 SMP Tue Jan 8 16:01:04 CST 2013 x86_64 x86_64 x86_64 GNU/Linux
>       0.151778
>       0.118343
>       0.135750
>       0.437019
>       0.120536
> 
> Nathan Zimmer
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

I realized I forgot to attach the test.

The test is fairly basic.  Just fork off a number of threads each on their own cpu
have them all wait on a cell and measure how long it took for them to all exit.

Usage is ./time_exit -p 3 512

The numbers I have provided where from some runs on a 512 system.  I tried for
a 4096 box but it was being fickle and was needed for some other testing.


[-- Attachment #2: time_exit.c --]
[-- Type: text/x-c++src, Size: 2092 bytes --]

#define _GNU_SOURCE
#include <errno.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

#include <sys/mman.h>
#include <sys/time.h>
#include <sys/wait.h>

struct time_exit {
	volatile int ready	__attribute__((aligned(64)));
	volatile int quit	__attribute__((aligned(64)));
};

#define cpu_relax()             asm volatile ("rep;nop":::"memory");

#define MAXCPUS 4096
static int cpu_set_size;
static cpu_set_t *task_affinity;
static int delay;

static void pin(int cpu)
{
	cpu_set_t *affinity;

	if (cpu < 0 || cpu >= MAXCPUS)
		return;

	affinity = CPU_ALLOC(MAXCPUS);
	CPU_ZERO_S(cpu_set_size, affinity);
	CPU_SET_S(cpu, cpu_set_size, affinity);
	(void)sched_setaffinity(0, cpu_set_size, affinity);
	CPU_FREE(affinity);
	return;
}

static void child(struct time_exit *sharep, int cpu)
{
	pin(cpu);
	__sync_fetch_and_add(&sharep->ready, 1);
	while (sharep->quit == 0)
		cpu_relax();
	exit(0);
}

int main(int argc, char **argv)
{
	int children, i;
	struct time_exit *sharep;
	struct timeval tv0, tv1;
	long secs, usecs;
	char opt;

	while ((opt = getopt(argc, argv, "p:")) != -1) {
		switch (opt) {
		case 'p':
			delay = atoi(optarg);
			break;
		default:
			fprintf(stderr, "Usage:\n");
		}
	}
	argv += optind - 1;
	argc -= optind - 1;
	if (argc != 2) {
		printf("Wrong\n");
		exit(-1);
	}
	children = atoi(argv[1]);

	cpu_set_size = CPU_ALLOC_SIZE(MAXCPUS);
	task_affinity = CPU_ALLOC(MAXCPUS);
	if (sched_getaffinity(0, cpu_set_size, task_affinity) < 0) {
		perror("Failed in sched_getaffinitt");
		exit(-2);
	}

	sharep = mmap(0, sizeof(struct time_exit), PROT_READ | PROT_WRITE,
			MAP_ANONYMOUS | MAP_SHARED, -1, 0);

	for (i = 0; i < children; i++)
		if (fork() == 0)
			child(sharep, i);

	while (sharep->ready != children)
		cpu_relax();

	if (delay)
		sleep(delay);

	gettimeofday(&tv0, NULL);
	sharep->quit = 1;
	while (wait(&i) > 0)
		cpu_relax();
	gettimeofday(&tv1, NULL);

	usecs = tv1.tv_usec - tv0.tv_usec;
	secs = tv1.tv_sec - tv0.tv_sec;
	if (usecs < 0) {
		secs--;
		usecs += 1000000;
	}
	printf("%7ld.%06ld\n", secs, usecs);

	return 0;
}

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Improving lock pages
  2013-01-15 17:38 Improving lock pages Nathan Zimmer
  2013-01-15 18:10 ` Nathan Zimmer
@ 2013-02-06 16:31 ` Mel Gorman
  2013-02-08 21:55   ` Nathan Zimmer
  1 sibling, 1 reply; 5+ messages in thread
From: Mel Gorman @ 2013-02-06 16:31 UTC (permalink / raw)
  To: Nathan Zimmer; +Cc: holt, linux-mm

On Tue, Jan 15, 2013 at 11:38:14AM -0600, Nathan Zimmer wrote:
> 
> Hello Mel,

Hi Nathan,

>     You helped some time ago with contention in lock_pages on very large boxes. 

It was Nick Piggin and Jack Steiner that helped the situation within SLES
and before my time. I inherited the relevant patches but made relatively
few contributions to the effort.

> You worked with Jack Steiner on this.  Currently I am tasked with improving this 
> area even more.  So I am fishing for any more ideas that would be productive or 
> worth trying. 
> 
> I have some numbers from a 512 machine.
> 
> Linux uvpsw1 3.0.51-0.7.9-default #1 SMP Thu Nov 29 22:12:17 UTC 2012 (f3be9d0) x86_64 x86_64 x86_64 GNU/Linux
>       0.166850
>       0.082339
>       0.248428
>       0.081197
>       0.127635

Ok, this looks like a SLES 11 SP2 kernel and so includes some unlock/lock
page optimisations.

> Linux uvpsw1 3.8.0-rc1-medusa_ntz_clean-dirty #32 SMP Tue Jan 8 16:01:04 CST 2013 x86_64 x86_64 x86_64 GNU/Linux
>       0.151778
>       0.118343
>       0.135750
>       0.437019
>       0.120536
> 

And this is a mainline-ish kernel which doesn't.

The main reason I never made an strong effort to push them upstream
because the problems are barely observable on any machine I had access to.
The unlock page optimisation requires a page flag and while it helps
profiles a little, the effects are barely observable on smaller machines
(at least since I last checked).  One machine it was reported to help
dramatically was a 768-way 128 node machine.

Forthe 512-way machine you're testing with the figures are marginal. The
time to exit is shorter but the amount of time is tiny and very close to
noise. I forward ported the relevant patches but on a 48-way machine the
results for the same test were well within the noise and the standard
deviation was higher.

I know you're tasked with improving this area more but what are you
using as your example workload? What's the minimum sized machine needed
for the optimisations to make a difference?

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Improving lock pages
  2013-02-06 16:31 ` Mel Gorman
@ 2013-02-08 21:55   ` Nathan Zimmer
  2013-02-13 10:47     ` Mel Gorman
  0 siblings, 1 reply; 5+ messages in thread
From: Nathan Zimmer @ 2013-02-08 21:55 UTC (permalink / raw)
  To: Mel Gorman; +Cc: holt, linux-mm

On 02/06/2013 10:31 AM, Mel Gorman wrote:
> On Tue, Jan 15, 2013 at 11:38:14AM -0600, Nathan Zimmer wrote:
>> Hello Mel,
> Hi Nathan,
>
>>      You helped some time ago with contention in lock_pages on very large boxes.
> It was Nick Piggin and Jack Steiner that helped the situation within SLES
> and before my time. I inherited the relevant patches but made relatively
> few contributions to the effort.
>
>> You worked with Jack Steiner on this.  Currently I am tasked with improving this
>> area even more.  So I am fishing for any more ideas that would be productive or
>> worth trying.
>>
>> I have some numbers from a 512 machine.
>>
>> Linux uvpsw1 3.0.51-0.7.9-default #1 SMP Thu Nov 29 22:12:17 UTC 2012 (f3be9d0) x86_64 x86_64 x86_64 GNU/Linux
>>        0.166850
>>        0.082339
>>        0.248428
>>        0.081197
>>        0.127635
> Ok, this looks like a SLES 11 SP2 kernel and so includes some unlock/lock
> page optimisations.
>
>> Linux uvpsw1 3.8.0-rc1-medusa_ntz_clean-dirty #32 SMP Tue Jan 8 16:01:04 CST 2013 x86_64 x86_64 x86_64 GNU/Linux
>>        0.151778
>>        0.118343
>>        0.135750
>>        0.437019
>>        0.120536
>>
> And this is a mainline-ish kernel which doesn't.
>
> The main reason I never made an strong effort to push them upstream
> because the problems are barely observable on any machine I had access to.
> The unlock page optimisation requires a page flag and while it helps
> profiles a little, the effects are barely observable on smaller machines
> (at least since I last checked).  One machine it was reported to help
> dramatically was a 768-way 128 node machine.
>
> Forthe 512-way machine you're testing with the figures are marginal. The
> time to exit is shorter but the amount of time is tiny and very close to
> noise. I forward ported the relevant patches but on a 48-way machine the
> results for the same test were well within the noise and the standard
> deviation was higher.
One thing I had noticed the performance curve on this issue is worse 
then linear.
This has made it tough to measure/capture data on smaller boxes.

> I know you're tasked with improving this area more but what are you
> using as your example workload? What's the minimum sized machine needed
> for the optimisations to make a difference?
>
Right now I am just using the time_exit test I posted earlier.
I know it is a bit artificial and am open to suggestion.

One of the rough goals is to get under a second on a 4096 box.

Also here are some numbers from a larger box with 3.8-rc4...
nzimmer@uv48-sys:~/tests/time_exit> for I in $(seq 1 5); { ./time_exit 
-p 3 2048; }
       0.762282
       0.810356
       0.777785
       0.840679
       0.743509

nzimmer@uv48-sys:~/tests/time_exit> for I in $(seq 1 5); { ./time_exit 
-p 3 4096; }
       2.550571
       2.374378
       2.669021
       2.703232
       2.679028

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Improving lock pages
  2013-02-08 21:55   ` Nathan Zimmer
@ 2013-02-13 10:47     ` Mel Gorman
  0 siblings, 0 replies; 5+ messages in thread
From: Mel Gorman @ 2013-02-13 10:47 UTC (permalink / raw)
  To: Nathan Zimmer; +Cc: holt, linux-mm

On Fri, Feb 08, 2013 at 03:55:09PM -0600, Nathan Zimmer wrote:
> >The main reason I never made an strong effort to push them upstream
> >because the problems are barely observable on any machine I had access to.
> >The unlock page optimisation requires a page flag and while it helps
> >profiles a little, the effects are barely observable on smaller machines
> >(at least since I last checked).  One machine it was reported to help
> >dramatically was a 768-way 128 node machine.
> >
> >Forthe 512-way machine you're testing with the figures are marginal. The
> >time to exit is shorter but the amount of time is tiny and very close to
> >noise. I forward ported the relevant patches but on a 48-way machine the
> >results for the same test were well within the noise and the standard
> >deviation was higher.
>
> One thing I had noticed the performance curve on this issue is worse
> then linear.
> This has made it tough to measure/capture data on smaller boxes.
> 

While this is true the figures you present are of marginal gain given the
complexity involved.  I know the patches also affected boot-times quite
significantly but this was not a common task for the machines involved.

> >I know you're tasked with improving this area more but what are you
> >using as your example workload? What's the minimum sized machine needed
> >for the optimisations to make a difference?
> >
>
> Right now I am just using the time_exit test I posted earlier.
> I know it is a bit artificial and am open to suggestion.
> 

I'm not currently aware of a workload that is dominated by lock_page
contention and I was expecting SGI was. There are plenty of times where we
stall on lock_page but it's usually IO related and not because processes
trying to acquire the lock went to sleep too quickly.

> One of the rough goals is to get under a second on a 4096 box.
> 
> Also here are some numbers from a larger box with 3.8-rc4...
> nzimmer@uv48-sys:~/tests/time_exit> for I in $(seq 1 5); {
> ./time_exit -p 3 2048; }
>       0.762282
>       0.810356
>       0.777785
>       0.840679
>       0.743509
> 
> nzimmer@uv48-sys:~/tests/time_exit> for I in $(seq 1 5); {
> ./time_exit -p 3 4096; }
>       2.550571
>       2.374378
>       2.669021
>       2.703232
>       2.679028
> 

I collapsed the patches, editted them a bit and pushed them to the
mm-lock-page-optimise-v1r1 branch in the git repository
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git 

The patches are rebased against 3.8-rc6 but I did not pay any special
attention to actually improving them. I did leave a few notes on what could
be done in the changelog. You could try them out as a starting point and
see if they can be reduced to the minimum you require. Unfortunately I
suspect that you'll need a more compelling test case than time_exit on a
4096-way machine to justify pushing them to mainline.

-- 
Mel Gorman
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2013-02-13 10:47 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-01-15 17:38 Improving lock pages Nathan Zimmer
2013-01-15 18:10 ` Nathan Zimmer
2013-02-06 16:31 ` Mel Gorman
2013-02-08 21:55   ` Nathan Zimmer
2013-02-13 10:47     ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).