[RFC/PATCH] Optimize zone allocator synchronization

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [RFC/PATCH] Optimize zone allocator synchronization
@ 2007-11-04 19:52 Don Porter
  2007-11-06 10:08 ` Chris Snook
  0 siblings, 1 reply; 6+ messages in thread
From: Don Porter @ 2007-11-04 19:52 UTC (permalink / raw)
  To: linux-kernel

From: Donald E. Porter <porterde@cs.utexas.edu>

In the bulk page allocation/free routines in mm/page_alloc.c, the zone
lock is held across all iterations.  For certain parallel workloads, I
have found that releasing and reacquiring the lock for each iteration
yields better performance, especially at higher CPU counts.  For
instance, kernel compilation is sped up by 5% on an 8 CPU test
machine.  In most cases, there is no significant effect on performance
(although the effect tends to be slightly positive).  This seems quite
reasonable for the very small scope of the change.

My intuition is that this patch prevents smaller requests from waiting
on larger ones.  While grabbing and releasing the lock within the loop
adds a few instructions, it can lower the latency for a particular
thread's allocation which is often on the thread's critical path.
Lowering the average latency for allocation can increase system throughput.

More detailed information, including data from the tests I ran to
validate this change are available at
http://www.cs.utexas.edu/~porterde/kernel-patch.html .

Thanks in advance for your consideration and feedback.

Don

Signed-off-by: Donald E. Porter <porterde@cs.utexas.edu>

---

diff -uprN linux-2.6.23.1/mm/page_alloc.c linux-2.6.23.1-opt/mm/page_alloc.c
--- linux-2.6.23.1/mm/page_alloc.c	2007-10-12 11:43:44.000000000 -0500
+++ linux-2.6.23.1-opt/mm/page_alloc.c	2007-10-29 18:29:05.000000000 -0500
@@ -477,19 +477,19 @@ static inline int free_pages_check(struc
 static void free_pages_bulk(struct zone *zone, int count,
 					struct list_head *list, int order)
 {
-	spin_lock(&zone->lock);
 	zone->all_unreclaimable = 0;
 	zone->pages_scanned = 0;
 	while (count--) {
 		struct page *page;
+		spin_lock(&zone->lock);

 		VM_BUG_ON(list_empty(list));
 		page = list_entry(list->prev, struct page, lru);
 		/* have to delete it as __free_one_page list manipulates */
 		list_del(&page->lru);
 		__free_one_page(page, zone, order);
+		spin_unlock(&zone->lock);
 	}
-	spin_unlock(&zone->lock);
 }

 static void free_one_page(struct zone *zone, struct page *page, int order)
@@ -665,14 +665,17 @@ static int rmqueue_bulk(struct zone *zon
 {
 	int i;

-	spin_lock(&zone->lock);
 	for (i = 0; i < count; ++i) {
-		struct page *page = __rmqueue(zone, order);
+		struct page *page;
+		spin_lock(&zone->lock);
+
+		page = __rmqueue(zone, order);
 		if (unlikely(page == NULL))
 			break;
 		list_add_tail(&page->lru, list);
+		spin_unlock(&zone->lock);
 	}
-	spin_unlock(&zone->lock);
+
 	return i;
 }

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC/PATCH] Optimize zone allocator synchronization
  2007-11-04 19:52 [RFC/PATCH] Optimize zone allocator synchronization Don Porter
@ 2007-11-06 10:08 ` Chris Snook
  2007-11-07  6:19   ` Andrew Morton
  0 siblings, 1 reply; 6+ messages in thread
From: Chris Snook @ 2007-11-06 10:08 UTC (permalink / raw)
  To: Don Porter; +Cc: linux-kernel

Don Porter wrote:
> From: Donald E. Porter <porterde@cs.utexas.edu>
> 
> In the bulk page allocation/free routines in mm/page_alloc.c, the zone
> lock is held across all iterations.  For certain parallel workloads, I
> have found that releasing and reacquiring the lock for each iteration
> yields better performance, especially at higher CPU counts.  For
> instance, kernel compilation is sped up by 5% on an 8 CPU test
> machine.  In most cases, there is no significant effect on performance
> (although the effect tends to be slightly positive).  This seems quite
> reasonable for the very small scope of the change.
> 
> My intuition is that this patch prevents smaller requests from waiting
> on larger ones.  While grabbing and releasing the lock within the loop
> adds a few instructions, it can lower the latency for a particular
> thread's allocation which is often on the thread's critical path.
> Lowering the average latency for allocation can increase system throughput.
> 
> More detailed information, including data from the tests I ran to
> validate this change are available at
> http://www.cs.utexas.edu/~porterde/kernel-patch.html .
> 
> Thanks in advance for your consideration and feedback.

That's an interesting insight.  My intuition is that Nick Piggin's 
recently-posted ticket spinlocks patches[1] will reduce the need for this patch, 
though it may be useful to have both.  Can you benchmark again with only ticket 
spinlocks, and with ticket spinlocks + this patch?  You'll probably want to use 
2.6.24-rc1 as your baseline, due to the x86 architecture merge.

	-- Chris

[1] http://lkml.org/lkml/2007/11/1/123

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC/PATCH] Optimize zone allocator synchronization
  2007-11-06 10:08 ` Chris Snook
@ 2007-11-07  6:19   ` Andrew Morton
  2007-11-07  5:31     ` Nick Piggin
  0 siblings, 1 reply; 6+ messages in thread
From: Andrew Morton @ 2007-11-07  6:19 UTC (permalink / raw)
  To: Chris Snook; +Cc: porterde, linux-kernel, Nick Piggin

> On Tue, 06 Nov 2007 05:08:07 -0500 Chris Snook <csnook@redhat.com> wrote:
> Don Porter wrote:
> > From: Donald E. Porter <porterde@cs.utexas.edu>
> > 
> > In the bulk page allocation/free routines in mm/page_alloc.c, the zone
> > lock is held across all iterations.  For certain parallel workloads, I
> > have found that releasing and reacquiring the lock for each iteration
> > yields better performance, especially at higher CPU counts.  For
> > instance, kernel compilation is sped up by 5% on an 8 CPU test
> > machine.  In most cases, there is no significant effect on performance
> > (although the effect tends to be slightly positive).  This seems quite
> > reasonable for the very small scope of the change.
> > 
> > My intuition is that this patch prevents smaller requests from waiting
> > on larger ones.  While grabbing and releasing the lock within the loop
> > adds a few instructions, it can lower the latency for a particular
> > thread's allocation which is often on the thread's critical path.
> > Lowering the average latency for allocation can increase system throughput.
> > 
> > More detailed information, including data from the tests I ran to
> > validate this change are available at
> > http://www.cs.utexas.edu/~porterde/kernel-patch.html .
> > 
> > Thanks in advance for your consideration and feedback.
> 
> That's an interesting insight.  My intuition is that Nick Piggin's 
> recently-posted ticket spinlocks patches[1] will reduce the need for this patch, 
> though it may be useful to have both.  Can you benchmark again with only ticket 
> spinlocks, and with ticket spinlocks + this patch?  You'll probably want to use 
> 2.6.24-rc1 as your baseline, due to the x86 architecture merge.

The patch as-is would hurt low cpu-count workloads, and single-threaded
workloads: it is simply taking that lock a lot more times.  This will be
particuarly noticable on things like older P4 machines which have peculiarly
expensive locked operations.

A test to run would be, on ext2:

	time (dd if=/dev/zero of=foo bs=16k count=2048 ; rm foo)

(might need to increase /proc/sys/vm/dirty* to avoid any writeback)


I wonder if we can do something like:

	if (lock_is_contended(lock)) {
		spin_unlock(lock);
		spin_lock(lock);		/* To the back of the queue */
	}

(in conjunction with the ticket locks) so that we only do the expensive
buslocked operation when we actually have a need to do so.

(The above should be wrapped in some new spinlock interface function which
is probably a no-op on architectures which cannot implement it usefully)

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC/PATCH] Optimize zone allocator synchronization
  2007-11-07  6:19   ` Andrew Morton
@ 2007-11-07  5:31     ` Nick Piggin
  2007-11-18  5:36       ` Don Porter
  0 siblings, 1 reply; 6+ messages in thread
From: Nick Piggin @ 2007-11-07  5:31 UTC (permalink / raw)
  To: Andrew Morton, David Miller; +Cc: Chris Snook, porterde, linux-kernel

On Wednesday 07 November 2007 17:19, Andrew Morton wrote:
> > On Tue, 06 Nov 2007 05:08:07 -0500 Chris Snook <csnook@redhat.com> wrote:
> >
> > Don Porter wrote:
> > > From: Donald E. Porter <porterde@cs.utexas.edu>
> > >
> > > In the bulk page allocation/free routines in mm/page_alloc.c, the zone
> > > lock is held across all iterations.  For certain parallel workloads, I
> > > have found that releasing and reacquiring the lock for each iteration
> > > yields better performance, especially at higher CPU counts.  For
> > > instance, kernel compilation is sped up by 5% on an 8 CPU test
> > > machine.  In most cases, there is no significant effect on performance
> > > (although the effect tends to be slightly positive).  This seems quite
> > > reasonable for the very small scope of the change.
> > >
> > > My intuition is that this patch prevents smaller requests from waiting
> > > on larger ones.  While grabbing and releasing the lock within the loop
> > > adds a few instructions, it can lower the latency for a particular
> > > thread's allocation which is often on the thread's critical path.
> > > Lowering the average latency for allocation can increase system
> > > throughput.
> > >
> > > More detailed information, including data from the tests I ran to
> > > validate this change are available at
> > > http://www.cs.utexas.edu/~porterde/kernel-patch.html .
> > >
> > > Thanks in advance for your consideration and feedback.

I did see this initial post, and didn't quite know what to make of it.
I'll admit it is slightly unexpected :) Always good to research ideas
against common convention, though.

I don't know whether your reasoning is correct though: unless there is
a significant number of higher order allocations (which there should
not be, AFAIKS), all allocators will go through the per CPU lists which
batch the same number of objects on and off, so there is no such thing
as smaller or larger requests.

And there are a number of regressions as well in your tests. It would be
nice to get some more detailed profile numbers (preferably with an
upstream kernel) to try to work out what is going faster.

It's funny, Dave Miller and I were just talking about the possible
reappearance of zone->lock contention with massively multi core and
multi threaded CPUs. I think the right way to fix this in the long run
if it turns into a real problem, is something like having a lock per
MAX_ORDER block, and having CPUs prefer to allocate from different
blocks. Anti-frag makes this pretty interesting to implement, but it
will be possible.

> > That's an interesting insight.  My intuition is that Nick Piggin's
> > recently-posted ticket spinlocks patches[1] will reduce the need for this
> > patch, though it may be useful to have both.  Can you benchmark again
> > with only ticket spinlocks, and with ticket spinlocks + this patch? 
> > You'll probably want to use 2.6.24-rc1 as your baseline, due to the x86
> > architecture merge.
>
> The patch as-is would hurt low cpu-count workloads, and single-threaded
> workloads: it is simply taking that lock a lot more times.  This will be
> particuarly noticable on things like older P4 machines which have
> peculiarly expensive locked operations.

It's not even restricted to P4s -- another big cost is going to be the
cacheline pingpong. Actually it might be worth trying another test run
with zone->lock put into its own cacheline (as it stands, when the lock
gets contended, spinners will just sit there pushing useful fields out
of the holder's memory -- ticket locks will do better here, but they
still write to the lock once, then sit there loading it).

> A test to run would be, on ext2:
>
> 	time (dd if=/dev/zero of=foo bs=16k count=2048 ; rm foo)
>
> (might need to increase /proc/sys/vm/dirty* to avoid any writeback)
>
>
> I wonder if we can do something like:
>
> 	if (lock_is_contended(lock)) {
> 		spin_unlock(lock);
> 		spin_lock(lock);		/* To the back of the queue */
> 	}
>
> (in conjunction with the ticket locks) so that we only do the expensive
> buslocked operation when we actually have a need to do so.
>
> (The above should be wrapped in some new spinlock interface function which
> is probably a no-op on architectures which cannot implement it usefully)

We have the need_lockbreak stuff. Of course, that's often pretty useless
with regular spinlocks (when you consider that my tests show that a single
CPU can be allowed to retake the same lock several million times in a row
despite contention)...

Anyway, yeah we could do that. But I think we do actually want to batch
up allocations on a given CPU in the multithreaded case as well, rather
than interleave them. There are some benefits avoiding cacheline bouncing.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC/PATCH] Optimize zone allocator synchronization
  2007-11-07  5:31     ` Nick Piggin
@ 2007-11-18  5:36       ` Don Porter
  2008-01-29 16:31         ` Don Porter
  0 siblings, 1 reply; 6+ messages in thread
From: Don Porter @ 2007-11-18  5:36 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, David Miller, Chris Snook, linux-kernel

Thank you all for your consideration and insightful responses to my
posting.  I apologize for not responding sooner---I have been under a
deadline.

It seems clear that further investigation will be needed to understand
these performance numbers better.

To summarize, I understand that the following experiments will be helpful:

1) Instrument the allocation code to determine the common size/order
of the allocations for these workloads.

2) Try to integrate these changes with ticket spinlocks

3) Try placing the zone lock in its own cacheline

4) Look for single-threaded regressions (dd benchmark).

I'll do these at my first opportunity, hopefully within the next week.
Please let me know if I misunderstood any of your comments.

My intuition about the cost of ping-ponging the lock's cache line
certainly matched yours, so I was very surprised to see these
performance numbers.  

On Wed, Nov 07, 2007 at 04:31:59PM +1100, Nick Piggin wrote:
> It's funny, Dave Miller and I were just talking about the possible
> reappearance of zone->lock contention with massively multi core and
> multi threaded CPUs. I think the right way to fix this in the long run
> if it turns into a real problem, is something like having a lock per
> MAX_ORDER block, and having CPUs prefer to allocate from different
> blocks. Anti-frag makes this pretty interesting to implement, but it
> will be possible.

As a bit of background, the zone lock is indeed one of the more
contended locks in my target workloads so it was no accident that I
was looking for ways to improve its scalability.  I am quite
interested in Nick's ideas about how to split up the zone allocator's
synchronization.

Of course, these contention levels may not meet your definition of
"real problem" (~.1% of the execution time).

Best regards,
Don

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC/PATCH] Optimize zone allocator synchronization
  2007-11-18  5:36       ` Don Porter
@ 2008-01-29 16:31         ` Don Porter
  0 siblings, 0 replies; 6+ messages in thread
From: Don Porter @ 2008-01-29 16:31 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, David Miller, Chris Snook, linux-kernel

I apologize again for the long delay in responding with the requested
additional data needed to understand the performance of this patch.
The complete information is available at:

http://www.cs.utexas.edu/~porterde/kernel-patch.html#subsequent

I instrumented the kernel within simics to determine that the mean
value of the count is 7 and the order is 0 for both free_pages_bulk()
and rmqueue_bulk() for the simulation workloads.

I applied the change suggested by Andrew Morton to the 2.6.24-rc7
kernel patched with Nick Piggin's ticket spinlock patch.

These data indicate that there is a small performance penalty (1-2%)
incurred when adding ticket spinlocks alone, probably because they are
not used enough in the kernel to reap the performance benefits of the
implementation cost.  By allowing the lock to be released and
reacquired under contention in just these two places, the 1-2%
overhead of ticket spin performance is reclaimed in most benchmarks,
making overall performance comparable to the baseline kernel.

The only data inconsistent with these results are the kernel
compilation benchmarks.  These data indicate that the best performance
is from the baseline kernel.  It is not clear to me what property of
kernel compilation causes it to have this performance profile.

Placing the zone spin lock in its own cache line hurts performance.

The dd regression test actually shows a similar trend to the other
benchmarks; ticket spinlocks + my patch perform best.

Thanks for your comments and consideration,

Don Porter

On Sat, Nov 17, 2007 at 11:36:26PM -0600, Don Porter wrote:
> Thank you all for your consideration and insightful responses to my
> posting.  I apologize for not responding sooner---I have been under a
> deadline.
> 
> It seems clear that further investigation will be needed to understand
> these performance numbers better.
> 
> To summarize, I understand that the following experiments will be helpful:
> 
> 1) Instrument the allocation code to determine the common size/order
> of the allocations for these workloads.
> 
> 2) Try to integrate these changes with ticket spinlocks
> 
> 3) Try placing the zone lock in its own cacheline
> 
> 4) Look for single-threaded regressions (dd benchmark).
> 
> I'll do these at my first opportunity, hopefully within the next week.
> Please let me know if I misunderstood any of your comments.
> 
> My intuition about the cost of ping-ponging the lock's cache line
> certainly matched yours, so I was very surprised to see these
> performance numbers.  
> 
> On Wed, Nov 07, 2007 at 04:31:59PM +1100, Nick Piggin wrote:
> > It's funny, Dave Miller and I were just talking about the possible
> > reappearance of zone->lock contention with massively multi core and
> > multi threaded CPUs. I think the right way to fix this in the long run
> > if it turns into a real problem, is something like having a lock per
> > MAX_ORDER block, and having CPUs prefer to allocate from different
> > blocks. Anti-frag makes this pretty interesting to implement, but it
> > will be possible.
> 
> As a bit of background, the zone lock is indeed one of the more
> contended locks in my target workloads so it was no accident that I
> was looking for ways to improve its scalability.  I am quite
> interested in Nick's ideas about how to split up the zone allocator's
> synchronization.
> 
> Of course, these contention levels may not meet your definition of
> "real problem" (~.1% of the execution time).
> 
> Best regards,
> Don

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2008-01-29 17:05 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-11-04 19:52 [RFC/PATCH] Optimize zone allocator synchronization Don Porter
2007-11-06 10:08 ` Chris Snook
2007-11-07  6:19   ` Andrew Morton
2007-11-07  5:31     ` Nick Piggin
2007-11-18  5:36       ` Don Porter
2008-01-29 16:31         ` Don Porter

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox