[PATCH] fix free swap cache latency

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] fix free swap cache latency
@ 2006-03-16 19:11 Hugh Dickins
  2006-03-16 22:07 ` Ingo Molnar
  2006-03-17  1:38 ` Andrew Morton
  0 siblings, 2 replies; 5+ messages in thread
From: Hugh Dickins @ 2006-03-16 19:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Lee Revell, Ingo Molnar, Nick Piggin, linux-kernel

Lee Revell reported 28ms latency when process with lots of swapped memory
exits.

2.6.15 introduced a latency regression when unmapping: in accounting the
zap_work latency breaker, pte_none counted 1, pte_present PAGE_SIZE, but
a swap entry counted nothing at all.  We think of pages present as the
slow case, but Lee's trace shows that free_swap_and_cache's radix tree
lookup can make a lot of work - and we could have been doing it many
thousands of times without a latency break.

Move the zap_work update up to account swap entries like pages present.
This does account non-linear pte_file entries, and unmap_mapping_range
skipping over swap entries, by the same amount even though they're quick:
but neither of those cases deserves complicating the code (and they're
treated no worse than they were in 2.6.14).

Signed-off-by: Hugh Dickins <hugh@veritas.com>
Acked-by: Nick Piggin <npiggin@suse.de>
---
Andrew, I recommend this one for 2.6.16: but you may fairly disagree,
so I'm sending it to you, to pass on to Linus or not as you see fit.
Lee doesn't expect to be able to reproduce the testcase quickly, so
the fix has not been verified: but we consider it self-evidently good.

 mm/memory.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

--- 2.6.16-rc6/mm/memory.c	2006-03-12 15:25:45.000000000 +0000
+++ linux/mm/memory.c	2006-03-15 07:32:36.000000000 +0000
@@ -623,11 +623,12 @@ static unsigned long zap_pte_range(struc
 			(*zap_work)--;
 			continue;
 		}
+
+		(*zap_work) -= PAGE_SIZE;
+
 		if (pte_present(ptent)) {
 			struct page *page;

-			(*zap_work) -= PAGE_SIZE;
-
 			page = vm_normal_page(vma, addr, ptent);
 			if (unlikely(details) && page) {
 				/*

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] fix free swap cache latency
  2006-03-16 19:11 [PATCH] fix free swap cache latency Hugh Dickins
@ 2006-03-16 22:07 ` Ingo Molnar
  2006-03-17  1:38 ` Andrew Morton
  1 sibling, 0 replies; 5+ messages in thread
From: Ingo Molnar @ 2006-03-16 22:07 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: Andrew Morton, Lee Revell, Nick Piggin, linux-kernel


* Hugh Dickins <hugh@veritas.com> wrote:

> Lee Revell reported 28ms latency when process with lots of swapped memory
> exits.
> 
> 2.6.15 introduced a latency regression when unmapping: in accounting the
> zap_work latency breaker, pte_none counted 1, pte_present PAGE_SIZE, but
> a swap entry counted nothing at all.  We think of pages present as the
> slow case, but Lee's trace shows that free_swap_and_cache's radix tree
> lookup can make a lot of work - and we could have been doing it many
> thousands of times without a latency break.
> 
> Move the zap_work update up to account swap entries like pages present.
> This does account non-linear pte_file entries, and unmap_mapping_range
> skipping over swap entries, by the same amount even though they're quick:
> but neither of those cases deserves complicating the code (and they're
> treated no worse than they were in 2.6.14).
> 
> Signed-off-by: Hugh Dickins <hugh@veritas.com>
> Acked-by: Nick Piggin <npiggin@suse.de>

i've added this patch to the 2.6.16-rc6 based -rt kernel yesterday and 
have ran an overnight stresstest on an SMP box (which creates heavy 
swapping too, amongst other things), which found no problems whatsoever.  
(not that we would expect any, but it's still nice to know.)

Acked-by: Ingo Molnar <mingo@elte.hu>

	Ingo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] fix free swap cache latency
  2006-03-16 19:11 [PATCH] fix free swap cache latency Hugh Dickins
  2006-03-16 22:07 ` Ingo Molnar
@ 2006-03-17  1:38 ` Andrew Morton
  2006-03-17  2:52   ` Nick Piggin
  1 sibling, 1 reply; 5+ messages in thread
From: Andrew Morton @ 2006-03-17  1:38 UTC (permalink / raw)
  To: Hugh Dickins; +Cc: rlrevell, mingo, nickpiggin, linux-kernel

Hugh Dickins <hugh@veritas.com> wrote:
>
>  			(*zap_work)--;
>  			continue;
>  		}
> +
> +		(*zap_work) -= PAGE_SIZE;

Sometimes we subtract 1 from zap_work, sometimes PAGE_SIZE.  It's in units
of bytes, so PAGE_SIZE is correct.  Although it would make sense to
redefine it to be in units of PAGE_SIZE.  What's up with that?

Even better, define it in units of "approximate number of touched
cachelines".  After all, it is a sort-of-time-based thing.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] fix free swap cache latency
  2006-03-17  1:38 ` Andrew Morton
@ 2006-03-17  2:52   ` Nick Piggin
  2006-03-17 17:55     ` Hugh Dickins
  0 siblings, 1 reply; 5+ messages in thread
From: Nick Piggin @ 2006-03-17  2:52 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Hugh Dickins, rlrevell, mingo, linux-kernel

Andrew Morton wrote:
> Hugh Dickins <hugh@veritas.com> wrote:
> 
>> 			(*zap_work)--;
>> 			continue;
>> 		}
>>+
>>+		(*zap_work) -= PAGE_SIZE;
> 
> 
> Sometimes we subtract 1 from zap_work, sometimes PAGE_SIZE.  It's in units
> of bytes, so PAGE_SIZE is correct.  Although it would make sense to
> redefine it to be in units of PAGE_SIZE.  What's up with that?
> 

Subtracting 1 if there is no work to do for that cacheline entry.

> Even better, define it in units of "approximate number of touched
> cachelines".  After all, it is a sort-of-time-based thing.
> 

Yeah that was the rough intention, but I never actually measured
to see whether the results were right.

The pte_none case touches about 1/32 of a cacheline on my P4
(add a bit for setup costs, subtract a bit for linear prefetchable
access over multiple lines).

pte_present touches the pte, the page once or twice -- first
directly, then from mmu gather, the vma and the mapping (mostly be
the same though so cost is small), a whole host of locks and atomic
operations (put_page_testzero, lru_lock, tree_lock, page_remove_rmap,
zone->lock), an IPI to other CPUs, tlb invalidates etc. some things
batched, some not.

So it gets hard to count, especially if you have other CPUs contending
the same locks. I suspect the per-page cost is not really 128 cache
misses most of the time, but it was just a number I pulled out. Anyone
can feel free to turn the knob if they feel they have a better idea.

-- 
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com 

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] fix free swap cache latency
  2006-03-17  2:52   ` Nick Piggin
@ 2006-03-17 17:55     ` Hugh Dickins
  0 siblings, 0 replies; 5+ messages in thread
From: Hugh Dickins @ 2006-03-17 17:55 UTC (permalink / raw)
  To: Nick Piggin; +Cc: Andrew Morton, rlrevell, mingo, linux-kernel

On Fri, 17 Mar 2006, Nick Piggin wrote:
> Andrew Morton wrote:
> > Hugh Dickins <hugh@veritas.com> wrote:
> > 
> > >			(*zap_work)--;
> > >			continue;
> > > 		}
> > > +
> > > +		(*zap_work) -= PAGE_SIZE;
> > 
> > 
> > Sometimes we subtract 1 from zap_work, sometimes PAGE_SIZE.  It's in
> > units
> > of bytes, so PAGE_SIZE is correct.  Although it would make sense to
> > redefine it to be in units of PAGE_SIZE.  What's up with that?
> 
> Subtracting 1 if there is no work to do for that cacheline entry.
> 
> > Even better, define it in units of "approximate number of touched
> > cachelines".  After all, it is a sort-of-time-based thing.
> 
> Yeah that was the rough intention, but I never actually measured
> to see whether the results were right.
> .... 
> So it gets hard to count, especially if you have other CPUs contending
> the same locks. I suspect the per-page cost is not really 128 cache
> misses most of the time, but it was just a number I pulled out. Anyone
> can feel free to turn the knob if they feel they have a better idea.

Accounting PAGE_SIZE for a present pte or swap entry preserves the
same latency allowance as Ingo made when he introduced ZAP_BLOCK_SIZE,
so it assures us of no regression on those.

I've always felt a pte_none entry probably ought to count for more than
1 on that scale, perhaps 16.  But likewise never did any calculation or
research, nor heard of any complaints - and large stretches of pte_none
should be quite a common case, after your copy_page_range optimization.

Plus we've both regarded that code as just for the interim, while both
of us (I see today) pursue preemptible mmu_gather solutions (I'll look
at your patch and report, but not today).

Hugh

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2006-03-17 17:55 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-03-16 19:11 [PATCH] fix free swap cache latency Hugh Dickins
2006-03-16 22:07 ` Ingo Molnar
2006-03-17  1:38 ` Andrew Morton
2006-03-17  2:52   ` Nick Piggin
2006-03-17 17:55     ` Hugh Dickins

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox