public inbox for linux-ia64@vger.kernel.org
 help / color / mirror / Atom feed
* [Patch 0/3] Page table quicklist fixups Rev 3.
@ 2005-03-03 21:23 Robin Holt
  2005-03-03 21:57 ` Luck, Tony
                   ` (10 more replies)
  0 siblings, 11 replies; 12+ messages in thread
From: Robin Holt @ 2005-03-03 21:23 UTC (permalink / raw)
  To: linux-ia64

Tony,

Here are the changes again.  I have not received a response from you
about the restructered shrink routine.  I have incorporated David
Mosberger's recommendations.

Thanks,
Robin Holt

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [Patch 0/3] Page table quicklist fixups Rev 3.
  2005-03-03 21:23 [Patch 0/3] Page table quicklist fixups Rev 3 Robin Holt
@ 2005-03-03 21:57 ` Luck, Tony
  2005-03-04  8:03 ` Zou Nan hai
                   ` (9 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Luck, Tony @ 2005-03-03 21:57 UTC (permalink / raw)
  To: linux-ia64

>Here are the changes again.  I have not received a response from you
>about the restructered shrink routine.  I have incorporated David
>Mosberger's recommendations.

Your new shrink routine looks ok, but you didn't address the
double role played by NODE_FREE_PAGES_SHIFT ... you just dropped the
comment that explained one of its uses - which isn't what I'd hoped
for :-)

I'm also a bit uncomfortable with:

+		preempt_enable();
+		preempt_disable();

For a kernel with CONFIG_PREEMPT=n, this is a no-op ... so if there
is a ton of extra pages on the quicklist, we'll loop freeing 16 at
a time and re-computing how many to free, with no pause to take a
breath (or a clock tick).

Are there other places in the kernel where this idiom is being used?

-Tony


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Patch 0/3] Page table quicklist fixups Rev 3.
  2005-03-03 21:23 [Patch 0/3] Page table quicklist fixups Rev 3 Robin Holt
  2005-03-03 21:57 ` Luck, Tony
@ 2005-03-04  8:03 ` Zou Nan hai
  2005-03-04  8:46 ` Robin Holt
                   ` (8 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Zou Nan hai @ 2005-03-04  8:03 UTC (permalink / raw)
  To: linux-ia64

On Fri, 2005-03-04 at 16:46, Robin Holt wrote:
> What do you want me to do.  I don't see anywhere else in the kernel
> that these two lines are directly adjacent.  Most places that do
> the disable/enable are in a function which does one thing.  That is
> occasionally contained inside a larger loop.  We can not do that since
> part of our outer loop control is based on the per_cpu variable we are
> expecting to not change.  I suppose I could elminitate the disable/enable
> entirely.  I haven't thought all the way through the possibilities,
> but I would guess we could free a couple extra pages, but who cares.
> You tell me what to do.
> 
> Robin

  I think there is no need to disable preempt in an idle thread...,
since the idle thread has no chance to migrate to another CPU.

  I also think that the shrinking of pgtables is a bit like that in
shrink_slab in vmscan.c, shrink_slab shrinks slab by batch, and itcalls
cond_resched every batch. 
So that it will not introduce too big latency when there are too many
slabs to shrink. 
While shrink_slab is called under mem pressure but check_pdt_cache is
called in idle thread, maybe there is some difference.

Zou Nan hai
> -
> To unsubscribe from this list: send the line "unsubscribe linux-ia64" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Patch 0/3] Page table quicklist fixups Rev 3.
  2005-03-03 21:23 [Patch 0/3] Page table quicklist fixups Rev 3 Robin Holt
  2005-03-03 21:57 ` Luck, Tony
  2005-03-04  8:03 ` Zou Nan hai
@ 2005-03-04  8:46 ` Robin Holt
  2005-03-04 14:55 ` Robin Holt
                   ` (7 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Robin Holt @ 2005-03-04  8:46 UTC (permalink / raw)
  To: linux-ia64

On Thu, Mar 03, 2005 at 01:57:49PM -0800, Luck, Tony wrote:
> >Here are the changes again.  I have not received a response from you
> >about the restructered shrink routine.  I have incorporated David
> >Mosberger's recommendations.
> 
> Your new shrink routine looks ok, but you didn't address the
> double role played by NODE_FREE_PAGES_SHIFT ... you just dropped the
> comment that explained one of its uses - which isn't what I'd hoped
> for :-)

The comment was left from the earlier set of patches where it
had a single role.  In that patch, we picked a number of pages to
free, freed that many, and went on with life expecting the next
pass to free a group.  Since the changes gave it a more complex
role and the code was not that complex, I figured I would remove
the comment and let anybody planning on changing it in the future
read the code to figure out the two roles.

> 
> I'm also a bit uncomfortable with:
> 
> +		preempt_enable();
> +		preempt_disable();
> 
> For a kernel with CONFIG_PREEMPT=n, this is a no-op ... so if there
> is a ton of extra pages on the quicklist, we'll loop freeing 16 at
> a time and re-computing how many to free, with no pause to take a
> breath (or a clock tick).

What do you want me to do.  I don't see anywhere else in the kernel
that these two lines are directly adjacent.  Most places that do
the disable/enable are in a function which does one thing.  That is
occasionally contained inside a larger loop.  We can not do that since
part of our outer loop control is based on the per_cpu variable we are
expecting to not change.  I suppose I could elminitate the disable/enable
entirely.  I haven't thought all the way through the possibilities,
but I would guess we could free a couple extra pages, but who cares.
You tell me what to do.

Robin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Patch 0/3] Page table quicklist fixups Rev 3.
  2005-03-03 21:23 [Patch 0/3] Page table quicklist fixups Rev 3 Robin Holt
                   ` (2 preceding siblings ...)
  2005-03-04  8:46 ` Robin Holt
@ 2005-03-04 14:55 ` Robin Holt
  2005-03-04 16:25 ` Robin Holt
                   ` (6 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Robin Holt @ 2005-03-04 14:55 UTC (permalink / raw)
  To: linux-ia64

On Fri, Mar 04, 2005 at 04:03:38PM +0800, Zou Nan hai wrote:
> On Fri, 2005-03-04 at 16:46, Robin Holt wrote:
> > What do you want me to do.  I don't see anywhere else in the kernel
> > that these two lines are directly adjacent.  Most places that do
> > the disable/enable are in a function which does one thing.  That is
> > occasionally contained inside a larger loop.  We can not do that since
> > part of our outer loop control is based on the per_cpu variable we are
> > expecting to not change.  I suppose I could elminitate the disable/enable
> > entirely.  I haven't thought all the way through the possibilities,
> > but I would guess we could free a couple extra pages, but who cares.
> > You tell me what to do.
> > 
> > Robin
> 
>   I think there is no need to disable preempt in an idle thread...,
> since the idle thread has no chance to migrate to another CPU.
> 
>   I also think that the shrinking of pgtables is a bit like that in
> shrink_slab in vmscan.c, shrink_slab shrinks slab by batch, and itcalls
> cond_resched every batch. 
> So that it will not introduce too big latency when there are too many
> slabs to shrink. 
> While shrink_slab is called under mem pressure but check_pdt_cache is
> called in idle thread, maybe there is some difference.

I think that would be changing the behavior.  For the idle case, I don't
think calling cond_resched is that much of a difference.  For the case
where a process has completed an unmap and called tlb_finish_mmu which
has in turn called check_pgt_cache(), we would be allowing another
process of the same priority to take the remainder of our time slice.
With the preempt_enable/disable, we are only allowing others to run when
I have either exhausted my time slice or they are a higher priority.

I am trying to get some figure out times that would be adversely affected
by the preempt_enable/disable pairing.  At this time, I do not see the
issue, but maybe I have missed something.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Patch 0/3] Page table quicklist fixups Rev 3.
  2005-03-03 21:23 [Patch 0/3] Page table quicklist fixups Rev 3 Robin Holt
                   ` (3 preceding siblings ...)
  2005-03-04 14:55 ` Robin Holt
@ 2005-03-04 16:25 ` Robin Holt
  2005-03-04 19:58 ` Luck, Tony
                   ` (5 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Robin Holt @ 2005-03-04 16:25 UTC (permalink / raw)
  To: linux-ia64

On Thu, Mar 03, 2005 at 01:57:49PM -0800, Luck, Tony wrote:
> I'm also a bit uncomfortable with:
> 
> +		preempt_enable();
> +		preempt_disable();
> 
> For a kernel with CONFIG_PREEMPT=n, this is a no-op ... so if there
> is a ton of extra pages on the quicklist, we'll loop freeing 16 at
> a time and re-computing how many to free, with no pause to take a
> breath (or a clock tick).

Someone pointed out that your concern may actually be the "or a clock
tick" portion of this statement.  We are not disabling interrupts.
We are disabling and enabling preemption.  Was that your concern or am
I still missing the picture?

If that concern, I will fall back on the "I am not making it worse defense"
because the shrink used to go from high water to low water mark.  This would
result in thousands of pages being freed.  That, of course is based on the
2.4 kernel since that is the last series that actually used the quicklists.

Robin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [Patch 0/3] Page table quicklist fixups Rev 3.
  2005-03-03 21:23 [Patch 0/3] Page table quicklist fixups Rev 3 Robin Holt
                   ` (4 preceding siblings ...)
  2005-03-04 16:25 ` Robin Holt
@ 2005-03-04 19:58 ` Luck, Tony
  2005-03-04 22:33 ` Chen, Kenneth W
                   ` (4 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Luck, Tony @ 2005-03-04 19:58 UTC (permalink / raw)
  To: linux-ia64

Robin Holt wrote:
>The comment was left from the earlier set of patches where it
>had a single role.  In that patch, we picked a number of pages to
>free, freed that many, and went on with life expecting the next
>pass to free a group.  Since the changes gave it a more complex
>role and the code was not that complex, I figured I would remove
>the comment and let anybody planning on changing it in the future
>read the code to figure out the two roles.

Why not just have two defines?

#define MAX_QUICKLIST_BATCH_FREE	16

#define NODE_FREE_PAGES_SHIFT 4

The first may be obvious enough to not need a comment for people to
figure it out.  The second either needs a better name, or a comment
(or both).  Also ... is "4" the right value?  That implies that you
are expecting to only manage 16 user level data pages for each pte
(which sounds quite plausible for cat/ls/cp/mail sized applications,
but larger memory hungry number-crunching applications hopefully see
a better ratio of data pages to page-tables).  If you have data (I
only have guesses) then I'd like a snapshot ... but I'm not trying
to open a new can of worms here, I'm OK with the "4" value (I just
want it properlay named and commented).

>> I'm also a bit uncomfortable with:
>> 
>> +		preempt_enable();
>> +		preempt_disable();
>> 
>> For a kernel with CONFIG_PREEMPT=n, this is a no-op ... so if there
>> is a ton of extra pages on the quicklist, we'll loop freeing 16 at
>> a time and re-computing how many to free, with no pause to take a
>> breath (or a clock tick).
>
>What do you want me to do.  I don't see anywhere else in the kernel
>that these two lines are directly adjacent.  Most places that do
>the disable/enable are in a function which does one thing.  That is
>occasionally contained inside a larger loop.  We can not do that since
>part of our outer loop control is based on the per_cpu variable we are
>expecting to not change.  I suppose I could elminitate the 
>disable/enable entirely.  I haven't thought all the way through the possibilities,
>but I would guess we could free a couple extra pages, but who cares.

The fact that this doesn't happen elsewhere might be a sign that
we are doing something strange ... if this was normal, then there
would be a prettier macro that did the enable/disable.  Getting
the exactly right number of pages freed isn't my goal (since there
is no way to know what the exactly right number is ... we just need
to do something approximately right, that doesn't have any truly
awful boundary cases.

>You tell me what to do.

At the moment I'm hoping somebody smarter will chime into this thread
(either to point out a great solution, or to tell me that I'm a dork
and this code is perfectly reasonable, so I should quit quibbling).

Overall the patch is great ... it's solving a real problem in an elegant
way.  It's just this little corner of how to shrink the quicklists that
I'm trying to get right.

-Tony

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [Patch 0/3] Page table quicklist fixups Rev 3.
  2005-03-03 21:23 [Patch 0/3] Page table quicklist fixups Rev 3 Robin Holt
                   ` (5 preceding siblings ...)
  2005-03-04 19:58 ` Luck, Tony
@ 2005-03-04 22:33 ` Chen, Kenneth W
  2005-03-07 12:57 ` Robin Holt
                   ` (3 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Chen, Kenneth W @ 2005-03-04 22:33 UTC (permalink / raw)
  To: linux-ia64

Luck, Tony wrote on Friday, March 04, 2005 11:58 AM
> >You tell me what to do.
>
> At the moment I'm hoping somebody smarter will chime into this thread
> (either to point out a great solution, or to tell me that I'm a dork
> and this code is perfectly reasonable, so I should quit quibbling).
>
> Overall the patch is great ... it's solving a real problem in an elegant
> way.  It's just this little corner of how to shrink the quicklists that
> I'm trying to get right.

One other possible solution I can think of is to use schedule_delayed_work
API.  You can schedule one per node every one second interval and have the
work function re-arm itself.  It has several pros:

1. Addresses the concern that Tony has with SETI eating up all the idle
   ticks and check_pgt_cache() may never get a chance to run.
2. once work is scheduled, you don't need to dance with batch count.
   Just keep on freeing in one while loop since it is running in a kernel
   thread.
3. Potentially, call to check_pgt_cache from tlb_finish_mmu() can be
   removed, making process exit faster.
4. And last, probably not important, addresses my concern of broken model
   (freeing memory in idle loop).

Just my 2 cents, only worth what's on the paper.

- Ken



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [Patch 0/3] Page table quicklist fixups Rev 3.
  2005-03-03 21:23 [Patch 0/3] Page table quicklist fixups Rev 3 Robin Holt
                   ` (6 preceding siblings ...)
  2005-03-04 22:33 ` Chen, Kenneth W
@ 2005-03-07 12:57 ` Robin Holt
  2005-03-08 23:56 ` Luck, Tony
                   ` (2 subsequent siblings)
  10 siblings, 0 replies; 12+ messages in thread
From: Robin Holt @ 2005-03-07 12:57 UTC (permalink / raw)
  To: linux-ia64

On Fri, Mar 04, 2005 at 02:33:56PM -0800, Chen, Kenneth W wrote:
> Luck, Tony wrote on Friday, March 04, 2005 11:58 AM
> > >You tell me what to do.
> >
> > At the moment I'm hoping somebody smarter will chime into this thread
> > (either to point out a great solution, or to tell me that I'm a dork
> > and this code is perfectly reasonable, so I should quit quibbling).
> >
> > Overall the patch is great ... it's solving a real problem in an elegant
> > way.  It's just this little corner of how to shrink the quicklists that
> > I'm trying to get right.
> 
> One other possible solution I can think of is to use schedule_delayed_work
> API.  You can schedule one per node every one second interval and have the
> work function re-arm itself.  It has several pros:
> 

We need to use schedule_delayed_work_on() instead since the quicklists are
maintained lockless per cpu.  Will this method address everybody's concerns?


> 1. Addresses the concern that Tony has with SETI eating up all the idle
>    ticks and check_pgt_cache() may never get a chance to run.

SETI is a bad example because it uses mmap() and munmap constantly which
will walk the code path in question.

> 2. once work is scheduled, you don't need to dance with batch count.
>    Just keep on freeing in one while loop since it is running in a kernel
>    thread.

I don't know I we are thinking the same thing here.  I would remove the
while loop and let the next pass complete the frees.

> 3. Potentially, call to check_pgt_cache from tlb_finish_mmu() can be
>    removed, making process exit faster.

I think this one is still useful as it will keep the quicklist closer to
the ideal size.  If we limit it to 256 quicklist entries per pass, that
should keep the quicklists reasonably sized.

> 4. And last, probably not important, addresses my concern of broken model
>    (freeing memory in idle loop).

Tony, is this going to be acceptable?  Should it be the work queue or
timer based?

Thanks,
Robin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [Patch 0/3] Page table quicklist fixups Rev 3.
  2005-03-03 21:23 [Patch 0/3] Page table quicklist fixups Rev 3 Robin Holt
                   ` (7 preceding siblings ...)
  2005-03-07 12:57 ` Robin Holt
@ 2005-03-08 23:56 ` Luck, Tony
  2005-03-15 19:33 ` [Patch 0/3] Page table quicklist fixups Rev 4 Robin Holt
  2005-03-16  0:24 ` Luck, Tony
  10 siblings, 0 replies; 12+ messages in thread
From: Luck, Tony @ 2005-03-08 23:56 UTC (permalink / raw)
  To: linux-ia64

>Tony, is this going to be acceptable?  Should it be the work queue or
>timer based?

The only downsides I see to Ken's suggestion are:

1) Will need extra hooks for hot plug cpu to start/stop new timer
when cpus are added/removed.  Some work is presumably needed here
anyway ... when a cpu is removed it's quicklist ought to be drained.

2) Opens a new question of whether 1 second is the right interval.
That may be a long time to wait for memory to be freed up.

-Tony

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [Patch 0/3] Page table quicklist fixups Rev 4.
  2005-03-03 21:23 [Patch 0/3] Page table quicklist fixups Rev 3 Robin Holt
                   ` (8 preceding siblings ...)
  2005-03-08 23:56 ` Luck, Tony
@ 2005-03-15 19:33 ` Robin Holt
  2005-03-16  0:24 ` Luck, Tony
  10 siblings, 0 replies; 12+ messages in thread
From: Robin Holt @ 2005-03-15 19:33 UTC (permalink / raw)
  To: linux-ia64

Tony,

Yet another round.  This set has incorporated your suggestion about
seperate #defines for the sizing and freeing.  This does make the code
more readable.

Thanks,
Robin

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [Patch 0/3] Page table quicklist fixups Rev 4.
  2005-03-03 21:23 [Patch 0/3] Page table quicklist fixups Rev 3 Robin Holt
                   ` (9 preceding siblings ...)
  2005-03-15 19:33 ` [Patch 0/3] Page table quicklist fixups Rev 4 Robin Holt
@ 2005-03-16  0:24 ` Luck, Tony
  10 siblings, 0 replies; 12+ messages in thread
From: Luck, Tony @ 2005-03-16  0:24 UTC (permalink / raw)
  To: linux-ia64

>Yet another round.  This set has incorporated your suggestion about
>seperate #defines for the sizing and freeing.  This does make the code
>more readable.

Applied.  It is in the linux-ia64-test-2.6.12 tree.

-Tony

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2005-03-16  0:24 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-03-03 21:23 [Patch 0/3] Page table quicklist fixups Rev 3 Robin Holt
2005-03-03 21:57 ` Luck, Tony
2005-03-04  8:03 ` Zou Nan hai
2005-03-04  8:46 ` Robin Holt
2005-03-04 14:55 ` Robin Holt
2005-03-04 16:25 ` Robin Holt
2005-03-04 19:58 ` Luck, Tony
2005-03-04 22:33 ` Chen, Kenneth W
2005-03-07 12:57 ` Robin Holt
2005-03-08 23:56 ` Luck, Tony
2005-03-15 19:33 ` [Patch 0/3] Page table quicklist fixups Rev 4 Robin Holt
2005-03-16  0:24 ` Luck, Tony

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox