Re: PG_zero - Martin J. Bligh

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: "Martin J. Bligh" <mbligh@aracnet.com>
To: Andrea Arcangeli <andrea@novell.com>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>,
	Andrew Morton <akpm@digeo.com>,
	linux-kernel@vger.kernel.org
Subject: Re: PG_zero
Date: Tue, 02 Nov 2004 14:41:12 -0800	[thread overview]
Message-ID: <235520000.1099435272@flay> (raw)
In-Reply-To: <20041102195007.GP3571@dualathlon.random>

>> Exactly. The disadvantage of the single list is that cold allocs can steal 
>> hot pages, which we believe are precious, and as CPUs get faster, will only 
>> get more so (insert McKenney's bog roll here).
> 
> a cold page can become hot anytime soon (just when the I/O has
> completed and we map it into userspace), 

See other email ... this makes non sense to me.

> plus you're throwing a the
> problem twice as much memory as I do to get the same boost.

Well .... depends on how we do it. Conceptually, I agree with you that
it's really one list. However, I'd like to stop the cold list stealing
from the hot, so I think it's easier to *implement* it as two lists,
because there's nothing in the standard list code to add a magic marker
on one element in the middle (though maybe you can think of a trick way
to do that). Moving things from one list to the other is pretty cheap
because we can use the splice operations.
 
>> Mmmm. Are you actually seeing lock contention on the buddy allocator on
>> a real benchmark? I haven't seen it since the advent of hot/cold pages.
>> Remember it's not global at all on the larger systems, since they're NUMA.
> 
> yes, the pgdat helps there. But certainly doing a single cli should be
> faster than a spinlock + buddy algorithms.

Yes, it's certainly faster to do that one operation - no dispute from me.
However, the question is whether it's worth trading off against the
cache-warmth of pages ... that's why we wrote it that way originally, to
preserve that.

> This is also to drop down the probability of having to call into the
> higher latency of the buddy allocator, this is why the per-cpu lists
> exists in UP too. so the more we can work inside the per-cpu lists the
> better. This is why the last thing I consider the per-cpu lists are
> for buffering. The buffering is really a "refill", because we were
> unlucky and no other free_page refilled it, so we've to enter the slow
> path.

yup, exactly.
 
>> Mmmm. I'm still very worried this'll exhaust ZONE_NORMAL on highmem systems,
> 
> See my patch, I've all the protection code on top of it. Andi as well
> was worried about that and he was right, but then I've fixed it.

OK, I'll go read it.
 
>> and keep remote pages around instead of returning them to their home node
>> on NUMA systems. Not sure that's really what we want, even if we gain a
> 
> in my early version of the code I had a:
> 
> +               for (i = 0; (z = zones[i]) != NULL; i++) {
> +#ifdef CONFIG_NUMA
> +                       /* discontigmem doesn't need this, only true numa needs this */
> +                       if (zones[0]->zone_pgdat != z->zone_pgdat)   
> +                               break
> +#endif
> 
> I dropped it from the latest just to make it a bit simpler and because I
> started to suspect on the new numa it might be faster to use the hot
> cache on remote memory, than cold cache on local memory. So effectively
> I believe removing the above was going to optimize x86-64 which is the
> only numa arch I run anyways. The above code can be made conditional
> under an #ifdef CONFIG_SLOW_NUMA of course.

Probably easier to make it a per-arch option to define, but yes, if one
arch works better one way, and others the other, we can make it optional
fairly easily in lots of ways. <insert traditional disagreement with
"new" vs "old" connotations here ...>
 
>> little in spinlock and immediate cache cost. Nor will it be easy to measure
>> since it'll only do anything under memory pressure, and the perf there is
>> notoriously unstable for measurement.
> 
> under mem pressure it's worthless to measure performance I agree.
> However the mainline code under mem pressure was very wrong and it could
> have generated early OOM kills by mistake on big boxes with lots of ram,
> since you prevented hot allocations to get ram from the cold quicklist
> (the breakage I noticed and fixed, and that you acknowledge at the top
> of the email) and that would lead to the VM freeing memory and the
> application going oom anyways.

Well, remember that the lists only contain a very small amount of memory
relative to the machine size anyway, so I'm not really worried about us
triggering early OOMs or whatever.

But however one chose to fix this, it'd require more specialization in
the allocator (knowing about hot/cold, or merging into one list), so not
sure how concerned Andrew is about that (see his other email on the subject)

next prev parent reply	other threads:[~2004-11-03  0:23 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-10-30 14:10 PG_zero Andrea Arcangeli
2004-10-30 21:07 ` PG_zero Andrew Morton
2004-10-30 22:45   ` PG_zero Andrea Arcangeli
2004-10-31 15:35     ` PG_zero Martin J. Bligh
2004-11-01 21:57       ` PG_zero Andrea Arcangeli
2004-11-01 22:05         ` PG_zero Martin J. Bligh
2004-11-02  3:41         ` PG_zero William Lee Irwin III
2004-10-31 15:17   ` PG_zero Martin J. Bligh
2004-11-02 13:53     ` PG_zero Andy Whitcroft
2004-11-02 19:39       ` PG_zero Andrea Arcangeli
2004-11-01 17:26 ` PG_zero Nick Piggin
2004-11-01 18:03   ` PG_zero Martin J. Bligh
2004-11-01 22:34     ` PG_zero Andrea Arcangeli
2004-11-01 23:47       ` PG_zero Martin J. Bligh
2004-11-02  1:47       ` PG_zero Nick Piggin
2004-11-02  2:21       ` PG_zero Andrea Arcangeli
2004-11-02  2:54         ` PG_zero Nick Piggin
2004-11-02 15:42         ` PG_zero Martin J. Bligh
2004-11-02 19:50           ` PG_zero Andrea Arcangeli
2004-11-02 22:41             ` Martin J. Bligh [this message]
2004-11-03  1:26               ` PG_zero Andrea Arcangeli
2004-11-02 21:09           ` PG_zero Andrew Morton
2004-11-02 21:56             ` PG_zero Andrea Arcangeli
2004-11-02 22:41               ` PG_zero Martin J. Bligh
2004-11-03  1:09                 ` PG_zero Andrea Arcangeli
2004-11-03  1:18                   ` PG_zero Martin J. Bligh
2004-11-03  1:23                   ` PG_zero Nick Piggin
2004-11-03  2:05                     ` PG_zero Andrea Arcangeli
2004-11-03 11:53                       ` PG_zero Andrea Arcangeli
2004-11-03 12:10       ` PG_zero Pavel Machek
2004-11-01 22:24   ` PG_zero Andrea Arcangeli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=235520000.1099435272@flay \
    --to=mbligh@aracnet.com \
    --cc=akpm@digeo.com \
    --cc=andrea@novell.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nickpiggin@yahoo.com.au \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox