From: Nick Piggin <npiggin@suse.de>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Hugh Dickins <hugh@veritas.com>,
Jared Hulbert <jaredeh@gmail.com>,
Carsten Otte <cotte@de.ibm.com>,
Martin Schwidefsky <schwidefsky@de.ibm.com>,
Heiko Carstens <heiko.carstens@de.ibm.com>,
linux-arch@vger.kernel.org, rmk@arm.linux.org.uk,
catalin.marinas@arm.com
Subject: Re: [rfc][patch 2/2] mm: introduce optional pte_special pte bit
Date: Wed, 16 Jan 2008 06:17:59 +0100 [thread overview]
Message-ID: <20080116051759.GA14049@wotan.suse.de> (raw)
In-Reply-To: <alpine.LFD.1.00.0801152040310.2806@woody.linux-foundation.org>
On Tue, Jan 15, 2008 at 08:48:42PM -0800, Linus Torvalds wrote:
>
>
> On Wed, 16 Jan 2008, Nick Piggin wrote:
> >
> > Right, that's what I had hoped as well. But when I say pte_special
> > *usable* by all architectures, I mean it is usable by all that can
> > spare a bit in the pte. Apparently ARM can't because some some bug
> > in an Xscale CPU or something (the thread is on linux-arch).
>
> Hmm. Can you give a pointer to some browsable archive? I guess I should
> subscribe, but there's too much email, too little time. linux-arch is one
> of the lists that I probably should look at.
http://marc.info/?t=119968107900003&r=2&w=2
I've also cc'ed Russell and Catalin, who were involved in that one.
> That said: especially for PFNMAP and friends, it may be possible to simply
> re-use an existign hardware bit, like (for example) a "cacheable" bit.
>
> That doesn't mean that such an architecture would have a free bit for any
> *arbitrary* software use, but the "no <struct page> backing" is really a
> pretty special feature, and may well map fairly well 1:1 with something
> like a "cache disable" bit (which I do think ARM has).
>
> It's not like we necessarily would want /dev/mem to be mapped cacheable
> *anyway*, much less on some architecture with stupid virtual caches.
>
> > I remember that too. I guess some wires got crossed somewhere. s390
> > evidently does have free bits in their pte_present-type ptes.
>
> I think they had two types of PTE's - 32-bit and 64-bit. Maybe it's just
> the 32-bit one that was all used up (but see above - maybe cacheable bits
> are doable?)
Not sure, I'll let the s390 people chime in here. We'd still need a
pte_special bit for 32 and 64 bit ptes (if they are different)...
> I do have to say, one of the reasons I enjoyed PFNMAP was that so far
> we've basially been able to live without any SW-specified bits at all.
> Yeah, we use "software bits" on architectures to emulate dirty/accessed,
> but we have never really needed any "kernel internal bits". And I do think
> that's generally a good idea.
I agree in a way (on one hand it would be nice to simplify the whole
PFNMAP logic, on the other hand it isn't actually a problem to keep it,
and it is a nice way of avoiding the use of a pte bit, which might be
important in future even if not today...)
> So in that sense, I'd actually prefer the current setup if it's not a huge
> pain.
Well that is why I wanted to go the ifdef route (or rather, tidy it up to
be an if (CONSTANT) { } else { } or something).
We could go and hide the pte-check in s390 specific code just in the
VM_MIXEDMAP case, but again I think that is actually just making things
less clear (than having the 2 distinct cases in core code).
> I saw your 5% number, but I really wonder about that one. Was that
> perhaps with the much more expensive non-linear NUMA "pfn_to_page()"? THAT
> expense would drown out any vma->vm_flags costs.
The 5% case yes that was with SPARSEMEM. And yes we should get rid of
that pfn_valid test IMO (or put it under CONFIG_DEBUG_VM). I'm not aware
of it ever triggering, so I think it should go away (I think Hugh nacked
my attempt at this last time -- Hugh, what do you think?).
However, minor performance issues aside, I'd still hope to find a good
way to get the pte_special path in.
---
vm_normal_page has been seen to take nearly 5% kernel time in profiles, and
regularly appears in the top 10 or so functions in a profile. It is called
at every page fault, and for every pte in bulk copy or unmaps.
pfn_valid in particular can be quite expensive in some memory models.
Place that code under CONFIG_DEBUG_VM. I'm not aware of it catching any
problems.
Signed-off-by: Nick Piggin <npiggin@suse.de>
---
Index: linux-2.6/mm/memory.c
===================================================================
--- linux-2.6.orig/mm/memory.c
+++ linux-2.6/mm/memory.c
@@ -392,16 +392,13 @@ struct page *vm_normal_page(struct vm_ar
return NULL;
}
- /*
- * Add some anal sanity checks for now. Eventually,
- * we should just do "return pfn_to_page(pfn)", but
- * in the meantime we check that we get a valid pfn,
- * and that the resulting page looks ok.
- */
+#ifdef CONFIG_DEBUG_VM
+ /* Check that we get a valid pfn. */
if (unlikely(!pfn_valid(pfn))) {
print_bad_pte(vma, pte, addr);
return NULL;
}
+#endif
/*
* NOTE! We still have PageReserved() pages in the page
next prev parent reply other threads:[~2008-01-16 5:18 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2008-01-13 3:08 [rfc] changes to user memory mapping scheme Nick Piggin
2008-01-13 3:09 ` [rfc][patch 1/2] mm: introduce VM_MIXEDMAP Nick Piggin
2008-01-13 3:10 ` [rfc][patch 2/2] mm: introduce optional pte_special pte bit Nick Piggin
2008-01-13 3:41 ` Linus Torvalds
2008-01-13 4:39 ` Nick Piggin
2008-01-13 4:45 ` Linus Torvalds
2008-01-13 5:06 ` Nick Piggin
2008-01-13 16:50 ` Linus Torvalds
2008-01-13 20:46 ` Martin Schwidefsky
2008-01-14 21:04 ` Jared Hulbert
2008-01-15 9:18 ` Carsten Otte
2008-01-16 3:38 ` Nick Piggin
2008-01-16 4:04 ` Linus Torvalds
2008-01-16 4:37 ` Nick Piggin
2008-01-16 4:48 ` Linus Torvalds
2008-01-16 4:51 ` David Miller
2008-01-16 5:23 ` Linus Torvalds
2008-01-16 5:48 ` Nick Piggin
2008-01-16 9:52 ` Martin Schwidefsky
2008-01-16 5:17 ` Nick Piggin [this message]
2008-01-16 10:52 ` Catalin Marinas
2008-01-16 18:18 ` Russell King
2008-01-16 17:21 ` Linus Torvalds
2008-01-16 17:14 ` David Howells
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20080116051759.GA14049@wotan.suse.de \
--to=npiggin@suse.de \
--cc=catalin.marinas@arm.com \
--cc=cotte@de.ibm.com \
--cc=heiko.carstens@de.ibm.com \
--cc=hugh@veritas.com \
--cc=jaredeh@gmail.com \
--cc=linux-arch@vger.kernel.org \
--cc=rmk@arm.linux.org.uk \
--cc=schwidefsky@de.ibm.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).