LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] powerpc/8xx: fix regression introduced by cache coherency rewrite
From: Rex Feany @ 2009-09-25 21:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev@ozlabs.org
In-Reply-To: <1253847827.7103.504.camel@pasglop>

Thus spake Benjamin Herrenschmidt (benh@kernel.crashing.org):

> 
> > I think there's more finishyness to 8xx than we thought. IE. That
> > tlbil_va might have more reasons to be there than what the comment
> > seems to advertize. Can you try to move it even higher up ? IE.
> > Unconditionally at the beginning of set_pte_filter ?
> > 
> > Also, if that doesn't help, can you try putting one in
> > set_access_flags_filter() just below ?
> 
> Ok, I got a refresher on the whole concept of "unpopulated TLB entries"
> on 8xx, and that's damn scary. I think what mislead me initially is that
> the comment around the workaround is simply not properly describing the
> extent of the problem :-)

Oh boy, that sounds bad. Where is a good place to read about this?

> So I'm not going to make the 8xx TLB miss code sane, that's beyond what
> I'm prepare to do with it, but I suspect that this should fix it (on top
> of upstream). Let me know if that's enough or if we also need to put
> one of these in ptep_set_access_flags().
> 
> Please let me know if that works for you.

Putting the tlbil_va() in the top of set_pte_filter() doesn't work - it
hangs on boot before it even prints any messages to the console.

However, adding tlbil_va() to ptep_set_access_flags() as you suggested
makes everything happy. I need to test it some more, but it looks good
so far. Below is what I am testing now.

thanks!
/rex.


diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index 5304093..aef552a 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -176,18 +176,19 @@ static pte_t set_pte_filter(pte_t pte, unsigned long addr)
 		struct page *pg = maybe_pte_to_page(pte);
 		if (!pg)
 			return pte;
-		if (!test_bit(PG_arch_1, &pg->flags)) {
 #ifdef CONFIG_8xx
-			/* On 8xx, cache control instructions (particularly
-			 * "dcbst" from flush_dcache_icache) fault as write
-			 * operation if there is an unpopulated TLB entry
-			 * for the address in question. To workaround that,
-			 * we invalidate the TLB here, thus avoiding dcbst
-			 * misbehaviour.
-			 */
-			/* 8xx doesn't care about PID, size or ind args */
-			_tlbil_va(addr, 0, 0, 0);
+		/* On 8xx, cache control instructions (particularly
+		 * "dcbst" from flush_dcache_icache) fault as write
+		 * operation if there is an unpopulated TLB entry
+		 * for the address in question. To workaround that,
+		 * we invalidate the TLB here, thus avoiding dcbst
+		 * misbehaviour.
+		 */
+		/* 8xx doesn't care about PID, size or ind args */
+		_tlbil_va(addr, 0, 0, 0);
 #endif /* CONFIG_8xx */
+
+		if (!test_bit(PG_arch_1, &pg->flags)) {
 			flush_dcache_icache_page(pg);
 			set_bit(PG_arch_1, &pg->flags);
 		}
@@ -308,6 +309,12 @@ int ptep_set_access_flags(struct vm_area_struct *vma, unsigned long address,
 	int changed;
 	entry = set_access_flags_filter(entry, vma, dirty);
 	changed = !pte_same(*(ptep), entry);
+
+#ifdef CONFIG_8xx
+	/* 8xx doesn't care about PID, size or ind args */
+	_tlbil_va(address, 0, 0, 0);
+#endif /* CONFIG_8xx */
+
 	if (changed) {
 		if (!(vma->vm_flags & VM_HUGETLB))
 			assert_pte_locked(vma->vm_mm, address);

^ permalink raw reply related

* Re: [patch] powerpc: build modules outside the kernel tree fails, if it was built using O=
From: Benjamin Herrenschmidt @ 2009-09-25 21:57 UTC (permalink / raw)
  To: Sam Ravnborg; +Cc: rep.dot.nop, Yuri Frolov, linuxppc-dev, linux-kbuild
In-Reply-To: <20090925194557.GA3323@merkur.ravnborg.org>

On Fri, 2009-09-25 at 21:45 +0200, Sam Ravnborg wrote:
> On Thu, Sep 24, 2009 at 03:28:11PM +0400, Yuri Frolov wrote:
> > Hello,
> > 
> > here is a corresponding bug: http://bugzilla.kernel.org/show_bug.cgi?id=11143
> > This patch should correctly export crtsavres.o in order to make O= option working.
> > Please, consider to apply.
> 
> Hi Yuri.
> 
> I like the way you do the extra link in Makefile.modpost.
> But you need to redo some parts as per comments below.

> > 
> > Fix linking modules against crtsavres.o
> 
> Please elaborate more on what this commit does.

It's a support file that needs to be linked against every module
(aka libgcc like) and thus needs to be built before any module.

Cheers,
Ben.

^ permalink raw reply

* Re: [patch] powerpc: build modules outside the kernel tree fails, if it was built using O=
From: Sam Ravnborg @ 2009-09-25 22:01 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: rep.dot.nop, Yuri Frolov, linuxppc-dev, linux-kbuild
In-Reply-To: <1253915852.7103.531.camel@pasglop>

On Sat, Sep 26, 2009 at 07:57:32AM +1000, Benjamin Herrenschmidt wrote:
> On Fri, 2009-09-25 at 21:45 +0200, Sam Ravnborg wrote:
> > On Thu, Sep 24, 2009 at 03:28:11PM +0400, Yuri Frolov wrote:
> > > Hello,
> > > 
> > > here is a corresponding bug: http://bugzilla.kernel.org/show_bug.cgi?id=11143
> > > This patch should correctly export crtsavres.o in order to make O= option working.
> > > Please, consider to apply.
> > 
> > Hi Yuri.
> > 
> > I like the way you do the extra link in Makefile.modpost.
> > But you need to redo some parts as per comments below.
> 
> > > 
> > > Fix linking modules against crtsavres.o
> > 
> > Please elaborate more on what this commit does.
> 
> It's a support file that needs to be linked against every module
> (aka libgcc like) and thus needs to be built before any module.

Yes.

My point was that the next version of the changelog should
include this information - as well as kbuild.txt.

	Sam

^ permalink raw reply

* Re: [PATCH v3 0/3] cpu: pseries: Cpu offline states framework
From: Pavel Machek @ 2009-09-26  9:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Gautham R Shenoy, Venkatesh Pallipadi, linux-kernel,
	Arun R Bharadwaj, linuxppc-dev, Darrick J. Wong
In-Reply-To: <1253016701.5506.73.camel@laptop>

On Tue 2009-09-15 14:11:41, Peter Zijlstra wrote:
> On Tue, 2009-09-15 at 17:36 +0530, Gautham R Shenoy wrote:
> > This patchset contains the offline state driver implemented for
> > pSeries. For pSeries, we define three available_hotplug_states. They are:
> > 
> >         online: The processor is online.
> > 
> >         offline: This is the the default behaviour when the cpu is offlined
> >         even in the absense of this driver. The CPU would call make an
> >         rtas_stop_self() call and hand over the CPU back to the resource pool,
> >         thereby effectively deallocating that vCPU from the LPAR.
> >         NOTE: This would result in a configuration change to the LPAR
> >         which is visible to the outside world.
> > 
> >         inactive: This cedes the vCPU to the hypervisor with a cede latency
> >         specifier value 2.
> >         NOTE: This option does not result in a configuration change
> >         and the vCPU would be still entitled to the LPAR to which it earlier
> >         belong to.
> > 
> > Any feedback on the patchset will be immensely valuable.
> 
> I still think its a layering violation... its the hypervisor manager
> that should be bothered in what state an off-lined cpu is in. 

Agreed. Proposed interface is ugly.

-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply

* Market research for new PowerPC system
From: Konstantinos Margaritis @ 2009-09-26 11:38 UTC (permalink / raw)
  To: linuxppc-dev, debian-powerpc, opensuse-ppc; +Cc: Konstantinos Margaritis

(Sorry for the cross-posting, please ignore if you are not interested  
in this, CC me as I'm not subscribed)
Hi,

First some introductions. I'm Konstantinos Margaritis, a long time  
Amiga/BeOS/Linux user/developer and a PowerPC fan, former Debian  
Developer, also a SIMD/AltiVec fanatic and the author of libfreevec.  
I've posted this on the following sites:

http://amigaworld.net/modules/newbb/viewtopic.php?mode=viewtopic&topic_id=29594&forum=33&start=0&viewmode=flat&order=0

http://www.amiga.org/forums/showthread.php?t=49424

http://www.morphzone.org/modules/newbb_plus/viewtopic.php?topic_id=6465&forum=11

http://aros-exec.org/modules/newbb/viewtopic.php?viewmode=flat&topic_id=3768&forum=4

http://www.haiku-os.org/community/forum/market_research_new_powerpc_system#comment-12604

To anyone who is not a PowerPC user, it might seem like crazy, but  
here it goes:

I'm considering funding the design & production of a new PowerPC  
system (well, the motherboard, the rest are typical pc stuff and a  
case). No this is not a joke, I've been wanting to do this for a long  
time, and perhaps the chance will be given to me now. But before I  
spend any money on this, I want to do a little market research first.  
I know the market is literally "dying" for a new powerpc motherboard,  
but exactly how many are there that want to buy one?

Ok, let's give some rough specs first. I'm considering 3 choices -not  
in order of probability/importance:

1. MPC8640D-based. It will be dual core at 1Ghz -most likely, higher  
frequencies are much more expensive and the cost of the final board  
would be prohibitive.
2. MPC8610-based. Single core at 1Ghz, slightly less expensive, and  
includes a 2D DIU display unit -quite fast, but no 3D unfortunately.
3. QorIQ P1022-based. Again dual core at 1Ghz (1055Mhz to be precise).  
Apart from the much lower chip price, this one includes dual gigabit  
ethernet, dual SATA, USB 2.0 and a 2D DIU display unit (same as the  
MPC8610). So this one would lower the cost of the board quite much.  
Disadvantages: No AltiVec unit (it sucks I know), though it includes  
an SPE unit which is not that bad, and availability will be in Q3/Q4  
2010, so that's a long wait.

Now, the end motherboard will probably be MicroATX (in the 8640D/8610  
case) or PicoITX (in the P1022 case), and it will definitely include:

* SATA connectors
* USB (possibly 2 back and 2 front, but that's discussable)
* Dual gigabit (at least one will be there, in the case of the  
MPC8640D we might even have 4!!!)
* Sound (of course, SPDIF support will definitely be there)
* 1 PCI-e slot 1x
* 1 PCI-e slot (4x in the P1022 case, 8x in the MPC86xx cases)

Ok, what I want to know is if people would really really buy one of  
these. End price is estimated to be ~around~ 350EUR for the P1022  
board or ~500EUR (definitely more in the case of 8640D) in the case of  
the other boards. Besides being more expensive, the MPC86xx chips,  
don't include SATA, USB and only one of ethernet/sound (quad-gige in  
MPC8640D case, or sound in the case of MPC8610). I know this sounds a  
lot, but it's the reality, there is not enough funding to build  
enormous amounts of units and bring the prices down substantially, we  
have to start low and build up from there. In case you are wondering,  
yes, the boards will be designed/produced by bPlan and funded by my  
company (Codex).

Support for OSes: Linux definitely, Haiku most probably and there is a  
possibility of supporting AmigaOS/MorphOS, which will depend on the  
actual feedback I get from those users.

I would like to make a list of everyone that is really interested in  
such a system, so it would really help me make a decision sooner  
rather than later if you would send me a few personal details to markos@codex.gr 
  with subject "PowerPC board":

* Name
* Country
* email (definitely, I'd have to reach you back!)
* Phone/Skype (optional, please include international prefix)
* Forum you saw this post (ok, Morphzone in this case)
* OS of preference
* board you would be most interested in (MPC8610/MPC8640D/P1022)
* preferred price (please have in mind the estimated price quotes I  
mentioned, it might be lower but that's not very probable)
* Other notes/comments

Also, I found out that I had to state my case on many forums to prove  
that this is not vapourware. Well, it will not be vapourware, if I get  
feedback. So far the feedback I got can be summarized here:

http://www.codex.gr/index.php?pageID=&blogItem=60

Thanks a lot for your time and I hope this system becomes a reality.

Konstantinos Margaritis

Codex

^ permalink raw reply

* Problem with futex call on 8xx
From: Frank Svendsbøe @ 2009-09-26 12:00 UTC (permalink / raw)
  To: linuxppc-dev

Hi
I'm having a problem with ~100% CPU load on MPC8xx when calling
pthread_cond_wait.
Running strace, I get:

futex(0x10040d5c, FUTEX_WAIT, 1, NULL)  = -1 ENOSYS (Function not implemented)

.. and this call is inifinitely repeating, which explains the high load.

I'm running Linux 2.6.26-rc2 (not newer due to the slowdown problem
introduced by commit
8d30c14cab30d405a05f2aaceda1e9ad57800f36, as pointed out by Rex Feany
this week), and
glibc v2.6 (part of ELDK 4.2).

Is this a problem due to an error in the futex implementation in
glibc, or a kernel problem?

Best regards,
Frank

^ permalink raw reply

* Re: Market research for new PowerPC system
From: Stephen Rothwell @ 2009-09-26 12:18 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <0046254A-432D-4AE3-9B9F-C0D30311B7D7@codex.gr>

[-- Attachment #1: Type: text/plain, Size: 568 bytes --]

Just in case anyone feels like flaming about this post, I allowed it
because I thought some of you may feel inclined to provide some technical
advise, enthusiasm for someone else building PowerPC systems.  If you
object, then just ignore the post ...

On Sat, 26 Sep 2009 14:38:38 +0300 Konstantinos Margaritis <markos@codex.gr> wrote:
>
> (Sorry for the cross-posting, please ignore if you are not interested  
> in this, CC me as I'm not subscribed)
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* Re: [patch] powerpc: build modules outside the kernel tree fails, if it was built using O=
From: Yuri Frolov @ 2009-09-26 12:45 UTC (permalink / raw)
  To: Sam Ravnborg; +Cc: rep.dot.nop, linuxppc-dev, linux-kbuild
In-Reply-To: <20090925194557.GA3323@merkur.ravnborg.org>

Hello, here is a fixed version.

Compile and export arch/powerpc/lib/crtsavres.o in order to fix the
"arch/powerpc/lib/crtsavres.o not found" error when "O=" option
is employed for external module compilation.
crtsavres.o is a support file, containing save/restore code from gcc,
simplified down for powerpc architecture needs.
This file needs to be linked against every module and thus to be built
before any module.

Documentation/kbuild/kbuild.txt |    8 ++++++++
Makefile                        |    2 +-
arch/powerpc/Makefile           |    2 +-
scripts/Makefile.modpost        |   14 ++++++++++++--
4 files changed, 22 insertions(+), 4 deletions(-)

diff -urpN -X linux-2.6/Documentation/dontdiff linux-2.6/arch/powerpc/Makefile linux-2.6-powerpc-crtsavres/arch/powerpc/Makefile
--- linux-2.6/arch/powerpc/Makefile	2009-09-17 20:04:31.000000000 +0400
+++ linux-2.6-powerpc-crtsavres/arch/powerpc/Makefile	2009-09-26 13:35:32.000000000 +0400
@@ -93,7 +93,7 @@ else
 	KBUILD_CFLAGS += $(call cc-option,-mtune=power4)
 endif
 else
-LDFLAGS_MODULE	+= arch/powerpc/lib/crtsavres.o
+KBUILD_MODULE_LINK_SOURCE += arch/powerpc/lib/crtsavres.o
 endif
 
 ifeq ($(CONFIG_TUNE_CELL),y)
diff -urpN -X linux-2.6/Documentation/dontdiff linux-2.6/Documentation/kbuild/kbuild.txt linux-2.6-powerpc-crtsavres/Documentation/kbuild/kbuild.txt
--- linux-2.6/Documentation/kbuild/kbuild.txt	2009-09-17 20:04:30.000000000 +0400
+++ linux-2.6-powerpc-crtsavres/Documentation/kbuild/kbuild.txt	2009-09-26 16:03:50.000000000 +0400
@@ -132,3 +132,11 @@ For tags/TAGS/cscope targets, you can sp
 to be included in the databases, separated by blank space. E.g.:
 
     $ make ALLSOURCE_ARCHS="x86 mips arm" tags
+
+KBUILD_MODULE_LINK_SOURCE
+--------------------------------------------------
+Compile and export arch/powerpc/lib/crtsavres.o
+when "O=" option is employed for powerpc external module compilation.
+This file needs to be linked against every module and thus to be built
+before any module.
+
diff -urpN -X linux-2.6/Documentation/dontdiff linux-2.6/Makefile linux-2.6-powerpc-crtsavres/Makefile
--- linux-2.6/Makefile	2009-09-17 20:04:30.000000000 +0400
+++ linux-2.6-powerpc-crtsavres/Makefile	2009-09-26 14:23:27.000000000 +0400
@@ -354,7 +354,7 @@ KERNELVERSION = $(VERSION).$(PATCHLEVEL)
 export VERSION PATCHLEVEL SUBLEVEL KERNELRELEASE KERNELVERSION
 export ARCH SRCARCH CONFIG_SHELL HOSTCC HOSTCFLAGS CROSS_COMPILE AS LD CC
 export CPP AR NM STRIP OBJCOPY OBJDUMP MAKE AWK GENKSYMS PERL UTS_MACHINE
-export HOSTCXX HOSTCXXFLAGS LDFLAGS_MODULE CHECK CHECKFLAGS
+export HOSTCXX HOSTCXXFLAGS LDFLAGS_MODULE CHECK CHECKFLAGS KBUILD_MODULE_LINK_SOURCE
 
 export KBUILD_CPPFLAGS NOSTDINC_FLAGS LINUXINCLUDE OBJCOPYFLAGS LDFLAGS
 export KBUILD_CFLAGS CFLAGS_KERNEL CFLAGS_MODULE CFLAGS_GCOV
diff -urpN -X linux-2.6/Documentation/dontdiff linux-2.6/scripts/Makefile.modpost linux-2.6-powerpc-crtsavres/scripts/Makefile.modpost
--- linux-2.6/scripts/Makefile.modpost	2009-09-17 20:04:42.000000000 +0400
+++ linux-2.6-powerpc-crtsavres/scripts/Makefile.modpost	2009-09-26 14:34:28.000000000 +0400
@@ -122,14 +122,24 @@ quiet_cmd_cc_o_c = CC      $@
       cmd_cc_o_c = $(CC) $(c_flags) $(CFLAGS_MODULE)	\
 		   -c -o $@ $<
 
-$(modules:.ko=.mod.o): %.mod.o: %.mod.c FORCE
+quiet_cmd_as_o_S = AS $(quiet_modtag)  $@
+      cmd_as_o_S = $(CC) $(a_flags) $(AFLAGS_MODULE) -c -o $@ $<
+
+ifdef KBUILD_MODULE_LINK_SOURCE
+$(KBUILD_MODULE_LINK_SOURCE): %.o: %.S FORCE
+	$(Q)mkdir -p $(dir $@)
+	$(call if_changed_dep,as_o_S)
+endif
+
+$(modules:.ko=.mod.o): %.mod.o: %.mod.c $(KBUILD_MODULE_LINK_SOURCE) FORCE
 	$(call if_changed_dep,cc_o_c)
 
 targets += $(modules:.ko=.mod.o)
 
 # Step 6), final link of the modules
 quiet_cmd_ld_ko_o = LD [M]  $@
-      cmd_ld_ko_o = $(LD) -r $(LDFLAGS) $(LDFLAGS_MODULE) -o $@		\
+      cmd_ld_ko_o = $(LD) -r $(LDFLAGS) $(KBUILD_MODULE_LINK_SOURCE)	\
+			  $(LDFLAGS_MODULE) -o $@			\
 			  $(filter-out FORCE,$^)
 
 $(modules): %.ko :%.o %.mod.o FORCE

On 09/25/2009 11:45 PM, Sam Ravnborg wrote:
> On Thu, Sep 24, 2009 at 03:28:11PM +0400, Yuri Frolov wrote:
>> Hello,
>>
>> here is a corresponding bug: http://bugzilla.kernel.org/show_bug.cgi?id=11143
>> This patch should correctly export crtsavres.o in order to make O= option working.
>> Please, consider to apply.
> 
> Hi Yuri.
> 
> I like the way you do the extra link in Makefile.modpost.
> But you need to redo some parts as per comments below.
> 
>>
>> Fix linking modules against crtsavres.o
> 
> Please elaborate more on what this commit does.
> 
>> Previously we got
>>   CC      drivers/char/hw_random/rng-core.mod.o
>>   LD [M]  drivers/char/hw_random/rng-core.ko
>> /there/src/buildroot.git.ppc/build_powerpc_nofpu/staging_dir/usr/bin/powerpc-linux-uclibc-ld: arch/powerpc/lib/crtsavres.o: No such file: No such file or directory
> 
> Always good to include error messages.
> 
>> 	* Makefile (LDFLAGS_MODULE_PREREQ): New variable to hold prerequisite
>>           files for modules.
>> 	* arch/powerpc/Makefile: add crtsavres.o to LDFLAGS_MODULE_PREREQ.
>> 	* scripts/Makefile.modpost (cmd_as_o_S): Copy from Makefile.build.
>> 	  (cmd_ld_ko_o): Also link LDFLAGS_MODULE_PREREQ.
>> 	  Provide rule to build objects from assembler.
> But this GNUism can go - we do not use it in the kernel.
>  
>> Signed-off-by:  Bernhard Reutner-Fischer  <rep.dot.nop@gmail.com>
>> Signed-off by:  Yuri Frolov <yfrolov@ru.mvista.com>
>>
>> Makefile                 |    2 ++
>> arch/powerpc/Makefile    |    2 +-
>> scripts/Makefile.modpost |   12 ++++++++++--
>> 3 files changed, 13 insertions(+), 3 deletions(-)
>>
>> diff -urpN -X linux-2.6/Documentation/dontdiff linux-2.6/arch/powerpc/Makefile linux-2.6-powerpc-crtsavres/arch/powerpc/Makefile
>> --- linux-2.6/arch/powerpc/Makefile	2009-09-17 20:04:31.000000000 +0400
>> +++ linux-2.6-powerpc-crtsavres/arch/powerpc/Makefile	2009-09-23 22:08:03.000000000 +0400
>> @@ -93,7 +93,7 @@ else
>>  	KBUILD_CFLAGS += $(call cc-option,-mtune=power4)
>>  endif
>>  else
>> -LDFLAGS_MODULE	+= arch/powerpc/lib/crtsavres.o
>> +LDFLAGS_MODULE_PREREQ += arch/powerpc/lib/crtsavres.o
>>  endif
> 
> The naming sucks.
> How about:
> 
> KBUILD_MODULE_LINK_SOURCE
> 
> This would tell the reader that this is source to be linked on a module.
> 
> And this is an arch specific thing so no need to preset it in top-level
> Makefile.
> But it is mandatory to include a description in Documentation/kbuild/kbuild.txt
> 
> 
>> --- linux-2.6/scripts/Makefile.modpost	2009-09-17 20:04:42.000000000 +0400
>> +++ linux-2.6-powerpc-crtsavres/scripts/Makefile.modpost	2009-09-23 22:15:00.000000000 +0400
>> @@ -122,14 +122,22 @@ quiet_cmd_cc_o_c = CC      $@
>>        cmd_cc_o_c = $(CC) $(c_flags) $(CFLAGS_MODULE)	\
>>  		   -c -o $@ $<
>>  
>> -$(modules:.ko=.mod.o): %.mod.o: %.mod.c FORCE
>> +quiet_cmd_as_o_S = AS $(quiet_modtag)  $@
>> +cmd_as_o_S       = $(CC) $(a_flags) $(AFLAGS_MODULE) -c -o $@ $<
> 
> Align this so cmd_as_o_S is under each other - as we do for cmd_cc_o_c
> 
> 
>> +
>> +$(LDFLAGS_MODULE_PREREQ): %.o: %.S FORCE
>> +	$(Q)mkdir -p $(dir $@)
>> +	$(call if_changed_dep,as_o_S)
> Good catch with the mkdir - needed for O= builds.
> I think we shall wrap this in
> ifdef KBUILD_MODULE_LINK_SOURCE
> ...
> endif
> 
> So we do not have an empty rule when it is not defined.
> 
> Please fix up these things and resubmit.
> 
> Thanks,
> 	Sam

^ permalink raw reply

* Re: Problem with futex call on 8xx
From: Frank Svendsbøe @ 2009-09-26 12:52 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Scott Wood, Detlev Zundel
In-Reply-To: <1ba63b520909260500k3b7a0ed1l2d75116053ae5522@mail.gmail.com>

I'll answer myself, since I just found the solution to the problem -
with help from 'kos_tom'
at #uclibc on freenode.

The problem is caused by a missing 'CONFIG_FUTEX=3Dy' in the defconfig
for our target.

The system we're developing is based on the Adder port by Scott Wood, which=
 is
also missing this, so a patch should be committed for this and
possible other 8xx targets
as well.

Best regards,
Frank

On Sat, Sep 26, 2009 at 2:00 PM, Frank Svendsb=F8e
<frank.svendsboe@gmail.com> wrote:
> Hi
> I'm having a problem with ~100% CPU load on MPC8xx when calling
> pthread_cond_wait.
> Running strace, I get:
>
> futex(0x10040d5c, FUTEX_WAIT, 1, NULL) =A0=3D -1 ENOSYS (Function not imp=
lemented)
>
> .. and this call is inifinitely repeating, which explains the high load.
>
> I'm running Linux 2.6.26-rc2 (not newer due to the slowdown problem
> introduced by commit
> 8d30c14cab30d405a05f2aaceda1e9ad57800f36, as pointed out by Rex Feany
> this week), and
> glibc v2.6 (part of ELDK 4.2).
>
> Is this a problem due to an error in the futex implementation in
> glibc, or a kernel problem?
>
> Best regards,
> Frank
>

^ permalink raw reply

* Re: Market research for new PowerPC system
From: Guennadi Liakhovetski @ 2009-09-26 17:58 UTC (permalink / raw)
  To: Konstantinos Margaritis; +Cc: debian-powerpc, linuxppc-dev, opensuse-ppc
In-Reply-To: <0046254A-432D-4AE3-9B9F-C0D30311B7D7@codex.gr>

On Sat, 26 Sep 2009, Konstantinos Margaritis wrote:

> (Sorry for the cross-posting, please ignore if you are not interested in this,
> CC me as I'm not subscribed)
> Hi,
> 
> First some introductions. I'm Konstantinos Margaritis, a long time
> Amiga/BeOS/Linux user/developer and a PowerPC fan, former Debian Developer,
> also a SIMD/AltiVec fanatic and the author of libfreevec. I've posted this on
> the following sites:
> 
> http://amigaworld.net/modules/newbb/viewtopic.php?mode=viewtopic&topic_id=29594&forum=33&start=0&viewmode=flat&order=0
> 
> http://www.amiga.org/forums/showthread.php?t=49424
> 
> http://www.morphzone.org/modules/newbb_plus/viewtopic.php?topic_id=6465&forum=11
> 
> http://aros-exec.org/modules/newbb/viewtopic.php?viewmode=flat&topic_id=3768&forum=4
> 
> http://www.haiku-os.org/community/forum/market_research_new_powerpc_system#comment-12604
> 
> To anyone who is not a PowerPC user, it might seem like crazy, but here it
> goes:
> 
> I'm considering funding the design & production of a new PowerPC system (well,
> the motherboard, the rest are typical pc stuff and a case). No this is not a
> joke, I've been wanting to do this for a long time, and perhaps the chance
> will be given to me now. But before I spend any money on this, I want to do a
> little market research first. I know the market is literally "dying" for a new
> powerpc motherboard, but exactly how many are there that want to buy one?

Ok, just a short comment. In principle I like diversity, competition, etc. 
And it was somewhat sad when Apple abandoned ppc. But honestly - why 
should I be buying a ppc desktop system? If we restrict our comparison to 
Linux, because that's what I'm using, what advantages would a ppc system 
give me over a comparable in price ix86 system? This is not meant 
negatively, I just have not followed recent ppc CPUs from the "desktop" 
range, so, this is a real honest question. Would such a system provide 
more MIPS per Watt at the same price? Or more periferals? Or some specific 
hardware blocks unavailable or unsupported om ix86?

Thanks
Guennadi

> 
> Ok, let's give some rough specs first. I'm considering 3 choices -not in order
> of probability/importance:
> 
> 1. MPC8640D-based. It will be dual core at 1Ghz -most likely, higher
> frequencies are much more expensive and the cost of the final board would be
> prohibitive.
> 2. MPC8610-based. Single core at 1Ghz, slightly less expensive, and includes a
> 2D DIU display unit -quite fast, but no 3D unfortunately.
> 3. QorIQ P1022-based. Again dual core at 1Ghz (1055Mhz to be precise). Apart
> from the much lower chip price, this one includes dual gigabit ethernet, dual
> SATA, USB 2.0 and a 2D DIU display unit (same as the MPC8610). So this one
> would lower the cost of the board quite much. Disadvantages: No AltiVec unit
> (it sucks I know), though it includes an SPE unit which is not that bad, and
> availability will be in Q3/Q4 2010, so that's a long wait.
> 
> Now, the end motherboard will probably be MicroATX (in the 8640D/8610 case) or
> PicoITX (in the P1022 case), and it will definitely include:
> 
> * SATA connectors
> * USB (possibly 2 back and 2 front, but that's discussable)
> * Dual gigabit (at least one will be there, in the case of the MPC8640D we
> might even have 4!!!)
> * Sound (of course, SPDIF support will definitely be there)
> * 1 PCI-e slot 1x
> * 1 PCI-e slot (4x in the P1022 case, 8x in the MPC86xx cases)
> 
> Ok, what I want to know is if people would really really buy one of these. End
> price is estimated to be ~around~ 350EUR for the P1022 board or ~500EUR
> (definitely more in the case of 8640D) in the case of the other boards.
> Besides being more expensive, the MPC86xx chips, don't include SATA, USB and
> only one of ethernet/sound (quad-gige in MPC8640D case, or sound in the case
> of MPC8610). I know this sounds a lot, but it's the reality, there is not
> enough funding to build enormous amounts of units and bring the prices down
> substantially, we have to start low and build up from there. In case you are
> wondering, yes, the boards will be designed/produced by bPlan and funded by my
> company (Codex).
> 
> Support for OSes: Linux definitely, Haiku most probably and there is a
> possibility of supporting AmigaOS/MorphOS, which will depend on the actual
> feedback I get from those users.
> 
> I would like to make a list of everyone that is really interested in such a
> system, so it would really help me make a decision sooner rather than later if
> you would send me a few personal details to markos@codex.gr with subject
> "PowerPC board":
> 
> * Name
> * Country
> * email (definitely, I'd have to reach you back!)
> * Phone/Skype (optional, please include international prefix)
> * Forum you saw this post (ok, Morphzone in this case)
> * OS of preference
> * board you would be most interested in (MPC8610/MPC8640D/P1022)
> * preferred price (please have in mind the estimated price quotes I mentioned,
> it might be lower but that's not very probable)
> * Other notes/comments
> 
> Also, I found out that I had to state my case on many forums to prove that
> this is not vapourware. Well, it will not be vapourware, if I get feedback. So
> far the feedback I got can be summarized here:
> 
> http://www.codex.gr/index.php?pageID=&blogItem=60
> 
> Thanks a lot for your time and I hope this system becomes a reality.
> 
> Konstantinos Margaritis
> 
> Codex
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

---
Guennadi Liakhovetski, Ph.D.
Freelance Open-Source Software Developer
http://www.open-technology.de/

^ permalink raw reply

* Re: Market research for new PowerPC system
From: Leon Woestenberg @ 2009-09-26 18:15 UTC (permalink / raw)
  To: Konstantinos Margaritis; +Cc: debian-powerpc, linuxppc-dev, opensuse-ppc
In-Reply-To: <0046254A-432D-4AE3-9B9F-C0D30311B7D7@codex.gr>

Hello,

first off, I like your idea. This is my public reply, I'll give a
personal reply later.

On Sat, Sep 26, 2009 at 1:38 PM, Konstantinos Margaritis
<markos@codex.gr> wrote:
> I'm considering funding the design & production of a new PowerPC system
> (well, the motherboard, the rest are typical pc stuff and a case). No this
>
What makes the system stand out, from say a Atom based PC?

(I know, playing devil's advocate here)

> 1. MPC8640D-based. It will be dual core at 1Ghz -most likely, higher
> frequencies are much more expensive and the cost of the final board would be
> prohibitive.
> 2. MPC8610-based. Single core at 1Ghz, slightly less expensive, and includes
> a 2D DIU display unit -quite fast, but no 3D unfortunately.
> 3. QorIQ P1022-based. Again dual core at 1Ghz (1055Mhz to be precise). Apart
> from the much lower chip price, this one includes dual gigabit ethernet,
> dual SATA, USB 2.0 and a 2D DIU display unit (same as the MPC8610). So this
>

Go for QorIQ P1022, it's the ideal SoC for many applications.

Alternatively, it's predecessor MPC8536E available now, ~same specs,
but higher power.  But not the two you mention please.

> End price is estimated to be ~around~ 350EUR for the P1022 board or ~500EUR
>
Pico P1022 or Pico MPC8536E pls.

and throw PCI Express (x4) in the party!

(hint: I haven't seen an Intel board with Atom and PCI Express yet).

Will the board be open hardware? I.e. an open sourced design?


Regards,
-- 
Leon

^ permalink raw reply

* Re: Market research for new PowerPC system
From: Konstantinos Margaritis @ 2009-09-26 19:08 UTC (permalink / raw)
  To: Leon Woestenberg; +Cc: debian-powerpc, linuxppc-dev
In-Reply-To: <c384c5ea0909261115t361e019auadf3b76a312ff2d1@mail.gmail.com>


> What makes the system stand out, from say a Atom based PC?

As you said, PCI Express, *actual* low power and probably higher speed  
-all cpus mention are faster than the Atom at least in relative terms.  
The SoC design means less components on the board, so smaller sizes  
might be achieved.

> Go for QorIQ P1022, it's the ideal SoC for many applications.
>
I know, I like the chip very much, the thing is that it won't be  
available till Q4/2010, so it will take quite a bit more thinking.

> Alternatively, it's predecessor MPC8536E available now, ~same specs,
> but higher power.  But not the two you mention please.

Why not? I mean seriously, why would you dislike them?

>> End price is estimated to be ~around~ 350EUR for the P1022 board or  
>> ~500EUR
>>
> Pico P1022 or Pico MPC8536E pls.

> and throw PCI Express (x4) in the party!

definetely PCI Express for all CPUs anyway.

> (hint: I haven't seen an Intel board with Atom and PCI Express yet).
>
> Will the board be open hardware? I.e. an open sourced design?

I don't know yet, to be frank, I haven't thought about it, though I've  
worked with open source software for years, I have no experience  
whatsoever with open source hardware, I don't know the dangers,  
pitfalls, advantages, etc.

Konstantinos

^ permalink raw reply

* Re: Market research for new PowerPC system
From: Konstantinos Margaritis @ 2009-09-26 19:15 UTC (permalink / raw)
  To: Guennadi Liakhovetski; +Cc: debian-powerpc, linuxppc-dev
In-Reply-To: <Pine.LNX.4.64.0909261949240.4273@axis700.grange>

On Sep 26, 2009, at 8:58 PM, Guennadi Liakhovetski wrote:
>
> Ok, just a short comment. In principle I like diversity,  
> competition, etc.
> And it was somewhat sad when Apple abandoned ppc. But honestly - why
> should I be buying a ppc desktop system? If we restrict our  
> comparison to
> Linux, because that's what I'm using, what advantages would a ppc  
> system
> give me over a comparable in price ix86 system? This is not meant
> negatively, I just have not followed recent ppc CPUs from the  
> "desktop"
> range, so, this is a real honest question. Would such a system provide
> more MIPS per Watt at the same price? Or more periferals? Or some  
> specific
> hardware blocks unavailable or unsupported om ix86?

Ok, I remember a few years back when we had Alpha, MIPS, x86, SPARC,
PowerPC, etc all viable platforms to use and work on. Now it's only  
x86. I'm sorry,
I just don't like it. I cannot answer your question, no more than I  
can answer why
a car lover buys an old Jaguar antique for the price he could buy a  
new Audi S8
for example. Well, ok the analogy is not exactly the same, but you get  
the point.
If not, well, the ppc board would just lessen the current gap between  
x86/ppc in
favour of the -admittedly very small- ppc desktop/hobbyist market.  
Nevertheless,
I'm pretty sure the system would find itself in many ppc developers'  
desks, just
because they can't really buy something *new* with those specs, at  
this price range.
Ok, perhaps I will fail and just add my name to the list of failed  
hardware projects.
Perhaps not. I really don't know if I can convince you if you don't  
want to be convinced.
Deliver a super ppc system that beats all x86 systems at the same or  
better price? No,
I'm sorry I cannot do that, and I never implied I could. Only IBM/ 
Freescale could do that
and even then the game would not be in their favour.

Regards

Konstantinos

^ permalink raw reply

* Re: Market research for new PowerPC system
From: Guennadi Liakhovetski @ 2009-09-26 20:49 UTC (permalink / raw)
  To: Konstantinos Margaritis; +Cc: debian-powerpc, linuxppc-dev
In-Reply-To: <95FC0E0B-F6FE-45FD-A84E-AC4AF477F786@codex.gr>

On Sat, 26 Sep 2009, Konstantinos Margaritis wrote:

> 
> On Sep 26, 2009, at 8:58 PM, Guennadi Liakhovetski wrote:
> > 
> > Ok, just a short comment. In principle I like diversity, competition, etc.
> > And it was somewhat sad when Apple abandoned ppc. But honestly - why
> > should I be buying a ppc desktop system? If we restrict our comparison to
> > Linux, because that's what I'm using, what advantages would a ppc system
> > give me over a comparable in price ix86 system? This is not meant
> > negatively, I just have not followed recent ppc CPUs from the "desktop"
> > range, so, this is a real honest question. Would such a system provide
> > more MIPS per Watt at the same price? Or more periferals? Or some specific
> > hardware blocks unavailable or unsupported om ix86?
> 
> Ok, I remember a few years back when we had Alpha, MIPS, x86, SPARC,
> PowerPC, etc all viable platforms to use and work on. Now it's only x86. I'm sorry,
> I just don't like it. I cannot answer your question, no more than I can answer why
> a car lover buys an old Jaguar antique for the price he could buy a new Audi S8
> for example. Well, ok the analogy is not exactly the same, but you get the point.
> If not, well, the ppc board would just lessen the current gap between x86/ppc in
> favour of the -admittedly very small- ppc desktop/hobbyist market. Nevertheless,
> I'm pretty sure the system would find itself in many ppc developers' desks, just
> because they can't really buy something *new* with those specs, at this price range.
> Ok, perhaps I will fail and just add my name to the list of failed hardware projects.
> Perhaps not. I really don't know if I can convince you if you don't want to be convinced.
> Deliver a super ppc system that beats all x86 systems at the same or better price? No,
> I'm sorry I cannot do that, and I never implied I could. Only IBM/Freescale could do that
> and even then the game would not be in their favour.

Ok, fair enough, as I said, that wasn't meant as a pun. I'd really love to 
see non-x86 desktops _successful_ on the market, and I don't mean just 
ARM-based netbooks, nettops, tablets, etc.:-) So, good luck to you, and I 
really mean it! Interestingly, ppc competes on embedded, competes on 
servers, but practically absent on desktops (apart from a couple of 
hackintosh manufacturers:-)), so, maybe indeed there's still something 
that ppc can offer us that x86 cannot - as a self-contained system, and 
not just a development platform for ppc professionals?

Thanks
Guennadi
---
Guennadi Liakhovetski, Ph.D.
Freelance Open-Source Software Developer
http://www.open-technology.de/

^ permalink raw reply

* Re: lite5200b kernel not booting
From: Albrecht Dreß @ 2009-09-27 12:22 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <9e4733910909250515p60e0b0e0v8c10063326fe882c@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1298 bytes --]

Am 25.09.09 14:15 schrieb(en) Jon Smirl:
> No one has tried the Macraigor USB wiggler on the mpc5200 and  
> reported back if it works.

I'm working on a self-designed, roughly Lite5200B based board, and  
first thought about buying this device.  However, I tried to contact  
their support several times (as to get a confirmation that it will work  
with our flash and board-specific design), but never got a reply.  So  
we ended up with the Abatron BDI3000, which is rather expensive, but  
works really flawlessly, and Abatron's (and their German distributor's)  
support is just *excellent* (fast and professional).

> uboot 1.2 is very old. That may be the cause of your problems. For  
> example old u-boots don't initialize the PCI hardware correctly on  
> systems that don't have PCI implemented.

U-Boot 2009.03 together with ELDK 4.2 work flawlessly on both the  
Lite5200B and on our design.

> In general the current powerpc kernel works fine on the mpc5200b. We  
> are running it on four different CPU boards but I don't have a  
> lite5200b.

I can only confirm that the stock (from kernel.org) kernels run  
flawlessly on our boards (the latest I tried was 2.6.30.3, though, but  
I don't expect any issues with 2.6.31.1).

Hope this helps,
Albrecht.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: [PATCH] powerpc/8xx: fix regression introduced by cache coherency rewrite
From: Joakim Tjernlund @ 2009-09-27 13:22 UTC (permalink / raw)
  To: Rex Feany; +Cc: linuxppc-dev@ozlabs.org
In-Reply-To: <20090925211848.GA3371@compile2.chatsunix.int.mrv.com>

>
> Thus spake Benjamin Herrenschmidt (benh@kernel.crashing.org):
>
> >
> > > I think there's more finishyness to 8xx than we thought. IE. That
> > > tlbil_va might have more reasons to be there than what the comment
> > > seems to advertize. Can you try to move it even higher up ? IE.
> > > Unconditionally at the beginning of set_pte_filter ?
> > >
> > > Also, if that doesn't help, can you try putting one in
> > > set_access_flags_filter() just below ?
> >
> > Ok, I got a refresher on the whole concept of "unpopulated TLB entries"
> > on 8xx, and that's damn scary. I think what mislead me initially is that
> > the comment around the workaround is simply not properly describing the
> > extent of the problem :-)
>
> Oh boy, that sounds bad. Where is a good place to read about this?
>
> > So I'm not going to make the 8xx TLB miss code sane, that's beyond what
> > I'm prepare to do with it, but I suspect that this should fix it (on top
> > of upstream). Let me know if that's enough or if we also need to put
> > one of these in ptep_set_access_flags().
> >
> > Please let me know if that works for you.
>
> Putting the tlbil_va() in the top of set_pte_filter() doesn't work - it
> hangs on boot before it even prints any messages to the console.
>
> However, adding tlbil_va() to ptep_set_access_flags() as you suggested
> makes everything happy. I need to test it some more, but it looks good
> so far. Below is what I am testing now.

8xx, is getting very hacky and I suspect that the only long term fix is
add code to trap the cache instructions in TLB error/miss and fixup the
exception in page fault handler. This will also have the added benefit on being able
to use the cache instructions in both kernel and user space like any other
ppc arch.

   Jocke

^ permalink raw reply

* Re: [PATCH] i2c-mpc: Do not generate STOP after read.
From: Joakim Tjernlund @ 2009-09-27 22:26 UTC (permalink / raw)
  To: Jean Delvare; +Cc: linuxppc-dev, linux-i2c, Esben Haabendal
In-Reply-To: <4ABC94ED.6070508@grandegger.com>

Jean, I just noticed you pull request for i2c on LKML but I didn't see this
patch nor have I got any feedback from you. What is your view?

   Jocke

Wolfgang Grandegger <wg@grandegger.com> wrote on 25/09/2009 12:01:17:
>
> Joakim Tjernlund wrote:
> > The driver always ends a read with a STOP condition which
> > breaks subsequent I2C reads/writes in the same transaction as
> > these expect to do a repeated START(ReSTART).
> >
> > This will also help I2C multimaster as the bus will not be released
> > after the first read, but when the whole transaction ends.
> >
> > Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
> Tested-by: Wolfgang Grandegger <wg@grandegger.com>
>
> on a MPC8548 board with an up-to-date kernel. I did not realize any
> problems.
>
> Wolfgang.
>

^ permalink raw reply

* Re: GPIO driver for MPC8313.
From: Johnny Hung @ 2009-09-28  2:43 UTC (permalink / raw)
  To: Peter Korsgaard; +Cc: linuxppc-dev, linux-embedded
In-Reply-To: <cb9ecdfa0909230820k6ebbd589o409e58e2d9b6740c@mail.gmail.com>

The kerne source I used now is 2.6.23 with MPC8313-erdb pached. I
think I should port 2.6.28 gpio function back to 2.6.23. Is it a
common way to implement it in the kernel I used or I should port
MPC8313-erdb pached to 2.6.28 opposite?

BRs, H. Johnny

2009/9/23 Johnny Hung <johnny.hacking@gmail.com>:
> Many thanks for your help. I will try it.
>
> 2009/9/23 Peter Korsgaard <jacmet@sunsite.dk>:
>>>>>>> "Johnny" =3D=3D Johnny Hung <johnny.hacking@gmail.com> writes:
>>
>> =A0Johnny> Thanks, got it. BTW, how to trigger GPIO level in user space
>> =A0Johnny> application? I also found
>> =A0Johnny> arch/powerpc/platforms/52xx/mpc52xx_gpio.c is a good
>> =A0Johnny> example. Any reply is appreciate.
>>
>> Through sysfs. See 'Sysfs Interface for Userspace' section of
>> Documentation/gpio.txt
>>
>> --
>> Bye, Peter Korsgaard
>>
>

^ permalink raw reply

* Re: [PATCH] powerpc/8xx: fix regression introduced by cache coherency rewrite
From: Benjamin Herrenschmidt @ 2009-09-28  3:21 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <OFEC38184A.A376E71D-ONC125763E.0048F581-C125763E.00497FD3@transmode.se>

On Sun, 2009-09-27 at 15:22 +0200, Joakim Tjernlund wrote:
> > However, adding tlbil_va() to ptep_set_access_flags() as you suggested
> > makes everything happy. I need to test it some more, but it looks good
> > so far. Below is what I am testing now.
> 
> 8xx, is getting very hacky and I suspect that the only long term fix is
> add code to trap the cache instructions in TLB error/miss and fixup the
> exception in page fault handler. This will also have the added benefit on being able
> to use the cache instructions in both kernel and user space like any other
> ppc arch.

First I'd like to understand exactly what's happening today, since
it makes little sense :-) I suppose I'll have to get myself some
8xx doco and understand how the bloody MMU works.

Then, I saw your old patch and it's -very- invasive. If we can get away
with a one liner just adding tlbil_va in the right place, I think I'm
happy to stick with it until somebody comes up with a real good reason
to do more :-) 8xx is on life support and has been around for long
enough without people feeling the need overall to work around that
problem so I'm tempted to keep the status-quo here.

Cheers,
Ben

^ permalink raw reply

* [0/5] Assorted hugepage cleanups (v3)
From: David Gibson @ 2009-09-28  4:39 UTC (permalink / raw)
  To: linuxppc-dev, Benjamin Herrenschmidt

Currently, ordinary pages use one pagetable layout, and each different
hugepage size uses a slightly different variant layout.  A number of
places which need to walk the pagetable must first check the slice map
to see what the pagetable layout then handle the various different
forms.  New hardware, like Book3E is liable to introduce more possible
variants.

This patch series, therefore, is designed to simplify the matter by
limiting knowledge of the pagetable layout to only the allocation
path.  With this patch, ordinary pages are handled as ever, with a
fixed 4 (or 3) level tree.  All other variants branch off from some
layer of that with a specially marked PGD/PUD/PMD pointer which also
contains enough information to interpret the directories below that
point.  This means that things walking the pagetables (without
allocating) don't need to look up the slice map, they can just step
down the tree in the usual way, branching off to the "non-standard
layout" path for hugepages, which uses the embdded information to
interpret the tree from that point on.

This reduces the source size in a number of places, and means that
newer variants on the pagetable layout to handle new hardware and new
features will need to alter the existing code in less places.

In addition we split out the hash / classic MMU specific code into a
separate hugetlbpage-hash64.c file.  This will make adding support for
other MMUs (like 440 and/or Book3E) easier.

I've used the libhugetlbfs testsuite to test these patches on a
Power5+ machine, but they could certainly do with more testing. In
particular, I don't have any suitable hardware to test 16G pages.

V2: Made the tweaks that BenH suggested to patch 2 of the original
series.  Some corresponding tweaks in patch 3 to match.

V3: Fix several small bugs.
	* We had a BUILD_BUG_ON() which is broken by the recent
problems with BUILD_BUG_ON().  Since it's not a hot path, use a
runtime BUG_ON() instead.

	* The ifdef logic was inverted for the test against
CONFIG_SPU_FS_64K_LS.

	* That in turn masked a compile bug (using a non-existent
constant) in the CONFIG_SPU_FS_64K_LS path.

-- 
David Gibson			| I'll have my music baroque, and my code
david AT gibson.dropbear.id.au	| minimalist, thank you.  NOT _the_ _other_
				| _way_ _around_!
http://www.ozlabs.org/~dgibson

^ permalink raw reply

* [2/5] Cleanup management of kmem_caches for pagetables
From: David Gibson @ 2009-09-28  4:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20090928043902.GA6302@yookeroo.seuss>

Currently we have a fair bit of rather fiddly code to manage the
various kmem_caches used to store page tables of various levels.  We
generally have two caches holding some combination of PGD, PUD and PMD
tables, plus several more for the special hugepage pagetables.

This patch cleans this all up by taking a different approach.  Rather
than the caches being designated as for PUDs or for hugeptes for 16M
pages, the caches are simply allocated to be a specific size.  Thus
sharing of caches between different types/levels of pagetables happens
naturally.  The pagetable size, where needed, is passed around encoded
in the same way as {PGD,PUD,PMD}_INDEX_SIZE; that is n where the
pagetable contains 2^n pointers.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/include/asm/pgalloc-64.h    |   51 ++++++++++++++---------------
 arch/powerpc/include/asm/pgalloc.h       |   30 ++---------------
 arch/powerpc/include/asm/pgtable-ppc64.h |    1 
 arch/powerpc/mm/hugetlbpage.c            |   45 ++++++-------------------
 arch/powerpc/mm/init_64.c                |   54 ++++++++++++++++++-------------
 arch/powerpc/mm/pgtable.c                |   25 +++++++++-----
 6 files changed, 92 insertions(+), 114 deletions(-)

Index: working-2.6/arch/powerpc/mm/init_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/init_64.c	2009-09-28 12:50:46.000000000 +1000
+++ working-2.6/arch/powerpc/mm/init_64.c	2009-09-28 14:19:03.000000000 +1000
@@ -119,30 +119,42 @@ static void pmd_ctor(void *addr)
 	memset(addr, 0, PMD_TABLE_SIZE);
 }
 
-static const unsigned int pgtable_cache_size[2] = {
-	PGD_TABLE_SIZE, PMD_TABLE_SIZE
-};
-static const char *pgtable_cache_name[ARRAY_SIZE(pgtable_cache_size)] = {
-#ifdef CONFIG_PPC_64K_PAGES
-	"pgd_cache", "pmd_cache",
-#else
-	"pgd_cache", "pud_pmd_cache",
-#endif /* CONFIG_PPC_64K_PAGES */
-};
-
-#ifdef CONFIG_HUGETLB_PAGE
-/* Hugepages need an extra cache per hugepagesize, initialized in
- * hugetlbpage.c.  We can't put into the tables above, because HPAGE_SHIFT
- * is not compile time constant. */
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)+MMU_PAGE_COUNT];
-#else
-struct kmem_cache *pgtable_cache[ARRAY_SIZE(pgtable_cache_size)];
-#endif
+struct kmem_cache *pgtable_cache[MAX_PGTABLE_INDEX_SIZE];
+
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *))
+{
+	char *name;
+	unsigned long table_size = sizeof(void *) << shift;
+	unsigned long align = table_size;
+	/* When batching pgtable pointers for RCU freeing, we store
+	 * the index size in the low bits.  Table alignment must be
+	 * big enough to fit it */
+	unsigned long minalign = MAX_PGTABLE_INDEX_SIZE + 1;
+	struct kmem_cache *new;
+
+	/* It would be nice if this was a BUILD_BUG_ON(), but at the
+	 * moment, gcc doesn't seem to recognize is_power_of_2 as a
+	 * constant expression, so so much for that. */
+	BUG_ON(!is_power_of_2(minalign));
+	BUG_ON((shift < 1) || (shift > MAX_PGTABLE_INDEX_SIZE));
+
+	if (PGT_CACHE(shift))
+		return; /* Already have a cache of this size */
+	align = max_t(unsigned long, align, minalign);
+	name = kasprintf(GFP_KERNEL, "pgtable-2^%d", shift);
+	new = kmem_cache_create(name, table_size, table_size, 0, ctor);
+	PGT_CACHE(shift) = new;
+	pr_debug("Allocated pgtable cache for order %d\n", shift);
+}
+
 
 void pgtable_cache_init(void)
 {
-	pgtable_cache[0] = kmem_cache_create(pgtable_cache_name[0], PGD_TABLE_SIZE, PGD_TABLE_SIZE, SLAB_PANIC, pgd_ctor);
-	pgtable_cache[1] = kmem_cache_create(pgtable_cache_name[1], PMD_TABLE_SIZE, PMD_TABLE_SIZE, SLAB_PANIC, pmd_ctor);
+	pgtable_cache_add(PGD_INDEX_SIZE, pgd_ctor);
+	pgtable_cache_add(PMD_INDEX_SIZE, pmd_ctor);
+	if (!PGT_CACHE(PGD_INDEX_SIZE) || !PGT_CACHE(PMD_INDEX_SIZE))
+		panic("Couldn't allocate pgtable caches");
+	BUG_ON(PUD_INDEX_SIZE && !PGT_CACHE(PUD_INDEX_SIZE));
 }
 
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
Index: working-2.6/arch/powerpc/include/asm/pgalloc-64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc-64.h	2009-08-03 16:00:45.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgalloc-64.h	2009-09-28 13:53:42.000000000 +1000
@@ -11,27 +11,30 @@
 #include <linux/cpumask.h>
 #include <linux/percpu.h>
 
+/*
+ * This needs to be big enough to allow any pagetable sizes we need,
+ * but small enough to fit in the low bits of any page table pointer.
+ * In other words all pagetables, even tiny ones, must be aligned to
+ * allow at least enough low 0 bits to contain this value.
+ */
+#define MAX_PGTABLE_INDEX_SIZE	0xf
+
 #ifndef CONFIG_PPC_SUBPAGE_PROT
 static inline void subpage_prot_free(pgd_t *pgd) {}
 #endif
 
 extern struct kmem_cache *pgtable_cache[];
-
-#define PGD_CACHE_NUM		0
-#define PUD_CACHE_NUM		1
-#define PMD_CACHE_NUM		1
-#define HUGEPTE_CACHE_NUM	2
-#define PTE_NONCACHE_NUM	7  /* from GFP rather than kmem_cache */
+#define PGT_CACHE(shift) (pgtable_cache[(shift)-1])
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-	return kmem_cache_alloc(pgtable_cache[PGD_CACHE_NUM], GFP_KERNEL);
+	return kmem_cache_alloc(PGT_CACHE(PGD_INDEX_SIZE), GFP_KERNEL);
 }
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
 	subpage_prot_free(pgd);
-	kmem_cache_free(pgtable_cache[PGD_CACHE_NUM], pgd);
+	kmem_cache_free(PGT_CACHE(PGD_INDEX_SIZE), pgd);
 }
 
 #ifndef CONFIG_PPC_64K_PAGES
@@ -40,13 +43,13 @@ static inline void pgd_free(struct mm_st
 
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache[PUD_CACHE_NUM],
+	return kmem_cache_alloc(PGT_CACHE(PUD_INDEX_SIZE),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pud_free(struct mm_struct *mm, pud_t *pud)
 {
-	kmem_cache_free(pgtable_cache[PUD_CACHE_NUM], pud);
+	kmem_cache_free(PGT_CACHE(PUD_INDEX_SIZE), pud);
 }
 
 static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
@@ -78,13 +81,13 @@ static inline void pmd_populate_kernel(s
 
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
 {
-	return kmem_cache_alloc(pgtable_cache[PMD_CACHE_NUM],
+	return kmem_cache_alloc(PGT_CACHE(PMD_INDEX_SIZE),
 				GFP_KERNEL|__GFP_REPEAT);
 }
 
 static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
-	kmem_cache_free(pgtable_cache[PMD_CACHE_NUM], pmd);
+	kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
@@ -107,24 +110,22 @@ static inline pgtable_t pte_alloc_one(st
 	return page;
 }
 
-static inline void pgtable_free(pgtable_free_t pgf)
+static inline void pgtable_free(void *table, unsigned index_size)
 {
-	void *p = (void *)(pgf.val & ~PGF_CACHENUM_MASK);
-	int cachenum = pgf.val & PGF_CACHENUM_MASK;
-
-	if (cachenum == PTE_NONCACHE_NUM)
-		free_page((unsigned long)p);
-	else
-		kmem_cache_free(pgtable_cache[cachenum], p);
+	if (!index_size)
+		free_page((unsigned long)table);
+	else {
+		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
+		kmem_cache_free(PGT_CACHE(index_size), table);
+	}
 }
 
-#define __pmd_free_tlb(tlb, pmd,addr)		      \
-	pgtable_free_tlb(tlb, pgtable_free_cache(pmd, \
-		PMD_CACHE_NUM, PMD_TABLE_SIZE-1))
+#define __pmd_free_tlb(tlb, pmd, addr)		      \
+	pgtable_free_tlb(tlb, pmd, PMD_INDEX_SIZE)
 #ifndef CONFIG_PPC_64K_PAGES
 #define __pud_free_tlb(tlb, pud, addr)		      \
-	pgtable_free_tlb(tlb, pgtable_free_cache(pud, \
-		PUD_CACHE_NUM, PUD_TABLE_SIZE-1))
+	pgtable_free_tlb(tlb, pud, PUD_INDEX_SIZE)
+
 #endif /* CONFIG_PPC_64K_PAGES */
 
 #define check_pgt_cache()	do { } while (0)
Index: working-2.6/arch/powerpc/include/asm/pgalloc.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgalloc.h	2009-08-14 16:07:54.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgalloc.h	2009-09-28 13:53:42.000000000 +1000
@@ -24,25 +24,6 @@ static inline void pte_free(struct mm_st
 	__free_page(ptepage);
 }
 
-typedef struct pgtable_free {
-	unsigned long val;
-} pgtable_free_t;
-
-/* This needs to be big enough to allow for MMU_PAGE_COUNT + 2 to be stored
- * and small enough to fit in the low bits of any naturally aligned page
- * table cache entry. Arbitrarily set to 0x1f, that should give us some
- * room to grow
- */
-#define PGF_CACHENUM_MASK	0x1f
-
-static inline pgtable_free_t pgtable_free_cache(void *p, int cachenum,
-						unsigned long mask)
-{
-	BUG_ON(cachenum > PGF_CACHENUM_MASK);
-
-	return (pgtable_free_t){.val = ((unsigned long) p & ~mask) | cachenum};
-}
-
 #ifdef CONFIG_PPC64
 #include <asm/pgalloc-64.h>
 #else
@@ -50,12 +31,12 @@ static inline pgtable_free_t pgtable_fre
 #endif
 
 #ifdef CONFIG_SMP
-extern void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf);
+extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift);
 extern void pte_free_finish(void);
 #else /* CONFIG_SMP */
-static inline void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
+static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
-	pgtable_free(pgf);
+	pgtable_free(table, shift);
 }
 static inline void pte_free_finish(void) { }
 #endif /* !CONFIG_SMP */
@@ -63,12 +44,9 @@ static inline void pte_free_finish(void)
 static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
 				  unsigned long address)
 {
-	pgtable_free_t pgf = pgtable_free_cache(page_address(ptepage),
-						PTE_NONCACHE_NUM,
-						PTE_TABLE_SIZE-1);
 	tlb_flush_pgtable(tlb, address);
 	pgtable_page_dtor(ptepage);
-	pgtable_free_tlb(tlb, pgf);
+	pgtable_free_tlb(tlb, page_address(ptepage), 0);
 }
 
 #endif /* __KERNEL__ */
Index: working-2.6/arch/powerpc/mm/pgtable.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/pgtable.c	2009-08-28 13:46:31.000000000 +1000
+++ working-2.6/arch/powerpc/mm/pgtable.c	2009-09-28 13:53:42.000000000 +1000
@@ -47,12 +47,12 @@ struct pte_freelist_batch
 {
 	struct rcu_head	rcu;
 	unsigned int	index;
-	pgtable_free_t	tables[0];
+	unsigned long	tables[0];
 };
 
 #define PTE_FREELIST_SIZE \
 	((PAGE_SIZE - sizeof(struct pte_freelist_batch)) \
-	  / sizeof(pgtable_free_t))
+	  / sizeof(unsigned long))
 
 static void pte_free_smp_sync(void *arg)
 {
@@ -62,13 +62,13 @@ static void pte_free_smp_sync(void *arg)
 /* This is only called when we are critically out of memory
  * (and fail to get a page in pte_free_tlb).
  */
-static void pgtable_free_now(pgtable_free_t pgf)
+static void pgtable_free_now(void *table, unsigned shift)
 {
 	pte_freelist_forced_free++;
 
 	smp_call_function(pte_free_smp_sync, NULL, 1);
 
-	pgtable_free(pgf);
+	pgtable_free(table, shift);
 }
 
 static void pte_free_rcu_callback(struct rcu_head *head)
@@ -77,8 +77,12 @@ static void pte_free_rcu_callback(struct
 		container_of(head, struct pte_freelist_batch, rcu);
 	unsigned int i;
 
-	for (i = 0; i < batch->index; i++)
-		pgtable_free(batch->tables[i]);
+	for (i = 0; i < batch->index; i++) {
+		void *table = (void *)(batch->tables[i] & ~MAX_PGTABLE_INDEX_SIZE);
+		unsigned shift = batch->tables[i] & MAX_PGTABLE_INDEX_SIZE;
+
+		pgtable_free(table, shift);
+	}
 
 	free_page((unsigned long)batch);
 }
@@ -89,25 +93,28 @@ static void pte_free_submit(struct pte_f
 	call_rcu(&batch->rcu, pte_free_rcu_callback);
 }
 
-void pgtable_free_tlb(struct mmu_gather *tlb, pgtable_free_t pgf)
+void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
 {
 	/* This is safe since tlb_gather_mmu has disabled preemption */
 	struct pte_freelist_batch **batchp = &__get_cpu_var(pte_freelist_cur);
+	unsigned long pgf;
 
 	if (atomic_read(&tlb->mm->mm_users) < 2 ||
 	    cpumask_equal(mm_cpumask(tlb->mm), cpumask_of(smp_processor_id()))){
-		pgtable_free(pgf);
+		pgtable_free(table, shift);
 		return;
 	}
 
 	if (*batchp == NULL) {
 		*batchp = (struct pte_freelist_batch *)__get_free_page(GFP_ATOMIC);
 		if (*batchp == NULL) {
-			pgtable_free_now(pgf);
+			pgtable_free_now(table, shift);
 			return;
 		}
 		(*batchp)->index = 0;
 	}
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf = (unsigned long)table | (shift - 1);
 	(*batchp)->tables[(*batchp)->index++] = pgf;
 	if ((*batchp)->index == PTE_FREELIST_SIZE) {
 		pte_free_submit(*batchp);
Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-09-28 13:53:42.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-09-28 14:17:38.000000000 +1000
@@ -43,26 +43,14 @@ static unsigned nr_gpages;
 unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
 
 #define hugepte_shift			mmu_huge_psizes
-#define PTRS_PER_HUGEPTE(psize)		(1 << hugepte_shift[psize])
-#define HUGEPTE_TABLE_SIZE(psize)	(sizeof(pte_t) << hugepte_shift[psize])
+#define HUGEPTE_INDEX_SIZE(psize)	(mmu_huge_psizes[(psize)])
+#define PTRS_PER_HUGEPTE(psize)		(1 << mmu_huge_psizes[psize])
 
 #define HUGEPD_SHIFT(psize)		(mmu_psize_to_shift(psize) \
-						+ hugepte_shift[psize])
+					 + HUGEPTE_INDEX_SIZE(psize))
 #define HUGEPD_SIZE(psize)		(1UL << HUGEPD_SHIFT(psize))
 #define HUGEPD_MASK(psize)		(~(HUGEPD_SIZE(psize)-1))
 
-/* Subtract one from array size because we don't need a cache for 4K since
- * is not a huge page size */
-#define HUGE_PGTABLE_INDEX(psize)	(HUGEPTE_CACHE_NUM + psize - 1)
-#define HUGEPTE_CACHE_NAME(psize)	(huge_pgtable_cache_name[psize])
-
-static const char *huge_pgtable_cache_name[MMU_PAGE_COUNT] = {
-	[MMU_PAGE_64K]	= "hugepte_cache_64K",
-	[MMU_PAGE_1M]	= "hugepte_cache_1M",
-	[MMU_PAGE_16M]	= "hugepte_cache_16M",
-	[MMU_PAGE_16G]	= "hugepte_cache_16G",
-};
-
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
  * catching screwups early. */
@@ -114,15 +102,15 @@ static inline pte_t *hugepte_offset(huge
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
 			   unsigned long address, unsigned int psize)
 {
-	pte_t *new = kmem_cache_zalloc(pgtable_cache[HUGE_PGTABLE_INDEX(psize)],
-				      GFP_KERNEL|__GFP_REPEAT);
+	pte_t *new = kmem_cache_zalloc(PGT_CACHE(hugepte_shift[psize]),
+				       GFP_KERNEL|__GFP_REPEAT);
 
 	if (! new)
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
 	if (!hugepd_none(*hpdp))
-		kmem_cache_free(pgtable_cache[HUGE_PGTABLE_INDEX(psize)], new);
+		kmem_cache_free(PGT_CACHE(hugepte_shift[psize]), new);
 	else
 		hpdp->pd = (unsigned long)new | HUGEPD_OK;
 	spin_unlock(&mm->page_table_lock);
@@ -271,9 +259,7 @@ static void free_hugepte_range(struct mm
 
 	hpdp->pd = 0;
 	tlb->need_flush = 1;
-	pgtable_free_tlb(tlb, pgtable_free_cache(hugepte,
-						 HUGEPTE_CACHE_NUM+psize-1,
-						 PGF_CACHENUM_MASK));
+	pgtable_free_tlb(tlb, hugepte, hugepte_shift[psize]);
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
@@ -698,8 +684,6 @@ static void __init set_huge_psize(int ps
 		if (mmu_huge_psizes[psize] ||
 		   mmu_psize_defs[psize].shift == PAGE_SHIFT)
 			return;
-		if (WARN_ON(HUGEPTE_CACHE_NAME(psize) == NULL))
-			return;
 		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
 
 		switch (mmu_psize_defs[psize].shift) {
@@ -769,16 +753,11 @@ static int __init hugetlbpage_init(void)
 
 	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
 		if (mmu_huge_psizes[psize]) {
-			pgtable_cache[HUGE_PGTABLE_INDEX(psize)] =
-				kmem_cache_create(
-					HUGEPTE_CACHE_NAME(psize),
-					HUGEPTE_TABLE_SIZE(psize),
-					HUGEPTE_TABLE_SIZE(psize),
-					0,
-					NULL);
-			if (!pgtable_cache[HUGE_PGTABLE_INDEX(psize)])
-				panic("hugetlbpage_init(): could not create %s"\
-				      "\n", HUGEPTE_CACHE_NAME(psize));
+			pgtable_cache_add(hugepte_shift[psize], NULL);
+			if (!PGT_CACHE(hugepte_shift[psize]))
+				panic("hugetlbpage_init(): could not create "
+				      "pgtable cache for %d bit pagesize\n",
+				      mmu_psize_to_shift(psize));
 		}
 	}
 
Index: working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgtable-ppc64.h	2009-08-28 13:46:31.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h	2009-09-28 14:17:38.000000000 +1000
@@ -354,6 +354,7 @@ static inline void __ptep_set_access_fla
 #define pgoff_to_pte(off)	((pte_t) {((off) << PTE_RPN_SHIFT)|_PAGE_FILE})
 #define PTE_FILE_MAX_BITS	(BITS_PER_LONG - PTE_RPN_SHIFT)
 
+void pgtable_cache_add(unsigned shift, void (*ctor)(void *));
 void pgtable_cache_init(void);
 
 /*

^ permalink raw reply

* [1/5] Make hpte_need_flush() correctly mask for multiple page sizes
From: David Gibson @ 2009-09-28  4:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20090928043902.GA6302@yookeroo.seuss>

Currently, hpte_need_flush() only correctly flushes the given address
for normal pages.  Callers for hugepages are required to mask the
address themselves.

But hpte_need_flush() already looks up the page sizes for its own
reasons, so this is a rather silly imposition on the callers.  This
patch alters it to mask based on the pagesize it has looked up itself,
and removes the awkward masking code in the hugepage caller.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/mm/hugetlbpage.c |    6 +-----
 arch/powerpc/mm/tlb_hash64.c  |    8 +++-----
 2 files changed, 4 insertions(+), 10 deletions(-)

Index: working-2.6/arch/powerpc/mm/tlb_hash64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/tlb_hash64.c	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/mm/tlb_hash64.c	2009-09-04 14:36:12.000000000 +1000
@@ -53,11 +53,6 @@ void hpte_need_flush(struct mm_struct *m
 
 	i = batch->index;
 
-	/* We mask the address for the base page size. Huge pages will
-	 * have applied their own masking already
-	 */
-	addr &= PAGE_MASK;
-
 	/* Get page size (maybe move back to caller).
 	 *
 	 * NOTE: when using special 64K mappings in 4K environment like
@@ -75,6 +70,9 @@ void hpte_need_flush(struct mm_struct *m
 	} else
 		psize = pte_pagesize_index(mm, addr, pte);
 
+	/* Mask the address for the correct page size */
+	addr &= ~((1UL << mmu_psize_defs[psize].shift) - 1);
+
 	/* Build full vaddr */
 	if (!is_kernel_addr(addr)) {
 		ssize = user_segment_size(addr);
Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-09-04 14:35:30.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-09-04 14:36:12.000000000 +1000
@@ -445,11 +445,7 @@ void set_huge_pte_at(struct mm_struct *m
 		 * necessary anymore if we make hpte_need_flush() get the
 		 * page size from the slices
 		 */
-		unsigned int psize = get_slice_psize(mm, addr);
-		unsigned int shift = mmu_psize_to_shift(psize);
-		unsigned long sz = ((1UL) << shift);
-		struct hstate *hstate = size_to_hstate(sz);
-		pte_update(mm, addr & hstate->mask, ptep, ~0UL, 1);
+		pte_update(mm, addr, ptep, ~0UL, 1);
 	}
 	*ptep = __pte(pte_val(pte) & ~_PAGE_HPTEFLAGS);
 }

^ permalink raw reply

* [3/5] Allow more flexible layouts for hugepage pagetables
From: David Gibson @ 2009-09-28  4:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20090928043902.GA6302@yookeroo.seuss>

Currently each available hugepage size uses a slightly different
pagetable layout: that is, the bottem level table of pointers to
hugepages is a different size, and may branch off from the normal page
tables at a different level.  Every hugepage aware path that needs to
walk the pagetables must therefore look up the hugepage size from the
slice info first, and work out the correct way to walk the pagetables
accordingly.  Future hardware is likely to add more possible hugepage
sizes, more layout options and more mess.

This patch, therefore reworks the handling of hugepage pagetables to
reduce this complexity.  In the new scheme, instead of having to
consult the slice mask, pagetable walking code can check a flag in the
PGD/PUD/PMD entries to see where to branch off to hugepage pagetables,
and the entry also contains the information (eseentially hugepage
shift) necessary to then interpret that table without recourse to the
slice mask.  This scheme can be extended neatly to handle multiple
levels of self-describing "special" hugepage pagetables, although for
now we assume only one level exists.

This approach means that only the pagetable allocation path needs to
know how the pagetables should be set out.  All other (hugepage)
pagetable walking paths can just interpret the structure as they go.

There already was a flag bit in PGD/PUD/PMD entries for hugepage
directory pointers, but it was only used for debug.  We alter that
flag bit to instead be a 0 in the MSB to indicate a hugepage pagetable
pointer (normally it would be 1 since the pointer lies in the linear
mapping).  This means that asm pagetable walking can test for (and
punt on) hugepage pointers with the same test that checks for
unpopulated page directory entries (beq becomes bge), since hugepage
pointers will always be positive, and normal pointers always negative.

While we're at it, we get rid of the confusing (and grep defeating)
#defining of hugepte_shift to be the same thing as mmu_huge_psizes.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/include/asm/hugetlb.h       |   12 
 arch/powerpc/include/asm/mmu-hash64.h    |   14 
 arch/powerpc/include/asm/pgtable-ppc64.h |   13 
 arch/powerpc/kernel/perf_callchain.c     |   20 -
 arch/powerpc/mm/gup.c                    |  149 +--------
 arch/powerpc/mm/hash_utils_64.c          |   17 -
 arch/powerpc/mm/hugetlbpage.c            |  473 ++++++++++++++-----------------
 arch/powerpc/mm/init_64.c                |   10 
 8 files changed, 302 insertions(+), 406 deletions(-)

Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-09-28 14:17:38.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-09-28 14:19:08.000000000 +1000
@@ -40,25 +40,11 @@ static unsigned nr_gpages;
 /* Array of valid huge page sizes - non-zero value(hugepte_shift) is
  * stored for the huge page sizes that are valid.
  */
-unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
-
-#define hugepte_shift			mmu_huge_psizes
-#define HUGEPTE_INDEX_SIZE(psize)	(mmu_huge_psizes[(psize)])
-#define PTRS_PER_HUGEPTE(psize)		(1 << mmu_huge_psizes[psize])
-
-#define HUGEPD_SHIFT(psize)		(mmu_psize_to_shift(psize) \
-					 + HUGEPTE_INDEX_SIZE(psize))
-#define HUGEPD_SIZE(psize)		(1UL << HUGEPD_SHIFT(psize))
-#define HUGEPD_MASK(psize)		(~(HUGEPD_SIZE(psize)-1))
+static unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
 
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
  * catching screwups early. */
-#define HUGEPD_OK	0x1
-
-typedef struct { unsigned long pd; } hugepd_t;
-
-#define hugepd_none(hpd)	((hpd).pd == 0)
 
 static inline int shift_to_mmu_psize(unsigned int shift)
 {
@@ -82,71 +68,126 @@ static inline unsigned int mmu_psize_to_
 	BUG();
 }
 
+#define hugepd_none(hpd)	((hpd).pd == 0)
+
 static inline pte_t *hugepd_page(hugepd_t hpd)
 {
-	BUG_ON(!(hpd.pd & HUGEPD_OK));
-	return (pte_t *)(hpd.pd & ~HUGEPD_OK);
+	BUG_ON(!hugepd_ok(hpd));
+	return (pte_t *)((hpd.pd & ~HUGEPD_SHIFT_MASK) | 0xc000000000000000);
 }
 
-static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr,
-				    struct hstate *hstate)
+static inline unsigned int hugepd_shift(hugepd_t hpd)
 {
-	unsigned int shift = huge_page_shift(hstate);
-	int psize = shift_to_mmu_psize(shift);
-	unsigned long idx = ((addr >> shift) & (PTRS_PER_HUGEPTE(psize)-1));
+	return hpd.pd & HUGEPD_SHIFT_MASK;
+}
+
+static inline pte_t *hugepte_offset(hugepd_t *hpdp, unsigned long addr, unsigned pdshift)
+{
+	unsigned long idx = (addr & ((1UL << pdshift) - 1)) >> hugepd_shift(*hpdp);
 	pte_t *dir = hugepd_page(*hpdp);
 
 	return dir + idx;
 }
 
+pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea, unsigned *shift)
+{
+	pgd_t *pg;
+	pud_t *pu;
+	pmd_t *pm;
+	hugepd_t *hpdp = NULL;
+	unsigned pdshift = PGDIR_SHIFT;
+
+	if (shift)
+		*shift = 0;
+
+	pg = pgdir + pgd_index(ea);
+	if (is_hugepd(pg)) {
+		hpdp = (hugepd_t *)pg;
+	} else if (!pgd_none(*pg)) {
+		pdshift = PUD_SHIFT;
+		pu = pud_offset(pg, ea);
+		if (is_hugepd(pu))
+			hpdp = (hugepd_t *)pu;
+		else if (!pud_none(*pu)) {
+			pdshift = PMD_SHIFT;
+			pm = pmd_offset(pu, ea);
+			if (is_hugepd(pm))
+				hpdp = (hugepd_t *)pm;
+			else if (!pmd_none(*pm)) {
+				return pte_offset_map(pm, ea);
+			}
+		}
+	}
+
+	if (!hpdp)
+		return NULL;
+
+	if (shift)
+		*shift = hugepd_shift(*hpdp);
+	return hugepte_offset(hpdp, ea, pdshift);
+}
+
+pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
+{
+	return find_linux_pte_or_hugepte(mm->pgd, addr, NULL);
+}
+
 static int __hugepte_alloc(struct mm_struct *mm, hugepd_t *hpdp,
-			   unsigned long address, unsigned int psize)
+			   unsigned long address, unsigned pdshift, unsigned pshift)
 {
-	pte_t *new = kmem_cache_zalloc(PGT_CACHE(hugepte_shift[psize]),
+	pte_t *new = kmem_cache_zalloc(PGT_CACHE(pdshift - pshift),
 				       GFP_KERNEL|__GFP_REPEAT);
 
+	BUG_ON(pshift > HUGEPD_SHIFT_MASK);
+	BUG_ON((unsigned long)new & HUGEPD_SHIFT_MASK);
+
 	if (! new)
 		return -ENOMEM;
 
 	spin_lock(&mm->page_table_lock);
 	if (!hugepd_none(*hpdp))
-		kmem_cache_free(PGT_CACHE(hugepte_shift[psize]), new);
+		kmem_cache_free(PGT_CACHE(pdshift - pshift), new);
 	else
-		hpdp->pd = (unsigned long)new | HUGEPD_OK;
+		hpdp->pd = ((unsigned long)new & ~0x8000000000000000) | pshift;
 	spin_unlock(&mm->page_table_lock);
 	return 0;
 }
 
-
-static pud_t *hpud_offset(pgd_t *pgd, unsigned long addr, struct hstate *hstate)
+pte_t *huge_pte_alloc(struct mm_struct *mm, unsigned long addr, unsigned long sz)
 {
-	if (huge_page_shift(hstate) < PUD_SHIFT)
-		return pud_offset(pgd, addr);
-	else
-		return (pud_t *) pgd;
-}
-static pud_t *hpud_alloc(struct mm_struct *mm, pgd_t *pgd, unsigned long addr,
-			 struct hstate *hstate)
-{
-	if (huge_page_shift(hstate) < PUD_SHIFT)
-		return pud_alloc(mm, pgd, addr);
-	else
-		return (pud_t *) pgd;
-}
-static pmd_t *hpmd_offset(pud_t *pud, unsigned long addr, struct hstate *hstate)
-{
-	if (huge_page_shift(hstate) < PMD_SHIFT)
-		return pmd_offset(pud, addr);
-	else
-		return (pmd_t *) pud;
-}
-static pmd_t *hpmd_alloc(struct mm_struct *mm, pud_t *pud, unsigned long addr,
-			 struct hstate *hstate)
-{
-	if (huge_page_shift(hstate) < PMD_SHIFT)
-		return pmd_alloc(mm, pud, addr);
-	else
-		return (pmd_t *) pud;
+	pgd_t *pg;
+	pud_t *pu;
+	pmd_t *pm;
+	hugepd_t *hpdp = NULL;
+	unsigned pshift = __ffs(sz);
+	unsigned pdshift = PGDIR_SHIFT;
+
+	addr &= ~(sz-1);
+
+	pg = pgd_offset(mm, addr);
+	if (pshift >= PUD_SHIFT) {
+		hpdp = (hugepd_t *)pg;
+	} else {
+		pdshift = PUD_SHIFT;
+		pu = pud_alloc(mm, pg, addr);
+		if (pshift >= PMD_SHIFT) {
+			hpdp = (hugepd_t *)pu;
+		} else {
+			pdshift = PMD_SHIFT;
+			pm = pmd_alloc(mm, pu, addr);
+			hpdp = (hugepd_t *)pm;
+		}
+	}
+
+	if (!hpdp)
+		return NULL;
+
+	BUG_ON(!hugepd_none(*hpdp) && !hugepd_ok(*hpdp));
+
+	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, pdshift, pshift))
+		return NULL;
+
+	return hugepte_offset(hpdp, addr, pdshift);
 }
 
 /* Build list of addresses of gigantic pages.  This function is used in early
@@ -180,92 +221,38 @@ int alloc_bootmem_huge_page(struct hstat
 	return 1;
 }
 
-
-/* Modelled after find_linux_pte() */
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long addr)
-{
-	pgd_t *pg;
-	pud_t *pu;
-	pmd_t *pm;
-
-	unsigned int psize;
-	unsigned int shift;
-	unsigned long sz;
-	struct hstate *hstate;
-	psize = get_slice_psize(mm, addr);
-	shift = mmu_psize_to_shift(psize);
-	sz = ((1UL) << shift);
-	hstate = size_to_hstate(sz);
-
-	addr &= hstate->mask;
-
-	pg = pgd_offset(mm, addr);
-	if (!pgd_none(*pg)) {
-		pu = hpud_offset(pg, addr, hstate);
-		if (!pud_none(*pu)) {
-			pm = hpmd_offset(pu, addr, hstate);
-			if (!pmd_none(*pm))
-				return hugepte_offset((hugepd_t *)pm, addr,
-						      hstate);
-		}
-	}
-
-	return NULL;
-}
-
-pte_t *huge_pte_alloc(struct mm_struct *mm,
-			unsigned long addr, unsigned long sz)
-{
-	pgd_t *pg;
-	pud_t *pu;
-	pmd_t *pm;
-	hugepd_t *hpdp = NULL;
-	struct hstate *hstate;
-	unsigned int psize;
-	hstate = size_to_hstate(sz);
-
-	psize = get_slice_psize(mm, addr);
-	BUG_ON(!mmu_huge_psizes[psize]);
-
-	addr &= hstate->mask;
-
-	pg = pgd_offset(mm, addr);
-	pu = hpud_alloc(mm, pg, addr, hstate);
-
-	if (pu) {
-		pm = hpmd_alloc(mm, pu, addr, hstate);
-		if (pm)
-			hpdp = (hugepd_t *)pm;
-	}
-
-	if (! hpdp)
-		return NULL;
-
-	if (hugepd_none(*hpdp) && __hugepte_alloc(mm, hpdp, addr, psize))
-		return NULL;
-
-	return hugepte_offset(hpdp, addr, hstate);
-}
-
 int huge_pmd_unshare(struct mm_struct *mm, unsigned long *addr, pte_t *ptep)
 {
 	return 0;
 }
 
-static void free_hugepte_range(struct mmu_gather *tlb, hugepd_t *hpdp,
-			       unsigned int psize)
+static void free_hugepd_range(struct mmu_gather *tlb, hugepd_t *hpdp, int pdshift,
+			      unsigned long start, unsigned long end,
+			      unsigned long floor, unsigned long ceiling)
 {
 	pte_t *hugepte = hugepd_page(*hpdp);
+	unsigned shift = hugepd_shift(*hpdp);
+	unsigned long pdmask = ~((1UL << pdshift) - 1);
+
+	start &= pdmask;
+	if (start < floor)
+		return;
+	if (ceiling) {
+		ceiling &= pdmask;
+		if (! ceiling)
+			return;
+	}
+	if (end - 1 > ceiling - 1)
+		return;
 
 	hpdp->pd = 0;
 	tlb->need_flush = 1;
-	pgtable_free_tlb(tlb, hugepte, hugepte_shift[psize]);
+	pgtable_free_tlb(tlb, hugepte, pdshift - shift);
 }
 
 static void hugetlb_free_pmd_range(struct mmu_gather *tlb, pud_t *pud,
 				   unsigned long addr, unsigned long end,
-				   unsigned long floor, unsigned long ceiling,
-				   unsigned int psize)
+				   unsigned long floor, unsigned long ceiling)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -277,7 +264,8 @@ static void hugetlb_free_pmd_range(struc
 		next = pmd_addr_end(addr, end);
 		if (pmd_none(*pmd))
 			continue;
-		free_hugepte_range(tlb, (hugepd_t *)pmd, psize);
+		free_hugepd_range(tlb, (hugepd_t *)pmd, PMD_SHIFT,
+				  addr, next, floor, ceiling);
 	} while (pmd++, addr = next, addr != end);
 
 	start &= PUD_MASK;
@@ -303,23 +291,19 @@ static void hugetlb_free_pud_range(struc
 	pud_t *pud;
 	unsigned long next;
 	unsigned long start;
-	unsigned int shift;
-	unsigned int psize = get_slice_psize(tlb->mm, addr);
-	shift = mmu_psize_to_shift(psize);
 
 	start = addr;
 	pud = pud_offset(pgd, addr);
 	do {
 		next = pud_addr_end(addr, end);
-		if (shift < PMD_SHIFT) {
+		if (!is_hugepd(pud)) {
 			if (pud_none_or_clear_bad(pud))
 				continue;
 			hugetlb_free_pmd_range(tlb, pud, addr, next, floor,
-					       ceiling, psize);
+					       ceiling);
 		} else {
-			if (pud_none(*pud))
-				continue;
-			free_hugepte_range(tlb, (hugepd_t *)pud, psize);
+			free_hugepd_range(tlb, (hugepd_t *)pud, PUD_SHIFT,
+					  addr, next, floor, ceiling);
 		}
 	} while (pud++, addr = next, addr != end);
 
@@ -350,74 +334,34 @@ void hugetlb_free_pgd_range(struct mmu_g
 {
 	pgd_t *pgd;
 	unsigned long next;
-	unsigned long start;
 
 	/*
-	 * Comments below take from the normal free_pgd_range().  They
-	 * apply here too.  The tests against HUGEPD_MASK below are
-	 * essential, because we *don't* test for this at the bottom
-	 * level.  Without them we'll attempt to free a hugepte table
-	 * when we unmap just part of it, even if there are other
-	 * active mappings using it.
+	 * Because there are a number of different possible pagetable
+	 * layouts for hugepage ranges, we limit knowledge of how
+	 * things should be laid out to the allocation path
+	 * (huge_pte_alloc(), above).  Everything else works out the
+	 * structure as it goes from information in the hugepd
+	 * pointers.  That means that we can't here use the
+	 * optimization used in the normal page free_pgd_range(), of
+	 * checking whether we're actually covering a large enough
+	 * range to have to do anything at the top level of the walk
+	 * instead of at the bottom.
 	 *
-	 * The next few lines have given us lots of grief...
-	 *
-	 * Why are we testing HUGEPD* at this top level?  Because
-	 * often there will be no work to do at all, and we'd prefer
-	 * not to go all the way down to the bottom just to discover
-	 * that.
-	 *
-	 * Why all these "- 1"s?  Because 0 represents both the bottom
-	 * of the address space and the top of it (using -1 for the
-	 * top wouldn't help much: the masks would do the wrong thing).
-	 * The rule is that addr 0 and floor 0 refer to the bottom of
-	 * the address space, but end 0 and ceiling 0 refer to the top
-	 * Comparisons need to use "end - 1" and "ceiling - 1" (though
-	 * that end 0 case should be mythical).
-	 *
-	 * Wherever addr is brought up or ceiling brought down, we
-	 * must be careful to reject "the opposite 0" before it
-	 * confuses the subsequent tests.  But what about where end is
-	 * brought down by HUGEPD_SIZE below? no, end can't go down to
-	 * 0 there.
-	 *
-	 * Whereas we round start (addr) and ceiling down, by different
-	 * masks at different levels, in order to test whether a table
-	 * now has no other vmas using it, so can be freed, we don't
-	 * bother to round floor or end up - the tests don't need that.
+	 * To make sense of this, you should probably go read the big
+	 * block comment at the top of the normal free_pgd_range(),
+	 * too.
 	 */
-	unsigned int psize = get_slice_psize(tlb->mm, addr);
 
-	addr &= HUGEPD_MASK(psize);
-	if (addr < floor) {
-		addr += HUGEPD_SIZE(psize);
-		if (!addr)
-			return;
-	}
-	if (ceiling) {
-		ceiling &= HUGEPD_MASK(psize);
-		if (!ceiling)
-			return;
-	}
-	if (end - 1 > ceiling - 1)
-		end -= HUGEPD_SIZE(psize);
-	if (addr > end - 1)
-		return;
-
-	start = addr;
 	pgd = pgd_offset(tlb->mm, addr);
 	do {
-		psize = get_slice_psize(tlb->mm, addr);
-		BUG_ON(!mmu_huge_psizes[psize]);
 		next = pgd_addr_end(addr, end);
-		if (mmu_psize_to_shift(psize) < PUD_SHIFT) {
+		if (!is_hugepd(pgd)) {
 			if (pgd_none_or_clear_bad(pgd))
 				continue;
 			hugetlb_free_pud_range(tlb, pgd, addr, next, floor, ceiling);
 		} else {
-			if (pgd_none(*pgd))
-				continue;
-			free_hugepte_range(tlb, (hugepd_t *)pgd, psize);
+			free_hugepd_range(tlb, (hugepd_t *)pgd, PGDIR_SHIFT,
+					  addr, next, floor, ceiling);
 		}
 	} while (pgd++, addr = next, addr != end);
 }
@@ -448,19 +392,19 @@ follow_huge_addr(struct mm_struct *mm, u
 {
 	pte_t *ptep;
 	struct page *page;
-	unsigned int mmu_psize = get_slice_psize(mm, address);
+	unsigned shift;
+	unsigned long mask;
+
+	ptep = find_linux_pte_or_hugepte(mm->pgd, address, &shift);
 
 	/* Verify it is a huge page else bail. */
-	if (!mmu_huge_psizes[mmu_psize])
+	if (!ptep || !shift)
 		return ERR_PTR(-EINVAL);
 
-	ptep = huge_pte_offset(mm, address);
+	mask = (1UL << shift) - 1;
 	page = pte_page(*ptep);
-	if (page) {
-		unsigned int shift = mmu_psize_to_shift(mmu_psize);
-		unsigned long sz = ((1UL) << shift);
-		page += (address % sz) / PAGE_SIZE;
-	}
+	if (page)
+		page += (address & mask) / PAGE_SIZE;
 
 	return page;
 }
@@ -483,6 +427,73 @@ follow_huge_pmd(struct mm_struct *mm, un
 	return NULL;
 }
 
+static noinline int gup_hugepte(pte_t *ptep, unsigned long sz, unsigned long addr,
+		       unsigned long end, int write, struct page **pages, int *nr)
+{
+	unsigned long mask;
+	unsigned long pte_end;
+	struct page *head, *page;
+	pte_t pte;
+	int refs;
+
+	pte_end = (addr + sz) & ~(sz-1);
+	if (pte_end < end)
+		end = pte_end;
+
+	pte = *ptep;
+	mask = _PAGE_PRESENT | _PAGE_USER;
+	if (write)
+		mask |= _PAGE_RW;
+
+	if ((pte_val(pte) & mask) != mask)
+		return 0;
+
+	/* hugepages are never "special" */
+	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
+
+	refs = 0;
+	head = pte_page(pte);
+
+	page = head + ((addr & (sz-1)) >> PAGE_SHIFT);
+	do {
+		VM_BUG_ON(compound_head(page) != head);
+		pages[*nr] = page;
+		(*nr)++;
+		page++;
+		refs++;
+	} while (addr += PAGE_SIZE, addr != end);
+
+	if (!page_cache_add_speculative(head, refs)) {
+		*nr -= refs;
+		return 0;
+	}
+
+	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
+		/* Could be optimized better */
+		while (*nr) {
+			put_page(page);
+			(*nr)--;
+		}
+	}
+
+	return 1;
+}
+
+int gup_hugepd(hugepd_t *hugepd, unsigned pdshift,
+	       unsigned long addr, unsigned long end,
+	       int write, struct page **pages, int *nr)
+{
+	pte_t *ptep;
+	unsigned long sz = 1UL << hugepd_shift(*hugepd);
+
+	ptep = hugepte_offset(hugepd, addr, pdshift);
+	do {
+		if (!gup_hugepte(ptep, sz, addr, end, write, pages, nr))
+			return 0;
+	} while (ptep++, addr += sz, addr != end);
+
+	return 1;
+}
 
 unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
 					unsigned long len, unsigned long pgoff,
@@ -530,34 +541,20 @@ static unsigned int hash_huge_page_do_la
 	return rflags;
 }
 
-int hash_huge_page(struct mm_struct *mm, unsigned long access,
-		   unsigned long ea, unsigned long vsid, int local,
-		   unsigned long trap)
+int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
+		     pte_t *ptep, unsigned long trap, int local, int ssize,
+		     unsigned int shift, unsigned int mmu_psize)
 {
-	pte_t *ptep;
 	unsigned long old_pte, new_pte;
 	unsigned long va, rflags, pa, sz;
 	long slot;
 	int err = 1;
-	int ssize = user_segment_size(ea);
-	unsigned int mmu_psize;
-	int shift;
-	mmu_psize = get_slice_psize(mm, ea);
 
-	if (!mmu_huge_psizes[mmu_psize])
-		goto out;
-	ptep = huge_pte_offset(mm, ea);
+	BUG_ON(shift != mmu_psize_defs[mmu_psize].shift);
 
 	/* Search the Linux page table for a match with va */
 	va = hpt_va(ea, vsid, ssize);
 
-	/*
-	 * If no pte found or not present, send the problem up to
-	 * do_page_fault
-	 */
-	if (unlikely(!ptep || pte_none(*ptep)))
-		goto out;
-
 	/* 
 	 * Check the user's access rights to the page.  If access should be
 	 * prevented then send the problem up to do_page_fault.
@@ -588,7 +585,6 @@ int hash_huge_page(struct mm_struct *mm,
 	rflags = 0x2 | (!(new_pte & _PAGE_RW));
  	/* _PAGE_EXEC -> HW_NO_EXEC since it's inverted */
 	rflags |= ((new_pte & _PAGE_EXEC) ? 0 : HPTE_R_N);
-	shift = mmu_psize_to_shift(mmu_psize);
 	sz = ((1UL) << shift);
 	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
 		/* No CPU has hugepages but lacks no execute, so we
@@ -672,6 +668,8 @@ repeat:
 
 static void __init set_huge_psize(int psize)
 {
+	unsigned pdshift;
+
 	/* Check that it is a page size supported by the hardware and
 	 * that it fits within pagetable limits. */
 	if (mmu_psize_defs[psize].shift &&
@@ -686,29 +684,14 @@ static void __init set_huge_psize(int ps
 			return;
 		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
 
-		switch (mmu_psize_defs[psize].shift) {
-		case PAGE_SHIFT_64K:
-		    /* We only allow 64k hpages with 4k base page,
-		     * which was checked above, and always put them
-		     * at the PMD */
-		    hugepte_shift[psize] = PMD_SHIFT;
-		    break;
-		case PAGE_SHIFT_16M:
-		    /* 16M pages can be at two different levels
-		     * of pagestables based on base page size */
-		    if (PAGE_SHIFT == PAGE_SHIFT_64K)
-			    hugepte_shift[psize] = PMD_SHIFT;
-		    else /* 4k base page */
-			    hugepte_shift[psize] = PUD_SHIFT;
-		    break;
-		case PAGE_SHIFT_16G:
-		    /* 16G pages are always at PGD level */
-		    hugepte_shift[psize] = PGDIR_SHIFT;
-		    break;
-		}
-		hugepte_shift[psize] -= mmu_psize_defs[psize].shift;
-	} else
-		hugepte_shift[psize] = 0;
+		if (mmu_psize_defs[psize].shift < PMD_SHIFT)
+			pdshift = PMD_SHIFT;
+		else if (mmu_psize_defs[psize].shift < PUD_SHIFT)
+			pdshift = PUD_SHIFT;
+		else
+			pdshift = PGDIR_SHIFT;
+		mmu_huge_psizes[psize] = pdshift - mmu_psize_defs[psize].shift;
+	}
 }
 
 static int __init hugepage_setup_sz(char *str)
@@ -732,7 +715,7 @@ __setup("hugepagesz=", hugepage_setup_sz
 
 static int __init hugetlbpage_init(void)
 {
-	unsigned int psize;
+	int psize;
 
 	if (!cpu_has_feature(CPU_FTR_16M_PAGE))
 		return -ENODEV;
@@ -753,8 +736,8 @@ static int __init hugetlbpage_init(void)
 
 	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
 		if (mmu_huge_psizes[psize]) {
-			pgtable_cache_add(hugepte_shift[psize], NULL);
-			if (!PGT_CACHE(hugepte_shift[psize]))
+			pgtable_cache_add(mmu_huge_psizes[psize], NULL);
+			if (!PGT_CACHE(mmu_huge_psizes[psize]))
 				panic("hugetlbpage_init(): could not create "
 				      "pgtable cache for %d bit pagesize\n",
 				      mmu_psize_to_shift(psize));
Index: working-2.6/arch/powerpc/include/asm/hugetlb.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/hugetlb.h	2009-09-28 14:17:38.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/hugetlb.h	2009-09-28 14:19:08.000000000 +1000
@@ -3,6 +3,15 @@
 
 #include <asm/page.h>
 
+typedef struct { signed long pd; } hugepd_t;
+
+static inline int hugepd_ok(hugepd_t hpd)
+{
+	return (hpd.pd > 0);
+}
+
+#define is_hugepd(pdep)               (hugepd_ok(*((hugepd_t *)(pdep))))
+#define HUGEPD_SHIFT_MASK     0x3f
 
 int is_hugepage_only_range(struct mm_struct *mm, unsigned long addr,
 			   unsigned long len);
@@ -17,6 +26,9 @@ void set_huge_pte_at(struct mm_struct *m
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm, unsigned long addr,
 			      pte_t *ptep);
 
+int gup_hugepd(hugepd_t *hugepd, unsigned pdshift, unsigned long addr,
+	       unsigned long end, int write, struct page **pages, int *nr);
+
 /*
  * The version of vma_mmu_pagesize() in arch/powerpc/mm/hugetlbpage.c needs
  * to override the version in mm/hugetlb.c
Index: working-2.6/arch/powerpc/mm/init_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/init_64.c	2009-09-28 14:19:03.000000000 +1000
+++ working-2.6/arch/powerpc/mm/init_64.c	2009-09-28 14:19:08.000000000 +1000
@@ -41,6 +41,7 @@
 #include <linux/module.h>
 #include <linux/poison.h>
 #include <linux/lmb.h>
+#include <linux/hugetlb.h>
 
 #include <asm/pgalloc.h>
 #include <asm/page.h>
@@ -128,8 +129,13 @@ void pgtable_cache_add(unsigned shift, v
 	unsigned long align = table_size;
 	/* When batching pgtable pointers for RCU freeing, we store
 	 * the index size in the low bits.  Table alignment must be
-	 * big enough to fit it */
-	unsigned long minalign = MAX_PGTABLE_INDEX_SIZE + 1;
+	 * big enough to fit it.
+	 *
+	 * Likewise, hugeapge pagetable pointers contain a (different)
+	 * shift value in the low bits.  All tables must be aligned so
+	 * as to leave enough 0 bits in the address to contain it. */
+	unsigned long minalign = max(MAX_PGTABLE_INDEX_SIZE + 1,
+				     HUGEPD_SHIFT_MASK + 1);
 	struct kmem_cache *new;
 
 	/* It would be nice if this was a BUILD_BUG_ON(), but at the
Index: working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/pgtable-ppc64.h	2009-09-28 14:17:38.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/pgtable-ppc64.h	2009-09-28 14:19:08.000000000 +1000
@@ -379,7 +379,18 @@ void pgtable_cache_init(void);
 	return pt;
 }
 
-pte_t *huge_pte_offset(struct mm_struct *mm, unsigned long address);
+#ifdef CONFIG_HUGETLB_PAGE
+pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
+				 unsigned *shift);
+#else
+static inline pte_t *find_linux_pte_or_hugepte(pgd_t *pgdir, unsigned long ea,
+					       unsigned *shift)
+{
+	if (shift)
+		*shift = 0;
+	return find_linux_pte(pgdir, ea);
+}
+#endif /* !CONFIG_HUGETLB_PAGE */
 
 #endif /* __ASSEMBLY__ */
 
Index: working-2.6/arch/powerpc/mm/gup.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/gup.c	2009-09-28 14:17:38.000000000 +1000
+++ working-2.6/arch/powerpc/mm/gup.c	2009-09-28 14:19:08.000000000 +1000
@@ -55,57 +55,6 @@ static noinline int gup_pte_range(pmd_t 
 	return 1;
 }
 
-#ifdef CONFIG_HUGETLB_PAGE
-static noinline int gup_huge_pte(pte_t *ptep, struct hstate *hstate,
-				 unsigned long *addr, unsigned long end,
-				 int write, struct page **pages, int *nr)
-{
-	unsigned long mask;
-	unsigned long pte_end;
-	struct page *head, *page;
-	pte_t pte;
-	int refs;
-
-	pte_end = (*addr + huge_page_size(hstate)) & huge_page_mask(hstate);
-	if (pte_end < end)
-		end = pte_end;
-
-	pte = *ptep;
-	mask = _PAGE_PRESENT|_PAGE_USER;
-	if (write)
-		mask |= _PAGE_RW;
-	if ((pte_val(pte) & mask) != mask)
-		return 0;
-	/* hugepages are never "special" */
-	VM_BUG_ON(!pfn_valid(pte_pfn(pte)));
-
-	refs = 0;
-	head = pte_page(pte);
-	page = head + ((*addr & ~huge_page_mask(hstate)) >> PAGE_SHIFT);
-	do {
-		VM_BUG_ON(compound_head(page) != head);
-		pages[*nr] = page;
-		(*nr)++;
-		page++;
-		refs++;
-	} while (*addr += PAGE_SIZE, *addr != end);
-
-	if (!page_cache_add_speculative(head, refs)) {
-		*nr -= refs;
-		return 0;
-	}
-	if (unlikely(pte_val(pte) != pte_val(*ptep))) {
-		/* Could be optimized better */
-		while (*nr) {
-			put_page(page);
-			(*nr)--;
-		}
-	}
-
-	return 1;
-}
-#endif /* CONFIG_HUGETLB_PAGE */
-
 static int gup_pmd_range(pud_t pud, unsigned long addr, unsigned long end,
 		int write, struct page **pages, int *nr)
 {
@@ -119,7 +68,11 @@ static int gup_pmd_range(pud_t pud, unsi
 		next = pmd_addr_end(addr, end);
 		if (pmd_none(pmd))
 			return 0;
-		if (!gup_pte_range(pmd, addr, next, write, pages, nr))
+		if (is_hugepd(pmdp)) {
+			if (!gup_hugepd((hugepd_t *)pmdp, PMD_SHIFT,
+					addr, next, write, pages, nr))
+				return 0;
+		} else if (!gup_pte_range(pmd, addr, next, write, pages, nr))
 			return 0;
 	} while (pmdp++, addr = next, addr != end);
 
@@ -139,7 +92,11 @@ static int gup_pud_range(pgd_t pgd, unsi
 		next = pud_addr_end(addr, end);
 		if (pud_none(pud))
 			return 0;
-		if (!gup_pmd_range(pud, addr, next, write, pages, nr))
+		if (is_hugepd(pudp)) {
+			if (!gup_hugepd((hugepd_t *)pudp, PUD_SHIFT,
+					addr, next, write, pages, nr))
+				return 0;
+		} else if (!gup_pmd_range(pud, addr, next, write, pages, nr))
 			return 0;
 	} while (pudp++, addr = next, addr != end);
 
@@ -154,10 +111,6 @@ int get_user_pages_fast(unsigned long st
 	unsigned long next;
 	pgd_t *pgdp;
 	int nr = 0;
-#ifdef CONFIG_PPC64
-	unsigned int shift;
-	int psize;
-#endif
 
 	pr_devel("%s(%lx,%x,%s)\n", __func__, start, nr_pages, write ? "write" : "read");
 
@@ -172,25 +125,6 @@ int get_user_pages_fast(unsigned long st
 
 	pr_devel("  aligned: %lx .. %lx\n", start, end);
 
-#ifdef CONFIG_HUGETLB_PAGE
-	/* We bail out on slice boundary crossing when hugetlb is
-	 * enabled in order to not have to deal with two different
-	 * page table formats
-	 */
-	if (addr < SLICE_LOW_TOP) {
-		if (end > SLICE_LOW_TOP)
-			goto slow_irqon;
-
-		if (unlikely(GET_LOW_SLICE_INDEX(addr) !=
-			     GET_LOW_SLICE_INDEX(end - 1)))
-			goto slow_irqon;
-	} else {
-		if (unlikely(GET_HIGH_SLICE_INDEX(addr) !=
-			     GET_HIGH_SLICE_INDEX(end - 1)))
-			goto slow_irqon;
-	}
-#endif /* CONFIG_HUGETLB_PAGE */
-
 	/*
 	 * XXX: batch / limit 'nr', to avoid large irq off latency
 	 * needs some instrumenting to determine the common sizes used by
@@ -210,54 +144,23 @@ int get_user_pages_fast(unsigned long st
 	 */
 	local_irq_disable();
 
-#ifdef CONFIG_PPC64
-	/* Those bits are related to hugetlbfs implementation and only exist
-	 * on 64-bit for now
-	 */
-	psize = get_slice_psize(mm, addr);
-	shift = mmu_psize_defs[psize].shift;
-#endif /* CONFIG_PPC64 */
-
-#ifdef CONFIG_HUGETLB_PAGE
-	if (unlikely(mmu_huge_psizes[psize])) {
-		pte_t *ptep;
-		unsigned long a = addr;
-		unsigned long sz = ((1UL) << shift);
-		struct hstate *hstate = size_to_hstate(sz);
-
-		BUG_ON(!hstate);
-		/*
-		 * XXX: could be optimized to avoid hstate
-		 * lookup entirely (just use shift)
-		 */
-
-		do {
-			VM_BUG_ON(shift != mmu_psize_defs[get_slice_psize(mm, a)].shift);
-			ptep = huge_pte_offset(mm, a);
-			pr_devel(" %016lx: huge ptep %p\n", a, ptep);
-			if (!ptep || !gup_huge_pte(ptep, hstate, &a, end, write, pages,
-						   &nr))
-				goto slow;
-		} while (a != end);
-	} else
-#endif /* CONFIG_HUGETLB_PAGE */
-	{
-		pgdp = pgd_offset(mm, addr);
-		do {
-			pgd_t pgd = *pgdp;
-
-#ifdef CONFIG_PPC64
-			VM_BUG_ON(shift != mmu_psize_defs[get_slice_psize(mm, addr)].shift);
-#endif
-			pr_devel("  %016lx: normal pgd %p\n", addr,
-				 (void *)pgd_val(pgd));
-			next = pgd_addr_end(addr, end);
-			if (pgd_none(pgd))
-				goto slow;
-			if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+	pgdp = pgd_offset(mm, addr);
+	do {
+		pgd_t pgd = *pgdp;
+
+		pr_devel("  %016lx: normal pgd %p\n", addr,
+			 (void *)pgd_val(pgd));
+		next = pgd_addr_end(addr, end);
+		if (pgd_none(pgd))
+			goto slow;
+		if (is_hugepd(pgdp)) {
+			if (!gup_hugepd((hugepd_t *)pgdp, PGDIR_SHIFT,
+					addr, next, write, pages, &nr))
 				goto slow;
-		} while (pgdp++, addr = next, addr != end);
-	}
+		} else if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
+			goto slow;
+	} while (pgdp++, addr = next, addr != end);
+
 	local_irq_enable();
 
 	VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
Index: working-2.6/arch/powerpc/kernel/perf_callchain.c
===================================================================
--- working-2.6.orig/arch/powerpc/kernel/perf_callchain.c	2009-09-28 14:17:38.000000000 +1000
+++ working-2.6/arch/powerpc/kernel/perf_callchain.c	2009-09-28 14:19:08.000000000 +1000
@@ -119,13 +119,6 @@ static void perf_callchain_kernel(struct
 }
 
 #ifdef CONFIG_PPC64
-
-#ifdef CONFIG_HUGETLB_PAGE
-#define is_huge_psize(pagesize)	(HPAGE_SHIFT && mmu_huge_psizes[pagesize])
-#else
-#define is_huge_psize(pagesize)	0
-#endif
-
 /*
  * On 64-bit we don't want to invoke hash_page on user addresses from
  * interrupt context, so if the access faults, we read the page tables
@@ -135,7 +128,7 @@ static int read_user_stack_slow(void __u
 {
 	pgd_t *pgdir;
 	pte_t *ptep, pte;
-	int pagesize;
+	unsigned shift;
 	unsigned long addr = (unsigned long) ptr;
 	unsigned long offset;
 	unsigned long pfn;
@@ -145,17 +138,14 @@ static int read_user_stack_slow(void __u
 	if (!pgdir)
 		return -EFAULT;
 
-	pagesize = get_slice_psize(current->mm, addr);
+	ptep = find_linux_pte_or_hugepte(pgdir, addr, &shift);
+	if (!shift)
+		shift = PAGE_SHIFT;
 
 	/* align address to page boundary */
-	offset = addr & ((1ul << mmu_psize_defs[pagesize].shift) - 1);
+	offset = addr & ((1UL << shift) - 1);
 	addr -= offset;
 
-	if (is_huge_psize(pagesize))
-		ptep = huge_pte_offset(current->mm, addr);
-	else
-		ptep = find_linux_pte(pgdir, addr);
-
 	if (ptep == NULL)
 		return -EFAULT;
 	pte = *ptep;
Index: working-2.6/arch/powerpc/mm/hash_utils_64.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hash_utils_64.c	2009-09-28 14:17:38.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hash_utils_64.c	2009-09-28 14:19:08.000000000 +1000
@@ -891,6 +891,7 @@ int hash_page(unsigned long ea, unsigned
 	unsigned long vsid;
 	struct mm_struct *mm;
 	pte_t *ptep;
+	unsigned hugeshift;
 	const struct cpumask *tmp;
 	int rc, user_region = 0, local = 0;
 	int psize, ssize;
@@ -943,14 +944,6 @@ int hash_page(unsigned long ea, unsigned
 	if (user_region && cpumask_equal(mm_cpumask(mm), tmp))
 		local = 1;
 
-#ifdef CONFIG_HUGETLB_PAGE
-	/* Handle hugepage regions */
-	if (HPAGE_SHIFT && mmu_huge_psizes[psize]) {
-		DBG_LOW(" -> huge page !\n");
-		return hash_huge_page(mm, access, ea, vsid, local, trap);
-	}
-#endif /* CONFIG_HUGETLB_PAGE */
-
 #ifndef CONFIG_PPC_64K_PAGES
 	/* If we use 4K pages and our psize is not 4K, then we are hitting
 	 * a special driver mapping, we need to align the address before
@@ -961,12 +954,18 @@ int hash_page(unsigned long ea, unsigned
 #endif /* CONFIG_PPC_64K_PAGES */
 
 	/* Get PTE and page size from page tables */
-	ptep = find_linux_pte(pgdir, ea);
+	ptep = find_linux_pte_or_hugepte(pgdir, ea, &hugeshift);
 	if (ptep == NULL || !pte_present(*ptep)) {
 		DBG_LOW(" no PTE !\n");
 		return 1;
 	}
 
+#ifdef CONFIG_HUGETLB_PAGE
+	if (hugeshift)
+		return __hash_page_huge(ea, access, vsid, ptep, trap, local,
+					ssize, hugeshift, psize);
+#endif /* CONFIG_HUGETLB_PAGE */
+
 #ifndef CONFIG_PPC_64K_PAGES
 	DBG_LOW(" i-pte: %016lx\n", pte_val(*ptep));
 #else
Index: working-2.6/arch/powerpc/include/asm/mmu-hash64.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/mmu-hash64.h	2009-09-28 14:17:38.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/mmu-hash64.h	2009-09-28 14:19:08.000000000 +1000
@@ -173,14 +173,6 @@ extern unsigned long tce_alloc_start, tc
  */
 extern int mmu_ci_restrictions;
 
-#ifdef CONFIG_HUGETLB_PAGE
-/*
- * The page size indexes of the huge pages for use by hugetlbfs
- */
-extern unsigned int mmu_huge_psizes[MMU_PAGE_COUNT];
-
-#endif /* CONFIG_HUGETLB_PAGE */
-
 /*
  * This function sets the AVPN and L fields of the HPTE  appropriately
  * for the page size
@@ -254,9 +246,9 @@ extern int __hash_page_64K(unsigned long
 			   unsigned int local, int ssize);
 struct mm_struct;
 extern int hash_page(unsigned long ea, unsigned long access, unsigned long trap);
-extern int hash_huge_page(struct mm_struct *mm, unsigned long access,
-			  unsigned long ea, unsigned long vsid, int local,
-			  unsigned long trap);
+int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
+		     pte_t *ptep, unsigned long trap, int local, int ssize,
+		     unsigned int shift, unsigned int mmu_psize);
 
 extern int htab_bolt_mapping(unsigned long vstart, unsigned long vend,
 			     unsigned long pstart, unsigned long prot,

^ permalink raw reply

* [4/5] Cleanup initialization of hugepages on powerpc
From: David Gibson @ 2009-09-28  4:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20090928043902.GA6302@yookeroo.seuss>

This patch simplifies the logic used to initialize hugepages on
powerpc.  The somewhat oddly named set_huge_psize() is renamed to
add_huge_page_size() and now does all necessary verification of
whether it's given a valid hugepage sizes (instead of just some) and
instantiates the generic hstate structure (but no more).  

hugetlbpage_init() now steps through the available pagesizes, checks
if they're valid for hugepages by calling add_huge_page_size() and
initializes the kmem_caches for the hugepage pagetables.  This means
we can now eliminate the mmu_huge_psizes array, since we no longer
need to pass the sizing information for the pagetable caches from
set_huge_psize() into hugetlbpage_init()

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/mm/hugetlbpage.c |  106 +++++++++++++++++++-----------------------
 1 file changed, 49 insertions(+), 57 deletions(-)

Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-09-28 13:53:42.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-09-28 14:02:34.000000000 +1000
@@ -37,11 +37,6 @@
 static unsigned long gpage_freearray[MAX_NUMBER_GPAGES];
 static unsigned nr_gpages;
 
-/* Array of valid huge page sizes - non-zero value(hugepte_shift) is
- * stored for the huge page sizes that are valid.
- */
-static unsigned int mmu_huge_psizes[MMU_PAGE_COUNT] = { }; /* initialize all to 0 */
-
 /* Flag to mark huge PD pointers.  This means pmd_bad() and pud_bad()
  * will choke on pointers to hugepte tables, which is handy for
  * catching screwups early. */
@@ -502,8 +497,6 @@ unsigned long hugetlb_get_unmapped_area(
 	struct hstate *hstate = hstate_file(file);
 	int mmu_psize = shift_to_mmu_psize(huge_page_shift(hstate));
 
-	if (!mmu_huge_psizes[mmu_psize])
-		return -EINVAL;
 	return slice_get_unmapped_area(addr, len, flags, mmu_psize, 1, 0);
 }
 
@@ -666,47 +659,46 @@ repeat:
 	return err;
 }
 
-static void __init set_huge_psize(int psize)
+static int __init add_huge_page_size(unsigned long long size)
 {
-	unsigned pdshift;
+	int shift = __ffs(size);
+	int mmu_psize;
 
 	/* Check that it is a page size supported by the hardware and
-	 * that it fits within pagetable limits. */
-	if (mmu_psize_defs[psize].shift &&
-		mmu_psize_defs[psize].shift < SID_SHIFT_1T &&
-		(mmu_psize_defs[psize].shift > MIN_HUGEPTE_SHIFT ||
-		 mmu_psize_defs[psize].shift == PAGE_SHIFT_64K ||
-		 mmu_psize_defs[psize].shift == PAGE_SHIFT_16G)) {
-		/* Return if huge page size has already been setup or is the
-		 * same as the base page size. */
-		if (mmu_huge_psizes[psize] ||
-		   mmu_psize_defs[psize].shift == PAGE_SHIFT)
-			return;
-		hugetlb_add_hstate(mmu_psize_defs[psize].shift - PAGE_SHIFT);
+	 * that it fits within pagetable and slice limits. */
+	if (!is_power_of_2(size)
+	    || (shift > SLICE_HIGH_SHIFT) || (shift <= PAGE_SHIFT))
+		return -EINVAL;
 
-		if (mmu_psize_defs[psize].shift < PMD_SHIFT)
-			pdshift = PMD_SHIFT;
-		else if (mmu_psize_defs[psize].shift < PUD_SHIFT)
-			pdshift = PUD_SHIFT;
-		else
-			pdshift = PGDIR_SHIFT;
-		mmu_huge_psizes[psize] = pdshift - mmu_psize_defs[psize].shift;
-	}
+	if ((mmu_psize = shift_to_mmu_psize(shift)) < 0)
+		return -EINVAL;
+
+#ifdef CONFIG_SPU_FS_64K_LS
+	/* Disable support for 64K huge pages when 64K SPU local store
+	 * support is enabled as the current implementation conflicts.
+	 */
+	if (shift == PAGE_SHIFT_64K)
+		return -EINVAL;
+#endif /* CONFIG_SPU_FS_64K_LS */
+
+	BUG_ON(mmu_psize_defs[mmu_psize].shift != shift);
+
+	/* Return if huge page size has already been setup */
+	if (size_to_hstate(size))
+		return 0;
+
+	hugetlb_add_hstate(shift - PAGE_SHIFT);
+
+	return 0;
 }
 
 static int __init hugepage_setup_sz(char *str)
 {
 	unsigned long long size;
-	int mmu_psize;
-	int shift;
 
 	size = memparse(str, &str);
 
-	shift = __ffs(size);
-	mmu_psize = shift_to_mmu_psize(shift);
-	if (mmu_psize >= 0 && mmu_psize_defs[mmu_psize].shift)
-		set_huge_psize(mmu_psize);
-	else
+	if (add_huge_page_size(size) != 0)
 		printk(KERN_WARNING "Invalid huge page size specified(%llu)\n", size);
 
 	return 1;
@@ -720,31 +712,31 @@ static int __init hugetlbpage_init(void)
 	if (!cpu_has_feature(CPU_FTR_16M_PAGE))
 		return -ENODEV;
 
-	/* Add supported huge page sizes.  Need to change HUGE_MAX_HSTATE
-	 * and adjust PTE_NONCACHE_NUM if the number of supported huge page
-	 * sizes changes.
-	 */
-	set_huge_psize(MMU_PAGE_16M);
-	set_huge_psize(MMU_PAGE_16G);
+	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
+		unsigned shift;
+		unsigned pdshift;
 
-	/* Temporarily disable support for 64K huge pages when 64K SPU local
-	 * store support is enabled as the current implementation conflicts.
-	 */
-#ifndef CONFIG_SPU_FS_64K_LS
-	set_huge_psize(MMU_PAGE_64K);
-#endif
+		if (!mmu_psize_defs[psize].shift)
+			continue;
 
-	for (psize = 0; psize < MMU_PAGE_COUNT; ++psize) {
-		if (mmu_huge_psizes[psize]) {
-			pgtable_cache_add(mmu_huge_psizes[psize], NULL);
-			if (!PGT_CACHE(mmu_huge_psizes[psize]))
-				panic("hugetlbpage_init(): could not create "
-				      "pgtable cache for %d bit pagesize\n",
-				      mmu_psize_to_shift(psize));
-		}
+		shift = mmu_psize_to_shift(psize);
+
+		if (add_huge_page_size(1ULL << shift) < 0)
+			continue;
+
+		if (shift < PMD_SHIFT)
+			pdshift = PMD_SHIFT;
+		else if (shift < PUD_SHIFT)
+			pdshift = PUD_SHIFT;
+		else
+			pdshift = PGDIR_SHIFT;
+
+		pgtable_cache_add(pdshift - shift, NULL);
+		if (!PGT_CACHE(pdshift - shift))
+			panic("hugetlbpage_init(): could not create "
+			      "pgtable cache for %d bit pagesize\n", shift);
 	}
 
 	return 0;
 }
-
 module_init(hugetlbpage_init);

^ permalink raw reply

* [5/5] Split hash MMU specific hugepage code into a new file
From: David Gibson @ 2009-09-28  4:41 UTC (permalink / raw)
  To: Benjamin Herrenschmidt, linuxppc-dev
In-Reply-To: <20090928043902.GA6302@yookeroo.seuss>

This patch separates the parts of hugetlbpage.c which are inherently
specific to the hash MMU into a new hugelbpage-hash64.c file.

Signed-off-by: David Gibson <dwg@au1.ibm.com>

---
 arch/powerpc/include/asm/hugetlb.h   |    3 
 arch/powerpc/mm/Makefile             |    5 -
 arch/powerpc/mm/hugetlbpage-hash64.c |  167 ++++++++++++++++++++++++++++++++++
 arch/powerpc/mm/hugetlbpage.c        |  168 -----------------------------------
 4 files changed, 176 insertions(+), 167 deletions(-)

Index: working-2.6/arch/powerpc/mm/Makefile
===================================================================
--- working-2.6.orig/arch/powerpc/mm/Makefile	2009-09-28 13:51:57.000000000 +1000
+++ working-2.6/arch/powerpc/mm/Makefile	2009-09-28 13:53:21.000000000 +1000
@@ -28,7 +28,10 @@ obj-$(CONFIG_44x)		+= 44x_mmu.o
 obj-$(CONFIG_FSL_BOOKE)		+= fsl_booke_mmu.o
 obj-$(CONFIG_NEED_MULTIPLE_NODES) += numa.o
 obj-$(CONFIG_PPC_MM_SLICES)	+= slice.o
-obj-$(CONFIG_HUGETLB_PAGE)	+= hugetlbpage.o
+ifeq ($(CONFIG_HUGETLB_PAGE),y)
+obj-y				+= hugetlbpage.o
+obj-$(CONFIG_PPC_STD_MMU_64)	+= hugetlbpage-hash64.o
+endif
 obj-$(CONFIG_PPC_SUBPAGE_PROT)	+= subpage-prot.o
 obj-$(CONFIG_NOT_COHERENT_CACHE) += dma-noncoherent.o
 obj-$(CONFIG_HIGHMEM)		+= highmem.o
Index: working-2.6/arch/powerpc/mm/hugetlbpage-hash64.c
===================================================================
--- /dev/null	1970-01-01 00:00:00.000000000 +0000
+++ working-2.6/arch/powerpc/mm/hugetlbpage-hash64.c	2009-09-28 13:53:21.000000000 +1000
@@ -0,0 +1,167 @@
+/*
+ * PPC64 Huge TLB Page Support for hash based MMUs (POWER4 and later)
+ *
+ * Copyright (C) 2003 David Gibson, IBM Corporation.
+ *
+ * Based on the IA-32 version:
+ * Copyright (C) 2002, Rohit Seth <rohit.seth@intel.com>
+ */
+
+#include <linux/mm.h>
+#include <linux/hugetlb.h>
+#include <asm/pgtable.h>
+#include <asm/pgalloc.h>
+#include <asm/cacheflush.h>
+#include <asm/machdep.h>
+
+/*
+ * Called by asm hashtable.S for doing lazy icache flush
+ */
+static unsigned int hash_huge_page_do_lazy_icache(unsigned long rflags,
+					pte_t pte, int trap, unsigned long sz)
+{
+	struct page *page;
+	int i;
+
+	if (!pfn_valid(pte_pfn(pte)))
+		return rflags;
+
+	page = pte_page(pte);
+
+	/* page is dirty */
+	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
+		if (trap == 0x400) {
+			for (i = 0; i < (sz / PAGE_SIZE); i++)
+				__flush_dcache_icache(page_address(page+i));
+			set_bit(PG_arch_1, &page->flags);
+		} else {
+			rflags |= HPTE_R_N;
+		}
+	}
+	return rflags;
+}
+
+int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
+		     pte_t *ptep, unsigned long trap, int local, int ssize,
+		     unsigned int shift, unsigned int mmu_psize)
+{
+	unsigned long old_pte, new_pte;
+	unsigned long va, rflags, pa, sz;
+	long slot;
+	int err = 1;
+
+	BUG_ON(shift != mmu_psize_defs[mmu_psize].shift);
+
+	/* Search the Linux page table for a match with va */
+	va = hpt_va(ea, vsid, ssize);
+
+	/*
+	 * Check the user's access rights to the page.  If access should be
+	 * prevented then send the problem up to do_page_fault.
+	 */
+	if (unlikely(access & ~pte_val(*ptep)))
+		goto out;
+	/*
+	 * At this point, we have a pte (old_pte) which can be used to build
+	 * or update an HPTE. There are 2 cases:
+	 *
+	 * 1. There is a valid (present) pte with no associated HPTE (this is
+	 *	the most common case)
+	 * 2. There is a valid (present) pte with an associated HPTE. The
+	 *	current values of the pp bits in the HPTE prevent access
+	 *	because we are doing software DIRTY bit management and the
+	 *	page is currently not DIRTY.
+	 */
+
+
+	do {
+		old_pte = pte_val(*ptep);
+		if (old_pte & _PAGE_BUSY)
+			goto out;
+		new_pte = old_pte | _PAGE_BUSY | _PAGE_ACCESSED;
+	} while(old_pte != __cmpxchg_u64((unsigned long *)ptep,
+					 old_pte, new_pte));
+
+	rflags = 0x2 | (!(new_pte & _PAGE_RW));
+ 	/* _PAGE_EXEC -> HW_NO_EXEC since it's inverted */
+	rflags |= ((new_pte & _PAGE_EXEC) ? 0 : HPTE_R_N);
+	sz = ((1UL) << shift);
+	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
+		/* No CPU has hugepages but lacks no execute, so we
+		 * don't need to worry about that case */
+		rflags = hash_huge_page_do_lazy_icache(rflags, __pte(old_pte),
+						       trap, sz);
+
+	/* Check if pte already has an hpte (case 2) */
+	if (unlikely(old_pte & _PAGE_HASHPTE)) {
+		/* There MIGHT be an HPTE for this pte */
+		unsigned long hash, slot;
+
+		hash = hpt_hash(va, shift, ssize);
+		if (old_pte & _PAGE_F_SECOND)
+			hash = ~hash;
+		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+		slot += (old_pte & _PAGE_F_GIX) >> 12;
+
+		if (ppc_md.hpte_updatepp(slot, rflags, va, mmu_psize,
+					 ssize, local) == -1)
+			old_pte &= ~_PAGE_HPTEFLAGS;
+	}
+
+	if (likely(!(old_pte & _PAGE_HASHPTE))) {
+		unsigned long hash = hpt_hash(va, shift, ssize);
+		unsigned long hpte_group;
+
+		pa = pte_pfn(__pte(old_pte)) << PAGE_SHIFT;
+
+repeat:
+		hpte_group = ((hash & htab_hash_mask) *
+			      HPTES_PER_GROUP) & ~0x7UL;
+
+		/* clear HPTE slot informations in new PTE */
+#ifdef CONFIG_PPC_64K_PAGES
+		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | _PAGE_HPTE_SUB0;
+#else
+		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | _PAGE_HASHPTE;
+#endif
+		/* Add in WIMG bits */
+		rflags |= (new_pte & (_PAGE_WRITETHRU | _PAGE_NO_CACHE |
+				      _PAGE_COHERENT | _PAGE_GUARDED));
+
+		/* Insert into the hash table, primary slot */
+		slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags, 0,
+					  mmu_psize, ssize);
+
+		/* Primary is full, try the secondary */
+		if (unlikely(slot == -1)) {
+			hpte_group = ((~hash & htab_hash_mask) *
+				      HPTES_PER_GROUP) & ~0x7UL;
+			slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags,
+						  HPTE_V_SECONDARY,
+						  mmu_psize, ssize);
+			if (slot == -1) {
+				if (mftb() & 0x1)
+					hpte_group = ((hash & htab_hash_mask) *
+						      HPTES_PER_GROUP)&~0x7UL;
+
+				ppc_md.hpte_remove(hpte_group);
+				goto repeat;
+                        }
+		}
+
+		if (unlikely(slot == -2))
+			panic("hash_huge_page: pte_insert failed\n");
+
+		new_pte |= (slot << 12) & (_PAGE_F_SECOND | _PAGE_F_GIX);
+	}
+
+	/*
+	 * No need to use ldarx/stdcx here
+	 */
+	*ptep = __pte(new_pte & ~_PAGE_BUSY);
+
+	err = 0;
+
+ out:
+	return err;
+}
Index: working-2.6/arch/powerpc/mm/hugetlbpage.c
===================================================================
--- working-2.6.orig/arch/powerpc/mm/hugetlbpage.c	2009-09-28 13:53:16.000000000 +1000
+++ working-2.6/arch/powerpc/mm/hugetlbpage.c	2009-09-28 13:53:21.000000000 +1000
@@ -7,29 +7,17 @@
  * Copyright (C) 2002, Rohit Seth <rohit.seth@intel.com>
  */
 
-#include <linux/init.h>
-#include <linux/fs.h>
 #include <linux/mm.h>
+#include <linux/io.h>
 #include <linux/hugetlb.h>
-#include <linux/pagemap.h>
-#include <linux/slab.h>
-#include <linux/err.h>
-#include <linux/sysctl.h>
-#include <asm/mman.h>
+#include <asm/pgtable.h>
 #include <asm/pgalloc.h>
 #include <asm/tlb.h>
-#include <asm/tlbflush.h>
-#include <asm/mmu_context.h>
-#include <asm/machdep.h>
-#include <asm/cputable.h>
-#include <asm/spu.h>
 
 #define PAGE_SHIFT_64K	16
 #define PAGE_SHIFT_16M	24
 #define PAGE_SHIFT_16G	34
 
-#define NUM_LOW_AREAS	(0x100000000UL >> SID_SHIFT)
-#define NUM_HIGH_AREAS	(PGTABLE_RANGE >> HTLB_AREA_SHIFT)
 #define MAX_NUMBER_GPAGES	1024
 
 /* Tracks the 16G pages after the device tree is scanned and before the
@@ -507,158 +495,6 @@ unsigned long vma_mmu_pagesize(struct vm
 	return 1UL << mmu_psize_to_shift(psize);
 }
 
-/*
- * Called by asm hashtable.S for doing lazy icache flush
- */
-static unsigned int hash_huge_page_do_lazy_icache(unsigned long rflags,
-					pte_t pte, int trap, unsigned long sz)
-{
-	struct page *page;
-	int i;
-
-	if (!pfn_valid(pte_pfn(pte)))
-		return rflags;
-
-	page = pte_page(pte);
-
-	/* page is dirty */
-	if (!test_bit(PG_arch_1, &page->flags) && !PageReserved(page)) {
-		if (trap == 0x400) {
-			for (i = 0; i < (sz / PAGE_SIZE); i++)
-				__flush_dcache_icache(page_address(page+i));
-			set_bit(PG_arch_1, &page->flags);
-		} else {
-			rflags |= HPTE_R_N;
-		}
-	}
-	return rflags;
-}
-
-int __hash_page_huge(unsigned long ea, unsigned long access, unsigned long vsid,
-		     pte_t *ptep, unsigned long trap, int local, int ssize,
-		     unsigned int shift, unsigned int mmu_psize)
-{
-	unsigned long old_pte, new_pte;
-	unsigned long va, rflags, pa, sz;
-	long slot;
-	int err = 1;
-
-	BUG_ON(shift != mmu_psize_defs[mmu_psize].shift);
-
-	/* Search the Linux page table for a match with va */
-	va = hpt_va(ea, vsid, ssize);
-
-	/* 
-	 * Check the user's access rights to the page.  If access should be
-	 * prevented then send the problem up to do_page_fault.
-	 */
-	if (unlikely(access & ~pte_val(*ptep)))
-		goto out;
-	/*
-	 * At this point, we have a pte (old_pte) which can be used to build
-	 * or update an HPTE. There are 2 cases:
-	 *
-	 * 1. There is a valid (present) pte with no associated HPTE (this is 
-	 *	the most common case)
-	 * 2. There is a valid (present) pte with an associated HPTE. The
-	 *	current values of the pp bits in the HPTE prevent access
-	 *	because we are doing software DIRTY bit management and the
-	 *	page is currently not DIRTY. 
-	 */
-
-
-	do {
-		old_pte = pte_val(*ptep);
-		if (old_pte & _PAGE_BUSY)
-			goto out;
-		new_pte = old_pte | _PAGE_BUSY | _PAGE_ACCESSED;
-	} while(old_pte != __cmpxchg_u64((unsigned long *)ptep,
-					 old_pte, new_pte));
-
-	rflags = 0x2 | (!(new_pte & _PAGE_RW));
- 	/* _PAGE_EXEC -> HW_NO_EXEC since it's inverted */
-	rflags |= ((new_pte & _PAGE_EXEC) ? 0 : HPTE_R_N);
-	sz = ((1UL) << shift);
-	if (!cpu_has_feature(CPU_FTR_COHERENT_ICACHE))
-		/* No CPU has hugepages but lacks no execute, so we
-		 * don't need to worry about that case */
-		rflags = hash_huge_page_do_lazy_icache(rflags, __pte(old_pte),
-						       trap, sz);
-
-	/* Check if pte already has an hpte (case 2) */
-	if (unlikely(old_pte & _PAGE_HASHPTE)) {
-		/* There MIGHT be an HPTE for this pte */
-		unsigned long hash, slot;
-
-		hash = hpt_hash(va, shift, ssize);
-		if (old_pte & _PAGE_F_SECOND)
-			hash = ~hash;
-		slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-		slot += (old_pte & _PAGE_F_GIX) >> 12;
-
-		if (ppc_md.hpte_updatepp(slot, rflags, va, mmu_psize,
-					 ssize, local) == -1)
-			old_pte &= ~_PAGE_HPTEFLAGS;
-	}
-
-	if (likely(!(old_pte & _PAGE_HASHPTE))) {
-		unsigned long hash = hpt_hash(va, shift, ssize);
-		unsigned long hpte_group;
-
-		pa = pte_pfn(__pte(old_pte)) << PAGE_SHIFT;
-
-repeat:
-		hpte_group = ((hash & htab_hash_mask) *
-			      HPTES_PER_GROUP) & ~0x7UL;
-
-		/* clear HPTE slot informations in new PTE */
-#ifdef CONFIG_PPC_64K_PAGES
-		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | _PAGE_HPTE_SUB0;
-#else
-		new_pte = (new_pte & ~_PAGE_HPTEFLAGS) | _PAGE_HASHPTE;
-#endif
-		/* Add in WIMG bits */
-		rflags |= (new_pte & (_PAGE_WRITETHRU | _PAGE_NO_CACHE |
-				      _PAGE_COHERENT | _PAGE_GUARDED));
-
-		/* Insert into the hash table, primary slot */
-		slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags, 0,
-					  mmu_psize, ssize);
-
-		/* Primary is full, try the secondary */
-		if (unlikely(slot == -1)) {
-			hpte_group = ((~hash & htab_hash_mask) *
-				      HPTES_PER_GROUP) & ~0x7UL; 
-			slot = ppc_md.hpte_insert(hpte_group, va, pa, rflags,
-						  HPTE_V_SECONDARY,
-						  mmu_psize, ssize);
-			if (slot == -1) {
-				if (mftb() & 0x1)
-					hpte_group = ((hash & htab_hash_mask) *
-						      HPTES_PER_GROUP)&~0x7UL;
-
-				ppc_md.hpte_remove(hpte_group);
-				goto repeat;
-                        }
-		}
-
-		if (unlikely(slot == -2))
-			panic("hash_huge_page: pte_insert failed\n");
-
-		new_pte |= (slot << 12) & (_PAGE_F_SECOND | _PAGE_F_GIX);
-	}
-
-	/*
-	 * No need to use ldarx/stdcx here
-	 */
-	*ptep = __pte(new_pte & ~_PAGE_BUSY);
-
-	err = 0;
-
- out:
-	return err;
-}
-
 static int __init add_huge_page_size(unsigned long long size)
 {
 	int shift = __ffs(size);
Index: working-2.6/arch/powerpc/include/asm/hugetlb.h
===================================================================
--- working-2.6.orig/arch/powerpc/include/asm/hugetlb.h	2009-09-28 13:53:00.000000000 +1000
+++ working-2.6/arch/powerpc/include/asm/hugetlb.h	2009-09-28 13:53:21.000000000 +1000
@@ -13,6 +13,9 @@ static inline int hugepd_ok(hugepd_t hpd
 #define is_hugepd(pdep)               (hugepd_ok(*((hugepd_t *)(pdep))))
 #define HUGEPD_SHIFT_MASK     0x3f
 
+pte_t *huge_pte_offset_and_shift(struct mm_struct *mm,
+				 unsigned long addr, unsigned *shift);
+
 int is_hugepage_only_range(struct mm_struct *mm, unsigned long addr,
 			   unsigned long len);
 

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox