LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH] sound: Don't assume i2c device probing always succeeds
From: Johannes Berg @ 2009-09-30 15:05 UTC (permalink / raw)
  To: Jean Delvare
  Cc: Takashi Iwai, linuxppc-dev, alsa-devel, Tim Shepard,
	Jaroslav Kysela
In-Reply-To: <20090930170006.76e145ca@hyperion.delvare>

[-- Attachment #1: Type: text/plain, Size: 405 bytes --]

On Wed, 2009-09-30 at 17:00 +0200, Jean Delvare wrote:

> The NULL check of client->driver, if followed by a call to
> i2c_unregister_device(), would indeed be enough. But unlike the onyx
> driver which we know we sometimes load erroneously, the other drivers
> should never fail.

All of these drivers can be loaded manually and then fail though, or am
I misunderstanding something?

johannes

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 801 bytes --]

^ permalink raw reply

* Re: [PATCH] sound: Don't assume i2c device probing always succeeds
From: Jean Delvare @ 2009-09-30 15:57 UTC (permalink / raw)
  To: Johannes Berg
  Cc: Takashi Iwai, linuxppc-dev, alsa-devel, Tim Shepard,
	Jaroslav Kysela
In-Reply-To: <1254323111.3959.6.camel@johannes.local>

On Wed, 30 Sep 2009 17:05:11 +0200, Johannes Berg wrote:
> On Wed, 2009-09-30 at 17:00 +0200, Jean Delvare wrote:
> 
> > The NULL check of client->driver, if followed by a call to
> > i2c_unregister_device(), would indeed be enough. But unlike the onyx
> > driver which we know we sometimes load erroneously, the other drivers
> > should never fail.
> 
> All of these drivers can be loaded manually and then fail though, or am
> I misunderstanding something?

I don't think so. At least tas and keywest have checks before they
attempt to instantiate an i2c client, which I think are reliable. The
onyx case is different because apparently some machines have it but are
difficult to detect:

	/* if that didn't work, try desperate mode for older
	 * machines that have stuff missing from the device tree */

And then we have to attempt to instantiate i2c devices at a not
completely known address, and that may fail. I think this is the reason
why onyx has this extra client->driver NULL check and the other two
drivers do not.

I would really love if someone with good knowledge of the device tree
on mac would convert all these hacks to proper i2c device declarations.
All the infrastructure is available already, but I don't know enough
about open firmware and mac the device tree to do it myself.

-- 
Jean Delvare

^ permalink raw reply

* Re: [PATCH] sound: Don't assume i2c device probing always succeeds
From: Jean Delvare @ 2009-09-30 16:55 UTC (permalink / raw)
  To: Takashi Iwai
  Cc: linuxppc-dev, Johannes Berg, Tim Shepard, alsa-devel,
	Jaroslav Kysela
In-Reply-To: <s5h4oqkxy3e.wl%tiwai@suse.de>

On Wed, 30 Sep 2009 17:15:49 +0200, Takashi Iwai wrote:
> Yes, indeed I prefer NULL check because the user can know the error
> at the right place.  I share your concern about the code addition,
> though :)
> 
> I already made a patch below, but it's totally untested.
> It'd be helpful if someone can do review and build-test it.
> 
> 
> thanks,
> 
> Takashi
> 
> ---
> diff --git a/sound/aoa/codecs/tas.c b/sound/aoa/codecs/tas.c
> index f0ebc97..0f810c8 100644
> --- a/sound/aoa/codecs/tas.c
> +++ b/sound/aoa/codecs/tas.c
> @@ -897,6 +897,10 @@ static int tas_create(struct i2c_adapter *adapter,
>  	client = i2c_new_device(adapter, &info);
>  	if (!client)
>  		return -ENODEV;
> +	if (!client->driver) {
> +		i2c_unregister_device(client);
> +		return -ENODEV;
> +	}
>  
>  	/*
>  	 * Let i2c-core delete that device on driver removal.
> diff --git a/sound/ppc/keywest.c b/sound/ppc/keywest.c
> index 835fa19..473c5a6 100644
> --- a/sound/ppc/keywest.c
> +++ b/sound/ppc/keywest.c
> @@ -59,6 +59,13 @@ static int keywest_attach_adapter(struct i2c_adapter *adapter)
>  	strlcpy(info.type, "keywest", I2C_NAME_SIZE);
>  	info.addr = keywest_ctx->addr;
>  	keywest_ctx->client = i2c_new_device(adapter, &info);
> +	if (!keywest_ctx->client)
> +		return -ENODEV;
> +	if (!keywest_ctx->client->driver) {
> +		i2c_unregister_device(keywest_ctx->client);
> +		keywest_ctx->client = NULL;
> +		return -ENODEV;
> +	}
>  	
>  	/*
>  	 * Let i2c-core delete that device on driver removal.

This looks good to me. Please add the following comment before the
client->driver check in both drivers:

	/*
	 * We know the driver is already loaded, so the device should be
	 * already bound. If not it means binding failed, and then there
	 * is no point in keeping the device instantiated.
	 */

Otherwise it's a little difficult to understand why the check is there.

-- 
Jean Delvare

^ permalink raw reply

* Re: [PATCH] powerpc/8xx: fix regression introduced by cache coherency rewrite
From: Joakim Tjernlund @ 2009-09-30 17:23 UTC (permalink / raw)
  Cc: linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <OF46F87EEF.62BE7D78-ONC1257641.003D4172-C1257641.003E1445@transmode.se>

>
> Rex Feany <RFeany@mrv.com> wrote on 30/09/2009 11:00:02:
> >
> > Thus spake Joakim Tjernlund (joakim.tjernlund@transmode.se):
> >
> > > > Ok, I have made some minor tweaks and added debug code in
> > > > do_page_fault(). Would be great if you could try on both
> > > > .31 and top of tree.
> > > >
> > > >  Jocke
> > >
> > > OOPS, found a bug. Use this one instead:
> >
> > .31 - no change, it worked before your patch and it works after.
> > None of your debugging printks show up. I tried removing the tlbil_va()
> > from do_dcache_icache_coherency() and userspace goes back to be being
> > slow/non functional.
>
> Had a look at linus tree and there is something I don't understand.
> Your fix, e0908085fc2391c85b85fb814ae1df377c8e0dcb, fixes a problem
> that was introduced by 8d30c14cab30d405a05f2aaceda1e9ad57800f36 but
> 8d30c14cab30d405a05f2aaceda1e9ad57800f36 was included in .31 and .31
> works and top of tree does not, how can that be?
>
> To me it seems more likely that some mm change introduced between .31 and
> top of tree is the culprit.
> My patch addresses the problem described in the comment:
> /* On 8xx, cache control instructions (particularly
>  * "dcbst" from flush_dcache_icache) fault as write
>  * operation if there is an unpopulated TLB entry
>  * for the address in question. To workaround that,
>  * we invalidate the TLB here, thus avoiding dcbst
>  * misbehaviour.
>  */
> Now you are using this old fix to paper over some other bug or so I think.

There is something fishy with the TLB status, looking into the mpc862 manual I
don't see how it can work reliably. Need to dwell some more.
Anyhow, I have incorporated some of my findings into a new patch,
perhaps this will be the golden one?

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index 4dd38f1..b3f6687 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -709,6 +709,14 @@ ret_from_except:
 	SYNC			/* Some chip revs have problems here... */
 	MTMSRD(r10)		/* disable interrupts */

+#ifdef CONFIG_8xx
+	/* Tag DAR with a well know value.
+	 * This needs to match head_8xx.S and
+	 * do_page_fault()
+	 */
+	li	r3, 0x00f0
+	mtspr	SPRN_DAR, r3
+#endif
 	lwz	r3,_MSR(r1)	/* Returning to user mode? */
 	andi.	r0,r3,MSR_PR
 	beq	resume_kernel
diff --git a/arch/powerpc/kernel/head_8xx.S b/arch/powerpc/kernel/head_8xx.S
index 52ff8c5..de1fd58 100644
--- a/arch/powerpc/kernel/head_8xx.S
+++ b/arch/powerpc/kernel/head_8xx.S
@@ -39,6 +39,15 @@
 #else
 #define DO_8xx_CPU6(val, reg)
 #endif
+
+/* DAR needs to be tagged with a known value so that the
+ * DataTLB Miss/Error and do_page_fault() can recognize a
+ * buggy dcbx instruction and workaround the problem.
+ * dcbf, dcbi, dcbst, dcbz instructions do not update DAR
+ * when trapping into a Data TLB Miss/Error. See
+ * DataStoreTLBMiss and DataTLBError for details
+ */
+
 	__HEAD
 _ENTRY(_stext);
 _ENTRY(_start);
@@ -222,6 +231,7 @@ DataAccess:
 	stw	r10,_DSISR(r11)
 	mr	r5,r10
 	mfspr	r4,SPRN_DAR
+	stw	r4,_DAR(r11)
 	EXC_XFER_EE_LITE(0x300, handle_page_fault)

 /* Instruction access exception.
@@ -352,7 +362,7 @@ InstructionTLBMiss:
 	 * set.  All other Linux PTE bits control the behavior
 	 * of the MMU.
 	 */
-2:	li	r11, 0x00f0
+	li	r11, 0x00f0
 	rlwimi	r10, r11, 0, 24, 28	/* Set 24-27, clear 28 */
 	DO_8xx_CPU6(0x2d80, r3)
 	mtspr	SPRN_MI_RPN, r10	/* Update TLB entry */
@@ -365,6 +375,19 @@ InstructionTLBMiss:
 	lwz	r3, 8(r0)
 #endif
 	rfi
+2:
+	mfspr	r10, SPRN_SRR1
+	rlwinm	r10,r10,0,5,3 /* clear bit 4(0x08000000) */
+	mtspr	SPRN_SRR1, r10
+
+	mfspr	r10, SPRN_M_TW	/* Restore registers */
+	lwz	r11, 0(r0)
+	mtcr	r11
+	lwz	r11, 4(r0)
+#ifdef CONFIG_8xx_CPU6
+	lwz	r3, 8(r0)
+#endif
+	b	InstructionAccess

 	. = 0x1200
 DataStoreTLBMiss:
@@ -428,10 +451,11 @@ DataStoreTLBMiss:
 	 * set.  All other Linux PTE bits control the behavior
 	 * of the MMU.
 	 */
-2:	li	r11, 0x00f0
+	li	r11, 0x00f0
 	rlwimi	r10, r11, 0, 24, 28	/* Set 24-27, clear 28 */
 	DO_8xx_CPU6(0x3d80, r3)
 	mtspr	SPRN_MD_RPN, r10	/* Update TLB entry */
+	mtspr	SPRN_DAR, r11		/* Tag DAR */

 	mfspr	r10, SPRN_M_TW	/* Restore registers */
 	lwz	r11, 0(r0)
@@ -441,7 +465,18 @@ DataStoreTLBMiss:
 	lwz	r3, 8(r0)
 #endif
 	rfi
+2:
+	li	r10, 0
+	mtspr	SPRN_DSISR, r10 /* Clear all bits */

+	mfspr	r10, SPRN_M_TW	/* Restore registers */
+	lwz	r11, 0(r0)
+	mtcr	r11
+	lwz	r11, 4(r0)
+#ifdef CONFIG_8xx_CPU6
+	lwz	r3, 8(r0)
+#endif
+	b	DataAccess
 /* This is an instruction TLB error on the MPC8xx.  This could be due
  * to many reasons, such as executing guarded memory or illegal instruction
  * addresses.  There is nothing to do but handle a big time error fault.
@@ -492,6 +527,8 @@ DataTLBError:
 	 * assuming we only use the dcbi instruction on kernel addresses.
 	 */
 	mfspr	r10, SPRN_DAR
+	cmpwi	cr0, r10, 0x00f0	/* check if DAR holds a tag */
+	beq-	2f
 	rlwinm	r11, r10, 0, 0, 19
 	ori	r11, r11, MD_EVALID
 	mfspr	r10, SPRN_M_CASID
@@ -550,6 +587,7 @@ DataTLBError:
 	rlwimi	r10, r11, 0, 24, 28	/* Set 24-27, clear 28 */
 	DO_8xx_CPU6(0x3d80, r3)
 	mtspr	SPRN_MD_RPN, r10	/* Update TLB entry */
+	mtspr	SPRN_DAR, r11		/* Tag DAR */

 	mfspr	r10, SPRN_M_TW	/* Restore registers */
 	lwz	r11, 0(r0)

^ permalink raw reply related

* Re: [PATCH] powerpc/5200: add LocalPlus bus FIFO device driver
From: Albrecht Dreß @ 2009-09-30 18:34 UTC (permalink / raw)
  To: John Bonesio; +Cc: linuxppc-dev
In-Reply-To: <20090929204248.16225.22818.stgit@riker>

[-- Attachment #1: Type: text/plain, Size: 2832 bytes --]

Hi John:

Cool stuff - Grant's original one also helped me a lot to get Bestcomm  
running for me (I use it only for read dma)!

Am 29.09.09 22:43 schrieb(en) John Bonesio:
[snip]
> +		 * It may be worth experimenting with the ALARM value  
> to see if
> +		 * there is a performance impacit.  However, if it is  
> wrong there
> +		 * is a risk of DMA not transferring the last chunk of  
> data
> +		 */
> +		if (write) {
> +			out_be32(lpbfifo.regs + LPBFIFO_REG_FIFO_ALARM,  
> 0x1e4);
> +			out_8(lpbfifo.regs + LPBFIFO_REG_FIFO_CONTROL,  
> 7);
> +			lpbfifo.bcom_cur_task = lpbfifo.bcom_tx_task;
> +		} else {
> +			out_be32(lpbfifo.regs + LPBFIFO_REG_FIFO_ALARM,  
> 0x1ff);
> +			out_8(lpbfifo.regs + LPBFIFO_REG_FIFO_CONTROL,  
> 0);
> +			lpbfifo.bcom_cur_task = lpbfifo.bcom_rx_task;

Hmmm, the MPC5200B User's manual, sect. 9.7.3.3 LPC Rx/Tx FIFO Control  
Register (pg. 323) seems to imply that only 0 or 1 are valid settings  
here.  Where does the 7 come from?

[snip]
> +			 * In the DMA read case, the DMA doesn't  
> complete,
> +			 * possibly due to incorrect watermarks in the  
> ALARM
> +			 * and CONTROL regs. For now instead of trying  
> to
> +			 * determine the right watermarks that will  
> make this
> +			 * work, just increase the number of bytes the  
> FIFO is
> +			 * expecting.

My experience is that for reads I have to set a granularity of 0 (as  
you already do) *and* set the 'flush' bit in the SCLPC Control  
Register.  See the remarks for that bit in the MPC5200B User's manual,  
pg. 318.  With these settings and my driver, I made several tests  
reading blocks (in one Bestcomm operation) between 16 bytes and 2  
MBytes, and *never* had any problem, so your remark here makes me  
somewhat nervous!

In my dma read tests, the watermark (LPC Rx/Tx FIFO Alarm Register)  
seems to have only a (small) performance impact.  If I set the  
watermark to 0 (i.e. fill the fifo completely), I *never* saw any  
under- or overruns.  My assumption is that the fifo would stop reading  
the peripheral when it is full, and only resume when Bestcomm emptied  
it at least partly.  You you know if that is really the case?

[snip]
> +
> +	bit_fields = req->cs << 24 | 0x000008;
> +	if (!write)
> +		bit_fields |= 0x010000; /* read mode */

See above - should be "bit_fields |= 0x030000" IMHO...

[snip]
> +	/* Request the Bestcomm receive (fifo --> memory) task and IRQ  
> */
> +	lpbfifo.bcom_rx_task =
> +		bcom_gen_bd_rx_init(2, res.start +  
> LPBFIFO_REG_FIFO_DATA,
> +				    BCOM_INITIATOR_SCLPC,  
> BCOM_IPR_SCLPC,
> +				    16*1024*1024);

Is the limit really correct?  The manual (sect. 9.7.2.1 SCLPC Packet  
Size Register, pg. 316) says the maximum packet size is 16 M - 1 byte.

Cheers,
Albrecht.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* Re: [PATCH v3 2/3] cpu: Offline state Framework.
From: Randy Dunlap @ 2009-09-30 17:31 UTC (permalink / raw)
  To: Gautham R Shenoy
  Cc: Peter Zijlstra, linux-kernel, Venkatesh Pallipadi,
	Arun R Bharadwaj, linuxppc-dev, Darrick J. Wong
In-Reply-To: <20090915120706.20523.95464.stgit@sofia.in.ibm.com>

On Tue, 15 Sep 2009 17:37:06 +0530 Gautham R Shenoy wrote:

> Signed-off-by: Gautham R Shenoy <ego@in.ibm.com>
> ---
>  Documentation/cpu-hotplug.txt |   22 +++++
>  drivers/base/cpu.c            |  181 +++++++++++++++++++++++++++++++++++++++--
>  include/linux/cpu.h           |   10 ++
>  3 files changed, 204 insertions(+), 9 deletions(-)
> 
> diff --git a/Documentation/cpu-hotplug.txt b/Documentation/cpu-hotplug.txt
> index 9d620c1..dcec06d 100644
> --- a/Documentation/cpu-hotplug.txt
> +++ b/Documentation/cpu-hotplug.txt
> @@ -115,6 +115,28 @@ Just remember the critical section cannot call any
>  function that can sleep or schedule this process away. The preempt_disable()
>  will work as long as stop_machine_run() is used to take a cpu down.
>  
> +CPU-offline states
> +--------------------------------------
> +On architectures which allow the more than one valid state when

                                ^drop "the"

> +the CPU goes offline, the system administrator can decide
> +the state the CPU needs to go to when it is offlined.

s/needs to/should/

> +
> +If the architecture has implemented a cpu-offline driver exposing these
> +multiple offline states, the system administrator can use the following sysfs
> +interfaces to query the available hotplug states and also query and set the
> +current hotplug state for a given cpu:
> +
> +To query the hotplug states, on needs to perform a read on:

                                one

> +/sys/devices/system/cpu/cpu<number>/available_hotplug_states
> +
> +To query or set the current state for a particular CPU,
> +one needs to use the sysfs interface
> +
> +/sys/devices/system/cpu/cpu<number>/current_hotplug_state
> +
> +Writes to the "online" sysfs files are serialized against the writes to the
> +"current_hotplug_state" file.
> +
>  CPU Hotplug - Frequently Asked Questions.
>  
>  Q: How to enable my kernel to support CPU hotplug?


---
~Randy

^ permalink raw reply

* [PATCH] i2c-powermac: Reject unsupported I2C transactions
From: Jean Delvare @ 2009-09-30 20:14 UTC (permalink / raw)
  To: Linux I2C, linuxppc-dev; +Cc: Paul Mackerras

The i2c-powermac driver doesn't support arbitrary multi-message I2C
transactions, only SMBus ones. Make it clear by returning an error if
a multi-message I2C transaction is attempted. This is better than only
processing the first message, because most callers won't recover from
the short transaction. Anyone wishing to issue multi-message
transactions should use the SMBus API instead of the raw I2C API.

Signed-off-by: Jean Delvare <khali@linux-fr.org>
Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Cc: Paul Mackerras <paulus@samba.org>
---
 drivers/i2c/busses/i2c-powermac.c |    6 ++++++
 1 file changed, 6 insertions(+)

--- linux-2.6.32-rc1.orig/drivers/i2c/busses/i2c-powermac.c	2009-06-10 05:05:27.000000000 +0200
+++ linux-2.6.32-rc1/drivers/i2c/busses/i2c-powermac.c	2009-09-30 20:29:42.000000000 +0200
@@ -146,6 +146,12 @@ static int i2c_powermac_master_xfer(	str
 	int			read;
 	int			addrdir;

+	if (num != 1) {
+		dev_err(&adap->dev,
+			"Multi-message I2C transactions not supported\n");
+		return -EOPNOTSUPP;
+	}
+
 	if (msgs->flags & I2C_M_TEN)
 		return -EINVAL;
 	read = (msgs->flags & I2C_M_RD) != 0;

-- 
Jean Delvare

^ permalink raw reply

* Re: 64bit kernel is huge
From: Benjamin Herrenschmidt @ 2009-09-30 20:30 UTC (permalink / raw)
  To: Michael Neuling; +Cc: linuxppc-dev, Anton Blanchard
In-Reply-To: <5642.1254290487@neuling.org>

On Wed, 2009-09-30 at 16:01 +1000, Michael Neuling wrote:
> 2.6.30     9221595      1620728 1052812 11895135        b5815f
> 2.6.29     9104807      1569840  785292 11459939        aedd63 

The jump in BSS size is huge...  lockdep ?

Ben.

^ permalink raw reply

* Re: [PATCH] powerpc/8xx: fix regression introduced by cache coherency rewrite
From: Benjamin Herrenschmidt @ 2009-09-30 22:35 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <OFC29CE070.42D58057-ONC1257641.005F19FD-C1257641.005F8ED3@transmode.se>

> > Had a look at linus tree and there is something I don't understand.
> > Your fix, e0908085fc2391c85b85fb814ae1df377c8e0dcb, fixes a problem
> > that was introduced by 8d30c14cab30d405a05f2aaceda1e9ad57800f36 but
> > 8d30c14cab30d405a05f2aaceda1e9ad57800f36 was included in .31 and .31
> > works and top of tree does not, how can that be?
> >
> > To me it seems more likely that some mm change introduced between .31 and
> > top of tree is the culprit.
> > My patch addresses the problem described in the comment:
> > /* On 8xx, cache control instructions (particularly
> >  * "dcbst" from flush_dcache_icache) fault as write
> >  * operation if there is an unpopulated TLB entry
> >  * for the address in question. To workaround that,
> >  * we invalidate the TLB here, thus avoiding dcbst
> >  * misbehaviour.
> >  */
> > Now you are using this old fix to paper over some other bug or so I think.
> 
> There is something fishy with the TLB status, looking into the mpc862 manual I
> don't see how it can work reliably. Need to dwell some more.
> Anyhow, I have incorporated some of my findings into a new patch,
> perhaps this will be the golden one?

Well, I still wonder about what whole unpopulated entry business.

>From what I can see, the TLB miss code will check _PAGE_PRESENT, and
when not set, it will -still- insert something into the TLB (unlike
all other CPU types that go straight to data access faults from there).

So we end up populating with those unpopulated entries that will then
cause us to take a DSI (or ISI) instead of a TLB miss the next time
around ... and so we need to remove them once we do that, no ? IE. Once
we have actually put a valid PTE in.

At least that's my understanding and why I believe we should always have
tlbil_va() in set_pte() and ptep_set_access_flags(), which boils down
in putting it into the 2 "filter" functions in the new code.

Well.. actually, the ptep_set_access_flags() will already do a
flush_tlb_page_nohash() when the PTE is changed. So I suppose all we
really need here would be in set_pte_filter(), to do the same if we are
writing a PTE that has _PAGE_PRESENT, at least on 8xx.

But just unconditionally doing a tlbil_va() in both the filter functions
shouldn't harm and also fix the problem, though Rex seems to indicate
that is not the case.

Now, we -might- have something else broken on 8xx, hard to tell. You may
want to try to bisect, adding back the removed tlbil_va() as you go
backward, between .31 and upstream... 

Cheers,
Ben.

^ permalink raw reply

* Re: 64bit kernel is huge
From: Michael Ellerman @ 2009-10-01  3:36 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: Michael Neuling, linuxppc-dev
In-Reply-To: <20090928074503.GB16073@kryten>

[-- Attachment #1: Type: text/plain, Size: 881 bytes --]

On Mon, 2009-09-28 at 17:45 +1000, Anton Blanchard wrote:
> Hi,
> 
> I've found at least one machine that wont boot 2.6.31-rc* with a 
> pseries_defconfig. If I move real-base from 0xc00000 to 0xd00000 it
> boots fine.
> 
> # size vmlinux
>    text	   data	    bss	    dec	    hex	filename
> 9812942	1982496	1105228	12900666	 c4d93a	vmlinux
> 
> Looks like we blow right through the 12MB mark. It desperately needs to eat
> less and lose weight.
> 
> Here are some of the problem areas:
> 
> 131072  lppaca
> 65536   paca
> 
> I think we've attacked these before, not sure if there is anything left
> we can trim.

Why can't we dynamically allocate all but one paca? I seem to recall
Mikey tried it but it didn't work?

And on !ISERIES we should be able to allocate the lppacas too I think,
the HV doesn't know about them until register_vpa().

cheers

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply

* Re: 64bit kernel is huge
From: Michael Neuling @ 2009-10-01  5:06 UTC (permalink / raw)
  To: Anton Blanchard, linuxppc-dev
In-Reply-To: <5642.1254290487@neuling.org>

> 2.6.28     8647080	1699460	 780472	11127012	a9c8e4
> 2.6.27     7461663	1505796	 774400	9741859		94a623

If you compile 28 with the 27 pseries_config, you lose most of this
1.2MB text bloat.

If we remove just FUNCTION_TRACER and STACK_TRACER from 28 we gain back
about 600K.

Mikey

^ permalink raw reply

* Re: 64bit kernel is huge
From: Michael Neuling @ 2009-10-01  5:09 UTC (permalink / raw)
  To: Anton Blanchard, linuxppc-dev
In-Reply-To: <14112.1254373570@neuling.org>

> > 2.6.28     8647080	1699460	 780472	11127012	a9c8e4
> > 2.6.27     7461663	1505796	 774400	9741859		94a623
> 
> If you compile 28 with the 27 pseries_config, you lose most of this
> 1.2MB text bloat.
> 
> If we remove just FUNCTION_TRACER and STACK_TRACER from 28 we gain back
> about 600K.

EXT4 also adds about 200K, between the 27 and 28 configs.

Mikey

^ permalink raw reply

* Re: [PATCH] sound: Don't assume i2c device probing always succeeds
From: Takashi Iwai @ 2009-10-01  6:52 UTC (permalink / raw)
  To: Jean Delvare
  Cc: linuxppc-dev, Johannes Berg, Tim Shepard, alsa-devel,
	Jaroslav Kysela
In-Reply-To: <20090930185505.1261e183@hyperion.delvare>

At Wed, 30 Sep 2009 18:55:05 +0200,
Jean Delvare wrote:
> 
> On Wed, 30 Sep 2009 17:15:49 +0200, Takashi Iwai wrote:
> > Yes, indeed I prefer NULL check because the user can know the error
> > at the right place.  I share your concern about the code addition,
> > though :)
> > 
> > I already made a patch below, but it's totally untested.
> > It'd be helpful if someone can do review and build-test it.
> > 
> > 
> > thanks,
> > 
> > Takashi
> > 
> > ---
> > diff --git a/sound/aoa/codecs/tas.c b/sound/aoa/codecs/tas.c
> > index f0ebc97..0f810c8 100644
> > --- a/sound/aoa/codecs/tas.c
> > +++ b/sound/aoa/codecs/tas.c
> > @@ -897,6 +897,10 @@ static int tas_create(struct i2c_adapter *adapter,
> >  	client = i2c_new_device(adapter, &info);
> >  	if (!client)
> >  		return -ENODEV;
> > +	if (!client->driver) {
> > +		i2c_unregister_device(client);
> > +		return -ENODEV;
> > +	}
> >  
> >  	/*
> >  	 * Let i2c-core delete that device on driver removal.
> > diff --git a/sound/ppc/keywest.c b/sound/ppc/keywest.c
> > index 835fa19..473c5a6 100644
> > --- a/sound/ppc/keywest.c
> > +++ b/sound/ppc/keywest.c
> > @@ -59,6 +59,13 @@ static int keywest_attach_adapter(struct i2c_adapter *adapter)
> >  	strlcpy(info.type, "keywest", I2C_NAME_SIZE);
> >  	info.addr = keywest_ctx->addr;
> >  	keywest_ctx->client = i2c_new_device(adapter, &info);
> > +	if (!keywest_ctx->client)
> > +		return -ENODEV;
> > +	if (!keywest_ctx->client->driver) {
> > +		i2c_unregister_device(keywest_ctx->client);
> > +		keywest_ctx->client = NULL;
> > +		return -ENODEV;
> > +	}
> >  	
> >  	/*
> >  	 * Let i2c-core delete that device on driver removal.
> 
> This looks good to me. Please add the following comment before the
> client->driver check in both drivers:
> 
> 	/*
> 	 * We know the driver is already loaded, so the device should be
> 	 * already bound. If not it means binding failed, and then there
> 	 * is no point in keeping the device instantiated.
> 	 */
> 
> Otherwise it's a little difficult to understand why the check is there.

Fair enough.  I applied the patch with the comment now.
Thanks!


Takashi

^ permalink raw reply

* Re: [PATCH] powerpc/8xx: fix regression introduced by cache coherency rewrite
From: Joakim Tjernlund @ 2009-10-01  7:05 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev@ozlabs.org, Rex Feany
In-Reply-To: <1254350159.5699.21.camel@pasglop>

Benjamin Herrenschmidt <benh@kernel.crashing.org> wrote on 01/10/2009 00:35:59:
>
>
> > > Had a look at linus tree and there is something I don't understand.
> > > Your fix, e0908085fc2391c85b85fb814ae1df377c8e0dcb, fixes a problem
> > > that was introduced by 8d30c14cab30d405a05f2aaceda1e9ad57800f36 but
> > > 8d30c14cab30d405a05f2aaceda1e9ad57800f36 was included in .31 and .31
> > > works and top of tree does not, how can that be?
> > >
> > > To me it seems more likely that some mm change introduced between .31 and
> > > top of tree is the culprit.
> > > My patch addresses the problem described in the comment:
> > > /* On 8xx, cache control instructions (particularly
> > >  * "dcbst" from flush_dcache_icache) fault as write
> > >  * operation if there is an unpopulated TLB entry
> > >  * for the address in question. To workaround that,
> > >  * we invalidate the TLB here, thus avoiding dcbst
> > >  * misbehaviour.
> > >  */
> > > Now you are using this old fix to paper over some other bug or so I think.
> >
> > There is something fishy with the TLB status, looking into the mpc862 manual I
> > don't see how it can work reliably. Need to dwell some more.
> > Anyhow, I have incorporated some of my findings into a new patch,
> > perhaps this will be the golden one?
>
> Well, I still wonder about what whole unpopulated entry business.

Yes, is a odd compared to the other ppc arch. I know why it is there
and I also know that there is alternative way to work around the problem
it is supposed to fix. I went back to the old way in my patch but it didn't
help. although I don't see why there is such a benefit to branch to DataAccess
directly, is it so expensive to to a DTLB error and then go to DataAccess?

>
> >From what I can see, the TLB miss code will check _PAGE_PRESENT, and

No, it will branch to before that happens.

> when not set, it will -still- insert something into the TLB (unlike
> all other CPU types that go straight to data access faults from there).

Yes, a non valid pte so that you will trap to DTLB Error.
The other arch also muck around to get load/store bit etc. and
accroding to the 8xx user manual there are no such info available, not even DAR:

Table 6-14. Register Settings after a Data TLB Miss Exception
Register  Setting
SRR0      Set to the EA of the instruction that caused the exception.
SRR1      1–4     0
          10–15 0
          Others Loaded from MSR[16-31]. SRR1[30] is cleared only by loading a zero from MSR[RI].
MSR       IP      No change
          ME      No change
          LE      Copied from the ILE setting of the interrupted process
          Other 0

However from the past we know that DAR is avaliable (but not for dcbX insn though)

For a TLB Error there are:
Table 6-16. Register Settings after a Data TLB Error Exception
Register Setting
SRR0     Set to the EA of the instruction that caused the exception.
SRR1     1–4     0
         10–15 0
         Others—Loaded from MSR[16-31]. SRR1[30] is cleared only by loading a zero from MSR[RI].
MSR      IP      No change
         ME      No change
         LE      Copied from the ILE setting of the interrupted process
         Other   0
DSISR    0       0
         1       Set if the translation of an attempted access is not found in the translation tables. Otherwise, cleared
         2–3     0
         4       Set if the memory access is not permitted by the protection mechanism; otherwise cleared
         5       0
         6       1 for a store operation; 0 for a load operation.
         7–31    0
DAR      Set to the EA of the data access that caused the exception.

For completeness here are ITLB Miss/Error too:
Table 6-13. Register Settings after an Instruction TLB Miss Exception
Register  Setting
SRR0      Set to the EA of the instruction that caused the exception.
SRR1      0–3      0
          4       1
          10      1
          11–15   0
          Others  Loaded from MSR[16-31]. SRR1[30] is cleared only by loading a zero from MSR[RI].
MSR       IP      No change
          ME      No change
          LE      Copied from the ILE setting of the interrupted process
          Other 0

Table 6-15. Register Settings after an Instruction TLB Error Exception
Register Setting
SRR0     Set to the EA of the instruction that caused the exception.
SRR1     Note: Only one of bits 1, 3, and 4 will be set.
         1       1 if the translation of an attempted access is not in the translation tables. Otherwise 0
         2       0
         3       1 if the fetch access was to guarded memory when MSR[IR] = 1. Otherwise 0
         4       1 if the access is not permitted by the protection mechanism; otherwise 0.
         11–15 0
         Others Loaded from MSR[16-31]. SRR1[30] is cleared only by loading a zero from MSR[RI].
MSR      IP      No change
         ME      No change
         LE      Copied from the ILE setting of the interrupted process
         Other   0

>
> So we end up populating with those unpopulated entries that will then
> cause us to take a DSI (or ISI) instead of a TLB miss the next time
> around ... and so we need to remove them once we do that, no ? IE. Once
> we have actually put a valid PTE in.
>
> At least that's my understanding and why I believe we should always have
> tlbil_va() in set_pte() and ptep_set_access_flags(), which boils down
> in putting it into the 2 "filter" functions in the new code.
>
> Well.. actually, the ptep_set_access_flags() will already do a
> flush_tlb_page_nohash() when the PTE is changed. So I suppose all we
> really need here would be in set_pte_filter(), to do the same if we are
> writing a PTE that has _PAGE_PRESENT, at least on 8xx.

I am pondering another idea. I suspect that this bit is set for non-valid ptes:
1       Set if the translation of an attempted access is not found in the translation tables. Otherwise, cleared
So perhaps we can test this bit in do_page_fault() and then do the
tlbil_va() directly?

>
> But just unconditionally doing a tlbil_va() in both the filter functions
> shouldn't harm and also fix the problem, though Rex seems to indicate
> that is not the case.
>
> Now, we -might- have something else broken on 8xx, hard to tell. You may
> want to try to bisect, adding back the removed tlbil_va() as you go
> backward, between .31 and upstream...
>
> Cheers,
> Ben.
>
>
>
>

^ permalink raw reply

* Re: [PATCH] perf_event, powerpc: Fix compilation after big perf_counter rename
From: Ingo Molnar @ 2009-10-01  7:42 UTC (permalink / raw)
  To: Stephen Rothwell
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Paul Mackerras,
	Linus Torvalds, David S. Miller
In-Reply-To: <20090925095841.8578073a.sfr@canb.auug.org.au>


* Stephen Rothwell <sfr@canb.auug.org.au> wrote:

> Hi Ingo,
> 
> On Thu, 24 Sep 2009 23:25:55 +1000 Michael Ellerman <michael@ellerman.id.au> wrote:
> >
> > Give me a day or two, I should be able to add a per-branch setting for
> > who to send mails to without too much trouble.
> 
> In the mean time I don't now if someone has pointed you at these today:
> 
> http://kisskb.ellerman.id.au/kisskb/branch/12/

That's an upstream warning.

-tip supports fail-on-build-warnings build mode (for the whole kernel) 
via the CONFIG_ALLOW_WARNINGS .config setting. So if you do allnoconfig 
builds, make sure you turn on CONFIG_ALLOW_WARNINGS=y to get the same 
build behavior as with Linus's tree.

	Ingo

^ permalink raw reply

* Re: [PATCH] perf_event, powerpc: Fix compilation after big perf_counter rename
From: Stephen Rothwell @ 2009-10-01 11:13 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, linux-kernel, linuxppc-dev, Paul Mackerras,
	Linus Torvalds, David S. Miller
In-Reply-To: <20091001074201.GA6300@elte.hu>

[-- Attachment #1: Type: text/plain, Size: 1372 bytes --]

Hi Ingo,

On Thu, 1 Oct 2009 09:42:01 +0200 Ingo Molnar <mingo@elte.hu> wrote:
>
> * Stephen Rothwell <sfr@canb.auug.org.au> wrote:
> 
> > On Thu, 24 Sep 2009 23:25:55 +1000 Michael Ellerman <michael@ellerman.id.au> wrote:
> > >
> > > Give me a day or two, I should be able to add a per-branch setting for
> > > who to send mails to without too much trouble.
> > 
> > In the mean time I don't now if someone has pointed you at these today:
> > 
> > http://kisskb.ellerman.id.au/kisskb/branch/12/
> 
> That's an upstream warning.

When I sent that to you, I was referring to these build results:

http://kisskb.ellerman.id.au/kisskb/buildresult/1307802/

which (as you say) was only a waning in the rest of the kernel but is
turned into an error by the -Werror we use when building arch/powerpc.
The warning came from commit 4765c1db84c73f775eb1822a009117cbae524e9e
("rcu-tiny: The Bloatwatch Edition, v6") in the -tip tree and fixed by
commit 5ef9b8e59c043624fb44e31439cecf7f8b4cd62a ("rcu: Clean up the
warning message about RCU not defined") the next day (in the -tip tree).
So no big problem.  Neither of these commits are upstream.

I was just filling in the role of notifying you of build problems until
Michael can automate it.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 198 bytes --]

^ permalink raw reply

* [PATCH] powerpc/kgdb: Fix build failure caused by "kgdb.c: unused variable 'acc'"
From: Anton Vorontsov @ 2009-10-01 18:38 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, Jason Wessel

'acc' isn't used anywhere and thus triggers gcc warning, which causes
build error with CONFIG_PPC_DISABLE_WERROR=n (default):

  cc1: warnings being treated as errors
  arch/powerpc/kernel/kgdb.c: In function 'gdb_regs_to_pt_regs':
  arch/powerpc/kernel/kgdb.c:289: warning: unused variable 'acc'
  make[1]: *** [arch/powerpc/kernel/kgdb.o] Error 1

Signed-off-by: Anton Vorontsov <avorontsov@ru.mvista.com>
---
 arch/powerpc/kernel/kgdb.c |    6 ------
 1 files changed, 0 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/kgdb.c b/arch/powerpc/kernel/kgdb.c
index fe8f71d..641c74b 100644
--- a/arch/powerpc/kernel/kgdb.c
+++ b/arch/powerpc/kernel/kgdb.c
@@ -282,12 +282,6 @@ void gdb_regs_to_pt_regs(unsigned long *gdb_regs, struct pt_regs *regs)
 {
 	unsigned long *ptr = gdb_regs;
 	int reg;
-#ifdef CONFIG_SPE
-	union {
-		u32 v32[2];
-		u64 v64;
-	} acc;
-#endif
 
 	for (reg = 0; reg < 32; reg++)
 		UNPACK64(regs->gpr[reg], ptr);
-- 
1.6.3.3

^ permalink raw reply related

* Re: [PATCH] powerpc/5200: add LocalPlus bus FIFO device driver
From: John Bonesio @ 2009-10-01 18:45 UTC (permalink / raw)
  To: Albrecht Dreß; +Cc: linuxppc-dev
In-Reply-To: <1254335663.3681.0@antares>

Hi Albrecht,

Thanks for the comments. My replies are below.

- John

On Wed, 2009-09-30 at 20:34 +0200, Albrecht Dreß wrote:
> Hi John:
> 
> Cool stuff - Grant's original one also helped me a lot to get Bestcomm  
> running for me (I use it only for read dma)!
> 
> Am 29.09.09 22:43 schrieb(en) John Bonesio:
> [snip]
> > +		 * It may be worth experimenting with the ALARM value  
> > to see if
> > +		 * there is a performance impacit.  However, if it is  
> > wrong there
> > +		 * is a risk of DMA not transferring the last chunk of  
> > data
> > +		 */
> > +		if (write) {
> > +			out_be32(lpbfifo.regs + LPBFIFO_REG_FIFO_ALARM,  
> > 0x1e4);
> > +			out_8(lpbfifo.regs + LPBFIFO_REG_FIFO_CONTROL,  
> > 7);
> > +			lpbfifo.bcom_cur_task = lpbfifo.bcom_tx_task;
> > +		} else {
> > +			out_be32(lpbfifo.regs + LPBFIFO_REG_FIFO_ALARM,  
> > 0x1ff);
> > +			out_8(lpbfifo.regs + LPBFIFO_REG_FIFO_CONTROL,  
> > 0);
> > +			lpbfifo.bcom_cur_task = lpbfifo.bcom_rx_task;
> 
> Hmmm, the MPC5200B User's manual, sect. 9.7.3.3 LPC Rx/Tx FIFO Control  
> Register (pg. 323) seems to imply that only 0 or 1 are valid settings  
> here.  Where does the 7 come from?

When I read that section, I read that GR is a 3 bit field for high water
mark (the number of free bytes * 4). I didn't read that 0 and 1 are the
only possible values, but that those are two example values to help the
reader understand the meaning of the field.

Nowhere in the text does it say 0 and 1 are the only possible values.
And since GR is a 3 bit field instead of a 1 bit field, that implied
that to me that other values are allowed.

I can see how one could read the text as saying 0 and 1 are the only
values allowed. However, empirical the testing I've done, seems to
indicate that the '7' is a valid value, as it produced different
behavior than 1.

> 
> [snip]
> > +			 * In the DMA read case, the DMA doesn't  
> > complete,
> > +			 * possibly due to incorrect watermarks in the  
> > ALARM
> > +			 * and CONTROL regs. For now instead of trying  
> > to
> > +			 * determine the right watermarks that will  
> > make this
> > +			 * work, just increase the number of bytes the  
> > FIFO is
> > +			 * expecting.
> 
> My experience is that for reads I have to set a granularity of 0 (as  
> you already do) *and* set the 'flush' bit in the SCLPC Control  
> Register.  See the remarks for that bit in the MPC5200B User's manual,  
> pg. 318.  With these settings and my driver, I made several tests  
> reading blocks (in one Bestcomm operation) between 16 bytes and 2  
> MBytes, and *never* had any problem, so your remark here makes me  
> somewhat nervous!

I've put in the suggested change to set the 'flush' bit. It didn't seem
to help. I'm not sure what else might be different between my system and
yours.

I agree that this solution feels "hacky" and is probably not a good long
term solution. I think this solution may need to stay in temporarily
until we can discover a better one.

> 
> In my dma read tests, the watermark (LPC Rx/Tx FIFO Alarm Register)  
> seems to have only a (small) performance impact.  If I set the  
> watermark to 0 (i.e. fill the fifo completely), I *never* saw any  
> under- or overruns.  My assumption is that the fifo would stop reading  
> the peripheral when it is full, and only resume when Bestcomm emptied  
> it at least partly.  You you know if that is really the case?

We were testing using the Bestcomm on a framebuffer to refresh the
display. Without the watermarks, the DMA was getting clogged.

> 
> [snip]
> > +
> > +	bit_fields = req->cs << 24 | 0x000008;
> > +	if (!write)
> > +		bit_fields |= 0x010000; /* read mode */
> 
> See above - should be "bit_fields |= 0x030000" IMHO...
> 
> [snip]
> > +	/* Request the Bestcomm receive (fifo --> memory) task and IRQ  
> > */
> > +	lpbfifo.bcom_rx_task =
> > +		bcom_gen_bd_rx_init(2, res.start +  
> > LPBFIFO_REG_FIFO_DATA,
> > +				    BCOM_INITIATOR_SCLPC,  
> > BCOM_IPR_SCLPC,
> > +				    16*1024*1024);
> 
> Is the limit really correct?  The manual (sect. 9.7.2.1 SCLPC Packet  
> Size Register, pg. 316) says the maximum packet size is 16 M - 1 byte.

You are right. I've made the fix. A new patch will go out soon.

> 
> Cheers,
> Albrecht.
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* Re: [PATCH] powerpc/5200: add LocalPlus bus FIFO device driver
From: Albrecht Dreß @ 2009-10-01 19:31 UTC (permalink / raw)
  To: John Bonesio; +Cc: linuxppc-dev
In-Reply-To: <1254422721.689.0.camel@riker>

[-- Attachment #1: Type: text/plain, Size: 2527 bytes --]

Hi John:

Am 01.10.09 20:45 schrieb(en) John Bonesio:
> Nowhere in the text does it say 0 and 1 are the only possible values.  
> And since GR is a 3 bit field instead of a 1 bit field, that implied  
> that to me that other values are allowed.
> 
> I can see how one could read the text as saying 0 and 1 are the only  
> values allowed. However, empirical the testing I've done, seems to  
> indicate that the '7' is a valid value, as it produced different  
> behavior than 1.

Thanks for the clarification!

> I've put in the suggested change to set the 'flush' bit. It didn't  
> seem to help. I'm not sure what else might be different between my  
> system and yours.

My board is a self-designed one, roughly Icecube-based, with a 5200B.   
The peripheral (actually a second processor) is attached in 16-bit mode  
to the LPB.  The main data flow goes to the 5200, so I have only a task  
for reading data.  As the block transfer is only one part (but with a  
huge block size) in a transaction, I allocate only one block for the  
task, i.e. I call bcom_gen_bd_rx_init(1, ...).

The transaction size is always a multiple of 4 bytes, and I use the  
same value for the FIFO packet size and struct bcom_bd::status.  In the  
Control reg, I set CS, BPT=4, DAI, RWb and Flush.  I can isolate the  
code launching Bestcomm if you're interested.

I ran some further tests today, which are a little confusing...

When a Bestcomm IRQ arrives, the call to bcom_buffer_done() (in the  
isr) will always return 0.  If I just ignore that, and call  
bcom_retrieve_buffer() anyway, the status value will also be 0.   
However, the data block is transferred completely, and it is correct (I  
separately transmit a crc32, which matches, and the data block itself  
looks fine).  Actually, I'm not really sure what this means...

BTW, I noticed that the Bestcomm IRQ arrives *before* the FIFO IRQ,  
which IMHO is a little unlogical.  It may be a result of setting the  
granularity to 0 and the FIFO Alarm level to 4 (which should trigger  
Bestcomm only if one u32 is free in the FIFO - is that correct?  At  
least the drawing on Pg. 11 in Freescale's AN2604 implies that.).

> We were testing using the Bestcomm on a framebuffer to refresh the  
> display. Without the watermarks, the DMA was getting clogged.

I see.  In my case, only the /overall/ time and cpu load for the  
transaction are critical, and there I didn't notice significant  
differences.

Cheers,
Albrecht.

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* [PATCH 1/2] mm: add notifier in pageblock isolation for balloon drivers
From: Robert Jennings @ 2009-10-01 19:53 UTC (permalink / raw)
  To: Mel Gorman
  Cc: linux-mm, linux-kernel, linuxppc-dev, Martin Schwidefsky,
	Badari Pulavarty, Brian King, Paul Mackerras, Ingo Molnar

Memory balloon drivers can allocate a large amount of memory which
is not movable but could be freed to accommodate memory hotplug remove.

Prior to calling the memory hotplug notifier chain the memory in the
pageblock is isolated.  If the migrate type is not MIGRATE_MOVABLE the
isolation will not proceed, causing the memory removal for that page
range to fail.

Rather than immediately failing pageblock isolation if the the
migrateteype is not MIGRATE_MOVABLE, this patch checks if all of the
pages in the pageblock are owned by a registered balloon driver using a
notifier chain.  If all of the non-movable pages are owned by a balloon,
they can be freed later through the memory notifier chain and the range
can still be isolated in set_migratetype_isolate().

Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>

---
 drivers/base/memory.c  |   19 +++++++++++++++++++
 include/linux/memory.h |   22 ++++++++++++++++++++++
 mm/page_alloc.c        |   49 +++++++++++++++++++++++++++++++++++++++++--------
 3 files changed, 82 insertions(+), 8 deletions(-)

Index: b/drivers/base/memory.c
===================================================================
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -63,6 +63,20 @@ void unregister_memory_notifier(struct n
 }
 EXPORT_SYMBOL(unregister_memory_notifier);
 
+static BLOCKING_NOTIFIER_HEAD(memory_isolate_chain);
+
+int register_memory_isolate_notifier(struct notifier_block *nb)
+{
+	return blocking_notifier_chain_register(&memory_isolate_chain, nb);
+}
+EXPORT_SYMBOL(register_memory_isolate_notifier);
+
+void unregister_memory_isolate_notifier(struct notifier_block *nb)
+{
+	blocking_notifier_chain_unregister(&memory_isolate_chain, nb);
+}
+EXPORT_SYMBOL(unregister_memory_isolate_notifier);
+
 /*
  * register_memory - Setup a sysfs device for a memory block
  */
@@ -157,6 +171,11 @@ int memory_notify(unsigned long val, voi
 	return blocking_notifier_call_chain(&memory_chain, val, v);
 }
 
+int memory_isolate_notify(unsigned long val, void *v)
+{
+	return blocking_notifier_call_chain(&memory_isolate_chain, val, v);
+}
+
 /*
  * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
  * OK to have direct references to sparsemem variables in here.
Index: b/include/linux/memory.h
===================================================================
--- a/include/linux/memory.h
+++ b/include/linux/memory.h
@@ -50,6 +50,14 @@ struct memory_notify {
 	int status_change_nid;
 };
 
+#define MEM_ISOLATE_COUNT	(1<<0)
+
+struct memory_isolate_notify {
+	unsigned long start_addr;
+	unsigned int nr_pages;
+	unsigned int pages_found;
+};
+
 struct notifier_block;
 struct mem_section;
 
@@ -76,14 +84,28 @@ static inline int memory_notify(unsigned
 {
 	return 0;
 }
+static inline int register_memory_isolate_notifier(struct notifier_block *nb)
+{
+	return 0;
+}
+static inline void unregister_memory_isolate_notifier(struct notifier_block *nb)
+{
+}
+static inline int memory_isolate_notify(unsigned long val, void *v)
+{
+	return 0;
+}
 #else
 extern int register_memory_notifier(struct notifier_block *nb);
 extern void unregister_memory_notifier(struct notifier_block *nb);
+extern int register_memory_isolate_notifier(struct notifier_block *nb);
+extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
 extern int register_new_memory(int, struct mem_section *);
 extern int unregister_memory_section(struct mem_section *);
 extern int memory_dev_init(void);
 extern int remove_memory_block(unsigned long, struct mem_section *, int);
 extern int memory_notify(unsigned long val, void *v);
+extern int memory_isolate_notify(unsigned long val, void *v);
 extern struct memory_block *find_memory_block(struct mem_section *);
 #define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
 enum mem_add_context { BOOT, HOTPLUG };
Index: b/mm/page_alloc.c
===================================================================
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -48,6 +48,7 @@
 #include <linux/page_cgroup.h>
 #include <linux/debugobjects.h>
 #include <linux/kmemleak.h>
+#include <linux/memory.h>
 #include <trace/events/kmem.h>
 
 #include <asm/tlbflush.h>
@@ -4985,23 +4986,55 @@ void set_pageblock_flags_group(struct pa
 int set_migratetype_isolate(struct page *page)
 {
 	struct zone *zone;
-	unsigned long flags;
+	unsigned long flags, pfn, iter;
+	long immobile = 0;
+	struct memory_isolate_notify arg;
+	int notifier_ret;
 	int ret = -EBUSY;
 	int zone_idx;
 
 	zone = page_zone(page);
 	zone_idx = zone_idx(zone);
+
+	pfn = page_to_pfn(page);
+	arg.start_addr = (unsigned long)page_address(page);
+	arg.nr_pages = pageblock_nr_pages;
+	arg.pages_found = 0;
+
 	spin_lock_irqsave(&zone->lock, flags);
 	/*
 	 * In future, more migrate types will be able to be isolation target.
 	 */
-	if (get_pageblock_migratetype(page) != MIGRATE_MOVABLE &&
-	    zone_idx != ZONE_MOVABLE)
-		goto out;
-	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
-	move_freepages_block(zone, page, MIGRATE_ISOLATE);
-	ret = 0;
-out:
+	do {
+		if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE &&
+		    zone_idx == ZONE_MOVABLE) {
+			ret = 0;
+			break;
+		}
+
+		/*
+		 * If all of the pages in a zone are used by a balloon,
+		 * the range can be still be isolated.  The balloon will
+		 * free these pages from the memory notifier chain.
+		 */
+		notifier_ret = memory_isolate_notify(MEM_ISOLATE_COUNT, &arg);
+		notifier_ret = notifier_to_errno(ret);
+		if (notifier_ret || !arg.pages_found)
+			break;
+
+		for (iter = pfn; iter < (pfn + pageblock_nr_pages); iter++)
+			if (page_count(pfn_to_page(iter)))
+				immobile++;
+
+		if (arg.pages_found == immobile)
+			ret = 0;
+	} while (0);
+
+	if (!ret) {
+		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
+		move_freepages_block(zone, page, MIGRATE_ISOLATE);
+	}
+
 	spin_unlock_irqrestore(&zone->lock, flags);
 	if (!ret)
 		drain_all_pages();

^ permalink raw reply

* [PATCH] powerpc/5200: make BestComm gen_bd microcode exchangeable
From: Albrecht Dreß @ 2009-10-01 19:55 UTC (permalink / raw)
  To: grant.likely, linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 3391 bytes --]

This patch adds a method for defining different microcodes than the  
pe-defined ones for the MPC52xx processor's BestComm General Buffer  
Descriptor (gen_db) tasks.  The default microcode is still the one from  
bcom_gen_bd_[rt]x_task, but it can be replaced by calling  
bcom_gen_bd_set_microcode() which is more efficient than explicitly  
loading it via bcom_load_image() after each bcom_gen_bd_[rt]x_reset().

Signed-off-by: Albrecht Dreß <albrecht.dress@arcor.de>


---


diff -uprN -X linux-2.6.30.3/Documentation/dontdiff  
linux-2.6.30.3.orig/arch/powerpc/sysdev/bestcomm/gen_bd.c  
linux-2.6.30.3/arch/powerpc/sysdev/bestcomm/gen_bd.c
--- linux-2.6.30.3.orig/arch/powerpc/sysdev/bestcomm/gen_bd.c	 
2009-07-24 23:47:51.000000000 +0200
+++ linux-2.6.30.3/arch/powerpc/sysdev/bestcomm/gen_bd.c	 
2009-10-01 14:26:33.000000000 +0200
@@ -78,6 +78,7 @@ struct bcom_gen_bd_priv {
  	int		initiator;
  	int		ipr;
  	int		maxbufsize;
+	u32    *microcode;
  };


@@ -104,6 +105,7 @@ bcom_gen_bd_rx_init(int queue_len, phys_
  	priv->initiator	= initiator;
  	priv->ipr	= ipr;
  	priv->maxbufsize = maxbufsize;
+	priv->microcode = bcom_gen_bd_rx_task;

  	if (bcom_gen_bd_rx_reset(tsk)) {
  		bcom_task_free(tsk);
@@ -128,7 +130,7 @@ bcom_gen_bd_rx_reset(struct bcom_task *t
  	var = (struct bcom_gen_bd_rx_var *) bcom_task_var(tsk->tasknum);
  	inc = (struct bcom_gen_bd_rx_inc *) bcom_task_inc(tsk->tasknum);

-	if (bcom_load_image(tsk->tasknum, bcom_gen_bd_rx_task))
+	if (bcom_load_image(tsk->tasknum, priv->microcode))
  		return -1;

  	var->enable	= bcom_eng->regs_base +
@@ -188,6 +190,7 @@ bcom_gen_bd_tx_init(int queue_len, phys_
  	priv->fifo	= fifo;
  	priv->initiator	= initiator;
  	priv->ipr	= ipr;
+	priv->microcode = bcom_gen_bd_tx_task;

  	if (bcom_gen_bd_tx_reset(tsk)) {
  		bcom_task_free(tsk);
@@ -212,7 +215,7 @@ bcom_gen_bd_tx_reset(struct bcom_task *t
  	var = (struct bcom_gen_bd_tx_var *) bcom_task_var(tsk->tasknum);
  	inc = (struct bcom_gen_bd_tx_inc *) bcom_task_inc(tsk->tasknum);

-	if (bcom_load_image(tsk->tasknum, bcom_gen_bd_tx_task))
+	if (bcom_load_image(tsk->tasknum, priv->microcode))
  		return -1;

  	var->enable	= bcom_eng->regs_base +
@@ -253,6 +256,16 @@ bcom_gen_bd_tx_release(struct bcom_task
  }
  EXPORT_SYMBOL_GPL(bcom_gen_bd_tx_release);

+void
+bcom_gen_bd_set_microcode(struct bcom_task *tsk, u32 *microcode)
+{
+	struct bcom_gen_bd_priv *priv = tsk->priv;
+
+	priv->microcode = microcode;
+}
+EXPORT_SYMBOL_GPL(bcom_gen_bd_set_microcode);
+
+
  /*  
---------------------------------------------------------------------
   * PSC support code
   */
diff -uprN -X linux-2.6.30.3/Documentation/dontdiff  
linux-2.6.30.3.orig/arch/powerpc/sysdev/bestcomm/gen_bd.h  
linux-2.6.30.3/arch/powerpc/sysdev/bestcomm/gen_bd.h
--- linux-2.6.30.3.orig/arch/powerpc/sysdev/bestcomm/gen_bd.h	 
2009-07-24 23:47:51.000000000 +0200
+++ linux-2.6.30.3/arch/powerpc/sysdev/bestcomm/gen_bd.h	 
2009-10-01 14:26:50.000000000 +0200
@@ -43,6 +43,9 @@ bcom_gen_bd_tx_reset(struct bcom_task *t
  extern void
  bcom_gen_bd_tx_release(struct bcom_task *tsk);

+extern void
+bcom_gen_bd_set_microcode(struct bcom_task *tsk, u32 *microcode);
+

  /* PSC support utility wrappers */
  struct bcom_task * bcom_psc_gen_bd_rx_init(unsigned psc_num, int  
queue_len,


[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply

* [PATCH 2/2] powerpc: Make the CMM memory hotplug aware
From: Robert Jennings @ 2009-10-01 19:56 UTC (permalink / raw)
  To: Mel Gorman, Ingo Molnar, Badari Pulavarty, Robert Jennings,
	Brian King, Benjamin Herrenschmidt, Paul Mackerras,
	Martin Schwidefsky, linux-kernel, linux-mm, linuxppc-dev
In-Reply-To: <20091001195311.GA16667@austin.ibm.com>

The Collaborative Memory Manager (CMM) module allocates individual pages
over time that are not migratable.  On a long running system this can
severely impact the ability to find enough pages to support a hotplug
memory remove operation.

This patch adds a memory isolation notifier and a memory hotplug notifier.
The memory isolation notifier will return the number of pages found in
the range specified.  This is used to determine if all of the pages in
a pageblock are owned by the balloon.  The hotplug notifier will free
pages in the range which is to be removed.  The priority of the hotplug
notifier is low so that it will be called near last, this helps avoids
removing loaned pages in operations that fail due to other handlers.

CMM activity will be halted when hotplug remove operations are active
and resume activity after a delay period to allow the hypervisor time
to adjust.

Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>

---
 arch/powerpc/platforms/pseries/cmm.c |  211 ++++++++++++++++++++++++++++++++++-
 1 file changed, 205 insertions(+), 6 deletions(-)

Index: b/arch/powerpc/platforms/pseries/cmm.c
===================================================================
--- a/arch/powerpc/platforms/pseries/cmm.c
+++ b/arch/powerpc/platforms/pseries/cmm.c
@@ -38,19 +38,28 @@
 #include <asm/mmu.h>
 #include <asm/pgalloc.h>
 #include <asm/uaccess.h>
+#include <linux/memory.h>
 
 #include "plpar_wrappers.h"
 
 #define CMM_DRIVER_VERSION	"1.0.0"
 #define CMM_DEFAULT_DELAY	1
+#define CMM_HOTPLUG_DELAY	5
 #define CMM_DEBUG			0
 #define CMM_DISABLE		0
 #define CMM_OOM_KB		1024
 #define CMM_MIN_MEM_MB		256
 #define KB2PAGES(_p)		((_p)>>(PAGE_SHIFT-10))
 #define PAGES2KB(_p)		((_p)<<(PAGE_SHIFT-10))
+/*
+ * The priority level tries to ensure that this notifier is called as
+ * late as possible to reduce thrashing in the shared memory pool.
+ */
+#define CMM_MEM_HOTPLUG_PRI	1
+#define CMM_MEM_ISOLATE_PRI	15
 
 static unsigned int delay = CMM_DEFAULT_DELAY;
+static unsigned int hotplug_delay = CMM_HOTPLUG_DELAY;
 static unsigned int oom_kb = CMM_OOM_KB;
 static unsigned int cmm_debug = CMM_DEBUG;
 static unsigned int cmm_disabled = CMM_DISABLE;
@@ -65,6 +74,10 @@ MODULE_VERSION(CMM_DRIVER_VERSION);
 module_param_named(delay, delay, uint, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(delay, "Delay (in seconds) between polls to query hypervisor paging requests. "
 		 "[Default=" __stringify(CMM_DEFAULT_DELAY) "]");
+module_param_named(hotplug_delay, hotplug_delay, uint, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(delay, "Delay (in seconds) after memory hotplug remove "
+		 "before activity resumes. "
+		 "[Default=" __stringify(CMM_HOTPLUG_DELAY) "]");
 module_param_named(oom_kb, oom_kb, uint, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(oom_kb, "Amount of memory in kb to free on OOM. "
 		 "[Default=" __stringify(CMM_OOM_KB) "]");
@@ -88,6 +101,8 @@ struct cmm_page_array {
 static unsigned long loaned_pages;
 static unsigned long loaned_pages_target;
 static unsigned long oom_freed_pages;
+static atomic_t hotplug_active = ATOMIC_INIT(0);
+static atomic_t hotplug_occurred = ATOMIC_INIT(0);
 
 static struct cmm_page_array *cmm_page_list;
 static DEFINE_SPINLOCK(cmm_lock);
@@ -110,6 +125,9 @@ static long cmm_alloc_pages(long nr)
 	cmm_dbg("Begin request for %ld pages\n", nr);
 
 	while (nr) {
+		if (atomic_read(&hotplug_active))
+			break;
+
 		addr = __get_free_page(GFP_NOIO | __GFP_NOWARN |
 				       __GFP_NORETRY | __GFP_NOMEMALLOC);
 		if (!addr)
@@ -119,8 +137,10 @@ static long cmm_alloc_pages(long nr)
 		if (!pa || pa->index >= CMM_NR_PAGES) {
 			/* Need a new page for the page list. */
 			spin_unlock(&cmm_lock);
-			npa = (struct cmm_page_array *)__get_free_page(GFP_NOIO | __GFP_NOWARN |
-								       __GFP_NORETRY | __GFP_NOMEMALLOC);
+			npa = (struct cmm_page_array *)__get_free_page(
+					GFP_NOIO | __GFP_NOWARN |
+					__GFP_NORETRY | __GFP_NOMEMALLOC |
+					__GFP_MOVABLE);
 			if (!npa) {
 				pr_info("%s: Can not allocate new page list\n", __func__);
 				free_page(addr);
@@ -273,9 +293,23 @@ static int cmm_thread(void *dummy)
 	while (1) {
 		timeleft = msleep_interruptible(delay * 1000);
 
-		if (kthread_should_stop() || timeleft) {
-			loaned_pages_target = loaned_pages;
+		if (kthread_should_stop() || timeleft)
 			break;
+
+		if (atomic_read(&hotplug_active)) {
+			cmm_dbg("Hotplug operation in progress, activity "
+					"suspended\n");
+			continue;
+		}
+
+		if (atomic_dec_if_positive(&hotplug_occurred) >= 0) {
+			cmm_dbg("Hotplug operation has occurred, loaning "
+					"activity suspended for %d seconds.\n",
+					hotplug_delay);
+			timeleft = msleep_interruptible(hotplug_delay * 1000);
+			if (kthread_should_stop() || timeleft)
+				break;
+			continue;
 		}
 
 		cmm_get_mpp();
@@ -405,6 +439,163 @@ static struct notifier_block cmm_reboot_
 };
 
 /**
+ * cmm_count_pages - Count the number of pages loaned in a particular range.
+ *
+ * @arg: memory_isolate_notify structure with address range and count
+ *
+ * Return value:
+ *      0 on success
+ **/
+static unsigned long cmm_count_pages(void *arg)
+{
+	struct memory_isolate_notify *marg = arg;
+	struct cmm_page_array *pa;
+	unsigned long idx;
+	unsigned long start;
+	unsigned long end;
+
+	start = marg->start_addr;
+	end = start + ((unsigned long)marg->nr_pages << PAGE_SHIFT);
+
+	spin_lock(&cmm_lock);
+	pa = cmm_page_list;
+	while (pa) {
+		for (idx = 0; idx < pa->index; idx++)
+			if (pa->page[idx] >= start && pa->page[idx] < end)
+				marg->pages_found++;
+		pa = pa->next;
+
+	}
+	spin_unlock(&cmm_lock);
+	return 0;
+}
+
+/**
+ * cmm_memory_isolate_cb - Handle memory isolation notifier calls
+ * @self:	notifier block struct
+ * @action:	action to take
+ * @arg:	struct memory_isolate_notify data for handler
+ *
+ * Return value:
+ *	NOTIFY_OK or notifier error based on subfunction return value
+ **/
+static int cmm_memory_isolate_cb(struct notifier_block *self,
+				 unsigned long action, void *arg)
+{
+	int ret = 0;
+
+	if (action == MEM_ISOLATE_COUNT)
+		ret = cmm_count_pages(arg);
+
+	if (ret)
+		ret = notifier_from_errno(ret);
+	else
+		ret = NOTIFY_OK;
+
+	return ret;
+}
+
+static struct notifier_block cmm_mem_isolate_nb = {
+	.notifier_call = cmm_memory_isolate_cb,
+	.priority = CMM_MEM_ISOLATE_PRI
+};
+
+/**
+ * cmm_mem_going_offline - Unloan pages where memory is to be removed
+ * @arg: memory_notify structure with page range to be offlined
+ *
+ * Return value:
+ *	0 on success
+ **/
+static int cmm_mem_going_offline(void *arg)
+{
+	struct memory_notify *marg = arg;
+	unsigned long start_page = (unsigned long)pfn_to_kaddr(marg->start_pfn);
+	unsigned long end_page = start_page + (marg->nr_pages << PAGE_SHIFT);
+	struct cmm_page_array *pa_curr, *pa_last;
+	unsigned long idx;
+	unsigned long freed = 0;
+
+	cmm_dbg("Memory going offline, searching 0x%lx (%ld pages).\n",
+			start_page, marg->nr_pages);
+	spin_lock(&cmm_lock);
+
+	pa_last = pa_curr = cmm_page_list;
+	while (pa_curr) {
+		for (idx = (pa_curr->index - 1); (idx + 1) > 0; idx--) {
+			if ((pa_curr->page[idx] < start_page) ||
+			    (pa_curr->page[idx] >= end_page))
+				continue;
+
+			plpar_page_set_active(__pa(pa_curr->page[idx]));
+			free_page(pa_curr->page[idx]);
+			freed++;
+			loaned_pages--;
+			totalram_pages++;
+			pa_curr->page[idx] = pa_last->page[--pa_last->index];
+			if (pa_last->index == 0) {
+				if (pa_curr == pa_last)
+					pa_curr = pa_last->next;
+				pa_last = pa_last->next;
+				free_page((unsigned long)cmm_page_list);
+				cmm_page_list = pa_last;
+				continue;
+			}
+		}
+		pa_curr = pa_curr->next;
+	}
+	atomic_set(&hotplug_occurred, 1);
+	spin_unlock(&cmm_lock);
+	cmm_dbg("Released %ld pages in the search range.\n", freed);
+
+	return 0;
+}
+
+/**
+ * cmm_memory_cb - Handle memory hotplug notifier calls
+ * @self:	notifier block struct
+ * @action:	action to take
+ * @arg:	struct memory_notify data for handler
+ *
+ * Return value:
+ *	NOTIFY_OK or notifier error based on subfunction return value
+ *
+ **/
+static int cmm_memory_cb(struct notifier_block *self,
+			unsigned long action, void *arg)
+{
+	int ret = 0;
+
+	switch (action) {
+	case MEM_GOING_OFFLINE:
+		atomic_set(&hotplug_active, 1);
+		ret = cmm_mem_going_offline(arg);
+		break;
+	case MEM_OFFLINE:
+	case MEM_CANCEL_OFFLINE:
+		atomic_set(&hotplug_active, 0);
+		cmm_dbg("Memory offline operation complete.\n");
+		break;
+	case MEM_GOING_ONLINE:
+	case MEM_ONLINE:
+	case MEM_CANCEL_ONLINE:
+		break;
+	}
+
+	if (ret)
+		ret = notifier_from_errno(ret);
+	else
+		ret = NOTIFY_OK;
+
+	return ret;
+}
+
+static struct notifier_block cmm_mem_nb = {
+	.notifier_call = cmm_memory_cb,
+	.priority = CMM_MEM_HOTPLUG_PRI
+};
+
+/**
  * cmm_init - Module initialization
  *
  * Return value:
@@ -426,18 +617,24 @@ static int cmm_init(void)
 	if ((rc = cmm_sysfs_register(&cmm_sysdev)))
 		goto out_reboot_notifier;
 
+	if (register_memory_notifier(&cmm_mem_nb) ||
+	    register_memory_isolate_notifier(&cmm_mem_isolate_nb))
+		goto out_unregister_notifier;
+
 	if (cmm_disabled)
 		return rc;
 
 	cmm_thread_ptr = kthread_run(cmm_thread, NULL, "cmmthread");
 	if (IS_ERR(cmm_thread_ptr)) {
 		rc = PTR_ERR(cmm_thread_ptr);
-		goto out_unregister_sysfs;
+		goto out_unregister_notifier;
 	}
 
 	return rc;
 
-out_unregister_sysfs:
+out_unregister_notifier:
+	unregister_memory_notifier(&cmm_mem_nb);
+	unregister_memory_isolate_notifier(&cmm_mem_isolate_nb);
 	cmm_unregister_sysfs(&cmm_sysdev);
 out_reboot_notifier:
 	unregister_reboot_notifier(&cmm_reboot_nb);
@@ -458,6 +655,8 @@ static void cmm_exit(void)
 		kthread_stop(cmm_thread_ptr);
 	unregister_oom_notifier(&cmm_oom_nb);
 	unregister_reboot_notifier(&cmm_reboot_nb);
+	unregister_memory_notifier(&cmm_mem_nb);
+	unregister_memory_isolate_notifier(&cmm_mem_isolate_nb);
 	cmm_free_pages(loaned_pages);
 	cmm_unregister_sysfs(&cmm_sysdev);
 }

^ permalink raw reply

* Re: [PATCH 1/2] mm: add notifier in pageblock isolation for balloon drivers
From: Nathan Fontenot @ 2009-10-01 21:35 UTC (permalink / raw)
  To: Mel Gorman, Ingo Molnar, Badari Pulavarty, Brian King,
	Benjamin Herrenschmidt, Paul Mackerras, Martin Schwidefsky,
	linux-kernel, linux-mm, linuxppc-dev
In-Reply-To: <20091001195311.GA16667@austin.ibm.com>

Robert Jennings wrote:
> Memory balloon drivers can allocate a large amount of memory which
> is not movable but could be freed to accommodate memory hotplug remove.
> 
> Prior to calling the memory hotplug notifier chain the memory in the
> pageblock is isolated.  If the migrate type is not MIGRATE_MOVABLE the
> isolation will not proceed, causing the memory removal for that page
> range to fail.
> 
> Rather than immediately failing pageblock isolation if the the
> migrateteype is not MIGRATE_MOVABLE, this patch checks if all of the
> pages in the pageblock are owned by a registered balloon driver using a
> notifier chain.  If all of the non-movable pages are owned by a balloon,
> they can be freed later through the memory notifier chain and the range
> can still be isolated in set_migratetype_isolate().
> 
> Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
> 
> ---
>  drivers/base/memory.c  |   19 +++++++++++++++++++
>  include/linux/memory.h |   22 ++++++++++++++++++++++
>  mm/page_alloc.c        |   49 +++++++++++++++++++++++++++++++++++++++++--------
>  3 files changed, 82 insertions(+), 8 deletions(-)
> 
> Index: b/drivers/base/memory.c
> ===================================================================
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -63,6 +63,20 @@ void unregister_memory_notifier(struct n
>  }
>  EXPORT_SYMBOL(unregister_memory_notifier);
>  
> +static BLOCKING_NOTIFIER_HEAD(memory_isolate_chain);
> +
> +int register_memory_isolate_notifier(struct notifier_block *nb)
> +{
> +	return blocking_notifier_chain_register(&memory_isolate_chain, nb);
> +}
> +EXPORT_SYMBOL(register_memory_isolate_notifier);
> +
> +void unregister_memory_isolate_notifier(struct notifier_block *nb)
> +{
> +	blocking_notifier_chain_unregister(&memory_isolate_chain, nb);
> +}
> +EXPORT_SYMBOL(unregister_memory_isolate_notifier);
> +
>  /*
>   * register_memory - Setup a sysfs device for a memory block
>   */
> @@ -157,6 +171,11 @@ int memory_notify(unsigned long val, voi
>  	return blocking_notifier_call_chain(&memory_chain, val, v);
>  }
>  
> +int memory_isolate_notify(unsigned long val, void *v)
> +{
> +	return blocking_notifier_call_chain(&memory_isolate_chain, val, v);
> +}
> +
>  /*
>   * MEMORY_HOTPLUG depends on SPARSEMEM in mm/Kconfig, so it is
>   * OK to have direct references to sparsemem variables in here.
> Index: b/include/linux/memory.h
> ===================================================================
> --- a/include/linux/memory.h
> +++ b/include/linux/memory.h
> @@ -50,6 +50,14 @@ struct memory_notify {
>  	int status_change_nid;
>  };
>  
> +#define MEM_ISOLATE_COUNT	(1<<0)
> +
> +struct memory_isolate_notify {
> +	unsigned long start_addr;
> +	unsigned int nr_pages;
> +	unsigned int pages_found;
> +};
> +
>  struct notifier_block;
>  struct mem_section;
>  
> @@ -76,14 +84,28 @@ static inline int memory_notify(unsigned
>  {
>  	return 0;
>  }
> +static inline int register_memory_isolate_notifier(struct notifier_block *nb)
> +{
> +	return 0;
> +}
> +static inline void unregister_memory_isolate_notifier(struct notifier_block *nb)
> +{
> +}
> +static inline int memory_isolate_notify(unsigned long val, void *v)
> +{
> +	return 0;
> +}
>  #else
>  extern int register_memory_notifier(struct notifier_block *nb);
>  extern void unregister_memory_notifier(struct notifier_block *nb);
> +extern int register_memory_isolate_notifier(struct notifier_block *nb);
> +extern void unregister_memory_isolate_notifier(struct notifier_block *nb);
>  extern int register_new_memory(int, struct mem_section *);
>  extern int unregister_memory_section(struct mem_section *);
>  extern int memory_dev_init(void);
>  extern int remove_memory_block(unsigned long, struct mem_section *, int);
>  extern int memory_notify(unsigned long val, void *v);
> +extern int memory_isolate_notify(unsigned long val, void *v);
>  extern struct memory_block *find_memory_block(struct mem_section *);
>  #define CONFIG_MEM_BLOCK_SIZE	(PAGES_PER_SECTION<<PAGE_SHIFT)
>  enum mem_add_context { BOOT, HOTPLUG };
> Index: b/mm/page_alloc.c
> ===================================================================
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -48,6 +48,7 @@
>  #include <linux/page_cgroup.h>
>  #include <linux/debugobjects.h>
>  #include <linux/kmemleak.h>
> +#include <linux/memory.h>
>  #include <trace/events/kmem.h>
>  
>  #include <asm/tlbflush.h>
> @@ -4985,23 +4986,55 @@ void set_pageblock_flags_group(struct pa
>  int set_migratetype_isolate(struct page *page)
>  {
>  	struct zone *zone;
> -	unsigned long flags;
> +	unsigned long flags, pfn, iter;
> +	long immobile = 0;
> +	struct memory_isolate_notify arg;
> +	int notifier_ret;
>  	int ret = -EBUSY;
>  	int zone_idx;
>  
>  	zone = page_zone(page);
>  	zone_idx = zone_idx(zone);
> +
> +	pfn = page_to_pfn(page);
> +	arg.start_addr = (unsigned long)page_address(page);
> +	arg.nr_pages = pageblock_nr_pages;
> +	arg.pages_found = 0;
> +
>  	spin_lock_irqsave(&zone->lock, flags);
>  	/*
>  	 * In future, more migrate types will be able to be isolation target.
>  	 */
> -	if (get_pageblock_migratetype(page) != MIGRATE_MOVABLE &&
> -	    zone_idx != ZONE_MOVABLE)
> -		goto out;
> -	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> -	move_freepages_block(zone, page, MIGRATE_ISOLATE);
> -	ret = 0;
> -out:
> +	do {
> +		if (get_pageblock_migratetype(page) == MIGRATE_MOVABLE &&
> +		    zone_idx == ZONE_MOVABLE) {
> +			ret = 0;
> +			break;
> +		}
> +
> +		/*
> +		 * If all of the pages in a zone are used by a balloon,
> +		 * the range can be still be isolated.  The balloon will
> +		 * free these pages from the memory notifier chain.
> +		 */
> +		notifier_ret = memory_isolate_notify(MEM_ISOLATE_COUNT, &arg);
> +		notifier_ret = notifier_to_errno(ret);

Should this be

		notifier_ret = notifier_to_errno(notifier_ret);

-Nathan

> +		if (notifier_ret || !arg.pages_found)
> +			break;
> +
> +		for (iter = pfn; iter < (pfn + pageblock_nr_pages); iter++)
> +			if (page_count(pfn_to_page(iter)))
> +				immobile++;
> +
> +		if (arg.pages_found == immobile)
> +			ret = 0;
> +	} while (0);
> +
> +	if (!ret) {
> +		set_pageblock_migratetype(page, MIGRATE_ISOLATE);
> +		move_freepages_block(zone, page, MIGRATE_ISOLATE);
> +	}
> +
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  	if (!ret)
>  		drain_all_pages();
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev

^ permalink raw reply

* patch powerpc-fix-incorrect-setting-of-__have_arch_pte_special.patch added to 2.6.31-stable tree
From: gregkh @ 2009-10-01 22:46 UTC (permalink / raw)
  To: Bernhard.Weirich, benh, bernhard.weirich, gregkh, linuxppc-dev,
	stable
  Cc: stable, stable-commits
In-Reply-To: <1253776613.7103.433.camel@pasglop>


This is a note to let you know that we have just queued up the patch titled

    Subject: powerpc: Fix incorrect setting of __HAVE_ARCH_PTE_SPECIAL

to the 2.6.31-stable tree.  Its filename is

    powerpc-fix-incorrect-setting-of-__have_arch_pte_special.patch

A git repo of this tree can be found at 
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary


>From benh@kernel.crashing.org  Thu Oct  1 15:36:15 2009
From: Weirich, Bernhard <Bernhard.Weirich@riedel.net>
Date: Thu, 24 Sep 2009 17:16:53 +1000
Subject: powerpc: Fix incorrect setting of __HAVE_ARCH_PTE_SPECIAL
To: stable <stable@kernel.org>
Cc: linuxppc-dev list <linuxppc-dev@ozlabs.org>, bernhard.weirich@riedel.net, RFeany@mrv.com
Message-ID: <1253776613.7103.433.camel@pasglop>


From: Weirich, Bernhard <Bernhard.Weirich@riedel.net>

[I'm going to fix upstream differently, by having all CPU types
actually support _PAGE_SPECIAL, but I prefer the simple and obvious
fix for -stable. -- Ben]

The test that decides whether to define __HAVE_ARCH_PTE_SPECIAL on
powerpc is bogus and will end up always defining it, even when
_PAGE_SPECIAL is not supported (in which case it's 0) such as on
8xx or 40x processors.

Signed-off-by: Bernhard Weirich <bernhard.weirich@riedel.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


---
 arch/powerpc/include/asm/pte-common.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/powerpc/include/asm/pte-common.h
+++ b/arch/powerpc/include/asm/pte-common.h
@@ -176,7 +176,7 @@ extern unsigned long bad_call_to_PMD_PAG
 #define HAVE_PAGE_AGP
 
 /* Advertise support for _PAGE_SPECIAL */
-#ifdef _PAGE_SPECIAL
+#if _PAGE_SPECIAL != 0
 #define __HAVE_ARCH_PTE_SPECIAL
 #endif
 


Patches currently in stable-queue which might be from Bernhard.Weirich@riedel.net are

queue-2.6.31/powerpc-fix-incorrect-setting-of-__have_arch_pte_special.patch

^ permalink raw reply

* patch powerpc-fix-incorrect-setting-of-__have_arch_pte_special.patch added to 2.6.30-stable tree
From: gregkh @ 2009-10-01 22:51 UTC (permalink / raw)
  To: Bernhard.Weirich, benh, bernhard.weirich, gregkh, linuxppc-dev,
	stable
  Cc: stable, stable-commits
In-Reply-To: <1253776613.7103.433.camel@pasglop>


This is a note to let you know that we have just queued up the patch titled

    Subject: powerpc: Fix incorrect setting of __HAVE_ARCH_PTE_SPECIAL

to the 2.6.30-stable tree.  Its filename is

    powerpc-fix-incorrect-setting-of-__have_arch_pte_special.patch

A git repo of this tree can be found at 
    http://www.kernel.org/git/?p=linux/kernel/git/stable/stable-queue.git;a=summary


>From benh@kernel.crashing.org  Thu Oct  1 15:36:15 2009
From: Weirich, Bernhard <Bernhard.Weirich@riedel.net>
Date: Thu, 24 Sep 2009 17:16:53 +1000
Subject: powerpc: Fix incorrect setting of __HAVE_ARCH_PTE_SPECIAL
To: stable <stable@kernel.org>
Cc: linuxppc-dev list <linuxppc-dev@ozlabs.org>, bernhard.weirich@riedel.net, RFeany@mrv.com
Message-ID: <1253776613.7103.433.camel@pasglop>


From: Weirich, Bernhard <Bernhard.Weirich@riedel.net>

[I'm going to fix upstream differently, by having all CPU types
actually support _PAGE_SPECIAL, but I prefer the simple and obvious
fix for -stable. -- Ben]

The test that decides whether to define __HAVE_ARCH_PTE_SPECIAL on
powerpc is bogus and will end up always defining it, even when
_PAGE_SPECIAL is not supported (in which case it's 0) such as on
8xx or 40x processors.

Signed-off-by: Bernhard Weirich <bernhard.weirich@riedel.net>
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@suse.de>


---
 arch/powerpc/include/asm/pte-common.h |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/powerpc/include/asm/pte-common.h
+++ b/arch/powerpc/include/asm/pte-common.h
@@ -176,7 +176,7 @@ extern unsigned long bad_call_to_PMD_PAG
 #define HAVE_PAGE_AGP
 
 /* Advertise support for _PAGE_SPECIAL */
-#ifdef _PAGE_SPECIAL
+#if _PAGE_SPECIAL != 0
 #define __HAVE_ARCH_PTE_SPECIAL
 #endif
 


Patches currently in stable-queue which might be from Bernhard.Weirich@riedel.net are

queue-2.6.30/powerpc-fix-incorrect-setting-of-__have_arch_pte_special.patch

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox