LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 1/6] powerpc: Move branch instruction from ACCOUNT_CPU_USER_ENTRY to caller
From: Haren Myneni @ 2012-09-11  5:23 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: paulus, mikey, linuxppc-dev, anton
In-Reply-To: <1347235505.2385.134.camel@pasglop>

On 09/09/2012 05:05 PM, Benjamin Herrenschmidt wrote:
> On Sun, 2012-09-09 at 04:36 -0700, Haren Myneni wrote:
>> The first instruction in ACCOUNT_CPU_USER_ENTRY is 'beq' which checkes for
>> exceptions coming from kernel mode. PPR value will be saved immediately after
>> ACCOUNT_CPU_USER_ENTRY and is also for user level exceptions. So moved this
>> branch instruction in the caller code.
> 
> grep fail ? ACCOUNT_CPU_USER_ENTRY is used in exception-64e.S as well,
> so that needs to be updated too.

Sorry, I will make this change.

> 
> Cheers,
> Ben.
> 
>> Signed-off-by: Haren Myneni <haren@us.ibm.com>
>>
>> ---
>>  arch/powerpc/include/asm/exception-64s.h |    3 ++-
>>  arch/powerpc/include/asm/ppc_asm.h       |    2 --
>>  arch/powerpc/kernel/entry_64.S           |    3 ++-
>>  3 files changed, 4 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/exception-64s.h b/arch/powerpc/include/asm/exception-64s.h
>> index a43c147..45702e0 100644
>> --- a/arch/powerpc/include/asm/exception-64s.h
>> +++ b/arch/powerpc/include/asm/exception-64s.h
>> @@ -176,8 +176,9 @@ do_kvm_##n:								\
>>  	std	r10,0(r1);		/* make stack chain pointer	*/ \
>>  	std	r0,GPR0(r1);		/* save r0 in stackframe	*/ \
>>  	std	r10,GPR1(r1);		/* save r1 in stackframe	*/ \
>> +	beq	4f;			/* if from kernel mode		*/ \
>>  	ACCOUNT_CPU_USER_ENTRY(r9, r10);				   \
>> -	std	r2,GPR2(r1);		/* save r2 in stackframe	*/ \
>> +4:	std	r2,GPR2(r1);		/* save r2 in stackframe	*/ \
>>  	SAVE_4GPRS(3, r1);		/* save r3 - r6 in stackframe	*/ \
>>  	SAVE_2GPRS(7, r1);		/* save r7, r8 in stackframe	*/ \
>>  	ld	r9,area+EX_R9(r13);	/* move r9, r10 to stackframe	*/ \
>> diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h
>> index ea2a86e..376e36d 100644
>> --- a/arch/powerpc/include/asm/ppc_asm.h
>> +++ b/arch/powerpc/include/asm/ppc_asm.h
>> @@ -30,7 +30,6 @@
>>  #define ACCOUNT_STOLEN_TIME
>>  #else
>>  #define ACCOUNT_CPU_USER_ENTRY(ra, rb)					\
>> -	beq	2f;			/* if from kernel mode */	\
>>  	MFTB(ra);			/* get timebase */		\
>>  	ld	rb,PACA_STARTTIME_USER(r13);				\
>>  	std	ra,PACA_STARTTIME(r13);					\
>> @@ -38,7 +37,6 @@
>>  	ld	ra,PACA_USER_TIME(r13);					\
>>  	add	ra,ra,rb;		/* add on to user time */	\
>>  	std	ra,PACA_USER_TIME(r13);					\
>> -2:
>>  
>>  #define ACCOUNT_CPU_USER_EXIT(ra, rb)					\
>>  	MFTB(ra);			/* get timebase */		\
>> diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
>> index b40e0b4..8d21cc4 100644
>> --- a/arch/powerpc/kernel/entry_64.S
>> +++ b/arch/powerpc/kernel/entry_64.S
>> @@ -62,8 +62,9 @@ system_call_common:
>>  	std	r12,_MSR(r1)
>>  	std	r0,GPR0(r1)
>>  	std	r10,GPR1(r1)
>> +	beq	2f			/* if from kernel mode */
>>  	ACCOUNT_CPU_USER_ENTRY(r10, r11)
>> -	std	r2,GPR2(r1)
>> +2:	std	r2,GPR2(r1)
>>  	std	r3,GPR3(r1)
>>  	mfcr	r2
>>  	std	r4,GPR4(r1)
> 
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/linuxppc-dev
> 

^ permalink raw reply

* Re: [PATCH 5/6] powerpc: Macros for saving/restore PPR
From: Haren Myneni @ 2012-09-11  5:18 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, mikey, paulus, anton
In-Reply-To: <1347251044.2385.144.camel@pasglop>

On 09/09/2012 09:24 PM, Benjamin Herrenschmidt wrote:
> On Sun, 2012-09-09 at 04:43 -0700, Haren Myneni wrote:
>> Several macros are defined for saving and restore user defined PPR value.
>>
>> Signed-off-by: Haren Myneni <haren@us.ibm.com>
>> ---
>>  arch/powerpc/include/asm/exception-64s.h |   35 ++++++++++++++++++++++++++++++
>>  arch/powerpc/include/asm/reg.h           |    1 +
>>  2 files changed, 36 insertions(+)
>>
>> diff --git a/arch/powerpc/include/asm/exception-64s.h b/arch/powerpc/include/asm/exception-64s.h
>> index bfd3f1f..618fd18 100644
>> --- a/arch/powerpc/include/asm/exception-64s.h
>> +++ b/arch/powerpc/include/asm/exception-64s.h
>> @@ -62,6 +62,41 @@
>>  #define EXC_HV	H
>>  #define EXC_STD
>>  
>> +/*
>> + * PPR save/restore macros - Used on P7 or later processors
>> + */
>> +#define SAVE_PPR(area, ra, rb)						\
>> +BEGIN_FTR_SECTION_NESTED(940)						\
>> +	ld	ra,area+EX_PPR(r13);	/* Read PPR from paca */	\
>> +	clrrdi	rb,r1,THREAD_SHIFT;	/* thread_info struct */	\
>> +	std	ra,TI_PPR(rb);		/* Save PPR in thread_info */	\
>> +END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,940)
> 
> Why the thread info and not the thread struct ?

We can also use thread struct, but the saved ppr is needed as long as
the process is in kernel context. Once the process exits, we do not need
this value - means the kernel or any other command will not be reading
this value. Even the process can change this value later. Need to add
ppr in thread_struct which uses extra 8 bytes for each task.

Whereas thread_info is at the top of stack. No need to allocate extra
memory. Since the saved process PPR value is needed temporarily whenever
the process is in kernel, thought thread_info is proper. Also easy to
read - one less load instruction.

Saving in thread_info:

ld      r4,area+EX_PPR(r13);
clrrdi  r5,r1,THREAD_SHIFT;
std     r4,TI_PPR(r5);

Saving in thread_struct:

 ld      r4,PACACURRENT(r13)
 addi    r5,r4,THREAD
 ld      r4,paca+EX_PPR(r13);
 std     r4,THREAD_PPR(r5)

If you prefer thread_struct, I will change the patch. Please let me know.

Thanks
Haren

> 
> Cheers,
> Ben.
> 
>> +#define RESTORE_PPR(ra,rb)						\
>> +BEGIN_FTR_SECTION_NESTED(941)						\
>> +	clrrdi	ra,r1,THREAD_SHIFT;					\
>> +	ld	rb,TI_PPR(ra);		/* Read PPR from thread_info */	\
>> +	mtspr	SPRN_PPR,rb;		/* Restore PPR */		\
>> +END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,941)
>> +
>> +#define RESTORE_PPR_PACA(area,ra)					\
>> +BEGIN_FTR_SECTION_NESTED(942)						\
>> +	ld	ra,area+EX_PPR(r13);					\
>> +	mtspr	SPRN_PPR,ra;						\
>> +END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,942)
>> +
>> +#define HMT_MEDIUM_NO_PPR						\
>> +BEGIN_FTR_SECTION_NESTED(944)						\
>> +	HMT_MEDIUM;							\
>> +END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,0,944)  /*non P7*/		
>> +
>> +#define HMT_MEDIUM_HAS_PPR(area, ra)					\
>> +BEGIN_FTR_SECTION_NESTED(943)						\
>> +	mfspr	ra,SPRN_PPR;						\
>> +	std	ra,area+EX_PPR(r13);					\
>> +	HMT_MEDIUM;							\
>> +END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,943) /* P7 */
>> +
>>  #define __EXCEPTION_PROLOG_1(area, extra, vec)				\
>>  	GET_PACA(r13);							\
>>  	std	r9,area+EX_R9(r13);	/* save r9 - r12 */		\
>> diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
>> index 6386086..dff2f89 100644
>> --- a/arch/powerpc/include/asm/reg.h
>> +++ b/arch/powerpc/include/asm/reg.h
>> @@ -284,6 +284,7 @@
>>  #define SPRN_DBAT6U	0x23C	/* Data BAT 6 Upper Register */
>>  #define SPRN_DBAT7L	0x23F	/* Data BAT 7 Lower Register */
>>  #define SPRN_DBAT7U	0x23E	/* Data BAT 7 Upper Register */
>> +#define SPRN_PPR	0x380	/* SMT Thread status Register */
>>  
>>  #define SPRN_DEC	0x016		/* Decrement Register */
>>  #define SPRN_DER	0x095		/* Debug Enable Regsiter */
> 
> 

^ permalink raw reply

* Re: [RFC v8 PATCH 00/20] memory-hotplug: hot-remove physical memory
From: Jerry @ 2012-09-11  5:18 UTC (permalink / raw)
  To: Minchan Kim, Wen Congyang
  Cc: linux-s390, linux-ia64, len.brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, Vasilis Liaskovitis, linux-mm,
	Yasuaki Ishimatsu, paulus, kosaki.motohiro, rientjes, sparclinux,
	Andrew Morton, linuxppc-dev, cl, liuj97
In-Reply-To: <20120911012345.GD14205@bbox>

[-- Attachment #1: Type: text/plain, Size: 4464 bytes --]

Hi Kim,

Thank you for your kindness. Let me clarify this:

On ARM architecture, there are 32 bits physical addresses space. However,
the addresses space is divided into 8 banks normally. Each bank
disabled/enabled by a chip selector signal. In my platform, bank0 connects
a DDR chip, and bank1 also connects another DDR chip. And each DDR chip
whose capability is 512MB is integrated into the main board. So, it could
not be removed by hand. We can disable/enable each bank by peripheral
device controller registers.

When system enter suspend state, if all the pages allocated could be
migrated to one bank, there are no valid data in the another bank. In this
time, I could disable the free bank. It isn't necessary to provided power
to this chip in the suspend state. When system resume, I just need to
enable it again.

Hi Wen,

I am sorry for that I doesn't know the "_PSx support" means. Maybe I
needn't it.

Thanks,
Jerry

2012/9/11 Minchan Kim <minchan@kernel.org>

> Hi Jerry,
>
> On Tue, Sep 11, 2012 at 08:27:40AM +0800, Jerry wrote:
> > Hi Wen,
> >
> > I have been arranged a job related memory hotplug on ARM architecture.
> > Maybe I know some new issues about memory hotplug on ARM architecture. I
> > just enabled it on ARM, and it works well in my Android tablet now.
> > However, I have not send out my patches. The real reason is that I don't
> > know how to do it. Maybe I need to read
> "Documentation/SubmittingPatches".
> >
> > Hi Andrew,
> > This is my first time to send you a e-mail. I am so nervous about if I
> have
> > some mistakes or not.
>
> Don't be afraid.
> If you might make a mistake, it's very natural to newbie.
> I am sure anyone doesn't blame you. :)
> If you have a good patch, please send out.
>
> >
> > Some peoples maybe think memory hotplug need to be supported by special
> > hardware. Maybe it means memory physical hotplug. Some times, we just
> need
> > to use memory logical hotplug, doesn't remove the memory in physical. It
> is
> > also usefully for power saving in my platform. Because I doesn't want
> > the offline memory is in *self-refresh* state.
>
> Just out of curiosity.
> What's the your scenario and gain?
> AFAIK, there were some effort about it in embedded side but gain isn't
> rather big
> IIRC.
>
> >
> > Any comments are appreciated.
> >
> > Thanks,
> > Jerry
> >
> > 2012/9/10 Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
> >
> > > Hi,
> > >
> > > On Mon, Sep 10, 2012 at 10:01:44AM +0800, Wen Congyang wrote:
> > > > At 09/10/2012 09:46 AM, Yasuaki Ishimatsu Wrote:
> > > > > Hi Wen,
> > > > >
> > > > > 2012/09/01 5:49, Andrew Morton wrote:
> > > > >> On Tue, 28 Aug 2012 18:00:07 +0800
> > > > >> wency@cn.fujitsu.com wrote:
> > > > >>
> > > > >>> This patch series aims to support physical memory hot-remove.
> > > > >>
> > > > >> I doubt if many people have hardware which permits physical memory
> > > > >> removal?  How would you suggest that people with regular hardware
> can
> > > > >> test these chagnes?
> > > > >
> > > > > How do you test the patch? As Andrew says, for hot-removing memory,
> > > > > we need a particular hardware. I think so too. So many people may
> want
> > > > > to know how to test the patch.
> > > > > If we apply following patch to kvm guest, can we hot-remove memory
> on
> > > > > kvm guest?
> > > > >
> > > > > http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01389.html
> > > >
> > > > Yes, if we apply this patchset, we can test hot-remove memory on kvm
> > > guest.
> > > > But that patchset doesn't implement _PS3, so there is some
> restriction.
> > >
> > > the following repos contain the patchset above, plus 2 more patches
> that
> > > add
> > > PS3 support to the dimm devices in qemu/seabios:
> > >
> > > https://github.com/vliaskov/seabios/commits/memhp-v2
> > > https://github.com/vliaskov/qemu-kvm/commits/memhp-v2
> > >
> > > I have not posted the PS3 patches yet in the qemu list, but will post
> them
> > > soon for v3 of the memory hotplug series. If you have issues testing,
> let
> > > me
> > > know.
> > >
> > > thanks,
> > >
> > > - Vasilis
> > >
> > > --
> > > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > > the body to majordomo@kvack.org.  For more info on Linux MM,
> > > see: http://www.linux-mm.org/ .
> > > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> > >
> >
> >
> >
> > --
> > I love linux!!!
>
> --
> Kind regards,
> Minchan Kim
>



-- 
I love linux!!!

[-- Attachment #2: Type: text/html, Size: 6648 bytes --]

^ permalink raw reply

* RE: [PATCH V10] powerpc/fsl-pci: Unify pci/pcie initialization code
From: Jia Hongtao-B38951 @ 2012-09-11  4:49 UTC (permalink / raw)
  To: Kumar Gala
  Cc: Wood Scott-B07421, linuxppc-dev@lists.ozlabs.org, Li Yang-R58472
In-Reply-To: <412C8208B4A0464FA894C5F0C278CD5D01A745DC@039-SN1MPN1-003.039d.mgd.msft.net>

Hi Kumar,=0A=
=0A=
Do you or Ben has any comment on this?=0A=
Please let me know.=0A=
=0A=
Thanks.=0A=
=0A=
________________________________________=0A=
From: Jia Hongtao-B38951=0A=
Sent: Friday, August 31, 2012 10:25 AM=0A=
To: Kumar Gala=0A=
Cc: linuxppc-dev@lists.ozlabs.org; Li Yang-R58472; Wood Scott-B07421=0A=
Subject: RE: [PATCH V10] powerpc/fsl-pci: Unify pci/pcie initialization cod=
e=0A=
=0A=
> -----Original Message-----=0A=
> From: Kumar Gala [mailto:galak@kernel.crashing.org]=0A=
> Sent: Friday, August 31, 2012 12:54 AM=0A=
> To: Jia Hongtao-B38951=0A=
> Cc: linuxppc-dev@lists.ozlabs.org; Li Yang-R58472; Wood Scott-B07421=0A=
> Subject: Re: [PATCH V10] powerpc/fsl-pci: Unify pci/pcie initialization=
=0A=
> code=0A=
>=0A=
>=0A=
> On Aug 28, 2012, at 2:44 AM, Jia Hongtao wrote:=0A=
>=0A=
> > We unified the Freescale pci/pcie initialization by changing the=0A=
> > fsl_pci to a platform driver. In previous PCI code architecture the=0A=
> > initialization routine is called at board_setup_arch stage. Now the=0A=
> > initialization is done in probe function which is architectural=0A=
> > better. Also It's convenient for adding PM support for PCI controller=
=0A=
> in later patch.=0A=
> >=0A=
> > Now we registered pci controllers as platform devices. So we combine=0A=
> > two initialization code as one platform driver.=0A=
> >=0A=
> > Signed-off-by: Jia Hongtao <B38951@freescale.com>=0A=
> > Signed-off-by: Li Yang <leoli@freescale.com>=0A=
> > Signed-off-by: Chunhe Lan <Chunhe.Lan@freescale.com>=0A=
> > ---=0A=
> > Changes for V10:=0A=
> > * Just update comments for mpc85xx_cds_pci_assign_primary=0A=
>=0A=
> What do we have to say w/regards to Ben's comment about breaking of=0A=
> pci_fixup_final by moving to a platform driver?=0A=
>=0A=
> - k=0A=
=0A=
If Ben mean the fixup order I have already gave some explanations as follow=
:=0A=
=0A=
I have already done some investigations and the sequence of fixup=0A=
(including early, header, final) will not be changed in platform driver.=0A=
=0A=
We register and init PCI controllers as platform devices at arch_initcall=
=0A=
stage and PCI scanning (pcibios_init) is at subsys_initcall stage in which=
=0A=
early and header fixup will be done in right sequence. The final fixup will=
=0A=
be start at rootfs_initcall stage which is later than early and header fixu=
p.=0A=
=0A=
- Hongtao.=0A=

^ permalink raw reply

* [PATCH][v4] sata_fsl: add workaround for data length mismatch on freescale V2 controller
From: Shaohui Xie @ 2012-09-11  2:48 UTC (permalink / raw)
  To: jgarzik, linux-ide; +Cc: linuxppc-dev, linux-kernel, Anju Bhartiya, Shaohui Xie

The freescale V2 SATA controller checks if the received data length matches
the programmed length 'ttl', if not, it assumes that this is an error.
In ATAPI, the 'ttl' is based on max allocation length and not the actual
data transfer length, controller will raise 'DLM' (Data length Mismatch)
error bit in Hstatus register. Along with 'DLM', DE (Device error) and
FE (fatal Error) bits are also set in Hstatus register, 'E' (Internal Error)
bit is set in Serror register and CE (Command Error) and DE (Device error)
registers have the corresponding bit set. In this condition, we need to
clear errors in following way: in the service routine, based on 'DLM' flag,
HCONTROL[27] operation clears Hstatus, CE and DE registers, clear Serror
register.

Signed-off-by: Shaohui Xie <Shaohui.Xie@freescale.com>
Signed-off-by: Anju Bhartiya <Anju.Bhartiya@freescale.com>
---
changes for v4:
1. put HCONTROL_CLEAR_ERROR to enum and rename it as CLEAR_ERROR;

changes for v3:
1. not using uppercase for variable names;
2. remove unnecessary parens;

changes for v2:
1. remove the using of quirk;
2. wrap errata codes in condition;


 drivers/ata/sata_fsl.c |   39 +++++++++++++++++++++++++++++++++++----
 1 files changed, 35 insertions(+), 4 deletions(-)

diff --git a/drivers/ata/sata_fsl.c b/drivers/ata/sata_fsl.c
index d6577b9..124b2c1 100644
--- a/drivers/ata/sata_fsl.c
+++ b/drivers/ata/sata_fsl.c
@@ -123,6 +123,7 @@ enum {
 	ONLINE = (1 << 31),
 	GOING_OFFLINE = (1 << 30),
 	BIST_ERR = (1 << 29),
+	CLEAR_ERROR = (1 << 27),
 
 	FATAL_ERR_HC_MASTER_ERR = (1 << 18),
 	FATAL_ERR_PARITY_ERR_TX = (1 << 17),
@@ -143,6 +144,7 @@ enum {
 	    FATAL_ERR_CRC_ERR_RX |
 	    FATAL_ERR_FIFO_OVRFL_TX | FATAL_ERR_FIFO_OVRFL_RX,
 
+	INT_ON_DATA_LENGTH_MISMATCH = (1 << 12),
 	INT_ON_FATAL_ERR = (1 << 5),
 	INT_ON_PHYRDY_CHG = (1 << 4),
 
@@ -1181,25 +1183,54 @@ static void sata_fsl_host_intr(struct ata_port *ap)
 	u32 hstatus, done_mask = 0;
 	struct ata_queued_cmd *qc;
 	u32 SError;
+	u32 tag;
+	u32 status_mask = INT_ON_ERROR;
 
 	hstatus = ioread32(hcr_base + HSTATUS);
 
 	sata_fsl_scr_read(&ap->link, SCR_ERROR, &SError);
 
+	/* Read command completed register */
+	done_mask = ioread32(hcr_base + CC);
+
+	/* Workaround for data length mismatch errata */
+	if (unlikely(hstatus & INT_ON_DATA_LENGTH_MISMATCH)) {
+		for (tag = 0; tag < ATA_MAX_QUEUE; tag++) {
+			qc = ata_qc_from_tag(ap, tag);
+			if (qc && ata_is_atapi(qc->tf.protocol)) {
+				u32 hcontrol;
+				/* Set HControl[27] to clear error registers */
+				hcontrol = ioread32(hcr_base + HCONTROL);
+				iowrite32(hcontrol | CLEAR_ERROR,
+						hcr_base + HCONTROL);
+
+				/* Clear HControl[27] */
+				iowrite32(hcontrol & ~CLEAR_ERROR,
+						hcr_base + HCONTROL);
+
+				/* Clear SError[E] bit */
+				sata_fsl_scr_write(&ap->link, SCR_ERROR,
+						SError);
+
+				/* Ignore fatal error and device error */
+				status_mask &= ~(INT_ON_SINGL_DEVICE_ERR
+						| INT_ON_FATAL_ERR);
+				break;
+			}
+		}
+	}
+
 	if (unlikely(SError & 0xFFFF0000)) {
 		DPRINTK("serror @host_intr : 0x%x\n", SError);
 		sata_fsl_error_intr(ap);
 	}
 
-	if (unlikely(hstatus & INT_ON_ERROR)) {
+	if (unlikely(hstatus & status_mask)) {
 		DPRINTK("error interrupt!!\n");
 		sata_fsl_error_intr(ap);
 		return;
 	}
 
-	/* Read command completed register */
-	done_mask = ioread32(hcr_base + CC);
-
 	VPRINTK("Status of all queues :\n");
 	VPRINTK("done_mask/CC = 0x%x, CA = 0x%x, CE=0x%x,CQ=0x%x,apqa=0x%x\n",
 		done_mask,
-- 
1.6.4

^ permalink raw reply related

* Re: [RFC v9 PATCH 05/21] memory-hotplug: check whether memory is present or not
From: Wen Congyang @ 2012-09-11  2:15 UTC (permalink / raw)
  To: isimatu.yasuaki
  Cc: linux-s390, linux-ia64, wency, linux-acpi, linux-sh, len.brown,
	x86, linux-kernel, cmetcalf, linux-mm, paulus, minchan.kim,
	kosaki.motohiro, rientjes, sparclinux, cl, linuxppc-dev, akpm,
	liuj97
In-Reply-To: <1346837155-534-6-git-send-email-wency@cn.fujitsu.com>

Hi, ishimatsu

At 09/05/2012 05:25 PM, wency@cn.fujitsu.com Wrote:
> From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> 
> If system supports memory hot-remove, online_pages() may online removed pages.
> So online_pages() need to check whether onlining pages are present or not.

Because we use memory_block_change_state() to hotremoving memory, I think
this patch can be removed. What do you think?

Thanks
Wen Congyang

> 
> CC: David Rientjes <rientjes@google.com>
> CC: Jiang Liu <liuj97@gmail.com>
> CC: Len Brown <len.brown@intel.com>
> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> CC: Paul Mackerras <paulus@samba.org>
> CC: Christoph Lameter <cl@linux.com>
> Cc: Minchan Kim <minchan.kim@gmail.com>
> CC: Andrew Morton <akpm@linux-foundation.org>
> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
> CC: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> ---
>  include/linux/mmzone.h |   19 +++++++++++++++++++
>  mm/memory_hotplug.c    |   13 +++++++++++++
>  2 files changed, 32 insertions(+), 0 deletions(-)
> 
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 2daa54f..ac3ae30 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1180,6 +1180,25 @@ void sparse_init(void);
>  #define sparse_index_init(_sec, _nid)  do {} while (0)
>  #endif /* CONFIG_SPARSEMEM */
>  
> +#ifdef CONFIG_SPARSEMEM
> +static inline int pfns_present(unsigned long pfn, unsigned long nr_pages)
> +{
> +	int i;
> +	for (i = 0; i < nr_pages; i++) {
> +		if (pfn_present(pfn + i))
> +			continue;
> +		else
> +			return -EINVAL;
> +	}
> +	return 0;
> +}
> +#else
> +static inline int pfns_present(unsigned long pfn, unsigned long nr_pages)
> +{
> +	return 0;
> +}
> +#endif /* CONFIG_SPARSEMEM*/
> +
>  #ifdef CONFIG_NODES_SPAN_OTHER_NODES
>  bool early_pfn_in_nid(unsigned long pfn, int nid);
>  #else
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 49f7747..299747d 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -467,6 +467,19 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
>  	struct memory_notify arg;
>  
>  	lock_memory_hotplug();
> +	/*
> +	 * If system supports memory hot-remove, the memory may have been
> +	 * removed. So we check whether the memory has been removed or not.
> +	 *
> +	 * Note: When CONFIG_SPARSEMEM is defined, pfns_present() become
> +	 *       effective. If CONFIG_SPARSEMEM is not defined, pfns_present()
> +	 *       always returns 0.
> +	 */
> +	ret = pfns_present(pfn, nr_pages);
> +	if (ret) {
> +		unlock_memory_hotplug();
> +		return ret;
> +	}
>  	arg.start_pfn = pfn;
>  	arg.nr_pages = nr_pages;
>  	arg.status_change_nid = -1;

^ permalink raw reply

* Re: [RFC v9 PATCH 05/21] memory-hotplug: check whether memory is present or not
From: Wen Congyang @ 2012-09-11  2:46 UTC (permalink / raw)
  To: Yasuaki Ishimatsu
  Cc: linux-s390, linux-ia64, len.brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, linux-mm, paulus, minchan.kim,
	kosaki.motohiro, rientjes, sparclinux, cl, linuxppc-dev, akpm,
	liuj97
In-Reply-To: <504EA0F7.5090805@jp.fujitsu.com>

At 09/11/2012 10:24 AM, Yasuaki Ishimatsu Wrote:
> Hi Wen,
> 
> 2012/09/11 11:15, Wen Congyang wrote:
>> Hi, ishimatsu
>>
>> At 09/05/2012 05:25 PM, wency@cn.fujitsu.com Wrote:
>>> From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>>
>>> If system supports memory hot-remove, online_pages() may online
>>> removed pages.
>>> So online_pages() need to check whether onlining pages are present or
>>> not.
>>
>> Because we use memory_block_change_state() to hotremoving memory, I think
>> this patch can be removed. What do you think?
> 
> Pleae teach me detals a little more. If we use memory_block_change_state(),
> does the conflict never occur? Why?

I misunderstand sth, please ignore it.

Wen Congyang

> 
> Thansk,
> Yasuaki Ishimatsu
> 
>> Thanks
>> Wen Congyang
>>
>>>
>>> CC: David Rientjes <rientjes@google.com>
>>> CC: Jiang Liu <liuj97@gmail.com>
>>> CC: Len Brown <len.brown@intel.com>
>>> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>>> CC: Paul Mackerras <paulus@samba.org>
>>> CC: Christoph Lameter <cl@linux.com>
>>> Cc: Minchan Kim <minchan.kim@gmail.com>
>>> CC: Andrew Morton <akpm@linux-foundation.org>
>>> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>>> CC: Wen Congyang <wency@cn.fujitsu.com>
>>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>> ---
>>>   include/linux/mmzone.h |   19 +++++++++++++++++++
>>>   mm/memory_hotplug.c    |   13 +++++++++++++
>>>   2 files changed, 32 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>>> index 2daa54f..ac3ae30 100644
>>> --- a/include/linux/mmzone.h
>>> +++ b/include/linux/mmzone.h
>>> @@ -1180,6 +1180,25 @@ void sparse_init(void);
>>>   #define sparse_index_init(_sec, _nid)  do {} while (0)
>>>   #endif /* CONFIG_SPARSEMEM */
>>>
>>> +#ifdef CONFIG_SPARSEMEM
>>> +static inline int pfns_present(unsigned long pfn, unsigned long
>>> nr_pages)
>>> +{
>>> +    int i;
>>> +    for (i = 0; i < nr_pages; i++) {
>>> +        if (pfn_present(pfn + i))
>>> +            continue;
>>> +        else
>>> +            return -EINVAL;
>>> +    }
>>> +    return 0;
>>> +}
>>> +#else
>>> +static inline int pfns_present(unsigned long pfn, unsigned long
>>> nr_pages)
>>> +{
>>> +    return 0;
>>> +}
>>> +#endif /* CONFIG_SPARSEMEM*/
>>> +
>>>   #ifdef CONFIG_NODES_SPAN_OTHER_NODES
>>>   bool early_pfn_in_nid(unsigned long pfn, int nid);
>>>   #else
>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>> index 49f7747..299747d 100644
>>> --- a/mm/memory_hotplug.c
>>> +++ b/mm/memory_hotplug.c
>>> @@ -467,6 +467,19 @@ int __ref online_pages(unsigned long pfn,
>>> unsigned long nr_pages)
>>>       struct memory_notify arg;
>>>
>>>       lock_memory_hotplug();
>>> +    /*
>>> +     * If system supports memory hot-remove, the memory may have been
>>> +     * removed. So we check whether the memory has been removed or not.
>>> +     *
>>> +     * Note: When CONFIG_SPARSEMEM is defined, pfns_present() become
>>> +     *       effective. If CONFIG_SPARSEMEM is not defined,
>>> pfns_present()
>>> +     *       always returns 0.
>>> +     */
>>> +    ret = pfns_present(pfn, nr_pages);
>>> +    if (ret) {
>>> +        unlock_memory_hotplug();
>>> +        return ret;
>>> +    }
>>>       arg.start_pfn = pfn;
>>>       arg.nr_pages = nr_pages;
>>>       arg.status_change_nid = -1;
>>
> 
> 
> 

^ permalink raw reply

* Re: [RFC v9 PATCH 05/21] memory-hotplug: check whether memory is present or not
From: Yasuaki Ishimatsu @ 2012-09-11  2:24 UTC (permalink / raw)
  To: Wen Congyang
  Cc: linux-s390, linux-ia64, len.brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, linux-mm, paulus, minchan.kim,
	kosaki.motohiro, rientjes, sparclinux, cl, linuxppc-dev, akpm,
	liuj97
In-Reply-To: <504E9EBE.1040403@cn.fujitsu.com>

Hi Wen,

2012/09/11 11:15, Wen Congyang wrote:
> Hi, ishimatsu
>
> At 09/05/2012 05:25 PM, wency@cn.fujitsu.com Wrote:
>> From: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>>
>> If system supports memory hot-remove, online_pages() may online removed pages.
>> So online_pages() need to check whether onlining pages are present or not.
>
> Because we use memory_block_change_state() to hotremoving memory, I think
> this patch can be removed. What do you think?

Pleae teach me detals a little more. If we use memory_block_change_state(),
does the conflict never occur? Why?

Thansk,
Yasuaki Ishimatsu

> Thanks
> Wen Congyang
>
>>
>> CC: David Rientjes <rientjes@google.com>
>> CC: Jiang Liu <liuj97@gmail.com>
>> CC: Len Brown <len.brown@intel.com>
>> CC: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>> CC: Paul Mackerras <paulus@samba.org>
>> CC: Christoph Lameter <cl@linux.com>
>> Cc: Minchan Kim <minchan.kim@gmail.com>
>> CC: Andrew Morton <akpm@linux-foundation.org>
>> CC: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
>> CC: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>> ---
>>   include/linux/mmzone.h |   19 +++++++++++++++++++
>>   mm/memory_hotplug.c    |   13 +++++++++++++
>>   2 files changed, 32 insertions(+), 0 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 2daa54f..ac3ae30 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -1180,6 +1180,25 @@ void sparse_init(void);
>>   #define sparse_index_init(_sec, _nid)  do {} while (0)
>>   #endif /* CONFIG_SPARSEMEM */
>>
>> +#ifdef CONFIG_SPARSEMEM
>> +static inline int pfns_present(unsigned long pfn, unsigned long nr_pages)
>> +{
>> +	int i;
>> +	for (i = 0; i < nr_pages; i++) {
>> +		if (pfn_present(pfn + i))
>> +			continue;
>> +		else
>> +			return -EINVAL;
>> +	}
>> +	return 0;
>> +}
>> +#else
>> +static inline int pfns_present(unsigned long pfn, unsigned long nr_pages)
>> +{
>> +	return 0;
>> +}
>> +#endif /* CONFIG_SPARSEMEM*/
>> +
>>   #ifdef CONFIG_NODES_SPAN_OTHER_NODES
>>   bool early_pfn_in_nid(unsigned long pfn, int nid);
>>   #else
>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>> index 49f7747..299747d 100644
>> --- a/mm/memory_hotplug.c
>> +++ b/mm/memory_hotplug.c
>> @@ -467,6 +467,19 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages)
>>   	struct memory_notify arg;
>>
>>   	lock_memory_hotplug();
>> +	/*
>> +	 * If system supports memory hot-remove, the memory may have been
>> +	 * removed. So we check whether the memory has been removed or not.
>> +	 *
>> +	 * Note: When CONFIG_SPARSEMEM is defined, pfns_present() become
>> +	 *       effective. If CONFIG_SPARSEMEM is not defined, pfns_present()
>> +	 *       always returns 0.
>> +	 */
>> +	ret = pfns_present(pfn, nr_pages);
>> +	if (ret) {
>> +		unlock_memory_hotplug();
>> +		return ret;
>> +	}
>>   	arg.start_pfn = pfn;
>>   	arg.nr_pages = nr_pages;
>>   	arg.status_change_nid = -1;
>

^ permalink raw reply

* [v3][PATCH 3/3] ppc/kprobe: don't emulate store when kprobe stwu r1
From: Tiejun Chen @ 2012-09-11  2:20 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev
In-Reply-To: <1347330053-27039-1-git-send-email-tiejun.chen@windriver.com>

We don't do the real store operation for kprobing 'stwu Rx,(y)R1'
since this may corrupt the exception frame, now we will do this
operation safely in exception return code after migrate current
exception frame below the kprobed function stack.

So we only update gpr[1] here and trigger a thread flag to mask
this.

Note we should make sure if we trigger kernel stack over flow.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
---
 arch/powerpc/lib/sstep.c |   36 ++++++++++++++++++++++++++++++++++--
 1 file changed, 34 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/lib/sstep.c b/arch/powerpc/lib/sstep.c
index 9a52349..e15c521 100644
--- a/arch/powerpc/lib/sstep.c
+++ b/arch/powerpc/lib/sstep.c
@@ -566,7 +566,7 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr)
 	unsigned long int ea;
 	unsigned int cr, mb, me, sh;
 	int err;
-	unsigned long old_ra;
+	unsigned long old_ra, val3;
 	long ival;
 
 	opcode = instr >> 26;
@@ -1486,11 +1486,43 @@ int __kprobes emulate_step(struct pt_regs *regs, unsigned int instr)
 		goto ldst_done;
 
 	case 36:	/* stw */
-	case 37:	/* stwu */
 		val = regs->gpr[rd];
 		err = write_mem(val, dform_ea(instr, regs), 4, regs);
 		goto ldst_done;
 
+	case 37:	/* stwu */
+		val = regs->gpr[rd];
+		val3 = dform_ea(instr, regs);
+		/*
+		 * For PPC32 we always use stwu to change stack point with r1. So
+		 * this emulated store may corrupt the exception frame, now we
+		 * have to provide the exception frame trampoline, which is pushed
+		 * below the kprobed function stack. So we only update gpr[1] but
+		 * don't emulate the real store operation. We will do real store
+		 * operation safely in exception return code by checking this flag.
+		 */
+		if ((ra == 1) && !(regs->msr & MSR_PR) \
+			&& (val3 >= (regs->gpr[1] - STACK_INT_FRAME_SIZE))) {
+			/*
+			 * Check if we will touch kernel sack overflow
+			 */
+			if (val3 - STACK_INT_FRAME_SIZE <= current->thread.ksp_limit) {
+				printk(KERN_CRIT "Can't kprobe this since Kernel stack overflow.\n");
+				err = -EINVAL;
+				break;
+			}
+
+			/*
+			 * Check if we already set since that means we'll
+			 * lose the previous value.
+			 */
+			WARN_ON(test_thread_flag(TIF_EMULATE_STACK_STORE));
+			set_thread_flag(TIF_EMULATE_STACK_STORE);
+			err = 0;
+		} else
+			err = write_mem(val, val3, 4, regs);
+		goto ldst_done;
+
 	case 38:	/* stb */
 	case 39:	/* stbu */
 		val = regs->gpr[rd];
-- 
1.7.9.5

^ permalink raw reply related

* [v3][PATCH 1/3] powerpc/kprobe: introduce a new thread flag
From: Tiejun Chen @ 2012-09-11  2:20 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev

We need to add a new thread flag, TIF_EMULATE_STACK_STORE,
for emulating stack store operation while exiting exception.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
---
v3:
* rebase on next
* restore those clobbered registers
* add for 64-bit
* retest with kprobe do_fork()/show_interrupts()
	for fsl-p4080 and fsl-p5020, separately

 arch/powerpc/include/asm/thread_info.h |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
index e942203..8ceea14 100644
--- a/arch/powerpc/include/asm/thread_info.h
+++ b/arch/powerpc/include/asm/thread_info.h
@@ -104,6 +104,8 @@ static inline struct thread_info *current_thread_info(void)
 #define TIF_NOTIFY_RESUME	13	/* callback before returning to user */
 #define TIF_UPROBE		14	/* breakpointed or single-stepping */
 #define TIF_SYSCALL_TRACEPOINT	15	/* syscall tracepoint instrumentation */
+#define TIF_EMULATE_STACK_STORE	16	/* Is an instruction emulation
+						for stack store? */
 
 /* as above, but as bit values */
 #define _TIF_SYSCALL_TRACE	(1<<TIF_SYSCALL_TRACE)
@@ -121,6 +123,7 @@ static inline struct thread_info *current_thread_info(void)
 #define _TIF_NOTIFY_RESUME	(1<<TIF_NOTIFY_RESUME)
 #define _TIF_UPROBE		(1<<TIF_UPROBE)
 #define _TIF_SYSCALL_TRACEPOINT	(1<<TIF_SYSCALL_TRACEPOINT)
+#define _TIF_EMULATE_STACK_STORE	(1<<TIF_EMULATE_STACK_STORE)
 #define _TIF_SYSCALL_T_OR_A	(_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
 				 _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT)
 
-- 
1.7.9.5

^ permalink raw reply related

* [v3][PATCH 2/3] ppc/kprobe: complete kprobe and migrate exception frame
From: Tiejun Chen @ 2012-09-11  2:20 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev
In-Reply-To: <1347330053-27039-1-git-send-email-tiejun.chen@windriver.com>

We can't emulate stwu since that may corrupt current exception stack.
So we will have to do real store operation in the exception return code.

Firstly we'll allocate a trampoline exception frame below the kprobed
function stack and copy the current exception frame to the trampoline.
Then we can do this real store operation to implement 'stwu', and reroute
the trampoline frame to r1 to complete this exception migration.

Signed-off-by: Tiejun Chen <tiejun.chen@windriver.com>
---
 arch/powerpc/kernel/entry_32.S |   45 ++++++++++++++++++++++++++++++++++------
 arch/powerpc/kernel/entry_64.S |   32 ++++++++++++++++++++++++++++
 2 files changed, 71 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/entry_32.S b/arch/powerpc/kernel/entry_32.S
index ead5016..6cfe12f 100644
--- a/arch/powerpc/kernel/entry_32.S
+++ b/arch/powerpc/kernel/entry_32.S
@@ -831,19 +831,54 @@ restore_user:
 	bnel-	load_dbcr0
 #endif
 
-#ifdef CONFIG_PREEMPT
 	b	restore
 
 /* N.B. the only way to get here is from the beq following ret_from_except. */
 resume_kernel:
+	/* check current_thread_info, _TIF_EMULATE_STACK_STORE */
+	CURRENT_THREAD_INFO(r9, r1)
+	lwz	r0,TI_FLAGS(r9)
+	andis.	r0,r0,_TIF_EMULATE_STACK_STORE@h
+	beq+	1f
+
+	addi	r8,r1,INT_FRAME_SIZE	/* Get the kprobed function entry */
+
+	lwz	r3,GPR1(r1)
+	subi	r3,r3,INT_FRAME_SIZE	/* dst: Allocate a trampoline exception frame */
+	mr	r4,r1			/* src:  current exception frame */
+	li	r5,INT_FRAME_SIZE	/* size: INT_FRAME_SIZE */
+	mr	r1,r3			/* Reroute the trampoline frame to r1 */
+	bl	memcpy			/* Copy from the original to the trampoline */
+
+	CURRENT_THREAD_INFO(r9, r1)
+	lwz	r0,TI_FLAGS(r9)		/* Restore this clobbered r0 */
+
+	/* Do real store operation to complete stwu */
+	lwz	r5,GPR1(r1)
+	stw	r8,0(r5)
+
+	/* Clear _TIF_EMULATE_STACK_STORE flag */
+	CURRENT_THREAD_INFO(r9, r1)
+	lis	r11,_TIF_EMULATE_STACK_STORE@h
+	addi	r5,r9,TI_FLAGS
+0:	lwarx	r8,0,r5
+	andc	r8,r8,r11
+#ifdef CONFIG_IBM405_ERR77
+	dcbt	0,r5
+#endif
+	stwcx.	r8,0,r5
+	bne-	0b
+1:
+
+#ifdef CONFIG_PREEMPT
 	/* check current_thread_info->preempt_count */
 	CURRENT_THREAD_INFO(r9, r1)
-	lwz	r0,TI_PREEMPT(r9)
-	cmpwi	0,r0,0		/* if non-zero, just restore regs and return */
+	lwz	r8,TI_PREEMPT(r9)
+	cmpwi	0,r8,0		/* if non-zero, just restore regs and return */
 	bne	restore
-	lwz	r0,TI_FLAGS(r9)
 	andi.	r0,r0,_TIF_NEED_RESCHED
 	beq+	restore
+	lwz	r3,_MSR(r1)
 	andi.	r0,r3,MSR_EE	/* interrupts off? */
 	beq	restore		/* don't schedule if so */
 #ifdef CONFIG_TRACE_IRQFLAGS
@@ -864,8 +899,6 @@ resume_kernel:
 	 */
 	bl	trace_hardirqs_on
 #endif
-#else
-resume_kernel:
 #endif /* CONFIG_PREEMPT */
 
 	/* interrupts are hard-disabled at this point */
diff --git a/arch/powerpc/kernel/entry_64.S b/arch/powerpc/kernel/entry_64.S
index b40e0b4..b6d7483 100644
--- a/arch/powerpc/kernel/entry_64.S
+++ b/arch/powerpc/kernel/entry_64.S
@@ -593,6 +593,38 @@ _GLOBAL(ret_from_except_lite)
 	b	.ret_from_except
 
 resume_kernel:
+	/* check current_thread_info, _TIF_EMULATE_STACK_STORE */
+	CURRENT_THREAD_INFO(r9, r1)
+	ld	r0,TI_FLAGS(r9)
+	andis.	r0,r0,_TIF_EMULATE_STACK_STORE@h
+	beq+	1f
+
+	addi	r8,r1,INT_FRAME_SIZE	/* Get the kprobed function entry */
+
+	lwz	r3,GPR1(r1)
+	subi	r3,r3,INT_FRAME_SIZE	/* dst: Allocate a trampoline exception frame */
+	mr	r4,r1			/* src:  current exception frame */
+	li	r5,INT_FRAME_SIZE	/* size: INT_FRAME_SIZE */
+	mr	r1,r3			/* Reroute the trampoline frame to r1 */
+	bl	memcpy			/* Copy from the original to the trampoline */
+
+	CURRENT_THREAD_INFO(r9, r1)
+	ld	r4,TI_FLAGS(r9)		/* Restore this clobbered r4 */
+
+	/* Do real store operation to complete stwu */
+	lwz	r5,GPR1(r1)
+	std	r8,0(r5)
+
+	/* Clear _TIF_EMULATE_STACK_STORE flag */
+	CURRENT_THREAD_INFO(r9, r1)
+	lis	r11,_TIF_EMULATE_STACK_STORE@h
+	addi	r5,r9,TI_FLAGS
+0:	ldarx	r8,0,r5
+	andc	r8,r8,r11
+	stdcx.	r8,0,r5
+	bne-	0b
+1:
+
 #ifdef CONFIG_PREEMPT
 	/* Check if we need to preempt */
 	andi.	r0,r4,_TIF_NEED_RESCHED
-- 
1.7.9.5

^ permalink raw reply related

* [PATCH powerpc,trace] Avoid suspicious RCU usage reporting for some tracepoints
From: Li Zhong @ 2012-09-11  1:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Anton Blanchard, Paul Mackerras, Paul E. McKenney,
	PowerPC email list, Steven Rostedt
In-Reply-To: <1347315255.11820.45.camel@pasglop>

Thank you all, this is the updated patch for your review:

There are a few tracepoints in the interrupt code path, which is before
irq_enter(), or after irq_exit(), like
trace_irq_entry()/trace_irq_exit() in do_IRQ(),
trace_timer_interrupt_entry()/trace_timer_interrupt_exit() in
timer_interrupt(). 

If the interrupt is from idle(), and because tracepoint contains RCU
read-side critical section, we could see following suspicious RCU usage
reported:

[  145.127743] ===============================
[  145.127747] [ INFO: suspicious RCU usage. ]
[  145.127752] 3.6.0-rc3+ #1 Not tainted
[  145.127755] -------------------------------
[  145.127759] /root/.workdir/linux/arch/powerpc/include/asm/trace.h:33
suspicious rcu_dereference_check() usage!
[  145.127765] 
[  145.127765] other info that might help us debug this:
[  145.127765] 
[  145.127771] 
[  145.127771] RCU used illegally from idle CPU!
[  145.127771] rcu_scheduler_active = 1, debug_locks = 0
[  145.127777] RCU used illegally from extended quiescent state!
[  145.127781] no locks held by swapper/0/0.
[  145.127785] 
[  145.127785] stack backtrace:
[  145.127789] Call Trace:
[  145.127796] [c00000000108b530] [c000000000013c40] .show_stack
+0x70/0x1c0 (unreliable)
[  145.127806] [c00000000108b5e0]
[c0000000000f59d8] .lockdep_rcu_suspicious+0x118/0x150
[  145.127813] [c00000000108b680] [c00000000000fc58] .do_IRQ+0x498/0x500
[  145.127820] [c00000000108b750] [c000000000003950]
hardware_interrupt_common+0x150/0x180
[  145.127828] --- Exception: 501 at .plpar_hcall_norets+0x84/0xd4
[  145.127828]     LR = .check_and_cede_processor+0x38/0x70
[  145.127836] [c00000000108bab0] [c0000000000665dc] .shared_cede_loop
+0x5c/0x100
[  145.127844] [c00000000108bb70] [c000000000588ab0] .cpuidle_enter
+0x30/0x50
[  145.127850] [c00000000108bbe0]
[c000000000588b0c] .cpuidle_enter_state+0x3c/0xb0
[  145.127857] [c00000000108bc60] [c000000000589730] .cpuidle_idle_call
+0x150/0x6c0
[  145.127863] [c00000000108bd30] [c000000000058440] .pSeries_idle
+0x10/0x40
[  145.127870] [c00000000108bda0] [c00000000001683c] .cpu_idle
+0x18c/0x2d0
[  145.127876] [c00000000108be60] [c00000000000b434] .rest_init
+0x124/0x1b0
[  145.127884] [c00000000108bef0] [c0000000009d0d28] .start_kernel
+0x568/0x588
[  145.127890] [c00000000108bf90] [c000000000009660] .start_here_common
+0x20/0x40

This is because the RCU usage in interrupt context should be used in
area marked by rcu_irq_enter()/rcu_irq_exit(), called in
irq_enter()/irq_exit() respectively. 

Move them into the irq_enter()/irq_exit() area to avoid the reporting.

Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/irq.c  |    8 ++++----
 arch/powerpc/kernel/time.c |    8 ++++----
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 1f017bb..71413f4 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -489,10 +489,10 @@ void do_IRQ(struct pt_regs *regs)
 	struct pt_regs *old_regs = set_irq_regs(regs);
 	unsigned int irq;
 
-	trace_irq_entry(regs);
-
 	irq_enter();
 
+	trace_irq_entry(regs);
+
 	check_stack_overflow();
 
 	/*
@@ -511,10 +511,10 @@ void do_IRQ(struct pt_regs *regs)
 	else
 		__get_cpu_var(irq_stat).spurious_irqs++;
 
+	trace_irq_exit(regs);
+
 	irq_exit();
 	set_irq_regs(old_regs);
-
-	trace_irq_exit(regs);
 }
 
 void __init init_IRQ(void)
diff --git a/arch/powerpc/kernel/time.c b/arch/powerpc/kernel/time.c
index e49e931..bd693a1 100644
--- a/arch/powerpc/kernel/time.c
+++ b/arch/powerpc/kernel/time.c
@@ -493,8 +493,6 @@ void timer_interrupt(struct pt_regs * regs)
 	 */
 	may_hard_irq_enable();
 
-	trace_timer_interrupt_entry(regs);
-
 	__get_cpu_var(irq_stat).timer_irqs++;
 
 #if defined(CONFIG_PPC32) && defined(CONFIG_PMAC)
@@ -505,6 +503,8 @@ void timer_interrupt(struct pt_regs * regs)
 	old_regs = set_irq_regs(regs);
 	irq_enter();
 
+	trace_timer_interrupt_entry(regs);
+
 	if (test_irq_work_pending()) {
 		clear_irq_work_pending();
 		irq_work_run();
@@ -529,10 +529,10 @@ void timer_interrupt(struct pt_regs * regs)
 	}
 #endif
 
+	trace_timer_interrupt_exit(regs);
+
 	irq_exit();
 	set_irq_regs(old_regs);
-
-	trace_timer_interrupt_exit(regs);
 }
 
 /*
-- 
1.7.1

^ permalink raw reply related

* Re: [RFC v8 PATCH 00/20] memory-hotplug: hot-remove physical memory
From: Minchan Kim @ 2012-09-11  1:23 UTC (permalink / raw)
  To: Jerry
  Cc: linux-s390, linux-ia64, Wen Congyang, linux-acpi, linux-sh,
	len.brown, x86, linux-kernel, cmetcalf, Vasilis Liaskovitis,
	linux-mm, Yasuaki Ishimatsu, paulus, kosaki.motohiro, rientjes,
	sparclinux, Andrew Morton, linuxppc-dev, cl, liuj97
In-Reply-To: <CAAV+Mu7YWRWnxt78F4ZDMrrUsWB=n-_qkYOcQT7WQ2HwP89Obw@mail.gmail.com>

Hi Jerry,

On Tue, Sep 11, 2012 at 08:27:40AM +0800, Jerry wrote:
> Hi Wen,
> 
> I have been arranged a job related memory hotplug on ARM architecture.
> Maybe I know some new issues about memory hotplug on ARM architecture. I
> just enabled it on ARM, and it works well in my Android tablet now.
> However, I have not send out my patches. The real reason is that I don't
> know how to do it. Maybe I need to read "Documentation/SubmittingPatches".
> 
> Hi Andrew,
> This is my first time to send you a e-mail. I am so nervous about if I have
> some mistakes or not.

Don't be afraid.
If you might make a mistake, it's very natural to newbie.
I am sure anyone doesn't blame you. :)
If you have a good patch, please send out.

> 
> Some peoples maybe think memory hotplug need to be supported by special
> hardware. Maybe it means memory physical hotplug. Some times, we just need
> to use memory logical hotplug, doesn't remove the memory in physical. It is
> also usefully for power saving in my platform. Because I doesn't want
> the offline memory is in *self-refresh* state.

Just out of curiosity.
What's the your scenario and gain?
AFAIK, there were some effort about it in embedded side but gain isn't rather big
IIRC.

> 
> Any comments are appreciated.
> 
> Thanks,
> Jerry
> 
> 2012/9/10 Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
> 
> > Hi,
> >
> > On Mon, Sep 10, 2012 at 10:01:44AM +0800, Wen Congyang wrote:
> > > At 09/10/2012 09:46 AM, Yasuaki Ishimatsu Wrote:
> > > > Hi Wen,
> > > >
> > > > 2012/09/01 5:49, Andrew Morton wrote:
> > > >> On Tue, 28 Aug 2012 18:00:07 +0800
> > > >> wency@cn.fujitsu.com wrote:
> > > >>
> > > >>> This patch series aims to support physical memory hot-remove.
> > > >>
> > > >> I doubt if many people have hardware which permits physical memory
> > > >> removal?  How would you suggest that people with regular hardware can
> > > >> test these chagnes?
> > > >
> > > > How do you test the patch? As Andrew says, for hot-removing memory,
> > > > we need a particular hardware. I think so too. So many people may want
> > > > to know how to test the patch.
> > > > If we apply following patch to kvm guest, can we hot-remove memory on
> > > > kvm guest?
> > > >
> > > > http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01389.html
> > >
> > > Yes, if we apply this patchset, we can test hot-remove memory on kvm
> > guest.
> > > But that patchset doesn't implement _PS3, so there is some restriction.
> >
> > the following repos contain the patchset above, plus 2 more patches that
> > add
> > PS3 support to the dimm devices in qemu/seabios:
> >
> > https://github.com/vliaskov/seabios/commits/memhp-v2
> > https://github.com/vliaskov/qemu-kvm/commits/memhp-v2
> >
> > I have not posted the PS3 patches yet in the qemu list, but will post them
> > soon for v3 of the memory hotplug series. If you have issues testing, let
> > me
> > know.
> >
> > thanks,
> >
> > - Vasilis
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to majordomo@kvack.org.  For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> >
> 
> 
> 
> -- 
> I love linux!!!

-- 
Kind regards,
Minchan Kim

^ permalink raw reply

* linux-next: build failure after merge of the pci tree
From: Stephen Rothwell @ 2012-09-11  1:29 UTC (permalink / raw)
  To: Bjorn Helgaas
  Cc: Gavin Shan, linux-kernel, linux-next, Paul Mackerras, Yijing Wang,
	linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 699 bytes --]

Hi Bjorn,

After merging the pci tree, today's linux-next build (powerpc
ppc64_defconfig) failed like this:

arch/powerpc/platforms/powernv/pci-ioda.c: In function 'pnv_pci_window_alignment':
arch/powerpc/platforms/powernv/pci-ioda.c:1163:13: error: 'struct pci_dev' has no member named 'pcie_type'

Caused by commit ac161fbfdaa2 ("powerpc/powernv: I/O and memory alignment
for P2P bridges").  pcie_type was removed in commit b2ef39be5744 ("PCI:
Remove unused field pcie_type from struct pci_dev") (also in the pci
tree).

I have used the pci tree from next-20120910 for today.
-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [RFC patch powerpc,trace] Avoid suspicious RCU usage reporting for some tracepoints
From: Anton Blanchard @ 2012-09-11  1:23 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Paul Mackerras, Paul E. McKenney, PowerPC email list,
	Steven Rostedt, Li Zhong
In-Reply-To: <1347253812.2385.148.camel@pasglop>

Hi Ben,

> > Or could we just move these tracepoints inside the
> > irq_enter()/irq_exit() area? (Seems not good for the timer_interrupt
> > case). 
> 
> I'd say just move them in. Anton, any objection ?

Sounds reasonable, no objections from me.

Anton

^ permalink raw reply

* Re: [RFC v8 PATCH 00/20] memory-hotplug: hot-remove physical memory
From: Wen Congyang @ 2012-09-11  1:25 UTC (permalink / raw)
  To: Jerry
  Cc: linux-s390, linux-ia64, len.brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, Vasilis Liaskovitis, linux-mm,
	Yasuaki Ishimatsu, paulus, minchan.kim, kosaki.motohiro, rientjes,
	sparclinux, Andrew Morton, linuxppc-dev, cl, liuj97
In-Reply-To: <CAAV+Mu7YWRWnxt78F4ZDMrrUsWB=n-_qkYOcQT7WQ2HwP89Obw@mail.gmail.com>

At 09/11/2012 08:27 AM, Jerry Wrote:
> Hi Wen,
> 
> I have been arranged a job related memory hotplug on ARM architecture.
> Maybe I know some new issues about memory hotplug on ARM architecture. I
> just enabled it on ARM, and it works well in my Android tablet now.
> However, I have not send out my patches. The real reason is that I don't
> know how to do it. Maybe I need to read "Documentation/SubmittingPatches".
> 
> Hi Andrew,
> This is my first time to send you a e-mail. I am so nervous about if I have
> some mistakes or not.
> 
> Some peoples maybe think memory hotplug need to be supported by special
> hardware. Maybe it means memory physical hotplug. Some times, we just need
> to use memory logical hotplug, doesn't remove the memory in physical. It is
> also usefully for power saving in my platform. Because I doesn't want
> the offline memory is in *self-refresh* state.

Power saving? Do you need _PSx support?

Thanks
Wen Congyang

> 
> Any comments are appreciated.
> 
> Thanks,
> Jerry
> 
> 2012/9/10 Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>
> 
>> Hi,
>>
>> On Mon, Sep 10, 2012 at 10:01:44AM +0800, Wen Congyang wrote:
>>> At 09/10/2012 09:46 AM, Yasuaki Ishimatsu Wrote:
>>>> Hi Wen,
>>>>
>>>> 2012/09/01 5:49, Andrew Morton wrote:
>>>>> On Tue, 28 Aug 2012 18:00:07 +0800
>>>>> wency@cn.fujitsu.com wrote:
>>>>>
>>>>>> This patch series aims to support physical memory hot-remove.
>>>>>
>>>>> I doubt if many people have hardware which permits physical memory
>>>>> removal?  How would you suggest that people with regular hardware can
>>>>> test these chagnes?
>>>>
>>>> How do you test the patch? As Andrew says, for hot-removing memory,
>>>> we need a particular hardware. I think so too. So many people may want
>>>> to know how to test the patch.
>>>> If we apply following patch to kvm guest, can we hot-remove memory on
>>>> kvm guest?
>>>>
>>>> http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01389.html
>>>
>>> Yes, if we apply this patchset, we can test hot-remove memory on kvm
>> guest.
>>> But that patchset doesn't implement _PS3, so there is some restriction.
>>
>> the following repos contain the patchset above, plus 2 more patches that
>> add
>> PS3 support to the dimm devices in qemu/seabios:
>>
>> https://github.com/vliaskov/seabios/commits/memhp-v2
>> https://github.com/vliaskov/qemu-kvm/commits/memhp-v2
>>
>> I have not posted the PS3 patches yet in the qemu list, but will post them
>> soon for v3 of the memory hotplug series. If you have issues testing, let
>> me
>> know.
>>
>> thanks,
>>
>> - Vasilis
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>>
> 
> 
> 

^ permalink raw reply

* Re: [RFC v8 PATCH 00/20] memory-hotplug: hot-remove physical memory
From: Jerry @ 2012-09-11  0:27 UTC (permalink / raw)
  To: Wen Congyang, Andrew Morton
  Cc: linux-s390, linux-ia64, len.brown, linux-acpi, linux-sh, x86,
	linux-kernel, cmetcalf, Vasilis Liaskovitis, linux-mm,
	Yasuaki Ishimatsu, paulus, minchan.kim, kosaki.motohiro, rientjes,
	sparclinux, cl, linuxppc-dev, liuj97
In-Reply-To: <20120910135213.GA1550@dhcp-192-168-178-175.profitbricks.localdomain>

[-- Attachment #1: Type: text/plain, Size: 2696 bytes --]

Hi Wen,

I have been arranged a job related memory hotplug on ARM architecture.
Maybe I know some new issues about memory hotplug on ARM architecture. I
just enabled it on ARM, and it works well in my Android tablet now.
However, I have not send out my patches. The real reason is that I don't
know how to do it. Maybe I need to read "Documentation/SubmittingPatches".

Hi Andrew,
This is my first time to send you a e-mail. I am so nervous about if I have
some mistakes or not.

Some peoples maybe think memory hotplug need to be supported by special
hardware. Maybe it means memory physical hotplug. Some times, we just need
to use memory logical hotplug, doesn't remove the memory in physical. It is
also usefully for power saving in my platform. Because I doesn't want
the offline memory is in *self-refresh* state.

Any comments are appreciated.

Thanks,
Jerry

2012/9/10 Vasilis Liaskovitis <vasilis.liaskovitis@profitbricks.com>

> Hi,
>
> On Mon, Sep 10, 2012 at 10:01:44AM +0800, Wen Congyang wrote:
> > At 09/10/2012 09:46 AM, Yasuaki Ishimatsu Wrote:
> > > Hi Wen,
> > >
> > > 2012/09/01 5:49, Andrew Morton wrote:
> > >> On Tue, 28 Aug 2012 18:00:07 +0800
> > >> wency@cn.fujitsu.com wrote:
> > >>
> > >>> This patch series aims to support physical memory hot-remove.
> > >>
> > >> I doubt if many people have hardware which permits physical memory
> > >> removal?  How would you suggest that people with regular hardware can
> > >> test these chagnes?
> > >
> > > How do you test the patch? As Andrew says, for hot-removing memory,
> > > we need a particular hardware. I think so too. So many people may want
> > > to know how to test the patch.
> > > If we apply following patch to kvm guest, can we hot-remove memory on
> > > kvm guest?
> > >
> > > http://lists.gnu.org/archive/html/qemu-devel/2012-07/msg01389.html
> >
> > Yes, if we apply this patchset, we can test hot-remove memory on kvm
> guest.
> > But that patchset doesn't implement _PS3, so there is some restriction.
>
> the following repos contain the patchset above, plus 2 more patches that
> add
> PS3 support to the dimm devices in qemu/seabios:
>
> https://github.com/vliaskov/seabios/commits/memhp-v2
> https://github.com/vliaskov/qemu-kvm/commits/memhp-v2
>
> I have not posted the PS3 patches yet in the qemu list, but will post them
> soon for v3 of the memory hotplug series. If you have issues testing, let
> me
> know.
>
> thanks,
>
> - Vasilis
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>



-- 
I love linux!!!

[-- Attachment #2: Type: text/html, Size: 4071 bytes --]

^ permalink raw reply

* Re: [PATCH 2/2] powerpc/e6500: TLB miss handler with hardware tablewalk support
From: Scott Wood @ 2012-09-11  0:26 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1347061804.2385.58.camel@pasglop>

On 09/07/2012 06:50 PM, Benjamin Herrenschmidt wrote:
>>>>>  #endif /* CONFIG_PPC64 */
>>>>> @@ -377,7 +382,7 @@ void tlb_flush_pgtable(struct mmu_gather *tlb, unsigned long address)
>>>>>  {
>>>>>  	int tsize = mmu_psize_defs[mmu_pte_psize].enc;
>>>>>  
>>>>> -	if (book3e_htw_enabled) {
>>>>> +	if (book3e_htw_mode) {
>>>>
>>>> Make it if (boot3e_htw_enabled != PPC_HTW_NONE)
>>
>> Seems a little verbose, but OK.
>>
>> Same with things like this, I guess:
>> 	book3e_htw_mode ? "enabled" : "not supported"
> 
> Well, it's no longer a boolean so ...

It's pretty common to use implicit boolean conversion when a zero value
means no/false/absent, even if there are multiple non-false
possibilities (e.g. pointers) -- but not a big deal to change it if you
prefer.

> BTW. On another note, can you pickup Ananth series for larger address
> space (minus the one patch that breaks the BookE build, it shouldn't
> matter) and see if there's any runtime issue on BookE 64 ? (And whether
> the larger address space actually works for you too, using something
> like high up mmap tests)

It booted OK for me in my initial testing with a ramdisk, but when I
tried to use the network (to load a high mmap test program) I got hangs
that didn't happen before that patchset.  I'll look into it more tomorrow.

-Scott

^ permalink raw reply

* RE: [PATCH] using get/put_user64 apis on 64bit machine
From: Benjamin Herrenschmidt @ 2012-09-10 22:25 UTC (permalink / raw)
  To: Bhushan Bharat-R65777
  Cc: Wood Scott-B07421, linuxppc-dev@lists.ozlabs.org, agraf@suse.de
In-Reply-To: <6A3DF150A5B70D4F9B66A25E3F7C888D0646913C@039-SN2MPN1-023.039d.mgd.msft.net>

On Mon, 2012-09-10 at 08:50 +0000, Bhushan Bharat-R65777 wrote:
> When I sent the patch I did not search the users, I agree that we can
> remove the __get_user64 and __put_user64 altogether.

Care to send a patch ?

Cheers,
Ben.

^ permalink raw reply

* Re: [RFC patch powerpc,trace] Avoid suspicious RCU usage reporting for some tracepoints
From: Benjamin Herrenschmidt @ 2012-09-10 22:14 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Anton Blanchard, Paul Mackerras, Paul E. McKenney,
	PowerPC email list, Li Zhong
In-Reply-To: <1347285742.10751.6.camel@gandalf.local.home>

On Mon, 2012-09-10 at 10:02 -0400, Steven Rostedt wrote:
> I agree too. I'm a bit concerned about the lack of coverage the
> irq_enter() and irq_exit() have in the timer interrupt:
> 
> 	may_hard_irq_enable();
> 
> 	trace_timer_interrupt_entry(regs);
> 
> 	__get_cpu_var(irq_stat).timer_irqs++;
> 
> #if defined(CONFIG_PPC32) && defined(CONFIG_PMAC)
> 	if (atomic_read(&ppc_n_lost_interrupts) != 0)
> 		do_IRQ(regs);
> #endif
> 
> 	old_regs = set_irq_regs(regs);
> 	irq_enter();
> 
> 
> I'm guessing that call to do_IRQ() has to do with lazy irq handling?

Not exactly. may_hard_irq_enable() does, it will hard-enable interrupts
(while keeping them soft-disabled) if there isn't a recorded "level"
interrupt pending already (in which case it's pointless, we'll just be
interrupted again and re-disable).

The do_IRQ() hack is an old hack for ancient 32-bit powermacs who could
lose interrupts, it's a sideband mechanism we have to re-emit them,
which nowadays could probably be replaced with something more modern,
it's just that nobody ever wants to touch that code.

> Anyway, there may be a reason to have the tracepoint before this call,
> but I'm not sure it really is that important. It should probably be best
> to move it after the irq_enter(). There's side-effects with calling
> things from interrupt context outside of irq_enter/exit().

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH] powerpc-powernv: align BARs to PAGE_SIZE on powernv platform
From: Alex Williamson @ 2012-09-10 17:06 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paul Mackerras, linuxppc-dev, David Gibson
In-Reply-To: <5046E2B1.8010805@ozlabs.ru>

On Wed, 2012-09-05 at 15:27 +1000, Alexey Kardashevskiy wrote:
> On 05/09/12 15:17, Benjamin Herrenschmidt wrote:
> > On Tue, 2012-09-04 at 22:57 -0600, Alex Williamson wrote:
> >
> >> Do we need an extra region info field, or is it sufficient that we
> >> define a region to be mmap'able with getpagesize() pages when the MMAP
> >> flag is set and simply offset the region within the device fd?  ex.
> >
> > Alexey ? You mentioned you had ways to get at the offset with the
> > existing interfaces ?
> 
> 
> Yes, VFIO_DEVICE_GET_REGION_INFO ioctl of vfio-pci host driver, the "info" 
> struct has an "offset" field.
> I just do not have a place to use it in the QEMU right now as the guest 
> does the same allocation as the host does (by accident).

Yep, this is the offset into the device fd though.  We currently used a
fixed 40bit region for each BAR, but that's mostly a leftover from
before the API described the offset.  It's a little bit of an
optimization on the kernel side to convert offset->BAR w/o a lookup, but
we're hopefully mmap'ing as much as possible, modulo the page size
issues here.

> >> BAR0: 0x10000 /* no offset */
> >> BAR1: 0x21000 /* 4k offset */
> >> BAR2: 0x32000 /* 8k offset */
> >>
> >> A second level optimization might make these 0x10000, 0x11000, 0x12000.
> >>
> >> This will obviously require some arch hooks w/in vfio as we can't do
> >> this on x86 since we can't guarantee that whatever lives in the
> >> overflow/gaps is in the same group and power is going to need to make
> >> sure we don't accidentally allow msix table mapping... in fact hiding
> >> the msix table might be a lot more troublesome on 64k page hosts.
> >
> > Fortunately, our guests don't access the msix table directly anyway, at
> > least most of the time :-)
> 
> 
> Not at all in our case. It took me some time to push a QEMU patch which 
> changes msix table :)

vfio needs to be safe regardless of whether it's being used by qemu or
some other userspace driver though.

> > There's a paravirt API for it, and our iommu
> > makes sure that if for some reason the guest still accesses it and does
> > the wrong thing to it, the side effects will be contained to the guest.

If direct access to the MSIX table neither leaks information nor leads
to exploitable holes, then I have no problem allowing a platform hook to
make it mmap'able.  We should be looking at this for x86 too on
platforms where we have interrupt remapping capabilities.

> >>> Now the main problem here is going to be that the guest itself might
> >>> reallocate the BAR and move it around (well, it's version of the BAR
> >>> which isn't the real thing), and so we cannot create a direct MMU
> >>> mapping between -that- and the real BAR.
> >>>
> >>> IE. We can only allow that direct mapping if the guest BAR mapping has
> >>> the same "offset within page" as the host BAR mapping.
> >>
> >> Euw...
> >
> > Yeah sucks :-) Basically, let's say page size is 64K. Host side BAR
> > (real BAR) is at 0xf0001000.
> >
> > qemu maps 0xf0000000..0xf000ffff to a virtual address inside QEMU,
> > itself 64k aligned, let's say 0x80000000 and knows that the BAR is at
> > offset 0x1000 in there.
> >
> > However, the KVM "MR" API is such that we can only map PAGE_SIZE regions
> > into the guest as well, so if the guest assigns a value ADDR to the
> > guest BAR, let's say 0x40002000, all KVM can do is an MR that maps
> > 0x40000000 (guest physical) to 0x80000000 (qemu). Any access within that
> > 64K page will have the low bits transferred directly from guest to HW.
> >
> > So the guest will end up having that 0x2000 offset instead of the 0x1000
> > needed to actually access the BAR. FAIL.
> >
> > There are ways to fix that but all are nasty.
> >
> >   - In theory, we have the capability (and use it today) to restrict IO
> > mappings in the guest to 4K HW pages, so knowing that, KVM could use a
> > "special" MR that plays tricks here... but that would break all sort of
> > generic code both in qemu and kvm and generally be very nasty.
> >
> >   - The best approach is to rely on the fact that our guest kernels don't
> > do BAR assignment, they rely on FW to do it (ie not at all, unlike x86,
> > we can't even fixup because in the general case, the hypervisor won't
> > let us anyway). So we could move our guest BAR allocation code out of
> > our guest firmware (SLOF) back into qemu (where we had it very early
> > on), which allows us to make sure that the guest BAR values we assign
> > have the same "offset within the page" as the host side values. This
> > would also allow us to avoid messing up too many MRs (this can have a
> > performance impact with KVM) and eventually handle our "group" regions
> > instead of individual BARs for mappings. We might need to do that anyway
> > in the long run for hotplug as our hotplug hypervisor APIs also rely on
> > the "new" hotplugged devices to have the BARs pre-assigned when they get
> > handed out to the guest.

Ok, now it's making more sense how the original patch here was
beneficial.  If the physical BAR is 64k aligned we could expose the BAR
as being at least 64k and therefore everything would line up.  We have
the same issue on x86 where devices might have <4k BARs and we'd prefer
to mmap them.  So far we've successfully ignored them because they're
not "high performance" devices and because we can't always assume the
extra space is unused or even safe to access.  Obviously the problem
gets much worse at 64k.

You know that the MMIO space consumed by a group is 64k aligned, but
individual BARs are not.  Guest and host may end up with different BAR
offsets within a 64k page though, so that doesn't buy us much.  Blech.
Yeah, relying on the guest not changing mappings is pretty ugly, but may
be a solution.  For the general case though we'd really prefer each BAR
to be treated as a minimum 64k then exposed to the user (qemu) as the
same.  Doing that across a system is pretty wasteful and remapping is
not always possible.  You almost want a setting per PE to specify how
BARs are aligned (and x86 might want the same, but it's not as clear how
to define what devices to apply it to).  Hard problem.  Thanks,

Alex

^ permalink raw reply

* Re: [PATCH] usb: gadget: fsl_udc_core: do not immediatly prime STATUS for IN xfer
From: Felipe Balbi @ 2012-09-10 16:22 UTC (permalink / raw)
  To: Felipe Balbi
  Cc: Chen Peter-B29397, Li Yang-R58472, Enrico Scholz,
	linux-usb@vger.kernel.org, gregkh@linuxfoundation.org,
	linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20120906142739.GV29202@arwen.pp.htv.fi>

[-- Attachment #1: Type: text/plain, Size: 1890 bytes --]

On Thu, Sep 06, 2012 at 05:27:42PM +0300, Felipe Balbi wrote:
> Hi,
> 
> On Thu, Sep 06, 2012 at 04:27:12PM +0200, Enrico Scholz wrote:
> > Felipe Balbi <balbi@ti.com> writes:
> > 
> > >> > Because the fsl_udc_core driver shares one 'status_req' object for the
> > >> > complete ep0 control transfer, it is not possible to prime the final
> > >> > STATUS phase immediately after the IN transaction.  E.g. ch9getstatus()
> > >> > executed:
> > >> > 
> > >> > | req = udc->status_req;
> > >> > | ...
> > >> > | list_add_tail(&req->queue, &ep->queue);
> > >> > | if (ep0_prime_status(udc, EP_DIR_OUT))
> > >> > |       ....
> > >> > |       struct fsl_req *req = udc->status_req;
> > >> > |       list_add_tail(&req->queue, &ep->queue);
> > >> > 
> > >> > which corrupts the ep->queue list by inserting 'status_req' twice.  This
> > >> > causes a kernel oops e.g. when 'lsusb -v' is executed on the host.
> > >> > 
> > >> > Patch delays the final 'ep0_prime_status(udc, EP_DIR_OUT))' by moving it
> > >> > into the ep0 completion handler.
> > >> > 
> > >> Enrico, thanks for pointing this problem.
> > >> 
> > >> As "prime STATUS phase immediately after the IN transaction" is followed
> > >> USB 2.0 spec, to fix this problem, it is better to add data_req for ep0.
> > >> In fact, it is already at FSL i.mx internal code, just still not mainlined.
> > >
> > > so, do I get an Acked-by to this patch ? Does it need to go on v3.6-rc
> > > or can it wait until v3.7 merge window ?
> > 
> > Without this (or the mentioned data_req patch), I can crash a g_multi
> > gadget by executing 'lsusb -v' as root on the host.  Should not be
> > exploitable (only a BUG_ON() is triggered) but issue should be fixed
> > asap.
> 
> cool, so I'll apply to my fixes branch as soon as I get Acked-by or
> Tested-by from someone.

No Acks, no Tested-by ?

-- 
balbi

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply

* Re: [PATCH] vfio: enabled and supported on power (v7)
From: Alex Williamson @ 2012-09-10 16:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: Paul Mackerras, linuxppc-dev, David Gibson
In-Reply-To: <1346744035-31154-1-git-send-email-aik@ozlabs.ru>

On Tue, 2012-09-04 at 17:33 +1000, Alexey Kardashevskiy wrote:
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Paul Mackerras <paulus@samba.org>
> Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
> ---

Please at least cc kvm@vger as well since we list that as the devel list
for vfio.

>  arch/powerpc/include/asm/iommu.h    |    3 +

I'll need an ack from Ben or Paul for this change.

>  drivers/iommu/Kconfig               |    8 +
>  drivers/vfio/Kconfig                |    6 +
>  drivers/vfio/Makefile               |    1 +
>  drivers/vfio/vfio_iommu_spapr_tce.c |  440 +++++++++++++++++++++++++++++++++++
>  include/linux/vfio.h                |   29 +++
>  6 files changed, 487 insertions(+)
>  create mode 100644 drivers/vfio/vfio_iommu_spapr_tce.c
> 
> diff --git a/arch/powerpc/include/asm/iommu.h b/arch/powerpc/include/asm/iommu.h
> index 957a83f..c64bce7 100644
> --- a/arch/powerpc/include/asm/iommu.h
> +++ b/arch/powerpc/include/asm/iommu.h
> @@ -66,6 +66,9 @@ struct iommu_table {
>  	unsigned long  it_halfpoint; /* Breaking point for small/large allocs */
>  	spinlock_t     it_lock;      /* Protects it_map */
>  	unsigned long *it_map;       /* A simple allocation bitmap for now */
> +#ifdef CONFIG_IOMMU_API
> +	struct iommu_group *it_group;
> +#endif
>  };

This seems to only be valid when vfio_iommu_spapr_tce is loaded, which
is a bit misleading.

>  
>  struct scatterlist;
> diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig
> index 3bd9fff..19cf2d9 100644
> --- a/drivers/iommu/Kconfig
> +++ b/drivers/iommu/Kconfig
> @@ -162,4 +162,12 @@ config TEGRA_IOMMU_SMMU
>  	  space through the SMMU (System Memory Management Unit)
>  	  hardware included on Tegra SoCs.
>  
> +config SPAPR_TCE_IOMMU
> +	bool "sPAPR TCE IOMMU Support"
> +	depends on PPC_PSERIES
> +	select IOMMU_API
> +	help
> +	  Enables bits of IOMMU API required by VFIO. The iommu_ops is
> +	  still not implemented.
> +
>  endif # IOMMU_SUPPORT
> diff --git a/drivers/vfio/Kconfig b/drivers/vfio/Kconfig
> index 7cd5dec..b464687 100644
> --- a/drivers/vfio/Kconfig
> +++ b/drivers/vfio/Kconfig
> @@ -3,10 +3,16 @@ config VFIO_IOMMU_TYPE1
>  	depends on VFIO
>  	default n
>  
> +config VFIO_IOMMU_SPAPR_TCE
> +	tristate
> +	depends on VFIO && SPAPR_TCE_IOMMU
> +	default n
> +
>  menuconfig VFIO
>  	tristate "VFIO Non-Privileged userspace driver framework"
>  	depends on IOMMU_API
>  	select VFIO_IOMMU_TYPE1 if X86
> +	select VFIO_IOMMU_SPAPR_TCE if PPC_POWERNV
>  	help
>  	  VFIO provides a framework for secure userspace device drivers.
>  	  See Documentation/vfio.txt for more details.
> diff --git a/drivers/vfio/Makefile b/drivers/vfio/Makefile
> index 2398d4a..72bfabc 100644
> --- a/drivers/vfio/Makefile
> +++ b/drivers/vfio/Makefile
> @@ -1,3 +1,4 @@
>  obj-$(CONFIG_VFIO) += vfio.o
>  obj-$(CONFIG_VFIO_IOMMU_TYPE1) += vfio_iommu_type1.o
> +obj-$(CONFIG_VFIO_IOMMU_SPAPR_TCE) += vfio_iommu_spapr_tce.o
>  obj-$(CONFIG_VFIO_PCI) += pci/
> diff --git a/drivers/vfio/vfio_iommu_spapr_tce.c b/drivers/vfio/vfio_iommu_spapr_tce.c
> new file mode 100644
> index 0000000..21f1909
> --- /dev/null
> +++ b/drivers/vfio/vfio_iommu_spapr_tce.c
> @@ -0,0 +1,440 @@
> +/*
> + * VFIO: IOMMU DMA mapping support for TCE on POWER
> + *
> + * Copyright (C) 2012 IBM Corp.  All rights reserved.
> + *     Author: Alexey Kardashevskiy <aik@ozlabs.ru>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + *
> + * Derived from original vfio_iommu_x86.c:

Should this be _type1?  Only the mail archives are going to remember
there was a _x86, so the renamed version is probably a better reference.

> + * Copyright (C) 2012 Red Hat, Inc.  All rights reserved.
> + *     Author: Alex Williamson <alex.williamson@redhat.com>
> + */
> +
> +#include <linux/module.h>
> +#include <linux/pci.h>
> +#include <linux/slab.h>
> +#include <linux/uaccess.h>
> +#include <linux/err.h>
> +#include <linux/vfio.h>
> +#include <linux/spinlock.h>
> +#include <asm/iommu.h>
> +
> +#define DRIVER_VERSION  "0.1"
> +#define DRIVER_AUTHOR   "aik@ozlabs.ru"
> +#define DRIVER_DESC     "VFIO IOMMU SPAPR TCE"
> +
> +
> +/*
> + * SPAPR TCE API
> + */
> +static void tce_free(struct iommu_table *tbl, unsigned long entry,
> +		unsigned long tce)
> +{
> +	struct page *page = pfn_to_page(tce >> PAGE_SHIFT);
> +
> +	WARN_ON(!page);
> +	if (page) {
> +		if (tce & VFIO_SPAPR_TCE_WRITE)
> +			SetPageDirty(page);
> +		put_page(page);
> +	}
> +	ppc_md.tce_free(tbl, entry, 1);
> +}
> +
> +static long tce_put(struct iommu_table *tbl,
> +		unsigned long entry, uint64_t tce, uint32_t flags)
> +{
> +	int ret;
> +	unsigned long oldtce, kva, offset;
> +	struct page *page = NULL;
> +	enum dma_data_direction direction = DMA_NONE;
> +
> +	switch (flags & VFIO_SPAPR_TCE_PUT_MASK) {
> +	case VFIO_SPAPR_TCE_READ:
> +		direction = DMA_TO_DEVICE;
> +		break;
> +	case VFIO_SPAPR_TCE_WRITE:
> +		direction = DMA_FROM_DEVICE;
> +		break;
> +	case VFIO_SPAPR_TCE_BIDIRECTIONAL:
> +		direction = DMA_BIDIRECTIONAL;
> +		break;
> +	}
> +
> +	oldtce = ppc_md.tce_get(tbl, entry);
> +
> +	/* Free page if still allocated */
> +	if (oldtce & VFIO_SPAPR_TCE_PUT_MASK)
> +		tce_free(tbl, entry, oldtce);
> +
> +	/* Map new TCE */
> +	if (direction != DMA_NONE) {
> +		offset = (tce & IOMMU_PAGE_MASK) - (tce & PAGE_MASK);
> +		ret = get_user_pages_fast(tce & PAGE_MASK, 1,
> +				direction != DMA_TO_DEVICE, &page);
> +		BUG_ON(ret > 1);

Can this happen?

> +		if (ret < 1) {
> +			printk(KERN_ERR "tce_vfio: get_user_pages_fast failed "
> +					"tce=%llx ioba=%lx ret=%d\n",
> +					tce, entry << IOMMU_PAGE_SHIFT, ret);
> +			if (!ret)
> +				ret = -EFAULT;
> +			goto unlock_exit;
> +		}
> +
> +		kva = (unsigned long) page_address(page);
> +		kva += offset;
> +		BUG_ON(!kva);

Same here, can it happen?  If so, should it BUG or catch the below
EINVAL?

> +		if (WARN_ON(kva & ~IOMMU_PAGE_MASK))
> +			return -EINVAL;

Page leak?  Don't we want to do a put_page(), which means we probably
want a goto exit here.

> +
> +		/* Preserve access bits */
> +		kva |= flags & VFIO_SPAPR_TCE_PUT_MASK;
> +
> +		/* tce_build receives a virtual address */
> +		entry += tbl->it_offset;	/* Offset into real TCE table */
> +		ret = ppc_md.tce_build(tbl, entry, 1, kva, direction, NULL);
> +
> +		/* tce_build() only returns non-zero for transient errors */
> +		if (unlikely(ret)) {
> +			printk(KERN_ERR "tce_vfio: Failed to add TCE\n");
> +			ret = -EIO;
> +			goto unlock_exit;
> +		}
> +	}
> +	/* Flush/invalidate TLB caches if necessary */
> +	if (ppc_md.tce_flush)
> +		ppc_md.tce_flush(tbl);
> +
> +	/* Make sure updates are seen by hardware */
> +	mb();
> +
> +unlock_exit:

unlock seems wrong here, I had to go re-read the code looking for the
lock.

> +	if (ret && page)
> +		put_page(page);
> +
> +	if (ret)
> +		printk(KERN_ERR "tce_vfio: tce_put failed on tce=%llx "
> +				"ioba=%lx kva=%lx\n", tce,
> +				entry << IOMMU_PAGE_SHIFT, kva);
> +	return ret;
> +}
> +
> +/*
> + * VFIO IOMMU fd for SPAPR_TCE IOMMU implementation
> + */
> +
> +/*
> + * The container descriptor supports only a single group per container.
> + * Required by the API as the container is not supplied with the IOMMU group
> + * at the moment of initialization.
> + */
> +struct tce_container {
> +	struct iommu_table *tbl;
> +};
> +
> +static void *tce_iommu_open(unsigned long arg)
> +{
> +	struct tce_container *container;
> +
> +	if (arg != VFIO_SPAPR_TCE_IOMMU) {
> +		printk(KERN_ERR "tce_vfio: Wrong IOMMU type\n");
> +		return ERR_PTR(-EINVAL);
> +	}
> +
> +	container = kzalloc(sizeof(*container), GFP_KERNEL);
> +	if (!container)
> +		return ERR_PTR(-ENOMEM);
> +
> +	return container;
> +}
> +
> +static void tce_iommu_release(void *iommu_data)
> +{
> +	struct tce_container *container = iommu_data;
> +	struct iommu_table *tbl = container->tbl;
> +	unsigned long i, tce;
> +

This will segfault if releasing a container that never had an a device
attached.

> +	/* Unmap leftovers */
> +	spin_lock_irq(&tbl->it_lock);
> +	for (i = tbl->it_offset; i < tbl->it_offset + tbl->it_size; ++i) {
> +		tce = ppc_md.tce_get(tbl, i);
> +		if (tce & VFIO_SPAPR_TCE_PUT_MASK)
> +			tce_free(tbl, i, tce);
> +	}
> +	/* Flush/invalidate TLB caches if necessary */
> +	if (ppc_md.tce_flush)
> +		ppc_md.tce_flush(tbl);
> +
> +	/* Make sure updates are seen by hardware */
> +	mb();
> +
> +	spin_unlock_irq(&tbl->it_lock);
> +
> +	kfree(container);
> +}
> +
> +static long tce_iommu_ioctl(void *iommu_data,
> +				 unsigned int cmd, unsigned long arg)
> +{
> +	struct tce_container *container = iommu_data;
> +	unsigned long minsz;
> +	long ret;
> +
> +	switch (cmd) {
> +	case VFIO_CHECK_EXTENSION: {
> +		return (arg == VFIO_SPAPR_TCE_IOMMU) ? 1 : 0;
> +	}
> +	case VFIO_IOMMU_SPAPR_TCE_GET_INFO: {
> +		struct vfio_iommu_spapr_tce_info info;
> +		struct iommu_table *tbl = container->tbl;
> +
> +		minsz = offsetofend(struct vfio_iommu_spapr_tce_info,
> +				dma64_window_size);
> +
> +		if (copy_from_user(&info, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (info.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (!tbl)
> +			return -ENXIO;

nit: why not check this earlier?

> +
> +		info.dma32_window_start = tbl->it_offset << IOMMU_PAGE_SHIFT;
> +		info.dma32_window_size = tbl->it_size << IOMMU_PAGE_SHIFT;
> +		info.dma64_window_start = 0;
> +		info.dma64_window_size = 0;
> +		info.flags = 0;
> +
> +		return copy_to_user((void __user *)arg, &info, minsz);
> +	}
> +	case VFIO_IOMMU_SPAPR_TCE_PUT: {
> +		struct vfio_iommu_spapr_tce_put par;
> +		struct iommu_table *tbl = container->tbl;
> +
> +		minsz = offsetofend(struct vfio_iommu_spapr_tce_put, tce);
> +
> +		if (copy_from_user(&par, (void __user *)arg, minsz))
> +			return -EFAULT;
> +
> +		if (par.argsz < minsz)
> +			return -EINVAL;
> +
> +		if (!tbl) {
> +			return -ENXIO;
> +		}

Same, plus drop the braces.

> +
> +		spin_lock_irq(&tbl->it_lock);
> +		ret = tce_put(tbl, par.ioba >> IOMMU_PAGE_SHIFT,
> +				par.tce, par.flags);
> +		spin_unlock_irq(&tbl->it_lock);
> +
> +		return ret;
> +	}

Is "PUT" really the name we want for this?

> +	default:
> +		printk(KERN_WARNING "tce_vfio: unexpected cmd %x\n", cmd);
> +	}
> +
> +	return -ENOTTY;
> +}
> +
> +static int tce_iommu_attach_group(void *iommu_data,
> +		struct iommu_group *iommu_group)
> +{
> +	struct tce_container *container = iommu_data;
> +	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> +
> +	printk(KERN_DEBUG "tce_vfio: Attaching group #%u to iommu %p\n",
> +			iommu_group_id(iommu_group), iommu_group);

Let's use pr_debug() and friends throughout.

> +	if (container->tbl) {
> +		printk(KERN_WARNING "tce_vfio: Only one group per IOMMU "
> +				"container is allowed, "
> +				"existing id=%d, attaching id=%d\n",
> +				iommu_group_id(container->tbl->it_group),
> +				iommu_group_id(iommu_group));
> +		return -EBUSY;
> +	}
> +

_type1 has a lock to avoid races here, I think you might need one too.

> +	container->tbl = tbl;
> +
> +	return 0;
> +}
> +
> +static void tce_iommu_detach_group(void *iommu_data,
> +		struct iommu_group *iommu_group)
> +{
> +	struct tce_container *container = iommu_data;
> +	struct iommu_table *tbl = iommu_group_get_iommudata(iommu_group);
> +
> +	BUG_ON(!tbl);

Needed?  If so, why is there no check on attach?

> +	if (tbl != container->tbl) {
> +		printk(KERN_WARNING "tce_vfio: detaching group #%u, expected "
> +				"group is #%u\n", iommu_group_id(iommu_group),
> +				iommu_group_id(tbl->it_group));
> +		return;
> +	}
> +	printk(KERN_DEBUG "tce_vfio: detaching group #%u from iommu %p\n",
> +			iommu_group_id(iommu_group), iommu_group);

container->tbl = NULL?

> +}
> +
> +const struct vfio_iommu_driver_ops tce_iommu_driver_ops = {
> +	.name		= "iommu-vfio-powerpc",
> +	.owner		= THIS_MODULE,
> +	.open		= tce_iommu_open,
> +	.release	= tce_iommu_release,
> +	.ioctl		= tce_iommu_ioctl,
> +	.attach_group	= tce_iommu_attach_group,
> +	.detach_group	= tce_iommu_detach_group,
> +};
> +
> +/*
> + * Add/delete devices support (hotplug, module_init, module_exit)
> + */
> +static int add_device(struct device *dev)
> +{
> +	struct iommu_table *tbl;
> +	int ret = 0;
> +
> +	if (dev->iommu_group) {
> +		printk(KERN_WARNING "tce_vfio: device %s is already in iommu "
> +				"group %d, skipping\n", dev->kobj.name,

Watch line wrapping on strings.

> +				iommu_group_id(dev->iommu_group));
> +		return -EBUSY;
> +	}
> +
> +	tbl = get_iommu_table_base(dev);
> +	if (!tbl) {
> +		printk(KERN_DEBUG "tce_vfio: skipping device %s with no tbl\n",
> +				dev->kobj.name);
> +		return 0;
> +	}
> +
> +	printk(KERN_DEBUG "tce_vfio: adding %s to iommu group %d\n",
> +			dev->kobj.name, iommu_group_id(tbl->it_group));
> +
> +	ret = iommu_group_add_device(tbl->it_group, dev);
> +	if (ret < 0)
> +		printk(KERN_ERR "tce_vfio: %s has not been added, ret=%d\n",
> +				dev->kobj.name, ret);
> +
> +	return ret;
> +}
> +
> +static void del_device(struct device *dev)
> +{
> +	iommu_group_remove_device(dev);
> +}
> +
> +static int iommu_bus_notifier(struct notifier_block *nb,
> +			      unsigned long action, void *data)
> +{
> +	struct device *dev = data;
> +
> +	switch (action) {
> +	case BUS_NOTIFY_ADD_DEVICE:
> +		return add_device(dev);
> +	case BUS_NOTIFY_DEL_DEVICE:
> +		del_device(dev);
> +		return 0;
> +	default:
> +		return 0;
> +	}
> +}
> +
> +static struct notifier_block tce_iommu_bus_nb = {
> +	.notifier_call = iommu_bus_notifier,
> +};
> +
> +void group_release(void *iommu_data)
> +{
> +	struct iommu_table *tbl = iommu_data;
> +	tbl->it_group = NULL;
> +}
> +
> +static int __init tce_iommu_init(void)
> +{
> +	struct pci_dev *pdev = NULL;
> +	struct iommu_table *tbl;
> +	struct iommu_group *grp;
> +
> +	/* If the current platform does not support tce_get
> +	   we are unable to clean TCE table properly and
> +	   therefore it is better not to touch it at all */
> +	if (!ppc_md.tce_get) {
> +		printk(KERN_ERR "tce_vfio: ppc_md.tce_get isn't implemented\n");
> +		return -EOPNOTSUPP;
> +	}
> +
> +	bus_register_notifier(&pci_bus_type, &tce_iommu_bus_nb);
> +
> +	/* Allocate and initialize VFIO groups */

s/VFIO groups/IOMMU groups/

> +	for_each_pci_dev(pdev) {
> +		tbl = get_iommu_table_base(&pdev->dev);
> +		if (!tbl)
> +			continue;
> +
> +		/* Skip already initialized */
> +		if (tbl->it_group)
> +			continue;
> +
> +		grp = iommu_group_alloc();
> +		if (IS_ERR(grp)) {
> +			printk(KERN_INFO "tce_vfio: cannot create "
> +					"new IOMMU group, ret=%ld\n",
> +					PTR_ERR(grp));
> +			return -EFAULT;
> +		}
> +		tbl->it_group = grp;
> +		iommu_group_set_iommudata(grp, tbl, group_release);
> +	}
> +
> +	/* Add PCI devices to VFIO groups */
> +	for_each_pci_dev(pdev)
> +		add_device(&pdev->dev);
> +
> +	return vfio_register_iommu_driver(&tce_iommu_driver_ops);
> +}
> +
> +static void __exit tce_iommu_cleanup(void)
> +{
> +	struct pci_dev *pdev = NULL;
> +	struct iommu_table *tbl;
> +	struct iommu_group *grp = NULL;
> +
> +	bus_unregister_notifier(&pci_bus_type, &tce_iommu_bus_nb);
> +
> +	/* Delete PCI devices from VFIO groups */
> +	for_each_pci_dev(pdev)
> +		del_device(&pdev->dev);
> +
> +	/* Release VFIO groups */
> +	for_each_pci_dev(pdev) {
> +		tbl = get_iommu_table_base(&pdev->dev);
> +		if (!tbl)
> +			continue;
> +		grp = tbl->it_group;
> +
> +		/* Skip (already) uninitialized */
> +		if (!grp)
> +			continue;
> +
> +		/* Do actual release, group_release() is expected to work */
> +		iommu_group_put(grp);
> +		BUG_ON(tbl->it_group);
> +	}
> +


It troubles me a bit that you're using the vfio driver to initialize and
tear down IOMMU groups on your platform.  VFIO makes use of IOMMU groups
and is the only user so far, but they're hopefully useful beyond this.
In fact, VFIO used to manage assembling all groups from data provided by
the IOMMU but David wanted to see IOMMU groups be a more universally
available feature, so it's odd to see POWER implementing it this way.

> +	vfio_unregister_iommu_driver(&tce_iommu_driver_ops);
> +}
> +
> +module_init(tce_iommu_init);
> +module_exit(tce_iommu_cleanup);
> +
> +MODULE_VERSION(DRIVER_VERSION);
> +MODULE_LICENSE("GPL v2");
> +MODULE_AUTHOR(DRIVER_AUTHOR);
> +MODULE_DESCRIPTION(DRIVER_DESC);
> +
> diff --git a/include/linux/vfio.h b/include/linux/vfio.h
> index 0a4f180..2c0a927 100644
> --- a/include/linux/vfio.h
> +++ b/include/linux/vfio.h
> @@ -99,6 +99,7 @@ extern void vfio_unregister_iommu_driver(
>  /* Extensions */
>  
>  #define VFIO_TYPE1_IOMMU		1
> +#define VFIO_SPAPR_TCE_IOMMU		2
>  
>  /*
>   * The IOCTL interface is designed for extensibility by embedding the
> @@ -442,4 +443,32 @@ struct vfio_iommu_type1_dma_unmap {
>  
>  #define VFIO_IOMMU_UNMAP_DMA _IO(VFIO_TYPE, VFIO_BASE + 14)
>  
> +/* -------- API for SPAPR TCE (Server POWERPC) IOMMU -------- */
> +
> +struct vfio_iommu_spapr_tce_info {
> +	__u32 argsz;
> +	__u32 flags;
> +	__u32 dma32_window_start;
> +	__u32 dma32_window_size;
> +	__u64 dma64_window_start;
> +	__u64 dma64_window_size;
> +};
> +
> +#define VFIO_IOMMU_SPAPR_TCE_GET_INFO	_IO(VFIO_TYPE, VFIO_BASE + 12)
> +
> +struct vfio_iommu_spapr_tce_put {
> +	__u32 argsz;
> +	__u32 flags;
> +#define VFIO_SPAPR_TCE_READ		1
> +#define VFIO_SPAPR_TCE_WRITE		2
> +#define VFIO_SPAPR_TCE_BIDIRECTIONAL	(VFIO_SPAPR_TCE_READ|VFIO_SPAPR_TCE_WRITE)
> +#define VFIO_SPAPR_TCE_PUT_MASK		VFIO_SPAPR_TCE_BIDIRECTIONAL
> +	__u64 ioba;
> +	__u64 tce;
> +};

Ok, so if READ & WRITE are both clear and ioba is set, that's an
"unmap"?  This is exactly why _type1 has a MAP and UNMAP, to make it
clear which fields are necessary for which call.  I think we should
probably do the same here.  Besides, _put makes me think there should be
a _get; do these have some unique meaning in POWER?

> +
> +#define VFIO_IOMMU_SPAPR_TCE_PUT	_IO(VFIO_TYPE, VFIO_BASE + 13)
> +

Please document what all of the above means.  Thanks,

Alex

> +/* ***************************************************************** */
> +
>  #endif /* VFIO_H */

^ permalink raw reply

* Re: [PATCH v3 0/7] mv643xx.c: Add basic device tree support.
From: Arnd Bergmann @ 2012-09-10 14:22 UTC (permalink / raw)
  To: Ian Molton
  Cc: thomas.petazzoni, andrew, netdev, devicetree-discuss, ben.dooks,
	linuxppc-dev, David Miller, linux-arm-kernel
In-Reply-To: <502D201E.9030304@codethink.co.uk>

On Thursday 16 August 2012, Ian Molton wrote:
> Ping :)
> 
> Can we get some consensus on the right approach here? I'm loathe to code
> this if its going to be rejected.
> 
> I'd prefer the driver to be properly split so we dont have the MDIO
> driver mapping the ethernet drivers address spaces, but if thats not
> going to be merged, I'm not feeling like doing the work for nothing.
> 
> If the driver is to use the overlapping-address mapped-by-the-mdio
> scheme, then so be it, but I could do with knowing.
> 
> Another point against the latter scheme is that the MDIO driver could
> sensibly be used (the block is identical) on the ArmadaXP, which has 4
> ethernet blocks rather than two, yet grouped in two pairs with a
> discontiguous address range.
> 
> I'd like to get this moved along as soon as possible though.

Following up on the old discussion, I talked briefly about this
issue with BenH at the kernel summit. The outcome basically is that
it's a bit sad to have incompatible bindings, but it's not the end
of the world,and it's more important to do it right this time.

Just make sure that you use different values for the 'compatible'
strings and then do what you need to get the ARM hardware working.

Ideally, the new binding should be written in a way that powerpc
machines can use the same one, but the existing ones all use
an version of Open Firmware that is not going to get updated
and it's also not too likely that we are going to see new
powerpc machines based on this chip.

	Arnd

^ permalink raw reply

* Re: [RFC patch powerpc,trace] Avoid suspicious RCU usage reporting for some tracepoints
From: Steven Rostedt @ 2012-09-10 14:02 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: Anton Blanchard, Paul Mackerras, Paul E. McKenney,
	PowerPC email list, Li Zhong
In-Reply-To: <1347253812.2385.148.camel@pasglop>

On Mon, 2012-09-10 at 15:10 +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2012-09-10 at 12:58 +0800, Li Zhong wrote:
> > There are a few tracepoints in the interrupt code path, which is before
> > irq_enter(), or after irq_exit(), like
> > trace_irq_entry()/trace_irq_exit() in do_IRQ(),
> > trace_timer_interrupt_entry()/trace_timer_interrupt_exit() in
> > timer_interrupt(). 
> > 
> > If the interrupt is from idle(), and because tracepoint contains RCU
> > read-side critical section, we could see following suspicious RCU usage
> > reported:
> 
>  .../...
> 
> > This is because the RCU usage in interrupt context should be used in
> > area marked by rcu_irq_enter()/rcu_irq_exit(), called in
> > irq_enter()/irq_exit() respectively. 
> > 
> > Could we add a new tracepoint trace_***_rcuirq, like trace_***_rcuidle
> > to avoid the report? like the code attached below.
> > 
> > Or could we just move these tracepoints inside the
> > irq_enter()/irq_exit() area? (Seems not good for the timer_interrupt
> > case). 
> 
> I'd say just move them in. Anton, any objection ?
> 

I agree too. I'm a bit concerned about the lack of coverage the
irq_enter() and irq_exit() have in the timer interrupt:

	may_hard_irq_enable();

	trace_timer_interrupt_entry(regs);

	__get_cpu_var(irq_stat).timer_irqs++;

#if defined(CONFIG_PPC32) && defined(CONFIG_PMAC)
	if (atomic_read(&ppc_n_lost_interrupts) != 0)
		do_IRQ(regs);
#endif

	old_regs = set_irq_regs(regs);
	irq_enter();


I'm guessing that call to do_IRQ() has to do with lazy irq handling?

Anyway, there may be a reason to have the tracepoint before this call,
but I'm not sure it really is that important. It should probably be best
to move it after the irq_enter(). There's side-effects with calling
things from interrupt context outside of irq_enter/exit().

-- Steve

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox