LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v5 00/45] CPU hotplug: stop_machine()-free CPU hotplug
From: Rusty Russell @ 2013-02-07  4:14 UTC (permalink / raw)
  To: Srivatsa S. Bhat, tglx, peterz, tj, oleg, paulmck, mingo
  Cc: linux-arch, linux, nikunj, linux-pm, fweisbec, linux-doc,
	linux-kernel, rostedt, xiaoguangrong, rjw, sbw, wangyun,
	Srivatsa S. Bhat, netdev, namhyung, akpm, walken, linuxppc-dev,
	linux-arm-kernel
In-Reply-To: <510FBC01.2030405@linux.vnet.ibm.com>

"Srivatsa S. Bhat" <srivatsa.bhat@linux.vnet.ibm.com> writes:
> On 01/22/2013 01:03 PM, Srivatsa S. Bhat wrote:
>                  Avg. latency of 1 CPU offline (ms) [stop-cpu/stop-m/c latency]
>
> # online CPUs    Mainline (with stop-m/c)       This patchset (no stop-m/c)
>
>       8                 17.04                          7.73
>
>      16                 18.05                          6.44
>
>      32                 17.31                          7.39
>
>      64                 32.40                          9.28
>
>     128                 98.23                          7.35

Nice!  I wonder how the ARM guys feel with their quad-cpu systems...

Thanks!
Rusty.

^ permalink raw reply

* Re: [PATCH 1/4] audit: Syscall rules are not applied to existing processes on non-x86
From: Anton Blanchard @ 2013-02-07  4:13 UTC (permalink / raw)
  To: eparis, viro, benh, paulus; +Cc: akpm, linuxppc-dev, linux-kernel
In-Reply-To: <20130109104617.74e995a5@kryten>


Hi,

Just following up on this. I've had a few people complaining about
audit being broken on ppc64 and it would be nice to fix.

Anton
--

On Wed, 9 Jan 2013 10:46:17 +1100
Anton Blanchard <anton@samba.org> wrote:

> 
> Commit b05d8447e782 (audit: inline audit_syscall_entry to reduce
> burden on archs) changed audit_syscall_entry to check for a dummy
> context before calling __audit_syscall_entry. Unfortunately the dummy
> context state is maintained in __audit_syscall_entry so once set it
> never gets cleared, even if the audit rules change.
> 
> As a result, if there are no auditing rules when a process starts
> then it will never be subject to any rules added later. x86 doesn't
> see this because it has an assembly fast path that calls directly into
> __audit_syscall_entry.
> 
> I noticed this issue when working on audit performance optimisations.
> I wrote a set of simple test cases available at:
> 
> http://ozlabs.org/~anton/junkcode/audit_tests.tar.gz
> 
> 02_new_rule.py fails without the patch and passes with it. The
> test case clears all rules, starts a process, adds a rule then
> verifies the process produces a syscall audit record.
> 
> Signed-off-by: Anton Blanchard <anton@samba.org>
> Cc: <stable@kernel.org> # 3.3+
> ---
> 
> Index: b/include/linux/audit.h
> ===================================================================
> --- a/include/linux/audit.h
> +++ b/include/linux/audit.h
> @@ -119,7 +119,7 @@ static inline void audit_syscall_entry(i
>  				       unsigned long a1, unsigned
> long a2, unsigned long a3)
>  {
> -	if (unlikely(!audit_dummy_context()))
> +	if (unlikely(current->audit_context))
>  		__audit_syscall_entry(arch, major, a0, a1, a2, a3);
>  }
>  static inline void audit_syscall_exit(void *pt_regs)

^ permalink raw reply

* [RFC][PATCH] powerpc: add Book E support to 64-bit hibernation
From: Wang Dongsheng @ 2013-02-07  2:25 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: scottwood, chenhui.zhao, Wang Dongsheng

Update the 64-bit hibernation code to support Book E CPUs.
Some registers and instructions are not defined for Book3e
(SDR reg, tlbia instruction).
SDR: Storage Description Register. Book3S and Book3E have different
address translation mode, we do not need HTABORG & HTABSIZE to
translate virtual address to real address.
More registers are saved in BookE-64bit.(TCR, SPRGx)

Signed-off-by: Wang Dongsheng <dongsheng.wang@freescale.com>
---

Hopefully someone can give me some advice about cache flush. It
confused me that why only 1 MiB is flushed from KERNEL_START on
Book3S. Is there a need to flush L2 cache if I have already flushed
L1 cache? If yes, how about L3 cache? Cache levels are different
in cores, the instruction sets may also be different.

 1 files changed, 62 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/swsusp_asm64.S b/arch/powerpc/kernel/swsusp_asm64.S
index 86ac1d9..608e4ceb 100644
--- a/arch/powerpc/kernel/swsusp_asm64.S
+++ b/arch/powerpc/kernel/swsusp_asm64.S
@@ -46,10 +46,29 @@
 #define SL_r29		0xe8
 #define SL_r30		0xf0
 #define SL_r31		0xf8
-#define SL_SIZE		SL_r31+8
+#define SL_SPRG0	0x100
+#define SL_SPRG1	0x108
+#define SL_SPRG2	0x110
+#define SL_SPRG3	0x118
+#define SL_SPRG4	0x120
+#define SL_SPRG5	0x128
+#define SL_SPRG6	0x130
+#define SL_SPRG7	0x138
+#define SL_TCR		0x140
+#define SL_SIZE		SL_TCR+8
 
 /* these macros rely on the save area being
  * pointed to by r11 */
+
+#define SAVE_SPR(register)		\
+	mfspr	r0,SPRN_##register	;\
+	std	r0,SL_##register(r11)
+#define RESTORE_SPR(register)		\
+	ld	r0,SL_##register(r11)	;\
+	mtspr	SPRN_##register,r0
+#define RESTORE_SPRG(n)			\
+	ld	r0,SL_SPRG##n(r11)	;\
+	mtsprg	n,r0
 #define SAVE_SPECIAL(special)		\
 	mf##special	r0		;\
 	std	r0, SL_##special(r11)
@@ -103,8 +122,21 @@ _GLOBAL(swsusp_arch_suspend)
 	SAVE_REGISTER(r30)
 	SAVE_REGISTER(r31)
 	SAVE_SPECIAL(MSR)
-	SAVE_SPECIAL(SDR1)
 	SAVE_SPECIAL(XER)
+#ifdef CONFIG_PPC_BOOK3S_64
+	SAVE_SPECIAL(SDR1)
+#else
+	SAVE_SPR(TCR)
+	/* Save SPRGs */
+	SAVE_SPR(SPRG0)
+	SAVE_SPR(SPRG1)
+	SAVE_SPR(SPRG2)
+	SAVE_SPR(SPRG3)
+	SAVE_SPR(SPRG4)
+	SAVE_SPR(SPRG5)
+	SAVE_SPR(SPRG6)
+	SAVE_SPR(SPRG7)
+#endif
 
 	/* we push the stack up 128 bytes but don't store the
 	 * stack pointer on the stack like a real stackframe */
@@ -151,6 +183,7 @@ copy_page_loop:
 	bne+	copyloop
 nothing_to_copy:
 
+#ifdef CONFIG_PPC_BOOK3S_64
 	/* flush caches */
 	lis	r3, 0x10
 	mtctr	r3
@@ -167,6 +200,7 @@ nothing_to_copy:
 	sync
 
 	tlbia
+#endif
 
 	ld	r11,swsusp_save_area_ptr@toc(r2)
 
@@ -208,16 +242,42 @@ nothing_to_copy:
 	RESTORE_REGISTER(r29)
 	RESTORE_REGISTER(r30)
 	RESTORE_REGISTER(r31)
+
+#ifdef CONFIG_PPC_BOOK3S_64
 	/* can't use RESTORE_SPECIAL(MSR) */
 	ld	r0, SL_MSR(r11)
 	mtmsrd	r0, 0
 	RESTORE_SPECIAL(SDR1)
+#else
+	/* Save SPRGs */
+	RESTORE_SPRG(0)
+	RESTORE_SPRG(1)
+	RESTORE_SPRG(2)
+	RESTORE_SPRG(3)
+	RESTORE_SPRG(4)
+	RESTORE_SPRG(5)
+	RESTORE_SPRG(6)
+	RESTORE_SPRG(7)
+
+	RESTORE_SPECIAL(MSR)
+
+	/* Restore TCR and clear any pending bits in TSR. */
+	RESTORE_SPR(TCR)
+	lis	r0, (TSR_ENW | TSR_WIS | TSR_DIS | TSR_FIS)@h
+	mtspr	SPRN_TSR,r0
+
+	/* Kick decrementer */
+	li	r0,1
+	mtdec	r0
+#endif
 	RESTORE_SPECIAL(XER)
 
 	sync
 
 	addi	r1,r1,-128
+#ifdef CONFIG_PPC_BOOK3S_64
 	bl	slb_flush_and_rebolt
+#endif
 	bl	do_after_copyback
 	addi	r1,r1,128
 
-- 
1.7.5.1

^ permalink raw reply related

* Re: [RFC PATCH 1/5] powerpc: Syscall hooks for context tracking subsystem
From: Frederic Weisbecker @ 2013-02-07  0:29 UTC (permalink / raw)
  To: Li Zhong; +Cc: paulmck, linuxppc-dev, linux-kernel, paulus
In-Reply-To: <1359714465-6297-2-git-send-email-zhong@linux.vnet.ibm.com>

2013/2/1 Li Zhong <zhong@linux.vnet.ibm.com>:
> This is the syscall slow path hooks for context tracking subsystem,
> corresponding to
> [PATCH] x86: Syscall hooks for userspace RCU extended QS
>   commit bf5a3c13b939813d28ce26c01425054c740d6731
>
> TIF_MEMDIE is moved to the second 16-bits (with value 17), as it seems there
> is no asm code using it. TIF_NOHZ is added to _TIF_SYCALL_T_OR_A, so it is
> better for it to be in the same 16 bits with others in the group, so in the
> asm code, andi. with this group could work.
>
> Signed-off-by: Li Zhong <zhong@linux.vnet.ibm.com>
> ---
>  arch/powerpc/include/asm/thread_info.h |    7 +++++--
>  arch/powerpc/kernel/ptrace.c           |    5 +++++
>  2 files changed, 10 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/thread_info.h b/arch/powerpc/include/asm/thread_info.h
> index 406b7b9..414a261 100644
> --- a/arch/powerpc/include/asm/thread_info.h
> +++ b/arch/powerpc/include/asm/thread_info.h
> @@ -97,7 +97,7 @@ static inline struct thread_info *current_thread_info(void)
>  #define TIF_PERFMON_CTXSW      6       /* perfmon needs ctxsw calls */
>  #define TIF_SYSCALL_AUDIT      7       /* syscall auditing active */
>  #define TIF_SINGLESTEP         8       /* singlestepping active */
> -#define TIF_MEMDIE             9       /* is terminating due to OOM killer */
> +#define TIF_NOHZ               9       /* in adaptive nohz mode */
>  #define TIF_SECCOMP            10      /* secure computing */
>  #define TIF_RESTOREALL         11      /* Restore all regs (implies NOERROR) */
>  #define TIF_NOERROR            12      /* Force successful syscall return */
> @@ -106,6 +106,7 @@ static inline struct thread_info *current_thread_info(void)
>  #define TIF_SYSCALL_TRACEPOINT 15      /* syscall tracepoint instrumentation */
>  #define TIF_EMULATE_STACK_STORE        16      /* Is an instruction emulation
>                                                 for stack store? */
> +#define TIF_MEMDIE             17      /* is terminating due to OOM killer */
>
>  /* as above, but as bit values */
>  #define _TIF_SYSCALL_TRACE     (1<<TIF_SYSCALL_TRACE)
> @@ -124,8 +125,10 @@ static inline struct thread_info *current_thread_info(void)
>  #define _TIF_UPROBE            (1<<TIF_UPROBE)
>  #define _TIF_SYSCALL_TRACEPOINT        (1<<TIF_SYSCALL_TRACEPOINT)
>  #define _TIF_EMULATE_STACK_STORE       (1<<TIF_EMULATE_STACK_STORE)
> +#define _TIF_NOHZ              (1<<TIF_NOHZ)
>  #define _TIF_SYSCALL_T_OR_A    (_TIF_SYSCALL_TRACE | _TIF_SYSCALL_AUDIT | \
> -                                _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT)
> +                                _TIF_SECCOMP | _TIF_SYSCALL_TRACEPOINT | \
> +                                _TIF_NOHZ)
>
>  #define _TIF_USER_WORK_MASK    (_TIF_SIGPENDING | _TIF_NEED_RESCHED | \
>                                  _TIF_NOTIFY_RESUME | _TIF_UPROBE)
> diff --git a/arch/powerpc/kernel/ptrace.c b/arch/powerpc/kernel/ptrace.c
> index c497000..62238dd 100644
> --- a/arch/powerpc/kernel/ptrace.c
> +++ b/arch/powerpc/kernel/ptrace.c
> @@ -32,6 +32,7 @@
>  #include <trace/syscall.h>
>  #include <linux/hw_breakpoint.h>
>  #include <linux/perf_event.h>
> +#include <linux/context_tracking.h>
>
>  #include <asm/uaccess.h>
>  #include <asm/page.h>
> @@ -1745,6 +1746,8 @@ long do_syscall_trace_enter(struct pt_regs *regs)
>  {
>         long ret = 0;
>
> +       user_exit();
> +
>         secure_computing_strict(regs->gpr[0]);
>
>         if (test_thread_flag(TIF_SYSCALL_TRACE) &&
> @@ -1789,4 +1792,6 @@ void do_syscall_trace_leave(struct pt_regs *regs)
>         step = test_thread_flag(TIF_SINGLESTEP);
>         if (step || test_thread_flag(TIF_SYSCALL_TRACE))
>                 tracehook_report_syscall_exit(regs, step);
> +
> +       user_enter();

In x86-64, schedule_user() and do_notify_resume() can be called before
syscall_trace_leave(). As a result we may be entering
syscall_trace_leave() in user mode (from a context tracking POV). To
fix this I added a call to user_exit() on the very beginning of that
function.

You can find the details in 2c5594df344cd1ff0cc9bf007dea3235582b3acf
("rcu: Fix unrecovered RCU user mode in syscall_trace_leave()").

Could this problem happen in ppc as well?

Thanks.

^ permalink raw reply

* Re: ethtool occationally fails to communicate with with ucc_geth
From: Lennart Sorensen @ 2013-02-06 22:24 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: netdev, linuxppc-dev, linux-kernel
In-Reply-To: <1360184912.2659.27.camel@bwh-desktop.uk.solarflarecom.com>

On Wed, Feb 06, 2013 at 09:08:32PM +0000, Ben Hutchings wrote:
> This seems to be a workaround for a bug in phylib: phy_state_machine()
> calls netif_carrier_on() before adjust_link(), so the TX scheduler can
> start immediately even though the MAC has not been configured.
> 
> A better workaround would be to use netif_carrier_{off,on}() in
> ugeth_{quiesce,activate}() respectively instead of
> netif_device_{detach,attach}().  But I think phylib really ought to be
> fixed.

I am willing to try things, but this is certainly in parts of the network
stack I don't normally poke around in and hence don't know how works.

I just managed to track it down this far. :)

I can try the carrier_off/on in place of the detach/attach and see if
it works.

-- 
Len Sorensen

^ permalink raw reply

* Re: ethtool occationally fails to communicate with with ucc_geth
From: Ben Hutchings @ 2013-02-06 21:08 UTC (permalink / raw)
  To: Lennart Sorensen; +Cc: netdev, linuxppc-dev, linux-kernel
In-Reply-To: <20130206200504.GJ30788@csclub.uwaterloo.ca>

On Wed, 2013-02-06 at 15:05 -0500, Lennart Sorensen wrote:
> We are occationally seeing ethtool fail to communicate with ucc_geth.
> I think I have tracked down why it happens, but I don't see a good way
> to fix it.
> 
> When the phy state changes, adjust_link() checks if the state has changed
> and if the link is up.  If it is it does:
> 
>                 if (new_state) {
>                         /*
>                          * To change the MAC configuration we need to disable
>                          * the controller. To do so, we have to either grab
>                          * ugeth->lock, which is a bad idea since 'graceful
>                          * stop' commands might take quite a while, or we can
>                          * quiesce driver's activity.
>                          */
>                         ugeth_quiesce(ugeth);
>                         ugeth_disable(ugeth, COMM_DIR_RX_AND_TX);
> 
>                         out_be32(&ug_regs->maccfg2, tempval);
>                         out_be32(&uf_regs->upsmr, upsmr);
> 
>                         ugeth_enable(ugeth, COMM_DIR_RX_AND_TX);
>                         ugeth_activate(ugeth);
>                 }
> 
> The problem I believe is that ugeth_quiesce() does netif_device_detach
> which clears __LINK_STATE_PRESENT, and hence makes dev_ethtool fail
> due to:
> 
>         if (!dev || !netif_device_present(dev))
>                 return -ENODEV;
> 
> So if ethtool happens to be run between ugeth_quiesce() and
> ugeth_activate(), it fails as if the device simply doesn't exist, which
> is of course not true, it's just temporarily disabled.
[...]
> Any suggestions?

This seems to be a workaround for a bug in phylib: phy_state_machine()
calls netif_carrier_on() before adjust_link(), so the TX scheduler can
start immediately even though the MAC has not been configured.

A better workaround would be to use netif_carrier_{off,on}() in
ugeth_{quiesce,activate}() respectively instead of
netif_device_{detach,attach}().  But I think phylib really ought to be
fixed.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* ethtool occationally fails to communicate with with ucc_geth
From: Lennart Sorensen @ 2013-02-06 20:05 UTC (permalink / raw)
  To: Li Yang; +Cc: netdev, linuxppc-dev, linux-kernel, Len Sorensen

We are occationally seeing ethtool fail to communicate with ucc_geth.
I think I have tracked down why it happens, but I don't see a good way
to fix it.

When the phy state changes, adjust_link() checks if the state has changed
and if the link is up.  If it is it does:

                if (new_state) {
                        /*
                         * To change the MAC configuration we need to disable
                         * the controller. To do so, we have to either grab
                         * ugeth->lock, which is a bad idea since 'graceful
                         * stop' commands might take quite a while, or we can
                         * quiesce driver's activity.
                         */
                        ugeth_quiesce(ugeth);
                        ugeth_disable(ugeth, COMM_DIR_RX_AND_TX);

                        out_be32(&ug_regs->maccfg2, tempval);
                        out_be32(&uf_regs->upsmr, upsmr);

                        ugeth_enable(ugeth, COMM_DIR_RX_AND_TX);
                        ugeth_activate(ugeth);
                }

The problem I believe is that ugeth_quiesce() does netif_device_detach
which clears __LINK_STATE_PRESENT, and hence makes dev_ethtool fail
due to:

        if (!dev || !netif_device_present(dev))
                return -ENODEV;

So if ethtool happens to be run between ugeth_quiesce() and
ugeth_activate(), it fails as if the device simply doesn't exist, which
is of course not true, it's just temporarily disabled.

I don't see any obvious way to make the ethtool requests block while the
adjust_link does it's business.  It seems that that making the device
disappear is the wrong thing to do though.

I am able to make it happen if I do:

'while ethtool ifname; do :; done' while plugging and unplugging the
cable for a few minutes.

Any suggestions?

-- 
len Sorensen

^ permalink raw reply

* Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
From: Glauber Costa @ 2013-02-06 14:24 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, liuj97, len.brown, Miao Xie,
	Wen Congyang, cmetcalf, wujianguo, yinghai, KAMEZAWA Hiroyuki,
	laijs, linux-kernel, minchan.kim, akpm, linuxppc-dev
In-Reply-To: <51122C1D.5020002@cn.fujitsu.com>

On 02/06/2013 02:10 PM, Tang Chen wrote:
> On 02/06/2013 05:17 PM, Tang Chen wrote:
>> Hi all,
>>
>> On 02/06/2013 11:07 AM, Tang Chen wrote:
>>> Hi Glauber, all,
>>>
>>> An old thing I want to discuss with you. :)
>>>
>>> On 01/09/2013 11:09 PM, Glauber Costa wrote:
>>>>>>> memory can't be offlined when CONFIG_MEMCG is selected.
>>>>>>> For example: there is a memory device on node 1. The address range
>>>>>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9,
>>>>>>> memory10,
>>>>>>> and memory11 under the directory /sys/devices/system/memory/.
>>>>>>>
>>>>>>> If CONFIG_MEMCG is selected, we will allocate memory to store page
>>>>>>> cgroup
>>>>>>> when we online pages. When we online memory8, the memory stored
>>>>>>> page cgroup
>>>>>>> is not provided by this memory device. But when we online memory9,
>>>>>>> the memory
>>>>>>> stored page cgroup may be provided by memory8. So we can't offline
>>>>>>> memory8
>>>>>>> now. We should offline the memory in the reversed order.
>>>>>>>
>>>>>>> When the memory device is hotremoved, we will auto offline memory
>>>>>>> provided
>>>>>>> by this memory device. But we don't know which memory is onlined
>>>>>>> first, so
>>>>>>> offlining memory may fail. In such case, iterate twice to offline
>>>>>>> the memory.
>>>>>>> 1st iterate: offline every non primary memory block.
>>>>>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>>>>>
>>>>>>> This idea is suggested by KOSAKI Motohiro.
>>>>>>>
>>>>>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>>>>>>
>>>>>> Maybe there is something here that I am missing - I admit that I came
>>>>>> late to this one, but this really sounds like a very ugly hack, that
>>>>>> really has no place in here.
>>>>>>
>>>>>> Retrying, of course, may make sense, if we have reasonable belief
>>>>>> that
>>>>>> we may now succeed. If this is the case, you need to document - in
>>>>>> the
>>>>>> code - while is that.
>>>>>>
>>>>>> The memcg argument, however, doesn't really cut it. Why can't we make
>>>>>> all page_cgroup allocations local to the node they are describing? If
>>>>>> memcg is the culprit here, we should fix it, and not retry. If
>>>>>> there is
>>>>>> still any benefit in retrying, then we retry being very specific
>>>>>> about why.
>>>>>
>>>>> We try to make all page_cgroup allocations local to the node they are
>>>>> describing
>>>>> now. If the memory is the first memory onlined in this node, we will
>>>>> allocate
>>>>> it from the other node.
>>>>>
>>>>> For example, node1 has 4 memory blocks: 8-11, and we online it from 8
>>>>> to 11
>>>>> 1. memory block 8, page_cgroup allocations are in the other nodes
>>>>> 2. memory block 9, page_cgroup allocations are in memory block 8
>>>>>
>>>>> So we should offline memory block 9 first. But we don't know in which
>>>>> order
>>>>> the user online the memory block.
>>>>>
>>>>> I think we can modify memcg like this:
>>>>> allocate the memory from the memory block they are describing
>>>>>
>>>>> I am not sure it is OK to do so.
>>>>
>>>> I don't see a reason why not.
>>>>
>>>> You would have to tweak a bit the lookup function for page_cgroup, but
>>>> assuming you will always have the pfns and limits, it should be easy
>>>> to do.
>>>>
>>>> I think the only tricky part is that today we have a single
>>>> node_page_cgroup, and we would of course have to have one per memory
>>>> block. My assumption is that the number of memory blocks is limited and
>>>> likely not very big. So even a static array would do.
>>>>
>>>
>>> About the idea "allocate the memory from the memory block they are
>>> describing",
>>>
>>> online_pages()
>>> |-->memory_notify(MEM_GOING_ONLINE, &arg) ----------- memory of this
>>> section is not in buddy yet.
>>> |-->page_cgroup_callback()
>>> |-->online_page_cgroup()
>>> |-->init_section_page_cgroup()
>>> |-->alloc_page_cgroup() --------- allocate page_cgroup from buddy
>>> system.
>>>
>>> When onlining pages, we allocate page_cgroup from buddy. And the being
>>> onlined pages are not in
>>> buddy yet. I think we can reserve some memory in the section for
>>> page_cgroup, and return all the
>>> rest to the buddy.
>>>
>>> But when the system is booting,
>>>
>>> start_kernel()
>>> |-->setup_arch()
>>> |-->mm_init()
>>> | |-->mem_init()
>>> | |-->numa_free_all_bootmem() -------------- all the pages are in buddy
>>> system.
>>> |-->page_cgroup_init()
>>> |-->init_section_page_cgroup()
>>> |-->alloc_page_cgroup() ------------------ I don't know how to reserve
>>> memory in each section.
>>>
>>> So any idea about how to deal with it when the system is booting please?
>>>
>>
>> How about this way.
>>
>> 1) Add a new flag PAGE_CGROUP_INFO, like SECTION_INFO and
>> MIX_SECTION_INFO.
>> 2) In sparse_init(), reserve some beginning pages of each section as
>> bootmem.
> 
> Hi all,
> 
> After digging into bootmem code, I met another problem.
> 
> memblock allocates memory from high address to low address, using
> memblock.current_limit
> to remember where the upper limit is. What I am doing will produce a lot
> of fragments,
> and the memory will be non-contiguous. So we need to modify memblock again.
> 
> I don't think it's a good idea. How do you think ?
> 
> Thanks. :)
> 
>> 3) In register_page_bootmem_info_section(), set these pages as
>> page->lru.next = PAGE_CGROUP_INFO;
>>
>> Then these pages will not go to buddy system.
>>
>> But I do worry about the fragment problem because part of each section
>> will
>> be used in the very beginning.
>>
>> Thanks. :)
>>
>>>
>>> And one more question, a memory section is 128MB in Linux. If we reserve
>>> part of the them for page_cgroup,
>>> then anyone who wants to allocate a contiguous memory larger than 128MB,
>>> it will fail, right ?
>>> Is it OK ?
No, it is not.

Another take on this: Can't we free all the page_cgroup structure before
we actually start removing the sections ? If we do this, we would be
basically left with no problem at all, since when your code starts
running we would no longer have any page_cgroup allocated.

All you have to guarantee is that it happens after the memory block is
already isolated and allocations no longer can reach it.

What do you think ?

^ permalink raw reply

* Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
From: Tang Chen @ 2013-02-06 10:10 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, liuj97, len.brown, Miao Xie,
	Wen Congyang, cmetcalf, wujianguo, yinghai, KAMEZAWA Hiroyuki,
	laijs, linux-kernel, minchan.kim, akpm, linuxppc-dev
In-Reply-To: <51121FB7.1070205@cn.fujitsu.com>

On 02/06/2013 05:17 PM, Tang Chen wrote:
> Hi all,
>
> On 02/06/2013 11:07 AM, Tang Chen wrote:
>> Hi Glauber, all,
>>
>> An old thing I want to discuss with you. :)
>>
>> On 01/09/2013 11:09 PM, Glauber Costa wrote:
>>>>>> memory can't be offlined when CONFIG_MEMCG is selected.
>>>>>> For example: there is a memory device on node 1. The address range
>>>>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9,
>>>>>> memory10,
>>>>>> and memory11 under the directory /sys/devices/system/memory/.
>>>>>>
>>>>>> If CONFIG_MEMCG is selected, we will allocate memory to store page
>>>>>> cgroup
>>>>>> when we online pages. When we online memory8, the memory stored
>>>>>> page cgroup
>>>>>> is not provided by this memory device. But when we online memory9,
>>>>>> the memory
>>>>>> stored page cgroup may be provided by memory8. So we can't offline
>>>>>> memory8
>>>>>> now. We should offline the memory in the reversed order.
>>>>>>
>>>>>> When the memory device is hotremoved, we will auto offline memory
>>>>>> provided
>>>>>> by this memory device. But we don't know which memory is onlined
>>>>>> first, so
>>>>>> offlining memory may fail. In such case, iterate twice to offline
>>>>>> the memory.
>>>>>> 1st iterate: offline every non primary memory block.
>>>>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>>>>
>>>>>> This idea is suggested by KOSAKI Motohiro.
>>>>>>
>>>>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>>>>>
>>>>> Maybe there is something here that I am missing - I admit that I came
>>>>> late to this one, but this really sounds like a very ugly hack, that
>>>>> really has no place in here.
>>>>>
>>>>> Retrying, of course, may make sense, if we have reasonable belief that
>>>>> we may now succeed. If this is the case, you need to document - in the
>>>>> code - while is that.
>>>>>
>>>>> The memcg argument, however, doesn't really cut it. Why can't we make
>>>>> all page_cgroup allocations local to the node they are describing? If
>>>>> memcg is the culprit here, we should fix it, and not retry. If
>>>>> there is
>>>>> still any benefit in retrying, then we retry being very specific
>>>>> about why.
>>>>
>>>> We try to make all page_cgroup allocations local to the node they are
>>>> describing
>>>> now. If the memory is the first memory onlined in this node, we will
>>>> allocate
>>>> it from the other node.
>>>>
>>>> For example, node1 has 4 memory blocks: 8-11, and we online it from 8
>>>> to 11
>>>> 1. memory block 8, page_cgroup allocations are in the other nodes
>>>> 2. memory block 9, page_cgroup allocations are in memory block 8
>>>>
>>>> So we should offline memory block 9 first. But we don't know in which
>>>> order
>>>> the user online the memory block.
>>>>
>>>> I think we can modify memcg like this:
>>>> allocate the memory from the memory block they are describing
>>>>
>>>> I am not sure it is OK to do so.
>>>
>>> I don't see a reason why not.
>>>
>>> You would have to tweak a bit the lookup function for page_cgroup, but
>>> assuming you will always have the pfns and limits, it should be easy
>>> to do.
>>>
>>> I think the only tricky part is that today we have a single
>>> node_page_cgroup, and we would of course have to have one per memory
>>> block. My assumption is that the number of memory blocks is limited and
>>> likely not very big. So even a static array would do.
>>>
>>
>> About the idea "allocate the memory from the memory block they are
>> describing",
>>
>> online_pages()
>> |-->memory_notify(MEM_GOING_ONLINE, &arg) ----------- memory of this
>> section is not in buddy yet.
>> |-->page_cgroup_callback()
>> |-->online_page_cgroup()
>> |-->init_section_page_cgroup()
>> |-->alloc_page_cgroup() --------- allocate page_cgroup from buddy system.
>>
>> When onlining pages, we allocate page_cgroup from buddy. And the being
>> onlined pages are not in
>> buddy yet. I think we can reserve some memory in the section for
>> page_cgroup, and return all the
>> rest to the buddy.
>>
>> But when the system is booting,
>>
>> start_kernel()
>> |-->setup_arch()
>> |-->mm_init()
>> | |-->mem_init()
>> | |-->numa_free_all_bootmem() -------------- all the pages are in buddy
>> system.
>> |-->page_cgroup_init()
>> |-->init_section_page_cgroup()
>> |-->alloc_page_cgroup() ------------------ I don't know how to reserve
>> memory in each section.
>>
>> So any idea about how to deal with it when the system is booting please?
>>
>
> How about this way.
>
> 1) Add a new flag PAGE_CGROUP_INFO, like SECTION_INFO and MIX_SECTION_INFO.
> 2) In sparse_init(), reserve some beginning pages of each section as
> bootmem.

Hi all,

After digging into bootmem code, I met another problem.

memblock allocates memory from high address to low address, using 
memblock.current_limit
to remember where the upper limit is. What I am doing will produce a lot 
of fragments,
and the memory will be non-contiguous. So we need to modify memblock again.

I don't think it's a good idea. How do you think ?

Thanks. :)

> 3) In register_page_bootmem_info_section(), set these pages as
> page->lru.next = PAGE_CGROUP_INFO;
>
> Then these pages will not go to buddy system.
>
> But I do worry about the fragment problem because part of each section will
> be used in the very beginning.
>
> Thanks. :)
>
>>
>> And one more question, a memory section is 128MB in Linux. If we reserve
>> part of the them for page_cgroup,
>> then anyone who wants to allocate a contiguous memory larger than 128MB,
>> it will fail, right ?
>> Is it OK ?
>>
>> Thanks. :)
>>
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
From: Tang Chen @ 2013-02-06  9:17 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, liuj97, len.brown, Miao Xie,
	Wen Congyang, cmetcalf, wujianguo, yinghai, KAMEZAWA Hiroyuki,
	laijs, linux-kernel, minchan.kim, akpm, linuxppc-dev
In-Reply-To: <5111C8EB.6090805@cn.fujitsu.com>

Hi all,

On 02/06/2013 11:07 AM, Tang Chen wrote:
> Hi Glauber, all,
>
> An old thing I want to discuss with you. :)
>
> On 01/09/2013 11:09 PM, Glauber Costa wrote:
>>>>> memory can't be offlined when CONFIG_MEMCG is selected.
>>>>> For example: there is a memory device on node 1. The address range
>>>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9,
>>>>> memory10,
>>>>> and memory11 under the directory /sys/devices/system/memory/.
>>>>>
>>>>> If CONFIG_MEMCG is selected, we will allocate memory to store page
>>>>> cgroup
>>>>> when we online pages. When we online memory8, the memory stored
>>>>> page cgroup
>>>>> is not provided by this memory device. But when we online memory9,
>>>>> the memory
>>>>> stored page cgroup may be provided by memory8. So we can't offline
>>>>> memory8
>>>>> now. We should offline the memory in the reversed order.
>>>>>
>>>>> When the memory device is hotremoved, we will auto offline memory
>>>>> provided
>>>>> by this memory device. But we don't know which memory is onlined
>>>>> first, so
>>>>> offlining memory may fail. In such case, iterate twice to offline
>>>>> the memory.
>>>>> 1st iterate: offline every non primary memory block.
>>>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>>>
>>>>> This idea is suggested by KOSAKI Motohiro.
>>>>>
>>>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>>>>
>>>> Maybe there is something here that I am missing - I admit that I came
>>>> late to this one, but this really sounds like a very ugly hack, that
>>>> really has no place in here.
>>>>
>>>> Retrying, of course, may make sense, if we have reasonable belief that
>>>> we may now succeed. If this is the case, you need to document - in the
>>>> code - while is that.
>>>>
>>>> The memcg argument, however, doesn't really cut it. Why can't we make
>>>> all page_cgroup allocations local to the node they are describing? If
>>>> memcg is the culprit here, we should fix it, and not retry. If there is
>>>> still any benefit in retrying, then we retry being very specific
>>>> about why.
>>>
>>> We try to make all page_cgroup allocations local to the node they are
>>> describing
>>> now. If the memory is the first memory onlined in this node, we will
>>> allocate
>>> it from the other node.
>>>
>>> For example, node1 has 4 memory blocks: 8-11, and we online it from 8
>>> to 11
>>> 1. memory block 8, page_cgroup allocations are in the other nodes
>>> 2. memory block 9, page_cgroup allocations are in memory block 8
>>>
>>> So we should offline memory block 9 first. But we don't know in which
>>> order
>>> the user online the memory block.
>>>
>>> I think we can modify memcg like this:
>>> allocate the memory from the memory block they are describing
>>>
>>> I am not sure it is OK to do so.
>>
>> I don't see a reason why not.
>>
>> You would have to tweak a bit the lookup function for page_cgroup, but
>> assuming you will always have the pfns and limits, it should be easy
>> to do.
>>
>> I think the only tricky part is that today we have a single
>> node_page_cgroup, and we would of course have to have one per memory
>> block. My assumption is that the number of memory blocks is limited and
>> likely not very big. So even a static array would do.
>>
>
> About the idea "allocate the memory from the memory block they are
> describing",
>
> online_pages()
> |-->memory_notify(MEM_GOING_ONLINE, &arg) ----------- memory of this
> section is not in buddy yet.
> |-->page_cgroup_callback()
> |-->online_page_cgroup()
> |-->init_section_page_cgroup()
> |-->alloc_page_cgroup() --------- allocate page_cgroup from buddy system.
>
> When onlining pages, we allocate page_cgroup from buddy. And the being
> onlined pages are not in
> buddy yet. I think we can reserve some memory in the section for
> page_cgroup, and return all the
> rest to the buddy.
>
> But when the system is booting,
>
> start_kernel()
> |-->setup_arch()
> |-->mm_init()
> | |-->mem_init()
> | |-->numa_free_all_bootmem() -------------- all the pages are in buddy
> system.
> |-->page_cgroup_init()
> |-->init_section_page_cgroup()
> |-->alloc_page_cgroup() ------------------ I don't know how to reserve
> memory in each section.
>
> So any idea about how to deal with it when the system is booting please?
>

How about this way.

1) Add a new flag PAGE_CGROUP_INFO, like SECTION_INFO and MIX_SECTION_INFO.
2) In sparse_init(), reserve some beginning pages of each section as 
bootmem.
3) In register_page_bootmem_info_section(), set these pages as
      page->lru.next = PAGE_CGROUP_INFO;

Then these pages will not go to buddy system.

But I do worry about the fragment problem because part of each section will
be used in the very beginning.

Thanks. :)

>
> And one more question, a memory section is 128MB in Linux. If we reserve
> part of the them for page_cgroup,
> then anyone who wants to allocate a contiguous memory larger than 128MB,
> it will fail, right ?
> Is it OK ?
>
> Thanks. :)
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH 1/3] powerpc/mpc512x: fix noderef sparse warnings
From: Kim Phillips @ 2013-02-06  3:11 UTC (permalink / raw)
  To: Anatolij Gustschin; +Cc: linuxppc-dev
In-Reply-To: <1360048818-3765-1-git-send-email-agust@denx.de>

On Tue, 5 Feb 2013 08:20:16 +0100
Anatolij Gustschin <agust@denx.de> wrote:

> Fix:
> warning: dereference of noderef expression
> 
> Signed-off-by: Anatolij Gustschin <agust@denx.de>
> ---
>  arch/powerpc/platforms/512x/clock.c |   18 +++++++++---------
>  1 files changed, 9 insertions(+), 9 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/512x/clock.c b/arch/powerpc/platforms/512x/clock.c
> index 7937361..8a784d4 100644
> --- a/arch/powerpc/platforms/512x/clock.c
> +++ b/arch/powerpc/platforms/512x/clock.c
> @@ -184,7 +184,7 @@ static unsigned long spmf_mult(void)
>  		36, 40, 44, 48,
>  		52, 56, 60, 64
>  	};
> -	int spmf = (clockctl->spmr >> 24) & 0xf;
> +	int spmf = (in_be32(&clockctl->spmr) >> 24) & 0xf;

power arch should start using the more portable i/o accessors
io{read,write}32be instead of the arch-centric {in,out}_be32.  The
io{read,write}32be functions have better sparse annotation, so an
endian check will complain if registers are not defined __be32.

Kim

^ permalink raw reply

* Re: [PATCH v5 01/14] memory-hotplug: try to offline the memory twice to avoid dependence
From: Tang Chen @ 2013-02-06  3:07 UTC (permalink / raw)
  To: Glauber Costa
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, liuj97, len.brown, Miao Xie,
	Wen Congyang, cmetcalf, wujianguo, yinghai, KAMEZAWA Hiroyuki,
	laijs, linux-kernel, minchan.kim, akpm, linuxppc-dev
In-Reply-To: <50ED8834.1090804@parallels.com>

Hi Glauber, all,

An old thing I want to discuss with you. :)

On 01/09/2013 11:09 PM, Glauber Costa wrote:
>>>> memory can't be offlined when CONFIG_MEMCG is selected.
>>>> For example: there is a memory device on node 1. The address range
>>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>>>> and memory11 under the directory /sys/devices/system/memory/.
>>>>
>>>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>>>> when we online pages. When we online memory8, the memory stored page cgroup
>>>> is not provided by this memory device. But when we online memory9, the memory
>>>> stored page cgroup may be provided by memory8. So we can't offline memory8
>>>> now. We should offline the memory in the reversed order.
>>>>
>>>> When the memory device is hotremoved, we will auto offline memory provided
>>>> by this memory device. But we don't know which memory is onlined first, so
>>>> offlining memory may fail. In such case, iterate twice to offline the memory.
>>>> 1st iterate: offline every non primary memory block.
>>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>>
>>>> This idea is suggested by KOSAKI Motohiro.
>>>>
>>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>>>
>>> Maybe there is something here that I am missing - I admit that I came
>>> late to this one, but this really sounds like a very ugly hack, that
>>> really has no place in here.
>>>
>>> Retrying, of course, may make sense, if we have reasonable belief that
>>> we may now succeed. If this is the case, you need to document - in the
>>> code - while is that.
>>>
>>> The memcg argument, however, doesn't really cut it. Why can't we make
>>> all page_cgroup allocations local to the node they are describing? If
>>> memcg is the culprit here, we should fix it, and not retry. If there is
>>> still any benefit in retrying, then we retry being very specific about why.
>>
>> We try to make all page_cgroup allocations local to the node they are describing
>> now. If the memory is the first memory onlined in this node, we will allocate
>> it from the other node.
>>
>> For example, node1 has 4 memory blocks: 8-11, and we online it from 8 to 11
>> 1. memory block 8, page_cgroup allocations are in the other nodes
>> 2. memory block 9, page_cgroup allocations are in memory block 8
>>
>> So we should offline memory block 9 first. But we don't know in which order
>> the user online the memory block.
>>
>> I think we can modify memcg like this:
>> allocate the memory from the memory block they are describing
>>
>> I am not sure it is OK to do so.
>
> I don't see a reason why not.
>
> You would have to tweak a bit the lookup function for page_cgroup, but
> assuming you will always have the pfns and limits, it should be easy to do.
>
> I think the only tricky part is that today we have a single
> node_page_cgroup, and we would of course have to have one per memory
> block. My assumption is that the number of memory blocks is limited and
> likely not very big. So even a static array would do.
>

About the idea "allocate the memory from the memory block they are 
describing",

online_pages()
  |-->memory_notify(MEM_GOING_ONLINE, &arg) ----------- memory of this 
section is not in buddy yet.
       |-->page_cgroup_callback()
            |-->online_page_cgroup()
                 |-->init_section_page_cgroup()
                      |-->alloc_page_cgroup() --------- allocate 
page_cgroup from buddy system.

When onlining pages, we allocate page_cgroup from buddy. And the being 
onlined pages are not in
buddy yet. I think we can reserve some memory in the section for 
page_cgroup, and return all the
rest to the buddy.

But when the system is booting,

start_kernel()
  |-->setup_arch()
  |-->mm_init()
  |    |-->mem_init()
  |         |-->numa_free_all_bootmem() -------------- all the pages are 
in buddy system.
  |-->page_cgroup_init()
       |-->init_section_page_cgroup()
            |-->alloc_page_cgroup() ------------------ I don't know how 
to reserve memory in each section.

So any idea about how to deal with it when the system is booting please?


And one more question, a memory section is 128MB in Linux. If we reserve 
part of the them for page_cgroup,
then anyone who wants to allocate a contiguous memory larger than 128MB, 
it will fail, right ?
Is it OK ?

Thanks. :)

^ permalink raw reply

* [PATCH] perf/powerpc: Fix compile warnings
From: Sukadev Bhattiprolu @ 2013-02-05 23:19 UTC (permalink / raw)
  To: Arnaldo Carvalho de Melo; +Cc: linuxppc-dev, sfr, mingo, linux-kernel


From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Date: Tue, 5 Feb 2013 15:04:49 -0800
Subject: [PATCH] perf/powerpc: Fix compile warnings

Fix compile errors like those below:

  CC      arch/powerpc/perf/power7-pmu.o
/home/git/linux/arch/powerpc/perf/power7-pmu.c:397:2: error: initialization from
incompatible pointer type [-Werror]

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
---
 arch/powerpc/include/asm/perf_event_server.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/arch/powerpc/include/asm/perf_event_server.h b/arch/powerpc/include/asm/perf_event_server.h
index ee63205..92460bc 100644
--- a/arch/powerpc/include/asm/perf_event_server.h
+++ b/arch/powerpc/include/asm/perf_event_server.h
@@ -124,7 +124,7 @@ extern ssize_t power_events_sysfs_show(struct device *dev,
  * POWER CPU specification.
  */
 #define	EVENT_VAR(_id, _suffix)		event_attr_##_id##_suffix
-#define	EVENT_PTR(_id, _suffix)		&EVENT_VAR(_id, _suffix)
+#define	EVENT_PTR(_id, _suffix)		&EVENT_VAR(_id, _suffix).attr.attr
 
 #define	EVENT_ATTR(_name, _id, _suffix)					\
 	PMU_EVENT_ATTR(_name, EVENT_VAR(_id, _suffix), PME_PM_##_id,	\
-- 
1.7.1

^ permalink raw reply related

* Re: [RFC PATCH v2 01/12] Add sys_hotplug.h for system device hotplug framework
From: Rafael J. Wysocki @ 2013-02-05 21:13 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-s390, Toshi Kani, jiang.liu, wency, linux-acpi, yinghai,
	linux-kernel, linux-mm, isimatu.yasuaki, srivatsa.bhat, guohanjun,
	bhelgaas, akpm, linuxppc-dev, lenb
In-Reply-To: <20130205183948.GA19026@kroah.com>

On Tuesday, February 05, 2013 10:39:48 AM Greg KH wrote:
> On Tue, Feb 05, 2013 at 12:11:17PM +0100, Rafael J. Wysocki wrote:
> > On Monday, February 04, 2013 04:04:47 PM Greg KH wrote:
> > > On Tue, Feb 05, 2013 at 12:52:30AM +0100, Rafael J. Wysocki wrote:
> > > > You'd probably never try to hot-remove a disk before unmounting filesystems
> > > > mounted from it or failing it as a RAID component and nobody sane wants the
> > > > kernel to do things like that automatically when the user presses the eject
> > > > button.  In my opinion we should treat memory eject, or CPU package eject, or
> > > > PCI host bridge eject in exactly the same way: Don't eject if it is not
> > > > prepared for ejecting in the first place.
> > > 
> > > Bad example, we have disks hot-removed all the time without any
> > > filesystems being unmounted, and have supported this since the 2.2 days
> > > (although we didn't get it "right" until 2.6.)
> > 
> > I actually don't think it is really bad, because it exposes the problem nicely.
> > 
> > Namely, there are two arguments that can be made here.  The first one is the
> > usability argument: Users should always be allowed to do what they want,
> > because it is [explicit content] annoying if software pretends to know better
> > what to do than the user (it is a convenience argument too, because usually
> > it's *easier* to allow users to do what they want).  The second one is the
> > data integrity argument: Operations that may lead to data loss should never
> > be carried out, because it is [explicit content] disappointing to lose valuable
> > stuff by a stupid mistake if software allows that mistake to be made (that also
> > may be costly in terms of real money).
> > 
> > You seem to believe that we should always follow the usability argument, while
> > Toshi seems to be thinking that (at least in the case of the "system" devices),
> > the data integrity argument is more important.  They are both valid arguments,
> > however, and they are in conflict, so this is a matter of balance.
> > 
> > You're saying that in the case of disks we always follow the usability argument
> > entirely.  I'm fine with that, although I suspect that some people may not be
> > considering this as the right balance.
> > 
> > Toshi seems to be thinking that for the hotplug of memory/CPUs/host bridges we
> > should always follow the data integrity argument entirely, because the users of
> > that feature value their data so much that they pretty much don't care about
> > usability.  That very well may be the case, so I'm fine with that too, although
> > I'm sure there are people who'll argue that this is not the right balance
> > either.
> > 
> > Now, the point is that we *can* do what Toshi is arguing for and that doesn't
> > seem to be overly complicated, so my question is: Why don't we do that, at
> > least to start with?  If it turns out eventually that the users care about
> > usability too, after all, we can add a switch to adjust things more to their
> > liking.  Still, we can very well do that later.
> 
> Ok, I'd much rather deal with reviewing actual implementations than
> talking about theory at this point in time, so let's see what you all
> can come up with next and I'll be glad to review it.

Sure, thanks a lot for your comments so far!

Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* Re: [RFC PATCH v2 01/12] Add sys_hotplug.h for system device hotplug framework
From: Greg KH @ 2013-02-05 18:39 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: linux-s390, Toshi Kani, jiang.liu, wency, linux-acpi, yinghai,
	linux-kernel, linux-mm, isimatu.yasuaki, srivatsa.bhat, guohanjun,
	bhelgaas, akpm, linuxppc-dev, lenb
In-Reply-To: <4225828.6MQHJn7Yzr@vostro.rjw.lan>

On Tue, Feb 05, 2013 at 12:11:17PM +0100, Rafael J. Wysocki wrote:
> On Monday, February 04, 2013 04:04:47 PM Greg KH wrote:
> > On Tue, Feb 05, 2013 at 12:52:30AM +0100, Rafael J. Wysocki wrote:
> > > You'd probably never try to hot-remove a disk before unmounting filesystems
> > > mounted from it or failing it as a RAID component and nobody sane wants the
> > > kernel to do things like that automatically when the user presses the eject
> > > button.  In my opinion we should treat memory eject, or CPU package eject, or
> > > PCI host bridge eject in exactly the same way: Don't eject if it is not
> > > prepared for ejecting in the first place.
> > 
> > Bad example, we have disks hot-removed all the time without any
> > filesystems being unmounted, and have supported this since the 2.2 days
> > (although we didn't get it "right" until 2.6.)
> 
> I actually don't think it is really bad, because it exposes the problem nicely.
> 
> Namely, there are two arguments that can be made here.  The first one is the
> usability argument: Users should always be allowed to do what they want,
> because it is [explicit content] annoying if software pretends to know better
> what to do than the user (it is a convenience argument too, because usually
> it's *easier* to allow users to do what they want).  The second one is the
> data integrity argument: Operations that may lead to data loss should never
> be carried out, because it is [explicit content] disappointing to lose valuable
> stuff by a stupid mistake if software allows that mistake to be made (that also
> may be costly in terms of real money).
> 
> You seem to believe that we should always follow the usability argument, while
> Toshi seems to be thinking that (at least in the case of the "system" devices),
> the data integrity argument is more important.  They are both valid arguments,
> however, and they are in conflict, so this is a matter of balance.
> 
> You're saying that in the case of disks we always follow the usability argument
> entirely.  I'm fine with that, although I suspect that some people may not be
> considering this as the right balance.
> 
> Toshi seems to be thinking that for the hotplug of memory/CPUs/host bridges we
> should always follow the data integrity argument entirely, because the users of
> that feature value their data so much that they pretty much don't care about
> usability.  That very well may be the case, so I'm fine with that too, although
> I'm sure there are people who'll argue that this is not the right balance
> either.
> 
> Now, the point is that we *can* do what Toshi is arguing for and that doesn't
> seem to be overly complicated, so my question is: Why don't we do that, at
> least to start with?  If it turns out eventually that the users care about
> usability too, after all, we can add a switch to adjust things more to their
> liking.  Still, we can very well do that later.

Ok, I'd much rather deal with reviewing actual implementations than
talking about theory at this point in time, so let's see what you all
can come up with next and I'll be glad to review it.

thanks,

greg k-h

^ permalink raw reply

* [PATCH] powerpc: fix ics_rtas_init and start_secondary section mismatch
From: Daniel Borkmann @ 2013-02-05 15:07 UTC (permalink / raw)
  To: linuxppc-dev

It seems, we're fine with just annotating the two functions.
Thus, this fixes the following build warnings on ppc64:

WARNING: arch/powerpc/sysdev/xics/built-in.o(.text+0x1664):
The function .ics_rtas_init() references
the function __init .xics_register_ics().
This is often because .ics_rtas_init lacks a __init
annotation or the annotation of .xics_register_ics is wrong.

WARNING: arch/powerpc/sysdev/built-in.o(.text+0x6044):
The function .ics_rtas_init() references
the function __init .xics_register_ics().
This is often because .ics_rtas_init lacks a __init
annotation or the annotation of .xics_register_ics is wrong.

WARNING: arch/powerpc/kernel/built-in.o(.text+0x2db30):
The function .start_secondary() references
the function __cpuinit .vdso_getcpu_init().
This is often because .start_secondary lacks a __cpuinit
annotation or the annotation of .vdso_getcpu_init is wrong.

Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 Note: compile-tested only!

 arch/powerpc/kernel/smp.c           |    2 +-
 arch/powerpc/sysdev/xics/ics-rtas.c |    2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 793401e..76bd9da 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -610,7 +610,7 @@ static struct device_node *cpu_to_l2cache(int cpu)
 }
 
 /* Activate a secondary processor. */
-void start_secondary(void *unused)
+__cpuinit void start_secondary(void *unused)
 {
 	unsigned int cpu = smp_processor_id();
 	struct device_node *l2_cache;
diff --git a/arch/powerpc/sysdev/xics/ics-rtas.c b/arch/powerpc/sysdev/xics/ics-rtas.c
index c782f85..936575d 100644
--- a/arch/powerpc/sysdev/xics/ics-rtas.c
+++ b/arch/powerpc/sysdev/xics/ics-rtas.c
@@ -213,7 +213,7 @@ static int ics_rtas_host_match(struct ics *ics, struct device_node *node)
 	return !of_device_is_compatible(node, "chrp,iic");
 }
 
-int ics_rtas_init(void)
+__init int ics_rtas_init(void)
 {
 	ibm_get_xive = rtas_token("ibm,get-xive");
 	ibm_set_xive = rtas_token("ibm,set-xive");
-- 
1.7.1

^ permalink raw reply related

* Re: [RFC PATCH v2 01/12] Add sys_hotplug.h for system device hotplug framework
From: Rafael J. Wysocki @ 2013-02-05 11:11 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-s390, Toshi Kani, jiang.liu, wency, linux-acpi, yinghai,
	linux-kernel, linux-mm, isimatu.yasuaki, srivatsa.bhat, guohanjun,
	bhelgaas, akpm, linuxppc-dev, lenb
In-Reply-To: <20130205000447.GA21782@kroah.com>

On Monday, February 04, 2013 04:04:47 PM Greg KH wrote:
> On Tue, Feb 05, 2013 at 12:52:30AM +0100, Rafael J. Wysocki wrote:
> > You'd probably never try to hot-remove a disk before unmounting filesystems
> > mounted from it or failing it as a RAID component and nobody sane wants the
> > kernel to do things like that automatically when the user presses the eject
> > button.  In my opinion we should treat memory eject, or CPU package eject, or
> > PCI host bridge eject in exactly the same way: Don't eject if it is not
> > prepared for ejecting in the first place.
> 
> Bad example, we have disks hot-removed all the time without any
> filesystems being unmounted, and have supported this since the 2.2 days
> (although we didn't get it "right" until 2.6.)

I actually don't think it is really bad, because it exposes the problem nicely.

Namely, there are two arguments that can be made here.  The first one is the
usability argument: Users should always be allowed to do what they want,
because it is [explicit content] annoying if software pretends to know better
what to do than the user (it is a convenience argument too, because usually
it's *easier* to allow users to do what they want).  The second one is the
data integrity argument: Operations that may lead to data loss should never
be carried out, because it is [explicit content] disappointing to lose valuable
stuff by a stupid mistake if software allows that mistake to be made (that also
may be costly in terms of real money).

You seem to believe that we should always follow the usability argument, while
Toshi seems to be thinking that (at least in the case of the "system" devices),
the data integrity argument is more important.  They are both valid arguments,
however, and they are in conflict, so this is a matter of balance.

You're saying that in the case of disks we always follow the usability argument
entirely.  I'm fine with that, although I suspect that some people may not be
considering this as the right balance.

Toshi seems to be thinking that for the hotplug of memory/CPUs/host bridges we
should always follow the data integrity argument entirely, because the users of
that feature value their data so much that they pretty much don't care about
usability.  That very well may be the case, so I'm fine with that too, although
I'm sure there are people who'll argue that this is not the right balance
either.

Now, the point is that we *can* do what Toshi is arguing for and that doesn't
seem to be overly complicated, so my question is: Why don't we do that, at
least to start with?  If it turns out eventually that the users care about
usability too, after all, we can add a switch to adjust things more to their
liking.  Still, we can very well do that later.

Thanks,
Rafael


-- 
I speak only for myself.
Rafael J. Wysocki, Intel Open Source Technology Center.

^ permalink raw reply

* [PATCH 3/3] powerpc/mpc5xxx: fix sparse warning for non static symbol
From: Anatolij Gustschin @ 2013-02-05  7:20 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Anatolij Gustschin
In-Reply-To: <1360048818-3765-1-git-send-email-agust@denx.de>

Fix warning:
symbol 'mpc5xxx_get_bus_frequency' was not declared. Should it be static?

Signed-off-by: Anatolij Gustschin <agust@denx.de>
---
 arch/powerpc/sysdev/mpc5xxx_clocks.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/sysdev/mpc5xxx_clocks.c b/arch/powerpc/sysdev/mpc5xxx_clocks.c
index 96f815a..5492dc5 100644
--- a/arch/powerpc/sysdev/mpc5xxx_clocks.c
+++ b/arch/powerpc/sysdev/mpc5xxx_clocks.c
@@ -9,9 +9,9 @@
 #include <linux/kernel.h>
 #include <linux/of_platform.h>
 #include <linux/export.h>
+#include <asm/mpc5xxx.h>
 
-unsigned int
-mpc5xxx_get_bus_frequency(struct device_node *node)
+unsigned long mpc5xxx_get_bus_frequency(struct device_node *node)
 {
 	struct device_node *np;
 	const unsigned int *p_bus_freq = NULL;
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 2/3] powerpc/mpc512x: fix sparce warnings for non static symbols
From: Anatolij Gustschin @ 2013-02-05  7:20 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Anatolij Gustschin
In-Reply-To: <1360048818-3765-1-git-send-email-agust@denx.de>

Fix warnings:
symbol 'clockctl' was not declared. Should it be static?
symbol 'rate_clks' was not declared. Should it be static?
symbol 'dev_clks' was not declared. Should it be static?
symbol 'mpc5121_clk_init' was not declared. Should it be static?

Signed-off-by: Anatolij Gustschin <agust@denx.de>
---
 arch/powerpc/include/asm/mpc5121.h  |    1 +
 arch/powerpc/platforms/512x/clock.c |    7 ++++---
 2 files changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/include/asm/mpc5121.h b/arch/powerpc/include/asm/mpc5121.h
index e700c8b..1b0ce6d 100644
--- a/arch/powerpc/include/asm/mpc5121.h
+++ b/arch/powerpc/include/asm/mpc5121.h
@@ -147,5 +147,6 @@ struct mpc512x_axe_module {
 
 
 int mpc512x_cs_config(int cs, u32 val);
+int __init mpc5121_clk_init(void);
 
 #endif /* __ASM_POWERPC_MPC5121_H__ */
diff --git a/arch/powerpc/platforms/512x/clock.c b/arch/powerpc/platforms/512x/clock.c
index 8a784d4..52d57d2 100644
--- a/arch/powerpc/platforms/512x/clock.c
+++ b/arch/powerpc/platforms/512x/clock.c
@@ -26,6 +26,7 @@
 
 #include <linux/of_platform.h>
 #include <asm/mpc5xxx.h>
+#include <asm/mpc5121.h>
 #include <asm/clk_interface.h>
 
 #undef CLK_DEBUG
@@ -122,7 +123,7 @@ struct mpc512x_clockctl {
 	u32 dccr;		/* DIU Clk Cnfg Reg */
 };
 
-struct mpc512x_clockctl __iomem *clockctl;
+static struct mpc512x_clockctl __iomem *clockctl;
 
 static int mpc5121_clk_enable(struct clk *clk)
 {
@@ -551,7 +552,7 @@ static struct clk ac97_clk = {
 	.calc = ac97_clk_calc,
 };
 
-struct clk *rate_clks[] = {
+static struct clk *rate_clks[] = {
 	&ref_clk,
 	&sys_clk,
 	&diu_clk,
@@ -607,7 +608,7 @@ static void rate_clks_init(void)
  * There are two clk enable registers with 32 enable bits each
  * psc clocks and device clocks are all stored in dev_clks
  */
-struct clk dev_clks[2][32];
+static struct clk dev_clks[2][32];
 
 /*
  * Given a psc number return the dev_clk
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 1/3] powerpc/mpc512x: fix noderef sparse warnings
From: Anatolij Gustschin @ 2013-02-05  7:20 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Anatolij Gustschin

Fix:
warning: dereference of noderef expression

Signed-off-by: Anatolij Gustschin <agust@denx.de>
---
 arch/powerpc/platforms/512x/clock.c |   18 +++++++++---------
 1 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/powerpc/platforms/512x/clock.c b/arch/powerpc/platforms/512x/clock.c
index 7937361..8a784d4 100644
--- a/arch/powerpc/platforms/512x/clock.c
+++ b/arch/powerpc/platforms/512x/clock.c
@@ -184,7 +184,7 @@ static unsigned long spmf_mult(void)
 		36, 40, 44, 48,
 		52, 56, 60, 64
 	};
-	int spmf = (clockctl->spmr >> 24) & 0xf;
+	int spmf = (in_be32(&clockctl->spmr) >> 24) & 0xf;
 	return spmf_to_mult[spmf];
 }
 
@@ -206,7 +206,7 @@ static unsigned long sysdiv_div_x_2(void)
 		52, 56, 58, 62,
 		60, 64, 66,
 	};
-	int sysdiv = (clockctl->scfr2 >> 26) & 0x3f;
+	int sysdiv = (in_be32(&clockctl->scfr2) >> 26) & 0x3f;
 	return sysdiv_to_div_x_2[sysdiv];
 }
 
@@ -230,7 +230,7 @@ static unsigned long sys_to_ref(unsigned long rate)
 
 static long ips_to_ref(unsigned long rate)
 {
-	int ips_div = (clockctl->scfr1 >> 23) & 0x7;
+	int ips_div = (in_be32(&clockctl->scfr1) >> 23) & 0x7;
 
 	rate *= ips_div;	/* csb_clk = ips_clk * ips_div */
 	rate *= 2;		/* sys_clk = csb_clk * 2 */
@@ -284,7 +284,7 @@ static struct clk sys_clk = {
 
 static void diu_clk_calc(struct clk *clk)
 {
-	int diudiv_x_2 = clockctl->scfr1 & 0xff;
+	int diudiv_x_2 = in_be32(&clockctl->scfr1) & 0xff;
 	unsigned long rate;
 
 	rate = sys_clk.rate;
@@ -311,7 +311,7 @@ static void half_clk_calc(struct clk *clk)
 
 static void generic_div_clk_calc(struct clk *clk)
 {
-	int div = (clockctl->scfr1 >> clk->div_shift) & 0x7;
+	int div = (in_be32(&clockctl->scfr1) >> clk->div_shift) & 0x7;
 
 	clk->rate = clk->parent->rate / div;
 }
@@ -329,7 +329,7 @@ static struct clk csb_clk = {
 
 static void e300_clk_calc(struct clk *clk)
 {
-	int spmf = (clockctl->spmr >> 16) & 0xf;
+	int spmf = (in_be32(&clockctl->spmr) >> 16) & 0xf;
 	int ratex2 = clk->parent->rate * spmf;
 
 	clk->rate = ratex2 / 2;
@@ -648,12 +648,12 @@ static void psc_calc_rate(struct clk *clk, int pscnum, struct device_node *np)
 	out_be32(&clockctl->pccr[pscnum], 0x00020000);
 	out_be32(&clockctl->pccr[pscnum], 0x00030000);
 
-	if (clockctl->pccr[pscnum] & 0x80) {
+	if (in_be32(&clockctl->pccr[pscnum]) & 0x80) {
 		clk->rate = spdif_rxclk.rate;
 		return;
 	}
 
-	switch ((clockctl->pccr[pscnum] >> 14) & 0x3) {
+	switch ((in_be32(&clockctl->pccr[pscnum]) >> 14) & 0x3) {
 	case 0:
 		mclk_src = sys_clk.rate;
 		break;
@@ -668,7 +668,7 @@ static void psc_calc_rate(struct clk *clk, int pscnum, struct device_node *np)
 		break;
 	}
 
-	mclk_div = ((clockctl->pccr[pscnum] >> 17) & 0x7fff) + 1;
+	mclk_div = ((in_be32(&clockctl->pccr[pscnum]) >> 17) & 0x7fff) + 1;
 	clk->rate = mclk_src / mclk_div;
 }
 
-- 
1.7.5.4

^ permalink raw reply related

* [PATCH 4/4] KVM: PPC: Book3S PR: Fix compilation on 32-bit machines
From: Paul Mackerras @ 2013-02-05  4:11 UTC (permalink / raw)
  To: linuxppc-dev, Benjamin Herrenschmidt, Alexander Graf; +Cc: kvm-ppc
In-Reply-To: <20130205040902.GA20303@drongo>

Commit a413f474a0 ("powerpc: Disable relocation on exceptions whenever
PR KVM is active") added calls to pSeries_disable_reloc_on_exc() and
pSeries_enable_reloc_on_exc() to book3s_pr.c, and added declarations
of those functions to <asm/hvcall.h>, but didn't add an include of
<asm/hvcall.h> to book3s_pr.c.  64-bit kernels seem to get hvcall.h
included via some other path, but 32-bit kernels fail to compile with:

arch/powerpc/kvm/book3s_pr.c: In function ‘kvmppc_core_init_vm’:
arch/powerpc/kvm/book3s_pr.c:1300:4: error: implicit declaration of function ‘pSeries_disable_reloc_on_exc’ [-Werror=implicit-function-declaration]
arch/powerpc/kvm/book3s_pr.c: In function ‘kvmppc_core_destroy_vm’:
arch/powerpc/kvm/book3s_pr.c:1316:4: error: implicit declaration of function ‘pSeries_enable_reloc_on_exc’ [-Werror=implicit-function-declaration]
cc1: all warnings being treated as errors
make[2]: *** [arch/powerpc/kvm/book3s_pr.o] Error 1
make[1]: *** [arch/powerpc/kvm] Error 2
make: *** [sub-make] Error 2

This fixes it by adding an include of hvcall.h.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/kvm/book3s_pr.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
index 67e4708..6702442 100644
--- a/arch/powerpc/kvm/book3s_pr.c
+++ b/arch/powerpc/kvm/book3s_pr.c
@@ -35,6 +35,7 @@
 #include <asm/mmu_context.h>
 #include <asm/switch_to.h>
 #include <asm/firmware.h>
+#include <asm/hvcall.h>
 #include <linux/gfp.h>
 #include <linux/sched.h>
 #include <linux/vmalloc.h>
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 3/4] KVM: PPC: Book3S HV: Preserve guest CFAR register value
From: Paul Mackerras @ 2013-02-05  4:10 UTC (permalink / raw)
  To: linuxppc-dev, Benjamin Herrenschmidt, Alexander Graf; +Cc: kvm-ppc
In-Reply-To: <20130205040902.GA20303@drongo>

The CFAR (Come-From Address Register) is a useful debugging aid that
exists on POWER7 processors.  Currently HV KVM doesn't save or restore
the CFAR register for guest vcpus, making the CFAR of limited use in
guests.

This adds the necessary code to capture the CFAR value saved in the
early exception entry code (it has to be saved before any branch is
executed), save it in the vcpu.arch struct, and restore it on entry
to the guest.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/exception-64s.h  |    8 ++++++--
 arch/powerpc/include/asm/kvm_book3s_asm.h |    3 +++
 arch/powerpc/include/asm/kvm_host.h       |    1 +
 arch/powerpc/kernel/asm-offsets.c         |    5 +++++
 arch/powerpc/kvm/book3s_hv_rmhandlers.S   |    9 +++++++++
 5 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h b/arch/powerpc/include/asm/exception-64s.h
index 4dfc515..05e6d2e 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -199,10 +199,14 @@ END_FTR_SECTION_NESTED(ftr,ftr,943)
 
 #define __KVM_HANDLER(area, h, n)					\
 do_kvm_##n:								\
+	BEGIN_FTR_SECTION_NESTED(947)					\
+	ld	r10,area+EX_CFAR(r13);					\
+	std	r10,HSTATE_CFAR(r13);					\
+	END_FTR_SECTION_NESTED(CPU_FTR_CFAR,CPU_FTR_CFAR,947);		\
 	ld	r10,area+EX_R10(r13);					\
-	stw	r9,HSTATE_SCRATCH1(r13);			\
+	stw	r9,HSTATE_SCRATCH1(r13);				\
 	ld	r9,area+EX_R9(r13);					\
-	std	r12,HSTATE_SCRATCH0(r13);			\
+	std	r12,HSTATE_SCRATCH0(r13);				\
 	li	r12,n;							\
 	b	kvmppc_interrupt
 
diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h b/arch/powerpc/include/asm/kvm_book3s_asm.h
index 88609b2..cdc3d27 100644
--- a/arch/powerpc/include/asm/kvm_book3s_asm.h
+++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
@@ -93,6 +93,9 @@ struct kvmppc_host_state {
 	u64 host_dscr;
 	u64 dec_expires;
 #endif
+#ifdef CONFIG_PPC_BOOK3S_64
+	u64 cfar;
+#endif
 };
 
 struct kvmppc_book3s_shadow_vcpu {
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index ca9bf45..03d7bea 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -440,6 +440,7 @@ struct kvm_vcpu_arch {
 	ulong uamor;
 	u32 ctrl;
 	ulong dabr;
+	ulong cfar;
 #endif
 	u32 vrsave; /* also USPRG0 */
 	u32 mmucr;
diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c
index e39ca55..9a73fb0 100644
--- a/arch/powerpc/kernel/asm-offsets.c
+++ b/arch/powerpc/kernel/asm-offsets.c
@@ -475,6 +475,7 @@ int main(void)
 	DEFINE(VCPU_LAST_INST, offsetof(struct kvm_vcpu, arch.last_inst));
 	DEFINE(VCPU_TRAP, offsetof(struct kvm_vcpu, arch.trap));
 	DEFINE(VCPU_PTID, offsetof(struct kvm_vcpu, arch.ptid));
+	DEFINE(VCPU_CFAR, offsetof(struct kvm_vcpu, arch.cfar));
 	DEFINE(VCORE_ENTRY_EXIT, offsetof(struct kvmppc_vcore, entry_exit_count));
 	DEFINE(VCORE_NAP_COUNT, offsetof(struct kvmppc_vcore, nap_count));
 	DEFINE(VCORE_IN_GUEST, offsetof(struct kvmppc_vcore, in_guest));
@@ -554,6 +555,10 @@ int main(void)
 	DEFINE(IPI_PRIORITY, IPI_PRIORITY);
 #endif /* CONFIG_KVM_BOOK3S_64_HV */
 
+#ifdef CONFIG_PPC_BOOK3S_64
+	HSTATE_FIELD(HSTATE_CFAR, cfar);
+#endif /* CONFIG_PPC_BOOK3S_64 */
+
 #else /* CONFIG_PPC_BOOK3S */
 	DEFINE(VCPU_CR, offsetof(struct kvm_vcpu, arch.cr));
 	DEFINE(VCPU_XER, offsetof(struct kvm_vcpu, arch.xer));
diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
index 10b6c35..e33d11f 100644
--- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
+++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
@@ -539,6 +539,11 @@ fast_guest_return:
 
 	/* Enter guest */
 
+BEGIN_FTR_SECTION
+	ld	r5, VCPU_CFAR(r4)
+	mtspr	SPRN_CFAR, r5
+END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
+
 	ld	r5, VCPU_LR(r4)
 	lwz	r6, VCPU_CR(r4)
 	mtlr	r5
@@ -604,6 +609,10 @@ kvmppc_interrupt:
 	lwz	r4, HSTATE_SCRATCH1(r13)
 	std	r3, VCPU_GPR(R12)(r9)
 	stw	r4, VCPU_CR(r9)
+BEGIN_FTR_SECTION
+	ld	r3, HSTATE_CFAR(r13)
+	std	r3, VCPU_CFAR(r9)
+END_FTR_SECTION_IFSET(CPU_FTR_CFAR)
 
 	/* Restore R1/R2 so we can handle faults */
 	ld	r1, HSTATE_HOST_R1(r13)
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 2/4] powerpc: Save CFAR before branching in interrupt entry paths
From: Paul Mackerras @ 2013-02-05  4:10 UTC (permalink / raw)
  To: linuxppc-dev, Benjamin Herrenschmidt, Alexander Graf; +Cc: kvm-ppc
In-Reply-To: <20130205040902.GA20303@drongo>

Some of the interrupt vectors on 64-bit POWER server processors are
only 32 bytes long, which is not enough for the full first-level
interrupt handler.  For these we currently just have a branch to an
out-of-line handler.  However, this means that we corrupt the CFAR
(come-from address register) on POWER7 and later processors.

To fix this, we split the EXCEPTION_PROLOG_1 macro into two pieces:
EXCEPTION_PROLOG_0 contains the part up to the point where the CFAR
is saved in the PACA, and EXCEPTION_PROLOG_1 contains the rest.  We
then put EXCEPTION_PROLOG_0 in the short interrupt vectors before
we branch to the out-of-line handler, which contains the rest of the
first-level interrupt handler.  To facilitate this, we define new
_OOL (out of line) variants of STD_EXCEPTION_PSERIES, etc.

In order to get EXCEPTION_PROLOG_0 to be short enough, i.e., no more
than 6 instructions, it was necessary to move the stores that move
the PPR and CFAR values into the PACA into __EXCEPTION_PROLOG_1 and
to get rid of one of the two HMT_MEDIUM instructions.  Previously
there was a HMT_MEDIUM_PPR_DISCARD before the prolog, which was
nop'd out on processors with the PPR (POWER7 and later), and then
another HMT_MEDIUM inside the HMT_MEDIUM_PPR_SAVE macro call inside
__EXCEPTION_PROLOG_1, which was nop'd out on processors without PPR.
Now the HMT_MEDIUM inside EXCEPTION_PROLOG_0 is there unconditionally
and the HMT_MEDIUM_PPR_DISCARD is not strictly necessary, although
this leaves it in for the interrupt vectors where there is room for
it.

Previously we had a handler for hypervisor maintenance interrupts at
0xe50, which doesn't leave enough room for the vector for hypervisor
emulation assist interrupts at 0xe40, since we need 8 instructions.
The 0xe50 vector was only used on POWER6, as the HMI vector was moved
to 0xe60 on POWER7.  Since we don't support running in hypervisor mode
on POWER6, we just remove the handler at 0xe50.

This also changes denorm_exception_hv to use EXCEPTION_PROLOG_0
instead of open-coding it, and removes the HMT_MEDIUM_PPR_DISCARD
from the relocation-on vectors (since any CPU that supports
relocation-on interrupts also has the PPR).

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/include/asm/exception-64s.h |   84 +++++++++++++++++++++-----
 arch/powerpc/kernel/exceptions-64s.S     |   95 ++++++++++++++++++++----------
 2 files changed, 133 insertions(+), 46 deletions(-)

diff --git a/arch/powerpc/include/asm/exception-64s.h b/arch/powerpc/include/asm/exception-64s.h
index 370298a..4dfc515 100644
--- a/arch/powerpc/include/asm/exception-64s.h
+++ b/arch/powerpc/include/asm/exception-64s.h
@@ -50,7 +50,7 @@
 #define EX_PPR		88	/* SMT thread status register (priority) */
 
 #ifdef CONFIG_RELOCATABLE
-#define EXCEPTION_RELON_PROLOG_PSERIES_1(label, h)			\
+#define __EXCEPTION_RELON_PROLOG_PSERIES_1(label, h)			\
 	ld	r12,PACAKBASE(r13);	/* get high part of &label */	\
 	mfspr	r11,SPRN_##h##SRR0;	/* save SRR0 */			\
 	LOAD_HANDLER(r12,label);					\
@@ -61,13 +61,15 @@
 	blr;
 #else
 /* If not relocatable, we can jump directly -- and save messing with LR */
-#define EXCEPTION_RELON_PROLOG_PSERIES_1(label, h)			\
+#define __EXCEPTION_RELON_PROLOG_PSERIES_1(label, h)			\
 	mfspr	r11,SPRN_##h##SRR0;	/* save SRR0 */			\
 	mfspr	r12,SPRN_##h##SRR1;	/* and SRR1 */			\
 	li	r10,MSR_RI;						\
 	mtmsrd 	r10,1;			/* Set RI (EE=0) */		\
 	b	label;
 #endif
+#define EXCEPTION_RELON_PROLOG_PSERIES_1(label, h)			\
+	__EXCEPTION_RELON_PROLOG_PSERIES_1(label, h)			\
 
 /*
  * As EXCEPTION_PROLOG_PSERIES(), except we've already got relocation on
@@ -75,6 +77,7 @@
  * case EXCEPTION_RELON_PROLOG_PSERIES_1 will be using lr.
  */
 #define EXCEPTION_RELON_PROLOG_PSERIES(area, label, h, extra, vec)	\
+	EXCEPTION_PROLOG_0(area);					\
 	EXCEPTION_PROLOG_1(area, extra, vec);				\
 	EXCEPTION_RELON_PROLOG_PSERIES_1(label, h)
 
@@ -135,25 +138,32 @@ BEGIN_FTR_SECTION_NESTED(942)						\
 END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,0,942)  /*non P7*/		
 
 /*
- * Save PPR in paca whenever some register is available to use.
- * Then increase the priority.
+ * Get an SPR into a register if the CPU has the given feature
  */
-#define HMT_MEDIUM_PPR_SAVE(area, ra)					\
+#define OPT_GET_SPR(ra, spr, ftr)					\
 BEGIN_FTR_SECTION_NESTED(943)						\
-	mfspr	ra,SPRN_PPR;						\
-	std	ra,area+EX_PPR(r13);					\
-	HMT_MEDIUM;							\
-END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,943) 
+	mfspr	ra,spr;							\
+END_FTR_SECTION_NESTED(ftr,ftr,943)
 
-#define __EXCEPTION_PROLOG_1(area, extra, vec)				\
+/*
+ * Save a register to the PACA if the CPU has the given feature
+ */
+#define OPT_SAVE_REG_TO_PACA(offset, ra, ftr)				\
+BEGIN_FTR_SECTION_NESTED(943)						\
+	std	ra,offset(r13);						\
+END_FTR_SECTION_NESTED(ftr,ftr,943)
+
+#define EXCEPTION_PROLOG_0(area)					\
 	GET_PACA(r13);							\
 	std	r9,area+EX_R9(r13);	/* save r9 */			\
-	HMT_MEDIUM_PPR_SAVE(area, r9);					\
+	OPT_GET_SPR(r9, SPRN_PPR, CPU_FTR_HAS_PPR);			\
+	HMT_MEDIUM;							\
 	std	r10,area+EX_R10(r13);	/* save r10 - r12 */		\
-	BEGIN_FTR_SECTION_NESTED(66);					\
-	mfspr	r10,SPRN_CFAR;						\
-	std	r10,area+EX_CFAR(r13);					\
-	END_FTR_SECTION_NESTED(CPU_FTR_CFAR, CPU_FTR_CFAR, 66);		\
+	OPT_GET_SPR(r10, SPRN_CFAR, CPU_FTR_CFAR)
+
+#define __EXCEPTION_PROLOG_1(area, extra, vec)				\
+	OPT_SAVE_REG_TO_PACA(area+EX_PPR, r9, CPU_FTR_HAS_PPR);		\
+	OPT_SAVE_REG_TO_PACA(area+EX_CFAR, r10, CPU_FTR_CFAR);		\
 	SAVE_LR(r10, area);						\
 	mfcr	r9;							\
 	extra(vec);							\
@@ -178,6 +188,7 @@ END_FTR_SECTION_NESTED(CPU_FTR_HAS_PPR,CPU_FTR_HAS_PPR,943)
 	__EXCEPTION_PROLOG_PSERIES_1(label, h)
 
 #define EXCEPTION_PROLOG_PSERIES(area, label, h, extra, vec)		\
+	EXCEPTION_PROLOG_0(area);					\
 	EXCEPTION_PROLOG_1(area, extra, vec);				\
 	EXCEPTION_PROLOG_PSERIES_1(label, h);
 
@@ -312,6 +323,13 @@ label##_pSeries:					\
 	EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, label##_common,	\
 				 EXC_STD, KVMTEST_PR, vec)
 
+/* Version of above for when we have to branch out-of-line */
+#define STD_EXCEPTION_PSERIES_OOL(vec, label)			\
+	.globl label##_pSeries;					\
+label##_pSeries:						\
+	EXCEPTION_PROLOG_1(PACA_EXGEN, KVMTEST_PR, vec);	\
+	EXCEPTION_PROLOG_PSERIES_1(label##_common, EXC_STD)
+
 #define STD_EXCEPTION_HV(loc, vec, label)		\
 	. = loc;					\
 	.globl label##_hv;				\
@@ -321,6 +339,13 @@ label##_hv:						\
 	EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, label##_common,	\
 				 EXC_HV, KVMTEST, vec)
 
+/* Version of above for when we have to branch out-of-line */
+#define STD_EXCEPTION_HV_OOL(vec, label)		\
+	.globl label##_hv;				\
+label##_hv:						\
+	EXCEPTION_PROLOG_1(PACA_EXGEN, KVMTEST, vec);	\
+	EXCEPTION_PROLOG_PSERIES_1(label##_common, EXC_HV)
+
 #define STD_RELON_EXCEPTION_PSERIES(loc, vec, label)	\
 	. = loc;					\
 	.globl label##_relon_pSeries;			\
@@ -331,6 +356,12 @@ label##_relon_pSeries:					\
 	EXCEPTION_RELON_PROLOG_PSERIES(PACA_EXGEN, label##_common, \
 				       EXC_STD, KVMTEST_PR, vec)
 
+#define STD_RELON_EXCEPTION_PSERIES_OOL(vec, label)		\
+	.globl label##_relon_pSeries;				\
+label##_relon_pSeries:						\
+	EXCEPTION_PROLOG_1(PACA_EXGEN, KVMTEST_PR, vec);	\
+	EXCEPTION_RELON_PROLOG_PSERIES_1(label##_common, EXC_STD)
+
 #define STD_RELON_EXCEPTION_HV(loc, vec, label)		\
 	. = loc;					\
 	.globl label##_relon_hv;			\
@@ -341,6 +372,12 @@ label##_relon_hv:					\
 	EXCEPTION_RELON_PROLOG_PSERIES(PACA_EXGEN, label##_common, \
 				       EXC_HV, KVMTEST, vec)
 
+#define STD_RELON_EXCEPTION_HV_OOL(vec, label)			\
+	.globl label##_relon_hv;				\
+label##_relon_hv:						\
+	EXCEPTION_PROLOG_1(PACA_EXGEN, KVMTEST, vec);		\
+	EXCEPTION_RELON_PROLOG_PSERIES_1(label##_common, EXC_HV)
+
 /* This associate vector numbers with bits in paca->irq_happened */
 #define SOFTEN_VALUE_0x500	PACA_IRQ_EE
 #define SOFTEN_VALUE_0x502	PACA_IRQ_EE
@@ -375,8 +412,10 @@ label##_relon_hv:					\
 #define __MASKABLE_EXCEPTION_PSERIES(vec, label, h, extra)		\
 	HMT_MEDIUM_PPR_DISCARD;						\
 	SET_SCRATCH0(r13);    /* save r13 */				\
-	__EXCEPTION_PROLOG_1(PACA_EXGEN, extra, vec);		\
+	EXCEPTION_PROLOG_0(PACA_EXGEN);					\
+	__EXCEPTION_PROLOG_1(PACA_EXGEN, extra, vec);			\
 	EXCEPTION_PROLOG_PSERIES_1(label##_common, h);
+
 #define _MASKABLE_EXCEPTION_PSERIES(vec, label, h, extra)		\
 	__MASKABLE_EXCEPTION_PSERIES(vec, label, h, extra)
 
@@ -394,9 +433,16 @@ label##_hv:								\
 	_MASKABLE_EXCEPTION_PSERIES(vec, label,				\
 				    EXC_HV, SOFTEN_TEST_HV)
 
+#define MASKABLE_EXCEPTION_HV_OOL(vec, label)				\
+	.globl label##_hv;						\
+label##_hv:								\
+	EXCEPTION_PROLOG_1(PACA_EXGEN, SOFTEN_TEST_HV, vec);		\
+	EXCEPTION_PROLOG_PSERIES_1(label##_common, EXC_HV);
+
 #define __MASKABLE_RELON_EXCEPTION_PSERIES(vec, label, h, extra)	\
 	HMT_MEDIUM_PPR_DISCARD;						\
 	SET_SCRATCH0(r13);    /* save r13 */				\
+	EXCEPTION_PROLOG_0(PACA_EXGEN);					\
 	__EXCEPTION_PROLOG_1(PACA_EXGEN, extra, vec);		\
 	EXCEPTION_RELON_PROLOG_PSERIES_1(label##_common, h);
 #define _MASKABLE_RELON_EXCEPTION_PSERIES(vec, label, h, extra)	\
@@ -416,6 +462,12 @@ label##_relon_hv:							\
 	_MASKABLE_RELON_EXCEPTION_PSERIES(vec, label,			\
 					  EXC_HV, SOFTEN_NOTEST_HV)
 
+#define MASKABLE_RELON_EXCEPTION_HV_OOL(vec, label)			\
+	.globl label##_relon_hv;					\
+label##_relon_hv:							\
+	EXCEPTION_PROLOG_1(PACA_EXGEN, SOFTEN_NOTEST_HV, vec);		\
+	EXCEPTION_PROLOG_PSERIES_1(label##_common, EXC_HV);
+
 /*
  * Our exception common code can be passed various "additions"
  * to specify the behaviour of interrupts, whether to kick the
diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index dc64165..b9bcf21 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -153,7 +153,10 @@ machine_check_pSeries_1:
 	 * some code path might still want to branch into the original
 	 * vector
 	 */
-	b	machine_check_pSeries
+	HMT_MEDIUM_PPR_DISCARD
+	SET_SCRATCH0(r13)		/* save r13 */
+	EXCEPTION_PROLOG_0(PACA_EXMC)
+	b	machine_check_pSeries_0
 
 	. = 0x300
 	.globl data_access_pSeries
@@ -172,6 +175,7 @@ END_MMU_FTR_SECTION_IFCLR(MMU_FTR_SLB)
 data_access_slb_pSeries:
 	HMT_MEDIUM_PPR_DISCARD
 	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXSLB)
 	EXCEPTION_PROLOG_1(PACA_EXSLB, KVMTEST, 0x380)
 	std	r3,PACA_EXSLB+EX_R3(r13)
 	mfspr	r3,SPRN_DAR
@@ -203,6 +207,7 @@ data_access_slb_pSeries:
 instruction_access_slb_pSeries:
 	HMT_MEDIUM_PPR_DISCARD
 	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXSLB)
 	EXCEPTION_PROLOG_1(PACA_EXSLB, KVMTEST_PR, 0x480)
 	std	r3,PACA_EXSLB+EX_R3(r13)
 	mfspr	r3,SPRN_SRR0		/* SRR0 is faulting address */
@@ -284,16 +289,28 @@ system_call_pSeries:
 	 */
 	. = 0xe00
 hv_exception_trampoline:
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	h_data_storage_hv
+
 	. = 0xe20
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	h_instr_storage_hv
+
 	. = 0xe40
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	emulation_assist_hv
-	. = 0xe50
-	b	hmi_exception_hv
+
 	. = 0xe60
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	hmi_exception_hv
+
 	. = 0xe80
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	h_doorbell_hv
 
 	/* We need to deal with the Altivec unavailable exception
@@ -303,14 +320,20 @@ hv_exception_trampoline:
 	 */
 performance_monitor_pSeries_1:
 	. = 0xf00
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	performance_monitor_pSeries
 
 altivec_unavailable_pSeries_1:
 	. = 0xf20
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	altivec_unavailable_pSeries
 
 vsx_unavailable_pSeries_1:
 	. = 0xf40
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	vsx_unavailable_pSeries
 
 #ifdef CONFIG_CBE_RAS
@@ -326,10 +349,7 @@ vsx_unavailable_pSeries_1:
 denorm_exception_hv:
 	HMT_MEDIUM_PPR_DISCARD
 	mtspr	SPRN_SPRG_HSCRATCH0,r13
-	mfspr	r13,SPRN_SPRG_HPACA
-	std	r9,PACA_EXGEN+EX_R9(r13)
-	HMT_MEDIUM_PPR_SAVE(PACA_EXGEN, r9)
-	std	r10,PACA_EXGEN+EX_R10(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	std	r11,PACA_EXGEN+EX_R11(r13)
 	std	r12,PACA_EXGEN+EX_R12(r13)
 	mfspr	r9,SPRN_SPRG_HSCRATCH0
@@ -372,8 +392,10 @@ machine_check_pSeries:
 machine_check_fwnmi:
 	HMT_MEDIUM_PPR_DISCARD
 	SET_SCRATCH0(r13)		/* save r13 */
-	EXCEPTION_PROLOG_PSERIES(PACA_EXMC, machine_check_common,
-				 EXC_STD, KVMTEST, 0x200)
+	EXCEPTION_PROLOG_0(PACA_EXMC)
+machine_check_pSeries_0:
+	EXCEPTION_PROLOG_1(PACA_EXMC, KVMTEST, 0x200)
+	EXCEPTION_PROLOG_PSERIES_1(machine_check_common, EXC_STD)
 	KVM_HANDLER_SKIP(PACA_EXMC, EXC_STD, 0x200)
 
 	/* moved from 0x300 */
@@ -510,23 +532,23 @@ ALT_FTR_SECTION_END_IFCLR(CPU_FTR_ARCH_206)
 
 	.align	7
 	/* moved from 0xe00 */
-	STD_EXCEPTION_HV(., 0xe02, h_data_storage)
+	STD_EXCEPTION_HV_OOL(0xe02, h_data_storage)
 	KVM_HANDLER_SKIP(PACA_EXGEN, EXC_HV, 0xe02)
-	STD_EXCEPTION_HV(., 0xe22, h_instr_storage)
+	STD_EXCEPTION_HV_OOL(0xe22, h_instr_storage)
 	KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe22)
-	STD_EXCEPTION_HV(., 0xe42, emulation_assist)
+	STD_EXCEPTION_HV_OOL(0xe42, emulation_assist)
 	KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe42)
-	STD_EXCEPTION_HV(., 0xe62, hmi_exception) /* need to flush cache ? */
+	STD_EXCEPTION_HV_OOL(0xe62, hmi_exception) /* need to flush cache ? */
 	KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe62)
-	MASKABLE_EXCEPTION_HV(., 0xe82, h_doorbell)
+	MASKABLE_EXCEPTION_HV_OOL(0xe82, h_doorbell)
 	KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe82)
 
 	/* moved from 0xf00 */
-	STD_EXCEPTION_PSERIES(., 0xf00, performance_monitor)
+	STD_EXCEPTION_PSERIES_OOL(0xf00, performance_monitor)
 	KVM_HANDLER_PR(PACA_EXGEN, EXC_STD, 0xf00)
-	STD_EXCEPTION_PSERIES(., 0xf20, altivec_unavailable)
+	STD_EXCEPTION_PSERIES_OOL(0xf20, altivec_unavailable)
 	KVM_HANDLER_PR(PACA_EXGEN, EXC_STD, 0xf20)
-	STD_EXCEPTION_PSERIES(., 0xf40, vsx_unavailable)
+	STD_EXCEPTION_PSERIES_OOL(0xf40, vsx_unavailable)
 	KVM_HANDLER_PR(PACA_EXGEN, EXC_STD, 0xf40)
 
 /*
@@ -718,8 +740,8 @@ machine_check_common:
 	. = 0x4380
 	.globl data_access_slb_relon_pSeries
 data_access_slb_relon_pSeries:
-	HMT_MEDIUM_PPR_DISCARD
 	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXSLB)
 	EXCEPTION_PROLOG_1(PACA_EXSLB, NOTEST, 0x380)
 	std	r3,PACA_EXSLB+EX_R3(r13)
 	mfspr	r3,SPRN_DAR
@@ -743,8 +765,8 @@ data_access_slb_relon_pSeries:
 	. = 0x4480
 	.globl instruction_access_slb_relon_pSeries
 instruction_access_slb_relon_pSeries:
-	HMT_MEDIUM_PPR_DISCARD
 	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXSLB)
 	EXCEPTION_PROLOG_1(PACA_EXSLB, NOTEST, 0x480)
 	std	r3,PACA_EXSLB+EX_R3(r13)
 	mfspr	r3,SPRN_SRR0		/* SRR0 is faulting address */
@@ -788,33 +810,46 @@ system_call_relon_pSeries:
 	STD_RELON_EXCEPTION_PSERIES(0x4d00, 0xd00, single_step)
 
 	. = 0x4e00
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	h_data_storage_relon_hv
 
 	. = 0x4e20
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	h_instr_storage_relon_hv
 
 	. = 0x4e40
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	emulation_assist_relon_hv
 
-	. = 0x4e50
-	b	hmi_exception_relon_hv
-
 	. = 0x4e60
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	hmi_exception_relon_hv
 
 	. = 0x4e80
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	h_doorbell_relon_hv
 
 performance_monitor_relon_pSeries_1:
 	. = 0x4f00
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	performance_monitor_relon_pSeries
 
 altivec_unavailable_relon_pSeries_1:
 	. = 0x4f20
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	altivec_unavailable_relon_pSeries
 
 vsx_unavailable_relon_pSeries_1:
 	. = 0x4f40
+	SET_SCRATCH0(r13)
+	EXCEPTION_PROLOG_0(PACA_EXGEN)
 	b	vsx_unavailable_relon_pSeries
 
 	STD_RELON_EXCEPTION_PSERIES(0x5300, 0x1300, instruction_breakpoint)
@@ -1171,20 +1206,20 @@ END_FTR_SECTION_IFSET(CPU_FTR_VSX)
 __end_handlers:
 
 	/* Equivalents to the above handlers for relocation-on interrupt vectors */
-	STD_RELON_EXCEPTION_HV(., 0xe00, h_data_storage)
+	STD_RELON_EXCEPTION_HV_OOL(0xe00, h_data_storage)
 	KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe00)
-	STD_RELON_EXCEPTION_HV(., 0xe20, h_instr_storage)
+	STD_RELON_EXCEPTION_HV_OOL(0xe20, h_instr_storage)
 	KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe20)
-	STD_RELON_EXCEPTION_HV(., 0xe40, emulation_assist)
+	STD_RELON_EXCEPTION_HV_OOL(0xe40, emulation_assist)
 	KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe40)
-	STD_RELON_EXCEPTION_HV(., 0xe60, hmi_exception)
+	STD_RELON_EXCEPTION_HV_OOL(0xe60, hmi_exception)
 	KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe60)
-	MASKABLE_RELON_EXCEPTION_HV(., 0xe80, h_doorbell)
+	MASKABLE_RELON_EXCEPTION_HV_OOL(0xe80, h_doorbell)
 	KVM_HANDLER(PACA_EXGEN, EXC_HV, 0xe80)
 
-	STD_RELON_EXCEPTION_PSERIES(., 0xf00, performance_monitor)
-	STD_RELON_EXCEPTION_PSERIES(., 0xf20, altivec_unavailable)
-	STD_RELON_EXCEPTION_PSERIES(., 0xf40, vsx_unavailable)
+	STD_RELON_EXCEPTION_PSERIES_OOL(0xf00, performance_monitor)
+	STD_RELON_EXCEPTION_PSERIES_OOL(0xf20, altivec_unavailable)
+	STD_RELON_EXCEPTION_PSERIES_OOL(0xf40, vsx_unavailable)
 
 #if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
 /*
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 1/4] powerpc: Remove Cell-specific relocation-on interrupt vector code
From: Paul Mackerras @ 2013-02-05  4:09 UTC (permalink / raw)
  To: linuxppc-dev, Benjamin Herrenschmidt, Alexander Graf; +Cc: kvm-ppc
In-Reply-To: <20130205040902.GA20303@drongo>

The Cell processor doesn't support relocation-on interrupts, so we
don't need relocation-on versions of the interrupt vectors that are
purely Cell-specific.  This removes them.

Signed-off-by: Paul Mackerras <paulus@samba.org>
---
 arch/powerpc/kernel/exceptions-64s.S |   10 ----------
 1 file changed, 10 deletions(-)

diff --git a/arch/powerpc/kernel/exceptions-64s.S b/arch/powerpc/kernel/exceptions-64s.S
index 7a1c87c..dc64165 100644
--- a/arch/powerpc/kernel/exceptions-64s.S
+++ b/arch/powerpc/kernel/exceptions-64s.S
@@ -817,26 +817,16 @@ vsx_unavailable_relon_pSeries_1:
 	. = 0x4f40
 	b	vsx_unavailable_relon_pSeries
 
-#ifdef CONFIG_CBE_RAS
-	STD_RELON_EXCEPTION_HV(0x5200, 0x1202, cbe_system_error)
-#endif /* CONFIG_CBE_RAS */
 	STD_RELON_EXCEPTION_PSERIES(0x5300, 0x1300, instruction_breakpoint)
 #ifdef CONFIG_PPC_DENORMALISATION
 	. = 0x5500
 	b	denorm_exception_hv
 #endif
-#ifdef CONFIG_CBE_RAS
-	STD_RELON_EXCEPTION_HV(0x5600, 0x1602, cbe_maintenance)
-#else
 #ifdef CONFIG_HVC_SCOM
 	STD_RELON_EXCEPTION_HV(0x5600, 0x1600, maintence_interrupt)
 	KVM_HANDLER_SKIP(PACA_EXGEN, EXC_HV, 0x1600)
 #endif /* CONFIG_HVC_SCOM */
-#endif /* CONFIG_CBE_RAS */
 	STD_RELON_EXCEPTION_PSERIES(0x5700, 0x1700, altivec_assist)
-#ifdef CONFIG_CBE_RAS
-	STD_RELON_EXCEPTION_HV(0x5800, 0x1802, cbe_thermal)
-#endif /* CONFIG_CBE_RAS */
 
 	/* Other future vectors */
 	.align	7
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 0/4] Improve CFAR handling
From: Paul Mackerras @ 2013-02-05  4:09 UTC (permalink / raw)
  To: linuxppc-dev, Benjamin Herrenschmidt, Alexander Graf; +Cc: kvm-ppc

The CFAR (Come From Address Register) is useful for debugging; it
records the address of the most recent taken branch or rfid
instructions.  At present, KVM doesn't even try to context switch it,
and the first-level interrupt handlers for some interrupts have a
branch before it gets saved, which will corrupt it.

This series fixes the interrupt handlers to not corrupt the CFAR, and
makes KVM context-switch it.  The series is against Ben H.'s
next branch.  The last patch in the series corrects a compile error
for 32-bit PR KVM configs which was introduced by an earlier commit in
Ben's next branch.

I suggest this series should go via Ben's tree rather than the KVM
tree, since most of the changes are to core powerpc interrupt handling
code.  Alex, if you could ack patch 3/4 that would be helpful.

Paul.

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox