LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 0/13] Hypervisor-mode KVM on POWER7
From: Avi Kivity @ 2011-05-17 11:42 UTC (permalink / raw)
  To: Alexander Graf
  Cc: linuxppc-dev@ozlabs.org, Paul Mackerras, kvm@vger.kernel.org
In-Reply-To: <14CFAA70-1747-4902-8CC1-4BE924CAD031@suse.de>

On 05/17/2011 02:38 PM, Alexander Graf wrote:
> >
> >  What would be the path for these patches to get upstream?  Would this
> >  stuff normally go through Avi's tree?  There is a bit of a
> >  complication in that they are based on Ben's next branch.  Would Avi
> >  pull Ben's next branch, or would they go in via Ben's tree?
>
> Usually the ppc tree gets merged into Avi's tree and goes on from there. When we have interdependencies, we can certainly do it differently though. We can also shove them through Ben's tree this time around, as there are more dependencies on ppc code than KVM code.
>

Yes, both options are fine.  If it goes through kvm.git I can merge 
Ben's tree (provided it is append-only) and apply the kvm-ppc patches on 
top.

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Ingo Molnar @ 2011-05-17 12:42 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Ralf Baechle, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, James Morris, Linus Torvalds, Ingo Molnar,
	linux-arm-kernel, kees.cook, Serge E. Hallyn, Martin Schwidefsky,
	Thomas Gleixner, Roland McGrath, Michal Marek, Michal Simek,
	Will Drewry, linuxppc-dev, linux-kernel, Eric Paris, Paul Mundt,
	Tejun Heo, linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <1305565422.5456.21.camel@gandalf.stny.rr.com>


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote:
> > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > I'm a bit nervous about the 'active' role of (trace_)events, because of the 
> > > way multiple callbacks can be registered. How would:
> > > 
> > > 	err = event_x();
> > > 	if (err == -EACCESS) {
> > > 
> > > be handled? [...]
> > 
> > The default behavior would be something obvious: to trigger all callbacks and 
> > use the first non-zero return value.
> 
> But how do we know which callback that was from? There's no ordering of what 
> callbacks are called first.

We do not have to know that - nor do the calling sites care in general. Do you 
have some specific usecase in mind where the identity of the callback that 
generates a match matters?

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Ingo Molnar @ 2011-05-17 12:57 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Eric Paris, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, James Morris, Linus Torvalds, Ingo Molnar,
	kees.cook, Serge E. Hallyn, Peter Zijlstra, Steven Rostedt,
	Tejun Heo, Thomas Gleixner, linux-arm-kernel, Michal Marek,
	Michal Simek, linuxppc-dev, linux-kernel, Ralf Baechle,
	Paul Mundt, Martin Schwidefsky, linux390, Andrew Morton, agl,
	David S. Miller
In-Reply-To: <BANLkTimcYyTxUDN4QysyOitTJYJP9ZavZA@mail.gmail.com>


* Will Drewry <wad@chromium.org> wrote:

> > This is *far* more generic still yields the same short-term end result as 
> > far as your sandboxing is concerned.
> 
> Almost :/ [...]

Hey that's a pretty good result from a subsystem that was not written with your 
usecase in mind *at all* ;-)

> [...]  I still need to review the code you've pointed out, but, at present, 
> the ftrace hooks occur after the seccomp and syscall auditing hooks.  This 
> means that that code is exposed no matter what in this model.  To trim the 
> exposed surface to userspace, we really need those early hooks.  While I can 
> see both hacky and less hacky approaches around this, it stills strikes me 
> that the seccomp thread flag and early interception are good to reuse.  One 
> option might be to allow seccomp to be a secure-syscall event source, but I 
> suspect that lands more on the hack-y side of the fence :)

Agreed, there should be no security compromise imposed on your usecase, at all.

You could move the event callback sooner into the syscall-entry sequence to 
make sure it's the highest priority thing to process?

There's no semantic dependency on its current location so this can be changed 
AFAICS.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Steven Rostedt @ 2011-05-17 13:05 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Ralf Baechle, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, James Morris, Linus Torvalds, Ingo Molnar,
	linux-arm-kernel, kees.cook, Serge E. Hallyn, Martin Schwidefsky,
	Thomas Gleixner, Roland McGrath, Michal Marek, Michal Simek,
	Will Drewry, linuxppc-dev, linux-kernel, Eric Paris, Paul Mundt,
	Tejun Heo, linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <20110517124212.GB21441@elte.hu>

On Tue, 2011-05-17 at 14:42 +0200, Ingo Molnar wrote:
> * Steven Rostedt <rostedt@goodmis.org> wrote:
> 
> > On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote:
> > > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > > 
> > > > I'm a bit nervous about the 'active' role of (trace_)events, because of the 
> > > > way multiple callbacks can be registered. How would:
> > > > 
> > > > 	err = event_x();
> > > > 	if (err == -EACCESS) {
> > > > 
> > > > be handled? [...]
> > > 
> > > The default behavior would be something obvious: to trigger all callbacks and 
> > > use the first non-zero return value.
> > 
> > But how do we know which callback that was from? There's no ordering of what 
> > callbacks are called first.
> 
> We do not have to know that - nor do the calling sites care in general. Do you 
> have some specific usecase in mind where the identity of the callback that 
> generates a match matters?

Maybe I'm confused. I was thinking that these event_*() are what we
currently call trace_*(), but the event_*(), I assume, can return a
value if a call back returns one.

Thus, we now have the ability to dynamically attach function calls to
arbitrary points in the kernel that can have an affect on the code that
called it. Right now, we only have the ability to attach function calls
to these locations that have passive affects (tracing/profiling).

But you say, "nor do the calling sites care in general". Then what do
these calling sites do with the return code? Are we limiting these
actions to security only? Or can we have some other feature. I can
envision that we can make the Linux kernel quite dynamic here with "self
modifying code". That is, anywhere we have "hooks", perhaps we could
replace them with dynamic switches (jump labels). Maybe events would not
be the best use, but they could be a generic one.

Knowing what callback returned the result would be beneficial. Right
now, you are saying if the call back return anything, just abort the
call, not knowing what callback was called.

-- Steve

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Ingo Molnar @ 2011-05-17 13:10 UTC (permalink / raw)
  To: James Morris
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, Paul Mackerras, Eric Paris,
	H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390, Russell King,
	x86, Linus Torvalds, Ingo Molnar, kees.cook, Serge E. Hallyn,
	Peter Zijlstra, Steven Rostedt, Tejun Heo, Thomas Gleixner,
	linux-arm-kernel, Michal Marek, Michal Simek, Will Drewry,
	linuxppc-dev, linux-kernel, Ralf Baechle, Paul Mundt,
	Martin Schwidefsky, linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <alpine.LRH.2.00.1105171214330.31710@tundra.namei.org>

* James Morris <jmorris@namei.org> wrote:

> On Mon, 16 May 2011, Ingo Molnar wrote:
> 
> > > Not really.
> > > 
> > > Firstly, what is the security goal of these restrictions? [...]
> > 
> > To do what i described above? Namely:
> > 
> >  " Sandboxed code should only be allowed to open files in /home/sandbox/, /lib/
> >    and /usr/lib/ "
> 
> These are access rules, they don't really describe a high-level security 
> goal. [...]

Restrictng sandboxed code to only open files within a given VFS namespace 
boundary sure sounds like a high-level security goal to me.

If implemented and set up correctly then it restricts sandboxed code to only be 
able to open files reachable via that VFS sub-namespace.

That is a rather meaningful high-level concept. What higher level concept do 
you want to argue?

> [...]  How do you know it's ok to open everything in these directories?

How do you know it's ok to open /etc/hosts? The sysadmin has configured the 
system that way.

How do you know that it's ok for sandboxed code to open files in 
/home/sandbox/? The sandbox developer has configured the system that way.

I'm not sure i get your point.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Ingo Molnar @ 2011-05-17 13:19 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Ralf Baechle, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, James Morris, Linus Torvalds, Ingo Molnar,
	linux-arm-kernel, kees.cook, Serge E. Hallyn, Martin Schwidefsky,
	Thomas Gleixner, Roland McGrath, Michal Marek, Michal Simek,
	Will Drewry, linuxppc-dev, linux-kernel, Eric Paris, Paul Mundt,
	Tejun Heo, linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <1305637528.5456.723.camel@gandalf.stny.rr.com>


* Steven Rostedt <rostedt@goodmis.org> wrote:

> On Tue, 2011-05-17 at 14:42 +0200, Ingo Molnar wrote:
> > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > 
> > > On Mon, 2011-05-16 at 18:52 +0200, Ingo Molnar wrote:
> > > > * Steven Rostedt <rostedt@goodmis.org> wrote:
> > > > 
> > > > > I'm a bit nervous about the 'active' role of (trace_)events, because of the 
> > > > > way multiple callbacks can be registered. How would:
> > > > > 
> > > > > 	err = event_x();
> > > > > 	if (err == -EACCESS) {
> > > > > 
> > > > > be handled? [...]
> > > > 
> > > > The default behavior would be something obvious: to trigger all callbacks and 
> > > > use the first non-zero return value.
> > > 
> > > But how do we know which callback that was from? There's no ordering of what 
> > > callbacks are called first.
> > 
> > We do not have to know that - nor do the calling sites care in general. Do you 
> > have some specific usecase in mind where the identity of the callback that 
> > generates a match matters?
> 
> Maybe I'm confused. I was thinking that these event_*() are what we
> currently call trace_*(), but the event_*(), I assume, can return a
> value if a call back returns one.

Yeah - and the call site can treat it as:

 - Ugh, if i get an error i need to abort whatever i was about to do

or (more advanced future use):

 - If i get a positive value i need to re-evaluate the parameters that were 
   passed in, they were changed

> Thus, we now have the ability to dynamically attach function calls to 
> arbitrary points in the kernel that can have an affect on the code that 
> called it. Right now, we only have the ability to attach function calls to 
> these locations that have passive affects (tracing/profiling).

Well, they can only have the effect that the calling site accepts and handles. 
So the 'effect' is not arbitrary and not defined by the callbacks, it is 
controlled and handled by the calling code.

We do not want invisible side-effects, opaque hooks, etc.

Instead of that we want (this is the getname() example i cited in the thread) 
explicit effects, like:

 if (event_vfs_getname(result))
	return ERR_PTR(-EPERM);

> But you say, "nor do the calling sites care in general". Then what do
> these calling sites do with the return code? Are we limiting these
> actions to security only? Or can we have some other feature. [...]

Yeah, not just security. One other example that came up recently is whether to 
panic the box on certain (bad) events such as NMI errors. This too could be 
made flexible via the event filter code: we already capture many events, so 
places that might conceivably do some policy could do so based on a filter 
condition.

> [...] I can envision that we can make the Linux kernel quite dynamic here 
> with "self modifying code". That is, anywhere we have "hooks", perhaps we 
> could replace them with dynamic switches (jump labels). Maybe events would 
> not be the best use, but they could be a generic one.

events and explicit function calls and explicit side-effects are pretty much 
the only thing that are acceptable. We do not want opaque hooks and arbitrary 
side-effects.

> Knowing what callback returned the result would be beneficial. Right now, you 
> are saying if the call back return anything, just abort the call, not knowing 
> what callback was called.

Yeah, and that's a feature: that way a number of conditions can be attached. 
Multiple security frameworks may have effect on a task or multiple tools might 
set policy action on a given event.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: James Morris @ 2011-05-17 13:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, Paul Mackerras, Eric Paris,
	H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390, Russell King,
	x86, Linus Torvalds, Ingo Molnar, kees.cook, Serge E. Hallyn,
	Peter Zijlstra, Steven Rostedt, Tejun Heo, Thomas Gleixner,
	linux-arm-kernel, Michal Marek, Michal Simek, Will Drewry,
	linuxppc-dev, linux-kernel, Ralf Baechle, Paul Mundt,
	Martin Schwidefsky, linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <20110517131058.GE21441@elte.hu>

On Tue, 17 May 2011, Ingo Molnar wrote:

> I'm not sure i get your point.

Your example was not complete as described.  After an apparently simple 
specification, you've since added several qualifiers and assumptions, and 
I still doubt that it's complete.

A higher level goal would look like

"Allow a sandbox app access only to approved resources, to contain the 
effects of flaws in the app", or similar.

Note that this includes a threat model (remote attacker taking control of 
the app) and a general and fully stated strategy for dealing with it.

>From there, you can start to analyze how to implement the goal, at which 
point you'd start thinking about configuration, assumptions, filesystem 
access, namespaces, indirect access (e.g. via sockets, rpc, ipc, shared 
memory, invocation).

Anyway, this is getting off track from the main discussion, but you 
asked...

- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply

* powerpc: mpc85xx regression since 2.6.39-rc2, one cpu core lame
From: Richard Cochran @ 2011-05-17 16:28 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev, linux-kernel

Ben,

Recent 2.6.39-rc kernels behave strangely on the Freescale dual core
mpc8572 and p2020. There is a long pause (like 2 seconds) in the boot
sequence after "mpic: requesting IPIs..."

When the system comes up, only one core shows in /proc/cpuinfo. Later
on, lots of messages appear like the following:

   INFO: task ksoftirqd/1:9 blocked for more than 120 seconds.

I bisected [1] the problem to:

   commit c56e58537d504706954a06570b4034c04e5b7500
   Author: Benjamin Herrenschmidt <benh@kernel.crashing.org>
   Date:   Tue Mar 8 14:40:04 2011 +1100

       powerpc/smp: Create idle threads on demand and properly reset them

I don't see from that commit what had gone wrong. Perhaps you can
help resolve this?

Thanks,
Richard

1. I had to patch commit e5462d16 by hand when bisecting, which is a
   fixup for commit fa3f82c8 and not yet merged in c56e5853.

^ permalink raw reply

* RE: book to learn ppc assembly and architecture
From: s shaiju @ 2011-05-17 16:46 UTC (permalink / raw)
  To: mikey, benh; +Cc: linuxppc-dev
In-Reply-To: <18259.1305589415@neuling.org>

[-- Attachment #1: Type: text/plain, Size: 899 bytes --]


thanks.
 
regards,
sha
 
> From: mikey@neuling.org
> To: benh@kernel.crashing.org
> CC: sha_neb@hotmail.com; linuxppc-dev@lists.ozlabs.org
> Subject: Re: book to learn ppc assembly and architecture
> Date: Tue, 17 May 2011 09:43:35 +1000
> 
> In message <1305589123.2781.15.camel@pasglop> you wrote:
> > On Mon, 2011-05-16 at 16:37 +1000, Michael Neuling wrote:
> > > > what is the best book to learn assembly and architecture .
> > > 
> > > Reading the architecture books with a nice cup of tea.
> > > 
> > > http://www.power.org/resources/downloads/PowerISA_V2.06B_V2_PUBLIC.pdf
> > > 
> > 
> > A slightly less steep approach might be to get yourself the programmer
> > manuals of some older powerpc chips, like the freescale PEMs, and then
> > move on to the full ISA
> 
> True but a well aged Shiraz goes better with them than the cup of tea.
> 
> Mikey
 		 	   		  

[-- Attachment #2: Type: text/html, Size: 1294 bytes --]

^ permalink raw reply

* Re: [PATCH 08/13] kvm/powerpc: Move guest enter/exit down into subarch-specific code
From: Marcelo Tosatti @ 2011-05-17 18:05 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, Alexander Graf, kvm
In-Reply-To: <20110511104331.GI2837@brick.ozlabs.ibm.com>

On Wed, May 11, 2011 at 08:43:31PM +1000, Paul Mackerras wrote:
> >From 964ee93b2d728e4fb16ae66eaceb6e912bf114ad Mon Sep 17 00:00:00 2001
> From: Paul Mackerras <paulus@samba.org>
> Date: Tue, 10 May 2011 22:23:18 +1000
> Subject: [PATCH 08/13] kvm/powerpc: Move guest enter/exit down into
>  subarch-specific code
> 
> Instead of doing the kvm_guest_enter/exit() and local_irq_dis/enable()
> calls in powerpc.c, this moves them down into the subarch-specific
> book3s_pr.c and booke.c.  This eliminates an extra local_irq_enable()
> call in book3s_pr.c, and will be needed for when we do SMT4 guest
> support in the book3s hypervisor mode code.
> 
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
>  arch/powerpc/include/asm/kvm_ppc.h   |    1 +
>  arch/powerpc/kvm/book3s_interrupts.S |    2 +-
>  arch/powerpc/kvm/book3s_pr.c         |   12 ++++++------
>  arch/powerpc/kvm/booke.c             |   13 +++++++++++++
>  arch/powerpc/kvm/powerpc.c           |    6 +-----
>  5 files changed, 22 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h
> index f3c218a..3210911 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -42,6 +42,7 @@ enum emulation_result {
>  	EMULATE_AGAIN,        /* something went wrong. go again */
>  };
>  
> +extern int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu);
>  extern int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu);
>  extern char kvmppc_handlers_start[];
>  extern unsigned long kvmppc_handler_len;
> diff --git a/arch/powerpc/kvm/book3s_interrupts.S b/arch/powerpc/kvm/book3s_interrupts.S
> index 2f0bc92..8c5e0e1 100644
> --- a/arch/powerpc/kvm/book3s_interrupts.S
> +++ b/arch/powerpc/kvm/book3s_interrupts.S
> @@ -85,7 +85,7 @@
>   *  r3: kvm_run pointer
>   *  r4: vcpu pointer
>   */
> -_GLOBAL(__kvmppc_vcpu_entry)
> +_GLOBAL(__kvmppc_vcpu_run)
>  
>  kvm_start_entry:
>  	/* Write correct stack frame */
> diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c
> index 08cedf0..f769915 100644
> --- a/arch/powerpc/kvm/book3s_pr.c
> +++ b/arch/powerpc/kvm/book3s_pr.c
> @@ -891,8 +891,7 @@ void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
>  	vfree(vcpu_book3s);
>  }
>  
> -extern int __kvmppc_vcpu_entry(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu);
> -int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
> +int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
>  {
>  	int ret;
>  	double fpr[32][TS_FPRWIDTH];
> @@ -944,14 +943,15 @@ int __kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu)
>  	/* Remember the MSR with disabled extensions */
>  	ext_msr = current->thread.regs->msr;
>  
> -	/* XXX we get called with irq disabled - change that! */
> -	local_irq_enable();
> -
>  	/* Preload FPU if it's enabled */
>  	if (vcpu->arch.shared->msr & MSR_FP)
>  		kvmppc_handle_ext(vcpu, BOOK3S_INTERRUPT_FP_UNAVAIL, MSR_FP);
>  
> -	ret = __kvmppc_vcpu_entry(kvm_run, vcpu);
> +	kvm_guest_enter();


kvm_guest_enter should run with interrupts disabled.

^ permalink raw reply

* Re: [PATCH 08/13] kvm/powerpc: Move guest enter/exit down into subarch-specific code
From: Marcelo Tosatti @ 2011-05-17 18:10 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, Alexander Graf, kvm
In-Reply-To: <20110517180511.GA9728@amt.cnet>

On Tue, May 17, 2011 at 03:05:12PM -0300, Marcelo Tosatti wrote:
> >  
> > -	ret = __kvmppc_vcpu_entry(kvm_run, vcpu);
> > +	kvm_guest_enter();
> 
> 
> kvm_guest_enter should run with interrupts disabled.

Its fine, please ignore message.

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Ingo Molnar @ 2011-05-17 18:34 UTC (permalink / raw)
  To: James Morris
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, Paul Mackerras, Eric Paris,
	H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390, Russell King,
	x86, Linus Torvalds, Ingo Molnar, kees.cook, Serge E. Hallyn,
	Peter Zijlstra, Steven Rostedt, Tejun Heo, Thomas Gleixner,
	linux-arm-kernel, Michal Marek, Michal Simek, Will Drewry,
	linuxppc-dev, linux-kernel, Ralf Baechle, Paul Mundt,
	Martin Schwidefsky, linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <alpine.LRH.2.00.1105172317060.5404@tundra.namei.org>

* James Morris <jmorris@namei.org> wrote:

> On Tue, 17 May 2011, Ingo Molnar wrote:
> 
> > I'm not sure i get your point.
> 
> Your example was not complete as described.  After an apparently simple 
> specification, you've since added several qualifiers and assumptions, [...]

I havent added any qualifiers really (i added examples/description), the opt-in 
method i mentioned in my first mail should be pretty robust:

 | Firstly, using the filter code i deny the various link creation syscalls so 
 | that sandboxed code cannot escape for example by creating a symlink to 
 | outside the permitted VFS namespace. (Note: we opt-in to syscalls, that way 
 | new syscalls added by new kernels are denied by defalt. The current symlink 
 | creation syscalls are not opted in to.)

> [...] and I still doubt that it's complete.

I could too claim that i doubt that the SELinux kernel implementation is 
secure!

So how about we both come up with specific examples about how it's not secure, 
instead of going down the fear-uncertainty-and-doubt road? ;-)

> A higher level goal would look like
> 
> "Allow a sandbox app access only to approved resources, to contain the 
> effects of flaws in the app", or similar.

I see what you mean.

I really think that "restricting sandboxed code to only open files within a 
given VFS namespace boundary" is the most useful highlevel description here - 
which is really a subset of a "allow a sandbox app access only to an easily 
approved set of files" highlevel concept.

There's no "to contain ..." bit here: *all* of the sandboxed app code is 
untrusted, so there's no 'remote attacker' and we do not limit our threat to 
flaws in the app. We want to contain apps to within a small subset of Linux 
functionality, and we want to do that within regular apps (without having to be 
superuser), full stop.

> Note that this includes a threat model (remote attacker taking control of the 
> app) and a general and fully stated strategy for dealing with it.

Attacker does not have to be remote - most sandboxing concepts protect against 
locally installed plugins/apps/applets. In sandboxing the whole app is 
considered untrusted - not just some flaw in it, abused remotely.

> From there, you can start to analyze how to implement the goal, at which 
> point you'd start thinking about configuration, assumptions, filesystem 
> access, namespaces, indirect access (e.g. via sockets, rpc, ipc, shared 
> memory, invocation).

Sandboxed code generally does not have access to anything fancy like that - if 
it is added then all possible side effects have to be examined.

Thanks,

	Ingo

^ permalink raw reply

* Re: book to learn ppc assembly and architecture
From: kevin diggs @ 2011-05-17 19:24 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Michael Neuling, linuxppc-dev, s shaiju
In-Reply-To: <1305589123.2781.15.camel@pasglop>

Hi,

On Mon, May 16, 2011 at 6:38 PM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Mon, 2011-05-16 at 16:37 +1000, Michael Neuling wrote:
>> > what =A0is =A0the =A0best =A0book =A0to =A0learn =A0assembly =A0and =
=A0architecture .
>>

Assuming you have a powerpc compiler available you can use the -S
option to produce assembly listings.

^ permalink raw reply

* Re: powerpc: mpc85xx regression since 2.6.39-rc2, one cpu core lame
From: Benjamin Herrenschmidt @ 2011-05-17 21:40 UTC (permalink / raw)
  To: Richard Cochran, Kumar Gala; +Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20110517162827.GA16918@riccoc20.at.omicron.at>

On Tue, 2011-05-17 at 18:28 +0200, Richard Cochran wrote:
> Ben,
> 
> Recent 2.6.39-rc kernels behave strangely on the Freescale dual core
> mpc8572 and p2020. There is a long pause (like 2 seconds) in the boot
> sequence after "mpic: requesting IPIs..."
> 
> When the system comes up, only one core shows in /proc/cpuinfo. Later
> on, lots of messages appear like the following:
> 
>    INFO: task ksoftirqd/1:9 blocked for more than 120 seconds.
> 
> I bisected [1] the problem to:
> 
>    commit c56e58537d504706954a06570b4034c04e5b7500
>    Author: Benjamin Herrenschmidt <benh@kernel.crashing.org>
>    Date:   Tue Mar 8 14:40:04 2011 +1100
> 
>        powerpc/smp: Create idle threads on demand and properly reset them
> 
> I don't see from that commit what had gone wrong. Perhaps you can
> help resolve this?

Hrm, odd. Kumar, care to have a look ? That's what happens when you
don't get me HW to test with :-)

Cheers,
Ben.

> Thanks,
> Richard
> 
> 
> 1. I had to patch commit e5462d16 by hand when bisecting, which is a
>    fixup for commit fa3f82c8 and not yet merged in c56e5853.

^ permalink raw reply

* [PATCH 01/13] powerpc/e500: Save SPEFCSR in flush_spe_to_thread()
From: Scott Wood @ 2011-05-17 23:35 UTC (permalink / raw)
  To: agraf, galak; +Cc: linuxppc-dev, kvm-ppc

From: yu liu <yu.liu@freescale.com>

giveup_spe() saves the SPE state which is protected by MSR[SPE].
However, modifying SPEFSCR does not trap when MSR[SPE]=0.
And since SPEFSCR is already saved/restored in _switch(),
not all the callers want to save SPEFSCR again.
Thus, saving SPEFSCR should not belong to giveup_spe().

This patch moves SPEFSCR saving to flush_spe_to_thread(),
and cleans up the caller that needs to save SPEFSCR accordingly.

Signed-off-by: Liu Yu <yu.liu@freescale.com>
Signed-off-by: Scott Wood <scottwood@freescale.com>
---
This is a resending of http://patchwork.ozlabs.org/patch/88677/

Kumar, please ack to go via kvm.  This is holding up the rest of the SPE
patches, which in turn are holding up the MMU patches due to both
touching the MSR update code.

 arch/powerpc/kernel/head_fsl_booke.S |    2 --
 arch/powerpc/kernel/process.c        |    1 +
 arch/powerpc/kernel/traps.c          |    5 +----
 3 files changed, 2 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/head_fsl_booke.S b/arch/powerpc/kernel/head_fsl_booke.S
index 5ecf54c..aede4f8 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -792,8 +792,6 @@ _GLOBAL(giveup_spe)
 	evmwumiaa evr6, evr6, evr6	/* evr6 <- ACC = 0 * 0 + ACC */
 	li	r4,THREAD_ACC
 	evstddx	evr6, r4, r3		/* save off accumulator */
-	mfspr	r6,SPRN_SPEFSCR
-	stw	r6,THREAD_SPEFSCR(r3)	/* save spefscr register value */
 	beq	1f
 	lwz	r4,_MSR-STACK_FRAME_OVERHEAD(r5)
 	lis	r3,MSR_SPE@h
diff --git a/arch/powerpc/kernel/process.c b/arch/powerpc/kernel/process.c
index f74f355..138e7dd 100644
--- a/arch/powerpc/kernel/process.c
+++ b/arch/powerpc/kernel/process.c
@@ -213,6 +213,7 @@ void flush_spe_to_thread(struct task_struct *tsk)
 #ifdef CONFIG_SMP
 			BUG_ON(tsk != current);
 #endif
+			tsk->thread.spefscr = mfspr(SPRN_SPEFSCR);
 			giveup_spe(tsk);
 		}
 		preempt_enable();
diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
index 5ddb801..742a0fb 100644
--- a/arch/powerpc/kernel/traps.c
+++ b/arch/powerpc/kernel/traps.c
@@ -1356,10 +1356,7 @@ void SPEFloatingPointException(struct pt_regs *regs)
 	int code = 0;
 	int err;
 
-	preempt_disable();
-	if (regs->msr & MSR_SPE)
-		giveup_spe(current);
-	preempt_enable();
+	flush_spe_to_thread(current);
 
 	spefscr = current->thread.spefscr;
 	fpexc_mode = current->thread.fpexc_mode;
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH 02/13] powerpc/e500: SPE register saving: take arbitrary struct offset
From: Scott Wood @ 2011-05-17 23:36 UTC (permalink / raw)
  To: agraf, galak; +Cc: linuxppc-dev, kvm-ppc
In-Reply-To: <20110517233551.GA3511@schlenkerla.am.freescale.net>

Previously, these macros hardcoded THREAD_EVR0 as the base of the save
area, relative to the base register passed.  This base offset is now
passed as a separate macro parameter, allowing reuse with other SPE
save areas, such as used by KVM.

Signed-off-by: Scott Wood <scottwood@freescale.com>
---
This is a resending of http://www.spinics.net/lists/kvm-ppc/msg02672.html

Kumar, please ack to go via kvm.

 arch/powerpc/include/asm/ppc_asm.h   |   28 ++++++++++++++++------------
 arch/powerpc/kernel/head_fsl_booke.S |    6 +++---
 2 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/arch/powerpc/include/asm/ppc_asm.h b/arch/powerpc/include/asm/ppc_asm.h
index 9821006..ba0cd33 100644
--- a/arch/powerpc/include/asm/ppc_asm.h
+++ b/arch/powerpc/include/asm/ppc_asm.h
@@ -150,18 +150,22 @@ END_FW_FTR_SECTION_IFSET(FW_FEATURE_SPLPAR)
 #define REST_16VSRSU(n,b,base)	REST_8VSRSU(n,b,base); REST_8VSRSU(n+8,b,base)
 #define REST_32VSRSU(n,b,base)	REST_16VSRSU(n,b,base); REST_16VSRSU(n+16,b,base)
 
-#define SAVE_EVR(n,s,base)	evmergehi s,s,n; stw s,THREAD_EVR0+4*(n)(base)
-#define SAVE_2EVRS(n,s,base)	SAVE_EVR(n,s,base); SAVE_EVR(n+1,s,base)
-#define SAVE_4EVRS(n,s,base)	SAVE_2EVRS(n,s,base); SAVE_2EVRS(n+2,s,base)
-#define SAVE_8EVRS(n,s,base)	SAVE_4EVRS(n,s,base); SAVE_4EVRS(n+4,s,base)
-#define SAVE_16EVRS(n,s,base)	SAVE_8EVRS(n,s,base); SAVE_8EVRS(n+8,s,base)
-#define SAVE_32EVRS(n,s,base)	SAVE_16EVRS(n,s,base); SAVE_16EVRS(n+16,s,base)
-#define REST_EVR(n,s,base)	lwz s,THREAD_EVR0+4*(n)(base); evmergelo n,s,n
-#define REST_2EVRS(n,s,base)	REST_EVR(n,s,base); REST_EVR(n+1,s,base)
-#define REST_4EVRS(n,s,base)	REST_2EVRS(n,s,base); REST_2EVRS(n+2,s,base)
-#define REST_8EVRS(n,s,base)	REST_4EVRS(n,s,base); REST_4EVRS(n+4,s,base)
-#define REST_16EVRS(n,s,base)	REST_8EVRS(n,s,base); REST_8EVRS(n+8,s,base)
-#define REST_32EVRS(n,s,base)	REST_16EVRS(n,s,base); REST_16EVRS(n+16,s,base)
+/*
+ * b = base register for addressing, o = base offset from register of 1st EVR
+ * n = first EVR, s = scratch
+ */
+#define SAVE_EVR(n,s,b,o)	evmergehi s,s,n; stw s,o+4*(n)(b)
+#define SAVE_2EVRS(n,s,b,o)	SAVE_EVR(n,s,b,o); SAVE_EVR(n+1,s,b,o)
+#define SAVE_4EVRS(n,s,b,o)	SAVE_2EVRS(n,s,b,o); SAVE_2EVRS(n+2,s,b,o)
+#define SAVE_8EVRS(n,s,b,o)	SAVE_4EVRS(n,s,b,o); SAVE_4EVRS(n+4,s,b,o)
+#define SAVE_16EVRS(n,s,b,o)	SAVE_8EVRS(n,s,b,o); SAVE_8EVRS(n+8,s,b,o)
+#define SAVE_32EVRS(n,s,b,o)	SAVE_16EVRS(n,s,b,o); SAVE_16EVRS(n+16,s,b,o)
+#define REST_EVR(n,s,b,o)	lwz s,o+4*(n)(b); evmergelo n,s,n
+#define REST_2EVRS(n,s,b,o)	REST_EVR(n,s,b,o); REST_EVR(n+1,s,b,o)
+#define REST_4EVRS(n,s,b,o)	REST_2EVRS(n,s,b,o); REST_2EVRS(n+2,s,b,o)
+#define REST_8EVRS(n,s,b,o)	REST_4EVRS(n,s,b,o); REST_4EVRS(n+4,s,b,o)
+#define REST_16EVRS(n,s,b,o)	REST_8EVRS(n,s,b,o); REST_8EVRS(n+8,s,b,o)
+#define REST_32EVRS(n,s,b,o)	REST_16EVRS(n,s,b,o); REST_16EVRS(n+16,s,b,o)
 
 /* Macros to adjust thread priority for hardware multithreading */
 #define HMT_VERY_LOW	or	31,31,31	# very low priority
diff --git a/arch/powerpc/kernel/head_fsl_booke.S b/arch/powerpc/kernel/head_fsl_booke.S
index aede4f8..fe37dd0 100644
--- a/arch/powerpc/kernel/head_fsl_booke.S
+++ b/arch/powerpc/kernel/head_fsl_booke.S
@@ -656,7 +656,7 @@ load_up_spe:
 	cmpi	0,r4,0
 	beq	1f
 	addi	r4,r4,THREAD	/* want THREAD of last_task_used_spe */
-	SAVE_32EVRS(0,r10,r4)
+	SAVE_32EVRS(0,r10,r4,THREAD_EVR0)
 	evxor	evr10, evr10, evr10	/* clear out evr10 */
 	evmwumiaa evr10, evr10, evr10	/* evr10 <- ACC = 0 * 0 + ACC */
 	li	r5,THREAD_ACC
@@ -676,7 +676,7 @@ load_up_spe:
 	stw	r4,THREAD_USED_SPE(r5)
 	evlddx	evr4,r10,r5
 	evmra	evr4,evr4
-	REST_32EVRS(0,r10,r5)
+	REST_32EVRS(0,r10,r5,THREAD_EVR0)
 #ifndef CONFIG_SMP
 	subi	r4,r5,THREAD
 	stw	r4,last_task_used_spe@l(r3)
@@ -787,7 +787,7 @@ _GLOBAL(giveup_spe)
 	addi	r3,r3,THREAD		/* want THREAD of task */
 	lwz	r5,PT_REGS(r3)
 	cmpi	0,r5,0
-	SAVE_32EVRS(0, r4, r3)
+	SAVE_32EVRS(0, r4, r3, THREAD_EVR0)
 	evxor	evr6, evr6, evr6	/* clear out evr6 */
 	evmwumiaa evr6, evr6, evr6	/* evr6 <- ACC = 0 * 0 + ACC */
 	li	r4,THREAD_ACC
-- 
1.7.4.1

^ permalink raw reply related

* Re: [PATCH 1/3] mpt2sas: remove the use of writeq, since writeq is not atomic
From: James Bottomley @ 2011-05-18  4:23 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Prakash, Sathya, Desai, Kashyap, linux-scsi@vger.kernel.org,
	linuxppc-dev, paulus, Moore, Eric
In-Reply-To: <20110518041551.GL15227@parisc-linux.org>

On Tue, 2011-05-17 at 22:15 -0600, Matthew Wilcox wrote:
> On Wed, May 18, 2011 at 09:37:08AM +0530, Desai, Kashyap wrote:
> > On Wed, 2011-05-04 at 17:23 +0530, Kashyap, Desai wrote:
> > > The following code seems to be there in /usr/src/linux/arch/x86/include/asm/io.h.
> > > This is not going to work.
> > > 
> > > static inline void writeq(__u64 val, volatile void __iomem *addr)
> > > {
> > >         writel(val, addr);
> > >         writel(val >> 32, addr+4);
> > > }
> > > 
> > > So with this code turned on in the kernel, there is going to be race condition 
> > > where multiple cpus can be writing to the request descriptor at the same time.
> > > 
> > > Meaning this could happen:
> > > (A) CPU A doest 32bit write
> > > (B) CPU B does 32 bit write
> > > (C) CPU A does 32 bit write
> > > (D) CPU B does 32 bit write
> > > 
> > > We need the 64 bit completed in one access pci memory write, else spin lock is required.
> > > Since it's going to be difficult to know which writeq was implemented in the kernel, 
> > > the driver is going to have to always acquire a spin lock each time we do 64bit write.
> > > 
> > > Cc: stable@kernle.org
> > > Signed-off-by: Kashyap Desai <kashyap.desai@lsi.com>
> > > ---
> > > diff --git a/drivers/scsi/mpt2sas/mpt2sas_base.c b/drivers/scsi/mpt2sas/mpt2sas_base.c
> > > index efa0255..5778334 100644
> > > --- a/drivers/scsi/mpt2sas/mpt2sas_base.c
> > > +++ b/drivers/scsi/mpt2sas/mpt2sas_base.c
> > > @@ -1558,7 +1558,6 @@ mpt2sas_base_free_smid(struct MPT2SAS_ADAPTER *ioc, u16 smid)
> > >   * care of 32 bit environment where its not quarenteed to send the entire word
> > >   * in one transfer.
> > >   */
> > > -#ifndef writeq
> > 
> > Why not make this #ifndef CONFIG_64BIT?  You know that all 64 bit
> > systems have writeq implemented correctly; you suspect 32 bit systems
> > don't.
> > 
> > James
> > 
> > James, This issue was observed on PPC64 system. So what you have suggested will not solve this issue.
> > If we are sure that writeq() is atomic across all architecture, we can use it safely. As we have seen issue on ppc64, we are not confident to use
> > "writeq" call.
> 
> So have you told the powerpc people that they have a broken writeq?

I'm just in the process of finding them now on IRC so I can demand an
explanation: this is a really serious API problem because writeq is
supposed to be atomic on 64 bit.

> And why do you obfuscate your report by talking about i386 when it's
> really about powerpc64?

James

^ permalink raw reply

* Re: [PATCH 1/3] mpt2sas: remove the use of writeq, since writeq is not atomic
From: Matthew Wilcox @ 2011-05-18  4:15 UTC (permalink / raw)
  To: Desai, Kashyap
  Cc: Prakash, Sathya, linux-scsi@vger.kernel.org, linuxppc-dev,
	James Bottomley, paulus, Moore, Eric
In-Reply-To: <B2FD678A64EAAD45B089B123FDFC3ED70157F7BCE5@inbmail01.lsi.com>

On Wed, May 18, 2011 at 09:37:08AM +0530, Desai, Kashyap wrote:
> On Wed, 2011-05-04 at 17:23 +0530, Kashyap, Desai wrote:
> > The following code seems to be there in /usr/src/linux/arch/x86/include/asm/io.h.
> > This is not going to work.
> > 
> > static inline void writeq(__u64 val, volatile void __iomem *addr)
> > {
> >         writel(val, addr);
> >         writel(val >> 32, addr+4);
> > }
> > 
> > So with this code turned on in the kernel, there is going to be race condition 
> > where multiple cpus can be writing to the request descriptor at the same time.
> > 
> > Meaning this could happen:
> > (A) CPU A doest 32bit write
> > (B) CPU B does 32 bit write
> > (C) CPU A does 32 bit write
> > (D) CPU B does 32 bit write
> > 
> > We need the 64 bit completed in one access pci memory write, else spin lock is required.
> > Since it's going to be difficult to know which writeq was implemented in the kernel, 
> > the driver is going to have to always acquire a spin lock each time we do 64bit write.
> > 
> > Cc: stable@kernle.org
> > Signed-off-by: Kashyap Desai <kashyap.desai@lsi.com>
> > ---
> > diff --git a/drivers/scsi/mpt2sas/mpt2sas_base.c b/drivers/scsi/mpt2sas/mpt2sas_base.c
> > index efa0255..5778334 100644
> > --- a/drivers/scsi/mpt2sas/mpt2sas_base.c
> > +++ b/drivers/scsi/mpt2sas/mpt2sas_base.c
> > @@ -1558,7 +1558,6 @@ mpt2sas_base_free_smid(struct MPT2SAS_ADAPTER *ioc, u16 smid)
> >   * care of 32 bit environment where its not quarenteed to send the entire word
> >   * in one transfer.
> >   */
> > -#ifndef writeq
> 
> Why not make this #ifndef CONFIG_64BIT?  You know that all 64 bit
> systems have writeq implemented correctly; you suspect 32 bit systems
> don't.
> 
> James
> 
> James, This issue was observed on PPC64 system. So what you have suggested will not solve this issue.
> If we are sure that writeq() is atomic across all architecture, we can use it safely. As we have seen issue on ppc64, we are not confident to use
> "writeq" call.

So have you told the powerpc people that they have a broken writeq?
And why do you obfuscate your report by talking about i386 when it's
really about powerpc64?

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply

* Re: [PATCH 1/3] mpt2sas: remove the use of writeq, since writeq is not atomic
From: Benjamin Herrenschmidt @ 2011-05-18  5:45 UTC (permalink / raw)
  To: Desai, Kashyap
  Cc: Prakash, Sathya, linux-scsi@vger.kernel.org, Matthew Wilcox,
	linuxppc-dev, James Bottomley, paulus, Moore, Eric
In-Reply-To: <20110518041551.GL15227@parisc-linux.org>


> > > Cc: stable@kernle.org
> > > Signed-off-by: Kashyap Desai <kashyap.desai@lsi.com>
> > > ---
> > > diff --git a/drivers/scsi/mpt2sas/mpt2sas_base.c b/drivers/scsi/mpt2sas/mpt2sas_base.c
> > > index efa0255..5778334 100644
> > > --- a/drivers/scsi/mpt2sas/mpt2sas_base.c
> > > +++ b/drivers/scsi/mpt2sas/mpt2sas_base.c
> > > @@ -1558,7 +1558,6 @@ mpt2sas_base_free_smid(struct MPT2SAS_ADAPTER *ioc, u16 smid)
> > >   * care of 32 bit environment where its not quarenteed to send the entire word
> > >   * in one transfer.
> > >   */
> > > -#ifndef writeq
> > 
> > Why not make this #ifndef CONFIG_64BIT?  You know that all 64 bit
> > systems have writeq implemented correctly; you suspect 32 bit systems
> > don't.
> > 
> > James
> > 
> > James, This issue was observed on PPC64 system. So what you have
> suggested will not solve this issue.
> > If we are sure that writeq() is atomic across all architecture, we
> can use it safely. As we have seen issue on ppc64, we are not
> confident to use
> > "writeq" call.
> 
> So have you told the powerpc people that they have a broken writeq?
> And why do you obfuscate your report by talking about i386 when it's
> really about powerpc64?

Well, our writeq isn't supposed to be broken :-)

It's defined as an std instruction (and "ld" for readq) so that's
perfectly atomic ... provided your access is aligned. Is it ?

Cheers,
Be

^ permalink raw reply

* RE: book to learn ppc assembly and architecture
From: David Laight @ 2011-05-18  7:56 UTC (permalink / raw)
  To: kevin diggs, Benjamin Herrenschmidt
  Cc: Michael Neuling, linuxppc-dev, s shaiju
In-Reply-To: <BANLkTinuzAwxxMJ+BAYrrotH6an=wvZ3=Q@mail.gmail.com>

=20
> > On Mon, 2011-05-16 at 16:37 +1000, Michael Neuling wrote:
> >> > what =A0is =A0the =A0best =A0book =A0to =A0learn =A0assembly =
=A0and architecture .
>=20
> Assuming you have a powerpc compiler available you can use the -S
> option to produce assembly listings.

With gcc add -fverbose-asm for more info.

For a general background, look at something much simpler than ppc,
even if you don't write/run any code.

	David

^ permalink raw reply

* Re: [PATCH 1/3] mpt2sas: remove the use of writeq, since writeq is not atomic
From: Benjamin Herrenschmidt @ 2011-05-18  7:00 UTC (permalink / raw)
  To: James Bottomley
  Cc: Prakash, Sathya, Desai, Kashyap, linux-scsi@vger.kernel.org,
	Matthew Wilcox, linuxppc-dev, miltonm, paulus, Moore, Eric
In-Reply-To: <1305692584.2580.3.camel@mulgrave.site>

(Just adding Milton to the CC list, he suspects races in the driver
instead).

On Wed, 2011-05-18 at 08:23 +0400, James Bottomley wrote:
> On Tue, 2011-05-17 at 22:15 -0600, Matthew Wilcox wrote:
> > On Wed, May 18, 2011 at 09:37:08AM +0530, Desai, Kashyap wrote:
> > > On Wed, 2011-05-04 at 17:23 +0530, Kashyap, Desai wrote:
> > > > The following code seems to be there in /usr/src/linux/arch/x86/include/asm/io.h.
> > > > This is not going to work.
> > > > 
> > > > static inline void writeq(__u64 val, volatile void __iomem *addr)
> > > > {
> > > >         writel(val, addr);
> > > >         writel(val >> 32, addr+4);
> > > > }
> > > > 
> > > > So with this code turned on in the kernel, there is going to be race condition 
> > > > where multiple cpus can be writing to the request descriptor at the same time.
> > > > 
> > > > Meaning this could happen:
> > > > (A) CPU A doest 32bit write
> > > > (B) CPU B does 32 bit write
> > > > (C) CPU A does 32 bit write
> > > > (D) CPU B does 32 bit write
> > > > 
> > > > We need the 64 bit completed in one access pci memory write, else spin lock is required.
> > > > Since it's going to be difficult to know which writeq was implemented in the kernel, 
> > > > the driver is going to have to always acquire a spin lock each time we do 64bit write.
> > > > 
> > > > Cc: stable@kernle.org
> > > > Signed-off-by: Kashyap Desai <kashyap.desai@lsi.com>
> > > > ---
> > > > diff --git a/drivers/scsi/mpt2sas/mpt2sas_base.c b/drivers/scsi/mpt2sas/mpt2sas_base.c
> > > > index efa0255..5778334 100644
> > > > --- a/drivers/scsi/mpt2sas/mpt2sas_base.c
> > > > +++ b/drivers/scsi/mpt2sas/mpt2sas_base.c
> > > > @@ -1558,7 +1558,6 @@ mpt2sas_base_free_smid(struct MPT2SAS_ADAPTER *ioc, u16 smid)
> > > >   * care of 32 bit environment where its not quarenteed to send the entire word
> > > >   * in one transfer.
> > > >   */
> > > > -#ifndef writeq
> > > 
> > > Why not make this #ifndef CONFIG_64BIT?  You know that all 64 bit
> > > systems have writeq implemented correctly; you suspect 32 bit systems
> > > don't.
> > > 
> > > James
> > > 
> > > James, This issue was observed on PPC64 system. So what you have suggested will not solve this issue.
> > > If we are sure that writeq() is atomic across all architecture, we can use it safely. As we have seen issue on ppc64, we are not confident to use
> > > "writeq" call.
> > 
> > So have you told the powerpc people that they have a broken writeq?
> 
> I'm just in the process of finding them now on IRC so I can demand an
> explanation: this is a really serious API problem because writeq is
> supposed to be atomic on 64 bit.
> 
> > And why do you obfuscate your report by talking about i386 when it's
> > really about powerpc64?
> 
> James
> 

^ permalink raw reply

* RE: [PATCH 1/3] mpt2sas: remove the use of writeq, since writeq isnot atomic
From: David Laight @ 2011-05-18  8:04 UTC (permalink / raw)
  To: James Bottomley, Matthew Wilcox
  Cc: Prakash, Sathya, Desai, Kashyap, linux-scsi, linuxppc-dev, paulus,
	Moore, Eric
In-Reply-To: <1305692584.2580.3.camel@mulgrave.site>

=20

> > > > static inline void writeq(__u64 val, volatile void __iomem
*addr)
> > > > {
> > > >         writel(val, addr);
> > > >         writel(val >> 32, addr+4);
> > > > }
...
> > > > We need the 64 bit completed in one access pci memory write,
else spin lock is required.
> > > > Since it's going to be difficult to know which writeq was
implemented in the kernel,=20
> > > > the driver is going to have to always acquire a spin lock each
time we do 64bit write.
...
> I'm just in the process of finding them now on IRC so I can demand an
> explanation: this is a really serious API problem because writeq is
> supposed to be atomic on 64 bit.

Most 32 bit systems don't have atomic 64bit writes.
I'd also have thought there would be code which wouldn't mind the
write being done as two cycles.

I'm not sure that some of the ppc soc systems are capable of
doing a 64bit data pci/pcie cycle except by dma.
So your driver is probably doomed to require a lock.

	David

^ permalink raw reply

* Re: [PATCH 1/3] mpt2sas: remove the use of writeq, since writeq is not atomic
From: Milton Miller @ 2011-05-18  8:23 UTC (permalink / raw)
  To: Desai, Kashyap, Moore, Eric, Prakash, Sathya
  Cc: linux-scsi@vger.kernel.org, Matthew Wilcox, James Bottomley,
	paulus, linuxppc-dev
In-Reply-To: <1305702010.2781.33.camel@pasglop>

On Wed, 18 May 2011 around 17:00:10 +1000, Benjamin Herrenschmidt wrote:
> (Just adding Milton to the CC list, he suspects races in the
> driver instead).
>
> On Wed, 2011-05-18 at 08:23 +0400, James Bottomley wrote:
> > On Tue, 2011-05-17 at 22:15 -0600, Matthew Wilcox wrote:
> > > On Wed, May 18, 2011 at 09:37:08AM +0530, Desai, Kashyap wrote:
> > > > On Wed, 2011-05-04 at 17:23 +0530, Kashyap, Desai wrote:
> > > > > The following code seems to be there in /usr/src/linux/arch/x86/include/asm/io.h.
> > > > > This is not going to work.
> > > > > 
> > > > > static inline void writeq(__u64 val, volatile void __iomem *addr)
> > > > > {
> > > > >         writel(val, addr);
> > > > >         writel(val >> 32, addr+4);
> > > > > }
> > > > > 
> > > > > So with this code turned on in the kernel, there is going to be race condition 
> > > > > where multiple cpus can be writing to the request descriptor at the same time.
> > > > > 
> > > > > Meaning this could happen:
> > > > > (A) CPU A doest 32bit write
> > > > > (B) CPU B does 32 bit write
> > > > > (C) CPU A does 32 bit write
> > > > > (D) CPU B does 32 bit write
> > > > > 
> > > > > We need the 64 bit completed in one access pci memory write, else spin lock is required.
> > > > > Since it's going to be difficult to know which writeq was implemented in the kernel, 
> > > > > the driver is going to have to always acquire a spin lock each time we do 64bit write.
> > > > > 
> > > > > Cc: stable@kernle.org
> > > > > Signed-off-by: Kashyap Desai <kashyap.desai@lsi.com>
> > > > > ---
> > > > > diff --git a/drivers/scsi/mpt2sas/mpt2sas_base.c b/drivers/scsi/mpt2sas/mpt2sas_base.c
> > > > > index efa0255..5778334 100644
> > > > > --- a/drivers/scsi/mpt2sas/mpt2sas_base.c
> > > > > +++ b/drivers/scsi/mpt2sas/mpt2sas_base.c
> > > > > @@ -1558,7 +1558,6 @@ mpt2sas_base_free_smid(struct MPT2SAS_ADAPTER *ioc, u16 smid)
> > > > >   * care of 32 bit environment where its not quarenteed to send the entire word
> > > > >   * in one transfer.
> > > > >   */
> > > > > -#ifndef writeq
> > > > 
> > > > Why not make this #ifndef CONFIG_64BIT?  You know that all 64 bit
> > > > systems have writeq implemented correctly; you suspect 32 bit systems
> > > > don't.
> > > > 
> > > > James
> > > > 
> > > > James, This issue was observed on PPC64 system. So what you have suggested will not solve this issue.
> > > > If we are sure that writeq() is atomic across all architecture, we can use it safely. As we have seen issue on ppc64, we are not confident to use
> > > > "writeq" call.
> > > 
> > > So have you told the powerpc people that they have a broken writeq?
> > 
> > I'm just in the process of finding them now on IRC so I can demand an
> > explanation: this is a really serious API problem because writeq is
> > supposed to be atomic on 64 bit.
> > 
> > > And why do you obfuscate your report by talking about i386 when it's
> > > really about powerpc64?
> > 
> > James

I checked the assembly for my complied output and it ends up with
a single std (store doubleword aka 64 bits) instruction with offset
192 decimal (0xc0) from the base register obtained from the structure.

An aligned doubleword store is atomic on 64 bit powerpc.

So I would really like more details if you are blaming 64 bit
powerpc of a non-atomic store.

That said, the patch will affect the code by adding barriers.
Specifically, while powerpc has a sync before doing the store as part
of writeq, wrapping in a spinlock adds a sync before releasing the lock
whenever a writeq (or writex x=b,w,d,q) was issued inside the lock.

(sync orders all reads and all writes to both memory and devices from
that cpu).

But looking further at the code, I see such things as:

drivers/scsi/mpt2sas/mpt2sas_base.c  line 2944

        mpt2sas_base_put_smid_default(ioc, smid);
        init_completion(&ioc->base_cmds.done);
        timeleft = wait_for_completion_timeout(&ioc->base_cmds.done,

where mpt2sas_base_put_smid_default is a routine that has a call to
_base_writeq.  This will initiate io to the adapter, then initialize
the completion, then hope that the timeout is long enough to let the io
complete and be marked done but short enough to not be a problem when
the timeout occurs because we initialized the compeltion after the irq
came in.

The code then looks at a status flag, but there is no indication how
the access to that field is serialized between the interrupt handler
and the submission routine.  It may mostly work due to barriers in
the primitives but I don't see any statement of rules.

Also, while I see a few wmb before writel in _base_interrupt, I don't
see any rmb, which I would expect between establishing a element is
valid and reading other fields in that element.

So I'd really like to hear more about what your symptoms were and how
you determined writeq on 64 bit powerpc was not atomic.

milton

^ permalink raw reply

* [PATCH] PPC_47x SMP fix
From: Kerstin Jonsson @ 2011-05-18  9:51 UTC (permalink / raw)
  To: benh, linuxppc-dev, linux-kernel
  Cc: Kerstin Jonsson, Michael Neuling, Darren Hart, Paul Mackerras,
	Will Schmidt

 commit c56e58537d504706954a06570b4034c04e5b7500 breaks SMP support in PPC_47x chip.
 secondary_ti must be set to current thread info before callin kick_cpu or else
 start_secondary_47x will jump into void when trying to return to c-code.
 In the current setup secondary_ti is initialized before the CPU idle task is started
 and only the boot core will start. I am not sure this is the correct solution, but it
 makes SMP possible in my chip.
 Note! The HOTPLUG support probably need some fixing to, There is no trampoline code
 available in head_44x.S - start_secondary_resume?

Signed-off-by: Kerstin Jonsson <kerstin.jonsson@ericsson.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Darren Hart <dvhltc@us.ibm.com>
Cc: Will Schmidt <will_schmidt@vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index cbdbb14..f2dcab7 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -410,8 +410,6 @@ int __cpuinit __cpu_up(unsigned int cpu)
 {
 	int rc, c;

-	secondary_ti = current_set[cpu];
-
 	if (smp_ops == NULL ||
 	    (smp_ops->cpu_bootable && !smp_ops->cpu_bootable(cpu)))
 		return -EINVAL;
@@ -421,6 +419,8 @@ int __cpuinit __cpu_up(unsigned int cpu)
 	if (rc)
 		return rc;

+	secondary_ti = current_set[cpu];
+
 	/* Make sure callin-map entry is 0 (can be leftover a CPU
 	 * hotplug
 	 */
-- 
1.7.2.3

^ permalink raw reply related

* [PATCH] PPC_47x SMP fix
From: Kerstin Jonsson @ 2011-05-18  9:57 UTC (permalink / raw)
  To: benh, linuxppc-dev, linux-kernel
  Cc: Kerstin Jonsson, Michael Neuling, Paul Mackerras, Will Schmidt

 commit c56e58537d504706954a06570b4034c04e5b7500 breaks SMP support in PPC_47x chip.
 secondary_ti must be set to current thread info before callin kick_cpu or else
 start_secondary_47x will jump into void when trying to return to c-code.
 In the current setup secondary_ti is initialized before the CPU idle task is started
 and only the boot core will start. I am not sure this is the correct solution, but it
 makes SMP possible in my chip.
 Note! The HOTPLUG support probably need some fixing to, There is no trampoline code
 available in head_44x.S - start_secondary_resume?

Signed-off-by: Kerstin Jonsson <kerstin.jonsson@ericsson.com>
Cc: Paul Mackerras <paulus@samba.org>
Cc: Michael Neuling <mikey@neuling.org>
Cc: Will Schmidt <will_schmidt@vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index cbdbb14..f2dcab7 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -410,8 +410,6 @@ int __cpuinit __cpu_up(unsigned int cpu)
 {
 	int rc, c;

-	secondary_ti = current_set[cpu];
-
 	if (smp_ops == NULL ||
 	    (smp_ops->cpu_bootable && !smp_ops->cpu_bootable(cpu)))
 		return -EINVAL;
@@ -421,6 +419,8 @@ int __cpuinit __cpu_up(unsigned int cpu)
 	if (rc)
 		return rc;

+	secondary_ti = current_set[cpu];
+
 	/* Make sure callin-map entry is 0 (can be leftover a CPU
 	 * hotplug
 	 */
-- 
1.7.2.3

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox