LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Ingo Molnar @ 2011-05-14  7:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-mips, linux-sh, Frederic Weisbecker, Heiko Carstens,
	Oleg Nesterov, David Howells, Paul Mackerras, Eric Paris,
	H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390, Russell King,
	x86, James Morris, Linus Torvalds, Ingo Molnar, kees.cook,
	Serge E. Hallyn, Steven Rostedt, Tejun Heo, Thomas Gleixner,
	linux-arm-kernel, Michal Marek, Michal Simek, Will Drewry,
	linuxppc-dev, linux-kernel, Ralf Baechle, Paul Mundt,
	Martin Schwidefsky, linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <1305300443.2466.77.camel@twins>

* Peter Zijlstra <peterz@infradead.org> wrote:

> On Fri, 2011-05-13 at 16:57 +0200, Ingo Molnar wrote:
> > this is a security mechanism
> 
> Who says? [...]

Kernel developers/maintainers of the affected code.

We have security hooks all around the kernel, which can deny/accept execution 
at various key points, but we do not have 'execute arbitrary user-space defined 
(safe) scripts' callbacks in general.

But yes, if a particular callback point is defined widely enough to allow much 
bigger intervention into the flow of execution, then more is possible as well.

> [...] and why would you want to unify two separate concepts only to them 
> limit it to security that just doesn't make sense.

I don't limit them to security - the callbacks themselves are either for 
passive observation or, at most, for security accept/deny callbacks.

It's decided by the subsystem maintainers what kind of user-space control power 
(or observation power) they want to allow, not me.

I would just like to not stop the facility itself at the 'observe only' level, 
like you suggest.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Ingo Molnar @ 2011-05-14  7:30 UTC (permalink / raw)
  To: Eric Paris
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390, Russell King,
	x86, James Morris, Linus Torvalds, Ingo Molnar, kees.cook,
	Serge E. Hallyn, Peter Zijlstra, Steven Rostedt, Tejun Heo,
	Thomas Gleixner, linux-arm-kernel, Michal Marek, Michal Simek,
	Will Drewry, linuxppc-dev, linux-kernel, Ralf Baechle, Paul Mundt,
	Martin Schwidefsky, linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <1305299455.2076.26.camel@localhost.localdomain>

* Eric Paris <eparis@redhat.com> wrote:

> [dropping microblaze and roland]
> 
> lOn Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote:
> > * James Morris <jmorris@namei.org> wrote:
> 
> > It is a simple and sensible security feature, agreed? It allows most code to 
> > run well and link to countless libraries - but no access to other files is 
> > allowed.
> 
> It's simple enough and sounds reasonable, but you can read all the discussion 
> about AppArmour why many people don't really think it's the best. [...]

I have to say most of the participants of the AppArmour flamefests were dead 
wrong, and it wasnt the AppArmour folks who were wrong ...

The straight ASCII VFS namespace *makes sense*, and yes, the raw physical 
objects space that SELinux uses makes sense as well.

And no, i do not subscribe to the dogma that it is not possible to secure the 
ASCII VFS namespace: it evidently is possible, if you know and handle the 
ambiguitites. It is also obviously true that the ASCII VFS namespaces we use 
every day are a *lot* more intuitive than the labeled physical objects space 
...

What all the security flamewars missed is the simple fact that being intuitive 
matters a *lot* not just to not annoy users, but also to broaden the effective 
number of security-conscious developers ...

> > Unfortunately this audit callback cannot be used for my purposes, because 
> > the event is single-purpose for auditd and because it allows no feedback 
> > (no deny/accept discretion for the security policy).
> > 
> > But if had this simple event there:
> > 
> > 	err = event_vfs_getname(result);
> 
> Wow it sounds so easy.  Now lets keep extending your train of thought
> until we can actually provide the security provided by SELinux.  What do
> we end up with?  We end up with an event hook right next to every LSM
> hook.  You know, the LSM hooks were placed where they are for a reason.
> Because those were the locations inside the kernel where you actually
> have information about the task doing an operation and the objects
> (files, sockets, directories, other tasks, etc) they are doing an
> operation on.
> 
> Honestly all you are talking about it remaking the LSM with 2 sets of
> hooks instead if 1.  Why? [...]

Not at all. I am taking about using *one* set of events, to keep the intrusion 
at the lowest possible level.

LSM could make use of them as well.

Obviously for pragmatic reasons that might not be feasible initially.

> [...]  It seems much easier that if you want the language of the filter 
> engine you would just make a new LSM that uses the filter engine for it's 
> policy language rather than the language created by SELinux or SMACK or name 
> your LSM implementation.

Correct, that is what i suggested.

Note that performance is a primary concern, so if certain filters are very 
popular then in practice this would come with support for a couple of 'built 
in' (pre-optimized) filters that the kernel can accelerate directly, so that we 
do not incure the cost of executing the filter preds for really common-sense 
security policies that almost everyone is using.

I.e. in the end we'd *roughly* end up with the same performance and security as 
we are today (i mean, SELinux and the other LSMs did a nice job of collecting 
the things that apps should be careful about), but the crutial difference isnt 
just the advantages i menioned, but the fact that the *development model* of 
security modules would be a *lot* more extensible.

So security efforts could move to a whole different level: they could move into 
key apps and they could integrate with the general mind-set of developers.

At least Will as an application framework developer cares, so that hope is 
justified i think.

> >  - unprivileged:  application-definable, allowing the embedding of security 
> >                   policy in *apps* as well, not just the system
> > 
> >  - flexible:      can be added/removed runtime unprivileged, and cheaply so
> > 
> >  - transparent:   does not impact executing code that meets the policy
> > 
> >  - nestable:      it is inherited by child tasks and is fundamentally stackable,
> >                   multiple policies will have the combined effect and they
> >                   are transparent to each other. So if a child task within a
> >                   sandbox adds *more* checks then those add to the already
> >                   existing set of checks. We only narrow permissions, never
> >                   extend them.
> > 
> >  - generic:       allowing observation and (safe) control of security relevant
> >                   parameters not just at the system call boundary but at other
> >                   relevant places of kernel execution as well: which 
> >                   points/callbacks could also be used for other types of event 
> >                   extraction such as perf. It could even be shared with audit ...
> 
> I'm not arguing that any of these things are bad things.  What you describe 
> is a new LSM that uses a discretionary access control model but with the 
> granularity and flexibility that has traditionally only existed in the 
> mandatory access control security modules previously implemented in the 
> kernel.
> 
> I won't argue that's a bad idea, there's no reason in my mind that a process 
> shouldn't be allowed to control it's own access decisions in a more flexible 
> way than rwx bits.  Then again, I certainly don't see a reason that this 
> syscall hardening patch should be held up while a whole new concept in 
> computer security is contemplated...

Note, i'm not actually asking for the moon, a pony and more.

I fully submit that we are yet far away from being able to do a full LSM via 
this mechanism.

What i'm asking for is that because the syscall point steps taken by Will look 
very promising so it would be nice to do *that* in a slightly more flexible 
scheme that does not condemn it to be limited to the syscall boundary and such 
...

Also, to answer you, do you say that by my argument someone should have stood 
up and said 'no' to the LSM mess that was introduced a couple of years ago and 
which caused so many problems:

 - kernel inefficiencies and user-space overhead

 - stalled security efforts

 - infighting

 - friction, fragmentation, overmodularization

 - non-stackability

 - security annoyances on the Linux desktop

 - probably *less* Linux security

and should have asked them to do something better designed instead?

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Arnd Bergmann @ 2011-05-13 19:35 UTC (permalink / raw)
  To: linux-arm-kernel
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Ralf Baechle, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, jmorris, Ingo Molnar, Ingo Molnar,
	Serge E. Hallyn, Peter Zijlstra, microblaze-uclinux,
	Steven Rostedt, Tejun Heo, Thomas Gleixner, kees.cook,
	Roland McGrath, Michal Marek, Michal Simek, Will Drewry, agl,
	linux-kernel, Eric Paris, Paul Mundt, Martin Schwidefsky,
	linux390, Andrew Morton, linuxppc-dev, David S. Miller
In-Reply-To: <1305169376-2363-1-git-send-email-wad@chromium.org>

On Thursday 12 May 2011, Will Drewry wrote:
> This change adds a new seccomp mode based on the work by
> agl@chromium.org in [1]. This new mode, "filter mode", provides a hash
> table of seccomp_filter objects.  When in the new mode (2), all system
> calls are checked against the filters - first by system call number,
> then by a filter string.  If an entry exists for a given system call and
> all filter predicates evaluate to true, then the task may proceed.
> Otherwise, the task is killed (as per seccomp_mode == 1).

I've got a question about this: Do you expect the typical usage to disallow
ioctl()? Given that ioctl alone is responsible for a huge number of exploits
in various drivers, while certain ioctls are immensely useful (FIONREAD,
FIOASYNC, ...), do you expect to extend the mechanism to filter specific
ioctl commands in the future?

	Arnd

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Will Drewry @ 2011-05-14 20:57 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Eric Paris, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, James Morris, Linus Torvalds, Ingo Molnar,
	kees.cook, Serge E. Hallyn, Peter Zijlstra, Steven Rostedt,
	Tejun Heo, Thomas Gleixner, linux-arm-kernel, Michal Marek,
	Michal Simek, linuxppc-dev, linux-kernel, Ralf Baechle,
	Paul Mundt, Martin Schwidefsky, linux390, Andrew Morton, agl,
	David S. Miller
In-Reply-To: <20110514073015.GB9307@elte.hu>

On Sat, May 14, 2011 at 2:30 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Eric Paris <eparis@redhat.com> wrote:
>
>> [dropping microblaze and roland]
>>
>> lOn Fri, 2011-05-13 at 14:10 +0200, Ingo Molnar wrote:
>> > * James Morris <jmorris@namei.org> wrote:
>>
>> > It is a simple and sensible security feature, agreed? It allows most c=
ode to
>> > run well and link to countless libraries - but no access to other file=
s is
>> > allowed.
>>
>> It's simple enough and sounds reasonable, but you can read all the discu=
ssion
>> about AppArmour why many people don't really think it's the best. [...]
>
> I have to say most of the participants of the AppArmour flamefests were d=
ead
> wrong, and it wasnt the AppArmour folks who were wrong ...
>
> The straight ASCII VFS namespace *makes sense*, and yes, the raw physical
> objects space that SELinux uses makes sense as well.
>
> And no, i do not subscribe to the dogma that it is not possible to secure=
 the
> ASCII VFS namespace: it evidently is possible, if you know and handle the
> ambiguitites. It is also obviously true that the ASCII VFS namespaces we =
use
> every day are a *lot* more intuitive than the labeled physical objects sp=
ace
> ...
>
> What all the security flamewars missed is the simple fact that being intu=
itive
> matters a *lot* not just to not annoy users, but also to broaden the effe=
ctive
> number of security-conscious developers ...
>
>> > Unfortunately this audit callback cannot be used for my purposes, beca=
use
>> > the event is single-purpose for auditd and because it allows no feedba=
ck
>> > (no deny/accept discretion for the security policy).
>> >
>> > But if had this simple event there:
>> >
>> > =A0 =A0 err =3D event_vfs_getname(result);
>>
>> Wow it sounds so easy. =A0Now lets keep extending your train of thought
>> until we can actually provide the security provided by SELinux. =A0What =
do
>> we end up with? =A0We end up with an event hook right next to every LSM
>> hook. =A0You know, the LSM hooks were placed where they are for a reason=
.
>> Because those were the locations inside the kernel where you actually
>> have information about the task doing an operation and the objects
>> (files, sockets, directories, other tasks, etc) they are doing an
>> operation on.
>>
>> Honestly all you are talking about it remaking the LSM with 2 sets of
>> hooks instead if 1. =A0Why? [...]
>
> Not at all. I am taking about using *one* set of events, to keep the intr=
usion
> at the lowest possible level.
>
> LSM could make use of them as well.
>
> Obviously for pragmatic reasons that might not be feasible initially.
>
>> [...] =A0It seems much easier that if you want the language of the filte=
r
>> engine you would just make a new LSM that uses the filter engine for it'=
s
>> policy language rather than the language created by SELinux or SMACK or =
name
>> your LSM implementation.
>
> Correct, that is what i suggested.
>
> Note that performance is a primary concern, so if certain filters are ver=
y
> popular then in practice this would come with support for a couple of 'bu=
ilt
> in' (pre-optimized) filters that the kernel can accelerate directly, so t=
hat we
> do not incure the cost of executing the filter preds for really common-se=
nse
> security policies that almost everyone is using.
>
> I.e. in the end we'd *roughly* end up with the same performance and secur=
ity as
> we are today (i mean, SELinux and the other LSMs did a nice job of collec=
ting
> the things that apps should be careful about), but the crutial difference=
 isnt
> just the advantages i menioned, but the fact that the *development model*=
 of
> security modules would be a *lot* more extensible.
>
> So security efforts could move to a whole different level: they could mov=
e into
> key apps and they could integrate with the general mind-set of developers=
.
>
> At least Will as an application framework developer cares, so that hope i=
s
> justified i think.
>
>> > =A0- unprivileged: =A0application-definable, allowing the embedding of=
 security
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 policy in *apps* as well, not just=
 the system
>> >
>> > =A0- flexible: =A0 =A0 =A0can be added/removed runtime unprivileged, a=
nd cheaply so
>> >
>> > =A0- transparent: =A0 does not impact executing code that meets the po=
licy
>> >
>> > =A0- nestable: =A0 =A0 =A0it is inherited by child tasks and is fundam=
entally stackable,
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 multiple policies will have the co=
mbined effect and they
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 are transparent to each other. So =
if a child task within a
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 sandbox adds *more* checks then th=
ose add to the already
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 existing set of checks. We only na=
rrow permissions, never
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 extend them.
>> >
>> > =A0- generic: =A0 =A0 =A0 allowing observation and (safe) control of s=
ecurity relevant
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 parameters not just at the system =
call boundary but at other
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 relevant places of kernel executio=
n as well: which
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 points/callbacks could also be use=
d for other types of event
>> > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 extraction such as perf. It could =
even be shared with audit ...
>>
>> I'm not arguing that any of these things are bad things. =A0What you des=
cribe
>> is a new LSM that uses a discretionary access control model but with the
>> granularity and flexibility that has traditionally only existed in the
>> mandatory access control security modules previously implemented in the
>> kernel.
>>
>> I won't argue that's a bad idea, there's no reason in my mind that a pro=
cess
>> shouldn't be allowed to control it's own access decisions in a more flex=
ible
>> way than rwx bits. =A0Then again, I certainly don't see a reason that th=
is
>> syscall hardening patch should be held up while a whole new concept in
>> computer security is contemplated...
>
> Note, i'm not actually asking for the moon, a pony and more.
>
> I fully submit that we are yet far away from being able to do a full LSM =
via
> this mechanism.
>
> What i'm asking for is that because the syscall point steps taken by Will=
 look
> very promising so it would be nice to do *that* in a slightly more flexib=
le
> scheme that does not condemn it to be limited to the syscall boundary and=
 such
> ...

What do you suggest here?

>From my brief exploration of the ftrace/perf (and seccomp) code, I
don't see a clean way of integrating over the existing interfaces to
the ftrace framework (e.g., the global perf event pump seems to be a
mismatch), but I may be missing something obvious.   In my view,
implementing this nestled between the seccomp/ftrace world provides a
stepping stone forward without being too restrictive.  No matter how
we change security events in the future, system calls will always be
the first line of attack surface reduction.  It could just mean that,
in the long term, accessing the "security event filtering" framework
is done through another new interface with seccomp providing only a
targeted syscall filtering featureset that may one day be deprecated
(if that day ever comes).

If there's a clear way to cleanly expand this interface that I'm
missing, I'd love to know - thanks!
will

> Also, to answer you, do you say that by my argument someone should have s=
tood
> up and said 'no' to the LSM mess that was introduced a couple of years ag=
o and
> which caused so many problems:
>
> =A0- kernel inefficiencies and user-space overhead
>
> =A0- stalled security efforts
>
> =A0- infighting
>
> =A0- friction, fragmentation, overmodularization
>
> =A0- non-stackability
>
> =A0- security annoyances on the Linux desktop
>
> =A0- probably *less* Linux security
>
> and should have asked them to do something better designed instead?
>
> Thanks,
>
> =A0 =A0 =A0 =A0Ingo
>

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Will Drewry @ 2011-05-14 20:58 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Ralf Baechle, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, jmorris, Ingo Molnar, Roland McGrath,
	Ingo Molnar, Serge E. Hallyn, Peter Zijlstra, microblaze-uclinux,
	Steven Rostedt, Tejun Heo, Thomas Gleixner, kees.cook,
	linux-arm-kernel, Michal Marek, Michal Simek, agl, linux-kernel,
	Eric Paris, Paul Mundt, Martin Schwidefsky, linux390,
	Andrew Morton, linuxppc-dev, David S. Miller
In-Reply-To: <201105132135.34741.arnd@arndb.de>

On Fri, May 13, 2011 at 2:35 PM, Arnd Bergmann <arnd@arndb.de> wrote:
> On Thursday 12 May 2011, Will Drewry wrote:
>> This change adds a new seccomp mode based on the work by
>> agl@chromium.org in [1]. This new mode, "filter mode", provides a hash
>> table of seccomp_filter objects. =A0When in the new mode (2), all system
>> calls are checked against the filters - first by system call number,
>> then by a filter string. =A0If an entry exists for a given system call a=
nd
>> all filter predicates evaluate to true, then the task may proceed.
>> Otherwise, the task is killed (as per seccomp_mode =3D=3D 1).
>
> I've got a question about this: Do you expect the typical usage to disall=
ow
> ioctl()? Given that ioctl alone is responsible for a huge number of explo=
its
> in various drivers, while certain ioctls are immensely useful (FIONREAD,
> FIOASYNC, ...), do you expect to extend the mechanism to filter specific
> ioctl commands in the future?

In many cases, I do expect ioctl's to be dropped, but it's totally up
to whoever is setting the filters.

As is, it can already help out: [even though an LSM, if available,
would be appropriate to define a fine-grained policy]

ioctl() is hooked by the ftrace syscalls infrastructure (via SYSCALL_DEFINE=
3):

  SYSCALL_DEFINE3(ioctl, unsigned int, fd, unsigned int, cmd, unsigned
long, arg)

This means you can do:
  sprintf(filter, "cmd =3D=3D %u || cmd =3D=3D %u", FIOASYNC, FIONREAD);
  prctl(PR_SET_SECCOMP_FILTER, __NR_ioctl, filter);
  ...
  prctl(PR_SET_SECCOMP, 2, 0);
and then you'll be able to call ioctl on any fd with any argument but
limited to only the FIOASYNC and FIONREAD commands.

Depending on integration, it could even be limited to ioctl commands
that are appropriate to a known fd if the fd is opened prior to
entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed
with a filter of "1" then narrowed through a later addition of
something like "(fd =3D=3D %u && (cmd =3D=3D %u || cmd =3D=3D %u))" or some=
thing
along those lines.

Does that make sense?

In general, this interface won't need specific extensions for most
system call oriented filtering events.  ftrace events may be expanded
(to include more system calls), but that's behind the scenes.  Only
arguments subject to time-of-check-time-of-use attacks (data living in
userspace passed in by pointer) are not safe to use via this
interface.  In theory, that limitation could also be lifted in the
implementation without changing the ABI.

Thanks!
will

^ permalink raw reply

* linux kernel reference to non-existent CONFIG_FSL_85XX_CACHE_SRAM
From: Robert P. J. Day @ 2011-05-14 23:04 UTC (permalink / raw)
  To: linuxppc-dev


  the current kernel source tree contains a Makefile reference to the
above Kconfig variable that doesn't appear to be defined anywhere.

rday

-- 

========================================================================
Robert P. J. Day                                 Ottawa, Ontario, CANADA
                        http://crashcourse.ca

Twitter:                                       http://twitter.com/rpjday
LinkedIn:                               http://ca.linkedin.com/in/rpjday
========================================================================

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Arnd Bergmann @ 2011-05-15  6:42 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Ralf Baechle, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, jmorris, Ingo Molnar, Roland McGrath,
	Ingo Molnar, Serge E. Hallyn, Peter Zijlstra, microblaze-uclinux,
	Steven Rostedt, Tejun Heo, Thomas Gleixner, kees.cook,
	linux-arm-kernel, Michal Marek, Michal Simek, agl, linux-kernel,
	Eric Paris, Paul Mundt, Martin Schwidefsky, linux390,
	Andrew Morton, linuxppc-dev, David S. Miller
In-Reply-To: <BANLkTinukLesDoXzXKdtdRwgHtJkthXK0A@mail.gmail.com>

On Saturday 14 May 2011, Will Drewry wrote:
> Depending on integration, it could even be limited to ioctl commands
> that are appropriate to a known fd if the fd is opened prior to
> entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed
> with a filter of "1" then narrowed through a later addition of
> something like "(fd == %u && (cmd == %u || cmd == %u))" or something
> along those lines.
> 
> Does that make sense?

Thanks for the explanation. This sounds like it's already doing all
we need.

	Arnd

^ permalink raw reply

* book  to  learn  ppc  assembly  and  architecture
From: s shaiju @ 2011-05-15 16:52 UTC (permalink / raw)
  To: linuxppc-dev

[-- Attachment #1: Type: text/plain, Size: 113 bytes --]


Hi,
 
what  is  the  best  book  to  learn  assembly  and  architecture .
 
 
 
regards,
sha 		 	   		  

[-- Attachment #2: Type: text/html, Size: 407 bytes --]

^ permalink raw reply

* Re: [PATCH 10/13] kvm/powerpc: Add support for Book3S processors in hypervisor mode
From: Alexander Graf @ 2011-05-15 21:58 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: Linuxppc-dev, kvm-ppc, KVM list
In-Reply-To: <20110511104456.GK2837@brick.ozlabs.ibm.com>


On 11.05.2011, at 12:44, Paul Mackerras wrote:

> This adds support for KVM running on 64-bit Book 3S processors,
> specifically POWER7, in hypervisor mode.  Using hypervisor mode means
> that the guest can use the processor's supervisor mode.  That means
> that the guest can execute privileged instructions and access =
privileged
> registers itself without trapping to the host.  This gives excellent
> performance, but does mean that KVM cannot emulate a processor
> architecture other than the one that the hardware implements.
>=20
> This code assumes that the guest is running paravirtualized using the
> PAPR (Power Architecture Platform Requirements) interface, which is =
the
> interface that IBM's PowerVM hypervisor uses.  That means that =
existing
> Linux distributions that run on IBM pSeries machines will also run
> under KVM without modification.  In order to communicate the PAPR
> hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
> to include/linux/kvm.h.
>=20
> Currently the choice between book3s_hv support and book3s_pr support
> (i.e. the existing code, which runs the guest in user mode) has to be
> made at kernel configuration time, so a given kernel binary can only
> do one or the other.
>=20
> This new book3s_hv code doesn't support MMIO emulation at present.
> Since we are running paravirtualized guests, this isn't a serious
> restriction.
>=20
> With the guest running in supervisor mode, most exceptions go straight
> to the guest.  We will never get data or instruction storage or =
segment
> interrupts, alignment interrupts, decrementer interrupts, program
> interrupts, single-step interrupts, etc., coming to the hypervisor =
from
> the guest.  Therefore this introduces a new KVMTEST_NONHV macro for =
the
> exception entry path so that we don't have to do the KVM test on entry
> to those exception handlers.
>=20
> We do however get hypervisor decrementer, hypervisor data storage,
> hypervisor instruction storage, and hypervisor emulation assist
> interrupts, so we have to handle those.
>=20
> In hypervisor mode, real-mode accesses can access all of RAM, not just
> a limited amount.  Therefore we put all the guest state in the =
vcpu.arch
> and use the shadow_vcpu in the PACA only for temporary scratch space.
> We allocate the vcpu with kzalloc rather than vzalloc, and we don't =
use
> anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
>=20
> The POWER7 processor has a restriction that all threads in a core have
> to be in the same partition.  MMU-on kernel code counts as a partition
> (partition 0), so we have to do a partition switch on every entry to =
and
> exit from the guest.  At present we require the host and guest to run
> in single-thread mode because of this hardware restriction.
>=20
> This code allocates a hashed page table for the guest and initializes
> it with HPTEs for the guest's Virtual Real Memory Area (VRMA).  We
> require that the guest memory is allocated using 16MB huge pages, in
> order to simplify the low-level memory management.  This also means =
that
> we can get away without tracking paging activity in the host for now,
> since huge pages can't be paged or swapped.
>=20
> Signed-off-by: Paul Mackerras <paulus@samba.org>
> ---
> arch/powerpc/include/asm/exception-64s.h  |   27 +-
> arch/powerpc/include/asm/kvm_asm.h        |    4 +
> arch/powerpc/include/asm/kvm_book3s.h     |  148 ++++++-
> arch/powerpc/include/asm/kvm_book3s_asm.h |    4 +-
> arch/powerpc/include/asm/kvm_booke.h      |    4 +
> arch/powerpc/include/asm/kvm_host.h       |   60 +++-
> arch/powerpc/include/asm/kvm_ppc.h        |    6 +
> arch/powerpc/include/asm/mmu-hash64.h     |   10 +-
> arch/powerpc/include/asm/paca.h           |   10 +
> arch/powerpc/include/asm/reg.h            |    4 +
> arch/powerpc/kernel/asm-offsets.c         |   95 ++++-
> arch/powerpc/kernel/exceptions-64s.S      |   60 ++--
> arch/powerpc/kvm/Kconfig                  |   40 ++-
> arch/powerpc/kvm/Makefile                 |   16 +-
> arch/powerpc/kvm/book3s_64_mmu_hv.c       |  258 +++++++++++
> arch/powerpc/kvm/book3s_hv.c              |  413 ++++++++++++++++++
> arch/powerpc/kvm/book3s_hv_interrupts.S   |  326 ++++++++++++++
> arch/powerpc/kvm/book3s_hv_rmhandlers.S   |  663 =
+++++++++++++++++++++++++++++
> arch/powerpc/kvm/powerpc.c                |   23 +-
> arch/powerpc/kvm/trace.h                  |    2 +-
> include/linux/kvm.h                       |    6 +
> 21 files changed, 2094 insertions(+), 85 deletions(-)
> create mode 100644 arch/powerpc/kvm/book3s_64_mmu_hv.c
> create mode 100644 arch/powerpc/kvm/book3s_hv.c
> create mode 100644 arch/powerpc/kvm/book3s_hv_interrupts.S
> create mode 100644 arch/powerpc/kvm/book3s_hv_rmhandlers.S
>=20
> diff --git a/arch/powerpc/include/asm/exception-64s.h =
b/arch/powerpc/include/asm/exception-64s.h
> index 2a770d8..d32e1ef 100644
> --- a/arch/powerpc/include/asm/exception-64s.h
> +++ b/arch/powerpc/include/asm/exception-64s.h
> @@ -65,14 +65,14 @@
> 	GET_PACA(r13);							=
\
> 	std	r9,area+EX_R9(r13);	/* save r9 - r12 */		=
\
> 	std	r10,area+EX_R10(r13);					=
\
> -	mfcr	r9;							=
\
> -	extra(vec);							=
\
> -	std	r11,area+EX_R11(r13);					=
\
> -	std	r12,area+EX_R12(r13);					=
\
> 	BEGIN_FTR_SECTION_NESTED(66);					=
\
> 	mfspr	r10,SPRN_CFAR;						=
\
> 	std	r10,area+EX_CFAR(r13);					=
\
> 	END_FTR_SECTION_NESTED(CPU_FTR_CFAR, CPU_FTR_CFAR, 66);		=
\
> +	mfcr	r9;							=
\
> +	extra(vec);							=
\
> +	std	r11,area+EX_R11(r13);					=
\
> +	std	r12,area+EX_R12(r13);					=
\


I don't really understand that change :).

> 	GET_SCRATCH0(r10);						=
\
> 	std	r10,area+EX_R13(r13)
> #define EXCEPTION_PROLOG_1(area, extra, vec)				=
\
> @@ -134,6 +134,17 @@ do_kvm_##n:						=
		\
> #define KVM_HANDLER_SKIP(area, h, n)
> #endif
>=20
> +#ifdef CONFIG_KVM_BOOK3S_NONHV

I really liked how you called the .c file _pr - why call it NONHV now?

> +#define KVMTEST_NONHV(n)		__KVMTEST(n)
> +#define KVM_HANDLER_NONHV(area, h, n)	__KVM_HANDLER(area, h, =
n)
> +#define KVM_HANDLER_NONHV_SKIP(area, h, n)	__KVM_HANDLER_SKIP(area, =
h, n)
> +
> +#else
> +#define KVMTEST_NONHV(n)
> +#define KVM_HANDLER_NONHV(area, h, n)
> +#define KVM_HANDLER_NONHV_SKIP(area, h, n)
> +#endif
> +
> #define NOTEST(n)
>=20
> /*
> @@ -210,7 +221,7 @@ label##_pSeries:					=
\
> 	HMT_MEDIUM;					\
> 	SET_SCRATCH0(r13);		/* save r13 */		\
> 	EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, label##_common,	\
> -				 EXC_STD, KVMTEST, vec)
> +				 EXC_STD, KVMTEST_NONHV, vec)
>=20
> #define STD_EXCEPTION_HV(loc, vec, label)		\
> 	. =3D loc;					\
> @@ -227,8 +238,8 @@ label##_hv:						=
\
> 	beq	masked_##h##interrupt
> #define _SOFTEN_TEST(h)	__SOFTEN_TEST(h)
>=20
> -#define SOFTEN_TEST(vec)						=
\
> -	KVMTEST(vec);							=
\
> +#define SOFTEN_TEST_NONHV(vec)						=
\
> +	KVMTEST_NONHV(vec);						=
\
> 	_SOFTEN_TEST(EXC_STD)
>=20
> #define SOFTEN_TEST_HV(vec)						=
\
> @@ -248,7 +259,7 @@ label##_hv:						=
\
> 	.globl label##_pSeries;						=
\
> label##_pSeries:							=
\
> 	_MASKABLE_EXCEPTION_PSERIES(vec, label,				=
\
> -				    EXC_STD, SOFTEN_TEST)
> +				    EXC_STD, SOFTEN_TEST_NONHV)
>=20
> #define MASKABLE_EXCEPTION_HV(loc, vec, label)				=
\
> 	. =3D loc;							=
\
> diff --git a/arch/powerpc/include/asm/kvm_asm.h =
b/arch/powerpc/include/asm/kvm_asm.h
> index 0951b17..7b1f0e0 100644
> --- a/arch/powerpc/include/asm/kvm_asm.h
> +++ b/arch/powerpc/include/asm/kvm_asm.h
> @@ -64,8 +64,12 @@
> #define BOOK3S_INTERRUPT_PROGRAM	0x700
> #define BOOK3S_INTERRUPT_FP_UNAVAIL	0x800
> #define BOOK3S_INTERRUPT_DECREMENTER	0x900
> +#define BOOK3S_INTERRUPT_HV_DECREMENTER	0x980
> #define BOOK3S_INTERRUPT_SYSCALL	0xc00
> #define BOOK3S_INTERRUPT_TRACE		0xd00
> +#define BOOK3S_INTERRUPT_H_DATA_STORAGE	0xe00
> +#define BOOK3S_INTERRUPT_H_INST_STORAGE	0xe20
> +#define BOOK3S_INTERRUPT_H_EMUL_ASSIST	0xe40
> #define BOOK3S_INTERRUPT_PERFMON	0xf00
> #define BOOK3S_INTERRUPT_ALTIVEC	0xf20
> #define BOOK3S_INTERRUPT_VSX		0xf40
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h =
b/arch/powerpc/include/asm/kvm_book3s.h
> index 12829bb..5b76073 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -117,6 +117,7 @@ extern void kvmppc_set_msr(struct kvm_vcpu *vcpu, =
u64 new_msr);
> extern void kvmppc_set_pvr(struct kvm_vcpu *vcpu, u32 pvr);
> extern void kvmppc_mmu_book3s_64_init(struct kvm_vcpu *vcpu);
> extern void kvmppc_mmu_book3s_32_init(struct kvm_vcpu *vcpu);
> +extern void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu);
> extern int kvmppc_mmu_map_page(struct kvm_vcpu *vcpu, struct =
kvmppc_pte *pte);
> extern int kvmppc_mmu_map_segment(struct kvm_vcpu *vcpu, ulong eaddr);
> extern void kvmppc_mmu_flush_segments(struct kvm_vcpu *vcpu);
> @@ -128,10 +129,12 @@ extern int kvmppc_mmu_hpte_init(struct kvm_vcpu =
*vcpu);
> extern void kvmppc_mmu_invalidate_pte(struct kvm_vcpu *vcpu, struct =
hpte_cache *pte);
> extern int kvmppc_mmu_hpte_sysinit(void);
> extern void kvmppc_mmu_hpte_sysexit(void);
> +extern int kvmppc_mmu_hv_init(void);
>=20
> extern int kvmppc_ld(struct kvm_vcpu *vcpu, ulong *eaddr, int size, =
void *ptr, bool data);
> extern int kvmppc_st(struct kvm_vcpu *vcpu, ulong *eaddr, int size, =
void *ptr, bool data);
> extern void kvmppc_book3s_queue_irqprio(struct kvm_vcpu *vcpu, =
unsigned int vec);
> +extern void kvmppc_inject_interrupt(struct kvm_vcpu *vcpu, int vec, =
u64 flags);
> extern void kvmppc_set_bat(struct kvm_vcpu *vcpu, struct kvmppc_bat =
*bat,
> 			   bool upper, u32 val);
> extern void kvmppc_giveup_ext(struct kvm_vcpu *vcpu, ulong msr);
> @@ -152,6 +155,19 @@ static inline struct kvmppc_vcpu_book3s =
*to_book3s(struct kvm_vcpu *vcpu)
> 	return container_of(vcpu, struct kvmppc_vcpu_book3s, vcpu);
> }
>=20
> +extern void kvm_return_point(void);
> +
> +/* Also add subarch specific defines */
> +
> +#ifdef CONFIG_KVM_BOOK3S_32_HANDLER
> +#include <asm/kvm_book3s_32.h>
> +#endif
> +#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
> +#include <asm/kvm_book3s_64.h>
> +#endif
> +
> +#ifdef CONFIG_KVM_BOOK3S_NONHV
> +
> #define vcpu_guest_state(vcpu)	((vcpu)->arch.shared)
>=20
> static inline unsigned long kvmppc_interrupt_offset(struct kvm_vcpu =
*vcpu)
> @@ -168,16 +184,6 @@ static inline void =
kvmppc_update_int_pending(struct kvm_vcpu *vcpu,
> 		vcpu_guest_state(vcpu)->int_pending =3D 0;
> }
>=20
> -static inline ulong dsisr(void)
> -{
> -	ulong r;
> -	asm ( "mfdsisr %0 " : "=3Dr" (r) );
> -	return r;
> -}
> -
> -extern void kvm_return_point(void);
> -static inline struct kvmppc_book3s_shadow_vcpu *to_svcpu(struct =
kvm_vcpu *vcpu);
> -
> static inline void kvmppc_set_gpr(struct kvm_vcpu *vcpu, int num, =
ulong val)
> {
> 	if ( num < 14 ) {
> @@ -265,6 +271,11 @@ static inline ulong kvmppc_get_fault_dar(struct =
kvm_vcpu *vcpu)
> 	return to_svcpu(vcpu)->fault_dar;
> }
>=20
> +static inline ulong kvmppc_get_msr(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.shared->msr;
> +}
> +
> static inline bool kvmppc_critical_section(struct kvm_vcpu *vcpu)
> {
> 	ulong crit_raw =3D vcpu_guest_state(vcpu)->critical;
> @@ -284,6 +295,115 @@ static inline bool =
kvmppc_critical_section(struct kvm_vcpu *vcpu)
>=20
> 	return crit;
> }
> +#else /* CONFIG_KVM_BOOK3S_NONHV */
> +
> +#define vcpu_guest_state(vcpu)	(&(vcpu)->arch)
> +
> +static inline unsigned long kvmppc_interrupt_offset(struct kvm_vcpu =
*vcpu)
> +{
> +	return 0;
> +}
> +
> +static inline void kvmppc_update_int_pending(struct kvm_vcpu *vcpu,
> +			unsigned long pending_now, unsigned long =
old_pending)
> +{
> +	/* Recalculate LPCR:MER based on the presence of
> +	 * a pending external interrupt
> +	 */
> +	if (test_bit(BOOK3S_IRQPRIO_EXTERNAL, =
&vcpu->arch.pending_exceptions) ||
> +	    test_bit(BOOK3S_IRQPRIO_EXTERNAL_LEVEL, =
&vcpu->arch.pending_exceptions))

Wasn't pending_now pending_exceptions?

> +		vcpu->arch.lpcr |=3D LPCR_MER;
> +	else
> +		vcpu->arch.lpcr &=3D ~((u64)LPCR_MER);
> +}
> +
> +static inline void kvmppc_set_gpr(struct kvm_vcpu *vcpu, int num, =
ulong val)
> +{
> +	vcpu->arch.gpr[num] =3D val;
> +}
> +
> +static inline ulong kvmppc_get_gpr(struct kvm_vcpu *vcpu, int num)
> +{
> +	return vcpu->arch.gpr[num];
> +}
> +
> +static inline void kvmppc_set_cr(struct kvm_vcpu *vcpu, u32 val)
> +{
> +	vcpu->arch.cr =3D val;
> +}
> +
> +static inline u32 kvmppc_get_cr(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.cr;
> +}
> +
> +static inline void kvmppc_set_xer(struct kvm_vcpu *vcpu, u32 val)
> +{
> +	vcpu->arch.xer =3D val;
> +}
> +
> +static inline u32 kvmppc_get_xer(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.xer;
> +}
> +
> +static inline void kvmppc_set_ctr(struct kvm_vcpu *vcpu, ulong val)
> +{
> +	vcpu->arch.ctr =3D val;
> +}
> +
> +static inline ulong kvmppc_get_ctr(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.ctr;
> +}
> +
> +static inline void kvmppc_set_lr(struct kvm_vcpu *vcpu, ulong val)
> +{
> +	vcpu->arch.lr =3D val;
> +}
> +
> +static inline ulong kvmppc_get_lr(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.lr;
> +}
> +
> +static inline void kvmppc_set_pc(struct kvm_vcpu *vcpu, ulong val)
> +{
> +	vcpu->arch.pc =3D val;
> +}
> +
> +static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.pc;
> +}
> +
> +static inline u32 kvmppc_get_last_inst(struct kvm_vcpu *vcpu)
> +{
> +	ulong pc =3D kvmppc_get_pc(vcpu);
> +
> +	/* Load the instruction manually if it failed to do so in the
> +	 * exit path */
> +	if (vcpu->arch.last_inst =3D=3D KVM_INST_FETCH_FAILED)
> +		kvmppc_ld(vcpu, &pc, sizeof(u32), &vcpu->arch.last_inst, =
false);
> +
> +	return vcpu->arch.last_inst;
> +}
> +
> +static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.fault_dar;
> +}
> +
> +static inline ulong kvmppc_get_msr(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.msr;
> +}
> +
> +static inline bool kvmppc_critical_section(struct kvm_vcpu *vcpu)
> +{
> +	return false;
> +}
> +#endif
>=20
> /* Magic register values loaded into r3 and r4 before the 'sc' =
assembly
>  * instruction for the OSI hypercalls */
> @@ -292,12 +412,4 @@ static inline bool kvmppc_critical_section(struct =
kvm_vcpu *vcpu)
>=20
> #define INS_DCBZ			0x7c0007ec
>=20
> -/* Also add subarch specific defines */
> -
> -#ifdef CONFIG_PPC_BOOK3S_32
> -#include <asm/kvm_book3s_32.h>
> -#else
> -#include <asm/kvm_book3s_64.h>
> -#endif
> -
> #endif /* __ASM_KVM_BOOK3S_H__ */
> diff --git a/arch/powerpc/include/asm/kvm_book3s_asm.h =
b/arch/powerpc/include/asm/kvm_book3s_asm.h
> index d5a8a38..d7279f5 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_asm.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_asm.h
> @@ -61,6 +61,7 @@ kvmppc_resume_\intno:
> #else  /*__ASSEMBLY__ */
>=20
> struct kvmppc_book3s_shadow_vcpu {
> +#ifdef CONFIG_KVM_BOOK3S_NONHV

Do you really want any shadow_vcpu in the paca at all?

> 	ulong gpr[14];
> 	u32 cr;
> 	u32 xer;
> @@ -72,6 +73,7 @@ struct kvmppc_book3s_shadow_vcpu {
> 	ulong pc;
> 	ulong shadow_srr1;
> 	ulong fault_dar;
> +#endif
>=20
> 	ulong host_r1;
> 	ulong host_r2;
> @@ -84,7 +86,7 @@ struct kvmppc_book3s_shadow_vcpu {
> #ifdef CONFIG_PPC_BOOK3S_32
> 	u32     sr[16];			/* Guest SRs */
> #endif
> -#ifdef CONFIG_PPC_BOOK3S_64
> +#if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_KVM_BOOK3S_NONHV)
> 	u8 slb_max;			/* highest used guest slb entry =
*/
> 	struct  {
> 		u64     esid;
> diff --git a/arch/powerpc/include/asm/kvm_booke.h =
b/arch/powerpc/include/asm/kvm_booke.h
> index 9c9ba3d..a90e091 100644
> --- a/arch/powerpc/include/asm/kvm_booke.h
> +++ b/arch/powerpc/include/asm/kvm_booke.h
> @@ -93,4 +93,8 @@ static inline ulong kvmppc_get_fault_dar(struct =
kvm_vcpu *vcpu)
> 	return vcpu->arch.fault_dear;
> }
>=20
> +static inline ulong kvmppc_get_msr(struct kvm_vcpu *vcpu)
> +{
> +	return vcpu->arch.shared->msr;
> +}
> #endif /* __ASM_KVM_BOOKE_H__ */
> diff --git a/arch/powerpc/include/asm/kvm_host.h =
b/arch/powerpc/include/asm/kvm_host.h
> index 3ebe51b..ec62365 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -33,7 +33,9 @@
> /* memory slots that does not exposed to userspace */
> #define KVM_PRIVATE_MEM_SLOTS 4
>=20
> +#ifdef CONFIG_KVM_MMIO
> #define KVM_COALESCED_MMIO_PAGE_OFFSET 1
> +#endif
>=20
> /* We don't currently support large pages. */
> #define KVM_HPAGE_GFN_SHIFT(x)	0
> @@ -133,7 +135,24 @@ struct kvmppc_exit_timing {
> 	};
> };
>=20
> +struct kvmppc_pginfo {
> +	unsigned long pfn;
> +	atomic_t refcnt;
> +};
> +
> struct kvm_arch {
> +	unsigned long hpt_virt;
> +	unsigned long ram_npages;
> +	unsigned long ram_psize;
> +	unsigned long ram_porder;
> +	struct kvmppc_pginfo *ram_pginfo;
> +	unsigned int lpid;
> +	unsigned int host_lpid;
> +	unsigned long host_lpcr;
> +	unsigned long sdr1;
> +	unsigned long host_sdr1;
> +	int tlbie_lock;
> +	unsigned short last_vcpu[NR_CPUS];

This should all be #ifdef CONFIG_KVM_BOOK3S_HV

> };
>=20
> struct kvmppc_pte {
> @@ -190,7 +209,7 @@ struct kvm_vcpu_arch {
> 	ulong rmcall;
> 	ulong host_paca_phys;
> 	struct kvmppc_slb slb[64];
> -	int slb_max;		/* # valid entries in slb[] */
> +	int slb_max;		/* 1 + index of last valid entry in =
slb[] */
> 	int slb_nr;		/* total number of entries in SLB */
> 	struct kvmppc_mmu mmu;
> #endif
> @@ -204,9 +223,10 @@ struct kvm_vcpu_arch {
> 	vector128 vr[32];
> 	vector128 vscr;
> #endif
> +	u32 vrsave;
>=20
> #ifdef CONFIG_VSX
> -	u64 vsr[32];
> +	u64 vsr[64];
> #endif
>=20
> #ifdef CONFIG_PPC_BOOK3S
> @@ -214,29 +234,45 @@ struct kvm_vcpu_arch {
> 	u32 qpr[32];
> #endif
>=20
> -#ifdef CONFIG_BOOKE
> -	ulong pc;
> 	ulong ctr;
> 	ulong lr;
>=20
> 	ulong xer;
> 	u32 cr;
> -#endif
> +
> +	ulong pc;
> +	ulong msr;
>=20
> #ifdef CONFIG_PPC_BOOK3S
> 	ulong shadow_msr;
> 	ulong hflags;
> 	ulong guest_owned_ext;
> +	ulong purr;
> +	ulong spurr;
> +	ulong lpcr;
> +	ulong dscr;
> +	ulong amr;
> +	ulong uamor;
> +	u32 ctrl;
> +	u32 dsisr;
> +	ulong dabr;
> #endif
> 	u32 mmucr;
> +	ulong sprg0;
> +	ulong sprg1;
> +	ulong sprg2;
> +	ulong sprg3;
> 	ulong sprg4;
> 	ulong sprg5;
> 	ulong sprg6;
> 	ulong sprg7;
> +	ulong srr0;
> +	ulong srr1;
> 	ulong csrr0;
> 	ulong csrr1;
> 	ulong dsrr0;
> 	ulong dsrr1;
> +	ulong dear;
> 	ulong esr;
> 	u32 dec;
> 	u32 decar;
> @@ -259,6 +295,9 @@ struct kvm_vcpu_arch {
> 	u32 dbcr1;
> 	u32 dbsr;
>=20
> +	u64 mmcr[3];
> +	u32 pmc[6];
> +
> #ifdef CONFIG_KVM_EXIT_TIMING
> 	struct kvmppc_exit_timing timing_exit;
> 	struct kvmppc_exit_timing timing_last_enter;
> @@ -272,8 +311,12 @@ struct kvm_vcpu_arch {
> 	struct dentry *debugfs_exit_timing;
> #endif
>=20
> +#ifdef CONFIG_PPC_BOOK3S
> +	ulong fault_dar;
> +	u32 fault_dsisr;
> +#endif
> +
> #ifdef CONFIG_BOOKE
> -	u32 last_inst;
> 	ulong fault_dear;
> 	ulong fault_esr;
> 	ulong queued_dear;
> @@ -288,13 +331,18 @@ struct kvm_vcpu_arch {
> 	u8 dcr_is_write;
> 	u8 osi_needed;
> 	u8 osi_enabled;
> +	u8 hcall_needed;
>=20
> 	u32 cpr0_cfgaddr; /* holds the last set cpr0_cfgaddr */
>=20
> 	struct hrtimer dec_timer;
> 	struct tasklet_struct tasklet;
> 	u64 dec_jiffies;
> +	u64 dec_expires;
> 	unsigned long pending_exceptions;
> +	u16 last_cpu;
> +	u32 last_inst;
> +	int trap;
> 	struct kvm_vcpu_arch_shared *shared;
> 	unsigned long magic_page_pa; /* phys addr to map the magic page =
to */
> 	unsigned long magic_page_ea; /* effect. addr to map the magic =
page to */
> diff --git a/arch/powerpc/include/asm/kvm_ppc.h =
b/arch/powerpc/include/asm/kvm_ppc.h
> index 3210911..cd9ad96 100644
> --- a/arch/powerpc/include/asm/kvm_ppc.h
> +++ b/arch/powerpc/include/asm/kvm_ppc.h
> @@ -110,6 +110,12 @@ extern void kvmppc_booke_exit(void);
> extern void kvmppc_core_destroy_mmu(struct kvm_vcpu *vcpu);
> extern int kvmppc_kvm_pv(struct kvm_vcpu *vcpu);
>=20
> +extern long kvmppc_alloc_hpt(struct kvm *kvm);
> +extern void kvmppc_free_hpt(struct kvm *kvm);
> +extern long kvmppc_prepare_vrma(struct kvm *kvm,
> +				struct kvm_userspace_memory_region =
*mem);
> +extern void kvmppc_map_vrma(struct kvm *kvm,
> +			    struct kvm_userspace_memory_region *mem);
> extern int kvmppc_core_init_vm(struct kvm *kvm);
> extern void kvmppc_core_destroy_vm(struct kvm *kvm);
> extern int kvmppc_core_prepare_memory_region(struct kvm *kvm,
> diff --git a/arch/powerpc/include/asm/mmu-hash64.h =
b/arch/powerpc/include/asm/mmu-hash64.h
> index ae7b3ef..0bb3fc1 100644
> --- a/arch/powerpc/include/asm/mmu-hash64.h
> +++ b/arch/powerpc/include/asm/mmu-hash64.h
> @@ -90,13 +90,19 @@ extern char initial_stab[];
>=20
> #define HPTE_R_PP0		ASM_CONST(0x8000000000000000)
> #define HPTE_R_TS		ASM_CONST(0x4000000000000000)
> +#define HPTE_R_KEY_HI		ASM_CONST(0x3000000000000000)
> #define HPTE_R_RPN_SHIFT	12
> -#define HPTE_R_RPN		ASM_CONST(0x3ffffffffffff000)
> -#define HPTE_R_FLAGS		ASM_CONST(0x00000000000003ff)
> +#define HPTE_R_RPN		ASM_CONST(0x0ffffffffffff000)
> #define HPTE_R_PP		ASM_CONST(0x0000000000000003)
> #define HPTE_R_N		ASM_CONST(0x0000000000000004)
> +#define HPTE_R_G		ASM_CONST(0x0000000000000008)
> +#define HPTE_R_M		ASM_CONST(0x0000000000000010)
> +#define HPTE_R_I		ASM_CONST(0x0000000000000020)
> +#define HPTE_R_W		ASM_CONST(0x0000000000000040)
> +#define HPTE_R_WIMG		ASM_CONST(0x0000000000000078)
> #define HPTE_R_C		ASM_CONST(0x0000000000000080)
> #define HPTE_R_R		ASM_CONST(0x0000000000000100)
> +#define HPTE_R_KEY_LO		ASM_CONST(0x0000000000000e00)
>=20
> #define HPTE_V_1TB_SEG		ASM_CONST(0x4000000000000000)
> #define HPTE_V_VRMA_MASK	ASM_CONST(0x4001ffffff000000)
> diff --git a/arch/powerpc/include/asm/paca.h =
b/arch/powerpc/include/asm/paca.h
> index 7412676..8dba5f6 100644
> --- a/arch/powerpc/include/asm/paca.h
> +++ b/arch/powerpc/include/asm/paca.h
> @@ -149,6 +149,16 @@ struct paca_struct {
> #ifdef CONFIG_KVM_BOOK3S_HANDLER
> 	/* We use this to store guest state in */
> 	struct kvmppc_book3s_shadow_vcpu shadow_vcpu;
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +	struct kvm_vcpu *kvm_vcpu;
> +	u64 dabr;
> +	u64 host_mmcr[3];
> +	u32 host_pmc[6];
> +	u64 host_purr;
> +	u64 host_spurr;
> +	u64 host_dscr;
> +	u64 dec_expires;

Hrm. I'd say either push those into shadow_vcpu for HV mode or get rid =
of the shadow_vcpu reference. I'd probably prefer the former.

> +#endif
> #endif
> };
>=20
> diff --git a/arch/powerpc/include/asm/reg.h =
b/arch/powerpc/include/asm/reg.h
> index c07b7be..0036977 100644
> --- a/arch/powerpc/include/asm/reg.h
> +++ b/arch/powerpc/include/asm/reg.h
> @@ -189,6 +189,10 @@
> #define SPRN_CTR	0x009	/* Count Register */
> #define SPRN_DSCR	0x11
> #define SPRN_CFAR	0x1c	/* Come =46rom Address Register */
> +#define SPRN_AMR	0x1d	/* Authority Mask Register */
> +#define SPRN_UAMOR	0x9d	/* User Authority Mask Override Register =
*/
> +#define SPRN_AMOR	0x15d	/* Authority Mask Override Register */
> +#define SPRN_RWMR	885
> #define SPRN_CTRLF	0x088
> #define SPRN_CTRLT	0x098
> #define   CTRL_CT	0xc0000000	/* current thread */
> diff --git a/arch/powerpc/kernel/asm-offsets.c =
b/arch/powerpc/kernel/asm-offsets.c
> index 6887661..49e97fd 100644
> --- a/arch/powerpc/kernel/asm-offsets.c
> +++ b/arch/powerpc/kernel/asm-offsets.c
> @@ -187,6 +187,7 @@ int main(void)
> 	DEFINE(LPPACASRR1, offsetof(struct lppaca, saved_srr1));
> 	DEFINE(LPPACAANYINT, offsetof(struct lppaca, =
int_dword.any_int));
> 	DEFINE(LPPACADECRINT, offsetof(struct lppaca, =
int_dword.fields.decr_int));
> +	DEFINE(LPPACA_PMCINUSE, offsetof(struct lppaca, =
pmcregs_in_use));
> 	DEFINE(LPPACA_DTLIDX, offsetof(struct lppaca, dtl_idx));
> 	DEFINE(PACA_DTL_RIDX, offsetof(struct paca_struct, dtl_ridx));
> #endif /* CONFIG_PPC_STD_MMU_64 */
> @@ -200,9 +201,17 @@ int main(void)
> 	DEFINE(PACA_TRAP_SAVE, offsetof(struct paca_struct, trap_save));
> #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
> 	DEFINE(PACA_KVM_SVCPU, offsetof(struct paca_struct, =
shadow_vcpu));
> -	DEFINE(SVCPU_SLB, offsetof(struct kvmppc_book3s_shadow_vcpu, =
slb));
> -	DEFINE(SVCPU_SLB_MAX, offsetof(struct kvmppc_book3s_shadow_vcpu, =
slb_max));
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +	DEFINE(PACA_KVM_VCPU, offsetof(struct paca_struct, kvm_vcpu));
> +	DEFINE(PACA_HOST_MMCR, offsetof(struct paca_struct, host_mmcr));
> +	DEFINE(PACA_HOST_PMC, offsetof(struct paca_struct, host_pmc));
> +	DEFINE(PACA_HOST_PURR, offsetof(struct paca_struct, host_purr));
> +	DEFINE(PACA_HOST_SPURR, offsetof(struct paca_struct, =
host_spurr));
> +	DEFINE(PACA_HOST_DSCR, offsetof(struct paca_struct, host_dscr));
> +	DEFINE(PACA_DABR, offsetof(struct paca_struct, dabr));
> +	DEFINE(PACA_KVM_DECEXP, offsetof(struct paca_struct, =
dec_expires));
> #endif
> +#endif /* CONFIG_KVM_BOOK3S_64_HANDLER */
> #endif /* CONFIG_PPC64 */
>=20
> 	/* RTAS */
> @@ -396,6 +405,28 @@ int main(void)
> 	DEFINE(VCPU_HOST_STACK, offsetof(struct kvm_vcpu, =
arch.host_stack));
> 	DEFINE(VCPU_HOST_PID, offsetof(struct kvm_vcpu, arch.host_pid));
> 	DEFINE(VCPU_GPRS, offsetof(struct kvm_vcpu, arch.gpr));
> +	DEFINE(VCPU_FPRS, offsetof(struct kvm_vcpu, arch.fpr));
> +	DEFINE(VCPU_FPSCR, offsetof(struct kvm_vcpu, arch.fpscr));
> +#ifdef CONFIG_ALTIVEC
> +	DEFINE(VCPU_VRS, offsetof(struct kvm_vcpu, arch.vr));
> +	DEFINE(VCPU_VSCR, offsetof(struct kvm_vcpu, arch.vscr));
> +#endif
> +#ifdef CONFIG_VSX
> +	DEFINE(VCPU_VSRS, offsetof(struct kvm_vcpu, arch.vsr));
> +#endif
> +	DEFINE(VCPU_VRSAVE, offsetof(struct kvm_vcpu, arch.vrsave));
> +	DEFINE(VCPU_XER, offsetof(struct kvm_vcpu, arch.xer));
> +	DEFINE(VCPU_CTR, offsetof(struct kvm_vcpu, arch.ctr));
> +	DEFINE(VCPU_LR, offsetof(struct kvm_vcpu, arch.lr));
> +	DEFINE(VCPU_CR, offsetof(struct kvm_vcpu, arch.cr));
> +	DEFINE(VCPU_PC, offsetof(struct kvm_vcpu, arch.pc));
> +	DEFINE(VCPU_MSR, offsetof(struct kvm_vcpu, arch.msr));
> +	DEFINE(VCPU_SRR0, offsetof(struct kvm_vcpu, arch.srr0));
> +	DEFINE(VCPU_SRR1, offsetof(struct kvm_vcpu, arch.srr1));
> +	DEFINE(VCPU_SPRG0, offsetof(struct kvm_vcpu, arch.sprg0));
> +	DEFINE(VCPU_SPRG1, offsetof(struct kvm_vcpu, arch.sprg1));
> +	DEFINE(VCPU_SPRG2, offsetof(struct kvm_vcpu, arch.sprg2));
> +	DEFINE(VCPU_SPRG3, offsetof(struct kvm_vcpu, arch.sprg3));
> 	DEFINE(VCPU_SPRG4, offsetof(struct kvm_vcpu, arch.sprg4));
> 	DEFINE(VCPU_SPRG5, offsetof(struct kvm_vcpu, arch.sprg5));
> 	DEFINE(VCPU_SPRG6, offsetof(struct kvm_vcpu, arch.sprg6));
> @@ -406,16 +437,65 @@ int main(void)
>=20
> 	/* book3s */
> #ifdef CONFIG_PPC_BOOK3S
> +	DEFINE(KVM_LPID, offsetof(struct kvm, arch.lpid));
> +	DEFINE(KVM_SDR1, offsetof(struct kvm, arch.sdr1));
> +	DEFINE(KVM_HOST_LPID, offsetof(struct kvm, arch.host_lpid));
> +	DEFINE(KVM_HOST_LPCR, offsetof(struct kvm, arch.host_lpcr));
> +	DEFINE(KVM_HOST_SDR1, offsetof(struct kvm, arch.host_sdr1));
> +	DEFINE(KVM_TLBIE_LOCK, offsetof(struct kvm, arch.tlbie_lock));
> +	DEFINE(KVM_ONLINE_CPUS, offsetof(struct kvm, =
online_vcpus.counter));
> +	DEFINE(KVM_LAST_VCPU, offsetof(struct kvm, arch.last_vcpu));
> +	DEFINE(VCPU_KVM, offsetof(struct kvm_vcpu, kvm));
> +	DEFINE(VCPU_VCPUID, offsetof(struct kvm_vcpu, vcpu_id));
> 	DEFINE(VCPU_HOST_RETIP, offsetof(struct kvm_vcpu, =
arch.host_retip));
> 	DEFINE(VCPU_HOST_MSR, offsetof(struct kvm_vcpu, arch.host_msr));
> 	DEFINE(VCPU_SHADOW_MSR, offsetof(struct kvm_vcpu, =
arch.shadow_msr));
> +	DEFINE(VCPU_PURR, offsetof(struct kvm_vcpu, arch.purr));
> +	DEFINE(VCPU_SPURR, offsetof(struct kvm_vcpu, arch.spurr));
> +	DEFINE(VCPU_DSCR, offsetof(struct kvm_vcpu, arch.dscr));
> +	DEFINE(VCPU_AMR, offsetof(struct kvm_vcpu, arch.amr));
> +	DEFINE(VCPU_UAMOR, offsetof(struct kvm_vcpu, arch.uamor));
> +	DEFINE(VCPU_CTRL, offsetof(struct kvm_vcpu, arch.ctrl));
> +	DEFINE(VCPU_DABR, offsetof(struct kvm_vcpu, arch.dabr));
> 	DEFINE(VCPU_TRAMPOLINE_LOWMEM, offsetof(struct kvm_vcpu, =
arch.trampoline_lowmem));
> 	DEFINE(VCPU_TRAMPOLINE_ENTER, offsetof(struct kvm_vcpu, =
arch.trampoline_enter));
> 	DEFINE(VCPU_HIGHMEM_HANDLER, offsetof(struct kvm_vcpu, =
arch.highmem_handler));
> 	DEFINE(VCPU_RMCALL, offsetof(struct kvm_vcpu, arch.rmcall));
> 	DEFINE(VCPU_HFLAGS, offsetof(struct kvm_vcpu, arch.hflags));
> +	DEFINE(VCPU_DSISR, offsetof(struct kvm_vcpu, arch.dsisr));
> +	DEFINE(VCPU_DAR, offsetof(struct kvm_vcpu, arch.dear));
> +	DEFINE(VCPU_DEC, offsetof(struct kvm_vcpu, arch.dec));
> +	DEFINE(VCPU_DEC_EXPIRES, offsetof(struct kvm_vcpu, =
arch.dec_expires));
> +	DEFINE(VCPU_LPCR, offsetof(struct kvm_vcpu, arch.lpcr));
> +	DEFINE(VCPU_MMCR, offsetof(struct kvm_vcpu, arch.mmcr));
> +	DEFINE(VCPU_PMC, offsetof(struct kvm_vcpu, arch.pmc));
> +	DEFINE(VCPU_SLB, offsetof(struct kvm_vcpu, arch.slb));
> +	DEFINE(VCPU_SLB_MAX, offsetof(struct kvm_vcpu, arch.slb_max));
> +	DEFINE(VCPU_LAST_CPU, offsetof(struct kvm_vcpu, arch.last_cpu));
> +	DEFINE(VCPU_FAULT_DSISR, offsetof(struct kvm_vcpu, =
arch.fault_dsisr));
> +	DEFINE(VCPU_FAULT_DAR, offsetof(struct kvm_vcpu, =
arch.fault_dar));
> +	DEFINE(VCPU_LAST_INST, offsetof(struct kvm_vcpu, =
arch.last_inst));
> +	DEFINE(VCPU_TRAP, offsetof(struct kvm_vcpu, arch.trap));
> 	DEFINE(VCPU_SVCPU, offsetof(struct kvmppc_vcpu_book3s, =
shadow_vcpu) -
> 			   offsetof(struct kvmppc_vcpu_book3s, vcpu));
> +	DEFINE(SVCPU_HOST_R1, offsetof(struct kvmppc_book3s_shadow_vcpu, =
host_r1));
> +	DEFINE(SVCPU_HOST_R2, offsetof(struct kvmppc_book3s_shadow_vcpu, =
host_r2));
> +	DEFINE(SVCPU_SCRATCH0, offsetof(struct =
kvmppc_book3s_shadow_vcpu,
> +					scratch0));
> +	DEFINE(SVCPU_SCRATCH1, offsetof(struct =
kvmppc_book3s_shadow_vcpu,
> +					scratch1));
> +	DEFINE(SVCPU_IN_GUEST, offsetof(struct =
kvmppc_book3s_shadow_vcpu,
> +					in_guest));
> +	DEFINE(VCPU_SLB_E, offsetof(struct kvmppc_slb, orige));
> +	DEFINE(VCPU_SLB_V, offsetof(struct kvmppc_slb, origv));
> +	DEFINE(VCPU_SLB_SIZE, sizeof(struct kvmppc_slb));
> +#ifdef CONFIG_KVM_BOOK3S_NONHV
> +#ifdef CONFIG_PPC64
> +	DEFINE(SVCPU_SLB, offsetof(struct kvmppc_book3s_shadow_vcpu, =
slb));
> +	DEFINE(SVCPU_SLB_MAX, offsetof(struct kvmppc_book3s_shadow_vcpu, =
slb_max));
> +#endif
> +	DEFINE(SVCPU_VMHANDLER, offsetof(struct =
kvmppc_book3s_shadow_vcpu,
> +					 vmhandler));
> 	DEFINE(SVCPU_CR, offsetof(struct kvmppc_book3s_shadow_vcpu, =
cr));
> 	DEFINE(SVCPU_XER, offsetof(struct kvmppc_book3s_shadow_vcpu, =
xer));
> 	DEFINE(SVCPU_CTR, offsetof(struct kvmppc_book3s_shadow_vcpu, =
ctr));
> @@ -435,16 +515,6 @@ int main(void)
> 	DEFINE(SVCPU_R11, offsetof(struct kvmppc_book3s_shadow_vcpu, =
gpr[11]));
> 	DEFINE(SVCPU_R12, offsetof(struct kvmppc_book3s_shadow_vcpu, =
gpr[12]));
> 	DEFINE(SVCPU_R13, offsetof(struct kvmppc_book3s_shadow_vcpu, =
gpr[13]));
> -	DEFINE(SVCPU_HOST_R1, offsetof(struct kvmppc_book3s_shadow_vcpu, =
host_r1));
> -	DEFINE(SVCPU_HOST_R2, offsetof(struct kvmppc_book3s_shadow_vcpu, =
host_r2));
> -	DEFINE(SVCPU_VMHANDLER, offsetof(struct =
kvmppc_book3s_shadow_vcpu,
> -					 vmhandler));
> -	DEFINE(SVCPU_SCRATCH0, offsetof(struct =
kvmppc_book3s_shadow_vcpu,
> -					scratch0));
> -	DEFINE(SVCPU_SCRATCH1, offsetof(struct =
kvmppc_book3s_shadow_vcpu,
> -					scratch1));
> -	DEFINE(SVCPU_IN_GUEST, offsetof(struct =
kvmppc_book3s_shadow_vcpu,
> -					in_guest));
> 	DEFINE(SVCPU_FAULT_DSISR, offsetof(struct =
kvmppc_book3s_shadow_vcpu,
> 					   fault_dsisr));
> 	DEFINE(SVCPU_FAULT_DAR, offsetof(struct =
kvmppc_book3s_shadow_vcpu,
> @@ -453,6 +523,7 @@ int main(void)
> 					 last_inst));
> 	DEFINE(SVCPU_SHADOW_SRR1, offsetof(struct =
kvmppc_book3s_shadow_vcpu,
> 					   shadow_srr1));
> +#endif /* CONFIG_KVM_BOOK3S_NONHV */
> #ifdef CONFIG_PPC_BOOK3S_32
> 	DEFINE(SVCPU_SR, offsetof(struct kvmppc_book3s_shadow_vcpu, =
sr));
> #endif
> diff --git a/arch/powerpc/kernel/exceptions-64s.S =
b/arch/powerpc/kernel/exceptions-64s.S
> index cbdf374..80c6456 100644
> --- a/arch/powerpc/kernel/exceptions-64s.S
> +++ b/arch/powerpc/kernel/exceptions-64s.S
> @@ -87,14 +87,14 @@ data_access_not_stab:
> END_MMU_FTR_SECTION_IFCLR(MMU_FTR_SLB)
> #endif
> 	EXCEPTION_PROLOG_PSERIES(PACA_EXGEN, data_access_common, =
EXC_STD,
> -				 KVMTEST, 0x300)
> +				 KVMTEST_NONHV, 0x300)
>=20
> 	. =3D 0x380
> 	.globl data_access_slb_pSeries
> data_access_slb_pSeries:
> 	HMT_MEDIUM
> 	SET_SCRATCH0(r13)
> -	EXCEPTION_PROLOG_1(PACA_EXSLB, KVMTEST, 0x380)
> +	EXCEPTION_PROLOG_1(PACA_EXSLB, KVMTEST_NONHV, 0x380)
> 	std	r3,PACA_EXSLB+EX_R3(r13)
> 	mfspr	r3,SPRN_DAR
> #ifdef __DISABLED__
> @@ -125,7 +125,7 @@ data_access_slb_pSeries:
> instruction_access_slb_pSeries:
> 	HMT_MEDIUM
> 	SET_SCRATCH0(r13)
> -	EXCEPTION_PROLOG_1(PACA_EXSLB, KVMTEST, 0x480)
> +	EXCEPTION_PROLOG_1(PACA_EXSLB, KVMTEST_NONHV, 0x480)
> 	std	r3,PACA_EXSLB+EX_R3(r13)
> 	mfspr	r3,SPRN_SRR0		/* SRR0 is faulting address */
> #ifdef __DISABLED__
> @@ -153,32 +153,32 @@ instruction_access_slb_pSeries:
> hardware_interrupt_pSeries:
> hardware_interrupt_hv:
> 	BEGIN_FTR_SECTION
> -		_MASKABLE_EXCEPTION_PSERIES(0x500, hardware_interrupt,
> -					    EXC_STD, SOFTEN_TEST)
> -		KVM_HANDLER(PACA_EXGEN, EXC_STD, 0x500)
> -	FTR_SECTION_ELSE
> 		_MASKABLE_EXCEPTION_PSERIES(0x502, hardware_interrupt,
> 					    EXC_HV, SOFTEN_TEST_HV)
> 		KVM_HANDLER(PACA_EXGEN, EXC_HV, 0x502)
> -	ALT_FTR_SECTION_END_IFCLR(CPU_FTR_HVMODE_206)
> +	FTR_SECTION_ELSE
> +		_MASKABLE_EXCEPTION_PSERIES(0x500, hardware_interrupt,
> +					    EXC_STD, SOFTEN_TEST_NONHV)
> +		KVM_HANDLER(PACA_EXGEN, EXC_STD, 0x500)
> +	ALT_FTR_SECTION_END_IFSET(CPU_FTR_HVMODE_206)
>=20
> 	STD_EXCEPTION_PSERIES(0x600, 0x600, alignment)
> -	KVM_HANDLER(PACA_EXGEN, EXC_STD, 0x600)
> +	KVM_HANDLER_NONHV(PACA_EXGEN, EXC_STD, 0x600)
>=20
> 	STD_EXCEPTION_PSERIES(0x700, 0x700, program_check)
> -	KVM_HANDLER(PACA_EXGEN, EXC_STD, 0x700)
> +	KVM_HANDLER_NONHV(PACA_EXGEN, EXC_STD, 0x700)
>=20
> 	STD_EXCEPTION_PSERIES(0x800, 0x800, fp_unavailable)
> -	KVM_HANDLER(PACA_EXGEN, EXC_STD, 0x800)
> +	KVM_HANDLER_NONHV(PACA_EXGEN, EXC_STD, 0x800)
>=20
> 	MASKABLE_EXCEPTION_PSERIES(0x900, 0x900, decrementer)
> 	MASKABLE_EXCEPTION_HV(0x980, 0x982, decrementer)
>=20
> 	STD_EXCEPTION_PSERIES(0xa00, 0xa00, trap_0a)
> -	KVM_HANDLER(PACA_EXGEN, EXC_STD, 0xa00)
> +	KVM_HANDLER_NONHV(PACA_EXGEN, EXC_STD, 0xa00)
>=20
> 	STD_EXCEPTION_PSERIES(0xb00, 0xb00, trap_0b)
> -	KVM_HANDLER(PACA_EXGEN, EXC_STD, 0xb00)
> +	KVM_HANDLER_NONHV(PACA_EXGEN, EXC_STD, 0xb00)
>=20
> 	. =3D 0xc00
> 	.globl	system_call_pSeries
> @@ -219,7 +219,7 @@ END_FTR_SECTION_IFSET(CPU_FTR_REAL_LE)
> 	b	.
>=20
> 	STD_EXCEPTION_PSERIES(0xd00, 0xd00, single_step)
> -	KVM_HANDLER(PACA_EXGEN, EXC_STD, 0xd00)
> +	KVM_HANDLER_NONHV(PACA_EXGEN, EXC_STD, 0xd00)
>=20
> 	/* At 0xe??? we have a bunch of hypervisor exceptions, we branch
> 	 * out of line to handle them
> @@ -254,23 +254,23 @@ vsx_unavailable_pSeries_1:
>=20
> #ifdef CONFIG_CBE_RAS
> 	STD_EXCEPTION_HV(0x1200, 0x1202, cbe_system_error)
> -	KVM_HANDLER_SKIP(PACA_EXGEN, EXC_HV, 0x1202)
> +	KVM_HANDLER_NONHV_SKIP(PACA_EXGEN, EXC_HV, 0x1202)
> #endif /* CONFIG_CBE_RAS */
>=20
> 	STD_EXCEPTION_PSERIES(0x1300, 0x1300, instruction_breakpoint)
> -	KVM_HANDLER_SKIP(PACA_EXGEN, EXC_STD, 0x1300)
> +	KVM_HANDLER_NONHV_SKIP(PACA_EXGEN, EXC_STD, 0x1300)
>=20
> #ifdef CONFIG_CBE_RAS
> 	STD_EXCEPTION_HV(0x1600, 0x1602, cbe_maintenance)
> -	KVM_HANDLER_SKIP(PACA_EXGEN, EXC_HV, 0x1602)
> +	KVM_HANDLER_NONHV_SKIP(PACA_EXGEN, EXC_HV, 0x1602)
> #endif /* CONFIG_CBE_RAS */
>=20
> 	STD_EXCEPTION_PSERIES(0x1700, 0x1700, altivec_assist)
> -	KVM_HANDLER(PACA_EXGEN, EXC_STD, 0x1700)
> +	KVM_HANDLER_NONHV(PACA_EXGEN, EXC_STD, 0x1700)
>=20
> #ifdef CONFIG_CBE_RAS
> 	STD_EXCEPTION_HV(0x1800, 0x1802, cbe_thermal)
> -	KVM_HANDLER_SKIP(PACA_EXGEN, EXC_HV, 0x1802)
> +	KVM_HANDLER_NONHV_SKIP(PACA_EXGEN, EXC_HV, 0x1802)
> #endif /* CONFIG_CBE_RAS */
>=20
> 	. =3D 0x3000
> @@ -297,7 +297,7 @@ data_access_check_stab:
> 	mfspr	r9,SPRN_DSISR
> 	srdi	r10,r10,60
> 	rlwimi	r10,r9,16,0x20
> -#ifdef CONFIG_KVM_BOOK3S_64_HANDLER
> +#ifdef CONFIG_KVM_BOOK3S_NONHV
> 	lbz	r9,PACA_KVM_SVCPU+SVCPU_IN_GUEST(r13)
> 	rlwimi	r10,r9,8,0x300
> #endif
> @@ -316,11 +316,11 @@ do_stab_bolted_pSeries:
> 	EXCEPTION_PROLOG_PSERIES_1(.do_stab_bolted, EXC_STD)
> #endif /* CONFIG_POWER4_ONLY */
>=20
> -	KVM_HANDLER_SKIP(PACA_EXGEN, EXC_STD, 0x300)
> -	KVM_HANDLER_SKIP(PACA_EXSLB, EXC_STD, 0x380)
> -	KVM_HANDLER(PACA_EXGEN, EXC_STD, 0x400)
> -	KVM_HANDLER(PACA_EXSLB, EXC_STD, 0x480)
> -	KVM_HANDLER(PACA_EXGEN, EXC_STD, 0x900)
> +	KVM_HANDLER_NONHV_SKIP(PACA_EXGEN, EXC_STD, 0x300)
> +	KVM_HANDLER_NONHV_SKIP(PACA_EXSLB, EXC_STD, 0x380)
> +	KVM_HANDLER_NONHV(PACA_EXGEN, EXC_STD, 0x400)
> +	KVM_HANDLER_NONHV(PACA_EXSLB, EXC_STD, 0x480)
> +	KVM_HANDLER_NONHV(PACA_EXGEN, EXC_STD, 0x900)
> 	KVM_HANDLER(PACA_EXGEN, EXC_HV, 0x982)
>=20
> 	.align	7
> @@ -337,11 +337,11 @@ do_stab_bolted_pSeries:
>=20
> 	/* moved from 0xf00 */
> 	STD_EXCEPTION_PSERIES(., 0xf00, performance_monitor)
> -	KVM_HANDLER(PACA_EXGEN, EXC_STD, 0xf00)
> +	KVM_HANDLER_NONHV(PACA_EXGEN, EXC_STD, 0xf00)
> 	STD_EXCEPTION_PSERIES(., 0xf20, altivec_unavailable)
> -	KVM_HANDLER(PACA_EXGEN, EXC_STD, 0xf20)
> +	KVM_HANDLER_NONHV(PACA_EXGEN, EXC_STD, 0xf20)
> 	STD_EXCEPTION_PSERIES(., 0xf40, vsx_unavailable)
> -	KVM_HANDLER(PACA_EXGEN, EXC_STD, 0xf40)
> +	KVM_HANDLER_NONHV(PACA_EXGEN, EXC_STD, 0xf40)
>=20
> /*
>  * An interrupt came in while soft-disabled; clear EE in SRR1,
> @@ -418,7 +418,11 @@ slb_miss_user_pseries:
> /* KVM's trampoline code needs to be close to the interrupt handlers =
*/
>=20
> #ifdef CONFIG_KVM_BOOK3S_64_HANDLER
> +#ifdef CONFIG_KVM_BOOK3S_NONHV
> #include "../kvm/book3s_rmhandlers.S"
> +#else
> +#include "../kvm/book3s_hv_rmhandlers.S"
> +#endif
> #endif
>=20
> 	.align	7
> diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
> index b7baff7..6ff191b 100644
> --- a/arch/powerpc/kvm/Kconfig
> +++ b/arch/powerpc/kvm/Kconfig
> @@ -20,7 +20,6 @@ config KVM
> 	bool
> 	select PREEMPT_NOTIFIERS
> 	select ANON_INODES
> -	select KVM_MMIO
>=20
> config KVM_BOOK3S_HANDLER
> 	bool
> @@ -28,16 +27,22 @@ config KVM_BOOK3S_HANDLER
> config KVM_BOOK3S_32_HANDLER
> 	bool
> 	select KVM_BOOK3S_HANDLER
> +	select KVM_MMIO
>=20
> config KVM_BOOK3S_64_HANDLER
> 	bool
> 	select KVM_BOOK3S_HANDLER
>=20
> +config KVM_BOOK3S_NONHV
> +	bool
> +	select KVM_MMIO
> +
> config KVM_BOOK3S_32
> 	tristate "KVM support for PowerPC book3s_32 processors"
> 	depends on EXPERIMENTAL && PPC_BOOK3S_32 && !SMP && !PTE_64BIT
> 	select KVM
> 	select KVM_BOOK3S_32_HANDLER
> +	select KVM_BOOK3S_NONHV
> 	---help---
> 	  Support running unmodified book3s_32 guest kernels
> 	  in virtual machines on book3s_32 host processors.
> @@ -48,10 +53,38 @@ config KVM_BOOK3S_32
> 	  If unsure, say N.
>=20
> config KVM_BOOK3S_64
> -	tristate "KVM support for PowerPC book3s_64 processors"
> +	bool
> +	select KVM_BOOK3S_64_HANDLER
> +
> +config KVM_BOOK3S_64_HV
> +	bool "KVM support for POWER7 using hypervisor mode in host"
> 	depends on EXPERIMENTAL && PPC_BOOK3S_64
> 	select KVM
> -	select KVM_BOOK3S_64_HANDLER
> +	select KVM_BOOK3S_64
> +	---help---
> +	  Support running unmodified book3s_64 guest kernels in
> +	  virtual machines on POWER7 processors that have hypervisor
> +	  mode available to the host.
> +
> +	  If you say Y here, KVM will use the hardware virtualization
> +	  facilities of POWER7 (and later) processors, meaning that
> +	  guest operating systems will run at full hardware speed
> +	  using supervisor and user modes.  However, this also means
> +	  that KVM is not usable under PowerVM (pHyp), is only usable
> +	  on POWER7 (or later) processors, and can only emulate
> +	  POWER5+, POWER6 and POWER7 processors.
> +
> +	  This module provides access to the hardware capabilities =
through
> +	  a character device node named /dev/kvm.
> +
> +	  If unsure, say N.
> +
> +config KVM_BOOK3S_64_NONHV
> +	tristate "KVM support for PowerPC book3s_64 processors"
> +	depends on EXPERIMENTAL && PPC_BOOK3S_64 && !KVM_BOOK3S_64_HV
> +	select KVM
> +	select KVM_BOOK3S_64
> +	select KVM_BOOK3S_NONHV
> 	---help---
> 	  Support running unmodified book3s_64 and book3s_32 guest =
kernels
> 	  in virtual machines on book3s_64 host processors.
> @@ -65,6 +98,7 @@ config KVM_440
> 	bool "KVM support for PowerPC 440 processors"
> 	depends on EXPERIMENTAL && 44x
> 	select KVM
> +	select KVM_MMIO

e500 should also select MMIO, no?

> 	---help---
> 	  Support running unmodified 440 guest kernels in virtual =
machines on
> 	  440 host processors.
> diff --git a/arch/powerpc/kvm/Makefile b/arch/powerpc/kvm/Makefile
> index bf9854f..37c1a60 100644
> --- a/arch/powerpc/kvm/Makefile
> +++ b/arch/powerpc/kvm/Makefile
> @@ -14,7 +14,7 @@ CFLAGS_emulate.o  :=3D -I.
>=20
> common-objs-y +=3D powerpc.o emulate.o
> obj-$(CONFIG_KVM_EXIT_TIMING) +=3D timing.o
> -obj-$(CONFIG_KVM_BOOK3S_HANDLER) +=3D book3s_exports.o
> +obj-$(CONFIG_KVM_BOOK3S_NONHV) +=3D book3s_exports.o
>=20
> AFLAGS_booke_interrupts.o :=3D -I$(obj)
>=20
> @@ -38,7 +38,7 @@ kvm-e500-objs :=3D \
> 	e500_emulate.o
> kvm-objs-$(CONFIG_KVM_E500) :=3D $(kvm-e500-objs)
>=20
> -kvm-book3s_64-objs :=3D \
> +kvm-book3s_64_nonhv-objs :=3D \
> 	$(common-objs-y) \
> 	fpu.o \
> 	book3s_paired_singles.o \
> @@ -50,7 +50,17 @@ kvm-book3s_64-objs :=3D \
> 	book3s_64_mmu_host.o \
> 	book3s_64_mmu.o \
> 	book3s_32_mmu.o
> -kvm-objs-$(CONFIG_KVM_BOOK3S_64) :=3D $(kvm-book3s_64-objs)
> +kvm-objs-$(CONFIG_KVM_BOOK3S_64_NONHV) :=3D =
$(kvm-book3s_64_nonhv-objs)
> +
> +kvm-book3s_64_hv-objs :=3D \
> +	../../../virt/kvm/kvm_main.o \
> +	powerpc.o \
> +	emulate.o \
> +	book3s.o \
> +	book3s_hv.o \
> +	book3s_hv_interrupts.o \
> +	book3s_64_mmu_hv.o
> +kvm-objs-$(CONFIG_KVM_BOOK3S_64_HV) :=3D $(kvm-book3s_64_hv-objs)
>=20
> kvm-book3s_32-objs :=3D \
> 	$(common-objs-y) \
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c =
b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> new file mode 100644
> index 0000000..52d1be1
> --- /dev/null
> +++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
> @@ -0,0 +1,258 @@
> +/*
> + * This program is free software; you can redistribute it and/or =
modify
> + * it under the terms of the GNU General Public License, version 2, =
as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  =
02110-1301, USA.
> + *
> + * Copyright 2010 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
> + */
> +
> +#include <linux/types.h>
> +#include <linux/string.h>
> +#include <linux/kvm.h>
> +#include <linux/kvm_host.h>
> +#include <linux/highmem.h>
> +#include <linux/gfp.h>
> +#include <linux/slab.h>
> +#include <linux/hugetlb.h>
> +
> +#include <asm/tlbflush.h>
> +#include <asm/kvm_ppc.h>
> +#include <asm/kvm_book3s.h>
> +#include <asm/mmu-hash64.h>
> +#include <asm/hvcall.h>
> +#include <asm/synch.h>
> +#include <asm/ppc-opcode.h>
> +#include <asm/cputable.h>
> +
> +/* For now use fixed-size 16MB page table */
> +#define HPT_ORDER	24
> +#define HPT_NPTEG	(1ul << (HPT_ORDER - 7))	/* 128B per pteg =
*/
> +#define HPT_HASH_MASK	(HPT_NPTEG - 1)
> +
> +/* Pages in the VRMA are 16MB pages */
> +#define VRMA_PAGE_ORDER	24
> +#define VRMA_VSID	0x1ffffffUL	/* 1TB VSID reserved for VRMA */
> +
> +#define NR_LPIDS	1024
> +unsigned long lpid_inuse[BITS_TO_LONGS(NR_LPIDS)];
> +
> +long kvmppc_alloc_hpt(struct kvm *kvm)
> +{
> +	unsigned long hpt;
> +	unsigned long lpid;
> +
> +	hpt =3D =
__get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|__GFP_NOWARN,
> +			       HPT_ORDER - PAGE_SHIFT);

This would end up failing quite often, no?

> +	if (!hpt) {
> +		pr_err("kvm_alloc_hpt: Couldn't alloc HPT\n");
> +		return -ENOMEM;
> +	}
> +	kvm->arch.hpt_virt =3D hpt;
> +
> +	do {
> +		lpid =3D find_first_zero_bit(lpid_inuse, NR_LPIDS);
> +		if (lpid >=3D NR_LPIDS) {
> +			pr_err("kvm_alloc_hpt: No LPIDs free\n");
> +			free_pages(hpt, HPT_ORDER - PAGE_SHIFT);
> +			return -ENOMEM;
> +		}
> +	} while (test_and_set_bit(lpid, lpid_inuse));
> +
> +	kvm->arch.sdr1 =3D __pa(hpt) | (HPT_ORDER - 18);
> +	kvm->arch.lpid =3D lpid;
> +	kvm->arch.host_sdr1 =3D mfspr(SPRN_SDR1);
> +	kvm->arch.host_lpid =3D mfspr(SPRN_LPID);
> +	kvm->arch.host_lpcr =3D mfspr(SPRN_LPCR);

How do these correlate with the guest's hv mmu? I'd like to keep the =
code clean enough so we can potentially use it for PR mode as well :).

> +=09
> +	pr_info("KVM guest htab at %lx, LPID %lx\n", hpt, lpid);
> +	return 0;
> +}
> +
> +void kvmppc_free_hpt(struct kvm *kvm)
> +{
> +	unsigned long i;
> +	struct kvmppc_pginfo *pginfo;
> +
> +	clear_bit(kvm->arch.lpid, lpid_inuse);
> +	free_pages(kvm->arch.hpt_virt, HPT_ORDER - PAGE_SHIFT);
> +
> +	if (kvm->arch.ram_pginfo) {
> +		pginfo =3D kvm->arch.ram_pginfo;
> +		kvm->arch.ram_pginfo =3D NULL;
> +		for (i =3D 0; i < kvm->arch.ram_npages; ++i)
> +			put_page(pfn_to_page(pginfo[i].pfn));
> +		kfree(pginfo);
> +	}
> +}
> +
> +static unsigned long user_page_size(unsigned long addr)
> +{
> +	struct vm_area_struct *vma;
> +	unsigned long size =3D PAGE_SIZE;
> +
> +	down_read(&current->mm->mmap_sem);
> +	vma =3D find_vma(current->mm, addr);
> +	if (vma)
> +		size =3D vma_kernel_pagesize(vma);
> +	up_read(&current->mm->mmap_sem);
> +	return size;
> +}
> +
> +static pfn_t hva_to_pfn(unsigned long addr)
> +{
> +	struct page *page[1];
> +	int npages;
> +
> +	might_sleep();
> +
> +	npages =3D get_user_pages_fast(addr, 1, 1, page);
> +
> +	if (unlikely(npages !=3D 1))
> +		return 0;
> +
> +	return page_to_pfn(page[0]);
> +}
> +
> +long kvmppc_prepare_vrma(struct kvm *kvm,
> +			 struct kvm_userspace_memory_region *mem)
> +{
> +	unsigned long psize, porder;
> +	unsigned long i, npages;
> +	struct kvmppc_pginfo *pginfo;
> +	pfn_t pfn;
> +	unsigned long hva;
> +
> +	/* First see what page size we have */
> +	psize =3D user_page_size(mem->userspace_addr);
> +	/* For now, only allow 16MB pages */

The reason to go for 16MB pages is because of the host mmu code, not the =
guest hv mmu. So please at least #ifdef the code to HV so we document =
that correlation.

> +	if (psize !=3D 1ul << VRMA_PAGE_ORDER || (mem->memory_size & =
(psize - 1))) {
> +		pr_err("bad psize=3D%lx memory_size=3D%llx @ %llx\n",
> +		       psize, mem->memory_size, mem->userspace_addr);
> +		return -EINVAL;
> +	}
> +	porder =3D __ilog2(psize);
> +
> +	npages =3D mem->memory_size >> porder;
> +	pginfo =3D kzalloc(npages * sizeof(struct kvmppc_pginfo), =
GFP_KERNEL);
> +	if (!pginfo) {
> +		pr_err("kvmppc_prepare_vrma: couldn't alloc %lu =
bytes\n",
> +		       npages * sizeof(struct kvmppc_pginfo));
> +		return -ENOMEM;
> +	}
> +
> +	for (i =3D 0; i < npages; ++i) {
> +		hva =3D mem->userspace_addr + (i << porder);
> +		if (user_page_size(hva) !=3D psize)
> +			goto err;
> +		pfn =3D hva_to_pfn(hva);
> +		if (pfn =3D=3D 0) {
> +			pr_err("oops, no pfn for hva %lx\n", hva);
> +			goto err;
> +		}
> +		if (pfn & ((1ul << (porder - PAGE_SHIFT)) - 1)) {
> +			pr_err("oops, unaligned pfn %llx\n", pfn);
> +			put_page(pfn_to_page(pfn));
> +			goto err;
> +		}
> +		pginfo[i].pfn =3D pfn;
> +	}
> +
> +	kvm->arch.ram_npages =3D npages;
> +	kvm->arch.ram_psize =3D psize;
> +	kvm->arch.ram_porder =3D porder;
> +	kvm->arch.ram_pginfo =3D pginfo;
> +
> +	return 0;
> +
> + err:
> +	kfree(pginfo);
> +	return -EINVAL;
> +}
> +
> +void kvmppc_map_vrma(struct kvm *kvm, struct =
kvm_userspace_memory_region *mem)
> +{
> +	unsigned long i;
> +	unsigned long npages =3D kvm->arch.ram_npages;
> +	unsigned long pfn;
> +	unsigned long *hpte;
> +	unsigned long hash;
> +	struct kvmppc_pginfo *pginfo =3D kvm->arch.ram_pginfo;
> +
> +	if (!pginfo)
> +		return;
> +
> +	/* VRMA can't be > 1TB */
> +	if (npages > 1ul << (40 - kvm->arch.ram_porder))
> +		npages =3D 1ul << (40 - kvm->arch.ram_porder);
> +	/* Can't use more than 1 HPTE per HPTEG */
> +	if (npages > HPT_NPTEG)
> +		npages =3D HPT_NPTEG;
> +
> +	for (i =3D 0; i < npages; ++i) {
> +		pfn =3D pginfo[i].pfn;
> +		/* can't use hpt_hash since va > 64 bits */
> +		hash =3D (i ^ (VRMA_VSID ^ (VRMA_VSID << 25))) & =
HPT_HASH_MASK;
> +		/*
> +		 * We assume that the hash table is empty and no
> +		 * vcpus are using it at this stage.  Since we create
> +		 * at most one HPTE per HPTEG, we just assume entry 7
> +		 * is available and use it.
> +		 */
> +		hpte =3D (unsigned long *) (kvm->arch.hpt_virt + (hash =
<< 7));
> +		hpte +=3D 7 * 2;
> +		/* HPTE low word - RPN, protection, etc. */
> +		hpte[1] =3D (pfn << PAGE_SHIFT) | HPTE_R_R | HPTE_R_C |
> +			HPTE_R_M | PP_RWXX;
> +		wmb();
> +		hpte[0] =3D HPTE_V_1TB_SEG | (VRMA_VSID << (40 - 16)) |
> +			(i << (VRMA_PAGE_ORDER - 16)) | HPTE_V_BOLTED |
> +			HPTE_V_LARGE | HPTE_V_VALID;
> +	}
> +}
> +
> +int kvmppc_mmu_hv_init(void)
> +{
> +	if (!cpu_has_feature(CPU_FTR_HVMODE_206))
> +		return 0;

Return 0 for "it doesn't work" might not be the right exit code ;).

> +	memset(lpid_inuse, 0, sizeof(lpid_inuse));
> +	set_bit(mfspr(SPRN_LPID), lpid_inuse);
> +	set_bit(NR_LPIDS - 1, lpid_inuse);
> +
> +	return 0;
> +}
> +
> +void kvmppc_mmu_destroy(struct kvm_vcpu *vcpu)
> +{
> +}
> +
> +static void kvmppc_mmu_book3s_64_hv_reset_msr(struct kvm_vcpu *vcpu)
> +{
> +	kvmppc_set_msr(vcpu, MSR_SF | MSR_ME);
> +}
> +
> +static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t =
eaddr,
> +				struct kvmppc_pte *gpte, bool data)
> +{
> +	return -ENOENT;

Can't you just call the normal book3s_64 mmu code here? Without xlate, =
doing ppc_ld doesn't work, which means that reading out the faulting =
guest instruction breaks. We'd also need it for PR mode :).

> +}
> +
> +void kvmppc_mmu_book3s_hv_init(struct kvm_vcpu *vcpu)
> +{
> +	struct kvmppc_mmu *mmu =3D &vcpu->arch.mmu;
> +
> +	vcpu->arch.slb_nr =3D 32;		/* Assume POWER7 for now =
*/
> +
> +	mmu->xlate =3D kvmppc_mmu_book3s_64_hv_xlate;
> +	mmu->reset_msr =3D kvmppc_mmu_book3s_64_hv_reset_msr;
> +
> +	vcpu->arch.hflags |=3D BOOK3S_HFLAG_SLB;
> +}
> diff --git a/arch/powerpc/kvm/book3s_hv.c =
b/arch/powerpc/kvm/book3s_hv.c
> new file mode 100644
> index 0000000..f6b7cd1
> --- /dev/null
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -0,0 +1,413 @@
> +/*
> + * Copyright 2011 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
> + * Copyright (C) 2009. SUSE Linux Products GmbH. All rights reserved.
> + *
> + * Authors:
> + *    Paul Mackerras <paulus@au1.ibm.com>
> + *    Alexander Graf <agraf@suse.de>
> + *    Kevin Wolf <mail@kevin-wolf.de>
> + *
> + * Description: KVM functions specific to running on Book 3S
> + * processors in hypervisor mode (specifically POWER7 and later).
> + *
> + * This file is derived from arch/powerpc/kvm/book3s.c,
> + * by Alexander Graf <agraf@suse.de>.
> + *
> + * This program is free software; you can redistribute it and/or =
modify
> + * it under the terms of the GNU General Public License, version 2, =
as
> + * published by the Free Software Foundation.
> + */
> +
> +#include <linux/kvm_host.h>
> +#include <linux/err.h>
> +#include <linux/slab.h>
> +#include <linux/preempt.h>
> +#include <linux/sched.h>
> +#include <linux/delay.h>
> +#include <linux/fs.h>
> +#include <linux/anon_inodes.h>
> +
> +#include <asm/reg.h>
> +#include <asm/cputable.h>
> +#include <asm/cacheflush.h>
> +#include <asm/tlbflush.h>
> +#include <asm/uaccess.h>
> +#include <asm/io.h>
> +#include <asm/kvm_ppc.h>
> +#include <asm/kvm_book3s.h>
> +#include <asm/mmu_context.h>
> +#include <asm/lppaca.h>
> +#include <asm/processor.h>
> +#include <linux/gfp.h>
> +#include <linux/sched.h>
> +#include <linux/vmalloc.h>
> +#include <linux/highmem.h>
> +
> +/* #define EXIT_DEBUG */
> +/* #define EXIT_DEBUG_SIMPLE */
> +/* #define EXIT_DEBUG_INT */
> +
> +void kvmppc_core_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
> +{
> +	local_paca->kvm_vcpu =3D vcpu;
> +	vcpu->cpu =3D cpu;
> +}
> +
> +void kvmppc_core_vcpu_put(struct kvm_vcpu *vcpu)
> +{
> +	vcpu->cpu =3D -1;
> +}
> +
> +void kvmppc_vcpu_block(struct kvm_vcpu *vcpu)
> +{
> +	u64 now;
> +	unsigned long dec_nsec;
> +
> +	now =3D get_tb();
> +	if (now >=3D vcpu->arch.dec_expires && =
!kvmppc_core_pending_dec(vcpu))
> +		kvmppc_core_queue_dec(vcpu);
> +	if (vcpu->arch.pending_exceptions)
> +		return;
> +	if (vcpu->arch.dec_expires !=3D ~(u64)0) {
> +		dec_nsec =3D (vcpu->arch.dec_expires - now) * =
NSEC_PER_SEC /
> +			tb_ticks_per_sec;
> +		hrtimer_start(&vcpu->arch.dec_timer, ktime_set(0, =
dec_nsec),
> +			      HRTIMER_MODE_REL);
> +	}
> +
> +	kvm_vcpu_block(vcpu);
> +	vcpu->stat.halt_wakeup++;
> +
> +	if (vcpu->arch.dec_expires !=3D ~(u64)0)
> +		hrtimer_try_to_cancel(&vcpu->arch.dec_timer);
> +}
> +
> +void kvmppc_set_msr(struct kvm_vcpu *vcpu, u64 msr)
> +{
> +	vcpu->arch.msr =3D msr;
> +}
> +
> +void kvmppc_set_pvr(struct kvm_vcpu *vcpu, u32 pvr)
> +{
> +	vcpu->arch.pvr =3D pvr;
> +	kvmppc_mmu_book3s_hv_init(vcpu);

No need for you to do it depending on pvr. You can just do the mmu =
initialization on normal init :).

> +}
> +
> +void kvmppc_dump_regs(struct kvm_vcpu *vcpu)
> +{
> +	int r;
> +
> +	pr_err("vcpu %p (%d):\n", vcpu, vcpu->vcpu_id);
> +	pr_err("pc  =3D %.16lx  msr =3D %.16lx  trap =3D %x\n",
> +	       vcpu->arch.pc, vcpu->arch.msr, vcpu->arch.trap);
> +	for (r =3D 0; r < 16; ++r)
> +		pr_err("r%2d =3D %.16lx  r%d =3D %.16lx\n",
> +		       r, kvmppc_get_gpr(vcpu, r),
> +		       r+16, kvmppc_get_gpr(vcpu, r+16));
> +	pr_err("ctr =3D %.16lx  lr  =3D %.16lx\n",
> +	       vcpu->arch.ctr, vcpu->arch.lr);
> +	pr_err("srr0 =3D %.16lx srr1 =3D %.16lx\n",
> +	       vcpu->arch.srr0, vcpu->arch.srr1);
> +	pr_err("sprg0 =3D %.16lx sprg1 =3D %.16lx\n",
> +	       vcpu->arch.sprg0, vcpu->arch.sprg1);
> +	pr_err("sprg2 =3D %.16lx sprg3 =3D %.16lx\n",
> +	       vcpu->arch.sprg2, vcpu->arch.sprg3);
> +	pr_err("cr =3D %.8x  xer =3D %.16lx  dsisr =3D %.8x\n",
> +	       vcpu->arch.cr, vcpu->arch.xer, vcpu->arch.dsisr);
> +	pr_err("dar =3D %.16lx\n", vcpu->arch.dear);
> +	pr_err("fault dar =3D %.16lx dsisr =3D %.8x\n",
> +	       vcpu->arch.fault_dar, vcpu->arch.fault_dsisr);
> +	pr_err("SLB (%d entries):\n", vcpu->arch.slb_max);
> +	for (r =3D 0; r < vcpu->arch.slb_max; ++r)
> +		pr_err("  ESID =3D %.16llx VSID =3D %.16llx\n",
> +		       vcpu->arch.slb[r].orige, =
vcpu->arch.slb[r].origv);
> +	pr_err("lpcr =3D %.16lx sdr1 =3D %.16lx last_inst =3D %.8x\n",
> +	       vcpu->arch.lpcr, vcpu->kvm->arch.sdr1,
> +	       vcpu->arch.last_inst);
> +}
> +
> +static int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu =
*vcpu,
> +			      struct task_struct *tsk)
> +{
> +	int r =3D RESUME_HOST;
> +
> +	vcpu->stat.sum_exits++;
> +
> +	run->exit_reason =3D KVM_EXIT_UNKNOWN;
> +	run->ready_for_interrupt_injection =3D 1;
> +	switch (vcpu->arch.trap) {
> +	/* We're good on these - the host merely wanted to get our =
attention */
> +	case BOOK3S_INTERRUPT_HV_DECREMENTER:
> +		vcpu->stat.dec_exits++;
> +		r =3D RESUME_GUEST;
> +		break;
> +	case BOOK3S_INTERRUPT_EXTERNAL:
> +		vcpu->stat.ext_intr_exits++;
> +		r =3D RESUME_GUEST;
> +		break;
> +	case BOOK3S_INTERRUPT_PERFMON:
> +		r =3D RESUME_GUEST;
> +		break;
> +	case BOOK3S_INTERRUPT_PROGRAM:
> +	{
> +		ulong flags;
> +		/*
> +		 * Normally program interrupts are delivered directly
> +		 * to the guest by the hardware, but we can get here
> +		 * as a result of a hypervisor emulation interrupt
> +		 * (e40) getting turned into a 700 by BML RTAS.

Not sure I fully understand what's going on there. Mind to explain? :)

> +		 */
> +		flags =3D vcpu->arch.msr & 0x1f0000ull;
> +		kvmppc_core_queue_program(vcpu, flags);
> +		r =3D RESUME_GUEST;
> +		break;
> +	}
> +	case BOOK3S_INTERRUPT_SYSCALL:
> +	{
> +		/* hcall - punt to userspace */
> +		int i;
> +

Do we really want to accept sc from pr=3D1? I'd just reject them =
straight away.

> +		run->papr_hcall.nr =3D kvmppc_get_gpr(vcpu, 3);
> +		for (i =3D 0; i < 9; ++i)
> +			run->papr_hcall.args[i] =3D kvmppc_get_gpr(vcpu, =
4 + i);
> +		run->exit_reason =3D KVM_EXIT_PAPR_HCALL;
> +		vcpu->arch.hcall_needed =3D 1;
> +		r =3D RESUME_HOST;
> +		break;
> +	}
> +	case BOOK3S_INTERRUPT_H_DATA_STORAGE:
> +		vcpu->arch.dsisr =3D vcpu->arch.fault_dsisr;
> +		vcpu->arch.dear =3D vcpu->arch.fault_dar;
> +		kvmppc_inject_interrupt(vcpu, =
BOOK3S_INTERRUPT_DATA_STORAGE, 0);
> +		r =3D RESUME_GUEST;
> +		break;
> +	case BOOK3S_INTERRUPT_H_INST_STORAGE:
> +		kvmppc_inject_interrupt(vcpu, =
BOOK3S_INTERRUPT_INST_STORAGE,
> +					0x08000000);

What do these do? I thought the guest gets DSI and ISI directly?

> +		r =3D RESUME_GUEST;
> +		break;
> +	case BOOK3S_INTERRUPT_H_EMUL_ASSIST:
> +		kvmppc_core_queue_program(vcpu, 0x80000);
> +		r =3D RESUME_GUEST;
> +		break;
> +	default:
> +		kvmppc_dump_regs(vcpu);
> +		printk(KERN_EMERG "trap=3D0x%x | pc=3D0x%lx | =
msr=3D0x%lx\n",
> +			vcpu->arch.trap, kvmppc_get_pc(vcpu), =
vcpu->arch.msr);
> +		r =3D RESUME_HOST;
> +		BUG();
> +		break;
> +	}
> +
> +
> +	if (!(r & RESUME_HOST)) {
> +		/* To avoid clobbering exit_reason, only check for =
signals if
> +		 * we aren't already exiting to userspace for some other
> +		 * reason. */
> +		if (signal_pending(tsk)) {
> +			vcpu->stat.signal_exits++;
> +			run->exit_reason =3D KVM_EXIT_INTR;
> +			r =3D -EINTR;
> +		} else {
> +			kvmppc_core_deliver_interrupts(vcpu);
> +		}
> +	}
> +
> +	return r;
> +}
> +
> +int kvm_arch_vcpu_ioctl_get_sregs(struct kvm_vcpu *vcpu,
> +                                  struct kvm_sregs *sregs)
> +{
> +	int i;
> +
> +	sregs->pvr =3D vcpu->arch.pvr;
> +
> +	memset(sregs, 0, sizeof(struct kvm_sregs));
> +	for (i =3D 0; i < vcpu->arch.slb_max; i++) {
> +		sregs->u.s.ppc64.slb[i].slbe =3D =
vcpu->arch.slb[i].orige;
> +		sregs->u.s.ppc64.slb[i].slbv =3D =
vcpu->arch.slb[i].origv;
> +	}
> +
> +	return 0;
> +}
> +
> +int kvm_arch_vcpu_ioctl_set_sregs(struct kvm_vcpu *vcpu,
> +                                  struct kvm_sregs *sregs)
> +{
> +	int i, j;
> +
> +	kvmppc_set_pvr(vcpu, sregs->pvr);
> +
> +	j =3D 0;
> +	for (i =3D 0; i < vcpu->arch.slb_nr; i++) {
> +		if (sregs->u.s.ppc64.slb[i].slbe & SLB_ESID_V) {
> +			vcpu->arch.slb[j].orige =3D =
sregs->u.s.ppc64.slb[i].slbe;
> +			vcpu->arch.slb[j].origv =3D =
sregs->u.s.ppc64.slb[i].slbv;
> +			++j;
> +		}
> +	}
> +	vcpu->arch.slb_max =3D j;
> +
> +	return 0;
> +}
> +
> +int kvmppc_core_check_processor_compat(void)
> +{
> +	if (cpu_has_feature(CPU_FTR_HVMODE_206))
> +		return 0;
> +	return -EIO;
> +}
> +
> +struct kvm_vcpu *kvmppc_core_vcpu_create(struct kvm *kvm, unsigned =
int id)
> +{
> +	struct kvm_vcpu *vcpu;
> +	int err =3D -ENOMEM;
> +	unsigned long lpcr;
> +
> +	vcpu =3D kzalloc(sizeof(struct kvm_vcpu), GFP_KERNEL);
> +	if (!vcpu)
> +		goto out;
> +
> +	err =3D kvm_vcpu_init(vcpu, kvm, id);
> +	if (err)
> +		goto free_vcpu;
> +
> +	vcpu->arch.last_cpu =3D -1;
> +	vcpu->arch.host_msr =3D mfmsr();
> +	vcpu->arch.mmcr[0] =3D MMCR0_FC;
> +	vcpu->arch.ctrl =3D CTRL_RUNLATCH;
> +	/* default to book3s_64 (power7) */
> +	vcpu->arch.pvr =3D 0x3f0200;

Wouldn't it make sense to simply default to the host pvr? Not sure - =
maybe it's not :).

> +	kvmppc_set_pvr(vcpu, vcpu->arch.pvr);
> +
> +	/* remember where some real-mode handlers are */
> +	vcpu->arch.trampoline_lowmem =3D kvmppc_trampoline_lowmem;
> +	vcpu->arch.trampoline_enter =3D kvmppc_trampoline_enter;
> +	vcpu->arch.highmem_handler =3D (ulong)kvmppc_handler_highmem;
> +	vcpu->arch.rmcall =3D *(ulong*)kvmppc_rmcall;
> +
> +	lpcr =3D kvm->arch.host_lpcr & (LPCR_PECE | LPCR_LPES);
> +	lpcr |=3D LPCR_VPM0 | LPCR_VRMA_L | (4UL << LPCR_DPFD_SH) | =
LPCR_HDICE;
> +	vcpu->arch.lpcr =3D lpcr;
> +
> +	return vcpu;
> +
> +free_vcpu:
> +	kfree(vcpu);
> +out:
> +	return ERR_PTR(err);
> +}
> +
> +void kvmppc_core_vcpu_free(struct kvm_vcpu *vcpu)
> +{
> +	kvm_vcpu_uninit(vcpu);
> +	kfree(vcpu);
> +}
> +
> +extern int __kvmppc_vcore_entry(struct kvm_run *kvm_run, struct =
kvm_vcpu *vcpu);
> +
> +int kvmppc_vcpu_run(struct kvm_run *run, struct kvm_vcpu *vcpu)
> +{
> +	u64 now;
> +
> +	if (signal_pending(current)) {
> +		run->exit_reason =3D KVM_EXIT_INTR;
> +		return -EINTR;
> +	}
> +
> +	flush_fp_to_thread(current);

Do those with fine with preemption enabled?

> +	flush_altivec_to_thread(current);
> +	flush_vsx_to_thread(current);
> +	preempt_disable();
> +
> +	kvm_guest_enter();
> +
> +	__kvmppc_vcore_entry(NULL, vcpu);
> +
> +	kvm_guest_exit();
> +
> +	preempt_enable();
> +	kvm_resched(vcpu);
> +
> +	now =3D get_tb();
> +	/* cancel pending dec exception if dec is positive */
> +	if (now < vcpu->arch.dec_expires && =
kvmppc_core_pending_dec(vcpu))
> +		kvmppc_core_dequeue_dec(vcpu);
> +
> +	return kvmppc_handle_exit(run, vcpu, current);
> +}
> +
> +int kvmppc_core_prepare_memory_region(struct kvm *kvm,
> +				struct kvm_userspace_memory_region *mem)
> +{
> +	if (mem->guest_phys_addr =3D=3D 0 && mem->memory_size !=3D 0)
> +		return kvmppc_prepare_vrma(kvm, mem);
> +	return 0;
> +}
> +
> +void kvmppc_core_commit_memory_region(struct kvm *kvm,
> +				struct kvm_userspace_memory_region *mem)
> +{
> +	if (mem->guest_phys_addr =3D=3D 0 && mem->memory_size !=3D 0)
> +		kvmppc_map_vrma(kvm, mem);
> +}
> +
> +int kvmppc_core_init_vm(struct kvm *kvm)
> +{
> +	long r;
> +
> +	/* Allocate hashed page table */
> +	r =3D kvmppc_alloc_hpt(kvm);
> +
> +	return r;
> +}
> +
> +void kvmppc_core_destroy_vm(struct kvm *kvm)
> +{
> +	kvmppc_free_hpt(kvm);
> +}
> +
> +/* These are stubs for now */
> +void kvmppc_mmu_pte_pflush(struct kvm_vcpu *vcpu, ulong pa_start, =
ulong pa_end)
> +{
> +}
> +
> +/* We don't need to emulate any privileged instructions or dcbz */
> +int kvmppc_core_emulate_op(struct kvm_run *run, struct kvm_vcpu =
*vcpu,
> +                           unsigned int inst, int *advance)
> +{
> +	return EMULATE_FAIL;
> +}
> +
> +int kvmppc_core_emulate_mtspr(struct kvm_vcpu *vcpu, int sprn, int =
rs)
> +{
> +	return EMULATE_FAIL;
> +}
> +
> +int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int sprn, int =
rt)
> +{
> +	return EMULATE_FAIL;
> +}
> +
> +static int kvmppc_book3s_hv_init(void)
> +{
> +	int r;
> +
> +	r =3D kvm_init(NULL, sizeof(struct kvm_vcpu), 0, THIS_MODULE);
> +
> +	if (r)
> +		return r;
> +
> +	r =3D kvmppc_mmu_hv_init();
> +
> +	return r;
> +}
> +
> +static void kvmppc_book3s_hv_exit(void)
> +{
> +	kvm_exit();
> +}
> +
> +module_init(kvmppc_book3s_hv_init);
> +module_exit(kvmppc_book3s_hv_exit);
> diff --git a/arch/powerpc/kvm/book3s_hv_interrupts.S =
b/arch/powerpc/kvm/book3s_hv_interrupts.S
> new file mode 100644
> index 0000000..19d152d
> --- /dev/null
> +++ b/arch/powerpc/kvm/book3s_hv_interrupts.S
> @@ -0,0 +1,326 @@
> +/*
> + * This program is free software; you can redistribute it and/or =
modify
> + * it under the terms of the GNU General Public License, version 2, =
as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, 51 Franklin Street, Fifth Floor, Boston, MA  =
02110-1301, USA.
> + *
> + * Copyright 2011 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
> + *
> + * Derived from book3s_interrupts.S, which is:
> + * Copyright SUSE Linux Products GmbH 2009
> + *
> + * Authors: Alexander Graf <agraf@suse.de>
> + */
> +
> +#include <asm/ppc_asm.h>
> +#include <asm/kvm_asm.h>
> +#include <asm/reg.h>
> +#include <asm/page.h>
> +#include <asm/asm-offsets.h>
> +#include <asm/exception-64s.h>
> +#include <asm/ppc-opcode.h>
> +
> +#define DISABLE_INTERRUPTS	\
> +	mfmsr   r0;		\
> +	rldicl  r0,r0,48,1;	\
> +	rotldi  r0,r0,16;	\
> +	mtmsrd  r0,1;		\
> +
> =
+/************************************************************************=
*****
> + *                                                                    =
       *
> + *     Guest entry / exit code that is in kernel module memory =
(vmalloc)     *
> + *                                                                    =
       *
> + =
**************************************************************************=
**/
> +
> +/* Registers:
> + *  r4: vcpu pointer
> + */
> +_GLOBAL(__kvmppc_vcore_entry)
> +
> +	/* Write correct stack frame */
> +	mflr	r0
> +	std	r0,PPC_LR_STKOFF(r1)
> +
> +	/* Save host state to the stack */
> +	stdu	r1, -SWITCH_FRAME_SIZE(r1)
> +
> +	/* Save non-volatile registers (r14 - r31) */
> +	SAVE_NVGPRS(r1)
> +
> +	/* Save host PMU registers and load guest PMU registers */
> +	/* R4 is live here (vcpu pointer) but not r3 or r5 */
> +	li	r3, 1
> +	sldi	r3, r3, 31		/* MMCR0_FC (freeze counters) =
bit */
> +	mfspr	r7, SPRN_MMCR0		/* save MMCR0 */
> +	mtspr	SPRN_MMCR0, r3		/* freeze all counters, disable =
interrupts */
> +	isync
> +	ld	r3, PACALPPACAPTR(r13)	/* is the host using the PMU? */
> +	lbz	r5, LPPACA_PMCINUSE(r3)
> +	cmpwi	r5, 0
> +	beq	31f			/* skip if not */
> +	mfspr	r5, SPRN_MMCR1
> +	mfspr	r6, SPRN_MMCRA
> +	std	r7, PACA_HOST_MMCR(r13)
> +	std	r5, PACA_HOST_MMCR + 8(r13)
> +	std	r6, PACA_HOST_MMCR + 16(r13)
> +	mfspr	r3, SPRN_PMC1
> +	mfspr	r5, SPRN_PMC2
> +	mfspr	r6, SPRN_PMC3
> +	mfspr	r7, SPRN_PMC4
> +	mfspr	r8, SPRN_PMC5
> +	mfspr	r9, SPRN_PMC6
> +	stw	r3, PACA_HOST_PMC(r13)
> +	stw	r5, PACA_HOST_PMC + 4(r13)
> +	stw	r6, PACA_HOST_PMC + 8(r13)
> +	stw	r7, PACA_HOST_PMC + 12(r13)
> +	stw	r8, PACA_HOST_PMC + 16(r13)
> +	stw	r9, PACA_HOST_PMC + 20(r13)
> +31:
> +
> +	/* Save host DSCR */
> +	mfspr	r3, SPRN_DSCR
> +	std	r3, PACA_HOST_DSCR(r13)
> +
> +	/* Save host DABR */
> +	mfspr	r3, SPRN_DABR
> +	std	r3, PACA_DABR(r13)
> +
> +	DISABLE_INTERRUPTS
> +
> +	/*
> +	 * Put whatever is in the decrementer into the
> +	 * hypervisor decrementer.
> +	 */
> +	mfspr	r8,SPRN_DEC
> +	mftb	r7
> +	mtspr	SPRN_HDEC,r8

Can't we just always use HDEC on the host? That's save us from all the =
swapping here.

> +	extsw	r8,r8
> +	add	r8,r8,r7
> +	std	r8,PACA_KVM_DECEXP(r13)

Why is dec+hdec on vcpu_run decexp? What exactly does this store?

> +
> +	ld	r5, VCPU_TRAMPOLINE_ENTER(r4)
> +	LOAD_REG_IMMEDIATE(r6, MSR_KERNEL & ~(MSR_IR | MSR_DR))
> +
> +	/* Jump to segment patching handler and into our guest */
> +	b	kvmppc_rmcall
> +
> +/*
> + * This is the handler in module memory. It gets jumped at from the
> + * lowmem trampoline code, so it's basically the guest exit code.
> + *
> + */
> +
> +.global kvmppc_handler_highmem
> +kvmppc_handler_highmem:
> +
> +	/*
> +	 * Register usage at this point:
> +	 *
> +	 * R1       =3D host R1
> +	 * R2       =3D host R2
> +	 * R12      =3D exit handler id
> +	 * R13      =3D PACA
> +	 * SVCPU.*  =3D guest *
> +	 *
> +	 */
> +
> +	/* R7 =3D vcpu */
> +	ld	r7, PACA_KVM_VCPU(r13)
> +
> +	/*
> +	 * Reload DEC.  HDEC interrupts were disabled when
> +	 * we reloaded the host's LPCR value.
> +	 */
> +	ld	r3, PACA_KVM_DECEXP(r13)
> +	mftb	r4
> +	subf	r4, r4, r3
> +	mtspr	SPRN_DEC, r4
> +
> +	ld	r3, PACALPPACAPTR(r13)	/* is the host using the PMU? */
> +	lbz	r4, LPPACA_PMCINUSE(r3)
> +	cmpwi	r4, 0
> +	beq	23f			/* skip if not */
> +	lwz	r3, PACA_HOST_PMC(r13)
> +	lwz	r4, PACA_HOST_PMC + 4(r13)
> +	lwz	r5, PACA_HOST_PMC + 4(r13)

copy&paste error?

> +	lwz	r6, PACA_HOST_PMC + 4(r13)
> +	lwz	r8, PACA_HOST_PMC + 4(r13)
> +	lwz	r9, PACA_HOST_PMC + 4(r13)
> +	mtspr	SPRN_PMC1, r3
> +	mtspr	SPRN_PMC2, r4
> +	mtspr	SPRN_PMC3, r5
> +	mtspr	SPRN_PMC4, r6
> +	mtspr	SPRN_PMC5, r8
> +	mtspr	SPRN_PMC6, r9
> +	ld	r3, PACA_HOST_MMCR(r13)
> +	ld	r4, PACA_HOST_MMCR + 8(r13)
> +	ld	r5, PACA_HOST_MMCR + 16(r13)
> +	mtspr	SPRN_MMCR1, r4
> +	mtspr	SPRN_MMCRA, r5
> +	mtspr	SPRN_MMCR0, r3
> +	isync
> +23:
> +
> +	/* Restore host msr -> SRR1 */
> +	ld	r4, VCPU_HOST_MSR(r7)
> +
> +	/*
> +	 * For some interrupts, we need to call the real Linux
> +	 * handler, so it can do work for us. This has to happen
> +	 * as if the interrupt arrived from the kernel though,
> +	 * so let's fake it here where most state is restored.
> +	 *
> +	 * Call Linux for hardware interrupts/decrementer
> +	 * r3 =3D address of interrupt handler (exit reason)
> +	 */
> +	/* Note: preemption is disabled at this point */
> +
> +	cmpwi	r12, BOOK3S_INTERRUPT_MACHINE_CHECK
> +	beq	1f
> +	cmpwi	r12, BOOK3S_INTERRUPT_EXTERNAL
> +	beq	1f
> +
> +	/* Back to EE=3D1 */
> +	mtmsr	r4
> +	sync
> +	b	kvm_return_point
> +
> +1:	bl	call_linux_handler
> +
> +.global kvm_return_point
> +kvm_return_point:
> +	/* Restore non-volatile host registers (r14 - r31) */
> +	REST_NVGPRS(r1)
> +
> +	addi    r1, r1, SWITCH_FRAME_SIZE
> +	ld	r0, PPC_LR_STKOFF(r1)
> +	mtlr	r0
> +	blr
> +
> +call_linux_handler:
> +	/* Restore host IP -> SRR0 */
> +	mflr	r3
> +	mtlr	r12
> +
> +	ld	r5, VCPU_TRAMPOLINE_LOWMEM(r7)
> +	LOAD_REG_IMMEDIATE(r6, MSR_KERNEL & ~(MSR_IR | MSR_DR))
> +	b	kvmppc_rmcall
> +
> +/*
> + * Save away FP, VMX and VSX registers.
> + * r3 =3D vcpu pointer
> +*/
> +_GLOBAL(kvmppc_save_fp)
> +	mfmsr	r9
> +	ori	r8,r9,MSR_FP
> +#ifdef CONFIG_ALTIVEC
> +#ifdef CONFIG_VSX
> +	oris	r8,r8,(MSR_VEC|MSR_VSX)@h
> +#else
> +	oris	r8,r8,MSR_VEC@h
> +#endif
> +#endif
> +	mtmsrd	r8
> +	isync
> +#ifdef CONFIG_VSX
> +BEGIN_FTR_SECTION
> +	reg =3D 0
> +	.rept	32
> +	li	r6,reg*16+VCPU_VSRS
> +	stxvd2x	reg,r6,r3
> +	reg =3D reg + 1
> +	.endr
> +FTR_SECTION_ELSE
> +#endif
> +	reg =3D 0
> +	.rept	32
> +	stfd	reg,reg*8+VCPU_FPRS(r3)
> +	reg =3D reg + 1
> +	.endr
> +#ifdef CONFIG_VSX
> +ALT_FTR_SECTION_END_IFSET(CPU_FTR_VSX)
> +#endif
> +	mffs	fr0
> +	stfd	fr0,VCPU_FPSCR(r3)
> +
> +#ifdef CONFIG_ALTIVEC
> +BEGIN_FTR_SECTION
> +	reg =3D 0
> +	.rept	32
> +	li	r6,reg*16+VCPU_VRS
> +	stvx	reg,r6,r3
> +	reg =3D reg + 1
> +	.endr
> +	mfvscr	vr0
> +	li	r6,VCPU_VSCR
> +	stvx	vr0,r6,r3
> +END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
> +#endif
> +	mfspr	r6,SPRN_VRSAVE
> +	stw	r6,VCPU_VRSAVE(r3)
> +	mtmsrd	r9
> +	isync
> +	blr
> +
> +/*
> + * Load up FP, VMX and VSX registers
> + * r4 =3D vcpu pointer
> + */
> +	.globl	kvmppc_load_fp
> +kvmppc_load_fp:
> +	mfmsr	r9
> +	ori	r8,r9,MSR_FP
> +#ifdef CONFIG_ALTIVEC
> +#ifdef CONFIG_VSX
> +	oris	r8,r8,(MSR_VEC|MSR_VSX)@h
> +#else
> +	oris	r8,r8,MSR_VEC@h
> +#endif
> +#endif
> +	mtmsrd	r8
> +	isync
> +	lfd	fr0,VCPU_FPSCR(r4)
> +	MTFSF_L(fr0)
> +#ifdef CONFIG_VSX
> +BEGIN_FTR_SECTION
> +	reg =3D 0
> +	.rept	32
> +	li	r7,reg*16+VCPU_VSRS
> +	lxvd2x	reg,r7,r4
> +	reg =3D reg + 1
> +	.endr
> +FTR_SECTION_ELSE
> +#endif
> +	reg =3D 0
> +	.rept	32
> +	lfd	reg,reg*8+VCPU_FPRS(r4)
> +	reg =3D reg + 1
> +	.endr
> +#ifdef CONFIG_VSX
> +ALT_FTR_SECTION_END_IFSET(CPU_FTR_VSX)
> +#endif
> +
> +#ifdef CONFIG_ALTIVEC
> +	li	r7,VCPU_VSCR
> +	lvx	vr0,r7,r4
> +	mtvscr	vr0
> +BEGIN_FTR_SECTION
> +	reg =3D 0
> +	.rept	32
> +	li	r7,reg*16+VCPU_VRS
> +	lvx	reg,r7,r4
> +	reg =3D reg + 1
> +	.endr
> +END_FTR_SECTION_IFSET(CPU_FTR_ALTIVEC)
> +#endif
> +	lwz	r7,VCPU_VRSAVE(r4)
> +	mtspr	SPRN_VRSAVE,r7
> +	blr
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S =
b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> new file mode 100644
> index 0000000..813b01c
> --- /dev/null
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -0,0 +1,663 @@
> +/*
> + * This program is free software; you can redistribute it and/or =
modify
> + * it under the terms of the GNU General Public License, version 2, =
as
> + * published by the Free Software Foundation.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
> + * GNU General Public License for more details.
> + *
> + * Copyright 2011 Paul Mackerras, IBM Corp. <paulus@au1.ibm.com>
> + *
> + * Derived from book3s_rmhandlers.S and other files, which are:
> + *
> + * Copyright SUSE Linux Products GmbH 2009
> + *
> + * Authors: Alexander Graf <agraf@suse.de>
> + */
> +
> +#include <asm/ppc_asm.h>
> +#include <asm/kvm_asm.h>
> +#include <asm/reg.h>
> +#include <asm/page.h>
> +#include <asm/asm-offsets.h>
> +#include <asm/exception-64s.h>
> +
> =
+/************************************************************************=
*****
> + *                                                                    =
       *
> + *        Real Mode handlers that need to be in the linear mapping    =
       *
> + *                                                                    =
       *
> + =
**************************************************************************=
**/
> +
> +#define SHADOW_VCPU_OFF		PACA_KVM_SVCPU
> +
> +	.globl	kvmppc_skip_interrupt
> +kvmppc_skip_interrupt:
> +	mfspr	r13,SPRN_SRR0
> +	addi	r13,r13,4
> +	mtspr	SPRN_SRR0,r13
> +	GET_SCRATCH0(r13)
> +	rfid
> +	b	.
> +
> +	.globl	kvmppc_skip_Hinterrupt
> +kvmppc_skip_Hinterrupt:
> +	mfspr	r13,SPRN_HSRR0
> +	addi	r13,r13,4
> +	mtspr	SPRN_HSRR0,r13
> +	GET_SCRATCH0(r13)
> +	hrfid
> +	b	.
> +
> +/*
> + * This trampoline brings us back to a real mode handler
> + *
> + * Input Registers:
> + *
> + * R5 =3D SRR0
> + * R6 =3D SRR1
> + * R12 =3D real-mode IP
> + * LR =3D real-mode IP
> + *
> + */
> +.global kvmppc_handler_lowmem_trampoline
> +kvmppc_handler_lowmem_trampoline:
> +	cmpwi	r12,0x500
> +	beq	1f
> +	cmpwi	r12,0x980
> +	beq	1f

What?

1) use the macros please
2) why?

> +	mtsrr0	r3
> +	mtsrr1	r4
> +	blr
> +1:	mtspr	SPRN_HSRR0,r3
> +	mtspr	SPRN_HSRR1,r4
> +	blr
> +
> +/*
> + * Call a function in real mode.
> + * Must be called with interrupts hard-disabled.
> + *
> + * Input Registers:
> + *
> + * R5 =3D function
> + * R6 =3D MSR
> + * R7 =3D scratch register
> + *
> + */
> +_GLOBAL(kvmppc_rmcall)
> +	mfmsr	r7
> +	li	r0,MSR_RI		/* clear RI in MSR */
> +	andc	r7,r7,r0
> +	mtmsrd	r7,1
> +	mtsrr0	r5
> +	mtsrr1	r6
> +	RFI
> +
> +.global kvmppc_trampoline_lowmem
> +kvmppc_trampoline_lowmem:
> +	PPC_LONG kvmppc_handler_lowmem_trampoline - _stext
> +
> +.global kvmppc_trampoline_enter
> +kvmppc_trampoline_enter:
> +	PPC_LONG kvmppc_handler_trampoline_enter - _stext
> +
> +#define ULONG_SIZE 		8
> +#define VCPU_GPR(n)		(VCPU_GPRS + (n * ULONG_SIZE))
> +
> =
+/************************************************************************=
******
> + *                                                                    =
        *
> + *                               Entry code                           =
        *
> + *                                                                    =
        *
> + =
**************************************************************************=
***/
> +
> +.global kvmppc_handler_trampoline_enter
> +kvmppc_handler_trampoline_enter:
> +
> +	/* Required state:
> +	 *
> +	 * R4 =3D vcpu pointer
> +	 * MSR =3D ~IR|DR
> +	 * R13 =3D PACA
> +	 * R1 =3D host R1
> +	 * all other volatile GPRS =3D free
> +	 */
> +	ld	r14, VCPU_GPR(r14)(r4)
> +	ld	r15, VCPU_GPR(r15)(r4)
> +	ld	r16, VCPU_GPR(r16)(r4)
> +	ld	r17, VCPU_GPR(r17)(r4)
> +	ld	r18, VCPU_GPR(r18)(r4)
> +	ld	r19, VCPU_GPR(r19)(r4)
> +	ld	r20, VCPU_GPR(r20)(r4)
> +	ld	r21, VCPU_GPR(r21)(r4)
> +	ld	r22, VCPU_GPR(r22)(r4)
> +	ld	r23, VCPU_GPR(r23)(r4)
> +	ld	r24, VCPU_GPR(r24)(r4)
> +	ld	r25, VCPU_GPR(r25)(r4)
> +	ld	r26, VCPU_GPR(r26)(r4)
> +	ld	r27, VCPU_GPR(r27)(r4)
> +	ld	r28, VCPU_GPR(r28)(r4)
> +	ld	r29, VCPU_GPR(r29)(r4)
> +	ld	r30, VCPU_GPR(r30)(r4)
> +	ld	r31, VCPU_GPR(r31)(r4)
> +
> +	/* Load guest PMU registers */
> +	/* R4 is live here (vcpu pointer) */
> +	li	r3, 1
> +	sldi	r3, r3, 31		/* MMCR0_FC (freeze counters) =
bit */
> +	mtspr	SPRN_MMCR0, r3		/* freeze all counters, disable =
ints */
> +	isync
> +	lwz	r3, VCPU_PMC(r4)	/* always load up guest PMU =
registers */
> +	lwz	r5, VCPU_PMC + 4(r4)	/* to prevent information leak =
*/
> +	lwz	r6, VCPU_PMC + 8(r4)
> +	lwz	r7, VCPU_PMC + 12(r4)
> +	lwz	r8, VCPU_PMC + 16(r4)
> +	lwz	r9, VCPU_PMC + 20(r4)
> +	mtspr	SPRN_PMC1, r3
> +	mtspr	SPRN_PMC2, r5
> +	mtspr	SPRN_PMC3, r6
> +	mtspr	SPRN_PMC4, r7
> +	mtspr	SPRN_PMC5, r8
> +	mtspr	SPRN_PMC6, r9
> +	ld	r3, VCPU_MMCR(r4)
> +	ld	r5, VCPU_MMCR + 8(r4)
> +	ld	r6, VCPU_MMCR + 16(r4)
> +	mtspr	SPRN_MMCR1, r5
> +	mtspr	SPRN_MMCRA, r6
> +	mtspr	SPRN_MMCR0, r3
> +	isync
> +
> +	/* Load up FP, VMX and VSX registers */
> +	bl	kvmppc_load_fp
> +
> +	/* Switch DSCR to guest value */
> +	ld	r5, VCPU_DSCR(r4)
> +	mtspr	SPRN_DSCR, r5
> +
> +	/*
> +	 * Set the decrementer to the guest decrementer.
> +	 */
> +	ld	r8,VCPU_DEC_EXPIRES(r4)
> +	mftb	r7
> +	subf	r3,r7,r8
> +	mtspr	SPRN_DEC,r3
> +	stw	r3,VCPU_DEC(r4)
> +
> +	ld	r5, VCPU_SPRG0(r4)
> +	ld	r6, VCPU_SPRG1(r4)
> +	ld	r7, VCPU_SPRG2(r4)
> +	ld	r8, VCPU_SPRG3(r4)
> +	mtspr	SPRN_SPRG0, r5
> +	mtspr	SPRN_SPRG1, r6
> +	mtspr	SPRN_SPRG2, r7
> +	mtspr	SPRN_SPRG3, r8
> +
> +	/* Save R1 in the PACA */
> +	std	r1, PACA_KVM_SVCPU + SVCPU_HOST_R1(r13)
> +
> +	/* Load up DAR and DSISR */
> +	ld	r5, VCPU_DAR(r4)
> +	lwz	r6, VCPU_DSISR(r4)
> +	mtspr	SPRN_DAR, r5
> +	mtspr	SPRN_DSISR, r6
> +
> +	/* Set partition DABR */
> +	li	r5,3
> +	ld	r6,VCPU_DABR(r4)
> +	mtspr	SPRN_DABRX,r5
> +	mtspr	SPRN_DABR,r6
> +
> +	/* Restore AMR and UAMOR, set AMOR to all 1s */
> +	ld	r5,VCPU_AMR(r4)
> +	ld	r6,VCPU_UAMOR(r4)
> +	li	r7,-1
> +	mtspr	SPRN_AMR,r5
> +	mtspr	SPRN_UAMOR,r6
> +	mtspr	SPRN_AMOR,r7
> +
> +	/* Clear out SLB */
> +	li	r6,0
> +	slbmte	r6,r6
> +	slbia
> +	ptesync
> +
> +	/* Switch to guest partition. */
> +	ld	r9,VCPU_KVM(r4)		/* pointer to struct kvm */
> +	ld	r6,KVM_SDR1(r9)
> +	lwz	r7,KVM_LPID(r9)
> +	li	r0,0x3ff		/* switch to reserved LPID */
> +	mtspr	SPRN_LPID,r0
> +	ptesync
> +	mtspr	SPRN_SDR1,r6		/* switch to partition page =
table */
> +	mtspr	SPRN_LPID,r7
> +	isync
> +	ld	r8,VCPU_LPCR(r4)
> +	mtspr	SPRN_LPCR,r8
> +	isync
> +
> +	/* Check if HDEC expires soon */
> +	mfspr	r3,SPRN_HDEC
> +	cmpwi	r3,10
> +	li	r12,0x980

define

> +	mr	r9,r4
> +	blt	hdec_soon

Is it faster to do the check than to save the check and take the odds? =
Also, maybe we should rather do the check in preemptible code that could =
just directly pass the time slice on.

> +
> +	/*
> +	 * Invalidate the TLB if we could possibly have stale TLB
> +	 * entries for this partition on this core due to the use
> +	 * of tlbiel.
> +	 */
> +	ld	r9,VCPU_KVM(r4)		/* pointer to struct kvm */
> +	lwz	r5,VCPU_VCPUID(r4)
> +	lhz	r6,PACAPACAINDEX(r13)
> +	lhz	r8,VCPU_LAST_CPU(r4)
> +	sldi	r7,r6,1			/* see if this is the same vcpu =
*/
> +	add	r7,r7,r9		/* as last ran on this pcpu */
> +	lhz	r0,KVM_LAST_VCPU(r7)
> +	cmpw	r6,r8			/* on the same cpu core as last =
time? */
> +	bne	3f
> +	cmpw	r0,r5			/* same vcpu as this core last =
ran? */
> +	beq	1f
> +3:	sth	r6,VCPU_LAST_CPU(r4)	/* if not, invalidate partition =
TLB */
> +	sth	r5,KVM_LAST_VCPU(r7)
> +	li	r6,128
> +	mtctr	r6
> +	li	r7,0x800		/* IS field =3D 0b10 */
> +	ptesync
> +2:	tlbiel	r7
> +	addi	r7,r7,0x1000
> +	bdnz	2b
> +	ptesync
> +1:
> +
> +	/* Save purr/spurr */
> +	mfspr	r5,SPRN_PURR
> +	mfspr	r6,SPRN_SPURR
> +	std	r5,PACA_HOST_PURR(r13)
> +	std	r6,PACA_HOST_SPURR(r13)
> +	ld	r7,VCPU_PURR(r4)
> +	ld	r8,VCPU_SPURR(r4)
> +	mtspr	SPRN_PURR,r7
> +	mtspr	SPRN_SPURR,r8
> +
> +	/* Load up guest SLB entries */
> +	lwz	r5,VCPU_SLB_MAX(r4)
> +	cmpwi	r5,0
> +	beq	9f
> +	mtctr	r5
> +	addi	r6,r4,VCPU_SLB
> +1:	ld	r8,VCPU_SLB_E(r6)
> +	ld	r9,VCPU_SLB_V(r6)
> +	slbmte	r9,r8
> +	addi	r6,r6,VCPU_SLB_SIZE
> +	bdnz	1b
> +9:
> +
> +	/* Restore state of CTRL run bit; assume 1 on entry */
> +	lwz	r5,VCPU_CTRL(r4)
> +	andi.	r5,r5,1
> +	bne	4f
> +	mfspr	r6,SPRN_CTRLF
> +	clrrdi	r6,r6,1
> +	mtspr	SPRN_CTRLT,r6
> +4:
> +	ld	r6, VCPU_CTR(r4)
> +	lwz	r7, VCPU_XER(r4)
> +
> +	mtctr	r6
> +	mtxer	r7
> +
> +	/* Move SRR0 and SRR1 into the respective regs */
> +	ld	r6, VCPU_SRR0(r4)
> +	ld	r7, VCPU_SRR1(r4)
> +	mtspr	SPRN_SRR0, r6
> +	mtspr	SPRN_SRR1, r7
> +
> +	ld	r10, VCPU_PC(r4)
> +
> +	ld	r11, VCPU_MSR(r4)	/* r10 =3D vcpu->arch.msr & =
~MSR_HV */
> +	rldicl	r11, r11, 63 - MSR_HV_LG, 1
> +	rotldi	r11, r11, 1 + MSR_HV_LG
> +	ori	r11, r11, MSR_ME
> +
> +fast_guest_return:
> +	mtspr	SPRN_HSRR0,r10
> +	mtspr	SPRN_HSRR1,r11
> +
> +	/* Activate guest mode, so faults get handled by KVM */
> +	li	r9, KVM_GUEST_MODE_GUEST
> +	stb	r9, (SHADOW_VCPU_OFF + SVCPU_IN_GUEST)(r13)
> +
> +	/* Enter guest */
> +
> +	ld	r5, VCPU_LR(r4)
> +	lwz	r6, VCPU_CR(r4)
> +	mtlr	r5
> +	mtcr	r6
> +
> +	ld	r0, VCPU_GPR(r0)(r4)
> +	ld	r1, VCPU_GPR(r1)(r4)
> +	ld	r2, VCPU_GPR(r2)(r4)
> +	ld	r3, VCPU_GPR(r3)(r4)
> +	ld	r5, VCPU_GPR(r5)(r4)
> +	ld	r6, VCPU_GPR(r6)(r4)
> +	ld	r7, VCPU_GPR(r7)(r4)
> +	ld	r8, VCPU_GPR(r8)(r4)
> +	ld	r9, VCPU_GPR(r9)(r4)
> +	ld	r10, VCPU_GPR(r10)(r4)
> +	ld	r11, VCPU_GPR(r11)(r4)
> +	ld	r12, VCPU_GPR(r12)(r4)
> +	ld	r13, VCPU_GPR(r13)(r4)
> +
> +	ld	r4, VCPU_GPR(r4)(r4)
> +
> +	hrfid
> +	b	.
> +kvmppc_handler_trampoline_enter_end:
> +
> +
> =
+/************************************************************************=
******
> + *                                                                    =
        *
> + *                               Exit code                            =
        *
> + *                                                                    =
        *
> + =
**************************************************************************=
***/
> +
> +/*
> + * We come here from the first-level interrupt handlers.
> + */
> +	.globl	kvmppc_interrupt
> +kvmppc_interrupt:
> +	/*
> +	 * Register contents:
> +	 * R12		=3D interrupt vector
> +	 * R13		=3D PACA
> +	 * guest CR, R12 saved in shadow VCPU SCRATCH1/0
> +	 * guest R13 saved in SPRN_SCRATCH0
> +	 */
> +	/* abuse host_r2 as third scratch area; we get r2 from =
PACATOC(r13) */
> +	std	r9, (SHADOW_VCPU_OFF + SVCPU_HOST_R2)(r13)
> +	ld	r9, PACA_KVM_VCPU(r13)
> +
> +	/* Save registers */
> +
> +	std	r0, VCPU_GPR(r0)(r9)
> +	std	r1, VCPU_GPR(r1)(r9)
> +	std	r2, VCPU_GPR(r2)(r9)
> +	std	r3, VCPU_GPR(r3)(r9)
> +	std	r4, VCPU_GPR(r4)(r9)
> +	std	r5, VCPU_GPR(r5)(r9)
> +	std	r6, VCPU_GPR(r6)(r9)
> +	std	r7, VCPU_GPR(r7)(r9)
> +	std	r8, VCPU_GPR(r8)(r9)
> +	ld	r0, (SHADOW_VCPU_OFF + SVCPU_HOST_R2)(r13)
> +	std	r0, VCPU_GPR(r9)(r9)
> +	std	r10, VCPU_GPR(r10)(r9)
> +	std	r11, VCPU_GPR(r11)(r9)
> +	ld	r3, (SHADOW_VCPU_OFF + SVCPU_SCRATCH0)(r13)
> +	lwz	r4, (SHADOW_VCPU_OFF + SVCPU_SCRATCH1)(r13)
> +	std	r3, VCPU_GPR(r12)(r9)
> +	stw	r4, VCPU_CR(r9)
> +
> +	/* Restore R1/R2 so we can handle faults */
> +	ld	r1, (SHADOW_VCPU_OFF + SVCPU_HOST_R1)(r13)
> +	ld	r2, PACATOC(r13)
> +
> +	mfspr	r10, SPRN_SRR0
> +	mfspr	r11, SPRN_SRR1
> +	std	r10, VCPU_SRR0(r9)
> +	std	r11, VCPU_SRR1(r9)
> +	andi.	r0, r12, 2		/* need to read HSRR0/1? */
> +	beq	1f
> +	mfspr	r10, SPRN_HSRR0
> +	mfspr	r11, SPRN_HSRR1
> +	clrrdi	r12, r12, 2
> +1:	std	r10, VCPU_PC(r9)
> +	std	r11, VCPU_MSR(r9)
> +
> +	GET_SCRATCH0(r3)
> +	mflr	r4
> +	std	r3, VCPU_GPR(r13)(r9)
> +	std	r4, VCPU_LR(r9)
> +
> +	/* Unset guest mode */
> +	li	r0, KVM_GUEST_MODE_NONE
> +	stb	r0, (SHADOW_VCPU_OFF + SVCPU_IN_GUEST)(r13)
> +
> +	stw	r12,VCPU_TRAP(r9)
> +
> +	/* See if this is a leftover HDEC interrupt */
> +	cmpwi	r12,0x980

define

> +	bne	2f
> +	mfspr	r3,SPRN_HDEC
> +	cmpwi	r3,0
> +	bge	ignore_hdec
> +2:
> +
> +	/* Check for mediated interrupts (could be done earlier really =
...) */
> +	cmpwi	r12,0x500

define

> +	bne+	1f
> +	ld	r5,VCPU_LPCR(r9)
> +	andi.	r0,r11,MSR_EE
> +	beq	1f
> +	andi.	r0,r5,LPCR_MER
> +	bne	bounce_ext_interrupt

So there's no need for the external check that directly goes into the =
Linux handler code on full-fledged exits?

> +1:
> +
> +	/* Save DEC */
> +	mfspr	r5,SPRN_DEC
> +	mftb	r6
> +	extsw	r5,r5
> +	add	r5,r5,r6
> +	std	r5,VCPU_DEC_EXPIRES(r9)
> +
> +	/* Save HEIR (in last_inst) if this is a HEI (e40) */
> +	li	r3,-1
> +	cmpwi	r12,0xe40
> +	bne	11f
> +	mfspr	r3,SPRN_HEIR
> +11:	stw	r3,VCPU_LAST_INST(r9)
> +=09
> +	/* Save more register state  */
> +	mfxer	r5
> +	mfdar	r6
> +	mfdsisr	r7
> +	mfctr	r8
> +
> +	stw	r5, VCPU_XER(r9)
> +	std	r6, VCPU_DAR(r9)
> +	stw	r7, VCPU_DSISR(r9)
> +	std	r8, VCPU_CTR(r9)
> +	cmpwi	r12,0xe00		/* grab HDAR & HDSISR if HDSI */
> +	beq	6f
> +7:	std	r6, VCPU_FAULT_DAR(r9)
> +	stw	r7, VCPU_FAULT_DSISR(r9)
> +
> +	/* Save guest CTRL register, set runlatch to 1 */
> +	mfspr	r6,SPRN_CTRLF
> +	stw	r6,VCPU_CTRL(r9)
> +	andi.	r0,r6,1
> +	bne	4f
> +	ori	r6,r6,1
> +	mtspr	SPRN_CTRLT,r6
> +4:
> +	/* Read the guest SLB and save it away */
> +	li	r6,0
> +	addi	r7,r9,VCPU_SLB
> +	li	r5,0
> +1:	slbmfee	r8,r6
> +	andis.	r0,r8,SLB_ESID_V@h
> +	beq	2f
> +	add	r8,r8,r6		/* put index in */
> +	slbmfev	r3,r6
> +	std	r8,VCPU_SLB_E(r7)
> +	std	r3,VCPU_SLB_V(r7)
> +	addi	r7,r7,VCPU_SLB_SIZE
> +	addi	r5,r5,1
> +2:	addi	r6,r6,1
> +	cmpwi	r6,32

I don't like how the 32 is hardcoded here. Better create a define for it =
and use the same in the init code.

> +	blt	1b
> +	stw	r5,VCPU_SLB_MAX(r9)
> +
> +	/*
> +	 * Save the guest PURR/SPURR
> +	 */
> +	mfspr	r5,SPRN_PURR
> +	mfspr	r6,SPRN_SPURR
> +	ld	r7,VCPU_PURR(r9)
> +	ld	r8,VCPU_SPURR(r9)
> +	std	r5,VCPU_PURR(r9)
> +	std	r6,VCPU_SPURR(r9)
> +	subf	r5,r7,r5
> +	subf	r6,r8,r6
> +
> +	/*
> +	 * Restore host PURR/SPURR and add guest times
> +	 * so that the time in the guest gets accounted.
> +	 */
> +	ld	r3,PACA_HOST_PURR(r13)
> +	ld	r4,PACA_HOST_SPURR(r13)
> +	add	r3,r3,r5
> +	add	r4,r4,r6
> +	mtspr	SPRN_PURR,r3
> +	mtspr	SPRN_SPURR,r4
> +
> +	/* Clear out SLB */
> +	li	r5,0
> +	slbmte	r5,r5
> +	slbia
> +	ptesync
> +
> +hdec_soon:
> +	/* Switch back to host partition */
> +	ld	r4,VCPU_KVM(r9)		/* pointer to struct kvm */
> +	ld	r6,KVM_HOST_SDR1(r4)
> +	lwz	r7,KVM_HOST_LPID(r4)
> +	li	r8,0x3ff		/* switch to reserved LPID */

is it reserved by ISA? Either way, hard-coding the constant without a =
name is not nice :).

> +	mtspr	SPRN_LPID,r8
> +	ptesync
> +	mtspr	SPRN_SDR1,r6		/* switch to partition page =
table */
> +	mtspr	SPRN_LPID,r7
> +	isync
> +	lis	r8,0x7fff

INT_MAX@h might be more readable.

> +	mtspr	SPRN_HDEC,r8
> +
> +	ld	r8,KVM_HOST_LPCR(r4)
> +	mtspr	SPRN_LPCR,r8
> +	isync
> +
> +	/* load host SLB entries */
> +	ld	r8,PACA_SLBSHADOWPTR(r13)
> +
> +	.rept	SLB_NUM_BOLTED
> +	ld	r5,SLBSHADOW_SAVEAREA(r8)
> +	ld	r6,SLBSHADOW_SAVEAREA+8(r8)
> +	andis.	r7,r5,SLB_ESID_V@h
> +	beq	1f
> +	slbmte	r6,r5
> +1:	addi	r8,r8,16
> +	.endr
> +
> +	/* Save and reset AMR and UAMOR before turning on the MMU */
> +	mfspr	r5,SPRN_AMR
> +	mfspr	r6,SPRN_UAMOR
> +	std	r5,VCPU_AMR(r9)
> +	std	r6,VCPU_UAMOR(r9)
> +	li	r6,0
> +	mtspr	SPRN_AMR,r6
> +
> +	/* Restore host DABR and DABRX */
> +	ld	r5,PACA_DABR(r13)
> +	li	r6,7
> +	mtspr	SPRN_DABR,r5
> +	mtspr	SPRN_DABRX,r6
> +
> +	/* Switch DSCR back to host value */
> +	mfspr	r8, SPRN_DSCR
> +	ld	r7, PACA_HOST_DSCR(r13)
> +	std	r8, VCPU_DSCR(r7)
> +	mtspr	SPRN_DSCR, r7
> +
> +	/* Save non-volatile GPRs */
> +	std	r14, VCPU_GPR(r14)(r9)
> +	std	r15, VCPU_GPR(r15)(r9)
> +	std	r16, VCPU_GPR(r16)(r9)
> +	std	r17, VCPU_GPR(r17)(r9)
> +	std	r18, VCPU_GPR(r18)(r9)
> +	std	r19, VCPU_GPR(r19)(r9)
> +	std	r20, VCPU_GPR(r20)(r9)
> +	std	r21, VCPU_GPR(r21)(r9)
> +	std	r22, VCPU_GPR(r22)(r9)
> +	std	r23, VCPU_GPR(r23)(r9)
> +	std	r24, VCPU_GPR(r24)(r9)
> +	std	r25, VCPU_GPR(r25)(r9)
> +	std	r26, VCPU_GPR(r26)(r9)
> +	std	r27, VCPU_GPR(r27)(r9)
> +	std	r28, VCPU_GPR(r28)(r9)
> +	std	r29, VCPU_GPR(r29)(r9)
> +	std	r30, VCPU_GPR(r30)(r9)
> +	std	r31, VCPU_GPR(r31)(r9)
> +
> +	/* Save SPRGs */
> +	mfspr	r3, SPRN_SPRG0
> +	mfspr	r4, SPRN_SPRG1
> +	mfspr	r5, SPRN_SPRG2
> +	mfspr	r6, SPRN_SPRG3
> +	std	r3, VCPU_SPRG0(r9)
> +	std	r4, VCPU_SPRG1(r9)
> +	std	r5, VCPU_SPRG2(r9)
> +	std	r6, VCPU_SPRG3(r9)
> +
> +	/* Save PMU registers */
> +	li	r3, 1
> +	sldi	r3, r3, 31		/* MMCR0_FC (freeze counters) =
bit */
> +	mfspr	r4, SPRN_MMCR0		/* save MMCR0 */
> +	mtspr	SPRN_MMCR0, r3		/* freeze all counters, disable =
ints */
> +	isync
> +	mfspr	r5, SPRN_MMCR1
> +	mfspr	r6, SPRN_MMCRA
> +	std	r4, VCPU_MMCR(r9)
> +	std	r5, VCPU_MMCR + 8(r9)
> +	std	r6, VCPU_MMCR + 16(r9)
> +	mfspr	r3, SPRN_PMC1
> +	mfspr	r4, SPRN_PMC2
> +	mfspr	r5, SPRN_PMC3
> +	mfspr	r6, SPRN_PMC4
> +	mfspr	r7, SPRN_PMC5
> +	mfspr	r8, SPRN_PMC6
> +	stw	r3, VCPU_PMC(r9)
> +	stw	r4, VCPU_PMC + 4(r9)
> +	stw	r5, VCPU_PMC + 8(r9)
> +	stw	r6, VCPU_PMC + 12(r9)
> +	stw	r7, VCPU_PMC + 16(r9)
> +	stw	r8, VCPU_PMC + 20(r9)
> +22:
> +	/* save FP state */
> +	mr	r3, r9
> +	bl	.kvmppc_save_fp
> +
> +	/* RFI into the highmem handler */
> +	mfmsr	r7
> +	ori	r7, r7, MSR_IR|MSR_DR|MSR_RI|MSR_ME	/* Enable paging =
*/
> +	mtsrr1	r7
> +	/* Load highmem handler address */
> +	ld	r8, VCPU_HIGHMEM_HANDLER(r3)
> +	mtsrr0	r8
> +
> +	RFI
> +kvmppc_handler_trampoline_exit_end:
> +
> +6:	mfspr	r6,SPRN_HDAR
> +	mfspr	r7,SPRN_HDSISR
> +	b	7b
> +
> +ignore_hdec:
> +	mr	r4,r9
> +	b	fast_guest_return
> +
> +bounce_ext_interrupt:
> +	mr	r4,r9
> +	mtspr	SPRN_SRR0,r10
> +	mtspr	SPRN_SRR1,r11
> +	li	r10,0x500
> +	LOAD_REG_IMMEDIATE(r11,MSR_SF | MSR_ME);
> +	b	fast_guest_return
> diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> index 4b54148..e5e3f92 100644
> --- a/arch/powerpc/kvm/powerpc.c
> +++ b/arch/powerpc/kvm/powerpc.c
> @@ -38,8 +38,12 @@
>=20
> int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
> {
> -	return !(v->arch.shared->msr & MSR_WE) ||
> +#ifndef CONFIG_KVM_BOOK3S_64_HV
> +	return !(kvmppc_get_msr(v) & MSR_WE) ||
> 	       !!(v->arch.pending_exceptions);
> +#else
> +	return 1;

So what happens if the guest sets MSR_WE? It just stays in guest context =
idling? That'd be pretty bad for scheduling on the host.

> +#endif
> }
>=20
> int kvmppc_kvm_pv(struct kvm_vcpu *vcpu)
> @@ -52,7 +56,7 @@ int kvmppc_kvm_pv(struct kvm_vcpu *vcpu)
> 	unsigned long __maybe_unused param4 =3D kvmppc_get_gpr(vcpu, 6);
> 	unsigned long r2 =3D 0;
>=20
> -	if (!(vcpu->arch.shared->msr & MSR_SF)) {
> +	if (!(kvmppc_get_msr(vcpu) & MSR_SF)) {
> 		/* 32 bit mode */
> 		param1 &=3D 0xffffffff;
> 		param2 &=3D 0xffffffff;
> @@ -184,12 +188,14 @@ int kvm_dev_ioctl_check_extension(long ext)
> 	case KVM_CAP_PPC_IRQ_LEVEL:
> 	case KVM_CAP_ENABLE_CAP:
> 	case KVM_CAP_PPC_OSI:
> +#ifndef CONFIG_KVM_BOOK3S_64_HV

You also don't do OSI on HV :).

> 	case KVM_CAP_PPC_GET_PVINFO:
> 		r =3D 1;
> 		break;
> 	case KVM_CAP_COALESCED_MMIO:
> 		r =3D KVM_COALESCED_MMIO_PAGE_OFFSET;
> 		break;
> +#endif
> 	default:
> 		r =3D 0;
> 		break;
> @@ -286,6 +292,7 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)
> 	hrtimer_init(&vcpu->arch.dec_timer, CLOCK_REALTIME, =
HRTIMER_MODE_ABS);
> 	tasklet_init(&vcpu->arch.tasklet, kvmppc_decrementer_func, =
(ulong)vcpu);
> 	vcpu->arch.dec_timer.function =3D kvmppc_decrementer_wakeup;
> +	vcpu->arch.dec_expires =3D ~(u64)0;
>=20
> 	return 0;
> }
> @@ -474,6 +481,13 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu =
*vcpu, struct kvm_run *run)
> 		for (i =3D 0; i < 32; i++)
> 			kvmppc_set_gpr(vcpu, i, gprs[i]);
> 		vcpu->arch.osi_needed =3D 0;
> +	} else if (vcpu->arch.hcall_needed) {
> +		int i;
> +
> +		kvmppc_set_gpr(vcpu, 3, run->papr_hcall.ret);
> +		for (i =3D 0; i < 6; ++i)
> +			kvmppc_set_gpr(vcpu, 4 + i, =
run->papr_hcall.args[i]);
> +		vcpu->arch.hcall_needed =3D 0;
> 	}
>=20
> 	kvmppc_core_deliver_interrupts(vcpu);
> @@ -496,8 +510,11 @@ int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu =
*vcpu, struct kvm_interrupt *irq)
> 	if (waitqueue_active(&vcpu->wq)) {
> 		wake_up_interruptible(&vcpu->wq);
> 		vcpu->stat.halt_wakeup++;
> +#ifdef CONFIG_KVM_BOOK3S_64_HV
> +	} else if (vcpu->cpu !=3D -1) {
> +		smp_send_reschedule(vcpu->cpu);

Shouldn't this be done for non-HV too? The only reason we don't do it =
yet is because we don't do SMP, no?

> +#endif
> 	}
> -

eh... :)

> 	return 0;
> }
>=20
> diff --git a/arch/powerpc/kvm/trace.h b/arch/powerpc/kvm/trace.h
> index d62a14b..e5c99b8 100644
> --- a/arch/powerpc/kvm/trace.h
> +++ b/arch/powerpc/kvm/trace.h
> @@ -103,7 +103,7 @@ TRACE_EVENT(kvm_gtlb_write,
>  *                         Book3S trace points                         =
  *
>  =
*************************************************************************/=

>=20
> -#ifdef CONFIG_PPC_BOOK3S
> +#ifdef CONFIG_KVM_BOOK3S_NONHV
>=20
> TRACE_EVENT(kvm_book3s_exit,
> 	TP_PROTO(unsigned int exit_nr, struct kvm_vcpu *vcpu),
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index ea2dc1a..a4447ce 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -161,6 +161,7 @@ struct kvm_pit_config {
> #define KVM_EXIT_NMI              16
> #define KVM_EXIT_INTERNAL_ERROR   17
> #define KVM_EXIT_OSI              18
> +#define KVM_EXIT_PAPR_HCALL	  19
>=20
> /* For KVM_EXIT_INTERNAL_ERROR */
> #define KVM_INTERNAL_ERROR_EMULATION 1
> @@ -264,6 +265,11 @@ struct kvm_run {
> 		struct {
> 			__u64 gprs[32];
> 		} osi;
> +		struct {
> +			__u64 nr;
> +			__u64 ret;
> +			__u64 args[9];
> +		} papr_hcall;

This needs some information in the documentation.


Alex

PS: I CC'ed kvm-ppc@vger. Please make sure to CC that mailing list, so =
people interested in kvm on ppc get your patches as well :).

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: James Morris @ 2011-05-16  0:36 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Eric Paris, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, Linus Torvalds, Ingo Molnar, kees.cook,
	Serge E. Hallyn, Peter Zijlstra, Steven Rostedt, Tejun Heo,
	Thomas Gleixner, linux-arm-kernel, Michal Marek, Michal Simek,
	Will Drewry, linuxppc-dev, linux-kernel, Ralf Baechle, Paul Mundt,
	Martin Schwidefsky, linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <20110513121034.GG21022@elte.hu>

On Fri, 13 May 2011, Ingo Molnar wrote:

> Say i'm a user-space sandbox developer who wants to enforce that sandboxed code 
> should only be allowed to open files in /home/sandbox/, /lib/ and /usr/lib/.
> 
> It is a simple and sensible security feature, agreed? It allows most code to 
> run well and link to countless libraries - but no access to other files is 
> allowed.

Not really.

Firstly, what is the security goal of these restrictions?  Then, are the 
restrictions complete and unbypassable?

How do you reason about the behavior of the system as a whole?

> I argue that this is the LSM and audit subsystems designed right: in the long 
> run it could allow everything that LSM does at the moment - and so much more 
> ...

Now you're proposing a redesign of the security subsystem.  That's a 
significant undertaking.

In the meantime, we have a simple, well-defined enhancement to seccomp 
which will be very useful to current users in reducing their kernel attack 
surface.

We should merge that, and the security subsystem discussion can carry on 
separately.

- James
-- 
James Morris
<jmorris@namei.org>

^ permalink raw reply

* Re: [PATCH 10/13] kvm/powerpc: Add support for Book3S processors in hypervisor mode
From: Paul Mackerras @ 2011-05-16  1:07 UTC (permalink / raw)
  To: Avi Kivity; +Cc: linuxppc-dev, Alexander Graf, kvm
In-Reply-To: <4DCBA345.6090006@redhat.com>

On Thu, May 12, 2011 at 12:07:17PM +0300, Avi Kivity wrote:
> On 05/11/2011 01:44 PM, Paul Mackerras wrote:

> >--- a/include/linux/kvm.h
> >+++ b/include/linux/kvm.h
> >@@ -161,6 +161,7 @@ struct kvm_pit_config {
> >  #define KVM_EXIT_NMI              16
> >  #define KVM_EXIT_INTERNAL_ERROR   17
> >  #define KVM_EXIT_OSI              18
> >+#define KVM_EXIT_PAPR_HCALL	  19
> >
> >  /* For KVM_EXIT_INTERNAL_ERROR */
> >  #define KVM_INTERNAL_ERROR_EMULATION 1
> >@@ -264,6 +265,11 @@ struct kvm_run {
> >  		struct {
> >  			__u64 gprs[32];
> >  		} osi;
> >+		struct {
> >+			__u64 nr;
> >+			__u64 ret;
> >+			__u64 args[9];
> >+		} papr_hcall;
> >  		/* Fix the size of the union. */
> >  		char padding[256];
> >  	};
> 
> Please document this in Documentation/kvm/api.txt.

I'll add a description of the basic calling convention in the next
version of the patches.  The full description of all the possible
hypercalls is in the PAPR version 2.4 document (826 pages) on the
www.power.org website.  You have to become a power.org member to
download it, but membership is free for individual developers.

Paul.

^ permalink raw reply

* Re: [PATCH 10/13] kvm/powerpc: Add support for Book3S processors in hypervisor mode
From: Paul Mackerras @ 2011-05-16  5:58 UTC (permalink / raw)
  To: Alexander Graf; +Cc: Linuxppc-dev, kvm-ppc, KVM list
In-Reply-To: <BDD550B2-02CA-4B1A-A05B-6CB3EF89B7B3@suse.de>

On Sun, May 15, 2011 at 11:58:12PM +0200, Alexander Graf wrote:
> 
> On 11.05.2011, at 12:44, Paul Mackerras wrote:

> > +#ifdef CONFIG_KVM_BOOK3S_NONHV
> 
> I really liked how you called the .c file _pr - why call it NONHV now?

I agree, CONFIG_KVM_BOOK3S_PR would be better, I'll change it.

> > diff --git a/arch/powerpc/include/asm/paca.h b/arch/powerpc/include/asm/paca.h
> > index 7412676..8dba5f6 100644
> > --- a/arch/powerpc/include/asm/paca.h
> > +++ b/arch/powerpc/include/asm/paca.h
> > @@ -149,6 +149,16 @@ struct paca_struct {
> > #ifdef CONFIG_KVM_BOOK3S_HANDLER
> > 	/* We use this to store guest state in */
> > 	struct kvmppc_book3s_shadow_vcpu shadow_vcpu;
> > +#ifdef CONFIG_KVM_BOOK3S_64_HV
> > +	struct kvm_vcpu *kvm_vcpu;
> > +	u64 dabr;
> > +	u64 host_mmcr[3];
> > +	u32 host_pmc[6];
> > +	u64 host_purr;
> > +	u64 host_spurr;
> > +	u64 host_dscr;
> > +	u64 dec_expires;
> 
> Hrm. I'd say either push those into shadow_vcpu for HV mode or get
> rid of the shadow_vcpu reference. I'd probably prefer the former.

These are fields that are pieces of host state that we need to save
and restore across execution of a guest; they don't apply to any
specific guest or vcpu.  That's why I didn't put them in shadow_vcpu,
which is specifically for one vcpu in one guest.  

Given that book3s_pr copies the shadow_vcpu into and out of the paca,
I thought it best not to add fields to it that are only live while we
are in the guest.  True, these fields only exist for book3s_hv, but if
we later on make it possible to select book3s_pr vs. book3s_hv at
runtime, we won't want to be copying these fields into and out of the
paca when book3s_pr is active.

Maybe we need another struct, kvm_host_state or something like that,
to save this sort of state.

> > @@ -65,6 +98,7 @@ config KVM_440
> > 	bool "KVM support for PowerPC 440 processors"
> > 	depends on EXPERIMENTAL && 44x
> > 	select KVM
> > +	select KVM_MMIO
> 
> e500 should also select MMIO, no?

Good point, I'll fix that.

> > +long kvmppc_alloc_hpt(struct kvm *kvm)
> > +{
> > +	unsigned long hpt;
> > +	unsigned long lpid;
> > +
> > +	hpt = __get_free_pages(GFP_KERNEL|__GFP_ZERO|__GFP_REPEAT|__GFP_NOWARN,
> > +			       HPT_ORDER - PAGE_SHIFT);
> 
> This would end up failing quite often, no?

In practice it seems to be OK, possibly because the machines we're
testing this on have plenty of memory.  Maybe we should get qemu to
allocate the HPT using hugetlbfs so the memory will come from the
reserved page pool.  It does need to be physically contiguous and
aligned on a multiple of its size -- that's a hardware requirement.

> > +	kvm->arch.sdr1 = __pa(hpt) | (HPT_ORDER - 18);
> > +	kvm->arch.lpid = lpid;
> > +	kvm->arch.host_sdr1 = mfspr(SPRN_SDR1);
> > +	kvm->arch.host_lpid = mfspr(SPRN_LPID);
> > +	kvm->arch.host_lpcr = mfspr(SPRN_LPCR);
> 
> How do these correlate with the guest's hv mmu? I'd like to keep the
> code clean enough so we can potentially use it for PR mode as well :). 

The host SDR1 and LPID are different from the guest's.  That is, the
guest has its own HPT which is quite separate from the host's.  The
host values could be saved in global variables, though; there's no
real need for each VM to have its own copy, except that doing it this
way simplifies the low-level assembly code a little.

> > +	/* First see what page size we have */
> > +	psize = user_page_size(mem->userspace_addr);
> > +	/* For now, only allow 16MB pages */
> 
> The reason to go for 16MB pages is because of the host mmu code, not
> the guest hv mmu. So please at least #ifdef the code to HV so we
> document that correlation. 

I'm not sure what you mean by that.  The reason for going for 16MB
pages initially is for performance (this way the guest can use 16MB
pages for its linear mapping) and to avoid having to deal with the
pages being paged or swapped on the host side.  Could you explain the
"because of the host mmu code" part of your comment further?

What would adding #ifdef CONFIG_KVM_BOOK3S_64_HV achieve in a file
whose name ends in _hv.c and which only gets compiled when
CONFIG_KVM_BOOK3S_64_HV=y?

> > +int kvmppc_mmu_hv_init(void)
> > +{
> > +	if (!cpu_has_feature(CPU_FTR_HVMODE_206))
> > +		return 0;
> 
> Return 0 for "it doesn't work" might not be the right exit code ;).

Good point, I'll make it -ENXIO or something.

> > +static int kvmppc_mmu_book3s_64_hv_xlate(struct kvm_vcpu *vcpu, gva_t eaddr,
> > +				struct kvmppc_pte *gpte, bool data)
> > +{
> > +	return -ENOENT;
> 
> Can't you just call the normal book3s_64 mmu code here? Without
> xlate, doing ppc_ld doesn't work, which means that reading out the
> faulting guest instruction breaks. We'd also need it for PR mode :). 

With book3s_hv we currently have no situations where we need to read
out the faulting guest instruction.  

We could use the normal code here if we had the guest HPT mapped into
the qemu address space, which it currently isn't -- but should be.  It
hasn't been a priority to fix because with book3s_hv we currently have
no situations where we need to read out the faulting guest
instruction.

> > +void kvmppc_set_pvr(struct kvm_vcpu *vcpu, u32 pvr)
> > +{
> > +	vcpu->arch.pvr = pvr;
> > +	kvmppc_mmu_book3s_hv_init(vcpu);
> 
> No need for you to do it depending on pvr. You can just do the mmu
> initialization on normal init :).

OK, I'll do that.

> > +	case BOOK3S_INTERRUPT_PROGRAM:
> > +	{
> > +		ulong flags;
> > +		/*
> > +		 * Normally program interrupts are delivered directly
> > +		 * to the guest by the hardware, but we can get here
> > +		 * as a result of a hypervisor emulation interrupt
> > +		 * (e40) getting turned into a 700 by BML RTAS.
> 
> Not sure I fully understand what's going on there. Mind to explain? :)

Recent versions of the architecture don't actually deliver a 0x700
interrupt to the OS on an illegal instruction; instead they generate
an 0xe40 interrupt to the hypervisor in case the hypervisor wants to
emulate the instruction.  If the hypervisor doesn't want to do that
it's supposed to synthesize a 0x700 interrupt to the OS.

When we're running this stuff under our BML (Bare Metal Linux)
framework in the lab, we use a small RTAS implementation that gets
loaded by the BML loader, and one of the things that this RTAS does is
to patch the 0xe40 vector to make the 0xe40 interrupt come in via the
0x700 vector, in case the kernel you're running under BML hasn't been
updated to have an 0xe40 handler.

I could just remove that case, in fact.

> > +	case BOOK3S_INTERRUPT_SYSCALL:
> > +	{
> > +		/* hcall - punt to userspace */
> > +		int i;
> > +
> 
> Do we really want to accept sc from pr=1? I'd just reject them straight away.

Good idea, I'll do that.

> > +	case BOOK3S_INTERRUPT_H_DATA_STORAGE:
> > +		vcpu->arch.dsisr = vcpu->arch.fault_dsisr;
> > +		vcpu->arch.dear = vcpu->arch.fault_dar;
> > +		kvmppc_inject_interrupt(vcpu, BOOK3S_INTERRUPT_DATA_STORAGE, 0);
> > +		r = RESUME_GUEST;
> > +		break;
> > +	case BOOK3S_INTERRUPT_H_INST_STORAGE:
> > +		kvmppc_inject_interrupt(vcpu, BOOK3S_INTERRUPT_INST_STORAGE,
> > +					0x08000000);
> 
> What do these do? I thought the guest gets DSI and ISI directly?

It does for accesses with relocation (IR/DR) on, but because we have
enabled VRMA mode (Virtualized Real Mode Area) in the LPCR, we get
these interrupts to the hypervisor if the guest does a bad real-mode
access.  If we also turned on Virtualized Partition Memory (VPM) mode
in the LPCR, then all ISI/DSI in the guest come through to the
hypervisor using these vectors in case the hypervisor wants to do any
paging/swapping of guest memory.  I plan to do that later when we
support using 4k/64k pages for guest memory.

> > +	/* default to book3s_64 (power7) */
> > +	vcpu->arch.pvr = 0x3f0200;
> 
> Wouldn't it make sense to simply default to the host pvr? Not sure -
> maybe it's not :).

Sounds sensible, actually.  In fact the hypervisor can't spoof the PVR
for the guest, that is, the guest can read the real PVR value and
there's no way the hypervisor can stop it.

> > +	flush_fp_to_thread(current);
> 
> Do those with fine with preemption enabled?

Yes.  If a preemption happens, it will flush the FP/VMX/VSX registers
out to the thread_struct, and then any of these explicit calls that
happen after the preemption will do nothing.

> > +	/*
> > +	 * Put whatever is in the decrementer into the
> > +	 * hypervisor decrementer.
> > +	 */
> > +	mfspr	r8,SPRN_DEC
> > +	mftb	r7
> > +	mtspr	SPRN_HDEC,r8
> 
> Can't we just always use HDEC on the host? That's save us from all
> the swapping here.

The problem is that there is only one HDEC per core, so using it would
become complicated when the host is running in SMT4 or SMT2 mode.

> > +	extsw	r8,r8
> > +	add	r8,r8,r7
> > +	std	r8,PACA_KVM_DECEXP(r13)
> 
> Why is dec+hdec on vcpu_run decexp? What exactly does this store?

R7 here is the current (or very recent, anyway) timebase value, so
this stores the timebase value at which the host decrementer would get
to zero.

> > +	lwz	r3, PACA_HOST_PMC(r13)
> > +	lwz	r4, PACA_HOST_PMC + 4(r13)
> > +	lwz	r5, PACA_HOST_PMC + 4(r13)
> 
> copy&paste error?

Yes, thanks.

> > +.global kvmppc_handler_lowmem_trampoline
> > +kvmppc_handler_lowmem_trampoline:
> > +	cmpwi	r12,0x500
> > +	beq	1f
> > +	cmpwi	r12,0x980
> > +	beq	1f
> 
> What?
> 
> 1) use the macros please

OK

> 2) why?

The external interrupt and hypervisor decrementer interrupt handlers
expect the interrupted PC and MSR to be in HSRR0/1 rather than
SRR0/1.  I could remove the case for 0x980 since we don't call the
linux handler for HDEC interrupts any more.

> > +	/* Check if HDEC expires soon */
> > +	mfspr	r3,SPRN_HDEC
> > +	cmpwi	r3,10
> > +	li	r12,0x980
> 
> define

OK.

> > +	mr	r9,r4
> > +	blt	hdec_soon
> 
> Is it faster to do the check than to save the check and take the
> odds? Also, maybe we should rather do the check in preemptible
> code that could just directly pass the time slice on.

I do the check there because I was having problems where, if the HDEC
goes negative before we do the partition switch, we would occasionally
not get the HDEC interrupt at all until the next time HDEC went
negative, ~ 8.4 seconds later.

> > +	/* See if this is a leftover HDEC interrupt */
> > +	cmpwi	r12,0x980
> 
> define

OK.

> > +	bne	2f
> > +	mfspr	r3,SPRN_HDEC
> > +	cmpwi	r3,0
> > +	bge	ignore_hdec
> > +2:
> > +
> > +	/* Check for mediated interrupts (could be done earlier really ...) */
> > +	cmpwi	r12,0x500
> 
> define

OK.

> > +	bne+	1f
> > +	ld	r5,VCPU_LPCR(r9)
> > +	andi.	r0,r11,MSR_EE
> > +	beq	1f
> > +	andi.	r0,r5,LPCR_MER
> > +	bne	bounce_ext_interrupt
> 
> So there's no need for the external check that directly goes into
> the Linux handler code on full-fledged exits?

No, we still need that; ordinary external interrupts come to the
hypervisor regardless of the guest's MSR.EE setting.

The MER bit (Mediated External Request) is a way for the hypervisor to
know when the guest sets its MSR.EE bit.  If an event happens that
means the host wants to give a guest vcpu an external interrupt, but
the guest vcpu has MSR.EE = 0, then the host can't deliver the
simulated external interrupt.  Instead it sets LPCR.MER, which has the
effect that the hardware will deliver an external interrupt (to the
hypervisor) when running in the guest and it has MSR.EE = 1.

So, when we get an external interrupt, LPCR.MER = 1 and MSR.EE = 1,
we need to synthesize an external interrupt in the guest.  Doing it
here means that we don't need to do the full partition switch out to
the host and back again.

> > +	/* Read the guest SLB and save it away */
> > +	li	r6,0
> > +	addi	r7,r9,VCPU_SLB
> > +	li	r5,0
> > +1:	slbmfee	r8,r6
> > +	andis.	r0,r8,SLB_ESID_V@h
> > +	beq	2f
> > +	add	r8,r8,r6		/* put index in */
> > +	slbmfev	r3,r6
> > +	std	r8,VCPU_SLB_E(r7)
> > +	std	r3,VCPU_SLB_V(r7)
> > +	addi	r7,r7,VCPU_SLB_SIZE
> > +	addi	r5,r5,1
> > +2:	addi	r6,r6,1
> > +	cmpwi	r6,32
> 
> I don't like how the 32 is hardcoded here. Better create a define
> for it and use the same in the init code.

Sure.  In fact I probably should use vcpu->arch.slb_nr here.

> > +hdec_soon:
> > +	/* Switch back to host partition */
> > +	ld	r4,VCPU_KVM(r9)		/* pointer to struct kvm */
> > +	ld	r6,KVM_HOST_SDR1(r4)
> > +	lwz	r7,KVM_HOST_LPID(r4)
> > +	li	r8,0x3ff		/* switch to reserved LPID */
> 
> is it reserved by ISA? Either way, hard-coding the constant without
> a name is not nice :).

Actually, it just has to be an LPID value that isn't ever used for
running a real guest (or the host).  I'll make a name for it.

> > +	lis	r8,0x7fff
> 
> INT_MAX@h might be more readable.

OK.

> > int kvm_arch_vcpu_runnable(struct kvm_vcpu *v)
> > {
> > -	return !(v->arch.shared->msr & MSR_WE) ||
> > +#ifndef CONFIG_KVM_BOOK3S_64_HV
> > +	return !(kvmppc_get_msr(v) & MSR_WE) ||
> > 	       !!(v->arch.pending_exceptions);
> > +#else
> > +	return 1;
> 
> So what happens if the guest sets MSR_WE? It just stays in guest
> context idling? That'd be pretty bad for scheduling on the host.

The MSR_WE bit doesn't exist on POWER7 (or any of POWER[4567], in
fact).  If the guest wants to idle it calls the H_CEDE hcall.

> > @@ -184,12 +188,14 @@ int kvm_dev_ioctl_check_extension(long ext)
> > 	case KVM_CAP_PPC_IRQ_LEVEL:
> > 	case KVM_CAP_ENABLE_CAP:
> > 	case KVM_CAP_PPC_OSI:
> > +#ifndef CONFIG_KVM_BOOK3S_64_HV
> 
> You also don't do OSI on HV :).

Good point, I'll fix that.

> > @@ -496,8 +510,11 @@ int kvm_vcpu_ioctl_interrupt(struct kvm_vcpu *vcpu, struct kvm_interrupt *irq)
> > 	if (waitqueue_active(&vcpu->wq)) {
> > 		wake_up_interruptible(&vcpu->wq);
> > 		vcpu->stat.halt_wakeup++;
> > +#ifdef CONFIG_KVM_BOOK3S_64_HV
> > +	} else if (vcpu->cpu != -1) {
> > +		smp_send_reschedule(vcpu->cpu);
> 
> Shouldn't this be done for non-HV too? The only reason we don't do
> it yet is because we don't do SMP, no?

I didn't know why you didn't do it for non-HV, so I didn't change it
for that case.  If you say it's OK, I'll change it (we'll need to set
vcpu->cpu in the vcpu_load/unload code for book3s_pr too, then).

> > +#endif
> > 	}
> > -
> 
> eh... :)

OK. :)

> > +		struct {
> > +			__u64 nr;
> > +			__u64 ret;
> > +			__u64 args[9];
> > +		} papr_hcall;
> 
> This needs some information in the documentation.

Yes, Avi commented on that too.

> PS: I CC'ed kvm-ppc@vger. Please make sure to CC that mailing list,
> so people interested in kvm on ppc get your patches as well :).

Sure, will do.

Thanks,
Paul.

^ permalink raw reply

* Re: book to learn ppc assembly and architecture
From: Michael Neuling @ 2011-05-16  6:37 UTC (permalink / raw)
  To: s shaiju; +Cc: linuxppc-dev
In-Reply-To: <BLU156-w53FEA9E58A7EB19B0F239E838A0@phx.gbl>

> what  is  the  best  book  to  learn  assembly  and  architecture .

Reading the architecture books with a nice cup of tea.

http://www.power.org/resources/downloads/PowerISA_V2.06B_V2_PUBLIC.pdf

Good luck!

Mikey

^ permalink raw reply

* IPSec L2tpv3 throughput low using Netkey kernel stack
From: Vasanth Ragavendran @ 2011-05-16  8:55 UTC (permalink / raw)
  To: linuxppc-dev

Hi all,

My setup is as below I've two MPC8315E board from freescale connected as
below

PC1 ---- mpc board1 --- mpc board 2 ----- PC2.

I've installed openswan 2.6.18 on the mpc boards for encrypting the data
sent over the link between mpc boards using IPSEC. I've l2tpv3 bridge
between the mpc boards so that both PC's can be in the same subnet. Now
using l2tpv3 and Ipsec the thruput is only 14.4Mbps tested using iperf and
in tcp mode. I am using the NETKEY kernel stack during Ipsec configuration,
using a PSK method and AES256-SHA1 cipher. I've complied the AES and SHA
crypto API into the kernel running the mpc boards. the kernel version being
2.6.37.6. And it looks like the hardware accelerator is not getting used or
how do I check if the hardware accelerator is getting used? Why is the thru
put so low? With just the l2tpv3 and no IPSec betn the boards i am able to
get a thruput of 40Mbps with the rest of the settings being the same. I read
somewhere that OCF cannot be used with NETKEY and would the thruput be more
if i use KLIPS? if so how do i patch up the kernel with KLIPS. Also I had
patched up the SEC driver from the freescale website 

http://www.freescale.com/webapp/sps/site/prod_summary.jsp?code=MPC8315E&nodeId=018rH3Jk191439&fpsp=1&tab=Design_Tools_Tab

but when the Sec driver is configured as module into the kernel the module
doesn;t get inserted gives segmentation fault and if the Sec driver is
configured as non-module type the kernel doesn't bootsup at all! How do i
solve this situation? I'm desperately in need of some help as i've been
working on it to increase the thruput for a long time. Kindly help. Thanks a
lot in advance.
-- 
View this message in context: http://old.nabble.com/IPSec-L2tpv3-throughput-low-using-Netkey-kernel-stack-tp31627334p31627334.html
Sent from the linuxppc-dev mailing list archive at Nabble.com.

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Ingo Molnar @ 2011-05-16 12:00 UTC (permalink / raw)
  To: Arnd Bergmann
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Ralf Baechle, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, jmorris, Ingo Molnar, Roland McGrath,
	kees.cook, Serge E. Hallyn, Peter Zijlstra, microblaze-uclinux,
	Steven Rostedt, Tejun Heo, Thomas Gleixner, linux-arm-kernel,
	Michal Marek, Michal Simek, Will Drewry, agl, linux-kernel,
	Eric Paris, Paul Mundt, Martin Schwidefsky, linux390,
	Andrew Morton, linuxppc-dev, David S. Miller
In-Reply-To: <201105150842.07816.arnd@arndb.de>


* Arnd Bergmann <arnd@arndb.de> wrote:

> On Saturday 14 May 2011, Will Drewry wrote:
> > Depending on integration, it could even be limited to ioctl commands
> > that are appropriate to a known fd if the fd is opened prior to
> > entering seccomp mode 2. Alternatively, __NR__ioctl could be allowed
> > with a filter of "1" then narrowed through a later addition of
> > something like "(fd == %u && (cmd == %u || cmd == %u))" or something
> > along those lines.
> > 
> > Does that make sense?
> 
> Thanks for the explanation. This sounds like it's already doing all
> we need.

One thing we could do more clearly here is to help keep the filter expressions 
symbolic - i.e. help resolve the various ioctl variants as well, not just the 
raw syscall parameter numbers.

But yes, access to the raw syscall parameters and the ability to filter them 
already gives us the ability to exclude/include specific ioctls in a rather 
flexible way.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system callfiltering
From: Ingo Molnar @ 2011-05-16 12:03 UTC (permalink / raw)
  To: David Laight
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, linux-kernel, David Howells, Paul Mackerras,
	Ralf Baechle, H. PeterAnvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, James Morris, Ingo Molnar, kees.cook,
	Serge E. Hallyn, Steven Rostedt, Martin Schwidefsky,
	Thomas Gleixner, agl, linux-arm-kernel, Michal Marek,
	Michal Simek, Will Drewry, linuxppc-dev, Oleg Nesterov,
	Eric Paris, Paul Mundt, Tejun Heo, linux390, Andrew Morton,
	Linus Torvalds, David S. Miller
In-Reply-To: <AE90C24D6B3A694183C094C60CF0A2F6D8AD37@saturn3.aculab.com>


* David Laight <David.Laight@ACULAB.COM> wrote:

> [...] unfortunately it worked by looking at the user-space buffers on system 
> call entry - and a multithreaded program can easily arrange to update them 
> after the initial check! [...]

Such problems of reliability/persistency of security checks is exactly one of 
my arguments why this should not be limited to the syscall boundary, if you 
read the example i have provided in this discussion.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Ingo Molnar @ 2011-05-16 12:43 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Eric Paris, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, James Morris, Linus Torvalds, Ingo Molnar,
	kees.cook, Serge E. Hallyn, Peter Zijlstra, Steven Rostedt,
	Tejun Heo, Thomas Gleixner, linux-arm-kernel, Michal Marek,
	Michal Simek, linuxppc-dev, linux-kernel, Ralf Baechle,
	Paul Mundt, Martin Schwidefsky, linux390, Andrew Morton, agl,
	David S. Miller
In-Reply-To: <BANLkTi=1CBKevjg3+fYFZF9zWtQVw-W9hQ@mail.gmail.com>

* Will Drewry <wad@chromium.org> wrote:

> > Note, i'm not actually asking for the moon, a pony and more.
> >
> > I fully submit that we are yet far away from being able to do a full LSM 
> > via this mechanism.
> >
> > What i'm asking for is that because the syscall point steps taken by Will 
> > look very promising so it would be nice to do *that* in a slightly more 
> > flexible scheme that does not condemn it to be limited to the syscall 
> > boundary and such ...
> 
> What do you suggest here?
> 
> From my brief exploration of the ftrace/perf (and seccomp) code, I don't see 
> a clean way of integrating over the existing interfaces to the ftrace 
> framework (e.g., the global perf event pump seems to be a mismatch), but I 
> may be missing something obvious. [...]

Well, there's no global perf event pump. Here we'd obviously want to use 
buffer-less events that do no output (pumping) whatsoever - i.e. just counting 
events with filters attached.

What i suggest is to:

 - share syscall events        # you are fine with that as your patch makes use of them
 - share the scripting engine  # you are fine with that as your patch makes use of it

 - share *any* other event to do_exit() a process at syscall exit time

 - share any other active event that kernel developers specifically enable 
   for active use to impact security-relevant execution even sooner than 
   syscall exit time - not just system calls

 - share the actual facility that manages (sets/gets) filters

So right now you have this structure for your new feature:

 Documentation/trace/seccomp_filter.txt |   75 +++++
 arch/x86/kernel/ldt.c                  |    5 
 arch/x86/kernel/tls.c                  |    7 
 fs/proc/array.c                        |   21 +
 fs/proc/base.c                         |   25 +
 include/linux/ftrace_event.h           |    9 
 include/linux/seccomp.h                |   98 +++++++
 include/linux/syscalls.h               |   54 ++--
 include/trace/syscall.h                |    6 
 kernel/Makefile                        |    1 
 kernel/fork.c                          |    8 
 kernel/perf_event.c                    |    7 
 kernel/seccomp.c                       |  144 ++++++++++-
 kernel/seccomp_filter.c                |  428 +++++++++++++++++++++++++++++++++
 kernel/sys.c                           |   11 
 kernel/trace/Kconfig                   |   10 
 kernel/trace/trace_events_filter.c     |   60 ++--
 kernel/trace/trace_syscalls.c          |   96 ++++++-
 18 files changed, 986 insertions(+), 79 deletions(-)

Firstly, one specific problem i can see is that kernel/seccomp_filter.c 
hardcodes to the system call boundary. Which is fine to a prototype 
implementation (and obviously fine for something that builds upon seccomp) but 
not fine in terms of a flexible Linux security feature :-)

You have hardcoded these syscall assumptions via:

 struct seccomp_filter {
        struct list_head list;
        struct rcu_head rcu;
        int syscall_nr;
        struct syscall_metadata *data;
        struct event_filter *event_filter;
 };

Which comes just because you chose to enumerate only syscall events - instead 
of enumerating all events.

Instead of that please bind to all events instead - syscalls are just one of 
the many events we have. Type 'perf list' and see how many event types we have, 
and quite a few could be enabled for 'active feedback' sandboxing as well.

Secondly, and this is really a variant of the first problem you have, the way 
you process event filter 'failures' is pretty limited. You utilize the regular 
seccomp method which works by calling into __secure_computing() and silently 
accepting syscalls or generating a hard do_exit() on even the slightest of 
filter failures.

Instead of that what we'd want to have is to have regular syscall events 
registered, *and* enabled for such active filtering purposes. The moment the 
filter hits such an 'active' event would set the TIF_NOTIFY_RESUME flag and 
some other attribute in the task and the kernel would do a do_exit() at the 
earliest of opportunities before calling the syscall or at the 
return-from-syscall point latest.

Note that no seccomp specific code would have to execute here, we can already 
generate events both at syscall entry and at syscall exit points, the only new 
bit we'd need is for the 'kill the task' [or abort the syscall] policy action.

This is *far* more generic still yields the same short-term end result as far 
as your sandboxing is concerned.

What we'd need for this is a way to mark existing TRACE_EVENT()s as 'active'. 
We'd mark all syscall events as 'active' straight away.

[ Detail: these trace events return a return code, which the calling code can 
  use, that way event return values could be used sooner than syscall exit 
  points. IRQ code could make use of it as well, so for example this way we 
  could filter based early packet inspection, still in the IRQ code. ]

Then what your feature would do is to simply open up the events you are 
interested in (just as counting events, without any buffers), and set 'active' 
filters in them them to 'deny/allow' specific combinations of parameters only.

Thirdly, you have added a separate facility to set/query filters via /proc, 
this lives mostly in /proc/<pid>/seccomp_filter. This seccomp specificity would 
go away altogether: it should be possible to query existing events and filters 
even outside of the seccomp facility, so if you want something explicit here 
beyond the current event ioctl() to set filters please improve the generic 
facility - which right now is poorer than what you have implemented so it's in 
good need to be improved! :-)

All in one such a solution would enable us to reuse code in a lot more deeper 
fashion while still providing all the functionality you are interested in. 
These are all short-term concerns, not pie-in-the-sky future ideal LSM 
concerns.

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Ingo Molnar @ 2011-05-16 12:55 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Eric Paris, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, James Morris, Linus Torvalds, Ingo Molnar,
	linux-arm-kernel, kees.cook, Serge E. Hallyn, Peter Zijlstra,
	microblaze-uclinux, Steven Rostedt, Martin Schwidefsky,
	Thomas Gleixner, Roland McGrath, Michal Marek, Michal Simek,
	linuxppc-dev, linux-kernel, Ralf Baechle, Paul Mundt, Tejun Heo,
	linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <BANLkTin74OSAi94jbcYf_tj04L247O4GOg@mail.gmail.com>

* Will Drewry <wad@chromium.org> wrote:

> I agree with you on many of these points!  However, I don't think that the 
> views around LSMs, perf/ftrace infrastructure, or the current seccomp 
> filtering implementation are necessarily in conflict.  Here is my 
> understanding of how the different worlds fit together and where I see this 
> patchset living, along with where I could see future work going.  Perhaps I'm 
> being a trifle naive, but here goes anyway:
> 
> 1. LSMs provide a global mechanism for hooking "security relevant"
> events at a point where all the incoming user-sourced data has been
> preprocessed and moved into userspace.  The hooks are called every
> time one of those boundaries are crossed.

> 2. Perf and the ftrace infrastructure provide global function tracing
> and system call hooks with direct access to the caller's registers
> (and memory).

No, perf events are not just global but per task as well. Nor are events 
limited to 'tracing' (generating a flow of events into a trace buffer) - they 
can just be themselves as well and count and generate callbacks.

The generic NMI watchdog uses that kind of event model for example, see 
kernel/watchdog.c and how it makes use of struct perf_event abstractions to do 
per CPU events (with no buffrs), or how kernel/hw_breakpoint.c uses it for per 
task events and integrates it with the ptrace hw-breakpoints code.

Ideally Peter's one particular suggestion is right IMO and we'd want to be able 
for a perf_event to just be a list of callbacks, attached to a task and barely 
more than a discoverable, named notifier chain in its slimmest form.

In practice it's fatter than that right now, but we should definitely factor 
out that aspect of it more clearly, both code-wise and API-wise. 
kernel/watchdog.c and kernel/hw_breakpoint.c shows that such factoring out is 
possible and desirable.

> 3. seccomp (as it exists today) provides a global system call entry
> hook point with a binary per-process decision about whether to provide
> "secure computing" behavior.
> 
> When I boil that down to abstractions, I see:
> A. Globally scoped: LSMs, ftrace/perf
> B. Locally/process scoped: seccomp

Ok, i see where you got the idea that you needed to cut your surface of 
abstraction at the filter engine / syscall enumeration level - i think you were 
thinking of it in the ftrace model of tracepoints, not in the perf model of 
events.

No, events are generic and as such per task as well, not just global.

I've replied to your other mail with more specific suggestions of how we could 
provide your feature using abstractions that share code more widely. Talking 
specifics will i hope help move the discussion forward! :-)

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Will Drewry @ 2011-05-16 14:42 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Eric Paris, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, James Morris, Linus Torvalds, Ingo Molnar,
	linux-arm-kernel, kees.cook, Serge E. Hallyn, Peter Zijlstra,
	microblaze-uclinux, Steven Rostedt, Martin Schwidefsky,
	Thomas Gleixner, Roland McGrath, Michal Marek, Michal Simek,
	linuxppc-dev, linux-kernel, Ralf Baechle, Paul Mundt, Tejun Heo,
	linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <20110516125505.GE7128@elte.hu>

On Mon, May 16, 2011 at 7:55 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Will Drewry <wad@chromium.org> wrote:
>
>> I agree with you on many of these points! =A0However, I don't think that=
 the
>> views around LSMs, perf/ftrace infrastructure, or the current seccomp
>> filtering implementation are necessarily in conflict. =A0Here is my
>> understanding of how the different worlds fit together and where I see t=
his
>> patchset living, along with where I could see future work going. =A0Perh=
aps I'm
>> being a trifle naive, but here goes anyway:
>>
>> 1. LSMs provide a global mechanism for hooking "security relevant"
>> events at a point where all the incoming user-sourced data has been
>> preprocessed and moved into userspace. =A0The hooks are called every
>> time one of those boundaries are crossed.
>
>> 2. Perf and the ftrace infrastructure provide global function tracing
>> and system call hooks with direct access to the caller's registers
>> (and memory).
>
> No, perf events are not just global but per task as well. Nor are events
> limited to 'tracing' (generating a flow of events into a trace buffer) - =
they
> can just be themselves as well and count and generate callbacks.

I was looking at the perf_sysenter_enable() call, but clearly there is
more going on :)

> The generic NMI watchdog uses that kind of event model for example, see
> kernel/watchdog.c and how it makes use of struct perf_event abstractions =
to do
> per CPU events (with no buffrs), or how kernel/hw_breakpoint.c uses it fo=
r per
> task events and integrates it with the ptrace hw-breakpoints code.
>
> Ideally Peter's one particular suggestion is right IMO and we'd want to b=
e able
> for a perf_event to just be a list of callbacks, attached to a task and b=
arely
> more than a discoverable, named notifier chain in its slimmest form.
>
> In practice it's fatter than that right now, but we should definitely fac=
tor
> out that aspect of it more clearly, both code-wise and API-wise.
> kernel/watchdog.c and kernel/hw_breakpoint.c shows that such factoring ou=
t is
> possible and desirable.
>
>> 3. seccomp (as it exists today) provides a global system call entry
>> hook point with a binary per-process decision about whether to provide
>> "secure computing" behavior.
>>
>> When I boil that down to abstractions, I see:
>> A. Globally scoped: LSMs, ftrace/perf
>> B. Locally/process scoped: seccomp
>
> Ok, i see where you got the idea that you needed to cut your surface of
> abstraction at the filter engine / syscall enumeration level - i think yo=
u were
> thinking of it in the ftrace model of tracepoints, not in the perf model =
of
> events.
>
> No, events are generic and as such per task as well, not just global.
>
> I've replied to your other mail with more specific suggestions of how we =
could
> provide your feature using abstractions that share code more widely. Talk=
ing
> specifics will i hope help move the discussion forward! :-)

Agreed.  I'll digest both the watchdog code as well as your other
comments and follow up when I have a fuller picture in my head.

(I have a few initial comments I'll post in response to your other mail.)

Thanks!
will

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Ingo Molnar @ 2011-05-16 15:08 UTC (permalink / raw)
  To: James Morris
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Eric Paris, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, Linus Torvalds, Ingo Molnar, kees.cook,
	Serge E. Hallyn, Peter Zijlstra, Steven Rostedt, Tejun Heo,
	Thomas Gleixner, linux-arm-kernel, Michal Marek, Michal Simek,
	Will Drewry, linuxppc-dev, linux-kernel, Ralf Baechle, Paul Mundt,
	Martin Schwidefsky, linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <alpine.LRH.2.00.1105161006340.21749@tundra.namei.org>

* James Morris <jmorris@namei.org> wrote:

> On Fri, 13 May 2011, Ingo Molnar wrote:
> 
> > Say i'm a user-space sandbox developer who wants to enforce that sandboxed 
> > code should only be allowed to open files in /home/sandbox/, /lib/ and 
> > /usr/lib/.
> > 
> > It is a simple and sensible security feature, agreed? It allows most code 
> > to run well and link to countless libraries - but no access to other files 
> > is allowed.
> 
> Not really.
> 
> Firstly, what is the security goal of these restrictions? [...]

To do what i described above? Namely:

 " Sandboxed code should only be allowed to open files in /home/sandbox/, /lib/
   and /usr/lib/ "

> [...]  Then, are the restrictions complete and unbypassable?

If only the system calls i mentioned are allowed, and if the sandboxed VFS 
namespace itself is isolated from the rest of the system (no bind mounts, no 
hard links outside the sandbox, etc.) then its goal is to not be bypassable - 
what use is a sandbox if the sandbox can be bypassed by the sandboxed code?

There's a few ways how to alter (and thus bypass) VFS namespace lookups: 
symlinks, chdir, chroot, rename, etc., which (as i mentioned) have to be 
excluded by default or filtered as well.

> How do you reason about the behavior of the system as a whole?

For some usecases i mainly want to reason about what the sandboxed code can do 
and can not do, within a fairly static and limited VFS namespace environment.

I might not want to have a full-blown 'physical barrier' for all objects 
labeled as inaccessible to sandboxed code (or labeled as accessible to 
sandboxed code).

Especially as manipulating file labels is not also slow (affects all files) but 
is also often an exclusively privileged operation even for owned files, for no 
good reason. For things like /lib/ and /usr/lib/ it also *has* to be a 
privileged operation.

> > I argue that this is the LSM and audit subsystems designed right: in the 
> > long run it could allow everything that LSM does at the moment - and so 
> > much more ...
> 
> Now you're proposing a redesign of the security subsystem.  That's a 
> significant undertaking.

It certainly is.

> In the meantime, we have a simple, well-defined enhancement to seccomp which 
> will be very useful to current users in reducing their kernel attack surface.
> 
> We should merge that, and the security subsystem discussion can carry on 
> separately.

Is that the development and merge process along which the LSM subsystem got 
into its current state?

Thanks,

	Ingo

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Will Drewry @ 2011-05-16 15:28 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Ralf Baechle, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, jmorris, Ingo Molnar, linux-arm-kernel,
	Ingo Molnar, Serge E. Hallyn, Peter Zijlstra, microblaze-uclinux,
	Martin Schwidefsky, Thomas Gleixner, kees.cook, Roland McGrath,
	Michal Marek, Michal Simek, linuxppc-dev, linux-kernel,
	Eric Paris, Paul Mundt, Tejun Heo, linux390, Andrew Morton, agl,
	David S. Miller
In-Reply-To: <1305559607.5456.11.camel@gandalf.stny.rr.com>

On Mon, May 16, 2011 at 10:26 AM, Steven Rostedt <rostedt@goodmis.org> wrot=
e:
> Sorry to be absent from this thread so far, I just got back from my
> travels and I'm now catching up on email.
>
>
> On Wed, 2011-05-11 at 22:02 -0500, Will Drewry wrote:
>
>> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
>> index 377a7a5..22e1668 100644
>> --- a/arch/arm/Kconfig
>> +++ b/arch/arm/Kconfig
>> @@ -1664,6 +1664,16 @@ config SECCOMP
>> =A0 =A0 =A0 =A0 and the task is only allowed to execute a few safe sysca=
lls
>> =A0 =A0 =A0 =A0 defined by each seccomp mode.
>>
>> +config SECCOMP_FILTER
>> + =A0 =A0 bool "Enable seccomp-based system call filtering"
>> + =A0 =A0 depends on SECCOMP && EXPERIMENTAL
>> + =A0 =A0 help
>> + =A0 =A0 =A0 Per-process, inherited system call filtering using shared =
code
>> + =A0 =A0 =A0 across seccomp and ftrace_syscalls. =A0If CONFIG_FTRACE_SY=
SCALLS
>> + =A0 =A0 =A0 is not available, enhanced filters will not be available.
>> +
>> + =A0 =A0 =A0 See Documentation/prctl/seccomp_filter.txt for more detail=
.
>> +
>> =A0config CC_STACKPROTECTOR
>> =A0 =A0 =A0 bool "Enable -fstack-protector buffer overflow detection (EX=
PERIMENTAL)"
>> =A0 =A0 =A0 depends on EXPERIMENTAL
>> diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
>> index eccdefe..7641ee9 100644
>> --- a/arch/microblaze/Kconfig
>> +++ b/arch/microblaze/Kconfig
>> @@ -129,6 +129,16 @@ config SECCOMP
>>
>> =A0 =A0 =A0 =A0 If unsure, say Y. Only embedded should say N here.
>>
>> +config SECCOMP_FILTER
>> + =A0 =A0 bool "Enable seccomp-based system call filtering"
>> + =A0 =A0 depends on SECCOMP && EXPERIMENTAL
>> + =A0 =A0 help
>> + =A0 =A0 =A0 Per-process, inherited system call filtering using shared =
code
>> + =A0 =A0 =A0 across seccomp and ftrace_syscalls. =A0If CONFIG_FTRACE_SY=
SCALLS
>> + =A0 =A0 =A0 is not available, enhanced filters will not be available.
>> +
>> + =A0 =A0 =A0 See Documentation/prctl/seccomp_filter.txt for more detail=
.
>> +
>> =A0endmenu
>>
>> =A0menu "Advanced setup"
>> diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
>> index 8e256cc..fe4cbda 100644
>> --- a/arch/mips/Kconfig
>> +++ b/arch/mips/Kconfig
>> @@ -2245,6 +2245,16 @@ config SECCOMP
>>
>> =A0 =A0 =A0 =A0 If unsure, say Y. Only embedded should say N here.
>>
>> +config SECCOMP_FILTER
>> + =A0 =A0 bool "Enable seccomp-based system call filtering"
>> + =A0 =A0 depends on SECCOMP && EXPERIMENTAL
>> + =A0 =A0 help
>> + =A0 =A0 =A0 Per-process, inherited system call filtering using shared =
code
>> + =A0 =A0 =A0 across seccomp and ftrace_syscalls. =A0If CONFIG_FTRACE_SY=
SCALLS
>> + =A0 =A0 =A0 is not available, enhanced filters will not be available.
>> +
>> + =A0 =A0 =A0 See Documentation/prctl/seccomp_filter.txt for more detail=
.
>> +
>> =A0config USE_OF
>> =A0 =A0 =A0 bool "Flattened Device Tree support"
>> =A0 =A0 =A0 select OF
>> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
>> index 8f4d50b..83499e4 100644
>> --- a/arch/powerpc/Kconfig
>> +++ b/arch/powerpc/Kconfig
>> @@ -605,6 +605,16 @@ config SECCOMP
>>
>> =A0 =A0 =A0 =A0 If unsure, say Y. Only embedded should say N here.
>>
>> +config SECCOMP_FILTER
>> + =A0 =A0 bool "Enable seccomp-based system call filtering"
>> + =A0 =A0 depends on SECCOMP && EXPERIMENTAL
>> + =A0 =A0 help
>> + =A0 =A0 =A0 Per-process, inherited system call filtering using shared =
code
>> + =A0 =A0 =A0 across seccomp and ftrace_syscalls. =A0If CONFIG_FTRACE_SY=
SCALLS
>> + =A0 =A0 =A0 is not available, enhanced filters will not be available.
>> +
>> + =A0 =A0 =A0 See Documentation/prctl/seccomp_filter.txt for more detail=
.
>> +
>> =A0endmenu
>>
>> =A0config ISA_DMA_API
>> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
>> index 2508a6f..2777515 100644
>> --- a/arch/s390/Kconfig
>> +++ b/arch/s390/Kconfig
>> @@ -614,6 +614,16 @@ config SECCOMP
>>
>> =A0 =A0 =A0 =A0 If unsure, say Y.
>>
>> +config SECCOMP_FILTER
>> + =A0 =A0 bool "Enable seccomp-based system call filtering"
>> + =A0 =A0 depends on SECCOMP && EXPERIMENTAL
>> + =A0 =A0 help
>> + =A0 =A0 =A0 Per-process, inherited system call filtering using shared =
code
>> + =A0 =A0 =A0 across seccomp and ftrace_syscalls. =A0If CONFIG_FTRACE_SY=
SCALLS
>> + =A0 =A0 =A0 is not available, enhanced filters will not be available.
>> +
>> + =A0 =A0 =A0 See Documentation/prctl/seccomp_filter.txt for more detail=
.
>> +
>> =A0endmenu
>>
>> =A0menu "Power Management"
>> diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
>> index 4b89da2..00c1521 100644
>> --- a/arch/sh/Kconfig
>> +++ b/arch/sh/Kconfig
>> @@ -676,6 +676,16 @@ config SECCOMP
>>
>> =A0 =A0 =A0 =A0 If unsure, say N.
>>
>> +config SECCOMP_FILTER
>> + =A0 =A0 bool "Enable seccomp-based system call filtering"
>> + =A0 =A0 depends on SECCOMP && EXPERIMENTAL
>> + =A0 =A0 help
>> + =A0 =A0 =A0 Per-process, inherited system call filtering using shared =
code
>> + =A0 =A0 =A0 across seccomp and ftrace_syscalls. =A0If CONFIG_FTRACE_SY=
SCALLS
>> + =A0 =A0 =A0 is not available, enhanced filters will not be available.
>> +
>> + =A0 =A0 =A0 See Documentation/prctl/seccomp_filter.txt for more detail=
.
>> +
>> =A0config SMP
>> =A0 =A0 =A0 bool "Symmetric multi-processing support"
>> =A0 =A0 =A0 depends on SYS_SUPPORTS_SMP
>> diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
>> index e560d10..5b42255 100644
>> --- a/arch/sparc/Kconfig
>> +++ b/arch/sparc/Kconfig
>> @@ -270,6 +270,16 @@ config SECCOMP
>>
>> =A0 =A0 =A0 =A0 If unsure, say Y. Only embedded should say N here.
>>
>> +config SECCOMP_FILTER
>> + =A0 =A0 bool "Enable seccomp-based system call filtering"
>> + =A0 =A0 depends on SECCOMP && EXPERIMENTAL
>> + =A0 =A0 help
>> + =A0 =A0 =A0 Per-process, inherited system call filtering using shared =
code
>> + =A0 =A0 =A0 across seccomp and ftrace_syscalls. =A0If CONFIG_FTRACE_SY=
SCALLS
>> + =A0 =A0 =A0 is not available, enhanced filters will not be available.
>> +
>> + =A0 =A0 =A0 See Documentation/prctl/seccomp_filter.txt for more detail=
.
>> +
>> =A0config HOTPLUG_CPU
>> =A0 =A0 =A0 bool "Support for hot-pluggable CPUs"
>> =A0 =A0 =A0 depends on SPARC64 && SMP
>> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
>> index cc6c53a..d6d44d9 100644
>> --- a/arch/x86/Kconfig
>> +++ b/arch/x86/Kconfig
>> @@ -1485,6 +1485,16 @@ config SECCOMP
>>
>> =A0 =A0 =A0 =A0 If unsure, say Y. Only embedded should say N here.
>>
>> +config SECCOMP_FILTER
>> + =A0 =A0 bool "Enable seccomp-based system call filtering"
>> + =A0 =A0 depends on SECCOMP && EXPERIMENTAL
>> + =A0 =A0 help
>> + =A0 =A0 =A0 Per-process, inherited system call filtering using shared =
code
>> + =A0 =A0 =A0 across seccomp and ftrace_syscalls. =A0If CONFIG_FTRACE_SY=
SCALLS
>> + =A0 =A0 =A0 is not available, enhanced filters will not be available.
>> +
>> + =A0 =A0 =A0 See Documentation/prctl/seccomp_filter.txt for more detail=
.
>> +
>> =A0config CC_STACKPROTECTOR
>> =A0 =A0 =A0 bool "Enable -fstack-protector buffer overflow detection (EX=
PERIMENTAL)"
>> =A0 =A0 =A0 ---help---
>
> You just cut-and-pasted 8 copies of a config selection. The proper way
> to do that is to add the Kconfig selection in a core kernel Kconfig,
> have it depend on "HAVE_SECCOMP_FILTER" and then in each of these
> configs, simply add in the arch Kconfig:
>
> config <ARCH>
> =A0 =A0 =A0 =A0[...]
> =A0 =A0 =A0 =A0select HAVE_SECCOMP_FILTER
>
>
> That way you don't need to duplicate the config option all over the
> place.

Thanks!  I had seen the HAVE_* format but failed to acknowledge why!

I'll fix it in the next rev (assuming it still fits there)!
will

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Will Drewry @ 2011-05-16 15:29 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Eric Paris, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, James Morris, Linus Torvalds, Ingo Molnar,
	kees.cook, Serge E. Hallyn, Peter Zijlstra, Steven Rostedt,
	Tejun Heo, Thomas Gleixner, linux-arm-kernel, Michal Marek,
	Michal Simek, linuxppc-dev, linux-kernel, Ralf Baechle,
	Paul Mundt, Martin Schwidefsky, linux390, Andrew Morton, agl,
	David S. Miller
In-Reply-To: <20110516124304.GC7128@elte.hu>

On Mon, May 16, 2011 at 7:43 AM, Ingo Molnar <mingo@elte.hu> wrote:
>
> * Will Drewry <wad@chromium.org> wrote:
>
>> > Note, i'm not actually asking for the moon, a pony and more.
>> >
>> > I fully submit that we are yet far away from being able to do a full L=
SM
>> > via this mechanism.
>> >
>> > What i'm asking for is that because the syscall point steps taken by W=
ill
>> > look very promising so it would be nice to do *that* in a slightly mor=
e
>> > flexible scheme that does not condemn it to be limited to the syscall
>> > boundary and such ...
>>
>> What do you suggest here?
>>
>> From my brief exploration of the ftrace/perf (and seccomp) code, I don't=
 see
>> a clean way of integrating over the existing interfaces to the ftrace
>> framework (e.g., the global perf event pump seems to be a mismatch), but=
 I
>> may be missing something obvious. [...]
>
> Well, there's no global perf event pump. Here we'd obviously want to use
> buffer-less events that do no output (pumping) whatsoever - i.e. just cou=
nting
> events with filters attached.

Cool - I missed these entirely.  I'll get reading :)

> What i suggest is to:
>
> =A0- share syscall events =A0 =A0 =A0 =A0# you are fine with that as your=
 patch makes use of them
> =A0- share the scripting engine =A0# you are fine with that as your patch=
 makes use of it
>
> =A0- share *any* other event to do_exit() a process at syscall exit time
>
> =A0- share any other active event that kernel developers specifically ena=
ble
> =A0 for active use to impact security-relevant execution even sooner than
> =A0 syscall exit time - not just system calls
>
> =A0- share the actual facility that manages (sets/gets) filters

These make sense to me at a high level.  I'll throw in a few initial
comments, but I'll be back for a round-two once I catch up on the rest
of the perf code.

> So right now you have this structure for your new feature:
>
> =A0Documentation/trace/seccomp_filter.txt | =A0 75 +++++
> =A0arch/x86/kernel/ldt.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 =A05
> =A0arch/x86/kernel/tls.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 =A07
> =A0fs/proc/array.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 2=
1 +
> =A0fs/proc/base.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 2=
5 +
> =A0include/linux/ftrace_event.h =A0 =A0 =A0 =A0 =A0 | =A0 =A09
> =A0include/linux/seccomp.h =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 98 ++++++=
+
> =A0include/linux/syscalls.h =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 54 ++--
> =A0include/trace/syscall.h =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 =A06
> =A0kernel/Makefile =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 =
=A01
> =A0kernel/fork.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0=
 =A08
> =A0kernel/perf_event.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0 =A07
> =A0kernel/seccomp.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0144 =
++++++++++-
> =A0kernel/seccomp_filter.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0| =A0428 ++++++=
+++++++++++++++++++++++++++
> =A0kernel/sys.c =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0=
 11
> =A0kernel/trace/Kconfig =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0 | =A0 10
> =A0kernel/trace/trace_events_filter.c =A0 =A0 | =A0 60 ++--
> =A0kernel/trace/trace_syscalls.c =A0 =A0 =A0 =A0 =A0| =A0 96 ++++++-
> =A018 files changed, 986 insertions(+), 79 deletions(-)
>
> Firstly, one specific problem i can see is that kernel/seccomp_filter.c
> hardcodes to the system call boundary. Which is fine to a prototype
> implementation (and obviously fine for something that builds upon seccomp=
) but
> not fine in terms of a flexible Linux security feature :-)
>
> You have hardcoded these syscall assumptions via:
>
> =A0struct seccomp_filter {
> =A0 =A0 =A0 =A0struct list_head list;
> =A0 =A0 =A0 =A0struct rcu_head rcu;
> =A0 =A0 =A0 =A0int syscall_nr;
> =A0 =A0 =A0 =A0struct syscall_metadata *data;
> =A0 =A0 =A0 =A0struct event_filter *event_filter;
> =A0};

(This structure is a bit different in the new rev of the patch, but
nothing relevant to this specific part of the discussion :)

> Which comes just because you chose to enumerate only syscall events - ins=
tead
> of enumerating all events.

While I'm willing to live with the tradeoff, using ftrace event
numbers from FTRACE_SYSCALLS means that the filter will be unable to
hook a fair number of syscalls: execve, clone, etc  (all the ptregs
fixup syscalls on x86) and anything that returns int instead of long
(on x86).  Though the last two patches in the initial patch series
provided a proposed clean up for the latter :/

The current revision of the seccomp filter patch can function in a
bitmask-like state when CONFIG_FTRACE is unset or
CONFIG_FTRACE_SYSCALLS is unset.  This also means any platform with
CONFIG_SECCOMP support can start using this right away, but I realize
that is more of a short-term gain rather than a long-term one.

> Instead of that please bind to all events instead - syscalls are just one=
 of
> the many events we have. Type 'perf list' and see how many event types we=
 have,
> and quite a few could be enabled for 'active feedback' sandboxing as well=
.
>
> Secondly, and this is really a variant of the first problem you have, the=
 way
> you process event filter 'failures' is pretty limited. You utilize the re=
gular
> seccomp method which works by calling into __secure_computing() and silen=
tly
> accepting syscalls or generating a hard do_exit() on even the slightest o=
f
> filter failures.
>
> Instead of that what we'd want to have is to have regular syscall events
> registered, *and* enabled for such active filtering purposes. The moment =
the
> filter hits such an 'active' event would set the TIF_NOTIFY_RESUME flag a=
nd
> some other attribute in the task and the kernel would do a do_exit() at t=
he
> earliest of opportunities before calling the syscall or at the
> return-from-syscall point latest.
>
> Note that no seccomp specific code would have to execute here, we can alr=
eady
> generate events both at syscall entry and at syscall exit points, the onl=
y new
> bit we'd need is for the 'kill the task' [or abort the syscall] policy ac=
tion.
>
> This is *far* more generic still yields the same short-term end result as=
 far
> as your sandboxing is concerned.

Almost :/  I still need to review the code you've pointed out, but, at
present, the ftrace hooks occur after the seccomp and syscall auditing
hooks.  This means that that code is exposed no matter what in this
model.  To trim the exposed surface to userspace, we really need those
early hooks.  While I can see both hacky and less hacky approaches
around this, it stills strikes me that the seccomp thread flag and
early interception are good to reuse.  One option might be to allow
seccomp to be a secure-syscall event source, but I suspect that lands
more on the hack-y side of the fence :)

Anyway, I'll read up on the other filters and try to understand
exactly how per-task, counting perf events are being handled.

> What we'd need for this is a way to mark existing TRACE_EVENT()s as 'acti=
ve'.
> We'd mark all syscall events as 'active' straight away.

It seems like we'll still end up special casing system calls no matter
what as they are the userland vector into the kernel.  Marking
syscalls active right away sounds roughly equivalent to calling
prctl(PR_SET_SECCOMP, 2).  This could easily be done following the
model of syscall_nr_to_meta() since that enumerates every system call
meta data entry which yields both entry and exit event numbers. Of
course, I'd need to understand how that ties in with the per-task,
counting code.

> [ Detail: these trace events return a return code, which the calling code=
 can
> =A0use, that way event return values could be used sooner than syscall ex=
it
> =A0points. IRQ code could make use of it as well, so for example this way=
 we
> =A0could filter based early packet inspection, still in the IRQ code. ]
>
> Then what your feature would do is to simply open up the events you are
> interested in (just as counting events, without any buffers), and set 'ac=
tive'
> filters in them them to 'deny/allow' specific combinations of parameters =
only.
>
> Thirdly, you have added a separate facility to set/query filters via /pro=
c,
> this lives mostly in /proc/<pid>/seccomp_filter. This seccomp specificity=
 would
> go away altogether: it should be possible to query existing events and fi=
lters
> even outside of the seccomp facility, so if you want something explicit h=
ere
> beyond the current event ioctl() to set filters please improve the generi=
c
> facility - which right now is poorer than what you have implemented so it=
's in
> good need to be improved! :-)

I haven't looked at the other interface, so I will.  I know about the
existence of the perf_event_open syscall, but that's about it.  Would
it make sense to use the same syscall in a unified-interface world?
Even if the right way, semantically it seems awkward.

> All in one such a solution would enable us to reuse code in a lot more de=
eper
> fashion while still providing all the functionality you are interested in=
.
> These are all short-term concerns, not pie-in-the-sky future ideal LSM
> concerns.

Thanks for the concrete feedback! I'll follow up again once I better
understand the code you're citing as well as the existing userland
interface to it.

cheers!
will

^ permalink raw reply

* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Steven Rostedt @ 2011-05-16 15:26 UTC (permalink / raw)
  To: Will Drewry
  Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
	Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
	Ralf Baechle, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
	Russell King, x86, jmorris, Ingo Molnar, linux-arm-kernel,
	Ingo Molnar, Serge E. Hallyn, Peter Zijlstra, microblaze-uclinux,
	Martin Schwidefsky, Thomas Gleixner, kees.cook, Roland McGrath,
	Michal Marek, Michal Simek, linuxppc-dev, linux-kernel,
	Eric Paris, Paul Mundt, Tejun Heo, linux390, Andrew Morton, agl,
	David S. Miller
In-Reply-To: <1305169376-2363-1-git-send-email-wad@chromium.org>

Sorry to be absent from this thread so far, I just got back from my
travels and I'm now catching up on email.


On Wed, 2011-05-11 at 22:02 -0500, Will Drewry wrote:

> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index 377a7a5..22e1668 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig
> @@ -1664,6 +1664,16 @@ config SECCOMP
>  	  and the task is only allowed to execute a few safe syscalls
>  	  defined by each seccomp mode.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  config CC_STACKPROTECTOR
>  	bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
>  	depends on EXPERIMENTAL
> diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
> index eccdefe..7641ee9 100644
> --- a/arch/microblaze/Kconfig
> +++ b/arch/microblaze/Kconfig
> @@ -129,6 +129,16 @@ config SECCOMP
>  
>  	  If unsure, say Y. Only embedded should say N here.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  endmenu
>  
>  menu "Advanced setup"
> diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
> index 8e256cc..fe4cbda 100644
> --- a/arch/mips/Kconfig
> +++ b/arch/mips/Kconfig
> @@ -2245,6 +2245,16 @@ config SECCOMP
>  
>  	  If unsure, say Y. Only embedded should say N here.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  config USE_OF
>  	bool "Flattened Device Tree support"
>  	select OF
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 8f4d50b..83499e4 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -605,6 +605,16 @@ config SECCOMP
>  
>  	  If unsure, say Y. Only embedded should say N here.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  endmenu
>  
>  config ISA_DMA_API
> diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
> index 2508a6f..2777515 100644
> --- a/arch/s390/Kconfig
> +++ b/arch/s390/Kconfig
> @@ -614,6 +614,16 @@ config SECCOMP
>  
>  	  If unsure, say Y.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  endmenu
>  
>  menu "Power Management"
> diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
> index 4b89da2..00c1521 100644
> --- a/arch/sh/Kconfig
> +++ b/arch/sh/Kconfig
> @@ -676,6 +676,16 @@ config SECCOMP
>  
>  	  If unsure, say N.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  config SMP
>  	bool "Symmetric multi-processing support"
>  	depends on SYS_SUPPORTS_SMP
> diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
> index e560d10..5b42255 100644
> --- a/arch/sparc/Kconfig
> +++ b/arch/sparc/Kconfig
> @@ -270,6 +270,16 @@ config SECCOMP
>  
>  	  If unsure, say Y. Only embedded should say N here.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  config HOTPLUG_CPU
>  	bool "Support for hot-pluggable CPUs"
>  	depends on SPARC64 && SMP
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index cc6c53a..d6d44d9 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1485,6 +1485,16 @@ config SECCOMP
>  
>  	  If unsure, say Y. Only embedded should say N here.
>  
> +config SECCOMP_FILTER
> +	bool "Enable seccomp-based system call filtering"
> +	depends on SECCOMP && EXPERIMENTAL
> +	help
> +	  Per-process, inherited system call filtering using shared code
> +	  across seccomp and ftrace_syscalls.  If CONFIG_FTRACE_SYSCALLS
> +	  is not available, enhanced filters will not be available.
> +
> +	  See Documentation/prctl/seccomp_filter.txt for more detail.
> +
>  config CC_STACKPROTECTOR
>  	bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
>  	---help---

You just cut-and-pasted 8 copies of a config selection. The proper way
to do that is to add the Kconfig selection in a core kernel Kconfig,
have it depend on "HAVE_SECCOMP_FILTER" and then in each of these
configs, simply add in the arch Kconfig:

config <ARCH>
	[...]
	select HAVE_SECCOMP_FILTER


That way you don't need to duplicate the config option all over the
place.

-- Steve

^ permalink raw reply

* [PATCH 1/2] powerpc/5200: mpc5200b.dtsi: add spi node address- and size-cells properties
From: Anatolij Gustschin @ 2011-05-16 16:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: devicetree-discuss

Both, #address-cells and #size-cells properties are required
for spi bus node, so add them.

Signed-off-by: Anatolij Gustschin <agust@denx.de>
---
 arch/powerpc/boot/dts/mpc5200b.dtsi |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/boot/dts/mpc5200b.dtsi b/arch/powerpc/boot/dts/mpc5200b.dtsi
index bc27548..7ab286a 100644
--- a/arch/powerpc/boot/dts/mpc5200b.dtsi
+++ b/arch/powerpc/boot/dts/mpc5200b.dtsi
@@ -147,6 +147,8 @@
 		};
 
 		spi@f00 {
+			#address-cells = <1>;
+			#size-cells = <0>;
 			compatible = "fsl,mpc5200b-spi","fsl,mpc5200-spi";
 			reg = <0xf00 0x20>;
 			interrupts = <2 13 0 2 14 0>;
-- 
1.7.1

^ permalink raw reply related

* [PATCH 2/2] powerpc/5200: dts: digsy_mtc.dts: update to add can, pci, serial and spi
From: Anatolij Gustschin @ 2011-05-16 16:02 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: devicetree-discuss, Heiko Schocher
In-Reply-To: <1305561764-5942-1-git-send-email-agust@denx.de>

Add new nodes to describe more hardware the board is
equipped with:
 - two can nodes for SJA1000 on localbus
 - pci node to support Coral-PA graphics controller
 - serial node for SC28L92 DUART on localbus
 - spi node for MSP430 device

Also correct i2c eeprom node name.

Signed-off-by: Heiko Schocher <hs@denx.de>
Signed-off-by: Anatolij Gustschin <agust@denx.de>
---
 arch/powerpc/boot/dts/digsy_mtc.dts |   50 ++++++++++++++++++++++++++++++++--
 1 files changed, 47 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/boot/dts/digsy_mtc.dts b/arch/powerpc/boot/dts/digsy_mtc.dts
index 27bd267..e205d17 100644
--- a/arch/powerpc/boot/dts/digsy_mtc.dts
+++ b/arch/powerpc/boot/dts/digsy_mtc.dts
@@ -38,6 +38,14 @@
 			status = "disabled";
 		};
 
+		spi@f00 {
+			msp430@0 {
+				compatible = "spidev";
+				spi-max-frequency = <32000>;
+				reg = <0>;
+			};
+		};
+
 		psc@2000 {		// PSC1
 			status = "disabled";
 		};
@@ -73,11 +81,16 @@
 		};
 
 		i2c@3d00 {
-			rtc@50 {
+			eeprom@50 {
 				compatible = "at,24c08";
 				reg = <0x50>;
 			};
 
+			rtc@56 {
+				compatible = "mc,rv3029c2";
+				reg = <0x56>;
+			};
+
 			rtc@68 {
 				compatible = "dallas,ds1339";
 				reg = <0x68>;
@@ -90,11 +103,22 @@
 	};
 
 	pci@f0000d00 {
-		status = "disabled";
+		interrupt-map-mask = <0xf800 0 0 7>;
+		interrupt-map = <0xc000 0 0 1 &mpc5200_pic 0 0 3
+				 0xc000 0 0 2 &mpc5200_pic 0 0 3
+				 0xc000 0 0 3 &mpc5200_pic 0 0 3
+				 0xc000 0 0 4 &mpc5200_pic 0 0 3>;
+		clock-frequency = <0>; // From boot loader
+		interrupts = <2 8 0 2 9 0 2 10 0>;
+		bus-range = <0 0>;
+		ranges = <0x42000000 0 0x80000000 0x80000000 0 0x10000000
+			  0x02000000 0 0x90000000 0x90000000 0 0x10000000
+			  0x01000000 0 0x00000000 0xa0000000 0 0x01000000>;
 	};
 
 	localbus {
-		ranges = <0 0 0xff000000 0x1000000>;
+		ranges = <0 0 0xff000000 0x1000000
+			  4 0 0x60000000 0x0001000>;
 
 		// 16-bit flash device at LocalPlus Bus CS0
 		flash@0,0 {
@@ -122,5 +146,25 @@
 				reg = <0x00f00000 0x100000>;
 			};
 		};
+
+		can@4,0 {
+			compatible = "nxp,sja1000";
+			reg = <4 0x000 0x80>;
+			nxp,external-clock-frequency = <24000000>;
+			interrupts = <1 2 3>; // Level-low
+		};
+
+		can@4,100 {
+			compatible = "nxp,sja1000";
+			reg = <4 0x100 0x80>;
+			nxp,external-clock-frequency = <24000000>;
+			interrupts = <1 2 3>;  // Level-low
+		};
+
+		serial@4,200 {
+			compatible = "nxp,sc28l92";
+			reg = <4 0x200 0x10>;
+			interrupts = <1 3 3>;
+		};
 	};
 };
-- 
1.7.1

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox