* [PATCH] seqlock: don't smp_rmb in seqlock reader spin loop, [PATCH] seqlock: don't smp_rmb in seqlock reader spin loop
From: Milton Miller @ 2011-05-12 9:13 UTC (permalink / raw)
To: Andrew Morton, Nick Piggin, Benjamin Herrenschmidt,
Anton Blanchard, Thomas Gleixner, Eric Dumazet
Cc: Linus Torvalds, Ingo Molnar, Andi Kleen, linuxppc-dev,
linux-kernel
In-Reply-To: <alpine.LFD.2.02.1105091036000.2895@ionos>
Move the smp_rmb after cpu_relax loop in read_seqlock and add
ACCESS_ONCE to make sure the test and return are consistent.
A multi-threaded core in the lab didn't like the update
from 2.6.35 to 2.6.36, to the point it would hang during
boot when multiple threads were active. Bisection showed
af5ab277ded04bd9bc6b048c5a2f0e7d70ef0867 (clockevents:
Remove the per cpu tick skew) as the culprit and it is
supported with stack traces showing xtime_lock waits including
tick_do_update_jiffies64 and/or update_vsyscall.
Experimentation showed the combination of cpu_relax and smp_rmb
was significantly slowing the progress of other threads sharing
the core, and this patch is effective in avoiding the hang.
A theory is the rmb is affecting the whole core while the
cpu_relax is causing a resource rebalance flush, together they
cause an interfernce cadance that is unbroken when the seqlock
reader has interrupts disabled.
At first I was confused why the refactor in
3c22cd5709e8143444a6d08682a87f4c57902df3 (kernel: optimise
seqlock) didn't affect this patch application, but after some
study that affected seqcount not seqlock. The new seqcount was
not factored back into the seqlock. I defer that the future.
While the removal of the timer interrupt offset created
contention for the xtime lock while a cpu does the
additonal work to update the system clock, the seqlock
implementation with the tight rmb spin loop goes back much
further, and is just waiting for the right trigger.
Cc: <stable@vger.kernel.org>
Signed-off-by: Milton Miller <miltonm@bga.com>
---
To the readers of [RFC] time: xtime_lock is held too long:
I initially thought x86 would not see this because rmb would
be a nop, but upon closer inspection X86_PPRO_FENCE will add
a lfence for rmb.
milton
Index: common/include/linux/seqlock.h
===================================================================
--- common.orig/include/linux/seqlock.h 2011-04-06 03:27:02.000000000 -0500
+++ common/include/linux/seqlock.h 2011-04-06 03:35:02.000000000 -0500
@@ -88,12 +88,12 @@ static __always_inline unsigned read_seq
unsigned ret;
repeat:
- ret = sl->sequence;
- smp_rmb();
+ ret = ACCESS_ONCE(sl->sequence);
if (unlikely(ret & 1)) {
cpu_relax();
goto repeat;
}
+ smp_rmb();
return ret;
}
^ permalink raw reply
* Re: [PATCH 10/13] kvm/powerpc: Add support for Book3S processors in hypervisor mode
From: Avi Kivity @ 2011-05-12 9:07 UTC (permalink / raw)
To: Paul Mackerras; +Cc: linuxppc-dev, Alexander Graf, kvm
In-Reply-To: <20110511104456.GK2837@brick.ozlabs.ibm.com>
On 05/11/2011 01:44 PM, Paul Mackerras wrote:
> This adds support for KVM running on 64-bit Book 3S processors,
> specifically POWER7, in hypervisor mode. Using hypervisor mode means
> that the guest can use the processor's supervisor mode. That means
> that the guest can execute privileged instructions and access privileged
> registers itself without trapping to the host. This gives excellent
> performance, but does mean that KVM cannot emulate a processor
> architecture other than the one that the hardware implements.
>
> This code assumes that the guest is running paravirtualized using the
> PAPR (Power Architecture Platform Requirements) interface, which is the
> interface that IBM's PowerVM hypervisor uses. That means that existing
> Linux distributions that run on IBM pSeries machines will also run
> under KVM without modification. In order to communicate the PAPR
> hypercalls to qemu, this adds a new KVM_EXIT_PAPR_HCALL exit code
> to include/linux/kvm.h.
>
> Currently the choice between book3s_hv support and book3s_pr support
> (i.e. the existing code, which runs the guest in user mode) has to be
> made at kernel configuration time, so a given kernel binary can only
> do one or the other.
>
> This new book3s_hv code doesn't support MMIO emulation at present.
> Since we are running paravirtualized guests, this isn't a serious
> restriction.
>
> With the guest running in supervisor mode, most exceptions go straight
> to the guest. We will never get data or instruction storage or segment
> interrupts, alignment interrupts, decrementer interrupts, program
> interrupts, single-step interrupts, etc., coming to the hypervisor from
> the guest. Therefore this introduces a new KVMTEST_NONHV macro for the
> exception entry path so that we don't have to do the KVM test on entry
> to those exception handlers.
>
> We do however get hypervisor decrementer, hypervisor data storage,
> hypervisor instruction storage, and hypervisor emulation assist
> interrupts, so we have to handle those.
>
> In hypervisor mode, real-mode accesses can access all of RAM, not just
> a limited amount. Therefore we put all the guest state in the vcpu.arch
> and use the shadow_vcpu in the PACA only for temporary scratch space.
> We allocate the vcpu with kzalloc rather than vzalloc, and we don't use
> anything in the kvmppc_vcpu_book3s struct, so we don't allocate it.
>
> The POWER7 processor has a restriction that all threads in a core have
> to be in the same partition. MMU-on kernel code counts as a partition
> (partition 0), so we have to do a partition switch on every entry to and
> exit from the guest. At present we require the host and guest to run
> in single-thread mode because of this hardware restriction.
>
> This code allocates a hashed page table for the guest and initializes
> it with HPTEs for the guest's Virtual Real Memory Area (VRMA). We
> require that the guest memory is allocated using 16MB huge pages, in
> order to simplify the low-level memory management. This also means that
> we can get away without tracking paging activity in the host for now,
> since huge pages can't be paged or swapped.
>
> diff --git a/include/linux/kvm.h b/include/linux/kvm.h
> index ea2dc1a..a4447ce 100644
> --- a/include/linux/kvm.h
> +++ b/include/linux/kvm.h
> @@ -161,6 +161,7 @@ struct kvm_pit_config {
> #define KVM_EXIT_NMI 16
> #define KVM_EXIT_INTERNAL_ERROR 17
> #define KVM_EXIT_OSI 18
> +#define KVM_EXIT_PAPR_HCALL 19
>
> /* For KVM_EXIT_INTERNAL_ERROR */
> #define KVM_INTERNAL_ERROR_EMULATION 1
> @@ -264,6 +265,11 @@ struct kvm_run {
> struct {
> __u64 gprs[32];
> } osi;
> + struct {
> + __u64 nr;
> + __u64 ret;
> + __u64 args[9];
> + } papr_hcall;
> /* Fix the size of the union. */
> char padding[256];
> };
Please document this in Documentation/kvm/api.txt.
--
error compiling committee.c: too many arguments to function
^ permalink raw reply
* Re: fsl_udc_core: BUG: scheduling while atomic
From: Sergej.Stepanov @ 2011-05-12 8:37 UTC (permalink / raw)
To: mlcreech; +Cc: linuxppc-dev
In-Reply-To: <BANLkTikp5yVgrCe7tbc4cvG5wvy7XJp2fQ@mail.gmail.com>
SGkgTWF0dGhlZXcsCgpzdWNoIG9vcHMgeW91IGNhbiBnZXQgYWxzbyB3aXRoIHNwaS4KRm9yIHN1
Y2ggcHJvYmxlbSBoZWxwcyB0byBjb21waWxlIHlvdXIga2VybmVsIHdpdGggb3RoZXIgcHJlZW1w
dGlvbgptb2RlbDoKIC0gcHJlZW1wdAogLSBzdGFuZGFyZAogLSAhISEgYnV0IG5vdCB2b2x1bnRh
cnkgcHJlZW1wdGlvbiAhISEKVGhlIG90aGVyIHBvc3NpYmlsaXR5OiBjaGVjayB5b3VyIGJvYXJk
LCBtYXkgYmUgaXQgaGFzIHNvbWUgbWVtb3J5CnByb2JsZW1zLgoKUmVnYXJkcwpTZXJnZWouCgoK
QW0gTWl0dHdvY2gsIGRlbiAxMS4wNS4yMDExLCAxNzozNyAtMDQwMCBzY2hyaWViIE1hdHRoZXcg
TC4gQ3JlZWNoOgo+IEhpLAo+IAo+IE15IE1QQzgzMTMtYmFzZWQgYm9hcmQsIHJ1bm5pbmcgYSAy
LjYuMzcga2VybmVsLCBpcyBvY2Nhc2lvbmFsbHkKPiBoaXR0aW5nIHRoaXMgYnVnIHdoaWxlIGRv
aW5nIFJORElTLWJhc2VkIGNvbW11bmljYXRpb246Cj4gCj4gQlVHOiBzY2hlZHVsaW5nIHdoaWxl
IGF0b21pYzogbGlnaHR0cGQvMTE0NS8weDEwMDAwMjAwCj4gQ2FsbCBUcmFjZToKPiBbYzZhOGI5
MTBdIFtjMDAwODZjMF0gc2hvd19zdGFjaysweDdjLzB4MTk0ICh1bnJlbGlhYmxlKQo+IFtjNmE4
Yjk1MF0gW2MwMDE5ZTI4XSBfX3NjaGVkdWxlX2J1ZysweDU0LzB4NjgKPiBbYzZhOGI5NjBdIFtj
MDJiMDRlOF0gc2NoZWR1bGUrMHhhNC8weDQwOAo+IFtjNmE4YmE1MF0gW2MwMmIwOTg4XSBfY29u
ZF9yZXNjaGVkKzB4MzgvMHg2NAo+IFtjNmE4YmE2MF0gW2MwMDgwZThjXSBkbWFfcG9vbF9hbGxv
YysweDVjLzB4MmE0Cj4gW2M2YThiYWMwXSBbYzAxYzU3YjBdIGZzbF9yZXFfdG9fZHRkKzB4Njgv
MHgyNGMKPiBbYzZhOGJiMDBdIFtjMDFjNWI2OF0gZnNsX2VwX3F1ZXVlKzB4MWQ0LzB4MjY0Cj4g
W2M2YThiYjIwXSBbYzAxYzdlZWNdIGV0aF9zdGFydF94bWl0KzB4Mjc4LzB4MzQ0Cj4gW2M2YThi
YjUwXSBbYzAxZmRiYzhdIGRldl9oYXJkX3N0YXJ0X3htaXQrMHg1MjAvMHg2ODAKPiBbYzZhOGJi
YTBdIFtjMDIxMjJhNF0gc2NoX2RpcmVjdF94bWl0KzB4NjgvMHgxZTAKPiBbYzZhOGJiYzBdIFtj
MDFmZGYyMF0gZGV2X3F1ZXVlX3htaXQrMHgxZjgvMHgzYzQKPiBbYzZhOGJiZTBdIFtjMDIyZDY4
NF0gaXBfZmluaXNoX291dHB1dCsweDJkNC8weDMyOAo+IFtjNmE4YmMxMF0gW2MwMjJkYjA4XSBp
cF9sb2NhbF9vdXQrMHgzOC8weDRjCj4gW2M2YThiYzIwXSBbYzAyMmUzY2NdIGlwX3F1ZXVlX3ht
aXQrMHgyY2MvMHgzNjAKPiBbYzZhOGJjYTBdIFtjMDI0MTg0NF0gdGNwX3RyYW5zbWl0X3NrYisw
eDdjYy8weDgzOAo+IFtjNmE4YmQwMF0gW2MwMjQ0NDM0XSB0Y3Bfd3JpdGVfeG1pdCsweDhjNC8w
eGEzNAo+IFtjNmE4YmQ2MF0gW2MwMjM3NjE4XSB0Y3Bfc2VuZG1zZysweDkwMC8weGJkNAo+IFtj
NmE4YmRkMF0gW2MwMjU2MDg4XSBpbmV0X3NlbmRtc2crMHg3NC8weDhjCj4gW2M2YThiZGYwXSBb
YzAxZWE0OThdIHNvY2tfYWlvX3dyaXRlKzB4MTMwLzB4MTRjCj4gW2M2YThiZTUwXSBbYzAwODU1
ZmNdIGRvX3N5bmNfd3JpdGUrMHhiMC8weDExMAo+IFtjNmE4YmVmMF0gW2MwMDg2Mjk0XSB2ZnNf
d3JpdGUrMHhkYy8weDE3Ywo+IFtjNmE4YmYxMF0gW2MwMDg2NDJjXSBzeXNfd3JpdGUrMHg1NC8w
eDljCj4gW2M2YThiZjQwXSBbYzAwMGYyY2NdIHJldF9mcm9tX3N5c2NhbGwrMHgwLzB4MzgKPiAK
PiBUaGlzIHNlZW1zIHNpbWlsYXIgdG8gYSBidWcgZnJvbSAyMDEwOgo+IAo+IGh0dHA6Ly93d3cu
c3Bpbmljcy5uZXQvbGlzdHMvbGludXgtdXNiL21zZzMxMzU0Lmh0bWwKPiAKPiB3aGljaCBjb25j
bHVkZXMgdGhhdCB0aGUgZnNsX3VkY19jb3JlIGRyaXZlciBpcyB3cm9uZ2x5IHVzaW5nCj4gR0ZQ
X0tFUk5FTCBpbiBmc2xfYnVpbGRfZHRkKCkuICBIb3dldmVyIEknbSBub3Qgc3VyZSB3aGF0IGFu
Cj4gYXBwcm9wcmlhdGUgZml4IGlzLCBzaW5jZSBqdXN0IHJlcGxhY2luZyBpdCB3aXRoIEdGUF9B
VE9NSUMgY2F1c2VzCj4gYWxsb2NhdGlvbiBmYWlsdXJlcy4gIEFueSBoZWxwZnVsIHRpcHM/Cj4g
Cj4gVGhhbmtzCj4gCj4gLS0gCj4gTWF0dGhldyBMLiBDcmVlY2gKPiBfX19fX19fX19fX19fX19f
X19fX19fX19fX19fX19fX19fX19fX19fX19fX19fXwo+IExpbnV4cHBjLWRldiBtYWlsaW5nIGxp
c3QKPiBMaW51eHBwYy1kZXZAbGlzdHMub3psYWJzLm9yZwo+IGh0dHBzOi8vbGlzdHMub3psYWJz
Lm9yZy9saXN0aW5mby9saW51eHBwYy1kZXYK
^ permalink raw reply
* Re: [PATCH 37/37] powerpc: make IRQ_NOREQUEST last to clear, first to set
From: Milton Miller @ 2011-05-12 8:31 UTC (permalink / raw)
To: Grant Likely; +Cc: Thomas Gleixner, linuxppc-dev
In-Reply-To: <BANLkTi=03Te6kqs8G48FJr6zmQVaP3N2Vw@mail.gmail.com>
On Wed, 11 May 2011 about 21:18:11 +0200, Grant Likely wrote:
> On Wed, May 11, 2011 at 7:30 AM, Milton Miller <miltonm@bga.com> wrote:
> > When allocating irqs, wait to clear the IRQ_NOREQUEST flag until the
> > host map hook has been called.
> >
> > When freeing irqs, set the IRQ_NOREQUEST flag before calling the host
> > unmap hook.
>
> A description describing why this change is being made would be
> appreciated here.
>
> g.
You are right. #insert <late addition to series but made cut>
When creating an irq, don't allow a concurent driver request until
we have caled map, which will likley call set_chip_and_handler to
change the irq_chip and its operations.
Similarly, when tearing down an IRQ, make sure no new uses come
along while we change the irq back to the nop chip and then reset
the descriptor to freed status.
If this is acceptable I'll let Ben update the changelog unless
he asks me to resend.
milton
>
> >
> > Signed-off-by: Milton Miller <miltonm@bga.com>
> > ---
> > arch/powerpc/kernel/irq.c | 14 +++++++-------
> > 1 files changed, 7 insertions(+), 7 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
> > index 4368b5e..a24d37d 100644
> > --- a/arch/powerpc/kernel/irq.c
> > +++ b/arch/powerpc/kernel/irq.c
> > @@ -586,14 +586,14 @@ struct irq_host *irq_alloc_host(struct device_node *of_node,
> > irq_map[i].host = host;
> > smp_wmb();
> >
> > - /* Clear norequest flags */
> > - irq_clear_status_flags(i, IRQ_NOREQUEST);
> > -
> > /* Legacy flags are left to default at this point,
> > * one can then use irq_create_mapping() to
> > * explicitly change them
> > */
> > ops->map(host, i, i);
> > +
> > + /* Clear norequest flags */
> > + irq_clear_status_flags(i, IRQ_NOREQUEST);
> > }
> > break;
> > case IRQ_HOST_MAP_LINEAR:
> > @@ -664,8 +664,6 @@ static int irq_setup_virq(struct irq_host *host, unsigned int virq,
> > goto error;
> > }
> >
> > - irq_clear_status_flags(virq, IRQ_NOREQUEST);
> > -
> > /* map it */
> > smp_wmb();
> > irq_map[virq].hwirq = hwirq;
> > @@ -676,6 +674,8 @@ static int irq_setup_virq(struct irq_host *host, unsigned int virq,
> > goto errdesc;
> > }
> >
> > + irq_clear_status_flags(virq, IRQ_NOREQUEST);
> > +
> > return 0;
> >
> > errdesc:
> > @@ -819,6 +819,8 @@ void irq_dispose_mapping(unsigned int virq)
> > if (host->revmap_type == IRQ_HOST_MAP_LEGACY)
> > return;
> >
> > + irq_set_status_flags(virq, IRQ_NOREQUEST);
> > +
> > /* remove chip and handler */
> > irq_set_chip_and_handler(virq, NULL, NULL);
> >
> > @@ -848,8 +850,6 @@ void irq_dispose_mapping(unsigned int virq)
> > smp_mb();
> > irq_map[virq].hwirq = host->inval_irq;
> >
> > - irq_set_status_flags(virq, IRQ_NOREQUEST);
> > -
> > irq_free_descs(virq, 1);
> > /* Free it */
> > irq_free_virt(virq, 1);
> > --
> > 1.7.0.4
> >
> >
^ permalink raw reply
* Re: powerpc: Make early memory scan more resilient to out of order nodes
From: Milton Miller @ 2011-05-12 8:09 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: linuxppc-dev
In-Reply-To: <1305183498.29820.95.camel@pasglop>
On Wed, 11 May 2011 about 20:58:18 -0000, Benjamin Herrenschmidt wrote:
> We keep track of the size of the lowest block of memory and call
> setup_initial_memory_limit() only after we've parsed them all
>
Good, we lose our sensitivity to device node ordering.
> diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
> index 584b398..27475c6 100644
> --- a/arch/powerpc/kernel/prom.c
> +++ b/arch/powerpc/kernel/prom.c
> @@ -70,6 +70,7 @@ int __initdata iommu_force_on;
> unsigned long tce_alloc_start, tce_alloc_end;
> u64 ppc64_rma_size;
> #endif
> +static phys_addr_t first_memblock_size;
__initdata
(its only referenced by 2 __init functions)
Acked-by: Milton Miller <miltonm@bga.com>
> static int __init early_parse_mem(char *p)
> {
..
> @@ -507,11 +508,14 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 size)
..
> @@ -708,6 +712,7 @@ void __init early_init_devtree(void *params)
^ permalink raw reply
* Re: [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Ingo Molnar @ 2011-05-12 7:48 UTC (permalink / raw)
To: Will Drewry
Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
Heiko Carstens, Oleg Nesterov, David Howells, Paul Mackerras,
Eric Paris, H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390,
Russell King, x86, jmorris, Linus Torvalds, Ingo Molnar,
linux-arm-kernel, kees.cook, Serge E. Hallyn, Peter Zijlstra,
microblaze-uclinux, Steven Rostedt, Martin Schwidefsky,
Thomas Gleixner, Roland McGrath, Michal Marek, Michal Simek,
linuxppc-dev, linux-kernel, Ralf Baechle, Paul Mundt, Tejun Heo,
linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <1305169376-2363-1-git-send-email-wad@chromium.org>
Ok, i like the direction here, but i think the ABI should be done differently.
In this patch the ftrace event filter mechanism is used:
* Will Drewry <wad@chromium.org> wrote:
> +static struct seccomp_filter *alloc_seccomp_filter(int syscall_nr,
> + const char *filter_string)
> +{
> + int err = -ENOMEM;
> + struct seccomp_filter *filter = kzalloc(sizeof(struct seccomp_filter),
> + GFP_KERNEL);
> + if (!filter)
> + goto fail;
> +
> + INIT_HLIST_NODE(&filter->node);
> + filter->syscall_nr = syscall_nr;
> + filter->data = syscall_nr_to_meta(syscall_nr);
> +
> + /* Treat a filter of SECCOMP_WILDCARD_FILTER as a wildcard and skip
> + * using a predicate at all.
> + */
> + if (!strcmp(SECCOMP_WILDCARD_FILTER, filter_string))
> + goto out;
> +
> + /* Argument-based filtering only works on ftrace-hooked syscalls. */
> + if (!filter->data) {
> + err = -ENOSYS;
> + goto fail;
> + }
> +
> +#ifdef CONFIG_FTRACE_SYSCALLS
> + err = ftrace_parse_filter(&filter->event_filter,
> + filter->data->enter_event->event.type,
> + filter_string);
> + if (err)
> + goto fail;
> +#endif
> +
> +out:
> + return filter;
> +
> +fail:
> + kfree(filter);
> + return ERR_PTR(err);
> +}
Via a prctl() ABI:
> --- a/kernel/sys.c
> +++ b/kernel/sys.c
> @@ -1698,12 +1698,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
> case PR_SET_ENDIAN:
> error = SET_ENDIAN(me, arg2);
> break;
> -
> case PR_GET_SECCOMP:
> error = prctl_get_seccomp();
> break;
> case PR_SET_SECCOMP:
> - error = prctl_set_seccomp(arg2);
> + error = prctl_set_seccomp(arg2, arg3);
> + break;
> + case PR_SET_SECCOMP_FILTER:
> + error = prctl_set_seccomp_filter(arg2,
> + (char __user *) arg3);
> + break;
> + case PR_CLEAR_SECCOMP_FILTER:
> + error = prctl_clear_seccomp_filter(arg2);
> + break;
> + case PR_GET_SECCOMP_FILTER:
> + error = prctl_get_seccomp_filter(arg2,
> + (char __user *) arg3,
> + arg4);
To restrict execution to system calls.
Two observations:
1) We already have a specific ABI for this: you can set filters for events via
an event fd.
Why not extend that mechanism instead and improve *both* your sandboxing
bits and the events code? This new seccomp code has a lot more
to do with trace event filters than the minimal old seccomp code ...
kernel/trace/trace_event_filter.c is 2000 lines of tricky code that
interprets the ASCII filter expressions. kernel/seccomp.c is 86 lines of
mostly trivial code.
2) Why should this concept not be made available wider, to allow the
restriction of not just system calls but other security relevant components
of the kernel as well?
This too, if you approach the problem via the events code, will be a natural
end result, while if you approach it from the seccomp prctl angle it will be
a limited hack only.
Note, the end result will be the same - just using a different ABI.
So i really think the ABI itself should be closer related to the event code.
What this "seccomp" code does is that it uses specific syscall events to
restrict execution of certain event generating codepaths, such as system calls.
Thanks,
Ingo
^ permalink raw reply
* Re: [PATCH 7/8] powerpc: use the newly added get_required_mask dma_map_ops hook
From: Milton Miller @ 2011-05-12 7:32 UTC (permalink / raw)
To: Geert Uytterhoeven
Cc: cbe-oss-dev, FUJITA Tomonori, Greg Kroah-Hartman, Arnd Bergmann,
Geoff Levand, Sean MacLennan, linux-kernel, Paul Mackerras,
Will Schmidt, H. Peter Anvin, Nishanth Aravamudan, Andrew Morton,
linuxppc-dev, David S. Miller
In-Reply-To: <BANLkTi=eL9uwvNSdEnY4S=k1kjSgH4Q5xg@mail.gmail.com>
> On Thu, May 12, 2011 at 00:25, Nishanth Aravamudan <nacc@us.ibm.com> wrote:
> > diff --git a/arch/powerpc/platforms/ps3/system-bus.c b/arch/powerpc/platforms/ps3/system-bus.c
> > index 23083c3..688141c 100644
> > --- a/arch/powerpc/platforms/ps3/system-bus.c
> > +++ b/arch/powerpc/platforms/ps3/system-bus.c
> > @@ -695,12 +695,18 @@ static int ps3_dma_supported(struct device *_dev, u64 mask)
> > return mask >= DMA_BIT_MASK(32);
> > }
> >
> > +static u64 ps3_dma_get_required_mask(struct device *_dev)
> > +{
> > + return DMA_BIT_MASK(32);
>
> Why 32 and not 64?
I based it on the return of ps3_dma_supported, which you can see just
above says anything at or above a 32 bit mask is ok.
I don't really know the platform, but digging a bit deeper, it looks
like this goes to ps3_map_dma in ps3/mm.c. It looks like that translates
the virt to phys to lpar (similar to absolute in iseries), and the
maps it to a bus address by a linear mapping. But no where do I see
mention of a device dma mask (neither in mm.c or system-dev.c (except
for the ps3_dma_supported local), so I assume that 32 bits is sufficient
for any device. It appears to me the code establishs a 1:1 mapping
of all possible memory with no provision for allocating blocks or
checking that a bus address belongs to another memory segment.
Feel free to point out any errors in the above analysis, otherwise
I assume the required mask matches the dma_supported op.
Does the lv1 hypervisor offer more than 4G of memory to the lpar?
milton
^ permalink raw reply
* [PATCH] powerpc: Make early memory scan more resilient to out of order nodes
From: Benjamin Herrenschmidt @ 2011-05-12 6:58 UTC (permalink / raw)
To: linuxppc-dev
We keep track of the size of the lowest block of memory and call
setup_initial_memory_limit() only after we've parsed them all
Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org>
---
arch/powerpc/kernel/prom.c | 15 ++++++++++-----
1 files changed, 10 insertions(+), 5 deletions(-)
diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 584b398..27475c6 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -70,6 +70,7 @@ int __initdata iommu_force_on;
unsigned long tce_alloc_start, tce_alloc_end;
u64 ppc64_rma_size;
#endif
+static phys_addr_t first_memblock_size;
static int __init early_parse_mem(char *p)
{
@@ -507,11 +508,14 @@ void __init early_init_dt_add_memory_arch(u64 base, u64 size)
size = 0x80000000ul - base;
}
#endif
-
- /* First MEMBLOCK added, do some special initializations */
- if (memstart_addr == ~(phys_addr_t)0)
- setup_initial_memory_limit(base, size);
- memstart_addr = min((u64)memstart_addr, base);
+ /* Keep track of the beginning of memory -and- the size of
+ * the very first block in the device-tree as it represents
+ * the RMA on ppc64 server
+ */
+ if (base < memstart_addr) {
+ memstart_addr = base;
+ first_memblock_size = size;
+ }
/* Add the chunk to the MEMBLOCK list */
memblock_add(base, size);
@@ -708,6 +712,7 @@ void __init early_init_devtree(void *params)
of_scan_flat_dt(early_init_dt_scan_root, NULL);
of_scan_flat_dt(early_init_dt_scan_memory_ppc, NULL);
+ setup_initial_memory_limit(memstart_addr, first_memblock_size);
/* Save command line for /proc/cmdline and then parse parameters */
strlcpy(boot_command_line, cmd_line, COMMAND_LINE_SIZE);
^ permalink raw reply related
* RE: [linuxppc-release] [PATCH 1/2] powerpc, e5500: add networking to defconfig
From: Li Yang-R58472 @ 2011-05-12 6:11 UTC (permalink / raw)
To: Wood Scott-B07421, galak@kernel.crashing.org
Cc: linuxppc-dev@lists.ozlabs.org
In-Reply-To: <20110510180147.GA18443@schlenkerla.am.freescale.net>
>Subject: [linuxppc-release] [PATCH 1/2] powerpc, e5500: add networking to
>defconfig
>
>Even though support for the p5020's on-chip ethernet is not yet upstream,
>it is not appropriate to disable all networking support (including
>loopback, unix domain sockets, external ethernet devices, etc) in the
>defconfig. The networking settings are taken from mpc85xx_smp_defconfig,
>minus the drivers for ethernet devices not found on any current e5500 chip=
.
>
>The other changes are the result of running "make savedefconfig".
>
>Signed-off-by: Scott Wood <scottwood@freescale.com>
>---
> arch/powerpc/configs/e55xx_smp_defconfig | 38 ++++++++++++++++++++++---
>----
> 1 files changed, 29 insertions(+), 9 deletions(-)
>
>diff --git a/arch/powerpc/configs/e55xx_smp_defconfig
>b/arch/powerpc/configs/e55xx_smp_defconfig
>index 9fa1613..f4c5780 100644
>--- a/arch/powerpc/configs/e55xx_smp_defconfig
>+++ b/arch/powerpc/configs/e55xx_smp_defconfig
>@@ -6,10 +6,10 @@ CONFIG_NR_CPUS=3D2
> CONFIG_EXPERIMENTAL=3Dy
> CONFIG_SYSVIPC=3Dy
> CONFIG_BSD_PROCESS_ACCT=3Dy
>+CONFIG_SPARSE_IRQ=3Dy
Hi Scott,
I remember in previous testing that this option has a negative effect on pe=
rformance. Do we really need it to be enabled?
- Leo
^ permalink raw reply
* Re: [PATCH 7/8] powerpc: use the newly added get_required_mask dma_map_ops hook
From: Geert Uytterhoeven @ 2011-05-12 5:51 UTC (permalink / raw)
To: Nishanth Aravamudan, Milton Miller
Cc: cbe-oss-dev, FUJITA Tomonori, Greg Kroah-Hartman, Arnd Bergmann,
Geoff Levand, Sean MacLennan, linux-kernel, Paul Mackerras,
H. Peter Anvin, Will Schmidt, Andrew Morton, linuxppc-dev,
David S. Miller
In-Reply-To: <1305152704-4864-8-git-send-email-nacc@us.ibm.com>
On Thu, May 12, 2011 at 00:25, Nishanth Aravamudan <nacc@us.ibm.com> wrote:
> diff --git a/arch/powerpc/platforms/ps3/system-bus.c b/arch/powerpc/platf=
orms/ps3/system-bus.c
> index 23083c3..688141c 100644
> --- a/arch/powerpc/platforms/ps3/system-bus.c
> +++ b/arch/powerpc/platforms/ps3/system-bus.c
> @@ -695,12 +695,18 @@ static int ps3_dma_supported(struct device *_dev, u=
64 mask)
> =C2=A0 =C2=A0 =C2=A0 =C2=A0return mask >=3D DMA_BIT_MASK(32);
> =C2=A0}
>
> +static u64 ps3_dma_get_required_mask(struct device *_dev)
> +{
> + =C2=A0 =C2=A0 =C2=A0 return DMA_BIT_MASK(32);
Why 32 and not 64?
> +}
> +
> =C2=A0static struct dma_map_ops ps3_sb_dma_ops =3D {
> =C2=A0 =C2=A0 =C2=A0 =C2=A0.alloc_coherent =3D ps3_alloc_coherent,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0.free_coherent =3D ps3_free_coherent,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0.map_sg =3D ps3_sb_map_sg,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0.unmap_sg =3D ps3_sb_unmap_sg,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0.dma_supported =3D ps3_dma_supported,
> + =C2=A0 =C2=A0 =C2=A0 .get_required_mask =3D ps3_dma_get_required_mask,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0.map_page =3D ps3_sb_map_page,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0.unmap_page =3D ps3_unmap_page,
> =C2=A0};
> @@ -711,6 +717,7 @@ static struct dma_map_ops ps3_ioc0_dma_ops =3D {
> =C2=A0 =C2=A0 =C2=A0 =C2=A0.map_sg =3D ps3_ioc0_map_sg,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0.unmap_sg =3D ps3_ioc0_unmap_sg,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0.dma_supported =3D ps3_dma_supported,
> + =C2=A0 =C2=A0 =C2=A0 .get_required_mask =3D ps3_dma_get_required_mask,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0.map_page =3D ps3_ioc0_map_page,
> =C2=A0 =C2=A0 =C2=A0 =C2=A0.unmap_page =3D ps3_unmap_page,
> =C2=A0};
Gr{oetje,eeting}s,
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 Geert
--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k=
.org
In personal conversations with technical people, I call myself a hacker. Bu=
t
when I'm talking to journalists I just say "programmer" or something like t=
hat.
=C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=A0 =C2=
=A0 =C2=A0 =C2=A0 =C2=A0=C2=A0 =C2=A0=C2=A0 -- Linus Torvalds
^ permalink raw reply
* [PATCH 3/5] v2 seccomp_filters: Enable ftrace-based system call filtering
From: Will Drewry @ 2011-05-12 3:02 UTC (permalink / raw)
To: linux-kernel
Cc: linux-mips, linux-sh, Peter Zijlstra, Frederic Weisbecker,
Heiko Carstens, David Howells, Paul Mackerras, Eric Paris,
H. Peter Anvin, sparclinux, Jiri Slaby, linux-s390, Russell King,
x86, jmorris, Ingo Molnar, linux-arm-kernel, Ingo Molnar,
Serge E. Hallyn, Peter Zijlstra, microblaze-uclinux,
Steven Rostedt, Martin Schwidefsky, Thomas Gleixner, kees.cook,
Roland McGrath, Michal Marek, Michal Simek, Will Drewry,
linuxppc-dev, Oleg Nesterov, Ralf Baechle, Paul Mundt, Tejun Heo,
linux390, Andrew Morton, agl, David S. Miller
In-Reply-To: <1304017638.18763.205.camel@gandalf.stny.rr.com>
This change adds a new seccomp mode based on the work by
agl@chromium.org in [1]. This new mode, "filter mode", provides a hash
table of seccomp_filter objects. When in the new mode (2), all system
calls are checked against the filters - first by system call number,
then by a filter string. If an entry exists for a given system call and
all filter predicates evaluate to true, then the task may proceed.
Otherwise, the task is killed (as per seccomp_mode == 1).
Filter string parsing and evaluation is handled by the ftrace filter
engine. Related patches tweak to the perf filter trace and free allow
the call to be shared. Filters inherit their understanding of types and
arguments for each system call from the CONFIG_FTRACE_SYSCALLS subsystem
which already predefines this information in syscall_metadata associated
enter_event (and exit_event) structures. If CONFIG_FTRACE and
CONFIG_FTRACE_SYSCALLS are not compiled in, only "1" filter strings will
be allowed.
The net result is a process may have its system calls filtered using the
ftrace filter engine's inherent understanding of systems calls. A
logical ruleset for a process that only needs stdin/stdout may be:
sys_read: fd == 0
sys_write: fd == 1 || fd == 2
sys_exit: 1
The set of filters is specified through the PR_SET_SECCOMP path in prctl().
For example:
prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 0");
prctl(PR_SET_SECCOMP_FILTER, __NR_write, "fd == 1 || fd == 2");
prctl(PR_SET_SECCOMP_FILTER, __NR_exit, "1");
prctl(PR_SET_SECCOMP, 2, 0);
v2: - changed to use the existing syscall number ABI.
- prctl changes to minimize parsing in the kernel:
prctl(PR_SET_SECCOMP, {0 | 1 | 2 }, { 0 | ON_EXEC });
prctl(PR_SET_SECCOMP_FILTER, __NR_read, "fd == 5");
prctl(PR_CLEAR_SECCOMP_FILTER, __NR_read);
prctl(PR_GET_SECCOMP_FILTER, __NR_read, buf, bufsize);
- defined PR_SECCOMP_MODE_STRICT and ..._FILTER
- added flags
- provide a default fail syscall_nr_to_meta in ftrace
- provides fallback for unhooked system calls
- use -ENOSYS and ERR_PTR(-ENOSYS) for stubbed functionality
- added kernel/seccomp.h to share seccomp.c/seccomp_filter.c
- moved to a hlist and 4 bit hash of linked lists
- added support to operate without CONFIG_FTRACE_SYSCALLS
- moved Kconfig support next to SECCOMP
(should this be done in per-platform patches?)
- made Kconfig entries dependent on EXPERIMENTAL
- added macros to avoid ifdefs from kernel/fork.c
- added compat task/filter matching
- drop seccomp.h inclusion in sched.h and drop seccomp_t
- added Filtering to "show" output
- added on_exec state dup'ing when enabling after a fast-path accept.
Signed-off-by: Will Drewry <wad@chromium.org>
---
arch/arm/Kconfig | 10 +
arch/microblaze/Kconfig | 10 +
arch/mips/Kconfig | 10 +
arch/powerpc/Kconfig | 10 +
arch/s390/Kconfig | 10 +
arch/sh/Kconfig | 10 +
arch/sparc/Kconfig | 10 +
arch/x86/Kconfig | 10 +
include/linux/prctl.h | 9 +
include/linux/sched.h | 5 +-
include/linux/seccomp.h | 116 +++++++++-
include/trace/syscall.h | 7 +
kernel/Makefile | 3 +
kernel/fork.c | 3 +
kernel/seccomp.c | 228 +++++++++++++++++-
kernel/seccomp.h | 74 ++++++
kernel/seccomp_filter.c | 581 +++++++++++++++++++++++++++++++++++++++++++++++
kernel/sys.c | 15 +-
18 files changed, 1100 insertions(+), 21 deletions(-)
create mode 100644 kernel/seccomp.h
create mode 100644 kernel/seccomp_filter.c
diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 377a7a5..22e1668 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1664,6 +1664,16 @@ config SECCOMP
and the task is only allowed to execute a few safe syscalls
defined by each seccomp mode.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config CC_STACKPROTECTOR
bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
depends on EXPERIMENTAL
diff --git a/arch/microblaze/Kconfig b/arch/microblaze/Kconfig
index eccdefe..7641ee9 100644
--- a/arch/microblaze/Kconfig
+++ b/arch/microblaze/Kconfig
@@ -129,6 +129,16 @@ config SECCOMP
If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
endmenu
menu "Advanced setup"
diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 8e256cc..fe4cbda 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2245,6 +2245,16 @@ config SECCOMP
If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config USE_OF
bool "Flattened Device Tree support"
select OF
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 8f4d50b..83499e4 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -605,6 +605,16 @@ config SECCOMP
If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
endmenu
config ISA_DMA_API
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2508a6f..2777515 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -614,6 +614,16 @@ config SECCOMP
If unsure, say Y.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
endmenu
menu "Power Management"
diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 4b89da2..00c1521 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -676,6 +676,16 @@ config SECCOMP
If unsure, say N.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config SMP
bool "Symmetric multi-processing support"
depends on SYS_SUPPORTS_SMP
diff --git a/arch/sparc/Kconfig b/arch/sparc/Kconfig
index e560d10..5b42255 100644
--- a/arch/sparc/Kconfig
+++ b/arch/sparc/Kconfig
@@ -270,6 +270,16 @@ config SECCOMP
If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config HOTPLUG_CPU
bool "Support for hot-pluggable CPUs"
depends on SPARC64 && SMP
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index cc6c53a..d6d44d9 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1485,6 +1485,16 @@ config SECCOMP
If unsure, say Y. Only embedded should say N here.
+config SECCOMP_FILTER
+ bool "Enable seccomp-based system call filtering"
+ depends on SECCOMP && EXPERIMENTAL
+ help
+ Per-process, inherited system call filtering using shared code
+ across seccomp and ftrace_syscalls. If CONFIG_FTRACE_SYSCALLS
+ is not available, enhanced filters will not be available.
+
+ See Documentation/prctl/seccomp_filter.txt for more detail.
+
config CC_STACKPROTECTOR
bool "Enable -fstack-protector buffer overflow detection (EXPERIMENTAL)"
---help---
diff --git a/include/linux/prctl.h b/include/linux/prctl.h
index a3baeb2..379b391 100644
--- a/include/linux/prctl.h
+++ b/include/linux/prctl.h
@@ -63,6 +63,15 @@
/* Get/set process seccomp mode */
#define PR_GET_SECCOMP 21
#define PR_SET_SECCOMP 22
+# define PR_SECCOMP_MODE_NONE 0
+# define PR_SECCOMP_MODE_STRICT 1
+# define PR_SECCOMP_MODE_FILTER 2
+# define PR_SECCOMP_FLAG_FILTER_ON_EXEC (1 << 1)
+
+/* Get/set process seccomp filters */
+#define PR_GET_SECCOMP_FILTER 35
+#define PR_SET_SECCOMP_FILTER 36
+#define PR_CLEAR_SECCOMP_FILTER 37
/* Get/set the capability bounding set (as per security/commoncap.c) */
#define PR_CAPBSET_READ 23
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 18d63ce..27eacf9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -77,7 +77,6 @@ struct sched_param {
#include <linux/percpu.h>
#include <linux/topology.h>
#include <linux/proportions.h>
-#include <linux/seccomp.h>
#include <linux/rcupdate.h>
#include <linux/rculist.h>
#include <linux/rtmutex.h>
@@ -1190,6 +1189,8 @@ enum perf_event_task_context {
perf_nr_task_contexts,
};
+struct seccomp_state;
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1374,7 +1375,7 @@ struct task_struct {
uid_t loginuid;
unsigned int sessionid;
#endif
- seccomp_t seccomp;
+ struct seccomp_state *seccomp;
/* Thread group tracking */
u32 parent_exec_id;
diff --git a/include/linux/seccomp.h b/include/linux/seccomp.h
index 167c333..289c836 100644
--- a/include/linux/seccomp.h
+++ b/include/linux/seccomp.h
@@ -2,12 +2,34 @@
#define _LINUX_SECCOMP_H
+/* Forward declare for proc interface */
+struct seq_file;
+
#ifdef CONFIG_SECCOMP
+#include <linux/errno.h>
+#include <linux/list.h>
#include <linux/thread_info.h>
+#include <linux/types.h>
#include <asm/seccomp.h>
-typedef struct { int mode; } seccomp_t;
+/**
+ * struct seccomp_state - the state of a seccomp'ed process
+ *
+ * @mode:
+ * if this is 1, the process is under standard seccomp rules
+ * is 2, the process is only allowed to make system calls where
+ * associated filters evaluate successfully.
+ * @usage: number of references to the current instance.
+ * @flags: a bitmask of behavior altering flags.
+ * @filters: Hash table of filters if using CONFIG_SECCOMP_FILTER.
+ */
+struct seccomp_state {
+ uint16_t mode;
+ atomic_t usage;
+ long flags;
+ struct seccomp_filter_table *filters;
+};
extern void __secure_computing(int);
static inline void secure_computing(int this_syscall)
@@ -16,27 +38,113 @@ static inline void secure_computing(int this_syscall)
__secure_computing(this_syscall);
}
+extern struct seccomp_state *seccomp_state_new(void);
+extern struct seccomp_state *seccomp_state_dup(const struct seccomp_state *);
+extern struct seccomp_state *get_seccomp_state(struct seccomp_state *);
+extern void put_seccomp_state(struct seccomp_state *);
+
+extern long prctl_set_seccomp(unsigned long, unsigned long);
extern long prctl_get_seccomp(void);
-extern long prctl_set_seccomp(unsigned long);
+
+extern long prctl_set_seccomp_filter(unsigned long, char __user *);
+extern long prctl_get_seccomp_filter(unsigned long, char __user *,
+ unsigned long);
+extern long prctl_clear_seccomp_filter(unsigned long);
+
+#define inherit_tsk_seccomp_state(_child, _orig) \
+ _child->seccomp = get_seccomp_state(_orig->seccomp);
+#define put_tsk_seccomp_state(_tsk) put_seccomp_state(_tsk->seccomp)
#else /* CONFIG_SECCOMP */
#include <linux/errno.h>
-typedef struct { } seccomp_t;
+struct seccomp_state { };
#define secure_computing(x) do { } while (0)
+#define inherit_tsk_seccomp_state(_child, _orig) do { } while (0)
+#define put_tsk_seccomp_state(_tsk) do { } while (0)
static inline long prctl_get_seccomp(void)
{
return -EINVAL;
}
-static inline long prctl_set_seccomp(unsigned long arg2)
+static inline long prctl_set_seccomp(unsigned long a2, unsigned long a3)
{
return -EINVAL;
}
+static inline long prctl_set_seccomp_filter(unsigned long a2, char __user *a3)
+{
+ return -ENOSYS;
+}
+
+static inline long prctl_clear_seccomp_filter(unsigned long a2)
+{
+ return -ENOSYS;
+}
+
+static inline long prctl_get_seccomp_filter(unsigned long a2, char __user *a3,
+ unsigned long a4)
+{
+ return -ENOSYS;
+}
+
+static inline struct seccomp_state *seccomp_state_new(void)
+{
+ return NULL;
+}
+
+static inline struct seccomp_state *seccomp_state_dup(
+ const struct seccomp_state *state)
+{
+ return NULL;
+}
+
+static inline struct seccomp_state *get_seccomp_state(
+ struct seccomp_state *state)
+{
+ return NULL;
+}
+
+static inline void put_seccomp_state(struct seccomp_state *state)
+{
+}
+
#endif /* CONFIG_SECCOMP */
+#ifdef CONFIG_SECCOMP_FILTER
+
+extern int seccomp_show_filters(struct seccomp_state *, struct seq_file *);
+extern long seccomp_set_filter(int, char *);
+extern long seccomp_clear_filter(int);
+extern long seccomp_get_filter(int, char *, unsigned long);
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+static inline int seccomp_show_filters(struct seccomp_state *state,
+ struct seq_file *m)
+{
+ return -ENOSYS;
+}
+
+static inline long seccomp_set_filter(int syscall_nr, char *filter)
+{
+ return -ENOSYS;
+}
+
+static inline long seccomp_clear_filter(int syscall_nr)
+{
+ return -ENOSYS;
+}
+
+static inline long seccomp_get_filter(int syscall_nr,
+ char *buf, unsigned long available)
+{
+ return -ENOSYS;
+}
+
+#endif /* CONFIG_SECCOMP_FILTER */
+
#endif /* _LINUX_SECCOMP_H */
diff --git a/include/trace/syscall.h b/include/trace/syscall.h
index 242ae04..e061ad0 100644
--- a/include/trace/syscall.h
+++ b/include/trace/syscall.h
@@ -35,6 +35,8 @@ struct syscall_metadata {
extern unsigned long arch_syscall_addr(int nr);
extern int init_syscall_trace(struct ftrace_event_call *call);
+extern struct syscall_metadata *syscall_nr_to_meta(int);
+
extern int reg_event_syscall_enter(struct ftrace_event_call *call);
extern void unreg_event_syscall_enter(struct ftrace_event_call *call);
extern int reg_event_syscall_exit(struct ftrace_event_call *call);
@@ -49,6 +51,11 @@ enum print_line_t print_syscall_enter(struct trace_iterator *iter, int flags,
struct trace_event *event);
enum print_line_t print_syscall_exit(struct trace_iterator *iter, int flags,
struct trace_event *event);
+#else
+static inline struct syscall_metadata *syscall_nr_to_meta(int nr)
+{
+ return NULL;
+}
#endif
#ifdef CONFIG_PERF_EVENTS
diff --git a/kernel/Makefile b/kernel/Makefile
index 85cbfb3..84e7dfb 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -81,6 +81,9 @@ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o
obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o
obj-$(CONFIG_GENERIC_HARDIRQS) += irq/
obj-$(CONFIG_SECCOMP) += seccomp.o
+ifeq ($(CONFIG_SECCOMP_FILTER),y)
+obj-$(CONFIG_SECCOMP) += seccomp_filter.o
+endif
obj-$(CONFIG_RCU_TORTURE_TEST) += rcutorture.o
obj-$(CONFIG_TREE_RCU) += rcutree.o
obj-$(CONFIG_TREE_PREEMPT_RCU) += rcutree.o
diff --git a/kernel/fork.c b/kernel/fork.c
index e7548de..46987d4 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -34,6 +34,7 @@
#include <linux/cgroup.h>
#include <linux/security.h>
#include <linux/hugetlb.h>
+#include <linux/seccomp.h>
#include <linux/swap.h>
#include <linux/syscalls.h>
#include <linux/jiffies.h>
@@ -169,6 +170,7 @@ void free_task(struct task_struct *tsk)
free_thread_info(tsk->stack);
rt_mutex_debug_task_free(tsk);
ftrace_graph_exit_task(tsk);
+ put_tsk_seccomp_state(tsk);
free_task_struct(tsk);
}
EXPORT_SYMBOL(free_task);
@@ -280,6 +282,7 @@ static struct task_struct *dup_task_struct(struct task_struct *orig)
if (err)
goto out;
+ inherit_tsk_seccomp_state(tsk, orig);
setup_thread_stack(tsk, orig);
clear_user_return_notifier(tsk);
clear_tsk_need_resched(tsk);
diff --git a/kernel/seccomp.c b/kernel/seccomp.c
index 57d4b13..502ba04 100644
--- a/kernel/seccomp.c
+++ b/kernel/seccomp.c
@@ -6,12 +6,15 @@
* This defines a simple but solid secure-computing mode.
*/
+#include <linux/err.h>
+#include <linux/prctl.h>
#include <linux/seccomp.h>
#include <linux/sched.h>
+#include <linux/slab.h>
#include <linux/compat.h>
+#include <linux/unistd.h>
-/* #define SECCOMP_DEBUG 1 */
-#define NR_SECCOMP_MODES 1
+#include "seccomp.h"
/*
* Secure computing mode 1 allows only read/write/exit/sigreturn.
@@ -32,11 +35,13 @@ static int mode1_syscalls_32[] = {
void __secure_computing(int this_syscall)
{
- int mode = current->seccomp.mode;
+ int mode = -1;
+ long ret = 0;
int * syscall;
-
+ if (current->seccomp)
+ mode = current->seccomp->mode;
switch (mode) {
- case 1:
+ case PR_SECCOMP_MODE_STRICT:
syscall = mode1_syscalls;
#ifdef CONFIG_COMPAT
if (is_compat_task())
@@ -47,6 +52,20 @@ void __secure_computing(int this_syscall)
return;
} while (*++syscall);
break;
+ case PR_SECCOMP_MODE_FILTER:
+ if (this_syscall >= NR_syscalls || this_syscall < 0)
+ break;
+ ret = seccomp_test_filters(current->seccomp, this_syscall);
+ if (!ret)
+ return;
+ /* Only check for an override if an access failure occurred. */
+ if (ret != -EACCES)
+ break;
+ ret = seccomp_maybe_apply_filters(current, this_syscall);
+ if (!ret)
+ return;
+ seccomp_filter_log_failure(this_syscall);
+ break;
default:
BUG();
}
@@ -57,30 +76,213 @@ void __secure_computing(int this_syscall)
do_exit(SIGKILL);
}
+/* seccomp_state_new - allocate a new state object. */
+struct seccomp_state *seccomp_state_new()
+{
+ struct seccomp_state *new = kzalloc(sizeof(struct seccomp_state),
+ GFP_KERNEL);
+ if (!new)
+ return NULL;
+
+ new->flags = 0;
+#ifdef CONFIG_COMPAT
+ /* Annotate if this filterset is being created by a compat task. */
+ if (is_compat_task())
+ new->flags |= SECCOMP_FLAG_COMPAT;
+#endif
+
+ atomic_set(&new->usage, 1);
+ new->filters = seccomp_filter_table_new();
+ /* Not supported errors are fine, others are a problem. */
+ if (IS_ERR(new->filters) && PTR_ERR(new->filters) != -ENOSYS) {
+ kfree(new);
+ new = NULL;
+ }
+ return new;
+}
+
+/* seccomp_state_dup - copies an existing state object. */
+struct seccomp_state *seccomp_state_dup(const struct seccomp_state *orig)
+{
+ int err;
+ struct seccomp_state *new_state = seccomp_state_new();
+
+ err = -ENOMEM;
+ if (!new_state)
+ goto fail;
+ new_state->mode = orig->mode;
+ /* Flag copying will hide if the new process is a compat task. However,
+ * if the rule was compat/non-compat and the process is the opposite,
+ * enforcement will terminate it.
+ */
+ new_state->flags = orig->flags;
+ err = seccomp_copy_all_filters(new_state->filters,
+ orig->filters);
+ if (err)
+ goto fail;
+
+ return new_state;
+fail:
+ put_seccomp_state(new_state);
+ return NULL;
+}
+
+/* get_seccomp_state - increments the reference count of @orig */
+struct seccomp_state *get_seccomp_state(struct seccomp_state *orig)
+{
+ if (!orig)
+ return NULL;
+ atomic_inc(&orig->usage);
+ return orig;
+}
+
+static void __put_seccomp_state(struct seccomp_state *orig)
+{
+ WARN_ON(atomic_read(&orig->usage));
+ seccomp_drop_all_filters(orig);
+ kfree(orig);
+}
+
+/* put_seccomp_state - decrements the reference count of @orig and may free. */
+void put_seccomp_state(struct seccomp_state *orig)
+{
+ if (!orig)
+ return;
+
+ if (atomic_dec_and_test(&orig->usage))
+ __put_seccomp_state(orig);
+}
+
long prctl_get_seccomp(void)
{
- return current->seccomp.mode;
+ if (!current->seccomp)
+ return 0;
+ return current->seccomp->mode;
}
-long prctl_set_seccomp(unsigned long seccomp_mode)
+long prctl_set_seccomp(unsigned long seccomp_mode, unsigned long flags)
{
long ret;
+ struct seccomp_state *state, *cur_state;
+ cur_state = get_seccomp_state(current->seccomp);
/* can set it only once to be even more secure */
ret = -EPERM;
- if (unlikely(current->seccomp.mode))
+ if (cur_state && unlikely(cur_state->mode))
goto out;
ret = -EINVAL;
- if (seccomp_mode && seccomp_mode <= NR_SECCOMP_MODES) {
- current->seccomp.mode = seccomp_mode;
- set_thread_flag(TIF_SECCOMP);
+ if (seccomp_mode <= 0 || seccomp_mode > NR_SECCOMP_MODES)
+ goto out;
+
+ ret = -ENOMEM;
+ state = (cur_state ? seccomp_state_dup(cur_state) :
+ seccomp_state_new());
+ if (!state)
+ goto out;
+
+ if (seccomp_mode == PR_SECCOMP_MODE_STRICT) {
#ifdef TIF_NOTSC
disable_TSC();
#endif
- ret = 0;
}
- out:
+ rcu_assign_pointer(current->seccomp, state);
+ synchronize_rcu();
+ put_seccomp_state(cur_state); /* For the task */
+
+ /* Convert the ABI flag to the internal flag value. */
+ if (seccomp_mode == PR_SECCOMP_MODE_FILTER &&
+ (flags & PR_SECCOMP_FLAG_FILTER_ON_EXEC))
+ state->flags |= SECCOMP_FLAG_ON_EXEC;
+ /* Encourage flag values to stay synchronized explicitly. */
+ BUILD_BUG_ON(PR_SECCOMP_FLAG_FILTER_ON_EXEC != SECCOMP_FLAG_ON_EXEC);
+
+ /* Only set the thread flag once after the new state is in place. */
+ state->mode = seccomp_mode;
+ set_thread_flag(TIF_SECCOMP);
+ ret = 0;
+
+out:
+ put_seccomp_state(cur_state); /* for the get */
+ return ret;
+}
+
+long prctl_set_seccomp_filter(unsigned long syscall_nr,
+ char __user *user_filter)
+{
+ int nr;
+ long ret;
+ char filter[SECCOMP_MAX_FILTER_LENGTH];
+
+ ret = -EINVAL;
+ if (syscall_nr >= NR_syscalls || syscall_nr < 0)
+ goto out;
+
+ ret = -EFAULT;
+ if (!user_filter ||
+ strncpy_from_user(filter, user_filter,
+ sizeof(filter) - 1) < 0)
+ goto out;
+
+ nr = (int) syscall_nr;
+ ret = seccomp_set_filter(nr, filter);
+
+out:
+ return ret;
+}
+
+long prctl_clear_seccomp_filter(unsigned long syscall_nr)
+{
+ int nr = -1;
+ long ret;
+
+ ret = -EINVAL;
+ if (syscall_nr >= NR_syscalls || syscall_nr < 0)
+ goto out;
+
+ nr = (int) syscall_nr;
+ ret = seccomp_clear_filter(nr);
+
+out:
+ return ret;
+}
+
+long prctl_get_seccomp_filter(unsigned long syscall_nr, char __user *dst,
+ unsigned long available)
+{
+ int ret, nr;
+ unsigned long copied;
+ char *buf = NULL;
+ ret = -EINVAL;
+ if (!available)
+ goto out;
+ /* Ignore extra buffer space. */
+ if (available > SECCOMP_MAX_FILTER_LENGTH)
+ available = SECCOMP_MAX_FILTER_LENGTH;
+
+ ret = -EINVAL;
+ if (syscall_nr >= NR_syscalls || syscall_nr < 0)
+ goto out;
+ nr = (int) syscall_nr;
+
+ ret = -ENOMEM;
+ buf = kmalloc(available, GFP_KERNEL);
+ if (!buf)
+ goto out;
+
+ ret = seccomp_get_filter(nr, buf, available);
+ if (ret < 0)
+ goto out;
+
+ /* Include the NUL byte in the copy. */
+ copied = copy_to_user(dst, buf, ret + 1);
+ ret = -ENOSPC;
+ if (copied)
+ goto out;
+
+ ret = 0;
+out:
+ kfree(buf);
return ret;
}
diff --git a/kernel/seccomp.h b/kernel/seccomp.h
new file mode 100644
index 0000000..5abd219
--- /dev/null
+++ b/kernel/seccomp.h
@@ -0,0 +1,74 @@
+/*
+ * seccomp/seccomp_filter shared internal prototypes and state.
+ *
+ * Copyright (C) 2011 Chromium OS Authors.
+ */
+
+#ifndef __KERNEL_SECCOMP_H
+#define __KERNEL_SECCOMP_H
+
+#include <linux/ftrace_event.h>
+#include <linux/seccomp.h>
+
+/* #define SECCOMP_DEBUG 1 */
+#define NR_SECCOMP_MODES 2
+
+/* Inherit the max filter length from the filtering engine. */
+#define SECCOMP_MAX_FILTER_LENGTH MAX_FILTER_STR_VAL
+
+/* Presently, flags only affect SECCOMP_FILTER. */
+#define _SECCOMP_FLAG_COMPAT 0
+#define _SECCOMP_FLAG_ON_EXEC 1
+
+#define SECCOMP_FLAG_COMPAT (1 << (_SECCOMP_FLAG_COMPAT))
+#define SECCOMP_FLAG_ON_EXEC (1 << (_SECCOMP_FLAG_ON_EXEC))
+
+
+#ifdef CONFIG_SECCOMP_FILTER
+
+#define SECCOMP_FILTER_HASH_BITS 4
+#define SECCOMP_FILTER_HASH_SIZE (1 << SECCOMP_FILTER_HASH_BITS)
+
+struct seccomp_filter_table;
+extern struct seccomp_filter_table *seccomp_filter_table_new(void);
+extern int seccomp_copy_all_filters(struct seccomp_filter_table *,
+ const struct seccomp_filter_table *);
+extern void seccomp_drop_all_filters(struct seccomp_state *);
+
+extern int seccomp_test_filters(struct seccomp_state *, int);
+extern int seccomp_maybe_apply_filters(struct task_struct *, int);
+extern void seccomp_filter_log_failure(int);
+
+#else /* CONFIG_SECCOMP_FILTER */
+
+static inline void seccomp_filter_log_failure(int syscall)
+{
+}
+
+static inline int seccomp_maybe_apply_filters(struct task_struct *tsk,
+ int syscall_nr)
+{
+ return -ENOSYS;
+}
+
+static inline struct seccomp_filter_table *seccomp_filter_table_new(void)
+{
+ return ERR_PTR(-ENOSYS);
+}
+
+static inline int seccomp_test_filters(struct seccomp_state *state, int nr)
+{
+ return -ENOSYS;
+}
+
+extern inline int seccomp_copy_all_filters(struct seccomp_filter_table *dst,
+ const struct seccomp_filter_table *src)
+{
+ return 0;
+}
+
+static inline void seccomp_drop_all_filters(struct seccomp_state *state) { }
+
+#endif /* CONFIG_SECCOMP_FILTER */
+
+#endif /* __KERNEL_SECCOMP_H */
diff --git a/kernel/seccomp_filter.c b/kernel/seccomp_filter.c
new file mode 100644
index 0000000..ff4e055
--- /dev/null
+++ b/kernel/seccomp_filter.c
@@ -0,0 +1,581 @@
+/* filter engine-based seccomp system call filtering.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) 2011 The Chromium OS Authors <chromium-os-dev@chromium.org>
+ */
+
+#include <linux/compat.h>
+#include <linux/errno.h>
+#include <linux/hash.h>
+#include <linux/prctl.h>
+#include <linux/seccomp.h>
+#include <linux/seq_file.h>
+#include <linux/sched.h>
+#include <linux/slab.h>
+#include <linux/uaccess.h>
+
+#include <asm/syscall.h>
+#include <trace/syscall.h>
+
+#include "seccomp.h"
+
+#define SECCOMP_MAX_FILTER_COUNT 512
+#define SECCOMP_WILDCARD_FILTER "1"
+
+struct seccomp_filter {
+ struct hlist_node node;
+ int syscall_nr;
+ struct syscall_metadata *data;
+ struct event_filter *event_filter;
+};
+
+struct seccomp_filter_table {
+ struct hlist_head heads[SECCOMP_FILTER_HASH_SIZE];
+ int count;
+};
+
+struct seccomp_filter_table *seccomp_filter_table_new(void)
+{
+ struct seccomp_filter_table *t =
+ kzalloc(sizeof(struct seccomp_filter_table), GFP_KERNEL);
+ if (!t)
+ return ERR_PTR(-ENOMEM);
+ return t;
+}
+
+static inline u32 syscall_hash(int syscall_nr)
+{
+ return hash_32(syscall_nr, SECCOMP_FILTER_HASH_BITS);
+}
+
+static const char *get_filter_string(struct seccomp_filter *f)
+{
+ const char *str = SECCOMP_WILDCARD_FILTER;
+ if (!f)
+ return NULL;
+
+ /* Missing event filters qualify as wildcard matches. */
+ if (!f->event_filter)
+ return str;
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+ str = ftrace_get_filter_string(f->event_filter);
+#endif
+ return str;
+}
+
+static struct seccomp_filter *alloc_seccomp_filter(int syscall_nr,
+ const char *filter_string)
+{
+ int err = -ENOMEM;
+ struct seccomp_filter *filter = kzalloc(sizeof(struct seccomp_filter),
+ GFP_KERNEL);
+ if (!filter)
+ goto fail;
+
+ INIT_HLIST_NODE(&filter->node);
+ filter->syscall_nr = syscall_nr;
+ filter->data = syscall_nr_to_meta(syscall_nr);
+
+ /* Treat a filter of SECCOMP_WILDCARD_FILTER as a wildcard and skip
+ * using a predicate at all.
+ */
+ if (!strcmp(SECCOMP_WILDCARD_FILTER, filter_string))
+ goto out;
+
+ /* Argument-based filtering only works on ftrace-hooked syscalls. */
+ if (!filter->data) {
+ err = -ENOSYS;
+ goto fail;
+ }
+
+#ifdef CONFIG_FTRACE_SYSCALLS
+ err = ftrace_parse_filter(&filter->event_filter,
+ filter->data->enter_event->event.type,
+ filter_string);
+ if (err)
+ goto fail;
+#endif
+
+out:
+ return filter;
+
+fail:
+ kfree(filter);
+ return ERR_PTR(err);
+}
+
+static void free_seccomp_filter(struct seccomp_filter *filter)
+{
+#ifdef CONFIG_FTRACE_SYSCALLS
+ ftrace_free_filter(filter->event_filter);
+#endif
+ kfree(filter);
+}
+
+static struct seccomp_filter *copy_seccomp_filter(struct seccomp_filter *orig)
+{
+ return alloc_seccomp_filter(orig->syscall_nr, get_filter_string(orig));
+}
+
+/* Returns the matching filter or NULL */
+static struct seccomp_filter *find_filter(struct seccomp_state *state,
+ int syscall)
+{
+ struct hlist_node *this, *pos;
+ struct seccomp_filter *filter = NULL;
+
+ u32 head = syscall_hash(syscall);
+ if (head >= SECCOMP_FILTER_HASH_SIZE)
+ goto out;
+
+ hlist_for_each_safe(this, pos, &state->filters->heads[head]) {
+ filter = hlist_entry(this, struct seccomp_filter, node);
+ if (filter->syscall_nr == syscall)
+ goto out;
+ }
+
+ filter = NULL;
+
+out:
+ return filter;
+}
+
+/* Safely drops all filters for a given syscall. This should only be called
+ * on unattached seccomp_state objects.
+ */
+static void drop_filter(struct seccomp_state *state, int syscall_nr)
+{
+ struct seccomp_filter *filter = find_filter(state, syscall_nr);
+ if (!filter)
+ return;
+
+ WARN_ON(state->filters->count == 0);
+ state->filters->count--;
+ hlist_del(&filter->node);
+ free_seccomp_filter(filter);
+}
+
+/* This should only be called on unattached seccomp_state objects. */
+static int add_filter(struct seccomp_state *state, int syscall_nr,
+ char *filter_string)
+{
+ struct seccomp_filter *filter;
+ struct hlist_head *head;
+ char merged[SECCOMP_MAX_FILTER_LENGTH];
+ int ret;
+ u32 hash = syscall_hash(syscall_nr);
+
+ ret = -EINVAL;
+ if (state->filters->count == SECCOMP_MAX_FILTER_COUNT - 1)
+ goto out;
+
+ filter_string = strstrip(filter_string);
+
+ /* Disallow empty strings. */
+ if (filter_string[0] == 0)
+ goto out;
+
+ /* Get the right list head. */
+ head = &state->filters->heads[hash];
+
+ /* Find out if there is an existing entry to append to and
+ * build the resultant filter string. The original filter can be
+ * destroyed here since the caller should be operating on a copy.
+ */
+ filter = find_filter(state, syscall_nr);
+ if (filter) {
+ int expected = snprintf(merged, sizeof(merged), "(%s) && %s",
+ get_filter_string(filter),
+ filter_string);
+ ret = -E2BIG;
+ if (expected >= sizeof(merged) || expected < 0)
+ goto out;
+ filter_string = merged;
+ hlist_del(&filter->node);
+ free_seccomp_filter(filter);
+ }
+
+ /* When in seccomp filtering mode, only allow additions. */
+ ret = -EACCES;
+ if (filter == NULL && state->mode == PR_SECCOMP_MODE_FILTER)
+ goto out;
+
+ ret = 0;
+ filter = alloc_seccomp_filter(syscall_nr, filter_string);
+ if (IS_ERR(filter)) {
+ ret = PTR_ERR(filter);
+ goto out;
+ }
+
+ state->filters->count++;
+ hlist_add_head(&filter->node, head);
+out:
+ return ret;
+}
+
+/* Wrap optional ftrace syscall support. Returns 1 on match or if ftrace is not
+ * supported.
+ */
+static int do_ftrace_syscall_match(struct event_filter *event_filter)
+{
+ int err = 1;
+#ifdef CONFIG_FTRACE_SYSCALLS
+ uint8_t syscall_state[64];
+
+ memset(syscall_state, 0, sizeof(syscall_state));
+
+ /* The generic tracing entry can remain zeroed. */
+ err = ftrace_syscall_enter_state(syscall_state, sizeof(syscall_state),
+ NULL);
+ if (err)
+ return 0;
+
+ err = filter_match_preds(event_filter, syscall_state);
+#endif
+ return err;
+}
+
+/* 1 on match, 0 otherwise. */
+static int filter_match_current(struct seccomp_filter *filter)
+{
+ /* If no event filter exists, we assume a wildcard match. */
+ if (!filter->event_filter)
+ return 1;
+
+ return do_ftrace_syscall_match(filter->event_filter);
+}
+
+#ifndef KSTK_EIP
+#define KSTK_EIP(x) 0L
+#endif
+
+static const char *syscall_nr_to_name(int syscall)
+{
+ const char *syscall_name = "unknown";
+ struct syscall_metadata *data = syscall_nr_to_meta(syscall);
+ if (data)
+ syscall_name = data->name;
+ return syscall_name;
+}
+
+void seccomp_filter_log_failure(int syscall)
+{
+ printk(KERN_INFO
+ "%s[%d]: system call %d (%s) blocked at ip:%lx\n",
+ current->comm, task_pid_nr(current), syscall,
+ syscall_nr_to_name(syscall), KSTK_EIP(current));
+}
+
+/**
+ * seccomp_drop_all_filters - cleans up the filter list and frees the table
+ * @state: the seccomp_state to destroy the filters in.
+ */
+void seccomp_drop_all_filters(struct seccomp_state *state)
+{
+ struct hlist_node *this, *pos;
+ int head;
+ if (!state->filters)
+ return;
+ for (head = 0; head < SECCOMP_FILTER_HASH_SIZE; ++head) {
+ hlist_for_each_safe(this, pos, &state->filters->heads[head]) {
+ struct seccomp_filter *f = hlist_entry(this,
+ struct seccomp_filter, node);
+ WARN_ON(state->filters->count == 0);
+ hlist_del(this);
+ free_seccomp_filter(f);
+ state->filters->count--;
+ }
+ }
+ kfree(state->filters);
+}
+
+/**
+ * seccomp_copy_all_filters - copies all filters from src to dst.
+ *
+ * @dst: seccomp_filter_table to populate.
+ * @src: table to read from.
+ * Returns non-zero on failure.
+ * Both the source and the destination should have no simultaneous
+ * writers, and dst should be exclusive to the caller.
+ */
+int seccomp_copy_all_filters(struct seccomp_filter_table *dst,
+ const struct seccomp_filter_table *src)
+{
+ struct seccomp_filter *filter;
+ int head, ret = 0;
+ BUG_ON(!dst || !src);
+ for (head = 0; head < SECCOMP_FILTER_HASH_SIZE; ++head) {
+ struct hlist_node *pos;
+ hlist_for_each_entry(filter, pos, &src->heads[head], node) {
+ struct seccomp_filter *new_filter =
+ copy_seccomp_filter(filter);
+ if (IS_ERR(new_filter)) {
+ ret = PTR_ERR(new_filter);
+ goto done;
+ }
+ hlist_add_head(&new_filter->node,
+ &dst->heads[head]);
+ dst->count++;
+ }
+ }
+
+done:
+ return ret;
+}
+
+/**
+ * seccomp_show_filters - prints the filter state to a seq_file
+ * @state: the seccomp_state to enumerate the filter and bitmask of
+ * @m: the prepared seq_file to receive the data
+ *
+ * Returns 0 on a successful write.
+ */
+int seccomp_show_filters(struct seccomp_state *state, struct seq_file *m)
+{
+ int head;
+ struct hlist_node *pos;
+ struct seccomp_filter *filter;
+ int filtering = 0;
+ if (!state)
+ return 0;
+ if (!state->filters)
+ return 0;
+
+ filtering = (state->mode == 2);
+ filtering &= !(state->flags & SECCOMP_FLAG_ON_EXEC);
+ seq_printf(m, "Filtering: %d\n", filtering);
+ seq_printf(m, "FilterCount: %d\n", state->filters->count);
+ for (head = 0; head < SECCOMP_FILTER_HASH_SIZE; ++head) {
+ hlist_for_each_entry(filter, pos, &state->filters->heads[head],
+ node) {
+ seq_printf(m, "SystemCall: %d (%s)\n",
+ filter->syscall_nr,
+ syscall_nr_to_name(filter->syscall_nr));
+ seq_printf(m, "Filter: %s\n",
+ get_filter_string(filter));
+ }
+ }
+ return 0;
+}
+EXPORT_SYMBOL_GPL(seccomp_show_filters);
+
+/**
+ * seccomp_maybe_apply_filters - conditionally applies seccomp filters
+ * @tsk: task to update
+ * @syscall_nr: current system call in progress
+ * tsk must already be in seccomp filter mode.
+ *
+ * Returns 0 if the call should be allowed or state has been updated.
+ * This call is only reach if no filters matched the current system call.
+ * In some cases, such as when the ON_EXEC flag is set, failure should
+ * not be terminal.
+ */
+int seccomp_maybe_apply_filters(struct task_struct *tsk, int syscall_nr)
+{
+ struct seccomp_state *state, *new_state = NULL;
+ int ret = -EACCES;
+
+ /* There's no question of application if ON_EXEC is not set. */
+ state = get_seccomp_state(tsk->seccomp);
+ if ((state->flags & SECCOMP_FLAG_ON_EXEC) == 0)
+ goto out;
+
+ ret = 0;
+ if (syscall_nr != __NR_execve)
+ goto out;
+
+ new_state = seccomp_state_dup(state);
+ ret = -ENOMEM;
+ if (!new_state)
+ goto out;
+
+ ret = 0;
+ new_state->flags &= ~(SECCOMP_FLAG_ON_EXEC);
+ rcu_assign_pointer(tsk->seccomp, new_state);
+ synchronize_rcu();
+ put_seccomp_state(state); /* for the task */
+
+out:
+ put_seccomp_state(state); /* for the get */
+ return ret;
+}
+
+/**
+ * seccomp_test_filters - tests 'current' against the given syscall
+ * @state: seccomp_state of current to use.
+ * @syscall: number of the system call to test
+ *
+ * Returns 0 on ok and non-zero on error/failure.
+ */
+int seccomp_test_filters(struct seccomp_state *state, int syscall)
+{
+ struct seccomp_filter *filter = NULL;
+ int ret;
+
+#ifdef CONFIG_COMPAT
+ ret = -EPERM;
+ if (is_compat_task() == !!(state->flags & SECCOMP_FLAG_COMPAT)) {
+ printk(KERN_INFO "%s[%d]: seccomp filter compat() mismatch.\n",
+ current->comm, task_pid_nr(current));
+ goto out;
+ }
+#endif
+
+ ret = 0;
+ filter = find_filter(state, syscall);
+ if (filter && filter_match_current(filter))
+ goto out;
+
+ ret = -EACCES;
+out:
+ return ret;
+}
+
+/**
+ * seccomp_get_filter - copies the filter_string into "buf"
+ * @syscall_nr: system call number to look up
+ * @buf: destination buffer
+ * @bufsize: available space in the buffer.
+ *
+ * Looks up the filter for the given system call number on current. If found,
+ * the string length of the NUL-terminated buffer is returned and < 0 is
+ * returned on error. The NUL byte is not included in the length.
+ */
+long seccomp_get_filter(int syscall_nr, char *buf, unsigned long bufsize)
+{
+ struct seccomp_state *state;
+ struct seccomp_filter *filter;
+ long ret = -ENOENT;
+
+ if (bufsize > SECCOMP_MAX_FILTER_LENGTH)
+ bufsize = SECCOMP_MAX_FILTER_LENGTH;
+
+ state = get_seccomp_state(current->seccomp);
+ if (!state)
+ goto out;
+
+ filter = find_filter(state, syscall_nr);
+ if (!filter)
+ goto out;
+
+ ret = strlcpy(buf, get_filter_string(filter), bufsize);
+ if (ret >= bufsize) {
+ ret = -ENOSPC;
+ goto out;
+ }
+ /* Zero out any remaining buffer, just in case. */
+ memset(buf + ret, 0, bufsize - ret);
+out:
+ put_seccomp_state(state);
+ return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_get_filter);
+
+/**
+ * seccomp_clear_filter: clears the seccomp filter for a syscall.
+ * @syscall_nr: the system call number to clear filters for.
+ *
+ * (acts as a frontend for seccomp_set_filter. All restrictions
+ * apply)
+ *
+ * Returns 0 on success.
+ */
+long seccomp_clear_filter(int syscall_nr)
+{
+ return seccomp_set_filter(syscall_nr, NULL);
+}
+EXPORT_SYMBOL_GPL(seccomp_clear_filter);
+
+/**
+ * seccomp_set_filter: - Adds/extends a seccomp filter for a syscall.
+ * @syscall_nr: system call number to apply the filter to.
+ * @filter: ftrace filter string to apply.
+ *
+ * Context: User context only. This function may sleep on allocation and
+ * operates on current. current must be attempting a system call
+ * when this is called.
+ *
+ * New filters may be added for system calls when the current task is
+ * not in a secure computing mode (seccomp). Otherwise, filters may only
+ * be added to already filtered system call entries. Any additions will
+ * be &&'d with the existing filter string to ensure no expansion of privileges
+ * will be possible.
+ *
+ * Returns 0 on success or an errno on failure.
+ */
+long seccomp_set_filter(int syscall_nr, char *filter)
+{
+ struct seccomp_state *state, *orig_state;
+ long ret = -EINVAL;
+
+ orig_state = get_seccomp_state(current->seccomp);
+
+ /* Prior to mutating the state, create a duplicate to avoid modifying
+ * the behavior of other instances sharing the state and ensure
+ * consistency.
+ */
+ state = (orig_state ? seccomp_state_dup(orig_state) :
+ seccomp_state_new());
+ if (!state) {
+ ret = -ENOMEM;
+ goto out;
+ }
+
+ /* A NULL filter doubles as a drop value, but the exposed prctl
+ * interface requires a trip through seccomp_clear_filter().
+ * Filter dropping is allowed across the is_compat_task() barrier.
+ */
+ ret = 0;
+ if (filter == NULL) {
+ drop_filter(state, syscall_nr);
+ goto assign;
+ }
+
+ /* Avoid amiguous filters which may have been inherited from a parent
+ * with different syscall numbers for the logically same calls.
+ */
+#ifdef CONFIG_COMPAT
+ ret = -EACCES;
+ if (is_compat_task() != !!(state->flags & SECCOMP_FLAG_COMPAT)) {
+ if (state->filters->count)
+ goto free_state;
+ /* It's safe to add if there are no existing ambiguous rules.*/
+ if (is_compat_task())
+ state->flags |= SECCOMP_FLAG_COMPAT;
+ else
+ state->flags &= ~(SECCOMP_FLAG_COMPAT);
+ }
+#endif
+
+ ret = add_filter(state, syscall_nr, filter);
+ if (ret)
+ goto free_state;
+
+assign:
+ rcu_assign_pointer(current->seccomp, state);
+ synchronize_rcu();
+ put_seccomp_state(orig_state); /* for the task */
+out:
+ put_seccomp_state(orig_state); /* for the get */
+ return ret;
+
+free_state:
+ put_seccomp_state(orig_state); /* for the get */
+ put_seccomp_state(state); /* drop the dup/new */
+ return ret;
+}
+EXPORT_SYMBOL_GPL(seccomp_set_filter);
diff --git a/kernel/sys.c b/kernel/sys.c
index af468ed..d29003a 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -1698,12 +1698,23 @@ SYSCALL_DEFINE5(prctl, int, option, unsigned long, arg2, unsigned long, arg3,
case PR_SET_ENDIAN:
error = SET_ENDIAN(me, arg2);
break;
-
case PR_GET_SECCOMP:
error = prctl_get_seccomp();
break;
case PR_SET_SECCOMP:
- error = prctl_set_seccomp(arg2);
+ error = prctl_set_seccomp(arg2, arg3);
+ break;
+ case PR_SET_SECCOMP_FILTER:
+ error = prctl_set_seccomp_filter(arg2,
+ (char __user *) arg3);
+ break;
+ case PR_CLEAR_SECCOMP_FILTER:
+ error = prctl_clear_seccomp_filter(arg2);
+ break;
+ case PR_GET_SECCOMP_FILTER:
+ error = prctl_get_seccomp_filter(arg2,
+ (char __user *) arg3,
+ arg4);
break;
case PR_GET_TSC:
error = GET_TSC_CTL(arg2);
--
1.7.0.4
^ permalink raw reply related
* Re: [RFC] powerpc: respect how command line nr_cpus is set
From: Milton Miller @ 2011-05-12 0:26 UTC (permalink / raw)
To: Kumar Gala, Benjamin Herrenschmidt; +Cc: linuxppc-dev
In-Reply-To: <1304540257-19831-1-git-send-email-galak@kernel.crashing.org>
On Wed, 04 May 2011 around 10:17:37 -0000, Kumar Gala wrote:
> We should utilize nr_cpus as the max # of CPUs that we can have present
> instead of NR_CPUS. This way we actually respect how nr_cpus is set on
> the command line rather than ignoring it.
>
> Signed-off-by: Kumar Gala <galak@kernel.crashing.org>
>
> ---
> I think this is what we should be doing, but would like someone else to take
> a look.
>
> - k
>
> arch/powerpc/kernel/setup-common.c | 10 +++++-----
> 1 files changed, 5 insertions(+), 5 deletions(-)
>
This looks very similar to my patch at
http://patchwork.ozlabs.org/patch/95080/ except I also updated the
comment. Also, the variable is nr_cpu_ids while the parameter
is nr_cpus=, the first instance in in the changelog is referring to
the variable while the second is the parameter.
Sorry I took me so long get that part of my series tested and posted.
milton
^ permalink raw reply
* [PATCH 8/8] powerpc: tidy up dma_map_ops after adding new hook
From: Nishanth Aravamudan @ 2011-05-11 22:25 UTC (permalink / raw)
To: Milton Miller
Cc: Greg Kroah-Hartman, linux-kernel, FUJITA Tomonori, Paul Mackerras,
Sean MacLennan, Andrew Morton, linuxppc-dev
In-Reply-To: <1305152704-4864-1-git-send-email-nacc@us.ibm.com>
From: Milton Miller <miltonm@bga.com>
The new get_required_mask hook name is longer than many of but not all
of the prior ops. Tidy the struct initializers to align the equal signs
using the local whitespace.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
---
arch/powerpc/kernel/dma-iommu.c | 14 +++++++-------
arch/powerpc/kernel/dma.c | 16 ++++++++--------
arch/powerpc/kernel/ibmebus.c | 14 +++++++-------
arch/powerpc/kernel/vio.c | 14 +++++++-------
4 files changed, 29 insertions(+), 29 deletions(-)
diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index c1ad9db..6f04b9c 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -104,13 +104,13 @@ static u64 dma_iommu_get_required_mask(struct device *dev)
}
struct dma_map_ops dma_iommu_ops = {
- .alloc_coherent = dma_iommu_alloc_coherent,
- .free_coherent = dma_iommu_free_coherent,
- .map_sg = dma_iommu_map_sg,
- .unmap_sg = dma_iommu_unmap_sg,
- .dma_supported = dma_iommu_dma_supported,
- .map_page = dma_iommu_map_page,
- .unmap_page = dma_iommu_unmap_page,
+ .alloc_coherent = dma_iommu_alloc_coherent,
+ .free_coherent = dma_iommu_free_coherent,
+ .map_sg = dma_iommu_map_sg,
+ .unmap_sg = dma_iommu_unmap_sg,
+ .dma_supported = dma_iommu_dma_supported,
+ .map_page = dma_iommu_map_page,
+ .unmap_page = dma_iommu_unmap_page,
.get_required_mask = dma_iommu_get_required_mask,
};
EXPORT_SYMBOL(dma_iommu_ops);
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index df142d1..f94df52 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -149,14 +149,14 @@ static inline void dma_direct_sync_single(struct device *dev,
#endif
struct dma_map_ops dma_direct_ops = {
- .alloc_coherent = dma_direct_alloc_coherent,
- .free_coherent = dma_direct_free_coherent,
- .map_sg = dma_direct_map_sg,
- .unmap_sg = dma_direct_unmap_sg,
- .dma_supported = dma_direct_dma_supported,
- .map_page = dma_direct_map_page,
- .unmap_page = dma_direct_unmap_page,
- .get_required_mask = dma_direct_get_required_mask,
+ .alloc_coherent = dma_direct_alloc_coherent,
+ .free_coherent = dma_direct_free_coherent,
+ .map_sg = dma_direct_map_sg,
+ .unmap_sg = dma_direct_unmap_sg,
+ .dma_supported = dma_direct_dma_supported,
+ .map_page = dma_direct_map_page,
+ .unmap_page = dma_direct_unmap_page,
+ .get_required_mask = dma_direct_get_required_mask,
#ifdef CONFIG_NOT_COHERENT_CACHE
.sync_single_for_cpu = dma_direct_sync_single,
.sync_single_for_device = dma_direct_sync_single,
diff --git a/arch/powerpc/kernel/ibmebus.c b/arch/powerpc/kernel/ibmebus.c
index 90ef2a4..73110fb 100644
--- a/arch/powerpc/kernel/ibmebus.c
+++ b/arch/powerpc/kernel/ibmebus.c
@@ -134,14 +134,14 @@ static u64 ibmebus_dma_get_required_mask(struct device *dev)
}
static struct dma_map_ops ibmebus_dma_ops = {
- .alloc_coherent = ibmebus_alloc_coherent,
- .free_coherent = ibmebus_free_coherent,
- .map_sg = ibmebus_map_sg,
- .unmap_sg = ibmebus_unmap_sg,
- .dma_supported = ibmebus_dma_supported,
+ .alloc_coherent = ibmebus_alloc_coherent,
+ .free_coherent = ibmebus_free_coherent,
+ .map_sg = ibmebus_map_sg,
+ .unmap_sg = ibmebus_unmap_sg,
+ .dma_supported = ibmebus_dma_supported,
.get_required_mask = ibmebus_dma_get_required_mask,
- .map_page = ibmebus_map_page,
- .unmap_page = ibmebus_unmap_page,
+ .map_page = ibmebus_map_page,
+ .unmap_page = ibmebus_unmap_page,
};
static int ibmebus_match_path(struct device *dev, void *data)
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index c049325..34d291d 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -611,13 +611,13 @@ static u64 vio_dma_get_required_mask(struct device *dev)
}
struct dma_map_ops vio_dma_mapping_ops = {
- .alloc_coherent = vio_dma_iommu_alloc_coherent,
- .free_coherent = vio_dma_iommu_free_coherent,
- .map_sg = vio_dma_iommu_map_sg,
- .unmap_sg = vio_dma_iommu_unmap_sg,
- .map_page = vio_dma_iommu_map_page,
- .unmap_page = vio_dma_iommu_unmap_page,
- .dma_supported = vio_dma_iommu_dma_supported,
+ .alloc_coherent = vio_dma_iommu_alloc_coherent,
+ .free_coherent = vio_dma_iommu_free_coherent,
+ .map_sg = vio_dma_iommu_map_sg,
+ .unmap_sg = vio_dma_iommu_unmap_sg,
+ .map_page = vio_dma_iommu_map_page,
+ .unmap_page = vio_dma_iommu_unmap_page,
+ .dma_supported = vio_dma_iommu_dma_supported,
.get_required_mask = vio_dma_get_required_mask,
};
--
1.7.4.1
^ permalink raw reply related
* [PATCH 2/8] pseries/iommu: remove ddw property when destroying window
From: Nishanth Aravamudan @ 2011-05-11 22:24 UTC (permalink / raw)
To: Milton Miller; +Cc: linux-kernel, Paul Mackerras, Will Schmidt, linuxppc-dev
In-Reply-To: <1305152704-4864-1-git-send-email-nacc@us.ibm.com>
From: Milton Miller <miltonm@bga.com>
If we destroy the window, we need to remove the property recording that
we setup the window. Otherwise the next kernel we kexec will be
confused.
Also we should remove the property if even if we don't find the
ibm,ddw-applicable window or if one of the property sizes is unexpected;
presumably these came from a prior kernel via kexec, and we will not be
maintaining the window with respect to memory hotplug.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
---
arch/powerpc/platforms/pseries/iommu.c | 12 ++++++++++--
1 files changed, 10 insertions(+), 2 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 05c101e..a0421ac 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -665,9 +665,12 @@ static void remove_ddw(struct device_node *np)
ddr_avail = of_get_property(np, "ibm,ddw-applicable", &len);
win64 = of_find_property(np, DIRECT64_PROPNAME, NULL);
- if (!win64 || !ddr_avail || len < 3 * sizeof(u32))
+ if (!win64)
return;
+ if (!ddr_avail || len < 3 * sizeof(u32) || win64->length < sizeof(*dwp))
+ goto delprop;
+
dwp = win64->value;
liobn = (u64)be32_to_cpu(dwp->liobn);
@@ -690,8 +693,13 @@ static void remove_ddw(struct device_node *np)
pr_debug("%s: successfully removed direct window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
np->full_name, ret, ddr_avail[2], liobn);
-}
+delprop:
+ ret = of_remove_property(np, win64);
+ if (ret)
+ pr_warning("%s: failed to remove direct window property: %d\n"
+ np->full_name, ret);
+}
static u64 dupe_ddw_if_already_created(struct pci_dev *dev, struct device_node *pdn)
{
--
1.7.4.1
^ permalink raw reply related
* [PATCH 1/8] pseries/iommu: add additional checks when changing iommu mask
From: Nishanth Aravamudan @ 2011-05-11 22:24 UTC (permalink / raw)
To: Milton Miller; +Cc: linux-kernel, Paul Mackerras, Will Schmidt, linuxppc-dev
In-Reply-To: <1305152704-4864-1-git-send-email-nacc@us.ibm.com>
From: Milton Miller <miltonm@bga.com>
Do not check dma supported until we have chosen the right dma ops.
Check that the device is pci before treating it as such.
Check the mask is supported by the selected dma ops before
committing it.
We only need to set iommu ops if it is not the current ops; this
avoids searching the tree for the iommu table unnecessarily.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
---
arch/powerpc/platforms/pseries/iommu.c | 15 +++++++++++----
1 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 44d47ac..05c101e 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1026,9 +1026,12 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
const void *dma_window = NULL;
u64 dma_offset;
- if (!dev->dma_mask || !dma_supported(dev, dma_mask))
+ if (!dev->dma_mask)
return -EIO;
+ if (!dev_is_pci(dev))
+ goto check_mask;
+
pdev = to_pci_dev(dev);
/* only attempt to use a new window if 64-bit DMA is requested */
@@ -1059,13 +1062,17 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
}
}
- /* fall-through to iommu ops */
- if (!ddw_enabled) {
- dev_info(dev, "Using 32-bit DMA via iommu\n");
+ /* fall back on iommu ops, restore table pointer with ops */
+ if (!ddw_enabled && get_dma_ops(dev) != &dma_iommu_ops) {
+ dev_info(dev, "Restoring 32-bit DMA via iommu\n");
set_dma_ops(dev, &dma_iommu_ops);
pci_dma_dev_setup_pSeriesLP(pdev);
}
+check_mask:
+ if (!dma_supported(dev, dma_mask))
+ return -EIO;
+
*dev->dma_mask = dma_mask;
return 0;
}
--
1.7.4.1
^ permalink raw reply related
* [PATCH 0/8] pseries/iommu: bug-fixes and cleanups for dynamic dma windows
From: Nishanth Aravamudan @ 2011-05-11 22:24 UTC (permalink / raw)
To: Milton Miller; +Cc: linuxppc-dev, linux-kernel
This series of patches attempts to cleanup and fix some bugs related to
dynamic dma windows and kexec/kdump. They build on the three patches I
have submitted most recently:
powerpc: fix kexec with dynamic dma windows
http://patchwork.ozlabs.org/patch/94445/
pseries/iommu: restore iommu table pointer when restoring iommu ops
http://patchwork.ozlabs.org/patch/94909/
pseries/iommu: use correct return type in dupe_ddw_if_already_created
http://patchwork.ozlabs.org/patch/95184/
The full series has been successfully tested with kdump/kexec on pseries
machine.
Milton Miller (8):
pseries/iommu: add additional checks when changing iommu mask
pseries/iommu: remove ddw property when destroying window
pseries/iommu: find windows after kexec during boot
pseries/iommu: cleanup ddw naming
powerpc: override dma_get_required_mask by platform hook and ops
dma-mapping: add get_required_mask if arch overrides default
powerpc: use the newly added get_required_mask dma_map_ops hook
powerpc: tidy up dma_map_ops after adding new hook
arch/powerpc/include/asm/device.h | 2 +
arch/powerpc/include/asm/machdep.h | 3 +-
arch/powerpc/kernel/dma-iommu.c | 28 +++++--
arch/powerpc/kernel/dma-swiotlb.c | 16 ++++
arch/powerpc/kernel/dma.c | 44 ++++++++--
arch/powerpc/kernel/ibmebus.c | 22 +++--
arch/powerpc/kernel/vio.c | 21 +++--
arch/powerpc/platforms/cell/iommu.c | 21 +++++
arch/powerpc/platforms/ps3/system-bus.c | 7 ++
arch/powerpc/platforms/pseries/iommu.c | 142 +++++++++++++++++++------------
include/linux/dma-mapping.h | 3 +
11 files changed, 224 insertions(+), 85 deletions(-)
--
1.7.4.1
^ permalink raw reply
* [PATCH 7/8] powerpc: use the newly added get_required_mask dma_map_ops hook
From: Nishanth Aravamudan @ 2011-05-11 22:25 UTC (permalink / raw)
To: Milton Miller
Cc: cbe-oss-dev, FUJITA Tomonori, Greg Kroah-Hartman, Arnd Bergmann,
Geoff Levand, Sean MacLennan, linux-kernel, Paul Mackerras,
H. Peter Anvin, Will Schmidt, Andrew Morton, linuxppc-dev,
David S. Miller
In-Reply-To: <1305152704-4864-1-git-send-email-nacc@us.ibm.com>
From: Milton Miller <miltonm@bga.com>
Now that the generic code has dma_map_ops set, instead of having a
messy ifdef & if block in the base dma_get_required_mask hook push
the computation into the dma ops.
If the ops fails to set the get_required_mask hook default to the
width of dma_addr_t.
This also corrects ibmbus ibmebus_dma_supported to require a 64
bit mask. I doubt anything is checking or setting the dma mask on
that bus.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
---
arch/powerpc/include/asm/device.h | 2 +
arch/powerpc/include/asm/dma-mapping.h | 3 --
arch/powerpc/kernel/dma-iommu.c | 3 +-
arch/powerpc/kernel/dma-swiotlb.c | 16 ++++++++++++
arch/powerpc/kernel/dma.c | 41 ++++++++++++-------------------
arch/powerpc/kernel/ibmebus.c | 8 +++++-
arch/powerpc/kernel/vio.c | 7 ++++-
arch/powerpc/platforms/cell/iommu.c | 13 ++++++++-
arch/powerpc/platforms/ps3/system-bus.c | 7 +++++
arch/powerpc/platforms/pseries/iommu.c | 2 +-
10 files changed, 68 insertions(+), 34 deletions(-)
diff --git a/arch/powerpc/include/asm/device.h b/arch/powerpc/include/asm/device.h
index 16d25c0..d57c08a 100644
--- a/arch/powerpc/include/asm/device.h
+++ b/arch/powerpc/include/asm/device.h
@@ -37,4 +37,6 @@ struct pdev_archdata {
u64 dma_mask;
};
+#define ARCH_HAS_DMA_GET_REQUIRED_MASK
+
#endif /* _ASM_POWERPC_DEVICE_H */
diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
index 8135e66..dd70fac 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -20,8 +20,6 @@
#define DMA_ERROR_CODE (~(dma_addr_t)0x0)
-#define ARCH_HAS_DMA_GET_REQUIRED_MASK
-
/* Some dma direct funcs must be visible for use in other dma_ops */
extern void *dma_direct_alloc_coherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t flag);
@@ -71,7 +69,6 @@ static inline unsigned long device_to_mask(struct device *dev)
*/
#ifdef CONFIG_PPC64
extern struct dma_map_ops dma_iommu_ops;
-extern u64 dma_iommu_get_required_mask(struct device *dev);
#endif
extern struct dma_map_ops dma_direct_ops;
diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index 1f2a711..c1ad9db 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -90,7 +90,7 @@ static int dma_iommu_dma_supported(struct device *dev, u64 mask)
return 1;
}
-u64 dma_iommu_get_required_mask(struct device *dev)
+static u64 dma_iommu_get_required_mask(struct device *dev)
{
struct iommu_table *tbl = get_iommu_table_base(dev);
u64 mask;
@@ -111,5 +111,6 @@ struct dma_map_ops dma_iommu_ops = {
.dma_supported = dma_iommu_dma_supported,
.map_page = dma_iommu_map_page,
.unmap_page = dma_iommu_unmap_page,
+ .get_required_mask = dma_iommu_get_required_mask,
};
EXPORT_SYMBOL(dma_iommu_ops);
diff --git a/arch/powerpc/kernel/dma-swiotlb.c b/arch/powerpc/kernel/dma-swiotlb.c
index 4295e0b..1ebc918 100644
--- a/arch/powerpc/kernel/dma-swiotlb.c
+++ b/arch/powerpc/kernel/dma-swiotlb.c
@@ -24,6 +24,21 @@
unsigned int ppc_swiotlb_enable;
+static u64 swiotlb_powerpc_get_required(struct device *dev)
+{
+ u64 end, mask, max_direct_dma_addr = dev->archdata.max_direct_dma_addr;
+
+ end = memblock_end_of_DRAM();
+ if (max_direct_dma_addr && end > max_direct_dma_addr)
+ end = max_direct_dma_addr;
+ end += get_dma_offset(dev);
+
+ mask = 1ULL << (fls64(end) - 1);
+ mask += mask - 1;
+
+ return mask;
+}
+
/*
* At the moment, all platforms that use this code only require
* swiotlb to be used if we're operating on HIGHMEM. Since
@@ -44,6 +59,7 @@ struct dma_map_ops swiotlb_dma_ops = {
.sync_sg_for_cpu = swiotlb_sync_sg_for_cpu,
.sync_sg_for_device = swiotlb_sync_sg_for_device,
.mapping_error = swiotlb_dma_mapping_error,
+ .get_required_mask = swiotlb_powerpc_get_required,
};
void pci_dma_dev_setup_swiotlb(struct pci_dev *pdev)
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index 97fe867..df142d1 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -96,6 +96,18 @@ static int dma_direct_dma_supported(struct device *dev, u64 mask)
#endif
}
+static u64 dma_direct_get_required_mask(struct device *dev)
+{
+ u64 end, mask;
+
+ end = memblock_end_of_DRAM() + get_dma_offset(dev);
+
+ mask = 1ULL << (fls64(end) - 1);
+ mask += mask - 1;
+
+ return mask;
+}
+
static inline dma_addr_t dma_direct_map_page(struct device *dev,
struct page *page,
unsigned long offset,
@@ -144,6 +156,7 @@ struct dma_map_ops dma_direct_ops = {
.dma_supported = dma_direct_dma_supported,
.map_page = dma_direct_map_page,
.unmap_page = dma_direct_unmap_page,
+ .get_required_mask = dma_direct_get_required_mask,
#ifdef CONFIG_NOT_COHERENT_CACHE
.sync_single_for_cpu = dma_direct_sync_single,
.sync_single_for_device = dma_direct_sync_single,
@@ -175,7 +188,6 @@ EXPORT_SYMBOL(dma_set_mask);
u64 dma_get_required_mask(struct device *dev)
{
struct dma_map_ops *dma_ops = get_dma_ops(dev);
- u64 mask, end = 0;
if (ppc_md.dma_get_required_mask)
return ppc_md.dma_get_required_mask(dev);
@@ -183,31 +195,10 @@ u64 dma_get_required_mask(struct device *dev)
if (unlikely(dma_ops == NULL))
return 0;
-#ifdef CONFIG_PPC64
- else if (dma_ops == &dma_iommu_ops)
- return dma_iommu_get_required_mask(dev);
-#endif
-#ifdef CONFIG_SWIOTLB
- else if (dma_ops == &swiotlb_dma_ops) {
- u64 max_direct_dma_addr = dev->archdata.max_direct_dma_addr;
-
- end = memblock_end_of_DRAM();
- if (max_direct_dma_addr && end > max_direct_dma_addr)
- end = max_direct_dma_addr;
- end += get_dma_offset(dev);
- }
-#endif
- else if (dma_ops == &dma_direct_ops)
- end = memblock_end_of_DRAM() + get_dma_offset(dev);
- else {
- WARN_ONCE(1, "%s: unknown ops %p\n", __func__, dma_ops);
- end = memblock_end_of_DRAM();
- }
+ if (dma_ops->get_required_mask)
+ return dma_ops->get_required_mask(dev);
- mask = 1ULL << (fls64(end) - 1);
- mask += mask - 1;
-
- return mask;
+ return DMA_BIT_MASK(8 * sizeof(dma_addr_t));
}
EXPORT_SYMBOL_GPL(dma_get_required_mask);
diff --git a/arch/powerpc/kernel/ibmebus.c b/arch/powerpc/kernel/ibmebus.c
index 28581f1..90ef2a4 100644
--- a/arch/powerpc/kernel/ibmebus.c
+++ b/arch/powerpc/kernel/ibmebus.c
@@ -125,7 +125,12 @@ static void ibmebus_unmap_sg(struct device *dev,
static int ibmebus_dma_supported(struct device *dev, u64 mask)
{
- return 1;
+ return mask == DMA_BIT_MASK(64);
+}
+
+static u64 ibmebus_dma_get_required_mask(struct device *dev)
+{
+ return DMA_BIT_MASK(64);
}
static struct dma_map_ops ibmebus_dma_ops = {
@@ -134,6 +139,7 @@ static struct dma_map_ops ibmebus_dma_ops = {
.map_sg = ibmebus_map_sg,
.unmap_sg = ibmebus_unmap_sg,
.dma_supported = ibmebus_dma_supported,
+ .get_required_mask = ibmebus_dma_get_required_mask,
.map_page = ibmebus_map_page,
.unmap_page = ibmebus_unmap_page,
};
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index 1b695fd..c049325 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -605,6 +605,11 @@ static int vio_dma_iommu_dma_supported(struct device *dev, u64 mask)
return dma_iommu_ops.dma_supported(dev, mask);
}
+static u64 vio_dma_get_required_mask(struct device *dev)
+{
+ return dma_iommu_ops.get_required_mask(dev);
+}
+
struct dma_map_ops vio_dma_mapping_ops = {
.alloc_coherent = vio_dma_iommu_alloc_coherent,
.free_coherent = vio_dma_iommu_free_coherent,
@@ -613,7 +618,7 @@ struct dma_map_ops vio_dma_mapping_ops = {
.map_page = vio_dma_iommu_map_page,
.unmap_page = vio_dma_iommu_unmap_page,
.dma_supported = vio_dma_iommu_dma_supported,
-
+ .get_required_mask = vio_dma_get_required_mask,
};
/**
diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
index 5ef55f3..fc46fca 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -1161,11 +1161,20 @@ __setup("iommu_fixed=", setup_iommu_fixed);
static u64 cell_dma_get_required_mask(struct device *dev)
{
+ struct dma_map_ops *dma_ops;
+
if (!dev->dma_mask)
return 0;
- if (iommu_fixed_disabled && get_dma_ops(dev) == &dma_iommu_ops)
- return dma_iommu_get_required_mask(dev);
+ if (!iommu_fixed_disabled &&
+ cell_iommu_get_fixed_address(dev) != OF_BAD_ADDR)
+ return DMA_BIT_MASK(64);
+
+ dma_ops = get_dma_ops(dev);
+ if (dma_ops->get_required_mask)
+ return dma_ops->get_required_mask(dev);
+
+ WARN_ONCE(1, "no get_required_mask in %p ops", dma_ops);
return DMA_BIT_MASK(64);
}
diff --git a/arch/powerpc/platforms/ps3/system-bus.c b/arch/powerpc/platforms/ps3/system-bus.c
index 23083c3..688141c 100644
--- a/arch/powerpc/platforms/ps3/system-bus.c
+++ b/arch/powerpc/platforms/ps3/system-bus.c
@@ -695,12 +695,18 @@ static int ps3_dma_supported(struct device *_dev, u64 mask)
return mask >= DMA_BIT_MASK(32);
}
+static u64 ps3_dma_get_required_mask(struct device *_dev)
+{
+ return DMA_BIT_MASK(32);
+}
+
static struct dma_map_ops ps3_sb_dma_ops = {
.alloc_coherent = ps3_alloc_coherent,
.free_coherent = ps3_free_coherent,
.map_sg = ps3_sb_map_sg,
.unmap_sg = ps3_sb_unmap_sg,
.dma_supported = ps3_dma_supported,
+ .get_required_mask = ps3_dma_get_required_mask,
.map_page = ps3_sb_map_page,
.unmap_page = ps3_unmap_page,
};
@@ -711,6 +717,7 @@ static struct dma_map_ops ps3_ioc0_dma_ops = {
.map_sg = ps3_ioc0_map_sg,
.unmap_sg = ps3_ioc0_unmap_sg,
.dma_supported = ps3_dma_supported,
+ .get_required_mask = ps3_dma_get_required_mask,
.map_page = ps3_ioc0_map_page,
.unmap_page = ps3_unmap_page,
};
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index fe5eded..9f121a3 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1099,7 +1099,7 @@ static u64 dma_get_required_mask_pSeriesLP(struct device *dev)
return DMA_BIT_MASK(64);
}
- return dma_iommu_get_required_mask(dev);
+ return dma_iommu_ops.get_required_mask(dev);
}
#else /* CONFIG_PCI */
--
1.7.4.1
^ permalink raw reply related
* [PATCH 5/8] powerpc: override dma_get_required_mask by platform hook and ops
From: Nishanth Aravamudan @ 2011-05-11 22:25 UTC (permalink / raw)
To: Milton Miller
Cc: cbe-oss-dev, FUJITA Tomonori, Arnd Bergmann, Sonny Rao,
devicetree-discuss, linux-kernel, Paul Mackerras, Anton Blanchard,
Will Schmidt, Andrew Morton, linuxppc-dev
In-Reply-To: <1305152704-4864-1-git-send-email-nacc@us.ibm.com>
From: Milton Miller <miltonm@bga.com>
The hook dma_get_required_mask is supposed to return the mask required
by the platform to operate efficently. The generic version of
dma_get_required_mask in driver/base/platform.c returns a mask based
only on max_pfn. However, this is likely too big for iommu systems
and could be too small for platforms that require a dma offset or have
a secondary window at a high offset.
Override the default, provide a hook in ppc_md used by pseries lpar and
cell, and provide the default answer based on memblock_end_of_DRAM(),
with hooks for get_dma_offset, and provide an implementation for iommu
that looks at the defined table size. Coverting from the end address
to the required bit mask is based on the generic implementation.
The need for this was discovered when the qla2xxx driver switched to
64 bit dma then reverted to 32 bit when dma_get_required_mask said
32 bits was sufficient.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
---
arch/powerpc/include/asm/dma-mapping.h | 3 ++
arch/powerpc/include/asm/machdep.h | 3 +-
arch/powerpc/kernel/dma-iommu.c | 13 ++++++++++
arch/powerpc/kernel/dma.c | 39 ++++++++++++++++++++++++++++++++
arch/powerpc/platforms/cell/iommu.c | 12 +++++++++
arch/powerpc/platforms/pseries/iommu.c | 27 ++++++++++++++++++++++
6 files changed, 96 insertions(+), 1 deletions(-)
diff --git a/arch/powerpc/include/asm/dma-mapping.h b/arch/powerpc/include/asm/dma-mapping.h
index dd70fac..8135e66 100644
--- a/arch/powerpc/include/asm/dma-mapping.h
+++ b/arch/powerpc/include/asm/dma-mapping.h
@@ -20,6 +20,8 @@
#define DMA_ERROR_CODE (~(dma_addr_t)0x0)
+#define ARCH_HAS_DMA_GET_REQUIRED_MASK
+
/* Some dma direct funcs must be visible for use in other dma_ops */
extern void *dma_direct_alloc_coherent(struct device *dev, size_t size,
dma_addr_t *dma_handle, gfp_t flag);
@@ -69,6 +71,7 @@ static inline unsigned long device_to_mask(struct device *dev)
*/
#ifdef CONFIG_PPC64
extern struct dma_map_ops dma_iommu_ops;
+extern u64 dma_iommu_get_required_mask(struct device *dev);
#endif
extern struct dma_map_ops dma_direct_ops;
diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index e4f0191..5837881 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -100,8 +100,9 @@ struct machdep_calls {
void (*pci_dma_dev_setup)(struct pci_dev *dev);
void (*pci_dma_bus_setup)(struct pci_bus *bus);
- /* Platform set_dma_mask override */
+ /* Platform set_dma_mask and dma_get_required_mask overrides */
int (*dma_set_mask)(struct device *dev, u64 dma_mask);
+ u64 (*dma_get_required_mask)(struct device *dev);
int (*probe)(void);
void (*setup_arch)(void); /* Optional, may be NULL */
diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index e755415..1f2a711 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -90,6 +90,19 @@ static int dma_iommu_dma_supported(struct device *dev, u64 mask)
return 1;
}
+u64 dma_iommu_get_required_mask(struct device *dev)
+{
+ struct iommu_table *tbl = get_iommu_table_base(dev);
+ u64 mask;
+ if (!tbl)
+ return 0;
+
+ mask = 1ULL < (fls_long(tbl->it_offset + tbl->it_size) - 1);
+ mask += mask - 1;
+
+ return mask;
+}
+
struct dma_map_ops dma_iommu_ops = {
.alloc_coherent = dma_iommu_alloc_coherent,
.free_coherent = dma_iommu_free_coherent,
diff --git a/arch/powerpc/kernel/dma.c b/arch/powerpc/kernel/dma.c
index d238c08..97fe867 100644
--- a/arch/powerpc/kernel/dma.c
+++ b/arch/powerpc/kernel/dma.c
@@ -172,6 +172,45 @@ int dma_set_mask(struct device *dev, u64 dma_mask)
}
EXPORT_SYMBOL(dma_set_mask);
+u64 dma_get_required_mask(struct device *dev)
+{
+ struct dma_map_ops *dma_ops = get_dma_ops(dev);
+ u64 mask, end = 0;
+
+ if (ppc_md.dma_get_required_mask)
+ return ppc_md.dma_get_required_mask(dev);
+
+ if (unlikely(dma_ops == NULL))
+ return 0;
+
+#ifdef CONFIG_PPC64
+ else if (dma_ops == &dma_iommu_ops)
+ return dma_iommu_get_required_mask(dev);
+#endif
+#ifdef CONFIG_SWIOTLB
+ else if (dma_ops == &swiotlb_dma_ops) {
+ u64 max_direct_dma_addr = dev->archdata.max_direct_dma_addr;
+
+ end = memblock_end_of_DRAM();
+ if (max_direct_dma_addr && end > max_direct_dma_addr)
+ end = max_direct_dma_addr;
+ end += get_dma_offset(dev);
+ }
+#endif
+ else if (dma_ops == &dma_direct_ops)
+ end = memblock_end_of_DRAM() + get_dma_offset(dev);
+ else {
+ WARN_ONCE(1, "%s: unknown ops %p\n", __func__, dma_ops);
+ end = memblock_end_of_DRAM();
+ }
+
+ mask = 1ULL << (fls64(end) - 1);
+ mask += mask - 1;
+
+ return mask;
+}
+EXPORT_SYMBOL_GPL(dma_get_required_mask);
+
static int __init dma_init(void)
{
dma_debug_init(PREALLOC_DMA_DEBUG_ENTRIES);
diff --git a/arch/powerpc/platforms/cell/iommu.c b/arch/powerpc/platforms/cell/iommu.c
index 26a0671..5ef55f3 100644
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -1159,6 +1159,17 @@ static int __init setup_iommu_fixed(char *str)
}
__setup("iommu_fixed=", setup_iommu_fixed);
+static u64 cell_dma_get_required_mask(struct device *dev)
+{
+ if (!dev->dma_mask)
+ return 0;
+
+ if (iommu_fixed_disabled && get_dma_ops(dev) == &dma_iommu_ops)
+ return dma_iommu_get_required_mask(dev);
+
+ return DMA_BIT_MASK(64);
+}
+
static int __init cell_iommu_init(void)
{
struct device_node *np;
@@ -1175,6 +1186,7 @@ static int __init cell_iommu_init(void)
/* Setup various ppc_md. callbacks */
ppc_md.pci_dma_dev_setup = cell_pci_dma_dev_setup;
+ ppc_md.dma_get_required_mask = cell_dma_get_required_mask;
ppc_md.tce_build = tce_build_cell;
ppc_md.tce_free = tce_free_cell;
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 01faab9..fe5eded 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -1077,12 +1077,38 @@ check_mask:
return 0;
}
+static u64 dma_get_required_mask_pSeriesLP(struct device *dev)
+{
+ if (!dev->dma_mask)
+ return 0;
+
+ if (!disable_ddw && dev_is_pci(dev)) {
+ struct pci_dev *pdev = to_pci_dev(dev);
+ struct device_node *dn;
+
+ dn = pci_device_to_OF_node(pdev);
+
+ /* search upwards for ibm,dma-window */
+ for (; dn && PCI_DN(dn) && !PCI_DN(dn)->iommu_table;
+ dn = dn->parent)
+ if (of_get_property(dn, "ibm,dma-window", NULL))
+ break;
+ /* if there is a ibm,ddw-applicable property require 64 bits */
+ if (dn && PCI_DN(dn) &&
+ of_get_property(dn, "ibm,ddw-applicable", NULL))
+ return DMA_BIT_MASK(64);
+ }
+
+ return dma_iommu_get_required_mask(dev);
+}
+
#else /* CONFIG_PCI */
#define pci_dma_bus_setup_pSeries NULL
#define pci_dma_dev_setup_pSeries NULL
#define pci_dma_bus_setup_pSeriesLP NULL
#define pci_dma_dev_setup_pSeriesLP NULL
#define dma_set_mask_pSeriesLP NULL
+#define dma_get_required_mask_pSeriesLP NULL
#endif /* !CONFIG_PCI */
static int iommu_mem_notifier(struct notifier_block *nb, unsigned long action,
@@ -1186,6 +1212,7 @@ void iommu_init_early_pSeries(void)
ppc_md.pci_dma_bus_setup = pci_dma_bus_setup_pSeriesLP;
ppc_md.pci_dma_dev_setup = pci_dma_dev_setup_pSeriesLP;
ppc_md.dma_set_mask = dma_set_mask_pSeriesLP;
+ ppc_md.dma_get_required_mask = dma_get_required_mask_pSeriesLP;
} else {
ppc_md.tce_build = tce_build_pSeries;
ppc_md.tce_free = tce_free_pSeries;
--
1.7.4.1
^ permalink raw reply related
* [PATCH 4/8] pseries/iommu: cleanup ddw naming
From: Nishanth Aravamudan @ 2011-05-11 22:25 UTC (permalink / raw)
To: Milton Miller
Cc: devicetree-discuss, linux-kernel, Paul Mackerras, Will Schmidt,
linuxppc-dev
In-Reply-To: <1305152704-4864-1-git-send-email-nacc@us.ibm.com>
From: Milton Miller <miltonm@bga.com>
When using a property refering to the availibily of dynamic dma windows
call it ddw_avail not ddr_avail.
dupe_ddw_if_already_created does not dupilcate anything, it only finds
and reuses the windows we already created, so rename it to
find_existing_ddw. Also, it does not need the pci device node, so
remove that argument.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
---
arch/powerpc/platforms/pseries/iommu.c | 42 ++++++++++++++-----------------
1 files changed, 19 insertions(+), 23 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index a48f126..01faab9 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -659,16 +659,16 @@ static void remove_ddw(struct device_node *np)
{
struct dynamic_dma_window_prop *dwp;
struct property *win64;
- const u32 *ddr_avail;
+ const u32 *ddw_avail;
u64 liobn;
int len, ret;
- ddr_avail = of_get_property(np, "ibm,ddw-applicable", &len);
+ ddw_avail = of_get_property(np, "ibm,ddw-applicable", &len);
win64 = of_find_property(np, DIRECT64_PROPNAME, NULL);
if (!win64)
return;
- if (!ddr_avail || len < 3 * sizeof(u32) || win64->length < sizeof(*dwp))
+ if (!ddw_avail || len < 3 * sizeof(u32) || win64->length < sizeof(*dwp))
goto delprop;
dwp = win64->value;
@@ -684,15 +684,15 @@ static void remove_ddw(struct device_node *np)
pr_debug("%s successfully cleared tces in window.\n",
np->full_name);
- ret = rtas_call(ddr_avail[2], 1, 1, NULL, liobn);
+ ret = rtas_call(ddw_avail[2], 1, 1, NULL, liobn);
if (ret)
pr_warning("%s: failed to remove direct window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
- np->full_name, ret, ddr_avail[2], liobn);
+ np->full_name, ret, ddw_avail[2], liobn);
else
pr_debug("%s: successfully removed direct window: rtas returned "
"%d to ibm,remove-pe-dma-window(%x) %llx\n",
- np->full_name, ret, ddr_avail[2], liobn);
+ np->full_name, ret, ddw_avail[2], liobn);
delprop:
ret = prom_remove_property(np, win64);
@@ -701,16 +701,12 @@ delprop:
np->full_name, ret);
}
-static u64 dupe_ddw_if_already_created(struct pci_dev *dev, struct device_node *pdn)
+static u64 find_existing_ddw(struct device_node *pdn)
{
- struct device_node *dn;
- struct pci_dn *pcidn;
struct direct_window *window;
const struct dynamic_dma_window_prop *direct64;
u64 dma_addr = 0;
- dn = pci_device_to_OF_node(dev);
- pcidn = PCI_DN(dn);
spin_lock(&direct_window_list_lock);
/* check if we already created a window and dupe that config if so */
list_for_each_entry(window, &direct_window_list, list) {
@@ -758,7 +754,7 @@ static int find_existing_ddw_windows(void)
}
machine_arch_initcall(pseries, find_existing_ddw_windows);
-static int query_ddw(struct pci_dev *dev, const u32 *ddr_avail,
+static int query_ddw(struct pci_dev *dev, const u32 *ddw_avail,
struct ddw_query_response *query)
{
struct device_node *dn;
@@ -779,15 +775,15 @@ static int query_ddw(struct pci_dev *dev, const u32 *ddr_avail,
if (pcidn->eeh_pe_config_addr)
cfg_addr = pcidn->eeh_pe_config_addr;
buid = pcidn->phb->buid;
- ret = rtas_call(ddr_avail[0], 3, 5, (u32 *)query,
+ ret = rtas_call(ddw_avail[0], 3, 5, (u32 *)query,
cfg_addr, BUID_HI(buid), BUID_LO(buid));
dev_info(&dev->dev, "ibm,query-pe-dma-windows(%x) %x %x %x"
- " returned %d\n", ddr_avail[0], cfg_addr, BUID_HI(buid),
+ " returned %d\n", ddw_avail[0], cfg_addr, BUID_HI(buid),
BUID_LO(buid), ret);
return ret;
}
-static int create_ddw(struct pci_dev *dev, const u32 *ddr_avail,
+static int create_ddw(struct pci_dev *dev, const u32 *ddw_avail,
struct ddw_create_response *create, int page_shift,
int window_shift)
{
@@ -812,12 +808,12 @@ static int create_ddw(struct pci_dev *dev, const u32 *ddr_avail,
do {
/* extra outputs are LIOBN and dma-addr (hi, lo) */
- ret = rtas_call(ddr_avail[1], 5, 4, (u32 *)create, cfg_addr,
+ ret = rtas_call(ddw_avail[1], 5, 4, (u32 *)create, cfg_addr,
BUID_HI(buid), BUID_LO(buid), page_shift, window_shift);
} while (rtas_busy_delay(ret));
dev_info(&dev->dev,
"ibm,create-pe-dma-window(%x) %x %x %x %x %x returned %d "
- "(liobn = 0x%x starting addr = %x %x)\n", ddr_avail[1],
+ "(liobn = 0x%x starting addr = %x %x)\n", ddw_avail[1],
cfg_addr, BUID_HI(buid), BUID_LO(buid), page_shift,
window_shift, ret, create->liobn, create->addr_hi, create->addr_lo);
@@ -843,14 +839,14 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
int page_shift;
u64 dma_addr, max_addr;
struct device_node *dn;
- const u32 *uninitialized_var(ddr_avail);
+ const u32 *uninitialized_var(ddw_avail);
struct direct_window *window;
struct property *win64;
struct dynamic_dma_window_prop *ddwprop;
mutex_lock(&direct_window_init_mutex);
- dma_addr = dupe_ddw_if_already_created(dev, pdn);
+ dma_addr = find_existing_ddw(pdn);
if (dma_addr != 0)
goto out_unlock;
@@ -862,8 +858,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
* for the given node in that order.
* the property is actually in the parent, not the PE
*/
- ddr_avail = of_get_property(pdn, "ibm,ddw-applicable", &len);
- if (!ddr_avail || len < 3 * sizeof(u32))
+ ddw_avail = of_get_property(pdn, "ibm,ddw-applicable", &len);
+ if (!ddw_avail || len < 3 * sizeof(u32))
goto out_unlock;
/*
@@ -873,7 +869,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
* of page sizes: supported and supported for migrate-dma.
*/
dn = pci_device_to_OF_node(dev);
- ret = query_ddw(dev, ddr_avail, &query);
+ ret = query_ddw(dev, ddw_avail, &query);
if (ret != 0)
goto out_unlock;
@@ -922,7 +918,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
goto out_free_prop;
}
- ret = create_ddw(dev, ddr_avail, &create, page_shift, len);
+ ret = create_ddw(dev, ddw_avail, &create, page_shift, len);
if (ret != 0)
goto out_free_prop;
--
1.7.4.1
^ permalink raw reply related
* [PATCH 3/8] pseries/iommu: find windows after kexec during boot
From: Nishanth Aravamudan @ 2011-05-11 22:24 UTC (permalink / raw)
To: Milton Miller
Cc: devicetree-discuss, linux-kernel, Paul Mackerras, Will Schmidt,
linuxppc-dev
In-Reply-To: <1305152704-4864-1-git-send-email-nacc@us.ibm.com>
From: Milton Miller <miltonm@bga.com>
Move the discovery of windows previously setup from when the pci driver
calls set_dma_mask to an arch_initcall.
When kexecing into a kernel with dynamic dma windows allocated, we need
to find the windows early so that memory hot remove will be able to
delete the tces mapping the to be removed memory and memory hotplug add
will map the new memory into the window. We should not wait for the
driver to be loaded and the device to be probed. The iommu init hooks
are before kmalloc is setup, so defer to arch_initcall.
Signed-off-by: Milton Miller <miltonm@bga.com>
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
---
arch/powerpc/platforms/pseries/iommu.c | 52 ++++++++++++++-----------------
1 files changed, 24 insertions(+), 28 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index a0421ac..a48f126 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -695,9 +695,9 @@ static void remove_ddw(struct device_node *np)
np->full_name, ret, ddr_avail[2], liobn);
delprop:
- ret = of_remove_property(np, win64);
+ ret = prom_remove_property(np, win64);
if (ret)
- pr_warning("%s: failed to remove direct window property: %d\n"
+ pr_warning("%s: failed to remove direct window property: %d\n",
np->full_name, ret);
}
@@ -725,38 +725,38 @@ static u64 dupe_ddw_if_already_created(struct pci_dev *dev, struct device_node *
return dma_addr;
}
-static u64 dupe_ddw_if_kexec(struct pci_dev *dev, struct device_node *pdn)
+static int find_existing_ddw_windows(void)
{
- struct device_node *dn;
- struct pci_dn *pcidn;
int len;
+ struct device_node *pdn;
struct direct_window *window;
const struct dynamic_dma_window_prop *direct64;
- u64 dma_addr = 0;
- dn = pci_device_to_OF_node(dev);
- pcidn = PCI_DN(dn);
- direct64 = of_get_property(pdn, DIRECT64_PROPNAME, &len);
- if (direct64) {
- if (len < sizeof(struct dynamic_dma_window_prop)) {
+ if (!firmware_has_feature(FW_FEATURE_LPAR))
+ return 0;
+
+ for_each_node_with_property(pdn, DIRECT64_PROPNAME) {
+ direct64 = of_get_property(pdn, DIRECT64_PROPNAME, &len);
+ if (!direct64)
+ continue;
+
+ window = kzalloc(sizeof(*window), GFP_KERNEL);
+ if (!window || len < sizeof(struct dynamic_dma_window_prop)) {
+ kfree(window);
remove_ddw(pdn);
- } else {
- window = kzalloc(sizeof(*window), GFP_KERNEL);
- if (!window) {
- remove_ddw(pdn);
- } else {
- window->device = pdn;
- window->prop = direct64;
- spin_lock(&direct_window_list_lock);
- list_add(&window->list, &direct_window_list);
- spin_unlock(&direct_window_list_lock);
- dma_addr = direct64->dma_base;
- }
+ continue;
}
+
+ window->device = pdn;
+ window->prop = direct64;
+ spin_lock(&direct_window_list_lock);
+ list_add(&window->list, &direct_window_list);
+ spin_unlock(&direct_window_list_lock);
}
- return dma_addr;
+ return 0;
}
+machine_arch_initcall(pseries, find_existing_ddw_windows);
static int query_ddw(struct pci_dev *dev, const u32 *ddr_avail,
struct ddw_query_response *query)
@@ -854,10 +854,6 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
if (dma_addr != 0)
goto out_unlock;
- dma_addr = dupe_ddw_if_kexec(dev, pdn);
- if (dma_addr != 0)
- goto out_unlock;
-
/*
* the ibm,ddw-applicable property holds the tokens for:
* ibm,query-pe-dma-window
--
1.7.4.1
^ permalink raw reply related
* fsl_udc_core: BUG: scheduling while atomic
From: Matthew L. Creech @ 2011-05-11 21:37 UTC (permalink / raw)
To: linuxppc-dev
Hi,
My MPC8313-based board, running a 2.6.37 kernel, is occasionally
hitting this bug while doing RNDIS-based communication:
BUG: scheduling while atomic: lighttpd/1145/0x10000200
Call Trace:
[c6a8b910] [c00086c0] show_stack+0x7c/0x194 (unreliable)
[c6a8b950] [c0019e28] __schedule_bug+0x54/0x68
[c6a8b960] [c02b04e8] schedule+0xa4/0x408
[c6a8ba50] [c02b0988] _cond_resched+0x38/0x64
[c6a8ba60] [c0080e8c] dma_pool_alloc+0x5c/0x2a4
[c6a8bac0] [c01c57b0] fsl_req_to_dtd+0x68/0x24c
[c6a8bb00] [c01c5b68] fsl_ep_queue+0x1d4/0x264
[c6a8bb20] [c01c7eec] eth_start_xmit+0x278/0x344
[c6a8bb50] [c01fdbc8] dev_hard_start_xmit+0x520/0x680
[c6a8bba0] [c02122a4] sch_direct_xmit+0x68/0x1e0
[c6a8bbc0] [c01fdf20] dev_queue_xmit+0x1f8/0x3c4
[c6a8bbe0] [c022d684] ip_finish_output+0x2d4/0x328
[c6a8bc10] [c022db08] ip_local_out+0x38/0x4c
[c6a8bc20] [c022e3cc] ip_queue_xmit+0x2cc/0x360
[c6a8bca0] [c0241844] tcp_transmit_skb+0x7cc/0x838
[c6a8bd00] [c0244434] tcp_write_xmit+0x8c4/0xa34
[c6a8bd60] [c0237618] tcp_sendmsg+0x900/0xbd4
[c6a8bdd0] [c0256088] inet_sendmsg+0x74/0x8c
[c6a8bdf0] [c01ea498] sock_aio_write+0x130/0x14c
[c6a8be50] [c00855fc] do_sync_write+0xb0/0x110
[c6a8bef0] [c0086294] vfs_write+0xdc/0x17c
[c6a8bf10] [c008642c] sys_write+0x54/0x9c
[c6a8bf40] [c000f2cc] ret_from_syscall+0x0/0x38
This seems similar to a bug from 2010:
http://www.spinics.net/lists/linux-usb/msg31354.html
which concludes that the fsl_udc_core driver is wrongly using
GFP_KERNEL in fsl_build_dtd(). However I'm not sure what an
appropriate fix is, since just replacing it with GFP_ATOMIC causes
allocation failures. Any helpful tips?
Thanks
--
Matthew L. Creech
^ permalink raw reply
* Re: [PATCH 13/13] kvm/powerpc: Allow book3s_hv guests to use SMT processor modes
From: Paul Mackerras @ 2011-05-11 21:17 UTC (permalink / raw)
To: Christoph Hellwig; +Cc: linuxppc-dev, Alexander Graf, kvm
In-Reply-To: <20110511134415.GA1134@lst.de>
On Wed, May 11, 2011 at 03:44:15PM +0200, Christoph Hellwig wrote:
> On Wed, May 11, 2011 at 08:46:56PM +1000, Paul Mackerras wrote:
> > arch/powerpc/sysdev/xics/icp-native.c.
>
> What kernel tree do I need to actually have that file?
The "next" branch of
git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc.git
Paul.
^ permalink raw reply
* [PATCH] pseries/iommu: use correct return type in dupe_ddw_if_already_created
From: Nishanth Aravamudan @ 2011-05-11 21:07 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, Anton Blanchard, Milton Miller
Otherwise we get silent truncations.
Signed-off-by: Nishanth Aravamudan <nacc@us.ibm.com>
Cc: Anton Blanchard <anton@samba.org>
Cc: Milton Miller <miltonm@bga.com>
Cc: linuxppc-dev@ozlabs.org
---
arch/powerpc/platforms/pseries/iommu.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 6d5412a..31e2ac4 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -693,7 +693,7 @@ static void remove_ddw(struct device_node *np)
}
-static int dupe_ddw_if_already_created(struct pci_dev *dev, struct device_node *pdn)
+static u64 dupe_ddw_if_already_created(struct pci_dev *dev, struct device_node *pdn)
{
struct device_node *dn;
struct pci_dn *pcidn;
--
1.7.4.1
^ permalink raw reply related
* Re: [PATCH] RapidIO: Fix default routing initialization
From: Andrew Morton @ 2011-05-11 20:00 UTC (permalink / raw)
To: Alexandre Bounine; +Cc: linux-kernel, Thomas Moll, linuxppc-dev
In-Reply-To: <1305125163-14590-1-git-send-email-alexandre.bounine@idt.com>
On Wed, 11 May 2011 10:46:03 -0400
Alexandre Bounine <alexandre.bounine@idt.com> wrote:
> Fix switch initialization to ensure that all switches have default
> routing disabled. This guarantees that no unexpected RapidIO packets
> arrive to the default port set by reset and there is no default routing
> destination until it is properly configured by software.
>
> This update also unifies handling of unmapped destinations by tsi57x,
> IDT Gen1 and IDT Gen2 switches.
The changelog doesn't permit me to determine the importance of this fix,
so I don't know whether to schedule it for 2.6.39 or for -stable.
^ permalink raw reply
* Re: [PATCH 00/37] fix paca memory usage and NR_CPU loops, factor ipi and simplify irq code
From: Grant Likely @ 2011-05-11 19:40 UTC (permalink / raw)
To: Milton Miller; +Cc: Thomas Gleixner, linuxppc-dev
In-Reply-To: <cover.1305092637.git.miltonm@bga.com>
On Wed, May 11, 2011 at 7:43 AM, Milton Miller <miltonm@bga.com> wrote:
[...]
> After doing a bunch of grep's on arch/sh, I think the interesting
> parts of super8 interrupt handling that Thomas referred to are in
> drivers/sh/intc. =A0For some reason they populate the radix tree first
> with a descriptor of their common interrupt controler abstraction
> then walk the tree, replacing the tagged elements with the pointer
> to their equivalent to irq_map. =A0I do not yet understand the purpose
> of this two phase allocation.
BTW, you can also write a patch now that eliminates irq_map entirely.
I just talked to tglx today and we agree in principle to move the
hwirq value and irq_domain pointer directly into irq_data (and hence
directly accessible from the irq_desc). The final form of irq_domain
is still up in the air, but you could do an initial patch that leaves
it as platform defined, and it can be followed up with a patch series
that creates a common irq_domain definition.
g.
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox