* Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
From: Mathieu Desnoyers @ 2021-08-26 18:42 UTC (permalink / raw)
To: Sean Christopherson
Cc: KVM list, Peter Zijlstra, Catalin Marinas, linux-kernel,
Will Deacon, Guo Ren, linux-kselftest, Ben Gardon, shuah,
Paul Mackerras, linux-s390, gor, Russell King, ARM Linux,
linux-csky, Christian Borntraeger, Ingo Molnar, dvhart,
linux-mips, Boqun Feng, paulmck, Heiko Carstens, rostedt,
Shakeel Butt, Andy Lutomirski, Thomas Gleixner, Peter Foley,
linux-arm-kernel, Thomas Bogendoerfer, Oleg Nesterov,
Paolo Bonzini, linuxppc-dev
In-Reply-To: <YSblqrrpKcORzilX@google.com>
----- On Aug 25, 2021, at 8:51 PM, Sean Christopherson seanjc@google.com wrote:
> On Mon, Aug 23, 2021, Mathieu Desnoyers wrote:
>> [ re-send to Darren Hart ]
>>
>> ----- On Aug 23, 2021, at 11:18 AM, Mathieu Desnoyers
>> mathieu.desnoyers@efficios.com wrote:
>>
>> > ----- On Aug 20, 2021, at 6:50 PM, Sean Christopherson seanjc@google.com wrote:
>> >
>> >> Add a test to verify an rseq's CPU ID is updated correctly if the task is
>> >> migrated while the kernel is handling KVM_RUN. This is a regression test
>> >> for a bug introduced by commit 72c3c0fe54a3 ("x86/kvm: Use generic xfer
>> >> to guest work function"), where TIF_NOTIFY_RESUME would be cleared by KVM
>> >> without updating rseq, leading to a stale CPU ID and other badness.
>> >>
>> >
>> > [...]
>> >
>> > +#define RSEQ_SIG 0xdeadbeef
>> >
>> > Is there any reason for defining a custom signature rather than including
>> > tools/testing/selftests/rseq/rseq.h ? This should take care of including
>> > the proper architecture header which will define the appropriate signature.
>> >
>> > Arguably you don't define rseq critical sections in this test per se, but
>> > I'm wondering why the custom signature here.
>
> Partly to avoid taking a dependency on rseq.h, and partly to try to call out
> that
> the test doesn't actually do any rseq critical sections.
It might be good to add a comment near this define stating this then, so nobody
attempts to copy this as an example.
>
>> > [...]
>> >
>> >> +
>> >> +static void *migration_worker(void *ign)
>> >> +{
>> >> + cpu_set_t allowed_mask;
>> >> + int r, i, nr_cpus, cpu;
>> >> +
>> >> + CPU_ZERO(&allowed_mask);
>> >> +
>> >> + nr_cpus = CPU_COUNT(&possible_mask);
>> >> +
>> >> + for (i = 0; i < 20000; i++) {
>> >> + cpu = i % nr_cpus;
>> >> + if (!CPU_ISSET(cpu, &possible_mask))
>> >> + continue;
>> >> +
>> >> + CPU_SET(cpu, &allowed_mask);
>> >> +
>> >> + /*
>> >> + * Bump the sequence count twice to allow the reader to detect
>> >> + * that a migration may have occurred in between rseq and sched
>> >> + * CPU ID reads. An odd sequence count indicates a migration
>> >> + * is in-progress, while a completely different count indicates
>> >> + * a migration occurred since the count was last read.
>> >> + */
>> >> + atomic_inc(&seq_cnt);
>> >
>> > So technically this atomic_inc contains the required barriers because the
>> > selftests implementation uses "__sync_add_and_fetch(&addr->val, 1)". But
>> > it's rather odd that the semantic differs from the kernel implementation in
>> > terms of memory barriers: the kernel implementation of atomic_inc
>> > guarantees no memory barriers, but this one happens to provide full
>> > barriers pretty much by accident (selftests futex/include/atomic.h
>> > documents no such guarantee).
>
> Yeah, I got quite lost trying to figure out what atomics the test would actually
> end up with.
At the very least, until things are clarified in the selftests atomic header, I would
recommend adding a comment stating which memory barriers are required alongside each
use of atomic_inc here. I would even prefer that we add extra (currently unneeded)
write barriers to make extra sure that this stays documented. Performance of the write-side
does not matter much here.
>
>> > If this full barrier guarantee is indeed provided by the selftests atomic.h
>> > header, I would really like a comment stating that in the atomic.h header
>> > so the carpet is not pulled from under our feet by a future optimization.
>> >
>> >
>> >> + r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
>> >> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
>> >> + errno, strerror(errno));
>> >> + atomic_inc(&seq_cnt);
>> >> +
>> >> + CPU_CLR(cpu, &allowed_mask);
>> >> +
>> >> + /*
>> >> + * Let the read-side get back into KVM_RUN to improve the odds
>> >> + * of task migration coinciding with KVM's run loop.
>> >
>> > This comment should be about increasing the odds of letting the seqlock
>> > read-side complete. Otherwise, the delay between the two back-to-back
>> > atomic_inc is so small that the seqlock read-side may never have time to
>> > complete the reading the rseq cpu id and the sched_getcpu() call, and can
>> > retry forever.
>
> Hmm, but that's not why there's a delay. I'm not arguing that a livelock isn't
> possible (though that syscall would have to be screaming fast),
I don't think we have the same understanding of the livelock scenario. AFAIU the livelock
would be caused by a too-small delay between the two consecutive atomic_inc() from one
loop iteration to the next compared to the time it takes to perform a read-side critical
section of the seqlock. Back-to-back atomic_inc can be performed very quickly, so I
doubt that the sched_getcpu implementation have good odds to be fast enough to complete
in that narrow window, leading to lots of read seqlock retry.
> but the primary
> motivation is very much to allow the read-side enough time to get back into KVM
> proper.
I'm puzzled by your statement. AFAIU, let's say we don't have the delay, then we
have:
migration thread KVM_RUN/read-side thread
-----------------------------------------------------------------------------------
- ioctl(KVM_RUN)
- atomic_inc_seq_cst(&seqcnt)
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)
- a = atomic_load(&seqcnt) & ~1
- smp_rmb()
- b = LOAD_ONCE(__rseq_abi->cpu_id);
- sched_getcpu()
- smp_rmb()
- re-load seqcnt/compare (succeeds)
- Can only succeed if entire read-side happens while the seqcnt
is in an even numbered state.
- if (a != b) abort()
/* no delay. Even counter state is very
short. */
- atomic_inc_seq_cst(&seqcnt)
/* Let's suppose the lack of delay causes the
setaffinity to complete too early compared
with KVM_RUN ioctl */
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)
/* no delay. Even counter state is very
short. */
- atomic_inc_seq_cst(&seqcnt)
/* Then a setaffinity from a following
migration thread loop will run
concurrently with KVM_RUN */
- ioctl(KVM_RUN)
- sched_setaffinity
- atomic_inc_seq_cst(&seqcnt)
As pointed out here, if the first setaffinity runs too early compared with KVM_RUN,
a following setaffinity will run concurrently with it. However, the fact that
the even counter state is very short may very well hurt progress of the read seqlock.
>
> To encounter the bug, TIF_NOTIFY_RESUME has to be recognized by KVM in its run
> loop, i.e. sched_setaffinity() must induce task migration after the read-side
> has
> invoked ioctl(KVM_RUN).
No argument here.
>
>> > I'm wondering if 1 microsecond is sufficient on other architectures as
>> > well.
>
> I'm definitely wondering that as well :-)
>
>> > One alternative way to make this depend less on the architecture's
>> > implementation of sched_getcpu (whether it's a vDSO, or goes through a
>> > syscall) would be to read the rseq cpu id and call sched_getcpu a few times
>> > (e.g. 3 times) in the migration thread rather than use usleep, and throw
>> > away the value read. This would ensure the delay is appropriate on all
>> > architectures.
>
> As above, I think an arbitrary delay is required regardless of how fast
> sched_getcpu() can execute. One thought would be to do sched_getcpu() _and_
> usleep() to account for sched_getcpu() overhead and to satisfy the KVM_RUN part,
> but I don't know that that adds meaningful value.
>
> The real test is if someone could see if the bug repros on non-x86 hardware...
As I explain in the scenario above, I don't think we agree on the reason why the
delay is required. One way to confirm this would be to run the code without delay,
and count how many seqcnt read-side succeed vs fail. We can then compare that with
a run with a 1us delay between the migration thread loops. This data should help us
come to a common understanding of the delay's role.
Thanks,
Mathieu
--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com
^ permalink raw reply
* Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto
From: Nathan Chancellor @ 2021-08-26 18:54 UTC (permalink / raw)
To: Michael Ellerman
Cc: llvm, linux-kernel, clang-built-linux, Paul Mackerras,
linuxppc-dev
In-Reply-To: <87h7fcc2m4.fsf@mpe.ellerman.id.au>
On Thu, Aug 26, 2021 at 01:21:39PM +1000, Michael Ellerman wrote:
> Nathan Chancellor <nathan@kernel.org> writes:
> > On Tue, Apr 13, 2021 at 04:38:10PM +0000, Christophe Leroy wrote:
> >> Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
> >> flexibility to GCC.
> ...
> >
> > This patch as commit 1e688dd2a3d6 ("powerpc/bug: Provide better
> > flexibility to WARN_ON/__WARN_FLAGS() with asm goto") cause a WARN_ON in
> > klist_add_tail to trigger over and over on boot when compiling with
> > clang:
> >
> > [ 2.177416][ T1] WARNING: CPU: 0 PID: 1 at lib/klist.c:62 .klist_add_tail+0x3c/0x110
> > [ 2.177456][ T1] Modules linked in:
> > [ 2.177481][ T1] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 5.14.0-rc7-next-20210825 #1
> > [ 2.177520][ T1] NIP: c0000000007ff81c LR: c00000000090a038 CTR: 0000000000000000
> > [ 2.177557][ T1] REGS: c0000000073c32a0 TRAP: 0700 Tainted: G W (5.14.0-rc7-next-20210825)
> > [ 2.177593][ T1] MSR: 8000000002029032 <SF,VEC,EE,ME,IR,DR,RI> CR: 22000a40 XER: 00000000
> > [ 2.177667][ T1] CFAR: c00000000090a034 IRQMASK: 0
> > [ 2.177667][ T1] GPR00: c00000000090a038 c0000000073c3540 c000000001be3200 0000000000000001
> > [ 2.177667][ T1] GPR04: c0000000072d65c0 0000000000000000 c0000000091ba798 c0000000091bb0a0
> > [ 2.177667][ T1] GPR08: 0000000000000001 0000000000000000 c000000008581918 fffffffffffffc00
> > [ 2.177667][ T1] GPR12: 0000000044000240 c000000001dd0000 c000000000012300 0000000000000000
> > [ 2.177667][ T1] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > [ 2.177667][ T1] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > [ 2.177667][ T1] GPR24: 0000000000000000 c0000000017e3200 0000000000000000 c000000001a0e778
> > [ 2.177667][ T1] GPR28: c0000000072d65b0 c0000000072d65a8 c000000007de72c8 c0000000073c35d0
> > [ 2.178019][ T1] NIP [c0000000007ff81c] .klist_add_tail+0x3c/0x110
> > [ 2.178058][ T1] LR [c00000000090a038] .bus_add_driver+0x148/0x290
> > [ 2.178088][ T1] Call Trace:
> > [ 2.178105][ T1] [c0000000073c3540] [c0000000073c35d0] 0xc0000000073c35d0 (unreliable)
> > [ 2.178150][ T1] [c0000000073c35d0] [c00000000090a038] .bus_add_driver+0x148/0x290
> > [ 2.178190][ T1] [c0000000073c3670] [c00000000090fae8] .driver_register+0xb8/0x190
> > [ 2.178234][ T1] [c0000000073c3700] [c000000000be55c0] .__hid_register_driver+0x70/0xd0
> > [ 2.178275][ T1] [c0000000073c37a0] [c00000000116955c] .redragon_driver_init+0x34/0x58
> > [ 2.178314][ T1] [c0000000073c3820] [c000000000011ae0] .do_one_initcall+0x130/0x3b0
> > [ 2.178357][ T1] [c0000000073c3bb0] [c0000000011065e0] .do_initcall_level+0xd8/0x188
> > [ 2.178403][ T1] [c0000000073c3c50] [c0000000011064a8] .do_initcalls+0x7c/0xdc
> > [ 2.178445][ T1] [c0000000073c3ce0] [c000000001106238] .kernel_init_freeable+0x178/0x21c
> > [ 2.178491][ T1] [c0000000073c3d90] [c000000000012334] .kernel_init+0x34/0x220
> > [ 2.178530][ T1] [c0000000073c3e10] [c00000000000cf50] .ret_from_kernel_thread+0x58/0x60
> > [ 2.178569][ T1] Instruction dump:
> > [ 2.178592][ T1] fba10078 7c7d1b78 38600001 fb810070 3b9d0008 fbc10080 7c9e2378 389d0018
> > [ 2.178662][ T1] fb9d0008 fb9d0010 90640000 fbdd0000 <0b1e0000> e87e0018 28230000 41820024
> > [ 2.178728][ T1] ---[ end trace 52ed3431f58f1847 ]---
> >
> > Is this a bug with clang or is there something wrong with the patch? The
> > vmlinux image is available at [1] if you want to inspect it and our QEMU
> > command and the warning at boot can be viewed at [2]. If there is any
> > other information I can provide, please let me know.
> >
> > [1] https://builds.tuxbuild.com/1xDcmp3Tvno0TTGxDVPedRKIKM2/
> > [2] https://github.com/ClangBuiltLinux/continuous-integration2/commit/cee159b66a58eb57fa2359e7888074b9da24126c/checks/3422232736/logs
>
> Thanks.
>
> This is the generated assembly:
>
> c0000000007ff600 <.klist_add_tail>:
> c0000000007ff600: 7c 08 02 a6 mflr r0
> c0000000007ff604: f8 01 00 10 std r0,16(r1)
> c0000000007ff608: f8 21 ff 71 stdu r1,-144(r1) ^ prolog
> c0000000007ff60c: fb a1 00 78 std r29,120(r1) save r29 to stack
> c0000000007ff610: 7c 7d 1b 78 mr r29,r3 r29 = struct klist_node *n
> c0000000007ff614: 38 60 00 01 li r3,1 r3 = 1
> c0000000007ff618: fb 81 00 70 std r28,112(r1) save r28 to stack
> c0000000007ff61c: 3b 9d 00 08 addi r28,r29,8 r28 = &n->n_node
> c0000000007ff620: fb c1 00 80 std r30,128(r1) save r30 to stack
> c0000000007ff624: 7c 9e 23 78 mr r30,r4 r30 = struct klist *k
> c0000000007ff628: 38 9d 00 18 addi r4,r29,24 r4 = &n->n_ref
> c0000000007ff62c: fb 9d 00 08 std r28,8(r29) n->n_node.next = &n->n_node INIT_LIST_HEAD
> c0000000007ff630: fb 9d 00 10 std r28,16(r29) n->n_node.prev = &n->n_node
> c0000000007ff634: 90 64 00 00 stw r3,0(r4) kref_init(&n->n_ref)
> c0000000007ff638: fb dd 00 00 std r30,0(r29) n->n_klist = k
> c0000000007ff63c: 0b 1e 00 00 tdnei r30,0 trap if r30 (k) is not zero
>
>
> From:
>
> static void knode_set_klist(struct klist_node *knode, struct klist *klist)
> {
> knode->n_klist = klist;
> /* no knode deserves to start its life dead */
> WARN_ON(knode_dead(knode));
> ^^^^^^^^^^^^^^^^^
> }
>
> Which expands to:
>
> static void knode_set_klist(struct klist_node *knode, struct klist *klist)
> {
> knode->n_klist = klist;
>
> ({
> bool __ret_warn_on = false;
> do {
> ...
> } else {
> __label__ __label_warn_on;
> do {
> asm goto(
> "1: "
> "tdnei"
> "
> " " % 4,
> 0 " "\n " ".section __ex_table,\"a\";"
> " "
> ".balign 4;"
> " "
> ".long (1b) - . ;"
> " "
> ".long (%l[__label_warn_on]) - . ;"
> " "
> ".previous"
> " "
> ".section __bug_table,\"aw\"\n"
> "2:\t.4byte 1b - 2b, %0 - 2b\n"
> "\t.short %1, %2\n"
> ".org 2b+%3\n"
> ".previous\n"
> :
> : "i"("lib/klist.c"), "i"(62),
> "i"((1 << 0) | ((9) << 8)),
> "i"(sizeof(struct bug_entry)),
> "r"(knode_dead(knode))
> ^^^^^^^^^^^^^^^^^^^^^
>
> :
> : __label_warn_on);
> asm("");
> } while (0);
> break;
> __label_warn_on:
> __ret_warn_on = true;
> }
> } while (0);
> __builtin_expect(!!(__ret_warn_on), 0);
> });
> }
>
> And knode_dead is:
>
> #define KNODE_DEAD 1LU
>
> static bool knode_dead(struct klist_node *knode)
> {
> return (unsigned long)knode->n_klist & KNODE_DEAD;
> }
>
>
> So it's meant to warn if (n_klist & KNODE_DEAD) is not equal to zero.
>
> But in the asm:
>
> c0000000007ff600 <.klist_add_tail>:
> ...
> c0000000007ff624: 7c 9e 23 78 mr r30,r4 r30 = struct klist *k
> ...
> c0000000007ff638: fb dd 00 00 std r30,0(r29) n->n_klist = k
> c0000000007ff63c: 0b 1e 00 00 tdnei r30,0 trap if r30 (k) is not zero
>
>
> It's just warning if n_klist is not equal to zero. ie. we lost the "& KNODE_DEAD".
>
> In the GCC output you can see it:
>
> c0000000008c8a30 <.klist_node_init>:
> c0000000008c8a30: 39 24 00 08 addi r9,r4,8
> c0000000008c8a34: 39 40 00 01 li r10,1
> c0000000008c8a38: f9 24 00 08 std r9,8(r4)
> c0000000008c8a3c: f9 24 00 10 std r9,16(r4)
> c0000000008c8a40: 91 44 00 18 stw r10,24(r4)
> c0000000008c8a44: f8 64 00 00 std r3,0(r4)
> c0000000008c8a48: 54 69 07 fe clrlwi r9,r3,31
> c0000000008c8a4c: 0b 09 00 00 tdnei r9,0
>
> ie. the clrlwi is "clear left (word) immediate", and zeroes everything
> except bit 0, which is equivalent to "& KNODE_DEAD".
>
>
> So there seems to be some misunderstanding with clang, it doesn't like
> us passing an expression to the inline asm.
>
> AFAIK it is legal to pass expressions as inputs to inline asm, ie. it
> doesn't have to just be a variable name.
>
> This patch seems to fix it. Not sure if that's just papering over it though.
>
> diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
> index 1ee0f22313ee..75fcb4370d96 100644
> --- a/arch/powerpc/include/asm/bug.h
> +++ b/arch/powerpc/include/asm/bug.h
> @@ -119,7 +119,7 @@ __label_warn_on: \
> \
> WARN_ENTRY(PPC_TLNEI " %4, 0", \
> BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN), \
> - __label_warn_on, "r" (x)); \
> + __label_warn_on, "r" (!!(x))); \
> break; \
> __label_warn_on: \
> __ret_warn_on = true; \
>
>
> Generates:
>
> c0000000008e2ac0 <.klist_add_tail>:
> c0000000008e2ac0: 7c 08 02 a6 mflr r0
> c0000000008e2ac4: f8 01 00 10 std r0,16(r1)
> c0000000008e2ac8: f8 21 ff 71 stdu r1,-144(r1)
> c0000000008e2acc: fb a1 00 78 std r29,120(r1)
> c0000000008e2ad0: 7c 7d 1b 78 mr r29,r3
> c0000000008e2ad4: 38 60 00 01 li r3,1
> c0000000008e2ad8: fb c1 00 80 std r30,128(r1)
> c0000000008e2adc: 7c 9e 23 78 mr r30,r4
> c0000000008e2ae0: 38 9d 00 18 addi r4,r29,24
> c0000000008e2ae4: 57 c5 07 fe clrlwi r5,r30,31 <-
> c0000000008e2ae8: fb 81 00 70 std r28,112(r1)
> c0000000008e2aec: 3b 9d 00 08 addi r28,r29,8
> c0000000008e2af0: fb 9d 00 08 std r28,8(r29)
> c0000000008e2af4: fb 9d 00 10 std r28,16(r29)
> c0000000008e2af8: 90 64 00 00 stw r3,0(r4)
> c0000000008e2afc: fb dd 00 00 std r30,0(r29)
> c0000000008e2b00: 0b 05 00 00 tdnei r5,0 <-
Thank you for the in-depth explanation and triage! I have gone ahead and
created a standalone reproducer that shows this based on the
preprocessed file and opened https://bugs.llvm.org/show_bug.cgi?id=51634
so the LLVM developers can take a look.
Cheers,
Nathan
^ permalink raw reply
* Re: [PATCH v2 4/5] KVM: selftests: Add a test for KVM_RUN+rseq to detect task migration bugs
From: Sean Christopherson @ 2021-08-26 23:54 UTC (permalink / raw)
To: Mathieu Desnoyers
Cc: KVM list, Peter Zijlstra, Catalin Marinas, linux-kernel,
Will Deacon, Guo Ren, linux-kselftest, Ben Gardon, shuah,
Paul Mackerras, linux-s390, gor, Russell King, ARM Linux,
linux-csky, Christian Borntraeger, Ingo Molnar, dvhart,
linux-mips, Boqun Feng, paulmck, Heiko Carstens, rostedt,
Shakeel Butt, Andy Lutomirski, Thomas Gleixner, Peter Foley,
linux-arm-kernel, Thomas Bogendoerfer, Oleg Nesterov,
Paolo Bonzini, linuxppc-dev
In-Reply-To: <1700758714.29394.1630003332081.JavaMail.zimbra@efficios.com>
On Thu, Aug 26, 2021, Mathieu Desnoyers wrote:
> ----- On Aug 25, 2021, at 8:51 PM, Sean Christopherson seanjc@google.com wrote:
> >> >> + r = sched_setaffinity(0, sizeof(allowed_mask), &allowed_mask);
> >> >> + TEST_ASSERT(!r, "sched_setaffinity failed, errno = %d (%s)",
> >> >> + errno, strerror(errno));
> >> >> + atomic_inc(&seq_cnt);
> >> >> +
> >> >> + CPU_CLR(cpu, &allowed_mask);
> >> >> +
> >> >> + /*
> >> >> + * Let the read-side get back into KVM_RUN to improve the odds
> >> >> + * of task migration coinciding with KVM's run loop.
> >> >
> >> > This comment should be about increasing the odds of letting the seqlock
> >> > read-side complete. Otherwise, the delay between the two back-to-back
> >> > atomic_inc is so small that the seqlock read-side may never have time to
> >> > complete the reading the rseq cpu id and the sched_getcpu() call, and can
> >> > retry forever.
> >
> > Hmm, but that's not why there's a delay. I'm not arguing that a livelock isn't
> > possible (though that syscall would have to be screaming fast),
>
> I don't think we have the same understanding of the livelock scenario. AFAIU the livelock
> would be caused by a too-small delay between the two consecutive atomic_inc() from one
> loop iteration to the next compared to the time it takes to perform a read-side critical
> section of the seqlock. Back-to-back atomic_inc can be performed very quickly, so I
> doubt that the sched_getcpu implementation have good odds to be fast enough to complete
> in that narrow window, leading to lots of read seqlock retry.
Ooooh, yeah, brain fart on my side. I was thinking of the two atomic_inc() in the
same loop iteration and completely ignoring the next iteration. Yes, I 100% agree
that a delay to ensure forward progress is needed. An assertion in main() that the
reader complete at least some reasonable number of KVM_RUNs is also probably a good
idea, e.g. to rule out a false pass due to the reader never making forward progress.
FWIW, the do-while loop does make forward progress without a delay, but at ~50%
throughput, give or take.
> > but the primary motivation is very much to allow the read-side enough time
> > to get back into KVM proper.
>
> I'm puzzled by your statement. AFAIU, let's say we don't have the delay, then we
> have:
>
> migration thread KVM_RUN/read-side thread
> -----------------------------------------------------------------------------------
> - ioctl(KVM_RUN)
> - atomic_inc_seq_cst(&seqcnt)
> - sched_setaffinity
> - atomic_inc_seq_cst(&seqcnt)
> - a = atomic_load(&seqcnt) & ~1
> - smp_rmb()
> - b = LOAD_ONCE(__rseq_abi->cpu_id);
> - sched_getcpu()
> - smp_rmb()
> - re-load seqcnt/compare (succeeds)
> - Can only succeed if entire read-side happens while the seqcnt
> is in an even numbered state.
> - if (a != b) abort()
> /* no delay. Even counter state is very
> short. */
> - atomic_inc_seq_cst(&seqcnt)
> /* Let's suppose the lack of delay causes the
> setaffinity to complete too early compared
> with KVM_RUN ioctl */
> - sched_setaffinity
> - atomic_inc_seq_cst(&seqcnt)
>
> /* no delay. Even counter state is very
> short. */
> - atomic_inc_seq_cst(&seqcnt)
> /* Then a setaffinity from a following
> migration thread loop will run
> concurrently with KVM_RUN */
> - ioctl(KVM_RUN)
> - sched_setaffinity
> - atomic_inc_seq_cst(&seqcnt)
>
> As pointed out here, if the first setaffinity runs too early compared with KVM_RUN,
> a following setaffinity will run concurrently with it. However, the fact that
> the even counter state is very short may very well hurt progress of the read seqlock.
*sigh*
Several hours later, I think I finally have my head wrapped around everything.
Due to the way the test is written and because of how KVM's run loop works,
TIF_NOTIFY_RESUME or TIF_NEED_RESCHED effectively has to be set before KVM actually
enters the guest, otherwise KVM will exit to userspace without touching the flag,
i.e. it will be handled by the normal exit_to_user_mode_loop().
Where I got lost was trying to figure out why I couldn't make the bug reproduce by
causing the guest to exit to KVM, but not userspace, in which case KVM should
easily trigger the problematic flow as the window for sched_getcpu() to collide
with KVM would be enormous. The reason I didn't go down this route for the
"official" test is that, unless there's something clever I'm overlooking, it
requires arch specific guest code, and ialso I don't know that forcing an exit
would even be necessary/sufficient on other architectures.
Anyways, I was trying to confirm that the bug was being hit without a delay, while
still retaining the sequence retry in the test. The test doesn't fail because the
back-to-back atomic_inc() changes seqcnt too fast. The read-side makes forward
progress, but it never observes failure because the do-while loop only ever
completes after another sched_setaffinity(), never after the one that collides
with KVM because it takes too long to get out of ioctl(KVM_RUN) and back to the
test. I.e. the atomic_inc() in the next loop iteration (makes seq_cnt odd) always
completes before the check, and so the check ends up spinning until another
migration, which correctly updates rseq. This was expected and didn't confuse me.
What confused me is that I was trying to confirm the bug was being hit from within
the kernel by confirming KVM observed TIF_NOTIFY_RESUME, but I misunderstood when
TIF_NOTIFY_RESUME would get set. KVM can observe TIF_NOTIFY_RESUME directly, but
it's rare, and I suspect happens iff sched_setaffinity() hits the small window where
it collides with KVM_RUN before KVM enters the guest.
More commonly, the bug occurs when KVM sees TIF_NEED_RESCHED. In that case, KVM
calls xfer_to_guest_mode_work(), which does schedule() and _that_ sets
TIF_NOTIFY_RESUME. xfer_to_guest_mode_work() then mishandles TIF_NOTIFY_RESUME
and the bug is hit, but my confirmation logic in KVM never fired.
So there are effectively three reasons we want a delay:
1. To allow sched_setaffinity() to coincide with ioctl(KVM_RUN) before KVM can
enter the guest so that the guest doesn't need an arch-specific VM-Exit source.
2. To let ioctl(KVM_RUN) make its way back to the test before the next round
of migration.
3. To ensure the read-side can make forward progress, e.g. if sched_getcpu()
involves a syscall.
After looking at KVM for arm64 and s390, #1 is a bit tenuous because x86 is the
only arch that currently uses xfer_to_guest_mode_work(), i.e. the test could be
tweaked to be overtly x86-specific. But since a delay is needed for #2 and #3,
I'd prefer to rely on it for #1 as well in the hopes that this test provides
coverage for arm64 and/or s390 if they're ever converted to use the common
xfer_to_guest_mode_work().
^ permalink raw reply
* Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto
From: Nathan Chancellor @ 2021-08-26 23:55 UTC (permalink / raw)
To: Michael Ellerman
Cc: llvm, Nick Desaulniers, linux-kernel, clang-built-linux,
Paul Mackerras, linuxppc-dev
In-Reply-To: <YSfjWfGbZbpYBn+w@Ryzen-9-3900X.localdomain>
On Thu, Aug 26, 2021 at 11:54:17AM -0700, Nathan Chancellor wrote:
> On Thu, Aug 26, 2021 at 01:21:39PM +1000, Michael Ellerman wrote:
> > Nathan Chancellor <nathan@kernel.org> writes:
> > > On Tue, Apr 13, 2021 at 04:38:10PM +0000, Christophe Leroy wrote:
> > >> Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
> > >> flexibility to GCC.
> > ...
> > >
> > > This patch as commit 1e688dd2a3d6 ("powerpc/bug: Provide better
> > > flexibility to WARN_ON/__WARN_FLAGS() with asm goto") cause a WARN_ON in
> > > klist_add_tail to trigger over and over on boot when compiling with
> > > clang:
> > >
> > > [ 2.177416][ T1] WARNING: CPU: 0 PID: 1 at lib/klist.c:62 .klist_add_tail+0x3c/0x110
> > > [ 2.177456][ T1] Modules linked in:
> > > [ 2.177481][ T1] CPU: 0 PID: 1 Comm: swapper/0 Tainted: G W 5.14.0-rc7-next-20210825 #1
> > > [ 2.177520][ T1] NIP: c0000000007ff81c LR: c00000000090a038 CTR: 0000000000000000
> > > [ 2.177557][ T1] REGS: c0000000073c32a0 TRAP: 0700 Tainted: G W (5.14.0-rc7-next-20210825)
> > > [ 2.177593][ T1] MSR: 8000000002029032 <SF,VEC,EE,ME,IR,DR,RI> CR: 22000a40 XER: 00000000
> > > [ 2.177667][ T1] CFAR: c00000000090a034 IRQMASK: 0
> > > [ 2.177667][ T1] GPR00: c00000000090a038 c0000000073c3540 c000000001be3200 0000000000000001
> > > [ 2.177667][ T1] GPR04: c0000000072d65c0 0000000000000000 c0000000091ba798 c0000000091bb0a0
> > > [ 2.177667][ T1] GPR08: 0000000000000001 0000000000000000 c000000008581918 fffffffffffffc00
> > > [ 2.177667][ T1] GPR12: 0000000044000240 c000000001dd0000 c000000000012300 0000000000000000
> > > [ 2.177667][ T1] GPR16: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > > [ 2.177667][ T1] GPR20: 0000000000000000 0000000000000000 0000000000000000 0000000000000000
> > > [ 2.177667][ T1] GPR24: 0000000000000000 c0000000017e3200 0000000000000000 c000000001a0e778
> > > [ 2.177667][ T1] GPR28: c0000000072d65b0 c0000000072d65a8 c000000007de72c8 c0000000073c35d0
> > > [ 2.178019][ T1] NIP [c0000000007ff81c] .klist_add_tail+0x3c/0x110
> > > [ 2.178058][ T1] LR [c00000000090a038] .bus_add_driver+0x148/0x290
> > > [ 2.178088][ T1] Call Trace:
> > > [ 2.178105][ T1] [c0000000073c3540] [c0000000073c35d0] 0xc0000000073c35d0 (unreliable)
> > > [ 2.178150][ T1] [c0000000073c35d0] [c00000000090a038] .bus_add_driver+0x148/0x290
> > > [ 2.178190][ T1] [c0000000073c3670] [c00000000090fae8] .driver_register+0xb8/0x190
> > > [ 2.178234][ T1] [c0000000073c3700] [c000000000be55c0] .__hid_register_driver+0x70/0xd0
> > > [ 2.178275][ T1] [c0000000073c37a0] [c00000000116955c] .redragon_driver_init+0x34/0x58
> > > [ 2.178314][ T1] [c0000000073c3820] [c000000000011ae0] .do_one_initcall+0x130/0x3b0
> > > [ 2.178357][ T1] [c0000000073c3bb0] [c0000000011065e0] .do_initcall_level+0xd8/0x188
> > > [ 2.178403][ T1] [c0000000073c3c50] [c0000000011064a8] .do_initcalls+0x7c/0xdc
> > > [ 2.178445][ T1] [c0000000073c3ce0] [c000000001106238] .kernel_init_freeable+0x178/0x21c
> > > [ 2.178491][ T1] [c0000000073c3d90] [c000000000012334] .kernel_init+0x34/0x220
> > > [ 2.178530][ T1] [c0000000073c3e10] [c00000000000cf50] .ret_from_kernel_thread+0x58/0x60
> > > [ 2.178569][ T1] Instruction dump:
> > > [ 2.178592][ T1] fba10078 7c7d1b78 38600001 fb810070 3b9d0008 fbc10080 7c9e2378 389d0018
> > > [ 2.178662][ T1] fb9d0008 fb9d0010 90640000 fbdd0000 <0b1e0000> e87e0018 28230000 41820024
> > > [ 2.178728][ T1] ---[ end trace 52ed3431f58f1847 ]---
> > >
> > > Is this a bug with clang or is there something wrong with the patch? The
> > > vmlinux image is available at [1] if you want to inspect it and our QEMU
> > > command and the warning at boot can be viewed at [2]. If there is any
> > > other information I can provide, please let me know.
> > >
> > > [1] https://builds.tuxbuild.com/1xDcmp3Tvno0TTGxDVPedRKIKM2/
> > > [2] https://github.com/ClangBuiltLinux/continuous-integration2/commit/cee159b66a58eb57fa2359e7888074b9da24126c/checks/3422232736/logs
> >
> > Thanks.
> >
> > This is the generated assembly:
> >
> > c0000000007ff600 <.klist_add_tail>:
> > c0000000007ff600: 7c 08 02 a6 mflr r0
> > c0000000007ff604: f8 01 00 10 std r0,16(r1)
> > c0000000007ff608: f8 21 ff 71 stdu r1,-144(r1) ^ prolog
> > c0000000007ff60c: fb a1 00 78 std r29,120(r1) save r29 to stack
> > c0000000007ff610: 7c 7d 1b 78 mr r29,r3 r29 = struct klist_node *n
> > c0000000007ff614: 38 60 00 01 li r3,1 r3 = 1
> > c0000000007ff618: fb 81 00 70 std r28,112(r1) save r28 to stack
> > c0000000007ff61c: 3b 9d 00 08 addi r28,r29,8 r28 = &n->n_node
> > c0000000007ff620: fb c1 00 80 std r30,128(r1) save r30 to stack
> > c0000000007ff624: 7c 9e 23 78 mr r30,r4 r30 = struct klist *k
> > c0000000007ff628: 38 9d 00 18 addi r4,r29,24 r4 = &n->n_ref
> > c0000000007ff62c: fb 9d 00 08 std r28,8(r29) n->n_node.next = &n->n_node INIT_LIST_HEAD
> > c0000000007ff630: fb 9d 00 10 std r28,16(r29) n->n_node.prev = &n->n_node
> > c0000000007ff634: 90 64 00 00 stw r3,0(r4) kref_init(&n->n_ref)
> > c0000000007ff638: fb dd 00 00 std r30,0(r29) n->n_klist = k
> > c0000000007ff63c: 0b 1e 00 00 tdnei r30,0 trap if r30 (k) is not zero
> >
> >
> > From:
> >
> > static void knode_set_klist(struct klist_node *knode, struct klist *klist)
> > {
> > knode->n_klist = klist;
> > /* no knode deserves to start its life dead */
> > WARN_ON(knode_dead(knode));
> > ^^^^^^^^^^^^^^^^^
> > }
> >
> > Which expands to:
> >
> > static void knode_set_klist(struct klist_node *knode, struct klist *klist)
> > {
> > knode->n_klist = klist;
> >
> > ({
> > bool __ret_warn_on = false;
> > do {
> > ...
> > } else {
> > __label__ __label_warn_on;
> > do {
> > asm goto(
> > "1: "
> > "tdnei"
> > "
> > " " % 4,
> > 0 " "\n " ".section __ex_table,\"a\";"
> > " "
> > ".balign 4;"
> > " "
> > ".long (1b) - . ;"
> > " "
> > ".long (%l[__label_warn_on]) - . ;"
> > " "
> > ".previous"
> > " "
> > ".section __bug_table,\"aw\"\n"
> > "2:\t.4byte 1b - 2b, %0 - 2b\n"
> > "\t.short %1, %2\n"
> > ".org 2b+%3\n"
> > ".previous\n"
> > :
> > : "i"("lib/klist.c"), "i"(62),
> > "i"((1 << 0) | ((9) << 8)),
> > "i"(sizeof(struct bug_entry)),
> > "r"(knode_dead(knode))
> > ^^^^^^^^^^^^^^^^^^^^^
> >
> > :
> > : __label_warn_on);
> > asm("");
> > } while (0);
> > break;
> > __label_warn_on:
> > __ret_warn_on = true;
> > }
> > } while (0);
> > __builtin_expect(!!(__ret_warn_on), 0);
> > });
> > }
> >
> > And knode_dead is:
> >
> > #define KNODE_DEAD 1LU
> >
> > static bool knode_dead(struct klist_node *knode)
> > {
> > return (unsigned long)knode->n_klist & KNODE_DEAD;
> > }
> >
> >
> > So it's meant to warn if (n_klist & KNODE_DEAD) is not equal to zero.
> >
> > But in the asm:
> >
> > c0000000007ff600 <.klist_add_tail>:
> > ...
> > c0000000007ff624: 7c 9e 23 78 mr r30,r4 r30 = struct klist *k
> > ...
> > c0000000007ff638: fb dd 00 00 std r30,0(r29) n->n_klist = k
> > c0000000007ff63c: 0b 1e 00 00 tdnei r30,0 trap if r30 (k) is not zero
> >
> >
> > It's just warning if n_klist is not equal to zero. ie. we lost the "& KNODE_DEAD".
> >
> > In the GCC output you can see it:
> >
> > c0000000008c8a30 <.klist_node_init>:
> > c0000000008c8a30: 39 24 00 08 addi r9,r4,8
> > c0000000008c8a34: 39 40 00 01 li r10,1
> > c0000000008c8a38: f9 24 00 08 std r9,8(r4)
> > c0000000008c8a3c: f9 24 00 10 std r9,16(r4)
> > c0000000008c8a40: 91 44 00 18 stw r10,24(r4)
> > c0000000008c8a44: f8 64 00 00 std r3,0(r4)
> > c0000000008c8a48: 54 69 07 fe clrlwi r9,r3,31
> > c0000000008c8a4c: 0b 09 00 00 tdnei r9,0
> >
> > ie. the clrlwi is "clear left (word) immediate", and zeroes everything
> > except bit 0, which is equivalent to "& KNODE_DEAD".
> >
> >
> > So there seems to be some misunderstanding with clang, it doesn't like
> > us passing an expression to the inline asm.
> >
> > AFAIK it is legal to pass expressions as inputs to inline asm, ie. it
> > doesn't have to just be a variable name.
> >
> > This patch seems to fix it. Not sure if that's just papering over it though.
> >
> > diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
> > index 1ee0f22313ee..75fcb4370d96 100644
> > --- a/arch/powerpc/include/asm/bug.h
> > +++ b/arch/powerpc/include/asm/bug.h
> > @@ -119,7 +119,7 @@ __label_warn_on: \
> > \
> > WARN_ENTRY(PPC_TLNEI " %4, 0", \
> > BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN), \
> > - __label_warn_on, "r" (x)); \
> > + __label_warn_on, "r" (!!(x))); \
> > break; \
> > __label_warn_on: \
> > __ret_warn_on = true; \
> >
> >
> > Generates:
> >
> > c0000000008e2ac0 <.klist_add_tail>:
> > c0000000008e2ac0: 7c 08 02 a6 mflr r0
> > c0000000008e2ac4: f8 01 00 10 std r0,16(r1)
> > c0000000008e2ac8: f8 21 ff 71 stdu r1,-144(r1)
> > c0000000008e2acc: fb a1 00 78 std r29,120(r1)
> > c0000000008e2ad0: 7c 7d 1b 78 mr r29,r3
> > c0000000008e2ad4: 38 60 00 01 li r3,1
> > c0000000008e2ad8: fb c1 00 80 std r30,128(r1)
> > c0000000008e2adc: 7c 9e 23 78 mr r30,r4
> > c0000000008e2ae0: 38 9d 00 18 addi r4,r29,24
> > c0000000008e2ae4: 57 c5 07 fe clrlwi r5,r30,31 <-
> > c0000000008e2ae8: fb 81 00 70 std r28,112(r1)
> > c0000000008e2aec: 3b 9d 00 08 addi r28,r29,8
> > c0000000008e2af0: fb 9d 00 08 std r28,8(r29)
> > c0000000008e2af4: fb 9d 00 10 std r28,16(r29)
> > c0000000008e2af8: 90 64 00 00 stw r3,0(r4)
> > c0000000008e2afc: fb dd 00 00 std r30,0(r29)
> > c0000000008e2b00: 0b 05 00 00 tdnei r5,0 <-
>
> Thank you for the in-depth explanation and triage! I have gone ahead and
> created a standalone reproducer that shows this based on the
> preprocessed file and opened https://bugs.llvm.org/show_bug.cgi?id=51634
> so the LLVM developers can take a look.
Based on Eli Friedman's comment in the bug, would something like this
work and not regress GCC? I noticed that the BUG_ON macro does a cast as
well. Nick pointed out to me privately that we have run into what seems
like a similar issue before in commit 6c58f25e6938 ("riscv/atomic: Fix
sign extension for RV64I").
Cheers,
Nathan
diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
index 1ee0f22313ee..35022667f57d 100644
--- a/arch/powerpc/include/asm/bug.h
+++ b/arch/powerpc/include/asm/bug.h
@@ -119,7 +119,7 @@ __label_warn_on: \
\
WARN_ENTRY(PPC_TLNEI " %4, 0", \
BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN), \
- __label_warn_on, "r" (x)); \
+ __label_warn_on, "r" ((__force long)(x))); \
break; \
__label_warn_on: \
__ret_warn_on = true; \
^ permalink raw reply related
* Re: [PATCH v2 2/3] selftests/powerpc: Add test for real address error handling
From: Ganesh @ 2021-08-26 12:57 UTC (permalink / raw)
To: Michael Ellerman, linuxppc-dev; +Cc: mikey, mahesh, npiggin
In-Reply-To: <87eeagc2c1.fsf@mpe.ellerman.id.au>
[-- Attachment #1: Type: text/plain, Size: 1767 bytes --]
On 8/26/21 8:57 AM, Michael Ellerman wrote:
> Ganesh <ganeshgr@linux.ibm.com> writes:
>> On 8/24/21 6:18 PM, Michael Ellerman wrote:
>>
>>> Ganesh Goudar <ganeshgr@linux.ibm.com> writes:
>>>> Add test for real address or control memory address access
>>>> error handling, using NX-GZIP engine.
>>>>
>>>> The error is injected by accessing the control memory address
>>>> using illegal instruction, on successful handling the process
>>>> attempting to access control memory address using illegal
>>>> instruction receives SIGBUS.
>>> ...
>>>
>>>> diff --git a/tools/testing/selftests/powerpc/mce/inject-ra-err.sh b/tools/testing/selftests/powerpc/mce/inject-ra-err.sh
>>>> new file mode 100755
>>>> index 000000000000..3633cdc651a1
>>>> --- /dev/null
>>>> +++ b/tools/testing/selftests/powerpc/mce/inject-ra-err.sh
>>>> @@ -0,0 +1,18 @@
>>>> +#!/bin/bash
>>>> +# SPDX-License-Identifier: GPL-2.0-or-later
>>>> +
>>>> +if [[ ! -w /dev/crypto/nx-gzip ]]; then
>>>> + echo "WARN: Can't access /dev/crypto/nx-gzip, skipping"
>>>> + exit 0
>>>> +fi
>>>> +
>>>> +timeout 5 ./inject-ra-err
>>>> +
>>>> +# 128 + 7 (SIGBUS) = 135, 128 is a exit code with special meaning.
>>>> +if [ $? -ne 135 ]; then
>>>> + echo "FAILED: Real address or Control memory access error not handled"
>>>> + exit $?
>>>> +fi
>>>> +
>>>> +echo "OK: Real address or Control memory access error is handled"
>>>> +exit 0
>>> I don't think we really need the shell script, we should be able to do
>>> all that in the C code.
>>>
>>> Can you try this?
>> it works!, We need to set timeout, with 120 sec timeout we may flood the dmesg.
> Hmm. Does it keep faulting? The regs->nip += 4 is meant to avoid that.
Yes, it keeps faulting, if we fail to handle and not send SIGBUS to the process.
>
> cheers
[-- Attachment #2: Type: text/html, Size: 2865 bytes --]
^ permalink raw reply
* Re: [PATCH v2 RESEND] powerpc/audit: Convert powerpc to AUDIT_ARCH_COMPAT_GENERIC
From: Paul Moore @ 2021-08-27 0:52 UTC (permalink / raw)
To: Michael Ellerman, rgb
Cc: linux-kernel, Eric Paris, linux-audit, Paul Mackerras,
linuxppc-dev
In-Reply-To: <87tujc9srr.fsf@mpe.ellerman.id.au>
On Thu, Aug 26, 2021 at 10:37 AM Michael Ellerman <mpe@ellerman.id.au> wrote:
> Paul Moore <paul@paul-moore.com> writes:
> > On Tue, Aug 24, 2021 at 1:11 PM Christophe Leroy
> > <christophe.leroy@csgroup.eu> wrote:
> >> Le 24/08/2021 à 16:47, Paul Moore a écrit :
> >> > On Tue, Aug 24, 2021 at 9:36 AM Christophe Leroy
> >> > <christophe.leroy@csgroup.eu> wrote:
> >> >>
> >> >> Commit e65e1fc2d24b ("[PATCH] syscall class hookup for all normal
> >> >> targets") added generic support for AUDIT but that didn't include
> >> >> support for bi-arch like powerpc.
> >> >>
> >> >> Commit 4b58841149dc ("audit: Add generic compat syscall support")
> >> >> added generic support for bi-arch.
> >> >>
> >> >> Convert powerpc to that bi-arch generic audit support.
> >> >>
> >> >> Cc: Paul Moore <paul@paul-moore.com>
> >> >> Cc: Eric Paris <eparis@redhat.com>
> >> >> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
> >> >> ---
> >> >> Resending v2 with Audit people in Cc
> >> >>
> >> >> v2:
> >> >> - Missing 'git add' for arch/powerpc/include/asm/unistd32.h
> >> >> - Finalised commit description
> >> >> ---
> >> >> arch/powerpc/Kconfig | 5 +-
> >> >> arch/powerpc/include/asm/unistd32.h | 7 +++
> >> >> arch/powerpc/kernel/Makefile | 3 --
> >> >> arch/powerpc/kernel/audit.c | 84 -----------------------------
> >> >> arch/powerpc/kernel/compat_audit.c | 44 ---------------
> >> >> 5 files changed, 8 insertions(+), 135 deletions(-)
> >> >> create mode 100644 arch/powerpc/include/asm/unistd32.h
> >> >> delete mode 100644 arch/powerpc/kernel/audit.c
> >> >> delete mode 100644 arch/powerpc/kernel/compat_audit.c
> >> >
> >> > Can you explain, in detail please, the testing you have done to verify
> >> > this patch?
> >> >
> >>
> >> I built ppc64_defconfig and checked that the generated code is functionnaly equivalent.
> >>
> >> ppc32_classify_syscall() is exactly the same as audit_classify_compat_syscall() except that the
> >> later takes the syscall as second argument (ie in r4) whereas the former takes it as first argument
> >> (ie in r3).
> >>
> >> audit_classify_arch() and powerpc audit_classify_syscall() are slightly different between the
> >> powerpc version and the generic version because the powerpc version checks whether it is
> >> AUDIT_ARCH_PPC or not (ie value 20), while the generic one checks whether it has bit
> >> __AUDIT_ARCH_64BIT set or not (__AUDIT_ARCH_64BIT is the sign bit of a word), but taking into
> >> account that the abi is either AUDIT_ARCH_PPC, AUDIT_ARCH_PPC64 or AUDIT_ARCH_PPC64LE, the result is
> >> the same.
> >>
> >> If you are asking I guess you saw something wrong ?
> >
> > I was asking because I didn't see any mention of testing, and when you
> > are enabling something significant like this it is nice to see that it
> > has been verified to work :)
> >
> > While binary dumps and comparisons are nice, it is always good to see
> > verification from a test suite. I don't have access to the necessary
> > hardware to test this, but could you verify that the audit-testsuite
> > passes on your test system with your patches applied?
> >
> > * https://github.com/linux-audit/audit-testsuite
>
> I tested on ppc64le. Both before and after the patch I get the result
> below.
>
> So I guess the patch is OK, but maybe we have some existing issue.
>
> I had a bit of a look at the test code, but my perl is limited. I think
> it was running the command below, and it returned "<no matches>", but
> not really sure what that means.
If it makes you feel any better, my perl is *very* limited; thankfully
this isn't my first time looking at that test :)
It's a little odd, but after some basic sanity tests at the top, the
test sets a watch on a file, /tmp/<rando_string>, and tells the kernel
to generate audit records for any syscall that operates on that file.
It then creates, and removes, a series of exclude audit filters to
check if the exclude filtering is working as expected, e.g. when
syscall filtering is excluded there should be no syscall records in
the audit log.
In the case you describe, it looks like it looks like the audit
exclude filter is removed (that's what line 147 does), the
/tmp/<rando_string> file is removed (line 152), and then we check to
see if any syscall records exist (line 164, and yes, there should be
*something* there for the unlink/rm).
It may be of little consolation, but this test works just fine on
recent kernels running on both x86_64 and aarch64. I don't have
access to a powerpc system of any vintage, but I added Richard to the
To line above in case he has easier access to a test system (I suspect
the RH/IBM linkage should help in this regard). Otherwise I would
suggest starting to debug this by simply doing some basic tests using
auditctl to create rules and exclude rules to see what is working, and
what isn't; that might provide some clues.
Sorry I'm not much more help at this point :/
> $ sudo ausearch -i -m SYSCALL -p 216440 -ui 0 -gi 0 -ul 0 -su unconfined _u:unconfined_r:unconfined_t:s0-s0:c0.c1023 -ts recent
> <no matches>
>
> cheers
>
>
>
> Running as user root
> with context unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023
> on system Fedora
>
> backlog_wait_time_actual_reset/test .. ok
> exec_execve/test ..................... ok
> exec_name/test ....................... ok
> file_create/test ..................... ok
> file_delete/test ..................... ok
> file_rename/test ..................... ok
> filter_exclude/test .................. 1/21
> # Test 20 got: "256" (filter_exclude/test at line 167)
> # Expected: "0"
> # filter_exclude/test line 167 is: ok( $result, 0 );
> # Test 21 got: "0" (filter_exclude/test at line 179)
> # Expected: "1"
> # filter_exclude/test line 179 is: ok( $found_msg, 1 );
> filter_exclude/test .................. Failed 2/21 subtests
> filter_saddr_fam/test ................ ok
> filter_sessionid/test ................ ok
> login_tty/test ....................... ok
> lost_reset/test ...................... ok
> netfilter_pkt/test ................... ok
> syscalls_file/test ................... ok
> syscall_module/test .................. ok
> time_change/test ..................... ok
> user_msg/test ........................ ok
> fanotify/test ........................ ok
> bpf/test ............................. ok
>
> Test Summary Report
> -------------------
> filter_exclude/test (Wstat: 0 Tests: 21 Failed: 2)
> Failed tests: 20-21
> Files=18, Tests=202, 45 wallclock secs ( 0.18 usr 0.03 sys + 20.15 cusr 0.92 csys = 21.28 CPU)
> Result: FAIL
> Failed 1/18 test programs. 2/202 subtests failed.
--
paul moore
www.paul-moore.com
^ permalink raw reply
* Re: [PATCH v2 1/2] powerpc/bug: Remove specific powerpc BUG_ON() and WARN_ON() on PPC32
From: Nicholas Piggin @ 2021-08-27 1:28 UTC (permalink / raw)
To: Segher Boessenkool; +Cc: linux-kernel, Paul Mackerras, linuxppc-dev
In-Reply-To: <20210826153048.GD1583@gate.crashing.org>
Excerpts from Segher Boessenkool's message of August 27, 2021 1:30 am:
> On Fri, Aug 27, 2021 at 01:04:36AM +1000, Nicholas Piggin wrote:
>> Excerpts from Segher Boessenkool's message of August 27, 2021 12:37 am:
>> >> No, they are all dispatched and issue to the BRU for execution. It's
>> >> trivial to construct a test of a lot of not taken branches in a row
>> >> and time a loop of it to see it executes at 1 cycle per branch.
>> >
>> > (s/dispatched/issued/)
>>
>> ?
>
> Dispatch is from decode to the issue queues. Issue is from there to
> execution units. Dispatch is in-order, issue is not.
I know what those mean, I wonder what your s/dispatched/issued means.
I was saying they are dispatched in response to you saying they never
hit the issue queue.
>
>> >> How could it validate prediction without issuing? It wouldn't know when
>> >> sources are ready.
>> >
>> > In the backend. But that is just how it worked on older cores :-/
>>
>> Okay. I don't know about older cores than POWER9. Backend would normally
>> include execution though.
>> Only other place you could do it if you don't
>> issue/exec would be after it goes back in order, like completion.
>
> You do not have to do the verification in-order: the insn cannot finish
> until it is no longer speculative, that takes care of all ordering
> needed.
Branches *can* finish out of order and speculative as they do in P9 and
P10. Are you talking about these CPUs or something else which can
verify branches without issuing them?
>
>> But that would be horrible for mispredict penalty.
>
> See the previous point. Also, any insn known to be mispredicted can be
> flushed immediately anyway.
The point is it has to know sources (CR) to verify (aka execute) the
branch prediction was right, and if it needs sources then it needs to
either issue and execute in the out of order part, or it needs to wait
until completion which would seem to be prohibitively expensive. I am
interested to know how it works.
>
>> >> >> The first problem seems like the show stopper though. AFAIKS it would
>> >> >> need a special builtin support that does something to create the table
>> >> >> entry, or a guarantee that we could put an inline asm right after the
>> >> >> builtin as a recognized pattern and that would give us the instruction
>> >> >> following the trap.
>> >> >
>> >> > I'm not quite sure what this means. Can't you always just put a
>> >> >
>> >> > bla: asm("");
>> >> >
>> >> > in there, and use the address of "bla"?
>> >>
>> >> Not AFAIKS. Put it where?
>> >
>> > After wherever you want to know the address after. You will have to
>> > make sure they stay together somehow.
>>
>> I still don't follow.
>
> some_thing_you_want_to_know_the_address_after_let_us_call_it_A;
> empty_asm_that_we_can_take_the_address_of_known_as_B;
>
> You have to make sure the compiler keeps A and B together, does not
> insert anything between them, does put them in the assembler output in
> the same fragment, etc.
How does all this help our problem of putting the address of the trap
into the table?
>
>> If you could give a built in that put a label at the address of the trap
>> instruction that could be used later by inline asm then that could work
>> too:
>>
>> __builtin_labeled_trap("1:");
>> asm (" .section __bug_table,\"aw\" \n\t"
>> "2: .4byte 1b - 2b \n\t"
>> " .previous");
>
> How could a compiler do anything like that?!
How could it add a label at the trap instruction it generates? It didn't
seem like an outlandish thing to do, but I'm not a compiler writer. It was
just a handwaving idea to show what we want to be able to do.
Thanks,
Nick
^ permalink raw reply
* Re: [PATCH v2 3/4] powerpc/64s/interrupt: Don't enable MSR[EE] in irq handlers unless perf is in use
From: Nicholas Piggin @ 2021-08-27 1:33 UTC (permalink / raw)
To: linuxppc-dev, kernel test robot; +Cc: kbuild-all
In-Reply-To: <202108262232.PzC05uqr-lkp@intel.com>
Excerpts from kernel test robot's message of August 27, 2021 1:04 am:
> Hi Nicholas,
>
> I love your patch! Yet something to improve:
>
> [auto build test ERROR on powerpc/next]
> [also build test ERROR on v5.14-rc7 next-20210826]
> [cannot apply to scottwood/next]
> [If your patch is applied to the wrong git tree, kindly drop us a note.
> And when submitting patch, we suggest to use '--base' as documented in
> https://git-scm.com/docs/git-format-patch]
>
> url: https://github.com/0day-ci/linux/commits/Nicholas-Piggin/powerpc-64s-interrupt-speedups/20210825-204209
> base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
> config: powerpc-allnoconfig (attached as .config)
Fix for 32-bit:
diff --git a/arch/powerpc/include/asm/hw_irq.h b/arch/powerpc/include/asm/hw_irq.h
index 8c78c40c006e..55e3fa44f280 100644
--- a/arch/powerpc/include/asm/hw_irq.h
+++ b/arch/powerpc/include/asm/hw_irq.h
@@ -437,7 +437,7 @@ static inline bool arch_irq_disabled_regs(struct pt_regs *regs)
return !(regs->msr & MSR_EE);
}
-static inline bool may_hard_irq_enable(void)
+static inline bool should_hard_irq_enable(void)
{
return false;
}
^ permalink raw reply related
* Re: [PATCH v15 10/12] swiotlb: Add restricted DMA pool initialization
From: Claire Chang @ 2021-08-27 3:50 UTC (permalink / raw)
To: Guenter Roeck
Cc: heikki.krogerus, thomas.hellstrom, peterz, joonas.lahtinen,
dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
Marek Szyprowski, Stefano Stabellini, Saravana Kannan,
Joerg Roedel, Rafael J . Wysocki, Christoph Hellwig,
Bartosz Golaszewski, bskeggs, linux-pci, xen-devel,
Thierry Reding, intel-gfx, matthew.auld, linux-devicetree,
Jianxiong Gao, Daniel Vetter, Will Deacon, Konrad Rzeszutek Wilk,
maarten.lankhorst, airlied, Dan Williams, linuxppc-dev,
jani.nikula, Rob Herring, rodrigo.vivi, Bjorn Helgaas,
boris.ostrovsky, Andy Shevchenko, jgross, Nicolas Boichat,
Greg KH, Randy Dunlap, Qian Cai, lkml, Tomasz Figa,
list@263.net:IOMMU DRIVERS, Jim Quinlan, xypron.glpk,
Tom Lendacky, Robin Murphy, bauerman
In-Reply-To: <20210824142601.GA3393158@roeck-us.net>
On Tue, Aug 24, 2021 at 10:26 PM Guenter Roeck <linux@roeck-us.net> wrote:
>
> Hi Claire,
>
> On Thu, Jun 24, 2021 at 11:55:24PM +0800, Claire Chang wrote:
> > Add the initialization function to create restricted DMA pools from
> > matching reserved-memory nodes.
> >
> > Regardless of swiotlb setting, the restricted DMA pool is preferred if
> > available.
> >
> > The restricted DMA pools provide a basic level of protection against the
> > DMA overwriting buffer contents at unexpected times. However, to protect
> > against general data leakage and system memory corruption, the system
> > needs to provide a way to lock down the memory access, e.g., MPU.
> >
> > Signed-off-by: Claire Chang <tientzu@chromium.org>
> > Reviewed-by: Christoph Hellwig <hch@lst.de>
> > Tested-by: Stefano Stabellini <sstabellini@kernel.org>
> > Tested-by: Will Deacon <will@kernel.org>
> > ---
> > include/linux/swiotlb.h | 3 +-
> > kernel/dma/Kconfig | 14 ++++++++
> > kernel/dma/swiotlb.c | 76 +++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 92 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h
> > index 3b9454d1e498..39284ff2a6cd 100644
> > --- a/include/linux/swiotlb.h
> > +++ b/include/linux/swiotlb.h
> > @@ -73,7 +73,8 @@ extern enum swiotlb_force swiotlb_force;
> > * range check to see if the memory was in fact allocated by this
> > * API.
> > * @nslabs: The number of IO TLB blocks (in groups of 64) between @start and
> > - * @end. This is command line adjustable via setup_io_tlb_npages.
> > + * @end. For default swiotlb, this is command line adjustable via
> > + * setup_io_tlb_npages.
> > * @used: The number of used IO TLB block.
> > * @list: The free list describing the number of free entries available
> > * from each index.
> > diff --git a/kernel/dma/Kconfig b/kernel/dma/Kconfig
> > index 77b405508743..3e961dc39634 100644
> > --- a/kernel/dma/Kconfig
> > +++ b/kernel/dma/Kconfig
> > @@ -80,6 +80,20 @@ config SWIOTLB
> > bool
> > select NEED_DMA_MAP_STATE
> >
> > +config DMA_RESTRICTED_POOL
> > + bool "DMA Restricted Pool"
> > + depends on OF && OF_RESERVED_MEM
> > + select SWIOTLB
>
> This makes SWIOTLB user configurable, which in turn results in
>
> mips64-linux-ld: arch/mips/kernel/setup.o: in function `arch_mem_init':
> setup.c:(.init.text+0x19c8): undefined reference to `plat_swiotlb_setup'
> make[1]: *** [Makefile:1280: vmlinux] Error 1
>
> when building mips:allmodconfig.
>
> Should this possibly be "depends on SWIOTLB" ?
Patch is sent here: https://lkml.org/lkml/2021/8/26/932
>
> Thanks,
> Guenter
Thanks,
Claire
^ permalink raw reply
* [PATCH kernel] KVM: PPC: Fix clearing never mapped TCEs in realmode
From: Alexey Kardashevskiy @ 2021-08-27 4:07 UTC (permalink / raw)
To: linuxppc-dev; +Cc: Alexey Kardashevskiy, Leonardo Bras, kvm-ppc
Since e1a1ef84cd07, pages for TCE tables for KVM guests are allocated
only when needed. This allows skipping any update when clearing TCEs.
This works mostly fine as TCE updates are handled when MMU is enabled.
The realmode handlers fail with H_TOO_HARD when pages are not yet
allocated except when clearing a TCE in which case KVM prints a warning
but proceeds to dereference a NULL pointer which crashes the host OS.
This has not been caught so far as the change is reasonably new,
POWER9 runs mostly radix which does not use realmode handlers.
With hash, the default TCE table is memset() by QEMU the machine reset
which triggers page faults and the KVM TCE device's kvm_spapr_tce_fault()
handles those with MMU on. And the huge DMA windows are not cleared
by VMs whicn instead successfully create a DMA window big enough to map
the VM memory 1:1 and then VMs just map everything without clearing.
This started crashing now as upcoming sriov-under-powervm support added
a mode when a dymanic DMA window not big enough to map the VM memory 1:1
but it is used anyway, and the VM now is the first (i.e. not QEMU) to
clear a just created table. Note that the upstream QEMU needs to be
modified to trigger the VM to trigger the host OS crash.
This replaces WARN_ON_ONCE_RM() with a check and return.
This adds another warning if TCE is not being cleared.
Cc: Leonardo Bras <leobras.c@gmail.com>
Fixes: e1a1ef84cd07 ("KVM: PPC: Book3S: Allocate guest TCEs on demand too")
Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
With recent changes in the printk() department, calling pr_err() when MMU
off causes lockdep lockups which I did not dig any further so we should
start getting rid of the realmode's WARN_ON_ONCE_RM().
---
arch/powerpc/kvm/book3s_64_vio_hv.c | 9 ++++++---
1 file changed, 6 insertions(+), 3 deletions(-)
diff --git a/arch/powerpc/kvm/book3s_64_vio_hv.c b/arch/powerpc/kvm/book3s_64_vio_hv.c
index 083a4e037718..e5ba96c41f3f 100644
--- a/arch/powerpc/kvm/book3s_64_vio_hv.c
+++ b/arch/powerpc/kvm/book3s_64_vio_hv.c
@@ -173,10 +173,13 @@ static void kvmppc_rm_tce_put(struct kvmppc_spapr_tce_table *stt,
idx -= stt->offset;
page = stt->pages[idx / TCES_PER_PAGE];
/*
- * page must not be NULL in real mode,
- * kvmppc_rm_ioba_validate() must have taken care of this.
+ * kvmppc_rm_ioba_validate() allows pages not be allocated if TCE is
+ * being cleared, otherwise it returns H_TOO_HARD and we skip this.
*/
- WARN_ON_ONCE_RM(!page);
+ if (!page) {
+ WARN_ON_ONCE_RM(tce != 0);
+ return;
+ }
tbl = kvmppc_page_address(page);
tbl[idx % TCES_PER_PAGE] = tce;
--
2.30.2
^ permalink raw reply related
* Re: [PATCH 1/2] ASoC: fsl_rpmsg: add soc specific data structure
From: Shengjiu Wang @ 2021-08-27 6:14 UTC (permalink / raw)
To: Fabio Estevam
Cc: Linux-ALSA, Timur Tabi, Xiubo Li, Shengjiu Wang, linux-kernel,
Takashi Iwai, Nicolin Chen, Mark Brown, linuxppc-dev
In-Reply-To: <CAOMZO5BCsTMjJJPtLN6_seVcWb24A2ms11FP3HzR0i7t3GLSuA@mail.gmail.com>
On Thu, Aug 26, 2021 at 7:54 PM Fabio Estevam <festevam@gmail.com> wrote:
>
> Hi Shengjiu,
>
> On Thu, Aug 26, 2021 at 8:19 AM Shengjiu Wang <shengjiu.wang@nxp.com> wrote:
>
> > + rpmsg->soc_data = of_device_get_match_data(&pdev->dev);
> > + if (rpmsg->soc_data) {
>
> This check is not necessary, because rpmsg->soc_data is always non-NULL.
>
> Other than that:
>
> Reviewed-by: Fabio Estevam <festevam@gmail.com>
Thanks, I will update in v2.
Best regards
wang shengjiu
^ permalink raw reply
* [PATCH v2] ASoC: fsl_rpmsg: add soc specific data structure
From: Shengjiu Wang @ 2021-08-27 6:00 UTC (permalink / raw)
To: timur, nicoleotsuka, Xiubo.Lee, festevam, broonie, perex, tiwai,
alsa-devel
Cc: linuxppc-dev, linux-kernel
Each platform has different supported rates and
formats, so add soc specific data for each platform.
This soc specific data is attached with compatible string.
Signed-off-by: Shengjiu Wang <shengjiu.wang@nxp.com>
Reviewed-by: Fabio Estevam <festevam@gmail.com>
---
changes in v2:
- remove checking rpmsg->soc_data is NULL
- add Reviewed-by Fabio
sound/soc/fsl/fsl_rpmsg.c | 46 +++++++++++++++++++++++++++++++++++----
sound/soc/fsl/fsl_rpmsg.h | 12 ++++++++++
2 files changed, 54 insertions(+), 4 deletions(-)
diff --git a/sound/soc/fsl/fsl_rpmsg.c b/sound/soc/fsl/fsl_rpmsg.c
index d60f4dac6c1b..07abad7fe372 100644
--- a/sound/soc/fsl/fsl_rpmsg.c
+++ b/sound/soc/fsl/fsl_rpmsg.c
@@ -138,11 +138,42 @@ static const struct snd_soc_component_driver fsl_component = {
.name = "fsl-rpmsg",
};
+static const struct fsl_rpmsg_soc_data imx7ulp_data = {
+ .rates = SNDRV_PCM_RATE_8000 | SNDRV_PCM_RATE_16000 |
+ SNDRV_PCM_RATE_48000,
+ .formats = SNDRV_PCM_FMTBIT_S16_LE,
+};
+
+static const struct fsl_rpmsg_soc_data imx8mm_data = {
+ .rates = SNDRV_PCM_RATE_KNOT,
+ .formats = SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FMTBIT_S24_LE |
+ SNDRV_PCM_FMTBIT_S32_LE | SNDRV_PCM_FMTBIT_DSD_U8 |
+ SNDRV_PCM_FMTBIT_DSD_U16_LE | SNDRV_PCM_FMTBIT_DSD_U32_LE,
+};
+
+static const struct fsl_rpmsg_soc_data imx8mn_data = {
+ .rates = SNDRV_PCM_RATE_32000 | SNDRV_PCM_RATE_44100 |
+ SNDRV_PCM_RATE_48000 | SNDRV_PCM_RATE_88200 |
+ SNDRV_PCM_RATE_96000 | SNDRV_PCM_RATE_176400 |
+ SNDRV_PCM_RATE_192000,
+ .formats = SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FMTBIT_S24_LE |
+ SNDRV_PCM_FMTBIT_S32_LE,
+};
+
+static const struct fsl_rpmsg_soc_data imx8mp_data = {
+ .rates = SNDRV_PCM_RATE_32000 | SNDRV_PCM_RATE_44100 |
+ SNDRV_PCM_RATE_48000 | SNDRV_PCM_RATE_88200 |
+ SNDRV_PCM_RATE_96000 | SNDRV_PCM_RATE_176400 |
+ SNDRV_PCM_RATE_192000,
+ .formats = SNDRV_PCM_FMTBIT_S16_LE | SNDRV_PCM_FMTBIT_S24_LE |
+ SNDRV_PCM_FMTBIT_S32_LE,
+};
+
static const struct of_device_id fsl_rpmsg_ids[] = {
- { .compatible = "fsl,imx7ulp-rpmsg-audio"},
- { .compatible = "fsl,imx8mm-rpmsg-audio"},
- { .compatible = "fsl,imx8mn-rpmsg-audio"},
- { .compatible = "fsl,imx8mp-rpmsg-audio"},
+ { .compatible = "fsl,imx7ulp-rpmsg-audio", .data = &imx7ulp_data},
+ { .compatible = "fsl,imx8mm-rpmsg-audio", .data = &imx8mm_data},
+ { .compatible = "fsl,imx8mn-rpmsg-audio", .data = &imx8mn_data},
+ { .compatible = "fsl,imx8mp-rpmsg-audio", .data = &imx8mp_data},
{ /* sentinel */ }
};
MODULE_DEVICE_TABLE(of, fsl_rpmsg_ids);
@@ -157,6 +188,13 @@ static int fsl_rpmsg_probe(struct platform_device *pdev)
if (!rpmsg)
return -ENOMEM;
+ rpmsg->soc_data = of_device_get_match_data(&pdev->dev);
+
+ fsl_rpmsg_dai.playback.rates = rpmsg->soc_data->rates;
+ fsl_rpmsg_dai.capture.rates = rpmsg->soc_data->rates;
+ fsl_rpmsg_dai.playback.formats = rpmsg->soc_data->formats;
+ fsl_rpmsg_dai.capture.formats = rpmsg->soc_data->formats;
+
if (of_property_read_bool(np, "fsl,enable-lpa")) {
rpmsg->enable_lpa = 1;
rpmsg->buffer_size = LPA_LARGE_BUFFER_SIZE;
diff --git a/sound/soc/fsl/fsl_rpmsg.h b/sound/soc/fsl/fsl_rpmsg.h
index 4f5b49eb18d8..b04086fbf828 100644
--- a/sound/soc/fsl/fsl_rpmsg.h
+++ b/sound/soc/fsl/fsl_rpmsg.h
@@ -6,6 +6,16 @@
#ifndef __FSL_RPMSG_H
#define __FSL_RPMSG_H
+/*
+ * struct fsl_rpmsg_soc_data
+ * @rates: supported rates
+ * @formats: supported formats
+ */
+struct fsl_rpmsg_soc_data {
+ int rates;
+ u64 formats;
+};
+
/*
* struct fsl_rpmsg - rpmsg private data
*
@@ -15,6 +25,7 @@
* @pll8k: parent clock for multiple of 8kHz frequency
* @pll11k: parent clock for multiple of 11kHz frequency
* @card_pdev: Platform_device pointer to register a sound card
+ * @soc_data: soc specific data
* @mclk_streams: Active streams that are using baudclk
* @force_lpa: force enable low power audio routine if condition satisfy
* @enable_lpa: enable low power audio routine according to dts setting
@@ -27,6 +38,7 @@ struct fsl_rpmsg {
struct clk *pll8k;
struct clk *pll11k;
struct platform_device *card_pdev;
+ const struct fsl_rpmsg_soc_data *soc_data;
unsigned int mclk_streams;
int force_lpa;
int enable_lpa;
--
2.17.1
^ permalink raw reply related
* Re: [PATCH v15 10/12] swiotlb: Add restricted DMA pool initialization
From: Andy Shevchenko @ 2021-08-27 6:58 UTC (permalink / raw)
To: Claire Chang
Cc: heikki.krogerus, thomas.hellstrom, peterz, joonas.lahtinen,
dri-devel, chris, grant.likely, paulus, Frank Rowand, mingo,
Marek Szyprowski, sstabellini, Saravana Kannan, Joerg Roedel,
Rafael J . Wysocki, Christoph Hellwig, Bartosz Golaszewski,
bskeggs, linux-pci, xen-devel, Thierry Reding, intel-gfx,
matthew.auld, linux-devicetree, jxgao, daniel, Will Deacon,
Konrad Rzeszutek Wilk, maarten.lankhorst, airlied, Dan Williams,
linuxppc-dev, jani.nikula, Rob Herring, rodrigo.vivi, bhelgaas,
Boris Ostrovsky, Andy Shevchenko, Juergen Gross, Nicolas Boichat,
Greg KH, Randy Dunlap, quic_qiancai, lkml, tfiga,
list@263.net:IOMMU DRIVERS, Jim Quinlan, xypron.glpk,
thomas.lendacky, Robin Murphy, bauerman
In-Reply-To: <20210624155526.2775863-11-tientzu@chromium.org>
On Thu, Jun 24, 2021 at 6:59 PM Claire Chang <tientzu@chromium.org> wrote:
>
> Add the initialization function to create restricted DMA pools from
> matching reserved-memory nodes.
>
> Regardless of swiotlb setting, the restricted DMA pool is preferred if
> available.
>
> The restricted DMA pools provide a basic level of protection against the
> DMA overwriting buffer contents at unexpected times. However, to protect
> against general data leakage and system memory corruption, the system
> needs to provide a way to lock down the memory access, e.g., MPU.
> +static int rmem_swiotlb_device_init(struct reserved_mem *rmem,
> + struct device *dev)
> +{
> + struct io_tlb_mem *mem = rmem->priv;
> + unsigned long nslabs = rmem->size >> IO_TLB_SHIFT;
> +
> + /*
> + * Since multiple devices can share the same pool, the private data,
> + * io_tlb_mem struct, will be initialized by the first device attached
> + * to it.
> + */
> + if (!mem) {
Can it be rather
if (mem)
goto out_assign;
or so?
> + mem = kzalloc(struct_size(mem, slots, nslabs), GFP_KERNEL);
> + if (!mem)
> + return -ENOMEM;
> +
> + set_memory_decrypted((unsigned long)phys_to_virt(rmem->base),
> + rmem->size >> PAGE_SHIFT);
Below you are using a macro from pfn.h, but not here, I think it's PFN_DOWN().
> + swiotlb_init_io_tlb_mem(mem, rmem->base, nslabs, false);
> + mem->force_bounce = true;
> + mem->for_alloc = true;
> +
> + rmem->priv = mem;
> +
> + if (IS_ENABLED(CONFIG_DEBUG_FS)) {
> + mem->debugfs =
> + debugfs_create_dir(rmem->name, debugfs_dir);
> + swiotlb_create_debugfs_files(mem);
> + }
> + }
> +
> + dev->dma_io_tlb_mem = mem;
> +
> + return 0;
> +}
> +
> +static void rmem_swiotlb_device_release(struct reserved_mem *rmem,
> + struct device *dev)
> +{
> + dev->dma_io_tlb_mem = io_tlb_default_mem;
> +}
> +
> +static const struct reserved_mem_ops rmem_swiotlb_ops = {
> + .device_init = rmem_swiotlb_device_init,
> + .device_release = rmem_swiotlb_device_release,
> +};
> +
> +static int __init rmem_swiotlb_setup(struct reserved_mem *rmem)
> +{
> + unsigned long node = rmem->fdt_node;
> +
> + if (of_get_flat_dt_prop(node, "reusable", NULL) ||
> + of_get_flat_dt_prop(node, "linux,cma-default", NULL) ||
> + of_get_flat_dt_prop(node, "linux,dma-default", NULL) ||
> + of_get_flat_dt_prop(node, "no-map", NULL))
> + return -EINVAL;
> +
> + if (PageHighMem(pfn_to_page(PHYS_PFN(rmem->base)))) {
> + pr_err("Restricted DMA pool must be accessible within the linear mapping.");
> + return -EINVAL;
> + }
> +
> + rmem->ops = &rmem_swiotlb_ops;
> + pr_info("Reserved memory: created restricted DMA pool at %pa, size %ld MiB\n",
> + &rmem->base, (unsigned long)rmem->size / SZ_1M);
Oh là là, besides explicit casting that I believe can be avoided, %ld
!= unsigned long. Can you check the printk-formats.rst document?
> + return 0;
> +}
> +
> +RESERVEDMEM_OF_DECLARE(dma, "restricted-dma-pool", rmem_swiotlb_setup);
> #endif /* CONFIG_DMA_RESTRICTED_POOL */
--
With Best Regards,
Andy Shevchenko
^ permalink raw reply
* Re: [PATCH v2 1/4] powerpc/64: handle MSR EE and RI in interrupt entry wrapper
From: Daniel Axtens @ 2021-08-27 7:31 UTC (permalink / raw)
To: Nicholas Piggin, linuxppc-dev; +Cc: Nicholas Piggin
In-Reply-To: <20210825123714.706201-2-npiggin@gmail.com>
Hi,
> Similarly to the system call change in the previous patch, the mtmsrd to
I don't actually know what patch this was - I assume it's from a series
that has since been applied?
> enable RI can be combined with the mtmsrd to enable EE for interrupts
> which enable the latter, which tends to be the important synchronous
> interrupts (i.e., page faults).
>
> Do this by enabling EE and RI together at the beginning of the entry
> wrapper if PACA_IRQ_HARD_DIS is clear, and just enabling RI if it is set
> (which means something wanted EE=0).
> diff --git a/arch/powerpc/include/asm/interrupt.h b/arch/powerpc/include/asm/interrupt.h
> index 6b800d3e2681..e3228a911b35 100644
> --- a/arch/powerpc/include/asm/interrupt.h
> +++ b/arch/powerpc/include/asm/interrupt.h
> @@ -148,9 +148,21 @@ static inline void interrupt_enter_prepare(struct pt_regs *regs, struct interrup
> #endif
>
> #ifdef CONFIG_PPC64
> - if (irq_soft_mask_set_return(IRQS_ALL_DISABLED) == IRQS_ENABLED)
> + bool trace_enable = false;
> +
> + if (IS_ENABLED(CONFIG_TRACE_IRQFLAGS)) {
> + if (irq_soft_mask_set_return(IRQS_DISABLED) == IRQS_ENABLED)
In the previous code we had IRQS_ALL_DISABLED, now we just have
IRQS_DISABLED. Is that intentional?
> + trace_enable = true;
> + } else {
> + irq_soft_mask_set(IRQS_DISABLED);
> + }
> + /* If the interrupt was taken with HARD_DIS set, don't enable MSR[EE] */
> + if (local_paca->irq_happened & PACA_IRQ_HARD_DIS)
> + __hard_RI_enable();
> + else
> + __hard_irq_enable();
> + if (trace_enable)
> trace_hardirqs_off();
> - local_paca->irq_happened |= PACA_IRQ_HARD_DIS;
>
> if (user_mode(regs)) {
> CT_WARN_ON(ct_state() != CONTEXT_USER);
> @@ -901,11 +892,6 @@ INT_DEFINE_BEGIN(system_reset)
> IVEC=0x100
> IAREA=PACA_EXNMI
> IVIRT=0 /* no virt entry point */
> - /*
> - * MSR_RI is not enabled, because PACA_EXNMI and nmi stack is
> - * being used, so a nested NMI exception would corrupt it.
> - */
> - ISET_RI=0
> ISTACK=0
> IKVM_REAL=1
> INT_DEFINE_END(system_reset)
> @@ -986,8 +972,6 @@ EXC_COMMON_BEGIN(system_reset_common)
Right before this change, there's a comment that reads:
/*
* Increment paca->in_nmi then enable MSR_RI. SLB or MCE will be able
* to recover, but nested NMI will notice in_nmi and not recover
...
You've taken out the bit that enables MSR_RI, which means the comment is
no longer accurate.
Beyond that, I'm still trying to follow all the various changes in code
flows. It seems to make at least vague sense: we move the setting of
MSR_RI from the early asm to interrupt*enter_prepare. I'm just
struggling to convince myself that when we hit the underlying handler
that the RI states all line up.
Kind regards,
Daniel
^ permalink raw reply
* Re: [PATCH v2 2/2] powerpc/bug: Provide better flexibility to WARN_ON/__WARN_FLAGS() with asm goto
From: Michael Ellerman @ 2021-08-27 7:53 UTC (permalink / raw)
To: Nathan Chancellor
Cc: llvm, Nick Desaulniers, linux-kernel, clang-built-linux,
Paul Mackerras, linuxppc-dev
In-Reply-To: <YSgp7HNGXbzrfvFq@Ryzen-9-3900X.localdomain>
Nathan Chancellor <nathan@kernel.org> writes:
> On Thu, Aug 26, 2021 at 11:54:17AM -0700, Nathan Chancellor wrote:
>> On Thu, Aug 26, 2021 at 01:21:39PM +1000, Michael Ellerman wrote:
>> > Nathan Chancellor <nathan@kernel.org> writes:
>> > > On Tue, Apr 13, 2021 at 04:38:10PM +0000, Christophe Leroy wrote:
>> > >> Using asm goto in __WARN_FLAGS() and WARN_ON() allows more
>> > >> flexibility to GCC.
>> > ...
>> > >
>> > > This patch as commit 1e688dd2a3d6 ("powerpc/bug: Provide better
>> > > flexibility to WARN_ON/__WARN_FLAGS() with asm goto") cause a WARN_ON in
>> > > klist_add_tail to trigger over and over on boot when compiling with
>> > > clang:
>> > >
...
>> >
>> > This patch seems to fix it. Not sure if that's just papering over it though.
>> >
>> > diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
>> > index 1ee0f22313ee..75fcb4370d96 100644
>> > --- a/arch/powerpc/include/asm/bug.h
>> > +++ b/arch/powerpc/include/asm/bug.h
>> > @@ -119,7 +119,7 @@ __label_warn_on: \
>> > \
>> > WARN_ENTRY(PPC_TLNEI " %4, 0", \
>> > BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN), \
>> > - __label_warn_on, "r" (x)); \
>> > + __label_warn_on, "r" (!!(x))); \
>> > break; \
>> > __label_warn_on: \
>> > __ret_warn_on = true; \
>> >
>>
>> Thank you for the in-depth explanation and triage! I have gone ahead and
>> created a standalone reproducer that shows this based on the
>> preprocessed file and opened https://bugs.llvm.org/show_bug.cgi?id=51634
>> so the LLVM developers can take a look.
>
> Based on Eli Friedman's comment in the bug, would something like this
> work and not regress GCC? I noticed that the BUG_ON macro does a cast as
> well. Nick pointed out to me privately that we have run into what seems
> like a similar issue before in commit 6c58f25e6938 ("riscv/atomic: Fix
> sign extension for RV64I").
Yes, I read that comment this morning, and then landed at the same fix
via digging through the history of our bug code.
We in fact fixed a similar bug 16 years ago :}
32818c2eb6b8 ("[PATCH] ppc64: Fix issue with gcc 4.0 compiled kernels")
Which is when we first started adding the cast to long.
> diff --git a/arch/powerpc/include/asm/bug.h b/arch/powerpc/include/asm/bug.h
> index 1ee0f22313ee..35022667f57d 100644
> --- a/arch/powerpc/include/asm/bug.h
> +++ b/arch/powerpc/include/asm/bug.h
> @@ -119,7 +119,7 @@ __label_warn_on: \
> \
> WARN_ENTRY(PPC_TLNEI " %4, 0", \
> BUGFLAG_WARNING | BUGFLAG_TAINT(TAINT_WARN), \
> - __label_warn_on, "r" (x)); \
> + __label_warn_on, "r" ((__force long)(x))); \
> break; \
> __label_warn_on: \
> __ret_warn_on = true; \
Yeah that fixes the clang build for me.
For GCC it seems to generate the same code in the simple cases:
void test_warn_on_ulong(unsigned long b)
{
WARN_ON(b);
}
void test_warn_on_int(int b)
{
WARN_ON(b);
}
I get:
0000000000000020 <.test_warn_on_ulong>:
20: 0b 03 00 00 tdnei r3,0
24: 4e 80 00 20 blr
28: 60 00 00 00 nop
2c: 60 00 00 00 nop
0000000000000030 <.test_warn_on_int>:
30: 0b 03 00 00 tdnei r3,0
34: 4e 80 00 20 blr
Both before and after the change. So that's good.
For:
void test_warn_on_int_addition(int b)
{
WARN_ON(b+1);
}
Before I get:
0000000000000040 <.test_warn_on_int_addition>:
40: 38 63 00 01 addi r3,r3,1
44: 0b 03 00 00 tdnei r3,0
48: 4e 80 00 20 blr
vs after:
0000000000000040 <.test_warn_on_int_addition>:
40: 38 63 00 01 addi r3,r3,1
44: 7c 63 07 b4 extsw r3,r3
48: 0b 03 00 00 tdnei r3,0
4c: 4e 80 00 20 blr
So there's an extra sign extension we don't need, but I guess we can
probably live with that.
cheers
^ permalink raw reply
* [RFC PATCH] powerpc: Investigate static_call concept
From: Christophe Leroy @ 2021-08-27 9:45 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
Cc: Peter Zijlstra, linux-kernel, Steven Rostedt, Jason Baron,
Josh Poimboeuf, linuxppc-dev, Ard Biesheuvel
This RFC is to validate the concept of static_call on powerpc.
Highly copied from x86.
It replaces ppc_md.get_irq() which is called at every IRQ, by
a static call.
When updating the call, we just replace the instruction at the
trampoline address by a relative jump to the function.
For the time being, the case of out-of-range functions is not handled.
Note that even if we don't immediately use static calls in powerpc
code, supporting static calls would immediately benefit to core
functionnalities using it, like tracing.
Tested on powerpc 8xx.
With the patch:
00000000 <__SCT__ppc_md_get_irq>:
0: 4e 80 00 20 blr <== Replaced by 'b <ppc_md.get_irq>' at runtime
...
00000038 <__do_irq>:
38: 94 21 ff f0 stwu r1,-16(r1)
3c: 7c 08 02 a6 mflr r0
40: 90 01 00 14 stw r0,20(r1)
44: 48 00 00 01 bl 44 <__do_irq+0xc>
44: R_PPC_REL24 __SCT__ppc_md_get_irq
...
Before the patch:
00000038 <__do_irq>:
38: 3d 20 00 00 lis r9,0
3a: R_PPC_ADDR16_HA ppc_md+0x20
3c: 94 21 ff f0 stwu r1,-16(r1)
40: 81 29 00 00 lwz r9,0(r9)
42: R_PPC_ADDR16_LO ppc_md+0x20
44: 7c 08 02 a6 mflr r0
48: 90 01 00 14 stw r0,20(r1)
4c: 7d 29 03 a6 mtctr r9
50: 4e 80 04 21 bctrl
...
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
arch/powerpc/Kconfig | 1 +
arch/powerpc/include/asm/machdep.h | 3 +++
arch/powerpc/include/asm/static_call.h | 25 +++++++++++++++++++++++++
arch/powerpc/kernel/Makefile | 2 +-
arch/powerpc/kernel/irq.c | 2 +-
arch/powerpc/kernel/setup-common.c | 4 ++++
arch/powerpc/kernel/static_call.c | 16 ++++++++++++++++
7 files changed, 51 insertions(+), 2 deletions(-)
create mode 100644 arch/powerpc/include/asm/static_call.h
create mode 100644 arch/powerpc/kernel/static_call.c
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 36b72d972568..c3930ea63e59 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -247,6 +247,7 @@ config PPC
select HAVE_SOFTIRQ_ON_OWN_STACK
select HAVE_STACKPROTECTOR if PPC32 && $(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r2)
select HAVE_STACKPROTECTOR if PPC64 && $(cc-option,-mstack-protector-guard=tls -mstack-protector-guard-reg=r13)
+ select HAVE_STATIC_CALL
select HAVE_SYSCALL_TRACEPOINTS
select HAVE_VIRT_CPU_ACCOUNTING
select HUGETLB_PAGE_SIZE_VARIABLE if PPC_BOOK3S_64 && HUGETLB_PAGE
diff --git a/arch/powerpc/include/asm/machdep.h b/arch/powerpc/include/asm/machdep.h
index 764f2732a821..ac9712312b76 100644
--- a/arch/powerpc/include/asm/machdep.h
+++ b/arch/powerpc/include/asm/machdep.h
@@ -7,6 +7,7 @@
#include <linux/init.h>
#include <linux/dma-mapping.h>
#include <linux/export.h>
+#include <linux/static_call_types.h>
#include <asm/setup.h>
@@ -295,5 +296,7 @@ static inline void log_error(char *buf, unsigned int err_type, int fatal)
#define machine_late_initcall(mach, fn) __define_machine_initcall(mach, fn, 7)
#define machine_late_initcall_sync(mach, fn) __define_machine_initcall(mach, fn, 7s)
+DECLARE_STATIC_CALL(ppc_md_get_irq, *ppc_md.get_irq);
+
#endif /* __KERNEL__ */
#endif /* _ASM_POWERPC_MACHDEP_H */
diff --git a/arch/powerpc/include/asm/static_call.h b/arch/powerpc/include/asm/static_call.h
new file mode 100644
index 000000000000..335ee4ceaef9
--- /dev/null
+++ b/arch/powerpc/include/asm/static_call.h
@@ -0,0 +1,25 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _ASM_POWERPC_STATIC_CALL_H
+#define _ASM_POWERPC_STATIC_CALL_H
+
+#define ARCH_DEFINE_STATIC_CALL_TRAMP(name, func) \
+ asm(".pushsection .text, \"ax\" \n" \
+ ".align 4 \n" \
+ ".globl " STATIC_CALL_TRAMP_STR(name) " \n" \
+ STATIC_CALL_TRAMP_STR(name) ": \n" \
+ " b " #func " \n" \
+ ".type " STATIC_CALL_TRAMP_STR(name) ", @function \n" \
+ ".size " STATIC_CALL_TRAMP_STR(name) ", . - " STATIC_CALL_TRAMP_STR(name) " \n" \
+ ".popsection \n")
+
+#define ARCH_DEFINE_STATIC_CALL_NULL_TRAMP(name) \
+ asm(".pushsection .text, \"ax\" \n" \
+ ".align 4 \n" \
+ ".globl " STATIC_CALL_TRAMP_STR(name) " \n" \
+ STATIC_CALL_TRAMP_STR(name) ": \n" \
+ " blr \n" \
+ ".type " STATIC_CALL_TRAMP_STR(name) ", @function \n" \
+ ".size " STATIC_CALL_TRAMP_STR(name) ", . - " STATIC_CALL_TRAMP_STR(name) " \n" \
+ ".popsection \n")
+
+#endif /* _ASM_POWERPC_STATIC_CALL_H */
diff --git a/arch/powerpc/kernel/Makefile b/arch/powerpc/kernel/Makefile
index 7be36c1e1db6..08877252dff8 100644
--- a/arch/powerpc/kernel/Makefile
+++ b/arch/powerpc/kernel/Makefile
@@ -47,7 +47,7 @@ obj-y := cputable.o syscalls.o \
udbg.o misc.o io.o misc_$(BITS).o \
of_platform.o prom_parse.o firmware.o \
hw_breakpoint_constraints.o interrupt.o \
- kdebugfs.o
+ kdebugfs.o static_call.o
obj-y += ptrace/
obj-$(CONFIG_PPC64) += setup_64.o \
paca.o nvram_64.o note.o
diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
index 551b653228c4..872f46e20754 100644
--- a/arch/powerpc/kernel/irq.c
+++ b/arch/powerpc/kernel/irq.c
@@ -736,7 +736,7 @@ void __do_irq(struct pt_regs *regs)
*
* This will typically lower the interrupt line to the CPU
*/
- irq = ppc_md.get_irq();
+ irq = static_call(ppc_md_get_irq)();
/* We can hard enable interrupts now to allow perf interrupts */
may_hard_irq_enable();
diff --git a/arch/powerpc/kernel/setup-common.c b/arch/powerpc/kernel/setup-common.c
index b1e43b69a559..57d06c163b7b 100644
--- a/arch/powerpc/kernel/setup-common.c
+++ b/arch/powerpc/kernel/setup-common.c
@@ -33,6 +33,7 @@
#include <linux/of_platform.h>
#include <linux/hugetlb.h>
#include <linux/pgtable.h>
+#include <linux/static_call.h>
#include <asm/io.h>
#include <asm/paca.h>
#include <asm/prom.h>
@@ -81,6 +82,8 @@ EXPORT_SYMBOL(ppc_md);
struct machdep_calls *machine_id;
EXPORT_SYMBOL(machine_id);
+DEFINE_STATIC_CALL_NULL(ppc_md_get_irq, *ppc_md.get_irq);
+
int boot_cpuid = -1;
EXPORT_SYMBOL_GPL(boot_cpuid);
@@ -613,6 +616,7 @@ void probe_machine(void)
machine_id++) {
DBG(" %s ...", machine_id->name);
memcpy(&ppc_md, machine_id, sizeof(struct machdep_calls));
+ static_call_update(ppc_md_get_irq, ppc_md.get_irq);
if (ppc_md.probe()) {
DBG(" match !\n");
break;
diff --git a/arch/powerpc/kernel/static_call.c b/arch/powerpc/kernel/static_call.c
new file mode 100644
index 000000000000..a281f1759c39
--- /dev/null
+++ b/arch/powerpc/kernel/static_call.c
@@ -0,0 +1,16 @@
+// SPDX-License-Identifier: GPL-2.0
+#include <linux/memory.h>
+#include <linux/static_call.h>
+
+#include <asm/code-patching.h>
+
+void arch_static_call_transform(void *site, void *tramp, void *func, bool tail)
+{
+ mutex_lock(&text_mutex);
+
+ if (tramp)
+ patch_branch(tramp, (unsigned long)func, 0);
+
+ mutex_unlock(&text_mutex);
+}
+EXPORT_SYMBOL_GPL(arch_static_call_transform);
--
2.25.0
^ permalink raw reply related
* Re: [PATCH v4 6/6] sched/fair: Consider SMT in ASYM_PACKING load balance
From: Vincent Guittot @ 2021-08-27 10:13 UTC (permalink / raw)
To: Ricardo Neri
Cc: Juri Lelli, Aubrey Li, Srikar Dronamraju, Ravi V. Shankar,
Peter Zijlstra (Intel), Ricardo Neri, Ben Segall,
Srinivas Pandruvada, Joel Fernandes (Google), Ingo Molnar,
Rafael J. Wysocki, Steven Rostedt, Mel Gorman, Len Brown,
Nicholas Piggin, Aubrey Li, Dietmar Eggemann, Tim Chen,
Quentin Perret, Daniel Bristot de Oliveira, linux-kernel,
linuxppc-dev
In-Reply-To: <20210810144145.18776-7-ricardo.neri-calderon@linux.intel.com>
On Tue, 10 Aug 2021 at 16:41, Ricardo Neri
<ricardo.neri-calderon@linux.intel.com> wrote:
>
> When deciding to pull tasks in ASYM_PACKING, it is necessary not only to
> check for the idle state of the destination CPU, dst_cpu, but also of
> its SMT siblings.
>
> If dst_cpu is idle but its SMT siblings are busy, performance suffers
> if it pulls tasks from a medium priority CPU that does not have SMT
> siblings.
>
> Implement asym_smt_can_pull_tasks() to inspect the state of the SMT
> siblings of both dst_cpu and the CPUs in the candidate busiest group.
>
> Cc: Aubrey Li <aubrey.li@intel.com>
> Cc: Ben Segall <bsegall@google.com>
> Cc: Daniel Bristot de Oliveira <bristot@redhat.com>
> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Cc: Mel Gorman <mgorman@suse.de>
> Cc: Quentin Perret <qperret@google.com>
> Cc: Rafael J. Wysocki <rafael.j.wysocki@intel.com>
> Cc: Srinivas Pandruvada <srinivas.pandruvada@linux.intel.com>
> Cc: Steven Rostedt <rostedt@goodmis.org>
> Cc: Tim Chen <tim.c.chen@linux.intel.com>
> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>
> Reviewed-by: Len Brown <len.brown@intel.com>
> Signed-off-by: Ricardo Neri <ricardo.neri-calderon@linux.intel.com>
> ---
> Changes since v3:
> * Removed the arch_asym_check_smt_siblings() hook. Discussions with the
> powerpc folks showed that this patch should not impact them. Also, more
> recent powerpc processor no longer use asym_packing. (PeterZ)
> * Removed unnecessary local variable in asym_can_pull_tasks(). (Dietmar)
> * Removed unnecessary check for local CPUs when the local group has zero
> utilization. (Joel)
> * Renamed asym_can_pull_tasks() as asym_smt_can_pull_tasks() to reflect
> the fact that it deals with SMT cases.
> * Made asym_smt_can_pull_tasks() return false for !CONFIG_SCHED_SMT so
> that callers can deal with non-SMT cases.
>
> Changes since v2:
> * Reworded the commit message to reflect updates in code.
> * Corrected misrepresentation of dst_cpu as the CPU doing the load
> balancing. (PeterZ)
> * Removed call to arch_asym_check_smt_siblings() as it is now called in
> sched_asym().
>
> Changes since v1:
> * Don't bailout in update_sd_pick_busiest() if dst_cpu cannot pull
> tasks. Instead, reclassify the candidate busiest group, as it
> may still be selected. (PeterZ)
> * Avoid an expensive and unnecessary call to cpumask_weight() when
> determining if a sched_group is comprised of SMT siblings.
> (PeterZ).
> ---
> kernel/sched/fair.c | 95 +++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 95 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index dd411cefb63f..8a1a2a43732c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8531,10 +8531,99 @@ group_type group_classify(unsigned int imbalance_pct,
> return group_has_spare;
> }
>
> +/**
> + * asym_smt_can_pull_tasks - Check whether the load balancing CPU can pull tasks
> + * @dst_cpu: Destination CPU of the load balancing
> + * @sds: Load-balancing data with statistics of the local group
> + * @sgs: Load-balancing statistics of the candidate busiest group
> + * @sg: The candidate busiet group
> + *
> + * Check the state of the SMT siblings of both @sds::local and @sg and decide
> + * if @dst_cpu can pull tasks. If @dst_cpu does not have SMT siblings, it can
> + * pull tasks if two or more of the SMT siblings of @sg are busy. If only one
> + * CPU in @sg is busy, pull tasks only if @dst_cpu has higher priority.
> + *
> + * If both @dst_cpu and @sg have SMT siblings, even the number of idle CPUs
> + * between @sds::local and @sg. Thus, pull tasks from @sg if the difference
> + * between the number of busy CPUs is 2 or more. If the difference is of 1,
> + * only pull if @dst_cpu has higher priority. If @sg does not have SMT siblings
> + * only pull tasks if all of the SMT siblings of @dst_cpu are idle and @sg
> + * has lower priority.
> + */
> +static bool asym_smt_can_pull_tasks(int dst_cpu, struct sd_lb_stats *sds,
> + struct sg_lb_stats *sgs,
> + struct sched_group *sg)
> +{
> +#ifdef CONFIG_SCHED_SMT
> + bool local_is_smt, sg_is_smt;
> + int sg_busy_cpus;
> +
> + local_is_smt = sds->local->flags & SD_SHARE_CPUCAPACITY;
> + sg_is_smt = sg->flags & SD_SHARE_CPUCAPACITY;
> +
> + sg_busy_cpus = sgs->group_weight - sgs->idle_cpus;
> +
> + if (!local_is_smt) {
> + /*
> + * If we are here, @dst_cpu is idle and does not have SMT
> + * siblings. Pull tasks if candidate group has two or more
> + * busy CPUs.
> + */
> + if (sg_is_smt && sg_busy_cpus >= 2)
> + return true;
> +
> + /*
> + * @dst_cpu does not have SMT siblings. @sg may have SMT
> + * siblings and only one is busy. In such case, @dst_cpu
> + * can help if it has higher priority and is idle.
> + */
> + return !sds->local_stat.group_util &&
sds->local_stat.group_util can't be used to decide if a CPU or group
of CPUs is idle. util_avg is usually not null when a CPU becomes idle
and you can have to wait more than 300ms before it becomes Null
At the opposite, the utilization of a CPU can be null but a task with
null utilization has just woken up on it.
Utilization is used to reflect the average work of the CPU or group of
CPUs but not the current state
> + sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> + }
> +
> + /* @dst_cpu has SMT siblings. */
> +
> + if (sg_is_smt) {
> + int local_busy_cpus = sds->local->group_weight -
> + sds->local_stat.idle_cpus;
> + int busy_cpus_delta = sg_busy_cpus - local_busy_cpus;
> +
> + /* Local can always help to even the number busy CPUs. */
default behavior of the load balance already tries to even the number
of idle CPUs.
> + if (busy_cpus_delta >= 2)
> + return true;
> +
> + if (busy_cpus_delta == 1)
> + return sched_asym_prefer(dst_cpu,
> + sg->asym_prefer_cpu);
> +
> + return false;
> + }
> +
> + /*
> + * @sg does not have SMT siblings. Ensure that @sds::local does not end
> + * up with more than one busy SMT sibling and only pull tasks if there
> + * are not busy CPUs. As CPUs move in and out of idle state frequently,
> + * also check the group utilization to smoother the decision.
> + */
> + if (!sds->local_stat.group_util)
same comment as above about the meaning of group_util == 0
> + return sched_asym_prefer(dst_cpu, sg->asym_prefer_cpu);
> +
> + return false;
> +#else
> + /* Always return false so that callers deal with non-SMT cases. */
> + return false;
> +#endif
> +}
> +
> static inline bool
> sched_asym(struct lb_env *env, struct sd_lb_stats *sds, struct sg_lb_stats *sgs,
> struct sched_group *group)
> {
> + /* Only do SMT checks if either local or candidate have SMT siblings */
> + if ((sds->local->flags & SD_SHARE_CPUCAPACITY) ||
> + (group->flags & SD_SHARE_CPUCAPACITY))
> + return asym_smt_can_pull_tasks(env->dst_cpu, sds, sgs, group);
> +
> return sched_asym_prefer(env->dst_cpu, group->asym_prefer_cpu);
> }
>
> @@ -9540,6 +9629,12 @@ static struct rq *find_busiest_queue(struct lb_env *env,
> nr_running == 1)
> continue;
>
> + /* Make sure we only pull tasks from a CPU of lower priority */
> + if ((env->sd->flags & SD_ASYM_PACKING) &&
> + sched_asym_prefer(i, env->dst_cpu) &&
> + nr_running == 1)
> + continue;
This really looks similar to the test above for SD_ASYM_CPUCAPACITY.
More generally speaking SD_ASYM_PACKING and SD_ASYM_CPUCAPACITY share
a lot of common policy and I wonder if at some point we could not
merge their behavior in LB
> +
> switch (env->migration_type) {
> case migrate_load:
> /*
> --
> 2.17.1
>
^ permalink raw reply
* Re: [PATCH] powerpc/doc: Fix htmldocs errors
From: Michael Ellerman @ 2021-08-27 13:16 UTC (permalink / raw)
To: linuxppc-dev, mpe, Aneesh Kumar K.V; +Cc: Stephen Rothwell, Jonathan Corbet
In-Reply-To: <20210825042447.106219-1-aneesh.kumar@linux.ibm.com>
On Wed, 25 Aug 2021 09:54:47 +0530, Aneesh Kumar K.V wrote:
> Fix make htmldocs related errors with the newly added associativity.rst
> doc file.
>
>
>
>
Applied to powerpc/next.
[1/1] powerpc/doc: Fix htmldocs errors
https://git.kernel.org/powerpc/c/f50da6edbf1ebf35dd8070847bfab5cb988d472b
cheers
^ permalink raw reply
* Re: [PATCH] powerpc/32: Remove unneccessary calculations in load_up_{fpu/altivec}
From: Michael Ellerman @ 2021-08-27 13:15 UTC (permalink / raw)
To: Michael Ellerman, Christophe Leroy, Paul Mackerras,
Benjamin Herrenschmidt
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <c04cce578b97a76a9e69a096698b1d89f721768a.1629276437.git.christophe.leroy@csgroup.eu>
On Wed, 18 Aug 2021 08:47:28 +0000 (UTC), Christophe Leroy wrote:
> No need to re-read SPRN_THREAD, we can calculate thread address
> from current (r2).
>
> And remove a reload of value 1 into r4 as r4 is already 1.
>
>
>
> [...]
Applied to powerpc/next.
[1/1] powerpc/32: Remove unneccessary calculations in load_up_{fpu/altivec}
https://git.kernel.org/powerpc/c/51ed00e71f0130e0f3534b8e7d78facd16829426
cheers
^ permalink raw reply
* Re: [PATCH] powerpc/audit: Avoid unneccessary #ifdef in syscall_get_arguments()
From: Michael Ellerman @ 2021-08-27 13:15 UTC (permalink / raw)
To: Michael Ellerman, Christophe Leroy, Paul Mackerras,
Benjamin Herrenschmidt
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <ba49cdd574558a0363300c3f6b5b062b397cb071.1629451483.git.christophe.leroy@csgroup.eu>
On Fri, 20 Aug 2021 09:28:19 +0000 (UTC), Christophe Leroy wrote:
> Use is_32bit_task() which already handles CONFIG_COMPAT.
>
>
>
>
Applied to powerpc/next.
[1/1] powerpc/audit: Avoid unneccessary #ifdef in syscall_get_arguments()
https://git.kernel.org/powerpc/c/898a1ef06ad4a2a8e3c9490ce62624fcd1a7b1f8
cheers
^ permalink raw reply
* Re: [PATCH] powerpc/audit: Simplify syscall_get_arch()
From: Michael Ellerman @ 2021-08-27 13:15 UTC (permalink / raw)
To: Michael Ellerman, Christophe Leroy, Paul Mackerras,
Benjamin Herrenschmidt
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <4be53b9187a4d8c163968f4d224267e41a7fcc33.1629451479.git.christophe.leroy@csgroup.eu>
On Fri, 20 Aug 2021 09:39:14 +0000 (UTC), Christophe Leroy wrote:
> Make use of is_32bit_task() and CONFIG_CPU_LITTLE_ENDIAN
> to simplify syscall_get_arch().
>
>
>
>
Applied to powerpc/next.
[1/1] powerpc/audit: Simplify syscall_get_arch()
https://git.kernel.org/powerpc/c/770cec16cdc9d15555e67896dd2214a4fec9a3fa
cheers
^ permalink raw reply
* Re: [PATCH] powerpc/32: indirect function call use bctrl rather than blrl in ret_from_kernel_thread
From: Michael Ellerman @ 2021-08-27 13:15 UTC (permalink / raw)
To: Michael Ellerman, Christophe Leroy, Paul Mackerras,
Benjamin Herrenschmidt
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <91b1d242525307ceceec7ef6e832bfbacdd4501b.1629436472.git.christophe.leroy@csgroup.eu>
On Fri, 20 Aug 2021 05:16:05 +0000 (UTC), Christophe Leroy wrote:
> Copied from commit 89bbe4c798bc ("powerpc/64: indirect function call
> use bctrl rather than blrl in ret_from_kernel_thread")
>
> blrl is not recommended to use as an indirect function call, as it may
> corrupt the link stack predictor.
>
> This is not a performance critical path but this should be fixed for
> consistency.
>
> [...]
Applied to powerpc/next.
[1/1] powerpc/32: indirect function call use bctrl rather than blrl in ret_from_kernel_thread
https://git.kernel.org/powerpc/c/113ec9ccc8049c3772f0eab46b62c5d6654c09f7
cheers
^ permalink raw reply
* Re: [PATCH] powerpc/ptrace: Make user_mode() common to PPC32 and PPC64
From: Michael Ellerman @ 2021-08-27 13:16 UTC (permalink / raw)
To: Michael Ellerman, Christophe Leroy, Paul Mackerras,
Benjamin Herrenschmidt
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <000a28c51808bbd802b505af42d2cb316c2be7d3.1629216000.git.christophe.leroy@csgroup.eu>
On Tue, 17 Aug 2021 16:00:14 +0000 (UTC), Christophe Leroy wrote:
> Today we have:
>
> #ifdef __powerpc64__
> #define user_mode(regs) ((((regs)->msr) >> MSR_PR_LG) & 0x1)
> #else
> #define user_mode(regs) (((regs)->msr & MSR_PR) != 0)
> #endif
>
> [...]
Applied to powerpc/next.
[1/1] powerpc/ptrace: Make user_mode() common to PPC32 and PPC64
https://git.kernel.org/powerpc/c/19e932eb6ea47f4f37513eb2ae0daee19117765c
cheers
^ permalink raw reply
* Re: [PATCH] powerpc/syscalls: Remove __NR__exit
From: Michael Ellerman @ 2021-08-27 13:15 UTC (permalink / raw)
To: Michael Ellerman, Christophe Leroy, Paul Mackerras,
Benjamin Herrenschmidt
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <6457eb4f327313323ed1f70e540bbb4ddc9178fa.1629701106.git.christophe.leroy@csgroup.eu>
On Mon, 23 Aug 2021 06:45:20 +0000 (UTC), Christophe Leroy wrote:
> __NR_exit is nowhere used. On most architectures it was removed by
> commit 135ab6ec8fda ("[PATCH] remove remaining errno and
> __KERNEL_SYSCALLS__ references") but not on powerpc.
>
> powerpc removed __KERNEL_SYSCALLS__ in commit 3db03b4afb3e ("[PATCH]
> rename the provided execve functions to kernel_execve"), but __NR_exit
> was left over.
>
> [...]
Applied to powerpc/next.
[1/1] powerpc/syscalls: Remove __NR__exit
https://git.kernel.org/powerpc/c/a00ea5b6f2bbef8b004b0b7228c61680a50c7c3f
cheers
^ permalink raw reply
* Re: [PATCH] powerpc/syscalls: Simplify do_mmap2()
From: Michael Ellerman @ 2021-08-27 13:16 UTC (permalink / raw)
To: Michael Ellerman, Christophe Leroy, Paul Mackerras,
Benjamin Herrenschmidt
Cc: linuxppc-dev, linux-kernel
In-Reply-To: <373ec500f386374bc5735007df3d3869eac47be1.1624618701.git.christophe.leroy@csgroup.eu>
On Fri, 25 Jun 2021 10:58:33 +0000 (UTC), Christophe Leroy wrote:
> When shift is nul, operations remain valid so no test needed.
>
> And 'ret' is unnecessary.
>
> And use IS_ALIGNED() to check alignment, that's more clear.
>
>
> [...]
Applied to powerpc/next.
[1/1] powerpc/syscalls: Simplify do_mmap2()
https://git.kernel.org/powerpc/c/316389e904f968d24d44cd96a6d171ee1ef269cf
cheers
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox