LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 4/5] dma-mapping: add a dma_ops_bypass flag to struct device
From: Christoph Hellwig @ 2020-07-14  7:07 UTC (permalink / raw)
  To: Alexey Kardashevskiy
  Cc: Björn Töpel, Daniel Borkmann, Greg Kroah-Hartman,
	Joerg Roedel, Robin Murphy, linux-kernel, iommu,
	Jesper Dangaard Brouer, linuxppc-dev, Christoph Hellwig, Lu Baolu
In-Reply-To: <9bff7460-e6fa-f765-dcb4-cc96eb86d92c@ozlabs.ru>

On Mon, Jul 13, 2020 at 02:59:39PM +1000, Alexey Kardashevskiy wrote:
> 
> 
> On 09/07/2020 01:24, Christoph Hellwig wrote:
> > Several IOMMU drivers have a bypass mode where they can use a direct
> > mapping if the devices DMA mask is large enough.  Add generic support
> > to the core dma-mapping code to do that to switch those drivers to
> > a common solution.
> > 
> > Signed-off-by: Christoph Hellwig <hch@lst.de>
> > ---
> >  include/linux/device.h |  8 +++++
> >  kernel/dma/Kconfig     |  8 +++++
> >  kernel/dma/mapping.c   | 74 +++++++++++++++++++++++++++++-------------
> >  3 files changed, 68 insertions(+), 22 deletions(-)
> > 
> > diff --git a/include/linux/device.h b/include/linux/device.h
> > index 4c4af98321ebd6..1f71acf37f78d7 100644
> > --- a/include/linux/device.h
> > +++ b/include/linux/device.h
> > @@ -523,6 +523,11 @@ struct dev_links_info {
> >   *		  sync_state() callback.
> >   * @dma_coherent: this particular device is dma coherent, even if the
> >   *		architecture supports non-coherent devices.
> > + * @dma_ops_bypass: If set to %true then the dma_ops are bypassed for the
> > + *		streaming DMA operations (->map_* / ->unmap_* / ->sync_*),
> > + *		and optionall (if the coherent mask is large enough) also
> 
> 
> s/optionall/optional/g
> 
> Otherwise the series looks good and works well on powernv and pseries.
> Thanks,

Can you give a formal ACK?

^ permalink raw reply

* Re: /sys/kernel/debug/kmemleak empty despite kmemleak reports
From: Paul Menzel @ 2020-07-14  6:59 UTC (permalink / raw)
  To: Catalin Marinas; +Cc: linuxppc-dev
In-Reply-To: <20200713182735.GH15829@gaia>

Dear Catalin,


Am 13.07.20 um 20:27 schrieb Catalin Marinas:
> On Thu, Jul 09, 2020 at 11:08:52PM +0200, Paul Menzel wrote:
>> Am 09.07.20 um 19:57 schrieb Catalin Marinas:
>>> On Thu, Jul 09, 2020 at 04:37:10PM +0200, Paul Menzel wrote:
>>>> Despite Linux 5.8-rc4 reporting memory leaks on the IBM POWER 8 S822LC, the
>>>> file does not contain more information.
>>>>
>>>>> $ dmesg
>>>>> […] > [48662.953323] perf: interrupt took too long (2570 > 2500), lowering kernel.perf_event_max_sample_rate to 77750
>>>>> [48854.810636] perf: interrupt took too long (3216 > 3212), lowering kernel.perf_event_max_sample_rate to 62000
>>>>> [52300.044518] perf: interrupt took too long (4244 > 4020), lowering kernel.perf_event_max_sample_rate to 47000
>>>>> [52751.373083] perf: interrupt took too long (5373 > 5305), lowering kernel.perf_event_max_sample_rate to 37000
>>>>> [53354.000363] perf: interrupt took too long (6793 > 6716), lowering kernel.perf_event_max_sample_rate to 29250
>>>>> [53850.215606] perf: interrupt took too long (8672 > 8491), lowering kernel.perf_event_max_sample_rate to 23000
>>>>> [57542.266099] perf: interrupt took too long (10940 > 10840), lowering kernel.perf_event_max_sample_rate to 18250
>>>>> [57559.645404] perf: interrupt took too long (13714 > 13675), lowering kernel.perf_event_max_sample_rate to 14500
>>>>> [61608.697728] Can't find PMC that caused IRQ
>>>>> [71774.463111] kmemleak: 12 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>>>>> [92372.044785] process '@/usr/bin/gnatmake-5' started with executable stack
>>>>> [92849.380672] FS-Cache: Loaded
>>>>> [92849.417269] FS-Cache: Netfs 'nfs' registered for caching
>>>>> [92849.595974] NFS: Registering the id_resolver key type
>>>>> [92849.596000] Key type id_resolver registered
>>>>> [92849.596000] Key type id_legacy registered
>>>>> [101808.079143] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>>>>> [106904.323471] Can't find PMC that caused IRQ
>>>>> [129416.391456] kmemleak: 1 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>>>>> [158171.604221] kmemleak: 34 new suspected memory leaks (see /sys/kernel/debug/kmemleak)
>>>>> $ sudo cat /sys/kernel/debug/kmemleak
>>>
>>> When they are no longer present, they are most likely false positives.
>>
>> How can this be? Shouldn’t the false positive also be logged in
>> `/sys/kernel/debug/kmemleak`?
> 
> Sorry, I wasn't clear. It can be a transient false positive. At a
> subsequent scan, kmemleak found pointer referring the previously
> reported objects and no longer shows them.

Interesting. Is it possible to print a message in that case to avoid 
confusion?

>>> Was this triggered during boot? Or under some workload?
>>
>>  From the timestamps it looks like under some load.
> 
> Was it during boot? I put a delay of 60s to avoid this but, depending on
> the platform, it can still trigger.

No, it happened after several hours of runtime.


Kind regards,

Paul

^ permalink raw reply

* Re: [PATCH v3 4/6] powerpc/pseries/iommu: Remove default DMA window before creating DDW
From: Leonardo Bras @ 2020-07-14  6:46 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Thiago Jung Bauermann, Ram Pai
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <eb357d42f5605a2b0234c04de477e171134c24f5.camel@gmail.com>

In fact, the changes over the last patch are more complex than the
current patch. 
Just for reference, that's how enable_ddw() currently patches:

@@ -1087,7 +1119,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct
device_node *pdn)
        struct device_node *dn;
        u32 ddw_avail[DDW_APPLICABLE_SIZE];
        struct direct_window *window;
-       struct property *win64;
+       struct property *win64, *default_win = NULL;
        struct dynamic_dma_window_prop *ddwprop;
        struct failed_ddw_pdn *fpdn;
 
@@ -1133,14 +1165,38 @@ static u64 enable_ddw(struct pci_dev *dev,
struct device_node *pdn)
        if (ret != 0)
                goto out_failed;
 
+       /*
+        * If there is no window available, remove the default DMA
window,
+        * if it's present. This will make all the resources available
to the
+        * new DDW window.
+        * If anything fails after this, we need to restore it, so also
check
+        * for extensions presence.
+        */
        if (query.windows_available == 0) {
-               /*
-                * no additional windows are available for this device.
-                * We might be able to reallocate the existing window,
-                * trading in for a larger page size.
-                */
-               dev_dbg(&dev->dev, "no free dynamic windows");
-               goto out_failed;
+               int reset_win_ext;
+
+               default_win = of_find_property(pdn, "ibm,dma-window",
NULL);
+               if (!default_win)
+                       goto out_failed;
+
+               reset_win_ext = ddw_read_ext(pdn,
DDW_EXT_RESET_DMA_WIN, NULL);
+               if (reset_win_ext) {
+                       default_win = NULL;
+                       goto out_failed;
+               }
+
+               remove_dma_window(pdn, ddw_avail, default_win);
+
+               /* Query again, to check if the window is available */
+               ret = query_ddw(dev, ddw_avail, &query, pdn);
+               if (ret != 0)
+                       goto out_failed;
+
+               if (query.windows_available == 0) {
+                       /* no windows are available for this device. */
+                       dev_dbg(&dev->dev, "no free dynamic windows");
+                       goto out_failed;
+               }
        }
        if (query.page_size & 4) {
                page_shift = 24; /* 16MB */
@@ -1231,6 +1287,8 @@ static u64 enable_ddw(struct pci_dev *dev, struct
device_node *pdn)
        kfree(win64);
 
 out_failed:
+       if (default_win)
+               reset_dma_window(dev, pdn);
 
        fpdn = kzalloc(sizeof(*fpdn), GFP_KERNEL);
        if (!fpdn)


^ permalink raw reply

* Re: [PATCH 1/3] module: Rename module_alloc() to text_alloc() and move to kernel proper
From: Ard Biesheuvel @ 2020-07-14  6:35 UTC (permalink / raw)
  To: Steven Rostedt
  Cc: Catalin Marinas, Kefeng Wang, Paul Mackerras, Zong Li, Andi Kleen,
	Paul Burton, Vincent Whitchurch, Petr Mladek, Brian Gerst,
	Andy Lutomirski, Yonghong Song, Thomas Gleixner, Jiri Kosina,
	Anup Patel, Linux Kernel Mailing List, Philipp Rudo, Torsten Duwe,
	Masami Hiramatsu, Andrew Morton, Mark Rutland,
	James E.J. Bottomley, Vincent Chen, open list:S390, Joe Lawrence,
	Helge Deller, John Fastabend, Anil S Keshavamurthy,
	Andrey Ryabinin, Iurii Zaikin, Andrii Nakryiko, Vasily Gorbik,
	moderated list:ARM PORT, Daniel Axtens, Damien Le Moal,
	Peter Oberparleiter, Sean Christopherson, Martin KaFai Lau,
	Song Liu, Paul Walmsley, Heiko Carstens, Alexei Starovoitov,
	Jarkko Sakkinen, Atish Patra, Will Deacon, Daniel Borkmann,
	Masahiro Yamada, Nayna Jain, Ley Foon Tan, Christian Borntraeger,
	Dmitry Vyukov, Sami Tolvanen, Naveen N. Rao, Mao Han, Marco Elver,
	Babu Moger, Borislav Petkov, Greentime Hu, Ben Dooks, Guan Xuetao,
	Thomas Bogendoerfer, open list:PARISC ARCHITECTURE, Jessica Yu,
	open list:BPF JIT for MIPS (32-BIT AND 64-BIT), David S. Miller,
	Thiago Jung Bauermann, Peter Zijlstra,
	open list:SPARC + UltraSPARC (sparc/sparc64), Sandipan Das,
	H. Peter Anvin, Amit Daniel Kachhap, Tiezhu Yang, Miroslav Benes,
	Jiri Olsa, open list:RISC-V ARCHITECTURE, Vincenzo Frascino,
	Anders Roxell, Sven Schnelle,
	maintainer:X86 ARCHITECTURE (32-BIT AND 64-BIT), Russell King,
	Mike Rapoport, Ingo Molnar, Albert Ou, Paul E. McKenney,
	Josh Poimboeuf, KP Singh, Gerald Schaefer, Nick Hu,
	open list:BPF JIT for MIPS (32-BIT AND 64-BIT), open list:MIPS,
	Sergey Senozhatsky, Palmer Dabbelt,
	open list:LINUX FOR POWERPC (32-BIT AND 64-BIT)
In-Reply-To: <20200713220436.2f21d366@oasis.local.home>

On Tue, 14 Jul 2020 at 05:04, Steven Rostedt <rostedt@goodmis.org> wrote:
>
> On Mon, 13 Jul 2020 22:49:48 +0300
> Ard Biesheuvel <ardb@kernel.org> wrote:
>
> > On arm64, we no longer use module_alloc for bpf or kprobes, to avoid
> > wasting va space on code that does not need to be loaded close to the
> > kernel. Also, module_alloc() allocates kasan shadow, which is
> > unnecessary for kprobes or bpf programs, which don't have kasan
> > instrumentation.
> >
> > This patch suggests that there are other reasons why conflating
> > allocation of module space and allocating  text pages for other uses
> > is a bad idea, but switching all users to text_alloc() is a step in
> > the wrong direction. It would be better to stop using module_alloc()
> > in core code except in the module loader, and have a generic
> > text_alloc() that can be overridden by the arch if necessary. Note
> > that x86  and s390 are the only architectures that use module_alloc()
> > in ftrace code.
> >
> > Please have a look at alloc_insn_page() or bpf_jit_alloc_exec() in the
> > arm64 tree to see what I mean.
>
> Hmm, so you have another method for allocating memory for trampolines?
> (I haven't looked at those functions you pointed out, out of sheer
> laziness ;-)
>
> It would be nice to implement the trampoline optimization in arm, which
> x86 has (see arch_ftrace_update_trampoline() and
> arch_ftrace_trampoline_func()).
>
> It helps when you have two different callbacks for different functions
> (like having live patching enabled and function tracing enabled, or
> kprobes using ftrace). Each callback will get its own allocated
> trampoline to jump to instead of jumping to the a trampoline that calls
> a looping function that tests to see which callback wants to be called
> by the traced function.
>

So in what sense are ftrace trampolines like kernel modules, apart
from the fact that they are executable pages that live in the vmalloc
space?

^ permalink raw reply

* Re: [PATCH v2] powerpc/pseries: detect secure and trusted boot state of the system.
From: Daniel Axtens @ 2020-07-14  6:38 UTC (permalink / raw)
  To: Nayna Jain, linuxppc-dev; +Cc: Nayna Jain, linux-kernel, Mimi Zohar
In-Reply-To: <1594434329-31219-1-git-send-email-nayna@linux.ibm.com>

Hi Nayna,

Thanks! Would you be able to fold in some of the information from my
reply to v1 into the changelog? Until we have public PAPR release with
it, that information is the extent of the public documentation. It would
be good to get it into the git log rather than just floating around in
the mail archives!

A couple of small nits:

> +	if (enabled)
> +		goto out;
> +
> +	if (!of_property_read_u32(of_root, "ibm,secure-boot", &secureboot)) {
> +		if (secureboot)
> +			enabled = (secureboot > 1) ? true : false;

Your tests double up here - you don't need both the 'if' statement and
the 'secureboot > 1' ternary operator.

Just

+	if (!of_property_read_u32(of_root, "ibm,secure-boot", &secureboot)) {
+		enabled = (secureboot > 1) ? true : false;

or even

+	if (!of_property_read_u32(of_root, "ibm,secure-boot", &secureboot)) {
+		enabled = (secureboot > 1);

would work.

> +	if (!of_property_read_u32(of_root, "ibm,trusted-boot", &trustedboot)) {
> +		if (trustedboot)
> +			enabled = (trustedboot > 0) ? true : false;

Likewise for trusted boot.

Regards,
Daniel

P.S. please could you add me to the cc: list for future revisions?

> +	}
> +
> +out:
>  	pr_info("Trusted boot mode %s\n", enabled ? "enabled" : "disabled");
>  
>  	return enabled;
> -- 
> 2.26.2

^ permalink raw reply

* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option
From: Nicholas Piggin @ 2020-07-14  6:31 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-arch, Arnd Bergmann, Peter Zijlstra, X86 ML, LKML, Linux-MM,
	Mathieu Desnoyers, Andy Lutomirski, linuxppc-dev
In-Reply-To: <1594701900.gcgdq8p13l.astroid@bobo.none>

Excerpts from Nicholas Piggin's message of July 14, 2020 3:04 pm:
> Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am:
>> 
>>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
>>> 
>>> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am:
>>>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>>>>> 
>>>>> On big systems, the mm refcount can become highly contented when doing
>>>>> a lot of context switching with threaded applications (particularly
>>>>> switching between the idle thread and an application thread).
>>>>> 
>>>>> Abandoning lazy tlb slows switching down quite a bit in the important
>>>>> user->idle->user cases, so so instead implement a non-refcounted scheme
>>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>>>>> any remaining lazy ones.
>>>>> 
>>>>> On a 16-socket 192-core POWER8 system, a context switching benchmark
>>>>> with as many software threads as CPUs (so each switch will go in and
>>>>> out of idle), upstream can achieve a rate of about 1 million context
>>>>> switches per second. After this patch it goes up to 118 million.
>>>>> 
>>>> 
>>>> I read the patch a couple of times, and I have a suggestion that could
>>>> be nonsense.  You are, effectively, using mm_cpumask() as a sort of
>>>> refcount.  You're saying "hey, this mm has no more references, but it
>>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down
>>>> those references too."  I'm wondering whether you actually need the
>>>> IPI.  What if, instead, you actually treated mm_cpumask as a refcount
>>>> for real?  Roughly, in __mmdrop(), you would only free the page tables
>>>> if mm_cpumask() is empty.  And, in the code that removes a CPU from
>>>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if
>>>> you just removed the last bit from mm_cpumask and potentially free the
>>>> mm.
>>>> 
>>>> Getting the locking right here could be a bit tricky -- you need to
>>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they
>>>> should free the mm, and you also need to avoid an mm with mm_users
>>>> hitting zero concurrently with the last remote CPU using it lazily
>>>> exiting lazy TLB.  Perhaps this could be resolved by having mm_count
>>>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
>>>> mm" and mm_count == 0 meaning "now it's dead" and using some careful
>>>> cmpxchg or dec_return to make sure that only one CPU frees it.
>>>> 
>>>> Or maybe you'd need a lock or RCU for this, but the idea would be to
>>>> only ever take the lock after mm_users goes to zero.
>>> 
>>> I don't think it's nonsense, it could be a good way to avoid IPIs.
>>> 
>>> I haven't seen much problem here that made me too concerned about IPIs 
>>> yet, so I think the simple patch may be good enough to start with
>>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the
>>> unlazying with the exit TLB flush without doing anything fancy with
>>> ref counting, but we'll see.
>> 
>> I would be cautious with benchmarking here. I would expect that the
>> nasty cases may affect power consumption more than performance — the 
>> specific issue is IPIs hitting idle cores, and the main effects are to 
>> slow down exit() a bit but also to kick the idle core out of idle. 
>> Although, if the idle core is in a deep sleep, that IPI could be 
>> *very* slow.
> 
> It will tend to be self-limiting to some degree (deeper idle cores
> would tend to have less chance of IPI) but we have bigger issues on
> powerpc with that, like broadcast IPIs to the mm cpumask for THP
> management. Power hasn't really shown up as an issue but powerpc
> CPUs may have their own requirements and issues there, shall we say.
> 
>> So I think it’s worth at least giving this a try.
> 
> To be clear it's not a complete solution itself. The problem is of 
> course that mm cpumask gives you false negatives, so the bits
> won't always clean up after themselves as CPUs switch away from their
> lazy tlb mms.

^^

False positives: CPU is in the mm_cpumask, but is not using the mm
as a lazy tlb. So there can be bits left and never freed.

If you closed the false positives, you're back to a shared mm cache
line on lazy mm context switches.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v3 4/6] powerpc/pseries/iommu: Remove default DMA window before creating DDW
From: Leonardo Bras @ 2020-07-14  6:30 UTC (permalink / raw)
  To: Alexey Kardashevskiy, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Thiago Jung Bauermann, Ram Pai
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <18fd94d2-4365-16d1-7c85-af07d5c9a0f3@ozlabs.ru>

On Tue, 2020-07-14 at 14:52 +1000, Alexey Kardashevskiy wrote:
> 
> On 14/07/2020 12:40, Leonardo Bras wrote:
> > Thank you for this feedback Alexey!
> > 
> > On Mon, 2020-07-13 at 17:33 +1000, Alexey Kardashevskiy wrote:
> > > [...]
> > > > -	int len, ret;
> > > > +	int len, ret, reset_win_ext;
> > > 
> > > Make it "reset_token".
> > 
> > Oh, it's not a token here, it just checks if the reset_win extension
> > exists. The token would be returned in *value, but since we did not
> > need it here, it's not copied.
> 
> ah right, so it is a bool actually.

In fact I did it a int, as it's the return value of ddw_read_ext(),
which can return 0 on success and -error otherwise.

> > > > [...]
> > > > -out_failed:
> > > > +out_restore_defwin:
> > > > +	if (default_win && reset_win_ext == 0)
> > > 
> > > reset_win_ext potentially may be uninitialized here. Yeah I know it is
> > > tied to default_win but still.
> > 
> > I can't see it being used uninitialized here, as you said it's tied to
> > default_win. 
> 
> Where it is declared - it is not initialized so in theory it can skip
> "if (query.windows_available == 0)".

Humm, I thought doing if (default_win && reset_win_ext == 0) would
guarantee default_win to be tested before reset_win_ext is ever tested,
so I could control it using default_win. 

> 
> 
> > Could you please tell me how it can be used uninitialized here, or what
> > is bad by doing this way?
> > 
> > > After looking at this function for a few minutes, it could use some
> > > refactoring (way too many gotos)  such as:
> > 
> > Yes, I agree.
> > 
> > > 1. move (query.page_size & xx) checks before "if
> > > (query.windows_available == 0)"
> > 
> > Moving 'page_size selection' above 'checking windows available' will
> > need us to duplicate the 'page_size selection' after the new query,
> > inside the if.
> 
> page_size selection is not going to change, why?

In theory, a query after freeing the default DMA window could have a
different (bigger) page size, so we should test again.

> 
> 
> > I mean, as query will be done again, it will need to get the (new) page
> > size.
> > 
> > > 2. move "win64 = kzalloc(sizeof(struct property), GFP_KERNEL)" before
> > > "if (query.windows_available == 0)"
> > > 3. call "reset_dma_window(dev, pdn)" inside the "if
> > > (query.windows_available == 0)" branch.
> > > Then you can drop all "goto out_restore_defwin" and move default_win and
> > > reset_win_ext inside "if (query.windows_available == 0)".
> > 
> > I did all changes suggested locally and did some analysis in the
> > result:
> > 
> > I did not see a way to put default_win and reset_win_ext inside 
> > "if (query.windows_available == 0)", because if we still need a way to
> > know if the default window was removed, and if so, restore in case
> > anything ever fails ahead (like creating the node property). 
> 
> Ah, I missed that new out_restore_defwin label is between other exit
> labels. Sorry :-/
> 
> 
> >                 reset_win_ext = ddw_read_ext(pdn,
> > DDW_EXT_RESET_DMA_WIN, NULL);
> > -               if (reset_win_ext)
> > +               if (reset_win_ext){
> > +                       default_win = NULL;
> >                         goto out_failed;
> > +               }
> 
> This says "if we can reset, then we fail", no?

Here ddw_read_ext() should return 0 if extension was found, and 
(-EINVAL, -ENODATA or -EOVERFLOW) otherwise.
So it should return nonzero if we can't find the extension, in which
case we should fail.

> 
> >                 remove_dma_window(pdn, ddw_avail, default_win);
> 
> I think you can do "default_win=NULL" here and later at
> out_restore_defwin check if it is NULL - then call reset.

Currently I initialize 'default_win = NULL', and it only changes when I
read the default DMA window. If reset is not available I restore it to
NULL, so it will be not-NULL only when the have removed the default DMA
window. 

If I make it NULL here, we either never reset the default DMA window
(as it is now "if (default_win)" ) or we may always reset it (in case
 "if (default_win == NULL)"). 

If you think it's better, I can create a bool variable like
"default_win_removed", initialized with 'false', which can be assigned
here with 'true' and test in the end if(default_win_removed) reset();

This would allow to move default_win inside this 'if block'.

What do you think?

> > [...]
> >  
> > -out_restore_defwin:
> > -       if (default_win && reset_win_ext == 0)
> > +out_failed:
> > +       if (default_win)
> >                 reset_dma_window(dev, pdn);
> >  
> > -out_failed:
> >         fpdn = kzalloc(sizeof(*fpdn), GFP_KERNEL);
> >         if (!fpdn)
> >                 goto out_unlock;
> > 
> > #####
> > 
> > What do you think?
> > 
> > 
> > 
> > > The rest of the series is good as it is,
> > 
> > Thank you :)
> > 
> > >  however it may conflict with
> > > https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200713062348.100552-1-aik@ozlabs.ru/
> > > and the patchset it is made on top of -
> > > https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=188385 .
> > 
> > <From the message of the first link>
> > > (do not rush, let me finish reviewing this first) 
> > 
> > Ok, I have no problem rebasing on top of those patchsets, but what
> > would you suggest to be done?
> 
> Polish this patch one more time and if by the time when you reposted it
> the other patchset is not in upstream, I'll ask Michael to take yours first.

Ok :)

> 
> > Would it be ok doing a big multi-author patchset, so we guarantee it
> > being applied in the correct order?
> > > (You probably want me to rebase my patchset on top of Hellwig + yours,
> > right?) 
> 
> Nah, at least not yet.

Thank you!


^ permalink raw reply

* Re: [PATCH 05/11] powerpc/smp: Dont assume l2-cache to be superset of sibling
From: Srikar Dronamraju @ 2020-07-14  6:30 UTC (permalink / raw)
  To: Oliver O'Halloran
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Michael Ellerman, Anton Blanchard, linuxppc-dev, Nick Piggin
In-Reply-To: <CAOSf1CGmHuyiW_s6DgaNbBEzUhq0qsuQ0ODPYvH+X9je3VWxwA@mail.gmail.com>

* Oliver O'Halloran <oohall@gmail.com> [2020-07-14 15:40:09]:

> On Tue, Jul 14, 2020 at 2:45 PM Srikar Dronamraju
> <srikar@linux.vnet.ibm.com> wrote:
> >
> > Current code assumes that cpumask of cpus sharing a l2-cache mask will
> > always be a superset of cpu_sibling_mask.
> >
> > Lets stop that assumption.
> 
> It's been a while since I looked, but I'm pretty sure the scheduler
> requires child domains to be subsets of their parents. Why is this
> necessary or a good idea?

Thanks for looking into the patches.

Yes the scheduler requires the child domains to be subsets of their parents.

Current code assumes that the l2_cache is always a superset of sibling mask.
However there may be processors in future whose sibling mask maynot be a
superset. 

Lets for example we have a chip with 16 threads and 8 threads share
l2-cache, i.e 8 threads are acting like a small core and 16 threads are
acting like a big core. Then the assumption that l2-cache mask is a superset
of cpu_sibling mask would be wrong.

> 
> > Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> > Cc: Michael Ellerman <michaele@au1.ibm.com>
> > Cc: Nick Piggin <npiggin@au1.ibm.com>
> > Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
> > Cc: Nathan Lynch <nathanl@linux.ibm.com>
> > Cc: Michael Neuling <mikey@linux.ibm.com>
> > Cc: Anton Blanchard <anton@au1.ibm.com>
> > Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> > Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> > Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> > ---
> >  arch/powerpc/kernel/smp.c | 28 +++++++++++++++-------------
> >  1 file changed, 15 insertions(+), 13 deletions(-)
> >
> > diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> > index 7d430fc536cc..875f57e41355 100644
> > --- a/arch/powerpc/kernel/smp.c
> > +++ b/arch/powerpc/kernel/smp.c
> > @@ -1198,6 +1198,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int))
> >         struct device_node *l2_cache, *np;
> >         int i;
> >
> > +       cpumask_set_cpu(cpu, mask_fn(cpu));
> 
> ?

At the time the cpumasks are updated, the cpu is not yet part of the
cpu_online_mask. So when we online/offline the cpus, the masks will end up
not having itself and causes the scheduler to bork.

Previously (as we can note in code below thats removed), we were doing as
part of updating all cpus that were part of the cpu_sibling_mask before
calling update_mask_by_l2.

> 
> >         l2_cache = cpu_to_l2cache(cpu);
> >         if (!l2_cache)
> >                 return false;
> > @@ -1284,29 +1285,30 @@ static void add_cpu_to_masks(int cpu)
> >          * add it to it's own thread sibling mask.
> >          */
> >         cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
> > +       cpumask_set_cpu(cpu, cpu_core_mask(cpu));
> >
> >         for (i = first_thread; i < first_thread + threads_per_core; i++)
> >                 if (cpu_online(i))
> >                         set_cpus_related(i, cpu, cpu_sibling_mask);
> >
> >         add_cpu_to_smallcore_masks(cpu);
> > -       /*
> > -        * Copy the thread sibling mask into the cache sibling mask
> > -        * and mark any CPUs that share an L2 with this CPU.
> > -        */
> > -       for_each_cpu(i, cpu_sibling_mask(cpu))
> > -               set_cpus_related(cpu, i, cpu_l2_cache_mask);

I am referring to this code above. This would have updated the self in its
cpumask. For the rest of the cpus in the cpu_sibling_mask, they get updated
correctly in the update_mask_by_l2.

> >         update_mask_by_l2(cpu, cpu_l2_cache_mask);
> >
> > -       /*
> > -        * Copy the cache sibling mask into core sibling mask and mark
> > -        * any CPUs on the same chip as this CPU.
> > -        */

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH v3] powerpc/perf: Use SIER_USER_MASK while updating SPRN_SIER for EBB events
From: Michael Ellerman @ 2020-07-14  6:08 UTC (permalink / raw)
  To: Athira Rajeev; +Cc: maddy, linuxppc-dev
In-Reply-To: <8C50DF8B-1CBB-4365-B068-C8DA5B7D1148@linux.vnet.ibm.com>

Athira Rajeev <atrajeev@linux.vnet.ibm.com> writes:
>> On 19-Mar-2020, at 4:22 PM, Michael Ellerman <mpe@ellerman.id.au> wrote:
>> 
>> Hi Athira,
>> 
>> Athira Rajeev <atrajeev@linux.vnet.ibm.com> writes:
>>> Sampled Instruction Event Register (SIER), is a PMU register,
>>                                                               ^
>>                                                               that
>>> captures architecture state for a given sample. And sier_user_mask
>>           ^                                          ^
>>           don't think we need "architecture"         SIER_USER_MASK
>> 
>>> defined in commit 330a1eb7775b ("powerpc/perf: Core EBB support for 64-bit
>>> book3s") defines the architected bits that needs to be saved from the SPR.
>> 
>> Not quite, it defines the bits that are visible to userspace.
>> 
>> And I think it's true that for EBB events the bits we need/want to save
>> are only the user visible bits.
>> 
>>> Currently all of the bits from SIER are saved for EBB events. Patch fixes
>>> this by ANDing the "sier_user_mask" to data from SIER in ebb_switch_out().
>>> This will force save only architected bits from the SIER.
>> 
>> s/architected/user visible/
>> 
>> 
>> But, why does it matter? The kernel saves the user visible bits, as well
>> as the kernel-only bits into the thread struct. And then later the
>> kernel restores that value into the hardware before returning to
>> userspace.
>> 
>> But the hardware enforces the visibility of the bits, so userspace can't
>> observe any bits that it shouldn't.
>> 
>> Or is there some other mechanism whereby userspace can see those bits? ;)
>> 
>> If there was, what would the security implications of that be?
>
> Hi Michael,
>
> Thanks for your comments. 
>
> In ebb_switch_in, we set PMCC bit [MMCR0 44:45 ] to 10 which means
> SIER ( Group B ) register is readable in problem state. Hence the
> intention of the patch was to make sure we are not exposing the bits
> which the userspace shouldn't be reading.
>
> But following your comment about "hardware enforcing the visibility of
> bits", I did try an "ebb" experiment which showed that reading
> SPRN_SIER didn't expose any bits other than the user visible bits.
> Sorry for the confusion here.

That's OK. Thanks for following my trail of clues :)

> In that case, Can we drop the existing definition of SIER_USER_MASK if
> it is no more needed ?

I think it is still needed, and I think this change to use it is good, because
SIER is visible via ptrace.

What we need to do, is look at what information in SIER we are currently
exposing to userspace via ptrace, and what the security implications (if
any) of that are.

cheers

^ permalink raw reply

* Re: [PATCH 03/15] powerpc/powernv/pci: Add explicit tracking of the DMA setup state
From: Oliver O'Halloran @ 2020-07-14  5:58 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev
In-Reply-To: <ee5a00db-badd-12fe-1c46-eaba5afc8dea@ozlabs.ru>

On Tue, Jul 14, 2020 at 3:37 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> On 10/07/2020 15:23, Oliver O'Halloran wrote:
> > There's an optimisation in the PE setup which skips performing DMA
> > setup for a PE if we only have bridges in a PE. The assumption being
> > that only "real" devices will DMA to system memory, which is probably
> > fair. However, if we start off with only bridge devices in a PE then
> > add a non-bridge device the new device won't be able to use DMA  because
> > we never configured it.
> >
> > Fix this (admittedly pretty weird) edge case by tracking whether we've done
> > the DMA setup for the PE or not. If a non-bridge device is added to the PE
> > (via rescan or hotplug, or whatever) we can set up DMA on demand.
>
> So hotplug does not work on powernv then, right? I thought you tested it
> a while ago, or this patch is the result of that attempt? If it is, then

It mostly works. Just the really niche case of hot plugging a bridge,
then later on hot plugging a device into the same bus which wouldn't
work.

> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
>
> > This also means the only remaining user of the old "DMA Weight" code is
> > the IODA1 DMA setup code that it was originally added for, which is good.
>
>
> Is ditching IODA1 in the plan? :)

That or separating out the pci_controller_ops for IODA1 and IODA2 so
we can stop any IODA2 specific changes from breaking it. For the most
part keeping around IODA1 support isn't hurting anyone, but I wanted
to re-work how the BDFN->PE assignment works so that we'd delay
assigning a BDFN to a PE until the device is probed. Right now when
we're configuring the PE for a bus we map all 255 devfn's to that PE.
This is mostly fine, but if you do a bus rescan and there's no device
present we'll get a spurious EEH on that PE since the PHB sees that
there's no device responding to the CFG cycle. We stop the spurious
EEH freeze today by only allowing config cycles if we can find a
pci_dn for that bdfn, but I want to get rid of pci_dn.

Mapping each BDFN to a PE after the device is probed is easy enough to
do on PHB3 and above since the mapping is handled by an in-memory
table which is indexed by the BDFN. Earlier PHBs (i.e. IODA1) use a
table of bask & mask values which match on the BDFN, so assigning a
whole bus at once is easy, but adding individual BDFNs is hard. It's
still possible to do in the HW, but the way the OPAL API works makes
it impossible.

> >
> > Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
> > Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> > ---
> > Alexey, do we need to have the IOMMU API stuff set/clear this flag?
>
>
> I'd say no as that API only cares if a device is in a PE and for those
> the PE DMA setup  optimization is skipped. Thanks,

Ok cool.

^ permalink raw reply

* Re: [PATCH 05/11] powerpc/smp: Dont assume l2-cache to be superset of sibling
From: Oliver O'Halloran @ 2020-07-14  5:40 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Michael Ellerman, Anton Blanchard, linuxppc-dev, Nick Piggin
In-Reply-To: <20200714043624.5648-6-srikar@linux.vnet.ibm.com>

On Tue, Jul 14, 2020 at 2:45 PM Srikar Dronamraju
<srikar@linux.vnet.ibm.com> wrote:
>
> Current code assumes that cpumask of cpus sharing a l2-cache mask will
> always be a superset of cpu_sibling_mask.
>
> Lets stop that assumption.

It's been a while since I looked, but I'm pretty sure the scheduler
requires child domains to be subsets of their parents. Why is this
necessary or a good idea?

> Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
> Cc: Michael Ellerman <michaele@au1.ibm.com>
> Cc: Nick Piggin <npiggin@au1.ibm.com>
> Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
> Cc: Nathan Lynch <nathanl@linux.ibm.com>
> Cc: Michael Neuling <mikey@linux.ibm.com>
> Cc: Anton Blanchard <anton@au1.ibm.com>
> Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
> Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
> Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/kernel/smp.c | 28 +++++++++++++++-------------
>  1 file changed, 15 insertions(+), 13 deletions(-)
>
> diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
> index 7d430fc536cc..875f57e41355 100644
> --- a/arch/powerpc/kernel/smp.c
> +++ b/arch/powerpc/kernel/smp.c
> @@ -1198,6 +1198,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int))
>         struct device_node *l2_cache, *np;
>         int i;
>
> +       cpumask_set_cpu(cpu, mask_fn(cpu));

?

>         l2_cache = cpu_to_l2cache(cpu);
>         if (!l2_cache)
>                 return false;
> @@ -1284,29 +1285,30 @@ static void add_cpu_to_masks(int cpu)
>          * add it to it's own thread sibling mask.
>          */
>         cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
> +       cpumask_set_cpu(cpu, cpu_core_mask(cpu));
>
>         for (i = first_thread; i < first_thread + threads_per_core; i++)
>                 if (cpu_online(i))
>                         set_cpus_related(i, cpu, cpu_sibling_mask);
>
>         add_cpu_to_smallcore_masks(cpu);
> -       /*
> -        * Copy the thread sibling mask into the cache sibling mask
> -        * and mark any CPUs that share an L2 with this CPU.
> -        */
> -       for_each_cpu(i, cpu_sibling_mask(cpu))
> -               set_cpus_related(cpu, i, cpu_l2_cache_mask);
>         update_mask_by_l2(cpu, cpu_l2_cache_mask);
>
> -       /*
> -        * Copy the cache sibling mask into core sibling mask and mark
> -        * any CPUs on the same chip as this CPU.
> -        */
> -       for_each_cpu(i, cpu_l2_cache_mask(cpu))
> -               set_cpus_related(cpu, i, cpu_core_mask);
> +       if (pkg_id == -1) {
> +               struct cpumask *(*mask)(int) = cpu_sibling_mask;
> +
> +               /*
> +                * Copy the sibling mask into core sibling mask and
> +                * mark any CPUs on the same chip as this CPU.
> +                */
> +               if (shared_caches)
> +                       mask = cpu_l2_cache_mask;
> +
> +               for_each_cpu(i, mask(cpu))
> +                       set_cpus_related(cpu, i, cpu_core_mask);
>
> -       if (pkg_id == -1)
>                 return;
> +       }
>
>         for_each_cpu(i, cpu_online_mask)
>                 if (get_physical_package_id(i) == pkg_id)
> --
> 2.17.1
>

^ permalink raw reply

* Re: [PATCH 03/15] powerpc/powernv/pci: Add explicit tracking of the DMA setup state
From: Alexey Kardashevskiy @ 2020-07-14  5:37 UTC (permalink / raw)
  To: Oliver O'Halloran, linuxppc-dev
In-Reply-To: <20200710052340.737567-4-oohall@gmail.com>



On 10/07/2020 15:23, Oliver O'Halloran wrote:
> There's an optimisation in the PE setup which skips performing DMA
> setup for a PE if we only have bridges in a PE. The assumption being
> that only "real" devices will DMA to system memory, which is probably
> fair. However, if we start off with only bridge devices in a PE then
> add a non-bridge device the new device won't be able to use DMA  because
> we never configured it.
> 
> Fix this (admittedly pretty weird) edge case by tracking whether we've done
> the DMA setup for the PE or not. If a non-bridge device is added to the PE
> (via rescan or hotplug, or whatever) we can set up DMA on demand.

So hotplug does not work on powernv then, right? I thought you tested it
a while ago, or this patch is the result of that attempt? If it is, then

Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>


> This also means the only remaining user of the old "DMA Weight" code is
> the IODA1 DMA setup code that it was originally added for, which is good.


Is ditching IODA1 in the plan? :)

> 
> Cc: Alexey Kardashevskiy <aik@ozlabs.ru>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> ---
> Alexey, do we need to have the IOMMU API stuff set/clear this flag?


I'd say no as that API only cares if a device is in a PE and for those
the PE DMA setup  optimization is skipped. Thanks,




> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 48 ++++++++++++++---------
>  arch/powerpc/platforms/powernv/pci.h      |  7 ++++
>  2 files changed, 36 insertions(+), 19 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index bfb40607aa0e..bb9c1cc60c33 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -141,6 +141,7 @@ static struct pnv_ioda_pe *pnv_ioda_init_pe(struct pnv_phb *phb, int pe_no)
>  
>  	phb->ioda.pe_array[pe_no].phb = phb;
>  	phb->ioda.pe_array[pe_no].pe_number = pe_no;
> +	phb->ioda.pe_array[pe_no].dma_setup_done = false;
>  
>  	/*
>  	 * Clear the PE frozen state as it might be put into frozen state
> @@ -1685,6 +1686,12 @@ static int pnv_pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
>  }
>  #endif /* CONFIG_PCI_IOV */
>  
> +static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
> +				       struct pnv_ioda_pe *pe);
> +
> +static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
> +				       struct pnv_ioda_pe *pe);
> +
>  static void pnv_pci_ioda_dma_dev_setup(struct pci_dev *pdev)
>  {
>  	struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
> @@ -1713,6 +1720,24 @@ static void pnv_pci_ioda_dma_dev_setup(struct pci_dev *pdev)
>  		pci_info(pdev, "Added to existing PE#%x\n", pe->pe_number);
>  	}
>  
> +	/*
> +	 * We assume that bridges *probably* don't need to do any DMA so we can
> +	 * skip allocating a TCE table, etc unless we get a non-bridge device.
> +	 */
> +	if (!pe->dma_setup_done && !pci_is_bridge(pdev)) {
> +		switch (phb->type) {
> +		case PNV_PHB_IODA1:
> +			pnv_pci_ioda1_setup_dma_pe(phb, pe);
> +			break;
> +		case PNV_PHB_IODA2:
> +			pnv_pci_ioda2_setup_dma_pe(phb, pe);
> +			break;
> +		default:
> +			pr_warn("%s: No DMA for PHB#%x (type %d)\n",
> +				__func__, phb->hose->global_number, phb->type);
> +		}
> +	}
> +
>  	if (pdn)
>  		pdn->pe_number = pe->pe_number;
>  	pe->device_count++;
> @@ -2222,6 +2247,7 @@ static void pnv_pci_ioda1_setup_dma_pe(struct pnv_phb *phb,
>  	pe->table_group.tce32_size = tbl->it_size << tbl->it_page_shift;
>  	iommu_init_table(tbl, phb->hose->node, 0, 0);
>  
> +	pe->dma_setup_done = true;
>  	return;
>   fail:
>  	/* XXX Failure: Try to fallback to 64-bit only ? */
> @@ -2536,9 +2562,6 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>  {
>  	int64_t rc;
>  
> -	if (!pnv_pci_ioda_pe_dma_weight(pe))
> -		return;
> -
>  	/* TVE #1 is selected by PCI address bit 59 */
>  	pe->tce_bypass_base = 1ull << 59;
>  
> @@ -2563,6 +2586,7 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>  	iommu_register_group(&pe->table_group, phb->hose->global_number,
>  			     pe->pe_number);
>  #endif
> +	pe->dma_setup_done = true;
>  }
>  
>  int64_t pnv_opal_pci_msi_eoi(struct irq_chip *chip, unsigned int hw_irq)
> @@ -3136,7 +3160,6 @@ static void pnv_pci_fixup_bridge_resources(struct pci_bus *bus,
>  
>  static void pnv_pci_configure_bus(struct pci_bus *bus)
>  {
> -	struct pnv_phb *phb = pci_bus_to_pnvhb(bus);
>  	struct pci_dev *bridge = bus->self;
>  	struct pnv_ioda_pe *pe;
>  	bool all = (bridge && pci_pcie_type(bridge) == PCI_EXP_TYPE_PCI_BRIDGE);
> @@ -3160,17 +3183,6 @@ static void pnv_pci_configure_bus(struct pci_bus *bus)
>  		return;
>  
>  	pnv_ioda_setup_pe_seg(pe);
> -	switch (phb->type) {
> -	case PNV_PHB_IODA1:
> -		pnv_pci_ioda1_setup_dma_pe(phb, pe);
> -		break;
> -	case PNV_PHB_IODA2:
> -		pnv_pci_ioda2_setup_dma_pe(phb, pe);
> -		break;
> -	default:
> -		pr_warn("%s: No DMA for PHB#%x (type %d)\n",
> -			__func__, phb->hose->global_number, phb->type);
> -	}
>  }
>  
>  static resource_size_t pnv_pci_default_alignment(void)
> @@ -3289,11 +3301,10 @@ static long pnv_pci_ioda1_unset_window(struct iommu_table_group *table_group,
>  
>  static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
>  {
> -	unsigned int weight = pnv_pci_ioda_pe_dma_weight(pe);
>  	struct iommu_table *tbl = pe->table_group.tables[0];
>  	int64_t rc;
>  
> -	if (!weight)
> +	if (!pe->dma_setup_done)
>  		return;
>  
>  	rc = pnv_pci_ioda1_unset_window(&pe->table_group, 0);
> @@ -3313,10 +3324,9 @@ static void pnv_pci_ioda1_release_pe_dma(struct pnv_ioda_pe *pe)
>  static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
>  {
>  	struct iommu_table *tbl = pe->table_group.tables[0];
> -	unsigned int weight = pnv_pci_ioda_pe_dma_weight(pe);
>  	int64_t rc;
>  
> -	if (!weight)
> +	if (pe->dma_setup_done)
>  		return;
>  
>  	rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0);
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 0727dec9a0d1..6aa6aefb637d 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -87,6 +87,13 @@ struct pnv_ioda_pe {
>  	bool			tce_bypass_enabled;
>  	uint64_t		tce_bypass_base;
>  
> +	/*
> +	 * Used to track whether we've done DMA setup for this PE or not. We
> +	 * want to defer allocating TCE tables, etc until we've added a
> +	 * non-bridge device to the PE.
> +	 */
> +	bool			dma_setup_done;
> +
>  	/* MSIs. MVE index is identical for for 32 and 64 bit MSI
>  	 * and -1 if not supported. (It's actually identical to the
>  	 * PE number)
> 

-- 
Alexey

^ permalink raw reply

* Re: [RFC PATCH v0 2/2] KVM: PPC: Book3S HV: Use H_RPT_INVALIDATE in nested KVM
From: Bharata B Rao @ 2020-07-14  5:31 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: aneesh.kumar, linuxppc-dev, npiggin, kvm-ppc
In-Reply-To: <20200709100711.GA2961345@thinks.paulus.ozlabs.org>

On Thu, Jul 09, 2020 at 08:07:11PM +1000, Paul Mackerras wrote:
> On Thu, Jul 09, 2020 at 02:38:51PM +0530, Bharata B Rao wrote:
> > On Thu, Jul 09, 2020 at 03:18:03PM +1000, Paul Mackerras wrote:
> > > On Fri, Jul 03, 2020 at 04:14:20PM +0530, Bharata B Rao wrote:
> > > > In the nested KVM case, replace H_TLB_INVALIDATE by the new hcall
> > > > H_RPT_INVALIDATE if available. The availability of this hcall
> > > > is determined from "hcall-rpt-invalidate" string in ibm,hypertas-functions
> > > > DT property.
> > > 
> > > What are we going to use when nested KVM supports HPT guests at L2?
> > > L1 will need to do partition-scoped tlbies with R=0 via a hypercall,
> > > but H_RPT_INVALIDATE says in its name that it only handles radix
> > > page tables (i.e. R=1).
> > 
> > For L2 HPT guests, the old hcall is expected to work after it adds
> > support for R=0 case?
> 
> That was the plan.
> 
> > The new hcall should be advertised via ibm,hypertas-functions only
> > for radix guests I suppose.
> 
> Well, the L1 hypervisor is a radix guest of L0, so it would have
> H_RPT_INVALIDATE available to it?
> 
> I guess the question is whether H_RPT_INVALIDATE is supposed to do
> everything, that is, radix process-scoped invalidations, radix
> partition-scoped invalidations, and HPT partition-scoped
> invalidations.  If that is the plan then we should call it something
> different.

Guess we are bit late now to rename it and include HPT in the scope.

> 
> This patchset seems to imply that H_RPT_INVALIDATE is at least going
> to be used for radix partition-scoped invalidations as well as radix
> process-scoped invalidations.  If you are thinking that in future when
> we need HPT partition-scoped invalidations for a radix L1 hypervisor
> running a HPT L2 guest, we are going to define a new hypercall for
> that, I suppose that is OK, though it doesn't really seem necessary.

Guess a new hcall would be the way forward to cover the HPT L2 guest
requirements.

Thanks for pointing this out.

Regards,
Bharata.

^ permalink raw reply

* Re: [PATCH 00/11] Support for grouping cores
From: Srikar Dronamraju @ 2020-07-14  5:06 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Michael Ellerman, Anton Blanchard, linuxppc-dev, Nick Piggin
In-Reply-To: <20200714043624.5648-1-srikar@linux.vnet.ibm.com>

* Srikar Dronamraju <srikar@linux.vnet.ibm.com> [2020-07-14 10:06:13]:

> 
> On Power 9 (with device-tree enablement to show coregroups).
> -----------------------------------------------------------

With help of Gautham, I tried to kexec into a newer kernel with a modified
dtb. However when passing with the dtb option, kexec would get stuck.

Hence alternatively, I hardcoded some assumptions to test the same.
These hunks apply on top of all the patches.

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index 8ec7ff05ae47..762aa573c313 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -56,6 +56,12 @@ static int n_mem_addr_cells, n_mem_size_cells;
 static int form1_affinity;
 
 #define MAX_DISTANCE_REF_POINTS 4
+/*
+ *  HAck2: On Power9, there are 12 SMT8 cores per chip.
+ *  Hard code this for now.
+ */
+#define CORES_PER_CHIP 12
+#define CORES_PER_COREGROUP (CORES_PER_CHIP / 2)
 static int distance_ref_points_depth;
 static const __be32 *distance_ref_points;
 static int distance_lookup_table[MAX_NUMNODES][MAX_DISTANCE_REF_POINTS];
@@ -326,6 +332,13 @@ static int __init find_min_common_depth(void)
 
 	if (form1_affinity) {
 		depth = of_read_number(distance_ref_points, 1);
+		/*
+		 * Hack: Hack: Hack:
+		 * For now only on Phyp machines.
+		 */
+		if (!firmware_has_feature(FW_FEATURE_OPAL))
+			depth--;
 	} else {
 		if (distance_ref_points_depth < 2) {
 			printk(KERN_WARNING "NUMA: "
@@ -1247,8 +1260,8 @@ int cpu_to_coregroup_id(int cpu)
 		goto out;
 
 	index = of_read_number(associativity, 1);
-	if ((index > min_common_depth + 1) && coregroup_enabled)
-		return of_read_number(&associativity[index - 1], 1);
+	if ((index > min_common_depth + 1) && coregroup_enabled && has_big_cores)
+		return of_read_number(&associativity[index], 1) / CORES_PER_COREGROUP;
 
 out:
 	return cpu_to_core_id(cpu);
-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply related

* Re: [RFC PATCH 7/7] lazy tlb: shoot lazies, a non-refcounting lazy tlb option
From: Nicholas Piggin @ 2020-07-14  5:04 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-arch, Arnd Bergmann, Peter Zijlstra, X86 ML, LKML, Linux-MM,
	Mathieu Desnoyers, Andy Lutomirski, linuxppc-dev
In-Reply-To: <010054C3-7FFF-4FB5-BDA8-D2B80F7B1A5D@amacapital.net>

Excerpts from Andy Lutomirski's message of July 14, 2020 4:18 am:
> 
>> On Jul 13, 2020, at 9:48 AM, Nicholas Piggin <npiggin@gmail.com> wrote:
>> 
>> Excerpts from Andy Lutomirski's message of July 14, 2020 1:59 am:
>>>> On Thu, Jul 9, 2020 at 6:57 PM Nicholas Piggin <npiggin@gmail.com> wrote:
>>>> 
>>>> On big systems, the mm refcount can become highly contented when doing
>>>> a lot of context switching with threaded applications (particularly
>>>> switching between the idle thread and an application thread).
>>>> 
>>>> Abandoning lazy tlb slows switching down quite a bit in the important
>>>> user->idle->user cases, so so instead implement a non-refcounted scheme
>>>> that causes __mmdrop() to IPI all CPUs in the mm_cpumask and shoot down
>>>> any remaining lazy ones.
>>>> 
>>>> On a 16-socket 192-core POWER8 system, a context switching benchmark
>>>> with as many software threads as CPUs (so each switch will go in and
>>>> out of idle), upstream can achieve a rate of about 1 million context
>>>> switches per second. After this patch it goes up to 118 million.
>>>> 
>>> 
>>> I read the patch a couple of times, and I have a suggestion that could
>>> be nonsense.  You are, effectively, using mm_cpumask() as a sort of
>>> refcount.  You're saying "hey, this mm has no more references, but it
>>> still has nonempty mm_cpumask(), so let's send an IPI and shoot down
>>> those references too."  I'm wondering whether you actually need the
>>> IPI.  What if, instead, you actually treated mm_cpumask as a refcount
>>> for real?  Roughly, in __mmdrop(), you would only free the page tables
>>> if mm_cpumask() is empty.  And, in the code that removes a CPU from
>>> mm_cpumask(), you would check if mm_users == 0 and, if so, check if
>>> you just removed the last bit from mm_cpumask and potentially free the
>>> mm.
>>> 
>>> Getting the locking right here could be a bit tricky -- you need to
>>> avoid two CPUs simultaneously exiting lazy TLB and thinking they
>>> should free the mm, and you also need to avoid an mm with mm_users
>>> hitting zero concurrently with the last remote CPU using it lazily
>>> exiting lazy TLB.  Perhaps this could be resolved by having mm_count
>>> == 1 mean "mm_cpumask() is might contain bits and, if so, it owns the
>>> mm" and mm_count == 0 meaning "now it's dead" and using some careful
>>> cmpxchg or dec_return to make sure that only one CPU frees it.
>>> 
>>> Or maybe you'd need a lock or RCU for this, but the idea would be to
>>> only ever take the lock after mm_users goes to zero.
>> 
>> I don't think it's nonsense, it could be a good way to avoid IPIs.
>> 
>> I haven't seen much problem here that made me too concerned about IPIs 
>> yet, so I think the simple patch may be good enough to start with
>> for powerpc. I'm looking at avoiding/reducing the IPIs by combining the
>> unlazying with the exit TLB flush without doing anything fancy with
>> ref counting, but we'll see.
> 
> I would be cautious with benchmarking here. I would expect that the
> nasty cases may affect power consumption more than performance — the 
> specific issue is IPIs hitting idle cores, and the main effects are to 
> slow down exit() a bit but also to kick the idle core out of idle. 
> Although, if the idle core is in a deep sleep, that IPI could be 
> *very* slow.

It will tend to be self-limiting to some degree (deeper idle cores
would tend to have less chance of IPI) but we have bigger issues on
powerpc with that, like broadcast IPIs to the mm cpumask for THP
management. Power hasn't really shown up as an issue but powerpc
CPUs may have their own requirements and issues there, shall we say.

> So I think it’s worth at least giving this a try.

To be clear it's not a complete solution itself. The problem is of 
course that mm cpumask gives you false negatives, so the bits
won't always clean up after themselves as CPUs switch away from their
lazy tlb mms.

I would suspect it _may_ help with garbage collecting some remainders
nicely after exit, but only with somewhat of a different accounting
system than powerpc uses -- we tie mm_cpumask to TLB valids, so it can
become spread over CPUs that don't (and even have never) used that mm
as a lazy mm I don't know that the self-culling trick would help
a great deal within that scheme.

So powerpc needs a bit more work on that side of things too, hence
looking at doing more of this in the final TLB shootdown.

There's actually a lot of other things we can do as well to reduce
IPIs, batching being a simple hammer, some kind of quiescing, testing
the remote CPU to check what active mm it is using, doing the un-lazy
at certain defined points etc, so I'm actually not worried about IPIs
suddenly popping up and rendering the whole concept unworkable. At
some point (unless we go something pretty complex like a SRCU type 
thing, or adding extra locking .e.g, to use_mm()), then at least 
sometimes an IPI will be required so I think it's reasonable to
start here and introduce complexity more slowly if it's justified.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH v3 4/6] powerpc/pseries/iommu: Remove default DMA window before creating DDW
From: Alexey Kardashevskiy @ 2020-07-14  4:52 UTC (permalink / raw)
  To: Leonardo Bras, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Thiago Jung Bauermann, Ram Pai
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <8c29be499e8741e7d77d53ca005034a2ca0179ac.camel@gmail.com>



On 14/07/2020 12:40, Leonardo Bras wrote:
> Thank you for this feedback Alexey!
> 
> On Mon, 2020-07-13 at 17:33 +1000, Alexey Kardashevskiy wrote:
>> [...]
>>> -	int len, ret;
>>> +	int len, ret, reset_win_ext;
>>
>> Make it "reset_token".
> 
> Oh, it's not a token here, it just checks if the reset_win extension
> exists. The token would be returned in *value, but since we did not
> need it here, it's not copied.

ah right, so it is a bool actually.


> 
>>> [...]
>>> -out_failed:
>>> +out_restore_defwin:
>>> +	if (default_win && reset_win_ext == 0)
>>
>> reset_win_ext potentially may be uninitialized here. Yeah I know it is
>> tied to default_win but still.
> 
> I can't see it being used uninitialized here, as you said it's tied to
> default_win. 

Where it is declared - it is not initialized so in theory it can skip
"if (query.windows_available == 0)".


> Could you please tell me how it can be used uninitialized here, or what
> is bad by doing this way?
> 
>> After looking at this function for a few minutes, it could use some
>> refactoring (way too many gotos)  such as:
> 
> Yes, I agree.
> 
>> 1. move (query.page_size & xx) checks before "if
>> (query.windows_available == 0)"
> 
> Moving 'page_size selection' above 'checking windows available' will
> need us to duplicate the 'page_size selection' after the new query,
> inside the if.

page_size selection is not going to change, why?


> I mean, as query will be done again, it will need to get the (new) page
> size.
> 
>> 2. move "win64 = kzalloc(sizeof(struct property), GFP_KERNEL)" before
>> "if (query.windows_available == 0)"
> 
>> 3. call "reset_dma_window(dev, pdn)" inside the "if
>> (query.windows_available == 0)" branch.
> 
>> Then you can drop all "goto out_restore_defwin" and move default_win and
>> reset_win_ext inside "if (query.windows_available == 0)".
> 
> I did all changes suggested locally and did some analysis in the
> result:
> 
> I did not see a way to put default_win and reset_win_ext inside 
> "if (query.windows_available == 0)", because if we still need a way to
> know if the default window was removed, and if so, restore in case
> anything ever fails ahead (like creating the node property). 

Ah, I missed that new out_restore_defwin label is between other exit
labels. Sorry :-/


> But from that analysis I noted it's possible to remove all the new
> "goto out_restore_defwin", if we do default_win = NULL if
> ddw_read_ext() fails. 
> 
> So testing only default_win should always be enough to say if the
> default window was deleted, and reset_win_ext could be moved inside "if
> (query.windows_available == 0)".
> Also, it would avoid reset_win_ext being 'used uninitialized' and
> "out_restore_defwin:" would not be needed.
> 
> Against the current patch, we would have something like this:
> 
> #####
> 
>  static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>  {
> -       int len, ret, reset_win_ext;
> +       int len, ret;
>         struct ddw_query_response query;
>         struct ddw_create_response create;
>         int page_shift;
> @@ -1173,25 +1173,28 @@ static u64 enable_ddw(struct pci_dev *dev,
> struct device_node *pdn)
>          * for extensions presence.
>          */
>         if (query.windows_available == 0) {
> +               int reset_win_ext;
>                 default_win = of_find_property(pdn, "ibm,dma-window",
> NULL);
>                 if (!default_win)
>                         goto out_failed;
>  
>                 reset_win_ext = ddw_read_ext(pdn,
> DDW_EXT_RESET_DMA_WIN, NULL);
> -               if (reset_win_ext)
> +               if (reset_win_ext){
> +                       default_win = NULL;
>                         goto out_failed;
> +               }


This says "if we can reset, then we fail", no?

>                 remove_dma_window(pdn, ddw_avail, default_win);


I think you can do "default_win=NULL" here and later at
out_restore_defwin check if it is NULL - then call reset.

>                 /* Query again, to check if the window is available */
>                 ret = query_ddw(dev, ddw_avail, &query, pdn);
>                 if (ret != 0)
> -                       goto out_restore_defwin;
> +                       goto out_failed;
>  
>                 if (query.windows_available == 0) {
>                         /* no windows are available for this device. */
>                         dev_dbg(&dev->dev, "no free dynamic windows");
> -                       goto out_restore_defwin;
> +                       goto out_failed;
>                 }
>         }
>         if (query.page_size & 4) {
> @@ -1203,7 +1206,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct
> device_node *pdn)
>         } else {
>                 dev_dbg(&dev->dev, "no supported direct page size in
> mask %x",
>                           query.page_size);
> -               goto out_restore_defwin;
> +               goto out_failed;
>         }
>         /* verify the window * number of ptes will map the partition */
>         /* check largest block * page size > max memory hotplug addr */
> @@ -1212,14 +1215,14 @@ static u64 enable_ddw(struct pci_dev *dev,
> struct device_node *pdn)
>                 dev_dbg(&dev->dev, "can't map partition max 0x%llx with
> %llu "
>                           "%llu-sized pages\n",
> max_addr,  query.largest_available_block,
>                           1ULL << page_shift);
> -               goto out_restore_defwin;
> +               goto out_failed;
>         }
>         len = order_base_2(max_addr);
>         win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
>         if (!win64) {
>                 dev_info(&dev->dev,
>                         "couldn't allocate property for 64bit dma
> window\n");
> -               goto out_restore_defwin;
> +               goto out_failed;
>         }
>         win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
>         win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
> @@ -1282,11 +1285,10 @@ static u64 enable_ddw(struct pci_dev *dev,
> struct device_node *pdn)
>         kfree(win64->value);
>         kfree(win64);
>  
> -out_restore_defwin:
> -       if (default_win && reset_win_ext == 0)
> +out_failed:
> +       if (default_win)
>                 reset_dma_window(dev, pdn);
>  
> -out_failed:
>         fpdn = kzalloc(sizeof(*fpdn), GFP_KERNEL);
>         if (!fpdn)
>                 goto out_unlock;
> 
> #####
> 
> What do you think?
> 
> 
> 
>> The rest of the series is good as it is,
> 
> Thank you :)
> 
>>  however it may conflict with
>> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200713062348.100552-1-aik@ozlabs.ru/
>> and the patchset it is made on top of -
>> https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=188385 .
> 
> <From the message of the first link>
>> (do not rush, let me finish reviewing this first) 
> 
> Ok, I have no problem rebasing on top of those patchsets, but what
> would you suggest to be done?

Polish this patch one more time and if by the time when you reposted it
the other patchset is not in upstream, I'll ask Michael to take yours first.


> Would it be ok doing a big multi-author patchset, so we guarantee it
> being applied in the correct order?
>> (You probably want me to rebase my patchset on top of Hellwig + yours,
> right?) 

Nah, at least not yet.


-- 
Alexey

^ permalink raw reply

* [PATCH 02/11] powerpc/smp: Merge Power9 topology with Power topology
From: Srikar Dronamraju @ 2020-07-14  4:36 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Michael Ellerman, Anton Blanchard,
	linuxppc-dev, Nick Piggin
In-Reply-To: <20200714043624.5648-1-srikar@linux.vnet.ibm.com>

A new sched_domain_topology_level was added just for Power9. However the
same can be achieved by merging powerpc_topology with power9_topology
and makes the code more simpler especially when adding a new sched
domain.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 33 ++++++++++-----------------------
 1 file changed, 10 insertions(+), 23 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 680c0edcc59d..069ea4b21c6d 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1315,7 +1315,7 @@ int setup_profiling_timer(unsigned int multiplier)
 }
 
 #ifdef CONFIG_SCHED_SMT
-/* cpumask of CPUs with asymetric SMT dependancy */
+/* cpumask of CPUs with asymmetric SMT dependency */
 static int powerpc_smt_flags(void)
 {
 	int flags = SD_SHARE_CPUCAPACITY | SD_SHARE_PKG_RESOURCES;
@@ -1328,14 +1328,6 @@ static int powerpc_smt_flags(void)
 }
 #endif
 
-static struct sched_domain_topology_level powerpc_topology[] = {
-#ifdef CONFIG_SCHED_SMT
-	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
-#endif
-	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
-	{ NULL, },
-};
-
 /*
  * P9 has a slightly odd architecture where pairs of cores share an L2 cache.
  * This topology makes it *much* cheaper to migrate tasks between adjacent cores
@@ -1353,7 +1345,13 @@ static int powerpc_shared_cache_flags(void)
  */
 static const struct cpumask *shared_cache_mask(int cpu)
 {
-	return cpu_l2_cache_mask(cpu);
+	if (shared_caches)
+		return cpu_l2_cache_mask(cpu);
+
+	if (has_big_cores)
+		return cpu_smallcore_mask(cpu);
+
+	return cpu_smt_mask(cpu);
 }
 
 #ifdef CONFIG_SCHED_SMT
@@ -1363,7 +1361,7 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
 }
 #endif
 
-static struct sched_domain_topology_level power9_topology[] = {
+static struct sched_domain_topology_level powerpc_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
 #endif
@@ -1388,21 +1386,10 @@ void __init smp_cpus_done(unsigned int max_cpus)
 #ifdef CONFIG_SCHED_SMT
 	if (has_big_cores) {
 		pr_info("Big cores detected but using small core scheduling\n");
-		power9_topology[0].mask = smallcore_smt_mask;
 		powerpc_topology[0].mask = smallcore_smt_mask;
 	}
 #endif
-	/*
-	 * If any CPU detects that it's sharing a cache with another CPU then
-	 * use the deeper topology that is aware of this sharing.
-	 */
-	if (shared_caches) {
-		pr_info("Using shared cache scheduler topology\n");
-		set_sched_topology(power9_topology);
-	} else {
-		pr_info("Using standard scheduler topology\n");
-		set_sched_topology(powerpc_topology);
-	}
+	set_sched_topology(powerpc_topology);
 }
 
 #ifdef CONFIG_HOTPLUG_CPU
-- 
2.17.1


^ permalink raw reply related

* [PATCH 11/11] powerpc/smp: Provide an ability to disable coregroup
From: Srikar Dronamraju @ 2020-07-14  4:36 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Michael Ellerman, Anton Blanchard,
	linuxppc-dev, Nick Piggin
In-Reply-To: <20200714043624.5648-1-srikar@linux.vnet.ibm.com>

If user wants to enable coregroup sched_domain then they can boot with
kernel parameter "coregroup_support=on"

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index bb25c13bbb79..c43909e6e8e9 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -118,6 +118,23 @@ struct smp_ops_t *smp_ops;
 volatile unsigned int cpu_callin_map[NR_CPUS];
 
 int smt_enabled_at_boot = 1;
+int coregroup_support;
+
+static int __init early_coregroup(char *p)
+{
+	if (!p)
+		return 0;
+
+	if (strstr(p, "on"))
+		coregroup_support = 1;
+
+	if (strstr(p, "1"))
+		coregroup_support = 1;
+
+	return 0;
+}
+
+early_param("coregroup_support", early_coregroup);
 
 /*
  * Returns 1 if the specified cpu should be brought up during boot.
@@ -878,7 +895,7 @@ static struct cpumask *cpu_coregroup_mask(int cpu)
 
 static bool has_coregroup_support(void)
 {
-	return coregroup_enabled;
+	return coregroup_enabled && coregroup_support;
 }
 
 static const struct cpumask *cpu_mc_mask(int cpu)
-- 
2.17.1


^ permalink raw reply related

* [PATCH 10/11] powerpc/smp: Implement cpu_to_coregroup_id
From: Srikar Dronamraju @ 2020-07-14  4:36 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Michael Ellerman, Anton Blanchard,
	linuxppc-dev, Nick Piggin
In-Reply-To: <20200714043624.5648-1-srikar@linux.vnet.ibm.com>

Lookup the coregroup id from the associativity array.

If unable to detect the coregroup id, fallback on the core id.
This way, ensure sched_domain degenerates and an extra sched domain is
not created.

Ideally this function should have been implemented in
arch/powerpc/kernel/smp.c. However if its implemented in mm/numa.c, we
don't need to find the primary domain again.

If the device-tree mentions more than one coregroup, then kernel
implements only the last or the smallest coregroup, which currently
corresponds to the penultimate domain in the device-tree.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/numa.c | 17 +++++++++++++++++
 1 file changed, 17 insertions(+)

diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index d9ab9da85eab..4e85564ef62a 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1697,6 +1697,23 @@ static const struct proc_ops topology_proc_ops = {
 
 int cpu_to_coregroup_id(int cpu)
 {
+	__be32 associativity[VPHN_ASSOC_BUFSIZE] = {0};
+	int index;
+
+	if (cpu < 0 || cpu > nr_cpu_ids)
+		return -1;
+
+	if (!firmware_has_feature(FW_FEATURE_VPHN))
+		goto out;
+
+	if (vphn_get_associativity(cpu, associativity))
+		goto out;
+
+	index = of_read_number(associativity, 1);
+	if ((index > min_common_depth + 1) && coregroup_enabled)
+		return of_read_number(&associativity[index - 1], 1);
+
+out:
 	return cpu_to_core_id(cpu);
 }
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH 09/11] Powerpc/smp: Create coregroup domain
From: Srikar Dronamraju @ 2020-07-14  4:36 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Michael Ellerman, Anton Blanchard,
	linuxppc-dev, Nick Piggin
In-Reply-To: <20200714043624.5648-1-srikar@linux.vnet.ibm.com>

Add percpu coregroup maps and masks to create coregroup domain.
If a coregroup doesn't exist, the coregroup domain will be degenerated
in favour of SMT/CACHE domain.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/topology.h | 10 ++++++++
 arch/powerpc/kernel/smp.c           | 37 +++++++++++++++++++++++++++++
 arch/powerpc/mm/numa.c              |  5 ++++
 3 files changed, 52 insertions(+)

diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
index 2db7ba789720..34812c35018e 100644
--- a/arch/powerpc/include/asm/topology.h
+++ b/arch/powerpc/include/asm/topology.h
@@ -98,6 +98,7 @@ extern int stop_topology_update(void);
 extern int prrn_is_enabled(void);
 extern int find_and_online_cpu_nid(int cpu);
 extern int timed_topology_update(int nsecs);
+extern int cpu_to_coregroup_id(int cpu);
 #else
 static inline int start_topology_update(void)
 {
@@ -120,6 +121,15 @@ static inline int timed_topology_update(int nsecs)
 	return 0;
 }
 
+static inline int cpu_to_coregroup_id(int cpu)
+{
+#ifdef CONFIG_SMP
+	return cpu_to_core_id(cpu);
+#else
+	return 0;
+#endif
+}
+
 #endif /* CONFIG_NUMA && CONFIG_PPC_SPLPAR */
 
 #include <asm-generic/topology.h>
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index ef19eeccd21e..bb25c13bbb79 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -80,6 +80,7 @@ DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_l2_cache_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_core_map);
+DEFINE_PER_CPU(cpumask_var_t, cpu_coregroup_map);
 
 EXPORT_PER_CPU_SYMBOL(cpu_sibling_map);
 EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
@@ -91,6 +92,7 @@ enum {
 	smt_idx,
 #endif
 	bigcore_idx,
+	mc_idx,
 	die_idx,
 };
 
@@ -869,6 +871,21 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
 }
 #endif
 
+static struct cpumask *cpu_coregroup_mask(int cpu)
+{
+	return per_cpu(cpu_coregroup_map, cpu);
+}
+
+static bool has_coregroup_support(void)
+{
+	return coregroup_enabled;
+}
+
+static const struct cpumask *cpu_mc_mask(int cpu)
+{
+	return cpu_coregroup_mask(cpu);
+}
+
 static const struct cpumask *cpu_bigcore_mask(int cpu)
 {
 	return cpu_core_mask(cpu);
@@ -879,6 +896,7 @@ static struct sched_domain_topology_level powerpc_topology[] = {
 	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
 #endif
 	{ cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
+	{ cpu_mc_mask, SD_INIT_NAME(MC) },
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
@@ -933,6 +951,10 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 					GFP_KERNEL, node);
 		zalloc_cpumask_var_node(&per_cpu(cpu_core_map, cpu),
 					GFP_KERNEL, node);
+		if (has_coregroup_support())
+			zalloc_cpumask_var_node(&per_cpu(cpu_coregroup_map, cpu),
+						GFP_KERNEL, node);
+
 #ifdef CONFIG_NEED_MULTIPLE_NODES
 		/*
 		 * numa_node_id() works after this.
@@ -950,6 +972,11 @@ void __init smp_prepare_cpus(unsigned int max_cpus)
 	cpumask_set_cpu(boot_cpuid, cpu_l2_cache_mask(boot_cpuid));
 	cpumask_set_cpu(boot_cpuid, cpu_core_mask(boot_cpuid));
 
+	if (has_coregroup_support())
+		cpumask_set_cpu(boot_cpuid, cpu_coregroup_mask(boot_cpuid));
+	else
+		powerpc_topology[mc_idx].mask = cpu_bigcore_mask;
+
 	init_big_cores();
 	if (has_big_cores) {
 		cpumask_set_cpu(boot_cpuid,
@@ -1241,6 +1268,8 @@ static void remove_cpu_from_masks(int cpu)
 		set_cpus_unrelated(cpu, i, cpu_sibling_mask);
 		if (has_big_cores)
 			set_cpus_unrelated(cpu, i, cpu_smallcore_mask);
+		if (has_coregroup_support())
+			set_cpus_unrelated(cpu, i, cpu_coregroup_mask);
 	}
 }
 #endif
@@ -1301,6 +1330,14 @@ static void add_cpu_to_masks(int cpu)
 	add_cpu_to_smallcore_masks(cpu);
 	update_mask_by_l2(cpu, cpu_l2_cache_mask);
 
+	if (has_coregroup_support()) {
+		cpumask_set_cpu(cpu, cpu_coregroup_mask(cpu));
+		for_each_cpu(i, cpu_online_mask) {
+			if (cpu_to_coregroup_id(cpu) == cpu_to_coregroup_id(i))
+				set_cpus_related(cpu, i, cpu_coregroup_mask);
+		}
+	}
+
 	if (pkg_id == -1) {
 		struct cpumask *(*mask)(int) = cpu_sibling_mask;
 
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index a43eab455be4..d9ab9da85eab 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -1695,6 +1695,11 @@ static const struct proc_ops topology_proc_ops = {
 	.proc_release	= single_release,
 };
 
+int cpu_to_coregroup_id(int cpu)
+{
+	return cpu_to_core_id(cpu);
+}
+
 static int topology_update_init(void)
 {
 	start_topology_update();
-- 
2.17.1


^ permalink raw reply related

* [PATCH 08/11] powerpc/smp: Allocate cpumask only after searching thread group
From: Srikar Dronamraju @ 2020-07-14  4:36 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Michael Ellerman, Anton Blanchard,
	linuxppc-dev, Nick Piggin
In-Reply-To: <20200714043624.5648-1-srikar@linux.vnet.ibm.com>

If allocated earlier and the search fails, then cpumask need to be
freed. However cpu_l1_cache_map can be allocated after we search thread
group.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 96e47450d9b3..ef19eeccd21e 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -797,10 +797,6 @@ static int init_cpu_l1_cache_map(int cpu)
 	if (err)
 		goto out;
 
-	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
-				GFP_KERNEL,
-				cpu_to_node(cpu));
-
 	cpu_group_start = get_cpu_thread_group_start(cpu, &tg);
 
 	if (unlikely(cpu_group_start == -1)) {
@@ -809,6 +805,9 @@ static int init_cpu_l1_cache_map(int cpu)
 		goto out;
 	}
 
+	zalloc_cpumask_var_node(&per_cpu(cpu_l1_cache_map, cpu),
+				GFP_KERNEL, cpu_to_node(cpu));
+
 	for (i = first_thread; i < first_thread + threads_per_core; i++) {
 		int i_group_start = get_cpu_thread_group_start(i, &tg);
 
-- 
2.17.1


^ permalink raw reply related

* [PATCH 07/11] Powerpc/numa: Detect support for coregroup
From: Srikar Dronamraju @ 2020-07-14  4:36 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Michael Ellerman, Anton Blanchard,
	linuxppc-dev, Nick Piggin
In-Reply-To: <20200714043624.5648-1-srikar@linux.vnet.ibm.com>

Add support for grouping cores based on the device-tree classification.
- The last domain in the associativity domains always refers to the
core.
- If primary reference domain happens to be the penultimate domain in
the associativity domains device-tree property, then there are no
coregroups. However if its not a penultimate domain, then there are
coregroups. There can be more than one coregroup. For now we would be
interested in the last or the smallest coregroups.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/smp.h |  1 +
 arch/powerpc/kernel/smp.c      |  1 +
 arch/powerpc/mm/numa.c         | 34 +++++++++++++++++++++-------------
 3 files changed, 23 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index 49a25e2400f2..5bdc17a7049f 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -28,6 +28,7 @@
 extern int boot_cpuid;
 extern int spinning_secondaries;
 extern u32 *cpu_to_phys_id;
+extern bool coregroup_enabled;
 
 extern void cpu_die(void);
 extern int cpu_to_chip_id(int cpu);
diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index f8faf75135af..96e47450d9b3 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -74,6 +74,7 @@ static DEFINE_PER_CPU(int, cpu_state) = { 0 };
 
 struct task_struct *secondary_current;
 bool has_big_cores;
+bool coregroup_enabled;
 
 DEFINE_PER_CPU(cpumask_var_t, cpu_sibling_map);
 DEFINE_PER_CPU(cpumask_var_t, cpu_smallcore_map);
diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
index fc7b0505bdd8..a43eab455be4 100644
--- a/arch/powerpc/mm/numa.c
+++ b/arch/powerpc/mm/numa.c
@@ -887,7 +887,9 @@ static void __init setup_node_data(int nid, u64 start_pfn, u64 end_pfn)
 static void __init find_possible_nodes(void)
 {
 	struct device_node *rtas;
-	u32 numnodes, i;
+	const __be32 *domains;
+	int prop_length, max_nodes;
+	u32 i;
 
 	if (!numa_enabled)
 		return;
@@ -896,25 +898,31 @@ static void __init find_possible_nodes(void)
 	if (!rtas)
 		return;
 
-	if (of_property_read_u32_index(rtas, "ibm,current-associativity-domains",
-				min_common_depth, &numnodes)) {
-		/*
-		 * ibm,current-associativity-domains is a fairly recent
-		 * property. If it doesn't exist, then fallback on
-		 * ibm,max-associativity-domains. Current denotes what the
-		 * platform can support compared to max which denotes what the
-		 * Hypervisor can support.
-		 */
-		if (of_property_read_u32_index(rtas, "ibm,max-associativity-domains",
-				min_common_depth, &numnodes))
+	/*
+	 * ibm,current-associativity-domains is a fairly recent property. If
+	 * it doesn't exist, then fallback on ibm,max-associativity-domains.
+	 * Current denotes what the platform can support compared to max
+	 * which denotes what the Hypervisor can support.
+	 */
+	domains = of_get_property(rtas, "ibm,current-associativity-domains",
+					&prop_length);
+	if (!domains) {
+		domains = of_get_property(rtas, "ibm,max-associativity-domains",
+					&prop_length);
+		if (!domains)
 			goto out;
 	}
 
-	for (i = 0; i < numnodes; i++) {
+	max_nodes = of_read_number(&domains[min_common_depth], 1);
+	for (i = 0; i < max_nodes; i++) {
 		if (!node_possible(i))
 			node_set(i, node_possible_map);
 	}
 
+	prop_length /= sizeof(int);
+	if (prop_length > min_common_depth + 2)
+		coregroup_enabled = 1;
+
 out:
 	of_node_put(rtas);
 }
-- 
2.17.1


^ permalink raw reply related

* [PATCH 06/11] powerpc/smp: Generalize 2nd sched domain
From: Srikar Dronamraju @ 2020-07-14  4:36 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Michael Ellerman, Anton Blanchard,
	linuxppc-dev, Nick Piggin
In-Reply-To: <20200714043624.5648-1-srikar@linux.vnet.ibm.com>

Currently "CACHE" domain happens to be the 2nd sched domain as per
powerpc_topology. This domain will collapse if cpumask of l2-cache is
same as SMT domain. However we could generalize this domain such that it
could mean either be a "CACHE" domain or a "BIGCORE" domain.

While setting up the "CACHE" domain, check if shared_cache is already
set.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 48 +++++++++++++++++++++++++++------------
 1 file changed, 34 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 875f57e41355..f8faf75135af 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -85,6 +85,14 @@ EXPORT_PER_CPU_SYMBOL(cpu_l2_cache_map);
 EXPORT_PER_CPU_SYMBOL(cpu_core_map);
 EXPORT_SYMBOL_GPL(has_big_cores);
 
+enum {
+#ifdef CONFIG_SCHED_SMT
+	smt_idx,
+#endif
+	bigcore_idx,
+	die_idx,
+};
+
 #define MAX_THREAD_LIST_SIZE	8
 #define THREAD_GROUP_SHARE_L1   1
 struct thread_groups {
@@ -851,13 +859,7 @@ static int powerpc_shared_cache_flags(void)
  */
 static const struct cpumask *shared_cache_mask(int cpu)
 {
-	if (shared_caches)
-		return cpu_l2_cache_mask(cpu);
-
-	if (has_big_cores)
-		return cpu_smallcore_mask(cpu);
-
-	return cpu_smt_mask(cpu);
+	return per_cpu(cpu_l2_cache_map, cpu);
 }
 
 #ifdef CONFIG_SCHED_SMT
@@ -867,11 +869,16 @@ static const struct cpumask *smallcore_smt_mask(int cpu)
 }
 #endif
 
+static const struct cpumask *cpu_bigcore_mask(int cpu)
+{
+	return cpu_core_mask(cpu);
+}
+
 static struct sched_domain_topology_level powerpc_topology[] = {
 #ifdef CONFIG_SCHED_SMT
 	{ cpu_smt_mask, powerpc_smt_flags, SD_INIT_NAME(SMT) },
 #endif
-	{ shared_cache_mask, powerpc_shared_cache_flags, SD_INIT_NAME(CACHE) },
+	{ cpu_bigcore_mask, SD_INIT_NAME(BIGCORE) },
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
@@ -895,7 +902,7 @@ static int init_big_cores(void)
 
 #ifdef CONFIG_SCHED_SMT
 	pr_info("Big cores detected. Using small core scheduling\n");
-	powerpc_topology[0].mask = smallcore_smt_mask;
+	powerpc_topology[smt_idx].mask = smallcore_smt_mask;
 #endif
 
 	return 0;
@@ -1319,7 +1326,6 @@ static void add_cpu_to_masks(int cpu)
 void start_secondary(void *unused)
 {
 	unsigned int cpu = smp_processor_id();
-	struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
 
 	mmgrab(&init_mm);
 	current->active_mm = &init_mm;
@@ -1345,14 +1351,20 @@ void start_secondary(void *unused)
 	/* Update topology CPU masks */
 	add_cpu_to_masks(cpu);
 
-	if (has_big_cores)
-		sibling_mask = cpu_smallcore_mask;
 	/*
 	 * Check for any shared caches. Note that this must be done on a
 	 * per-core basis because one core in the pair might be disabled.
 	 */
-	if (!cpumask_equal(cpu_l2_cache_mask(cpu), sibling_mask(cpu)))
-		shared_caches = true;
+	if (!shared_caches) {
+		struct cpumask *(*sibling_mask)(int) = cpu_sibling_mask;
+		struct cpumask *mask = cpu_l2_cache_mask(cpu);
+
+		if (has_big_cores)
+			sibling_mask = cpu_smallcore_mask;
+
+		if (cpumask_weight(mask) > cpumask_weight(sibling_mask(cpu)))
+			shared_caches = true;
+	}
 
 	set_numa_node(numa_cpu_lookup_table[cpu]);
 	set_numa_mem(local_memory_node(numa_cpu_lookup_table[cpu]));
@@ -1390,6 +1402,14 @@ void __init smp_cpus_done(unsigned int max_cpus)
 		smp_ops->bringup_done();
 
 	dump_numa_cpu_topology();
+	if (shared_caches) {
+		pr_info("Using shared cache scheduler topology\n");
+		powerpc_topology[bigcore_idx].mask = shared_cache_mask;
+#ifdef CONFIG_SCHED_DEBUG
+		powerpc_topology[bigcore_idx].name = "CACHE";
+#endif
+		powerpc_topology[bigcore_idx].sd_flags = powerpc_shared_cache_flags;
+	}
 
 	set_sched_topology(powerpc_topology);
 }
-- 
2.17.1


^ permalink raw reply related

* [PATCH 05/11] powerpc/smp: Dont assume l2-cache to be superset of sibling
From: Srikar Dronamraju @ 2020-07-14  4:36 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Michael Ellerman, Anton Blanchard,
	linuxppc-dev, Nick Piggin
In-Reply-To: <20200714043624.5648-1-srikar@linux.vnet.ibm.com>

Current code assumes that cpumask of cpus sharing a l2-cache mask will
always be a superset of cpu_sibling_mask.

Lets stop that assumption.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 28 +++++++++++++++-------------
 1 file changed, 15 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 7d430fc536cc..875f57e41355 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -1198,6 +1198,7 @@ static bool update_mask_by_l2(int cpu, struct cpumask *(*mask_fn)(int))
 	struct device_node *l2_cache, *np;
 	int i;
 
+	cpumask_set_cpu(cpu, mask_fn(cpu));
 	l2_cache = cpu_to_l2cache(cpu);
 	if (!l2_cache)
 		return false;
@@ -1284,29 +1285,30 @@ static void add_cpu_to_masks(int cpu)
 	 * add it to it's own thread sibling mask.
 	 */
 	cpumask_set_cpu(cpu, cpu_sibling_mask(cpu));
+	cpumask_set_cpu(cpu, cpu_core_mask(cpu));
 
 	for (i = first_thread; i < first_thread + threads_per_core; i++)
 		if (cpu_online(i))
 			set_cpus_related(i, cpu, cpu_sibling_mask);
 
 	add_cpu_to_smallcore_masks(cpu);
-	/*
-	 * Copy the thread sibling mask into the cache sibling mask
-	 * and mark any CPUs that share an L2 with this CPU.
-	 */
-	for_each_cpu(i, cpu_sibling_mask(cpu))
-		set_cpus_related(cpu, i, cpu_l2_cache_mask);
 	update_mask_by_l2(cpu, cpu_l2_cache_mask);
 
-	/*
-	 * Copy the cache sibling mask into core sibling mask and mark
-	 * any CPUs on the same chip as this CPU.
-	 */
-	for_each_cpu(i, cpu_l2_cache_mask(cpu))
-		set_cpus_related(cpu, i, cpu_core_mask);
+	if (pkg_id == -1) {
+		struct cpumask *(*mask)(int) = cpu_sibling_mask;
+
+		/*
+		 * Copy the sibling mask into core sibling mask and
+		 * mark any CPUs on the same chip as this CPU.
+		 */
+		if (shared_caches)
+			mask = cpu_l2_cache_mask;
+
+		for_each_cpu(i, mask(cpu))
+			set_cpus_related(cpu, i, cpu_core_mask);
 
-	if (pkg_id == -1)
 		return;
+	}
 
 	for_each_cpu(i, cpu_online_mask)
 		if (get_physical_package_id(i) == pkg_id)
-- 
2.17.1


^ permalink raw reply related

* [PATCH 04/11] powerpc/smp: Enable small core scheduling sooner
From: Srikar Dronamraju @ 2020-07-14  4:36 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Nathan Lynch, Gautham R Shenoy, Oliver OHalloran, Michael Neuling,
	Srikar Dronamraju, Michael Ellerman, Anton Blanchard,
	linuxppc-dev, Nick Piggin
In-Reply-To: <20200714043624.5648-1-srikar@linux.vnet.ibm.com>

Enable small core scheduling as soon as we detect that we are in a
system that supports thread group. Doing so would avoid a redundant
check.

Cc: linuxppc-dev <linuxppc-dev@lists.ozlabs.org>
Cc: Michael Ellerman <michaele@au1.ibm.com>
Cc: Nick Piggin <npiggin@au1.ibm.com>
Cc: Oliver OHalloran <oliveroh@au1.ibm.com>
Cc: Nathan Lynch <nathanl@linux.ibm.com>
Cc: Michael Neuling <mikey@linux.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Gautham R Shenoy <ego@linux.vnet.ibm.com>
Cc: Vaidyanathan Srinivasan <svaidy@linux.ibm.com>
Signed-off-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/smp.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/smp.c b/arch/powerpc/kernel/smp.c
index 24529f6134aa..7d430fc536cc 100644
--- a/arch/powerpc/kernel/smp.c
+++ b/arch/powerpc/kernel/smp.c
@@ -892,6 +892,12 @@ static int init_big_cores(void)
 	}
 
 	has_big_cores = true;
+
+#ifdef CONFIG_SCHED_SMT
+	pr_info("Big cores detected. Using small core scheduling\n");
+	powerpc_topology[0].mask = smallcore_smt_mask;
+#endif
+
 	return 0;
 }
 
@@ -1383,12 +1389,6 @@ void __init smp_cpus_done(unsigned int max_cpus)
 
 	dump_numa_cpu_topology();
 
-#ifdef CONFIG_SCHED_SMT
-	if (has_big_cores) {
-		pr_info("Big cores detected but using small core scheduling\n");
-		powerpc_topology[0].mask = smallcore_smt_mask;
-	}
-#endif
 	set_sched_topology(powerpc_topology);
 }
 
-- 
2.17.1


^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox