LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH v2 15/16] powerpc/powernv/sriov: Make single PE mode a per-BAR setting
From: Michael Ellerman @ 2020-08-02 13:12 UTC (permalink / raw)
  To: Nathan Chancellor, Oliver O'Halloran; +Cc: clang-built-linux, linuxppc-dev
In-Reply-To: <20200801061823.GA1203340@ubuntu-n2-xlarge-x86>

Nathan Chancellor <natechancellor@gmail.com> writes:
> On Wed, Jul 22, 2020 at 04:57:14PM +1000, Oliver O'Halloran wrote:
>> Using single PE BARs to map an SR-IOV BAR is really a choice about what
>> strategy to use when mapping a BAR. It doesn't make much sense for this to
>> be a global setting since a device might have one large BAR which needs to
>> be mapped with single PE windows and another smaller BAR that can be mapped
>> with a regular segmented window. Make the segmented vs single decision a
>> per-BAR setting and clean up the logic that decides which mode to use.
>> 
>> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
>> ---
>> v2: Dropped unused total_vfs variables in pnv_pci_ioda_fixup_iov_resources()
>>     Dropped bar_no from pnv_pci_iov_resource_alignment()
>>     Minor re-wording of comments.
>> ---
>>  arch/powerpc/platforms/powernv/pci-sriov.c | 131 ++++++++++-----------
>>  arch/powerpc/platforms/powernv/pci.h       |  11 +-
>>  2 files changed, 73 insertions(+), 69 deletions(-)
>> 
>> diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c b/arch/powerpc/platforms/powernv/pci-sriov.c
>> index ce8ad6851d73..76215d01405b 100644
>> --- a/arch/powerpc/platforms/powernv/pci-sriov.c
>> +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
>> @@ -260,42 +256,40 @@ void pnv_pci_ioda_fixup_iov(struct pci_dev *pdev)
>>  resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
>>  						      int resno)
>>  {
>> -	struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
>>  	struct pnv_iov_data *iov = pnv_iov_get(pdev);
>>  	resource_size_t align;
>>  
>> +	/*
>> +	 * iov can be null if we have an SR-IOV device with IOV BAR that can't
>> +	 * be placed in the m64 space (i.e. The BAR is 32bit or non-prefetch).
>> +	 * In that case we don't allow VFs to be enabled since one of their
>> +	 * BARs would not be placed in the correct PE.
>> +	 */
>> +	if (!iov)
>> +		return align;
>> +	if (!iov->vfs_expanded)
>> +		return align;
>> +
>> +	align = pci_iov_resource_size(pdev, resno);

That's, oof.

> I am not sure if it has been reported yet but clang points out that
> align is initialized after its use:
>
> arch/powerpc/platforms/powernv/pci-sriov.c:267:10: warning: variable 'align' is uninitialized when used here [-Wuninitialized]
>                 return align;
>                        ^~~~~
> arch/powerpc/platforms/powernv/pci-sriov.c:258:23: note: initialize the variable 'align' to silence this warning
>         resource_size_t align;
>                              ^
>                               = 0
> 1 warning generated.

But I can't get gcc to warn about it?

It produces some code, so it's not like the whole function has been
elided or something. I'm confused.

cheers

^ permalink raw reply

* Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
From: Michael Ellerman @ 2020-08-02 12:42 UTC (permalink / raw)
  To: Nathan Lynch, Laurent Dufour; +Cc: tyreld, cheloha, linuxppc-dev
In-Reply-To: <87365723m0.fsf@linux.ibm.com>

Nathan Lynch <nathanl@linux.ibm.com> writes:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>> Nathan Lynch <nathanl@linux.ibm.com> writes:
>>> Michael Ellerman <mpe@ellerman.id.au> writes:
>>>> Nathan Lynch <nathanl@linux.ibm.com> writes:
>>>>> Laurent Dufour <ldufour@linux.ibm.com> writes:
>>>>>> Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
>>>>>>> The drmem lmb list can have hundreds of thousands of entries, and
>>>>>>> unfortunately lookups take the form of linear searches. As long as
>>>>>>> this is the case, traversals have the potential to monopolize the CPU
>>>>>>> and provoke lockup reports, workqueue stalls, and the like unless
>>>>>>> they explicitly yield.
>>>>>>> 
>>>>>>> Rather than placing cond_resched() calls within various
>>>>>>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
>>>>>>> expression of the loop macro itself so users can't omit it.
>>>>>>
>>>>>> Is that not too much to call cond_resched() on every LMB?
>>>>>>
>>>>>> Could that be less frequent, every 10, or 100, I don't really know ?
>>>>>
>>>>> Everything done within for_each_drmem_lmb is relatively heavyweight
>>>>> already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
>>>>> of milliseconds. I don't think cond_resched() is an expensive check in
>>>>> this context.
>>>>
>>>> Hmm, mostly.
>>>>
>>>> But there are quite a few cases like drmem_update_dt_v1():
>>>>
>>>> 	for_each_drmem_lmb(lmb) {
>>>> 		dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
>>>> 		dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
>>>> 		dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
>>>> 		dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
>>>>
>>>> 		dr_cell++;
>>>> 	}
>>>>
>>>> Which will compile to a pretty tight loop at the moment.
>>>>
>>>> Or drmem_update_dt_v2() which has two loops over all lmbs.
>>>>
>>>> And although the actual TIF check is cheap the function call to do it is
>>>> not free.
>>>>
>>>> So I worry this is going to make some of those long loops take even
>>>> longer.
>>>
>>> That's fair, and I was wrong - some of the loop bodies are relatively
>>> simple, not doing allocations or taking locks, etc.
>>>
>>> One way to deal is to keep for_each_drmem_lmb() as-is and add a new
>>> iterator that can reschedule, e.g. for_each_drmem_lmb_slow().
>>
>> If we did that, how many call-sites would need converting?
>> Is it ~2 or ~20 or ~200?
>
> At a glance I would convert 15-20 out of the 24 users in the tree I'm
> looking at. Let me know if I should do a v2 with that approach.

OK, that's a bunch of churn then, if we're planning to rework the code
significantly in the near future.

One thought, which I possibly should not put in writing, is that we
could use the alignment of the pointer as a poor man's substitute for a
counter, eg:

+static inline struct drmem_lmb *drmem_lmb_next(struct drmem_lmb *lmb)
+{
+	if (lmb % PAGE_SIZE == 0)
+		cond_resched();
+
+	return ++lmb;
+}

I think the lmbs are allocated in a block, so I think that will work.
Maybe PAGE_SIZE is not the right size to use, but you get the idea.

Gross I know, but might be OK as short term solution?

cheers

^ permalink raw reply

* [merge] Build failure selftest/powerpc/mm/pkey_exec_prot
From: Sachin Sant @ 2020-08-02 11:15 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: sandipan

pkey_exec_prot test from linuxppc merge branch (3f68564f1f5a) fails to
build due to following error:

gcc -std=gnu99 -O2 -Wall -Werror -DGIT_VERSION='"v5.8-rc7-1276-g3f68564f1f5a"' -I/home/sachin/linux/tools/testing/selftests/powerpc/include  -m64    pkey_exec_prot.c /home/sachin/linux/tools/testing/selftests/kselftest_harness.h /home/sachin/linux/tools/testing/selftests/kselftest.h ../harness.c ../utils.c  -o /home/sachin/linux/tools/testing/selftests/powerpc/mm/pkey_exec_prot
In file included from pkey_exec_prot.c:18:
/home/sachin/linux/tools/testing/selftests/powerpc/include/pkeys.h:34: error: "SYS_pkey_mprotect" redefined [-Werror]
 #define SYS_pkey_mprotect 386
 
In file included from /usr/include/sys/syscall.h:31,
                 from /home/sachin/linux/tools/testing/selftests/powerpc/include/utils.h:47,
                 from /home/sachin/linux/tools/testing/selftests/powerpc/include/pkeys.h:12,
                 from pkey_exec_prot.c:18:
/usr/include/bits/syscall.h:1583: note: this is the location of the previous definition
 # define SYS_pkey_mprotect __NR_pkey_mprotect

commit 128d3d021007 introduced this error.
selftests/powerpc: Move pkey helpers to headers

Possibly the # defines for sys calls can be retained in pkey_exec_prot.c or


Thanks
-Sachin

^ permalink raw reply

* powerpc: build failures in Linus' tree
From: Stephen Rothwell @ 2020-08-02 10:48 UTC (permalink / raw)
  To: Michael Ellerman, Linux-kernel Mailing List
  Cc: Linus Torvalds, PowerPC, Willy Tarreau

[-- Attachment #1: Type: text/plain, Size: 819 bytes --]

Hi all,

We are getting build failures in some PowerPC configs for Linus' tree.
See e.g. http://kisskb.ellerman.id.au/kisskb/buildresult/14306515/

In file included from /kisskb/src/arch/powerpc/include/asm/paca.h:18,
                 from /kisskb/src/arch/powerpc/include/asm/percpu.h:13,
                 from /kisskb/src/include/linux/random.h:14,
                 from /kisskb/src/include/linux/net.h:18,
                 from /kisskb/src/net/ipv6/ip6_fib.c:20:
/kisskb/src/arch/powerpc/include/asm/mmu.h:139:22: error: unknown type name 'next_tlbcam_idx'
  139 | DECLARE_PER_CPU(int, next_tlbcam_idx);

I assume this is caused by commit

  1c9df907da83 ("random: fix circular include dependency on arm64 after addition of percpu.h")

But I can't see how, sorry.

-- 
Cheers,
Stephen Rothwell

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* Re: [PATCH v2] ASoC: fsl-asoc-card: Remove fsl_asoc_card_set_bias_level function
From: Nicolin Chen @ 2020-08-02  6:43 UTC (permalink / raw)
  To: Shengjiu Wang
  Cc: Linux-ALSA, Timur Tabi, Xiubo Li, Fabio Estevam, Shengjiu Wang,
	Takashi Iwai, Liam Girdwood, Mark Brown, linuxppc-dev,
	linux-kernel
In-Reply-To: <CAA+D8AMM90bt_WbPCny6C=R=dv6gXXh49p59yng2vH7DDuD2PQ@mail.gmail.com>

On Sun, Aug 02, 2020 at 10:22:35AM +0800, Shengjiu Wang wrote:

> > > +     /* Specific configuration for PLL */
> > > +     if (codec_priv->pll_id && codec_priv->fll_id) {
> > > +             if (priv->sample_format == SNDRV_PCM_FORMAT_S24_LE)
> > > +                     pll_out = priv->sample_rate * 384;
> > > +             else
> > > +                     pll_out = priv->sample_rate * 256;
> > > +
> > > +             ret = snd_soc_dai_set_pll(asoc_rtd_to_codec(rtd, 0),
> > > +                                       codec_priv->pll_id,
> > > +                                       codec_priv->mclk_id,
> > > +                                       codec_priv->mclk_freq, pll_out);
> > > +             if (ret) {
> > > +                     dev_err(dev, "failed to start FLL: %d\n", ret);
> > > +                     goto out;
> > > +             }
> > > +
> > > +             ret = snd_soc_dai_set_sysclk(asoc_rtd_to_codec(rtd, 0),
> > > +                                          codec_priv->fll_id,
> > > +                                          pll_out, SND_SOC_CLOCK_IN);
> >
> > Just came into my mind: do we need some protection here to prevent
> > PLL/SYSCLK reconfiguration if TX/RX end up with different values?
> >
> Sorry,  not really catching your point. could you please elaborate?
> Why do TX/RX end up with different values?

If TX and RX run concurrently but in different sample rates or
sample formats, pll_out would be overwritten to PLL/SYSCLK?

I remember imx-wm8962 uses SSI, having symmetric flags for rates/
channels/samplebits, but fsl-asoc-card might have (or will have)
other use case.

If all existing combinations don't have any problem, we can add
a protection later when we need.

^ permalink raw reply

* Re: [PATCH v2] ASoC: fsl-asoc-card: Remove fsl_asoc_card_set_bias_level function
From: Shengjiu Wang @ 2020-08-02  2:22 UTC (permalink / raw)
  To: Nicolin Chen
  Cc: Linux-ALSA, Timur Tabi, Xiubo Li, Fabio Estevam, Shengjiu Wang,
	Takashi Iwai, Liam Girdwood, Mark Brown, linuxppc-dev,
	linux-kernel
In-Reply-To: <20200801075954.GA19629@Asurada-Nvidia>

On Sat, Aug 1, 2020 at 4:01 PM Nicolin Chen <nicoleotsuka@gmail.com> wrote:
>
> Hi,
>
> Having two nits and one question, inline:
>
> On Thu, Jul 30, 2020 at 05:47:02PM +0800, Shengjiu Wang wrote:
> > @@ -182,6 +180,69 @@ static int fsl_asoc_card_hw_params(struct snd_pcm_substream *substream,
> >                                              cpu_priv->slot_width);
> >               if (ret && ret != -ENOTSUPP) {
> >                       dev_err(dev, "failed to set TDM slot for cpu dai\n");
> > +                     goto out;
> > +             }
> > +     }
> > +
> > +     /* Specific configuration for PLL */
> > +     if (codec_priv->pll_id && codec_priv->fll_id) {
> > +             if (priv->sample_format == SNDRV_PCM_FORMAT_S24_LE)
> > +                     pll_out = priv->sample_rate * 384;
> > +             else
> > +                     pll_out = priv->sample_rate * 256;
> > +
> > +             ret = snd_soc_dai_set_pll(asoc_rtd_to_codec(rtd, 0),
> > +                                       codec_priv->pll_id,
> > +                                       codec_priv->mclk_id,
> > +                                       codec_priv->mclk_freq, pll_out);
> > +             if (ret) {
> > +                     dev_err(dev, "failed to start FLL: %d\n", ret);
> > +                     goto out;
> > +             }
> > +
> > +             ret = snd_soc_dai_set_sysclk(asoc_rtd_to_codec(rtd, 0),
> > +                                          codec_priv->fll_id,
> > +                                          pll_out, SND_SOC_CLOCK_IN);
>
> Just came into my mind: do we need some protection here to prevent
> PLL/SYSCLK reconfiguration if TX/RX end up with different values?
>
Sorry,  not really catching your point. could you please elaborate?
Why do TX/RX end up with different values?

best regards
wang shengiu
> > +     return 0;
> > +
> > +out:
> > +     priv->streams &= ~BIT(substream->stream);
> > +     return ret;
>
> Rather than "out:" which doesn't explicitly indicate an error-out,
> "fail:" would be better, following what we used in probe().
>
> > +static int fsl_asoc_card_hw_free(struct snd_pcm_substream *substream)
> > +{
> > +     struct snd_soc_pcm_runtime *rtd = substream->private_data;
> > +     struct fsl_asoc_card_priv *priv = snd_soc_card_get_drvdata(rtd->card);
> > +     struct codec_priv *codec_priv = &priv->codec_priv;
> > +     struct device *dev = rtd->card->dev;
> > +     int ret;
> > +
> > +     priv->streams &= ~BIT(substream->stream);
> > +
>
> > +     if (!priv->streams && codec_priv->pll_id &&
> > +         codec_priv->fll_id) {
>
> This now can fit into single line :)

^ permalink raw reply

* Re: ASMedia ASM2142 USB host controller tries to DMA to address zero when doing bulk reads from multiple devices
From: Forest Crossman @ 2020-08-01 21:57 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: Alexey Kardashevskiy, linux-usb, linuxppc-dev
In-Reply-To: <CAOSf1CEkHLamLXK3HOAZ+w0K=2hTOjn=x5KpDdmRZ4BXVy+P2A@mail.gmail.com>

On Wed, Jul 29, 2020 at 8:22 AM Oliver O'Halloran <oohall@gmail.com> wrote:
>
> On Tue, Jul 21, 2020 at 3:51 PM Forest Crossman <cyrozap@gmail.com> wrote:
> >
> > Hello, again!
> >
> > After fixing the issue in my previous thread using this patch[1], I
> > decided to do some stress-testing of the controller to make sure it
> > could handle my intended workloads and that there were no further DMA
> > address issues that would need to be fixed. Unfortunately, it looks
> > like there's still more work to be done: when I try to do long bulk
> > reads from multiple devices simultaneously, eventually the host
> > controller sends a DMA write to address zero, which then triggers EEH
> > in my POWER9 system, causing the controller card to get hotplug-reset,
> > which of course kills the disk-reading processes. For more details on
> > the EEH errors, you can see my kernel's EEH message log[2].
>
> Take the logged address with a grain of salt. If an error occurs while
> translating the DMA address the PHB logs all zeros as the "DMA
> Address" because it only keeps around the bits that it needs to fetch
> the next level of the TCE table. The EEH dump says the error is due to
> a TCE permission mis-match so odds the ASmedia controller is writing
> to an address that's already been DMA unmapped, hence the logged
> address being zeros.

Interesting, that's good to know. I saw that the RXE_TCE_FESR had the
"TCE Response Page Access Error" bit set, and had originally assumed
that just meant the DMA to address zero was triggering that error
since it wasn't in a mapped page, but after reading that bit's
description again I think I understand it now.

> Sorry, I probably should have mentioned that quirk in the last mail.
>
> > The results of the various tests I performed are listed below.
> >
> > Test results (all failures are due to DMA writes to address zero, all
> > hubs are USB 3.0/3.1 Gen1 only, and all disks are accessed via the
> > usb-storage driver):
> > - Reading simultaneously from two or more disks behind a hub connected
> > to one port on the host controller:
> >   - FAIL after 20-50 GB of data transferred for each device.
> > - Reading simultaneously from two disks, each connected directly to
> > one port on the host controller:
> >   - FAIL after about 800 GB of data transferred for each device.
> > - Reading from one disk behind a hub connected to one port on the host
> > controller:
> >   - OK for at least 2.7 TB of data transferred (I didn't test the
> > whole 8 TB disk).
> > - Writing simultaneously to two FL2000 dongles (using osmo-fl2k's
> > "fl2k_test"), each connected directly to one port on the host
> > controller:
> >   - OK, was able to write several dozen terabytes to each device over
> > the course of a little over 21 hours.
> >
> > Seeing how simultaneous writes to multiple devices and reads from
> > single devices both seem to work fine, I assume that means this is
> > being caused by some race condition in the host controller firmware
> > when it responds to multiple read requests.
>
> Most likely. It's possible it's a platform specific race with DMA
> map/unmap too, but I think we would be seeing similar issues with
> other devices if it was.

Yeah, I have several other xHCI controllers connected to this system,
and I've never experienced this issue with any of them. If the problem
was a POWER-specific DMA map/unmap race I would expect to be having
problems with those controllers as well.

> > I also assume we're not
> > going to be able to convince ASMedia to both fix the bug in their
> > firmware and release the details on how to flash it from Linux, so I
> > guess we'll just have to figure out how to make the driver talk to the
> > controller in a way that avoids triggering the bad DMA write. As
> > before, I decided to try a little kernel hacking of my own before
> > sending this email, and tried separately enabling the
> > XHCI_BROKEN_STREAMS and XHCI_ASMEDIA_MODIFY_FLOWCONTROL quirks in an
> > attempt to fix this. As you might expect since you're reading this
> > message, neither of those quirks fixed the issue, nor did they even
> > make the transfers last any longer before failing.
> >
> > So now I've reached the limits of my understanding, and I need some
> > help devising a fix. If anyone has any comments to that effect, or any
> > questions about my hardware configuration, testing methodology, etc.,
> > please don't hesitate to air them. Also, if anyone needs me to perform
> > additional tests, or collect more log information, I'd be happy to do
> > that as well.
>
> I started writing a tool a while ago to use the internal trace bus to
> log incoming TLPs. Something like that might allow you to get a better
> idea what the faulting access pattern is, but you would still need to
> find a way to mitigate the issue. I'm not all that familiar with USB3
> so I'm not much help on that front.

Oh, interesting, I remember seeing the trace registers in the PHB4
spec, but I wasn't sure how to access them without writing a kernel
driver. I'd love to be able to log and dissect TLPs in Wireshark the
same way I do for USB packets, since it makes reverse engineering
protocols and debugging drivers so much easier. This would also be
especially helpful because I haven't yet figured out how to get Qemu
to intercept certain types of PCIe accesses (I forget if it was DMA or
PIO or something, it was a quite while ago), and PCIe protocol
analyzer hardware is prohibitively expensive (when it's even available
for purchase). So if you've uploaded your code anywhere, I'd be really
interested in seeing it, even if it's incomplete, since even with
incomplete code I could use that as a reference for a Wireshark plugin
or something. But if it's not online or if you'd prefer not to share
it in its current state, I'll understand.


Thanks again for your help,

Forest

^ permalink raw reply

* Scott Wood fix for SysRQ Crash over Serial Console on Linux PowerPC 2.6.32 kernel?
From: thefirst ECS @ 2020-08-01 21:41 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: scottwood
In-Reply-To: <11049609.25021.1596315320093.JavaMail.Dan@DanHP>

Hello,

I've been fighting with SysRq over serial for a while now and can't win, regardless of where I call emergency_restart or __handle_sysrq('b', NULL, 0); I can't seem to get it to reboot to U-Boot but instead I get a hung/crashed system:

SysRq : Resetting
Kernel BUG at c00bbe80 [verbose debug info unavailable]
Oops: Exception in kernel mode, sig: 5 [#1]
...

The NIP is always c00bbe80. 

The strange part is that echo b > /proc/sysrq-trigger works perfectly fine and quickly soft-reboots system.

Does writing to procfs somehow FIX some issue created when sending a BREAK signal over serial console? 

Because I don't get it, I see that the same exact code is called inside drivers/char/sysrq.c ie by __handle_sysrq(key, NULL, 0); (which is executed upon write_sysrq_trigger via echo b > /proc/sysrq-trigger) as when I also manually call it via __handle_sysrq('b', NULL, 0); 

And I hard coded to 'b' when I called it myself from drivers/serial/8250.c or sysrq.c due to another issue with normal way somehow not waiting or giving the supposed "5 seconds" to enter a letter after the BREAK signal. So it was almost impossible to trigger BREAK via Ctrl-A B by gnuscreen.

I can't figure out why the same code is called but giving completely different outcomes, other than maybe sending BREAK signal over serial console changes something with some interrupts in a way that's then magically fixed/undone by userland writing to procfs /proc/sysrq-trigger?

Or maybe some sort of race condition since I'm triggering the actual "b" reboot more or less at same time as the actual "BREAK" signal? (Because otherwise with original kernel code I was only once able to trigger SysRq reboot and that crashed same way, I triggered it by pressing b very quick after Ctrl-B i think. Otherwise usually I would just get SyRq: HELP menu and the 'b' would be printed to my terminal...)

I stumbled upon an old post at:
https://www.mail-archive.com/linuxppc-dev@lists.ozlabs.org/msg75979.html

And in there Scott Wood said he had done some fix for SysRq over serial console it seems for powerpc but apparently my kernel doesn't have his fix since seems our dts doesn't have "compatible = "fsl,ns16550", "ns16550";"  but only has "ns16550" and my kernel (2.6.32) config doesn't have any CONFIG_SERIAL_8250_FSL. I was wondering if anyone knew if my issue would be helped by Scott's fix and where I could find it and perhaps try to port it to our kernel.

So again, echo m or c or b etc to /proc/sysrq-trigger works fine but triggering the same __handle_sysrq('b', NULL, 0) separately, crashes/hangs system and requires a hard reboot, power-cycle regardless of where I call it from after detecting a "BREAK" signal over serial console.

Below is an example of a successful reboot via echo b > /proc/sysrq-trigger 

SysRq : Resetting
BUG: sleeping function called from invalid context at mm/slub.c:1719
pcnt: 1 0 in_atomic(): 1, irqs_disabled(): 1, pid: 2471, name: bash
Call Trace:
[ec8d7d10] [c0007ec0] show_stack+0x54/0x16c (unreliable)
[ec8d7d40] [c0037c48] __might_sleep+0xe8/0x10c
[ec8d7d50] [c00c6e00] kmem_cache_alloc+0xa4/0xe0
[ec8d7d80] [c00bbecc] __get_vm_area_node+0x84/0x10c
[ec8d7db0] [c00148e8] __ioremap_caller+0x114/0x124
[ec8d7de0] [c039ab68] ourcustom_spi_init+0x24/0xa0
[ec8d7e20] [c039af3c] ourcustom_spi_reset+0x14/0xac
[ec8d7e40] [c0018f90] fsl_rstcr_restart+0x14/0x18
[ec8d7e50] [c0010114] machine_restart+0x30/0x4c
[ec8d7e60] [c005cd6c] emergency_restart+0x30/0x6c
[ec8d7e70] [c03211dc] sysrq_handle_reboot+0x14/0x24
[ec8d7e80] [c0321418] __handle_sysrq+0xac/0x15c
[ec8d7eb0] [c032153c] write_sysrq_trigger+0x74/0x80
[ec8d7ec0] [c0122704] proc_reg_write+0x80/0xb4
[ec8d7ef0] [c00d3b7c] vfs_write+0xb4/0x184
[ec8d7f10] [c00d3d34] sys_write+0x4c/0x90
[ec8d7f40] [c0011180] ret_from_syscall+0x0/0x3c

COMPARED to an example of when I try to trigger via sending BREAK signal (Ctrl-A b with screen):

NIP [c00bbe80] __get_vm_area_node+0x38/0x10c
LR [c00148e8] __ioremap_caller+0x114/0x124
Call Trace:
[c0667c60] [c004eabc] irq_exit+0x68/0xa4 (unreliable)
[c0667c90] [c00148e8] __ioremap_caller+0x114/0x124
[c0667cc0] [c039abdc] ourcustom_spi_init+0x24/0xa0
[c0667d00] [c039afb0] ourcustom_spi_reset+0x14/0xac
[c0667d20] [c0018f90] fsl_rstcr_restart+0x14/0x18
[c0667d30] [c0010114] machine_restart+0x30/0x4c
[c0667d40] [c005cd6c] emergency_restart+0x30/0x6c
[c0667d50] [c03211dc] sysrq_handle_reboot+0x14/0x24
[c0667d60] [c032143c] __handle_sysrq+0xd0/0x1d0
[c0667d90] [c03286e4] receive_chars+0x298/0x3e4
[c0667de0] [c0329650] serial8250_handle_port+0x84/0xdc
[c0667e00] [c0329958] serial8250_interrupt+0x90/0x128
[c0667e30] [c008e6fc] handle_irq_action+0x8c/0xa0
[c0667e50] [c008e774] handle_IRQ_event+0x64/0x13c
[c0667e80] [c00910e4] handle_fasteoi_irq+0x94/0x17c
[c0667ea0] [c00051b4] do_IRQ+0xc4/0xec
[c0667ec0] [c00118ec] ret_from_except+0x0/0x18
[c0667f80] [c0009310] cpu_idle+0x94/0x184
[c0667fb0] [c00023cc] rest_init+0xa0/0xb4
[c0667fc0] [c06089ec] start_kernel+0x33c/0x350
[c0667ff0] [c00003f4] skpinv+0x30c/0x348
Instruction dump:
bee1000c 542b0024 90010034 7c9f2378 7cbe2b78 7cd93378 800b000c 7cfa3b78 
7d1b4378 7d3c4b78 7d585378 5400016e <0f000000> 70a00001 41820018 7c600034 
Kernel panic - not syncing: Fatal exception in interrupt
Rebooting in 5 seconds..
------------[ cut here ]------------
Kernel BUG at c00bbe80 [verbose debug info unavailable]
Oops: Exception in kernel mode, sig: 5 [#2]
...

Thanks,
     -Dan

^ permalink raw reply

* [powerpc:merge] BUILD SUCCESS 3f68564f1f5aca55654fda237fc01495bf050ce9
From: kernel test robot @ 2020-08-01 14:34 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git  merge
branch HEAD: 3f68564f1f5aca55654fda237fc01495bf050ce9  Automatic merge of 'master', 'next' and 'fixes' (2020-07-31 22:52)

elapsed time: 1494m

configs tested: 69
configs skipped: 2

The following configs have been built successfully.
More configs may be tested in the coming days.

arm                                 defconfig
arm64                            allyesconfig
arm64                               defconfig
arm                              allyesconfig
arm                              allmodconfig
xtensa                    xip_kc705_defconfig
arm                       versatile_defconfig
arm                             ezx_defconfig
ia64                             allmodconfig
ia64                                defconfig
ia64                             allyesconfig
m68k                             allmodconfig
m68k                                defconfig
m68k                             allyesconfig
nios2                               defconfig
nds32                               defconfig
nios2                            allyesconfig
csky                                defconfig
alpha                               defconfig
alpha                            allyesconfig
xtensa                           allyesconfig
h8300                            allyesconfig
arc                                 defconfig
sh                               allmodconfig
parisc                              defconfig
s390                             allyesconfig
parisc                           allyesconfig
s390                                defconfig
arc                              allyesconfig
nds32                             allnoconfig
c6x                              allyesconfig
i386                             allyesconfig
sparc                            allyesconfig
sparc                               defconfig
i386                                defconfig
mips                             allyesconfig
mips                             allmodconfig
powerpc                             defconfig
powerpc                          allyesconfig
powerpc                          allmodconfig
powerpc                           allnoconfig
i386                 randconfig-a005-20200731
i386                 randconfig-a004-20200731
i386                 randconfig-a006-20200731
i386                 randconfig-a002-20200731
i386                 randconfig-a001-20200731
i386                 randconfig-a003-20200731
x86_64               randconfig-a015-20200731
x86_64               randconfig-a014-20200731
x86_64               randconfig-a016-20200731
x86_64               randconfig-a012-20200731
x86_64               randconfig-a013-20200731
x86_64               randconfig-a011-20200731
i386                 randconfig-a016-20200731
i386                 randconfig-a012-20200731
i386                 randconfig-a014-20200731
i386                 randconfig-a015-20200731
i386                 randconfig-a011-20200731
i386                 randconfig-a013-20200731
riscv                            allyesconfig
riscv                             allnoconfig
riscv                               defconfig
riscv                            allmodconfig
x86_64                                   rhel
x86_64                           allyesconfig
x86_64                    rhel-7.6-kselftests
x86_64                              defconfig
x86_64                               rhel-8.3
x86_64                                  kexec

---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org

^ permalink raw reply

* Re: [PATCH 06/15] powerpc: fadamp: simplify fadump_reserve_crash_area()
From: Hari Bathini @ 2020-08-01 10:53 UTC (permalink / raw)
  To: Mike Rapoport, Michael Ellerman
  Cc: linux-sh, Peter Zijlstra, Catalin Marinas, Dave Hansen,
	Hari Bathini, linux-kernel, Max Filippov, Paul Mackerras,
	sparclinux, linux-riscv, Will Deacon, Thomas Gleixner,
	Marek Szyprowski, linux-s390, linux-c6x-dev, Yoshinori Sato, x86,
	Russell King, Mike Rapoport, clang-built-linux, Ingo Molnar,
	Christoph Hellwig, uclinux-h8-devel, linux-xtensa, openrisc,
	Borislav Petkov, Andy Lutomirski, Paul Walmsley, Stafford Horne,
	linux-arm-kernel, Michal Simek, linux-mm, linux-mips, iommu,
	Palmer Dabbelt, Andrew Morton, linuxppc-dev
In-Reply-To: <20200801101854.GD534153@kernel.org>



On 01/08/20 3:48 pm, Mike Rapoport wrote:
> On Thu, Jul 30, 2020 at 10:15:13PM +1000, Michael Ellerman wrote:
>> Mike Rapoport <rppt@kernel.org> writes:
>>> From: Mike Rapoport <rppt@linux.ibm.com>
>>>
>>> fadump_reserve_crash_area() reserves memory from a specified base address
>>> till the end of the RAM.
>>>
>>> Replace iteration through the memblock.memory with a single call to
>>> memblock_reserve() with appropriate  that will take care of proper memory
>>                                       ^
>>                                       parameters?
>>> reservation.
>>>
>>> Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
>>> ---
>>>   arch/powerpc/kernel/fadump.c | 20 +-------------------
>>>   1 file changed, 1 insertion(+), 19 deletions(-)
>>
>> I think this looks OK to me, but I don't have a setup to test it easily.
>> I've added Hari to Cc who might be able to.
>>
>> But I'll give you an ack in the hope that it works :)
> 
> Actually, I did some digging in the git log and the traversal was added
> there on purpose by the commit b71a693d3db3 ("powerpc/fadump: exclude
> memory holes while reserving memory in second kernel")

I was about to comment on the same :)
memblock_reserve() was being used until we ran into the issue talked 
about in the above commit...

> Presuming this is still reqruired I'm going to drop this patch and will

Yeah, it is still required..

> simply replace for_each_memblock() with for_each_mem_range() in v2.

Sounds right.

>   
>> Acked-by: Michael Ellerman <mpe@ellerman.id.au>
>>
>>
>>> diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
>>> index 78ab9a6ee6ac..2446a61e3c25 100644
>>> --- a/arch/powerpc/kernel/fadump.c
>>> +++ b/arch/powerpc/kernel/fadump.c
>>> @@ -1658,25 +1658,7 @@ int __init fadump_reserve_mem(void)
>>>   /* Preserve everything above the base address */
>>>   static void __init fadump_reserve_crash_area(u64 base)
>>>   {
>>> -	struct memblock_region *reg;
>>> -	u64 mstart, msize;
>>> -
>>> -	for_each_memblock(memory, reg) {
>>> -		mstart = reg->base;
>>> -		msize  = reg->size;
>>> -
>>> -		if ((mstart + msize) < base)
>>> -			continue;
>>> -
>>> -		if (mstart < base) {
>>> -			msize -= (base - mstart);
>>> -			mstart = base;
>>> -		}
>>> -
>>> -		pr_info("Reserving %lluMB of memory at %#016llx for preserving crash data",
>>> -			(msize >> 20), mstart);
>>> -		memblock_reserve(mstart, msize);
>>> -	}
>>> +	memblock_reserve(base, memblock_end_of_DRAM() - base);
>>>   }
>>>   
>>>   unsigned long __init arch_reserved_kernel_pages(void)
>>> -- 
>>> 2.26.2
> 

Thanks
Hari

^ permalink raw reply

* Re: [PATCH 06/15] powerpc: fadamp: simplify fadump_reserve_crash_area()
From: Mike Rapoport @ 2020-08-01 10:18 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: linux-sh, Peter Zijlstra, Dave Hansen, Hari Bathini, linux-mips,
	Max Filippov, Paul Mackerras, sparclinux, linux-riscv,
	Will Deacon, Stafford Horne, Marek Szyprowski, linux-s390,
	linux-c6x-dev, Yoshinori Sato, x86, Russell King, Mike Rapoport,
	clang-built-linux, Ingo Molnar, Catalin Marinas, uclinux-h8-devel,
	linux-xtensa, openrisc, Borislav Petkov, Andy Lutomirski,
	Paul Walmsley, Thomas Gleixner, linux-arm-kernel, Michal Simek,
	linux-mm, linuxppc-dev, linux-kernel, iommu, Palmer Dabbelt,
	Andrew Morton, Christoph Hellwig
In-Reply-To: <87d04d5hda.fsf@mpe.ellerman.id.au>

On Thu, Jul 30, 2020 at 10:15:13PM +1000, Michael Ellerman wrote:
> Mike Rapoport <rppt@kernel.org> writes:
> > From: Mike Rapoport <rppt@linux.ibm.com>
> >
> > fadump_reserve_crash_area() reserves memory from a specified base address
> > till the end of the RAM.
> >
> > Replace iteration through the memblock.memory with a single call to
> > memblock_reserve() with appropriate  that will take care of proper memory
>                                      ^
>                                      parameters?
> > reservation.
> >
> > Signed-off-by: Mike Rapoport <rppt@linux.ibm.com>
> > ---
> >  arch/powerpc/kernel/fadump.c | 20 +-------------------
> >  1 file changed, 1 insertion(+), 19 deletions(-)
> 
> I think this looks OK to me, but I don't have a setup to test it easily.
> I've added Hari to Cc who might be able to.
> 
> But I'll give you an ack in the hope that it works :)

Actually, I did some digging in the git log and the traversal was added
there on purpose by the commit b71a693d3db3 ("powerpc/fadump: exclude
memory holes while reserving memory in second kernel")
Presuming this is still reqruired I'm going to drop this patch and will
simply replace for_each_memblock() with for_each_mem_range() in v2.
 
> Acked-by: Michael Ellerman <mpe@ellerman.id.au>
> 
> 
> > diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
> > index 78ab9a6ee6ac..2446a61e3c25 100644
> > --- a/arch/powerpc/kernel/fadump.c
> > +++ b/arch/powerpc/kernel/fadump.c
> > @@ -1658,25 +1658,7 @@ int __init fadump_reserve_mem(void)
> >  /* Preserve everything above the base address */
> >  static void __init fadump_reserve_crash_area(u64 base)
> >  {
> > -	struct memblock_region *reg;
> > -	u64 mstart, msize;
> > -
> > -	for_each_memblock(memory, reg) {
> > -		mstart = reg->base;
> > -		msize  = reg->size;
> > -
> > -		if ((mstart + msize) < base)
> > -			continue;
> > -
> > -		if (mstart < base) {
> > -			msize -= (base - mstart);
> > -			mstart = base;
> > -		}
> > -
> > -		pr_info("Reserving %lluMB of memory at %#016llx for preserving crash data",
> > -			(msize >> 20), mstart);
> > -		memblock_reserve(mstart, msize);
> > -	}
> > +	memblock_reserve(base, memblock_end_of_DRAM() - base);
> >  }
> >  
> >  unsigned long __init arch_reserved_kernel_pages(void)
> > -- 
> > 2.26.2

-- 
Sincerely yours,
Mike.

^ permalink raw reply

* Re: [PATCH v2] ASoC: fsl-asoc-card: Remove fsl_asoc_card_set_bias_level function
From: Nicolin Chen @ 2020-08-01  7:59 UTC (permalink / raw)
  To: Shengjiu Wang
  Cc: alsa-devel, timur, Xiubo.Lee, linuxppc-dev, tiwai, lgirdwood,
	perex, broonie, festevam, linux-kernel
In-Reply-To: <1596102422-14010-1-git-send-email-shengjiu.wang@nxp.com>

Hi,

Having two nits and one question, inline:

On Thu, Jul 30, 2020 at 05:47:02PM +0800, Shengjiu Wang wrote:
> @@ -182,6 +180,69 @@ static int fsl_asoc_card_hw_params(struct snd_pcm_substream *substream,
>  					       cpu_priv->slot_width);
>  		if (ret && ret != -ENOTSUPP) {
>  			dev_err(dev, "failed to set TDM slot for cpu dai\n");
> +			goto out;
> +		}
> +	}
> +
> +	/* Specific configuration for PLL */
> +	if (codec_priv->pll_id && codec_priv->fll_id) {
> +		if (priv->sample_format == SNDRV_PCM_FORMAT_S24_LE)
> +			pll_out = priv->sample_rate * 384;
> +		else
> +			pll_out = priv->sample_rate * 256;
> +
> +		ret = snd_soc_dai_set_pll(asoc_rtd_to_codec(rtd, 0),
> +					  codec_priv->pll_id,
> +					  codec_priv->mclk_id,
> +					  codec_priv->mclk_freq, pll_out);
> +		if (ret) {
> +			dev_err(dev, "failed to start FLL: %d\n", ret);
> +			goto out;
> +		}
> +
> +		ret = snd_soc_dai_set_sysclk(asoc_rtd_to_codec(rtd, 0),
> +					     codec_priv->fll_id,
> +					     pll_out, SND_SOC_CLOCK_IN);

Just came into my mind: do we need some protection here to prevent
PLL/SYSCLK reconfiguration if TX/RX end up with different values?

> +	return 0;
> +
> +out:
> +	priv->streams &= ~BIT(substream->stream);
> +	return ret;

Rather than "out:" which doesn't explicitly indicate an error-out,
"fail:" would be better, following what we used in probe().

> +static int fsl_asoc_card_hw_free(struct snd_pcm_substream *substream)
> +{
> +	struct snd_soc_pcm_runtime *rtd = substream->private_data;
> +	struct fsl_asoc_card_priv *priv = snd_soc_card_get_drvdata(rtd->card);
> +	struct codec_priv *codec_priv = &priv->codec_priv;
> +	struct device *dev = rtd->card->dev;
> +	int ret;
> +
> +	priv->streams &= ~BIT(substream->stream);
> +

> +	if (!priv->streams && codec_priv->pll_id &&
> +	    codec_priv->fll_id) {

This now can fit into single line :)

^ permalink raw reply

* Re: [PATCH v2 15/16] powerpc/powernv/sriov: Make single PE mode a per-BAR setting
From: Nathan Chancellor @ 2020-08-01  6:18 UTC (permalink / raw)
  To: Oliver O'Halloran, Michael Ellerman; +Cc: clang-built-linux, linuxppc-dev
In-Reply-To: <20200722065715.1432738-15-oohall@gmail.com>

On Wed, Jul 22, 2020 at 04:57:14PM +1000, Oliver O'Halloran wrote:
> Using single PE BARs to map an SR-IOV BAR is really a choice about what
> strategy to use when mapping a BAR. It doesn't make much sense for this to
> be a global setting since a device might have one large BAR which needs to
> be mapped with single PE windows and another smaller BAR that can be mapped
> with a regular segmented window. Make the segmented vs single decision a
> per-BAR setting and clean up the logic that decides which mode to use.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> ---
> v2: Dropped unused total_vfs variables in pnv_pci_ioda_fixup_iov_resources()
>     Dropped bar_no from pnv_pci_iov_resource_alignment()
>     Minor re-wording of comments.
> ---
>  arch/powerpc/platforms/powernv/pci-sriov.c | 131 ++++++++++-----------
>  arch/powerpc/platforms/powernv/pci.h       |  11 +-
>  2 files changed, 73 insertions(+), 69 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-sriov.c b/arch/powerpc/platforms/powernv/pci-sriov.c
> index ce8ad6851d73..76215d01405b 100644
> --- a/arch/powerpc/platforms/powernv/pci-sriov.c
> +++ b/arch/powerpc/platforms/powernv/pci-sriov.c
> @@ -146,21 +146,17 @@
>  static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  {
>  	struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
> -	const resource_size_t gate = phb->ioda.m64_segsize >> 2;
>  	struct resource *res;
>  	int i;
> -	resource_size_t size, total_vf_bar_sz;
> +	resource_size_t vf_bar_sz;
>  	struct pnv_iov_data *iov;
> -	int mul, total_vfs;
> +	int mul;
>  
>  	iov = kzalloc(sizeof(*iov), GFP_KERNEL);
>  	if (!iov)
>  		goto disable_iov;
>  	pdev->dev.archdata.iov_data = iov;
> -
> -	total_vfs = pci_sriov_get_totalvfs(pdev);
>  	mul = phb->ioda.total_pe_num;
> -	total_vf_bar_sz = 0;
>  
>  	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
>  		res = &pdev->resource[i + PCI_IOV_RESOURCES];
> @@ -173,50 +169,50 @@ static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  			goto disable_iov;
>  		}
>  
> -		total_vf_bar_sz += pci_iov_resource_size(pdev,
> -				i + PCI_IOV_RESOURCES);
> +		vf_bar_sz = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
>  
>  		/*
> -		 * If bigger than quarter of M64 segment size, just round up
> -		 * power of two.
> +		 * Generally, one segmented M64 BAR maps one IOV BAR. However,
> +		 * if a VF BAR is too large we end up wasting a lot of space.
> +		 * If each VF needs more than 1/4 of the default m64 segment
> +		 * then each VF BAR should be mapped in single-PE mode to reduce
> +		 * the amount of space required. This does however limit the
> +		 * number of VFs we can support.
>  		 *
> -		 * Generally, one M64 BAR maps one IOV BAR. To avoid conflict
> -		 * with other devices, IOV BAR size is expanded to be
> -		 * (total_pe * VF_BAR_size).  When VF_BAR_size is half of M64
> -		 * segment size , the expanded size would equal to half of the
> -		 * whole M64 space size, which will exhaust the M64 Space and
> -		 * limit the system flexibility.  This is a design decision to
> -		 * set the boundary to quarter of the M64 segment size.
> +		 * The 1/4 limit is arbitrary and can be tweaked.
>  		 */
> -		if (total_vf_bar_sz > gate) {
> -			mul = roundup_pow_of_two(total_vfs);
> -			dev_info(&pdev->dev,
> -				"VF BAR Total IOV size %llx > %llx, roundup to %d VFs\n",
> -				total_vf_bar_sz, gate, mul);
> -			iov->m64_single_mode = true;
> -			break;
> -		}
> -	}
> +		if (vf_bar_sz > (phb->ioda.m64_segsize >> 2)) {
> +			/*
> +			 * On PHB3, the minimum size alignment of M64 BAR in
> +			 * single mode is 32MB. If this VF BAR is smaller than
> +			 * 32MB, but still too large for a segmented window
> +			 * then we can't map it and need to disable SR-IOV for
> +			 * this device.
> +			 */
> +			if (vf_bar_sz < SZ_32M) {
> +				pci_err(pdev, "VF BAR%d: %pR can't be mapped in single PE mode\n",
> +					i, res);
> +				goto disable_iov;
> +			}
>  
> -	for (i = 0; i < PCI_SRIOV_NUM_BARS; i++) {
> -		res = &pdev->resource[i + PCI_IOV_RESOURCES];
> -		if (!res->flags || res->parent)
> +			iov->m64_single_mode[i] = true;
>  			continue;
> +		}
>  
> -		size = pci_iov_resource_size(pdev, i + PCI_IOV_RESOURCES);
>  		/*
> -		 * On PHB3, the minimum size alignment of M64 BAR in single
> -		 * mode is 32MB.
> +		 * This BAR can be mapped with one segmented window, so adjust
> +		 * te resource size to accommodate.
>  		 */
> -		if (iov->m64_single_mode && (size < SZ_32M))
> -			goto disable_iov;
> +		pci_dbg(pdev, " Fixing VF BAR%d: %pR to\n", i, res);
> +		res->end = res->start + vf_bar_sz * mul - 1;
> +		pci_dbg(pdev, "                       %pR\n", res);
>  
> -		dev_dbg(&pdev->dev, " Fixing VF BAR%d: %pR to\n", i, res);
> -		res->end = res->start + size * mul - 1;
> -		dev_dbg(&pdev->dev, "                       %pR\n", res);
> -		dev_info(&pdev->dev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
> +		pci_info(pdev, "VF BAR%d: %pR (expanded to %d VFs for PE alignment)",
>  			 i, res, mul);
> +
> +		iov->need_shift = true;
>  	}
> +
>  	iov->vfs_expanded = mul;
>  
>  	return;
> @@ -260,42 +256,40 @@ void pnv_pci_ioda_fixup_iov(struct pci_dev *pdev)
>  resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
>  						      int resno)
>  {
> -	struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
>  	struct pnv_iov_data *iov = pnv_iov_get(pdev);
>  	resource_size_t align;
>  
> +	/*
> +	 * iov can be null if we have an SR-IOV device with IOV BAR that can't
> +	 * be placed in the m64 space (i.e. The BAR is 32bit or non-prefetch).
> +	 * In that case we don't allow VFs to be enabled since one of their
> +	 * BARs would not be placed in the correct PE.
> +	 */
> +	if (!iov)
> +		return align;
> +	if (!iov->vfs_expanded)
> +		return align;
> +
> +	align = pci_iov_resource_size(pdev, resno);

I am not sure if it has been reported yet but clang points out that
align is initialized after its use:

arch/powerpc/platforms/powernv/pci-sriov.c:267:10: warning: variable 'align' is uninitialized when used here [-Wuninitialized]
                return align;
                       ^~~~~
arch/powerpc/platforms/powernv/pci-sriov.c:258:23: note: initialize the variable 'align' to silence this warning
        resource_size_t align;
                             ^
                              = 0
1 warning generated.

> +	/*
> +	 * If we're using single mode then we can just use the native VF BAR
> +	 * alignment. We validated that it's possible to use a single PE
> +	 * window above when we did the fixup.
> +	 */
> +	if (iov->m64_single_mode[resno - PCI_IOV_RESOURCES])
> +		return align;
> +
>  	/*
>  	 * On PowerNV platform, IOV BAR is mapped by M64 BAR to enable the
>  	 * SR-IOV. While from hardware perspective, the range mapped by M64
>  	 * BAR should be size aligned.
>  	 *
> -	 * When IOV BAR is mapped with M64 BAR in Single PE mode, the extra
> -	 * powernv-specific hardware restriction is gone. But if just use the
> -	 * VF BAR size as the alignment, PF BAR / VF BAR may be allocated with
> -	 * in one segment of M64 #15, which introduces the PE conflict between
> -	 * PF and VF. Based on this, the minimum alignment of an IOV BAR is
> -	 * m64_segsize.
> -	 *
>  	 * This function returns the total IOV BAR size if M64 BAR is in
>  	 * Shared PE mode or just VF BAR size if not.
>  	 * If the M64 BAR is in Single PE mode, return the VF BAR size or
>  	 * M64 segment size if IOV BAR size is less.
>  	 */
> -	align = pci_iov_resource_size(pdev, resno);
> -
> -	/*
> -	 * iov can be null if we have an SR-IOV device with IOV BAR that can't
> -	 * be placed in the m64 space (i.e. The BAR is 32bit or non-prefetch).
> -	 * In that case we don't allow VFs to be enabled so just return the
> -	 * default alignment.
> -	 */
> -	if (!iov)
> -		return align;
> -	if (!iov->vfs_expanded)
> -		return align;
> -	if (iov->m64_single_mode)
> -		return max(align, (resource_size_t)phb->ioda.m64_segsize);
> -
>  	return iov->vfs_expanded * align;
>  }
>  
> @@ -450,7 +444,7 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 num_vfs)
>  			continue;
>  
>  		/* don't need single mode? map everything in one go! */
> -		if (!iov->m64_single_mode) {
> +		if (!iov->m64_single_mode[i]) {
>  			win = pnv_pci_alloc_m64_bar(phb, iov);
>  			if (win < 0)
>  				goto m64_failed;
> @@ -543,6 +537,8 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>  		res = &dev->resource[i + PCI_IOV_RESOURCES];
>  		if (!res->flags || !res->parent)
>  			continue;
> +		if (iov->m64_single_mode[i])
> +			continue;
>  
>  		/*
>  		 * The actual IOV BAR range is determined by the start address
> @@ -574,6 +570,8 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>  		res = &dev->resource[i + PCI_IOV_RESOURCES];
>  		if (!res->flags || !res->parent)
>  			continue;
> +		if (iov->m64_single_mode[i])
> +			continue;
>  
>  		size = pci_iov_resource_size(dev, i + PCI_IOV_RESOURCES);
>  		res2 = *res;
> @@ -619,8 +617,8 @@ static void pnv_pci_sriov_disable(struct pci_dev *pdev)
>  	/* Release VF PEs */
>  	pnv_ioda_release_vf_PE(pdev);
>  
> -	/* Un-shift the IOV BAR resources */
> -	if (!iov->m64_single_mode)
> +	/* Un-shift the IOV BARs if we need to */
> +	if (iov->need_shift)
>  		pnv_pci_vf_resource_shift(pdev, -base_pe);
>  
>  	/* Release M64 windows */
> @@ -738,9 +736,8 @@ static int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
>  	 * the IOV BAR according to the PE# allocated to the VFs.
>  	 * Otherwise, the PE# for the VF will conflict with others.
>  	 */
> -	if (!iov->m64_single_mode) {
> -		ret = pnv_pci_vf_resource_shift(pdev,
> -						base_pe->pe_number);
> +	if (iov->need_shift) {
> +		ret = pnv_pci_vf_resource_shift(pdev, base_pe->pe_number);
>  		if (ret)
>  			goto shift_failed;
>  	}
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 41a6f4e938e4..902e928c7c22 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -243,8 +243,15 @@ struct pnv_iov_data {
>  	/* pointer to the array of VF PEs. num_vfs long*/
>  	struct pnv_ioda_pe *vf_pe_arr;
>  
> -	/* Did we map the VF BARs with single-PE IODA BARs? */
> -	bool    m64_single_mode;
> +	/* Did we map the VF BAR with single-PE IODA BARs? */
> +	bool    m64_single_mode[PCI_SRIOV_NUM_BARS];
> +
> +	/*
> +	 * True if we're using any segmented windows. In that case we need
> +	 * shift the start of the IOV resource the segment corresponding to
> +	 * the allocated PE.
> +	 */
> +	bool    need_shift;
>  
>  	/*
>  	 * Bit mask used to track which m64 windows are used to map the
> -- 
> 2.26.2
> 

^ permalink raw reply

* Re: [RFC PATCH 1/2] powerpc/numa: Introduce logical numa id
From: Srikar Dronamraju @ 2020-08-01  5:20 UTC (permalink / raw)
  To: Aneesh Kumar K.V; +Cc: Nathan Lynch, linuxppc-dev
In-Reply-To: <20200731111916.243569-1-aneesh.kumar@linux.ibm.com>

* Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> [2020-07-31 16:49:14]:

> We use ibm,associativity and ibm,associativity-lookup-arrays to derive the numa
> node numbers. These device tree properties are firmware indicated grouping of
> resources based on their hierarchy in the platform. These numbers (group id) are
> not sequential and hypervisor/firmware can follow different numbering schemes.
> For ex: on powernv platforms, we group them in the below order.
> 
>  *     - CCM node ID
>  *     - HW card ID
>  *     - HW module ID
>  *     - Chip ID
>  *     - Core ID
> 
> Based on ibm,associativity-reference-points we use one of the above group ids as
> Linux NUMA node id. (On PowerNV platform Chip ID is used). This results
> in Linux reporting non-linear NUMA node id and which also results in Linux
> reporting empty node 0 NUMA nodes.
> 

If its just to eliminate node 0, then we have 2 other probably better
solutions.
1. Dont mark node 0 as spl (currently still in mm-tree and a result in
linux-next)
2. powerpc specific: explicitly clear node 0 during numa bringup.

> This can  be resolved by mapping the firmware provided group id to a logical Linux
> NUMA id. In this patch, we do this only for pseries platforms considering the

On PowerVM, as you would know the nid is already a logical or a flattened
chip-id and not the actual hardware chip-id.

> firmware group id is a virtualized entity and users would not have drawn any
> conclusion based on the Linux Numa Node id.
> 
> On PowerNV platform since we have historically mapped Chip ID as Linux NUMA node
> id, we keep the existing Linux NUMA node id numbering.
> 
> Before Fix:
>  # numactl -H
> available: 2 nodes (0-1)
> node 0 cpus:
> node 0 size: 0 MB
> node 0 free: 0 MB
> node 1 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> node 1 size: 50912 MB
> node 1 free: 45248 MB
> node distances:
> node   0   1
>   0:  10  40
>   1:  40  10
> 
> after fix
>  # numactl  -H
> available: 1 nodes (0)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71
> node 0 size: 50912 MB
> node 0 free: 49724 MB
> node distances:
> node   0
>   0:  10
> 
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> ---
>  arch/powerpc/include/asm/topology.h |  1 +
>  arch/powerpc/mm/numa.c              | 49 ++++++++++++++++++++++-------
>  2 files changed, 39 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/topology.h b/arch/powerpc/include/asm/topology.h
> index f0b6300e7dd3..15b0424a27a8 100644
> --- a/arch/powerpc/include/asm/topology.h
> +++ b/arch/powerpc/include/asm/topology.h
> @@ -118,5 +118,6 @@ int get_physical_package_id(int cpu);
>  #endif
>  #endif
> 
> +int firmware_group_id_to_nid(int firmware_gid);
>  #endif /* __KERNEL__ */
>  #endif	/* _ASM_POWERPC_TOPOLOGY_H */
> diff --git a/arch/powerpc/mm/numa.c b/arch/powerpc/mm/numa.c
> index e437a9ac4956..6c659aada55b 100644
> --- a/arch/powerpc/mm/numa.c
> +++ b/arch/powerpc/mm/numa.c
> @@ -221,25 +221,51 @@ static void initialize_distance_lookup_table(int nid,
>  	}
>  }
> 
> +static u32 nid_map[MAX_NUMNODES] = {[0 ... MAX_NUMNODES - 1] =  NUMA_NO_NODE};
> +
> +int firmware_group_id_to_nid(int firmware_gid)
> +{
> +	static int last_nid = 0;
> +
> +	/*
> +	 * For PowerNV we don't change the node id. This helps to avoid
> +	 * confusion w.r.t the expected node ids. On pseries, node numbers
> +	 * are virtualized. Hence do logical node id for pseries.
> +	 */
> +	if (!firmware_has_feature(FW_FEATURE_LPAR))
> +		return firmware_gid;
> +
> +	if (firmware_gid ==  -1)
> +		return NUMA_NO_NODE;
> +
> +	if (nid_map[firmware_gid] == NUMA_NO_NODE)
> +		nid_map[firmware_gid] = last_nid++;

How do we ensure 2 simultaneous firmware_group_id_to_nid() calls dont end up
at this place in parallel?

> +
> +	return nid_map[firmware_gid];
> +}
> +
>  /* Returns nid in the range [0..MAX_NUMNODES-1], or -1 if no useful numa
>   * info is found.
>   */
>  static int associativity_to_nid(const __be32 *associativity)
>  {
>  	int nid = NUMA_NO_NODE;
> +	int firmware_gid = -1;
> 
>  	if (!numa_enabled)
>  		goto out;
> 
>  	if (of_read_number(associativity, 1) >= min_common_depth)
> -		nid = of_read_number(&associativity[min_common_depth], 1);
> +		firmware_gid = of_read_number(&associativity[min_common_depth], 1);
> 
>  	/* POWER4 LPAR uses 0xffff as invalid node */
> -	if (nid == 0xffff || nid >= MAX_NUMNODES)
> -		nid = NUMA_NO_NODE;
> +	if (firmware_gid == 0xffff || firmware_gid >= MAX_NUMNODES)
> +		firmware_gid = -1;

Lets assume two or more invocations of associativity_to_nid for the same
associativity, end up with -1, In each case aren't giving different
nids?


> +
> +	nid = firmware_group_id_to_nid(firmware_gid);
> 
>  	if (nid > 0 &&
> -		of_read_number(associativity, 1) >= distance_ref_points_depth) {
> +	    of_read_number(associativity, 1) >= distance_ref_points_depth) {
>  		/*
>  		 * Skip the length field and send start of associativity array
>  		 */
> @@ -432,24 +458,25 @@ static int of_get_assoc_arrays(struct assoc_arrays *aa)
>  static int of_drconf_to_nid_single(struct drmem_lmb *lmb)
>  {
>  	struct assoc_arrays aa = { .arrays = NULL };
> -	int default_nid = NUMA_NO_NODE;
> -	int nid = default_nid;
> +	int nid = NUMA_NO_NODE, firmware_gid;
>  	int rc, index;
> 
>  	if ((min_common_depth < 0) || !numa_enabled)
> -		return default_nid;
> +		return NUMA_NO_NODE;
> 
>  	rc = of_get_assoc_arrays(&aa);
>  	if (rc)
> -		return default_nid;
> +		return NUMA_NO_NODE;

https://lore.kernel.org/linuxppc-dev/87lfjc1b5f.fsf@linux.ibm.com/t/#u

> 
>  	if (min_common_depth <= aa.array_sz &&
>  	    !(lmb->flags & DRCONF_MEM_AI_INVALID) && lmb->aa_index < aa.n_arrays) {
>  		index = lmb->aa_index * aa.array_sz + min_common_depth - 1;
> -		nid = of_read_number(&aa.arrays[index], 1);
> +		firmware_gid = of_read_number(&aa.arrays[index], 1);
> 
> -		if (nid == 0xffff || nid >= MAX_NUMNODES)
> -			nid = default_nid;
> +		if (firmware_gid == 0xffff || firmware_gid >= MAX_NUMNODES)
> +			firmware_gid = -1;

Same case as above, How do we ensure that we return unique nid for a
similar assoc_array?

> +
> +		nid = firmware_group_id_to_nid(firmware_gid);
> 
>  		if (nid > 0) {
>  			index = lmb->aa_index * aa.array_sz;
> -- 
> 2.26.2
> 

-- 
Thanks and Regards
Srikar Dronamraju

^ permalink raw reply

* Re: [PATCH] ASoC: fsl_sai: Fix value of FSL_SAI_CR1_RFW_MASK
From: Mark Brown @ 2020-07-31 18:54 UTC (permalink / raw)
  To: alsa-devel, tiwai, nicoleotsuka, linux-kernel, festevam,
	Shengjiu Wang, linuxppc-dev, Xiubo.Lee, lgirdwood, perex, timur
In-Reply-To: <1596176895-28724-1-git-send-email-shengjiu.wang@nxp.com>

On Fri, 31 Jul 2020 14:28:15 +0800, Shengjiu Wang wrote:
> The fifo_depth is 64 on i.MX8QM/i.MX8QXP, 128 on i.MX8MQ, 16 on
> i.MX7ULP.
> 
> Original FSL_SAI_CR1_RFW_MASK value 0x1F is not suitable for
> these platform, the FIFO watermark mask should be updated
> according to the fifo_depth.

Applied to

   https://git.kernel.org/pub/scm/linux/kernel/git/broonie/sound.git for-next

Thanks!

[1/1] ASoC: fsl_sai: Fix value of FSL_SAI_CR1_RFW_MASK
      commit: 5aef1ff2397d021f93d874b57dff032fdfac73de

All being well this means that it will be integrated into the linux-next
tree (usually sometime in the next 24 hours) and sent to Linus during
the next merge window (or sooner if it is a bug fix), however if
problems are discovered then the patch may be dropped or reverted.

You may get further e-mails resulting from automated or manual testing
and review of the tree, please engage with people reporting problems and
send followup patches addressing any issues that are reported if needed.

If any updates are required or you are submitting further changes they
should be sent as incremental updates against current git, existing
patches will not be replaced.

Please add any relevant lists and maintainers to the CCs when replying
to this mail.

Thanks,
Mark

^ permalink raw reply

* Re: [PATCH V5 0/4] powerpc/perf: Add support for perf extended regs in powerpc
From: Athira Rajeev @ 2020-07-31 17:34 UTC (permalink / raw)
  To: Jiri Olsa
  Cc: Ravi Bangoria, Michael Neuling, maddy, kjain,
	Arnaldo Carvalho de Melo, Jiri Olsa, linuxppc-dev
In-Reply-To: <20200730195048.GA1484375@krava>



> On 31-Jul-2020, at 1:20 AM, Jiri Olsa <jolsa@redhat.com> wrote:
> 
> On Thu, Jul 30, 2020 at 01:24:40PM +0530, Athira Rajeev wrote:
>> 
>> 
>>> On 27-Jul-2020, at 10:46 PM, Athira Rajeev <atrajeev@linux.vnet.ibm.com> wrote:
>>> 
>>> Patch set to add support for perf extended register capability in
>>> powerpc. The capability flag PERF_PMU_CAP_EXTENDED_REGS, is used to
>>> indicate the PMU which support extended registers. The generic code
>>> define the mask of extended registers as 0 for non supported architectures.
>>> 
>>> Patches 1 and 2 are the kernel side changes needed to include
>>> base support for extended regs in powerpc and in power10.
>>> Patches 3 and 4 are the perf tools side changes needed to support the
>>> extended registers.
>>> 
>> 
>> Hi Arnaldo, Jiri
>> 
>> please let me know if you have any comments/suggestions on this patch series to add support for perf extended regs.
> 
> hi,
> can't really tell for powerpc, but in general
> perf tool changes look ok
> 

Hi Jiri,
Thanks for checking the patchset.

Athira.

> jirka


^ permalink raw reply

* Re: [GIT PULL] Please pull powerpc/linux.git powerpc-5.8-8 tag
From: pr-tracker-bot @ 2020-07-31 16:45 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: linuxppc-dev, Linus Torvalds, linux-kernel, npiggin,
	Aneesh Kumar K.V
In-Reply-To: <87ime34yya.fsf@mpe.ellerman.id.au>

The pull request you sent on Fri, 31 Jul 2020 23:05:17 +1000:

> https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git tags/powerpc-5.8-8

has been merged into torvalds/linux.git:
https://git.kernel.org/torvalds/c/deacdb3e3979979016fcd0ffd518c320a62ad166

Thank you!

-- 
Deet-doot-dot, I am a bot.
https://korg.wiki.kernel.org/userdoc/prtracker

^ permalink raw reply

* Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
From: Nathan Lynch @ 2020-07-31 13:52 UTC (permalink / raw)
  To: Michael Ellerman, Laurent Dufour; +Cc: tyreld, cheloha, linuxppc-dev
In-Reply-To: <87ft974yf7.fsf@mpe.ellerman.id.au>

Michael Ellerman <mpe@ellerman.id.au> writes:

> Nathan Lynch <nathanl@linux.ibm.com> writes:
>> Michael Ellerman <mpe@ellerman.id.au> writes:
>>> Nathan Lynch <nathanl@linux.ibm.com> writes:
>>>> Laurent Dufour <ldufour@linux.ibm.com> writes:
>>>>> Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
>>>>>> The drmem lmb list can have hundreds of thousands of entries, and
>>>>>> unfortunately lookups take the form of linear searches. As long as
>>>>>> this is the case, traversals have the potential to monopolize the CPU
>>>>>> and provoke lockup reports, workqueue stalls, and the like unless
>>>>>> they explicitly yield.
>>>>>> 
>>>>>> Rather than placing cond_resched() calls within various
>>>>>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
>>>>>> expression of the loop macro itself so users can't omit it.
>>>>>
>>>>> Is that not too much to call cond_resched() on every LMB?
>>>>>
>>>>> Could that be less frequent, every 10, or 100, I don't really know ?
>>>>
>>>> Everything done within for_each_drmem_lmb is relatively heavyweight
>>>> already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
>>>> of milliseconds. I don't think cond_resched() is an expensive check in
>>>> this context.
>>>
>>> Hmm, mostly.
>>>
>>> But there are quite a few cases like drmem_update_dt_v1():
>>>
>>> 	for_each_drmem_lmb(lmb) {
>>> 		dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
>>> 		dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
>>> 		dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
>>> 		dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
>>>
>>> 		dr_cell++;
>>> 	}
>>>
>>> Which will compile to a pretty tight loop at the moment.
>>>
>>> Or drmem_update_dt_v2() which has two loops over all lmbs.
>>>
>>> And although the actual TIF check is cheap the function call to do it is
>>> not free.
>>>
>>> So I worry this is going to make some of those long loops take even
>>> longer.
>>
>> That's fair, and I was wrong - some of the loop bodies are relatively
>> simple, not doing allocations or taking locks, etc.
>>
>> One way to deal is to keep for_each_drmem_lmb() as-is and add a new
>> iterator that can reschedule, e.g. for_each_drmem_lmb_slow().
>
> If we did that, how many call-sites would need converting?
> Is it ~2 or ~20 or ~200?

At a glance I would convert 15-20 out of the 24 users in the tree I'm
looking at. Let me know if I should do a v2 with that approach.

^ permalink raw reply

* Re: [PATCH] powerpc/pseries: explicitly reschedule during drmem_lmb list traversal
From: Michael Ellerman @ 2020-07-31 13:16 UTC (permalink / raw)
  To: Nathan Lynch, Laurent Dufour; +Cc: tyreld, cheloha, linuxppc-dev
In-Reply-To: <875za511z6.fsf@linux.ibm.com>

Nathan Lynch <nathanl@linux.ibm.com> writes:
> Michael Ellerman <mpe@ellerman.id.au> writes:
>> Nathan Lynch <nathanl@linux.ibm.com> writes:
>>> Laurent Dufour <ldufour@linux.ibm.com> writes:
>>>> Le 28/07/2020 à 19:37, Nathan Lynch a écrit :
>>>>> The drmem lmb list can have hundreds of thousands of entries, and
>>>>> unfortunately lookups take the form of linear searches. As long as
>>>>> this is the case, traversals have the potential to monopolize the CPU
>>>>> and provoke lockup reports, workqueue stalls, and the like unless
>>>>> they explicitly yield.
>>>>> 
>>>>> Rather than placing cond_resched() calls within various
>>>>> for_each_drmem_lmb() loop blocks in the code, put it in the iteration
>>>>> expression of the loop macro itself so users can't omit it.
>>>>
>>>> Is that not too much to call cond_resched() on every LMB?
>>>>
>>>> Could that be less frequent, every 10, or 100, I don't really know ?
>>>
>>> Everything done within for_each_drmem_lmb is relatively heavyweight
>>> already. E.g. calling dlpar_remove_lmb()/dlpar_add_lmb() can take dozens
>>> of milliseconds. I don't think cond_resched() is an expensive check in
>>> this context.
>>
>> Hmm, mostly.
>>
>> But there are quite a few cases like drmem_update_dt_v1():
>>
>> 	for_each_drmem_lmb(lmb) {
>> 		dr_cell->base_addr = cpu_to_be64(lmb->base_addr);
>> 		dr_cell->drc_index = cpu_to_be32(lmb->drc_index);
>> 		dr_cell->aa_index = cpu_to_be32(lmb->aa_index);
>> 		dr_cell->flags = cpu_to_be32(drmem_lmb_flags(lmb));
>>
>> 		dr_cell++;
>> 	}
>>
>> Which will compile to a pretty tight loop at the moment.
>>
>> Or drmem_update_dt_v2() which has two loops over all lmbs.
>>
>> And although the actual TIF check is cheap the function call to do it is
>> not free.
>>
>> So I worry this is going to make some of those long loops take even
>> longer.
>
> That's fair, and I was wrong - some of the loop bodies are relatively
> simple, not doing allocations or taking locks, etc.
>
> One way to deal is to keep for_each_drmem_lmb() as-is and add a new
> iterator that can reschedule, e.g. for_each_drmem_lmb_slow().

If we did that, how many call-sites would need converting?
Is it ~2 or ~20 or ~200?

> On the other hand... it's probably not too strong to say that the
> drmem/hotplug code is in crisis with respect to correctness and
> algorithmic complexity, so those are my overriding concerns right
> now. Yes, this change will pessimize loops that are reinitializing the
> entire drmem_lmb array on every DLPAR operation, but:
>
> 1. it doesn't make any user of for_each_drmem_lmb() less correct;
> 2. why is this code doing that in the first place, other than to
>    accommodate a poor data structure choice?
>
> The duration of the system calls where this code runs are measured in
> minutes or hours on large configurations because of all the behaviors
> that are at best O(n) with the amount of memory assigned to the
> partition. For simplicity's sake I'd rather defer lower-level
> performance considerations like this until the drmem data structures'
> awful lookup properties are fixed -- hopefully in the 5.10 timeframe.

Yeah fair enough.

cheers

^ permalink raw reply

* [GIT PULL] Please pull powerpc/linux.git powerpc-5.8-8 tag
From: Michael Ellerman @ 2020-07-31 13:05 UTC (permalink / raw)
  To: Linus Torvalds; +Cc: Aneesh Kumar K.V, linuxppc-dev, linux-kernel, npiggin

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

Hi Linus,

Please pull one more powerpc fix for 5.8:

The following changes since commit f0479c4bcbd92d1a457d4a43bcab79f29d11334a:

  selftests/powerpc: Use proper error code to check fault address (2020-07-15 23:10:17 +1000)

are available in the git repository at:

  https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git tags/powerpc-5.8-8

for you to fetch changes up to 909adfc66b9a1db21b5e8733e9ebfa6cd5135d74:

  powerpc/64s/hash: Fix hash_preload running with interrupts enabled (2020-07-27 17:02:09 +1000)

- ------------------------------------------------------------------
powerpc fixes for 5.8 #8

Fix a bug introduced by the changes we made to lockless page table walking this
cycle. When using the hash MMU, and perf with callchain recording, we can
deadlock if the PMI interrupts a hash fault, and the callchain recording then
takes a hash fault on the same page.

Thanks to:
  Nicholas Piggin, Aneesh Kumar K.V, Anton Blanchard, Athira Rajeev.

- ------------------------------------------------------------------
Nicholas Piggin (1):
      powerpc/64s/hash: Fix hash_preload running with interrupts enabled


 arch/powerpc/kernel/exceptions-64s.S  | 14 ++++++++---
 arch/powerpc/mm/book3s64/hash_utils.c | 25 ++++++++++++++++++++
 arch/powerpc/perf/core-book3s.c       |  6 +++++
 3 files changed, 42 insertions(+), 3 deletions(-)
-----BEGIN PGP SIGNATURE-----

iQIzBAEBCAAdFiEEJFGtCPCthwEv2Y/bUevqPMjhpYAFAl8kFsQACgkQUevqPMjh
pYAkbg/9HzJB8KbjAVJpoXHOf9+aHylhZKU2pcVtdoYE07VV6JabWnlI5flvTSEY
3Dvr3h/X0k2Z3n52+s1kYlW27ErU+o9Hbz1j4O41Ardiy14GmQ02FSi5iXjhKxuk
AjX2MN76n1Te+1ee2Ib/Ubu8KOrbxLIya223tJ0TNu7RWuYVaCO/lNXPPTwm2JWI
s9lryRAHNc4qkPMytsWi/gzNml3IP6cqnGVt1H5fmnOwQXUYbSjdmhBa7AwjCsA1
xJBMkxBxWOnUSknRCxaCCUTQ3sD+sJMzp6verlwEIb+HRh08iJyVzuWCpT9m0ZOG
5F3CvhIxwLatDsJ9N5EeGI2V4qBstBlHEUJJAjkwKgZxzmEm7j9H+ItAc421eo2A
t/SYJTK4JltllCDyB22jCxdEgJh+opv8rki7U+INI01I8gHzqsr/0CHleGtMN5Cq
fUBf25LOMUCpjeAO207j4DzeY6rTXz9H07XWrR4su//4S9bNkYUYpFfeE/3Kwwdq
vZCxjm+fxxLPyAW+E2D2EGsyzIJootrAzly5T7gun8qwgCSgoCxI+YV9/DLmYaGf
OjcG6+X58sfn1FTc8cD6udKCIMKh0JmSccZRqw1Ijp4EiGMnDZUbP9jg/aSciD8a
+oco9QGvvxYQTMut4XUkXtqqZqaTISgWFD/QB9f4KGJ1Rs8tW5Y=
=oaTB
-----END PGP SIGNATURE-----

^ permalink raw reply

* [PATCH v4 12/12] PCI: Remove '*val = 0' from pcie_capability_read_*()
From: Saheed O. Bolarinwa @ 2020-07-31 11:43 UTC (permalink / raw)
  To: helgaas
  Cc: Greg Kroah-Hartman, linux-wireless, QCA ath9k Development,
	Oliver O'Halloran, linux-acpi, linux-rdma, Jason Gunthorpe,
	Doug Ledford, linux-pci, Jakub Kicinski, linux-kernel-mentees,
	Len Brown, Arnd Bergmann, skhan, bjorn, Kalle Valo,
	Mike Marciniszyn, Sam Bobroff, Saheed O. Bolarinwa,
	Dennis Dalessandro, Rafael J. Wysocki, linux-kernel, netdev,
	linuxppc-dev, David S. Miller
In-Reply-To: <20200731114329.100848-1-refactormyself@gmail.com>

There are several reasons why a PCI capability read may fail whether the
device is present or not. If this happens, pcie_capability_read_*() will
return -EINVAL/PCIBIOS_BAD_REGISTER_NUMBER or PCIBIOS_DEVICE_NOT_FOUND
and *val is set to 0.

This behaviour if further ensured by this code inside
pcie_capability_read_*()

 ret = pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, val);
 /*
  * Reset *val to 0 if pci_read_config_dword() fails, it may
  * have been written as 0xFFFFFFFF if hardware error happens
  * during pci_read_config_dword().
  */
 if (ret)
	 *val = 0;
 return ret;

a) Since all pci_generic_config_read() does is read a register value,
it may return success after reading a ~0 which *may* have been fabricated
by the PCI host bridge due to a read timeout. Hence pci_read_config_*()
will return success with a fabricated ~0 in *val, indicating a problem.
In this case, the assumed behaviour of  pcie_capability_read_*() will be
wrong. To avoid error slipping through, more checks are necessary.

b) pci_read_config_*() will return PCIBIOS_DEVICE_NOT_FOUND only if
dev->error_state = pci_channel_io_perm_failure (i.e.
pci_dev_is_disconnected()) or if pci_generic_config_read() can't find the
device. In both cases *val is initially set to ~0 but as shown in the code
above pcie_capability_read_*() resets it back to 0. Even with this effort,
drivers still have to perform validation checks more so if 0 is a valid
value.

Most drivers only consider the case (b) and in some cases, there is the
expectation that on timeout *val has a fabricated value of ~0, which *may*
not always be true as explained in (a).

In any case, checks need to be done to validate the value read and maybe
confirm which error has occurred. It is better left to the drivers to do.

Remove the reset of *val to 0 when pci_read_config_*() fails.

Suggested-by: Bjorn Helgaas <bjorn@helgaas.com>
Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com>
---
 drivers/pci/access.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/drivers/pci/access.c b/drivers/pci/access.c
index 79c4a2ef269a..ec95edbb1ac8 100644
--- a/drivers/pci/access.c
+++ b/drivers/pci/access.c
@@ -413,13 +413,6 @@ int pcie_capability_read_word(struct pci_dev *dev, int pos, u16 *val)
 
 	if (pcie_capability_reg_implemented(dev, pos)) {
 		ret = pci_read_config_word(dev, pci_pcie_cap(dev) + pos, val);
-		/*
-		 * Reset *val to 0 if pci_read_config_word() fails, it may
-		 * have been written as 0xFFFF if hardware error happens
-		 * during pci_read_config_word().
-		 */
-		if (ret)
-			*val = 0;
 		return ret;
 	}
 
@@ -448,13 +441,6 @@ int pcie_capability_read_dword(struct pci_dev *dev, int pos, u32 *val)
 
 	if (pcie_capability_reg_implemented(dev, pos)) {
 		ret = pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, val);
-		/*
-		 * Reset *val to 0 if pci_read_config_dword() fails, it may
-		 * have been written as 0xFFFFFFFF if hardware error happens
-		 * during pci_read_config_dword().
-		 */
-		if (ret)
-			*val = 0;
 		return ret;
 	}
 
-- 
2.18.4


^ permalink raw reply related

* [PATCH v4 10/12] PCI/AER: Check if pcie_capability_read_*() reads ~0
From: Saheed O. Bolarinwa @ 2020-07-31 11:43 UTC (permalink / raw)
  To: helgaas, Russell Currey, Sam Bobroff, Oliver O'Halloran
  Cc: Saheed O. Bolarinwa, skhan, linux-kernel, linux-pci, bjorn,
	linuxppc-dev, linux-kernel-mentees
In-Reply-To: <20200731114329.100848-1-refactormyself@gmail.com>

On failure pcie_capability_read_*() sets it's last parameter, val
to 0. However, with Patch 12/12, it is possible that val is set
to ~0 on failure. This would introduce a bug because
(x & x) == (~0 & x).

Since ~0 is an invalid value in here,

Add extra check for ~0 to the if condition to confirm failure.

Suggested-by: Bjorn Helgaas <bjorn@helgaas.com>
Signed-off-by: Saheed O. Bolarinwa <refactormyself@gmail.com>
---
 drivers/pci/pcie/aer.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 3acf56683915..dbeabc370efc 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -829,7 +829,7 @@ static bool is_error_source(struct pci_dev *dev, struct aer_err_info *e_info)
 
 	/* Check if AER is enabled */
 	pcie_capability_read_word(dev, PCI_EXP_DEVCTL, &reg16);
-	if (!(reg16 & PCI_EXP_AER_FLAGS))
+	if ((reg16 == (u16)~0) || !(reg16 & PCI_EXP_AER_FLAGS))
 		return false;
 
 	if (!aer)
-- 
2.18.4


^ permalink raw reply related

* Re: [PATCH v4 06/10] powerpc/smp: Generalize 2nd sched domain
From: Michael Ellerman @ 2020-07-31 12:22 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200731092901.GH14603@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 17:45:37]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > Currently "CACHE" domain happens to be the 2nd sched domain as per
>> > powerpc_topology. This domain will collapse if cpumask of l2-cache is
>> > same as SMT domain. However we could generalize this domain such that it
>> > could mean either be a "CACHE" domain or a "BIGCORE" domain.
>> >
>> > While setting up the "CACHE" domain, check if shared_cache is already
>> > set.
>> 
>> PeterZ asked for some overview of what you're doing and why, you
>> responded to his mail, but I was expecting to see that text incorporated
>> here somewhere.
>> 
>
> Okay, do you want that as part of the code or documentation dir or the
> changelog?

I guess a comment is best, as that's most likely to be seen by people
looking at the code in future.

A little bit of overview in the change log is also good.

>> He also asked for some comments, which I would also like to see.
>> 
>> 
>> I'm also not clear why we want to rename it to "bigcore", that's not a
>> commonly understood term, I don't think it's clear to new readers what
>> it means.
>> 
>> Leaving it as the shared cache domain, and having a comment mentioning
>> that "bigcores" share a cache, would be clearer I think.
>> 
>
> Today, Shared cache is equal to Big Core. However in not too distant future,
> Shared cache domain and Big Core may not be the same. For example lets
> assume that L2 cache were to Shrink per small core with the firmware
> exposing the core as a bigcore. Then with the current design, we have a SMT
> == SHARED CACHE, and a DIE. We would not have any domain at the publicised 8
> thread level. Keeping the Bigcore as a domain and mapping the shared
> cache, (I am resetting the domain name as CACHE if BIGCORE==SHARED_CACHE),
> helps us in this scenario.

Yeah OK.

In that scenario it's not really clear what the 8 thread level domain
expresses, if the two "halves" of the bigcore have separate L2s. But
presumably there is still some benefit to exposing it.

cheers

^ permalink raw reply

* Re: [PATCH v4 08/10] powerpc/smp: Allocate cpumask only after searching thread group
From: Michael Ellerman @ 2020-07-31 12:14 UTC (permalink / raw)
  To: Srikar Dronamraju
  Cc: Nathan Lynch, Gautham R Shenoy, Michael Neuling, Peter Zijlstra,
	Jordan Niethe, LKML, Nicholas Piggin, Valentin Schneider,
	Oliver O'Halloran, linuxppc-dev, Ingo Molnar
In-Reply-To: <20200731094938.GA18776@linux.vnet.ibm.com>

Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
> * Michael Ellerman <mpe@ellerman.id.au> [2020-07-31 17:52:15]:
>
>> Srikar Dronamraju <srikar@linux.vnet.ibm.com> writes:
>> > If allocated earlier and the search fails, then cpumask need to be
>> > freed. However cpu_l1_cache_map can be allocated after we search thread
>> > group.
>> 
>> It's not freed anywhere AFAICS?
>
> Yes, its never freed. Infact we are never checking if
> zalloc_cpumask_var_node fails. Its not just this cpumask, but historically
> all the other existing cpumasks in arch/powerpc/kernel/smp.c are never
> freed/checked. I did dig into this a bit and it appears that ..
> (Please do correct me if I am wrong!! )

That's correct.

> Powerpc using cpumask_var_t for all of the percpu variables. And it dont seem
> to enable CONFIG_CPUMASK_OFFSTACK even from the MAXSMP config.

I remember Rusty adding that code, but I don't know if we ever
considered enabling CPUMASK_OFFSTACK.

Probably we meant to but never got around to doing it.

> So from include/linux/cpumask.h
>
> typedef struct cpumask cpumask_var_t[1];
> and
> zalloc_cpumask_var_node ends up being cpumask_clear
>
> So I think we are historically we seem to assume we are always
> !CPUMASK_OFFSTACK and hence we dont need to check for return as well as
> free..

Right.

> I would look forward to your comments on how we should handle this going
> forward. But I would keep this the same for this patchset.

Agreed, just clarify in the change log that it's not freed at the moment
because of CPU_MASK_OFFSTACK=n

> One of the questions that I have is if we most likely are to be in
> !CONFIG_CPUMASK_OFFSTACK, then should be migrate to cpumask_t for percpu
> variables. 

I don't think so, cpumask_t is semi-deprecated AIUI.
  
> The reason being we end up using NR_CPU cpumask for each percpu cpumask
> variable instead of using NR_CPU cpumask_t pointer.

Our current defconfigs have NR_CPUS=2048, which is probably just small
enough to continue using OFFSTACK=n.

But we allow configuring NR_CPUS up to 8192, which surely would need
OFFSTACK=y in order to work.

So I think we need to stick with cpumask_var_t, but we should test with
OFFSTACK=y, and should probably be a bit more careful with checking the
allocations succeed.

And then we should select OFFSTACK=y for NR_CPUS above some threshold.

cheers

^ permalink raw reply

* [PATCH v4 00/12] PCI: Remove '*val = 0' from pcie_capability_read_*()
From: Saheed O. Bolarinwa @ 2020-07-31 11:02 UTC (permalink / raw)
  To: helgaas, Mike Marciniszyn, Dennis Dalessandro, Doug Ledford,
	Jason Gunthorpe, Arnd Bergmann, Greg Kroah-Hartman,
	David S. Miller, Kalle Valo, Jakub Kicinski, Rafael J. Wysocki,
	Len Brown, Russell Currey, Sam Bobroff, Oliver O'Halloran
  Cc: linuxppc-dev, linux-rdma, Saheed O. Bolarinwa, skhan,
	linux-wireless, linux-kernel, QCA ath9k Development, linux-acpi,
	netdev, linux-pci, bjorn, linux-kernel-mentees

v4 CHANGES:
- Drop uses of pcie_capability_read_*() return value. This related to
  [1] which is pointing towards making the accessors return void.
- Remove patches found to be unnecessary
- Reword some commit messages

v3 CHANGES:
- Split previous PATCH 6/13 into two : PATCH 6/14 and PATCH 7/14
- Fix commit message of PATCH 5/14
- Update Patch numbering and Commit messages
- Add 'Acked by Greg KH' to PATCH 2/14
- Add PATCH version

v2 CHANGES:
- Fix missing comma, causing the email cc error
- Fix typos and numbering errors in commit messages
- Add commit message to 13/13
- Add two more patches: PATCH 3/13 and PATCH 4/13

MERGING:
- Patch 6/12 depends on Patch 5/12. However Patch 5/12 has no dependency.
  Please, merge PATCH 6/12 only after Patch 5/12.
- Patch 12/12 depends on all preceding patches. Please merge Patch 12/12
  only after other patches in this series have been merged.
- All other patches have no dependencies besides those mentioned above and
  can be merge individually.

PATCH 5/12:
Set the default case in the switch-statement to set status
to "Power On".

PATCH 1/12 to 11/12:
Use the value read by pcie_capability_read_*() to determine success or
failure. This is done by checking if it is ~0, while maintaining the
functions' behaviour. This ensures that the changes in PATCH 12/12 does
not introduce any bug.

PATCH 12/12:
There are several reasons why a PCI capability read may fail whether the
device is present or not. If this happens, pcie_capability_read_*() will
return -EINVAL/PCIBIOS_BAD_REGISTER_NUMBER or PCIBIOS_DEVICE_NOT_FOUND
and *val is set to 0.

This behaviour if further ensured by this code inside
pcie_capability_read_*()

 ret = pci_read_config_dword(dev, pci_pcie_cap(dev) + pos, val);
 /*
  * Reset *val to 0 if pci_read_config_dword() fails, it may
  * have been written as 0xFFFFFFFF if hardware error happens
  * during pci_read_config_dword().
  */
 if (ret)
	 *val = 0;
 return ret;

a) Since all pci_generic_config_read() does is read a register value,
it may return success after reading a ~0 which *may* have been fabricated
by the PCI host bridge due to a read timeout. Hence pci_read_config_*() 
will return success with a fabricated ~0 in *val, indicating a problem.
In this case, the assumed behaviour of  pcie_capability_read_*() will be
wrong. To avoid error slipping through, more checks are necessary.

b) pci_read_config_*() will return PCIBIOS_DEVICE_NOT_FOUND only if 
dev->error_state = pci_channel_io_perm_failure (i.e. 
pci_dev_is_disconnected()) or if pci_generic_config_read() can't find the
device. In both cases *val is initially set to ~0 but as shown in the code
above pcie_capability_read_*() resets it back to 0. Even with this effort,
drivers still have to perform validation checks more so if 0 is a valid
value.

Most drivers only consider the case (b) and in some cases, there is the 
expectation that on timeout *val has a fabricated value of ~0, which *may*
not always be true as explained in (a).

In any case, checks need to be done to validate the value read and maybe
confirm which error has occurred. It is better left to the drivers to do.

Check the return value of pcie_capability_read_dword() to ensure success
and avoid bug as a result of Patch 14/14.
Remove the reset of *val to 0 when pci_read_config_*() fails.

[1] https://lore.kernel.org/linux-pci/20200714234625.GA428442@bjorn-Precision-5520/


Saheed O. Bolarinwa (12):
  IB/hfi1: Check if pcie_capability_read_*() reads ~0
  misc: rtsx: Check if pcie_capability_read_*() reads ~0
  ath9k: Check if pcie_capability_read_*() reads ~0
  iwlegacy: Check if pcie_capability_read_*() reads ~0
  PCI: pciehp: Set "Power On" as the default get_power_status
  PCI: pciehp: Check if pcie_capability_read_*() reads ~0
  PCI/ACPI: Check if pcie_capability_read_*() reads ~0
  PCI: Check if pcie_capability_read_*() reads ~0
  PCI/PM: Check if pcie_capability_read_*() reads ~0
  PCI/AER: Check if pcie_capability_read_*() reads ~0
  PCI/ASPM: Check if pcie_capability_read_*() reads ~0
  PCI: Remove '*val = 0' from pcie_capability_read_*()

 drivers/infiniband/hw/hfi1/aspm.c            |  6 ++--
 drivers/misc/cardreader/rts5227.c            |  2 +-
 drivers/misc/cardreader/rts5249.c            |  2 +-
 drivers/misc/cardreader/rts5260.c            |  2 +-
 drivers/misc/cardreader/rts5261.c            |  2 +-
 drivers/net/wireless/ath/ath9k/pci.c         |  3 +-
 drivers/net/wireless/intel/iwlegacy/common.c |  2 +-
 drivers/pci/access.c                         | 14 --------
 drivers/pci/hotplug/pciehp_hpc.c             | 13 +++++---
 drivers/pci/pci-acpi.c                       |  4 +--
 drivers/pci/pci.c                            | 34 ++++++++++++++------
 drivers/pci/pcie/aer.c                       |  2 +-
 drivers/pci/pcie/aspm.c                      | 10 +++---
 drivers/pci/probe.c                          | 12 +++----
 14 files changed, 56 insertions(+), 52 deletions(-)

-- 
2.18.4


^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox