* Re: [PATCH net RFC] ibmvnic: device remove has higher precedence over reset
From: Lijun Pan @ 2021-01-19 20:08 UTC (permalink / raw)
To: Uwe Kleine-König
Cc: gregkh, julietk, netdev, Jakub Kicinski, Lijun Pan, kernel, drt,
paulus, sukadev, linuxppc-dev, davem
In-Reply-To: <20210119194959.a67nlfbngx4drvyz@pengutronix.de>
On Tue, Jan 19, 2021 at 1:56 PM Uwe Kleine-König
<u.kleine-koenig@pengutronix.de> wrote:
>
> On Tue, Jan 19, 2021 at 01:33:13PM -0600, Lijun Pan wrote:
> > Returning -EBUSY in ibmvnic_remove() does not actually hold the
> > removal procedure since driver core doesn't care for the return
> > value (see __device_release_driver() in drivers/base/dd.c
> > calling dev->bus->remove()) though vio_bus_remove
> > (in arch/powerpc/platforms/pseries/vio.c) records the
> > return value and passes it on. [1]
> >
> > During the device removal precedure, we should not schedule
> > any new reset, and rely on the flush_work and flush_delayed_work
> > to complete the pending resets, and specifically we need to
> > let __ibmvnic_reset() keep running while in REMOVING state since
> > flush_work and flush_delayed_work shall call __ibmvnic_reset finally.
> >
> > [1] https://lore.kernel.org/linuxppc-dev/20210117101242.dpwayq6wdgfdzirl@pengutronix.de/T/#m48f5befd96bc9842ece2a3ad14f4c27747206a53
> > Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
> > Fixes: 7d7195a026ba ("ibmvnic: Do not process device remove during device reset")
> > Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
> > ---
> > drivers/net/ethernet/ibm/ibmvnic.c | 8 +-------
> > 1 file changed, 1 insertion(+), 7 deletions(-)
> >
> > diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
> > index aed985e08e8a..11f28fd03057 100644
> > --- a/drivers/net/ethernet/ibm/ibmvnic.c
> > +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> > @@ -2235,8 +2235,7 @@ static void __ibmvnic_reset(struct work_struct *work)
> > while (rwi) {
> > spin_lock_irqsave(&adapter->state_lock, flags);
> >
> > - if (adapter->state == VNIC_REMOVING ||
> > - adapter->state == VNIC_REMOVED) {
> > + if (adapter->state == VNIC_REMOVED) {
>
> I think you need to keep the check for VNIC_REMOVING. Otherwise you
> don't prevent that a new reset being queued after ibmvnic_remove() set
> the state to VNIC_REMOVING. Am I missing something?
I leave the checking for REMOVING there in ibmvnic_reset, which is the
function to schedule/queue up the resets.
Here I delete the REMOVING in __ibmvnic_reset to let it run and finish.
Otherwise flush_work will not do anything if __ibmvnic_reset() just return
doing nothing.
I can explain it in the commit message.
>
> > spin_unlock_irqrestore(&adapter->state_lock, flags);
> > kfree(rwi);
> > rc = EBUSY;
> > @@ -5372,11 +5371,6 @@ static int ibmvnic_remove(struct vio_dev *dev)
> > unsigned long flags;
> >
> > spin_lock_irqsave(&adapter->state_lock, flags);
> > - if (test_bit(0, &adapter->resetting)) {
> > - spin_unlock_irqrestore(&adapter->state_lock, flags);
> > - return -EBUSY;
> > - }
> > -
> > adapter->state = VNIC_REMOVING;
> > spin_unlock_irqrestore(&adapter->state_lock, flags);
>
> Best regards
> Uwe
>
> --
> Pengutronix e.K. | Uwe Kleine-König |
> Industrial Linux Solutions | https://www.pengutronix.de/ |
^ permalink raw reply
* Re: [PATCH net RFC] ibmvnic: device remove has higher precedence over reset
From: Uwe Kleine-König @ 2021-01-19 19:49 UTC (permalink / raw)
To: Lijun Pan
Cc: gregkh, julietk, netdev, paulus, kernel, drt, kuba, sukadev,
linuxppc-dev, davem
In-Reply-To: <20210119193313.43791-1-ljp@linux.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 2569 bytes --]
On Tue, Jan 19, 2021 at 01:33:13PM -0600, Lijun Pan wrote:
> Returning -EBUSY in ibmvnic_remove() does not actually hold the
> removal procedure since driver core doesn't care for the return
> value (see __device_release_driver() in drivers/base/dd.c
> calling dev->bus->remove()) though vio_bus_remove
> (in arch/powerpc/platforms/pseries/vio.c) records the
> return value and passes it on. [1]
>
> During the device removal precedure, we should not schedule
> any new reset, and rely on the flush_work and flush_delayed_work
> to complete the pending resets, and specifically we need to
> let __ibmvnic_reset() keep running while in REMOVING state since
> flush_work and flush_delayed_work shall call __ibmvnic_reset finally.
>
> [1] https://lore.kernel.org/linuxppc-dev/20210117101242.dpwayq6wdgfdzirl@pengutronix.de/T/#m48f5befd96bc9842ece2a3ad14f4c27747206a53
> Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
> Fixes: 7d7195a026ba ("ibmvnic: Do not process device remove during device reset")
> Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
> ---
> drivers/net/ethernet/ibm/ibmvnic.c | 8 +-------
> 1 file changed, 1 insertion(+), 7 deletions(-)
>
> diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
> index aed985e08e8a..11f28fd03057 100644
> --- a/drivers/net/ethernet/ibm/ibmvnic.c
> +++ b/drivers/net/ethernet/ibm/ibmvnic.c
> @@ -2235,8 +2235,7 @@ static void __ibmvnic_reset(struct work_struct *work)
> while (rwi) {
> spin_lock_irqsave(&adapter->state_lock, flags);
>
> - if (adapter->state == VNIC_REMOVING ||
> - adapter->state == VNIC_REMOVED) {
> + if (adapter->state == VNIC_REMOVED) {
I think you need to keep the check for VNIC_REMOVING. Otherwise you
don't prevent that a new reset being queued after ibmvnic_remove() set
the state to VNIC_REMOVING. Am I missing something?
> spin_unlock_irqrestore(&adapter->state_lock, flags);
> kfree(rwi);
> rc = EBUSY;
> @@ -5372,11 +5371,6 @@ static int ibmvnic_remove(struct vio_dev *dev)
> unsigned long flags;
>
> spin_lock_irqsave(&adapter->state_lock, flags);
> - if (test_bit(0, &adapter->resetting)) {
> - spin_unlock_irqrestore(&adapter->state_lock, flags);
> - return -EBUSY;
> - }
> -
> adapter->state = VNIC_REMOVING;
> spin_unlock_irqrestore(&adapter->state_lock, flags);
Best regards
Uwe
--
Pengutronix e.K. | Uwe Kleine-König |
Industrial Linux Solutions | https://www.pengutronix.de/ |
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* [PATCH net RFC] ibmvnic: device remove has higher precedence over reset
From: Lijun Pan @ 2021-01-19 19:33 UTC (permalink / raw)
To: netdev
Cc: Lijun Pan, gregkh, julietk, Uwe Kleine-König, paulus, kernel,
drt, kuba, sukadev, linuxppc-dev, davem
Returning -EBUSY in ibmvnic_remove() does not actually hold the
removal procedure since driver core doesn't care for the return
value (see __device_release_driver() in drivers/base/dd.c
calling dev->bus->remove()) though vio_bus_remove
(in arch/powerpc/platforms/pseries/vio.c) records the
return value and passes it on. [1]
During the device removal precedure, we should not schedule
any new reset, and rely on the flush_work and flush_delayed_work
to complete the pending resets, and specifically we need to
let __ibmvnic_reset() keep running while in REMOVING state since
flush_work and flush_delayed_work shall call __ibmvnic_reset finally.
[1] https://lore.kernel.org/linuxppc-dev/20210117101242.dpwayq6wdgfdzirl@pengutronix.de/T/#m48f5befd96bc9842ece2a3ad14f4c27747206a53
Reported-by: Uwe Kleine-König <u.kleine-koenig@pengutronix.de>
Fixes: 7d7195a026ba ("ibmvnic: Do not process device remove during device reset")
Signed-off-by: Lijun Pan <ljp@linux.ibm.com>
---
drivers/net/ethernet/ibm/ibmvnic.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/drivers/net/ethernet/ibm/ibmvnic.c b/drivers/net/ethernet/ibm/ibmvnic.c
index aed985e08e8a..11f28fd03057 100644
--- a/drivers/net/ethernet/ibm/ibmvnic.c
+++ b/drivers/net/ethernet/ibm/ibmvnic.c
@@ -2235,8 +2235,7 @@ static void __ibmvnic_reset(struct work_struct *work)
while (rwi) {
spin_lock_irqsave(&adapter->state_lock, flags);
- if (adapter->state == VNIC_REMOVING ||
- adapter->state == VNIC_REMOVED) {
+ if (adapter->state == VNIC_REMOVED) {
spin_unlock_irqrestore(&adapter->state_lock, flags);
kfree(rwi);
rc = EBUSY;
@@ -5372,11 +5371,6 @@ static int ibmvnic_remove(struct vio_dev *dev)
unsigned long flags;
spin_lock_irqsave(&adapter->state_lock, flags);
- if (test_bit(0, &adapter->resetting)) {
- spin_unlock_irqrestore(&adapter->state_lock, flags);
- return -EBUSY;
- }
-
adapter->state = VNIC_REMOVING;
spin_unlock_irqrestore(&adapter->state_lock, flags);
--
2.22.0
^ permalink raw reply related
* Re: ibmvnic: Race condition in remove callback
From: Uwe Kleine-König @ 2021-01-19 19:18 UTC (permalink / raw)
To: Dany Madden
Cc: Juliet Kim, Greg Kroah-Hartman, Lijun Pan, kernel, netdev,
Jakub Kicinski, Sukadev Bhattiprolu, linuxppc-dev,
David S. Miller, Paul Mackerras
In-Reply-To: <b725079b34031595887b019d1d2f6fc7@imap.linux.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 3517 bytes --]
Hello Dany,
On Tue, Jan 19, 2021 at 10:14:25AM -0800, Dany Madden wrote:
> On 2021-01-17 02:12, Uwe Kleine-König wrote:
> > while working on some cleanup I stumbled over a problem in the ibmvnic's
> > remove callback. Since commit
> >
> > 7d7195a026ba ("ibmvnic: Do not process device remove during
> > device reset")
> >
> > there is the following code in the remove callback:
> >
> > static int ibmvnic_remove(struct vio_dev *dev)
> > {
> > ...
> > spin_lock_irqsave(&adapter->state_lock, flags);
> > if (test_bit(0, &adapter->resetting)) {
> > spin_unlock_irqrestore(&adapter->state_lock,
> > flags);
> > return -EBUSY;
> > }
> >
> > adapter->state = VNIC_REMOVING;
> > spin_unlock_irqrestore(&adapter->state_lock, flags);
> >
> > flush_work(&adapter->ibmvnic_reset);
> > flush_delayed_work(&adapter->ibmvnic_delayed_reset);
> > ...
> > }
> >
> > Unfortunately returning -EBUSY doesn't work as intended. That's because
> > the return value of this function is ignored[1] and the device is
> > considered unbound by the device core (shortly) after ibmvnic_remove()
> > returns.
>
> Oh! I was not aware of this.
There are not very many people who are aware. That's why I'm working on
making all these remove callbacks return void. (And that's how I stubled
over this driver's expectations when returning -EBUSY.)
> In our code review, a question on whether or not device reset should
> have a higher precedence over device remove was raised before. So, now
> it is clear that this driver has to take care of remove over reset.
Yes, either you have to sleep in .remove until the reset is completed,
or you cancel the request.
> > While looking into fixing that I noticed a worse problem:
> >
> > If ibmvnic_reset() (e.g. called by the tx_timeout callback) calls
> > schedule_work(&adapter->ibmvnic_reset); just after the work queue is
> > flushed above the problem that 7d7195a026ba intends to fix will trigger
> > resulting in a use-after-free.
>
> It was proposed that when coming into ibmvnic_remove() we lock down the
> workqueue to prevent future access, flush, cleanup, then unregister the
> device. Your thought on this?
This is already done a bit, as ibmvnic_reset() checks for adapter->state
== VNIC_REMOVING. The problem is just that this check is racy.
> > Also ibmvnic_reset() checks for adapter->state without holding the lock
> > which might be racy, too.
>
> Suka started addressing consistent locking with this patch series:
> https://lists.openwall.net/netdev/2021/01/08/89
>
> He is reworking this.
Please understand that I won't look into this. This is not in my area of
expertise and interest and I'd like to consider this problem already
done for me with my report. I'm glad you're acting on it. Depending on
how quick this is fixed I plan to submit a patch that changes
return -EBUSY;
to
return 0;
to prepare changing the prototype of the remove callback to return void.
> Thank you for taking the time to review this driver, Uwe. This is very
> helpful for us.
You're welcome.
Best regards
Uwe
--
Pengutronix e.K. | Uwe Kleine-König |
Industrial Linux Solutions | https://www.pengutronix.de/ |
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
^ permalink raw reply
* Re: [PATCH v3 1/8] powerpc/uaccess: Add unsafe_copy_from_user
From: Christophe Leroy @ 2021-01-19 17:27 UTC (permalink / raw)
To: Christopher M. Riedl, Michael Ellerman, linuxppc-dev
In-Reply-To: <C8NAORBNJH4S.KKQFN1HWO8XH@geist>
Le 19/01/2021 à 18:02, Christopher M. Riedl a écrit :
> On Tue Jan 19, 2021 at 6:33 AM CST, Christophe Leroy wrote:
>>
>>
>> Le 19/01/2021 à 03:11, Michael Ellerman a écrit :
>>> "Christopher M. Riedl" <cmr@codefail.de> writes:
>>>> On Mon Jan 11, 2021 at 7:22 AM CST, Christophe Leroy wrote:
>>>>> Le 09/01/2021 à 04:25, Christopher M. Riedl a écrit :
>>>>>> Implement raw_copy_from_user_allowed() which assumes that userspace read
>>>>>> access is open. Use this new function to implement raw_copy_from_user().
>>>>>> Finally, wrap the new function to follow the usual "unsafe_" convention
>>>>>> of taking a label argument.
>>>>>
>>>>> I think there is no point implementing raw_copy_from_user_allowed(), see
>>>>> https://github.com/linuxppc/linux/commit/4b842e4e25b1 and
>>>>> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/8c74fc9ce8131cabb10b3e95dc0e430f396ee83e.1610369143.git.christophe.leroy@csgroup.eu/
>>>>>
>>>>> You should simply do:
>>>>>
>>>>> #define unsafe_copy_from_user(d, s, l, e) \
>>>>> unsafe_op_wrap(__copy_tofrom_user((__force void __user *)d, s, l), e)
>>>>>
>>>>
>>>> I gave this a try and the signal ops decreased by ~8K. Now, to be
>>>> honest, I am not sure what an "acceptable" benchmark number here
>>>> actually is - so maybe this is ok? Same loss with both radix and hash:
>>>>
>>>> | | hash | radix |
>>>> | ------------------------------------ | ------ | ------ |
>>>> | linuxppc/next | 118693 | 133296 |
>>>> | linuxppc/next w/o KUAP+KUEP | 228911 | 228654 |
>>>> | unsafe-signal64 | 200480 | 234067 |
>>>> | unsafe-signal64 (__copy_tofrom_user) | 192467 | 225119 |
>>>>
>>>> To put this into perspective, prior to KUAP and uaccess flush, signal
>>>> performance in this benchmark was ~290K on hash.
>>>
>>> If I'm doing the math right 8K is ~4% of the best number.
>>>
>>> It seems like 4% is worth a few lines of code to handle these constant
>>> sizes. It's not like we have performance to throw away.
>>>
>>> Or, we should chase down where the call sites are that are doing small
>>> constant copies with copy_to/from_user() and change them to use
>>> get/put_user().
>>>
>>
>> Christopher, when you say you gave it a try, is I my series or only the
>> following ?
>>
>> #define unsafe_copy_from_user(d, s, l, e) \
>> unsafe_op_wrap(__copy_tofrom_user((__force void __user *)d, s, l), e)
>>
>
> I only used the above to replace this patch in my series (so none of my
> changes implementing raw_copy_from_user_allowed() are included).
Then I see no reason why the performance would be different, because you only call
unsafe_copy_from_user() with non trivial lengthes.
>
>>
>> Because I see no use of unsafe_copy_from_user() that would explain that.
>>
>> Christophe
^ permalink raw reply
* Re: [PATCH v6 04/49] soc: fsl: qe: introduce qe_io{read,write}* wrappers
From: Christophe Leroy @ 2021-01-19 17:33 UTC (permalink / raw)
To: Rasmus Villemoes
Cc: Timur Tabi, linux-kernel, Li Yang, Scott Wood, linuxppc-dev,
linux-arm-kernel, Qiang Zhao
In-Reply-To: <20191128145554.1297-5-linux@rasmusvillemoes.dk>
Hi Rasmus,
Le 28/11/2019 à 15:55, Rasmus Villemoes a écrit :
> The QUICC engine drivers use the powerpc-specific out_be32() etc. In
> order to allow those drivers to build for other architectures, those
> must be replaced by iowrite32be(). However, on powerpc, out_be32() is
> a simple inline function while iowrite32be() is out-of-line. So in
> order not to introduce a performance regression on powerpc when making
> the drivers work on other architectures, introduce qe_io* helpers.
>
> Also define the qe_{clr,set,clrset}bits* helpers in terms of these new
> macros.
Since commit https://github.com/linuxppc/linux/commit/894fa235eb4ca0bfa692dbe4932c2f940cdc8c1e
ioread/iowrite wrappers are also inlined on PPC32, so this commit can now be reverted.
Christophe
>
> Reviewed-by: Timur Tabi <timur@kernel.org>
> Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> ---
> include/soc/fsl/qe/qe.h | 34 +++++++++++++++++++++++++---------
> 1 file changed, 25 insertions(+), 9 deletions(-)
>
> diff --git a/include/soc/fsl/qe/qe.h b/include/soc/fsl/qe/qe.h
> index a1aa4eb28f0c..9cac04c692fd 100644
> --- a/include/soc/fsl/qe/qe.h
> +++ b/include/soc/fsl/qe/qe.h
> @@ -241,21 +241,37 @@ static inline int qe_alive_during_sleep(void)
> #define qe_muram_offset cpm_muram_offset
> #define qe_muram_dma cpm_muram_dma
>
> -#define qe_setbits_be32(_addr, _v) iowrite32be(ioread32be(_addr) | (_v), (_addr))
> -#define qe_clrbits_be32(_addr, _v) iowrite32be(ioread32be(_addr) & ~(_v), (_addr))
> +#ifdef CONFIG_PPC32
> +#define qe_iowrite8(val, addr) out_8(addr, val)
> +#define qe_iowrite16be(val, addr) out_be16(addr, val)
> +#define qe_iowrite32be(val, addr) out_be32(addr, val)
> +#define qe_ioread8(addr) in_8(addr)
> +#define qe_ioread16be(addr) in_be16(addr)
> +#define qe_ioread32be(addr) in_be32(addr)
> +#else
> +#define qe_iowrite8(val, addr) iowrite8(val, addr)
> +#define qe_iowrite16be(val, addr) iowrite16be(val, addr)
> +#define qe_iowrite32be(val, addr) iowrite32be(val, addr)
> +#define qe_ioread8(addr) ioread8(addr)
> +#define qe_ioread16be(addr) ioread16be(addr)
> +#define qe_ioread32be(addr) ioread32be(addr)
> +#endif
> +
> +#define qe_setbits_be32(_addr, _v) qe_iowrite32be(qe_ioread32be(_addr) | (_v), (_addr))
> +#define qe_clrbits_be32(_addr, _v) qe_iowrite32be(qe_ioread32be(_addr) & ~(_v), (_addr))
>
> -#define qe_setbits_be16(_addr, _v) iowrite16be(ioread16be(_addr) | (_v), (_addr))
> -#define qe_clrbits_be16(_addr, _v) iowrite16be(ioread16be(_addr) & ~(_v), (_addr))
> +#define qe_setbits_be16(_addr, _v) qe_iowrite16be(qe_ioread16be(_addr) | (_v), (_addr))
> +#define qe_clrbits_be16(_addr, _v) qe_iowrite16be(qe_ioread16be(_addr) & ~(_v), (_addr))
>
> -#define qe_setbits_8(_addr, _v) iowrite8(ioread8(_addr) | (_v), (_addr))
> -#define qe_clrbits_8(_addr, _v) iowrite8(ioread8(_addr) & ~(_v), (_addr))
> +#define qe_setbits_8(_addr, _v) qe_iowrite8(qe_ioread8(_addr) | (_v), (_addr))
> +#define qe_clrbits_8(_addr, _v) qe_iowrite8(qe_ioread8(_addr) & ~(_v), (_addr))
>
> #define qe_clrsetbits_be32(addr, clear, set) \
> - iowrite32be((ioread32be(addr) & ~(clear)) | (set), (addr))
> + qe_iowrite32be((qe_ioread32be(addr) & ~(clear)) | (set), (addr))
> #define qe_clrsetbits_be16(addr, clear, set) \
> - iowrite16be((ioread16be(addr) & ~(clear)) | (set), (addr))
> + qe_iowrite16be((qe_ioread16be(addr) & ~(clear)) | (set), (addr))
> #define qe_clrsetbits_8(addr, clear, set) \
> - iowrite8((ioread8(addr) & ~(clear)) | (set), (addr))
> + qe_iowrite8((qe_ioread8(addr) & ~(clear)) | (set), (addr))
>
> /* Structure that defines QE firmware binary files.
> *
>
^ permalink raw reply
* Re: [PATCH v3 1/8] powerpc/uaccess: Add unsafe_copy_from_user
From: Christopher M. Riedl @ 2021-01-19 17:02 UTC (permalink / raw)
To: Christophe Leroy, Michael Ellerman, linuxppc-dev
In-Reply-To: <148f85e2-a49e-062a-6627-cb46bf6eab14@csgroup.eu>
On Tue Jan 19, 2021 at 6:33 AM CST, Christophe Leroy wrote:
>
>
> Le 19/01/2021 à 03:11, Michael Ellerman a écrit :
> > "Christopher M. Riedl" <cmr@codefail.de> writes:
> >> On Mon Jan 11, 2021 at 7:22 AM CST, Christophe Leroy wrote:
> >>> Le 09/01/2021 à 04:25, Christopher M. Riedl a écrit :
> >>>> Implement raw_copy_from_user_allowed() which assumes that userspace read
> >>>> access is open. Use this new function to implement raw_copy_from_user().
> >>>> Finally, wrap the new function to follow the usual "unsafe_" convention
> >>>> of taking a label argument.
> >>>
> >>> I think there is no point implementing raw_copy_from_user_allowed(), see
> >>> https://github.com/linuxppc/linux/commit/4b842e4e25b1 and
> >>> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/8c74fc9ce8131cabb10b3e95dc0e430f396ee83e.1610369143.git.christophe.leroy@csgroup.eu/
> >>>
> >>> You should simply do:
> >>>
> >>> #define unsafe_copy_from_user(d, s, l, e) \
> >>> unsafe_op_wrap(__copy_tofrom_user((__force void __user *)d, s, l), e)
> >>>
> >>
> >> I gave this a try and the signal ops decreased by ~8K. Now, to be
> >> honest, I am not sure what an "acceptable" benchmark number here
> >> actually is - so maybe this is ok? Same loss with both radix and hash:
> >>
> >> | | hash | radix |
> >> | ------------------------------------ | ------ | ------ |
> >> | linuxppc/next | 118693 | 133296 |
> >> | linuxppc/next w/o KUAP+KUEP | 228911 | 228654 |
> >> | unsafe-signal64 | 200480 | 234067 |
> >> | unsafe-signal64 (__copy_tofrom_user) | 192467 | 225119 |
> >>
> >> To put this into perspective, prior to KUAP and uaccess flush, signal
> >> performance in this benchmark was ~290K on hash.
> >
> > If I'm doing the math right 8K is ~4% of the best number.
> >
> > It seems like 4% is worth a few lines of code to handle these constant
> > sizes. It's not like we have performance to throw away.
> >
> > Or, we should chase down where the call sites are that are doing small
> > constant copies with copy_to/from_user() and change them to use
> > get/put_user().
> >
>
> Christopher, when you say you gave it a try, is I my series or only the
> following ?
>
> #define unsafe_copy_from_user(d, s, l, e) \
> unsafe_op_wrap(__copy_tofrom_user((__force void __user *)d, s, l), e)
>
I only used the above to replace this patch in my series (so none of my
changes implementing raw_copy_from_user_allowed() are included).
>
> Because I see no use of unsafe_copy_from_user() that would explain that.
>
> Christophe
^ permalink raw reply
* Re: [PATCH v6 04/49] soc: fsl: qe: introduce qe_io{read, write}* wrappers
From: Li Yang @ 2021-01-19 17:46 UTC (permalink / raw)
To: Christophe Leroy
Cc: Timur Tabi, Rasmus Villemoes, lkml, Scott Wood, linuxppc-dev,
moderated list:ARM/FREESCALE IMX / MXC ARM ARCHITECTURE,
Qiang Zhao
In-Reply-To: <a80b9c70-f9f2-0f76-15d0-d7a1c926f485@csgroup.eu>
On Tue, Jan 19, 2021 at 11:35 AM Christophe Leroy
<christophe.leroy@csgroup.eu> wrote:
>
> Hi Rasmus,
>
> Le 28/11/2019 à 15:55, Rasmus Villemoes a écrit :
> > The QUICC engine drivers use the powerpc-specific out_be32() etc. In
> > order to allow those drivers to build for other architectures, those
> > must be replaced by iowrite32be(). However, on powerpc, out_be32() is
> > a simple inline function while iowrite32be() is out-of-line. So in
> > order not to introduce a performance regression on powerpc when making
> > the drivers work on other architectures, introduce qe_io* helpers.
> >
> > Also define the qe_{clr,set,clrset}bits* helpers in terms of these new
> > macros.
>
> Since commit https://github.com/linuxppc/linux/commit/894fa235eb4ca0bfa692dbe4932c2f940cdc8c1e
> ioread/iowrite wrappers are also inlined on PPC32, so this commit can now be reverted.
Yes. That will be great.
>
> Christophe
>
> >
> > Reviewed-by: Timur Tabi <timur@kernel.org>
> > Signed-off-by: Rasmus Villemoes <linux@rasmusvillemoes.dk>
> > ---
> > include/soc/fsl/qe/qe.h | 34 +++++++++++++++++++++++++---------
> > 1 file changed, 25 insertions(+), 9 deletions(-)
> >
> > diff --git a/include/soc/fsl/qe/qe.h b/include/soc/fsl/qe/qe.h
> > index a1aa4eb28f0c..9cac04c692fd 100644
> > --- a/include/soc/fsl/qe/qe.h
> > +++ b/include/soc/fsl/qe/qe.h
> > @@ -241,21 +241,37 @@ static inline int qe_alive_during_sleep(void)
> > #define qe_muram_offset cpm_muram_offset
> > #define qe_muram_dma cpm_muram_dma
> >
> > -#define qe_setbits_be32(_addr, _v) iowrite32be(ioread32be(_addr) | (_v), (_addr))
> > -#define qe_clrbits_be32(_addr, _v) iowrite32be(ioread32be(_addr) & ~(_v), (_addr))
> > +#ifdef CONFIG_PPC32
> > +#define qe_iowrite8(val, addr) out_8(addr, val)
> > +#define qe_iowrite16be(val, addr) out_be16(addr, val)
> > +#define qe_iowrite32be(val, addr) out_be32(addr, val)
> > +#define qe_ioread8(addr) in_8(addr)
> > +#define qe_ioread16be(addr) in_be16(addr)
> > +#define qe_ioread32be(addr) in_be32(addr)
> > +#else
> > +#define qe_iowrite8(val, addr) iowrite8(val, addr)
> > +#define qe_iowrite16be(val, addr) iowrite16be(val, addr)
> > +#define qe_iowrite32be(val, addr) iowrite32be(val, addr)
> > +#define qe_ioread8(addr) ioread8(addr)
> > +#define qe_ioread16be(addr) ioread16be(addr)
> > +#define qe_ioread32be(addr) ioread32be(addr)
> > +#endif
> > +
> > +#define qe_setbits_be32(_addr, _v) qe_iowrite32be(qe_ioread32be(_addr) | (_v), (_addr))
> > +#define qe_clrbits_be32(_addr, _v) qe_iowrite32be(qe_ioread32be(_addr) & ~(_v), (_addr))
> >
> > -#define qe_setbits_be16(_addr, _v) iowrite16be(ioread16be(_addr) | (_v), (_addr))
> > -#define qe_clrbits_be16(_addr, _v) iowrite16be(ioread16be(_addr) & ~(_v), (_addr))
> > +#define qe_setbits_be16(_addr, _v) qe_iowrite16be(qe_ioread16be(_addr) | (_v), (_addr))
> > +#define qe_clrbits_be16(_addr, _v) qe_iowrite16be(qe_ioread16be(_addr) & ~(_v), (_addr))
> >
> > -#define qe_setbits_8(_addr, _v) iowrite8(ioread8(_addr) | (_v), (_addr))
> > -#define qe_clrbits_8(_addr, _v) iowrite8(ioread8(_addr) & ~(_v), (_addr))
> > +#define qe_setbits_8(_addr, _v) qe_iowrite8(qe_ioread8(_addr) | (_v), (_addr))
> > +#define qe_clrbits_8(_addr, _v) qe_iowrite8(qe_ioread8(_addr) & ~(_v), (_addr))
> >
> > #define qe_clrsetbits_be32(addr, clear, set) \
> > - iowrite32be((ioread32be(addr) & ~(clear)) | (set), (addr))
> > + qe_iowrite32be((qe_ioread32be(addr) & ~(clear)) | (set), (addr))
> > #define qe_clrsetbits_be16(addr, clear, set) \
> > - iowrite16be((ioread16be(addr) & ~(clear)) | (set), (addr))
> > + qe_iowrite16be((qe_ioread16be(addr) & ~(clear)) | (set), (addr))
> > #define qe_clrsetbits_8(addr, clear, set) \
> > - iowrite8((ioread8(addr) & ~(clear)) | (set), (addr))
> > + qe_iowrite8((qe_ioread8(addr) & ~(clear)) | (set), (addr))
> >
> > /* Structure that defines QE firmware binary files.
> > *
> >
^ permalink raw reply
* Re: [PATCH] powerpc/uprobes: Don't allow probe on suffix of prefixed instruction
From: Oleg Nesterov @ 2021-01-19 17:26 UTC (permalink / raw)
To: Ravi Bangoria
Cc: linux-kernel, rostedt, paulus, sandipan, jniethe5, naveen.n.rao,
linuxppc-dev
In-Reply-To: <20210119091234.76317-1-ravi.bangoria@linux.ibm.com>
On 01/19, Ravi Bangoria wrote:
>
> Probe on 2nd word of a prefixed instruction is invalid scenario and
> should be restricted.
I don't understand this ppc-specific problem, but...
> +#ifdef CONFIG_PPC64
> +int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> + uprobe_opcode_t opcode)
> +{
> + uprobe_opcode_t prefix;
> + void *kaddr;
> + struct ppc_inst inst;
> +
> + /* Don't check if vaddr is pointing to the beginning of page */
> + if (!(vaddr & ~PAGE_MASK))
> + return 0;
So the fix is incomplete? Or insn at the start of page can't be prefixed?
> +int __weak arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
> + uprobe_opcode_t opcode)
> +{
> + return 0;
> +}
> +
> static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)
> {
> uprobe_opcode_t old_opcode;
> @@ -275,6 +281,8 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
> if (is_swbp_insn(new_opcode)) {
> if (is_swbp) /* register: already installed? */
> return 0;
> + if (arch_uprobe_verify_opcode(page, vaddr, old_opcode))
> + return -EINVAL;
Well, this doesn't look good...
To me it would be better to change the prepare_uprobe() path to copy
the potential prefix into uprobe->arch and check ppc_inst_prefixed()
in arch_uprobe_analyze_insn(). What do you think?
Oleg.
^ permalink raw reply
* Re: ibmvnic: Race condition in remove callback
From: Dany Madden @ 2021-01-19 18:14 UTC (permalink / raw)
To: Uwe Kleine-König
Cc: Greg Kroah-Hartman, Juliet Kim, Jakub Kicinski, netdev, Lijun Pan,
kernel, Paul Mackerras, Sukadev Bhattiprolu, linuxppc-dev,
David S. Miller
In-Reply-To: <20210117101242.dpwayq6wdgfdzirl@pengutronix.de>
On 2021-01-17 02:12, Uwe Kleine-König wrote:
> Hello,
>
> while working on some cleanup I stumbled over a problem in the
> ibmvnic's
> remove callback. Since commit
>
> 7d7195a026ba ("ibmvnic: Do not process device remove during
> device reset")
>
> there is the following code in the remove callback:
>
> static int ibmvnic_remove(struct vio_dev *dev)
> {
> ...
> spin_lock_irqsave(&adapter->state_lock, flags);
> if (test_bit(0, &adapter->resetting)) {
> spin_unlock_irqrestore(&adapter->state_lock,
> flags);
> return -EBUSY;
> }
>
> adapter->state = VNIC_REMOVING;
> spin_unlock_irqrestore(&adapter->state_lock, flags);
>
> flush_work(&adapter->ibmvnic_reset);
> flush_delayed_work(&adapter->ibmvnic_delayed_reset);
> ...
> }
>
> Unfortunately returning -EBUSY doesn't work as intended. That's because
> the return value of this function is ignored[1] and the device is
> considered unbound by the device core (shortly) after ibmvnic_remove()
> returns.
Oh! I was not aware of this. In our code review, a question on whether
or not device reset should have a higher precedence over device remove
was raised before. So, now it is clear that this driver has to take care
of remove over reset.
>
> While looking into fixing that I noticed a worse problem:
>
> If ibmvnic_reset() (e.g. called by the tx_timeout callback) calls
> schedule_work(&adapter->ibmvnic_reset); just after the work queue is
> flushed above the problem that 7d7195a026ba intends to fix will trigger
> resulting in a use-after-free.
It was proposed that when coming into ibmvnic_remove() we lock down the
workqueue to prevent future access, flush, cleanup, then unregister the
device. Your thought on this?
>
> Also ibmvnic_reset() checks for adapter->state without holding the lock
> which might be racy, too.
>
Suka started addressing consistent locking with this patch series:
https://lists.openwall.net/netdev/2021/01/08/89
He is reworking this.
> Best regards
> Uwe
Thank you for taking the time to review this driver, Uwe. This is very
helpful for us.
Best Regards,
Dany
>
> [1] vio_bus_remove (in arch/powerpc/platforms/pseries/vio.c) records
> the
> return value and passes it on. But the driver core doesn't care for
> the return value (see __device_release_driver() in
> drivers/base/dd.c
> calling dev->bus->remove()).
^ permalink raw reply
* Re: [PATCH v3 1/8] powerpc/uaccess: Add unsafe_copy_from_user
From: Christophe Leroy @ 2021-01-19 12:33 UTC (permalink / raw)
To: Michael Ellerman, Christopher M. Riedl, linuxppc-dev
In-Reply-To: <87pn21r7yr.fsf@mpe.ellerman.id.au>
Le 19/01/2021 à 03:11, Michael Ellerman a écrit :
> "Christopher M. Riedl" <cmr@codefail.de> writes:
>> On Mon Jan 11, 2021 at 7:22 AM CST, Christophe Leroy wrote:
>>> Le 09/01/2021 à 04:25, Christopher M. Riedl a écrit :
>>>> Implement raw_copy_from_user_allowed() which assumes that userspace read
>>>> access is open. Use this new function to implement raw_copy_from_user().
>>>> Finally, wrap the new function to follow the usual "unsafe_" convention
>>>> of taking a label argument.
>>>
>>> I think there is no point implementing raw_copy_from_user_allowed(), see
>>> https://github.com/linuxppc/linux/commit/4b842e4e25b1 and
>>> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/8c74fc9ce8131cabb10b3e95dc0e430f396ee83e.1610369143.git.christophe.leroy@csgroup.eu/
>>>
>>> You should simply do:
>>>
>>> #define unsafe_copy_from_user(d, s, l, e) \
>>> unsafe_op_wrap(__copy_tofrom_user((__force void __user *)d, s, l), e)
>>>
>>
>> I gave this a try and the signal ops decreased by ~8K. Now, to be
>> honest, I am not sure what an "acceptable" benchmark number here
>> actually is - so maybe this is ok? Same loss with both radix and hash:
>>
>> | | hash | radix |
>> | ------------------------------------ | ------ | ------ |
>> | linuxppc/next | 118693 | 133296 |
>> | linuxppc/next w/o KUAP+KUEP | 228911 | 228654 |
>> | unsafe-signal64 | 200480 | 234067 |
>> | unsafe-signal64 (__copy_tofrom_user) | 192467 | 225119 |
>>
>> To put this into perspective, prior to KUAP and uaccess flush, signal
>> performance in this benchmark was ~290K on hash.
>
> If I'm doing the math right 8K is ~4% of the best number.
>
> It seems like 4% is worth a few lines of code to handle these constant
> sizes. It's not like we have performance to throw away.
>
> Or, we should chase down where the call sites are that are doing small
> constant copies with copy_to/from_user() and change them to use
> get/put_user().
>
Christopher, when you say you gave it a try, is I my series or only the following ?
#define unsafe_copy_from_user(d, s, l, e) \
unsafe_op_wrap(__copy_tofrom_user((__force void __user *)d, s, l), e)
Because I see no use of unsafe_copy_from_user() that would explain that.
Christophe
^ permalink raw reply
* Re: [PATCH v6 14/39] powerpc/perf: move perf irq/nmi handling details into traps.c
From: Athira Rajeev @ 2021-01-19 10:24 UTC (permalink / raw)
To: Nicholas Piggin; +Cc: linuxppc-dev
In-Reply-To: <20210115165012.1260253-15-npiggin@gmail.com>
> On 15-Jan-2021, at 10:19 PM, Nicholas Piggin <npiggin@gmail.com> wrote:
>
> This is required in order to allow more significant differences between
> NMI type interrupt handlers and regular asynchronous handlers.
>
> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
> arch/powerpc/kernel/traps.c | 31 +++++++++++++++++++++++++++-
> arch/powerpc/perf/core-book3s.c | 35 ++------------------------------
> arch/powerpc/perf/core-fsl-emb.c | 25 -----------------------
> 3 files changed, 32 insertions(+), 59 deletions(-)
Hi Nick,
Reviewed this perf patch which moves the nmi_enter/irq_enter to traps.c and code-wise changes
for perf looks fine to me. Further, I was trying to test this by picking the whole Patch series on top
of 5.11.0-rc3 kernel and using below test scenario:
Intention of testcase is to check whether the perf nmi and asynchronous interrupts are getting
captured as expected. My test kernel module below tries to create one of performance monitor
counter ( PMC6 ) overflow between local_irq_save/local_irq_restore.
[ Here interrupts are disabled and has IRQS_DISABLED as regs->softe ].
I am expecting the PMI to come as an NMI in this case. I am also configuring ftrace so that I
can confirm whether it comes as an NMI or a replayed interrupt from the trace.
Environment :One CPU online
prerequisite for ftrace:
# cd /sys/kernel/debug/tracing
# echo 100 > buffer_percent
# echo 200000 > buffer_size_kb
# echo ppc-tb > trace_clock
# echo function > current_tracer
Part of sample kernel test module to trigger a PMI between
local_irq_save and local_irq_restore:
<<>>
static ulong delay = 1;
static void busy_wait(ulong time)
{
udelay(delay);
}
static __always_inline void irq_test(void)
{
unsigned long flags;
local_irq_save(flags);
trace_printk("IN IRQ TEST\n");
mtspr(SPRN_MMCR0, 0x80000000);
mtspr(SPRN_PMC6, 0x80000000 - 100);
mtspr(SPRN_MMCR0, 0x6004000);
busy_wait(delay);
trace_printk("IN IRQ TEST DONE\n");
local_irq_restore(flags);
mtspr(SPRN_MMCR0, 0x80000000);
mtspr(SPRN_PMC6, 0);
}
<<>>
But this resulted in soft lockup, Adding a snippet of call-trace below:
[ 883.900762] watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [swapper/0:0]
[ 883.901381] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G OE 5.11.0-rc3+ #34
--
[ 883.901999] NIP [c0000000000168d0] replay_soft_interrupts+0x70/0x2f0
[ 883.902032] LR [c00000000003b2b8] interrupt_exit_kernel_prepare+0x1e8/0x240
[ 883.902063] Call Trace:
[ 883.902085] [c000000001c96f50] [c00000000003b2b8] interrupt_exit_kernel_prepare+0x1e8/0x240 (unreliable)
[ 883.902139] [c000000001c96fb0] [c00000000000fd88] interrupt_return+0x158/0x200
[ 883.902185] --- interrupt: ea0 at __rb_reserve_next+0xc0/0x5b0
[ 883.902224] NIP: c0000000002d8980 LR: c0000000002d897c CTR: c0000000001aad90
[ 883.902262] REGS: c000000001c97020 TRAP: 0ea0 Tainted: G OE (5.11.0-rc3+)
[ 883.902301] MSR: 9000000000009033 <SF,HV,EE,ME,IR,DR,RI,LE> CR: 28000484 XER: 20040000
[ 883.902387] CFAR: c00000000000fe00 IRQMASK: 0
--
[ 883.902757] NIP [c0000000002d8980] __rb_reserve_next+0xc0/0x5b0
[ 883.902786] LR [c0000000002d897c] __rb_reserve_next+0xbc/0x5b0
[ 883.902824] --- interrupt: ea0
[ 883.902848] [c000000001c97360] [c0000000002d8fcc] ring_buffer_lock_reserve+0x15c/0x580
[ 883.902894] [c000000001c973f0] [c0000000002e82fc] trace_function+0x4c/0x1c0
[ 883.902930] [c000000001c97440] [c0000000002f6f50] function_trace_call+0x140/0x190
[ 883.902976] [c000000001c97470] [c00000000007d6f8] ftrace_call+0x4/0x44
[ 883.903021] [c000000001c97660] [c000000000dcf70c] __do_softirq+0x15c/0x3d4
[ 883.903066] [c000000001c97750] [c00000000015fc68] irq_exit+0x198/0x1b0
[ 883.903102] [c000000001c97780] [c000000000dc1790] timer_interrupt+0x170/0x3b0
[ 883.903148] [c000000001c977e0] [c000000000016994] replay_soft_interrupts+0x134/0x2f0
[ 883.903193] [c000000001c979d0] [c00000000003b2b8] interrupt_exit_kernel_prepare+0x1e8/0x240
[ 883.903240] [c000000001c97a30] [c00000000000fd88] interrupt_return+0x158/0x200
[ 883.903276] --- interrupt: ea0 at arch_local_irq_restore+0x70/0xc0
Thanks
Athira
>
> diff --git a/arch/powerpc/kernel/traps.c b/arch/powerpc/kernel/traps.c
> index 738370519937..bd55f201115b 100644
> --- a/arch/powerpc/kernel/traps.c
> +++ b/arch/powerpc/kernel/traps.c
> @@ -1892,11 +1892,40 @@ void vsx_unavailable_tm(struct pt_regs *regs)
> }
> #endif /* CONFIG_PPC_TRANSACTIONAL_MEM */
>
> -void performance_monitor_exception(struct pt_regs *regs)
> +static void performance_monitor_exception_nmi(struct pt_regs *regs)
> +{
> + nmi_enter();
> +
> + __this_cpu_inc(irq_stat.pmu_irqs);
> +
> + perf_irq(regs);
> +
> + nmi_exit();
> +}
> +
> +static void performance_monitor_exception_async(struct pt_regs *regs)
> {
> + irq_enter();
> +
> __this_cpu_inc(irq_stat.pmu_irqs);
>
> perf_irq(regs);
> +
> + irq_exit();
> +}
> +
> +void performance_monitor_exception(struct pt_regs *regs)
> +{
> + /*
> + * On 64-bit, if perf interrupts hit in a local_irq_disable
> + * (soft-masked) region, we consider them as NMIs. This is required to
> + * prevent hash faults on user addresses when reading callchains (and
> + * looks better from an irq tracing perspective).
> + */
> + if (IS_ENABLED(CONFIG_PPC64) && unlikely(arch_irq_disabled_regs(regs)))
> + performance_monitor_exception_nmi(regs);
> + else
> + performance_monitor_exception_async(regs);
> }
>
> #ifdef CONFIG_PPC_ADV_DEBUG_REGS
> diff --git a/arch/powerpc/perf/core-book3s.c b/arch/powerpc/perf/core-book3s.c
> index 28206b1fe172..9fd06010e8b6 100644
> --- a/arch/powerpc/perf/core-book3s.c
> +++ b/arch/powerpc/perf/core-book3s.c
> @@ -110,10 +110,6 @@ static inline void perf_read_regs(struct pt_regs *regs)
> {
> regs->result = 0;
> }
> -static inline int perf_intr_is_nmi(struct pt_regs *regs)
> -{
> - return 0;
> -}
>
> static inline int siar_valid(struct pt_regs *regs)
> {
> @@ -353,15 +349,6 @@ static inline void perf_read_regs(struct pt_regs *regs)
> regs->result = use_siar;
> }
>
> -/*
> - * If interrupts were soft-disabled when a PMU interrupt occurs, treat
> - * it as an NMI.
> - */
> -static inline int perf_intr_is_nmi(struct pt_regs *regs)
> -{
> - return (regs->softe & IRQS_DISABLED);
> -}
> -
> /*
> * On processors like P7+ that have the SIAR-Valid bit, marked instructions
> * must be sampled only if the SIAR-valid bit is set.
> @@ -2279,7 +2266,6 @@ static void __perf_event_interrupt(struct pt_regs *regs)
> struct perf_event *event;
> unsigned long val[8];
> int found, active;
> - int nmi;
>
> if (cpuhw->n_limited)
> freeze_limited_counters(cpuhw, mfspr(SPRN_PMC5),
> @@ -2287,18 +2273,6 @@ static void __perf_event_interrupt(struct pt_regs *regs)
>
> perf_read_regs(regs);
>
> - /*
> - * If perf interrupts hit in a local_irq_disable (soft-masked) region,
> - * we consider them as NMIs. This is required to prevent hash faults on
> - * user addresses when reading callchains. See the NMI test in
> - * do_hash_page.
> - */
> - nmi = perf_intr_is_nmi(regs);
> - if (nmi)
> - nmi_enter();
> - else
> - irq_enter();
> -
> /* Read all the PMCs since we'll need them a bunch of times */
> for (i = 0; i < ppmu->n_counter; ++i)
> val[i] = read_pmc(i + 1);
> @@ -2344,8 +2318,8 @@ static void __perf_event_interrupt(struct pt_regs *regs)
> }
> }
> }
> - if (!found && !nmi && printk_ratelimit())
> - printk(KERN_WARNING "Can't find PMC that caused IRQ\n");
> + if (unlikely(!found) && !arch_irq_disabled_regs(regs))
> + printk_ratelimited(KERN_WARNING "Can't find PMC that caused IRQ\n");
>
> /*
> * Reset MMCR0 to its normal value. This will set PMXE and
> @@ -2355,11 +2329,6 @@ static void __perf_event_interrupt(struct pt_regs *regs)
> * we get back out of this interrupt.
> */
> write_mmcr0(cpuhw, cpuhw->mmcr.mmcr0);
> -
> - if (nmi)
> - nmi_exit();
> - else
> - irq_exit();
> }
>
> static void perf_event_interrupt(struct pt_regs *regs)
> diff --git a/arch/powerpc/perf/core-fsl-emb.c b/arch/powerpc/perf/core-fsl-emb.c
> index e0e7e276bfd2..ee721f420a7b 100644
> --- a/arch/powerpc/perf/core-fsl-emb.c
> +++ b/arch/powerpc/perf/core-fsl-emb.c
> @@ -31,19 +31,6 @@ static atomic_t num_events;
> /* Used to avoid races in calling reserve/release_pmc_hardware */
> static DEFINE_MUTEX(pmc_reserve_mutex);
>
> -/*
> - * If interrupts were soft-disabled when a PMU interrupt occurs, treat
> - * it as an NMI.
> - */
> -static inline int perf_intr_is_nmi(struct pt_regs *regs)
> -{
> -#ifdef __powerpc64__
> - return (regs->softe & IRQS_DISABLED);
> -#else
> - return 0;
> -#endif
> -}
> -
> static void perf_event_interrupt(struct pt_regs *regs);
>
> /*
> @@ -659,13 +646,6 @@ static void perf_event_interrupt(struct pt_regs *regs)
> struct perf_event *event;
> unsigned long val;
> int found = 0;
> - int nmi;
> -
> - nmi = perf_intr_is_nmi(regs);
> - if (nmi)
> - nmi_enter();
> - else
> - irq_enter();
>
> for (i = 0; i < ppmu->n_counter; ++i) {
> event = cpuhw->event[i];
> @@ -690,11 +670,6 @@ static void perf_event_interrupt(struct pt_regs *regs)
> mtmsr(mfmsr() | MSR_PMM);
> mtpmr(PMRN_PMGC0, PMGC0_PMIE | PMGC0_FCECE);
> isync();
> -
> - if (nmi)
> - nmi_exit();
> - else
> - irq_exit();
> }
>
> void hw_perf_event_setup(int cpu)
> --
> 2.23.0
>
^ permalink raw reply
* [PATCH] powerpc/47x: Disable 256k page size
From: Christophe Leroy @ 2021-01-19 7:00 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
Cc: linuxppc-dev, linux-kernel
PPC47x_TLBE_SIZE isn't defined for 256k pages, so
this size of page shall not be selected for 47x.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: e7f75ad01d59 ("powerpc/47x: Base ppc476 support")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
arch/powerpc/Kconfig | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 107bb4319e0e..a685e42d3993 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -772,7 +772,7 @@ config PPC_64K_PAGES
config PPC_256K_PAGES
bool "256k page size"
- depends on 44x && !STDBINUTILS
+ depends on 44x && !STDBINUTILS && !PPC_47x
help
Make the page size 256k.
--
2.25.0
^ permalink raw reply related
* [PATCH] powerpc/uprobes: Don't allow probe on suffix of prefixed instruction
From: Ravi Bangoria @ 2021-01-19 9:12 UTC (permalink / raw)
To: mpe
Cc: ravi.bangoria, jniethe5, oleg, rostedt, linux-kernel, paulus,
sandipan, naveen.n.rao, linuxppc-dev
Probe on 2nd word of a prefixed instruction is invalid scenario and
should be restricted.
There are two ways probed instruction is changed in mapped pages.
First, when Uprobe is activated, it searches for all the relevant
pages and replace instruction in them. In this case, if we notice
that probe is on the 2nd word of prefixed instruction, error out
directly. Second, when Uprobe is already active and user maps a
relevant page via mmap(), instruction is replaced via mmap() code
path. But because Uprobe is invalid, entire mmap() operation can
not be stopped. In this case just print an error and continue.
Signed-off-by: Ravi Bangoria <ravi.bangoria@linux.ibm.com>
---
arch/powerpc/kernel/uprobes.c | 28 ++++++++++++++++++++++++++++
include/linux/uprobes.h | 1 +
kernel/events/uprobes.c | 8 ++++++++
3 files changed, 37 insertions(+)
diff --git a/arch/powerpc/kernel/uprobes.c b/arch/powerpc/kernel/uprobes.c
index e8a63713e655..c73d5a397164 100644
--- a/arch/powerpc/kernel/uprobes.c
+++ b/arch/powerpc/kernel/uprobes.c
@@ -7,6 +7,7 @@
* Adapted from the x86 port by Ananth N Mavinakayanahalli <ananth@in.ibm.com>
*/
#include <linux/kernel.h>
+#include <linux/highmem.h>
#include <linux/sched.h>
#include <linux/ptrace.h>
#include <linux/uprobes.h>
@@ -44,6 +45,33 @@ int arch_uprobe_analyze_insn(struct arch_uprobe *auprobe,
return 0;
}
+#ifdef CONFIG_PPC64
+int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
+ uprobe_opcode_t opcode)
+{
+ uprobe_opcode_t prefix;
+ void *kaddr;
+ struct ppc_inst inst;
+
+ /* Don't check if vaddr is pointing to the beginning of page */
+ if (!(vaddr & ~PAGE_MASK))
+ return 0;
+
+ kaddr = kmap_atomic(page);
+ memcpy(&prefix, kaddr + ((vaddr - 4) & ~PAGE_MASK), UPROBE_SWBP_INSN_SIZE);
+ kunmap_atomic(kaddr);
+
+ inst = ppc_inst_prefix(prefix, opcode);
+
+ if (ppc_inst_prefixed(inst)) {
+ printk_ratelimited("Cannot register a uprobe on the second "
+ "word of prefixed instruction\n");
+ return -1;
+ }
+ return 0;
+}
+#endif
+
/*
* arch_uprobe_pre_xol - prepare to execute out of line.
* @auprobe: the probepoint information.
diff --git a/include/linux/uprobes.h b/include/linux/uprobes.h
index f46e0ca0169c..5a3b45878e13 100644
--- a/include/linux/uprobes.h
+++ b/include/linux/uprobes.h
@@ -128,6 +128,7 @@ extern bool uprobe_deny_signal(void);
extern bool arch_uprobe_skip_sstep(struct arch_uprobe *aup, struct pt_regs *regs);
extern void uprobe_clear_state(struct mm_struct *mm);
extern int arch_uprobe_analyze_insn(struct arch_uprobe *aup, struct mm_struct *mm, unsigned long addr);
+int arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t opcode);
extern int arch_uprobe_pre_xol(struct arch_uprobe *aup, struct pt_regs *regs);
extern int arch_uprobe_post_xol(struct arch_uprobe *aup, struct pt_regs *regs);
extern bool arch_uprobe_xol_was_trapped(struct task_struct *tsk);
diff --git a/kernel/events/uprobes.c b/kernel/events/uprobes.c
index bf9edd8d75be..be02e6c26e3f 100644
--- a/kernel/events/uprobes.c
+++ b/kernel/events/uprobes.c
@@ -255,6 +255,12 @@ static void copy_to_page(struct page *page, unsigned long vaddr, const void *src
kunmap_atomic(kaddr);
}
+int __weak arch_uprobe_verify_opcode(struct page *page, unsigned long vaddr,
+ uprobe_opcode_t opcode)
+{
+ return 0;
+}
+
static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t *new_opcode)
{
uprobe_opcode_t old_opcode;
@@ -275,6 +281,8 @@ static int verify_opcode(struct page *page, unsigned long vaddr, uprobe_opcode_t
if (is_swbp_insn(new_opcode)) {
if (is_swbp) /* register: already installed? */
return 0;
+ if (arch_uprobe_verify_opcode(page, vaddr, old_opcode))
+ return -EINVAL;
} else {
if (!is_swbp) /* unregister: was it changed by us? */
return 0;
--
2.26.2
^ permalink raw reply related
* [PATCH] powerpc/kvm: Force selection of CONFIG_PPC_FPU
From: Christophe Leroy @ 2021-01-19 6:36 UTC (permalink / raw)
To: Benjamin Herrenschmidt, Paul Mackerras, Michael Ellerman
Cc: linuxppc-dev, linux-kernel, kvm-ppc
book3s/32 kvm is designed with the assumption that
an FPU is always present.
Force selection of FPU support in the kernel when
build KVM.
Reported-by: kernel test robot <lkp@intel.com>
Fixes: 7d68c8916950 ("powerpc/32s: Allow deselecting CONFIG_PPC_FPU on mpc832x")
Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu>
---
arch/powerpc/kvm/Kconfig | 1 +
1 file changed, 1 insertion(+)
diff --git a/arch/powerpc/kvm/Kconfig b/arch/powerpc/kvm/Kconfig
index 549591d9aaa2..e45644657d49 100644
--- a/arch/powerpc/kvm/Kconfig
+++ b/arch/powerpc/kvm/Kconfig
@@ -54,6 +54,7 @@ config KVM_BOOK3S_32
select KVM
select KVM_BOOK3S_32_HANDLER
select KVM_BOOK3S_PR_POSSIBLE
+ select PPC_FPU
help
Support running unmodified book3s_32 guest kernels
in virtual machines on book3s_32 host processors.
--
2.25.0
^ permalink raw reply related
* Re: [PATCH 6/6] powerpc/rtas: constrain user region allocation to RMA
From: Michael Ellerman @ 2021-01-19 9:00 UTC (permalink / raw)
To: Nathan Lynch, linuxppc-dev; +Cc: aik, tyreld, brking, ajd, aneesh.kumar
In-Reply-To: <20210114220004.1138993-7-nathanl@linux.ibm.com>
Nathan Lynch <nathanl@linux.ibm.com> writes:
> Memory locations passed as arguments from the OS to RTAS usually need
> to be addressable in 32-bit mode and must reside in the Real Mode
> Area. On PAPR guests, the RMA starts at logical address 0 and is the
> first logical memory block reported in the LPAR’s device tree.
>
> On powerpc targets with RTAS, Linux makes available to user space a
> region of memory suitable for arguments to be passed to RTAS via
> sys_rtas(). This region (rtas_rmo_buf) is allocated via the memblock
> API during boot in order to ensure that it satisfies the requirements
> described above.
>
> With radix MMU, the upper limit supplied to the memblock allocation
> can exceed the bounds of the first logical memory block, since
> ppc64_rma_size is ULONG_MAX and RTAS_INSTANTIATE_MAX is 1GB.
Why does the size of the first memory block matter for radix?
The 1GB limit is sufficient to make it accessible by 32-bit code.
> (512MB is a common size of the first memory block according to a small sample of
> LPARs I have checked.)
That's the minimum we request, see prom_init.c:
/* option vector 2: Open Firmware options supported */
.vec2 = {
.byte1 = OV2_REAL_MODE,
.reserved = 0,
.real_base = cpu_to_be32(0xffffffff),
.real_size = cpu_to_be32(0xffffffff),
.virt_base = cpu_to_be32(0xffffffff),
.virt_size = cpu_to_be32(0xffffffff),
.load_base = cpu_to_be32(0xffffffff),
.min_rma = cpu_to_be32(512), /* 512MB min RMA */
Since v4.12 in 2017.
cheers
^ permalink raw reply
* Re: [PATCH v3 1/8] powerpc/uaccess: Add unsafe_copy_from_user
From: Christophe Leroy @ 2021-01-19 7:29 UTC (permalink / raw)
To: Christopher M. Riedl, linuxppc-dev
In-Reply-To: <C8LLSM3UC598.L4E2VMGJOI06@geist>
Le 17/01/2021 à 18:19, Christopher M. Riedl a écrit :
> On Mon Jan 11, 2021 at 7:22 AM CST, Christophe Leroy wrote:
>>
>>
>> Le 09/01/2021 à 04:25, Christopher M. Riedl a écrit :
>>> Implement raw_copy_from_user_allowed() which assumes that userspace read
>>> access is open. Use this new function to implement raw_copy_from_user().
>>> Finally, wrap the new function to follow the usual "unsafe_" convention
>>> of taking a label argument.
>>
>> I think there is no point implementing raw_copy_from_user_allowed(), see
>> https://github.com/linuxppc/linux/commit/4b842e4e25b1 and
>> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/8c74fc9ce8131cabb10b3e95dc0e430f396ee83e.1610369143.git.christophe.leroy@csgroup.eu/
>>
>> You should simply do:
>>
>> #define unsafe_copy_from_user(d, s, l, e) \
>> unsafe_op_wrap(__copy_tofrom_user((__force void __user *)d, s, l), e)
>>
>
> I gave this a try and the signal ops decreased by ~8K. Now, to be
> honest, I am not sure what an "acceptable" benchmark number here
> actually is - so maybe this is ok? Same loss with both radix and hash:
I don't think this is ok, but it probably means that you are using unsafe_copy_from_user() to copy
small constant size data that should be copied with unsafe_get_user() instead.
>
> | | hash | radix |
> | ------------------------------------ | ------ | ------ |
> | linuxppc/next | 118693 | 133296 |
> | linuxppc/next w/o KUAP+KUEP | 228911 | 228654 |
> | unsafe-signal64 | 200480 | 234067 |
> | unsafe-signal64 (__copy_tofrom_user) | 192467 | 225119 |
>
> To put this into perspective, prior to KUAP and uaccess flush, signal
> performance in this benchmark was ~290K on hash.
>
>>
>> Christophe
>>
>>>
>>> The new raw_copy_from_user_allowed() calls non-inline __copy_tofrom_user()
>>> internally. This is still safe to call inside user access blocks formed
>>> with user_*_access_begin()/user_*_access_end() since asm functions are not
>>> instrumented for tracing.
>>>
>>> Signed-off-by: Christopher M. Riedl <cmr@codefail.de>
>>> ---
>>> arch/powerpc/include/asm/uaccess.h | 28 +++++++++++++++++++---------
>>> 1 file changed, 19 insertions(+), 9 deletions(-)
>>>
>>> diff --git a/arch/powerpc/include/asm/uaccess.h b/arch/powerpc/include/asm/uaccess.h
>>> index 501c9a79038c..698f3a6d6ae5 100644
>>> --- a/arch/powerpc/include/asm/uaccess.h
>>> +++ b/arch/powerpc/include/asm/uaccess.h
>>> @@ -403,38 +403,45 @@ raw_copy_in_user(void __user *to, const void __user *from, unsigned long n)
>>> }
>>> #endif /* __powerpc64__ */
>>>
>>> -static inline unsigned long raw_copy_from_user(void *to,
>>> - const void __user *from, unsigned long n)
>>> +static inline unsigned long
>>> +raw_copy_from_user_allowed(void *to, const void __user *from, unsigned long n)
>>> {
>>> - unsigned long ret;
>>> if (__builtin_constant_p(n) && (n <= 8)) {
>>> - ret = 1;
>>> + unsigned long ret = 1;
>>>
>>> switch (n) {
>>> case 1:
>>> barrier_nospec();
>>> - __get_user_size(*(u8 *)to, from, 1, ret);
>>> + __get_user_size_allowed(*(u8 *)to, from, 1, ret);
>>> break;
>>> case 2:
>>> barrier_nospec();
>>> - __get_user_size(*(u16 *)to, from, 2, ret);
>>> + __get_user_size_allowed(*(u16 *)to, from, 2, ret);
>>> break;
>>> case 4:
>>> barrier_nospec();
>>> - __get_user_size(*(u32 *)to, from, 4, ret);
>>> + __get_user_size_allowed(*(u32 *)to, from, 4, ret);
>>> break;
>>> case 8:
>>> barrier_nospec();
>>> - __get_user_size(*(u64 *)to, from, 8, ret);
>>> + __get_user_size_allowed(*(u64 *)to, from, 8, ret);
>>> break;
>>> }
>>> if (ret == 0)
>>> return 0;
>>> }
>>>
>>> + return __copy_tofrom_user((__force void __user *)to, from, n);
>>> +}
>>> +
>>> +static inline unsigned long
>>> +raw_copy_from_user(void *to, const void __user *from, unsigned long n)
>>> +{
>>> + unsigned long ret;
>>> +
>>> barrier_nospec();
>>> allow_read_from_user(from, n);
>>> - ret = __copy_tofrom_user((__force void __user *)to, from, n);
>>> + ret = raw_copy_from_user_allowed(to, from, n);
>>> prevent_read_from_user(from, n);
>>> return ret;
>>> }
>>> @@ -542,6 +549,9 @@ user_write_access_begin(const void __user *ptr, size_t len)
>>> #define unsafe_get_user(x, p, e) unsafe_op_wrap(__get_user_allowed(x, p), e)
>>> #define unsafe_put_user(x, p, e) __put_user_goto(x, p, e)
>>>
>>> +#define unsafe_copy_from_user(d, s, l, e) \
>>> + unsafe_op_wrap(raw_copy_from_user_allowed(d, s, l), e)
>>> +
>>> #define unsafe_copy_to_user(d, s, l, e) \
>>> do { \
>>> u8 __user *_dst = (u8 __user *)(d); \
>>>
^ permalink raw reply
* [PATCH 0/2] powerpc/cacheinfo: Add "ibm,thread-groups" awareness
From: Gautham R. Shenoy @ 2021-01-19 7:36 UTC (permalink / raw)
To: Michael Ellerman, Nathan Lynch, Srikar Dronamraju
Cc: Gautham R. Shenoy, linuxppc-dev, linux-kernel
From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
Hi,
Currently the cacheinfo code on powerpc indexes the "cache" objects
(modelling the L1/L2/L3 caches) where the key is device-tree node
corresponding to that cache. On some of the POWER server platforms
thread-groups within the core share different sets of caches (Eg: On
SMT8 POWER9 systems, threads 0,2,4,6 of a core share L1 cache and
threads 1,3,5,7 of the same core share another L1 cache). On such
platforms, there is a single device-tree node corresponding to that
cache and the cache-configuration within the threads of the core is
indicated via "ibm,thread-groups" device-tree property.
Since the current code is not aware of the "ibm,thread-groups"
property, on the aforementoined systems, cacheinfo code still treats
all the threads in the core to be sharing the cache because of the
single device-tree node (In the earlier example, the cacheinfo code
would says CPUs 0-7 share L1 cache).
In this patch series, we make the powerpc cacheinfo code aware of the
"ibm,thread-groups" property. We indexe the "cache" objects by the
key-pair (device-tree node, thread-group id). For any CPUX, for a
given level of cache, the thread-group id is defined to be the first
CPU in the "ibm,thread-groups" cache-group containing CPUX. For levels
of cache which are not represented in "ibm,thread-groups" property,
the thread-group id is -1.
We can now remove the helper function get_shared_cpu_map() and
index_dir_to_cpu() since the cache->shared_cpu_map contains the
correct satate of the thread-siblings sharing the cache.
This has been tested on a SMT8 POWER9 system where L1 cache is split
between groups of threads in the core and on an SMT8 POWER10 system
where L1 and L2 caches are split between groups of threads in a core.
With this patch series, on POWER10 SMT8 system, we see the following
reported via sysfs:
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-15
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-15
$ ppc64_cpu --smt=4
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-11
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-11
$ ppc64_cpu --smt=2
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-9
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-9
$ ppc64_cpu --smt=1
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8
Gautham R. Shenoy (2):
powerpc/cacheinfo: Lookup cache by dt node and thread-group id
powerpc/cacheinfo: Remove the redundant get_shared_cpu_map()
arch/powerpc/include/asm/smp.h | 3 +
arch/powerpc/kernel/cacheinfo.c | 121 ++++++++++++++++++++--------------------
2 files changed, 62 insertions(+), 62 deletions(-)
--
1.9.4
^ permalink raw reply
* [PATCH 2/2] powerpc/cacheinfo: Remove the redundant get_shared_cpu_map()
From: Gautham R. Shenoy @ 2021-01-19 7:36 UTC (permalink / raw)
To: Michael Ellerman, Nathan Lynch, Srikar Dronamraju
Cc: Gautham R. Shenoy, linuxppc-dev, linux-kernel
In-Reply-To: <1611041780-8640-1-git-send-email-ego@linux.vnet.ibm.com>
From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
The helper function get_shared_cpu_map() was added in
'commit 500fe5f550ec ("powerpc/cacheinfo: Report the correct
shared_cpu_map on big-cores")'
and subsequently expanded upon in
'commit 0be47634db0b ("powerpc/cacheinfo: Print correct cache-sibling
map/list for L2 cache")'
in order to help report the correct groups of threads sharing these caches
on big-core systems where groups of threads within a core can share
different sets of caches.
Now that powerpc/cacheinfo is aware of "ibm,thread-groups" property,
cache->shared_cpu_map contains the correct set of thread-siblings
sharing the cache. Hence we no longer need the functions
get_shared_cpu_map(). This patch removes this function. We also remove
the helper function index_dir_to_cpu() which was only called by
get_shared_cpu_map().
With these functions removed, we can still see the correct
cache-sibling map/list for L1 and L2 caches on systems with L1 and L2
caches distributed among groups of threads in a core.
With this patch, on a SMT8 POWER10 system where the L1 and L2 caches
are split between the two groups of threads in a core, for CPUs 8,9,
the L1-Data, L1-Instruction, L2, L3 cache CPU sibling list is as
follows:
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10,12,14
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-15
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11,13,15
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-15
$ ppc64_cpu --smt=4
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8,10
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-11
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9,11
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-11
$ ppc64_cpu --smt=2
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8-9
/sys/devices/system/cpu/cpu9/cache/index0/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index1/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index2/shared_cpu_list:9
/sys/devices/system/cpu/cpu9/cache/index3/shared_cpu_list:8-9
$ ppc64_cpu --smt=1
$ grep . /sys/devices/system/cpu/cpu[89]/cache/index[0123]/shared_cpu_list
/sys/devices/system/cpu/cpu8/cache/index0/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index1/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index2/shared_cpu_list:8
/sys/devices/system/cpu/cpu8/cache/index3/shared_cpu_list:8
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
arch/powerpc/kernel/cacheinfo.c | 41 +----------------------------------------
1 file changed, 1 insertion(+), 40 deletions(-)
diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index 5a6925d..20d9169 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -675,45 +675,6 @@ static ssize_t level_show(struct kobject *k, struct kobj_attribute *attr, char *
static struct kobj_attribute cache_level_attr =
__ATTR(level, 0444, level_show, NULL);
-static unsigned int index_dir_to_cpu(struct cache_index_dir *index)
-{
- struct kobject *index_dir_kobj = &index->kobj;
- struct kobject *cache_dir_kobj = index_dir_kobj->parent;
- struct kobject *cpu_dev_kobj = cache_dir_kobj->parent;
- struct device *dev = kobj_to_dev(cpu_dev_kobj);
-
- return dev->id;
-}
-
-/*
- * On big-core systems, each core has two groups of CPUs each of which
- * has its own L1-cache. The thread-siblings which share l1-cache with
- * @cpu can be obtained via cpu_smallcore_mask().
- *
- * On some big-core systems, the L2 cache is shared only between some
- * groups of siblings. This is already parsed and encoded in
- * cpu_l2_cache_mask().
- *
- * TODO: cache_lookup_or_instantiate() needs to be made aware of the
- * "ibm,thread-groups" property so that cache->shared_cpu_map
- * reflects the correct siblings on platforms that have this
- * device-tree property. This helper function is only a stop-gap
- * solution so that we report the correct siblings to the
- * userspace via sysfs.
- */
-static const struct cpumask *get_shared_cpu_map(struct cache_index_dir *index, struct cache *cache)
-{
- if (has_big_cores) {
- int cpu = index_dir_to_cpu(index);
- if (cache->level == 1)
- return cpu_smallcore_mask(cpu);
- if (cache->level == 2 && thread_group_shares_l2)
- return cpu_l2_cache_mask(cpu);
- }
-
- return &cache->shared_cpu_map;
-}
-
static ssize_t
show_shared_cpumap(struct kobject *k, struct kobj_attribute *attr, char *buf, bool list)
{
@@ -724,7 +685,7 @@ static const struct cpumask *get_shared_cpu_map(struct cache_index_dir *index, s
index = kobj_to_cache_index_dir(k);
cache = index->cache;
- mask = get_shared_cpu_map(index, cache);
+ mask = &cache->shared_cpu_map;
return cpumap_print_to_pagebuf(list, buf, mask);
}
--
1.9.4
^ permalink raw reply related
* [PATCH 1/2] powerpc/cacheinfo: Lookup cache by dt node and thread-group id
From: Gautham R. Shenoy @ 2021-01-19 7:36 UTC (permalink / raw)
To: Michael Ellerman, Nathan Lynch, Srikar Dronamraju
Cc: Gautham R. Shenoy, linuxppc-dev, linux-kernel
In-Reply-To: <1611041780-8640-1-git-send-email-ego@linux.vnet.ibm.com>
From: "Gautham R. Shenoy" <ego@linux.vnet.ibm.com>
Currently the cacheinfo code on powerpc indexes the "cache" objects
(modelling the L1/L2/L3 caches) where the key is device-tree node
corresponding to that cache. On some of the POWER server platforms
thread-groups within the core share different sets of caches (Eg: On
SMT8 POWER9 systems, threads 0,2,4,6 of a core share L1 cache and
threads 1,3,5,7 of the same core share another L1 cache). On such
platforms, there is a single device-tree node corresponding to that
cache and the cache-configuration within the threads of the core is
indicated via "ibm,thread-groups" device-tree property.
Since the current code is not aware of the "ibm,thread-groups"
property, on the aforementoined systems, cacheinfo code still treats
all the threads in the core to be sharing the cache because of the
single device-tree node (In the earlier example, the cacheinfo code
would says CPUs 0-7 share L1 cache).
In this patch, we make the powerpc cacheinfo code aware of the
"ibm,thread-groups" property. We indexe the "cache" objects by the
key-pair (device-tree node, thread-group id). For any CPUX, for a
given level of cache, the thread-group id is defined to be the first
CPU in the "ibm,thread-groups" cache-group containing CPUX. For levels
of cache which are not represented in "ibm,thread-groups" property,
the thread-group id is -1.
Signed-off-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
---
arch/powerpc/include/asm/smp.h | 3 ++
arch/powerpc/kernel/cacheinfo.c | 80 +++++++++++++++++++++++++++++------------
2 files changed, 61 insertions(+), 22 deletions(-)
diff --git a/arch/powerpc/include/asm/smp.h b/arch/powerpc/include/asm/smp.h
index c4e2d53..39de24b 100644
--- a/arch/powerpc/include/asm/smp.h
+++ b/arch/powerpc/include/asm/smp.h
@@ -32,6 +32,9 @@
extern int cpu_to_chip_id(int cpu);
+DECLARE_PER_CPU(cpumask_var_t, thread_group_l1_cache_map);
+DECLARE_PER_CPU(cpumask_var_t, thread_group_l2_cache_map);
+
#ifdef CONFIG_SMP
struct smp_ops_t {
diff --git a/arch/powerpc/kernel/cacheinfo.c b/arch/powerpc/kernel/cacheinfo.c
index 6f903e9a..5a6925d 100644
--- a/arch/powerpc/kernel/cacheinfo.c
+++ b/arch/powerpc/kernel/cacheinfo.c
@@ -120,6 +120,7 @@ struct cache {
struct cpumask shared_cpu_map; /* online CPUs using this cache */
int type; /* split cache disambiguation */
int level; /* level not explicit in device tree */
+ int group_id; /* id of the group of threads that share this cache */
struct list_head list; /* global list of cache objects */
struct cache *next_local; /* next cache of >= level */
};
@@ -142,22 +143,24 @@ static const char *cache_type_string(const struct cache *cache)
}
static void cache_init(struct cache *cache, int type, int level,
- struct device_node *ofnode)
+ struct device_node *ofnode, int group_id)
{
cache->type = type;
cache->level = level;
cache->ofnode = of_node_get(ofnode);
+ cache->group_id = group_id;
INIT_LIST_HEAD(&cache->list);
list_add(&cache->list, &cache_list);
}
-static struct cache *new_cache(int type, int level, struct device_node *ofnode)
+static struct cache *new_cache(int type, int level,
+ struct device_node *ofnode, int group_id)
{
struct cache *cache;
cache = kzalloc(sizeof(*cache), GFP_KERNEL);
if (cache)
- cache_init(cache, type, level, ofnode);
+ cache_init(cache, type, level, ofnode, group_id);
return cache;
}
@@ -309,20 +312,24 @@ static struct cache *cache_find_first_sibling(struct cache *cache)
return cache;
list_for_each_entry(iter, &cache_list, list)
- if (iter->ofnode == cache->ofnode && iter->next_local == cache)
+ if (iter->ofnode == cache->ofnode &&
+ iter->group_id == cache->group_id &&
+ iter->next_local == cache)
return iter;
return cache;
}
-/* return the first cache on a local list matching node */
-static struct cache *cache_lookup_by_node(const struct device_node *node)
+/* return the first cache on a local list matching node and thread-group id */
+static struct cache *cache_lookup_by_node_group(const struct device_node *node,
+ int group_id)
{
struct cache *cache = NULL;
struct cache *iter;
list_for_each_entry(iter, &cache_list, list) {
- if (iter->ofnode != node)
+ if (iter->ofnode != node ||
+ iter->group_id != group_id)
continue;
cache = cache_find_first_sibling(iter);
break;
@@ -352,14 +359,15 @@ static int cache_is_unified_d(const struct device_node *np)
CACHE_TYPE_UNIFIED_D : CACHE_TYPE_UNIFIED;
}
-static struct cache *cache_do_one_devnode_unified(struct device_node *node, int level)
+static struct cache *cache_do_one_devnode_unified(struct device_node *node, int group_id,
+ int level)
{
pr_debug("creating L%d ucache for %pOFP\n", level, node);
- return new_cache(cache_is_unified_d(node), level, node);
+ return new_cache(cache_is_unified_d(node), level, node, group_id);
}
-static struct cache *cache_do_one_devnode_split(struct device_node *node,
+static struct cache *cache_do_one_devnode_split(struct device_node *node, int group_id,
int level)
{
struct cache *dcache, *icache;
@@ -367,8 +375,8 @@ static struct cache *cache_do_one_devnode_split(struct device_node *node,
pr_debug("creating L%d dcache and icache for %pOFP\n", level,
node);
- dcache = new_cache(CACHE_TYPE_DATA, level, node);
- icache = new_cache(CACHE_TYPE_INSTRUCTION, level, node);
+ dcache = new_cache(CACHE_TYPE_DATA, level, node, group_id);
+ icache = new_cache(CACHE_TYPE_INSTRUCTION, level, node, group_id);
if (!dcache || !icache)
goto err;
@@ -382,31 +390,32 @@ static struct cache *cache_do_one_devnode_split(struct device_node *node,
return NULL;
}
-static struct cache *cache_do_one_devnode(struct device_node *node, int level)
+static struct cache *cache_do_one_devnode(struct device_node *node, int group_id, int level)
{
struct cache *cache;
if (cache_node_is_unified(node))
- cache = cache_do_one_devnode_unified(node, level);
+ cache = cache_do_one_devnode_unified(node, group_id, level);
else
- cache = cache_do_one_devnode_split(node, level);
+ cache = cache_do_one_devnode_split(node, group_id, level);
return cache;
}
static struct cache *cache_lookup_or_instantiate(struct device_node *node,
+ int group_id,
int level)
{
struct cache *cache;
- cache = cache_lookup_by_node(node);
+ cache = cache_lookup_by_node_group(node, group_id);
WARN_ONCE(cache && cache->level != level,
"cache level mismatch on lookup (got %d, expected %d)\n",
cache->level, level);
if (!cache)
- cache = cache_do_one_devnode(node, level);
+ cache = cache_do_one_devnode(node, group_id, level);
return cache;
}
@@ -443,7 +452,27 @@ static void do_subsidiary_caches_debugcheck(struct cache *cache)
of_node_get_device_type(cache->ofnode));
}
-static void do_subsidiary_caches(struct cache *cache)
+/*
+ * If sub-groups of threads in a core containing @cpu_id share the
+ * L@level-cache (information obtained via "ibm,thread-groups"
+ * device-tree property), then we identify the group by the first
+ * thread-sibling in the group. We define this to be the group-id.
+ *
+ * In the absence of any thread-group information for L@level-cache,
+ * this function returns -1.
+ */
+static int get_group_id(unsigned int cpu_id, int level)
+{
+ if (has_big_cores && level == 1)
+ return cpumask_first(per_cpu(thread_group_l1_cache_map,
+ cpu_id));
+ else if (thread_group_shares_l2 && level == 2)
+ return cpumask_first(per_cpu(thread_group_l2_cache_map,
+ cpu_id));
+ return -1;
+}
+
+static void do_subsidiary_caches(struct cache *cache, unsigned int cpu_id)
{
struct device_node *subcache_node;
int level = cache->level;
@@ -452,9 +481,11 @@ static void do_subsidiary_caches(struct cache *cache)
while ((subcache_node = of_find_next_cache_node(cache->ofnode))) {
struct cache *subcache;
+ int group_id;
level++;
- subcache = cache_lookup_or_instantiate(subcache_node, level);
+ group_id = get_group_id(cpu_id, level);
+ subcache = cache_lookup_or_instantiate(subcache_node, group_id, level);
of_node_put(subcache_node);
if (!subcache)
break;
@@ -468,6 +499,7 @@ static struct cache *cache_chain_instantiate(unsigned int cpu_id)
{
struct device_node *cpu_node;
struct cache *cpu_cache = NULL;
+ int group_id;
pr_debug("creating cache object(s) for CPU %i\n", cpu_id);
@@ -476,11 +508,13 @@ static struct cache *cache_chain_instantiate(unsigned int cpu_id)
if (!cpu_node)
goto out;
- cpu_cache = cache_lookup_or_instantiate(cpu_node, 1);
+ group_id = get_group_id(cpu_id, 1);
+
+ cpu_cache = cache_lookup_or_instantiate(cpu_node, group_id, 1);
if (!cpu_cache)
goto out;
- do_subsidiary_caches(cpu_cache);
+ do_subsidiary_caches(cpu_cache, cpu_id);
cache_cpu_set(cpu_cache, cpu_id);
out:
@@ -848,13 +882,15 @@ static struct cache *cache_lookup_by_cpu(unsigned int cpu_id)
{
struct device_node *cpu_node;
struct cache *cache;
+ int group_id;
cpu_node = of_get_cpu_node(cpu_id, NULL);
WARN_ONCE(!cpu_node, "no OF node found for CPU %i\n", cpu_id);
if (!cpu_node)
return NULL;
- cache = cache_lookup_by_node(cpu_node);
+ group_id = get_group_id(cpu_id, 1);
+ cache = cache_lookup_by_node_group(cpu_node, group_id);
of_node_put(cpu_node);
return cache;
--
1.9.4
^ permalink raw reply related
* Re: [RFC Qemu PATCH v2 1/2] spapr: drc: Add support for async hcalls at the drc level
From: Shivaprasad G Bhat @ 2021-01-19 7:10 UTC (permalink / raw)
To: David Gibson, Greg Kurz
Cc: xiaoguangrong.eric, linux-nvdimm, aneesh.kumar, mst, qemu-devel,
kvm-ppc, shivaprasadbhat, qemu-ppc, bharata, imammedo,
linuxppc-dev
In-Reply-To: <20201228083800.GN6952@yekko.fritz.box>
Thanks for the comments!
On 12/28/20 2:08 PM, David Gibson wrote:
> On Mon, Dec 21, 2020 at 01:08:53PM +0100, Greg Kurz wrote:
...
>> The overall idea looks good but I think you should consider using
>> a thread pool to implement it. See below.
> I am not convinced, however. Specifically, attaching this to the DRC
> doesn't make sense to me. We're adding exactly one DRC related async
> hcall, and I can't really see much call for another one. We could
> have other async hcalls - indeed we already have one for HPT resizing
> - but attaching this to DRCs doesn't help for those.
The semantics of the hcall made me think, if this is going to be
re-usable for future if implemented at DRC level. Other option
is to move the async-hcall-state/list into the NVDIMMState structure
in include/hw/mem/nvdimm.h and handle it with machine->nvdimms_state
at a global level.
Hope you are okay with using the pool based approach that Greg
suggested.
Please let me know.
Thanks,
Shivaprasad
^ permalink raw reply
* Re: [PATCH] selftests/powerpc: Fix exit status of pkey tests
From: Aneesh Kumar K.V @ 2021-01-19 4:35 UTC (permalink / raw)
To: Sandipan Das, mpe; +Cc: harish, efuller, linuxppc-dev
In-Reply-To: <20210118093145.10134-1-sandipan@linux.ibm.com>
Sandipan Das <sandipan@linux.ibm.com> writes:
> Since main() does not return a value explicitly, the
> return values from FAIL_IF() conditions are ignored
> and the tests can still pass irrespective of failures.
> This makes sure that we always explicitly return the
> correct test exit status.
>
Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Reported-by: Eirik Fuller <efuller@redhat.com>
> Fixes: 1addb6444791 ("selftests/powerpc: Add test for execute-disabled pkeys")
> Fixes: c27f2fd1705a ("selftests/powerpc: Add test for pkey siginfo verification")
> Signed-off-by: Sandipan Das <sandipan@linux.ibm.com>
> ---
> tools/testing/selftests/powerpc/mm/pkey_exec_prot.c | 2 +-
> tools/testing/selftests/powerpc/mm/pkey_siginfo.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/tools/testing/selftests/powerpc/mm/pkey_exec_prot.c b/tools/testing/selftests/powerpc/mm/pkey_exec_prot.c
> index 9e5c7f3f498a..0af4f02669a1 100644
> --- a/tools/testing/selftests/powerpc/mm/pkey_exec_prot.c
> +++ b/tools/testing/selftests/powerpc/mm/pkey_exec_prot.c
> @@ -290,5 +290,5 @@ static int test(void)
>
> int main(void)
> {
> - test_harness(test, "pkey_exec_prot");
> + return test_harness(test, "pkey_exec_prot");
> }
> diff --git a/tools/testing/selftests/powerpc/mm/pkey_siginfo.c b/tools/testing/selftests/powerpc/mm/pkey_siginfo.c
> index 4f815d7c1214..2db76e56d4cb 100644
> --- a/tools/testing/selftests/powerpc/mm/pkey_siginfo.c
> +++ b/tools/testing/selftests/powerpc/mm/pkey_siginfo.c
> @@ -329,5 +329,5 @@ static int test(void)
>
> int main(void)
> {
> - test_harness(test, "pkey_siginfo");
> + return test_harness(test, "pkey_siginfo");
> }
> --
> 2.25.1
^ permalink raw reply
* [PATCH] selftests/powerpc: Only test lwm/stmw on big endian
From: Michael Ellerman @ 2021-01-19 4:18 UTC (permalink / raw)
To: linuxppc-dev; +Cc: msuchanek, lpechacek
Newer binutils (>= 2.36) refuse to assemble lmw/stmw when building in
little endian mode. That breaks compilation of our alignment handler
test:
/tmp/cco4l14N.s: Assembler messages:
/tmp/cco4l14N.s:1440: Error: `lmw' invalid when little-endian
/tmp/cco4l14N.s:1814: Error: `stmw' invalid when little-endian
make[2]: *** [../../lib.mk:139: /output/kselftest/powerpc/alignment/alignment_handler] Error 1
These tests do pass on little endian machines, as the kernel will
still emulate those instructions even when running little
endian (which is arguably a kernel bug).
But we don't really need to test that case, so ifdef those
instructions out to get the alignment test building again.
Reported-by: Libor Pechacek <lpechacek@suse.com>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
---
.../testing/selftests/powerpc/alignment/alignment_handler.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/tools/testing/selftests/powerpc/alignment/alignment_handler.c b/tools/testing/selftests/powerpc/alignment/alignment_handler.c
index cb53a8b777e6..c25cf7cd45e9 100644
--- a/tools/testing/selftests/powerpc/alignment/alignment_handler.c
+++ b/tools/testing/selftests/powerpc/alignment/alignment_handler.c
@@ -443,7 +443,6 @@ int test_alignment_handler_integer(void)
LOAD_DFORM_TEST(ldu);
LOAD_XFORM_TEST(ldx);
LOAD_XFORM_TEST(ldux);
- LOAD_DFORM_TEST(lmw);
STORE_DFORM_TEST(stb);
STORE_XFORM_TEST(stbx);
STORE_DFORM_TEST(stbu);
@@ -462,7 +461,11 @@ int test_alignment_handler_integer(void)
STORE_XFORM_TEST(stdx);
STORE_DFORM_TEST(stdu);
STORE_XFORM_TEST(stdux);
+
+#ifdef __BIG_ENDIAN__
+ LOAD_DFORM_TEST(lmw);
STORE_DFORM_TEST(stmw);
+#endif
return rc;
}
--
2.25.1
^ permalink raw reply related
* Re: [PATCH v3] powerpc/mce: Remove per cpu variables from MCE handlers
From: Nicholas Piggin @ 2021-01-19 3:58 UTC (permalink / raw)
To: Ganesh Goudar, linuxppc-dev, mpe; +Cc: mahesh
In-Reply-To: <20210115125845.28224-1-ganeshgr@linux.ibm.com>
Excerpts from Ganesh Goudar's message of January 15, 2021 10:58 pm:
> Access to per-cpu variables requires translation to be enabled on
> pseries machine running in hash mmu mode, Since part of MCE handler
> runs in realmode and part of MCE handling code is shared between ppc
> architectures pseries and powernv, it becomes difficult to manage
> these variables differently on different architectures, So have
> these variables in paca instead of having them as per-cpu variables
> to avoid complications.
Seems okay.
>
> Maximum recursive depth of MCE is 4, Considering the maximum depth
> allowed reduce the size of event to 10 from 100.
Could you make this a separate patch, with memory saving numbers?
"Delayed" MCEs are not necessarily the same as recursive (several
sequential MCEs can occur before the first event is processed).
But I agree 100 is pretty overboard (as is 4 recursive MCEs really).
>
> Signed-off-by: Ganesh Goudar <ganeshgr@linux.ibm.com>
> ---
> v2: Dynamically allocate memory for machine check event info
>
> v3: Remove check for hash mmu lpar, use memblock_alloc_try_nid
> to allocate memory.
> ---
> arch/powerpc/include/asm/mce.h | 21 ++++++++-
> arch/powerpc/include/asm/paca.h | 4 ++
> arch/powerpc/kernel/mce.c | 76 +++++++++++++++++-------------
> arch/powerpc/kernel/setup-common.c | 2 +-
> 4 files changed, 69 insertions(+), 34 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/mce.h b/arch/powerpc/include/asm/mce.h
> index e6c27ae843dc..8d6e3a7a9f37 100644
> --- a/arch/powerpc/include/asm/mce.h
> +++ b/arch/powerpc/include/asm/mce.h
> @@ -204,7 +204,18 @@ struct mce_error_info {
> bool ignore_event;
> };
>
> -#define MAX_MC_EVT 100
> +#define MAX_MC_EVT 10
> +
> +struct mce_info {
> + int mce_nest_count;
> + struct machine_check_event mce_event[MAX_MC_EVT];
> + /* Queue for delayed MCE events. */
> + int mce_queue_count;
> + struct machine_check_event mce_event_queue[MAX_MC_EVT];
> + /* Queue for delayed MCE UE events. */
> + int mce_ue_count;
> + struct machine_check_event mce_ue_event_queue[MAX_MC_EVT];
> +};
>
> /* Release flags for get_mce_event() */
> #define MCE_EVENT_RELEASE true
> @@ -233,5 +244,13 @@ long __machine_check_early_realmode_p7(struct pt_regs *regs);
> long __machine_check_early_realmode_p8(struct pt_regs *regs);
> long __machine_check_early_realmode_p9(struct pt_regs *regs);
> long __machine_check_early_realmode_p10(struct pt_regs *regs);
> +#define get_mce_info() local_paca->mce_info
I don't think this adds anything. Could you open code it?
Thanks,
Nick
^ permalink raw reply
* Re: [PATCH 1/4] KVM: PPC: Book3S HV: Remove support for running HPT guest on RPT host without mixed mode support
From: Nicholas Piggin @ 2021-01-19 3:27 UTC (permalink / raw)
To: Fabiano Rosas, kvm-ppc; +Cc: linuxppc-dev
In-Reply-To: <87czy1bsvz.fsf@linux.ibm.com>
Excerpts from Fabiano Rosas's message of January 19, 2021 11:46 am:
> Resending because the previous got spam-filtered:
>
> Nicholas Piggin <npiggin@gmail.com> writes:
>
>> This reverts much of commit c01015091a770 ("KVM: PPC: Book3S HV: Run HPT
>> guests on POWER9 radix hosts"), which was required to run HPT guests on
>> RPT hosts on early POWER9 CPUs without support for "mixed mode", which
>> meant the host could not run with MMU on while guests were running.
>>
>> This code has some corner case bugs, e.g., when the guest hits a machine
>> check or HMI the primary locks up waiting for secondaries to switch LPCR
>> to host, which they never do. This could all be fixed in software, but
>> most CPUs in production have mixed mode support, and those that don't
>> are believed to be all in installations that don't use this capability.
>> So simplify things and remove support.
>
> With this patch in a DD2.1 machine + indep_threads_mode=N +
> disable_radix, QEMU aborts and dumps registers, is that intended?
Yes. That configuration is hanging handling MCEs in the guest with some
threads waiting forever to synchronize. Paul suggested it was never a
supported configuration so we might just remove it.
> Could we use the 'no_mixing_hpt_and_radix' logic in check_extension to
> advertise only KVM_CAP_PPC_MMU_RADIX to the guest via OV5 so it doesn't
> try to run hash?
>
> For instance, if I hack QEMU's 'spapr_dt_ov5_platform_support' from
> OV5_MMU_BOTH to OV5_MMU_RADIX_300 then it boots succesfuly, but the
> guest turns into radix, due to this code in prom_init:
>
> prom_parse_mmu_model:
>
> case OV5_FEAT(OV5_MMU_RADIX): /* Only Radix */
> prom_debug("MMU - radix only\n");
> if (prom_radix_disable) {
> /*
> * If we __have__ to do radix, we're better off ignoring
> * the command line rather than not booting.
> */
> prom_printf("WARNING: Ignoring cmdline option disable_radix\n");
> }
> support->radix_mmu = true;
> break;
>
> It seems we could explicitly say that the host does not support hash and
> that would align with the above code.
I'm not sure, sounds like you could, on the other hand these aborts seem
like the prefered failure mode for these kinds of configuration issues,
I don't know what the policy is, is reverting back to radix acceptable?
Thanks,
Nick
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox