LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH 18/20] Documentation: security/keys: eliminate duplicated word
From: Mimi Zohar @ 2020-07-13 12:24 UTC (permalink / raw)
  To: Randy Dunlap, linux-kernel
  Cc: kvm, linux-doc, David Airlie, kgdb-bugreport, linux-fpga,
	Liviu Dudau, dri-devel, linux-mips, Paul Cercueil, keyrings,
	Paul Mackerras, linux-i2c, Pavel Machek, Srinivas Pandruvada,
	Mihail Atanassov, linux-leds, linux-s390, Daniel Thompson,
	linux-scsi, Jonathan Corbet, Masahiro Yamada, Matthew Wilcox,
	Halil Pasic, Jarkko Sakkinen, James Wang, linux-input,
	Mali DP Maintainers, Derek Kiernan, Dragan Cvetic, Wu Hao,
	Tony Krowiak, linux-kbuild, James E.J. Bottomley, Jiri Kosina,
	Hannes Reinecke, linux-block, Thomas Bogendoerfer,
	Jacek Anaszewski, linux-mm, Dan Williams, Andrew Morton,
	Jens Axboe, Michal Marek, Martin K. Petersen, Pierre Morel,
	Douglas Anderson, Wolfram Sang, Daniel Vetter, Jason Wessel,
	Paolo Bonzini, linux-integrity, linuxppc-dev, Mike Rapoport,
	Dan Murphy
In-Reply-To: <20200707180414.10467-19-rdunlap@infradead.org>

On Tue, 2020-07-07 at 11:04 -0700, Randy Dunlap wrote:
> Drop the doubled word "in".
> 
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>

Reviewed-by: Mimi Zohar <zohar@linux.ibm.com>

^ permalink raw reply

* Re: [PATCH v2 0/3] Power10 basic energy management
From: Gautham R Shenoy @ 2020-07-13 10:48 UTC (permalink / raw)
  To: Nicholas Piggin
  Cc: ego, mikey, pratik.r.sampat, linux-kernel, Pratik Rajesh Sampat,
	paulus, linuxppc-dev, ravi.bangoria
In-Reply-To: <1594617564.57k8bsyfd0.astroid@bobo.none>

On Mon, Jul 13, 2020 at 03:23:21PM +1000, Nicholas Piggin wrote:
> Excerpts from Pratik Rajesh Sampat's message of July 10, 2020 3:22 pm:
> > Changelog v1 --> v2:
> > 1. Save-restore DAWR and DAWRX unconditionally as they are lost in
> > shallow idle states too
> > 2. Rename pnv_first_spr_loss_level to pnv_first_fullstate_loss_level to
> > correct naming terminology
> > 
> > Pratik Rajesh Sampat (3):
> >   powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above
> >   powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
> >   powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
> > 
> >  arch/powerpc/platforms/powernv/idle.c | 34 +++++++++++++++++----------
> >  1 file changed, 22 insertions(+), 12 deletions(-)
> 
> These look okay to me, but the CPU_FTR_ARCH_300 test for 
> pnv_power9_idle_init() is actually wrong, it should be a PVR test 
> because idle is not completely architected (not even shallow stop 
> states, unfortunately).
> 
> It doesn't look like we support POWER10 idle correctly yet, and on older
> kernels it wouldn't work even if we fixed newer, so ideally the PVR 
> check would be backported as a fix in the front of the series.
> 
> Sadly, we have no OPAL idle driver yet. Hopefully we will before the
> next processor shows up :P

Abhishek posted a version recently :
https://patchwork.ozlabs.org/project/skiboot/patch/20200706043533.76539-1-huntbag@linux.vnet.ibm.com/


> 
> Thanks,
> Nick

--
Thanks and Regards
gautham.

^ permalink raw reply

* Re: [PATCH 06/14] powerpc/eeh: Remove VF config space restoration
From: Alexey Kardashevskiy @ 2020-07-13 11:39 UTC (permalink / raw)
  To: Oliver O'Halloran; +Cc: linuxppc-dev, Mahesh J Salgaonkar
In-Reply-To: <CAOSf1CFsC8PeLW3Deh=vf8pJQbo7Gg7oDTSOk0T0Da1TBptwGg@mail.gmail.com>



On 13/07/2020 20:55, Oliver O'Halloran wrote:
> On Mon, Jul 13, 2020 at 8:32 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>>
>>>  #ifdef CONFIG_PCI_IOV
>>>  int pseries_send_allow_unfreeze(struct pci_dn *pdn,
>>>                               u16 *vf_pe_array, int cur_vfs)
>>> @@ -848,7 +824,7 @@ static struct eeh_ops pseries_eeh_ops = {
>>>       .read_config            = pseries_eeh_read_config,
>>>       .write_config           = pseries_eeh_write_config,
>>>       .next_error             = NULL,
>>> -     .restore_config         = pseries_eeh_restore_config,
>>> +     .restore_config         = NULL, /* NB: configure_bridge() does this */
>>
>>
>> configure_bridge() calls rtas_call(ibm_configure_pe, 3, 1, NULL...)
>> which reconfigures the PE and which is quite different from what
>> pseries_eeh_restore_config() used to do although the comment suggests it
>> is about the same thing. I am pretty sure the new code produces a better
>> result so I suggest ditching this comment and adding a note to the
>> commit log may be. Thanks,
> 
> I put the comment there largely because the EEH core seems to think
> that restore_config() is what should be called to reset the device's
> config space to the defaults set be firmware. On PowerNV it does
> actually do that and configure_bridge is this:
> 
> static int pnv_eeh_configure_bridge(struct eeh_pe *pe)
> {
>         return 0;
> }
> 
> So... there's definitely something strange going on there. I don't
> remember the exact details, but I think the generic EEH code calls
> into RTAS to collect debug data and apparently that requires the
> device to be accessible via MMIO (i.e BARs need to be restored) which
> is why the pseries .configure_bridge() calls configure_pe. 


ah ok, makes more now, cool. thanks,


> It might
> work out better, but having something called "restore_config" that
> doesn't actually restore the config is uh... modern. It's something
> that probably needs a rework at some point. Anyway, I think the
> comment is more helpful than it is misleading. Especially if you
> consider the PowerNV behaviour.
> 
>>>  #ifdef CONFIG_PCI_IOV
>>>       .notify_resume          = pseries_notify_resume
>>>  #endif
>>>
>>
>> --
>> Alexey

-- 
Alexey

^ permalink raw reply

* Re: [PATCH 05/14] powerpc/eeh: Kill off eeh_ops->get_pe_addr()
From: Oliver O'Halloran @ 2020-07-13 11:11 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Mahesh J Salgaonkar
In-Reply-To: <d227871b-efc7-0864-efc4-a92b99a2ff04@ozlabs.ru>

On Mon, Jul 13, 2020 at 7:54 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>
>
> On 06/07/2020 11:36, Oliver O'Halloran wrote:
> > This is used in precisely one place which is in pseries specific platform
> > code.  There's no need to have the callback in eeh_ops since the platform
> > chooses the EEH PE addresses anyway. The PowerNV implementation has always
> > been a stub too so remove it.
> >
> > Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> > ---
> >  arch/powerpc/include/asm/eeh.h               |  1 -
> >  arch/powerpc/platforms/powernv/eeh-powernv.c | 13 ------------
> >  arch/powerpc/platforms/pseries/eeh_pseries.c | 22 ++++++++++----------
> >  3 files changed, 11 insertions(+), 25 deletions(-)
> >
> > diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> > index 3d648e042835..1bddc0dfe099 100644
> > --- a/arch/powerpc/include/asm/eeh.h
> > +++ b/arch/powerpc/include/asm/eeh.h
> > @@ -220,7 +220,6 @@ struct eeh_ops {
> >       int (*init)(void);
> >       struct eeh_dev *(*probe)(struct pci_dev *pdev);
> >       int (*set_option)(struct eeh_pe *pe, int option);
> > -     int (*get_pe_addr)(struct eeh_pe *pe);
> >       int (*get_state)(struct eeh_pe *pe, int *delay);
> >       int (*reset)(struct eeh_pe *pe, int option);
> >       int (*get_log)(struct eeh_pe *pe, int severity, char *drv_log, unsigned long len);
> > diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> > index 79409e005fcd..bcd0515d8f79 100644
> > --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> > +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> > @@ -535,18 +535,6 @@ static int pnv_eeh_set_option(struct eeh_pe *pe, int option)
> >       return 0;
> >  }
> >
> > -/**
> > - * pnv_eeh_get_pe_addr - Retrieve PE address
> > - * @pe: EEH PE
> > - *
> > - * Retrieve the PE address according to the given tranditional
> > - * PCI BDF (Bus/Device/Function) address.
> > - */
> > -static int pnv_eeh_get_pe_addr(struct eeh_pe *pe)
> > -{
> > -     return pe->addr;
> > -}
> > -
> >  static void pnv_eeh_get_phb_diag(struct eeh_pe *pe)
> >  {
> >       struct pnv_phb *phb = pe->phb->private_data;
> > @@ -1670,7 +1658,6 @@ static struct eeh_ops pnv_eeh_ops = {
> >       .init                   = pnv_eeh_init,
> >       .probe                  = pnv_eeh_probe,
> >       .set_option             = pnv_eeh_set_option,
> > -     .get_pe_addr            = pnv_eeh_get_pe_addr,
> >       .get_state              = pnv_eeh_get_state,
> >       .reset                  = pnv_eeh_reset,
> >       .get_log                = pnv_eeh_get_log,
> > diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
> > index 18a2522b9b5e..088771fa38be 100644
> > --- a/arch/powerpc/platforms/pseries/eeh_pseries.c
> > +++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
> > @@ -32,6 +32,8 @@
> >  #include <asm/ppc-pci.h>
> >  #include <asm/rtas.h>
> >
> > +static int pseries_eeh_get_pe_addr(struct pci_dn *pdn);
> > +
> >  /* RTAS tokens */
> >  static int ibm_set_eeh_option;
> >  static int ibm_set_slot_reset;
> > @@ -301,7 +303,7 @@ void pseries_eeh_init_edev(struct pci_dn *pdn)
> >               eeh_edev_dbg(edev, "EEH failed to enable on device (code %d)\n", ret);
> >       } else {
> >               /* Retrieve PE address */
> > -             edev->pe_config_addr = eeh_ops->get_pe_addr(&pe);
> > +             edev->pe_config_addr = pseries_eeh_get_pe_addr(pdn);
> >               pe.addr = edev->pe_config_addr;
> >
> >               /* Some older systems (Power4) allow the ibm,set-eeh-option
> > @@ -431,8 +433,10 @@ static int pseries_eeh_set_option(struct eeh_pe *pe, int option)
> >   * It's notable that zero'ed return value means invalid PE config
> >   * address.
> >   */
> > -static int pseries_eeh_get_pe_addr(struct eeh_pe *pe)
> > +static int pseries_eeh_get_pe_addr(struct pci_dn *pdn)
> >  {
> > +     int config_addr = rtas_config_addr(pdn->busno, pdn->devfn, 0);
>
> Why not use pe->config_addr

I wanted to get rid of the PE argument. The only caller
(pseries_eeh_init_edev()) doesn't even pass a real PE, just the "fake"
PE which only has one initialised field which is... sketch IMO. The
other reason is for Wen's post-kdump pseries PHB reset patch. In that
situation we want the reset to be done before we've done any PCI setup
so there won't be any eeh_pe structures available. We will however
have pci_dn's since they're set up before we start scanning PHBs. I
also think some of the "fake pe" stuff in pseries_eeh_init_edev() is
broken so the fewer users of that we have the better.

> (and why we have two addresses in eeh_pe anyway)?

I don't know :(

It's one of those things I've been meaning to look at but haven't
found the will to jump down that particular rabbit hole.

I did take a cursory look and there's some comments about pe->addr
being zero in some cases so EEH falls back to matching on
pe->config_addr when searching for a PE. IIRC when I looked I couldn't
work out why pe->config_addr would ever be zero. On PowerNV zero is a
valid PE address and we set the EEH_VALID_PE_ZERO flag to disable that
fallback logic so the reason is probably some weird pseries thing.


> Ah, I guess I just trust you with this one :)
>
>
> Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>
>
>
>
>
> > +     int buid = pdn->phb->buid;
> >       int ret = 0;
> >       int rets[3];
> >
> > @@ -443,18 +447,16 @@ static int pseries_eeh_get_pe_addr(struct eeh_pe *pe)
> >                * meaningless.
> >                */
> >               ret = rtas_call(ibm_get_config_addr_info2, 4, 2, rets,
> > -                             pe->config_addr, BUID_HI(pe->phb->buid),
> > -                             BUID_LO(pe->phb->buid), 1);
> > +                             config_addr, BUID_HI(buid), BUID_LO(buid), 1);
> >               if (ret || (rets[0] == 0))
> >                       return 0;
> >
> >               /* Retrieve the associated PE config address */
> >               ret = rtas_call(ibm_get_config_addr_info2, 4, 2, rets,
> > -                             pe->config_addr, BUID_HI(pe->phb->buid),
> > -                             BUID_LO(pe->phb->buid), 0);
> > +                             config_addr, BUID_HI(buid), BUID_LO(buid), 0);
> >               if (ret) {
> >                       pr_warn("%s: Failed to get address for PHB#%x-PE#%x\n",
> > -                             __func__, pe->phb->global_number, pe->config_addr);
> > +                             __func__, pdn->phb->global_number, config_addr);
> >                       return 0;
> >               }
> >
> > @@ -463,11 +465,10 @@ static int pseries_eeh_get_pe_addr(struct eeh_pe *pe)
> >
> >       if (ibm_get_config_addr_info != RTAS_UNKNOWN_SERVICE) {
> >               ret = rtas_call(ibm_get_config_addr_info, 4, 2, rets,
> > -                             pe->config_addr, BUID_HI(pe->phb->buid),
> > -                             BUID_LO(pe->phb->buid), 0);
> > +                             config_addr, BUID_HI(buid), BUID_LO(buid), 0);
> >               if (ret) {
> >                       pr_warn("%s: Failed to get address for PHB#%x-PE#%x\n",
> > -                             __func__, pe->phb->global_number, pe->config_addr);
> > +                             __func__, pdn->phb->global_number, config_addr);
> >                       return 0;
> >               }
> >
> > @@ -839,7 +840,6 @@ static struct eeh_ops pseries_eeh_ops = {
> >       .init                   = pseries_eeh_init,
> >       .probe                  = pseries_eeh_probe,
> >       .set_option             = pseries_eeh_set_option,
> > -     .get_pe_addr            = pseries_eeh_get_pe_addr,
> >       .get_state              = pseries_eeh_get_state,
> >       .reset                  = pseries_eeh_reset,
> >       .get_log                = pseries_eeh_get_log,
> >
>
> --
> Alexey

^ permalink raw reply

* Re: [PATCH 06/14] powerpc/eeh: Remove VF config space restoration
From: Oliver O'Halloran @ 2020-07-13 10:55 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Mahesh J Salgaonkar
In-Reply-To: <c808c6d8-b5ed-3256-5396-4300be9fa308@ozlabs.ru>

On Mon, Jul 13, 2020 at 8:32 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
> >  #ifdef CONFIG_PCI_IOV
> >  int pseries_send_allow_unfreeze(struct pci_dn *pdn,
> >                               u16 *vf_pe_array, int cur_vfs)
> > @@ -848,7 +824,7 @@ static struct eeh_ops pseries_eeh_ops = {
> >       .read_config            = pseries_eeh_read_config,
> >       .write_config           = pseries_eeh_write_config,
> >       .next_error             = NULL,
> > -     .restore_config         = pseries_eeh_restore_config,
> > +     .restore_config         = NULL, /* NB: configure_bridge() does this */
>
>
> configure_bridge() calls rtas_call(ibm_configure_pe, 3, 1, NULL...)
> which reconfigures the PE and which is quite different from what
> pseries_eeh_restore_config() used to do although the comment suggests it
> is about the same thing. I am pretty sure the new code produces a better
> result so I suggest ditching this comment and adding a note to the
> commit log may be. Thanks,

I put the comment there largely because the EEH core seems to think
that restore_config() is what should be called to reset the device's
config space to the defaults set be firmware. On PowerNV it does
actually do that and configure_bridge is this:

static int pnv_eeh_configure_bridge(struct eeh_pe *pe)
{
        return 0;
}

So... there's definitely something strange going on there. I don't
remember the exact details, but I think the generic EEH code calls
into RTAS to collect debug data and apparently that requires the
device to be accessible via MMIO (i.e BARs need to be restored) which
is why the pseries .configure_bridge() calls configure_pe. It might
work out better, but having something called "restore_config" that
doesn't actually restore the config is uh... modern. It's something
that probably needs a rework at some point. Anyway, I think the
comment is more helpful than it is misleading. Especially if you
consider the PowerNV behaviour.

> >  #ifdef CONFIG_PCI_IOV
> >       .notify_resume          = pseries_notify_resume
> >  #endif
> >
>
> --
> Alexey

^ permalink raw reply

* Re: [PATCH 06/14] powerpc/eeh: Remove VF config space restoration
From: Alexey Kardashevskiy @ 2020-07-13 10:32 UTC (permalink / raw)
  To: Oliver O'Halloran, linuxppc-dev; +Cc: mahesh
In-Reply-To: <20200706013619.459420-7-oohall@gmail.com>



On 06/07/2020 11:36, Oliver O'Halloran wrote:
> There's a bunch of strange things about this code. First up is that none of
> the fields being written to are functional for a VF. The SR-IOV
> specification lists then as "Reserved, but OS should preserve" so writing
> new values to them doesn't do anything and is clearly wrong from a
> correctness perspective.
> 
> However, since VFs are designed to be managed by the OS there is an
> argument to be made that we should be saving and restoring some parts of
> config space. We already sort of do that by saving the first 64 bytes of
> config space in the eeh_dev (see eeh_dev->config_space[]). This is
> inadequate since it doesn't even consider saving and restoring the PCI
> capability structures. However, this is a problem with EEH in general and
> that needs to be fixed for non-VF devices too.
> 
> There's no real reason to keep around this around so delete it.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>


Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>



> ---
>  arch/powerpc/include/asm/eeh.h               |  1 -
>  arch/powerpc/kernel/eeh.c                    | 59 --------------------
>  arch/powerpc/platforms/powernv/eeh-powernv.c | 20 ++-----
>  arch/powerpc/platforms/pseries/eeh_pseries.c | 26 +--------
>  4 files changed, 7 insertions(+), 99 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index 1bddc0dfe099..046c5a2fe411 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -314,7 +314,6 @@ int eeh_pe_reset(struct eeh_pe *pe, int option, bool include_passed);
>  int eeh_pe_configure(struct eeh_pe *pe);
>  int eeh_pe_inject_err(struct eeh_pe *pe, int type, int func,
>  		      unsigned long addr, unsigned long mask);
> -int eeh_restore_vf_config(struct pci_dn *pdn);
>  
>  /**
>   * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure.
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 859f76020256..a4df6f6de0bd 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -742,65 +742,6 @@ static void eeh_restore_dev_state(struct eeh_dev *edev, void *userdata)
>  		pci_restore_state(pdev);
>  }
>  
> -int eeh_restore_vf_config(struct pci_dn *pdn)
> -{
> -	struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
> -	u32 devctl, cmd, cap2, aer_capctl;
> -	int old_mps;
> -
> -	if (edev->pcie_cap) {
> -		/* Restore MPS */
> -		old_mps = (ffs(pdn->mps) - 8) << 5;
> -		eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
> -				     2, &devctl);
> -		devctl &= ~PCI_EXP_DEVCTL_PAYLOAD;
> -		devctl |= old_mps;
> -		eeh_ops->write_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
> -				      2, devctl);
> -
> -		/* Disable Completion Timeout if possible */
> -		eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCAP2,
> -				     4, &cap2);
> -		if (cap2 & PCI_EXP_DEVCAP2_COMP_TMOUT_DIS) {
> -			eeh_ops->read_config(pdn,
> -					     edev->pcie_cap + PCI_EXP_DEVCTL2,
> -					     4, &cap2);
> -			cap2 |= PCI_EXP_DEVCTL2_COMP_TMOUT_DIS;
> -			eeh_ops->write_config(pdn,
> -					      edev->pcie_cap + PCI_EXP_DEVCTL2,
> -					      4, cap2);
> -		}
> -	}
> -
> -	/* Enable SERR and parity checking */
> -	eeh_ops->read_config(pdn, PCI_COMMAND, 2, &cmd);
> -	cmd |= (PCI_COMMAND_PARITY | PCI_COMMAND_SERR);
> -	eeh_ops->write_config(pdn, PCI_COMMAND, 2, cmd);
> -
> -	/* Enable report various errors */
> -	if (edev->pcie_cap) {
> -		eeh_ops->read_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
> -				     2, &devctl);
> -		devctl &= ~PCI_EXP_DEVCTL_CERE;
> -		devctl |= (PCI_EXP_DEVCTL_NFERE |
> -			   PCI_EXP_DEVCTL_FERE |
> -			   PCI_EXP_DEVCTL_URRE);
> -		eeh_ops->write_config(pdn, edev->pcie_cap + PCI_EXP_DEVCTL,
> -				      2, devctl);
> -	}
> -
> -	/* Enable ECRC generation and check */
> -	if (edev->pcie_cap && edev->aer_cap) {
> -		eeh_ops->read_config(pdn, edev->aer_cap + PCI_ERR_CAP,
> -				     4, &aer_capctl);
> -		aer_capctl |= (PCI_ERR_CAP_ECRC_GENE | PCI_ERR_CAP_ECRC_CHKE);
> -		eeh_ops->write_config(pdn, edev->aer_cap + PCI_ERR_CAP,
> -				      4, aer_capctl);
> -	}
> -
> -	return 0;
> -}
> -
>  /**
>   * pcibios_set_pcie_reset_state - Set PCI-E reset state
>   * @dev: pci device struct
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index bcd0515d8f79..8f3a7611efc1 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -1629,20 +1629,12 @@ static int pnv_eeh_restore_config(struct pci_dn *pdn)
>  	if (!edev)
>  		return -EEXIST;
>  
> -	/*
> -	 * We have to restore the PCI config space after reset since the
> -	 * firmware can't see SRIOV VFs.
> -	 *
> -	 * FIXME: The MPS, error routing rules, timeout setting are worthy
> -	 * to be exported by firmware in extendible way.
> -	 */
> -	if (edev->physfn) {
> -		ret = eeh_restore_vf_config(pdn);
> -	} else {
> -		phb = pdn->phb->private_data;
> -		ret = opal_pci_reinit(phb->opal_id,
> -				      OPAL_REINIT_PCI_DEV, config_addr);
> -	}
> +	if (edev->physfn)
> +		return 0;
> +
> +	phb = edev->controller->private_data;
> +	ret = opal_pci_reinit(phb->opal_id,
> +			      OPAL_REINIT_PCI_DEV, edev->bdfn);
>  
>  	if (ret) {
>  		pr_warn("%s: Can't reinit PCI dev 0x%x (%lld)\n",
> diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
> index 088771fa38be..83122bf65a8c 100644
> --- a/arch/powerpc/platforms/pseries/eeh_pseries.c
> +++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
> @@ -718,30 +718,6 @@ static int pseries_eeh_write_config(struct pci_dn *pdn, int where, int size, u32
>  	return rtas_write_config(pdn, where, size, val);
>  }
>  
> -static int pseries_eeh_restore_config(struct pci_dn *pdn)
> -{
> -	struct eeh_dev *edev = pdn_to_eeh_dev(pdn);
> -	s64 ret = 0;
> -
> -	if (!edev)
> -		return -EEXIST;
> -
> -	/*
> -	 * FIXME: The MPS, error routing rules, timeout setting are worthy
> -	 * to be exported by firmware in extendible way.
> -	 */
> -	if (edev->physfn)
> -		ret = eeh_restore_vf_config(pdn);
> -
> -	if (ret) {
> -		pr_warn("%s: Can't reinit PCI dev 0x%x (%lld)\n",
> -			__func__, edev->pe_config_addr, ret);
> -		return -EIO;
> -	}
> -
> -	return ret;
> -}
> -
>  #ifdef CONFIG_PCI_IOV
>  int pseries_send_allow_unfreeze(struct pci_dn *pdn,
>  				u16 *vf_pe_array, int cur_vfs)
> @@ -848,7 +824,7 @@ static struct eeh_ops pseries_eeh_ops = {
>  	.read_config		= pseries_eeh_read_config,
>  	.write_config		= pseries_eeh_write_config,
>  	.next_error		= NULL,
> -	.restore_config		= pseries_eeh_restore_config,
> +	.restore_config		= NULL, /* NB: configure_bridge() does this */


configure_bridge() calls rtas_call(ibm_configure_pe, 3, 1, NULL...)
which reconfigures the PE and which is quite different from what
pseries_eeh_restore_config() used to do although the comment suggests it
is about the same thing. I am pretty sure the new code produces a better
result so I suggest ditching this comment and adding a note to the
commit log may be. Thanks,



>  #ifdef CONFIG_PCI_IOV
>  	.notify_resume		= pseries_notify_resume
>  #endif
> 

-- 
Alexey

^ permalink raw reply

* Re: [PATCH v2 0/3] Power10 basic energy management
From: Pratik Sampat @ 2020-07-13 10:02 UTC (permalink / raw)
  To: Nicholas Piggin, benh, ego, linux-kernel, linuxppc-dev, mikey,
	mpe, paulus, pratik.r.sampat, ravi.bangoria, svaidy
In-Reply-To: <1594617564.57k8bsyfd0.astroid@bobo.none>

Thank you for your comments,

On 13/07/20 10:53 am, Nicholas Piggin wrote:
> Excerpts from Pratik Rajesh Sampat's message of July 10, 2020 3:22 pm:
>> Changelog v1 --> v2:
>> 1. Save-restore DAWR and DAWRX unconditionally as they are lost in
>> shallow idle states too
>> 2. Rename pnv_first_spr_loss_level to pnv_first_fullstate_loss_level to
>> correct naming terminology
>>
>> Pratik Rajesh Sampat (3):
>>    powerpc/powernv/idle: Exclude mfspr on HID1,4,5 on P9 and above
>>    powerpc/powernv/idle: save-restore DAWR0,DAWRX0 for P10
>>    powerpc/powernv/idle: Rename pnv_first_spr_loss_level variable
>>
>>   arch/powerpc/platforms/powernv/idle.c | 34 +++++++++++++++++----------
>>   1 file changed, 22 insertions(+), 12 deletions(-)
> These look okay to me, but the CPU_FTR_ARCH_300 test for
> pnv_power9_idle_init() is actually wrong, it should be a PVR test
> because idle is not completely architected (not even shallow stop
> states, unfortunately).
>
> It doesn't look like we support POWER10 idle correctly yet, and on older
> kernels it wouldn't work even if we fixed newer, so ideally the PVR
> check would be backported as a fix in the front of the series.
>
> Sadly, we have no OPAL idle driver yet. Hopefully we will before the
> next processor shows up :P
>
> Thanks,
> Nick

So if I understand this correctly, in powernv/idle.c where we check for
CPU_FTR_ARCH_300, we should rather be making a pvr_version_is(PVR_POWER9)
check instead?

Of course, the P10 PVR and its relevant checks will have to be added then too.

Thanks
Pratik

  


^ permalink raw reply

* Re: [PATCH 05/14] powerpc/eeh: Kill off eeh_ops->get_pe_addr()
From: Alexey Kardashevskiy @ 2020-07-13  9:54 UTC (permalink / raw)
  To: Oliver O'Halloran, linuxppc-dev; +Cc: mahesh
In-Reply-To: <20200706013619.459420-6-oohall@gmail.com>



On 06/07/2020 11:36, Oliver O'Halloran wrote:
> This is used in precisely one place which is in pseries specific platform
> code.  There's no need to have the callback in eeh_ops since the platform
> chooses the EEH PE addresses anyway. The PowerNV implementation has always
> been a stub too so remove it.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> ---
>  arch/powerpc/include/asm/eeh.h               |  1 -
>  arch/powerpc/platforms/powernv/eeh-powernv.c | 13 ------------
>  arch/powerpc/platforms/pseries/eeh_pseries.c | 22 ++++++++++----------
>  3 files changed, 11 insertions(+), 25 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index 3d648e042835..1bddc0dfe099 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -220,7 +220,6 @@ struct eeh_ops {
>  	int (*init)(void);
>  	struct eeh_dev *(*probe)(struct pci_dev *pdev);
>  	int (*set_option)(struct eeh_pe *pe, int option);
> -	int (*get_pe_addr)(struct eeh_pe *pe);
>  	int (*get_state)(struct eeh_pe *pe, int *delay);
>  	int (*reset)(struct eeh_pe *pe, int option);
>  	int (*get_log)(struct eeh_pe *pe, int severity, char *drv_log, unsigned long len);
> diff --git a/arch/powerpc/platforms/powernv/eeh-powernv.c b/arch/powerpc/platforms/powernv/eeh-powernv.c
> index 79409e005fcd..bcd0515d8f79 100644
> --- a/arch/powerpc/platforms/powernv/eeh-powernv.c
> +++ b/arch/powerpc/platforms/powernv/eeh-powernv.c
> @@ -535,18 +535,6 @@ static int pnv_eeh_set_option(struct eeh_pe *pe, int option)
>  	return 0;
>  }
>  
> -/**
> - * pnv_eeh_get_pe_addr - Retrieve PE address
> - * @pe: EEH PE
> - *
> - * Retrieve the PE address according to the given tranditional
> - * PCI BDF (Bus/Device/Function) address.
> - */
> -static int pnv_eeh_get_pe_addr(struct eeh_pe *pe)
> -{
> -	return pe->addr;
> -}
> -
>  static void pnv_eeh_get_phb_diag(struct eeh_pe *pe)
>  {
>  	struct pnv_phb *phb = pe->phb->private_data;
> @@ -1670,7 +1658,6 @@ static struct eeh_ops pnv_eeh_ops = {
>  	.init                   = pnv_eeh_init,
>  	.probe			= pnv_eeh_probe,
>  	.set_option             = pnv_eeh_set_option,
> -	.get_pe_addr            = pnv_eeh_get_pe_addr,
>  	.get_state              = pnv_eeh_get_state,
>  	.reset                  = pnv_eeh_reset,
>  	.get_log                = pnv_eeh_get_log,
> diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
> index 18a2522b9b5e..088771fa38be 100644
> --- a/arch/powerpc/platforms/pseries/eeh_pseries.c
> +++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
> @@ -32,6 +32,8 @@
>  #include <asm/ppc-pci.h>
>  #include <asm/rtas.h>
>  
> +static int pseries_eeh_get_pe_addr(struct pci_dn *pdn);
> +
>  /* RTAS tokens */
>  static int ibm_set_eeh_option;
>  static int ibm_set_slot_reset;
> @@ -301,7 +303,7 @@ void pseries_eeh_init_edev(struct pci_dn *pdn)
>  		eeh_edev_dbg(edev, "EEH failed to enable on device (code %d)\n", ret);
>  	} else {
>  		/* Retrieve PE address */
> -		edev->pe_config_addr = eeh_ops->get_pe_addr(&pe);
> +		edev->pe_config_addr = pseries_eeh_get_pe_addr(pdn);
>  		pe.addr = edev->pe_config_addr;
>  
>  		/* Some older systems (Power4) allow the ibm,set-eeh-option
> @@ -431,8 +433,10 @@ static int pseries_eeh_set_option(struct eeh_pe *pe, int option)
>   * It's notable that zero'ed return value means invalid PE config
>   * address.
>   */
> -static int pseries_eeh_get_pe_addr(struct eeh_pe *pe)
> +static int pseries_eeh_get_pe_addr(struct pci_dn *pdn)
>  {
> +	int config_addr = rtas_config_addr(pdn->busno, pdn->devfn, 0);

Why not use pe->config_addr (and why we have two addresses in eeh_pe
anyway)?


Ah, I guess I just trust you with this one :)


Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>




> +	int buid = pdn->phb->buid;
>  	int ret = 0;
>  	int rets[3];
>  
> @@ -443,18 +447,16 @@ static int pseries_eeh_get_pe_addr(struct eeh_pe *pe)
>  		 * meaningless.
>  		 */
>  		ret = rtas_call(ibm_get_config_addr_info2, 4, 2, rets,
> -				pe->config_addr, BUID_HI(pe->phb->buid),
> -				BUID_LO(pe->phb->buid), 1);
> +				config_addr, BUID_HI(buid), BUID_LO(buid), 1);
>  		if (ret || (rets[0] == 0))
>  			return 0;
>  
>  		/* Retrieve the associated PE config address */
>  		ret = rtas_call(ibm_get_config_addr_info2, 4, 2, rets,
> -				pe->config_addr, BUID_HI(pe->phb->buid),
> -				BUID_LO(pe->phb->buid), 0);
> +				config_addr, BUID_HI(buid), BUID_LO(buid), 0);
>  		if (ret) {
>  			pr_warn("%s: Failed to get address for PHB#%x-PE#%x\n",
> -				__func__, pe->phb->global_number, pe->config_addr);
> +				__func__, pdn->phb->global_number, config_addr);
>  			return 0;
>  		}
>  
> @@ -463,11 +465,10 @@ static int pseries_eeh_get_pe_addr(struct eeh_pe *pe)
>  
>  	if (ibm_get_config_addr_info != RTAS_UNKNOWN_SERVICE) {
>  		ret = rtas_call(ibm_get_config_addr_info, 4, 2, rets,
> -				pe->config_addr, BUID_HI(pe->phb->buid),
> -				BUID_LO(pe->phb->buid), 0);
> +				config_addr, BUID_HI(buid), BUID_LO(buid), 0);
>  		if (ret) {
>  			pr_warn("%s: Failed to get address for PHB#%x-PE#%x\n",
> -				__func__, pe->phb->global_number, pe->config_addr);
> +				__func__, pdn->phb->global_number, config_addr);
>  			return 0;
>  		}
>  
> @@ -839,7 +840,6 @@ static struct eeh_ops pseries_eeh_ops = {
>  	.init			= pseries_eeh_init,
>  	.probe			= pseries_eeh_probe,
>  	.set_option		= pseries_eeh_set_option,
> -	.get_pe_addr		= pseries_eeh_get_pe_addr,
>  	.get_state		= pseries_eeh_get_state,
>  	.reset			= pseries_eeh_reset,
>  	.get_log		= pseries_eeh_get_log,
> 

-- 
Alexey

^ permalink raw reply

* Re: [v3 4/5] KVM: PPC: Book3S HV: retry page migration before erroring-out H_SVM_PAGE_IN
From: Bharata B Rao @ 2020-07-13  9:50 UTC (permalink / raw)
  To: Ram Pai
  Cc: ldufour, cclaudio, kvm-ppc, sathnaga, aneesh.kumar, sukadev,
	linuxppc-dev, bauerman, david
In-Reply-To: <1594458827-31866-5-git-send-email-linuxram@us.ibm.com>

On Sat, Jul 11, 2020 at 02:13:46AM -0700, Ram Pai wrote:
> The page requested for page-in; sometimes, can have transient
> references, and hence cannot migrate immediately. Retry a few times
> before returning error.

As I noted in the previous patch, we need to understand what are these
transient errors and they occur on what type of pages?

The previous patch also introduced a bit of retry logic in the
page-in path. Can you consolidate the retry logic into a separate
patch?

> 
> H_SVM_PAGE_IN interface is enhanced to return H_BUSY if the page is
> not in a migratable state.
> 
> Cc: Paul Mackerras <paulus@ozlabs.org>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Bharata B Rao <bharata@linux.ibm.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> Cc: Laurent Dufour <ldufour@linux.ibm.com>
> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Claudio Carvalho <cclaudio@linux.ibm.com>
> Cc: kvm-ppc@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> 
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> ---
>  Documentation/powerpc/ultravisor.rst |  1 +
>  arch/powerpc/kvm/book3s_hv_uvmem.c   | 54 +++++++++++++++++++++---------------
>  2 files changed, 33 insertions(+), 22 deletions(-)
> 
> diff --git a/Documentation/powerpc/ultravisor.rst b/Documentation/powerpc/ultravisor.rst
> index d98fc85..638d1a7 100644
> --- a/Documentation/powerpc/ultravisor.rst
> +++ b/Documentation/powerpc/ultravisor.rst
> @@ -1034,6 +1034,7 @@ Return values
>  	* H_PARAMETER	if ``guest_pa`` is invalid.
>  	* H_P2		if ``flags`` is invalid.
>  	* H_P3		if ``order`` of page is invalid.
> +	* H_BUSY	if ``page`` is not in a state to pagein
>  
>  Description
>  ~~~~~~~~~~~
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 12ed52a..c9bdef6 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -843,7 +843,7 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
>  	struct vm_area_struct *vma;
>  	int srcu_idx;
>  	unsigned long gfn = gpa >> page_shift;
> -	int ret;
> +	int ret, repeat_count = REPEAT_COUNT;
>  
>  	if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
>  		return H_UNSUPPORTED;
> @@ -857,34 +857,44 @@ unsigned long kvmppc_h_svm_page_in(struct kvm *kvm, unsigned long gpa,
>  	if (flags & H_PAGE_IN_SHARED)
>  		return kvmppc_share_page(kvm, gpa, page_shift);
>  
> -	ret = H_PARAMETER;
>  	srcu_idx = srcu_read_lock(&kvm->srcu);
> -	down_write(&kvm->mm->mmap_sem);
>  
> -	start = gfn_to_hva(kvm, gfn);
> -	if (kvm_is_error_hva(start))
> -		goto out;
> -
> -	mutex_lock(&kvm->arch.uvmem_lock);
>  	/* Fail the page-in request of an already paged-in page */
> -	if (kvmppc_gfn_is_uvmem_pfn(gfn, kvm, NULL))
> -		goto out_unlock;
> +	mutex_lock(&kvm->arch.uvmem_lock);
> +	ret = kvmppc_gfn_is_uvmem_pfn(gfn, kvm, NULL);
> +	mutex_unlock(&kvm->arch.uvmem_lock);
> +	if (ret) {
> +		srcu_read_unlock(&kvm->srcu, srcu_idx);
> +		return H_PARAMETER;
> +	}
>  
> -	end = start + (1UL << page_shift);
> -	vma = find_vma_intersection(kvm->mm, start, end);
> -	if (!vma || vma->vm_start > start || vma->vm_end < end)
> -		goto out_unlock;
> +	do {
> +		ret = H_PARAMETER;
> +		down_write(&kvm->mm->mmap_sem);

Again with ksm_madvise() moved to init time, check if you still need
write mmap_sem here.

Regards,
Bharata.

^ permalink raw reply

* Re: [v3 3/5] KVM: PPC: Book3S HV: migrate remaining normal-GFNs to secure-GFNs in H_SVM_INIT_DONE
From: Bharata B Rao @ 2020-07-13  9:45 UTC (permalink / raw)
  To: Ram Pai
  Cc: ldufour, cclaudio, kvm-ppc, sathnaga, aneesh.kumar, sukadev,
	linuxppc-dev, bauerman, david
In-Reply-To: <1594458827-31866-4-git-send-email-linuxram@us.ibm.com>

On Sat, Jul 11, 2020 at 02:13:45AM -0700, Ram Pai wrote:
> The Ultravisor is expected to explicitly call H_SVM_PAGE_IN for all the pages
> of the SVM before calling H_SVM_INIT_DONE. This causes a huge delay in
> tranistioning the VM to SVM. The Ultravisor is interested in the pages that
> contain the kernel, initrd and other important data structures. The rest of the
> pages contain throw-away content. Hence requesting just the necessary and
> sufficient pages from the Hypervisor is sufficient.
> 
> However if not all pages are requested by the Ultravisor, the Hypervisor
> continues to consider the GFNs corresponding to the non-requested pages as
> normal GFNs. This can lead to data-corruption and undefined behavior.
> 
> Move all the PFNs associated with the SVM's GFNs to secure-PFNs, in
> H_SVM_INIT_DONE. Skip the GFNs that are already Paged-in or Shared or
> Paged-in followed by a Paged-out.
> 
> Cc: Paul Mackerras <paulus@ozlabs.org>
> Cc: Benjamin Herrenschmidt <benh@kernel.crashing.org>
> Cc: Michael Ellerman <mpe@ellerman.id.au>
> Cc: Bharata B Rao <bharata@linux.ibm.com>
> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
> Cc: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
> Cc: Laurent Dufour <ldufour@linux.ibm.com>
> Cc: Thiago Jung Bauermann <bauerman@linux.ibm.com>
> Cc: David Gibson <david@gibson.dropbear.id.au>
> Cc: Claudio Carvalho <cclaudio@linux.ibm.com>
> Cc: kvm-ppc@vger.kernel.org
> Cc: linuxppc-dev@lists.ozlabs.org
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> ---
>  Documentation/powerpc/ultravisor.rst        |   2 +
>  arch/powerpc/include/asm/kvm_book3s_uvmem.h |   2 +
>  arch/powerpc/kvm/book3s_hv_uvmem.c          | 166 +++++++++++++++++++++++++---
>  3 files changed, 156 insertions(+), 14 deletions(-)
> 
> diff --git a/Documentation/powerpc/ultravisor.rst b/Documentation/powerpc/ultravisor.rst
> index df136c8..d98fc85 100644
> --- a/Documentation/powerpc/ultravisor.rst
> +++ b/Documentation/powerpc/ultravisor.rst
> @@ -933,6 +933,8 @@ Return values
>  	* H_UNSUPPORTED		if called from the wrong context (e.g.
>  				from an SVM or before an H_SVM_INIT_START
>  				hypercall).
> +	* H_STATE		if the hypervisor could not successfully
> +                                transition the VM to Secure VM.
>  
>  Description
>  ~~~~~~~~~~~
> diff --git a/arch/powerpc/include/asm/kvm_book3s_uvmem.h b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> index 9cb7d8b..f229ab5 100644
> --- a/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> +++ b/arch/powerpc/include/asm/kvm_book3s_uvmem.h
> @@ -23,6 +23,8 @@ unsigned long kvmppc_h_svm_page_out(struct kvm *kvm,
>  unsigned long kvmppc_h_svm_init_abort(struct kvm *kvm);
>  void kvmppc_uvmem_drop_pages(const struct kvm_memory_slot *free,
>  			     struct kvm *kvm, bool skip_page_out);
> +int kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
> +			const struct kvm_memory_slot *memslot);
>  #else
>  static inline int kvmppc_uvmem_init(void)
>  {
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 6d6c256..12ed52a 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -93,6 +93,7 @@
>  #include <asm/ultravisor.h>
>  #include <asm/mman.h>
>  #include <asm/kvm_ppc.h>
> +#include <asm/kvm_book3s_uvmem.h>
>  
>  static struct dev_pagemap kvmppc_uvmem_pgmap;
>  static unsigned long *kvmppc_uvmem_bitmap;
> @@ -348,6 +349,43 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, struct kvm *kvm,
>  	return false;
>  }
>  
> +/*
> + * starting from *gfn search for the next available GFN that is not yet
> + * transitioned to a secure GFN.  return the value of that GFN in *gfn.  If a
> + * GFN is found, return true, else return false
> + */
> +static bool kvmppc_next_nontransitioned_gfn(const struct kvm_memory_slot *memslot,
> +		struct kvm *kvm, unsigned long *gfn)
> +{
> +	struct kvmppc_uvmem_slot *p;
> +	bool ret = false;
> +	unsigned long i;
> +
> +	mutex_lock(&kvm->arch.uvmem_lock);
> +
> +	list_for_each_entry(p, &kvm->arch.uvmem_pfns, list)
> +		if (*gfn >= p->base_pfn && *gfn < p->base_pfn + p->nr_pfns)
> +			break;
> +	if (!p)
> +		goto out;
> +	/*
> +	 * The code below assumes, one to one correspondence between
> +	 * kvmppc_uvmem_slot and memslot.
> +	 */
> +	for (i = *gfn; i < p->base_pfn + p->nr_pfns; i++) {
> +		unsigned long index = i - p->base_pfn;
> +
> +		if (!(p->pfns[index] & KVMPPC_GFN_FLAG_MASK)) {
> +			*gfn = i;
> +			ret = true;
> +			break;
> +		}
> +	}
> +out:
> +	mutex_unlock(&kvm->arch.uvmem_lock);
> +	return ret;
> +}
> +
>  static int kvmppc_memslot_page_merge(struct kvm *kvm,
>  		struct kvm_memory_slot *memslot, bool merge)
>  {
> @@ -461,12 +499,31 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>  
>  unsigned long kvmppc_h_svm_init_done(struct kvm *kvm)
>  {
> +	struct kvm_memslots *slots;
> +	struct kvm_memory_slot *memslot;
> +	int srcu_idx;
> +	long ret = H_SUCCESS;
> +
>  	if (!(kvm->arch.secure_guest & KVMPPC_SECURE_INIT_START))
>  		return H_UNSUPPORTED;
>  
> +	/* migrate any unmoved normal pfn to device pfns*/
> +	srcu_idx = srcu_read_lock(&kvm->srcu);
> +	slots = kvm_memslots(kvm);
> +	kvm_for_each_memslot(memslot, slots) {
> +		ret = kvmppc_uv_migrate_mem_slot(kvm, memslot);
> +		if (ret) {
> +			ret = H_STATE;
> +			goto out;
> +		}
> +	}
> +
>  	kvm->arch.secure_guest |= KVMPPC_SECURE_INIT_DONE;
>  	pr_info("LPID %d went secure\n", kvm->arch.lpid);
> -	return H_SUCCESS;
> +
> +out:
> +	srcu_read_unlock(&kvm->srcu, srcu_idx);
> +	return ret;
>  }
>  
>  /*
> @@ -587,12 +644,14 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
>  }
>  
>  /*
> - * Alloc a PFN from private device memory pool and copy page from normal
> - * memory to secure memory using UV_PAGE_IN uvcall.
> + * Alloc a PFN from private device memory pool. If @pagein is true,
> + * copy page from normal memory to secure memory using UV_PAGE_IN uvcall.
>   */
> -static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
> -		   unsigned long end, unsigned long gpa, struct kvm *kvm,
> -		   unsigned long page_shift)
> +static int kvmppc_svm_migrate_page(struct vm_area_struct *vma,
> +		unsigned long start,
> +		unsigned long end, unsigned long gpa, struct kvm *kvm,
> +		unsigned long page_shift,
> +		bool pagein)
>  {
>  	unsigned long src_pfn, dst_pfn = 0;
>  	struct migrate_vma mig;
> @@ -613,7 +672,7 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
>  		return ret;
>  
>  	if (!(*mig.src & MIGRATE_PFN_MIGRATE)) {
> -		ret = -1;
> +		ret = -2;

migrate_vma_setup() has marked that this pfn can't be migrated. What
transient errors are you observing which will disappear within 10
retries?

Also till now when UV used to pull in all the pages, we never seemed to
have hit these transient errors. But now when HV is pushing the same
pages, we see these errors which are disappearing after 10 retries.
Can you explain this more please? What sort of pages are these?

>  		goto out_finalize;
>  	}
>  
> @@ -623,11 +682,16 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
>  		goto out_finalize;
>  	}
>  
> -	pfn = *mig.src >> MIGRATE_PFN_SHIFT;
> -	spage = migrate_pfn_to_page(*mig.src);
> -	if (spage)
> -		uv_page_in(kvm->arch.lpid, pfn << page_shift, gpa, 0,
> -			   page_shift);
> +	if (pagein) {
> +		pfn = *mig.src >> MIGRATE_PFN_SHIFT;
> +		spage = migrate_pfn_to_page(*mig.src);
> +		if (spage) {
> +			ret = uv_page_in(kvm->arch.lpid, pfn << page_shift,
> +					gpa, 0, page_shift);
> +			if (ret)
> +				goto out_finalize;
> +		}
> +	}
>  
>  	*mig.dst = migrate_pfn(page_to_pfn(dpage)) | MIGRATE_PFN_LOCKED;
>  	migrate_vma_pages(&mig);
> @@ -637,6 +701,77 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
>  }
>  
>  /*
> + * return 1, if some page migration failed because of transient error,
> + * while the remaining pages migrated successfully.
> + * The caller can use this as a hint to retry.
> + *
> + * return 0 otherwise. *ret indicates the success status
> + * of this call.
> + */
> +static int __kvmppc_uv_migrate_mem_slot(struct kvm *kvm,
> +		const struct kvm_memory_slot *memslot, int *ret)
> +{
> +	unsigned long gfn = memslot->base_gfn;
> +	struct vm_area_struct *vma;
> +	unsigned long start, end;
> +	bool retry = 0;
> +
> +	*ret = 0;
> +	while (kvmppc_next_nontransitioned_gfn(memslot, kvm, &gfn)) {
> +
> +		down_write(&kvm->mm->mmap_sem);

Acquiring and releasing mmap_sem in a loop? Any reason?

Now that you have moved ksm_madvise() calls to init time, any specific
reason to take write mmap_sem here?

> +		start = gfn_to_hva(kvm, gfn);
> +		if (kvm_is_error_hva(start)) {
> +			*ret = H_STATE;
> +			goto next;
> +		}
> +
> +		end = start + (1UL << PAGE_SHIFT);
> +		vma = find_vma_intersection(kvm->mm, start, end);
> +		if (!vma || vma->vm_start > start || vma->vm_end < end) {
> +			*ret = H_STATE;
> +			goto next;
> +		}
> +
> +		mutex_lock(&kvm->arch.uvmem_lock);
> +		*ret = kvmppc_svm_migrate_page(vma, start, end,
> +				(gfn << PAGE_SHIFT), kvm, PAGE_SHIFT, false);
> +		mutex_unlock(&kvm->arch.uvmem_lock);
> +
> +next:
> +		up_write(&kvm->mm->mmap_sem);
> +
> +		if (*ret == -2) {
> +			retry = 1;

Using true or false assignment makes it easy to understand the intention
of this 'retry' variable.

> +			continue;
> +		}
> +
> +		if (*ret)
> +			return 0;
> +	}
> +	return retry;

The function is supposed to return an int, but you are returning a bool.

Regards,
Bharata.

^ permalink raw reply

* Re: [PATCH 13/20] Documentation: mips/ingenic-tcu: eliminate duplicated word
From: Thomas Bogendoerfer @ 2020-07-13  9:28 UTC (permalink / raw)
  To: Randy Dunlap
  Cc: kvm, linux-doc, David Airlie, kgdb-bugreport, linux-fpga,
	Liviu Dudau, dri-devel, Douglas Anderson, Paul Cercueil, keyrings,
	Paul Mackerras, linux-i2c, Pavel Machek, Srinivas Pandruvada,
	Mihail Atanassov, linux-leds, linux-s390, Daniel Thompson,
	linux-scsi, Jonathan Corbet, Masahiro Yamada, Matthew Wilcox,
	Halil Pasic, Jarkko Sakkinen, James Wang, linux-input,
	Mali DP Maintainers, Derek Kiernan, linux-mips, Dragan Cvetic,
	Wu Hao, Tony Krowiak, linux-kbuild, James E.J. Bottomley,
	Jiri Kosina, Hannes Reinecke, linux-block, Jacek Anaszewski,
	linux-mm, Dan Williams, Andrew Morton, Mimi Zohar, Jens Axboe,
	Michal Marek, Martin K. Petersen, Pierre Morel, linux-kernel,
	Wolfram Sang, Daniel Vetter, Jason Wessel, Paolo Bonzini,
	linux-integrity, linuxppc-dev, Mike Rapoport, Dan Murphy
In-Reply-To: <20200707180414.10467-14-rdunlap@infradead.org>

On Tue, Jul 07, 2020 at 11:04:07AM -0700, Randy Dunlap wrote:
> Drop the doubled word "to".
> 
> Signed-off-by: Randy Dunlap <rdunlap@infradead.org>
> Cc: Jonathan Corbet <corbet@lwn.net>
> Cc: linux-doc@vger.kernel.org
> Cc: Paul Cercueil <paul@crapouillou.net>
> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
> Cc: linux-mips@vger.kernel.org
> ---
>  Documentation/mips/ingenic-tcu.rst |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)

Acked-by: Thomas Bogendoerfer <tsbogend@alpha.franken.de>

Thomas.

-- 
Crap can work. Given enough thrust pigs will fly, but it's not necessarily a
good idea.                                                [ RFC1925, 2.3 ]

^ permalink raw reply

* Re: [PATCH 03/14] powerpc/eeh: Move vf_index out of pci_dn and into eeh_dev
From: Oliver O'Halloran @ 2020-07-13  9:02 UTC (permalink / raw)
  To: Alexey Kardashevskiy; +Cc: linuxppc-dev, Mahesh J Salgaonkar
In-Reply-To: <7853bbc2-715b-110a-2d96-8d32e6141261@ozlabs.ru>

On Mon, Jul 13, 2020 at 6:56 PM Alexey Kardashevskiy <aik@ozlabs.ru> wrote:
>
>
>
> On 06/07/2020 11:36, Oliver O'Halloran wrote:
> > Drivers that do not support the PCI error handling callbacks are handled by
> > tearing down the device and re-probing them. If the device to be removed is
> > a virtual function we need to know the index of the index of the VF so that
>
> Too many indexes in "the index of the index of "?

I'll index you!

(yes)

^ permalink raw reply

* Re: [PATCH 04/14] powerpc/pseries: Stop using pdn->pe_number
From: Alexey Kardashevskiy @ 2020-07-13  8:59 UTC (permalink / raw)
  To: Oliver O'Halloran, linuxppc-dev; +Cc: mahesh
In-Reply-To: <20200706013619.459420-5-oohall@gmail.com>



On 06/07/2020 11:36, Oliver O'Halloran wrote:
> The pci_dn->pe_number field is mainly used to track the IODA PE number of a
> device on PowerNV. At some point it grew a user in the pseries SR-IOV
> support which muddies the waters a bit, so remove it.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>


Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>



> ---
>  arch/powerpc/platforms/pseries/eeh_pseries.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/eeh_pseries.c b/arch/powerpc/platforms/pseries/eeh_pseries.c
> index ace117f99d94..18a2522b9b5e 100644
> --- a/arch/powerpc/platforms/pseries/eeh_pseries.c
> +++ b/arch/powerpc/platforms/pseries/eeh_pseries.c
> @@ -52,8 +52,6 @@ void pseries_pcibios_bus_add_device(struct pci_dev *pdev)
>  	dev_dbg(&pdev->dev, "EEH: Setting up device\n");
>  #ifdef CONFIG_PCI_IOV
>  	if (pdev->is_virtfn) {
> -		struct pci_dn *physfn_pdn;
> -
>  		pdn->device_id  =  pdev->device;
>  		pdn->vendor_id  =  pdev->vendor;
>  		pdn->class_code =  pdev->class;
> @@ -63,8 +61,6 @@ void pseries_pcibios_bus_add_device(struct pci_dev *pdev)
>  		 * completion from platform.
>  		 */
>  		pdn->last_allow_rc =  0;
> -		physfn_pdn      =  pci_get_pdn(pdev->physfn);
> -		pdn->pe_number  =  physfn_pdn->pe_num_map[pdn->vf_index];
>  	}
>  #endif
>  	pseries_eeh_init_edev(pdn);
> @@ -772,8 +768,8 @@ int pseries_send_allow_unfreeze(struct pci_dn *pdn,
>  
>  static int pseries_call_allow_unfreeze(struct eeh_dev *edev)
>  {
> +	int cur_vfs = 0, rc = 0, vf_index, bus, devfn, vf_pe_num;
>  	struct pci_dn *pdn, *tmp, *parent, *physfn_pdn;
> -	int cur_vfs = 0, rc = 0, vf_index, bus, devfn;
>  	u16 *vf_pe_array;
>  
>  	vf_pe_array = kzalloc(RTAS_DATA_BUF_SIZE, GFP_KERNEL);
> @@ -806,8 +802,10 @@ static int pseries_call_allow_unfreeze(struct eeh_dev *edev)
>  			}
>  		} else {
>  			pdn = pci_get_pdn(edev->pdev);
> -			vf_pe_array[0] = cpu_to_be16(pdn->pe_number);
>  			physfn_pdn = pci_get_pdn(edev->physfn);
> +
> +			vf_pe_num = physfn_pdn->pe_num_map[edev->vf_index];
> +			vf_pe_array[0] = cpu_to_be16(vf_pe_num);
>  			rc = pseries_send_allow_unfreeze(physfn_pdn,
>  							 vf_pe_array, 1);
>  			pdn->last_allow_rc = rc;
> 

-- 
Alexey

^ permalink raw reply

* Re: [PATCH 03/14] powerpc/eeh: Move vf_index out of pci_dn and into eeh_dev
From: Alexey Kardashevskiy @ 2020-07-13  8:55 UTC (permalink / raw)
  To: Oliver O'Halloran, linuxppc-dev; +Cc: mahesh
In-Reply-To: <20200706013619.459420-4-oohall@gmail.com>



On 06/07/2020 11:36, Oliver O'Halloran wrote:
> Drivers that do not support the PCI error handling callbacks are handled by
> tearing down the device and re-probing them. If the device to be removed is
> a virtual function we need to know the index of the index of the VF so that

Too many indexes in "the index of the index of "?


> we can remove it with the pci_iov_{add|remove}_virtfn() API.
> 
> Currently this is handled by looking up the pci_dn, and using the vf_index
> that was stashed there when the pci_dn for the VF was created in
> pcibios_sriov_enable(). We would like to eliminate the use of pci_dn
> outside of pseries though so we need to provide the generic EEH code with
> some other way to find the vf_index.
> 
> The easiest thing to do here is move the vf_index field out of pci_dn and
> into eeh_dev.  Currently pci_dn and eeh_dev are allocated and initialized
> together so this is a fairly minimal change in preparation for splitting
> pci_dn and eeh_dev in the future.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>



Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>


> ---
>  arch/powerpc/include/asm/eeh.h        | 3 +++
>  arch/powerpc/include/asm/pci-bridge.h | 1 -
>  arch/powerpc/kernel/eeh_driver.c      | 6 ++----
>  arch/powerpc/kernel/pci_dn.c          | 7 ++++---
>  4 files changed, 9 insertions(+), 8 deletions(-)
> 
> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
> index e22881a0c415..3d648e042835 100644
> --- a/arch/powerpc/include/asm/eeh.h
> +++ b/arch/powerpc/include/asm/eeh.h
> @@ -148,7 +148,10 @@ struct eeh_dev {
>  	struct pci_dn *pdn;		/* Associated PCI device node	*/
>  	struct pci_dev *pdev;		/* Associated PCI device	*/
>  	bool in_error;			/* Error flag for edev		*/
> +
> +	/* VF specific properties */
>  	struct pci_dev *physfn;		/* Associated SRIOV PF		*/
> +	int vf_index;			/* Index of this VF 		*/
>  };
>  
>  /* "fmt" must be a simple literal string */
> diff --git a/arch/powerpc/include/asm/pci-bridge.h b/arch/powerpc/include/asm/pci-bridge.h
> index b92e81b256e5..d2a2a14e56f9 100644
> --- a/arch/powerpc/include/asm/pci-bridge.h
> +++ b/arch/powerpc/include/asm/pci-bridge.h
> @@ -202,7 +202,6 @@ struct pci_dn {
>  #define IODA_INVALID_PE		0xFFFFFFFF
>  	unsigned int pe_number;
>  #ifdef CONFIG_PCI_IOV
> -	int     vf_index;		/* VF index in the PF */
>  	u16     vfs_expanded;		/* number of VFs IOV BAR expanded */
>  	u16     num_vfs;		/* number of VFs enabled*/
>  	unsigned int *pe_num_map;	/* PE# for the first VF PE or array */
> diff --git a/arch/powerpc/kernel/eeh_driver.c b/arch/powerpc/kernel/eeh_driver.c
> index 7b048cee767c..b70b9273f45a 100644
> --- a/arch/powerpc/kernel/eeh_driver.c
> +++ b/arch/powerpc/kernel/eeh_driver.c
> @@ -477,7 +477,7 @@ static void *eeh_add_virt_device(struct eeh_dev *edev)
>  	}
>  
>  #ifdef CONFIG_PCI_IOV
> -	pci_iov_add_virtfn(edev->physfn, eeh_dev_to_pdn(edev)->vf_index);
> +	pci_iov_add_virtfn(edev->physfn, edev->vf_index);
>  #endif
>  	return NULL;
>  }
> @@ -521,9 +521,7 @@ static void eeh_rmv_device(struct eeh_dev *edev, void *userdata)
>  
>  	if (edev->physfn) {
>  #ifdef CONFIG_PCI_IOV
> -		struct pci_dn *pdn = eeh_dev_to_pdn(edev);
> -
> -		pci_iov_remove_virtfn(edev->physfn, pdn->vf_index);
> +		pci_iov_remove_virtfn(edev->physfn, edev->vf_index);
>  		edev->pdev = NULL;
>  #endif
>  		if (rmv_data)
> diff --git a/arch/powerpc/kernel/pci_dn.c b/arch/powerpc/kernel/pci_dn.c
> index f790a8d06f50..bf11ac8427ac 100644
> --- a/arch/powerpc/kernel/pci_dn.c
> +++ b/arch/powerpc/kernel/pci_dn.c
> @@ -146,7 +146,6 @@ static struct eeh_dev *eeh_dev_init(struct pci_dn *pdn)
>  
>  #ifdef CONFIG_PCI_IOV
>  static struct pci_dn *add_one_sriov_vf_pdn(struct pci_dn *parent,
> -					   int vf_index,
>  					   int busno, int devfn)
>  {
>  	struct pci_dn *pdn;
> @@ -163,7 +162,6 @@ static struct pci_dn *add_one_sriov_vf_pdn(struct pci_dn *parent,
>  	pdn->parent = parent;
>  	pdn->busno = busno;
>  	pdn->devfn = devfn;
> -	pdn->vf_index = vf_index;
>  	pdn->pe_number = IODA_INVALID_PE;
>  	INIT_LIST_HEAD(&pdn->child_list);
>  	INIT_LIST_HEAD(&pdn->list);
> @@ -194,7 +192,7 @@ struct pci_dn *add_sriov_vf_pdns(struct pci_dev *pdev)
>  	for (i = 0; i < pci_sriov_get_totalvfs(pdev); i++) {
>  		struct eeh_dev *edev __maybe_unused;
>  
> -		pdn = add_one_sriov_vf_pdn(parent, i,
> +		pdn = add_one_sriov_vf_pdn(parent,
>  					   pci_iov_virtfn_bus(pdev, i),
>  					   pci_iov_virtfn_devfn(pdev, i));
>  		if (!pdn) {
> @@ -207,7 +205,10 @@ struct pci_dn *add_sriov_vf_pdns(struct pci_dev *pdev)
>  		/* Create the EEH device for the VF */
>  		edev = eeh_dev_init(pdn);
>  		BUG_ON(!edev);
> +
> +		/* FIXME: these should probably be populated by the EEH probe */
>  		edev->physfn = pdev;
> +		edev->vf_index = i;
>  #endif /* CONFIG_EEH */
>  	}
>  	return pci_get_pdn(pdev);
> 

-- 
Alexey

^ permalink raw reply

* Re: [PATCH 2/2] powerpc/powernv: Move pnv_ioda_setup_bus_dma under CONFIG_IOMMU_API
From: Alexey Kardashevskiy @ 2020-07-13  8:39 UTC (permalink / raw)
  To: Oliver O'Halloran, linuxppc-dev; +Cc: kernel test robot
In-Reply-To: <20200705133557.443607-2-oohall@gmail.com>



On 05/07/2020 23:35, Oliver O'Halloran wrote:
> pnv_ioda_setup_bus_dma() is only used when a passed through PE is
> returned to the host. If the kernel is built without IOMMU support
> this is dead code.

True.

Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>




> Move it under the #ifdef with the rest of the
> IOMMU API support.
> 
> Reported-by: kernel test robot <lkp@intel.com>
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 26 +++++++++++------------
>  1 file changed, 13 insertions(+), 13 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index c2d46d28114b..31c3e6d58c41 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1885,19 +1885,6 @@ static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev,
>  	return false;
>  }
>  
> -static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
> -{
> -	struct pci_dev *dev;
> -
> -	list_for_each_entry(dev, &bus->devices, bus_list) {
> -		set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
> -		dev->dev.archdata.dma_offset = pe->tce_bypass_base;
> -
> -		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
> -			pnv_ioda_setup_bus_dma(pe, dev->subordinate);
> -	}
> -}
> -
>  static inline __be64 __iomem *pnv_ioda_get_inval_reg(struct pnv_phb *phb,
>  						     bool real_mode)
>  {
> @@ -2547,6 +2534,19 @@ static long pnv_pci_ioda2_create_table_userspace(
>  	return ret;
>  }
>  
> +static void pnv_ioda_setup_bus_dma(struct pnv_ioda_pe *pe, struct pci_bus *bus)
> +{
> +	struct pci_dev *dev;
> +
> +	list_for_each_entry(dev, &bus->devices, bus_list) {
> +		set_iommu_table_base(&dev->dev, pe->table_group.tables[0]);
> +		dev->dev.archdata.dma_offset = pe->tce_bypass_base;
> +
> +		if ((pe->flags & PNV_IODA_PE_BUS_ALL) && dev->subordinate)
> +			pnv_ioda_setup_bus_dma(pe, dev->subordinate);
> +	}
> +}
> +
>  static void pnv_ioda2_take_ownership(struct iommu_table_group *table_group)
>  {
>  	struct pnv_ioda_pe *pe = container_of(table_group, struct pnv_ioda_pe,
> 

-- 
Alexey

^ permalink raw reply

* [PATCH] powerpc: Add cputime_to_nsecs()
From: Anton Blanchard @ 2020-07-13  8:36 UTC (permalink / raw)
  To: benh, paulus, mpe, npiggin; +Cc: linuxppc-dev

Generic code has a wrapper to implement cputime_to_nsecs() on top of
cputime_to_usecs() but we can easily return the full nanosecond
resolution directly.

Signed-off-by: Anton Blanchard <anton@ozlabs.org>
---
 arch/powerpc/include/asm/cputime.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/cputime.h b/arch/powerpc/include/asm/cputime.h
index 0fccd5ea1e9a..9335b93924b4 100644
--- a/arch/powerpc/include/asm/cputime.h
+++ b/arch/powerpc/include/asm/cputime.h
@@ -36,6 +36,8 @@ static inline unsigned long cputime_to_usecs(const cputime_t ct)
 	return mulhdu((__force u64) ct, __cputime_usec_factor);
 }
 
+#define cputime_to_nsecs(cputime) tb_to_ns((__force u64)cputime)
+
 /*
  * PPC64 uses PACA which is task independent for storing accounting data while
  * PPC32 uses struct thread_info, therefore at task switch the accounting data
-- 
2.26.2


^ permalink raw reply related

* Re: [PATCH 02/15] powerpc/powernv/pci: Always tear down DMA windows on PE release
From: Alexey Kardashevskiy @ 2020-07-13  8:30 UTC (permalink / raw)
  To: Oliver O'Halloran, linuxppc-dev
In-Reply-To: <20200710052340.737567-3-oohall@gmail.com>



On 10/07/2020 15:23, Oliver O'Halloran wrote:
> Currently we have these two functions:
> 
> 	pnv_pci_ioda2_release_dma_pe(), and
> 	pnv_pci_ioda2_release_pe_dma()
> 
> The first is used when tearing down VF PEs and the other is used for normal
> devices. There's very little difference between the two though. The latter
> (non-VF) will skip a call to pnv_pci_ioda2_unset_window() unless
> CONFIG_IOMMU_API=y is set. There's no real point in doing this so fold the
> two together.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>



Reviewed-by: Alexey Kardashevskiy <aik@ozlabs.ru>


> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 30 +++--------------------
>  1 file changed, 3 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 687919db0347..bfb40607aa0e 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -1422,26 +1422,7 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 num_vfs)
>  	return -EBUSY;
>  }
>  
> -static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
> -		int num);
> -
> -static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe *pe)
> -{
> -	struct iommu_table    *tbl;
> -	int64_t               rc;
> -
> -	tbl = pe->table_group.tables[0];
> -	rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0);
> -	if (rc)
> -		pe_warn(pe, "OPAL error %lld release DMA window\n", rc);
> -
> -	pnv_pci_ioda2_set_bypass(pe, false);
> -	if (pe->table_group.group) {
> -		iommu_group_put(pe->table_group.group);
> -		BUG_ON(pe->table_group.group);
> -	}
> -	iommu_tce_table_put(tbl);
> -}
> +static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe);
>  
>  static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
>  {
> @@ -1455,11 +1436,12 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
>  	if (!pdev->is_physfn)
>  		return;
>  
> +	/* FIXME: Use pnv_ioda_release_pe()? */
>  	list_for_each_entry_safe(pe, pe_n, &phb->ioda.pe_list, list) {
>  		if (pe->parent_dev != pdev)
>  			continue;
>  
> -		pnv_pci_ioda2_release_dma_pe(pdev, pe);
> +		pnv_pci_ioda2_release_pe_dma(pe);
>  
>  		/* Remove from list */
>  		mutex_lock(&phb->ioda.pe_list_mutex);
> @@ -2429,7 +2411,6 @@ static long pnv_pci_ioda2_setup_default_config(struct pnv_ioda_pe *pe)
>  	return 0;
>  }
>  
> -#if defined(CONFIG_IOMMU_API) || defined(CONFIG_PCI_IOV)
>  static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
>  		int num)
>  {
> @@ -2453,7 +2434,6 @@ static long pnv_pci_ioda2_unset_window(struct iommu_table_group *table_group,
>  
>  	return ret;
>  }
> -#endif
>  
>  #ifdef CONFIG_IOMMU_API
>  unsigned long pnv_pci_ioda2_get_table_size(__u32 page_shift,
> @@ -3334,18 +3314,14 @@ static void pnv_pci_ioda2_release_pe_dma(struct pnv_ioda_pe *pe)
>  {
>  	struct iommu_table *tbl = pe->table_group.tables[0];
>  	unsigned int weight = pnv_pci_ioda_pe_dma_weight(pe);
> -#ifdef CONFIG_IOMMU_API
>  	int64_t rc;
> -#endif
>  
>  	if (!weight)
>  		return;
>  
> -#ifdef CONFIG_IOMMU_API
>  	rc = pnv_pci_ioda2_unset_window(&pe->table_group, 0);
>  	if (rc)
>  		pe_warn(pe, "OPAL error %lld release DMA window\n", rc);
> -#endif
>  
>  	pnv_pci_ioda2_set_bypass(pe, false);
>  	if (pe->table_group.group) {
> 

-- 
Alexey

^ permalink raw reply

* Re: [PATCH 01/15] powernv/pci: Add pci_bus_to_pnvhb() helper
From: Alexey Kardashevskiy @ 2020-07-13  8:28 UTC (permalink / raw)
  To: Oliver O'Halloran, linuxppc-dev
In-Reply-To: <20200710052340.737567-2-oohall@gmail.com>



On 10/07/2020 15:23, Oliver O'Halloran wrote:
> Add a helper to go from a pci_bus structure to the pnv_phb that hosts that
> bus. There's a lot of instances of the following pattern:
> 
> 	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> 	struct pnv_phb *phb = hose->private_data;
> 
> Without any other uses of the pci_controller inside the function. This is
> hard to read since it requires you to memorise the contents of the
> private data fields and kind of error prone since it involves blindly
> assigning a void pointer. Add a helper to make it more concise and
> explicit.
> 
> Signed-off-by: Oliver O'Halloran <oohall@gmail.com>
> ---
>  arch/powerpc/platforms/powernv/pci-ioda.c | 88 +++++++----------------
>  arch/powerpc/platforms/powernv/pci.c      | 14 ++--
>  arch/powerpc/platforms/powernv/pci.h      | 10 +++
>  3 files changed, 38 insertions(+), 74 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/pci-ioda.c b/arch/powerpc/platforms/powernv/pci-ioda.c
> index 31c3e6d58c41..687919db0347 100644
> --- a/arch/powerpc/platforms/powernv/pci-ioda.c
> +++ b/arch/powerpc/platforms/powernv/pci-ioda.c
> @@ -252,8 +252,7 @@ static int pnv_ioda2_init_m64(struct pnv_phb *phb)
>  static void pnv_ioda_reserve_dev_m64_pe(struct pci_dev *pdev,
>  					 unsigned long *pe_bitmap)
>  {
> -	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
>  	struct resource *r;
>  	resource_size_t base, sgsz, start, end;
>  	int segno, i;
> @@ -351,8 +350,7 @@ static void pnv_ioda_reserve_m64_pe(struct pci_bus *bus,
>  
>  static struct pnv_ioda_pe *pnv_ioda_pick_m64_pe(struct pci_bus *bus, bool all)
>  {
> -	struct pci_controller *hose = pci_bus_to_host(bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(bus);
>  	struct pnv_ioda_pe *master_pe, *pe;
>  	unsigned long size, *pe_alloc;
>  	int i;
> @@ -673,8 +671,7 @@ struct pnv_ioda_pe *pnv_pci_bdfn_to_pe(struct pnv_phb *phb, u16 bdfn)
>  
>  struct pnv_ioda_pe *pnv_ioda_get_pe(struct pci_dev *dev)
>  {
> -	struct pci_controller *hose = pci_bus_to_host(dev->bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(dev->bus);
>  	struct pci_dn *pdn = pci_get_pdn(dev);
>  
>  	if (!pdn)
> @@ -1069,8 +1066,7 @@ static int pnv_pci_vf_resource_shift(struct pci_dev *dev, int offset)
>  
>  static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
>  {
> -	struct pci_controller *hose = pci_bus_to_host(dev->bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(dev->bus);
>  	struct pci_dn *pdn = pci_get_pdn(dev);
>  	struct pnv_ioda_pe *pe;
>  
> @@ -1129,8 +1125,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_dev_PE(struct pci_dev *dev)
>   */
>  static struct pnv_ioda_pe *pnv_ioda_setup_bus_PE(struct pci_bus *bus, bool all)
>  {
> -	struct pci_controller *hose = pci_bus_to_host(bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(bus);
>  	struct pnv_ioda_pe *pe = NULL;
>  	unsigned int pe_num;
>  
> @@ -1196,8 +1191,7 @@ static struct pnv_ioda_pe *pnv_ioda_setup_npu_PE(struct pci_dev *npu_pdev)
>  	struct pnv_ioda_pe *pe;
>  	struct pci_dev *gpu_pdev;
>  	struct pci_dn *npu_pdn;
> -	struct pci_controller *hose = pci_bus_to_host(npu_pdev->bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(npu_pdev->bus);
>  
>  	/*
>  	 * Intentionally leak a reference on the npu device (for
> @@ -1300,16 +1294,12 @@ static void pnv_pci_ioda_setup_nvlink(void)
>  #ifdef CONFIG_PCI_IOV
>  static int pnv_pci_vf_release_m64(struct pci_dev *pdev, u16 num_vfs)
>  {
> -	struct pci_bus        *bus;
> -	struct pci_controller *hose;
>  	struct pnv_phb        *phb;
>  	struct pci_dn         *pdn;
>  	int                    i, j;
>  	int                    m64_bars;
>  
> -	bus = pdev->bus;
> -	hose = pci_bus_to_host(bus);
> -	phb = hose->private_data;
> +	phb = pci_bus_to_pnvhb(pdev->bus);
>  	pdn = pci_get_pdn(pdev);
>  
>  	if (pdn->m64_single_mode)
> @@ -1333,8 +1323,6 @@ static int pnv_pci_vf_release_m64(struct pci_dev *pdev, u16 num_vfs)
>  
>  static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 num_vfs)
>  {
> -	struct pci_bus        *bus;
> -	struct pci_controller *hose;
>  	struct pnv_phb        *phb;
>  	struct pci_dn         *pdn;
>  	unsigned int           win;
> @@ -1346,9 +1334,7 @@ static int pnv_pci_vf_assign_m64(struct pci_dev *pdev, u16 num_vfs)
>  	int                    pe_num;
>  	int                    m64_bars;
>  
> -	bus = pdev->bus;
> -	hose = pci_bus_to_host(bus);
> -	phb = hose->private_data;
> +	phb = pci_bus_to_pnvhb(pdev->bus);
>  	pdn = pci_get_pdn(pdev);
>  	total_vfs = pci_sriov_get_totalvfs(pdev);
>  
> @@ -1459,15 +1445,11 @@ static void pnv_pci_ioda2_release_dma_pe(struct pci_dev *dev, struct pnv_ioda_pe
>  
>  static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
>  {
> -	struct pci_bus        *bus;
> -	struct pci_controller *hose;
>  	struct pnv_phb        *phb;
>  	struct pnv_ioda_pe    *pe, *pe_n;
>  	struct pci_dn         *pdn;
>  
> -	bus = pdev->bus;
> -	hose = pci_bus_to_host(bus);
> -	phb = hose->private_data;
> +	phb = pci_bus_to_pnvhb(pdev->bus);
>  	pdn = pci_get_pdn(pdev);
>  
>  	if (!pdev->is_physfn)
> @@ -1492,16 +1474,12 @@ static void pnv_ioda_release_vf_PE(struct pci_dev *pdev)
>  
>  static void pnv_pci_sriov_disable(struct pci_dev *pdev)
>  {
> -	struct pci_bus        *bus;
> -	struct pci_controller *hose;
>  	struct pnv_phb        *phb;
>  	struct pnv_ioda_pe    *pe;
>  	struct pci_dn         *pdn;
>  	u16                    num_vfs, i;
>  
> -	bus = pdev->bus;
> -	hose = pci_bus_to_host(bus);
> -	phb = hose->private_data;
> +	phb = pci_bus_to_pnvhb(pdev->bus);
>  	pdn = pci_get_pdn(pdev);
>  	num_vfs = pdn->num_vfs;
>  
> @@ -1535,17 +1513,13 @@ static void pnv_pci_ioda2_setup_dma_pe(struct pnv_phb *phb,
>  				       struct pnv_ioda_pe *pe);
>  static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>  {
> -	struct pci_bus        *bus;
> -	struct pci_controller *hose;
>  	struct pnv_phb        *phb;
>  	struct pnv_ioda_pe    *pe;
>  	int                    pe_num;
>  	u16                    vf_index;
>  	struct pci_dn         *pdn;
>  
> -	bus = pdev->bus;
> -	hose = pci_bus_to_host(bus);
> -	phb = hose->private_data;
> +	phb = pci_bus_to_pnvhb(pdev->bus);
>  	pdn = pci_get_pdn(pdev);
>  
>  	if (!pdev->is_physfn)
> @@ -1572,7 +1546,7 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>  		pe->rid = (vf_bus << 8) | vf_devfn;
>  
>  		pe_info(pe, "VF %04d:%02d:%02d.%d associated with PE#%x\n",
> -			hose->global_number, pdev->bus->number,
> +			pci_domain_nr(pdev->bus), pdev->bus->number,
>  			PCI_SLOT(vf_devfn), PCI_FUNC(vf_devfn), pe_num);
>  
>  		if (pnv_ioda_configure_pe(phb, pe)) {
> @@ -1602,17 +1576,13 @@ static void pnv_ioda_setup_vf_PE(struct pci_dev *pdev, u16 num_vfs)
>  
>  static int pnv_pci_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
>  {
> -	struct pci_bus        *bus;
> -	struct pci_controller *hose;
>  	struct pnv_phb        *phb;
>  	struct pnv_ioda_pe    *pe;
>  	struct pci_dn         *pdn;
>  	int                    ret;
>  	u16                    i;
>  
> -	bus = pdev->bus;
> -	hose = pci_bus_to_host(bus);
> -	phb = hose->private_data;
> +	phb = pci_bus_to_pnvhb(pdev->bus);
>  	pdn = pci_get_pdn(pdev);
>  
>  	if (phb->type == PNV_PHB_IODA2) {
> @@ -1735,8 +1705,7 @@ static int pnv_pcibios_sriov_enable(struct pci_dev *pdev, u16 num_vfs)
>  
>  static void pnv_pci_ioda_dma_dev_setup(struct pci_dev *pdev)
>  {
> -	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
>  	struct pci_dn *pdn = pci_get_pdn(pdev);
>  	struct pnv_ioda_pe *pe;
>  
> @@ -1847,8 +1816,7 @@ static int pnv_pci_ioda_dma_64bit_bypass(struct pnv_ioda_pe *pe)
>  static bool pnv_pci_ioda_iommu_bypass_supported(struct pci_dev *pdev,
>  		u64 dma_mask)
>  {
> -	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
>  	struct pci_dn *pdn = pci_get_pdn(pdev);
>  	struct pnv_ioda_pe *pe;
>  
> @@ -2766,8 +2734,7 @@ static void pnv_pci_init_ioda_msis(struct pnv_phb *phb)
>  #ifdef CONFIG_PCI_IOV
>  static void pnv_pci_ioda_fixup_iov_resources(struct pci_dev *pdev)
>  {
> -	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
>  	const resource_size_t gate = phb->ioda.m64_segsize >> 2;
>  	struct resource *res;
>  	int i;
> @@ -3101,10 +3068,9 @@ static void pnv_pci_ioda_fixup(void)
>  static resource_size_t pnv_pci_window_alignment(struct pci_bus *bus,
>  						unsigned long type)
>  {
> -	struct pci_dev *bridge;
> -	struct pci_controller *hose = pci_bus_to_host(bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(bus);
>  	int num_pci_bridges = 0;
> +	struct pci_dev *bridge;
>  
>  	bridge = bus->self;
>  	while (bridge) {
> @@ -3190,8 +3156,7 @@ static void pnv_pci_fixup_bridge_resources(struct pci_bus *bus,
>  
>  static void pnv_pci_configure_bus(struct pci_bus *bus)
>  {
> -	struct pci_controller *hose = pci_bus_to_host(bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(bus);
>  	struct pci_dev *bridge = bus->self;
>  	struct pnv_ioda_pe *pe;
>  	bool all = (bridge && pci_pcie_type(bridge) == PCI_EXP_TYPE_PCI_BRIDGE);
> @@ -3237,8 +3202,7 @@ static resource_size_t pnv_pci_default_alignment(void)
>  static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
>  						      int resno)
>  {
> -	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
>  	struct pci_dn *pdn = pci_get_pdn(pdev);
>  	resource_size_t align;
>  
> @@ -3274,8 +3238,7 @@ static resource_size_t pnv_pci_iov_resource_alignment(struct pci_dev *pdev,
>   */
>  static bool pnv_pci_enable_device_hook(struct pci_dev *dev)
>  {
> -	struct pci_controller *hose = pci_bus_to_host(dev->bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(dev->bus);
>  	struct pci_dn *pdn;
>  
>  	/* The function is probably called while the PEs have
> @@ -3488,8 +3451,7 @@ static void pnv_ioda_release_pe(struct pnv_ioda_pe *pe)
>  
>  static void pnv_pci_release_device(struct pci_dev *pdev)
>  {
> -	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
>  	struct pci_dn *pdn = pci_get_pdn(pdev);
>  	struct pnv_ioda_pe *pe;
>  
> @@ -3534,8 +3496,7 @@ static void pnv_pci_ioda_shutdown(struct pci_controller *hose)
>  
>  static void pnv_pci_ioda_dma_bus_setup(struct pci_bus *bus)
>  {
> -	struct pci_controller *hose = bus->sysdata;
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(bus);
>  	struct pnv_ioda_pe *pe;
>  
>  	list_for_each_entry(pe, &phb->ioda.pe_list, list) {
> @@ -3873,8 +3834,7 @@ void __init pnv_pci_init_npu2_opencapi_phb(struct device_node *np)
>  
>  static void pnv_npu2_opencapi_cfg_size_fixup(struct pci_dev *dev)
>  {
> -	struct pci_controller *hose = pci_bus_to_host(dev->bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(dev->bus);
>  
>  	if (!machine_is(powernv))
>  		return;
> diff --git a/arch/powerpc/platforms/powernv/pci.c b/arch/powerpc/platforms/powernv/pci.c
> index 091fe1cf386b..9b9bca169275 100644
> --- a/arch/powerpc/platforms/powernv/pci.c
> +++ b/arch/powerpc/platforms/powernv/pci.c
> @@ -162,8 +162,7 @@ EXPORT_SYMBOL_GPL(pnv_pci_set_power_state);
>  
>  int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
>  {
> -	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
>  	struct msi_desc *entry;
>  	struct msi_msg msg;
>  	int hwirq;
> @@ -211,8 +210,7 @@ int pnv_setup_msi_irqs(struct pci_dev *pdev, int nvec, int type)
>  
>  void pnv_teardown_msi_irqs(struct pci_dev *pdev)
>  {
> -	struct pci_controller *hose = pci_bus_to_host(pdev->bus);
> -	struct pnv_phb *phb = hose->private_data;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(pdev->bus);
>  	struct msi_desc *entry;
>  	irq_hw_number_t hwirq;
>  
> @@ -824,10 +822,9 @@ EXPORT_SYMBOL(pnv_pci_get_phb_node);
>  
>  int pnv_pci_set_tunnel_bar(struct pci_dev *dev, u64 addr, int enable)
>  {
> -	__be64 val;
> -	struct pci_controller *hose;
> -	struct pnv_phb *phb;
> +	struct pnv_phb *phb = pci_bus_to_pnvhb(dev->bus);
>  	u64 tunnel_bar;
> +	__be64 val;
>  	int rc;
>  
>  	if (!opal_check_token(OPAL_PCI_GET_PBCQ_TUNNEL_BAR))
> @@ -835,9 +832,6 @@ int pnv_pci_set_tunnel_bar(struct pci_dev *dev, u64 addr, int enable)
>  	if (!opal_check_token(OPAL_PCI_SET_PBCQ_TUNNEL_BAR))
>  		return -ENXIO;
>  
> -	hose = pci_bus_to_host(dev->bus);
> -	phb = hose->private_data;
> -
>  	mutex_lock(&tunnel_mutex);
>  	rc = opal_pci_get_pbcq_tunnel_bar(phb->opal_id, &val);
>  	if (rc != OPAL_SUCCESS) {
> diff --git a/arch/powerpc/platforms/powernv/pci.h b/arch/powerpc/platforms/powernv/pci.h
> index 51c254f2f3cb..0727dec9a0d1 100644
> --- a/arch/powerpc/platforms/powernv/pci.h
> +++ b/arch/powerpc/platforms/powernv/pci.h
> @@ -260,4 +260,14 @@ extern void pnv_pci_setup_iommu_table(struct iommu_table *tbl,
>  
>  extern unsigned long pnv_ioda_parse_tce_sizes(struct pnv_phb *phb);
>  
> +static inline struct pnv_phb *pci_bus_to_pnvhb(struct pci_bus *bus)
> +{
> +	struct pci_controller *hose = bus->sysdata;
> +
> +	if (hose)
> +		return hose->private_data;


Since it is powernv, private_data should not ever be NULL so we want
BUG_ON here may be?


> +
> +	return NULL;
> +}
> +
>  #endif /* __POWERNV_PCI_H */
> 

-- 
Alexey

^ permalink raw reply

* [PATCH] powerpc/boot: add DTB to 'targets'
From: Masahiro Yamada @ 2020-07-13  7:56 UTC (permalink / raw)
  To: linux-kbuild
  Cc: Christophe Leroy, Arnd Bergmann, Masahiro Yamada, Michal Simek,
	linux-kernel, Paul Mackerras, linuxppc-dev

PowerPC always re-builds DTB even if nothing has been changed.

As for other architectures, arch/*/boot/dts/Makefile builds DTB by
using the dtb-y syntax.

In contrast, arch/powerpc/boot/dts/(fsl/)Makefile does nothing unless
CONFIG_OF_ALL_DTBS is defined. Instead, arch/powerpc/boot/Makefile
builds DTB on demand. You need to add DTB to 'targets' explicitly
so .*.cmd files are included.

Signed-off-by: Masahiro Yamada <masahiroy@kernel.org>
---

I want to apply this to kbuild tree because this is needed
to fix the build error caused by another kbuild patch:

https://lkml.org/lkml/2020/7/7/134

 arch/powerpc/boot/Makefile | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index 63d7456b9518..8792323707fd 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -366,6 +366,8 @@ initrd-y := $(patsubst zImage%, zImage.initrd%, \
 		$(patsubst treeImage%, treeImage.initrd%, $(image-y)))))
 initrd-y := $(filter-out $(image-y), $(initrd-y))
 targets	+= $(image-y) $(initrd-y)
+targets += $(foreach x, dtbImage uImage cuImage simpleImage treeImage, \
+		$(patsubst $(x).%, dts/%.dtb, $(filter $(x).%, $(image-y))))

 $(addprefix $(obj)/, $(initrd-y)): $(obj)/ramdisk.image.gz

-- 
2.25.1

^ permalink raw reply related

* Re: [PATCH v3 4/6] powerpc/pseries/iommu: Remove default DMA window before creating DDW
From: Alexey Kardashevskiy @ 2020-07-13  7:33 UTC (permalink / raw)
  To: Leonardo Bras, Michael Ellerman, Benjamin Herrenschmidt,
	Paul Mackerras, Thiago Jung Bauermann, Ram Pai
  Cc: linuxppc-dev, linux-kernel
In-Reply-To: <20200703061844.111865-5-leobras.c@gmail.com>



On 03/07/2020 16:18, Leonardo Bras wrote:
> On LoPAR "DMA Window Manipulation Calls", it's recommended to remove the
> default DMA window for the device, before attempting to configure a DDW,
> in order to make the maximum resources available for the next DDW to be
> created.
> 
> This is a requirement for using DDW on devices in which hypervisor
> allows only one DMA window.
> 
> If setting up a new DDW fails anywhere after the removal of this
> default DMA window, it's needed to restore the default DMA window.
> For this, an implementation of ibm,reset-pe-dma-windows rtas call is
> needed:
> 
> Platforms supporting the DDW option starting with LoPAR level 2.7 implement
> ibm,ddw-extensions. The first extension available (index 2) carries the
> token for ibm,reset-pe-dma-windows rtas call, which is used to restore
> the default DMA window for a device, if it has been deleted.
> 
> It does so by resetting the TCE table allocation for the PE to it's
> boot time value, available in "ibm,dma-window" device tree node.
> 
> Signed-off-by: Leonardo Bras <leobras.c@gmail.com>
> ---
>  arch/powerpc/platforms/pseries/iommu.c | 83 +++++++++++++++++++++-----
>  1 file changed, 69 insertions(+), 14 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
> index 4e33147825cc..5b520ac354c6 100644
> --- a/arch/powerpc/platforms/pseries/iommu.c
> +++ b/arch/powerpc/platforms/pseries/iommu.c
> @@ -1066,6 +1066,38 @@ static phys_addr_t ddw_memory_hotplug_max(void)
>  	return max_addr;
>  }
>  
> +/*
> + * Platforms supporting the DDW option starting with LoPAR level 2.7 implement
> + * ibm,ddw-extensions, which carries the rtas token for
> + * ibm,reset-pe-dma-windows.
> + * That rtas-call can be used to restore the default DMA window for the device.
> + */
> +static void reset_dma_window(struct pci_dev *dev, struct device_node *par_dn)
> +{
> +	int ret;
> +	u32 cfg_addr, reset_dma_win;
> +	u64 buid;
> +	struct device_node *dn;
> +	struct pci_dn *pdn;
> +
> +	ret = ddw_read_ext(par_dn, DDW_EXT_RESET_DMA_WIN, &reset_dma_win);
> +	if (ret)
> +		return;
> +
> +	dn = pci_device_to_OF_node(dev);
> +	pdn = PCI_DN(dn);
> +	buid = pdn->phb->buid;
> +	cfg_addr = ((pdn->busno << 16) | (pdn->devfn << 8));
> +
> +	ret = rtas_call(reset_dma_win, 3, 1, NULL, cfg_addr, BUID_HI(buid),
> +			BUID_LO(buid));
> +	if (ret)
> +		dev_info(&dev->dev,
> +			 "ibm,reset-pe-dma-windows(%x) %x %x %x returned %d ",
> +			 reset_dma_win, cfg_addr, BUID_HI(buid), BUID_LO(buid),
> +			 ret);
> +}
> +
>  /*
>   * If the PE supports dynamic dma windows, and there is space for a table
>   * that can map all pages in a linear offset, then setup such a table,
> @@ -1079,7 +1111,7 @@ static phys_addr_t ddw_memory_hotplug_max(void)
>   */
>  static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>  {
> -	int len, ret;
> +	int len, ret, reset_win_ext;

Make it "reset_token".

>  	struct ddw_query_response query;
>  	struct ddw_create_response create;
>  	int page_shift;
> @@ -1087,7 +1119,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>  	struct device_node *dn;
>  	u32 ddw_avail[DDW_APPLICABLE_SIZE];
>  	struct direct_window *window;
> -	struct property *win64;
> +	struct property *win64, *default_win = NULL;
>  	struct dynamic_dma_window_prop *ddwprop;
>  	struct failed_ddw_pdn *fpdn;
>  
> @@ -1122,7 +1154,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>  	if (ret)
>  		goto out_failed;
>  
> -       /*
> +	/*
>  	 * Query if there is a second window of size to map the
>  	 * whole partition.  Query returns number of windows, largest
>  	 * block assigned to PE (partition endpoint), and two bitmasks
> @@ -1133,14 +1165,34 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>  	if (ret != 0)
>  		goto out_failed;
>  
> +	/*
> +	 * If there is no window available, remove the default DMA window,
> +	 * if it's present. This will make all the resources available to the
> +	 * new DDW window.
> +	 * If anything fails after this, we need to restore it, so also check
> +	 * for extensions presence.
> +	 */
>  	if (query.windows_available == 0) {
> -		/*
> -		 * no additional windows are available for this device.
> -		 * We might be able to reallocate the existing window,
> -		 * trading in for a larger page size.
> -		 */
> -		dev_dbg(&dev->dev, "no free dynamic windows");
> -		goto out_failed;
> +		default_win = of_find_property(pdn, "ibm,dma-window", NULL);
> +		if (!default_win)
> +			goto out_failed;
> +
> +		reset_win_ext = ddw_read_ext(pdn, DDW_EXT_RESET_DMA_WIN, NULL);
> +		if (reset_win_ext)
> +			goto out_failed;
> +
> +		remove_dma_window(pdn, ddw_avail, default_win);
> +
> +		/* Query again, to check if the window is available */
> +		ret = query_ddw(dev, ddw_avail, &query, pdn);
> +		if (ret != 0)
> +			goto out_restore_defwin;
> +
> +		if (query.windows_available == 0) {
> +			/* no windows are available for this device. */
> +			dev_dbg(&dev->dev, "no free dynamic windows");
> +			goto out_restore_defwin;
> +		}
>  	}
>  	if (query.page_size & 4) {
>  		page_shift = 24; /* 16MB */
> @@ -1151,7 +1203,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>  	} else {
>  		dev_dbg(&dev->dev, "no supported direct page size in mask %x",
>  			  query.page_size);
> -		goto out_failed;
> +		goto out_restore_defwin;
>  	}
>  	/* verify the window * number of ptes will map the partition */
>  	/* check largest block * page size > max memory hotplug addr */
> @@ -1160,14 +1212,14 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>  		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %llu "
>  			  "%llu-sized pages\n", max_addr,  query.largest_available_block,
>  			  1ULL << page_shift);
> -		goto out_failed;
> +		goto out_restore_defwin;
>  	}
>  	len = order_base_2(max_addr);
>  	win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
>  	if (!win64) {
>  		dev_info(&dev->dev,
>  			"couldn't allocate property for 64bit dma window\n");
> -		goto out_failed;
> +		goto out_restore_defwin;
>  	}
>  	win64->name = kstrdup(DIRECT64_PROPNAME, GFP_KERNEL);
>  	win64->value = ddwprop = kmalloc(sizeof(*ddwprop), GFP_KERNEL);
> @@ -1230,8 +1282,11 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
>  	kfree(win64->value);
>  	kfree(win64);
>  
> -out_failed:
> +out_restore_defwin:
> +	if (default_win && reset_win_ext == 0)


reset_win_ext potentially may be uninitialized here. Yeah I know it is
tied to default_win but still.

After looking at this function for a few minutes, it could use some
refactoring (way too many gotos)  such as:

1. move (query.page_size & xx) checks before "if
(query.windows_available == 0)"

2. move "win64 = kzalloc(sizeof(struct property), GFP_KERNEL)" before
"if (query.windows_available == 0)"

3. call "reset_dma_window(dev, pdn)" inside the "if
(query.windows_available == 0)" branch.

Then you can drop all "goto out_restore_defwin" and move default_win and
reset_win_ext inside "if (query.windows_available == 0)".

The rest of the series is good as it is, however it may conflict with
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20200713062348.100552-1-aik@ozlabs.ru/
and the patchset it is made on top of -
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=188385 .
thanks,


> +		reset_dma_window(dev, pdn);
>  
> +out_failed:
>  	fpdn = kzalloc(sizeof(*fpdn), GFP_KERNEL);
>  	if (!fpdn)
>  		goto out_unlock;
> 

-- 
Alexey

^ permalink raw reply

* [PATCH kernel] powerpc/dma: Fallback to dma_ops when persistent memory present
From: Alexey Kardashevskiy @ 2020-07-13  6:23 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Leonardo Bras, Brian J King, Alexey Kardashevskiy,
	Aneesh Kumar K . V, Christoph Hellwig, Wen Xiong

So far we have been using huge DMA windows to map all the RAM available.
The RAM is normally mapped to the VM address space contiguously, and
there is always a reasonable upper limit for possible future hot plugged
RAM which makes it easy to map all RAM via IOMMU.

Now there is persistent memory ("ibm,pmemory" in the FDT) which (unlike
normal RAM) can map anywhere in the VM space beyond the maximum RAM size
and since it can be used for DMA, it requires extending the huge window
up to MAX_PHYSMEM_BITS which requires hypervisor support for:
1. huge TCE tables;
2. multilevel TCE tables;
3. huge IOMMU pages.

Certain hypervisors cannot do either so the only option left is
restricting the huge DMA window to include only RAM and fallback to
the default DMA window for persistent memory.

This checks if the system has persistent memory. If it does not,
the DMA bypass mode is selected, i.e.
* dev->bus_dma_limit = 0
* dev->dma_ops_bypass = true <- this avoid calling dma_ops for mapping.

If there is such memory, this creates identity mapping only for RAM and
disables the DMA bypass mode which makes generic DMA code use indirect
dma_ops which may have performance impact:
* dev->bus_dma_limit = bus_offset + max_ram_size
  for example 0x0800.0000.8000.0000 for a 2GB VM
* dev->dma_ops_bypass = false <- this forces indirect calls to dma_ops for
  every mapping which then directs these to small or huge window.

This should not change the existing behaviour when no persistent memory.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---

This is based on
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=188385

Leonardo, this makes
https://patchwork.ozlabs.org/project/linuxppc-dev/list/?series=187348
not apply but the conflict should be quite trivial to resolve
(do not rush, let me finish reviewing this first). Cheers,



---
 arch/powerpc/kernel/dma-iommu.c        | 68 +++++++++++++++++++++++++-
 arch/powerpc/platforms/pseries/iommu.c | 41 +++++++++++++---
 2 files changed, 99 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/kernel/dma-iommu.c b/arch/powerpc/kernel/dma-iommu.c
index 569fecd7b5b2..9fe5f0aefa9d 100644
--- a/arch/powerpc/kernel/dma-iommu.c
+++ b/arch/powerpc/kernel/dma-iommu.c
@@ -10,6 +10,16 @@
 #include <linux/pci.h>
 #include <asm/iommu.h>
 
+static inline bool can_map_direct(struct device *dev, phys_addr_t addr)
+{
+	return dev->bus_dma_limit >= phys_to_dma(dev, addr);
+}
+
+static inline bool dma_handle_direct(struct device *dev, dma_addr_t dma_handle)
+{
+	return dma_handle >= dev->archdata.dma_offset;
+}
+
 /*
  * Generic iommu implementation
  */
@@ -44,6 +54,12 @@ static dma_addr_t dma_iommu_map_page(struct device *dev, struct page *page,
 				     enum dma_data_direction direction,
 				     unsigned long attrs)
 {
+	if (dev->bus_dma_limit &&
+	    can_map_direct(dev, (phys_addr_t) page_to_phys(page) +
+			   offset + size))
+		return dma_direct_map_page(dev, page, offset, size, direction,
+					   attrs);
+
 	return iommu_map_page(dev, get_iommu_table_base(dev), page, offset,
 			      size, dma_get_mask(dev), direction, attrs);
 }
@@ -53,6 +69,12 @@ static void dma_iommu_unmap_page(struct device *dev, dma_addr_t dma_handle,
 				 size_t size, enum dma_data_direction direction,
 				 unsigned long attrs)
 {
+	if (dev->bus_dma_limit &&
+	    dma_handle_direct(dev, dma_handle + size)) {
+		dma_direct_unmap_page(dev, dma_handle, size, direction, attrs);
+		return;
+	}
+
 	iommu_unmap_page(get_iommu_table_base(dev), dma_handle, size, direction,
 			 attrs);
 }
@@ -62,6 +84,22 @@ static int dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist,
 			    int nelems, enum dma_data_direction direction,
 			    unsigned long attrs)
 {
+	if (dev->bus_dma_limit) {
+		struct scatterlist *s;
+		bool direct = true;
+		int i;
+
+		for_each_sg(sglist, s, nelems, i) {
+			direct = can_map_direct(dev,
+					sg_phys(s) + s->offset + s->length);
+			if (!direct)
+				break;
+		}
+		if (direct)
+			return dma_direct_map_sg(dev, sglist, nelems, direction,
+						 attrs);
+	}
+
 	return ppc_iommu_map_sg(dev, get_iommu_table_base(dev), sglist, nelems,
 				dma_get_mask(dev), direction, attrs);
 }
@@ -70,6 +108,24 @@ static void dma_iommu_unmap_sg(struct device *dev, struct scatterlist *sglist,
 		int nelems, enum dma_data_direction direction,
 		unsigned long attrs)
 {
+	if (dev->bus_dma_limit) {
+		struct scatterlist *s;
+		bool direct = true;
+		int i;
+
+		for_each_sg(sglist, s, nelems, i) {
+			direct = dma_handle_direct(dev,
+						   s->dma_address + s->length);
+			if (!direct)
+				break;
+		}
+		if (direct) {
+			dma_direct_unmap_sg(dev, sglist, nelems, direction,
+					    attrs);
+			return;
+		}
+	}
+
 	ppc_iommu_unmap_sg(get_iommu_table_base(dev), sglist, nelems,
 			   direction, attrs);
 }
@@ -90,8 +146,16 @@ int dma_iommu_dma_supported(struct device *dev, u64 mask)
 	struct iommu_table *tbl = get_iommu_table_base(dev);
 
 	if (dev_is_pci(dev) && dma_iommu_bypass_supported(dev, mask)) {
-		dev->dma_ops_bypass = true;
-		dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
+		/*
+		 * dma_iommu_bypass_supported() sets dma_max when there is
+		 * 1:1 mapping but it is somehow limited.
+		 * ibm,pmemory is one example.
+		 */
+		dev->dma_ops_bypass = dev->bus_dma_limit == 0;
+		if (!dev->dma_ops_bypass)
+			dev_warn(dev, "iommu: 64-bit OK but using default ops\n");
+		else
+			dev_dbg(dev, "iommu: 64-bit OK, using fixed ops\n");
 		return 1;
 	}
 
diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 6d47b4a3ce39..1996f83021fe 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -816,7 +816,7 @@ static void remove_ddw(struct device_node *np, bool remove_prop)
 			np, ret);
 }
 
-static u64 find_existing_ddw(struct device_node *pdn)
+static u64 find_existing_ddw(struct device_node *pdn, int *window_shift)
 {
 	struct direct_window *window;
 	const struct dynamic_dma_window_prop *direct64;
@@ -828,6 +828,7 @@ static u64 find_existing_ddw(struct device_node *pdn)
 		if (window->device == pdn) {
 			direct64 = window->prop;
 			dma_addr = be64_to_cpu(direct64->dma_base);
+			*window_shift = be32_to_cpu(direct64->window_shift);
 			break;
 		}
 	}
@@ -990,11 +991,13 @@ static phys_addr_t ddw_memory_hotplug_max(void)
  */
 static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 {
-	int len, ret;
+	int len = 0, ret;
+	bool pmem_present = of_find_node_by_type(NULL, "ibm,pmemory") != NULL;
+	int max_ram_len = order_base_2(ddw_memory_hotplug_max());
 	struct ddw_query_response query;
 	struct ddw_create_response create;
 	int page_shift;
-	u64 dma_addr, max_addr;
+	u64 dma_addr;
 	struct device_node *dn;
 	u32 ddw_avail[3];
 	struct direct_window *window;
@@ -1004,7 +1007,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 
 	mutex_lock(&direct_window_init_mutex);
 
-	dma_addr = find_existing_ddw(pdn);
+	dma_addr = find_existing_ddw(pdn, &len);
 	if (dma_addr != 0)
 		goto out_unlock;
 
@@ -1066,14 +1069,27 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	}
 	/* verify the window * number of ptes will map the partition */
 	/* check largest block * page size > max memory hotplug addr */
-	max_addr = ddw_memory_hotplug_max();
-	if (query.largest_available_block < (max_addr >> page_shift)) {
+	/*
+	 * The "ibm,pmemory" can appear anywhere in the address space.
+	 * Assuming it is still backed by page structs, try MAX_PHYSMEM_BITS
+	 * for the upper limit and fallback to max RAM otherwise but this
+	 * disables device::dma_ops_bypass.
+	 */
+	len = max_ram_len;
+	if (pmem_present) {
+		if (query.largest_available_block >=
+		    (1ULL << (MAX_PHYSMEM_BITS - page_shift)))
+			len = MAX_PHYSMEM_BITS - page_shift;
+		else
+			dev_info(&dev->dev, "Skipping ibm,pmemory");
+	}
+
+	if (query.largest_available_block < (1ULL << (len - page_shift))) {
 		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %u "
-			  "%llu-sized pages\n", max_addr,  query.largest_available_block,
+			  "%llu-sized pages\n", 1ULL << len, query.largest_available_block,
 			  1ULL << page_shift);
 		goto out_failed;
 	}
-	len = order_base_2(max_addr);
 	win64 = kzalloc(sizeof(struct property), GFP_KERNEL);
 	if (!win64) {
 		dev_info(&dev->dev,
@@ -1151,6 +1167,15 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 
 out_unlock:
 	mutex_unlock(&direct_window_init_mutex);
+
+	/*
+	 * If we have persistent memory and the window size is only as big
+	 * as RAM, then we failed to create a window to cover persistent
+	 * memory and need to set the DMA limit.
+	 */
+	if (pmem_present && dma_addr && (len == max_ram_len))
+		dev->dev.bus_dma_limit = dma_addr + (1ULL << len);
+
 	return dma_addr;
 }
 
-- 
2.17.1


^ permalink raw reply related

* Re: [PATCH v2 1/3] powerpc/powernv/idle: Exclude mfspr on HID1, 4, 5 on P9 and above
From: Nicholas Piggin @ 2020-07-13  5:53 UTC (permalink / raw)
  To: benh, ego, linux-kernel, linuxppc-dev, mikey, mpe, paulus,
	pratik.r.sampat, Pratik Rajesh Sampat, ravi.bangoria, svaidy
In-Reply-To: <20200710052207.12003-2-psampat@linux.ibm.com>

Excerpts from Pratik Rajesh Sampat's message of July 10, 2020 3:22 pm:
> POWER9 onwards the support for the registers HID1, HID4, HID5 has been
> receded.
> Although mfspr on the above registers worked in Power9, In Power10
> simulator is unrecognized. Moving their assignment under the
> check for machines lower than Power9

Seems like a good fix.

Thanks,
Nick

> 
> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
> Reviewed-by: Gautham R. Shenoy <ego@linux.vnet.ibm.com>
> ---
>  arch/powerpc/platforms/powernv/idle.c | 6 +++---
>  1 file changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
> index 2dd467383a88..19d94d021357 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -73,9 +73,6 @@ static int pnv_save_sprs_for_deep_states(void)
>  	 */
>  	uint64_t lpcr_val	= mfspr(SPRN_LPCR);
>  	uint64_t hid0_val	= mfspr(SPRN_HID0);
> -	uint64_t hid1_val	= mfspr(SPRN_HID1);
> -	uint64_t hid4_val	= mfspr(SPRN_HID4);
> -	uint64_t hid5_val	= mfspr(SPRN_HID5);
>  	uint64_t hmeer_val	= mfspr(SPRN_HMEER);
>  	uint64_t msr_val = MSR_IDLE;
>  	uint64_t psscr_val = pnv_deepest_stop_psscr_val;
> @@ -117,6 +114,9 @@ static int pnv_save_sprs_for_deep_states(void)
>  
>  			/* Only p8 needs to set extra HID regiters */
>  			if (!cpu_has_feature(CPU_FTR_ARCH_300)) {
> +				uint64_t hid1_val = mfspr(SPRN_HID1);
> +				uint64_t hid4_val = mfspr(SPRN_HID4);
> +				uint64_t hid5_val = mfspr(SPRN_HID5);
>  
>  				rc = opal_slw_set_reg(pir, SPRN_HID1, hid1_val);
>  				if (rc != 0)
> -- 
> 2.25.4
> 
> 

^ permalink raw reply

* Re: [PATCH v2 2/3] powerpc/powernv/idle: save-restore DAWR0, DAWRX0 for P10
From: Nicholas Piggin @ 2020-07-13  5:52 UTC (permalink / raw)
  To: benh, ego, linux-kernel, linuxppc-dev, mikey, mpe, paulus,
	pratik.r.sampat, Pratik Rajesh Sampat, ravi.bangoria, svaidy
In-Reply-To: <20200710052207.12003-3-psampat@linux.ibm.com>

Excerpts from Pratik Rajesh Sampat's message of July 10, 2020 3:22 pm:
> Additional registers DAWR0, DAWRX0 may be lost on Power 10 for
> stop levels < 4.
> Therefore save the values of these SPRs before entering a  "stop"
> state and restore their values on wakeup.

Hmm, where do you get this from? Documentation I see says DAWR is lost
on POWER9 but not P10.

Does idle thread even need to save DAWR, or does it get switched when
going to a thread that has a watchpoint set?

Thanks,
Nick

> 
> Signed-off-by: Pratik Rajesh Sampat <psampat@linux.ibm.com>
> ---
>  arch/powerpc/platforms/powernv/idle.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
> index 19d94d021357..f2e2a6a4c274 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -600,6 +600,8 @@ struct p9_sprs {
>  	u64 iamr;
>  	u64 amor;
>  	u64 uamor;
> +	u64 dawr0;
> +	u64 dawrx0;
>  };
>  
>  static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
> @@ -687,6 +689,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
>  	sprs.iamr	= mfspr(SPRN_IAMR);
>  	sprs.amor	= mfspr(SPRN_AMOR);
>  	sprs.uamor	= mfspr(SPRN_UAMOR);
> +	if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> +		sprs.dawr0 = mfspr(SPRN_DAWR0);
> +		sprs.dawrx0 = mfspr(SPRN_DAWRX0);
> +	}
>  
>  	srr1 = isa300_idle_stop_mayloss(psscr);		/* go idle */
>  
> @@ -710,6 +716,10 @@ static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
>  		mtspr(SPRN_IAMR,	sprs.iamr);
>  		mtspr(SPRN_AMOR,	sprs.amor);
>  		mtspr(SPRN_UAMOR,	sprs.uamor);
> +		if (cpu_has_feature(CPU_FTR_ARCH_31)) {
> +			mtspr(SPRN_DAWR0, sprs.dawr0);
> +			mtspr(SPRN_DAWRX0, sprs.dawrx0);
> +		}
>  
>  		/*
>  		 * Workaround for POWER9 DD2.0, if we lost resources, the ERAT
> -- 
> 2.25.4
> 
> 

^ permalink raw reply

* [PATCH v6] powerpc/fadump: fix race between pstore write and fadump crash trigger
From: Sourabh Jain @ 2020-07-13  5:24 UTC (permalink / raw)
  To: mpe; +Cc: mahesh, linux-kernel, hbathini, linuxppc-dev

When we enter into fadump crash path via system reset we fail to update
the pstore.

On the system reset path we first update the pstore then we go for fadump
crash. But the problem here is when all the CPUs try to get the pstore
lock to initiate the pstore write, only one CPUs will acquire the lock
and proceed with the pstore write. Since it in NMI context CPUs that fail
to get lock do not wait for their turn to write to the pstore and simply
proceed with the next operation which is fadump crash. One of the CPU who
proceeded with fadump crash path triggers the crash and does not wait for
the CPU who gets the pstore lock to complete the pstore update.

Timeline diagram to depicts the sequence of events that leads to an
unsuccessful pstore update when we hit fadump crash path via system reset.

                 1    2     3    ...      n   CPU Threads
                 |    |     |             |
                 |    |     |             |
 Reached to   -->|--->|---->| ----------->|
 system reset    |    |     |             |
 path            |    |     |             |
                 |    |     |             |
 Try to       -->|--->|---->|------------>|
 acquire the     |    |     |             |
 pstore lock     |    |     |             |
                 |    |     |             |
                 |    |     |             |
 Got the      -->| +->|     |             |<-+
 pstore lock     | |  |     |             |  |-->  Didn't get the
                 | --------------------------+     lock and moving
                 |    |     |             |        ahead on fadump
                 |    |     |             |        crash path
                 |    |     |             |
  Begins the  -->|    |     |             |
  process to     |    |     |             |<-- Got the chance to
  update the     |    |     |             |    trigger the crash
  pstore         | -> |     |    ... <-   |
                 | |  |     |         |   |
                 | |  |     |         |   |<-- Triggers the
                 | |  |     |         |   |    crash
                 | |  |     |         |   |      ^
                 | |  |     |         |   |      |
  Writing to  -->| |  |     |         |   |      |
  pstore         | |  |     |         |   |      |
                   |                  |          |
       ^           |__________________|          |
       |               CPU Relax                 |
       |                                         |
       +-----------------------------------------+
                          |
                          v
            Race: crash triggered before pstore
                  update completes

To avoid this race condition a barrier is added on crash_fadump path, it
prevents the CPU to trigger the crash until all the online CPUs completes
their task.

A barrier is added to make sure all the secondary CPUs hit the
crash_fadump function before we initiates the crash. A timeout is kept to
ensure the primary CPU (one who initiates the crash) do not wait for
secondary CPUs indefinitely.

Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com>
---
 arch/powerpc/kernel/fadump.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

---
Chanagelog:

v1 -> v3:
   - https://lists.ozlabs.org/pipermail/linuxppc-dev/2020-April/208267.html

v3 -> v4:

   - Now the primary CPU (one who triggers dump) waits for all secondary
     CPUs to enter and then initiates the crash.

v4 -> v5:
    - Fixed a build failure reported by kernel test robot <lkp at intel.com>
      Now the cpus_in_crash variable is defined outside CONFIG_CMA
      config option.

v5 -> v6
    - Changed a variable name cpus_in_crash -> cpus_in_fadump.
---

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index 78ab9a6ee6ac..1858896d6809 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -32,11 +32,20 @@
 #include <asm/fadump-internal.h>
 #include <asm/setup.h>
 
+/*
+ * The CPU who acquired the lock to trigger the fadump crash should
+ * wait for other CPUs to enter.
+ *
+ * The timeout is in milliseconds.
+ */
+#define CRASH_TIMEOUT		500
+
 static struct fw_dump fw_dump;
 
 static void __init fadump_reserve_crash_area(u64 base);
 
 struct kobject *fadump_kobj;
+static atomic_t cpus_in_fadump;
 
 #ifndef CONFIG_PRESERVE_FA_DUMP
 static DEFINE_MUTEX(fadump_mutex);
@@ -668,8 +677,11 @@ early_param("fadump_reserve_mem", early_fadump_reserve_mem);
 
 void crash_fadump(struct pt_regs *regs, const char *str)
 {
+	unsigned int msecs;
 	struct fadump_crash_info_header *fdh = NULL;
 	int old_cpu, this_cpu;
+	/* Do not include first CPU */
+	unsigned int ncpus = num_online_cpus() - 1;
 
 	if (!should_fadump_crash())
 		return;
@@ -685,6 +697,8 @@ void crash_fadump(struct pt_regs *regs, const char *str)
 	old_cpu = cmpxchg(&crashing_cpu, -1, this_cpu);
 
 	if (old_cpu != -1) {
+		atomic_inc(&cpus_in_fadump);
+
 		/*
 		 * We can't loop here indefinitely. Wait as long as fadump
 		 * is in force. If we race with fadump un-registration this
@@ -708,6 +722,16 @@ void crash_fadump(struct pt_regs *regs, const char *str)
 
 	fdh->online_mask = *cpu_online_mask;
 
+	/*
+	 * If we came in via system reset, wait a while for the secondary
+	 * CPUs to enter.
+	 */
+	if (TRAP(&(fdh->regs)) == 0x100) {
+		msecs = CRASH_TIMEOUT;
+		while ((atomic_read(&cpus_in_fadump) < ncpus) && (--msecs > 0))
+			mdelay(1);
+	}
+
 	fw_dump.ops->fadump_trigger(fdh, str);
 }
 
-- 
2.25.4


^ permalink raw reply related

* Re: [v3 1/5] KVM: PPC: Book3S HV: Disable page merging in H_SVM_INIT_START
From: Bharata B Rao @ 2020-07-13  5:29 UTC (permalink / raw)
  To: Ram Pai
  Cc: ldufour, cclaudio, kvm-ppc, sathnaga, aneesh.kumar, sukadev,
	linuxppc-dev, bauerman, david
In-Reply-To: <1594458827-31866-2-git-send-email-linuxram@us.ibm.com>

On Sat, Jul 11, 2020 at 02:13:43AM -0700, Ram Pai wrote:
> Merging of pages associated with each memslot of a SVM is
> disabled the page is migrated in H_SVM_PAGE_IN handler.
> 
> This operation should have been done much earlier; the moment the VM
> is initiated for secure-transition. Delaying this operation, increases
> the probability for those pages to acquire new references , making it
> impossible to migrate those pages in H_SVM_PAGE_IN handler.
> 
> Disable page-migration in H_SVM_INIT_START handling.

While it is a good idea to disable KSM merging for all VMAs during
H_SVM_INIT_START, I am curious if you did observe an actual case of
ksm_madvise() failing which resulted in subsequent H_SVM_PAGE_IN
failing to migrate?

> 
> Signed-off-by: Ram Pai <linuxram@us.ibm.com>
> ---
>  arch/powerpc/kvm/book3s_hv_uvmem.c | 96 +++++++++++++++++++++++++++++---------
>  1 file changed, 74 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
> index 3d987b1..bfc3841 100644
> --- a/arch/powerpc/kvm/book3s_hv_uvmem.c
> +++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
> @@ -211,6 +211,65 @@ static bool kvmppc_gfn_is_uvmem_pfn(unsigned long gfn, struct kvm *kvm,
>  	return false;
>  }
>  
> +static int kvmppc_memslot_page_merge(struct kvm *kvm,
> +		struct kvm_memory_slot *memslot, bool merge)
> +{
> +	unsigned long gfn = memslot->base_gfn;
> +	unsigned long end, start = gfn_to_hva(kvm, gfn);
> +	int ret = 0;
> +	struct vm_area_struct *vma;
> +	int merge_flag = (merge) ? MADV_MERGEABLE : MADV_UNMERGEABLE;
> +
> +	if (kvm_is_error_hva(start))
> +		return H_STATE;

This and other cases below seem to be a new return value from
H_SVM_INIT_START. May be update the documentation too along with
this patch?

> +
> +	end = start + (memslot->npages << PAGE_SHIFT);
> +
> +	down_write(&kvm->mm->mmap_sem);

When you rebase the patches against latest upstream you may want to
replace the above and other instances by mmap_write/read_lock().

> +	do {
> +		vma = find_vma_intersection(kvm->mm, start, end);
> +		if (!vma) {
> +			ret = H_STATE;
> +			break;
> +		}
> +		ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> +			  merge_flag, &vma->vm_flags);
> +		if (ret) {
> +			ret = H_STATE;
> +			break;
> +		}
> +		start = vma->vm_end + 1;
> +	} while (end > vma->vm_end);
> +
> +	up_write(&kvm->mm->mmap_sem);
> +	return ret;
> +}
> +
> +static int __kvmppc_page_merge(struct kvm *kvm, bool merge)
> +{
> +	struct kvm_memslots *slots;
> +	struct kvm_memory_slot *memslot;
> +	int ret = 0;
> +
> +	slots = kvm_memslots(kvm);
> +	kvm_for_each_memslot(memslot, slots) {
> +		ret = kvmppc_memslot_page_merge(kvm, memslot, merge);
> +		if (ret)
> +			break;
> +	}
> +	return ret;
> +}
> +
> +static inline int kvmppc_disable_page_merge(struct kvm *kvm)
> +{
> +	return __kvmppc_page_merge(kvm, false);
> +}
> +
> +static inline int kvmppc_enable_page_merge(struct kvm *kvm)
> +{
> +	return __kvmppc_page_merge(kvm, true);
> +}
> +
>  unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>  {
>  	struct kvm_memslots *slots;
> @@ -232,11 +291,18 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>  		return H_AUTHORITY;
>  
>  	srcu_idx = srcu_read_lock(&kvm->srcu);
> +
> +	/* disable page-merging for all memslot */
> +	ret = kvmppc_disable_page_merge(kvm);
> +	if (ret)
> +		goto out;
> +
> +	/* register the memslot */
>  	slots = kvm_memslots(kvm);
>  	kvm_for_each_memslot(memslot, slots) {
>  		if (kvmppc_uvmem_slot_init(kvm, memslot)) {
>  			ret = H_PARAMETER;
> -			goto out;
> +			break;
>  		}
>  		ret = uv_register_mem_slot(kvm->arch.lpid,
>  					   memslot->base_gfn << PAGE_SHIFT,
> @@ -245,9 +311,12 @@ unsigned long kvmppc_h_svm_init_start(struct kvm *kvm)
>  		if (ret < 0) {
>  			kvmppc_uvmem_slot_free(kvm, memslot);
>  			ret = H_PARAMETER;
> -			goto out;
> +			break;
>  		}
>  	}
> +
> +	if (ret)
> +		kvmppc_enable_page_merge(kvm);

Is there any use of enabling KSM merging in the failure path here?
Won't UV terminate the VM if H_SVM_INIT_START fails? If there is no need,
you can do away with some extra routines above.

>  out:
>  	srcu_read_unlock(&kvm->srcu, srcu_idx);
>  	return ret;
> @@ -384,7 +453,7 @@ static struct page *kvmppc_uvmem_get_page(unsigned long gpa, struct kvm *kvm)
>   */
>  static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
>  		   unsigned long end, unsigned long gpa, struct kvm *kvm,
> -		   unsigned long page_shift, bool *downgrade)
> +		   unsigned long page_shift)
>  {
>  	unsigned long src_pfn, dst_pfn = 0;
>  	struct migrate_vma mig;
> @@ -400,18 +469,6 @@ static int kvmppc_svm_page_in(struct vm_area_struct *vma, unsigned long start,
>  	mig.src = &src_pfn;
>  	mig.dst = &dst_pfn;
>  
> -	/*
> -	 * We come here with mmap_sem write lock held just for
> -	 * ksm_madvise(), otherwise we only need read mmap_sem.
> -	 * Hence downgrade to read lock once ksm_madvise() is done.
> -	 */
> -	ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
> -			  MADV_UNMERGEABLE, &vma->vm_flags);

I haven't seen the subsequent patches yet, but guess you are
taking care of disabling KSM mering for hot-plugged memory too.

Regards,
Bharata.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox