LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface
From: Michael Ellerman @ 2013-10-01  7:35 UTC (permalink / raw)
  To: Tejun Heo
  Cc: Joerg Roedel, x86@kernel.org, linux-kernel@vger.kernel.org,
	linux-ide@vger.kernel.org, Alexander Gordeev, Jan Beulich,
	linux-pci@vger.kernel.org, Bjorn Helgaas, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <20130918142231.GA21650@mtj.dyndns.org>

On Wed, Sep 18, 2013 at 09:22:31AM -0500, Tejun Heo wrote:
> Hello,
> 
> On Wed, Sep 18, 2013 at 11:48:00AM +0200, Alexander Gordeev wrote:
> > On Wed, Sep 18, 2013 at 12:30:23AM +1000, Michael Ellerman wrote:
> > > How about no?
> > > 
> > > We have a small number of MSIs available, limited by hardware &
> > > firmware, if we don't impose a quota then the first device that probes
> > > will get most/all of the MSIs and other devices miss out.
> > 
> > Out of curiosity - how pSeries has had done it without quotas before
> > 448e2ca ("powerpc/pseries: Implement a quota system for MSIs")?
> 
> Hmmm... do we need to treat this any differently?  If the platform
> can't allocate full range of requested MSIs, just failing should be
> enough regardless of why such allocation can't be met, no?
> 
> > > Anyway I don't see what problem you're trying to solve? I agree the
> > > -ve/0/+ve return value pattern is ugly, but it's hardly the end of the
> > > world.
> > 
> > Well, the interface recently has been re-classified from "ugly" to
> > "unnecessarily complex and actively harmful" in Tejun's words ;)
> 
> LOL. :)
> 
> > Indeed, I checked most of the drivers and it is incredible how people
> > are creative in misusing the interface: from innocent pci_disable_msix()
> > calls when if pci_enable_msix() failed to assuming MSI-Xs were enabled
> > if pci_enable_msix() returned a positive value (apparently untested).
> > 
> > Roughly third of the drivers just do not care and bail out once
> > pci_enable_msix() has not succeeded. Not sure how many of these are
> > mandated by the hardware.
> 
> Yeah, I mean, this type of interface is a trap.  People have to
> actively resist to avoid doing silly stuff which is a lot to ask.

I really think you're overstating the complexity here.

Functions typically return a boolean   -> nothing to see here
This function returns a tristate value -> brain explosion!


> > 	/*
> > 	 * Retrieving 'nvec' by means other than pci_msix_table_size()
> > 	 */
> > 
> > 	rc = pci_get_msix_limit(pdev);
> > 	if (rc < 0)
> > 		return rc;
> > 
> > 	/*
> > 	 * nvec = min(rc, nvec);
> > 	 */
> > 
> > 	for (i = 0; i < nvec; i++)
> > 		msix_entry[i].entry = i;
> > 
> > 	rc = pci_enable_msix(pdev, msix_entry, nvec);
> > 	if (rc)
> > 		return rc;
> 
> I really think what we should do is
> 
> * Determine the number of MSIs the controller wants.  Don't worry
>   about quotas or limits or anything.  Just determine the number
>   necessary to enable enhanced interrupt handling.
> 	
> * Try allocating that number of MSIs.  If it fails, then just revert
>   to single interrupt mode.  It's not the end of the world and mostly
>   guaranteed to work.  Let's please not even try to do partial
>   multiple interrupts.  I really don't think it's worth the risk or
>   complexity.

It will potentially break existing setups on our hardware.

Can I make that any clearer?

cheers

^ permalink raw reply

* Re: [PATCH V2] powerpc/kernel/sysfs: disable writing to purr in non-powernv
From: Benjamin Herrenschmidt @ 2013-10-01  7:50 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: Madhavan Srinivasan, linuxppc-dev
In-Reply-To: <20131001063129.GF17966@concordia>

On Tue, 2013-10-01 at 16:31 +1000, Michael Ellerman wrote:

> > 1)Changed the test for to hypervisor mode instead of platform
> 
> I think Ben's wrong about that.
> 
> Almost all existing code uses FW_FEATURE_LPAR to differentiate
> hypervisor vs guest mode, so I think we should do the same here.

I didn't object to using the FW test, it's a reasonable way to do it, I
objected to using the platform as an indication (powernv vs. pseries)

Ben.

> So it would be:
> 
> > +	if (cpu_has_feature(CPU_FTR_PURR)) {
> > +		if (!firmware_has_feature(FW_FEATURE_LPAR))
> > +			add_write_permission_dev_attr((void *)&dev_attr_purr);
> >  		device_create_file(s, &dev_attr_purr);
> > +	}
> 
> 
> > +static void add_write_permission_dev_attr(void *ptr)
> > +{
> > +	struct device_attribute *attr = (struct device_attribute *)ptr;
> > +
> > +	attr->attr.mode |= (unsigned short) 0200;
> > +}
> 
> Why does it take a void *, which then requires a cast at the call site?
> 
> And do you need the cast to short? If so shouldn't you use umode_t
> directly?
> 
> cheers

^ permalink raw reply

* Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface
From: Michael Ellerman @ 2013-10-01  7:51 UTC (permalink / raw)
  To: Alexander Gordeev
  Cc: linuxppc-dev, Joerg Roedel, x86@kernel.org,
	linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org,
	Jan Beulich, linux-pci@vger.kernel.org, Tejun Heo, Bjorn Helgaas,
	Ingo Molnar
In-Reply-To: <20130918094759.GA2353@dhcp-26-207.brq.redhat.com>

On Wed, Sep 18, 2013 at 11:48:00AM +0200, Alexander Gordeev wrote:
> On Wed, Sep 18, 2013 at 12:30:23AM +1000, Michael Ellerman wrote:
> > How about no?
> > 
> > We have a small number of MSIs available, limited by hardware &
> > firmware, if we don't impose a quota then the first device that probes
> > will get most/all of the MSIs and other devices miss out.
> 
> Out of curiosity - how pSeries has had done it without quotas before
> 448e2ca ("powerpc/pseries: Implement a quota system for MSIs")?
> 
> > Anyway I don't see what problem you're trying to solve? I agree the
> > -ve/0/+ve return value pattern is ugly, but it's hardly the end of the
> > world.
> 
> Well, the interface recently has been re-classified from "ugly" to
> "unnecessarily complex and actively harmful" in Tejun's words ;)
> 
> Indeed, I checked most of the drivers and it is incredible how people
> are creative in misusing the interface: from innocent pci_disable_msix()
> calls when if pci_enable_msix() failed to assuming MSI-Xs were enabled
> if pci_enable_msix() returned a positive value (apparently untested).

OK, but we have the source to the drivers, we could just fix them.

We could even add:

  pci_enable_msix_i_am_stupid()

Which swallows the positive return and just gives back -ve/0.

> Roughly third of the drivers just do not care and bail out once
> pci_enable_msix() has not succeeded. Not sure how many of these are
> mandated by the hardware.

Sure, that's fine if those drivers do that, it's up to the drivers after
all.

> Another quite common pattern is a call to pci_enable_msix() to figure out
> the number of MSI-Xs available and a repeated call of pci_enable_msix()
> to enable those MSI-Xs, this time.

Also fine, though as the documentation suggests a loop is the best
construct rather than two explicit calls.

> The recommended practice would be:
> 
> 	/*
> 	 * Retrieving 'nvec' by means other than pci_msix_table_size()
> 	 */
> 
> 	rc = pci_get_msix_limit(pdev);
> 	if (rc < 0)
> 		return rc;
> 
> 	/*
> 	 * nvec = min(rc, nvec);
> 	 */
> 
> 	for (i = 0; i < nvec; i++)
> 		msix_entry[i].entry = i;
> 
> 	rc = pci_enable_msix(pdev, msix_entry, nvec);
> 	if (rc)
> 		return rc;
> 
> Thoughts?

We could probably make that work.

The disadvantage is that any restriction imposed on us above the quota
can only be reported as an error from pci_enable_msix().

The quota code, called from pci_get_msix_limit(), can only do so much to
interogate firmware about the limitations. The ultimate way to check if
firmware will give us enough MSIs is to try and allocate them. But we
can't do that from pci_get_msix_limit() because the driver is not asking
us to enable MSIs, just query them.

You'll also need to add another arch hook, for the quota check, and
we'll have to add it to our per-platform indirection as well.

All a lot of bother for no real gain IMHO.

cheers

^ permalink raw reply

* Re: [PATCH] Revert "powerpc: 52xx: provide a default in mpc52xx_irqhost_map()"
From: Sebastian Andrzej Siewior @ 2013-10-01  7:54 UTC (permalink / raw)
  To: Wolfram Sang; +Cc: linuxppc-dev, Anatolij Gustschin, linux-rt-users
In-Reply-To: <1380612366-13504-1-git-send-email-wsa@the-dreams.de>

On 10/01/2013 09:26 AM, Wolfram Sang wrote:
> This reverts commit 6391f697d4892a6f233501beea553e13f7745a23. The
> compiler warning it wants to fix does not appear with my gcc 4.6.2. IMO
> we don't need superfluous (and here even misleading) code to make old
> compilers happy. Fixing the printout was bogus, too. We want to know
> WHICH critical irq failed, not which level it had.

According to minimal Doc*/Changes minimal gcc is 3.2. Mine was 4.3.5.
Why miss leading code? Default here does the same as unhandled and crit
where it does nothing. Any why do you want to see l2irq since it was
not in the case statement? l2 holds the number, l1 the level.

> Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
> Cc: Anatolij Gustschin <agust@denx.de>
> ---
> 
> Have I been on CC when the original patch was sent?

You were but your email bounced. I wasn't aware of this new email
address you are using now.

> 
>  arch/powerpc/platforms/52xx/mpc52xx_pic.c |    3 +--
>  1 file changed, 1 insertion(+), 2 deletions(-)
> 
> diff --git a/arch/powerpc/platforms/52xx/mpc52xx_pic.c b/arch/powerpc/platforms/52xx/mpc52xx_pic.c
> index b69221b..b89ef65 100644
> --- a/arch/powerpc/platforms/52xx/mpc52xx_pic.c
> +++ b/arch/powerpc/platforms/52xx/mpc52xx_pic.c
> @@ -373,9 +373,8 @@ static int mpc52xx_irqhost_map(struct irq_domain *h, unsigned int virq,
>  	case MPC52xx_IRQ_L1_PERP: irqchip = &mpc52xx_periph_irqchip; break;
>  	case MPC52xx_IRQ_L1_SDMA: irqchip = &mpc52xx_sdma_irqchip; break;
>  	case MPC52xx_IRQ_L1_CRIT:
> -	default:
>  		pr_warn("%s: Critical IRQ #%d is unsupported! Nopping it.\n",
> -			__func__, l1irq);
> +			__func__, l2irq);
>  		irq_set_chip(virq, &no_irq_chip);
>  		return 0;
>  	}
> 

Sebastian

^ permalink raw reply

* Re: [PATCH V2] powerpc/kernel/sysfs: disable writing to purr in non-powernv
From: Madhavan Srinivasan @ 2013-10-01  8:09 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: linuxppc-dev
In-Reply-To: <20131001063129.GF17966@concordia>

On Tuesday 01 October 2013 12:01 PM, Michael Ellerman wrote:
> Hi Maddy,
> 
> On Fri, Sep 27, 2013 at 05:03:54PM +0530, Madhavan Srinivasan wrote:
>> powerpc/kernel/sysfs.c exports purr with write permission.
> 
> PURR
> 
>> This is only valid for kernel in hypervisor mode.
>> But writing to the file in PowerVM lpar causes crash.
> 
> In the kernel history/source we refer to it as "phyp". However in this

Will follow the convention.

> case it's not an issue with phyp, it's simply that you are not in
> hypervisor mode, ie. the same crash would occur under KVM.
> 
> So you should just say "writing to the file in guest mode .."
> 

Sure. Will make the changes.

>> # echo 0 > purr
>> cpu 0x0: Vector: 700 (Program Check) at [c000000000d072b0]
>>     pc: c00000000001770c: .write_purr+0x1c/0x40
>>     lr: c000000000017708: .write_purr+0x18/0x40
>>     sp: c000000000d07530
>>    msr: 8000000000049032
>>   current = 0xc000000000c53de0
>>   paca    = 0xc00000000ec70000	 softe: 0	 irq_happened: 0x01
>>     pid   = 0, comm = swapper/0
>> enter ? for help
>> [c000000000d075b0] c0000000000fba64
>> .generic_smp_call_function_single_interrupt+0x104/0x190
>> [c000000000d07650] c000000000037748 .smp_ipi_demux+0xa8/0xf0
>> [c000000000d076e0] c000000000035314 .doorbell_exception+0x74/0xb0
>> [c000000000d07760] c000000000002950 doorbell_super_common+0x150/0x180
>> --- Exception: a01 (Doorbell) at c000000000060904
>> .plpar_hcall_norets+0x84/0xd4
>> [link register   ] c00000000006dbd4 .check_and_cede_processor+0x24/0x40
>> [c000000000d07a50] c000000001002558 (unreliable)
>> [c000000000d07ac0] c00000000006dd0c .shared_cede_loop+0x2c/0x70
>> [c000000000d07b40] c0000000006ae954 .cpuidle_enter_state+0x64/0x150
>> [c000000000d07c00] c0000000006aeb30 .cpuidle_idle_call+0xf0/0x300
>> [c000000000d07cb0] c000000000062fa0 .pseries_lpar_idle+0x10/0x50
>> [c000000000d07d20] c000000000016d14 .arch_cpu_idle+0x64/0x150
>> [c000000000d07da0] c0000000000e0060 .cpu_startup_entry+0x1a0/0x2c0
>> [c000000000d07e80] c00000000000bca4 .rest_init+0x94/0xb0
>> [c000000000d07ef0] c000000000b54530 .start_kernel+0x478/0x494
>> [c000000000d07f90] c000000000009be0 .start_here_common+0x20/0x40
>> 0:mon>
>>
>> Changes:
>>
>> 1)Changed the test for to hypervisor mode instead of platform
> 
> I think Ben's wrong about that.
> 
> Almost all existing code uses FW_FEATURE_LPAR to differentiate
> hypervisor vs guest mode, so I think we should do the same here.
> 
> So it would be:
> 
>> +	if (cpu_has_feature(CPU_FTR_PURR)) {
>> +		if (!firmware_has_feature(FW_FEATURE_LPAR))
>> +			add_write_permission_dev_attr((void *)&dev_attr_purr);
>>  		device_create_file(s, &dev_attr_purr);
>> +	}
>

Will modify the check.

> 
>> +static void add_write_permission_dev_attr(void *ptr)
>> +{
>> +	struct device_attribute *attr = (struct device_attribute *)ptr;
>> +
>> +	attr->attr.mode |= (unsigned short) 0200;
>> +}
> 
> Why does it take a void *, which then requires a cast at the call site?
>

just prefered to send the address as void.

> And do you need the cast to short? If so shouldn't you use umode_t
> directly?

No, not really.
Will make the changes and will resend the patch.

> 
> cheers
> 
Thanks for feedback.
Maddy

^ permalink raw reply

* Re: [PATCH 2/3] hwrng: Add a driver for the hwrng found in power7+ systems
From: Michael Ellerman @ 2013-10-01  8:25 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: tytso, kvm, gleb, linuxppc-dev, linux-kernel, kvm-ppc, agraf,
	herbert, Paul Mackerras, mpm, pbonzini
In-Reply-To: <1380182487.28561.22.camel@pasglop>

On Thu, Sep 26, 2013 at 06:01:27PM +1000, Benjamin Herrenschmidt wrote:
> On Thu, 2013-09-26 at 16:31 +1000, Michael Ellerman wrote:
> 
> > +       pr_info("registered powernv hwrng.\n");
> 
> First letter of a line should get a capital :-) Also since
> it's per-device, at least indicate the OF path or the chip number or
> something ...

It's not the first letter, it looks like this:

  powernv-rng: registered powernv hwrng.

Which looks nice to me, but I can make it this if you prefer:

  powernv-rng: Registered powernv hwrng.

And it's not per-device due to the trick in the probe routine. It's a
bit sneaky but we need to make it a proper driver for module loading to
work, but we only want to register one hwrng device.

cheers

^ permalink raw reply

* Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems
From: Michael Ellerman @ 2013-10-01  8:34 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: tytso, herbert, gleb, linux-kernel, kvm-ppc, agraf, linuxppc-dev,
	Paul Mackerras, kvm, mpm
In-Reply-To: <5243F933.7000907@redhat.com>

On Thu, Sep 26, 2013 at 11:06:59AM +0200, Paolo Bonzini wrote:
> Il 26/09/2013 08:31, Michael Ellerman ha scritto:
> > Some powernv systems include a hwrng. Guests can access it via the
> > H_RANDOM hcall.
> 
> Is there any reason to do this in the kernel?  

It's less code, and it's faster :)

> It does not have to be a particularly fast path;

Sure, but do we have to make it slow on purpose?

> on x86, we are simply forwarding /dev/hwrng or
> /dev/random data to the guest.  You can simply use virtio-rng.

Not all guests support virtio-rng.

> If you really want to have the hypercall, implementing it in QEMU means
> that you can support it on all systems, in fact even when running
> without KVM.  

Sure, I can add a fallback to /dev/hwrng for full emulation.

> The QEMU command line would be something like "-object
> rng-random,filename=/dev/random,id=rng0 -device spapr-rng,rng=rng0".

We can't use /dev/random like that. The PAPR specification, which is
what we're implementing, implies that H_RANDOM provides data from a
hardware source.

cheers

^ permalink raw reply

* Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems
From: Michael Ellerman @ 2013-10-01  8:36 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: tytso, herbert, gleb, agraf, kvm-ppc, linux-kernel, linuxppc-dev,
	Paul Mackerras, kvm, mpm, pbonzini
In-Reply-To: <52459311.4010104@linux.vnet.ibm.com>

On Fri, Sep 27, 2013 at 07:45:45PM +0530, Anshuman Khandual wrote:
> On 09/26/2013 12:01 PM, Michael Ellerman wrote:
> > +int powernv_hwrng_present(void)
> > +{
> > +	return __raw_get_cpu_var(powernv_rng) != NULL;
> > +}
> > +
> >  static unsigned long rng_whiten(struct powernv_rng *rng, unsigned long val)
> >  {
> >  	unsigned long parity;
> > @@ -42,6 +48,17 @@ static unsigned long rng_whiten(struct powernv_rng *rng, unsigned long val)
> >  	return val;
> >  }
> > 
> > +int powernv_get_random_real_mode(unsigned long *v)
> > +{
> > +	struct powernv_rng *rng;
> > +
> > +	rng = __raw_get_cpu_var(powernv_rng);
> > +
> > +	*v = rng_whiten(rng, in_rm64(rng->regs_real));
> > +
> 
> Will it be in_be64() instead of in_rm64() ? Its failing the build here. Except this
> all individual patches build correctly.

No it's definitely not in_be64() - that will checkstop your machine :)

I added in_rm64() in a previous patch, "Add real mode cache inhibited
IO accessors" - I just didn't want to spam the KVM guys with those
patches as well.

Thanks for the review & testing.

cheers

^ permalink raw reply

* Re: [PATCH] powerpc/powernv: Reduce panic timeout from 180s to 10s
From: Michael Ellerman @ 2013-10-01  8:39 UTC (permalink / raw)
  To: Anton Blanchard; +Cc: paulus, linuxppc-dev
In-Reply-To: <20130926211719.7b99740a@kryten>

On Thu, Sep 26, 2013 at 09:17:19PM +1000, Anton Blanchard wrote:
> 
> We made this change to pseries in 2011 and I think it makes
> sense to do the same on powernv.

I'd vote we set it to 10s for all 64-bit machines in
arch/powerpc/kernel/setup_64.c.

cheers

^ permalink raw reply

* Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems
From: Gleb Natapov @ 2013-10-01  8:39 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: tytso, kvm, linuxppc-dev, agraf, kvm-ppc, linux-kernel, herbert,
	Paul Mackerras, mpm, Paolo Bonzini
In-Reply-To: <20131001083426.GB27484@concordia>

On Tue, Oct 01, 2013 at 06:34:26PM +1000, Michael Ellerman wrote:
> On Thu, Sep 26, 2013 at 11:06:59AM +0200, Paolo Bonzini wrote:
> > Il 26/09/2013 08:31, Michael Ellerman ha scritto:
> > > Some powernv systems include a hwrng. Guests can access it via the
> > > H_RANDOM hcall.
> > 
> > Is there any reason to do this in the kernel?  
> 
> It's less code, and it's faster :)
> 
> > It does not have to be a particularly fast path;
> 
> Sure, but do we have to make it slow on purpose?
> 
We do not put non performance critical devices into the kernel.

--
			Gleb.

^ permalink raw reply

* Re: [PATCH V2] powerpc/kernel/sysfs: disable writing to purr in non-powernv
From: Michael Ellerman @ 2013-10-01  8:43 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: Madhavan Srinivasan, linuxppc-dev
In-Reply-To: <1380613830.645.18.camel@pasglop>

On Tue, Oct 01, 2013 at 05:50:30PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2013-10-01 at 16:31 +1000, Michael Ellerman wrote:
> 
> > > 1)Changed the test for to hypervisor mode instead of platform
> > 
> > I think Ben's wrong about that.
> > 
> > Almost all existing code uses FW_FEATURE_LPAR to differentiate
> > hypervisor vs guest mode, so I think we should do the same here.
> 
> I didn't object to using the FW test, it's a reasonable way to do it, I
> objected to using the platform as an indication (powernv vs. pseries)

Yeah sorry I wasn't clear. You are right that it shouldn't use the
platform. So I think we agree FW_FEATURE is the right way to go.

cheers

^ permalink raw reply

* Re: [PATCH] Revert "powerpc: 52xx: provide a default in mpc52xx_irqhost_map()"
From: Wolfram Sang @ 2013-10-01  9:11 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: linuxppc-dev, Anatolij Gustschin, linux-rt-users
In-Reply-To: <524A7FCB.3020406@linutronix.de>

[-- Attachment #1: Type: text/plain, Size: 2136 bytes --]

Hi,

On Tue, Oct 01, 2013 at 09:54:51AM +0200, Sebastian Andrzej Siewior wrote:
> On 10/01/2013 09:26 AM, Wolfram Sang wrote:
> > This reverts commit 6391f697d4892a6f233501beea553e13f7745a23. The
> > compiler warning it wants to fix does not appear with my gcc 4.6.2. IMO
> > we don't need superfluous (and here even misleading) code to make old
> > compilers happy. Fixing the printout was bogus, too. We want to know
> > WHICH critical irq failed, not which level it had.
> 
> According to minimal Doc*/Changes minimal gcc is 3.2. Mine was 4.3.5.

Well, if you insist, I'd prefer the following patch.

From: Wolfram Sang <wsa@the-dreams.de>
Subject: [PATCH] ppc: mpc52xx: silence false positive from old GCC

So people can compile with -Werror (RT patchset).

Signed-off-by: Wolfram Sang <wsa@the-dreams.de>
---
 arch/powerpc/platforms/52xx/mpc52xx_pic.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/52xx/mpc52xx_pic.c b/arch/powerpc/platforms/52xx/mpc52xx_pic.c
index b89ef65..ad3c9b0 100644
--- a/arch/powerpc/platforms/52xx/mpc52xx_pic.c
+++ b/arch/powerpc/platforms/52xx/mpc52xx_pic.c
@@ -340,7 +340,7 @@ static int mpc52xx_irqhost_map(struct irq_domain *h, unsigned int virq,
 {
 	int l1irq;
 	int l2irq;
-	struct irq_chip *irqchip;
+	struct irq_chip *irqchip = NULL; /* pet old compilers */
 	void *hndlr;
 	int type;
 	u32 reg;

> Why miss leading code? Default here does the same as unhandled and crit
> where it does nothing.

People not realizing 'default' is a no-op might wonder why unknown
levels are mapped to critical.

> Any why do you want to see l2irq since it was
> not in the case statement? l2 holds the number, l1 the level.

We know which level it was, since the printout is only for that level.
We probably want to know which requested IRQ was causing this, so we can
fix the assorted driver. Otherwise we only know that some critical IRQ
was requested somewhere.

> You were but your email bounced. I wasn't aware of this new email
> address you are using now.

Ah, I see, pity.

Thanks,

   Wolfram

[-- Attachment #2: Digital signature --]
[-- Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply related

* Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems
From: Paul Mackerras @ 2013-10-01  9:23 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: tytso, kvm, linuxppc-dev, agraf, kvm-ppc, linux-kernel, herbert,
	mpm, Paolo Bonzini
In-Reply-To: <20131001083908.GA17294@redhat.com>

On Tue, Oct 01, 2013 at 11:39:08AM +0300, Gleb Natapov wrote:
> On Tue, Oct 01, 2013 at 06:34:26PM +1000, Michael Ellerman wrote:
> > On Thu, Sep 26, 2013 at 11:06:59AM +0200, Paolo Bonzini wrote:
> > > Il 26/09/2013 08:31, Michael Ellerman ha scritto:
> > > > Some powernv systems include a hwrng. Guests can access it via the
> > > > H_RANDOM hcall.
> > > 
> > > Is there any reason to do this in the kernel?  
> > 
> > It's less code, and it's faster :)
> > 
> > > It does not have to be a particularly fast path;
> > 
> > Sure, but do we have to make it slow on purpose?
> > 
> We do not put non performance critical devices into the kernel.

It's not a device, it's a single hypercall, specified by PAPR, which
is the moral equivalent of x86's RDRAND.

Paul.

^ permalink raw reply

* [PATCH v2] powerpc/kernel/sysfs: cleanup set up macros for PMC/non-PMC sprs
From: Madhavan Srinivasan @ 2013-10-01  9:23 UTC (permalink / raw)
  To: benh; +Cc: linuxppc-dev, Madhavan Srinivasan

Currently PMC (Performance Monitor Counter) setup macros are used
for other sprs. Since not all sprs are PMC related, this patch
modifies the exisiting macro and uses it to setup both PMC and
non PMC sprs accordingly.

V2 changes:

1) Modified SYSFS_PMCSETUP to a generic macro with additional parameter
2) Added PMC and SPR macro to call the generic macro
3) Changes in the comment to explain better.

Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com>
---
 arch/powerpc/kernel/sysfs.c |   73 +++++++++++++++++++++++--------------------
 1 file changed, 39 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/kernel/sysfs.c b/arch/powerpc/kernel/sysfs.c
index 27a90b9..cb971c4 100644
--- a/arch/powerpc/kernel/sysfs.c
+++ b/arch/powerpc/kernel/sysfs.c
@@ -107,14 +107,14 @@ void ppc_enable_pmcs(void)
 }
 EXPORT_SYMBOL(ppc_enable_pmcs);

-#define SYSFS_PMCSETUP(NAME, ADDRESS) \
+#define _SYSFS_SPRSETUP(NAME, ADDRESS, EXTRA) \
 static void read_##NAME(void *val) \
 { \
 	*(unsigned long *)val = mfspr(ADDRESS);	\
 } \
 static void write_##NAME(void *val) \
 { \
-	ppc_enable_pmcs(); \
+	EXTRA; \
 	mtspr(ADDRESS, *(unsigned long *)val);	\
 } \
 static ssize_t show_##NAME(struct device *dev, \
@@ -139,6 +139,11 @@ static ssize_t __used \
 	return count; \
 }

+#define SYSFS_EMPTY
+#define SYSFS_PMCSETUP(NAME, ADDRESS)	\
+	_SYSFS_SPRSETUP(NAME, ADDRESS, ppc_enable_pmcs())
+#define SYSFS_SPRSETUP(NAME, ADDRESS)	\
+	_SYSFS_SPRSETUP(NAME, ADDRESS, SYSFS_EMPTY)

 /* Let's define all possible registers, we'll only hook up the ones
  * that are implemented on the current processor
@@ -174,10 +179,10 @@ SYSFS_PMCSETUP(pmc7, SPRN_PMC7);
 SYSFS_PMCSETUP(pmc8, SPRN_PMC8);

 SYSFS_PMCSETUP(mmcra, SPRN_MMCRA);
-SYSFS_PMCSETUP(purr, SPRN_PURR);
-SYSFS_PMCSETUP(spurr, SPRN_SPURR);
-SYSFS_PMCSETUP(dscr, SPRN_DSCR);
-SYSFS_PMCSETUP(pir, SPRN_PIR);
+SYSFS_SPRSETUP(purr, SPRN_PURR);
+SYSFS_SPRSETUP(spurr, SPRN_SPURR);
+SYSFS_SPRSETUP(dscr, SPRN_DSCR);
+SYSFS_SPRSETUP(pir, SPRN_PIR);

 static DEVICE_ATTR(mmcra, 0600, show_mmcra, store_mmcra);
 static DEVICE_ATTR(spurr, 0400, show_spurr, NULL);
@@ -238,34 +243,34 @@ SYSFS_PMCSETUP(pa6t_pmc3, SPRN_PA6T_PMC3);
 SYSFS_PMCSETUP(pa6t_pmc4, SPRN_PA6T_PMC4);
 SYSFS_PMCSETUP(pa6t_pmc5, SPRN_PA6T_PMC5);
 #ifdef CONFIG_DEBUG_KERNEL
-SYSFS_PMCSETUP(hid0, SPRN_HID0);
-SYSFS_PMCSETUP(hid1, SPRN_HID1);
-SYSFS_PMCSETUP(hid4, SPRN_HID4);
-SYSFS_PMCSETUP(hid5, SPRN_HID5);
-SYSFS_PMCSETUP(ima0, SPRN_PA6T_IMA0);
-SYSFS_PMCSETUP(ima1, SPRN_PA6T_IMA1);
-SYSFS_PMCSETUP(ima2, SPRN_PA6T_IMA2);
-SYSFS_PMCSETUP(ima3, SPRN_PA6T_IMA3);
-SYSFS_PMCSETUP(ima4, SPRN_PA6T_IMA4);
-SYSFS_PMCSETUP(ima5, SPRN_PA6T_IMA5);
-SYSFS_PMCSETUP(ima6, SPRN_PA6T_IMA6);
-SYSFS_PMCSETUP(ima7, SPRN_PA6T_IMA7);
-SYSFS_PMCSETUP(ima8, SPRN_PA6T_IMA8);
-SYSFS_PMCSETUP(ima9, SPRN_PA6T_IMA9);
-SYSFS_PMCSETUP(imaat, SPRN_PA6T_IMAAT);
-SYSFS_PMCSETUP(btcr, SPRN_PA6T_BTCR);
-SYSFS_PMCSETUP(pccr, SPRN_PA6T_PCCR);
-SYSFS_PMCSETUP(rpccr, SPRN_PA6T_RPCCR);
-SYSFS_PMCSETUP(der, SPRN_PA6T_DER);
-SYSFS_PMCSETUP(mer, SPRN_PA6T_MER);
-SYSFS_PMCSETUP(ber, SPRN_PA6T_BER);
-SYSFS_PMCSETUP(ier, SPRN_PA6T_IER);
-SYSFS_PMCSETUP(sier, SPRN_PA6T_SIER);
-SYSFS_PMCSETUP(siar, SPRN_PA6T_SIAR);
-SYSFS_PMCSETUP(tsr0, SPRN_PA6T_TSR0);
-SYSFS_PMCSETUP(tsr1, SPRN_PA6T_TSR1);
-SYSFS_PMCSETUP(tsr2, SPRN_PA6T_TSR2);
-SYSFS_PMCSETUP(tsr3, SPRN_PA6T_TSR3);
+SYSFS_SPRSETUP(hid0, SPRN_HID0);
+SYSFS_SPRSETUP(hid1, SPRN_HID1);
+SYSFS_SPRSETUP(hid4, SPRN_HID4);
+SYSFS_SPRSETUP(hid5, SPRN_HID5);
+SYSFS_SPRSETUP(ima0, SPRN_PA6T_IMA0);
+SYSFS_SPRSETUP(ima1, SPRN_PA6T_IMA1);
+SYSFS_SPRSETUP(ima2, SPRN_PA6T_IMA2);
+SYSFS_SPRSETUP(ima3, SPRN_PA6T_IMA3);
+SYSFS_SPRSETUP(ima4, SPRN_PA6T_IMA4);
+SYSFS_SPRSETUP(ima5, SPRN_PA6T_IMA5);
+SYSFS_SPRSETUP(ima6, SPRN_PA6T_IMA6);
+SYSFS_SPRSETUP(ima7, SPRN_PA6T_IMA7);
+SYSFS_SPRSETUP(ima8, SPRN_PA6T_IMA8);
+SYSFS_SPRSETUP(ima9, SPRN_PA6T_IMA9);
+SYSFS_SPRSETUP(imaat, SPRN_PA6T_IMAAT);
+SYSFS_SPRSETUP(btcr, SPRN_PA6T_BTCR);
+SYSFS_SPRSETUP(pccr, SPRN_PA6T_PCCR);
+SYSFS_SPRSETUP(rpccr, SPRN_PA6T_RPCCR);
+SYSFS_SPRSETUP(der, SPRN_PA6T_DER);
+SYSFS_SPRSETUP(mer, SPRN_PA6T_MER);
+SYSFS_SPRSETUP(ber, SPRN_PA6T_BER);
+SYSFS_SPRSETUP(ier, SPRN_PA6T_IER);
+SYSFS_SPRSETUP(sier, SPRN_PA6T_SIER);
+SYSFS_SPRSETUP(siar, SPRN_PA6T_SIAR);
+SYSFS_SPRSETUP(tsr0, SPRN_PA6T_TSR0);
+SYSFS_SPRSETUP(tsr1, SPRN_PA6T_TSR1);
+SYSFS_SPRSETUP(tsr2, SPRN_PA6T_TSR2);
+SYSFS_SPRSETUP(tsr3, SPRN_PA6T_TSR3);
 #endif /* CONFIG_DEBUG_KERNEL */
 #endif /* HAS_PPC_PMC_PA6T */

--
1.7.10.4

^ permalink raw reply related

* Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems
From: Benjamin Herrenschmidt @ 2013-10-01  9:38 UTC (permalink / raw)
  To: Gleb Natapov
  Cc: tytso, kvm, linuxppc-dev, linux-kernel, kvm-ppc, agraf, herbert,
	Paul Mackerras, mpm, Paolo Bonzini
In-Reply-To: <20131001083908.GA17294@redhat.com>

On Tue, 2013-10-01 at 11:39 +0300, Gleb Natapov wrote:
> On Tue, Oct 01, 2013 at 06:34:26PM +1000, Michael Ellerman wrote:
> > On Thu, Sep 26, 2013 at 11:06:59AM +0200, Paolo Bonzini wrote:
> > > Il 26/09/2013 08:31, Michael Ellerman ha scritto:
> > > > Some powernv systems include a hwrng. Guests can access it via the
> > > > H_RANDOM hcall.
> > > 
> > > Is there any reason to do this in the kernel?  
> > 
> > It's less code, and it's faster :)
> > 
> > > It does not have to be a particularly fast path;
> > 
> > Sure, but do we have to make it slow on purpose?
> > 
> We do not put non performance critical devices into the kernel.

So for the sake of that dogma you are going to make us do something that
is about 100 times slower ? (and possibly involves more lines of code)

It's not just speed ... H_RANDOM is going to be called by the guest
kernel. A round trip to qemu is going to introduce a kernel jitter
(complete stop of operations of the kernel on that virtual processor) of
a full exit + round trip to qemu + back to the kernel to get to some
source of random number ...  this is going to be in the dozens of ns at
least.

This makes no sense.

Ben.

^ permalink raw reply

* Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems
From: Paolo Bonzini @ 2013-10-01  9:58 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: tytso, herbert, gleb, linux-kernel, kvm-ppc, agraf, linuxppc-dev,
	Paul Mackerras, kvm, mpm
In-Reply-To: <20131001083426.GB27484@concordia>

Il 01/10/2013 10:34, Michael Ellerman ha scritto:
>> If you really want to have the hypercall, implementing it in QEMU means
>> that you can support it on all systems, in fact even when running
>> without KVM.  
> 
> Sure, I can add a fallback to /dev/hwrng for full emulation.
> 
>> The QEMU command line would be something like "-object
>> rng-random,filename=/dev/random,id=rng0 -device spapr-rng,rng=rng0".
> 
> We can't use /dev/random like that. The PAPR specification, which is
> what we're implementing, implies that H_RANDOM provides data from a
> hardware source.

Then use /dev/hwrng.

I don't have POWER machines, but I still want to be able to test as much
as possible using emulation.

Paolo

^ permalink raw reply

* Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems
From: Gleb Natapov @ 2013-10-01  9:57 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: tytso, kvm, linuxppc-dev, agraf, kvm-ppc, linux-kernel, herbert,
	mpm, Paolo Bonzini
In-Reply-To: <20131001092320.GA2000@iris.ozlabs.ibm.com>

On Tue, Oct 01, 2013 at 07:23:20PM +1000, Paul Mackerras wrote:
> On Tue, Oct 01, 2013 at 11:39:08AM +0300, Gleb Natapov wrote:
> > On Tue, Oct 01, 2013 at 06:34:26PM +1000, Michael Ellerman wrote:
> > > On Thu, Sep 26, 2013 at 11:06:59AM +0200, Paolo Bonzini wrote:
> > > > Il 26/09/2013 08:31, Michael Ellerman ha scritto:
> > > > > Some powernv systems include a hwrng. Guests can access it via the
> > > > > H_RANDOM hcall.
> > > > 
> > > > Is there any reason to do this in the kernel?  
> > > 
> > > It's less code, and it's faster :)
> > > 
> > > > It does not have to be a particularly fast path;
> > > 
> > > Sure, but do we have to make it slow on purpose?
> > > 
> > We do not put non performance critical devices into the kernel.
> 
> It's not a device, it's a single hypercall, specified by PAPR, which
> is the moral equivalent of x86's RDRAND.
> 
OK. A couple of general questions. How guest knows when this hypercall
is available? Is QEMU enable it when KVM_CAP_PPC_HWRNG is available (I
haven't seen userspace patch that uses KVM_CAP_PPC_HWRNG)? What about
QEMU TCG does it implement this hypercall or it emulates hwrng directly?

--
			Gleb.

^ permalink raw reply

* Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems
From: Alexander Graf @ 2013-10-01 10:00 UTC (permalink / raw)
  To: Paul Mackerras
  Cc: tytso, kvm, Gleb Natapov, linuxppc-dev, linux-kernel, kvm-ppc,
	herbert, mpm, Paolo Bonzini
In-Reply-To: <20131001092320.GA2000@iris.ozlabs.ibm.com>

On 10/01/2013 11:23 AM, Paul Mackerras wrote:
> On Tue, Oct 01, 2013 at 11:39:08AM +0300, Gleb Natapov wrote:
>> On Tue, Oct 01, 2013 at 06:34:26PM +1000, Michael Ellerman wrote:
>>> On Thu, Sep 26, 2013 at 11:06:59AM +0200, Paolo Bonzini wrote:
>>>> Il 26/09/2013 08:31, Michael Ellerman ha scritto:
>>>>> Some powernv systems include a hwrng. Guests can access it via the
>>>>> H_RANDOM hcall.
>>>> Is there any reason to do this in the kernel?
>>> It's less code, and it's faster :)
>>>
>>>> It does not have to be a particularly fast path;
>>> Sure, but do we have to make it slow on purpose?
>>>
>> We do not put non performance critical devices into the kernel.
> It's not a device, it's a single hypercall, specified by PAPR, which
> is the moral equivalent of x86's RDRAND.

Yes, and hypercalls should be handled in user space unless impossible 
otherwise (like MMU hypercalls which modify state that user space has no 
priviledge to access).

I think the most reasonable way forward would be to implement the path 
that jumps through hoops and goes through user space, then add a new 
device in kvm that registers on this hcall inside of kvm.

That way we ensure consistency (user space knows what to put into device 
tree, can disable it if it wants to, can run with TCG, etc) and you can 
prove that your user space interface works along the way.

Alex

^ permalink raw reply

* Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface
From: Alexander Gordeev @ 2013-10-01 10:35 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: linuxppc-dev, Joerg Roedel, x86@kernel.org,
	linux-kernel@vger.kernel.org, linux-ide@vger.kernel.org,
	Jan Beulich, linux-pci@vger.kernel.org, Tejun Heo, Bjorn Helgaas,
	Ingo Molnar
In-Reply-To: <20131001075133.GJ17966@concordia>

On Tue, Oct 01, 2013 at 05:51:33PM +1000, Michael Ellerman wrote:
> The disadvantage is that any restriction imposed on us above the quota
> can only be reported as an error from pci_enable_msix().
> 
> The quota code, called from pci_get_msix_limit(), can only do so much to
> interogate firmware about the limitations. The ultimate way to check if
> firmware will give us enough MSIs is to try and allocate them. But we
> can't do that from pci_get_msix_limit() because the driver is not asking
> us to enable MSIs, just query them.

If things are this way then pci_enable_msix() already exposed to this
problem internally on pSeries.

I see that even successful quota checks in rtas_msi_check_device() and
rtas_setup_msi_irqs() do not guarantee (as you say) that firmware will
give enough MSIs. Hence, pci_enable_msix() might fail even though the
its quota checks succeeded.

Therefore, nothing will really change if we make pci_get_msix_limit() check
quota and hope the follow-up call to pci_enable_msix() succeeded.

(Of course, we could allocate-deallocate MSIs at check time, but I think it
is an overkill).

> You'll also need to add another arch hook, for the quota check, and
> we'll have to add it to our per-platform indirection as well.

Already, in a branch, hidden from Bjorn & Tejun eyes ;)

> All a lot of bother for no real gain IMHO.

Well, I do not have a strong opinion here. I leave it to the ones who have :)
But few drivers have became clearer as result of this change (and messy ones
are still messy).

> cheers

-- 
Regards,
Alexander Gordeev
agordeev@redhat.com

^ permalink raw reply

* Re: [PATCH 3/3] KVM: PPC: Book3S: Add support for hwrng found on some powernv systems
From: Paolo Bonzini @ 2013-10-01 11:19 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: tytso, kvm, Gleb Natapov, linuxppc-dev, linux-kernel, kvm-ppc,
	agraf, herbert, Paul Mackerras, mpm
In-Reply-To: <1380620338.645.22.camel@pasglop>

Il 01/10/2013 11:38, Benjamin Herrenschmidt ha scritto:
> So for the sake of that dogma you are going to make us do something that
> is about 100 times slower ? (and possibly involves more lines of code)

If it's 100 times slower there is something else that's wrong.  It's
most likely not 100 times slower, and this makes me wonder if you or
Michael actually timed the code at all.

> It's not just speed ... H_RANDOM is going to be called by the guest
> kernel. A round trip to qemu is going to introduce a kernel jitter
> (complete stop of operations of the kernel on that virtual processor) of
> a full exit + round trip to qemu + back to the kernel to get to some
> source of random number ...  this is going to be in the dozens of ns at
> least.

I guess you mean dozens of *micro*seconds, which is somewhat exaggerated
but not too much.  On x86 some reasonable timings are:

  100 cycles            bare metal rdrand
  2000 cycles           guest->hypervisor->guest
  15000 cycles          guest->userspace->guest

(100 cycles = 40 ns = 200 MB/sec; 2000 cycles = ~1 microseconds; 15000
cycles = ~7.5 microseconds).  Even on 5 year old hardware, a userspace
roundtrip is around a dozen microseconds.

Anyhow, I would like to know more about this hwrng and hypercall.

Does the hwrng return random numbers (like rdrand) or real entropy (like
rdseed that Intel will add in Broadwell)?  What about the hypercall?
For example virtio-rng is specified to return actual entropy, it doesn't
matter if it is from hardware or software.

In either case, the patches have problems.

1) If the hwrng returns random numbers, the whitening you're doing is
totally insufficient and patch 2 is forging entropy that doesn't exist.

2) If the hwrng returns entropy, a read from the hwrng is going to even
more expensive than an x86 rdrand (perhaps ~2000 cycles).  Hence, doing
the emulation in the kernel is even less necessary.  Also, if the hwrng
returns entropy patch 1 is unnecessary: you do not need to waste
precious entropy bits by passing them to arch_get_random_long; just run
rngd in the host as that will put the entropy to much better use.

3) If the hypercall returns random numbers, then it is a pretty
braindead interface since returning 8 bytes at a time limits the
throughput to a handful of MB/s (compare to 200 MB/sec for x86 rdrand).
 But more important: in this case drivers/char/hw_random/pseries-rng.c
is completely broken and insecure, just like patch 2 in case (1) above.

4) If the hypercall returns entropy (same as virtio-rng), the same
considerations on speed apply.  If you can only produce entropy at say 1
MB/s (so reading 8 bytes take 8 microseconds---which is actually very
fast), it doesn't matter that much to spend 7 microseconds on a
userspace roundtrip.  It's going to be only half the speed of bare
metal, not 100 times slower.

Also, you will need _anyway_ extra code that is not present here to
either disable the rng based on userspace command-line, or to emulate
the rng from userspace.  It is absolutely _not_ acceptable to have a
hypercall disappear across migration.  You're repeatedly ignoring these
issues, but rest assured that they will come back and bite you
spectacularly.

Based on all this, I would simply ignore the part of the spec where they
say "the hypercall should return numbers from a hardware source".  All
that matters in virtualization is to have a good source of _entropy_.
Then you can run rngd without randomness checks, which will more than
recover the cost of userspace roundtrips.

In any case, deciding where to get that entropy from is definitely
outside the scope of KVM, and in fact QEMU already has a configurable
mechanism for that.

Paolo

^ permalink raw reply

* Re: [RFC PATCH 00/11 Allow PR and HV KVM to coexist in one kernel
From: Aneesh Kumar K.V @ 2013-10-01 11:26 UTC (permalink / raw)
  To: Alexander Graf
  Cc: <kvm@vger.kernel.org> list, Gleb Natapov, kvm-ppc,
	Paul Mackerras, Paolo Bonzini, linuxppc-dev
In-Reply-To: <524990A9.60704@suse.de>

Alexander Graf <agraf@suse.de> writes:

> On 09/30/2013 03:09 PM, Aneesh Kumar K.V wrote:
>> Alexander Graf<agraf@suse.de>  writes:
>>
>>> On 27.09.2013, at 12:52, Aneesh Kumar K.V wrote:
>>>
>>>> "Aneesh Kumar K.V"<aneesh.kumar@linux.vnet.ibm.com>  writes:
>>>>
>>>>> Hi All,
>>>>>
>>>>> This patch series support enabling HV and PR KVM together in the same kernel. We
>>>>> extend machine property with new property "kvm_type". A value of 1 will force HV
>>>>> KVM and 2 PR KVM. The default value is 0 which will select the fastest KVM mode.
>>>>> ie, HV if that is supported otherwise PR.
>>>>>
>>>>> With Qemu command line having
>>>>>
>>>>> -machine pseries,accel=kvm,kvm_type=1
>>>>>
>>>>> [root@llmp24l02 qemu]# bash ../qemu
>>>>> failed to initialize KVM: Invalid argument
>>>>> [root@llmp24l02 qemu]# modprobe kvm-pr
>>>>> [root@llmp24l02 qemu]# bash ../qemu
>>>>> failed to initialize KVM: Invalid argument
>>>>> [root@llmp24l02 qemu]# modprobe  kvm-hv
>>>>> [root@llmp24l02 qemu]# bash ../qemu
>>>>>
>>>>> now with
>>>>>
>>>>> -machine pseries,accel=kvm,kvm_type=2
>>>>>
>>>>> [root@llmp24l02 qemu]# rmmod kvm-pr
>>>>> [root@llmp24l02 qemu]# bash ../qemu
>>>>> failed to initialize KVM: Invalid argument
>>>>> [root@llmp24l02 qemu]#
>>>>> [root@llmp24l02 qemu]# modprobe kvm-pr
>>>>> [root@llmp24l02 qemu]# bash ../qemu
>>>>>
>>>>> if don't specify kvm_type machine property, it will take a default value 0,
>>>>> which means fastest supported.
>>>> Related qemu patch
>>>>
>>>> commit 8d139053177d48a70cb710b211ea4c2843eccdfb
>>>> Author: Aneesh Kumar K.V<aneesh.kumar@linux.vnet.ibm.com>
>>>> Date:   Mon Sep 23 12:28:37 2013 +0530
>>>>
>>>>     kvm: Add a new machine property kvm_type
>>>>
>>>>     Targets like ppc64 support different type of KVM, one which use
>>>>     hypervisor mode and the other which doesn't. Add a new machine
>>>>     property kvm_type that helps in selecting the respective ones
>>>>
>>>>     Signed-off-by: Aneesh Kumar K.V<aneesh.kumar@linux.vnet.ibm.com>
>>> This really is too early, as we can't possibly run in HV mode for
>>> non-pseries machines, so the interpretation (or at least sanity
>>> checking) of what values are reasonable should occur in the
>>> machine. That's why it's a variable in the "machine opts".
>> With the current code CREATE_VM will fail, because we won't have
>> kvm-hv.ko loaded and trying to create a vm with type 1 will fail.
>> Now the challenge related to moving that to machine_init or later is, we
>> depend on HV or PR callback early in CREATE_VM. With the changes we have
>>
>> int kvmppc_core_init_vm(struct kvm *kvm)
>> {
>>
>> #ifdef CONFIG_PPC64
>> 	INIT_LIST_HEAD(&kvm->arch.spapr_tce_tables);
>> 	INIT_LIST_HEAD(&kvm->arch.rtas_tokens);
>> #endif
>>
>> 	return kvm->arch.kvm_ops->init_vm(kvm);
>> }
>>
>> Also the mmu notifier callback do end up calling kvm_unmap_hva etc which
>> are all HV/PR dependent.
>
> Yes, so we should verify in the machine models that we're runnable with 
> the currently selected type at least, to give the user a sensible error 
> message.

Something like the below

diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
index 004184d..7d59ac1 100644
--- a/hw/ppc/spapr.c
+++ b/hw/ppc/spapr.c
@@ -1337,6 +1337,21 @@ static void ppc_spapr_init(QEMUMachineInitArgs *args)
     assert(spapr->fdt_skel != NULL);
 }
 
+static int spapr_get_vm_type(const char *vm_type)
+{
+    if (!vm_type)
+        return 0;
+
+    if (!strcmp(vm_type, "HV"))
+        return 1;
+
+    if (!strcmp(vm_type, "PR"))
+        return 2;
+
+    hw_error("qemu: unknown kvm_type specified '%s'", vm_type);
+    exit(1);
+}
+
 static QEMUMachine spapr_machine = {
     .name = "pseries",
     .desc = "pSeries Logical Partition (PAPR compliant)",
@@ -1347,6 +1362,7 @@ static QEMUMachine spapr_machine = {
     .max_cpus = MAX_CPUS,
     .no_parallel = 1,
     .default_boot_order = NULL,
+    .get_vm_type = spapr_get_vm_type,
 };
 
 static void spapr_machine_init(void)
diff --git a/include/hw/boards.h b/include/hw/boards.h
index 5a7ae9f..2130488 100644
--- a/include/hw/boards.h
+++ b/include/hw/boards.h
@@ -21,6 +21,8 @@ typedef void QEMUMachineResetFunc(void);
 
 typedef void QEMUMachineHotAddCPUFunc(const int64_t id, Error **errp);
 
+typedef int QEMUMachineGetVmTypeFunc(const char *arg);
+
 typedef struct QEMUMachine {
     const char *name;
     const char *alias;
@@ -28,6 +30,7 @@ typedef struct QEMUMachine {
     QEMUMachineInitFunc *init;
     QEMUMachineResetFunc *reset;
     QEMUMachineHotAddCPUFunc *hot_add_cpu;
+    QEMUMachineGetVmTypeFunc *get_vm_type;
     BlockInterfaceType block_default_type;
     int max_cpus;
     unsigned int no_serial:1,
diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h
index e1f88bf..acc3d74 100644
--- a/include/hw/xen/xen.h
+++ b/include/hw/xen/xen.h
@@ -36,7 +36,8 @@ void xen_cmos_set_s3_resume(void *opaque, int irq, int level);
 
 qemu_irq *xen_interrupt_controller_init(void);
 
-int xen_init(void);
+typedef struct QEMUMachine QEMUMachine;
+int xen_init(QEMUMachine *machine);
 int xen_hvm_init(MemoryRegion **ram_memory);
 void xenstore_store_pv_console_info(int i, struct CharDriverState *chr);
 
diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
index 9bbe3db..f25caec 100644
--- a/include/sysemu/kvm.h
+++ b/include/sysemu/kvm.h
@@ -142,8 +142,8 @@ typedef struct KVMState KVMState;
 extern KVMState *kvm_state;
 
 /* external API */
-
-int kvm_init(void);
+typedef struct QEMUMachine QEMUMachine;
+int kvm_init(QEMUMachine *machine);
 
 int kvm_has_sync_mmu(void);
 int kvm_has_vcpu_events(void);
diff --git a/include/sysemu/qtest.h b/include/sysemu/qtest.h
index 9a0c6b3..d71343d 100644
--- a/include/sysemu/qtest.h
+++ b/include/sysemu/qtest.h
@@ -31,7 +31,8 @@ static inline int qtest_available(void)
     return 1;
 }
 
-int qtest_init(void);
+typedef struct QEMUMachine QEMUMachine;
+int qtest_init(QEMUMachine *machine);
 #else
 static inline bool qtest_enabled(void)
 {
@@ -43,7 +44,7 @@ static inline int qtest_available(void)
     return 0;
 }
 
-static inline int qtest_init(void)
+static inline int qtest_init(QEMUMachine *machine)
 {
     return 0;
 }
diff --git a/kvm-all.c b/kvm-all.c
index b87215c..3863abd 100644
--- a/kvm-all.c
+++ b/kvm-all.c
@@ -35,6 +35,8 @@
 #include "qemu/event_notifier.h"
 #include "trace.h"
 
+#include "hw/boards.h"
+
 /* This check must be after config-host.h is included */
 #ifdef CONFIG_EVENTFD
 #include <sys/eventfd.h>
@@ -1342,7 +1344,7 @@ static int kvm_max_vcpus(KVMState *s)
     return 4;
 }
 
-int kvm_init(void)
+int kvm_init(QEMUMachine *machine)
 {
     static const char upgrade_note[] =
         "Please upgrade to at least kernel 2.6.29 or recent kvm-kmod\n"
@@ -1350,7 +1352,7 @@ int kvm_init(void)
     KVMState *s;
     const KVMCapabilityInfo *missing_cap;
     int ret;
-    int i;
+    int i, kvm_type = 0;
     int max_vcpus;
 
     s = g_malloc0(sizeof(KVMState));
@@ -1407,7 +1409,11 @@ int kvm_init(void)
         goto err;
     }
 
-    s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0);
+    if (machine->get_vm_type) {
+        kvm_type = machine->get_vm_type(qemu_opt_get(qemu_get_machine_opts(),
+                                                     "kvm_type"));
+    }
+    s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, kvm_type);
     if (s->vmfd < 0) {
 #ifdef TARGET_S390X
         fprintf(stderr, "Please add the 'switch_amode' kernel parameter to "
diff --git a/kvm-stub.c b/kvm-stub.c
index 548f471..ccb7b8c 100644
--- a/kvm-stub.c
+++ b/kvm-stub.c
@@ -19,6 +19,8 @@
 #include "hw/pci/msi.h"
 #endif
 
+#include "hw/boards.h"
+
 KVMState *kvm_state;
 bool kvm_kernel_irqchip;
 bool kvm_async_interrupts_allowed;
@@ -33,7 +35,7 @@ int kvm_init_vcpu(CPUState *cpu)
     return -ENOSYS;
 }
 
-int kvm_init(void)
+int kvm_init(QEMUMachine *machine)
 {
     return -ENOSYS;
 }
diff --git a/qtest.c b/qtest.c
index 584c707..ef3c473 100644
--- a/qtest.c
+++ b/qtest.c
@@ -502,7 +502,7 @@ static void qtest_event(void *opaque, int event)
     }
 }
 
-int qtest_init(void)
+int qtest_init(QEMUMachine *machine)
 {
     CharDriverState *chr;
 
diff --git a/vl.c b/vl.c
index 4e709d5..7ecc581 100644
--- a/vl.c
+++ b/vl.c
@@ -427,7 +427,12 @@ static QemuOptsList qemu_machine_opts = {
             .name = "usb",
             .type = QEMU_OPT_BOOL,
             .help = "Set on/off to enable/disable usb",
+        },{
+            .name = "kvm_type",
+            .type = QEMU_OPT_STRING,
+            .help = "Set to kvm type to be used in create vm ioctl",
         },
+
         { /* End of list */ }
     },
 };
@@ -2608,7 +2613,7 @@ static QEMUMachine *machine_parse(const char *name)
     exit(!name || !is_help_option(name));
 }
 
-static int tcg_init(void)
+static int tcg_init(QEMUMachine *machine)
 {
     tcg_exec_init(tcg_tb_size * 1024 * 1024);
     return 0;
@@ -2618,7 +2623,7 @@ static struct {
     const char *opt_name;
     const char *name;
     int (*available)(void);
-    int (*init)(void);
+    int (*init)(QEMUMachine *);
     bool *allowed;
 } accel_list[] = {
     { "tcg", "tcg", tcg_available, tcg_init, &tcg_allowed },
@@ -2627,7 +2632,7 @@ static struct {
     { "qtest", "QTest", qtest_available, qtest_init, &qtest_allowed },
 };
 
-static int configure_accelerator(void)
+static int configure_accelerator(QEMUMachine *machine)
 {
     const char *p;
     char buf[10];
@@ -2654,7 +2659,7 @@ static int configure_accelerator(void)
                     continue;
                 }
                 *(accel_list[i].allowed) = true;
-                ret = accel_list[i].init();
+                ret = accel_list[i].init(machine);
                 if (ret < 0) {
                     init_failed = true;
                     fprintf(stderr, "failed to initialize %s: %s\n",
@@ -4037,10 +4042,10 @@ int main(int argc, char **argv, char **envp)
         exit(0);
     }
 
-    configure_accelerator();
+    configure_accelerator(machine);
 
     if (!qtest_enabled() && qtest_chrdev) {
-        qtest_init();
+        qtest_init(machine);
     }
 
     machine_opts = qemu_get_machine_opts();
diff --git a/xen-all.c b/xen-all.c
index 839f14f..ac3654b 100644
--- a/xen-all.c
+++ b/xen-all.c
@@ -1000,7 +1000,7 @@ static void xen_exit_notifier(Notifier *n, void *data)
     xs_daemon_close(state->xenstore);
 }
 
-int xen_init(void)
+int xen_init(QEMUMachine *machine)
 {
     xen_xc = xen_xc_interface_open(0, 0, 0);
     if (xen_xc == XC_HANDLER_INITIAL_VALUE) {
diff --git a/xen-stub.c b/xen-stub.c
index ad189a6..59927cb 100644
--- a/xen-stub.c
+++ b/xen-stub.c
@@ -47,7 +47,7 @@ qemu_irq *xen_interrupt_controller_init(void)
     return NULL;
 }
 
-int xen_init(void)
+int xen_init(QEMUMachine *machine)
 {
     return -ENOSYS;
 }

>
>>
>>
>>
>>> Also, users don't want to say type=0. They want to say type=PR or
>>> type=HV or type=HV,PR. In fact, can't you make this a property of
>>> -accel? Then it's truly accel specific and everything should be well.
>> If we are doing this as machine property, we can't specify string,
>> because "HV"/"PR" are all powerpc dependent, so parsing that is not
>> possible in kvm_init in qemu. But, yes ideally it would be nice to be
>
> Well, we could do the "name to integer" conversion in an arch specific 
> function, no?
>
>> able to speicy the type using string. I thought accel is a machine
>> property, hence was not sure whether I can have additional properties
>> against that. I was using it as below.
>>
>>   -machine pseries,accel=kvm,kvm_type=1

Can we really specific -accel ? I check and I am finding that as machine
property. 

-aneesh

^ permalink raw reply related

* Re: [RFC PATCH 00/11 Allow PR and HV KVM to coexist in one kernel
From: Alexander Graf @ 2013-10-01 11:36 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: <kvm@vger.kernel.org> list, Gleb Natapov, kvm-ppc,
	Paul Mackerras, Paolo Bonzini, linuxppc-dev, Andreas Färber
In-Reply-To: <87ob79cjvm.fsf@linux.vnet.ibm.com>

On 10/01/2013 01:26 PM, Aneesh Kumar K.V wrote:
> Alexander Graf<agraf@suse.de>  writes:
>
>> On 09/30/2013 03:09 PM, Aneesh Kumar K.V wrote:
>>> Alexander Graf<agraf@suse.de>   writes:
>>>
>>>> On 27.09.2013, at 12:52, Aneesh Kumar K.V wrote:
>>>>
>>>>> "Aneesh Kumar K.V"<aneesh.kumar@linux.vnet.ibm.com>   writes:
>>>>>
>>>>>> Hi All,
>>>>>>
>>>>>> This patch series support enabling HV and PR KVM together in the same kernel. We
>>>>>> extend machine property with new property "kvm_type". A value of 1 will force HV
>>>>>> KVM and 2 PR KVM. The default value is 0 which will select the fastest KVM mode.
>>>>>> ie, HV if that is supported otherwise PR.
>>>>>>
>>>>>> With Qemu command line having
>>>>>>
>>>>>> -machine pseries,accel=kvm,kvm_type=1
>>>>>>
>>>>>> [root@llmp24l02 qemu]# bash ../qemu
>>>>>> failed to initialize KVM: Invalid argument
>>>>>> [root@llmp24l02 qemu]# modprobe kvm-pr
>>>>>> [root@llmp24l02 qemu]# bash ../qemu
>>>>>> failed to initialize KVM: Invalid argument
>>>>>> [root@llmp24l02 qemu]# modprobe  kvm-hv
>>>>>> [root@llmp24l02 qemu]# bash ../qemu
>>>>>>
>>>>>> now with
>>>>>>
>>>>>> -machine pseries,accel=kvm,kvm_type=2
>>>>>>
>>>>>> [root@llmp24l02 qemu]# rmmod kvm-pr
>>>>>> [root@llmp24l02 qemu]# bash ../qemu
>>>>>> failed to initialize KVM: Invalid argument
>>>>>> [root@llmp24l02 qemu]#
>>>>>> [root@llmp24l02 qemu]# modprobe kvm-pr
>>>>>> [root@llmp24l02 qemu]# bash ../qemu
>>>>>>
>>>>>> if don't specify kvm_type machine property, it will take a default value 0,
>>>>>> which means fastest supported.
>>>>> Related qemu patch
>>>>>
>>>>> commit 8d139053177d48a70cb710b211ea4c2843eccdfb
>>>>> Author: Aneesh Kumar K.V<aneesh.kumar@linux.vnet.ibm.com>
>>>>> Date:   Mon Sep 23 12:28:37 2013 +0530
>>>>>
>>>>>      kvm: Add a new machine property kvm_type
>>>>>
>>>>>      Targets like ppc64 support different type of KVM, one which use
>>>>>      hypervisor mode and the other which doesn't. Add a new machine
>>>>>      property kvm_type that helps in selecting the respective ones
>>>>>
>>>>>      Signed-off-by: Aneesh Kumar K.V<aneesh.kumar@linux.vnet.ibm.com>
>>>> This really is too early, as we can't possibly run in HV mode for
>>>> non-pseries machines, so the interpretation (or at least sanity
>>>> checking) of what values are reasonable should occur in the
>>>> machine. That's why it's a variable in the "machine opts".
>>> With the current code CREATE_VM will fail, because we won't have
>>> kvm-hv.ko loaded and trying to create a vm with type 1 will fail.
>>> Now the challenge related to moving that to machine_init or later is, we
>>> depend on HV or PR callback early in CREATE_VM. With the changes we have
>>>
>>> int kvmppc_core_init_vm(struct kvm *kvm)
>>> {
>>>
>>> #ifdef CONFIG_PPC64
>>> 	INIT_LIST_HEAD(&kvm->arch.spapr_tce_tables);
>>> 	INIT_LIST_HEAD(&kvm->arch.rtas_tokens);
>>> #endif
>>>
>>> 	return kvm->arch.kvm_ops->init_vm(kvm);
>>> }
>>>
>>> Also the mmu notifier callback do end up calling kvm_unmap_hva etc which
>>> are all HV/PR dependent.
>> Yes, so we should verify in the machine models that we're runnable with
>> the currently selected type at least, to give the user a sensible error
>> message.
> Something like the below

I like that one a lot. Andreas, Paolo, what do you think?


Alex

>
> diff --git a/hw/ppc/spapr.c b/hw/ppc/spapr.c
> index 004184d..7d59ac1 100644
> --- a/hw/ppc/spapr.c
> +++ b/hw/ppc/spapr.c
> @@ -1337,6 +1337,21 @@ static void ppc_spapr_init(QEMUMachineInitArgs *args)
>       assert(spapr->fdt_skel != NULL);
>   }
>
> +static int spapr_get_vm_type(const char *vm_type)
> +{
> +    if (!vm_type)
> +        return 0;
> +
> +    if (!strcmp(vm_type, "HV"))
> +        return 1;
> +
> +    if (!strcmp(vm_type, "PR"))
> +        return 2;
> +
> +    hw_error("qemu: unknown kvm_type specified '%s'", vm_type);
> +    exit(1);
> +}
> +
>   static QEMUMachine spapr_machine = {
>       .name = "pseries",
>       .desc = "pSeries Logical Partition (PAPR compliant)",
> @@ -1347,6 +1362,7 @@ static QEMUMachine spapr_machine = {
>       .max_cpus = MAX_CPUS,
>       .no_parallel = 1,
>       .default_boot_order = NULL,
> +    .get_vm_type = spapr_get_vm_type,
>   };
>
>   static void spapr_machine_init(void)
> diff --git a/include/hw/boards.h b/include/hw/boards.h
> index 5a7ae9f..2130488 100644
> --- a/include/hw/boards.h
> +++ b/include/hw/boards.h
> @@ -21,6 +21,8 @@ typedef void QEMUMachineResetFunc(void);
>
>   typedef void QEMUMachineHotAddCPUFunc(const int64_t id, Error **errp);
>
> +typedef int QEMUMachineGetVmTypeFunc(const char *arg);
> +
>   typedef struct QEMUMachine {
>       const char *name;
>       const char *alias;
> @@ -28,6 +30,7 @@ typedef struct QEMUMachine {
>       QEMUMachineInitFunc *init;
>       QEMUMachineResetFunc *reset;
>       QEMUMachineHotAddCPUFunc *hot_add_cpu;
> +    QEMUMachineGetVmTypeFunc *get_vm_type;
>       BlockInterfaceType block_default_type;
>       int max_cpus;
>       unsigned int no_serial:1,
> diff --git a/include/hw/xen/xen.h b/include/hw/xen/xen.h
> index e1f88bf..acc3d74 100644
> --- a/include/hw/xen/xen.h
> +++ b/include/hw/xen/xen.h
> @@ -36,7 +36,8 @@ void xen_cmos_set_s3_resume(void *opaque, int irq, int level);
>
>   qemu_irq *xen_interrupt_controller_init(void);
>
> -int xen_init(void);
> +typedef struct QEMUMachine QEMUMachine;
> +int xen_init(QEMUMachine *machine);
>   int xen_hvm_init(MemoryRegion **ram_memory);
>   void xenstore_store_pv_console_info(int i, struct CharDriverState *chr);
>
> diff --git a/include/sysemu/kvm.h b/include/sysemu/kvm.h
> index 9bbe3db..f25caec 100644
> --- a/include/sysemu/kvm.h
> +++ b/include/sysemu/kvm.h
> @@ -142,8 +142,8 @@ typedef struct KVMState KVMState;
>   extern KVMState *kvm_state;
>
>   /* external API */
> -
> -int kvm_init(void);
> +typedef struct QEMUMachine QEMUMachine;
> +int kvm_init(QEMUMachine *machine);
>
>   int kvm_has_sync_mmu(void);
>   int kvm_has_vcpu_events(void);
> diff --git a/include/sysemu/qtest.h b/include/sysemu/qtest.h
> index 9a0c6b3..d71343d 100644
> --- a/include/sysemu/qtest.h
> +++ b/include/sysemu/qtest.h
> @@ -31,7 +31,8 @@ static inline int qtest_available(void)
>       return 1;
>   }
>
> -int qtest_init(void);
> +typedef struct QEMUMachine QEMUMachine;
> +int qtest_init(QEMUMachine *machine);
>   #else
>   static inline bool qtest_enabled(void)
>   {
> @@ -43,7 +44,7 @@ static inline int qtest_available(void)
>       return 0;
>   }
>
> -static inline int qtest_init(void)
> +static inline int qtest_init(QEMUMachine *machine)
>   {
>       return 0;
>   }
> diff --git a/kvm-all.c b/kvm-all.c
> index b87215c..3863abd 100644
> --- a/kvm-all.c
> +++ b/kvm-all.c
> @@ -35,6 +35,8 @@
>   #include "qemu/event_notifier.h"
>   #include "trace.h"
>
> +#include "hw/boards.h"
> +
>   /* This check must be after config-host.h is included */
>   #ifdef CONFIG_EVENTFD
>   #include<sys/eventfd.h>
> @@ -1342,7 +1344,7 @@ static int kvm_max_vcpus(KVMState *s)
>       return 4;
>   }
>
> -int kvm_init(void)
> +int kvm_init(QEMUMachine *machine)
>   {
>       static const char upgrade_note[] =
>           "Please upgrade to at least kernel 2.6.29 or recent kvm-kmod\n"
> @@ -1350,7 +1352,7 @@ int kvm_init(void)
>       KVMState *s;
>       const KVMCapabilityInfo *missing_cap;
>       int ret;
> -    int i;
> +    int i, kvm_type = 0;
>       int max_vcpus;
>
>       s = g_malloc0(sizeof(KVMState));
> @@ -1407,7 +1409,11 @@ int kvm_init(void)
>           goto err;
>       }
>
> -    s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, 0);
> +    if (machine->get_vm_type) {
> +        kvm_type = machine->get_vm_type(qemu_opt_get(qemu_get_machine_opts(),
> +                                                     "kvm_type"));
> +    }
> +    s->vmfd = kvm_ioctl(s, KVM_CREATE_VM, kvm_type);
>       if (s->vmfd<  0) {
>   #ifdef TARGET_S390X
>           fprintf(stderr, "Please add the 'switch_amode' kernel parameter to "
> diff --git a/kvm-stub.c b/kvm-stub.c
> index 548f471..ccb7b8c 100644
> --- a/kvm-stub.c
> +++ b/kvm-stub.c
> @@ -19,6 +19,8 @@
>   #include "hw/pci/msi.h"
>   #endif
>
> +#include "hw/boards.h"
> +
>   KVMState *kvm_state;
>   bool kvm_kernel_irqchip;
>   bool kvm_async_interrupts_allowed;
> @@ -33,7 +35,7 @@ int kvm_init_vcpu(CPUState *cpu)
>       return -ENOSYS;
>   }
>
> -int kvm_init(void)
> +int kvm_init(QEMUMachine *machine)
>   {
>       return -ENOSYS;
>   }
> diff --git a/qtest.c b/qtest.c
> index 584c707..ef3c473 100644
> --- a/qtest.c
> +++ b/qtest.c
> @@ -502,7 +502,7 @@ static void qtest_event(void *opaque, int event)
>       }
>   }
>
> -int qtest_init(void)
> +int qtest_init(QEMUMachine *machine)
>   {
>       CharDriverState *chr;
>
> diff --git a/vl.c b/vl.c
> index 4e709d5..7ecc581 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -427,7 +427,12 @@ static QemuOptsList qemu_machine_opts = {
>               .name = "usb",
>               .type = QEMU_OPT_BOOL,
>               .help = "Set on/off to enable/disable usb",
> +        },{
> +            .name = "kvm_type",
> +            .type = QEMU_OPT_STRING,
> +            .help = "Set to kvm type to be used in create vm ioctl",
>           },
> +
>           { /* End of list */ }
>       },
>   };
> @@ -2608,7 +2613,7 @@ static QEMUMachine *machine_parse(const char *name)
>       exit(!name || !is_help_option(name));
>   }
>
> -static int tcg_init(void)
> +static int tcg_init(QEMUMachine *machine)
>   {
>       tcg_exec_init(tcg_tb_size * 1024 * 1024);
>       return 0;
> @@ -2618,7 +2623,7 @@ static struct {
>       const char *opt_name;
>       const char *name;
>       int (*available)(void);
> -    int (*init)(void);
> +    int (*init)(QEMUMachine *);
>       bool *allowed;
>   } accel_list[] = {
>       { "tcg", "tcg", tcg_available, tcg_init,&tcg_allowed },
> @@ -2627,7 +2632,7 @@ static struct {
>       { "qtest", "QTest", qtest_available, qtest_init,&qtest_allowed },
>   };
>
> -static int configure_accelerator(void)
> +static int configure_accelerator(QEMUMachine *machine)
>   {
>       const char *p;
>       char buf[10];
> @@ -2654,7 +2659,7 @@ static int configure_accelerator(void)
>                       continue;
>                   }
>                   *(accel_list[i].allowed) = true;
> -                ret = accel_list[i].init();
> +                ret = accel_list[i].init(machine);
>                   if (ret<  0) {
>                       init_failed = true;
>                       fprintf(stderr, "failed to initialize %s: %s\n",
> @@ -4037,10 +4042,10 @@ int main(int argc, char **argv, char **envp)
>           exit(0);
>       }
>
> -    configure_accelerator();
> +    configure_accelerator(machine);
>
>       if (!qtest_enabled()&&  qtest_chrdev) {
> -        qtest_init();
> +        qtest_init(machine);
>       }
>
>       machine_opts = qemu_get_machine_opts();
> diff --git a/xen-all.c b/xen-all.c
> index 839f14f..ac3654b 100644
> --- a/xen-all.c
> +++ b/xen-all.c
> @@ -1000,7 +1000,7 @@ static void xen_exit_notifier(Notifier *n, void *data)
>       xs_daemon_close(state->xenstore);
>   }
>
> -int xen_init(void)
> +int xen_init(QEMUMachine *machine)
>   {
>       xen_xc = xen_xc_interface_open(0, 0, 0);
>       if (xen_xc == XC_HANDLER_INITIAL_VALUE) {
> diff --git a/xen-stub.c b/xen-stub.c
> index ad189a6..59927cb 100644
> --- a/xen-stub.c
> +++ b/xen-stub.c
> @@ -47,7 +47,7 @@ qemu_irq *xen_interrupt_controller_init(void)
>       return NULL;
>   }
>
> -int xen_init(void)
> +int xen_init(QEMUMachine *machine)
>   {
>       return -ENOSYS;
>   }
>
>>>
>>>
>>>> Also, users don't want to say type=0. They want to say type=PR or
>>>> type=HV or type=HV,PR. In fact, can't you make this a property of
>>>> -accel? Then it's truly accel specific and everything should be well.
>>> If we are doing this as machine property, we can't specify string,
>>> because "HV"/"PR" are all powerpc dependent, so parsing that is not
>>> possible in kvm_init in qemu. But, yes ideally it would be nice to be
>> Well, we could do the "name to integer" conversion in an arch specific
>> function, no?
>>
>>> able to speicy the type using string. I thought accel is a machine
>>> property, hence was not sure whether I can have additional properties
>>> against that. I was using it as below.
>>>
>>>    -machine pseries,accel=kvm,kvm_type=1
> Can we really specific -accel ? I check and I am finding that as machine
> property.
>
> -aneesh
>

^ permalink raw reply

* Re: [RFC PATCH 00/11 Allow PR and HV KVM to coexist in one kernel
From: Paolo Bonzini @ 2013-10-01 11:41 UTC (permalink / raw)
  To: Alexander Graf
  Cc: Gleb Natapov, <kvm@vger.kernel.org> list, kvm-ppc,
	Paul Mackerras, Aneesh Kumar K.V, linuxppc-dev,
	Andreas Färber
In-Reply-To: <524AB3AE.5000503@suse.de>

Il 01/10/2013 13:36, Alexander Graf ha scritto:
>>>>
>>> Yes, so we should verify in the machine models that we're runnable with
>>> the currently selected type at least, to give the user a sensible error
>>> message.
>> Something like the below
> 
> I like that one a lot. Andreas, Paolo, what do you think?

Yes, it's fine.

Paolo

^ permalink raw reply

* Re: [RFC PATCH 00/11 Allow PR and HV KVM to coexist in one kernel
From: Alexander Graf @ 2013-10-01 11:43 UTC (permalink / raw)
  To: Aneesh Kumar K.V
  Cc: <kvm@vger.kernel.org> list, Gleb Natapov, kvm-ppc,
	Paul Mackerras, Paolo Bonzini, linuxppc-dev, Andreas Färber
In-Reply-To: <524AB3AE.5000503@suse.de>

On 10/01/2013 01:36 PM, Alexander Graf wrote:
> On 10/01/2013 01:26 PM, Aneesh Kumar K.V wrote:
>> Alexander Graf<agraf@suse.de>  writes:
>>
>>> On 09/30/2013 03:09 PM, Aneesh Kumar K.V wrote:
>>>> Alexander Graf<agraf@suse.de>   writes:
>>>>
>>>>> On 27.09.2013, at 12:52, Aneesh Kumar K.V wrote:
>>>>>
>>>>>> "Aneesh Kumar K.V"<aneesh.kumar@linux.vnet.ibm.com>   writes:
>>>>>>
>>>>>>> Hi All,
>>>>>>>
>>>>>>> This patch series support enabling HV and PR KVM together in the 
>>>>>>> same kernel. We
>>>>>>> extend machine property with new property "kvm_type". A value of 
>>>>>>> 1 will force HV
>>>>>>> KVM and 2 PR KVM. The default value is 0 which will select the 
>>>>>>> fastest KVM mode.
>>>>>>> ie, HV if that is supported otherwise PR.
>>>>>>>
>>>>>>> With Qemu command line having
>>>>>>>
>>>>>>> -machine pseries,accel=kvm,kvm_type=1
>>>>>>>
>>>>>>> [root@llmp24l02 qemu]# bash ../qemu
>>>>>>> failed to initialize KVM: Invalid argument
>>>>>>> [root@llmp24l02 qemu]# modprobe kvm-pr
>>>>>>> [root@llmp24l02 qemu]# bash ../qemu
>>>>>>> failed to initialize KVM: Invalid argument
>>>>>>> [root@llmp24l02 qemu]# modprobe  kvm-hv
>>>>>>> [root@llmp24l02 qemu]# bash ../qemu
>>>>>>>
>>>>>>> now with
>>>>>>>
>>>>>>> -machine pseries,accel=kvm,kvm_type=2
>>>>>>>
>>>>>>> [root@llmp24l02 qemu]# rmmod kvm-pr
>>>>>>> [root@llmp24l02 qemu]# bash ../qemu
>>>>>>> failed to initialize KVM: Invalid argument
>>>>>>> [root@llmp24l02 qemu]#
>>>>>>> [root@llmp24l02 qemu]# modprobe kvm-pr
>>>>>>> [root@llmp24l02 qemu]# bash ../qemu
>>>>>>>
>>>>>>> if don't specify kvm_type machine property, it will take a 
>>>>>>> default value 0,
>>>>>>> which means fastest supported.
>>>>>> Related qemu patch
>>>>>>
>>>>>> commit 8d139053177d48a70cb710b211ea4c2843eccdfb
>>>>>> Author: Aneesh Kumar K.V<aneesh.kumar@linux.vnet.ibm.com>
>>>>>> Date:   Mon Sep 23 12:28:37 2013 +0530
>>>>>>
>>>>>>      kvm: Add a new machine property kvm_type
>>>>>>
>>>>>>      Targets like ppc64 support different type of KVM, one which use
>>>>>>      hypervisor mode and the other which doesn't. Add a new machine
>>>>>>      property kvm_type that helps in selecting the respective ones
>>>>>>
>>>>>>      Signed-off-by: Aneesh Kumar 
>>>>>> K.V<aneesh.kumar@linux.vnet.ibm.com>
>>>>> This really is too early, as we can't possibly run in HV mode for
>>>>> non-pseries machines, so the interpretation (or at least sanity
>>>>> checking) of what values are reasonable should occur in the
>>>>> machine. That's why it's a variable in the "machine opts".
>>>> With the current code CREATE_VM will fail, because we won't have
>>>> kvm-hv.ko loaded and trying to create a vm with type 1 will fail.
>>>> Now the challenge related to moving that to machine_init or later 
>>>> is, we
>>>> depend on HV or PR callback early in CREATE_VM. With the changes we 
>>>> have
>>>>
>>>> int kvmppc_core_init_vm(struct kvm *kvm)
>>>> {
>>>>
>>>> #ifdef CONFIG_PPC64
>>>>     INIT_LIST_HEAD(&kvm->arch.spapr_tce_tables);
>>>>     INIT_LIST_HEAD(&kvm->arch.rtas_tokens);
>>>> #endif
>>>>
>>>>     return kvm->arch.kvm_ops->init_vm(kvm);
>>>> }
>>>>
>>>> Also the mmu notifier callback do end up calling kvm_unmap_hva etc 
>>>> which
>>>> are all HV/PR dependent.
>>> Yes, so we should verify in the machine models that we're runnable with
>>> the currently selected type at least, to give the user a sensible error
>>> message.
>> Something like the below
>
> I like that one a lot. Andreas, Paolo, what do you think?

To clarify this a bit more, this patch is missing a few critical pieces:

   1) The default implementation for the kvm_type callback needs to 
error out when kvm_type is set to anything
   2) All non-papr-non-e500-PPC machines should force "PR only" mode and 
bail out on anything else
   3) We're missing sensible error messages for the user still when the 
VM could not get created

but I like the overall concept :).


Alex

^ permalink raw reply

* Re: [PATCH v2 2/6] PCI/MSI: Factor out pci_get_msi_cap() interface
From: Tejun Heo @ 2013-10-01 11:55 UTC (permalink / raw)
  To: Michael Ellerman
  Cc: Joerg Roedel, x86@kernel.org, linux-kernel@vger.kernel.org,
	linux-ide@vger.kernel.org, Alexander Gordeev, Jan Beulich,
	linux-pci@vger.kernel.org, Bjorn Helgaas, linuxppc-dev,
	Ingo Molnar
In-Reply-To: <20131001073548.GI17966@concordia>

Hello,

On Tue, Oct 01, 2013 at 05:35:48PM +1000, Michael Ellerman wrote:
> > > Roughly third of the drivers just do not care and bail out once
> > > pci_enable_msix() has not succeeded. Not sure how many of these are
> > > mandated by the hardware.
> > 
> > Yeah, I mean, this type of interface is a trap.  People have to
> > actively resist to avoid doing silly stuff which is a lot to ask.
> 
> I really think you're overstating the complexity here.
> 
> Functions typically return a boolean   -> nothing to see here
> This function returns a tristate value -> brain explosion!

It is an interface which forces the driver writers to write
complicated fallback code which won't usually be excercised.  The same
goes for the hardware.  In isolation, it doesn't look like much but
things like this are bound to lead to subtle bugs which are diffiuclt
to trigger.

> > * Determine the number of MSIs the controller wants.  Don't worry
> >   about quotas or limits or anything.  Just determine the number
> >   necessary to enable enhanced interrupt handling.
> > 	
> > * Try allocating that number of MSIs.  If it fails, then just revert
> >   to single interrupt mode.  It's not the end of the world and mostly
> >   guaranteed to work.  Let's please not even try to do partial
> >   multiple interrupts.  I really don't think it's worth the risk or
> >   complexity.
> 
> It will potentially break existing setups on our hardware.

I think it'd be much better to have a separate interface for the
drivers which actually require it *in practice* rather than forcing
everyone to go "oh this interface supports that, I don't know if I
need it but let's implement fallback logic which I won't and have no
means of testing".  Are we talking about some limited number of device
drivers here?  Also, is the quota still necessary for machines in
production today?

Thanks.

-- 
tejun

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox