Enable more than 255 VCPU support without irq remapping function in the guest

kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Enable more than 255 VCPU support without irq remapping function in the guest
@ 2016-04-26 16:14 Lan, Tianyu
  2016-04-26 16:17 ` Jan Kiszka
  0 siblings, 1 reply; 25+ messages in thread
From: Lan, Tianyu @ 2016-04-26 16:14 UTC (permalink / raw)
  To: pbonzini, kvm, yang.zhang.wz, tglx, gleb, mst, jan.kiszka, x86

Hi All:

Recently I am working on extending max vcpu to more than 256 on the both
KVM/Xen. For some HPC cases, it needs many vcpus. The job requires to
use X2APIC in the guest which supports 32-bit APIC id. Linux kernel
requires irq remapping function during enabling X2APIC when max APIC id
is more than 255(More detail please see try_to_enable_x2apic()).

The irq remapping function helps to deliver irq to cpu 255~. IOAPIC just
supports 8-bit target APIC id field and only can deliver irq to
cpu 0~255.

So far both KVM/Xen doesn't enable irq remapping function. If enable the
function, it seems a huge job which need to rework IO-APIC, local APIC,
MSI parts and add virtual VTD support in the KVM.

Other quick way to enable more than 256 VCPUs is to eliminate the
dependency between irq remapping and X2APIC in the guest linux kernel.
So far I can boot the guest after removing the dependency.
The side effect I thought is that irq only can deliver to 0~255 vcpus
but 256 vcpus seem enough to balance irq requests in the guest. In the
most cases, there are fewer devices in the guest.

I wonder whether it's feasible. There maybe some other side effects I 
didn't think of. Very appreciate for your comments.

Tianyu Lan
Best regards.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-26 16:14 Enable more than 255 VCPU support without irq remapping function in the guest Lan, Tianyu
@ 2016-04-26 16:17 ` Jan Kiszka
  2016-04-26 16:49   ` Radim Krčmář
  2016-04-27  5:15   ` Lan Tianyu
  0 siblings, 2 replies; 25+ messages in thread
From: Jan Kiszka @ 2016-04-26 16:17 UTC (permalink / raw)
  To: Lan, Tianyu, pbonzini, kvm, yang.zhang.wz, tglx, gleb, mst, x86,
	Radim Krčmář, Peter Xu

On 2016-04-26 18:14, Lan, Tianyu wrote:
> Hi All:
> 
> Recently I am working on extending max vcpu to more than 256 on the both
> KVM/Xen. For some HPC cases, it needs many vcpus. The job requires to
> use X2APIC in the guest which supports 32-bit APIC id. Linux kernel
> requires irq remapping function during enabling X2APIC when max APIC id
> is more than 255(More detail please see try_to_enable_x2apic()).
> 
> The irq remapping function helps to deliver irq to cpu 255~. IOAPIC just
> supports 8-bit target APIC id field and only can deliver irq to
> cpu 0~255.
> 
> So far both KVM/Xen doesn't enable irq remapping function. If enable the
> function, it seems a huge job which need to rework IO-APIC, local APIC,
> MSI parts and add virtual VTD support in the KVM.
> 
> Other quick way to enable more than 256 VCPUs is to eliminate the
> dependency between irq remapping and X2APIC in the guest linux kernel.
> So far I can boot the guest after removing the dependency.
> The side effect I thought is that irq only can deliver to 0~255 vcpus
> but 256 vcpus seem enough to balance irq requests in the guest. In the
> most cases, there are fewer devices in the guest.
> 
> I wonder whether it's feasible. There maybe some other side effects I
> didn't think of. Very appreciate for your comments.

Radim is working on the KVM side already, Peter is currently driving the
VT-d interrupt emulation topic in QEMU. It's in reach, I would say. :)

Jan

PS: Please no PV mess, at least without good reasons.

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-26 16:17 ` Jan Kiszka
@ 2016-04-26 16:49   ` Radim Krčmář
  2016-04-27  4:10     ` Yang Zhang
  2016-04-27  5:39     ` Lan Tianyu
  2016-04-27  5:15   ` Lan Tianyu
  1 sibling, 2 replies; 25+ messages in thread
From: Radim Krčmář @ 2016-04-26 16:49 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Lan, Tianyu, pbonzini, kvm, yang.zhang.wz, tglx, gleb, mst, x86,
	Peter Xu, Igor Mammedov

2016-04-26 18:17+0200, Jan Kiszka:
> On 2016-04-26 18:14, Lan, Tianyu wrote:
>> Hi All:
>> 
>> Recently I am working on extending max vcpu to more than 256 on the both
>> KVM/Xen. For some HPC cases, it needs many vcpus. The job requires to
>> use X2APIC in the guest which supports 32-bit APIC id. Linux kernel
>> requires irq remapping function during enabling X2APIC when max APIC id
>> is more than 255(More detail please see try_to_enable_x2apic()).

Our of curiosity, how many VCPUs are you aiming at?

>> The irq remapping function helps to deliver irq to cpu 255~. IOAPIC just
>> supports 8-bit target APIC id field and only can deliver irq to
>> cpu 0~255.
>> 
>> So far both KVM/Xen doesn't enable irq remapping function. If enable the
>> function, it seems a huge job which need to rework IO-APIC, local APIC,
>> MSI parts and add virtual VTD support in the KVM.
>> 
>> Other quick way to enable more than 256 VCPUs is to eliminate the
>> dependency between irq remapping and X2APIC in the guest linux kernel.
>> So far I can boot the guest after removing the dependency.
>> The side effect I thought is that irq only can deliver to 0~255 vcpus
>> but 256 vcpus seem enough to balance irq requests in the guest. In the
>> most cases, there are fewer devices in the guest.
>> 
>> I wonder whether it's feasible. There maybe some other side effects I
>> didn't think of. Very appreciate for your comments.
> 
> Radim is working on the KVM side already, Peter is currently driving the
> VT-d interrupt emulation topic in QEMU. It's in reach, I would say. :)

+ Igor extends QEMU to support more than 255 in internal structures and
ACPI.  What remains mostly untracked is Seabios/OVMF.

> PS: Please no PV mess, at least without good reasons.

Seconded.

(If we designed all related devices as virtware, then it would not be
 that bad, but slightly modifying and putting hardware drivers into
 situations that cannot happen in hardware, not even in the spec, and
 then juggling the KVM side to make them work, is a road to hell.)

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-26 16:49   ` Radim Krčmář
@ 2016-04-27  4:10     ` Yang Zhang
  2016-04-27  5:24       ` Jan Kiszka
  2016-04-27  5:39     ` Lan Tianyu
  1 sibling, 1 reply; 25+ messages in thread
From: Yang Zhang @ 2016-04-27  4:10 UTC (permalink / raw)
  To: Radim Krčmář, Jan Kiszka
  Cc: Lan, Tianyu, pbonzini, kvm, tglx, gleb, mst, x86, Peter Xu,
	Igor Mammedov

On 2016/4/27 0:49, Radim Krčmář wrote:
> 2016-04-26 18:17+0200, Jan Kiszka:
>> On 2016-04-26 18:14, Lan, Tianyu wrote:
>>> Hi All:
>>>
>>> Recently I am working on extending max vcpu to more than 256 on the both
>>> KVM/Xen. For some HPC cases, it needs many vcpus. The job requires to
>>> use X2APIC in the guest which supports 32-bit APIC id. Linux kernel
>>> requires irq remapping function during enabling X2APIC when max APIC id
>>> is more than 255(More detail please see try_to_enable_x2apic()).
>
> Our of curiosity, how many VCPUs are you aiming at?
>
>>> The irq remapping function helps to deliver irq to cpu 255~. IOAPIC just
>>> supports 8-bit target APIC id field and only can deliver irq to
>>> cpu 0~255.
>>>
>>> So far both KVM/Xen doesn't enable irq remapping function. If enable the
>>> function, it seems a huge job which need to rework IO-APIC, local APIC,
>>> MSI parts and add virtual VTD support in the KVM.
>>>
>>> Other quick way to enable more than 256 VCPUs is to eliminate the
>>> dependency between irq remapping and X2APIC in the guest linux kernel.
>>> So far I can boot the guest after removing the dependency.
>>> The side effect I thought is that irq only can deliver to 0~255 vcpus
>>> but 256 vcpus seem enough to balance irq requests in the guest. In the
>>> most cases, there are fewer devices in the guest.
>>>
>>> I wonder whether it's feasible. There maybe some other side effects I
>>> didn't think of. Very appreciate for your comments.
>>
>> Radim is working on the KVM side already, Peter is currently driving the
>> VT-d interrupt emulation topic in QEMU. It's in reach, I would say. :)
>
> + Igor extends QEMU to support more than 255 in internal structures and
> ACPI.  What remains mostly untracked is Seabios/OVMF.

If we don't want the interrupt from internal device delivers to CPU 
 >255, do we still need the VT-d interrupt remapping emulation? I think 
firmware is able to send IPI to wakeup APs even without IR and OS is 
able to do it too. So basically, only KVM and Qemu's support is enough.

>
>> PS: Please no PV mess, at least without good reasons.
>
> Seconded.
>
> (If we designed all related devices as virtware, then it would not be
>   that bad, but slightly modifying and putting hardware drivers into
>   situations that cannot happen in hardware, not even in the spec, and
>   then juggling the KVM side to make them work, is a road to hell.)
>


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-27  4:10     ` Yang Zhang
@ 2016-04-27  5:24       ` Jan Kiszka
  2016-04-27  6:24         ` Lan Tianyu
  2016-04-27  9:39         ` Yang Zhang
  0 siblings, 2 replies; 25+ messages in thread
From: Jan Kiszka @ 2016-04-27  5:24 UTC (permalink / raw)
  To: Yang Zhang, Radim Krčmář
  Cc: Lan, Tianyu, pbonzini, kvm, tglx, gleb, mst, x86, Peter Xu,
	Igor Mammedov

On 2016-04-27 06:10, Yang Zhang wrote:
> On 2016/4/27 0:49, Radim Krčmář wrote:
>> 2016-04-26 18:17+0200, Jan Kiszka:
>>> On 2016-04-26 18:14, Lan, Tianyu wrote:
>>>> Hi All:
>>>>
>>>> Recently I am working on extending max vcpu to more than 256 on the
>>>> both
>>>> KVM/Xen. For some HPC cases, it needs many vcpus. The job requires to
>>>> use X2APIC in the guest which supports 32-bit APIC id. Linux kernel
>>>> requires irq remapping function during enabling X2APIC when max APIC id
>>>> is more than 255(More detail please see try_to_enable_x2apic()).
>>
>> Our of curiosity, how many VCPUs are you aiming at?
>>
>>>> The irq remapping function helps to deliver irq to cpu 255~. IOAPIC
>>>> just
>>>> supports 8-bit target APIC id field and only can deliver irq to
>>>> cpu 0~255.
>>>>
>>>> So far both KVM/Xen doesn't enable irq remapping function. If enable
>>>> the
>>>> function, it seems a huge job which need to rework IO-APIC, local APIC,
>>>> MSI parts and add virtual VTD support in the KVM.
>>>>
>>>> Other quick way to enable more than 256 VCPUs is to eliminate the
>>>> dependency between irq remapping and X2APIC in the guest linux kernel.
>>>> So far I can boot the guest after removing the dependency.
>>>> The side effect I thought is that irq only can deliver to 0~255 vcpus
>>>> but 256 vcpus seem enough to balance irq requests in the guest. In the
>>>> most cases, there are fewer devices in the guest.
>>>>
>>>> I wonder whether it's feasible. There maybe some other side effects I
>>>> didn't think of. Very appreciate for your comments.
>>>
>>> Radim is working on the KVM side already, Peter is currently driving the
>>> VT-d interrupt emulation topic in QEMU. It's in reach, I would say. :)
>>
>> + Igor extends QEMU to support more than 255 in internal structures and
>> ACPI.  What remains mostly untracked is Seabios/OVMF.
> 
> If we don't want the interrupt from internal device delivers to CPU
>>255, do we still need the VT-d interrupt remapping emulation? I think
> firmware is able to send IPI to wakeup APs even without IR and OS is
> able to do it too. So basically, only KVM and Qemu's support is enough.

What are "internal devices" for you? And which OS do you know that would
handle such artificial setups without prio massive patching?

We do need VT-d IR emulation in order to present our guest a well
specified and support architecture for running > 255 CPUs.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-27  5:24       ` Jan Kiszka
@ 2016-04-27  6:24         ` Lan Tianyu
  2016-04-27  6:56           ` Jan Kiszka
  2016-04-27  9:39         ` Yang Zhang
  1 sibling, 1 reply; 25+ messages in thread
From: Lan Tianyu @ 2016-04-27  6:24 UTC (permalink / raw)
  To: Jan Kiszka, Yang Zhang, Radim Krčmář
  Cc: pbonzini, kvm, tglx, gleb, mst, x86, Peter Xu, Igor Mammedov

On 2016年04月27日 13:24, Jan Kiszka wrote:
>> > If we don't want the interrupt from internal device delivers to CPU
>>> >>255, do we still need the VT-d interrupt remapping emulation? I think
>> > firmware is able to send IPI to wakeup APs even without IR and OS is
>> > able to do it too. So basically, only KVM and Qemu's support is enough.

Yes, just starting more than 255 APs doesn't need IR.

> What are "internal devices" for you? And which OS do you know that would
> handle such artificial setups without prio massive patching?
> 
> We do need VT-d IR emulation in order to present our guest a well
> specified and support architecture for running > 255 CPUs.

Changing guest kernel will be big concern. I found commit ce69a784 did
optimization to use X2APIC without IR in the guest when APIC id is less
than 256 and so I proposed my idea to see everyone's feedback. Whether
it's possible to relax the IR requirement when APIC id > 255 in the guest.

commit ce69a784504222c3ab6f1b3c357d09ec5772127a
Author: Gleb Natapov <gleb@redhat.com>
Date:   Mon Jul 20 15:24:17 2009 +0300

    x86/apic: Enable x2APIC without interrupt remapping under KVM

    KVM would like to provide x2APIC interface to a guest without emulating
    interrupt remapping device. The reason KVM prefers guest to use x2APIC
    is that x2APIC interface is better virtualizable and provides better
    performance than mmio xAPIC interface:

     - msr exits are faster than mmio (no page table walk, emulation)
     - no need to read back ICR to look at the busy bit
     - one 64 bit ICR write instead of two 32 bit writes
     - shared code with the Hyper-V paravirt interface

    Included patch changes x2APIC enabling logic to enable it even if IR
    initialization failed, but kernel runs under KVM and no apic id is
    greater than 255 (if there is one spec requires BIOS to move to
x2apic mode before starting an OS).

It's great to know Peter already worked on the IR.

-- 
Best regards
Tianyu Lan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-27  6:24         ` Lan Tianyu
@ 2016-04-27  6:56           ` Jan Kiszka
  0 siblings, 0 replies; 25+ messages in thread
From: Jan Kiszka @ 2016-04-27  6:56 UTC (permalink / raw)
  To: Lan Tianyu, Yang Zhang, Radim Krčmář
  Cc: pbonzini, kvm, tglx, gleb, mst, x86, Peter Xu, Igor Mammedov

On 2016-04-27 08:24, Lan Tianyu wrote:
> On 2016年04月27日 13:24, Jan Kiszka wrote:
>>>> If we don't want the interrupt from internal device delivers to CPU
>>>>>> 255, do we still need the VT-d interrupt remapping emulation? I think
>>>> firmware is able to send IPI to wakeup APs even without IR and OS is
>>>> able to do it too. So basically, only KVM and Qemu's support is enough.
> 
> Yes, just starting more than 255 APs doesn't need IR.
> 
>> What are "internal devices" for you? And which OS do you know that would
>> handle such artificial setups without prio massive patching?
>>
>> We do need VT-d IR emulation in order to present our guest a well
>> specified and support architecture for running > 255 CPUs.
> 
> Changing guest kernel will be big concern. I found commit ce69a784 did
> optimization to use X2APIC without IR in the guest when APIC id is less
> than 256 and so I proposed my idea to see everyone's feedback. Whether
> it's possible to relax the IR requirement when APIC id > 255 in the guest.

You can't do that easily because you can't address those additional CPUs
from *any* device then, only via IPIs. That means, Linux would have to
be changed to only set up IRQ affinity masks in the 0-254 range. I
suppose you would even have to patch tools like irqbalanced to not issue
mask changes via /proc that include larger CPU IDs. Practically not
feasible, already on Linux. Not to speak of other guest OSes.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-27  5:24       ` Jan Kiszka
  2016-04-27  6:24         ` Lan Tianyu
@ 2016-04-27  9:39         ` Yang Zhang
  2016-04-27  9:45           ` Jan Kiszka
  1 sibling, 1 reply; 25+ messages in thread
From: Yang Zhang @ 2016-04-27  9:39 UTC (permalink / raw)
  To: Jan Kiszka, Radim Krčmář
  Cc: Lan, Tianyu, pbonzini, kvm, tglx, gleb, mst, x86, Peter Xu,
	Igor Mammedov

On 2016/4/27 13:24, Jan Kiszka wrote:
> On 2016-04-27 06:10, Yang Zhang wrote:
>> On 2016/4/27 0:49, Radim Krčmář wrote:
>>> 2016-04-26 18:17+0200, Jan Kiszka:
>>>> On 2016-04-26 18:14, Lan, Tianyu wrote:
>>>>> Hi All:
>>>>>
>>>>> Recently I am working on extending max vcpu to more than 256 on the
>>>>> both
>>>>> KVM/Xen. For some HPC cases, it needs many vcpus. The job requires to
>>>>> use X2APIC in the guest which supports 32-bit APIC id. Linux kernel
>>>>> requires irq remapping function during enabling X2APIC when max APIC id
>>>>> is more than 255(More detail please see try_to_enable_x2apic()).
>>>
>>> Our of curiosity, how many VCPUs are you aiming at?
>>>
>>>>> The irq remapping function helps to deliver irq to cpu 255~. IOAPIC
>>>>> just
>>>>> supports 8-bit target APIC id field and only can deliver irq to
>>>>> cpu 0~255.
>>>>>
>>>>> So far both KVM/Xen doesn't enable irq remapping function. If enable
>>>>> the
>>>>> function, it seems a huge job which need to rework IO-APIC, local APIC,
>>>>> MSI parts and add virtual VTD support in the KVM.
>>>>>
>>>>> Other quick way to enable more than 256 VCPUs is to eliminate the
>>>>> dependency between irq remapping and X2APIC in the guest linux kernel.
>>>>> So far I can boot the guest after removing the dependency.
>>>>> The side effect I thought is that irq only can deliver to 0~255 vcpus
>>>>> but 256 vcpus seem enough to balance irq requests in the guest. In the
>>>>> most cases, there are fewer devices in the guest.
>>>>>
>>>>> I wonder whether it's feasible. There maybe some other side effects I
>>>>> didn't think of. Very appreciate for your comments.
>>>>
>>>> Radim is working on the KVM side already, Peter is currently driving the
>>>> VT-d interrupt emulation topic in QEMU. It's in reach, I would say. :)
>>>
>>> + Igor extends QEMU to support more than 255 in internal structures and
>>> ACPI.  What remains mostly untracked is Seabios/OVMF.
>>
>> If we don't want the interrupt from internal device delivers to CPU
>>> 255, do we still need the VT-d interrupt remapping emulation? I think
>> firmware is able to send IPI to wakeup APs even without IR and OS is
>> able to do it too. So basically, only KVM and Qemu's support is enough.
>
> What are "internal devices" for you? And which OS do you know that would
> handle such artificial setups without prio massive patching?

Sorry, a typo. I mean the external devices of IOAPIC/MSI/MSIX. Doesn't 
current Linux use x2apic without IR in VM?

>
> We do need VT-d IR emulation in order to present our guest a well
> specified and support architecture for running > 255 CPUs.

I mean in Tianyu's case, if he doesn't care about to deliver external 
interrupt to CPU >255, IR is not required.


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-27  9:39         ` Yang Zhang
@ 2016-04-27  9:45           ` Jan Kiszka
  2016-04-28  1:11             ` Yang Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: Jan Kiszka @ 2016-04-27  9:45 UTC (permalink / raw)
  To: Yang Zhang, Radim Krčmář
  Cc: Lan, Tianyu, pbonzini, kvm, tglx, gleb, mst, x86, Peter Xu,
	Igor Mammedov

On 2016-04-27 11:39, Yang Zhang wrote:
> On 2016/4/27 13:24, Jan Kiszka wrote:
>> On 2016-04-27 06:10, Yang Zhang wrote:
>>> On 2016/4/27 0:49, Radim Krčmář wrote:
>>>> 2016-04-26 18:17+0200, Jan Kiszka:
>>>>> On 2016-04-26 18:14, Lan, Tianyu wrote:
>>>>>> Hi All:
>>>>>>
>>>>>> Recently I am working on extending max vcpu to more than 256 on the
>>>>>> both
>>>>>> KVM/Xen. For some HPC cases, it needs many vcpus. The job requires to
>>>>>> use X2APIC in the guest which supports 32-bit APIC id. Linux kernel
>>>>>> requires irq remapping function during enabling X2APIC when max
>>>>>> APIC id
>>>>>> is more than 255(More detail please see try_to_enable_x2apic()).
>>>>
>>>> Our of curiosity, how many VCPUs are you aiming at?
>>>>
>>>>>> The irq remapping function helps to deliver irq to cpu 255~. IOAPIC
>>>>>> just
>>>>>> supports 8-bit target APIC id field and only can deliver irq to
>>>>>> cpu 0~255.
>>>>>>
>>>>>> So far both KVM/Xen doesn't enable irq remapping function. If enable
>>>>>> the
>>>>>> function, it seems a huge job which need to rework IO-APIC, local
>>>>>> APIC,
>>>>>> MSI parts and add virtual VTD support in the KVM.
>>>>>>
>>>>>> Other quick way to enable more than 256 VCPUs is to eliminate the
>>>>>> dependency between irq remapping and X2APIC in the guest linux
>>>>>> kernel.
>>>>>> So far I can boot the guest after removing the dependency.
>>>>>> The side effect I thought is that irq only can deliver to 0~255 vcpus
>>>>>> but 256 vcpus seem enough to balance irq requests in the guest. In
>>>>>> the
>>>>>> most cases, there are fewer devices in the guest.
>>>>>>
>>>>>> I wonder whether it's feasible. There maybe some other side effects I
>>>>>> didn't think of. Very appreciate for your comments.
>>>>>
>>>>> Radim is working on the KVM side already, Peter is currently
>>>>> driving the
>>>>> VT-d interrupt emulation topic in QEMU. It's in reach, I would say. :)
>>>>
>>>> + Igor extends QEMU to support more than 255 in internal structures and
>>>> ACPI.  What remains mostly untracked is Seabios/OVMF.
>>>
>>> If we don't want the interrupt from internal device delivers to CPU
>>>> 255, do we still need the VT-d interrupt remapping emulation? I think
>>> firmware is able to send IPI to wakeup APs even without IR and OS is
>>> able to do it too. So basically, only KVM and Qemu's support is enough.
>>
>> What are "internal devices" for you? And which OS do you know that would
>> handle such artificial setups without prio massive patching?
> 
> Sorry, a typo. I mean the external devices of IOAPIC/MSI/MSIX. Doesn't
> current Linux use x2apic without IR in VM?

If and only if there only need to be 254 CPUs to be addressed.

> 
>>
>> We do need VT-d IR emulation in order to present our guest a well
>> specified and support architecture for running > 255 CPUs.
> 
> I mean in Tianyu's case, if he doesn't care about to deliver external
> interrupt to CPU >255, IR is not required.

What matters is the guest OS. See my other reply on this why this
doesn't work, even for Linux.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-27  9:45           ` Jan Kiszka
@ 2016-04-28  1:11             ` Yang Zhang
  2016-04-28  6:54               ` Jan Kiszka
  0 siblings, 1 reply; 25+ messages in thread
From: Yang Zhang @ 2016-04-28  1:11 UTC (permalink / raw)
  To: Jan Kiszka, Radim Krčmář
  Cc: Lan, Tianyu, pbonzini, kvm, tglx, gleb, mst, x86, Peter Xu,
	Igor Mammedov

On 2016/4/27 17:45, Jan Kiszka wrote:
> On 2016-04-27 11:39, Yang Zhang wrote:
>> On 2016/4/27 13:24, Jan Kiszka wrote:
>>> On 2016-04-27 06:10, Yang Zhang wrote:
>>>> On 2016/4/27 0:49, Radim Krčmář wrote:
>>>>> 2016-04-26 18:17+0200, Jan Kiszka:
>>>>>> On 2016-04-26 18:14, Lan, Tianyu wrote:
>>>>>>> Hi All:
>>>>>>>
>>>>>>> Recently I am working on extending max vcpu to more than 256 on the
>>>>>>> both
>>>>>>> KVM/Xen. For some HPC cases, it needs many vcpus. The job requires to
>>>>>>> use X2APIC in the guest which supports 32-bit APIC id. Linux kernel
>>>>>>> requires irq remapping function during enabling X2APIC when max
>>>>>>> APIC id
>>>>>>> is more than 255(More detail please see try_to_enable_x2apic()).
>>>>>
>>>>> Our of curiosity, how many VCPUs are you aiming at?
>>>>>
>>>>>>> The irq remapping function helps to deliver irq to cpu 255~. IOAPIC
>>>>>>> just
>>>>>>> supports 8-bit target APIC id field and only can deliver irq to
>>>>>>> cpu 0~255.
>>>>>>>
>>>>>>> So far both KVM/Xen doesn't enable irq remapping function. If enable
>>>>>>> the
>>>>>>> function, it seems a huge job which need to rework IO-APIC, local
>>>>>>> APIC,
>>>>>>> MSI parts and add virtual VTD support in the KVM.
>>>>>>>
>>>>>>> Other quick way to enable more than 256 VCPUs is to eliminate the
>>>>>>> dependency between irq remapping and X2APIC in the guest linux
>>>>>>> kernel.
>>>>>>> So far I can boot the guest after removing the dependency.
>>>>>>> The side effect I thought is that irq only can deliver to 0~255 vcpus
>>>>>>> but 256 vcpus seem enough to balance irq requests in the guest. In
>>>>>>> the
>>>>>>> most cases, there are fewer devices in the guest.
>>>>>>>
>>>>>>> I wonder whether it's feasible. There maybe some other side effects I
>>>>>>> didn't think of. Very appreciate for your comments.
>>>>>>
>>>>>> Radim is working on the KVM side already, Peter is currently
>>>>>> driving the
>>>>>> VT-d interrupt emulation topic in QEMU. It's in reach, I would say. :)
>>>>>
>>>>> + Igor extends QEMU to support more than 255 in internal structures and
>>>>> ACPI.  What remains mostly untracked is Seabios/OVMF.
>>>>
>>>> If we don't want the interrupt from internal device delivers to CPU
>>>>> 255, do we still need the VT-d interrupt remapping emulation? I think
>>>> firmware is able to send IPI to wakeup APs even without IR and OS is
>>>> able to do it too. So basically, only KVM and Qemu's support is enough.
>>>
>>> What are "internal devices" for you? And which OS do you know that would
>>> handle such artificial setups without prio massive patching?
>>
>> Sorry, a typo. I mean the external devices of IOAPIC/MSI/MSIX. Doesn't
>> current Linux use x2apic without IR in VM?
>
> If and only if there only need to be 254 CPUs to be addressed.
>
>>
>>>
>>> We do need VT-d IR emulation in order to present our guest a well
>>> specified and support architecture for running > 255 CPUs.
>>
>> I mean in Tianyu's case, if he doesn't care about to deliver external
>> interrupt to CPU >255, IR is not required.
>
> What matters is the guest OS. See my other reply on this why this
> doesn't work, even for Linux.

Since there only few devices in his case, set the irq affinity manually 
is enough.

-- 
best regards
yang

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-28  1:11             ` Yang Zhang
@ 2016-04-28  6:54               ` Jan Kiszka
  2016-04-28 15:32                 ` Radim Krčmář
  0 siblings, 1 reply; 25+ messages in thread
From: Jan Kiszka @ 2016-04-28  6:54 UTC (permalink / raw)
  To: Yang Zhang, Radim Krčmář
  Cc: Lan, Tianyu, pbonzini, kvm, tglx, gleb, mst, x86, Peter Xu,
	Igor Mammedov

On 2016-04-28 03:11, Yang Zhang wrote:
> On 2016/4/27 17:45, Jan Kiszka wrote:
>> On 2016-04-27 11:39, Yang Zhang wrote:
>>> On 2016/4/27 13:24, Jan Kiszka wrote:
>>>> On 2016-04-27 06:10, Yang Zhang wrote:
>>>>> On 2016/4/27 0:49, Radim Krčmář wrote:
>>>>>> 2016-04-26 18:17+0200, Jan Kiszka:
>>>>>>> On 2016-04-26 18:14, Lan, Tianyu wrote:
>>>>>>>> Hi All:
>>>>>>>>
>>>>>>>> Recently I am working on extending max vcpu to more than 256 on the
>>>>>>>> both
>>>>>>>> KVM/Xen. For some HPC cases, it needs many vcpus. The job
>>>>>>>> requires to
>>>>>>>> use X2APIC in the guest which supports 32-bit APIC id. Linux kernel
>>>>>>>> requires irq remapping function during enabling X2APIC when max
>>>>>>>> APIC id
>>>>>>>> is more than 255(More detail please see try_to_enable_x2apic()).
>>>>>>
>>>>>> Our of curiosity, how many VCPUs are you aiming at?
>>>>>>
>>>>>>>> The irq remapping function helps to deliver irq to cpu 255~. IOAPIC
>>>>>>>> just
>>>>>>>> supports 8-bit target APIC id field and only can deliver irq to
>>>>>>>> cpu 0~255.
>>>>>>>>
>>>>>>>> So far both KVM/Xen doesn't enable irq remapping function. If
>>>>>>>> enable
>>>>>>>> the
>>>>>>>> function, it seems a huge job which need to rework IO-APIC, local
>>>>>>>> APIC,
>>>>>>>> MSI parts and add virtual VTD support in the KVM.
>>>>>>>>
>>>>>>>> Other quick way to enable more than 256 VCPUs is to eliminate the
>>>>>>>> dependency between irq remapping and X2APIC in the guest linux
>>>>>>>> kernel.
>>>>>>>> So far I can boot the guest after removing the dependency.
>>>>>>>> The side effect I thought is that irq only can deliver to 0~255
>>>>>>>> vcpus
>>>>>>>> but 256 vcpus seem enough to balance irq requests in the guest. In
>>>>>>>> the
>>>>>>>> most cases, there are fewer devices in the guest.
>>>>>>>>
>>>>>>>> I wonder whether it's feasible. There maybe some other side
>>>>>>>> effects I
>>>>>>>> didn't think of. Very appreciate for your comments.
>>>>>>>
>>>>>>> Radim is working on the KVM side already, Peter is currently
>>>>>>> driving the
>>>>>>> VT-d interrupt emulation topic in QEMU. It's in reach, I would
>>>>>>> say. :)
>>>>>>
>>>>>> + Igor extends QEMU to support more than 255 in internal
>>>>>> structures and
>>>>>> ACPI.  What remains mostly untracked is Seabios/OVMF.
>>>>>
>>>>> If we don't want the interrupt from internal device delivers to CPU
>>>>>> 255, do we still need the VT-d interrupt remapping emulation? I think
>>>>> firmware is able to send IPI to wakeup APs even without IR and OS is
>>>>> able to do it too. So basically, only KVM and Qemu's support is
>>>>> enough.
>>>>
>>>> What are "internal devices" for you? And which OS do you know that
>>>> would
>>>> handle such artificial setups without prio massive patching?
>>>
>>> Sorry, a typo. I mean the external devices of IOAPIC/MSI/MSIX. Doesn't
>>> current Linux use x2apic without IR in VM?
>>
>> If and only if there only need to be 254 CPUs to be addressed.
>>
>>>
>>>>
>>>> We do need VT-d IR emulation in order to present our guest a well
>>>> specified and support architecture for running > 255 CPUs.
>>>
>>> I mean in Tianyu's case, if he doesn't care about to deliver external
>>> interrupt to CPU >255, IR is not required.
>>
>> What matters is the guest OS. See my other reply on this why this
>> doesn't work, even for Linux.
> 
> Since there only few devices in his case, set the irq affinity manually
> is enough.

Ah, wait - are we talking about emulating the Xeon Phi architecture in
QEMU, accelerated by KVM?

Then maybe you can point to a more detailed description of its interrupt
architecture than that rather vague "Xeon Phi Coprocessor System
Software Developers Guide" I was just looking at provides. While the Phi
may not have VT-d internally, it still has a need to translate incoming
MSI/MSI-X messages (via that PEG port?) to something that can address
more than 255 APIC IDs, no?

Possibly, you only need an extended KVM kernel interface for the Phi
that allows injecting APIC interrupts to more than 255 CPUs. That
interface has to be designed anyway, for normal x86 systems, and is what
Radim was talking about.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-28  6:54               ` Jan Kiszka
@ 2016-04-28 15:32                 ` Radim Krčmář
  2016-04-29  2:09                   ` Yang Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: Radim Krčmář @ 2016-04-28 15:32 UTC (permalink / raw)
  To: Jan Kiszka
  Cc: Yang Zhang, Lan, Tianyu, pbonzini, kvm, tglx, gleb, mst, x86,
	Peter Xu, Igor Mammedov

2016-04-28 08:54+0200, Jan Kiszka:
> On 2016-04-28 03:11, Yang Zhang wrote:
>> On 2016/4/27 17:45, Jan Kiszka wrote:
>>> On 2016-04-27 11:39, Yang Zhang wrote:
>>>> I mean in Tianyu's case, if he doesn't care about to deliver external
>>>> interrupt to CPU >255, IR is not required.
>>>
>>> What matters is the guest OS. See my other reply on this why this
>>> doesn't work, even for Linux.
>> 
>> Since there only few devices in his case, set the irq affinity manually
>> is enough.

You could configure non-IPIs to work, but we want to create options that
are hard to break.

> Ah, wait - are we talking about emulating the Xeon Phi architecture in
> QEMU, accelerated by KVM?

Knights Landing will also be manufactured as a CPU, hopefully without
many peculiarities.

I think we are talking about extending KVM's IR-less x2APIC, when
standard x2APIC is the future.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-28 15:32                 ` Radim Krčmář
@ 2016-04-29  2:09                   ` Yang Zhang
  2016-04-29  3:01                     ` Nadav Amit
  2016-04-29  4:59                     ` Jan Kiszka
  0 siblings, 2 replies; 25+ messages in thread
From: Yang Zhang @ 2016-04-29  2:09 UTC (permalink / raw)
  To: Radim Krčmář, Jan Kiszka
  Cc: Lan, Tianyu, pbonzini, kvm, tglx, gleb, mst, x86, Peter Xu,
	Igor Mammedov

On 2016/4/28 23:32, Radim Krčmář wrote:
> 2016-04-28 08:54+0200, Jan Kiszka:
>> On 2016-04-28 03:11, Yang Zhang wrote:
>>> On 2016/4/27 17:45, Jan Kiszka wrote:
>>>> On 2016-04-27 11:39, Yang Zhang wrote:
>>>>> I mean in Tianyu's case, if he doesn't care about to deliver external
>>>>> interrupt to CPU >255, IR is not required.
>>>>
>>>> What matters is the guest OS. See my other reply on this why this
>>>> doesn't work, even for Linux.
>>>
>>> Since there only few devices in his case, set the irq affinity manually
>>> is enough.
>
> You could configure non-IPIs to work, but we want to create options that
> are hard to break.
>
>> Ah, wait - are we talking about emulating the Xeon Phi architecture in
>> QEMU, accelerated by KVM?
>
> Knights Landing will also be manufactured as a CPU, hopefully without
> many peculiarities.
>
> I think we are talking about extending KVM's IR-less x2APIC, when
> standard x2APIC is the future.

Yes, Since IR is only useful for the external device, and 255 CPUs is 
enough to handle the interrupts from external devices. Besides, i think 
virtual VT-d will bring extra performance impaction for devices, so if 
IR-less X2APIC also works well with more than 255 VCPUs, maybe extending 
KVM with IR-less x2apic is not a bad idea.

-- 
best regards
yang

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-29  2:09                   ` Yang Zhang
@ 2016-04-29  3:01                     ` Nadav Amit
  2016-05-03  1:34                       ` Yang Zhang
  2016-04-29  4:59                     ` Jan Kiszka
  1 sibling, 1 reply; 25+ messages in thread
From: Nadav Amit @ 2016-04-29  3:01 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Radim Krčmář, Jan Kiszka, Lan, Tianyu, pbonzini,
	kvm, tglx, gleb, mst, x86, Peter Xu, Igor Mammedov

Yang Zhang <yang.zhang.wz@gmail.com> wrote:

> On 2016/4/28 23:32, Radim Krčmář wrote:
>> I think we are talking about extending KVM's IR-less x2APIC, when
>> standard x2APIC is the future.
> 
> Yes, Since IR is only useful for the external device, and 255 CPUs is enough to handle the interrupts from external devices. Besides, i think virtual VT-d will bring extra performance impaction for devices, so if IR-less X2APIC also works well with more than 255 VCPUs, maybe extending KVM with IR-less x2apic is not a bad idea.

So will you use x2APIC physical mode in this system?
Try not to send a multicast IPI to 400 cores in the VM...

Nadav

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-29  3:01                     ` Nadav Amit
@ 2016-05-03  1:34                       ` Yang Zhang
  0 siblings, 0 replies; 25+ messages in thread
From: Yang Zhang @ 2016-05-03  1:34 UTC (permalink / raw)
  To: Nadav Amit
  Cc: Radim Krčmář, Jan Kiszka, Lan, Tianyu, pbonzini,
	kvm, tglx, gleb, mst, x86, Peter Xu, Igor Mammedov

On 2016/4/29 11:01, Nadav Amit wrote:
> Yang Zhang <yang.zhang.wz@gmail.com> wrote:
>
>> On 2016/4/28 23:32, Radim Krčmář wrote:
>>> I think we are talking about extending KVM's IR-less x2APIC, when
>>> standard x2APIC is the future.
>>
>> Yes, Since IR is only useful for the external device, and 255 CPUs is enough to handle the interrupts from external devices. Besides, i think virtual VT-d will bring extra performance impaction for devices, so if IR-less X2APIC also works well with more than 255 VCPUs, maybe extending KVM with IR-less x2apic is not a bad idea.
>
> So will you use x2APIC physical mode in this system?

Probably, cluster mode is the better choice.

> Try not to send a multicast IPI to 400 cores in the VM...

Yes, a multicast IPI to so many cores is a disaster in VM, like 
flush_tlb_others().


-- 
best regards
yang

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-29  2:09                   ` Yang Zhang
  2016-04-29  3:01                     ` Nadav Amit
@ 2016-04-29  4:59                     ` Jan Kiszka
  2016-05-03  1:52                       ` Yang Zhang
  1 sibling, 1 reply; 25+ messages in thread
From: Jan Kiszka @ 2016-04-29  4:59 UTC (permalink / raw)
  To: Yang Zhang, Radim Krčmář
  Cc: Lan, Tianyu, pbonzini, kvm, tglx, gleb, mst, x86, Peter Xu,
	Igor Mammedov

On 2016-04-29 04:09, Yang Zhang wrote:
> On 2016/4/28 23:32, Radim Krčmář wrote:
>> 2016-04-28 08:54+0200, Jan Kiszka:
>>> On 2016-04-28 03:11, Yang Zhang wrote:
>>>> On 2016/4/27 17:45, Jan Kiszka wrote:
>>>>> On 2016-04-27 11:39, Yang Zhang wrote:
>>>>>> I mean in Tianyu's case, if he doesn't care about to deliver external
>>>>>> interrupt to CPU >255, IR is not required.
>>>>>
>>>>> What matters is the guest OS. See my other reply on this why this
>>>>> doesn't work, even for Linux.
>>>>
>>>> Since there only few devices in his case, set the irq affinity manually
>>>> is enough.
>>
>> You could configure non-IPIs to work, but we want to create options that
>> are hard to break.
>>
>>> Ah, wait - are we talking about emulating the Xeon Phi architecture in
>>> QEMU, accelerated by KVM?
>>
>> Knights Landing will also be manufactured as a CPU, hopefully without
>> many peculiarities.
>>
>> I think we are talking about extending KVM's IR-less x2APIC, when
>> standard x2APIC is the future.
> 
> Yes, Since IR is only useful for the external device, and 255 CPUs is
> enough to handle the interrupts from external devices. Besides, i think
> virtual VT-d will bring extra performance impaction for devices, so if
> IR-less X2APIC also works well with more than 255 VCPUs, maybe extending
> KVM with IR-less x2apic is not a bad idea.

IR-less x2APIC for guest architectures that are expected to provide IR
remains a bad idea, at least until we have hard numbers what this
suspected performance impact actually is. Unless you update IRQ
affinities an insane rates, the impact should not be relevant because
remapping results are cached (for the irqfd hot-path) or you are already
taking the long way (userspace device emulation).

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-29  4:59                     ` Jan Kiszka
@ 2016-05-03  1:52                       ` Yang Zhang
  2016-05-03  2:03                         ` Nadav Amit
  0 siblings, 1 reply; 25+ messages in thread
From: Yang Zhang @ 2016-05-03  1:52 UTC (permalink / raw)
  To: Jan Kiszka, Radim Krčmář
  Cc: Lan, Tianyu, pbonzini, kvm, tglx, gleb, mst, x86, Peter Xu,
	Igor Mammedov

On 2016/4/29 12:59, Jan Kiszka wrote:
> On 2016-04-29 04:09, Yang Zhang wrote:
>> On 2016/4/28 23:32, Radim Krčmář wrote:
>>> 2016-04-28 08:54+0200, Jan Kiszka:
>>>> On 2016-04-28 03:11, Yang Zhang wrote:
>>>>> On 2016/4/27 17:45, Jan Kiszka wrote:
>>>>>> On 2016-04-27 11:39, Yang Zhang wrote:
>>>>>>> I mean in Tianyu's case, if he doesn't care about to deliver external
>>>>>>> interrupt to CPU >255, IR is not required.
>>>>>>
>>>>>> What matters is the guest OS. See my other reply on this why this
>>>>>> doesn't work, even for Linux.
>>>>>
>>>>> Since there only few devices in his case, set the irq affinity manually
>>>>> is enough.
>>>
>>> You could configure non-IPIs to work, but we want to create options that
>>> are hard to break.
>>>
>>>> Ah, wait - are we talking about emulating the Xeon Phi architecture in
>>>> QEMU, accelerated by KVM?
>>>
>>> Knights Landing will also be manufactured as a CPU, hopefully without
>>> many peculiarities.
>>>
>>> I think we are talking about extending KVM's IR-less x2APIC, when
>>> standard x2APIC is the future.
>>
>> Yes, Since IR is only useful for the external device, and 255 CPUs is
>> enough to handle the interrupts from external devices. Besides, i think
>> virtual VT-d will bring extra performance impaction for devices, so if
>> IR-less X2APIC also works well with more than 255 VCPUs, maybe extending
>> KVM with IR-less x2apic is not a bad idea.
>
> IR-less x2APIC for guest architectures that are expected to provide IR
> remains a bad idea, at least until we have hard numbers what this
> suspected performance impact actually is. Unless you update IRQ
> affinities an insane rates, the impact should not be relevant because
> remapping results are cached (for the irqfd hot-path) or you are already
> taking the long way (userspace device emulation).

I think it is not only interrupt. There must have the DMAR emulation and 
the cost for DMA is heavy in VM(DMA operations are very frequently). I 
cannot remember whether there are strong dependency in hardware between 
DMAR and IR(I know IR is relying on QI). Even hardware dependency is ok, 
is it ok for OS running in hardware with IR but without DMAR?

-- 
best regards
yang

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-05-03  1:52                       ` Yang Zhang
@ 2016-05-03  2:03                         ` Nadav Amit
  2016-05-03  4:55                           ` Jan Kiszka
  0 siblings, 1 reply; 25+ messages in thread
From: Nadav Amit @ 2016-05-03  2:03 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Jan Kiszka, Radim Krčmář, Lan, Tianyu,
	Paolo Bonzini, kvm, Thomas Gleixner, mst, x86, Peter Xu,
	Igor Mammedov

Yang Zhang <yang.zhang.wz@gmail.com> wrote:

> I think it is not only interrupt. There must have the DMAR emulation and
> the cost for DMA is heavy in VM(DMA operations are very frequently). I
> cannot remember whether there are strong dependency in hardware between
> DMAR and IR(I know IR is relying on QI). Even hardware dependency is ok,
> is it ok for OS running in hardware with IR but without DMAR?

Do you know a way for the IOMMU to report that DMAR is disabled, while IR
is enabled?

Anyhow, the VM can use IOMMU passthrough mode to avoid most IOMMU overhead.
Regardless, a recent patch-set should improve DMAR performance
considerably [1].

Regards,
Nadav

[1] https://www.mail-archive.com/iommu@lists.linux-foundation.org/msg12386.html


^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-05-03  2:03                         ` Nadav Amit
@ 2016-05-03  4:55                           ` Jan Kiszka
  2016-05-04  1:46                             ` Yang Zhang
  0 siblings, 1 reply; 25+ messages in thread
From: Jan Kiszka @ 2016-05-03  4:55 UTC (permalink / raw)
  To: Nadav Amit, Yang Zhang
  Cc: Radim Krčmář, Lan, Tianyu, Paolo Bonzini, kvm,
	Thomas Gleixner, mst, x86, Peter Xu, Igor Mammedov

On 2016-05-03 04:03, Nadav Amit wrote:
> Yang Zhang <yang.zhang.wz@gmail.com> wrote:
> 
>> I think it is not only interrupt. There must have the DMAR emulation and
>> the cost for DMA is heavy in VM(DMA operations are very frequently). I
>> cannot remember whether there are strong dependency in hardware between
>> DMAR and IR(I know IR is relying on QI). Even hardware dependency is ok,
>> is it ok for OS running in hardware with IR but without DMAR?
> 
> Do you know a way for the IOMMU to report that DMAR is disabled, while IR
> is enabled?

The hardware cannot decide about disabling this, but the guest can, of
course. In fact, you can even configure Linux to have DMAR off by
default until you pass "intel_iommu=on" on the command line (I think
distros still do this - at least they used to). No idea about other
OSes, though.

> 
> Anyhow, the VM can use IOMMU passthrough mode to avoid most IOMMU overhead.
> Regardless, a recent patch-set should improve DMAR performance
> considerably [1].

The bottleneck with emulated DMAR is rather in QEMU. But DMAR can be
almost as cheap as IR once we get it running for VFIO and vhost: both
need proper caching because they do not work with QEMU in the loop for
each and every DMA transfer. Still no need to deviate from physical
hardware.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-05-03  4:55                           ` Jan Kiszka
@ 2016-05-04  1:46                             ` Yang Zhang
  2016-05-04  1:56                               ` Nadav Amit
  2016-05-04  5:38                               ` Jan Kiszka
  0 siblings, 2 replies; 25+ messages in thread
From: Yang Zhang @ 2016-05-04  1:46 UTC (permalink / raw)
  To: Jan Kiszka, Nadav Amit
  Cc: Radim Krčmář, Lan, Tianyu, Paolo Bonzini, kvm,
	Thomas Gleixner, mst, x86, Peter Xu, Igor Mammedov

On 2016/5/3 12:55, Jan Kiszka wrote:
> On 2016-05-03 04:03, Nadav Amit wrote:
>> Yang Zhang <yang.zhang.wz@gmail.com> wrote:
>>
>>> I think it is not only interrupt. There must have the DMAR emulation and
>>> the cost for DMA is heavy in VM(DMA operations are very frequently). I
>>> cannot remember whether there are strong dependency in hardware between
>>> DMAR and IR(I know IR is relying on QI). Even hardware dependency is ok,
>>> is it ok for OS running in hardware with IR but without DMAR?
>>
>> Do you know a way for the IOMMU to report that DMAR is disabled, while IR
>> is enabled?
>
> The hardware cannot decide about disabling this, but the guest can, of
> course. In fact, you can even configure Linux to have DMAR off by
> default until you pass "intel_iommu=on" on the command line (I think
> distros still do this - at least they used to). No idea about other
> OSes, though.

If we can disable DMAR in guest, it should be enough.

>
>>
>> Anyhow, the VM can use IOMMU passthrough mode to avoid most IOMMU overhead.
>> Regardless, a recent patch-set should improve DMAR performance
>> considerably [1].
>
> The bottleneck with emulated DMAR is rather in QEMU. But DMAR can be
> almost as cheap as IR once we get it running for VFIO and vhost: both
> need proper caching because they do not work with QEMU in the loop for
> each and every DMA transfer. Still no need to deviate from physical
> hardware.

Sorry, i don't know detail about how VFIO and vhost work with IR. But it 
seems hard to do proper caching since DMA allocations are very 
frequently in Linux unless we move the whole iommu emulation to kernel. 
Another idea is using two iommus: one for Qemu and one for device in 
kernel like vfio and vhost. I did the similar thing in Xen in several 
years ago which uses two iommus solution and it works well in my 
experiment environment. Besides, this solution is easy for nested device 
pass-through. The Page 32 of [1] has more detail.

[1] http://docplayer.net/10559370-Nested-virtualization.html

-- 
best regards
yang

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-05-04  1:46                             ` Yang Zhang
@ 2016-05-04  1:56                               ` Nadav Amit
  2016-05-04  5:38                               ` Jan Kiszka
  1 sibling, 0 replies; 25+ messages in thread
From: Nadav Amit @ 2016-05-04  1:56 UTC (permalink / raw)
  To: Yang Zhang
  Cc: Jan Kiszka, Radim Krčmář, Lan, Tianyu,
	Paolo Bonzini, kvm, Thomas Gleixner, mst, x86, Peter Xu,
	Igor Mammedov

Yang Zhang <yang.zhang.wz@gmail.com> wrote:

> On 2016/5/3 12:55, Jan Kiszka wrote:
>> On 2016-05-03 04:03, Nadav Amit wrote:
>> 
>> The bottleneck with emulated DMAR is rather in QEMU. But DMAR can be
>> almost as cheap as IR once we get it running for VFIO and vhost: both
>> need proper caching because they do not work with QEMU in the loop for
>> each and every DMA transfer. Still no need to deviate from physical
>> hardware.
> 
> Sorry, i don't know detail about how VFIO and vhost work with IR. But it seems hard to do proper caching since DMA allocations are very frequently in Linux unless we move the whole iommu emulation to kernel. Another idea is using two iommus: one for Qemu and one for device in kernel like vfio and vhost. I did the similar thing in Xen in several years ago which uses two iommus solution and it works well in my experiment environment. Besides, this solution is easy for nested device pass-through. The Page 32 of [1] has more detail.
> 
> [1] http://docplayer.net/10559370-Nested-virtualization.html

I did a similar work as well several years ago [2], and achieved similar 
results. The problem with these results is that you don’t show the CPU
utilization. Sure, for 1GBE netperf it might be fine, but I am not sure
it would be useful for more demanding tasks.

Isn’t it possible to use the PASID as a sort of virtual function?

Regards,
Nadav


[2] https://www.usenix.org/legacy/event/atc11/tech/final_files/Amit.pdf

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-05-04  1:46                             ` Yang Zhang
  2016-05-04  1:56                               ` Nadav Amit
@ 2016-05-04  5:38                               ` Jan Kiszka
  1 sibling, 0 replies; 25+ messages in thread
From: Jan Kiszka @ 2016-05-04  5:38 UTC (permalink / raw)
  To: Yang Zhang, Nadav Amit
  Cc: Radim Krčmář, Lan, Tianyu, Paolo Bonzini, kvm,
	Thomas Gleixner, mst, x86, Peter Xu, Igor Mammedov

On 2016-05-04 03:46, Yang Zhang wrote:
> On 2016/5/3 12:55, Jan Kiszka wrote:
>> On 2016-05-03 04:03, Nadav Amit wrote:
>>>
>>> Anyhow, the VM can use IOMMU passthrough mode to avoid most IOMMU
>>> overhead.
>>> Regardless, a recent patch-set should improve DMAR performance
>>> considerably [1].
>>
>> The bottleneck with emulated DMAR is rather in QEMU. But DMAR can be
>> almost as cheap as IR once we get it running for VFIO and vhost: both
>> need proper caching because they do not work with QEMU in the loop for
>> each and every DMA transfer. Still no need to deviate from physical
>> hardware.
> 
> Sorry, i don't know detail about how VFIO and vhost work with IR. But it
> seems hard to do proper caching since DMA allocations are very
> frequently in Linux unless we move the whole iommu emulation to kernel.

There is technically no reason for Linux to reprogram the DMAR units
unless it changes partitioning (or really wants to enforce strict DMA
containment for each device). You can surely tune this to no updates at
all for the guest Linux under normal operations.

Jan

-- 
Siemens AG, Corporate Technology, CT RDA ITP SES-DE
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-26 16:49   ` Radim Krčmář
  2016-04-27  4:10     ` Yang Zhang
@ 2016-04-27  5:39     ` Lan Tianyu
  2016-04-27 14:38       ` Radim Krčmář
  1 sibling, 1 reply; 25+ messages in thread
From: Lan Tianyu @ 2016-04-27  5:39 UTC (permalink / raw)
  To: Radim Krčmář, Jan Kiszka
  Cc: pbonzini, kvm, yang.zhang.wz, tglx, gleb, mst, x86, Peter Xu,
	Igor Mammedov

Hi Radim:

On 2016年04月27日 00:49, Radim Krčmář wrote:
> 2016-04-26 18:17+0200, Jan Kiszka:
>> On 2016-04-26 18:14, Lan, Tianyu wrote:
>>> Hi All:
>>>
>>> Recently I am working on extending max vcpu to more than 256 on the both
>>> KVM/Xen. For some HPC cases, it needs many vcpus. The job requires to
>>> use X2APIC in the guest which supports 32-bit APIC id. Linux kernel
>>> requires irq remapping function during enabling X2APIC when max APIC id
>>> is more than 255(More detail please see try_to_enable_x2apic()).
> 
> Our of curiosity, how many VCPUs are you aiming at?

I think it's 1024.

In the short term, we hope hypervisor at least supports 288 vcpus
because Xeon phi chip already supports 288 logical cpus. As hardware
development, there will be more logical cpus and we hope one guest can
totally uses all cpu resources on the chip to meet HPC requirement.

> 
>>> The irq remapping function helps to deliver irq to cpu 255~. IOAPIC just
>>> supports 8-bit target APIC id field and only can deliver irq to
>>> cpu 0~255.
>>>
>>> So far both KVM/Xen doesn't enable irq remapping function. If enable the
>>> function, it seems a huge job which need to rework IO-APIC, local APIC,
>>> MSI parts and add virtual VTD support in the KVM.
>>>
>>> Other quick way to enable more than 256 VCPUs is to eliminate the
>>> dependency between irq remapping and X2APIC in the guest linux kernel.
>>> So far I can boot the guest after removing the dependency.
>>> The side effect I thought is that irq only can deliver to 0~255 vcpus
>>> but 256 vcpus seem enough to balance irq requests in the guest. In the
>>> most cases, there are fewer devices in the guest.
>>>
>>> I wonder whether it's feasible. There maybe some other side effects I
>>> didn't think of. Very appreciate for your comments.
>>
>> Radim is working on the KVM side already, Peter is currently driving the
>> VT-d interrupt emulation topic in QEMU. It's in reach, I would say. :)
> 
> + Igor extends QEMU to support more than 255 in internal structures and
> ACPI.  What remains mostly untracked is Seabios/OVMF.

Thanks for you information. How about KVM X2APIC part? Do you have patch
to extend KVM X2APIC to support 32-bit APIC ID?



> 
>> PS: Please no PV mess, at least without good reasons.
> 
> Seconded.
> 
> (If we designed all related devices as virtware, then it would not be
>  that bad, but slightly modifying and putting hardware drivers into
>  situations that cannot happen in hardware, not even in the spec, and
>  then juggling the KVM side to make them work, is a road to hell.)
> 


-- 
Best regards
Tianyu Lan

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-27  5:39     ` Lan Tianyu
@ 2016-04-27 14:38       ` Radim Krčmář
  0 siblings, 0 replies; 25+ messages in thread
From: Radim Krčmář @ 2016-04-27 14:38 UTC (permalink / raw)
  To: Lan Tianyu
  Cc: Jan Kiszka, pbonzini, kvm, yang.zhang.wz, tglx, gleb, mst, x86,
	Peter Xu, Igor Mammedov

2016-04-27 13:39+0800, Lan Tianyu:
> On 2016年04月27日 00:49, Radim Krčmář wrote:
>> 2016-04-26 18:17+0200, Jan Kiszka:
>>> On 2016-04-26 18:14, Lan, Tianyu wrote:
>>>> Hi All:
>>>>
>>>> Recently I am working on extending max vcpu to more than 256 on the both
>>>> KVM/Xen. For some HPC cases, it needs many vcpus. The job requires to
>>>> use X2APIC in the guest which supports 32-bit APIC id. Linux kernel
>>>> requires irq remapping function during enabling X2APIC when max APIC id
>>>> is more than 255(More detail please see try_to_enable_x2apic()).
>> 
>> Our of curiosity, how many VCPUs are you aiming at?
> 
> I think it's 1024.
> 
> In the short term, we hope hypervisor at least supports 288 vcpus
> because Xeon phi chip already supports 288 logical cpus. As hardware
> development, there will be more logical cpus and we hope one guest can
> totally uses all cpu resources on the chip to meet HPC requirement.

Thanks, I think KVM will start by bumping the hard VCPU limit to 512 or
1024, with recommended maximum being 288.  You'll be able to raise the
hard limit just by configuing and recompiling.

>                             How about KVM X2APIC part? Do you have patch
> to extend KVM X2APIC to support 32-bit APIC ID?

I do, in limbo, as QEMU cannot create VCPUs with higher APIC IDs, yet.

^ permalink raw reply	[flat|nested] 25+ messages in thread

* Re: Enable more than 255 VCPU support without irq remapping function in the guest
  2016-04-26 16:17 ` Jan Kiszka
  2016-04-26 16:49   ` Radim Krčmář
@ 2016-04-27  5:15   ` Lan Tianyu
  1 sibling, 0 replies; 25+ messages in thread
From: Lan Tianyu @ 2016-04-27  5:15 UTC (permalink / raw)
  To: Jan Kiszka, pbonzini, kvm, yang.zhang.wz, tglx, gleb, mst, x86,
	Radim Krčmář, Peter Xu

On 2016年04月27日 00:17, Jan Kiszka wrote:
> On 2016-04-26 18:14, Lan, Tianyu wrote:
>> Hi All:
>>
>> Recently I am working on extending max vcpu to more than 256 on the both
>> KVM/Xen. For some HPC cases, it needs many vcpus. The job requires to
>> use X2APIC in the guest which supports 32-bit APIC id. Linux kernel
>> requires irq remapping function during enabling X2APIC when max APIC id
>> is more than 255(More detail please see try_to_enable_x2apic()).
>>
>> The irq remapping function helps to deliver irq to cpu 255~. IOAPIC just
>> supports 8-bit target APIC id field and only can deliver irq to
>> cpu 0~255.
>>
>> So far both KVM/Xen doesn't enable irq remapping function. If enable the
>> function, it seems a huge job which need to rework IO-APIC, local APIC,
>> MSI parts and add virtual VTD support in the KVM.
>>
>> Other quick way to enable more than 256 VCPUs is to eliminate the
>> dependency between irq remapping and X2APIC in the guest linux kernel.
>> So far I can boot the guest after removing the dependency.
>> The side effect I thought is that irq only can deliver to 0~255 vcpus
>> but 256 vcpus seem enough to balance irq requests in the guest. In the
>> most cases, there are fewer devices in the guest.
>>
>> I wonder whether it's feasible. There maybe some other side effects I
>> didn't think of. Very appreciate for your comments.
> 
> Radim is working on the KVM side already, Peter is currently driving the
> VT-d interrupt emulation topic in QEMU. It's in reach, I would say. :)

Oh. Thanks for your information. Very helpful :)

> 
> Jan
> 
> PS: Please no PV mess, at least without good reasons.
> 


-- 
Best regards
Tianyu Lan

^ permalink raw reply	[flat|nested] 25+ messages in thread

end of thread, other threads:[~2016-05-04  5:38 UTC | newest]

Thread overview: 25+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-04-26 16:14 Enable more than 255 VCPU support without irq remapping function in the guest Lan, Tianyu
2016-04-26 16:17 ` Jan Kiszka
2016-04-26 16:49   ` Radim Krčmář
2016-04-27  4:10     ` Yang Zhang
2016-04-27  5:24       ` Jan Kiszka
2016-04-27  6:24         ` Lan Tianyu
2016-04-27  6:56           ` Jan Kiszka
2016-04-27  9:39         ` Yang Zhang
2016-04-27  9:45           ` Jan Kiszka
2016-04-28  1:11             ` Yang Zhang
2016-04-28  6:54               ` Jan Kiszka
2016-04-28 15:32                 ` Radim Krčmář
2016-04-29  2:09                   ` Yang Zhang
2016-04-29  3:01                     ` Nadav Amit
2016-05-03  1:34                       ` Yang Zhang
2016-04-29  4:59                     ` Jan Kiszka
2016-05-03  1:52                       ` Yang Zhang
2016-05-03  2:03                         ` Nadav Amit
2016-05-03  4:55                           ` Jan Kiszka
2016-05-04  1:46                             ` Yang Zhang
2016-05-04  1:56                               ` Nadav Amit
2016-05-04  5:38                               ` Jan Kiszka
2016-04-27  5:39     ` Lan Tianyu
2016-04-27 14:38       ` Radim Krčmář
2016-04-27  5:15   ` Lan Tianyu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).