* [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption
@ 2024-08-21 9:51 Kunkun Jiang
2024-08-21 10:59 ` Marc Zyngier
0 siblings, 1 reply; 9+ messages in thread
From: Kunkun Jiang @ 2024-08-21 9:51 UTC (permalink / raw)
To: Thomas Gleixner, Marc Zyngier, Oliver Upton, James Morse,
Suzuki K Poulose, Zenghui Yu
Cc: open list:IRQ SUBSYSTEM, moderated list:ARM SMMU DRIVERS, kvmarm,
wanghaibin.wang@huawei.com, nizhiqiang1, tangnianyao@huawei.com,
wangzhou1
Hi all,
Recently I discovered a problem about GICv4.1, the scenario is as follows:
1. Enable GICv4.1
2. Create multiple VMs.For example, 50 VMs(4U8G)
3. The business running in VMs has a frequent mmio access and need to exit
to qemu for processing.
4. Or modify the kvm code so that wfi must trap to kvm
5. Then the utilization of pcpu where the vcpu is located will be 100%,and
basically all in sys.
6. This problem does not exist in GICv3.
According to analysis, this problem is due to the execution of vgic_v4_load.
vcpu_load or kvm_sched_in
kvm_arch_vcpu_load
...
vgic_v4_load
irq_set_affinity
...
irq_do_set_affinity
raw_spin_lock(&tmp_mask_lock)
chip->irq_set_affinity
...
its_vpe_set_affinity
The tmp_mask_lock is the key. This is a global lock. I don't quite
understand
why tmp_mask_lock is needed here. I think there are two possible
solutions here:
1. Remove this tmp_mask_lock
2. Modify the gicv4 driver,do not perfrom VMOVP via irq_set_affinity.
Everyone is welcome to discuss.
Thanks,
Kunkun Jiang
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption
2024-08-21 9:51 [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption Kunkun Jiang
@ 2024-08-21 10:59 ` Marc Zyngier
2024-08-21 18:23 ` Kunkun Jiang
0 siblings, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2024-08-21 10:59 UTC (permalink / raw)
To: Kunkun Jiang
Cc: Thomas Gleixner, Oliver Upton, James Morse, Suzuki K Poulose,
Zenghui Yu, open list:IRQ SUBSYSTEM,
moderated list:ARM SMMU DRIVERS, kvmarm,
wanghaibin.wang@huawei.com, nizhiqiang1, tangnianyao@huawei.com,
wangzhou1
On Wed, 21 Aug 2024 10:51:27 +0100,
Kunkun Jiang <jiangkunkun@huawei.com> wrote:
>
> Hi all,
>
> Recently I discovered a problem about GICv4.1, the scenario is as follows:
> 1. Enable GICv4.1
> 2. Create multiple VMs.For example, 50 VMs(4U8G)
I don't know what 4U8G means. On how many physical CPUs are you
running 50 VMs? Direct injection of interrupts and over-subscription
are fundamentally incompatible.
> 3. The business running in VMs has a frequent mmio access and need to exit
> to qemu for processing.
> 4. Or modify the kvm code so that wfi must trap to kvm
> 5. Then the utilization of pcpu where the vcpu is located will be 100%,and
> basically all in sys.
What did you expect? If you trap all the time, your performance will
suck. Don't do that.
> 6. This problem does not exist in GICv3.
Because GICv3 doesn't have the same constraints.
>
> According to analysis, this problem is due to the execution of vgic_v4_load.
> vcpu_load or kvm_sched_in
> kvm_arch_vcpu_load
> ...
> vgic_v4_load
> irq_set_affinity
> ...
> irq_do_set_affinity
> raw_spin_lock(&tmp_mask_lock)
> chip->irq_set_affinity
> ...
> its_vpe_set_affinity
>
> The tmp_mask_lock is the key. This is a global lock. I don't quite
> understand
> why tmp_mask_lock is needed here. I think there are two possible
> solutions here:
> 1. Remove this tmp_mask_lock
Maybe you could have a look at 33de0aa4bae98 (and 11ea68f553e24)? It
would allow you to understand the nature of the problem.
This can probably be replaced with a per-CPU cpumask, which would
avoid the locking, but potentially result in a larger memory usage.
> 2. Modify the gicv4 driver,do not perfrom VMOVP via
> irq_set_affinity.
Sure. You could also not use KVM at all if don't care about interrupts
being delivered to your VM. We do not send a VMOVP just for fun. We
send it because your vcpu has moved to a different CPU, and the ITS
needs to know about that.
You seem to be misunderstanding the use case for GICv4: a partitioned
system, without any over-subscription, no vcpu migration between CPUs.
If that's not your setup, then GICv4 will always be a net loss
compared to SW injection with GICv3 (additional HW interaction,
doorbell interrupts).
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption
2024-08-21 10:59 ` Marc Zyngier
@ 2024-08-21 18:23 ` Kunkun Jiang
2024-08-22 8:26 ` Marc Zyngier
0 siblings, 1 reply; 9+ messages in thread
From: Kunkun Jiang @ 2024-08-21 18:23 UTC (permalink / raw)
To: Marc Zyngier
Cc: Thomas Gleixner, Oliver Upton, James Morse, Suzuki K Poulose,
Zenghui Yu, open list:IRQ SUBSYSTEM,
moderated list:ARM SMMU DRIVERS, kvmarm,
wanghaibin.wang@huawei.com, nizhiqiang1, tangnianyao@huawei.com,
wangzhou1
Hi Marc,
On 2024/8/21 18:59, Marc Zyngier wrote:
> On Wed, 21 Aug 2024 10:51:27 +0100,
> Kunkun Jiang <jiangkunkun@huawei.com> wrote:
>>
>> Hi all,
>>
>> Recently I discovered a problem about GICv4.1, the scenario is as follows:
>> 1. Enable GICv4.1
>> 2. Create multiple VMs.For example, 50 VMs(4U8G)
s/4U8G/8U16G/, sorry..
> I don't know what 4U8G means. On how many physical CPUs are you
> running 50 VMs? Direct injection of interrupts and over-subscription
> are fundamentally incompatible.
Each VM is configured with 8 vcpus and 16G memory. The number of
physical CPUs is 320.
>
>> 3. The business running in VMs has a frequent mmio access and need to exit
>> to qemu for processing.
>> 4. Or modify the kvm code so that wfi must trap to kvm
>> 5. Then the utilization of pcpu where the vcpu is located will be 100%,and
>> basically all in sys.
>
> What did you expect? If you trap all the time, your performance will
> suck. Don't do that.
>
>> 6. This problem does not exist in GICv3.
>
> Because GICv3 doesn't have the same constraints.
>
>>
>> According to analysis, this problem is due to the execution of vgic_v4_load.
>> vcpu_load or kvm_sched_in
>> kvm_arch_vcpu_load
>> ...
>> vgic_v4_load
>> irq_set_affinity
>> ...
>> irq_do_set_affinity
>> raw_spin_lock(&tmp_mask_lock)
>> chip->irq_set_affinity
>> ...
>> its_vpe_set_affinity
>>
>> The tmp_mask_lock is the key. This is a global lock. I don't quite
>> understand
>> why tmp_mask_lock is needed here. I think there are two possible
>> solutions here:
>> 1. Remove this tmp_mask_lock
>
> Maybe you could have a look at 33de0aa4bae98 (and 11ea68f553e24)? It
> would allow you to understand the nature of the problem.
>
> This can probably be replaced with a per-CPU cpumask, which would
> avoid the locking, but potentially result in a larger memory usage.
Thanks, I will try it.
>> 2. Modify the gicv4 driver,do not perfrom VMOVP via
>> irq_set_affinity.
>
> Sure. You could also not use KVM at all if don't care about interrupts
> being delivered to your VM. We do not send a VMOVP just for fun. We
> send it because your vcpu has moved to a different CPU, and the ITS
> needs to know about that.
When a vcpu is moved to a different CPU, of course VMOVP has to be sent.
I mean is it possible to call its_vpe_set_affinity() to send VMOVP by
other means (instead of by calling the irq_set_affinity() API). So we
can bypass this tmp_mask_lock.
>
> You seem to be misunderstanding the use case for GICv4: a partitioned
> system, without any over-subscription, no vcpu migration between CPUs.
> If that's not your setup, then GICv4 will always be a net loss
> compared to SW injection with GICv3 (additional HW interaction,
> doorbell interrupts).
Thanks for the explanation. The key to the problem is not vcpu migration
between CPUs. The key point is that many vcpus execute vgic_v4_load() at
the same time. Even if it is not migrated to another CPU, there may be a
large number of vcpus executing vgic_v4_load() at the same time. For
example, the service running in VMs has a large number of MMIO accesses
and need to return to userspace for emulation. Due to the competition of
tmp_mask_lock, performance will deteriorate.
When the target CPU is the same CPU as the last run, there seems to be
no need to call irq_set_affinity() in this case. I did a test and it was
indeed able to alleviate the problem described above.
I feel it might be better to remove tmp_mask_lock or call
its_vpe_set_affinity() in another way. So I mentioned these two ideas
above.
Thanks,
Kunkun Jiang
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption
2024-08-21 18:23 ` Kunkun Jiang
@ 2024-08-22 8:26 ` Marc Zyngier
2024-08-22 10:59 ` Kunkun Jiang
0 siblings, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2024-08-22 8:26 UTC (permalink / raw)
To: Kunkun Jiang
Cc: Thomas Gleixner, Oliver Upton, James Morse, Suzuki K Poulose,
Zenghui Yu, open list:IRQ SUBSYSTEM,
moderated list:ARM SMMU DRIVERS, kvmarm,
wanghaibin.wang@huawei.com, nizhiqiang1, tangnianyao@huawei.com,
wangzhou1
On Wed, 21 Aug 2024 19:23:30 +0100,
Kunkun Jiang <jiangkunkun@huawei.com> wrote:
>
> Hi Marc,
>
> On 2024/8/21 18:59, Marc Zyngier wrote:
> > On Wed, 21 Aug 2024 10:51:27 +0100,
> > Kunkun Jiang <jiangkunkun@huawei.com> wrote:
> >>
> >> Hi all,
> >>
> >> Recently I discovered a problem about GICv4.1, the scenario is as follows:
> >> 1. Enable GICv4.1
> >> 2. Create multiple VMs.For example, 50 VMs(4U8G)
>
> s/4U8G/8U16G/, sorry..
>
> > I don't know what 4U8G means. On how many physical CPUs are you
> > running 50 VMs? Direct injection of interrupts and over-subscription
> > are fundamentally incompatible.
>
> Each VM is configured with 8 vcpus and 16G memory. The number of
> physical CPUs is 320.
So you spawn 200 vcpus in one go. Fun.
>
> >
> >> 3. The business running in VMs has a frequent mmio access and need to exit
> >> to qemu for processing.
> >> 4. Or modify the kvm code so that wfi must trap to kvm
> >> 5. Then the utilization of pcpu where the vcpu is located will be 100%,and
> >> basically all in sys.
> >
> > What did you expect? If you trap all the time, your performance will
> > suck. Don't do that.
> >
> >> 6. This problem does not exist in GICv3.
> >
> > Because GICv3 doesn't have the same constraints.
> >
> >>
> >> According to analysis, this problem is due to the execution of vgic_v4_load.
> >> vcpu_load or kvm_sched_in
> >> kvm_arch_vcpu_load
> >> ...
> >> vgic_v4_load
> >> irq_set_affinity
> >> ...
> >> irq_do_set_affinity
> >> raw_spin_lock(&tmp_mask_lock)
> >> chip->irq_set_affinity
> >> ...
> >> its_vpe_set_affinity
> >>
> >> The tmp_mask_lock is the key. This is a global lock. I don't quite
> >> understand
> >> why tmp_mask_lock is needed here. I think there are two possible
> >> solutions here:
> >> 1. Remove this tmp_mask_lock
> >
> > Maybe you could have a look at 33de0aa4bae98 (and 11ea68f553e24)? It
> > would allow you to understand the nature of the problem.
> >
> > This can probably be replaced with a per-CPU cpumask, which would
> > avoid the locking, but potentially result in a larger memory usage.
>
> Thanks, I will try it.
A simple alternative would be this:
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index dd53298ef1a5..0d11b74af38c 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -224,15 +224,12 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
struct irq_desc *desc = irq_data_to_desc(data);
struct irq_chip *chip = irq_data_get_irq_chip(data);
const struct cpumask *prog_mask;
+ struct cpumask tmp_mask = {};
int ret;
- static DEFINE_RAW_SPINLOCK(tmp_mask_lock);
- static struct cpumask tmp_mask;
-
if (!chip || !chip->irq_set_affinity)
return -EINVAL;
- raw_spin_lock(&tmp_mask_lock);
/*
* If this is a managed interrupt and housekeeping is enabled on
* it check whether the requested affinity mask intersects with
@@ -280,8 +277,6 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
else
ret = -EINVAL;
- raw_spin_unlock(&tmp_mask_lock);
-
switch (ret) {
case IRQ_SET_MASK_OK:
case IRQ_SET_MASK_OK_DONE:
but that will eat a significant portion of your stack if your kernel is
configured for a large number of CPUs.
>
> >> 2. Modify the gicv4 driver,do not perfrom VMOVP via
> >> irq_set_affinity.
> >
> > Sure. You could also not use KVM at all if don't care about interrupts
> > being delivered to your VM. We do not send a VMOVP just for fun. We
> > send it because your vcpu has moved to a different CPU, and the ITS
> > needs to know about that.
>
> When a vcpu is moved to a different CPU, of course VMOVP has to be sent.
> I mean is it possible to call its_vpe_set_affinity() to send VMOVP by
> other means (instead of by calling the irq_set_affinity() API). So we
> can bypass this tmp_mask_lock.
The whole point of this infrastructure is that the VPE doorbell is the
control point for the VPE. If the VPE moves, then the change of
affinity *must* be done using irq_set_affinity(). All the locking is
constructed around that. Please read the abundant documentation that
exists in both the GIC code and KVM describing why this is done like
that.
>
> >
> > You seem to be misunderstanding the use case for GICv4: a partitioned
> > system, without any over-subscription, no vcpu migration between CPUs.
> > If that's not your setup, then GICv4 will always be a net loss
> > compared to SW injection with GICv3 (additional HW interaction,
> > doorbell interrupts).
>
> Thanks for the explanation. The key to the problem is not vcpu migration
> between CPUs. The key point is that many vcpus execute vgic_v4_load() at
> the same time. Even if it is not migrated to another CPU, there may be a
> large number of vcpus executing vgic_v4_load() at the same time. For
> example, the service running in VMs has a large number of MMIO accesses
> and need to return to userspace for emulation. Due to the competition of
> tmp_mask_lock, performance will deteriorate.
That's only a symptom. And that doesn't affect only pathological VM
workloads, but all interrupts being moved around for any reason.
>
> When the target CPU is the same CPU as the last run, there seems to be
> no need to call irq_set_affinity() in this case. I did a test and it was
> indeed able to alleviate the problem described above.
The premise is that irq_set_affinity() should be cheap when there
isn't much to do, and you are papering over the problem.
>
> I feel it might be better to remove tmp_mask_lock or call
> its_vpe_set_affinity() in another way. So I mentioned these two ideas
> above.
The removal of this global lock is the only option in my opinion.
Either the cpumask becomes a stack variable, or it becomes a static
per-CPU variable. Both have drawbacks, but they are not a bottleneck
anymore.
Thanks,
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption
2024-08-22 8:26 ` Marc Zyngier
@ 2024-08-22 10:59 ` Kunkun Jiang
2024-08-22 12:47 ` Marc Zyngier
0 siblings, 1 reply; 9+ messages in thread
From: Kunkun Jiang @ 2024-08-22 10:59 UTC (permalink / raw)
To: Marc Zyngier
Cc: Thomas Gleixner, Oliver Upton, James Morse, Suzuki K Poulose,
Zenghui Yu, open list:IRQ SUBSYSTEM,
moderated list:ARM SMMU DRIVERS, kvmarm,
wanghaibin.wang@huawei.com, nizhiqiang1, tangnianyao@huawei.com,
wangzhou1
Hi Marc,
On 2024/8/22 16:26, Marc Zyngier wrote:
>>>> According to analysis, this problem is due to the execution of vgic_v4_load.
>>>> vcpu_load or kvm_sched_in
>>>> kvm_arch_vcpu_load
>>>> ...
>>>> vgic_v4_load
>>>> irq_set_affinity
>>>> ...
>>>> irq_do_set_affinity
>>>> raw_spin_lock(&tmp_mask_lock)
>>>> chip->irq_set_affinity
>>>> ...
>>>> its_vpe_set_affinity
>>>>
>>>> The tmp_mask_lock is the key. This is a global lock. I don't quite
>>>> understand
>>>> why tmp_mask_lock is needed here. I think there are two possible
>>>> solutions here:
>>>> 1. Remove this tmp_mask_lock
>>>
>>> Maybe you could have a look at 33de0aa4bae98 (and 11ea68f553e24)? It
>>> would allow you to understand the nature of the problem.
>>>
>>> This can probably be replaced with a per-CPU cpumask, which would
>>> avoid the locking, but potentially result in a larger memory usage.
>>
>> Thanks, I will try it.
>
> A simple alternative would be this:
>
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index dd53298ef1a5..0d11b74af38c 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -224,15 +224,12 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
> struct irq_desc *desc = irq_data_to_desc(data);
> struct irq_chip *chip = irq_data_get_irq_chip(data);
> const struct cpumask *prog_mask;
> + struct cpumask tmp_mask = {};
> int ret;
>
> - static DEFINE_RAW_SPINLOCK(tmp_mask_lock);
> - static struct cpumask tmp_mask;
> -
> if (!chip || !chip->irq_set_affinity)
> return -EINVAL;
>
> - raw_spin_lock(&tmp_mask_lock);
> /*
> * If this is a managed interrupt and housekeeping is enabled on
> * it check whether the requested affinity mask intersects with
> @@ -280,8 +277,6 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
> else
> ret = -EINVAL;
>
> - raw_spin_unlock(&tmp_mask_lock);
> -
> switch (ret) {
> case IRQ_SET_MASK_OK:
> case IRQ_SET_MASK_OK_DONE:
>
> but that will eat a significant portion of your stack if your kernel is
> configured for a large number of CPUs.
>
Currently CONFIG_NR_CPUS=4096,each `struct cpumask` occupies 512 bytes.
>>
>>>> 2. Modify the gicv4 driver,do not perfrom VMOVP via
>>>> irq_set_affinity.
>>>
>>> Sure. You could also not use KVM at all if don't care about interrupts
>>> being delivered to your VM. We do not send a VMOVP just for fun. We
>>> send it because your vcpu has moved to a different CPU, and the ITS
>>> needs to know about that.
>>
>> When a vcpu is moved to a different CPU, of course VMOVP has to be sent.
>> I mean is it possible to call its_vpe_set_affinity() to send VMOVP by
>> other means (instead of by calling the irq_set_affinity() API). So we
>> can bypass this tmp_mask_lock.
>
> The whole point of this infrastructure is that the VPE doorbell is the
> control point for the VPE. If the VPE moves, then the change of
> affinity *must* be done using irq_set_affinity(). All the locking is
> constructed around that. Please read the abundant documentation that
> exists in both the GIC code and KVM describing why this is done like
> that.
>
OK. Thank you for your guidance.
>>
>>>
>>> You seem to be misunderstanding the use case for GICv4: a partitioned
>>> system, without any over-subscription, no vcpu migration between CPUs.
>>> If that's not your setup, then GICv4 will always be a net loss
>>> compared to SW injection with GICv3 (additional HW interaction,
>>> doorbell interrupts).
>>
>> Thanks for the explanation. The key to the problem is not vcpu migration
>> between CPUs. The key point is that many vcpus execute vgic_v4_load() at
>> the same time. Even if it is not migrated to another CPU, there may be a
>> large number of vcpus executing vgic_v4_load() at the same time. For
>> example, the service running in VMs has a large number of MMIO accesses
>> and need to return to userspace for emulation. Due to the competition of
>> tmp_mask_lock, performance will deteriorate.
>
> That's only a symptom. And that doesn't affect only pathological VM
> workloads, but all interrupts being moved around for any reason.
>
Yes.
>>
>> When the target CPU is the same CPU as the last run, there seems to be
>> no need to call irq_set_affinity() in this case. I did a test and it was
>> indeed able to alleviate the problem described above.
>
> The premise is that irq_set_affinity() should be cheap when there
> isn't much to do, and you are papering over the problem.
>
>>
>> I feel it might be better to remove tmp_mask_lock or call
>> its_vpe_set_affinity() in another way. So I mentioned these two ideas
>> above.
>
> The removal of this global lock is the only option in my opinion.
> Either the cpumask becomes a stack variable, or it becomes a static
> per-CPU variable. Both have drawbacks, but they are not a bottleneck
> anymore.
I also prefer to remove the global lock. Which variable do you think is
better?
Thanks,
Kunkun Jiang
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption
2024-08-22 10:59 ` Kunkun Jiang
@ 2024-08-22 12:47 ` Marc Zyngier
2024-08-22 21:20 ` Thomas Gleixner
0 siblings, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2024-08-22 12:47 UTC (permalink / raw)
To: Kunkun Jiang
Cc: Thomas Gleixner, Oliver Upton, James Morse, Suzuki K Poulose,
Zenghui Yu, open list:IRQ SUBSYSTEM,
moderated list:ARM SMMU DRIVERS, kvmarm,
wanghaibin.wang@huawei.com, nizhiqiang1, tangnianyao@huawei.com,
wangzhou1
On Thu, 22 Aug 2024 11:59:50 +0100,
Kunkun Jiang <jiangkunkun@huawei.com> wrote:
>
> Hi Marc,
>
> On 2024/8/22 16:26, Marc Zyngier wrote:
> >>>> According to analysis, this problem is due to the execution of vgic_v4_load.
> >>>> vcpu_load or kvm_sched_in
> >>>> kvm_arch_vcpu_load
> >>>> ...
> >>>> vgic_v4_load
> >>>> irq_set_affinity
> >>>> ...
> >>>> irq_do_set_affinity
> >>>> raw_spin_lock(&tmp_mask_lock)
> >>>> chip->irq_set_affinity
> >>>> ...
> >>>> its_vpe_set_affinity
> >>>>
> >>>> The tmp_mask_lock is the key. This is a global lock. I don't quite
> >>>> understand
> >>>> why tmp_mask_lock is needed here. I think there are two possible
> >>>> solutions here:
> >>>> 1. Remove this tmp_mask_lock
> >>>
> >>> Maybe you could have a look at 33de0aa4bae98 (and 11ea68f553e24)? It
> >>> would allow you to understand the nature of the problem.
> >>>
> >>> This can probably be replaced with a per-CPU cpumask, which would
> >>> avoid the locking, but potentially result in a larger memory usage.
> >>
> >> Thanks, I will try it.
> >
> > A simple alternative would be this:
> >
> > diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> > index dd53298ef1a5..0d11b74af38c 100644
> > --- a/kernel/irq/manage.c
> > +++ b/kernel/irq/manage.c
> > @@ -224,15 +224,12 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
> > struct irq_desc *desc = irq_data_to_desc(data);
> > struct irq_chip *chip = irq_data_get_irq_chip(data);
> > const struct cpumask *prog_mask;
> > + struct cpumask tmp_mask = {};
> > int ret;
> > - static DEFINE_RAW_SPINLOCK(tmp_mask_lock);
> > - static struct cpumask tmp_mask;
> > -
> > if (!chip || !chip->irq_set_affinity)
> > return -EINVAL;
> > - raw_spin_lock(&tmp_mask_lock);
> > /*
> > * If this is a managed interrupt and housekeeping is enabled on
> > * it check whether the requested affinity mask intersects with
> > @@ -280,8 +277,6 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
> > else
> > ret = -EINVAL;
> > - raw_spin_unlock(&tmp_mask_lock);
> > -
> > switch (ret) {
> > case IRQ_SET_MASK_OK:
> > case IRQ_SET_MASK_OK_DONE:
> >
> > but that will eat a significant portion of your stack if your kernel is
> > configured for a large number of CPUs.
> >
>
> Currently CONFIG_NR_CPUS=4096,each `struct cpumask` occupies 512 bytes.
This seems crazy. Why would you build a kernel with something *that*
big, specially considering that you have a lot less than 1k CPUs?
[...]
> > The removal of this global lock is the only option in my opinion.
> > Either the cpumask becomes a stack variable, or it becomes a static
> > per-CPU variable. Both have drawbacks, but they are not a bottleneck
> > anymore.
>
> I also prefer to remove the global lock. Which variable do you think is
> better?
Given the number of CPUs your system is configured for, there is no
good answer. An on-stack variable is dangerously large, and a per-CPU
cpumask results in 2MB being allocated, which I find insane.
You'll have to pick your own poison and convince Thomas of the
validity of your approach.
M.
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption
2024-08-22 12:47 ` Marc Zyngier
@ 2024-08-22 21:20 ` Thomas Gleixner
2024-08-23 8:49 ` Marc Zyngier
0 siblings, 1 reply; 9+ messages in thread
From: Thomas Gleixner @ 2024-08-22 21:20 UTC (permalink / raw)
To: Marc Zyngier, Kunkun Jiang
Cc: Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
open list:IRQ SUBSYSTEM, moderated list:ARM SMMU DRIVERS, kvmarm,
wanghaibin.wang@huawei.com, nizhiqiang1, tangnianyao@huawei.com,
wangzhou1
On Thu, Aug 22 2024 at 13:47, Marc Zyngier wrote:
> On Thu, 22 Aug 2024 11:59:50 +0100,
> Kunkun Jiang <jiangkunkun@huawei.com> wrote:
>> > but that will eat a significant portion of your stack if your kernel is
>> > configured for a large number of CPUs.
>> >
>>
>> Currently CONFIG_NR_CPUS=4096,each `struct cpumask` occupies 512 bytes.
>
> This seems crazy. Why would you build a kernel with something *that*
> big, specially considering that you have a lot less than 1k CPUs?
That's why CONFIG_CPUMASK_OFFSTACK exists, but that does not help in
that context. :)
>> > The removal of this global lock is the only option in my opinion.
>> > Either the cpumask becomes a stack variable, or it becomes a static
>> > per-CPU variable. Both have drawbacks, but they are not a bottleneck
>> > anymore.
>>
>> I also prefer to remove the global lock. Which variable do you think is
>> better?
>
> Given the number of CPUs your system is configured for, there is no
> good answer. An on-stack variable is dangerously large, and a per-CPU
> cpumask results in 2MB being allocated, which I find insane.
Only if there are actually 4096 CPUs enumerated. The per CPU magic is
smart enough to limit the damage to the actual number of possible CPUs
which are enumerated at boot time. It still will over-allocate due to
NR_CPUS being insanely large but on a 4 CPU machine this boils down to
2k of memory waste unless Aaarg64 is stupid enough to allocate for
NR_CPUS instead of num_possible_cpus()...
That said, on a real 4k CPU system 2M of memory should be the least of
your worries.
> You'll have to pick your own poison and convince Thomas of the
> validity of your approach.
As this is an operation which is really not suitable for on demand
or large stack allocations the per CPU approach makes sense.
Thanks,
tglx
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption
2024-08-22 21:20 ` Thomas Gleixner
@ 2024-08-23 8:49 ` Marc Zyngier
2024-08-26 3:10 ` Kunkun Jiang
0 siblings, 1 reply; 9+ messages in thread
From: Marc Zyngier @ 2024-08-23 8:49 UTC (permalink / raw)
To: Thomas Gleixner, Kunkun Jiang
Cc: Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
open list:IRQ SUBSYSTEM, moderated list:ARM SMMU DRIVERS, kvmarm,
wanghaibin.wang@huawei.com, nizhiqiang1, tangnianyao@huawei.com,
wangzhou1
On Thu, 22 Aug 2024 22:20:43 +0100,
Thomas Gleixner <tglx@linutronix.de> wrote:
>
> On Thu, Aug 22 2024 at 13:47, Marc Zyngier wrote:
> > On Thu, 22 Aug 2024 11:59:50 +0100,
> > Kunkun Jiang <jiangkunkun@huawei.com> wrote:
> >> > but that will eat a significant portion of your stack if your kernel is
> >> > configured for a large number of CPUs.
> >> >
> >>
> >> Currently CONFIG_NR_CPUS=4096,each `struct cpumask` occupies 512 bytes.
> >
> > This seems crazy. Why would you build a kernel with something *that*
> > big, specially considering that you have a lot less than 1k CPUs?
>
> That's why CONFIG_CPUMASK_OFFSTACK exists, but that does not help in
> that context. :)
>
> >> > The removal of this global lock is the only option in my opinion.
> >> > Either the cpumask becomes a stack variable, or it becomes a static
> >> > per-CPU variable. Both have drawbacks, but they are not a bottleneck
> >> > anymore.
> >>
> >> I also prefer to remove the global lock. Which variable do you think is
> >> better?
> >
> > Given the number of CPUs your system is configured for, there is no
> > good answer. An on-stack variable is dangerously large, and a per-CPU
> > cpumask results in 2MB being allocated, which I find insane.
>
> Only if there are actually 4096 CPUs enumerated. The per CPU magic is
> smart enough to limit the damage to the actual number of possible CPUs
> which are enumerated at boot time. It still will over-allocate due to
> NR_CPUS being insanely large but on a 4 CPU machine this boils down to
> 2k of memory waste unless Aaarg64 is stupid enough to allocate for
> NR_CPUS instead of num_possible_cpus()...
No difference between arm64 and xyz85.999 here.
>
> That said, on a real 4k CPU system 2M of memory should be the least of
> your worries.
Don't underestimate the general level of insanity!
>
> > You'll have to pick your own poison and convince Thomas of the
> > validity of your approach.
>
> As this is an operation which is really not suitable for on demand
> or large stack allocations the per CPU approach makes sense.
Right, so let's shoot for that. Kunkun, can you please give the
following hack a go with your workload?
Thanks,
M.
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index dd53298ef1a5..b6aa259ac749 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -224,15 +224,16 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
struct irq_desc *desc = irq_data_to_desc(data);
struct irq_chip *chip = irq_data_get_irq_chip(data);
const struct cpumask *prog_mask;
+ struct cpumask *tmp_mask;
int ret;
- static DEFINE_RAW_SPINLOCK(tmp_mask_lock);
- static struct cpumask tmp_mask;
+ static DEFINE_PER_CPU(struct cpumask, __tmp_mask);
if (!chip || !chip->irq_set_affinity)
return -EINVAL;
- raw_spin_lock(&tmp_mask_lock);
+ tmp_mask = this_cpu_ptr(&__tmp_mask);
+
/*
* If this is a managed interrupt and housekeeping is enabled on
* it check whether the requested affinity mask intersects with
@@ -258,11 +259,11 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
- cpumask_and(&tmp_mask, mask, hk_mask);
- if (!cpumask_intersects(&tmp_mask, cpu_online_mask))
+ cpumask_and(tmp_mask, mask, hk_mask);
+ if (!cpumask_intersects(tmp_mask, cpu_online_mask))
prog_mask = mask;
else
- prog_mask = &tmp_mask;
+ prog_mask = tmp_mask;
} else {
prog_mask = mask;
}
@@ -272,16 +273,14 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
* unless we are being asked to force the affinity (in which
* case we do as we are told).
*/
- cpumask_and(&tmp_mask, prog_mask, cpu_online_mask);
- if (!force && !cpumask_empty(&tmp_mask))
- ret = chip->irq_set_affinity(data, &tmp_mask, force);
+ cpumask_and(tmp_mask, prog_mask, cpu_online_mask);
+ if (!force && !cpumask_empty(tmp_mask))
+ ret = chip->irq_set_affinity(data, tmp_mask, force);
else if (force)
ret = chip->irq_set_affinity(data, mask, force);
else
ret = -EINVAL;
- raw_spin_unlock(&tmp_mask_lock);
-
switch (ret) {
case IRQ_SET_MASK_OK:
case IRQ_SET_MASK_OK_DONE:
--
Without deviation from the norm, progress is not possible.
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption
2024-08-23 8:49 ` Marc Zyngier
@ 2024-08-26 3:10 ` Kunkun Jiang
0 siblings, 0 replies; 9+ messages in thread
From: Kunkun Jiang @ 2024-08-26 3:10 UTC (permalink / raw)
To: Marc Zyngier, Thomas Gleixner
Cc: Oliver Upton, James Morse, Suzuki K Poulose, Zenghui Yu,
open list:IRQ SUBSYSTEM, moderated list:ARM SMMU DRIVERS, kvmarm,
wanghaibin.wang@huawei.com, nizhiqiang1, tangnianyao@huawei.com,
wangzhou1
Hi Marc,
On 2024/8/23 16:49, Marc Zyngier wrote:
> On Thu, 22 Aug 2024 22:20:43 +0100,
> Thomas Gleixner <tglx@linutronix.de> wrote:
>>
>> On Thu, Aug 22 2024 at 13:47, Marc Zyngier wrote:
>>> On Thu, 22 Aug 2024 11:59:50 +0100,
>>> Kunkun Jiang <jiangkunkun@huawei.com> wrote:
>>>>> but that will eat a significant portion of your stack if your kernel is
>>>>> configured for a large number of CPUs.
>>>>>
>>>>
>>>> Currently CONFIG_NR_CPUS=4096,each `struct cpumask` occupies 512 bytes.
>>>
>>> This seems crazy. Why would you build a kernel with something *that*
>>> big, specially considering that you have a lot less than 1k CPUs?
>>
>> That's why CONFIG_CPUMASK_OFFSTACK exists, but that does not help in
>> that context. :)
>>
>>>>> The removal of this global lock is the only option in my opinion.
>>>>> Either the cpumask becomes a stack variable, or it becomes a static
>>>>> per-CPU variable. Both have drawbacks, but they are not a bottleneck
>>>>> anymore.
>>>>
>>>> I also prefer to remove the global lock. Which variable do you think is
>>>> better?
>>>
>>> Given the number of CPUs your system is configured for, there is no
>>> good answer. An on-stack variable is dangerously large, and a per-CPU
>>> cpumask results in 2MB being allocated, which I find insane.
>>
>> Only if there are actually 4096 CPUs enumerated. The per CPU magic is
>> smart enough to limit the damage to the actual number of possible CPUs
>> which are enumerated at boot time. It still will over-allocate due to
>> NR_CPUS being insanely large but on a 4 CPU machine this boils down to
>> 2k of memory waste unless Aaarg64 is stupid enough to allocate for
>> NR_CPUS instead of num_possible_cpus()...
>
> No difference between arm64 and xyz85.999 here.
>
>>
>> That said, on a real 4k CPU system 2M of memory should be the least of
>> your worries.
>
> Don't underestimate the general level of insanity!
>
>>
>>> You'll have to pick your own poison and convince Thomas of the
>>> validity of your approach.
>>
>> As this is an operation which is really not suitable for on demand
>> or large stack allocations the per CPU approach makes sense.
>
> Right, so let's shoot for that. Kunkun, can you please give the
> following hack a go with your workload?
I tested my workload based on the patch below. It solved my problem.
Thank you very much.
Thanks,
Kunkun Jiang
>
> Thanks,
>
> M.
>
> diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
> index dd53298ef1a5..b6aa259ac749 100644
> --- a/kernel/irq/manage.c
> +++ b/kernel/irq/manage.c
> @@ -224,15 +224,16 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
> struct irq_desc *desc = irq_data_to_desc(data);
> struct irq_chip *chip = irq_data_get_irq_chip(data);
> const struct cpumask *prog_mask;
> + struct cpumask *tmp_mask;
> int ret;
>
> - static DEFINE_RAW_SPINLOCK(tmp_mask_lock);
> - static struct cpumask tmp_mask;
> + static DEFINE_PER_CPU(struct cpumask, __tmp_mask);
>
> if (!chip || !chip->irq_set_affinity)
> return -EINVAL;
>
> - raw_spin_lock(&tmp_mask_lock);
> + tmp_mask = this_cpu_ptr(&__tmp_mask);
> +
> /*
> * If this is a managed interrupt and housekeeping is enabled on
> * it check whether the requested affinity mask intersects with
> @@ -258,11 +259,11 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
>
> hk_mask = housekeeping_cpumask(HK_TYPE_MANAGED_IRQ);
>
> - cpumask_and(&tmp_mask, mask, hk_mask);
> - if (!cpumask_intersects(&tmp_mask, cpu_online_mask))
> + cpumask_and(tmp_mask, mask, hk_mask);
> + if (!cpumask_intersects(tmp_mask, cpu_online_mask))
> prog_mask = mask;
> else
> - prog_mask = &tmp_mask;
> + prog_mask = tmp_mask;
> } else {
> prog_mask = mask;
> }
> @@ -272,16 +273,14 @@ int irq_do_set_affinity(struct irq_data *data, const struct cpumask *mask,
> * unless we are being asked to force the affinity (in which
> * case we do as we are told).
> */
> - cpumask_and(&tmp_mask, prog_mask, cpu_online_mask);
> - if (!force && !cpumask_empty(&tmp_mask))
> - ret = chip->irq_set_affinity(data, &tmp_mask, force);
> + cpumask_and(tmp_mask, prog_mask, cpu_online_mask);
> + if (!force && !cpumask_empty(tmp_mask))
> + ret = chip->irq_set_affinity(data, tmp_mask, force);
> else if (force)
> ret = chip->irq_set_affinity(data, mask, force);
> else
> ret = -EINVAL;
>
> - raw_spin_unlock(&tmp_mask_lock);
> -
> switch (ret) {
> case IRQ_SET_MASK_OK:
> case IRQ_SET_MASK_OK_DONE:
>
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-08-26 3:11 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-08-21 9:51 [bug report] GICv4.1: multiple vpus execute vgic_v4_load at the same time will greatly increase the time consumption Kunkun Jiang
2024-08-21 10:59 ` Marc Zyngier
2024-08-21 18:23 ` Kunkun Jiang
2024-08-22 8:26 ` Marc Zyngier
2024-08-22 10:59 ` Kunkun Jiang
2024-08-22 12:47 ` Marc Zyngier
2024-08-22 21:20 ` Thomas Gleixner
2024-08-23 8:49 ` Marc Zyngier
2024-08-26 3:10 ` Kunkun Jiang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).