* [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi @ 2022-02-18 21:55 Barry Song 2022-02-19 9:56 ` Marc Zyngier 2022-02-20 15:04 ` Russell King (Oracle) 0 siblings, 2 replies; 11+ messages in thread From: Barry Song @ 2022-02-18 21:55 UTC (permalink / raw) To: maz, tglx, will, linux-kernel; +Cc: linux-arm-kernel, linuxarm, Barry Song dsb(ishst) should be enough here as we only need to guarantee the visibility of data to other CPUs in smp inner domain before we send the ipi. Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> --- drivers/irqchip/irq-gic-v3.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index 5e935d97207d..0efe1a9a9f3b 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask) * Ensure that stores to Normal memory are visible to the * other CPUs before issuing the IPI. */ - wmb(); + dsb(ishst); for_each_cpu(cpu, mask) { u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu)); -- 2.25.1 _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi 2022-02-18 21:55 [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi Barry Song @ 2022-02-19 9:56 ` Marc Zyngier 2022-02-19 23:46 ` Barry Song 2022-02-20 13:30 ` Ard Biesheuvel 2022-02-20 15:04 ` Russell King (Oracle) 1 sibling, 2 replies; 11+ messages in thread From: Marc Zyngier @ 2022-02-19 9:56 UTC (permalink / raw) To: Barry Song Cc: tglx, will, linux-kernel, linux-arm-kernel, linuxarm, Barry Song On 2022-02-18 21:55, Barry Song wrote: > dsb(ishst) should be enough here as we only need to guarantee the > visibility of data to other CPUs in smp inner domain before we > send the ipi. > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > --- > drivers/irqchip/irq-gic-v3.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/irqchip/irq-gic-v3.c > b/drivers/irqchip/irq-gic-v3.c > index 5e935d97207d..0efe1a9a9f3b 100644 > --- a/drivers/irqchip/irq-gic-v3.c > +++ b/drivers/irqchip/irq-gic-v3.c > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data > *d, const struct cpumask *mask) > * Ensure that stores to Normal memory are visible to the > * other CPUs before issuing the IPI. > */ > - wmb(); > + dsb(ishst); > > for_each_cpu(cpu, mask) { > u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu)); I'm not opposed to that change, but I'm pretty curious whether this makes any visible difference in practice. Could you measure the effect of this change for any sort of IPI heavy workload? Thanks, M. -- Jazz is not dead. It just smells funny... _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi 2022-02-19 9:56 ` Marc Zyngier @ 2022-02-19 23:46 ` Barry Song 2022-02-20 1:33 ` Barry Song 2022-02-20 13:30 ` Ard Biesheuvel 1 sibling, 1 reply; 11+ messages in thread From: Barry Song @ 2022-02-19 23:46 UTC (permalink / raw) To: maz Cc: 21cnbao, linux-arm-kernel, linux-kernel, linuxarm, song.bao.hua, tglx, will >> + dsb(ishst); >> >> for_each_cpu(cpu, mask) { >> u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu)); > > I'm not opposed to that change, but I'm pretty curious whether this > makes > any visible difference in practice. Could you measure the effect of this > change > for any sort of IPI heavy workload? > > Thanks, > > M. In practice, at least I don't see much difference on the hardware I am using. So the result probably depends on the implementaion of the real hardwares. I wrote a micro benchmark to measure the latency w/ and w/o the patch on kunpeng920 with 96 cores(2 socket, each socket has 2dies, each die has 24 cores, cpu0-cpu47 belong to socket0, cpu48-95 belong to socket1) by sending IPI to cpu0-cpu95 1000 times from an specified cpu: #include <linux/module.h> #include <linux/timekeeping.h> static void ipi_latency_func(void *val) { } static int __init ipi_latency_init(void) { ktime_t stime, etime, delta; int cpu, i; int start = smp_processor_id(); stime = ktime_get(); for ( i = 0; i < 1000; i++) for (cpu = 0; cpu < 96; cpu++) smp_call_function_single(cpu, ipi_latency_func, NULL, 1); etime = ktime_get(); delta = ktime_sub(etime, stime); printk("%s ipi from cpu%d to cpu0-95 delta of 1000times:%lld\n", __func__, start, delta); return 0; } module_init(ipi_latency_init); static void ipi_latency_exit(void) { } module_exit(ipi_latency_exit); MODULE_DESCRIPTION("IPI benchmark"); MODULE_LICENSE("GPL"); do the below 10times: # taskset -c 0 insmod test.ko # rmmod test and the below 10times: # taskset -c 48 insmod test.ko # rmmod test by taskset -c, I can change the source cpu sending IPI. The result is as below: vanilla kernel: [ 103.391684] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122237009 [ 103.537256] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121364329 [ 103.681276] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121420160 [ 103.826254] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122392403 [ 103.970209] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122371262 [ 104.113879] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122041254 [ 104.257444] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121594453 [ 104.402432] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122592556 [ 104.561434] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121601214 [ 104.705561] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121732767 [ 124.592944] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147048939 [ 124.779280] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147467842 [ 124.958162] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146448676 [ 125.129253] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:141537482 [ 125.298848] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147161504 [ 125.471531] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147833787 [ 125.643133] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147438445 [ 125.814530] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146806172 [ 125.989677] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145971002 [ 126.159497] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147780655 patched kernel: [ 428.828167] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122195849 [ 428.970822] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122361042 [ 429.111058] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122528494 [ 429.257704] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121155045 [ 429.410186] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121608565 [ 429.570171] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121613673 [ 429.718181] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121593737 [ 429.862615] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:121953875 [ 430.002796] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122102741 [ 430.142741] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122005473 [ 516.642812] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145610926 [ 516.817002] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145878266 [ 517.004665] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145602966 [ 517.188758] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145658672 [ 517.372409] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:141329497 [ 517.557313] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146323829 [ 517.733107] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146015196 [ 517.921491] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146439231 [ 518.093129] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146106916 [ 518.264162] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145097868 So there is no much difference between vanilla and patched kernel. What really makes me worried about my hardware is that IPI sent from the second socket always shows worse performance than the first socket. This seems to be a problem worth investigation. Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi 2022-02-19 23:46 ` Barry Song @ 2022-02-20 1:33 ` Barry Song 0 siblings, 0 replies; 11+ messages in thread From: Barry Song @ 2022-02-20 1:33 UTC (permalink / raw) To: 21cnbao Cc: linux-arm-kernel, linux-kernel, linuxarm, maz, song.bao.hua, tglx, will > So there is no much difference between vanilla and patched kernel. Sorry, let me correct it. I realize I should write some data before sending IPI. So I have changed the module to be as below: #include <linux/module.h> #include <linux/timekeeping.h> volatile int data0 ____cacheline_aligned; volatile int data1 ____cacheline_aligned; volatile int data2 ____cacheline_aligned; volatile int data3 ____cacheline_aligned; volatile int data4 ____cacheline_aligned; volatile int data5 ____cacheline_aligned; volatile int data6 ____cacheline_aligned; static void ipi_latency_func(void *val) { } static int __init ipi_latency_init(void) { ktime_t stime, etime, delta; int cpu, i; int start = smp_processor_id(); stime = ktime_get(); for ( i = 0; i < 1000; i++) for (cpu = 0; cpu < 96; cpu++) { data0 = data1 = data2 = data3 = data4 = data5 = data6 = cpu; smp_call_function_single(cpu, ipi_latency_func, NULL, 1); } etime = ktime_get(); delta = ktime_sub(etime, stime); printk("%s ipi from cpu%d to cpu0-95 delta of 1000times:%lld\n", __func__, start, delta); return 0; } module_init(ipi_latency_init); static void ipi_latency_exit(void) { } module_exit(ipi_latency_exit); MODULE_DESCRIPTION("IPI benchmark"); MODULE_LICENSE("GPL"); after that, I can see ~1% difference between patched kernel and vanilla: vanilla: [ 375.220131] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126757449 [ 375.382596] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126784249 [ 375.537975] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126177703 [ 375.686823] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:127022281 [ 375.849967] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126184883 [ 375.999173] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:127374585 [ 376.149565] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:125778089 [ 376.298743] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126974441 [ 376.451125] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:127357625 [ 376.606006] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:126228184 [ 371.405378] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151851181 [ 371.591642] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151568608 [ 371.767906] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151853441 [ 371.944031] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:152065453 [ 372.114085] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:146122093 [ 372.291345] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151379636 [ 372.459812] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151854411 [ 372.629708] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:145750720 [ 372.807574] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151629448 [ 372.994979] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:151050253 patched kernel: [ 105.598815] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:124467401 [ 105.748368] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123474209 [ 105.900400] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123558497 [ 106.043890] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122993951 [ 106.191845] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:122984223 [ 106.348215] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123323609 [ 106.501448] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:124507583 [ 106.656358] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123386963 [ 106.804367] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123340664 [ 106.956331] ipi_latency_init ipi from cpu0 to cpu0-95 delta of 1000times:123285324 [ 108.930802] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:143616067 [ 109.094750] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:148969821 [ 109.267428] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149648418 [ 109.443274] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149448903 [ 109.621760] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:147882917 [ 109.794611] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:148700282 [ 109.975197] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149050595 [ 110.141543] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:143566604 [ 110.315213] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:149202898 [ 110.491008] ipi_latency_init ipi from cpu48 to cpu0-95 delta of 1000times:148958261 as you can see, while cpu0 is the source, vanilla takes 125xxxxxx-127xxxxxx ns, patched kernel takes 122xxxxxx-124xxxxxx ns. Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi 2022-02-19 9:56 ` Marc Zyngier 2022-02-19 23:46 ` Barry Song @ 2022-02-20 13:30 ` Ard Biesheuvel 2022-02-20 15:04 ` Marc Zyngier 2022-02-20 15:05 ` Russell King (Oracle) 1 sibling, 2 replies; 11+ messages in thread From: Ard Biesheuvel @ 2022-02-20 13:30 UTC (permalink / raw) To: Marc Zyngier Cc: Barry Song, Thomas Gleixner, Will Deacon, Linux Kernel Mailing List, Linux ARM, Linuxarm, Barry Song On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <maz@kernel.org> wrote: > > On 2022-02-18 21:55, Barry Song wrote: > > dsb(ishst) should be enough here as we only need to guarantee the > > visibility of data to other CPUs in smp inner domain before we > > send the ipi. > > > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > > --- > > drivers/irqchip/irq-gic-v3.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/drivers/irqchip/irq-gic-v3.c > > b/drivers/irqchip/irq-gic-v3.c > > index 5e935d97207d..0efe1a9a9f3b 100644 > > --- a/drivers/irqchip/irq-gic-v3.c > > +++ b/drivers/irqchip/irq-gic-v3.c > > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data > > *d, const struct cpumask *mask) > > * Ensure that stores to Normal memory are visible to the > > * other CPUs before issuing the IPI. > > */ > > - wmb(); > > + dsb(ishst); > > > > for_each_cpu(cpu, mask) { > > u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu)); > > I'm not opposed to that change, but I'm pretty curious whether this > makes > any visible difference in practice. Could you measure the effect of this > change > for any sort of IPI heavy workload? > Does this have to be a DSB ? _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi 2022-02-20 13:30 ` Ard Biesheuvel @ 2022-02-20 15:04 ` Marc Zyngier 2022-02-20 15:05 ` Russell King (Oracle) 1 sibling, 0 replies; 11+ messages in thread From: Marc Zyngier @ 2022-02-20 15:04 UTC (permalink / raw) To: Ard Biesheuvel Cc: Barry Song, Thomas Gleixner, Will Deacon, Linux Kernel Mailing List, Linux ARM, Linuxarm, Barry Song On 2022-02-20 13:30, Ard Biesheuvel wrote: > On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <maz@kernel.org> wrote: >> >> On 2022-02-18 21:55, Barry Song wrote: >> > dsb(ishst) should be enough here as we only need to guarantee the >> > visibility of data to other CPUs in smp inner domain before we >> > send the ipi. >> > >> > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> >> > --- >> > drivers/irqchip/irq-gic-v3.c | 2 +- >> > 1 file changed, 1 insertion(+), 1 deletion(-) >> > >> > diff --git a/drivers/irqchip/irq-gic-v3.c >> > b/drivers/irqchip/irq-gic-v3.c >> > index 5e935d97207d..0efe1a9a9f3b 100644 >> > --- a/drivers/irqchip/irq-gic-v3.c >> > +++ b/drivers/irqchip/irq-gic-v3.c >> > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data >> > *d, const struct cpumask *mask) >> > * Ensure that stores to Normal memory are visible to the >> > * other CPUs before issuing the IPI. >> > */ >> > - wmb(); >> > + dsb(ishst); >> > >> > for_each_cpu(cpu, mask) { >> > u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu)); >> >> I'm not opposed to that change, but I'm pretty curious whether this >> makes >> any visible difference in practice. Could you measure the effect of >> this >> change >> for any sort of IPI heavy workload? >> > > Does this have to be a DSB ? We can use a DMB ISHST for the other interrupt controllers that use a MMIO access to signal the IPI, as we need to order the previous writes with the MMIO access (and DMB fits that bill). For GICv3 when ICC_SRE_EL1.SRE==1, we need to order a set of writes with a system register access with side effects, and the only barrier that allows us to do that is DSB. It is a bit unfortunate that the relative lightweight nature of the sysreg CPU interface is encumbered by fairly heavy barriers (the most horrible one being the DSB SY on each read of IAR to be able to synchronise the CPU interface and the redistributor). Thanks, M. -- Jazz is not dead. It just smells funny... _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi 2022-02-20 13:30 ` Ard Biesheuvel 2022-02-20 15:04 ` Marc Zyngier @ 2022-02-20 15:05 ` Russell King (Oracle) 2022-02-20 20:09 ` Barry Song 1 sibling, 1 reply; 11+ messages in thread From: Russell King (Oracle) @ 2022-02-20 15:05 UTC (permalink / raw) To: Ard Biesheuvel Cc: Marc Zyngier, Barry Song, Thomas Gleixner, Will Deacon, Linux Kernel Mailing List, Linux ARM, Linuxarm, Barry Song On Sun, Feb 20, 2022 at 02:30:24PM +0100, Ard Biesheuvel wrote: > On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <maz@kernel.org> wrote: > > > > On 2022-02-18 21:55, Barry Song wrote: > > > dsb(ishst) should be enough here as we only need to guarantee the > > > visibility of data to other CPUs in smp inner domain before we > > > send the ipi. > > > > > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > > > --- > > > drivers/irqchip/irq-gic-v3.c | 2 +- > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > diff --git a/drivers/irqchip/irq-gic-v3.c > > > b/drivers/irqchip/irq-gic-v3.c > > > index 5e935d97207d..0efe1a9a9f3b 100644 > > > --- a/drivers/irqchip/irq-gic-v3.c > > > +++ b/drivers/irqchip/irq-gic-v3.c > > > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data > > > *d, const struct cpumask *mask) > > > * Ensure that stores to Normal memory are visible to the > > > * other CPUs before issuing the IPI. > > > */ > > > - wmb(); > > > + dsb(ishst); > > > > > > for_each_cpu(cpu, mask) { > > > u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu)); > > > > I'm not opposed to that change, but I'm pretty curious whether this > > makes > > any visible difference in practice. Could you measure the effect of this > > change > > for any sort of IPI heavy workload? > > > > Does this have to be a DSB ? Are you suggesting that smp_wmb() may suffice (which is a dmb(ishst)) ? -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last! _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi 2022-02-20 15:05 ` Russell King (Oracle) @ 2022-02-20 20:09 ` Barry Song 0 siblings, 0 replies; 11+ messages in thread From: Barry Song @ 2022-02-20 20:09 UTC (permalink / raw) To: Russell King (Oracle) Cc: Ard Biesheuvel, Marc Zyngier, Thomas Gleixner, Will Deacon, Linux Kernel Mailing List, Linux ARM, Linuxarm, Barry Song On Mon, Feb 21, 2022 at 4:05 AM Russell King (Oracle) <linux@armlinux.org.uk> wrote: > > On Sun, Feb 20, 2022 at 02:30:24PM +0100, Ard Biesheuvel wrote: > > On Sat, 19 Feb 2022 at 10:57, Marc Zyngier <maz@kernel.org> wrote: > > > > > > On 2022-02-18 21:55, Barry Song wrote: > > > > dsb(ishst) should be enough here as we only need to guarantee the > > > > visibility of data to other CPUs in smp inner domain before we > > > > send the ipi. > > > > > > > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > > > > --- > > > > drivers/irqchip/irq-gic-v3.c | 2 +- > > > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > > > > > diff --git a/drivers/irqchip/irq-gic-v3.c > > > > b/drivers/irqchip/irq-gic-v3.c > > > > index 5e935d97207d..0efe1a9a9f3b 100644 > > > > --- a/drivers/irqchip/irq-gic-v3.c > > > > +++ b/drivers/irqchip/irq-gic-v3.c > > > > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data > > > > *d, const struct cpumask *mask) > > > > * Ensure that stores to Normal memory are visible to the > > > > * other CPUs before issuing the IPI. > > > > */ > > > > - wmb(); > > > > + dsb(ishst); > > > > > > > > for_each_cpu(cpu, mask) { > > > > u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu)); > > > > > > I'm not opposed to that change, but I'm pretty curious whether this > > > makes > > > any visible difference in practice. Could you measure the effect of this > > > change > > > for any sort of IPI heavy workload? > > > > > > > Does this have to be a DSB ? > > Are you suggesting that smp_wmb() may suffice (which is a dmb(ishst)) ? usually smp_wmb() is ok, for example drivers/irqchip/irq-bcm2836.c: static void bcm2836_arm_irqchip_ipi_send_mask(struct irq_data *d, const struct cpumask *mask) { int cpu; void __iomem *mailbox0_base = intc.base + LOCAL_MAILBOX0_SET0; /* * Ensure that stores to normal memory are visible to the * other CPUs before issuing the IPI. */ smp_wmb(); for_each_cpu(cpu, mask) writel_relaxed(BIT(d->hwirq), mailbox0_base + 16 * cpu); } but the instruction to send ipi for irq-gic-v3.c isn't a store, so this driver has been changed by this commit from dmb(ish) to dsb(sy): commit 21ec30c0ef5234fb1039cc7c7737d885bf875a9e Author: Shanker Donthineni <shankerd@codeaurora.org> irqchip/gic-v3: Use wmb() instead of smb_wmb() in gic_raise_softirq() A DMB instruction can be used to ensure the relative order of only memory accesses before and after the barrier. Since writes to system registers are not memory operations, barrier DMB is not sufficient for observability of memory accesses that occur before ICC_SGI1R_EL1 writes. A DSB instruction ensures that no instructions that appear in program order after the DSB instruction, can execute until the DSB instruction has completed. ... diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c index d71be9a1f9d2..d99cc07903ec 100644 --- a/drivers/irqchip/irq-gic-v3.c +++ b/drivers/irqchip/irq-gic-v3.c @@ -688,7 +688,7 @@ static void gic_raise_softirq(const struct cpumask *mask, unsigned int irq) * Ensure that stores to Normal memory are visible to the * other CPUs before issuing the IPI. */ - smp_wmb(); + wmb(); for_each_cpu(cpu, mask) { u64 cluster_id = MPIDR_TO_SGI_CLUSTER_ID(cpu_logical_map(cpu)); my understanding is that dtb(sy) is too strong as this case we are asking data to be observed by other CPUs in smp just as smp_wmb is doing that in other drivers by dmb(isb). we are not requiring data is observed by everyone in the system. > > -- > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ > FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last! Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi 2022-02-18 21:55 [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi Barry Song 2022-02-19 9:56 ` Marc Zyngier @ 2022-02-20 15:04 ` Russell King (Oracle) 2022-02-20 15:21 ` Marc Zyngier 1 sibling, 1 reply; 11+ messages in thread From: Russell King (Oracle) @ 2022-02-20 15:04 UTC (permalink / raw) To: Barry Song Cc: maz, tglx, will, linux-kernel, linux-arm-kernel, linuxarm, Barry Song On Sat, Feb 19, 2022 at 05:55:49AM +0800, Barry Song wrote: > dsb(ishst) should be enough here as we only need to guarantee the > visibility of data to other CPUs in smp inner domain before we > send the ipi. > > Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > --- > drivers/irqchip/irq-gic-v3.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/irqchip/irq-gic-v3.c b/drivers/irqchip/irq-gic-v3.c > index 5e935d97207d..0efe1a9a9f3b 100644 > --- a/drivers/irqchip/irq-gic-v3.c > +++ b/drivers/irqchip/irq-gic-v3.c > @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data *d, const struct cpumask *mask) > * Ensure that stores to Normal memory are visible to the > * other CPUs before issuing the IPI. > */ > - wmb(); > + dsb(ishst); On ARM, wmb() is a dsb(st) followed by other operations which may include a sync operation at the L2 cache, and SoC specific barriers for the bus. Hopefully, nothing will break if the sledge hammer is replaced by the tack hammer. -- RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ FTTP is here! 40Mbps down 10Mbps up. Decent connectivity at last! _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi 2022-02-20 15:04 ` Russell King (Oracle) @ 2022-02-20 15:21 ` Marc Zyngier 2022-02-20 20:20 ` Barry Song 0 siblings, 1 reply; 11+ messages in thread From: Marc Zyngier @ 2022-02-20 15:21 UTC (permalink / raw) To: Russell King (Oracle) Cc: Barry Song, tglx, will, linux-kernel, linux-arm-kernel, linuxarm, Barry Song On 2022-02-20 15:04, Russell King (Oracle) wrote: > On Sat, Feb 19, 2022 at 05:55:49AM +0800, Barry Song wrote: >> dsb(ishst) should be enough here as we only need to guarantee the >> visibility of data to other CPUs in smp inner domain before we >> send the ipi. >> >> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> >> --- >> drivers/irqchip/irq-gic-v3.c | 2 +- >> 1 file changed, 1 insertion(+), 1 deletion(-) >> >> diff --git a/drivers/irqchip/irq-gic-v3.c >> b/drivers/irqchip/irq-gic-v3.c >> index 5e935d97207d..0efe1a9a9f3b 100644 >> --- a/drivers/irqchip/irq-gic-v3.c >> +++ b/drivers/irqchip/irq-gic-v3.c >> @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data >> *d, const struct cpumask *mask) >> * Ensure that stores to Normal memory are visible to the >> * other CPUs before issuing the IPI. >> */ >> - wmb(); >> + dsb(ishst); > > On ARM, wmb() is a dsb(st) followed by other operations which may > include a sync operation at the L2 cache, and SoC specific barriers > for the bus. Hopefully, nothing will break if the sledge hammer is > replaced by the tack hammer. The saving grace is that ARMv8 forbids (as per D4.4.11) these SW-visible, non architected caches (something like PL310 simply isn't allowed). Given that GICv3 requires ARMv8 the first place, we should be OK. As for SoC-specific bus barriers, I don't know of any that would affect an ARMv8 based SoC. But I'm always prepared to be badly surprised... M. -- Jazz is not dead. It just smells funny... _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi 2022-02-20 15:21 ` Marc Zyngier @ 2022-02-20 20:20 ` Barry Song 0 siblings, 0 replies; 11+ messages in thread From: Barry Song @ 2022-02-20 20:20 UTC (permalink / raw) To: Marc Zyngier Cc: Russell King (Oracle), Thomas Gleixner, Will Deacon, LKML, LAK, Linuxarm, Barry Song On Mon, Feb 21, 2022 at 4:21 AM Marc Zyngier <maz@kernel.org> wrote: > > On 2022-02-20 15:04, Russell King (Oracle) wrote: > > On Sat, Feb 19, 2022 at 05:55:49AM +0800, Barry Song wrote: > >> dsb(ishst) should be enough here as we only need to guarantee the > >> visibility of data to other CPUs in smp inner domain before we > >> send the ipi. > >> > >> Signed-off-by: Barry Song <song.bao.hua@hisilicon.com> > >> --- > >> drivers/irqchip/irq-gic-v3.c | 2 +- > >> 1 file changed, 1 insertion(+), 1 deletion(-) > >> > >> diff --git a/drivers/irqchip/irq-gic-v3.c > >> b/drivers/irqchip/irq-gic-v3.c > >> index 5e935d97207d..0efe1a9a9f3b 100644 > >> --- a/drivers/irqchip/irq-gic-v3.c > >> +++ b/drivers/irqchip/irq-gic-v3.c > >> @@ -1211,7 +1211,7 @@ static void gic_ipi_send_mask(struct irq_data > >> *d, const struct cpumask *mask) > >> * Ensure that stores to Normal memory are visible to the > >> * other CPUs before issuing the IPI. > >> */ > >> - wmb(); > >> + dsb(ishst); > > > > On ARM, wmb() is a dsb(st) followed by other operations which may > > include a sync operation at the L2 cache, and SoC specific barriers > > for the bus. Hopefully, nothing will break if the sledge hammer is > > replaced by the tack hammer. > > The saving grace is that ARMv8 forbids (as per D4.4.11) these > SW-visible, > non architected caches (something like PL310 simply isn't allowed). > Given > that GICv3 requires ARMv8 the first place, we should be OK. > > As for SoC-specific bus barriers, I don't know of any that would affect > an ARMv8 based SoC. But I'm always prepared to be badly surprised...2 i assume we have been safe since dsb(ish) has been widely used for smp data visibility in arm64: arch/arm64/include/asm/assembler.h: dsb ish arch/arm64/include/asm/cacheflush.h: dsb(ish); arch/arm64/include/asm/pgtable.h: dsb(ishst); arch/arm64/include/asm/pgtable.h: dsb(ishst); arch/arm64/include/asm/pgtable.h: dsb(ishst); arch/arm64/include/asm/pgtable.h: dsb(ishst); arch/arm64/include/asm/pgtable.h: * is doing a dsb(ishst), but that penalises the fastpath. arch/arm64/include/asm/smp.h: dsb(ishst); arch/arm64/include/asm/tlbflush.h: "dsb ish\n tlbi " #op, \ arch/arm64/include/asm/tlbflush.h: "dsb ish\n tlbi " #op ", %0", \ arch/arm64/include/asm/tlbflush.h: dsb(ishst); arch/arm64/include/asm/tlbflush.h: dsb(ish); arch/arm64/include/asm/tlbflush.h: dsb(ishst); arch/arm64/include/asm/tlbflush.h: dsb(ish); arch/arm64/include/asm/tlbflush.h: dsb(ishst); arch/arm64/include/asm/tlbflush.h: dsb(ish); arch/arm64/include/asm/tlbflush.h: dsb(ishst); arch/arm64/include/asm/tlbflush.h: dsb(ish); arch/arm64/include/asm/tlbflush.h: dsb(ishst); arch/arm64/include/asm/tlbflush.h: dsb(ish); arch/arm64/include/asm/tlbflush.h: dsb(ishst); arch/arm64/include/asm/tlbflush.h: dsb(ish); arch/arm64/kernel/alternative.c: dsb(ish); arch/arm64/kernel/entry.S: dsb ish arch/arm64/kernel/head.S: dsb ishst // Make zero page visible to PTW arch/arm64/kernel/hibernate-asm.S: dsb ish /* wait for PoU cleaning to finish */ arch/arm64/kernel/hibernate-asm.S: dsb ish arch/arm64/kernel/mte.c: dsb(ish); arch/arm64/kernel/process.c: dsb(ish); arch/arm64/kernel/sys_compat.c: dsb(ish); arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ishst); arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish); arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish); arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ishst); arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish); arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ishst); arch/arm64/kvm/hyp/nvhe/tlb.c: dsb(ish); arch/arm64/kvm/hyp/pgtable.c: dsb(ishst); arch/arm64/kvm/hyp/pgtable.c: dsb(ishst); arch/arm64/kvm/hyp/pgtable.c: dsb(ishst); arch/arm64/kvm/hyp/pgtable.c: dsb(ish); arch/arm64/kvm/hyp/pgtable.c: dsb(ishst); arch/arm64/kvm/hyp/pgtable.c: dsb(ishst); arch/arm64/kvm/hyp/pgtable.c: dsb(ishst); arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ishst); arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish); arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish); arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ishst); arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish); arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ishst); arch/arm64/kvm/hyp/vhe/tlb.c: dsb(ish); arch/arm64/mm/cache.S: dsb ishst arch/arm64/mm/cache.S: dsb ishst arch/arm64/mm/kasan_init.c: dsb(ishst); arch/arm64/mm/mmu.c: * We need dsb(ishst) here to ensure the page-table-walker sees arch/arm64/mm/mmu.c: dsb(ishst); arch/arm64/mm/proc.S: dsb ish drivers/irqchip/irq-gic-v3-its.c: dsb(ishst); drivers/irqchip/irq-gic-v3-its.c: dsb(ishst); Plus, is it even a problem to arm since arm only requires soc-level barrier for system-level memory barrier rather than smp-level barrier? #if defined(CONFIG_ARM_DMA_MEM_BUFFERABLE) || defined(CONFIG_SMP) #define mb() __arm_heavy_mb() #define rmb() dsb() #define wmb() __arm_heavy_mb(st) #define dma_rmb() dmb(osh) #define dma_wmb() dmb(oshst) #else #define mb() barrier() #define rmb() barrier() #define wmb() barrier() #define dma_rmb() barrier() #define dma_wmb() barrier() #endif #define __smp_mb() dmb(ish) #define __smp_rmb() __smp_mb() #define __smp_wmb() dmb(ishst) I am not seeing arm requires soc-level mb for smp-level barrier though the mandatory barriers are using heavy_mb which has soc-level mb. In this particular case, we are asking the data visibility for smp-level not system-level. I am not quite sure Russell's concern is relevant. > > M. > -- > Jazz is not dead. It just smells funny... Thanks Barry _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2022-02-20 20:22 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2022-02-18 21:55 [PATCH] irqchip/gic-v3: use dsb(ishst) to synchronize data to smp before issuing ipi Barry Song 2022-02-19 9:56 ` Marc Zyngier 2022-02-19 23:46 ` Barry Song 2022-02-20 1:33 ` Barry Song 2022-02-20 13:30 ` Ard Biesheuvel 2022-02-20 15:04 ` Marc Zyngier 2022-02-20 15:05 ` Russell King (Oracle) 2022-02-20 20:09 ` Barry Song 2022-02-20 15:04 ` Russell King (Oracle) 2022-02-20 15:21 ` Marc Zyngier 2022-02-20 20:20 ` Barry Song
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).