From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============1637574556238539969==" MIME-Version: 1.0 From: Xing Zhengjun To: lkp@lists.01.org Subject: Re: [mm] 09bc0443e9: will-it-scale.per_thread_ops -7.2% regression Date: Tue, 11 May 2021 16:43:27 +0800 Message-ID: <4111dbd3-164a-2c67-04f2-cee2aa950b04@linux.intel.com> In-Reply-To: List-Id: --===============1637574556238539969== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable Hi Andrea, On 5/11/2021 10:35 AM, Andrea Arcangeli wrote: > Hi Xing, > = > On Tue, May 11, 2021 at 09:34:32AM +0800, Xing Zhengjun wrote: >> Hi Andrea, >> >> On 5/9/2021 7:12 AM, Andrea Arcangeli wrote: >>> On Sat, May 08, 2021 at 10:58:34AM +0800, Xing Zhengjun wrote: >>>> Hi Andrea, >>>> >>>> =C2=A0=C2=A0=C2=A0 I re-test it more than 10 times, the regression = still exists. >>> >>> Probably I didn't measure it because for me they may have always >>> landed in the same cacheline even before the patch was applied. >>> >>> Ouch btw looking at flush_tlb_pending I also noticed I forgot to revert >>> 6ce64428d62026a10cb5d80138ff2f90cc21d367, that has to be added to the >>> pile of SMP unscalable upstream code to be reverted on my tree to >>> improve performance even more to the wp fault with uffd-wp. >>> >>>> =C2=A0=C2=A0=C2=A0 0d2285c62210(mm: gup: pack has_pinned in MMF_HAS= _PINNED) removed >>>> "atomic_t has_pinned" from "struct mm_struct", it changed the "struct >>>> mm_struct" cache line layout (please see in the attached pahole dumped >>>> results), in its parent commit, "spinlock_t page_table_lock" and "stru= ct >>>> rw_semaphore mmap_lock" in two different cache line, but in >>>> 0d2285c62210, they may in the same cache line, "mm_struct" is widely >>>> used, it may have potential contention issue, so I apply the following >>>> patch change the layout, retest the regression is gone (very small, -0= .9%). >>> >>> Agreed, the accidental cacheline alignment change is the only thing >>> that could explain a performance difference, except it didn't make a >>> difference for me. >>> >>> I applied this, this should fix your issue and it is going to work >>> without depending on .config options (see all fields before mmap_lock, >>> they vary depending on config/arch). >>> >>> I'm not really sure the problem is the page_table_lock even, it may be >>> just false sharing on the fields at the top of the vma. >>> >>> The best I recommend to try also in addition of the below is to move >>> this line: >>> >>> + struct rw_semaphore mmap_lock ____cacheline_aligned_in_smp; >>> >>> at the very end. so it gets 64 bytes alone and reduces all false >>> sharing to zero. That cacheline is heavily heavily contended during >>> the mmap test. so it may be worth wasting 64bytes per vma even... >>> >>> The below wastes in average not more than 32 bytes. >>> >>> >From 99a2452759bb5cfb2a77405d7db42968bf7e7509 Mon Sep 17 00:00:00 2001 >>> From: Andrea Arcangeli >>> Date: Sat, 8 May 2021 18:54:34 -0400 >>> Subject: [PATCH 1/1] mm: cacheline alignment for page_table_lock and mm= ap_lock >>> >>> It's important to keep the two fields in two different cachelines and >>> this will enforce it. >>> >>> The kernel test robot reported a per_thread_ops -7.2% regression for >>> the will-it-scale mmap2 threaded testcase when the two fields >>> accidentally landed in the same cacheline. >>> >>> Reported-by: kernel test robot >>> Signed-off-by: Andrea Arcangeli >>> --- >>> include/linux/mm_types.h | 2 +- >>> 1 file changed, 1 insertion(+), 1 deletion(-) >>> >>> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h >>> index 7fc7a3320ad9..714b4c3e2a7c 100644 >>> --- a/include/linux/mm_types.h >>> +++ b/include/linux/mm_types.h >>> @@ -456,7 +456,7 @@ struct mm_struct { >>> spinlock_t page_table_lock; /* Protects page tables and some >>> * counters >>> */ >>> - struct rw_semaphore mmap_lock; >>> + struct rw_semaphore mmap_lock ____cacheline_aligned_in_smp; >>> = >>> struct list_head mmlist; /* List of maybe swapped mm's. These >>> * are globally strung together off >>> >> >> I test the patch, the regression is gone and it gets +2.1% improvement. > = > That's nice, thanks for running the tests so fast! Very interesting. > = > So if we call above the test patch 1), I'd be now curious to know what > happens with patch 2) and patch 3) below if applied on top of patch > 1. (or on top of current aa.git main branch which already includes > patch 1 since yesterday) > = > this would be test patch 2) > = > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 714b4c3e2a7c..f463ca223cad 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -392,7 +392,8 @@ struct core_state { > struct kioctx_table; > struct mm_struct { > struct { > - struct vm_area_struct *mmap; /* list of VMAs */ > + struct rw_semaphore mmap_lock; > + struct vm_area_struct *mmap ____cacheline_aligned_in_smp; /* list of = VMAs */ > struct rb_root mm_rb; > u64 vmacache_seqnum; /* per-thread vmacache */ > #ifdef CONFIG_MMU > @@ -456,7 +457,6 @@ struct mm_struct { > spinlock_t page_table_lock; /* Protects page tables and some > * counters > */ > - struct rw_semaphore mmap_lock ____cacheline_aligned_in_smp; > = > struct list_head mmlist; /* List of maybe swapped mm's. These > * are globally strung together off > = > = > = > this would be test patch 3) > = > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > index 714b4c3e2a7c..dcafca22374e 100644 > --- a/include/linux/mm_types.h > +++ b/include/linux/mm_types.h > @@ -392,7 +392,10 @@ struct core_state { > struct kioctx_table; > struct mm_struct { > struct { > - struct vm_area_struct *mmap; /* list of VMAs */ > + spinlock_t page_table_lock; /* Protects page tables and some > + * counters > + */ > + struct vm_area_struct *mmap ____cacheline_aligned_in_smp; /* list of = VMAs */ > struct rb_root mm_rb; > u64 vmacache_seqnum; /* per-thread vmacache */ > #ifdef CONFIG_MMU > @@ -453,10 +456,7 @@ struct mm_struct { > #endif > int map_count; /* number of VMAs */ > = > - spinlock_t page_table_lock; /* Protects page tables and some > - * counters > - */ > - struct rw_semaphore mmap_lock ____cacheline_aligned_in_smp; > + struct rw_semaphore mmap_lock; > = > struct list_head mmlist; /* List of maybe swapped mm's. These > * are globally strung together off > = > = > Test patch 3 is just in case, to verify the relation with the > page_table_lock. I don't expect a significant benefit, maybe the > regression will return in fact since mmap_lock will land back in the > previous cacheline as when the regression started happening. > = > My guess is the fastest will be test patch 2, but it costs in average > 32 bytes more than patch 1. > = > Unless patch 2 will be significantly better than patch 1, it's not > worth it and we can keep patch 1 which is quite nice separating the > two locks. Separating the two locks is the most important in general. > = > This is a pretty useful benchmark: in some important enterprise > production workloads the mmap_lock is the ultimate contention point, > because even when it's a read lock the cacheline keeps bouncing around. > = > Thanks! > Andrea > = I test the patches, patch 1 gets 2.1% improvement, patch 2 gets 3.6% = improvement, patch 3 gets -7.0% regression. =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D tbox_group/testcase/rootfs/kconfig/compiler/nr_task/mode/test/cpufreq_gover= nor/ucode/debug-setup: = lkp-csl-2sp9/will-it-scale/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/= gcc-9/100%/thread/mmap2/performance/0x5003006/test commit: cf379a4e0005ff9a4e734901aab9eeafe28c4eb6 0d2285c622103f0314ced7485c3b5b43f870c2d3 4e1291d70e101e7e1064ca814288a003171d75c7 (patch1) 3d5791b79590c6d57b9a66be7d4ec33bfa5027fc (patch2) 5f65daa0dc701eeae20d29b0521dd11becd50239 (patch3) cf379a4e0005ff9a 0d2285c622103f0314ced7485c3 4e1291d70e101e7e1064ca81428 = 3d5791b79590c6d57b9a66be7d4 5f65daa0dc701eeae20d29b0521 ---------------- --------------------------- --------------------------- = --------------------------- --------------------------- %stddev %change %stddev %change = %stddev %change %stddev %change %stddev \ | \ | \ = | \ | \ 204106 =C2=B1 2% -5.6% 192733 +2.1% 208489 = +3.6% 211498 -7.0% 189797 =C2=B1 2% = will-it-scale.88.threads 3.13 =C2=B1 2% +25.1% 3.91 =C2=B1 2% -1.3% 3.= 08 =C2=B1 = 3% -1.8% 3.07 +25.9% 3.94 =C2=B1 2% = will-it-scale.88.threads_idle 2319 =C2=B1 2% -5.6% 2189 +2.1% 2368 = +3.6% 2403 -7.0% 2156 =C2=B1 2% = will-it-scale.per_thread_ops 204106 =C2=B1 2% -5.6% 192733 +2.1% 208489 = +3.6% 211498 -7.0% 189797 =C2=B1 2% = will-it-scale.workload >> >> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >> tbox_group/testcase/rootfs/kconfig/compiler/nr_task/mode/test/cpufreq_go= vernor/ucode/debug-setup: >> = >> lkp-csl-2sp9/will-it-scale/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8= .3/gcc-9/100%/thread/mmap2/performance/0x5003006/test >> >> commit: >> cf379a4e0005ff9a4e734901aab9eeafe28c4eb6 >> 0d2285c622103f0314ced7485c3b5b43f870c2d3 >> 4e1291d70e101e7e1064ca814288a003171d75c7 (test patch) >> >> cf379a4e0005ff9a 0d2285c622103f0314ced7485c3 4e1291d70e101e7e1064ca81428 >> ---------------- --------------------------- --------------------------- >> %stddev %change %stddev %change %stdd= ev >> \ | \ | \ >> 204106 =C2=B1 2% -5.6% 192733 +2.1% 208489 >> will-it-scale.88.threads >> 3.13 =C2=B1 2% +25.1% 3.91 =C2=B1 2% -1.3% = 3.08 =C2=B1 >> 3% will-it-scale.88.threads_idle >> 2319 =C2=B1 2% -5.6% 2189 +2.1% 2368 >> will-it-scale.per_thread_ops >> 204106 =C2=B1 2% -5.6% 192733 +2.1% 208489 >> will-it-scale.workload >> >> >> >> -- = >> Zhengjun Xing >> > = -- = Zhengjun Xing --===============1637574556238539969==--