From mboxrd@z Thu Jan 1 00:00:00 1970 Content-Type: multipart/mixed; boundary="===============2808864947957115746==" MIME-Version: 1.0 From: Andrea Arcangeli To: lkp@lists.01.org Subject: Re: [mm] 09bc0443e9: will-it-scale.per_thread_ops -7.2% regression Date: Sat, 08 May 2021 19:12:14 -0400 Message-ID: In-Reply-To: <82458eba-7ecf-018a-7e73-671199c8794a@linux.intel.com> List-Id: --===============2808864947957115746== Content-Type: text/plain; charset="utf-8" MIME-Version: 1.0 Content-Transfer-Encoding: quoted-printable On Sat, May 08, 2021 at 10:58:34AM +0800, Xing Zhengjun wrote: > Hi Andrea, > = > =C2=A0=C2=A0=C2=A0 I re-test it more than 10 times, the regression still= exists. Probably I didn't measure it because for me they may have always landed in the same cacheline even before the patch was applied. Ouch btw looking at flush_tlb_pending I also noticed I forgot to revert 6ce64428d62026a10cb5d80138ff2f90cc21d367, that has to be added to the pile of SMP unscalable upstream code to be reverted on my tree to improve performance even more to the wp fault with uffd-wp. > =C2=A0=C2=A0=C2=A0 0d2285c62210(mm: gup: pack has_pinned in MMF_HAS_PINN= ED) removed = > "atomic_t has_pinned" from "struct mm_struct", it changed the "struct = > mm_struct" cache line layout (please see in the attached pahole dumped = > results), in its parent commit, "spinlock_t page_table_lock" and "struct = > rw_semaphore mmap_lock" in two different cache line, but in = > 0d2285c62210, they may in the same cache line, "mm_struct" is widely = > used, it may have potential contention issue, so I apply the following = > patch change the layout, retest the regression is gone (very small, -0.9%= ). Agreed, the accidental cacheline alignment change is the only thing that could explain a performance difference, except it didn't make a difference for me. I applied this, this should fix your issue and it is going to work without depending on .config options (see all fields before mmap_lock, they vary depending on config/arch). I'm not really sure the problem is the page_table_lock even, it may be just false sharing on the fields at the top of the vma. The best I recommend to try also in addition of the below is to move this line: + struct rw_semaphore mmap_lock ____cacheline_aligned_in_smp; at the very end. so it gets 64 bytes alone and reduces all false sharing to zero. That cacheline is heavily heavily contended during the mmap test. so it may be worth wasting 64bytes per vma even... The below wastes in average not more than 32 bytes. >>From 99a2452759bb5cfb2a77405d7db42968bf7e7509 Mon Sep 17 00:00:00 2001 From: Andrea Arcangeli Date: Sat, 8 May 2021 18:54:34 -0400 Subject: [PATCH 1/1] mm: cacheline alignment for page_table_lock and mmap_l= ock It's important to keep the two fields in two different cachelines and this will enforce it. The kernel test robot reported a per_thread_ops -7.2% regression for the will-it-scale mmap2 threaded testcase when the two fields accidentally landed in the same cacheline. Reported-by: kernel test robot Signed-off-by: Andrea Arcangeli --- include/linux/mm_types.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 7fc7a3320ad9..714b4c3e2a7c 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -456,7 +456,7 @@ struct mm_struct { spinlock_t page_table_lock; /* Protects page tables and some * counters */ - struct rw_semaphore mmap_lock; + struct rw_semaphore mmap_lock ____cacheline_aligned_in_smp; = struct list_head mmlist; /* List of maybe swapped mm's. These * are globally strung together off --===============2808864947957115746==--