From: John Hubbard <jhubbard@nvidia.com>
To: Feng Tang <feng.tang@intel.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Jason Gunthorpe <jgg@nvidia.com>
Cc: kernel test robot <oliver.sang@intel.com>,
Jan Kara <jack@suse.cz>, "Peter Xu" <peterx@redhat.com>,
Andrea Arcangeli <aarcange@redhat.com>,
"Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>,
Christoph Hellwig <hch@lst.de>, "Hugh Dickins" <hughd@google.com>,
Jann Horn <jannh@google.com>,
Kirill Shutemov <kirill@shutemov.name>,
Kirill Tkhai <ktkhai@virtuozzo.com>,
Leon Romanovsky <leonro@nvidia.com>,
Michal Hocko <mhocko@suse.com>, Oleg Nesterov <oleg@redhat.com>,
Andrew Morton <akpm@linux-foundation.org>,
LKML <linux-kernel@vger.kernel.org>, <lkp@lists.01.org>,
kernel test robot <lkp@intel.com>,
"Huang, Ying" <ying.huang@intel.com>, <zhengjun.xing@intel.com>
Subject: Re: [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression
Date: Fri, 4 Jun 2021 10:58:14 -0700 [thread overview]
Message-ID: <f35ae155-d73c-eca4-b950-58589a76addc@nvidia.com> (raw)
In-Reply-To: <20210604075220.GA40621@shbuild999.sh.intel.com>
On 6/4/21 12:52 AM, Feng Tang wrote:
...
>>> The perf data doesn't even mention any of the GUP paths, and on the
>>> pure fork path the biggest impact would be:
>>>
>>> (a) maybe "struct mm_struct" changed in size or had a different cache layout
>>
>> Yes, this seems to be the cause of the regression.
>>
>> The test case is many thread are doing map/unmap at the same time,
>> so the process's rw_semaphore 'mmap_lock' is highly contended.
>>
>> Before the patch (with 0day's kconfig), the mmap_lock is separated
>> into 2 cachelines, the 'count' is in one line, and the other members
>> sit in the next line, so it luckily avoid some cache bouncing. After
Wow! That's quite a fortunate layout to land on by accident. Almost
makes me wonder if mmap_lock should be designed to do that, but it's
probably even better to just keep working on having a less contended
mmap_lock.
I *suppose* it's worth trying to keep this fragile layout in place,
but it is a landmine for anyone who touches mm_struct. And the struct
is so large already that I'm not sure a comment warning would even
be noticed. Anyway...
>> the patch, the 'mmap_lock' is pushed into one cacheline, which may
>> cause the regression.
>>
>> Below is the pahole info:
>>
>> - before the patch
>>
>> spinlock_t page_table_lock; /* 116 4 */
>> struct rw_semaphore mmap_lock; /* 120 40 */
>> /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
>> struct list_head mmlist; /* 160 16 */
>> long unsigned int hiwater_rss; /* 176 8 */
>>
>> - after the patch
>>
>> spinlock_t page_table_lock; /* 124 4 */
>> /* --- cacheline 2 boundary (128 bytes) --- */
>> struct rw_semaphore mmap_lock; /* 128 40 */
>> struct list_head mmlist; /* 168 16 */
>> long unsigned int hiwater_rss; /* 184 8 */
>>
>> perf c2c log can also confirm this.
>
> We've tried some patch, which can restore the regerssion. As the
> newly added member 'write_protect_seq' is 4 bytes long, and putting
> it into an existing 4 bytes long hole can restore the regeression,
> while not affecting most of other member's alignment. Please review
> the following patch, thanks!
>
So, this is a neat little solution, if we agree that it's worth "fixing".
I'm definitely on the fence, but leaning toward, "go for it", because
I like the "no cache effect" result of using up the hole.
Reviewed-by: John Hubbard <jhubbard@nvidia.com>
thanks,
--
John Hubbard
NVIDIA
> - Feng
>
> From 85ddc2c3d0f2bdcbad4edc5c392c7bc90bb1667e Mon Sep 17 00:00:00 2001
> From: Feng Tang <feng.tang@intel.com>
> Date: Fri, 4 Jun 2021 15:20:57 +0800
> Subject: [PATCH RFC] mm: relocate 'write_protect_seq' in struct mm_struct
>
> Before commit 57efa1fe5957 ("mm/gup: prevent gup_fast from
> racing with COW during fork), on 64bits system, the hot member
> rw_semaphore 'mmap_lock' of 'mm_struct' could be separated into
> 2 cachelines, that its member 'count' sits in one cacheline while
> all other members in next cacheline, this naturally reduces some
> cache bouncing, and with the commit, the 'mmap_lock' is pushed
> into one cacheline, as shown in the pahole info:
>
> - before the commit
>
> spinlock_t page_table_lock; /* 116 4 */
> struct rw_semaphore mmap_lock; /* 120 40 */
> /* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
> struct list_head mmlist; /* 160 16 */
> long unsigned int hiwater_rss; /* 176 8 */
>
> - after the commit
>
> spinlock_t page_table_lock; /* 124 4 */
> /* --- cacheline 2 boundary (128 bytes) --- */
> struct rw_semaphore mmap_lock; /* 128 40 */
> struct list_head mmlist; /* 168 16 */
> long unsigned int hiwater_rss; /* 184 8 */
>
> and it causes one 9.2% regression for 'mmap1' case of will-it-scale
> benchmark[1], as in the case 'mmap_lock' is highly contented (occupies
> 90%+ cpu cycles).
>
> Though relayouting a structure could be a double-edged sword, as it
> helps some case, but may hurt other cases. So one solution is the
> newly added 'seqcount_t' is 4 bytes long (when CONFIG_DEBUG_LOCK_ALLOC=n),
> placing it into an existing 4 bytes hole in 'mm_struct' will not
> affect most of other members's alignment, while restoring the
> regression.
>
> [1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> ---
> include/linux/mm_types.h | 15 ++++++++-------
> 1 file changed, 8 insertions(+), 7 deletions(-)
>
> diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
> index 5aacc1c..5b55f88 100644
> --- a/include/linux/mm_types.h
> +++ b/include/linux/mm_types.h
> @@ -445,13 +445,6 @@ struct mm_struct {
> */
> atomic_t has_pinned;
>
> - /**
> - * @write_protect_seq: Locked when any thread is write
> - * protecting pages mapped by this mm to enforce a later COW,
> - * for instance during page table copying for fork().
> - */
> - seqcount_t write_protect_seq;
> -
> #ifdef CONFIG_MMU
> atomic_long_t pgtables_bytes; /* PTE page table pages */
> #endif
> @@ -480,7 +473,15 @@ struct mm_struct {
> unsigned long stack_vm; /* VM_STACK */
> unsigned long def_flags;
>
> + /**
> + * @write_protect_seq: Locked when any thread is write
> + * protecting pages mapped by this mm to enforce a later COW,
> + * for instance during page table copying for fork().
> + */
> + seqcount_t write_protect_seq;
> +
> spinlock_t arg_lock; /* protect the below fields */
> +
> unsigned long start_code, end_code, start_data, end_data;
> unsigned long start_brk, brk, start_stack;
> unsigned long arg_start, arg_end, env_start, env_end;
>
next prev parent reply other threads:[~2021-06-04 17:58 UTC|newest]
Thread overview: 13+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-05-25 3:16 [mm/gup] 57efa1fe59: will-it-scale.per_thread_ops -9.2% regression kernel test robot
2021-05-25 3:11 ` Linus Torvalds
2021-06-04 7:04 ` Feng Tang
2021-06-04 7:52 ` Feng Tang
2021-06-04 17:57 ` Linus Torvalds
2021-06-06 10:16 ` Feng Tang
2021-06-06 19:20 ` Linus Torvalds
2021-06-06 22:13 ` Waiman Long
2021-06-07 6:05 ` Feng Tang
2021-06-08 0:03 ` Linus Torvalds
2021-06-04 17:58 ` John Hubbard [this message]
2021-06-06 4:47 ` Feng Tang
2021-06-04 8:37 ` [LKP] " Xing Zhengjun
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=f35ae155-d73c-eca4-b950-58589a76addc@nvidia.com \
--to=jhubbard@nvidia.com \
--cc=aarcange@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=aneesh.kumar@linux.ibm.com \
--cc=feng.tang@intel.com \
--cc=hch@lst.de \
--cc=hughd@google.com \
--cc=jack@suse.cz \
--cc=jannh@google.com \
--cc=jgg@nvidia.com \
--cc=kirill@shutemov.name \
--cc=ktkhai@virtuozzo.com \
--cc=leonro@nvidia.com \
--cc=linux-kernel@vger.kernel.org \
--cc=lkp@intel.com \
--cc=lkp@lists.01.org \
--cc=mhocko@suse.com \
--cc=oleg@redhat.com \
--cc=oliver.sang@intel.com \
--cc=peterx@redhat.com \
--cc=torvalds@linux-foundation.org \
--cc=ying.huang@intel.com \
--cc=zhengjun.xing@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox