linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct
       [not found] <1623376482-92265-1-git-send-email-feng.tang@intel.com>
@ 2021-06-11 17:09 ` Jason Gunthorpe
  2021-06-14  3:27   ` Feng Tang
  0 siblings, 1 reply; 5+ messages in thread
From: Jason Gunthorpe @ 2021-06-11 17:09 UTC (permalink / raw)
  To: Feng Tang
  Cc: linux-mm, Andrew Morton, Linus Torvalds, kernel test robot,
	John Hubbard, linux-kernel, lkp, Peter Xu

On Fri, Jun 11, 2021 at 09:54:42AM +0800, Feng Tang wrote:
> 0day robot reported a 9.2% regression for will-it-scale mmap1 test
> case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast
> from racing with COW during fork").
> 
> Further debug shows the regression is due to that commit changes
> the offset of hot fields 'mmap_lock' inside structure 'mm_struct',
> thus some cache alignment changes.
> 
> From the perf data, the contention for 'mmap_lock' is very severe
> and takes around 95% cpu cycles, and it is a rw_semaphore
> 
>         struct rw_semaphore {
>                 atomic_long_t count;	/* 8 bytes */
>                 atomic_long_t owner;	/* 8 bytes */
>                 struct optimistic_spin_queue osq; /* spinner MCS lock */
>                 ...
> 
> Before commit 57efa1fe5957 adds the 'write_protect_seq', it
> happens to have a very optimal cache alignment layout, as
> Linus explained:
> 
>  "and before the addition of the 'write_protect_seq' field, the
>   mmap_sem was at offset 120 in 'struct mm_struct'.
> 
>   Which meant that count and owner were in two different cachelines,
>   and then when you have contention and spend time in
>   rwsem_down_write_slowpath(), this is probably *exactly* the kind
>   of layout you want.
> 
>   Because first the rwsem_write_trylock() will do a cmpxchg on the
>   first cacheline (for the optimistic fast-path), and then in the
>   case of contention, rwsem_down_write_slowpath() will just access
>   the second cacheline.
> 
>   Which is probably just optimal for a load that spends a lot of
>   time contended - new waiters touch that first cacheline, and then
>   they queue themselves up on the second cacheline."
> 
> After the commit, the rw_semaphore is at offset 128, which means
> the 'count' and 'owner' fields are now in the same cacheline,
> and causes more cache bouncing.
> 
> Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which
> will affect its offset:
> 
>   CONFIG_MMU
>   CONFIG_MEMBARRIER
>   CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
> 
> The layout above is on 64 bits system with 0day's default kernel
> config (similar to RHEL-8.3's config), in which all these 3 options
> are 'y'. And the layout can vary with different kernel configs.
> 
> Relayouting a structure is usually a double-edged sword, as sometimes
> it can helps one case, but hurt other cases. For this case, one
> solution is, as the newly added 'write_protect_seq' is a 4 bytes long
> seqcount_t (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an
> existing 4 bytes hole in 'mm_struct' will not change other fields'
> alignment, while restoring the regression. 
> 
> [1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
> Reported-by: kernel test robot <oliver.sang@intel.com>
> Signed-off-by: Feng Tang <feng.tang@intel.com>
> Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> Cc: Jason Gunthorpe <jgg@nvidia.com>
> Cc: Linus Torvalds <torvalds@linux-foundation.org>
> Cc: Peter Xu <peterx@redhat.com>
> ---
>  include/linux/mm_types.h | 27 ++++++++++++++++++++-------
>  1 file changed, 20 insertions(+), 7 deletions(-)

It seems Ok to me, but didn't we earlier add the has_pinned which
would have changed the layout too? Are we chasing performance delta's
nobody cares about?

Still it is mechanically fine, so:

Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Jason


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct
  2021-06-11 17:09 ` [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct Jason Gunthorpe
@ 2021-06-14  3:27   ` Feng Tang
  2021-06-15  1:11     ` Feng Tang
  0 siblings, 1 reply; 5+ messages in thread
From: Feng Tang @ 2021-06-14  3:27 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: linux-mm, Andrew Morton, Linus Torvalds, kernel test robot,
	John Hubbard, linux-kernel, lkp, Peter Xu

Hi Jason,

On Fri, Jun 11, 2021 at 02:09:17PM -0300, Jason Gunthorpe wrote:
> On Fri, Jun 11, 2021 at 09:54:42AM +0800, Feng Tang wrote:
> > 0day robot reported a 9.2% regression for will-it-scale mmap1 test
> > case[1], caused by commit 57efa1fe5957 ("mm/gup: prevent gup_fast
> > from racing with COW during fork").
> > 
> > Further debug shows the regression is due to that commit changes
> > the offset of hot fields 'mmap_lock' inside structure 'mm_struct',
> > thus some cache alignment changes.
> > 
> > From the perf data, the contention for 'mmap_lock' is very severe
> > and takes around 95% cpu cycles, and it is a rw_semaphore
> > 
> >         struct rw_semaphore {
> >                 atomic_long_t count;	/* 8 bytes */
> >                 atomic_long_t owner;	/* 8 bytes */
> >                 struct optimistic_spin_queue osq; /* spinner MCS lock */
> >                 ...
> > 
> > Before commit 57efa1fe5957 adds the 'write_protect_seq', it
> > happens to have a very optimal cache alignment layout, as
> > Linus explained:
> > 
> >  "and before the addition of the 'write_protect_seq' field, the
> >   mmap_sem was at offset 120 in 'struct mm_struct'.
> > 
> >   Which meant that count and owner were in two different cachelines,
> >   and then when you have contention and spend time in
> >   rwsem_down_write_slowpath(), this is probably *exactly* the kind
> >   of layout you want.
> > 
> >   Because first the rwsem_write_trylock() will do a cmpxchg on the
> >   first cacheline (for the optimistic fast-path), and then in the
> >   case of contention, rwsem_down_write_slowpath() will just access
> >   the second cacheline.
> > 
> >   Which is probably just optimal for a load that spends a lot of
> >   time contended - new waiters touch that first cacheline, and then
> >   they queue themselves up on the second cacheline."
> > 
> > After the commit, the rw_semaphore is at offset 128, which means
> > the 'count' and 'owner' fields are now in the same cacheline,
> > and causes more cache bouncing.
> > 
> > Currently there are 3 "#ifdef CONFIG_XXX" before 'mmap_lock' which
> > will affect its offset:
> > 
> >   CONFIG_MMU
> >   CONFIG_MEMBARRIER
> >   CONFIG_HAVE_ARCH_COMPAT_MMAP_BASES
> > 
> > The layout above is on 64 bits system with 0day's default kernel
> > config (similar to RHEL-8.3's config), in which all these 3 options
> > are 'y'. And the layout can vary with different kernel configs.
> > 
> > Relayouting a structure is usually a double-edged sword, as sometimes
> > it can helps one case, but hurt other cases. For this case, one
> > solution is, as the newly added 'write_protect_seq' is a 4 bytes long
> > seqcount_t (when CONFIG_DEBUG_LOCK_ALLOC=n), placing it into an
> > existing 4 bytes hole in 'mm_struct' will not change other fields'
> > alignment, while restoring the regression. 
> > 
> > [1]. https://lore.kernel.org/lkml/20210525031636.GB7744@xsang-OptiPlex-9020/
> > Reported-by: kernel test robot <oliver.sang@intel.com>
> > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > Reviewed-by: John Hubbard <jhubbard@nvidia.com>
> > Cc: Jason Gunthorpe <jgg@nvidia.com>
> > Cc: Linus Torvalds <torvalds@linux-foundation.org>
> > Cc: Peter Xu <peterx@redhat.com>
> > ---
> >  include/linux/mm_types.h | 27 ++++++++++++++++++++-------
> >  1 file changed, 20 insertions(+), 7 deletions(-)
> 
> It seems Ok to me, but didn't we earlier add the has_pinned which
> would have changed the layout too? Are we chasing performance delta's
> nobody cares about?

Good point! I checked my email folder for 0day's reports, and haven't
found a report related with Peter's commit 008cfe4418b3 ("mm: Introduce
mm_struct.has_pinned) which adds 'has_pinned' field. 

Will run the same test for it and report back.

> Still it is mechanically fine, so:
> 
> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com>

Thanks for the review!

- Feng

> Jason


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct
  2021-06-14  3:27   ` Feng Tang
@ 2021-06-15  1:11     ` Feng Tang
  2021-06-15 18:52       ` Peter Xu
  0 siblings, 1 reply; 5+ messages in thread
From: Feng Tang @ 2021-06-15  1:11 UTC (permalink / raw)
  To: Jason Gunthorpe, Peter Xu
  Cc: linux-mm, Andrew Morton, Linus Torvalds, kernel test robot,
	John Hubbard, linux-kernel, lkp, Peter Xu

On Mon, Jun 14, 2021 at 11:27:39AM +0800, Feng Tang wrote:
> > 
> > It seems Ok to me, but didn't we earlier add the has_pinned which
> > would have changed the layout too? Are we chasing performance delta's
> > nobody cares about?
> 
> Good point! I checked my email folder for 0day's reports, and haven't
> found a report related with Peter's commit 008cfe4418b3 ("mm: Introduce
> mm_struct.has_pinned) which adds 'has_pinned' field. 
> 
> Will run the same test for it and report back.

I run the same will-it-scale/mmap1 case for Peter's commit 008cfe4418b3
and its parent commit, and there is no obvious performance diff:

a1bffa48745afbb5 008cfe4418b3dbda2ff820cdd7b 
---------------- --------------------------- 

    344353            -0.4%     342929        will-it-scale.48.threads
      7173            -0.4%       7144        will-it-scale.per_thread_ops

And from the pahole info for the 2 kernels, Peter's commit adds the
'has_pinned' is put into an existing 4 bytes hole, so all other following 
fields keep their alignment unchanged. Peter may do it purposely 
considering the alignment. So no performance change is expected.

Pahole info for kernel before 008cfe4418b3:

struct mm_struct {
	...
	/* --- cacheline 1 boundary (64 bytes) --- */
	long unsigned int  task_size;            /*    64     8 */
	long unsigned int  highest_vm_end;       /*    72     8 */
	pgd_t *            pgd;                  /*    80     8 */
	atomic_t           membarrier_state;     /*    88     4 */
	atomic_t           mm_users;             /*    92     4 */
	atomic_t           mm_count;             /*    96     4 */

	/* XXX 4 bytes hole, try to pack */

	atomic_long_t      pgtables_bytes;       /*   104     8 */
	int                map_count;            /*   112     4 */
	spinlock_t         page_table_lock;      /*   116     4 */
	struct rw_semaphore mmap_lock;           /*   120    40 */
	/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */

pahold info with 008cfe4418b3:

struct mm_struct {
	...
	/* --- cacheline 1 boundary (64 bytes) --- */
	long unsigned int  task_size;            /*    64     8 */
	long unsigned int  highest_vm_end;       /*    72     8 */
	pgd_t *            pgd;                  /*    80     8 */
	atomic_t           membarrier_state;     /*    88     4 */
	atomic_t           mm_users;             /*    92     4 */
	atomic_t           mm_count;             /*    96     4 */
	atomic_t           has_pinned;           /*   100     4 */
	atomic_long_t      pgtables_bytes;       /*   104     8 */
	int                map_count;            /*   112     4 */
	spinlock_t         page_table_lock;      /*   116     4 */
	struct rw_semaphore mmap_lock;           /*   120    40 */
	/* --- cacheline 2 boundary (128 bytes) was 32 bytes ago --- */
	
Thanks,
Feng

 



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct
  2021-06-15  1:11     ` Feng Tang
@ 2021-06-15 18:52       ` Peter Xu
  2021-06-16  1:51         ` Feng Tang
  0 siblings, 1 reply; 5+ messages in thread
From: Peter Xu @ 2021-06-15 18:52 UTC (permalink / raw)
  To: Feng Tang
  Cc: Jason Gunthorpe, linux-mm, Andrew Morton, Linus Torvalds,
	kernel test robot, John Hubbard, linux-kernel, lkp

On Tue, Jun 15, 2021 at 09:11:03AM +0800, Feng Tang wrote:
> On Mon, Jun 14, 2021 at 11:27:39AM +0800, Feng Tang wrote:
> > > 
> > > It seems Ok to me, but didn't we earlier add the has_pinned which
> > > would have changed the layout too? Are we chasing performance delta's
> > > nobody cares about?
> > 
> > Good point! I checked my email folder for 0day's reports, and haven't
> > found a report related with Peter's commit 008cfe4418b3 ("mm: Introduce
> > mm_struct.has_pinned) which adds 'has_pinned' field. 
> > 
> > Will run the same test for it and report back.
> 
> I run the same will-it-scale/mmap1 case for Peter's commit 008cfe4418b3
> and its parent commit, and there is no obvious performance diff:
> 
> a1bffa48745afbb5 008cfe4418b3dbda2ff820cdd7b 
> ---------------- --------------------------- 
> 
>     344353            -0.4%     342929        will-it-scale.48.threads
>       7173            -0.4%       7144        will-it-scale.per_thread_ops
> 
> And from the pahole info for the 2 kernels, Peter's commit adds the
> 'has_pinned' is put into an existing 4 bytes hole, so all other following 
> fields keep their alignment unchanged. Peter may do it purposely 
> considering the alignment. So no performance change is expected.

Thanks for verifying this.  I didn't do it on purpose at least for the initial
version, but I do remember some comment to fill up that hole, so it may have
got moved around.

Also note that if nothing goes wrong, has_pinned will be gone in the next
release with commit 3c0c4cda6d48 ("mm: gup: pack has_pinned in MMF_HAS_PINNED",
2021-05-26); it's in -mm-next but not reaching the main branch yet.  So then I
think the 4 bytes hole should come back to us again, with no perf loss either.

What I'm thinking is whether we should move some important (and especially
CONFIG_* irrelevant) fields at the top of the whole struct define, make sure
they're most optimal for the common workload and make them static.  Then
there'll be less or no possibility some new field regress some common workload
by accident.  Not sure whether it makes sense to do so.

Thanks,

-- 
Peter Xu



^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct
  2021-06-15 18:52       ` Peter Xu
@ 2021-06-16  1:51         ` Feng Tang
  0 siblings, 0 replies; 5+ messages in thread
From: Feng Tang @ 2021-06-16  1:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: Jason Gunthorpe, linux-mm, Andrew Morton, Linus Torvalds,
	kernel test robot, John Hubbard, linux-kernel, lkp

Hi Peter,

On Tue, Jun 15, 2021 at 02:52:49PM -0400, Peter Xu wrote:
> On Tue, Jun 15, 2021 at 09:11:03AM +0800, Feng Tang wrote:
> > On Mon, Jun 14, 2021 at 11:27:39AM +0800, Feng Tang wrote:
> > > > 
> > > > It seems Ok to me, but didn't we earlier add the has_pinned which
> > > > would have changed the layout too? Are we chasing performance delta's
> > > > nobody cares about?
> > > 
> > > Good point! I checked my email folder for 0day's reports, and haven't
> > > found a report related with Peter's commit 008cfe4418b3 ("mm: Introduce
> > > mm_struct.has_pinned) which adds 'has_pinned' field. 
> > > 
> > > Will run the same test for it and report back.
> > 
> > I run the same will-it-scale/mmap1 case for Peter's commit 008cfe4418b3
> > and its parent commit, and there is no obvious performance diff:
> > 
> > a1bffa48745afbb5 008cfe4418b3dbda2ff820cdd7b 
> > ---------------- --------------------------- 
> > 
> >     344353            -0.4%     342929        will-it-scale.48.threads
> >       7173            -0.4%       7144        will-it-scale.per_thread_ops
> > 
> > And from the pahole info for the 2 kernels, Peter's commit adds the
> > 'has_pinned' is put into an existing 4 bytes hole, so all other following 
> > fields keep their alignment unchanged. Peter may do it purposely 
> > considering the alignment. So no performance change is expected.
> 
> Thanks for verifying this.  I didn't do it on purpose at least for the initial
> version, but I do remember some comment to fill up that hole, so it may have
> got moved around.
> 
> Also note that if nothing goes wrong, has_pinned will be gone in the next
> release with commit 3c0c4cda6d48 ("mm: gup: pack has_pinned in MMF_HAS_PINNED",
> 2021-05-26); it's in -mm-next but not reaching the main branch yet.  So then I
> think the 4 bytes hole should come back to us again, with no perf loss either.
 
Thanks for the heads up.

> What I'm thinking is whether we should move some important (and especially
> CONFIG_* irrelevant) fields at the top of the whole struct define, make sure
> they're most optimal for the common workload and make them static.  Then
> there'll be less or no possibility some new field regress some common workload
> by accident.  Not sure whether it makes sense to do so.

Yep, it makes sense to me, as it makes the alignment more predictable and
controllable.  But usually we dare not to move the fields around, as it could
cause improvments to some cases and regressions to other cases, given different
benchmarks can see different hotspots.  And most of our patches changing 
data structure's layout are mostly regression driven :)

Thanks,
Feng

> Thanks,
> 
> -- 
> Peter Xu


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-06-16  1:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <1623376482-92265-1-git-send-email-feng.tang@intel.com>
2021-06-11 17:09 ` [PATCH] mm: relocate 'write_protect_seq' in struct mm_struct Jason Gunthorpe
2021-06-14  3:27   ` Feng Tang
2021-06-15  1:11     ` Feng Tang
2021-06-15 18:52       ` Peter Xu
2021-06-16  1:51         ` Feng Tang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).