Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Mel Gorman <mgorman@suse.de>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Oleg Nesterov <oleg@redhat.com>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Vlastimil Babka <vbabka@suse.cz>, Jan Kara <jack@suse.cz>,
	Michal Hocko <mhocko@suse.cz>, Hugh Dickins <hughd@google.com>,
	Dave Hansen <dave.hansen@intel.com>,
	Linux Kernel <linux-kernel@vger.kernel.org>,
	Linux-MM <linux-mm@kvack.org>,
	Linux-FSDevel <linux-fsdevel@vger.kernel.org>,
	Paul McKenney <paulmck@linux.vnet.ibm.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	David Howells <dhowells@redhat.com>
Subject: Re: [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5
Date: Thu, 22 May 2014 00:35:28 +0100	[thread overview]
Message-ID: <20140521233528.GZ23991@suse.de> (raw)
In-Reply-To: <20140521142622.049d0b3af5fc94912d5a1472@linux-foundation.org>

On Wed, May 21, 2014 at 02:26:22PM -0700, Andrew Morton wrote:
> On Wed, 21 May 2014 13:15:01 +0100 Mel Gorman <mgorman@suse.de> wrote:
> 
> > Andrew had suggested dropping v4 of the patch entirely as the numbers were
> > marginal and the complexity was high. However, even on a relatively small
> > machine running simple workloads the overhead of page_waitqueue and wakeup
> > functions is around 5% of system CPU time. That's quite high for basic
> > operations so I felt it was worth another shot. The performance figures
> > are better with this version than they were for v4 and overall the patch
> > should be more comprehensible.
> > 
> > Changelog since v4
> > o Remove dependency on io_schedule_timeout
> > o Push waiting logic down into waitqueue
> > 
> > This patch introduces a new page flag for 64-bit capable machines,
> > PG_waiters, to signal there are processes waiting on PG_lock and uses it to
> > avoid memory barriers and waitqueue hash lookup in the unlock_page fastpath.
> > 
> > This adds a few branches to the fast path but avoids bouncing a dirty
> > cache line between CPUs. 32-bit machines always take the slow path but the
> > primary motivation for this patch is large machines so I do not think that
> > is a concern.
> > 
> > The test case used to evaulate this is a simple dd of a large file done
> > multiple times with the file deleted on each iterations. The size of
> > the file is 1/10th physical memory to avoid dirty page balancing. In the
> > async case it will be possible that the workload completes without even
> > hitting the disk and will have variable results but highlight the impact
> > of mark_page_accessed for async IO. The sync results are expected to be
> > more stable. The exception is tmpfs where the normal case is for the "IO"
> > to not hit the disk.
> > 
> > The test machine was single socket and UMA to avoid any scheduling or
> > NUMA artifacts. Throughput and wall times are presented for sync IO, only
> > wall times are shown for async as the granularity reported by dd and the
> > variability is unsuitable for comparison. As async results were variable
> > do to writback timings, I'm only reporting the maximum figures. The sync
> > results were stable enough to make the mean and stddev uninteresting.
> > 
> > The performance results are reported based on a run with no profiling.
> > Profile data is based on a separate run with oprofile running. The
> > kernels being compared are "accessed-v2" which is the patch series up
> > to this patch where as lockpage-v2 includes this patch.
> > 
> > ...
> >
> > --- a/include/linux/wait.h
> > +++ b/include/linux/wait.h
> > @@ -147,8 +147,13 @@ void __wake_up_sync_key(wait_queue_head_t *q, unsigned int mode, int nr, void *k
> >  void __wake_up_locked(wait_queue_head_t *q, unsigned int mode, int nr);
> >  void __wake_up_sync(wait_queue_head_t *q, unsigned int mode, int nr);
> >  void __wake_up_bit(wait_queue_head_t *, void *, int);
> > +void __wake_up_page_bit(wait_queue_head_t *, struct page *page, void *, int);
> 
> You're going to need to forward-declare struct page in wait.h.  The
> good thing about this is that less people will notice that we've gone
> and mentioned struct page in wait.h :(
> 

Will add the forward-declare.

> >  int __wait_on_bit(wait_queue_head_t *, struct wait_bit_queue *, int (*)(void *), unsigned);
> > +int __wait_on_page_bit(wait_queue_head_t *, struct wait_bit_queue *,
> > 
> > ...
> >
> > --- a/kernel/sched/wait.c
> > +++ b/kernel/sched/wait.c
> > @@ -167,31 +167,39 @@ EXPORT_SYMBOL_GPL(__wake_up_sync);	/* For internal use only */
> >   * stops them from bleeding out - it would still allow subsequent
> >   * loads to move into the critical region).
> >   */
> > -void
> > -prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait, int state)
> > +static inline void
> > +__prepare_to_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > +			struct page *page, int state, bool exclusive)
> 
> Putting MM stuff into core waitqueue code is rather bad.  I really
> don't know how I'm going to explain this to my family.
> 

The alternative is updating wait.h and wait.c was open-coding the waitqueue
modifications in filemap.c but that is just as ugly. The wait queue stuff
is complex and there was motivation to keep it in one place even if we
are special casing struct page handling.

FWIW, I cannot explain anything I do in work to my family. It gets blank
looks no matter what.


> >  {
> >  	unsigned long flags;
> >  
> > -	wait->flags &= ~WQ_FLAG_EXCLUSIVE;
> >  	spin_lock_irqsave(&q->lock, flags);
> > -	if (list_empty(&wait->task_list))
> > -		__add_wait_queue(q, wait);
> > +	if (page && !PageWaiters(page))
> > +		SetPageWaiters(page);
> 
> And this isn't racy because we're assuming that all users of `page' are
> using the same waitqueue.  ie, assuming all callers use
> page_waitqueue()?   Subtle, unobvious, worth documenting.
> 

All users of the page will get the same waitqueue. page_waitqueue is hashed
on the pointer value.

> > +	if (list_empty(&wait->task_list)) {
> > +		if (exclusive) {
> > +			wait->flags |= WQ_FLAG_EXCLUSIVE;
> > +			__add_wait_queue_tail(q, wait);
> > +		} else {
> > +			wait->flags &= ~WQ_FLAG_EXCLUSIVE;
> > +			__add_wait_queue(q, wait);
> > +		}
> > +	}
> >  	set_current_state(state);
> >  	spin_unlock_irqrestore(&q->lock, flags);
> >  }
> > 
> > ...
> >
> > @@ -228,7 +236,8 @@ EXPORT_SYMBOL(prepare_to_wait_event);
> >   * the wait descriptor from the given waitqueue if still
> >   * queued.
> >   */
> > -void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> > +static inline void __finish_wait(wait_queue_head_t *q, wait_queue_t *wait,
> > +			struct page *page)
> 
> Thusly does kerneldoc bitrot.
> 

Now, I am become Rot, Destroyer of Kerneldoc.

Kerneldoc comment moved to correct location.

> >  {
> >  	unsigned long flags;
> >  
> > @@ -249,9 +258,16 @@ void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> >  	if (!list_empty_careful(&wait->task_list)) {
> >  		spin_lock_irqsave(&q->lock, flags);
> >  		list_del_init(&wait->task_list);
> > +		if (page && !waitqueue_active(q))
> > +			ClearPageWaiters(page);
> 
> And again, the assumption that all users of this page use the same
> waitqueue avoids the races?
> 

Yes. There is no guarantee that the bit will be cleared if there are no
waiters on that specific page if there are waitqueue collisions but
detecting that accurately would require ref counts.

Will stick in a comment.

> >  		spin_unlock_irqrestore(&q->lock, flags);
> >  	}
> >  }
> > +
> > +void finish_wait(wait_queue_head_t *q, wait_queue_t *wait)
> > +{
> > +	return __finish_wait(q, wait, NULL);
> > +}
> >  EXPORT_SYMBOL(finish_wait);
> >  
> >  /**
> > @@ -331,6 +347,22 @@ __wait_on_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
> >  	finish_wait(wq, &q->wait);
> >  	return ret;
> >  }
> > +
> > +int __sched
> > +__wait_on_page_bit(wait_queue_head_t *wq, struct wait_bit_queue *q,
> > +			struct page *page,
> > +			int (*action)(void *), unsigned mode)
> 
> Comment over __wait_on_bit needs updating.
> 
> > +{
> > +	int ret = 0;
> > +
> > +	do {
> > +		__prepare_to_wait(wq, &q->wait, page, mode, false);
> > +		if (test_bit(q->key.bit_nr, q->key.flags))
> > +			ret = (*action)(q->key.flags);
> > +	} while (test_bit(q->key.bit_nr, q->key.flags) && !ret);
> > +	__finish_wait(wq, &q->wait, page);
> > +	return ret;
> > +}
> 
> __wait_on_bit() can now become a wrapper which calls this with page==NULL?
> 

Yep. Early in development this would have looked ugly but not in the
final version but failed to fix it up.

> >  EXPORT_SYMBOL(__wait_on_bit);
> 
> This export is now misplaced.
> 
> >  int __sched out_of_line_wait_on_bit(void *word, int bit,
> > @@ -344,6 +376,27 @@ int __sched out_of_line_wait_on_bit(void *word, int bit,
> >  EXPORT_SYMBOL(out_of_line_wait_on_bit);
> >  
> >  int __sched
> > +__wait_on_page_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> > +			struct page *page,
> > +			int (*action)(void *), unsigned mode)
> > +{
> > +	do {
> > +		int ret;
> > +
> > +		__prepare_to_wait(wq, &q->wait, page, mode, true);
> > +		if (!test_bit(q->key.bit_nr, q->key.flags))
> > +			continue;
> > +		ret = action(q->key.flags);
> > +		if (!ret)
> > +			continue;
> > +		abort_exclusive_wait(wq, &q->wait, mode, &q->key);
> > +		return ret;
> > +	} while (test_and_set_bit(q->key.bit_nr, q->key.flags));
> > +	__finish_wait(wq, &q->wait, page);
> > +	return 0;
> > +}
> 
> You are in a maze of twisty little functions, all alike.  Perhaps some
> rudimentary documentation here?  Like what on earth does
> __wait_on_page_bit_lock() actually do?   And `mode'.
> 

Most of the useful commentary on what this does is in the kerneldoc comment
for wait_on_bit including the meaning of "mode".

> 
> > +int __sched
> >  __wait_on_bit_lock(wait_queue_head_t *wq, struct wait_bit_queue *q,
> >  			int (*action)(void *), unsigned mode)
> 
> Perhaps __wait_on_bit_lock() can become a wrapper around
> __wait_on_page_bit_lock().
> 

Yes.

> > 
> > ...
> >
> > --- a/mm/swap.c
> > +++ b/mm/swap.c
> > @@ -67,6 +67,10 @@ static void __page_cache_release(struct page *page)
> >  static void __put_single_page(struct page *page)
> >  {
> >  	__page_cache_release(page);
> > +
> > +	/* Clear dangling waiters from collisions on page_waitqueue */
> > +	__ClearPageWaiters(page);
> 
> What's this collisions thing?
> 

I'll expand the comment.

> >  	free_hot_cold_page(page, false);
> >  }
> >  
> > 
> > ...
> >
> > --- a/mm/vmscan.c
> > +++ b/mm/vmscan.c
> > @@ -1096,6 +1096,7 @@ static unsigned long shrink_page_list(struct list_head *page_list,
> >  		 * waiting on the page lock, because there are no references.
> >  		 */
> >  		__clear_page_locked(page);
> > +		__ClearPageWaiters(page);
> 
> We're freeing the page - if someone is still waiting on it then we have
> a huge bug?  It's the mysterious collision thing again I hope?
> 

Yes. Freeing a page with active waiters should also be attempting to free
a page with an elevated ref count. The page allocator should catch that
and scream.

-- 
Mel Gorman
SUSE Labs

next prev parent reply	other threads:[~2014-05-21 23:35 UTC|newest]

Thread overview: 103+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2014-05-13  9:45 [PATCH 00/19] Misc page alloc, shmem, mark_page_accessed and page_waitqueue optimisations v3r33 Mel Gorman
2014-05-13  9:45 ` [PATCH 01/19] mm: page_alloc: Do not update zlc unless the zlc is active Mel Gorman
2014-05-13  9:45 ` [PATCH 02/19] mm: page_alloc: Do not treat a zone that cannot be used for dirty pages as "full" Mel Gorman
2014-05-13  9:45 ` [PATCH 03/19] jump_label: Expose the reference count Mel Gorman
2014-05-13  9:45 ` [PATCH 04/19] mm: page_alloc: Use jump labels to avoid checking number_of_cpusets Mel Gorman
2014-05-13 10:58   ` Peter Zijlstra
2014-05-13 12:28     ` Mel Gorman
2014-05-13  9:45 ` [PATCH 05/19] mm: page_alloc: Calculate classzone_idx once from the zonelist ref Mel Gorman
2014-05-13 22:25   ` Andrew Morton
2014-05-14  6:32     ` Mel Gorman
2014-05-14 20:29     ` Mel Gorman
2014-05-13  9:45 ` [PATCH 06/19] mm: page_alloc: Only check the zone id check if pages are buddies Mel Gorman
2014-05-13  9:45 ` [PATCH 07/19] mm: page_alloc: Only check the alloc flags and gfp_mask for dirty once Mel Gorman
2014-05-13  9:45 ` [PATCH 08/19] mm: page_alloc: Take the ALLOC_NO_WATERMARK check out of the fast path Mel Gorman
2014-05-13  9:45 ` [PATCH 09/19] mm: page_alloc: Use word-based accesses for get/set pageblock bitmaps Mel Gorman
2014-05-22  9:24   ` Vlastimil Babka
2014-05-22 18:23     ` Andrew Morton
2014-05-22 18:45       ` Vlastimil Babka
2014-05-13  9:45 ` [PATCH 10/19] mm: page_alloc: Reduce number of times page_to_pfn is called Mel Gorman
2014-05-13 13:27   ` Vlastimil Babka
2014-05-13 14:09     ` Mel Gorman
2014-05-13  9:45 ` [PATCH 11/19] mm: page_alloc: Lookup pageblock migratetype with IRQs enabled during free Mel Gorman
2014-05-13 13:36   ` Vlastimil Babka
2014-05-13 14:23     ` Mel Gorman
2014-05-13  9:45 ` [PATCH 12/19] mm: page_alloc: Use unsigned int for order in more places Mel Gorman
2014-05-13  9:45 ` [PATCH 13/19] mm: page_alloc: Convert hot/cold parameter and immediate callers to bool Mel Gorman
2014-05-13  9:45 ` [PATCH 14/19] mm: shmem: Avoid atomic operation during shmem_getpage_gfp Mel Gorman
2014-05-13  9:45 ` [PATCH 15/19] mm: Do not use atomic operations when releasing pages Mel Gorman
2014-05-13  9:45 ` [PATCH 16/19] mm: Do not use unnecessary atomic operations when adding pages to the LRU Mel Gorman
2014-05-13  9:45 ` [PATCH 17/19] fs: buffer: Do not use unnecessary atomic operations when discarding buffers Mel Gorman
2014-05-13 11:09   ` Peter Zijlstra
2014-05-13 12:50     ` Mel Gorman
2014-05-13 13:49       ` Jan Kara
2014-05-13 14:30         ` Mel Gorman
2014-05-13 14:01       ` Peter Zijlstra
2014-05-13 14:46         ` Mel Gorman
2014-05-13 13:50   ` Jan Kara
2014-05-13 22:29   ` Andrew Morton
2014-05-14  6:12     ` Mel Gorman
2014-05-13  9:45 ` [PATCH 18/19] mm: Non-atomically mark page accessed during page cache allocation where possible Mel Gorman
2014-05-13 14:29   ` Theodore Ts'o
2014-05-20 15:49   ` [PATCH] mm: non-atomically mark page accessed during page cache allocation where possible -fix Mel Gorman
2014-05-20 19:34     ` Andrew Morton
2014-05-21 12:09       ` Mel Gorman
2014-05-21 22:11         ` Andrew Morton
2014-05-22  0:07           ` Mel Gorman
2014-05-22  5:35       ` Prabhakar Lad
2014-05-13  9:45 ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Mel Gorman
2014-05-13 12:53   ` Mel Gorman
2014-05-13 14:17     ` Peter Zijlstra
2014-05-13 15:27       ` Paul E. McKenney
2014-05-13 15:44         ` Peter Zijlstra
2014-05-13 16:14           ` Paul E. McKenney
2014-05-13 18:57             ` Oleg Nesterov
2014-05-13 20:24               ` Paul E. McKenney
2014-05-14 14:25                 ` Oleg Nesterov
2014-05-13 18:22           ` Oleg Nesterov
2014-05-13 18:18         ` Oleg Nesterov
2014-05-13 18:24           ` Peter Zijlstra
2014-05-13 18:52           ` Paul E. McKenney
2014-05-13 19:31             ` Oleg Nesterov
2014-05-13 20:32               ` Paul E. McKenney
2014-05-14 16:11       ` Oleg Nesterov
2014-05-14 16:17         ` Peter Zijlstra
2014-05-16 13:51           ` [PATCH 0/1] ptrace: task_clear_jobctl_trapping()->wake_up_bit() needs mb() Oleg Nesterov
2014-05-16 13:51             ` [PATCH 1/1] " Oleg Nesterov
2014-05-21  9:29               ` Peter Zijlstra
2014-05-21 19:19                 ` Andrew Morton
2014-05-21 19:18             ` [PATCH 0/1] " Andrew Morton
2014-05-14 19:29         ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Oleg Nesterov
2014-05-14 20:53           ` Mel Gorman
2014-05-15 10:48           ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v4 Mel Gorman
2014-05-15 13:20             ` Peter Zijlstra
2014-05-15 13:29               ` Peter Zijlstra
2014-05-15 15:34               ` Oleg Nesterov
2014-05-15 15:45                 ` Peter Zijlstra
2014-05-15 16:18               ` Mel Gorman
2014-05-15 15:03             ` Oleg Nesterov
2014-05-15 21:24             ` Andrew Morton
2014-05-21 12:15               ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5 Mel Gorman
2014-05-21 13:02                 ` Peter Zijlstra
2014-05-21 15:33                   ` Mel Gorman
2014-05-21 16:08                     ` Peter Zijlstra
2014-05-21 21:26                 ` Andrew Morton
2014-05-21 21:33                   ` Peter Zijlstra
2014-05-21 21:50                     ` Andrew Morton
2014-05-22  0:07                       ` Mel Gorman
2014-05-22  7:20                         ` Peter Zijlstra
2014-05-22 10:40                           ` [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v7 Mel Gorman
2014-05-22 10:56                             ` Peter Zijlstra
2014-05-22 13:00                               ` Mel Gorman
2014-05-22 14:40                               ` Mel Gorman
2014-05-22 15:04                                 ` Peter Zijlstra
2014-05-22 15:36                                   ` Mel Gorman
2014-05-22 16:58                                   ` [PATCH] mm: filemap: Avoid unnecessary barriers and waitqueue lookups in unlock_page fastpath v8 Mel Gorman
2014-05-22  6:45                       ` [PATCH] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath v5 Peter Zijlstra
2014-05-22  8:46                         ` Mel Gorman
2014-05-22 17:47                           ` Andrew Morton
2014-05-22 19:53                             ` Mel Gorman
2014-05-21 23:35                   ` Mel Gorman [this message]
2014-05-13 16:52   ` [PATCH 19/19] mm: filemap: Avoid unnecessary barries and waitqueue lookups in unlock_page fastpath Peter Zijlstra
2014-05-14  7:31     ` Mel Gorman
2014-05-19  8:57 ` [PATCH] mm: Avoid unnecessary atomic operations during end_page_writeback Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20140521233528.GZ23991@suse.de \
    --to=mgorman@suse.de \
    --cc=akpm@linux-foundation.org \
    --cc=dave.hansen@intel.com \
    --cc=dhowells@redhat.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=oleg@redhat.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=peterz@infradead.org \
    --cc=torvalds@linux-foundation.org \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).