All of lore.kernel.org
 help / color / mirror / Atom feed
From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Miklos Szeredi <miklos@szeredi.hu>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] remove throttle_vm_writeout()
Date: Fri, 5 Oct 2007 20:30:28 +0800	[thread overview]
Message-ID: <391587432.00784@ustc.edu.cn> (raw)
Message-ID: <20071005123028.GA10372@mail.ustc.edu.cn> (raw)
In-Reply-To: <20071004104650.d158121f.akpm@linux-foundation.org>

On Thu, Oct 04, 2007 at 10:46:50AM -0700, Andrew Morton wrote:
> On Thu, 04 Oct 2007 18:47:07 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > > > > > But that said, there might be better ways to do that.
> > > > > 
> > > > > Sure, if we do need to globally limit the number of under-writeback
> > > > > pages, then I think we need to do it independently of the dirty
> > > > > accounting.
> > > > 
> > > > It need not be global, it could be per BDI as well, but yes.
> > > 
> > > For per-bdi limits we have the queue length.
> > 
> > Agreed, except for:
> > 
> > static int may_write_to_queue(struct backing_dev_info *bdi)
> > {
> > 	if (current->flags & PF_SWAPWRITE)
> > 		return 1;
> > 	if (!bdi_write_congested(bdi))
> > 		return 1;
> > 	if (bdi == current->backing_dev_info)
> > 		return 1;
> > 	return 0;
> > }
> > 
> > Which will write to congested queues. Anybody know why?
> 
> 
> commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
> Author: akpm <akpm>
> Date:   Sun Dec 22 01:07:33 2002 +0000
> 
>     [PATCH] Give kswapd writeback higher priority than pdflush
>     
>     The `low latency page reclaim' design works by preventing page
>     allocators from blocking on request queues (and by preventing them from
>     blocking against writeback of individual pages, but that is immaterial
>     here).
>     
>     This has a problem under some situations.  pdflush (or a write(2)
>     caller) could be saturating the queue with highmem pages.  This
>     prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
>     enormous amounts of scenning.

Sorry, I cannot not understand it. We now have balanced aging between
zones. So the page allocations are expected to distribute proportionally
between ZONE_HIGHMEM and ZONE_NORMAL?

>     A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
>     then kill the mmapping applications.  The machine instantly goes from
>     0% of memory dirty to 95% or more.  pdflush kicks in and starts writing
>     the least-recently-dirtied pages, which are all highmem.  The queue is
>     congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
>     50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
>     efficiency (pages_reclaimed/pages_scanned) falls to 2%.
>     
>     So this patch changes the policy for kswapd.  kswapd may use all of a
>     request queue, and is prepared to block on request queues.
>     
>     What will now happen in the above scenario is:
>     
>     1: The page alloctor scans some pages, fails to reclaim enough
>        memory and takes a nap in blk_congetion_wait().
>     
>     2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
>        back pages.  (These pages will be rotated to the tail of the
>        inactive list at IO-completion interrupt time).
>     
>        This writeback will saturate the queue with ZONE_NORMAL pages.
>        Conveniently, pdflush will avoid the congested queues.  So we end up
>        writing the correct pages.
>     
>     In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
>     efficiency rises from 2% to 40% and things are generally a lot happier.

We may see the same problem and improvement in the absent of 'all
writeback goes to one zone' assumption.

The problem could be:
- dirty_thresh is exceeded, so balance_dirty_pages() starts syncing
  data and quickly _congests_ the queue;
- dirty pages are slowly but continuously turned into clean pages by
  balance_dirty_pages(), but they still stay in the same place in LRU;
- the zones are mostly dirty/writeback pages, kswapd has a hard time
  finding the randomly distributed clean pages;
- kswapd cannot do the writeout because the queue is congested!

The improvement could be:
- kswapd is now explicitly preferred to do the writeout;
- the pages written by kswapd will be rotated and easy for kswapd to reclaim;
- it becomes possible for kswapd to wait for the congested queue,
  instead of doing the vmscan like mad.

The congestion wait looks like a pretty natural way to throttle the kswapd.
Instead of doing the vmscan at 1000MB/s and actually freeing pages at
60MB/s(about the write throughput), kswapd will be relaxed to do vmscan at
maybe 150MB/s.

Fengguang
---
>     The downside is that kswapd may now do a lot less page reclaim,
>     increasing page allocation latency, causing more direct reclaim,
>     increasing lock contention in the VM, etc.  But I have not been able to
>     demonstrate that in testing.
>     
>     
>     The other problem is that there is only one kswapd, and there are lots
>     of disks.  That is a generic problem - without being able to co-opt
>     user processes we don't have enough threads to keep lots of disks saturated.
>     
>     One fix for this would be to add an additional "really congested"
>     threshold in the request queues, so kswapd can still perform
>     nonblocking writeout.  This gives kswapd priority over pdflush while
>     allowing kswapd to feed many disk queues.  I doubt if this will be
>     called for.
>     
>     BKrev: 3e051055aitHp3bZBPSqmq21KGs5aQ
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index c635f39..9ab0209 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -7,6 +7,7 @@ #include <linux/kdev_t.h>
>  #include <linux/linkage.h>
>  #include <linux/mmzone.h>
>  #include <linux/list.h>
> +#include <linux/sched.h>
>  #include <asm/atomic.h>
>  #include <asm/page.h>
>  
> @@ -14,6 +15,11 @@ #define SWAP_FLAG_PREFER	0x8000	/* set i
>  #define SWAP_FLAG_PRIO_MASK	0x7fff
>  #define SWAP_FLAG_PRIO_SHIFT	0
>  
> +static inline int current_is_kswapd(void)
> +{
> +	return current->flags & PF_KSWAPD;
> +}
> +
>  /*
>   * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
>   * be swapped to.  The swap type and the offset into that swap type are
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index aeab1e3..a8b9d2c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -204,6 +204,19 @@ static inline int is_page_cache_freeable
>  	return page_count(page) - !!PagePrivate(page) == 2;
>  }
>  
> +static int may_write_to_queue(struct backing_dev_info *bdi)
> +{
> +	if (current_is_kswapd())
> +		return 1;
> +	if (current_is_pdflush())	/* This is unlikely, but why not... */
> +		return 1;
> +	if (!bdi_write_congested(bdi))
> +		return 1;
> +	if (bdi == current->backing_dev_info)
> +		return 1;
> +	return 0;
> +}
> +
>  /*
>   * shrink_list returns the number of reclaimed pages
>   */
> @@ -303,8 +316,6 @@ #endif /* CONFIG_SWAP */
>  		 * See swapfile.c:page_queue_congested().
>  		 */
>  		if (PageDirty(page)) {
> -			struct backing_dev_info *bdi;
> -
>  			if (!is_page_cache_freeable(page))
>  				goto keep_locked;
>  			if (!mapping)
> @@ -313,9 +324,7 @@ #endif /* CONFIG_SWAP */
>  				goto activate_locked;
>  			if (!may_enter_fs)
>  				goto keep_locked;
> -			bdi = mapping->backing_dev_info;
> -			if (bdi != current->backing_dev_info &&
> -					bdi_write_congested(bdi))
> +			if (!may_write_to_queue(mapping->backing_dev_info))
>  				goto keep_locked;
>  			write_lock(&mapping->page_lock);
>  			if (test_clear_page_dirty(page)) {
> @@ -424,7 +433,7 @@ keep:
>  	if (pagevec_count(&freed_pvec))
>  		__pagevec_release_nonlru(&freed_pvec);
>  	mod_page_state(pgsteal, ret);
> -	if (current->flags & PF_KSWAPD)
> +	if (current_is_kswapd())
>  		mod_page_state(kswapd_steal, ret);
>  	mod_page_state(pgactivate, pgactivate);
>  	return ret;
> 


WARNING: multiple messages have this Message-ID (diff)
From: Fengguang Wu <wfg@mail.ustc.edu.cn>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>,
	Miklos Szeredi <miklos@szeredi.hu>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] remove throttle_vm_writeout()
Date: Fri, 5 Oct 2007 20:30:28 +0800	[thread overview]
Message-ID: <391587432.00784@ustc.edu.cn> (raw)
Message-ID: <20071005123028.GA10372@mail.ustc.edu.cn> (raw)
In-Reply-To: <20071004104650.d158121f.akpm@linux-foundation.org>

On Thu, Oct 04, 2007 at 10:46:50AM -0700, Andrew Morton wrote:
> On Thu, 04 Oct 2007 18:47:07 +0200 Peter Zijlstra <a.p.zijlstra@chello.nl> wrote:
> 
> > > > > > But that said, there might be better ways to do that.
> > > > > 
> > > > > Sure, if we do need to globally limit the number of under-writeback
> > > > > pages, then I think we need to do it independently of the dirty
> > > > > accounting.
> > > > 
> > > > It need not be global, it could be per BDI as well, but yes.
> > > 
> > > For per-bdi limits we have the queue length.
> > 
> > Agreed, except for:
> > 
> > static int may_write_to_queue(struct backing_dev_info *bdi)
> > {
> > 	if (current->flags & PF_SWAPWRITE)
> > 		return 1;
> > 	if (!bdi_write_congested(bdi))
> > 		return 1;
> > 	if (bdi == current->backing_dev_info)
> > 		return 1;
> > 	return 0;
> > }
> > 
> > Which will write to congested queues. Anybody know why?
> 
> 
> commit c4e2d7ddde9693a4c05da7afd485db02c27a7a09
> Author: akpm <akpm>
> Date:   Sun Dec 22 01:07:33 2002 +0000
> 
>     [PATCH] Give kswapd writeback higher priority than pdflush
>     
>     The `low latency page reclaim' design works by preventing page
>     allocators from blocking on request queues (and by preventing them from
>     blocking against writeback of individual pages, but that is immaterial
>     here).
>     
>     This has a problem under some situations.  pdflush (or a write(2)
>     caller) could be saturating the queue with highmem pages.  This
>     prevents anyone from writing back ZONE_NORMAL pages.  We end up doing
>     enormous amounts of scenning.

Sorry, I cannot not understand it. We now have balanced aging between
zones. So the page allocations are expected to distribute proportionally
between ZONE_HIGHMEM and ZONE_NORMAL?

>     A test case is to mmap(MAP_SHARED) almost all of a 4G machine's memory,
>     then kill the mmapping applications.  The machine instantly goes from
>     0% of memory dirty to 95% or more.  pdflush kicks in and starts writing
>     the least-recently-dirtied pages, which are all highmem.  The queue is
>     congested so nobody will write back ZONE_NORMAL pages.  kswapd chews
>     50% of the CPU scanning past dirty ZONE_NORMAL pages and page reclaim
>     efficiency (pages_reclaimed/pages_scanned) falls to 2%.
>     
>     So this patch changes the policy for kswapd.  kswapd may use all of a
>     request queue, and is prepared to block on request queues.
>     
>     What will now happen in the above scenario is:
>     
>     1: The page alloctor scans some pages, fails to reclaim enough
>        memory and takes a nap in blk_congetion_wait().
>     
>     2: kswapd() will scan the ZONE_NORMAL LRU and will start writing
>        back pages.  (These pages will be rotated to the tail of the
>        inactive list at IO-completion interrupt time).
>     
>        This writeback will saturate the queue with ZONE_NORMAL pages.
>        Conveniently, pdflush will avoid the congested queues.  So we end up
>        writing the correct pages.
>     
>     In this test, kswapd CPU utilisation falls from 50% to 2%, page reclaim
>     efficiency rises from 2% to 40% and things are generally a lot happier.

We may see the same problem and improvement in the absent of 'all
writeback goes to one zone' assumption.

The problem could be:
- dirty_thresh is exceeded, so balance_dirty_pages() starts syncing
  data and quickly _congests_ the queue;
- dirty pages are slowly but continuously turned into clean pages by
  balance_dirty_pages(), but they still stay in the same place in LRU;
- the zones are mostly dirty/writeback pages, kswapd has a hard time
  finding the randomly distributed clean pages;
- kswapd cannot do the writeout because the queue is congested!

The improvement could be:
- kswapd is now explicitly preferred to do the writeout;
- the pages written by kswapd will be rotated and easy for kswapd to reclaim;
- it becomes possible for kswapd to wait for the congested queue,
  instead of doing the vmscan like mad.

The congestion wait looks like a pretty natural way to throttle the kswapd.
Instead of doing the vmscan at 1000MB/s and actually freeing pages at
60MB/s(about the write throughput), kswapd will be relaxed to do vmscan at
maybe 150MB/s.

Fengguang
---
>     The downside is that kswapd may now do a lot less page reclaim,
>     increasing page allocation latency, causing more direct reclaim,
>     increasing lock contention in the VM, etc.  But I have not been able to
>     demonstrate that in testing.
>     
>     
>     The other problem is that there is only one kswapd, and there are lots
>     of disks.  That is a generic problem - without being able to co-opt
>     user processes we don't have enough threads to keep lots of disks saturated.
>     
>     One fix for this would be to add an additional "really congested"
>     threshold in the request queues, so kswapd can still perform
>     nonblocking writeout.  This gives kswapd priority over pdflush while
>     allowing kswapd to feed many disk queues.  I doubt if this will be
>     called for.
>     
>     BKrev: 3e051055aitHp3bZBPSqmq21KGs5aQ
> 
> diff --git a/include/linux/swap.h b/include/linux/swap.h
> index c635f39..9ab0209 100644
> --- a/include/linux/swap.h
> +++ b/include/linux/swap.h
> @@ -7,6 +7,7 @@ #include <linux/kdev_t.h>
>  #include <linux/linkage.h>
>  #include <linux/mmzone.h>
>  #include <linux/list.h>
> +#include <linux/sched.h>
>  #include <asm/atomic.h>
>  #include <asm/page.h>
>  
> @@ -14,6 +15,11 @@ #define SWAP_FLAG_PREFER	0x8000	/* set i
>  #define SWAP_FLAG_PRIO_MASK	0x7fff
>  #define SWAP_FLAG_PRIO_SHIFT	0
>  
> +static inline int current_is_kswapd(void)
> +{
> +	return current->flags & PF_KSWAPD;
> +}
> +
>  /*
>   * MAX_SWAPFILES defines the maximum number of swaptypes: things which can
>   * be swapped to.  The swap type and the offset into that swap type are
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index aeab1e3..a8b9d2c 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -204,6 +204,19 @@ static inline int is_page_cache_freeable
>  	return page_count(page) - !!PagePrivate(page) == 2;
>  }
>  
> +static int may_write_to_queue(struct backing_dev_info *bdi)
> +{
> +	if (current_is_kswapd())
> +		return 1;
> +	if (current_is_pdflush())	/* This is unlikely, but why not... */
> +		return 1;
> +	if (!bdi_write_congested(bdi))
> +		return 1;
> +	if (bdi == current->backing_dev_info)
> +		return 1;
> +	return 0;
> +}
> +
>  /*
>   * shrink_list returns the number of reclaimed pages
>   */
> @@ -303,8 +316,6 @@ #endif /* CONFIG_SWAP */
>  		 * See swapfile.c:page_queue_congested().
>  		 */
>  		if (PageDirty(page)) {
> -			struct backing_dev_info *bdi;
> -
>  			if (!is_page_cache_freeable(page))
>  				goto keep_locked;
>  			if (!mapping)
> @@ -313,9 +324,7 @@ #endif /* CONFIG_SWAP */
>  				goto activate_locked;
>  			if (!may_enter_fs)
>  				goto keep_locked;
> -			bdi = mapping->backing_dev_info;
> -			if (bdi != current->backing_dev_info &&
> -					bdi_write_congested(bdi))
> +			if (!may_write_to_queue(mapping->backing_dev_info))
>  				goto keep_locked;
>  			write_lock(&mapping->page_lock);
>  			if (test_clear_page_dirty(page)) {
> @@ -424,7 +433,7 @@ keep:
>  	if (pagevec_count(&freed_pvec))
>  		__pagevec_release_nonlru(&freed_pvec);
>  	mod_page_state(pgsteal, ret);
> -	if (current->flags & PF_KSWAPD)
> +	if (current_is_kswapd())
>  		mod_page_state(kswapd_steal, ret);
>  	mod_page_state(pgactivate, pgactivate);
>  	return ret;
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

  parent reply	other threads:[~2007-10-05 12:30 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-10-04 12:25 [PATCH] remove throttle_vm_writeout() Miklos Szeredi
2007-10-04 12:25 ` Miklos Szeredi
2007-10-04 12:40 ` Peter Zijlstra
2007-10-04 13:00   ` Miklos Szeredi
2007-10-04 13:00     ` Miklos Szeredi
2007-10-04 13:23     ` Peter Zijlstra
2007-10-04 13:49       ` Miklos Szeredi
2007-10-04 13:49         ` Miklos Szeredi
2007-10-04 16:47         ` Peter Zijlstra
2007-10-04 16:47           ` Peter Zijlstra
2007-10-04 17:46           ` Andrew Morton
2007-10-04 17:46             ` Andrew Morton
2007-10-04 18:10             ` Peter Zijlstra
2007-10-04 18:10               ` Peter Zijlstra
2007-10-04 18:54               ` Andrew Morton
2007-10-04 18:54                 ` Andrew Morton
2007-10-05 12:30             ` Fengguang Wu [this message]
2007-10-05 12:30               ` Fengguang Wu
2007-10-05 12:30                 ` Fengguang Wu
2007-10-05 17:20                 ` Andrew Morton
2007-10-05 17:20                   ` Andrew Morton
2007-10-06  2:32                   ` Fengguang Wu
2007-10-06  2:32                     ` Fengguang Wu
2007-10-06  2:32                       ` Fengguang Wu
2007-10-07 23:54               ` David Chinner
2007-10-07 23:54                 ` David Chinner
2007-10-08  0:33                 ` Fengguang Wu
2007-10-08  0:33                   ` Fengguang Wu
2007-10-08  0:33                     ` Fengguang Wu
2007-10-04 21:07           ` Miklos Szeredi
2007-10-04 21:07             ` Miklos Szeredi
2007-10-04 21:56 ` Andrew Morton
2007-10-04 21:56   ` Andrew Morton
2007-10-04 22:39   ` Miklos Szeredi
2007-10-04 22:39     ` Miklos Szeredi
2007-10-04 23:09     ` Andrew Morton
2007-10-04 23:09       ` Andrew Morton
2007-10-04 23:26       ` Miklos Szeredi
2007-10-04 23:26         ` Miklos Szeredi
2007-10-04 23:48         ` Andrew Morton
2007-10-04 23:48           ` Andrew Morton
2007-10-05  0:12           ` Miklos Szeredi
2007-10-05  0:12             ` Miklos Szeredi
2007-10-05  0:48             ` Andrew Morton
2007-10-05  0:48               ` Andrew Morton
2007-10-05  8:22               ` Peter Zijlstra
2007-10-05  9:22                 ` Miklos Szeredi
2007-10-05  9:22                   ` Miklos Szeredi
2007-10-05  9:47                   ` Peter Zijlstra
2007-10-05 10:27                     ` Miklos Szeredi
2007-10-05 10:27                       ` Miklos Szeredi
2007-10-05 10:32                       ` Miklos Szeredi
2007-10-05 10:32                         ` Miklos Szeredi
2007-10-05 15:43                         ` John Stoffel
2007-10-05 15:43                           ` John Stoffel
2007-10-05 10:57                       ` Peter Zijlstra
2007-10-05 11:27                         ` Miklos Szeredi
2007-10-05 11:27                           ` Miklos Szeredi
2007-10-05 17:50                         ` Trond Myklebust
2007-10-05 17:50                           ` Trond Myklebust
2007-10-05 18:32                           ` Peter Zijlstra
2007-10-05 18:32                             ` Peter Zijlstra
2007-10-05 19:20                             ` Trond Myklebust
2007-10-05 19:20                               ` Trond Myklebust
2007-10-05 19:23                               ` Trond Myklebust
2007-10-05 19:23                                 ` Trond Myklebust
2007-10-05 21:07                                 ` Peter Zijlstra
2007-10-05 21:07                                   ` Peter Zijlstra
2007-10-06  0:40                             ` Fengguang Wu
2007-10-06  0:40                               ` Fengguang Wu
2007-10-06  0:40                                 ` Fengguang Wu
2007-10-05  7:32       ` Peter Zijlstra
2007-10-05 19:54         ` Rik van Riel
2007-10-05 19:54           ` Rik van Riel

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=391587432.00784@ustc.edu.cn \
    --to=wfg@mail.ustc.edu.cn \
    --cc=a.p.zijlstra@chello.nl \
    --cc=akpm@linux-foundation.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=miklos@szeredi.hu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.