Re: [LSF/MM TOPIC] memcg topics.

public inbox for linux-mm@kvack.org
 help / color / mirror / Atom feed

From: Wu Fengguang <fengguang.wu@intel.com>
To: Greg Thelen <gthelen@google.com>
Cc: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>,
	lsf-pc@lists.linux-foundation.org, linux-mm@kvack.org,
	"hannes@cmpxchg.org" <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@suse.cz>,
	"bsingharora@gmail.com" <bsingharora@gmail.com>,
	Hugh Dickins <hughd@google.com>, Ying Han <yinghan@google.com>,
	Mel Gorman <mgorman@suse.de>
Subject: Re: [LSF/MM TOPIC] memcg topics.
Date: Thu, 2 Feb 2012 15:52:34 +0800	[thread overview]
Message-ID: <20120202075234.GA3039@localhost> (raw)
In-Reply-To: <20120202063345.GA15124@localhost>

On Thu, Feb 02, 2012 at 02:33:45PM +0800, Wu Fengguang wrote:
> Hi Greg,
> 
> On Wed, Feb 01, 2012 at 12:24:25PM -0800, Greg Thelen wrote:
> > On Tue, Jan 31, 2012 at 4:55 PM, KAMEZAWA Hiroyuki
> > <kamezawa.hiroyu@jp.fujitsu.com> wrote:
> > > 4. dirty ratio
> > > A  In the last year, patches were posted but not merged. I'd like to hear
> > > A  works on this area.
> > 
> > I would like to attend to discuss this topic.  I have not had much time to work
> > on this recently, but should be able to focus more on this soon.  The
> > IO less writeback changes require some redesign and may allow for a
> > simpler implementation of mem_cgroup_balance_dirty_pages().
> > Maintaining a per container dirty page counts, ratios, and limits is
> > fairly easy, but integration with writeback is the challenge.  My big
> > questions are for writeback people:
> > 1. how to compute per-container pause based on bdi bandwidth, cgroup
> > dirty page usage.
> > 2. how to ensure that writeback will engage even if system and bdi are
> > below respective background dirty ratios, yet a memcg is above its bg
> > dirty limit.
> 
> The solution to (1,2) would be something like this:
> 
> --- linux-next.orig/mm/page-writeback.c	2012-02-02 14:13:45.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2012-02-02 14:24:11.000000000 +0800
> @@ -654,6 +654,17 @@ static unsigned long bdi_position_ratio(
>  	pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
>  	pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
>  
> +	if (memcg) {
> +		long long f;
> +		x = div_s64((memcg_setpoint - memcg_dirty) << RATELIMIT_CALC_SHIFT,
> +			    memcg_limit - memcg_setpoint + 1);
> +		f = x;
> +		f = f * x >> RATELIMIT_CALC_SHIFT;
> +		f = f * x >> RATELIMIT_CALC_SHIFT;
> +		f += 1 << RATELIMIT_CALC_SHIFT;
> +		pos_ratio = pos_ratio * f >> RATELIMIT_CALC_SHIFT;
> +	}
> +
>  	/*
>  	 * We have computed basic pos_ratio above based on global situation. If
>  	 * the bdi is over/under its share of dirty pages, we want to scale
> @@ -1202,6 +1213,8 @@ static void balance_dirty_pages(struct a
>  		freerun = dirty_freerun_ceiling(dirty_thresh,
>  						background_thresh);
>  		if (nr_dirty <= freerun) {
> +			if (memcg && memcg_dirty > memcg_freerun)
> +				goto start_writeback;
>  			current->dirty_paused_when = now;
>  			current->nr_dirtied = 0;
>  			current->nr_dirtied_pause =
> @@ -1209,6 +1222,7 @@ static void balance_dirty_pages(struct a
>  			break;
>  		}
>  
> +start_writeback:
>  		if (unlikely(!writeback_in_progress(bdi)))
>  			bdi_start_background_writeback(bdi);
>  
> 
> That makes the minimal change to enforce per-memcg dirty ratio.
> It could result in a less stable control system, but should still
> be able to balance things out.

Unfortunately the memcg partitioning could fundamentally make the
dirty throttling more bumpy.

Imagine 10 memcgs each with

- memcg_dirty_limit=50MB
- 1 dd dirty task

The flusher thread will be working on 10 inodes in turn, each time
grabbing the next inode and taking ~0.5s to write ~50MB of its dirty
pages to the disk. So each inode will be flushed on every ~5s.

Without memcg dirty ratio, the dd tasks will be throttled quite
smoothly.  However with memcg, each memcg will be limited to 50MB
dirty pages, and the dirty number will be dropping quickly from 50MB
to 0 on every 5 seconds.

As a result, the small partitions of dirty pages will transmit the
flusher's bumpy writeout (which is necessary for performance) to the
dd tasks' bumpy progress. The dd tasks will be blocked for seconds
from time to time.

So I cannot help thinking: can the problem be canceled in the root?
The basic scheme could be: when reclaiming from a memcg zone, if any
PG_writeback/PG_dirty pages are encountered, mark PG_reclaim on it and
move it to the global zone and de-account it from the memcg.

In this way, we can avoid dirty/writeback pages hurting the (possibly
small) memcg zones. The aggressive dirtier tasks will be throttled by
the global 20% limit and the memcg page reclaims can go on smoothly.

Thanks,
Fengguang

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2012-02-02  8:02 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-02-01  0:55 [LSF/MM TOPIC] memcg topics KAMEZAWA Hiroyuki
2012-02-01  8:58 ` Glauber Costa
2012-02-02 11:33   ` [LSF/MM TOPIC][ATTEND] " Glauber Costa
2012-02-01 20:24 ` [LSF/MM TOPIC] " Greg Thelen
2012-02-02  6:33   ` Wu Fengguang
2012-02-02  7:34     ` Greg Thelen
2012-02-02  7:54       ` Wu Fengguang
2012-02-02  7:52     ` Wu Fengguang [this message]
2012-02-02 10:39       ` [Lsf-pc] " Jan Kara
2012-02-02 11:04         ` Wu Fengguang
2012-02-02 15:42           ` Jan Kara
2012-02-03  1:26             ` Wu Fengguang
2012-02-03  6:21               ` Greg Thelen
2012-02-03  9:40                 ` Wu Fengguang
2012-02-02 10:15     ` Jan Kara
2012-02-02 11:31       ` Wu Fengguang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120202075234.GA3039@localhost \
    --to=fengguang.wu@intel.com \
    --cc=bsingharora@gmail.com \
    --cc=gthelen@google.com \
    --cc=hannes@cmpxchg.org \
    --cc=hughd@google.com \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=mgorman@suse.de \
    --cc=mhocko@suse.cz \
    --cc=yinghan@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox