All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jan Kara <jack@suse.cz>
To: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>, Namjae Jeon <linkinjeon@gmail.com>,
	Dave Chinner <david@fromorbit.com>,
	linux-kernel@vger.kernel.org,
	Namjae Jeon <namjae.jeon@samsung.com>,
	Vivek Trivedi <t.vivek@samsung.com>,
	Linux Memory Management List <linux-mm@kvack.org>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
Date: Wed, 26 Sep 2012 22:22:47 +0200	[thread overview]
Message-ID: <20120926202247.GA20920@quack.suse.cz> (raw)
In-Reply-To: <20120926165602.GA24672@localhost>

On Thu 27-09-12 00:56:02, Wu Fengguang wrote:
> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> > > > From: Namjae Jeon <namjae.jeon@samsung.com>
> > > > 
> > > > This patch is based on suggestion by Wu Fengguang:
> > > > https://lkml.org/lkml/2011/8/19/19
> > > > 
> > > > kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> > > > ratio. It also maintains per task dirty rate limit to keep balance of
> > > > dirty pages at any given instance by doing bdi bandwidth estimation.
> > > > 
> > > > Kernel also has max_ratio/min_ratio tunables to specify percentage of
> > > > writecache to control per bdi dirty limits and task throttling.
> > > > 
> > > > However, there might be a usecase where user wants a per bdi writeback tuning
> > > > parameter to flush dirty data once per bdi dirty data reach a threshold
> > > > especially at NFS server.
> > > > 
> > > > dirty_background_centisecs provides an interface where user can tune
> > > > background writeback start threshold using
> > > > /sys/block/sda/bdi/dirty_background_centisecs
> > > > 
> > > > dirty_background_centisecs is used alongwith average bdi write bandwidth
> > > > estimation to start background writeback.
> >   The functionality you describe, i.e. start flushing bdi when there's
> > reasonable amount of dirty data on it, looks sensible and useful. However
> > I'm not so sure whether the interface you propose is the right one.
> > Traditionally, we allow user to set amount of dirty data (either in bytes
> > or percentage of memory) when background writeback should start. You
> > propose setting the amount of data in centisecs-to-write. Why that
> > difference? Also this interface ties our throughput estimation code (which
> > is an implementation detail of current dirty throttling) with the userspace
> > API. So we'd have to maintain the estimation code forever, possibly also
> > face problems when we change the estimation code (and thus estimates in
> > some cases) and users will complain that the values they set originally no
> > longer work as they used to.
> 
> Yes, that bandwidth estimation is not all that (and in theory cannot
> be made) reliable which may be a surprise to the user. Which make the
> interface flaky.
> 
> > Also, as with each knob, there's a problem how to properly set its value?
> > Most admins won't know about the knob and so won't touch it. Others might
> > know about the knob but will have hard time figuring out what value should
> > they set. So if there's a new knob, it should have a sensible initial
> > value. And since this feature looks like a useful one, it shouldn't be
> > zero.
> 
> Agreed in principle. There seems be no reasonable defaults for the
> centisecs-to-write interface, mainly due to its inaccurate nature,
> especially the initial value may be wildly wrong on fresh system
> bootup. This is also true for your proposed interfaces, see below.
> 
> > So my personal preference would be to have bdi->dirty_background_ratio and
> > bdi->dirty_background_bytes and start background writeback whenever
> > one of global background limit and per-bdi background limit is exceeded. I
> > think this interface will do the job as well and it's easier to maintain in
> > future.
> 
> bdi->dirty_background_ratio, if I understand its semantics right, is
> unfortunately flaky in the same principle as centisecs-to-write,
> because it relies on the (implicitly estimation of) writeout
> proportions. The writeout proportions for each bdi starts with 0,
> which is even worse than the 100MB/s initial value for
> bdi->write_bandwidth and will trigger background writeback on the
> first write.
  Well, I meant bdi->dirty_backround_ratio wouldn't use writeout proportion
estimates at all. Limit would be
  dirtiable_memory * bdi->dirty_backround_ratio.

After all we want to start writeout to bdi when we have enough pages to
reasonably load the device for a while which has nothing to do with how
much is written to this device as compared to other devices.
 
OTOH I'm not particularly attached to this interface. Especially since on a
lot of today's machines, 1% is rather big so people might often end up
using dirty_background_bytes anyway.

> bdi->dirty_background_bytes is, however, reliable, and gives users
> total control. If we export this interface alone, I'd imagine users
> who want to control centisecs-to-write could run a simple script to
> periodically get the write bandwith value out of the existing bdi
> interface and echo it into bdi->dirty_background_bytes. Which makes
> simple yet good enough centisecs-to-write controlling.
> 
> So what do you think about exporting a really dumb
> bdi->dirty_background_bytes, which will effectively give smart users
> the freedom to do smart control over per-bdi background writeback
> threshold? The users are offered the freedom to do his own bandwidth
> estimation and choose not to rely on the kernel estimation, which will
> free us from the burden of maintaining a flaky interface as well. :)
  That's fine with me. Just it would be nice if we gave
bdi->dirty_background_bytes some useful initial value. Maybe like
dirtiable_memory * dirty_background_ratio?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

WARNING: multiple messages have this Message-ID (diff)
From: Jan Kara <jack@suse.cz>
To: Fengguang Wu <fengguang.wu@intel.com>
Cc: Jan Kara <jack@suse.cz>, Namjae Jeon <linkinjeon@gmail.com>,
	Dave Chinner <david@fromorbit.com>,
	linux-kernel@vger.kernel.org,
	Namjae Jeon <namjae.jeon@samsung.com>,
	Vivek Trivedi <t.vivek@samsung.com>,
	Linux Memory Management List <linux-mm@kvack.org>,
	linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable
Date: Wed, 26 Sep 2012 22:22:47 +0200	[thread overview]
Message-ID: <20120926202247.GA20920@quack.suse.cz> (raw)
In-Reply-To: <20120926165602.GA24672@localhost>

On Thu 27-09-12 00:56:02, Wu Fengguang wrote:
> On Tue, Sep 25, 2012 at 12:23:06AM +0200, Jan Kara wrote:
> > On Thu 20-09-12 16:44:22, Wu Fengguang wrote:
> > > On Sun, Sep 16, 2012 at 08:25:42AM -0400, Namjae Jeon wrote:
> > > > From: Namjae Jeon <namjae.jeon@samsung.com>
> > > > 
> > > > This patch is based on suggestion by Wu Fengguang:
> > > > https://lkml.org/lkml/2011/8/19/19
> > > > 
> > > > kernel has mechanism to do writeback as per dirty_ratio and dirty_background
> > > > ratio. It also maintains per task dirty rate limit to keep balance of
> > > > dirty pages at any given instance by doing bdi bandwidth estimation.
> > > > 
> > > > Kernel also has max_ratio/min_ratio tunables to specify percentage of
> > > > writecache to control per bdi dirty limits and task throttling.
> > > > 
> > > > However, there might be a usecase where user wants a per bdi writeback tuning
> > > > parameter to flush dirty data once per bdi dirty data reach a threshold
> > > > especially at NFS server.
> > > > 
> > > > dirty_background_centisecs provides an interface where user can tune
> > > > background writeback start threshold using
> > > > /sys/block/sda/bdi/dirty_background_centisecs
> > > > 
> > > > dirty_background_centisecs is used alongwith average bdi write bandwidth
> > > > estimation to start background writeback.
> >   The functionality you describe, i.e. start flushing bdi when there's
> > reasonable amount of dirty data on it, looks sensible and useful. However
> > I'm not so sure whether the interface you propose is the right one.
> > Traditionally, we allow user to set amount of dirty data (either in bytes
> > or percentage of memory) when background writeback should start. You
> > propose setting the amount of data in centisecs-to-write. Why that
> > difference? Also this interface ties our throughput estimation code (which
> > is an implementation detail of current dirty throttling) with the userspace
> > API. So we'd have to maintain the estimation code forever, possibly also
> > face problems when we change the estimation code (and thus estimates in
> > some cases) and users will complain that the values they set originally no
> > longer work as they used to.
> 
> Yes, that bandwidth estimation is not all that (and in theory cannot
> be made) reliable which may be a surprise to the user. Which make the
> interface flaky.
> 
> > Also, as with each knob, there's a problem how to properly set its value?
> > Most admins won't know about the knob and so won't touch it. Others might
> > know about the knob but will have hard time figuring out what value should
> > they set. So if there's a new knob, it should have a sensible initial
> > value. And since this feature looks like a useful one, it shouldn't be
> > zero.
> 
> Agreed in principle. There seems be no reasonable defaults for the
> centisecs-to-write interface, mainly due to its inaccurate nature,
> especially the initial value may be wildly wrong on fresh system
> bootup. This is also true for your proposed interfaces, see below.
> 
> > So my personal preference would be to have bdi->dirty_background_ratio and
> > bdi->dirty_background_bytes and start background writeback whenever
> > one of global background limit and per-bdi background limit is exceeded. I
> > think this interface will do the job as well and it's easier to maintain in
> > future.
> 
> bdi->dirty_background_ratio, if I understand its semantics right, is
> unfortunately flaky in the same principle as centisecs-to-write,
> because it relies on the (implicitly estimation of) writeout
> proportions. The writeout proportions for each bdi starts with 0,
> which is even worse than the 100MB/s initial value for
> bdi->write_bandwidth and will trigger background writeback on the
> first write.
  Well, I meant bdi->dirty_backround_ratio wouldn't use writeout proportion
estimates at all. Limit would be
  dirtiable_memory * bdi->dirty_backround_ratio.

After all we want to start writeout to bdi when we have enough pages to
reasonably load the device for a while which has nothing to do with how
much is written to this device as compared to other devices.
 
OTOH I'm not particularly attached to this interface. Especially since on a
lot of today's machines, 1% is rather big so people might often end up
using dirty_background_bytes anyway.

> bdi->dirty_background_bytes is, however, reliable, and gives users
> total control. If we export this interface alone, I'd imagine users
> who want to control centisecs-to-write could run a simple script to
> periodically get the write bandwith value out of the existing bdi
> interface and echo it into bdi->dirty_background_bytes. Which makes
> simple yet good enough centisecs-to-write controlling.
> 
> So what do you think about exporting a really dumb
> bdi->dirty_background_bytes, which will effectively give smart users
> the freedom to do smart control over per-bdi background writeback
> threshold? The users are offered the freedom to do his own bandwidth
> estimation and choose not to rely on the kernel estimation, which will
> free us from the burden of maintaining a flaky interface as well. :)
  That's fine with me. Just it would be nice if we gave
bdi->dirty_background_bytes some useful initial value. Maybe like
dirtiable_memory * dirty_background_ratio?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

  reply	other threads:[~2012-09-26 20:22 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-09-16 12:25 [PATCH v3 1/2] writeback: add dirty_background_centisecs per bdi variable Namjae Jeon
2012-09-20  8:44 ` Fengguang Wu
2012-09-20  8:44   ` Fengguang Wu
2012-09-24 22:23   ` Jan Kara
2012-09-24 22:23     ` Jan Kara
2012-09-25  6:50     ` Namjae Jeon
2012-09-25  6:50       ` Namjae Jeon
2012-09-26 16:56     ` Fengguang Wu
2012-09-26 16:56       ` Fengguang Wu
2012-09-26 20:22       ` Jan Kara [this message]
2012-09-26 20:22         ` Jan Kara
2012-09-27  6:00         ` Namjae Jeon
2012-09-27  6:00           ` Namjae Jeon
2012-09-27 11:52           ` Jan Kara
2012-09-27 11:52             ` Jan Kara
2012-09-28  1:59             ` Namjae Jeon
2012-09-28  1:59               ` Namjae Jeon
2012-09-25  1:36   ` Dave Chinner
2012-09-25  1:36     ` Dave Chinner
2012-09-25  6:46     ` Namjae Jeon
2012-09-25  6:46       ` Namjae Jeon
2012-09-25  6:54       ` Namjae Jeon
2012-09-25  6:54         ` Namjae Jeon
2012-10-19  7:51       ` Namjae Jeon
2012-10-19  7:51         ` Namjae Jeon
2012-10-22  1:25         ` Dave Chinner
2012-10-22  1:25           ` Dave Chinner
2012-11-19 23:18           ` Namjae Jeon
2012-11-19 23:18             ` Namjae Jeon
2012-12-05  3:51             ` Wanpeng Li
2012-12-05  3:51             ` Wanpeng Li
     [not found]             ` <50bec4af.0aad2a0a.2fc9.6fe5SMTPIN_ADDED_BROKEN@mx.google.com>
2012-12-05 10:19               ` Namjae Jeon
2012-12-05 10:19                 ` Namjae Jeon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120926202247.GA20920@quack.suse.cz \
    --to=jack@suse.cz \
    --cc=david@fromorbit.com \
    --cc=fengguang.wu@intel.com \
    --cc=linkinjeon@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=namjae.jeon@samsung.com \
    --cc=t.vivek@samsung.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.