writeback fixes for 3.2-rc5

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* writeback fixes for 3.2-rc5
@ 2011-12-05  6:22 Wu Fengguang
  2011-12-05  6:22 ` [PATCH 1/5] writeback: Fix issue on make htmldocs Wu Fengguang
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Wu Fengguang @ 2011-12-05  6:22 UTC (permalink / raw)
  To: linux-fsdevel

Hi,

I'd like to push these patches to Linus later this week.  Please review.

- abort write(2) on SIGKILL
- 2 patches to keep system responsive on stalled NFS mount
- 2 comment patches

[PATCH 1/5] writeback: Fix issue on make htmldocs
[PATCH 2/5] fs: Make write(2) interruptible by a fatal signal
[PATCH 3/5] writeback: comment on the bdi dirty threshold
[PATCH 4/5] writeback: permit through good bdi even when global
[PATCH 5/5] writeback: set max_pause to lowest value on zero

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH 1/5] writeback: Fix issue on make htmldocs
  2011-12-05  6:22 writeback fixes for 3.2-rc5 Wu Fengguang
@ 2011-12-05  6:22 ` Wu Fengguang
  2011-12-05  6:22 ` [PATCH 2/5] fs: Make write(2) interruptible by a fatal signal Wu Fengguang
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Wu Fengguang @ 2011-12-05  6:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Marcos Paulo de Souza, Wu Fengguang

From: Marcos Paulo de Souza <marcos.mage@gmail.com>

Document the @reason parameter to make "make htmldocs" happy.

Acked-by: Randy Dunlap <rdunlap@xenotime.net>
Signed-off-by: Marcos Paulo de Souza <marcos.mage@gmail.com>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 fs/fs-writeback.c |    5 +++++
 1 files changed, 5 insertions(+), 0 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 73c3992..ac86f8b 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -156,6 +156,7 @@ __bdi_start_writeback(struct backing_dev_info *bdi, long nr_pages,
  * bdi_start_writeback - start writeback
  * @bdi: the backing device to write from
  * @nr_pages: the number of pages to write
+ * @reason: reason why some writeback work was initiated
  *
  * Description:
  *   This does WB_SYNC_NONE opportunistic writeback. The IO is only
@@ -1223,6 +1224,7 @@ static void wait_sb_inodes(struct super_block *sb)
  * writeback_inodes_sb_nr -	writeback dirty inodes from given super_block
  * @sb: the superblock
  * @nr: the number of pages to write
+ * @reason: reason why some writeback work initiated
  *
  * Start writeback on some inodes on this super_block. No guarantees are made
  * on how many (if any) will be written, and this function does not wait
@@ -1251,6 +1253,7 @@ EXPORT_SYMBOL(writeback_inodes_sb_nr);
 /**
  * writeback_inodes_sb	-	writeback dirty inodes from given super_block
  * @sb: the superblock
+ * @reason: reason why some writeback work was initiated
  *
  * Start writeback on some inodes on this super_block. No guarantees are made
  * on how many (if any) will be written, and this function does not wait
@@ -1265,6 +1268,7 @@ EXPORT_SYMBOL(writeback_inodes_sb);
 /**
  * writeback_inodes_sb_if_idle	-	start writeback if none underway
  * @sb: the superblock
+ * @reason: reason why some writeback work was initiated
  *
  * Invoke writeback_inodes_sb if no writeback is currently underway.
  * Returns 1 if writeback was started, 0 if not.
@@ -1285,6 +1289,7 @@ EXPORT_SYMBOL(writeback_inodes_sb_if_idle);
  * writeback_inodes_sb_if_idle	-	start writeback if none underway
  * @sb: the superblock
  * @nr: the number of pages to write
+ * @reason: reason why some writeback work was initiated
  *
  * Invoke writeback_inodes_sb if no writeback is currently underway.
  * Returns 1 if writeback was started, 0 if not.
-- 
1.7.7.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 2/5] fs: Make write(2) interruptible by a fatal signal
  2011-12-05  6:22 writeback fixes for 3.2-rc5 Wu Fengguang
  2011-12-05  6:22 ` [PATCH 1/5] writeback: Fix issue on make htmldocs Wu Fengguang
@ 2011-12-05  6:22 ` Wu Fengguang
  2011-12-05  6:22 ` [PATCH 3/5] writeback: comment on the bdi dirty threshold Wu Fengguang
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Wu Fengguang @ 2011-12-05  6:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Jan Kara, Wu Fengguang

From: Jan Kara <jack@suse.cz>

Currently write(2) to a file is not interruptible by any signal.
Sometimes this is desirable, e.g. when you want to quickly kill a
process hogging your disk. Also, with commit 499d05ecf990 ("mm: Make
task in balance_dirty_pages() killable"), it's necessary to abort the
current write accordingly to avoid it quickly dirtying lots more pages
at unthrottled rate.

This patch makes write interruptible by SIGKILL. We do not allow write
to be interruptible by any other signal because that has larger
potential of screwing some badly written applications.

Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Acked-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/filemap.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c0018f2..c106d3b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file,
 						iov_iter_count(i));
 
 again:
-
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
@@ -2463,7 +2462,10 @@ again:
 		written += copied;
 
 		balance_dirty_pages_ratelimited(mapping);
-
+		if (fatal_signal_pending(current)) {
+			status = -EINTR;
+			break;
+		}
 	} while (iov_iter_count(i));
 
 	return written ? written : status;
-- 
1.7.7.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 3/5] writeback: comment on the bdi dirty threshold
  2011-12-05  6:22 writeback fixes for 3.2-rc5 Wu Fengguang
  2011-12-05  6:22 ` [PATCH 1/5] writeback: Fix issue on make htmldocs Wu Fengguang
  2011-12-05  6:22 ` [PATCH 2/5] fs: Make write(2) interruptible by a fatal signal Wu Fengguang
@ 2011-12-05  6:22 ` Wu Fengguang
  2011-12-05  6:22 ` [PATCH 4/5] writeback: permit through good bdi even when global dirty exceeded Wu Fengguang
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Wu Fengguang @ 2011-12-05  6:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Wu Fengguang

We do "floating proportions" to let active devices to grow its target
share of dirty pages and stalled/inactive devices to decrease its target
share over time.

It works well except in the case of "an inactive disk suddenly goes
busy", where the initial target share may be too small. To mitigate
this, bdi_position_ratio() has the below line to raise a small
bdi_thresh when it's safe to do so, so that the disk be feed with enough
dirty pages for efficient IO and in turn fast rampup of bdi_thresh:

        bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);

balance_dirty_pages() normally does negative feedback control which
adjusts ratelimit to balance the bdi dirty pages around the target.
In some extreme cases when that is not enough, it will have to block
the tasks completely until the bdi dirty pages drop below bdi_thresh.

Acked-by: Peter Zijlstra <a.p.zijlstra@chello.nl>
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   16 ++++++++++++++--
 1 files changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 7125248..155efca 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -411,8 +411,13 @@ void global_dirty_limits(unsigned long *pbackground, unsigned long *pdirty)
  *
  * Returns @bdi's dirty limit in pages. The term "dirty" in the context of
  * dirty balancing includes all PG_dirty, PG_writeback and NFS unstable pages.
- * And the "limit" in the name is not seriously taken as hard limit in
- * balance_dirty_pages().
+ *
+ * Note that balance_dirty_pages() will only seriously take it as a hard limit
+ * when sleeping max_pause per page is not enough to keep the dirty pages under
+ * control. For example, when the device is completely stalled due to some error
+ * conditions, or when there are 1000 dd tasks writing to a slow 10MB/s USB key.
+ * In the other normal situations, it acts more gently by throttling the tasks
+ * more (rather than completely block them) when the bdi dirty pages go high.
  *
  * It allocates high/low dirty limits to fast/slow devices, in order to prevent
  * - starving fast devices
@@ -594,6 +599,13 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
 	 */
 	if (unlikely(bdi_thresh > thresh))
 		bdi_thresh = thresh;
+	/*
+	 * It's very possible that bdi_thresh is close to 0 not because the
+	 * device is slow, but that it has remained inactive for long time.
+	 * Honour such devices a reasonable good (hopefully IO efficient)
+	 * threshold, so that the occasional writes won't be blocked and active
+	 * writes can rampup the threshold quickly.
+	 */
 	bdi_thresh = max(bdi_thresh, (limit - dirty) / 8);
 	/*
 	 * scale global setpoint to bdi's:
-- 
1.7.7.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 4/5] writeback: permit through good bdi even when global dirty exceeded
  2011-12-05  6:22 writeback fixes for 3.2-rc5 Wu Fengguang
                   ` (2 preceding siblings ...)
  2011-12-05  6:22 ` [PATCH 3/5] writeback: comment on the bdi dirty threshold Wu Fengguang
@ 2011-12-05  6:22 ` Wu Fengguang
  2011-12-05  6:22 ` [PATCH 5/5] writeback: set max_pause to lowest value on zero bdi_dirty Wu Fengguang
       [not found] ` <20111212102947.GA6731@localhost>
  5 siblings, 0 replies; 9+ messages in thread
From: Wu Fengguang @ 2011-12-05  6:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Wu Fengguang

On a system with 1 local mount and 1 NFS mount, if the NFS server
becomes not responding when dd to the NFS mount, the NFS dirty pages may
exceed the global dirty limit and _every_ task involving writing will be
blocked. The whole system appears unresponsive.

The workaround is to permit through the bdi's that only has a small
number of dirty pages. The number chosen (bdi_stat_error pages) is not
enough to enable the local disk to run in optimal throughput, however is
enough to make the system responsive on a broken NFS mount. The user can
then kill the dirtiers on the NFS mount and increase the global dirty
limit to bring up the local disk's throughput.

It risks allowing dirty pages to grow much larger than the global dirty
limit when there are 1000+ mounts, however that's very unlikely to happen,
especially in low memory profiles.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   13 +++++++++++++
 1 files changed, 13 insertions(+), 0 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 155efca..17403e3 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1148,6 +1148,19 @@ pause:
 		if (task_ratelimit)
 			break;

+		/*
+		 * In the case of an unresponding NFS server and the NFS dirty
+		 * pages exceeds dirty_thresh, give the other good bdi's a pipe
+		 * to go through, so that tasks on them still remain responsive.
+		 *
+		 * In theory 1 page is enough to keep the comsumer-producer
+		 * pipe going: the flusher cleans 1 page => the task dirties 1
+		 * more page. However bdi_dirty has accounting errors.  So use
+		 * the larger and more IO friendly bdi_stat_error.
+		 */
+		if (bdi_dirty <= bdi_stat_error(bdi))
+			break;
+
 		if (fatal_signal_pending(current))
 			break;
 	}
-- 
1.7.7.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH 5/5] writeback: set max_pause to lowest value on zero bdi_dirty
  2011-12-05  6:22 writeback fixes for 3.2-rc5 Wu Fengguang
                   ` (3 preceding siblings ...)
  2011-12-05  6:22 ` [PATCH 4/5] writeback: permit through good bdi even when global dirty exceeded Wu Fengguang
@ 2011-12-05  6:22 ` Wu Fengguang
       [not found] ` <20111212102947.GA6731@localhost>
  5 siblings, 0 replies; 9+ messages in thread
From: Wu Fengguang @ 2011-12-05  6:22 UTC (permalink / raw)
  To: linux-fsdevel; +Cc: Wu Fengguang

Some trace shows lots of bdi_dirty=0 lines where it's actually some
small value if w/o the accounting errors in the per-cpu bdi stats.

In this case the max pause time should really be set to the smallest
(non-zero) value to avoid IO queue underrun and improve throughput.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 17403e3..50f0824 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -989,8 +989,7 @@ static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
 	 *
 	 * 8 serves as the safety ratio.
 	 */
-	if (bdi_dirty)
-		t = min(t, bdi_dirty * HZ / (8 * bw + 1));
+	t = min(t, bdi_dirty * HZ / (8 * bw + 1));
 
 	/*
 	 * The pause time will be settled within range (max_pause/4, max_pause).
-- 
1.7.7.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

[parent not found: <20111212102947.GA6731@localhost>]

[parent not found: <20111212204634.GB5214@quack.suse.cz>]

* Re: writeback fixes for 3.2-rc5
       [not found]   ` <20111212204634.GB5214@quack.suse.cz>
@ 2011-12-13  1:44     ` Wu Fengguang
  2011-12-13 10:57       ` Jan Kara
  0 siblings, 1 reply; 9+ messages in thread
From: Wu Fengguang @ 2011-12-13  1:44 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel

On Tue, Dec 13, 2011 at 04:46:34AM +0800, Jan Kara wrote:
>   Hi Fengguang,
> 
> On Mon 12-12-11 18:29:48, Wu Fengguang wrote:
> > May I ask if you see any problems pushing these patches to Linus?
>   Sorry, this somehow escaped my attention. Patches 1, 2, 3, and 5 are
> fine. I'm not sure about patch 4 - I'm not against it but e.g. on
> single-cpu machine, bdi_stat_error() is 1 so there the patch won't help

As the comment said, actually 1 is enough to let the tasks go through.
It may sound terrible to write 1 page at a time, however it's already
much better than being blocked there forever. The user experience is
totally different according to my tests, because most tasks are not IO
intensive at all, they are blocked simply on writing some small file.

> much. Enconding fixed constant like you had in the first version of the
> patch doesn't look nice either. But I don't have a better solution...

My typical need on the global exceeded case is "please at least let me
ssh in and kill some task or raise the dirty limit to break out of the
error condition". IMHO the patch is good enough for that need.

Thanks,
Fengguang

> > 
> > On Mon, Dec 05, 2011 at 02:22:13PM +0800, Wu Fengguang wrote:
> > > Hi,
> > > 
> > > I'd like to push these patches to Linus later this week.  Please review.
> > > 
> > > - abort write(2) on SIGKILL
> > > - 2 patches to keep system responsive on stalled NFS mount
> > > - 2 comment patches
> > > 
> > > [PATCH 1/5] writeback: Fix issue on make htmldocs
> > > [PATCH 2/5] fs: Make write(2) interruptible by a fatal signal
> > > [PATCH 3/5] writeback: comment on the bdi dirty threshold
> > > [PATCH 4/5] writeback: permit through good bdi even when global
> > > [PATCH 5/5] writeback: set max_pause to lowest value on zero
> > > 
> > > Thanks,
> > > Fengguang
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: writeback fixes for 3.2-rc5
  2011-12-13  1:44     ` writeback fixes for 3.2-rc5 Wu Fengguang
@ 2011-12-13 10:57       ` Jan Kara
  2011-12-13 11:24         ` Wu Fengguang
  0 siblings, 1 reply; 9+ messages in thread
From: Jan Kara @ 2011-12-13 10:57 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Jan Kara, linux-fsdevel

On Tue 13-12-11 09:44:56, Wu Fengguang wrote:
> On Tue, Dec 13, 2011 at 04:46:34AM +0800, Jan Kara wrote:
> >   Hi Fengguang,
> > 
> > On Mon 12-12-11 18:29:48, Wu Fengguang wrote:
> > > May I ask if you see any problems pushing these patches to Linus?
> >   Sorry, this somehow escaped my attention. Patches 1, 2, 3, and 5 are
> > fine. I'm not sure about patch 4 - I'm not against it but e.g. on
> > single-cpu machine, bdi_stat_error() is 1 so there the patch won't help
> 
> As the comment said, actually 1 is enough to let the tasks go through.
> It may sound terrible to write 1 page at a time, however it's already
> much better than being blocked there forever. The user experience is
> totally different according to my tests, because most tasks are not IO
> intensive at all, they are blocked simply on writing some small file.
> 
> > much. Enconding fixed constant like you had in the first version of the
> > patch doesn't look nice either. But I don't have a better solution...
> 
> My typical need on the global exceeded case is "please at least let me
> ssh in and kill some task or raise the dirty limit to break out of the
> error condition". IMHO the patch is good enough for that need.
  Yes, I understand it's better than nothing. That's why I think it's an
acceptable solution at least for now. I just don't feel completely
satisfied with it :). So please go ahead and merge the fixes with Linus.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: writeback fixes for 3.2-rc5
  2011-12-13 10:57       ` Jan Kara
@ 2011-12-13 11:24         ` Wu Fengguang
  0 siblings, 0 replies; 9+ messages in thread
From: Wu Fengguang @ 2011-12-13 11:24 UTC (permalink / raw)
  To: Jan Kara; +Cc: linux-fsdevel@vger.kernel.org

On Tue, Dec 13, 2011 at 06:57:11PM +0800, Jan Kara wrote:
> On Tue 13-12-11 09:44:56, Wu Fengguang wrote:
> > On Tue, Dec 13, 2011 at 04:46:34AM +0800, Jan Kara wrote:
> > >   Hi Fengguang,
> > > 
> > > On Mon 12-12-11 18:29:48, Wu Fengguang wrote:
> > > > May I ask if you see any problems pushing these patches to Linus?
> > >   Sorry, this somehow escaped my attention. Patches 1, 2, 3, and 5 are
> > > fine. I'm not sure about patch 4 - I'm not against it but e.g. on
> > > single-cpu machine, bdi_stat_error() is 1 so there the patch won't help
> > 
> > As the comment said, actually 1 is enough to let the tasks go through.
> > It may sound terrible to write 1 page at a time, however it's already
> > much better than being blocked there forever. The user experience is
> > totally different according to my tests, because most tasks are not IO
> > intensive at all, they are blocked simply on writing some small file.
> > 
> > > much. Enconding fixed constant like you had in the first version of the
> > > patch doesn't look nice either. But I don't have a better solution...
> > 
> > My typical need on the global exceeded case is "please at least let me
> > ssh in and kill some task or raise the dirty limit to break out of the
> > error condition". IMHO the patch is good enough for that need.
>   Yes, I understand it's better than nothing. That's why I think it's an
> acceptable solution at least for now. I just don't feel completely
> satisfied with it :). So please go ahead and merge the fixes with Linus.

OK, thanks! :)

Fengguang

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-12-13 11:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-05  6:22 writeback fixes for 3.2-rc5 Wu Fengguang
2011-12-05  6:22 ` [PATCH 1/5] writeback: Fix issue on make htmldocs Wu Fengguang
2011-12-05  6:22 ` [PATCH 2/5] fs: Make write(2) interruptible by a fatal signal Wu Fengguang
2011-12-05  6:22 ` [PATCH 3/5] writeback: comment on the bdi dirty threshold Wu Fengguang
2011-12-05  6:22 ` [PATCH 4/5] writeback: permit through good bdi even when global dirty exceeded Wu Fengguang
2011-12-05  6:22 ` [PATCH 5/5] writeback: set max_pause to lowest value on zero bdi_dirty Wu Fengguang
     [not found] ` <20111212102947.GA6731@localhost>
     [not found]   ` <20111212204634.GB5214@quack.suse.cz>
2011-12-13  1:44     ` writeback fixes for 3.2-rc5 Wu Fengguang
2011-12-13 10:57       ` Jan Kara
2011-12-13 11:24         ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).