[PATCH] fs: Make write(2) interruptible by a fatal signal

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] fs: Make write(2) interruptible by a fatal signal
@ 2011-12-01 10:27 Jan Kara
  2011-12-01 12:24 ` Wu Fengguang
  0 siblings, 1 reply; 11+ messages in thread
From: Jan Kara @ 2011-12-01 10:27 UTC (permalink / raw)
  To: LKML; +Cc: linux-fsdevel, Andrew Morton, Linus Torvalds, Wu Fengguang,
	Jan Kara

Currently write(2) to a file is not interruptible by any signal. Sometimes this
is desirable (e.g. when you want to quickly kill a process hogging your disk or
when some process gets blocked in balance_dirty_pages() indefinitely due to a
filesystem being in an error condition). This patch makes write interruptible
by SIGKILL. We do not allow write to be interruptible by any other signal
because that has larger potential of screwing some badly written applications.

Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/filemap.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

  Resending this patch to allow people outside linux-fsdevel to take part
in the discussion...

diff --git a/mm/filemap.c b/mm/filemap.c
index c0018f2..c106d3b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file,
 						iov_iter_count(i));

 again:
-
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
@@ -2463,7 +2462,10 @@ again:
 		written += copied;

 		balance_dirty_pages_ratelimited(mapping);
-
+		if (fatal_signal_pending(current)) {
+			status = -EINTR;
+			break;
+		}
 	} while (iov_iter_count(i));

 	return written ? written : status;
-- 
1.7.1

^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: Make write(2) interruptible by a fatal signal
  2011-12-01 10:27 [PATCH] fs: Make write(2) interruptible by a fatal signal Jan Kara
@ 2011-12-01 12:24 ` Wu Fengguang
  2011-12-01 14:27   ` Matthew Wilcox
  0 siblings, 1 reply; 11+ messages in thread
From: Wu Fengguang @ 2011-12-01 12:24 UTC (permalink / raw)
  To: Jan Kara
  Cc: LKML, linux-fsdevel@vger.kernel.org, Andrew Morton,
	Linus Torvalds, Theodore Ts'o, Matthew Wilcox,
	Christoph Hellwig

Hi,

How are we going to do with this patch?

> This patch makes write interruptible by SIGKILL.

Let me try to summarize the objective impacts of (not) merging this
patch, and would like to hear more opinions from experienced users.

- w/o patch

BEHAVIOR:
write(2) insists to complete even when the user really wants to stop it.

IMPACT:
It could be annoying to experience slow responses to "kill -9" when
it's a large write to a slow device, for example,

        dd if=/dev/zero of=/mnt/nokia/zero bs=100M

- w/ patch

BEHAVIOR:
write(2) aborts quickly with possible partial write on SIGKILL

IMPACT:
The partial write might lead to data corruption somewhere, sometime
(the possibility is low but real) and bring trouble to some users.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: Make write(2) interruptible by a fatal signal
  2011-12-01 12:24 ` Wu Fengguang
@ 2011-12-01 14:27   ` Matthew Wilcox
  2011-12-01 16:10     ` Linus Torvalds
                       ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Matthew Wilcox @ 2011-12-01 14:27 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, LKML, linux-fsdevel@vger.kernel.org, Andrew Morton,
	Linus Torvalds, Theodore Ts'o, Christoph Hellwig

On Thu, Dec 01, 2011 at 08:24:25PM +0800, Wu Fengguang wrote:
> > This patch makes write interruptible by SIGKILL.
> 
> Let me try to summarize the objective impacts of (not) merging this
> patch, and would like to hear more opinions from experienced users.
> 
> - w/o patch
> 
> BEHAVIOR:
> write(2) insists to complete even when the user really wants to stop it.
> 
> IMPACT:
> It could be annoying to experience slow responses to "kill -9" when
> it's a large write to a slow device, for example,
> 
>         dd if=/dev/zero of=/mnt/nokia/zero bs=100M

Another problem scenario is an NFS mounted file going away while the
user is writing to it.  The user should be able to kill the stuck process
without rebooting their machine.

> - w/ patch
> 
> BEHAVIOR:
> write(2) aborts quickly with possible partial write on SIGKILL
> 
> IMPACT:
> The partial write might lead to data corruption somewhere, sometime
> (the possibility is low but real) and bring trouble to some users.

Let's examine these cases.  We've already written at least some of the
data into the page cache (and updated i_size for extending writes in the
call to ->write_end).  It's just not hit the backing store yet.  That means
that this state of affairs is already *visible* to another process on the
same machine, it's just not *durable* (eg in the event of power failure).

I think in the worst case, we've simply extended the window of opportunity
for another process to see the partial write.

So, please add

Acked-by: Matthew Wilcox <matthew.r.wilcox@intel.com>

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: Make write(2) interruptible by a fatal signal
  2011-12-01 14:27   ` Matthew Wilcox
@ 2011-12-01 16:10     ` Linus Torvalds
  2011-12-02 11:58       ` Janne Blomqvist
  2011-12-02  2:05     ` Wu Fengguang
  2011-12-02  6:36     ` [PATCH] writeback: permit through good bdi even when global dirty exceeded Wu Fengguang
  2 siblings, 1 reply; 11+ messages in thread
From: Linus Torvalds @ 2011-12-01 16:10 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Wu Fengguang, Jan Kara, LKML, linux-fsdevel@vger.kernel.org,
	Andrew Morton, Theodore Ts'o, Christoph Hellwig

On Thu, Dec 1, 2011 at 6:27 AM, Matthew Wilcox <matthew@wil.cx> wrote:
>
> Another problem scenario is an NFS mounted file going away while the
> user is writing to it.  The user should be able to kill the stuck process
> without rebooting their machine.

Well, NFS has always had the 'intr' mount option.

The problem with NFS and 'intr' is that it's very user-visible, and
makes even nonfatal signals cause EINTR. So the "abort writes on fatal
signals" is a lot less noticeable to any programs: sure, it does
introduce potentially visible semantic changes, but now they are
visible only across different processes, not within one process.

And anybody who relies on "all or nothing" even in the face of SIGKILL
is just broken. That said, like so many changes, I think this falls
solidly on the "we should do this, but if somebody reports a
regression..."

I suspect the likelihood of regressions is basically zero. No
application that actually cares about its data would ever rely on the
"writes will complete entirely or not at all" semantics. Not only has
it never been true on NFS, but it's not true in the face of crashes
either, so any careful app will already depend on other things like
fsync+rename etc.

                           Linus

                          Linus

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: Make write(2) interruptible by a fatal signal
  2011-12-01 14:27   ` Matthew Wilcox
  2011-12-01 16:10     ` Linus Torvalds
@ 2011-12-02  2:05     ` Wu Fengguang
  2011-12-02  6:36     ` [PATCH] writeback: permit through good bdi even when global dirty exceeded Wu Fengguang
  2 siblings, 0 replies; 11+ messages in thread
From: Wu Fengguang @ 2011-12-02  2:05 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jan Kara, LKML, linux-fsdevel@vger.kernel.org, Andrew Morton,
	Linus Torvalds, Theodore Ts'o, Christoph Hellwig,
	Trond Myklebust, Jeremy Allison

On Thu, Dec 01, 2011 at 10:27:05PM +0800, Matthew Wilcox wrote:
> On Thu, Dec 01, 2011 at 08:24:25PM +0800, Wu Fengguang wrote:
> > > This patch makes write interruptible by SIGKILL.
> > 
> > Let me try to summarize the objective impacts of (not) merging this
> > patch, and would like to hear more opinions from experienced users.
> > 
> > - w/o patch
> > 
> > BEHAVIOR:
> > write(2) insists to complete even when the user really wants to stop it.
> > 
> > IMPACT:
> > It could be annoying to experience slow responses to "kill -9" when
> > it's a large write to a slow device, for example,
> > 
> >         dd if=/dev/zero of=/mnt/nokia/zero bs=100M
> 
> Another problem scenario is an NFS mounted file going away while the
> user is writing to it.  The user should be able to kill the stuck process
> without rebooting their machine.

It turns out to eventually block on close().

I just experimented writing to a default mounted NFS:

dd if=/dev/zero of=/fs/zero bs=100M

snb:/nfs/ on /fs type nfs (rw,relatime,vers=3,rsize=262144,wsize=262144,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.1.61,mountvers=3,mountport=42149,mountproto=udp,local_lock=none,addr=192.168.1.61)

At some time stop the NFS server, and do "kill -9 dd" in the client.
Then NFS tries to flush all dirty pages and wait for all writeback
pages on close(), which blocks dd hard:

[80786.103371] dd              D 0000000000000001  3712  4680   4445 0x00000004
[80786.103878]  ffff8800aade5948 0000000000000046 ffffffff81985509 ffffffff81099bb5
[80786.104589]  ffff8800aade4000 00000000001d3280 00000000001d3280 ffff8800b2020000
[80786.105301]  00000000001d3280 ffff8800aade5fd8 00000000001d3280 ffff8800aade5fd8
[80786.106011] Call Trace:
[80786.106265]  [<ffffffff81985509>] ? __schedule+0x313/0x937
[80786.109674]  [<ffffffff81099bb5>] ? local_clock+0x41/0x5a
[80786.110041]  [<ffffffff81094afd>] ? prepare_to_wait+0x6c/0x79
[80786.110421]  [<ffffffff81099bb5>] ? local_clock+0x41/0x5a
[80786.110788]  [<ffffffff810a490c>] ? lock_release_holdtime+0xa3/0xac
[80786.111188]  [<ffffffff81094afd>] ? prepare_to_wait+0x6c/0x79
[80786.111568]  [<ffffffff8103bd68>] ? read_tsc+0x9/0x1b
[80786.111922]  [<ffffffff811003bc>] ? __lock_page+0x6d/0x6d
[80786.112289]  [<ffffffff81985deb>] schedule+0x5a/0x5c
[80786.112639]  [<ffffffff81985e79>] io_schedule+0x8c/0xcf
[80786.113000]  [<ffffffff811003ca>] sleep_on_page+0xe/0x12
[80786.113362]  [<ffffffff81986562>] __wait_on_bit+0x48/0x7b
[80786.113729]  [<ffffffff81100074>] ? find_get_pages_tag+0x133/0x16e
[80786.114127]  [<ffffffff810fff41>] ? generic_file_readonly_mmap+0x22/0x22
[80786.114543]  [<ffffffff811005be>] wait_on_page_bit+0x72/0x79
[80786.114921]  [<ffffffff810948a7>] ? autoremove_wake_function+0x3d/0x3d
[80786.115331]  [<ffffffff8110b1c9>] ? pagevec_lookup_tag+0x25/0x2e
[80786.115722]  [<ffffffff81100bd2>] filemap_fdatawait_range+0x9c/0x163
[80786.116127]  [<ffffffff8110100c>] filemap_write_and_wait_range+0x46/0x59
[80786.116544]  [<ffffffff81246ca1>] nfs_file_fsync+0x61/0xea
[80786.116915]  [<ffffffff81173617>] vfs_fsync_range+0x23/0x25
[80786.117288]  [<ffffffff81173635>] vfs_fsync+0x1c/0x1e
[80786.117641]  [<ffffffff812467f6>] nfs_file_flush+0x67/0x6c
[80786.118012]  [<ffffffff8114bbc1>] filp_close+0x49/0x7e
[80786.118370]  [<ffffffff81077821>] put_files_struct+0xb0/0x142
[80786.118750]  [<ffffffff81077798>] ? put_files_struct+0x27/0x142
[80786.119137]  [<ffffffff81077950>] exit_files+0x4b/0x54
[80786.119495]  [<ffffffff81077ea1>] do_exit+0x27d/0x780
[80786.119847]  [<ffffffff81099bb5>] ? local_clock+0x41/0x5a
[80786.120214]  [<ffffffff810a490c>] ? lock_release_holdtime+0xa3/0xac
[80786.120614]  [<ffffffff81086ab6>] ? get_signal_to_deliver+0x47a/0x50f
[80786.121022]  [<ffffffff8107863b>] do_group_exit+0x88/0xb6
[80786.121389]  [<ffffffff81086b29>] get_signal_to_deliver+0x4ed/0x50f
[80786.121789]  [<ffffffff810a490c>] ? lock_release_holdtime+0xa3/0xac
[80786.122191]  [<ffffffff81035e6e>] do_signal+0x3e/0x641
[80786.122549]  [<ffffffff810364b6>] do_notify_resume+0x2c/0x6e
[80786.122926]  [<ffffffff8140110e>] ? trace_hardirqs_on_thunk+0x3a/0x3f
[80786.123333]  [<ffffffff8198fe13>] int_signal+0x12/0x17

> > - w/ patch
> > 
> > BEHAVIOR:
> > write(2) aborts quickly with possible partial write on SIGKILL
> > 
> > IMPACT:
> > The partial write might lead to data corruption somewhere, sometime
> > (the possibility is low but real) and bring trouble to some users.
> 
> Let's examine these cases.  We've already written at least some of the
> data into the page cache (and updated i_size for extending writes in the
> call to ->write_end).  It's just not hit the backing store yet.  That means
> that this state of affairs is already *visible* to another process on the
> same machine, it's just not *durable* (eg in the event of power failure).
> 
> I think in the worst case, we've simply extended the window of opportunity
> for another process to see the partial write.
> 
> So, please add
> 
> Acked-by: Matthew Wilcox <matthew.r.wilcox@intel.com>

OK. Let's try this. I pushed it to linux-next after updating the
changelog on the balance_dirty_pages() part:

commit a50527b19c62c808a7fca022816fff88a50b948d
Author: Jan Kara <jack@suse.cz>
Date:   Fri Dec 2 09:17:02 2011 +0800

    fs: Make write(2) interruptible by a fatal signal
    
    Currently write(2) to a file is not interruptible by any signal.
    Sometimes this is desirable, e.g. when you want to quickly kill a
    process hogging your disk. Also, with commit 499d05ecf990 ("mm: Make
    task in balance_dirty_pages() killable"), it's necessary to abort the
    current write accordingly to avoid it quickly dirtying lots more pages
    at unthrottled rate.
    
    This patch makes write interruptible by SIGKILL. We do not allow write
    to be interruptible by any other signal because that has larger
    potential of screwing some badly written applications.
    
    Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
    Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
    Acked-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
    Signed-off-by: Jan Kara <jack@suse.cz>
    Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH] writeback: permit through good bdi even when global dirty exceeded
  2011-12-01 14:27   ` Matthew Wilcox
  2011-12-01 16:10     ` Linus Torvalds
  2011-12-02  2:05     ` Wu Fengguang
@ 2011-12-02  6:36     ` Wu Fengguang
  2011-12-02  7:03       ` Andrew Morton
  2 siblings, 1 reply; 11+ messages in thread
From: Wu Fengguang @ 2011-12-02  6:36 UTC (permalink / raw)
  To: Matthew Wilcox, Jan Kara
  Cc: LKML, linux-fsdevel@vger.kernel.org, Andrew Morton,
	Linus Torvalds, Theodore Ts'o, Christoph Hellwig

On Thu, Dec 01, 2011 at 10:27:05PM +0800, Matthew Wilcox wrote:
> On Thu, Dec 01, 2011 at 08:24:25PM +0800, Wu Fengguang wrote:
> > > This patch makes write interruptible by SIGKILL.
> > 
> > Let me try to summarize the objective impacts of (not) merging this
> > patch, and would like to hear more opinions from experienced users.
> > 
> > - w/o patch
> > 
> > BEHAVIOR:
> > write(2) insists to complete even when the user really wants to stop it.
> > 
> > IMPACT:
> > It could be annoying to experience slow responses to "kill -9" when
> > it's a large write to a slow device, for example,
> > 
> >         dd if=/dev/zero of=/mnt/nokia/zero bs=100M
> 
> Another problem scenario is an NFS mounted file going away while the
> user is writing to it.  The user should be able to kill the stuck process
> without rebooting their machine.

Hmm I find more serious problem: the whole (NFS client) test box may
(not always) go unresponsive when stopping the NFS server. I don't
even have the opportunity to kill that dd, nor able to establish a new
ssh login.

Tests show that the below patch is enough to make the system stay
responsive on a broken NFS mount.

Thanks,
Fengguang
--

Subject: writeback: permit through good bdi even when global dirty exceeded
Date: Fri Dec 02 10:21:33 CST 2011

On a system with 1 local mount and 1 NFS mount, if the NFS server
becomes not responding when dd to the NFS mount, the NFS dirty pages may
exceed the global dirty limit and _every_ task involving writing will be
blocked. The whole system appears unresponsive.

The workaround is to permit through the bdi's that only has a small
number of dirty pages. The number chosen (8 pages) is not enough to
enable the local disk to run in optimal throughput, but is enough to
make the system responsive on a broken NFS mount. The user can then
kill the dirtiers on the NFS mount and increase the global dirty limit
to bring up the local disk's throughput.

It risks allowing dirty pages to grow beyond 80000 (320MB) when there
are 10000 mounts, however that's very unlikely.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    8 ++++++++
 1 file changed, 8 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-12-02 10:16:21.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-12-02 14:28:44.000000000 +0800
@@ -1182,6 +1182,14 @@ pause:
 		if (task_ratelimit)
 			break;

+		/*
+		 * In the case of an unresponding NFS server and the NFS dirty
+		 * pages exceeds dirty_thresh, give the other good bdi's a pipe
+		 * to go through, so that tasks on them still remain responsive.
+		 */
+		if (bdi_dirty < 8)
+			break;
+
 		if (fatal_signal_pending(current))
 			break;
 	}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] writeback: permit through good bdi even when global dirty exceeded
  2011-12-02  6:36     ` [PATCH] writeback: permit through good bdi even when global dirty exceeded Wu Fengguang
@ 2011-12-02  7:03       ` Andrew Morton
  2011-12-02  8:29         ` Wu Fengguang
  0 siblings, 1 reply; 11+ messages in thread
From: Andrew Morton @ 2011-12-02  7:03 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Matthew Wilcox, Jan Kara, LKML, linux-fsdevel@vger.kernel.org,
	Linus Torvalds, Theodore Ts'o, Christoph Hellwig

On Fri, 2 Dec 2011 14:36:03 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:

> --- linux-next.orig/mm/page-writeback.c	2011-12-02 10:16:21.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-12-02 14:28:44.000000000 +0800
> @@ -1182,6 +1182,14 @@ pause:
>  		if (task_ratelimit)
>  			break;
>  
> +		/*
> +		 * In the case of an unresponding NFS server and the NFS dirty
> +		 * pages exceeds dirty_thresh, give the other good bdi's a pipe
> +		 * to go through, so that tasks on them still remain responsive.
> +		 */
> +		if (bdi_dirty < 8)
> +			break;

What happens if the local disk has nine dirty pages?

Also: please, no more magic numbers.  We have too many in there already.

What to do instead?  Perhaps arrange for devices which can block in
this fashion to be identified as such in their backing_device and then
prevent the kernel from ever permitting such devices to fully consume
the dirty-page pool.

If someone later comes along and decreases the dirty limits mid-flight,
I guess the same problem occurs.  This can perhaps be handled by not
permitting to limit to be set that low at that time.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] writeback: permit through good bdi even when global dirty exceeded
  2011-12-02  7:03       ` Andrew Morton
@ 2011-12-02  8:29         ` Wu Fengguang
  2011-12-02 10:16           ` Wu Fengguang
  0 siblings, 1 reply; 11+ messages in thread
From: Wu Fengguang @ 2011-12-02  8:29 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Jan Kara, LKML, linux-fsdevel@vger.kernel.org,
	Linus Torvalds, Theodore Ts'o, Christoph Hellwig

On Fri, Dec 02, 2011 at 03:03:59PM +0800, Andrew Morton wrote:
> On Fri, 2 Dec 2011 14:36:03 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > --- linux-next.orig/mm/page-writeback.c	2011-12-02 10:16:21.000000000 +0800
> > +++ linux-next/mm/page-writeback.c	2011-12-02 14:28:44.000000000 +0800
> > @@ -1182,6 +1182,14 @@ pause:
> >  		if (task_ratelimit)
> >  			break;
> >  
> > +		/*
> > +		 * In the case of an unresponding NFS server and the NFS dirty
> > +		 * pages exceeds dirty_thresh, give the other good bdi's a pipe
> > +		 * to go through, so that tasks on them still remain responsive.
> > +		 */
> > +		if (bdi_dirty < 8)
> > +			break;
> 
> What happens if the local disk has nine dirty pages?

The 9 dirty pages will be cleaned by the flusher (likely in one shot),
so after a while the dirtier task can dirty 8 pages more. This
consumer-producer work flow can keep going on as long as the magic
number chosen is >= 1.

> Also: please, no more magic numbers.  We have too many in there already.

Good point. Let's add some comment on the number chosen?

> What to do instead?  Perhaps arrange for devices which can block in
> this fashion to be identified as such in their backing_device and then
> prevent the kernel from ever permitting such devices to fully consume
> the dirty-page pool.

Yeah, that's considered too, unfortunately it's not as simple and
elegant than the proposed patch. For example, if giving all NFS mounts
the same "lowered" limit, there is still the problem that when one NFS
mount goes broken, the other NFS mounts are all impacted.

> If someone later comes along and decreases the dirty limits mid-flight,
> I guess the same problem occurs.  This can perhaps be handled by not
> permitting to limit to be set that low at that time.

Yes! Not long ago we introduced @global_dirty_limit and
update_dirty_limit() exactly for fixing that case. The comment says:

/*
 * The global dirtyable memory and dirty threshold could be suddenly knocked
 * down by a large amount (eg. on the startup of KVM in a swapless system).
 * This may throw the system into deep dirty exceeded state and throttle
 * heavy/light dirtiers alike. To retain good responsiveness, maintain
 * global_dirty_limit for tracking slowly down to the knocked down dirty
 * threshold.
 */
static void update_dirty_limit(unsigned long thresh, unsigned long dirty)
{       
...

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] writeback: permit through good bdi even when global dirty exceeded
  2011-12-02  8:29         ` Wu Fengguang
@ 2011-12-02 10:16           ` Wu Fengguang
  2011-12-02 10:28             ` Wu Fengguang
  0 siblings, 1 reply; 11+ messages in thread
From: Wu Fengguang @ 2011-12-02 10:16 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Jan Kara, LKML, linux-fsdevel@vger.kernel.org,
	Linus Torvalds, Theodore Ts'o, Christoph Hellwig

On Fri, Dec 02, 2011 at 04:29:50PM +0800, Wu Fengguang wrote:
> On Fri, Dec 02, 2011 at 03:03:59PM +0800, Andrew Morton wrote:
> > On Fri, 2 Dec 2011 14:36:03 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > --- linux-next.orig/mm/page-writeback.c	2011-12-02 10:16:21.000000000 +0800
> > > +++ linux-next/mm/page-writeback.c	2011-12-02 14:28:44.000000000 +0800
> > > @@ -1182,6 +1182,14 @@ pause:
> > >  		if (task_ratelimit)
> > >  			break;
> > >  
> > > +		/*
> > > +		 * In the case of an unresponding NFS server and the NFS dirty
> > > +		 * pages exceeds dirty_thresh, give the other good bdi's a pipe
> > > +		 * to go through, so that tasks on them still remain responsive.
> > > +		 */
> > > +		if (bdi_dirty < 8)
> > > +			break;
> > 
> > What happens if the local disk has nine dirty pages?
> 
> The 9 dirty pages will be cleaned by the flusher (likely in one shot),
> so after a while the dirtier task can dirty 8 pages more. This
> consumer-producer work flow can keep going on as long as the magic
> number chosen is >= 1.
> 
> > Also: please, no more magic numbers.  We have too many in there already.
> 
> Good point. Let's add some comment on the number chosen?

I did a dd test to the local disk (when w/ a stalled NFS mount) and
find that it always idle for several seconds before making a little
progress. It can be confirmed from the trace that the bdi_dirty
remains 8 even when the flusher has done its work.

So the number is lifted to bdi_stat_error to cover the errors in
bdi_dirty. Here goes the updated patch.

---
Subject: writeback: permit through good bdi even when global dirty exceeded
Date: Fri Dec 02 10:21:33 CST 2011

On a system with 1 local mount and 1 NFS mount, if the NFS server
becomes not responding when dd to the NFS mount, the NFS dirty pages may
exceed the global dirty limit and _every_ task involving writing will be
blocked. The whole system appears unresponsive.

The workaround is to permit through the bdi's that only has a small
number of dirty pages. The number chosen (bdi_stat_error pages) is not
enough to enable the local disk to run in optimal throughput, but is
enough to make the system responsive on a broken NFS mount. The user can
then kill the dirtiers on the NFS mount and increase the global dirty
limit to bring up the local disk's throughput.

It risks allowing dirty pages to grow much larger than the global dirty
limit when there are 1000+ mounts, however that's very unlikely to happen,
especially in low memory profiles.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |   13 +++++++++++++
 1 file changed, 13 insertions(+)

--- linux-next.orig/mm/page-writeback.c	2011-12-02 17:01:09.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-12-02 18:07:01.000000000 +0800
@@ -1182,6 +1182,19 @@ pause:
 		if (task_ratelimit)
 			break;

+		/*
+		 * In the case of an unresponding NFS server and the NFS dirty
+		 * pages exceeds dirty_thresh, give the other good bdi's a pipe
+		 * to go through, so that tasks on them still remain responsive.
+		 *
+		 * In theory 1 page is enough to keep the consumer-producer
+		 * pipe going: the flusher cleans 1 page => the task dirties 1
+		 * more page. However bdi_dirty has accounting errors.  So use
+		 * the larger and more IO friendly bdi_stat_error.
+		 */
+		if (bdi_dirty < bdi_stat_error(bdi))
+			break;
+
 		if (fatal_signal_pending(current))
 			break;
 	}

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] writeback: permit through good bdi even when global dirty exceeded
  2011-12-02 10:16           ` Wu Fengguang
@ 2011-12-02 10:28             ` Wu Fengguang
  0 siblings, 0 replies; 11+ messages in thread
From: Wu Fengguang @ 2011-12-02 10:28 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Matthew Wilcox, Jan Kara, LKML, linux-fsdevel@vger.kernel.org,
	Linus Torvalds, Theodore Ts'o, Christoph Hellwig

On Fri, Dec 02, 2011 at 06:16:06PM +0800, Wu Fengguang wrote:
> On Fri, Dec 02, 2011 at 04:29:50PM +0800, Wu Fengguang wrote:
> > On Fri, Dec 02, 2011 at 03:03:59PM +0800, Andrew Morton wrote:
> > > On Fri, 2 Dec 2011 14:36:03 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > --- linux-next.orig/mm/page-writeback.c	2011-12-02 10:16:21.000000000 +0800
> > > > +++ linux-next/mm/page-writeback.c	2011-12-02 14:28:44.000000000 +0800
> > > > @@ -1182,6 +1182,14 @@ pause:
> > > >  		if (task_ratelimit)
> > > >  			break;
> > > >  
> > > > +		/*
> > > > +		 * In the case of an unresponding NFS server and the NFS dirty
> > > > +		 * pages exceeds dirty_thresh, give the other good bdi's a pipe
> > > > +		 * to go through, so that tasks on them still remain responsive.
> > > > +		 */
> > > > +		if (bdi_dirty < 8)
> > > > +			break;
> > > 
> > > What happens if the local disk has nine dirty pages?
> > 
> > The 9 dirty pages will be cleaned by the flusher (likely in one shot),
> > so after a while the dirtier task can dirty 8 pages more. This
> > consumer-producer work flow can keep going on as long as the magic
> > number chosen is >= 1.
> > 
> > > Also: please, no more magic numbers.  We have too many in there already.
> > 
> > Good point. Let's add some comment on the number chosen?
> 
> I did a dd test to the local disk (when w/ a stalled NFS mount) and
> find that it always idle for several seconds before making a little
> progress. It can be confirmed from the trace that the bdi_dirty
> remains 8 even when the flusher has done its work.
> 
> So the number is lifted to bdi_stat_error to cover the errors in
> bdi_dirty. Here goes the updated patch.

The new trace now shows bdi_dirty=0 in the _majority_ lines. But in
fact it's some small value. In this case the max pause time should
really be set to the smallest non-zero value to avoid IO queue underrun
and improve throughput.

So here comes one more fix.

--- linux-next.orig/mm/page-writeback.c	2011-12-02 18:20:14.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-12-02 18:20:27.000000000 +0800
@@ -989,18 +989,17 @@ static unsigned long bdi_max_pause(struc
 
 	/*
 	 * Limit pause time for small memory systems. If sleeping for too long
 	 * time, a small pool of dirty/writeback pages may go empty and disk go
 	 * idle.
 	 *
 	 * 8 serves as the safety ratio.
 	 */
-	if (bdi_dirty)
-		t = min(t, bdi_dirty * HZ / (8 * bw + 1));
+	t = min(t, bdi_dirty * HZ / (8 * bw + 1));
 
 	/*
 	 * The pause time will be settled within range (max_pause/4, max_pause).
 	 * Apply a minimal value of 4 to get a non-zero max_pause/4.
 	 */
 	return clamp_val(t, 4, MAX_PAUSE);
 }
 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH] fs: Make write(2) interruptible by a fatal signal
  2011-12-01 16:10     ` Linus Torvalds
@ 2011-12-02 11:58       ` Janne Blomqvist
  0 siblings, 0 replies; 11+ messages in thread
From: Janne Blomqvist @ 2011-12-02 11:58 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Matthew Wilcox, Wu Fengguang, Jan Kara, LKML,
	linux-fsdevel@vger.kernel.org, Andrew Morton, Theodore Ts'o

On Thu, Dec 1, 2011 at 18:10, Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Thu, Dec 1, 2011 at 6:27 AM, Matthew Wilcox <matthew@wil.cx> wrote:
>>
>> Another problem scenario is an NFS mounted file going away while the
>> user is writing to it.  The user should be able to kill the stuck process
>> without rebooting their machine.
>
> Well, NFS has always had the 'intr' mount option.

For some values of always; My nfs(5) manpage says:

"The intr / nointr mount option is deprecated after kernel 2.6.25.
Only SIGKILL can interrupt a pending NFS operation on these  kernels,
and
if specified, this mount option is ignored to provide backwards
compatibility with older kernels."

(Apparently this was related to the introduction of TASK_KILLABLE.)

Isn't this pretty much the same "common sense" semantics that Jan's
patch is introducing?

Wu's testing in this thread suggests that at some point this
TASK_KILLABLE for nfs writer thing was broken, or didn't work very
robustly to begin with? Anyway, awesome if it's getting fixed, not
only for NFS but for all regular files!

-- 
Janne Blomqvist

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2011-12-02 11:58 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-12-01 10:27 [PATCH] fs: Make write(2) interruptible by a fatal signal Jan Kara
2011-12-01 12:24 ` Wu Fengguang
2011-12-01 14:27   ` Matthew Wilcox
2011-12-01 16:10     ` Linus Torvalds
2011-12-02 11:58       ` Janne Blomqvist
2011-12-02  2:05     ` Wu Fengguang
2011-12-02  6:36     ` [PATCH] writeback: permit through good bdi even when global dirty exceeded Wu Fengguang
2011-12-02  7:03       ` Andrew Morton
2011-12-02  8:29         ` Wu Fengguang
2011-12-02 10:16           ` Wu Fengguang
2011-12-02 10:28             ` Wu Fengguang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).