[PATCH 0/2 v3] Make task in balance_dirty

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/2 v3] Make task in balance_dirty_pages() killable
@ 2011-11-16 11:12 Jan Kara
  2011-11-16 11:12 ` [PATCH 1/2] mm: " Jan Kara
                   ` (2 more replies)
  0 siblings, 3 replies; 34+ messages in thread
From: Jan Kara @ 2011-11-16 11:12 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, Christoph Hellwig, Al Viro, linux-fsdevel

  Hello,

  since the discussion has calmed down, here is a third iteration of the patches.
Fengguang, can you put them into your tree and merge them with Linus? Thanks.

Changes since v2:
* removed return value of balance_dirty_pages() since it didn't seem that
  useful after all.
* kept fatal_signal_pending() check in generic_perform_write() since I don't think
  possible partial writes when app receives SIGKILL are a problem.

Changes since v1:
* slightly moved the check in balance_dirty_pages() as Fengguang requested
* made balance_dirty_pages() return EINTR if fatal signal was detected
* changed check for signal to check for fatal signal in generic_perform_write()
  to avoid unexpected results for userspace applications.

									Honza

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/2] mm: Make task in balance_dirty_pages() killable
  2011-11-16 11:12 [PATCH 0/2 v3] Make task in balance_dirty_pages() killable Jan Kara
@ 2011-11-16 11:12 ` Jan Kara
  2011-11-16 11:28   ` Wu Fengguang
  2011-11-16 11:12 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
  2011-11-16 11:23 ` [PATCH 0/2 v3] Make task in balance_dirty_pages() killable Wu Fengguang
  2 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2011-11-16 11:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Christoph Hellwig, Al Viro, linux-fsdevel,
	Jan Kara

There is no reason why task in balance_dirty_pages() shouldn't be killable
and it helps in recovering from some error conditions (like when filesystem
goes in error state and cannot accept writeback anymore but we still want to
kill processes using it to be able to unmount it).

Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/page-writeback.c |    5 ++++-
 1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0360d1b..39f5805 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1133,7 +1133,7 @@ pause:
 					  pages_dirtied,
 					  pause,
 					  start_time);
-		__set_current_state(TASK_UNINTERRUPTIBLE);
+		__set_current_state(TASK_KILLABLE);
 		io_schedule_timeout(pause);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
@@ -1145,6 +1145,9 @@ pause:
 		 */
 		if (nr_dirty < dirty_thresh)
 			break;
+
+		if (fatal_signal_pending(current))
+			break;
 	}
 
 	if (!dirty_exceeded && bdi->dirty_exceeded)
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: Make task in balance_dirty_pages() killable
  2011-11-16 11:12 ` [PATCH 1/2] mm: " Jan Kara
@ 2011-11-16 11:28   ` Wu Fengguang
  2011-11-16 12:58     ` Jan Kara
  0 siblings, 1 reply; 34+ messages in thread
From: Wu Fengguang @ 2011-11-16 11:28 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org, Neil Brown, KOSAKI Motohiro,
	Anton Altaparmakov

On Wed, Nov 16, 2011 at 07:12:14PM +0800, Jan Kara wrote:
> There is no reason why task in balance_dirty_pages() shouldn't be killable
> and it helps in recovering from some error conditions (like when filesystem
> goes in error state and cannot accept writeback anymore but we still want to
> kill processes using it to be able to unmount it).

Here I'd like to add a paragraph from my original patch:

The follow up patches will further abort the generic_perform_write() and
other filesystem write loops, to avoid large write + SIGKILL combination
exceeding the dirty limit and possibly strange OOM.

Thanks,
Fengguang

> Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  mm/page-writeback.c |    5 ++++-
>  1 files changed, 4 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 0360d1b..39f5805 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1133,7 +1133,7 @@ pause:
>  					  pages_dirtied,
>  					  pause,
>  					  start_time);
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> +		__set_current_state(TASK_KILLABLE);
>  		io_schedule_timeout(pause);
>  
>  		dirty_thresh = hard_dirty_limit(dirty_thresh);
> @@ -1145,6 +1145,9 @@ pause:
>  		 */
>  		if (nr_dirty < dirty_thresh)
>  			break;
> +
> +		if (fatal_signal_pending(current))
> +			break;
>  	}
>  
>  	if (!dirty_exceeded && bdi->dirty_exceeded)
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: Make task in balance_dirty_pages() killable
  2011-11-16 11:28   ` Wu Fengguang
@ 2011-11-16 12:58     ` Jan Kara
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Kara @ 2011-11-16 12:58 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org, Neil Brown, KOSAKI Motohiro,
	Anton Altaparmakov

On Wed 16-11-11 19:28:19, Wu Fengguang wrote:
> On Wed, Nov 16, 2011 at 07:12:14PM +0800, Jan Kara wrote:
> > There is no reason why task in balance_dirty_pages() shouldn't be killable
> > and it helps in recovering from some error conditions (like when filesystem
> > goes in error state and cannot accept writeback anymore but we still want to
> > kill processes using it to be able to unmount it).
> 
> Here I'd like to add a paragraph from my original patch:
> 
> The follow up patches will further abort the generic_perform_write() and
> other filesystem write loops, to avoid large write + SIGKILL combination
> exceeding the dirty limit and possibly strange OOM.
  Sure, feel free to add the paragraph to the changelog.

									Honza

> > Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  mm/page-writeback.c |    5 ++++-
> >  1 files changed, 4 insertions(+), 1 deletions(-)
> > 
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 0360d1b..39f5805 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -1133,7 +1133,7 @@ pause:
> >  					  pages_dirtied,
> >  					  pause,
> >  					  start_time);
> > -		__set_current_state(TASK_UNINTERRUPTIBLE);
> > +		__set_current_state(TASK_KILLABLE);
> >  		io_schedule_timeout(pause);
> >  
> >  		dirty_thresh = hard_dirty_limit(dirty_thresh);
> > @@ -1145,6 +1145,9 @@ pause:
> >  		 */
> >  		if (nr_dirty < dirty_thresh)
> >  			break;
> > +
> > +		if (fatal_signal_pending(current))
> > +			break;
> >  	}
> >  
> >  	if (!dirty_exceeded && bdi->dirty_exceeded)
> > -- 
> > 1.7.1
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 2/2] fs: Make write(2) interruptible by a signal
  2011-11-16 11:12 [PATCH 0/2 v3] Make task in balance_dirty_pages() killable Jan Kara
  2011-11-16 11:12 ` [PATCH 1/2] mm: " Jan Kara
@ 2011-11-16 11:12 ` Jan Kara
  2011-11-16 11:44   ` Wu Fengguang
  2011-11-16 11:23 ` [PATCH 0/2 v3] Make task in balance_dirty_pages() killable Wu Fengguang
  2 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2011-11-16 11:12 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Christoph Hellwig, Al Viro, linux-fsdevel,
	Jan Kara

Currently write(2) to a file is not interruptible by a signal. Sometimes this
is desirable (e.g. when you want to quickly kill a process hogging your disk or
when some process gets blocked in balance_dirty_pages() indefinitely due to a
filesystem being in an error condition).

Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/filemap.c |    6 ++++--
 1 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index c0018f2..c106d3b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file,
 						iov_iter_count(i));
 
 again:
-
 		/*
 		 * Bring in the user page that we will copy from _first_.
 		 * Otherwise there's a nasty deadlock on copying from the
@@ -2463,7 +2462,10 @@ again:
 		written += copied;
 
 		balance_dirty_pages_ratelimited(mapping);
-
+		if (fatal_signal_pending(current)) {
+			status = -EINTR;
+			break;
+		}
 	} while (iov_iter_count(i));
 
 	return written ? written : status;
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
  2011-11-16 11:12 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
@ 2011-11-16 11:44   ` Wu Fengguang
  2011-11-16 12:54     ` Jan Kara
  2011-11-22 22:28     ` Andrew Morton
  0 siblings, 2 replies; 34+ messages in thread
From: Wu Fengguang @ 2011-11-16 11:44 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

Jan,

Due to the (very low) possibility of data loss by partial writes, IMHO
it would safer to test this patch in linux-next until next merge window,
would you agree?

Pushing the first patch will address the main problem, anyway.

Thanks,
Fengguang

On Wed, Nov 16, 2011 at 07:12:15PM +0800, Jan Kara wrote:
> Currently write(2) to a file is not interruptible by a signal. Sometimes this
> is desirable (e.g. when you want to quickly kill a process hogging your disk or
> when some process gets blocked in balance_dirty_pages() indefinitely due to a
> filesystem being in an error condition).
> 
> Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  mm/filemap.c |    6 ++++--
>  1 files changed, 4 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c0018f2..c106d3b 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file,
>  						iov_iter_count(i));
>  
>  again:
> -
>  		/*
>  		 * Bring in the user page that we will copy from _first_.
>  		 * Otherwise there's a nasty deadlock on copying from the
> @@ -2463,7 +2462,10 @@ again:
>  		written += copied;
>  
>  		balance_dirty_pages_ratelimited(mapping);
> -
> +		if (fatal_signal_pending(current)) {
> +			status = -EINTR;
> +			break;
> +		}
>  	} while (iov_iter_count(i));
>  
>  	return written ? written : status;
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
  2011-11-16 11:44   ` Wu Fengguang
@ 2011-11-16 12:54     ` Jan Kara
  2011-11-16 13:11       ` Wu Fengguang
  2011-11-22 22:28     ` Andrew Morton
  1 sibling, 1 reply; 34+ messages in thread
From: Jan Kara @ 2011-11-16 12:54 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

On Wed 16-11-11 19:44:21, Wu Fengguang wrote:
> Jan,
> 
> Due to the (very low) possibility of data loss by partial writes, IMHO
> it would safer to test this patch in linux-next until next merge window,
> would you agree?
  Fine with me. Thanks.

> Pushing the first patch will address the main problem, anyway.
  Hopefully, yes.

									Honza

> On Wed, Nov 16, 2011 at 07:12:15PM +0800, Jan Kara wrote:
> > Currently write(2) to a file is not interruptible by a signal. Sometimes this
> > is desirable (e.g. when you want to quickly kill a process hogging your disk or
> > when some process gets blocked in balance_dirty_pages() indefinitely due to a
> > filesystem being in an error condition).
> > 
> > Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  mm/filemap.c |    6 ++++--
> >  1 files changed, 4 insertions(+), 2 deletions(-)
> > 
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index c0018f2..c106d3b 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file,
> >  						iov_iter_count(i));
> >  
> >  again:
> > -
> >  		/*
> >  		 * Bring in the user page that we will copy from _first_.
> >  		 * Otherwise there's a nasty deadlock on copying from the
> > @@ -2463,7 +2462,10 @@ again:
> >  		written += copied;
> >  
> >  		balance_dirty_pages_ratelimited(mapping);
> > -
> > +		if (fatal_signal_pending(current)) {
> > +			status = -EINTR;
> > +			break;
> > +		}
> >  	} while (iov_iter_count(i));
> >  
> >  	return written ? written : status;
> > -- 
> > 1.7.1
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
  2011-11-16 12:54     ` Jan Kara
@ 2011-11-16 13:11       ` Wu Fengguang
  0 siblings, 0 replies; 34+ messages in thread
From: Wu Fengguang @ 2011-11-16 13:11 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org, Kazuya Mio

On Wed, Nov 16, 2011 at 08:54:45PM +0800, Jan Kara wrote:
> On Wed 16-11-11 19:44:21, Wu Fengguang wrote:
> > Jan,
> > 
> > Due to the (very low) possibility of data loss by partial writes, IMHO
> > it would safer to test this patch in linux-next until next merge window,
> > would you agree?
>   Fine with me. Thanks.

Great. The two patches are now in:

http://git.kernel.org/?p=linux/kernel/git/wfg/linux.git;a=shortlog;h=refs/heads/writeback-for-next

When applying the patch I changed the title "... by a signal" to
"...  by SIGKILL" to reflect the updated patch content. Hopefully
this is also what's in your mind.

> > Pushing the first patch will address the main problem, anyway.
>   Hopefully, yes.

I looked over Kazuya's posts in linux-ext4 and think the first patch
alone does have good chance address the problem of busy looping in
balance_dirty_pages() due to dirty pages never drop under fs error
conditions. 

Thanks,
Fengguang

> > On Wed, Nov 16, 2011 at 07:12:15PM +0800, Jan Kara wrote:
> > > Currently write(2) to a file is not interruptible by a signal. Sometimes this
> > > is desirable (e.g. when you want to quickly kill a process hogging your disk or
> > > when some process gets blocked in balance_dirty_pages() indefinitely due to a
> > > filesystem being in an error condition).
> > > 
> > > Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > > Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > >  mm/filemap.c |    6 ++++--
> > >  1 files changed, 4 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index c0018f2..c106d3b 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file,
> > >  						iov_iter_count(i));
> > >  
> > >  again:
> > > -
> > >  		/*
> > >  		 * Bring in the user page that we will copy from _first_.
> > >  		 * Otherwise there's a nasty deadlock on copying from the
> > > @@ -2463,7 +2462,10 @@ again:
> > >  		written += copied;
> > >  
> > >  		balance_dirty_pages_ratelimited(mapping);
> > > -
> > > +		if (fatal_signal_pending(current)) {
> > > +			status = -EINTR;
> > > +			break;
> > > +		}
> > >  	} while (iov_iter_count(i));
> > >  
> > >  	return written ? written : status;
> > > -- 
> > > 1.7.1
> -- 
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
  2011-11-16 11:44   ` Wu Fengguang
  2011-11-16 12:54     ` Jan Kara
@ 2011-11-22 22:28     ` Andrew Morton
  2011-11-23  9:05       ` Wu Fengguang
  1 sibling, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2011-11-22 22:28 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

On Wed, 16 Nov 2011 19:44:21 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:

> Due to the (very low) possibility of data loss by partial writes, IMHO
> it would safer to test this patch in linux-next until next merge window,

Any such bugs will not be discovered in linux-next testing.

The only way to find these things in a reasonable period of time is to
go in and find them.  For example, intensive fsx-linux testing with
concurrent heavy memory pressure on various filesystems with various
block sizes.  And of course concurrent signalling.  If you're talking
about O_DIRECT then iirc I hacked support for that into fsx-linux.  I
think.

Anyway, what _are_ the scenarios in which we think data can be lost?

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
  2011-11-22 22:28     ` Andrew Morton
@ 2011-11-23  9:05       ` Wu Fengguang
  2011-11-23  9:50         ` Andrew Morton
  2011-11-23 13:08         ` [PATCH 2/2] fs: " Jan Kara
  0 siblings, 2 replies; 34+ messages in thread
From: Wu Fengguang @ 2011-11-23  9:05 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Jan Kara, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

On Wed, Nov 23, 2011 at 06:28:05AM +0800, Andrew Morton wrote:
> On Wed, 16 Nov 2011 19:44:21 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
> 
> > Due to the (very low) possibility of data loss by partial writes, IMHO
> > it would safer to test this patch in linux-next until next merge window,
> 
> Any such bugs will not be discovered in linux-next testing.

Yup, I'm afraid.

> The only way to find these things in a reasonable period of time is to
> go in and find them.  For example, intensive fsx-linux testing with
> concurrent heavy memory pressure on various filesystems with various
> block sizes.  And of course concurrent signalling.  If you're talking
> about O_DIRECT then iirc I hacked support for that into fsx-linux.  I
> think.

How are we going to measure the success/failure? Check if it
eventually resulted in filesystem corruption or whatever?

When received SIGKILL, fsx-linux itself will just die.

> Anyway, what _are_ the scenarios in which we think data can be lost?

It's the vision that there may be partial writes on SIGKILL. Before
patch, the write will either succeed as a whole or not started at
all, depending on the timing of write/SIGKILL. This is kind of atomic
operation. However now the write can be half done.

If the application really cares about atomic behavior, it can do
create-and-rename. However there are always the possibility of broken
applications.

Maybe this is not that big problem as SIGKILL is considered be to
destructive already.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
  2011-11-23  9:05       ` Wu Fengguang
@ 2011-11-23  9:50         ` Andrew Morton
  2011-11-23 12:27           ` [PATCH 2/2] " Theodore Tso
  2011-11-23 13:08         ` [PATCH 2/2] fs: " Jan Kara
  1 sibling, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2011-11-23  9:50 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

On Wed, 23 Nov 2011 17:05:33 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:

> On Wed, Nov 23, 2011 at 06:28:05AM +0800, Andrew Morton wrote:
> > On Wed, 16 Nov 2011 19:44:21 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Due to the (very low) possibility of data loss by partial writes, IMHO
> > > it would safer to test this patch in linux-next until next merge window,
> > 
> > Any such bugs will not be discovered in linux-next testing.
> 
> Yup, I'm afraid.
> 
> > The only way to find these things in a reasonable period of time is to
> > go in and find them.  For example, intensive fsx-linux testing with
> > concurrent heavy memory pressure on various filesystems with various
> > block sizes.  And of course concurrent signalling.  If you're talking
> > about O_DIRECT then iirc I hacked support for that into fsx-linux.  I
> > think.
> 
> How are we going to measure the success/failure? Check if it
> eventually resulted in filesystem corruption or whatever?

yup.

> When received SIGKILL, fsx-linux itself will just die.

Well, there are ways of simulating its effect.  For example, bale out
of the write() every seventh time if current->comm=="fsx-linux".  Or
set a rearming timer which triggers a baled-out write.  You'll work it
out ;)

> > Anyway, what _are_ the scenarios in which we think data can be lost?
> 
> It's the vision that there may be partial writes on SIGKILL. Before
> patch, the write will either succeed as a whole or not started at
> all, depending on the timing of write/SIGKILL. This is kind of atomic
> operation. However now the write can be half done.
> 
> If the application really cares about atomic behavior, it can do
> create-and-rename. However there are always the possibility of broken
> applications.
> 
> Maybe this is not that big problem as SIGKILL is considered be to
> destructive already.

Yeah, I have dim dark memories that there are subtle problems with
interrupting write().  Linus might remember.

Others might remember too, but we're only talking to fsdevel which
nobody reads :(

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] Make write(2) interruptible by a signal
  2011-11-23  9:50         ` Andrew Morton
@ 2011-11-23 12:27           ` Theodore Tso
  2011-11-23 20:29             ` Andrew Morton
  0 siblings, 1 reply; 34+ messages in thread
From: Theodore Tso @ 2011-11-23 12:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Wu Fengguang, Jan Kara, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

On Nov 23, 2011, at 4:50 AM, Andrew Morton wrote:

>> 
>> If the application really cares about atomic behavior, it can do
>> create-and-rename. However there are always the possibility of broken
>> applications.

Heh.   Application programmers whine a lot about not doing create-and-rename already (and we had to introduce a work-around since so many programs assume close() implies fsync(). 

>> 
>> Maybe this is not that big problem as SIGKILL is considered be to
>> destructive already.
> 
> Yeah, I have dim dark memories that there are subtle problems with
> interrupting write().  Linus might remember.

The big one is that you're lucky if application programmers check the return values of write(2), and if they do, it's likely they will only check for error returns and not necessarily for partial writes --- since no other Unix-like or Linux-like system has ever returned partial reads or writes for files except in error conditions.    We've barely gotten them trained to check for partial writes and reads with TCP connections and character devices, but I wouldn't bet on application programmers getting things right for files.

Still, if it's ***only*** for SIGKILL, we'll probably be OK, since for that one case there's no chance userspace can intercept the signal, so it can't do any recovery anyway.   (I could imagine some HPC program doing a massive 2GB write, and some user of that program depending on the fact that he can kill it at a predefined place by sending a SIGKILL and knowing that the file would be written up to that 2GB chunk --- but that's clearly an edge situation, as opposed to something that would effect most GNOME and KDE apps.)   We just need to make sure we never try to do this for any other signal that could be caught, such as SIGINT or SIGQUIT or (worse yet) SIGTSTP.

-- Ted

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] Make write(2) interruptible by a signal
  2011-11-23 12:27           ` [PATCH 2/2] " Theodore Tso
@ 2011-11-23 20:29             ` Andrew Morton
  2011-11-24 19:27               ` Matthew Wilcox
  0 siblings, 1 reply; 34+ messages in thread
From: Andrew Morton @ 2011-11-23 20:29 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Wu Fengguang, Jan Kara, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

On Wed, 23 Nov 2011 07:27:43 -0500
Theodore Tso <tytso@mit.edu> wrote:

> >> 
> >> Maybe this is not that big problem as SIGKILL is considered be to
> >> destructive already.
> > 
> > Yeah, I have dim dark memories that there are subtle problems with
> > interrupting write().  Linus might remember.

(err, you're sending 600-column emails)

> The big one is that you're lucky if application programmers check the
> return values of write(2), and if they do, it's likely they will only
> check for error returns and not necessarily for partial writes ---
> since no other Unix-like or Linux-like system has ever returned partial
> reads or writes for files except in error conditions.  We've barely
> gotten them trained to check for partial writes and reads with TCP
> connections and character devices, but I wouldn't bet on application
> programmers getting things right for files.
> 
> Still, if it's ***only*** for SIGKILL, we'll probably be OK, since
> for that one case there's no chance userspace can intercept the signal,
> so it can't do any recovery anyway.  (I could imagine some HPC program
> doing a massive 2GB write, and some user of that program depending on
> the fact that he can kill it at a predefined place by sending a SIGKILL
> and knowing that the file would be written up to that 2GB chunk --- but
> that's clearly an edge situation, as opposed to something that would
> effect most GNOME and KDE apps.) We just need to make sure we never try
> to do this for any other signal that could be caught, such as SIGINT or
> SIGQUIT or (worse yet) SIGTSTP.

That it is a fatal SIGKILL means that the *current* application doesn't
care.  But other processes will sometimes notice this change. 
Previously if an app did write(file, 128k) and was hit with SIGKILL, it
would write either 0 bytes or 128k bytes.  Now, it can write 36k bytes,
yes?  If the target file consisted of a stream of 128k records then the
user will claim, with some justification, that Linux corrupted it.

Dunno.  People do lots of weird and flakey things.  I have a suspicion
that we'll be hearing back from them about this change.

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] Make write(2) interruptible by a signal
  2011-11-23 20:29             ` Andrew Morton
@ 2011-11-24 19:27               ` Matthew Wilcox
  2011-11-24 20:53                 ` Ted Ts'o
  2011-11-24 20:53                 ` Jan Kara
  0 siblings, 2 replies; 34+ messages in thread
From: Matthew Wilcox @ 2011-11-24 19:27 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Theodore Tso, Wu Fengguang, Jan Kara, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

On Wed, Nov 23, 2011 at 12:29:48PM -0800, Andrew Morton wrote:
> > Still, if it's ***only*** for SIGKILL, we'll probably be OK, since
> > for that one case there's no chance userspace can intercept the signal,
> > so it can't do any recovery anyway.  (I could imagine some HPC program
> > doing a massive 2GB write, and some user of that program depending on
> > the fact that he can kill it at a predefined place by sending a SIGKILL
> > and knowing that the file would be written up to that 2GB chunk --- but
> > that's clearly an edge situation, as opposed to something that would
> > effect most GNOME and KDE apps.) We just need to make sure we never try
> > to do this for any other signal that could be caught, such as SIGINT or
> > SIGQUIT or (worse yet) SIGTSTP.
> 
> That it is a fatal SIGKILL means that the *current* application doesn't
> care.  But other processes will sometimes notice this change. 
> Previously if an app did write(file, 128k) and was hit with SIGKILL, it
> would write either 0 bytes or 128k bytes.  Now, it can write 36k bytes,
> yes?  If the target file consisted of a stream of 128k records then the
> user will claim, with some justification, that Linux corrupted it.

On the other hand, if there was a crash mid-write, they might also get a
36k write that actually hit media (right?  Or do we guarantee that on
reboot you see a multiple of 128k?)

We could put in some nice code that rewinds i_size to where it used to
be if the write was interrupted.  Or do we need to?  Presumably write_end
would not have been called, so i_size would not have been updated.

> Dunno.  People do lots of weird and flakey things.  I have a suspicion
> that we'll be hearing back from them about this change.

The problem is that we may not hear from them for 6 years ... or whenever
they decide to move off RHEL 3.

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] Make write(2) interruptible by a signal
  2011-11-24 19:27               ` Matthew Wilcox
@ 2011-11-24 20:53                 ` Ted Ts'o
  2011-11-25  0:10                   ` Matthew Wilcox
  2011-11-24 20:53                 ` Jan Kara
  1 sibling, 1 reply; 34+ messages in thread
From: Ted Ts'o @ 2011-11-24 20:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

On Thu, Nov 24, 2011 at 12:27:11PM -0700, Matthew Wilcox wrote:
> 
> On the other hand, if there was a crash mid-write, they might also get a
> 36k write that actually hit media (right?  Or do we guarantee that on
> reboot you see a multiple of 128k?)

Sure, but in the case the crash we expect things to be in a wonky
state.  The problem is if people assume atomic writes to files in a
non-crash case, which has been a traditional Unix/Linux "feature".
It's guaranteed by the standards as much a "close() implies fsync()",
but once application programmers start coding to such assumptions,
they refuse to admit they were wrong, and blame the kernel
programmers.

> > Dunno.  People do lots of weird and flakey things.  I have a suspicion
> > that we'll be hearing back from them about this change.
> 
> The problem is that we may not hear from them for 6 years ... or whenever
> they decide to move off RHEL 3.

Funny you mention RHEL 3.  I once had to fly on site to a customer
because their application programs depended on the order of addresses
handed back from mmap(), and that changed between 2.4 and 2.6, and
hence between RHEL 3 and RHEL 4 to debug the problem --- which turned
out to be caused in a change of an undocumented problem in mmap(2)...

(Well, it didn't break per se, but it turned an algorithm hidden
inside their app that had been O(n) to O(n**2), and that caused it to
slow down to the point that it was stalling long enough to make other
programs time out.  All the customer could tell us was "Og had
program.  Program worked on RHEL 3.  Broke on RHEL 4.  Og need
support." :-)

Still, I'm not too worried about those folks.  Those are the customers
who keep people who do advanced support at Red Hat and IBM employed. :-)
I'm worried about the problems that break thousands of users using
open source code --- and so long as we it only happens on SIGKILL, and
not on any other signal, it's _probably_ going to be ok.

       	   	 	      		       - Ted

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] Make write(2) interruptible by a signal
  2011-11-24 20:53                 ` Ted Ts'o
@ 2011-11-25  0:10                   ` Matthew Wilcox
  0 siblings, 0 replies; 34+ messages in thread
From: Matthew Wilcox @ 2011-11-25  0:10 UTC (permalink / raw)
  To: Ted Ts'o
  Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

On Thu, Nov 24, 2011 at 03:53:43PM -0500, Ted Ts'o wrote:
> On Thu, Nov 24, 2011 at 12:27:11PM -0700, Matthew Wilcox wrote:
> > On the other hand, if there was a crash mid-write, they might also get a
> > 36k write that actually hit media (right?  Or do we guarantee that on
> > reboot you see a multiple of 128k?)
> 
> Sure, but in the case the crash we expect things to be in a wonky
> state.  The problem is if people assume atomic writes to files in a
> non-crash case, which has been a traditional Unix/Linux "feature".
> It's guaranteed by the standards as much a "close() implies fsync()",
> but once application programmers start coding to such assumptions,
> they refuse to admit they were wrong, and blame the kernel
> programmers.

Sure, but resorting to kill -9 is almost the same as pushing the BRS.
Nobody's arguing in favour of non-fatal signals interrupting write()
[well, Honza was earlier, but we all talked him out of it].

-- 
Matthew Wilcox				Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours.  We can't possibly take such
a retrograde step."

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] Make write(2) interruptible by a signal
  2011-11-24 19:27               ` Matthew Wilcox
  2011-11-24 20:53                 ` Ted Ts'o
@ 2011-11-24 20:53                 ` Jan Kara
  1 sibling, 0 replies; 34+ messages in thread
From: Jan Kara @ 2011-11-24 20:53 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Andrew Morton, Theodore Tso, Wu Fengguang, Jan Kara,
	Christoph Hellwig, Al Viro, linux-fsdevel@vger.kernel.org

On Thu 24-11-11 12:27:11, Matthew Wilcox wrote:
> On Wed, Nov 23, 2011 at 12:29:48PM -0800, Andrew Morton wrote:
> > > Still, if it's ***only*** for SIGKILL, we'll probably be OK, since
> > > for that one case there's no chance userspace can intercept the signal,
> > > so it can't do any recovery anyway.  (I could imagine some HPC program
> > > doing a massive 2GB write, and some user of that program depending on
> > > the fact that he can kill it at a predefined place by sending a SIGKILL
> > > and knowing that the file would be written up to that 2GB chunk --- but
> > > that's clearly an edge situation, as opposed to something that would
> > > effect most GNOME and KDE apps.) We just need to make sure we never try
> > > to do this for any other signal that could be caught, such as SIGINT or
> > > SIGQUIT or (worse yet) SIGTSTP.
> > 
> > That it is a fatal SIGKILL means that the *current* application doesn't
> > care.  But other processes will sometimes notice this change. 
> > Previously if an app did write(file, 128k) and was hit with SIGKILL, it
> > would write either 0 bytes or 128k bytes.  Now, it can write 36k bytes,
> > yes?  If the target file consisted of a stream of 128k records then the
> > user will claim, with some justification, that Linux corrupted it.
> 
> On the other hand, if there was a crash mid-write, they might also get a
> 36k write that actually hit media (right?  Or do we guarantee that on
> reboot you see a multiple of 128k?)
  They might see only 36k written for all filesystems I know...

> We could put in some nice code that rewinds i_size to where it used to
> be if the write was interrupted.  Or do we need to?  Presumably write_end
> would not have been called, so i_size would not have been updated.
  But that would solve only extending writes and not overwrites. So I'm not
sure it would be worth it...

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
  2011-11-23  9:05       ` Wu Fengguang
  2011-11-23  9:50         ` Andrew Morton
@ 2011-11-23 13:08         ` Jan Kara
  2011-11-23 13:27           ` Wu Fengguang
  1 sibling, 1 reply; 34+ messages in thread
From: Jan Kara @ 2011-11-23 13:08 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

On Wed 23-11-11 17:05:33, Wu Fengguang wrote:
> On Wed, Nov 23, 2011 at 06:28:05AM +0800, Andrew Morton wrote:
> > On Wed, 16 Nov 2011 19:44:21 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > 
> > > Due to the (very low) possibility of data loss by partial writes, IMHO
> > > it would safer to test this patch in linux-next until next merge window,
> > 
> > Any such bugs will not be discovered in linux-next testing.
> 
> Yup, I'm afraid.
> 
> > The only way to find these things in a reasonable period of time is to
> > go in and find them.  For example, intensive fsx-linux testing with
> > concurrent heavy memory pressure on various filesystems with various
> > block sizes.  And of course concurrent signalling.  If you're talking
> > about O_DIRECT then iirc I hacked support for that into fsx-linux.  I
> > think.
> 
> How are we going to measure the success/failure? Check if it
> eventually resulted in filesystem corruption or whatever?
  There are a few different questions:
  1) Checking for filesystem corruption via fsck - I find such corruption
caused by stopping write early extremely unlikely.
  2) Checking that we do not expose uninitialized data after a partial
(possibly DIRECT_IO) write - I did not find a place where that could happen
but this would be worth testing. I think I can write a test for this if
people are afraid of data exposure problems.
  3) Is it acceptable for write(2) to be interrupted by SIGKILL in the
middle? That obviously does happen with my patches so there's no reason
to test that. The question is whether someone cares or not and that can be
tested only by reality check :). Since the signal is SIGKILL, the process
itself cannot notice the interrupted write but someone else can. But as I
already said earlier, partial writes can already be observed when the
machine crashes, filesystem is close to ENOSPC or so. Arguably these are
more severe error conditions than application catching SIGKILL so my
patch lowers the bar for observing partial writes. But I wouldn't like to
throw away a sensible thing - allow SIGKILL to interrupt a system call -
just because of fear of possibility some broken app could rely on this.
Sure if the reality check shows there are such broken apps and users who
care enough to report, then I have nothing against biting the bullet
and reverting the change... Opinions?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
  2011-11-23 13:08         ` [PATCH 2/2] fs: " Jan Kara
@ 2011-11-23 13:27           ` Wu Fengguang
  2011-11-23 15:06             ` Theodore Tso
  0 siblings, 1 reply; 34+ messages in thread
From: Wu Fengguang @ 2011-11-23 13:27 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org, Theodore Ts'o

On Wed, Nov 23, 2011 at 09:08:03PM +0800, Jan Kara wrote:
> On Wed 23-11-11 17:05:33, Wu Fengguang wrote:
> > On Wed, Nov 23, 2011 at 06:28:05AM +0800, Andrew Morton wrote:
> > > On Wed, 16 Nov 2011 19:44:21 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > > 
> > > > Due to the (very low) possibility of data loss by partial writes, IMHO
> > > > it would safer to test this patch in linux-next until next merge window,
> > > 
> > > Any such bugs will not be discovered in linux-next testing.
> > 
> > Yup, I'm afraid.
> > 
> > > The only way to find these things in a reasonable period of time is to
> > > go in and find them.  For example, intensive fsx-linux testing with
> > > concurrent heavy memory pressure on various filesystems with various
> > > block sizes.  And of course concurrent signalling.  If you're talking
> > > about O_DIRECT then iirc I hacked support for that into fsx-linux.  I
> > > think.
> > 
> > How are we going to measure the success/failure? Check if it
> > eventually resulted in filesystem corruption or whatever?
>   There are a few different questions:
>   1) Checking for filesystem corruption via fsck - I find such corruption
> caused by stopping write early extremely unlikely.

Agreed.

>   2) Checking that we do not expose uninitialized data after a partial
> (possibly DIRECT_IO) write - I did not find a place where that could happen
> but this would be worth testing. I think I can write a test for this if
> people are afraid of data exposure problems.

Do we already have such kind of tests in xfstests? If not, it sounds
like a good gap to fill :-)

>   3) Is it acceptable for write(2) to be interrupted by SIGKILL in the
> middle? That obviously does happen with my patches so there's no reason
> to test that. The question is whether someone cares or not and that can be
> tested only by reality check :). Since the signal is SIGKILL, the process
> itself cannot notice the interrupted write but someone else can. But as I
> already said earlier, partial writes can already be observed when the
> machine crashes, filesystem is close to ENOSPC or so. Arguably these are
> more severe error conditions than application catching SIGKILL so my
> patch lowers the bar for observing partial writes. But I wouldn't like to
> throw away a sensible thing - allow SIGKILL to interrupt a system call -
> just because of fear of possibility some broken app could rely on this.
> Sure if the reality check shows there are such broken apps and users who
> care enough to report, then I have nothing against biting the bullet
> and reverting the change... Opinions?

Reading Ted's information feed, I tend to disregard the partial write
issue: since the "broken" applications will already fail and get
punished in various other cases, I don't care adding one more penalty
case to them :-P

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
  2011-11-23 13:27           ` Wu Fengguang
@ 2011-11-23 15:06             ` Theodore Tso
  2011-11-28  3:08               ` Wu Fengguang
  0 siblings, 1 reply; 34+ messages in thread
From: Theodore Tso @ 2011-11-23 15:06 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Theodore Tso, Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org


On Nov 23, 2011, at 8:27 AM, Wu Fengguang wrote:

> 
> Reading Ted's information feed, I tend to disregard the partial write
> issue: since the "broken" applications will already fail and get
> punished in various other cases, I don't care adding one more penalty
> case to them :-P

Just wait until you have a bunch of rabid application programmers, questioning your judgement, your morality, and even your paternity.  :-)

-- Ted


^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
  2011-11-23 15:06             ` Theodore Tso
@ 2011-11-28  3:08               ` Wu Fengguang
  2011-11-28  3:33                 ` [PATCH] writeback: add a safety limit to the SIGKILL abort Wu Fengguang
  2011-11-29 14:16                 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
  0 siblings, 2 replies; 34+ messages in thread
From: Wu Fengguang @ 2011-11-28  3:08 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org, Kazuya Mio, Dmitry Monakhov,
	Matthew Wilcox

On Wed, Nov 23, 2011 at 11:06:56PM +0800, Theodore Ts'o wrote:
> 
> On Nov 23, 2011, at 8:27 AM, Wu Fengguang wrote:
> 
> > 
> > Reading Ted's information feed, I tend to disregard the partial write
> > issue: since the "broken" applications will already fail and get
> > punished in various other cases, I don't care adding one more penalty
> > case to them :-P
> 
> Just wait until you have a bunch of rabid application programmers,
> questioning your judgement, your morality, and even your paternity.
> :-)

Ah OK, that sounds frightening. Hmm, till now every one have
acknowledged the possibility of data corruption, only that
people have different perceptions of the severeness.

Let's rethink things this way: "Is it a _worthwhile_ risk at all?"
I'm afraid not. Considering the origin of this patch

[BUG] aborted ext4 leads to inifinity loop in balance_dirty_pages
http://www.spinics.net/lists/linux-ext4/msg28464.html

I *think* Jan's first patch is already enough for fixing the bug. IWO
the patch we worried/discussed so much is really an optional one. I
would imagine the easy and safe solution is to just drop it. Any
objections?

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH] writeback: add a safety limit to the SIGKILL abort
  2011-11-28  3:08               ` Wu Fengguang
@ 2011-11-28  3:33                 ` Wu Fengguang
  2011-11-29 14:18                   ` Jan Kara
  2011-11-29 14:16                 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
  1 sibling, 1 reply; 34+ messages in thread
From: Wu Fengguang @ 2011-11-28  3:33 UTC (permalink / raw)
  To: Theodore Tso
  Cc: Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org, Kazuya Mio, Dmitry Monakhov,
	Matthew Wilcox

On Mon, Nov 28, 2011 at 11:08:42AM +0800, Wu Fengguang wrote:
> On Wed, Nov 23, 2011 at 11:06:56PM +0800, Theodore Ts'o wrote:
> > 
> > On Nov 23, 2011, at 8:27 AM, Wu Fengguang wrote:
> > 
> > > 
> > > Reading Ted's information feed, I tend to disregard the partial write
> > > issue: since the "broken" applications will already fail and get
> > > punished in various other cases, I don't care adding one more penalty
> > > case to them :-P
> > 
> > Just wait until you have a bunch of rabid application programmers,
> > questioning your judgement, your morality, and even your paternity.
> > :-)
> 
> Ah OK, that sounds frightening. Hmm, till now every one have
> acknowledged the possibility of data corruption, only that
> people have different perceptions of the severeness.
> 
> Let's rethink things this way: "Is it a _worthwhile_ risk at all?"
> I'm afraid not. Considering the origin of this patch
> 
> [BUG] aborted ext4 leads to inifinity loop in balance_dirty_pages
> http://www.spinics.net/lists/linux-ext4/msg28464.html
> 
> I *think* Jan's first patch is already enough for fixing the bug. IWO
> the patch we worried/discussed so much is really an optional one. I
> would imagine the easy and safe solution is to just drop it. Any
> objections?

Here is the replacement patch.

---
Subject: writeback: add a safety limit to the SIGKILL abort
Date: Mon Nov 28 11:16:54 CST 2011

This adds a safety limit to the SIGKILL abort in commit 499d05ecf990
("mm: Make task in balance_dirty_pages() killable"). This will avoid
dirty pages rushing arbitrarily high in the case some task receives
SIGKILL and hence become *unthrottled* when doing a huge sized write.

The alternative way is to check SIGKILL and return partial write in
generic_perform_write(). However it will lead to data corruption as
put by Andrew Morton:

 Previously if an app did write(file, 128k) and was hit with SIGKILL, it
 would write either 0 bytes or 128k bytes.  Now, it can write 36k bytes,
 yes?  If the target file consisted of a stream of 128k records then the
 user will claim, with some justification, that Linux corrupted it.

Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
 mm/page-writeback.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

--- linux-next.orig/mm/page-writeback.c	2011-11-28 11:13:58.000000000 +0800
+++ linux-next/mm/page-writeback.c	2011-11-28 11:16:52.000000000 +0800
@@ -1136,7 +1136,8 @@ pause:
 		if (task_ratelimit)
 			break;
 
-		if (fatal_signal_pending(current))
+		if (fatal_signal_pending(current) &&
+		    nr_dirty <= dirty_thresh + dirty_thresh / 2)
 			break;
 	}
 

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH] writeback: add a safety limit to the SIGKILL abort
  2011-11-28  3:33                 ` [PATCH] writeback: add a safety limit to the SIGKILL abort Wu Fengguang
@ 2011-11-29 14:18                   ` Jan Kara
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Kara @ 2011-11-29 14:18 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Theodore Tso, Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org, Kazuya Mio, Dmitry Monakhov,
	Matthew Wilcox

On Mon 28-11-11 11:33:40, Wu Fengguang wrote:
> On Mon, Nov 28, 2011 at 11:08:42AM +0800, Wu Fengguang wrote:
> > On Wed, Nov 23, 2011 at 11:06:56PM +0800, Theodore Ts'o wrote:
> > > 
> > > On Nov 23, 2011, at 8:27 AM, Wu Fengguang wrote:
> > > 
> > > > 
> > > > Reading Ted's information feed, I tend to disregard the partial write
> > > > issue: since the "broken" applications will already fail and get
> > > > punished in various other cases, I don't care adding one more penalty
> > > > case to them :-P
> > > 
> > > Just wait until you have a bunch of rabid application programmers,
> > > questioning your judgement, your morality, and even your paternity.
> > > :-)
> > 
> > Ah OK, that sounds frightening. Hmm, till now every one have
> > acknowledged the possibility of data corruption, only that
> > people have different perceptions of the severeness.
> > 
> > Let's rethink things this way: "Is it a _worthwhile_ risk at all?"
> > I'm afraid not. Considering the origin of this patch
> > 
> > [BUG] aborted ext4 leads to inifinity loop in balance_dirty_pages
> > http://www.spinics.net/lists/linux-ext4/msg28464.html
> > 
> > I *think* Jan's first patch is already enough for fixing the bug. IWO
> > the patch we worried/discussed so much is really an optional one. I
> > would imagine the easy and safe solution is to just drop it. Any
> > objections?
> 
> Here is the replacement patch.
> 
> ---
> Subject: writeback: add a safety limit to the SIGKILL abort
> Date: Mon Nov 28 11:16:54 CST 2011
> 
> This adds a safety limit to the SIGKILL abort in commit 499d05ecf990
> ("mm: Make task in balance_dirty_pages() killable"). This will avoid
> dirty pages rushing arbitrarily high in the case some task receives
> SIGKILL and hence become *unthrottled* when doing a huge sized write.
> 
> The alternative way is to check SIGKILL and return partial write in
> generic_perform_write(). However it will lead to data corruption as
> put by Andrew Morton:
> 
>  Previously if an app did write(file, 128k) and was hit with SIGKILL, it
>  would write either 0 bytes or 128k bytes.  Now, it can write 36k bytes,
>  yes?  If the target file consisted of a stream of 128k records then the
>  user will claim, with some justification, that Linux corrupted it.
> 
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
  The patch looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>

									Honza

> ---
>  mm/page-writeback.c |    3 ++-
>  1 file changed, 2 insertions(+), 1 deletion(-)
> 
> --- linux-next.orig/mm/page-writeback.c	2011-11-28 11:13:58.000000000 +0800
> +++ linux-next/mm/page-writeback.c	2011-11-28 11:16:52.000000000 +0800
> @@ -1136,7 +1136,8 @@ pause:
>  		if (task_ratelimit)
>  			break;
>  
> -		if (fatal_signal_pending(current))
> +		if (fatal_signal_pending(current) &&
> +		    nr_dirty <= dirty_thresh + dirty_thresh / 2)
>  			break;
>  	}
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
  2011-11-28  3:08               ` Wu Fengguang
  2011-11-28  3:33                 ` [PATCH] writeback: add a safety limit to the SIGKILL abort Wu Fengguang
@ 2011-11-29 14:16                 ` Jan Kara
  1 sibling, 0 replies; 34+ messages in thread
From: Jan Kara @ 2011-11-29 14:16 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Theodore Tso, Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org, Kazuya Mio, Dmitry Monakhov,
	Matthew Wilcox

On Mon 28-11-11 11:08:42, Wu Fengguang wrote:
> On Wed, Nov 23, 2011 at 11:06:56PM +0800, Theodore Ts'o wrote:
> > 
> > On Nov 23, 2011, at 8:27 AM, Wu Fengguang wrote:
> > 
> > > 
> > > Reading Ted's information feed, I tend to disregard the partial write
> > > issue: since the "broken" applications will already fail and get
> > > punished in various other cases, I don't care adding one more penalty
> > > case to them :-P
> > 
> > Just wait until you have a bunch of rabid application programmers,
> > questioning your judgement, your morality, and even your paternity.
> > :-)
> 
> Ah OK, that sounds frightening. Hmm, till now every one have
> acknowledged the possibility of data corruption, only that
> people have different perceptions of the severeness.
> 
> Let's rethink things this way: "Is it a _worthwhile_ risk at all?"
> I'm afraid not. Considering the origin of this patch
> 
> [BUG] aborted ext4 leads to inifinity loop in balance_dirty_pages
> http://www.spinics.net/lists/linux-ext4/msg28464.html
> 
> I *think* Jan's first patch is already enough for fixing the bug. IWO
> the patch we worried/discussed so much is really an optional one. I
> would imagine the easy and safe solution is to just drop it. Any
> objections?
  I still think aborting write is a cleaner solution. But I don't want to
argue to death about this so if you still don't think it's a good idea,
I'll respect your decision.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 0/2 v3] Make task in balance_dirty_pages() killable
  2011-11-16 11:12 [PATCH 0/2 v3] Make task in balance_dirty_pages() killable Jan Kara
  2011-11-16 11:12 ` [PATCH 1/2] mm: " Jan Kara
  2011-11-16 11:12 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
@ 2011-11-16 11:23 ` Wu Fengguang
  2 siblings, 0 replies; 34+ messages in thread
From: Wu Fengguang @ 2011-11-16 11:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

On Wed, Nov 16, 2011 at 07:12:13PM +0800, Jan Kara wrote:
> 
>   Hello,
> 
>   since the discussion has calmed down, here is a third iteration of the patches.
> Fengguang, can you put them into your tree and merge them with Linus? Thanks.

Sure, thanks for pushing this forward.
I'll push them to linux-next and then to Linus.

Thanks,
Fengguang

> Changes since v2:
> * removed return value of balance_dirty_pages() since it didn't seem that
>   useful after all.
> * kept fatal_signal_pending() check in generic_perform_write() since I don't think
>   possible partial writes when app receives SIGKILL are a problem.
> 
> Changes since v1:
> * slightly moved the check in balance_dirty_pages() as Fengguang requested
> * made balance_dirty_pages() return EINTR if fatal signal was detected
> * changed check for signal to check for fatal signal in generic_perform_write()
>   to avoid unexpected results for userspace applications.
> 
> 									Honza

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 0/2 v2] Make task in balance_dirty_pages() killable
@ 2011-11-14 16:15 Jan Kara
  2011-11-14 16:15 ` [PATCH 1/2] mm: " Jan Kara
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2011-11-14 16:15 UTC (permalink / raw)
  To: Wu Fengguang; +Cc: Andrew Morton, Christoph Hellwig, Al Viro, linux-fsdevel

  Hello,

  here is a second iteration of the patches. Changes since v1:
* slightly moved the check in balance_dirty_pages() as Fengguang requested
* made balance_dirty_pages() return EINTR if fatal signal was detected
* changed check for signal to check for fatal signal in generic_perform_write()
  to avoid unexpected results for userspace applications.

									Honza

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/2] mm: Make task in balance_dirty_pages() killable
  2011-11-14 16:15 [PATCH 0/2 v2] " Jan Kara
@ 2011-11-14 16:15 ` Jan Kara
  2011-11-14 16:23   ` Christoph Hellwig
  2011-11-15 11:48   ` Wu Fengguang
  0 siblings, 2 replies; 34+ messages in thread
From: Jan Kara @ 2011-11-14 16:15 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Andrew Morton, Christoph Hellwig, Al Viro, linux-fsdevel,
	Jan Kara

There is no reason why task in balance_dirty_pages() shouldn't be killable
and it helps in recovering from some error conditions (like when filesystem
goes in error state and cannot accept writeback anymore but we still want to
kill processes using it to be able to unmount it).

Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 include/linux/writeback.h |    8 ++++----
 mm/page-writeback.c       |   27 +++++++++++++++++++--------
 2 files changed, 23 insertions(+), 12 deletions(-)

diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index a378c29..eb3ecca 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -170,13 +170,13 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
 			    unsigned long start_time);
 
 void page_writeback_init(void);
-void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
-					unsigned long nr_pages_dirtied);
+int balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
+				       unsigned long nr_pages_dirtied);
 
-static inline void
+static inline int
 balance_dirty_pages_ratelimited(struct address_space *mapping)
 {
-	balance_dirty_pages_ratelimited_nr(mapping, 1);
+	return balance_dirty_pages_ratelimited_nr(mapping, 1);
 }
 
 typedef int (*writepage_t)(struct page *page, struct writeback_control *wbc,
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0360d1b..380279c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1000,8 +1000,11 @@ static unsigned long bdi_max_pause(struct backing_dev_info *bdi,
  * the caller to wait once crossing the (background_thresh + dirty_thresh) / 2.
  * If we're over `background_thresh' then the writeback threads are woken to
  * perform some writeout.
+ *
+ * We return 0 if nothing special happened, EINTR if our wait has been
+ * interrupted by a fatal signal.
  */
-static void balance_dirty_pages(struct address_space *mapping,
+static int balance_dirty_pages(struct address_space *mapping,
 				unsigned long pages_dirtied)
 {
 	unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */
@@ -1020,6 +1023,7 @@ static void balance_dirty_pages(struct address_space *mapping,
 	unsigned long pos_ratio;
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	unsigned long start_time = jiffies;
+	int err = 0;
 
 	for (;;) {
 		/*
@@ -1133,7 +1137,7 @@ pause:
 					  pages_dirtied,
 					  pause,
 					  start_time);
-		__set_current_state(TASK_UNINTERRUPTIBLE);
+		__set_current_state(TASK_KILLABLE);
 		io_schedule_timeout(pause);
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
@@ -1145,6 +1149,11 @@ pause:
 		 */
 		if (nr_dirty < dirty_thresh)
 			break;
+
+		if (fatal_signal_pending(current)) {
+			err = -EINTR;
+			break;
+		}
 	}
 
 	if (!dirty_exceeded && bdi->dirty_exceeded)
@@ -1168,7 +1177,7 @@ pause:
 	}
 
 	if (writeback_in_progress(bdi))
-		return;
+		return err;
 
 	/*
 	 * In laptop mode, we wait until hitting the higher threshold before
@@ -1179,10 +1188,11 @@ pause:
 	 * background_thresh, to keep the amount of dirty memory low.
 	 */
 	if (laptop_mode)
-		return;
+		return err;
 
 	if (nr_reclaimable > background_thresh)
 		bdi_start_background_writeback(bdi);
+	return err;
 }
 
 void set_page_dirty_balance(struct page *page, int page_mkwrite)
@@ -1211,15 +1221,15 @@ static DEFINE_PER_CPU(int, bdp_ratelimits);
  * limit we decrease the ratelimiting by a lot, to prevent individual processes
  * from overshooting the limit by (ratelimit_pages) each.
  */
-void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
-					unsigned long nr_pages_dirtied)
+int balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
+				       unsigned long nr_pages_dirtied)
 {
 	struct backing_dev_info *bdi = mapping->backing_dev_info;
 	int ratelimit;
 	int *p;
 
 	if (!bdi_cap_account_dirty(bdi))
-		return;
+		return 0;
 
 	ratelimit = current->nr_dirtied_pause;
 	if (bdi->dirty_exceeded)
@@ -1247,7 +1257,8 @@ void balance_dirty_pages_ratelimited_nr(struct address_space *mapping,
 	preempt_enable();
 
 	if (unlikely(current->nr_dirtied >= ratelimit))
-		balance_dirty_pages(mapping, current->nr_dirtied);
+		return balance_dirty_pages(mapping, current->nr_dirtied);
+	return 0;
 }
 EXPORT_SYMBOL(balance_dirty_pages_ratelimited_nr);
 
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: Make task in balance_dirty_pages() killable
  2011-11-14 16:15 ` [PATCH 1/2] mm: " Jan Kara
@ 2011-11-14 16:23   ` Christoph Hellwig
  2011-11-15 11:48   ` Wu Fengguang
  1 sibling, 0 replies; 34+ messages in thread
From: Christoph Hellwig @ 2011-11-14 16:23 UTC (permalink / raw)
  To: Jan Kara
  Cc: Wu Fengguang, Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel

On Mon, Nov 14, 2011 at 05:15:23PM +0100, Jan Kara wrote:
> There is no reason why task in balance_dirty_pages() shouldn't be killable
> and it helps in recovering from some error conditions (like when filesystem
> goes in error state and cannot accept writeback anymore but we still want to
> kill processes using it to be able to unmount it).

Sorry for noticing this so late, but what about killing the
balance_dirty_pages_ratelimited wrapper, and only exporting a version
that takes a number of pages now that we have to change the signature
anyway?  I'd recommend to call the new one simply
balance_dirty_pages_ratelimited, that will give a compile breakge to
anyone using one of the old versions and thus cause them to actually
adapt and hopefull check the return value.

(and not for this series but soon later - we probably really should call
into balance_dirty_pages in larger batch sizes in most places).

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: Make task in balance_dirty_pages() killable
  2011-11-14 16:15 ` [PATCH 1/2] mm: " Jan Kara
  2011-11-14 16:23   ` Christoph Hellwig
@ 2011-11-15 11:48   ` Wu Fengguang
  2011-11-15 13:41     ` Jan Kara
  1 sibling, 1 reply; 34+ messages in thread
From: Wu Fengguang @ 2011-11-15 11:48 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

> +static int balance_dirty_pages(struct address_space *mapping,
>  				unsigned long pages_dirtied)
>  {
>  	unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */
> @@ -1020,6 +1023,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>  	unsigned long pos_ratio;
>  	struct backing_dev_info *bdi = mapping->backing_dev_info;
>  	unsigned long start_time = jiffies;
> +	int err = 0;
>  
>  	for (;;) {
>  		/*
> @@ -1133,7 +1137,7 @@ pause:
>  					  pages_dirtied,
>  					  pause,
>  					  start_time);
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> +		__set_current_state(TASK_KILLABLE);
>  		io_schedule_timeout(pause);
>  
>  		dirty_thresh = hard_dirty_limit(dirty_thresh);
> @@ -1145,6 +1149,11 @@ pause:
>  		 */
>  		if (nr_dirty < dirty_thresh)
>  			break;
> +
> +		if (fatal_signal_pending(current)) {
> +			err = -EINTR;
> +			break;
> +		}
>  	}
  
The other alternative is to raise the limit on fatal_signal_pending:

                if (fatal_signal_pending(current) && 
                    nr_dirty < dirty_thresh + dirty_thresh / 2)
                        break;

That should work well enough in practice and avoids touching the fs code.

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: Make task in balance_dirty_pages() killable
  2011-11-15 11:48   ` Wu Fengguang
@ 2011-11-15 13:41     ` Jan Kara
  2011-11-15 14:15       ` Wu Fengguang
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2011-11-15 13:41 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

On Tue 15-11-11 19:48:44, Wu Fengguang wrote:
> > +static int balance_dirty_pages(struct address_space *mapping,
> >  				unsigned long pages_dirtied)
> >  {
> >  	unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */
> > @@ -1020,6 +1023,7 @@ static void balance_dirty_pages(struct address_space *mapping,
> >  	unsigned long pos_ratio;
> >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> >  	unsigned long start_time = jiffies;
> > +	int err = 0;
> >  
> >  	for (;;) {
> >  		/*
> > @@ -1133,7 +1137,7 @@ pause:
> >  					  pages_dirtied,
> >  					  pause,
> >  					  start_time);
> > -		__set_current_state(TASK_UNINTERRUPTIBLE);
> > +		__set_current_state(TASK_KILLABLE);
> >  		io_schedule_timeout(pause);
> >  
> >  		dirty_thresh = hard_dirty_limit(dirty_thresh);
> > @@ -1145,6 +1149,11 @@ pause:
> >  		 */
> >  		if (nr_dirty < dirty_thresh)
> >  			break;
> > +
> > +		if (fatal_signal_pending(current)) {
> > +			err = -EINTR;
> > +			break;
> > +		}
> >  	}
>   
> The other alternative is to raise the limit on fatal_signal_pending:
> 
>                 if (fatal_signal_pending(current) && 
>                     nr_dirty < dirty_thresh + dirty_thresh / 2)
>                         break;
> 
> That should work well enough in practice and avoids touching the fs code.
  Sorry, but I fail to see what would this bring us... Can you elaborate a
bit please?

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: Make task in balance_dirty_pages() killable
  2011-11-15 13:41     ` Jan Kara
@ 2011-11-15 14:15       ` Wu Fengguang
  2011-11-15 14:44         ` Jan Kara
  0 siblings, 1 reply; 34+ messages in thread
From: Wu Fengguang @ 2011-11-15 14:15 UTC (permalink / raw)
  To: Jan Kara
  Cc: Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

On Tue, Nov 15, 2011 at 09:41:27PM +0800, Jan Kara wrote:
> On Tue 15-11-11 19:48:44, Wu Fengguang wrote:
> > > +static int balance_dirty_pages(struct address_space *mapping,
> > >  				unsigned long pages_dirtied)
> > >  {
> > >  	unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */
> > > @@ -1020,6 +1023,7 @@ static void balance_dirty_pages(struct address_space *mapping,
> > >  	unsigned long pos_ratio;
> > >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > >  	unsigned long start_time = jiffies;
> > > +	int err = 0;
> > >  
> > >  	for (;;) {
> > >  		/*
> > > @@ -1133,7 +1137,7 @@ pause:
> > >  					  pages_dirtied,
> > >  					  pause,
> > >  					  start_time);
> > > -		__set_current_state(TASK_UNINTERRUPTIBLE);
> > > +		__set_current_state(TASK_KILLABLE);
> > >  		io_schedule_timeout(pause);
> > >  
> > >  		dirty_thresh = hard_dirty_limit(dirty_thresh);
> > > @@ -1145,6 +1149,11 @@ pause:
> > >  		 */
> > >  		if (nr_dirty < dirty_thresh)
> > >  			break;
> > > +
> > > +		if (fatal_signal_pending(current)) {
> > > +			err = -EINTR;
> > > +			break;
> > > +		}
> > >  	}
> >   
> > The other alternative is to raise the limit on fatal_signal_pending:
> > 
> >                 if (fatal_signal_pending(current) && 
> >                     nr_dirty < dirty_thresh + dirty_thresh / 2)
> >                         break;
> > 
> > That should work well enough in practice and avoids touching the fs code.
>   Sorry, but I fail to see what would this bring us... Can you elaborate a
> bit please?

It's not bringing us something, but allows us to get rid of patch 2
as well as adding fatal_signal_pending() tests to all the other fs.

So no more worries about partial writes :)

Sure it leaves ->write_begin() blocks unhandled, however that should
be much less a problem than the blocks inside balance_dirty_pages().

Thanks,
Fengguang

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: Make task in balance_dirty_pages() killable
  2011-11-15 14:15       ` Wu Fengguang
@ 2011-11-15 14:44         ` Jan Kara
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Kara @ 2011-11-15 14:44 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
	linux-fsdevel@vger.kernel.org

On Tue 15-11-11 22:15:28, Wu Fengguang wrote:
> On Tue, Nov 15, 2011 at 09:41:27PM +0800, Jan Kara wrote:
> > On Tue 15-11-11 19:48:44, Wu Fengguang wrote:
> > > > +static int balance_dirty_pages(struct address_space *mapping,
> > > >  				unsigned long pages_dirtied)
> > > >  {
> > > >  	unsigned long nr_reclaimable;	/* = file_dirty + unstable_nfs */
> > > > @@ -1020,6 +1023,7 @@ static void balance_dirty_pages(struct address_space *mapping,
> > > >  	unsigned long pos_ratio;
> > > >  	struct backing_dev_info *bdi = mapping->backing_dev_info;
> > > >  	unsigned long start_time = jiffies;
> > > > +	int err = 0;
> > > >  
> > > >  	for (;;) {
> > > >  		/*
> > > > @@ -1133,7 +1137,7 @@ pause:
> > > >  					  pages_dirtied,
> > > >  					  pause,
> > > >  					  start_time);
> > > > -		__set_current_state(TASK_UNINTERRUPTIBLE);
> > > > +		__set_current_state(TASK_KILLABLE);
> > > >  		io_schedule_timeout(pause);
> > > >  
> > > >  		dirty_thresh = hard_dirty_limit(dirty_thresh);
> > > > @@ -1145,6 +1149,11 @@ pause:
> > > >  		 */
> > > >  		if (nr_dirty < dirty_thresh)
> > > >  			break;
> > > > +
> > > > +		if (fatal_signal_pending(current)) {
> > > > +			err = -EINTR;
> > > > +			break;
> > > > +		}
> > > >  	}
> > >   
> > > The other alternative is to raise the limit on fatal_signal_pending:
> > > 
> > >                 if (fatal_signal_pending(current) && 
> > >                     nr_dirty < dirty_thresh + dirty_thresh / 2)
> > >                         break;
> > > 
> > > That should work well enough in practice and avoids touching the fs code.
> >   Sorry, but I fail to see what would this bring us... Can you elaborate a
> > bit please?
> 
> It's not bringing us something, but allows us to get rid of patch 2
> as well as adding fatal_signal_pending() tests to all the other fs.
> 
> So no more worries about partial writes :)
> 
> Sure it leaves ->write_begin() blocks unhandled, however that should
> be much less a problem than the blocks inside balance_dirty_pages().
  Ah, OK, I see. Frankly, I'd rather keep the loop break in
generic_perform_write() and have the code straightforward. I don't see any
security (data exposure) problem there and if someone complains on the
grounds that the behavior has changed, we can revert the change in the
worst case.

								Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 0/2] Make task doing heavy writing killable
@ 2011-11-14 11:10 Jan Kara
  2011-11-14 11:10 ` [PATCH 1/2] mm: Make task in balance_dirty_pages() killable Jan Kara
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2011-11-14 11:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wu Fengguang, Al Viro, k-mio, Andrew Morton, Christoph Hellwig

  Hello,

  these two patches aim at making task waiting in balance_dirty_pages()
killable.  This is desirable because otherwise if filesystem stops accepting
writes (e.g. if device has been removed or other serious error condidion) we
have a task stuck in D state forever.

  I'm not sure who should merge these two patches... Al, Fengguang?

									Honza

^ permalink raw reply	[flat|nested] 34+ messages in thread

* [PATCH 1/2] mm: Make task in balance_dirty_pages() killable
  2011-11-14 11:10 [PATCH 0/2] Make task doing heavy writing killable Jan Kara
@ 2011-11-14 11:10 ` Jan Kara
  2011-11-14 12:12   ` Wu Fengguang
  0 siblings, 1 reply; 34+ messages in thread
From: Jan Kara @ 2011-11-14 11:10 UTC (permalink / raw)
  To: linux-fsdevel
  Cc: Wu Fengguang, Al Viro, k-mio, Andrew Morton, Christoph Hellwig,
	Jan Kara

There is no reason why task in balance_dirty_pages() shouldn't be killable
and it helps in recovering from some error conditions (like when filesystem
goes in error state and cannot accept writeback anymore but we still want to
kill processes using it to be able to unmount it).

Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
 mm/page-writeback.c |    4 +++-
 1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0360d1b..e83c286 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1133,8 +1133,10 @@ pause:
 					  pages_dirtied,
 					  pause,
 					  start_time);
-		__set_current_state(TASK_UNINTERRUPTIBLE);
+		__set_current_state(TASK_KILLABLE);
 		io_schedule_timeout(pause);
+		if (fatal_signal_pending(current))
+			break;
 
 		dirty_thresh = hard_dirty_limit(dirty_thresh);
 		/*
-- 
1.7.1


^ permalink raw reply related	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: Make task in balance_dirty_pages() killable
  2011-11-14 11:10 ` [PATCH 1/2] mm: Make task in balance_dirty_pages() killable Jan Kara
@ 2011-11-14 12:12   ` Wu Fengguang
  2011-11-14 12:37     ` Jan Kara
  0 siblings, 1 reply; 34+ messages in thread
From: Wu Fengguang @ 2011-11-14 12:12 UTC (permalink / raw)
  To: Jan Kara
  Cc: linux-fsdevel@vger.kernel.org, Al Viro, k-mio@sx.jp.nec.com,
	Andrew Morton, Christoph Hellwig

On Mon, Nov 14, 2011 at 07:10:29PM +0800, Jan Kara wrote:
> There is no reason why task in balance_dirty_pages() shouldn't be killable
> and it helps in recovering from some error conditions (like when filesystem
> goes in error state and cannot accept writeback anymore but we still want to
> kill processes using it to be able to unmount it).
> 
> Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
>  mm/page-writeback.c |    4 +++-
>  1 files changed, 3 insertions(+), 1 deletions(-)
> 
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 0360d1b..e83c286 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1133,8 +1133,10 @@ pause:
>  					  pages_dirtied,
>  					  pause,
>  					  start_time);
> -		__set_current_state(TASK_UNINTERRUPTIBLE);
> +		__set_current_state(TASK_KILLABLE);

Dumb question: will the task continue to show up as 'D' in ps?

>  		io_schedule_timeout(pause);
> +		if (fatal_signal_pending(current))
> +			break;

I'd like to move that several lines below, to the end of the loop, 
so that the condition is not tested at all in normal executions.

>  		dirty_thresh = hard_dirty_limit(dirty_thresh);
>  		/*
> -- 
> 1.7.1

^ permalink raw reply	[flat|nested] 34+ messages in thread

* Re: [PATCH 1/2] mm: Make task in balance_dirty_pages() killable
  2011-11-14 12:12   ` Wu Fengguang
@ 2011-11-14 12:37     ` Jan Kara
  0 siblings, 0 replies; 34+ messages in thread
From: Jan Kara @ 2011-11-14 12:37 UTC (permalink / raw)
  To: Wu Fengguang
  Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Al Viro,
	k-mio@sx.jp.nec.com, Andrew Morton, Christoph Hellwig

On Mon 14-11-11 20:12:45, Wu Fengguang wrote:
> On Mon, Nov 14, 2011 at 07:10:29PM +0800, Jan Kara wrote:
> > There is no reason why task in balance_dirty_pages() shouldn't be killable
> > and it helps in recovering from some error conditions (like when filesystem
> > goes in error state and cannot accept writeback anymore but we still want to
> > kill processes using it to be able to unmount it).
> > 
> > Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> >  mm/page-writeback.c |    4 +++-
> >  1 files changed, 3 insertions(+), 1 deletions(-)
> > 
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 0360d1b..e83c286 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -1133,8 +1133,10 @@ pause:
> >  					  pages_dirtied,
> >  					  pause,
> >  					  start_time);
> > -		__set_current_state(TASK_UNINTERRUPTIBLE);
> > +		__set_current_state(TASK_KILLABLE);
> 
> Dumb question: will the task continue to show up as 'D' in ps?
  Yes, it will. The only difference is that the task will be woken when
SIGKILL arrives.

> >  		io_schedule_timeout(pause);
> > +		if (fatal_signal_pending(current))
> > +			break;
> 
> I'd like to move that several lines below, to the end of the loop, 
> so that the condition is not tested at all in normal executions.
  OK, will do that.

> >  		dirty_thresh = hard_dirty_limit(dirty_thresh);
> >  		/*

									Honza
-- 
Jan Kara <jack@suse.cz>
SUSE Labs, CR

^ permalink raw reply	[flat|nested] 34+ messages in thread

end of thread, other threads:[~2011-11-29 14:18 UTC | newest]

Thread overview: 34+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-16 11:12 [PATCH 0/2 v3] Make task in balance_dirty_pages() killable Jan Kara
2011-11-16 11:12 ` [PATCH 1/2] mm: " Jan Kara
2011-11-16 11:28   ` Wu Fengguang
2011-11-16 12:58     ` Jan Kara
2011-11-16 11:12 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
2011-11-16 11:44   ` Wu Fengguang
2011-11-16 12:54     ` Jan Kara
2011-11-16 13:11       ` Wu Fengguang
2011-11-22 22:28     ` Andrew Morton
2011-11-23  9:05       ` Wu Fengguang
2011-11-23  9:50         ` Andrew Morton
2011-11-23 12:27           ` [PATCH 2/2] " Theodore Tso
2011-11-23 20:29             ` Andrew Morton
2011-11-24 19:27               ` Matthew Wilcox
2011-11-24 20:53                 ` Ted Ts'o
2011-11-25  0:10                   ` Matthew Wilcox
2011-11-24 20:53                 ` Jan Kara
2011-11-23 13:08         ` [PATCH 2/2] fs: " Jan Kara
2011-11-23 13:27           ` Wu Fengguang
2011-11-23 15:06             ` Theodore Tso
2011-11-28  3:08               ` Wu Fengguang
2011-11-28  3:33                 ` [PATCH] writeback: add a safety limit to the SIGKILL abort Wu Fengguang
2011-11-29 14:18                   ` Jan Kara
2011-11-29 14:16                 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
2011-11-16 11:23 ` [PATCH 0/2 v3] Make task in balance_dirty_pages() killable Wu Fengguang
  -- strict thread matches above, loose matches on Subject: below --
2011-11-14 16:15 [PATCH 0/2 v2] " Jan Kara
2011-11-14 16:15 ` [PATCH 1/2] mm: " Jan Kara
2011-11-14 16:23   ` Christoph Hellwig
2011-11-15 11:48   ` Wu Fengguang
2011-11-15 13:41     ` Jan Kara
2011-11-15 14:15       ` Wu Fengguang
2011-11-15 14:44         ` Jan Kara
2011-11-14 11:10 [PATCH 0/2] Make task doing heavy writing killable Jan Kara
2011-11-14 11:10 ` [PATCH 1/2] mm: Make task in balance_dirty_pages() killable Jan Kara
2011-11-14 12:12   ` Wu Fengguang
2011-11-14 12:37     ` Jan Kara

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).