* [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-14 11:10 [PATCH 0/2] Make task doing heavy writing killable Jan Kara
@ 2011-11-14 11:10 ` Jan Kara
2011-11-14 12:12 ` Matthew Wilcox
2011-11-14 12:15 ` Wu Fengguang
0 siblings, 2 replies; 38+ messages in thread
From: Jan Kara @ 2011-11-14 11:10 UTC (permalink / raw)
To: linux-fsdevel
Cc: Wu Fengguang, Al Viro, k-mio, Andrew Morton, Christoph Hellwig,
Jan Kara
Currently write(2) to a file is not interruptible by a signal. Sometimes this
is desirable (e.g. when you want to quickly kill a process hogging your disk or
when some process gets blocked in balance_dirty_pages() indefinitely due to a
filesystem being in an error condition).
Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
mm/filemap.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index c0018f2..6b01d2f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2407,6 +2407,10 @@ static ssize_t generic_perform_write(struct file *file,
iov_iter_count(i));
again:
+ if (signal_pending(current)) {
+ status = -EINTR;
+ break;
+ }
/*
* Bring in the user page that we will copy from _first_.
--
1.7.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-14 11:10 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
@ 2011-11-14 12:12 ` Matthew Wilcox
2011-11-14 12:15 ` Wu Fengguang
1 sibling, 0 replies; 38+ messages in thread
From: Matthew Wilcox @ 2011-11-14 12:12 UTC (permalink / raw)
To: Jan Kara
Cc: linux-fsdevel, Wu Fengguang, Al Viro, k-mio, Andrew Morton,
Christoph Hellwig
On Mon, Nov 14, 2011 at 12:10:30PM +0100, Jan Kara wrote:
> again:
> + if (signal_pending(current)) {
Uhh ... surely 'fatal_signal_pending', no?
> + status = -EINTR;
> + break;
> + }
>
> /*
> * Bring in the user page that we will copy from _first_.
> --
> 1.7.1
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-14 11:10 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
2011-11-14 12:12 ` Matthew Wilcox
@ 2011-11-14 12:15 ` Wu Fengguang
2011-11-14 12:34 ` Jan Kara
1 sibling, 1 reply; 38+ messages in thread
From: Wu Fengguang @ 2011-11-14 12:15 UTC (permalink / raw)
To: Jan Kara
Cc: linux-fsdevel@vger.kernel.org, Al Viro, k-mio@sx.jp.nec.com,
Andrew Morton, Christoph Hellwig
> @@ -2407,6 +2407,10 @@ static ssize_t generic_perform_write(struct file *file,
> iov_iter_count(i));
>
> again:
> + if (signal_pending(current)) {
signal_pending looks more useful than fatal_signal_pending in that it
covers normal signals too. However it's exactly the broader coverage
that makes it an interface change -- will this possibly break casually
written applications?
> + status = -EINTR;
> + break;
> + }
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-14 12:15 ` Wu Fengguang
@ 2011-11-14 12:34 ` Jan Kara
2011-11-14 14:16 ` Matthew Wilcox
0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-11-14 12:34 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jan Kara, linux-fsdevel@vger.kernel.org, Al Viro,
k-mio@sx.jp.nec.com, Andrew Morton, Christoph Hellwig
On Mon 14-11-11 20:15:56, Wu Fengguang wrote:
> > @@ -2407,6 +2407,10 @@ static ssize_t generic_perform_write(struct file *file,
> > iov_iter_count(i));
> >
> > again:
> > + if (signal_pending(current)) {
>
> signal_pending looks more useful than fatal_signal_pending in that it
> covers normal signals too. However it's exactly the broader coverage
> that makes it an interface change -- will this possibly break casually
> written applications?
Yeah, this is upto discussion. Historically, write() (or any other system
call) could have returned EINTR. In fact, write() to a socket can return
EINTR even now. But you are right that we didn't return EINTR from write()
to a regular file. So if you prefer to never return EINTR from a write to a
regular file, I can change the check since I'm also slightly worried that
some badly written app can notice.
Honza
>
> > + status = -EINTR;
> > + break;
> > + }
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-14 12:34 ` Jan Kara
@ 2011-11-14 14:16 ` Matthew Wilcox
2011-11-14 15:30 ` Jan Kara
0 siblings, 1 reply; 38+ messages in thread
From: Matthew Wilcox @ 2011-11-14 14:16 UTC (permalink / raw)
To: Jan Kara
Cc: Wu Fengguang, linux-fsdevel@vger.kernel.org, Al Viro,
k-mio@sx.jp.nec.com, Andrew Morton, Christoph Hellwig
On Mon, Nov 14, 2011 at 01:34:46PM +0100, Jan Kara wrote:
> On Mon 14-11-11 20:15:56, Wu Fengguang wrote:
> > > @@ -2407,6 +2407,10 @@ static ssize_t generic_perform_write(struct file *file,
> > > iov_iter_count(i));
> > >
> > > again:
> > > + if (signal_pending(current)) {
> >
> > signal_pending looks more useful than fatal_signal_pending in that it
> > covers normal signals too. However it's exactly the broader coverage
> > that makes it an interface change -- will this possibly break casually
> > written applications?
> Yeah, this is upto discussion. Historically, write() (or any other system
> call) could have returned EINTR. In fact, write() to a socket can return
> EINTR even now. But you are right that we didn't return EINTR from write()
> to a regular file. So if you prefer to never return EINTR from a write to a
> regular file, I can change the check since I'm also slightly worried that
> some badly written app can notice.
No, this is not up for discussion. You can't return short writes (or
reads). This is why the 'fatal_signal_pending' API exists -- if the
signal is fatal, the task is never returned to, so its bug (not checking
the return from read/write) is not exposed.
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-14 14:16 ` Matthew Wilcox
@ 2011-11-14 15:30 ` Jan Kara
2011-11-14 18:44 ` Jeremy Allison
0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-11-14 15:30 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Jan Kara, Wu Fengguang, linux-fsdevel@vger.kernel.org, Al Viro,
k-mio@sx.jp.nec.com, Andrew Morton, Christoph Hellwig
On Mon 14-11-11 07:16:26, Matthew Wilcox wrote:
> On Mon, Nov 14, 2011 at 01:34:46PM +0100, Jan Kara wrote:
> > On Mon 14-11-11 20:15:56, Wu Fengguang wrote:
> > > > @@ -2407,6 +2407,10 @@ static ssize_t generic_perform_write(struct file *file,
> > > > iov_iter_count(i));
> > > >
> > > > again:
> > > > + if (signal_pending(current)) {
> > >
> > > signal_pending looks more useful than fatal_signal_pending in that it
> > > covers normal signals too. However it's exactly the broader coverage
> > > that makes it an interface change -- will this possibly break casually
> > > written applications?
> > Yeah, this is upto discussion. Historically, write() (or any other system
> > call) could have returned EINTR. In fact, write() to a socket can return
> > EINTR even now. But you are right that we didn't return EINTR from write()
> > to a regular file. So if you prefer to never return EINTR from a write to a
> > regular file, I can change the check since I'm also slightly worried that
> > some badly written app can notice.
>
> No, this is not up for discussion. You can't return short writes (or
> reads). This is why the 'fatal_signal_pending' API exists -- if the
> signal is fatal, the task is never returned to, so its bug (not checking
> the return from read/write) is not exposed.
By "can't return" you mean userspace need not be expecting it so we
shouldn't break it or is there some standard which forbids it? Just
curious...
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-14 16:15 [PATCH 0/2 v2] " Jan Kara
@ 2011-11-14 16:15 ` Jan Kara
2011-11-14 16:26 ` Christoph Hellwig
2011-11-14 22:19 ` Andrew Morton
0 siblings, 2 replies; 38+ messages in thread
From: Jan Kara @ 2011-11-14 16:15 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Christoph Hellwig, Al Viro, linux-fsdevel,
Jan Kara
Currently write(2) to a file is not interruptible by a signal. Sometimes this
is desirable (e.g. when you want to quickly kill a process hogging your disk or
when some process gets blocked in balance_dirty_pages() indefinitely due to a
filesystem being in an error condition).
Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
mm/filemap.c | 11 +++++++++--
1 files changed, 9 insertions(+), 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index c0018f2..166b30e 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file,
iov_iter_count(i));
again:
-
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
@@ -2463,7 +2462,15 @@ again:
written += copied;
balance_dirty_pages_ratelimited(mapping);
-
+ /*
+ * We check the signal independently of balance_dirty_pages()
+ * because we need not wait and check for signal there although
+ * this loop could have taken significant amount of time...
+ */
+ if (fatal_signal_pending(current)) {
+ status = -EINTR;
+ break;
+ }
} while (iov_iter_count(i));
return written ? written : status;
--
1.7.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-14 16:15 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
@ 2011-11-14 16:26 ` Christoph Hellwig
2011-11-14 16:46 ` Jan Kara
2011-11-14 22:19 ` Andrew Morton
1 sibling, 1 reply; 38+ messages in thread
From: Christoph Hellwig @ 2011-11-14 16:26 UTC (permalink / raw)
To: Jan Kara
Cc: Wu Fengguang, Andrew Morton, Christoph Hellwig, Al Viro,
linux-fsdevel
On Mon, Nov 14, 2011 at 05:15:24PM +0100, Jan Kara wrote:
> Currently write(2) to a file is not interruptible by a signal. Sometimes this
> is desirable (e.g. when you want to quickly kill a process hogging your disk or
> when some process gets blocked in balance_dirty_pages() indefinitely due to a
> filesystem being in an error condition).
>
> Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> mm/filemap.c | 11 +++++++++--
> 1 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c0018f2..166b30e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file,
> iov_iter_count(i));
>
> again:
> -
> /*
> * Bring in the user page that we will copy from _first_.
> * Otherwise there's a nasty deadlock on copying from the
> @@ -2463,7 +2462,15 @@ again:
> written += copied;
>
> balance_dirty_pages_ratelimited(mapping);
> -
> + /*
> + * We check the signal independently of balance_dirty_pages()
> + * because we need not wait and check for signal there although
> + * this loop could have taken significant amount of time...
> + */
> + if (fatal_signal_pending(current)) {
> + status = -EINTR;
> + break;
> + }
Hmm. If we need to check again every time adding the return value to
balance_dirty_pages is rather pointless.
I have a bit of a problem parsing the comment - does it try to say that
we might spend too much time after the fatal_signal_pending in the
balance_dirty_pages code so that we have to check it again? Why not
repeat the check at the end of balance_dirty_pages_ratelimited and thus
avoid having to duplicate the thing in all callers?
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-14 16:26 ` Christoph Hellwig
@ 2011-11-14 16:46 ` Jan Kara
2011-11-14 20:13 ` Christoph Hellwig
0 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-11-14 16:46 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Jan Kara, Wu Fengguang, Andrew Morton, Al Viro, linux-fsdevel
On Mon 14-11-11 11:26:00, Christoph Hellwig wrote:
> On Mon, Nov 14, 2011 at 05:15:24PM +0100, Jan Kara wrote:
> > Currently write(2) to a file is not interruptible by a signal. Sometimes this
> > is desirable (e.g. when you want to quickly kill a process hogging your disk or
> > when some process gets blocked in balance_dirty_pages() indefinitely due to a
> > filesystem being in an error condition).
> >
> > Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> > mm/filemap.c | 11 +++++++++--
> > 1 files changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index c0018f2..166b30e 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file,
> > iov_iter_count(i));
> >
> > again:
> > -
> > /*
> > * Bring in the user page that we will copy from _first_.
> > * Otherwise there's a nasty deadlock on copying from the
> > @@ -2463,7 +2462,15 @@ again:
> > written += copied;
> >
> > balance_dirty_pages_ratelimited(mapping);
> > -
> > + /*
> > + * We check the signal independently of balance_dirty_pages()
> > + * because we need not wait and check for signal there although
> > + * this loop could have taken significant amount of time...
> > + */
> > + if (fatal_signal_pending(current)) {
> > + status = -EINTR;
> > + break;
> > + }
>
> Hmm. If we need to check again every time adding the return value to
> balance_dirty_pages is rather pointless.
>
> I have a bit of a problem parsing the comment - does it try to say that
> we might spend too much time after the fatal_signal_pending in the
> balance_dirty_pages code so that we have to check it again? Why not
> repeat the check at the end of balance_dirty_pages_ratelimited and thus
> avoid having to duplicate the thing in all callers?
The point I was trying to get across is that all:
iov_iter_fault_in_readable()
->write_begin()
...
->write_end()
can take a rather long time. Considering that
balance_dirty_pages_ratelimited() needn't call balance_dirty_pages() or
balance_dirty_pages() can just return without doing anything because there
aren't many dirty pages, the end result is that
balance_dirty_pages_ratelimited() won't return EINTR for rather long time
although there is a signal pending. That's why I want to check for signal
pending directly in the loop in generic_perform_write().
There can be loops where the only blocking point is in
balance_dirty_pages() and then using the return value makes sense but
I think such loops would be in minority so maybe the return value doesn't
make much sense after all.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-14 15:30 ` Jan Kara
@ 2011-11-14 18:44 ` Jeremy Allison
0 siblings, 0 replies; 38+ messages in thread
From: Jeremy Allison @ 2011-11-14 18:44 UTC (permalink / raw)
To: Jan Kara
Cc: Matthew Wilcox, Wu Fengguang, linux-fsdevel@vger.kernel.org,
Al Viro, k-mio@sx.jp.nec.com, Andrew Morton, Christoph Hellwig
On Mon, Nov 14, 2011 at 04:30:47PM +0100, Jan Kara wrote:
> On Mon 14-11-11 07:16:26, Matthew Wilcox wrote:
> > On Mon, Nov 14, 2011 at 01:34:46PM +0100, Jan Kara wrote:
> > > On Mon 14-11-11 20:15:56, Wu Fengguang wrote:
> > > > > @@ -2407,6 +2407,10 @@ static ssize_t generic_perform_write(struct file *file,
> > > > > iov_iter_count(i));
> > > > >
> > > > > again:
> > > > > + if (signal_pending(current)) {
> > > >
> > > > signal_pending looks more useful than fatal_signal_pending in that it
> > > > covers normal signals too. However it's exactly the broader coverage
> > > > that makes it an interface change -- will this possibly break casually
> > > > written applications?
> > > Yeah, this is upto discussion. Historically, write() (or any other system
> > > call) could have returned EINTR. In fact, write() to a socket can return
> > > EINTR even now. But you are right that we didn't return EINTR from write()
> > > to a regular file. So if you prefer to never return EINTR from a write to a
> > > regular file, I can change the check since I'm also slightly worried that
> > > some badly written app can notice.
> >
> > No, this is not up for discussion. You can't return short writes (or
> > reads). This is why the 'fatal_signal_pending' API exists -- if the
> > signal is fatal, the task is never returned to, so its bug (not checking
> > the return from read/write) is not exposed.
> By "can't return" you mean userspace need not be expecting it so we
> shouldn't break it or is there some standard which forbids it? Just
> curious...
This *WILL* break userspace code if you return short writes on a
filesystem fd. Guarenteed. I originally wrote code in Samba to
take care of it back before I learned the difference between
"slow" and "fast" interruptable system calls (see W.R.Stevens
for details).
Don't return short writes or reads on a filesystem fd.
Jeremy.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-14 16:46 ` Jan Kara
@ 2011-11-14 20:13 ` Christoph Hellwig
0 siblings, 0 replies; 38+ messages in thread
From: Christoph Hellwig @ 2011-11-14 20:13 UTC (permalink / raw)
To: Jan Kara
Cc: Christoph Hellwig, Wu Fengguang, Andrew Morton, Al Viro,
linux-fsdevel
On Mon, Nov 14, 2011 at 05:46:47PM +0100, Jan Kara wrote:
> There can be loops where the only blocking point is in
> balance_dirty_pages() and then using the return value makes sense but
> I think such loops would be in minority so maybe the return value doesn't
> make much sense after all.
Yes, maybe the return value isn't that smart. Sorry for all the noise.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-14 16:15 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
2011-11-14 16:26 ` Christoph Hellwig
@ 2011-11-14 22:19 ` Andrew Morton
2011-11-15 11:23 ` Jan Kara
1 sibling, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2011-11-14 22:19 UTC (permalink / raw)
To: Jan Kara; +Cc: Wu Fengguang, Christoph Hellwig, Al Viro, linux-fsdevel
On Mon, 14 Nov 2011 17:15:24 +0100
Jan Kara <jack@suse.cz> wrote:
> Currently write(2) to a file is not interruptible by a signal. Sometimes this
> is desirable (e.g. when you want to quickly kill a process hogging your disk or
> when some process gets blocked in balance_dirty_pages() indefinitely due to a
> filesystem being in an error condition).
>
> Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> mm/filemap.c | 11 +++++++++--
> 1 files changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c0018f2..166b30e 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file,
> iov_iter_count(i));
>
> again:
> -
> /*
> * Bring in the user page that we will copy from _first_.
> * Otherwise there's a nasty deadlock on copying from the
> @@ -2463,7 +2462,15 @@ again:
> written += copied;
>
> balance_dirty_pages_ratelimited(mapping);
> -
> + /*
> + * We check the signal independently of balance_dirty_pages()
> + * because we need not wait and check for signal there although
> + * this loop could have taken significant amount of time...
> + */
> + if (fatal_signal_pending(current)) {
> + status = -EINTR;
> + break;
> + }
> } while (iov_iter_count(i));
>
> return written ? written : status;
Will this permit the interruption of things like fsync() or sync()? If
so, considerable pondering is needed.
Also I worry about stuff like the use of buffered write to finish off
loose ends in direct-IO writing. Sometimes these writes MUST complete,
to prevent exposing unwritten disk blocks to a subsequent read. Will a
well-timed ^C break this? If "no" then does this change introduce risk
that we'll later accidentally introduce such a security hole?
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-14 22:19 ` Andrew Morton
@ 2011-11-15 11:23 ` Jan Kara
0 siblings, 0 replies; 38+ messages in thread
From: Jan Kara @ 2011-11-15 11:23 UTC (permalink / raw)
To: Andrew Morton
Cc: Jan Kara, Wu Fengguang, Christoph Hellwig, Al Viro, linux-fsdevel
On Mon 14-11-11 14:19:54, Andrew Morton wrote:
> On Mon, 14 Nov 2011 17:15:24 +0100
> Jan Kara <jack@suse.cz> wrote:
>
> > Currently write(2) to a file is not interruptible by a signal. Sometimes this
> > is desirable (e.g. when you want to quickly kill a process hogging your disk or
> > when some process gets blocked in balance_dirty_pages() indefinitely due to a
> > filesystem being in an error condition).
> >
> > Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> > mm/filemap.c | 11 +++++++++--
> > 1 files changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index c0018f2..166b30e 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file,
> > iov_iter_count(i));
> >
> > again:
> > -
> > /*
> > * Bring in the user page that we will copy from _first_.
> > * Otherwise there's a nasty deadlock on copying from the
> > @@ -2463,7 +2462,15 @@ again:
> > written += copied;
> >
> > balance_dirty_pages_ratelimited(mapping);
> > -
> > + /*
> > + * We check the signal independently of balance_dirty_pages()
> > + * because we need not wait and check for signal there although
> > + * this loop could have taken significant amount of time...
> > + */
> > + if (fatal_signal_pending(current)) {
> > + status = -EINTR;
> > + break;
> > + }
> > } while (iov_iter_count(i));
> >
> > return written ? written : status;
>
> Will this permit the interruption of things like fsync() or sync()? If
> so, considerable pondering is needed.
No, fsync() or sync() are not influenced.
> Also I worry about stuff like the use of buffered write to finish off
> loose ends in direct-IO writing. Sometimes these writes MUST complete,
> to prevent exposing unwritten disk blocks to a subsequent read. Will a
> well-timed ^C break this? If "no" then does this change introduce risk
> that we'll later accidentally introduce such a security hole?
That's a good question! Well timed SIGKILL can stop buffered write
completing bits of direct IO write. But looking into the code I fail to see
where we rely on buffered write being completed. I find it somewhere on a
boundary of a filesystem bug to return from direct IO with a potential of
stale data exposure...
For writes beyond i_size, we increase i_size only by the amount which was
successfully written so there is no risk of exposure. Otherwise, direct IO
would have to fill a hole to allocate a block - ext? filesystems, reiserfs,
ocfs2, and other simple filesystems avoid doing that. XFS fills a hole but
takes care to init everything. So I don't think anyone really expects
buffered write to cover his back.
To sum up the only result of interrupted buffered write after direct IO
write seems to be that some other application could see only part of the
write completed. That is possible in case of multipage buffered writes as
well and e.g. with ext3/4 such situation is possible even without these
patches (due to conditions like ENOSPC or due to a system crash -
admittedly more serious conditions than someone sending SIGKILL).
Thanks for your comments.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 0/2 v3] Make task in balance_dirty_pages() killable
@ 2011-11-16 11:12 Jan Kara
2011-11-16 11:12 ` [PATCH 1/2] mm: " Jan Kara
` (2 more replies)
0 siblings, 3 replies; 38+ messages in thread
From: Jan Kara @ 2011-11-16 11:12 UTC (permalink / raw)
To: Wu Fengguang; +Cc: Andrew Morton, Christoph Hellwig, Al Viro, linux-fsdevel
Hello,
since the discussion has calmed down, here is a third iteration of the patches.
Fengguang, can you put them into your tree and merge them with Linus? Thanks.
Changes since v2:
* removed return value of balance_dirty_pages() since it didn't seem that
useful after all.
* kept fatal_signal_pending() check in generic_perform_write() since I don't think
possible partial writes when app receives SIGKILL are a problem.
Changes since v1:
* slightly moved the check in balance_dirty_pages() as Fengguang requested
* made balance_dirty_pages() return EINTR if fatal signal was detected
* changed check for signal to check for fatal signal in generic_perform_write()
to avoid unexpected results for userspace applications.
Honza
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 1/2] mm: Make task in balance_dirty_pages() killable
2011-11-16 11:12 [PATCH 0/2 v3] Make task in balance_dirty_pages() killable Jan Kara
@ 2011-11-16 11:12 ` Jan Kara
2011-11-16 11:28 ` Wu Fengguang
2011-11-16 11:12 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
2011-11-16 11:23 ` [PATCH 0/2 v3] Make task in balance_dirty_pages() killable Wu Fengguang
2 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-11-16 11:12 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Christoph Hellwig, Al Viro, linux-fsdevel,
Jan Kara
There is no reason why task in balance_dirty_pages() shouldn't be killable
and it helps in recovering from some error conditions (like when filesystem
goes in error state and cannot accept writeback anymore but we still want to
kill processes using it to be able to unmount it).
Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
mm/page-writeback.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 0360d1b..39f5805 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -1133,7 +1133,7 @@ pause:
pages_dirtied,
pause,
start_time);
- __set_current_state(TASK_UNINTERRUPTIBLE);
+ __set_current_state(TASK_KILLABLE);
io_schedule_timeout(pause);
dirty_thresh = hard_dirty_limit(dirty_thresh);
@@ -1145,6 +1145,9 @@ pause:
*/
if (nr_dirty < dirty_thresh)
break;
+
+ if (fatal_signal_pending(current))
+ break;
}
if (!dirty_exceeded && bdi->dirty_exceeded)
--
1.7.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-16 11:12 [PATCH 0/2 v3] Make task in balance_dirty_pages() killable Jan Kara
2011-11-16 11:12 ` [PATCH 1/2] mm: " Jan Kara
@ 2011-11-16 11:12 ` Jan Kara
2011-11-16 11:44 ` Wu Fengguang
2011-11-16 11:23 ` [PATCH 0/2 v3] Make task in balance_dirty_pages() killable Wu Fengguang
2 siblings, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-11-16 11:12 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Christoph Hellwig, Al Viro, linux-fsdevel,
Jan Kara
Currently write(2) to a file is not interruptible by a signal. Sometimes this
is desirable (e.g. when you want to quickly kill a process hogging your disk or
when some process gets blocked in balance_dirty_pages() indefinitely due to a
filesystem being in an error condition).
Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
Signed-off-by: Jan Kara <jack@suse.cz>
---
mm/filemap.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index c0018f2..c106d3b 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file,
iov_iter_count(i));
again:
-
/*
* Bring in the user page that we will copy from _first_.
* Otherwise there's a nasty deadlock on copying from the
@@ -2463,7 +2462,10 @@ again:
written += copied;
balance_dirty_pages_ratelimited(mapping);
-
+ if (fatal_signal_pending(current)) {
+ status = -EINTR;
+ break;
+ }
} while (iov_iter_count(i));
return written ? written : status;
--
1.7.1
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 0/2 v3] Make task in balance_dirty_pages() killable
2011-11-16 11:12 [PATCH 0/2 v3] Make task in balance_dirty_pages() killable Jan Kara
2011-11-16 11:12 ` [PATCH 1/2] mm: " Jan Kara
2011-11-16 11:12 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
@ 2011-11-16 11:23 ` Wu Fengguang
2 siblings, 0 replies; 38+ messages in thread
From: Wu Fengguang @ 2011-11-16 11:23 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org
On Wed, Nov 16, 2011 at 07:12:13PM +0800, Jan Kara wrote:
>
> Hello,
>
> since the discussion has calmed down, here is a third iteration of the patches.
> Fengguang, can you put them into your tree and merge them with Linus? Thanks.
Sure, thanks for pushing this forward.
I'll push them to linux-next and then to Linus.
Thanks,
Fengguang
> Changes since v2:
> * removed return value of balance_dirty_pages() since it didn't seem that
> useful after all.
> * kept fatal_signal_pending() check in generic_perform_write() since I don't think
> possible partial writes when app receives SIGKILL are a problem.
>
> Changes since v1:
> * slightly moved the check in balance_dirty_pages() as Fengguang requested
> * made balance_dirty_pages() return EINTR if fatal signal was detected
> * changed check for signal to check for fatal signal in generic_perform_write()
> to avoid unexpected results for userspace applications.
>
> Honza
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 1/2] mm: Make task in balance_dirty_pages() killable
2011-11-16 11:12 ` [PATCH 1/2] mm: " Jan Kara
@ 2011-11-16 11:28 ` Wu Fengguang
2011-11-16 12:58 ` Jan Kara
0 siblings, 1 reply; 38+ messages in thread
From: Wu Fengguang @ 2011-11-16 11:28 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org, Neil Brown, KOSAKI Motohiro,
Anton Altaparmakov
On Wed, Nov 16, 2011 at 07:12:14PM +0800, Jan Kara wrote:
> There is no reason why task in balance_dirty_pages() shouldn't be killable
> and it helps in recovering from some error conditions (like when filesystem
> goes in error state and cannot accept writeback anymore but we still want to
> kill processes using it to be able to unmount it).
Here I'd like to add a paragraph from my original patch:
The follow up patches will further abort the generic_perform_write() and
other filesystem write loops, to avoid large write + SIGKILL combination
exceeding the dirty limit and possibly strange OOM.
Thanks,
Fengguang
> Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> mm/page-writeback.c | 5 ++++-
> 1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 0360d1b..39f5805 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -1133,7 +1133,7 @@ pause:
> pages_dirtied,
> pause,
> start_time);
> - __set_current_state(TASK_UNINTERRUPTIBLE);
> + __set_current_state(TASK_KILLABLE);
> io_schedule_timeout(pause);
>
> dirty_thresh = hard_dirty_limit(dirty_thresh);
> @@ -1145,6 +1145,9 @@ pause:
> */
> if (nr_dirty < dirty_thresh)
> break;
> +
> + if (fatal_signal_pending(current))
> + break;
> }
>
> if (!dirty_exceeded && bdi->dirty_exceeded)
> --
> 1.7.1
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-16 11:12 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
@ 2011-11-16 11:44 ` Wu Fengguang
2011-11-16 12:54 ` Jan Kara
2011-11-22 22:28 ` Andrew Morton
0 siblings, 2 replies; 38+ messages in thread
From: Wu Fengguang @ 2011-11-16 11:44 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org
Jan,
Due to the (very low) possibility of data loss by partial writes, IMHO
it would safer to test this patch in linux-next until next merge window,
would you agree?
Pushing the first patch will address the main problem, anyway.
Thanks,
Fengguang
On Wed, Nov 16, 2011 at 07:12:15PM +0800, Jan Kara wrote:
> Currently write(2) to a file is not interruptible by a signal. Sometimes this
> is desirable (e.g. when you want to quickly kill a process hogging your disk or
> when some process gets blocked in balance_dirty_pages() indefinitely due to a
> filesystem being in an error condition).
>
> Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> Signed-off-by: Jan Kara <jack@suse.cz>
> ---
> mm/filemap.c | 6 ++++--
> 1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c0018f2..c106d3b 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file,
> iov_iter_count(i));
>
> again:
> -
> /*
> * Bring in the user page that we will copy from _first_.
> * Otherwise there's a nasty deadlock on copying from the
> @@ -2463,7 +2462,10 @@ again:
> written += copied;
>
> balance_dirty_pages_ratelimited(mapping);
> -
> + if (fatal_signal_pending(current)) {
> + status = -EINTR;
> + break;
> + }
> } while (iov_iter_count(i));
>
> return written ? written : status;
> --
> 1.7.1
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-16 11:44 ` Wu Fengguang
@ 2011-11-16 12:54 ` Jan Kara
2011-11-16 13:11 ` Wu Fengguang
2011-11-22 22:28 ` Andrew Morton
1 sibling, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-11-16 12:54 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org
On Wed 16-11-11 19:44:21, Wu Fengguang wrote:
> Jan,
>
> Due to the (very low) possibility of data loss by partial writes, IMHO
> it would safer to test this patch in linux-next until next merge window,
> would you agree?
Fine with me. Thanks.
> Pushing the first patch will address the main problem, anyway.
Hopefully, yes.
Honza
> On Wed, Nov 16, 2011 at 07:12:15PM +0800, Jan Kara wrote:
> > Currently write(2) to a file is not interruptible by a signal. Sometimes this
> > is desirable (e.g. when you want to quickly kill a process hogging your disk or
> > when some process gets blocked in balance_dirty_pages() indefinitely due to a
> > filesystem being in an error condition).
> >
> > Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> > mm/filemap.c | 6 ++++--
> > 1 files changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index c0018f2..c106d3b 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file,
> > iov_iter_count(i));
> >
> > again:
> > -
> > /*
> > * Bring in the user page that we will copy from _first_.
> > * Otherwise there's a nasty deadlock on copying from the
> > @@ -2463,7 +2462,10 @@ again:
> > written += copied;
> >
> > balance_dirty_pages_ratelimited(mapping);
> > -
> > + if (fatal_signal_pending(current)) {
> > + status = -EINTR;
> > + break;
> > + }
> > } while (iov_iter_count(i));
> >
> > return written ? written : status;
> > --
> > 1.7.1
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 1/2] mm: Make task in balance_dirty_pages() killable
2011-11-16 11:28 ` Wu Fengguang
@ 2011-11-16 12:58 ` Jan Kara
0 siblings, 0 replies; 38+ messages in thread
From: Jan Kara @ 2011-11-16 12:58 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org, Neil Brown, KOSAKI Motohiro,
Anton Altaparmakov
On Wed 16-11-11 19:28:19, Wu Fengguang wrote:
> On Wed, Nov 16, 2011 at 07:12:14PM +0800, Jan Kara wrote:
> > There is no reason why task in balance_dirty_pages() shouldn't be killable
> > and it helps in recovering from some error conditions (like when filesystem
> > goes in error state and cannot accept writeback anymore but we still want to
> > kill processes using it to be able to unmount it).
>
> Here I'd like to add a paragraph from my original patch:
>
> The follow up patches will further abort the generic_perform_write() and
> other filesystem write loops, to avoid large write + SIGKILL combination
> exceeding the dirty limit and possibly strange OOM.
Sure, feel free to add the paragraph to the changelog.
Honza
> > Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > Signed-off-by: Jan Kara <jack@suse.cz>
> > ---
> > mm/page-writeback.c | 5 ++++-
> > 1 files changed, 4 insertions(+), 1 deletions(-)
> >
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 0360d1b..39f5805 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -1133,7 +1133,7 @@ pause:
> > pages_dirtied,
> > pause,
> > start_time);
> > - __set_current_state(TASK_UNINTERRUPTIBLE);
> > + __set_current_state(TASK_KILLABLE);
> > io_schedule_timeout(pause);
> >
> > dirty_thresh = hard_dirty_limit(dirty_thresh);
> > @@ -1145,6 +1145,9 @@ pause:
> > */
> > if (nr_dirty < dirty_thresh)
> > break;
> > +
> > + if (fatal_signal_pending(current))
> > + break;
> > }
> >
> > if (!dirty_exceeded && bdi->dirty_exceeded)
> > --
> > 1.7.1
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-16 12:54 ` Jan Kara
@ 2011-11-16 13:11 ` Wu Fengguang
0 siblings, 0 replies; 38+ messages in thread
From: Wu Fengguang @ 2011-11-16 13:11 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org, Kazuya Mio
On Wed, Nov 16, 2011 at 08:54:45PM +0800, Jan Kara wrote:
> On Wed 16-11-11 19:44:21, Wu Fengguang wrote:
> > Jan,
> >
> > Due to the (very low) possibility of data loss by partial writes, IMHO
> > it would safer to test this patch in linux-next until next merge window,
> > would you agree?
> Fine with me. Thanks.
Great. The two patches are now in:
http://git.kernel.org/?p=linux/kernel/git/wfg/linux.git;a=shortlog;h=refs/heads/writeback-for-next
When applying the patch I changed the title "... by a signal" to
"... by SIGKILL" to reflect the updated patch content. Hopefully
this is also what's in your mind.
> > Pushing the first patch will address the main problem, anyway.
> Hopefully, yes.
I looked over Kazuya's posts in linux-ext4 and think the first patch
alone does have good chance address the problem of busy looping in
balance_dirty_pages() due to dirty pages never drop under fs error
conditions.
Thanks,
Fengguang
> > On Wed, Nov 16, 2011 at 07:12:15PM +0800, Jan Kara wrote:
> > > Currently write(2) to a file is not interruptible by a signal. Sometimes this
> > > is desirable (e.g. when you want to quickly kill a process hogging your disk or
> > > when some process gets blocked in balance_dirty_pages() indefinitely due to a
> > > filesystem being in an error condition).
> > >
> > > Reported-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > > Tested-by: Kazuya Mio <k-mio@sx.jp.nec.com>
> > > Signed-off-by: Jan Kara <jack@suse.cz>
> > > ---
> > > mm/filemap.c | 6 ++++--
> > > 1 files changed, 4 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/mm/filemap.c b/mm/filemap.c
> > > index c0018f2..c106d3b 100644
> > > --- a/mm/filemap.c
> > > +++ b/mm/filemap.c
> > > @@ -2407,7 +2407,6 @@ static ssize_t generic_perform_write(struct file *file,
> > > iov_iter_count(i));
> > >
> > > again:
> > > -
> > > /*
> > > * Bring in the user page that we will copy from _first_.
> > > * Otherwise there's a nasty deadlock on copying from the
> > > @@ -2463,7 +2462,10 @@ again:
> > > written += copied;
> > >
> > > balance_dirty_pages_ratelimited(mapping);
> > > -
> > > + if (fatal_signal_pending(current)) {
> > > + status = -EINTR;
> > > + break;
> > > + }
> > > } while (iov_iter_count(i));
> > >
> > > return written ? written : status;
> > > --
> > > 1.7.1
> --
> Jan Kara <jack@suse.cz>
> SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-16 11:44 ` Wu Fengguang
2011-11-16 12:54 ` Jan Kara
@ 2011-11-22 22:28 ` Andrew Morton
2011-11-23 9:05 ` Wu Fengguang
1 sibling, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2011-11-22 22:28 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jan Kara, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org
On Wed, 16 Nov 2011 19:44:21 +0800
Wu Fengguang <fengguang.wu@intel.com> wrote:
> Due to the (very low) possibility of data loss by partial writes, IMHO
> it would safer to test this patch in linux-next until next merge window,
Any such bugs will not be discovered in linux-next testing.
The only way to find these things in a reasonable period of time is to
go in and find them. For example, intensive fsx-linux testing with
concurrent heavy memory pressure on various filesystems with various
block sizes. And of course concurrent signalling. If you're talking
about O_DIRECT then iirc I hacked support for that into fsx-linux. I
think.
Anyway, what _are_ the scenarios in which we think data can be lost?
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-22 22:28 ` Andrew Morton
@ 2011-11-23 9:05 ` Wu Fengguang
2011-11-23 9:50 ` Andrew Morton
2011-11-23 13:08 ` [PATCH 2/2] fs: " Jan Kara
0 siblings, 2 replies; 38+ messages in thread
From: Wu Fengguang @ 2011-11-23 9:05 UTC (permalink / raw)
To: Andrew Morton
Cc: Jan Kara, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org
On Wed, Nov 23, 2011 at 06:28:05AM +0800, Andrew Morton wrote:
> On Wed, 16 Nov 2011 19:44:21 +0800
> Wu Fengguang <fengguang.wu@intel.com> wrote:
>
> > Due to the (very low) possibility of data loss by partial writes, IMHO
> > it would safer to test this patch in linux-next until next merge window,
>
> Any such bugs will not be discovered in linux-next testing.
Yup, I'm afraid.
> The only way to find these things in a reasonable period of time is to
> go in and find them. For example, intensive fsx-linux testing with
> concurrent heavy memory pressure on various filesystems with various
> block sizes. And of course concurrent signalling. If you're talking
> about O_DIRECT then iirc I hacked support for that into fsx-linux. I
> think.
How are we going to measure the success/failure? Check if it
eventually resulted in filesystem corruption or whatever?
When received SIGKILL, fsx-linux itself will just die.
> Anyway, what _are_ the scenarios in which we think data can be lost?
It's the vision that there may be partial writes on SIGKILL. Before
patch, the write will either succeed as a whole or not started at
all, depending on the timing of write/SIGKILL. This is kind of atomic
operation. However now the write can be half done.
If the application really cares about atomic behavior, it can do
create-and-rename. However there are always the possibility of broken
applications.
Maybe this is not that big problem as SIGKILL is considered be to
destructive already.
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-23 9:05 ` Wu Fengguang
@ 2011-11-23 9:50 ` Andrew Morton
2011-11-23 12:27 ` [PATCH 2/2] " Theodore Tso
2011-11-23 13:08 ` [PATCH 2/2] fs: " Jan Kara
1 sibling, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2011-11-23 9:50 UTC (permalink / raw)
To: Wu Fengguang
Cc: Jan Kara, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org
On Wed, 23 Nov 2011 17:05:33 +0800 Wu Fengguang <fengguang.wu@intel.com> wrote:
> On Wed, Nov 23, 2011 at 06:28:05AM +0800, Andrew Morton wrote:
> > On Wed, 16 Nov 2011 19:44:21 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> > > Due to the (very low) possibility of data loss by partial writes, IMHO
> > > it would safer to test this patch in linux-next until next merge window,
> >
> > Any such bugs will not be discovered in linux-next testing.
>
> Yup, I'm afraid.
>
> > The only way to find these things in a reasonable period of time is to
> > go in and find them. For example, intensive fsx-linux testing with
> > concurrent heavy memory pressure on various filesystems with various
> > block sizes. And of course concurrent signalling. If you're talking
> > about O_DIRECT then iirc I hacked support for that into fsx-linux. I
> > think.
>
> How are we going to measure the success/failure? Check if it
> eventually resulted in filesystem corruption or whatever?
yup.
> When received SIGKILL, fsx-linux itself will just die.
Well, there are ways of simulating its effect. For example, bale out
of the write() every seventh time if current->comm=="fsx-linux". Or
set a rearming timer which triggers a baled-out write. You'll work it
out ;)
> > Anyway, what _are_ the scenarios in which we think data can be lost?
>
> It's the vision that there may be partial writes on SIGKILL. Before
> patch, the write will either succeed as a whole or not started at
> all, depending on the timing of write/SIGKILL. This is kind of atomic
> operation. However now the write can be half done.
>
> If the application really cares about atomic behavior, it can do
> create-and-rename. However there are always the possibility of broken
> applications.
>
> Maybe this is not that big problem as SIGKILL is considered be to
> destructive already.
Yeah, I have dim dark memories that there are subtle problems with
interrupting write(). Linus might remember.
Others might remember too, but we're only talking to fsdevel which
nobody reads :(
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] Make write(2) interruptible by a signal
2011-11-23 9:50 ` Andrew Morton
@ 2011-11-23 12:27 ` Theodore Tso
2011-11-23 20:29 ` Andrew Morton
0 siblings, 1 reply; 38+ messages in thread
From: Theodore Tso @ 2011-11-23 12:27 UTC (permalink / raw)
To: Andrew Morton
Cc: Theodore Tso, Wu Fengguang, Jan Kara, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org
On Nov 23, 2011, at 4:50 AM, Andrew Morton wrote:
>>
>> If the application really cares about atomic behavior, it can do
>> create-and-rename. However there are always the possibility of broken
>> applications.
Heh. Application programmers whine a lot about not doing create-and-rename already (and we had to introduce a work-around since so many programs assume close() implies fsync().
>>
>> Maybe this is not that big problem as SIGKILL is considered be to
>> destructive already.
>
> Yeah, I have dim dark memories that there are subtle problems with
> interrupting write(). Linus might remember.
The big one is that you're lucky if application programmers check the return values of write(2), and if they do, it's likely they will only check for error returns and not necessarily for partial writes --- since no other Unix-like or Linux-like system has ever returned partial reads or writes for files except in error conditions. We've barely gotten them trained to check for partial writes and reads with TCP connections and character devices, but I wouldn't bet on application programmers getting things right for files.
Still, if it's ***only*** for SIGKILL, we'll probably be OK, since for that one case there's no chance userspace can intercept the signal, so it can't do any recovery anyway. (I could imagine some HPC program doing a massive 2GB write, and some user of that program depending on the fact that he can kill it at a predefined place by sending a SIGKILL and knowing that the file would be written up to that 2GB chunk --- but that's clearly an edge situation, as opposed to something that would effect most GNOME and KDE apps.) We just need to make sure we never try to do this for any other signal that could be caught, such as SIGINT or SIGQUIT or (worse yet) SIGTSTP.
-- Ted
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-23 9:05 ` Wu Fengguang
2011-11-23 9:50 ` Andrew Morton
@ 2011-11-23 13:08 ` Jan Kara
2011-11-23 13:27 ` Wu Fengguang
1 sibling, 1 reply; 38+ messages in thread
From: Jan Kara @ 2011-11-23 13:08 UTC (permalink / raw)
To: Wu Fengguang
Cc: Andrew Morton, Jan Kara, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org
On Wed 23-11-11 17:05:33, Wu Fengguang wrote:
> On Wed, Nov 23, 2011 at 06:28:05AM +0800, Andrew Morton wrote:
> > On Wed, 16 Nov 2011 19:44:21 +0800
> > Wu Fengguang <fengguang.wu@intel.com> wrote:
> >
> > > Due to the (very low) possibility of data loss by partial writes, IMHO
> > > it would safer to test this patch in linux-next until next merge window,
> >
> > Any such bugs will not be discovered in linux-next testing.
>
> Yup, I'm afraid.
>
> > The only way to find these things in a reasonable period of time is to
> > go in and find them. For example, intensive fsx-linux testing with
> > concurrent heavy memory pressure on various filesystems with various
> > block sizes. And of course concurrent signalling. If you're talking
> > about O_DIRECT then iirc I hacked support for that into fsx-linux. I
> > think.
>
> How are we going to measure the success/failure? Check if it
> eventually resulted in filesystem corruption or whatever?
There are a few different questions:
1) Checking for filesystem corruption via fsck - I find such corruption
caused by stopping write early extremely unlikely.
2) Checking that we do not expose uninitialized data after a partial
(possibly DIRECT_IO) write - I did not find a place where that could happen
but this would be worth testing. I think I can write a test for this if
people are afraid of data exposure problems.
3) Is it acceptable for write(2) to be interrupted by SIGKILL in the
middle? That obviously does happen with my patches so there's no reason
to test that. The question is whether someone cares or not and that can be
tested only by reality check :). Since the signal is SIGKILL, the process
itself cannot notice the interrupted write but someone else can. But as I
already said earlier, partial writes can already be observed when the
machine crashes, filesystem is close to ENOSPC or so. Arguably these are
more severe error conditions than application catching SIGKILL so my
patch lowers the bar for observing partial writes. But I wouldn't like to
throw away a sensible thing - allow SIGKILL to interrupt a system call -
just because of fear of possibility some broken app could rely on this.
Sure if the reality check shows there are such broken apps and users who
care enough to report, then I have nothing against biting the bullet
and reverting the change... Opinions?
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-23 13:08 ` [PATCH 2/2] fs: " Jan Kara
@ 2011-11-23 13:27 ` Wu Fengguang
2011-11-23 15:06 ` Theodore Tso
0 siblings, 1 reply; 38+ messages in thread
From: Wu Fengguang @ 2011-11-23 13:27 UTC (permalink / raw)
To: Jan Kara
Cc: Andrew Morton, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org, Theodore Ts'o
On Wed, Nov 23, 2011 at 09:08:03PM +0800, Jan Kara wrote:
> On Wed 23-11-11 17:05:33, Wu Fengguang wrote:
> > On Wed, Nov 23, 2011 at 06:28:05AM +0800, Andrew Morton wrote:
> > > On Wed, 16 Nov 2011 19:44:21 +0800
> > > Wu Fengguang <fengguang.wu@intel.com> wrote:
> > >
> > > > Due to the (very low) possibility of data loss by partial writes, IMHO
> > > > it would safer to test this patch in linux-next until next merge window,
> > >
> > > Any such bugs will not be discovered in linux-next testing.
> >
> > Yup, I'm afraid.
> >
> > > The only way to find these things in a reasonable period of time is to
> > > go in and find them. For example, intensive fsx-linux testing with
> > > concurrent heavy memory pressure on various filesystems with various
> > > block sizes. And of course concurrent signalling. If you're talking
> > > about O_DIRECT then iirc I hacked support for that into fsx-linux. I
> > > think.
> >
> > How are we going to measure the success/failure? Check if it
> > eventually resulted in filesystem corruption or whatever?
> There are a few different questions:
> 1) Checking for filesystem corruption via fsck - I find such corruption
> caused by stopping write early extremely unlikely.
Agreed.
> 2) Checking that we do not expose uninitialized data after a partial
> (possibly DIRECT_IO) write - I did not find a place where that could happen
> but this would be worth testing. I think I can write a test for this if
> people are afraid of data exposure problems.
Do we already have such kind of tests in xfstests? If not, it sounds
like a good gap to fill :-)
> 3) Is it acceptable for write(2) to be interrupted by SIGKILL in the
> middle? That obviously does happen with my patches so there's no reason
> to test that. The question is whether someone cares or not and that can be
> tested only by reality check :). Since the signal is SIGKILL, the process
> itself cannot notice the interrupted write but someone else can. But as I
> already said earlier, partial writes can already be observed when the
> machine crashes, filesystem is close to ENOSPC or so. Arguably these are
> more severe error conditions than application catching SIGKILL so my
> patch lowers the bar for observing partial writes. But I wouldn't like to
> throw away a sensible thing - allow SIGKILL to interrupt a system call -
> just because of fear of possibility some broken app could rely on this.
> Sure if the reality check shows there are such broken apps and users who
> care enough to report, then I have nothing against biting the bullet
> and reverting the change... Opinions?
Reading Ted's information feed, I tend to disregard the partial write
issue: since the "broken" applications will already fail and get
punished in various other cases, I don't care adding one more penalty
case to them :-P
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-23 13:27 ` Wu Fengguang
@ 2011-11-23 15:06 ` Theodore Tso
2011-11-28 3:08 ` Wu Fengguang
0 siblings, 1 reply; 38+ messages in thread
From: Theodore Tso @ 2011-11-23 15:06 UTC (permalink / raw)
To: Wu Fengguang
Cc: Theodore Tso, Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org
On Nov 23, 2011, at 8:27 AM, Wu Fengguang wrote:
>
> Reading Ted's information feed, I tend to disregard the partial write
> issue: since the "broken" applications will already fail and get
> punished in various other cases, I don't care adding one more penalty
> case to them :-P
Just wait until you have a bunch of rabid application programmers, questioning your judgement, your morality, and even your paternity. :-)
-- Ted
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] Make write(2) interruptible by a signal
2011-11-23 12:27 ` [PATCH 2/2] " Theodore Tso
@ 2011-11-23 20:29 ` Andrew Morton
2011-11-24 19:27 ` Matthew Wilcox
0 siblings, 1 reply; 38+ messages in thread
From: Andrew Morton @ 2011-11-23 20:29 UTC (permalink / raw)
To: Theodore Tso
Cc: Wu Fengguang, Jan Kara, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org
On Wed, 23 Nov 2011 07:27:43 -0500
Theodore Tso <tytso@mit.edu> wrote:
> >>
> >> Maybe this is not that big problem as SIGKILL is considered be to
> >> destructive already.
> >
> > Yeah, I have dim dark memories that there are subtle problems with
> > interrupting write(). Linus might remember.
(err, you're sending 600-column emails)
> The big one is that you're lucky if application programmers check the
> return values of write(2), and if they do, it's likely they will only
> check for error returns and not necessarily for partial writes ---
> since no other Unix-like or Linux-like system has ever returned partial
> reads or writes for files except in error conditions. We've barely
> gotten them trained to check for partial writes and reads with TCP
> connections and character devices, but I wouldn't bet on application
> programmers getting things right for files.
>
> Still, if it's ***only*** for SIGKILL, we'll probably be OK, since
> for that one case there's no chance userspace can intercept the signal,
> so it can't do any recovery anyway. (I could imagine some HPC program
> doing a massive 2GB write, and some user of that program depending on
> the fact that he can kill it at a predefined place by sending a SIGKILL
> and knowing that the file would be written up to that 2GB chunk --- but
> that's clearly an edge situation, as opposed to something that would
> effect most GNOME and KDE apps.) We just need to make sure we never try
> to do this for any other signal that could be caught, such as SIGINT or
> SIGQUIT or (worse yet) SIGTSTP.
That it is a fatal SIGKILL means that the *current* application doesn't
care. But other processes will sometimes notice this change.
Previously if an app did write(file, 128k) and was hit with SIGKILL, it
would write either 0 bytes or 128k bytes. Now, it can write 36k bytes,
yes? If the target file consisted of a stream of 128k records then the
user will claim, with some justification, that Linux corrupted it.
Dunno. People do lots of weird and flakey things. I have a suspicion
that we'll be hearing back from them about this change.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] Make write(2) interruptible by a signal
2011-11-23 20:29 ` Andrew Morton
@ 2011-11-24 19:27 ` Matthew Wilcox
2011-11-24 20:53 ` Ted Ts'o
2011-11-24 20:53 ` Jan Kara
0 siblings, 2 replies; 38+ messages in thread
From: Matthew Wilcox @ 2011-11-24 19:27 UTC (permalink / raw)
To: Andrew Morton
Cc: Theodore Tso, Wu Fengguang, Jan Kara, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org
On Wed, Nov 23, 2011 at 12:29:48PM -0800, Andrew Morton wrote:
> > Still, if it's ***only*** for SIGKILL, we'll probably be OK, since
> > for that one case there's no chance userspace can intercept the signal,
> > so it can't do any recovery anyway. (I could imagine some HPC program
> > doing a massive 2GB write, and some user of that program depending on
> > the fact that he can kill it at a predefined place by sending a SIGKILL
> > and knowing that the file would be written up to that 2GB chunk --- but
> > that's clearly an edge situation, as opposed to something that would
> > effect most GNOME and KDE apps.) We just need to make sure we never try
> > to do this for any other signal that could be caught, such as SIGINT or
> > SIGQUIT or (worse yet) SIGTSTP.
>
> That it is a fatal SIGKILL means that the *current* application doesn't
> care. But other processes will sometimes notice this change.
> Previously if an app did write(file, 128k) and was hit with SIGKILL, it
> would write either 0 bytes or 128k bytes. Now, it can write 36k bytes,
> yes? If the target file consisted of a stream of 128k records then the
> user will claim, with some justification, that Linux corrupted it.
On the other hand, if there was a crash mid-write, they might also get a
36k write that actually hit media (right? Or do we guarantee that on
reboot you see a multiple of 128k?)
We could put in some nice code that rewinds i_size to where it used to
be if the write was interrupted. Or do we need to? Presumably write_end
would not have been called, so i_size would not have been updated.
> Dunno. People do lots of weird and flakey things. I have a suspicion
> that we'll be hearing back from them about this change.
The problem is that we may not hear from them for 6 years ... or whenever
they decide to move off RHEL 3.
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] Make write(2) interruptible by a signal
2011-11-24 19:27 ` Matthew Wilcox
@ 2011-11-24 20:53 ` Ted Ts'o
2011-11-25 0:10 ` Matthew Wilcox
2011-11-24 20:53 ` Jan Kara
1 sibling, 1 reply; 38+ messages in thread
From: Ted Ts'o @ 2011-11-24 20:53 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org
On Thu, Nov 24, 2011 at 12:27:11PM -0700, Matthew Wilcox wrote:
>
> On the other hand, if there was a crash mid-write, they might also get a
> 36k write that actually hit media (right? Or do we guarantee that on
> reboot you see a multiple of 128k?)
Sure, but in the case the crash we expect things to be in a wonky
state. The problem is if people assume atomic writes to files in a
non-crash case, which has been a traditional Unix/Linux "feature".
It's guaranteed by the standards as much a "close() implies fsync()",
but once application programmers start coding to such assumptions,
they refuse to admit they were wrong, and blame the kernel
programmers.
> > Dunno. People do lots of weird and flakey things. I have a suspicion
> > that we'll be hearing back from them about this change.
>
> The problem is that we may not hear from them for 6 years ... or whenever
> they decide to move off RHEL 3.
Funny you mention RHEL 3. I once had to fly on site to a customer
because their application programs depended on the order of addresses
handed back from mmap(), and that changed between 2.4 and 2.6, and
hence between RHEL 3 and RHEL 4 to debug the problem --- which turned
out to be caused in a change of an undocumented problem in mmap(2)...
(Well, it didn't break per se, but it turned an algorithm hidden
inside their app that had been O(n) to O(n**2), and that caused it to
slow down to the point that it was stalling long enough to make other
programs time out. All the customer could tell us was "Og had
program. Program worked on RHEL 3. Broke on RHEL 4. Og need
support." :-)
Still, I'm not too worried about those folks. Those are the customers
who keep people who do advanced support at Red Hat and IBM employed. :-)
I'm worried about the problems that break thousands of users using
open source code --- and so long as we it only happens on SIGKILL, and
not on any other signal, it's _probably_ going to be ok.
- Ted
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] Make write(2) interruptible by a signal
2011-11-24 19:27 ` Matthew Wilcox
2011-11-24 20:53 ` Ted Ts'o
@ 2011-11-24 20:53 ` Jan Kara
1 sibling, 0 replies; 38+ messages in thread
From: Jan Kara @ 2011-11-24 20:53 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Andrew Morton, Theodore Tso, Wu Fengguang, Jan Kara,
Christoph Hellwig, Al Viro, linux-fsdevel@vger.kernel.org
On Thu 24-11-11 12:27:11, Matthew Wilcox wrote:
> On Wed, Nov 23, 2011 at 12:29:48PM -0800, Andrew Morton wrote:
> > > Still, if it's ***only*** for SIGKILL, we'll probably be OK, since
> > > for that one case there's no chance userspace can intercept the signal,
> > > so it can't do any recovery anyway. (I could imagine some HPC program
> > > doing a massive 2GB write, and some user of that program depending on
> > > the fact that he can kill it at a predefined place by sending a SIGKILL
> > > and knowing that the file would be written up to that 2GB chunk --- but
> > > that's clearly an edge situation, as opposed to something that would
> > > effect most GNOME and KDE apps.) We just need to make sure we never try
> > > to do this for any other signal that could be caught, such as SIGINT or
> > > SIGQUIT or (worse yet) SIGTSTP.
> >
> > That it is a fatal SIGKILL means that the *current* application doesn't
> > care. But other processes will sometimes notice this change.
> > Previously if an app did write(file, 128k) and was hit with SIGKILL, it
> > would write either 0 bytes or 128k bytes. Now, it can write 36k bytes,
> > yes? If the target file consisted of a stream of 128k records then the
> > user will claim, with some justification, that Linux corrupted it.
>
> On the other hand, if there was a crash mid-write, they might also get a
> 36k write that actually hit media (right? Or do we guarantee that on
> reboot you see a multiple of 128k?)
They might see only 36k written for all filesystems I know...
> We could put in some nice code that rewinds i_size to where it used to
> be if the write was interrupted. Or do we need to? Presumably write_end
> would not have been called, so i_size would not have been updated.
But that would solve only extending writes and not overwrites. So I'm not
sure it would be worth it...
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] Make write(2) interruptible by a signal
2011-11-24 20:53 ` Ted Ts'o
@ 2011-11-25 0:10 ` Matthew Wilcox
0 siblings, 0 replies; 38+ messages in thread
From: Matthew Wilcox @ 2011-11-25 0:10 UTC (permalink / raw)
To: Ted Ts'o
Cc: Andrew Morton, Wu Fengguang, Jan Kara, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org
On Thu, Nov 24, 2011 at 03:53:43PM -0500, Ted Ts'o wrote:
> On Thu, Nov 24, 2011 at 12:27:11PM -0700, Matthew Wilcox wrote:
> > On the other hand, if there was a crash mid-write, they might also get a
> > 36k write that actually hit media (right? Or do we guarantee that on
> > reboot you see a multiple of 128k?)
>
> Sure, but in the case the crash we expect things to be in a wonky
> state. The problem is if people assume atomic writes to files in a
> non-crash case, which has been a traditional Unix/Linux "feature".
> It's guaranteed by the standards as much a "close() implies fsync()",
> but once application programmers start coding to such assumptions,
> they refuse to admit they were wrong, and blame the kernel
> programmers.
Sure, but resorting to kill -9 is almost the same as pushing the BRS.
Nobody's arguing in favour of non-fatal signals interrupting write()
[well, Honza was earlier, but we all talked him out of it].
--
Matthew Wilcox Intel Open Source Technology Centre
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-23 15:06 ` Theodore Tso
@ 2011-11-28 3:08 ` Wu Fengguang
2011-11-28 3:33 ` [PATCH] writeback: add a safety limit to the SIGKILL abort Wu Fengguang
2011-11-29 14:16 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
0 siblings, 2 replies; 38+ messages in thread
From: Wu Fengguang @ 2011-11-28 3:08 UTC (permalink / raw)
To: Theodore Tso
Cc: Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org, Kazuya Mio, Dmitry Monakhov,
Matthew Wilcox
On Wed, Nov 23, 2011 at 11:06:56PM +0800, Theodore Ts'o wrote:
>
> On Nov 23, 2011, at 8:27 AM, Wu Fengguang wrote:
>
> >
> > Reading Ted's information feed, I tend to disregard the partial write
> > issue: since the "broken" applications will already fail and get
> > punished in various other cases, I don't care adding one more penalty
> > case to them :-P
>
> Just wait until you have a bunch of rabid application programmers,
> questioning your judgement, your morality, and even your paternity.
> :-)
Ah OK, that sounds frightening. Hmm, till now every one have
acknowledged the possibility of data corruption, only that
people have different perceptions of the severeness.
Let's rethink things this way: "Is it a _worthwhile_ risk at all?"
I'm afraid not. Considering the origin of this patch
[BUG] aborted ext4 leads to inifinity loop in balance_dirty_pages
http://www.spinics.net/lists/linux-ext4/msg28464.html
I *think* Jan's first patch is already enough for fixing the bug. IWO
the patch we worried/discussed so much is really an optional one. I
would imagine the easy and safe solution is to just drop it. Any
objections?
Thanks,
Fengguang
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH] writeback: add a safety limit to the SIGKILL abort
2011-11-28 3:08 ` Wu Fengguang
@ 2011-11-28 3:33 ` Wu Fengguang
2011-11-29 14:18 ` Jan Kara
2011-11-29 14:16 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
1 sibling, 1 reply; 38+ messages in thread
From: Wu Fengguang @ 2011-11-28 3:33 UTC (permalink / raw)
To: Theodore Tso
Cc: Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org, Kazuya Mio, Dmitry Monakhov,
Matthew Wilcox
On Mon, Nov 28, 2011 at 11:08:42AM +0800, Wu Fengguang wrote:
> On Wed, Nov 23, 2011 at 11:06:56PM +0800, Theodore Ts'o wrote:
> >
> > On Nov 23, 2011, at 8:27 AM, Wu Fengguang wrote:
> >
> > >
> > > Reading Ted's information feed, I tend to disregard the partial write
> > > issue: since the "broken" applications will already fail and get
> > > punished in various other cases, I don't care adding one more penalty
> > > case to them :-P
> >
> > Just wait until you have a bunch of rabid application programmers,
> > questioning your judgement, your morality, and even your paternity.
> > :-)
>
> Ah OK, that sounds frightening. Hmm, till now every one have
> acknowledged the possibility of data corruption, only that
> people have different perceptions of the severeness.
>
> Let's rethink things this way: "Is it a _worthwhile_ risk at all?"
> I'm afraid not. Considering the origin of this patch
>
> [BUG] aborted ext4 leads to inifinity loop in balance_dirty_pages
> http://www.spinics.net/lists/linux-ext4/msg28464.html
>
> I *think* Jan's first patch is already enough for fixing the bug. IWO
> the patch we worried/discussed so much is really an optional one. I
> would imagine the easy and safe solution is to just drop it. Any
> objections?
Here is the replacement patch.
---
Subject: writeback: add a safety limit to the SIGKILL abort
Date: Mon Nov 28 11:16:54 CST 2011
This adds a safety limit to the SIGKILL abort in commit 499d05ecf990
("mm: Make task in balance_dirty_pages() killable"). This will avoid
dirty pages rushing arbitrarily high in the case some task receives
SIGKILL and hence become *unthrottled* when doing a huge sized write.
The alternative way is to check SIGKILL and return partial write in
generic_perform_write(). However it will lead to data corruption as
put by Andrew Morton:
Previously if an app did write(file, 128k) and was hit with SIGKILL, it
would write either 0 bytes or 128k bytes. Now, it can write 36k bytes,
yes? If the target file consisted of a stream of 128k records then the
user will claim, with some justification, that Linux corrupted it.
Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
---
mm/page-writeback.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
--- linux-next.orig/mm/page-writeback.c 2011-11-28 11:13:58.000000000 +0800
+++ linux-next/mm/page-writeback.c 2011-11-28 11:16:52.000000000 +0800
@@ -1136,7 +1136,8 @@ pause:
if (task_ratelimit)
break;
- if (fatal_signal_pending(current))
+ if (fatal_signal_pending(current) &&
+ nr_dirty <= dirty_thresh + dirty_thresh / 2)
break;
}
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 2/2] fs: Make write(2) interruptible by a signal
2011-11-28 3:08 ` Wu Fengguang
2011-11-28 3:33 ` [PATCH] writeback: add a safety limit to the SIGKILL abort Wu Fengguang
@ 2011-11-29 14:16 ` Jan Kara
1 sibling, 0 replies; 38+ messages in thread
From: Jan Kara @ 2011-11-29 14:16 UTC (permalink / raw)
To: Wu Fengguang
Cc: Theodore Tso, Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org, Kazuya Mio, Dmitry Monakhov,
Matthew Wilcox
On Mon 28-11-11 11:08:42, Wu Fengguang wrote:
> On Wed, Nov 23, 2011 at 11:06:56PM +0800, Theodore Ts'o wrote:
> >
> > On Nov 23, 2011, at 8:27 AM, Wu Fengguang wrote:
> >
> > >
> > > Reading Ted's information feed, I tend to disregard the partial write
> > > issue: since the "broken" applications will already fail and get
> > > punished in various other cases, I don't care adding one more penalty
> > > case to them :-P
> >
> > Just wait until you have a bunch of rabid application programmers,
> > questioning your judgement, your morality, and even your paternity.
> > :-)
>
> Ah OK, that sounds frightening. Hmm, till now every one have
> acknowledged the possibility of data corruption, only that
> people have different perceptions of the severeness.
>
> Let's rethink things this way: "Is it a _worthwhile_ risk at all?"
> I'm afraid not. Considering the origin of this patch
>
> [BUG] aborted ext4 leads to inifinity loop in balance_dirty_pages
> http://www.spinics.net/lists/linux-ext4/msg28464.html
>
> I *think* Jan's first patch is already enough for fixing the bug. IWO
> the patch we worried/discussed so much is really an optional one. I
> would imagine the easy and safe solution is to just drop it. Any
> objections?
I still think aborting write is a cleaner solution. But I don't want to
argue to death about this so if you still don't think it's a good idea,
I'll respect your decision.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH] writeback: add a safety limit to the SIGKILL abort
2011-11-28 3:33 ` [PATCH] writeback: add a safety limit to the SIGKILL abort Wu Fengguang
@ 2011-11-29 14:18 ` Jan Kara
0 siblings, 0 replies; 38+ messages in thread
From: Jan Kara @ 2011-11-29 14:18 UTC (permalink / raw)
To: Wu Fengguang
Cc: Theodore Tso, Jan Kara, Andrew Morton, Christoph Hellwig, Al Viro,
linux-fsdevel@vger.kernel.org, Kazuya Mio, Dmitry Monakhov,
Matthew Wilcox
On Mon 28-11-11 11:33:40, Wu Fengguang wrote:
> On Mon, Nov 28, 2011 at 11:08:42AM +0800, Wu Fengguang wrote:
> > On Wed, Nov 23, 2011 at 11:06:56PM +0800, Theodore Ts'o wrote:
> > >
> > > On Nov 23, 2011, at 8:27 AM, Wu Fengguang wrote:
> > >
> > > >
> > > > Reading Ted's information feed, I tend to disregard the partial write
> > > > issue: since the "broken" applications will already fail and get
> > > > punished in various other cases, I don't care adding one more penalty
> > > > case to them :-P
> > >
> > > Just wait until you have a bunch of rabid application programmers,
> > > questioning your judgement, your morality, and even your paternity.
> > > :-)
> >
> > Ah OK, that sounds frightening. Hmm, till now every one have
> > acknowledged the possibility of data corruption, only that
> > people have different perceptions of the severeness.
> >
> > Let's rethink things this way: "Is it a _worthwhile_ risk at all?"
> > I'm afraid not. Considering the origin of this patch
> >
> > [BUG] aborted ext4 leads to inifinity loop in balance_dirty_pages
> > http://www.spinics.net/lists/linux-ext4/msg28464.html
> >
> > I *think* Jan's first patch is already enough for fixing the bug. IWO
> > the patch we worried/discussed so much is really an optional one. I
> > would imagine the easy and safe solution is to just drop it. Any
> > objections?
>
> Here is the replacement patch.
>
> ---
> Subject: writeback: add a safety limit to the SIGKILL abort
> Date: Mon Nov 28 11:16:54 CST 2011
>
> This adds a safety limit to the SIGKILL abort in commit 499d05ecf990
> ("mm: Make task in balance_dirty_pages() killable"). This will avoid
> dirty pages rushing arbitrarily high in the case some task receives
> SIGKILL and hence become *unthrottled* when doing a huge sized write.
>
> The alternative way is to check SIGKILL and return partial write in
> generic_perform_write(). However it will lead to data corruption as
> put by Andrew Morton:
>
> Previously if an app did write(file, 128k) and was hit with SIGKILL, it
> would write either 0 bytes or 128k bytes. Now, it can write 36k bytes,
> yes? If the target file consisted of a stream of 128k records then the
> user will claim, with some justification, that Linux corrupted it.
>
> Signed-off-by: Wu Fengguang <fengguang.wu@intel.com>
The patch looks good. You can add:
Reviewed-by: Jan Kara <jack@suse.cz>
Honza
> ---
> mm/page-writeback.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> --- linux-next.orig/mm/page-writeback.c 2011-11-28 11:13:58.000000000 +0800
> +++ linux-next/mm/page-writeback.c 2011-11-28 11:16:52.000000000 +0800
> @@ -1136,7 +1136,8 @@ pause:
> if (task_ratelimit)
> break;
>
> - if (fatal_signal_pending(current))
> + if (fatal_signal_pending(current) &&
> + nr_dirty <= dirty_thresh + dirty_thresh / 2)
> break;
> }
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2011-11-29 14:18 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-11-16 11:12 [PATCH 0/2 v3] Make task in balance_dirty_pages() killable Jan Kara
2011-11-16 11:12 ` [PATCH 1/2] mm: " Jan Kara
2011-11-16 11:28 ` Wu Fengguang
2011-11-16 12:58 ` Jan Kara
2011-11-16 11:12 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
2011-11-16 11:44 ` Wu Fengguang
2011-11-16 12:54 ` Jan Kara
2011-11-16 13:11 ` Wu Fengguang
2011-11-22 22:28 ` Andrew Morton
2011-11-23 9:05 ` Wu Fengguang
2011-11-23 9:50 ` Andrew Morton
2011-11-23 12:27 ` [PATCH 2/2] " Theodore Tso
2011-11-23 20:29 ` Andrew Morton
2011-11-24 19:27 ` Matthew Wilcox
2011-11-24 20:53 ` Ted Ts'o
2011-11-25 0:10 ` Matthew Wilcox
2011-11-24 20:53 ` Jan Kara
2011-11-23 13:08 ` [PATCH 2/2] fs: " Jan Kara
2011-11-23 13:27 ` Wu Fengguang
2011-11-23 15:06 ` Theodore Tso
2011-11-28 3:08 ` Wu Fengguang
2011-11-28 3:33 ` [PATCH] writeback: add a safety limit to the SIGKILL abort Wu Fengguang
2011-11-29 14:18 ` Jan Kara
2011-11-29 14:16 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
2011-11-16 11:23 ` [PATCH 0/2 v3] Make task in balance_dirty_pages() killable Wu Fengguang
-- strict thread matches above, loose matches on Subject: below --
2011-11-14 16:15 [PATCH 0/2 v2] " Jan Kara
2011-11-14 16:15 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
2011-11-14 16:26 ` Christoph Hellwig
2011-11-14 16:46 ` Jan Kara
2011-11-14 20:13 ` Christoph Hellwig
2011-11-14 22:19 ` Andrew Morton
2011-11-15 11:23 ` Jan Kara
2011-11-14 11:10 [PATCH 0/2] Make task doing heavy writing killable Jan Kara
2011-11-14 11:10 ` [PATCH 2/2] fs: Make write(2) interruptible by a signal Jan Kara
2011-11-14 12:12 ` Matthew Wilcox
2011-11-14 12:15 ` Wu Fengguang
2011-11-14 12:34 ` Jan Kara
2011-11-14 14:16 ` Matthew Wilcox
2011-11-14 15:30 ` Jan Kara
2011-11-14 18:44 ` Jeremy Allison
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).