Question: errors=continue behaviour for failed external journal device

public inbox for linux-ext4@vger.kernel.org
 help / color / mirror / Atom feed

* Question: errors=continue behaviour for failed external journal device
@ 2014-07-26 23:07 Vlad Dobrotescu
  2014-07-27  0:07 ` Theodore Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: Vlad Dobrotescu @ 2014-07-26 23:07 UTC (permalink / raw)
  To: linux-ext4

If this isn't the proper place for this question, please point me in 
the right direction.

I couldn't find any description on Ext4's behaviour when mounted 
with errors=continue and external journal if the journal block device 
is unavailable at mount time (or becomes unavailable at some point).

I would be using CentOS 7 (kernel 3.10.0-123.4.4.el7 x86_64) and 
(probably) full data journaling on a SSD. Can someone help?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question: errors=continue behaviour for failed external journal device
  2014-07-26 23:07 Question: errors=continue behaviour for failed external journal device Vlad Dobrotescu
@ 2014-07-27  0:07 ` Theodore Ts'o
  2014-07-27  0:34   ` Vlad Dobrotescu
  2014-07-28  9:11   ` Lukáš Czerner
  0 siblings, 2 replies; 10+ messages in thread
From: Theodore Ts'o @ 2014-07-27  0:07 UTC (permalink / raw)
  To: Vlad Dobrotescu; +Cc: linux-ext4

On Sat, Jul 26, 2014 at 11:07:59PM +0000, Vlad Dobrotescu wrote:
> If this isn't the proper place for this question, please point me in 
> the right direction.
> 
> I couldn't find any description on Ext4's behaviour when mounted 
> with errors=continue and external journal if the journal block device 
> is unavailable at mount time (or becomes unavailable at some point).
> 
> I would be using CentOS 7 (kernel 3.10.0-123.4.4.el7 x86_64) and 
> (probably) full data journaling on a SSD. Can someone help?

So there are two different questions.

If you use errors=continue, there is the chance that the file system
inconsistencies that discovered could cause further file system
damage, which might lead to the loss or corruption of data files
written earlier.  So it's not really recommended for most purposes,
unless you have some scheme where you are monitoring dmesgs and having
some strategy to deal with detected file system errors, or when the
system absolutely, positively must continue running, and this is more
important than potential data loss. 

If the journal block device is not present then the file system can't
be mounted, and if the system was uncleanly shut down you won't be
able to recover from the unclean shutdown by replaying the journal.

If the journal block device is *gone*, it is possible to remove the
external journal block device, and then force a file system repair,
but if this happens after an unclean shutdown, you may very well lose
data.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question: errors=continue behaviour for failed external journal device
  2014-07-27  0:07 ` Theodore Ts'o
@ 2014-07-27  0:34   ` Vlad Dobrotescu
  2014-07-27  1:07     ` Theodore Ts'o
  2014-07-28  9:11   ` Lukáš Czerner
  1 sibling, 1 reply; 10+ messages in thread
From: Vlad Dobrotescu @ 2014-07-27  0:34 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: linux-ext4

On 26/07/2014 20:07, Theodore Ts'o wrote:
> On Sat, Jul 26, 2014 at 11:07:59PM +0000, Vlad Dobrotescu wrote:
>> If this isn't the proper place for this question, please point me in
>> the right direction.
>>
>> I couldn't find any description on Ext4's behaviour when mounted
>> with errors=continue and external journal if the journal block device
>> is unavailable at mount time (or becomes unavailable at some point).
>>
>> I would be using CentOS 7 (kernel 3.10.0-123.4.4.el7 x86_64) and
>> (probably) full data journaling on a SSD. Can someone help?
> So there are two different questions.
>
> If you use errors=continue, there is the chance that the file system
> inconsistencies that discovered could cause further file system
> damage, which might lead to the loss or corruption of data files
> written earlier.  So it's not really recommended for most purposes,
> unless you have some scheme where you are monitoring dmesgs and having
> some strategy to deal with detected file system errors, or when the
> system absolutely, positively must continue running, and this is more
> important than potential data loss.
>
> If the journal block device is not present then the file system can't
> be mounted, and if the system was uncleanly shut down you won't be
> able to recover from the unclean shutdown by replaying the journal.
>
> If the journal block device is *gone*, it is possible to remove the
> external journal block device, and then force a file system repair,
> but if this happens after an unclean shutdown, you may very well lose
> data.
>
> Cheers,
>
> 						- Ted

Sorry if this is a duplicate, but the "Followup" didn't seem to work for me

Thanks for the quick and detailed answer. If I understand it correctly,
the errors= option has nothing to do with journaling, but only with FS
consistency issues (which can be caused by a vanished journal, but also
by other events), while the mounting itself fails in the absence of the
device specified for external journaling, with no fall-back alternative.
Right?

Vlad


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question: errors=continue behaviour for failed external journal device
  2014-07-27  0:34   ` Vlad Dobrotescu
@ 2014-07-27  1:07     ` Theodore Ts'o
  0 siblings, 0 replies; 10+ messages in thread
From: Theodore Ts'o @ 2014-07-27  1:07 UTC (permalink / raw)
  To: Vlad Dobrotescu; +Cc: linux-ext4

On Sat, Jul 26, 2014 at 08:34:45PM -0400, Vlad Dobrotescu wrote:
> 
> Thanks for the quick and detailed answer. If I understand it correctly,
> the errors= option has nothing to do with journaling, but only with FS
> consistency issues (which can be caused by a vanished journal,

The errors= option has to do with how the system will react when it
discovered a file system inconsistency (for example, while deleting a
file, it discovers that the blocks it is trying to free are already
freed, etc.)  errors=continue is the "don't worry, be happy" option
--- and this can sometimes work out, it's much like ignoring a late
mortgage payment notice from the bank.  Most of the time, sooner or
later, it catches up to you.  :-)

> by other events), while the mounting itself fails in the absence of the
> device specified for external journaling, with no fall-back alternative.

Your question about what happens if the journal is missing is much
like the question, "suppose as I have a RAID 0 setup, and I'm missing
one of the disks --- what can we do"?  Basically, nothing.  In a
desperation scenario, there are ways you can forcibly tell the system
to pretend that there is no journal, just like you can pretend that
the system should ignore 20% of a missing RAID 0 array and have the
LVM replace the missing disk with zero blocks --- but results are very
likely to lead to data loss.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question: errors=continue behaviour for failed external journal device
  2014-07-27  0:07 ` Theodore Ts'o
  2014-07-27  0:34   ` Vlad Dobrotescu
@ 2014-07-28  9:11   ` Lukáš Czerner
  2014-07-28 13:17     ` Theodore Ts'o
  1 sibling, 1 reply; 10+ messages in thread
From: Lukáš Czerner @ 2014-07-28  9:11 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Vlad Dobrotescu, linux-ext4

On Sat, 26 Jul 2014, Theodore Ts'o wrote:

> Date: Sat, 26 Jul 2014 20:07:33 -0400
> From: Theodore Ts'o <tytso@mit.edu>
> To: Vlad Dobrotescu <vlad@dobrotescu.ca>
> Cc: linux-ext4@vger.kernel.org
> Subject: Re: Question: errors=continue behaviour for failed external journal
>     device
> 
> On Sat, Jul 26, 2014 at 11:07:59PM +0000, Vlad Dobrotescu wrote:
> > If this isn't the proper place for this question, please point me in 
> > the right direction.
> > 
> > I couldn't find any description on Ext4's behaviour when mounted 
> > with errors=continue and external journal if the journal block device 
> > is unavailable at mount time (or becomes unavailable at some point).
> > 
> > I would be using CentOS 7 (kernel 3.10.0-123.4.4.el7 x86_64) and 
> > (probably) full data journaling on a SSD. Can someone help?
> 
> So there are two different questions.
> 
> If you use errors=continue, there is the chance that the file system
> inconsistencies that discovered could cause further file system
> damage, which might lead to the loss or corruption of data files
> written earlier.  So it's not really recommended for most purposes,

I very much agree with that, that's why I was quite surprised that I
found out recently that this is the default. I was living in the
delusion that the default was ERRORS_RO for as long as I can remember.
So my question is, should we change it ? This really does not seem
like a sane default.

Thanks!
-Lukas

> unless you have some scheme where you are monitoring dmesgs and having
> some strategy to deal with detected file system errors, or when the
> system absolutely, positively must continue running, and this is more
> important than potential data loss. 
> 
> If the journal block device is not present then the file system can't
> be mounted, and if the system was uncleanly shut down you won't be
> able to recover from the unclean shutdown by replaying the journal.
> 
> If the journal block device is *gone*, it is possible to remove the
> external journal block device, and then force a file system repair,
> but if this happens after an unclean shutdown, you may very well lose
> data.
> 
> Cheers,
> 
> 						- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question: errors=continue behaviour for failed external journal device
  2014-07-28  9:11   ` Lukáš Czerner
@ 2014-07-28 13:17     ` Theodore Ts'o
  2014-07-28 13:25       ` Lukáš Czerner
  2014-07-28 16:09       ` Darrick J. Wong
  0 siblings, 2 replies; 10+ messages in thread
From: Theodore Ts'o @ 2014-07-28 13:17 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: Vlad Dobrotescu, linux-ext4

On Mon, Jul 28, 2014 at 11:11:45AM +0200, Lukáš Czerner wrote:
> 
> I very much agree with that, that's why I was quite surprised that I
> found out recently that this is the default. I was living in the
> delusion that the default was ERRORS_RO for as long as I can remember.
> So my question is, should we change it ? This really does not seem
> like a sane default.

Yeah, I've been thinking that this would be a good thing to change for
1.43.

The only reason that errors=continue was the default was for
historical reasons.  I could imagine some system administrators being
surprised when all of a sudden their production systems start getting
lots of EROFS errors getting reported by applications.  So I could
potentially imagine some Help Desks / Support folks at distributions
not being enthusiastic about such a change.

Hmm.... we are starting to have some errors where we can allow the
system to stagger on, even if we need to disallow new allocations in
some block groups.  I wonder if it is worthwhile to have a "continue
for correctable errors".  The danger, of course, is that some errors,
even if they are correctable, (such as freeing a block which is
already freed), could be a hint that there are other fs corruptions,
not yet detected, that might lead to data loss if we reboot and fsck,
or remount readonly right away.  So the question is while there is
some value, is it worth the added complexity to add an
"errors=continue-correctable" option?

							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question: errors=continue behaviour for failed external journal device
  2014-07-28 13:17     ` Theodore Ts'o
@ 2014-07-28 13:25       ` Lukáš Czerner
  2014-07-28 13:31         ` Vlad Dobrotescu
  2014-07-28 16:09       ` Darrick J. Wong
  1 sibling, 1 reply; 10+ messages in thread
From: Lukáš Czerner @ 2014-07-28 13:25 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Vlad Dobrotescu, linux-ext4

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2239 bytes --]

On Mon, 28 Jul 2014, Theodore Ts'o wrote:

> Date: Mon, 28 Jul 2014 09:17:42 -0400
> From: Theodore Ts'o <tytso@mit.edu>
> To: Lukáš Czerner <lczerner@redhat.com>
> Cc: Vlad Dobrotescu <vlad@dobrotescu.ca>, linux-ext4@vger.kernel.org
> Subject: Re: Question: errors=continue behaviour for failed external journal
>     device
> 
> On Mon, Jul 28, 2014 at 11:11:45AM +0200, Lukáš Czerner wrote:
> > 
> > I very much agree with that, that's why I was quite surprised that I
> > found out recently that this is the default. I was living in the
> > delusion that the default was ERRORS_RO for as long as I can remember.
> > So my question is, should we change it ? This really does not seem
> > like a sane default.
> 
> Yeah, I've been thinking that this would be a good thing to change for
> 1.43.
> 
> The only reason that errors=continue was the default was for
> historical reasons.  I could imagine some system administrators being
> surprised when all of a sudden their production systems start getting
> lots of EROFS errors getting reported by applications.  So I could
> potentially imagine some Help Desks / Support folks at distributions
> not being enthusiastic about such a change.
> 
> Hmm.... we are starting to have some errors where we can allow the
> system to stagger on, even if we need to disallow new allocations in
> some block groups.  I wonder if it is worthwhile to have a "continue
> for correctable errors".  The danger, of course, is that some errors,
> even if they are correctable, (such as freeing a block which is
> already freed), could be a hint that there are other fs corruptions,
> not yet detected, that might lead to data loss if we reboot and fsck,
> or remount readonly right away.  So the question is while there is
> some value, is it worth the added complexity to add an
> "errors=continue-correctable" option?

Right,

I like the idea of the new errors option, even though the name is a
bit long (maybe "auto") which will try the best to continue, but is
allowed to remount read only if we can not recover from that error.

This would however need some work to make it work reliably and most
importantly a fair amount of testing. Though I think it's worth the
work.

-Lukas

> 
> 							- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question: errors=continue behaviour for failed external journal device
  2014-07-28 13:25       ` Lukáš Czerner
@ 2014-07-28 13:31         ` Vlad Dobrotescu
  2014-07-28 15:00           ` Theodore Ts'o
  0 siblings, 1 reply; 10+ messages in thread
From: Vlad Dobrotescu @ 2014-07-28 13:31 UTC (permalink / raw)
  To: Lukáš Czerner; +Cc: Theodore Ts'o, linux-ext4

If you are talking about changes, wouldn't "read-only" be a better 
fall-back
alternative for a failed or missing external journal?

Vlad

On 28/07/2014 09:25, Lukáš Czerner wrote:
> On Mon, 28 Jul 2014, Theodore Ts'o wrote:
>
>> Date: Mon, 28 Jul 2014 09:17:42 -0400
>> From: Theodore Ts'o<tytso@mit.edu>
>> To: Lukáš Czerner<lczerner@redhat.com>
>> Cc: Vlad Dobrotescu<vlad@dobrotescu.ca>, linux-ext4@vger.kernel.org
>> Subject: Re: Question: errors=continue behaviour for failed external journal
>>      device
>>
>> On Mon, Jul 28, 2014 at 11:11:45AM +0200, Lukáš Czerner wrote:
>>> I very much agree with that, that's why I was quite surprised that I
>>> found out recently that this is the default. I was living in the
>>> delusion that the default was ERRORS_RO for as long as I can remember.
>>> So my question is, should we change it ? This really does not seem
>>> like a sane default.
>> Yeah, I've been thinking that this would be a good thing to change for
>> 1.43.
>>
>> The only reason that errors=continue was the default was for
>> historical reasons.  I could imagine some system administrators being
>> surprised when all of a sudden their production systems start getting
>> lots of EROFS errors getting reported by applications.  So I could
>> potentially imagine some Help Desks / Support folks at distributions
>> not being enthusiastic about such a change.
>>
>> Hmm.... we are starting to have some errors where we can allow the
>> system to stagger on, even if we need to disallow new allocations in
>> some block groups.  I wonder if it is worthwhile to have a "continue
>> for correctable errors".  The danger, of course, is that some errors,
>> even if they are correctable, (such as freeing a block which is
>> already freed), could be a hint that there are other fs corruptions,
>> not yet detected, that might lead to data loss if we reboot and fsck,
>> or remount readonly right away.  So the question is while there is
>> some value, is it worth the added complexity to add an
>> "errors=continue-correctable" option?
> Right,
>
> I like the idea of the new errors option, even though the name is a
> bit long (maybe "auto") which will try the best to continue, but is
> allowed to remount read only if we can not recover from that error.
>
> This would however need some work to make it work reliably and most
> importantly a fair amount of testing. Though I think it's worth the
> work.
>
> -Lukas
>
>> 							- Ted
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question: errors=continue behaviour for failed external journal device
  2014-07-28 13:31         ` Vlad Dobrotescu
@ 2014-07-28 15:00           ` Theodore Ts'o
  0 siblings, 0 replies; 10+ messages in thread
From: Theodore Ts'o @ 2014-07-28 15:00 UTC (permalink / raw)
  To: Vlad Dobrotescu; +Cc: Lukáš Czerner, linux-ext4

On Mon, Jul 28, 2014 at 09:31:05AM -0400, Vlad Dobrotescu wrote:
> If you are talking about changes, wouldn't "read-only" be a better fall-back
> alternative for a failed or missing external journal?

For a missing external journal, we simply wouldn't allow the mount to
succeed at all.

The discussion here was about what to do if we detect some kind of
inconsistency in a mounted file system.

Cheers,

					- Ted

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: Question: errors=continue behaviour for failed external journal device
  2014-07-28 13:17     ` Theodore Ts'o
  2014-07-28 13:25       ` Lukáš Czerner
@ 2014-07-28 16:09       ` Darrick J. Wong
  1 sibling, 0 replies; 10+ messages in thread
From: Darrick J. Wong @ 2014-07-28 16:09 UTC (permalink / raw)
  To: Theodore Ts'o; +Cc: Lukáš Czerner, Vlad Dobrotescu, linux-ext4

On Mon, Jul 28, 2014 at 09:17:42AM -0400, Theodore Ts'o wrote:
> On Mon, Jul 28, 2014 at 11:11:45AM +0200, Lukáš Czerner wrote:
> > 
> > I very much agree with that, that's why I was quite surprised that I
> > found out recently that this is the default. I was living in the
> > delusion that the default was ERRORS_RO for as long as I can remember.
> > So my question is, should we change it ? This really does not seem
> > like a sane default.
> 
> Yeah, I've been thinking that this would be a good thing to change for
> 1.43.
> 
> The only reason that errors=continue was the default was for
> historical reasons.  I could imagine some system administrators being
> surprised when all of a sudden their production systems start getting
> lots of EROFS errors getting reported by applications.  So I could
> potentially imagine some Help Desks / Support folks at distributions
> not being enthusiastic about such a change.
> 
> Hmm.... we are starting to have some errors where we can allow the
> system to stagger on, even if we need to disallow new allocations in
> some block groups.  I wonder if it is worthwhile to have a "continue
> for correctable errors".  The danger, of course, is that some errors,
> even if they are correctable, (such as freeing a block which is
> already freed), could be a hint that there are other fs corruptions,
> not yet detected, that might lead to data loss if we reboot and fsck,
> or remount readonly right away.  So the question is while there is
> some value, is it worth the added complexity to add an
> "errors=continue-correctable" option?

Back in the earlier 3.15 days when I was trying to figure out what was going on
with that corruption bug that Eric Whitney found, it was useful for the kernel
to be able to stumble on with the non-broken block groups long enough to save
the logs of what had happened.  (Laptops don't seem to have serial consoles...)

In general I think it's worth the effort.

(I'd shovel crash reports into pstore if I wasn't afraid of bricking UEFI.)

--D
> 
> 							- Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2014-07-28 16:09 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-07-26 23:07 Question: errors=continue behaviour for failed external journal device Vlad Dobrotescu
2014-07-27  0:07 ` Theodore Ts'o
2014-07-27  0:34   ` Vlad Dobrotescu
2014-07-27  1:07     ` Theodore Ts'o
2014-07-28  9:11   ` Lukáš Czerner
2014-07-28 13:17     ` Theodore Ts'o
2014-07-28 13:25       ` Lukáš Czerner
2014-07-28 13:31         ` Vlad Dobrotescu
2014-07-28 15:00           ` Theodore Ts'o
2014-07-28 16:09       ` Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox