* Question: errors=continue behaviour for failed external journal device @ 2014-07-26 23:07 Vlad Dobrotescu 2014-07-27 0:07 ` Theodore Ts'o 0 siblings, 1 reply; 10+ messages in thread From: Vlad Dobrotescu @ 2014-07-26 23:07 UTC (permalink / raw) To: linux-ext4 If this isn't the proper place for this question, please point me in the right direction. I couldn't find any description on Ext4's behaviour when mounted with errors=continue and external journal if the journal block device is unavailable at mount time (or becomes unavailable at some point). I would be using CentOS 7 (kernel 3.10.0-123.4.4.el7 x86_64) and (probably) full data journaling on a SSD. Can someone help? ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question: errors=continue behaviour for failed external journal device 2014-07-26 23:07 Question: errors=continue behaviour for failed external journal device Vlad Dobrotescu @ 2014-07-27 0:07 ` Theodore Ts'o 2014-07-27 0:34 ` Vlad Dobrotescu 2014-07-28 9:11 ` Lukáš Czerner 0 siblings, 2 replies; 10+ messages in thread From: Theodore Ts'o @ 2014-07-27 0:07 UTC (permalink / raw) To: Vlad Dobrotescu; +Cc: linux-ext4 On Sat, Jul 26, 2014 at 11:07:59PM +0000, Vlad Dobrotescu wrote: > If this isn't the proper place for this question, please point me in > the right direction. > > I couldn't find any description on Ext4's behaviour when mounted > with errors=continue and external journal if the journal block device > is unavailable at mount time (or becomes unavailable at some point). > > I would be using CentOS 7 (kernel 3.10.0-123.4.4.el7 x86_64) and > (probably) full data journaling on a SSD. Can someone help? So there are two different questions. If you use errors=continue, there is the chance that the file system inconsistencies that discovered could cause further file system damage, which might lead to the loss or corruption of data files written earlier. So it's not really recommended for most purposes, unless you have some scheme where you are monitoring dmesgs and having some strategy to deal with detected file system errors, or when the system absolutely, positively must continue running, and this is more important than potential data loss. If the journal block device is not present then the file system can't be mounted, and if the system was uncleanly shut down you won't be able to recover from the unclean shutdown by replaying the journal. If the journal block device is *gone*, it is possible to remove the external journal block device, and then force a file system repair, but if this happens after an unclean shutdown, you may very well lose data. Cheers, - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question: errors=continue behaviour for failed external journal device 2014-07-27 0:07 ` Theodore Ts'o @ 2014-07-27 0:34 ` Vlad Dobrotescu 2014-07-27 1:07 ` Theodore Ts'o 2014-07-28 9:11 ` Lukáš Czerner 1 sibling, 1 reply; 10+ messages in thread From: Vlad Dobrotescu @ 2014-07-27 0:34 UTC (permalink / raw) To: Theodore Ts'o; +Cc: linux-ext4 On 26/07/2014 20:07, Theodore Ts'o wrote: > On Sat, Jul 26, 2014 at 11:07:59PM +0000, Vlad Dobrotescu wrote: >> If this isn't the proper place for this question, please point me in >> the right direction. >> >> I couldn't find any description on Ext4's behaviour when mounted >> with errors=continue and external journal if the journal block device >> is unavailable at mount time (or becomes unavailable at some point). >> >> I would be using CentOS 7 (kernel 3.10.0-123.4.4.el7 x86_64) and >> (probably) full data journaling on a SSD. Can someone help? > So there are two different questions. > > If you use errors=continue, there is the chance that the file system > inconsistencies that discovered could cause further file system > damage, which might lead to the loss or corruption of data files > written earlier. So it's not really recommended for most purposes, > unless you have some scheme where you are monitoring dmesgs and having > some strategy to deal with detected file system errors, or when the > system absolutely, positively must continue running, and this is more > important than potential data loss. > > If the journal block device is not present then the file system can't > be mounted, and if the system was uncleanly shut down you won't be > able to recover from the unclean shutdown by replaying the journal. > > If the journal block device is *gone*, it is possible to remove the > external journal block device, and then force a file system repair, > but if this happens after an unclean shutdown, you may very well lose > data. > > Cheers, > > - Ted Sorry if this is a duplicate, but the "Followup" didn't seem to work for me Thanks for the quick and detailed answer. If I understand it correctly, the errors= option has nothing to do with journaling, but only with FS consistency issues (which can be caused by a vanished journal, but also by other events), while the mounting itself fails in the absence of the device specified for external journaling, with no fall-back alternative. Right? Vlad ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question: errors=continue behaviour for failed external journal device 2014-07-27 0:34 ` Vlad Dobrotescu @ 2014-07-27 1:07 ` Theodore Ts'o 0 siblings, 0 replies; 10+ messages in thread From: Theodore Ts'o @ 2014-07-27 1:07 UTC (permalink / raw) To: Vlad Dobrotescu; +Cc: linux-ext4 On Sat, Jul 26, 2014 at 08:34:45PM -0400, Vlad Dobrotescu wrote: > > Thanks for the quick and detailed answer. If I understand it correctly, > the errors= option has nothing to do with journaling, but only with FS > consistency issues (which can be caused by a vanished journal, The errors= option has to do with how the system will react when it discovered a file system inconsistency (for example, while deleting a file, it discovers that the blocks it is trying to free are already freed, etc.) errors=continue is the "don't worry, be happy" option --- and this can sometimes work out, it's much like ignoring a late mortgage payment notice from the bank. Most of the time, sooner or later, it catches up to you. :-) > by other events), while the mounting itself fails in the absence of the > device specified for external journaling, with no fall-back alternative. Your question about what happens if the journal is missing is much like the question, "suppose as I have a RAID 0 setup, and I'm missing one of the disks --- what can we do"? Basically, nothing. In a desperation scenario, there are ways you can forcibly tell the system to pretend that there is no journal, just like you can pretend that the system should ignore 20% of a missing RAID 0 array and have the LVM replace the missing disk with zero blocks --- but results are very likely to lead to data loss. Cheers, - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question: errors=continue behaviour for failed external journal device 2014-07-27 0:07 ` Theodore Ts'o 2014-07-27 0:34 ` Vlad Dobrotescu @ 2014-07-28 9:11 ` Lukáš Czerner 2014-07-28 13:17 ` Theodore Ts'o 1 sibling, 1 reply; 10+ messages in thread From: Lukáš Czerner @ 2014-07-28 9:11 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Vlad Dobrotescu, linux-ext4 On Sat, 26 Jul 2014, Theodore Ts'o wrote: > Date: Sat, 26 Jul 2014 20:07:33 -0400 > From: Theodore Ts'o <tytso@mit.edu> > To: Vlad Dobrotescu <vlad@dobrotescu.ca> > Cc: linux-ext4@vger.kernel.org > Subject: Re: Question: errors=continue behaviour for failed external journal > device > > On Sat, Jul 26, 2014 at 11:07:59PM +0000, Vlad Dobrotescu wrote: > > If this isn't the proper place for this question, please point me in > > the right direction. > > > > I couldn't find any description on Ext4's behaviour when mounted > > with errors=continue and external journal if the journal block device > > is unavailable at mount time (or becomes unavailable at some point). > > > > I would be using CentOS 7 (kernel 3.10.0-123.4.4.el7 x86_64) and > > (probably) full data journaling on a SSD. Can someone help? > > So there are two different questions. > > If you use errors=continue, there is the chance that the file system > inconsistencies that discovered could cause further file system > damage, which might lead to the loss or corruption of data files > written earlier. So it's not really recommended for most purposes, I very much agree with that, that's why I was quite surprised that I found out recently that this is the default. I was living in the delusion that the default was ERRORS_RO for as long as I can remember. So my question is, should we change it ? This really does not seem like a sane default. Thanks! -Lukas > unless you have some scheme where you are monitoring dmesgs and having > some strategy to deal with detected file system errors, or when the > system absolutely, positively must continue running, and this is more > important than potential data loss. > > If the journal block device is not present then the file system can't > be mounted, and if the system was uncleanly shut down you won't be > able to recover from the unclean shutdown by replaying the journal. > > If the journal block device is *gone*, it is possible to remove the > external journal block device, and then force a file system repair, > but if this happens after an unclean shutdown, you may very well lose > data. > > Cheers, > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question: errors=continue behaviour for failed external journal device 2014-07-28 9:11 ` Lukáš Czerner @ 2014-07-28 13:17 ` Theodore Ts'o 2014-07-28 13:25 ` Lukáš Czerner 2014-07-28 16:09 ` Darrick J. Wong 0 siblings, 2 replies; 10+ messages in thread From: Theodore Ts'o @ 2014-07-28 13:17 UTC (permalink / raw) To: Lukáš Czerner; +Cc: Vlad Dobrotescu, linux-ext4 On Mon, Jul 28, 2014 at 11:11:45AM +0200, Lukáš Czerner wrote: > > I very much agree with that, that's why I was quite surprised that I > found out recently that this is the default. I was living in the > delusion that the default was ERRORS_RO for as long as I can remember. > So my question is, should we change it ? This really does not seem > like a sane default. Yeah, I've been thinking that this would be a good thing to change for 1.43. The only reason that errors=continue was the default was for historical reasons. I could imagine some system administrators being surprised when all of a sudden their production systems start getting lots of EROFS errors getting reported by applications. So I could potentially imagine some Help Desks / Support folks at distributions not being enthusiastic about such a change. Hmm.... we are starting to have some errors where we can allow the system to stagger on, even if we need to disallow new allocations in some block groups. I wonder if it is worthwhile to have a "continue for correctable errors". The danger, of course, is that some errors, even if they are correctable, (such as freeing a block which is already freed), could be a hint that there are other fs corruptions, not yet detected, that might lead to data loss if we reboot and fsck, or remount readonly right away. So the question is while there is some value, is it worth the added complexity to add an "errors=continue-correctable" option? - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question: errors=continue behaviour for failed external journal device 2014-07-28 13:17 ` Theodore Ts'o @ 2014-07-28 13:25 ` Lukáš Czerner 2014-07-28 13:31 ` Vlad Dobrotescu 2014-07-28 16:09 ` Darrick J. Wong 1 sibling, 1 reply; 10+ messages in thread From: Lukáš Czerner @ 2014-07-28 13:25 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Vlad Dobrotescu, linux-ext4 [-- Attachment #1: Type: TEXT/PLAIN, Size: 2239 bytes --] On Mon, 28 Jul 2014, Theodore Ts'o wrote: > Date: Mon, 28 Jul 2014 09:17:42 -0400 > From: Theodore Ts'o <tytso@mit.edu> > To: Lukáš Czerner <lczerner@redhat.com> > Cc: Vlad Dobrotescu <vlad@dobrotescu.ca>, linux-ext4@vger.kernel.org > Subject: Re: Question: errors=continue behaviour for failed external journal > device > > On Mon, Jul 28, 2014 at 11:11:45AM +0200, Lukáš Czerner wrote: > > > > I very much agree with that, that's why I was quite surprised that I > > found out recently that this is the default. I was living in the > > delusion that the default was ERRORS_RO for as long as I can remember. > > So my question is, should we change it ? This really does not seem > > like a sane default. > > Yeah, I've been thinking that this would be a good thing to change for > 1.43. > > The only reason that errors=continue was the default was for > historical reasons. I could imagine some system administrators being > surprised when all of a sudden their production systems start getting > lots of EROFS errors getting reported by applications. So I could > potentially imagine some Help Desks / Support folks at distributions > not being enthusiastic about such a change. > > Hmm.... we are starting to have some errors where we can allow the > system to stagger on, even if we need to disallow new allocations in > some block groups. I wonder if it is worthwhile to have a "continue > for correctable errors". The danger, of course, is that some errors, > even if they are correctable, (such as freeing a block which is > already freed), could be a hint that there are other fs corruptions, > not yet detected, that might lead to data loss if we reboot and fsck, > or remount readonly right away. So the question is while there is > some value, is it worth the added complexity to add an > "errors=continue-correctable" option? Right, I like the idea of the new errors option, even though the name is a bit long (maybe "auto") which will try the best to continue, but is allowed to remount read only if we can not recover from that error. This would however need some work to make it work reliably and most importantly a fair amount of testing. Though I think it's worth the work. -Lukas > > - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question: errors=continue behaviour for failed external journal device 2014-07-28 13:25 ` Lukáš Czerner @ 2014-07-28 13:31 ` Vlad Dobrotescu 2014-07-28 15:00 ` Theodore Ts'o 0 siblings, 1 reply; 10+ messages in thread From: Vlad Dobrotescu @ 2014-07-28 13:31 UTC (permalink / raw) To: Lukáš Czerner; +Cc: Theodore Ts'o, linux-ext4 If you are talking about changes, wouldn't "read-only" be a better fall-back alternative for a failed or missing external journal? Vlad On 28/07/2014 09:25, Lukáš Czerner wrote: > On Mon, 28 Jul 2014, Theodore Ts'o wrote: > >> Date: Mon, 28 Jul 2014 09:17:42 -0400 >> From: Theodore Ts'o<tytso@mit.edu> >> To: Lukáš Czerner<lczerner@redhat.com> >> Cc: Vlad Dobrotescu<vlad@dobrotescu.ca>, linux-ext4@vger.kernel.org >> Subject: Re: Question: errors=continue behaviour for failed external journal >> device >> >> On Mon, Jul 28, 2014 at 11:11:45AM +0200, Lukáš Czerner wrote: >>> I very much agree with that, that's why I was quite surprised that I >>> found out recently that this is the default. I was living in the >>> delusion that the default was ERRORS_RO for as long as I can remember. >>> So my question is, should we change it ? This really does not seem >>> like a sane default. >> Yeah, I've been thinking that this would be a good thing to change for >> 1.43. >> >> The only reason that errors=continue was the default was for >> historical reasons. I could imagine some system administrators being >> surprised when all of a sudden their production systems start getting >> lots of EROFS errors getting reported by applications. So I could >> potentially imagine some Help Desks / Support folks at distributions >> not being enthusiastic about such a change. >> >> Hmm.... we are starting to have some errors where we can allow the >> system to stagger on, even if we need to disallow new allocations in >> some block groups. I wonder if it is worthwhile to have a "continue >> for correctable errors". The danger, of course, is that some errors, >> even if they are correctable, (such as freeing a block which is >> already freed), could be a hint that there are other fs corruptions, >> not yet detected, that might lead to data loss if we reboot and fsck, >> or remount readonly right away. So the question is while there is >> some value, is it worth the added complexity to add an >> "errors=continue-correctable" option? > Right, > > I like the idea of the new errors option, even though the name is a > bit long (maybe "auto") which will try the best to continue, but is > allowed to remount read only if we can not recover from that error. > > This would however need some work to make it work reliably and most > importantly a fair amount of testing. Though I think it's worth the > work. > > -Lukas > >> - Ted -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question: errors=continue behaviour for failed external journal device 2014-07-28 13:31 ` Vlad Dobrotescu @ 2014-07-28 15:00 ` Theodore Ts'o 0 siblings, 0 replies; 10+ messages in thread From: Theodore Ts'o @ 2014-07-28 15:00 UTC (permalink / raw) To: Vlad Dobrotescu; +Cc: Lukáš Czerner, linux-ext4 On Mon, Jul 28, 2014 at 09:31:05AM -0400, Vlad Dobrotescu wrote: > If you are talking about changes, wouldn't "read-only" be a better fall-back > alternative for a failed or missing external journal? For a missing external journal, we simply wouldn't allow the mount to succeed at all. The discussion here was about what to do if we detect some kind of inconsistency in a mounted file system. Cheers, - Ted ^ permalink raw reply [flat|nested] 10+ messages in thread
* Re: Question: errors=continue behaviour for failed external journal device 2014-07-28 13:17 ` Theodore Ts'o 2014-07-28 13:25 ` Lukáš Czerner @ 2014-07-28 16:09 ` Darrick J. Wong 1 sibling, 0 replies; 10+ messages in thread From: Darrick J. Wong @ 2014-07-28 16:09 UTC (permalink / raw) To: Theodore Ts'o; +Cc: Lukáš Czerner, Vlad Dobrotescu, linux-ext4 On Mon, Jul 28, 2014 at 09:17:42AM -0400, Theodore Ts'o wrote: > On Mon, Jul 28, 2014 at 11:11:45AM +0200, Lukáš Czerner wrote: > > > > I very much agree with that, that's why I was quite surprised that I > > found out recently that this is the default. I was living in the > > delusion that the default was ERRORS_RO for as long as I can remember. > > So my question is, should we change it ? This really does not seem > > like a sane default. > > Yeah, I've been thinking that this would be a good thing to change for > 1.43. > > The only reason that errors=continue was the default was for > historical reasons. I could imagine some system administrators being > surprised when all of a sudden their production systems start getting > lots of EROFS errors getting reported by applications. So I could > potentially imagine some Help Desks / Support folks at distributions > not being enthusiastic about such a change. > > Hmm.... we are starting to have some errors where we can allow the > system to stagger on, even if we need to disallow new allocations in > some block groups. I wonder if it is worthwhile to have a "continue > for correctable errors". The danger, of course, is that some errors, > even if they are correctable, (such as freeing a block which is > already freed), could be a hint that there are other fs corruptions, > not yet detected, that might lead to data loss if we reboot and fsck, > or remount readonly right away. So the question is while there is > some value, is it worth the added complexity to add an > "errors=continue-correctable" option? Back in the earlier 3.15 days when I was trying to figure out what was going on with that corruption bug that Eric Whitney found, it was useful for the kernel to be able to stumble on with the non-broken block groups long enough to save the logs of what had happened. (Laptops don't seem to have serial consoles...) In general I think it's worth the effort. (I'd shovel crash reports into pstore if I wasn't afraid of bricking UEFI.) --D > > - Ted > -- > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-ext4" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 10+ messages in thread
end of thread, other threads:[~2014-07-28 16:09 UTC | newest] Thread overview: 10+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2014-07-26 23:07 Question: errors=continue behaviour for failed external journal device Vlad Dobrotescu 2014-07-27 0:07 ` Theodore Ts'o 2014-07-27 0:34 ` Vlad Dobrotescu 2014-07-27 1:07 ` Theodore Ts'o 2014-07-28 9:11 ` Lukáš Czerner 2014-07-28 13:17 ` Theodore Ts'o 2014-07-28 13:25 ` Lukáš Czerner 2014-07-28 13:31 ` Vlad Dobrotescu 2014-07-28 15:00 ` Theodore Ts'o 2014-07-28 16:09 ` Darrick J. Wong
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox