All of lore.kernel.org
 help / color / mirror / Atom feed
* [Ocfs2-devel] How can ecc be corrected?
@ 2011-06-17 15:55 Goldwyn Rodrigues
  2011-06-17 16:53 ` Sunil Mushran
  0 siblings, 1 reply; 9+ messages in thread
From: Goldwyn Rodrigues @ 2011-06-17 15:55 UTC (permalink / raw)
  To: ocfs2-devel

Hi,

I am not able to understand the use of metaecc or the ECC in the
metadata. All the metadata contain the ecc to check if the data
written to the block is sane, but what happens in case the ecc does
not match? All it does is fail in case it does not match. There does
not seem a way to correct it.

fsck simply fails in ocfs2_read_inode, (or in some cases such as
superblock inode (2) does not even check) if the ecc does not match.
What is the best way to correct ecc errors? I understand that an
incorrect ECC means the data might be corrupt, but what if we want to
recover? or is it not meant to be corrected at all?

Regards,

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] How can ecc be corrected?
  2011-06-17 15:55 [Ocfs2-devel] How can ecc be corrected? Goldwyn Rodrigues
@ 2011-06-17 16:53 ` Sunil Mushran
  2011-06-17 18:50   ` Goldwyn Rodrigues
  0 siblings, 1 reply; 9+ messages in thread
From: Sunil Mushran @ 2011-06-17 16:53 UTC (permalink / raw)
  To: ocfs2-devel

On 06/17/2011 08:55 AM, Goldwyn Rodrigues wrote:
> I am not able to understand the use of metaecc or the ECC in the
> metadata. All the metadata contain the ecc to check if the data
> written to the block is sane, but what happens in case the ecc does
> not match? All it does is fail in case it does not match. There does
> not seem a way to correct it.
>
> fsck simply fails in ocfs2_read_inode, (or in some cases such as
> superblock inode (2) does not even check) if the ecc does not match.
> What is the best way to correct ecc errors? I understand that an
> incorrect ECC means the data might be corrupt, but what if we want to
> recover? or is it not meant to be corrected at all?

I think originally our thought was that bad checksum means bad block. But
we are wiser now. As in, while that works in the fs, we could to do better
job in the tools. And that's the reason it is not yet enabled by default.

If you have ideas, do share.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] How can ecc be corrected?
  2011-06-17 16:53 ` Sunil Mushran
@ 2011-06-17 18:50   ` Goldwyn Rodrigues
  2011-06-17 19:14     ` Sunil Mushran
  0 siblings, 1 reply; 9+ messages in thread
From: Goldwyn Rodrigues @ 2011-06-17 18:50 UTC (permalink / raw)
  To: ocfs2-devel

On Fri, Jun 17, 2011 at 11:53 AM, Sunil Mushran
<sunil.mushran@oracle.com> wrote:
> On 06/17/2011 08:55 AM, Goldwyn Rodrigues wrote:
>>
>> I am not able to understand the use of metaecc or the ECC in the
>> metadata. All the metadata contain the ecc to check if the data
>> written to the block is sane, but what happens in case the ecc does
>> not match? All it does is fail in case it does not match. There does
>> not seem a way to correct it.
>>
>> fsck simply fails in ocfs2_read_inode, (or in some cases such as
>> superblock inode (2) does not even check) if the ecc does not match.
>> What is the best way to correct ecc errors? I understand that an
>> incorrect ECC means the data might be corrupt, but what if we want to
>> recover? or is it not meant to be corrected at all?
>
> I think originally our thought was that bad checksum means bad block. But
> we are wiser now. As in, while that works in the fs, we could to do better
> job in the tools. And that's the reason it is not yet enabled by default.
>

So, what is the plan in the future? Do you intend to put it as a
default option or let things be as is?

In any case, I agree we should modify tools to correct the filesystem
(fsck) if the filesystem fails due to metaecc error or else we could
end up having an unusable filesystem. It sure is a good debugging tool
for development purposes though.

> If you have ideas, do share.

No ideas as such. I raised this question because a customer was facing
this issue with the superblock and no way to fix it. Fortunately, he
can still use the filesystem. It is debugfs.ocfs2 which is failing. I
guess I will have to work on a patch to fix this.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] How can ecc be corrected?
  2011-06-17 18:50   ` Goldwyn Rodrigues
@ 2011-06-17 19:14     ` Sunil Mushran
  2011-06-17 23:16       ` Joel Becker
  2011-06-19  4:13       ` Goldwyn Rodrigues
  0 siblings, 2 replies; 9+ messages in thread
From: Sunil Mushran @ 2011-06-17 19:14 UTC (permalink / raw)
  To: ocfs2-devel

On 06/17/2011 11:50 AM, Goldwyn Rodrigues wrote:
> On Fri, Jun 17, 2011 at 11:53 AM, Sunil Mushran
> <sunil.mushran@oracle.com>  wrote:
>> On 06/17/2011 08:55 AM, Goldwyn Rodrigues wrote:
>>> I am not able to understand the use of metaecc or the ECC in the
>>> metadata. All the metadata contain the ecc to check if the data
>>> written to the block is sane, but what happens in case the ecc does
>>> not match? All it does is fail in case it does not match. There does
>>> not seem a way to correct it.
>>>
>>> fsck simply fails in ocfs2_read_inode, (or in some cases such as
>>> superblock inode (2) does not even check) if the ecc does not match.
>>> What is the best way to correct ecc errors? I understand that an
>>> incorrect ECC means the data might be corrupt, but what if we want to
>>> recover? or is it not meant to be corrected at all?
>> I think originally our thought was that bad checksum means bad block. But
>> we are wiser now. As in, while that works in the fs, we could to do better
>> job in the tools. And that's the reason it is not yet enabled by default.
>>
> So, what is the plan in the future? Do you intend to put it as a
> default option or let things be as is?
>
> In any case, I agree we should modify tools to correct the filesystem
> (fsck) if the filesystem fails due to metaecc error or else we could
> end up having an unusable filesystem. It sure is a good debugging tool
> for development purposes though.

Oh absolutely it will be made a default. But we have to address this
shortcoming first.

>> If you have ideas, do share.
> No ideas as such. I raised this question because a customer was facing
> this issue with the superblock and no way to fix it. Fortunately, he
> can still use the filesystem. It is debugfs.ocfs2 which is failing. I
> guess I will have to work on a patch to fix this.

So I remember we had a bug in tunefs that changed the superblock
without recomputing the checksum. It has been fixed since.

How can he still use the fs?

One solution is to disable it... manually. And then re-enable it using
the latest tunefs.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] How can ecc be corrected?
  2011-06-17 19:14     ` Sunil Mushran
@ 2011-06-17 23:16       ` Joel Becker
  2011-06-20 16:22         ` Sunil Mushran
  2011-06-19  4:13       ` Goldwyn Rodrigues
  1 sibling, 1 reply; 9+ messages in thread
From: Joel Becker @ 2011-06-17 23:16 UTC (permalink / raw)
  To: ocfs2-devel

On Fri, Jun 17, 2011 at 12:14:36PM -0700, Sunil Mushran wrote:
> >> If you have ideas, do share.
> > No ideas as such. I raised this question because a customer was facing
> > this issue with the superblock and no way to fix it. Fortunately, he
> > can still use the filesystem. It is debugfs.ocfs2 which is failing. I
> > guess I will have to work on a patch to fix this.
> 
> So I remember we had a bug in tunefs that changed the superblock
> without recomputing the checksum. It has been fixed since.
> 
> How can he still use the fs?
> 
> One solution is to disable it... manually. And then re-enable it using
> the latest tunefs.

	I thought we were going to patch fsck.ocfs2 to run in an
ignore-metaecc mode?

Joel


-- 

"Hey mister if you're gonna walk on water,
 Could you drop a line my way?"

			http://www.jlbec.org/
			jlbec at evilplan.org

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] How can ecc be corrected?
  2011-06-17 19:14     ` Sunil Mushran
  2011-06-17 23:16       ` Joel Becker
@ 2011-06-19  4:13       ` Goldwyn Rodrigues
  2011-06-20 16:32         ` Sunil Mushran
  1 sibling, 1 reply; 9+ messages in thread
From: Goldwyn Rodrigues @ 2011-06-19  4:13 UTC (permalink / raw)
  To: ocfs2-devel

On Fri, Jun 17, 2011 at 2:14 PM, Sunil Mushran <sunil.mushran@oracle.com> wrote:
> On 06/17/2011 11:50 AM, Goldwyn Rodrigues wrote:
>>
>> On Fri, Jun 17, 2011 at 11:53 AM, Sunil Mushran
>> <sunil.mushran@oracle.com> ?wrote:
>>>
>>> On 06/17/2011 08:55 AM, Goldwyn Rodrigues wrote:
>>>>
>>>> I am not able to understand the use of metaecc or the ECC in the
>>>> metadata. All the metadata contain the ecc to check if the data
>>>> written to the block is sane, but what happens in case the ecc does
>>>> not match? All it does is fail in case it does not match. There does
>>>> not seem a way to correct it.
>>>>
>>>> fsck simply fails in ocfs2_read_inode, (or in some cases such as
>>>> superblock inode (2) does not even check) if the ecc does not match.

Oh, I was wrong about this. I patched fswreck to mess_up the
superblock ECC values real bad, and neither mount nor fsck worked. But
an error in correctable limits will go ignored and block_check will
remain the same. At this state, there is no way to revive the fs.

Like Joel mentioned, we need to ignore-metaecc for fsck to correct it.

>>>> What is the best way to correct ecc errors? I understand that an
>>>> incorrect ECC means the data might be corrupt, but what if we want to
>>>> recover? or is it not meant to be corrected at all?
>>>
>>> I think originally our thought was that bad checksum means bad block. But
>>> we are wiser now. As in, while that works in the fs, we could to do
>>> better
>>> job in the tools. And that's the reason it is not yet enabled by default.
>>>
>> So, what is the plan in the future? Do you intend to put it as a
>> default option or let things be as is?
>>
>> In any case, I agree we should modify tools to correct the filesystem
>> (fsck) if the filesystem fails due to metaecc error or else we could
>> end up having an unusable filesystem. It sure is a good debugging tool
>> for development purposes though.
>
> Oh absolutely it will be made a default. But we have to address this
> shortcoming first.
>
>>> If you have ideas, do share.
>>
>> No ideas as such. I raised this question because a customer was facing
>> this issue with the superblock and no way to fix it. Fortunately, he
>> can still use the filesystem. It is debugfs.ocfs2 which is failing. I
>> guess I will have to work on a patch to fix this.
>
> So I remember we had a bug in tunefs that changed the superblock
> without recomputing the checksum. It has been fixed since.
>
> How can he still use the fs?
>

I suppose it is still in the correctable limits. By failing I meant a
"stat" output in debugfs gives a "FAILED CHECKSUM" error.

On reading more I found we are not writing the superblock anywhere in
kernel module and perhaps the reason the block_check values remain
unchanged. PCMIIW.

This brings me to the next question: Why don't we use mnt_count? The
fact that it is distributed makes life complicated, but still...

> One solution is to disable it... manually. And then re-enable it using
> the latest tunefs.
>


-- 
Goldwyn

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] How can ecc be corrected?
  2011-06-17 23:16       ` Joel Becker
@ 2011-06-20 16:22         ` Sunil Mushran
  2011-06-20 17:34           ` Goldwyn Rodrigues
  0 siblings, 1 reply; 9+ messages in thread
From: Sunil Mushran @ 2011-06-20 16:22 UTC (permalink / raw)
  To: ocfs2-devel

On 06/17/2011 04:16 PM, Joel Becker wrote:
> On Fri, Jun 17, 2011 at 12:14:36PM -0700, Sunil Mushran wrote:
>>>> If you have ideas, do share.
>>> No ideas as such. I raised this question because a customer was facing
>>> this issue with the superblock and no way to fix it. Fortunately, he
>>> can still use the filesystem. It is debugfs.ocfs2 which is failing. I
>>> guess I will have to work on a patch to fix this.
>> So I remember we had a bug in tunefs that changed the superblock
>> without recomputing the checksum. It has been fixed since.
>>
>> How can he still use the fs?
>>
>> One solution is to disable it... manually. And then re-enable it using
>> the latest tunefs.
> 	I thought we were going to patch fsck.ocfs2 to run in an
> ignore-metaecc mode?


Oh I did not know we had decided on that. Though that appears to be the
best solution. fsck and debugfs always run in ignore-metaecc mode. fsck
will need a fixup code for that.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] How can ecc be corrected?
  2011-06-19  4:13       ` Goldwyn Rodrigues
@ 2011-06-20 16:32         ` Sunil Mushran
  0 siblings, 0 replies; 9+ messages in thread
From: Sunil Mushran @ 2011-06-20 16:32 UTC (permalink / raw)
  To: ocfs2-devel

On 06/18/2011 09:13 PM, Goldwyn Rodrigues wrote:
> I suppose it is still in the correctable limits. By failing I meant a
> "stat" output in debugfs gives a "FAILED CHECKSUM" error.
>
> On reading more I found we are not writing the superblock anywhere in
> kernel module and perhaps the reason the block_check values remain
> unchanged. PCMIIW.
>
> This brings me to the next question: Why don't we use mnt_count? The
> fact that it is distributed makes life complicated, but still...

Yeah.. Mark had added the failed checksum check in debugfs.
Without that we were running blind. Hard to compute it in the head. ;)

mnt count was originally added in extN to force fsck after N mounts.
That has never worked for us because fsck is a offline process. And it
could take time. It is prudent to let users control when it's run.

FWIW, extN has also changed its default behaviour to ignore mnt count.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Ocfs2-devel] How can ecc be corrected?
  2011-06-20 16:22         ` Sunil Mushran
@ 2011-06-20 17:34           ` Goldwyn Rodrigues
  0 siblings, 0 replies; 9+ messages in thread
From: Goldwyn Rodrigues @ 2011-06-20 17:34 UTC (permalink / raw)
  To: ocfs2-devel

On Mon, Jun 20, 2011 at 11:22 AM, Sunil Mushran
<sunil.mushran@oracle.com> wrote:
> On 06/17/2011 04:16 PM, Joel Becker wrote:
>>
>> On Fri, Jun 17, 2011 at 12:14:36PM -0700, Sunil Mushran wrote:
>>>>>
>>>>> If you have ideas, do share.
>>>>
>>>> No ideas as such. I raised this question because a customer was facing
>>>> this issue with the superblock and no way to fix it. Fortunately, he
>>>> can still use the filesystem. It is debugfs.ocfs2 which is failing. I
>>>> guess I will have to work on a patch to fix this.
>>>
>>> So I remember we had a bug in tunefs that changed the superblock
>>> without recomputing the checksum. It has been fixed since.
>>>
>>> How can he still use the fs?
>>>
>>> One solution is to disable it... manually. And then re-enable it using
>>> the latest tunefs.
>>
>> ? ? ? ?I thought we were going to patch fsck.ocfs2 to run in an
>> ignore-metaecc mode?
>
>
> Oh I did not know we had decided on that. Though that appears to be the
> best solution. fsck and debugfs always run in ignore-metaecc mode. fsck
> will need a fixup code for that.
>

Cool. I have sent a set of 3 patches on the tools mailing list. Let me
know if it works for you.

-- 
Goldwyn

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2011-06-20 17:34 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-06-17 15:55 [Ocfs2-devel] How can ecc be corrected? Goldwyn Rodrigues
2011-06-17 16:53 ` Sunil Mushran
2011-06-17 18:50   ` Goldwyn Rodrigues
2011-06-17 19:14     ` Sunil Mushran
2011-06-17 23:16       ` Joel Becker
2011-06-20 16:22         ` Sunil Mushran
2011-06-20 17:34           ` Goldwyn Rodrigues
2011-06-19  4:13       ` Goldwyn Rodrigues
2011-06-20 16:32         ` Sunil Mushran

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.