Re: md devices: Suggestion for in place time and checksum within the RAID

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Keld Simonsen <keld@keldix.com>
To: Joachim Otahal <Jou@gmx.net>
Cc: Bill Davidsen <davidsen@tmr.com>, linux-raid@vger.kernel.org
Subject: Re: md devices: Suggestion for in place time and checksum within the RAID
Date: Sun, 14 Mar 2010 14:03:48 +0100	[thread overview]
Message-ID: <20100314130348.GA14141@light.rap.dk> (raw)
In-Reply-To: <4B9CCF7A.4010809@gmx.net>

On Sun, Mar 14, 2010 at 12:58:50PM +0100, Joachim Otahal wrote:
> Keld Simonsen schrieb:
> >On Sun, Mar 14, 2010 at 02:25:38AM +0100, Joachim Otahal wrote
> >>>>Question:
> >>>>Will RAID4/5/6 in the future use the parity upon read too? Currently
> >>>>it would not detect wrong data reads from the parity chunk, resulting
> >>>>in a disaster when it is actually needed.
> >>>>
> >>>>Do those plans already exist and my post was completely useless?
> >>>>
> >>>>Sorry that I cannot give patches, my last kernel patch + compile was
> >>>>2.2.26, since then I never compiled a kernel.
> >>>>
> >>>>Joachim Otahal
> >>>>         
> >Hmm, would that not be detected by a check - initiated by cron?
> >   
> Debian schedules a monthly check (first sunday 00:57), IMHO the best 
> possible time and frequency, less is dangerous, more is useless. I added 
> a cronjob to check every 15 minutes for changes from /proc/mdstat and 
> changes from smart info (reallocated sector count and drive internal 
> error list only) and emails me if something changed from the previous check.
> I use the script because /etc/mdadm/mdadm.conf only takes ONE email 
> address and requires a local MTA installed, I allways uninstall the 
> local MTA if the machine is not going to be a mail server.

Interesting! I would like to see your scripts....

> But why not checking parity during normal read operation? Was that a 
> performance decision?

I don't know, but I do think it would hurt performance considerably.


> It is not _that_ bad not doing it during normal 
> operation since the good dists schedule a regular check, but can it be 
> controlled by something like echo "1" > 
> /proc/sys/dev/raid/always_read_parity ?

Well, I think making an optional check would be fine.
I dont know if it could be done in a non-performance hurting way, such
as being deleyed or running at a lower IO priority.

> >Which data to believe could then be determined according to a number
> >of techniques, like for a 3 copy array the best 2 out of 3,
> >investigating the error log of the drives, and relaying the error
> >information to the file system layer for manual inspection and repair.
> >   
> That is a matter of "believe" and "best guess" and not "knowing" which 
> contains the correct data in redundant array levels, hence the 
> suggestion from before to include a timer + ECC (or better) at the raid 
> level, so we actually _know_ which is the newest, and we _know_ which 
> stripe does have consistent data, no guessing needed, we can apply 
> crystal clear rules.
> My ruleset would be:
> first use: newest time and correct ECC
> second use: newest time and correctable ECC
> third use: any time and correct ECC (hint possible filesystem error to 
> the lext layer)
> fourth use: any time and correctable ECC (hint possible filesystem error 
> to the lext layer)
> fifth use: Current implementation, use the data from the active drive 
> ordering according to the list in the superblock + hint possible 
> filesystem error to the lext layer.
> A raid aware filesystem would be perfect (compare with ZFS on Solaris) 
> eliminating the write hole problem, doing the checksum at raid level 
> makes it more flexible.

Interesting ideas

> >I would expect this is not something that occurs frequently, so maybe
> >once a year for the unlucky or systems with many disks.
> >   
> If you get paranoid about corrupting really important data once in 5 
> years too much. Implementing the checksum + timestamp would lift linux 
> software raid to the next level, closer to enterprise where such 
> techniques are actually in use. At it's current level it is very good 
> and solid, so it is time to get to the next level for long time archiving.

I was not trying to say this is not important, but rather that error
correction could be done by manual intervention, given that is not so
frequent. Or at least that manual corrction should be one of the
impelemted ways of adressing it.

best regards
keld

next prev parent reply	other threads:[~2010-03-14 13:03 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-03-13 23:00 md devices: Suggestion for in place time and checksum within the RAID Joachim Otahal
2010-03-14  0:04 ` Bill Davidsen
2010-03-14  1:25   ` Joachim Otahal
2010-03-14 10:20     ` Keld Simonsen
2010-03-14 11:58       ` Joachim Otahal
2010-03-14 13:03         ` Keld Simonsen [this message]
2010-03-14 14:00           ` Joachim Otahal
2010-03-15 21:28           ` Joachim Otahal
  -- strict thread matches above, loose matches on Subject: below --
2010-03-13 23:21 Joachim Otahal

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100314130348.GA14141@light.rap.dk \
    --to=keld@keldix.com \
    --cc=Jou@gmx.net \
    --cc=davidsen@tmr.com \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.