linux-raid.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Neil Brown <neilb@suse.de>
To: Alberto Bertogli <albertito@blitiri.com.ar>
Cc: Goswin von Brederlow <goswin-v-b@web.de>,
	linux-kernel@vger.kernel.org, dm-devel@redhat.com,
	linux-raid@vger.kernel.org, agk@redhat.com
Subject: Re: [RFC PATCH] dm-csum: A new device mapper target that checks data integrity
Date: Sun, 28 Jun 2009 10:34:17 +1000	[thread overview]
Message-ID: <19014.47753.69063.510164@notabene.brown> (raw)
In-Reply-To: message from Alberto Bertogli on Tuesday May 26

On Tuesday May 26, albertito@blitiri.com.ar wrote:
> On Tue, May 26, 2009 at 12:33:01PM +0200, Goswin von Brederlow wrote:
> > Alberto Bertogli <albertito@blitiri.com.ar> writes:
> > > On Mon, May 25, 2009 at 02:22:23PM +0200, Goswin von Brederlow wrote:
> > >> Alberto Bertogli <albertito@blitiri.com.ar> writes:
> > >> > I'm writing this device mapper target that stores checksums on writes and
> > >> > verifies them on reads.
> > >> 
> > >> How does that behave on crashes? Will checksums be out of sync with data?
> > >> Will pending blocks recalculate their checksum?
> > >
> > >    To guarantee consistency, two imd sectors (named M1 and M2) are kept for
> > >    every 62 data sectors, and the following procedure is used to update them
> > >    when a write to a given sector is required:
> > >
> > >     - Read both M1 and M2.
> > >     - Find out (using information stored in their headers) which one is newer.
> > >       Let's assume M1 is newer than M2.
> > >     - Update the M2 buffer to mark it's newer, and update the new data's CRC.
> > >     - Submit the write to M2, and then the write to the data, using a barrier
> > >       to make sure the metadata is updated _after_ the data.
> > 
> > Consider that the disk writes the data and then the system
> > crashes. Now you have the old checksum but the new data. The checksum
> > is out of sync.
> > 
> > Don't you mean that M2 is written _before_ the data? That way you have
> > the old checksum in M1 and the new in M2. One of them will match
> > depending on wether the data gets written before a crash or not. That
> > would be more consistent with your read operation below.
> 
> Yes, the comment is wrong, thanks for noticing. That is how it's implemented.
> 
> 
> > >    Accordingly, the read operations are handled as follows:
> > >
> > >     - Read both the data, M1 and M2.
> > >     - Find out which one is newer. Let's assume M1 is newer than M2.
> > >     - Calculate the data's CRC, and compare it to the one found in M1. If they
> > >       match, the reading is successful. If not, compare it to the one found in
> > >       M2. If they match, the reading is successful; otherwise, fail. If
> > >       the read involves multiple sectors, it is possible that some of the
> > >       correct CRCs are in M1 and some in M2.
> > >
> > >
> > > The barrier will be (it's not done yet) replaced with serialized writes for
> > > cases where the underlying block device does not support them, or when the
> > > integrity metadata resides on a different block device than the data.
> > >
> > >
> > > This scheme assumes writes to a single sector are atomic in the presence of
> > > normal crashes, which I'm not sure if it's something sane to assume in
> > > practise. If it's not, then the scheme can be modified to cope with that.
> > 
> > What happens if you have multiple writes to the same sector? (assuming
> > you ment "before" above)
> > 
> > - user writes to sector
> > - queue up write for M1 and data1
> > - M1 writes
> > - user writes to sector
> > - queue up writes for M2 and data2
> > - data1 is thrown away as data2 overwrites it
> > - M2 writes
> > - system crashes
> > 
> > Now both M1 and M2 have a different checksum than the old data left on
> > disk.
> > 
> > Can this happen?
> 
> No, parallel writes that affect the same metadata sectors will not be allowed.
> At the moment there is a rough lock which does not allow simultaneous updates
> at all, I plan to make that more fine-grained in the future.

Can I suggest a variation on the above which, I think, can cause a
problem.

 - user writes data-A' to sector-A (which currently contains data-A)
 - queue up write for M1 and data-A'
 - M1 is written correctly.
 - power fails (before data-A' is written)
reboot
 - read sector-A, find data-A which matches checksum on M2, so
   success.

So everything is working perfectly so far...

 - write sector-B (in same 62-sector range as sector-A).
 - queue up write for M2 and data-B
 - those writes complete
 - read sector-A.  find data-A, which doesn't match M1 (that has
   data-A') and doesn't match M2 (which is mostly a copy of M1),
   so the read fails.


i.e. you get a situation where writing one sector can cause another
sector to spontaneously fail.

NeilBrown


  parent reply	other threads:[~2009-06-28  0:34 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-05-21 16:13 [RFC PATCH] dm-csum: A new device mapper target that checks data integrity Alberto Bertogli
2009-05-21 18:17 ` Greg Freemyer
2009-05-21 19:17   ` Alberto Bertogli
2009-05-25 12:22 ` Goswin von Brederlow
2009-05-25 17:46   ` Alberto Bertogli
2009-05-26 10:33     ` Goswin von Brederlow
2009-05-26 12:52       ` Alberto Bertogli
2009-05-28 19:29         ` Goswin von Brederlow
2009-06-26  7:26           ` SandeepKsinha
2009-06-26  8:50             ` SandeepKsinha
2009-06-26 22:36             ` Alberto Bertogli
2009-06-26 22:53               ` Alan Cox
2009-06-28  0:34         ` Neil Brown [this message]
2009-06-28 15:30           ` Alberto Bertogli
2009-06-28 22:59             ` Goswin von Brederlow
2009-05-26 19:48 ` [RFC PATCH v2] " Alberto Bertogli

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=19014.47753.69063.510164@notabene.brown \
    --to=neilb@suse.de \
    --cc=agk@redhat.com \
    --cc=albertito@blitiri.com.ar \
    --cc=dm-devel@redhat.com \
    --cc=goswin-v-b@web.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-raid@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).