linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Chris Mason <chris.mason@oracle.com>
To: Nick Piggin <npiggin@suse.de>
Cc: "Martin K. Petersen" <martin.petersen@oracle.com>,
	James Bottomley <James.Bottomley@suse.de>,
	Matthew Wilcox <matthew@wil.cx>,
	Christof Schmitt <christof.schmitt@de.ibm.com>,
	Boaz Harrosh <bharrosh@panasas.com>,
	linux-scsi@vger.kernel.org, linux-kernel@vger.kernel.org,
	linux-fsdevel@vger.kernel.org
Subject: Re: Wrong DIF guard tag on ext2 write
Date: Thu, 3 Jun 2010 11:46:34 -0400	[thread overview]
Message-ID: <20100603154634.GC8980@think> (raw)
In-Reply-To: <20100602134121.GD6152@laptop>

On Wed, Jun 02, 2010 at 11:41:21PM +1000, Nick Piggin wrote:
> On Wed, Jun 02, 2010 at 09:17:56AM -0400, Martin K. Petersen wrote:
> > >>>>> "Nick" == Nick Piggin <npiggin@suse.de> writes:
> > 
> > >> 1) filesystem changed it
> > >> 2) corruption on the wire or in the raid controller
> > >> 3) the page was corrupted while the IO layer was doing the IO.
> > >> 
> > >> 1 and 2 are easy, we bounce, retry and everyone continues on with
> > >> their lives.  With #3, we'll recrc and send the IO down again
> > >> thinking the data is correct when really we're writing garbage.
> > >> 
> > >> How can we tell these three cases apart?
> > 
> > Nick> Do we really need to handle #3? It could have happened before the
> > Nick> checksum was calculated.
> > 
> > Reason #3 is one of the main reasons for having the checksum in the
> > first place.  The whole premise of the data integrity extensions is that
> > the checksum is calculated in close temporal proximity to the data
> > creation.  I.e. eventually in userland.
> > 
> > Filesystems will inevitably have to be integrity-aware for that to work.
> > And it will be their job to keep the data pages stable during DMA.
> 
> Let's just think hard about what windows can actually be closed versus
> how much effort goes in to closing them. I also prefer not to accept
> half-solutions in the kernel because they don't want to implement real
> solutions in hardware (it's pretty hard to checksum and protect all
> kernel data structures by hand).
> 
> For "normal" writes into pagecache, the data can get corrupted anywhere
> from after it is generated in userspace, during the copy, while it is
> dirty in cache, and while it is being written out.

This is why the DIF/DIX spec has the idea of a crc generated in userland
when the data is generated.  At any rate the basic idea is to crc early
but not often...recalculating the crc after we hand our precious memory
to the evil device driver does weaken the end-to-end integrity checks.

What I don't want to do is weaken the basic DIF/DIX structure by letting
the lower recrc stuff as they find faults.  It would be fine if we had
some definitive way to say: the FS raced, just recrc, but we really
don't.

> 
> Closing the while it is dirty, while it is being written back window
> still leaves a pretty big window. Also, how do you handle mmap writes?
> Write protect and checksum the destination page after every store? Or
> leave some window between when the pagecache is dirtied and when it is
> written back? So I don't know whether it's worth putting a lot of effort
> into this case.

So, changing gears to how do we protect filesystem page cache pages
instead of the generic idea of dif/dix, btrfs crcs just before writing,
which does leave a pretty big window for the page to get corrupted.
The storage layer shouldn't care or know about that though, we hand it a
crc and it makes sure data matching that crc goes to the media.

> 
> If you had an interface for userspace to insert checksums to direct IO
> requests or pagecache ranges, then not only can you close the entire gap
> between userspace data generation, and writeback. But you also can
> handle mmap writes and anything else just fine: userspace knows about
> the concurrency details, so it can add the right checksum (and
> potentially fsync etc) when it's ready.

Yeah, I do agree here.

-chris

  reply	other threads:[~2010-06-03 15:48 UTC|newest]

Thread overview: 79+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2010-05-31 11:28 Wrong DIF guard tag on ext2 write Christof Schmitt
2010-05-31 11:34 ` Christof Schmitt
2010-05-31 14:20 ` Martin K. Petersen
2010-05-31 14:46   ` Christof Schmitt
2010-06-01 13:16     ` Martin K. Petersen
2010-06-02 13:37       ` Christof Schmitt
2010-06-02 23:20         ` Dave Chinner
2010-06-04  1:34           ` Martin K. Petersen
2010-06-04  2:32             ` Dave Chinner
2010-06-07 16:20               ` Martin K. Petersen
2010-06-07 17:22                 ` Boaz Harrosh
2010-06-07 17:40                   ` Martin K. Petersen
2010-06-08  7:15                     ` Christof Schmitt
2010-06-08  8:47                       ` Dave Chinner
2010-06-08  8:52                         ` Nick Piggin
2010-05-31 14:49   ` Nick Piggin
2010-06-01 13:17     ` Martin K. Petersen
2010-05-31 15:01   ` James Bottomley
2010-05-31 15:30     ` Boaz Harrosh
2010-05-31 15:49       ` Nick Piggin
2010-05-31 16:25         ` Boaz Harrosh
2010-06-01 13:22         ` Martin K. Petersen
2010-06-01 10:30       ` Christof Schmitt
2010-06-01 10:49         ` Boaz Harrosh
2010-06-01 13:03         ` Chris Mason
2010-06-01 13:50           ` Christof Schmitt
     [not found]           ` <20100601135059.GA21008@schmichrtp.mainz.de.ibm.com>
2010-06-01 13:58             ` Chris Mason
2010-06-08  7:18               ` Christof Schmitt
2010-06-01 14:26             ` Nick Piggin
2010-06-01 13:27         ` James Bottomley
2010-06-01 13:33           ` Chris Mason
2010-06-01 13:40             ` James Bottomley
2010-06-01 13:49               ` Chris Mason
2010-06-01 16:29                 ` Matthew Wilcox
2010-06-01 16:47                   ` Chris Mason
2010-06-01 16:54                     ` James Bottomley
2010-06-01 18:09                       ` Chris Mason
2010-06-01 18:46                         ` Nick Piggin
     [not found]                         ` <20100601184649.GE9453@laptop>
2010-06-01 19:35                           ` Chris Mason
2010-06-02  3:20                             ` Nick Piggin
2010-06-02 13:17                               ` Martin K. Petersen
2010-06-02 13:41                                 ` Nick Piggin
2010-06-03 15:46                                   ` Chris Mason [this message]
2010-06-03 16:27                                     ` Nick Piggin
     [not found]                                     ` <20100603162718.GR6822@laptop>
2010-06-04  1:46                                       ` Martin K. Petersen
2010-06-04  3:09                                         ` Nick Piggin
2010-06-04  2:02                                     ` Dave Chinner
     [not found]                                     ` <20100604020243.GE19651@dastard>
2010-06-04 15:32                                       ` Jan Kara
2010-06-04  1:30                                   ` Martin K. Petersen
2010-06-01 21:07                         ` James Bottomley
2010-06-01 22:49                           ` Chris Mason
2010-06-01 13:50               ` Martin K. Petersen
2010-06-01 14:28                 ` Nick Piggin
2010-06-01 14:32                 ` James Bottomley
2010-06-01 14:54                   ` Martin K. Petersen
2010-06-03 11:20           ` Vladislav Bolkhovitin
2010-06-03 12:07             ` Boaz Harrosh
2010-06-03 12:41               ` Vladislav Bolkhovitin
2010-06-03 12:46                 ` Vladislav Bolkhovitin
2010-06-09 15:58                   ` Vladislav Bolkhovitin
2010-06-03 13:06                 ` Boaz Harrosh
2010-06-03 13:23                   ` Vladislav Bolkhovitin
2010-07-23 17:59             ` Gennadiy Nerubayev
2010-07-23 19:16               ` Vladislav Bolkhovitin
2010-07-23 20:51                 ` Gennadiy Nerubayev
2010-07-26 12:22                   ` Vladislav Bolkhovitin
2010-07-26 17:00                     ` Gennadiy Nerubayev
2010-07-26 19:26                       ` Vladislav Bolkhovitin
2010-07-24  1:03                 ` Dave Chinner
2010-06-01  2:40     ` FUJITA Tomonori
2010-06-03 16:09 ` [LFS/VM TOPIC] Stable pages while IO (was Wrong DIF guard tag on ext2 write) Boaz Harrosh
2010-06-03 16:30   ` [Lsf10-pc] " J. Bruce Fields
2010-06-03 17:41   ` Vladislav Bolkhovitin
2010-06-04 16:23   ` Jan Kara
2010-06-04 16:30     ` [Lsf10-pc] " J. Bruce Fields
2010-06-04 17:11       ` Jan Kara
2010-06-06  9:35     ` Boaz Harrosh
2010-06-06 23:37       ` Jan Kara
2010-06-07  8:30         ` Boaz Harrosh

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20100603154634.GC8980@think \
    --to=chris.mason@oracle.com \
    --cc=James.Bottomley@suse.de \
    --cc=bharrosh@panasas.com \
    --cc=christof.schmitt@de.ibm.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=matthew@wil.cx \
    --cc=npiggin@suse.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).