From: Chris Mason <chris.mason@oracle.com>
To: Jan Kara <jack@suse.cz>
Cc: Boaz Harrosh <bharrosh@panasas.com>, djwong <djwong@us.ibm.com>,
Jens Axboe <axboe@kernel.dk>,
linux-kernel <linux-kernel@vger.kernel.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Mingming Cao <mcao@us.ibm.com>,
linux-scsi <linux-scsi@vger.kernel.org>
Subject: Re: [RFC] block integrity: Fix write after checksum calculation problem
Date: Tue, 22 Feb 2011 08:02:10 -0500 [thread overview]
Message-ID: <1298379714-sup-3015@think> (raw)
In-Reply-To: <20110222114222.GA24728@quack.suse.cz>
[ resend sorry if you get this twice ]
Excerpts from Jan Kara's message of 2011-02-22 06:42:22 -0500:
> Hi Boaz,
>
> On Mon 21-02-11 21:45:51, Boaz Harrosh wrote:
> > On 02/21/2011 06:00 PM, Darrick J. Wong wrote:
> > > Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
> > > write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
> > > discussion about how to deal with the situation where one program tells the
> > > kernel to write a block to disk, the kernel computes the checksum of that data,
> > > and then a second program begins writing to that same block before the disk HBA
> > > can DMA the memory block, thereby causing the disk to complain about being sent
> > > invalid checksums.
> >
> > The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions
> > of ext4 it should work much better. (If you still have problems please report
> > them, those FSs advertise stable pages write-out)
> Do they? I've just checked ext4 and xfs and they don't seem to enforce
> stable pages. They do lock the page (which implicitely happens in mm code
> for any filesystem BTW) but this is not enough. You have to wait for
> PageWriteback to get cleared and only btrfs does that.
>
> > This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write
> > and syncing with write-out for example by taking the page-lock. Currently each
> > FS is to itself because in VFS it would force the behaviour on FSs that it does
> > not make sense to.
> Yes, it's easy to fix but at a performance cost for any application doing
> frequent rewrites regardless whether integrity features are used or not.
> And I don't think that's a good thing. I even remember someone measured the
> hit last time this came up and it was rather noticeable.
Do you remember which workload this was? I do remember someone
mentioning a specific workload, but can't recall which one now. fsx is
definitely slower when we wait for writeback, but that's because it's
all evil inside.
>
> > Note that the proper solution does not copy any data, just forces the app to
> > wait before changing write-out pages.
> I think that's up for discussion. In fact what is going to be faster
> depends pretty much on your system config. If you have enough CPU/RAM
> bandwidth compared to storage speed, you're better of doing copying. If
> you can barely saturate storage with your CPU/RAM, waiting is probably
> better for you.
>
> Moreover if you do data copyout, you push the performance cost only on
> users of the integrity feature which is nice. But on the other hand users
> of integrity take the cost even if they are not doing rewrites.
>
> A solution which is technically plausible and penalizing only rewrites
> of data-integrity protected pages would be a use of shadow pages as Darrick
> describes below. So I'd lean towards that long term. But for now I think
> Darrick's solution is OK to make the integrity feature actually useful and
> later someone can try something more clever.
Rewrites in flight should be very rare though, and I think the bouncing
is going to have a big impact on the intended workloads. It's not just
the cost of the copy, it's also the increased time as we beat on the
page allocator.
We're working on adding stable pages to ext34 and other filesystems
missing it. When the work is done we can benchmark and decide on the
tradeoffs.
-chris
next prev parent reply other threads:[~2011-02-22 13:02 UTC|newest]
Thread overview: 67+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-02-22 2:00 [RFC] block integrity: Fix write after checksum calculation problem Darrick J. Wong
2011-02-22 5:45 ` Boaz Harrosh
2011-02-22 11:42 ` Jan Kara
2011-02-22 13:02 ` Chris Mason [this message]
2011-02-22 19:13 ` Boaz Harrosh
2011-03-04 20:51 ` Darrick J. Wong
2011-03-04 20:53 ` Christoph Hellwig
2011-02-22 16:13 ` Andreas Dilger
2011-02-22 16:40 ` Martin K. Petersen
2011-02-22 19:45 ` Darrick J. Wong
2011-02-22 22:53 ` Dave Chinner
2011-02-23 16:24 ` Martin K. Petersen
2011-02-23 23:47 ` Dave Chinner
2011-02-24 16:43 ` Jan Kara
2011-02-28 8:49 ` Christoph Hellwig
2011-02-22 16:45 ` Martin K. Petersen
2011-02-23 20:24 ` Joel Becker
2011-02-23 20:35 ` Chris Mason
2011-02-23 21:42 ` Joel Becker
2011-02-24 16:47 ` Jan Kara
2011-02-24 17:37 ` Chris Mason
2011-02-24 18:27 ` Darrick J. Wong
2011-02-28 12:54 ` Chris Mason
2011-03-04 21:07 ` Darrick J. Wong
2011-03-04 22:22 ` Andreas Dilger
2011-03-07 19:11 ` Darrick J. Wong
2011-03-07 21:12 ` Chris Mason
2011-03-08 4:56 ` Dave Chinner
2011-03-10 23:57 ` Darrick J. Wong
2011-03-11 16:34 ` Chris Mason
2011-03-11 18:51 ` Darrick J. Wong
2011-03-19 0:07 ` Darrick J. Wong
2011-03-19 2:28 ` Andreas Dilger
2011-03-22 19:23 ` Darrick J. Wong
2011-03-22 21:54 ` Jan Kara
2011-03-21 14:04 ` Jan Kara
2011-03-21 14:24 ` Chris Mason
2011-03-21 16:43 ` Jan Kara
2011-04-06 23:29 ` Darrick J. Wong
2011-04-07 16:44 ` Darrick J. Wong
2011-04-07 16:57 ` Jan Kara
2011-04-08 20:31 ` Darrick J. Wong
2011-04-11 16:42 ` Jeff Layton
2011-04-11 17:41 ` Chris Mason
2011-04-11 18:25 ` Christoph Hellwig
2011-04-11 18:38 ` Chris Mason
2011-04-12 0:46 ` Mingming Cao
2011-04-12 0:57 ` Christoph Hellwig
2011-04-14 0:48 ` Mingming Cao
2011-04-22 0:02 ` [RFC v2] block integrity: Stabilize(?) pages during writeback Darrick J. Wong
2011-04-22 12:50 ` Chris Mason
2011-04-22 20:34 ` Jan Kara
2011-04-26 0:37 ` Darrick J. Wong
2011-04-26 11:33 ` Chris Mason
2011-05-03 1:59 ` Darrick J. Wong
2011-05-04 1:26 ` Darrick J. Wong
2011-04-26 11:37 ` Jan Kara
2011-05-04 17:37 ` [PATCH v3 0/3] data integrity: Stabilize pages during writeback for ext4 Darrick J. Wong
2011-05-04 18:46 ` Christoph Hellwig
2011-05-04 19:21 ` Chris Mason
2011-05-04 20:00 ` Darrick J. Wong
2011-05-04 23:57 ` Darrick J. Wong
2011-05-05 15:26 ` Jan Kara
2011-05-04 17:39 ` [PATCH v3 1/3] ext4: Clean up some wait_on_page_writeback calls Darrick J. Wong
2011-05-04 17:41 ` [PATCH v3 2/3] ext4: Wait for writeback to complete while making pages writable Darrick J. Wong
2011-05-04 17:42 ` [PATCH v3 3/3] mm: Wait for writeback when grabbing pages to begin a write Darrick J. Wong
2011-05-04 18:48 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1298379714-sup-3015@think \
--to=chris.mason@oracle.com \
--cc=axboe@kernel.dk \
--cc=bharrosh@panasas.com \
--cc=djwong@us.ibm.com \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-scsi@vger.kernel.org \
--cc=mcao@us.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).