From: "Darrick J. Wong" <djwong@us.ibm.com>
To: Greg Freemyer <greg.freemyer@gmail.com>
Cc: Andreas Dilger <adilger.kernel@dilger.ca>,
Theodore Tso <tytso@mit.edu>,
Sunil Mushran <sunil.mushran@oracle.com>,
Amir Goldstein <amir73il@gmail.com>,
linux-kernel <linux-kernel@vger.kernel.org>,
Andi Kleen <andi@firstfloor.org>, Mingming Cao <cmm@us.ibm.com>,
Joel Becker <jlbec@evilplan.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
linux-ext4@vger.kernel.org, Coly Li <colyli@gmail.com>
Subject: Re: [PATCH v1 00/16] ext4: Add metadata checksumming
Date: Fri, 2 Sep 2011 11:22:14 -0700 [thread overview]
Message-ID: <20110902182214.GC12086@tux1.beaverton.ibm.com> (raw)
In-Reply-To: <CAGpXXZ+Cq9_aS4BPC5BKNj6aWHzzL7+e8ZQKO3=C7UV+xcnqWA@mail.gmail.com>
On Fri, Sep 02, 2011 at 10:15:27AM -0400, Greg Freemyer wrote:
> On Wed, Aug 31, 2011 at 8:30 PM, Darrick J. Wong <djwong@us.ibm.com> wrote:
> > Hi all,
> >
> > This patchset adds crc32c checksums to most of the ext4 metadata objects. A
> > full design document is on the ext4 wiki[1] but I will summarize that document here.
> >
> > As much as we wish our storage hardware was totally reliable, it is still
> > quite possible for data to be corrupted on disk, corrupted during transfer over
> > a wire, or written to the wrong places. To protect against this sort of
> > non-hostile corruption, it is desirable to store checksums of metadata objects
> > on the filesystem to prevent broken metadata from shredding the filesystem.
> >
> > The crc32c polynomial was chosen for its improved error detection capabilities
> > over crc32 and crc16, and because of its hardware acceleration on current and
> > upcoming Intel and Sparc chips.
> >
> > Each type of metadata object has been retrofitted to store a checksum as follows:
> >
> > - The superblock stores a crc32c of itself.
> > - Each inode stores crc32c(fs_uuid + inode_num + inode + slack_space_after_inode)
> > - Block and inode bitmaps each get their own crc32c(fs_uuid + group_num +
> > bitmap), stored in the block group descriptor.
> > - Each extent tree block stores a crc32c(fs_uuid + inode_num + extent_entries)
> > in unused space at the end of the block.
> > - Each directory leaf block has an unused-looking directory entry big enough to
> > store a crc32c(fs_uuid + inode_num + block) at the end of the block.
> > - Each directory htree block is shortened to contain a crc32c(fs_uuid +
> > inode_num + block) at the end of the block.
> > - Extended attribute blocks store crc32c(fs_uuid + block_no + ea_block) in the
> > header.
> > - Journal commit blocks can be converted to use crc32c to checksum all blocks
> > in the transaction, if journal_checksum is given.
> >
> > The first four patches in the kernel patchset fix existing bugs in ext4 that
> > cause incorrect checkums to be written. I think Ted already took them, but
> > with recent instability I'm resending them to be cautious. The subsequent 12
> > patches add the necessary code to support checksumming in ext4 and jbd2.
> >
> > I also have a set of three patches that provide a faster crc32c implementation
> > based on Bob Pearson's earlier crc32 patchset. This will be sent under
> > separate cover to the crypto list and to lkml/linux-ext4.
> >
> > The patchset for e2fsprogs will be sent under separate cover only to linux-ext4
> > as it is quite lengthy (~36 patches).
> >
> > As far as performance impact goes, I see nearly no change with a standard mail
> > server ffsb simulation. On a test that involves only file creation and
> > deletion and extent tree modifications, I see a drop of about 50 percent with
> > the current kernel crc32c implementation; this improves to a drop of about 20
> > percent with the enclosed crc32c implementation. However, given that metadata
> > is usually a small fraction of total IO, it doesn't seem like the cost of
> > enabling this feature is unreasonable.
> >
> > There are of course unresolved issues:
> >
> > - What to do when the block group descriptor isn't big enough to hold 2 crc32s
> > (which is the case with 32-bit ext4 filesystems, sadly). I'm not quite
> > convinced that truncating a 32-bit checksum to 16-bits is a safe idea. Right
> > now, one has to enable the 64bit feature to enable any bitmap checksums.
> > I'm not sure how effective crc16 is at checksumming 32768-bit bitmaps.
> >
> > - Using the journal commit hooks to delay crc32c calculation until dirty
> > buffers are actually being written to disk.
> >
> > - Can we get away with using a (hw accelerated) LE crc32c for jbd2, which
> > stores its data in BE order?
> >
> > - Interaction with online resize code. Yongqiang seems to be in the process of
> > rewriting this, so I haven't looked at it very closely yet.
> >
> > - If block group descriptors can now exceed 32 bytes (when 64bit filesystem
> > support is enabled), should we use crc32c instead of crc16? From what I've
> > read of the literature, crc16 is not very effective on datasets exceeding 256
> > bytes.
> >
> > Please have a look at the design document and patches, and please feel free to
> > suggest any changes. I will be at LPC next week if anyone wishes to discuss,
> > debate, or protest.
> >
> > --D
> >
> > [1] https://ext4.wiki.kernel.org/index.php/Ext4_Metadata_Checksums
>
> Derrick,
>
> Brainstorming only:
>
> Another thing you might consider is to somehow tie into the data
> integrity patches that went into the kernel a couple years ago. Those
> are tied to specialized storage devices (typically scsi) that can
> actually have the checksum live on the disk, but not in the normal
> data area. ie. in the sector header / footer or some other out of
> band area.
>
> At a minimum, it may make sense to use the same CRC which that API
> does. Then you could calculate the CRC once and put it both in-band
> in the inode and out-of-band in the dedicated integrity area of
> supporting storage devices.
If you have the necessary DIF/DIX hardware and kernel support then every block
in the FS is already checksummed and you don't need metadata_csum at all. This
patchset was really intended for setups where we don't have DIF/DIX.
Furthermore, the nice thing about the in-filesystem checksum is that we bake in
other things like the FS UUID and the inode number, which gives you a somewhat
better assurance that the data block belongs to the fs and the file that the
code think it belongs to. The DIX interface allows for a 32-bit block number
and a 16-bit application tag ... which is unfortunately small given 64-bit
block numbers and 32-bit inode numbers.
I guess there's also an argument that from a layering perspective it's
desirable to have a FS image whose integrity features remain intact even if you
copy the image to a different device that doesn't support the hardware feature.
As a side note, the crc-t10dif implementation is quite slow -- the hardware
accelerated crc32c is 15x faster, and the sw implementation is usually 3-6x
faster. I suspect somebody will want to fix that before DIF becomes more
widespread...
> That if the data is corrupted on the wire as an example, the
> controller itself may be able to verify its a bad crc and ask for a
> re-read without even involving the kernel.
>
> I believe supporting hardware is rare, but if the kernel is going to
> have a data integrity API to support it at all, then code like this is
> exactly the kind of code that should layer on top of it.
The good news is that if you're really worried about integrity, metadata_csum
and DIF/DIX aren't mutually exclusive features. Rejecting corrupted write
commands at write time seems like a useful feature. :)
--D
--
To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2011-09-02 18:22 UTC|newest]
Thread overview: 48+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-09-01 0:30 [PATCH v1 00/16] ext4: Add metadata checksumming Darrick J. Wong
2011-09-01 0:30 ` [PATCH 01/16] ext4: ext4_dx_add_entry should dirty directory metadata with the directory inode Darrick J. Wong
2011-09-01 0:30 ` [PATCH 02/16] ext4: ext4_rename should dirty dir_bh with the correct directory Darrick J. Wong
2011-09-01 0:30 ` [PATCH 03/16] ext4: ext4_mkdir should dirty dir_block with the parent inode Darrick J. Wong
2011-09-01 0:30 ` [PATCH 04/16] ext4: Create a new BH_Verified flag to avoid unnecessary metadata validation Darrick J. Wong
2011-09-01 0:31 ` [PATCH 05/16] ext4: Create a rocompat flag for extended metadata checksumming Darrick J. Wong
2011-09-01 0:31 ` [PATCH 06/16] ext4: Calculate and verify inode checksums Darrick J. Wong
2011-09-01 2:30 ` Andreas Dilger
2011-09-02 19:32 ` Darrick J. Wong
2011-09-02 22:02 ` Andreas Dilger
2011-09-05 17:57 ` Darrick J. Wong
2011-09-01 0:31 ` [PATCH 07/16] ext4: Create bitmap checksum helper functions Darrick J. Wong
2011-09-01 0:31 ` [PATCH 08/16] ext4: Calculate and verify checksums for inode bitmaps Darrick J. Wong
2011-09-01 4:49 ` Andreas Dilger
2011-09-02 19:18 ` Darrick J. Wong
2011-09-02 21:27 ` Andreas Dilger
2011-09-05 18:22 ` Darrick J. Wong
2011-09-05 18:27 ` Andi Kleen
2011-09-05 19:45 ` James Bottomley
2011-09-05 22:12 ` Darrick J. Wong
2011-09-01 0:31 ` [PATCH 09/16] ext4: Calculate and verify block bitmap checksum Darrick J. Wong
[not found] ` <2492E720-3316-4561-8C9C-BBC6E8670EAD@dilger.ca>
2011-09-02 19:08 ` Darrick J. Wong
2011-09-02 21:06 ` Andreas Dilger
2011-09-01 0:31 ` [PATCH 10/16] ext4: Verify and calculate checksums for extent tree blocks Darrick J. Wong
[not found] ` <26584BE9-B716-40D5-B3B4-8C5912869648@dilger.ca>
2011-09-02 19:02 ` Darrick J. Wong
2011-09-01 0:31 ` [PATCH 11/16] ext4: Calculate and verify checksums for htree nodes Darrick J. Wong
2011-09-01 0:31 ` [PATCH 12/16] ext4: Calculate and verify checksums of directory leaf blocks Darrick J. Wong
[not found] ` <4D5D8FB2-0D8A-4D30-B6D8-51158395C1C9@dilger.ca>
2011-09-02 18:57 ` Darrick J. Wong
2011-09-02 20:52 ` Andreas Dilger
2011-09-05 18:30 ` Darrick J. Wong
2011-09-01 0:32 ` [PATCH 13/16] ext4: Calculate and verify checksums of extended attribute blocks Darrick J. Wong
[not found] ` <F4A42DCE-F91C-419B-9153-65A7EA91D241@dilger.ca>
2011-09-02 18:43 ` Darrick J. Wong
2011-09-01 0:32 ` [PATCH 14/16] ext4: Make inode checksum cover empty space Darrick J. Wong
[not found] ` <4540D5DB-53D9-4C33-BC6B-868870D42AF3@dilger.ca>
2011-09-02 18:42 ` Darrick J. Wong
2011-09-01 0:32 ` [PATCH 15/16] ext4: Calculate and verify superblock checksum Darrick J. Wong
[not found] ` <2882FBB2-797C-4D27-8569-B6826DD34F68@dilger.ca>
2011-09-02 18:40 ` Darrick J. Wong
2011-09-01 0:32 ` [PATCH 16/16] jbd2: Support CRC32C transaction checksums Darrick J. Wong
[not found] ` <99019900-30B1-450A-9620-E94371A30CC6@dilger.ca>
2011-09-02 18:31 ` Darrick J. Wong
2011-09-02 14:15 ` [PATCH v1 00/16] ext4: Add metadata checksumming Greg Freemyer
2011-09-02 18:22 ` Darrick J. Wong [this message]
2011-09-04 11:41 ` Martin K. Petersen
2011-09-04 16:54 ` Andi Kleen
2011-09-04 17:17 ` Martin K. Petersen
2011-09-04 17:44 ` Andi Kleen
2011-09-04 22:19 ` Martin K. Petersen
2011-09-05 18:55 ` Darrick J. Wong
2011-09-05 18:45 ` Darrick J. Wong
2011-09-06 12:59 ` Martin K. Petersen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110902182214.GC12086@tux1.beaverton.ibm.com \
--to=djwong@us.ibm.com \
--cc=adilger.kernel@dilger.ca \
--cc=amir73il@gmail.com \
--cc=andi@firstfloor.org \
--cc=cmm@us.ibm.com \
--cc=colyli@gmail.com \
--cc=greg.freemyer@gmail.com \
--cc=jlbec@evilplan.org \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=sunil.mushran@oracle.com \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).