From: "Lukáš Czerner" <lczerner@redhat.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: tytso@mit.edu, linux-ext4@vger.kernel.org
Subject: Re: [PATCH 16/37] libext2fs: support allocating uninit blocks in bmap2()
Date: Wed, 7 May 2014 12:02:30 +0200 (CEST) [thread overview]
Message-ID: <alpine.LFD.2.00.1405071045090.2128@localhost.localdomain> (raw)
In-Reply-To: <20140506195938.GD25745@birch.djwong.org>
[-- Attachment #1: Type: TEXT/PLAIN, Size: 8066 bytes --]
On Tue, 6 May 2014, Darrick J. Wong wrote:
> Date: Tue, 6 May 2014 12:59:38 -0700
> From: Darrick J. Wong <darrick.wong@oracle.com>
> To: Lukáš Czerner <lczerner@redhat.com>
> Cc: tytso@mit.edu, linux-ext4@vger.kernel.org
> Subject: Re: [PATCH 16/37] libext2fs: support allocating uninit blocks in
> bmap2()
>
> On Tue, May 06, 2014 at 05:45:01PM +0200, Lukáš Czerner wrote:
> > On Thu, 1 May 2014, Darrick J. Wong wrote:
> >
> > > Date: Thu, 01 May 2014 16:14:07 -0700
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > > To: tytso@mit.edu, darrick.wong@oracle.com
> > > Cc: linux-ext4@vger.kernel.org
> > > Subject: [PATCH 16/37] libext2fs: support allocating uninit blocks in bmap2()
> > >
> > > In order to support fallocate, we need to be able to have
> > > ext2fs_bmap2() allocate blocks and put them into uninitialized
> > > extents. There's a flag to do this in the extent code, but it's not
> > > exposed to the bmap2 interface, so plumb that in. Eventually fuse2fs
> > > or somebody will use it.
> > >
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> > > ---
> > > lib/ext2fs/bmap.c | 24 ++++++++++++++++++++++--
> > > lib/ext2fs/ext2fs.h | 1 +
> > > lib/ext2fs/mkjournal.c | 17 +++++++++++++++++
> > > 3 files changed, 40 insertions(+), 2 deletions(-)
> > >
> > >
> > > diff --git a/lib/ext2fs/bmap.c b/lib/ext2fs/bmap.c
> > > index c1d0e6f..a4dc8ef 100644
> > > --- a/lib/ext2fs/bmap.c
> > > +++ b/lib/ext2fs/bmap.c
> > > @@ -72,6 +72,11 @@ static _BMAP_INLINE_ errcode_t block_ind_bmap(ext2_filsys fs, int flags,
> > > block_buf + fs->blocksize, &b);
> > > if (retval)
> > > return retval;
> > > + if (flags & BMAP_UNINIT) {
> > > + retval = ext2fs_zero_blocks2(fs, b, 1, NULL, NULL);
> > > + if (retval)
> > > + return retval;
> > > + }
> > >
> > > #ifdef WORDS_BIGENDIAN
> > > ((blk_t *) block_buf)[nr] = ext2fs_swab32(b);
> > > @@ -214,10 +219,13 @@ static errcode_t extent_bmap(ext2_filsys fs, ext2_ino_t ino,
> > > errcode_t retval = 0;
> > > blk64_t blk64 = 0;
> > > int alloc = 0;
> > > + int set_flags;
> > > +
> > > + set_flags = bmap_flags & BMAP_UNINIT ? EXT2_EXTENT_SET_BMAP_UNINIT : 0;
> > >
> > > if (bmap_flags & BMAP_SET) {
> > > retval = ext2fs_extent_set_bmap(handle, block,
> > > - *phys_blk, 0);
> > > + *phys_blk, set_flags);
> > > return retval;
> > > }
> > > retval = ext2fs_extent_goto(handle, block);
> > > @@ -254,7 +262,7 @@ got_block:
> > > alloc++;
> > > set_extent:
> > > retval = ext2fs_extent_set_bmap(handle, block,
> > > - blk64, 0);
> > > + blk64, set_flags);
> > > if (retval) {
> > > ext2fs_block_alloc_stats2(fs, blk64, -1);
> > > return retval;
> > > @@ -345,6 +353,12 @@ errcode_t ext2fs_bmap2(ext2_filsys fs, ext2_ino_t ino, struct ext2_inode *inode,
> > > goto done;
> > > }
> > >
> > > + if ((bmap_flags & BMAP_SET) && (bmap_flags & BMAP_UNINIT)) {
> > > + retval = ext2fs_zero_blocks2(fs, *phys_blk, 1, NULL, NULL);
> > > + if (retval)
> > > + goto done;
> > > + }
> > > +
> > > if (block < EXT2_NDIR_BLOCKS) {
> > > if (bmap_flags & BMAP_SET) {
> > > b = *phys_blk;
> > > @@ -360,6 +374,12 @@ errcode_t ext2fs_bmap2(ext2_filsys fs, ext2_ino_t ino, struct ext2_inode *inode,
> > > retval = ext2fs_alloc_block(fs, b, block_buf, &b);
> > > if (retval)
> > > goto done;
> > > + if (bmap_flags & BMAP_UNINIT) {
> > > + retval = ext2fs_zero_blocks2(fs, b, 1, NULL,
> > > + NULL);
> > > + if (retval)
> > > + goto done;
> > > + }
> > > inode_bmap(inode, block) = b;
> > > blocks_alloc++;
> > > *phys_blk = b;
> > > diff --git a/lib/ext2fs/ext2fs.h b/lib/ext2fs/ext2fs.h
> > > index 599c972..819a14a 100644
> > > --- a/lib/ext2fs/ext2fs.h
> > > +++ b/lib/ext2fs/ext2fs.h
> > > @@ -527,6 +527,7 @@ typedef struct ext2_icount *ext2_icount_t;
> > > */
> > > #define BMAP_ALLOC 0x0001
> > > #define BMAP_SET 0x0002
> > > +#define BMAP_UNINIT 0x0004
> > >
> > > /*
> > > * Returned flags from ext2fs_bmap
> > > diff --git a/lib/ext2fs/mkjournal.c b/lib/ext2fs/mkjournal.c
> > > index 884d9c0..ecc3912 100644
> > > --- a/lib/ext2fs/mkjournal.c
> > > +++ b/lib/ext2fs/mkjournal.c
> > > @@ -174,6 +174,23 @@ errcode_t ext2fs_zero_blocks2(ext2_filsys fs, blk64_t blk, int num,
> > > return ENOMEM;
> > > memset(buf, 0, fs->blocksize * STRIDE_LENGTH);
> > > }
> > > +
> > > + /* Try discard, if it zeroes data... */
> > > + if (io_channel_discard_zeroes_data(fs->io)) {
> > > + memset(buf + fs->blocksize, 0, fs->blocksize);
> > > + retval = io_channel_discard(fs->io, blk, num);
> > > + if (retval)
> > > + goto skip_discard;
> > > + retval = io_channel_read_blk64(fs->io, blk, 1, buf);
> > > + if (retval)
> > > + goto skip_discard;
> > > + if (memcmp(buf, buf + fs->blocksize, fs->blocksize) == 0)
> > > + return 0;
> > > + /* Hah! Discard doesn't zero! */
> > > + fs->io->flags &= ~CHANNEL_FLAGS_DISCARD_ZEROES;
> > > + }
> > > +skip_discard:
> >
> > You did not mention that in the description, but this is actually a
> > problem. The reason is that discard might not be reliable on some
> > devices. This has been discussed several times and I am not the only
> > one who've seen that even if the device itself says that it will
> > return zeroes from discarded regions sometimes it might return data.
>
> I agree that the storage not living up to the interface it advertises is a
> problem, hence the verification step that will unset the io channel flag if it
> finds that the device is lying.
>
> On the other hand, I wonder if this ought to be abstracted away in an
> io_channel_zero() call that takes care of figuring out if it can do a zeroing
> discard or if it has to write a block of zeroes.
>
> Or, are you worried that a discard and immediate re-read will appear to work,
> but that a later re-read will return non-zero data?
Yes I am, because we know that it sometimes behaves unpredictably
and this is one of the things that might just happen. Even though I
have not seen this exact case I've seen the opposite where right
after discard I've read non zero values but later it actually
returned zeroes.
So I would much rather not rely on discard here because you might
expose stale data on indirect files and there is no way to turn this
optimization off.
>
> > I would rather avoid this kind of optimization. However if the
> > underlying "device" is a loop device then it will be reliable if
> > it's supported. Also if then underlying "device" is a image then we
> > can just simply use punch hole.
>
> But static whitelisting is also problematic -- what if the storage device is an
> AHCI (or virtio-scsi) disk in QEMU that's ultimately backed by a file that we
> can punch_hole? How do we distinguish that from an SSD hooked up to SATA
> hardware?
We do not. We can only do that if we know we're sitting on a file.
It is really unfortunate, but I think that there is a limitation in
how we can use discard.
However we could use write same which should help on devices which
supports it and on the fs images because QEMU will convert that to
zero range (at least on xfs since ext4 implementation is quite new).
However I have no idea what is the interface to do that.
-Lukas
>
> In the qemu emulated AHCI case we ought to be able to zeroing discard, if
> advertised. I thought it was a reasonable compromise to trust that it works
> and verify the results afterward.
>
> --D
> >
> > Thanks!
> > -Lukas
> >
> > > +
> > > /* OK, do the write loop */
> > > j=0;
> > > while (j < num) {
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > > the body of a message to majordomo@vger.kernel.org
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
next prev parent reply other threads:[~2014-05-07 10:02 UTC|newest]
Thread overview: 91+ messages / expand[flat|nested] mbox.gz Atom feed top
2014-05-01 23:12 [PATCH 00/37] e2fsprogs patchbomb 5/14 Darrick J. Wong
2014-05-01 23:12 ` [PATCH 01/37] misc: create better-packaged static analysis reports Darrick J. Wong
2014-05-11 22:33 ` Theodore Ts'o
2014-05-01 23:12 ` [PATCH 02/37] misc: coverity fixes Darrick J. Wong
2014-05-02 11:17 ` Lukáš Czerner
2014-05-05 20:04 ` Darrick J. Wong
2014-05-11 22:40 ` Theodore Ts'o
2014-05-01 23:12 ` [PATCH 03/37] libext2fs: create sockets when populating filesystem Darrick J. Wong
2014-05-02 11:22 ` Lukáš Czerner
2014-05-05 20:08 ` Darrick J. Wong
2014-05-11 22:44 ` Theodore Ts'o
2014-05-01 23:12 ` [PATCH 04/37] mke2fs: always warn if 128-byte inode and inline_data Darrick J. Wong
2014-05-02 11:27 ` Lukáš Czerner
2014-05-05 20:10 ` Darrick J. Wong
2014-05-12 0:26 ` Theodore Ts'o
2014-05-01 23:12 ` [PATCH 05/37] debugfs: teach logdump to deal with 64bit revoke tables Darrick J. Wong
2014-05-02 11:38 ` Lukáš Czerner
2014-05-05 22:23 ` Darrick J. Wong
2014-05-06 11:35 ` Lukáš Czerner
2014-05-12 1:20 ` Theodore Ts'o
2014-05-01 23:13 ` [PATCH 06/37] debugfs: force logdump to display (old) journal contents Darrick J. Wong
2014-05-02 11:49 ` Lukáš Czerner
2014-05-06 0:24 ` Darrick J. Wong
2014-05-12 1:41 ` Theodore Ts'o
2014-05-12 3:31 ` Theodore Ts'o
2014-05-14 0:05 ` Darrick J. Wong
2014-05-01 23:13 ` [PATCH 07/37] resize2fs: fix check for collision between old GDT and superblock on sparse_super2 fs Darrick J. Wong
2014-05-12 3:35 ` Theodore Ts'o
2014-05-01 23:13 ` [PATCH 08/37] mke2fs: set gdt csum when creating packed fs Darrick J. Wong
2014-05-02 11:55 ` Lukáš Czerner
2014-05-12 4:22 ` Theodore Ts'o
2014-05-01 23:13 ` [PATCH 09/37] mke2fs: set error behavior at initialization time Darrick J. Wong
2014-05-02 12:13 ` Lukáš Czerner
2014-05-01 23:13 ` [PATCH 10/37] e2fsck: verify checksums after checking everything else Darrick J. Wong
2014-05-02 12:32 ` Lukáš Czerner
2014-05-05 22:56 ` Darrick J. Wong
2014-05-06 11:32 ` Lukáš Czerner
2014-05-08 0:05 ` Darrick J. Wong
2014-05-01 23:13 ` [PATCH 11/37] e2fsck: fix the extended attribute checksum error message Darrick J. Wong
2014-05-02 12:46 ` Lukáš Czerner
2014-05-05 23:08 ` Darrick J. Wong
2014-05-06 10:12 ` Lukáš Czerner
2014-05-01 23:13 ` [PATCH 12/37] e2fsck: insert a missing dirent tail for checksums if possible Darrick J. Wong
2014-05-02 12:54 ` Lukáš Czerner
2014-05-05 23:16 ` Darrick J. Wong
2014-05-01 23:13 ` [PATCH 13/37] e2fsck: write dir blocks after new inode when reconstructing root/lost+found Darrick J. Wong
2014-05-05 17:13 ` Lukáš Czerner
2014-05-01 23:13 ` [PATCH 14/37] dumpe2fs: add switch to disable checksum verification Darrick J. Wong
2014-05-05 17:20 ` Lukáš Czerner
2014-05-01 23:14 ` [PATCH 15/37] mke2fs: set block_validity as a default mount option Darrick J. Wong
2014-05-05 17:24 ` Lukáš Czerner
2014-05-01 23:14 ` [PATCH 16/37] libext2fs: support allocating uninit blocks in bmap2() Darrick J. Wong
2014-05-06 15:45 ` Lukáš Czerner
2014-05-06 19:59 ` Darrick J. Wong
2014-05-07 10:02 ` Lukáš Czerner [this message]
2014-05-07 21:37 ` Darrick J. Wong
2014-05-08 0:13 ` [PATCH 1/2] libext2fs: support BLKZEROOUT/FALLOC_FL_ZERO_RANGE in ext2fs_zero_blocks Darrick J. Wong
2014-05-13 11:11 ` Lukáš Czerner
2014-05-08 0:14 ` [PATCH 2/2] libext2fs: support allocating uninit blocks in bmap2() Darrick J. Wong
2014-05-27 16:28 ` Lukáš Czerner
2014-05-28 19:48 ` Darrick J. Wong
2014-05-01 23:14 ` [PATCH 17/37] libext2fs: file IO routines should handle uninit blocks Darrick J. Wong
2014-05-01 23:14 ` [PATCH 18/37] resize2fs: convert fs to and from 64bit mode Darrick J. Wong
2014-05-01 23:14 ` [PATCH 19/37] resize2fs: when toggling 64bit, don't free in-use bg data clusters Darrick J. Wong
2014-05-01 23:14 ` [PATCH 20/37] resize2fs: adjust reserved_gdt_blocks when changing group descriptor size Darrick J. Wong
2014-05-01 23:14 ` [PATCH 21/37] libext2fs: have UNIX IO manager use pread/pwrite Darrick J. Wong
2014-08-02 23:16 ` Theodore Ts'o
2014-05-01 23:14 ` [PATCH 22/37] ext2fs: add readahead method to improve scanning Darrick J. Wong
2014-05-01 23:14 ` [PATCH 23/37] e2fsck: provide routines to read-ahead metadata Darrick J. Wong
2014-05-01 23:14 ` [PATCH 24/37] e2fsck: read-ahead metadata during passes 1, 2, and 4 Darrick J. Wong
2014-07-28 22:25 ` Darrick J. Wong
2014-05-01 23:15 ` [PATCH 25/37] libext2fs: when appending to a file, don't split an index block in equal halves Darrick J. Wong
2014-08-02 23:43 ` Theodore Ts'o
2014-05-01 23:15 ` [PATCH 26/37] libext2fs: find inode goal when allocating blocks Darrick J. Wong
2014-05-01 23:15 ` [PATCH 27/37] libext2fs: find a range of empty blocks Darrick J. Wong
2014-05-01 23:15 ` [PATCH 28/37] libext2fs: provide a function to set inode size Darrick J. Wong
2014-07-26 18:37 ` Theodore Ts'o
2014-05-01 23:15 ` [PATCH 29/37] libext2fs: implement fallocate Darrick J. Wong
2014-05-01 23:15 ` [PATCH 31/37] fuse2fs: translate ACL structures Darrick J. Wong
2014-05-01 23:15 ` [PATCH 32/37] fuse2fs: handle 64-bit dates correctly Darrick J. Wong
2014-05-01 23:16 ` [PATCH 33/37] fuse2fs: implement fallocate Darrick J. Wong
2014-05-01 23:16 ` [PATCH 35/37] tests: enable using fuse2fs with metadata checksum test Darrick J. Wong
2014-05-01 23:16 ` [PATCH 36/37] tests: test date handling Darrick J. Wong
2014-05-01 23:16 ` [PATCH 37/37] ext5: define new subtype to add features and reduce testing complexity Darrick J. Wong
2014-05-02 9:45 ` Lukáš Czerner
2014-05-02 14:04 ` Theodore Ts'o
2014-05-06 1:59 ` Darrick J. Wong
2014-05-06 1:33 ` Darrick J. Wong
2014-05-06 12:50 ` Lukáš Czerner
2014-05-06 15:21 ` Theodore Ts'o
2014-05-06 15:30 ` Lukáš Czerner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=alpine.LFD.2.00.1405071045090.2128@localhost.localdomain \
--to=lczerner@redhat.com \
--cc=darrick.wong@oracle.com \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).