From: Fredrick <fjohnber@zoho.com>
To: Theodore Ts'o <tytso@mit.edu>
Cc: linux-ext4@vger.kernel.org, Andreas Dilger <adilger@dilger.ca>,
wenqing.lz@taobao.com
Subject: Re: ext4_fallocate
Date: Mon, 25 Jun 2012 18:23:29 -0700 [thread overview]
Message-ID: <4FE90F11.4040801@zoho.com> (raw)
In-Reply-To: <20120625191744.GB9688@thunk.org>
On 06/25/2012 12:17 PM, Theodore Ts'o wrote:
> On Mon, Jun 25, 2012 at 04:51:59PM +0800, Zheng Liu wrote:
>>
>> Actually I want to send a url for you from linux mailing list archive but
>> I cannot find it. After applying this patch, you can call ioctl(2) to
>> enable expose_stale_data flag, and then when you call fallocate(2), ext4
>> create initialized extents for you. This patch cannot be merged into
>> upstream kernel because it brings a huge security hole.
>
> This is what we're using internally inside Google.... this allows the
> security exposure to be restricted to those programs running with a
> specific group id (which is better than giving programs access to
> CAP_SYS_RAWIO). We also require the use of a specific fallocate flag
> so that programs have to explicitly ask for this feature.
>
> Also note that I restrict the combination of NO_HIDE_STALE &&
> KEEP_SIZE since it causes e2fsck to complain --- and if you're trying
> to avoid fs metadata I/O, you want to avoid the extra i_size update
> anyway, so it's not worth trying to make this work w/o causing e2fsck
> complaints.
>
> This patch is versus the v3.3 kernel (as it happens, I was just in the
> middle of rebasing this patch from 2.6.34 :-)
>
> - Ted
>
> P.S. It just occurred to me that there are some patches being
> discussed that assign new fallocate flags for volatile data handling.
> So it would probably be a good idea to move the fallocate flag
> codepoint assignment up out of the way to avoid future conflicts.
>
> commit 5f12f1bc2b0fb0866d52763a611b022780780f05
> Author: Theodore Ts'o <tytso@google.com>
> Date: Fri Jun 22 17:19:53 2012 -0400
>
> ext4: add an fallocate flag to mark newly allocated extents initialized
>
> This commit adds a new flag to ext4's fallocate that allows new,
> uninitialized extents to be marked as initialized. This flag,
> FALLOC_FL_NO_HIDE_STALE requires that the nohide_stale_gid=<gid> mount
> option be used when the file system is mounted, and that the user is
> in the group <gid>.
>
> The benefit is to a program fallocates a larger space, but then writes
> to that space in small increments. This option prevents ext4 from
> having to split the unallocated extent and merge the newly initialized
> extent with the extent to its left. Even though this usually happens
> in-memory, this option is useful for tight memory situations and for
> ext4 on flash. Note: This allows an application in ths hohide_stale
> group to see stale data on the filesystem.
>
> Tested: Updated xfstests g002 to test a case where
> fallocate:no-hide-stale is not allowed. The existing tests now pass
> because I added a remount with a group that user root is in.
> Rebase-Tested-v3.3: same
>
> Effort: fs/nohide-stale
> Origin-2.6.34-SHA1: c3099bf61be1baf94bc91c481995bb0d77f05786
> Origin-2.6.34-SHA1: 004dd33b9ebc5d860781c3435526658cc8aa8ccb
> Change-Id: I0d2a7f2a4cf34443269acbcedb7b7074e0055e69
>
> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
> index aaaece6..ac7aa42 100644
> --- a/fs/ext4/ext4.h
> +++ b/fs/ext4/ext4.h
> @@ -1240,6 +1240,9 @@ struct ext4_sb_info {
> unsigned long s_mb_last_group;
> unsigned long s_mb_last_start;
>
> + /* gid that's allowed to see stale data via falloc flag. */
> + gid_t no_hide_stale_gid;
> +
> /* stats for buddy allocator */
> atomic_t s_bal_reqs; /* number of reqs with len > 1 */
> atomic_t s_bal_success; /* we found long enough chunks */
> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
> index cb99346..cc57c85 100644
> --- a/fs/ext4/extents.c
> +++ b/fs/ext4/extents.c
> @@ -4375,6 +4375,7 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> int retries = 0;
> int flags;
> struct ext4_map_blocks map;
> + struct ext4_sb_info *sbi;
> unsigned int credits, blkbits = inode->i_blkbits;
>
> /*
> @@ -4385,12 +4386,28 @@ long ext4_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> return -EOPNOTSUPP;
>
> /* Return error if mode is not supported */
> - if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> + if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE |
> + FALLOC_FL_NO_HIDE_STALE))
> + return -EOPNOTSUPP;
> +
> + /* The combination of NO_HIDE_STALE and KEEP_SIZE is not supported */
> + if ((mode & FALLOC_FL_NO_HIDE_STALE) &&
> + (mode & FALLOC_FL_KEEP_SIZE))
> return -EOPNOTSUPP;
>
> if (mode & FALLOC_FL_PUNCH_HOLE)
> return ext4_punch_hole(file, offset, len);
>
> + sbi = EXT4_SB(inode->i_sb);
> + /* Must have RAWIO to see stale data. */
> + if ((mode & FALLOC_FL_NO_HIDE_STALE) &&
> + !in_egroup_p(sbi->no_hide_stale_gid))
> + return -EACCES;
> +
> + /* preallocation to directories is currently not supported */
> + if (S_ISDIR(inode->i_mode))
> + return -ENODEV;
> +
> trace_ext4_fallocate_enter(inode, offset, len, mode);
> map.m_lblk = offset >> blkbits;
> /*
> @@ -4429,6 +4446,8 @@ retry:
> ret = PTR_ERR(handle);
> break;
> }
> + if (mode & FALLOC_FL_NO_HIDE_STALE)
> + flags &= ~EXT4_GET_BLOCKS_UNINIT_EXT;
> ret = ext4_map_blocks(handle, inode, &map, flags);
> if (ret <= 0) {
> #ifdef EXT4FS_DEBUG
> diff --git a/fs/ext4/super.c b/fs/ext4/super.c
> index 5b443a8..d976ec1 100644
> --- a/fs/ext4/super.c
> +++ b/fs/ext4/super.c
> @@ -1175,6 +1175,8 @@ static int ext4_show_options(struct seq_file *seq, struct dentry *root)
> if (test_opt2(sb, BIG_EXT))
> seq_puts(seq, ",big_extent");
> #endif
> + if (sbi->no_hide_stale_gid != -1)
> + seq_printf(seq, ",nohide_stale_gid=%u", sbi->no_hide_stale_gid);
>
> ext4_show_quota_options(seq, sb);
>
> @@ -1353,6 +1355,7 @@ enum {
> #ifdef CONFIG_EXT4_BIG_EXTENT
> Opt_big_extent, Opt_nobig_extent,
> #endif
> + Opt_nohide_stale_gid,
> };
>
> static const match_table_t tokens = {
> @@ -1432,6 +1435,7 @@ static const match_table_t tokens = {
> {Opt_big_extent, "big_extent"},
> {Opt_nobig_extent, "nobig_extent"},
> #endif
> + {Opt_nohide_stale_gid, "nohide_stale_gid=%u"},
> {Opt_err, NULL},
> };
>
> @@ -1931,6 +1935,12 @@ set_qf_format:
> return 0;
> sbi->s_li_wait_mult = option;
> break;
> + case Opt_nohide_stale_gid:
> + if (match_int(&args[0], &option))
> + return 0;
> + /* -1 for disabled, otherwise it's valid. */
> + sbi->no_hide_stale_gid = option;
> + break;
> case Opt_noinit_itable:
> clear_opt(sb, INIT_INODE_TABLE);
> break;
> @@ -3274,6 +3284,8 @@ static int ext4_fill_super(struct super_block *sb, void *data, int silent)
> #ifdef CONFIG_EXT4_BIG_EXTENT
> sbi->s_min_big_ext_size = EXT4_DEFAULT_MIN_BIG_EXT_SIZE;
> #endif
> + /* Default to having no-hide-stale disabled. */
> + sbi->no_hide_stale_gid = -1;
>
> if ((def_mount_opts & EXT4_DEFM_NOBARRIER) == 0)
> set_opt(sb, BARRIER);
> diff --git a/fs/open.c b/fs/open.c
> index 201431a..4edc0cd 100644
> --- a/fs/open.c
> +++ b/fs/open.c
> @@ -224,7 +224,9 @@ int do_fallocate(struct file *file, int mode, loff_t offset, loff_t len)
> return -EINVAL;
>
> /* Return error if mode is not supported */
> - if (mode & ~(FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE))
> + if (mode & ~(FALLOC_FL_KEEP_SIZE |
> + FALLOC_FL_PUNCH_HOLE |
> + FALLOC_FL_NO_HIDE_STALE))
> return -EOPNOTSUPP;
>
> /* Punch hole must have keep size set */
> diff --git a/include/linux/falloc.h b/include/linux/falloc.h
> index 73e0b62..a2489ac 100644
> --- a/include/linux/falloc.h
> +++ b/include/linux/falloc.h
> @@ -3,6 +3,7 @@
>
> #define FALLOC_FL_KEEP_SIZE 0x01 /* default is extend size */
> #define FALLOC_FL_PUNCH_HOLE 0x02 /* de-allocates range */
> +#define FALLOC_FL_NO_HIDE_STALE 0x04 /* default is hide stale data */
>
> #ifdef __KERNEL__
>
>
Thanks Ted. This patch is very nice
and addresses the comments of Andreas
of using a mount option.
-Fredrick
next prev parent reply other threads:[~2012-06-26 1:23 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2012-06-25 6:42 ext4_fallocate Fredrick
2012-06-25 7:33 ` ext4_fallocate Andreas Dilger
2012-06-28 15:12 ` ext4_fallocate Phillip Susi
2012-06-28 15:23 ` ext4_fallocate Eric Sandeen
2012-06-25 8:51 ` ext4_fallocate Zheng Liu
2012-06-25 19:04 ` ext4_fallocate Fredrick
2012-06-25 19:17 ` ext4_fallocate Theodore Ts'o
2012-06-26 1:23 ` Fredrick [this message]
2012-06-26 13:13 ` ext4_fallocate Ric Wheeler
2012-06-26 17:30 ` ext4_fallocate Theodore Ts'o
2012-06-26 18:06 ` ext4_fallocate Fredrick
2012-06-26 18:21 ` ext4_fallocate Ric Wheeler
2012-06-26 18:57 ` ext4_fallocate Ted Ts'o
2012-06-26 19:22 ` ext4_fallocate Ric Wheeler
2012-06-26 18:05 ` ext4_fallocate Fredrick
2012-06-26 18:59 ` ext4_fallocate Ted Ts'o
2012-06-26 19:30 ` ext4_fallocate Ric Wheeler
2012-06-26 19:57 ` ext4_fallocate Eric Sandeen
2012-06-26 20:44 ` ext4_fallocate Eric Sandeen
2012-06-27 15:14 ` ext4_fallocate Eric Sandeen
2012-06-27 19:30 ` ext4_fallocate Theodore Ts'o
2012-06-27 23:02 ` ext4_fallocate Eric Sandeen
2012-06-28 11:27 ` ext4_fallocate Ric Wheeler
2012-06-29 19:02 ` ext4_fallocate Andreas Dilger
2012-07-02 3:03 ` ext4_fallocate Zheng Liu
2012-06-28 12:48 ` ext4_fallocate Theodore Ts'o
2012-07-02 3:16 ` ext4_fallocate Zheng Liu
2012-07-02 16:33 ` ext4_fallocate Eric Sandeen
2012-07-02 17:44 ` ext4_fallocate Jan Kara
2012-07-02 17:48 ` ext4_fallocate Ric Wheeler
2012-07-03 17:41 ` ext4_fallocate Zheng Liu
2012-07-03 17:57 ` ext4_fallocate Zach Brown
2012-07-04 2:23 ` ext4_fallocate Zheng Liu
2012-07-02 18:01 ` ext4_fallocate Theodore Ts'o
2012-07-03 9:30 ` ext4_fallocate Jan Kara
2012-07-04 1:15 ` ext4_fallocate Phillip Susi
2012-07-04 2:36 ` ext4_fallocate Zheng Liu
2012-07-04 3:06 ` ext4_fallocate Phillip Susi
2012-07-04 3:48 ` ext4_fallocate Zheng Liu
2012-07-04 12:20 ` ext4_fallocate Ric Wheeler
2012-07-04 13:25 ` ext4_fallocate Zheng Liu
2012-06-26 13:06 ` ext4_fallocate Eric Sandeen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4FE90F11.4040801@zoho.com \
--to=fjohnber@zoho.com \
--cc=adilger@dilger.ca \
--cc=linux-ext4@vger.kernel.org \
--cc=tytso@mit.edu \
--cc=wenqing.lz@taobao.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).