linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zheng Liu <gnehzuil.liu@gmail.com>
To: Eric Sandeen <sandeen@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-ext4@vger.kernel.org, Zheng Liu <wenqing.lz@taobao.com>
Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate
Date: Wed, 18 Apr 2012 12:08:42 +0800	[thread overview]
Message-ID: <20120418040842.GA11684@gmail.com> (raw)
In-Reply-To: <4F8DAAFE.40500@redhat.com>

On Tue, Apr 17, 2012 at 12:40:14PM -0500, Eric Sandeen wrote:
> On 4/17/12 11:53 AM, Zheng Liu wrote:
> > Hi list,
> > 
> > fallocate is a useful system call because it can preallocate some disk blocks
> > for a file and keep blocks contiguous.  However, it has a defect that file
> > system will convert an uninitialized extent to be an initialized when the user
> > wants to write some data to this file, because file system create an
> > unititalized extent while it preallocates some blocks in fallocate (e.g. ext4).
> 
> That's a security-driven design feature, not a defect.  :)
> 
> > Especially, it causes a severe degradation when the user tries to do some
> > random write operations, which frequently modifies the metadata of this file.
> > We meet this problem in our product system at Taobao.  Last month, in ext4
> > workshop, we discussed this problem and the Google faces the same problem.  So
> > a new flag, FALLOC_FL_NO_HIDE_STALE, is added in order to solve this problem.
> 
> Which then opens up severe security problems.
> 
> > When this flag is set, file system will create an inititalized extent for this
> > file.  So it avoids the conversion from uninitialized to initialized.  If users
> > want to use this flag, they must guarantee that file has been initialized by
> > themselves before it is read at the same offset.  This flag is added in vfs so
> > that other file systems can also support this flag to improve the performance.
> > 
> > I try to make ext4 support this new flag, and run a simple test in my own
> > desktop to verify it.  The machine has a Intel(R) Core(TM)2 Duo CPU E8400, 4G
> > memory and a WDC WD1600AAJS-75M0A0 160G SATA disk.  I use the following script
> > to tset the performance.
> > 
> > #/bin/sh
> > mkfs.ext4 ${DEVICE}
> > mount -t ext4 ${DEVICE} ${TARGET}
> > fallocate -l 27262976 ${TARGET}/test # the size of the file is 256M (*)
> 
> That's only 26MB, but the below loop writes to a max offset of around
> 256M.

Yes, you are right.  I preallocate a file that is 256MB.

> 
> > time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/sda1/test_256M \
> > 	conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> > 	2>/dev/null; done
> 
> You fallocate ${TARGET}/test but dd to /mnt/sda1/test_256M ?  I presume
> those should be the same file.

Yes, it is the same file.

> 
> So the testcase as shown above seems fairly wrong, no?  Is that what you
> used for the numbers below?
> 
> > * I write a wrapper program to call fallocate(2) with FALLOC_FL_NO_HIDE_STALE
> >   flag because the userspace tool doesn't support the new flag.
> > 
> > The result:
> > 	w/o 		w/
> > real	1m16.043s	0m17.946s	-76.4%
> > user	0m0.195s	0m0.192s	-1.54%
> > sys	0m0.468s	0m0.462s	-1.28%
> 
> I think that the missing information here is some profiling to show where
> the time was spent in the "w/o" case.  What, exactly, in ext4 extent
> management is so darned slow that we have to poke security holes in the
> filesytem to get decent performance?
> 
> However,, the above also seems like an alarmingly large difference, and
> one that I can't attribute to unwritten extent conversion overhead.
> 
> If I test the seeky dd to a prewritten file (to eliminate extent
> conversion):
> 
> # dd if=/dev/zero of=/mnt/scratch/test bs=1M count=256
> # sync
> 
> vs. to a fallocated file (which requires extent conversion):
> 
> # fallocate -l 256m /mnt/scratch/test
> 
> and then do your seeky dd test after each of the above:
> 
> # time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/scratch/test \
> 	conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> 	2>/dev/null; done
> 
> On ext4, I get about 59.9 seconds in the pre-written case, 65.2 seconds in the fallocated case.
> 
> On xfs, I get about 52.5 seconds in the pre-written case, 52.7 seconds in the fallocated case.
> 
> I don't see anywhere near the slowdown you show above, certainly
> nothing bad enough to warrant opening the big security hole.
> Am I missing something?

I will run more detailed benchmarks to trace this issue.  If I have a
lastest result, I will send a new mail to let you know. :)

I fully understand that this flag will bring a security hole, and I
totally agree that we should dig *root cause* in ext4.  But, IMHO, a
proper interface, which is limited by a proper capabilities, might be
useful for the user.  

Regards,
Zheng

> 
> The ext4 delta is a bit larger, though, so it seems worth investigating
> the *root cause* of the extra overhead if it's problematic in your
> workloads...
> 
> -Eric
> 
> 
> > Obviously, this flag will bring an secure issue because the malicious user
> > could use this flag to get other user's data if (s)he doesn't do a
> > initialization before reading this file.  Thus, a sysctl parameter
> > 'fs.falloc_no_hide_stale' is defined in order to let administrator to determine
> > whether or not this flag is enabled.  Currently, this flag is disabled by
> > default.  I am not sure whether this is enough or not.  Another option is that
> > a new Kconfig entry is created to remove this flag during the kernel is
> > complied.  So any suggestions or comments are appreciated.
> > 
> > Regards,
> > Zheng
> > 
> > Zheng Liu (3):
> >       vfs: add FALLOC_FL_NO_HIDE_STALE flag in fallocate
> >       vfs: add security check for _NO_HIDE_STALE flag
> >       ext4: add FALLOC_FL_NO_HIDE_STALE support
> > 
> >  fs/ext4/extents.c      |    7 +++++--
> >  fs/open.c              |   12 +++++++++++-
> >  include/linux/falloc.h |    5 +++++
> >  include/linux/sysctl.h |    1 +
> >  kernel/sysctl.c        |   10 ++++++++++
> >  5 files changed, 32 insertions(+), 3 deletions(-)
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

  reply	other threads:[~2012-04-18  4:08 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-17 16:53 [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate Zheng Liu
2012-04-17 16:53 ` [RFC][PATCH 1/3] vfs: " Zheng Liu
2012-04-17 16:53 ` [RFC][PATCH 2/3] vfs: add security check for _NO_HIDE_STALE flag Zheng Liu
2012-04-17 16:53 ` [RFC][PATCH 3/3] ext4: add FALLOC_FL_NO_HIDE_STALE support Zheng Liu
2012-04-17 17:40 ` [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate Eric Sandeen
2012-04-18  4:08   ` Zheng Liu [this message]
2012-04-18  7:48     ` Lukas Czerner
2012-04-18 12:03       ` Zheng Liu
2012-04-18 12:07         ` Lukas Czerner
2012-04-20  9:52           ` Zheng Liu
2012-04-18  4:59   ` Andreas Dilger
2012-04-18  8:19     ` Lukas Czerner
2012-04-18 12:48       ` Zheng Liu
2012-04-18 15:09         ` Andreas Dilger
2012-04-20  9:59           ` Zheng Liu
2012-04-18 11:38     ` Zheng Liu
2012-04-18 11:39       ` Lukas Czerner
2012-04-18 12:06         ` Zheng Liu
2012-04-18 14:57     ` Eric Sandeen
2012-04-17 17:59 ` Ric Wheeler
2012-04-17 18:43   ` Ted Ts'o
2012-04-17 18:52     ` Ric Wheeler
2012-04-17 18:53     ` Eric Sandeen
2012-04-17 19:04       ` Ted Ts'o
2012-04-18  3:02       ` Dave Chinner
2012-04-18 16:07         ` Ted Ts'o
2012-04-18 23:37           ` Dave Chinner
2012-04-18  8:04     ` Lukas Czerner
  -- strict thread matches above, loose matches on Subject: below --
2012-04-23  1:55 Szabolcs Szakacsits

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120418040842.GA11684@gmail.com \
    --to=gnehzuil.liu@gmail.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sandeen@redhat.com \
    --cc=wenqing.lz@taobao.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).