linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zheng Liu <gnehzuil.liu@gmail.com>
To: Andreas Dilger <adilger@whamcloud.com>
Cc: Eric Sandeen <sandeen@redhat.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-fsdevel@vger.kernel.org" <linux-fsdevel@vger.kernel.org>,
	"linux-ext4@vger.kernel.org" <linux-ext4@vger.kernel.org>,
	Zheng Liu <wenqing.lz@taobao.com>
Subject: Re: [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate
Date: Wed, 18 Apr 2012 19:38:10 +0800	[thread overview]
Message-ID: <20120418113810.GA3447@gmail.com> (raw)
In-Reply-To: <D225F088-A53F-4962-97AD-13C850BBA722@whamcloud.com>

On Tue, Apr 17, 2012 at 09:59:37PM -0700, Andreas Dilger wrote:
> On 2012-04-17, at 10:40, Eric Sandeen <sandeen@redhat.com> wrote:
> > On 4/17/12 11:53 AM, Zheng Liu wrote:
> >> 
> >> fallocate is a useful system call because it can preallocate some disk blocks
> >> for a file and keep blocks contiguous.  However, it has a defect that file
> >> system will convert an uninitialized extent to be an initialized when the user
> >> wants to write some data to this file, because file system create an
> >> unititalized extent while it preallocates some blocks in fallocate (e.g. ext4).
> > 
> > That's a security-driven design feature, not a defect.  :)
> > 
> >> I try to make ext4 support this new flag, and run a simple test in my own
> >> desktop to verify it.  The machine has a Intel(R) Core(TM)2 Duo CPU E8400, 4G
> >> memory and a WDC WD1600AAJS-75M0A0 160G SATA disk.  I use the following script
> >> to tset the performance.
> >> 
> >> #/bin/sh
> >> mkfs.ext4 ${DEVICE}
> >> mount -t ext4 ${DEVICE} ${TARGET}
> >> fallocate -l 27262976 ${TARGET}/test # the size of the file is 256M (*)
> > 
> > That's only 26MB, but the below loop writes to a max offset of around
> > 256M.
> > 
> >> time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/sda1/test_256M \
> >>    conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> >>    2>/dev/null; done
> > 
> > You fallocate ${TARGET}/test but dd to /mnt/sda1/test_256M ?  I presume
> > those should be the same file.
> > 
> > So the testcase as shown above seems fairly wrong, no?  Is that what you
> > used for the numbers below?
> 
> I auspect the real problem may be the lazy inode table zeroing being done in the kernel. If the test is run immediately after formatting the filesystem, then the kernel thread is busy writing to the disk in the background and interfering with your benchmark.
> 
> Please format the filesystem with "-E lazy-itable-init=0", to avoid this behavior.  

I format the filesystem with this option, but it seems that it is
useless.

> 
> Secondly, your test program is not doing random writes to disk, but rather doing writes at 64kB intervals. There is logic in the uninitialized extent handling that will write zeros to an entire extent, rather than create many fragmented uninitialized extents.  It may be possible that you are zeroing out the entire file, and writing 16x as much data as you expect.

No, I don't see any entire extents that is written zeros.  The extent is
splitted to many fragmented initialized and uninitialized extents.

I have run a more detailed benchmark.  Later I will post it in mailing
list.  IMHO, I guess that the journal might be the root cause.

Regards,
Zheng

> 
> Cheers, Andreas
> 
> >> * I write a wrapper program to call fallocate(2) with FALLOC_FL_NO_HIDE_STALE
> >>  flag because the userspace tool doesn't support the new flag.
> >> 
> >> The result:
> >>    w/o        w/
> >> real    1m16.043s    0m17.946s    -76.4%
> >> user    0m0.195s    0m0.192s    -1.54%
> >> sys    0m0.468s    0m0.462s    -1.28%
> > 
> > I think that the missing information here is some profiling to show where
> > the time was spent in the "w/o" case.  What, exactly, in ext4 extent
> > management is so darned slow that we have to poke security holes in the
> > filesytem to get decent performance?
> > 
> > However,, the above also seems like an alarmingly large difference, and
> > one that I can't attribute to unwritten extent conversion overhead.
> > 
> > If I test the seeky dd to a prewritten file (to eliminate extent
> > conversion):
> > 
> > # dd if=/dev/zero of=/mnt/scratch/test bs=1M count=256
> > # sync
> > 
> > vs. to a fallocated file (which requires extent conversion):
> > 
> > # fallocate -l 256m /mnt/scratch/test
> > 
> > and then do your seeky dd test after each of the above:
> > 
> > # time for((i=0;i<2000;i++)); do dd if=/dev/zero of=/mnt/scratch/test \
> >    conv=notrunc bs=4k count=1 seek=`expr $i \* 16` oflag=sync,direct \
> >    2>/dev/null; done
> > 
> > On ext4, I get about 59.9 seconds in the pre-written case, 65.2 seconds in the fallocated case.
> > 
> > On xfs, I get about 52.5 seconds in the pre-written case, 52.7 seconds in the fallocated case.
> > 
> > I don't see anywhere near the slowdown you show above, certainly
> > nothing bad enough to warrant opening the big security hole.
> > Am I missing something?
> > 
> > The ext4 delta is a bit larger, though, so it seems worth investigating
> > the *root cause* of the extra overhead if it's problematic in your
> > workloads...
> > 
> > -Eric
> > 
> > 
> >> Obviously, this flag will bring an secure issue because the malicious user
> >> could use this flag to get other user's data if (s)he doesn't do a
> >> initialization before reading this file.  Thus, a sysctl parameter
> >> 'fs.falloc_no_hide_stale' is defined in order to let administrator to determine
> >> whether or not this flag is enabled.  Currently, this flag is disabled by
> >> default.  I am not sure whether this is enough or not.  Another option is that
> >> a new Kconfig entry is created to remove this flag during the kernel is
> >> complied.  So any suggestions or comments are appreciated.
> >> 
> >> Regards,
> >> Zheng
> >> 
> >> Zheng Liu (3):
> >>      vfs: add FALLOC_FL_NO_HIDE_STALE flag in fallocate
> >>      vfs: add security check for _NO_HIDE_STALE flag
> >>      ext4: add FALLOC_FL_NO_HIDE_STALE support
> >> 
> >> fs/ext4/extents.c      |    7 +++++--
> >> fs/open.c              |   12 +++++++++++-
> >> include/linux/falloc.h |    5 +++++
> >> include/linux/sysctl.h |    1 +
> >> kernel/sysctl.c        |   10 ++++++++++
> >> 5 files changed, 32 insertions(+), 3 deletions(-)
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > 
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html

  parent reply	other threads:[~2012-04-18 11:38 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-04-17 16:53 [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate Zheng Liu
2012-04-17 16:53 ` [RFC][PATCH 1/3] vfs: " Zheng Liu
2012-04-17 16:53 ` [RFC][PATCH 2/3] vfs: add security check for _NO_HIDE_STALE flag Zheng Liu
2012-04-17 16:53 ` [RFC][PATCH 3/3] ext4: add FALLOC_FL_NO_HIDE_STALE support Zheng Liu
2012-04-17 17:40 ` [RFC][PATCH 0/3] add FALLOC_FL_NO_HIDE_STALE flag in fallocate Eric Sandeen
2012-04-18  4:08   ` Zheng Liu
2012-04-18  7:48     ` Lukas Czerner
2012-04-18 12:03       ` Zheng Liu
2012-04-18 12:07         ` Lukas Czerner
2012-04-20  9:52           ` Zheng Liu
2012-04-18  4:59   ` Andreas Dilger
2012-04-18  8:19     ` Lukas Czerner
2012-04-18 12:48       ` Zheng Liu
2012-04-18 15:09         ` Andreas Dilger
2012-04-20  9:59           ` Zheng Liu
2012-04-18 11:38     ` Zheng Liu [this message]
2012-04-18 11:39       ` Lukas Czerner
2012-04-18 12:06         ` Zheng Liu
2012-04-18 14:57     ` Eric Sandeen
2012-04-17 17:59 ` Ric Wheeler
2012-04-17 18:43   ` Ted Ts'o
2012-04-17 18:52     ` Ric Wheeler
2012-04-17 18:53     ` Eric Sandeen
2012-04-17 19:04       ` Ted Ts'o
2012-04-18  3:02       ` Dave Chinner
2012-04-18 16:07         ` Ted Ts'o
2012-04-18 23:37           ` Dave Chinner
2012-04-18  8:04     ` Lukas Czerner
  -- strict thread matches above, loose matches on Subject: below --
2012-04-23  1:55 Szabolcs Szakacsits

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120418113810.GA3447@gmail.com \
    --to=gnehzuil.liu@gmail.com \
    --cc=adilger@whamcloud.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=sandeen@redhat.com \
    --cc=wenqing.lz@taobao.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).