public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed
From: Steven Whitehouse <swhiteho@redhat.com>
To: "Theodore Ts'o" <tytso@mit.edu>
Cc: Dave Chinner <david@fromorbit.com>,
	Chris Mason <chris.mason@fusionio.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Ric Wheeler <rwheeler@redhat.com>, Ingo Molnar <mingo@kernel.org>,
	Christoph Hellwig <hch@infradead.org>,
	Martin Steigerwald <Martin@lichtvoll.de>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
Date: Tue, 11 Dec 2012 12:16:03 +0000	[thread overview]
Message-ID: <1355228163.2721.32.camel@menhir> (raw)
In-Reply-To: <20121210182053.GC7516@thunk.org>

Hi,

On Mon, 2012-12-10 at 13:20 -0500, Theodore Ts'o wrote:
> A sentence or two got chopped out during an editing pass.  Let me try
> that again so it's a bit clearer what I was trying to say....
> 
> Sure, but if the block device supports WRITE_SAME or persistent
> discard, then presumably fallocate() should do this automatically all
> the time, and not require a flag to request this behavior.  The only
> reason why you might not is if the WRITE_SAME is more costly.  That is
> when a seek plus writing 1MB does take more time than the amount of
> disk time fraction that it consumes if you compare it to a seek plus
> writing 4k or 32k.
> 
Well there are two cases here I think....

One is the GFS2 type case where the metadata doesn't support "these
blocks are allocated but zero" so that we must, for all fallocate
requests, zero out the blocks at fallocate time to avoid exposing stale
data to userspace.

The advantage over dd from userspace in this case is firstly that no
copy from userspace means that it should be faster. Also the use of
sb_issue_zeroout means that block devices which don't need an explicit
block of zeros to write should be able to do this faster - however that
is implemented at the block layer. The fs shouldn't need to care about
how is it implemented. In the case of GFS2, we implemented fallocate
because it was useful to have the feature of being able to allocate
beyond the end of file without changing the file size. This helped us
fix a bug in our fs grow code, so performance was a secondary (but
welcome!) consideration. 

The other case is ext4/XFS type case where the metadata does support
"these blocks are allocated but zero" which means that the metadata
needs to be changed twice. Once to "these blocks are allocated but zero"
at fallocate time and again to "these blocks have valid content" at
write time. As I understand the issue, the problem is that this second
metadata change is what is causing the performance issue.

> Ext4 currently uses a threshold of 32k for this break point (below
> that, we will use sb_issue_zeroout; above that, we will break apart an
> uninitialized extent when writing into a preallocated region).  It may
> be that 32k is too low, especailly for certain types of devices (i.e.,
> SSD's versus RAID 5, where it should be aligned on a RAID strip,
> etc.).  More of an issue might be that there will be some disagreement
> about whether people want to the system to automatically tune for
> average throughput vs 99.9 percentile latency.
> 
> Regardless, this is actually something which I think the file system
> should try to do automatically if at all possible, via some kind of
> auto-tuning hueristic, instead of using an explicit fallocate(2) flag.
> (See, I don't propose using a new fallocate flag for everything.  :-)
> 
>       	      	      	      - Ted
> 

It sounds like it might well be worth experimenting with the thresholds
as you suggest, 32k is really pretty small. I guess that the real
question here is what is the cost of the metadata change (to say what is
written and what remains unwritten) vs. simply zeroing out the unwritten
blocks in the extent when the write occurs.

There are likely to be a number of factors affecting that, and the
answer doesn't appear straightforward,

Steve.

  reply	other threads:[~2012-12-11 12:16 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-19 23:04 [PATCH] fs: revert commit bbdd6808 to fallocate UAPI Dave Chinner
2012-11-20 16:36 ` Christoph Hellwig
2012-11-26  0:28 ` [PATCH, 3.7-rc7, RESEND] " Dave Chinner
2012-11-26  2:55   ` Theodore Ts'o
2012-11-26  6:14     ` Tao Ma
2012-11-26  9:12     ` Dave Chinner
2012-12-05 10:48       ` Martin Steigerwald
2012-12-05 15:45         ` Linus Torvalds
2012-12-05 16:18           ` Martin Steigerwald
2012-12-05 16:33             ` Theodore Ts'o
2012-12-05 17:24               ` Martin Steigerwald
2012-12-05 17:34                 ` Theodore Ts'o
2012-12-05 17:55                   ` Martin Steigerwald
2012-12-06  0:42                   ` Dave Chinner
2012-12-06  9:24                     ` Martin Steigerwald
2012-12-05 18:25             ` Linus Torvalds
2012-12-06  1:14               ` Dave Chinner
2012-12-06  3:03                 ` Linus Torvalds
2012-12-06  9:37                   ` Martin Steigerwald
2012-12-07  1:08                     ` Ingo Molnar
2012-12-07  2:40                       ` Dave Chinner
2012-12-07 10:24                       ` Martin Steigerwald
2012-12-06 12:06                 ` Christoph Hellwig
2012-12-06 16:50                   ` Theodore Ts'o
2012-12-07  1:57                     ` Dave Chinner
2012-12-06 12:05           ` Christoph Hellwig
2012-12-07  1:16             ` Ingo Molnar
2012-12-07  3:19               ` Dave Chinner
2012-12-07 17:36               ` Ric Wheeler
2012-12-07 18:18                 ` Linus Torvalds
2012-12-07 19:03                   ` Chris Mason
2012-12-07 20:43                     ` Theodore Ts'o
2012-12-07 21:09                       ` Chris Mason
2012-12-07 21:27                         ` Theodore Ts'o
2012-12-07 21:43                           ` Chris Mason
2012-12-07 21:49                             ` Ric Wheeler
2012-12-07 21:57                               ` Chris Mason
2012-12-07 22:51                                 ` Eric Sandeen
2012-12-07 22:52                                 ` Eric Sandeen
2012-12-07 21:42                         ` Ric Wheeler
2012-12-07 21:57                           ` Theodore Ts'o
2012-12-07 22:02                             ` Ric Wheeler
2012-12-08  0:39                               ` Dave Chinner
2012-12-08  2:52                                 ` Joel Becker
2012-12-08  4:04                                   ` Dave Chinner
2012-12-08  0:17                     ` Dave Chinner
2012-12-08  1:39                       ` Chris Mason
2012-12-10 16:02                         ` Chris Mason
2012-12-10 17:37                       ` Theodore Ts'o
2012-12-10 18:05                         ` Steven Whitehouse
2012-12-10 18:13                           ` Theodore Ts'o
2012-12-10 18:20                             ` Theodore Ts'o
2012-12-11 12:16                               ` Steven Whitehouse [this message]
2012-12-11 22:09                                 ` Dave Chinner
2012-12-10 18:52                         ` Ric Wheeler
2012-12-11  0:52                         ` Dave Chinner
2012-12-07 19:30                   ` Steven Rostedt
2012-12-07 21:14                     ` Theodore Ts'o
2012-12-07 21:47                       ` Ric Wheeler
2012-12-07 23:25                         ` Howard Chu
2012-12-08  0:50                           ` Dave Chinner
2012-12-08 13:52                             ` Howard Chu
2012-12-08 14:02                               ` Ric Wheeler
2012-12-07 22:01                       ` Eric Sandeen
2012-12-09 21:37                       ` Ric Wheeler
2012-11-26 11:53     ` Alan Cox
2012-11-26 14:43       ` Theodore Ts'o
2012-11-26 21:12       ` Dave Chinner
2012-11-27 13:44         ` Martin Steigerwald

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1355228163.2721.32.camel@menhir \
    --to=swhiteho@redhat.com \
    --cc=Martin@lichtvoll.de \
    --cc=chris.mason@fusionio.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=rwheeler@redhat.com \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox