linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ric Wheeler <rwheeler@redhat.com>
To: Howard Chu <hyc@symas.com>
Cc: Dave Chinner <david@fromorbit.com>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Steven Rostedt <rostedt@goodmis.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Ingo Molnar <mingo@kernel.org>,
	Christoph Hellwig <hch@infradead.org>,
	Martin Steigerwald <Martin@lichtvoll.de>,
	Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH, 3.7-rc7, RESEND] fs: revert commit bbdd6808 to fallocate UAPI
Date: Sat, 08 Dec 2012 09:02:40 -0500	[thread overview]
Message-ID: <50C34880.3020304@redhat.com> (raw)
In-Reply-To: <50C3461E.7030801@symas.com>

On 12/08/2012 08:52 AM, Howard Chu wrote:
> Dave Chinner wrote:
>> On Fri, Dec 07, 2012 at 03:25:53PM -0800, Howard Chu wrote:
>>> I have to agree that, if this is going to be an ext4-specific
>>> feature, then it can just be implemented via an ext4-specific ioctl
>>> and be done with it. But I'm not convinced this should be an
>>> ext4-specific feature.
>>>
>>> As for "fix the problem properly" - you're fixing the wrong problem.
>>> This type of feature is important to me, not just because of the
>>> performance issue. As has already been pointed out, the performance
>>> difference may even be negligible.
>>>
>>> But on SSDs, the issue is write endurance. The whole point of
>>> preallocating a file is to avoid doing incremental metadata updates.
>>> Particularly when each of those 1-bit status updates costs entire
>>> blocks, and gratuitously shortens the life of the media. The fact
>>> that avoiding the unnecessary wear and tear may also yield a
>>> performance boost is just icing on the cake. (And if the perf boost
>>> is over a factor of 2:1 that's some pretty damn good icing.)
>>
>> That's a filesystem implementation specific problem, not a generic
>> fallocate() or unwritten extent conversion problem.
>
>> Besides, ext4 doesn't write back every metadata modification that is
>> made - they are aggregated in memory and only written when the
>> journal is full or the metadata ages out. Hence unwritten extent
>> conversion has very little impact on the amount of writes that are
>> done to the flash because it is vastly dominated by the data writes.
>>
>> Similarly, in XFS you might see a few thousand or tens of thousands
>> of metadata blocks get written once every 30s under such a random
>> write workload, but each metadata block might have gone through a
>> million changes in memory since the last time it was written.
>> Indeed, in that 30s, there would have been a few million random data
>> writes so the metadata writes are well and truly lost in the
>> noise...
>
> That's only true if write caching is allowed. If you have a transactional 
> database running, it's syncing every transaction to media.
>

The math just does not add up - no device sustains millions of random IO's per 
second.

Each class of device has so many IOs it can do per second. S-ATA disks do say 
40-50 IOPS, SAS maybe twice that, enterprise arrays 10k IOPS and PCI-E cards 
100k IOPS.

Transactional databases accumulate multiple updates in memory and commit to disk 
*in transactions* to pack as much as possible into the IOPS that the device has. 
Batching is a core principle of database performance (just like we do in the 
guts of XFS or ext4 in our file system transactions).

If you use a transactional DB, you should pre-allocate the table space (and 
probably the log file), but it would be wise to pre-allocate it and zero the 
blocks since DB's are long lived items. Pay once for the initialization and off 
you go. More expensive for really large tables on slow devices of course.

Again, we are back to the core need here - we want to improve the performance of 
a random, small IO workload. This workload was historically *so* painful 
(remember that 40-50 IOP's for a s-ata disk :)), that pretty much every sane 
application avoided random IO like the plague.

With some of the newer devices like the new SSD's, random IO gets to be more 
reasonable and we need to fix the performance to accommodate workloads that were 
not normal.

Ric



  reply	other threads:[~2012-12-08 14:02 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-11-19 23:04 [PATCH] fs: revert commit bbdd6808 to fallocate UAPI Dave Chinner
2012-11-20 16:36 ` Christoph Hellwig
2012-11-26  0:28 ` [PATCH, 3.7-rc7, RESEND] " Dave Chinner
2012-11-26  2:55   ` Theodore Ts'o
2012-11-26  6:14     ` Tao Ma
2012-11-26  9:12     ` Dave Chinner
2012-12-05 10:48       ` Martin Steigerwald
2012-12-05 15:45         ` Linus Torvalds
2012-12-05 16:18           ` Martin Steigerwald
2012-12-05 16:33             ` Theodore Ts'o
2012-12-05 17:24               ` Martin Steigerwald
2012-12-05 17:34                 ` Theodore Ts'o
2012-12-05 17:55                   ` Martin Steigerwald
2012-12-06  0:42                   ` Dave Chinner
2012-12-06  9:24                     ` Martin Steigerwald
2012-12-05 18:25             ` Linus Torvalds
2012-12-06  1:14               ` Dave Chinner
2012-12-06  3:03                 ` Linus Torvalds
2012-12-06  9:37                   ` Martin Steigerwald
2012-12-07  1:08                     ` Ingo Molnar
2012-12-07  2:40                       ` Dave Chinner
2012-12-07 10:24                       ` Martin Steigerwald
2012-12-06 12:06                 ` Christoph Hellwig
2012-12-06 16:50                   ` Theodore Ts'o
2012-12-07  1:57                     ` Dave Chinner
2012-12-06 12:05           ` Christoph Hellwig
2012-12-07  1:16             ` Ingo Molnar
2012-12-07  3:19               ` Dave Chinner
2012-12-07 17:36               ` Ric Wheeler
2012-12-07 18:18                 ` Linus Torvalds
2012-12-07 19:03                   ` Chris Mason
2012-12-07 20:43                     ` Theodore Ts'o
2012-12-07 21:09                       ` Chris Mason
2012-12-07 21:27                         ` Theodore Ts'o
2012-12-07 21:43                           ` Chris Mason
2012-12-07 21:49                             ` Ric Wheeler
2012-12-07 21:57                               ` Chris Mason
2012-12-07 22:51                                 ` Eric Sandeen
2012-12-07 22:52                                 ` Eric Sandeen
2012-12-07 21:42                         ` Ric Wheeler
2012-12-07 21:57                           ` Theodore Ts'o
2012-12-07 22:02                             ` Ric Wheeler
2012-12-08  0:39                               ` Dave Chinner
2012-12-08  2:52                                 ` Joel Becker
2012-12-08  4:04                                   ` Dave Chinner
2012-12-08  0:17                     ` Dave Chinner
2012-12-08  1:39                       ` Chris Mason
2012-12-10 16:02                         ` Chris Mason
2012-12-10 17:37                       ` Theodore Ts'o
2012-12-10 18:05                         ` Steven Whitehouse
2012-12-10 18:13                           ` Theodore Ts'o
2012-12-10 18:20                             ` Theodore Ts'o
2012-12-11 12:16                               ` Steven Whitehouse
2012-12-11 22:09                                 ` Dave Chinner
2012-12-10 18:52                         ` Ric Wheeler
2012-12-11  0:52                         ` Dave Chinner
2012-12-07 19:30                   ` Steven Rostedt
2012-12-07 21:14                     ` Theodore Ts'o
2012-12-07 21:47                       ` Ric Wheeler
2012-12-07 23:25                         ` Howard Chu
2012-12-08  0:50                           ` Dave Chinner
2012-12-08 13:52                             ` Howard Chu
2012-12-08 14:02                               ` Ric Wheeler [this message]
2012-12-07 22:01                       ` Eric Sandeen
2012-12-09 21:37                       ` Ric Wheeler
2012-11-26 11:53     ` Alan Cox
2012-11-26 14:43       ` Theodore Ts'o
2012-11-26 21:12       ` Dave Chinner
2012-11-27 13:44         ` Martin Steigerwald

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=50C34880.3020304@redhat.com \
    --to=rwheeler@redhat.com \
    --cc=Martin@lichtvoll.de \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=hyc@symas.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=rostedt@goodmis.org \
    --cc=torvalds@linux-foundation.org \
    --cc=tytso@mit.edu \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).