From: "Theodore Ts'o" <tytso@mit.edu>
To: Andreas Dilger <adilger@dilger.ca>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
Dave Chinner <david@fromorbit.com>,
Linus Torvalds <torvalds@linux-foundation.org>,
Ric Wheeler <rwheeler@redhat.com>,
Andy Lutomirski <luto@amacapital.net>,
One Thousand Gnomes <gnomes@lxorguk.ukuu.org.uk>,
Gregory Farnum <greg@gregs42.com>,
Martin Petersen <martin.petersen@oracle.com>,
Christoph Hellwig <hch@infradead.org>,
Jens Axboe <axboe@kernel.dk>,
Andrew Morton <akpm@linux-foundation.org>,
Linux API <linux-api@vger.kernel.org>,
Linux Kernel Mailing List <linux-kernel@vger.kernel.org>,
shane.seymour@hpe.com, Bruce Fields <bfields@fieldses.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>,
Jeff Layton <jlayton@poochiereds.net>,
Eric Sandeen <esandeen@redhat.com>
Subject: Re: [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks
Date: Wed, 16 Mar 2016 20:15:02 -0400 [thread overview]
Message-ID: <20160317001502.GF23593@thunk.org> (raw)
In-Reply-To: <7674C689-C07E-4D38-85EB-4FD9B55CBB35@dilger.ca>
On Wed, Mar 16, 2016 at 03:45:49PM -0600, Andreas Dilger wrote:
> > Clearly, the performance hit of unwritten extent conversion is large
> > enough to tempt people to ask for no-hide-stale. But I'd rather hear
> > that directly from a developer, Ceph or otherwise.
>
> I suspect that this gets significantly worse if you are running with
> random writes instead of sequential overwrites. With sequential overwrites
> there is only a single boundary between init and uninit extents, so at
> most one extra extent in the tree. The above performance deltas will also
> be much larger when real disks are involved and seek latency is a factor.
It will vary a lot depending on your use case. If you are running
with data=ordered, and with journalled enabled, then even if it is a
single extent that is modified, the fact that a journal transaction
involved, with a forced data block flush to avoid revealing stale
data, that is certainly going to be measurable.
The other thing is if you are worried about tail latency, which is a
major concern at Google[1], and you are running your disks close to
flat out, the fact that you have to do an extra seek to update the
extent tree is a seek that you can't be using for useful work --- and
worse, could delay a low-latency read from completing within your SLO.
[1] https://research.google.com/pubs/pub44830.html
Part of what's challenging with giving numbers is that it's trivially
easy to give some worst case scneario where the numbers are really
terrible. A random 4k random write benchmark into an fallocated file,
eeven with XFS, would have pretty bad numbers, But of course people
wouldn't say that it's very realistic. But those are the easiest to
get.
The most realistic numbers are going to be a lot harder to get, and
wouldn't necessarily make a lot of sense without revealing a lot
proprietary information. I will say that Google does have a fairly
large number of disks[2] and so even a small fractional percentage
gain multipled by gazillions of disks starts turning into a dollar
number with enough zeros that people really sit up and take notice.
I'll also note that map reduce can be quite nasty as far as random I/O
is concerned[3], and while map reduce jobs are often not high priority
jobs, they can interfere with low-latency reads from important
applications (e.g., web search, user-visible gmail operations, etc.)
[2] https://what-if.xkcd.com/63/
[3] https://pdfs.semanticscholar.org/6238/e5f0fd807f634f5999701c7aa6a09d88dfc8.pdf
So I'm not sure what numbers I can really give that would satisfy
people. Doing a random write fio job is not hard, and will result in
fairly impressive numbers. If that's enough, then either I can do
this, or Chris Mason can reproduce his experiment using XFS (which
would presumably eliminate the excuse that it's because ext4 sucks at
extent operations). But if that's not going to convince people, then
I'd much rather not waste my time.
Besides, at Google it's easy enough for me to maintain the patch
out-of-tree. It's the Ceph folks who would need to at the very least,
have such a patch ship in Red Hat Enterprise Linux. So it's probably
better for them to justify it, if numbers are really necessary.
- Ted
next prev parent reply other threads:[~2016-03-17 0:15 UTC|newest]
Thread overview: 82+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-03-02 4:09 [PATCH v5.1 0/2] create BLKZEROOUT ioctl that invalidates page cache Darrick J. Wong
2016-03-02 4:09 ` [PATCH 1/2] block: invalidate the page cache when issuing BLKZEROOUT Darrick J. Wong
2016-03-02 9:19 ` Christoph Hellwig
2016-03-02 4:09 ` [PATCH 2/2] block: create ioctl to discard-or-zeroout a range of blocks Darrick J. Wong
2016-03-02 9:20 ` Christoph Hellwig
2016-03-02 18:52 ` Linus Torvalds
2016-03-02 22:56 ` Darrick J. Wong
2016-03-02 23:49 ` Linus Torvalds
2016-03-03 17:02 ` Theodore Ts'o
2016-03-03 17:55 ` Linus Torvalds
2016-03-03 18:00 ` Christoph Hellwig
2016-03-03 18:14 ` Martin K. Petersen
2016-03-03 18:21 ` Theodore Ts'o
2016-03-03 18:01 ` Martin K. Petersen
2016-03-03 18:09 ` Christoph Hellwig
2016-03-03 18:12 ` Darrick J. Wong
2016-03-03 18:54 ` Martin K. Petersen
2016-03-03 22:39 ` Theodore Ts'o
2016-03-03 23:10 ` Dave Chinner
2016-03-04 0:20 ` Theodore Ts'o
2016-03-09 22:20 ` Gregory Farnum
2016-03-09 23:08 ` Theodore Ts'o
2016-03-10 14:58 ` Ric Wheeler
2016-03-10 18:33 ` Linus Torvalds
2016-03-10 21:47 ` Theodore Ts'o
2016-03-11 4:42 ` Ric Wheeler
2016-03-11 13:59 ` One Thousand Gnomes
2016-03-11 15:27 ` Theodore Ts'o
2016-03-11 17:23 ` Linus Torvalds
2016-03-11 17:30 ` Andy Lutomirski
2016-03-11 18:25 ` Linus Torvalds
2016-03-11 22:30 ` Dave Chinner
2016-03-12 0:33 ` Linus Torvalds
2016-03-12 0:35 ` Theodore Ts'o
2016-03-12 0:44 ` Linus Torvalds
2016-03-12 7:19 ` Theodore Ts'o
2016-03-12 10:11 ` Thomas Schoebel-Theuer
2016-03-13 23:30 ` Dave Chinner
2016-03-14 10:34 ` Ric Wheeler
2016-03-14 14:46 ` Theodore Ts'o
2016-03-15 20:14 ` Dave Chinner
2016-03-15 20:43 ` Linus Torvalds
2016-03-15 21:29 ` Theodore Ts'o
2016-03-15 22:33 ` Dave Chinner
2016-03-15 22:52 ` Theodore Ts'o
2016-03-16 1:51 ` Darrick J. Wong
2016-03-16 21:45 ` Andreas Dilger
2016-03-17 0:15 ` Theodore Ts'o [this message]
2016-03-17 0:33 ` Eric Sandeen
2016-03-17 0:59 ` Theodore Ts'o
2016-03-17 5:18 ` Gregory Farnum
2016-03-17 12:36 ` Theodore Ts'o
2016-03-17 17:47 ` Linus Torvalds
2016-03-17 17:50 ` Ric Wheeler
2016-03-17 17:59 ` Linus Torvalds
2016-03-17 18:35 ` Chris Mason
2016-03-17 20:49 ` Andreas Dilger
2016-03-17 21:00 ` Chris Mason
2016-03-18 3:20 ` Theodore Ts'o
2016-03-18 15:15 ` Jeff Moyer
2016-03-18 20:05 ` Martin K. Petersen
2016-03-18 6:52 ` Gregory Farnum
2016-03-18 7:19 ` Linus Torvalds
2016-03-17 1:01 ` Dave Chinner
2016-03-17 2:38 ` Darrick J. Wong
2016-03-18 22:55 ` NeilBrown
2016-03-15 23:06 ` Linus Torvalds
2016-03-15 23:14 ` Linus Torvalds
2016-03-16 0:08 ` Dave Chinner
2016-03-15 23:52 ` Dave Chinner
2016-03-16 0:06 ` Linus Torvalds
2016-03-16 0:30 ` Eric Sandeen
2016-03-16 0:51 ` Chris Mason
2016-03-16 22:23 ` Chris Mason
2016-03-17 13:49 ` Ric Wheeler
2016-03-15 22:38 ` Eric Sandeen
2016-03-03 22:56 ` Dave Chinner
2016-03-04 2:30 ` Thomas Schoebel-Theuer
2016-03-03 18:14 ` Linus Torvalds
2016-03-02 9:15 ` [PATCH v5.1 0/2] create BLKZEROOUT ioctl that invalidates page cache Arnd Bergmann
2016-03-02 9:44 ` Christoph Hellwig
2016-03-02 10:55 ` Arnd Bergmann
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160317001502.GF23593@thunk.org \
--to=tytso@mit.edu \
--cc=adilger@dilger.ca \
--cc=akpm@linux-foundation.org \
--cc=axboe@kernel.dk \
--cc=bfields@fieldses.org \
--cc=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=esandeen@redhat.com \
--cc=gnomes@lxorguk.ukuu.org.uk \
--cc=greg@gregs42.com \
--cc=hch@infradead.org \
--cc=jlayton@poochiereds.net \
--cc=linux-api@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=luto@amacapital.net \
--cc=martin.petersen@oracle.com \
--cc=rwheeler@redhat.com \
--cc=shane.seymour@hpe.com \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox