From: Howard Chu <hyc@symas.com>
To: ceph-devel@vger.kernel.org
Subject: Re: newstore direction
Date: Fri, 23 Oct 2015 06:21:43 +0000 (UTC) [thread overview]
Message-ID: <loom.20151023T081245-920@post.gmane.org> (raw)
In-Reply-To: 5627981B.2040409@redhat.com
Ric Wheeler <rwheeler <at> redhat.com> writes:
>
> On 10/21/2015 09:32 AM, Sage Weil wrote:
> > On Tue, 20 Oct 2015, Ric Wheeler wrote:
> >>> Now:
> >>> 1 io to write a new file
> >>> 1-2 ios to sync the fs journal (commit the inode, alloc change)
> >>> (I see 2 journal IOs on XFS and only 1 on ext4...)
> >>> 1 io to commit the rocksdb journal (currently 3, but will drop to
> >>> 1 with xfs fix and my rocksdb change)
> >> I think that might be too pessimistic - the number of discrete IO's
sent down
> >> to a spinning disk make much less impact on performance than the number of
> >> fsync()'s since they IO's all land in the write cache. Some newer spinning
> >> drives have a non-volatile write cache, so even an fsync() might not end up
> >> doing the expensive data transfer to the platter.
> > True, but in XFS's case at least the file data and journal are not
> > colocated, so its 2 seeks for the new file write+fdatasync and another for
> > the rocksdb journal commit. Of course, with a deep queue, we're doing
> > lots of these so there's be fewer journal commits on both counts, but the
> > lower bound on latency of a single write is still 3 seeks, and that bound
> > is pretty critical when you also have network round trips and replication
> > (worst out of 2) on top.
>
> What are the performance goals we are looking for?
>
> Small, synchronous writes/second?
>
> File creates/second?
>
> I suspect that looking at things like seeks/write is probably looking at the
> wrong level of performance challenges. Again, when you write to a modern
drive,
> you write to its write cache and it decides internally when/how to destage to
> the platter.
>
> If you look at the performance of XFS with streaming workloads, it will
tend to
> max out the bandwidth of the underlaying storage.
>
> If we need IOP's/file writes, etc, we should be clear on what we are
aiming at.
>
> >
> >> It would be interesting to get the timings on the IO's you see to
measure the
> >> actual impact.
> > I observed this with the journaling workload for rocksdb, but I assume the
> > journaling behavior is the same regardless of what is being journaled.
> > For a 4KB append to a file + fdatasync, I saw ~30ms latency for XFS, and
> > blktrace showed an IO to the file, and 2 IOs to the journal. I believe
> > the first one is the record for the inode update, and the second is the
> > journal 'commit' record (though I forget how I decided that). My guess is
> > that XFS is being extremely careful about journal integrity here and not
> > writing the commit record until it knows that the preceding records landed
> > on stable storage. For ext4, the latency was about ~20ms, and blktrace
> > showed the IO to the file and then a single journal IO. When I made the
> > rocksdb change to overwrite an existing, prewritten file, the latency
> > dropped to ~10ms on ext4, and blktrace showed a single IO as expected.
> > (XFS still showed the 2 journal commit IOs, but Dave just posted the fix
> > for that on the XFS list today.)
> Normally, best practice is to use batching to avoid paying worst case latency
> when you do a synchronous IO. Write a batch of files or appends without
fsync,
> then go back and fsync and you will pay that latency once (not per file/op).
If filesystems would support ordered writes you wouldn't need to fsync at
all. Just spit out a stream of writes and declare that batch N must be
written before batch N+1. (Note that this is not identical to "write
barriers", which imposed the same latencies as fsync by blocking all I/Os at
a barrier boundary. Ordered writes may be freely interleaved with un-ordered
writes, so normal I/O traffic can proceed unhindered. Their ordering is only
enforced wrt other ordered writes.)
A bit of a shame that Linux's SCSI drivers support Ordering attributes but
nothing above that layer makes use of it.
--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/
next prev parent reply other threads:[~2015-10-23 6:22 UTC|newest]
Thread overview: 71+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-10-19 19:49 newstore direction Sage Weil
2015-10-19 20:22 ` Robert LeBlanc
2015-10-19 20:30 ` Somnath Roy
2015-10-19 20:54 ` Sage Weil
2015-10-19 22:21 ` James (Fei) Liu-SSI
2015-10-20 2:24 ` Chen, Xiaoxi
2015-10-20 12:30 ` Sage Weil
2015-10-20 13:19 ` Mark Nelson
2015-10-20 17:04 ` kernel neophyte
2015-10-21 10:06 ` Allen Samuels
2015-10-21 13:35 ` Mark Nelson
2015-10-21 16:10 ` Chen, Xiaoxi
2015-10-22 1:09 ` Allen Samuels
2015-10-20 2:32 ` Varada Kari
2015-10-20 2:40 ` Chen, Xiaoxi
2015-10-20 12:34 ` Sage Weil
2015-10-20 20:18 ` Martin Millnert
2015-10-20 20:32 ` James (Fei) Liu-SSI
2015-10-20 20:39 ` James (Fei) Liu-SSI
2015-10-20 21:20 ` Sage Weil
2015-10-19 21:18 ` Wido den Hollander
2015-10-19 22:40 ` Varada Kari
2015-10-20 0:48 ` John Spray
2015-10-20 20:00 ` Sage Weil
2015-10-20 20:36 ` Gregory Farnum
2015-10-20 21:47 ` Sage Weil
2015-10-20 22:23 ` Ric Wheeler
2015-10-21 13:32 ` Sage Weil
2015-10-21 13:50 ` Ric Wheeler
2015-10-23 6:21 ` Howard Chu [this message]
2015-10-23 11:06 ` Ric Wheeler
2015-10-23 11:47 ` Ric Wheeler
2015-10-23 14:59 ` Howard Chu
2015-10-23 16:37 ` Ric Wheeler
2015-10-23 18:59 ` Gregory Farnum
2015-10-23 21:23 ` Howard Chu
2015-10-20 20:42 ` Matt Benjamin
2015-10-22 12:32 ` Milosz Tanski
2015-10-23 3:16 ` Howard Chu
2015-10-23 13:27 ` Milosz Tanski
2015-10-20 2:08 ` Haomai Wang
2015-10-20 12:25 ` Sage Weil
2015-10-20 7:06 ` Dałek, Piotr
2015-10-20 18:31 ` Ric Wheeler
2015-10-20 19:44 ` Sage Weil
2015-10-20 21:43 ` Ric Wheeler
2015-10-20 19:44 ` Yehuda Sadeh-Weinraub
2015-10-21 8:22 ` Orit Wasserman
2015-10-21 11:18 ` Ric Wheeler
2015-10-21 17:30 ` Sage Weil
2015-10-22 8:31 ` Christoph Hellwig
2015-10-22 12:50 ` Sage Weil
2015-10-22 17:42 ` James (Fei) Liu-SSI
2015-10-22 23:42 ` Samuel Just
2015-10-23 0:10 ` Samuel Just
2015-10-23 1:26 ` Allen Samuels
2015-10-23 2:06 ` Ric Wheeler
2015-10-21 10:06 ` Allen Samuels
2015-10-21 11:24 ` Ric Wheeler
2015-10-21 14:14 ` Mark Nelson
2015-10-21 15:51 ` Ric Wheeler
2015-10-21 19:37 ` Mark Nelson
2015-10-21 21:20 ` Martin Millnert
2015-10-22 2:12 ` Allen Samuels
2015-10-22 8:51 ` Orit Wasserman
2015-10-22 0:53 ` Allen Samuels
2015-10-22 1:16 ` Ric Wheeler
2015-10-22 1:22 ` Allen Samuels
2015-10-23 2:10 ` Ric Wheeler
2015-10-21 13:44 ` Mark Nelson
2015-10-22 1:39 ` Allen Samuels
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=loom.20151023T081245-920@post.gmane.org \
--to=hyc@symas.com \
--cc=ceph-devel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.