From: Dave Chinner <david@fromorbit.com>
To: "Jörn Engel" <joern@logfs.org>
Cc: linux-fsdevel@vger.kernel.org
Subject: Re: Filesystem benchmarks on reasonably fast hardware
Date: Mon, 18 Jul 2011 20:57:49 +1000 [thread overview]
Message-ID: <20110718105749.GE30254@dastard> (raw)
In-Reply-To: <20110718075339.GB1437@logfs.org>
On Mon, Jul 18, 2011 at 09:53:39AM +0200, Jörn Engel wrote:
> On Mon, 18 July 2011 09:32:52 +1000, Dave Chinner wrote:
> > On Sun, Jul 17, 2011 at 06:05:01PM +0200, Jörn Engel wrote:
> > >
> > > Numbers below were created with sysbench, using directIO. Each block
> > > is a matrix with results for blocksizes from 512B to 16384B and thread
> > > count from 1 to 128. Four blocks for reads and writes, both
> > > sequential and random.
> >
> > What's the command line/script used to generate the result matrix?
> > And what kernel are you running on?
>
> Script is attached. Kernel is git from July 13th (51414d41).
Ok, thanks.
> > > xfs:
> > > ====
> > > seqrd 1 2 4 8 16 32 64 128
> > > 16384 4698 4424 4397 4402 4394 4398 4642 4679
> > > 8192 6234 5827 5797 5801 5795 6114 5793 5812
> > > 4096 9100 8835 8882 8896 8874 8890 8910 8906
> > > 2048 14922 14391 14259 14248 14264 14264 14269 14273
> > > 1024 23853 22690 22329 22362 22338 22277 22240 22301
> > > 512 37353 33990 33292 33332 33306 33296 33224 33271
> >
> > Something is single threading completely there - something is very
> > wrong. Someone want to send me a nice fast pci-e SSD - my disks
> > don't spin that fast... :/
>
> I wish I could just go down the shop and pick one from the
> manufacturing line. :/
Heh. At this point any old pci-e ssd would be an improvement ;)
> > > rndwr 1 2 4 8 16 32 64 128
> > > 16384 38447 38153 38145 38140 38156 38199 38208 38236
> > > 8192 78001 76965 76908 76945 77023 77174 77166 77106
> > > 4096 160721 156000 157196 157084 157078 157123 156978 157149
> > > 2048 325395 317148 317858 318442 318750 318981 319798 320393
> > > 1024 434084 649814 650176 651820 653928 654223 655650 655818
> > > 512 501067 876555 1290292 1217671 1244399 1267729 1285469 1298522
> >
> > I'm assuming that is the h/w can do 650MB/s then the numbers are in
> > iops? from 4 threads up all results equate to 650MB/s.
>
> Correct. Writes are spread automatically across all chips. They are
> further cached, so until every chip is busy writing, their effective
> latency is pretty much 0. Makes for a pretty flat graph, I agree.
>
> > > Sequential reads are pretty horrible. Sequential writes are hitting a
> > > hot lock again.
> >
> > lockstat output?
>
> Attached for the bottom right case each of seqrd and seqwr. I hope
> the filenames are descriptive enough.
Looks like you attached the seqrd lockstat twice.
> Lockstat itself hurts
> performance. Writes were at 32245 IO/s from 298013, reads at 22458
> IO/s from 33271. In a way we are measuring oranges to figure out why
> our apples are so small.
Yeah, but at least it points out the lock in question - the iolock.
We grab it exclusively for a very short period of time on each
direct IO read to check the page cache state, then demote it to
shared. I can see that when IO times are very short, this will, in
fact, serialise multiple readers to a single file.
A single thread shows this locking pattern:
sysbench-3087 [000] 2192558.643146: xfs_ilock: dev 253:0 ino 0x83 flags IOLOCK_EXCL caller xfs_rw_ilock
sysbench-3087 [000] 2192558.643147: xfs_ilock_demote: dev 253:0 ino 0x83 flags IOLOCK_EXCL caller T.1428
sysbench-3087 [000] 2192558.643150: xfs_ilock: dev 253:0 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_map_shared
sysbench-3087 [001] 2192558.643877: xfs_ilock: dev 253:0 ino 0x83 flags IOLOCK_EXCL caller xfs_rw_ilock
sysbench-3087 [001] 2192558.643879: xfs_ilock_demote: dev 253:0 ino 0x83 flags IOLOCK_EXCL caller T.1428
sysbench-3087 [007] 2192558.643881: xfs_ilock: dev 253:0 ino 0x83 flags ILOCK_SHARED caller xfs_ilock_map_shared
Two threads show this:
sysbench-3096 [005] 2192697.678308: xfs_ilock: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller xfs_rw_ilock
sysbench-3096 [005] 2192697.678314: xfs_ilock_demote: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller T.1428
sysbench-3096 [005] 2192697.678335: xfs_ilock: dev 253:0 ino 0x1c02c2 flags ILOCK_SHARED caller xfs_ilock_map_shared
sysbench-3097 [006] 2192697.678556: xfs_ilock: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller xfs_rw_ilock
sysbench-3097 [006] 2192697.678556: xfs_ilock_demote: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller T.1428
sysbench-3097 [006] 2192697.678577: xfs_ilock: dev 253:0 ino 0x1c02c2 flags ILOCK_SHARED caller xfs_ilock_map_shared
sysbench-3096 [007] 2192697.678976: xfs_ilock: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller xfs_rw_ilock
sysbench-3096 [007] 2192697.678978: xfs_ilock_demote: dev 253:0 ino 0x1c02c2 flags IOLOCK_EXCL caller T.1428
sysbench-3096 [007] 2192697.679000: xfs_ilock: dev 253:0 ino 0x1c02c2 flags ILOCK_SHARED caller xfs_ilock_map_shared
Which shows the exclusive lock on the concurrent IO serialising on
the IO in progress. Oops, that's not good.
Ok, the patch below takes the numbers on my test setup on a 16k IO
size:
seqrd 1 2 4 8 16
vanilla 3603 2798 2563 not tested...
patches 3707 5746 10304 12875 11016
So those numbers look a lot healthier. The patch is below,
> --
> Fancy algorithms are slow when n is small, and n is usually small.
> Fancy algorithms have big constants. Until you know that n is
> frequently going to be big, don't get fancy.
> -- Rob Pike
Heh. XFS always assumes n will be big. Because where XFS is used, it
just is.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
xfs: don't serialise direct IO reads on page cache checks
From: Dave Chinner <dchinner@redhat.com>
There is no need to grab the i_mutex of the IO lock in exclusive
mode if we don't need to invalidate the page cache. Taking hese
locks on every direct IO effective serialisaes them as taking the IO
lock in exclusive mode has to wait for all shared holders to drop
the lock. That only happens when IO is complete, so effective it
prevents dispatch of concurrent direct IO reads to the same inode.
Fix this by taking the IO lock shared to check the page cache state,
and only then drop it and take the IO lock exclusively if there is
work to be done. Hence for the normal direct IO case, no exclusive
locking will occur.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
fs/xfs/linux-2.6/xfs_file.c | 17 ++++++++++++++---
1 files changed, 14 insertions(+), 3 deletions(-)
diff --git a/fs/xfs/linux-2.6/xfs_file.c b/fs/xfs/linux-2.6/xfs_file.c
index 1e641e6..16a4bf0 100644
--- a/fs/xfs/linux-2.6/xfs_file.c
+++ b/fs/xfs/linux-2.6/xfs_file.c
@@ -321,7 +321,19 @@ xfs_file_aio_read(
if (XFS_FORCED_SHUTDOWN(mp))
return -EIO;
- if (unlikely(ioflags & IO_ISDIRECT)) {
+ /*
+ * Locking is a bit tricky here. If we take an exclusive lock
+ * for direct IO, we effectively serialise all new concurrent
+ * read IO to this file and block it behind IO that is currently in
+ * progress because IO in progress holds the IO lock shared. We only
+ * need to hold the lock exclusive to blow away the page cache, so
+ * only take lock exclusively if the page cache needs invalidation.
+ * This allows the normal direct IO case of no page cache pages to
+ * proceeed concurrently without serialisation.
+ */
+ xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+ if ((ioflags & IO_ISDIRECT) && inode->i_mapping->nrpages) {
+ xfs_rw_iunlock(ip, XFS_IOLOCK_SHARED);
xfs_rw_ilock(ip, XFS_IOLOCK_EXCL);
if (inode->i_mapping->nrpages) {
@@ -334,8 +346,7 @@ xfs_file_aio_read(
}
}
xfs_rw_ilock_demote(ip, XFS_IOLOCK_EXCL);
- } else
- xfs_rw_ilock(ip, XFS_IOLOCK_SHARED);
+ }
trace_xfs_file_read(ip, size, iocb->ki_pos, ioflags);
--
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2011-07-18 10:57 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2011-07-17 16:05 Filesystem benchmarks on reasonably fast hardware Jörn Engel
2011-07-17 23:32 ` Dave Chinner
[not found] ` <20110718075339.GB1437@logfs.org>
2011-07-18 10:57 ` Dave Chinner [this message]
2011-07-18 11:40 ` Jörn Engel
2011-07-19 2:41 ` Dave Chinner
2011-07-19 7:36 ` Jörn Engel
2011-07-19 9:23 ` srimugunthan dhandapani
2011-07-21 19:05 ` Jörn Engel
2011-07-19 10:15 ` Dave Chinner
2011-07-18 14:34 ` Jörn Engel
[not found] ` <20110718103956.GE1437@logfs.org>
2011-07-18 11:10 ` Dave Chinner
2011-07-18 12:07 ` Ted Ts'o
2011-07-18 12:42 ` Jörn Engel
2011-07-25 15:18 ` Ted Ts'o
2011-07-25 18:20 ` Jörn Engel
2011-07-25 21:18 ` Ted Ts'o
2011-07-26 14:57 ` Ted Ts'o
2011-07-27 3:39 ` Yongqiang Yang
2011-07-19 13:19 ` Dave Chinner
2011-07-21 10:42 ` Jörn Engel
2011-07-22 18:51 ` Jörn Engel
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20110718105749.GE30254@dastard \
--to=david@fromorbit.com \
--cc=joern@logfs.org \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).