From: Brian Foster <bfoster@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: Jan Kara <jack@suse.cz>,
linux-fsdevel@vger.kernel.org, dchinner@redhat.com,
jbacik@fb.com
Subject: Re: [PATCH v6 1/2] sb: add a new writeback list for sync
Date: Thu, 21 Jan 2016 10:22:57 -0500 [thread overview]
Message-ID: <20160121152256.GB19272@bfoster.bfoster> (raw)
In-Reply-To: <20160120201159.GW6033@dastard>
On Thu, Jan 21, 2016 at 07:11:59AM +1100, Dave Chinner wrote:
> On Wed, Jan 20, 2016 at 02:26:26PM +0100, Jan Kara wrote:
> > On Tue 19-01-16 12:59:12, Brian Foster wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > wait_sb_inodes() currently does a walk of all inodes in the
> > > filesystem to find dirty one to wait on during sync. This is highly
> > > inefficient and wastes a lot of CPU when there are lots of clean
> > > cached inodes that we don't need to wait on.
> > >
> > > To avoid this "all inode" walk, we need to track inodes that are
> > > currently under writeback that we need to wait for. We do this by
> > > adding inodes to a writeback list on the sb when the mapping is
> > > first tagged as having pages under writeback. wait_sb_inodes() can
> > > then walk this list of "inodes under IO" and wait specifically just
> > > for the inodes that the current sync(2) needs to wait for.
> > >
> > > Define a couple helpers to add/remove an inode from the writeback
> > > list and call them when the overall mapping is tagged for or cleared
> > > from writeback. Update wait_sb_inodes() to walk only the inodes
> > > under writeback due to the sync.
> >
Hi Jan, Dave,
> > The patch looks good. Just one comment: This grows struct inode by two
> > longs. Such a growth should be justified by measuring the improvements. So
> > can you measure some numbers showing how much the patch helped? I think it
> > would be interesting to see:
> >
Thanks.. indeed, I had run some simple tests that demonstrate the
effectiveness of the change. I reran them recently against the latest
version. Some results are appended to this mail.
Note that I don't have anything at the moment that demonstrates a
notable improvement with rcu over the original spin lock approach. I can
play with that a bit more, but that's not really the crux of the patch.
> > a) How much sync(2) speed has improved if there's not much to wait for.
>
> Depends on the size of the inode cache when sync is run. If it's
> empty it's not noticable. When you have tens of millions of cached,
> clean inodes the inode list traversal can takes tens of seconds.
> This is the sort of problem Josef reported that FB were having...
>
FWIW, Ceph has indicated this is a pain point for them as well. The
results at [0] below show the difference in sync time with a largely
populated inode cache before and after this patch.
> > b) See whether parallel heavy stat(2) load which is rotating lots of inodes
> > in inode cache sees some improvement when it doesn't have to contend with
> > sync(2) on s_inode_list_lock. I believe Dave Chinner had some loads where
> > the contention on s_inode_list_lock due to sync and rotation of inodes was
> > pretty heavy.
>
> Just my usual fsmark workloads - they have parallel find and
> parallel ls -lR traversals over the created fileset. Even just
> running sync during creation (because there are millions of cached
> inodes, and ~250,000 inodes being instiated and reclaimed every
> second) causes lock contention problems....
>
I ran a similar parallel (16x) fs_mark workload using '-S 4,' which
incorporates a sync() per pass. Without this patch, this demonstrates a
slow degradation as the inode cache grows. Results at [1].
Brian
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
16xcpu, 32GB RAM x86-64 server
Storage is LVM volumes on hw raid0.
[0] -- sync test w/ ~10m clean inode cache
- 10TB pre-populated XFS fs, cache populated via parallel find/stat
workload
--- 4.4.0+
# cat /proc/slabinfo | grep xfs
xfs_dqtrx 0 0 528 62 8 : tunables 0 0 0 : slabdata 0 0 0
xfs_dquot 0 0 656 49 8 : tunables 0 0 0 : slabdata 0 0 0
xfs_buf 496293 496893 640 51 8 : tunables 0 0 0 : slabdata 9743 9743 0
xfs_icr 0 0 144 56 2 : tunables 0 0 0 : slabdata 0 0 0
xfs_inode 10528071 10529150 1728 18 8 : tunables 0 0 0 : slabdata 584999 584999 0
xfs_efd_item 0 0 400 40 4 : tunables 0 0 0 : slabdata 0 0 0
xfs_da_state 544 544 480 34 4 : tunables 0 0 0 : slabdata 16 16 0
xfs_btree_cur 0 0 208 39 2 : tunables 0 0 0 : slabdata 0 0 0
# time sync
real 0m7.322s
user 0m0.000s
sys 0m7.314s
# time sync
real 0m7.299s
user 0m0.000s
sys 0m7.296s
--- 4.4.0+ w/ sync patch
# cat /proc/slabinfo | grep xfs
xfs_dqtrx 0 0 528 62 8 : tunables 0 0 0 : slabdata 0 0 0
xfs_dquot 0 0 656 49 8 : tunables 0 0 0 : slabdata 0 0 0
xfs_buf 428214 428514 640 51 8 : tunables 0 0 0 : slabdata 8719 8719 0
xfs_icr 0 0 144 56 2 : tunables 0 0 0 : slabdata 0 0 0
xfs_inode 11054375 11054438 1728 18 8 : tunables 0 0 0 : slabdata 721323 721323 0
xfs_efd_item 0 0 400 40 4 : tunables 0 0 0 : slabdata 0 0 0
xfs_da_state 544 544 480 34 4 : tunables 0 0 0 : slabdata 16 16 0
xfs_btree_cur 0 0 208 39 2 : tunables 0 0 0 : slabdata 0 0 0
# time sync
real 0m0.040s
user 0m0.001s
sys 0m0.003s
# time sync
real 0m0.002s
user 0m0.001s
sys 0m0.002s
[1] -- fs_mark -D 1000 -S4 -n 1000 -d /mnt/0 ... -d /mnt/15 -L 32
- 1TB XFS fs
--- 4.4.0+
FSUse% Count Size Files/sec App Overhead
2 16000 51200 3313.3 822514
2 32000 51200 3353.6 310268
2 48000 51200 3475.2 289941
2 64000 51200 3104.6 289993
2 80000 51200 2944.9 292124
2 96000 51200 3010.4 288042
3 112000 51200 2756.4 289761
3 128000 51200 2753.2 288096
3 144000 51200 2474.4 290797
3 160000 51200 2657.9 290898
3 176000 51200 2498.0 288247
3 192000 51200 2415.5 287329
3 208000 51200 2336.1 291113
3 224000 51200 2352.9 290103
3 240000 51200 2309.6 289580
3 256000 51200 2344.3 289828
3 272000 51200 2293.0 291282
3 288000 51200 2295.5 286538
4 304000 51200 2119.0 288906
4 320000 51200 2059.6 293605
4 336000 51200 2129.1 289825
4 352000 51200 1929.8 288186
4 368000 51200 1987.5 294596
4 384000 51200 1929.1 293528
4 400000 51200 1934.8 288138
4 416000 51200 1823.6 292318
4 432000 51200 1838.7 290890
4 448000 51200 1797.5 288816
4 464000 51200 1823.2 287190
4 480000 51200 1738.7 295745
4 496000 51200 1716.4 293821
5 512000 51200 1726.7 290445
--- 4.4.0+ w/ sync patch
FSUse% Count Size Files/sec App Overhead
2 16000 51200 3409.7 999579
2 32000 51200 3481.3 286877
2 48000 51200 3447.3 282743
2 64000 51200 3522.3 283400
2 80000 51200 3427.0 286360
2 96000 51200 3360.2 307219
3 112000 51200 3377.7 286625
3 128000 51200 3363.7 285929
3 144000 51200 3345.7 283138
3 160000 51200 3384.9 291081
3 176000 51200 3084.1 285265
3 192000 51200 3388.4 291439
3 208000 51200 3242.8 286332
3 224000 51200 3337.9 285006
3 240000 51200 3442.8 292109
3 256000 51200 3230.3 283432
3 272000 51200 3358.3 286996
3 288000 51200 3309.0 288058
4 304000 51200 3293.4 284309
4 320000 51200 3221.4 284476
4 336000 51200 3241.5 283968
4 352000 51200 3228.3 284354
4 368000 51200 3255.7 286072
4 384000 51200 3094.6 290240
4 400000 51200 3385.6 288158
4 416000 51200 3265.2 284387
4 432000 51200 3315.2 289656
4 448000 51200 3275.1 284562
4 464000 51200 3238.4 294976
4 480000 51200 3060.0 290088
4 496000 51200 3359.5 286949
5 512000 51200 3156.2 288126
next prev parent reply other threads:[~2016-01-21 15:23 UTC|newest]
Thread overview: 10+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-01-19 17:59 [PATCH v6 0/2] improve sync efficiency with sb inode wb list Brian Foster
2016-01-19 17:59 ` [PATCH v6 1/2] sb: add a new writeback list for sync Brian Foster
2016-01-20 13:26 ` Jan Kara
2016-01-20 20:11 ` Dave Chinner
2016-01-21 15:22 ` Brian Foster [this message]
2016-01-21 16:34 ` Jan Kara
2016-01-21 17:13 ` Brian Foster
2016-01-21 18:08 ` Josef Bacik
2016-01-19 17:59 ` [PATCH v6 2/2] wb: inode writeback list tracking tracepoints Brian Foster
2016-01-20 13:14 ` Jan Kara
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160121152256.GB19272@bfoster.bfoster \
--to=bfoster@redhat.com \
--cc=david@fromorbit.com \
--cc=dchinner@redhat.com \
--cc=jack@suse.cz \
--cc=jbacik@fb.com \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).