From: Jan Kara <jack@suse.cz>
To: Josef Bacik <jbacik@fb.com>
Cc: linux-fsdevel@vger.kernel.org, david@fromorbit.com,
viro@zeniv.linux.org.uk, jack@suse.cz,
linux-kernel@vger.kernel.org, Dave Chinner <dchinner@redhat.com>
Subject: Re: [PATCH 6/9] bdi: add a new writeback list for sync
Date: Mon, 16 Mar 2015 11:14:43 +0100 [thread overview]
Message-ID: <20150316101443.GF4934@quack.suse.cz> (raw)
In-Reply-To: <1426016724-23912-7-git-send-email-jbacik@fb.com>
On Tue 10-03-15 15:45:21, Josef Bacik wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> wait_sb_inodes() current does a walk of all inodes in the filesystem
> to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
>
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the bdi when the mapping is
> first tagged as having pages under writeback. wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
>
> To avoid needing all the realted locking to be safe against
> interrupts, Jan Kara suggested that we be lazy about removal from
> the writeback list. That is, we don't remove inodes from the
> writeback list on IO completion, but do it directly during a
> wait_sb_inodes() walk.
>
> This means that the a rare sync(2) call will have some work to do
> skipping clean inodes However, in the current problem case of
> concurrent sync workloads, concurrent wait_sb_inodes() calls only
> walk the very recently dispatched inodes and hence should have very
> little work to do.
>
> This also means that we have to remove the inodes from the writeback
> list during eviction. Do this in inode_wait_for_writeback() once
> all writeback on the inode is complete.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
The code looks mostly fine. Two comments below.
...
> /*
> - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> + * Wait for writeback on an inode to complete during eviction. Caller must have
> + * inode pinned.
> */
> void inode_wait_for_writeback(struct inode *inode)
> {
> + BUG_ON(!(inode->i_state & I_FREEING));
> +
> spin_lock(&inode->i_lock);
> __inode_wait_for_writeback(inode);
> spin_unlock(&inode->i_lock);
> +
> + /*
> + * bd_inode's will have already had the bdev free'd so don't bother
> + * doing the bdi_clear_inode_writeback.
> + */
> + if (!sb_is_blkdev_sb(inode->i_sb))
> + bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
Umm, I don't get this even though I've read the comment several times.
Why don't we want to remove bdev inode from writeback list of
blockdev_super? Is the comment suggesting that bdev inode cannot be on
writeback list?
> @@ -1307,28 +1347,59 @@ static void wait_sb_inodes(struct super_block *sb)
> */
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
>
> - mutex_lock(&sb->s_sync_lock);
> - spin_lock(&sb->s_inode_list_lock);
> -
> /*
> - * Data integrity sync. Must wait for all pages under writeback,
> - * because there may have been pages dirtied before our sync
> - * call, but which had writeout started before we write it out.
> - * In which case, the inode may not be on the dirty list, but
> - * we still have to wait for that writeout.
> + * Data integrity sync. Must wait for all pages under writeback, because
> + * there may have been pages dirtied before our sync call, but which had
> + * writeout started before we write it out. In which case, the inode
> + * may not be on the dirty list, but we still have to wait for that
> + * writeout.
> + *
> + * To avoid syncing inodes put under IO after we have started here,
> + * splice the io list to a temporary list head and walk that. Newly
> + * dirtied inodes will go onto the primary list so we won't wait for
> + * them. This is safe to do as we are serialised by the s_sync_lock,
> + * so we'll complete processing the complete list before the next
> + * sync operation repeats the splice-and-walk process.
> + *
> + * Inodes that have pages under writeback after we've finished the wait
> + * may or may not be on the primary list. They had pages under put IO
> + * after we started our wait, so we need to make sure the next sync or
> + * IO completion treats them correctly, Move them back to the primary
> + * list and restart the walk.
> + *
> + * Inodes that are clean after we have waited for them don't belong on
> + * any list, so simply remove them from the sync list and move onwards.
> */
> - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> + mutex_lock(&sb->s_sync_lock);
> + spin_lock(&bdi->wb.list_lock);
> + list_splice_init(&bdi->wb.b_writeback, &sync_list);
> +
> + while (!list_empty(&sync_list)) {
> + struct inode *inode = list_first_entry(&sync_list, struct inode,
> + i_wb_list);
> struct address_space *mapping = inode->i_mapping;
>
> + /*
> + * We are lazy on IO completion and don't remove inodes from the
> + * list when they are clean. Detect that immediately and skip
> + * inodes we don't ahve to wait on.
> + */
> + if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> + list_del_init(&inode->i_wb_list);
> + cond_resched_lock(&bdi->wb.list_lock);
> + continue;
> + }
> +
> spin_lock(&inode->i_lock);
> - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> - (mapping->nrpages == 0)) {
> + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> + list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
Indenting looks damaged here...
> spin_unlock(&inode->i_lock);
> + cond_resched_lock(&bdi->wb.list_lock);
> continue;
Why do we put freeing inodes back to b_writeback list? For I_FREEING and
I_WILL_FREE we could as well just delete them. For I_NEW we would start
waiting for them once I_NEW gets cleared but still I_NEW inodes under
writeback look somewhat worrying (flusher thread skips them so the
semantics isn't clear there). Anyway, probably lets just keep code as is
for I_NEW but moving back to b_writeback for I_FREEING | I_WILL_FREE looks
just dumb.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
next prev parent reply other threads:[~2015-03-16 10:14 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-10 19:45 [PATCH 0/9] Sync and VFS scalability improvements Josef Bacik
2015-03-10 19:45 ` [PATCH 1/9] writeback: plug writeback at a high level Josef Bacik
2015-03-10 19:45 ` [PATCH 2/9] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Josef Bacik
2015-03-12 9:52 ` Al Viro
2015-03-12 12:18 ` [PATCH] inode: add hlist_fake to avoid the " Josef Bacik
2015-03-12 12:20 ` [PATCH] inode: add hlist_fake to avoid the inode hash lock in evict V2 Josef Bacik
2015-03-14 7:00 ` Jan Kara
2015-03-12 12:24 ` [PATCH 2/9] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Josef Bacik
2015-03-10 19:45 ` [PATCH 3/9] inode: convert inode_sb_list_lock to per-sb Josef Bacik
2015-03-10 19:45 ` [PATCH 4/9] sync: serialise per-superblock sync operations Josef Bacik
2015-03-10 19:45 ` [PATCH 5/9] inode: rename i_wb_list to i_io_list Josef Bacik
2015-03-10 19:45 ` [PATCH 6/9] bdi: add a new writeback list for sync Josef Bacik
2015-03-16 10:14 ` Jan Kara [this message]
2015-03-10 19:45 ` [PATCH 7/9] writeback: periodically trim the writeback list Josef Bacik
2015-03-16 10:16 ` Jan Kara
2015-03-16 11:43 ` Jan Kara
2015-03-10 19:45 ` [PATCH 8/9] inode: convert per-sb inode list to a list_lru Josef Bacik
2015-03-16 12:27 ` Jan Kara
2015-03-16 15:34 ` Josef Bacik
2015-03-16 15:48 ` Jan Kara
2015-03-10 19:45 ` [PATCH 9/9] inode: don't softlockup when evicting inodes Josef Bacik
2015-03-16 12:31 ` Jan Kara
2015-03-16 11:39 ` [PATCH 0/9] Sync and VFS scalability improvements Jan Kara
2015-03-25 11:18 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150316101443.GF4934@quack.suse.cz \
--to=jack@suse.cz \
--cc=david@fromorbit.com \
--cc=dchinner@redhat.com \
--cc=jbacik@fb.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).