From: Jan Kara <jack@suse.cz>
To: Josef Bacik <jbacik@fb.com>
Cc: linux-fsdevel@vger.kernel.org, david@fromorbit.com,
viro@zeniv.linux.org.uk, jack@suse.cz,
linux-kernel@vger.kernel.org, Dave Chinner <dchinner@redhat.com>
Subject: Re: [PATCH 6/9] bdi: add a new writeback list for sync
Date: Mon, 16 Mar 2015 11:14:43 +0100 [thread overview]
Message-ID: <20150316101443.GF4934@quack.suse.cz> (raw)
In-Reply-To: <1426016724-23912-7-git-send-email-jbacik@fb.com>
On Tue 10-03-15 15:45:21, Josef Bacik wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> wait_sb_inodes() current does a walk of all inodes in the filesystem
> to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
>
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the bdi when the mapping is
> first tagged as having pages under writeback. wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
>
> To avoid needing all the realted locking to be safe against
> interrupts, Jan Kara suggested that we be lazy about removal from
> the writeback list. That is, we don't remove inodes from the
> writeback list on IO completion, but do it directly during a
> wait_sb_inodes() walk.
>
> This means that the a rare sync(2) call will have some work to do
> skipping clean inodes However, in the current problem case of
> concurrent sync workloads, concurrent wait_sb_inodes() calls only
> walk the very recently dispatched inodes and hence should have very
> little work to do.
>
> This also means that we have to remove the inodes from the writeback
> list during eviction. Do this in inode_wait_for_writeback() once
> all writeback on the inode is complete.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
The code looks mostly fine. Two comments below.
...
> /*
> - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> + * Wait for writeback on an inode to complete during eviction. Caller must have
> + * inode pinned.
> */
> void inode_wait_for_writeback(struct inode *inode)
> {
> + BUG_ON(!(inode->i_state & I_FREEING));
> +
> spin_lock(&inode->i_lock);
> __inode_wait_for_writeback(inode);
> spin_unlock(&inode->i_lock);
> +
> + /*
> + * bd_inode's will have already had the bdev free'd so don't bother
> + * doing the bdi_clear_inode_writeback.
> + */
> + if (!sb_is_blkdev_sb(inode->i_sb))
> + bdi_clear_inode_writeback(inode_to_bdi(inode), inode);
Umm, I don't get this even though I've read the comment several times.
Why don't we want to remove bdev inode from writeback list of
blockdev_super? Is the comment suggesting that bdev inode cannot be on
writeback list?
> @@ -1307,28 +1347,59 @@ static void wait_sb_inodes(struct super_block *sb)
> */
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
>
> - mutex_lock(&sb->s_sync_lock);
> - spin_lock(&sb->s_inode_list_lock);
> -
> /*
> - * Data integrity sync. Must wait for all pages under writeback,
> - * because there may have been pages dirtied before our sync
> - * call, but which had writeout started before we write it out.
> - * In which case, the inode may not be on the dirty list, but
> - * we still have to wait for that writeout.
> + * Data integrity sync. Must wait for all pages under writeback, because
> + * there may have been pages dirtied before our sync call, but which had
> + * writeout started before we write it out. In which case, the inode
> + * may not be on the dirty list, but we still have to wait for that
> + * writeout.
> + *
> + * To avoid syncing inodes put under IO after we have started here,
> + * splice the io list to a temporary list head and walk that. Newly
> + * dirtied inodes will go onto the primary list so we won't wait for
> + * them. This is safe to do as we are serialised by the s_sync_lock,
> + * so we'll complete processing the complete list before the next
> + * sync operation repeats the splice-and-walk process.
> + *
> + * Inodes that have pages under writeback after we've finished the wait
> + * may or may not be on the primary list. They had pages under put IO
> + * after we started our wait, so we need to make sure the next sync or
> + * IO completion treats them correctly, Move them back to the primary
> + * list and restart the walk.
> + *
> + * Inodes that are clean after we have waited for them don't belong on
> + * any list, so simply remove them from the sync list and move onwards.
> */
> - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> + mutex_lock(&sb->s_sync_lock);
> + spin_lock(&bdi->wb.list_lock);
> + list_splice_init(&bdi->wb.b_writeback, &sync_list);
> +
> + while (!list_empty(&sync_list)) {
> + struct inode *inode = list_first_entry(&sync_list, struct inode,
> + i_wb_list);
> struct address_space *mapping = inode->i_mapping;
>
> + /*
> + * We are lazy on IO completion and don't remove inodes from the
> + * list when they are clean. Detect that immediately and skip
> + * inodes we don't ahve to wait on.
> + */
> + if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> + list_del_init(&inode->i_wb_list);
> + cond_resched_lock(&bdi->wb.list_lock);
> + continue;
> + }
> +
> spin_lock(&inode->i_lock);
> - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> - (mapping->nrpages == 0)) {
> + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> + list_move(&inode->i_wb_list, &bdi->wb.b_writeback);
Indenting looks damaged here...
> spin_unlock(&inode->i_lock);
> + cond_resched_lock(&bdi->wb.list_lock);
> continue;
Why do we put freeing inodes back to b_writeback list? For I_FREEING and
I_WILL_FREE we could as well just delete them. For I_NEW we would start
waiting for them once I_NEW gets cleared but still I_NEW inodes under
writeback look somewhat worrying (flusher thread skips them so the
semantics isn't clear there). Anyway, probably lets just keep code as is
for I_NEW but moving back to b_writeback for I_FREEING | I_WILL_FREE looks
just dumb.
Honza
--
Jan Kara <jack@suse.cz>
SUSE Labs, CR
next prev parent reply other threads:[~2015-03-16 10:14 UTC|newest]
Thread overview: 24+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-03-10 19:45 [PATCH 0/9] Sync and VFS scalability improvements Josef Bacik
2015-03-10 19:45 ` [PATCH 1/9] writeback: plug writeback at a high level Josef Bacik
2015-03-10 19:45 ` [PATCH 2/9] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Josef Bacik
2015-03-12 9:52 ` Al Viro
2015-03-12 12:18 ` [PATCH] inode: add hlist_fake to avoid the " Josef Bacik
2015-03-12 12:20 ` [PATCH] inode: add hlist_fake to avoid the inode hash lock in evict V2 Josef Bacik
2015-03-14 7:00 ` Jan Kara
2015-03-12 12:24 ` [PATCH 2/9] inode: add IOP_NOTHASHED to avoid inode hash lock in evict Josef Bacik
2015-03-10 19:45 ` [PATCH 3/9] inode: convert inode_sb_list_lock to per-sb Josef Bacik
2015-03-10 19:45 ` [PATCH 4/9] sync: serialise per-superblock sync operations Josef Bacik
2015-03-10 19:45 ` [PATCH 5/9] inode: rename i_wb_list to i_io_list Josef Bacik
2015-03-10 19:45 ` [PATCH 6/9] bdi: add a new writeback list for sync Josef Bacik
2015-03-16 10:14 ` Jan Kara [this message]
2015-03-10 19:45 ` [PATCH 7/9] writeback: periodically trim the writeback list Josef Bacik
2015-03-16 10:16 ` Jan Kara
2015-03-16 11:43 ` Jan Kara
2015-03-10 19:45 ` [PATCH 8/9] inode: convert per-sb inode list to a list_lru Josef Bacik
2015-03-16 12:27 ` Jan Kara
2015-03-16 15:34 ` Josef Bacik
2015-03-16 15:48 ` Jan Kara
2015-03-10 19:45 ` [PATCH 9/9] inode: don't softlockup when evicting inodes Josef Bacik
2015-03-16 12:31 ` Jan Kara
2015-03-16 11:39 ` [PATCH 0/9] Sync and VFS scalability improvements Jan Kara
2015-03-25 11:18 ` Mel Gorman
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20150316101443.GF4934@quack.suse.cz \
--to=jack@suse.cz \
--cc=david@fromorbit.com \
--cc=dchinner@redhat.com \
--cc=jbacik@fb.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.