From: Brian Foster <bfoster@redhat.com>
To: linux-fsdevel@vger.kernel.org
Cc: dchinner@redhat.com, jbacik@fb.com, jack@suse.cz
Subject: Re: [RFC PATCH v5 1/2] sb: add a new writeback list for sync
Date: Tue, 12 Jan 2016 14:24:53 -0500 [thread overview]
Message-ID: <20160112192453.GA31860@bfoster.bfoster> (raw)
In-Reply-To: <1452623739-32542-2-git-send-email-bfoster@redhat.com>
On Tue, Jan 12, 2016 at 01:35:38PM -0500, Brian Foster wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> wait_sb_inodes() currently does a walk of all inodes in the
> filesystem to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
>
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the sb when the mapping is
> first tagged as having pages under writeback. wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
>
> Define a couple helpers to add/remove an inode from the writeback
> list and call them when the overall mapping is tagged for or cleared
> from writeback. Update wait_sb_inodes() to walk only the inodes
> under writeback due to the sync.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> fs/block_dev.c | 1 +
> fs/fs-writeback.c | 139 +++++++++++++++++++++++++++++++++++++---------
> fs/inode.c | 1 +
> fs/super.c | 2 +
> include/linux/fs.h | 4 ++
> include/linux/writeback.h | 3 +
> mm/page-writeback.c | 19 +++++++
> 7 files changed, 144 insertions(+), 25 deletions(-)
>
...
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 023f6a1..6821347 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
...
> @@ -1118,13 +1149,28 @@ static void __inode_wait_for_writeback(struct inode *inode)
> }
>
> /*
> - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> + * Wait for writeback on an inode to complete during eviction. Caller must have
> + * inode pinned.
> */
> void inode_wait_for_writeback(struct inode *inode)
> {
> + BUG_ON(!(inode->i_state & I_FREEING));
> +
> spin_lock(&inode->i_lock);
> __inode_wait_for_writeback(inode);
> spin_unlock(&inode->i_lock);
> +
> + /*
> + * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
> + * the inode and then deref bdev->bd_disk, which at this point has been
> + * set to NULL, so we would panic. At the point we are dropping our
> + * bd_inode we won't have any pages under writeback on the device so
> + * this is safe. But just in case we'll assert to make sure we don't
> + * screw this up.
> + */
While the comment above is no longer relevant now that this doesn't
touch the bdi, I'm not totally sure whether the following few lines are
still necessary. I think this was associated with the lazy list removal
(to remove the inode on eviction), but I'd have to dig into that a bit
more to be sure...
Brian
> + if (!sb_is_blkdev_sb(inode->i_sb))
> + sb_clear_inode_writeback(inode);
> + BUG_ON(!list_empty(&inode->i_wb_list));
> }
>
> /*
> @@ -2108,7 +2154,7 @@ EXPORT_SYMBOL(__mark_inode_dirty);
> */
> static void wait_sb_inodes(struct super_block *sb)
> {
> - struct inode *inode, *old_inode = NULL;
> + LIST_HEAD(sync_list);
>
> /*
> * We need to be protected against the filesystem going from
> @@ -2116,23 +2162,56 @@ static void wait_sb_inodes(struct super_block *sb)
> */
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
>
> + /*
> + * Data integrity sync. Must wait for all pages under writeback, because
> + * there may have been pages dirtied before our sync call, but which had
> + * writeout started before we write it out. In which case, the inode
> + * may not be on the dirty list, but we still have to wait for that
> + * writeout.
> + *
> + * To avoid syncing inodes put under IO after we have started here,
> + * splice the io list to a temporary list head and walk that. Newly
> + * dirtied inodes will go onto the primary list so we won't wait for
> + * them. This is safe to do as we are serialised by the s_sync_lock,
> + * so we'll complete processing the complete list before the next
> + * sync operation repeats the splice-and-walk process.
> + *
> + * s_inode_wblist_lock protects the wb list and is irq-safe as it is
> + * acquired inside of the mapping lock by __test_set_page_writeback().
> + * We cannot acquire i_lock while the wblist lock is held without
> + * introducing irq inversion issues. Since s_inodes_wb is a subset of
> + * s_inodes, use s_inode_list_lock to prevent inodes from disappearing
> + * until we have a reference. Note that s_inode_wblist_lock protects the
> + * local sync_list as well because inodes can be dropped from either
> + * list by writeback completion.
> + */
> mutex_lock(&sb->s_sync_lock);
> +
> spin_lock(&sb->s_inode_list_lock);
> + spin_lock_irq(&sb->s_inode_wblist_lock);
> + list_splice_init(&sb->s_inodes_wb, &sync_list);
>
> - /*
> - * Data integrity sync. Must wait for all pages under writeback,
> - * because there may have been pages dirtied before our sync
> - * call, but which had writeout started before we write it out.
> - * In which case, the inode may not be on the dirty list, but
> - * we still have to wait for that writeout.
> - */
> - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> + while (!list_empty(&sync_list)) {
> + struct inode *inode = list_first_entry(&sync_list, struct inode,
> + i_wb_list);
> struct address_space *mapping = inode->i_mapping;
>
> + list_del_init(&inode->i_wb_list);
> +
> + /*
> + * The mapping can be cleaned before we wait on the inode since
> + * we do not have the mapping lock.
> + */
> + if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
> + continue;
> +
> + spin_unlock_irq(&sb->s_inode_wblist_lock);
> +
> spin_lock(&inode->i_lock);
> - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> - (mapping->nrpages == 0)) {
> + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> spin_unlock(&inode->i_lock);
> +
> + spin_lock_irq(&sb->s_inode_wblist_lock);
> continue;
> }
> __iget(inode);
> @@ -2140,29 +2219,39 @@ static void wait_sb_inodes(struct super_block *sb)
> spin_unlock(&sb->s_inode_list_lock);
>
> /*
> - * We hold a reference to 'inode' so it couldn't have been
> - * removed from s_inodes list while we dropped the
> - * s_inode_list_lock. We cannot iput the inode now as we can
> - * be holding the last reference and we cannot iput it under
> - * s_inode_list_lock. So we keep the reference and iput it
> - * later.
> - */
> - iput(old_inode);
> - old_inode = inode;
> -
> - /*
> * We keep the error status of individual mapping so that
> * applications can catch the writeback error using fsync(2).
> * See filemap_fdatawait_keep_errors() for details.
> */
> filemap_fdatawait_keep_errors(mapping);
> -
> cond_resched();
>
> + /*
> + * The inode has been written back. Check whether we still have
> + * pages under I/O and move the inode back to the primary list
> + * if necessary. sb_mark_inode_writeback() might not have done
> + * anything if the writeback tag hadn't been cleared from the
> + * mapping by the time more wb had started.
> + */
> + spin_lock_irq(&mapping->tree_lock);
> + if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> + spin_lock(&sb->s_inode_wblist_lock);
> + if (list_empty(&inode->i_wb_list))
> + list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
> + spin_unlock(&sb->s_inode_wblist_lock);
> + } else
> + WARN_ON(!list_empty(&inode->i_wb_list));
> + spin_unlock_irq(&mapping->tree_lock);
> +
> + iput(inode);
> +
> spin_lock(&sb->s_inode_list_lock);
> + spin_lock_irq(&sb->s_inode_wblist_lock);
> }
> +
> + spin_unlock_irq(&sb->s_inode_wblist_lock);
> spin_unlock(&sb->s_inode_list_lock);
> - iput(old_inode);
> +
> mutex_unlock(&sb->s_sync_lock);
> }
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 5bb85a0..c7a9581 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -358,6 +358,7 @@ void inode_init_once(struct inode *inode)
> INIT_HLIST_NODE(&inode->i_hash);
> INIT_LIST_HEAD(&inode->i_devices);
> INIT_LIST_HEAD(&inode->i_io_list);
> + INIT_LIST_HEAD(&inode->i_wb_list);
> INIT_LIST_HEAD(&inode->i_lru);
> address_space_init_once(&inode->i_data);
> i_size_ordered_init(inode);
> diff --git a/fs/super.c b/fs/super.c
> index 954aeb8..86822b8 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -206,6 +206,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
> mutex_init(&s->s_sync_lock);
> INIT_LIST_HEAD(&s->s_inodes);
> spin_lock_init(&s->s_inode_list_lock);
> + INIT_LIST_HEAD(&s->s_inodes_wb);
> + spin_lock_init(&s->s_inode_wblist_lock);
>
> if (list_lru_init_memcg(&s->s_dentry_lru))
> goto fail;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ef3cd36..aef01c7 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -648,6 +648,7 @@ struct inode {
> #endif
> struct list_head i_lru; /* inode LRU list */
> struct list_head i_sb_list;
> + struct list_head i_wb_list; /* backing dev writeback list */
> union {
> struct hlist_head i_dentry;
> struct rcu_head i_rcu;
> @@ -1374,6 +1375,9 @@ struct super_block {
> /* s_inode_list_lock protects s_inodes */
> spinlock_t s_inode_list_lock ____cacheline_aligned_in_smp;
> struct list_head s_inodes; /* all inodes */
> +
> + spinlock_t s_inode_wblist_lock;
> + struct list_head s_inodes_wb; /* writeback inodes */
> };
>
> extern struct timespec current_fs_time(struct super_block *sb);
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index b333c94..90a380c 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -379,4 +379,7 @@ void tag_pages_for_writeback(struct address_space *mapping,
>
> void account_page_redirty(struct page *page);
>
> +void sb_mark_inode_writeback(struct inode *inode);
> +void sb_clear_inode_writeback(struct inode *inode);
> +
> #endif /* WRITEBACK_H */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d15d88c..dcd4e2b 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2735,6 +2735,11 @@ int test_clear_page_writeback(struct page *page)
> __wb_writeout_inc(wb);
> }
> }
> +
> + if (mapping->host && !mapping_tagged(mapping,
> + PAGECACHE_TAG_WRITEBACK))
> + sb_clear_inode_writeback(mapping->host);
> +
> spin_unlock_irqrestore(&mapping->tree_lock, flags);
> } else {
> ret = TestClearPageWriteback(page);
> @@ -2763,11 +2768,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
> spin_lock_irqsave(&mapping->tree_lock, flags);
> ret = TestSetPageWriteback(page);
> if (!ret) {
> + bool on_wblist;
> +
> + on_wblist = mapping_tagged(mapping,
> + PAGECACHE_TAG_WRITEBACK);
> +
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page),
> PAGECACHE_TAG_WRITEBACK);
> if (bdi_cap_account_writeback(bdi))
> __inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
> +
> + /*
> + * we can come through here when swapping anonymous
> + * pages, so we don't necessarily have an inode to
> + * track for sync. Inodes are remove lazily, so there is
> + * no equivalent in test_clear_page_writeback().
> + */
> + if (!on_wblist && mapping->host)
> + sb_mark_inode_writeback(mapping->host);
> }
> if (!PageDirty(page))
> radix_tree_tag_clear(&mapping->page_tree,
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2016-01-12 19:24 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-01-12 18:35 [RFC PATCH v5 0/2] improve sync efficiency with sb inode wb list Brian Foster
2016-01-12 18:35 ` [RFC PATCH v5 1/2] sb: add a new writeback list for sync Brian Foster
2016-01-12 19:24 ` Brian Foster [this message]
2016-01-13 10:43 ` Jan Kara
2016-01-13 14:33 ` Brian Foster
2016-01-15 12:16 ` Jan Kara
2016-01-12 18:35 ` [RFC PATCH v5 2/2] wb: inode writeback list tracking tracepoints Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160112192453.GA31860@bfoster.bfoster \
--to=bfoster@redhat.com \
--cc=dchinner@redhat.com \
--cc=jack@suse.cz \
--cc=jbacik@fb.com \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).