From: Brian Foster <bfoster@redhat.com>
To: linux-fsdevel@vger.kernel.org
Cc: dchinner@redhat.com, jbacik@fb.com, jack@suse.cz
Subject: Re: [RFC PATCH v5 1/2] sb: add a new writeback list for sync
Date: Tue, 12 Jan 2016 14:24:53 -0500 [thread overview]
Message-ID: <20160112192453.GA31860@bfoster.bfoster> (raw)
In-Reply-To: <1452623739-32542-2-git-send-email-bfoster@redhat.com>
On Tue, Jan 12, 2016 at 01:35:38PM -0500, Brian Foster wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> wait_sb_inodes() currently does a walk of all inodes in the
> filesystem to find dirty one to wait on during sync. This is highly
> inefficient and wastes a lot of CPU when there are lots of clean
> cached inodes that we don't need to wait on.
>
> To avoid this "all inode" walk, we need to track inodes that are
> currently under writeback that we need to wait for. We do this by
> adding inodes to a writeback list on the sb when the mapping is
> first tagged as having pages under writeback. wait_sb_inodes() can
> then walk this list of "inodes under IO" and wait specifically just
> for the inodes that the current sync(2) needs to wait for.
>
> Define a couple helpers to add/remove an inode from the writeback
> list and call them when the overall mapping is tagged for or cleared
> from writeback. Update wait_sb_inodes() to walk only the inodes
> under writeback due to the sync.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> Signed-off-by: Josef Bacik <jbacik@fb.com>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
> fs/block_dev.c | 1 +
> fs/fs-writeback.c | 139 +++++++++++++++++++++++++++++++++++++---------
> fs/inode.c | 1 +
> fs/super.c | 2 +
> include/linux/fs.h | 4 ++
> include/linux/writeback.h | 3 +
> mm/page-writeback.c | 19 +++++++
> 7 files changed, 144 insertions(+), 25 deletions(-)
>
...
> diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
> index 023f6a1..6821347 100644
> --- a/fs/fs-writeback.c
> +++ b/fs/fs-writeback.c
...
> @@ -1118,13 +1149,28 @@ static void __inode_wait_for_writeback(struct inode *inode)
> }
>
> /*
> - * Wait for writeback on an inode to complete. Caller must have inode pinned.
> + * Wait for writeback on an inode to complete during eviction. Caller must have
> + * inode pinned.
> */
> void inode_wait_for_writeback(struct inode *inode)
> {
> + BUG_ON(!(inode->i_state & I_FREEING));
> +
> spin_lock(&inode->i_lock);
> __inode_wait_for_writeback(inode);
> spin_unlock(&inode->i_lock);
> +
> + /*
> + * For a bd_inode when we do inode_to_bdi we'll want to get the bdev for
> + * the inode and then deref bdev->bd_disk, which at this point has been
> + * set to NULL, so we would panic. At the point we are dropping our
> + * bd_inode we won't have any pages under writeback on the device so
> + * this is safe. But just in case we'll assert to make sure we don't
> + * screw this up.
> + */
While the comment above is no longer relevant now that this doesn't
touch the bdi, I'm not totally sure whether the following few lines are
still necessary. I think this was associated with the lazy list removal
(to remove the inode on eviction), but I'd have to dig into that a bit
more to be sure...
Brian
> + if (!sb_is_blkdev_sb(inode->i_sb))
> + sb_clear_inode_writeback(inode);
> + BUG_ON(!list_empty(&inode->i_wb_list));
> }
>
> /*
> @@ -2108,7 +2154,7 @@ EXPORT_SYMBOL(__mark_inode_dirty);
> */
> static void wait_sb_inodes(struct super_block *sb)
> {
> - struct inode *inode, *old_inode = NULL;
> + LIST_HEAD(sync_list);
>
> /*
> * We need to be protected against the filesystem going from
> @@ -2116,23 +2162,56 @@ static void wait_sb_inodes(struct super_block *sb)
> */
> WARN_ON(!rwsem_is_locked(&sb->s_umount));
>
> + /*
> + * Data integrity sync. Must wait for all pages under writeback, because
> + * there may have been pages dirtied before our sync call, but which had
> + * writeout started before we write it out. In which case, the inode
> + * may not be on the dirty list, but we still have to wait for that
> + * writeout.
> + *
> + * To avoid syncing inodes put under IO after we have started here,
> + * splice the io list to a temporary list head and walk that. Newly
> + * dirtied inodes will go onto the primary list so we won't wait for
> + * them. This is safe to do as we are serialised by the s_sync_lock,
> + * so we'll complete processing the complete list before the next
> + * sync operation repeats the splice-and-walk process.
> + *
> + * s_inode_wblist_lock protects the wb list and is irq-safe as it is
> + * acquired inside of the mapping lock by __test_set_page_writeback().
> + * We cannot acquire i_lock while the wblist lock is held without
> + * introducing irq inversion issues. Since s_inodes_wb is a subset of
> + * s_inodes, use s_inode_list_lock to prevent inodes from disappearing
> + * until we have a reference. Note that s_inode_wblist_lock protects the
> + * local sync_list as well because inodes can be dropped from either
> + * list by writeback completion.
> + */
> mutex_lock(&sb->s_sync_lock);
> +
> spin_lock(&sb->s_inode_list_lock);
> + spin_lock_irq(&sb->s_inode_wblist_lock);
> + list_splice_init(&sb->s_inodes_wb, &sync_list);
>
> - /*
> - * Data integrity sync. Must wait for all pages under writeback,
> - * because there may have been pages dirtied before our sync
> - * call, but which had writeout started before we write it out.
> - * In which case, the inode may not be on the dirty list, but
> - * we still have to wait for that writeout.
> - */
> - list_for_each_entry(inode, &sb->s_inodes, i_sb_list) {
> + while (!list_empty(&sync_list)) {
> + struct inode *inode = list_first_entry(&sync_list, struct inode,
> + i_wb_list);
> struct address_space *mapping = inode->i_mapping;
>
> + list_del_init(&inode->i_wb_list);
> +
> + /*
> + * The mapping can be cleaned before we wait on the inode since
> + * we do not have the mapping lock.
> + */
> + if (!mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK))
> + continue;
> +
> + spin_unlock_irq(&sb->s_inode_wblist_lock);
> +
> spin_lock(&inode->i_lock);
> - if ((inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) ||
> - (mapping->nrpages == 0)) {
> + if (inode->i_state & (I_FREEING|I_WILL_FREE|I_NEW)) {
> spin_unlock(&inode->i_lock);
> +
> + spin_lock_irq(&sb->s_inode_wblist_lock);
> continue;
> }
> __iget(inode);
> @@ -2140,29 +2219,39 @@ static void wait_sb_inodes(struct super_block *sb)
> spin_unlock(&sb->s_inode_list_lock);
>
> /*
> - * We hold a reference to 'inode' so it couldn't have been
> - * removed from s_inodes list while we dropped the
> - * s_inode_list_lock. We cannot iput the inode now as we can
> - * be holding the last reference and we cannot iput it under
> - * s_inode_list_lock. So we keep the reference and iput it
> - * later.
> - */
> - iput(old_inode);
> - old_inode = inode;
> -
> - /*
> * We keep the error status of individual mapping so that
> * applications can catch the writeback error using fsync(2).
> * See filemap_fdatawait_keep_errors() for details.
> */
> filemap_fdatawait_keep_errors(mapping);
> -
> cond_resched();
>
> + /*
> + * The inode has been written back. Check whether we still have
> + * pages under I/O and move the inode back to the primary list
> + * if necessary. sb_mark_inode_writeback() might not have done
> + * anything if the writeback tag hadn't been cleared from the
> + * mapping by the time more wb had started.
> + */
> + spin_lock_irq(&mapping->tree_lock);
> + if (mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK)) {
> + spin_lock(&sb->s_inode_wblist_lock);
> + if (list_empty(&inode->i_wb_list))
> + list_add_tail(&inode->i_wb_list, &sb->s_inodes_wb);
> + spin_unlock(&sb->s_inode_wblist_lock);
> + } else
> + WARN_ON(!list_empty(&inode->i_wb_list));
> + spin_unlock_irq(&mapping->tree_lock);
> +
> + iput(inode);
> +
> spin_lock(&sb->s_inode_list_lock);
> + spin_lock_irq(&sb->s_inode_wblist_lock);
> }
> +
> + spin_unlock_irq(&sb->s_inode_wblist_lock);
> spin_unlock(&sb->s_inode_list_lock);
> - iput(old_inode);
> +
> mutex_unlock(&sb->s_sync_lock);
> }
>
> diff --git a/fs/inode.c b/fs/inode.c
> index 5bb85a0..c7a9581 100644
> --- a/fs/inode.c
> +++ b/fs/inode.c
> @@ -358,6 +358,7 @@ void inode_init_once(struct inode *inode)
> INIT_HLIST_NODE(&inode->i_hash);
> INIT_LIST_HEAD(&inode->i_devices);
> INIT_LIST_HEAD(&inode->i_io_list);
> + INIT_LIST_HEAD(&inode->i_wb_list);
> INIT_LIST_HEAD(&inode->i_lru);
> address_space_init_once(&inode->i_data);
> i_size_ordered_init(inode);
> diff --git a/fs/super.c b/fs/super.c
> index 954aeb8..86822b8 100644
> --- a/fs/super.c
> +++ b/fs/super.c
> @@ -206,6 +206,8 @@ static struct super_block *alloc_super(struct file_system_type *type, int flags)
> mutex_init(&s->s_sync_lock);
> INIT_LIST_HEAD(&s->s_inodes);
> spin_lock_init(&s->s_inode_list_lock);
> + INIT_LIST_HEAD(&s->s_inodes_wb);
> + spin_lock_init(&s->s_inode_wblist_lock);
>
> if (list_lru_init_memcg(&s->s_dentry_lru))
> goto fail;
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index ef3cd36..aef01c7 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -648,6 +648,7 @@ struct inode {
> #endif
> struct list_head i_lru; /* inode LRU list */
> struct list_head i_sb_list;
> + struct list_head i_wb_list; /* backing dev writeback list */
> union {
> struct hlist_head i_dentry;
> struct rcu_head i_rcu;
> @@ -1374,6 +1375,9 @@ struct super_block {
> /* s_inode_list_lock protects s_inodes */
> spinlock_t s_inode_list_lock ____cacheline_aligned_in_smp;
> struct list_head s_inodes; /* all inodes */
> +
> + spinlock_t s_inode_wblist_lock;
> + struct list_head s_inodes_wb; /* writeback inodes */
> };
>
> extern struct timespec current_fs_time(struct super_block *sb);
> diff --git a/include/linux/writeback.h b/include/linux/writeback.h
> index b333c94..90a380c 100644
> --- a/include/linux/writeback.h
> +++ b/include/linux/writeback.h
> @@ -379,4 +379,7 @@ void tag_pages_for_writeback(struct address_space *mapping,
>
> void account_page_redirty(struct page *page);
>
> +void sb_mark_inode_writeback(struct inode *inode);
> +void sb_clear_inode_writeback(struct inode *inode);
> +
> #endif /* WRITEBACK_H */
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index d15d88c..dcd4e2b 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2735,6 +2735,11 @@ int test_clear_page_writeback(struct page *page)
> __wb_writeout_inc(wb);
> }
> }
> +
> + if (mapping->host && !mapping_tagged(mapping,
> + PAGECACHE_TAG_WRITEBACK))
> + sb_clear_inode_writeback(mapping->host);
> +
> spin_unlock_irqrestore(&mapping->tree_lock, flags);
> } else {
> ret = TestClearPageWriteback(page);
> @@ -2763,11 +2768,25 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
> spin_lock_irqsave(&mapping->tree_lock, flags);
> ret = TestSetPageWriteback(page);
> if (!ret) {
> + bool on_wblist;
> +
> + on_wblist = mapping_tagged(mapping,
> + PAGECACHE_TAG_WRITEBACK);
> +
> radix_tree_tag_set(&mapping->page_tree,
> page_index(page),
> PAGECACHE_TAG_WRITEBACK);
> if (bdi_cap_account_writeback(bdi))
> __inc_wb_stat(inode_to_wb(inode), WB_WRITEBACK);
> +
> + /*
> + * we can come through here when swapping anonymous
> + * pages, so we don't necessarily have an inode to
> + * track for sync. Inodes are remove lazily, so there is
> + * no equivalent in test_clear_page_writeback().
> + */
> + if (!on_wblist && mapping->host)
> + sb_mark_inode_writeback(mapping->host);
> }
> if (!PageDirty(page))
> radix_tree_tag_clear(&mapping->page_tree,
> --
> 2.4.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
next prev parent reply other threads:[~2016-01-12 19:24 UTC|newest]
Thread overview: 7+ messages / expand[flat|nested] mbox.gz Atom feed top
2016-01-12 18:35 [RFC PATCH v5 0/2] improve sync efficiency with sb inode wb list Brian Foster
2016-01-12 18:35 ` [RFC PATCH v5 1/2] sb: add a new writeback list for sync Brian Foster
2016-01-12 19:24 ` Brian Foster [this message]
2016-01-13 10:43 ` Jan Kara
2016-01-13 14:33 ` Brian Foster
2016-01-15 12:16 ` Jan Kara
2016-01-12 18:35 ` [RFC PATCH v5 2/2] wb: inode writeback list tracking tracepoints Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20160112192453.GA31860@bfoster.bfoster \
--to=bfoster@redhat.com \
--cc=dchinner@redhat.com \
--cc=jack@suse.cz \
--cc=jbacik@fb.com \
--cc=linux-fsdevel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.