From: Brian Foster <bfoster@redhat.com>
To: "Darrick J. Wong" <darrick.wong@oracle.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH v2 1/3] xfs: implement per-inode writeback completion queues
Date: Mon, 15 Apr 2019 14:13:53 -0400 [thread overview]
Message-ID: <20190415181353.GG4222@bfoster> (raw)
In-Reply-To: <20190415161336.GB4752@magnolia>
On Mon, Apr 15, 2019 at 09:13:36AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <darrick.wong@oracle.com>
>
> When scheduling writeback of dirty file data in the page cache, XFS uses
> IO completion workqueue items to ensure that filesystem metadata only
> updates after the write completes successfully. This is essential for
> converting unwritten extents to real extents at the right time and
> performing COW remappings.
>
> Unfortunately, XFS queues each IO completion work item to an unbounded
> workqueue, which means that the kernel can spawn dozens of threads to
> try to handle the items quickly. These threads need to take the ILOCK
> to update file metadata, which results in heavy ILOCK contention if a
> large number of the work items target a single file, which is
> inefficient.
>
> Worse yet, the writeback completion threads get stuck waiting for the
> ILOCK while holding transaction reservations, which can use up all
> available log reservation space. When that happens, metadata updates to
> other parts of the filesystem grind to a halt, even if the filesystem
> could otherwise have handled it.
>
> Even worse, if one of the things grinding to a halt happens to be a
> thread in the middle of a defer-ops finish holding the same ILOCK and
> trying to obtain more log reservation having exhausted the permanent
> reservation, we now have an ABBA deadlock - writeback has a transaction
> reserved and wants the ILOCK, and someone else has the ILOCk and wants a
> transaction reservation.
>
> Therefore, we create a per-inode writeback io completion queue + work
> item. When writeback finishes, it can add the ioend to the per-inode
> queue and let the single worker item process that queue. This
> dramatically cuts down on the number of kworkers and ILOCK contention in
> the system, and seems to have eliminated an occasional deadlock I was
> seeing while running generic/476.
>
> Testing with a program that simulates a heavy random-write workload to a
> single file demonstrates that the number of kworkers drops from
> approximately 120 threads per file to 1, without dramatically changing
> write bandwidth or pagecache access latency.
>
> Note that we leave the xfs-conv workqueue's max_active alone because we
> still want to be able to run ioend processing for as many inodes as the
> system can handle.
>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> ---
> v2: rename i_iodone_* -> i_ioend_* to avoid naming collision
> ---
Looks good, thanks!
Reviewed-by: Brian Foster <bfoster@redhat.com>
> fs/xfs/xfs_aops.c | 48 +++++++++++++++++++++++++++++++++++++-----------
> fs/xfs/xfs_aops.h | 1 -
> fs/xfs/xfs_icache.c | 3 +++
> fs/xfs/xfs_inode.h | 7 +++++++
> 4 files changed, 47 insertions(+), 12 deletions(-)
>
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 3619e9e8d359..0dd249115f3c 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -234,11 +234,9 @@ xfs_setfilesize_ioend(
> * IO write completion.
> */
> STATIC void
> -xfs_end_io(
> - struct work_struct *work)
> +xfs_end_ioend(
> + struct xfs_ioend *ioend)
> {
> - struct xfs_ioend *ioend =
> - container_of(work, struct xfs_ioend, io_work);
> struct xfs_inode *ip = XFS_I(ioend->io_inode);
> xfs_off_t offset = ioend->io_offset;
> size_t size = ioend->io_size;
> @@ -278,19 +276,48 @@ xfs_end_io(
> xfs_destroy_ioend(ioend, error);
> }
>
> +/* Finish all pending io completions. */
> +void
> +xfs_end_io(
> + struct work_struct *work)
> +{
> + struct xfs_inode *ip;
> + struct xfs_ioend *ioend;
> + struct list_head completion_list;
> + unsigned long flags;
> +
> + ip = container_of(work, struct xfs_inode, i_ioend_work);
> +
> + spin_lock_irqsave(&ip->i_ioend_lock, flags);
> + list_replace_init(&ip->i_ioend_list, &completion_list);
> + spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
> +
> + while (!list_empty(&completion_list)) {
> + ioend = list_first_entry(&completion_list, struct xfs_ioend,
> + io_list);
> + list_del_init(&ioend->io_list);
> + xfs_end_ioend(ioend);
> + }
> +}
> +
> STATIC void
> xfs_end_bio(
> struct bio *bio)
> {
> struct xfs_ioend *ioend = bio->bi_private;
> - struct xfs_mount *mp = XFS_I(ioend->io_inode)->i_mount;
> + struct xfs_inode *ip = XFS_I(ioend->io_inode);
> + struct xfs_mount *mp = ip->i_mount;
> + unsigned long flags;
>
> if (ioend->io_fork == XFS_COW_FORK ||
> - ioend->io_state == XFS_EXT_UNWRITTEN)
> - queue_work(mp->m_unwritten_workqueue, &ioend->io_work);
> - else if (ioend->io_append_trans)
> - queue_work(mp->m_data_workqueue, &ioend->io_work);
> - else
> + ioend->io_state == XFS_EXT_UNWRITTEN ||
> + ioend->io_append_trans != NULL) {
> + spin_lock_irqsave(&ip->i_ioend_lock, flags);
> + if (list_empty(&ip->i_ioend_list))
> + queue_work(mp->m_unwritten_workqueue, &ip->i_ioend_work);
> + list_add_tail(&ioend->io_list, &ip->i_ioend_list);
> + spin_unlock_irqrestore(&ip->i_ioend_lock, flags);
> + } else
> xfs_destroy_ioend(ioend, blk_status_to_errno(bio->bi_status));
> }
>
> @@ -594,7 +621,6 @@ xfs_alloc_ioend(
> ioend->io_inode = inode;
> ioend->io_size = 0;
> ioend->io_offset = offset;
> - INIT_WORK(&ioend->io_work, xfs_end_io);
> ioend->io_append_trans = NULL;
> ioend->io_bio = bio;
> return ioend;
> diff --git a/fs/xfs/xfs_aops.h b/fs/xfs/xfs_aops.h
> index 6c2615b83c5d..f62b03186c62 100644
> --- a/fs/xfs/xfs_aops.h
> +++ b/fs/xfs/xfs_aops.h
> @@ -18,7 +18,6 @@ struct xfs_ioend {
> struct inode *io_inode; /* file being written to */
> size_t io_size; /* size of the extent */
> xfs_off_t io_offset; /* offset in the file */
> - struct work_struct io_work; /* xfsdatad work queue */
> struct xfs_trans *io_append_trans;/* xact. for size update */
> struct bio *io_bio; /* bio being built */
> struct bio io_inline_bio; /* MUST BE LAST! */
> diff --git a/fs/xfs/xfs_icache.c b/fs/xfs/xfs_icache.c
> index 245483cc282b..1237b775d7f7 100644
> --- a/fs/xfs/xfs_icache.c
> +++ b/fs/xfs/xfs_icache.c
> @@ -70,6 +70,9 @@ xfs_inode_alloc(
> ip->i_flags = 0;
> ip->i_delayed_blks = 0;
> memset(&ip->i_d, 0, sizeof(ip->i_d));
> + INIT_WORK(&ip->i_ioend_work, xfs_end_io);
> + INIT_LIST_HEAD(&ip->i_ioend_list);
> + spin_lock_init(&ip->i_ioend_lock);
>
> return ip;
> }
> diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
> index e62074a5257c..b9ee3dfc104a 100644
> --- a/fs/xfs/xfs_inode.h
> +++ b/fs/xfs/xfs_inode.h
> @@ -57,6 +57,11 @@ typedef struct xfs_inode {
>
> /* VFS inode */
> struct inode i_vnode; /* embedded VFS inode */
> +
> + /* pending io completions */
> + spinlock_t i_ioend_lock;
> + struct work_struct i_ioend_work;
> + struct list_head i_ioend_list;
> } xfs_inode_t;
>
> /* Convert from vfs inode to xfs inode */
> @@ -503,4 +508,6 @@ bool xfs_inode_verify_forks(struct xfs_inode *ip);
> int xfs_iunlink_init(struct xfs_perag *pag);
> void xfs_iunlink_destroy(struct xfs_perag *pag);
>
> +void xfs_end_io(struct work_struct *work);
> +
> #endif /* __XFS_INODE_H__ */
next prev parent reply other threads:[~2019-04-15 18:13 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2019-04-15 2:07 [PATCH v2 0/3] xfs: merged io completions Darrick J. Wong
2019-04-15 2:07 ` [PATCH 1/3] xfs: implement per-inode writeback completion queues Darrick J. Wong
2019-04-15 14:49 ` Brian Foster
2019-04-15 15:53 ` Darrick J. Wong
2019-04-15 16:13 ` Brian Foster
2019-04-15 17:09 ` Darrick J. Wong
2019-04-15 17:31 ` Brian Foster
2019-04-15 18:00 ` Darrick J. Wong
2019-04-15 18:13 ` Brian Foster
2019-04-15 19:27 ` Darrick J. Wong
2019-04-15 16:13 ` [PATCH v2 " Darrick J. Wong
2019-04-15 18:13 ` Brian Foster [this message]
2019-04-23 6:34 ` [PATCH " Christoph Hellwig
2019-04-23 15:29 ` Darrick J. Wong
2019-04-15 2:07 ` [PATCH 2/3] xfs: remove unused m_data_workqueue Darrick J. Wong
2019-04-15 14:49 ` Brian Foster
2019-04-15 2:08 ` [PATCH 3/3] xfs: merge adjacent io completions of the same type Darrick J. Wong
2019-04-15 14:49 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20190415181353.GG4222@bfoster \
--to=bfoster@redhat.com \
--cc=darrick.wong@oracle.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox