* [PATCH v2] xfs: replace global xfslogd wq with per-mount xfs-iodone wq
@ 2014-11-11 16:13 Brian Foster
2014-11-11 23:12 ` Dave Chinner
0 siblings, 1 reply; 4+ messages in thread
From: Brian Foster @ 2014-11-11 16:13 UTC (permalink / raw)
To: xfs
The xfslogd workqueue is a global, single-job workqueue for buffer ioend
processing. This means we allow for a single work item at a time for all
possible XFS mounts on a system. fsstress testing in loopback XFS over
XFS configurations has reproduced xfslogd deadlocks due to the single
threaded nature of the queue and dependencies introduced between the
separate XFS instances by online discard (-o discard).
Discard over a loopback device converts the discard request to a hole
punch (fallocate) on the underlying file. Online discard requests are
issued synchronously and from xfslogd context in XFS, hence the xfslogd
workqueue is blocked in the upper fs waiting on a hole punch request to
be servied in the lower fs. If the lower fs issues I/O that depends on
xfslogd to complete, both filesystems end up hung indefinitely. This is
reproduced reliabily by generic/013 on XFS->loop->XFS test devices with
the '-o discard' mount option.
Further, docker implementations appear to use this kind of configuration
for container instance filesystems by default (container fs->dm->
loop->base fs) and therefore are subject to this deadlock when running
on XFS.
Replace the global xfslogd workqueue with a per-mount variant. This
guarantees each mount access to a single worker and prevents deadlocks
due to inter-fs dependencies introduced by discard. Since the queue is
only responsible for iodone processing at this point in time, rename
xfslogd to xfs-iodone.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
I've left the wq in xfs_mount rather than moved to the buftarg in this
version due to the questions expressed here:
http://oss.sgi.com/archives/xfs/2014-11/msg00117.html
... particularly around the potential creation of multiple (of what is
now) max_active=1 queues per-fs.
Brian
v2:
- Rename xfslogd workqueue to xfs-iodone.
v1: http://oss.sgi.com/archives/xfs/2014-10/msg00539.html
fs/xfs/xfs_buf.c | 13 ++-----------
fs/xfs/xfs_mount.h | 1 +
fs/xfs/xfs_super.c | 11 ++++++++++-
3 files changed, 13 insertions(+), 12 deletions(-)
diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
index 24b4ebe..b7af140 100644
--- a/fs/xfs/xfs_buf.c
+++ b/fs/xfs/xfs_buf.c
@@ -44,8 +44,6 @@
static kmem_zone_t *xfs_buf_zone;
-static struct workqueue_struct *xfslogd_workqueue;
-
#ifdef XFS_BUF_LOCK_TRACKING
# define XB_SET_OWNER(bp) ((bp)->b_last_holder = current->pid)
# define XB_CLEAR_OWNER(bp) ((bp)->b_last_holder = -1)
@@ -1053,7 +1051,8 @@ xfs_buf_ioend_async(
struct xfs_buf *bp)
{
INIT_WORK(&bp->b_iodone_work, xfs_buf_ioend_work);
- queue_work(xfslogd_workqueue, &bp->b_iodone_work);
+ queue_work(bp->b_target->bt_mount->m_iodone_workqueue,
+ &bp->b_iodone_work);
}
void
@@ -1882,15 +1881,8 @@ xfs_buf_init(void)
if (!xfs_buf_zone)
goto out;
- xfslogd_workqueue = alloc_workqueue("xfslogd",
- WQ_MEM_RECLAIM | WQ_HIGHPRI | WQ_FREEZABLE, 1);
- if (!xfslogd_workqueue)
- goto out_free_buf_zone;
-
return 0;
- out_free_buf_zone:
- kmem_zone_destroy(xfs_buf_zone);
out:
return -ENOMEM;
}
@@ -1898,6 +1890,5 @@ xfs_buf_init(void)
void
xfs_buf_terminate(void)
{
- destroy_workqueue(xfslogd_workqueue);
kmem_zone_destroy(xfs_buf_zone);
}
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index b0447c8..6de1283 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -168,6 +168,7 @@ typedef struct xfs_mount {
/* low free space thresholds */
struct xfs_kobj m_kobj;
+ struct workqueue_struct *m_iodone_workqueue;
struct workqueue_struct *m_data_workqueue;
struct workqueue_struct *m_unwritten_workqueue;
struct workqueue_struct *m_cil_workqueue;
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 9f622fe..ef55264 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -842,10 +842,16 @@ STATIC int
xfs_init_mount_workqueues(
struct xfs_mount *mp)
{
+ mp->m_iodone_workqueue = alloc_workqueue("xfs-iodone/%s",
+ WQ_MEM_RECLAIM|WQ_HIGHPRI|WQ_FREEZABLE, 1,
+ mp->m_fsname);
+ if (!mp->m_iodone_workqueue)
+ goto out;
+
mp->m_data_workqueue = alloc_workqueue("xfs-data/%s",
WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
if (!mp->m_data_workqueue)
- goto out;
+ goto out_destroy_iodone;
mp->m_unwritten_workqueue = alloc_workqueue("xfs-conv/%s",
WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
@@ -884,6 +890,8 @@ out_destroy_unwritten:
destroy_workqueue(mp->m_unwritten_workqueue);
out_destroy_data_iodone_queue:
destroy_workqueue(mp->m_data_workqueue);
+out_destroy_iodone:
+ destroy_workqueue(mp->m_iodone_workqueue);
out:
return -ENOMEM;
}
@@ -898,6 +906,7 @@ xfs_destroy_mount_workqueues(
destroy_workqueue(mp->m_cil_workqueue);
destroy_workqueue(mp->m_data_workqueue);
destroy_workqueue(mp->m_unwritten_workqueue);
+ destroy_workqueue(mp->m_iodone_workqueue);
}
/*
--
1.8.3.1
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply related [flat|nested] 4+ messages in thread* Re: [PATCH v2] xfs: replace global xfslogd wq with per-mount xfs-iodone wq
2014-11-11 16:13 [PATCH v2] xfs: replace global xfslogd wq with per-mount xfs-iodone wq Brian Foster
@ 2014-11-11 23:12 ` Dave Chinner
2014-11-12 19:37 ` Brian Foster
0 siblings, 1 reply; 4+ messages in thread
From: Dave Chinner @ 2014-11-11 23:12 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Tue, Nov 11, 2014 at 11:13:13AM -0500, Brian Foster wrote:
> The xfslogd workqueue is a global, single-job workqueue for buffer ioend
> processing. This means we allow for a single work item at a time for all
> possible XFS mounts on a system. fsstress testing in loopback XFS over
> XFS configurations has reproduced xfslogd deadlocks due to the single
> threaded nature of the queue and dependencies introduced between the
> separate XFS instances by online discard (-o discard).
>
> Discard over a loopback device converts the discard request to a hole
> punch (fallocate) on the underlying file. Online discard requests are
> issued synchronously and from xfslogd context in XFS, hence the xfslogd
> workqueue is blocked in the upper fs waiting on a hole punch request to
> be servied in the lower fs. If the lower fs issues I/O that depends on
> xfslogd to complete, both filesystems end up hung indefinitely. This is
> reproduced reliabily by generic/013 on XFS->loop->XFS test devices with
> the '-o discard' mount option.
>
> Further, docker implementations appear to use this kind of configuration
> for container instance filesystems by default (container fs->dm->
> loop->base fs) and therefore are subject to this deadlock when running
> on XFS.
>
> Replace the global xfslogd workqueue with a per-mount variant. This
> guarantees each mount access to a single worker and prevents deadlocks
> due to inter-fs dependencies introduced by discard. Since the queue is
> only responsible for iodone processing at this point in time, rename
> xfslogd to xfs-iodone.
>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>
> I've left the wq in xfs_mount rather than moved to the buftarg in this
> version due to the questions expressed here:
>
> http://oss.sgi.com/archives/xfs/2014-11/msg00117.html
<sigh>
Another email from you that hasn't reached my inbox. That's two in a
week now, I think.
> ... particularly around the potential creation of multiple (of what is
> now) max_active=1 queues per-fs.
So concern #1 is that it splits log buffer versus metadata buffer
processing to different work queues causing concurrent processing.
I see no problem there - the io completions have different iodone
processing functions that mostly don't intersect. As it is, the
"max-active=1" means 1 work item being processed per CPU, not "only
one queue" (you need to use alloc_ordered_workqueue() to only
get one queue). So there is already concurrency in the processing of
io completions and hence I don't see any problem with separating
the log iodone completions from the metadata iodone completions.
Further, it might be work considering pushing the log buffer
completions to the m_log_workqueue rather than using a buftarg based
workqueue so that they are always separated from the rest of the
metadata completions regardless of whether we have an internal or
external log device.
The xfslogd workqueue is tagged with WQ_HIGHPRI only to expedite the
log buffer io completions over XFS data io completions that may get
stuck waiting on log forces. i.e. the xfslogd_workqueue needs
higher priority than m_data_workqueue and m_unwritten_workqueue as
they can require log forces to complete their work. Hence if we
separate out the log buffer io completion processing from the
metadata IO completion processing we don't need to process all the
metadata buffer IO completion as high priority work anymore.
Concern #2 is about the reason for max_active=1 and being unclear as
to why we only want a single completion active at a time on a CPU.
The reason for this is that most metadata and log buffer IO
completion work does not sleep - they
only ever take spinlocks and so there are no built in schedule
points during work processing. Hence it is rare to need a second
worker thread to process the queue because the first is blocked
on a sleeping lock and so max-active=1 makes sense. In comparison,
the data/unwritten io completion processing is very
different due to needing to take sleeping inode locks, buffer locks,
etc) and hence they use the wq default for max active (512).
So, really, a log workqueue with WQ_HIGHPRI, max_active = 1 and a
buffer IO completion workqueue with just max_active = 1 would
probably be fine.
....
> diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> index 9f622fe..ef55264 100644
> --- a/fs/xfs/xfs_super.c
> +++ b/fs/xfs/xfs_super.c
> @@ -842,10 +842,16 @@ STATIC int
> xfs_init_mount_workqueues(
> struct xfs_mount *mp)
> {
> + mp->m_iodone_workqueue = alloc_workqueue("xfs-iodone/%s",
> + WQ_MEM_RECLAIM|WQ_HIGHPRI|WQ_FREEZABLE, 1,
> + mp->m_fsname);
> + if (!mp->m_iodone_workqueue)
> + goto out;
m_buf_workqueue would be better, because...
> +
> mp->m_data_workqueue = alloc_workqueue("xfs-data/%s",
That's also an "iodone" workqueue for data IO, and ...
> WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
> if (!mp->m_data_workqueue)
> - goto out;
> + goto out_destroy_iodone;
>
> mp->m_unwritten_workqueue = alloc_workqueue("xfs-conv/%s",
That's another an "iodone" workqueue for data IO that needs unwritten
extent conversion....
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH v2] xfs: replace global xfslogd wq with per-mount xfs-iodone wq
2014-11-11 23:12 ` Dave Chinner
@ 2014-11-12 19:37 ` Brian Foster
2014-11-12 21:38 ` Dave Chinner
0 siblings, 1 reply; 4+ messages in thread
From: Brian Foster @ 2014-11-12 19:37 UTC (permalink / raw)
To: Dave Chinner; +Cc: xfs
On Wed, Nov 12, 2014 at 10:12:46AM +1100, Dave Chinner wrote:
> On Tue, Nov 11, 2014 at 11:13:13AM -0500, Brian Foster wrote:
> > The xfslogd workqueue is a global, single-job workqueue for buffer ioend
> > processing. This means we allow for a single work item at a time for all
> > possible XFS mounts on a system. fsstress testing in loopback XFS over
> > XFS configurations has reproduced xfslogd deadlocks due to the single
> > threaded nature of the queue and dependencies introduced between the
> > separate XFS instances by online discard (-o discard).
> >
> > Discard over a loopback device converts the discard request to a hole
> > punch (fallocate) on the underlying file. Online discard requests are
> > issued synchronously and from xfslogd context in XFS, hence the xfslogd
> > workqueue is blocked in the upper fs waiting on a hole punch request to
> > be servied in the lower fs. If the lower fs issues I/O that depends on
> > xfslogd to complete, both filesystems end up hung indefinitely. This is
> > reproduced reliabily by generic/013 on XFS->loop->XFS test devices with
> > the '-o discard' mount option.
> >
> > Further, docker implementations appear to use this kind of configuration
> > for container instance filesystems by default (container fs->dm->
> > loop->base fs) and therefore are subject to this deadlock when running
> > on XFS.
> >
> > Replace the global xfslogd workqueue with a per-mount variant. This
> > guarantees each mount access to a single worker and prevents deadlocks
> > due to inter-fs dependencies introduced by discard. Since the queue is
> > only responsible for iodone processing at this point in time, rename
> > xfslogd to xfs-iodone.
> >
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >
> > I've left the wq in xfs_mount rather than moved to the buftarg in this
> > version due to the questions expressed here:
> >
> > http://oss.sgi.com/archives/xfs/2014-11/msg00117.html
>
> <sigh>
>
> Another email from you that hasn't reached my inbox. That's two in a
> week now, I think.
>
> > ... particularly around the potential creation of multiple (of what is
> > now) max_active=1 queues per-fs.
>
> So concern #1 is that it splits log buffer versus metadata buffer
> processing to different work queues causing concurrent processing.
>
> I see no problem there - the io completions have different iodone
> processing functions that mostly don't intersect. As it is, the
> "max-active=1" means 1 work item being processed per CPU, not "only
> one queue" (you need to use alloc_ordered_workqueue() to only
> get one queue). So there is already concurrency in the processing of
> io completions and hence I don't see any problem with separating
> the log iodone completions from the metadata iodone completions.
>
Ah, I see. I misinterpreted the meaning of max_active. It's not a pure
concurrency limiter as such... moreso a limiter of scheduling multiple
active work items on a particular cpu. The workqueue.txt doc explains
it. That rules out the question that this change introduces concurrency.
> Further, it might be work considering pushing the log buffer
> completions to the m_log_workqueue rather than using a buftarg based
> workqueue so that they are always separated from the rest of the
> metadata completions regardless of whether we have an internal or
> external log device.
>
Seems reasonable at first glance. I do like the side effect of
consistent behavior. I'll look into it.
> The xfslogd workqueue is tagged with WQ_HIGHPRI only to expedite the
> log buffer io completions over XFS data io completions that may get
> stuck waiting on log forces. i.e. the xfslogd_workqueue needs
> higher priority than m_data_workqueue and m_unwritten_workqueue as
> they can require log forces to complete their work. Hence if we
> separate out the log buffer io completion processing from the
> metadata IO completion processing we don't need to process all the
> metadata buffer IO completion as high priority work anymore.
>
Ok, thanks. I didn't notice an explicit relationship between either of
those queues and xfslogd. Is the dependency implicit in that those
queues do transaction reservations, and thus can push on the log via the
AIL (and if so, why wouldn't the cil queue be higher priority as well)?
> Concern #2 is about the reason for max_active=1 and being unclear as
> to why we only want a single completion active at a time on a CPU.
> The reason for this is that most metadata and log buffer IO
> completion work does not sleep - they
> only ever take spinlocks and so there are no built in schedule
> points during work processing. Hence it is rare to need a second
> worker thread to process the queue because the first is blocked
> on a sleeping lock and so max-active=1 makes sense. In comparison,
> the data/unwritten io completion processing is very
> different due to needing to take sleeping inode locks, buffer locks,
> etc) and hence they use the wq default for max active (512).
>
Ok, max_active sounds more like a hint to the workqueue infrastructure
in this usage. E.g., there's no hard rule against activation of more
than one item, it's just of questionable value.
> So, really, a log workqueue with WQ_HIGHPRI, max_active = 1 and a
> buffer IO completion workqueue with just max_active = 1 would
> probably be fine.
>
Thanks for the explanation.
Brian
> ....
>
> > diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
> > index 9f622fe..ef55264 100644
> > --- a/fs/xfs/xfs_super.c
> > +++ b/fs/xfs/xfs_super.c
> > @@ -842,10 +842,16 @@ STATIC int
> > xfs_init_mount_workqueues(
> > struct xfs_mount *mp)
> > {
> > + mp->m_iodone_workqueue = alloc_workqueue("xfs-iodone/%s",
> > + WQ_MEM_RECLAIM|WQ_HIGHPRI|WQ_FREEZABLE, 1,
> > + mp->m_fsname);
> > + if (!mp->m_iodone_workqueue)
> > + goto out;
>
> m_buf_workqueue would be better, because...
>
> > +
> > mp->m_data_workqueue = alloc_workqueue("xfs-data/%s",
>
> That's also an "iodone" workqueue for data IO, and ...
>
> > WQ_MEM_RECLAIM|WQ_FREEZABLE, 0, mp->m_fsname);
> > if (!mp->m_data_workqueue)
> > - goto out;
> > + goto out_destroy_iodone;
> >
> > mp->m_unwritten_workqueue = alloc_workqueue("xfs-conv/%s",
>
> That's another an "iodone" workqueue for data IO that needs unwritten
> extent conversion....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread* Re: [PATCH v2] xfs: replace global xfslogd wq with per-mount xfs-iodone wq
2014-11-12 19:37 ` Brian Foster
@ 2014-11-12 21:38 ` Dave Chinner
0 siblings, 0 replies; 4+ messages in thread
From: Dave Chinner @ 2014-11-12 21:38 UTC (permalink / raw)
To: Brian Foster; +Cc: xfs
On Wed, Nov 12, 2014 at 02:37:51PM -0500, Brian Foster wrote:
> On Wed, Nov 12, 2014 at 10:12:46AM +1100, Dave Chinner wrote:
> > On Tue, Nov 11, 2014 at 11:13:13AM -0500, Brian Foster wrote:
> > > The xfslogd workqueue is a global, single-job workqueue for buffer ioend
> > > processing. This means we allow for a single work item at a time for all
> > > possible XFS mounts on a system. fsstress testing in loopback XFS over
> > > XFS configurations has reproduced xfslogd deadlocks due to the single
> > > threaded nature of the queue and dependencies introduced between the
> > > separate XFS instances by online discard (-o discard).
....
> > > I've left the wq in xfs_mount rather than moved to the buftarg in this
> > > version due to the questions expressed here:
> > >
> > > http://oss.sgi.com/archives/xfs/2014-11/msg00117.html
> >
> > <sigh>
> >
> > Another email from you that hasn't reached my inbox. That's two in a
> > week now, I think.
> >
> > > ... particularly around the potential creation of multiple (of what is
> > > now) max_active=1 queues per-fs.
> >
> > So concern #1 is that it splits log buffer versus metadata buffer
> > processing to different work queues causing concurrent processing.
...
> > The xfslogd workqueue is tagged with WQ_HIGHPRI only to expedite the
> > log buffer io completions over XFS data io completions that may get
> > stuck waiting on log forces. i.e. the xfslogd_workqueue needs
> > higher priority than m_data_workqueue and m_unwritten_workqueue as
> > they can require log forces to complete their work. Hence if we
> > separate out the log buffer io completion processing from the
> > metadata IO completion processing we don't need to process all the
> > metadata buffer IO completion as high priority work anymore.
> >
>
> Ok, thanks. I didn't notice an explicit relationship between either of
> those queues and xfslogd. Is the dependency implicit in that those
> queues do transaction reservations, and thus can push on the log via the
> AIL (and if so, why wouldn't the cil queue be higher priority as well)?
IIRC, the main dependency problem we found had to do with data IO
completion on a loop device getting stuck waiting log IO completion
on the backing device which was stuck in a dispatch queue behind
more blocked completions on the loop device. using WQ_HIGHPRI meant
they didn't get stuck in dispatch queues behind other queued work -
they got dispatched immeidately....
> > Concern #2 is about the reason for max_active=1 and being
> > unclear as to why we only want a single completion active at a
> > time on a CPU. The reason for this is that most metadata and
> > log buffer IO completion work does not sleep - they only ever
> > take spinlocks and so there are no built in schedule points
> > during work processing. Hence it is rare to need a second worker
> > thread to process the queue because the first is blocked on a
> > sleeping lock and so max-active=1 makes sense. In comparison,
> > the data/unwritten io completion processing is very different
> > due to needing to take sleeping inode locks, buffer locks, etc)
> > and hence they use the wq default for max active (512).
> >
>
> Ok, max_active sounds more like a hint to the workqueue
> infrastructure in this usage. E.g., there's no hard rule against
> activation of more than one item, it's just of questionable value.
Right. ISTR that there were worse lock contention problems on the
AIL and iclog locks when more concurency was introduced, so it was
just kept down to the minimum required.
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs
^ permalink raw reply [flat|nested] 4+ messages in thread
end of thread, other threads:[~2014-11-12 21:38 UTC | newest]
Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-11-11 16:13 [PATCH v2] xfs: replace global xfslogd wq with per-mount xfs-iodone wq Brian Foster
2014-11-11 23:12 ` Dave Chinner
2014-11-12 19:37 ` Brian Foster
2014-11-12 21:38 ` Dave Chinner
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox