[PATCH RFC 0/3] xfs: rework quotaoff to avoid log deadlock

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH RFC 0/3] xfs: rework quotaoff to avoid log deadlock
@ 2020-09-29 14:12 Brian Foster
  2020-09-29 14:12 ` [PATCH RFC 1/3] xfs: skip dquot reservations if quota is inactive Brian Foster
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Brian Foster @ 2020-09-29 14:12 UTC (permalink / raw)
  To: linux-xfs

Hi all,

Here's an RFC for the quotaoff rework (based on Dave's suggestion) to
incorporate a transaction subsystem quiesce to provide dquot log
ordering guarantees without creating a log deadlock vector. This is RFC
mainly due to patch 2, which was a quick hack to freeze the transaction
subsystem because I wanted to focus on the core approach/algorithm
first. If the general approach is acceptable, I'll go back and implement
something that doesn't abuse an external mechanism for transaction
freeze (perhaps just using a similar, internal percpu rwsem). Patch 1 is
a dependent bug fix to avoid logging dquots for inactive quota modes and
patch 3 reworks quotaoff as described. Thoughts, reviews, flames
appreciated.

Brian

Brian Foster (3):
  xfs: skip dquot reservations if quota is inactive
  xfs: temporary transaction subsystem freeze hack
  xfs: rework quotaoff logging to avoid log deadlock on active fs

 fs/xfs/xfs_qm_syscalls.c | 36 ++++++++++++++++--------------------
 fs/xfs/xfs_trans.c       | 16 ++++++++++++++++
 fs/xfs/xfs_trans.h       |  2 ++
 fs/xfs/xfs_trans_dquot.c | 22 +++++++++++-----------
 4 files changed, 45 insertions(+), 31 deletions(-)

-- 
2.25.4

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH RFC 1/3] xfs: skip dquot reservations if quota is inactive
  2020-09-29 14:12 [PATCH RFC 0/3] xfs: rework quotaoff to avoid log deadlock Brian Foster
@ 2020-09-29 14:12 ` Brian Foster
  2020-09-29 14:12 ` [PATCH RFC 2/3] xfs: temporary transaction subsystem freeze hack Brian Foster
  2020-09-29 14:12 ` [PATCH RFC 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs Brian Foster
  2 siblings, 0 replies; 9+ messages in thread
From: Brian Foster @ 2020-09-29 14:12 UTC (permalink / raw)
  To: linux-xfs

The dquot reservation helper currently performs the associated
reservation for any provided dquots. The dquots could have been
acquired from inode references or explicit dquot allocation
requests. Some reservation callers may have already checked that the
associated quota subsystem is active (xfs_qm_dqget() returns an
error otherwise), while others might not have checked at all
(xfs_trans_reserve_quota_nblks() passes the inode references).
Further, subsequent dquot modifications do actually check that the
associated quota is active before making transactional changes
(xfs_trans_mod_dquot_byino()).

Given all of that, the behavior to unconditionally perform
reservation on any provided dquots is somewhat ad hoc. While it is
currently harmless, it is not without side effect. If the quota is
inactive by the time a transaction attempts a quota reservation, the
dquot will be attached to the transaction and subsequently logged,
even though no dquot modifications are ultimately made.

This is a problem for upcoming quotaoff changes that intend to
implement a strict transactional barrier for logging dquots during a
quotaoff operation. If a dquot is logged after the subsystem
deactivated and the barrier released, a subsequent log recovery can
incorrectly replay dquot changes into the filesystem.

Therefore, update the dquot reservation path to also check that a
particular quota mode is active before associating a dquot with a
transaction. This should have no noticeable impact on the current
code that already accommodates checking active quota state at points
before and after quota reservations are made.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_trans_dquot.c | 20 +++++++++-----------
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
index 133fc6fc3edd..547ba824542e 100644
--- a/fs/xfs/xfs_trans_dquot.c
+++ b/fs/xfs/xfs_trans_dquot.c
@@ -39,14 +39,12 @@ xfs_trans_dqjoin(
 }

 /*
- * This is called to mark the dquot as needing
- * to be logged when the transaction is committed.  The dquot must
- * already be associated with the given transaction.
- * Note that it marks the entire transaction as dirty. In the ordinary
- * case, this gets called via xfs_trans_commit, after the transaction
- * is already dirty. However, there's nothing stop this from getting
- * called directly, as done by xfs_qm_scall_setqlim. Hence, the TRANS_DIRTY
- * flag.
+ * This is called to mark the dquot as needing to be logged when the transaction
+ * is committed. The dquot must already be associated with the given
+ * transaction. Note that it marks the entire transaction as dirty. In the
+ * ordinary case, this gets called via xfs_trans_commit, after the transaction
+ * is already dirty. However, there's nothing stop this from getting called
+ * directly, as done by xfs_qm_scall_setqlim. Hence, the TRANS_DIRTY flag.
  */
 void
 xfs_trans_log_dquot(
@@ -770,19 +768,19 @@ xfs_trans_reserve_quota_bydquots(

 	ASSERT(flags & XFS_QMOPT_RESBLK_MASK);

-	if (udqp) {
+	if (XFS_IS_UQUOTA_ON(mp) && udqp) {
 		error = xfs_trans_dqresv(tp, mp, udqp, nblks, ninos, flags);
 		if (error)
 			return error;
 	}

-	if (gdqp) {
+	if (XFS_IS_GQUOTA_ON(mp) && gdqp) {
 		error = xfs_trans_dqresv(tp, mp, gdqp, nblks, ninos, flags);
 		if (error)
 			goto unwind_usr;
 	}

-	if (pdqp) {
+	if (XFS_IS_PQUOTA_ON(mp) && pdqp) {
 		error = xfs_trans_dqresv(tp, mp, pdqp, nblks, ninos, flags);
 		if (error)
 			goto unwind_grp;
-- 
2.25.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH RFC 2/3] xfs: temporary transaction subsystem freeze hack
  2020-09-29 14:12 [PATCH RFC 0/3] xfs: rework quotaoff to avoid log deadlock Brian Foster
  2020-09-29 14:12 ` [PATCH RFC 1/3] xfs: skip dquot reservations if quota is inactive Brian Foster
@ 2020-09-29 14:12 ` Brian Foster
  2020-09-29 20:50   ` Dave Chinner
  2020-09-29 14:12 ` [PATCH RFC 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs Brian Foster
  2 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2020-09-29 14:12 UTC (permalink / raw)
  To: linux-xfs

Implement a quick hack to abuse the superblock freeze mechanism to
freeze the XFS transaction subsystem.

XXX: to be replaced

Not-Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_trans.c | 16 ++++++++++++++++
 fs/xfs/xfs_trans.h |  2 ++
 2 files changed, 18 insertions(+)

diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c
index ca18a040336a..5cb9cb44e963 100644
--- a/fs/xfs/xfs_trans.c
+++ b/fs/xfs/xfs_trans.c
@@ -1024,3 +1024,19 @@ xfs_trans_roll(
 	tres.tr_logflags = XFS_TRANS_PERM_LOG_RES;
 	return xfs_trans_reserve(*tpp, &tres, 0, 0);
 }
+
+void
+xfs_trans_freeze(
+	struct xfs_mount	*mp)
+{
+	struct super_block	*sb = mp->m_super;
+	percpu_down_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE-1);
+}
+
+void
+xfs_trans_unfreeze(
+	struct xfs_mount	*mp)
+{
+	struct super_block	*sb = mp->m_super;
+	percpu_up_write(sb->s_writers.rw_sem + SB_FREEZE_WRITE-1);
+}
diff --git a/fs/xfs/xfs_trans.h b/fs/xfs/xfs_trans.h
index f46534b75236..ec1d5a5c2610 100644
--- a/fs/xfs/xfs_trans.h
+++ b/fs/xfs/xfs_trans.h
@@ -233,6 +233,8 @@ int		xfs_trans_commit(struct xfs_trans *);
 int		xfs_trans_roll(struct xfs_trans **);
 int		xfs_trans_roll_inode(struct xfs_trans **, struct xfs_inode *);
 void		xfs_trans_cancel(xfs_trans_t *);
+void		xfs_trans_freeze(struct xfs_mount *);
+void		xfs_trans_unfreeze(struct xfs_mount *);
 int		xfs_trans_ail_init(struct xfs_mount *);
 void		xfs_trans_ail_destroy(struct xfs_mount *);
 
-- 
2.25.4


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH RFC 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs
  2020-09-29 14:12 [PATCH RFC 0/3] xfs: rework quotaoff to avoid log deadlock Brian Foster
  2020-09-29 14:12 ` [PATCH RFC 1/3] xfs: skip dquot reservations if quota is inactive Brian Foster
  2020-09-29 14:12 ` [PATCH RFC 2/3] xfs: temporary transaction subsystem freeze hack Brian Foster
@ 2020-09-29 14:12 ` Brian Foster
  2020-09-29 20:48   ` Dave Chinner
  2 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2020-09-29 14:12 UTC (permalink / raw)
  To: linux-xfs

The quotaoff operation logs two log items. The start item is
committed first, followed by the bulk of the in-core quotaoff
processing, and then the quotaoff end item is committed to release
the start item from the log. The problem with this mechanism is that
quite a bit of processing can be required to release dquots from all
in-core inodes and subsequently flush/purge all dquots in the
system. This processing work doesn't generally generate much log
traffic itself, but the start item pins the tail of the log. If an
external workload consumes the remaining log space before the
transaction for the end item is allocated, a log deadlock can occur.

The purpose of the separate start and end log items is primarily to
ensure that log recovery does not incorrectly recover dquot data
after an fs crash where a quotaoff was in progress. If we only
logged a single quotaoff item, for example, it could fall behind the
tail of the log before the last dquot modification was made and
incorrectly replay dquot changes that might have occurred after the
start item committed but before quotaoff deactivated the quota.

With that context, we can make some small changes to the quotaoff
algorithm to provide the same general log ordering guarantee without
such a large window to create a log deadlock vector. Rather than
place a start item in the log for the duration of quotaoff
processing, we can quiesce the transaction subsystem up front to
guarantee that no further dquots are logged from that point forward.
IOW, we freeze the transaction subsystem, commit the start item in a
synchronous transaction that forces the log, deactivate the
associated quota such that subsequent transactions no longer modify
associated dquots, unfreeze the transaction subsystem and finally
commit the quotaoff end item. The transaction freeze is somewhat of
a heavy weight operation, but quotaoff is already a rare, slow and
performance disruptive operation.

Altogether, this means that the dquot rele/purge sequence occurs
after the quotaoff end item has committed and thus can technically
fall off the active range of the log. This is safe because the
remaining processing is all in-core work that doesn't involve
logging dquots and we've guaranteed that no further dquots can be
modified by external transactions. This allows quotaoff to complete
without risking log deadlock regardless of how much dquot processing
is required.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/xfs/xfs_qm_syscalls.c | 36 ++++++++++++++++--------------------
 fs/xfs/xfs_trans_dquot.c |  2 ++
 2 files changed, 18 insertions(+), 20 deletions(-)

diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
index ca1b57d291dc..97844f33f70f 100644
--- a/fs/xfs/xfs_qm_syscalls.c
+++ b/fs/xfs/xfs_qm_syscalls.c
@@ -29,7 +29,8 @@ xfs_qm_log_quotaoff(
 	int			error;
 	struct xfs_qoff_logitem	*qoffi;

-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
+				XFS_TRANS_NO_WRITECOUNT, &tp);
 	if (error)
 		goto out;

@@ -169,14 +170,18 @@ xfs_qm_scall_quotaoff(
 	if ((mp->m_qflags & flags) == 0)
 		goto out_unlock;

+	xfs_trans_freeze(mp);
+
 	/*
 	 * Write the LI_QUOTAOFF log record, and do SB changes atomically,
 	 * and synchronously. If we fail to write, we should abort the
 	 * operation as it cannot be recovered safely if we crash.
 	 */
 	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
-	if (error)
+	if (error) {
+		xfs_trans_unfreeze(mp);
 		goto out_unlock;
+	}

 	/*
 	 * Next we clear the XFS_MOUNT_*DQ_ACTIVE bit(s) in the mount struct
@@ -191,6 +196,15 @@ xfs_qm_scall_quotaoff(
 	 */
 	mp->m_qflags &= ~inactivate_flags;

+	xfs_trans_unfreeze(mp);
+
+	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
+	if (error) {
+		/* We're screwed now. Shutdown is the only option. */
+		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
+		goto out_unlock;
+	}
+
 	/*
 	 * Give back all the dquot reference(s) held by inodes.
 	 * Here we go thru every single incore inode in this file system, and
@@ -216,24 +230,6 @@ xfs_qm_scall_quotaoff(
 	 */
 	xfs_qm_dqpurge_all(mp, dqtype);

-	/*
-	 * Transactions that had started before ACTIVE state bit was cleared
-	 * could have logged many dquots, so they'd have higher LSNs than
-	 * the first QUOTAOFF log record does. If we happen to crash when
-	 * the tail of the log has gone past the QUOTAOFF record, but
-	 * before the last dquot modification, those dquots __will__
-	 * recover, and that's not good.
-	 *
-	 * So, we have QUOTAOFF start and end logitems; the start
-	 * logitem won't get overwritten until the end logitem appears...
-	 */
-	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
-	if (error) {
-		/* We're screwed now. Shutdown is the only option. */
-		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
-		goto out_unlock;
-	}
-
 	/*
 	 * If all quotas are completely turned off, close shop.
 	 */
diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c
index 547ba824542e..9839b83e732a 100644
--- a/fs/xfs/xfs_trans_dquot.c
+++ b/fs/xfs/xfs_trans_dquot.c
@@ -52,6 +52,8 @@ xfs_trans_log_dquot(
 	struct xfs_dquot	*dqp)
 {
 	ASSERT(XFS_DQ_IS_LOCKED(dqp));
+	/* quotaoff expects no dquots logged after deactivation */
+	ASSERT(xfs_this_quota_on(tp->t_mountp, xfs_dquot_type(dqp)));

 	/* Upgrade the dquot to bigtime format if possible. */
 	if (dqp->q_id != 0 &&
-- 
2.25.4

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs
  2020-09-29 14:12 ` [PATCH RFC 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs Brian Foster
@ 2020-09-29 20:48   ` Dave Chinner
  2020-09-30 13:30     ` Brian Foster
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2020-09-29 20:48 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Sep 29, 2020 at 10:12:28AM -0400, Brian Foster wrote:
> The quotaoff operation logs two log items. The start item is
> committed first, followed by the bulk of the in-core quotaoff
> processing, and then the quotaoff end item is committed to release
> the start item from the log. The problem with this mechanism is that
> quite a bit of processing can be required to release dquots from all
> in-core inodes and subsequently flush/purge all dquots in the
> system. This processing work doesn't generally generate much log
> traffic itself, but the start item pins the tail of the log. If an
> external workload consumes the remaining log space before the
> transaction for the end item is allocated, a log deadlock can occur.
> 
> The purpose of the separate start and end log items is primarily to
> ensure that log recovery does not incorrectly recover dquot data
> after an fs crash where a quotaoff was in progress. If we only
> logged a single quotaoff item, for example, it could fall behind the
> tail of the log before the last dquot modification was made and
> incorrectly replay dquot changes that might have occurred after the
> start item committed but before quotaoff deactivated the quota.
> 
> With that context, we can make some small changes to the quotaoff
> algorithm to provide the same general log ordering guarantee without
> such a large window to create a log deadlock vector. Rather than
> place a start item in the log for the duration of quotaoff
> processing, we can quiesce the transaction subsystem up front to
> guarantee that no further dquots are logged from that point forward.
> IOW, we freeze the transaction subsystem, commit the start item in a
> synchronous transaction that forces the log, deactivate the
> associated quota such that subsequent transactions no longer modify
> associated dquots, unfreeze the transaction subsystem and finally
> commit the quotaoff end item. The transaction freeze is somewhat of
> a heavy weight operation, but quotaoff is already a rare, slow and
> performance disruptive operation.
> 
> Altogether, this means that the dquot rele/purge sequence occurs
> after the quotaoff end item has committed and thus can technically
> fall off the active range of the log. This is safe because the
> remaining processing is all in-core work that doesn't involve
> logging dquots and we've guaranteed that no further dquots can be
> modified by external transactions. This allows quotaoff to complete
> without risking log deadlock regardless of how much dquot processing
> is required.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Brian Foster <bfoster@redhat.com>
> ---
>  fs/xfs/xfs_qm_syscalls.c | 36 ++++++++++++++++--------------------
>  fs/xfs/xfs_trans_dquot.c |  2 ++
>  2 files changed, 18 insertions(+), 20 deletions(-)
> 
> diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> index ca1b57d291dc..97844f33f70f 100644
> --- a/fs/xfs/xfs_qm_syscalls.c
> +++ b/fs/xfs/xfs_qm_syscalls.c
> @@ -29,7 +29,8 @@ xfs_qm_log_quotaoff(
>  	int			error;
>  	struct xfs_qoff_logitem	*qoffi;
>  
> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> +				XFS_TRANS_NO_WRITECOUNT, &tp);
>  	if (error)
>  		goto out;
>  
> @@ -169,14 +170,18 @@ xfs_qm_scall_quotaoff(
>  	if ((mp->m_qflags & flags) == 0)
>  		goto out_unlock;
>  
> +	xfs_trans_freeze(mp);
> +
>  	/*
>  	 * Write the LI_QUOTAOFF log record, and do SB changes atomically,
>  	 * and synchronously. If we fail to write, we should abort the
>  	 * operation as it cannot be recovered safely if we crash.
>  	 */
>  	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
> -	if (error)
> +	if (error) {
> +		xfs_trans_unfreeze(mp);
>  		goto out_unlock;
> +	}
>  
>  	/*
>  	 * Next we clear the XFS_MOUNT_*DQ_ACTIVE bit(s) in the mount struct
> @@ -191,6 +196,15 @@ xfs_qm_scall_quotaoff(
>  	 */
>  	mp->m_qflags &= ~inactivate_flags;
>  
> +	xfs_trans_unfreeze(mp);
> +
> +	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
> +	if (error) {
> +		/* We're screwed now. Shutdown is the only option. */
> +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> +		goto out_unlock;
> +	}
> +

The m_qflags update is racy. There's no memory barriers or locks
here to order the mp->m_qflags update with other CPUs, so there's no
guarantee that an incoming transaction will see the m_qflags change
that disables quotas.

Also, we can now race with other transaction reservations to log the
quota off end item, which means that if there are enough pending
transactions reservations queued before the quotaoff_end transaction
is started, the quotaoff item can pin the tail of the log and we
deadlock.

That was the reason why I logged both quota off items in the same
transaction in the prototype code I send out - we have to log both
quota-off items while the transaction subsystem is quiesced
otherwise we don't fix the original problem (that the quotaoff
item can pin the tail of the log and deadlock).....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 2/3] xfs: temporary transaction subsystem freeze hack
  2020-09-29 14:12 ` [PATCH RFC 2/3] xfs: temporary transaction subsystem freeze hack Brian Foster
@ 2020-09-29 20:50   ` Dave Chinner
  2020-09-30 13:30     ` Brian Foster
  0 siblings, 1 reply; 9+ messages in thread
From: Dave Chinner @ 2020-09-29 20:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-xfs

On Tue, Sep 29, 2020 at 10:12:27AM -0400, Brian Foster wrote:
> Implement a quick hack to abuse the superblock freeze mechanism to
> freeze the XFS transaction subsystem.
> 
> XXX: to be replaced

What was wrong with the per-cpu counter that I used in the prototype
I sent? Why re-invent the wheel?

Also, can we call this a pause/resume operation so it doesn't get
confused with filesystem freezing? Freezing as operation name is way
too overloaded already...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs
  2020-09-29 20:48   ` Dave Chinner
@ 2020-09-30 13:30     ` Brian Foster
  2020-09-30 16:53       ` Brian Foster
  0 siblings, 1 reply; 9+ messages in thread
From: Brian Foster @ 2020-09-30 13:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Sep 30, 2020 at 06:48:08AM +1000, Dave Chinner wrote:
> On Tue, Sep 29, 2020 at 10:12:28AM -0400, Brian Foster wrote:
> > The quotaoff operation logs two log items. The start item is
> > committed first, followed by the bulk of the in-core quotaoff
> > processing, and then the quotaoff end item is committed to release
> > the start item from the log. The problem with this mechanism is that
> > quite a bit of processing can be required to release dquots from all
> > in-core inodes and subsequently flush/purge all dquots in the
> > system. This processing work doesn't generally generate much log
> > traffic itself, but the start item pins the tail of the log. If an
> > external workload consumes the remaining log space before the
> > transaction for the end item is allocated, a log deadlock can occur.
> > 
> > The purpose of the separate start and end log items is primarily to
> > ensure that log recovery does not incorrectly recover dquot data
> > after an fs crash where a quotaoff was in progress. If we only
> > logged a single quotaoff item, for example, it could fall behind the
> > tail of the log before the last dquot modification was made and
> > incorrectly replay dquot changes that might have occurred after the
> > start item committed but before quotaoff deactivated the quota.
> > 
> > With that context, we can make some small changes to the quotaoff
> > algorithm to provide the same general log ordering guarantee without
> > such a large window to create a log deadlock vector. Rather than
> > place a start item in the log for the duration of quotaoff
> > processing, we can quiesce the transaction subsystem up front to
> > guarantee that no further dquots are logged from that point forward.
> > IOW, we freeze the transaction subsystem, commit the start item in a
> > synchronous transaction that forces the log, deactivate the
> > associated quota such that subsequent transactions no longer modify
> > associated dquots, unfreeze the transaction subsystem and finally
> > commit the quotaoff end item. The transaction freeze is somewhat of
> > a heavy weight operation, but quotaoff is already a rare, slow and
> > performance disruptive operation.
> > 
> > Altogether, this means that the dquot rele/purge sequence occurs
> > after the quotaoff end item has committed and thus can technically
> > fall off the active range of the log. This is safe because the
> > remaining processing is all in-core work that doesn't involve
> > logging dquots and we've guaranteed that no further dquots can be
> > modified by external transactions. This allows quotaoff to complete
> > without risking log deadlock regardless of how much dquot processing
> > is required.
> > 
> > Suggested-by: Dave Chinner <david@fromorbit.com>
> > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > ---
> >  fs/xfs/xfs_qm_syscalls.c | 36 ++++++++++++++++--------------------
> >  fs/xfs/xfs_trans_dquot.c |  2 ++
> >  2 files changed, 18 insertions(+), 20 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> > index ca1b57d291dc..97844f33f70f 100644
> > --- a/fs/xfs/xfs_qm_syscalls.c
> > +++ b/fs/xfs/xfs_qm_syscalls.c
> > @@ -29,7 +29,8 @@ xfs_qm_log_quotaoff(
> >  	int			error;
> >  	struct xfs_qoff_logitem	*qoffi;
> >  
> > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> > +				XFS_TRANS_NO_WRITECOUNT, &tp);
> >  	if (error)
> >  		goto out;
> >  
> > @@ -169,14 +170,18 @@ xfs_qm_scall_quotaoff(
> >  	if ((mp->m_qflags & flags) == 0)
> >  		goto out_unlock;
> >  
> > +	xfs_trans_freeze(mp);
> > +
> >  	/*
> >  	 * Write the LI_QUOTAOFF log record, and do SB changes atomically,
> >  	 * and synchronously. If we fail to write, we should abort the
> >  	 * operation as it cannot be recovered safely if we crash.
> >  	 */
> >  	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
> > -	if (error)
> > +	if (error) {
> > +		xfs_trans_unfreeze(mp);
> >  		goto out_unlock;
> > +	}
> >  
> >  	/*
> >  	 * Next we clear the XFS_MOUNT_*DQ_ACTIVE bit(s) in the mount struct
> > @@ -191,6 +196,15 @@ xfs_qm_scall_quotaoff(
> >  	 */
> >  	mp->m_qflags &= ~inactivate_flags;
> >  
> > +	xfs_trans_unfreeze(mp);
> > +
> > +	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
> > +	if (error) {
> > +		/* We're screwed now. Shutdown is the only option. */
> > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > +		goto out_unlock;
> > +	}
> > +
> 
> The m_qflags update is racy. There's no memory barriers or locks
> here to order the mp->m_qflags update with other CPUs, so there's no
> guarantee that an incoming transaction will see the m_qflags change
> that disables quotas.
> 

The transaction freeze is based on a percpu rwsem (I've replaced the
abuse of the freeze lock with our own lock in patch 2). That documents a
memory barrier on the read side, but that appears to be down in
__percpu_down_read_trylock() (or percpu_rwsem_wait()). What isn't clear
to me is whether the higher level RCU fast path has similar semantics,
particularly in the case where a racing transaction allocates just after
the transaction unfreeze...

Hmm... it seems not, but the rcu sync state managed by the lock appears
to be reentrant. I'm wondering if we could do something like grab the
write lock in the quotaoff path, re-enter the rcu sync state explicitly,
do the quotaoff bits, unlock, then hold the extra rcu sync entry until
the dqrele scan completes. That means transactions would temporarily hit
the memory barrier via the read lock until we can fall back to the
current ilock ordering scheme. OTOH, I don't see any precedent for
something like that elsewhere in the kernel, though cgroups does
something very similar by forcing the slow path unconditionally via
rcu_sync_enter_start(). This would presumably just do that
dynamically/temporarily. Thoughts?

For reference, the current quotaoff implementation appears to rely on
the order of the flag update before the dqrele scan as it assumes racing
transactions check quota active state under the ilock:

        /*
         * Next we clear the XFS_MOUNT_*DQ_ACTIVE bit(s) in the mount struct
         * to take care of the race between dqget and quotaoff. We don't take
         * any special locks to reset these bits. All processes need to check
         * these bits *after* taking inode lock(s) to see if the particular
         * quota type is in the process of being turned off. If *ACTIVE, it is
         * guaranteed that all dquot structures and all quotainode ptrs will all
         * stay valid as long as that inode is kept locked.
         *
         * There is no turning back after this.
         */
        mp->m_qflags &= ~inactivate_flags;

IOW, since quotaoff scans and cycles every ilock, the dqrele scan
essentially serves as the barrier for racing transactions checking quota
active state (under ilock).

> Also, we can now race with other transaction reservations to log the
> quota off end item, which means that if there are enough pending
> transactions reservations queued before the quotaoff_end transaction
> is started, the quotaoff item can pin the tail of the log and we
> deadlock.
> 

That thought crossed my mind, but I didn't think it was possible in
practice. It's easy enough to commit the end transaction with
transactions paused as well so I'll make that change.

> That was the reason why I logged both quota off items in the same
> transaction in the prototype code I send out - we have to log both
> quota-off items while the transaction subsystem is quiesced
> otherwise we don't fix the original problem (that the quotaoff
> item can pin the tail of the log and deadlock).....
> 

The prototype code logged both transactions within the quota transaction
quiesce window, not both items in the same transaction. IIRC, we
discussed eventually doing that along with some other potential
cleanups, but I'm not looking to do any of that before the fundamental
algorithm rework settles...

Brian

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 2/3] xfs: temporary transaction subsystem freeze hack
  2020-09-29 20:50   ` Dave Chinner
@ 2020-09-30 13:30     ` Brian Foster
  0 siblings, 0 replies; 9+ messages in thread
From: Brian Foster @ 2020-09-30 13:30 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Sep 30, 2020 at 06:50:11AM +1000, Dave Chinner wrote:
> On Tue, Sep 29, 2020 at 10:12:27AM -0400, Brian Foster wrote:
> > Implement a quick hack to abuse the superblock freeze mechanism to
> > freeze the XFS transaction subsystem.
> > 
> > XXX: to be replaced
> 
> What was wrong with the per-cpu counter that I used in the prototype
> I sent? Why re-invent the wheel?
> 

We discussed this in the original thread. See [1] (the tail end of my
mail is where we switch from general relogging discussion to the
quotaoff prototype) and your immediate reply for reference. The synopsis
is that I think a percpu rwsem around transaction allocation (what I've
replaced this patch with) is far more straightforward to audit, test and
maintain than annotating quota modifying transactions purely for the
purpose of quotaoff.

> Also, can we call this a pause/resume operation so it doesn't get
> confused with filesystem freezing? Freezing as operation name is way
> too overloaded already...
> 

Sure, pause/resume seems fine to me.

Brian

[1] https://lore.kernel.org/linux-xfs/20200702185209.GA58137@bfoster/

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH RFC 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs
  2020-09-30 13:30     ` Brian Foster
@ 2020-09-30 16:53       ` Brian Foster
  0 siblings, 0 replies; 9+ messages in thread
From: Brian Foster @ 2020-09-30 16:53 UTC (permalink / raw)
  To: Dave Chinner; +Cc: linux-xfs

On Wed, Sep 30, 2020 at 09:30:30AM -0400, Brian Foster wrote:
> On Wed, Sep 30, 2020 at 06:48:08AM +1000, Dave Chinner wrote:
> > On Tue, Sep 29, 2020 at 10:12:28AM -0400, Brian Foster wrote:
> > > The quotaoff operation logs two log items. The start item is
> > > committed first, followed by the bulk of the in-core quotaoff
> > > processing, and then the quotaoff end item is committed to release
> > > the start item from the log. The problem with this mechanism is that
> > > quite a bit of processing can be required to release dquots from all
> > > in-core inodes and subsequently flush/purge all dquots in the
> > > system. This processing work doesn't generally generate much log
> > > traffic itself, but the start item pins the tail of the log. If an
> > > external workload consumes the remaining log space before the
> > > transaction for the end item is allocated, a log deadlock can occur.
> > > 
> > > The purpose of the separate start and end log items is primarily to
> > > ensure that log recovery does not incorrectly recover dquot data
> > > after an fs crash where a quotaoff was in progress. If we only
> > > logged a single quotaoff item, for example, it could fall behind the
> > > tail of the log before the last dquot modification was made and
> > > incorrectly replay dquot changes that might have occurred after the
> > > start item committed but before quotaoff deactivated the quota.
> > > 
> > > With that context, we can make some small changes to the quotaoff
> > > algorithm to provide the same general log ordering guarantee without
> > > such a large window to create a log deadlock vector. Rather than
> > > place a start item in the log for the duration of quotaoff
> > > processing, we can quiesce the transaction subsystem up front to
> > > guarantee that no further dquots are logged from that point forward.
> > > IOW, we freeze the transaction subsystem, commit the start item in a
> > > synchronous transaction that forces the log, deactivate the
> > > associated quota such that subsequent transactions no longer modify
> > > associated dquots, unfreeze the transaction subsystem and finally
> > > commit the quotaoff end item. The transaction freeze is somewhat of
> > > a heavy weight operation, but quotaoff is already a rare, slow and
> > > performance disruptive operation.
> > > 
> > > Altogether, this means that the dquot rele/purge sequence occurs
> > > after the quotaoff end item has committed and thus can technically
> > > fall off the active range of the log. This is safe because the
> > > remaining processing is all in-core work that doesn't involve
> > > logging dquots and we've guaranteed that no further dquots can be
> > > modified by external transactions. This allows quotaoff to complete
> > > without risking log deadlock regardless of how much dquot processing
> > > is required.
> > > 
> > > Suggested-by: Dave Chinner <david@fromorbit.com>
> > > Signed-off-by: Brian Foster <bfoster@redhat.com>
> > > ---
> > >  fs/xfs/xfs_qm_syscalls.c | 36 ++++++++++++++++--------------------
> > >  fs/xfs/xfs_trans_dquot.c |  2 ++
> > >  2 files changed, 18 insertions(+), 20 deletions(-)
> > > 
> > > diff --git a/fs/xfs/xfs_qm_syscalls.c b/fs/xfs/xfs_qm_syscalls.c
> > > index ca1b57d291dc..97844f33f70f 100644
> > > --- a/fs/xfs/xfs_qm_syscalls.c
> > > +++ b/fs/xfs/xfs_qm_syscalls.c
> > > @@ -29,7 +29,8 @@ xfs_qm_log_quotaoff(
> > >  	int			error;
> > >  	struct xfs_qoff_logitem	*qoffi;
> > >  
> > > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0, 0, &tp);
> > > +	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_qm_quotaoff, 0, 0,
> > > +				XFS_TRANS_NO_WRITECOUNT, &tp);
> > >  	if (error)
> > >  		goto out;
> > >  
> > > @@ -169,14 +170,18 @@ xfs_qm_scall_quotaoff(
> > >  	if ((mp->m_qflags & flags) == 0)
> > >  		goto out_unlock;
> > >  
> > > +	xfs_trans_freeze(mp);
> > > +
> > >  	/*
> > >  	 * Write the LI_QUOTAOFF log record, and do SB changes atomically,
> > >  	 * and synchronously. If we fail to write, we should abort the
> > >  	 * operation as it cannot be recovered safely if we crash.
> > >  	 */
> > >  	error = xfs_qm_log_quotaoff(mp, &qoffstart, flags);
> > > -	if (error)
> > > +	if (error) {
> > > +		xfs_trans_unfreeze(mp);
> > >  		goto out_unlock;
> > > +	}
> > >  
> > >  	/*
> > >  	 * Next we clear the XFS_MOUNT_*DQ_ACTIVE bit(s) in the mount struct
> > > @@ -191,6 +196,15 @@ xfs_qm_scall_quotaoff(
> > >  	 */
> > >  	mp->m_qflags &= ~inactivate_flags;
> > >  
> > > +	xfs_trans_unfreeze(mp);
> > > +
> > > +	error = xfs_qm_log_quotaoff_end(mp, &qoffstart, flags);
> > > +	if (error) {
> > > +		/* We're screwed now. Shutdown is the only option. */
> > > +		xfs_force_shutdown(mp, SHUTDOWN_CORRUPT_INCORE);
> > > +		goto out_unlock;
> > > +	}
> > > +
> > 
> > The m_qflags update is racy. There's no memory barriers or locks
> > here to order the mp->m_qflags update with other CPUs, so there's no
> > guarantee that an incoming transaction will see the m_qflags change
> > that disables quotas.
> > 
> 
> The transaction freeze is based on a percpu rwsem (I've replaced the
> abuse of the freeze lock with our own lock in patch 2). That documents a
> memory barrier on the read side, but that appears to be down in
> __percpu_down_read_trylock() (or percpu_rwsem_wait()). What isn't clear
> to me is whether the higher level RCU fast path has similar semantics,
> particularly in the case where a racing transaction allocates just after
> the transaction unfreeze...
> 
> Hmm... it seems not, but the rcu sync state managed by the lock appears
> to be reentrant. I'm wondering if we could do something like grab the
> write lock in the quotaoff path, re-enter the rcu sync state explicitly,
> do the quotaoff bits, unlock, then hold the extra rcu sync entry until
> the dqrele scan completes. That means transactions would temporarily hit
> the memory barrier via the read lock until we can fall back to the
> current ilock ordering scheme. OTOH, I don't see any precedent for
> something like that elsewhere in the kernel, though cgroups does
> something very similar by forcing the slow path unconditionally via
> rcu_sync_enter_start(). This would presumably just do that
> dynamically/temporarily. Thoughts?
> 

After digging further into it, this doesn't appear to be a viable option
because the associated rcu_sync_*() symbols aren't exported..

Brian

> For reference, the current quotaoff implementation appears to rely on
> the order of the flag update before the dqrele scan as it assumes racing
> transactions check quota active state under the ilock:
> 
>         /*
>          * Next we clear the XFS_MOUNT_*DQ_ACTIVE bit(s) in the mount struct
>          * to take care of the race between dqget and quotaoff. We don't take
>          * any special locks to reset these bits. All processes need to check
>          * these bits *after* taking inode lock(s) to see if the particular
>          * quota type is in the process of being turned off. If *ACTIVE, it is
>          * guaranteed that all dquot structures and all quotainode ptrs will all
>          * stay valid as long as that inode is kept locked.
>          *
>          * There is no turning back after this.
>          */
>         mp->m_qflags &= ~inactivate_flags;
> 
> IOW, since quotaoff scans and cycles every ilock, the dqrele scan
> essentially serves as the barrier for racing transactions checking quota
> active state (under ilock).
> 
> > Also, we can now race with other transaction reservations to log the
> > quota off end item, which means that if there are enough pending
> > transactions reservations queued before the quotaoff_end transaction
> > is started, the quotaoff item can pin the tail of the log and we
> > deadlock.
> > 
> 
> That thought crossed my mind, but I didn't think it was possible in
> practice. It's easy enough to commit the end transaction with
> transactions paused as well so I'll make that change.
> 
> > That was the reason why I logged both quota off items in the same
> > transaction in the prototype code I send out - we have to log both
> > quota-off items while the transaction subsystem is quiesced
> > otherwise we don't fix the original problem (that the quotaoff
> > item can pin the tail of the log and deadlock).....
> > 
> 
> The prototype code logged both transactions within the quota transaction
> quiesce window, not both items in the same transaction. IIRC, we
> discussed eventually doing that along with some other potential
> cleanups, but I'm not looking to do any of that before the fundamental
> algorithm rework settles...
> 
> Brian
> 
> > Cheers,
> > 
> > Dave.
> > -- 
> > Dave Chinner
> > david@fromorbit.com
> > 
> 


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2020-09-30 16:53 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2020-09-29 14:12 [PATCH RFC 0/3] xfs: rework quotaoff to avoid log deadlock Brian Foster
2020-09-29 14:12 ` [PATCH RFC 1/3] xfs: skip dquot reservations if quota is inactive Brian Foster
2020-09-29 14:12 ` [PATCH RFC 2/3] xfs: temporary transaction subsystem freeze hack Brian Foster
2020-09-29 20:50   ` Dave Chinner
2020-09-30 13:30     ` Brian Foster
2020-09-29 14:12 ` [PATCH RFC 3/3] xfs: rework quotaoff logging to avoid log deadlock on active fs Brian Foster
2020-09-29 20:48   ` Dave Chinner
2020-09-30 13:30     ` Brian Foster
2020-09-30 16:53       ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox