public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
* [PATCH] xfs: don't free EFIs before the EFDs are committed
@ 2013-04-03  3:09 Dave Chinner
  2013-04-03 19:12 ` Mark Tinguely
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Dave Chinner @ 2013-04-03  3:09 UTC (permalink / raw)
  To: xfs

From: Dave Chinner <dchinner@redhat.com>

Filesystems are occasionally being shut down with this error:

xfs_trans_ail_delete_bulk: attempting to delete a log item that is
not in the AIL.

It was diagnosed to be related to the EFI/EFD commit order when the
EFI and EFD are in different checkpoints and the EFD is committed
before the EFI here:

http://oss.sgi.com/archives/xfs/2013-01/msg00082.html

The real problem is that a single bit cannot fully describe the
states that the EFI/EFD processing can be in. These completion
states are:

EFI			EFI in AIL	EFD		Result
committed/unpinned	Yes		committed	OK
committed/pinned	No		committed	Shutdown
uncommitted		No		committed	Shutdown


Note that the "result" field is what should happen, not what does
happen. The current logic is broken and handles the first two cases
correctly by luck.  That is, the code will free the EFI if the
XFS_EFI_COMMITTED bit is *not* set, rather than if it is set. The
inverted logic "works" because if both EFI and EFD are committed,
then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED
bit, and the second frees the EFI item. Hence as long as
xfs_efi_item_committed() has been called, everything appears to be
fine.

It is the third case where the logic fails - where
xfs_efd_item_committed() is called before xfs_efi_item_committed(),
and that results in the EFI being freed before it has been
committed. That is the bug that triggered the shutdown, and hence
keeping track of whether the EFI has been committed or not is
insufficient to correctly order the EFI/EFD operations w.r.t. the
AIL.

What we really want is this: the EFI is always placed into the
AIL before the last reference goes away. The only way to guarantee
that is that the EFI is not freed until after it has been unpinned
*and* the EFD has been committed. That is, restructure the logic so
that the only case that can occur is the first case.

This can be done easily by replacing the XFS_EFI_COMMITTED with an
EFI reference count. The EFI is initialised with it's own count, and
that is not released until it is unpinned. However, there is a
complication to this method - the high level EFI/EFD code in
xfs_bmap_finish() does not hold direct references to the EFI
structure, and runs a transaction commit between the EFI and EFD
processing. Hence the EFI can be freed even before the EFD is
created using such a method.

Further, log recovery uses the AIL for tracking EFI/EFDs that need
to be recovered, but it uses the AIL *differently* to the EFI
transaction commit. Hence log recovery never pins or unpins EFIs, so
we can't drop the EFI reference count indirectly to free the EFI.

However, this doesn't prevent us from using a reference count here.
There is a 1:1 relationship between EFIs and EFDs, so when we
initialise the EFI we can take a reference count for the EFD as
well. This solves the xfs_bmap_finish() issue - the EFI will never
be freed until the EFD is processed. In terms of log recovery,
during the committing of the EFD we can look for the
XFS_EFI_RECOVERED bit being set and drop the EFI reference as well,
thereby ensuring everything works correctly there as well.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/xfs/xfs_extfree_item.c |   27 +++++++++++++--------------
 fs/xfs/xfs_extfree_item.h |   14 +++++++++-----
 fs/xfs/xfs_log_recover.c  |    1 +
 3 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/fs/xfs/xfs_extfree_item.c b/fs/xfs/xfs_extfree_item.c
index feb36d7..c0f3750 100644
--- a/fs/xfs/xfs_extfree_item.c
+++ b/fs/xfs/xfs_extfree_item.c
@@ -50,9 +50,8 @@ xfs_efi_item_free(
  * Freeing the efi requires that we remove it from the AIL if it has already
  * been placed there. However, the EFI may not yet have been placed in the AIL
  * when called by xfs_efi_release() from EFD processing due to the ordering of
- * committed vs unpin operations in bulk insert operations. Hence the
- * test_and_clear_bit(XFS_EFI_COMMITTED) to ensure only the last caller frees
- * the EFI.
+ * committed vs unpin operations in bulk insert operations. Hence the reference
+ * count to ensure only the last caller frees the EFI.
  */
 STATIC void
 __xfs_efi_release(
@@ -60,7 +59,7 @@ __xfs_efi_release(
 {
 	struct xfs_ail		*ailp = efip->efi_item.li_ailp;
 
-	if (!test_and_clear_bit(XFS_EFI_COMMITTED, &efip->efi_flags)) {
+	if (atomic_dec_and_test(&efip->efi_refcount)) {
 		spin_lock(&ailp->xa_lock);
 		/* xfs_trans_ail_delete() drops the AIL lock. */
 		xfs_trans_ail_delete(ailp, &efip->efi_item,
@@ -126,8 +125,8 @@ xfs_efi_item_pin(
  * which the EFI is manipulated during a transaction.  If we are being asked to
  * remove the EFI it's because the transaction has been cancelled and by
  * definition that means the EFI cannot be in the AIL so remove it from the
- * transaction and free it.  Otherwise coordinate with xfs_efi_release() (via
- * XFS_EFI_COMMITTED) to determine who gets to free the EFI.
+ * transaction and free it.  Otherwise coordinate with xfs_efi_release()
+ * to determine who gets to free the EFI.
  */
 STATIC void
 xfs_efi_item_unpin(
@@ -171,19 +170,13 @@ xfs_efi_item_unlock(
 
 /*
  * The EFI is logged only once and cannot be moved in the log, so simply return
- * the lsn at which it's been logged.  For bulk transaction committed
- * processing, the EFI may be processed but not yet unpinned prior to the EFD
- * being processed. Set the XFS_EFI_COMMITTED flag so this case can be detected
- * when processing the EFD.
+ * the lsn at which it's been logged.
  */
 STATIC xfs_lsn_t
 xfs_efi_item_committed(
 	struct xfs_log_item	*lip,
 	xfs_lsn_t		lsn)
 {
-	struct xfs_efi_log_item	*efip = EFI_ITEM(lip);
-
-	set_bit(XFS_EFI_COMMITTED, &efip->efi_flags);
 	return lsn;
 }
 
@@ -241,6 +234,7 @@ xfs_efi_init(
 	efip->efi_format.efi_nextents = nextents;
 	efip->efi_format.efi_id = (__psint_t)(void*)efip;
 	atomic_set(&efip->efi_next_extent, 0);
+	atomic_set(&efip->efi_refcount, 2);
 
 	return efip;
 }
@@ -310,8 +304,13 @@ xfs_efi_release(xfs_efi_log_item_t	*efip,
 		uint			nextents)
 {
 	ASSERT(atomic_read(&efip->efi_next_extent) >= nextents);
-	if (atomic_sub_and_test(nextents, &efip->efi_next_extent))
+	if (atomic_sub_and_test(nextents, &efip->efi_next_extent)) {
 		__xfs_efi_release(efip);
+
+		/* recovery needs us to drop the EFI reference, too */
+		if (test_bit(XFS_EFI_RECOVERED, &efip->efi_flags))
+			__xfs_efi_release(efip);
+	}
 }
 
 static inline struct xfs_efd_log_item *EFD_ITEM(struct xfs_log_item *lip)
diff --git a/fs/xfs/xfs_extfree_item.h b/fs/xfs/xfs_extfree_item.h
index 375f68e..4322224 100644
--- a/fs/xfs/xfs_extfree_item.h
+++ b/fs/xfs/xfs_extfree_item.h
@@ -114,16 +114,20 @@ typedef struct xfs_efd_log_format_64 {
  * Define EFI flag bits. Manipulated by set/clear/test_bit operators.
  */
 #define	XFS_EFI_RECOVERED	1
-#define	XFS_EFI_COMMITTED	2
 
 /*
- * This is the "extent free intention" log item.  It is used
- * to log the fact that some extents need to be free.  It is
- * used in conjunction with the "extent free done" log item
- * described below.
+ * This is the "extent free intention" log item.  It is used to log the fact
+ * that some extents need to be free.  It is used in conjunction with the
+ * "extent free done" log item described below.
+ *
+ * The EFI is reference counted so that it is not freed prior to both the EFI
+ * and EFD being committed and unpinned. This ensures that when the last
+ * reference goes away the EFI will always be in the AIL as it has been
+ * unpinned, regardless of whether the EFD is processed before or after the EFI.
  */
 typedef struct xfs_efi_log_item {
 	xfs_log_item_t		efi_item;
+	atomic_t		efi_refcount;
 	atomic_t		efi_next_extent;
 	unsigned long		efi_flags;	/* misc flags */
 	xfs_efi_log_format_t	efi_format;
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index d1dba7c..3ca3380 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -2948,6 +2948,7 @@ xlog_recover_process_efi(
 			 * This will pull the EFI from the AIL and
 			 * free the memory associated with it.
 			 */
+			set_bit(XFS_EFI_RECOVERED, &efip->efi_flags);
 			xfs_efi_release(efip, efip->efi_format.efi_nextents);
 			return XFS_ERROR(EIO);
 		}
-- 
1.7.10.4

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't free EFIs before the EFDs are committed
  2013-04-03  3:09 [PATCH] xfs: don't free EFIs before the EFDs are committed Dave Chinner
@ 2013-04-03 19:12 ` Mark Tinguely
  2013-04-03 19:46   ` Eric Sandeen
  2013-04-04  1:14   ` Dave Chinner
  2013-04-04 22:06 ` Mark Tinguely
  2013-04-05 18:31 ` Ben Myers
  2 siblings, 2 replies; 10+ messages in thread
From: Mark Tinguely @ 2013-04-03 19:12 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 04/02/13 22:09, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> Filesystems are occasionally being shut down with this error:
>
> xfs_trans_ail_delete_bulk: attempting to delete a log item that is
> not in the AIL.
>
> It was diagnosed to be related to the EFI/EFD commit order when the
> EFI and EFD are in different checkpoints and the EFD is committed
> before the EFI here:
>
> http://oss.sgi.com/archives/xfs/2013-01/msg00082.html
>
> The real problem is that a single bit cannot fully describe the
> states that the EFI/EFD processing can be in. These completion
> states are:
>
> EFI			EFI in AIL	EFD		Result
> committed/unpinned	Yes		committed	OK
> committed/pinned	No		committed	Shutdown
> uncommitted		No		committed	Shutdown
>
>
> Note that the "result" field is what should happen, not what does
> happen. The current logic is broken and handles the first two cases
> correctly by luck.  That is, the code will free the EFI if the
> XFS_EFI_COMMITTED bit is*not*  set, rather than if it is set. The
> inverted logic "works" because if both EFI and EFD are committed,
> then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED
> bit, and the second frees the EFI item. Hence as long as
> xfs_efi_item_committed() has been called, everything appears to be
> fine.
>
> It is the third case where the logic fails - where
> xfs_efd_item_committed() is called before xfs_efi_item_committed(),
> and that results in the EFI being freed before it has been
> committed. That is the bug that triggered the shutdown, d hence
> keeping track of whether the EFI has been committed or not is
> insufficient to correctly order the EFI/EFD operations w.r.t. the
> AIL.
>

I think the exist xfs_efi routines are working correctly.

the xfs_trans_committed_bulk() for the efi does the IOP_COMMITTED (sets 
the XFS_EFI_COMMITTED bit), add the entry to the AIL and then an 
IOP_UNPIN() which clears the XFS_EFI_COMMITTED since it is set, it 
should not release the efi and remove the entry from the AIL. It also 
correctly detects if the EFD is committed before the EFI by shutting 
down the filesystem.

This patch seems to paper over why the EFD will come before the EFI.

The CIL commits are in the correct order - protected by the xlog_wait() 
but the callbacks are out of order.

My previous patch fixed the callback from happening out of sequence.

--Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't free EFIs before the EFDs are committed
  2013-04-03 19:12 ` Mark Tinguely
@ 2013-04-03 19:46   ` Eric Sandeen
  2013-04-03 21:02     ` Eric Sandeen
  2013-04-04  1:14   ` Dave Chinner
  1 sibling, 1 reply; 10+ messages in thread
From: Eric Sandeen @ 2013-04-03 19:46 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: xfs

On 4/3/13 2:12 PM, Mark Tinguely wrote:
> On 04/02/13 22:09, Dave Chinner wrote:
>> From: Dave Chinner<dchinner@redhat.com>
>>
>> Filesystems are occasionally being shut down with this error:
>>
>> xfs_trans_ail_delete_bulk: attempting to delete a log item that is
>> not in the AIL.
>>
>> It was diagnosed to be related to the EFI/EFD commit order when the
>> EFI and EFD are in different checkpoints and the EFD is committed
>> before the EFI here:
>>
>> http://oss.sgi.com/archives/xfs/2013-01/msg00082.html
>>
>> The real problem is that a single bit cannot fully describe the
>> states that the EFI/EFD processing can be in. These completion
>> states are:
>>
>> EFI            EFI in AIL    EFD        Result
>> committed/unpinned    Yes        committed    OK
>> committed/pinned    No        committed    Shutdown
>> uncommitted        No        committed    Shutdown
>>
>>
>> Note that the "result" field is what should happen, not what does
>> happen. The current logic is broken and handles the first two cases
>> correctly by luck.  That is, the code will free the EFI if the
>> XFS_EFI_COMMITTED bit is*not*  set, rather than if it is set. The
>> inverted logic "works" because if both EFI and EFD are committed,
>> then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED
>> bit, and the second frees the EFI item. Hence as long as
>> xfs_efi_item_committed() has been called, everything appears to be
>> fine.
>>
>> It is the third case where the logic fails - where
>> xfs_efd_item_committed() is called before xfs_efi_item_committed(),
>> and that results in the EFI being freed before it has been
>> committed. That is the bug that triggered the shutdown, d hence
>> keeping track of whether the EFI has been committed or not is
>> insufficient to correctly order the EFI/EFD operations w.r.t. the
>> AIL.
>>
> 
> I think the exist xfs_efi routines are working correctly.
> 
> the xfs_trans_committed_bulk() for the efi does the IOP_COMMITTED
> (sets the XFS_EFI_COMMITTED bit), add the entry to the AIL and then
> an IOP_UNPIN() which clears the XFS_EFI_COMMITTED since it is set, it
> should not release the efi and remove the entry from the AIL. It also
> correctly detects if the EFD is committed before the EFI by shutting
> down the filesystem.

Well hang on, not everything can be cool in EFI-land here, if nothing
else this:

__xfs_efi_release(
        struct xfs_efi_log_item *efip)
{
        struct xfs_ail          *ailp = efip->efi_item.li_ailp;

        if (!test_and_clear_bit(XFS_EFI_COMMITTED, &efip->efi_flags)) {
                spin_lock(&ailp->xa_lock);
                /* xfs_trans_ail_delete() drops the AIL lock. */
                xfs_trans_ail_delete(ailp, &efip->efi_item,
                                     SHUTDOWN_LOG_IO_ERROR);
                xfs_efi_item_free(efip);
        }
}

seems obviously broken.

 * test_and_clear_bit - Clear a bit and return its old value

so it is saying if XFS_EFI_COMMITED was *not* set, free the efi.
Which I think we must admit is broken.

I have to get up to speed on this code, but it seems like at least
this much above is quite obviously broken, and the previously
proposed patch doesn't address it.  Am I missing something?

-Eric

> This patch seems to paper over why the EFD will come before the EFI.
> 
> The CIL commits are in the correct order - protected by the
> xlog_wait() but the callbacks are out of order.
> 
> My previous patch fixed the callback from happening out of sequence.
> 
> --Mark.
> 
> _______________________________________________
> xfs mailing list
> xfs@oss.sgi.com
> http://oss.sgi.com/mailman/listinfo/xfs
> 

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't free EFIs before the EFDs are committed
  2013-04-03 19:46   ` Eric Sandeen
@ 2013-04-03 21:02     ` Eric Sandeen
  2013-04-03 21:45       ` Mark Tinguely
  0 siblings, 1 reply; 10+ messages in thread
From: Eric Sandeen @ 2013-04-03 21:02 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: xfs

On 4/3/13 2:46 PM, Eric Sandeen wrote:
> On 4/3/13 2:12 PM, Mark Tinguely wrote:
>> On 04/02/13 22:09, Dave Chinner wrote:
>>> From: Dave Chinner<dchinner@redhat.com>
>>>
>>> Filesystems are occasionally being shut down with this error:
>>>
>>> xfs_trans_ail_delete_bulk: attempting to delete a log item that is
>>> not in the AIL.
>>>
>>> It was diagnosed to be related to the EFI/EFD commit order when the
>>> EFI and EFD are in different checkpoints and the EFD is committed
>>> before the EFI here:
>>>
>>> http://oss.sgi.com/archives/xfs/2013-01/msg00082.html
>>>
>>> The real problem is that a single bit cannot fully describe the
>>> states that the EFI/EFD processing can be in. These completion
>>> states are:
>>>
>>> EFI            EFI in AIL    EFD        Result
>>> committed/unpinned    Yes        committed    OK
>>> committed/pinned    No        committed    Shutdown
>>> uncommitted        No        committed    Shutdown
>>>
>>>
>>> Note that the "result" field is what should happen, not what does
>>> happen. The current logic is broken and handles the first two cases
>>> correctly by luck.  That is, the code will free the EFI if the
>>> XFS_EFI_COMMITTED bit is*not*  set, rather than if it is set. The
>>> inverted logic "works" because if both EFI and EFD are committed,
>>> then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED
>>> bit, and the second frees the EFI item. Hence as long as
>>> xfs_efi_item_committed() has been called, everything appears to be
>>> fine.
>>>
>>> It is the third case where the logic fails - where
>>> xfs_efd_item_committed() is called before xfs_efi_item_committed(),
>>> and that results in the EFI being freed before it has been
>>> committed. That is the bug that triggered the shutdown, d hence
>>> keeping track of whether the EFI has been committed or not is
>>> insufficient to correctly order the EFI/EFD operations w.r.t. the
>>> AIL.
>>>
>>
>> I think the exist xfs_efi routines are working correctly.
>>
>> the xfs_trans_committed_bulk() for the efi does the IOP_COMMITTED
>> (sets the XFS_EFI_COMMITTED bit), add the entry to the AIL and then
>> an IOP_UNPIN() which clears the XFS_EFI_COMMITTED since it is set, it
>> should not release the efi and remove the entry from the AIL. It also
>> correctly detects if the EFD is committed before the EFI by shutting
>> down the filesystem.
> 
> Well hang on, not everything can be cool in EFI-land here, if nothing
> else this:
> 
> __xfs_efi_release(
>         struct xfs_efi_log_item *efip)
> {
>         struct xfs_ail          *ailp = efip->efi_item.li_ailp;
> 
>         if (!test_and_clear_bit(XFS_EFI_COMMITTED, &efip->efi_flags)) {
>                 spin_lock(&ailp->xa_lock);
>                 /* xfs_trans_ail_delete() drops the AIL lock. */
>                 xfs_trans_ail_delete(ailp, &efip->efi_item,
>                                      SHUTDOWN_LOG_IO_ERROR);
>                 xfs_efi_item_free(efip);
>         }
> }
> 
> seems obviously broken.
> 
>  * test_and_clear_bit - Clear a bit and return its old value
> 
> so it is saying if XFS_EFI_COMMITED was *not* set, free the efi.
> Which I think we must admit is broken.
> 
> I have to get up to speed on this code, but it seems like at least
> this much above is quite obviously broken, and the previously
> proposed patch doesn't address it.  Am I missing something?

Or are you saying that it's working as designed, so that the first
caller finds it set, clears it and does nothing, and the 2nd caller
finds it unset, and frees it?  help me out here ;)

-Eric

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't free EFIs before the EFDs are committed
  2013-04-03 21:02     ` Eric Sandeen
@ 2013-04-03 21:45       ` Mark Tinguely
  2013-04-04  1:31         ` Dave Chinner
  0 siblings, 1 reply; 10+ messages in thread
From: Mark Tinguely @ 2013-04-03 21:45 UTC (permalink / raw)
  To: Eric Sandeen; +Cc: xfs

On 04/03/13 16:02, Eric Sandeen wrote:
> On 4/3/13 2:46 PM, Eric Sandeen wrote:
>> On 4/3/13 2:12 PM, Mark Tinguely wrote:
>>> On 04/02/13 22:09, Dave Chinner wrote:
>>>> From: Dave Chinner<dchinner@redhat.com>
>>>>
>>>> Filesystems are occasionally being shut down with this error:
>>>>
>>>> xfs_trans_ail_delete_bulk: attempting to delete a log item that is
>>>> not in the AIL.
>>>>
>>>> It was diagnosed to be related to the EFI/EFD commit order when the
>>>> EFI and EFD are in different checkpoints and the EFD is committed
>>>> before the EFI here:
>>>>
>>>> http://oss.sgi.com/archives/xfs/2013-01/msg00082.html
>>>>
>>>> The real problem is that a single bit cannot fully describe the
>>>> states that the EFI/EFD processing can be in. These completion
>>>> states are:
>>>>
>>>> EFI            EFI in AIL    EFD        Result
>>>> committed/unpinned    Yes        committed    OK
>>>> committed/pinned    No        committed    Shutdown
>>>> uncommitted        No        committed    Shutdown
>>>>
>>>>
>>>> Note that the "result" field is what should happen, not what does
>>>> happen. The current logic is broken and handles the first two cases
>>>> correctly by luck.  That is, the code will free the EFI if the
>>>> XFS_EFI_COMMITTED bit is*not*  set, rather than if it is set. The
>>>> inverted logic "works" because if both EFI and EFD are committed,
>>>> then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED
>>>> bit, and the second frees the EFI item. Hence as long as
>>>> xfs_efi_item_committed() has been called, everything appears to be
>>>> fine.
>>>>
>>>> It is the third case where the logic fails - where
>>>> xfs_efd_item_committed() is called before xfs_efi_item_committed(),
>>>> and that results in the EFI being freed before it has been
>>>> committed. That is the bug that triggered the shutdown, d hence
>>>> keeping track of whether the EFI has been committed or not is
>>>> insufficient to correctly order the EFI/EFD operations w.r.t. the
>>>> AIL.
>>>>
>>>
>>> I think the exist xfs_efi routines are working correctly.
>>>
>>> the xfs_trans_committed_bulk() for the efi does the IOP_COMMITTED
>>> (sets the XFS_EFI_COMMITTED bit), add the entry to the AIL and then
>>> an IOP_UNPIN() which clears the XFS_EFI_COMMITTED since it is set, it
>>> should not release the efi and remove the entry from the AIL. It also
>>> correctly detects if the EFD is committed before the EFI by shutting
>>> down the filesystem.
>>
>> Well hang on, not everything can be cool in EFI-land here, if nothing
>> else this:
>>
>> __xfs_efi_release(
>>          struct xfs_efi_log_item *efip)
>> {
>>          struct xfs_ail          *ailp = efip->efi_item.li_ailp;
>>
>>          if (!test_and_clear_bit(XFS_EFI_COMMITTED,&efip->efi_flags)) {
>>                  spin_lock(&ailp->xa_lock);
>>                  /* xfs_trans_ail_delete() drops the AIL lock. */
>>                  xfs_trans_ail_delete(ailp,&efip->efi_item,
>>                                       SHUTDOWN_LOG_IO_ERROR);
>>                  xfs_efi_item_free(efip);
>>          }
>> }
>>
>> seems obviously broken.
>>
>>   * test_and_clear_bit - Clear a bit and return its old value
>>
>> so it is saying if XFS_EFI_COMMITED was *not* set, free the efi.
>> Which I think we must admit is broken.
>>
>> I have to get up to speed on this code, but it seems like at least
>> this much above is quite obviously broken, and the previously
>> proposed patch doesn't address it.  Am I missing something?
>
> Or are you saying that it's working as designed, so that the first
> caller finds it set, clears it and does nothing, and the 2nd caller
> finds it unset, and frees it?  help me out here ;)
>
> -Eric
>

Yes. The order is IOP_COMMITTED and then IOP_UNPIN:

The EFI IOP_COMMITTED:

xfs_efi_item_committed(
         struct xfs_log_item     *lip,
         xfs_lsn_t               lsn)
{
         struct xfs_efi_log_item *efip = EFI_ITEM(lip);

         set_bit(XFS_EFI_COMMITTED, &efip->efi_flags);
         return lsn;
}

then

The EFI IOP_UNPIN() ->

STATIC void
xfs_efi_item_unpin(
         struct xfs_log_item     *lip,
         int                     remove)
{
         struct xfs_efi_log_item *efip = EFI_ITEM(lip);

         if (remove) {
                 ASSERT(!(lip->li_flags & XFS_LI_IN_AIL));
                 if (lip->li_desc)
                         xfs_trans_del_item(lip);
                 xfs_efi_item_free(efip);
                 return;
         }
         __xfs_efi_release(efip);
}


and the the EFD's IOP_COMMITED:

STATIC xfs_lsn_t
xfs_efi_item_committed(
         struct xfs_log_item     *lip,
         xfs_lsn_t               lsn)
{
         struct xfs_efi_log_item *efip = EFI_ITEM(lip);

         set_bit(XFS_EFI_COMMITTED, &efip->efi_flags);
         return lsn;
}

The code makes sure the sequence EFI and then EFD are done in order.

--Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't free EFIs before the EFDs are committed
  2013-04-03 19:12 ` Mark Tinguely
  2013-04-03 19:46   ` Eric Sandeen
@ 2013-04-04  1:14   ` Dave Chinner
  1 sibling, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2013-04-04  1:14 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: xfs

On Wed, Apr 03, 2013 at 02:12:09PM -0500, Mark Tinguely wrote:
> On 04/02/13 22:09, Dave Chinner wrote:
> >From: Dave Chinner<dchinner@redhat.com>
> >
> >Filesystems are occasionally being shut down with this error:
> >
> >xfs_trans_ail_delete_bulk: attempting to delete a log item that is
> >not in the AIL.
> >
> >It was diagnosed to be related to the EFI/EFD commit order when the
> >EFI and EFD are in different checkpoints and the EFD is committed
> >before the EFI here:
> >
> >http://oss.sgi.com/archives/xfs/2013-01/msg00082.html
> >
> >The real problem is that a single bit cannot fully describe the
> >states that the EFI/EFD processing can be in. These completion
> >states are:
> >
> >EFI			EFI in AIL	EFD		Result
> >committed/unpinned	Yes		committed	OK
> >committed/pinned	No		committed	Shutdown
> >uncommitted		No		committed	Shutdown
> >
> >
> >Note that the "result" field is what should happen, not what does
> >happen. The current logic is broken and handles the first two cases
> >correctly by luck.  That is, the code will free the EFI if the
> >XFS_EFI_COMMITTED bit is*not*  set, rather than if it is set. The
> >inverted logic "works" because if both EFI and EFD are committed,
> >then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED
> >bit, and the second frees the EFI item. Hence as long as
> >xfs_efi_item_committed() has been called, everything appears to be
> >fine.
> >
> >It is the third case where the logic fails - where
> >xfs_efd_item_committed() is called before xfs_efi_item_committed(),
> >and that results in the EFI being freed before it has been
> >committed. That is the bug that triggered the shutdown, d hence
> >keeping track of whether the EFI has been committed or not is
> >insufficient to correctly order the EFI/EFD operations w.r.t. the
> >AIL.
> >
> 
> I think the exist xfs_efi routines are working correctly.

Speaking as the person whose designed and wrote the current EFI/EFD
code in question here, I think it is broken and only working by
chance.

Hint: there is no comment anywhere (code or commit message) that is
works by a tricky piece of logic that first clears the committed bit
and then frees the object when it is not set the second time
through.

I *always* write comments about tricky logic like this when it is
intentional, so that when I look at it a couple of years later I
know exactly why the logic is correct even though it appears wrong.
No comments therefore means this logic was not the intended
behaviour and therefore only working by chance.

> the xfs_trans_committed_bulk() for the efi does the IOP_COMMITTED
> (sets the XFS_EFI_COMMITTED bit), add the entry to the AIL and then
> an IOP_UNPIN() which clears the XFS_EFI_COMMITTED since it is set,
> it should not release the efi and remove the entry from the AIL. It
> also correctly detects if the EFD is committed before the EFI by
> shutting down the filesystem.

But that behaviour is wrong - seeing the EFD before the EFI is not a
bug, just an indication that we have interleaved checkpoints being
committed concurrently. The fact that we've seen the EFD means when
we can simply free the EFI once it has been committed. The AIL is
where the ordering of journalled objects matter; the order in
which iclog commit callbacks are run does not affect that.

> This patch seems to paper over why the EFD will come before the EFI.
> 
> The CIL commits are in the correct order - protected by the
> xlog_wait() but the callbacks are out of order.

The design is such that what is physically on disk is correctly
ordered, but we don't care about the callback order as the
AIL is where the *in-memory object order* is correctly established.

> My previous patch fixed the callback from happening out of sequence.

It may have, but it serialised all CIL checkpoints, not just the
commit record order. That approach to fixing the bug is simply a
non-starter.  Why are you raising this now when you were silent
about it 3 months ago when this bug was first being discussed and
the EFI/EFD handling bug was initially raised?

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't free EFIs before the EFDs are committed
  2013-04-03 21:45       ` Mark Tinguely
@ 2013-04-04  1:31         ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2013-04-04  1:31 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: Eric Sandeen, xfs

On Wed, Apr 03, 2013 at 04:45:18PM -0500, Mark Tinguely wrote:
> On 04/03/13 16:02, Eric Sandeen wrote:
> >On 4/3/13 2:46 PM, Eric Sandeen wrote:
> >>On 4/3/13 2:12 PM, Mark Tinguely wrote:
> >>>On 04/02/13 22:09, Dave Chinner wrote:
> >>>>From: Dave Chinner<dchinner@redhat.com>
> >>>>
> >>>>Filesystems are occasionally being shut down with this error:
> >>>>
> >>>>xfs_trans_ail_delete_bulk: attempting to delete a log item that is
> >>>>not in the AIL.
> >>>>
> >>>>It was diagnosed to be related to the EFI/EFD commit order when the
> >>>>EFI and EFD are in different checkpoints and the EFD is committed
> >>>>before the EFI here:
> >>>>
> >>>>http://oss.sgi.com/archives/xfs/2013-01/msg00082.html
> >>>>
> >>>>The real problem is that a single bit cannot fully describe the
> >>>>states that the EFI/EFD processing can be in. These completion
> >>>>states are:
> >>>>
> >>>>EFI            EFI in AIL    EFD        Result
> >>>>committed/unpinned    Yes        committed    OK
> >>>>committed/pinned    No        committed    Shutdown
> >>>>uncommitted        No        committed    Shutdown
> >>>>
> >>>>
> >>>>Note that the "result" field is what should happen, not what does
> >>>>happen. The current logic is broken and handles the first two cases
> >>>>correctly by luck.  That is, the code will free the EFI if the
> >>>>XFS_EFI_COMMITTED bit is*not*  set, rather than if it is set. The
> >>>>inverted logic "works" because if both EFI and EFD are committed,
> >>>>then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED
> >>>>bit, and the second frees the EFI item. Hence as long as
> >>>>xfs_efi_item_committed() has been called, everything appears to be
> >>>>fine.
> >>>>
> >>>>It is the third case where the logic fails - where
> >>>>xfs_efd_item_committed() is called before xfs_efi_item_committed(),
> >>>>and that results in the EFI being freed before it has been
> >>>>committed. That is the bug that triggered the shutdown, d hence
> >>>>keeping track of whether the EFI has been committed or not is
> >>>>insufficient to correctly order the EFI/EFD operations w.r.t. the
> >>>>AIL.
> >>>>
> >>>
> >>>I think the exist xfs_efi routines are working correctly.
> >>>
> >>>the xfs_trans_committed_bulk() for the efi does the IOP_COMMITTED
> >>>(sets the XFS_EFI_COMMITTED bit), add the entry to the AIL and then
> >>>an IOP_UNPIN() which clears the XFS_EFI_COMMITTED since it is set, it
> >>>should not release the efi and remove the entry from the AIL. It also
> >>>correctly detects if the EFD is committed before the EFI by shutting
> >>>down the filesystem.
> >>
> >>Well hang on, not everything can be cool in EFI-land here, if nothing
> >>else this:
> >>
> >>__xfs_efi_release(
> >>         struct xfs_efi_log_item *efip)
> >>{
> >>         struct xfs_ail          *ailp = efip->efi_item.li_ailp;
> >>
> >>         if (!test_and_clear_bit(XFS_EFI_COMMITTED,&efip->efi_flags)) {
> >>                 spin_lock(&ailp->xa_lock);
> >>                 /* xfs_trans_ail_delete() drops the AIL lock. */
> >>                 xfs_trans_ail_delete(ailp,&efip->efi_item,
> >>                                      SHUTDOWN_LOG_IO_ERROR);
> >>                 xfs_efi_item_free(efip);
> >>         }
> >>}
> >>
> >>seems obviously broken.
> >>
> >>  * test_and_clear_bit - Clear a bit and return its old value
> >>
> >>so it is saying if XFS_EFI_COMMITED was *not* set, free the efi.
> >>Which I think we must admit is broken.
> >>
> >>I have to get up to speed on this code, but it seems like at least
> >>this much above is quite obviously broken, and the previously
> >>proposed patch doesn't address it.  Am I missing something?
> >
> >Or are you saying that it's working as designed, so that the first
> >caller finds it set, clears it and does nothing, and the 2nd caller
> >finds it unset, and frees it?  help me out here ;)
> >
> >-Eric
> >
> 
> Yes. The order is IOP_COMMITTED and then IOP_UNPIN:

Batching means you can get:

	IOP_COMMITTED(a)
	IOP_COMMITTED(b)
	IOP_COMMITTED(c)
	.....
	IOP_COMMITTED(6)

	AIL_INSERT(a)
	AIL_INSERT(b)
	AIL_INSERT(c)
	.....
	AIL_INSERT(6)

	IOP_UNPIN(a)
	IOP_UNPIN(b)
	IOP_UNPIN(c)
	.....
	IOP_UNPIN(6)

So that means at minimum we have to handle the EFI committed/pinned
case when the EFD is committed as well. This is what the commit
b199c8a (xfs: Pull EFI/EFD handling out from under the AIL lock)
did - it moved the COMMITTED bit from the EFI IOP_UNPIN() callback
to the IOP_COMMITTED() callback.

But then the IOP_UNPIN() callback has to handle the same case by
freeing the EFI as well, and hence the XFS_EFI_COMMITTED() bit
became a proxy reference count.

Which doesn't handle the third case, which is the EFD being
processed before the EFI, and hence we get a shutdown because the
EFD is trying to remove the EFI from the AIL before it is inserted.

> The code makes sure the sequence EFI and then EFD are done in order.

No, it doesn't ensure they are done in order - it *works* when they
are done in order. If it ensured they were done in order, then the
EFD commit would not be allowed to proceed until the EFI wsa
committed. The code does not do that, not does it need to. But it
needs to handle the EFD being processed before the EFI...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't free EFIs before the EFDs are committed
  2013-04-03  3:09 [PATCH] xfs: don't free EFIs before the EFDs are committed Dave Chinner
  2013-04-03 19:12 ` Mark Tinguely
@ 2013-04-04 22:06 ` Mark Tinguely
  2013-04-05  0:45   ` Dave Chinner
  2013-04-05 18:31 ` Ben Myers
  2 siblings, 1 reply; 10+ messages in thread
From: Mark Tinguely @ 2013-04-04 22:06 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On 04/02/13 22:09, Dave Chinner wrote:
> From: Dave Chinner<dchinner@redhat.com>
>
> Filesystems are occasionally being shut down with this error:
>
> xfs_trans_ail_delete_bulk: attempting to delete a log item that is
> not in the AIL.
>
> It was diagnosed to be related to the EFI/EFD commit order when the
> EFI and EFD are in different checkpoints and the EFD is committed
> before the EFI here:
>
> http://oss.sgi.com/archives/xfs/2013-01/msg00082.html
>
> The real problem is that a single bit cannot fully describe the
> states that the EFI/EFD processing can be in. These completion
> states are:
>
> EFI			EFI in AIL	EFD		Result
> committed/unpinned	Yes		committed	OK
> committed/pinned	No		committed	Shutdown
> uncommitted		No		committed	Shutdown
>
>
> Note that the "result" field is what should happen, not what does
> happen. The current logic is broken and handles the first two cases
> correctly by luck.  That is, the code will free the EFI if the
> XFS_EFI_COMMITTED bit is *not* set, rather than if it is set. The
> inverted logic "works" because if both EFI and EFD are committed,
> then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED
> bit, and the second frees the EFI item. Hence as long as
> xfs_efi_item_committed() has been called, everything appears to be
> fine.
>
> It is the third case where the logic fails - where
> xfs_efd_item_committed() is called before xfs_efi_item_committed(),
> and that results in the EFI being freed before it has been
> committed. That is the bug that triggered the shutdown, and hence
> keeping track of whether the EFI has been committed or not is
> insufficient to correctly order the EFI/EFD operations w.r.t. the
> AIL.
>
> What we really want is this: the EFI is always placed into the
> AIL before the last reference goes away. The only way to guarantee
> that is that the EFI is not freed until after it has been unpinned
> *and* the EFD has been committed. That is, restructure the logic so
> that the only case that can occur is the first case.
>
> This can be done easily by replacing the XFS_EFI_COMMITTED with an
> EFI reference count. The EFI is initialised with it's own count, and
> that is not released until it is unpinned. However, there is a
> complication to this method - the high level EFI/EFD code in
> xfs_bmap_finish() does not hold direct references to the EFI
> structure, and runs a transaction commit between the EFI and EFD
> processing. Hence the EFI can be freed even before the EFD is
> created using such a method.
>
> Further, log recovery uses the AIL for tracking EFI/EFDs that need
> to be recovered, but it uses the AIL *differently* to the EFI
> transaction commit. Hence log recovery never pins or unpins EFIs, so
> we can't drop the EFI reference count indirectly to free the EFI.
>
> However, this doesn't prevent us from using a reference count here.
> There is a 1:1 relationship between EFIs and EFDs, so when we
> initialise the EFI we can take a reference count for the EFD as
> well. This solves the xfs_bmap_finish() issue - the EFI will never
> be freed until the EFD is processed. In terms of log recovery,
> during the committing of the EFD we can look for the
> XFS_EFI_RECOVERED bit being set and drop the EFI reference as well,
> thereby ensuring everything works correctly there as well.

Duh me, beside what our problem where the cil push callbacks are called 
out of order, there can be several xfs_trans_committed_bulk() running at 
the same time.

Yep, the counter allows efi and efd to happen in any order and make sure 
the efi is entered into the AIL.

Okay, the pass2 of recovery places the efi item into the AIL, and the 
second decrement on the counter when the XFS_EFI_RECOVERED bit gets the 
counter to zero in the recovery case because there is no unpin.

Reviewed-by: Mark Tinguely <tinguely@sgi.com>

This patch also prevents the only chance of detecting of out of sequence 
ctx callbacks. The out of sequence ctx callbacks *does* matter because 
the li_lsn is set by the ctx->start_lsn and the lowest li_lsn is used to 
move the tail. When the callbacks are out of order, it is possible for 
the larger lsn of the greater ctx sequence to move the log tail beyond a 
lower ctx sequence entries before the entries for the lower sequence ctx 
are moved into the AIL.

--Mark.

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't free EFIs before the EFDs are committed
  2013-04-04 22:06 ` Mark Tinguely
@ 2013-04-05  0:45   ` Dave Chinner
  0 siblings, 0 replies; 10+ messages in thread
From: Dave Chinner @ 2013-04-05  0:45 UTC (permalink / raw)
  To: Mark Tinguely; +Cc: xfs

On Thu, Apr 04, 2013 at 05:06:31PM -0500, Mark Tinguely wrote:
> On 04/02/13 22:09, Dave Chinner wrote:
> >From: Dave Chinner<dchinner@redhat.com>
> >
> >Filesystems are occasionally being shut down with this error:
> >
> >xfs_trans_ail_delete_bulk: attempting to delete a log item that is
> >not in the AIL.
> >
> >It was diagnosed to be related to the EFI/EFD commit order when the
> >EFI and EFD are in different checkpoints and the EFD is committed
> >before the EFI here:
> >
> >http://oss.sgi.com/archives/xfs/2013-01/msg00082.html
> >
> >The real problem is that a single bit cannot fully describe the
> >states that the EFI/EFD processing can be in. These completion
> >states are:
> >
> >EFI			EFI in AIL	EFD		Result
> >committed/unpinned	Yes		committed	OK
> >committed/pinned	No		committed	Shutdown
> >uncommitted		No		committed	Shutdown
> >
> >
> >Note that the "result" field is what should happen, not what does
> >happen. The current logic is broken and handles the first two cases
> >correctly by luck.  That is, the code will free the EFI if the
> >XFS_EFI_COMMITTED bit is *not* set, rather than if it is set. The
> >inverted logic "works" because if both EFI and EFD are committed,
> >then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED
> >bit, and the second frees the EFI item. Hence as long as
> >xfs_efi_item_committed() has been called, everything appears to be
> >fine.
> >
> >It is the third case where the logic fails - where
> >xfs_efd_item_committed() is called before xfs_efi_item_committed(),
> >and that results in the EFI being freed before it has been
> >committed. That is the bug that triggered the shutdown, and hence
> >keeping track of whether the EFI has been committed or not is
> >insufficient to correctly order the EFI/EFD operations w.r.t. the
> >AIL.
> >
> >What we really want is this: the EFI is always placed into the
> >AIL before the last reference goes away. The only way to guarantee
> >that is that the EFI is not freed until after it has been unpinned
> >*and* the EFD has been committed. That is, restructure the logic so
> >that the only case that can occur is the first case.
> >
> >This can be done easily by replacing the XFS_EFI_COMMITTED with an
> >EFI reference count. The EFI is initialised with it's own count, and
> >that is not released until it is unpinned. However, there is a
> >complication to this method - the high level EFI/EFD code in
> >xfs_bmap_finish() does not hold direct references to the EFI
> >structure, and runs a transaction commit between the EFI and EFD
> >processing. Hence the EFI can be freed even before the EFD is
> >created using such a method.
> >
> >Further, log recovery uses the AIL for tracking EFI/EFDs that need
> >to be recovered, but it uses the AIL *differently* to the EFI
> >transaction commit. Hence log recovery never pins or unpins EFIs, so
> >we can't drop the EFI reference count indirectly to free the EFI.
> >
> >However, this doesn't prevent us from using a reference count here.
> >There is a 1:1 relationship between EFIs and EFDs, so when we
> >initialise the EFI we can take a reference count for the EFD as
> >well. This solves the xfs_bmap_finish() issue - the EFI will never
> >be freed until the EFD is processed. In terms of log recovery,
> >during the committing of the EFD we can look for the
> >XFS_EFI_RECOVERED bit being set and drop the EFI reference as well,
> >thereby ensuring everything works correctly there as well.
> 
> Duh me, beside what our problem where the cil push callbacks are
> called out of order, there can be several xfs_trans_committed_bulk()
> running at the same time.

I'm not sure that can actually happen - callback processing is
supposed to be serialised by xlog_state_do_callback()....

> Yep, the counter allows efi and efd to happen in any order and make
> sure the efi is entered into the AIL.
> 
> Okay, the pass2 of recovery places the efi item into the AIL, and
> the second decrement on the counter when the XFS_EFI_RECOVERED bit
> gets the counter to zero in the recovery case because there is no
> unpin.
> 
> Reviewed-by: Mark Tinguely <tinguely@sgi.com>

Thank you.

> This patch also prevents the only chance of detecting of out of
> sequence ctx callbacks.

I don't think the callbacks are being attached out of order.  Like I
said previously: if that were happening then the commit records are
being written to the log in the wrong order and that is a *major*
issue.

As such, I've attached the informal proof that I just ran through to
verify that the current code is ordering commit records and hence
callbacks on iclogs correctly for multiple outstanding checkpoints.
I don't see any obvious hole in the logic as the callbacks are
attached to the iclog the commit record is written into and the
commit records are strictly serialised and ordered...

> The out of sequence ctx callbacks *does*
> matter because the li_lsn is set by the ctx->start_lsn and the
> lowest li_lsn is used to move the tail. When the callbacks are out
> of order, it is possible for the larger lsn of the greater ctx
> sequence to move the log tail beyond a lower ctx sequence entries
> before the entries for the lower sequence ctx are moved into the
> AIL.

Assuming we are getting callbacks running out of order, this is only
a problem if the AIL is empty at the time the out of order callbacks
are run. If the AIL is not empty, then the tail of the log is
derived directly from the LSN of the first item in the AIL and that
LSN will be from an item committed from a previous checkpoint, not
the current one that is runing out of order. Hence the tail of the
log is not affected in this case.

If the AIL is empty, then the tail LSN is determined by the lowest
LSN of the iclogs being processed concurrently. This code in
xlog_state_do_callback() is supposed to serialise and order the
completion of multiple concurrent iclog completions:

		lowest_lsn = xlog_get_lowest_lsn(log);
		if (lowest_lsn &&
		    XFS_LSN_CMP(lowest_lsn,
				be64_to_cpu(iclog->ic_header.h_lsn)) < 0) {
			iclog = iclog->ic_next;
			continue; /* Leave this iclog for
				   * another thread */
		}

IOWs, if the current iclog being checked for IO completion
processing has a higher LSN than the lowest in all the iclogs that
are not clean, then skip over it and check the next iclog. So, even
if we have multiple concurrent iclogs in
XLOG_STATE_DONE_SYNC/XLOG_STATE_DO_CALLBACK state, only the first
that comes across the iclog with the lowest LSN in these states is
supposed to run callbacks. And an iclog is considered in the
xlog_get_lowest_lsn() calculation until it becomes dirty/active,
which is a state transition that occurs *after* callbacks are run
on the iclog.

Further, once an iclog has been selected for callback processing, if
it has callbacks on it, we do this:

		if (iclog->ic_callback)
			atomic64_set(&log->l_last_sync_lsn,
				be64_to_cpu(iclog->ic_header.h_lsn));

to set the tail of the log to match that of the current LSN
of the iclog that is being processed. This is an "advance" update of
the log tail that will be used until items are inserted into
the AIL by callback processing (see xlog_assign_tail_lsn()). If
there are not items to be placed into the AIL (i.e. it remains
empty) then this ensures that the next log write still gets the tail
LSN correct.

And by this analysis, we should not be seeing concurrent execution
of callbacks on different iclogs, nor should we be seeing out of
order execution of callbacks on different iclogs. And because
callbacks are ordered correctly by the CIL push code we shouldn't be
seeing callbacks on a single iclog being run out of order
either.

IOWs, the tail LSN of the log should not be jumping around due to
out of order iclog and/or callback processing.  Unless there
is a problem in xlog_state_do_callback() with serialising and
ordering iclog IO completion processing, there shouldn't be a
problem with out of order callback processing, either.

Note that I haven't been through the iclog state machine in fine
detail since 2007, so I'm going to need to spend a bit of time
looking at it again to confirm this. I don't see any obvious problem
at the moment, though....

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

commit CIL operation:

	down_write(ctx lock)
	increment sequence
	switch ctxs
	spin_lock(cil lock)
	add old ctx to commit list
	spin_unlock(cil lock)
	up_write(ctx lock)

	write checkpoint to log

	spin_lock(cil lock)
	walk commiting list and wait for prior sequences to commit
	spin_unlock(cil lock)

	xfs_log_done()		-> writes commit record, returns iclog, commit_lsn
	xfs_log_notify()	-> attaches callback to iclog tail

	spin_lock(cil lock)
	ctx->commit_lsn = commit_lsn
	wake waiters
	spin_unlock(cil lock)
	release iclog

	<iclog IO started>
	<iclog IO complete>
	<iclog run callbacks>

	spin_lock(cil lock)
	remove from committing list.
	spin_unlock(cil lock)


Informal proof of sequence commit order correctness by induction:

The initial context switching segment runs under an exclusive lock, and it means
that S2 can not be started (and hence committed) until S1 is already in the
committing list. So when S2 commits, if S1 has not yet written a commit record
we have the initial condition of:

	commiting list:
		S2, commit_lsn = 0
		S1, commit_lsn = 0

Now, if S2 writes it's checkpoint into the iclogs before S1 completes (e.g. S1
is a 10MB checkpoint, S2 is single inode core from a fsync) S2 enters the
committing list check and sees:

528 restart:
529         spin_lock(&cil->xc_cil_lock);
530         list_for_each_entry(new_ctx, &cil->xc_committing, committing) {

First entry is S2

531                 /*
532                  * Higher sequences will wait for this one so skip them.
533                  * Don't wait for own own sequence, either.
534                  */
535                 if (new_ctx->sequence >= ctx->sequence)
536                         continue;

new_ctx->sequence = S2, ctx->sequence = S2, so continue;

Second entry is S1, so we see:

new_ctx->sequence = S1, ctx->sequence = S2, so we  now check the commit lsn:

537                 if (!new_ctx->commit_lsn) {

new_ctx->commit_lsn = 0, so we:

538                         /*
539                          * It is still being pushed! Wait for the push to
540                          * complete, then start again from the beginning.
541                          */
542                         xlog_wait(&cil->xc_commit_wait, &cil->xc_cil_lock);
543                         goto restart;

wait and restart once woken.

544                 }
545         }
546         spin_unlock(&cil->xc_cil_lock);

Eventually, S1 finishes writing the checkpoint, and enters the commit record loop:

528 restart:
529         spin_lock(&cil->xc_cil_lock);
530         list_for_each_entry(new_ctx, &cil->xc_committing, committing) {

First entry is S2

531                 /*
532                  * Higher sequences will wait for this one so skip them.
533                  * Don't wait for own own sequence, either.
534                  */
535                 if (new_ctx->sequence >= ctx->sequence)
536                         continue;

new_ctx->sequence = S2, ctx->sequence = S1, so we continue.

Second entry is S1, so again we continue, hit the end of list and drop out of
the loop. At this point S1 writes the commit record, then attaches the callbacks
to the *tail* of the iclog callback list. In doing this, it gets the commit_lsn
of the commit record, so it does:

	ctx->commit_lsn = commit_lsn
	wakeup_all(&cil->xc_commit_wait);

Which wakes up S2. S2 now does:

_
528 restart:
529         spin_lock(&cil->xc_cil_lock);
530         list_for_each_entry(new_ctx, &cil->xc_committing, committing) {

First entry is S2

531                 /*
532                  * Higher sequences will wait for this one so skip them.
533                  * Don't wait for own own sequence, either.
534                  */
535                 if (new_ctx->sequence >= ctx->sequence)
536                         continue;

new_ctx->sequence = S2, ctx->sequence = S2, so we continue.

Second entry is S1, so we now check the commit lsn:

537                 if (!new_ctx->commit_lsn) {

new_ctx->commit_lsn != 0, so again we continue, hit the end of list and fall
through to the commit record write. This attaches the callback to the *tail* of
the iclog callback list, so it ordered *after* the callbacks for S1.

Hence the commit records appear to be correctly ordered by this code, as are the
callbacks being attached to iclogs. So, for the case of S1, S2 the code is safe.

For the case of S1, S2, S3, we have the same initial conditions:

	- S2 is not made active until S1 is on the commit list
	- S3 is not made active until S2 is on the commit list

Hence if we have S1 still committing while S2 and S3 have completed their
commits, the wait cycle looks like:

S2:
	Sees S2, continue;
	Sees S1, sees commit_lsn == 0, waits
S3:
	Sees S3, continue;
	Sees S2, sees commit_lsn == 0, waits

S1: completes commit,
	Sees S3, continue;
	Sees S2, continue;
	Sees S1, continue;
	Writes commit record
	Attaches callbacks
	sets commit_lsn
	wakeup_all()

S2:						S3:
	wake, restart				wake, restart
						[win wakeup race on cil lock]
						Sees S3, continue
						Sees S2, sees commit_lsn == 0, wait

	Sees S3, continue
	Sees S2, continue
	Sees S1, sees commit_lsn != 0, continue
						[lost wakeup race, spins on cil lock]
						[not woken until S2 already through]
						Sees S3, continue
						Sees S2, sees commit_lsn == 0, wait

	Writes commit record
	Attaches callbacks
	sets commit_lsn
	wakeup_all()

S3:
	Sees S3, continue
	Sees S2, sees commit_lsn != 0, continue
	Sees S1, sees commit_lsn != 0, continue
	Writes commit record
	Attaches callbacks
	sets commit_lsn
	wakeup_all()

And so the ordering of commit records and multiple callbacks on the same iclog
are correct for S1, S2 and S3.

So, for and arbitrary set of sequences {S1...Sn, S(n+1)}, the initial conditions are:

	- S2 is not made active until S1 is on the commit list
	....
	- S(n+1) is not made active until Sn is on the commit list

Hence if we have S1 still committing while Sn and S(n+1) have completed their
commits, the wait cycle looks like:

Sn:
	Sees Sn, continue;
	.....
	Sees S1, sees commit_lsn == 0, waits
S(n+1_:
	Sees S(n+1), continue;
	Sees Sn, sees commit_lsn == 0, waits

S1: completes commit,
	Sees S(n+1), continue;
	Sees Sn, continue;
	.....
	Sees S1, continue;
	Writes commit record
	Attaches callbacks
	sets commit_lsn
	wakeup_all()

Sn:						S(n+1):
	wake, restart				wake, restart
						[win wakeup race on cil lock]
						Sees S(n+1), continue
						Sees Sn, sees commit_lsn == 0, wait

	Sees S(n+1), continue
	Sees Sn, continue
	.....
	Sees S1, sees commit_lsn != 0, continue
						[lost wakeup race, spins on cil lock]
						[not woken until Sn already through]
						Sees S(n+1), continue
						Sees Sn, sees commit_lsn == 0, wait

	Writes commit record
	Attaches callbacks
	sets commit_lsn
	wakeup_all()

S3:
	Sees S(n+1), continue
	Sees Sn, sees commit_lsn != 0, continue
	.....
	Sees S1, sees commit_lsn != 0, continue
	Writes commit record
	Attaches callbacks
	sets commit_lsn
	wakeup_all()

Hence by induction the ordering of commit records and multiple callbacks on the
same iclog are correct for {S1...Sn, S(n+1)}

QED

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH] xfs: don't free EFIs before the EFDs are committed
  2013-04-03  3:09 [PATCH] xfs: don't free EFIs before the EFDs are committed Dave Chinner
  2013-04-03 19:12 ` Mark Tinguely
  2013-04-04 22:06 ` Mark Tinguely
@ 2013-04-05 18:31 ` Ben Myers
  2 siblings, 0 replies; 10+ messages in thread
From: Ben Myers @ 2013-04-05 18:31 UTC (permalink / raw)
  To: Dave Chinner; +Cc: xfs

On Wed, Apr 03, 2013 at 02:09:21PM +1100, Dave Chinner wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Filesystems are occasionally being shut down with this error:
> 
> xfs_trans_ail_delete_bulk: attempting to delete a log item that is
> not in the AIL.
> 
> It was diagnosed to be related to the EFI/EFD commit order when the
> EFI and EFD are in different checkpoints and the EFD is committed
> before the EFI here:
> 
> http://oss.sgi.com/archives/xfs/2013-01/msg00082.html
> 
> The real problem is that a single bit cannot fully describe the
> states that the EFI/EFD processing can be in. These completion
> states are:
> 
> EFI			EFI in AIL	EFD		Result
> committed/unpinned	Yes		committed	OK
> committed/pinned	No		committed	Shutdown
> uncommitted		No		committed	Shutdown
> 
> 
> Note that the "result" field is what should happen, not what does
> happen. The current logic is broken and handles the first two cases
> correctly by luck.  That is, the code will free the EFI if the
> XFS_EFI_COMMITTED bit is *not* set, rather than if it is set. The
> inverted logic "works" because if both EFI and EFD are committed,
> then the first __xfs_efi_release() call clears the XFS_EFI_COMMITTED
> bit, and the second frees the EFI item. Hence as long as
> xfs_efi_item_committed() has been called, everything appears to be
> fine.
> 
> It is the third case where the logic fails - where
> xfs_efd_item_committed() is called before xfs_efi_item_committed(),
> and that results in the EFI being freed before it has been
> committed. That is the bug that triggered the shutdown, and hence
> keeping track of whether the EFI has been committed or not is
> insufficient to correctly order the EFI/EFD operations w.r.t. the
> AIL.
> 
> What we really want is this: the EFI is always placed into the
> AIL before the last reference goes away. The only way to guarantee
> that is that the EFI is not freed until after it has been unpinned
> *and* the EFD has been committed. That is, restructure the logic so
> that the only case that can occur is the first case.
> 
> This can be done easily by replacing the XFS_EFI_COMMITTED with an
> EFI reference count. The EFI is initialised with it's own count, and
> that is not released until it is unpinned. However, there is a
> complication to this method - the high level EFI/EFD code in
> xfs_bmap_finish() does not hold direct references to the EFI
> structure, and runs a transaction commit between the EFI and EFD
> processing. Hence the EFI can be freed even before the EFD is
> created using such a method.
> 
> Further, log recovery uses the AIL for tracking EFI/EFDs that need
> to be recovered, but it uses the AIL *differently* to the EFI
> transaction commit. Hence log recovery never pins or unpins EFIs, so
> we can't drop the EFI reference count indirectly to free the EFI.
> 
> However, this doesn't prevent us from using a reference count here.
> There is a 1:1 relationship between EFIs and EFDs, so when we
> initialise the EFI we can take a reference count for the EFD as
> well. This solves the xfs_bmap_finish() issue - the EFI will never
> be freed until the EFD is processed. In terms of log recovery,
> during the committing of the EFD we can look for the
> XFS_EFI_RECOVERED bit being set and drop the EFI reference as well,
> thereby ensuring everything works correctly there as well.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>

Applied.

Regards,
	Ben

_______________________________________________
xfs mailing list
xfs@oss.sgi.com
http://oss.sgi.com/mailman/listinfo/xfs

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2013-04-05 18:31 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-03  3:09 [PATCH] xfs: don't free EFIs before the EFDs are committed Dave Chinner
2013-04-03 19:12 ` Mark Tinguely
2013-04-03 19:46   ` Eric Sandeen
2013-04-03 21:02     ` Eric Sandeen
2013-04-03 21:45       ` Mark Tinguely
2013-04-04  1:31         ` Dave Chinner
2013-04-04  1:14   ` Dave Chinner
2013-04-04 22:06 ` Mark Tinguely
2013-04-05  0:45   ` Dave Chinner
2013-04-05 18:31 ` Ben Myers

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox