[Cluster-devel] [PATCH 4 of 5] Bz #248176: GFS2: invalid metadata block

cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

* [Cluster-devel] [PATCH 4 of 5] Bz #248176: GFS2: invalid metadata block - REVISED
@ 2007-08-08 21:52 Bob Peterson
  2007-08-09 13:46 ` Wendy Cheng
  0 siblings, 1 reply; 8+ messages in thread
From: Bob Peterson @ 2007-08-08 21:52 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This is for bugzilla bug #248176: GFS2: invalid metadata block

Patches 1 thru 3 were accepted upstream, but there were problems
with 4 and 5.  Those issues have been resolved and now the recovery
tests are passing without errors.  This code has gone through
41 * 3 successful gfs2 recovery tests before it hit an
unrelated (openais) problem.

This is a complete rewrite of patch 4 for bug #248176.

Part of the problem was that inodes were being recycled
before their buffers were flushed to the journal logs.
Another problem was that the clone bitmaps were being
searched for deleted inodes to recycle, but only the
"real" bitmaps should be searched for that purpose.
--
Signed-off-by: Bob Peterson <rpeterso@redhat.com> 
--
 fs/gfs2/rgrp.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletions(-)

diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
index b93ac45..2d7f7ea 100644
--- a/fs/gfs2/rgrp.c
+++ b/fs/gfs2/rgrp.c
@@ -865,12 +865,15 @@ static struct inode *try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked)
 	struct inode *inode;
 	u32 goal = 0, block;
 	u64 no_addr;
+	struct gfs2_sbd *sdp = rgd->rd_sbd;
 
 	for(;;) {
 		if (goal >= rgd->rd_data)
 			break;
+		down_write(&sdp->sd_log_flush_lock);
 		block = rgblk_search(rgd, goal, GFS2_BLKST_UNLINKED,
 				     GFS2_BLKST_UNLINKED);
+		up_write(&sdp->sd_log_flush_lock);
 		if (block == BFITNOENT)
 			break;
 		/* rgblk_search can return a block < goal, so we need to
@@ -1295,7 +1298,9 @@ static u32 rgblk_search(struct gfs2_rgrpd *rgd, u32 goal,
 	   allocatable block anywhere else, we want to be able wrap around and
 	   search in the first part of our first-searched bit block.  */
 	for (x = 0; x <= length; x++) {
-		if (bi->bi_clone)
+		/* The GFS2_BLKST_UNLINKED state doesn't apply to the clone
+		   bitmaps, so we must search the originals for that. */
+		if (old_state != GFS2_BLKST_UNLINKED && bi->bi_clone)
 			blk = gfs2_bitfit(rgd, bi->bi_clone + bi->bi_offset,
 					  bi->bi_len, goal, old_state);
 		else




^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [Cluster-devel] [PATCH 4 of 5] Bz #248176: GFS2: invalid metadata block - REVISED
  2007-08-08 21:52 [Cluster-devel] [PATCH 4 of 5] Bz #248176: GFS2: invalid metadata block - REVISED Bob Peterson
@ 2007-08-09 13:46 ` Wendy Cheng
  2007-08-09 13:51   ` Wendy Cheng
  2007-08-09 15:37   ` Bob Peterson
  0 siblings, 2 replies; 8+ messages in thread
From: Wendy Cheng @ 2007-08-09 13:46 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Bob Peterson wrote:
> Part of the problem was that inodes were being recycled
> before their buffers were flushed to the journal logs.
>
>   
Set aside "after this patch, the problem goes away" thing ...

I haven't checked previous three patches yet so I may not have the 
overall picture ... but why adding the journal flush spin lock here 
could prevent the new inode to get re-used before its associated buffer 
are flushed to the logs ? Could you elaborate more ?

> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
> index b93ac45..2d7f7ea 100644
> --- a/fs/gfs2/rgrp.c
> +++ b/fs/gfs2/rgrp.c
> @@ -865,12 +865,15 @@ static struct inode *try_rgrp_unlink(struct gfs2_rgrpd *rgd, u64 *last_unlinked)
>  	struct inode *inode;
>  	u32 goal = 0, block;
>  	u64 no_addr;
> +	struct gfs2_sbd *sdp = rgd->rd_sbd;
>  
>  	for(;;) {
>  		if (goal >= rgd->rd_data)
>  			break;
> +		down_write(&sdp->sd_log_flush_lock);
>  		block = rgblk_search(rgd, goal, GFS2_BLKST_UNLINKED,
>  				     GFS2_BLKST_UNLINKED);
> +		up_write(&sdp->sd_log_flush_lock);
>  		if (block == BFITNOENT)
>  			break;
>  		
My concern is that GFS2's usage of sd_log_flush_lock has been very 
abused lately. The journal logic is gradually becoming difficult to 
understand and maintain. With this change, we move a local spin lock 
(that belongs to log.c) into another sub-component (rgrp). Intuitively, 
this is not right.

-- Wendy



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [PATCH 4 of 5] Bz #248176: GFS2: invalid metadata block - REVISED
  2007-08-09 13:46 ` Wendy Cheng
@ 2007-08-09 13:51   ` Wendy Cheng
  2007-08-09 15:37   ` Bob Peterson
  1 sibling, 0 replies; 8+ messages in thread
From: Wendy Cheng @ 2007-08-09 13:51 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Sorry ... hand and head do not coordinate well. Somehow I use "spin 
lock" but it is actually a semaphore.... Wendy
> Bob Peterson wrote:
>> Part of the problem was that inodes were being recycled
>> before their buffers were flushed to the journal logs.
>>
>>   
> Set aside "after this patch, the problem goes away" thing ...
>
> I haven't checked previous three patches yet so I may not have the 
> overall picture ... but why adding the journal flush spin lock here 
> could prevent the new inode to get re-used before its associated 
> buffer are flushed to the logs ? Could you elaborate more ?
>
>> diff --git a/fs/gfs2/rgrp.c b/fs/gfs2/rgrp.c
>> index b93ac45..2d7f7ea 100644
>> --- a/fs/gfs2/rgrp.c
>> +++ b/fs/gfs2/rgrp.c
>> @@ -865,12 +865,15 @@ static struct inode *try_rgrp_unlink(struct 
>> gfs2_rgrpd *rgd, u64 *last_unlinked)
>>      struct inode *inode;
>>      u32 goal = 0, block;
>>      u64 no_addr;
>> +    struct gfs2_sbd *sdp = rgd->rd_sbd;
>>  
>>      for(;;) {
>>          if (goal >= rgd->rd_data)
>>              break;
>> +        down_write(&sdp->sd_log_flush_lock);
>>          block = rgblk_search(rgd, goal, GFS2_BLKST_UNLINKED,
>>                       GFS2_BLKST_UNLINKED);
>> +        up_write(&sdp->sd_log_flush_lock);
>>          if (block == BFITNOENT)
>>              break;
>>         
> My concern is that GFS2's usage of sd_log_flush_lock has been very 
> abused lately. The journal logic is gradually becoming difficult to 
> understand and maintain. With this change, we move a local spin lock 
> (that belongs to log.c) into another sub-component (rgrp). 
> Intuitively, this is not right.
>
> -- Wendy
>
>



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [PATCH 4 of 5] Bz #248176: GFS2: invalid metadata block - REVISED
  2007-08-09 13:46 ` Wendy Cheng
  2007-08-09 13:51   ` Wendy Cheng
@ 2007-08-09 15:37   ` Bob Peterson
  2007-08-09 18:21     ` Wendy Cheng
  1 sibling, 1 reply; 8+ messages in thread
From: Bob Peterson @ 2007-08-09 15:37 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Thu, 2007-08-09 at 09:46 -0400, Wendy Cheng wrote:
> Set aside "after this patch, the problem goes away" thing ...
> 
> I haven't checked previous three patches yet so I may not have the 
> overall picture ... but why adding the journal flush spin lock here 
> could prevent the new inode to get re-used before its associated buffer 
> are flushed to the logs ? Could you elaborate more ?
> 
> > +		down_write(&sdp->sd_log_flush_lock);
> >  		block = rgblk_search(rgd, goal, GFS2_BLKST_UNLINKED,
> >  				     GFS2_BLKST_UNLINKED);
> > +		up_write(&sdp->sd_log_flush_lock);

IIRC, if we don't protect rgblk_search from finding GFS2_BLKST_UNLINKED
blocks, a "deleted" inode may be returned to function
gfs2_inplace_reserve_i which will do an iput on the inode,
which may reference buffers that are being flushed to disk.
If almost all blocks in that bitmap are allocated, I think the
deleted block may sometimes be reused and the buffer 
associated with the reused block may be changed before it's
actually written out to disk.

> My concern is that GFS2's usage of sd_log_flush_lock has been very 
> abused lately. The journal logic is gradually becoming difficult to 
> understand and maintain. With this change, we move a local spin lock 
> (that belongs to log.c) into another sub-component (rgrp). Intuitively, 
> this is not right.
> 
> -- Wendy

I agree that the journal logic is overly complex and difficult to
understand.  It's always been that way, but it is getting worse.
I'd like to do some work on that section of code to simplify it,
but only after we've stabilized the code for rhel5.1.

Regards,

Bob Peterson
Red Hat Cluster Suite

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [PATCH 4 of 5] Bz #248176: GFS2: invalid metadata block - REVISED
  2007-08-09 15:37   ` Bob Peterson
@ 2007-08-09 18:21     ` Wendy Cheng
  2007-08-10  8:26       ` Steven Whitehouse
  0 siblings, 1 reply; 8+ messages in thread
From: Wendy Cheng @ 2007-08-09 18:21 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Bob Peterson wrote:
> On Thu, 2007-08-09 at 09:46 -0400, Wendy Cheng wrote:
>   
>> Set aside "after this patch, the problem goes away" thing ...
>>
>> I haven't checked previous three patches yet so I may not have the 
>> overall picture ... but why adding the journal flush spin lock here 
>> could prevent the new inode to get re-used before its associated buffer 
>> are flushed to the logs ? Could you elaborate more ?
>>
>>     
>>> +		down_write(&sdp->sd_log_flush_lock);
>>>  		block = rgblk_search(rgd, goal, GFS2_BLKST_UNLINKED,
>>>  				     GFS2_BLKST_UNLINKED);
>>> +		up_write(&sdp->sd_log_flush_lock);
>>>       
>
> IIRC, if we don't protect rgblk_search from finding GFS2_BLKST_UNLINKED
> blocks, a "deleted" inode may be returned to function
> gfs2_inplace_reserve_i which will do an iput on the inode,
> which may reference buffers that are being flushed to disk.
> If almost all blocks in that bitmap are allocated, I think the
> deleted block may sometimes be reused and the buffer 
> associated with the reused block may be changed before it's
> actually written out to disk.
>   

Log flushing is an asynchronous event. I still don't see how this can 
*protect* the condition you just described (i.e., prevents the block 
being assigned to someone else before log flush occurs).  Or do I 
understand your statement right (i.e., the log flushing must occur 
before the block is used by someone else) ? It may *reduce* the 
possibility (if log flushing happens at the same time as this 
assignment) but I don't see how it can *stop* the condition.

You may "reduce" the (rare) possibility but the real issue is still 
hanging there ? If this is true, then I don't agree we have to pay the 
price of moving a journal flushing lock into resource handling code.

-- Wendy

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [PATCH 4 of 5] Bz #248176: GFS2: invalid metadata block - REVISED
  2007-08-09 18:21     ` Wendy Cheng
@ 2007-08-10  8:26       ` Steven Whitehouse
  2007-08-10 13:12         ` Wendy Cheng
  0 siblings, 1 reply; 8+ messages in thread
From: Steven Whitehouse @ 2007-08-10  8:26 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Thu, 2007-08-09 at 14:21 -0400, Wendy Cheng wrote:
> Bob Peterson wrote:
> > On Thu, 2007-08-09 at 09:46 -0400, Wendy Cheng wrote:
> >   
> >> Set aside "after this patch, the problem goes away" thing ...
> >>
> >> I haven't checked previous three patches yet so I may not have the 
> >> overall picture ... but why adding the journal flush spin lock here 
> >> could prevent the new inode to get re-used before its associated buffer 
> >> are flushed to the logs ? Could you elaborate more ?
> >>
> >>     
> >>> +		down_write(&sdp->sd_log_flush_lock);
> >>>  		block = rgblk_search(rgd, goal, GFS2_BLKST_UNLINKED,
> >>>  				     GFS2_BLKST_UNLINKED);
> >>> +		up_write(&sdp->sd_log_flush_lock);
> >>>       
> >
> > IIRC, if we don't protect rgblk_search from finding GFS2_BLKST_UNLINKED
> > blocks, a "deleted" inode may be returned to function
> > gfs2_inplace_reserve_i which will do an iput on the inode,
> > which may reference buffers that are being flushed to disk.
> > If almost all blocks in that bitmap are allocated, I think the
> > deleted block may sometimes be reused and the buffer 
> > associated with the reused block may be changed before it's
> > actually written out to disk.
> >   
> 
> Log flushing is an asynchronous event. I still don't see how this can 
> *protect* the condition you just described (i.e., prevents the block 
> being assigned to someone else before log flush occurs).  Or do I 
> understand your statement right (i.e., the log flushing must occur 
> before the block is used by someone else) ? It may *reduce* the 
> possibility (if log flushing happens at the same time as this 
> assignment) but I don't see how it can *stop* the condition.
> 
> You may "reduce" the (rare) possibility but the real issue is still 
> hanging there ? If this is true, then I don't agree we have to pay the 
> price of moving a journal flushing lock into resource handling code.
> 
> -- Wendy
> 

Due to the way in which the locking is defined, the journal lock is also
used to keep other processes out of the rgrp bitmaps. This prevents the
state of the rgrp bitmaps changing while we are scanning them in case a
journal flush might occur.

The sd_log_flush_lock is an rwsem which is held in read mode by each and
every transaction and in write mode when flushing the journal. Log
flushing ought to be an asynchronous event, but due to the design of the
journaling, it unfortunately isn't in GFS2. It is something that we need
to review in the future,

Steve.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [PATCH 4 of 5] Bz #248176: GFS2: invalid metadata block - REVISED
  2007-08-10 13:12         ` Wendy Cheng
@ 2007-08-10 13:04           ` Steven Whitehouse
  0 siblings, 0 replies; 8+ messages in thread
From: Steven Whitehouse @ 2007-08-10 13:04 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Fri, 2007-08-10 at 09:12 -0400, Wendy Cheng wrote:
> Steven Whitehouse wrote:
> 
> >Hi,
> >
> >On Thu, 2007-08-09 at 14:21 -0400, Wendy Cheng wrote:
> >  
> >
> >>Bob Peterson wrote:
> >>    
> >>
> >>>On Thu, 2007-08-09 at 09:46 -0400, Wendy Cheng wrote:
> >>>  
> >>>      
> >>>
> >>>>Set aside "after this patch, the problem goes away" thing ...
> >>>>
> >>>>I haven't checked previous three patches yet so I may not have the 
> >>>>overall picture ... but why adding the journal flush spin lock here 
> >>>>could prevent the new inode to get re-used before its associated buffer 
> >>>>are flushed to the logs ? Could you elaborate more ?
> >>>>
> >>>>    
> >>>>        
> >>>>
> >>>>>+		down_write(&sdp->sd_log_flush_lock);
> >>>>> 		block = rgblk_search(rgd, goal, GFS2_BLKST_UNLINKED,
> >>>>> 				     GFS2_BLKST_UNLINKED);
> >>>>>+		up_write(&sdp->sd_log_flush_lock);
> >>>>>      
> >>>>>          
> >>>>>
> >>>IIRC, if we don't protect rgblk_search from finding GFS2_BLKST_UNLINKED
> >>>blocks, a "deleted" inode may be returned to function
> >>>gfs2_inplace_reserve_i which will do an iput on the inode,
> >>>which may reference buffers that are being flushed to disk.
> >>>If almost all blocks in that bitmap are allocated, I think the
> >>>deleted block may sometimes be reused and the buffer 
> >>>associated with the reused block may be changed before it's
> >>>actually written out to disk.
> >>>  
> >>>      
> >>>
> >>Log flushing is an asynchronous event. I still don't see how this can 
> >>*protect* the condition you just described (i.e., prevents the block 
> >>being assigned to someone else before log flush occurs).  Or do I 
> >>understand your statement right (i.e., the log flushing must occur 
> >>before the block is used by someone else) ? It may *reduce* the 
> >>possibility (if log flushing happens at the same time as this 
> >>assignment) but I don't see how it can *stop* the condition.
> >>
> >>You may "reduce" the (rare) possibility but the real issue is still 
> >>hanging there ? If this is true, then I don't agree we have to pay the 
> >>price of moving a journal flushing lock into resource handling code.
> >>
> >>-- Wendy
> >>
> >>    
> >>
> >
> >Due to the way in which the locking is defined, the journal lock is also
> >used to keep other processes out of the rgrp bitmaps. This prevents the
> >state of the rgrp bitmaps changing while we are scanning them in case a
> >journal flush might occur.
> >
> >The sd_log_flush_lock is an rwsem which is held in read mode by each and
> >every transaction and in write mode when flushing the journal. Log
> >flushing ought to be an asynchronous event, but due to the design of the
> >journaling, it unfortunately isn't in GFS2. It is something that we need
> >to review in the future,
> >
> >
> >  
> >
> 
> It is still not clear what exactly does this lock protect ? The unlinked 
> rgrp bitmap itself or the buffers associated with these disk blocks ? If 
> it is later (the buffers as Bob said),  it implies for every block GFS2 
> takes from this unlinked bitmap, journal flush *has* to happen before it 
> can be used ? Could you elaborate more ?
> 
> -- Wendy
> 
A journal flush is required in order for blocks which have been freed
during the current transaction to become visible to the rest of the
filesystem again. We have two sets of bitmaps, the "normal" set and the
"clone" set. The "normal" set is what we read off disk and what we use
to allocate blocks from.

The "clone" set are created as an exact copy of the "normal" set if (and
only if) we try to deallocate some blocks. In that case the allocation
operation occurs in both bitmaps while the clone exists. When the
journal is flushed, the clone bitmap is copied back into the normal
bitmap for the rgrp, thus making the freed blocks available to the
filesystem for allocation in the following transactions.

When we are looking for unlinked, but not yet deallocated inodes to
free, we need to check the clone bitmap since thats where we mark the
inode free. If we don't do that we might try to free the inode twice
(bug #1 which this patch solves). The other problem is that the locking
governing when the clone bitmap is written back into the normal bitmap
is the journal flush lock (as per the last email) and we have to hold it
to avoid a journal flush from changing the bitmap as we are scanning it.

Steve.




^ permalink raw reply	[flat|nested] 8+ messages in thread

* [Cluster-devel] [PATCH 4 of 5] Bz #248176: GFS2: invalid metadata block - REVISED
  2007-08-10  8:26       ` Steven Whitehouse
@ 2007-08-10 13:12         ` Wendy Cheng
  2007-08-10 13:04           ` Steven Whitehouse
  0 siblings, 1 reply; 8+ messages in thread
From: Wendy Cheng @ 2007-08-10 13:12 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Steven Whitehouse wrote:

>Hi,
>
>On Thu, 2007-08-09 at 14:21 -0400, Wendy Cheng wrote:
>  
>
>>Bob Peterson wrote:
>>    
>>
>>>On Thu, 2007-08-09 at 09:46 -0400, Wendy Cheng wrote:
>>>  
>>>      
>>>
>>>>Set aside "after this patch, the problem goes away" thing ...
>>>>
>>>>I haven't checked previous three patches yet so I may not have the 
>>>>overall picture ... but why adding the journal flush spin lock here 
>>>>could prevent the new inode to get re-used before its associated buffer 
>>>>are flushed to the logs ? Could you elaborate more ?
>>>>
>>>>    
>>>>        
>>>>
>>>>>+		down_write(&sdp->sd_log_flush_lock);
>>>>> 		block = rgblk_search(rgd, goal, GFS2_BLKST_UNLINKED,
>>>>> 				     GFS2_BLKST_UNLINKED);
>>>>>+		up_write(&sdp->sd_log_flush_lock);
>>>>>      
>>>>>          
>>>>>
>>>IIRC, if we don't protect rgblk_search from finding GFS2_BLKST_UNLINKED
>>>blocks, a "deleted" inode may be returned to function
>>>gfs2_inplace_reserve_i which will do an iput on the inode,
>>>which may reference buffers that are being flushed to disk.
>>>If almost all blocks in that bitmap are allocated, I think the
>>>deleted block may sometimes be reused and the buffer 
>>>associated with the reused block may be changed before it's
>>>actually written out to disk.
>>>  
>>>      
>>>
>>Log flushing is an asynchronous event. I still don't see how this can 
>>*protect* the condition you just described (i.e., prevents the block 
>>being assigned to someone else before log flush occurs).  Or do I 
>>understand your statement right (i.e., the log flushing must occur 
>>before the block is used by someone else) ? It may *reduce* the 
>>possibility (if log flushing happens at the same time as this 
>>assignment) but I don't see how it can *stop* the condition.
>>
>>You may "reduce" the (rare) possibility but the real issue is still 
>>hanging there ? If this is true, then I don't agree we have to pay the 
>>price of moving a journal flushing lock into resource handling code.
>>
>>-- Wendy
>>
>>    
>>
>
>Due to the way in which the locking is defined, the journal lock is also
>used to keep other processes out of the rgrp bitmaps. This prevents the
>state of the rgrp bitmaps changing while we are scanning them in case a
>journal flush might occur.
>
>The sd_log_flush_lock is an rwsem which is held in read mode by each and
>every transaction and in write mode when flushing the journal. Log
>flushing ought to be an asynchronous event, but due to the design of the
>journaling, it unfortunately isn't in GFS2. It is something that we need
>to review in the future,
>
>
>  
>

It is still not clear what exactly does this lock protect ? The unlinked 
rgrp bitmap itself or the buffers associated with these disk blocks ? If 
it is later (the buffers as Bob said),  it implies for every block GFS2 
takes from this unlinked bitmap, journal flush *has* to happen before it 
can be used ? Could you elaborate more ?

-- Wendy



^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2007-08-10 13:12 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-08-08 21:52 [Cluster-devel] [PATCH 4 of 5] Bz #248176: GFS2: invalid metadata block - REVISED Bob Peterson
2007-08-09 13:46 ` Wendy Cheng
2007-08-09 13:51   ` Wendy Cheng
2007-08-09 15:37   ` Bob Peterson
2007-08-09 18:21     ` Wendy Cheng
2007-08-10  8:26       ` Steven Whitehouse
2007-08-10 13:12         ` Wendy Cheng
2007-08-10 13:04           ` Steven Whitehouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).