Apparent race condition in do_journal_begin

All of lore.kernel.org
 help / color / mirror / Atom feed

* Apparent race condition in do_journal_begin_r() with UL kernel
@ 2003-09-16 15:12 Martin Wilck
  2003-09-16 15:15 ` Chris Mason
  2003-09-16 15:36 ` Oleg Drokin
  0 siblings, 2 replies; 7+ messages in thread
From: Martin Wilck @ 2003-09-16 15:12 UTC (permalink / raw)
  To: Reiser FS Mailing List

Hi,

this problem may or may not be related to the last message I sent to
this list ("is_tree_inode ..").

The problem occurs on the same machine, with the same test scenario.
I applied Oleg's patch and successfully got rid of the
"is_tree_inode..." messages. 

However, after ~20h of high IO stress I run into the following
situation:

- 1 of the involved Reiser filesystems cannot be accessed anymore.
  a number of processes in "D" state tries to access files on that file
  system:

STAT PID   TIME     START    WCHAN         COMMAND/ARGS
D    15846 00:00:00 11:33:36 107f93 down   cp /hd2/a55M2 /hd1/cp7
D    15879 00:00:00 11:33:37 107f93 down   cp /hd1/a50M1 /hd2/cp2
D    15915 00:00:29 11:33:37 107f93 down   cp /hd2/a117M2 /hd3/cp71
D    16160 00:00:00 11:33:38 107f93 down   cp /hd2/a40M2 /hd1/cp5
D    16289 00:00:00 11:33:40 107f93 down   cp /hd1/src1/f.64mb /hd2/dest1
D    16895 00:00:00 11:33:46 107f93 down   cp /hd1/src1/refdat.7 /hd2/dest1
D    16938 00:00:00 11:33:46 107f93 down   cp /hd2/dest1/refdat.6 /hd1/src1/refdat.6
D    16985 00:00:00 11:33:46 107f93 down   cp /hd2/dest1/refdat.5 /hd1/src1/refdat.5
D    17031 00:00:00 11:33:46 107f93 down   cp /hd2/dest1/refdat.4 /hd1/src1/refdat.4
D    17082 00:00:00 11:33:47 107f93 down   cp /hd1/src1/f.16mb /hd2/dest1
D    17087 00:00:00 11:33:47 107f93 down   cp /hd2/a10M1 /hd1/cp3
D    17100 00:00:00 11:33:47 107f93 down   cp /hd2/dest1/f.8mb /hd1/src1/f.8mb
D    17138 00:00:22 11:33:48 107f93 down   cp /hd1/src1/refdat.3 /hd2/dest1
D    17163 00:00:00 11:33:48 107f93 down   cp /hd2/dest1/refdat.1 /hd1/src1/refdat.1
D    17194 00:00:00 11:33:49 107f93 down   cp /hd2/dest1/refdat.2 /hd1/src1/refdat.2
D    17458 00:01:38 11:33:54 107f93 down   rm /hd2/src2/f.32mb
D    17463 00:00:00 11:33:54 107f93 down   rm /hd2/src2/f.8mb
D    17468 00:00:00 11:33:54 107f93 down   rm /hd2/src2/f.16mb
D    17473 00:00:00 11:33:54 107f93 down   rm /hd2/src2/f.64mb
D    17480 00:00:00 11:33:55 107f93 down   rm /hd2/dest1/f.32mb
D    17772 00:00:00 11:33:56 107f93 down   cp /hd2/a35M1 /hd1/cp6

  Note that all processes were started approximately at the same time, which was
  ~1h ago at the time I produced this output.
  Newly started processes accessing the filesystem will also end up in 
  uninterruptible sleep.

- A closer look reveals that some of these processes (at least those with
  TIME > 0 above) are not stalled in 'D', but keep running. Their WCHAN changes
  from "down" to "wait_on_buffer" from time to time.

- iostat reports a continuous activity of ~80 requests/s, 950kB/s 
  on the respective device.

- The partition itself (raw) can be accessed and read.

- debugreiserfs -p reports "BROKEN BLOCK HEAD" for some blocks. The
  number of blocks with "broken block head" varies between 0 and 4
  in subsequent debugreiserfs runs, the reported indices of broken
  blocks vary, too (of course).

- My analysis of call traces of the offending processes shows that 
  all 'cp' processes and one 'rm' seem to be stuck in the 
  lock_journal() call in do_journal_begin_r().
  The "cp" processes all end up in journal_begin() through
  __mark_inode_dirty() (both in read() and write() system calls).
  3 of the 'rm' processes are stuck in the down() call in sys_unlink().

So, AFAICT, this look like a race of at least 3 processes for some
ReiserFS locks / semaphores.

Any hints and ideas for further debugging are greatly appreciated.

PS: A very similar observation was the actual reason that we started
this debugging procedure in the first place. We happened to have the
"is_tree_inode ..." messages, too, though, and I thought the apparent
race was  a follow-up error of those. However, since the race also comes
to pass with Oleg's patch, it seems to be an unrelated problem.

Regards,
Martin

-- 
Martin Wilck                Phone: +49 5251 8 15113
Fujitsu Siemens Computers   Fax:   +49 5251 8 20409
Heinz-Nixdorf-Ring 1	    mailto:Martin.Wilck@Fujitsu-Siemens.com
D-33106 Paderborn           http://www.fujitsu-siemens.com/primergy






^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Apparent race condition in do_journal_begin_r() with UL kernel
  2003-09-16 15:12 Apparent race condition in do_journal_begin_r() with UL kernel Martin Wilck
@ 2003-09-16 15:15 ` Chris Mason
  2003-09-16 15:49   ` Oleg Drokin
  2003-09-16 15:36 ` Oleg Drokin
  1 sibling, 1 reply; 7+ messages in thread
From: Chris Mason @ 2003-09-16 15:15 UTC (permalink / raw)
  To: Martin Wilck; +Cc: Reiser FS Mailing List

On Tue, 2003-09-16 at 11:12, Martin Wilck wrote:
> Hi,
> 
> this problem may or may not be related to the last message I sent to
> this list ("is_tree_inode ..").
> 
> The problem occurs on the same machine, with the same test scenario.
> I applied Oleg's patch and successfully got rid of the
> "is_tree_inode..." messages. 

Ok, I've been tracking the is_tree_node messages for about a week now,
but only on the latest beta UL kernels.  Which UL kernel exactly have
you been using?

The patch of mine that Oleg forwarded shouldn't be needed.  There's
actually a smaller patch from vs@namesys.com that I'm trying here.

-chris



^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Apparent race condition in do_journal_begin_r() with UL kernel
  2003-09-16 15:15 ` Chris Mason
@ 2003-09-16 15:49   ` Oleg Drokin
  2003-09-17  7:44     ` Martin Wilck
  0 siblings, 1 reply; 7+ messages in thread
From: Oleg Drokin @ 2003-09-16 15:49 UTC (permalink / raw)
  To: Chris Mason; +Cc: Martin Wilck, Reiser FS Mailing List

Hello!

On Tue, Sep 16, 2003 at 11:15:40AM -0400, Chris Mason wrote:

> Ok, I've been tracking the is_tree_node messages for about a week now,
> but only on the latest beta UL kernels.  Which UL kernel exactly have
> you been using?

He is not using UL kernels, as I see.
here's the quote from previous mail:
"The kernel in use is SuSE 8.1 errata kernel 2.4.19, release 340."

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Apparent race condition in do_journal_begin_r() with UL kernel
  2003-09-16 15:49   ` Oleg Drokin
@ 2003-09-17  7:44     ` Martin Wilck
  0 siblings, 0 replies; 7+ messages in thread
From: Martin Wilck @ 2003-09-17  7:44 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Reiser FS Mailing List


> He is not using UL kernels, as I see.
> here's the quote from previous mail:
> "The kernel in use is SuSE 8.1 errata kernel 2.4.19, release 340."

That kernel is identical to UL 2.4.19 -340 AFAICT.

Martin
-- 
Martin Wilck                Phone: +49 5251 8 15113
Fujitsu Siemens Computers   Fax:   +49 5251 8 20409
Heinz-Nixdorf-Ring 1	    mailto:Martin.Wilck@Fujitsu-Siemens.com
D-33106 Paderborn           http://www.fujitsu-siemens.com/primergy






^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Apparent race condition in do_journal_begin_r() with UL kernel
  2003-09-16 15:12 Apparent race condition in do_journal_begin_r() with UL kernel Martin Wilck
  2003-09-16 15:15 ` Chris Mason
@ 2003-09-16 15:36 ` Oleg Drokin
  2003-09-19  9:32   ` Martin Wilck
  1 sibling, 1 reply; 7+ messages in thread
From: Oleg Drokin @ 2003-09-16 15:36 UTC (permalink / raw)
  To: Martin Wilck; +Cc: Reiser FS Mailing List, mason

[-- Attachment #1: Type: text/plain, Size: 2362 bytes --]

Hello!

On Tue, Sep 16, 2003 at 05:12:58PM +0200, Martin Wilck wrote:

> - 1 of the involved Reiser filesystems cannot be accessed anymore.
>   a number of processes in "D" state tries to access files on that file
>   system:

What if you press 'reset' and then boot and try to mount the fs?
Will you see ~ 2k uncommited transactions replayed?
If so, we recently saw similar thing. The fix is not ready yet, though.

> - My analysis of call traces of the offending processes shows that 
>   all 'cp' processes and one 'rm' seem to be stuck in the 
>   lock_journal() call in do_journal_begin_r().
>   The "cp" processes all end up in journal_begin() through
>   __mark_inode_dirty() (both in read() and write() system calls).
>   3 of the 'rm' processes are stuck in the down() call in sys_unlink().

Yes, this looks a lot like what we already saw.

> So, AFAICT, this look like a race of at least 3 processes for some
> ReiserFS locks / semaphores.

No, the problem is different. We have easier reproduction scenario for it.

> PS: A very similar observation was the actual reason that we started
> this debugging procedure in the first place. We happened to have the
> "is_tree_inode ..." messages, too, though, and I thought the apparent
> race was  a follow-up error of those. However, since the race also comes
> to pass with Oleg's patch, it seems to be an unrelated problem.

The patch is not mine, in fact, it was made by Chris Mason.

Below you can find the patch that may help you if I am right about roots of your second problem.
This is not proper fix but merely a quick plug. (I verified and it passes my testcase)
Chris promised to send proper patch yesterday, but seems he is busy right now.

Also find attached another fix for is_tree_node from Vladimir Saveliev that
Chris said is superiour to his.

Bye,
    Oleg

===== fs/reiserfs/journal.c 1.33 vs edited =====
--- 1.33/fs/reiserfs/journal.c	Thu Aug 28 18:28:11 2003
+++ edited/fs/reiserfs/journal.c	Tue Sep 16 19:23:33 2003
@@ -2839,6 +2839,10 @@
   }
   now = CURRENT_TIME;
 
+  if ( nblocks > SB_JOURNAL_MAX_BATCH(p_s_sb) - 5 ) {
+	nblocks = SB_JOURNAL_MAX_BATCH(p_s_sb) - 5;
+  } 
+
   /* if there is no room in the journal OR
   ** if this transaction is too old, and we weren't called joinable, wait for it to finish before beginning 
   ** we don't sleep if there aren't other writers

[-- Attachment #2: istreelevel_fix-2 --]
[-- Type: text/plain, Size: 5747 bytes --]

Hello

Chris Mason wrote:

>On Mon, 2003-03-03 at 14:05, Chris Mason wrote:
>  
>
>>On Mon, 2003-03-03 at 13:42, Oleg Drokin wrote:
>>    
>>
>>>Hello!
>>>
>>>On Mon, Mar 03, 2003 at 01:06:54PM -0500, Chris Mason wrote:
>>>      
>>>
>>>>Ok, in search_by_key, what happens if a balance collapses one tree level
>>>>while we schedule, and the buffer we just found moves to a different
>>>>level?
>>>>B_IS_IN_TREE will return 1, and it is possible the key we searched for
>>>>is still in the buffer.
>>>>Won't that just lead to vs-5150 messages when we should have just
>>>>detected a tree change and started the search over?
>>>>        
>>>>
>>>For me it looks like there is definitely a problem.
>>>Unfortunatelly Vladimir is in Italy, so he cannot confirm
>>>or deny this yet.
>>>      
>>>
>>I haven't been able to trigger this on ia32, but I have triggered it on
>>my x86_64 box just by running 4 copies of fsx-linux at once.  I thought
>>the key_in_buffer checks were supposed to find this kind of thing
>>though.
>>    
>>
Yes. And it looks like it does it. Both get_rkey and get_lkey check 
whole path. However, it looks like the problem you describe can appear 
for root block.
Could you please try to reproduce the problem with the attached patch?

>This patch adds a few additional checks to make sure each buffer in the
>path is in the tree and the path is rooted correctly.  It also changes
>search_by_key to drop the path and research when the tree height
>changes.
>
>Perhaps someone can find a better check for key_in_buffer that provides
>these checks as well.  I actually wanted to just drop the key_in_buffer
>checks completely, and have search_by_key reserch any time the tree
>changed, but that lead to vs-3050 deadlocks under load.
>  
>
Dropping call to key_in_buffer must not lead to vs-3050. If it does - we 
probably have problems somewhere.

>-chris
>
>--- linux.ul.1/fs/reiserfs/stree.c	2003-03-06 09:16:36.000000000 -0500
>+++ linux.ul/fs/reiserfs/stree.c	2003-03-06 09:15:44.000000000 -0500
>@@ -338,7 +338,27 @@
>   return  &MAX_KEY;
> }
> 
>+/* check each buffer in the path to make sure it is still in the tree */
>+int path_in_tree (const struct path *path,
>+		  const struct super_block  * s) 
>+{
>+    int n_path_offset = path->path_length;
>+    struct buffer_head *bh;
>+
>+    if (PATH_OFFSET_PBUFFER(path, FIRST_PATH_ELEMENT_OFFSET)->b_blocknr !=
>+         SB_ROOT_BLOCK(s)) 
>+    {
>+        return 0;
>+    }	 
>+    while (n_path_offset-- > FIRST_PATH_ELEMENT_OFFSET) {
>+        bh = PATH_OFFSET_PBUFFER(path, n_path_offset);
>+	if (!B_IS_IN_TREE(bh)) {
>+	    return 0;
>+	}
>+    }
>+    return 1;
> 
>+}
> /* Get delimiting key of the buffer at the path and its right neighbor. */
> inline	const struct  key * get_rkey  (
> 	                const struct path         * p_s_chk_path,
>@@ -403,7 +423,7 @@
>   if ( COMP_KEYS(get_rkey(p_s_chk_path, p_s_sb), p_s_key) != 1 )
>     /* p_s_key must be less than right delimitiing key */
>     return 0;
>-  return 1;
>+  return path_in_tree(p_s_chk_path, p_s_sb);
> }
> 
> 
>@@ -664,6 +684,7 @@
>     int				n_node_level, n_retval;
>     int 			right_neighbor_of_leaf_node;
>     int				fs_gen;
>+    int				tree_height;
>     struct buffer_head *reada_bh[SEARCH_BY_KEY_READA];
>     unsigned long      reada_blocks[SEARCH_BY_KEY_READA];
>     int reada_count = 0;
>@@ -699,6 +720,7 @@
> 	/* prep path to have another element added to it. */
> 	p_s_last_element = PATH_OFFSET_PELEMENT(p_s_search_path, ++p_s_search_path->path_length);
> 	fs_gen = get_generation (p_s_sb);
>+	tree_height = SB_TREE_HEIGHT(p_s_sb);
> 	expected_level --;
> 
> 	/* schedule read of right neighbors */
>@@ -724,7 +746,8 @@
> 	   to search is still in the tree rooted from the current buffer. If
> 	   not then repeat search from the root. */
> 	if ( fs_changed (fs_gen, p_s_sb) && 
>-	     (!B_IS_IN_TREE (p_s_bh) || !key_in_buffer(p_s_search_path, p_s_key, p_s_sb)) ) {
>+	     (tree_height != SB_TREE_HEIGHT(p_s_sb) ||
>+	     !B_IS_IN_TREE (p_s_bh) || !key_in_buffer(p_s_search_path, p_s_key, p_s_sb)) ) {
>  	    PROC_INFO_INC( p_s_sb, search_by_key_restarted );
> 	    PROC_INFO_INC( p_s_sb, sbk_restarted[ expected_level - 1 ] );
> 	    decrement_counters_in_path(p_s_search_path);
>
>
>
>
>
>
>
>
>  
>

===== fs/reiserfs/stree.c 1.35 vs edited =====
--- 1.35/fs/reiserfs/stree.c	Mon Feb 24 21:10:17 2003
+++ edited/fs/reiserfs/stree.c	Mon Mar 10 13:36:47 2003
@@ -653,7 +653,7 @@
                                        DISK_LEAF_NODE_LEVEL */
     ) {
     int  n_block_number = SB_ROOT_BLOCK (p_s_sb),
-      expected_level = SB_TREE_HEIGHT (p_s_sb);
+      expected_level = -1;
     struct buffer_head  *       p_s_bh;
     struct path_element *       p_s_last_element;
     int				n_node_level, n_retval;
@@ -690,7 +690,6 @@
 	/* prep path to have another element added to it. */
 	p_s_last_element = PATH_OFFSET_PELEMENT(p_s_search_path, ++p_s_search_path->path_length);
 	fs_gen = get_generation (p_s_sb);
-	expected_level --;
 
 #ifdef SEARCH_BY_KEY_READA
 	/* schedule read of right neighbor */
@@ -706,6 +705,10 @@
 	    return IO_ERROR;
 	}
 
+	if (expected_level == -1)
+		expected_level = SB_TREE_HEIGHT (p_s_sb);
+	expected_level --;
+
  	if( fs_changed (fs_gen, p_s_sb) ) {
  		PROC_INFO_INC( p_s_sb, search_by_key_fs_changed );
  		PROC_INFO_INC( p_s_sb, sbk_fs_changed[ expected_level - 1 ] );
@@ -723,7 +726,7 @@
 	    /* Get the root block number so that we can repeat the search
                starting from the root. */
 	    n_block_number = SB_ROOT_BLOCK (p_s_sb);
-	    expected_level = SB_TREE_HEIGHT (p_s_sb);
+	    expected_level = -1;
 	    right_neighbor_of_leaf_node = 0;
 	    
 	    /* repeat search from the root */



Thanks,
vs





^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Apparent race condition in do_journal_begin_r() with UL kernel
  2003-09-16 15:36 ` Oleg Drokin
@ 2003-09-19  9:32   ` Martin Wilck
  2003-09-19  9:35     ` Oleg Drokin
  0 siblings, 1 reply; 7+ messages in thread
From: Martin Wilck @ 2003-09-19  9:32 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Reiser FS Mailing List, mason

Am Die, 2003-09-16 um 17.36 schrieb Oleg Drokin:

> What if you press 'reset' and then boot and try to mount the fs?
> Will you see ~ 2k uncommited transactions replayed?

Yes, exactly.

> If so, we recently saw similar thing. The fix is not ready yet, though.

> Below you can find the patch that may help you if I am right about roots of your second problem.
> This is not proper fix but merely a quick plug. (I verified and it passes my testcase)
> Chris promised to send proper patch yesterday, but seems he is busy right now.

Yes, the results with this patch (and the other one you sent which has
to be slightly modified for my UL kernel) look promising. ~40h IO
activity now and no problems. If our machines are still doing fine after
the weekend we'll conclude these patches solve the problem (actually,
both problems) for us. I'll give you another feedback then.

Thanks for the help,
Martin


-- 
Martin Wilck                Phone: +49 5251 8 15113
Fujitsu Siemens Computers   Fax:   +49 5251 8 20409
Heinz-Nixdorf-Ring 1	    mailto:Martin.Wilck@Fujitsu-Siemens.com
D-33106 Paderborn           http://www.fujitsu-siemens.com/primergy






^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: Apparent race condition in do_journal_begin_r() with UL kernel
  2003-09-19  9:32   ` Martin Wilck
@ 2003-09-19  9:35     ` Oleg Drokin
  0 siblings, 0 replies; 7+ messages in thread
From: Oleg Drokin @ 2003-09-19  9:35 UTC (permalink / raw)
  To: Martin Wilck; +Cc: Reiser FS Mailing List, mason

Hello!

On Fri, Sep 19, 2003 at 11:32:48AM +0200, Martin Wilck wrote:

> > Below you can find the patch that may help you if I am right about roots of your second problem.
> > This is not proper fix but merely a quick plug. (I verified and it passes my testcase)
> > Chris promised to send proper patch yesterday, but seems he is busy right now.
> Yes, the results with this patch (and the other one you sent which has
> to be slightly modified for my UL kernel) look promising. ~40h IO
> activity now and no problems. If our machines are still doing fine after
> the weekend we'll conclude these patches solve the problem (actually,
> both problems) for us. I'll give you another feedback then.

Ok. Thank you for verifying our patches.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2003-09-19  9:35 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2003-09-16 15:12 Apparent race condition in do_journal_begin_r() with UL kernel Martin Wilck
2003-09-16 15:15 ` Chris Mason
2003-09-16 15:49   ` Oleg Drokin
2003-09-17  7:44     ` Martin Wilck
2003-09-16 15:36 ` Oleg Drokin
2003-09-19  9:32   ` Martin Wilck
2003-09-19  9:35     ` Oleg Drokin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.