All of lore.kernel.org
 help / color / mirror / Atom feed
* [Ocfs2-devel] Bug in error handling
@ 2004-03-09 12:40 John L. Villalovos
  2004-03-09 15:13 ` Mark Fasheh
  0 siblings, 1 reply; 6+ messages in thread
From: John L. Villalovos @ 2004-03-09 12:40 UTC (permalink / raw)
  To: ocfs2-devel

I have encountered on my system a bug when OCFS2 tries to do a journal_wipe.

At the time that it does the call it gets back an error of -22.

The problem is that it seems to leave stuff in an inconsistent state when it exits out of the functions that have called it.  So later on bad things happen :(

This diff simulates the error that I received.  I am trying to figure out what is the stuff that has been partially initialized when this gets called but I am having a bit of difficulty and tracking it all down :(

John


Index: journal.c
===================================================================
--- journal.c	(revision 766)
+++ journal.c	(working copy)
@@ -1261,8 +1261,11 @@
  	if (!journal)
  		BUG();

-	status = journal_wipe(journal->k_journal, full);
+// FIXME: Simulate BUG
+//	status = journal_wipe(journal->k_journal, full);
+	status = -22;

+
  	LOG_EXIT_STATUS(status);
  	return(status);
  }

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Ocfs2-devel] Bug in error handling
  2004-03-09 12:40 [Ocfs2-devel] Bug in error handling John L. Villalovos
@ 2004-03-09 15:13 ` Mark Fasheh
  0 siblings, 0 replies; 6+ messages in thread
From: Mark Fasheh @ 2004-03-09 15:13 UTC (permalink / raw)
  To: ocfs2-devel

At what point did you first see the error? Was it a 1st mount of a fresh
file system or just a normal mount? I assume the file system failed to mount
because of this error... Can you be more specific as to what Bad Things (TM)
were happening?  :) Did it crash or what?
	--Mark

On Tue, Mar 09, 2004 at 10:39:52AM -0800, John L. Villalovos wrote:
> I have encountered on my system a bug when OCFS2 tries to do a journal_wipe.
> 
> At the time that it does the call it gets back an error of -22.
> 
> The problem is that it seems to leave stuff in an inconsistent state when 
> it exits out of the functions that have called it.  So later on bad things 
> happen :(
> 
> This diff simulates the error that I received.  I am trying to figure out 
> what is the stuff that has been partially initialized when this gets called 
> but I am having a bit of difficulty and tracking it all down :(
> 
> John
> 
> 
> Index: journal.c
> ===================================================================
> --- journal.c	(revision 766)
> +++ journal.c	(working copy)
> @@ -1261,8 +1261,11 @@
>  	if (!journal)
>  		BUG();
> 
> -	status = journal_wipe(journal->k_journal, full);
> +// FIXME: Simulate BUG
> +//	status = journal_wipe(journal->k_journal, full);
> +	status = -22;
> 
> +
>  	LOG_EXIT_STATUS(status);
>  	return(status);
>  }
> _______________________________________________
> Ocfs2-devel mailing list
> Ocfs2-devel@oss.oracle.com
> http://oss.oracle.com/mailman/listinfo/ocfs2-devel
--
Mark Fasheh
Software Developer, Oracle Corp
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Ocfs2-devel] Bug in error handling
@ 2004-03-09 15:18 Villalovos, John L
  2004-03-09 15:43 ` Mark Fasheh
  2004-03-09 17:00 ` Mark Fasheh
  0 siblings, 2 replies; 6+ messages in thread
From: Villalovos, John L @ 2004-03-09 15:18 UTC (permalink / raw)
  To: ocfs2-devel

> At what point did you first see the error? Was it a 1st mount 
> of a fresh
> file system or just a normal mount? I assume the file system 
> failed to mount
> because of this error... Can you be more specific as to what 
> Bad Things (TM)
> were happening?  :) Did it crash or what?

It was NOT a 1st mount.  It was a disk that had been previously used.

It appears that the mount fails but then some globals are probably in a
partially set state.

Here is what I saw after it happened:

# mount -t ocfs2 /dev/sda1 /ocfs2
JBD: no valid journal superblock found
(1856) ERROR: status = -22, /root/ocfs/ocfs2/src/osb.c, 424
(1856) ERROR: status = -22, /root/ocfs/ocfs2/src/super.c, 1047
mount: wrong fs type, bad option, bad superblock on /dev/sda1,
       or too many mounted file systems
[root@linuxjohn2 load_ocfs]# Unable to handle kernel NULL pointer
dereference at virtual address 00000000
 printing eip:
d10bb369
*pde = 0e9b5067
Oops: 0000 [#1]
CPU:    0
EIP:    0060:[<d10bb369>]    Tainted: GF
EFLAGS: 00010286
EIP is at ocfs_bh_sem_lookup+0x29/0x650 [ocfs2]
eax: 00000000   ebx: cbc57984   ecx: 000000f9   edx: 000007f9
esi: cbc57984   edi: cbc57984   ebp: 00000800   esp: cc1d5eac
ds: 007b   es: 007b   ss: 0068
Process ocfs2nm-0 (pid: 1857, threadinfo=cc1d4000 task=ccfbd660)
Stack: cbc6a374 cc1d5ed0 0000001f cba16e90 00000010 00000010 ce654a00
cfa0a200
^[[6~Stack: cbc6a374 cc1d5ed0 0000001f cba16e90 00000010 00000010
ce654a00 cfa0a200
       cfa0a200 00000000 00000000 00000000 00000000 ccfbea40 00000000
cbc57984
       00000000 cc1d5f3c c035cd80 ccdc87b4 00000010 ce6549f0 00011c00
cfa0a200
Call Trace:
 [<d10bb9a1>] ocfs_bh_sem_lock+0x11/0x60 [ocfs2]
 [<d10c6267>] ocfs_read_bhs+0x227/0x930 [ocfs2]
 [<d10bbd6a>] ocfs_bh_sem_hash_prune+0x19a/0x390 [ocfs2]
 [<d10d5d6e>] ocfs_volume_thread+0x29e/0x930 [ocfs2]
 [<d10d5ad0>] ocfs_volume_thread+0x0/0x930 [ocfs2]
 [<c0109295>] kernel_thread_helper+0x5/0x10

Code: 8b 00 89 c3 d3 e3 8d 4d f6 d3 e0 31 c3 88 d1 89 5c 24 34 8b



After this point I couldn't unload OCFS2 anymore.

John

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Ocfs2-devel] Bug in error handling
  2004-03-09 15:18 Villalovos, John L
@ 2004-03-09 15:43 ` Mark Fasheh
  2004-03-09 17:00 ` Mark Fasheh
  1 sibling, 0 replies; 6+ messages in thread
From: Mark Fasheh @ 2004-03-09 15:43 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Mar 09, 2004 at 01:18:11PM -0800, Villalovos, John L wrote:
> It was NOT a 1st mount.  It was a disk that had been previously used.
> 
> It appears that the mount fails but then some globals are probably in a
> partially set state.
> 
> Here is what I saw after it happened:
> 
> # mount -t ocfs2 /dev/sda1 /ocfs2
> JBD: no valid journal superblock found
> (1856) ERROR: status = -22, /root/ocfs/ocfs2/src/osb.c, 424
> (1856) ERROR: status = -22, /root/ocfs/ocfs2/src/super.c, 1047
> mount: wrong fs type, bad option, bad superblock on /dev/sda1,
>        or too many mounted file systems
> [root@linuxjohn2 load_ocfs]# Unable to handle kernel NULL pointer
> dereference at virtual address 00000000
<snip>
Alright, we should definitely *not* be doing that :) I'll add your previous
patch and check it out.
	--Mark

--
Mark Fasheh
Software Developer, Oracle Corp
mark.fasheh@oracle.com

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Ocfs2-devel] Bug in error handling
  2004-03-09 15:18 Villalovos, John L
  2004-03-09 15:43 ` Mark Fasheh
@ 2004-03-09 17:00 ` Mark Fasheh
  1 sibling, 0 replies; 6+ messages in thread
From: Mark Fasheh @ 2004-03-09 17:00 UTC (permalink / raw)
  To: ocfs2-devel

On Tue, Mar 09, 2004 at 01:18:11PM -0800, Villalovos, John L wrote:
> It was NOT a 1st mount.  It was a disk that had been previously used.
> 
> It appears that the mount fails but then some globals are probably in a
> partially set state.
Ok, could you update from latest SVN and let me know if that fixed it?
I wasn't getting the NULL pointer error in ocfs_bh_sem_lookup like you, but
I was definitely seeing one in ocfs_inode_hash_prune_all where we were
assuming that an inode existed on the inum when in fact it didn't :) The fix
of course, we to check for it's existence before acting on it!

Alternatively, if you don't want to update from SVN, you can apply this
patch.
	--Mark

--
Mark Fasheh
Software Developer, Oracle Corp
mark.fasheh@oracle.com

Index: hash.c
===================================================================
--- hash.c	(revision 766)
+++ hash.c	(working copy)
@@ -1267,18 +1267,19 @@ static int ocfs_inode_hash_prune_all(ocf
 		inum = list_entry(iter, ocfs_inode_num, i_list);
 		list_del(&inum->i_list);
 
-		/* this log_error_args is mainly for debugging */
-		if (atomic_read(&inum->i_inode->i_count) > 2)
-			LOG_ERROR_ARGS("inode (%lu) with i_count = %u left in "
-				       "system, (voteoff = %u.%u, "
-				       "fileoff = %u.%u)\n", 
-				       inum->i_inode->i_ino,
-				       atomic_read(&inum->i_inode->i_count),
-				       HILO(inum->i_voteoff), 
-				       HILO(inum->i_feoff));
+		if (inum->i_inode) {
+			/* this log_error_args is mainly for debugging */
+			if (atomic_read(&inum->i_inode->i_count) > 2)
+				LOG_ERROR_ARGS("inode (%lu) with i_count = %u "
+					  "left in system, (voteoff = "
+					  "%u.%u, fileoff = %u.%u)\n", 
+					  inum->i_inode->i_ino,
+					  atomic_read(&inum->i_inode->i_count),
+					  HILO(inum->i_voteoff), 
+					  HILO(inum->i_feoff));
 
-		if (inum->i_inode)
 			iput(inum->i_inode);
+		}
 		ocfs_free_inode_num(inum);
 	}
 

^ permalink raw reply	[flat|nested] 6+ messages in thread

* [Ocfs2-devel] Bug in error handling
@ 2004-03-09 19:14 Villalovos, John L
  0 siblings, 0 replies; 6+ messages in thread
From: Villalovos, John L @ 2004-03-09 19:14 UTC (permalink / raw)
  To: ocfs2-devel

> Ok, could you update from latest SVN and let me know if that fixed it?
> I wasn't getting the NULL pointer error in ocfs_bh_sem_lookup 
> like you, but
> I was definitely seeing one in ocfs_inode_hash_prune_all where we were
> assuming that an inode existed on the inum when in fact it 
> didn't :) The fix
> of course, we to check for it's existence before acting on it!
> 
> Alternatively, if you don't want to update from SVN, you can 
> apply this
> patch.

I will try to give that a try.  Though I reformatted my partition so I
may not be able to reproduce.

Just a note.  I am doing this on a 2.6.3 kernel.

Where I was having it crash was on:

ocfs_bh_sem * ocfs_bh_sem_lookup(struct buffer_head *bh)
{
        int depth, bucket;
        struct list_head *head, *iter = NULL;
        ocfs_bh_sem *sem = NULL, *newsem = NULL;

        bucket = ocfs_bh_sem_hash_fn(bh);  <<<<<<<<<----



#define ocfs_bh_sem_hash_fn(_b)   \
        (_hashfn((unsigned int)BH_GET_DEVICE((_b)), (_b)->b_blocknr) &
ocfs_bh_hash_shift)


This macro is where the NULL reference occurs:

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,6,0)
#define BH_GET_DEVICE(bh) ((bh->b_bdev)->bd_dev)  <<<<-------------
#else
#define BH_GET_DEVICE(bh) (bh->b_dev)
#endif

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-03-09 19:14 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-03-09 12:40 [Ocfs2-devel] Bug in error handling John L. Villalovos
2004-03-09 15:13 ` Mark Fasheh
  -- strict thread matches above, loose matches on Subject: below --
2004-03-09 15:18 Villalovos, John L
2004-03-09 15:43 ` Mark Fasheh
2004-03-09 17:00 ` Mark Fasheh
2004-03-09 19:14 Villalovos, John L

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.