Re: xfslogd-spinlock bug?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Haar János" <djani22@netcenter.hu>
To: David Chinner <dgc@sgi.com>
Cc: linux-xfs@oss.sgi.com, linux-kernel@vger.kernel.org
Subject: Re: xfslogd-spinlock bug?
Date: Wed, 27 Dec 2006 13:58:06 +0100	[thread overview]
Message-ID: <041601c729b6$f81e4af0$0400a8c0@dcccs> (raw)
In-Reply-To: 20061219044700.GW33919298@melbourne.sgi.com

Hello,

----- Original Message ----- 
From: "David Chinner" <dgc@sgi.com>
To: "David Chinner" <dgc@sgi.com>
Cc: "Haar János" <djani22@netcenter.hu>; <linux-xfs@oss.sgi.com>;
<linux-kernel@vger.kernel.org>
Sent: Tuesday, December 19, 2006 5:47 AM
Subject: Re: xfslogd-spinlock bug?


> On Tue, Dec 19, 2006 at 01:52:29PM +1100, David Chinner wrote:
> > On Tue, Dec 19, 2006 at 12:39:46AM +0100, Haar János wrote:
> > > From: "David Chinner" <dgc@sgi.com>
> > > > #define POISON_FREE 0x6b
> > > >
> > > > Can you confirm that you are running with CONFIG_DEBUG_SLAB=y?
> > >
> > > Yes, i build with this option enabled.
>
> ......
>
> > FWIW, I've run XFSQA twice now on a scsi disk with slab debuggin turned
> > on and I haven't seen this problem. I'm not sure how to track down
> > the source of the problem without a test case, but as a quick test, can
> > you try the following patch?
>
> Third try an I got a crash on a poisoned object:
>
> [1]kdb> md8c40 e00000300d7d5100
> 0xe00000300d7d5100 000000005a2cf071 0000000000000000   q.,Z............
> 0xe00000300d7d5110 000000005a2cf071 6b6b6b6b6b6b6b6b   q.,Z....kkkkkkkk
> 0xe00000300d7d5120 e0000039eb7b6320 6b6b6b6b6b6b6b6b    c{.9...kkkkkkkk
> 0xe00000300d7d5130 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d5140 6b6b6b6f6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkokkkkkkkkkkk
> 0xe00000300d7d5150 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d5160 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d5170 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d5180 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d5190 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d51a0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d51b0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d51c0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d51d0 6b6b6b6b6b6b6b6b a56b6b6b6b6b6b6b   kkkkkkkkkkkkkkk.
> 0xe00000300d7d51e0 000000005a2cf071 a000000100468c30   q.,Z....0.F.....
> [1]kdb> mds 0xe00000300d7d51e0
> 0xe00000300d7d51e0 5a2cf071   q.,Z....
> 0xe00000300d7d51e8 a000000100468c30 xfs_inode_item_destroy+0x30
>
> So the use-after-free here is on an inode item. You're tripping
> over a buffer item.
>
> Unfortunately, it is not the same problem - the problem I've just
> hit is to do with a QA test that does a forced shutdown on an active
> filesystem, and:
>
> [1]kdb> xmount 0xe00000304393e238
> .....
> flags 0x440010 <FSSHUTDOWN IDELETE COMPAT_IOSIZE >
>
> The filesystem was being shutdown so xfs_inode_item_destroy() just
> frees the inode log item without removing it from the AIL. I'll fix that,
> and see if i have any luck....
>
> So I'd still try that patch i sent in the previous email...

I still using the patch, but didnt shows any messages at this point.

I'v got 3 crash/reboot, but 2 causes nbd disconneted, and this one:

Dec 27 13:41:29 dy-base BUG: warning at
kernel/mutex.c:220/__mutex_unlock_common_slowpath()
Dec 27 13:41:29 dy-base Unable to handle kernel paging request at
0000000066604480 RIP:
Dec 27 13:41:29 dy-base  [<ffffffff80222c64>] resched_task+0x12/0x64
Dec 27 13:41:29 dy-base PGD 115246067 PUD 0
Dec 27 13:41:29 dy-base Oops: 0000 [1] SMP
Dec 27 13:41:29 dy-base CPU 1
Dec 27 13:41:29 dy-base Modules linked in: nbd rd netconsole e1000 video
Dec 27 13:41:29 dy-base Pid: 4069, comm: httpd Not tainted 2.6.19 #3
Dec 27 13:41:29 dy-base RIP: 0010:[<ffffffff80222c64>]  [<ffffffff80222c64>]
resched_task+0x12/0x64
Dec 27 13:41:29 dy-base RSP: 0018:ffff810105c01b78  EFLAGS: 00010083
Dec 27 13:41:29 dy-base RAX: ffffffff807d5800 RBX: 00001749fd97c214 RCX:
ffff81001cbd0000
Dec 27 13:41:29 dy-base RDX: 000000001cbd0048 RSI: ffff810005834068 RDI:
ffff8101047bf040
Dec 27 13:41:29 dy-base RBP: ffff810105c01b78 R08: 0000000000000001 R09:
0000000000000000
Dec 27 13:41:29 dy-base R10: 0000000000000057 R11: ffff81000583cd80 R12:
ffff810116693140
Dec 27 13:41:29 dy-base R13: 0000000000000001 R14: 0000000000000000 R15:
0000000000000000
Dec 27 13:41:29 dy-base FS:  00002ba3c1ad07d0(0000)
GS:ffff81011fc769c8(0000) knlGS:0000000000000000
Dec 27 13:41:29 dy-base CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 27 13:41:29 dy-base CR2: 0000000066604480 CR3: 0000000118196000 CR4:
00000000000006e0
Dec 27 13:41:29 dy-base Process httpd (pid: 4069, threadinfo
ffff810105c00000, task ffff8101166e1040)
Dec 27 13:41:29 dy-base Stack:  ffff810105c01bf8 ffffffff80223f37
ffff810002996a00 0000000002cba600
Dec 27 13:41:29 dy-base  000000000000000f 0000000000000001 ffff810005833700
0000000100000000
Dec 27 13:41:29 dy-base  0000000000000005 0000000000000296 ffff810105c01bd8
ffff810117fef690
Dec 27 13:41:29 dy-base Call Trace:
Dec 27 13:41:29 dy-base  [<ffffffff80223f37>] try_to_wake_up+0x3a7/0x3dc
Dec 27 13:41:29 dy-base  [<ffffffff80223f98>] wake_up_process+0x10/0x12
Dec 27 13:41:29 dy-base  [<ffffffff803c9186>] xfsbufd_wakeup+0x34/0x61
Dec 27 13:41:29 dy-base  [<ffffffff8025cdf9>] shrink_slab+0x64/0x163
Dec 27 13:41:29 dy-base  [<ffffffff8025d913>] try_to_free_pages+0x19c/0x289
Dec 27 13:41:29 dy-base  [<ffffffff80258b62>] __alloc_pages+0x1b8/0x2c0
Dec 27 13:41:29 dy-base  [<ffffffff80267f94>] anon_vma_prepare+0x29/0xf1
Dec 27 13:41:29 dy-base  [<ffffffff80260d52>] __handle_mm_fault+0x496/0x9e3
Dec 27 13:41:29 dy-base  [<ffffffff805e7dfd>] _spin_unlock+0x9/0xb
Dec 27 13:41:29 dy-base  [<ffffffff8021a1b6>] do_page_fault+0x418/0x7b6
Dec 27 13:41:29 dy-base  [<ffffffff802082ed>] __switch_to+0x280/0x28f
Dec 27 13:41:29 dy-base  [<ffffffff805e7efb>] _spin_unlock_irq+0x9/0xc
Dec 27 13:41:29 dy-base  [<ffffffff805e5ca8>] thread_return+0x5e/0xf7
Dec 27 13:41:29 dy-base  [<ffffffff805e809d>] error_exit+0x0/0x84
Dec 27 13:41:29 dy-base
Dec 27 13:41:29 dy-base
Dec 27 13:41:29 dy-base Code: 48 8b 14 d5 40 42 78 80 48 03 42 08 8b 00 85
c0 7e 0a 0f 0b
Dec 27 13:41:29 dy-base RIP  [<ffffffff80222c64>] resched_task+0x12/0x64
Dec 27 13:41:29 dy-base  RSP <ffff810105c01b78>
Dec 27 13:41:29 dy-base CR2: 0000000066604480
Dec 27 13:41:29 dy-base  <0>Kernel panic - not syncing: Fatal exception
Dec 27 13:41:29 dy-base
Dec 27 13:41:29 dy-base Rebooting in 5 seconds..

I found one bug on my apache config, and i think, the test case is changed.
:-(
Before the config is fixed, some users can stress the xfs source device
readahead, and can periodically overload the system, with this action.
I think, the original bug only comes on the highly overloaded system, and
about readahead+buffering/caching.

Thanks,
Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group

WARNING: multiple messages have this Message-ID (diff)

From: "Haar János" <djani22@netcenter.hu>
To: "David Chinner" <dgc@sgi.com>
Cc: <linux-xfs@oss.sgi.com>, <linux-kernel@vger.kernel.org>, <dgc@sgi.com>
Subject: Re: xfslogd-spinlock bug?
Date: Wed, 27 Dec 2006 13:58:06 +0100	[thread overview]
Message-ID: <041601c729b6$f81e4af0$0400a8c0@dcccs> (raw)
In-Reply-To: 20061219044700.GW33919298@melbourne.sgi.com

Hello,

----- Original Message ----- 
From: "David Chinner" <dgc@sgi.com>
To: "David Chinner" <dgc@sgi.com>
Cc: "Haar János" <djani22@netcenter.hu>; <linux-xfs@oss.sgi.com>;
<linux-kernel@vger.kernel.org>
Sent: Tuesday, December 19, 2006 5:47 AM
Subject: Re: xfslogd-spinlock bug?


> On Tue, Dec 19, 2006 at 01:52:29PM +1100, David Chinner wrote:
> > On Tue, Dec 19, 2006 at 12:39:46AM +0100, Haar János wrote:
> > > From: "David Chinner" <dgc@sgi.com>
> > > > #define POISON_FREE 0x6b
> > > >
> > > > Can you confirm that you are running with CONFIG_DEBUG_SLAB=y?
> > >
> > > Yes, i build with this option enabled.
>
> ......
>
> > FWIW, I've run XFSQA twice now on a scsi disk with slab debuggin turned
> > on and I haven't seen this problem. I'm not sure how to track down
> > the source of the problem without a test case, but as a quick test, can
> > you try the following patch?
>
> Third try an I got a crash on a poisoned object:
>
> [1]kdb> md8c40 e00000300d7d5100
> 0xe00000300d7d5100 000000005a2cf071 0000000000000000   q.,Z............
> 0xe00000300d7d5110 000000005a2cf071 6b6b6b6b6b6b6b6b   q.,Z....kkkkkkkk
> 0xe00000300d7d5120 e0000039eb7b6320 6b6b6b6b6b6b6b6b    c{.9...kkkkkkkk
> 0xe00000300d7d5130 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d5140 6b6b6b6f6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkokkkkkkkkkkk
> 0xe00000300d7d5150 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d5160 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d5170 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d5180 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d5190 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d51a0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d51b0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d51c0 6b6b6b6b6b6b6b6b 6b6b6b6b6b6b6b6b   kkkkkkkkkkkkkkkk
> 0xe00000300d7d51d0 6b6b6b6b6b6b6b6b a56b6b6b6b6b6b6b   kkkkkkkkkkkkkkk.
> 0xe00000300d7d51e0 000000005a2cf071 a000000100468c30   q.,Z....0.F.....
> [1]kdb> mds 0xe00000300d7d51e0
> 0xe00000300d7d51e0 5a2cf071   q.,Z....
> 0xe00000300d7d51e8 a000000100468c30 xfs_inode_item_destroy+0x30
>
> So the use-after-free here is on an inode item. You're tripping
> over a buffer item.
>
> Unfortunately, it is not the same problem - the problem I've just
> hit is to do with a QA test that does a forced shutdown on an active
> filesystem, and:
>
> [1]kdb> xmount 0xe00000304393e238
> .....
> flags 0x440010 <FSSHUTDOWN IDELETE COMPAT_IOSIZE >
>
> The filesystem was being shutdown so xfs_inode_item_destroy() just
> frees the inode log item without removing it from the AIL. I'll fix that,
> and see if i have any luck....
>
> So I'd still try that patch i sent in the previous email...

I still using the patch, but didnt shows any messages at this point.

I'v got 3 crash/reboot, but 2 causes nbd disconneted, and this one:

Dec 27 13:41:29 dy-base BUG: warning at
kernel/mutex.c:220/__mutex_unlock_common_slowpath()
Dec 27 13:41:29 dy-base Unable to handle kernel paging request at
0000000066604480 RIP:
Dec 27 13:41:29 dy-base  [<ffffffff80222c64>] resched_task+0x12/0x64
Dec 27 13:41:29 dy-base PGD 115246067 PUD 0
Dec 27 13:41:29 dy-base Oops: 0000 [1] SMP
Dec 27 13:41:29 dy-base CPU 1
Dec 27 13:41:29 dy-base Modules linked in: nbd rd netconsole e1000 video
Dec 27 13:41:29 dy-base Pid: 4069, comm: httpd Not tainted 2.6.19 #3
Dec 27 13:41:29 dy-base RIP: 0010:[<ffffffff80222c64>]  [<ffffffff80222c64>]
resched_task+0x12/0x64
Dec 27 13:41:29 dy-base RSP: 0018:ffff810105c01b78  EFLAGS: 00010083
Dec 27 13:41:29 dy-base RAX: ffffffff807d5800 RBX: 00001749fd97c214 RCX:
ffff81001cbd0000
Dec 27 13:41:29 dy-base RDX: 000000001cbd0048 RSI: ffff810005834068 RDI:
ffff8101047bf040
Dec 27 13:41:29 dy-base RBP: ffff810105c01b78 R08: 0000000000000001 R09:
0000000000000000
Dec 27 13:41:29 dy-base R10: 0000000000000057 R11: ffff81000583cd80 R12:
ffff810116693140
Dec 27 13:41:29 dy-base R13: 0000000000000001 R14: 0000000000000000 R15:
0000000000000000
Dec 27 13:41:29 dy-base FS:  00002ba3c1ad07d0(0000)
GS:ffff81011fc769c8(0000) knlGS:0000000000000000
Dec 27 13:41:29 dy-base CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec 27 13:41:29 dy-base CR2: 0000000066604480 CR3: 0000000118196000 CR4:
00000000000006e0
Dec 27 13:41:29 dy-base Process httpd (pid: 4069, threadinfo
ffff810105c00000, task ffff8101166e1040)
Dec 27 13:41:29 dy-base Stack:  ffff810105c01bf8 ffffffff80223f37
ffff810002996a00 0000000002cba600
Dec 27 13:41:29 dy-base  000000000000000f 0000000000000001 ffff810005833700
0000000100000000
Dec 27 13:41:29 dy-base  0000000000000005 0000000000000296 ffff810105c01bd8
ffff810117fef690
Dec 27 13:41:29 dy-base Call Trace:
Dec 27 13:41:29 dy-base  [<ffffffff80223f37>] try_to_wake_up+0x3a7/0x3dc
Dec 27 13:41:29 dy-base  [<ffffffff80223f98>] wake_up_process+0x10/0x12
Dec 27 13:41:29 dy-base  [<ffffffff803c9186>] xfsbufd_wakeup+0x34/0x61
Dec 27 13:41:29 dy-base  [<ffffffff8025cdf9>] shrink_slab+0x64/0x163
Dec 27 13:41:29 dy-base  [<ffffffff8025d913>] try_to_free_pages+0x19c/0x289
Dec 27 13:41:29 dy-base  [<ffffffff80258b62>] __alloc_pages+0x1b8/0x2c0
Dec 27 13:41:29 dy-base  [<ffffffff80267f94>] anon_vma_prepare+0x29/0xf1
Dec 27 13:41:29 dy-base  [<ffffffff80260d52>] __handle_mm_fault+0x496/0x9e3
Dec 27 13:41:29 dy-base  [<ffffffff805e7dfd>] _spin_unlock+0x9/0xb
Dec 27 13:41:29 dy-base  [<ffffffff8021a1b6>] do_page_fault+0x418/0x7b6
Dec 27 13:41:29 dy-base  [<ffffffff802082ed>] __switch_to+0x280/0x28f
Dec 27 13:41:29 dy-base  [<ffffffff805e7efb>] _spin_unlock_irq+0x9/0xc
Dec 27 13:41:29 dy-base  [<ffffffff805e5ca8>] thread_return+0x5e/0xf7
Dec 27 13:41:29 dy-base  [<ffffffff805e809d>] error_exit+0x0/0x84
Dec 27 13:41:29 dy-base
Dec 27 13:41:29 dy-base
Dec 27 13:41:29 dy-base Code: 48 8b 14 d5 40 42 78 80 48 03 42 08 8b 00 85
c0 7e 0a 0f 0b
Dec 27 13:41:29 dy-base RIP  [<ffffffff80222c64>] resched_task+0x12/0x64
Dec 27 13:41:29 dy-base  RSP <ffff810105c01b78>
Dec 27 13:41:29 dy-base CR2: 0000000066604480
Dec 27 13:41:29 dy-base  <0>Kernel panic - not syncing: Fatal exception
Dec 27 13:41:29 dy-base
Dec 27 13:41:29 dy-base Rebooting in 5 seconds..

I found one bug on my apache config, and i think, the test case is changed.
:-(
Before the config is fixed, some users can stress the xfs source device
readahead, and can periodically overload the system, with this action.
I think, the original bug only comes on the highly overloaded system, and
about readahead+buffering/caching.

Thanks,
Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group

next prev parent reply	other threads:[~2006-12-27 13:02 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-12-11 23:00 xfslogd-spinlock bug? Haar János
2006-12-11 23:00 ` Haar János
2006-12-12 14:32 ` Justin Piszcz
2006-12-13  1:11   ` Haar János
2006-12-16 11:19     ` Haar János
2006-12-16 11:19       ` Haar János
2006-12-17 22:44       ` David Chinner
2006-12-17 23:56         ` Haar János
2006-12-17 23:56           ` Haar János
2006-12-18  6:24           ` David Chinner
2006-12-18  8:17             ` Haar János
2006-12-18  8:17               ` Haar János
2006-12-18 22:36               ` David Chinner
2006-12-18 23:39                 ` Haar János
2006-12-18 23:39                   ` Haar János
2006-12-19  2:52                   ` David Chinner
2006-12-19  4:47                     ` David Chinner
2006-12-27 12:58                       ` Haar János [this message]
2006-12-27 12:58                         ` Haar János
2007-01-07 23:14                         ` David Chinner
2007-01-10 17:18                           ` Janos Haar
2007-01-10 17:18                             ` Janos Haar
2007-01-11  3:34                             ` David Chinner
2007-01-11 20:15                               ` Janos Haar
2007-01-11 20:15                                 ` Janos Haar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='041601c729b6$f81e4af0$0400a8c0@dcccs' \
    --to=djani22@netcenter.hu \
    --cc=dgc@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.