Re: xfslogd-spinlock bug?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Haar János" <djani22@netcenter.hu>
To: David Chinner <dgc@sgi.com>
Cc: linux-xfs@oss.sgi.com, linux-kernel@vger.kernel.org
Subject: Re: xfslogd-spinlock bug?
Date: Mon, 18 Dec 2006 00:56:41 +0100	[thread overview]
Message-ID: <026501c72237$0464f7a0$0400a8c0@dcccs> (raw)
In-Reply-To: 20061217224457.GN33919298@melbourne.sgi.com


----- Original Message ----- 
From: "David Chinner" <dgc@sgi.com>
To: "Haar JÃ¡nos" <djani22@netcenter.hu>
Cc: <linux-xfs@oss.sgi.com>; <linux-kernel@vger.kernel.org>
Sent: Sunday, December 17, 2006 11:44 PM
Subject: Re: xfslogd-spinlock bug?


> On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar JÃ¡nos wrote:
> > Hi
> >
> > I have some news.
> >
> > I dont know there is a context between 2 messages, but i can see, the
> > spinlock bug comes always on cpu #3.
> >
> > Somebody have any idea?
>
> Your disk interrupts are directed to CPU 3, and so log I/O completion
> occurs on that CPU.

           CPU0       CPU1       CPU2       CPU3
  0:        100          0          0    4583704   IO-APIC-edge      timer
  1:          0          0          0          2   IO-APIC-edge      i8042
  4:          0          0          0    3878668   IO-APIC-edge      serial
  8:          0          0          0          0   IO-APIC-edge      rtc
  9:          0          0          0          0   IO-APIC-fasteoi   acpi
 12:          0          0          0          3   IO-APIC-edge      i8042
 14:    3072118          0          0        181   IO-APIC-edge      ide0
 16:          0          0          0          0   IO-APIC-fasteoi
uhci_hcd:usb2
 18:          0          0          0          0   IO-APIC-fasteoi
uhci_hcd:usb4
 19:          0          0          0          0   IO-APIC-fasteoi
uhci_hcd:usb3
 23:          0          0          0          0   IO-APIC-fasteoi
ehci_hcd:usb1
 52:          0          0          0  213052723   IO-APIC-fasteoi   eth1
 53:          0          0          0   91913759   IO-APIC-fasteoi   eth2
100:          0          0          0   16776910   IO-APIC-fasteoi   eth0
NMI:      42271      43187      42234      43168
LOC:    4584247    4584219    4584215    4584198
ERR:          0

Maybe....
I have 3 XFS on this system, with 3 source.

1. 200G one ide hdd.
2. 2x200G mirror on 1 ide + 1 sata hdd.
3. 4x3.3TB strip on NBD.

The NBD serves through eth1, and it is on the CPU3, but the ide0 is on the
CPU0.


>
> > Dec 16 12:08:36 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
> > Dec 16 12:08:36 dy-base general protection fault: 0000 [1]
> > Dec 16 12:08:36 dy-base SMP
> > Dec 16 12:08:36 dy-base
> > Dec 16 12:08:36 dy-base CPU 3
> > Dec 16 12:08:36 dy-base
> > Dec 16 12:08:36 dy-base Modules linked in:
> > Dec 16 12:08:36 dy-base  nbd
>
> Are you using XFS on a NBD?

Yes, on the 3. source.
I used it about 1.5 years.

(The nbd deadlock is fixed on my system, thanks to Herbert Xu on 2.6.14.)

>
> > Dec 16 12:08:36 dy-base  rd
> > Dec 16 12:08:36 dy-base  netconsole
> > Dec 16 12:08:36 dy-base  e1000
> > Dec 16 12:08:36 dy-base  video
> > Dec 16 12:08:36 dy-base
> > Dec 16 12:08:36 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
> > Dec 16 12:08:36 dy-base RIP: 0010:[<ffffffff803f3aba>]
> > Dec 16 12:08:36 dy-base  [<ffffffff803f3aba>] spin_bug+0x69/0xdf
> > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0  EFLAGS: 00010002
> > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX:
>                                                      ^^^^^^^^^^^^^^^^
> Anyone recognise that pattern?

I think i have one idea.
This issue can stops sometimes the 5sec automatic restart on crash, and this
shows possible memory corruption, and if the bug occurs in the IRQ
handling.... :-)
I have a lot of logs about this issue, and the RAX, RBX always the same.

>
> > Dec 16 12:08:36 dy-base Call Trace:
> > Dec 16 12:08:36 dy-base  [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1
> > Dec 16 12:08:36 dy-base  [<ffffffff805e7f2b>]
_spin_lock_irqsave+0x11/0x18
> > Dec 16 12:08:36 dy-base  [<ffffffff80222aab>] __wake_up+0x22/0x50
> > Dec 16 12:08:36 dy-base  [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23
> > Dec 16 12:08:36 dy-base  [<ffffffff803970a4>]
xfs_buf_item_unpin+0x2e/0xa6
>
> This implies a spinlock inside a wait_queue_head_t is corrupt.
>
> What are you type of system do you have, and what sort of
> workload are you running?

OS: Fedora 5, 64bit.
HW: dual xeon, with HT, ram 4GB.
(the min_free_kbytes limit is set to 128000, because sometimes the e1000
driver run out the reserved memory during irq handling.)

Workload:

I use this system for free web storage.
(2x apache 2.0.xx,   12x pure-ftpd, 2x mysql but sql only use the source #2
fs.)

The normal system load is ~20-40, but currently i have a little problem with
apache, because it sometimes starts to read a lot from the big XFS device,
and eats all memory, the load is rising to 700-800.
At this point i use httpd restart, and everithing go back to normal, but if
i offline.....

Thanks a lot!

Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group

WARNING: multiple messages have this Message-ID (diff)

From: "Haar János" <djani22@netcenter.hu>
To: "David Chinner" <dgc@sgi.com>
Cc: <linux-xfs@oss.sgi.com>, <linux-kernel@vger.kernel.org>
Subject: Re: xfslogd-spinlock bug?
Date: Mon, 18 Dec 2006 00:56:41 +0100	[thread overview]
Message-ID: <026501c72237$0464f7a0$0400a8c0@dcccs> (raw)
In-Reply-To: 20061217224457.GN33919298@melbourne.sgi.com


----- Original Message ----- 
From: "David Chinner" <dgc@sgi.com>
To: "Haar JÃ¡nos" <djani22@netcenter.hu>
Cc: <linux-xfs@oss.sgi.com>; <linux-kernel@vger.kernel.org>
Sent: Sunday, December 17, 2006 11:44 PM
Subject: Re: xfslogd-spinlock bug?


> On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar JÃ¡nos wrote:
> > Hi
> >
> > I have some news.
> >
> > I dont know there is a context between 2 messages, but i can see, the
> > spinlock bug comes always on cpu #3.
> >
> > Somebody have any idea?
>
> Your disk interrupts are directed to CPU 3, and so log I/O completion
> occurs on that CPU.

           CPU0       CPU1       CPU2       CPU3
  0:        100          0          0    4583704   IO-APIC-edge      timer
  1:          0          0          0          2   IO-APIC-edge      i8042
  4:          0          0          0    3878668   IO-APIC-edge      serial
  8:          0          0          0          0   IO-APIC-edge      rtc
  9:          0          0          0          0   IO-APIC-fasteoi   acpi
 12:          0          0          0          3   IO-APIC-edge      i8042
 14:    3072118          0          0        181   IO-APIC-edge      ide0
 16:          0          0          0          0   IO-APIC-fasteoi
uhci_hcd:usb2
 18:          0          0          0          0   IO-APIC-fasteoi
uhci_hcd:usb4
 19:          0          0          0          0   IO-APIC-fasteoi
uhci_hcd:usb3
 23:          0          0          0          0   IO-APIC-fasteoi
ehci_hcd:usb1
 52:          0          0          0  213052723   IO-APIC-fasteoi   eth1
 53:          0          0          0   91913759   IO-APIC-fasteoi   eth2
100:          0          0          0   16776910   IO-APIC-fasteoi   eth0
NMI:      42271      43187      42234      43168
LOC:    4584247    4584219    4584215    4584198
ERR:          0

Maybe....
I have 3 XFS on this system, with 3 source.

1. 200G one ide hdd.
2. 2x200G mirror on 1 ide + 1 sata hdd.
3. 4x3.3TB strip on NBD.

The NBD serves through eth1, and it is on the CPU3, but the ide0 is on the
CPU0.


>
> > Dec 16 12:08:36 dy-base BUG: spinlock bad magic on CPU#3, xfslogd/3/317
> > Dec 16 12:08:36 dy-base general protection fault: 0000 [1]
> > Dec 16 12:08:36 dy-base SMP
> > Dec 16 12:08:36 dy-base
> > Dec 16 12:08:36 dy-base CPU 3
> > Dec 16 12:08:36 dy-base
> > Dec 16 12:08:36 dy-base Modules linked in:
> > Dec 16 12:08:36 dy-base  nbd
>
> Are you using XFS on a NBD?

Yes, on the 3. source.
I used it about 1.5 years.

(The nbd deadlock is fixed on my system, thanks to Herbert Xu on 2.6.14.)

>
> > Dec 16 12:08:36 dy-base  rd
> > Dec 16 12:08:36 dy-base  netconsole
> > Dec 16 12:08:36 dy-base  e1000
> > Dec 16 12:08:36 dy-base  video
> > Dec 16 12:08:36 dy-base
> > Dec 16 12:08:36 dy-base Pid: 317, comm: xfslogd/3 Not tainted 2.6.19 #1
> > Dec 16 12:08:36 dy-base RIP: 0010:[<ffffffff803f3aba>]
> > Dec 16 12:08:36 dy-base  [<ffffffff803f3aba>] spin_bug+0x69/0xdf
> > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0  EFLAGS: 00010002
> > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b RCX:
>                                                      ^^^^^^^^^^^^^^^^
> Anyone recognise that pattern?

I think i have one idea.
This issue can stops sometimes the 5sec automatic restart on crash, and this
shows possible memory corruption, and if the bug occurs in the IRQ
handling.... :-)
I have a lot of logs about this issue, and the RAX, RBX always the same.

>
> > Dec 16 12:08:36 dy-base Call Trace:
> > Dec 16 12:08:36 dy-base  [<ffffffff803f3bdc>] _raw_spin_lock+0x23/0xf1
> > Dec 16 12:08:36 dy-base  [<ffffffff805e7f2b>]
_spin_lock_irqsave+0x11/0x18
> > Dec 16 12:08:36 dy-base  [<ffffffff80222aab>] __wake_up+0x22/0x50
> > Dec 16 12:08:36 dy-base  [<ffffffff803c97f9>] xfs_buf_unpin+0x21/0x23
> > Dec 16 12:08:36 dy-base  [<ffffffff803970a4>]
xfs_buf_item_unpin+0x2e/0xa6
>
> This implies a spinlock inside a wait_queue_head_t is corrupt.
>
> What are you type of system do you have, and what sort of
> workload are you running?

OS: Fedora 5, 64bit.
HW: dual xeon, with HT, ram 4GB.
(the min_free_kbytes limit is set to 128000, because sometimes the e1000
driver run out the reserved memory during irq handling.)

Workload:

I use this system for free web storage.
(2x apache 2.0.xx,   12x pure-ftpd, 2x mysql but sql only use the source #2
fs.)

The normal system load is ~20-40, but currently i have a little problem with
apache, because it sometimes starts to read a lot from the big XFS device,
and eats all memory, the load is rising to 700-800.
At this point i use httpd restart, and everithing go back to normal, but if
i offline.....

Thanks a lot!

Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group

next prev parent reply	other threads:[~2006-12-17 23:58 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-12-11 23:00 xfslogd-spinlock bug? Haar János
2006-12-11 23:00 ` Haar János
2006-12-12 14:32 ` Justin Piszcz
2006-12-13  1:11   ` Haar János
2006-12-16 11:19     ` Haar János
2006-12-16 11:19       ` Haar János
2006-12-17 22:44       ` David Chinner
2006-12-17 23:56         ` Haar János [this message]
2006-12-17 23:56           ` Haar János
2006-12-18  6:24           ` David Chinner
2006-12-18  8:17             ` Haar János
2006-12-18  8:17               ` Haar János
2006-12-18 22:36               ` David Chinner
2006-12-18 23:39                 ` Haar János
2006-12-18 23:39                   ` Haar János
2006-12-19  2:52                   ` David Chinner
2006-12-19  4:47                     ` David Chinner
2006-12-27 12:58                       ` Haar János
2006-12-27 12:58                         ` Haar János
2007-01-07 23:14                         ` David Chinner
2007-01-10 17:18                           ` Janos Haar
2007-01-10 17:18                             ` Janos Haar
2007-01-11  3:34                             ` David Chinner
2007-01-11 20:15                               ` Janos Haar
2007-01-11 20:15                                 ` Janos Haar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='026501c72237$0464f7a0$0400a8c0@dcccs' \
    --to=djani22@netcenter.hu \
    --cc=dgc@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.