Re: xfslogd-spinlock bug?

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Haar János" <djani22@netcenter.hu>
To: David Chinner <dgc@sgi.com>
Cc: linux-xfs@oss.sgi.com, linux-kernel@vger.kernel.org
Subject: Re: xfslogd-spinlock bug?
Date: Mon, 18 Dec 2006 09:17:50 +0100	[thread overview]
Message-ID: <027b01c7227d$0e26d1f0$0400a8c0@dcccs> (raw)
In-Reply-To: 20061218062444.GH44411608@melbourne.sgi.com


----- Original Message ----- 
From: "David Chinner" <dgc@sgi.com>
To: "Haar János" <djani22@netcenter.hu>
Cc: "David Chinner" <dgc@sgi.com>; <linux-xfs@oss.sgi.com>;
<linux-kernel@vger.kernel.org>
Sent: Monday, December 18, 2006 7:24 AM
Subject: Re: xfslogd-spinlock bug?


> On Mon, Dec 18, 2006 at 12:56:41AM +0100, Haar János wrote:
> > > On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar JÃ¡nos wrote:
> > > > I dont know there is a context between 2 messages, but i can see,
the
> > > > spinlock bug comes always on cpu #3.
> > > >
> > > > Somebody have any idea?
> > >
> > > Your disk interrupts are directed to CPU 3, and so log I/O completion
> > > occurs on that CPU.
> >
> >            CPU0       CPU1       CPU2       CPU3
> >   0:        100          0          0    4583704   IO-APIC-edge
timer
> >   1:          0          0          0          2   IO-APIC-edge
i8042
> >   4:          0          0          0    3878668   IO-APIC-edge
serial
> .....
> >  14:    3072118          0          0        181   IO-APIC-edge
ide0
> .....
> >  52:          0          0          0  213052723   IO-APIC-fasteoi
eth1
> >  53:          0          0          0   91913759   IO-APIC-fasteoi
eth2
> > 100:          0          0          0   16776910   IO-APIC-fasteoi
eth0
> ....
> >
> > Maybe....
> > I have 3 XFS on this system, with 3 source.
> >
> > 1. 200G one ide hdd.
> > 2. 2x200G mirror on 1 ide + 1 sata hdd.
> > 3. 4x3.3TB strip on NBD.
> >
> > The NBD serves through eth1, and it is on the CPU3, but the ide0 is on
the
> > CPU0.
>
> I'd say your NBD based XFS filesystem is having trouble.
>
> > > Are you using XFS on a NBD?
> >
> > Yes, on the 3. source.
>
> Ok, I've never heard of a problem like this before and you are doing
> something that very few ppl are doing (i.e. XFS on NBD). I'd start
> Hence  I'd start by suspecting a bug in the NBD driver.

Ok, if you have right, this also can be in context with the following issue:

http://download.netcenter.hu/bughunt/20061217/messages.txt   (10KB)


>
> > > > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0  EFLAGS: 00010002
> > > > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b
RCX:
> > >                                                      ^^^^^^^^^^^^^^^^
> > > Anyone recognise that pattern?
> >
> > I think i have one idea.
> > This issue can stops sometimes the 5sec automatic restart on crash, and
this
> > shows possible memory corruption, and if the bug occurs in the IRQ
> > handling.... :-)
> > I have a lot of logs about this issue, and the RAX, RBX always the same.
>
> And is this the only place where you see the problem? Or are there
> other stack traces that you see this in as well?

I have used the 2.6.16.18 for a long time, and it was stable, except this
issue. (~20 dump with xfslogd)
And i try the new releases, and now i have more. :-)

What do you think exactly?
I can see in the logs, but search for what?
The RAX, RBX thing, or the xfslogd-spinlock problem or the old nbd-deadlock
+ mem corruption?

[root@NetCenter netlog]# grep "0000000000000033" messages*
messages.1:Dec 11 22:47:21 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 12 18:16:35 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 13 11:40:05 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 14 22:25:32 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 15 06:24:44 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.11:Oct  3 19:49:44 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.11:Oct  7 01:11:17 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.13:Sep 21 15:35:31 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.15:Sep  3 16:13:35 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.15:Sep  5 21:00:38 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.2:Dec  9 00:10:47 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.2:Dec  9 14:07:01 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.2:Dec 10 04:44:48 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.3:Nov 30 10:59:21 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.3:Dec  2 00:54:23 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 13 10:44:49 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 14 03:14:14 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 14 03:37:07 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 15 01:39:54 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  6 14:48:54 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  7 04:36:13 dy-base RAX: 0000000000000033 RBX:
ffff8100057d2080 RCX: ffff810050d638f8
messages.6:Nov  7 04:36:13 dy-base RDX: 0000000000000008 RSI:
0000000000012cff RDI: 0000000000000033
messages.6:Nov  7 11:12:06 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  8 03:20:38 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  8 15:02:16 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  8 15:27:12 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov 10 15:29:43 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov 11 20:44:14 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.9:Oct 18 15:31:02 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.9:Oct 19 13:53:24 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000


>
> > > This implies a spinlock inside a wait_queue_head_t is corrupt.
> > >
> > > What are you type of system do you have, and what sort of
> > > workload are you running?
> >
> > OS: Fedora 5, 64bit.
> > HW: dual xeon, with HT, ram 4GB.
> > (the min_free_kbytes limit is set to 128000, because sometimes the e1000
> > driver run out the reserved memory during irq handling.)
>
> That does not sound good. What happens when it does run out of memory?



This is an old problem, on 2.6.16.18 .
The default min_free_kbytes is 38xx , and the GIGE controller easily can be
overflow this little place.
If this happens, the system freez, and i can only use the serial console +
sysreq to dump stack:

download.netcenter.hu/bughunt/20060530/261618-good.txt
download.netcenter.hu/bughunt/20060530/dmesg.txt
download.netcenter.hu/bughunt/20060530/dump.txt

This problem is already fixed with set the min_free_kbytes to 128M.

> Is that when you start to see the above corruptions?

I think no, but i am not 100% sure.

Cheers,

Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group

WARNING: multiple messages have this Message-ID (diff)

From: "Haar János" <djani22@netcenter.hu>
To: "David Chinner" <dgc@sgi.com>
Cc: <dgc@sgi.com>, <linux-xfs@oss.sgi.com>, <linux-kernel@vger.kernel.org>
Subject: Re: xfslogd-spinlock bug?
Date: Mon, 18 Dec 2006 09:17:50 +0100	[thread overview]
Message-ID: <027b01c7227d$0e26d1f0$0400a8c0@dcccs> (raw)
In-Reply-To: 20061218062444.GH44411608@melbourne.sgi.com


----- Original Message ----- 
From: "David Chinner" <dgc@sgi.com>
To: "Haar János" <djani22@netcenter.hu>
Cc: "David Chinner" <dgc@sgi.com>; <linux-xfs@oss.sgi.com>;
<linux-kernel@vger.kernel.org>
Sent: Monday, December 18, 2006 7:24 AM
Subject: Re: xfslogd-spinlock bug?


> On Mon, Dec 18, 2006 at 12:56:41AM +0100, Haar János wrote:
> > > On Sat, Dec 16, 2006 at 12:19:45PM +0100, Haar JÃ¡nos wrote:
> > > > I dont know there is a context between 2 messages, but i can see,
the
> > > > spinlock bug comes always on cpu #3.
> > > >
> > > > Somebody have any idea?
> > >
> > > Your disk interrupts are directed to CPU 3, and so log I/O completion
> > > occurs on that CPU.
> >
> >            CPU0       CPU1       CPU2       CPU3
> >   0:        100          0          0    4583704   IO-APIC-edge
timer
> >   1:          0          0          0          2   IO-APIC-edge
i8042
> >   4:          0          0          0    3878668   IO-APIC-edge
serial
> .....
> >  14:    3072118          0          0        181   IO-APIC-edge
ide0
> .....
> >  52:          0          0          0  213052723   IO-APIC-fasteoi
eth1
> >  53:          0          0          0   91913759   IO-APIC-fasteoi
eth2
> > 100:          0          0          0   16776910   IO-APIC-fasteoi
eth0
> ....
> >
> > Maybe....
> > I have 3 XFS on this system, with 3 source.
> >
> > 1. 200G one ide hdd.
> > 2. 2x200G mirror on 1 ide + 1 sata hdd.
> > 3. 4x3.3TB strip on NBD.
> >
> > The NBD serves through eth1, and it is on the CPU3, but the ide0 is on
the
> > CPU0.
>
> I'd say your NBD based XFS filesystem is having trouble.
>
> > > Are you using XFS on a NBD?
> >
> > Yes, on the 3. source.
>
> Ok, I've never heard of a problem like this before and you are doing
> something that very few ppl are doing (i.e. XFS on NBD). I'd start
> Hence  I'd start by suspecting a bug in the NBD driver.

Ok, if you have right, this also can be in context with the following issue:

http://download.netcenter.hu/bughunt/20061217/messages.txt   (10KB)


>
> > > > Dec 16 12:08:36 dy-base RSP: 0018:ffff81011fdedbc0  EFLAGS: 00010002
> > > > Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX: 6b6b6b6b6b6b6b6b
RCX:
> > >                                                      ^^^^^^^^^^^^^^^^
> > > Anyone recognise that pattern?
> >
> > I think i have one idea.
> > This issue can stops sometimes the 5sec automatic restart on crash, and
this
> > shows possible memory corruption, and if the bug occurs in the IRQ
> > handling.... :-)
> > I have a lot of logs about this issue, and the RAX, RBX always the same.
>
> And is this the only place where you see the problem? Or are there
> other stack traces that you see this in as well?

I have used the 2.6.16.18 for a long time, and it was stable, except this
issue. (~20 dump with xfslogd)
And i try the new releases, and now i have more. :-)

What do you think exactly?
I can see in the logs, but search for what?
The RAX, RBX thing, or the xfslogd-spinlock problem or the old nbd-deadlock
+ mem corruption?

[root@NetCenter netlog]# grep "0000000000000033" messages*
messages.1:Dec 11 22:47:21 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 12 18:16:35 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 13 11:40:05 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 14 22:25:32 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 15 06:24:44 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.1:Dec 16 12:08:36 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.11:Oct  3 19:49:44 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.11:Oct  7 01:11:17 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.13:Sep 21 15:35:31 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.15:Sep  3 16:13:35 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.15:Sep  5 21:00:38 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.2:Dec  9 00:10:47 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.2:Dec  9 14:07:01 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.2:Dec 10 04:44:48 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.3:Nov 30 10:59:21 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.3:Dec  2 00:54:23 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 13 10:44:49 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 14 03:14:14 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 14 03:37:07 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.5:Nov 15 01:39:54 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  6 14:48:54 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  7 04:36:13 dy-base RAX: 0000000000000033 RBX:
ffff8100057d2080 RCX: ffff810050d638f8
messages.6:Nov  7 04:36:13 dy-base RDX: 0000000000000008 RSI:
0000000000012cff RDI: 0000000000000033
messages.6:Nov  7 11:12:06 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  8 03:20:38 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  8 15:02:16 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov  8 15:27:12 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov 10 15:29:43 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.6:Nov 11 20:44:14 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.9:Oct 18 15:31:02 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000
messages.9:Oct 19 13:53:24 dy-base RAX: 0000000000000033 RBX:
6b6b6b6b6b6b6b6b RCX: 0000000000000000


>
> > > This implies a spinlock inside a wait_queue_head_t is corrupt.
> > >
> > > What are you type of system do you have, and what sort of
> > > workload are you running?
> >
> > OS: Fedora 5, 64bit.
> > HW: dual xeon, with HT, ram 4GB.
> > (the min_free_kbytes limit is set to 128000, because sometimes the e1000
> > driver run out the reserved memory during irq handling.)
>
> That does not sound good. What happens when it does run out of memory?



This is an old problem, on 2.6.16.18 .
The default min_free_kbytes is 38xx , and the GIGE controller easily can be
overflow this little place.
If this happens, the system freez, and i can only use the serial console +
sysreq to dump stack:

download.netcenter.hu/bughunt/20060530/261618-good.txt
download.netcenter.hu/bughunt/20060530/dmesg.txt
download.netcenter.hu/bughunt/20060530/dump.txt

This problem is already fixed with set the min_free_kbytes to 128M.

> Is that when you start to see the above corruptions?

I think no, but i am not 100% sure.

Cheers,

Janos

>
> Cheers,
>
> Dave.
> -- 
> Dave Chinner
> Principal Engineer
> SGI Australian Software Group

next prev parent reply	other threads:[~2006-12-18  8:19 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2006-12-11 23:00 xfslogd-spinlock bug? Haar János
2006-12-11 23:00 ` Haar János
2006-12-12 14:32 ` Justin Piszcz
2006-12-13  1:11   ` Haar János
2006-12-16 11:19     ` Haar János
2006-12-16 11:19       ` Haar János
2006-12-17 22:44       ` David Chinner
2006-12-17 23:56         ` Haar János
2006-12-17 23:56           ` Haar János
2006-12-18  6:24           ` David Chinner
2006-12-18  8:17             ` Haar János [this message]
2006-12-18  8:17               ` Haar János
2006-12-18 22:36               ` David Chinner
2006-12-18 23:39                 ` Haar János
2006-12-18 23:39                   ` Haar János
2006-12-19  2:52                   ` David Chinner
2006-12-19  4:47                     ` David Chinner
2006-12-27 12:58                       ` Haar János
2006-12-27 12:58                         ` Haar János
2007-01-07 23:14                         ` David Chinner
2007-01-10 17:18                           ` Janos Haar
2007-01-10 17:18                             ` Janos Haar
2007-01-11  3:34                             ` David Chinner
2007-01-11 20:15                               ` Janos Haar
2007-01-11 20:15                                 ` Janos Haar

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='027b01c7227d$0e26d1f0$0400a8c0@dcccs' \
    --to=djani22@netcenter.hu \
    --cc=dgc@sgi.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@oss.sgi.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.