All of lore.kernel.org
 help / color / mirror / Atom feed
* 2.6.6 lockup
@ 2004-05-26 21:11 Garrick Staples
  2004-05-26 21:37 ` 2.6.6 lockupy J. Bruce Fields
  0 siblings, 1 reply; 6+ messages in thread
From: Garrick Staples @ 2004-05-26 21:11 UTC (permalink / raw)
  To: nfs

[-- Attachment #1: Type: text/plain, Size: 1212 bytes --]

Hi all again,
   After fixing up the failover issues, I got my pair of Itaniums into
production with 2.6.5 and as soon as the real world load went up, the machines
started freezing.  No net response, no console, only sysreq keys work.

   I updated to 2.6.6 and it doesn't freeze up as often, but it's still really
bad, at least a few times a day.  Unfortunately, I can't seem to figure out how
to get a decent kernel trace.  Apperently sysreq-Crash doesn't work in ia64.
And NMI watchdog doesn't work on ia.  And I can't find any info on a hardware
watchdog on the mobo!

I do have some other info from sysreq on the cpu regs, memory, and processes if
anyone would find that interesting.

The work load freezes up the machines under heavy streaming writes from about a
100 processes on at least 60 clients.  A combined load of about 80GB/hour is
enough to freeze up the machine pretty regularly.

I tried Trond's 2.6.6 patches at his website, but those brokes things
considerably.  Since I don't have any actual Oops messages, anyone have any
experimental deadlock-fixing patches they want me to test? :)

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6.6 lockupy
  2004-05-26 21:11 2.6.6 lockup Garrick Staples
@ 2004-05-26 21:37 ` J. Bruce Fields
  2004-05-26 21:44   ` Garrick Staples
  0 siblings, 1 reply; 6+ messages in thread
From: J. Bruce Fields @ 2004-05-26 21:37 UTC (permalink / raw)
  To: nfs

On Wed, May 26, 2004 at 02:11:16PM -0700, Garrick Staples wrote:
> I tried Trond's 2.6.6 patches at his website, but those brokes things
> considerably.  Since I don't have any actual Oops messages, anyone have any
> experimental deadlock-fixing patches they want me to test? :)

Note that Trond's patches are client-side only, so shouldn't affect your
servers one way or another (unless I'm misunderstanding what your
setup).--b.


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6.6 lockupy
  2004-05-26 21:37 ` 2.6.6 lockupy J. Bruce Fields
@ 2004-05-26 21:44   ` Garrick Staples
  2004-05-26 21:55     ` J. Bruce Fields
  0 siblings, 1 reply; 6+ messages in thread
From: Garrick Staples @ 2004-05-26 21:44 UTC (permalink / raw)
  To: nfs

[-- Attachment #1: Type: text/plain, Size: 783 bytes --]

On Wed, May 26, 2004 at 05:37:32PM -0400, J. Bruce Fields alleged:
> On Wed, May 26, 2004 at 02:11:16PM -0700, Garrick Staples wrote:
> > I tried Trond's 2.6.6 patches at his website, but those brokes things
> > considerably.  Since I don't have any actual Oops messages, anyone have any
> > experimental deadlock-fixing patches they want me to test? :)
> 
> Note that Trond's patches are client-side only, so shouldn't affect your
> servers one way or another (unless I'm misunderstanding what your
> setup).--b.

You didn't misunderstand... but I was at a complete loss with production
machines suddenly dropping like flies.  I'm now at the stage where I just try
random things =P


-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6.6 lockupy
  2004-05-26 21:44   ` Garrick Staples
@ 2004-05-26 21:55     ` J. Bruce Fields
  2004-05-26 22:26       ` Garrick Staples
  2004-05-27  4:39       ` Garrick Staples
  0 siblings, 2 replies; 6+ messages in thread
From: J. Bruce Fields @ 2004-05-26 21:55 UTC (permalink / raw)
  To: nfs

On Wed, May 26, 2004 at 02:44:36PM -0700, Garrick Staples wrote:
> You didn't misunderstand... but I was at a complete loss with production
> machines suddenly dropping like flies.  I'm now at the stage where I just try
> random things =P

OK.  May as well send us what information you have on the lockups, and
maybe your .config while you're at it....

(Also, you mention 2.6.6 in your subject line but it sounds like this
happens to you under earlier kernels as well?  Are there any kernels
that are OK?)

--Bruce Fields


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. 
Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
NFS maillist  -  NFS@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nfs

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6.6 lockupy
  2004-05-26 21:55     ` J. Bruce Fields
@ 2004-05-26 22:26       ` Garrick Staples
  2004-05-27  4:39       ` Garrick Staples
  1 sibling, 0 replies; 6+ messages in thread
From: Garrick Staples @ 2004-05-26 22:26 UTC (permalink / raw)
  To: nfs

[-- Attachment #1: Type: text/plain, Size: 978 bytes --]

On Wed, May 26, 2004 at 05:55:56PM -0400, J. Bruce Fields alleged:
> On Wed, May 26, 2004 at 02:44:36PM -0700, Garrick Staples wrote:
> > You didn't misunderstand... but I was at a complete loss with production
> > machines suddenly dropping like flies.  I'm now at the stage where I just try
> > random things =P
> 
> OK.  May as well send us what information you have on the lockups, and
> maybe your .config while you're at it....

It's more info then I feel like posting to the list, but I've got it at:
http://www-rds.usc.edu/~garrick/nfsprobs/

 
> (Also, you mention 2.6.6 in your subject line but it sounds like this
> happens to you under earlier kernels as well?  Are there any kernels
> that are OK?)

2.6.6 locks up less often than 2.6.5.  I also had 2.6.3, but it had other scsi
driver issues.  2.6.3 never saw this kind of load, but I can try it if you
want.

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: 2.6.6 lockupy
  2004-05-26 21:55     ` J. Bruce Fields
  2004-05-26 22:26       ` Garrick Staples
@ 2004-05-27  4:39       ` Garrick Staples
  1 sibling, 0 replies; 6+ messages in thread
From: Garrick Staples @ 2004-05-27  4:39 UTC (permalink / raw)
  To: nfs

[-- Attachment #1: Type: text/plain, Size: 1063 bytes --]

On Wed, May 26, 2004 at 05:55:56PM -0400, J. Bruce Fields alleged:
> On Wed, May 26, 2004 at 02:44:36PM -0700, Garrick Staples wrote:
> > You didn't misunderstand... but I was at a complete loss with production
> > machines suddenly dropping like flies.  I'm now at the stage where I just try
> > random things =P
> 
> OK.  May as well send us what information you have on the lockups, and
> maybe your .config while you're at it....

Filesystem corruption from the repeated lockups finally showed up today.  So I
managed to get those two machines out of production for now.  I'll be able to
figure out to trigger the problem and hopefully get you some better info
tomorrow.

Btw, the failover capabilities of 2.6 has been very well tested the last few
days :)  Nearly 15TB of data was swapped back and forth during heavy writes.
Good job guys on that!

Btw, reiserfs+lvm2 is very resilient!

(does anyone have advice on triggering a kernel trace on ia64?)

-- 
Garrick Staples, Linux/HPCC Administrator
University of Southern California

[-- Attachment #2: Type: application/pgp-signature, Size: 189 bytes --]

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-05-27  4:39 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-05-26 21:11 2.6.6 lockup Garrick Staples
2004-05-26 21:37 ` 2.6.6 lockupy J. Bruce Fields
2004-05-26 21:44   ` Garrick Staples
2004-05-26 21:55     ` J. Bruce Fields
2004-05-26 22:26       ` Garrick Staples
2004-05-27  4:39       ` Garrick Staples

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.