All of lore.kernel.org
 help / color / mirror / Atom feed
* [parisc-linux] Machine hanging during high-traffic NFS
@ 2005-07-19 21:02 Kurt Fitzner
  2005-07-19 23:36 ` Michael S. Zick
  2005-07-20  1:04 ` Kyle McMartin
  0 siblings, 2 replies; 14+ messages in thread
From: Kurt Fitzner @ 2005-07-19 21:02 UTC (permalink / raw)
  To: parisc-linux

I've been using nfs to try and save backup images from my B132L
(2.6.12-pa2) with a simple:
  dd if=/dev/sda of=/mnt/bulk/sda-image bs=512

Every time the machine hangs solid - the heartbeat LED even stops.
Usually it hangs after around 1 to 2 gigs have been transferred.  There
are no log entries at the time of the hang.  IT just... stops.

I'm using a 3c905 PCI ethernet card rather than the stock 10 megabit
LASI on board.

I'm wondering if this might be an issue with the ethernet driver when
compiled for PARISC.  I've tried very large ftp transfers and can't
reproduce the problem that way.

I've also tried NFS over TCP and  tried reducing the rsize/wsize below
1500 bytes to prevent IP fragmentation.  Neither of which seem to help.

Are there any known NFS issues right now?  Any ideas?  Suggestions?

	Kurt.
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [parisc-linux] Machine hanging during high-traffic NFS
  2005-07-19 21:02 [parisc-linux] Machine hanging during high-traffic NFS Kurt Fitzner
@ 2005-07-19 23:36 ` Michael S. Zick
  2005-07-20  1:04 ` Kyle McMartin
  1 sibling, 0 replies; 14+ messages in thread
From: Michael S. Zick @ 2005-07-19 23:36 UTC (permalink / raw)
  To: parisc-linux

On Tue July 19 2005 16:02, Kurt Fitzner wrote:
> I've been using nfs to try and save backup images from my B132L
> (2.6.12-pa2) with a simple:
>   dd if=/dev/sda of=/mnt/bulk/sda-image bs=512
> 
> Every time the machine hangs solid - the heartbeat LED even stops.
> Usually it hangs after around 1 to 2 gigs have been transferred.  There
> are no log entries at the time of the hang.  IT just... stops.
> 
> I'm using a 3c905 PCI ethernet card rather than the stock 10 megabit
> LASI on board.
> 
> I'm wondering if this might be an issue with the ethernet driver when
> compiled for PARISC.  I've tried very large ftp transfers and can't
> reproduce the problem that way.
> 
> I've also tried NFS over TCP and  tried reducing the rsize/wsize below
> 1500 bytes to prevent IP fragmentation.  Neither of which seem to help.
> 
> Are there any known NFS issues right now?  Any ideas?  Suggestions?
> 
Questions/Suggestions only.

Any hints in the log of the receiving (nfs server) side?

Any portion of /dev/sda mounted somewhere?

Is the /mnt/bulk/sda-image mount point on /dev/sda* ?
That is, is there a drive in common with '/', '/mnt', '/dev'
and the entire device '/dev/sda' ?

Can you achive your goal with a file copy rather than
a disk image?  Have you tried running rsync?

Can you successfully transfer (dd) a single file larger than
your trouble point size when trying to transfer the entire device?

Have you tried a blocksize != 512 with the dd command?
Perhaps an even sub-multiple of the packet size so that the
network stack does not have to fragment the dd blocks.

Mike
> 	Kurt.
> _______________________________________________
> parisc-linux mailing list
> parisc-linux@lists.parisc-linux.org
> http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
> 
> 
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [parisc-linux] Machine hanging during high-traffic NFS
  2005-07-19 21:02 [parisc-linux] Machine hanging during high-traffic NFS Kurt Fitzner
  2005-07-19 23:36 ` Michael S. Zick
@ 2005-07-20  1:04 ` Kyle McMartin
  2005-07-20  3:31   ` John David Anglin
  2005-07-20  6:59   ` Kurt Fitzner
  1 sibling, 2 replies; 14+ messages in thread
From: Kyle McMartin @ 2005-07-20  1:04 UTC (permalink / raw)
  To: Kurt Fitzner; +Cc: parisc-linux

On Tue, Jul 19, 2005 at 03:02:55PM -0600, Kurt Fitzner wrote:
> I'm using a 3c905 PCI ethernet card rather than the stock 10 megabit
> LASI on board.
> 
> I'm wondering if this might be an issue with the ethernet driver when
> compiled for PARISC.  I've tried very large ftp transfers and can't
> reproduce the problem that way.

TOC dump, IIR, IOAQ/IASQ locations? Come on people... It's really hard
to even begin to figure out what's wrong if no debugging information
has been provided...

http://www.parisc-linux.org/faq/kernelbug-howto.html

-- 
Kyle McMartin
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [parisc-linux] Machine hanging during high-traffic NFS
  2005-07-20  3:31   ` John David Anglin
@ 2005-07-20  2:57     ` Thibaut VARENE
  2005-07-20 14:56       ` Matthew Wilcox
  0 siblings, 1 reply; 14+ messages in thread
From: Thibaut VARENE @ 2005-07-20  2:57 UTC (permalink / raw)
  To: John David Anglin; +Cc: Kyle McMartin, parisc-linux

JDA wrote:
> It might help to have a "stable" branch that is maintained longer
> than is current practice.  At a minimum, the current tree needs to
> be slushed until the main problems are resolved.

I definitely concur on that. That's something I already suggested on
IRC a while ago, and I believe that we probably all agree that there's
such a need. The questions being how to do it properly (maitaining a
separate "stable" branch is not absolutely trivial), and also *who* is
going to maintain it.

In any case, we really want to take time to fix our bugs. I don't know
if we need to somehow "freeze" our tree to do that. I believe we sort
of need that, since we keep injecting new bugs on top of mostly
unknown existing ones involving and/or impacting many different
subsystems.

Maybe some sort of "puffinfest" would help cleaning up our kernel
before the situation gets out of control.

I'd really wish we talk about that while at OLS, with the guys that
are attending it ;)

my 2c

T-Bone




=20


>=20
>=20
>=20
> 	  -----------------  Processor 1 LPMC Information
------------------
>=20
> 	  Check Type                   =3D 0x00000000
> 	  IC Parity Info               =3D 0x00000000
> 	  Cache Check                  =3D 0x00000000
> 	  TLB Check                    =3D 0x00000000
> 	  Bus Check                    =3D 0x00000000
> 	  Assists Check                =3D 0x00000000
> 	  Assist State                 =3D 0x00000000
> 	  Path Info                    =3D 0x00000000
> 	  System Responder Address     =3D 0x0000000000000000
> 	  System Requestor Address     =3D 0x0000000000000000
>=20
>=20
>=20
> 	  -----------------  Processor 1 TOC Information
-------------------
>=20
> 	  General Registers 0 - 31
> 	  00-03  0000000000000000  00000000103c5ca0  000000001014c628=20
00000000fe38c620
> 	  04-07  0000000010552cc0  00000000046db5d4  00000000105cc360=20
00000000105ccfb0
> 	  08-11  0000000000000018  0000000010424050  0000000000000001=20
00000000bd893300
> 	  12-15  0000000000000000  00000000ff85fc00  0000000000000008=20
00000000fe38c288
> 	  16-19  00000000fe38c620  00000000ff85b400  0000000000000000=20
0000000000024089
> 	  20-23  0002b96357cd9ae6  000000000009eb10  000000000000ff00=20
fffffff0f0430ed8
> 	  24-27  0000000000000520  0000000000000000  00000000046db5d4=20
0000000010552cc0
> 	  28-31  0000000000000000  00000000fe38cc20  00000000fe38cc50=20
0203010200802004
>=20
>=20
> 	  Control Registers 0 - 31
> 	  00-03  0000000000000000  0000000000000000  0000000000000000=20
0000000000000000
> 	  04-07  0000000000000000  0000000000000000  0000000000000000=20
0000000000000000
> 	  08-11  000000000000e486  0000000000000000  00000000000000c0=20
0000000000000038
> 	  12-15  0000000000000000  0000000000000000  0000000000106000=20
0000000000000000
> 	  16-19  0002b96357d799de  0000000000000000  000000001014c67c=20
00000000020008b3
> 	  20-23  00000000103403b8  00000000e338cba8  000000ff0804fd0f=20
8140000000000000
> 	  24-27  00000000004cc000  00000000d153a000  0000000000041020=20
000000f0f0165650
> 	  28-31  000000f0f0165650  5555555555555555  00000000fe38c000=20
0000000000008020
>=20
> 	  Space Registers 0 - 7
> 	  00-03  03921800          00000000          00000000        =20
03921800
> 	  04-07  00000000          00000000          00000000        =20
00000000
>=20
> 	  IIA Space (back entry)       =3D 0x0000000000000000
> 	  IIA Offset (back entry)      =3D 0x000000001014c680
> 	  CPU State                    =3D 0x9e000001
>=20
>=20
> 	  --------------  Memory Error Log Information  --------------
>=20
> 	  Bus 0 Log Information
>=20
>=20
> 	     No errors logged for this bus
>=20
>=20
> 	     ------------  I/O Module Error Log Information  ------------
>=20
>=20
> 		No I/O module errors logged
>=20
>=20
> 		Service Menu: Enter command >
> _______________________________________________
> parisc-linux mailing list
> parisc-linux@lists.parisc-linux.org
> http://lists.parisc-linux.org/mailman/listinfo/parisc-linux
>=20
--=20
Thibaut VARENE
http://www.parisc-linux.org/~varenet/
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [parisc-linux] Machine hanging during high-traffic NFS
  2005-07-20  1:04 ` Kyle McMartin
@ 2005-07-20  3:31   ` John David Anglin
  2005-07-20  2:57     ` Thibaut VARENE
  2005-07-20  6:59   ` Kurt Fitzner
  1 sibling, 1 reply; 14+ messages in thread
From: John David Anglin @ 2005-07-20  3:31 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: parisc-linux

> TOC dump, IIR, IOAQ/IASQ locations? Come on people... It's really hard
> to even begin to figure out what's wrong if no debugging information
> has been provided...

2.6.8.1-pa11 is quite stable (12 days up and numerous GCC builds).
As Joel has indicated, this can be pushed a bit further.  However,
2.6.10 and later are not stable.  Randolph and James resolved one
of the major bugs (fp register bug).  However, this wasn't sufficient
to stabilize 2.6.12 under load.  I spent considerable time trying
to isolate the change(s) that introduced the instability but this
is difficult and time consuming.

It might help to have a "stable" branch that is maintained longer
than is current practice.  At a minimum, the current tree needs to
be slushed until the main problems are resolved.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)

PS: TOC on Linux gsyprf11.external.hp.com 2.6.11-pa4 #5 SMP Sat May 21 19:09:19 PDT 2005 parisc64 GNU/Linux

Proc 0 r2 -> return from call to __brelse in journal_put_journal_head
Proc 0 IIA Offset -> location in __brelse

Proc 1 IIA Offset -> to final loop in panic.

-----------------  Processor 0 TOC Information -------------------

General Registers 0 - 31
00-03  0000000000000000  000000001055f4c0  000000001023b3f4  0000000008108a24
04-07  0000000010552cc0  0000000204228918  0000000010517a40  000000013a7a0868
08-11  0000000000000001  0000000204228918  00000000ff85fc00  0000000000001000
12-15  0000000000000001  0000000013566ee8  0000000000000000  0000000204228918
16-19  0000000000000001  0000000000080000  0000000010553cc0  0000000000000000
20-23  0000000010517a40  0000000008108a24  000000000800000f  0000000008108a24
24-27  0000000000000001  00000000ffee4400  0000000204228930  0000000010552cc0
28-31  000000013a7a0868  00000000f8688a80  00000000f8688b30  0000000000000040


Control Registers 0 - 31
00-03  0000000000000000  0000000000000000  0000000000000000  0000000000000000
04-07  0000000000000000  0000000000000000  0000000000000000  0000000000000000
08-11  0000000000010c02  0000000000000000  00000000000000c0  000000000000003f
12-15  0000000000000000  0000000000000000  0000000000106000  fff0000000000000
16-19  0002b96338e536de  0000000000000000  00000000101b68dc  0000000008000240
20-23  0000000000000000  0000000000000000  00000000080c000e  e200000000000000
24-27  00000000004cc000  00000000a4152000  0000000000041020  0000000041198b80
28-31  5555555555555555  5555555555555555  00000000f8688000  000000001051c000

Space Registers 0 - 7
00-03  04300800          04300800          00000000          04300800
04-07  00000000          00000000          00000000          00000000

IIA Space (back entry)       = 0x0000000000000000
IIA Offset (back entry)      = 0x00000000101b68d4
CPU State                    = 0x9e000001



-----------------  Processor 1 HPMC Information - PDC Version: 42.09  ------

   * * * No valid timestamp * * *


	  No HPMC chassis codes logged

	  General Registers 0 - 31
	  00-03  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  04-07  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  08-11  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  12-15  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  16-19  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  20-23  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  24-27  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  28-31  0000000000000000  0000000000000000  0000000000000000  0000000000000000


	  Control Registers 0 - 31
	  00-03  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  04-07  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  08-11  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  12-15  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  16-19  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  20-23  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  24-27  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  28-31  0000000000000000  0000000000000000  0000000000000000  0000000000000000

	  Space Registers 0 - 7
	  00-03  00000000          00000000          00000000          00000000
	  04-07  00000000          00000000          00000000          00000000


	  IIA Space (back entry)       = 0x0000000000000000
	  IIA Offset (back entry)      = 0x0000000000000000
	  Check Type                   = 0x00000000
	  CPU State                    = 0x00000000
	  Cache Check                  = 0x00000000
	  TLB Check                    = 0x00000000
	  Bus Check                    = 0x00000000
	  Assists Check                = 0x00000000
	  Assist State                 = 0x00000000
	  Path Info                    = 0x00000000
	  System Responder Address     = 0x0000000000000000
	  System Requestor Address     = 0x0000000000000000


	  Floating Point Registers 0 - 31
	  00-03  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  04-07  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  08-11  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  12-15  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  16-19  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  20-23  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  24-27  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  28-31  0000000000000000  0000000000000000  0000000000000000  0000000000000000


	  Check Summary                = 0x0000000000000000
	  Available Memory             = 0x0000000000000000
	  CPU Diagnose Register 2      = 0x0000000000000000
	  CPU Status Register 0        = 0x0000000000000000
	  CPU Status Register 1        = 0x0000000000000000
	  SADD LOG                     = 0x0000000000000000
	  Read Short LOG               = 0x0000000000000000



	  -----------------  Processor 1 LPMC Information ------------------

	  Check Type                   = 0x00000000
	  IC Parity Info               = 0x00000000
	  Cache Check                  = 0x00000000
	  TLB Check                    = 0x00000000
	  Bus Check                    = 0x00000000
	  Assists Check                = 0x00000000
	  Assist State                 = 0x00000000
	  Path Info                    = 0x00000000
	  System Responder Address     = 0x0000000000000000
	  System Requestor Address     = 0x0000000000000000



	  -----------------  Processor 1 TOC Information -------------------

	  General Registers 0 - 31
	  00-03  0000000000000000  00000000103c5ca0  000000001014c628  00000000fe38c620
	  04-07  0000000010552cc0  00000000046db5d4  00000000105cc360  00000000105ccfb0
	  08-11  0000000000000018  0000000010424050  0000000000000001  00000000bd893300
	  12-15  0000000000000000  00000000ff85fc00  0000000000000008  00000000fe38c288
	  16-19  00000000fe38c620  00000000ff85b400  0000000000000000  0000000000024089
	  20-23  0002b96357cd9ae6  000000000009eb10  000000000000ff00  fffffff0f0430ed8
	  24-27  0000000000000520  0000000000000000  00000000046db5d4  0000000010552cc0
	  28-31  0000000000000000  00000000fe38cc20  00000000fe38cc50  0203010200802004


	  Control Registers 0 - 31
	  00-03  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  04-07  0000000000000000  0000000000000000  0000000000000000  0000000000000000
	  08-11  000000000000e486  0000000000000000  00000000000000c0  0000000000000038
	  12-15  0000000000000000  0000000000000000  0000000000106000  0000000000000000
	  16-19  0002b96357d799de  0000000000000000  000000001014c67c  00000000020008b3
	  20-23  00000000103403b8  00000000e338cba8  000000ff0804fd0f  8140000000000000
	  24-27  00000000004cc000  00000000d153a000  0000000000041020  000000f0f0165650
	  28-31  000000f0f0165650  5555555555555555  00000000fe38c000  0000000000008020

	  Space Registers 0 - 7
	  00-03  03921800          00000000          00000000          03921800
	  04-07  00000000          00000000          00000000          00000000

	  IIA Space (back entry)       = 0x0000000000000000
	  IIA Offset (back entry)      = 0x000000001014c680
	  CPU State                    = 0x9e000001


	  --------------  Memory Error Log Information  --------------

	  Bus 0 Log Information


	     No errors logged for this bus


	     ------------  I/O Module Error Log Information  ------------


		No I/O module errors logged


		Service Menu: Enter command >
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [parisc-linux] Machine hanging during high-traffic NFS
  2005-07-20  1:04 ` Kyle McMartin
  2005-07-20  3:31   ` John David Anglin
@ 2005-07-20  6:59   ` Kurt Fitzner
  2005-07-20 16:40     ` Grant Grundler
  1 sibling, 1 reply; 14+ messages in thread
From: Kurt Fitzner @ 2005-07-20  6:59 UTC (permalink / raw)
  To: Kyle McMartin; +Cc: parisc-linux

Kyle McMartin wrote:

> TOC dump, IIR, IOAQ/IASQ locations? Come on people... It's really hard
> to even begin to figure out what's wrong if no debugging information
> has been provided...
> 
> http://www.parisc-linux.org/faq/kernelbug-howto.html

I apologize - I had not seen that page before.  I should have done more
research myself before reporting the issue.

There is no console output prior to the hang, and no kernel fault to
obtain the IAOQ/IASQ information from.  I did perform a TOC.  The data
from that is below.

If there is any further information that might help, please let me know.

	Kurt.


Information about the machine/kernel:
- Kernel 2.6.12-pa2
- Compiled with gcc 3.3.5 (Debian 1:3.3.5-13), Binutils 2.15-6
- B132L w/ 3COM 3c905 ethernet card
- System map at http://www.excelcia.org/~kfitzner/System.map-2.6.12
- Kernel config at http://www.excelcia.org/~kfitzner/config-2.6.12

Output of "ser pim toc":
General Registers 0 - 31
 0 -  3  0x00000000  0x10000000  0x101e3910  0x00000000
 4 -  7  0x1389a14c  0x1389a034  0x105fcf60  0x1389a108
 8 - 11  0x00000000  0x1389a14c  0x00000200  0x15273720
12 - 15  0x00000200  0x00000200  0x00000200  0x00000000
16 - 19  0x14502640  0x10428768  0x00000001  0x000280ca
20 - 23  0x17468122  0x000280ca  0x00000015  0x00000000
24 - 27  0x0000010f  0x1389a0f8  0x17468122  0x10412010
28 - 31  0x00000000  0x03980700  0x13980940  0x1014b700

Control Registers 0 - 31
 0 -  3  0x00000000  0x00000000  0x00000000  0x00000000
 4 -  7  0x00000000  0x00000000  0x00000000  0x00000000
 8 - 11  0x00002632  0x00000000  0x000000c0  0x00000010
12 - 15  0x00000000  0x00000000  0x0010b800  0xf1000000
16 - 19  0x4c31b913  0x00000000  0x1010c1b0  0x001f0e60
20 - 23  0x00000000  0x1010c19c  0x0004ff00  0x01000000
24 - 27  0x004a0000  0x0342a000  0xffffffff  0x40e5fb80
28 - 31  0xaaaaaaaa  0x11111111  0x13980000  0x104ac000

Space Registers 0 - 7
 0 -  3  0x00000000  0x00000000  0x00000000  0x00001319
 4 -  7  0x00000000  0x00000000  0x00000000  0x00000000

IIA Space                    = 0x00000000
IIA Offset                   = 0x1010c1b0
CPU State                    = 0x9e000001
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [parisc-linux] Machine hanging during high-traffic NFS
  2005-07-20  2:57     ` Thibaut VARENE
@ 2005-07-20 14:56       ` Matthew Wilcox
  0 siblings, 0 replies; 14+ messages in thread
From: Matthew Wilcox @ 2005-07-20 14:56 UTC (permalink / raw)
  To: Thibaut VARENE; +Cc: Kyle McMartin, John David Anglin, parisc-linux

On Wed, Jul 20, 2005 at 05:57:19AM +0300, Thibaut VARENE wrote:
> I definitely concur on that. That's something I already suggested on
> IRC a while ago, and I believe that we probably all agree that there's
> such a need. The questions being how to do it properly (maitaining a
> separate "stable" branch is not absolutely trivial), and also *who* is
> going to maintain it.

I believe I said at the time that you were more than welcome to maintain
such a thing.  If you're just volunteering me to do more work ... sorry,
not interested.

> In any case, we really want to take time to fix our bugs. I don't know
> if we need to somehow "freeze" our tree to do that. I believe we sort
> of need that, since we keep injecting new bugs on top of mostly
> unknown existing ones involving and/or impacting many different
> subsystems.

Sounds good to me

> Maybe some sort of "puffinfest" would help cleaning up our kernel
> before the situation gets out of control.
> 
> I'd really wish we talk about that while at OLS, with the guys that
> are attending it ;)

We can certainly get together at some point ... this week's pretty busy though!

-- 
"Next the statesmen will invent cheap lies, putting the blame upon 
the nation that is attacked, and every man will be glad of those
conscience-soothing falsities, and will diligently study them, and refuse
to examine any refutations of them; and thus he will by and by convince 
himself that the war is just, and will thank God for the better sleep 
he enjoys after this process of grotesque self-deception." -- Mark Twain
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [parisc-linux] Machine hanging during high-traffic NFS
  2005-07-20  6:59   ` Kurt Fitzner
@ 2005-07-20 16:40     ` Grant Grundler
  2005-07-21  7:42       ` Kurt Fitzner
  0 siblings, 1 reply; 14+ messages in thread
From: Grant Grundler @ 2005-07-20 16:40 UTC (permalink / raw)
  To: Kurt Fitzner; +Cc: Kyle McMartin, parisc-linux

On Wed, Jul 20, 2005 at 12:59:42AM -0600, Kurt Fitzner wrote:
> I apologize - I had not seen that page before.  I should have done more
> research myself before reporting the issue.

FAQ has a reference to it. But thanks for reporting the bug.

> There is no console output prior to the hang, and no kernel fault to
> obtain the IAOQ/IASQ information from.  I did perform a TOC.  The data
> from that is below.
> 
> If there is any further information that might help, please let me know.

thanks - the key bit to start with is GR02 and IOAQ :

GR02 0x101e3910 nfs_mark_request_dirty+24
IOAQ 0x1010c1b0 intr_restore+11c

Sounds like either an interrupt storm from the card or a deadlock
in nfs code. Unfortunately TOC doesn't provide more stack trace
informaion. And I'm not able to chase NFS issues at the moment.

grant
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [parisc-linux] Machine hanging during high-traffic NFS
  2005-07-20 16:40     ` Grant Grundler
@ 2005-07-21  7:42       ` Kurt Fitzner
  2005-07-21 12:36         ` Grant Grundler
  2005-07-21 16:04         ` Kyle McMartin
  0 siblings, 2 replies; 14+ messages in thread
From: Kurt Fitzner @ 2005-07-21  7:42 UTC (permalink / raw)
  To: parisc-linux

John David Anglin wrote:
> 2.6.8.1-pa11 is quite stable (12 days up and numerous GCC builds).

I have switched to that version and now cannot reproduce the hang
problem.  Thank-you for the suggestion.

> It might help to have a "stable" branch that is maintained longer
> than is current practice.  At a minimum, the current tree needs to
> be slushed until the main problems are resolved.
> Grant Grundler wrote:

I am used to the old classic 'stable' line where each successive kernel
release under the stable tree was (theoretically) more stable than the
previous one.  Perhaps, at a suggestion, a compromise can be reached by
relabelling kernels.  When one is found to be quite stable label it the
2.6.N-paX.  Other than that, call them 2.6.N-paX-test.

It shouldn't require too much in the way of maintenance and it might
keep naive users (like me) from using unstable kernels before they are
ready to give meaningful bug reports and feedback on problems in them.

Grant Grundler wrote:
> Sounds like either an interrupt storm from the card or a deadlock
> in nfs code. Unfortunately TOC doesn't provide more stack trace
> informaion. And I'm not able to chase NFS issues at the moment.

I 'downgraded' to 2.6.8.1-pa11 as Mr. Anglin suggested and I am not able
to reproduce the hang.  Would it be helpful if I were to identify the
exact kernel version where the hang first begins to occur?

	Kurt.
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [parisc-linux] Machine hanging during high-traffic NFS
  2005-07-21  7:42       ` Kurt Fitzner
@ 2005-07-21 12:36         ` Grant Grundler
  2005-07-21 23:28           ` John David Anglin
  2005-07-21 16:04         ` Kyle McMartin
  1 sibling, 1 reply; 14+ messages in thread
From: Grant Grundler @ 2005-07-21 12:36 UTC (permalink / raw)
  To: Kurt Fitzner; +Cc: parisc-linux

On Thu, Jul 21, 2005 at 01:42:12AM -0600, Kurt Fitzner wrote:
> I 'downgraded' to 2.6.8.1-pa11 as Mr. Anglin suggested and I am not able
> to reproduce the hang.  Would it be helpful if I were to identify the
> exact kernel version where the hang first begins to occur?

Definitely. Be warned that this can be very time consuming.
If you can narrow the window to a major kernel release,
that would already be very helpful. The exact -paX version
would of course be perfect.

thanks,
grant
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [parisc-linux] Machine hanging during high-traffic NFS
  2005-07-21  7:42       ` Kurt Fitzner
  2005-07-21 12:36         ` Grant Grundler
@ 2005-07-21 16:04         ` Kyle McMartin
  1 sibling, 0 replies; 14+ messages in thread
From: Kyle McMartin @ 2005-07-21 16:04 UTC (permalink / raw)
  To: Kurt Fitzner; +Cc: parisc-linux

On Thu, Jul 21, 2005 at 01:42:12AM -0600, Kurt Fitzner wrote:
> I 'downgraded' to 2.6.8.1-pa11 as Mr. Anglin suggested and I am not able
> to reproduce the hang.  Would it be helpful if I were to identify the
> exact kernel version where the hang first begins to occur?
> 

Binary searching from 2.6.8.1-pa11 onwards would be helpful.

[Pick the middle version between 2.6.8.1 and current, if it's broken, at
 the middle of 2.6.8.1-pa11 to middle, otherwise middle to current, and
 continue until you can narrow the timeframe.]

Cheers,
-- 
Kyle McMartin
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [parisc-linux] Machine hanging during high-traffic NFS
  2005-07-21 12:36         ` Grant Grundler
@ 2005-07-21 23:28           ` John David Anglin
  2005-07-22  0:29             ` Kurt Fitzner
  0 siblings, 1 reply; 14+ messages in thread
From: John David Anglin @ 2005-07-21 23:28 UTC (permalink / raw)
  To: Grant Grundler; +Cc: parisc-linux

> On Thu, Jul 21, 2005 at 01:42:12AM -0600, Kurt Fitzner wrote:
> > I 'downgraded' to 2.6.8.1-pa11 as Mr. Anglin suggested and I am not able
> > to reproduce the hang.  Would it be helpful if I were to identify the
> > exact kernel version where the hang first begins to occur?
> 
> Definitely. Be warned that this can be very time consuming.
> If you can narrow the window to a major kernel release,
> that would already be very helpful. The exact -paX version
> would of course be perfect.

I know that 32-bit 2.6.10 isn't stable on my c3k.  There is a known
bug with kernel memcpy and fpregs.  Either James' fix needs to be
backported, or builds need to be done with gcc-4.0.0 (or 4.0.1) using
the -mfixed-range as discussed previously on the list.  I haven't
had time to try this.

Dave
-- 
J. David Anglin                                  dave.anglin@nrc-cnrc.gc.ca
National Research Council of Canada              (613) 990-0752 (FAX: 952-6602)
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [parisc-linux] Machine hanging during high-traffic NFS
  2005-07-21 23:28           ` John David Anglin
@ 2005-07-22  0:29             ` Kurt Fitzner
  2005-07-22  3:55               ` Grant Grundler
  0 siblings, 1 reply; 14+ messages in thread
From: Kurt Fitzner @ 2005-07-22  0:29 UTC (permalink / raw)
  To: parisc-linux

John David Anglin wrote:

> I know that 32-bit 2.6.10 isn't stable on my c3k.  There is a known
> bug with kernel memcpy and fpregs.

Well, as far as the bug I am reporting goes, so far I have narrowed it
down to a kernel later than 2.6.10-pa11 and before 2.6.11-pa4.  It
appears that whatever went into 2.6.10 isn't to blame.

It looks like the interrupt storm theory is best.  The functions I get
from the TOC data this time are:

GRO2 0x101060e0 handle_interruption+6c
IOAQ 0x101120dc handle_unaligned+2c0

I am curious, though.  This time when it hung the hearbeat didn't stop.
 Does this mean that it didn't hang as solid as the other times?  Are
interrupts still being handled at some kernel level if the heartbeat LED
is flashing normally?  If this is the case, then this means the TOC data
may be useless this time around, right?

	Kurt.
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [parisc-linux] Machine hanging during high-traffic NFS
  2005-07-22  0:29             ` Kurt Fitzner
@ 2005-07-22  3:55               ` Grant Grundler
  0 siblings, 0 replies; 14+ messages in thread
From: Grant Grundler @ 2005-07-22  3:55 UTC (permalink / raw)
  To: Kurt Fitzner; +Cc: parisc-linux

On Thu, Jul 21, 2005 at 06:29:46PM -0600, Kurt Fitzner wrote:
> John David Anglin wrote:
> 
> > I know that 32-bit 2.6.10 isn't stable on my c3k.  There is a known
> > bug with kernel memcpy and fpregs.
> 
> Well, as far as the bug I am reporting goes, so far I have narrowed it
> down to a kernel later than 2.6.10-pa11 and before 2.6.11-pa4.  It
> appears that whatever went into 2.6.10 isn't to blame.

Ok...If you were to try one more kernel, could it be 2.6.11-pa1?

> It looks like the interrupt storm theory is best.  The functions I get
> from the TOC data this time are:
> 
> GRO2 0x101060e0 handle_interruption+6c
> IOAQ 0x101120dc handle_unaligned+2c0

yes, seems like it's likely too.

> I am curious, though.  This time when it hung the hearbeat didn't stop.
>  Does this mean that it didn't hang as solid as the other times?

that would be my guess too.

> Are
> interrupts still being handled at some kernel level if the heartbeat LED
> is flashing normally?

yes

> If this is the case, then this means the TOC data
> may be useless this time around, right?

Not necessarily. The TOC data may still be useful
for register state.

grant
_______________________________________________
parisc-linux mailing list
parisc-linux@lists.parisc-linux.org
http://lists.parisc-linux.org/mailman/listinfo/parisc-linux

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2005-07-22  3:55 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-07-19 21:02 [parisc-linux] Machine hanging during high-traffic NFS Kurt Fitzner
2005-07-19 23:36 ` Michael S. Zick
2005-07-20  1:04 ` Kyle McMartin
2005-07-20  3:31   ` John David Anglin
2005-07-20  2:57     ` Thibaut VARENE
2005-07-20 14:56       ` Matthew Wilcox
2005-07-20  6:59   ` Kurt Fitzner
2005-07-20 16:40     ` Grant Grundler
2005-07-21  7:42       ` Kurt Fitzner
2005-07-21 12:36         ` Grant Grundler
2005-07-21 23:28           ` John David Anglin
2005-07-22  0:29             ` Kurt Fitzner
2005-07-22  3:55               ` Grant Grundler
2005-07-21 16:04         ` Kyle McMartin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.