very strange issue with sata,<4G Ram, and ext3

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* very strange issue with sata,<4G Ram, and ext3
@ 2005-04-28 16:16 Rick Warner
  2005-04-28 17:32 ` Rick Warner
  2005-04-28 22:48 ` Alan Cox
  0 siblings, 2 replies; 8+ messages in thread
From: Rick Warner @ 2005-04-28 16:16 UTC (permalink / raw)
  To: linux-kernel

Hello,
 We are having a very strange issue on some 64bit systems.  We have a 32 node 
cluster of EM64T's (supermicro boards).  We are using our node restore 
software to propagate a linux install onto them.  We do a pxe boot to a 
kernel and initrd image.  The initrd has some config info, a basic root 
filesystem, and a restore script.  The kernel is passed init=/restore  (the 
restore script itself).  The script runs dhcp, gets an ip, then nfs mounts 
the master node of the cluster.  The backup image is stored on the master 
node's nfs mount.  The script then applies a backed up partition table and 
then mkfs's the partitions, mounts them, untars a backup tar to the drive, 
and then makes it bootable with grub.

 On these systems, we are getting ext2 errors from the initrd during the 
untarring.  Soon after, we start getting seg faults on random things (looks 
like stuff caused by the still running dhcp client), and then a continuous 
stream of segfaults on the restore script itself (restore[1]).

 The systems being restored are dual em64t's with 2G of ram and 200G sata 
drives.  If we up the memory to 4G, the restores complete without error. If 
we reduce down to 512M, the segfaults start at the mkfs stage instead of the 
untar stage. We've tried different sata drives and controllers without 
change.  Switching to ide drives works.  Switching to reiserfs instead of 
ext3 for the destination drives works too.  We've tried enabling the scsi 
debug stuff as well as the jbd debug stuff for ext3 without getting any more 
info.  We also enabled the kernel debug options too.  We've also tried using 
the deprecated ide based sata drivers instead of the scsi based ones without 
success.  We have tried restoring to Intel's Jarell EM64T systems as well as 
an Arima HDAMA opteron with the same errors.  We've also tried adding swap 
space ASAP in the inird image.  

 This problem is really baffling us and we're not quite sure what to check 
into next.  Any ideas?

-- 
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: very strange issue with sata,<4G Ram, and ext3
  2005-04-28 16:16 very strange issue with sata,<4G Ram, and ext3 Rick Warner
@ 2005-04-28 17:32 ` Rick Warner
  2005-04-28 22:48 ` Alan Cox
  1 sibling, 0 replies; 8+ messages in thread
From: Rick Warner @ 2005-04-28 17:32 UTC (permalink / raw)
  To: linux-kernel

I forgot to mention the kernels that have been tried- 2.6.8.1, 2.6.11.7, 
2.6.12-rc3, and a redhat 2.6.9.


On Thursday 28 April 2005 12:16 pm, Rick Warner wrote:
> Hello,
>  We are having a very strange issue on some 64bit systems.  We have a 32
> node cluster of EM64T's (supermicro boards).  We are using our node restore
> software to propagate a linux install onto them.  We do a pxe boot to a
> kernel and initrd image.  The initrd has some config info, a basic root
> filesystem, and a restore script.  The kernel is passed init=/restore  (the
> restore script itself).  The script runs dhcp, gets an ip, then nfs mounts
> the master node of the cluster.  The backup image is stored on the master
> node's nfs mount.  The script then applies a backed up partition table and
> then mkfs's the partitions, mounts them, untars a backup tar to the drive,
> and then makes it bootable with grub.
>
>  On these systems, we are getting ext2 errors from the initrd during the
> untarring.  Soon after, we start getting seg faults on random things (looks
> like stuff caused by the still running dhcp client), and then a continuous
> stream of segfaults on the restore script itself (restore[1]).
>
>  The systems being restored are dual em64t's with 2G of ram and 200G sata
> drives.  If we up the memory to 4G, the restores complete without error. If
> we reduce down to 512M, the segfaults start at the mkfs stage instead of
> the untar stage. We've tried different sata drives and controllers without
> change.  Switching to ide drives works.  Switching to reiserfs instead of
> ext3 for the destination drives works too.  We've tried enabling the scsi
> debug stuff as well as the jbd debug stuff for ext3 without getting any
> more info.  We also enabled the kernel debug options too.  We've also tried
> using the deprecated ide based sata drivers instead of the scsi based ones
> without success.  We have tried restoring to Intel's Jarell EM64T systems
> as well as an Arima HDAMA opteron with the same errors.  We've also tried
> adding swap space ASAP in the inird image.
>
>  This problem is really baffling us and we're not quite sure what to check
> into next.  Any ideas?

-- 
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: very strange issue with sata,<4G Ram, and ext3
  2005-04-28 16:16 very strange issue with sata,<4G Ram, and ext3 Rick Warner
  2005-04-28 17:32 ` Rick Warner
@ 2005-04-28 22:48 ` Alan Cox
  2005-04-29 14:45   ` Rick Warner
  1 sibling, 1 reply; 8+ messages in thread
From: Alan Cox @ 2005-04-28 22:48 UTC (permalink / raw)
  To: Rick Warner; +Cc: Linux Kernel Mailing List

On Iau, 2005-04-28 at 17:16, Rick Warner wrote:
>  On these systems, we are getting ext2 errors from the initrd during the 
> untarring.  Soon after, we start getting seg faults on random things (looks 
> like stuff caused by the still running dhcp client), and then a continuous 
> stream of segfaults on the restore script itself (restore[1]).

This sounds almost like the pxe/boot code is still using ram that the
kernel has now used (eg the PXE layer or pxe booter forgot to close the
client and
its still DMAing happily into the kernel)



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: very strange issue with sata,<4G Ram, and ext3
  2005-04-28 22:48 ` Alan Cox
@ 2005-04-29 14:45   ` Rick Warner
  2005-05-04 19:29     ` Rick Warner
  2005-05-05 21:37     ` Krzysztof Halasa
  0 siblings, 2 replies; 8+ messages in thread
From: Rick Warner @ 2005-04-29 14:45 UTC (permalink / raw)
  To: Alan Cox; +Cc: Linux Kernel Mailing List

[-- Attachment #1: Type: text/plain, Size: 1261 bytes --]

On Thursday 28 April 2005 06:48 pm, Alan Cox wrote:
> On Iau, 2005-04-28 at 17:16, Rick Warner wrote:
> >  On these systems, we are getting ext2 errors from the initrd during the
> > untarring.  Soon after, we start getting seg faults on random things
> > (looks like stuff caused by the still running dhcp client), and then a
> > continuous stream of segfaults on the restore script itself (restore[1]).
>
> This sounds almost like the pxe/boot code is still using ram that the
> kernel has now used (eg the PXE layer or pxe booter forgot to close the
> client and
> its still DMAing happily into the kernel)
This morning, we tried updating to a newer pxelinux (3.07) and had the same 
results.  We then tried using etherboot with a mknbi tagged image and also 
had the same results.   Since we are getting the same problem on 3 different 
motherboards with 2 different network adapters, I have not looked into 
updating the boot rom on the nics.  Should I?

What should I look into next?  I have attached a serial console log of the 
system and errors.  The slashes and pipes you see are from a spinning bar 
thing.  If you want output that is cleaned up without that, I can provide it.

-- 
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517

[-- Attachment #2: new-pxelinux.txt --]
[-- Type: text/plain, Size: 19985 bytes --]

Bootdata ok (command line is initrd=initrd.img.gz ramdisk_size=46080 rw root=/dev/ram0 devfs=nomount init=/restore console=tty0 console=ttyS0,115200 BOOT_IMAGE=vmlinuz )
Linux version 2.6.12-rc3-em64t-mcms (root@master.cl.usgs.gov) (gcc version 3.4.3 20041212 (Red Hat 3.4.3-9.EL4)) #6 Thu Apr 28 10:11:16 EDT 2005
BIOS-provided physical RAM map:
 BIOS-e820: 0000000000000000 - 000000000009b400 (usable)
 BIOS-e820: 000000000009b400 - 00000000000a0000 (reserved)
 BIOS-e820: 00000000000e4000 - 0000000000100000 (reserved)
 BIOS-e820: 0000000000100000 - 000000007ff70000 (usable)
 BIOS-e820: 000000007ff70000 - 000000007ff78000 (ACPI data)
 BIOS-e820: 000000007ff78000 - 000000007ff80000 (ACPI NVS)
 BIOS-e820: 000000007ff80000 - 0000000080000000 (reserved)
 BIOS-e820: 00000000e0000000 - 00000000f0000000 (reserved)
 BIOS-e820: 00000000fec00000 - 00000000fec10000 (reserved)
 BIOS-e820: 00000000fee00000 - 00000000fee01000 (reserved)
 BIOS-e820: 00000000ff800000 - 00000000ffc00000 (reserved)
 BIOS-e820: 00000000fffffc00 - 0000000100000000 (reserved)
Intel MultiProcessor Specification v1.4
    Virtual Wire compatibility mode.
OEM ID: INTEL    <6>Product ID: Lindenhurst  <6>APIC at: 0xFEE00000
Processor #0 15:4 APIC version 20
Processor #6 15:4 APIC version 20
WARNING: NR_CPUS limit of 1 reached. Processor ignored.
I/O APIC #2 Version 32 at 0xFEC00000.
I/O APIC #3 Version 32 at 0xFEC80000.
I/O APIC #4 Version 32 at 0xFEC80400.
I/O APIC #5 Version 32 at 0xFEC84000.
I/O APIC #8 Version 32 at 0xFEC84400.
Setting APIC routing to flat
Processors: 1
Allocating PCI resources starting at 80000000 (gap: 80000000:60000000)
Checking aperture...
Built 1 zonelists
Kernel command line: initrd=initrd.img.gz ramdisk_size=46080 rw root=/dev/ram0 devfs=nomount init=/restore console=tty0 console=ttyS0,115200 BOOT_IMAGE=vmlinuz 
Initializing CPU#0
PID hash table entries: 4096 (order: 12, 131072 bytes)
time.c: Using 1.193182 MHz PIT timer.
time.c: Detected 3000.254 MHz processor.
time.c: Using PIT/TSC based timekeeping.
Console: colour VGA+ 80x25
Dentry cache hash table entries: 524288 (order: 10, 4194304 bytes)
Inode-cache hash table entries: 262144 (order: 9, 2097152 bytes)
Placing software IO TLB between 0x3243000 - 0x5243000
Memory: 2007900k/2096576k available (2932k kernel code, 87916k reserved, 1263k data, 140k init)
Mount-cache hash table entries: 256
CPU: Trace cache: 12K uops, L1 D cache: 16K
CPU: L2 cache: 1024K
using mwait in idle threads.
CPU0: Thermal monitoring enabled (TM1)
CPU:                   Intel(R) Xeon(TM) CPU 3.00GHz stepping 01
Using IO APIC NMI watchdog
Using IO-APIC 2
Using IO-APIC 3
Using IO-APIC 4
Using IO-APIC 5
Using IO-APIC 8
activating NMI Watchdog ... done.
testing NMI watchdog ... OK.
Using local APIC timer interrupts.
Detected 12.500 MHz APIC timer.
checking if image is initramfs...it isn't (no cpio magic); looks like an initrd
NET: Registered protocol family 16
PCI: Using configuration type 1
mtrr: v2.0 (20020519)
SCSI subsystem initialized
PCI: Probing PCI hardware
PCI: Probing PCI hardware (bus 00)
PCI: Transparent bridge - 0000:00:1e.0
PCI: Using IRQ router PIIX/ICH [8086/24d0] at 0000:00:1f.0
PCI->APIC IRQ transform: 0000:00:02.0[A] -> IRQ 16
PCI->APIC IRQ transform: 0000:00:04.0[A] -> IRQ 16
PCI->APIC IRQ transform: 0000:00:06.0[A] -> IRQ 16
PCI->APIC IRQ transform: 0000:00:1d.0[A] -> IRQ 16
PCI->APIC IRQ transform: 0000:00:1d.1[B] -> IRQ 19
PCI->APIC IRQ transform: 0000:00:1d.2[C] -> IRQ 18
PCI->APIC IRQ transform: 0000:00:1d.3[A] -> IRQ 16
PCI->APIC IRQ transform: 0000:00:1d.7[D] -> IRQ 23
PCI->APIC IRQ transform: 0000:00:1f.2[A] -> IRQ 18
PCI->APIC IRQ transform: 0000:00:1f.3[B] -> IRQ 17
PCI->APIC IRQ transform: 0000:03:02.0[A] -> IRQ 54
PCI->APIC IRQ transform: 0000:03:02.1[B] -> IRQ 55
PCI->APIC IRQ transform: 0000:08:01.0[A] -> IRQ 17
PCI-DMA: Using software bounce buffering for IO (SWIOTLB)
IA32 emulation $Id: sys_ia32.c,v 1.32 2002/03/24 13:02:28 ak Exp $
Total HugeTLB memory allocated, 0
JFS: nTxBlock = 8192, nTxLock = 65536
SGI XFS with large block/inode numbers, no debug enabled
Linux agpgart interface v0.101 (c) Dave Jones
Hangcheck: starting hangcheck timer 0.5.0 (tick is 180 seconds, margin is 60 seconds).
serio: i8042 AUX port at 0x60,0x64 irq 12
serio: i8042 KBD port at 0x60,0x64 irq 1
Serial: 8250/16550 driver $Revision: 1.90 $ 8 ports, IRQ sharing disabled
ttyS0 at I/O 0x3f8 (irq = 4) is a 16550A
ttyS1 at I/O 0x2f8 (irq = 3) is a 16550A
io scheduler noop registered
io scheduler anticipatory registered
io scheduler deadline registered
io scheduler cfq registered
Floppy drive(s): fd0 is 1.44M
FDC 0 is a National Semiconductor PC87306
RAMDISK driver initialized: 8 RAM disks of 46080K size 1024 blocksize
Uniform Multi-Platform E-IDE driver Revision: 7.00alpha2
ide: Assuming 33MHz system bus speed for PIO modes; override with idebus=xx
ata1: SATA max UDMA/133 cmd 0x14E8 ctl 0x14DE bmdma 0x14B0 irq 18
ata2: SATA max UDMA/133 cmd 0x14E0 ctl 0x14DA bmdma 0x14B8 irq 18
ata1: dev 0 ATA, max UDMA/133, 398297088 sectors: lba48
ata1: dev 0 configured for UDMA/133
scsi0 : ata_piix
ata2: SATA port has no device.
scsi1 : ata_piix
  Vendor: ATA       Model: Maxtor 6B200M0    Rev: BANC
  Type:   Direct-Access                      ANSI SCSI revision: 05
SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB)
SCSI device sda: drive cache: write back
SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB)
SCSI device sda: drive cache: write back
 sda: sda1 sda4 < sda5 sda6 sda7 sda8 sda9 sda10 >
Attached scsi disk sda at scsi0, channel 0, id 0, lun 0
Attached scsi generic sg0 at scsi0, channel 0, id 0, lun 0,  type 0
mice: PS/2 mouse device common for all mice
NET: Registered protocol family 2
IP: routing cache hash table of 2048 buckets, 112Kbytes
TCP established hash table entries: 524288 (order: 10, 4194304 bytes)
TCP bind hash table entries: 65536 (order: 9, 3670016 bytes)
TCP: Hash tables configured (established 524288 bind 65536)
NET: Registered protocol family 1
NET: Registered protocol family 17
RAMDISK: Compressed image found at block 0
VFS: Mounted root (ext2 filesystem).
Freeing unused kernel memory: 140k freed
Turning on debugging options.
Intel(R) PRO/1000 Network Driver - version 5.7.6-k2
Copyright (c) 1999-2004 Intel Corporation.
e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection
Internet Software Consortium DHCP Client V3.0.1rc13
Copyright 1995-2002 Internet Software Consortium.
All rights reserved.
For info, please visit http://www.isc.org/products/DHCP

e1000: eth0: e1000_watchdog_task: NIC Link is Up 1000 Mbps Full Duplex
Listening on LPF/eth1/00:30:48:74:a5:71
Sending on   LPF/eth1/00:30:48:74:a5:71
Listening on LPF/eth0/00:30:48:74:a5:70
Sending on   LPF/eth0/00:30:48:74:a5:70
Listening on LPF/lo/
Sending on   LPF/lo/
Sending on   Socket/fallback
DHCPDISCOVER on eth0 to 255.255.255.255 port 67 interval 3
DHCPDISCOVER on lo to 255.255.255.255 port 67 interval 2
DHCPOFFER from 10.0.0.1
DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 3
DHCPDISCOVER on lo to 255.255.255.255 port 67 interval 3
DHCPDISCOVER on eth1 to 255.255.255.255 port 67 interval 4
DHCPREQUEST on eth0 to 255.255.255.255 port 67
DHCPDISCOVER on lo to 255.255.255.255 port 67 interval 5
DHCPACK from 10.0.0.1
bound to 10.0.0.100 -- renewal in 1455 seconds.
Brought up network devices.
Setting date to match master
Permission denied.
Fri Apr 29 01:51:44 EDT 2005
nfs warning: mount version older than kernel
Checking that noSCSI device sda: 398297088 512-byte hdwr sectors (203928 MB)
-one is using thSCSI device sda: drive cache: write back
is disk right no sda:w ...
 sda1 sda4 < sda5 sda6 sda7 sda8 sda9 sda10 >
OK

Disk /dev/sda: 24792 cylinders, 255 heads, 63 sectors/track
Old situation:
Units = cylinders of 8225280 bytes, blocks of 1024 bytes, counting from 0

   Device Boot Start     End   #cyls    #blocks   Id  System
/dev/sda1          0+      9      10-     80293+  83  Linux
/dev/sda2          0       -       0          0    0  Empty
/dev/sda3          0       -       0          0    0  Empty
/dev/sda4         10   24791   24782  199061415    5  Extended
/dev/sda5         10+    253     244-   1959898+  83  Linux
/dev/sda6        254+   1470    1217-   9775521   83  Linux
/dev/sda7       1471+   1836     366-   2939863+  83  Linux
/dev/sda8       1837+   2202     366-   2939863+  83  Linux
/dev/sda9       2203+   2689     487-   3911796   82  Linux swap
/dev/sda10      2690+  24791   22102- 177534283+  83  Linux
New situation:
Units = sectors of 512 bytes, counting from 0

   Device Boot    Start       End   #sectors  Id  System
/dev/sda1            63    160649     160587  83  Linux
/dev/sda2             0         -          0   0  Empty
/dev/sda3             0         -          0   0  Empty
/dev/sda4        160650 398283479  398122830   5  Extended
/dev/sda5        160713   4080509    3919797  83  Linux
/dev/sda6       4080573  23631614   19551042  83  Linux
/dev/sda7      23631678  29511404    5879727  83  Linux
/dev/sda8      29511468  35391194    5879727  83  Linux
/dev/sda9      35391258  43214849    7823592  82  Linux swap
/dev/sda10     43214913 398283479  355068567  83  Linux
Warning: no primary partition is marked bootable (active)
This does not matter for LILO, but the DOS MBR will not boot this disk.
Successfully wrote the new partition table

Re-reading the partition table ...
SCSI device sda: 398297088 512-byte hdwr sectors (203928 MB)
SCSI device sda: drive cache: write back
 sda: sda1 sda4 < sda5 sda6 sda7 sda8 sda9 sda10 >

If you created or changed a DOS partition, /dev/foo7, say, then use dd(1)
to zero the first 512 bytes:  dd if=/dev/zero of=/dev/foo7 bs=512 count=1
(See fdisk(8).)
grep: /mnt/raidtab: No such file or directory
Created ext2/3 filesystem on /dev/sda1
Created ext2/3 filesystem on /dev/sda5
Created ext2/3 filesystem on /dev/sda6
Created ext2/3 filesystem on /dev/sda7
Created ext2/3 filesystem on /dev/sda8
Created ext2/3 filesystem on /dev/sda10
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda5, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Mounted /dev/sda5 at /drive
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda1, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Mounted /dev/sda1 at /drive/boot
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda6, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Mounted /dev/sda6 at /drive/usr
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda7, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Mounted /dev/sda7 at /drive/var
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda8, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Mounted /dev/sda8 at /drive/tmp
kjournald starting.  Commit interval 5 seconds
EXT3 FS on sda10, internal journal
EXT3-fs: mounted filesystem with ordered data mode.
Mounted /dev/sda10 at /drive/home
Adding 3911788k swap on /dev/sda9.  Priority:-1 extents:1
Swapspace /dev/sda9 initialized and added
Restoring drive....
						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=24576, inode=0, rec_len=0, name_len=0
EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=28672, inode=0, rec_len=0, name_len=0
EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=32768, inode=0, rec_len=0, name_len=0
EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=36864, inode=0, rec_len=0, name_len=0
EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=40960, inode=0, rec_len=0, name_len=0
						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=12288, inode=0, rec_len=0, name_len=0
EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=16384, inode=0, rec_len=0, name_len=0
uname[1129]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffefe920 error 4
sed[1133]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffb99d90 error 4
sed[1136]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffcbdf90 error 4
						\						|						/						-						\						|						/						-						\						|uname[1150]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffac5a40 error 4
						/sed[1155]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffb24b70 error 4
sed[1158]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffa44d50 error 4
						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=12288, inode=0, rec_len=0, name_len=0
EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=16384, inode=0, rec_len=0, name_len=0
uname[1231]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007ffffff0d1f0 error 4
sed[1235]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffd9aea0 error 4
sed[1238]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffbb5710 error 4
						-uname[1242]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffcc1300 error 4
sed[1246]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007ffffffe9910 error 4
sed[1249]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffef7ce0 error 4
						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=20480, inode=0, rec_len=0, name_len=0
EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=24576, inode=0, rec_len=0, name_len=0
EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=28672, inode=0, rec_len=0, name_len=0
EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=32768, inode=0, rec_len=0, name_len=0
EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=36864, inode=0, rec_len=0, name_len=0
EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=40960, inode=0, rec_len=0, name_len=0
uname[1307]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffbce170 error 4
sed[1311]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffff96b900 error 4
sed[1314]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffff9557d0 error 4
						|						/						-						\						|						/						-						\						|						/						-						\uname[1330]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffbb4930 error 4
sed[1334]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffff9200d0 error 4
sed[1337]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007ffffff780f0 error 4
						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-uname[1395]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffa83250 error 4
sed[1399]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007ffffffa8380 error 4
sed[1402]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffbc0690 error 4
						\						|						/						-						\						|						/						-						\						|uname[1419]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007ffffff07bf0 error 4
sed[1423]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffff82cb00 error 4
sed[1426]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffa072b0 error 4
						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-						\						|						/						-EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=12288, inode=0, rec_len=0, name_len=0
EXT2-fs error (device ram0): ext2_check_page: bad entry in directory #3345: rec_len is smaller than minimal - offset=16384, inode=0, rec_len=0, name_len=0
uname[1482]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffc3af00 error 4
						\sed[1487]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffd3c3c0 error 4
sed[1490]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffbe9b10 error 4
						|						/						-						\						|						/						-						\						|						/						-						\uname[1505]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffeda150 error 4
sed[1509]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007ffffff6b310 error 4
sed[1512]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffff9bed80 error 4
						|
mkdir[1514]: segfault at 0000000000000008 rip 00002aaaaaab1dff rsp 00007fffff815d60 error 4
/restore: line 1mkdir[1515]: segfault at 0000000000000008 rip 00002aaaaaab1dff rsp 00007fffff994c70 error 4
72:  1514 Segmenchmod[1516]: segfault at 0000000000000008 rip 00002aaaaaab3ce7 rsp 00007fffffa8f3b0 error 4
tation fault    restore[1517]: segfault at 0000000000000004 rip 00000000004322a2 rsp 00007fffffca6e58 error 6
  mkdir /drive/drive
File Restoration complete.Kernel panic - not syncing: Attempted to kill init!

Ensuring /medi a/floppy and /media/cdrom have been created
/restore: line 177:  1515 Segmentation fault      mkdir -p /drive/media/floppy /drive/media/cdrom /drive/media/dvd
Ensuring correct permissions on tmp
/restore: line 180:  1516 Segmentation fault      chmod 1777 /drive/tmp
/restore: line 190:  1517 Segmentation fault      chroot /drive $GRUB --batch --no-floppy  >&/dev/null <<EOF
device (hd0) ${device_save[0]}
root (hd0,0)
setup (hd0)
EOF

Unable to run grub on /dev/sda
/restore: line 195:  1518 Segmentation fault      bash

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: very strange issue with sata,<4G Ram, and ext3
  2005-04-29 14:45   ` Rick Warner
@ 2005-05-04 19:29     ` Rick Warner
  2005-05-05 15:00       ` Alan Cox
  2005-05-05 21:37     ` Krzysztof Halasa
  1 sibling, 1 reply; 8+ messages in thread
From: Rick Warner @ 2005-05-04 19:29 UTC (permalink / raw)
  To: linux-kernel

Just sending out a ping on this.. anyone have any ideas?

On Friday 29 April 2005 10:45 am, you wrote:
> On Thursday 28 April 2005 06:48 pm, Alan Cox wrote:
> > On Iau, 2005-04-28 at 17:16, Rick Warner wrote:
> > >  On these systems, we are getting ext2 errors from the initrd during
> > > the untarring.  Soon after, we start getting seg faults on random
> > > things (looks like stuff caused by the still running dhcp client), and
> > > then a continuous stream of segfaults on the restore script itself
> > > (restore[1]).
> >
> > This sounds almost like the pxe/boot code is still using ram that the
> > kernel has now used (eg the PXE layer or pxe booter forgot to close the
> > client and
> > its still DMAing happily into the kernel)
>
> This morning, we tried updating to a newer pxelinux (3.07) and had the same
> results.  We then tried using etherboot with a mknbi tagged image and also
> had the same results.   Since we are getting the same problem on 3
> different motherboards with 2 different network adapters, I have not looked
> into updating the boot rom on the nics.  Should I?
>
> What should I look into next?  I have attached a serial console log of the
> system and errors.  The slashes and pipes you see are from a spinning bar
> thing.  If you want output that is cleaned up without that, I can provide
> it.

-- 
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: very strange issue with sata,<4G Ram, and ext3
  2005-05-04 19:29     ` Rick Warner
@ 2005-05-05 15:00       ` Alan Cox
  0 siblings, 0 replies; 8+ messages in thread
From: Alan Cox @ 2005-05-05 15:00 UTC (permalink / raw)
  To: Rick Warner; +Cc: Linux Kernel Mailing List

On Mer, 2005-05-04 at 20:29, Rick Warner wrote:
> Just sending out a ping on this.. anyone have any ideas?

The best I can think of right now in going forward is check
	32 v 64 bit kernel
	32bit Highmem aware kernel v 32bit non highmem (1GB limit) kernel
	PATA boot v SATA boot v Network boot

just to try and find any patterns.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: very strange issue with sata,<4G Ram, and ext3
  2005-04-29 14:45   ` Rick Warner
  2005-05-04 19:29     ` Rick Warner
@ 2005-05-05 21:37     ` Krzysztof Halasa
  2005-05-06 13:39       ` Rick Warner
  1 sibling, 1 reply; 8+ messages in thread
From: Krzysztof Halasa @ 2005-05-05 21:37 UTC (permalink / raw)
  To: Rick Warner; +Cc: Alan Cox, Linux Kernel Mailing List

Rick Warner <rick@microway.com> writes:

> This morning, we tried updating to a newer pxelinux (3.07) and had the same 
> results.  We then tried using etherboot with a mknbi tagged image and also 
> had the same results.   Since we are getting the same problem on 3 different 
> motherboards with 2 different network adapters, I have not looked into 
> updating the boot rom on the nics.  Should I?

I remember I had memory corruption problems with an old version of
Etherboot few years ago. The machines were mostly AMD K6 based,
network cards were SMC EPIC100 (Etherpower II) and/or RTL 8139.

Memtest86 (downloaded with Etherboot) complained about random errors.
I think Linux didn't show any such illness.
This was Etherboot 4.something. Upgrading to 5.something fixed the
problem.

I suspect you're using Etherboot newer than 4.x though. I'd probably
give memtest86 loaded from network a try.
-- 
Krzysztof Halasa

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: very strange issue with sata,<4G Ram, and ext3
  2005-05-05 21:37     ` Krzysztof Halasa
@ 2005-05-06 13:39       ` Rick Warner
  0 siblings, 0 replies; 8+ messages in thread
From: Rick Warner @ 2005-05-06 13:39 UTC (permalink / raw)
  To: Krzysztof Halasa; +Cc: Alan Cox, Linux Kernel Mailing List

On Thursday 05 May 2005 05:37 pm, Krzysztof Halasa wrote:
> Rick Warner <rick@microway.com> writes:
> > This morning, we tried updating to a newer pxelinux (3.07) and had the
> > same results.  We then tried using etherboot with a mknbi tagged image
> > and also had the same results.   Since we are getting the same problem on
> > 3 different motherboards with 2 different network adapters, I have not
> > looked into updating the boot rom on the nics.  Should I?
>
> I remember I had memory corruption problems with an old version of
> Etherboot few years ago. The machines were mostly AMD K6 based,
> network cards were SMC EPIC100 (Etherpower II) and/or RTL 8139.
>
> Memtest86 (downloaded with Etherboot) complained about random errors.
> I think Linux didn't show any such illness.
> This was Etherboot 4.something. Upgrading to 5.something fixed the
> problem.
>
> I suspect you're using Etherboot newer than 4.x though. I'd probably
> give memtest86 loaded from network a try.

We actually run memtest86 from the network regularly.  This cluster had run 
dozens of passes of memtest booted over the network before doing any of this.  
We also did an md5sum of our initrd from the network boot server, and then 
had the initrd do an md5sum of itself on the network boot.  They matched.  
Thanks for the advice though!  I appreciate it.

-- 
Richard Warner
Lead Systems Integrator
Microway, Inc
(508)732-5517

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2005-05-06 13:39 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2005-04-28 16:16 very strange issue with sata,<4G Ram, and ext3 Rick Warner
2005-04-28 17:32 ` Rick Warner
2005-04-28 22:48 ` Alan Cox
2005-04-29 14:45   ` Rick Warner
2005-05-04 19:29     ` Rick Warner
2005-05-05 15:00       ` Alan Cox
2005-05-05 21:37     ` Krzysztof Halasa
2005-05-06 13:39       ` Rick Warner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox