[2.4] NMI WD detected lockup during page alloc

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* [2.4] NMI WD detected lockup during page alloc
@ 2004-04-04 12:17 Oleg Drokin
  2004-04-05 20:43 ` Marcelo Tosatti
  0 siblings, 1 reply; 5+ messages in thread
From: Oleg Drokin @ 2004-04-04 12:17 UTC (permalink / raw)
  To: linux-kernel

Hello!

   One of my servers started to experience mystic hangs after upgrade to
   dual P4 Xeon (before that it was running on UP kernel) (HT enabled now).
   So I enabled NMI watchdog and finally it triggered recently.
   The kernel is 2.4.25+ (pulled from 2.4 bitkeeper tree on XX/XX, but
   it seems related files in mm/ have not changed since at least January 2004
   anyway). 
   So the HW is Duap P4-Xeon on some Intel-branded server (E7501-based or
   something), 2G EСС RAM (highmem enabled).

   That's what I got on the serial console:
NMI Watchdog detected LOCKUP on CPU2, eip c013b527, registers:
CPU:    2
EIP:    0010:[<c013b527>]    Not tainted
EFLAGS: 00000086
eax: 00000000   ebx: c02dca38   ecx: 000048ce   edx: c02dca38
esi: c02dca74   edi: 00000000   ebp: d34b1e5c   esp: d34b1e30
ds: 0018   es: 0018   ss: 0018
Process mrtg (pid: 14663, stackpage=d34b1000)
Stack: 00038000 00000282 00000000 00015006 00015006 00000286 00000000 c02dca38
       c02dca38 c02dcb38 00000002 d34b1ea0 c013adfa c0139395 d34b1ea0 00000202
       c02dcaec 32353530 d34b1e7c c02dca38 c02dca38 c02dcb34 00000000 000001d2
Call Trace:    [<c013adfa>] [<c0139395>] [<c012dc0d>] [<c012e6d7>] [<c0119330>
]
  [<c014bca5>] [<c0159301>] [<c014ee56>] [<c014bd1b>] [<c0153b3b>] [<c0118f70>
]
  [<c01076b0>]
Code: f3 90 7e f9 e9 11 f4 ff ff 80 3f 00 f3 90 7e f9 e9 8e fd ff
>>EIP; c013b527 <.text.lock.page_alloc+f/28>   <=====
Trace; c013adfa <__alloc_pages+6a/270>
Trace; c0139395 <lru_cache_del+15/20>
Trace; c012dc0d <do_wp_page+6d/2e0>
Trace; c012e6d7 <handle_mm_fault+f7/110>
Trace; c0119330 <do_page_fault+3c0/586>
Trace; c014bca5 <cp_new_stat64+e5/110>
Trace; c0159301 <dput+31/190>
Trace; c014ee56 <path_release+16/40>
Trace; c014bd1b <sys_stat64+4b/80>
Trace; c0153b3b <sys_fcntl64+5b/c0>
Trace; c0118f70 <do_page_fault+0/586>
Trace; c01076b0 <error_code+34/3c>

So it seems it was blocked trying to take zone->lock in
mm/page_alloc.c::rmqueue()
The actual calltrace seems to be (lots of stale entries seems to be on
actual stack).

rmqueue
__alloc_pages+6a
do_wp_page+6d
handle_mm_fault+f7 (this is in fact handle_pte_fault())
do_page_fault+3c0
error_code+34

I fail to see a path where we can take lock on the same zone twice on same
CPU, so may be the zone structure was somehow corrupted (I do not have
spinlock debugging enabled yet). I do not think there are problems with
memory in that box that might explain this as well.
Probability of hangs vary over time, I got the first one on the next day after
upgrade (not even sure if it was the same as this one since I had no traces
from it), but this second one happened after 2-3 weeks of uptime.

May be it will help someone to find out what happens.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [2.4] NMI WD detected lockup during page alloc
  2004-04-04 12:17 [2.4] NMI WD detected lockup during page alloc Oleg Drokin
@ 2004-04-05 20:43 ` Marcelo Tosatti
  2004-04-05 21:27   ` Oleg Drokin
  0 siblings, 1 reply; 5+ messages in thread
From: Marcelo Tosatti @ 2004-04-05 20:43 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: linux-kernel, andrea, akpm

On Sun, Apr 04, 2004 at 03:17:56PM +0300, Oleg Drokin wrote:
> Hello!
> 
>    One of my servers started to experience mystic hangs after upgrade to
>    dual P4 Xeon (before that it was running on UP kernel) (HT enabled now).
>    So I enabled NMI watchdog and finally it triggered recently.
>    The kernel is 2.4.25+ (pulled from 2.4 bitkeeper tree on XX/XX, but
>    it seems related files in mm/ have not changed since at least January 2004
>    anyway). 
>    So the HW is Duap P4-Xeon on some Intel-branded server (E7501-based or
>    something), 2G E?? RAM (highmem enabled).
> 
>    That's what I got on the serial console:
> NMI Watchdog detected LOCKUP on CPU2, eip c013b527, registers:
> CPU:    2
> EIP:    0010:[<c013b527>]    Not tainted
> EFLAGS: 00000086
> eax: 00000000   ebx: c02dca38   ecx: 000048ce   edx: c02dca38
> esi: c02dca74   edi: 00000000   ebp: d34b1e5c   esp: d34b1e30
> ds: 0018   es: 0018   ss: 0018
> Process mrtg (pid: 14663, stackpage=d34b1000)
> Stack: 00038000 00000282 00000000 00015006 00015006 00000286 00000000 c02dca38
>        c02dca38 c02dcb38 00000002 d34b1ea0 c013adfa c0139395 d34b1ea0 00000202
>        c02dcaec 32353530 d34b1e7c c02dca38 c02dca38 c02dcb34 00000000 000001d2
> Call Trace:    [<c013adfa>] [<c0139395>] [<c012dc0d>] [<c012e6d7>] [<c0119330>
> ]
>   [<c014bca5>] [<c0159301>] [<c014ee56>] [<c014bd1b>] [<c0153b3b>] [<c0118f70>
> ]
>   [<c01076b0>]
> Code: f3 90 7e f9 e9 11 f4 ff ff 80 3f 00 f3 90 7e f9 e9 8e fd ff
> >>EIP; c013b527 <.text.lock.page_alloc+f/28>   <=====
> Trace; c013adfa <__alloc_pages+6a/270>
> Trace; c0139395 <lru_cache_del+15/20>
> Trace; c012dc0d <do_wp_page+6d/2e0>
> Trace; c012e6d7 <handle_mm_fault+f7/110>
> Trace; c0119330 <do_page_fault+3c0/586>
> Trace; c014bca5 <cp_new_stat64+e5/110>
> Trace; c0159301 <dput+31/190>
> Trace; c014ee56 <path_release+16/40>
> Trace; c014bd1b <sys_stat64+4b/80>
> Trace; c0153b3b <sys_fcntl64+5b/c0>
> Trace; c0118f70 <do_page_fault+0/586>
> Trace; c01076b0 <error_code+34/3c>
> 
> 
> So it seems it was blocked trying to take zone->lock in
> mm/page_alloc.c::rmqueue()
> The actual calltrace seems to be (lots of stale entries seems to be on
> actual stack).
> 
> rmqueue
> __alloc_pages+6a
> do_wp_page+6d
> handle_mm_fault+f7 (this is in fact handle_pte_fault())
> do_page_fault+3c0
> error_code+34
> 
> I fail to see a path where we can take lock on the same zone twice on same
> CPU, so may be the zone structure was somehow corrupted (I do not have
> spinlock debugging enabled yet). I do not think there are problems with
> memory in that box that might explain this as well.

I also fail to see how zone->lock could be left locked. The only users of it are
rmqueue and __free_pages_ok() and the codepaths which lock them are not prone to
problems.

> Probability of hangs vary over time, I got the first one on the next day after
> upgrade (not even sure if it was the same as this one since I had no traces
> from it), but this second one happened after 2-3 weeks of uptime.
> 
> May be it will help someone to find out what happens.

Can you send me your config file and description of workload? I have a similar E7501
around (with MPT fusion). 

What drivers are you using, btw?

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [2.4] NMI WD detected lockup during page alloc
  2004-04-05 20:43 ` Marcelo Tosatti
@ 2004-04-05 21:27   ` Oleg Drokin
  2004-04-05 22:12     ` Andrea Arcangeli
  0 siblings, 1 reply; 5+ messages in thread
From: Oleg Drokin @ 2004-04-05 21:27 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: linux-kernel, andrea, akpm

[-- Attachment #1: Type: text/plain, Size: 2453 bytes --]

Hello!

On Mon, Apr 05, 2004 at 05:43:17PM -0300, Marcelo Tosatti wrote:
> > Probability of hangs vary over time, I got the first one on the next day after
> > upgrade (not even sure if it was the same as this one since I had no traces
> > from it), but this second one happened after 2-3 weeks of uptime.
> > May be it will help someone to find out what happens.
> Can you send me your config file and description of workload? I have a similar E7501
> around (with MPT fusion). 

I have some Adaptec stuff, not MPT fusion.

Config is attached.
Workload is boring http proxy (squid) for some 200+ clients on not very fat
channel, some ip accounting using firewall (btw that reminds me that ipt_state
cannot be unloaded on this box, just spins doing something in kernel if I try
to unload it, and ipt_state then shown as being in deleted state. I
always reboot in this case, so it has no influence in the main problem).
Some home-grown accounting software (that leaks a bit so over time swap is
used more and more), mail server (sendmail) with may be a 100-200 incoming
mails per minute, DNS, modem pool (ppp), samba3.

> What drivers are you using, btw?

In addition to what I have compiled in:
# lsmod
Module                  Size  Used by    Not tainted
ppp_deflate             4568   1  (autoclean)
zlib_inflate           20868   0  (autoclean) [ppp_deflate]
zlib_deflate           21272   0  (autoclean) [ppp_deflate]
ppp_async              10208   2  (autoclean)
ppp_generic            27404   6  (autoclean) [ppp_deflate ppp_async]
slhc                    6384   0  (autoclean) [ppp_generic]
tulip                  45760   0  (unused)
e1000                  75844   1 
ipt_MARK                1368  81  (autoclean)
iptable_mangle          2776   1  (autoclean)
ipt_MASQUERADE          2360   1  (autoclean)
iptable_nat            22278   1  (autoclean) [ipt_MASQUERADE]
ipt_state               1016   4  (autoclean)
ip_conntrack           30432   0  (autoclean) [ipt_MASQUERADE iptable_nat ipt_state]
ipt_unclean             7672   2  (autoclean)
iptable_filter          2412   1  (autoclean)
ip_tables              15872   9  [ipt_MARK iptable_mangle ipt_MASQUERADE iptable_nat ipt_state ipt_unclean iptable_filter]
microcode               7936   0  (autoclean)
raid1                  18704   1  (autoclean)
md                     66048   2  [raid1]
loop                   13080   0  (autoclean)
lvm-mod                63040  13 


Bye,
    Oleg

[-- Attachment #2: mordor-config --]
[-- Type: text/plain, Size: 5522 bytes --]

CONFIG_X86=y
CONFIG_UID16=y
CONFIG_EXPERIMENTAL=y
CONFIG_MODULES=y
CONFIG_KMOD=y
CONFIG_MPENTIUM4=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_CMPXCHG=y
CONFIG_X86_XADD=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
CONFIG_X86_L1_CACHE_SHIFT=7
CONFIG_X86_HAS_TSC=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_PGE=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_F00F_WORKS_OK=y
CONFIG_X86_MCE=y
CONFIG_MICROCODE=m
CONFIG_X86_MSR=m
CONFIG_X86_CPUID=m
CONFIG_EDD=y
CONFIG_HIGHMEM4G=y
CONFIG_HIGHMEM=y
CONFIG_HIGHIO=y
CONFIG_MTRR=y
CONFIG_SMP=y
CONFIG_NR_CPUS=32
CONFIG_X86_TSC=y
CONFIG_HAVE_DEC_LOCK=y
CONFIG_NET=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_PCI=y
CONFIG_PCI_GOANY=y
CONFIG_PCI_BIOS=y
CONFIG_PCI_DIRECT=y
CONFIG_ISA=y
CONFIG_PCI_NAMES=y
CONFIG_HOTPLUG=y
CONFIG_SYSVIPC=y
CONFIG_SYSCTL=y
CONFIG_KCORE_ELF=y
CONFIG_BINFMT_AOUT=m
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=m
CONFIG_OOM_KILLER=y
CONFIG_ACPI_BOOT=y
CONFIG_PARPORT=m
CONFIG_PARPORT_PC=m
CONFIG_PARPORT_PC_CML1=m
CONFIG_PARPORT_SERIAL=m
CONFIG_PARPORT_PC_FIFO=y
CONFIG_PARPORT_PC_SUPERIO=y
CONFIG_BLK_DEV_FD=m
CONFIG_BLK_DEV_LOOP=m
CONFIG_BLK_DEV_NBD=m
CONFIG_BLK_DEV_RAM=m
CONFIG_BLK_DEV_RAM_SIZE=4096
CONFIG_BLK_STATS=y
CONFIG_MD=y
CONFIG_BLK_DEV_MD=m
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=m
CONFIG_BLK_DEV_LVM=m
CONFIG_PACKET=y
CONFIG_PACKET_MMAP=y
CONFIG_NETLINK_DEV=m
CONFIG_NETFILTER=y
CONFIG_FILTER=y
CONFIG_UNIX=y
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
CONFIG_IP_ADVANCED_ROUTER=y
CONFIG_IP_MULTIPLE_TABLES=y
CONFIG_IP_ROUTE_FWMARK=y
CONFIG_IP_ROUTE_NAT=y
CONFIG_IP_ROUTE_MULTIPATH=y
CONFIG_IP_ROUTE_TOS=y
CONFIG_NET_IPIP=m
CONFIG_NET_IPGRE=m
CONFIG_INET_ECN=y
CONFIG_SYN_COOKIES=y
CONFIG_IP_NF_CONNTRACK=m
CONFIG_IP_NF_FTP=m
CONFIG_IP_NF_AMANDA=m
CONFIG_IP_NF_TFTP=m
CONFIG_IP_NF_IRC=m
CONFIG_IP_NF_QUEUE=m
CONFIG_IP_NF_IPTABLES=m
CONFIG_IP_NF_MATCH_LIMIT=m
CONFIG_IP_NF_MATCH_MAC=m
CONFIG_IP_NF_MATCH_PKTTYPE=m
CONFIG_IP_NF_MATCH_MULTIPORT=m
CONFIG_IP_NF_MATCH_TOS=m
CONFIG_IP_NF_MATCH_RECENT=m
CONFIG_IP_NF_MATCH_ECN=m
CONFIG_IP_NF_MATCH_DSCP=m
CONFIG_IP_NF_MATCH_AH_ESP=m
CONFIG_IP_NF_MATCH_LENGTH=m
CONFIG_IP_NF_MATCH_TTL=m
CONFIG_IP_NF_MATCH_TCPMSS=m
CONFIG_IP_NF_MATCH_HELPER=m
CONFIG_IP_NF_MATCH_STATE=m
CONFIG_IP_NF_MATCH_CONNTRACK=m
CONFIG_IP_NF_MATCH_UNCLEAN=m
CONFIG_IP_NF_MATCH_OWNER=m
CONFIG_IP_NF_FILTER=m
CONFIG_IP_NF_TARGET_REJECT=m
CONFIG_IP_NF_TARGET_MIRROR=m
CONFIG_IP_NF_NAT=m
CONFIG_IP_NF_NAT_NEEDED=y
CONFIG_IP_NF_TARGET_MASQUERADE=m
CONFIG_IP_NF_TARGET_REDIRECT=m
CONFIG_IP_NF_NAT_AMANDA=m
CONFIG_IP_NF_NAT_LOCAL=y
CONFIG_IP_NF_NAT_SNMP_BASIC=m
CONFIG_IP_NF_NAT_IRC=m
CONFIG_IP_NF_NAT_FTP=m
CONFIG_IP_NF_NAT_TFTP=m
CONFIG_IP_NF_MANGLE=m
CONFIG_IP_NF_TARGET_TOS=m
CONFIG_IP_NF_TARGET_ECN=m
CONFIG_IP_NF_TARGET_DSCP=m
CONFIG_IP_NF_TARGET_MARK=m
CONFIG_IP_NF_TARGET_LOG=m
CONFIG_IP_NF_TARGET_ULOG=m
CONFIG_IP_NF_TARGET_TCPMSS=m
CONFIG_IP_NF_ARPTABLES=m
CONFIG_IP_NF_ARPFILTER=m
CONFIG_IP_NF_ARP_MANGLE=m
CONFIG_IPV6_SCTP__=y
CONFIG_SCTP_HMAC_SHA1=y
CONFIG_NET_SCHED=y
CONFIG_NET_SCH_CBQ=m
CONFIG_NET_SCH_HTB=m
CONFIG_NET_SCH_CSZ=m
CONFIG_NET_SCH_HFSC=m
CONFIG_NET_SCH_PRIO=m
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_SFQ=m
CONFIG_NET_SCH_TEQL=m
CONFIG_NET_SCH_TBF=m
CONFIG_NET_SCH_GRED=m
CONFIG_NET_SCH_DSMARK=m
CONFIG_NET_SCH_INGRESS=m
CONFIG_NET_QOS=y
CONFIG_NET_ESTIMATOR=y
CONFIG_NET_CLS=y
CONFIG_NET_CLS_TCINDEX=m
CONFIG_NET_CLS_ROUTE4=m
CONFIG_NET_CLS_ROUTE=y
CONFIG_NET_CLS_FW=m
CONFIG_NET_CLS_U32=m
CONFIG_NET_CLS_RSVP=m
CONFIG_NET_CLS_RSVP6=m
CONFIG_NET_CLS_POLICE=y
CONFIG_IDE=y
CONFIG_BLK_DEV_IDE=y
CONFIG_BLK_DEV_IDEDISK=y
CONFIG_IDEDISK_MULTI_MODE=y
CONFIG_BLK_DEV_IDECD=m
CONFIG_BLK_DEV_IDEPCI=y
CONFIG_IDEPCI_SHARE_IRQ=y
CONFIG_BLK_DEV_IDEDMA_PCI=y
CONFIG_IDEDMA_PCI_AUTO=y
CONFIG_BLK_DEV_IDEDMA=y
CONFIG_BLK_DEV_HPT366=y
CONFIG_BLK_DEV_PIIX=y
CONFIG_IDEDMA_AUTO=y
CONFIG_SCSI=y
CONFIG_BLK_DEV_SD=y
CONFIG_SD_EXTRA_DEVS=10
CONFIG_BLK_DEV_SR=m
CONFIG_SR_EXTRA_DEVS=1
CONFIG_CHR_DEV_SG=m
CONFIG_SCSI_DEBUG_QUEUES=y
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_AIC79XX=y
CONFIG_AIC79XX_CMDS_PER_DEVICE=32
CONFIG_AIC79XX_RESET_DELAY_MS=15000
CONFIG_AIC79XX_DEBUG_MASK=0
CONFIG_NETDEVICES=y
CONFIG_DUMMY=m
CONFIG_TUN=m
CONFIG_NET_ETHERNET=y
CONFIG_NET_PCI=y
CONFIG_TULIP=m
CONFIG_TULIP_MWI=y
CONFIG_TULIP_MMIO=y
CONFIG_E100=y
CONFIG_E1000=m
CONFIG_PPP=m
CONFIG_PPP_FILTER=y
CONFIG_PPP_ASYNC=m
CONFIG_PPP_SYNC_TTY=m
CONFIG_PPP_DEFLATE=m
CONFIG_PPP_BSDCOMP=m
CONFIG_PPPOE=m
CONFIG_SLIP=m
CONFIG_SLIP_COMPRESSED=y
CONFIG_SLIP_SMART=y
CONFIG_VT=y
CONFIG_VT_CONSOLE=y
CONFIG_SERIAL=y
CONFIG_SERIAL_CONSOLE=y
CONFIG_SERIAL_NONSTANDARD=y
CONFIG_MOXA_SMARTIO=y
CONFIG_UNIX98_PTYS=y
CONFIG_UNIX98_PTY_COUNT=2048
CONFIG_PRINTER=m
CONFIG_MOUSE=y
CONFIG_PSMOUSE=y
CONFIG_WATCHDOG=y
CONFIG_SOFT_WATCHDOG=m
CONFIG_NVRAM=m
CONFIG_RTC=y
CONFIG_AGP=y
CONFIG_AGP_INTEL=y
CONFIG_DRM=y
CONFIG_DRM_NEW=y
CONFIG_DRM_R128=y
CONFIG_QUOTA=y
CONFIG_QFMT_V2=y
CONFIG_REISERFS_FS=y
CONFIG_REISERFS_PROC_INFO=y
CONFIG_EXT3_FS=m
CONFIG_JBD=m
CONFIG_CRAMFS=m
CONFIG_TMPFS=y
CONFIG_RAMFS=y
CONFIG_ISO9660_FS=m
CONFIG_JOLIET=y
CONFIG_ZISOFS=y
CONFIG_PROC_FS=y
CONFIG_DEVPTS_FS=y
CONFIG_EXT2_FS=y
CONFIG_NFS_FS=m
CONFIG_NFS_V3=y
CONFIG_NFSD=m
CONFIG_NFSD_V3=y
CONFIG_NFSD_TCP=y
CONFIG_SUNRPC=m
CONFIG_LOCKD=m
CONFIG_LOCKD_V4=y
CONFIG_ZISOFS_FS=m
CONFIG_MSDOS_PARTITION=y
CONFIG_NLS=y
CONFIG_NLS_DEFAULT="iso8859-1"
CONFIG_VGA_CONSOLE=y
CONFIG_DEBUG_KERNEL=y
CONFIG_MAGIC_SYSRQ=y
CONFIG_DEBUG_SPINLOCK=y
CONFIG_FRAME_POINTER=y
CONFIG_LOG_BUF_SHIFT=16
CONFIG_CRC32=y
CONFIG_ZLIB_INFLATE=m
CONFIG_ZLIB_DEFLATE=m

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [2.4] NMI WD detected lockup during page alloc
  2004-04-05 21:27   ` Oleg Drokin
@ 2004-04-05 22:12     ` Andrea Arcangeli
  2004-04-06  7:02       ` Oleg Drokin
  0 siblings, 1 reply; 5+ messages in thread
From: Andrea Arcangeli @ 2004-04-05 22:12 UTC (permalink / raw)
  To: Oleg Drokin; +Cc: Marcelo Tosatti, linux-kernel, akpm

On Tue, Apr 06, 2004 at 12:27:34AM +0300, Oleg Drokin wrote:
> In addition to what I have compiled in:
> # lsmod
> Module                  Size  Used by    Not tainted
> ppp_deflate             4568   1  (autoclean)

you may want to disable compression, this sounds like mm corruption and
compression isn't trivial to handle in kernel skbs (though I doubt this
is the problem but it's easy to disable).

> ipt_state               1016   4  (autoclean)

the hang while unloading this module may also be a sign of a bug in the
module so it would be nice if you could reproduce also w/o the above
ips_state.

If this still doesn't help then you can try to go UP again, SMP is
harder at stressing the memory bus and see if it stabilizes. Other thing
you can do is to remove half of the ram and see if it stabilizes to try
to identify buggy ram slots.

Overall it's unlikely the oops is useful unfortunately since that piece
of the kernel is the most stressed ever, and it just signals random mm
corruption. I assume this is the first time you've got the nmi watchdog
oops, if you could get it again it would be more interesting, I'd expect
next time you would get it in another place.

hope this helps.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [2.4] NMI WD detected lockup during page alloc
  2004-04-05 22:12     ` Andrea Arcangeli
@ 2004-04-06  7:02       ` Oleg Drokin
  0 siblings, 0 replies; 5+ messages in thread
From: Oleg Drokin @ 2004-04-06  7:02 UTC (permalink / raw)
  To: Andrea Arcangeli; +Cc: Marcelo Tosatti, linux-kernel, akpm

Hello!

On Tue, Apr 06, 2004 at 12:12:55AM +0200, Andrea Arcangeli wrote:
> > In addition to what I have compiled in:
> > # lsmod
> > Module                  Size  Used by    Not tainted
> > ppp_deflate             4568   1  (autoclean)
> you may want to disable compression, this sounds like mm corruption and
> compression isn't trivial to handle in kernel skbs (though I doubt this
> is the problem but it's easy to disable).

Ok.

> > ipt_state               1016   4  (autoclean)
> the hang while unloading this module may also be a sign of a bug in the
> module so it would be nice if you could reproduce also w/o the above
> ips_state.

Unfortunatelly this is not as easy to do, though I believe there is just some
sort or race on unload that is not being hit until module is unloaded and
therefore it is completely not related.

> If this still doesn't help then you can try to go UP again, SMP is
> harder at stressing the memory bus and see if it stabilizes. Other thing
> you can do is to remove half of the ram and see if it stabilizes to try
> to identify buggy ram slots.

There I have ECC RAM, passed 14 days of memtest (yes, I know memtest uses
only 1 CPU), so I do not think I have memory problems, though this is not
absolute guarantee against that of course.
Also running in UP mode for weeks is not all that funny and still proves nothing
as I do not have clear way to reproduce it in certain time.

> Overall it's unlikely the oops is useful unfortunately since that piece
> of the kernel is the most stressed ever, and it just signals random mm
> corruption. I assume this is the first time you've got the nmi watchdog
> oops, if you could get it again it would be more interesting, I'd expect
> next time you would get it in another place.

Well, I had a hang before this oops and that was main reason I enabled NMI
watchdog. At that first hang nothing get to serial console so I guessed
it was similar spinlock deadlock.
We'll see what I get when another NMI watchdog thing occurs. I run
with spinlock debug this time, so hopefully if spinlock is really just
corrupted, its magic would be corrupted as well and I get clear warning about
that.

Thank you.

Bye,
    Oleg

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2004-04-06  7:03 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-04-04 12:17 [2.4] NMI WD detected lockup during page alloc Oleg Drokin
2004-04-05 20:43 ` Marcelo Tosatti
2004-04-05 21:27   ` Oleg Drokin
2004-04-05 22:12     ` Andrea Arcangeli
2004-04-06  7:02       ` Oleg Drokin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox