Re: [2.4] NMI WD detected lockup during page alloc

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Marcelo Tosatti <marcelo.tosatti@cyclades.com>
To: Oleg Drokin <green@linuxhacker.ru>
Cc: linux-kernel@vger.kernel.org, andrea@suse.de, akpm@osdl.org
Subject: Re: [2.4] NMI WD detected lockup during page alloc
Date: Mon, 5 Apr 2004 17:43:17 -0300	[thread overview]
Message-ID: <20040405204317.GA13528@logos.cnet> (raw)
In-Reply-To: <20040404121756.GA8854@linuxhacker.ru>

On Sun, Apr 04, 2004 at 03:17:56PM +0300, Oleg Drokin wrote:
> Hello!
> 
>    One of my servers started to experience mystic hangs after upgrade to
>    dual P4 Xeon (before that it was running on UP kernel) (HT enabled now).
>    So I enabled NMI watchdog and finally it triggered recently.
>    The kernel is 2.4.25+ (pulled from 2.4 bitkeeper tree on XX/XX, but
>    it seems related files in mm/ have not changed since at least January 2004
>    anyway). 
>    So the HW is Duap P4-Xeon on some Intel-branded server (E7501-based or
>    something), 2G E?? RAM (highmem enabled).
> 
>    That's what I got on the serial console:
> NMI Watchdog detected LOCKUP on CPU2, eip c013b527, registers:
> CPU:    2
> EIP:    0010:[<c013b527>]    Not tainted
> EFLAGS: 00000086
> eax: 00000000   ebx: c02dca38   ecx: 000048ce   edx: c02dca38
> esi: c02dca74   edi: 00000000   ebp: d34b1e5c   esp: d34b1e30
> ds: 0018   es: 0018   ss: 0018
> Process mrtg (pid: 14663, stackpage=d34b1000)
> Stack: 00038000 00000282 00000000 00015006 00015006 00000286 00000000 c02dca38
>        c02dca38 c02dcb38 00000002 d34b1ea0 c013adfa c0139395 d34b1ea0 00000202
>        c02dcaec 32353530 d34b1e7c c02dca38 c02dca38 c02dcb34 00000000 000001d2
> Call Trace:    [<c013adfa>] [<c0139395>] [<c012dc0d>] [<c012e6d7>] [<c0119330>
> ]
>   [<c014bca5>] [<c0159301>] [<c014ee56>] [<c014bd1b>] [<c0153b3b>] [<c0118f70>
> ]
>   [<c01076b0>]
> Code: f3 90 7e f9 e9 11 f4 ff ff 80 3f 00 f3 90 7e f9 e9 8e fd ff
> >>EIP; c013b527 <.text.lock.page_alloc+f/28>   <=====
> Trace; c013adfa <__alloc_pages+6a/270>
> Trace; c0139395 <lru_cache_del+15/20>
> Trace; c012dc0d <do_wp_page+6d/2e0>
> Trace; c012e6d7 <handle_mm_fault+f7/110>
> Trace; c0119330 <do_page_fault+3c0/586>
> Trace; c014bca5 <cp_new_stat64+e5/110>
> Trace; c0159301 <dput+31/190>
> Trace; c014ee56 <path_release+16/40>
> Trace; c014bd1b <sys_stat64+4b/80>
> Trace; c0153b3b <sys_fcntl64+5b/c0>
> Trace; c0118f70 <do_page_fault+0/586>
> Trace; c01076b0 <error_code+34/3c>
> 
> 
> So it seems it was blocked trying to take zone->lock in
> mm/page_alloc.c::rmqueue()
> The actual calltrace seems to be (lots of stale entries seems to be on
> actual stack).
> 
> rmqueue
> __alloc_pages+6a
> do_wp_page+6d
> handle_mm_fault+f7 (this is in fact handle_pte_fault())
> do_page_fault+3c0
> error_code+34
> 
> I fail to see a path where we can take lock on the same zone twice on same
> CPU, so may be the zone structure was somehow corrupted (I do not have
> spinlock debugging enabled yet). I do not think there are problems with
> memory in that box that might explain this as well.

I also fail to see how zone->lock could be left locked. The only users of it are
rmqueue and __free_pages_ok() and the codepaths which lock them are not prone to
problems.

> Probability of hangs vary over time, I got the first one on the next day after
> upgrade (not even sure if it was the same as this one since I had no traces
> from it), but this second one happened after 2-3 weeks of uptime.
> 
> May be it will help someone to find out what happens.

Can you send me your config file and description of workload? I have a similar E7501
around (with MPT fusion). 

What drivers are you using, btw?

next prev parent reply	other threads:[~2004-04-05 21:07 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2004-04-04 12:17 [2.4] NMI WD detected lockup during page alloc Oleg Drokin
2004-04-05 20:43 ` Marcelo Tosatti [this message]
2004-04-05 21:27   ` Oleg Drokin
2004-04-05 22:12     ` Andrea Arcangeli
2004-04-06  7:02       ` Oleg Drokin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20040405204317.GA13528@logos.cnet \
    --to=marcelo.tosatti@cyclades.com \
    --cc=akpm@osdl.org \
    --cc=andrea@suse.de \
    --cc=green@linuxhacker.ru \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.