public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Willy Tarreau <w@1wt.eu>
To: Nikola Ciprich <nikola.ciprich@linuxbox.cz>
Cc: "linux-kernel mlist" <linux-kernel@vger.kernel.org>,
	"linux-stable mlist" <stable@kernel.org>,
	"Hervé Commowick" <hcommowick@exosec.fr>
Subject: Re: [stable] 2.6.32.21 - uptime related crashes?
Date: Thu, 28 Apr 2011 20:34:34 +0200	[thread overview]
Message-ID: <20110428183434.GG30645@1wt.eu> (raw)
In-Reply-To: <20110428082625.GA23293@pcnci.linuxbox.cz>

Hello Nikola,

On Thu, Apr 28, 2011 at 10:26:25AM +0200, Nikola Ciprich wrote:
> Hello everybody,
> 
> I'm trying to solve strange issue, today, my fourth machine running 2.6.32.21 just crashed. What makes the cases similar, apart fromn same kernel version is that all boxes had very similar uptimes: 214, 216, 216, and 224 days. This might just be a coincidence, but I think this might be important.

Interestingly, one of our customers just had two machines who crashed
yesterday after 212 days and 212+20h respectively. They were running
debian's 2.6.32-bpo.5-amd64 which is based on 2.6.32.23 AIUI.

The crash looks very similar to the following bug which we have updated :

   https://bugzilla.kernel.org/show_bug.cgi?id=16991

(bugzilla doesn't appear to respond as I'm posting this mail).

The top of your ouput is missing. In our case as in the reports on the bug
above, there was a divide by zero error. Did you happen to spot this one
too, or do you just not know ? I observe "divide_error+0x15/0x20" in one
of your reports, so it's possible that it matches the same pattern at least
for one trace. Just in case, it would be nice to feed the bugzilla entry
above.

> Unfortunately I only have backtraces of two crashes (and those are trimmed, sorry), and they do not look as similar as I'd like, but still maybe there is something in common:
> 
> [<ffffffff81120cc7>] pollwake+0x57/0x60 
> [<ffffffff81046720>] ? default_wake_function+0x0/0x10 
> [<ffffffff8103683a>] __wake_up_common+0x5a/0x90 
> [<ffffffff8103a313>] __wake_up+0x43/0x70 
> [<ffffffffa0321573>] process_masterspan+0x643/0x670 [dahdi] 
> [<ffffffffa0326595>] coretimer_func+0x135/0x1d0 [dahdi] 
> [<ffffffff8105d74d>] run_timer_softirq+0x15d/0x320 
> [<ffffffffa0326460>] ? coretimer_func+0x0/0x1d0 [dahdi] 
> [<ffffffff8105690c>] __do_softirq+0xcc/0x220 
> [<ffffffff8100c40c>] call_softirq+0x1c/0x30 
> [<ffffffff8100e3ba>] do_softirq+0x4a/0x80 
> [<ffffffff810567c7>] irq_exit+0x87/0x90 
> [<ffffffff8100d7b7>] do_IRQ+0x77/0xf0 
> [<ffffffff8100bc53>] ret_from_intr+0x0/Oxa 
> <EUI> [<ffffffffa019e556>] ? acpi_idle_enter_bm+0x273/0x2a1 [processor] 
> [<ffffffffa019e54c>] ? acpi_idle_enter_bm+0x269/0x2a1 [processor] 
> [<ffffffff81280095>] ? cpuidle_idle_call+0xa5/0x150 
> [<ffffffff8100a18f>] ? cpu_idle+0x4f/0x90 
> [<ffffffff81323c95>] ? rest_init+0x75/0x80 
> [<ffffffff81582d7f>] ? start_kernel+0x2ef/0x390 
> [<ffffffff81582271>] ? x86_64_start_reservations+0x81/0xc0 
> [<ffffffff81582386>] ? x86_64_start_kernel+0xd6/0x100 
> 
> this box (actually two of the crashed ones) is using dahdi_dummy module to generate timing for asterisk SW pbx, so maybe it's related to it.
> 
> 
> [<ffffffff810a5063>] handle_IRQ_event+0x63/0x1c0
> [<ffffffff810a71ae>] handle_edge_irq+0xce/0x160
> [<ffffffff8100e1bf>] handle_irq+0x1f/0x30                                                                                                                                              
> [<ffffffff8100d7ae>] do_IRQ+0x6e/0xf0
> [<ffffffff8100bc53>] ret_from_intr+0x0/Oxa
> <EUI> [<ffffffff8133?f?f>] ? _spin_un1ock_irq+0xf/0x40
> [<ffffffff81337f79>] ? _spin_un1ock_irq+0x9/0x40
> [<ffffffff81064b9a>] ? exit_signals+0x8a/0x130
> [<ffffffff8105372e>] ? do_exit+0x7e/0x7d0
> [<ffffffff8100f8a7>] ? oops_end+0xa7/0xb0
> [<ffffffff8100faa6>] ? die+0x56/0x90
> [<ffffffff8100c810>] ? do_trap+0x130/0x150
> [<ffffffff8100ccca>] ? do_divide_error+0x8a/0xa0
> [<ffffffff8103d227>] ? find_busiest_group+0x3d7/0xa00
> [<ffffffff8104400b>] ? cpuacct_charge+0x6b/0x90
> [<ffffffff8100c045>] ? divide_error+0x15/0x20
> [<ffffffff8103d227>] ? find_busiest_group+0x3d7/0xa00
> [<ffffffff8103cfff>] ? find_busiest_group+0x1af/0xa00
> [<ffffffff81335483>] ? thread_return+0x4ce/0x7bb
> [<ffffffff8133bec5>] ? do_nanosleep+0x75/0x30
> [<ffffffff810?1?4e>] ? hrtimer_nanosleep+0x9e/0x120
> [<ffffffff810?08f0>] ? hrtimer_wakeup+0x0/0x30
> [<ffffffff810?183f>] ? sys_nanosleep+0x6f/0x80
> 
> another two don't use it. only similarity I see here is that it seems to be IRQ handling related, but both issues don't have anything in common.
> Does anybody have an idea on where should I look? Of course I should update all those boxes to (at least) latest 2.6.32.x, and I'll do it for sure, but still I'd first like to know where the problem was, and if it has been fixed, or how to fix it...
> I'd be gratefull for any help...

There were quite a bunch of scheduler updates recently. We may be lucky and
hope for the bug to have vanished with the changes, but we may as well see
the same crash in 7 months :-/

My coworker Hervé (CC'd) who worked on the issue suggests that we might have
something which goes wrong past a certain uptime (eg: 212 days), which needs
a special event to be triggered (I/O, process exiting, etc...). I think this
makes quite some sense.

Could you check your CONFIG_HZ so that we could convert those uptimes to
jiffies ? Maybe this will ring a bell in someone's head :-/

Best regards,
Willy


  reply	other threads:[~2011-04-28 18:35 UTC|newest]

Thread overview: 58+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2011-04-28  8:26 2.6.32.21 - uptime related crashes? Nikola Ciprich
2011-04-28 18:34 ` Willy Tarreau [this message]
2011-04-29 10:02   ` [stable] " Nikola Ciprich
2011-04-30  9:36     ` Willy Tarreau
2011-04-30 11:22       ` Henrique de Moraes Holschuh
2011-04-30 11:54         ` Willy Tarreau
2011-04-30 12:32           ` Henrique de Moraes Holschuh
2011-04-30 12:02       ` Nikola Ciprich
2011-04-30 15:57         ` Greg KH
2011-04-30 16:08           ` Randy Dunlap
2011-04-30 16:49             ` Willy Tarreau
2011-04-30 18:14               ` Henrique de Moraes Holschuh
2011-04-30 17:39       ` Faidon Liambotis
2011-04-30 20:14         ` Willy Tarreau
2011-05-14 19:04           ` Nikola Ciprich
2011-05-14 20:45             ` Willy Tarreau
2011-05-14 20:59               ` Ben Hutchings
2011-05-14 23:13               ` Nicolas Carlier
2011-05-15 22:56             ` Faidon Liambotis
2011-05-16  6:49               ` Apollon Oikonomopoulos
2011-06-28  2:25         ` john stultz
2011-06-28  5:17           ` Willy Tarreau
2011-06-28  6:19             ` Apollon Oikonomopoulos
2011-07-06  6:15           ` Andrew Morton
2011-07-12  1:18             ` MINOURA Makoto / 箕浦 真
2011-07-12  1:40               ` john stultz
2011-07-12  2:49                 ` MINOURA Makoto / 箕浦 真
2011-07-12  4:19                   ` Willy Tarreau
2011-07-15  0:35                     ` john stultz
2011-07-15  8:30                       ` Peter Zijlstra
2011-07-15 10:02                         ` Peter Zijlstra
2011-07-15 18:03                           ` john stultz
2011-07-15 10:01                       ` Peter Zijlstra
2011-07-15 17:59                         ` john stultz
2011-07-21  7:22                           ` Ingo Molnar
2011-07-21 12:24                             ` Peter Zijlstra
2011-07-21 12:50                               ` Nikola Ciprich
2011-07-21 12:53                                 ` Peter Zijlstra
2011-07-21 18:45                                   ` Ingo Molnar
2011-07-21 19:32                                     ` Nikola Ciprich
2011-08-25 18:56                                     ` Faidon Liambotis
2011-08-30 22:38                                       ` [stable] " Greg KH
2011-09-04 23:26                                         ` Faidon Liambotis
2011-10-23 18:31                                           ` Ruben Kerkhof
2011-10-23 22:07                                             ` Greg KH
2011-10-25 22:44                                             ` john stultz
2011-10-25 23:25                                               ` Willy Tarreau
2011-12-02 23:45                                                 ` Greg KH
2011-12-03  0:02                                                   ` john stultz
2011-12-03  1:02                                                     ` Greg KH
2011-12-03  7:00                                                       ` Willy Tarreau
2011-12-05 16:53                                                       ` Ingo Molnar
2011-10-26 18:21                                               ` Ruben Kerkhof
2011-07-21 19:25                                   ` Nikola Ciprich
2011-07-21 19:37                                     ` john stultz
2011-07-21 19:53                             ` john stultz
2011-05-06  3:12     ` [stable] " Hidetoshi Seto
2011-05-13 22:08   ` Nicolas Carlier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20110428183434.GG30645@1wt.eu \
    --to=w@1wt.eu \
    --cc=hcommowick@exosec.fr \
    --cc=linux-kernel@vger.kernel.org \
    --cc=nikola.ciprich@linuxbox.cz \
    --cc=stable@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox