public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
To: NeilBrown <neilb@suse.de>,
	Linus Torvalds <torvalds@linux-foundation.org>
Cc: John Stultz <john.stultz@linaro.org>,
	Sander Eikelenboom <linux@eikelenboom.it>,
	stefan.bader@canonical.com, rjw@sisk.pl,
	Thomas Gleixner <tglx@linutronix.de>,
	linux-kernel@vger.kernel.org
Subject: Re: Regression: ONE CPU fails bootup at Re: [3.2.0-RC7] BUG: unable to handle kernel NULL pointer dereference at 0000000000000598 [ 1.478005] IP: [<ffffffff8107a6c4>] queue_work_on+0x4/0x30
Date: Wed, 4 Jan 2012 09:46:00 -0500	[thread overview]
Message-ID: <20120104144600.GB3322@phenom.dumpdata.com> (raw)
In-Reply-To: <20120104122040.5a0fa6a3@notabene.brown>

On Wed, Jan 04, 2012 at 12:20:40PM +1100, NeilBrown wrote:
> On Tue, 03 Jan 2012 16:53:00 -0800 John Stultz <john.stultz@linaro.org> wrote:
> 
> > On Wed, 2012-01-04 at 11:31 +1100, NeilBrown wrote:
> > > On Tue, 03 Jan 2012 15:09:48 -0800 John Stultz <john.stultz@linaro.org> wrote:
> > > > >From the stack trace, we've kicked off a rtc_timer_do_work, probably
> > > > from the rtc_initialize_alarm() schedule_work call added in Neil's
> > > > patch.  From there, we call __rtc_set_alarm -> cmos_set_alarm ->
> > > > cmos_rq_disable -> cmos_checkintr -> rtc_update_irq -> schedule_work.
> > > > 
> > > > So, what it looks to me is that in cmos_checkintr, we grab the cmos->rtc
> > > > and pass that along. Unfortunately, since the cmos->rtc value isn't set
> > > > until after rtc_device_register() returns its null at that point. So
> > > > your patch isn't really fixing the issue, but just reducing the race
> > > > window for the second cpu to schedule the work.
> > > > 
> > > > Sigh. I'd guess dropping the schedule_work call from
> > > > rtc_initialize_alarm() is the right approach (see below). When reviewing
> > > > Neil's patch it seemed like a good idea there, but it seems off to me
> > > > now.
> > > > 
> > > > Neil, any thoughts on the following? Can you expand on the condition you
> > > > were worried about in around that call?
> > > 
> > > If you set an alarm in the future, then shutdown and boot again after that
> > > time, then you will end up with a timer_queue node which is in the past.
> > 
> > Thanks for explaining this again.
> > 
> > Hrm. It seems the easy answer is to simply not add alarms that are in
> > the past.  Further, I'm a bit perplexed, as if they are in the past, the
> > enabled flag shouldn't be set.   __rtc_read_alarm() does check the
> > current time, so maybe we can make sure we don't return old values? I
> > guess I assumed __rtc_read_alarm() avoided returning stale values, but
> > apparently not.
> 
> That would probably be a more robust approach.  Also it might make sense to
> clean out old alarms whenever we are about to add a new one.
> 
> > 
> > > When this happens the queue gets stuck.  That entry-in-the-past won't get
> > > removed until and interrupt happens and an interrupt won't happen because the
> > > RTC only triggers an interrupt when the alarm is "now".
> > > 
> > > So you'll find that e.g. "hwclock" will always tell you that 'select' timed
> > > out.
> > > 
> > > So we force the interrupt work to happen at the start just in case.
> > 
> > Unfortunately its too early. 
> > 
> > > Did you see my proposed patch which converted those calls to do the work
> > > in-process rather than passing it to a worker-thread?  I think that is a
> > > clean fix.
> > 
> > I don't think I saw it today. Was it from before the holidays?
> 
> About 4 hours ago:
>   Subject: Re: Patch Upstream: rtc: Expire alarms after the time is set.
> 
> > 
> > Even so, at this point, I don't know if we have enough time for testing,
> > so I'm thinking we either just drop the problematic sched_work call or
> > revert the whole thing and try again for 3.3
> 
> I wouldn't object to that.  The bug only triggers in unusual circumstances
> and is quite easy to work around so it is safer  to wait until we have a
> really good fix.

Linus,

Sorry for getting you in this loop so late-ish. Would it be possible to revert
93b2ec0128c431148b216b8f7337c1a52131ef03 before 3.2 is released? If there are a
couple of days to work this out we can probably come up with a proper patch but
we don't know when 3.2 is going out (presumarily today?).

  reply	other threads:[~2012-01-04 14:47 UTC|newest]

Thread overview: 26+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2012-01-03 16:13 [3.2.0-RC7] BUG: unable to handle kernel NULL pointer dereference at 0000000000000598 [ 1.478005] IP: [<ffffffff8107a6c4>] queue_work_on+0x4/0x30 Sander Eikelenboom
2012-01-03 19:07 ` Regression: ONE CPU fails bootup at " Konrad Rzeszutek Wilk
2012-01-03 19:17   ` Sander Eikelenboom
2012-01-03 19:26   ` Stefan Bader
2012-01-03 20:11     ` Sander Eikelenboom
2012-01-03 20:10   ` Sander Eikelenboom
2012-01-03 22:33     ` Konrad Rzeszutek Wilk
2012-01-03 23:09   ` John Stultz
2012-01-04  0:31     ` NeilBrown
2012-01-04  0:53       ` John Stultz
2012-01-04  1:20         ` NeilBrown
2012-01-04 14:46           ` Konrad Rzeszutek Wilk [this message]
2012-01-04 15:12           ` Regression: ONE CPU fails bootup at Re: [3.2.0-RC7] BUG: unable to handle kernel NULL pointer dereference at 0000000000000598 " Stefan Bader
2012-01-05 22:03             ` NeilBrown
2012-01-04  8:17         ` Stefan Bader
2012-01-04 12:25           ` Stefan Bader
2012-01-04 13:17             ` Sander Eikelenboom
2012-01-04 18:33               ` John Stultz
2012-01-04 14:13             ` Stefan Bader
2012-01-06 20:41               ` John Stultz
2012-01-08 20:48                 ` Sander Eikelenboom
2012-01-09 13:26                 ` Stefan Bader
2012-01-04 18:35             ` John Stultz
2012-01-04 18:36           ` John Stultz
2012-01-04 18:50             ` Stefan Bader
2012-01-04 19:47             ` Konrad Rzeszutek Wilk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20120104144600.GB3322@phenom.dumpdata.com \
    --to=konrad.wilk@oracle.com \
    --cc=john.stultz@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux@eikelenboom.it \
    --cc=neilb@suse.de \
    --cc=rjw@sisk.pl \
    --cc=stefan.bader@canonical.com \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox