From mboxrd@z Thu Jan 1 00:00:00 1970 From: Bernhard Schiffner Subject: Re: Suspend resume problem (WAS Re: [ANNOUNCE] 3.8.10-rt6) Date: Wed, 01 May 2013 10:30:48 +0200 Message-ID: <47157022.drQAKXAcr6@bs8> References: <20130429201202.GB7979@linutronix.de> <20130429161925.2a6ea78a@riff.lan> <20130430170948.GB4688@linutronix.de> Mime-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7Bit To: linux-rt-users Return-path: Received: from moutng.kundenserver.de ([212.227.126.187]:49737 "EHLO moutng.kundenserver.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751219Ab3EAId0 (ORCPT ); Wed, 1 May 2013 04:33:26 -0400 In-Reply-To: <20130430170948.GB4688@linutronix.de> Sender: linux-rt-users-owner@vger.kernel.org List-ID: Am Dienstag, 30. April 2013, 19:09:48 schrieb Sebastian Andrzej Siewior: > * Clark Williams | 2013-04-29 16:19:25 [-0500]: > >On Mon, 29 Apr 2013 22:12:02 +0200 > > > >Sebastian Andrzej Siewior wrote: > >> - suspend / resume seems to program program the timer wrong and wait > >> > >> ages until it continues. > > > >It has to be something we're doing when we apply RT to v3.8.x, since > >v3.8.x suspends/resumes with no issues and I was able to suspend and > >resume fine with the 3.6-rt series. > > I think I figured out what is going on or atleast I think I did. > > This log snippet is from the resume path (from suspend to mem): > > [ 15.052115] Enabling non-boot CPUs ... > [ 15.052115] smpboot: Booting Node 0 Processor 1 APIC 0x1 > [ 14.841378] Initializing CPU#1 > [ 42.840017] [sched_delayed] sched: RT throttling activated > [ 42.842144] CPU1 is up > [ 42.842536] smpboot: Booting Node 0 Processor 2 APIC 0x2 > > Two things happen here: > - the time goes backwards from 15.X to 14.X. This is okay because the > 14.X is the timestamp from the secondary CPU not - yet synchronized > with the bootcpu > - the printk with "CPU1 is up" is comming from the boot CPU and > according to the timestamp about 28secs passed by. But this did not > really happen as the whole procedure took less time. > > The next thing that happens is that RCU assumes nobody is doing any > progress (for almost 28secs) and triggers NMIs & printks to get some > attention. I have a trace where > - CPU0: arch_trigger_all_cpu_backtrace_handler() => printk() > has "lock" and is spinning for logbuf_lock > > - CPU1: print_cpu_stall() => printk() (spinning for the lock) => NMI => > arch_trigger_all_cpu_backtrace_handler() > it may have logbuf_lock and is spinning for "lock" > > I can't tell if CPU1 got the logbuf_lock at this time but it seemed that > it made no progress until I ended it. > This NMI releated deadlock is a problem which should also trigger > mainline, right? > > Now, the time jump on the other hand is the real issue here and is > RT-only. It looks like we get a big number of timer updates via > tick_do_update_jiffies64() because according to ktime_get() that much > time really passed by. > > The sollution seems as simple as > > From c27eb2e0ab0b5acd96a4b62288976f1b72789b3e Mon Sep 17 00:00:00 2001 > From: Sebastian Andrzej Siewior > Date: Tue, 30 Apr 2013 18:53:55 +0200 > Subject: [PATCH] time/timekeeping: shadow tk->cycle_last together with > clock->cycle_last > > Commit ("timekeeping: Store cycle_last value in timekeeper struct as > well") introduced a tk-> based cycle_last values which needs to be reset > on resume path as well or else ktime_get() will think that time > increased a lot. > > Signed-off-by: Sebastian Andrzej Siewior > --- > kernel/time/timekeeping.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c > index 99f943b..688817f 100644 > --- a/kernel/time/timekeeping.c > +++ b/kernel/time/timekeeping.c > @@ -777,6 +777,7 @@ static void timekeeping_resume(void) > } > /* re-base the last cycle value */ > tk->clock->cycle_last = tk->clock->read(tk->clock); > + tk->cycle_last = tk->clock->cycle_last; > tk->ntp_error = 0; > timekeeping_suspended = 0; > timekeeping_update(tk, false, true); > > >Clark > > Sebastian > -- This patch together with the in_nmi() patch solves the resume problem for me. Architecture X64, patched against 3.8.10-rt6. THANKS! Bernhard