* Re: /proc/stat btime accuracy problem [not found] ` <BANLkTimuPrqN1euyOqAGm2m4Ea1PdbrzDQ@mail.gmail.com> @ 2011-06-02 0:49 ` john stultz 2011-06-02 6:34 ` Bjorn Helgaas 2011-06-02 10:00 ` Alan Cox 0 siblings, 2 replies; 7+ messages in thread From: john stultz @ 2011-06-02 0:49 UTC (permalink / raw) To: Bjorn Helgaas Cc: Thomas Gleixner, linux-kernel@vger.kernel.org, linux-serial, Alan Cox On Wed, 2011-06-01 at 18:31 -0600, Bjorn Helgaas wrote: > On Wed, Jun 1, 2011 at 5:58 PM, john stultz <johnstul@us.ibm.com> wrote: > > My first instinct is "don't do that!" to whatever driver is disabling > > irqs for so long. Do you know what's actually causing these long irq off > > periods? > > > > I assume you're noticing this offset by seeing that CLOCK_REALTIME is > > off from the RTC right after boot? How severe is this? The RTC read is > > only second granular, so there's a fair amount of error (~1 second) > > possible right at boot, so this then must be many seconds worth of lost > > ticks to be noticeable, right? > > I'm using 2.6.34, so not really new. I think the major offender is > kernel serial printk, which is done in polled mode. A *lot* of it, > e.g., 30+ seconds' worth. CC'ing the linux-serial list and Alan for their thoughts: Does the serial port is disabling irqs for 30+ seconds during boot seem at all sane? I would think it would periodically enable irqs in the polling loop to let things be processed, but maybe not? > I wonder if there's some reasonably clean > way to resync with the RTC, say at the time we register a clocksource > better than jiffies, or in clocksource_done_booting(), or something. I definitely think we need to address the long irq disable periods instead of trying to hack around the issue in the timekeeping code. thanks -john ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: /proc/stat btime accuracy problem 2011-06-02 0:49 ` /proc/stat btime accuracy problem john stultz @ 2011-06-02 6:34 ` Bjorn Helgaas 2011-06-07 5:20 ` Bjorn Helgaas 2011-06-02 10:00 ` Alan Cox 1 sibling, 1 reply; 7+ messages in thread From: Bjorn Helgaas @ 2011-06-02 6:34 UTC (permalink / raw) To: john stultz Cc: Thomas Gleixner, linux-kernel@vger.kernel.org, linux-serial, Alan Cox On Wed, Jun 1, 2011 at 6:49 PM, john stultz <johnstul@us.ibm.com> wrote: > On Wed, 2011-06-01 at 18:31 -0600, Bjorn Helgaas wrote: >> On Wed, Jun 1, 2011 at 5:58 PM, john stultz <johnstul@us.ibm.com> wrote: >> > My first instinct is "don't do that!" to whatever driver is disabling >> > irqs for so long. Do you know what's actually causing these long irq off >> > periods? >> > >> > I assume you're noticing this offset by seeing that CLOCK_REALTIME is >> > off from the RTC right after boot? How severe is this? The RTC read is >> > only second granular, so there's a fair amount of error (~1 second) >> > possible right at boot, so this then must be many seconds worth of lost >> > ticks to be noticeable, right? >> >> I'm using 2.6.34, so not really new. I think the major offender is >> kernel serial printk, which is done in polled mode. A *lot* of it, >> e.g., 30+ seconds' worth. > > CC'ing the linux-serial list and Alan for their thoughts: Does the > serial port is disabling irqs for 30+ seconds during boot seem at all > sane? > > I would think it would periodically enable irqs in the polling loop to > let things be processed, but maybe not? > >> I wonder if there's some reasonably clean >> way to resync with the RTC, say at the time we register a clocksource >> better than jiffies, or in clocksource_done_booting(), or something. > > I definitely think we need to address the long irq disable periods > instead of trying to hack around the issue in the timekeeping code. OK, I have a theory. Here are snippets from my dmesg log (this is from a current upstream kernel). The "BH" lines are instrumentation I added to read the RTC at a few places: Linux version 3.0.0-smp-DEV ... BH now rtc 1306992452 (start_kernel, before setup_arch) Printk 230K of numa=fake debug stuff (more than seems strictly necessary to me, but there it is). All this data goes into the log buffer, not to the UART, because the console hasn't been initialized yet. Command line: auto BOOT_IMAGE=300smpD ro root=/dev/hda1,/dev/sda1 ignore_loglevel console=ttyS0,115200n8 swiotlb=16000 numa=fake=16M BH now rtc 1306992457 (start_kernel, after mm init) BH btime rtc 1306992457 xt 1306992457 wtm -1306992457 (timekeeping_init) Snapshot btime and init xtime and wall_to_monotonic. So theoretically, we should see 1306992457 in /proc/stat btime. Console: colour dummy device 80x25 console [ttyS0] enabled Now ttyS0 is enabled, so we dump the log buffer to the UART. I think this happens in console_unlock(), with interrupts disabled for the whole buffer. BH now rtc 1306992481 xt 1306992459 wtm -1306992457 clocksource_register jiffies This RTC read is in clocksource_register(); note that xtime is now 22 seconds behind the RTC. The UART is running at 115200 baud, and 230K/(115200/10) = about 20 seconds, so this sort of matches the time I expect it to take to dump the buffer. HPET: 3 timers in total, 0 timers will be used for per-cpu timer hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0 hpet0: 3 comparators, 64-bit 14.318180 MHz counter Switching to clocksource hpet Refined TSC clocksource calibration: 2400.084 MHz. Switching to clocksource tsc BH <set rtc 1306992488 xt 1306992465 wtm -1306992457 to 1306992488 BH settimeofday delta 23 BH set> rtc 1306992488 xt 1306992488 wtm -1306992480 This settimeofday() call happens when userland runs "hwclock --hctosys". We adjust xtime and wall_to_monotonic by 23 seconds. Finally, we read /proc/stat. Since we only re-synced the system clock to the RTC (we didn't sync to an external source), btime ought to be the same as it was when we collected it, i.e., 1306992457. But instead, we get this, which I assume is the same wall_to_monotonic we computed in settimeofday(), but rounded: btime 1306992479 Things that look interesting to me: - numa=fake prints an excessive amount of debug info! - We register the ttyS0 console before registering the jiffies clocksource. It seems like the jiffies clocksource ought to be registered about the same time we initialize xtime in timekeeping_init(). - Interrupts are disabled the entire time we dump the log buffer, which could be a very long time. Maybe we could restore/disable interrupts periodically there. Bjorn ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: /proc/stat btime accuracy problem 2011-06-02 6:34 ` Bjorn Helgaas @ 2011-06-07 5:20 ` Bjorn Helgaas 2011-06-07 17:50 ` john stultz 0 siblings, 1 reply; 7+ messages in thread From: Bjorn Helgaas @ 2011-06-07 5:20 UTC (permalink / raw) To: john stultz Cc: Thomas Gleixner, linux-kernel@vger.kernel.org, linux-serial, Alan Cox I'm still spinning my wheels on this, so I guess the only thing left is to ask even more stupid questions :) I'm only concerned about the early boot sequence, and I think only about the period when we're using the jiffies clocksource. My understanding is: - I'm using the jiffies clocksource during early boot. - Jiffies depends on a periodic (1000 HZ in my case) interrupt that updates xtime via the tick_periodic -> do_timer -> update_wall_time path. - If those periodic interrupts are lost, those xtime updates are forever lost. - An interrupt would be lost if interrupts are disabled for an interval that covers two or more ticks (my guess ... I'm thinking that if interrupts were re-enabled before the second tick, the first one would be delayed but not lost). - The RTC runs independently of CPU interrupts being disabled, so its time is not lost. - User-space will typically reset xtime to match the RTC And my sequence of events is: - xtime = RTC reading #1 - wall_to_monotonic = -xtime - periodic tick increments xtime - some ticks are lost while interrupts are disabled - by the time we switch from jiffies to hpet and eventually tsc clock source, the RTC is ahead of xtime by several seconds (1-2 in a normal boot, 30+ in more extreme cases) - user-space resets xtime to RTC ("hwclock -hctosys" in my case), which adds the delta to xtime and subtracts it from wall_to_monotonic - getboottime() returns -wall_to_monotonic (should be RTC reading #1, but now "reading #1 + delta") It seems like we're throwing away information here at the time we switch from jiffies to a more capable clocksource -- at that point, we know the RTC - xtime delta, and we know that delta represents time when interrupts were disabled. (Obviously this only applies during early boot, before we do any RTC updates.) My naive thought was "well, what if we just use the RTC directly as a clocksource." It's crappy resolution, but at least it doesn't lose time, so I tried the following, which didn't work at all (hangs during boot). But I don't know enough to know *why* this isn't feasible. Seems like jiffies can be different sizes, so why not 1 Hz? +static cycle_t rtc_read(struct clocksource *cs) +{ + static struct timespec base; + struct timespec now; + static int first = 1; + + read_persistent_clock(&now); + if (first) { + first = 0; + base = now; + } + + return (cycle_t) (now.tv_sec - base.tv_sec); +} + +struct clocksource clocksource_rtc = { + .name = "rtc", + .rating = 1, + .read = rtc_read, + .flags = CLOCK_SOURCE_IS_CONTINUOUS, + .mask = CLOCKSOURCE_MASK(64), + .mult = NSEC_PER_SEC, +}; + +struct clocksource * __init clocksource_default_clock(void) +{ + return &clocksource_rtc; +} Bjorn ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: /proc/stat btime accuracy problem 2011-06-07 5:20 ` Bjorn Helgaas @ 2011-06-07 17:50 ` john stultz 2011-06-08 1:03 ` john stultz 0 siblings, 1 reply; 7+ messages in thread From: john stultz @ 2011-06-07 17:50 UTC (permalink / raw) To: Bjorn Helgaas Cc: Thomas Gleixner, linux-kernel@vger.kernel.org, linux-serial, Alan Cox On Mon, 2011-06-06 at 23:20 -0600, Bjorn Helgaas wrote: > I'm still spinning my wheels on this, so I guess the only thing left > is to ask even more stupid questions :) > > I'm only concerned about the early boot sequence, and I think only > about the period when we're using the jiffies clocksource. My > understanding is: > > - I'm using the jiffies clocksource during early boot. > - Jiffies depends on a periodic (1000 HZ in my case) interrupt that > updates xtime via the tick_periodic -> do_timer -> update_wall_time > path. Yep. > - If those periodic interrupts are lost, those xtime updates are forever lost. Yep. > - An interrupt would be lost if interrupts are disabled for an > interval that covers two or more ticks (my guess ... I'm thinking that > if interrupts were re-enabled before the second tick, the first one > would be delayed but not lost). Yep. There's also some possibly connected issues here to irq starvation related to the irq priorities (so even if irqs were disabled, if the irq is getting hammered, and that irq is higher priority then the tick, you can lose ticks that way as well). > - The RTC runs independently of CPU interrupts being disabled, so > its time is not lost. Yep. > - User-space will typically reset xtime to match the RTC Not really sure about this one. I think most systems will set the system time via NTP and then after we're considered in-sync with ntpd we'll set the RTC to system time every 11 minutes. But regardless, the issue that if we lose ticks, the btime won't seem to be correct remains. > And my sequence of events is: > > - xtime = RTC reading #1 > - wall_to_monotonic = -xtime > - periodic tick increments xtime > - some ticks are lost while interrupts are disabled > - by the time we switch from jiffies to hpet and eventually tsc > clock source, the RTC is ahead of xtime by several seconds (1-2 in a > normal boot, 30+ in more extreme cases) > - user-space resets xtime to RTC ("hwclock -hctosys" in my case), > which adds the delta to xtime and subtracts it from wall_to_monotonic > - getboottime() returns -wall_to_monotonic (should be RTC reading > #1, but now "reading #1 + delta") > > It seems like we're throwing away information here at the time we > switch from jiffies to a more capable clocksource -- at that point, we > know the RTC - xtime delta, and we know that delta represents time > when interrupts were disabled. (Obviously this only applies during > early boot, before we do any RTC updates.) But I think you're focusing on trying to solve the symptom instead of the problem. The really big issue here is that irqs are apparently being disabled for 30 seconds at a time. Sure, once a real clocksource is registered, maybe you don't see timekeeping problems, but if the serial console gets more output, but then you might see strange scheduling issues, or very late timers. Further, you could hit other strange problems like OOM issues if you're doing lots of RCU and the grace periods don't get to run. Further, even if we did use the RTC to correct for lost ticks that happened while using the jiffies clocksource, you have the fact that the RTC resolution is so coarse, you couldn't account for lost ticks of less then a second anyway (which I suspect is much more common then the 30 second intervals you're seeing). > My naive thought was "well, what if we just use the RTC directly as a > clocksource." It's crappy resolution, but at least it doesn't lose > time, so I tried the following, which didn't work at all (hangs during > boot). But I don't know enough to know *why* this isn't feasible. It wouldn't be impossible to use the RTC as a clocksource (I think old 601 ppc macs use this). However, its not really a generic solution, as systems have a number of different types of RTCs, some which go over i2c buses or require interrupts in order to be read. read_persistent_clock is safe, but it doesn't solve the issue for systems that don't provide a read_persistent_clock hook. > Seems like jiffies can be different sizes, so why not 1 Hz? Hmm. That is interesting. I'm guessing it probably hits an edge case where the timekeeping code expects there to be a non-zero shift value. But again, I don't think this approach is going to solve all the issues that might be caused by 30-seconds of irqs being off. Maybe to get this back on coarse, could you provide some additional details about the machine where you're seeing this? Is there one specific driver that is putting out tons of output over the serial console? Or is there anything unique about the serial port or its settings (is it configured at 300 baud :)? What is the /proc/interrupts count after boot on one of these systems? thanks -john ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: /proc/stat btime accuracy problem 2011-06-07 17:50 ` john stultz @ 2011-06-08 1:03 ` john stultz 2011-06-08 4:16 ` Bjorn Helgaas 0 siblings, 1 reply; 7+ messages in thread From: john stultz @ 2011-06-08 1:03 UTC (permalink / raw) To: Bjorn Helgaas Cc: Thomas Gleixner, linux-kernel@vger.kernel.org, linux-serial, Alan Cox On Tue, 2011-06-07 at 10:50 -0700, john stultz wrote: > Maybe to get this back on coarse, could you provide some additional > details about the machine where you're seeing this? Is there one > specific driver that is putting out tons of output over the serial > console? Or is there anything unique about the serial port or its > settings (is it configured at 300 baud :)? What is the /proc/interrupts > count after boot on one of these systems? Sorry, I just remembered you already provided some of these details below: On Thu, 2011-06-02 at 00:34 -0600, Bjorn Helgaas wrote: > Linux version 3.0.0-smp-DEV ... > BH now rtc 1306992452 (start_kernel, before setup_arch) > Printk 230K of numa=fake debug stuff (more than seems strictly > necessary to me, but there it is). All this data goes into the log > buffer, not to the UART, because the console hasn't been > initialized yet. [snip] > console [ttyS0] enabled > Now ttyS0 is enabled, so we dump the log buffer to the UART. I think > this happens in console_unlock(), with interrupts disabled for the whole > buffer. > > BH now rtc 1306992481 xt 1306992459 wtm -1306992457 > clocksource_register jiffies > This RTC read is in clocksource_register(); note that xtime is now > 22 seconds behind the RTC. The UART is running at 115200 baud, > and 230K/(115200/10) = about 20 seconds, so this sort of matches > the time I expect it to take to dump the buffer. Ok. So having the 230k data backlog at console_unlock() seems to be part of the issue. Maybe something like the following could help? By only holding irqs off for 1k chunks? thanks -john diff --git a/kernel/printk.c b/kernel/printk.c index 3518539..9703b22 100644 --- a/kernel/printk.c +++ b/kernel/printk.c @@ -1243,6 +1243,7 @@ void console_unlock(void) unsigned long flags; unsigned _con_start, _log_end; unsigned wake_klogd = 0; + unsigned chunk_size, length; if (console_suspended) { up(&console_sem); @@ -1251,14 +1252,18 @@ void console_unlock(void) console_may_schedule = 0; + chunk_size = min(LOG_BUF_MASK, 1024); /* 1k chunks */ + for ( ; ; ) { spin_lock_irqsave(&logbuf_lock, flags); wake_klogd |= log_start - log_end; if (con_start == log_end) break; /* Nothing to print */ + length = (log_end - con_start) & LOG_BUF_MASK; + length = min(length , chunk_size); _con_start = con_start; - _log_end = log_end; - con_start = log_end; /* Flush */ + _log_end = (con_start + length) & LOG_BUF_MASK; + con_start = _log_end; /* Flush */ spin_unlock(&logbuf_lock); stop_critical_timings(); /* don't trace print latency */ call_console_drivers(_con_start, _log_end); ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: /proc/stat btime accuracy problem 2011-06-08 1:03 ` john stultz @ 2011-06-08 4:16 ` Bjorn Helgaas 0 siblings, 0 replies; 7+ messages in thread From: Bjorn Helgaas @ 2011-06-08 4:16 UTC (permalink / raw) To: john stultz Cc: Thomas Gleixner, linux-kernel@vger.kernel.org, linux-serial, Alan Cox On Tue, Jun 7, 2011 at 7:03 PM, john stultz <johnstul@us.ibm.com> wrote: > On Tue, 2011-06-07 at 10:50 -0700, john stultz wrote: >> Maybe to get this back on coarse, could you provide some additional >> details about the machine where you're seeing this? Is there one >> specific driver that is putting out tons of output over the serial >> console? Or is there anything unique about the serial port or its >> settings (is it configured at 300 baud :)? What is the /proc/interrupts >> count after boot on one of these systems? > > Sorry, I just remembered you already provided some of these details > below: > > On Thu, 2011-06-02 at 00:34 -0600, Bjorn Helgaas wrote: >> Linux version 3.0.0-smp-DEV ... >> BH now rtc 1306992452 (start_kernel, before setup_arch) >> Printk 230K of numa=fake debug stuff (more than seems strictly >> necessary to me, but there it is). All this data goes into the log >> buffer, not to the UART, because the console hasn't been >> initialized yet. > [snip] >> console [ttyS0] enabled >> Now ttyS0 is enabled, so we dump the log buffer to the UART. I think >> this happens in console_unlock(), with interrupts disabled for the whole >> buffer. >> >> BH now rtc 1306992481 xt 1306992459 wtm -1306992457 >> clocksource_register jiffies >> This RTC read is in clocksource_register(); note that xtime is now >> 22 seconds behind the RTC. The UART is running at 115200 baud, >> and 230K/(115200/10) = about 20 seconds, so this sort of matches >> the time I expect it to take to dump the buffer. > > > Ok. So having the 230k data backlog at console_unlock() seems to be part of the issue. > > Maybe something like the following could help? By only holding irqs off for 1k chunks? > > thanks > -john > > > diff --git a/kernel/printk.c b/kernel/printk.c > index 3518539..9703b22 100644 > --- a/kernel/printk.c > +++ b/kernel/printk.c > @@ -1243,6 +1243,7 @@ void console_unlock(void) > unsigned long flags; > unsigned _con_start, _log_end; > unsigned wake_klogd = 0; > + unsigned chunk_size, length; > > if (console_suspended) { > up(&console_sem); > @@ -1251,14 +1252,18 @@ void console_unlock(void) > > console_may_schedule = 0; > > + chunk_size = min(LOG_BUF_MASK, 1024); /* 1k chunks */ > + > for ( ; ; ) { > spin_lock_irqsave(&logbuf_lock, flags); > wake_klogd |= log_start - log_end; > if (con_start == log_end) > break; /* Nothing to print */ > + length = (log_end - con_start) & LOG_BUF_MASK; > + length = min(length , chunk_size); > _con_start = con_start; > - _log_end = log_end; > - con_start = log_end; /* Flush */ > + _log_end = (con_start + length) & LOG_BUF_MASK; > + con_start = _log_end; /* Flush */ > spin_unlock(&logbuf_lock); > stop_critical_timings(); /* don't trace print latency */ > call_console_drivers(_con_start, _log_end); That's a much better attempt than the disastrous one I made. But it didn't help at all, and I think I know one reason: if we reenable interrupts every 1K, and I have 230K to dump, that's 230 additional ticks we might get. But with HZ=1000, that's only a quarter second of wallclock time. Even at 115200 baud, I think we can only print about 10 characters/ms. It doesn't seem reasonable to me to try to flush in chunks anywhere near that small. But something else must still be going on, because even when I *did* turn the chunk size down to 100 or 10 or even 4 characters, we still didn't recover any of that missing wallclock time. I tried to verify that interrupts were actually enabled by the local_irq_restore(), and they were, but maybe there's still a priority issue or something. I wonder if we could do something with the TSC, e.g., capture it early, then use it in getboottime() to compute the delta back from the current time. Machines without a stable TSC would use the present system. Bjorn -- To unsubscribe from this list: send the line "unsubscribe linux-serial" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: /proc/stat btime accuracy problem 2011-06-02 0:49 ` /proc/stat btime accuracy problem john stultz 2011-06-02 6:34 ` Bjorn Helgaas @ 2011-06-02 10:00 ` Alan Cox 1 sibling, 0 replies; 7+ messages in thread From: Alan Cox @ 2011-06-02 10:00 UTC (permalink / raw) To: john stultz Cc: Bjorn Helgaas, Thomas Gleixner, linux-kernel@vger.kernel.org, linux-serial > CC'ing the linux-serial list and Alan for their thoughts: Does the > serial port is disabling irqs for 30+ seconds during boot seem at all > sane? printk is supposed to work from anywhere so it takes the hit. It should only be printk that is involved however. Alan ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2011-06-08 4:17 UTC | newest]
Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <BANLkTiniq6GVMJPO_-efKrWHQg9V9Dod9g@mail.gmail.com>
[not found] ` <1306967733.11492.11.camel@work-vm>
[not found] ` <BANLkTi=yEMDLEo99WZm2Bt4pi_pog5igww@mail.gmail.com>
[not found] ` <1306972711.11492.23.camel@work-vm>
[not found] ` <BANLkTimuPrqN1euyOqAGm2m4Ea1PdbrzDQ@mail.gmail.com>
2011-06-02 0:49 ` /proc/stat btime accuracy problem john stultz
2011-06-02 6:34 ` Bjorn Helgaas
2011-06-07 5:20 ` Bjorn Helgaas
2011-06-07 17:50 ` john stultz
2011-06-08 1:03 ` john stultz
2011-06-08 4:16 ` Bjorn Helgaas
2011-06-02 10:00 ` Alan Cox
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).