All of lore.kernel.org
 help / color / mirror / Atom feed
From: Chris Wright <chrisw@sous-sol.org>
To: linux-kernel@vger.kernel.org, stable@kernel.org,
	Greg KH <gregkh@suse.de>, Chris Wright <chrisw@kernel.org>,
	Adrian Bunk <bunk@stusta.de>
Cc: Justin Forbes <jmforbes@linuxtx.org>,
	Zwane Mwaikambo <zwane@arm.linux.org.uk>,
	"Theodore Ts'o" <tytso@mit.edu>,
	Randy Dunlap <rdunlap@xenotime.net>,
	Dave Jones <davej@redhat.com>,
	Chuck Wolber <chuckw@quantumlinux.com>,
	Chris Wedgwood <reviews@ml.cw.f00f.org>,
	Michael Krufky <mkrufky@linuxtv.org>,
	torvalds@osdl.org, akpm@osdl.org, alan@lxorguk.ukuu.org.uk,
	Ingo Molnar <mingo@elte.hu>
Subject: [patch 19/50] sched: fix bad missed wakeups in the i386, x86_64, ia64, ACPI and APM idle code
Date: Fri, 05 Jan 2007 18:28:12 -0800	[thread overview]
Message-ID: <20070106023229.297246000@sous-sol.org> (raw)
In-Reply-To: 20070106022753.334962000@sous-sol.org

[-- Attachment #1: sched-fix-bad-missed-wakeups-in-the-i386-x86_64-ia64-acpi-and-apm-idle-code.patch --]
[-- Type: text/plain, Size: 8173 bytes --]

-stable review patch.  If anyone has any objections, please let us know.
------------------

From: Ingo Molnar <mingo@elte.hu>

Fernando Lopez-Lezcano reported frequent scheduling latencies and audio 
xruns starting at the 2.6.18-rt kernel, and those problems persisted all 
until current -rt kernels. The latencies were serious and unjustified by 
system load, often in the milliseconds range.

After a patient and heroic multi-month effort of Fernando, where he 
tested dozens of kernels, tried various configs, boot options, 
test-patches of mine and provided latency traces of those incidents, the 
following 'smoking gun' trace was captured by him:

                 _------=> CPU#
                / _-----=> irqs-off
               | / _----=> need-resched
               || / _---=> hardirq/softirq
               ||| / _--=> preempt-depth
               |||| /
               |||||     delay
   cmd     pid ||||| time  |   caller
      \   /    |||||   \   |   /
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (try_to_wake_up)
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup <<...>-5856> (37 0)
  IRQ_19-1479  1D..1    0us : __trace_start_sched_wakeup (c01262ba 0 0)
  IRQ_19-1479  1D..1    0us : resched_task (try_to_wake_up)
  IRQ_19-1479  1D..1    0us : __spin_unlock_irqrestore (try_to_wake_up)
  ...
  <idle>-0     1...1   11us!: default_idle (cpu_idle)
  ...
  <idle>-0     0Dn.1  602us : smp_apic_timer_interrupt (c0103baf 1 0)
  ...
   <...>-5856  0D..2  618us : __switch_to (__schedule)
   <...>-5856  0D..2  618us : __schedule <<idle>-0> (20 162)
   <...>-5856  0D..2  619us : __spin_unlock_irq (__schedule)
   <...>-5856  0...1  619us : trace_stop_sched_switched (__schedule)
   <...>-5856  0D..1  619us : trace_stop_sched_switched <<...>-5856> (37 0)

what is visible in this trace is that CPU#1 ran try_to_wake_up() for 
PID:5856, it placed PID:5856 on CPU#0's runqueue and ran resched_task() 
for CPU#0. But it decided to not send an IPI that no CPU - due to 
TS_POLLING. But CPU#0 never woke up after its NEED_RESCHED bit was set, 
and only rescheduled to PID:5856 upon the next lapic timer IRQ. The 
result was a 600+ usecs latency and a missed wakeup!

the bug turned out to be an idle-wakeup bug introduced into the mainline 
kernel this summer via an optimization in the x86_64 tree:

    commit 495ab9c045e1b0e5c82951b762257fe1c9d81564
    Author: Andi Kleen <ak@suse.de>
    Date:   Mon Jun 26 13:59:11 2006 +0200

    [PATCH] i386/x86-64/ia64: Move polling flag into thread_info_status

    During some profiling I noticed that default_idle causes a lot of
    memory traffic. I think that is caused by the atomic operations
    to clear/set the polling flag in thread_info. There is actually
    no reason to make this atomic - only the idle thread does it
    to itself, other CPUs only read it. So I moved it into ti->status.

the problem is this type of change:

        if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
-               clear_thread_flag(TIF_POLLING_NRFLAG);
+               current_thread_info()->status &= ~TS_POLLING;
                smp_mb__after_clear_bit();
                while (!need_resched()) {
                        local_irq_disable();

this changes clear_thread_flag() to an explicit clearing of TS_POLLING. 
clear_thread_flag() is defined as:

        clear_bit(flag, &ti->flags);

and clear_bit() is a LOCK-ed atomic instruction on all x86 platforms:

  static inline void clear_bit(int nr, volatile unsigned long * addr)
  {
          __asm__ __volatile__( LOCK_PREFIX
                  "btrl %1,%0"

hence smp_mb__after_clear_bit() is defined as a simple compile barrier:

  #define smp_mb__after_clear_bit()       barrier()

but the explicit TS_POLLING clearing introduced by the patch:

+               current_thread_info()->status &= ~TS_POLLING;

is not an atomic op! So the clearing of the TS_POLLING bit is freely 
reorderable with the reading of the NEED_RESCHED bit - and both now 
reside in different memory addresses.

CPU idle wakeup very much depends on ordered memory ops, the clearing of 
the TS_POLLING flag must always be done before we test need_resched() 
and hit the idle instruction(s). [Symmetrically, the wakeup code needs 
to set NEED_RESCHED before it tests the TS_POLLING flag, so memory 
ordering is paramount.]

Fernando's dual-core Athlon64 system has a sufficiently advanced memory 
ordering model so that it triggered this scenario very often.

( And it also turned out that the reason why these latencies never
  triggered on my testsystems is that i routinely use idle=poll, which
  was the only idle variant not affected by this bug. )

The fix is to change the smp_mb__after_clear_bit() to an smp_mb(), to 
act as an absolute barrier between the TS_POLLING write and the 
NEED_RESCHED read. This affects almost all idling methods (default, 
ACPI, APM), on all 3 x86 architectures: i386, x86_64, ia64.

Signed-off-by: Ingo Molnar <mingo@elte.hu>
Tested-by: Fernando Lopez-Lezcano <nando@ccrma.Stanford.EDU>
[chrisw: backport to 2.6.19.1]
Signed-off-by: Chris Wright <chrisw@sous-sol.org>
---
 arch/i386/kernel/apm.c        |    6 +++++-
 arch/i386/kernel/process.c    |    7 ++++++-
 arch/ia64/kernel/process.c    |   10 ++++++++--
 arch/x86_64/kernel/process.c  |    6 +++++-
 drivers/acpi/processor_idle.c |   12 ++++++++++--
 5 files changed, 34 insertions(+), 7 deletions(-)

--- linux-2.6.19.1.orig/arch/i386/kernel/apm.c
+++ linux-2.6.19.1/arch/i386/kernel/apm.c
@@ -784,7 +784,11 @@ static int apm_do_idle(void)
 	polling = !!(current_thread_info()->status & TS_POLLING);
 	if (polling) {
 		current_thread_info()->status &= ~TS_POLLING;
-		smp_mb__after_clear_bit();
+		/*
+		 * TS_POLLING-cleared state must be visible before we
+		 * test NEED_RESCHED:
+		 */
+		smp_mb();
 	}
 	if (!need_resched()) {
 		idled = 1;
--- linux-2.6.19.1.orig/arch/i386/kernel/process.c
+++ linux-2.6.19.1/arch/i386/kernel/process.c
@@ -103,7 +103,12 @@ void default_idle(void)
 
 	if (!hlt_counter && boot_cpu_data.hlt_works_ok) {
 		current_thread_info()->status &= ~TS_POLLING;
-		smp_mb__after_clear_bit();
+		/*
+		 * TS_POLLING-cleared state must be visible before we
+		 * test NEED_RESCHED:
+		 */
+		smp_mb();
+
 		while (!need_resched()) {
 			local_irq_disable();
 			if (!need_resched())
--- linux-2.6.19.1.orig/arch/ia64/kernel/process.c
+++ linux-2.6.19.1/arch/ia64/kernel/process.c
@@ -268,10 +268,16 @@ cpu_idle (void)
 
 	/* endless idle loop with no priority at all */
 	while (1) {
-		if (can_do_pal_halt)
+		if (can_do_pal_halt) {
 			current_thread_info()->status &= ~TS_POLLING;
-		else
+			/*
+			 * TS_POLLING-cleared state must be visible before we
+			 * test NEED_RESCHED:
+			 */
+			smp_mb();
+		} else {
 			current_thread_info()->status |= TS_POLLING;
+		}
 
 		if (!need_resched()) {
 			void (*idle)(void);
--- linux-2.6.19.1.orig/arch/x86_64/kernel/process.c
+++ linux-2.6.19.1/arch/x86_64/kernel/process.c
@@ -111,7 +111,11 @@ static void default_idle(void)
 	local_irq_enable();
 
 	current_thread_info()->status &= ~TS_POLLING;
-	smp_mb__after_clear_bit();
+	/*
+	 * TS_POLLING-cleared state must be visible before we
+	 * test NEED_RESCHED:
+	 */
+	smp_mb();
 	while (!need_resched()) {
 		local_irq_disable();
 		if (!need_resched())
--- linux-2.6.19.1.orig/drivers/acpi/processor_idle.c
+++ linux-2.6.19.1/drivers/acpi/processor_idle.c
@@ -211,7 +211,11 @@ acpi_processor_power_activate(struct acp
 static void acpi_safe_halt(void)
 {
 	current_thread_info()->status &= ~TS_POLLING;
-	smp_mb__after_clear_bit();
+	/*
+	 * TS_POLLING-cleared state must be visible before we
+	 * test NEED_RESCHED:
+	 */
+	smp_mb();
 	if (!need_resched())
 		safe_halt();
 	current_thread_info()->status |= TS_POLLING;
@@ -345,7 +349,11 @@ static void acpi_processor_idle(void)
 	 */
 	if (cx->type == ACPI_STATE_C2 || cx->type == ACPI_STATE_C3) {
 		current_thread_info()->status &= ~TS_POLLING;
-		smp_mb__after_clear_bit();
+		/*
+		 * TS_POLLING-cleared state must be visible before we
+		 * test NEED_RESCHED:
+		 */
+		smp_mb();
 		if (need_resched()) {
 			current_thread_info()->status |= TS_POLLING;
 			local_irq_enable();

--

  parent reply	other threads:[~2007-01-06  2:31 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2007-01-06  2:27 [patch 00/50] -stable review Chris Wright
2007-01-06  2:27 ` [patch 01/50] dm-crypt: Select CRYPTO_CBC Chris Wright
2007-01-06  2:27 ` [patch 02/50] sha512: Fix sha384 block size Chris Wright
2007-01-06  2:27 ` [patch 03/50] read_zero_pagealigned() locking fix Chris Wright
2007-01-06  2:27 ` [patch 04/50] ieee80211softmac: Fix mutex_lock at exit of ieee80211_softmac_get_genie Chris Wright
2007-01-08 15:24   ` John W. Linville
2007-01-06  2:27 ` [patch 05/50] x86-64: Mark rdtsc as sync only for netburst, not for core2 Chris Wright
2007-01-06  2:27 ` [patch 06/50] bonding: incorrect bonding state reported via ioctl Chris Wright
2007-01-06  2:28 ` [patch 07/50] DVB: lgdt330x: fix signal / lock status detection bug Chris Wright
2007-01-06  2:28 ` [patch 08/50] V4L: Fix broken TUNER_LG_NTSC_TAPE radio support Chris Wright
2007-01-06  2:28 ` [patch 09/50] Revert "[PATCH] zd1211rw: Removed unneeded packed attributes" Chris Wright
2007-01-08 15:32   ` John W. Linville
2007-01-06  2:28 ` [patch 10/50] libata: handle 0xff status properly Chris Wright
2007-01-06  2:28 ` [patch 11/50] ieee1394: ohci1394: add PPC_PMAC platform code to driver probe Chris Wright
2007-01-06  2:28 ` [patch 12/50] kbuild: dont put temp files in source Chris Wright
2007-01-06  2:28 ` [patch 13/50] ARM: Add sys_*at syscalls Chris Wright
2007-01-06  2:28 ` [patch 14/50] sched: remove __cpuinitdata anotation to cpu_isolated_map Chris Wright
2007-01-06  2:28 ` [patch 15/50] IB/srp: Fix FMR mapping for 32-bit kernels and addresses above 4G Chris Wright
2007-01-06  2:28 ` [patch 16/50] SCSI: add missing cdb clearing in scsi_execute() Chris Wright
2007-01-06  2:28 ` [patch 17/50] Bluetooth: Add packet size checks for CAPI messages (CVE-2006-6106) Chris Wright
2007-01-06  2:28 ` [patch 18/50] i2c: fix broken ds1337 initialization Chris Wright
2007-01-06  2:28 ` Chris Wright [this message]
2007-01-06  2:28 ` [patch 20/50] Fix for shmem_truncate_range() BUG_ON() Chris Wright
2007-01-06  2:28 ` [patch 21/50] smc911x: fix netpoll compilation faliure Chris Wright
2007-01-06  2:28 ` [patch 22/50] fix aoe without scatter-gather [Bug 7662] Chris Wright
2007-01-06  2:28 ` [patch 23/50] UDP: Fix reversed logic in udp_get_port() Chris Wright
2007-01-06  2:28 ` [patch 24/50] cciss: fix XFER_READ/XFER_WRITE in do_cciss_request Chris Wright
2007-01-06  2:28 ` [patch 25/50] [stable] [stable patch] i386: CPU hotplug broken with 2GB VMSPLIT Chris Wright
2007-01-06  2:28 ` [patch 26/50] ramfs breaks without CONFIG_BLOCK Chris Wright
2007-01-06  2:28 ` [patch 27/50] Buglet in vmscan.c Chris Wright
2007-01-06  2:28 ` [patch 28/50] softmac: Fixed handling of deassociation from AP Chris Wright
2007-01-08 15:14   ` John W. Linville
2007-01-06  2:28 ` [patch 29/50] zd1211rw: Call ieee80211_rx in tasklet Chris Wright
2007-01-08 15:20   ` John W. Linville
2007-01-06  2:28 ` [patch 30/50] handle ext3 directory corruption better (CVE-2006-6053) Chris Wright
2007-01-06  2:28 ` [patch 31/50] corrupted cramfs filesystems cause kernel oops (CVE-2006-5823) Chris Wright
2007-01-06  2:28 ` [patch 32/50] ext2: skip pages past number of blocks in ext2_find_entry (CVE-2006-6054) Chris Wright
2007-01-06  2:28 ` [patch 33/50] PKTGEN: Fix module load/unload races Chris Wright
2007-01-06  2:28 ` [patch 34/50] SPARC64: Fix "mem=xxx" handling Chris Wright
2007-01-06  2:28 ` [patch 35/50] SPARC64: Handle ISA devices with no regs property Chris Wright
2007-01-06  2:28 ` [patch 36/50] NET: Dont export linux/random.h outside __KERNEL__ Chris Wright
2007-01-06  2:28 ` [patch 37/50] sparc32: add offset in pci_map_sg() Chris Wright
2007-01-06  2:28 ` [patch 38/50] VM: Fix nasty and subtle race in shared mmaped page writeback Chris Wright
2007-01-06  2:28 ` [patch 39/50] V4L: cx2341x: audio_properties is an u16, not u8 Chris Wright
2007-01-06  2:28 ` [patch 40/50] dvb-core: fix bug in CRC-32 checking on 64-bit systems Chris Wright
2007-01-06  2:28 ` [patch 41/50] V4L: cx88: Fix leadtek_eeprom tagging Chris Wright
2007-01-06  2:28 ` [patch 42/50] ebtables: dont compute gap before checking struct type Chris Wright
2007-01-06  2:28 ` [patch 43/50] SOUND: Sparc CS4231: Fix IRQ return value and initialization Chris Wright
2007-01-06  2:28 ` [patch 44/50] SOUND: Sparc CS4231: Use 64 for period_bytes_min Chris Wright
2007-01-06 14:20   ` [patch 43/50] SOUND: Sparc CS4231: Fix IRQ return value and initialization Jan Engelhardt
2007-01-06  2:28 ` [patch 45/50] IPV4/IPV6: Fix inet{,6} device initialization order Chris Wright
2007-01-06  2:28 ` [patch 46/50] asix: Fix typo for AX88772 PHY Selection Chris Wright
2007-01-06  2:28 ` [patch 47/50] NetLabel: correctly fill in unused CIPSOv4 level and category mappings Chris Wright
2007-01-06  2:28 ` [patch 48/50] connector: some fixes for ia64 unaligned access errors Chris Wright
2007-01-06  2:28 ` [patch 49/50] fix OOM killing of swapoff Chris Wright
2007-01-06  2:28 ` [patch 50/50] Fix incorrect user space access locking in mincore() (CVE-2006-4814) Chris Wright
2007-01-06  2:37 ` [patch 00/50] -stable review Chris Wright
2007-02-04 18:24 ` Michael Krufky
2007-02-05 21:43   ` [stable] " Greg KH
2007-02-05 22:12     ` Chris Wright
2007-02-05 22:18       ` Greg KH
2007-02-05 22:16     ` Michael Krufky
2007-02-05 22:33       ` Greg KH
2007-02-07 21:08         ` Michael Krufky
2007-02-09 19:41           ` Michael Krufky
2007-02-09 19:51             ` Greg KH
2007-02-20 23:12               ` Greg KH
2007-02-21  2:29                 ` Michael Krufky
2007-02-21 21:01                   ` [v4l-dvb-maintainer] " Michael Krufky
2007-02-26 14:35                     ` Adrian Bunk

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20070106023229.297246000@sous-sol.org \
    --to=chrisw@sous-sol.org \
    --cc=akpm@osdl.org \
    --cc=alan@lxorguk.ukuu.org.uk \
    --cc=bunk@stusta.de \
    --cc=chrisw@kernel.org \
    --cc=chuckw@quantumlinux.com \
    --cc=davej@redhat.com \
    --cc=gregkh@suse.de \
    --cc=jmforbes@linuxtx.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@elte.hu \
    --cc=mkrufky@linuxtv.org \
    --cc=rdunlap@xenotime.net \
    --cc=reviews@ml.cw.f00f.org \
    --cc=stable@kernel.org \
    --cc=torvalds@osdl.org \
    --cc=tytso@mit.edu \
    --cc=zwane@arm.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.