From: Steven Rostedt <rostedt@goodmis.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Siegfried Wulsch <Siegfried.Wulsch@rovema.de>,
Peter Zijlstra <peterz@infradead.org>,
Thomas Gleixner <tglx@linutronix.de>
Subject: [057/126] sched_clock: Prevent 64bit inatomicity on 32bit systems
Date: Mon, 06 May 2013 23:58:09 -0400 [thread overview]
Message-ID: <20130507035850.417705164@goodmis.org> (raw)
In-Reply-To: 20130507035712.909872333@goodmis.org
[-- Attachment #1: 0057-sched_clock-Prevent-64bit-inatomicity-on-32bit-syste.patch --]
[-- Type: text/plain, Size: 5520 bytes --]
3.6.11.3 stable review patch.
If anyone has any objections, please let me know.
------------------
From: Thomas Gleixner <tglx@linutronix.de>
[ Upstream commit a1cbcaa9ea87b87a96b9fc465951dcf36e459ca2 ]
The sched_clock_remote() implementation has the following inatomicity
problem on 32bit systems when accessing the remote scd->clock, which
is a 64bit value.
CPU0 CPU1
sched_clock_local() sched_clock_remote(CPU0)
...
remote_clock = scd[CPU0]->clock
read_low32bit(scd[CPU0]->clock)
cmpxchg64(scd->clock,...)
read_high32bit(scd[CPU0]->clock)
While the update of scd->clock is using an atomic64 mechanism, the
readout on the remote cpu is not, which can cause completely bogus
readouts.
It is a quite rare problem, because it requires the update to hit the
narrow race window between the low/high readout and the update must go
across the 32bit boundary.
The resulting misbehaviour is, that CPU1 will see the sched_clock on
CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
to this bogus timestamp. This stays that way due to the clamping
implementation for about 4 seconds until the synchronization with
CLOCK_MONOTONIC undoes the problem.
The issue is hard to observe, because it might only result in a less
accurate SCHED_OTHER timeslicing behaviour. To create observable
damage on realtime scheduling classes, it is necessary that the bogus
update of CPU1 sched_clock happens in the context of an realtime
thread, which then gets charged 4 seconds of RT runtime, which results
in the RT throttler mechanism to trigger and prevent scheduling of RT
tasks for a little less than 4 seconds. So this is quite unlikely as
well.
The issue was quite hard to decode as the reproduction time is between
2 days and 3 weeks and intrusive tracing makes it less likely, but the
following trace recorded with trace_clock=global, which uses
sched_clock_local(), gave the final hint:
<idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
<idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ...
irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0
<idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80
What happens is that CPU0 goes idle and invokes
sched_clock_idle_sleep_event() which invokes sched_clock_local() and
CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
sched_remote_clock(). The time jump gets propagated to CPU0 via
sched_remote_clock() and stays stale on both cores for ~4 seconds.
There are only two other possibilities, which could cause a stale
sched clock:
1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
wrong value.
2) sched_clock() which reads the TSC returns a sporadic wrong value.
#1 can be excluded because sched_clock would continue to increase for
one jiffy and then go stale.
#2 can be excluded because it would not make the clock jump
forward. It would just result in a stale sched_clock for one jiffy.
After quite some brain twisting and finding the same pattern on other
traces, sched_clock_remote() remained the only place which could cause
such a problem and as explained above it's indeed racy on 32bit
systems.
So while on 64bit systems the readout is atomic, we need to verify the
remote readout on 32bit machines. We need to protect the local->clock
readout in sched_clock_remote() on 32bit as well because an NMI could
hit between the low and the high readout, call sched_clock_local() and
modify local->clock.
Thanks to Siegfried Wulsch for bearing with my debug requests and
going through the tedious tasks of running a bunch of reproducer
systems to generate the debug information which let me decode the
issue.
Reported-by: Siegfried Wulsch <Siegfried.Wulsch@rovema.de>
Acked-by: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Signed-off-by: Steven Rostedt <rostedt@goodmis.org>
---
kernel/sched/clock.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index c685e31..c3ae144 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -176,10 +176,36 @@ static u64 sched_clock_remote(struct sched_clock_data *scd)
u64 this_clock, remote_clock;
u64 *ptr, old_val, val;
+#if BITS_PER_LONG != 64
+again:
+ /*
+ * Careful here: The local and the remote clock values need to
+ * be read out atomic as we need to compare the values and
+ * then update either the local or the remote side. So the
+ * cmpxchg64 below only protects one readout.
+ *
+ * We must reread via sched_clock_local() in the retry case on
+ * 32bit as an NMI could use sched_clock_local() via the
+ * tracer and hit between the readout of
+ * the low32bit and the high 32bit portion.
+ */
+ this_clock = sched_clock_local(my_scd);
+ /*
+ * We must enforce atomic readout on 32bit, otherwise the
+ * update on the remote cpu can hit inbetween the readout of
+ * the low32bit and the high 32bit portion.
+ */
+ remote_clock = cmpxchg64(&scd->clock, 0, 0);
+#else
+ /*
+ * On 64bit the read of [my]scd->clock is atomic versus the
+ * update, so we can avoid the above 32bit dance.
+ */
sched_clock_local(my_scd);
again:
this_clock = my_scd->clock;
remote_clock = scd->clock;
+#endif
/*
* Use the opportunity that we have both locks
--
1.7.10.4
next prev parent reply other threads:[~2013-05-07 3:58 UTC|newest]
Thread overview: 132+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-05-07 3:57 [000/126] 3.6.11.3-stable review Steven Rostedt
2013-05-07 3:57 ` [001/126] ASoC: imx-ssi: Fix occasional AC97 reset failure Steven Rostedt
2013-05-07 3:57 ` [002/126] ASoC: dma-sh7760: Fix compile error Steven Rostedt
2013-05-07 3:57 ` [003/126] ASoC: spear_pcm: Update to new pcm_new() API Steven Rostedt
2013-05-07 3:57 ` [004/126] regmap: cache Fix regcache-rbtree sync Steven Rostedt
2013-05-07 3:57 ` [005/126] spi/s3c64xx: modified error interrupt handling and init Steven Rostedt
2013-05-07 3:57 ` [006/126] spi/mpc512x-psc: optionally keep PSC SS asserted across xfer segmensts Steven Rostedt
2013-05-07 3:57 ` [007/126] UBIFS: make space fixup work in the remount case Steven Rostedt
2013-05-07 3:57 ` [008/126] ALSA: hda - Enabling Realtek ALC 671 codec Steven Rostedt
2013-05-07 3:57 ` [009/126] ALSA: hda - fix typo in proc output Steven Rostedt
2013-05-07 3:57 ` [010/126] mm: prevent mmap_cache race in find_vma() Steven Rostedt
2013-05-07 3:57 ` [011/126] EISA/PCI: Init EISA early, before PNP Steven Rostedt
2013-05-07 3:57 ` [012/126] EISA/PCI: Fix bus res reference Steven Rostedt
2013-05-07 3:57 ` [013/126] libata: Use integer return value for atapi_command_packet_set Steven Rostedt
2013-05-07 3:57 ` [014/126] libata: Set max sector to 65535 for Slimtype DVD A DS8A8SH drive Steven Rostedt
2013-05-07 3:57 ` [015/126] alpha: Add irongate_io to PCI bus resources Steven Rostedt
2013-05-07 3:57 ` [016/126] PCI/PM: Disable runtime PM of PCIe ports Steven Rostedt
2013-05-07 3:57 ` [017/126] ata_piix: Fix DVD not dectected at some Haswell platforms Steven Rostedt
2013-05-07 3:57 ` [018/126] ftrace: Consistently restore trace function on sysctl enabling Steven Rostedt
2013-05-07 3:57 ` [019/126] powerpc: pSeries_lpar_hpte_remove fails from Adjunct partition being performed before the ANDCOND test Steven Rostedt
2013-05-07 3:57 ` [020/126] mwifiex: limit channel number not to overflow memory Steven Rostedt
2013-05-07 3:57 ` [021/126] mac80211: fix remain-on-channel cancel crash Steven Rostedt
2013-05-07 3:57 ` [022/126] x86: remove the x32 syscall bitmask from syscall_get_nr() Steven Rostedt
2013-05-07 3:57 ` [023/126] hwspinlock: fix __hwspin_lock_request error path Steven Rostedt
2013-05-07 3:57 ` [024/126] remoteproc: fix error path of handle_vdev Steven Rostedt
2013-05-07 3:57 ` [025/126] remoteproc: fix FW_CONFIG typo Steven Rostedt
2013-05-07 3:57 ` [026/126] spinlocks and preemption points need to be at least compiler barriers Steven Rostedt
2013-05-07 3:57 ` [027/126] crypto: ux500 - add missing comma Steven Rostedt
2013-05-07 3:57 ` [028/126] crypto: gcm - fix assumption that assoc has one segment Steven Rostedt
2013-05-07 3:57 ` [029/126] drm/mgag200: Index 24 in extended CRTC registers is 24 in hex, not decimal Steven Rostedt
2013-05-07 3:57 ` [030/126] block: avoid using uninitialized value in from queue_var_store Steven Rostedt
2013-05-07 3:57 ` [031/126] x86: Fix rebuild with EFI_STUB enabled Steven Rostedt
2013-05-07 3:57 ` [032/126] thermal: return an error on failure to register thermal class Steven Rostedt
2013-05-07 3:57 ` [033/126] msi-wmi: Fix memory leak Steven Rostedt
2013-05-07 3:57 ` [034/126] cpufreq: exynos: Get booting freq value in exynos_cpufreq_init Steven Rostedt
2013-05-07 3:57 ` [035/126] drm/i915: add quirk to invert brightness on eMachines G725 Steven Rostedt
2013-05-07 3:57 ` [036/126] drm/i915: add quirk to invert brightness on eMachines e725 Steven Rostedt
2013-05-07 3:57 ` [037/126] drm/i915: add quirk to invert brightness on Packard Bell NCL20 Steven Rostedt
2013-05-07 3:57 ` [038/126] r8169: fix auto speed down issue Steven Rostedt
2013-05-07 3:57 ` [039/126] vfio-pci: Fix possible integer overflow Steven Rostedt
2013-05-07 3:57 ` [040/126] can: gw: use kmem_cache_free() instead of kfree() Steven Rostedt
2013-05-07 3:57 ` [041/126] rt2x00: rt2x00pci_regbusy_read() - only print register access failure once Steven Rostedt
2013-05-07 3:57 ` [042/126] ALSA: usb-audio: fix endianness bug in snd_nativeinstruments_* Steven Rostedt
2013-05-07 3:57 ` [043/126] ASoC: wm8903: Fix the bypass to HP/LINEOUT when no DAC or ADC is running Steven Rostedt
2013-05-07 3:57 ` [044/126] tracing: Fix double free when function profile init failed Steven Rostedt
2013-05-07 3:57 ` [045/126] PM / reboot: call syscore_shutdown() after disable_nonboot_cpus() Steven Rostedt
2013-05-07 3:57 ` [046/126] GFS2: Fix unlock of fcntl locks during withdrawn state Steven Rostedt
2013-05-07 3:57 ` [047/126] libsas: fix handling vacant phy in sas_set_ex_phy() Steven Rostedt
2013-05-07 3:58 ` [048/126] cifs: Allow passwords which begin with a delimitor Steven Rostedt
2013-05-07 3:58 ` [049/126] target: Fix incorrect fallthrough of ALUA Standby/Offline/Transition CDBs Steven Rostedt
2013-05-07 3:58 ` [050/126] vfs: Revert spurious fix to spinning prevention in prune_icache_sb Steven Rostedt
2013-05-07 3:58 ` [051/126] kref: Implement kref_get_unless_zero v3 Steven Rostedt
2013-05-07 3:58 ` [052/126] kobject: fix kset_find_obj() race with concurrent last kobject_put() Steven Rostedt
2013-05-07 3:58 ` [053/126] x86-32: Fix possible incomplete TLB invalidate with PAE pagetables Steven Rostedt
2013-05-07 3:58 ` [054/126] tracing: Fix possible NULL pointer dereferences Steven Rostedt
2013-05-07 3:58 ` [055/126] udl: handle EDID failure properly Steven Rostedt
2013-05-07 3:58 ` [056/126] ftrace: Move ftrace_filter_lseek out of CONFIG_DYNAMIC_FTRACE section Steven Rostedt
2013-05-07 3:58 ` Steven Rostedt [this message]
2013-05-07 3:58 ` [058/126] x86, mm, paravirt: Fix vmalloc_fault oops during lazy MMU updates Steven Rostedt
2013-05-07 3:58 ` [059/126] x86, mm: Patch out arch_flush_lazy_mmu_mode() when running on bare metal Steven Rostedt
2013-05-07 3:58 ` [060/126] ARM: Do 15e0d9e37c (ARM: pm: let platforms select cpu_suspend support) properly Steven Rostedt
2013-05-07 3:58 ` [061/126] hrtimer: Dont reinitialize a cpu_base lock on CPU_UP Steven Rostedt
2013-05-07 3:58 ` [062/126] can: mcp251x: add missing IRQF_ONESHOT to request_threaded_irq Steven Rostedt
2013-05-07 3:58 ` [063/126] can: sja1000: fix handling on dt properties on little endian systems Steven Rostedt
2013-05-07 3:58 ` [064/126] hugetlbfs: add swap entry check in follow_hugetlb_page() Steven Rostedt
2013-05-07 3:58 ` [065/126] kernel/signal.c: stop info leak via the tkill and the tgkill syscalls Steven Rostedt
2013-05-07 3:58 ` [066/126] hfsplus: fix potential overflow in hfsplus_file_truncate() Steven Rostedt
2013-05-07 3:58 ` [067/126] KVM: x86: fix for buffer overflow in handling of MSR_KVM_SYSTEM_TIME (CVE-2013-1796) Steven Rostedt
2013-05-07 3:58 ` [068/126] KVM: x86: Convert MSR_KVM_SYSTEM_TIME to use gfn_to_hva_cache functions (CVE-2013-1797) Steven Rostedt
2013-05-07 3:58 ` [069/126] KVM: Fix bounds checking in ioapic indirect register reads (CVE-2013-1798) Steven Rostedt
2013-05-07 3:58 ` [070/126] KVM: Allow cross page reads and writes from cached translations Steven Rostedt
2013-05-07 3:58 ` [071/126] ARM: i.MX35: enable MAX clock Steven Rostedt
2013-05-07 3:58 ` [072/126] ARM: clk-imx35: Bugfix iomux clock Steven Rostedt
2013-05-07 3:58 ` [073/126] tg3: Add 57766 device support Steven Rostedt
2013-05-07 3:58 ` [074/126] sched: Convert BUG_ON()s in try_to_wake_up_local() to WARN_ON_ONCE()s Steven Rostedt
2013-05-07 3:58 ` [075/126] ARM: 7696/1: Fix kexec by setting outer_cache.inv_all for Feroceon Steven Rostedt
2013-05-07 3:58 ` [076/126] ARM: 7698/1: perf: fix group validation when using enable_on_exec Steven Rostedt
2013-05-07 3:58 ` [077/126] ath9k_htc: accept 1.x firmware newer than 1.3 Steven Rostedt
2013-05-07 3:58 ` [078/126] ath9k_hw: change AR9580 initvals to fix a stability issue Steven Rostedt
2013-05-07 3:58 ` [079/126] mac80211: fix cfg80211 interaction on auth/assoc request Steven Rostedt
2013-05-07 3:58 ` [080/126] ssb: implement spurious tone avoidance Steven Rostedt
2013-05-07 3:58 ` [081/126] crypto: algif - suppress sending source address information in recvmsg Steven Rostedt
2013-05-07 3:58 ` [082/126] perf: Treat attr.config as u64 in perf_swevent_init() Steven Rostedt
2013-05-07 3:58 ` [083/126] perf/x86: Fix offcore_rsp valid mask for SNB/IVB Steven Rostedt
2013-05-07 3:58 ` [084/126] vm: add vm_iomap_memory() helper function Steven Rostedt
2013-05-07 3:58 ` [085/126] vm: convert snd_pcm_lib_mmap_iomem() to vm_iomap_memory() helper Steven Rostedt
2013-05-07 3:58 ` [086/126] vm: convert fb_mmap " Steven Rostedt
2013-05-07 3:58 ` [087/126] vm: convert HPET mmap " Steven Rostedt
2013-05-07 3:58 ` [088/126] mtd: Disable mtdchar mmap on MMU systems Steven Rostedt
2013-05-07 3:58 ` [089/126] vm: convert mtdchar mmap to vm_iomap_memory() helper Steven Rostedt
2013-05-07 3:58 ` [090/126] Btrfs: make sure nbytes are right after log replay Steven Rostedt
2013-05-07 3:58 ` [091/126] Add file_ns_capable() helper function for open-time capability checking Steven Rostedt
2013-05-07 3:58 ` [092/126] aio: fix possible invalid memory access when DEBUG is enabled Steven Rostedt
2013-05-07 3:58 ` [093/126] TTY: do not update atime/mtime on read/write Steven Rostedt
2013-05-07 3:58 ` [094/126] TTY: fix atime/mtime regression Steven Rostedt
2013-05-07 7:03 ` Jiri Slaby
2013-05-07 12:30 ` Steven Rostedt
2013-05-07 3:58 ` [095/126] atm: update msg_namelen in vcc_recvmsg() Steven Rostedt
2013-05-07 3:58 ` [096/126] ax25: fix info leak via msg_name in ax25_recvmsg() Steven Rostedt
2013-05-07 3:58 ` [097/126] Bluetooth: fix possible info leak in bt_sock_recvmsg() Steven Rostedt
2013-05-07 3:58 ` [098/126] Bluetooth: RFCOMM - Fix missing msg_namelen update in rfcomm_sock_recvmsg() Steven Rostedt
2013-05-07 3:58 ` [099/126] caif: Fix missing msg_namelen update in caif_seqpkt_recvmsg() Steven Rostedt
2013-05-07 3:58 ` [100/126] irda: Fix missing msg_namelen update in irda_recvmsg_dgram() Steven Rostedt
2013-05-07 3:58 ` [101/126] iucv: Fix missing msg_namelen update in iucv_sock_recvmsg() Steven Rostedt
2013-05-07 3:58 ` [102/126] l2tp: fix info leak in l2tp_ip6_recvmsg() Steven Rostedt
2013-05-07 3:58 ` [103/126] llc: Fix missing msg_namelen update in llc_ui_recvmsg() Steven Rostedt
2013-05-07 3:58 ` [104/126] netrom: fix info leak via msg_name in nr_recvmsg() Steven Rostedt
2013-05-07 3:58 ` [105/126] netrom: fix invalid use of sizeof " Steven Rostedt
2013-05-07 3:58 ` [106/126] NFC: llcp: fix info leaks via msg_name in llcp_sock_recvmsg() Steven Rostedt
2013-05-07 3:58 ` [107/126] rose: fix info leak via msg_name in rose_recvmsg() Steven Rostedt
2013-05-07 3:59 ` [108/126] tipc: fix info leaks via msg_name in recv_msg/recv_stream Steven Rostedt
2013-05-07 3:59 ` [109/126] cbq: incorrect processing of high limits Steven Rostedt
2013-05-07 3:59 ` [110/126] net IPv6 : Fix broken IPv6 routing table after loopback down-up Steven Rostedt
2013-05-07 3:59 ` [111/126] net: count hw_addr syncs so that unsync works properly Steven Rostedt
2013-05-07 3:59 ` [112/126] atl1e: limit gso segment size to prevent generation of wrong ip length fields Steven Rostedt
2013-05-07 3:59 ` [113/126] bonding: fix bonding_masters race condition in bond unloading Steven Rostedt
2013-05-07 3:59 ` [114/126] bonding: IFF_BONDING is not stripped on enslave failure Steven Rostedt
2013-05-07 3:59 ` [115/126] af_unix: If we dont care about credentials coallesce all messages Steven Rostedt
2013-05-07 3:59 ` [116/126] ipv6/tcp: Stop processing ICMPv6 redirect messages Steven Rostedt
2013-05-07 3:59 ` [117/126] rtnetlink: Call nlmsg_parse() with correct header length Steven Rostedt
2013-05-07 3:59 ` [118/126] tcp: incoming connections might use wrong route under synflood Steven Rostedt
2013-05-07 3:59 ` [119/126] tcp: Reallocate headroom if it would overflow csum_start Steven Rostedt
2013-05-07 3:59 ` [120/126] esp4: fix error return code in esp_output() Steven Rostedt
2013-05-07 3:59 ` [121/126] tcp: call tcp_replace_ts_recent() from tcp_ack() Steven Rostedt
2013-05-07 3:59 ` [122/126] net: rate-limit warn-bad-offload splats Steven Rostedt
2013-05-07 3:59 ` [123/126] net: fix incorrect credentials passing Steven Rostedt
2013-05-07 3:59 ` [124/126] net: drop dst before queueing fragments Steven Rostedt
2013-05-07 3:59 ` [125/126] ARM: 7699/1: sched_clock: Add more notrace to prevent recursion Steven Rostedt
2013-05-07 3:59 ` [126/126] ARM: 7692/1: iop3xx: move IOP3XX_PERIPHERAL_VIRT_BASE Steven Rostedt
2013-05-07 4:05 ` [000/126] 3.6.11.3-stable review Steven Rostedt
2013-05-07 20:17 ` Aaro Koskinen
2013-05-07 20:28 ` Steven Rostedt
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130507035850.417705164@goodmis.org \
--to=rostedt@goodmis.org \
--cc=Siegfried.Wulsch@rovema.de \
--cc=linux-kernel@vger.kernel.org \
--cc=peterz@infradead.org \
--cc=stable@vger.kernel.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox