stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Nicholas Piggin <npiggin@gmail.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Jonathan Cameron <Jonathan.Cameron@huawei.com>,
	David Miller <davem@davemloft.net>,
	dzickus@redhat.com, sfr@canb.auug.org.au, mpe@ellerman.id.au,
	Stephen Boyd <sboyd@codeaurora.org>,
	linuxarm@huawei.com, abdhalee@linux.vnet.ibm.com,
	John Stultz <john.stultz@linaro.org>,
	akpm@linux-foundation.org, paulmck@linux.vnet.ibm.com,
	torvalds@linux-foundation.org
Subject: [PATCH 4.9 59/84] timers: Fix excessive granularity of new timers after a nohz idle
Date: Mon, 28 Aug 2017 10:05:24 +0200	[thread overview]
Message-ID: <20170828080531.857567847@linuxfoundation.org> (raw)
In-Reply-To: <20170828080529.526391781@linuxfoundation.org>

4.9-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Nicholas Piggin <npiggin@gmail.com>

commit 2fe59f507a65dbd734b990a11ebc7488f6f87a24 upstream.

When a timer base is idle, it is forwarded when a new timer is added
to ensure that granularity does not become excessive. When not idle,
the timer tick is expected to increment the base.

However there are several problems:

- If an existing timer is modified, the base is forwarded only after
  the index is calculated.

- The base is not forwarded by add_timer_on.

- There is a window after a timer is restarted from a nohz idle, after
  it is marked not-idle and before the timer tick on this CPU, where a
  timer may be added but the ancient base does not get forwarded.

These result in excessive granularity (a 1 jiffy timeout can blow out
to 100s of jiffies), which cause the rcu lockup detector to trigger,
among other things.

Fix this by keeping track of whether the timer base has been idle
since it was last run or forwarded, and if so then forward it before
adding a new timer.

There is still a case where mod_timer optimises the case of a pending
timer mod with the same expiry time, where the timer can see excessive
granularity relative to the new, shorter interval. A comment is added,
but it's not changed because it is an important fastpath for
networking.

This has been tested and found to fix the RCU softlockup messages.

Testing was also done with tracing to measure requested versus
achieved wakeup latencies for all non-deferrable timers in an idle
system (with no lockup watchdogs running). Wakeup latency relative to
absolute latency is calculated (note this suffers from round-up skew
at low absolute times) and analysed:

             max     avg      std
upstream   506.0    1.20     4.68
patched      2.0    1.08     0.15

The bug was noticed due to the lockup detector Kconfig changes
dropping it out of people's .configs and resulting in larger base
clk skew When the lockup detectors are enabled, no CPU can go idle for
longer than 4 seconds, which limits the granularity errors.
Sub-optimal timer behaviour is observable on a smaller scale in that
case:

	     max     avg      std
upstream     9.0    1.05     0.19
patched      2.0    1.04     0.11

Fixes: Fixes: a683f390b93f ("timers: Forward the wheel clock whenever possible")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Tested-by: David Miller <davem@davemloft.net>
Cc: dzickus@redhat.com
Cc: sfr@canb.auug.org.au
Cc: mpe@ellerman.id.au
Cc: Stephen Boyd <sboyd@codeaurora.org>
Cc: linuxarm@huawei.com
Cc: abdhalee@linux.vnet.ibm.com
Cc: John Stultz <john.stultz@linaro.org>
Cc: akpm@linux-foundation.org
Cc: paulmck@linux.vnet.ibm.com
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20170822084348.21436-1-npiggin@gmail.com
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 kernel/time/timer.c |   50 +++++++++++++++++++++++++++++++++++++++++---------
 1 file changed, 41 insertions(+), 9 deletions(-)

--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -201,6 +201,7 @@ struct timer_base {
 	bool			migration_enabled;
 	bool			nohz_active;
 	bool			is_idle;
+	bool			must_forward_clk;
 	DECLARE_BITMAP(pending_map, WHEEL_SIZE);
 	struct hlist_head	vectors[WHEEL_SIZE];
 } ____cacheline_aligned;
@@ -891,13 +892,19 @@ get_target_base(struct timer_base *base,
 
 static inline void forward_timer_base(struct timer_base *base)
 {
-	unsigned long jnow = READ_ONCE(jiffies);
+	unsigned long jnow;
 
 	/*
-	 * We only forward the base when it's idle and we have a delta between
-	 * base clock and jiffies.
+	 * We only forward the base when we are idle or have just come out of
+	 * idle (must_forward_clk logic), and have a delta between base clock
+	 * and jiffies. In the common case, run_timers will take care of it.
 	 */
-	if (!base->is_idle || (long) (jnow - base->clk) < 2)
+	if (likely(!base->must_forward_clk))
+		return;
+
+	jnow = READ_ONCE(jiffies);
+	base->must_forward_clk = base->is_idle;
+	if ((long)(jnow - base->clk) < 2)
 		return;
 
 	/*
@@ -973,6 +980,11 @@ __mod_timer(struct timer_list *timer, un
 	 * same array bucket then just return:
 	 */
 	if (timer_pending(timer)) {
+		/*
+		 * The downside of this optimization is that it can result in
+		 * larger granularity than you would get from adding a new
+		 * timer with this expiry.
+		 */
 		if (timer->expires == expires)
 			return 1;
 
@@ -983,6 +995,7 @@ __mod_timer(struct timer_list *timer, un
 		 * dequeue/enqueue dance.
 		 */
 		base = lock_timer_base(timer, &flags);
+		forward_timer_base(base);
 
 		clk = base->clk;
 		idx = calc_wheel_index(expires, clk);
@@ -999,6 +1012,7 @@ __mod_timer(struct timer_list *timer, un
 		}
 	} else {
 		base = lock_timer_base(timer, &flags);
+		forward_timer_base(base);
 	}
 
 	timer_stats_timer_set_start_info(timer);
@@ -1028,12 +1042,10 @@ __mod_timer(struct timer_list *timer, un
 			spin_lock(&base->lock);
 			WRITE_ONCE(timer->flags,
 				   (timer->flags & ~TIMER_BASEMASK) | base->cpu);
+			forward_timer_base(base);
 		}
 	}
 
-	/* Try to forward a stale timer base clock */
-	forward_timer_base(base);
-
 	timer->expires = expires;
 	/*
 	 * If 'idx' was calculated above and the base time did not advance
@@ -1150,6 +1162,7 @@ void add_timer_on(struct timer_list *tim
 		WRITE_ONCE(timer->flags,
 			   (timer->flags & ~TIMER_BASEMASK) | cpu);
 	}
+	forward_timer_base(base);
 
 	debug_activate(timer, timer->expires);
 	internal_add_timer(base, timer);
@@ -1538,10 +1551,16 @@ u64 get_next_timer_interrupt(unsigned lo
 		if (!is_max_delta)
 			expires = basem + (u64)(nextevt - basej) * TICK_NSEC;
 		/*
-		 * If we expect to sleep more than a tick, mark the base idle:
+		 * If we expect to sleep more than a tick, mark the base idle.
+		 * Also the tick is stopped so any added timer must forward
+		 * the base clk itself to keep granularity small. This idle
+		 * logic is only maintained for the BASE_STD base, deferrable
+		 * timers may still see large granularity skew (by design).
 		 */
-		if ((expires - basem) > TICK_NSEC)
+		if ((expires - basem) > TICK_NSEC) {
+			base->must_forward_clk = true;
 			base->is_idle = true;
+		}
 	}
 	spin_unlock(&base->lock);
 
@@ -1651,6 +1670,19 @@ static __latent_entropy void run_timer_s
 {
 	struct timer_base *base = this_cpu_ptr(&timer_bases[BASE_STD]);
 
+	/*
+	 * must_forward_clk must be cleared before running timers so that any
+	 * timer functions that call mod_timer will not try to forward the
+	 * base. idle trcking / clock forwarding logic is only used with
+	 * BASE_STD timers.
+	 *
+	 * The deferrable base does not do idle tracking at all, so we do
+	 * not forward it. This can result in very large variations in
+	 * granularity for deferrable timers, but they can be deferred for
+	 * long periods due to idle.
+	 */
+	base->must_forward_clk = false;
+
 	__run_timers(base);
 	if (IS_ENABLED(CONFIG_NO_HZ_COMMON) && base->nohz_active)
 		__run_timers(this_cpu_ptr(&timer_bases[BASE_DEF]));

  parent reply	other threads:[~2017-08-28  8:13 UTC|newest]

Thread overview: 88+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-08-28  8:04 [PATCH 4.9 00/84] 4.9.46-stable review Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 01/84] sparc64: remove unnecessary log message Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 02/84] af_key: do not use GFP_KERNEL in atomic contexts Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 03/84] dccp: purge write queue in dccp_destroy_sock() Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 04/84] dccp: defer ccid_hc_tx_delete() at dismantle time Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 05/84] ipv4: fix NULL dereference in free_fib_info_rcu() Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 06/84] net_sched/sfq: update hierarchical backlog when drop packet Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 07/84] net_sched: remove warning from qdisc_hash_add Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 08/84] bpf: fix bpf_trace_printk on 32 bit archs Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 09/84] openvswitch: fix skb_panic due to the incorrect actions attrlen Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 10/84] ptr_ring: use kmalloc_array() Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 11/84] ipv4: better IP_MAX_MTU enforcement Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 12/84] nfp: fix infinite loop on umapping cleanup Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 13/84] sctp: fully initialize the IPv6 address in sctp_v6_to_addr() Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 14/84] tipc: fix use-after-free Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 15/84] ipv6: reset fn->rr_ptr when replacing route Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 16/84] ipv6: repair fib6 tree in failure case Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 17/84] tcp: when rearming RTO, if RTO time is in past then fire RTO ASAP Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 18/84] net/mlx4_core: Enable 4K UAR if SRIOV module parameter is not enabled Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 19/84] irda: do not leak initialized list.dev to userspace Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 20/84] net: sched: fix NULL pointer dereference when action calls some targets Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 21/84] net_sched: fix order of queue length updates in qdisc_replace() Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 22/84] bpf, verifier: add additional patterns to evaluate_reg_imm_alu Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 23/84] bpf: adjust verifier heuristics Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 24/84] bpf, verifier: fix alu ops against map_value{, _adj} register types Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 25/84] bpf: fix mixed signed/unsigned derived min/max value bounds Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 26/84] bpf/verifier: fix min/max handling in BPF_SUB Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 27/84] Input: trackpoint - add new trackpoint firmware ID Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 28/84] Input: elan_i2c - add ELAN0602 ACPI ID to support Lenovo Yoga310 Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 29/84] Input: ALPS - fix two-finger scroll breakage in right side on ALPS touchpad Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 30/84] KVM: s390: sthyi: fix sthyi inline assembly Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 31/84] KVM: s390: sthyi: fix specification exception detection Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 32/84] KVM: x86: block guest protection keys unless the host has them enabled Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 33/84] ALSA: usb-audio: Add delay quirk for H650e/Jabra 550a USB headsets Greg Kroah-Hartman
2017-08-28  8:04 ` [PATCH 4.9 34/84] ALSA: core: Fix unexpected error at replacing user TLV Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 35/84] ALSA: hda - Add stereo mic quirk for Lenovo G50-70 (17aa:3978) Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 36/84] ALSA: firewire: fix NULL pointer dereference when releasing uninitialized data of iso-resource Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 37/84] ARCv2: PAE40: Explicitly set MSB counterpart of SLC region ops addresses Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 38/84] mm, shmem: fix handling /sys/kernel/mm/transparent_hugepage/shmem_enabled Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 39/84] i2c: designware: Fix system suspend Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 40/84] mm/madvise.c: fix freeing of locked page with MADV_FREE Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 41/84] fork: fix incorrect fput of ->exe_file causing use-after-free Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 42/84] mm/memblock.c: reversed logic in memblock_discard() Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 44/84] drm/atomic: If the atomic check fails, return its value first Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 45/84] drm: rcar-du: Fix crash in encoder failure error path Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 46/84] drm: rcar-du: Fix display timing controller parameter Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 47/84] drm: rcar-du: Fix H/V sync signal polarity configuration Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 48/84] tracing: Call clear_boot_tracer() at lateinit_sync Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 49/84] tracing: Fix kmemleak in tracing_map_array_free() Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 50/84] tracing: Fix freeing of filter in create_filter() when set_str is false Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 51/84] kbuild: linker script do not match C names unless LD_DEAD_CODE_DATA_ELIMINATION is configured Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 52/84] cifs: Fix df output for users with quota limits Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 53/84] cifs: return ENAMETOOLONG for overlong names in cifs_open()/cifs_lookup() Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 54/84] nfsd: Limit end of page list when decoding NFSv4 WRITE Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 55/84] ftrace: Check for null ret_stack on profile function graph entry function Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 56/84] perf/core: Fix group {cpu,task} validation Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 57/84] perf probe: Fix --funcs to show correct symbols for offline module Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 58/84] perf/x86/intel/rapl: Make package handling more robust Greg Kroah-Hartman
2017-08-28  8:05 ` Greg Kroah-Hartman [this message]
2017-08-28  8:05 ` [PATCH 4.9 60/84] x86/mm: Fix use-after-free of ldt_struct Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 61/84] net: sunrpc: svcsock: fix NULL-pointer exception Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 62/84] Revert "leds: handle suspend/resume in heartbeat trigger" Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 63/84] netfilter: nat: fix src map lookup Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 64/84] Bluetooth: hidp: fix possible might sleep error in hidp_session_thread Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 65/84] Bluetooth: cmtp: fix possible might sleep error in cmtp_session Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 66/84] Bluetooth: bnep: fix possible might sleep error in bnep_session Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 67/84] Revert "android: binder: Sanity check at binder ioctl" Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 68/84] binder: use group leader instead of open thread Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 69/84] binder: Use wake up hint for synchronous transactions Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 70/84] ANDROID: binder: fix proc->tsk check Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 71/84] iio: imu: adis16480: Fix acceleration scale factor for adis16480 Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 72/84] iio: hid-sensor-trigger: Fix the race with user space powering up sensors Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 73/84] staging: rtl8188eu: add RNX-N150NUB support Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 74/84] Clarify (and fix) MAX_LFS_FILESIZE macros Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 75/84] ntb_transport: fix qp count bug Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 76/84] ntb_transport: fix bug calculating num_qps_mw Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 77/84] NTB: ntb_test: fix bug printing ntb_perf results Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 78/84] ntb: no sleep in ntb_async_tx_submit Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 79/84] ntb: ntb_test: ensure the link is up before trying to configure the mws Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 80/84] ntb: transport shouldnt disable link due to bogus values in SPADs Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 81/84] ACPI: ioapic: Clear on-stack resource before using it Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 82/84] ACPI / APEI: Add missing synchronize_rcu() on NOTIFY_SCI removal Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 83/84] ACPI: EC: Fix regression related to wrong ECDT initialization order Greg Kroah-Hartman
2017-08-28  8:05 ` [PATCH 4.9 84/84] powerpc/mm: Ensure cpumask update is ordered Greg Kroah-Hartman
2017-08-28 19:39 ` [PATCH 4.9 00/84] 4.9.46-stable review Shuah Khan
2017-08-29  0:10 ` Guenter Roeck
2017-08-29 12:02 ` Sumit Semwal
2017-08-29 15:17   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20170828080531.857567847@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=Jonathan.Cameron@huawei.com \
    --cc=abdhalee@linux.vnet.ibm.com \
    --cc=akpm@linux-foundation.org \
    --cc=davem@davemloft.net \
    --cc=dzickus@redhat.com \
    --cc=john.stultz@linaro.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linuxarm@huawei.com \
    --cc=mpe@ellerman.id.au \
    --cc=npiggin@gmail.com \
    --cc=paulmck@linux.vnet.ibm.com \
    --cc=sboyd@codeaurora.org \
    --cc=sfr@canb.auug.org.au \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).