stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
To: linux-kernel@vger.kernel.org
Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	stable@vger.kernel.org, Frederic Weisbecker <fweisbec@gmail.com>,
	Stanislaw Gruszka <sgruszka@redhat.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>
Subject: [ 019/102] sched: Lower chances of cputime scaling overflow
Date: Fri, 17 May 2013 14:35:34 -0700	[thread overview]
Message-ID: <20130517213246.400649292@linuxfoundation.org> (raw)
In-Reply-To: <20130517213244.277411019@linuxfoundation.org>

3.9-stable review patch.  If anyone has any objections, please let me know.

------------------

From: Frederic Weisbecker <fweisbec@gmail.com>

commit d9a3c9823a2e6a543eb7807fb3d15d8233817ec5 upstream.

Some users have reported that after running a process with
hundreds of threads on intensive CPU-bound loads, the cputime
of the group started to freeze after a few days.

This is due to how we scale the tick-based cputime against
the scheduler precise execution time value.

We add the values of all threads in the group and we multiply
that against the sum of the scheduler exec runtime of the whole
group.

This easily overflows after a few days/weeks of execution.

A proposed solution to solve this was to compute that multiplication
on stime instead of utime:
   62188451f0d63add7ad0cd2a1ae269d600c1663d
   ("cputime: Avoid multiplication overflow on utime scaling")

The rationale behind that was that it's easy for a thread to
spend most of its time in userspace under intensive CPU-bound workload
but it's much harder to do CPU-bound intensive long run in the kernel.

This postulate got defeated when a user recently reported he was still
seeing cputime freezes after the above patch. The workload that
triggers this issue relates to intensive networking workloads where
most of the cputime is consumed in the kernel.

To reduce much more the opportunities for multiplication overflow,
lets reduce the multiplication factors to the remainders of the division
between sched exec runtime and cputime. Assuming the difference between
these shouldn't ever be that large, it could work on many situations.

This gets the same results as in the upstream scaling code except for
a small difference: the upstream code always rounds the results to
the nearest integer not greater to what would be the precise result.
The new code rounds to the nearest integer either greater or not
greater. In practice this difference probably shouldn't matter but
it's worth mentioning.

If this solution appears not to be enough in the end, we'll
need to partly revert back to the behaviour prior to commit
     0cf55e1ec08bb5a22e068309e2d8ba1180ab4239
     ("sched, cputime: Introduce thread_group_times()")

Back then, the scaling was done on exit() time before adding the cputime
of an exiting thread to the signal struct. And then we'll need to
scale one-by-one the live threads cputime in thread_group_cputime(). The
drawback may be a slightly slower code on exit time.

Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Stanislaw Gruszka <sgruszka@redhat.com>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Stanislaw Gruszka <sgruszka@redhat.com>
Acked-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

---
 kernel/sched/cputime.c |   46 ++++++++++++++++++++++++++++++++++------------
 1 file changed, 34 insertions(+), 12 deletions(-)

--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -521,18 +521,36 @@ EXPORT_SYMBOL_GPL(vtime_account_irq_ente
 
 #else /* !CONFIG_VIRT_CPU_ACCOUNTING */
 
-static cputime_t scale_stime(cputime_t stime, cputime_t rtime, cputime_t total)
+/*
+ * Perform (stime * rtime) / total with reduced chances
+ * of multiplication overflows by using smaller factors
+ * like quotient and remainders of divisions between
+ * rtime and total.
+ */
+static cputime_t scale_stime(u64 stime, u64 rtime, u64 total)
 {
-	u64 temp = (__force u64) rtime;
-
-	temp *= (__force u64) stime;
+	u64 rem, res, scaled;
 
-	if (sizeof(cputime_t) == 4)
-		temp = div_u64(temp, (__force u32) total);
-	else
-		temp = div64_u64(temp, (__force u64) total);
+	if (rtime >= total) {
+		/*
+		 * Scale up to rtime / total then add
+		 * the remainder scaled to stime / total.
+		 */
+		res = div64_u64_rem(rtime, total, &rem);
+		scaled = stime * res;
+		scaled += div64_u64(stime * rem, total);
+	} else {
+		/*
+		 * Same in reverse: scale down to total / rtime
+		 * then substract that result scaled to
+		 * to the remaining part.
+		 */
+		res = div64_u64_rem(total, rtime, &rem);
+		scaled = div64_u64(stime, res);
+		scaled -= div64_u64(scaled * rem, total);
+	}
 
-	return (__force cputime_t) temp;
+	return (__force cputime_t) scaled;
 }
 
 /*
@@ -560,10 +578,14 @@ static void cputime_adjust(struct task_c
 	 */
 	rtime = nsecs_to_cputime(curr->sum_exec_runtime);
 
-	if (total)
-		stime = scale_stime(stime, rtime, total);
-	else
+	if (!rtime) {
+		stime = 0;
+	} else if (!total) {
 		stime = rtime;
+	} else {
+		stime = scale_stime((__force u64)stime,
+				    (__force u64)rtime, (__force u64)total);
+	}
 
 	/*
 	 * If the tick based count grows faster than the scheduler one,



  parent reply	other threads:[~2013-05-17 21:35 UTC|newest]

Thread overview: 113+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-05-17 21:35 [ 000/102] 3.9.3-stable review Greg Kroah-Hartman
2013-05-17 21:35 ` [ 001/102] KVM: VMX: fix halt emulation while emulating invalid guest sate Greg Kroah-Hartman
2013-05-17 21:35 ` [ 002/102] KVM: emulator: emulate AAM Greg Kroah-Hartman
2013-05-17 21:35 ` [ 003/102] KVM: emulator: emulate XLAT Greg Kroah-Hartman
2013-05-17 21:35 ` [ 004/102] KVM: emulator: emulate SALC Greg Kroah-Hartman
2013-05-17 21:35 ` [ 005/102] HID: reintroduce fix-up for certain Sony RF receivers Greg Kroah-Hartman
2013-05-17 21:35 ` [ 006/102] ARM: OMAP: RX-51: change probe order of touchscreen and panel SPI devices Greg Kroah-Hartman
2013-05-17 21:35 ` [ 007/102] ASoC: wm8994: missing break in wm8994_aif3_hw_params() Greg Kroah-Hartman
2013-05-17 21:35 ` [ 008/102] ACPICA: Fix possible buffer overflow during a field unit read operation Greg Kroah-Hartman
2013-05-17 21:35 ` [ 009/102] Revert "ALSA: hda - Dont set up active streams twice" Greg Kroah-Hartman
2013-05-17 21:35 ` [ 010/102] ALSA: HDA: Fix Oops caused by dereference NULL pointer Greg Kroah-Hartman
2013-05-17 21:35 ` [ 011/102] ALSA: hda - Fix 3.9 regression of EAPD init on Conexant codecs Greg Kroah-Hartman
2013-05-17 21:35 ` [ 012/102] DMA: OF: Check properties value before running be32_to_cpup() on it Greg Kroah-Hartman
2013-05-17 21:35 ` [ 013/102] dm table: fix write same support Greg Kroah-Hartman
2013-05-17 21:35 ` [ 014/102] dm stripe: fix regression in stripe_width calculation Greg Kroah-Hartman
2013-05-17 21:35 ` [ 015/102] dm bufio: avoid a possible __vmalloc deadlock Greg Kroah-Hartman
2013-05-17 21:35 ` [ 016/102] dm snapshot: fix error return code in snapshot_ctr Greg Kroah-Hartman
2013-05-17 21:35 ` [ 017/102] dm cache: fix error return code in cache_create Greg Kroah-Hartman
2013-05-17 21:35 ` [ 018/102] math64: New div64_u64_rem helper Greg Kroah-Hartman
2013-05-17 21:35 ` Greg Kroah-Hartman [this message]
2013-05-17 21:35 ` [ 020/102] sched: Avoid cputime scaling overflow Greg Kroah-Hartman
2013-05-17 21:35 ` [ 021/102] sched: Do not account bogus utime Greg Kroah-Hartman
2013-05-17 21:35 ` [ 022/102] Revert "math64: New div64_u64_rem helper" Greg Kroah-Hartman
2013-05-17 21:35 ` [ 023/102] sched: Avoid prev->stime underflow Greg Kroah-Hartman
2013-05-17 21:35 ` [ 024/102] nfsd4: dont allow owner override on 4.1 CLAIM_FH opens Greg Kroah-Hartman
2013-05-17 21:35 ` [ 025/102] nfsd: fix oops when legacy_recdir_name_error is passed a -ENOENT error Greg Kroah-Hartman
2013-05-17 21:35 ` [ 026/102] hp_accel: Ignore the error from lis3lv02d_poweron() at resume Greg Kroah-Hartman
2013-05-17 21:35 ` [ 027/102] x86, vm86: fix VM86 syscalls: use SYSCALL_DEFINEx(...) Greg Kroah-Hartman
2013-05-17 22:49   ` Al Viro
2013-05-17 23:51     ` Greg Kroah-Hartman
2013-05-19 12:58       ` Satoru Takeuchi
2013-05-19 18:37       ` Greg Kroah-Hartman
2013-05-20 12:42         ` Satoru Takeuchi
2013-05-17 21:35 ` [ 028/102] shm: fix null pointer deref when userspace specifies invalid hugepage size Greg Kroah-Hartman
2013-05-17 21:35 ` [ 029/102] xen/vcpu/pvhvm: Fix vcpu hotplugging hanging Greg Kroah-Hartman
2013-05-17 21:35 ` [ 030/102] SCSI: sd: fix array cache flushing bug causing performance problems Greg Kroah-Hartman
2013-05-17 21:35 ` [ 031/102] audit: Syscall rules are not applied to existing processes on non-x86 Greg Kroah-Hartman
2013-05-17 21:35 ` [ 032/102] audit: vfs: fix audit_inode call in O_CREAT case of do_last Greg Kroah-Hartman
2013-05-17 21:35 ` [ 033/102] time: Revert ALWAYS_USE_PERSISTENT_CLOCK compile time optimizaitons Greg Kroah-Hartman
2013-05-17 21:35 ` [ 034/102] timer: Dont reinitialize the cpu base lock during CPU_UP_PREPARE Greg Kroah-Hartman
2013-05-17 21:35 ` [ 035/102] tick: Cleanup NOHZ per cpu data on cpu down Greg Kroah-Hartman
2013-05-17 21:35 ` [ 036/102] tracing: Fix leaks of filter preds Greg Kroah-Hartman
2013-05-17 21:35 ` [ 037/102] ext4: limit group search loop for non-extent files Greg Kroah-Hartman
2013-05-17 21:35 ` [ 038/102] x86/microcode: Add local mutex to fix physical CPU hot-add deadlock Greg Kroah-Hartman
2013-05-17 21:35 ` [ 039/102] ARM: 7720/1: ARM v6/v7 cmpxchg64 shouldnt clear upper 32 bits of the old/new value Greg Kroah-Hartman
2013-05-17 21:35 ` [ 040/102] powerpc: Bring all threads online prior to migration/hibernation Greg Kroah-Hartman
2013-05-17 21:35 ` [ 041/102] powerpc/kexec: Fix kexec when using VMX optimised memcpy Greg Kroah-Hartman
2013-05-17 21:35 ` [ 042/102] ath9k: fix key allocation error handling for powersave keys Greg Kroah-Hartman
2013-05-17 21:35 ` [ 043/102] mwifiex: clear is_suspended flag when interrupt is received early Greg Kroah-Hartman
2013-05-17 21:35 ` [ 044/102] mwifiex: fix memory leak issue when driver unload Greg Kroah-Hartman
2013-05-17 21:36 ` [ 045/102] mwifiex: fix setting of multicast filter Greg Kroah-Hartman
2013-05-17 21:36 ` [ 046/102] tile: support new Tilera hypervisor Greg Kroah-Hartman
2013-05-17 21:36 ` [ 047/102] B43: Handle DMA RX descriptor underrun Greg Kroah-Hartman
2013-05-17 21:36 ` [ 048/102] iwl4965: workaround connection regression on passive channel Greg Kroah-Hartman
2013-05-17 21:36 ` [ 049/102] drm/mgag200: Fix writes into MGA1064_PIX_CLK_CTL register Greg Kroah-Hartman
2013-05-17 21:36 ` [ 050/102] drm/mgag200: Fix framebuffer base address programming Greg Kroah-Hartman
2013-05-17 21:36 ` [ 051/102] drm/mm: fix dump table BUG Greg Kroah-Hartman
2013-05-17 21:36 ` [ 052/102] drm: dont check modeset locks in panic handler Greg Kroah-Hartman
2013-05-17 21:36 ` [ 053/102] drm/i915: clear the stolen fb before resuming Greg Kroah-Hartman
2013-05-17 21:36 ` [ 054/102] tcp: force a dst refcount when prequeue packet Greg Kroah-Hartman
2013-05-17 21:36 ` [ 055/102] sfc: Fix naming of MTD partitions for FPGA bitfiles Greg Kroah-Hartman
2013-05-17 21:36 ` [ 056/102] net: tun: release the reference of tun device in tun_recvmsg Greg Kroah-Hartman
2013-05-17 21:36 ` [ 057/102] net: mac802154: comparision issue of type cast, finding by EXTRA_CFLAGS=-W Greg Kroah-Hartman
2013-05-17 21:36 ` [ 058/102] tcp: reset timer after any SYNACK retransmit Greg Kroah-Hartman
2013-05-17 21:36 ` [ 059/102] 3c509.c: call SET_NETDEV_DEV for all device types (ISA/ISAPnP/EISA) Greg Kroah-Hartman
2013-05-17 21:36 ` [ 060/102] net_sched: act_ipt forward compat with xtables Greg Kroah-Hartman
2013-05-17 21:36 ` [ 061/102] net: use netdev_features_t in skb_needs_linearize() Greg Kroah-Hartman
2013-05-17 21:36 ` [ 062/102] net: vlan,ethtool: netdev_features_t is more than 32 bit Greg Kroah-Hartman
2013-05-17 21:36 ` [ 063/102] bridge: fix race with topology change timer Greg Kroah-Hartman
2013-05-17 21:36 ` [ 064/102] asix: fix BUG in receive path when lowering MTU Greg Kroah-Hartman
2013-05-17 21:36 ` [ 065/102] packet: tpacket_v3: do not trigger bug() on wrong header status Greg Kroah-Hartman
2013-05-17 21:36 ` [ 066/102] virtio: dont expose u16 in userspace api Greg Kroah-Hartman
2013-05-17 21:36 ` [ 067/102] net: frag, fix race conditions in LRU list maintenance Greg Kroah-Hartman
2013-05-17 21:36 ` [ 068/102] 3c59x: fix freeing nonexistent resource on driver unload Greg Kroah-Hartman
2013-05-17 21:36 ` [ 069/102] 3c59x: fix PCI resource management Greg Kroah-Hartman
2013-05-17 21:36 ` [ 070/102] if_cablemodem.h: Add parenthesis around ioctl macros Greg Kroah-Hartman
2013-05-17 21:36 ` [ 071/102] macvlan: fix passthru mode race between dev removal and rx path Greg Kroah-Hartman
2013-05-17 21:36 ` [ 072/102] ipv6: do not clear pinet6 field Greg Kroah-Hartman
2013-05-21 11:44   ` Roman Gushchin
2013-05-21 21:47     ` Eric Dumazet
2013-05-22  8:12       ` Roman Gushchin
2013-05-17 21:36 ` [ 073/102] ipv6,gre: do not leak info to user-space Greg Kroah-Hartman
2013-05-17 21:36 ` [ 074/102] xfrm6: release dev before returning error Greg Kroah-Hartman
2013-05-17 21:36 ` [ 075/102] pch_dma: Use GFP_ATOMIC because called from interrupt context Greg Kroah-Hartman
2013-05-17 21:36 ` [ 076/102] watchdog: Fix race condition in registration code Greg Kroah-Hartman
2013-05-17 21:36 ` [ 077/102] drbd: Fix build error when CONFIG_CRYPTO_HMAC is not set Greg Kroah-Hartman
2013-05-17 21:36 ` [ 078/102] drbd: fix memory leak Greg Kroah-Hartman
2013-05-17 21:36 ` [ 079/102] drbd: fix for deadlock when using automatic split-brain-recovery Greg Kroah-Hartman
2013-05-17 21:36 ` [ 080/102] VSOCK: Drop bogus __init annotation from vsock_init_tables() Greg Kroah-Hartman
2013-05-17 21:36 ` [ 081/102] ARM: EXYNOS5: Fix kernel dump in AFTR idle mode Greg Kroah-Hartman
2013-05-17 21:36 ` [ 082/102] drivers/rtc/rtc-pcf2123.c: fix error return code in pcf2123_probe() Greg Kroah-Hartman
2013-05-17 21:36 ` [ 083/102] cpufreq / intel_pstate: remove idle time and duration from sample and calculations Greg Kroah-Hartman
2013-05-17 21:36 ` [ 084/102] cpufreq / intel_pstate: use lowest requested max performance Greg Kroah-Hartman
2013-05-17 21:36 ` [ 085/102] cpufreq / intel_pstate: fix ffmpeg regression Greg Kroah-Hartman
2013-05-17 21:36 ` [ 086/102] iscsi-target: Fix processing of OOO commands Greg Kroah-Hartman
2013-05-17 21:36 ` [ 087/102] target: close target_put_sess_cmd() vs. core_tmr_abort_task() race Greg Kroah-Hartman
2013-05-17 21:36 ` [ 088/102] target/iblock: Fix WCE=1 + DPOFUA=1 backend WRITE regression Greg Kroah-Hartman
2013-05-17 21:36 ` [ 089/102] ACPI / EC: Restart transaction even when the IBF flag set Greg Kroah-Hartman
2013-05-17 21:36 ` [ 090/102] drivers/char/ipmi: memcpy, need additional 2 bytes to avoid memory overflow Greg Kroah-Hartman
2013-05-17 21:36 ` [ 091/102] ipmi: ipmi_devintf: compat_ioctl method fails to take ipmi_mutex Greg Kroah-Hartman
2013-05-17 21:36 ` [ 092/102] ASoC: da7213: Fix setting dmic_samplephase and dmic_clk_rate Greg Kroah-Hartman
2013-05-17 21:36 ` [ 093/102] drm/radeon: check incoming cliprects pointer Greg Kroah-Hartman
2013-05-17 21:36 ` [ 094/102] drm/radeon: restore nomodeset operation (v2) Greg Kroah-Hartman
2013-05-17 21:36 ` [ 095/102] usermodehelper: check subprocess_info->path != NULL Greg Kroah-Hartman
2013-05-17 21:36 ` [ 096/102] parisc: only re-enable interrupts if we need to schedule or deliver signals when returning to userspace Greg Kroah-Hartman
2013-05-17 21:36 ` [ 097/102] parisc: fix SMP races when updating PTE and TLB entries in entry.S Greg Kroah-Hartman
2013-05-17 21:36 ` [ 098/102] parisc: use long branch in fork_like macro Greg Kroah-Hartman
2013-05-17 21:36 ` [ 099/102] parisc: fix NATIVE set up in build Greg Kroah-Hartman
2013-05-17 21:36 ` [ 100/102] parisc: make default cross compiler search more robust (v3) Greg Kroah-Hartman
2013-05-17 21:36 ` [ 101/102] audit: Make testing for a valid loginuid explicit Greg Kroah-Hartman
2013-05-17 21:36 ` [ 102/102] target: Use FD_MAX_SECTORS/FD_BLOCKSIZE for blockdevs using fileio Greg Kroah-Hartman
2013-05-19 13:00 ` [ 000/102] 3.9.3-stable review Satoru Takeuchi
2013-05-19 18:38   ` Greg Kroah-Hartman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130517213246.400649292@linuxfoundation.org \
    --to=gregkh@linuxfoundation.org \
    --cc=akpm@linux-foundation.org \
    --cc=fweisbec@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=sgruszka@redhat.com \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).