From mboxrd@z Thu Jan 1 00:00:00 1970 From: bert schulze Subject: 4.14-rt timer issues using PREEMPT_RT_FULL=y and NO_HZ_FULL_ALL=y Date: Tue, 12 Dec 2017 22:58:18 +0100 Message-ID: <20171212215818.GA18168@a.fritz.box> Mime-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit To: linux-rt-users@vger.kernel.org Return-path: Received: from mail-wm0-f65.google.com ([74.125.82.65]:46435 "EHLO mail-wm0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752564AbdLLV6b (ORCPT ); Tue, 12 Dec 2017 16:58:31 -0500 Received: by mail-wm0-f65.google.com with SMTP id r78so1401513wme.5 for ; Tue, 12 Dec 2017 13:58:31 -0800 (PST) Received: from a.fritz.box (i59F743C9.versanet.de. [89.247.67.201]) by smtp.gmail.com with ESMTPSA id 38sm283581wry.34.2017.12.12.13.58.28 for (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 12 Dec 2017 13:58:29 -0800 (PST) Content-Disposition: inline Sender: linux-rt-users-owner@vger.kernel.org List-ID: Hi folks, I'm having issues with v4.14-rt1 to v4.14.3-rt5 using NO_HZ_FULL_ALL=y with PREEMPT_RT_FULL=y and kernel.timer_migration enabled (which seems to be enabled by default). Full config used: http://paste.debian.net/hidden/eb51a120/ The kernel either boots fine or may lock up on boot already (sysrq is working still and boot continues after some seconds upto minutes). If any hang occurred on boot dmesg will contain: root@deb9:~# dmesg | grep hrtimer [ 1.507207] hrtimer: interrupt took 28740 ns If the system booted up fine (-> no "interrupt took #### ns" message) it behaves as expected as long as timer migration was disabled. root@deb9:~# echo 0 > /proc/sys/kernel/timer_migration A simple sleep (or anything else using nanosleep() is sufficient to reproduce this. The expected behaviour with kernel.timer_migration = 0 root@deb9:~# grep LOC: /proc/interrupts LOC: 91968 801 775 590 Local timer interrupts root@deb9:~# for cpu in {0..3} ;do time taskset -ac $cpu sleep 0.1 ;done real 0m0.104s // CPU0 ok real 0m0.104s // CPU1 ok real 0m0.104s // CPU2 ok real 0m0.105s // CPU3 ok root@deb9:~# grep LOC: /proc/interrupts LOC: 101069 824 782 599 Local timer interrupts Roughly 10 seconds passed and the housekeeping cpu shows ~10.000 timer interrupts (which matches up with CONFIG_HZ=1000). Doing the same with kernel.timer_migration = 1 root@deb9:~# for cpu in {0..3} ;do time taskset -ac $cpu sleep 0.1 ;done real 0m0.104s // CPU0 ok [ 125.282455] hrtimer: interrupt took 2230 ns <-- real 0m28.023s // CPU1 not ok real 0m9.129s // CPU2 not ok real 0m10.000s // CPU3 not ok The hrtimer: "interrupt took #### ns" message appeared any sleep on the adaptive-tick cpu are completely off and … root@deb9:~# grep LOC: /proc/interrupts LOC: 12544410 874 828 638 Local timer interrupts … timer interrupts on the housekeeping cpu advanced by ~12400000 after roughly 60 seconds even though the system is up for 2 minutes. root@deb9:~# uptime 21:37:14 up 2 min, 1 user, load average: 0.17, 0.15, 0.06 To rule out my hardware I've successfully reproduced this on i7-6700, i7-3517u, i7-2xxxHQ hardware as well as in QEMU itself. Everything is back to normal by passing "nohz_full=" to the kernel to disable adaptive-tick cpus. I've furthermore tested v4.13.13-rt5 and WIP.timers branch of tip.git and both of them are working as expected. Thanks, Bert