From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751204AbdJANGd (ORCPT ); Sun, 1 Oct 2017 09:06:33 -0400 Received: from mail.bix.bg ([193.105.196.21]:44947 "HELO mail.bix.bg" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with SMTP id S1751018AbdJANGc (ORCPT ); Sun, 1 Oct 2017 09:06:32 -0400 Message-ID: <1506863190.18340.20.camel@declera.com> Subject: Re: [regression 4.14rc] 74def747bcd0 (genirq: Restrict effective affinity to interrupts actually using it) From: Yanko Kaneti To: Thorsten Leemhuis Cc: LKML , Thomas Gleixner , Chuck Ebbert , Marc Zyngier Date: Sun, 01 Oct 2017 16:06:30 +0300 In-Reply-To: <4cd63212-8f77-891d-67a0-9a305ddba549@leemhuis.info> References: <1505833936.2634.11.camel@declera.com> <4374f6c0-dd67-3bd3-91a0-685eb9a0d711@arm.com> <1505835616.2634.14.camel@declera.com> <225dd0d8-2c27-57a6-c17d-c552c011d8da@arm.com> <20170919203044.560cb9f1@gmail.com> <4cd63212-8f77-891d-67a0-9a305ddba549@leemhuis.info> Content-Type: text/plain; charset="UTF-8" X-Mailer: Evolution 3.26.0 (3.26.0-1.fc28) Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 2017-10-01 at 14:46 +0200, Thorsten Leemhuis wrote: > Hi, the regression tracker here. What's the status of this issue? Was > the problem fixed? It seems nothing happened for more than 10 days -- or > did the discussion move somewhere else? Ciao, Thorsten The commit was reverted last week before rc2 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0551968add53777fddd18f4ffb4e3bbc1f646d79 Thanks for tracking it -Yanko > > On 20.09.2017 02:30, Chuck Ebbert wrote: > > On Tue, 19 Sep 2017 16:51:06 +0100 > > Marc Zyngier wrote: > > > > > On 19/09/17 16:40, Yanko Kaneti wrote: > > > > On Tue, 2017-09-19 at 16:33 +0100, Marc Zyngier wrote: > > > > > On 19/09/17 16:12, Yanko Kaneti wrote: > > > > > > Hello, > > > > > > > > > > > > Fedora rawhide config here. > > > > > > AMD FX-8370E > > > > > > > > > > > > Bisected a problem to: > > > > > > 74def747bcd0 (genirq: Restrict effective affinity to interrupts > > > > > > actually using it) > > > > > > > > > > > > It seems to be causing stalls, short lived or long lived lockups > > > > > > very shortly after boot. Everything becomes jerky. > > > > > > > > > > > > The only visible in the log indication is something like : > > > > > > .... > > > > > > [ 59.802129] clocksource: timekeeping watchdog on CPU3: Marking > > > > > > clocksource 'tsc' as unstable because the skew is too large: > > > > > > [ 59.802134] clocksource: 'hpet' wd_now: > > > > > > 3326e7aa wd_last: 329956f8 mask: ffffffff [ 59.802137] > > > > > > clocksource: 'tsc' cs_now: 423662bc6f > > > > > > cs_last: 41dfc91650 mask: ffffffffffffffff [ 59.802140] tsc: > > > > > > Marking TSC unstable due to clocksource watchdog [ 59.802158] > > > > > > TSC found unstable after boot, most likely due to broken BIOS. > > > > > > Use 'tsc=unstable'. [ 59.802161] sched_clock: Marking unstable > > > > > > (59802142067, 15510)<-(59920871789, -118714277) [ 60.015604] > > > > > > clocksource: Switched to clocksource hpet [ 89.015994] INFO: > > > > > > NMI handler (perf_event_nmi_handler) took too long to run: > > > > > > 209.660 msecs [ 89.016003] perf: interrupt took too long > > > > > > (1638003 > 2500), lowering kernel.perf_event_max_sample_rate to > > > > > > 1000 .... > > > > > > > > > > > > Just reverting that commit on top of linus mainline cures all the > > > > > > symptoms > > > > > > > > > > Interesting. Do you still get HPET interrupts? > > > > > > > > Sorry, I might need some basic help here (i.e where do I count > > > > them...) > > > > > > /proc/interrupts should display them. > > > > > > > After the watchdog switches the clocksource to hpet the system is > > > > still somewhat alive, so I'll guess some clock is still > > > > ticking.... > > > > > > Probably, but I suspect they're not hitting the right CPU, hence the > > > lockups. > > > > > > Unfortunately, my x86-foo is pretty minimal, and I'm about to drop off > > > the net for a few days. > > > > > > Thomas, any insight? > > > > Looking at flat_cpu_mask_to_apicid(), I don't see how 74def747bcd0 > > can be correct: > > > > struct cpumask *effmsk = > > irq_data_get_effective_affinity_mask(irqdata); unsigned long > > cpu_mask = cpumask_bits(mask)[0] & APIC_ALL_CPUS; > > > > if (!cpu_mask) > > return -EINVAL; > > *apicid = (unsigned int)cpu_mask; > > cpumask_bits(effmsk)[0] = cpu_mask; > > > > Before that patch, this function wrote to the effective mask > > unconditionally. After, it only writes to effective_mask if it is > > already non-zero. > > > > > > http://news.gmane.org/find-root.php?message_id=20170919203044.560cb9f1%40gmail.com > > http://mid.gmane.org/20170919203044.560cb9f1%40gmail.com > >