From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933733AbcECPq7 (ORCPT ); Tue, 3 May 2016 11:46:59 -0400 Received: from mail.linuxfoundation.org ([140.211.169.12]:51938 "EHLO mail.linuxfoundation.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932867AbcECPq5 (ORCPT ); Tue, 3 May 2016 11:46:57 -0400 Date: Tue, 3 May 2016 08:46:56 -0700 From: "gregkh@linuxfoundation.org" To: Steven Haigh Cc: Boris Ostrovsky , linux-kernel@vger.kernel.org Subject: Re: 4.4: INFO: rcu_sched self-detected stall on CPU Message-ID: <20160503154656.GA27311@kroah.com> References: <56F54EE0.6030004@oracle.com> <56F56172.9020805@crc.id.au> <56F5653B.1090700@oracle.com> <56F5A87A.8000903@crc.id.au> <56FA4336.2030301@crc.id.au> <56FA8DDD.7070406@oracle.com> <56FABF17.7090608@crc.id.au> <56FAC3AC.9050802@crc.id.au> <20160502205431.GA14983@kroah.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.6.1 (2016-04-27) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, May 04, 2016 at 01:11:46AM +1000, Steven Haigh wrote: > On 03/05/16 06:54, gregkh@linuxfoundation.org wrote: > > On Wed, Mar 30, 2016 at 05:04:28AM +1100, Steven Haigh wrote: > >> Greg, please see below - this is probably more for you... > >> > >> On 03/29/2016 04:56 AM, Steven Haigh wrote: > >>> > >>> Interestingly enough, this just happened again - but on a different > >>> virtual machine. I'm starting to wonder if this may have something to do > >>> with the uptime of the machine - as the system that this seems to happen > >>> to is always different. > >>> > >>> Destroying it and monitoring it again has so far come up blank. > >>> > >>> I've thrown the latest lot of kernel messages here: > >>> http://paste.fedoraproject.org/346802/59241532 > >> > >> So I just did a bit of digging via the almighty Google. > >> > >> I started hunting for these lines, as they happen just before the stall: > >> BUG: Bad rss-counter state mm:ffff88007b7db480 idx:2 val:-1 > >> BUG: Bad rss-counter state mm:ffff880079c638c0 idx:0 val:-1 > >> BUG: Bad rss-counter state mm:ffff880079c638c0 idx:2 val:-1 > >> > >> I stumbled across this post on the lkml: > >> http://marc.info/?l=linux-kernel&m=145141546409607 > >> > >> The patch attached seems to reference the following change in > >> unmap_mapping_range in mm/memory.c: > >>> - struct zap_details details; > >>> + struct zap_details details = { }; > >> > >> When I browse the GIT tree for 4.4.6: > >> https://git.kernel.org/cgit/linux/kernel/git/stable/linux-stable.git/tree/mm/memory.c?id=refs/tags/v4.4.6 > >> > >> I see at line 2411: > >> struct zap_details details; > >> > >> Is this something that has been missed being merged into the 4.4 tree? > >> I'll admit my kernel knowledge is not enough to understand what the code > >> actually does - but the similarities here seem uncanny. > > > > I'm sorry, I have no idea what you are asking me about here. Did I miss > > a patch that should be backported? Did I backport something > > incorrectly? > > Hi Greg + all, > > I did actually find the cause of my rss-counter problems - being the > experimental PVH functionality in Xen. It caused a number of corruptions > both on disk and in memory. Turning this off resolved the problem. > > As for the 'fix' above. It seems there was talk that zap_details should > be defined as { } to avoid a problem in newer versions of the kernel > that was in linux-next. > > The question that I cannot answer (and I leave this open to the more > knowledgeable on the list than I) is if that fix should also be applied > to other trees. > > So the question as I see it: > Is this an actual bug that we're just not seeing hit in other kernel > versions - but the newer oom reaper code from linux-next uncovered it - > or is the code as-is in the 4.4 tree considered correct? > > It could well be that the experimental code in the Xen PVH was tickling > something that triggered the same type of issue as per the original bug > report leading to the patch quoted above. I would recommend working with the xen developers, on their mailing list, about this issue. If you end up with a patch that needs to be applied, please let me and stable@ know about it. thanks, greg k-h