From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from mx0b-001b2d01.pphosted.com ([148.163.158.5]:53341 "EHLO
        mx0a-001b2d01.pphosted.com" rhost-flags-OK-OK-OK-FAIL)
        by vger.kernel.org with ESMTP id S1753651AbdCTMIo (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Mon, 20 Mar 2017 08:08:44 -0400
Received: from pps.filterd (m0098414.ppops.net [127.0.0.1])
        by mx0b-001b2d01.pphosted.com (8.16.0.20/8.16.0.20) with SMTP id v2KBxFs0093243
        for <linux-fsdevel@vger.kernel.org>; Mon, 20 Mar 2017 08:08:28 -0400
Received: from e06smtp13.uk.ibm.com (e06smtp13.uk.ibm.com [195.75.94.109])
        by mx0b-001b2d01.pphosted.com with ESMTP id 29911qghdk-1
        (version=TLSv1.2 cipher=AES256-SHA bits=256 verify=NOT)
        for <linux-fsdevel@vger.kernel.org>; Mon, 20 Mar 2017 08:08:28 -0400
Received: from localhost
        by e06smtp13.uk.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted
        for <linux-fsdevel@vger.kernel.org> from <heiko.carstens@de.ibm.com>;
        Mon, 20 Mar 2017 12:08:26 -0000
Date: Mon, 20 Mar 2017 13:08:22 +0100
From: Heiko Carstens <heiko.carstens@de.ibm.com>
To: Al Viro <viro@ZenIV.linux.org.uk>,
        Gustavo Luiz Ferreira Walbon <gwalbon@br.ibm.com>,
        linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [BUG 4.9/4.10] crash in __d_lookup() due to corrupted
 dentry_hashtable
References: <20170303133150.GE5319@osiris>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20170303133150.GE5319@osiris>
Message-Id: <20170320120822.GF3327@osiris>
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Fri, Mar 03, 2017 at 02:31:50PM +0100, Heiko Carstens wrote:
> Hello Al,
> 
> Gustavo reported the crash below within __d_lookup() on s390. I'm wondering
> if you can make any sense of it:
> 
> Unable to handle kernel pointer dereference in virtual kernel address space
> Failing address: fffffffffffff000 TEID: fffffffffffff803
> Fault in home space mode while using kernel ASCE.

...

> Kernel panic - not syncing: Fatal exception: panic_on_oops
> 
> Looking at the relevant part of __d_lookup:
> 
> struct dentry *__d_lookup(const struct dentry *parent, const struct qstr *name)
> {
> 	unsigned int hash = name->hash;
> 	struct hlist_bl_head *b = d_hash(hash);  <--- points to corrupted entry
> 	struct hlist_bl_node *node;
> 	struct dentry *found = NULL;
> 	struct dentry *dentry;
> 
> 	rcu_read_lock();
> 	
> 	hlist_bl_for_each_entry_rcu(dentry, node, b, d_hash) {
> 
> 		if (dentry->d_name.hash != hash)
> 			continue;
> ...
> 
> The contents of *b within the dump is:
> 
> > struct hlist_bl_head 000003e0806248f8
> struct hlist_bl_head {
> 	first = 0xffffffffffffffff
> }
> 
> Note that 0x000003e0806248f8 is a valid address within the
> dentry_hashtable. In addition all other entries look ok, as far as I can
> tell. This is the only entry that contains a -1UL value.
> 
> We also have a second dump with a similar crash with a 4.9 kernel. In that
> case there are in total three entries spread within the dentry_hashtable
> with a -1UL value, while all other entries seem to look ok. So there seems
> to be a pattern.
> 
> Note: these kernels do contain addon patches that are not mainline, but I
> don't believe that any of those can explain these corruptions.

Famous last words... it looks like it was indeed one of our addon patches.

At least with the bug fixed Gustavo reported that the system now survives
a 60h stress test, which it previously didn't.