From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: linux-nfs-owner@vger.kernel.org
Received: from mail.candelatech.com ([208.74.158.172]:50597 "EHLO
	ns3.lanforge.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1751465Ab3AXANc (ORCPT
	<rfc822;linux-nfs@vger.kernel.org>); Wed, 23 Jan 2013 19:13:32 -0500
Message-ID: <51007CA8.2050105@candelatech.com>
Date: Wed, 23 Jan 2013 16:13:28 -0800
From: Ben Greear <greearb@candelatech.com>
MIME-Version: 1.0
To: Eric Dumazet <eric.dumazet@gmail.com>
CC: netdev <netdev@vger.kernel.org>,
        "linux-nfs@vger.kernel.org" <linux-nfs@vger.kernel.org>
Subject: Re: 3.7.3+:  Bad paging request in ip_rcv_finish while running NFS
 traffic.
References: <50FDADF4.3060601@candelatech.com>  <50FDDE35.7070806@candelatech.com>  <1358829606.3464.3151.camel@edumazet-glaptop>  <50FE2A57.3040804@candelatech.com>  <50FEC796.5090404@candelatech.com>  <1358875020.3464.4006.camel@edumazet-glaptop>  <1358875607.3464.4020.camel@edumazet-glaptop>  <50FF102F.2050008@candelatech.com> <50FF4BC9.1060206@candelatech.com>  <5100785D.8040101@candelatech.com> <1358985688.12374.1247.camel@edumazet-glaptop>
In-Reply-To: <1358985688.12374.1247.camel@edumazet-glaptop>
Content-Type: text/plain; charset=UTF-8; format=flowed
Sender: linux-nfs-owner@vger.kernel.org
List-ID: <linux-nfs.vger.kernel.org>

On 01/23/2013 04:01 PM, Eric Dumazet wrote:
> On Wed, 2013-01-23 at 15:55 -0800, Ben Greear wrote:
>> On 01/22/2013 06:32 PM, Ben Greear wrote:
>>
>> So, I'm slowly making some progress.  I've verified that the skb
>> has bogus dst (0xdeadbeef) at the top of the ip_rcv_finish
>> method.  I'm trying to track it backwards and figure out which
>> device it belongs to, etc....takes a while to reproduce though.
>>
>> One thing about this stack trace below...the dev_seq_stop() does
>> a rcu read-unlock.  Now, I can't figure out exactly how ip_rcv()
>> can cause dev_seq_stop() to run, but if this stack is legit,
>> then maybe by the time we enter the ip_rcv_finish() code we are
>> running without rcu_readlock() held?
>>
>> If so, that would probably explain the bug.
>>
>
> The whole thing is run under rcu_read_lock() done in
> __netif_receive_skb()

I was worried that the dev_seq_stop might be called
incorrectly causing an asymetric unlock.  I have no
idea how that might happened, but several crashes
have that dev_seq_stop method listed, so it got me suspicious.

>
> My suspicion was that we called netif_rx() from macvlan leaving a
> not refcounted skb dst.
>
> But the patch I sent to you didnt solve the bug, so its something else.
>
> You could trace at which point the dst was released. (where you set
> dst->input/output to deadbeef)

My current code is in some garbage collector timer code, but I can
work on saving the call-site that first pokes the dst into the
garbage collection list...

Thanks,
Ben

-- 
Ben Greear <greearb@candelatech.com>
Candela Technologies Inc  http://www.candelatech.com