From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-fsdevel-owner@vger.kernel.org>
Received: from 8.mo69.mail-out.ovh.net ([46.105.56.233]:39126 "EHLO
        8.mo69.mail-out.ovh.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S940217AbdAGPKx (ORCPT
        <rfc822;linux-fsdevel@vger.kernel.org>);
        Sat, 7 Jan 2017 10:10:53 -0500
Received: from player696.ha.ovh.net (b7.ovh.net [213.186.33.57])
        by mo69.mail-out.ovh.net (Postfix) with ESMTP id A871CF0E5
        for <linux-fsdevel@vger.kernel.org>; Sat,  7 Jan 2017 16:10:51 +0100 (CET)
Date: Sat, 7 Jan 2017 16:10:45 +0100
From: Greg Kurz <groug@kaod.org>
To: Al Viro <viro@ZenIV.linux.org.uk>
Cc: Tuomas Tynkkynen <tuomas@tuxera.com>,
        linux-fsdevel@vger.kernel.org,
        v9fs-developer@lists.sourceforge.net, linux-kernel@vger.kernel.org
Subject: Re: [V9fs-developer] 9pfs hangs since 4.7
Message-ID: <20170107161045.742893b1@bahia.lan>
In-Reply-To: <20170107062647.GB12074@ZenIV.linux.org.uk>
References: <20161124215023.02deb03c@duuni>
        <20170102102035.7d1cf903@duuni>
        <20170102162309.GZ1555@ZenIV.linux.org.uk>
        <20170104013355.4a8923b6@duuni>
        <20170104014753.GE1555@ZenIV.linux.org.uk>
        <20170104220447.74f2265d@duuni>
        <20170104230101.GG1555@ZenIV.linux.org.uk>
        <20170106145235.51630baf@bahia.lan>
        <20170107062647.GB12074@ZenIV.linux.org.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-fsdevel-owner@vger.kernel.org
List-ID: <linux-fsdevel.vger.kernel.org>

On Sat, 7 Jan 2017 06:26:47 +0000
Al Viro <viro@ZenIV.linux.org.uk> wrote:

> On Fri, Jan 06, 2017 at 02:52:35PM +0100, Greg Kurz wrote:
> 
> > Looking at the tag numbers, I think we're hitting the hardcoded limit of 128
> > simultaneous requests in QEMU (which doesn't produce any error, new requests
> > are silently dropped).
> > 
> > Tuomas, can you change MAX_REQ to some higher value (< 65535 since tag is
> > 2-byte and 0xffff is reserved) to confirm ?  
> 
> Huh?
> 
> Just how is a client supposed to cope with that behaviour?  9P is not
> SunRPC - there's a reason why it doesn't live on top of UDP.  Sure, it's
> datagram-oriented, but it really wants reliable transport...
> 
> Setting the ring size at MAX_REQ is fine; that'll give you ENOSPC on
> attempt to put a request there, and p9_virtio_request() will wait for
> things to clear, but if you've accepted a request, that's bloody it -
> you really should go and handle it.
> 

Yes you're right and "dropped" in my previous mail meant "not accepted"
actually (virtqueue_pop() not called)... sorry for the confusion. :-\

> How does it happen, anyway?  qemu-side, I mean...  Does it move the buffer
> to used ring as soon as it has fetched the request?  AFAICS, it doesn't -
> virtqueue_push() is called just before pdu_free(); we might get complications
> in case of TFLUSH handling (queue with MAX_REQ-1 requests submitted, TFLUSH
> arrives, cancel_pdu is found and ->cancelled is set on it, then v9fs_flush()
> waits for it to complete.  Once the damn thing is done, buffer is released by
> virtqueue_push(), but pdu freeing is delayed until v9fs_flush() gets woken
> up.  In the meanwhile, another request arrives into the slot of freed by
> that virtqueue_push() and we are out of pdus.
> 

Indeed. Even if this doesn't seem to be the problem here, I guess this should
be fixed.

> So it could happen, and the things might get unpleasant to some extent, but...
> no TFLUSH had been present in all that traffic.  And none of the stuck
> processes had been spinning in p9_virtio_request(), so they *did* find
> ring slots...

So we're back to your previous proposal of checking if virtqueue_kick() returned
false...