From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:41038)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <richard.purdie@linuxfoundation.org>)
	id 1ZYjKv-0000gv-22
	for qemu-devel@nongnu.org; Sun, 06 Sep 2015 19:27:02 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <richard.purdie@linuxfoundation.org>)
	id 1ZYjKr-0001Pu-QT
	for qemu-devel@nongnu.org; Sun, 06 Sep 2015 19:27:00 -0400
Received: from 5751f4a1.skybroadband.com ([87.81.244.161]:49704
	helo=dan.rpsys.net) by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <richard.purdie@linuxfoundation.org>)
	id 1ZYjKr-0001PY-FM
	for qemu-devel@nongnu.org; Sun, 06 Sep 2015 19:26:57 -0400
Message-ID: <1441581997.24871.227.camel@linuxfoundation.org>
From: Richard Purdie <richard.purdie@linuxfoundation.org>
Date: Mon, 07 Sep 2015 00:26:37 +0100
In-Reply-To: <CAPokK=qr6Rr+kPbZu=qbUgn0yHGzu8FbzMkSF4CSP92xdhgPrg@mail.gmail.com>
References: <1441362357.24871.155.camel@linuxfoundation.org>
	<CAFEAcA874VoaFLDU+5Ai1DgQ8ahgSrh6Xfkg6VSqHwZ0422PSA@mail.gmail.com>
	<1441365880.24871.164.camel@linuxfoundation.org>
	<CAFEAcA8ict1SFxpu4yb_y1Rvp6kTZn3xweMF_7Vw1H0Mc-w3iw@mail.gmail.com>
	<1441370585.24871.166.camel@linuxfoundation.org>
	<1441387258.24871.197.camel@linuxfoundation.org>
	<CAFEAcA9YWnhnc5eL8ST1bHjkBj1_UiVd7ZDET2GjKbV+BQ+igQ@mail.gmail.com>
	<CAPokK=p9GUx7OeJ7xNwbnJkzd7JMLVayhgOCf0XapHnW9M9JLA@mail.gmail.com>
	<1441549313.24871.218.camel@linuxfoundation.org>
	<CAPokK=qr6Rr+kPbZu=qbUgn0yHGzu8FbzMkSF4CSP92xdhgPrg@mail.gmail.com>
Content-Type: text/plain; charset="UTF-8"
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Peter Crosthwaite <crosthwaitepeter@gmail.com>
Cc: Peter Maydell <peter.maydell@linaro.org>, qemu-devel <qemu-devel@nongnu.org>

On Sun, 2015-09-06 at 11:37 -0700, Peter Crosthwaite wrote:
> On Sun, Sep 6, 2015 at 7:21 AM, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
> > On Sat, 2015-09-05 at 13:30 -0700, Peter Crosthwaite wrote:
> >> On Fri, Sep 4, 2015 at 10:30 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> >> > On 4 September 2015 at 18:20, Richard Purdie
> >> > <richard.purdie@linuxfoundation.org> wrote:
> >> >> On Fri, 2015-09-04 at 13:43 +0100, Richard Purdie wrote:
> >> >>> On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote:
> >> >>> > On 4 September 2015 at 12:24, Richard Purdie
> >> >>> > <richard.purdie@linuxfoundation.org> wrote:
> >> >>> > > So just based on that, yes, seems that the rx_fifo looks to be
> >> >>> > > overrunning. I can add the asserts but I think it would just confirm
> >> >>> > > this.
> >> >>> >
> >> >>> > Yes, the point of adding assertions is to confirm a hypothesis.
> >> >>>
> >> >>> I've now confirmed that it does indeed trigger the assert in
> >> >>> smc91c111_receive().
> >> >>
> >> >> I just tried an experiment where I put:
> >> >>
> >> >>     if (s->rx_fifo_len >= NUM_PACKETS)
> >> >>         return -1;
> >> >>
> >> >> into smc91c111_receive() and my reproducer stops reproducing the
> >> >> problem.
> >>
> >> Does it just stop the crash or does it eliminate the problem
> >> completely with a fully now-working network?
> >
> > It stops the crash, the network works great.
> >
> >> >> I also noticed can_receive() could also have a check on buffer
> >> >> availability. Would one of these changes be the correct fix here?
> >> >
> >> > The interesting question is why smc91c111_allocate_packet() doesn't
> >> > fail in this situation. We only have NUM_PACKETS worth of storage,
> >> > shared between the tx and rx buffers, so how could we both have
> >> > already filled the rx_fifo and have a spare packet for the allocate
> >> > function to return?
> >>
> >> Maybe this:
> >>
> >>             case 5: /* Release.  */
> >>                 smc91c111_release_packet(s, s->packet_num);
> >>                 break;
> >>
> >> The guest is able to free an allocated packet without the accompanying
> >> pop of tx/rx fifo. This may suggest some sort of guest error?
> >>
> >> The fix depends on the behaviour of the real hardware. If that MMIO op
> >> is supposed to dequeue the corresponding queue entry then we may need
> >> to patch that logic to do search the queues and dequeue it. Otherwise
> >> we need to find out the genuine length of the rx queue, and clamp it
> >> without something like Richards patch. There are a few other bits and
> >> pieces that suggest the guest can have independent control of the
> >> queues and allocated buffers but i'm confused as to how the rx fifo
> >> length can get up to 10 in any case.
> >
> > I think I have a handle on what is going on. smc91c111_release_packet()
> > changes s->allocated() but not rx_fifo. can_receive() only looks at
> > s->allocated. We can trigger new network packets to arrive from
> > smc91c111_release_packet() which calls qemu_flush_queued_packets()
> > *before* we change rx_fifo and this can loop.
> >
> > The patch below which explicitly orders the qemu_flush_queued_packets()
> > call resolved the test case I was able to reproduce this problem in.
> >
> > So there are three ways to fix this, either can_receive() needs to check
> > both s->allocated() and rx_fifo,
> 
> This is probably the winner for me.
> 
> > or the code is more explicit about when
> > qemu_flush_queued_packets() is called (as per my patch below), or the
> > case 4 where smc91c111_release_packet() and then
> > smc91c111_pop_rx_fifo(s) is called is reversed. I also tested the latter
> > which also works, albeit with more ugly code.

It seems can_receive isn't enough, we'd need to put some checks into
receive itself since once can_receive says "yes", multiple packets can
arrive to _receive without further checks of can_receive. I've either
messed up my previous test or been lucky.

I tested an assert in _recieve() which confirms it can be called when
can_receive() says it isn't ready.

If we return -1 in _receive, the code will stop sending packets and all
works as it should, it recovers just fine. So I think that is looking
like the correct fix. I'd note that it already effectively has half this
check in the allocate_packet call, its just missing the rx_fifo_len one.

Cheers,

Richard