From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:41038) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZYjKv-0000gv-22 for qemu-devel@nongnu.org; Sun, 06 Sep 2015 19:27:02 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1ZYjKr-0001Pu-QT for qemu-devel@nongnu.org; Sun, 06 Sep 2015 19:27:00 -0400 Received: from 5751f4a1.skybroadband.com ([87.81.244.161]:49704 helo=dan.rpsys.net) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1ZYjKr-0001PY-FM for qemu-devel@nongnu.org; Sun, 06 Sep 2015 19:26:57 -0400 Message-ID: <1441581997.24871.227.camel@linuxfoundation.org> From: Richard Purdie Date: Mon, 07 Sep 2015 00:26:37 +0100 In-Reply-To: References: <1441362357.24871.155.camel@linuxfoundation.org> <1441365880.24871.164.camel@linuxfoundation.org> <1441370585.24871.166.camel@linuxfoundation.org> <1441387258.24871.197.camel@linuxfoundation.org> <1441549313.24871.218.camel@linuxfoundation.org> Content-Type: text/plain; charset="UTF-8" Mime-Version: 1.0 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111 List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Peter Crosthwaite Cc: Peter Maydell , qemu-devel On Sun, 2015-09-06 at 11:37 -0700, Peter Crosthwaite wrote: > On Sun, Sep 6, 2015 at 7:21 AM, Richard Purdie > wrote: > > On Sat, 2015-09-05 at 13:30 -0700, Peter Crosthwaite wrote: > >> On Fri, Sep 4, 2015 at 10:30 AM, Peter Maydell wrote: > >> > On 4 September 2015 at 18:20, Richard Purdie > >> > wrote: > >> >> On Fri, 2015-09-04 at 13:43 +0100, Richard Purdie wrote: > >> >>> On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote: > >> >>> > On 4 September 2015 at 12:24, Richard Purdie > >> >>> > wrote: > >> >>> > > So just based on that, yes, seems that the rx_fifo looks to be > >> >>> > > overrunning. I can add the asserts but I think it would just confirm > >> >>> > > this. > >> >>> > > >> >>> > Yes, the point of adding assertions is to confirm a hypothesis. > >> >>> > >> >>> I've now confirmed that it does indeed trigger the assert in > >> >>> smc91c111_receive(). > >> >> > >> >> I just tried an experiment where I put: > >> >> > >> >> if (s->rx_fifo_len >= NUM_PACKETS) > >> >> return -1; > >> >> > >> >> into smc91c111_receive() and my reproducer stops reproducing the > >> >> problem. > >> > >> Does it just stop the crash or does it eliminate the problem > >> completely with a fully now-working network? > > > > It stops the crash, the network works great. > > > >> >> I also noticed can_receive() could also have a check on buffer > >> >> availability. Would one of these changes be the correct fix here? > >> > > >> > The interesting question is why smc91c111_allocate_packet() doesn't > >> > fail in this situation. We only have NUM_PACKETS worth of storage, > >> > shared between the tx and rx buffers, so how could we both have > >> > already filled the rx_fifo and have a spare packet for the allocate > >> > function to return? > >> > >> Maybe this: > >> > >> case 5: /* Release. */ > >> smc91c111_release_packet(s, s->packet_num); > >> break; > >> > >> The guest is able to free an allocated packet without the accompanying > >> pop of tx/rx fifo. This may suggest some sort of guest error? > >> > >> The fix depends on the behaviour of the real hardware. If that MMIO op > >> is supposed to dequeue the corresponding queue entry then we may need > >> to patch that logic to do search the queues and dequeue it. Otherwise > >> we need to find out the genuine length of the rx queue, and clamp it > >> without something like Richards patch. There are a few other bits and > >> pieces that suggest the guest can have independent control of the > >> queues and allocated buffers but i'm confused as to how the rx fifo > >> length can get up to 10 in any case. > > > > I think I have a handle on what is going on. smc91c111_release_packet() > > changes s->allocated() but not rx_fifo. can_receive() only looks at > > s->allocated. We can trigger new network packets to arrive from > > smc91c111_release_packet() which calls qemu_flush_queued_packets() > > *before* we change rx_fifo and this can loop. > > > > The patch below which explicitly orders the qemu_flush_queued_packets() > > call resolved the test case I was able to reproduce this problem in. > > > > So there are three ways to fix this, either can_receive() needs to check > > both s->allocated() and rx_fifo, > > This is probably the winner for me. > > > or the code is more explicit about when > > qemu_flush_queued_packets() is called (as per my patch below), or the > > case 4 where smc91c111_release_packet() and then > > smc91c111_pop_rx_fifo(s) is called is reversed. I also tested the latter > > which also works, albeit with more ugly code. It seems can_receive isn't enough, we'd need to put some checks into receive itself since once can_receive says "yes", multiple packets can arrive to _receive without further checks of can_receive. I've either messed up my previous test or been lucky. I tested an assert in _recieve() which confirms it can be called when can_receive() says it isn't ready. If we return -1 in _receive, the code will stop sending packets and all works as it should, it recovers just fine. So I think that is looking like the correct fix. I'd note that it already effectively has half this check in the allocate_packet call, its just missing the rx_fifo_len one. Cheers, Richard