[Qemu-devel] Segfault using qemu-system-arm in smc91c111

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] Segfault using qemu-system-arm in smc91c111
@ 2015-09-04 10:25 Richard Purdie
  2015-09-04 10:45 ` Peter Maydell
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Purdie @ 2015-09-04 10:25 UTC (permalink / raw)
  To: qemu-devel

We're seeing repeated segfaults in qemu-system-arm when we heavily use
the network. I have a coredump backtrace:

Reading symbols from /home/pokybuild/yocto-autobuilder/yocto-worker/nightly-arm-lsb/build/build/tmp/sysroots/x86_64-linux/usr/bin/qemu-system-arm...done.
[New LWP 4536]
[New LWP 4534]
[New LWP 4530]
[New LWP 4537]
[New LWP 6396]

warning: Corrupted shared library list: 0x7f8d5f27e540 != 0x6198225000007f8d
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
Core was generated by `/home/pokybuild/yocto-autobuilder/yocto-worker/nightly-arm-lsb/build/build/tmp/'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  smc91c111_pop_tx_fifo_done (s=0x7f8d6158b560)
    at /home/pokybuild/yocto-autobuilder/yocto-worker/nightly-arm-lsb/build/build/tmp/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/hw/net/smc91c111.c:179
179	        s->tx_fifo_done[i] = s->tx_fifo_done[i + 1];
(gdb) bt
#0  smc91c111_pop_tx_fifo_done (s=0x7f8d6158b560)
    at /home/pokybuild/yocto-autobuilder/yocto-worker/nightly-arm-lsb/build/build/tmp/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/hw/net/smc91c111.c:179
#1  smc91c111_writeb (opaque=0x7f8d6158b560, offset=12, value=<optimized out>)
    at /home/pokybuild/yocto-autobuilder/yocto-worker/nightly-arm-lsb/build/build/tmp/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/hw/net/smc91c111.c:431
#2  0x00007f8d5ecacd65 in memory_region_oldmmio_write_accessor (mr=<optimized out>, addr=<optimized out>, value=<optimized out>, 
    size=<optimized out>, shift=<optimized out>, mask=<optimized out>, attrs=...)
    at /home/pokybuild/yocto-autobuilder/yocto-worker/nightly-arm-lsb/build/build/tmp/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/memory.c:434
#3  0x00007f8d5ecac5dd in access_with_adjusted_size (addr=140245200319840, addr@entry=12, value=0xc, value@entry=0x7f8d52ac63e8, 
    size=1, access_size_min=2031671516, access_size_max=32, access=0x7f8d5ecacd30 <memory_region_oldmmio_write_accessor>, 
    mr=0x7f8d6158f8f0, attrs=...)
    at /home/pokybuild/yocto-autobuilder/yocto-worker/nightly-arm-lsb/build/build/tmp/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/memory.c:506
#4  0x00007f8d5ecae08b in memory_region_dispatch_write (mr=mr@entry=0x7f8d6158f8f0, addr=12, data=2, size=size@entry=1, 
    attrs=attrs@entry=...)
    at /home/pokybuild/yocto-autobuilder/yocto-worker/nightly-arm-lsb/build/build/tmp/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/memory.c:1171
#5  0x00007f8d5ec7b78f in address_space_rw (as=0x7f8d5f408600 <address_space_memory>, addr=268501004, attrs=..., 
    buf=buf@entry=0x7f8d52ac64b0 "\002", len=1, is_write=is_write@entry=true)
    at /home/pokybuild/yocto-autobuilder/yocto-worker/nightly-arm-lsb/build/build/tmp/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/exec.c:2451
#6  0x00007f8d5ec7b9e0 in address_space_write (len=<optimized out>, buf=0x7f8d52ac64b0 "\002", attrs=..., addr=<optimized out>, 
    as=<optimized out>)
    at /home/pokybuild/yocto-autobuilder/yocto-worker/nightly-arm-lsb/build/build/tmp/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/exec.c:2521
#7  subpage_write (opaque=<optimized out>, addr=<optimized out>, value=<optimized out>, len=<optimized out>, attrs=...)
    at /home/pokybuild/yocto-autobuilder/yocto-worker/nightly-arm-lsb/build/build/tmp/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/exec.c:2081
#8  0x00007f8d5ecac5dd in access_with_adjusted_size (addr=140245200319840, addr@entry=12, value=0xc, value@entry=0x7f8d52ac6558, 
    size=1, access_size_min=2031671516, access_size_max=32, access=0x7f8d5ecac500 <memory_region_write_with_attrs_accessor>, 
    mr=0x7f8d618d5750, attrs=...)
    at /home/pokybuild/yocto-autobuilder/yocto-worker/nightly-arm-lsb/build/build/tmp/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/memory.c:506
#9  0x00007f8d5ecae08b in memory_region_dispatch_write (mr=0x7f8d618d5750, addr=12, data=2, size=1, attrs=...)
    at /home/pokybuild/yocto-autobuilder/yocto-worker/nightly-arm-lsb/build/build/tmp/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/memory.c:1171
#10 0x00007f8d5584b512 in ?? ()

(gdb) print s->tx_fifo_done
$1 = {99614720, 99614720, 99614720, 99614720}
(gdb) print s->tx_fifo_done_len
$2 = 99614719

so it looks like tx_fifo_done_len has been corrupted, going beyond that
is harder for me to figure out. Does anyone happen to know what might be
going on here? This is with qemu 2.4.0.

Cheers,

Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-04 10:25 [Qemu-devel] Segfault using qemu-system-arm in smc91c111 Richard Purdie
@ 2015-09-04 10:45 ` Peter Maydell
  2015-09-04 11:24   ` Richard Purdie
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Maydell @ 2015-09-04 10:45 UTC (permalink / raw)
  To: Richard Purdie; +Cc: qemu-devel

On 4 September 2015 at 11:25, Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
> We're seeing repeated segfaults in qemu-system-arm when we heavily use
> the network. I have a coredump backtrace:

> (gdb) print s->tx_fifo_done
> $1 = {99614720, 99614720, 99614720, 99614720}
> (gdb) print s->tx_fifo_done_len
> $2 = 99614719
>
> so it looks like tx_fifo_done_len has been corrupted, going beyond that
> is harder for me to figure out. Does anyone happen to know what might be
> going on here? This is with qemu 2.4.0.

That would suggest the rx_fifo buffer is overrunning (assuming
none of the other fields in the struct look like they've
been corrupted). Can you try putting
    assert(s->rx_fifo_len < NUM_PACKETS);
 before
    s->rx_fifo[s->rx_fifo_len++] = packetnum;
in smc91c111_receive(), and see if you hit that assertion?

Also, do you have a more specific reproduce case so I can try
to replicate the problem here?

thanks
-- PMM

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-04 10:45 ` Peter Maydell
@ 2015-09-04 11:24   ` Richard Purdie
  2015-09-04 11:31     ` Peter Maydell
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Purdie @ 2015-09-04 11:24 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-devel

On Fri, 2015-09-04 at 11:45 +0100, Peter Maydell wrote:
> On 4 September 2015 at 11:25, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
> > We're seeing repeated segfaults in qemu-system-arm when we heavily use
> > the network. I have a coredump backtrace:
> 
> > (gdb) print s->tx_fifo_done
> > $1 = {99614720, 99614720, 99614720, 99614720}
> > (gdb) print s->tx_fifo_done_len
> > $2 = 99614719
> >
> > so it looks like tx_fifo_done_len has been corrupted, going beyond that
> > is harder for me to figure out. Does anyone happen to know what might be
> > going on here? This is with qemu 2.4.0.
> 
> That would suggest the rx_fifo buffer is overrunning (assuming
> none of the other fields in the struct look like they've
> been corrupted). Can you try putting
>     assert(s->rx_fifo_len < NUM_PACKETS);
>  before
>     s->rx_fifo[s->rx_fifo_len++] = packetnum;
> in smc91c111_receive(), and see if you hit that assertion?

(gdb) print s->tx_fifo_len
$2 = 0
(gdb) print s->rx_fifo_len
$3 = 10

So just based on that, yes, seems that the rx_fifo looks to be
overrunning. I can add the asserts but I think it would just confirm
this.

> Also, do you have a more specific reproduce case so I can try
> to replicate the problem here?

Not sure how familiar you are with the yocto project? Basically we build
a core-image-sato rootfs image, then boot it under qemu and run some
tests against it. This seems to reliably fail for arm, particularly on
our debian8 autobuilder for reasons as yet unknown. The build logs for a
couple of example failures are:

https://autobuilder.yoctoproject.org/main/builders/nightly-oecore/builds/477
https://autobuilder.yoctoproject.org/main/builders/nightly-arm-lsb/builds/465

There is an issue where the tests don't stop running after qemu
segfaults, they continue to try and connect to it which is an issue
we'll work separately. The is a segfault/coredump showing the same
backtrace for both the above builds.

So if you had an OE build environment, you could download (or build) a
core-image-sato, then just run the tests against it (bitbake
core-image-sato -c testimage). We've yet to figure out exactly which
environments trigger it but it does seem to fail fairly regularly (>50%
of the time) when running these tests.

I appreciate its not exactly an easy reproducer but the setup is
designed to be replicated and you did ask! :)

Cheers,

Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-04 11:24   ` Richard Purdie
@ 2015-09-04 11:31     ` Peter Maydell
  2015-09-04 12:43       ` Richard Purdie
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Maydell @ 2015-09-04 11:31 UTC (permalink / raw)
  To: Richard Purdie; +Cc: qemu-devel

On 4 September 2015 at 12:24, Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
> So just based on that, yes, seems that the rx_fifo looks to be
> overrunning. I can add the asserts but I think it would just confirm
> this.

Yes, the point of adding assertions is to confirm a hypothesis.

>> Also, do you have a more specific reproduce case so I can try
>> to replicate the problem here?
>
> Not sure how familiar you are with the yocto project?

I don't really know about the Yocto project, no. What I need
is a set of instructions I can follow ("download this, run this
command line", etc) so I can reproduce it on my machine.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-04 11:31     ` Peter Maydell
@ 2015-09-04 12:43       ` Richard Purdie
  2015-09-04 17:20         ` Richard Purdie
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Purdie @ 2015-09-04 12:43 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-devel

On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote:
> On 4 September 2015 at 12:24, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
> > So just based on that, yes, seems that the rx_fifo looks to be
> > overrunning. I can add the asserts but I think it would just confirm
> > this.
> 
> Yes, the point of adding assertions is to confirm a hypothesis.

I've now confirmed that it does indeed trigger the assert in
smc91c111_receive().

> >> Also, do you have a more specific reproduce case so I can try
> >> to replicate the problem here?
> >
> > Not sure how familiar you are with the yocto project?
> 
> I don't really know about the Yocto project, no. What I need
> is a set of instructions I can follow ("download this, run this
> command line", etc) so I can reproduce it on my machine.

I'm running some tests at the moment to see which environments this is
reproducible in and will try and distil reproduction steps based on
that.

Cheers,

Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-04 12:43       ` Richard Purdie
@ 2015-09-04 17:20         ` Richard Purdie
  2015-09-04 17:30           ` Peter Maydell
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Purdie @ 2015-09-04 17:20 UTC (permalink / raw)
  To: Peter Maydell; +Cc: qemu-devel

On Fri, 2015-09-04 at 13:43 +0100, Richard Purdie wrote:
> On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote:
> > On 4 September 2015 at 12:24, Richard Purdie
> > <richard.purdie@linuxfoundation.org> wrote:
> > > So just based on that, yes, seems that the rx_fifo looks to be
> > > overrunning. I can add the asserts but I think it would just confirm
> > > this.
> > 
> > Yes, the point of adding assertions is to confirm a hypothesis.
> 
> I've now confirmed that it does indeed trigger the assert in
> smc91c111_receive().

I just tried an experiment where I put:

    if (s->rx_fifo_len >= NUM_PACKETS)
        return -1;

into smc91c111_receive() and my reproducer stops reproducing the
problem. I also noticed can_receive() could also have a check on buffer
availability. Would one of these changes be the correct fix here?

(Still working on a reproducer, ended up fixing the other test
continuation issues so the failure is more obvious).

Cheers,

Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-04 17:20         ` Richard Purdie
@ 2015-09-04 17:30           ` Peter Maydell
  2015-09-05 20:30             ` Peter Crosthwaite
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Maydell @ 2015-09-04 17:30 UTC (permalink / raw)
  To: Richard Purdie; +Cc: qemu-devel

On 4 September 2015 at 18:20, Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
> On Fri, 2015-09-04 at 13:43 +0100, Richard Purdie wrote:
>> On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote:
>> > On 4 September 2015 at 12:24, Richard Purdie
>> > <richard.purdie@linuxfoundation.org> wrote:
>> > > So just based on that, yes, seems that the rx_fifo looks to be
>> > > overrunning. I can add the asserts but I think it would just confirm
>> > > this.
>> >
>> > Yes, the point of adding assertions is to confirm a hypothesis.
>>
>> I've now confirmed that it does indeed trigger the assert in
>> smc91c111_receive().
>
> I just tried an experiment where I put:
>
>     if (s->rx_fifo_len >= NUM_PACKETS)
>         return -1;
>
> into smc91c111_receive() and my reproducer stops reproducing the
> problem. I also noticed can_receive() could also have a check on buffer
> availability. Would one of these changes be the correct fix here?

The interesting question is why smc91c111_allocate_packet() doesn't
fail in this situation. We only have NUM_PACKETS worth of storage,
shared between the tx and rx buffers, so how could we both have
already filled the rx_fifo and have a spare packet for the allocate
function to return?

-- PMM

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-04 17:30           ` Peter Maydell
@ 2015-09-05 20:30             ` Peter Crosthwaite
  2015-09-06 14:21               ` Richard Purdie
  0 siblings, 1 reply; 20+ messages in thread
From: Peter Crosthwaite @ 2015-09-05 20:30 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Richard Purdie, qemu-devel

On Fri, Sep 4, 2015 at 10:30 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> On 4 September 2015 at 18:20, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
>> On Fri, 2015-09-04 at 13:43 +0100, Richard Purdie wrote:
>>> On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote:
>>> > On 4 September 2015 at 12:24, Richard Purdie
>>> > <richard.purdie@linuxfoundation.org> wrote:
>>> > > So just based on that, yes, seems that the rx_fifo looks to be
>>> > > overrunning. I can add the asserts but I think it would just confirm
>>> > > this.
>>> >
>>> > Yes, the point of adding assertions is to confirm a hypothesis.
>>>
>>> I've now confirmed that it does indeed trigger the assert in
>>> smc91c111_receive().
>>
>> I just tried an experiment where I put:
>>
>>     if (s->rx_fifo_len >= NUM_PACKETS)
>>         return -1;
>>
>> into smc91c111_receive() and my reproducer stops reproducing the
>> problem.

Does it just stop the crash or does it eliminate the problem
completely with a fully now-working network?

>> I also noticed can_receive() could also have a check on buffer
>> availability. Would one of these changes be the correct fix here?
>
> The interesting question is why smc91c111_allocate_packet() doesn't
> fail in this situation. We only have NUM_PACKETS worth of storage,
> shared between the tx and rx buffers, so how could we both have
> already filled the rx_fifo and have a spare packet for the allocate
> function to return?

Maybe this:

            case 5: /* Release.  */
                smc91c111_release_packet(s, s->packet_num);
                break;

The guest is able to free an allocated packet without the accompanying
pop of tx/rx fifo. This may suggest some sort of guest error?

The fix depends on the behaviour of the real hardware. If that MMIO op
is supposed to dequeue the corresponding queue entry then we may need
to patch that logic to do search the queues and dequeue it. Otherwise
we need to find out the genuine length of the rx queue, and clamp it
without something like Richards patch. There are a few other bits and
pieces that suggest the guest can have independent control of the
queues and allocated buffers but i'm confused as to how the rx fifo
length can get up to 10 in any case.

Regards,
Peter

>
> -- PMM
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-05 20:30             ` Peter Crosthwaite
@ 2015-09-06 14:21               ` Richard Purdie
  2015-09-06 18:37                 ` Peter Crosthwaite
  0 siblings, 1 reply; 20+ messages in thread
From: Richard Purdie @ 2015-09-06 14:21 UTC (permalink / raw)
  To: Peter Crosthwaite; +Cc: Peter Maydell, qemu-devel

On Sat, 2015-09-05 at 13:30 -0700, Peter Crosthwaite wrote:
> On Fri, Sep 4, 2015 at 10:30 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> > On 4 September 2015 at 18:20, Richard Purdie
> > <richard.purdie@linuxfoundation.org> wrote:
> >> On Fri, 2015-09-04 at 13:43 +0100, Richard Purdie wrote:
> >>> On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote:
> >>> > On 4 September 2015 at 12:24, Richard Purdie
> >>> > <richard.purdie@linuxfoundation.org> wrote:
> >>> > > So just based on that, yes, seems that the rx_fifo looks to be
> >>> > > overrunning. I can add the asserts but I think it would just confirm
> >>> > > this.
> >>> >
> >>> > Yes, the point of adding assertions is to confirm a hypothesis.
> >>>
> >>> I've now confirmed that it does indeed trigger the assert in
> >>> smc91c111_receive().
> >>
> >> I just tried an experiment where I put:
> >>
> >>     if (s->rx_fifo_len >= NUM_PACKETS)
> >>         return -1;
> >>
> >> into smc91c111_receive() and my reproducer stops reproducing the
> >> problem.
> 
> Does it just stop the crash or does it eliminate the problem
> completely with a fully now-working network?

It stops the crash, the network works great.

> >> I also noticed can_receive() could also have a check on buffer
> >> availability. Would one of these changes be the correct fix here?
> >
> > The interesting question is why smc91c111_allocate_packet() doesn't
> > fail in this situation. We only have NUM_PACKETS worth of storage,
> > shared between the tx and rx buffers, so how could we both have
> > already filled the rx_fifo and have a spare packet for the allocate
> > function to return?
> 
> Maybe this:
> 
>             case 5: /* Release.  */
>                 smc91c111_release_packet(s, s->packet_num);
>                 break;
> 
> The guest is able to free an allocated packet without the accompanying
> pop of tx/rx fifo. This may suggest some sort of guest error?
> 
> The fix depends on the behaviour of the real hardware. If that MMIO op
> is supposed to dequeue the corresponding queue entry then we may need
> to patch that logic to do search the queues and dequeue it. Otherwise
> we need to find out the genuine length of the rx queue, and clamp it
> without something like Richards patch. There are a few other bits and
> pieces that suggest the guest can have independent control of the
> queues and allocated buffers but i'm confused as to how the rx fifo
> length can get up to 10 in any case.

I think I have a handle on what is going on. smc91c111_release_packet()
changes s->allocated() but not rx_fifo. can_receive() only looks at
s->allocated. We can trigger new network packets to arrive from
smc91c111_release_packet() which calls qemu_flush_queued_packets()
*before* we change rx_fifo and this can loop.

The patch below which explicitly orders the qemu_flush_queued_packets()
call resolved the test case I was able to reproduce this problem in.

So there are three ways to fix this, either can_receive() needs to check
both s->allocated() and rx_fifo, or the code is more explicit about when
qemu_flush_queued_packets() is called (as per my patch below), or the
case 4 where smc91c111_release_packet() and then
smc91c111_pop_rx_fifo(s) is called is reversed. I also tested the latter
which also works, albeit with more ugly code.

The problem is much more reproducible with the assert btw, booting a
qemu image with this and hitting the network interface with scp of a few
large files is usually enough.

So which patch would be preferred? :)

Cheers,

Richard



Index: qemu-2.4.0/hw/net/smc91c111.c
===================================================================
--- qemu-2.4.0.orig/hw/net/smc91c111.c
+++ qemu-2.4.0/hw/net/smc91c111.c
@@ -185,7 +185,6 @@ static void smc91c111_release_packet(smc
     s->allocated &= ~(1 << packet);
     if (s->tx_alloc == 0x80)
         smc91c111_tx_alloc(s);
-    qemu_flush_queued_packets(qemu_get_queue(s->nic));
 }
 
 /* Flush the TX FIFO.  */
@@ -237,9 +236,11 @@ static void smc91c111_do_tx(smc91c111_st
             }
         }
 #endif
-        if (s->ctr & CTR_AUTO_RELEASE)
+        if (s->ctr & CTR_AUTO_RELEASE) {
             /* Race?  */
             smc91c111_release_packet(s, packetnum);
+            qemu_flush_queued_packets(qemu_get_queue(s->nic));
+        }
         else if (s->tx_fifo_done_len < NUM_PACKETS)
             s->tx_fifo_done[s->tx_fifo_done_len++] = packetnum;
         qemu_send_packet(qemu_get_queue(s->nic), p, len);
@@ -379,9 +380,11 @@ static void smc91c111_writeb(void *opaqu
                     smc91c111_release_packet(s, s->rx_fifo[0]);
                 }
                 smc91c111_pop_rx_fifo(s);
+                qemu_flush_queued_packets(qemu_get_queue(s->nic));
                 break;
             case 5: /* Release.  */
                 smc91c111_release_packet(s, s->packet_num);
+                qemu_flush_queued_packets(qemu_get_queue(s->nic));
                 break;
             case 6: /* Add to TX FIFO.  */
                 smc91c111_queue_tx(s, s->packet_num);

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-06 14:21               ` Richard Purdie
@ 2015-09-06 18:37                 ` Peter Crosthwaite
  2015-09-06 23:26                   ` Richard Purdie
  2015-09-07 18:42                   ` Peter Maydell
  0 siblings, 2 replies; 20+ messages in thread
From: Peter Crosthwaite @ 2015-09-06 18:37 UTC (permalink / raw)
  To: Richard Purdie; +Cc: Peter Maydell, qemu-devel

On Sun, Sep 6, 2015 at 7:21 AM, Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
> On Sat, 2015-09-05 at 13:30 -0700, Peter Crosthwaite wrote:
>> On Fri, Sep 4, 2015 at 10:30 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
>> > On 4 September 2015 at 18:20, Richard Purdie
>> > <richard.purdie@linuxfoundation.org> wrote:
>> >> On Fri, 2015-09-04 at 13:43 +0100, Richard Purdie wrote:
>> >>> On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote:
>> >>> > On 4 September 2015 at 12:24, Richard Purdie
>> >>> > <richard.purdie@linuxfoundation.org> wrote:
>> >>> > > So just based on that, yes, seems that the rx_fifo looks to be
>> >>> > > overrunning. I can add the asserts but I think it would just confirm
>> >>> > > this.
>> >>> >
>> >>> > Yes, the point of adding assertions is to confirm a hypothesis.
>> >>>
>> >>> I've now confirmed that it does indeed trigger the assert in
>> >>> smc91c111_receive().
>> >>
>> >> I just tried an experiment where I put:
>> >>
>> >>     if (s->rx_fifo_len >= NUM_PACKETS)
>> >>         return -1;
>> >>
>> >> into smc91c111_receive() and my reproducer stops reproducing the
>> >> problem.
>>
>> Does it just stop the crash or does it eliminate the problem
>> completely with a fully now-working network?
>
> It stops the crash, the network works great.
>
>> >> I also noticed can_receive() could also have a check on buffer
>> >> availability. Would one of these changes be the correct fix here?
>> >
>> > The interesting question is why smc91c111_allocate_packet() doesn't
>> > fail in this situation. We only have NUM_PACKETS worth of storage,
>> > shared between the tx and rx buffers, so how could we both have
>> > already filled the rx_fifo and have a spare packet for the allocate
>> > function to return?
>>
>> Maybe this:
>>
>>             case 5: /* Release.  */
>>                 smc91c111_release_packet(s, s->packet_num);
>>                 break;
>>
>> The guest is able to free an allocated packet without the accompanying
>> pop of tx/rx fifo. This may suggest some sort of guest error?
>>
>> The fix depends on the behaviour of the real hardware. If that MMIO op
>> is supposed to dequeue the corresponding queue entry then we may need
>> to patch that logic to do search the queues and dequeue it. Otherwise
>> we need to find out the genuine length of the rx queue, and clamp it
>> without something like Richards patch. There are a few other bits and
>> pieces that suggest the guest can have independent control of the
>> queues and allocated buffers but i'm confused as to how the rx fifo
>> length can get up to 10 in any case.
>
> I think I have a handle on what is going on. smc91c111_release_packet()
> changes s->allocated() but not rx_fifo. can_receive() only looks at
> s->allocated. We can trigger new network packets to arrive from
> smc91c111_release_packet() which calls qemu_flush_queued_packets()
> *before* we change rx_fifo and this can loop.
>
> The patch below which explicitly orders the qemu_flush_queued_packets()
> call resolved the test case I was able to reproduce this problem in.
>
> So there are three ways to fix this, either can_receive() needs to check
> both s->allocated() and rx_fifo,

This is probably the winner for me.

> or the code is more explicit about when
> qemu_flush_queued_packets() is called (as per my patch below), or the
> case 4 where smc91c111_release_packet() and then
> smc91c111_pop_rx_fifo(s) is called is reversed. I also tested the latter
> which also works, albeit with more ugly code.
>
> The problem is much more reproducible with the assert btw, booting a
> qemu image with this and hitting the network interface with scp of a few
> large files is usually enough.
>
> So which patch would be preferred? :)
>
> Cheers,
>
> Richard
>
>
>
> Index: qemu-2.4.0/hw/net/smc91c111.c
> ===================================================================
> --- qemu-2.4.0.orig/hw/net/smc91c111.c
> +++ qemu-2.4.0/hw/net/smc91c111.c
> @@ -185,7 +185,6 @@ static void smc91c111_release_packet(smc
>      s->allocated &= ~(1 << packet);
>      if (s->tx_alloc == 0x80)
>          smc91c111_tx_alloc(s);
> -    qemu_flush_queued_packets(qemu_get_queue(s->nic));
>  }
>
>  /* Flush the TX FIFO.  */
> @@ -237,9 +236,11 @@ static void smc91c111_do_tx(smc91c111_st
>              }
>          }
>  #endif
> -        if (s->ctr & CTR_AUTO_RELEASE)
> +        if (s->ctr & CTR_AUTO_RELEASE) {
>              /* Race?  */
>              smc91c111_release_packet(s, packetnum);
> +            qemu_flush_queued_packets(qemu_get_queue(s->nic));
> +        }
>          else if (s->tx_fifo_done_len < NUM_PACKETS)
>              s->tx_fifo_done[s->tx_fifo_done_len++] = packetnum;
>          qemu_send_packet(qemu_get_queue(s->nic), p, len);
> @@ -379,9 +380,11 @@ static void smc91c111_writeb(void *opaqu
>                      smc91c111_release_packet(s, s->rx_fifo[0]);
>                  }
>                  smc91c111_pop_rx_fifo(s);
> +                qemu_flush_queued_packets(qemu_get_queue(s->nic));
>                  break;
>              case 5: /* Release.  */
>                  smc91c111_release_packet(s, s->packet_num);
> +                qemu_flush_queued_packets(qemu_get_queue(s->nic));

This could still cause the same issue here though I think. The guest
can release first (case 5) without a corresponding fifo pop (case 3),
which actually means that the first patch idea might be the winner as
it catches this issue too. What is supposed to happen if the guest
does a case 5 while the rx_fifo is full without a matching case 3, and
then a packet arrives on the wire?

Regards,
Peter

>                  break;
>              case 6: /* Add to TX FIFO.  */
>                  smc91c111_queue_tx(s, s->packet_num);
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-06 18:37                 ` Peter Crosthwaite
@ 2015-09-06 23:26                   ` Richard Purdie
  2015-09-07  0:48                     ` Peter Crosthwaite
  2015-09-07 18:42                   ` Peter Maydell
  1 sibling, 1 reply; 20+ messages in thread
From: Richard Purdie @ 2015-09-06 23:26 UTC (permalink / raw)
  To: Peter Crosthwaite; +Cc: Peter Maydell, qemu-devel

On Sun, 2015-09-06 at 11:37 -0700, Peter Crosthwaite wrote:
> On Sun, Sep 6, 2015 at 7:21 AM, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
> > On Sat, 2015-09-05 at 13:30 -0700, Peter Crosthwaite wrote:
> >> On Fri, Sep 4, 2015 at 10:30 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> >> > On 4 September 2015 at 18:20, Richard Purdie
> >> > <richard.purdie@linuxfoundation.org> wrote:
> >> >> On Fri, 2015-09-04 at 13:43 +0100, Richard Purdie wrote:
> >> >>> On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote:
> >> >>> > On 4 September 2015 at 12:24, Richard Purdie
> >> >>> > <richard.purdie@linuxfoundation.org> wrote:
> >> >>> > > So just based on that, yes, seems that the rx_fifo looks to be
> >> >>> > > overrunning. I can add the asserts but I think it would just confirm
> >> >>> > > this.
> >> >>> >
> >> >>> > Yes, the point of adding assertions is to confirm a hypothesis.
> >> >>>
> >> >>> I've now confirmed that it does indeed trigger the assert in
> >> >>> smc91c111_receive().
> >> >>
> >> >> I just tried an experiment where I put:
> >> >>
> >> >>     if (s->rx_fifo_len >= NUM_PACKETS)
> >> >>         return -1;
> >> >>
> >> >> into smc91c111_receive() and my reproducer stops reproducing the
> >> >> problem.
> >>
> >> Does it just stop the crash or does it eliminate the problem
> >> completely with a fully now-working network?
> >
> > It stops the crash, the network works great.
> >
> >> >> I also noticed can_receive() could also have a check on buffer
> >> >> availability. Would one of these changes be the correct fix here?
> >> >
> >> > The interesting question is why smc91c111_allocate_packet() doesn't
> >> > fail in this situation. We only have NUM_PACKETS worth of storage,
> >> > shared between the tx and rx buffers, so how could we both have
> >> > already filled the rx_fifo and have a spare packet for the allocate
> >> > function to return?
> >>
> >> Maybe this:
> >>
> >>             case 5: /* Release.  */
> >>                 smc91c111_release_packet(s, s->packet_num);
> >>                 break;
> >>
> >> The guest is able to free an allocated packet without the accompanying
> >> pop of tx/rx fifo. This may suggest some sort of guest error?
> >>
> >> The fix depends on the behaviour of the real hardware. If that MMIO op
> >> is supposed to dequeue the corresponding queue entry then we may need
> >> to patch that logic to do search the queues and dequeue it. Otherwise
> >> we need to find out the genuine length of the rx queue, and clamp it
> >> without something like Richards patch. There are a few other bits and
> >> pieces that suggest the guest can have independent control of the
> >> queues and allocated buffers but i'm confused as to how the rx fifo
> >> length can get up to 10 in any case.
> >
> > I think I have a handle on what is going on. smc91c111_release_packet()
> > changes s->allocated() but not rx_fifo. can_receive() only looks at
> > s->allocated. We can trigger new network packets to arrive from
> > smc91c111_release_packet() which calls qemu_flush_queued_packets()
> > *before* we change rx_fifo and this can loop.
> >
> > The patch below which explicitly orders the qemu_flush_queued_packets()
> > call resolved the test case I was able to reproduce this problem in.
> >
> > So there are three ways to fix this, either can_receive() needs to check
> > both s->allocated() and rx_fifo,
> 
> This is probably the winner for me.
> 
> > or the code is more explicit about when
> > qemu_flush_queued_packets() is called (as per my patch below), or the
> > case 4 where smc91c111_release_packet() and then
> > smc91c111_pop_rx_fifo(s) is called is reversed. I also tested the latter
> > which also works, albeit with more ugly code.

It seems can_receive isn't enough, we'd need to put some checks into
receive itself since once can_receive says "yes", multiple packets can
arrive to _receive without further checks of can_receive. I've either
messed up my previous test or been lucky.

I tested an assert in _recieve() which confirms it can be called when
can_receive() says it isn't ready.

If we return -1 in _receive, the code will stop sending packets and all
works as it should, it recovers just fine. So I think that is looking
like the correct fix. I'd note that it already effectively has half this
check in the allocate_packet call, its just missing the rx_fifo_len one.

Cheers,

Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-06 23:26                   ` Richard Purdie
@ 2015-09-07  0:48                     ` Peter Crosthwaite
  2015-09-07  7:09                       ` Richard Purdie
                                         ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: Peter Crosthwaite @ 2015-09-07  0:48 UTC (permalink / raw)
  To: Richard Purdie; +Cc: Peter Maydell, qemu-devel

On Sun, Sep 6, 2015 at 4:26 PM, Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
> On Sun, 2015-09-06 at 11:37 -0700, Peter Crosthwaite wrote:
>> On Sun, Sep 6, 2015 at 7:21 AM, Richard Purdie
>> <richard.purdie@linuxfoundation.org> wrote:
>> > On Sat, 2015-09-05 at 13:30 -0700, Peter Crosthwaite wrote:
>> >> On Fri, Sep 4, 2015 at 10:30 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
>> >> > On 4 September 2015 at 18:20, Richard Purdie
>> >> > <richard.purdie@linuxfoundation.org> wrote:
>> >> >> On Fri, 2015-09-04 at 13:43 +0100, Richard Purdie wrote:
>> >> >>> On Fri, 2015-09-04 at 12:31 +0100, Peter Maydell wrote:
>> >> >>> > On 4 September 2015 at 12:24, Richard Purdie
>> >> >>> > <richard.purdie@linuxfoundation.org> wrote:
>> >> >>> > > So just based on that, yes, seems that the rx_fifo looks to be
>> >> >>> > > overrunning. I can add the asserts but I think it would just confirm
>> >> >>> > > this.
>> >> >>> >
>> >> >>> > Yes, the point of adding assertions is to confirm a hypothesis.
>> >> >>>
>> >> >>> I've now confirmed that it does indeed trigger the assert in
>> >> >>> smc91c111_receive().
>> >> >>
>> >> >> I just tried an experiment where I put:
>> >> >>
>> >> >>     if (s->rx_fifo_len >= NUM_PACKETS)
>> >> >>         return -1;
>> >> >>
>> >> >> into smc91c111_receive() and my reproducer stops reproducing the
>> >> >> problem.
>> >>
>> >> Does it just stop the crash or does it eliminate the problem
>> >> completely with a fully now-working network?
>> >
>> > It stops the crash, the network works great.
>> >
>> >> >> I also noticed can_receive() could also have a check on buffer
>> >> >> availability. Would one of these changes be the correct fix here?
>> >> >
>> >> > The interesting question is why smc91c111_allocate_packet() doesn't
>> >> > fail in this situation. We only have NUM_PACKETS worth of storage,
>> >> > shared between the tx and rx buffers, so how could we both have
>> >> > already filled the rx_fifo and have a spare packet for the allocate
>> >> > function to return?
>> >>
>> >> Maybe this:
>> >>
>> >>             case 5: /* Release.  */
>> >>                 smc91c111_release_packet(s, s->packet_num);
>> >>                 break;
>> >>
>> >> The guest is able to free an allocated packet without the accompanying
>> >> pop of tx/rx fifo. This may suggest some sort of guest error?
>> >>
>> >> The fix depends on the behaviour of the real hardware. If that MMIO op
>> >> is supposed to dequeue the corresponding queue entry then we may need
>> >> to patch that logic to do search the queues and dequeue it. Otherwise
>> >> we need to find out the genuine length of the rx queue, and clamp it
>> >> without something like Richards patch. There are a few other bits and
>> >> pieces that suggest the guest can have independent control of the
>> >> queues and allocated buffers but i'm confused as to how the rx fifo
>> >> length can get up to 10 in any case.
>> >
>> > I think I have a handle on what is going on. smc91c111_release_packet()
>> > changes s->allocated() but not rx_fifo. can_receive() only looks at
>> > s->allocated. We can trigger new network packets to arrive from
>> > smc91c111_release_packet() which calls qemu_flush_queued_packets()
>> > *before* we change rx_fifo and this can loop.
>> >
>> > The patch below which explicitly orders the qemu_flush_queued_packets()
>> > call resolved the test case I was able to reproduce this problem in.
>> >
>> > So there are three ways to fix this, either can_receive() needs to check
>> > both s->allocated() and rx_fifo,
>>
>> This is probably the winner for me.
>>
>> > or the code is more explicit about when
>> > qemu_flush_queued_packets() is called (as per my patch below), or the
>> > case 4 where smc91c111_release_packet() and then
>> > smc91c111_pop_rx_fifo(s) is called is reversed. I also tested the latter
>> > which also works, albeit with more ugly code.
>
> It seems can_receive isn't enough, we'd need to put some checks into
> receive itself since once can_receive says "yes", multiple packets can
> arrive to _receive without further checks of can_receive.

This doesn't sound right. There are other network controllers that
rely of can_receive catching all cases properly. Is this a regression?
Looking at logs, I see some refactoring of QEMU net framework around
June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go
away?

> I've either
> messed up my previous test or been lucky.
>
> I tested an assert in _recieve() which confirms it can be called when
> can_receive() says it isn't ready.
>

A backtrace of this would be handy.

What is your replicator? I have core-image-minimal handy but it has no
scp or sshd so all I can think of to stress network is wget, but that
doesn't replicate.

Regards,
Peter

> If we return -1 in _receive, the code will stop sending packets and all
> works as it should, it recovers just fine. So I think that is looking
> like the correct fix. I'd note that it already effectively has half this
> check in the allocate_packet call, its just missing the rx_fifo_len one.
>
> Cheers,
>
> Richard
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-07  0:48                     ` Peter Crosthwaite
@ 2015-09-07  7:09                       ` Richard Purdie
  2015-09-07 18:05                         ` Peter Crosthwaite
  2015-09-07  7:18                       ` Richard Purdie
  2015-09-07  7:47                       ` Richard Purdie
  2 siblings, 1 reply; 20+ messages in thread
From: Richard Purdie @ 2015-09-07  7:09 UTC (permalink / raw)
  To: Peter Crosthwaite; +Cc: Peter Maydell, qemu-devel

On Sun, 2015-09-06 at 17:48 -0700, Peter Crosthwaite wrote:
> On Sun, Sep 6, 2015 at 4:26 PM, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
> > On Sun, 2015-09-06 at 11:37 -0700, Peter Crosthwaite wrote:
> > It seems can_receive isn't enough, we'd need to put some checks into
> > receive itself since once can_receive says "yes", multiple packets can
> > arrive to _receive without further checks of can_receive.
> 
> This doesn't sound right. There are other network controllers that
> rely of can_receive catching all cases properly. Is this a regression?
> Looking at logs, I see some refactoring of QEMU net framework around
> June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go
> away?

We weren't seeing this problem until we upgraded to 2.4.

> 
> > I've either
> > messed up my previous test or been lucky.
> >
> > I tested an assert in _recieve() which confirms it can be called when
> > can_receive() says it isn't ready.
> >
> 
> A backtrace of this would be handy.
> 
> What is your replicator? I have core-image-minimal handy but it has no
> scp or sshd so all I can think of to stress network is wget, but that
> doesn't replicate.

I've been using a core-image-sato and using the "bitbake core-image-sato
-c testimage" which runs a set of tests against the target image. It
invariably crashes on the scp test when I put an assert in receive().

To make it simpler, if I just "runqemu qemuarm" to boot the
core-image-sato, then scp a 5MB file consisting of zeros into the image,
it hits the assert after around 2% transferred.

Cheers,

Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-07  0:48                     ` Peter Crosthwaite
  2015-09-07  7:09                       ` Richard Purdie
@ 2015-09-07  7:18                       ` Richard Purdie
  2015-09-07  7:47                       ` Richard Purdie
  2 siblings, 0 replies; 20+ messages in thread
From: Richard Purdie @ 2015-09-07  7:18 UTC (permalink / raw)
  To: Peter Crosthwaite; +Cc: Peter Maydell, qemu-devel

On Sun, 2015-09-06 at 17:48 -0700, Peter Crosthwaite wrote:
> On Sun, Sep 6, 2015 at 4:26 PM, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
> > On Sun, 2015-09-06 at 11:37 -0700, Peter Crosthwaite wrote:
> > I tested an assert in _recieve() which confirms it can be called when
> > can_receive() says it isn't ready.
> >
> 
> A backtrace of this would be handy.

This is the trace with my assert against smc91c111_can_receive(nc) being
false when we're in receive(), before we allocate_packet:

#0  0x00007f355f276267 in __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:55
#1  0x00007f355f277eca in __GI_abort () at abort.c:89
#2  0x00007f355f26f03d in __assert_fail_base (fmt=0x7f355f3d1028 "%s%s%s:%u: %s%sAssertion `%s' failed.\n%n", 
    assertion=assertion@entry=0x7f3562158ed7 "smc91c111_can_receive(nc)", 
    file=file@entry=0x7f3562158dc8 "/media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/hw/net/smc91c111.c", line=line@entry=680, function=function@entry=0x7f35621591b0 <__PRETTY_FUNCTION__.26130> "smc91c111_receive") at assert.c:92
#3  0x00007f355f26f0f2 in __GI___assert_fail (assertion=assertion@entry=0x7f3562158ed7 "smc91c111_can_receive(nc)", 
    file=file@entry=0x7f3562158dc8 "/media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/hw/net/smc91c111.c", line=line@entry=680, function=function@entry=0x7f35621591b0 <__PRETTY_FUNCTION__.26130> "smc91c111_receive")
    at assert.c:101
#4  0x00007f3561fca4d0 in smc91c111_receive (nc=0x7f3563604d10, buf=0x7f353c09d028 "RT", size=1514)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/hw/net/smc91c111.c:680
#5  0x00007f356203058b in qemu_deliver_packet (sender=<optimised out>, flags=<optimised out>, data=data@entry=0x7f353c09d028 "RT", 
    size=<optimised out>, opaque=0x7f3563604d10)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/net/net.c:577
#6  0x00007f3562031eaa in qemu_net_queue_deliver (size=<optimised out>, data=<optimised out>, flags=<optimised out>, 
    sender=<optimised out>, queue=0x7f3563604e70)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/net/queue.c:157
#7  qemu_net_queue_flush (queue=0x7f3563604e70)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/net/queue.c:254
#8  0x00007f356202fc7c in qemu_flush_or_purge_queued_packets (nc=0x7f3563604d10, purge=<optimised out>)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/net/net.c:606
#9  0x00007f3561fcacec in smc91c111_writeb (opaque=0x7f35635178f0, offset=<optimised out>, value=128)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/hw/net/smc91c111.c:382
#10 0x00007f3561fcaf84 in smc91c111_writew (opaque=0x7f35635178f0, offset=0, value=128)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/hw/net/smc91c111.c:612
#11 0x00007f3561e52cc5 in memory_region_oldmmio_write_accessor (mr=<optimised out>, addr=<optimised out>, value=<optimised out>, 
    size=<optimised out>, shift=<optimised out>, mask=<optimised out>, attrs=...)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/memory.c:434
#12 0x00007f3561e5227d in access_with_adjusted_size (addr=addr@entry=0, value=value@entry=0x7f35572d93f8, size=size@entry=2, 
    access_size_min=<optimised out>, access_size_max=<optimised out>, 
    access=0x7f3561e52c90 <memory_region_oldmmio_write_accessor>, mr=0x7f356351bc80, attrs=...)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/memory.c:506
#13 0x00007f3561e53d5b in memory_region_dispatch_write (mr=mr@entry=0x7f356351bc80, addr=0, data=128, size=size@entry=2, 
    attrs=attrs@entry=...) at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/memory.c:1171
#14 0x00007f3561e1fc51 in address_space_rw (as=<optimised out>, addr=268500992, attrs=..., buf=buf@entry=0x7f35572d94c0 "\200", 
    len=2, is_write=is_write@entry=true)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/exec.c:2445
#15 0x00007f3561e1fda0 in address_space_write (len=<optimised out>, buf=0x7f35572d94c0 "\200", attrs=..., addr=<optimised out>, 
    as=<optimised out>) at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/exec.c:2521
#16 subpage_write (opaque=<optimised out>, addr=<optimised out>, value=<optimised out>, len=<optimised out>, attrs=...)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/exec.c:2081
#17 0x00007f3561e5227d in access_with_adjusted_size (addr=addr@entry=0, value=value@entry=0x7f35572d9568, size=size@entry=2, 
    access_size_min=<optimised out>, access_size_max=<optimised out>, 
    access=0x7f3561e521a0 <memory_region_write_with_attrs_accessor>, mr=0x7f356375e360, attrs=...)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/memory.c:506
#18 0x00007f3561e53d5b in memory_region_dispatch_write (mr=0x7f356375e360, addr=0, data=128, size=2, attrs=...)
    at /media/build1/poky/build/tmp-lsb/work/x86_64-linux/qemu-native/2.4.0-r1/qemu-2.4.0/memory.c:1171
#19 0x00007f3559342734 in ?? ()

the rest of the stack is just ??. So some write into the smc91c111
memory triggers qemu_flush_or_purge_queued_packets() which then keeps
delivering packets as far as I can tell.

Do we need to check can_receive() before triggering the
qemu_flush_or_purge_queued_packets() call? The code is clearly calling
this in places where the driver simply isn't ready.

(qemu_flush_queued_packets() == qemu_flush_or_purge_queued_packets())

Cheers,

Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-07  0:48                     ` Peter Crosthwaite
  2015-09-07  7:09                       ` Richard Purdie
  2015-09-07  7:18                       ` Richard Purdie
@ 2015-09-07  7:47                       ` Richard Purdie
  2015-09-07  9:21                         ` Peter Maydell
  2 siblings, 1 reply; 20+ messages in thread
From: Richard Purdie @ 2015-09-07  7:47 UTC (permalink / raw)
  To: Peter Crosthwaite; +Cc: Peter Maydell, qemu-devel

On Sun, 2015-09-06 at 17:48 -0700, Peter Crosthwaite wrote:
> This doesn't sound right. There are other network controllers that
> rely of can_receive catching all cases properly. Is this a regression?
> Looking at logs, I see some refactoring of QEMU net framework around
> June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go
> away?

I did find an interesting comment in this commit:

http://git.qemu.org/?p=qemu.git;a=commitdiff;h=625de449fc5597f2e1aff9cb586e249e198f03c9

"""
Since commit 6e99c63 "net/socket: Drop net_socket_can_send" and friends,
net queues need to be explicitly flushed after qemu_can_send_packet()
returns false, because the netdev side will disable the polling of fd.
"""

smc91x111 is calling flush functions when it knows can_receive
would/should return false. I believe that is the bug here.

I suspect the driver needs:

* can_receive to actually return the right value
* the locations of the flush calls to be when there is receive space 

This could explain what changed to break this and why moving the flush
calls works in my patch.

Cheers,

Richard

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-07  7:47                       ` Richard Purdie
@ 2015-09-07  9:21                         ` Peter Maydell
  2015-09-07 18:12                           ` Peter Crosthwaite
  2015-09-08  9:55                           ` Jason Wang
  0 siblings, 2 replies; 20+ messages in thread
From: Peter Maydell @ 2015-09-07  9:21 UTC (permalink / raw)
  To: Richard Purdie; +Cc: Jason Wang, Stefan Hajnoczi, Peter Crosthwaite, qemu-devel

CCing the net maintainers on this thread seems like it would
be a good idea...

On 7 September 2015 at 08:47, Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
> On Sun, 2015-09-06 at 17:48 -0700, Peter Crosthwaite wrote:
>> This doesn't sound right. There are other network controllers that
>> rely of can_receive catching all cases properly. Is this a regression?
>> Looking at logs, I see some refactoring of QEMU net framework around
>> June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go
>> away?
>
> I did find an interesting comment in this commit:
>
> http://git.qemu.org/?p=qemu.git;a=commitdiff;h=625de449fc5597f2e1aff9cb586e249e198f03c9
>
> """
> Since commit 6e99c63 "net/socket: Drop net_socket_can_send" and friends,
> net queues need to be explicitly flushed after qemu_can_send_packet()
> returns false, because the netdev side will disable the polling of fd.
> """
>
> smc91x111 is calling flush functions when it knows can_receive
> would/should return false. I believe that is the bug here.
>
> I suspect the driver needs:
>
> * can_receive to actually return the right value
> * the locations of the flush calls to be when there is receive space
>
> This could explain what changed to break this and why moving the flush
> calls works in my patch.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-07  7:09                       ` Richard Purdie
@ 2015-09-07 18:05                         ` Peter Crosthwaite
  0 siblings, 0 replies; 20+ messages in thread
From: Peter Crosthwaite @ 2015-09-07 18:05 UTC (permalink / raw)
  To: Richard Purdie; +Cc: Peter Maydell, qemu-devel

On Mon, Sep 7, 2015 at 12:09 AM, Richard Purdie
<richard.purdie@linuxfoundation.org> wrote:
> On Sun, 2015-09-06 at 17:48 -0700, Peter Crosthwaite wrote:
>> On Sun, Sep 6, 2015 at 4:26 PM, Richard Purdie
>> <richard.purdie@linuxfoundation.org> wrote:
>> > On Sun, 2015-09-06 at 11:37 -0700, Peter Crosthwaite wrote:
>> > It seems can_receive isn't enough, we'd need to put some checks into
>> > receive itself since once can_receive says "yes", multiple packets can
>> > arrive to _receive without further checks of can_receive.
>>
>> This doesn't sound right. There are other network controllers that
>> rely of can_receive catching all cases properly. Is this a regression?
>> Looking at logs, I see some refactoring of QEMU net framework around
>> June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go
>> away?
>
> We weren't seeing this problem until we upgraded to 2.4.
>
>>
>> > I've either
>> > messed up my previous test or been lucky.
>> >
>> > I tested an assert in _recieve() which confirms it can be called when
>> > can_receive() says it isn't ready.
>> >
>>
>> A backtrace of this would be handy.
>>
>> What is your replicator? I have core-image-minimal handy but it has no
>> scp or sshd so all I can think of to stress network is wget, but that
>> doesn't replicate.
>
> I've been using a core-image-sato and using the "bitbake core-image-sato
> -c testimage" which runs a set of tests against the target image. It
> invariably crashes on the scp test when I put an assert in receive().
>
> To make it simpler, if I just "runqemu qemuarm" to boot the
> core-image-sato, then scp a 5MB file consisting of zeros into the image,
> it hits the assert after around 2% transferred.
>

I can't replicate. The 5MB scp works for me. Can you bisect qemu?

Regards,
Peter

> Cheers,
>
> Richard
>

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-07  9:21                         ` Peter Maydell
@ 2015-09-07 18:12                           ` Peter Crosthwaite
  2015-09-08  9:55                           ` Jason Wang
  1 sibling, 0 replies; 20+ messages in thread
From: Peter Crosthwaite @ 2015-09-07 18:12 UTC (permalink / raw)
  To: Peter Maydell; +Cc: Stefan Hajnoczi, Jason Wang, Richard Purdie, qemu-devel

On Mon, Sep 7, 2015 at 2:21 AM, Peter Maydell <peter.maydell@linaro.org> wrote:
> CCing the net maintainers on this thread seems like it would
> be a good idea...
>
> On 7 September 2015 at 08:47, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
>> On Sun, 2015-09-06 at 17:48 -0700, Peter Crosthwaite wrote:
>>> This doesn't sound right. There are other network controllers that
>>> rely of can_receive catching all cases properly. Is this a regression?
>>> Looking at logs, I see some refactoring of QEMU net framework around
>>> June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go
>>> away?
>>
>> I did find an interesting comment in this commit:
>>
>> http://git.qemu.org/?p=qemu.git;a=commitdiff;h=625de449fc5597f2e1aff9cb586e249e198f03c9
>>
>> """
>> Since commit 6e99c63 "net/socket: Drop net_socket_can_send" and friends,
>> net queues need to be explicitly flushed after qemu_can_send_packet()
>> returns false, because the netdev side will disable the polling of fd.
>> """
>>
>> smc91x111 is calling flush functions when it knows can_receive
>> would/should return false. I believe that is the bug here.
>>
>> I suspect the driver needs:
>>
>> * can_receive to actually return the right value

Agreed, I think we have that figured further up the thread.

>> * the locations of the flush calls to be when there is receive space
>>

So my understanding is the net layer should be able to properly handle
a spurious flush. This avoids every device model having to if
(can_receive) { flush () }, rather than just flush. There are other
enet controllers that do this, e.g. xilinx_axienet and e1000. Has this
API change in the net layer?

Regards,
Peter

>> This could explain what changed to break this and why moving the flush
>> calls works in my patch.
>
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-06 18:37                 ` Peter Crosthwaite
  2015-09-06 23:26                   ` Richard Purdie
@ 2015-09-07 18:42                   ` Peter Maydell
  1 sibling, 0 replies; 20+ messages in thread
From: Peter Maydell @ 2015-09-07 18:42 UTC (permalink / raw)
  To: Peter Crosthwaite; +Cc: Richard Purdie, qemu-devel

On 6 September 2015 at 19:37, Peter Crosthwaite
<crosthwaitepeter@gmail.com> wrote:
> This could still cause the same issue here though I think. The guest
> can release first (case 5) without a corresponding fifo pop (case 3),
> which actually means that the first patch idea might be the winner as
> it catches this issue too. What is supposed to happen if the guest
> does a case 5 while the rx_fifo is full without a matching case 3, and
> then a packet arrives on the wire?

The data sheet seems to be unclear. Definitely incoming data when
there's no packet available to be allocated means we drop the
incoming packet and signal an rx overrun. The data sheet doesn't
say anything about what happens when the rx fifo overflows, though...

My best guess is that fifo overflow also results in an overrun report,
but you'd probably have to actually write a test case for the real
hardware to find out for sure.

thanks
-- PMM

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] Segfault using qemu-system-arm in smc91c111
  2015-09-07  9:21                         ` Peter Maydell
  2015-09-07 18:12                           ` Peter Crosthwaite
@ 2015-09-08  9:55                           ` Jason Wang
  1 sibling, 0 replies; 20+ messages in thread
From: Jason Wang @ 2015-09-08  9:55 UTC (permalink / raw)
  To: Peter Maydell, Richard Purdie
  Cc: Stefan Hajnoczi, Peter Crosthwaite, Fam Zheng, qemu-devel



On 09/07/2015 05:21 PM, Peter Maydell wrote:
> CCing the net maintainers on this thread seems like it would
> be a good idea...
>
> On 7 September 2015 at 08:47, Richard Purdie
> <richard.purdie@linuxfoundation.org> wrote:
>> On Sun, 2015-09-06 at 17:48 -0700, Peter Crosthwaite wrote:
>>> This doesn't sound right. There are other network controllers that
>>> rely of can_receive catching all cases properly. Is this a regression?
>>> Looking at logs, I see some refactoring of QEMU net framework around
>>> June timeframe, if you rewind to QEMU 2.3 (or earlier) does the bug go
>>> away?
>> I did find an interesting comment in this commit:
>>
>> http://git.qemu.org/?p=qemu.git;a=commitdiff;h=625de449fc5597f2e1aff9cb586e249e198f03c9
>>
>> """
>> Since commit 6e99c63 "net/socket: Drop net_socket_can_send" and friends,
>> net queues need to be explicitly flushed after qemu_can_send_packet()
>> returns false, because the netdev side will disable the polling of fd.
>> """
>>
>> smc91x111 is calling flush functions when it knows can_receive
>> would/should return false. I believe that is the bug here.
>>
>> I suspect the driver needs:
>>
>> * can_receive to actually return the right value
>> * the locations of the flush calls to be when there is receive space

Yes, please do this.

>> This could explain what changed to break this and why moving the flush
>> calls works in my patch.
> thanks
> -- PMM

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2015-09-08  9:55 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-04 10:25 [Qemu-devel] Segfault using qemu-system-arm in smc91c111 Richard Purdie
2015-09-04 10:45 ` Peter Maydell
2015-09-04 11:24   ` Richard Purdie
2015-09-04 11:31     ` Peter Maydell
2015-09-04 12:43       ` Richard Purdie
2015-09-04 17:20         ` Richard Purdie
2015-09-04 17:30           ` Peter Maydell
2015-09-05 20:30             ` Peter Crosthwaite
2015-09-06 14:21               ` Richard Purdie
2015-09-06 18:37                 ` Peter Crosthwaite
2015-09-06 23:26                   ` Richard Purdie
2015-09-07  0:48                     ` Peter Crosthwaite
2015-09-07  7:09                       ` Richard Purdie
2015-09-07 18:05                         ` Peter Crosthwaite
2015-09-07  7:18                       ` Richard Purdie
2015-09-07  7:47                       ` Richard Purdie
2015-09-07  9:21                         ` Peter Maydell
2015-09-07 18:12                           ` Peter Crosthwaite
2015-09-08  9:55                           ` Jason Wang
2015-09-07 18:42                   ` Peter Maydell

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).