All of lore.kernel.org
 help / color / mirror / Atom feed
From: Bin Wu <wu.wubin@huawei.com>
To: Stefan Hajnoczi <stefanha@gmail.com>
Cc: kwolf@redhat.com, famz@redhat.com, stefanha@redhat.com,
	subo7@huawei.com, kathy.wangting@huawei.com,
	bruce.fon@huawei.com, qemu-devel@nongnu.org,
	arei.gonglei@huawei.com, boby.chen@huawei.com,
	pbonzini@redhat.com, rudy.zhangmin@huawei.com
Subject: Re: [Qemu-devel] [PATCH v2] qemu-coroutine: segfault when restarting co_queue
Date: Tue, 10 Feb 2015 08:51:22 +0800	[thread overview]
Message-ID: <54D9560A.1080100@huawei.com> (raw)
In-Reply-To: <20150209144813.GA2076@stefanha-thinkpad.redhat.com>

On 2015/2/9 22:48, Stefan Hajnoczi wrote:
> On Mon, Feb 09, 2015 at 02:50:39PM +0800, Bin Wu wrote:
>> From: Bin Wu <wu.wubin@huawei.com>
>>
>> We tested VMs migration with their disk images by drive_mirror. With
>> migration, two VMs copyed large files between each other. During the
>> test, a segfault occured. The stack was as follow:
>>
>> (gdb) bt
>> qemu-coroutine-lock.c:66
>> to=0x7fa5a1798648) at qemu-coroutine.c:97
>> request=0x7fa28c2ffa10, reply=0x7fa28c2ffa30, qiov=0x0, offset=0) at
>> block/nbd-client.c:165
>> sector_num=8552704, nb_sectors=2040, qiov=0x7fa5a1757468, offset=0) at
>> block/nbd-client.c:262
>> sector_num=8552704, nb_sectors=2048, qiov=0x7fa5a1757468) at
>> block/nbd-client.c:296
>> nb_sectors=2048, qiov=0x7fa5a1757468) at block/nbd.c:291
>> req=0x7fa28c2ffbb0, offset=4378984448, bytes=1048576, qiov=0x7fa5a1757468,
>> flags=0) at block.c:3321
>> offset=4378984448, bytes=1048576, qiov=0x7fa5a1757468, flags=(unknown: 0)) at
>> block.c:3447
>> sector_num=8552704, nb_sectors=2048, qiov=0x7fa5a1757468, flags=(unknown: 0)) at
>> block.c:3471
>> nb_sectors=2048, qiov=0x7fa5a1757468) at block.c:3480
>> nb_sectors=2048, qiov=0x7fa5a1757468) at block/raw_bsd.c:62
>> req=0x7fa28c2ffe30, offset=4378984448, bytes=1048576, qiov=0x7fa5a1757468,
>> flags=0) at block.c:3321
>> offset=4378984448, bytes=1048576, qiov=0x7fa5a1757468, flags=(unknown: 0)) at
>> block.c:3447
>> sector_num=8552704, nb_sectors=2048, qiov=0x7fa5a1757468, flags=(unknown: 0)) at
>> block.c:3471
>> coroutine-ucontext.c:121
> 
> This backtrace is incomplete.  Where are the function names?  The
> parameter lists appear incomplete too.
> 

I put the stack in the git log, so some lines are missed:
(gdb) bt
#0  0x00007fa5a0c63fc5 in qemu_co_queue_run_restart (co=0x7fa5a1798648) at
qemu-coroutine-lock.c:66
#1  0x00007fa5a0c63bed in coroutine_swap (from=0x7fa5a178f160,
to=0x7fa5a1798648) at qemu-coroutine.c:97
#2  0x00007fa5a0c63dbf in qemu_coroutine_yield () at qemu-coroutine.c:140
#3  0x00007fa5a0c9e474 in nbd_co_receive_reply (s=0x7fa5a1a3cfd0,
request=0x7fa28c2ffa10, reply=0x7fa28c2ffa30, qiov=0x0, offset=0) at
block/nbd-client.c:165
#4  0x00007fa5a0c9e8b5 in nbd_co_writev_1 (client=0x7fa5a1a3cfd0,
sector_num=8552704, nb_sectors=2040, qiov=0x7fa5a1757468, offset=0) at
block/nbd-client.c:262
#5  0x00007fa5a0c9e9dd in nbd_client_session_co_writev (client=0x7fa5a1a3cfd0,
sector_num=8552704, nb_sectors=2048, qiov=0x7fa5a1757468) at
block/nbd-client.c:296
#6  0x00007fa5a0c9dda1 in nbd_co_writev (bs=0x7fa5a198fcb0, sector_num=8552704,
nb_sectors=2048, qiov=0x7fa5a1757468) at block/nbd.c:291
#7  0x00007fa5a0c509a4 in bdrv_aligned_pwritev (bs=0x7fa5a198fcb0,
req=0x7fa28c2ffbb0, offset=4378984448, bytes=1048576, qiov=0x7fa5a1757468,
flags=0) at block.c:3321
#8  0x00007fa5a0c50f3f in bdrv_co_do_pwritev (bs=0x7fa5a198fcb0,
offset=4378984448, bytes=1048576, qiov=0x7fa5a1757468, flags=(unknown: 0)) at
block.c:3447
#9  0x00007fa5a0c51007 in bdrv_co_do_writev (bs=0x7fa5a198fcb0,
sector_num=8552704, nb_sectors=2048, qiov=0x7fa5a1757468, flags=(unknown: 0)) at
block.c:3471
#10 0x00007fa5a0c51074 in bdrv_co_writev (bs=0x7fa5a198fcb0, sector_num=8552704,
nb_sectors=2048, qiov=0x7fa5a1757468) at block.c:3480
#11 0x00007fa5a0c652ec in raw_co_writev (bs=0x7fa5a198c110, sector_num=8552704,
nb_sectors=2048, qiov=0x7fa5a1757468) at block/raw_bsd.c:62
#12 0x00007fa5a0c509a4 in bdrv_aligned_pwritev (bs=0x7fa5a198c110,
req=0x7fa28c2ffe30, offset=4378984448, bytes=1048576, qiov=0x7fa5a1757468,
flags=0) at block.c:3321
#13 0x00007fa5a0c50f3f in bdrv_co_do_pwritev (bs=0x7fa5a198c110,
offset=4378984448, bytes=1048576, qiov=0x7fa5a1757468, flags=(unknown: 0)) at
block.c:3447
#14 0x00007fa5a0c51007 in bdrv_co_do_writev (bs=0x7fa5a198c110,
sector_num=8552704, nb_sectors=2048, qiov=0x7fa5a1757468, flags=(unknown: 0)) at
block.c:3471
#15 0x00007fa5a0c542b3 in bdrv_co_do_rw (opaque=0x7fa5a17a0000) at block.c:4706
#16 0x00007fa5a0c64e6e in coroutine_trampoline (i0=-1585909408, i1=32677) at
coroutine-ucontext.c:121
#17 0x00007fa59dc5aa50 in __correctly_grouped_prefixwc () from /lib64/libc.so.6
#18 0x0000000000000000 in ?? ()


>> After analyzing the stack and reviewing the code, we find the
>> qemu_co_queue_run_restart should not be put in the coroutine_swap function which
>> can be invoked by qemu_coroutine_enter or qemu_coroutine_yield. Only
>> qemu_coroutine_enter needs to restart the co_queue.
>>
>> The error scenario is as follow: coroutine C1 enters C2, C2 yields
>> back to C1, then C1 ternimates and the related coroutine memory
>> becomes invalid. After a while, the C2 coroutine is entered again.
>> At this point, C1 is used as a parameter passed to
>> qemu_co_queue_run_restart. Therefore, qemu_co_queue_run_restart
>> accesses an invalid memory and a segfault error ocurrs.
>>
>> The qemu_co_queue_run_restart function re-enters coroutines waiting
>> in the co_queue. However, this function should be only used int the
>> qemu_coroutine_enter context. Only in this context, when the current
>> coroutine gets execution control again(after the execution of
>> qemu_coroutine_switch), we can restart the target coutine because the
>> target coutine has yielded back to the current coroutine or it has
>> terminated.
>>
>> First we want to put qemu_co_queue_run_restart in qemu_coroutine_enter,
>> but we find we can not access the target coroutine if it terminates.
> 
> This example captures the scenario you describe:
> 
> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
> index 525247b..883cbf5 100644
> --- a/qemu-coroutine.c
> +++ b/qemu-coroutine.c
> @@ -103,7 +103,10 @@ static void coroutine_swap(Coroutine *from, Coroutine *to)
>  {
>      CoroutineAction ret;
>  
> +    fprintf(stderr, "> %s from %p to %p\n", __func__, from, to);
>      ret = qemu_coroutine_switch(from, to, COROUTINE_YIELD);
> +    fprintf(stderr, "< %s from %p to %p switch %s\n", __func__, from, to,
> +            ret == COROUTINE_YIELD ? "yield" : "terminate");
>  
>      qemu_co_queue_run_restart(to);
>  
> @@ -111,6 +114,7 @@ static void coroutine_swap(Coroutine *from, Coroutine *to)
>      case COROUTINE_YIELD:
>          return;
>      case COROUTINE_TERMINATE:
> +        fprintf(stderr, "coroutine_delete %p\n", to);
>          trace_qemu_coroutine_terminate(to);
>          coroutine_delete(to);
>          return;
> diff --git a/tests/test-coroutine.c b/tests/test-coroutine.c
> index 27d1b6f..d44c428 100644
> --- a/tests/test-coroutine.c
> +++ b/tests/test-coroutine.c
> @@ -13,6 +13,7 @@
>  
>  #include <glib.h>
>  #include "block/coroutine.h"
> +#include "block/coroutine_int.h"
>  
>  /*
>   * Check that qemu_in_coroutine() works
> @@ -122,6 +123,35 @@ static void test_yield(void)
>      g_assert_cmpint(i, ==, 5); /* coroutine must yield 5 times */
>  }
>  
> +static void coroutine_fn c2_fn(void *opaque)
> +{
> +    fprintf(stderr, "c2 Part 1\n");
> +    qemu_coroutine_yield();
> +    fprintf(stderr, "c2 Part 2\n");
> +}
> +
> +static void coroutine_fn c1_fn(void *opaque)
> +{
> +    Coroutine *c2 = opaque;
> +
> +    fprintf(stderr, "c1 Part 1\n");
> +    qemu_coroutine_enter(c2, NULL);
> +    fprintf(stderr, "c1 Part 2\n");
> +}
> +
> +static void test_co_queue(void)
> +{
> +    Coroutine *c1;
> +    Coroutine *c2;
> +
> +    c1 = qemu_coroutine_create(c1_fn);
> +    c2 = qemu_coroutine_create(c2_fn);
> +
> +    qemu_coroutine_enter(c1, c2);
> +    memset(c1, 0xff, sizeof(Coroutine));
> +    qemu_coroutine_enter(c2, NULL);
> +}
> +
>  /*
>   * Check that creation, enter, and return work
>   */
> @@ -343,6 +373,7 @@ static void perf_cost(void)
>  int main(int argc, char **argv)
>  {
>      g_test_init(&argc, &argv, NULL);
> +    g_test_add_func("/basic/co_queue", test_co_queue);
>      g_test_add_func("/basic/lifecycle", test_lifecycle);
>      g_test_add_func("/basic/yield", test_yield);
>      g_test_add_func("/basic/nesting", test_nesting);
> 
> Here is the output (with printfs in coroutine_swap):
> 
> -> coroutine_swap from MAIN to C1
> c1 Part 1
> -> coroutine_swap from C1 to C2
> c2 Part 1
> -> coroutine_swap from C2 to C1
> <- coroutine_swap from C1 to C2 switch yield
> c1 Part 2
> <- coroutine_swap from MAIN to C1 switch terminate
> coroutine_delete C1
> -> coroutine_swap from MAIN to C2
> <- coroutine_swap from C2 to C1 switch yield  !!!
> c2 Part 2
> <- coroutine_swap from MAIN to C2 switch terminate
> coroutine_delete C2
> 
> I have marked the problematic line with "!!!".  The to=C1 variable is
> used after C1 has been deleted.
> 
> The test crashes since it writes 0xff to C1 after it has terminated.

yes, this is exactly the error scenario I described. Thanks, stefan.

> 
>> Signed-off-by: Bin Wu <wu.wubin@huawei.com>
>> ---
>>  qemu-coroutine.c | 16 ++++++++++------
>>  1 file changed, 10 insertions(+), 6 deletions(-)
>>
>> diff --git a/qemu-coroutine.c b/qemu-coroutine.c
>> index 525247b..cc0bdfa 100644
>> --- a/qemu-coroutine.c
>> +++ b/qemu-coroutine.c
>> @@ -99,29 +99,31 @@ static void coroutine_delete(Coroutine *co)
>>      qemu_coroutine_delete(co);
>>  }
>>  
>> -static void coroutine_swap(Coroutine *from, Coroutine *to)
>> +static CoroutineAction coroutine_swap(Coroutine *from, Coroutine *to)
>>  {
>>      CoroutineAction ret;
>>  
>>      ret = qemu_coroutine_switch(from, to, COROUTINE_YIELD);
>>  
>> -    qemu_co_queue_run_restart(to);
>> -
>>      switch (ret) {
>>      case COROUTINE_YIELD:
>> -        return;
>> +        break;
>>      case COROUTINE_TERMINATE:
>>          trace_qemu_coroutine_terminate(to);
>> +        qemu_co_queue_run_restart(to);
>>          coroutine_delete(to);
>> -        return;
>> +        break;
>>      default:
>>          abort();
>>      }
>> +
>> +    return ret;
>>  }
>>  
>>  void qemu_coroutine_enter(Coroutine *co, void *opaque)
>>  {
>>      Coroutine *self = qemu_coroutine_self();
>> +    CoroutineAction ret;
>>  
>>      trace_qemu_coroutine_enter(self, co, opaque);
>>  
>> @@ -132,7 +134,9 @@ void qemu_coroutine_enter(Coroutine *co, void *opaque)
>>  
>>      co->caller = self;
>>      co->entry_arg = opaque;
>> -    coroutine_swap(self, co);
>> +    ret = coroutine_swap(self, co);
>> +    if (ret == COROUTINE_YIELD)
>> +        qemu_co_queue_run_restart(co);
>>  }
> 
> Your fix looks correct although QEMU coding style requires {}.
> 
> I tried to think of a simpler solution that keeps a single
> qemu_co_queue_run_restart() call but was unable to find one.
> 
> Please send another revision with a test-coroutine.c test case so we can
> protect against regressions.

OK, I will send another version latter.

> 
> Thanks,
> Stefan
> 

-- 
Bin Wu

  reply	other threads:[~2015-02-10  0:51 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-02-09  6:50 [Qemu-devel] [PATCH v2] qemu-coroutine: segfault when restarting co_queue Bin Wu
2015-02-09  9:09 ` Paolo Bonzini
2015-02-10  0:55   ` Bin Wu
2015-02-09  9:42 ` Kevin Wolf
2015-02-09 14:48 ` Stefan Hajnoczi
2015-02-10  0:51   ` Bin Wu [this message]
2015-02-10  3:16   ` Wen Congyang
2015-02-10  3:48     ` Bin Wu
2015-02-10  4:49       ` Wen Congyang
2015-02-10 10:13   ` Kevin Wolf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=54D9560A.1080100@huawei.com \
    --to=wu.wubin@huawei.com \
    --cc=arei.gonglei@huawei.com \
    --cc=boby.chen@huawei.com \
    --cc=bruce.fon@huawei.com \
    --cc=famz@redhat.com \
    --cc=kathy.wangting@huawei.com \
    --cc=kwolf@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=rudy.zhangmin@huawei.com \
    --cc=stefanha@gmail.com \
    --cc=stefanha@redhat.com \
    --cc=subo7@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.