[Qemu-devel] Segfaults in chardev due to races

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] Segfaults in chardev due to races
@ 2018-12-21 22:31 Max Reitz
  2018-12-22  9:17 ` Paolo Bonzini
  0 siblings, 1 reply; 5+ messages in thread
From: Max Reitz @ 2018-12-21 22:31 UTC (permalink / raw)
  To: Marc-André Lureau; +Cc: Paolo Bonzini, qemu-devel@nongnu.org

[-- Attachment #1: Type: text/plain, Size: 8556 bytes --]

Hi,

While fixing iotest 147 (which currently cannot be run in parallel),
I've noticed that after it's fixed (and does run in parallel), I see
some segfaults in chardev/char-socket.c.

They usually look like this:

#0  0x0000556f3b5765d0 in object_get_class (obj=obj@entry=0x0) at
qom/object.c:822
#1  0x0000556f3b6162a5 in qio_channel_writev_full (ioc=0x0,
iov=0x7fffc8bc9aa0, niov=1, fds=0x0, nfds=0, errp=0x0) at io/channel.c:76
#2  0x0000556f3b6056b4 in io_channel_send_full (ioc=0x0,
buf=buf@entry=0x7fd3a4006be0, len=len@entry=138, fds=0x0, nfds=0) at
chardev/char-io.c:123
#3  0x0000556f3b60a50b in tcp_chr_write
    (chr=0x556f3cfa8600, buf=0x7fd3a4006be0 "{\"timestamp\":
{\"seconds\": 1545429461, \"microseconds\": 104325}, \"event\":
\"SHUTDOWN\", \"data\": {\"guest\": false, \"reason\":
\"host-qmp-quit\"}}\r\n", len=138)
    at chardev/char-socket.c:135
#4  0x0000556f3b602440 in qemu_chr_write_buffer
    (s=s@entry=0x556f3cfa8600, buf=buf@entry=0x7fd3a4006be0
"{\"timestamp\": {\"seconds\": 1545429461, \"microseconds\": 104325},
\"event\": \"SHUTDOWN\", \"data\": {\"guest\": false, \"reason\":
\"host-qmp-quit\"}}\r\n", len=138, offset=offset@entry=0x7fffc8bc9b90,
write_all=false) at chardev/char.c:112
#5  0x0000556f3b6026ef in qemu_chr_write
    (s=0x556f3cfa8600, buf=buf@entry=0x7fd3a4006be0 "{\"timestamp\":
{\"seconds\": 1545429461, \"microseconds\": 104325}, \"event\":
\"SHUTDOWN\", \"data\": {\"guest\": false, \"reason\":
\"host-qmp-quit\"}}\r\n", len=<optimized out>,
write_all=write_all@entry=false) at chardev/char.c:147
#6  0x0000556f3b6046af in qemu_chr_fe_write
    (be=be@entry=0x556f3cece140, buf=buf@entry=0x7fd3a4006be0
"{\"timestamp\": {\"seconds\": 1545429461, \"microseconds\": 104325},
\"event\": \"SHUTDOWN\", \"data\": {\"guest\": false, \"reason\":
\"host-qmp-quit\"}}\r\n", len=<optimized out>) at chardev/char-fe.c:42
#7  0x0000556f3b2513d8 in monitor_flush_locked
(mon=mon@entry=0x556f3cece140) at monitor.c:406
#8  0x0000556f3b2515a9 in monitor_flush_locked (mon=0x556f3cece140) at
monitor.c:449
#9  0x0000556f3b2515a9 in monitor_puts (mon=mon@entry=0x556f3cece140,
str=0x556f3de59919 "") at monitor.c:449
#10 0x0000556f3b2515ef in qmp_send_response (mon=0x556f3cece140,
rsp=<optimized out>) at monitor.c:498
#11 0x0000556f3b2517ea in monitor_qapi_event_emit
(event=event@entry=QAPI_EVENT_SHUTDOWN,
qdict=qdict@entry=0x556f3e4613e0) at monitor.c:526
#12 0x0000556f3b251b22 in monitor_qapi_event_queue_no_reenter
(qdict=0x556f3e4613e0, event=QAPI_EVENT_SHUTDOWN) at monitor.c:551
#13 0x0000556f3b251b22 in monitor_qapi_event_queue
(event=QAPI_EVENT_SHUTDOWN, qdict=0x0) at monitor.c:627
#14 0x0000556f3b251b22 in monitor_qapi_event_queue
(event=event@entry=QAPI_EVENT_SHUTDOWN,
qdict=qdict@entry=0x556f3e4613e0) at monitor.c:594
#15 0x0000556f3b6563f1 in qapi_event_send_shutdown (guest=<optimized
out>, reason=<optimized out>) at qapi/qapi-events-run-state.c:49
#16 0x0000556f3b37ac07 in main_loop_should_exit () at
include/sysemu/sysemu.h:36
#17 0x0000556f3b37ac07 in main_loop () at vl.c:1909
#18 0x0000556f3b207735 in main (argc=<optimized out>, argv=<optimized
out>, envp=<optimized out>) at vl.c:4661


Investigating, I saw that sometimes tcp_chr_write() is invoked in
parallel with tcp_chr_free_connection() -- from two different
AioContexts (and two different pthreads).  s->ioc is set to NULL before
s->connected is set to 0, and because these run concurrently,
tcp_chr_write() sees the s->connected as 1 and then fetches the
already-NULL s->ioc which causes the segfault.

The issue is more easily reproducible by inserting the following into
tcp_chr_free_connection() above "s->connected = 0;":

     for (volatile int i = 0; i < 1000000; i++) {
         __asm__ __volatile__ ("" ::: "memory");

     }


Then, run this command line:

$ x86_64-softmmu/qemu-system-x86_64 -qmp tcp:localhost:4242,server

And this script:

#!/usr/bin/ruby
require 'socket'
c = TCPSocket.new('localhost', 4242)
c.readline
c.puts("{'execute':'qmp_capabilities'}")
c.readline
c.puts("{'execute':'quit'}")
c.close

(The above backtrace was generated this way because for some reason,
after everything, I couldn't see the segfaults in 147 any more.  Huh.
But anyway, the loop I added does not yield a coroutine or anything like
that, it just enlarges the race window, so I think it still is a real
issue.)

Now I thought it would suffice to just pull "s->disconnected = 0;" up to
the start of tcp_chr_free_connection().  But first of all I wasn't very
fond of that solution, because tcp_chr_write() says with an XXX comment
that seeing !s->disconnected is actually bad.

More importantly, though, once I did that, another segfault appeared
(again from running iotest 147 in parallel):

(gdb) bt
#0  0x00005629e4c7b29f in object_dynamic_cast_assert
    (obj=0x5629e6847690, typename=typename@entry=0x5629e4f3933b
"qio-channel-socket", file=file@entry=0x5629e4f423e8
"io/channel-socket.c", line=line@entry=512,
func=func@entry=0x5629e4f42b00 <__func__.20946>
"qio_channel_socket_writev") at qom/object.c:686
#1  0x00005629e4d1ded2 in qio_channel_socket_writev (ioc=<optimized
out>, iov=0x7ffe36e71c40, niov=1, fds=0x0, nfds=0, errp=0x0) at
io/channel-socket.c:512
#2  0x00005629e4d0a6b4 in io_channel_send_full (ioc=0x5629e6847690,
buf=buf@entry=0x7f7ae4006a30, len=len@entry=138, fds=0x0, nfds=0) at
chardev/char-io.c:123
#3  0x00005629e4d0f4cb in tcp_chr_write
    (chr=0x5629e67dced0, buf=0x7f7ae4006a30 "{\"timestamp\":
{\"seconds\": 1545428667, \"microseconds\": 895334}, \"event\":
\"SHUTDOWN\", \"data\": {\"guest\": false, \"reason\":
\"host-qmp-quit\"}}\r\n", len=138)
    at chardev/char-socket.c:135
#4  0x00005629e4d07440 in qemu_chr_write_buffer
    (s=s@entry=0x5629e67dced0, buf=buf@entry=0x7f7ae4006a30
"{\"timestamp\": {\"seconds\": 1545428667, \"microseconds\": 895334},
\"event\": \"SHUTDOWN\", \"data\": {\"guest\": false, \"reason\":
\"host-qmp-quit\"}}\r\n", len=138, offset=offset@entry=0x7ffe36e71d30,
write_all=false) at chardev/char.c:112
#5  0x00005629e4d076ef in qemu_chr_write
    (s=0x5629e67dced0, buf=buf@entry=0x7f7ae4006a30 "{\"timestamp\":
{\"seconds\": 1545428667, \"microseconds\": 895334}, \"event\":
\"SHUTDOWN\", \"data\": {\"guest\": false, \"reason\":
\"host-qmp-quit\"}}\r\n", len=<optimized out>,
write_all=write_all@entry=false) at chardev/char.c:147
#6  0x00005629e4d096af in qemu_chr_fe_write
    (be=be@entry=0x5629e681da90, buf=buf@entry=0x7f7ae4006a30
"{\"timestamp\": {\"seconds\": 1545428667, \"microseconds\": 895334},
\"event\": \"SHUTDOWN\", \"data\": {\"guest\": false, \"reason\":
\"host-qmp-quit\"}}\r\n", len=<optimized out>) at chardev/char-fe.c:42
#7  0x00005629e49563d8 in monitor_flush_locked
(mon=mon@entry=0x5629e681da90) at monitor.c:406
#8  0x00005629e49565a9 in monitor_flush_locked (mon=0x5629e681da90) at
monitor.c:449
#9  0x00005629e49565a9 in monitor_puts (mon=mon@entry=0x5629e681da90,
str=0x5629e768c7f9 "") at monitor.c:449
#10 0x00005629e49565ef in qmp_send_response (mon=0x5629e681da90,
rsp=<optimized out>) at monitor.c:498
#11 0x00005629e49567ea in monitor_qapi_event_emit
(event=event@entry=QAPI_EVENT_SHUTDOWN,
qdict=qdict@entry=0x5629e79825f0) at monitor.c:526
#12 0x00005629e4956b22 in monitor_qapi_event_queue_no_reenter
(qdict=0x5629e79825f0, event=QAPI_EVENT_SHUTDOWN) at monitor.c:551
#13 0x00005629e4956b22 in monitor_qapi_event_queue
(event=QAPI_EVENT_SHUTDOWN, qdict=0x0) at monitor.c:627
#14 0x00005629e4956b22 in monitor_qapi_event_queue
(event=event@entry=QAPI_EVENT_SHUTDOWN,
qdict=qdict@entry=0x5629e79825f0) at monitor.c:594
#15 0x00005629e4d5b3b1 in qapi_event_send_shutdown (guest=<optimized
out>, reason=<optimized out>) at qapi/qapi-events-run-state.c:49
#16 0x00005629e4a7fc07 in main_loop_should_exit () at
include/sysemu/sysemu.h:36
#17 0x00005629e4a7fc07 in main_loop () at vl.c:1909
#18 0x00005629e490c735 in main (argc=<optimized out>, argv=<optimized
out>, envp=<optimized out>) at vl.c:4661


At this point, I decided that I've ventured far enough.  (Especially
since it's Dec 21 and 11 pm and all that.)

I suppose the issue is that QMP events are sent by one thread, and
client disconnects are handled by a different one.  So if a QMP event is
sent while a client disconnects concurrently, races may occur; and the
only protection against concurrent access appears to be the
chr_write_lock, which I don't think is enough.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] Segfaults in chardev due to races
  2018-12-21 22:31 [Qemu-devel] Segfaults in chardev due to races Max Reitz
@ 2018-12-22  9:17 ` Paolo Bonzini
  2019-01-23 15:33   ` Max Reitz
  0 siblings, 1 reply; 5+ messages in thread
From: Paolo Bonzini @ 2018-12-22  9:17 UTC (permalink / raw)
  To: Max Reitz, Marc-André Lureau; +Cc: qemu-devel@nongnu.org

On 21/12/18 23:31, Max Reitz wrote:
> I suppose the issue is that QMP events are sent by one thread, and
> client disconnects are handled by a different one.  So if a QMP event is
> sent while a client disconnects concurrently, races may occur; and the
> only protection against concurrent access appears to be the
> chr_write_lock, which I don't think is enough.

I think disconnection (tcp_chr_disconnect) has to take the
chr_write_lock too.

Paolo

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] Segfaults in chardev due to races
  2018-12-22  9:17 ` Paolo Bonzini
@ 2019-01-23 15:33   ` Max Reitz
  2019-02-06 18:38     ` Marc-André Lureau
  0 siblings, 1 reply; 5+ messages in thread
From: Max Reitz @ 2019-01-23 15:33 UTC (permalink / raw)
  To: Paolo Bonzini, Marc-André Lureau; +Cc: qemu-devel@nongnu.org

[-- Attachment #1: Type: text/plain, Size: 796 bytes --]

On 22.12.18 10:17, Paolo Bonzini wrote:
> On 21/12/18 23:31, Max Reitz wrote:
>> I suppose the issue is that QMP events are sent by one thread, and
>> client disconnects are handled by a different one.  So if a QMP event is
>> sent while a client disconnects concurrently, races may occur; and the
>> only protection against concurrent access appears to be the
>> chr_write_lock, which I don't think is enough.
> 
> I think disconnection (tcp_chr_disconnect) has to take the
> chr_write_lock too.

That seems to fix the issue for me (can also be reproduced by running
iotest 169 in parallel), but how should this be implemented?  I suppose
tcp_chr_disconnect() can't really take the lock itself, because it's
called by tcp_chr_write() which is invoked with the lock held.

Max


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] Segfaults in chardev due to races
  2019-01-23 15:33   ` Max Reitz
@ 2019-02-06 18:38     ` Marc-André Lureau
  2019-02-11  6:13       ` Peter Xu
  0 siblings, 1 reply; 5+ messages in thread
From: Marc-André Lureau @ 2019-02-06 18:38 UTC (permalink / raw)
  To: Max Reitz
  Cc: Paolo Bonzini, qemu-devel@nongnu.org, Markus Armbruster, Peter Xu

Hi

On Wed, Jan 23, 2019 at 4:39 PM Max Reitz <mreitz@redhat.com> wrote:
>
> On 22.12.18 10:17, Paolo Bonzini wrote:
> > On 21/12/18 23:31, Max Reitz wrote:
> >> I suppose the issue is that QMP events are sent by one thread, and
> >> client disconnects are handled by a different one.  So if a QMP event is
> >> sent while a client disconnects concurrently, races may occur; and the
> >> only protection against concurrent access appears to be the
> >> chr_write_lock, which I don't think is enough.
> >
> > I think disconnection (tcp_chr_disconnect) has to take the
> > chr_write_lock too.
>
> That seems to fix the issue for me (can also be reproduced by running
> iotest 169 in parallel), but how should this be implemented?  I suppose
> tcp_chr_disconnect() can't really take the lock itself, because it's
> called by tcp_chr_write() which is invoked with the lock held.
>
> Max
>

Is anybody fixing this? Paolo suggestion seems right in general, but
not easy to implement correctly for all chardevs. (C isn't helping!)

I think this can be treated as a regression from commit
8258292e18c39480b64eba9f3551ab772ce29b5d, OOB monitor is now enabled
by default (on socket chardev).



-- 
Marc-André Lureau

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [Qemu-devel] Segfaults in chardev due to races
  2019-02-06 18:38     ` Marc-André Lureau
@ 2019-02-11  6:13       ` Peter Xu
  0 siblings, 0 replies; 5+ messages in thread
From: Peter Xu @ 2019-02-11  6:13 UTC (permalink / raw)
  To: Marc-André Lureau
  Cc: Max Reitz, Paolo Bonzini, qemu-devel@nongnu.org,
	Markus Armbruster

On Wed, Feb 06, 2019 at 07:38:18PM +0100, Marc-André Lureau wrote:
> Hi
> 
> On Wed, Jan 23, 2019 at 4:39 PM Max Reitz <mreitz@redhat.com> wrote:
> >
> > On 22.12.18 10:17, Paolo Bonzini wrote:
> > > On 21/12/18 23:31, Max Reitz wrote:
> > >> I suppose the issue is that QMP events are sent by one thread, and
> > >> client disconnects are handled by a different one.  So if a QMP event is
> > >> sent while a client disconnects concurrently, races may occur; and the
> > >> only protection against concurrent access appears to be the
> > >> chr_write_lock, which I don't think is enough.
> > >
> > > I think disconnection (tcp_chr_disconnect) has to take the
> > > chr_write_lock too.
> >
> > That seems to fix the issue for me (can also be reproduced by running
> > iotest 169 in parallel), but how should this be implemented?  I suppose
> > tcp_chr_disconnect() can't really take the lock itself, because it's
> > called by tcp_chr_write() which is invoked with the lock held.
> >
> > Max
> >
> 
> Is anybody fixing this? Paolo suggestion seems right in general, but
> not easy to implement correctly for all chardevs. (C isn't helping!)
> 
> I think this can be treated as a regression from commit
> 8258292e18c39480b64eba9f3551ab772ce29b5d, OOB monitor is now enabled
> by default (on socket chardev).

So far I see that only TCP implemented chr_disconnect, does it mean
it's still doable?

I'm uncertain on whether the response queue idea could help too since
that idea is/was trying to keep all the chardev IO/control paths in
the same thread.  When you wanted (and managed) to remove the response
queue I was uncertain on things like this - because if with the
response queues we're still making the chardev work very like single
thread (we only offload some works that must be done in main thread,
and let the two threads talk only via req/response queues, protected
by mutexes).  Now we're doing it real multi-threading in many places.
So IMHO we either fix all multi-thread safety, or may we fall back?  I
don't know much on chardev enough to say what is required to prove
100% multithread safety just like when I was working on oob, that's
why I chose the response queue and tried to keep safe.  I'm unsure
whether that's still an alternative.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2019-02-11  6:22 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2018-12-21 22:31 [Qemu-devel] Segfaults in chardev due to races Max Reitz
2018-12-22  9:17 ` Paolo Bonzini
2019-01-23 15:33   ` Max Reitz
2019-02-06 18:38     ` Marc-André Lureau
2019-02-11  6:13       ` Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).