[Qemu-devel] Endless loop in qcow2_alloc_cluster

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] Endless loop in qcow2_alloc_cluster_offset
@ 2009-11-19 12:19 Jan Kiszka
  2009-11-19 14:49 ` [Qemu-devel] " Kevin Wolf
  2010-05-07  1:19 ` Marcelo Tosatti
  0 siblings, 2 replies; 15+ messages in thread
From: Jan Kiszka @ 2009-11-19 12:19 UTC (permalink / raw)
  To: qemu-devel; +Cc: Kevin Wolf, kvm

Hi,

I just managed to push a qemu-kvm process (git rev. b496fe3431) into an
endless loop in qcow2_alloc_cluster_offset, namely over
QLIST_FOREACH(old_alloc, &s->cluster_allocs, next_in_flight):

(gdb) bt
#0  0x000000000048614b in qcow2_alloc_cluster_offset (bs=0xc4e1d0, offset=7417184256, n_start=0, n_end=16, num=0xcb351c, m=0xcb3568) at /data/qemu-kvm/block/qcow2-cluster.c:750
#1  0x00000000004828d0 in qcow_aio_write_cb (opaque=0xcb34d0, ret=0) at /data/qemu-kvm/block/qcow2.c:587
#2  0x0000000000482a44 in qcow_aio_writev (bs=<value optimized out>, sector_num=<value optimized out>, qiov=<value optimized out>, nb_sectors=<value optimized out>, cb=<value optimized out>, opaque=<value optimized out>) at /data/qemu-kvm/block/qcow2.c:645
#3  0x0000000000470e89 in bdrv_aio_writev (bs=0xc4e1d0, sector_num=2, qiov=0x7f48a9010ed0, nb_sectors=16, cb=0x470d20 <bdrv_rw_em_cb>, opaque=0x7f48a9010f0c) at /data/qemu-kvm/block.c:1362
#4  0x0000000000472991 in bdrv_write_em (bs=0xc4e1d0, sector_num=14486688, buf=0xd67200 "H\a", nb_sectors=16) at /data/qemu-kvm/block.c:1736
#5  0x0000000000435581 in ide_sector_write (s=0xc92650) at /data/qemu-kvm/hw/ide/core.c:622
#6  0x0000000000425fc2 in kvm_handle_io (env=<value optimized out>) at /data/qemu-kvm/kvm-all.c:553
#7  kvm_run (env=<value optimized out>) at /data/qemu-kvm/qemu-kvm.c:964
#8  0x0000000000426049 in kvm_cpu_exec (env=0x1000) at /data/qemu-kvm/qemu-kvm.c:1651
#9  0x000000000042627d in kvm_main_loop_cpu (_env=<value optimized out>) at /data/qemu-kvm/qemu-kvm.c:1893
#10 ap_main_loop (_env=<value optimized out>) at /data/qemu-kvm/qemu-kvm.c:1943
#11 0x00007f48ae89d070 in start_thread () from /lib64/libpthread.so.0
#12 0x00007f48abf0711d in clone () from /lib64/libc.so.6
#13 0x0000000000000000 in ?? ()
(gdb) print ((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
$5 = (struct QCowL2Meta *) 0xcb3568
(gdb) print *((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
$6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}

So next == first.

Is something fiddling with cluster_allocs concurrently, e.g. some signal
handler? Or what could cause this list corruption? Would it be enough to
move to QLIST_FOREACH_SAFE?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
  2009-11-19 12:19 [Qemu-devel] Endless loop in qcow2_alloc_cluster_offset Jan Kiszka
@ 2009-11-19 14:49 ` Kevin Wolf
  2009-11-19 14:58   ` Jan Kiszka
  2010-05-07  1:19 ` Marcelo Tosatti
  1 sibling, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2009-11-19 14:49 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, kvm

Hi Jan,

Am 19.11.2009 13:19, schrieb Jan Kiszka:
> (gdb) print ((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
> $5 = (struct QCowL2Meta *) 0xcb3568
> (gdb) print *((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
> $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}
> 
> So next == first.

Oops. Doesn't sound quite right...

> Is something fiddling with cluster_allocs concurrently, e.g. some signal
> handler? Or what could cause this list corruption? Would it be enough to
> move to QLIST_FOREACH_SAFE?

Are there any specific signals you're thinking of? Related to block code
I can only think of SIGUSR2 and this one shouldn't call any block driver
functions directly. You're using aio=threads, I assume? (It's the default)

QLIST_FOREACH_SAFE shouldn't make a difference in this place as the loop
doesn't insert or remove any elements. If the list is corrupted now, I
think it would be corrupted with QLIST_FOREACH_SAFE as well - at best,
the endless loop would occur one call later.

The only way I see to get such a loop in a list is to re-insert an
element that already is part of the list. The only insert is at
qcow2-cluster.c:777. Remains the question how we came there twice
without run_dependent_requests() removing the L2Meta from our list first
- because this is definitely wrong...

Presumably, it's not reproducible?

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
  2009-11-19 14:49 ` [Qemu-devel] " Kevin Wolf
@ 2009-11-19 14:58   ` Jan Kiszka
  2009-12-07 14:16     ` Jan Kiszka
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Kiszka @ 2009-11-19 14:58 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, kvm

Kevin Wolf wrote:
> Hi Jan,
> 
> Am 19.11.2009 13:19, schrieb Jan Kiszka:
>> (gdb) print ((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
>> $5 = (struct QCowL2Meta *) 0xcb3568
>> (gdb) print *((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
>> $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}
>>
>> So next == first.
> 
> Oops. Doesn't sound quite right...
> 
>> Is something fiddling with cluster_allocs concurrently, e.g. some signal
>> handler? Or what could cause this list corruption? Would it be enough to
>> move to QLIST_FOREACH_SAFE?
> 
> Are there any specific signals you're thinking of? Related to block code

No, was just blind guessing.

> I can only think of SIGUSR2 and this one shouldn't call any block driver
> functions directly. You're using aio=threads, I assume? (It's the default)

Yes, all on defaults.

> 
> QLIST_FOREACH_SAFE shouldn't make a difference in this place as the loop
> doesn't insert or remove any elements. If the list is corrupted now, I
> think it would be corrupted with QLIST_FOREACH_SAFE as well - at best,
> the endless loop would occur one call later.
> 
> The only way I see to get such a loop in a list is to re-insert an
> element that already is part of the list. The only insert is at
> qcow2-cluster.c:777. Remains the question how we came there twice
> without run_dependent_requests() removing the L2Meta from our list first
> - because this is definitely wrong...
> 
> Presumably, it's not reproducible?

Likely not. What I did was nothing special, and I did not noticed such a
crash in the last months.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
  2009-11-19 14:58   ` Jan Kiszka
@ 2009-12-07 14:16     ` Jan Kiszka
  2009-12-07 14:50       ` Jan Kiszka
  2009-12-07 15:00       ` Kevin Wolf
  0 siblings, 2 replies; 15+ messages in thread
From: Jan Kiszka @ 2009-12-07 14:16 UTC (permalink / raw)
  Cc: Kevin Wolf, qemu-devel, kvm

Jan Kiszka wrote:
> Kevin Wolf wrote:
>> Hi Jan,
>>
>> Am 19.11.2009 13:19, schrieb Jan Kiszka:
>>> (gdb) print ((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
>>> $5 = (struct QCowL2Meta *) 0xcb3568
>>> (gdb) print *((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
>>> $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}
>>>
>>> So next == first.
>> Oops. Doesn't sound quite right...
>>
>>> Is something fiddling with cluster_allocs concurrently, e.g. some signal
>>> handler? Or what could cause this list corruption? Would it be enough to
>>> move to QLIST_FOREACH_SAFE?
>> Are there any specific signals you're thinking of? Related to block code
> 
> No, was just blind guessing.
> 
>> I can only think of SIGUSR2 and this one shouldn't call any block driver
>> functions directly. You're using aio=threads, I assume? (It's the default)
> 
> Yes, all on defaults.
> 
>> QLIST_FOREACH_SAFE shouldn't make a difference in this place as the loop
>> doesn't insert or remove any elements. If the list is corrupted now, I
>> think it would be corrupted with QLIST_FOREACH_SAFE as well - at best,
>> the endless loop would occur one call later.
>>
>> The only way I see to get such a loop in a list is to re-insert an
>> element that already is part of the list. The only insert is at
>> qcow2-cluster.c:777. Remains the question how we came there twice
>> without run_dependent_requests() removing the L2Meta from our list first
>> - because this is definitely wrong...
>>
>> Presumably, it's not reproducible?
> 
> Likely not. What I did was nothing special, and I did not noticed such a
> crash in the last months.

And now it happened again (qemu-kvm head, during kernel installation
from network onto local qcow2-disk). Any clever idea how to proceed with
this?

I could try to run the step in a loop, hopefully retriggering it once in
a (likely longer) while. But then we need some good instrumentation first.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
  2009-12-07 14:16     ` Jan Kiszka
@ 2009-12-07 14:50       ` Jan Kiszka
  2009-12-07 15:03         ` Kevin Wolf
  2009-12-07 15:04         ` Avi Kivity
  2009-12-07 15:00       ` Kevin Wolf
  1 sibling, 2 replies; 15+ messages in thread
From: Jan Kiszka @ 2009-12-07 14:50 UTC (permalink / raw)
  To: Kevin Wolf, qemu-devel, kvm

Jan Kiszka wrote:
> And now it happened again (qemu-kvm head, during kernel installation
> from network onto local qcow2-disk). Any clever idea how to proceed with
> this?
> 
> I could try to run the step in a loop, hopefully retriggering it once in
> a (likely longer) while. But then we need some good instrumentation first.
> 

Maybe I'm seeing ghosts, and I don't even have a minimal clue about what
goes on in the code, but this looks fishy:

preallocate() invokes qcow2_alloc_cluster_offset() passing &meta, a
stack variable. It seems that qcow2_alloc_cluster_offset() may insert
this structure into cluster_allocs and leave it there. So we corrupt the
queue as soon as preallocate() returns, no?

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
  2009-12-07 14:16     ` Jan Kiszka
  2009-12-07 14:50       ` Jan Kiszka
@ 2009-12-07 15:00       ` Kevin Wolf
  2009-12-07 16:09         ` Jan Kiszka
  2009-12-08 14:51         ` Kevin Wolf
  1 sibling, 2 replies; 15+ messages in thread
From: Kevin Wolf @ 2009-12-07 15:00 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, kvm

Am 07.12.2009 15:16, schrieb Jan Kiszka:
>> Likely not. What I did was nothing special, and I did not noticed such a
>> crash in the last months.
> 
> And now it happened again (qemu-kvm head, during kernel installation
> from network onto local qcow2-disk). Any clever idea how to proceed with
> this?

I still haven't seen this and I still have no theory on what could be
happening here. I'm just trying to write down what I think must happen
to get into this situation. Maybe you can point at something I'm missing
or maybe it helps you to have a sudden inspiration.

The crash happens because we have a loop in the s->cluster_allocs list.
A loop can only be created by inserting an object twice. The only insert
to this list happens in qcow2_alloc_cluster_offset (though an earlier
call than that of the stack trace).

There is only one relevant caller of this function, qcow_aio_write_cb.
Part of it is a call to run_dependent_requests which removes the request
from s->cluster_allocs. So after the QLIST_REMOVE in
run_dependent_requests the request can't be contained in the list, but
at the call of qcow2_alloc_cluster_offset it must be contained again. It
must be added somewhere in between these two calls.

In qcow_aio_write_cb there isn't much happening between these calls. The
only thing that could somehow become dangerous is the
qcow_aio_write_cb(req, 0); for queued requests in run_dependent_requests.

> I could try to run the step in a loop, hopefully retriggering it once in
> a (likely longer) while. But then we need some good instrumentation first.

I can't explain what exactly would be going wrong there, but if my
thoughts are right so far, I think that moving this into a Bottom Half
would help. So if you can reproduce it in a loop this could be worth a try.

I'd certainly prefer to understand the problem first, but thinking about
AIO is the perfect way to make your brain hurt...

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
  2009-12-07 14:50       ` Jan Kiszka
@ 2009-12-07 15:03         ` Kevin Wolf
  2009-12-07 15:25           ` Jan Kiszka
  2009-12-07 15:04         ` Avi Kivity
  1 sibling, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2009-12-07 15:03 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, kvm

Am 07.12.2009 15:50, schrieb Jan Kiszka:
> Jan Kiszka wrote:
>> And now it happened again (qemu-kvm head, during kernel installation
>> from network onto local qcow2-disk). Any clever idea how to proceed with
>> this?
>>
>> I could try to run the step in a loop, hopefully retriggering it once in
>> a (likely longer) while. But then we need some good instrumentation first.
>>
> 
> Maybe I'm seeing ghosts, and I don't even have a minimal clue about what
> goes on in the code, but this looks fishy:
> 
> preallocate() invokes qcow2_alloc_cluster_offset() passing &meta, a
> stack variable. It seems that qcow2_alloc_cluster_offset() may insert
> this structure into cluster_allocs and leave it there. So we corrupt the
> queue as soon as preallocate() returns, no?

preallocate() is about metadata preallocation during image creation. It
is only ever run by qemu-img. Apart from that it calls
run_dependent_requests() which removes the request from the list again.

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
  2009-12-07 14:50       ` Jan Kiszka
  2009-12-07 15:03         ` Kevin Wolf
@ 2009-12-07 15:04         ` Avi Kivity
  1 sibling, 0 replies; 15+ messages in thread
From: Avi Kivity @ 2009-12-07 15:04 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Kevin Wolf, qemu-devel, kvm

On 12/07/2009 04:50 PM, Jan Kiszka wrote:
>
> Maybe I'm seeing ghosts, and I don't even have a minimal clue about what
> goes on in the code, but this looks fishy:
>
>    

Plenty of ghosts in qcow2, of all those explorers who tried to brave the 
code.  Only Kevin has ever come back.

> preallocate() invokes qcow2_alloc_cluster_offset() passing&meta, a
> stack variable. It seems that qcow2_alloc_cluster_offset() may insert
> this structure into cluster_allocs and leave it there. So we corrupt the
> queue as soon as preallocate() returns, no?
>
>    

We invoke run_dependent_requests() which should dequeue those &meta 
again (I think).

-- 
error compiling committee.c: too many arguments to function

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
  2009-12-07 15:03         ` Kevin Wolf
@ 2009-12-07 15:25           ` Jan Kiszka
  0 siblings, 0 replies; 15+ messages in thread
From: Jan Kiszka @ 2009-12-07 15:25 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, kvm

Kevin Wolf wrote:
> Am 07.12.2009 15:50, schrieb Jan Kiszka:
>> Jan Kiszka wrote:
>>> And now it happened again (qemu-kvm head, during kernel installation
>>> from network onto local qcow2-disk). Any clever idea how to proceed with
>>> this?
>>>
>>> I could try to run the step in a loop, hopefully retriggering it once in
>>> a (likely longer) while. But then we need some good instrumentation first.
>>>
>> Maybe I'm seeing ghosts, and I don't even have a minimal clue about what
>> goes on in the code, but this looks fishy:
>>
>> preallocate() invokes qcow2_alloc_cluster_offset() passing &meta, a
>> stack variable. It seems that qcow2_alloc_cluster_offset() may insert
>> this structure into cluster_allocs and leave it there. So we corrupt the
>> queue as soon as preallocate() returns, no?
> 
> preallocate() is about metadata preallocation during image creation. It
> is only ever run by qemu-img. Apart from that it calls
> run_dependent_requests() which removes the request from the list again.

OK, I see - was far too easy anyway.

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
  2009-12-07 15:00       ` Kevin Wolf
@ 2009-12-07 16:09         ` Jan Kiszka
  2009-12-07 16:26           ` Kevin Wolf
  2009-12-08 14:51         ` Kevin Wolf
  1 sibling, 1 reply; 15+ messages in thread
From: Jan Kiszka @ 2009-12-07 16:09 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: qemu-devel, kvm

Kevin Wolf wrote:
> Am 07.12.2009 15:16, schrieb Jan Kiszka:
>>> Likely not. What I did was nothing special, and I did not noticed such a
>>> crash in the last months.
>> And now it happened again (qemu-kvm head, during kernel installation
>> from network onto local qcow2-disk). Any clever idea how to proceed with
>> this?
> 
> I still haven't seen this and I still have no theory on what could be
> happening here. I'm just trying to write down what I think must happen
> to get into this situation. Maybe you can point at something I'm missing
> or maybe it helps you to have a sudden inspiration.
> 
> The crash happens because we have a loop in the s->cluster_allocs list.
> A loop can only be created by inserting an object twice. The only insert
> to this list happens in qcow2_alloc_cluster_offset (though an earlier
> call than that of the stack trace).
> 
> There is only one relevant caller of this function, qcow_aio_write_cb.
> Part of it is a call to run_dependent_requests which removes the request
> from s->cluster_allocs. So after the QLIST_REMOVE in
> run_dependent_requests the request can't be contained in the list, but
> at the call of qcow2_alloc_cluster_offset it must be contained again. It
> must be added somewhere in between these two calls.
> 
> In qcow_aio_write_cb there isn't much happening between these calls. The
> only thing that could somehow become dangerous is the
> qcow_aio_write_cb(req, 0); for queued requests in run_dependent_requests.

If m->nb_clusters is not, the entry won't be removed from the list. And
of something corrupted nb_clusters so that it became 0 although it's
still enqueued, we would see the deadly loop I faced, right?
Unfortunately, any arbitrary memory corruption that generates such zeros
can cause this...

Jan

-- 
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
  2009-12-07 16:09         ` Jan Kiszka
@ 2009-12-07 16:26           ` Kevin Wolf
  0 siblings, 0 replies; 15+ messages in thread
From: Kevin Wolf @ 2009-12-07 16:26 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, kvm

Am 07.12.2009 17:09, schrieb Jan Kiszka:
> Kevin Wolf wrote:
>> In qcow_aio_write_cb there isn't much happening between these calls. The
>> only thing that could somehow become dangerous is the
>> qcow_aio_write_cb(req, 0); for queued requests in run_dependent_requests.
> 
> If m->nb_clusters is not, the entry won't be removed from the list. And
> of something corrupted nb_clusters so that it became 0 although it's
> still enqueued, we would see the deadly loop I faced, right?
> Unfortunately, any arbitrary memory corruption that generates such zeros
> can cause this...

Right, this looks like another way to get into that endless loop. I
don't think it's very likely the cause, but who knows.

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
  2009-12-07 15:00       ` Kevin Wolf
  2009-12-07 16:09         ` Jan Kiszka
@ 2009-12-08 14:51         ` Kevin Wolf
  1 sibling, 0 replies; 15+ messages in thread
From: Kevin Wolf @ 2009-12-08 14:51 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: qemu-devel, kvm

Am 07.12.2009 16:00, schrieb Kevin Wolf:
> Am 07.12.2009 15:16, schrieb Jan Kiszka:
>>> Likely not. What I did was nothing special, and I did not noticed such a
>>> crash in the last months.
>>
>> And now it happened again (qemu-kvm head, during kernel installation
>> from network onto local qcow2-disk). Any clever idea how to proceed with
>> this?
> 
> I still haven't seen this and I still have no theory on what could be
> happening here. I'm just trying to write down what I think must happen
> to get into this situation. Maybe you can point at something I'm missing
> or maybe it helps you to have a sudden inspiration.
> 
> The crash happens because we have a loop in the s->cluster_allocs list.
> A loop can only be created by inserting an object twice. The only insert
> to this list happens in qcow2_alloc_cluster_offset (though an earlier
> call than that of the stack trace).
> 
> There is only one relevant caller of this function, qcow_aio_write_cb.
> Part of it is a call to run_dependent_requests which removes the request
> from s->cluster_allocs. So after the QLIST_REMOVE in
> run_dependent_requests the request can't be contained in the list, but
> at the call of qcow2_alloc_cluster_offset it must be contained again. It
> must be added somewhere in between these two calls.
> 
> In qcow_aio_write_cb there isn't much happening between these calls. The
> only thing that could somehow become dangerous is the
> qcow_aio_write_cb(req, 0); for queued requests in run_dependent_requests.

Hm, you're using only one disk, and it's an IDE disk, right? Then the
queue of dependent requests should be empty anyway, so no dangerous
calls here. Maybe your theory of a memory corruption is the better one.

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
  2009-11-19 12:19 [Qemu-devel] Endless loop in qcow2_alloc_cluster_offset Jan Kiszka
  2009-11-19 14:49 ` [Qemu-devel] " Kevin Wolf
@ 2010-05-07  1:19 ` Marcelo Tosatti
  2010-05-07  7:37   ` Kevin Wolf
  1 sibling, 1 reply; 15+ messages in thread
From: Marcelo Tosatti @ 2010-05-07  1:19 UTC (permalink / raw)
  To: Jan Kiszka; +Cc: Kevin Wolf, qemu-devel, kvm

On Thu, Nov 19, 2009 at 01:19:55PM +0100, Jan Kiszka wrote:
> Hi,
> 
> I just managed to push a qemu-kvm process (git rev. b496fe3431) into an
> endless loop in qcow2_alloc_cluster_offset, namely over
> QLIST_FOREACH(old_alloc, &s->cluster_allocs, next_in_flight):
> 
> (gdb) bt
> #0  0x000000000048614b in qcow2_alloc_cluster_offset (bs=0xc4e1d0, offset=7417184256, n_start=0, n_end=16, num=0xcb351c, m=0xcb3568) at /data/qemu-kvm/block/qcow2-cluster.c:750
> #1  0x00000000004828d0 in qcow_aio_write_cb (opaque=0xcb34d0, ret=0) at /data/qemu-kvm/block/qcow2.c:587
> #2  0x0000000000482a44 in qcow_aio_writev (bs=<value optimized out>, sector_num=<value optimized out>, qiov=<value optimized out>, nb_sectors=<value optimized out>, cb=<value optimized out>, opaque=<value optimized out>) at /data/qemu-kvm/block/qcow2.c:645
> #3  0x0000000000470e89 in bdrv_aio_writev (bs=0xc4e1d0, sector_num=2, qiov=0x7f48a9010ed0, nb_sectors=16, cb=0x470d20 <bdrv_rw_em_cb>, opaque=0x7f48a9010f0c) at /data/qemu-kvm/block.c:1362
> #4  0x0000000000472991 in bdrv_write_em (bs=0xc4e1d0, sector_num=14486688, buf=0xd67200 "H\a", nb_sectors=16) at /data/qemu-kvm/block.c:1736
> #5  0x0000000000435581 in ide_sector_write (s=0xc92650) at /data/qemu-kvm/hw/ide/core.c:622
> #6  0x0000000000425fc2 in kvm_handle_io (env=<value optimized out>) at /data/qemu-kvm/kvm-all.c:553
> #7  kvm_run (env=<value optimized out>) at /data/qemu-kvm/qemu-kvm.c:964
> #8  0x0000000000426049 in kvm_cpu_exec (env=0x1000) at /data/qemu-kvm/qemu-kvm.c:1651
> #9  0x000000000042627d in kvm_main_loop_cpu (_env=<value optimized out>) at /data/qemu-kvm/qemu-kvm.c:1893
> #10 ap_main_loop (_env=<value optimized out>) at /data/qemu-kvm/qemu-kvm.c:1943
> #11 0x00007f48ae89d070 in start_thread () from /lib64/libpthread.so.0
> #12 0x00007f48abf0711d in clone () from /lib64/libc.so.6
> #13 0x0000000000000000 in ?? ()
> (gdb) print ((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
> $5 = (struct QCowL2Meta *) 0xcb3568
> (gdb) print *((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
> $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}
> 
> So next == first.
> 

Seen the exact same bug twice in a row while installing FC12 with IDE
disk, current qemu-kvm.git. 

qemu-system-x86_64 -drive file=/root/images/fc12-ide.img,cache=writeback \
-m 1000  -vnc :1 \
-net nic,model=virtio \
-net tap,script=/root/ifup.sh -serial stdio \
-cdrom /root/iso/linux/Fedora-12-x86_64-DVD.iso -monitor
telnet::4445,server,nowait -usbdevice tablet

Can't reproduce though.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
  2010-05-07  1:19 ` Marcelo Tosatti
@ 2010-05-07  7:37   ` Kevin Wolf
  2010-05-07 15:16     ` Marcelo Tosatti
  0 siblings, 1 reply; 15+ messages in thread
From: Kevin Wolf @ 2010-05-07  7:37 UTC (permalink / raw)
  To: Marcelo Tosatti; +Cc: Jan Kiszka, qemu-devel, kvm

Am 07.05.2010 03:19, schrieb Marcelo Tosatti:
> On Thu, Nov 19, 2009 at 01:19:55PM +0100, Jan Kiszka wrote:
>> Hi,
>>
>> I just managed to push a qemu-kvm process (git rev. b496fe3431) into an
>> endless loop in qcow2_alloc_cluster_offset, namely over
>> QLIST_FOREACH(old_alloc, &s->cluster_allocs, next_in_flight):
>>
>> (gdb) bt
>> #0  0x000000000048614b in qcow2_alloc_cluster_offset (bs=0xc4e1d0, offset=7417184256, n_start=0, n_end=16, num=0xcb351c, m=0xcb3568) at /data/qemu-kvm/block/qcow2-cluster.c:750
>> #1  0x00000000004828d0 in qcow_aio_write_cb (opaque=0xcb34d0, ret=0) at /data/qemu-kvm/block/qcow2.c:587
>> #2  0x0000000000482a44 in qcow_aio_writev (bs=<value optimized out>, sector_num=<value optimized out>, qiov=<value optimized out>, nb_sectors=<value optimized out>, cb=<value optimized out>, opaque=<value optimized out>) at /data/qemu-kvm/block/qcow2.c:645
>> #3  0x0000000000470e89 in bdrv_aio_writev (bs=0xc4e1d0, sector_num=2, qiov=0x7f48a9010ed0, nb_sectors=16, cb=0x470d20 <bdrv_rw_em_cb>, opaque=0x7f48a9010f0c) at /data/qemu-kvm/block.c:1362
>> #4  0x0000000000472991 in bdrv_write_em (bs=0xc4e1d0, sector_num=14486688, buf=0xd67200 "H\a", nb_sectors=16) at /data/qemu-kvm/block.c:1736
>> #5  0x0000000000435581 in ide_sector_write (s=0xc92650) at /data/qemu-kvm/hw/ide/core.c:622
>> #6  0x0000000000425fc2 in kvm_handle_io (env=<value optimized out>) at /data/qemu-kvm/kvm-all.c:553
>> #7  kvm_run (env=<value optimized out>) at /data/qemu-kvm/qemu-kvm.c:964
>> #8  0x0000000000426049 in kvm_cpu_exec (env=0x1000) at /data/qemu-kvm/qemu-kvm.c:1651
>> #9  0x000000000042627d in kvm_main_loop_cpu (_env=<value optimized out>) at /data/qemu-kvm/qemu-kvm.c:1893
>> #10 ap_main_loop (_env=<value optimized out>) at /data/qemu-kvm/qemu-kvm.c:1943
>> #11 0x00007f48ae89d070 in start_thread () from /lib64/libpthread.so.0
>> #12 0x00007f48abf0711d in clone () from /lib64/libc.so.6
>> #13 0x0000000000000000 in ?? ()
>> (gdb) print ((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
>> $5 = (struct QCowL2Meta *) 0xcb3568
>> (gdb) print *((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
>> $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}
>>
>> So next == first.
>>
> 
> Seen the exact same bug twice in a row while installing FC12 with IDE
> disk, current qemu-kvm.git. 
> 
> qemu-system-x86_64 -drive file=/root/images/fc12-ide.img,cache=writeback \
> -m 1000  -vnc :1 \
> -net nic,model=virtio \
> -net tap,script=/root/ifup.sh -serial stdio \
> -cdrom /root/iso/linux/Fedora-12-x86_64-DVD.iso -monitor
> telnet::4445,server,nowait -usbdevice tablet
> 
> Can't reproduce though.

In current git master? That's interesting news. I had kind of expected
it would be fixed with c644db3d.

Kevin

^ permalink raw reply	[flat|nested] 15+ messages in thread

* [Qemu-devel] Re: Endless loop in qcow2_alloc_cluster_offset
  2010-05-07  7:37   ` Kevin Wolf
@ 2010-05-07 15:16     ` Marcelo Tosatti
  0 siblings, 0 replies; 15+ messages in thread
From: Marcelo Tosatti @ 2010-05-07 15:16 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Jan Kiszka, qemu-devel, kvm

On Fri, May 07, 2010 at 09:37:22AM +0200, Kevin Wolf wrote:
> Am 07.05.2010 03:19, schrieb Marcelo Tosatti:
> > On Thu, Nov 19, 2009 at 01:19:55PM +0100, Jan Kiszka wrote:
> >> Hi,
> >>
> >> I just managed to push a qemu-kvm process (git rev. b496fe3431) into an
> >> endless loop in qcow2_alloc_cluster_offset, namely over
> >> QLIST_FOREACH(old_alloc, &s->cluster_allocs, next_in_flight):
> >>
> >> (gdb) bt
> >> #0  0x000000000048614b in qcow2_alloc_cluster_offset (bs=0xc4e1d0, offset=7417184256, n_start=0, n_end=16, num=0xcb351c, m=0xcb3568) at /data/qemu-kvm/block/qcow2-cluster.c:750
> >> #1  0x00000000004828d0 in qcow_aio_write_cb (opaque=0xcb34d0, ret=0) at /data/qemu-kvm/block/qcow2.c:587
> >> #2  0x0000000000482a44 in qcow_aio_writev (bs=<value optimized out>, sector_num=<value optimized out>, qiov=<value optimized out>, nb_sectors=<value optimized out>, cb=<value optimized out>, opaque=<value optimized out>) at /data/qemu-kvm/block/qcow2.c:645
> >> #3  0x0000000000470e89 in bdrv_aio_writev (bs=0xc4e1d0, sector_num=2, qiov=0x7f48a9010ed0, nb_sectors=16, cb=0x470d20 <bdrv_rw_em_cb>, opaque=0x7f48a9010f0c) at /data/qemu-kvm/block.c:1362
> >> #4  0x0000000000472991 in bdrv_write_em (bs=0xc4e1d0, sector_num=14486688, buf=0xd67200 "H\a", nb_sectors=16) at /data/qemu-kvm/block.c:1736
> >> #5  0x0000000000435581 in ide_sector_write (s=0xc92650) at /data/qemu-kvm/hw/ide/core.c:622
> >> #6  0x0000000000425fc2 in kvm_handle_io (env=<value optimized out>) at /data/qemu-kvm/kvm-all.c:553
> >> #7  kvm_run (env=<value optimized out>) at /data/qemu-kvm/qemu-kvm.c:964
> >> #8  0x0000000000426049 in kvm_cpu_exec (env=0x1000) at /data/qemu-kvm/qemu-kvm.c:1651
> >> #9  0x000000000042627d in kvm_main_loop_cpu (_env=<value optimized out>) at /data/qemu-kvm/qemu-kvm.c:1893
> >> #10 ap_main_loop (_env=<value optimized out>) at /data/qemu-kvm/qemu-kvm.c:1943
> >> #11 0x00007f48ae89d070 in start_thread () from /lib64/libpthread.so.0
> >> #12 0x00007f48abf0711d in clone () from /lib64/libc.so.6
> >> #13 0x0000000000000000 in ?? ()
> >> (gdb) print ((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
> >> $5 = (struct QCowL2Meta *) 0xcb3568
> >> (gdb) print *((BDRVQcowState *)bs->opaque)->cluster_allocs.lh_first 
> >> $6 = {offset = 7417176064, n_start = 0, nb_available = 16, nb_clusters = 0, depends_on = 0xcb3568, dependent_requests = {lh_first = 0x0}, next_in_flight = {le_next = 0xcb3568, le_prev = 0xc4ebd8}}
> >>
> >> So next == first.
> >>
> > 
> > Seen the exact same bug twice in a row while installing FC12 with IDE
> > disk, current qemu-kvm.git. 
> > 
> > qemu-system-x86_64 -drive file=/root/images/fc12-ide.img,cache=writeback \
> > -m 1000  -vnc :1 \
> > -net nic,model=virtio \
> > -net tap,script=/root/ifup.sh -serial stdio \
> > -cdrom /root/iso/linux/Fedora-12-x86_64-DVD.iso -monitor
> > telnet::4445,server,nowait -usbdevice tablet
> > 
> > Can't reproduce though.
> 
> In current git master? That's interesting news. I had kind of expected
> it would be fixed with c644db3d.

Yes, with 31b460256 more precisely. And the symptom was the same as Jan
reported, cluster_allocs.lh_first had le_next pointing to itself.

Perhaps you can add an assert there, so it abort()'s in that case along
with some useful information? I'll try to reproduce.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2010-05-07 15:17 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-11-19 12:19 [Qemu-devel] Endless loop in qcow2_alloc_cluster_offset Jan Kiszka
2009-11-19 14:49 ` [Qemu-devel] " Kevin Wolf
2009-11-19 14:58   ` Jan Kiszka
2009-12-07 14:16     ` Jan Kiszka
2009-12-07 14:50       ` Jan Kiszka
2009-12-07 15:03         ` Kevin Wolf
2009-12-07 15:25           ` Jan Kiszka
2009-12-07 15:04         ` Avi Kivity
2009-12-07 15:00       ` Kevin Wolf
2009-12-07 16:09         ` Jan Kiszka
2009-12-07 16:26           ` Kevin Wolf
2009-12-08 14:51         ` Kevin Wolf
2010-05-07  1:19 ` Marcelo Tosatti
2010-05-07  7:37   ` Kevin Wolf
2010-05-07 15:16     ` Marcelo Tosatti

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).