* [Qemu-devel] commit virtio: recalculate vq->inuse after migration might cause last_avail_idx vs. used_idx failure
@ 2016-12-14 19:12 Halil Pasic
2016-12-15 8:24 ` Stefan Hajnoczi
2016-12-15 10:52 ` Dr. David Alan Gilbert
0 siblings, 2 replies; 8+ messages in thread
From: Halil Pasic @ 2016-12-14 19:12 UTC (permalink / raw)
To: Christian Borntraeger, QEMU Developers, Stefan Hajnoczi
We have a migration problem, which is in my opinion caused by a
deficiency in how vq->inuse is calculated after the migration (commit
bccdef6b "virtio: recalculate vq->inuse after migration" to
blame).
We got a bugreport with this log for a live migration target.
2016-12-13T18:59:03.647309Z qemu-system-s390x: VQ 1 size 0x100 < last_avail_idx 0x2f76 - used_idx 0x762f
2016-12-13T18:59:03.647385Z qemu-system-s390x: error while loading state for instance 0x0 of device '/fe.0.0001/virtio-net'
2016-12-13T18:59:03.647540Z qemu-system-s390x: load of migration failed: Operation not permitted
2016-12-13 18:59:03.796+0000: shutting down, reason=failed
They use QEMU version 2.7 but looking at the current git master
I think this did not get fixed in the meanwhile.
So here goes the argument. The recalculation is done like this:
+ vdev->vq[i].inuse = vdev->vq[i].last_avail_idx -
+ vdev->vq[i].used_idx;
This does not seem correct when last_avail_idx has already
wrapped around but used_idx not yet. We see from the log that
last_avail_idx (0x2f76) less that used_idx (0x762f) thus
inuse (of type int) ends up being negative.
+ if (vdev->vq[i].inuse > vdev->vq[i].vring.num) {
Because vdev->vq[i].vring.num is unsigned int ala usual arithmetic
conversions ("Otherwise, if the operand that has unsigned integer type
has rank greater or equal to the rank of the type of the other operand,
then the operand with signed integer type is converted to the type of
the operand with unsigned integer type." C99) inuse gets converted to
unsigned int.
Thus the check fails and produces the log cited above.
+ error_report("VQ %d size 0x%x < last_avail_idx 0x%x - "
+ "used_idx 0x%x",
+ i, vdev->vq[i].vring.num,
+ vdev->vq[i].last_avail_idx,
+ vdev->vq[i].used_idx);
+ return -1;
+ }
Do we want to try to fix this for 2.8? I already have a small patch prepared.
Regards,
Halil
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] commit virtio: recalculate vq->inuse after migration might cause last_avail_idx vs. used_idx failure
2016-12-14 19:12 [Qemu-devel] commit virtio: recalculate vq->inuse after migration might cause last_avail_idx vs. used_idx failure Halil Pasic
@ 2016-12-15 8:24 ` Stefan Hajnoczi
2016-12-15 10:52 ` Dr. David Alan Gilbert
1 sibling, 0 replies; 8+ messages in thread
From: Stefan Hajnoczi @ 2016-12-15 8:24 UTC (permalink / raw)
To: Halil Pasic; +Cc: Christian Borntraeger, QEMU Developers, Stefan Hajnoczi
[-- Attachment #1: Type: text/plain, Size: 2016 bytes --]
On Wed, Dec 14, 2016 at 08:12:17PM +0100, Halil Pasic wrote:
> We have a migration problem, which is in my opinion caused by a
> deficiency in how vq->inuse is calculated after the migration (commit
> bccdef6b "virtio: recalculate vq->inuse after migration" to
> blame).
>
>
> We got a bugreport with this log for a live migration target.
>
> 2016-12-13T18:59:03.647309Z qemu-system-s390x: VQ 1 size 0x100 < last_avail_idx 0x2f76 - used_idx 0x762f
Thanks for spotting the signedness bug described below. Regardless of
the bug last_avail_idx 0x2f76 - used_idx 0x762f is still an invalid
state and should be treated as an error. The virtqueue only has 256
elements so there's no way these descriptor indices can be valid. I
wanted to point that out since there must be another problem remaining
somewhere.
> 2016-12-13T18:59:03.647385Z qemu-system-s390x: error while loading state for instance 0x0 of device '/fe.0.0001/virtio-net'
> 2016-12-13T18:59:03.647540Z qemu-system-s390x: load of migration failed: Operation not permitted
> 2016-12-13 18:59:03.796+0000: shutting down, reason=failed
>
> They use QEMU version 2.7 but looking at the current git master
> I think this did not get fixed in the meanwhile.
>
> So here goes the argument. The recalculation is done like this:
>
> + vdev->vq[i].inuse = vdev->vq[i].last_avail_idx -
> + vdev->vq[i].used_idx;
>
> This does not seem correct when last_avail_idx has already
> wrapped around but used_idx not yet. We see from the log that
> last_avail_idx (0x2f76) less that used_idx (0x762f) thus
> inuse (of type int) ends up being negative.
Good catch. This works:
vdev->vq[i].inuse = (uint16_t)(vdev->vq[i].last_avail_idx - vdev->vq[i].used_idx);
> Do we want to try to fix this for 2.8? I already have a small patch prepared.
Please send the fix for 2.8.1 (-stable). 2.8.0-rc4 is currently being
tagged and build tested, it's too late to change it.
Stefan
[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] commit virtio: recalculate vq->inuse after migration might cause last_avail_idx vs. used_idx failure
2016-12-14 19:12 [Qemu-devel] commit virtio: recalculate vq->inuse after migration might cause last_avail_idx vs. used_idx failure Halil Pasic
2016-12-15 8:24 ` Stefan Hajnoczi
@ 2016-12-15 10:52 ` Dr. David Alan Gilbert
2016-12-15 11:32 ` Halil Pasic
1 sibling, 1 reply; 8+ messages in thread
From: Dr. David Alan Gilbert @ 2016-12-15 10:52 UTC (permalink / raw)
To: Halil Pasic; +Cc: Christian Borntraeger, QEMU Developers, Stefan Hajnoczi
* Halil Pasic (pasic@linux.vnet.ibm.com) wrote:
> We have a migration problem, which is in my opinion caused by a
> deficiency in how vq->inuse is calculated after the migration (commit
> bccdef6b "virtio: recalculate vq->inuse after migration" to
> blame).
>
>
> We got a bugreport with this log for a live migration target.
>
> 2016-12-13T18:59:03.647309Z qemu-system-s390x: VQ 1 size 0x100 < last_avail_idx 0x2f76 - used_idx 0x762f
Is this actually an endian problem - 2f76 vs 762f ?
Dave
> 2016-12-13T18:59:03.647385Z qemu-system-s390x: error while loading state for instance 0x0 of device '/fe.0.0001/virtio-net'
> 2016-12-13T18:59:03.647540Z qemu-system-s390x: load of migration failed: Operation not permitted
> 2016-12-13 18:59:03.796+0000: shutting down, reason=failed
>
> They use QEMU version 2.7 but looking at the current git master
> I think this did not get fixed in the meanwhile.
>
> So here goes the argument. The recalculation is done like this:
>
> + vdev->vq[i].inuse = vdev->vq[i].last_avail_idx -
> + vdev->vq[i].used_idx;
>
> This does not seem correct when last_avail_idx has already
> wrapped around but used_idx not yet. We see from the log that
> last_avail_idx (0x2f76) less that used_idx (0x762f) thus
> inuse (of type int) ends up being negative.
>
> + if (vdev->vq[i].inuse > vdev->vq[i].vring.num) {
>
> Because vdev->vq[i].vring.num is unsigned int ala usual arithmetic
> conversions ("Otherwise, if the operand that has unsigned integer type
> has rank greater or equal to the rank of the type of the other operand,
> then the operand with signed integer type is converted to the type of
> the operand with unsigned integer type." C99) inuse gets converted to
> unsigned int.
>
> Thus the check fails and produces the log cited above.
>
> + error_report("VQ %d size 0x%x < last_avail_idx 0x%x - "
> + "used_idx 0x%x",
> + i, vdev->vq[i].vring.num,
> + vdev->vq[i].last_avail_idx,
> + vdev->vq[i].used_idx);
> + return -1;
> + }
>
> Do we want to try to fix this for 2.8? I already have a small patch prepared.
>
> Regards,
> Halil
>
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] commit virtio: recalculate vq->inuse after migration might cause last_avail_idx vs. used_idx failure
2016-12-15 10:52 ` Dr. David Alan Gilbert
@ 2016-12-15 11:32 ` Halil Pasic
2016-12-15 11:38 ` Dr. David Alan Gilbert
` (2 more replies)
0 siblings, 3 replies; 8+ messages in thread
From: Halil Pasic @ 2016-12-15 11:32 UTC (permalink / raw)
To: Dr. David Alan Gilbert
Cc: Christian Borntraeger, QEMU Developers, Stefan Hajnoczi
On 12/15/2016 11:52 AM, Dr. David Alan Gilbert wrote:
>> We got a bugreport with this log for a live migration target.
>>
>> 2016-12-13T18:59:03.647309Z qemu-system-s390x: VQ 1 size 0x100 < last_avail_idx 0x2f76 - used_idx 0x762f
> Is this actually an endian problem - 2f76 vs 762f ?
>
> Dave
>
Thanks! It seems you are right:
static inline uint16_t vring_avail_idx(VirtQueue *vq)
{
hwaddr pa;
pa = vq->vring.avail + offsetof(VRingAvail, idx);
vq->shadow_avail_idx = virtio_lduw_phys(vq->vdev, pa);
we should have an endiannes handling here before assigning shadow_avail_idx I guess
return vq->shadow_avail_idx;
}
I will meditate a bit more on this and probably create a patch to fix it.
What make me wonder is that according to the reports live migration usually
works (ca 1% fails)...
Can I credit you as reporter in case I end up making a fix?
Halil
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] commit virtio: recalculate vq->inuse after migration might cause last_avail_idx vs. used_idx failure
2016-12-15 11:32 ` Halil Pasic
@ 2016-12-15 11:38 ` Dr. David Alan Gilbert
2016-12-15 13:37 ` Paolo Bonzini
2016-12-15 14:06 ` Stefan Hajnoczi
2 siblings, 0 replies; 8+ messages in thread
From: Dr. David Alan Gilbert @ 2016-12-15 11:38 UTC (permalink / raw)
To: Halil Pasic; +Cc: Christian Borntraeger, QEMU Developers, Stefan Hajnoczi
* Halil Pasic (pasic@linux.vnet.ibm.com) wrote:
>
>
> On 12/15/2016 11:52 AM, Dr. David Alan Gilbert wrote:
> >> We got a bugreport with this log for a live migration target.
> >>
> >> 2016-12-13T18:59:03.647309Z qemu-system-s390x: VQ 1 size 0x100 < last_avail_idx 0x2f76 - used_idx 0x762f
> > Is this actually an endian problem - 2f76 vs 762f ?
> >
> > Dave
> >
>
> Thanks! It seems you are right:
>
> static inline uint16_t vring_avail_idx(VirtQueue *vq)
> {
> hwaddr pa;
> pa = vq->vring.avail + offsetof(VRingAvail, idx);
> vq->shadow_avail_idx = virtio_lduw_phys(vq->vdev, pa);
>
> we should have an endiannes handling here before assigning shadow_avail_idx I guess
>
> return vq->shadow_avail_idx;
> }
>
> I will meditate a bit more on this and probably create a patch to fix it.
>
> What make me wonder is that according to the reports live migration usually
> works (ca 1% fails)...
> Can I credit you as reporter in case I end up making a fix?
Sure if you want.
Dave
> Halil
>
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] commit virtio: recalculate vq->inuse after migration might cause last_avail_idx vs. used_idx failure
2016-12-15 11:32 ` Halil Pasic
2016-12-15 11:38 ` Dr. David Alan Gilbert
@ 2016-12-15 13:37 ` Paolo Bonzini
2016-12-15 16:16 ` Halil Pasic
2016-12-15 14:06 ` Stefan Hajnoczi
2 siblings, 1 reply; 8+ messages in thread
From: Paolo Bonzini @ 2016-12-15 13:37 UTC (permalink / raw)
To: Halil Pasic, Dr. David Alan Gilbert
Cc: Christian Borntraeger, QEMU Developers, Stefan Hajnoczi
On 15/12/2016 12:32, Halil Pasic wrote:
> static inline uint16_t vring_avail_idx(VirtQueue *vq)
> {
> hwaddr pa;
> pa = vq->vring.avail + offsetof(VRingAvail, idx);
> vq->shadow_avail_idx = virtio_lduw_phys(vq->vdev, pa);
>
> we should have an endiannes handling here before assigning shadow_avail_idx I guess
>
> return vq->shadow_avail_idx;
> }
Endianness is already handled:
static inline uint16_t virtio_lduw_phys(VirtIODevice *vdev, hwaddr pa)
{
if (virtio_access_is_big_endian(vdev)) {
return lduw_be_phys(&address_space_memory, pa);
}
return lduw_le_phys(&address_space_memory, pa);
}
> I will meditate a bit more on this and probably create a patch to fix it.
>
> What make me wonder is that according to the reports live migration usually
> works (ca 1% fails)...
What is the backtrace of the vring_avail_idx call? If your device is
virtio 1.0, and vdev->guest_features has not been initialized correctly,
you might incorrectly treat LE virtio 1.0 data as BE virtio 0.9 data:
if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) {
/* Devices conforming to VIRTIO 1.0 or later are always LE. */
return false;
}
return true;
Thanks,
Paolo
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] commit virtio: recalculate vq->inuse after migration might cause last_avail_idx vs. used_idx failure
2016-12-15 13:37 ` Paolo Bonzini
@ 2016-12-15 16:16 ` Halil Pasic
0 siblings, 0 replies; 8+ messages in thread
From: Halil Pasic @ 2016-12-15 16:16 UTC (permalink / raw)
To: Paolo Bonzini, Dr. David Alan Gilbert
Cc: Christian Borntraeger, QEMU Developers, Stefan Hajnoczi
On 12/15/2016 02:37 PM, Paolo Bonzini wrote:
>
>
> On 15/12/2016 12:32, Halil Pasic wrote:
>> static inline uint16_t vring_avail_idx(VirtQueue *vq)
>> {
>> hwaddr pa;
>> pa = vq->vring.avail + offsetof(VRingAvail, idx);
>> vq->shadow_avail_idx = virtio_lduw_phys(vq->vdev, pa);
>>
>> we should have an endiannes handling here before assigning shadow_avail_idx I guess
>>
>> return vq->shadow_avail_idx;
>> }
>
> Endianness is already handled:
>
> static inline uint16_t virtio_lduw_phys(VirtIODevice *vdev, hwaddr pa)
> {
> if (virtio_access_is_big_endian(vdev)) {
> return lduw_be_phys(&address_space_memory, pa);
> }
> return lduw_le_phys(&address_space_memory, pa);
> }
Thanks Paolo, you are obviously right. Sorry for the noise.
>
>> I will meditate a bit more on this and probably create a patch to fix it.
>>
>> What make me wonder is that according to the reports live migration usually
>> works (ca 1% fails)...
Seems I will have to get a dump and/or reproduce the problem myself
before I can tell what is going on there -- the guru saved me some
meditation.
>
> What is the backtrace of the vring_avail_idx call? If your device is
As far as I can see from the code the guest features should be already
loaded from the migration stream.
Thanks again!
Halil
> virtio 1.0, and vdev->guest_features has not been initialized correctly,
> you might incorrectly treat LE virtio 1.0 data as BE virtio 0.9 data:
>
> if (virtio_vdev_has_feature(vdev, VIRTIO_F_VERSION_1)) {
> /* Devices conforming to VIRTIO 1.0 or later are always LE. */
> return false;
> }
> return true;
>
> Thanks,
>
> Paolo
>
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [Qemu-devel] commit virtio: recalculate vq->inuse after migration might cause last_avail_idx vs. used_idx failure
2016-12-15 11:32 ` Halil Pasic
2016-12-15 11:38 ` Dr. David Alan Gilbert
2016-12-15 13:37 ` Paolo Bonzini
@ 2016-12-15 14:06 ` Stefan Hajnoczi
2 siblings, 0 replies; 8+ messages in thread
From: Stefan Hajnoczi @ 2016-12-15 14:06 UTC (permalink / raw)
To: Halil Pasic
Cc: Dr. David Alan Gilbert, Christian Borntraeger, QEMU Developers,
Stefan Hajnoczi
On Thu, Dec 15, 2016 at 11:32 AM, Halil Pasic <pasic@linux.vnet.ibm.com> wrote:
> On 12/15/2016 11:52 AM, Dr. David Alan Gilbert wrote:
>>> We got a bugreport with this log for a live migration target.
>>>
>>> 2016-12-13T18:59:03.647309Z qemu-system-s390x: VQ 1 size 0x100 < last_avail_idx 0x2f76 - used_idx 0x762f
>> Is this actually an endian problem - 2f76 vs 762f ?
>>
>> Dave
>>
>
> Thanks! It seems you are right:
Please still submit the uint16_t -> int conversion fix because what
you discovered is a real bug.
Stefan
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2016-12-15 16:16 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-12-14 19:12 [Qemu-devel] commit virtio: recalculate vq->inuse after migration might cause last_avail_idx vs. used_idx failure Halil Pasic
2016-12-15 8:24 ` Stefan Hajnoczi
2016-12-15 10:52 ` Dr. David Alan Gilbert
2016-12-15 11:32 ` Halil Pasic
2016-12-15 11:38 ` Dr. David Alan Gilbert
2016-12-15 13:37 ` Paolo Bonzini
2016-12-15 16:16 ` Halil Pasic
2016-12-15 14:06 ` Stefan Hajnoczi
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).