From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:51231) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1cHTeu-0001Na-7y for qemu-devel@nongnu.org; Thu, 15 Dec 2016 05:53:09 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1cHTep-0004rr-CX for qemu-devel@nongnu.org; Thu, 15 Dec 2016 05:53:08 -0500 Received: from mx1.redhat.com ([209.132.183.28]:55270) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1cHTep-0004m3-6n for qemu-devel@nongnu.org; Thu, 15 Dec 2016 05:53:03 -0500 Date: Thu, 15 Dec 2016 10:52:57 +0000 From: "Dr. David Alan Gilbert" Message-ID: <20161215105257.GD2509@work-vm> References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Subject: Re: [Qemu-devel] commit virtio: recalculate vq->inuse after migration might cause last_avail_idx vs. used_idx failure List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Halil Pasic Cc: Christian Borntraeger , QEMU Developers , Stefan Hajnoczi * Halil Pasic (pasic@linux.vnet.ibm.com) wrote: > We have a migration problem, which is in my opinion caused by a > deficiency in how vq->inuse is calculated after the migration (commit > bccdef6b "virtio: recalculate vq->inuse after migration" to > blame). > > > We got a bugreport with this log for a live migration target. > > 2016-12-13T18:59:03.647309Z qemu-system-s390x: VQ 1 size 0x100 < last_avail_idx 0x2f76 - used_idx 0x762f Is this actually an endian problem - 2f76 vs 762f ? Dave > 2016-12-13T18:59:03.647385Z qemu-system-s390x: error while loading state for instance 0x0 of device '/fe.0.0001/virtio-net' > 2016-12-13T18:59:03.647540Z qemu-system-s390x: load of migration failed: Operation not permitted > 2016-12-13 18:59:03.796+0000: shutting down, reason=failed > > They use QEMU version 2.7 but looking at the current git master > I think this did not get fixed in the meanwhile. > > So here goes the argument. The recalculation is done like this: > > + vdev->vq[i].inuse = vdev->vq[i].last_avail_idx - > + vdev->vq[i].used_idx; > > This does not seem correct when last_avail_idx has already > wrapped around but used_idx not yet. We see from the log that > last_avail_idx (0x2f76) less that used_idx (0x762f) thus > inuse (of type int) ends up being negative. > > + if (vdev->vq[i].inuse > vdev->vq[i].vring.num) { > > Because vdev->vq[i].vring.num is unsigned int ala usual arithmetic > conversions ("Otherwise, if the operand that has unsigned integer type > has rank greater or equal to the rank of the type of the other operand, > then the operand with signed integer type is converted to the type of > the operand with unsigned integer type." C99) inuse gets converted to > unsigned int. > > Thus the check fails and produces the log cited above. > > + error_report("VQ %d size 0x%x < last_avail_idx 0x%x - " > + "used_idx 0x%x", > + i, vdev->vq[i].vring.num, > + vdev->vq[i].last_avail_idx, > + vdev->vq[i].used_idx); > + return -1; > + } > > Do we want to try to fix this for 2.8? I already have a small patch prepared. > > Regards, > Halil > > -- Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK