From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([208.118.235.92]:47365) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TLEdl-0007SI-7J for qemu-devel@nongnu.org; Mon, 08 Oct 2012 10:49:11 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1TLEdj-0008QO-1Y for qemu-devel@nongnu.org; Mon, 08 Oct 2012 10:49:05 -0400 Received: from mx1.redhat.com ([209.132.183.28]:23300) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1TLEdi-0008QI-OB for qemu-devel@nongnu.org; Mon, 08 Oct 2012 10:49:02 -0400 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id q98En15W017366 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Mon, 8 Oct 2012 10:49:01 -0400 Message-ID: <5072E838.8080104@redhat.com> Date: Mon, 08 Oct 2012 16:50:32 +0200 From: Hans de Goede MIME-Version: 1.0 References: <1349682696-3046-1-git-send-email-hdegoede@redhat.com> <1349682696-3046-5-git-send-email-hdegoede@redhat.com> <5072CC04.4010300@redhat.com> In-Reply-To: <5072CC04.4010300@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH 04/12] usb: Add support for input pipelining List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Gerd Hoffmann Cc: qemu-devel@nongnu.org Hi, On 10/08/2012 02:50 PM, Gerd Hoffmann wrote: > On 10/08/12 09:51, Hans de Goede wrote: >> Currently we effectively only do pipelining for output endpoints, since >> controllers will stop filling the queue for a packet with stop the queue >> on a short read semantics enabled. > >> This causes large input transfers to get split into multiple packets, which >> comes with a serious performance penalty. >> >> This patch makes it possible to do pipelining for input endpoints, by >> re-combining packets which belong to 1 transfer at the guest device-driver >> level into 1 large packet before sending them to the device, this is mostly >> useful as a significant performance improvement for redirection of real USB >> devices. > > This splitting and recombining is too much magic in the core layer for > my taste. I disagree, let me try to explain below. > I think the core work flow should stay as-is. Instead we should add > more meta data to USBPacket (stop-queue-on-short-read bit, maybe > interrupt-on-complete too). We agree on the meta data :) > Continue queuing packets even with > stop-queue-on-short-read set. This can only happen if we combine them first. We *must* ensure that we do not read additional data after a short completion and add that to a packet belonging to the previous short ended transfer, or if we end up cancelling after the short completion, throwing away that data, otherwise the guest and device can get out of sync. Which is exactly what was happening with the uhci code, before I added the SPD tests there. Another important reason for combining is that in most cases, we only see short-not-ok flags because the device-driver in the guest has asked for a single bulk in transfer which does not fit in a single td, so it gets split, and to make sure that if it ends up short, the problem of data from another part of the protocol (ie usb mass storage bulk only transport data versus sense) ending up in tds belonging to the previous transfer does not happen, the short-not-ok flag gets set on all packets but the last one. Notice btw that the xhci controller has no notion of short-not-ok, instead it simply allows for very large bulk transfers to be submitted in one "atomic" entity. So far to not cause this mixup of to which protocol phase in-data belongs, we do not queue up packets to the device past a short-not-ok boundary, but this means for serial <-> usb converters that instead of doing: submit 1st 256 byte bulk in submit 2nd 256 byte bulk in resubmit 256 byte bulk in We do: submit 1st 64/256 byte bulk in submit 2nd 64/256 byte bulk in Meaning that for receiving 256 bytes we pay 4 times the round trip time instead of 1 time, and since we do not submit further packets after a short-not-ok packet, that we don't have a 2nd 256 byte transfer queued up ready to receive more data while the 1st one is being processed. Making reading from such a device highly unreliable if there is any sort of load on the host. The whole combining "magic" results in a single 256 byte packet, which does not have its short-not-ok flag set, allowing true pipelining and having another in transfer queued up to keep the device busy will the 1st one is being processed. Re-combining the packets into a single transfer, is the *only* way to ensure that a short completion cannot end up reading some data from the next protocol phase. One possible alternative to re-combining would be to pass the short-not-ok flag along the chain all the way down into usbfs, but that won't work since the USBFS_URB_SHORT_NOT_OK flag does not work with XHCI controllers! There has even been discussion about outright refusing packet submissions with that flag set in newer Linux kernels, but that would cause regressions for apps which set it when they don't really need it (and most don't). So since more and more machines are using XHCI controllers, passing the short-not-ok flag along is not a valid option. So lets try to split this discussion into multiple questions: 1) Can we agree that re-combining input packets back into their original transfers as done at the guest device driver level, is useful and given the serial <-> usb converters case, even necessary to have ? 2) If we agree on 1), what is the right place to do them ? To be able to properly combine packets into one larger transfer, an uncombine them again later, some coordination is needed between the entity doing the combining / uncombining and the hcd code: 2a) The combining entity needs to know when the hcd is done with queuing up packets for an ep. 2b) When uncombining it is possible that less packets are actually used / filled with data, then were combined into one transfer. When this happens the hcd code will stop processing the queue after the short packet, and wait for the guest to restart it. When the guest restarts the queue it will be modified and the un-used packets will be gone from it. Since the hcd code keeps its own queue, when when uncombining the hcd code needs to be told to forget about the unused packets. Note that currently we have the same problem on a halt of the ep queue already, and currently we expect the hcd to completely empty the queue on a stall. I'm planning to change this code path over to also let the core tell the hcd to forget about the packets it had queued beyond the stall, as this seems something to do at the core level rather then reproducing it in every hcd emulation separately. Given the necessary close interaction between the hcd code and the combine / uncombine code + not wanting to reproduce the same code in both host-linux.c and redirect.c I believe that the usb core is the right place for this. > Maybe add a callback to notify USBDevice > when the host controller is done filling the queue, so USBDevices can > process all queued packets in one go (bottom half would work too > though). Changing the callback which I added for this to a bh is fine with me, I think that both from a making clear what is going on when reading the code, and from a having all the data hot in cache pov, the callback is better though. > Then expect USBDevices to get things right based on the > USBPacket flags passed on (i.e. have host-linux use > USBFS_URB_BULK_CONTINUATION as needed). Note that the need for host-linux to use USBFS_URB_BULK_CONTINUATION + SHORT_NOT_OK flags with older kernels, and to simply no longer split with newer kernels, already exists! A single ehci td can span 20k, and we split at 16k, so an input transfer ending short in the first 16k will already cause the wrong thing to happen (the sense data of usb-ms gets read as the next 4k), without input pipelining coming into play at all, and with the xhci emulation we can already get much larger in transfers... The reason why I disable input pipelining for host-linux for now in my patchset is because it will make the problem worse, not because the problem is new. For the full details on all the magic needed to correctly handle larger input transfers on usbdevfs with all possible different kernel versions, please see: https://github.com/libusbx/libusbx/commit/ede02ba91920f9be787a7f3cd006c5a4b92b5eab Or just switch to libusb ... > This way usb-host and usbredir will see what is really going on instead > of having the usb core magically fixing up stuff for them. I believe this is based on you thinking simply passing along a short-not-ok flag all the way down into usbfs will be enough, but it definitely is not enough, since the XHCI hardware has no notion of short-not-ok, the only way to do input pipelinig with an XHCI attached device is to re-combine the split packets into a single large input transfer. And the proper place for such combining code is in the core IMHO. > I think this > will serve us better long-term. Maybe you are seeing this "data *and* > stall" issue (patch 1) exactly because of all this combining & splitting > logic? I'm not seeing the data and stall condition in practice at all, but while designing all this I noticed that this can happen. a single ehci td can be a multiple of the endpoints maxPacketSize and then the endpoint can respond with maxPacketSize data, maxPacketSize data, stall at which point we will have both data and a stall condition in one. And the Linux ehci + usbfs + libusb + usbredir code all handle passing along both data and a stall as a result for one packet / transfer, so this is something which eventually we will need to support in the qemu usb core too. For now I decided on the quick fix in usbredir to cover the eventuality of this actually happening and added a warning to see if this actually ever happens (which it seems not to in my testing done sofar). Regards, Hans