From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:56690) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1dAXhl-0005Dz-Ua for qemu-devel@nongnu.org; Tue, 16 May 2017 04:19:48 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1dAXhk-0007Nx-G4 for qemu-devel@nongnu.org; Tue, 16 May 2017 04:19:41 -0400 Received: from mx1.redhat.com ([209.132.183.28]:46798) by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.71) (envelope-from ) id 1dAXhk-0007NT-6G for qemu-devel@nongnu.org; Tue, 16 May 2017 04:19:40 -0400 Date: Tue, 16 May 2017 04:19:27 -0400 (EDT) From: Paolo Bonzini Message-ID: <1282321742.7961608.1494922767010.JavaMail.zimbra@redhat.com> In-Reply-To: <5c410c0e-f0b0-a049-84ed-7e31eb4e1dab@suse.de> References: <58A4707602000062000D0393@prv-mh.provo.novell.com> <58A5C23202000062000D0FCB@prv-mh.provo.novell.com> <154b3902-d891-f77a-3d59-09d80596ddff@suse.de> <58ADBA0302000062000D3EA6@prv-mh.provo.novell.com> <1fcf2e94-fc86-4bd7-07a0-8ab2dc72429f@suse.de> <5c410c0e-f0b0-a049-84ed-7e31eb4e1dab@suse.de> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] =?utf-8?b?562U5aSN77yaIFJlOiBbUkZDXSB2aXJ0aW8tZmM6?= =?utf-8?q?_draft_idea_of_virtual_fibre_channel_HBA?= List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Hannes Reinecke Cc: Lin Ma , Stefan Hajnoczi , Zhiqiang Zhou , Fam Zheng , qemu-devel@nongnu.org, stefanha@redhat.com, mst@redhat.com > Maybe a union with an overall size of 256 byte (to hold the iSCSI iqn > string), which for FC carries the WWPN and the WWNN? That depends on how you would like to do controller passthrough in general. iSCSI doesn't have the 64-bit target ID, and doesn't have (AFAIK) hot-plug/hot-unplug support, so it's less important than FC. > > 2) If the initiator ID is the moral equivalent of a MAC address, > > shouldn't it be the host that provides the initiator ID to the host in > > the virtio-scsi config space? (From your proposal, I'd guess it's the > > latter, but maybe I am not reading correctly). > > That would be dependent on the emulation. For emulated SCSI disk I guess > we need to specify it in the commandline somewhere, but for scsi > passthrough we could grab it from the underlying device. Wait, that would be the target ID. The initiator ID would be the NPIV vport's WWNN/WWPN. It could be specified on the QEMU command line, or it could be tied to some file descriptor (created and initialized by libvirt, which has CAP_SYS_ADMIN, and then passed to QEMU; similar to tap file descriptors). > >> b) stop exposing the devices attached to that NPIV host to the guest > > > > What do you mean exactly? > > > That's one of the longer term plans I have. > When doing NPIV currently all devices from the NPIV host appear on the > host. Including all partitions, LVM devices and what not. [...] > If we make the (guest) initiator ID identical to the NPIV WWPN we can > tag the _host_ to not expose any partitions on any LUNs, making the > above quite easy. Yes, definitely. > > At this point, I can think of several ways to do this, one being SG_IO > > in QEMU while the other are more exoteric. > > > > 1) use virtio-scsi with userspace passthrough (current solution). > > With option (1) and the target/initiator ID extensions we should be able > to get basic NPIV support to work, and would even be able to handle > reservations in a sane manner. Agreed, but I'm not anymore that sure that the advantages outweigh the disadvantages. Also, let's add no FC-NVMe support to the disadvantages. > > 2) the exact opposite: use the recently added "mediated device > > passthrough" (mdev) framework to present a "fake" PCI device to the > > guest. > > (2) sounds interesting, but I'd have to have a look into the code to > figure out if it could easily be done. Not that easy, but it's the bread and butter of the hardware manufacturers. If we want them to do it alone, (2) is the way. Both nVidia and Intel are using it. > > 3) handle passthrough with a kernel driver. Under this model, the guest > > uses the virtio device, but the passthrough of commands and TMFs is > > performed by the host driver. > > > > We can then choose whether to do it with virtio-scsi or with a new > > virtio-fc. > > (3) would be feasible, as it would effectively mean 'just' to update the > current NPIV mechanism. However, this would essentially lock us in for > FC; any other types (think NVMe) will require yet another solution. An FC-NVMe driver could also expose the same vhost interface, couldn't it? FC-NVMe doesn't have to share the Linux code; but sharing the virtio standard and the userspace ABI would be great. In fact, the main advantage of virtio-fc would be that (if we define it properly) it could be reused for FC-NVMe instead of having to extend e.g. virtio-blk. For example virtio-scsi has request, to-device payload, response, from-device payload. virtio-fc's request format could be the initiator and target port identifiers, followed by FCP_CMD, to-device payload, FCP_RSP, from-device payload. > > 4) same as (3), but in userspace with a "macvtap" like layer (e.g., > > socket+bind creates an NPIV vport). This layer can work on some kind of > > FCP encapsulation, not the raw thing, and virtio-fc could be designed > > according to a similar format for simplicity. > > (4) would require raw FCP frame access, which is one thing we do _not_ > have. Each card (except for the pure FCoE ones like bnx2fc, fnic, and > fcoe) only allows access to pre-formatted I/O commands. And has it's own > mechanism for generatind sequence IDs etc. So anything requiring raw FCP > access is basically out of the game. Not raw. It could even be defined at the exchange level (plus some special things for discovery and login services). But I agree that (4) is a bit pie-in-the-sky. > Overall, I would vote to specify a new virtio scsi format _first_, > keeping in mind all of these options. > (1), (3), and (4) all require an update anyway :-) > > The big advantage I see with (1) is that it can be added with just some > code changes to qemu and virtio-scsi. Every other option require some > vendor buy-in, which inevitably leads to more discussions, delays, and > more complex interaction (changes to qemu, virtio, _and_ the affected HBAs). I agree. But if we have to reinvent everything in a couple years for NVMe over fabrics, maybe it's not worth it. > While we're at it: We also need a 'timeout' field to the virtion request > structure. I even posted an RFC for it :-) Yup, I've seen it. :) Paolo