From mboxrd@z Thu Jan 1 00:00:00 1970 From: Tony Krowiak Date: Thu, 02 Aug 2018 21:59:42 +0000 Subject: Re: [PATCH v7 22/22] s390: doc: detailed specifications for AP virtualization Message-Id: <8b9476dd-cb65-ba88-cd0d-71dbc364a1ba@linux.ibm.com> In-Reply-To: References: To: linux-s390@vger.kernel.org, kvm@vger.kernel.org List-ID: On 08/02/2018 11:25 AM, Alex Williamson wrote: > On Wed, 1 Aug 2018 19:05:35 -0400 > Tony Krowiak wrote: > >> On 07/27/2018 01:44 PM, Alex Williamson wrote: >>> On Thu, 26 Jul 2018 21:54:29 +0200 >>> Christian Borntraeger wrote: >>> ... >>>> +The process for reserving an AP queue for use by a KVM guest is: >>>> + >>>> +* The vfio-ap driver during its initialization will perform the following: >>>> + * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap >>>> + * Create the 'matrix' device in the 'vfio_ap' root >>>> + * Register the matrix device with the device core >>>> +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and >>>> + newer) and to provide the vfio_ap driver's probe and remove callback >>>> + interfaces. The reason why older devices are not supported is because there >>>> + are no systems available on which to test. >>>> +* The admin needs to reserve the AP Queue for vfio_ap driver. This can be >>>> + done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain >>>> + mask to /sys/bus/ap/aqmask. >>>> + >>>> + For example to reserve all the AP Queues on the system for vfio_ap driver: >>>> + >>>> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask >>>> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask >>> And this is a reasonable user syntax? ;) >> I am not a fan of writing a 256-bit bitmask as such. It is entirely too >> difficult to >> figure out which character to specify to indicate a bit number to be set >> for anything >> past the first few characters. Of course, tooling could be used to >> accomplish this >> task, but the sysfs interfaces should probably be user-friendly. A >> comma-separated >> list could be used; for example to set reserve adapters 21, 99 and 249: >> >> echo 21,99,249 > /sys/bus/ap/apmask >> >> The problem with this is how do we indicate that all bits are to be >> cleared? Do you >> have any suggestions for a reasonable user syntax? > Perhaps an empty write, just a carriage return, "null". I'd shy away > from magic digits, like 256 to clear. The crypto team has decided to eliminate these sysfs attributes for the AP bus and rely on bind/unbind. > >>> ... >>>> + * mdev_attr_groups >>>> + This attribute group identifies the user-defined sysfs attributes of the >>>> + mediated device. When a device is registered with the VFIO mediated device >>>> + framework, the sysfs attributes files identified in the 'mdev_attr_groups' >>>> + structure will be created in the mediated matrix device's directory. The >>>> + sysfs attributes for a mediated matrix device are: >>>> + * assign_adapter: >>>> + A write-only file for assigning an AP adapter to the mediated matrix >>>> + device. To assign an adapter, the APID of the adapter is written to the >>>> + file. >>>> + * assign_domain: >>>> + A write-only file for assigning an AP usage domain to the mediated matrix >>>> + device. To assign a domain, the APQI of the AP queue corresponding to a >>>> + usage domain is written to the file. >>>> + * matrix: >>>> + A read-only file for displaying the APQNs derived from the adapters and >>>> + domains assigned to the mediated matrix device. >>>> + * assign_control_domain: >>>> + A write-only file for assigning an AP control domain to the mediated >>>> + matrix device. To assign a control domain, the ID of a domain to be >>>> + controlled is written to the file. For the initial implementation, the set >>>> + of control domains will always include the set of usage domains, so it is >>>> + only necessary to assign control domains that are not also assigned as >>>> + usage domains. >>> How will the user know when this changes? What's the transition plan >>> to maintain compatibility when it does change? >> I'm sorry, I don't understand the question. When you say user, about >> whom are >> you talking; the person doing the assignments? What changes are you talking >> about? > The user would be the person interaction with the sysfs attribute > assign_control_domain where it states "[f]or the initial implementation, > the set of control domains will always include the set of usage > domains, so it is only necessary to assign control domains that are not > also assigned as usage domains." I infer from the usage of "initial > implementation" that this behavior may not always be the case and > therefore if it's not final what is the transition plan so that a user > interacting with this attribute can know the current behavior. Ah, I get your point now. I see how this verbiage would lead you to that conclusion and it will be removed. This behavior will not change. > >>> ... >>>> +4. The administrator now needs to configure the matrixes for mediated >>>> + devices $uuid1 (for Guest1) and $uuid2 (for Guest2). >>>> + >>>> + This is how the matrix is configured for Guest1: >>>> + >>>> + echo 5 > assign_adapter >>>> + echo 6 > assign_adapter >>>> + echo 4 > assign_domain >>>> + echo 0xab > assign_domain >>>> + >>>> + For this implementation, all usage domains - i.e., domains assigned >>>> + via the assign_domain attribute file - will also be configured in the ADM >>>> + field of the KVM guest's CRYCB, so there is no need to assign control >>>> + domains here unless you want to assign control domains that are not >>>> + assigned as usage domains. >>>> + >>>> + If a mistake is made configuring an adapter, domain or control domain, >>>> + you can use the unassign_xxx files to unassign the adapter, domain or >>>> + control domain. >>> Would it be more consistent with other sysfs entries to call these >>> "bind" and "unbind" rather than "assign" and "unassign >> Aren't the bind and unbind sysfs interfaces typically associated with >> binding devices to and unbinding devices from a device driver? > Yes > >> In this >> case we are talking about configuring a mediated device's AP matrix - i.e., >> setting bits identifying the adapters, domains and control domains defined >> for the mediated device (i.e., a guest). Maybe there are better words we >> could use than assign/unassign, but I find the use of bind/unbind to be >> confusing. I'm open to suggestions. > If we remove the 'activate' interface and dedicate resources to the > mdev device at the point where we create an intersection in the matrix, > wouldn't that resource be "bound" to the mdev at that time? This is > not a strong issue for me, I just thought "assign" seemed inconsistent > as the operation felt more like a traditional "bind". We are going to remove the activate for the next series, but stick with the assign terminology because we're not really binding devices, but constructing a matrix that identifies the AP queues to which the guest using the mdev will have access. > >>>> + >>>> + To display the matrix configuration for Guest1: >>>> + >>>> + cat matrix >>>> + >>>> + This is how the matrix is configured for Guest2: >>>> + >>>> + echo 5 > assign_adapter >>>> + echo 0x47 > assign_domain >>>> + echo 0xff > assign_domain >>>> + >>>> +5. The adminstrator now needs to activate the mediated devices. >>>> + echo 1 > activate >>> Or optionally not. As in reply to cover letter, I don't think this >>> interface is sufficiently justified, or necessarily desirable. >> To clarify your objection, let me state what I think you are saying. >> >> 1. You do not think there is a good reason to assign duplicate APQNs >> - i.e., the cross product of all adapter IDs and domain IDs assigned >> to the mdev. > I think the cross product of adapters and domains is a basic part of > the design here, what I object to is that mdevs can exist with > overlapping cross products and which one can be activated is dependent > on activation ordering. For example, if I have the following mdevs: > > {adapters}{domains} > A: {0,1}{0,1} > B: {0}{0} > C: {0}{1} > D: {1}{0} > E: {1}{1} > > Who is supposed to understand and debug that A cannot be activated > while any of B/C/D/E are activated? What does the debugging process > look like? If the mdev is not explicitly activated, only opened, how > do we determine the fault is from a resource conflict on the mdev > definition? On the other hand if the cross product is committed at the > time it's specified, then a scenario where E already exists and we're > trying to create A might look like: > > 1. Specify adapter mask of {0,1} > 2. Specify domain mask of {0,1} <-- write(2) fails with -EBUSY > > Here we know that one or more of the domains we've specified are not > available on the set of adapters we've configured. This is moot for the next series because the activate will be removed and consistency checking will be done as adapters/domains are assigned. > >> 2. All mdevs should be validly configured for use regardless of when >> guests using them are started. In other words, libvirt should be >> able to depend on the fact that an mdev is ready for use at all >> times. > Yes, an mdev should represent committed resources. We support dynamic > creation and removal if the resources need to be freed for the host or > another mdev configuration. See my previous response. > >> I will discuss this with the team when we meet on Thursday and get >> back to you. If I am mistaken in my analysis of your concern, please >> clarify. >> >> I would like to mention that when a guest using an mdev starts, >> the mdev will be activated if not already activated. The guest >> will not start if another activated mdev is using an APQN assigned >> to the mdev of the guest being started. > Understood. > >>>> +6. Start Guest1: >>>> + >>>> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ >>>> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid1 ... >>>> + >>>> +7. Start Guest2: >>>> + >>>> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \ >>>> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid2 ... >>>> + >>>> +When the guest is shut down, the mediated matrix device may be removed. >>>> + >>>> +Using our example again, to remove the mediated matrix device $uuid1: >>>> + >>>> + /sys/devices/virtual/misc >>>> + --- [vfio_ap] >>>> + --------- [mdev_supported_types] >>>> + ------------ [vfio_ap-passthrough] >>>> + --------------- [devices] >>>> + ------------------ [$uuid1] >>>> + --------------------- activate >>>> + --------------------- remove >>>> + >>>> + >>>> + echo 0 > activate >>>> + echo 1 > remove >>>> + >>>> + This will release all the AP queues for the mediated device and >>>> + remove all of the mdev matrix device's sysfs structures. To >>>> + recreate and reconfigure the mdev matrix device, all of the steps starting >>>> + with step 4 will have to be performed again. >>>> + >>>> + It is not necessary to remove an mdev matrix device, but one may want to >>>> + remove it if no guest will use it during the lifetime of the linux host. If >>>> + the mdev matrix device is removed, one may want to unbind the AP queues the >>>> + guest was using from the vfio_ap device driver and bind them back to the >>>> + default driver. Alternatively, the AP queues can be configured for another >>>> + mdev matrix (i.e., guest). In either case, one must take care to change the >>>> + secure key configured for the domain to which the queue is connected. >>> This seems prime for data leakage, extremely sensitive data at that. >>> Who's responsibility is it to prevent this? Shouldn't closing the >>> device reset the device, which should perhaps wipe any key >>> configuration? >> The device is zeroized (reset) when it is removed. The zeroize >> instruction - to >> the best of my knowledge - does nothing with the secure key for a given >> queue. >> There is also no instruction that I know of to clear keys, so it would >> appear >> that this is left to the system administrator. > :-o That seems like a pretty serious fault. Given that this interface > is specifically for crypto devices where such sensitive information > seems common, doesn't it seem like the kernel, ie. the mdev vendor > driver, should provide guarantees that should your user instance crash > or be killed that keys are not available to the system admin? I may have misspoke with regard to my understanding of the AP reset. I've been told that the keys stored with the domain will be reset. I will verify this and get back to you. > >>>> + >>>> + >>>> +Limitations >>>> +=========== >>>> +An admin needs to be careful when writing to sysfs files >>>> +/sys/bus/ap/apmask and /sys/bus/ap/aqmask. These files control the driver >>>> +to which an AP queue is bound to. Once an AP queue is bound vfio_ap >>>> +driver and assigned to a mediated device, admin should not re-assign the >>>> +AP queues back to the default driver, because of technical limitations. >>>> +The kernel does not prevent you from re-assigning from AP queues reserved >>>> +for the vfio_ap driver back to default driver. Future enhancements in >>>> +the ap_bus and vfio_ap are likely to introduce complete support for such >>>> +dynamic reconfiguration. But until then be extremely careful. >>> Why don't we have these enhancements now? Should the whole thing be >>> marked experimental without them? This sounds similar to a vfio-pci >>> case where a group with multiple devices could have device which is >>> unused by the user unbound from vfio-pci and re-bound to a native host >>> driver. We BUG_ON when this occurs to protect the data. >> We have customers that have been waiting a long time and are very anxious >> to have guest access to AP devices. In the interest of reducing time to >> market, we have decided to simplify the initial implementation as much >> as possible. I wouldn't necessarily call this experimental, but view it >> more as the minimal viable product. >> >> As far as BUG_ON, I will discuss this with the crypto team at our >> meeting on Thursday and get back to you. > A minimum viable product still needs to protect the host and does not > excuse us from later breaking the user exposed ABI. I will have a more thorough look into it. > >>> Probably also need to evaluate updates to >>> Documentation/ABI/testing/sysfs-bus-vfio-mdev, especially if libvirt is >>> supposed to interact with an 'activate' attribute. Thanks, >> As it stands, libvirt will not create or modify the attributes of an mdev. >> If a guest is started under libvirt and its mdev has not been activated, >> it will be activated when the mdev fd is opened. Of course, if another >> activated mdev is assigned an APQN assigned to the mdev of the guest >> being started, the guest will be terminated. > Exactly why I disapprove of the activated interface, but it's being > proposed and recommended for use as a standard part of the lifecycle > management for these devices. Thanks, As I stated elsewhere, we are removing the activate interface. All consistency checking will be done as adapters/domains are assigned to the mdev. > > Alex >