* Re: [PATCH v7 22/22] s390: doc: detailed specifications for AP virtualization
[not found] <e8a516d2-84e7-ab3e-0ddf-f4ed4f09b727@linux.ibm.com>
@ 2018-08-02 21:59 ` Tony Krowiak
0 siblings, 0 replies; 2+ messages in thread
From: Tony Krowiak @ 2018-08-02 21:59 UTC (permalink / raw)
To: linux-s390, kvm
On 08/02/2018 11:25 AM, Alex Williamson wrote:
> On Wed, 1 Aug 2018 19:05:35 -0400
> Tony Krowiak <akrowiak@linux.ibm.com> wrote:
>
>> On 07/27/2018 01:44 PM, Alex Williamson wrote:
>>> On Thu, 26 Jul 2018 21:54:29 +0200
>>> Christian Borntraeger <borntraeger@de.ibm.com> wrote:
>>> ...
>>>> +The process for reserving an AP queue for use by a KVM guest is:
>>>> +
>>>> +* The vfio-ap driver during its initialization will perform the following:
>>>> + * Create the 'vfio_ap' root device - /sys/devices/virtual/misc/vfio_ap
>>>> + * Create the 'matrix' device in the 'vfio_ap' root
>>>> + * Register the matrix device with the device core
>>>> +* Register with the ap_bus for AP queue devices of type 10 devices (CEX4 and
>>>> + newer) and to provide the vfio_ap driver's probe and remove callback
>>>> + interfaces. The reason why older devices are not supported is because there
>>>> + are no systems available on which to test.
>>>> +* The admin needs to reserve the AP Queue for vfio_ap driver. This can be
>>>> + done by writing the AP adapter mask to /sys/bus/ap/apmask and the AP domain
>>>> + mask to /sys/bus/ap/aqmask.
>>>> +
>>>> + For example to reserve all the AP Queues on the system for vfio_ap driver:
>>>> +
>>>> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/apmask
>>>> + echo 0x0000000000000000000000000000000000000000000000000000000000000000 > /sys/bus/ap/aqmask
>>> And this is a reasonable user syntax? ;)
>> I am not a fan of writing a 256-bit bitmask as such. It is entirely too
>> difficult to
>> figure out which character to specify to indicate a bit number to be set
>> for anything
>> past the first few characters. Of course, tooling could be used to
>> accomplish this
>> task, but the sysfs interfaces should probably be user-friendly. A
>> comma-separated
>> list could be used; for example to set reserve adapters 21, 99 and 249:
>>
>> echo 21,99,249 > /sys/bus/ap/apmask
>>
>> The problem with this is how do we indicate that all bits are to be
>> cleared? Do you
>> have any suggestions for a reasonable user syntax?
> Perhaps an empty write, just a carriage return, "null". I'd shy away
> from magic digits, like 256 to clear.
The crypto team has decided to eliminate these sysfs attributes for the
AP bus
and rely on bind/unbind.
>
>>> ...
>>>> + * mdev_attr_groups
>>>> + This attribute group identifies the user-defined sysfs attributes of the
>>>> + mediated device. When a device is registered with the VFIO mediated device
>>>> + framework, the sysfs attributes files identified in the 'mdev_attr_groups'
>>>> + structure will be created in the mediated matrix device's directory. The
>>>> + sysfs attributes for a mediated matrix device are:
>>>> + * assign_adapter:
>>>> + A write-only file for assigning an AP adapter to the mediated matrix
>>>> + device. To assign an adapter, the APID of the adapter is written to the
>>>> + file.
>>>> + * assign_domain:
>>>> + A write-only file for assigning an AP usage domain to the mediated matrix
>>>> + device. To assign a domain, the APQI of the AP queue corresponding to a
>>>> + usage domain is written to the file.
>>>> + * matrix:
>>>> + A read-only file for displaying the APQNs derived from the adapters and
>>>> + domains assigned to the mediated matrix device.
>>>> + * assign_control_domain:
>>>> + A write-only file for assigning an AP control domain to the mediated
>>>> + matrix device. To assign a control domain, the ID of a domain to be
>>>> + controlled is written to the file. For the initial implementation, the set
>>>> + of control domains will always include the set of usage domains, so it is
>>>> + only necessary to assign control domains that are not also assigned as
>>>> + usage domains.
>>> How will the user know when this changes? What's the transition plan
>>> to maintain compatibility when it does change?
>> I'm sorry, I don't understand the question. When you say user, about
>> whom are
>> you talking; the person doing the assignments? What changes are you talking
>> about?
> The user would be the person interaction with the sysfs attribute
> assign_control_domain where it states "[f]or the initial implementation,
> the set of control domains will always include the set of usage
> domains, so it is only necessary to assign control domains that are not
> also assigned as usage domains." I infer from the usage of "initial
> implementation" that this behavior may not always be the case and
> therefore if it's not final what is the transition plan so that a user
> interacting with this attribute can know the current behavior.
Ah, I get your point now. I see how this verbiage would lead you to that
conclusion and it will be removed. This behavior will not change.
>
>>> ...
>>>> +4. The administrator now needs to configure the matrixes for mediated
>>>> + devices $uuid1 (for Guest1) and $uuid2 (for Guest2).
>>>> +
>>>> + This is how the matrix is configured for Guest1:
>>>> +
>>>> + echo 5 > assign_adapter
>>>> + echo 6 > assign_adapter
>>>> + echo 4 > assign_domain
>>>> + echo 0xab > assign_domain
>>>> +
>>>> + For this implementation, all usage domains - i.e., domains assigned
>>>> + via the assign_domain attribute file - will also be configured in the ADM
>>>> + field of the KVM guest's CRYCB, so there is no need to assign control
>>>> + domains here unless you want to assign control domains that are not
>>>> + assigned as usage domains.
>>>> +
>>>> + If a mistake is made configuring an adapter, domain or control domain,
>>>> + you can use the unassign_xxx files to unassign the adapter, domain or
>>>> + control domain.
>>> Would it be more consistent with other sysfs entries to call these
>>> "bind" and "unbind" rather than "assign" and "unassign
>> Aren't the bind and unbind sysfs interfaces typically associated with
>> binding devices to and unbinding devices from a device driver?
> Yes
>
>> In this
>> case we are talking about configuring a mediated device's AP matrix - i.e.,
>> setting bits identifying the adapters, domains and control domains defined
>> for the mediated device (i.e., a guest). Maybe there are better words we
>> could use than assign/unassign, but I find the use of bind/unbind to be
>> confusing. I'm open to suggestions.
> If we remove the 'activate' interface and dedicate resources to the
> mdev device at the point where we create an intersection in the matrix,
> wouldn't that resource be "bound" to the mdev at that time? This is
> not a strong issue for me, I just thought "assign" seemed inconsistent
> as the operation felt more like a traditional "bind".
We are going to remove the activate for the next series, but stick with
the assign terminology because we're not really binding devices, but
constructing a
matrix that identifies the AP queues to which the guest using the mdev
will have access.
>
>>>> +
>>>> + To display the matrix configuration for Guest1:
>>>> +
>>>> + cat matrix
>>>> +
>>>> + This is how the matrix is configured for Guest2:
>>>> +
>>>> + echo 5 > assign_adapter
>>>> + echo 0x47 > assign_domain
>>>> + echo 0xff > assign_domain
>>>> +
>>>> +5. The adminstrator now needs to activate the mediated devices.
>>>> + echo 1 > activate
>>> Or optionally not. As in reply to cover letter, I don't think this
>>> interface is sufficiently justified, or necessarily desirable.
>> To clarify your objection, let me state what I think you are saying.
>>
>> 1. You do not think there is a good reason to assign duplicate APQNs
>> - i.e., the cross product of all adapter IDs and domain IDs assigned
>> to the mdev.
> I think the cross product of adapters and domains is a basic part of
> the design here, what I object to is that mdevs can exist with
> overlapping cross products and which one can be activated is dependent
> on activation ordering. For example, if I have the following mdevs:
>
> {adapters}{domains}
> A: {0,1}{0,1}
> B: {0}{0}
> C: {0}{1}
> D: {1}{0}
> E: {1}{1}
>
> Who is supposed to understand and debug that A cannot be activated
> while any of B/C/D/E are activated? What does the debugging process
> look like? If the mdev is not explicitly activated, only opened, how
> do we determine the fault is from a resource conflict on the mdev
> definition? On the other hand if the cross product is committed at the
> time it's specified, then a scenario where E already exists and we're
> trying to create A might look like:
>
> 1. Specify adapter mask of {0,1}
> 2. Specify domain mask of {0,1} <-- write(2) fails with -EBUSY
>
> Here we know that one or more of the domains we've specified are not
> available on the set of adapters we've configured.
This is moot for the next series because the activate will be removed and
consistency checking will be done as adapters/domains are assigned.
>
>> 2. All mdevs should be validly configured for use regardless of when
>> guests using them are started. In other words, libvirt should be
>> able to depend on the fact that an mdev is ready for use at all
>> times.
> Yes, an mdev should represent committed resources. We support dynamic
> creation and removal if the resources need to be freed for the host or
> another mdev configuration.
See my previous response.
>
>> I will discuss this with the team when we meet on Thursday and get
>> back to you. If I am mistaken in my analysis of your concern, please
>> clarify.
>>
>> I would like to mention that when a guest using an mdev starts,
>> the mdev will be activated if not already activated. The guest
>> will not start if another activated mdev is using an APQN assigned
>> to the mdev of the guest being started.
> Understood.
>
>>>> +6. Start Guest1:
>>>> +
>>>> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \
>>>> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid1 ...
>>>> +
>>>> +7. Start Guest2:
>>>> +
>>>> + /usr/bin/qemu-system-s390x ... -cpu xxx,ap=on,apft=on \
>>>> + -device vfio-ap,sysfsdev=/sys/devices/virtual/misc/vfio_ap/mdev_supported_types/vfio_ap-passthrough/devices/$uuid2 ...
>>>> +
>>>> +When the guest is shut down, the mediated matrix device may be removed.
>>>> +
>>>> +Using our example again, to remove the mediated matrix device $uuid1:
>>>> +
>>>> + /sys/devices/virtual/misc
>>>> + --- [vfio_ap]
>>>> + --------- [mdev_supported_types]
>>>> + ------------ [vfio_ap-passthrough]
>>>> + --------------- [devices]
>>>> + ------------------ [$uuid1]
>>>> + --------------------- activate
>>>> + --------------------- remove
>>>> +
>>>> +
>>>> + echo 0 > activate
>>>> + echo 1 > remove
>>>> +
>>>> + This will release all the AP queues for the mediated device and
>>>> + remove all of the mdev matrix device's sysfs structures. To
>>>> + recreate and reconfigure the mdev matrix device, all of the steps starting
>>>> + with step 4 will have to be performed again.
>>>> +
>>>> + It is not necessary to remove an mdev matrix device, but one may want to
>>>> + remove it if no guest will use it during the lifetime of the linux host. If
>>>> + the mdev matrix device is removed, one may want to unbind the AP queues the
>>>> + guest was using from the vfio_ap device driver and bind them back to the
>>>> + default driver. Alternatively, the AP queues can be configured for another
>>>> + mdev matrix (i.e., guest). In either case, one must take care to change the
>>>> + secure key configured for the domain to which the queue is connected.
>>> This seems prime for data leakage, extremely sensitive data at that.
>>> Who's responsibility is it to prevent this? Shouldn't closing the
>>> device reset the device, which should perhaps wipe any key
>>> configuration?
>> The device is zeroized (reset) when it is removed. The zeroize
>> instruction - to
>> the best of my knowledge - does nothing with the secure key for a given
>> queue.
>> There is also no instruction that I know of to clear keys, so it would
>> appear
>> that this is left to the system administrator.
> :-o That seems like a pretty serious fault. Given that this interface
> is specifically for crypto devices where such sensitive information
> seems common, doesn't it seem like the kernel, ie. the mdev vendor
> driver, should provide guarantees that should your user instance crash
> or be killed that keys are not available to the system admin?
I may have misspoke with regard to my understanding of the AP reset. I've
been told that the keys stored with the domain will be reset. I will
verify this and get back to you.
>
>>>> +
>>>> +
>>>> +Limitations
>>>> +===========
>>>> +An admin needs to be careful when writing to sysfs files
>>>> +/sys/bus/ap/apmask and /sys/bus/ap/aqmask. These files control the driver
>>>> +to which an AP queue is bound to. Once an AP queue is bound vfio_ap
>>>> +driver and assigned to a mediated device, admin should not re-assign the
>>>> +AP queues back to the default driver, because of technical limitations.
>>>> +The kernel does not prevent you from re-assigning from AP queues reserved
>>>> +for the vfio_ap driver back to default driver. Future enhancements in
>>>> +the ap_bus and vfio_ap are likely to introduce complete support for such
>>>> +dynamic reconfiguration. But until then be extremely careful.
>>> Why don't we have these enhancements now? Should the whole thing be
>>> marked experimental without them? This sounds similar to a vfio-pci
>>> case where a group with multiple devices could have device which is
>>> unused by the user unbound from vfio-pci and re-bound to a native host
>>> driver. We BUG_ON when this occurs to protect the data.
>> We have customers that have been waiting a long time and are very anxious
>> to have guest access to AP devices. In the interest of reducing time to
>> market, we have decided to simplify the initial implementation as much
>> as possible. I wouldn't necessarily call this experimental, but view it
>> more as the minimal viable product.
>>
>> As far as BUG_ON, I will discuss this with the crypto team at our
>> meeting on Thursday and get back to you.
> A minimum viable product still needs to protect the host and does not
> excuse us from later breaking the user exposed ABI.
I will have a more thorough look into it.
>
>>> Probably also need to evaluate updates to
>>> Documentation/ABI/testing/sysfs-bus-vfio-mdev, especially if libvirt is
>>> supposed to interact with an 'activate' attribute. Thanks,
>> As it stands, libvirt will not create or modify the attributes of an mdev.
>> If a guest is started under libvirt and its mdev has not been activated,
>> it will be activated when the mdev fd is opened. Of course, if another
>> activated mdev is assigned an APQN assigned to the mdev of the guest
>> being started, the guest will be terminated.
> Exactly why I disapprove of the activated interface, but it's being
> proposed and recommended for use as a standard part of the lifecycle
> management for these devices. Thanks,
As I stated elsewhere, we are removing the activate interface. All
consistency checking will be done as adapters/domains are assigned to
the mdev.
>
> Alex
>
^ permalink raw reply [flat|nested] 2+ messages in thread