* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Cornelia Huck @ 2018-06-14 10:02 UTC (permalink / raw)
To: Siwei Liu
Cc: Alexander Duyck, virtio-dev, Jiri Pirko, Michael S. Tsirkin,
Jakub Kicinski, Samudrala, Sridhar, qemu-devel, virtualization,
Netdev, aaron.f.brown
In-Reply-To: <CADGSJ213f8tJpNXuOhv8qRew-Y5VZAwA+srNMrLZYnKdVGLdAA@mail.gmail.com>
I've been pointed to this discussion (which I had missed previously)
and I'm getting a headache. Let me first summarize how I understand how
this feature is supposed to work, then I'll respond to some individual
points.
The basic idea is to enable guests to migrate seamlessly, while still
making it possible for them to use a passed-through device for more
performance etc. The means to do so is to hook a virtio-net device
together with a network device passed through via vfio. The
vfio-handled device is there for performance, the virtio device for
migratability. We have a new virtio feature bit for that which needs to
be negotiated for that 'combined' device to be available. We have to
consider two cases:
- Older guests that do not support the new feature bit. We presume that
those guests will be confused if they get two network devices with
the same MAC, so the idea is to not show them the vfio-handled device
at all.
- Guests that negotiate the feature bit. We only know positively that
they (a) know the feature bit and (b) are prepared to handle the
consequences of negotiating it after they set the FEATURES_OK bit.
This is therefore the earliest point in time that the vfio-handled
device should be visible or usable for the guest.
On Wed, 13 Jun 2018 18:02:01 -0700
Siwei Liu <loseweigh@gmail.com> wrote:
> On Tue, Jun 12, 2018 at 5:08 PM, Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
> > On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:
> >>
> >> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
> >>>
> >>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
> >>>>
> >>>> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
> >>>>>
> >>>>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
> >>>>>>
> >>>>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
> >>>>>>>
> >>>>>>> This feature bit can be used by hypervisor to indicate virtio_net
> >>>>>>> device to
> >>>>>>> act as a standby for another device with the same MAC address.
> >>>>>>>
> >>>>>>> I tested this with a small change to the patch to mark the STANDBY
> >>>>>>> feature 'true'
> >>>>>>> by default as i am using libvirt to start the VMs.
> >>>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu
> >>>>>>> via libvirt
> >>>>>>> XML file?
> >>>>>>>
> >>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >>>>>>
> >>>>>> So I do not think we can commit to this interface: we
> >>>>>> really need to control visibility of the primary device.
> >>>>>
> >>>>> The problem is legacy guest won't use primary device at all if we do
> >>>>> this.
> >>>>
> >>>> And that's by design - I think it's the only way to ensure the
> >>>> legacy guest isn't confused.
> >>>
> >>> Yes. I think so. But i am not sure if Qemu is the right place to control
> >>> the visibility
> >>> of the primary device. The primary device may not be specified as an
> >>> argument to Qemu. It
> >>> may be plugged in later.
> >>> The cloud service provider is providing a feature that enables low
> >>> latency datapath and live
> >>> migration capability.
> >>> A tenant can use this feature only if he is running a VM that has
> >>> virtio-net with failover support.
So, do you know from the outset that there will be such a coupled
device? I.e., is it a property of the VM definition?
Can there be a 'prepared' virtio-net device that presents the STANDBY
feature even if there currently is no vfio-handled device available --
but making it possible to simply hotplug that device later?
Should it be possible to add a virtio/vfio pair later on?
> >>
> >> Well live migration is there already. The new feature is low latency
> >> data path.
> >
> >
> > we get live migration with just virtio. But I meant live migration with VF
> > as
> > primary device.
> >
> >>
> >> And it's the guest that needs failover support not the VM.
> >
> >
> > Isn't guest and VM synonymous?
I think we need to be really careful to not mix up the two: The VM
contains the definitions, but it is up to the guest how it uses them.
> >
> >
> >>
> >>
> >>> I think Qemu should check if guest virtio-net supports this feature and
> >>> provide a mechanism for
> >>> an upper layer indicating if the STANDBY feature is successfully
> >>> negotiated or not.
> >>> The upper layer can then decide if it should hot plug a VF with the same
> >>> MAC and manage the 2 links.
> >>> If VF is successfully hot plugged, virtio-net link should be disabled.
> >>
> >> Did you even talk to upper layer management about it?
> >> Just list the steps they need to do and you will see
> >> that's a lot of machinery to manage by the upper layer.
> >>
> >> What do we gain in flexibility? As far as I can see the
> >> only gain is some resources saved for legacy VMs.
> >>
> >> That's not a lot as tenant of the upper layer probably already has
> >> at least a hunch that it's a new guest otherwise
> >> why bother specifying the feature at all - you
> >> save even more resources without it.
> >>
> >
> > I am not all that familiar with how Qemu manages network devices. If we can
> > do all the
> > required management of the primary/standby devices within Qemu, that is
> > definitely a better
> > approach without upper layer involvement.
>
> Right. I would imagine in the extreme case the upper layer doesn't
> have to be involved at all if QEMU manages all hot plug/unplug logic.
> The management tool can supply passthrough device and virtio with the
> same group UUID, QEMU auto-manages the presence of the primary, and
> hot plug the device as needed before or after the migration.
I do not really see how you can manage that kind of stuff in QEMU only.
Have you talked to some libvirt folks? (And I'm not sure what you refer
to with 'group UUID'?)
Also, I think you need to make a distinction between hotplugging a
device and making it visible to the guest. What does 'hotplugging' mean
here? Adding it to the VM definition? Would it be enough to have the
vfio-based device not operational until the virtio feature bit has been
negotiated?
What happens if the guest does not use the vfio-based device after it
has been made available? Will you still disable the virtio-net link?
(All that link handling definitely sounds like a task for libvirt or
the like.)
Regarding hot(un)plugging during migration, I think you also need to
keep in mind that different architectures/busses have different
semantics there. Something that works if there's an unplug handshake may
not work on a platform with surprise removal.
Have you considered guest agents? All of this is punching through
several layers, and I'm not sure if that is a good idea.
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-14 1:02 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: Alexander Duyck, virtio-dev, aaron.f.brown, Jiri Pirko,
Michael S. Tsirkin, Jakub Kicinski, Netdev, qemu-devel,
virtualization
In-Reply-To: <b9c88515-cf41-2a9a-078e-9c9e5adbbf14@intel.com>
On Tue, Jun 12, 2018 at 5:08 PM, Samudrala, Sridhar
<sridhar.samudrala@intel.com> wrote:
> On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:
>>
>> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
>>>
>>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
>>>>
>>>> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>>>>>
>>>>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>>>>>>
>>>>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>>>>>>>
>>>>>>> This feature bit can be used by hypervisor to indicate virtio_net
>>>>>>> device to
>>>>>>> act as a standby for another device with the same MAC address.
>>>>>>>
>>>>>>> I tested this with a small change to the patch to mark the STANDBY
>>>>>>> feature 'true'
>>>>>>> by default as i am using libvirt to start the VMs.
>>>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu
>>>>>>> via libvirt
>>>>>>> XML file?
>>>>>>>
>>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>>>
>>>>>> So I do not think we can commit to this interface: we
>>>>>> really need to control visibility of the primary device.
>>>>>
>>>>> The problem is legacy guest won't use primary device at all if we do
>>>>> this.
>>>>
>>>> And that's by design - I think it's the only way to ensure the
>>>> legacy guest isn't confused.
>>>
>>> Yes. I think so. But i am not sure if Qemu is the right place to control
>>> the visibility
>>> of the primary device. The primary device may not be specified as an
>>> argument to Qemu. It
>>> may be plugged in later.
>>> The cloud service provider is providing a feature that enables low
>>> latency datapath and live
>>> migration capability.
>>> A tenant can use this feature only if he is running a VM that has
>>> virtio-net with failover support.
>>
>> Well live migration is there already. The new feature is low latency
>> data path.
>
>
> we get live migration with just virtio. But I meant live migration with VF
> as
> primary device.
>
>>
>> And it's the guest that needs failover support not the VM.
>
>
> Isn't guest and VM synonymous?
>
>
>>
>>
>>> I think Qemu should check if guest virtio-net supports this feature and
>>> provide a mechanism for
>>> an upper layer indicating if the STANDBY feature is successfully
>>> negotiated or not.
>>> The upper layer can then decide if it should hot plug a VF with the same
>>> MAC and manage the 2 links.
>>> If VF is successfully hot plugged, virtio-net link should be disabled.
>>
>> Did you even talk to upper layer management about it?
>> Just list the steps they need to do and you will see
>> that's a lot of machinery to manage by the upper layer.
>>
>> What do we gain in flexibility? As far as I can see the
>> only gain is some resources saved for legacy VMs.
>>
>> That's not a lot as tenant of the upper layer probably already has
>> at least a hunch that it's a new guest otherwise
>> why bother specifying the feature at all - you
>> save even more resources without it.
>>
>
> I am not all that familiar with how Qemu manages network devices. If we can
> do all the
> required management of the primary/standby devices within Qemu, that is
> definitely a better
> approach without upper layer involvement.
Right. I would imagine in the extreme case the upper layer doesn't
have to be involved at all if QEMU manages all hot plug/unplug logic.
The management tool can supply passthrough device and virtio with the
same group UUID, QEMU auto-manages the presence of the primary, and
hot plug the device as needed before or after the migration.
-Siwei
>
>
>>
>>
>>>>> How about control the visibility of standby device?
>>>>>
>>>>> Thanks
>>>>
>>>> standy the always there to guarantee no downtime.
>>>>
>>>>>> However just for testing purposes, we could add a non-stable
>>>>>> interface "x-standby" with the understanding that as any
>>>>>> x- prefix it's unstable and will be changed down the road,
>>>>>> likely in the next release.
>>>>>>
>>>>>>
>>>>>>> ---
>>>>>>> hw/net/virtio-net.c | 2 ++
>>>>>>> include/standard-headers/linux/virtio_net.h | 3 +++
>>>>>>> 2 files changed, 5 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>>>> index 90502fca7c..38b3140670 100644
>>>>>>> --- a/hw/net/virtio-net.c
>>>>>>> +++ b/hw/net/virtio-net.c
>>>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>>>> true),
>>>>>>> DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed,
>>>>>>> SPEED_UNKNOWN),
>>>>>>> DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>>>>>> + DEFINE_PROP_BIT64("standby", VirtIONet, host_features,
>>>>>>> VIRTIO_NET_F_STANDBY,
>>>>>>> + false),
>>>>>>> DEFINE_PROP_END_OF_LIST(),
>>>>>>> };
>>>>>>> diff --git a/include/standard-headers/linux/virtio_net.h
>>>>>>> b/include/standard-headers/linux/virtio_net.h
>>>>>>> index e9f255ea3f..01ec09684c 100644
>>>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>>>> @@ -57,6 +57,9 @@
>>>>>>> * Steering */
>>>>>>> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
>>>>>>> +#define VIRTIO_NET_F_STANDBY 62 /* Act as standby for
>>>>>>> another device
>>>>>>> + * with the same MAC.
>>>>>>> + */
>>>>>>> #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set
>>>>>>> linkspeed and duplex */
>>>>>>> #ifndef VIRTIO_NET_NO_LEGACY
>>>>>>> --
>>>>>>> 2.14.3
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
>> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>>
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [virtio-dev] Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Siwei Liu @ 2018-06-14 0:56 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Alexander Duyck, virtio-dev, qemu-devel, Samudrala, Sridhar,
virtualization
In-Reply-To: <20180612143554-mutt-send-email-mst@kernel.org>
On Tue, Jun 12, 2018 at 4:47 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Tue, Jun 05, 2018 at 03:09:26PM -0700, Siwei Liu wrote:
>> The thing is cloud service provider might prefer sticking to the same
>> level of service agreement (SLA) of keeping VF over migration,
>
> That requirement is trivially satisfied by just a single VF :) If your
> SLA does not require live migration, you should do just that.
The requirement is enable live migration by default without getting
user too much attention. Migration should be attempted if guest is
live migratable. If not, present a single VF to legacy guest.
You seem to think it's not a valid use case? Or impossible to do so?
What if I have something in mind that can support my theory as well?
-Siwei
>
> --
> MST
^ permalink raw reply
* [PULL v2] vhost, virtio
From: Michael S. Tsirkin @ 2018-06-13 23:45 UTC (permalink / raw)
To: Linus Torvalds
Cc: ohad, kevin, kvm, mst, netdev, linux-remoteproc, linux-kernel,
stable, bjorn.andersson, syzbot+87cfa083e727a224754b,
virtualization
Page hints are reworked - I dropped them for now.
The following changes since commit 29dcea88779c856c7dc92040a0c01233263101d4:
Linux 4.17 (2018-06-03 14:15:21 -0700)
are available in the Git repository at:
git://git.kernel.org/pub/scm/linux/kernel/git/mst/vhost.git tags/for_linus
for you to fetch changes up to 2eb98105f8c7f4b867f7f714a998f5b8c1bb009b:
virtio: update the comments for transport features (2018-06-12 04:59:29 +0300)
----------------------------------------------------------------
virtio, vhost: features, fixes
VF support for virtio.
DMA barriers for virtio strong barriers.
Bugfixes.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
----------------------------------------------------------------
Michael S. Tsirkin (2):
virtio_ring: switch to dma_XX barriers for rpmsg
vhost: fix info leak due to uninitialized memory
Tiwei Bie (2):
virtio_pci: support enabling VFs
virtio: update the comments for transport features
drivers/vhost/vhost.c | 3 +++
drivers/virtio/virtio_pci_common.c | 30 ++++++++++++++++++++++++++++++
drivers/virtio/virtio_pci_modern.c | 14 ++++++++++++++
include/linux/virtio_ring.h | 4 ++--
include/uapi/linux/virtio_config.h | 16 ++++++++++++----
5 files changed, 61 insertions(+), 6 deletions(-)
^ permalink raw reply
* [RFC] virtio-iommu version 0.7
From: Jean-Philippe Brucker @ 2018-06-13 17:58 UTC (permalink / raw)
To: virtio-dev, virtualization; +Cc: eric.auger, tnowicki
This is version 0.7 of the virtio-iommu specification. The diff from 0.6,
included below, is fairly small and consists of the following changes:
* Address comments from 0.6, rework bits of the implementation notes.
* Change resv_mem parameters to be consistent with the rest of the
spec.
* Add the MMIO flag to MAP requests. At the moment it is used by
mapped MSIs mostly for completeness, but will be important for IDENTITY
resv_mem regions that next versions introduce. Please find more
information about this on the v0.6 thread [1].
For mapped MSIs, the MMIO flag allows host userspace to easily catch MSI
maps, and route the rest to VFIO. Without it the host needs to check
addresses against the MSI doorbell on every map request, before passing
them down to VFIO. This version makes the flag mandatory in MMIO map
requests. It could be useful on the unmap request as well but adding
such flag seems unintuitive (and changes unmap semantics, no way to do
unmap-all anymore). Adding both flags showed barely any performance
improvement on my kvmtool prototype. Please let me know if you notice
anything interesting on the QEMU and vhost prototypes, if you get around
comparing this.
* A notable change to protection semantics: write-only may imply
read-write (but read-only never implies read-write). We need this
behavior because some architectures do not support write-only mappings
and the corresponding host drivers don't reject write-only map requests.
I chose to follow POSIX mmap() semantics for this ("if the application
requests only PROT_WRITE, the implementation may also allow read
access.")
Sources: git://linux-arm.org/virtio-iommu.git viommu/v0.7
Docs: http://jpbrucker.net/virtio-iommu/spec/virtio-iommu.pdf
http://jpbrucker.net/virtio-iommu/spec/virtio-iommu.html
Diff: http://jpbrucker.net/virtio-iommu/spec/diffs/virtio-iommu-pdf-diff-v0.6-v0.7.pdf
I'll send the updated Linux driver, which you can find on my
virtio-iommu/devel branch, after the merge window.
Next RFCs should be more interesting, with support for page table sharing
and some optimizations. It is progressing nicely but there isn't any rush,
since we're currently discussing the host-side interface in VFIO, which
virtio-iommu will try to follow closely. Since I'm working on a few
related projects, expect a similar cadence (around four months) for next
versions.
[1] https://www.spinics.net/lists/linux-virtualization/msg32628.html
--- >8 ---
diff --git a/MSI.tex b/MSI.tex
index fb54af4..7758fd1 100644
--- a/MSI.tex
+++ b/MSI.tex
@@ -11,22 +11,18 @@ number and destination processing units. Additional devices between the
endpoint and the IRQ chip may translate the doorbell address, the IRQ
number and verify that the endpoint is allowed to send this interrupt.
-Different platforms implement IRQ remapping and routing in different ways.
-This section describes three ways of dealing with Message Signaled
-Interrupts in virtio-iommu devices and drivers.
-
-In simplest systems, the endpoint writes the plain interrupt number to the
+Different platforms implement IRQ remapping and routing in different ways. In
+simplest systems, the endpoint writes the plain interrupt number to the
doorbell, and the IRQ chip signals the interruption to destination CPUs
-programmed by software. Section \ref{sec:viommu / MSI / Address bypass}
-describes how to implement a simple system with virtio-iommu. Section
-\ref{sec:viommu / MSI / Address translation} describes the added complexity
-(from the host point of view) of translating the IRQ chip doorbell.
+programmed by software. Sections \ref{sec:viommu / MSI / Address bypass} and
+\ref{sec:viommu / MSI / Address translation} describe two ways of implementing
+MSIs with virtio-iommu.
More complex systems add a level of indirection in the MSI message. The address
or data contains an index into a remapping table, that describes interrupt
delivery in details and is programmed by software either into the IRQ chip or
-the IOMMU. Section \ref{sec:viommu / MSI / IRQ remapping} describes how to use
-the remapping feature of virtio-iommu.
+the IOMMU. This is shown in section \ref{sec:viommu / MSI / IRQ remapping} but
+isn't yet supported by virtio-iommu.
\subsubsection{Address bypass}\label{sec:viommu / MSI / Address bypass}
@@ -66,8 +62,8 @@ struct __attribute__((packed)) {
},
.mem = {
.subtype = VIRTIO_IOMMU_RESV_MEM_T_MSI,
- .addr = 0xfee00000,
- .size = 0x00100000,
+ .start = 0xfee00000,
+ .end = 0xfeefffff,
},
};
\end{lstlisting}
@@ -90,13 +86,20 @@ translation can only forbid an endpoint from sending interrupts. If it is
allowed to send MSIs, the endpoint can easily spoof another endpoint by
sending interrupts that were not assigned to it.
-From the virtio-iommu point of view, this is the simplest to implement, because
-there is no special address range. The whole address space is treated the same
-by the virtio-iommu device.
-
-However, this mode of operations may add significant complexity in the host
-implementation.
-
+From the virtio-iommu point of view, this is the simplest to implement,
+because there is no special address range and no need for a PROBE request.
+The whole address space is treated the same way by the virtio-iommu
+device.
+
+However, this mode of operations may add some complexity in the host
+implementation. To setup MSIs, the guest writes an IOVA into the MSI-X
+table of the PCI endpoint. One possible host implementation emulates IRQ
+chips and captures requests that map virtual addresses to doorbell
+registers. These requests have the VIRTIO_IOMMU_MAP_F_MMIO flag set,
+making them easy to differentiate from requests that target normal memory
+and are forwarded to the physical IOMMU driver. The host also traps
+accesses to the endpoint's MSI-X table, and creates IRQ routes by
+translating the written IOVA into the corresponding doorbell.
\subsubsection{IRQ remapping}\label{sec:viommu / MSI / IRQ remapping}
diff --git a/assignment.tex b/assignment.tex
index 4e26a9c..e372713 100644
--- a/assignment.tex
+++ b/assignment.tex
@@ -35,8 +35,9 @@ struct __attribute__((packed)) {
.length = sizeof(resv.mem),
},
.mem = {
- .addr = 0x08000000,
- .size = 0x00100000,
+ .subtype = VIRTIO_IOMMU_RESV_MEM_T_RESERVED,
+ .start = 0x08000000,
+ .end = 0x080fffff,
},
};
\end{lstlisting}
diff --git a/device-operations.tex b/device-operations.tex
index 40c68cf..7af6fb0 100644
--- a/device-operations.tex
+++ b/device-operations.tex
@@ -294,15 +294,6 @@ If the VIRTIO_IOMMU_F_DOMAIN_BITS feature is offered, the driver SHOULD
NOT send requests with \field{domain} greater than the size described by
\field{domain_bits}.
-% We mandate truncation to allow a future extension X.Y that would store
-% information in addresses and domain IDs.
-%
-% If device is 0.2 and driver is X.Y, then device ignores ext. bits. But
-% if device is X.Y and device is 0.2, then driver *might* set ext. bits to
-% garbage. But this extension would be negotiated with a feature bit
-% anyway. If it's not, then device must assume that driver is 0.2 and must
-% keep truncating the fields.
-
The driver SHOULD NOT use multiple descriptor chains for a single request.
\devicenormative{\subsubsection}{Device operations}{Device Types / IOMMU Device / Device operations}
@@ -321,12 +312,14 @@ to zero.
The device MUST ignore reserved fields of the head and the tail of a
request.
-If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered, the device MUST
-truncate the range described by \field{virt_start} and \field{virt_end} in
-requests to fit in the range described by \field{input_range}.
+If the VIRTIO_IOMMU_F_INPUT_RANGE feature is offered and the range
+described by fields \field{virt_start} and \field{virt_end} doesn't fit in
+the range described by \field{input_range}, the device MAY set
+\field{status} to VIRTIO_IOMMU_S_RANGE and ignore the request.
-If the VIRTIO_IOMMU_F_DOMAIN_BITS is offered, the device MUST ignore bits
-above \field{domain_bits} in field \field{domain} of requests.
+If the VIRTIO_IOMMU_F_DOMAIN_BITS is offered and bits above
+\field{domain_bits} are set in field \field{domain}, the device MAY set
+\field{status} to VIRTIO_IOMMU_S_RANGE and ignore the request.
\subsubsection{ATTACH request}\label{sec:Device Types / IOMMU Device / Device operations / ATTACH request}
@@ -415,7 +408,9 @@ struct virtio_iommu_req_detach {
\end{lstlisting}
Detach an endpoint from its domain. When this request completes, the
-endpoint cannot access any mapping from that domain anymore.
+endpoint cannot access any mapping from that domain anymore. If feature
+VIRTIO_IOMMU_F_BYPASS has been negotiated, then the endpoint accesses the
+guest-physical address space once this request completes.
After all endpoints have been successfully detached from a domain, it
ceases to exist and its ID can be reused by the driver for another domain.
@@ -457,6 +452,7 @@ struct virtio_iommu_req_map {
#define VIRTIO_IOMMU_MAP_F_READ (1 << 0)
#define VIRTIO_IOMMU_MAP_F_WRITE (1 << 1)
#define VIRTIO_IOMMU_MAP_F_EXEC (1 << 2)
+#define VIRTIO_IOMMU_MAP_F_MMIO (1 << 3)
\end{lstlisting}
Map a range of virtually-contiguous addresses to a range of
@@ -478,15 +474,23 @@ guest-physical addresses for use by the host (for instance MSI doorbells).
Guest physical boundaries are set by the host using a firmware mechanism
outside the scope of this specification.
-\begin{note}
-On flags: it is unlikely that all possible combinations of flags will be
-supported by the physical IOMMU. For instance, $W \& !R$ or $X \& W$ might
-be invalid. We do not have a way to advertise supported and implicit (for
-instance $W \rightarrow R$) flags or combination thereof for the moment,
-you are free to send any suggestions for describing this. Please keep in
-mind that we might soon want to add more flags, such as privileged,
-device, transient, shared, etc. (whatever these would mean).
-\end{note}
+Availability and allowed combinations of \field{flags} depend of the
+underlying IOMMU architectures. VIRTIO_IOMMU_MAP_F_READ and
+VIRTIO_IOMMU_MAP_F_WRITE are usually implemented, although READ is
+sometimes implied. VIRTIO_IOMMU_MAP_F_EXEC might not be available. In
+addition combinations such as "WRITE and not READ" or "WRITE and EXEC"
+might not be supported.
+
+The VIRTIO_IOMMU_MAP_F_MMIO flag is a memory type rather than a protection
+lag. It may be used, for example, to map Message Signaled Interrupt
+doorbells when a VIRTIO_IOMMU_RESV_MEM_T_MSI region isn't available. To
+trigger interrupts the endpoint performs a direct memory write to another
+peripheral, the IRQ chip. Since it is a signal, the write must not be
+buffered, elided, or combined with other writes by the memory
+interconnect. The precise meaning of the MMIO flag depends on the
+underlying memory architecture (for example on Armv8-A it corresponds to
+the "Device-nGnRE" memory type). Unless needed by mapped MSIs, the device
+isn't required to support the MMIO flag.
This request is only available when VIRTIO_IOMMU_F_MAP_UNMAP has been
negotiated.
@@ -497,6 +501,11 @@ The driver SHOULD set undefined \field{flags} bits to zero.
\field{virt_end} MUST be strictly greater than \field{virt_start}.
+The driver SHOULD set the VIRTIO_IOMMU_MAP_F_MMIO flag when the physical
+range corresponds to memory-mapped device registers. The physical range
+SHOULD have a single memory type: either normal memory or memory-mapped
+I/O.
+
\devicenormative{\paragraph}{MAP request}{Device Types / IOMMU Device / Device operations / MAP request}
If \field{virt_start}, \field{phys_start} or (\field{virt_end} + 1) is
@@ -510,6 +519,15 @@ here, because the driver might be attempting to map with special flags
that the device doesn't recognize. Creating the mapping with incompatible
flags may introduce a security hazard.}
+If a flag or combination of flag isn't supported, the device MAY set the
+request \field{status} to VIRTIO_IOMMU_S_UNSUPP.
+
+The device MUST NOT allow writes to a range mapped without the
+VIRTIO_IOMMU_MAP_F_WRITE flag. However, if the underlying architecture
+does not support write-only mappings, the device MAY allow reads to a
+range mapped with VIRTIO_IOMMU_MAP_F_WRITE but not
+VIRTIO_IOMMU_MAP_F_READ.
+
If \field{domain} does not exist, the device SHOULD set the request
\field{status} to VIRTIO_IOMMU_S_NOENT.
@@ -730,21 +748,24 @@ allocated by the driver, or that are special.
struct virtio_iommu_probe_resv_mem {
u8 subtype;
u8 reserved[3];
- le64 addr;
- le64 size;
+ le64 start;
+ le64 end;
};
\end{lstlisting}
-Fields \field{addr} and \field{size} describe the range of reserved
-virtual addresses. \field{subtype} may be one of:
+Fields \field{start} and \field{end} describe the range of reserved virtual
+addresses. \field{subtype} may be one of:
\begin{description}
\item[VIRTIO_IOMMU_RESV_MEM_T_RESERVED (0)]
- Accesses to virtual addresses in this region are not translated by the
- device. They may either be aborted by the device (or the underlying
- IOMMU), bypass it, or never even reach it. The guest should neither
- use these virtual addresses in a MAP request nor instruct endpoints to
- perform DMA on them.
+ Accesses to virtual addresses in this region have undefined behavior.
+ They may be aborted by the device, bypass it, or never even reach it.
+ The region may also be used for host mappings, for example Message
+ Signaled Interrupts (see \ref{sec:viommu / Hardware device
+ assignment}).
+
+ The guest should neither use these virtual addresses in a MAP request
+ nor instruct endpoints to perform DMA on them.
\item[VIRTIO_IOMMU_RESV_MEM_T_MSI (1)]
This region is a doorbell for Message Signaled Interrupts (MSIs). It
^ permalink raw reply related
* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Michael S. Tsirkin @ 2018-06-13 14:23 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: robh, pawel.moll, Tom Lendacky, cohuck, Ram Pai, linux-kernel,
virtualization, Christoph Hellwig, joe, Rustad, Mark D,
Anshuman Khandual, linuxppc-dev, elfring, david
In-Reply-To: <59e60715f27b10bc6816193eaf324824eff69c46.camel@kernel.crashing.org>
On Mon, Jun 11, 2018 at 01:34:50PM +1000, Benjamin Herrenschmidt wrote:
> On Mon, 2018-06-11 at 06:28 +0300, Michael S. Tsirkin wrote:
> >
> > > However if the administrator
> > > ignores/forgets/deliberatey-decides/is-constrained to NOT enable the
> > > flag, virtio will not be able to pass control to the DMA ops associated
> > > with the virtio devices. Which means, we have no opportunity to share
> > > the I/O buffers with the hypervisor/qemu.
> > >
> > > How do you suggest, we handle this case?
> >
> > As step 1, ignore it as a user error.
>
> Ugh ... not again. Ram, don't bring that subject back we ALREADY
> addressed it, and requiring the *user* to do special things is just
> utterly and completely wrong.
>
> The *user* has no bloody idea what that stuff is, will never know to
> set whatver magic qemu flag etc... The user will just start a a VM
> normally and expect things to work. Requiring the *user* to know things
> like that iommu virtio flag is complete nonsense.
You should consider teaching QEMU that the platform has support for the
ultravisor thing so it can set flags for you.
That's true even if something else is done for virtio - if you don't
have the capability right now you will come to regret it, host userspace
is just fundamentally in charge of control-path enumeration in the KVM
stack, I understand you want to take some of it away for security but
trying to hide things from QEMU completely is a mistake IMHO.
Just my $.02.
> If by "user" you mean libvirt, then you are now requesting about 4 or 5
> different projects to be patched to add speical cases for something
> they know nothing about and is completely irrelevant, while it can be
> entirely addressed with a 1-liner in virtio kernel side to allow the
> arch to plumb alternate DMA ops.
>
> So for some reason you seem to be dead set on a path that leads to
> mountain of user pain, changes to many different projects and overall
> havok while there is a much much simpler and elegant solution at hand
> which I described (again) in the response to Ram I sent about 5mn ago.
What you sent to Ram sounds OK to me - I only hope we do all mean
the same thing because the 3 paragraphs above are all
about the libvirt/QEMU split mess which linux or virtio
should not really care about.
And I'd like to stress that direct path isn't merely legacy.
It's a common sense approach that when you have a hypervisor doing
things like taking care of cache coherency, you do not want to do them
in the guest as well. And the same guest can use a pass-through device
where it does need to do all the coherency mess, so while it's quite
possible to build a platform-specific way to tell guests which devices
need which kind of treatment, people understandably chose to do it in a
portable way through the virtio device.
I understand you guys are working on a new project that is going to use
bounce buffers anyway so what do you care about optimizations but we
can't just slow everyone else down for your benefit.
> > Further you can for example add per-device quirks in virtio so it can be
> > switched to dma api. make extra decisions in platform code then.
Isn't above exactly what you sent to Ram?
--
MST
^ permalink raw reply
* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Michael S. Tsirkin @ 2018-06-13 14:03 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: robh, pawel.moll, Tom Lendacky, cohuck, Ram Pai, linux-kernel,
virtualization, Christoph Hellwig, joe, Rustad, Mark D,
Anshuman Khandual, linuxppc-dev, elfring, david
In-Reply-To: <07b804fccd7373c650be79ac9fa77ae7f2375ced.camel@kernel.crashing.org>
On Mon, Jun 11, 2018 at 01:29:18PM +1000, Benjamin Herrenschmidt wrote:
> On Sun, 2018-06-10 at 19:39 -0700, Ram Pai wrote:
> >
> > However if the administrator
> > ignores/forgets/deliberatey-decides/is-constrained to NOT enable the
> > flag, virtio will not be able to pass control to the DMA ops associated
> > with the virtio devices. Which means, we have no opportunity to share
> > the I/O buffers with the hypervisor/qemu.
> >
> > How do you suggest, we handle this case?
>
> At the risk of repeating myself, let's just do the first pass which is
> to switch virtio over to always using the DMA API in the actual data
> flow code, with a hook at initialization time that replaces the DMA ops
> with some home cooked "direct" ops in the case where the IOMMU flag
> isn't set.
I'm not sure I understand all of the details, will have to
see the patch, but superficially it sounds good to me.
> This will be equivalent to what we have today but avoids having 2
> separate code path all over the driver.
>
> Then a second stage, I think, is to replace this "hook" so that the
> architecture gets a say in the matter.
>
> Basically, something like:
>
> arch_virtio_update_dma_ops(pci_dev, qemu_direct_mode).
>
> IE, virtio would tell the arch whether the "other side" is in fact QEMU
> in a mode that bypasses the IOMMU and is cache coherent with the guest.
> This is our legacy "qemu special" mode. If the performance is
> sufficient we may want to deprecate it over time and have qemu enable
> the iommu by default but we still need it.
>
> A weak implementation of the above will be provied that just puts in
> the direct ops when qemu_direct_mode is set, and thus provides today's
> behaviour on any arch that doesn't override it. If the flag is not set,
> the ops are left to whatever the arch PCI layer already set.
>
> This will provide the opportunity for architectures that want to do
> something special, such as in our case, when we want to force even the
> "qemu_direct_mode" to go via bounce buffers, to put our own ops in
> there, while retaining the legacy behaviour otherwise.
>
> It also means that the "gunk" is entirely localized in that one
> function, the rest of virtio just uses the DMA API normally.
>
> Christoph, are you actually hacking "stage 1" above already or should
> we produce patches ?
>
> Cheers,
> Ben.
^ permalink raw reply
* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Michael S. Tsirkin @ 2018-06-13 13:59 UTC (permalink / raw)
To: Christoph Hellwig
Cc: Tom Lendacky, pawel.moll, robh, Benjamin Herrenschmidt, cohuck,
Ram Pai, linux-kernel, virtualization, joe, Rustad, Mark D,
Anshuman Khandual, linuxppc-dev, elfring, david
In-Reply-To: <20180613074141.GA12033@infradead.org>
On Wed, Jun 13, 2018 at 12:41:41AM -0700, Christoph Hellwig wrote:
> On Mon, Jun 11, 2018 at 01:29:18PM +1000, Benjamin Herrenschmidt wrote:
> > At the risk of repeating myself, let's just do the first pass which is
> > to switch virtio over to always using the DMA API in the actual data
> > flow code, with a hook at initialization time that replaces the DMA ops
> > with some home cooked "direct" ops in the case where the IOMMU flag
> > isn't set.
> >
> > This will be equivalent to what we have today but avoids having 2
> > separate code path all over the driver.
> >
> > Then a second stage, I think, is to replace this "hook" so that the
> > architecture gets a say in the matter.
>
> I don't think we can actually use dma_direct_ops. It still allows
> architectures to override parts of the dma setup, which virtio seems
> to blindly assume phys == dma and not cache flushing.
>
> I think the right way forward is to either add a new
> VIRTIO_F_IS_PCI_DEVICE (or redefine the existing iommu flag if deemed
> possible).
Given this is exactly what happens now, this seems possible, but maybe
we want a non-PCI specific name.
> And then make sure recent qemu always sets it.
I don't think that part is going to happen, sorry.
Hypervisors can set it when they *actually have* a real PCI device.
People emulate systems which have a bunch of overhead in the DMA API
which is required for real DMA. Your proposal would double that overhead
by first doing it in guest then re-doing it in host.
I don't think it's justified when 99% of the world doesn't need it.
--
MST
^ permalink raw reply
* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Michael S. Tsirkin @ 2018-06-13 13:49 UTC (permalink / raw)
To: Christoph Hellwig
Cc: robh, pawel.moll, Tom Lendacky, cohuck, Ram Pai, linux-kernel,
virtualization, joe, Rustad, Mark D, david, linuxppc-dev, elfring,
Anshuman Khandual
In-Reply-To: <20180608063655.GA32080@infradead.org>
On Thu, Jun 07, 2018 at 11:36:55PM -0700, Christoph Hellwig wrote:
> > This seems to be what was being asked for in this thread,
> > with comments claiming IOMMU flag adds too much overhead.
>
> Right now it means implementing a virtual iommu, which I agree is
> way too much overhead.
Well not really. The flag in question will have a desired effect without
a virtual iommu.
> > SEV guys report that they just set the iommu flag and then it all works.
> > I guess if there's translation we can think of this as a kind of iommu.
> > Maybe we should rename PLATFORM_IOMMU to PLARTFORM_TRANSLATION?
>
> VIRTIO_F_BEHAVES_LIKE_A_REAL_PCI_DEVICE_DONT_TRY_TO_OUTSMART_ME
>
> as said it's not just translations, it is cache coherence as well.
Well it's only for DMA. So maybe PLATFORM_DMA.
I suspect people will then come and complain that they
do *not* want cache coherence tricks because virtio is
running on a CPU, but we'll see.
> > And apparently some people complain that just setting that flag makes
> > qemu check translation on each access with an unacceptable performance
> > overhead. Forcing same behaviour for everyone on general principles
> > even without the flag is unlikely to make them happy.
>
> That sounds like a qemu implementation bug. If qemu knowns that
> guest physiscall == guest dma space there is no need to check.
Possibly. Or it's possible it's all just theoretical, no one
posted any numbers.
--
MST
^ permalink raw reply
* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Benjamin Herrenschmidt @ 2018-06-13 13:11 UTC (permalink / raw)
To: Christoph Hellwig
Cc: robh, cohuck, pawel.moll, Tom Lendacky, Michael S. Tsirkin,
Ram Pai, linux-kernel, virtualization, joe, Rustad, Mark D,
Anshuman Khandual, linuxppc-dev, elfring, david
In-Reply-To: <5dbcafa73b065bc619fd6adc9ef47eb6367b8378.camel@kernel.crashing.org>
On Wed, 2018-06-13 at 22:25 +1000, Benjamin Herrenschmidt wrote:
> On Wed, 2018-06-13 at 00:41 -0700, Christoph Hellwig wrote:
> > On Mon, Jun 11, 2018 at 01:29:18PM +1000, Benjamin Herrenschmidt wrote:
> > > At the risk of repeating myself, let's just do the first pass which is
> > > to switch virtio over to always using the DMA API in the actual data
> > > flow code, with a hook at initialization time that replaces the DMA ops
> > > with some home cooked "direct" ops in the case where the IOMMU flag
> > > isn't set.
> > >
> > > This will be equivalent to what we have today but avoids having 2
> > > separate code path all over the driver.
> > >
> > > Then a second stage, I think, is to replace this "hook" so that the
> > > architecture gets a say in the matter.
> >
> > I don't think we can actually use dma_direct_ops. It still allows
> > architectures to override parts of the dma setup, which virtio seems
> > to blindly assume phys == dma and not cache flushing.
>
> By direct ops I didn't mean *the* dma_direct_ops but a virtio-local
> variants that effectively reproduces the existing expectations (ie,
> virtio-dma-legacy-ops or something).
Actually ... the stuff in lib/dma-direct.c seems to be just it, no ?
There's no cache flushing and there's no architecture hooks that I can
see other than the AMD security stuff which is probably fine.
Or am I missing something ?
Cheers,
Ben.
>
> > I think the right way forward is to either add a new
> > VIRTIO_F_IS_PCI_DEVICE (or redefine the existing iommu flag if deemed
> > possible). And then make sure recent qemu always sets it.
>
> Ben.
^ permalink raw reply
* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Benjamin Herrenschmidt @ 2018-06-13 12:25 UTC (permalink / raw)
To: Christoph Hellwig
Cc: robh, cohuck, pawel.moll, Tom Lendacky, Michael S. Tsirkin,
Ram Pai, linux-kernel, virtualization, joe, Rustad, Mark D,
Anshuman Khandual, linuxppc-dev, elfring, david
In-Reply-To: <20180613074141.GA12033@infradead.org>
On Wed, 2018-06-13 at 00:41 -0700, Christoph Hellwig wrote:
> On Mon, Jun 11, 2018 at 01:29:18PM +1000, Benjamin Herrenschmidt wrote:
> > At the risk of repeating myself, let's just do the first pass which is
> > to switch virtio over to always using the DMA API in the actual data
> > flow code, with a hook at initialization time that replaces the DMA ops
> > with some home cooked "direct" ops in the case where the IOMMU flag
> > isn't set.
> >
> > This will be equivalent to what we have today but avoids having 2
> > separate code path all over the driver.
> >
> > Then a second stage, I think, is to replace this "hook" so that the
> > architecture gets a say in the matter.
>
> I don't think we can actually use dma_direct_ops. It still allows
> architectures to override parts of the dma setup, which virtio seems
> to blindly assume phys == dma and not cache flushing.
By direct ops I didn't mean *the* dma_direct_ops but a virtio-local
variants that effectively reproduces the existing expectations (ie,
virtio-dma-legacy-ops or something).
> I think the right way forward is to either add a new
> VIRTIO_F_IS_PCI_DEVICE (or redefine the existing iommu flag if deemed
> possible). And then make sure recent qemu always sets it.
Ben.
^ permalink raw reply
* Re: [RFC V2] virtio: Add platform specific DMA API translation for virito devices
From: Christoph Hellwig @ 2018-06-13 7:41 UTC (permalink / raw)
To: Benjamin Herrenschmidt
Cc: robh, cohuck, pawel.moll, Tom Lendacky, Michael S. Tsirkin,
Ram Pai, linux-kernel, virtualization, Christoph Hellwig, joe,
Rustad, Mark D, Anshuman Khandual, linuxppc-dev, elfring, david
In-Reply-To: <07b804fccd7373c650be79ac9fa77ae7f2375ced.camel@kernel.crashing.org>
On Mon, Jun 11, 2018 at 01:29:18PM +1000, Benjamin Herrenschmidt wrote:
> At the risk of repeating myself, let's just do the first pass which is
> to switch virtio over to always using the DMA API in the actual data
> flow code, with a hook at initialization time that replaces the DMA ops
> with some home cooked "direct" ops in the case where the IOMMU flag
> isn't set.
>
> This will be equivalent to what we have today but avoids having 2
> separate code path all over the driver.
>
> Then a second stage, I think, is to replace this "hook" so that the
> architecture gets a say in the matter.
I don't think we can actually use dma_direct_ops. It still allows
architectures to override parts of the dma setup, which virtio seems
to blindly assume phys == dma and not cache flushing.
I think the right way forward is to either add a new
VIRTIO_F_IS_PCI_DEVICE (or redefine the existing iommu flag if deemed
possible). And then make sure recent qemu always sets it.
^ permalink raw reply
* Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Jason Wang @ 2018-06-13 5:40 UTC (permalink / raw)
To: Samudrala, Sridhar, Michael S. Tsirkin
Cc: alexander.h.duyck, virtio-dev, qemu-devel, virtualization
In-Reply-To: <c29a5ac9-cfb5-0a19-f28c-30d41bd5efcf@intel.com>
On 2018年06月13日 12:24, Samudrala, Sridhar wrote:
> On 6/12/2018 7:38 PM, Jason Wang wrote:
>>
>>
>> On 2018年06月12日 19:54, Michael S. Tsirkin wrote:
>>> On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
>>>>
>>>> On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
>>>>> I don't think this is sufficient.
>>>>>
>>>>> If both primary and standby devices are present, a legacy guest
>>>>> without
>>>>> support for the feature might see two devices with same mac and get
>>>>> confused.
>>>>>
>>>>> I think that we should only make primary visible after guest acked
>>>>> the
>>>>> backup feature bit.
>>>> I think we want exactly the reverse? E.g fail the negotiation when
>>>> guest
>>>> does not ack backup feature.
>>>>
>>>> Otherwise legacy guest won't even have the chance to see primary
>>>> device in
>>>> the guest.
>>> That's by design.
>>
>> So management needs to know the capability of guest to set the backup
>> feature. This looks a chicken or egg problem to me.
>
> I don't think so. If the tenant requests 'accelerated datapath
> feature', the management
> will set 'standby' feature bit on virtio-net interface and if the
> guest virtio-net driver
> supports this feature, then the tenant VM will get that capability via
> a hot-plugged
> primary device.
Ok, I thought exactly the reverse because of the commit title is "enable
virtio_net to act as a standby for a passthru device". But re-read the
commit log content, I understand the case a little bit. Btw, VF is not
necessarily faster than virtio-net, especially consider virtio-net may
have a lot of queues.
>
>
>>
>>>
>>>>> And on reset or when backup is cleared in some other way, unplug the
>>>>> primary.
>>>> What if guest does not support hotplug?
>>> It shouldn't ack the backup feature then and it's a good point.
>>> We should both document this and check kernel config has
>>> hotplug support. Sridhar could you take a look pls?
>>>
>>>>> Something like the below will do the job:
>>>>>
>>>>> Primary device is added with a special "primary-failover" flag.
>>>>> A virtual machine is then initialized with just a standby virtio
>>>>> device. Primary is not yet added.
>>>> A question is how to do the matching? Qemu knows nothing about e.g mac
>>>> address of a pass-through device I believe?
>>> Supply a flag to the VFIO when it's added, this way QEMU will know.
>>>
>>>>> Later QEMU detects that guest driver device set DRIVER_OK.
>>>>> It then exposes the primary device to the guest, and triggers
>>>>> a device addition event (hot-plug event) for it.
>>>> Do you mean we won't have primary device in the initial qemu cli?
>>> No, that's not what I mean.
>>>
>>> I mean management will supply a flag to VFIO and then
>>>
>>>
>>> - VFIO defers exposing
>>> primary to guest until guest acks the feature bit.
>>> - When we see guest ack, initiate hotplug.
>>> - On reboot, hide it again.
>>> - On reset without reboot, request hot-unplug and on eject hide it
>>> again.
>>
>> This sounds much like a kind of bonding in qemu.
>>
>>>
>>>>> If QEMU detects guest driver removal, it initiates a hot-unplug
>>>>> sequence
>>>>> to remove the primary driver. In particular, if QEMU detects guest
>>>>> re-initialization (e.g. by detecting guest reset) it immediately
>>>>> removes
>>>>> the primary device.
>>>> I believe guest failover module should handle this gracefully?
>>> It can't control enumeration order, if primary is enumerated before
>>> standby then guest will load its driver and it's too late
>>> when failover driver is loaded.
>>
>> Well, even if we can delay the visibility of primary until DRIVER_OK,
>> there still be a race I think? And it looks to me it's still a bug of
>> guest:
>>
>> E.g primary could be probed before failover_register() in guest. Then
>> we will miss the enslaving of primary forever.
>
> That is not an issue. Even if the primary is probed before failover
> driver, it will
> enslave the primary via the call to failover_existing_slave_register()
> as part of
> failover_register() routine.
Aha I get it. So the enumeration order is not an issue.
Consider primary may still be seen by guest kernel even if we delay its
visibility, I wonder whether we can control the lifecycle of primary
through driver but not qemu. This can simplify a lot of things.
Thanks
>
>>
>> Thanks
>>
>>>
>>>> Thanks
>>>>
>>>>> We can move some of this code to management as well,
>>>>> architecturally it
>>>>> does not make too much sense but it might be easier
>>>>> implementation-wise.
>>>>>
>>>>> HTH
>>>>>
>>>>> On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
>>>>>> Ping on this patch now that the kernel patches are accepted into
>>>>>> davem's net-next tree.
>>>>>> https://patchwork.ozlabs.org/cover/920005/
>>>>>>
>>>>>>
>>>>>> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>>>>>>> This feature bit can be used by hypervisor to indicate
>>>>>>> virtio_net device to
>>>>>>> act as a standby for another device with the same MAC address.
>>>>>>>
>>>>>>> I tested this with a small change to the patch to mark the
>>>>>>> STANDBY feature 'true'
>>>>>>> by default as i am using libvirt to start the VMs.
>>>>>>> Is there a way to pass the newly added feature bit 'standby' to
>>>>>>> qemu via libvirt
>>>>>>> XML file?
>>>>>>>
>>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>>>> ---
>>>>>>> hw/net/virtio-net.c | 2 ++
>>>>>>> include/standard-headers/linux/virtio_net.h | 3 +++
>>>>>>> 2 files changed, 5 insertions(+)
>>>>>>>
>>>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>>>> index 90502fca7c..38b3140670 100644
>>>>>>> --- a/hw/net/virtio-net.c
>>>>>>> +++ b/hw/net/virtio-net.c
>>>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>>>> true),
>>>>>>> DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed,
>>>>>>> SPEED_UNKNOWN),
>>>>>>> DEFINE_PROP_STRING("duplex", VirtIONet,
>>>>>>> net_conf.duplex_str),
>>>>>>> + DEFINE_PROP_BIT64("standby", VirtIONet, host_features,
>>>>>>> VIRTIO_NET_F_STANDBY,
>>>>>>> + false),
>>>>>>> DEFINE_PROP_END_OF_LIST(),
>>>>>>> };
>>>>>>> diff --git a/include/standard-headers/linux/virtio_net.h
>>>>>>> b/include/standard-headers/linux/virtio_net.h
>>>>>>> index e9f255ea3f..01ec09684c 100644
>>>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>>>> @@ -57,6 +57,9 @@
>>>>>>> * Steering */
>>>>>>> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
>>>>>>> +#define VIRTIO_NET_F_STANDBY 62 /* Act as standby for
>>>>>>> another device
>>>>>>> + * with the same MAC.
>>>>>>> + */
>>>>>>> #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set
>>>>>>> linkspeed and duplex */
>>>>>>> #ifndef VIRTIO_NET_NO_LEGACY
>>
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Samudrala, Sridhar @ 2018-06-13 4:24 UTC (permalink / raw)
To: Jason Wang, Michael S. Tsirkin
Cc: alexander.h.duyck, virtio-dev, qemu-devel, virtualization
In-Reply-To: <5718de66-74c9-1362-a87b-2eba02475b48@redhat.com>
On 6/12/2018 7:38 PM, Jason Wang wrote:
>
>
> On 2018年06月12日 19:54, Michael S. Tsirkin wrote:
>> On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
>>>
>>> On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
>>>> I don't think this is sufficient.
>>>>
>>>> If both primary and standby devices are present, a legacy guest
>>>> without
>>>> support for the feature might see two devices with same mac and get
>>>> confused.
>>>>
>>>> I think that we should only make primary visible after guest acked the
>>>> backup feature bit.
>>> I think we want exactly the reverse? E.g fail the negotiation when
>>> guest
>>> does not ack backup feature.
>>>
>>> Otherwise legacy guest won't even have the chance to see primary
>>> device in
>>> the guest.
>> That's by design.
>
> So management needs to know the capability of guest to set the backup
> feature. This looks a chicken or egg problem to me.
I don't think so. If the tenant requests 'accelerated datapath feature', the management
will set 'standby' feature bit on virtio-net interface and if the guest virtio-net driver
supports this feature, then the tenant VM will get that capability via a hot-plugged
primary device.
>
>>
>>>> And on reset or when backup is cleared in some other way, unplug the
>>>> primary.
>>> What if guest does not support hotplug?
>> It shouldn't ack the backup feature then and it's a good point.
>> We should both document this and check kernel config has
>> hotplug support. Sridhar could you take a look pls?
>>
>>>> Something like the below will do the job:
>>>>
>>>> Primary device is added with a special "primary-failover" flag.
>>>> A virtual machine is then initialized with just a standby virtio
>>>> device. Primary is not yet added.
>>> A question is how to do the matching? Qemu knows nothing about e.g mac
>>> address of a pass-through device I believe?
>> Supply a flag to the VFIO when it's added, this way QEMU will know.
>>
>>>> Later QEMU detects that guest driver device set DRIVER_OK.
>>>> It then exposes the primary device to the guest, and triggers
>>>> a device addition event (hot-plug event) for it.
>>> Do you mean we won't have primary device in the initial qemu cli?
>> No, that's not what I mean.
>>
>> I mean management will supply a flag to VFIO and then
>>
>>
>> - VFIO defers exposing
>> primary to guest until guest acks the feature bit.
>> - When we see guest ack, initiate hotplug.
>> - On reboot, hide it again.
>> - On reset without reboot, request hot-unplug and on eject hide it
>> again.
>
> This sounds much like a kind of bonding in qemu.
>
>>
>>>> If QEMU detects guest driver removal, it initiates a hot-unplug
>>>> sequence
>>>> to remove the primary driver. In particular, if QEMU detects guest
>>>> re-initialization (e.g. by detecting guest reset) it immediately
>>>> removes
>>>> the primary device.
>>> I believe guest failover module should handle this gracefully?
>> It can't control enumeration order, if primary is enumerated before
>> standby then guest will load its driver and it's too late
>> when failover driver is loaded.
>
> Well, even if we can delay the visibility of primary until DRIVER_OK,
> there still be a race I think? And it looks to me it's still a bug of
> guest:
>
> E.g primary could be probed before failover_register() in guest. Then
> we will miss the enslaving of primary forever.
That is not an issue. Even if the primary is probed before failover driver, it will
enslave the primary via the call to failover_existing_slave_register() as part of
failover_register() routine.
>
> Thanks
>
>>
>>> Thanks
>>>
>>>> We can move some of this code to management as well,
>>>> architecturally it
>>>> does not make too much sense but it might be easier
>>>> implementation-wise.
>>>>
>>>> HTH
>>>>
>>>> On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
>>>>> Ping on this patch now that the kernel patches are accepted into
>>>>> davem's net-next tree.
>>>>> https://patchwork.ozlabs.org/cover/920005/
>>>>>
>>>>>
>>>>> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>>>>>> This feature bit can be used by hypervisor to indicate virtio_net
>>>>>> device to
>>>>>> act as a standby for another device with the same MAC address.
>>>>>>
>>>>>> I tested this with a small change to the patch to mark the
>>>>>> STANDBY feature 'true'
>>>>>> by default as i am using libvirt to start the VMs.
>>>>>> Is there a way to pass the newly added feature bit 'standby' to
>>>>>> qemu via libvirt
>>>>>> XML file?
>>>>>>
>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>>> ---
>>>>>> hw/net/virtio-net.c | 2 ++
>>>>>> include/standard-headers/linux/virtio_net.h | 3 +++
>>>>>> 2 files changed, 5 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>>> index 90502fca7c..38b3140670 100644
>>>>>> --- a/hw/net/virtio-net.c
>>>>>> +++ b/hw/net/virtio-net.c
>>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>>> true),
>>>>>> DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed,
>>>>>> SPEED_UNKNOWN),
>>>>>> DEFINE_PROP_STRING("duplex", VirtIONet,
>>>>>> net_conf.duplex_str),
>>>>>> + DEFINE_PROP_BIT64("standby", VirtIONet, host_features,
>>>>>> VIRTIO_NET_F_STANDBY,
>>>>>> + false),
>>>>>> DEFINE_PROP_END_OF_LIST(),
>>>>>> };
>>>>>> diff --git a/include/standard-headers/linux/virtio_net.h
>>>>>> b/include/standard-headers/linux/virtio_net.h
>>>>>> index e9f255ea3f..01ec09684c 100644
>>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>>> @@ -57,6 +57,9 @@
>>>>>> * Steering */
>>>>>> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
>>>>>> +#define VIRTIO_NET_F_STANDBY 62 /* Act as standby for
>>>>>> another device
>>>>>> + * with the same MAC.
>>>>>> + */
>>>>>> #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set
>>>>>> linkspeed and duplex */
>>>>>> #ifndef VIRTIO_NET_NO_LEGACY
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Jason Wang @ 2018-06-13 2:41 UTC (permalink / raw)
To: Samudrala, Sridhar, Michael S. Tsirkin
Cc: alexander.h.duyck, virtio-dev, virtualization, qemu-devel
In-Reply-To: <7aa6dbe7-bac2-96f6-d96c-f51e5173c556@intel.com>
On 2018年06月13日 08:20, Samudrala, Sridhar wrote:
> On 6/12/2018 4:54 AM, Michael S. Tsirkin wrote:
>> On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
>>>
>>> On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
>>>> I don't think this is sufficient.
>>>>
>>>> If both primary and standby devices are present, a legacy guest
>>>> without
>>>> support for the feature might see two devices with same mac and get
>>>> confused.
>>>>
>>>> I think that we should only make primary visible after guest acked the
>>>> backup feature bit.
>>> I think we want exactly the reverse? E.g fail the negotiation when
>>> guest
>>> does not ack backup feature.
>>>
>>> Otherwise legacy guest won't even have the chance to see primary
>>> device in
>>> the guest.
>> That's by design.
>>
>>>> And on reset or when backup is cleared in some other way, unplug the
>>>> primary.
>>> What if guest does not support hotplug?
>> It shouldn't ack the backup feature then and it's a good point.
>> We should both document this and check kernel config has
>> hotplug support. Sridhar could you take a look pls?
>
> Right now we select NET_FAILOVER when virtio-net is enabled, Should we
> select
> PCI_HOTPLUG when NET_FAILOVER is enabled?
Maybe I was wrong but it looks to me PCI_HOTPLUG is no sufficient since
it depends on other to work e.g HOTPLUG_PCI_ACPI.
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Jason Wang @ 2018-06-13 2:38 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: alexander.h.duyck, virtio-dev, Samudrala, Sridhar, qemu-devel,
virtualization
In-Reply-To: <20180612144743-mutt-send-email-mst@kernel.org>
On 2018年06月12日 19:54, Michael S. Tsirkin wrote:
> On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
>>
>> On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
>>> I don't think this is sufficient.
>>>
>>> If both primary and standby devices are present, a legacy guest without
>>> support for the feature might see two devices with same mac and get
>>> confused.
>>>
>>> I think that we should only make primary visible after guest acked the
>>> backup feature bit.
>> I think we want exactly the reverse? E.g fail the negotiation when guest
>> does not ack backup feature.
>>
>> Otherwise legacy guest won't even have the chance to see primary device in
>> the guest.
> That's by design.
So management needs to know the capability of guest to set the backup
feature. This looks a chicken or egg problem to me.
>
>>> And on reset or when backup is cleared in some other way, unplug the
>>> primary.
>> What if guest does not support hotplug?
> It shouldn't ack the backup feature then and it's a good point.
> We should both document this and check kernel config has
> hotplug support. Sridhar could you take a look pls?
>
>>> Something like the below will do the job:
>>>
>>> Primary device is added with a special "primary-failover" flag.
>>> A virtual machine is then initialized with just a standby virtio
>>> device. Primary is not yet added.
>> A question is how to do the matching? Qemu knows nothing about e.g mac
>> address of a pass-through device I believe?
> Supply a flag to the VFIO when it's added, this way QEMU will know.
>
>>> Later QEMU detects that guest driver device set DRIVER_OK.
>>> It then exposes the primary device to the guest, and triggers
>>> a device addition event (hot-plug event) for it.
>> Do you mean we won't have primary device in the initial qemu cli?
> No, that's not what I mean.
>
> I mean management will supply a flag to VFIO and then
>
>
> - VFIO defers exposing
> primary to guest until guest acks the feature bit.
> - When we see guest ack, initiate hotplug.
> - On reboot, hide it again.
> - On reset without reboot, request hot-unplug and on eject hide it again.
This sounds much like a kind of bonding in qemu.
>
>>> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
>>> to remove the primary driver. In particular, if QEMU detects guest
>>> re-initialization (e.g. by detecting guest reset) it immediately removes
>>> the primary device.
>> I believe guest failover module should handle this gracefully?
> It can't control enumeration order, if primary is enumerated before
> standby then guest will load its driver and it's too late
> when failover driver is loaded.
Well, even if we can delay the visibility of primary until DRIVER_OK,
there still be a race I think? And it looks to me it's still a bug of guest:
E.g primary could be probed before failover_register() in guest. Then we
will miss the enslaving of primary forever.
Thanks
>
>> Thanks
>>
>>> We can move some of this code to management as well, architecturally it
>>> does not make too much sense but it might be easier implementation-wise.
>>>
>>> HTH
>>>
>>> On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
>>>> Ping on this patch now that the kernel patches are accepted into davem's net-next tree.
>>>> https://patchwork.ozlabs.org/cover/920005/
>>>>
>>>>
>>>> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>>>>> This feature bit can be used by hypervisor to indicate virtio_net device to
>>>>> act as a standby for another device with the same MAC address.
>>>>>
>>>>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>>>>> by default as i am using libvirt to start the VMs.
>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>>>>> XML file?
>>>>>
>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>> ---
>>>>> hw/net/virtio-net.c | 2 ++
>>>>> include/standard-headers/linux/virtio_net.h | 3 +++
>>>>> 2 files changed, 5 insertions(+)
>>>>>
>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>> index 90502fca7c..38b3140670 100644
>>>>> --- a/hw/net/virtio-net.c
>>>>> +++ b/hw/net/virtio-net.c
>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>> true),
>>>>> DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>>>>> DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>>>> + DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>>>>> + false),
>>>>> DEFINE_PROP_END_OF_LIST(),
>>>>> };
>>>>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>>>>> index e9f255ea3f..01ec09684c 100644
>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>> @@ -57,6 +57,9 @@
>>>>> * Steering */
>>>>> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
>>>>> +#define VIRTIO_NET_F_STANDBY 62 /* Act as standby for another device
>>>>> + * with the same MAC.
>>>>> + */
>>>>> #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */
>>>>> #ifndef VIRTIO_NET_NO_LEGACY
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Samudrala, Sridhar @ 2018-06-13 0:20 UTC (permalink / raw)
To: Michael S. Tsirkin, Jason Wang
Cc: alexander.h.duyck, virtio-dev, qemu-devel, virtualization
In-Reply-To: <20180612144743-mutt-send-email-mst@kernel.org>
On 6/12/2018 4:54 AM, Michael S. Tsirkin wrote:
> On Wed, Jun 06, 2018 at 10:29:03AM +0800, Jason Wang wrote:
>>
>> On 2018年06月05日 20:33, Michael S. Tsirkin wrote:
>>> I don't think this is sufficient.
>>>
>>> If both primary and standby devices are present, a legacy guest without
>>> support for the feature might see two devices with same mac and get
>>> confused.
>>>
>>> I think that we should only make primary visible after guest acked the
>>> backup feature bit.
>> I think we want exactly the reverse? E.g fail the negotiation when guest
>> does not ack backup feature.
>>
>> Otherwise legacy guest won't even have the chance to see primary device in
>> the guest.
> That's by design.
>
>>> And on reset or when backup is cleared in some other way, unplug the
>>> primary.
>> What if guest does not support hotplug?
> It shouldn't ack the backup feature then and it's a good point.
> We should both document this and check kernel config has
> hotplug support. Sridhar could you take a look pls?
Right now we select NET_FAILOVER when virtio-net is enabled, Should we select
PCI_HOTPLUG when NET_FAILOVER is enabled?
>
>>> Something like the below will do the job:
>>>
>>> Primary device is added with a special "primary-failover" flag.
>>> A virtual machine is then initialized with just a standby virtio
>>> device. Primary is not yet added.
>> A question is how to do the matching? Qemu knows nothing about e.g mac
>> address of a pass-through device I believe?
> Supply a flag to the VFIO when it's added, this way QEMU will know.
>
>>> Later QEMU detects that guest driver device set DRIVER_OK.
>>> It then exposes the primary device to the guest, and triggers
>>> a device addition event (hot-plug event) for it.
>> Do you mean we won't have primary device in the initial qemu cli?
> No, that's not what I mean.
>
> I mean management will supply a flag to VFIO and then
>
>
> - VFIO defers exposing
> primary to guest until guest acks the feature bit.
> - When we see guest ack, initiate hotplug.
> - On reboot, hide it again.
> - On reset without reboot, request hot-unplug and on eject hide it again.
>
>
>>> If QEMU detects guest driver removal, it initiates a hot-unplug sequence
>>> to remove the primary driver. In particular, if QEMU detects guest
>>> re-initialization (e.g. by detecting guest reset) it immediately removes
>>> the primary device.
>> I believe guest failover module should handle this gracefully?
> It can't control enumeration order, if primary is enumerated before
> standby then guest will load its driver and it's too late
> when failover driver is loaded.
Does it mean that if guest unloads virtio-net driver, Qemu should automatically
unplug the associated primary device?
What about guest unloading/re-loading primary device driver?
>
>> Thanks
>>
>>> We can move some of this code to management as well, architecturally it
>>> does not make too much sense but it might be easier implementation-wise.
>>>
>>> HTH
>>>
>>> On Mon, Jun 04, 2018 at 06:41:48PM -0700, Samudrala, Sridhar wrote:
>>>> Ping on this patch now that the kernel patches are accepted into davem's net-next tree.
>>>> https://patchwork.ozlabs.org/cover/920005/
>>>>
>>>>
>>>> On 5/7/2018 4:09 PM, Sridhar Samudrala wrote:
>>>>> This feature bit can be used by hypervisor to indicate virtio_net device to
>>>>> act as a standby for another device with the same MAC address.
>>>>>
>>>>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>>>>> by default as i am using libvirt to start the VMs.
>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>>>>> XML file?
>>>>>
>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>> ---
>>>>> hw/net/virtio-net.c | 2 ++
>>>>> include/standard-headers/linux/virtio_net.h | 3 +++
>>>>> 2 files changed, 5 insertions(+)
>>>>>
>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>> index 90502fca7c..38b3140670 100644
>>>>> --- a/hw/net/virtio-net.c
>>>>> +++ b/hw/net/virtio-net.c
>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>> true),
>>>>> DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>>>>> DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>>>> + DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>>>>> + false),
>>>>> DEFINE_PROP_END_OF_LIST(),
>>>>> };
>>>>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>>>>> index e9f255ea3f..01ec09684c 100644
>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>> @@ -57,6 +57,9 @@
>>>>> * Steering */
>>>>> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
>>>>> +#define VIRTIO_NET_F_STANDBY 62 /* Act as standby for another device
>>>>> + * with the same MAC.
>>>>> + */
>>>>> #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */
>>>>> #ifndef VIRTIO_NET_NO_LEGACY
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [virtio-dev] Re: [Qemu-devel] [PATCH] qemu: Introduce VIRTIO_NET_F_STANDBY feature bit to virtio_net
From: Samudrala, Sridhar @ 2018-06-13 0:08 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: alexander.h.duyck, virtio-dev, aaron.f.brown, jiri, kubakici,
netdev, qemu-devel, loseweigh, virtualization
In-Reply-To: <20180612142557-mutt-send-email-mst@kernel.org>
On 6/12/2018 4:34 AM, Michael S. Tsirkin wrote:
> On Mon, Jun 11, 2018 at 10:02:45PM -0700, Samudrala, Sridhar wrote:
>> On 6/11/2018 7:17 PM, Michael S. Tsirkin wrote:
>>> On Tue, Jun 12, 2018 at 09:54:44AM +0800, Jason Wang wrote:
>>>> On 2018年06月12日 01:26, Michael S. Tsirkin wrote:
>>>>> On Mon, May 07, 2018 at 04:09:54PM -0700, Sridhar Samudrala wrote:
>>>>>> This feature bit can be used by hypervisor to indicate virtio_net device to
>>>>>> act as a standby for another device with the same MAC address.
>>>>>>
>>>>>> I tested this with a small change to the patch to mark the STANDBY feature 'true'
>>>>>> by default as i am using libvirt to start the VMs.
>>>>>> Is there a way to pass the newly added feature bit 'standby' to qemu via libvirt
>>>>>> XML file?
>>>>>>
>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>> So I do not think we can commit to this interface: we
>>>>> really need to control visibility of the primary device.
>>>> The problem is legacy guest won't use primary device at all if we do this.
>>> And that's by design - I think it's the only way to ensure the
>>> legacy guest isn't confused.
>> Yes. I think so. But i am not sure if Qemu is the right place to control the visibility
>> of the primary device. The primary device may not be specified as an argument to Qemu. It
>> may be plugged in later.
>> The cloud service provider is providing a feature that enables low latency datapath and live
>> migration capability.
>> A tenant can use this feature only if he is running a VM that has virtio-net with failover support.
> Well live migration is there already. The new feature is low latency
> data path.
we get live migration with just virtio. But I meant live migration with VF as
primary device.
>
> And it's the guest that needs failover support not the VM.
Isn't guest and VM synonymous?
>
>
>> I think Qemu should check if guest virtio-net supports this feature and provide a mechanism for
>> an upper layer indicating if the STANDBY feature is successfully negotiated or not.
>> The upper layer can then decide if it should hot plug a VF with the same MAC and manage the 2 links.
>> If VF is successfully hot plugged, virtio-net link should be disabled.
> Did you even talk to upper layer management about it?
> Just list the steps they need to do and you will see
> that's a lot of machinery to manage by the upper layer.
>
> What do we gain in flexibility? As far as I can see the
> only gain is some resources saved for legacy VMs.
>
> That's not a lot as tenant of the upper layer probably already has
> at least a hunch that it's a new guest otherwise
> why bother specifying the feature at all - you
> save even more resources without it.
>
I am not all that familiar with how Qemu manages network devices. If we can do all the
required management of the primary/standby devices within Qemu, that is definitely a better
approach without upper layer involvement.
>
>
>>>> How about control the visibility of standby device?
>>>>
>>>> Thanks
>>> standy the always there to guarantee no downtime.
>>>
>>>>> However just for testing purposes, we could add a non-stable
>>>>> interface "x-standby" with the understanding that as any
>>>>> x- prefix it's unstable and will be changed down the road,
>>>>> likely in the next release.
>>>>>
>>>>>
>>>>>> ---
>>>>>> hw/net/virtio-net.c | 2 ++
>>>>>> include/standard-headers/linux/virtio_net.h | 3 +++
>>>>>> 2 files changed, 5 insertions(+)
>>>>>>
>>>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>>>> index 90502fca7c..38b3140670 100644
>>>>>> --- a/hw/net/virtio-net.c
>>>>>> +++ b/hw/net/virtio-net.c
>>>>>> @@ -2198,6 +2198,8 @@ static Property virtio_net_properties[] = {
>>>>>> true),
>>>>>> DEFINE_PROP_INT32("speed", VirtIONet, net_conf.speed, SPEED_UNKNOWN),
>>>>>> DEFINE_PROP_STRING("duplex", VirtIONet, net_conf.duplex_str),
>>>>>> + DEFINE_PROP_BIT64("standby", VirtIONet, host_features, VIRTIO_NET_F_STANDBY,
>>>>>> + false),
>>>>>> DEFINE_PROP_END_OF_LIST(),
>>>>>> };
>>>>>> diff --git a/include/standard-headers/linux/virtio_net.h b/include/standard-headers/linux/virtio_net.h
>>>>>> index e9f255ea3f..01ec09684c 100644
>>>>>> --- a/include/standard-headers/linux/virtio_net.h
>>>>>> +++ b/include/standard-headers/linux/virtio_net.h
>>>>>> @@ -57,6 +57,9 @@
>>>>>> * Steering */
>>>>>> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
>>>>>> +#define VIRTIO_NET_F_STANDBY 62 /* Act as standby for another device
>>>>>> + * with the same MAC.
>>>>>> + */
>>>>>> #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */
>>>>>> #ifndef VIRTIO_NET_NO_LEGACY
>>>>>> --
>>>>>> 2.14.3
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: virtio-dev-unsubscribe@lists.oasis-open.org
> For additional commands, e-mail: virtio-dev-help@lists.oasis-open.org
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v4 1/3] compiler-gcc.h: add gnu_inline to all inline declarations
From: H. Peter Anvin @ 2018-06-12 21:31 UTC (permalink / raw)
To: Nick Desaulniers
Cc: Kate Stewart, linux-efi, brijesh.singh, J. Kiszka, Will Deacon,
virtualization, Masahiro Yamada, Manoj Gupta, boris.ostrovsky,
mawilcox, x86, akataria, Greg Hackmann, mingo, Alistair Strachan,
David Rientjes, geert, Arnd Bergmann, Linux Kbuild mailing list,
Philippe Ombredanne, rostedt, acme, Andrey Ryabinin, sedat.dilek,
Thomas Gleixner, Michal Marek, tstellar
In-Reply-To: <CAKwvOdkeEqb0HnHoggpwmHiKtDo06VN0N_r9ffv0rzigNLvXUg@mail.gmail.com>
On 06/12/18 13:19, Nick Desaulniers wrote:
>
> Oh, by [0] __GCC_STDC_INLINE__ is indeed actually the correct symbol
> to check for. Clang does support that, so nothing to fix there.
>
>> By the way, you should check clang against gcc's predefined macros by doing:
>>
>> gcc [options] -x c -Wp,-dM -E /dev/null | sort
>>
>> Options can change the predefined macros substantially, especially the, -std=, arch and -O options. -x c can be replaced with e.g. -x c++, objective-c, assembler-with-cpp etc.
>
> Neat, I'll have to bookmark that incantation. I can s/gcc/clang/ to
> get a similar list (which is how I know it supports
> __GCC_STDC_INLINE__).>
I bet that if you add -fgnu89-inlines then it *does* define
__GNUC_GNU_INLINE__.
>
> Patch now becomes something like:
>
> #ifdef __GNUC_GNU_INLINE__
> #define __gnu_inline __attribute__((gnu_inline))
> #else
> #define __gnu_inline
> #endif
>
> #define inline inline __attribute__((always_inline, unused)) notrace
> __gnu_inline
> ...
>
> Issues with that approach?
>
s/__GNUC_GNU_INLINE__/__GNUC_STDC_INLINE__/
-hpa
^ permalink raw reply
* Re: [PATCH 0/3] Use sbitmap instead of percpu_ida
From: Jens Axboe @ 2018-06-12 20:40 UTC (permalink / raw)
To: Matthew Wilcox, linux-kernel, linux-scsi, target-devel,
linux1394-devel, linux-usb, kvm, virtualization, netdev,
Juergen Gross, qla2xxx-upstream, Kent Overstreet
In-Reply-To: <20180612190545.10781-1-willy@infradead.org>
On 6/12/18 1:05 PM, Matthew Wilcox wrote:
> Removing the percpu_ida code nets over 400 lines of removal. It's not
> as spectacular as deleting an entire architecture, but it's still a
> worthy reduction in lines of code.
>
> Untested due to lack of hardware and not understanding how to set up a
> target platform.
>
> Changes from v1:
> - Fixed bugs pointed out by Jens in iscsit_wait_for_tag()
> - Abstracted out tag freeing as requested by Bart
> - Made iscsit_wait_for_tag static as pointed out by 0day
Reviewed-by: Jens Axboe <axboe@kernel.dk>
--
Jens Axboe
^ permalink raw reply
* Re: [PATCH 0/3] Use sbitmap instead of percpu_ida
From: Bart Van Assche @ 2018-06-12 20:06 UTC (permalink / raw)
To: kvm@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-usb@vger.kernel.org, willy@infradead.org,
virtualization@lists.linux-foundation.org,
kent.overstreet@gmail.com, linux1394-devel@lists.sourceforge.net,
jgross@suse.com, axboe@kernel.dk, linux-scsi@vger.kernel.org,
qla2xxx-upstream@qlogic.com, target-devel@vger.kernel.org,
netdev@vger.kernel.org
In-Reply-To: <20180612190545.10781-1-willy@infradead.org>
On Tue, 2018-06-12 at 12:05 -0700, Matthew Wilcox wrote:
> Removing the percpu_ida code nets over 400 lines of removal. It's not
> as spectacular as deleting an entire architecture, but it's still a
> worthy reduction in lines of code.
>
> Untested due to lack of hardware and not understanding how to set up a
> target platform.
>
> Changes from v1:
> - Fixed bugs pointed out by Jens in iscsit_wait_for_tag()
> - Abstracted out tag freeing as requested by Bart
> - Made iscsit_wait_for_tag static as pointed out by 0day
For the whole series:
Reviewed-by: Bart Van Assche <bart.vanassche@wdc.com>
Thanks,
Bart.
^ permalink raw reply
* [PATCH 3/3] Remove percpu_ida
From: Matthew Wilcox @ 2018-06-12 19:05 UTC (permalink / raw)
To: linux-kernel, linux-scsi, target-devel, linux1394-devel,
linux-usb, kvm, virtualization, netdev, Juergen Gross,
qla2xxx-upstream, Kent Overstreet, Jens Axboe
Cc: Matthew Wilcox
In-Reply-To: <20180612190545.10781-1-willy@infradead.org>
With its one user gone, remove the library code.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
---
include/linux/percpu_ida.h | 83 ---------
lib/Makefile | 2 +-
lib/percpu_ida.c | 370 -------------------------------------
3 files changed, 1 insertion(+), 454 deletions(-)
delete mode 100644 include/linux/percpu_ida.h
delete mode 100644 lib/percpu_ida.c
diff --git a/include/linux/percpu_ida.h b/include/linux/percpu_ida.h
deleted file mode 100644
index 07d78e4653bc..000000000000
--- a/include/linux/percpu_ida.h
+++ /dev/null
@@ -1,83 +0,0 @@
-/* SPDX-License-Identifier: GPL-2.0 */
-#ifndef __PERCPU_IDA_H__
-#define __PERCPU_IDA_H__
-
-#include <linux/types.h>
-#include <linux/bitops.h>
-#include <linux/init.h>
-#include <linux/sched.h>
-#include <linux/spinlock_types.h>
-#include <linux/wait.h>
-#include <linux/cpumask.h>
-
-struct percpu_ida_cpu;
-
-struct percpu_ida {
- /*
- * number of tags available to be allocated, as passed to
- * percpu_ida_init()
- */
- unsigned nr_tags;
- unsigned percpu_max_size;
- unsigned percpu_batch_size;
-
- struct percpu_ida_cpu __percpu *tag_cpu;
-
- /*
- * Bitmap of cpus that (may) have tags on their percpu freelists:
- * steal_tags() uses this to decide when to steal tags, and which cpus
- * to try stealing from.
- *
- * It's ok for a freelist to be empty when its bit is set - steal_tags()
- * will just keep looking - but the bitmap _must_ be set whenever a
- * percpu freelist does have tags.
- */
- cpumask_t cpus_have_tags;
-
- struct {
- spinlock_t lock;
- /*
- * When we go to steal tags from another cpu (see steal_tags()),
- * we want to pick a cpu at random. Cycling through them every
- * time we steal is a bit easier and more or less equivalent:
- */
- unsigned cpu_last_stolen;
-
- /* For sleeping on allocation failure */
- wait_queue_head_t wait;
-
- /*
- * Global freelist - it's a stack where nr_free points to the
- * top
- */
- unsigned nr_free;
- unsigned *freelist;
- } ____cacheline_aligned_in_smp;
-};
-
-/*
- * Number of tags we move between the percpu freelist and the global freelist at
- * a time
- */
-#define IDA_DEFAULT_PCPU_BATCH_MOVE 32U
-/* Max size of percpu freelist, */
-#define IDA_DEFAULT_PCPU_SIZE ((IDA_DEFAULT_PCPU_BATCH_MOVE * 3) / 2)
-
-int percpu_ida_alloc(struct percpu_ida *pool, int state);
-void percpu_ida_free(struct percpu_ida *pool, unsigned tag);
-
-void percpu_ida_destroy(struct percpu_ida *pool);
-int __percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags,
- unsigned long max_size, unsigned long batch_size);
-static inline int percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags)
-{
- return __percpu_ida_init(pool, nr_tags, IDA_DEFAULT_PCPU_SIZE,
- IDA_DEFAULT_PCPU_BATCH_MOVE);
-}
-
-typedef int (*percpu_ida_cb)(unsigned, void *);
-int percpu_ida_for_each_free(struct percpu_ida *pool, percpu_ida_cb fn,
- void *data);
-
-unsigned percpu_ida_free_tags(struct percpu_ida *pool, int cpu);
-#endif /* __PERCPU_IDA_H__ */
diff --git a/lib/Makefile b/lib/Makefile
index 84c6dcb31fbb..f4722a7fa62c 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -40,7 +40,7 @@ obj-y += bcd.o div64.o sort.o parser.o debug_locks.o random32.o \
bust_spinlocks.o kasprintf.o bitmap.o scatterlist.o \
gcd.o lcm.o list_sort.o uuid.o flex_array.o iov_iter.o clz_ctz.o \
bsearch.o find_bit.o llist.o memweight.o kfifo.o \
- percpu-refcount.o percpu_ida.o rhashtable.o reciprocal_div.o \
+ percpu-refcount.o rhashtable.o reciprocal_div.o \
once.o refcount.o usercopy.o errseq.o bucket_locks.o
obj-$(CONFIG_STRING_SELFTEST) += test_string.o
obj-y += string_helpers.o
diff --git a/lib/percpu_ida.c b/lib/percpu_ida.c
deleted file mode 100644
index 9bbd9c5d375a..000000000000
--- a/lib/percpu_ida.c
+++ /dev/null
@@ -1,370 +0,0 @@
-/*
- * Percpu IDA library
- *
- * Copyright (C) 2013 Datera, Inc. Kent Overstreet
- *
- * This program is free software; you can redistribute it and/or
- * modify it under the terms of the GNU General Public License as
- * published by the Free Software Foundation; either version 2, or (at
- * your option) any later version.
- *
- * This program is distributed in the hope that it will be useful, but
- * WITHOUT ANY WARRANTY; without even the implied warranty of
- * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU
- * General Public License for more details.
- */
-
-#include <linux/mm.h>
-#include <linux/bitmap.h>
-#include <linux/bitops.h>
-#include <linux/bug.h>
-#include <linux/err.h>
-#include <linux/export.h>
-#include <linux/init.h>
-#include <linux/kernel.h>
-#include <linux/percpu.h>
-#include <linux/sched/signal.h>
-#include <linux/string.h>
-#include <linux/spinlock.h>
-#include <linux/percpu_ida.h>
-
-struct percpu_ida_cpu {
- /*
- * Even though this is percpu, we need a lock for tag stealing by remote
- * CPUs:
- */
- spinlock_t lock;
-
- /* nr_free/freelist form a stack of free IDs */
- unsigned nr_free;
- unsigned freelist[];
-};
-
-static inline void move_tags(unsigned *dst, unsigned *dst_nr,
- unsigned *src, unsigned *src_nr,
- unsigned nr)
-{
- *src_nr -= nr;
- memcpy(dst + *dst_nr, src + *src_nr, sizeof(unsigned) * nr);
- *dst_nr += nr;
-}
-
-/*
- * Try to steal tags from a remote cpu's percpu freelist.
- *
- * We first check how many percpu freelists have tags
- *
- * Then we iterate through the cpus until we find some tags - we don't attempt
- * to find the "best" cpu to steal from, to keep cacheline bouncing to a
- * minimum.
- */
-static inline void steal_tags(struct percpu_ida *pool,
- struct percpu_ida_cpu *tags)
-{
- unsigned cpus_have_tags, cpu = pool->cpu_last_stolen;
- struct percpu_ida_cpu *remote;
-
- for (cpus_have_tags = cpumask_weight(&pool->cpus_have_tags);
- cpus_have_tags; cpus_have_tags--) {
- cpu = cpumask_next(cpu, &pool->cpus_have_tags);
-
- if (cpu >= nr_cpu_ids) {
- cpu = cpumask_first(&pool->cpus_have_tags);
- if (cpu >= nr_cpu_ids)
- BUG();
- }
-
- pool->cpu_last_stolen = cpu;
- remote = per_cpu_ptr(pool->tag_cpu, cpu);
-
- cpumask_clear_cpu(cpu, &pool->cpus_have_tags);
-
- if (remote == tags)
- continue;
-
- spin_lock(&remote->lock);
-
- if (remote->nr_free) {
- memcpy(tags->freelist,
- remote->freelist,
- sizeof(unsigned) * remote->nr_free);
-
- tags->nr_free = remote->nr_free;
- remote->nr_free = 0;
- }
-
- spin_unlock(&remote->lock);
-
- if (tags->nr_free)
- break;
- }
-}
-
-/*
- * Pop up to IDA_PCPU_BATCH_MOVE IDs off the global freelist, and push them onto
- * our percpu freelist:
- */
-static inline void alloc_global_tags(struct percpu_ida *pool,
- struct percpu_ida_cpu *tags)
-{
- move_tags(tags->freelist, &tags->nr_free,
- pool->freelist, &pool->nr_free,
- min(pool->nr_free, pool->percpu_batch_size));
-}
-
-/**
- * percpu_ida_alloc - allocate a tag
- * @pool: pool to allocate from
- * @state: task state for prepare_to_wait
- *
- * Returns a tag - an integer in the range [0..nr_tags) (passed to
- * tag_pool_init()), or otherwise -ENOSPC on allocation failure.
- *
- * Safe to be called from interrupt context (assuming it isn't passed
- * TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE, of course).
- *
- * @gfp indicates whether or not to wait until a free id is available (it's not
- * used for internal memory allocations); thus if passed __GFP_RECLAIM we may sleep
- * however long it takes until another thread frees an id (same semantics as a
- * mempool).
- *
- * Will not fail if passed TASK_UNINTERRUPTIBLE | TASK_INTERRUPTIBLE.
- */
-int percpu_ida_alloc(struct percpu_ida *pool, int state)
-{
- DEFINE_WAIT(wait);
- struct percpu_ida_cpu *tags;
- unsigned long flags;
- int tag = -ENOSPC;
-
- tags = raw_cpu_ptr(pool->tag_cpu);
- spin_lock_irqsave(&tags->lock, flags);
-
- /* Fastpath */
- if (likely(tags->nr_free >= 0)) {
- tag = tags->freelist[--tags->nr_free];
- spin_unlock_irqrestore(&tags->lock, flags);
- return tag;
- }
- spin_unlock_irqrestore(&tags->lock, flags);
-
- while (1) {
- spin_lock_irqsave(&pool->lock, flags);
- tags = this_cpu_ptr(pool->tag_cpu);
-
- /*
- * prepare_to_wait() must come before steal_tags(), in case
- * percpu_ida_free() on another cpu flips a bit in
- * cpus_have_tags
- *
- * global lock held and irqs disabled, don't need percpu lock
- */
- if (state != TASK_RUNNING)
- prepare_to_wait(&pool->wait, &wait, state);
-
- if (!tags->nr_free)
- alloc_global_tags(pool, tags);
- if (!tags->nr_free)
- steal_tags(pool, tags);
-
- if (tags->nr_free) {
- tag = tags->freelist[--tags->nr_free];
- if (tags->nr_free)
- cpumask_set_cpu(smp_processor_id(),
- &pool->cpus_have_tags);
- }
-
- spin_unlock_irqrestore(&pool->lock, flags);
-
- if (tag >= 0 || state == TASK_RUNNING)
- break;
-
- if (signal_pending_state(state, current)) {
- tag = -ERESTARTSYS;
- break;
- }
-
- schedule();
- }
- if (state != TASK_RUNNING)
- finish_wait(&pool->wait, &wait);
-
- return tag;
-}
-EXPORT_SYMBOL_GPL(percpu_ida_alloc);
-
-/**
- * percpu_ida_free - free a tag
- * @pool: pool @tag was allocated from
- * @tag: a tag previously allocated with percpu_ida_alloc()
- *
- * Safe to be called from interrupt context.
- */
-void percpu_ida_free(struct percpu_ida *pool, unsigned tag)
-{
- struct percpu_ida_cpu *tags;
- unsigned long flags;
- unsigned nr_free;
-
- BUG_ON(tag >= pool->nr_tags);
-
- tags = raw_cpu_ptr(pool->tag_cpu);
-
- spin_lock_irqsave(&tags->lock, flags);
- tags->freelist[tags->nr_free++] = tag;
-
- nr_free = tags->nr_free;
-
- if (nr_free == 1) {
- cpumask_set_cpu(smp_processor_id(),
- &pool->cpus_have_tags);
- wake_up(&pool->wait);
- }
- spin_unlock_irqrestore(&tags->lock, flags);
-
- if (nr_free == pool->percpu_max_size) {
- spin_lock_irqsave(&pool->lock, flags);
- spin_lock(&tags->lock);
-
- if (tags->nr_free == pool->percpu_max_size) {
- move_tags(pool->freelist, &pool->nr_free,
- tags->freelist, &tags->nr_free,
- pool->percpu_batch_size);
-
- wake_up(&pool->wait);
- }
- spin_unlock(&tags->lock);
- spin_unlock_irqrestore(&pool->lock, flags);
- }
-}
-EXPORT_SYMBOL_GPL(percpu_ida_free);
-
-/**
- * percpu_ida_destroy - release a tag pool's resources
- * @pool: pool to free
- *
- * Frees the resources allocated by percpu_ida_init().
- */
-void percpu_ida_destroy(struct percpu_ida *pool)
-{
- free_percpu(pool->tag_cpu);
- free_pages((unsigned long) pool->freelist,
- get_order(pool->nr_tags * sizeof(unsigned)));
-}
-EXPORT_SYMBOL_GPL(percpu_ida_destroy);
-
-/**
- * percpu_ida_init - initialize a percpu tag pool
- * @pool: pool to initialize
- * @nr_tags: number of tags that will be available for allocation
- *
- * Initializes @pool so that it can be used to allocate tags - integers in the
- * range [0, nr_tags). Typically, they'll be used by driver code to refer to a
- * preallocated array of tag structures.
- *
- * Allocation is percpu, but sharding is limited by nr_tags - for best
- * performance, the workload should not span more cpus than nr_tags / 128.
- */
-int __percpu_ida_init(struct percpu_ida *pool, unsigned long nr_tags,
- unsigned long max_size, unsigned long batch_size)
-{
- unsigned i, cpu, order;
-
- memset(pool, 0, sizeof(*pool));
-
- init_waitqueue_head(&pool->wait);
- spin_lock_init(&pool->lock);
- pool->nr_tags = nr_tags;
- pool->percpu_max_size = max_size;
- pool->percpu_batch_size = batch_size;
-
- /* Guard against overflow */
- if (nr_tags > (unsigned) INT_MAX + 1) {
- pr_err("percpu_ida_init(): nr_tags too large\n");
- return -EINVAL;
- }
-
- order = get_order(nr_tags * sizeof(unsigned));
- pool->freelist = (void *) __get_free_pages(GFP_KERNEL, order);
- if (!pool->freelist)
- return -ENOMEM;
-
- for (i = 0; i < nr_tags; i++)
- pool->freelist[i] = i;
-
- pool->nr_free = nr_tags;
-
- pool->tag_cpu = __alloc_percpu(sizeof(struct percpu_ida_cpu) +
- pool->percpu_max_size * sizeof(unsigned),
- sizeof(unsigned));
- if (!pool->tag_cpu)
- goto err;
-
- for_each_possible_cpu(cpu)
- spin_lock_init(&per_cpu_ptr(pool->tag_cpu, cpu)->lock);
-
- return 0;
-err:
- percpu_ida_destroy(pool);
- return -ENOMEM;
-}
-EXPORT_SYMBOL_GPL(__percpu_ida_init);
-
-/**
- * percpu_ida_for_each_free - iterate free ids of a pool
- * @pool: pool to iterate
- * @fn: interate callback function
- * @data: parameter for @fn
- *
- * Note, this doesn't guarantee to iterate all free ids restrictly. Some free
- * ids might be missed, some might be iterated duplicated, and some might
- * be iterated and not free soon.
- */
-int percpu_ida_for_each_free(struct percpu_ida *pool, percpu_ida_cb fn,
- void *data)
-{
- unsigned long flags;
- struct percpu_ida_cpu *remote;
- unsigned cpu, i, err = 0;
-
- for_each_possible_cpu(cpu) {
- remote = per_cpu_ptr(pool->tag_cpu, cpu);
- spin_lock_irqsave(&remote->lock, flags);
- for (i = 0; i < remote->nr_free; i++) {
- err = fn(remote->freelist[i], data);
- if (err)
- break;
- }
- spin_unlock_irqrestore(&remote->lock, flags);
- if (err)
- goto out;
- }
-
- spin_lock_irqsave(&pool->lock, flags);
- for (i = 0; i < pool->nr_free; i++) {
- err = fn(pool->freelist[i], data);
- if (err)
- break;
- }
- spin_unlock_irqrestore(&pool->lock, flags);
-out:
- return err;
-}
-EXPORT_SYMBOL_GPL(percpu_ida_for_each_free);
-
-/**
- * percpu_ida_free_tags - return free tags number of a specific cpu or global pool
- * @pool: pool related
- * @cpu: specific cpu or global pool if @cpu == nr_cpu_ids
- *
- * Note: this just returns a snapshot of free tags number.
- */
-unsigned percpu_ida_free_tags(struct percpu_ida *pool, int cpu)
-{
- struct percpu_ida_cpu *remote;
- if (cpu == nr_cpu_ids)
- return pool->nr_free;
- remote = per_cpu_ptr(pool->tag_cpu, cpu);
- return remote->nr_free;
-}
-EXPORT_SYMBOL_GPL(percpu_ida_free_tags);
--
2.17.1
^ permalink raw reply related
* [PATCH 2/3] Convert target drivers to use sbitmap
From: Matthew Wilcox @ 2018-06-12 19:05 UTC (permalink / raw)
To: linux-kernel, linux-scsi, target-devel, linux1394-devel,
linux-usb, kvm, virtualization, netdev, Juergen Gross,
qla2xxx-upstream, Kent Overstreet, Jens Axboe
Cc: Matthew Wilcox
In-Reply-To: <20180612190545.10781-1-willy@infradead.org>
The sbitmap and the percpu_ida perform essentially the same task,
allocating tags for commands. The sbitmap outperforms the percpu_ida
as documented here: https://lkml.org/lkml/2014/4/22/553
The sbitmap interface is a little harder to use, but being able to
remove the percpu_ida code and getting better performance justifies the
additional complexity.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Acked-by: Felipe Balbi <felipe.balbi@linux.intel.com> # f_tcm
---
drivers/scsi/qla2xxx/qla_target.c | 10 ++++---
drivers/target/iscsi/iscsi_target_util.c | 33 +++++++++++++++++++++---
drivers/target/sbp/sbp_target.c | 5 ++--
drivers/target/target_core_transport.c | 5 ++--
drivers/target/tcm_fc/tfc_cmd.c | 6 ++---
drivers/usb/gadget/function/f_tcm.c | 5 ++--
drivers/vhost/scsi.c | 6 ++---
drivers/xen/xen-scsiback.c | 5 ++--
include/target/iscsi/iscsi_target_core.h | 1 +
include/target/target_core_base.h | 7 ++---
10 files changed, 59 insertions(+), 24 deletions(-)
diff --git a/drivers/scsi/qla2xxx/qla_target.c b/drivers/scsi/qla2xxx/qla_target.c
index 05290966e630..a1725a054749 100644
--- a/drivers/scsi/qla2xxx/qla_target.c
+++ b/drivers/scsi/qla2xxx/qla_target.c
@@ -4277,9 +4277,9 @@ static struct qla_tgt_cmd *qlt_get_tag(scsi_qla_host_t *vha,
{
struct se_session *se_sess = sess->se_sess;
struct qla_tgt_cmd *cmd;
- int tag;
+ int tag, cpu;
- tag = percpu_ida_alloc(&se_sess->sess_tag_pool, TASK_RUNNING);
+ tag = sbitmap_queue_get(&se_sess->sess_tag_pool, &cpu);
if (tag < 0)
return NULL;
@@ -4292,6 +4292,7 @@ static struct qla_tgt_cmd *qlt_get_tag(scsi_qla_host_t *vha,
qlt_incr_num_pend_cmds(vha);
cmd->vha = vha;
cmd->se_cmd.map_tag = tag;
+ cmd->se_cmd.map_cpu = cpu;
cmd->sess = sess;
cmd->loop_id = sess->loop_id;
cmd->conf_compl_supported = sess->conf_compl_supported;
@@ -5294,7 +5295,7 @@ qlt_alloc_qfull_cmd(struct scsi_qla_host *vha,
struct fc_port *sess;
struct se_session *se_sess;
struct qla_tgt_cmd *cmd;
- int tag;
+ int tag, cpu;
unsigned long flags;
if (unlikely(tgt->tgt_stop)) {
@@ -5326,7 +5327,7 @@ qlt_alloc_qfull_cmd(struct scsi_qla_host *vha,
se_sess = sess->se_sess;
- tag = percpu_ida_alloc(&se_sess->sess_tag_pool, TASK_RUNNING);
+ tag = sbitmap_queue_get(&se_sess->sess_tag_pool, &cpu);
if (tag < 0)
return;
@@ -5357,6 +5358,7 @@ qlt_alloc_qfull_cmd(struct scsi_qla_host *vha,
cmd->reset_count = ha->base_qpair->chip_reset;
cmd->q_full = 1;
cmd->qpair = ha->base_qpair;
+ cmd->se_cmd.map_cpu = cpu;
if (qfull) {
cmd->q_full = 1;
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 7e98697cfb8e..8cfcf9033507 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -17,7 +17,7 @@
******************************************************************************/
#include <linux/list.h>
-#include <linux/percpu_ida.h>
+#include <linux/sched/signal.h>
#include <net/ipv6.h> /* ipv6_addr_equal() */
#include <scsi/scsi_tcq.h>
#include <scsi/iscsi_proto.h>
@@ -147,6 +147,30 @@ void iscsit_free_r2ts_from_list(struct iscsi_cmd *cmd)
spin_unlock_bh(&cmd->r2t_lock);
}
+static int iscsit_wait_for_tag(struct se_session *se_sess, int state, int *cpup)
+{
+ int tag = -1;
+ DEFINE_WAIT(wait);
+ struct sbq_wait_state *ws;
+
+ if (state == TASK_RUNNING)
+ return tag;
+
+ ws = &se_sess->sess_tag_pool.ws[0];
+ for (;;) {
+ prepare_to_wait_exclusive(&ws->wait, &wait, state);
+ if (signal_pending_state(state, current))
+ break;
+ tag = sbitmap_queue_get(&se_sess->sess_tag_pool, cpup);
+ if (tag >= 0)
+ break;
+ schedule();
+ }
+
+ finish_wait(&ws->wait, &wait);
+ return tag;
+}
+
/*
* May be called from software interrupt (timer) context for allocating
* iSCSI NopINs.
@@ -155,9 +179,11 @@ struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *conn, int state)
{
struct iscsi_cmd *cmd;
struct se_session *se_sess = conn->sess->se_sess;
- int size, tag;
+ int size, tag, cpu;
- tag = percpu_ida_alloc(&se_sess->sess_tag_pool, state);
+ tag = sbitmap_queue_get(&se_sess->sess_tag_pool, &cpu);
+ if (tag < 0)
+ tag = iscsit_wait_for_tag(se_sess, state, &cpu);
if (tag < 0)
return NULL;
@@ -166,6 +192,7 @@ struct iscsi_cmd *iscsit_allocate_cmd(struct iscsi_conn *conn, int state)
memset(cmd, 0, size);
cmd->se_cmd.map_tag = tag;
+ cmd->se_cmd.map_cpu = cpu;
cmd->conn = conn;
cmd->data_direction = DMA_NONE;
INIT_LIST_HEAD(&cmd->i_conn_node);
diff --git a/drivers/target/sbp/sbp_target.c b/drivers/target/sbp/sbp_target.c
index 679ae29d25ab..42b21f2ac8b0 100644
--- a/drivers/target/sbp/sbp_target.c
+++ b/drivers/target/sbp/sbp_target.c
@@ -926,15 +926,16 @@ static struct sbp_target_request *sbp_mgt_get_req(struct sbp_session *sess,
{
struct se_session *se_sess = sess->se_sess;
struct sbp_target_request *req;
- int tag;
+ int tag, cpu;
- tag = percpu_ida_alloc(&se_sess->sess_tag_pool, TASK_RUNNING);
+ tag = sbitmap_queue_get(&se_sess->sess_tag_pool, &cpu);
if (tag < 0)
return ERR_PTR(-ENOMEM);
req = &((struct sbp_target_request *)se_sess->sess_cmd_map)[tag];
memset(req, 0, sizeof(*req));
req->se_cmd.map_tag = tag;
+ req->se_cmd.map_cpu = cpu;
req->se_cmd.tag = next_orb;
return req;
diff --git a/drivers/target/target_core_transport.c b/drivers/target/target_core_transport.c
index f0e8f0f4ccb4..18c53c5cdd3d 100644
--- a/drivers/target/target_core_transport.c
+++ b/drivers/target/target_core_transport.c
@@ -260,7 +260,8 @@ int transport_alloc_session_tags(struct se_session *se_sess,
}
}
- rc = percpu_ida_init(&se_sess->sess_tag_pool, tag_num);
+ rc = sbitmap_queue_init_node(&se_sess->sess_tag_pool, tag_num, -1,
+ false, GFP_KERNEL, NUMA_NO_NODE);
if (rc < 0) {
pr_err("Unable to init se_sess->sess_tag_pool,"
" tag_num: %u\n", tag_num);
@@ -547,7 +548,7 @@ void transport_free_session(struct se_session *se_sess)
target_put_nacl(se_nacl);
}
if (se_sess->sess_cmd_map) {
- percpu_ida_destroy(&se_sess->sess_tag_pool);
+ sbitmap_queue_free(&se_sess->sess_tag_pool);
kvfree(se_sess->sess_cmd_map);
}
kmem_cache_free(se_sess_cache, se_sess);
diff --git a/drivers/target/tcm_fc/tfc_cmd.c b/drivers/target/tcm_fc/tfc_cmd.c
index 13e4efbe1ce7..a183d4da7db2 100644
--- a/drivers/target/tcm_fc/tfc_cmd.c
+++ b/drivers/target/tcm_fc/tfc_cmd.c
@@ -28,7 +28,6 @@
#include <linux/configfs.h>
#include <linux/ctype.h>
#include <linux/hash.h>
-#include <linux/percpu_ida.h>
#include <asm/unaligned.h>
#include <scsi/scsi_tcq.h>
#include <scsi/libfc.h>
@@ -448,9 +447,9 @@ static void ft_recv_cmd(struct ft_sess *sess, struct fc_frame *fp)
struct ft_cmd *cmd;
struct fc_lport *lport = sess->tport->lport;
struct se_session *se_sess = sess->se_sess;
- int tag;
+ int tag, cpu;
- tag = percpu_ida_alloc(&se_sess->sess_tag_pool, TASK_RUNNING);
+ tag = sbitmap_queue_get(&se_sess->sess_tag_pool, &cpu);
if (tag < 0)
goto busy;
@@ -458,6 +457,7 @@ static void ft_recv_cmd(struct ft_sess *sess, struct fc_frame *fp)
memset(cmd, 0, sizeof(struct ft_cmd));
cmd->se_cmd.map_tag = tag;
+ cmd->se_cmd.map_cpu = cpu;
cmd->sess = sess;
cmd->seq = fc_seq_assign(lport, fp);
if (!cmd->seq) {
diff --git a/drivers/usb/gadget/function/f_tcm.c b/drivers/usb/gadget/function/f_tcm.c
index 9f670d9224b9..5003e857dce7 100644
--- a/drivers/usb/gadget/function/f_tcm.c
+++ b/drivers/usb/gadget/function/f_tcm.c
@@ -1071,15 +1071,16 @@ static struct usbg_cmd *usbg_get_cmd(struct f_uas *fu,
{
struct se_session *se_sess = tv_nexus->tvn_se_sess;
struct usbg_cmd *cmd;
- int tag;
+ int tag, cpu;
- tag = percpu_ida_alloc(&se_sess->sess_tag_pool, TASK_RUNNING);
+ tag = sbitmap_queue_get(&se_sess->sess_tag_pool, &cpu);
if (tag < 0)
return ERR_PTR(-ENOMEM);
cmd = &((struct usbg_cmd *)se_sess->sess_cmd_map)[tag];
memset(cmd, 0, sizeof(*cmd));
cmd->se_cmd.map_tag = tag;
+ cmd->se_cmd.map_cpu = cpu;
cmd->se_cmd.tag = cmd->tag = scsi_tag;
cmd->fu = fu;
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 70d35e696533..c9c5d6b291cc 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -46,7 +46,6 @@
#include <linux/virtio_scsi.h>
#include <linux/llist.h>
#include <linux/bitmap.h>
-#include <linux/percpu_ida.h>
#include "vhost.h"
@@ -567,7 +566,7 @@ vhost_scsi_get_tag(struct vhost_virtqueue *vq, struct vhost_scsi_tpg *tpg,
struct se_session *se_sess;
struct scatterlist *sg, *prot_sg;
struct page **pages;
- int tag;
+ int tag, cpu;
tv_nexus = tpg->tpg_nexus;
if (!tv_nexus) {
@@ -576,7 +575,7 @@ vhost_scsi_get_tag(struct vhost_virtqueue *vq, struct vhost_scsi_tpg *tpg,
}
se_sess = tv_nexus->tvn_se_sess;
- tag = percpu_ida_alloc(&se_sess->sess_tag_pool, TASK_RUNNING);
+ tag = sbitmap_queue_get(&se_sess->sess_tag_pool, &cpu);
if (tag < 0) {
pr_err("Unable to obtain tag for vhost_scsi_cmd\n");
return ERR_PTR(-ENOMEM);
@@ -591,6 +590,7 @@ vhost_scsi_get_tag(struct vhost_virtqueue *vq, struct vhost_scsi_tpg *tpg,
cmd->tvc_prot_sgl = prot_sg;
cmd->tvc_upages = pages;
cmd->tvc_se_cmd.map_tag = tag;
+ cmd->tvc_se_cmd.map_cpu = cpu;
cmd->tvc_tag = scsi_tag;
cmd->tvc_lun = lun;
cmd->tvc_task_attr = task_attr;
diff --git a/drivers/xen/xen-scsiback.c b/drivers/xen/xen-scsiback.c
index ec6635258ed8..764dd9aa0131 100644
--- a/drivers/xen/xen-scsiback.c
+++ b/drivers/xen/xen-scsiback.c
@@ -654,9 +654,9 @@ static struct vscsibk_pend *scsiback_get_pend_req(struct vscsiif_back_ring *ring
struct scsiback_nexus *nexus = tpg->tpg_nexus;
struct se_session *se_sess = nexus->tvn_se_sess;
struct vscsibk_pend *req;
- int tag, i;
+ int tag, cpu, i;
- tag = percpu_ida_alloc(&se_sess->sess_tag_pool, TASK_RUNNING);
+ tag = sbitmap_queue_get(&se_sess->sess_tag_pool, &cpu);
if (tag < 0) {
pr_err("Unable to obtain tag for vscsiif_request\n");
return ERR_PTR(-ENOMEM);
@@ -665,6 +665,7 @@ static struct vscsibk_pend *scsiback_get_pend_req(struct vscsiif_back_ring *ring
req = &((struct vscsibk_pend *)se_sess->sess_cmd_map)[tag];
memset(req, 0, sizeof(*req));
req->se_cmd.map_tag = tag;
+ req->se_cmd.map_cpu = cpu;
for (i = 0; i < VSCSI_MAX_GRANTS; i++)
req->grant_handles[i] = SCSIBACK_INVALID_HANDLE;
diff --git a/include/target/iscsi/iscsi_target_core.h b/include/target/iscsi/iscsi_target_core.h
index cf5f3fff1f1a..f2e6abea8490 100644
--- a/include/target/iscsi/iscsi_target_core.h
+++ b/include/target/iscsi/iscsi_target_core.h
@@ -4,6 +4,7 @@
#include <linux/dma-direction.h> /* enum dma_data_direction */
#include <linux/list.h> /* struct list_head */
+#include <linux/sched.h>
#include <linux/socket.h> /* struct sockaddr_storage */
#include <linux/types.h> /* u8 */
#include <scsi/iscsi_proto.h> /* itt_t */
diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h
index 260c2f3e9460..448f291125c2 100644
--- a/include/target/target_core_base.h
+++ b/include/target/target_core_base.h
@@ -4,7 +4,7 @@
#include <linux/configfs.h> /* struct config_group */
#include <linux/dma-direction.h> /* enum dma_data_direction */
-#include <linux/percpu_ida.h> /* struct percpu_ida */
+#include <linux/sbitmap.h>
#include <linux/percpu-refcount.h>
#include <linux/semaphore.h> /* struct semaphore */
#include <linux/completion.h>
@@ -455,6 +455,7 @@ struct se_cmd {
int sam_task_attr;
/* Used for se_sess->sess_tag_pool */
unsigned int map_tag;
+ int map_cpu;
/* Transport protocol dependent state, see transport_state_table */
enum transport_state_table t_state;
/* See se_cmd_flags_table */
@@ -608,7 +609,7 @@ struct se_session {
struct list_head sess_wait_list;
spinlock_t sess_cmd_lock;
void *sess_cmd_map;
- struct percpu_ida sess_tag_pool;
+ struct sbitmap_queue sess_tag_pool;
};
struct se_device;
@@ -936,7 +937,7 @@ static inline void atomic_dec_mb(atomic_t *v)
static inline void target_free_tag(struct se_session *sess, struct se_cmd *cmd)
{
- percpu_ida_free(&sess->sess_tag_pool, cmd->map_tag);
+ sbitmap_queue_clear(&sess->sess_tag_pool, cmd->map_tag, cmd->map_cpu);
}
#endif /* TARGET_CORE_BASE_H */
--
2.17.1
^ permalink raw reply related
* [PATCH 1/3] target: Abstract tag freeing
From: Matthew Wilcox @ 2018-06-12 19:05 UTC (permalink / raw)
To: linux-kernel, linux-scsi, target-devel, linux1394-devel,
linux-usb, kvm, virtualization, netdev, Juergen Gross,
qla2xxx-upstream, Kent Overstreet, Jens Axboe
Cc: Matthew Wilcox
In-Reply-To: <20180612190545.10781-1-willy@infradead.org>
Introduce target_free_tag() and convert all drivers to use it.
Signed-off-by: Matthew Wilcox <willy@infradead.org>
---
drivers/scsi/qla2xxx/qla_target.c | 4 ++--
drivers/target/iscsi/iscsi_target_util.c | 2 +-
drivers/target/sbp/sbp_target.c | 2 +-
drivers/target/tcm_fc/tfc_cmd.c | 4 ++--
drivers/usb/gadget/function/f_tcm.c | 2 +-
drivers/vhost/scsi.c | 2 +-
drivers/xen/xen-scsiback.c | 4 +---
include/target/target_core_base.h | 5 +++++
8 files changed, 14 insertions(+), 11 deletions(-)
diff --git a/drivers/scsi/qla2xxx/qla_target.c b/drivers/scsi/qla2xxx/qla_target.c
index b85c833099ff..05290966e630 100644
--- a/drivers/scsi/qla2xxx/qla_target.c
+++ b/drivers/scsi/qla2xxx/qla_target.c
@@ -3783,7 +3783,7 @@ void qlt_free_cmd(struct qla_tgt_cmd *cmd)
return;
}
cmd->jiffies_at_free = get_jiffies_64();
- percpu_ida_free(&sess->se_sess->sess_tag_pool, cmd->se_cmd.map_tag);
+ target_free_tag(sess->se_sess, &cmd->se_cmd);
}
EXPORT_SYMBOL(qlt_free_cmd);
@@ -4146,7 +4146,7 @@ static void __qlt_do_work(struct qla_tgt_cmd *cmd)
qlt_send_term_exchange(qpair, NULL, &cmd->atio, 1, 0);
qlt_decr_num_pend_cmds(vha);
- percpu_ida_free(&sess->se_sess->sess_tag_pool, cmd->se_cmd.map_tag);
+ target_free_tag(sess->se_sess, &cmd->se_cmd);
spin_unlock_irqrestore(qpair->qp_lock_ptr, flags);
spin_lock_irqsave(&ha->tgt.sess_lock, flags);
diff --git a/drivers/target/iscsi/iscsi_target_util.c b/drivers/target/iscsi/iscsi_target_util.c
index 4435bf374d2d..7e98697cfb8e 100644
--- a/drivers/target/iscsi/iscsi_target_util.c
+++ b/drivers/target/iscsi/iscsi_target_util.c
@@ -711,7 +711,7 @@ void iscsit_release_cmd(struct iscsi_cmd *cmd)
kfree(cmd->iov_data);
kfree(cmd->text_in_ptr);
- percpu_ida_free(&sess->se_sess->sess_tag_pool, se_cmd->map_tag);
+ target_free_tag(sess->se_sess, se_cmd);
}
EXPORT_SYMBOL(iscsit_release_cmd);
diff --git a/drivers/target/sbp/sbp_target.c b/drivers/target/sbp/sbp_target.c
index fb1003921d85..679ae29d25ab 100644
--- a/drivers/target/sbp/sbp_target.c
+++ b/drivers/target/sbp/sbp_target.c
@@ -1460,7 +1460,7 @@ static void sbp_free_request(struct sbp_target_request *req)
kfree(req->pg_tbl);
kfree(req->cmd_buf);
- percpu_ida_free(&se_sess->sess_tag_pool, se_cmd->map_tag);
+ target_free_tag(se_sess, se_cmd);
}
static void sbp_mgt_agent_process(struct work_struct *work)
diff --git a/drivers/target/tcm_fc/tfc_cmd.c b/drivers/target/tcm_fc/tfc_cmd.c
index ec372860106f..13e4efbe1ce7 100644
--- a/drivers/target/tcm_fc/tfc_cmd.c
+++ b/drivers/target/tcm_fc/tfc_cmd.c
@@ -92,7 +92,7 @@ static void ft_free_cmd(struct ft_cmd *cmd)
if (fr_seq(fp))
fc_seq_release(fr_seq(fp));
fc_frame_free(fp);
- percpu_ida_free(&sess->se_sess->sess_tag_pool, cmd->se_cmd.map_tag);
+ target_free_tag(sess->se_sess, &cmd->se_cmd);
ft_sess_put(sess); /* undo get from lookup at recv */
}
@@ -461,7 +461,7 @@ static void ft_recv_cmd(struct ft_sess *sess, struct fc_frame *fp)
cmd->sess = sess;
cmd->seq = fc_seq_assign(lport, fp);
if (!cmd->seq) {
- percpu_ida_free(&se_sess->sess_tag_pool, tag);
+ target_free_tag(se_sess, &cmd->se_cmd);
goto busy;
}
cmd->req_frame = fp; /* hold frame during cmd */
diff --git a/drivers/usb/gadget/function/f_tcm.c b/drivers/usb/gadget/function/f_tcm.c
index d78dbb73bde8..9f670d9224b9 100644
--- a/drivers/usb/gadget/function/f_tcm.c
+++ b/drivers/usb/gadget/function/f_tcm.c
@@ -1288,7 +1288,7 @@ static void usbg_release_cmd(struct se_cmd *se_cmd)
struct se_session *se_sess = se_cmd->se_sess;
kfree(cmd->data_buf);
- percpu_ida_free(&se_sess->sess_tag_pool, se_cmd->map_tag);
+ target_free_tag(se_sess, se_cmd);
}
static u32 usbg_sess_get_index(struct se_session *se_sess)
diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
index 7ad57094d736..70d35e696533 100644
--- a/drivers/vhost/scsi.c
+++ b/drivers/vhost/scsi.c
@@ -324,7 +324,7 @@ static void vhost_scsi_release_cmd(struct se_cmd *se_cmd)
}
vhost_scsi_put_inflight(tv_cmd->inflight);
- percpu_ida_free(&se_sess->sess_tag_pool, se_cmd->map_tag);
+ target_free_tag(se_sess, se_cmd);
}
static u32 vhost_scsi_sess_get_index(struct se_session *se_sess)
diff --git a/drivers/xen/xen-scsiback.c b/drivers/xen/xen-scsiback.c
index 7bc88fd43cfc..ec6635258ed8 100644
--- a/drivers/xen/xen-scsiback.c
+++ b/drivers/xen/xen-scsiback.c
@@ -1377,9 +1377,7 @@ static int scsiback_check_stop_free(struct se_cmd *se_cmd)
static void scsiback_release_cmd(struct se_cmd *se_cmd)
{
- struct se_session *se_sess = se_cmd->se_sess;
-
- percpu_ida_free(&se_sess->sess_tag_pool, se_cmd->map_tag);
+ target_free_tag(se_cmd->se_sess, se_cmd);
}
static u32 scsiback_sess_get_index(struct se_session *se_sess)
diff --git a/include/target/target_core_base.h b/include/target/target_core_base.h
index 922a39f45abc..260c2f3e9460 100644
--- a/include/target/target_core_base.h
+++ b/include/target/target_core_base.h
@@ -934,4 +934,9 @@ static inline void atomic_dec_mb(atomic_t *v)
smp_mb__after_atomic();
}
+static inline void target_free_tag(struct se_session *sess, struct se_cmd *cmd)
+{
+ percpu_ida_free(&sess->sess_tag_pool, cmd->map_tag);
+}
+
#endif /* TARGET_CORE_BASE_H */
--
2.17.1
^ permalink raw reply related
* [PATCH 0/3] Use sbitmap instead of percpu_ida
From: Matthew Wilcox @ 2018-06-12 19:05 UTC (permalink / raw)
To: linux-kernel, linux-scsi, target-devel, linux1394-devel,
linux-usb, kvm, virtualization, netdev, Juergen Gross,
qla2xxx-upstream, Kent Overstreet, Jens Axboe
Cc: Matthew Wilcox
Removing the percpu_ida code nets over 400 lines of removal. It's not
as spectacular as deleting an entire architecture, but it's still a
worthy reduction in lines of code.
Untested due to lack of hardware and not understanding how to set up a
target platform.
Changes from v1:
- Fixed bugs pointed out by Jens in iscsit_wait_for_tag()
- Abstracted out tag freeing as requested by Bart
- Made iscsit_wait_for_tag static as pointed out by 0day
Matthew Wilcox (3):
target: Abstract tag freeing
Convert target drivers to use sbitmap
Remove percpu_ida
drivers/scsi/qla2xxx/qla_target.c | 14 +-
drivers/target/iscsi/iscsi_target_util.c | 35 ++-
drivers/target/sbp/sbp_target.c | 7 +-
drivers/target/target_core_transport.c | 5 +-
drivers/target/tcm_fc/tfc_cmd.c | 10 +-
drivers/usb/gadget/function/f_tcm.c | 7 +-
drivers/vhost/scsi.c | 8 +-
drivers/xen/xen-scsiback.c | 9 +-
include/linux/percpu_ida.h | 83 -----
include/target/iscsi/iscsi_target_core.h | 1 +
include/target/target_core_base.h | 10 +-
lib/Makefile | 2 +-
lib/percpu_ida.c | 370 -----------------------
13 files changed, 73 insertions(+), 488 deletions(-)
delete mode 100644 include/linux/percpu_ida.h
delete mode 100644 lib/percpu_ida.c
--
2.17.1
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox