* Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
From: Alexander Duyck @ 2018-01-23 3:36 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Duyck, Alexander H, virtio-dev, Jakub Kicinski,
Samudrala, Sridhar, virtualization, achiad shochat,
Achiad Shochat, Netdev, David Miller
In-Reply-To: <20180123035619-mutt-send-email-mst@kernel.org>
On Mon, Jan 22, 2018 at 6:04 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> On Mon, Jan 22, 2018 at 05:34:37PM -0800, Samudrala, Sridhar wrote:
>> On 1/22/2018 4:05 PM, Michael S. Tsirkin wrote:
>> > On Mon, Jan 22, 2018 at 03:27:40PM -0800, Samudrala, Sridhar wrote:
>> > > > > You could probably
>> > > > > even handle the Tx queue selection via a simple eBPF program and map
>> > > > > since the input for whatever is used to select Tx should be pretty
>> > > > > simple, destination MAC, source NUMA node, etc, and the data-set
>> > > > > shouldn't be too large.
>> > > > That sounds interesting. A separate device might make this kind of setup
>> > > > a bit easier. Sridhar, did you look into creating a separate device for
>> > > > the virtual bond device at all? It does not have to be in a separate
>> > > > module, that kind of refactoring can come later, but once we commit to
>> > > > using the same single device as virtio, we can't change that.
>> > > No. I haven't looked into creating a separate device. If we are going to
>> > > create a new
>> > > device, i guess it has to be of a new device type with its own driver.
>> > Well not necessarily - just a separate netdev ops.
>> > Kind of like e.g. vlans share a driver with the main driver.
>>
>> Not sure what you meant by vlans sharing a driver with the main driver.
>> IIUC, vlans are supported via 802.1q driver and creates a netdev of type
>> 'vlan'
>> with vlan_netdev_ops
>
> But nothing prevents a single module from registering
> multiple ops.
Just to clarify, it seems like what you are suggesting is just adding
the "master" as a separate set of netdev ops or netdevice and to have
virtio spawn two network devices, one slave and one master, if the
BACKUP bit is set. Do I have that right?
I am good with the code still living in the virtio driver and
consolidation with other similar implementations and further
improvement could probably happen later as part of some refactor.
Thanks.
- Alex
^ permalink raw reply
* Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
From: Michael S. Tsirkin @ 2018-01-23 2:04 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: Duyck, Alexander H, virtio-dev, Jakub Kicinski, Netdev,
Alexander Duyck, virtualization, achiad shochat, Achiad Shochat,
David Miller
In-Reply-To: <be3864b4-f108-6a7d-8628-204d4d4ae278@intel.com>
On Mon, Jan 22, 2018 at 05:34:37PM -0800, Samudrala, Sridhar wrote:
> On 1/22/2018 4:05 PM, Michael S. Tsirkin wrote:
> > On Mon, Jan 22, 2018 at 03:27:40PM -0800, Samudrala, Sridhar wrote:
> > > > > You could probably
> > > > > even handle the Tx queue selection via a simple eBPF program and map
> > > > > since the input for whatever is used to select Tx should be pretty
> > > > > simple, destination MAC, source NUMA node, etc, and the data-set
> > > > > shouldn't be too large.
> > > > That sounds interesting. A separate device might make this kind of setup
> > > > a bit easier. Sridhar, did you look into creating a separate device for
> > > > the virtual bond device at all? It does not have to be in a separate
> > > > module, that kind of refactoring can come later, but once we commit to
> > > > using the same single device as virtio, we can't change that.
> > > No. I haven't looked into creating a separate device. If we are going to
> > > create a new
> > > device, i guess it has to be of a new device type with its own driver.
> > Well not necessarily - just a separate netdev ops.
> > Kind of like e.g. vlans share a driver with the main driver.
>
> Not sure what you meant by vlans sharing a driver with the main driver.
> IIUC, vlans are supported via 802.1q driver and creates a netdev of type
> 'vlan'
> with vlan_netdev_ops
But nothing prevents a single module from registering
multiple ops.
> >
> > > As we are using virtio_net to control and manage the VF data path, it is not
> > > clear to me
> > > what is the advantage of creating a new device rather than extending
> > > virtio_net to manage
> > > the VF datapath via transparent bond mechanism.
> > >
> > > Thanks
> > > Sridhar
> > So that XDP redirect actions can differentiate between virtio, PT
> > device and the bond. Without it there's no way to redirect
> > to virtio specifically.
> I guess this usecase is for a guest admin to add bpf programs to VF netdev
> and
> redirect frames to virtio.
No - this doesn't make much sense IMHO. The usecase would be VM
bridging where we process packets incoming on one virtio interface, and
transmit them on another one. It was pointed out using a separate master
netdev and making both virtio and PT its slaves would allow redirect
switch to force virtio, without it it won't be possible to force
redirect to virtio.
How important that use-case is, I am not sure.
> How does bond enable this feature?
>
>
> Thanks
> Sridhar
As it's a userspace configuration, I guess for starters we
can punt this to userspace, too.
--
MST
^ permalink raw reply
* Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
From: Samudrala, Sridhar @ 2018-01-23 1:37 UTC (permalink / raw)
To: Stephen Hemminger
Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
Jakub Kicinski, Netdev, Alexander Duyck, virtualization,
achiad shochat, Achiad Shochat, David Miller
In-Reply-To: <20180122160204.130451f2@xeon-e3>
On 1/22/2018 4:02 PM, Stephen Hemminger wrote:
>
>>>
>>>> In the case of SwitchDev it
>>>> should be possible for the port representors and the switch to provide
>>>> data on which interfaces are bonded on the host side and which aren't.
>>>> With that data it would be pretty easy to just put together a list of
>>>> addresses that would prefer to go the para-virtual route instead of
>>>> being transmitted through physical hardware.
>>>>
>>>> In addition a bridge implies much more overhead since normally a
>>>> bridge can receive a packet in on one interface and transmit it on
>>>> another. We don't really need that. This is more of a VEPA type setup
>>>> and doesn't need to be anything all that complex. You could probably
>>>> even handle the Tx queue selection via a simple eBPF program and map
>>>> since the input for whatever is used to select Tx should be pretty
>>>> simple, destination MAC, source NUMA node, etc, and the data-set
>>>> shouldn't be too large.
>>> That sounds interesting. A separate device might make this kind of setup
>>> a bit easier. Sridhar, did you look into creating a separate device for
>>> the virtual bond device at all? It does not have to be in a separate
>>> module, that kind of refactoring can come later, but once we commit to
>>> using the same single device as virtio, we can't change that.
>> No. I haven't looked into creating a separate device. If we are going to
>> create a new
>> device, i guess it has to be of a new device type with its own driver.
>>
>> As we are using virtio_net to control and manage the VF data path, it is
>> not clear to me
>> what is the advantage of creating a new device rather than extending
>> virtio_net to manage
>> the VF datapath via transparent bond mechanism.
>>
>> Thanks
>> Sridhar
>>
>>
> The requirement with Azure accelerated network was that a stock distribution image from the
> store must be able to run unmodified and get accelerated networking.
> Not sure if other environments need to work the same, but it would be nice.
>
> That meant no additional setup scripts (aka no bonding) and also it must
> work transparently with hot-plug. Also there are diverse set of environments:
> openstack, cloudinit, network manager and systemd. The solution had to not depend
> on any one of them, but also not break any of them.
Yes. Cloud Service Providers using KVM as hypervisor have a similar
requirement to provide accelerated
networking with VM images that support virtio_net.
Thanks
Sridhar
^ permalink raw reply
* Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
From: Samudrala, Sridhar @ 2018-01-23 1:34 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Duyck, Alexander H, virtio-dev, Jakub Kicinski, Netdev,
Alexander Duyck, virtualization, achiad shochat, Achiad Shochat,
David Miller
In-Reply-To: <20180123020332-mutt-send-email-mst@kernel.org>
On 1/22/2018 4:05 PM, Michael S. Tsirkin wrote:
> On Mon, Jan 22, 2018 at 03:27:40PM -0800, Samudrala, Sridhar wrote:
>>>> You could probably
>>>> even handle the Tx queue selection via a simple eBPF program and map
>>>> since the input for whatever is used to select Tx should be pretty
>>>> simple, destination MAC, source NUMA node, etc, and the data-set
>>>> shouldn't be too large.
>>> That sounds interesting. A separate device might make this kind of setup
>>> a bit easier. Sridhar, did you look into creating a separate device for
>>> the virtual bond device at all? It does not have to be in a separate
>>> module, that kind of refactoring can come later, but once we commit to
>>> using the same single device as virtio, we can't change that.
>> No. I haven't looked into creating a separate device. If we are going to
>> create a new
>> device, i guess it has to be of a new device type with its own driver.
> Well not necessarily - just a separate netdev ops.
> Kind of like e.g. vlans share a driver with the main driver.
Not sure what you meant by vlans sharing a driver with the main driver.
IIUC, vlans are supported via 802.1q driver and creates a netdev of
type 'vlan'
with vlan_netdev_ops
>
>> As we are using virtio_net to control and manage the VF data path, it is not
>> clear to me
>> what is the advantage of creating a new device rather than extending
>> virtio_net to manage
>> the VF datapath via transparent bond mechanism.
>>
>> Thanks
>> Sridhar
> So that XDP redirect actions can differentiate between virtio, PT
> device and the bond. Without it there's no way to redirect
> to virtio specifically.
I guess this usecase is for a guest admin to add bpf programs to VF
netdev and
redirect frames to virtio. How does bond enable this feature?
Thanks
Sridhar
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
From: Michael S. Tsirkin @ 2018-01-23 1:23 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Duyck, Alexander H, virtio-dev, achiad shochat,
Samudrala, Sridhar, Alexander Duyck, virtualization,
Achiad Shochat, Netdev, David Miller
In-Reply-To: <20180122171301.3463156b@cakuba.netronome.com>
On Mon, Jan 22, 2018 at 05:13:01PM -0800, Jakub Kicinski wrote:
> On Tue, 23 Jan 2018 02:47:57 +0200, Michael S. Tsirkin wrote:
> > On Mon, Jan 22, 2018 at 04:16:23PM -0800, Jakub Kicinski wrote:
> > > On Tue, 23 Jan 2018 02:05:48 +0200, Michael S. Tsirkin wrote:
> > > > > As we are using virtio_net to control and manage the VF data path, it is not
> > > > > clear to me
> > > > > what is the advantage of creating a new device rather than extending
> > > > > virtio_net to manage
> > > > > the VF datapath via transparent bond mechanism.
> > > >
> > > > So that XDP redirect actions can differentiate between virtio, PT
> > > > device and the bond. Without it there's no way to redirect
> > > > to virtio specifically.
> > >
> > > Let's make a list :P
> > >
> > > separate netdev:
> > > 1. virtio device being a bond master is confusing/unexpected.
> > > 2. virtio device being both a master and a slave is confusing.
> >
> > vlans are like this too, aren't they?
>
> Perhaps a bad wording. Both master and member would be better.
>
> > > 3. configuration of a master may have different semantics than
> > > configuration of a slave.
> > > 4. two para-virt devices may create a loop (or rather will be bound
> > > to each other indeterministically, depending on which spawns first).
> >
> > For 2 virtio devices, we can disable the bond to make it deterministic.
>
> Do you mean the hypervisor can or is there a knob in virtio_net to mask
> off features?
Hypervisor can do it too. And it really should:
specifying 2 devices as backup and giving them same mac
seems like a misconfiguration.
But it's easy to do in virtio without knobs - check
that the potential slave is also a virtio device with the
backup flag, and don't make it a slave.
> Would that require re-probe of the virtio device?
Probably not.
> > > 5. there is no user configuration AFAIR in existing patch, VM admin
> > > won't be able to prevent the bond. Separate netdev we can make
> > > removable even if it's spawned automatically.
> >
> > That's more or less a feature. If you want configurability, just use
> > any of the existing generic solutions (team,bond,bridge,...).
>
> The use case in mind is that VM admin wants to troubleshoot a problem
> and temporarily disable the auto-bond without touching the hypervisor
> (and either member preferably).
I don't think it's possible to support this unconditionally.
E.g. think of a config where these actually share
a backend, with virtio becoming active when PT
access to the backend is disabled.
So you will need some device specific extension for that.
> > > 6. XDP redirect use-case (or any explicit use of the virtio slave)
> > > (from MST)
> > >
> > > independent driver:
> > > 7. code reuse.
> >
> > With netvsc? That precludes a separate device though because of
> > compatibility.
>
> Hopefully with one of the established bonding drivers (fingers
> crossed).
There is very little similarity. Calling this device a bond
just confuses people.
> We may see proliferation of special bonds (see Achiad's
> presentation from last netdev about NIC-NUMA-node-bonds).
I'll take a look, but this isn't like a bond at all, no more than a vlan
is a bond. E.g. if PT link goes down then link is down period and you
do not want to switch to virtio.
> > > separate device:
> >
> > I'm not sure I understand how "separate device" is different from
> > "separate netdev". Do you advocate for a special device who's job is
> > just to tell the guest "bind these two devices together"?
> >
> > Yea, sure, that works. However for sure it's more work to
> > implement and manage at all levels. Further
> >
> > - it doesn't answer the question
> > - a feature bit in a virtio device is cheap enough that
> > I wouldn't worry too much about this feature
> > going unused later.
>
> Right, I think we are referring to different things as device. I mean
> a bus device/struct device, but no strong preference on that one. I'll
> be happy as long as there is a separate netdev, really :)
>
> > > 8. bond any netdev with any netdev.
> > > 9. reuse well-known device driver model.
> > > a. natural anchor for hypervisor configuration (switchdev etc.)
> >
> > saparate netdev has the same property
> >
> > > b. next-gen silicon may be able to disguise as virtio device, and the
> > > loop check in virtio driver will prevent the legitimate bond it such
> > > case. AFAIU that's one of the goals of next-gen virtio spec as well.
> >
> > In fact we have a virtio feature bit for the fallback.
> > So this part does not depend on how software in guest works
> > and does not need software solutions.
>
> You mean in the new spec? Nice. Still I think people will try to
> implement the old one too given sufficiently capable HW.
Existing HW won't have the BACKUP feature so the new functionality
won't be activated. So no problem I think.
--
MST
^ permalink raw reply
* Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
From: Jakub Kicinski @ 2018-01-23 1:13 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Duyck, Alexander H, virtio-dev, achiad shochat,
Samudrala, Sridhar, Alexander Duyck, virtualization,
Achiad Shochat, Netdev, David Miller
In-Reply-To: <20180123023810-mutt-send-email-mst@kernel.org>
On Tue, 23 Jan 2018 02:47:57 +0200, Michael S. Tsirkin wrote:
> On Mon, Jan 22, 2018 at 04:16:23PM -0800, Jakub Kicinski wrote:
> > On Tue, 23 Jan 2018 02:05:48 +0200, Michael S. Tsirkin wrote:
> > > > As we are using virtio_net to control and manage the VF data path, it is not
> > > > clear to me
> > > > what is the advantage of creating a new device rather than extending
> > > > virtio_net to manage
> > > > the VF datapath via transparent bond mechanism.
> > >
> > > So that XDP redirect actions can differentiate between virtio, PT
> > > device and the bond. Without it there's no way to redirect
> > > to virtio specifically.
> >
> > Let's make a list :P
> >
> > separate netdev:
> > 1. virtio device being a bond master is confusing/unexpected.
> > 2. virtio device being both a master and a slave is confusing.
>
> vlans are like this too, aren't they?
Perhaps a bad wording. Both master and member would be better.
> > 3. configuration of a master may have different semantics than
> > configuration of a slave.
> > 4. two para-virt devices may create a loop (or rather will be bound
> > to each other indeterministically, depending on which spawns first).
>
> For 2 virtio devices, we can disable the bond to make it deterministic.
Do you mean the hypervisor can or is there a knob in virtio_net to mask
off features? Would that require re-probe of the virtio device?
> > 5. there is no user configuration AFAIR in existing patch, VM admin
> > won't be able to prevent the bond. Separate netdev we can make
> > removable even if it's spawned automatically.
>
> That's more or less a feature. If you want configurability, just use
> any of the existing generic solutions (team,bond,bridge,...).
The use case in mind is that VM admin wants to troubleshoot a problem
and temporarily disable the auto-bond without touching the hypervisor
(and either member preferably).
> > 6. XDP redirect use-case (or any explicit use of the virtio slave)
> > (from MST)
> >
> > independent driver:
> > 7. code reuse.
>
> With netvsc? That precludes a separate device though because of
> compatibility.
Hopefully with one of the established bonding drivers (fingers
crossed). We may see proliferation of special bonds (see Achiad's
presentation from last netdev about NIC-NUMA-node-bonds).
> > separate device:
>
> I'm not sure I understand how "separate device" is different from
> "separate netdev". Do you advocate for a special device who's job is
> just to tell the guest "bind these two devices together"?
>
> Yea, sure, that works. However for sure it's more work to
> implement and manage at all levels. Further
>
> - it doesn't answer the question
> - a feature bit in a virtio device is cheap enough that
> I wouldn't worry too much about this feature
> going unused later.
Right, I think we are referring to different things as device. I mean
a bus device/struct device, but no strong preference on that one. I'll
be happy as long as there is a separate netdev, really :)
> > 8. bond any netdev with any netdev.
> > 9. reuse well-known device driver model.
> > a. natural anchor for hypervisor configuration (switchdev etc.)
>
> saparate netdev has the same property
>
> > b. next-gen silicon may be able to disguise as virtio device, and the
> > loop check in virtio driver will prevent the legitimate bond it such
> > case. AFAIU that's one of the goals of next-gen virtio spec as well.
>
> In fact we have a virtio feature bit for the fallback.
> So this part does not depend on how software in guest works
> and does not need software solutions.
You mean in the new spec? Nice. Still I think people will try to
implement the old one too given sufficiently capable HW.
^ permalink raw reply
* Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
From: Michael S. Tsirkin @ 2018-01-23 0:47 UTC (permalink / raw)
To: Jakub Kicinski
Cc: Duyck, Alexander H, virtio-dev, achiad shochat,
Samudrala, Sridhar, Alexander Duyck, virtualization,
Achiad Shochat, Netdev, David Miller
In-Reply-To: <20180122161623.22b6e151@cakuba.netronome.com>
On Mon, Jan 22, 2018 at 04:16:23PM -0800, Jakub Kicinski wrote:
> On Tue, 23 Jan 2018 02:05:48 +0200, Michael S. Tsirkin wrote:
> > > As we are using virtio_net to control and manage the VF data path, it is not
> > > clear to me
> > > what is the advantage of creating a new device rather than extending
> > > virtio_net to manage
> > > the VF datapath via transparent bond mechanism.
> >
> > So that XDP redirect actions can differentiate between virtio, PT
> > device and the bond. Without it there's no way to redirect
> > to virtio specifically.
>
> Let's make a list :P
>
> separate netdev:
> 1. virtio device being a bond master is confusing/unexpected.
> 2. virtio device being both a master and a slave is confusing.
vlans are like this too, aren't they?
> 3. configuration of a master may have different semantics than
> configuration of a slave.
> 4. two para-virt devices may create a loop (or rather will be bound
> to each other indeterministically, depending on which spawns first).
For 2 virtio devices, we can disable the bond to make it deterministic.
> 5. there is no user configuration AFAIR in existing patch, VM admin
> won't be able to prevent the bond. Separate netdev we can make
> removable even if it's spawned automatically.
That's more or less a feature. If you want configurability, just use
any of the existing generic solutions (team,bond,bridge,...).
> 6. XDP redirect use-case (or any explicit use of the virtio slave)
> (from MST)
>
> independent driver:
> 7. code reuse.
With netvsc? That precludes a separate device though because of
compatibility.
>
> separate device:
I'm not sure I understand how "separate device" is different from
"separate netdev". Do you advocate for a special device who's job is
just to tell the guest "bind these two devices together"?
Yea, sure, that works. However for sure it's more work to
implement and manage at all levels. Further
- it doesn't answer the question
- a feature bit in a virtio device is cheap enough that
I wouldn't worry too much about this feature
going unused later.
> 8. bond any netdev with any netdev.
> 9. reuse well-known device driver model.
> a. natural anchor for hypervisor configuration (switchdev etc.)
saparate netdev has the same property
> b. next-gen silicon may be able to disguise as virtio device, and the
> loop check in virtio driver will prevent the legitimate bond it such
> case. AFAIU that's one of the goals of next-gen virtio spec as well.
In fact we have a virtio feature bit for the fallback.
So this part does not depend on how software in guest works
and does not need software solutions.
--
MST
^ permalink raw reply
* Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
From: Jakub Kicinski @ 2018-01-23 0:16 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Duyck, Alexander H, virtio-dev, achiad shochat,
Samudrala, Sridhar, Alexander Duyck, virtualization,
Achiad Shochat, Netdev, David Miller
In-Reply-To: <20180123020332-mutt-send-email-mst@kernel.org>
On Tue, 23 Jan 2018 02:05:48 +0200, Michael S. Tsirkin wrote:
> > As we are using virtio_net to control and manage the VF data path, it is not
> > clear to me
> > what is the advantage of creating a new device rather than extending
> > virtio_net to manage
> > the VF datapath via transparent bond mechanism.
>
> So that XDP redirect actions can differentiate between virtio, PT
> device and the bond. Without it there's no way to redirect
> to virtio specifically.
Let's make a list :P
separate netdev:
1. virtio device being a bond master is confusing/unexpected.
2. virtio device being both a master and a slave is confusing.
3. configuration of a master may have different semantics than
configuration of a slave.
4. two para-virt devices may create a loop (or rather will be bound
to each other indeterministically, depending on which spawns first).
5. there is no user configuration AFAIR in existing patch, VM admin
won't be able to prevent the bond. Separate netdev we can make
removable even if it's spawned automatically.
6. XDP redirect use-case (or any explicit use of the virtio slave)
(from MST)
independent driver:
7. code reuse.
separate device:
8. bond any netdev with any netdev.
9. reuse well-known device driver model.
a. natural anchor for hypervisor configuration (switchdev etc.)
b. next-gen silicon may be able to disguise as virtio device, and the
loop check in virtio driver will prevent the legitimate bond it such
case. AFAIU that's one of the goals of next-gen virtio spec as well.
^ permalink raw reply
* Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
From: Michael S. Tsirkin @ 2018-01-23 0:05 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: Duyck, Alexander H, virtio-dev, Jakub Kicinski, Netdev,
Alexander Duyck, virtualization, achiad shochat, Achiad Shochat,
David Miller
In-Reply-To: <7edf772b-627c-6121-3332-479caed524da@intel.com>
On Mon, Jan 22, 2018 at 03:27:40PM -0800, Samudrala, Sridhar wrote:
> > > You could probably
> > > even handle the Tx queue selection via a simple eBPF program and map
> > > since the input for whatever is used to select Tx should be pretty
> > > simple, destination MAC, source NUMA node, etc, and the data-set
> > > shouldn't be too large.
> > That sounds interesting. A separate device might make this kind of setup
> > a bit easier. Sridhar, did you look into creating a separate device for
> > the virtual bond device at all? It does not have to be in a separate
> > module, that kind of refactoring can come later, but once we commit to
> > using the same single device as virtio, we can't change that.
>
> No. I haven't looked into creating a separate device. If we are going to
> create a new
> device, i guess it has to be of a new device type with its own driver.
Well not necessarily - just a separate netdev ops.
Kind of like e.g. vlans share a driver with the main driver.
> As we are using virtio_net to control and manage the VF data path, it is not
> clear to me
> what is the advantage of creating a new device rather than extending
> virtio_net to manage
> the VF datapath via transparent bond mechanism.
>
> Thanks
> Sridhar
So that XDP redirect actions can differentiate between virtio, PT
device and the bond. Without it there's no way to redirect
to virtio specifically.
--
MST
^ permalink raw reply
* Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
From: Stephen Hemminger @ 2018-01-23 0:02 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: Duyck, Alexander H, virtio-dev, Michael S. Tsirkin,
Jakub Kicinski, Netdev, Alexander Duyck, virtualization,
achiad shochat, Achiad Shochat, David Miller
In-Reply-To: <7edf772b-627c-6121-3332-479caed524da@intel.com>
On Mon, 22 Jan 2018 15:27:40 -0800
"Samudrala, Sridhar" <sridhar.samudrala@intel.com> wrote:
> On 1/22/2018 1:31 PM, Michael S. Tsirkin wrote:
> > On Wed, Jan 17, 2018 at 01:49:58PM -0800, Alexander Duyck wrote:
> >> On Wed, Jan 17, 2018 at 11:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> >>> On Wed, Jan 17, 2018 at 11:25:41AM -0800, Samudrala, Sridhar wrote:
> >>>>
> >>>> On 1/17/2018 11:02 AM, Michael S. Tsirkin wrote:
> >>>>> On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote:
> >>>>>> On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
> >>>>>> <sridhar.samudrala@intel.com> wrote:
> >>>>>>> This feature bit can be used by hypervisor to indicate virtio_net device to
> >>>>>>> act as a backup for another device with the same MAC address.
> >>>>>>>
> >>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >>>>>>> ---
> >>>>>>> drivers/net/virtio_net.c | 2 +-
> >>>>>>> include/uapi/linux/virtio_net.h | 3 +++
> >>>>>>> 2 files changed, 4 insertions(+), 1 deletion(-)
> >>>>>>>
> >>>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >>>>>>> index 12dfc5fee58e..f149a160a8c5 100644
> >>>>>>> --- a/drivers/net/virtio_net.c
> >>>>>>> +++ b/drivers/net/virtio_net.c
> >>>>>>> @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
> >>>>>>> VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
> >>>>>>> VIRTIO_NET_F_CTRL_MAC_ADDR, \
> >>>>>>> VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
> >>>>>>> - VIRTIO_NET_F_SPEED_DUPLEX
> >>>>>>> + VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
> >>>>>>>
> >>>>>>> static unsigned int features[] = {
> >>>>>>> VIRTNET_FEATURES,
> >>>>>>> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> >>>>>>> index 5de6ed37695b..c7c35fd1a5ed 100644
> >>>>>>> --- a/include/uapi/linux/virtio_net.h
> >>>>>>> +++ b/include/uapi/linux/virtio_net.h
> >>>>>>> @@ -57,6 +57,9 @@
> >>>>>>> * Steering */
> >>>>>>> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
> >>>>>>>
> >>>>>>> +#define VIRTIO_NET_F_BACKUP 62 /* Act as backup for another device
> >>>>>>> + * with the same MAC.
> >>>>>>> + */
> >>>>>>> #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */
> >>>>>>>
> >>>>>>> #ifndef VIRTIO_NET_NO_LEGACY
> >>>>>> I'm not a huge fan of the name "backup" since that implies that the
> >>>>>> Virtio interface is only used if the VF is not present, and there are
> >>>>>> multiple instances such as dealing with east/west or
> >>>>>> broadcast/multicast traffic where it may be desirable to use the
> >>>>>> para-virtual interface rather then deal with PCI overhead/bottleneck
> >>>>>> to send the packet.
> >>>>> Right now hypervisors mostly expect that yes, only one at a time is
> >>>>> used. E.g. if you try to do multicast sending packets on both VF and
> >>>>> virtio then you will end up with two copies of each packet.
> >>>> I think we want to use only 1 interface to send out any packet. In case of
> >>>> broadcast/multicasts it would be an optimization to send them via virtio and
> >>>> this patch series adds that optimization.
> >>> Right that's what I think we should rather avoid for now.
> >>>
> >>> It's *not* an optimization if there's a single VM on this host,
> >>> or if a specific multicast group does not have any VMs on same
> >>> host.
> >> Agreed. In my mind this is something that is controlled by the
> >> pass-thru interface once it is enslaved.
> > It would be pretty tricky to control through the PT
> > interface since a PT interface pretends to be a physical
> > device, which has no concept of VMs.
> >
> >>> I'd rather we just sent everything out on the PT if that's
> >>> there. The reason we have virtio in the picture is just so
> >>> we can migrate without downtime.
> >> I wasn't saying we do that in all cases. That would be something that
> >> would have to be decided by the pass-thru interface. Ideally the
> >> virtio would provide just enough information to get itself into the
> >> bond and I see this being the mechanism for it to do so. From there
> >> the complexity mostly lies in the pass-thru interface to configure the
> >> correct transmit modes if for example you have multiple pass-thru
> >> interfaces or a more complex traffic setup due to things like
> >> SwitchDev.
> >>
> >> In my mind we go the bonding route and there are few use cases for all
> >> of this. First is the backup case that is being addressed here. That
> >> becomes your basic "copy netvsc" approach for this which would be
> >> default. It is how we would handle basic pass-thru back-up paths. If
> >> the host decides to send multicast/broadcast traffic from the host up
> >> through it that is a host side decision. I am okay with our default
> >> transmit behavior from the guest being to send everything through the
> >> pass-thru interface. All the virtio would be doing is advertising that
> >> it is a side channel for some sort of bond with another interface. By
> >> calling it a side channel it implies that it isn't the direct channel
> >> for the interface, but it is an alternative communications channel for
> >> things like backup, and other future uses.
> >>
> >> There end up being a few different "phase 2" concepts I have floating
> >> around which I want to avoid limiting. Solving the east/west problem
> >> is an example. In my mind that just becomes a bonding Tx mode that is
> >> controlled via the pass-thru interface. The VF could black-list the
> >> local addresses so that they instead fall into the virtio interface.
> >> In addition I seem to recall watching a presentation from Mellanox
> >> where they were talking about a VF per NUMA node in a system which
> >> would imply multiple VFs with the same MAC address. I'm not sure how
> >> the current patch handles that. Obviously that would require a more
> >> complex load balancing setup. The bonding solution ends up being a way
> >> to resolve that so that they could just have it take care of picking
> >> the right Tx queue based on the NUMA affinity and fall back to the
> >> virtio/netvsc when those fail.
> > The way I see it, we'd need to pass a bunch of extra information
> > host to guest, and we'd have to use a PV interface for it.
> > When we do this, we'll need to have another
> > feature bit, and we can call it SIDE_CHANNEL or whatever.
> >
> >
> >>>> In the receive path, the broadcasts should only go the PF and reach the VM
> >>>> via vitio so that the VM doesn't see duplicate broadcasts.
> >>>>
> >>>>
> >>>>> To me the east/west scenario looks like you want something
> >>>>> more similar to a bridge on top of the virtio/PT pair.
> >>>>>
> >>>>> So I suspect that use-case will need a separate configuration bit,
> >>>>> and possibly that's when you will want something more powerful
> >>>>> such as a full bridge.
> >>>> east-west optimization of unicasts would be a harder problem to solve as
> >>>> somehow the hypervisor needs to indicate the VM about the local dest. macs
> >>> Using a bridge with a dedicated device for east/west would let
> >>> bridge use standard learning techniques for that perhaps?
> >> I'm not sure about having the guest have to learn.
> > It's certainly a way to do this, but certainly not the only one.
> >
> >> In my mind the VF
> >> should know who is local and who isn't.
> > Right. But note that these things change.
> >
> >> In the case of SwitchDev it
> >> should be possible for the port representors and the switch to provide
> >> data on which interfaces are bonded on the host side and which aren't.
> >> With that data it would be pretty easy to just put together a list of
> >> addresses that would prefer to go the para-virtual route instead of
> >> being transmitted through physical hardware.
> >>
> >> In addition a bridge implies much more overhead since normally a
> >> bridge can receive a packet in on one interface and transmit it on
> >> another. We don't really need that. This is more of a VEPA type setup
> >> and doesn't need to be anything all that complex. You could probably
> >> even handle the Tx queue selection via a simple eBPF program and map
> >> since the input for whatever is used to select Tx should be pretty
> >> simple, destination MAC, source NUMA node, etc, and the data-set
> >> shouldn't be too large.
> > That sounds interesting. A separate device might make this kind of setup
> > a bit easier. Sridhar, did you look into creating a separate device for
> > the virtual bond device at all? It does not have to be in a separate
> > module, that kind of refactoring can come later, but once we commit to
> > using the same single device as virtio, we can't change that.
>
> No. I haven't looked into creating a separate device. If we are going to
> create a new
> device, i guess it has to be of a new device type with its own driver.
>
> As we are using virtio_net to control and manage the VF data path, it is
> not clear to me
> what is the advantage of creating a new device rather than extending
> virtio_net to manage
> the VF datapath via transparent bond mechanism.
>
> Thanks
> Sridhar
>
>
The requirement with Azure accelerated network was that a stock distribution image from the
store must be able to run unmodified and get accelerated networking.
Not sure if other environments need to work the same, but it would be nice.
That meant no additional setup scripts (aka no bonding) and also it must
work transparently with hot-plug. Also there are diverse set of environments:
openstack, cloudinit, network manager and systemd. The solution had to not depend
on any one of them, but also not break any of them.
^ permalink raw reply
* Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
From: Samudrala, Sridhar @ 2018-01-22 23:27 UTC (permalink / raw)
To: Michael S. Tsirkin, Alexander Duyck
Cc: Duyck, Alexander H, virtio-dev, Jakub Kicinski, Netdev,
virtualization, achiad shochat, Achiad Shochat, David Miller
In-Reply-To: <20180122231713-mutt-send-email-mst@kernel.org>
On 1/22/2018 1:31 PM, Michael S. Tsirkin wrote:
> On Wed, Jan 17, 2018 at 01:49:58PM -0800, Alexander Duyck wrote:
>> On Wed, Jan 17, 2018 at 11:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
>>> On Wed, Jan 17, 2018 at 11:25:41AM -0800, Samudrala, Sridhar wrote:
>>>>
>>>> On 1/17/2018 11:02 AM, Michael S. Tsirkin wrote:
>>>>> On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote:
>>>>>> On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
>>>>>> <sridhar.samudrala@intel.com> wrote:
>>>>>>> This feature bit can be used by hypervisor to indicate virtio_net device to
>>>>>>> act as a backup for another device with the same MAC address.
>>>>>>>
>>>>>>> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
>>>>>>> ---
>>>>>>> drivers/net/virtio_net.c | 2 +-
>>>>>>> include/uapi/linux/virtio_net.h | 3 +++
>>>>>>> 2 files changed, 4 insertions(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
>>>>>>> index 12dfc5fee58e..f149a160a8c5 100644
>>>>>>> --- a/drivers/net/virtio_net.c
>>>>>>> +++ b/drivers/net/virtio_net.c
>>>>>>> @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
>>>>>>> VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
>>>>>>> VIRTIO_NET_F_CTRL_MAC_ADDR, \
>>>>>>> VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
>>>>>>> - VIRTIO_NET_F_SPEED_DUPLEX
>>>>>>> + VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
>>>>>>>
>>>>>>> static unsigned int features[] = {
>>>>>>> VIRTNET_FEATURES,
>>>>>>> diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
>>>>>>> index 5de6ed37695b..c7c35fd1a5ed 100644
>>>>>>> --- a/include/uapi/linux/virtio_net.h
>>>>>>> +++ b/include/uapi/linux/virtio_net.h
>>>>>>> @@ -57,6 +57,9 @@
>>>>>>> * Steering */
>>>>>>> #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
>>>>>>>
>>>>>>> +#define VIRTIO_NET_F_BACKUP 62 /* Act as backup for another device
>>>>>>> + * with the same MAC.
>>>>>>> + */
>>>>>>> #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */
>>>>>>>
>>>>>>> #ifndef VIRTIO_NET_NO_LEGACY
>>>>>> I'm not a huge fan of the name "backup" since that implies that the
>>>>>> Virtio interface is only used if the VF is not present, and there are
>>>>>> multiple instances such as dealing with east/west or
>>>>>> broadcast/multicast traffic where it may be desirable to use the
>>>>>> para-virtual interface rather then deal with PCI overhead/bottleneck
>>>>>> to send the packet.
>>>>> Right now hypervisors mostly expect that yes, only one at a time is
>>>>> used. E.g. if you try to do multicast sending packets on both VF and
>>>>> virtio then you will end up with two copies of each packet.
>>>> I think we want to use only 1 interface to send out any packet. In case of
>>>> broadcast/multicasts it would be an optimization to send them via virtio and
>>>> this patch series adds that optimization.
>>> Right that's what I think we should rather avoid for now.
>>>
>>> It's *not* an optimization if there's a single VM on this host,
>>> or if a specific multicast group does not have any VMs on same
>>> host.
>> Agreed. In my mind this is something that is controlled by the
>> pass-thru interface once it is enslaved.
> It would be pretty tricky to control through the PT
> interface since a PT interface pretends to be a physical
> device, which has no concept of VMs.
>
>>> I'd rather we just sent everything out on the PT if that's
>>> there. The reason we have virtio in the picture is just so
>>> we can migrate without downtime.
>> I wasn't saying we do that in all cases. That would be something that
>> would have to be decided by the pass-thru interface. Ideally the
>> virtio would provide just enough information to get itself into the
>> bond and I see this being the mechanism for it to do so. From there
>> the complexity mostly lies in the pass-thru interface to configure the
>> correct transmit modes if for example you have multiple pass-thru
>> interfaces or a more complex traffic setup due to things like
>> SwitchDev.
>>
>> In my mind we go the bonding route and there are few use cases for all
>> of this. First is the backup case that is being addressed here. That
>> becomes your basic "copy netvsc" approach for this which would be
>> default. It is how we would handle basic pass-thru back-up paths. If
>> the host decides to send multicast/broadcast traffic from the host up
>> through it that is a host side decision. I am okay with our default
>> transmit behavior from the guest being to send everything through the
>> pass-thru interface. All the virtio would be doing is advertising that
>> it is a side channel for some sort of bond with another interface. By
>> calling it a side channel it implies that it isn't the direct channel
>> for the interface, but it is an alternative communications channel for
>> things like backup, and other future uses.
>>
>> There end up being a few different "phase 2" concepts I have floating
>> around which I want to avoid limiting. Solving the east/west problem
>> is an example. In my mind that just becomes a bonding Tx mode that is
>> controlled via the pass-thru interface. The VF could black-list the
>> local addresses so that they instead fall into the virtio interface.
>> In addition I seem to recall watching a presentation from Mellanox
>> where they were talking about a VF per NUMA node in a system which
>> would imply multiple VFs with the same MAC address. I'm not sure how
>> the current patch handles that. Obviously that would require a more
>> complex load balancing setup. The bonding solution ends up being a way
>> to resolve that so that they could just have it take care of picking
>> the right Tx queue based on the NUMA affinity and fall back to the
>> virtio/netvsc when those fail.
> The way I see it, we'd need to pass a bunch of extra information
> host to guest, and we'd have to use a PV interface for it.
> When we do this, we'll need to have another
> feature bit, and we can call it SIDE_CHANNEL or whatever.
>
>
>>>> In the receive path, the broadcasts should only go the PF and reach the VM
>>>> via vitio so that the VM doesn't see duplicate broadcasts.
>>>>
>>>>
>>>>> To me the east/west scenario looks like you want something
>>>>> more similar to a bridge on top of the virtio/PT pair.
>>>>>
>>>>> So I suspect that use-case will need a separate configuration bit,
>>>>> and possibly that's when you will want something more powerful
>>>>> such as a full bridge.
>>>> east-west optimization of unicasts would be a harder problem to solve as
>>>> somehow the hypervisor needs to indicate the VM about the local dest. macs
>>> Using a bridge with a dedicated device for east/west would let
>>> bridge use standard learning techniques for that perhaps?
>> I'm not sure about having the guest have to learn.
> It's certainly a way to do this, but certainly not the only one.
>
>> In my mind the VF
>> should know who is local and who isn't.
> Right. But note that these things change.
>
>> In the case of SwitchDev it
>> should be possible for the port representors and the switch to provide
>> data on which interfaces are bonded on the host side and which aren't.
>> With that data it would be pretty easy to just put together a list of
>> addresses that would prefer to go the para-virtual route instead of
>> being transmitted through physical hardware.
>>
>> In addition a bridge implies much more overhead since normally a
>> bridge can receive a packet in on one interface and transmit it on
>> another. We don't really need that. This is more of a VEPA type setup
>> and doesn't need to be anything all that complex. You could probably
>> even handle the Tx queue selection via a simple eBPF program and map
>> since the input for whatever is used to select Tx should be pretty
>> simple, destination MAC, source NUMA node, etc, and the data-set
>> shouldn't be too large.
> That sounds interesting. A separate device might make this kind of setup
> a bit easier. Sridhar, did you look into creating a separate device for
> the virtual bond device at all? It does not have to be in a separate
> module, that kind of refactoring can come later, but once we commit to
> using the same single device as virtio, we can't change that.
No. I haven't looked into creating a separate device. If we are going to
create a new
device, i guess it has to be of a new device type with its own driver.
As we are using virtio_net to control and manage the VF data path, it is
not clear to me
what is the advantage of creating a new device rather than extending
virtio_net to manage
the VF datapath via transparent bond mechanism.
Thanks
Sridhar
^ permalink raw reply
* Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available
From: Michael S. Tsirkin @ 2018-01-22 21:41 UTC (permalink / raw)
To: Siwei Liu
Cc: Alexander Duyck, virtio-dev, Jakub Kicinski, Sridhar Samudrala,
virtualization, Netdev, David Miller
In-Reply-To: <CADGSJ22BsdB-hDb0xebSyWrwDuBJaFACjb26PDoqAd2Ah+tsQg@mail.gmail.com>
On Mon, Jan 22, 2018 at 12:27:14PM -0800, Siwei Liu wrote:
> First off, as mentioned in another thread, the model of stacking up
> virt-bond functionality over virtio seems a wrong direction to me.
> Essentially the migration process would need to carry over all guest
> side configurations previously done on the VF/PT and get them moved to
> the new device being it virtio or VF/PT.
I might be wrong but I don't see why we should worry about this usecase.
Whoever has a bond configured already has working config for migration.
We are trying to help people who don't, not convert existig users.
> Without the help of a new
> upper layer bond driver that enslaves virtio and VF/PT devices
> underneath, virtio will be overloaded with too much specifics being a
> VF/PT backup in the future.
So this paragraph already includes at least two conflicting
proposals. On the one hand you want a separate device for
the virtual bond, on the other you are saying a separate
driver.
Further, the reason to have a separate *driver* was that
some people wanted to share code with netvsc - and that
one does not create a separate device, which you can't
change without breaking existing configs.
So some people want a fully userspace-configurable switchdev, and that
already exists at some level, and maybe it makes sense to add more
features for performance.
But the point was that some host configurations are very simple,
and it probably makes sense to pass this information to the guest
and have guest act on it directly. Let's not conflate the two.
--
MST
^ permalink raw reply
* Re: [virtio-dev] [RFC PATCH net-next v2 1/2] virtio_net: Introduce VIRTIO_NET_F_BACKUP feature bit
From: Michael S. Tsirkin @ 2018-01-22 21:31 UTC (permalink / raw)
To: Alexander Duyck
Cc: Duyck, Alexander H, virtio-dev, Jakub Kicinski,
Samudrala, Sridhar, virtualization, achiad shochat,
Achiad Shochat, Netdev, David Miller
In-Reply-To: <CAKgT0UeyNvVQc11KXc3updJfa9p7a9NcfRC=gP6=ktkjrSkOag@mail.gmail.com>
On Wed, Jan 17, 2018 at 01:49:58PM -0800, Alexander Duyck wrote:
> On Wed, Jan 17, 2018 at 11:57 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > On Wed, Jan 17, 2018 at 11:25:41AM -0800, Samudrala, Sridhar wrote:
> >>
> >>
> >> On 1/17/2018 11:02 AM, Michael S. Tsirkin wrote:
> >> > On Wed, Jan 17, 2018 at 10:15:52AM -0800, Alexander Duyck wrote:
> >> > > On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
> >> > > <sridhar.samudrala@intel.com> wrote:
> >> > > > This feature bit can be used by hypervisor to indicate virtio_net device to
> >> > > > act as a backup for another device with the same MAC address.
> >> > > >
> >> > > > Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> >> > > > ---
> >> > > > drivers/net/virtio_net.c | 2 +-
> >> > > > include/uapi/linux/virtio_net.h | 3 +++
> >> > > > 2 files changed, 4 insertions(+), 1 deletion(-)
> >> > > >
> >> > > > diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> >> > > > index 12dfc5fee58e..f149a160a8c5 100644
> >> > > > --- a/drivers/net/virtio_net.c
> >> > > > +++ b/drivers/net/virtio_net.c
> >> > > > @@ -2829,7 +2829,7 @@ static struct virtio_device_id id_table[] = {
> >> > > > VIRTIO_NET_F_GUEST_ANNOUNCE, VIRTIO_NET_F_MQ, \
> >> > > > VIRTIO_NET_F_CTRL_MAC_ADDR, \
> >> > > > VIRTIO_NET_F_MTU, VIRTIO_NET_F_CTRL_GUEST_OFFLOADS, \
> >> > > > - VIRTIO_NET_F_SPEED_DUPLEX
> >> > > > + VIRTIO_NET_F_SPEED_DUPLEX, VIRTIO_NET_F_BACKUP
> >> > > >
> >> > > > static unsigned int features[] = {
> >> > > > VIRTNET_FEATURES,
> >> > > > diff --git a/include/uapi/linux/virtio_net.h b/include/uapi/linux/virtio_net.h
> >> > > > index 5de6ed37695b..c7c35fd1a5ed 100644
> >> > > > --- a/include/uapi/linux/virtio_net.h
> >> > > > +++ b/include/uapi/linux/virtio_net.h
> >> > > > @@ -57,6 +57,9 @@
> >> > > > * Steering */
> >> > > > #define VIRTIO_NET_F_CTRL_MAC_ADDR 23 /* Set MAC address */
> >> > > >
> >> > > > +#define VIRTIO_NET_F_BACKUP 62 /* Act as backup for another device
> >> > > > + * with the same MAC.
> >> > > > + */
> >> > > > #define VIRTIO_NET_F_SPEED_DUPLEX 63 /* Device set linkspeed and duplex */
> >> > > >
> >> > > > #ifndef VIRTIO_NET_NO_LEGACY
> >> > > I'm not a huge fan of the name "backup" since that implies that the
> >> > > Virtio interface is only used if the VF is not present, and there are
> >> > > multiple instances such as dealing with east/west or
> >> > > broadcast/multicast traffic where it may be desirable to use the
> >> > > para-virtual interface rather then deal with PCI overhead/bottleneck
> >> > > to send the packet.
> >> > Right now hypervisors mostly expect that yes, only one at a time is
> >> > used. E.g. if you try to do multicast sending packets on both VF and
> >> > virtio then you will end up with two copies of each packet.
> >>
> >> I think we want to use only 1 interface to send out any packet. In case of
> >> broadcast/multicasts it would be an optimization to send them via virtio and
> >> this patch series adds that optimization.
> >
> > Right that's what I think we should rather avoid for now.
> >
> > It's *not* an optimization if there's a single VM on this host,
> > or if a specific multicast group does not have any VMs on same
> > host.
>
> Agreed. In my mind this is something that is controlled by the
> pass-thru interface once it is enslaved.
It would be pretty tricky to control through the PT
interface since a PT interface pretends to be a physical
device, which has no concept of VMs.
> > I'd rather we just sent everything out on the PT if that's
> > there. The reason we have virtio in the picture is just so
> > we can migrate without downtime.
>
> I wasn't saying we do that in all cases. That would be something that
> would have to be decided by the pass-thru interface. Ideally the
> virtio would provide just enough information to get itself into the
> bond and I see this being the mechanism for it to do so. From there
> the complexity mostly lies in the pass-thru interface to configure the
> correct transmit modes if for example you have multiple pass-thru
> interfaces or a more complex traffic setup due to things like
> SwitchDev.
>
> In my mind we go the bonding route and there are few use cases for all
> of this. First is the backup case that is being addressed here. That
> becomes your basic "copy netvsc" approach for this which would be
> default. It is how we would handle basic pass-thru back-up paths. If
> the host decides to send multicast/broadcast traffic from the host up
> through it that is a host side decision. I am okay with our default
> transmit behavior from the guest being to send everything through the
> pass-thru interface. All the virtio would be doing is advertising that
> it is a side channel for some sort of bond with another interface. By
> calling it a side channel it implies that it isn't the direct channel
> for the interface, but it is an alternative communications channel for
> things like backup, and other future uses.
>
> There end up being a few different "phase 2" concepts I have floating
> around which I want to avoid limiting. Solving the east/west problem
> is an example. In my mind that just becomes a bonding Tx mode that is
> controlled via the pass-thru interface. The VF could black-list the
> local addresses so that they instead fall into the virtio interface.
> In addition I seem to recall watching a presentation from Mellanox
> where they were talking about a VF per NUMA node in a system which
> would imply multiple VFs with the same MAC address. I'm not sure how
> the current patch handles that. Obviously that would require a more
> complex load balancing setup. The bonding solution ends up being a way
> to resolve that so that they could just have it take care of picking
> the right Tx queue based on the NUMA affinity and fall back to the
> virtio/netvsc when those fail.
The way I see it, we'd need to pass a bunch of extra information
host to guest, and we'd have to use a PV interface for it.
When we do this, we'll need to have another
feature bit, and we can call it SIDE_CHANNEL or whatever.
> >> In the receive path, the broadcasts should only go the PF and reach the VM
> >> via vitio so that the VM doesn't see duplicate broadcasts.
> >>
> >>
> >> >
> >> > To me the east/west scenario looks like you want something
> >> > more similar to a bridge on top of the virtio/PT pair.
> >> >
> >> > So I suspect that use-case will need a separate configuration bit,
> >> > and possibly that's when you will want something more powerful
> >> > such as a full bridge.
> >>
> >> east-west optimization of unicasts would be a harder problem to solve as
> >> somehow the hypervisor needs to indicate the VM about the local dest. macs
> >
> > Using a bridge with a dedicated device for east/west would let
> > bridge use standard learning techniques for that perhaps?
>
> I'm not sure about having the guest have to learn.
It's certainly a way to do this, but certainly not the only one.
> In my mind the VF
> should know who is local and who isn't.
Right. But note that these things change.
> In the case of SwitchDev it
> should be possible for the port representors and the switch to provide
> data on which interfaces are bonded on the host side and which aren't.
> With that data it would be pretty easy to just put together a list of
> addresses that would prefer to go the para-virtual route instead of
> being transmitted through physical hardware.
>
> In addition a bridge implies much more overhead since normally a
> bridge can receive a packet in on one interface and transmit it on
> another. We don't really need that. This is more of a VEPA type setup
> and doesn't need to be anything all that complex. You could probably
> even handle the Tx queue selection via a simple eBPF program and map
> since the input for whatever is used to select Tx should be pretty
> simple, destination MAC, source NUMA node, etc, and the data-set
> shouldn't be too large.
That sounds interesting. A separate device might make this kind of setup
a bit easier. Sridhar, did you look into creating a separate device for
the virtual bond device at all? It does not have to be in a separate
module, that kind of refactoring can come later, but once we commit to
using the same single device as virtio, we can't change that.
> >> > > What if instead of BACKUP we used the name SIDE_CHANNEL? Basically it
> >> > > is a bit of double entendre as we are using the physical MAC address
> >> > > to provide configuration information, and then in addition this
> >> > > interface acts as a secondary channel for passing frames to and from
> >> > > the guest rather than just using the VF.
> >> > >
> >> > > Just a thought.
> >> > >
> >> > > Thanks.
> >> > >
> >> > > - Alex
> >> > I just feel that's a very generic name, not conveying enough information
> >> > about how they hypervisor expects the pair of devices to be used.
>
> I would argue that BACKUP is too specific. I can see many other uses
> other than just being a backup interface
True but the ones you listed above all seem to require virtio
changes anyway, we'll be able to add a new feature or
rename this one as we make them.
> and I don't want us to box
> ourselves in. I agree that it makes sense for active/backup to be the
> base mode, but I really don't think it conveys all of the
> possibilities since I see this as possibly being able to solve
> multiple issues as this evolves. I'm also open to other suggestions.
> We could go with PASSTHRU_SIDE_CHANNEL for instance, I just didn't
> suggest that since it seemed kind of wordy.
There's only so much info a single bit can confer :)
So we can always rename later, the point is to draw the line
and say "this is the functionality current hosts expect".
--
MST
^ permalink raw reply
* Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available
From: Samudrala, Sridhar @ 2018-01-22 21:05 UTC (permalink / raw)
To: Siwei Liu
Cc: Alexander Duyck, virtio-dev, Michael S. Tsirkin, Jakub Kicinski,
Netdev, virtualization, David Miller
In-Reply-To: <CADGSJ22BsdB-hDb0xebSyWrwDuBJaFACjb26PDoqAd2Ah+tsQg@mail.gmail.com>
On 1/22/2018 12:27 PM, Siwei Liu wrote:
> First off, as mentioned in another thread, the model of stacking up
> virt-bond functionality over virtio seems a wrong direction to me.
> Essentially the migration process would need to carry over all guest
> side configurations previously done on the VF/PT and get them moved to
> the new device being it virtio or VF/PT. Without the help of a new
> upper layer bond driver that enslaves virtio and VF/PT devices
> underneath, virtio will be overloaded with too much specifics being a
> VF/PT backup in the future. I hope you're already aware of the issue
> in longer term and move to that model as soon as possible. See more
> inline.
The idea behind this design is to provide a low latency datapath to
virtio_net while
preserving live migration feature without the need for the guest admin
to configure
a bond between VF and virtio_net.
As this feature is enabled and configured via virtio_net which has a
back channel to
the hypervisor, adding this functionality to virtio_net looks like a
reasonable option.
Adding a new driver and a new device requires defining a new interface
and a channel
between the hypervisor and the VM and if required we may implement that in
future.
>
> On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
> <sridhar.samudrala@intel.com> wrote:
>> This patch enables virtio_net to switch over to a VF datapath when a VF
>> netdev is present with the same MAC address. The VF datapath is only used
>> for unicast traffic. Broadcasts/multicasts go via virtio datapath so that
>> east-west broadcasts don't use the PCI bandwidth.
> Why not making an this an option/optimization rather than being the
> only means? The problem of east-west broadcast eating PCI bandwidth
> depends on specifics of the (virtual) network setup, while some users
> won't want to lose VF's merits such as latency. Why restricting
> broadcast/multicast xmit to virtio only which potentially regresses
> the performance against raw VF?
I am planning to remove this option when i resubmit the patches.
>
>> It allows live migration
>> of a VM with a direct attached VF without the need to setup a bond/team
>> between a VF and virtio net device in the guest.
>>
>> The hypervisor needs to unplug the VF device from the guest on the source
>> host and reset the MAC filter of the VF to initiate failover of datapath to
>> virtio before starting the migration. After the migration is completed, the
>> destination hypervisor sets the MAC filter on the VF and plugs it back to
>> the guest to switch over to VF datapath.
> Is there a host side patch (planned) for this MAC filter switching
> process? As said in another thread, that simple script won't work for
> macvtap backend.
The host side patch to enable qemu to configure this feature is included
in this patch
series.
I have been testing this feature using a shell script, but i hope
someone in the libvirt
community will extend 'virsh' to handle live migration when this
feature is supported.
Thanks
Sridhar
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [RFC PATCH net-next v2 2/2] virtio_net: Extend virtio to use VF datapath when available
From: Siwei Liu @ 2018-01-22 20:27 UTC (permalink / raw)
To: Sridhar Samudrala
Cc: Alexander Duyck, virtio-dev, Michael S. Tsirkin, Jakub Kicinski,
Netdev, virtualization, David Miller
In-Reply-To: <1515736720-39368-3-git-send-email-sridhar.samudrala@intel.com>
First off, as mentioned in another thread, the model of stacking up
virt-bond functionality over virtio seems a wrong direction to me.
Essentially the migration process would need to carry over all guest
side configurations previously done on the VF/PT and get them moved to
the new device being it virtio or VF/PT. Without the help of a new
upper layer bond driver that enslaves virtio and VF/PT devices
underneath, virtio will be overloaded with too much specifics being a
VF/PT backup in the future. I hope you're already aware of the issue
in longer term and move to that model as soon as possible. See more
inline.
On Thu, Jan 11, 2018 at 9:58 PM, Sridhar Samudrala
<sridhar.samudrala@intel.com> wrote:
> This patch enables virtio_net to switch over to a VF datapath when a VF
> netdev is present with the same MAC address. The VF datapath is only used
> for unicast traffic. Broadcasts/multicasts go via virtio datapath so that
> east-west broadcasts don't use the PCI bandwidth.
Why not making an this an option/optimization rather than being the
only means? The problem of east-west broadcast eating PCI bandwidth
depends on specifics of the (virtual) network setup, while some users
won't want to lose VF's merits such as latency. Why restricting
broadcast/multicast xmit to virtio only which potentially regresses
the performance against raw VF?
> It allows live migration
> of a VM with a direct attached VF without the need to setup a bond/team
> between a VF and virtio net device in the guest.
>
> The hypervisor needs to unplug the VF device from the guest on the source
> host and reset the MAC filter of the VF to initiate failover of datapath to
> virtio before starting the migration. After the migration is completed, the
> destination hypervisor sets the MAC filter on the VF and plugs it back to
> the guest to switch over to VF datapath.
Is there a host side patch (planned) for this MAC filter switching
process? As said in another thread, that simple script won't work for
macvtap backend.
Thanks,
-Siwei
>
> This patch is based on the discussion initiated by Jesse on this thread.
> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>
> Signed-off-by: Sridhar Samudrala <sridhar.samudrala@intel.com>
> Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
> ---
> drivers/net/virtio_net.c | 307 ++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 305 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/virtio_net.c b/drivers/net/virtio_net.c
> index f149a160a8c5..0e58d364fde9 100644
> --- a/drivers/net/virtio_net.c
> +++ b/drivers/net/virtio_net.c
> @@ -30,6 +30,8 @@
> #include <linux/cpu.h>
> #include <linux/average.h>
> #include <linux/filter.h>
> +#include <linux/netdevice.h>
> +#include <linux/netpoll.h>
> #include <net/route.h>
> #include <net/xdp.h>
>
> @@ -120,6 +122,15 @@ struct receive_queue {
> struct xdp_rxq_info xdp_rxq;
> };
>
> +struct virtnet_vf_pcpu_stats {
> + u64 rx_packets;
> + u64 rx_bytes;
> + u64 tx_packets;
> + u64 tx_bytes;
> + struct u64_stats_sync syncp;
> + u32 tx_dropped;
> +};
> +
> struct virtnet_info {
> struct virtio_device *vdev;
> struct virtqueue *cvq;
> @@ -182,6 +193,10 @@ struct virtnet_info {
> u32 speed;
>
> unsigned long guest_offloads;
> +
> + /* State to manage the associated VF interface. */
> + struct net_device __rcu *vf_netdev;
> + struct virtnet_vf_pcpu_stats __percpu *vf_stats;
> };
>
> struct padded_vnet_hdr {
> @@ -1314,16 +1329,53 @@ static int xmit_skb(struct send_queue *sq, struct sk_buff *skb)
> return virtqueue_add_outbuf(sq->vq, sq->sg, num_sg, skb, GFP_ATOMIC);
> }
>
> +/* Send skb on the slave VF device. */
> +static int virtnet_vf_xmit(struct net_device *dev, struct net_device *vf_netdev,
> + struct sk_buff *skb)
> +{
> + struct virtnet_info *vi = netdev_priv(dev);
> + unsigned int len = skb->len;
> + int rc;
> +
> + skb->dev = vf_netdev;
> + skb->queue_mapping = qdisc_skb_cb(skb)->slave_dev_queue_mapping;
> +
> + rc = dev_queue_xmit(skb);
> + if (likely(rc == NET_XMIT_SUCCESS || rc == NET_XMIT_CN)) {
> + struct virtnet_vf_pcpu_stats *pcpu_stats
> + = this_cpu_ptr(vi->vf_stats);
> +
> + u64_stats_update_begin(&pcpu_stats->syncp);
> + pcpu_stats->tx_packets++;
> + pcpu_stats->tx_bytes += len;
> + u64_stats_update_end(&pcpu_stats->syncp);
> + } else {
> + this_cpu_inc(vi->vf_stats->tx_dropped);
> + }
> +
> + return rc;
> +}
> +
> static netdev_tx_t start_xmit(struct sk_buff *skb, struct net_device *dev)
> {
> struct virtnet_info *vi = netdev_priv(dev);
> int qnum = skb_get_queue_mapping(skb);
> struct send_queue *sq = &vi->sq[qnum];
> + struct net_device *vf_netdev;
> int err;
> struct netdev_queue *txq = netdev_get_tx_queue(dev, qnum);
> bool kick = !skb->xmit_more;
> bool use_napi = sq->napi.weight;
>
> + /* If VF is present and up then redirect packets
> + * called with rcu_read_lock_bh
> + */
> + vf_netdev = rcu_dereference_bh(vi->vf_netdev);
> + if (vf_netdev && netif_running(vf_netdev) &&
> + !netpoll_tx_running(dev) &&
> + is_unicast_ether_addr(eth_hdr(skb)->h_dest))
> + return virtnet_vf_xmit(dev, vf_netdev, skb);
> +
> /* Free up any pending old buffers before queueing new ones. */
> free_old_xmit_skbs(sq);
>
> @@ -1470,10 +1522,41 @@ static int virtnet_set_mac_address(struct net_device *dev, void *p)
> return ret;
> }
>
> +static void virtnet_get_vf_stats(struct net_device *dev,
> + struct virtnet_vf_pcpu_stats *tot)
> +{
> + struct virtnet_info *vi = netdev_priv(dev);
> + int i;
> +
> + memset(tot, 0, sizeof(*tot));
> +
> + for_each_possible_cpu(i) {
> + const struct virtnet_vf_pcpu_stats *stats
> + = per_cpu_ptr(vi->vf_stats, i);
> + u64 rx_packets, rx_bytes, tx_packets, tx_bytes;
> + unsigned int start;
> +
> + do {
> + start = u64_stats_fetch_begin_irq(&stats->syncp);
> + rx_packets = stats->rx_packets;
> + tx_packets = stats->tx_packets;
> + rx_bytes = stats->rx_bytes;
> + tx_bytes = stats->tx_bytes;
> + } while (u64_stats_fetch_retry_irq(&stats->syncp, start));
> +
> + tot->rx_packets += rx_packets;
> + tot->tx_packets += tx_packets;
> + tot->rx_bytes += rx_bytes;
> + tot->tx_bytes += tx_bytes;
> + tot->tx_dropped += stats->tx_dropped;
> + }
> +}
> +
> static void virtnet_stats(struct net_device *dev,
> struct rtnl_link_stats64 *tot)
> {
> struct virtnet_info *vi = netdev_priv(dev);
> + struct virtnet_vf_pcpu_stats vf_stats;
> int cpu;
> unsigned int start;
>
> @@ -1504,6 +1587,13 @@ static void virtnet_stats(struct net_device *dev,
> tot->rx_dropped = dev->stats.rx_dropped;
> tot->rx_length_errors = dev->stats.rx_length_errors;
> tot->rx_frame_errors = dev->stats.rx_frame_errors;
> +
> + virtnet_get_vf_stats(dev, &vf_stats);
> + tot->rx_packets += vf_stats.rx_packets;
> + tot->tx_packets += vf_stats.tx_packets;
> + tot->rx_bytes += vf_stats.rx_bytes;
> + tot->tx_bytes += vf_stats.tx_bytes;
> + tot->tx_dropped += vf_stats.tx_dropped;
> }
>
> #ifdef CONFIG_NET_POLL_CONTROLLER
> @@ -2635,6 +2725,13 @@ static int virtnet_probe(struct virtio_device *vdev)
>
> INIT_WORK(&vi->config_work, virtnet_config_changed_work);
>
> + if (virtio_has_feature(vdev, VIRTIO_NET_F_BACKUP)) {
> + vi->vf_stats =
> + netdev_alloc_pcpu_stats(struct virtnet_vf_pcpu_stats);
> + if (!vi->vf_stats)
> + goto free_stats;
> + }
> +
> /* If we can receive ANY GSO packets, we must allocate large ones. */
> if (virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO4) ||
> virtio_has_feature(vdev, VIRTIO_NET_F_GUEST_TSO6) ||
> @@ -2668,7 +2765,7 @@ static int virtnet_probe(struct virtio_device *vdev)
> */
> dev_err(&vdev->dev, "device MTU appears to have changed "
> "it is now %d < %d", mtu, dev->min_mtu);
> - goto free_stats;
> + goto free_vf_stats;
> }
>
> dev->mtu = mtu;
> @@ -2692,7 +2789,7 @@ static int virtnet_probe(struct virtio_device *vdev)
> /* Allocate/initialize the rx/tx queues, and invoke find_vqs */
> err = init_vqs(vi);
> if (err)
> - goto free_stats;
> + goto free_vf_stats;
>
> #ifdef CONFIG_SYSFS
> if (vi->mergeable_rx_bufs)
> @@ -2747,6 +2844,8 @@ static int virtnet_probe(struct virtio_device *vdev)
> cancel_delayed_work_sync(&vi->refill);
> free_receive_page_frags(vi);
> virtnet_del_vqs(vi);
> +free_vf_stats:
> + free_percpu(vi->vf_stats);
> free_stats:
> free_percpu(vi->stats);
> free:
> @@ -2768,19 +2867,184 @@ static void remove_vq_common(struct virtnet_info *vi)
> virtnet_del_vqs(vi);
> }
>
> +static struct net_device *get_virtio_bymac(const u8 *mac)
> +{
> + struct net_device *dev;
> +
> + ASSERT_RTNL();
> +
> + for_each_netdev(&init_net, dev) {
> + if (dev->netdev_ops != &virtnet_netdev)
> + continue; /* not a virtio_net device */
> +
> + if (ether_addr_equal(mac, dev->perm_addr))
> + return dev;
> + }
> +
> + return NULL;
> +}
> +
> +static struct net_device *get_virtio_byref(struct net_device *vf_netdev)
> +{
> + struct net_device *dev;
> +
> + ASSERT_RTNL();
> +
> + for_each_netdev(&init_net, dev) {
> + struct virtnet_info *vi;
> +
> + if (dev->netdev_ops != &virtnet_netdev)
> + continue; /* not a virtio_net device */
> +
> + vi = netdev_priv(dev);
> + if (rtnl_dereference(vi->vf_netdev) == vf_netdev)
> + return dev; /* a match */
> + }
> +
> + return NULL;
> +}
> +
> +/* Called when VF is injecting data into network stack.
> + * Change the associated network device from VF to virtio.
> + * note: already called with rcu_read_lock
> + */
> +static rx_handler_result_t virtnet_vf_handle_frame(struct sk_buff **pskb)
> +{
> + struct sk_buff *skb = *pskb;
> + struct net_device *ndev = rcu_dereference(skb->dev->rx_handler_data);
> + struct virtnet_info *vi = netdev_priv(ndev);
> + struct virtnet_vf_pcpu_stats *pcpu_stats =
> + this_cpu_ptr(vi->vf_stats);
> +
> + skb->dev = ndev;
> +
> + u64_stats_update_begin(&pcpu_stats->syncp);
> + pcpu_stats->rx_packets++;
> + pcpu_stats->rx_bytes += skb->len;
> + u64_stats_update_end(&pcpu_stats->syncp);
> +
> + return RX_HANDLER_ANOTHER;
> +}
> +
> +static int virtnet_vf_join(struct net_device *vf_netdev,
> + struct net_device *ndev)
> +{
> + int ret;
> +
> + ret = netdev_rx_handler_register(vf_netdev,
> + virtnet_vf_handle_frame, ndev);
> + if (ret != 0) {
> + netdev_err(vf_netdev,
> + "can not register virtio VF receive handler (err = %d)\n",
> + ret);
> + goto rx_handler_failed;
> + }
> +
> + ret = netdev_upper_dev_link(vf_netdev, ndev, NULL);
> + if (ret != 0) {
> + netdev_err(vf_netdev,
> + "can not set master device %s (err = %d)\n",
> + ndev->name, ret);
> + goto upper_link_failed;
> + }
> +
> + vf_netdev->flags |= IFF_SLAVE;
> +
> + /* Align MTU of VF with master */
> + ret = dev_set_mtu(vf_netdev, ndev->mtu);
> + if (ret)
> + netdev_warn(vf_netdev,
> + "unable to change mtu to %u\n", ndev->mtu);
> +
> + call_netdevice_notifiers(NETDEV_JOIN, vf_netdev);
> +
> + netdev_info(vf_netdev, "joined to %s\n", ndev->name);
> + return 0;
> +
> +upper_link_failed:
> + netdev_rx_handler_unregister(vf_netdev);
> +rx_handler_failed:
> + return ret;
> +}
> +
> +static int virtnet_register_vf(struct net_device *vf_netdev)
> +{
> + struct net_device *ndev;
> + struct virtnet_info *vi;
> +
> + if (vf_netdev->addr_len != ETH_ALEN)
> + return NOTIFY_DONE;
> +
> + /* We will use the MAC address to locate the virtio_net interface to
> + * associate with the VF interface. If we don't find a matching
> + * virtio interface, move on.
> + */
> + ndev = get_virtio_bymac(vf_netdev->perm_addr);
> + if (!ndev)
> + return NOTIFY_DONE;
> +
> + vi = netdev_priv(ndev);
> + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
> + return NOTIFY_DONE;
> +
> + if (rtnl_dereference(vi->vf_netdev))
> + return NOTIFY_DONE;
> +
> + if (virtnet_vf_join(vf_netdev, ndev) != 0)
> + return NOTIFY_DONE;
> +
> + netdev_info(ndev, "VF registering %s\n", vf_netdev->name);
> +
> + dev_hold(vf_netdev);
> + rcu_assign_pointer(vi->vf_netdev, vf_netdev);
> +
> + return NOTIFY_OK;
> +}
> +
> +static int virtnet_unregister_vf(struct net_device *vf_netdev)
> +{
> + struct net_device *ndev;
> + struct virtnet_info *vi;
> +
> + ndev = get_virtio_byref(vf_netdev);
> + if (!ndev)
> + return NOTIFY_DONE;
> +
> + vi = netdev_priv(ndev);
> + if (!virtio_has_feature(vi->vdev, VIRTIO_NET_F_BACKUP))
> + return NOTIFY_DONE;
> +
> + netdev_info(ndev, "VF unregistering %s\n", vf_netdev->name);
> +
> + netdev_rx_handler_unregister(vf_netdev);
> + netdev_upper_dev_unlink(vf_netdev, ndev);
> + RCU_INIT_POINTER(vi->vf_netdev, NULL);
> + dev_put(vf_netdev);
> +
> + return NOTIFY_OK;
> +}
> +
> static void virtnet_remove(struct virtio_device *vdev)
> {
> struct virtnet_info *vi = vdev->priv;
> + struct net_device *vf_netdev;
>
> virtnet_cpu_notif_remove(vi);
>
> /* Make sure no work handler is accessing the device. */
> flush_work(&vi->config_work);
>
> + rtnl_lock();
> + vf_netdev = rtnl_dereference(vi->vf_netdev);
> + if (vf_netdev)
> + virtnet_unregister_vf(vf_netdev);
> + rtnl_unlock();
> +
> unregister_netdev(vi->dev);
>
> remove_vq_common(vi);
>
> + free_percpu(vi->vf_stats);
> free_percpu(vi->stats);
> free_netdev(vi->dev);
> }
> @@ -2859,6 +3123,42 @@ static struct virtio_driver virtio_net_driver = {
> #endif
> };
>
> +static int virtio_netdev_event(struct notifier_block *this,
> + unsigned long event, void *ptr)
> +{
> + struct net_device *event_dev = netdev_notifier_info_to_dev(ptr);
> +
> + /* Skip our own events */
> + if (event_dev->netdev_ops == &virtnet_netdev)
> + return NOTIFY_DONE;
> +
> + /* Avoid non-Ethernet type devices */
> + if (event_dev->type != ARPHRD_ETHER)
> + return NOTIFY_DONE;
> +
> + /* Avoid Vlan dev with same MAC registering as VF */
> + if (is_vlan_dev(event_dev))
> + return NOTIFY_DONE;
> +
> + /* Avoid Bonding master dev with same MAC registering as VF */
> + if ((event_dev->priv_flags & IFF_BONDING) &&
> + (event_dev->flags & IFF_MASTER))
> + return NOTIFY_DONE;
> +
> + switch (event) {
> + case NETDEV_REGISTER:
> + return virtnet_register_vf(event_dev);
> + case NETDEV_UNREGISTER:
> + return virtnet_unregister_vf(event_dev);
> + default:
> + return NOTIFY_DONE;
> + }
> +}
> +
> +static struct notifier_block virtio_netdev_notifier = {
> + .notifier_call = virtio_netdev_event,
> +};
> +
> static __init int virtio_net_driver_init(void)
> {
> int ret;
> @@ -2877,6 +3177,8 @@ static __init int virtio_net_driver_init(void)
> ret = register_virtio_driver(&virtio_net_driver);
> if (ret)
> goto err_virtio;
> +
> + register_netdevice_notifier(&virtio_netdev_notifier);
> return 0;
> err_virtio:
> cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
> @@ -2889,6 +3191,7 @@ module_init(virtio_net_driver_init);
>
> static __exit void virtio_net_driver_exit(void)
> {
> + unregister_netdevice_notifier(&virtio_netdev_notifier);
> unregister_virtio_driver(&virtio_net_driver);
> cpuhp_remove_multi_state(CPUHP_VIRT_NET_DEAD);
> cpuhp_remove_multi_state(virtionet_online);
> --
> 2.14.3
>
^ permalink raw reply
* Re: [PATCH v2 net-next] virtio_net: Add ethtool stats
From: David Miller @ 2018-01-22 20:25 UTC (permalink / raw)
To: makita.toshiaki; +Cc: f.fainelli, mst, netdev, virtualization
In-Reply-To: <1516171105-2483-1-git-send-email-makita.toshiaki@lab.ntt.co.jp>
From: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
Date: Wed, 17 Jan 2018 15:38:25 +0900
> The main purpose of this patch is adding a way of checking per-queue stats.
> It's useful to debug performance problems on multiqueue environment.
>
> $ ethtool -S ens10
> NIC statistics:
> rx_queue_0_packets: 2090408
> rx_queue_0_bytes: 3164825094
> rx_queue_1_packets: 2082531
> rx_queue_1_bytes: 3152932314
> tx_queue_0_packets: 2770841
> tx_queue_0_bytes: 4194955474
> tx_queue_1_packets: 3084697
> tx_queue_1_bytes: 4670196372
>
> This change converts existing per-cpu stats structure into per-queue one.
> This should not impact on performance since each queue counter is not
> updated concurrently by multiple cpus.
>
> Performance numbers:
> - Guest has 2 vcpus and 2 queues
> - Guest runs netserver
> - Host runs 100-flow super_netperf
>
> Before After Diff
> UDP_STREAM 18byte 86.22 87.00 +0.90%
> UDP_STREAM 1472byte 4055.27 4042.18 -0.32%
> TCP_STREAM 16956.32 16890.63 -0.39%
> UDP_RR 178667.11 185862.70 +4.03%
> TCP_RR 128473.04 124985.81 -2.71%
>
> Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp>
> ---
> v2:
> - Removed redundant counters which can be obtained from dev_get_stats.
> - Made queue counter structure different for tx and rx so they can be
> easily extended separately, as some additional counters are expected
> like XDP related ones and VM-Exit event.
> - Added performance numbers in commitlog.
Applied, thanks.
^ permalink raw reply
* Re: [PATCH net-next 0/2] Enable virtio to act as a master for a passthru device
From: Siwei Liu @ 2018-01-22 19:00 UTC (permalink / raw)
To: Samudrala, Sridhar
Cc: Alexander Duyck, virtio-dev, Michael S. Tsirkin, Jakub Kicinski,
Netdev, virtualization
In-Reply-To: <CADGSJ21rDfqzsnd+m5f6w=LqoTY3Bb0p1Pe+_tppdEm_8HysKA@mail.gmail.com>
Apologies I didn't notice that the discussion was mistakenly taken
offline. Post it back.
-Siwei
On Sat, Jan 13, 2018 at 7:25 AM, Siwei Liu <loseweigh@gmail.com> wrote:
> On Thu, Jan 11, 2018 at 12:32 PM, Samudrala, Sridhar
> <sridhar.samudrala@intel.com> wrote:
>> On 1/8/2018 9:22 AM, Siwei Liu wrote:
>>>
>>> On Sat, Jan 6, 2018 at 2:33 AM, Samudrala, Sridhar
>>> <sridhar.samudrala@intel.com> wrote:
>>>>
>>>> On 1/5/2018 9:07 AM, Siwei Liu wrote:
>>>>>
>>>>> On Thu, Jan 4, 2018 at 8:22 AM, Samudrala, Sridhar
>>>>> <sridhar.samudrala@intel.com> wrote:
>>>>>>
>>>>>> On 1/3/2018 10:28 AM, Alexander Duyck wrote:
>>>>>>>
>>>>>>> On Wed, Jan 3, 2018 at 10:14 AM, Samudrala, Sridhar
>>>>>>> <sridhar.samudrala@intel.com> wrote:
>>>>>>>>
>>>>>>>>
>>>>>>>> On 1/3/2018 8:59 AM, Alexander Duyck wrote:
>>>>>>>>>
>>>>>>>>> On Tue, Jan 2, 2018 at 6:16 PM, Jakub Kicinski <kubakici@wp.pl>
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> On Tue, 2 Jan 2018 16:35:36 -0800, Sridhar Samudrala wrote:
>>>>>>>>>>>
>>>>>>>>>>> This patch series enables virtio to switch over to a VF datapath
>>>>>>>>>>> when
>>>>>>>>>>> a
>>>>>>>>>>> VF
>>>>>>>>>>> netdev is present with the same MAC address. It allows live
>>>>>>>>>>> migration
>>>>>>>>>>> of
>>>>>>>>>>> a VM
>>>>>>>>>>> with a direct attached VF without the need to setup a bond/team
>>>>>>>>>>> between
>>>>>>>>>>> a
>>>>>>>>>>> VF and virtio net device in the guest.
>>>>>>>>>>>
>>>>>>>>>>> The hypervisor needs to unplug the VF device from the guest on the
>>>>>>>>>>> source
>>>>>>>>>>> host and reset the MAC filter of the VF to initiate failover of
>>>>>>>>>>> datapath
>>>>>>>>>>> to
>>>>>>>>>>> virtio before starting the migration. After the migration is
>>>>>>>>>>> completed,
>>>>>>>>>>> the
>>>>>>>>>>> destination hypervisor sets the MAC filter on the VF and plugs it
>>>>>>>>>>> back
>>>>>>>>>>> to
>>>>>>>>>>> the guest to switch over to VF datapath.
>>>>>>>>>>>
>>>>>>>>>>> It is based on netvsc implementation and it may be possible to
>>>>>>>>>>> make
>>>>>>>>>>> this
>>>>>>>>>>> code
>>>>>>>>>>> generic and move it to a common location that can be shared by
>>>>>>>>>>> netvsc
>>>>>>>>>>> and virtio.
>>>>>>>>>>>
>>>>>>>>>>> This patch series is based on the discussion initiated by Jesse on
>>>>>>>>>>> this
>>>>>>>>>>> thread.
>>>>>>>>>>> https://marc.info/?l=linux-virtualization&m=151189725224231&w=2
>>>>>>>>>>
>>>>>>>>>> How does the notion of a device which is both a bond and a leg of a
>>>>>>>>>> bond fit with Alex's recent discussions about feature propagation?
>>>>>>>>>> Which propagation rules will apply to VirtIO master? Meaning of
>>>>>>>>>> the
>>>>>>>>>> flags on a software upper device may be different. Why muddy the
>>>>>>>>>> architecture like this and not introduce a synthetic bond device?
>>>>>>>>>
>>>>>>>>> It doesn't really fit with the notion I had. I think there may have
>>>>>>>>> been a bit of a disconnect as I have been out for the last week or
>>>>>>>>> so
>>>>>>>>> for the holidays.
>>>>>>>>>
>>>>>>>>> My thought on this was that the feature bit should be spawning a new
>>>>>>>>> para-virtual bond device and that bond should have the virto and the
>>>>>>>>> VF as slaves. Also I thought there was some discussion about trying
>>>>>>>>> to
>>>>>>>>> reuse as much of the netvsc code as possible for this so that we
>>>>>>>>> could
>>>>>>>>> avoid duplication of effort and have the two drivers use the same
>>>>>>>>> approach. It seems like it should be pretty straight forward since
>>>>>>>>> you
>>>>>>>>> would have the feature bit in the case of virto, and netvsc just
>>>>>>>>> does
>>>>>>>>> this sort of thing by default if I am not mistaken.
>>>>>>>>
>>>>>>>> This patch is mostly based on netvsc implementation. The only change
>>>>>>>> is
>>>>>>>> avoiding the
>>>>>>>> explicit dev_open() call of the VF netdev after a delay. I am
>>>>>>>> assuming
>>>>>>>> that
>>>>>>>> the guest userspace
>>>>>>>> will bring up the VF netdev and the hypervisor will update the MAC
>>>>>>>> filters
>>>>>>>> to switch to
>>>>>>>> the right data path.
>>>>>>>> We could commonize the code and make it shared between netvsc and
>>>>>>>> virtio.
>>>>>>>> Do
>>>>>>>> we want
>>>>>>>> to do this right away or later? If so, what would be a good location
>>>>>>>> for
>>>>>>>> these shared functions?
>>>>>>>> Is it net/core/dev.c?
>>>>>>>
>>>>>>> No, I would think about starting a new driver file in "/drivers/net/".
>>>>>>> The idea is this driver would be utilized to create a bond
>>>>>>> automatically and set the appropriate registration hooks. If nothing
>>>>>>> else you could probably just call it something generic like virt-bond
>>>>>>> or vbond or whatever.
>>>>>>
>>>>>>
>>>>>> We are trying to avoid creating another driver or a device. Can we
>>>>>> look
>>>>>> into
>>>>>> consolidation of the 2 implementations(virtio & netvsc) as a later
>>>>>> patch?
>>>>>>
>>>>>>>> Also, if we want to go with a solution that creates a bond device, do
>>>>>>>> we
>>>>>>>> want virtio_net/netvsc
>>>>>>>> drivers to create a upper device? Such a solution is already
>>>>>>>> possible
>>>>>>>> via
>>>>>>>> config scripts that can
>>>>>>>> create a bond with virtio and a VF net device as slaves. netvsc and
>>>>>>>> this
>>>>>>>> patch series is trying to
>>>>>>>> make it as simple as possible for the VM to use directly attached
>>>>>>>> devices
>>>>>>>> and support live migration
>>>>>>>> by switching to virtio datapath as a backup during the migration
>>>>>>>> process
>>>>>>>> when the VF device
>>>>>>>> is unplugged.
>>>>>>>
>>>>>>> We all understand that. But you are making the solution very virtio
>>>>>>> specific. We want to see this be usable for other interfaces such as
>>>>>>> netsc and whatever other virtual interfaces are floating around out
>>>>>>> there.
>>>>>>>
>>>>>>> Also I haven't seen us address what happens as far as how we will
>>>>>>> handle this on the host. My thought was we should have a paired
>>>>>>> interface. Something like veth, but made up of a bond on each end. So
>>>>>>> in the host we should have one bond that has a tap/vhost interface and
>>>>>>> a VF port representor, and on the other we would be looking at the
>>>>>>> virtio interface and the VF. Attaching the tap/vhost to the bond could
>>>>>>> be a way of triggering the feature bit to be set in the virtio. That
>>>>>>> way communication between the guest and the host won't get too
>>>>>>> confusing as you will see all traffic from the bonded MAC address
>>>>>>> always show up on the host side bond instead of potentially showing up
>>>>>>> on two unrelated interfaces. It would also make for a good way to
>>>>>>> resolve the east/west traffic problem on hosts since you could just
>>>>>>> send the broadcast/multicast traffic via the tap/vhost/virtio channel
>>>>>>> instead of having to send it back through the port representor and eat
>>>>>>> up all that PCIe bus traffic.
>>>>>>
>>>>>> From the host point of view, here is a simple script that needs to be
>>>>>> run to
>>>>>> do the
>>>>>> live migration. We don't need any bond configuration on the host.
>>>>>>
>>>>>> virsh detach-interface $DOMAIN hostdev --mac $MAC
>>>>>> ip link set $PF vf $VF_NUM mac $ZERO_MAC
>>>>>
>>>>> I'm not sure I understand how this script may work with regard to
>>>>> "live" migration.
>>>>>
>>>>> I'm confused, this script seems to require virtio-net to be configured
>>>>> on top of a different PF than where the migrating VF is seated. Or
>>>>> else, how does identical MAC address filter get programmed to one PF
>>>>> with two (or more) child virtual interfaces (e.g. one macvtap for
>>>>> virtio-net plus one VF)? The coincidence of it being able to work on
>>>>> the NIC of one/some vendor(s) does not apply to the others AFAIK.
>>>>>
>>>>> If you're planning to use a different PF, I don't see how gratuitous
>>>>> ARP announcements are generated to make this a "live" migration.
>>>>
>>>>
>>>> I am not using a different PF. virtio is backed by a tap/bridge with PF
>>>> attached
>>>> to that bridge. When we reset VF MAC after it is unplugged, all the
>>>> packets
>>>> for
>>>> the guest MAC will go to PF and reach virtio via the bridge.
>>>>
>>> That is the limitation of this scheme: it only works for virtio backed
>>> by tap/bridge, rather than backed by macvtap on top of the
>>> corresponding *PF*. Nowadays more datacenter users prefer macvtap as
>>> opposed to bridge, simply because of better isolation and performance
>>> (e.g. host stack consumption on NIC promiscuity processing are not
>>> scalable for bridges). Additionally, the ongoing virtio receive
>>> zero-copy work will be tightly integrated with macvtap, the
>>> performance optimization of which is apparently difficult (if
>>> technically possible at all) to be done on bridge. Why do we limit the
>>> host backend support to only bridge at this point?
>>
>>
>> No. This should work with virtio backed by macvtap over PF too.
>>
>>>
>>>> If we want to use virtio backed by macvtap on top of another VF as the
>>>> backup
>>>> channel, and we could set the guest MAC to that VF after unplugging the
>>>> directly
>>>> attached VF.
>>>
>>> I meant macvtap on the regarding PF instead of another VF. You know,
>>> users shouldn't have to change guest MAC back and forth. Live
>>> migration shouldn't involve any form of user intervention IMHO.
>>
>> Yes. macvtap on top of PF should work too. Hypervisor doesn't need to change
>> the guest MAC. The PF driver needs to program the HW MAC filters so that
>> the
>> frames reach PF when VF is unplugged.
>
> So the HW MAC filter is deferred to get programmed for virtio only
> until VF is unplugged, correct? This is not the regular plumbing order
> for macvtap. Unless I miss something obvious, how does this get
> reflected in the script below?
>
> virsh detach-interface $DOMAIN hostdev --mac $MAC
> ip link set $PF vf $VF_NUM mac $ZERO_MAC
>
> i.e. commands above won't automatically trigger the programming of MAC
> filters for virtio.
>
> If you program two identical MAC address filters for both VF and
> virito at the same point, I'm sure if won't work at all. It does not
> sound clear to me how you propose to make it work if you don't plan
> to change the plumbing order?
>
>>
>>
>>>
>>>>>> virsh migrate --live $DOMAIN qemu+ssh://$REMOTE_HOST/system
>>>>>>
>>>>>> ssh $REMOTE_HOST ip link set $PF vf $VF_NUM mac $MAC
>>>>>> ssh $REMOTE_HOST virsh attach-interface $DOMAIN hostdev $REMOTE_HOSTDEV
>>>>>> --mac $MAC
>>>>>
>>>>> How do you keep guest side VF configurations e.g. MTU and VLAN filters
>>>>> around across the migration? More broadly, how do you make sure the
>>>>> new VF still as performant as previously done such that all hardware
>>>>> ring tunings and offload settings can be kept as much as it can be?
>>>>> I'm afraid this simple script won't work for those real-world
>>>>> scenarios.
>>>>
>>>>
>>>>> I would agree with Alex that we'll soon need a host-side stub/entity
>>>>> with cached guest configurations that may make VF switching
>>>>> straightforward and transparent.
>>>>
>>>> The script is only adding MAC filter to the VF on the destination. If the
>>>> source host has
>>>> done any additional tunings on the VF they need to be done on the
>>>> destination host too.
>>>
>>> I was mainly saying the VF's run-time configuration in the guest more
>>> than those to be configured from the host side. Let's say guest admin
>>> had changed the VF's MTU value, the default of which is 1500, to 9000
>>> before the migration. How do you save and restore the old running
>>> config for the VF across the migration?
>>
>> Such optimizations should be possible on top of this patch. We need to sync
>> up
>> any changes/updates to VF configuration/features with virtio.
>
> This is possible but not the ideal way to build it. Virtio perhaps
> would not be the best place to stack this (VF specifics for live
> migration) up further. We need a new driver and do it right from the
> very beginning.
>
> Thanks,
> -Siwei
>
>>
>>>
>>>> It is also possible that the VF on the destination is based on a totally
>>>> different NIC which
>>>> may be more or less performant. Or the destination may not even support a
>>>> VF
>>>> datapath too.
>>>
>>> This argument is rather weak. In almost all real-world live migration
>>> scenarios, the hardware configurations on both source and destination
>>> are (required to be) identical. Being able to support heterogenous
>>> live migration doesn't mean we can do nothing but throw all running
>>> configs or driver tunings away when it's done. Specifically, I don't
>>> find a reason not to apply the guest network configs including NIC
>>> offload settings if those are commonly supported on both ends, even on
>>> virtio-net. While for some of the configs it might be noticeable for
>>> user to respond to the loss or change, complaints would still arise
>>> when issues are painful to troubleshoot and/or difficult to get them
>>> detected and restored. This is why I say real-world scenarios are more
>>> complex than just switch and go.
>>>
>>
>> Sure. These patches by themselves don't enable live migration automatically.
>> Hypervisor
>> needs to do some additional setup before and after the migration.
^ permalink raw reply
* Re: [virtio-dev] Re: [PATCH v22 2/3] virtio-balloon: VIRTIO_BALLOON_F_FREE_PAGE_VQ
From: Wei Wang @ 2018-01-22 11:25 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: yang.zhang.wz, virtio-dev, riel, quan.xu0, kvm, nilal,
liliang.opensource, linux-kernel, mhocko, linux-mm, pbonzini,
akpm, virtualization
In-Reply-To: <20180119143517-mutt-send-email-mst@kernel.org>
On 01/19/2018 08:39 PM, Michael S. Tsirkin wrote:
> On Fri, Jan 19, 2018 at 11:44:21AM +0800, Wei Wang wrote:
>> On 01/18/2018 12:44 AM, Michael S. Tsirkin wrote:
>>> On Wed, Jan 17, 2018 at 01:10:11PM +0800, Wei Wang wrote:
>>>
>>>> + vb->start_cmd_id = cmd_id;
>>>> + queue_work(vb->balloon_wq, &vb->report_free_page_work);
>>> It seems that if a command was already queued (with a different id),
>>> this will result in new command id being sent to host twice, which will
>>> likely confuse the host.
>> I think that case won't happen, because
>> - the host sends a cmd id to the guest via the config, while the guest acks
>> back the received cmd id via the virtqueue;
>> - the guest ack back a cmd id only when a new cmd id is received from the
>> host, that is the above check:
>>
>> if (cmd_id != vb->start_cmd_id) { --> the driver only queues the
>> reporting work only when a new cmd id is received
>> /*
>> * Host requests to start the reporting by sending a
>> * new cmd id.
>> */
>> WRITE_ONCE(vb->report_free_page, true);
>> vb->start_cmd_id = cmd_id;
>> queue_work(vb->balloon_wq,
>> &vb->report_free_page_work);
>> }
>>
>> So the same cmd id wouldn't queue the reporting work twice.
>>
> Like this:
>
> vb->start_cmd_id = cmd_id;
> queue_work(vb->balloon_wq, &vb->report_free_page_work);
>
> command id changes
>
> vb->start_cmd_id = cmd_id;
>
> work executes
>
> queue_work(vb->balloon_wq, &vb->report_free_page_work);
>
> work executes again
>
If we think about the whole working flow, I think this case couldn't happen:
1) device send cmd_id=1 to driver;
2) driver receives cmd_id=1 in the config and acks cmd_id=1 to the
device via the vq;
3) device revives cmd_id=1;
4) device wants to stop the reporting by sending cmd_id=STOP;
5) driver receives cmd_id=STOP from the config, and acks cmd_id=STOP to
the device via the vq;
6) device sends cmd_id=2 to driver;
...
cmd_id=2 won't come after cmd_id=1, there will be a STOP cmd in between
them (STOP won't queue the work).
How about defining the correct device behavior in the spec:
The device Should NOT send a second cmd id to the driver until a STOP
cmd ack for the previous cmd id has been received from the guest.
Best,
Wei
^ permalink raw reply
* [PATCH 6/6] MAINTAINERS: Add entry for Jailhouse
From: Jan Kiszka @ 2018-01-22 6:12 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas
Cc: jailhouse-dev, linux-pci, x86, Linux Kernel Mailing List,
virtualization
In-Reply-To: <cover.1516601570.git.jan.kiszka@siemens.com>
From: Jan Kiszka <jan.kiszka@siemens.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
MAINTAINERS | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/MAINTAINERS b/MAINTAINERS
index 426ba037d943..dd51a2012b36 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -7468,6 +7468,13 @@ Q: http://patchwork.linuxtv.org/project/linux-media/list/
S: Maintained
F: drivers/media/dvb-frontends/ix2505v*
+JAILHOUSE HYPERVISOR INTERFACE
+M: Jan Kiszka <jan.kiszka@siemens.com>
+L: jailhouse-dev@googlegroups.com
+S: Maintained
+F: arch/x86/kernel/jailhouse.c
+F: arch/x86/include/asm/jailhouse_para.h
+
JC42.4 TEMPERATURE SENSOR DRIVER
M: Guenter Roeck <linux@roeck-us.net>
L: linux-hwmon@vger.kernel.org
--
2.13.6
^ permalink raw reply related
* [PATCH 5/6] x86/jailhouse: Allow to use PCI_MMCONFIG without ACPI
From: Jan Kiszka @ 2018-01-22 6:12 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas
Cc: jailhouse-dev, linux-pci, x86, Linux Kernel Mailing List,
virtualization
In-Reply-To: <cover.1516601570.git.jan.kiszka@siemens.com>
From: Jan Kiszka <jan.kiszka@siemens.com>
Jailhouse does not use ACPI, but it does support MMCONFIG. Make sure the
latter can be built without having to enable ACPI as well. Primarily, we
need to make the AMD mmconf-fam10h_64 depend upon MMCONFIG and ACPI,
instead of just the former.
Saves some bytes in the Jailhouse non-root kernel.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
arch/x86/Kconfig | 6 +++++-
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/cpu/amd.c | 2 +-
3 files changed, 7 insertions(+), 3 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index f2038417a590..77ba0eb0a258 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2597,7 +2597,7 @@ config PCI_DIRECT
config PCI_MMCONFIG
bool "Support mmconfig PCI config space access" if X86_64
default y
- depends on PCI && (ACPI || SFI) && (PCI_GOMMCONFIG || PCI_GOANY || X86_64)
+ depends on PCI && (ACPI || SFI || JAILHOUSE_GUEST) && (PCI_GOMMCONFIG || PCI_GOANY || X86_64)
config PCI_OLPC
def_bool y
@@ -2612,6 +2612,10 @@ config PCI_DOMAINS
def_bool y
depends on PCI
+config MMCONF_FAM10H
+ def_bool y
+ depends on PCI_MMCONFIG && ACPI
+
config PCI_CNB20LE_QUIRK
bool "Read CNB20LE Host Bridge Windows" if EXPERT
depends on PCI
diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile
index aed9296dccd3..b2c9e230e2fe 100644
--- a/arch/x86/kernel/Makefile
+++ b/arch/x86/kernel/Makefile
@@ -143,6 +143,6 @@ ifeq ($(CONFIG_X86_64),y)
obj-$(CONFIG_GART_IOMMU) += amd_gart_64.o aperture_64.o
obj-$(CONFIG_CALGARY_IOMMU) += pci-calgary_64.o tce_64.o
- obj-$(CONFIG_PCI_MMCONFIG) += mmconf-fam10h_64.o
+ obj-$(CONFIG_MMCONF_FAM10H) += mmconf-fam10h_64.o
obj-y += vsmp_64.o
endif
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index ea831c858195..47edf599f6fd 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -690,7 +690,7 @@ static void init_amd_k8(struct cpuinfo_x86 *c)
static void init_amd_gh(struct cpuinfo_x86 *c)
{
-#ifdef CONFIG_X86_64
+#ifdef CONFIG_MMCONF_FAM10H
/* do this for boot cpu */
if (c == &boot_cpu_data)
check_enable_amd_mmconf_dmi();
--
2.13.6
^ permalink raw reply related
* [PATCH 4/6] x86: Consolidate PCI_MMCONFIG configs
From: Jan Kiszka @ 2018-01-22 6:12 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas
Cc: jailhouse-dev, linux-pci, x86, Linux Kernel Mailing List,
virtualization
In-Reply-To: <cover.1516601570.git.jan.kiszka@siemens.com>
From: Jan Kiszka <jan.kiszka@siemens.com>
Not sure if those two worked by design or just by chance so far. In any
case, it's at least cleaner and clearer to express this in a single
config statement.
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
arch/x86/Kconfig | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 423e4b64e683..f2038417a590 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2595,8 +2595,9 @@ config PCI_DIRECT
depends on PCI && (X86_64 || (PCI_GODIRECT || PCI_GOANY || PCI_GOOLPC || PCI_GOMMCONFIG))
config PCI_MMCONFIG
- def_bool y
- depends on X86_32 && PCI && (ACPI || SFI) && (PCI_GOMMCONFIG || PCI_GOANY)
+ bool "Support mmconfig PCI config space access" if X86_64
+ default y
+ depends on PCI && (ACPI || SFI) && (PCI_GOMMCONFIG || PCI_GOANY || X86_64)
config PCI_OLPC
def_bool y
@@ -2611,10 +2612,6 @@ config PCI_DOMAINS
def_bool y
depends on PCI
-config PCI_MMCONFIG
- bool "Support mmconfig PCI config space access"
- depends on X86_64 && PCI && ACPI
-
config PCI_CNB20LE_QUIRK
bool "Read CNB20LE Host Bridge Windows" if EXPERT
depends on PCI
--
2.13.6
^ permalink raw reply related
* [PATCH 3/6] x86/jailhouse: Enable PCI mmconfig access in inmates
From: Jan Kiszka @ 2018-01-22 6:12 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas
Cc: jailhouse-dev, linux-pci, x86, Linux Kernel Mailing List,
virtualization
In-Reply-To: <cover.1516601570.git.jan.kiszka@siemens.com>
From: Otavio Pontes <otavio.pontes@intel.com>
Use the PCI mmconfig base address exported by jailhouse in boot
parameters in order to access the memory mapped PCI configuration space.
Signed-off-by: Otavio Pontes <otavio.pontes@intel.com>
[Jan: rebased, fixed !CONFIG_PCI_MMCONFIG]
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
arch/x86/include/asm/pci_x86.h | 2 ++
arch/x86/kernel/jailhouse.c | 7 +++++++
arch/x86/pci/mmconfig-shared.c | 4 ++--
3 files changed, 11 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/pci_x86.h b/arch/x86/include/asm/pci_x86.h
index eb66fa9cd0fc..959d618dbb17 100644
--- a/arch/x86/include/asm/pci_x86.h
+++ b/arch/x86/include/asm/pci_x86.h
@@ -151,6 +151,8 @@ extern int pci_mmconfig_insert(struct device *dev, u16 seg, u8 start, u8 end,
phys_addr_t addr);
extern int pci_mmconfig_delete(u16 seg, u8 start, u8 end);
extern struct pci_mmcfg_region *pci_mmconfig_lookup(int segment, int bus);
+extern struct pci_mmcfg_region *__init pci_mmconfig_add(int segment, int start,
+ int end, u64 addr);
extern struct list_head pci_mmcfg_list;
diff --git a/arch/x86/kernel/jailhouse.c b/arch/x86/kernel/jailhouse.c
index b68fd895235a..7fe2a73da0b3 100644
--- a/arch/x86/kernel/jailhouse.c
+++ b/arch/x86/kernel/jailhouse.c
@@ -124,6 +124,13 @@ static int __init jailhouse_pci_arch_init(void)
if (pcibios_last_bus < 0)
pcibios_last_bus = 0xff;
+#ifdef CONFIG_PCI_MMCONFIG
+ if (setup_data.pci_mmconfig_base) {
+ pci_mmconfig_add(0, 0, 0xff, setup_data.pci_mmconfig_base);
+ pci_mmcfg_arch_init();
+ }
+#endif
+
return 0;
}
diff --git a/arch/x86/pci/mmconfig-shared.c b/arch/x86/pci/mmconfig-shared.c
index 96684d0adcf9..0e590272366b 100644
--- a/arch/x86/pci/mmconfig-shared.c
+++ b/arch/x86/pci/mmconfig-shared.c
@@ -94,8 +94,8 @@ static struct pci_mmcfg_region *pci_mmconfig_alloc(int segment, int start,
return new;
}
-static struct pci_mmcfg_region *__init pci_mmconfig_add(int segment, int start,
- int end, u64 addr)
+struct pci_mmcfg_region *__init pci_mmconfig_add(int segment, int start,
+ int end, u64 addr)
{
struct pci_mmcfg_region *new;
--
2.13.6
^ permalink raw reply related
* [PATCH 2/6] pci: Scan all functions when probing while running over Jailhouse
From: Jan Kiszka @ 2018-01-22 6:12 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas
Cc: jailhouse-dev, Benedikt Spranger, linux-pci, x86,
Linux Kernel Mailing List, virtualization
In-Reply-To: <cover.1516601570.git.jan.kiszka@siemens.com>
From: Jan Kiszka <jan.kiszka@siemens.com>
PCI and PCIBIOS probing only scans devices at function number 0/8/16/...
Subdevices (e.g. multiqueue) have function numbers which are not a
multiple of 8.
The simple hypervisor Jailhouse passes subdevices directly w/o providing
a virtual PCI topology like KVM. As a consequence a PCI passthrough from
Jailhouse to a guest will not be detected by Linux.
Based on patch by Benedikt Spranger, adding Jailhouse probing to avoid
changing the behavior in the absence of the hypervisor.
CC: Benedikt Spranger <b.spranger@linutronix.de>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
arch/x86/pci/legacy.c | 4 +++-
drivers/pci/probe.c | 4 +++-
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/arch/x86/pci/legacy.c b/arch/x86/pci/legacy.c
index 1cb01abcb1be..a7b0476b4f44 100644
--- a/arch/x86/pci/legacy.c
+++ b/arch/x86/pci/legacy.c
@@ -5,6 +5,7 @@
#include <linux/export.h>
#include <linux/pci.h>
#include <asm/pci_x86.h>
+#include <asm/jailhouse_para.h>
/*
* Discover remaining PCI buses in case there are peer host bridges.
@@ -34,13 +35,14 @@ int __init pci_legacy_init(void)
void pcibios_scan_specific_bus(int busn)
{
+ int stride = jailhouse_paravirt() ? 1 : 8;
int devfn;
u32 l;
if (pci_find_bus(0, busn))
return;
- for (devfn = 0; devfn < 256; devfn += 8) {
+ for (devfn = 0; devfn < 256; devfn += stride) {
if (!raw_pci_read(0, busn, devfn, PCI_VENDOR_ID, 2, &l) &&
l != 0x0000 && l != 0xffff) {
DBG("Found device at %02x:%02x [%04x]\n", busn, devfn, l);
diff --git a/drivers/pci/probe.c b/drivers/pci/probe.c
index 14e0ea1ff38b..60ad14c8245f 100644
--- a/drivers/pci/probe.c
+++ b/drivers/pci/probe.c
@@ -17,6 +17,7 @@
#include <linux/acpi.h>
#include <linux/irqdomain.h>
#include <linux/pm_runtime.h>
+#include <linux/hypervisor.h>
#include "pci.h"
#define CARDBUS_LATENCY_TIMER 176 /* secondary latency timer */
@@ -2454,6 +2455,7 @@ static unsigned int pci_scan_child_bus_extend(struct pci_bus *bus,
unsigned int available_buses)
{
unsigned int used_buses, normal_bridges = 0, hotplug_bridges = 0;
+ unsigned int stride = jailhouse_paravirt() ? 1 : 8;
unsigned int start = bus->busn_res.start;
unsigned int devfn, cmax, max = start;
struct pci_dev *dev;
@@ -2461,7 +2463,7 @@ static unsigned int pci_scan_child_bus_extend(struct pci_bus *bus,
dev_dbg(&bus->dev, "scanning bus\n");
/* Go find them, Rover! */
- for (devfn = 0; devfn < 0x100; devfn += 8)
+ for (devfn = 0; devfn < 0x100; devfn += stride)
pci_scan_slot(bus, devfn);
/* Reserve buses for SR-IOV capability. */
--
2.13.6
^ permalink raw reply related
* [PATCH 1/6] jailhouse: Provide detection for non-x86 systems
From: Jan Kiszka @ 2018-01-22 6:12 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas
Cc: jailhouse-dev, Mark Rutland, linux-pci, x86,
Linux Kernel Mailing List, virtualization, Rob Herring
In-Reply-To: <cover.1516601570.git.jan.kiszka@siemens.com>
From: Jan Kiszka <jan.kiszka@siemens.com>
Implement jailhouse_paravirt() via device tree probing on architectures
!= x86. Will be used by the PCI core.
CC: Rob Herring <robh+dt@kernel.org>
CC: Mark Rutland <mark.rutland@arm.com>
Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com>
---
Documentation/devicetree/bindings/jailhouse.txt | 8 ++++++++
arch/x86/include/asm/jailhouse_para.h | 2 +-
include/linux/hypervisor.h | 17 +++++++++++++++--
3 files changed, 24 insertions(+), 3 deletions(-)
create mode 100644 Documentation/devicetree/bindings/jailhouse.txt
diff --git a/Documentation/devicetree/bindings/jailhouse.txt b/Documentation/devicetree/bindings/jailhouse.txt
new file mode 100644
index 000000000000..2901c25ff340
--- /dev/null
+++ b/Documentation/devicetree/bindings/jailhouse.txt
@@ -0,0 +1,8 @@
+Jailhouse non-root cell device tree bindings
+--------------------------------------------
+
+When running in a non-root Jailhouse cell (partition), the device tree of this
+platform shall have a top-level "hypervisor" node with the following
+properties:
+
+- compatible = "jailhouse,cell"
diff --git a/arch/x86/include/asm/jailhouse_para.h b/arch/x86/include/asm/jailhouse_para.h
index 875b54376689..b885a961a150 100644
--- a/arch/x86/include/asm/jailhouse_para.h
+++ b/arch/x86/include/asm/jailhouse_para.h
@@ -1,7 +1,7 @@
/* SPDX-License-Identifier: GPL2.0 */
/*
- * Jailhouse paravirt_ops implementation
+ * Jailhouse paravirt detection
*
* Copyright (c) Siemens AG, 2015-2017
*
diff --git a/include/linux/hypervisor.h b/include/linux/hypervisor.h
index b19563f9a8eb..fc08b433c856 100644
--- a/include/linux/hypervisor.h
+++ b/include/linux/hypervisor.h
@@ -8,15 +8,28 @@
*/
#ifdef CONFIG_X86
+
+#include <asm/jailhouse_para.h>
#include <asm/x86_init.h>
+
static inline void hypervisor_pin_vcpu(int cpu)
{
x86_platform.hyper.pin_vcpu(cpu);
}
-#else
+
+#else /* !CONFIG_X86 */
+
+#include <linux/of.h>
+
static inline void hypervisor_pin_vcpu(int cpu)
{
}
-#endif
+
+static inline bool jailhouse_paravirt(void)
+{
+ return of_find_compatible_node(NULL, NULL, "jailhouse,cell");
+}
+
+#endif /* !CONFIG_X86 */
#endif /* __LINUX_HYPEVISOR_H */
--
2.13.6
^ permalink raw reply related
* [PATCH 0/6] jailhouse: Enhance secondary Jailhouse guest support /wrt PCI
From: Jan Kiszka @ 2018-01-22 6:12 UTC (permalink / raw)
To: Thomas Gleixner, Ingo Molnar, H . Peter Anvin, Bjorn Helgaas
Cc: jailhouse-dev, Mark Rutland, Benedikt Spranger, linux-pci, x86,
Linux Kernel Mailing List, virtualization, Rob Herring,
Otavio Pontes
Basic x86 support [1] for running Linux as secondary Jailhouse [2] guest
is currently pending in the tip tree. This builds on top and enhances
the PCI support for x86 and also ARM guests (ARM[64] does not require
platform patches and works already).
Key elements of this series are:
- detection of Jailhouse via device tree hypervisor node
- function-level PCI scan if Jailhouse is detected
- MMCONFIG support for x86 guests
As most changes affect x86, I would suggest to route the series also via
tip after the necessary acks are collected.
Jan
[1] https://lkml.org/lkml/2017/11/27/125
[2] http://jailhouse-project.org
CC: Benedikt Spranger <b.spranger@linutronix.de>
CC: Mark Rutland <mark.rutland@arm.com>
CC: Otavio Pontes <otavio.pontes@intel.com>
CC: Rob Herring <robh+dt@kernel.org>
Jan Kiszka (5):
jailhouse: Provide detection for non-x86 systems
pci: Scan all functions when probing while running over Jailhouse
x86: Consolidate PCI_MMCONFIG configs
x86/jailhouse: Allow to use PCI_MMCONFIG without ACPI
MAINTAINERS: Add entry for Jailhouse
Otavio Pontes (1):
x86/jailhouse: Enable PCI mmconfig access in inmates
Documentation/devicetree/bindings/jailhouse.txt | 8 ++++++++
MAINTAINERS | 7 +++++++
arch/x86/Kconfig | 11 ++++++-----
arch/x86/include/asm/jailhouse_para.h | 2 +-
arch/x86/include/asm/pci_x86.h | 2 ++
arch/x86/kernel/Makefile | 2 +-
arch/x86/kernel/cpu/amd.c | 2 +-
arch/x86/kernel/jailhouse.c | 7 +++++++
arch/x86/pci/legacy.c | 4 +++-
arch/x86/pci/mmconfig-shared.c | 4 ++--
drivers/pci/probe.c | 4 +++-
include/linux/hypervisor.h | 17 +++++++++++++++--
12 files changed, 56 insertions(+), 14 deletions(-)
create mode 100644 Documentation/devicetree/bindings/jailhouse.txt
--
2.13.6
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox