Kernel KVM virtualization development
 help / color / mirror / Atom feed
* RE: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
       [not found]   ` <ZJMLHSq9rjGIVS4V@nvidia.com>
@ 2023-06-27  6:55     ` Tian, Kevin
  2023-07-03  5:27       ` Cao, Yahui
  0 siblings, 1 reply; 4+ messages in thread
From: Tian, Kevin @ 2023-06-27  6:55 UTC (permalink / raw)
  To: Jason Gunthorpe, Liu, Lingyu
  Cc: intel-wired-lan@lists.osuosl.org, Liu, Yi L, Burra, Phani R,
	kvm@vger.kernel.org

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Wednesday, June 21, 2023 10:37 PM
> 
> On Wed, Jun 21, 2023 at 09:11:07AM +0000, Lingyu Liu wrote:
> > diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c
> b/drivers/net/ethernet/intel/ice/ice_migration.c
> > index 2579bc0bd193..c2a83a97af05 100644
> > --- a/drivers/net/ethernet/intel/ice/ice_migration.c
> > +++ b/drivers/net/ethernet/intel/ice/ice_migration.c
> 
> > +static int
> > +ice_migration_restore_tx_head(struct ice_vf *vf,
> > +			      struct ice_migration_dev_state *devstate,
> > +			      struct vfio_device *vdev)
> > +{
> > +	struct ice_tx_desc *tx_desc_dummy, *tx_desc;
> > +	struct ice_vsi *vsi = ice_get_vf_vsi(vf);
> > +	struct ice_pf *pf = vf->pf;
> > +	u16 max_ring_len = 0;
> > +	struct device *dev;
> > +	int ret = 0;
> > +	int i = 0;
> > +
> > +	dev = ice_pf_to_dev(vf->pf);
> > +
> > +	if (!vsi) {
> > +		dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id);
> > +		return -EINVAL;
> > +	}
> > +
> > +	ice_for_each_txq(vsi, i) {
> > +		if (!test_bit(i, vf->txq_ena))
> > +			continue;
> > +
> > +		max_ring_len = max(vsi->tx_rings[i]->count, max_ring_len);
> > +	}
> > +
> > +	if (max_ring_len == 0)
> > +		return 0;
> > +
> > +	tx_desc = (struct ice_tx_desc *)kcalloc
> > +		  (max_ring_len, sizeof(struct ice_tx_desc), GFP_KERNEL);
> > +	tx_desc_dummy = (struct ice_tx_desc *)kcalloc
> > +			(max_ring_len, sizeof(struct ice_tx_desc),
> GFP_KERNEL);
> > +	if (!tx_desc || !tx_desc_dummy) {
> > +		dev_err(dev, "VF %d failed to allocate memory for tx
> descriptors to restore tx head\n",
> > +			vf->vf_id);
> > +		ret = -ENOMEM;
> > +		goto err;
> > +	}
> > +
> > +	for (i = 0; i < max_ring_len; i++) {
> > +		u32 td_cmd;
> > +
> > +		td_cmd = ICE_TXD_LAST_DESC_CMD |
> ICE_TX_DESC_CMD_DUMMY;
> > +		tx_desc_dummy[i].cmd_type_offset_bsz =
> > +					ice_build_ctob(td_cmd, 0, SZ_256, 0);
> > +	}
> > +
> > +	/* For each tx queue, we restore the tx head following below steps:
> > +	 * 1. backup original tx ring descriptor memory
> > +	 * 2. overwrite the tx ring descriptor with dummy packets
> > +	 * 3. kick doorbell register to trigger descriptor writeback,
> > +	 *    then tx head will move from 0 to tail - 1 and tx head is restored
> > +	 *    to the place we expect.
> > +	 * 4. restore the tx ring with original tx ring descriptor memory in
> > +	 *    order not to corrupt the ring context.
> > +	 */
> > +	ice_for_each_txq(vsi, i) {
> > +		struct ice_tx_ring *tx_ring = vsi->tx_rings[i];
> > +		u16 *tx_heads = devstate->tx_head;
> > +		u32 tx_head;
> > +		int j;
> > +
> > +		if (!test_bit(i, vf->txq_ena) || tx_heads[i] == 0)
> > +			continue;
> > +
> > +		if (tx_heads[i] >= tx_ring->count) {
> > +			dev_err(dev, "saved tx ring head exceeds tx ring
> count\n");
> > +			ret = -EINVAL;
> > +			goto err;
> > +		}
> > +		ret = vfio_dma_rw(vdev, tx_ring->dma, (void *)tx_desc,
> > +				  tx_ring->count * sizeof(tx_desc[0]), false);
> > +		if (ret) {
> > +			dev_err(dev, "kvm read guest tx ring error: %d\n",
> > +				ret);
> > +			goto err;
> 
> You can't call VFIO functions from a netdev driver. All this code
> needs to be moved into the varient driver.
> 
> This design seems pretty wild to me, it doesn't seem too robust
> against a hostile VM - eg these DMAs can all fail under guest control,
> and then what?

Yeah that sounds fragile.

at least the range which will be overwritten in the resuming path should
be verified in the src side. If inaccessible then the driver should fail the
state transition immediately instead of letting it identified in the resuming
path which is unrecoverable.

btw I don't know how its spec describes the hw behavior in such situation.
If the behavior is undefined when a hostile software deliberately causes
DMA failures to TX queue then not restoring the queue head could also be
an option to continue the migration in such scenario.

> 
> We also don't have any guarentees defined for the VFIO protocol about
> what state the vIOMMU will be in prior to reaching RUNNING.

This is a good point. Actually it's not just a gap on vIOMMU. it's kind
of a dependency on IOMMUFD no matter the IOAS which the migrated
device is currently attached to is GPA or GIOVA. The device state can
be restored only after IOMMUFD is fully recovered and the device is
re-attached to the IOAS.

Need a way for migration driver to advocate such dependency to the user.

> 
> IDK, all of this looks like it is trying really hard to hackily force
> HW that was never ment to support live migration to somehow do
> something that looks like it.
> 
> You really need to present an explanation in the VFIO driver comments
> about how this whole scheme actually works and is secure and
> functional against a hostile guest.
> 

Agree. And please post the next version to the VFIO community to gain
more attention.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-06-27  6:55     ` [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head Tian, Kevin
@ 2023-07-03  5:27       ` Cao, Yahui
  2023-07-03 21:03         ` Jason Gunthorpe
  0 siblings, 1 reply; 4+ messages in thread
From: Cao, Yahui @ 2023-07-03  5:27 UTC (permalink / raw)
  To: Tian, Kevin, Jason Gunthorpe, Liu, Lingyu
  Cc: intel-wired-lan@lists.osuosl.org, Liu, Yi L, kvm@vger.kernel.org

Hi Jason & Kevin,

On 6/27/2023 2:55 PM, Tian, Kevin wrote:
>> From: Jason Gunthorpe <jgg@nvidia.com>
>> Sent: Wednesday, June 21, 2023 10:37 PM
>>
>> On Wed, Jun 21, 2023 at 09:11:07AM +0000, Lingyu Liu wrote:
>>> diff --git a/drivers/net/ethernet/intel/ice/ice_migration.c
>> b/drivers/net/ethernet/intel/ice/ice_migration.c
>>> index 2579bc0bd193..c2a83a97af05 100644
>>> --- a/drivers/net/ethernet/intel/ice/ice_migration.c
>>> +++ b/drivers/net/ethernet/intel/ice/ice_migration.c
>>> +static int
>>> +ice_migration_restore_tx_head(struct ice_vf *vf,
>>> +			      struct ice_migration_dev_state *devstate,
>>> +			      struct vfio_device *vdev)
>>> +{
>>> +	struct ice_tx_desc *tx_desc_dummy, *tx_desc;
>>> +	struct ice_vsi *vsi = ice_get_vf_vsi(vf);
>>> +	struct ice_pf *pf = vf->pf;
>>> +	u16 max_ring_len = 0;
>>> +	struct device *dev;
>>> +	int ret = 0;
>>> +	int i = 0;
>>> +
>>> +	dev = ice_pf_to_dev(vf->pf);
>>> +
>>> +	if (!vsi) {
>>> +		dev_err(dev, "VF %d VSI is NULL\n", vf->vf_id);
>>> +		return -EINVAL;
>>> +	}
>>> +
>>> +	ice_for_each_txq(vsi, i) {
>>> +		if (!test_bit(i, vf->txq_ena))
>>> +			continue;
>>> +
>>> +		max_ring_len = max(vsi->tx_rings[i]->count, max_ring_len);
>>> +	}
>>> +
>>> +	if (max_ring_len == 0)
>>> +		return 0;
>>> +
>>> +	tx_desc = (struct ice_tx_desc *)kcalloc
>>> +		  (max_ring_len, sizeof(struct ice_tx_desc), GFP_KERNEL);
>>> +	tx_desc_dummy = (struct ice_tx_desc *)kcalloc
>>> +			(max_ring_len, sizeof(struct ice_tx_desc),
>> GFP_KERNEL);
>>> +	if (!tx_desc || !tx_desc_dummy) {
>>> +		dev_err(dev, "VF %d failed to allocate memory for tx
>> descriptors to restore tx head\n",
>>> +			vf->vf_id);
>>> +		ret = -ENOMEM;
>>> +		goto err;
>>> +	}
>>> +
>>> +	for (i = 0; i < max_ring_len; i++) {
>>> +		u32 td_cmd;
>>> +
>>> +		td_cmd = ICE_TXD_LAST_DESC_CMD |
>> ICE_TX_DESC_CMD_DUMMY;
>>> +		tx_desc_dummy[i].cmd_type_offset_bsz =
>>> +					ice_build_ctob(td_cmd, 0, SZ_256, 0);
>>> +	}
>>> +
>>> +	/* For each tx queue, we restore the tx head following below steps:
>>> +	 * 1. backup original tx ring descriptor memory
>>> +	 * 2. overwrite the tx ring descriptor with dummy packets
>>> +	 * 3. kick doorbell register to trigger descriptor writeback,
>>> +	 *    then tx head will move from 0 to tail - 1 and tx head is restored
>>> +	 *    to the place we expect.
>>> +	 * 4. restore the tx ring with original tx ring descriptor memory in
>>> +	 *    order not to corrupt the ring context.
>>> +	 */
>>> +	ice_for_each_txq(vsi, i) {
>>> +		struct ice_tx_ring *tx_ring = vsi->tx_rings[i];
>>> +		u16 *tx_heads = devstate->tx_head;
>>> +		u32 tx_head;
>>> +		int j;
>>> +
>>> +		if (!test_bit(i, vf->txq_ena) || tx_heads[i] == 0)
>>> +			continue;
>>> +
>>> +		if (tx_heads[i] >= tx_ring->count) {
>>> +			dev_err(dev, "saved tx ring head exceeds tx ring
>> count\n");
>>> +			ret = -EINVAL;
>>> +			goto err;
>>> +		}
>>> +		ret = vfio_dma_rw(vdev, tx_ring->dma, (void *)tx_desc,
>>> +				  tx_ring->count * sizeof(tx_desc[0]), false);
>>> +		if (ret) {
>>> +			dev_err(dev, "kvm read guest tx ring error: %d\n",
>>> +				ret);
>>> +			goto err;
>> You can't call VFIO functions from a netdev driver. All this code
>> needs to be moved into the varient driver.


Will move vfio_dma_rw() into vfio driver and passing callback function 
into netdev driver


>>
>> This design seems pretty wild to me, it doesn't seem too robust
>> against a hostile VM - eg these DMAs can all fail under guest control,
>> and then what?
> Yeah that sounds fragile.
>
> at least the range which will be overwritten in the resuming path should
> be verified in the src side. If inaccessible then the driver should fail the
> state transition immediately instead of letting it identified in the resuming
> path which is unrecoverable.
>
> btw I don't know how its spec describes the hw behavior in such situation.
> If the behavior is undefined when a hostile software deliberately causes
> DMA failures to TX queue then not restoring the queue head could also be
> an option to continue the migration in such scenario.


Thanks for the advice. Will check the vfio_dma_rw() correctness on the 
source side and
fail the state transition once function return failure.

When a hostile software deliberately causes DMA failure to TX queue, TX 
queue head will
remain to be the original value, which is 0 on destination side cases. 
In this case, I'll let
VM resumes by letting TX HEAD to stay with original value.


>
>> We also don't have any guarentees defined for the VFIO protocol about
>> what state the vIOMMU will be in prior to reaching RUNNING.
> This is a good point. Actually it's not just a gap on vIOMMU. it's kind
> of a dependency on IOMMUFD no matter the IOAS which the migrated
> device is currently attached to is GPA or GIOVA. The device state can
> be restored only after IOMMUFD is fully recovered and the device is
> re-attached to the IOAS.
>
> Need a way for migration driver to advocate such dependency to the user.


Since this part is new to me, may need further guidance on how to 
resolve the dependency from you and other community experts.

Thanks.


>
>> IDK, all of this looks like it is trying really hard to hackily force
>> HW that was never ment to support live migration to somehow do
>> something that looks like it.
>>
>> You really need to present an explanation in the VFIO driver comments
>> about how this whole scheme actually works and is secure and
>> functional against a hostile guest.
>>
> Agree. And please post the next version to the VFIO community to gain
> more attention.


I'll add more comments about the whole scheme and post next version to 
VFIO community.

Thank you Jason and Kevin for the valuable feedback.

Thanks.
Yahui.


^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-07-03  5:27       ` Cao, Yahui
@ 2023-07-03 21:03         ` Jason Gunthorpe
  2023-07-04  7:35           ` Tian, Kevin
  0 siblings, 1 reply; 4+ messages in thread
From: Jason Gunthorpe @ 2023-07-03 21:03 UTC (permalink / raw)
  To: Cao, Yahui
  Cc: Tian, Kevin, Liu, Lingyu, intel-wired-lan@lists.osuosl.org,
	Liu, Yi L, kvm@vger.kernel.org

On Mon, Jul 03, 2023 at 01:27:51PM +0800, Cao, Yahui wrote:

> > > You can't call VFIO functions from a netdev driver. All this code
> > > needs to be moved into the varient driver.
> 
> Will move vfio_dma_rw() into vfio driver and passing callback function into
> netdev driver

Please make proper layers, you should not need to stitch your driver
together with weird function pointers. 
 
> > > We also don't have any guarentees defined for the VFIO protocol about
> > > what state the vIOMMU will be in prior to reaching RUNNING.
> > This is a good point. Actually it's not just a gap on vIOMMU. it's kind
> > of a dependency on IOMMUFD no matter the IOAS which the migrated
> > device is currently attached to is GPA or GIOVA. The device state can
> > be restored only after IOMMUFD is fully recovered and the device is
> > re-attached to the IOAS.
> > 
> > Need a way for migration driver to advocate such dependency to the user. 
> 
> Since this part is new to me, may need further guidance on how to resolve
> the dependency from you and other community experts.

Personally I'm quite uncomfortable with a driver that tries to work
this way, I'm not sure we should encourage this. Can Intel really be
convincing that this is safe and correct?

Jason

^ permalink raw reply	[flat|nested] 4+ messages in thread

* RE: [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head
  2023-07-03 21:03         ` Jason Gunthorpe
@ 2023-07-04  7:35           ` Tian, Kevin
  0 siblings, 0 replies; 4+ messages in thread
From: Tian, Kevin @ 2023-07-04  7:35 UTC (permalink / raw)
  To: Jason Gunthorpe, Cao, Yahui
  Cc: Liu, Lingyu, intel-wired-lan@lists.osuosl.org, Liu, Yi L,
	kvm@vger.kernel.org

> From: Jason Gunthorpe <jgg@nvidia.com>
> Sent: Tuesday, July 4, 2023 5:04 AM
> 
> On Mon, Jul 03, 2023 at 01:27:51PM +0800, Cao, Yahui wrote:
> 
> > > > We also don't have any guarentees defined for the VFIO protocol about
> > > > what state the vIOMMU will be in prior to reaching RUNNING.
> > > This is a good point. Actually it's not just a gap on vIOMMU. it's kind
> > > of a dependency on IOMMUFD no matter the IOAS which the migrated
> > > device is currently attached to is GPA or GIOVA. The device state can
> > > be restored only after IOMMUFD is fully recovered and the device is
> > > re-attached to the IOAS.
> > >
> > > Need a way for migration driver to advocate such dependency to the user.
> >
> > Since this part is new to me, may need further guidance on how to resolve
> > the dependency from you and other community experts.
> 
> Personally I'm quite uncomfortable with a driver that tries to work
> this way, I'm not sure we should encourage this. Can Intel really be
> convincing that this is safe and correct?
> 

I dislike it too. Will discuss with Yahui on the correctness of this approach
and any cleaner alternative.

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2023-07-04  7:36 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20230621091112.44945-1-lingyu.liu@intel.com>
     [not found] ` <20230621091112.44945-11-lingyu.liu@intel.com>
     [not found]   ` <ZJMLHSq9rjGIVS4V@nvidia.com>
2023-06-27  6:55     ` [Intel-wired-lan] [PATCH iwl-next V2 10/15] ice: save and restore TX queue head Tian, Kevin
2023-07-03  5:27       ` Cao, Yahui
2023-07-03 21:03         ` Jason Gunthorpe
2023-07-04  7:35           ` Tian, Kevin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox