All of lore.kernel.org
 help / color / mirror / Atom feed
From: Yishai Hadas <yishaih@nvidia.com>
To: "Cédric Le Goater" <clg@kaod.org>,
	alex.williamson@redhat.com, jgg@nvidia.com
Cc: <saeedm@nvidia.com>, <kvm@vger.kernel.org>,
	<netdev@vger.kernel.org>, <kuba@kernel.org>,
	<kevin.tian@intel.com>, <joao.m.martins@oracle.com>,
	<leonro@nvidia.com>, <maorg@nvidia.com>, <cohuck@redhat.com>,
	'Avihai Horon' <avihaih@nvidia.com>,
	Tarun Gupta <targupta@nvidia.com>
Subject: Re: [PATCH V7 vfio 07/10] vfio/mlx5: Create and destroy page tracker object
Date: Wed, 6 Sep 2023 12:48:05 +0300	[thread overview]
Message-ID: <1b60d2d3-e8b3-b47e-ad4b-e157bcd4bf18@nvidia.com> (raw)
In-Reply-To: <9a4ddb8c-a48a-67b0-b8ad-428ee936454e@kaod.org>

On 06/09/2023 11:55, Cédric Le Goater wrote:
> Hello,
>
> On 9/8/22 20:34, Yishai Hadas wrote:
>> Add support for creating and destroying page tracker object.
>>
>> This object is used to control/report the device dirty pages.
>>
>> As part of creating the tracker need to consider the device capabilities
>> for max ranges and adapt/combine ranges accordingly.
>>
>> Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
>> ---
>>   drivers/vfio/pci/mlx5/cmd.c | 147 ++++++++++++++++++++++++++++++++++++
>>   drivers/vfio/pci/mlx5/cmd.h |   1 +
>>   2 files changed, 148 insertions(+)
>>
>> diff --git a/drivers/vfio/pci/mlx5/cmd.c b/drivers/vfio/pci/mlx5/cmd.c
>> index 0a362796d567..f1cad96af6ab 100644
>> --- a/drivers/vfio/pci/mlx5/cmd.c
>> +++ b/drivers/vfio/pci/mlx5/cmd.c
>> @@ -410,6 +410,148 @@ int mlx5vf_cmd_load_vhca_state(struct 
>> mlx5vf_pci_core_device *mvdev,
>>       return err;
>>   }
>>   +static void combine_ranges(struct rb_root_cached *root, u32 
>> cur_nodes,
>> +               u32 req_nodes)
>> +{
>> +    struct interval_tree_node *prev, *curr, *comb_start, *comb_end;
>> +    unsigned long min_gap;
>> +    unsigned long curr_gap;
>> +
>> +    /* Special shortcut when a single range is required */
>> +    if (req_nodes == 1) {
>> +        unsigned long last;
>> +
>> +        curr = comb_start = interval_tree_iter_first(root, 0, 
>> ULONG_MAX);
>> +        while (curr) {
>> +            last = curr->last;
>> +            prev = curr;
>> +            curr = interval_tree_iter_next(curr, 0, ULONG_MAX);
>> +            if (prev != comb_start)
>> +                interval_tree_remove(prev, root);
>> +        }
>> +        comb_start->last = last;
>> +        return;
>> +    }
>> +
>> +    /* Combine ranges which have the smallest gap */
>> +    while (cur_nodes > req_nodes) {
>> +        prev = NULL;
>> +        min_gap = ULONG_MAX;
>> +        curr = interval_tree_iter_first(root, 0, ULONG_MAX);
>> +        while (curr) {
>> +            if (prev) {
>> +                curr_gap = curr->start - prev->last;
>> +                if (curr_gap < min_gap) {
>> +                    min_gap = curr_gap;
>> +                    comb_start = prev;
>> +                    comb_end = curr;
>> +                }
>> +            }
>> +            prev = curr;
>> +            curr = interval_tree_iter_next(curr, 0, ULONG_MAX);
>> +        }
>> +        comb_start->last = comb_end->last;
>> +        interval_tree_remove(comb_end, root);
>> +        cur_nodes--;
>> +    }
>> +}
>> +
>> +static int mlx5vf_create_tracker(struct mlx5_core_dev *mdev,
>> +                 struct mlx5vf_pci_core_device *mvdev,
>> +                 struct rb_root_cached *ranges, u32 nnodes)
>> +{
>> +    int max_num_range =
>> +        MLX5_CAP_ADV_VIRTUALIZATION(mdev, pg_track_max_num_range);
>> +    struct mlx5_vhca_page_tracker *tracker = &mvdev->tracker;
>> +    int record_size = MLX5_ST_SZ_BYTES(page_track_range);
>> +    u32 out[MLX5_ST_SZ_DW(general_obj_out_cmd_hdr)] = {};
>> +    struct interval_tree_node *node = NULL;
>> +    u64 total_ranges_len = 0;
>> +    u32 num_ranges = nnodes;
>> +    u8 log_addr_space_size;
>> +    void *range_list_ptr;
>> +    void *obj_context;
>> +    void *cmd_hdr;
>> +    int inlen;
>> +    void *in;
>> +    int err;
>> +    int i;
>> +
>> +    if (num_ranges > max_num_range) {
>> +        combine_ranges(ranges, nnodes, max_num_range);
>> +        num_ranges = max_num_range;
>> +    }
>> +
>> +    inlen = MLX5_ST_SZ_BYTES(create_page_track_obj_in) +
>> +                 record_size * num_ranges;
>> +    in = kzalloc(inlen, GFP_KERNEL);
>> +    if (!in)
>> +        return -ENOMEM;
>> +
>> +    cmd_hdr = MLX5_ADDR_OF(create_page_track_obj_in, in,
>> +                   general_obj_in_cmd_hdr);
>> +    MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, opcode,
>> +         MLX5_CMD_OP_CREATE_GENERAL_OBJECT);
>> +    MLX5_SET(general_obj_in_cmd_hdr, cmd_hdr, obj_type,
>> +         MLX5_OBJ_TYPE_PAGE_TRACK);
>> +    obj_context = MLX5_ADDR_OF(create_page_track_obj_in, in, 
>> obj_context);
>> +    MLX5_SET(page_track, obj_context, vhca_id, mvdev->vhca_id);
>> +    MLX5_SET(page_track, obj_context, track_type, 1);
>> +    MLX5_SET(page_track, obj_context, log_page_size,
>> +         ilog2(tracker->host_qp->tracked_page_size));
>> +    MLX5_SET(page_track, obj_context, log_msg_size,
>> +         ilog2(tracker->host_qp->max_msg_size));
>> +    MLX5_SET(page_track, obj_context, reporting_qpn, 
>> tracker->fw_qp->qpn);
>> +    MLX5_SET(page_track, obj_context, num_ranges, num_ranges);
>> +
>> +    range_list_ptr = MLX5_ADDR_OF(page_track, obj_context, 
>> track_range);
>> +    node = interval_tree_iter_first(ranges, 0, ULONG_MAX);
>> +    for (i = 0; i < num_ranges; i++) {
>> +        void *addr_range_i_base = range_list_ptr + record_size * i;
>> +        unsigned long length = node->last - node->start;
>> +
>> +        MLX5_SET64(page_track_range, addr_range_i_base, start_address,
>> +               node->start);
>> +        MLX5_SET64(page_track_range, addr_range_i_base, length, 
>> length);
>> +        total_ranges_len += length;
>> +        node = interval_tree_iter_next(node, 0, ULONG_MAX);
>> +    }
>> +
>> +    WARN_ON(node);
>> +    log_addr_space_size = ilog2(total_ranges_len);
>> +    if (log_addr_space_size <
>> +        (MLX5_CAP_ADV_VIRTUALIZATION(mdev, 
>> pg_track_log_min_addr_space)) ||
>> +        log_addr_space_size >
>> +        (MLX5_CAP_ADV_VIRTUALIZATION(mdev, 
>> pg_track_log_max_addr_space))) {
>> +        err = -EOPNOTSUPP;
>> +        goto out;
>> +    }
>
>
> We are seeing an issue with dirty page tracking when doing migration
> of an OVMF VM guest. The vfio-pci variant driver for the MLX5 VF
> device complains when dirty page tracking is initialized from QEMU :
>
>   qemu-kvm: 0000:b1:00.2: Failed to start DMA logging, err -95 
> (Operation not supported)
>
> The 64-bit computed range is  :
>
>   vfio_device_dirty_tracking_start nr_ranges 2 32:[0x0 - 0x807fffff], 
> 64:[0x100000000 - 0x3838000fffff]
>
> which seems to be too large for the HW. AFAICT, the MLX5 HW has a 42
> bits address space limitation for dirty tracking (min is 12). Is it a
> FW tunable or a strict limitation ?

It's mainly a FW limitation.

Tracking larger address space than 2^42 might take a lot of time in FW 
to allocate the required resources which might end-up in command 
timeout, etc.

>
> We should probably introduce more ranges to overcome the issue.

More ranges can help only if the total address space of the given ranges 
is < 2^42.

So, if there are some areas that don't require tracking (why?), breaking 
into more ranges with smaller total size can help.

Yishai


  reply	other threads:[~2023-09-06  9:48 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-09-08 18:34 [PATCH V7 vfio 00/10] Add device DMA logging support for mlx5 driver Yishai Hadas
2022-09-08 18:34 ` [PATCH V7 vfio 01/10] net/mlx5: Introduce ifc bits for page tracker Yishai Hadas
2022-09-08 18:34 ` [PATCH V7 vfio 02/10] net/mlx5: Query ADV_VIRTUALIZATION capabilities Yishai Hadas
2022-09-08 18:34 ` [PATCH V7 vfio 03/10] vfio: Introduce DMA logging uAPIs Yishai Hadas
2022-09-08 18:34 ` [PATCH V7 vfio 04/10] vfio: Add an IOVA bitmap support Yishai Hadas
2022-09-08 18:34 ` [PATCH V7 vfio 05/10] vfio: Introduce the DMA logging feature support Yishai Hadas
2022-09-08 18:34 ` [PATCH V7 vfio 06/10] vfio/mlx5: Init QP based resources for dirty tracking Yishai Hadas
2022-09-08 18:34 ` [PATCH V7 vfio 07/10] vfio/mlx5: Create and destroy page tracker object Yishai Hadas
2023-09-06  8:55   ` Cédric Le Goater
2023-09-06  9:48     ` Yishai Hadas [this message]
2023-09-06 11:51     ` Jason Gunthorpe
2023-09-06 12:08       ` Joao Martins
2023-09-07  9:56       ` Cédric Le Goater
2023-09-07 10:51         ` Joao Martins
2023-09-07 12:16           ` Cédric Le Goater
2023-09-07 16:33             ` Joao Martins
2023-09-07 17:34               ` Cédric Le Goater
2022-09-08 18:34 ` [PATCH V7 vfio 08/10] vfio/mlx5: Report dirty pages from tracker Yishai Hadas
2022-09-08 18:34 ` [PATCH V7 vfio 09/10] vfio/mlx5: Manage error scenarios on tracker Yishai Hadas
2022-09-08 18:34 ` [PATCH V7 vfio 10/10] vfio/mlx5: Set the driver DMA logging callbacks Yishai Hadas
2022-09-08 20:17 ` [PATCH V7 vfio 00/10] Add device DMA logging support for mlx5 driver Alex Williamson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1b60d2d3-e8b3-b47e-ad4b-e157bcd4bf18@nvidia.com \
    --to=yishaih@nvidia.com \
    --cc=alex.williamson@redhat.com \
    --cc=avihaih@nvidia.com \
    --cc=clg@kaod.org \
    --cc=cohuck@redhat.com \
    --cc=jgg@nvidia.com \
    --cc=joao.m.martins@oracle.com \
    --cc=kevin.tian@intel.com \
    --cc=kuba@kernel.org \
    --cc=kvm@vger.kernel.org \
    --cc=leonro@nvidia.com \
    --cc=maorg@nvidia.com \
    --cc=netdev@vger.kernel.org \
    --cc=saeedm@nvidia.com \
    --cc=targupta@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.