All of lore.kernel.org
 help / color / mirror / Atom feed
From: Fabiano Rosas <farosas@suse.de>
To: Jack Wang <jinpu.wang@ionos.com>,
	qemu-devel@nongnu.org, peterx@redhat.com
Cc: Li Zhijian <lizhijian@fujitsu.com>, Yu Zhang <yu.zhang@ionos.com>,
	linux-rdma@vger.kernel.org, michael@flatgalaxy.com,
	Michael Galaxy <mrgalaxy@nvidia.com>
Subject: Re: [PATCH] migration/rdma: Remove qemu_rdma_broken_ipv6_kernel
Date: Tue, 01 Apr 2025 16:32:55 -0300	[thread overview]
Message-ID: <87cydvllso.fsf@suse.de> (raw)
In-Reply-To: <20250327061123.14453-1-jinpu.wang@ionos.com>

Jack Wang <jinpu.wang@ionos.com> writes:

> I hit following error which testing migration in pure RoCE env:
> "-incoming rdma:[::]:8089: RDMA ERROR: You only have RoCE / iWARP devices in your
> systems and your management software has specified '[::]', but IPv6 over RoCE /
> iWARP is not supported in Linux.#012'."
>
> In our setup, we use rdma bind on ipv6 on target host, while connect from source
> with ipv4, remove the qemu_rdma_broken_ipv6_kernel, migration just work
> fine.
>
> Checking the git history, the function was added since introducing of
> rdma migration, which is more than 10 years ago. linux-rdma has
> improved support on RoCE/iWARP for ipv6 over past years. There are a few fixes
> back in 2016 seems related to the issue, eg:
> aeb76df46d11 ("IB/core: Set routable RoCE gid type for ipv4/ipv6 networks")
>
> other fixes back in 2018, eg:
> 052eac6eeb56 RDMA/cma: Update RoCE multicast routines to use net namespace
> 8d20a1f0ecd5 RDMA/cma: Fix rdma_cm raw IB path setting for RoCE
> 9327c7afdce3 RDMA/cma: Provide a function to set RoCE path record L2 parameters
> 5c181bda77f4 RDMA/cma: Set default GID type as RoCE when resolving RoCE route
> 3c7f67d1880d IB/cma: Fix default RoCE type setting
> be1d325a3358 IB/core: Set RoCEv2 MGID according to spec
> 63a5f483af0e IB/cma: Set default gid type to RoCEv2
>
> So remove the outdated function and it's usage.
>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: Li Zhijian <lizhijian@fujitsu.com>
> Cc: Yu Zhang <yu.zhang@ionos.com>
> Cc: qemu-devel@nongnu.org
> Cc: linux-rdma@vger.kernel.org
> Cc: michael@flatgalaxy.com
> Signed-off-by: Jack Wang <jinpu.wang@ionos.com>
> Tested-by: Li zhijian <lizhijian@fujitsu.com>
> Reviewed-by: Michael Galaxy <mrgalaxy@nvidia.com>
> ---
> v1: drop RFC, fix build error (zhijian), collect Reviewed-by and Tested-by
>
>  migration/rdma.c | 159 -----------------------------------------------
>  1 file changed, 159 deletions(-)
>
> diff --git a/migration/rdma.c b/migration/rdma.c
> index 76fb0349238a..e228520b8e01 100644
> --- a/migration/rdma.c
> +++ b/migration/rdma.c
> @@ -767,149 +767,6 @@ static void qemu_rdma_dump_gid(const char *who, struct rdma_cm_id *id)
>      trace_qemu_rdma_dump_gid(who, sgid, dgid);
>  }
>  
> -/*
> - * As of now, IPv6 over RoCE / iWARP is not supported by linux.
> - * We will try the next addrinfo struct, and fail if there are
> - * no other valid addresses to bind against.
> - *
> - * If user is listening on '[::]', then we will not have a opened a device
> - * yet and have no way of verifying if the device is RoCE or not.
> - *
> - * In this case, the source VM will throw an error for ALL types of
> - * connections (both IPv4 and IPv6) if the destination machine does not have
> - * a regular infiniband network available for use.
> - *
> - * The only way to guarantee that an error is thrown for broken kernels is
> - * for the management software to choose a *specific* interface at bind time
> - * and validate what time of hardware it is.
> - *
> - * Unfortunately, this puts the user in a fix:
> - *
> - *  If the source VM connects with an IPv4 address without knowing that the
> - *  destination has bound to '[::]' the migration will unconditionally fail
> - *  unless the management software is explicitly listening on the IPv4
> - *  address while using a RoCE-based device.
> - *
> - *  If the source VM connects with an IPv6 address, then we're OK because we can
> - *  throw an error on the source (and similarly on the destination).
> - *
> - *  But in mixed environments, this will be broken for a while until it is fixed
> - *  inside linux.
> - *
> - * We do provide a *tiny* bit of help in this function: We can list all of the
> - * devices in the system and check to see if all the devices are RoCE or
> - * Infiniband.
> - *
> - * If we detect that we have a *pure* RoCE environment, then we can safely
> - * thrown an error even if the management software has specified '[::]' as the
> - * bind address.
> - *
> - * However, if there is are multiple hetergeneous devices, then we cannot make
> - * this assumption and the user just has to be sure they know what they are
> - * doing.
> - *
> - * Patches are being reviewed on linux-rdma.
> - */
> -static int qemu_rdma_broken_ipv6_kernel(struct ibv_context *verbs, Error **errp)
> -{
> -    /* This bug only exists in linux, to our knowledge. */
> -#ifdef CONFIG_LINUX
> -    struct ibv_port_attr port_attr;
> -
> -    /*
> -     * Verbs are only NULL if management has bound to '[::]'.
> -     *
> -     * Let's iterate through all the devices and see if there any pure IB
> -     * devices (non-ethernet).
> -     *
> -     * If not, then we can safely proceed with the migration.
> -     * Otherwise, there are no guarantees until the bug is fixed in linux.
> -     */
> -    if (!verbs) {
> -        int num_devices;
> -        struct ibv_device **dev_list = ibv_get_device_list(&num_devices);
> -        bool roce_found = false;
> -        bool ib_found = false;
> -
> -        for (int x = 0; x < num_devices; x++) {
> -            verbs = ibv_open_device(dev_list[x]);
> -            /*
> -             * ibv_open_device() is not documented to set errno.  If
> -             * it does, it's somebody else's doc bug.  If it doesn't,
> -             * the use of errno below is wrong.
> -             * TODO Find out whether ibv_open_device() sets errno.
> -             */
> -            if (!verbs) {
> -                if (errno == EPERM) {
> -                    continue;
> -                } else {
> -                    error_setg_errno(errp, errno,
> -                                     "could not open RDMA device context");
> -                    return -1;
> -                }
> -            }
> -
> -            if (ibv_query_port(verbs, 1, &port_attr)) {
> -                ibv_close_device(verbs);
> -                error_setg(errp,
> -                           "RDMA ERROR: Could not query initial IB port");
> -                return -1;
> -            }
> -
> -            if (port_attr.link_layer == IBV_LINK_LAYER_INFINIBAND) {
> -                ib_found = true;
> -            } else if (port_attr.link_layer == IBV_LINK_LAYER_ETHERNET) {
> -                roce_found = true;
> -            }
> -
> -            ibv_close_device(verbs);
> -
> -        }
> -
> -        if (roce_found) {
> -            if (ib_found) {
> -                warn_report("migrations may fail:"
> -                            " IPv6 over RoCE / iWARP in linux"
> -                            " is broken. But since you appear to have a"
> -                            " mixed RoCE / IB environment, be sure to only"
> -                            " migrate over the IB fabric until the kernel "
> -                            " fixes the bug.");
> -            } else {
> -                error_setg(errp, "RDMA ERROR: "
> -                           "You only have RoCE / iWARP devices in your systems"
> -                           " and your management software has specified '[::]'"
> -                           ", but IPv6 over RoCE / iWARP is not supported in Linux.");
> -                return -1;
> -            }
> -        }
> -
> -        return 0;
> -    }
> -
> -    /*
> -     * If we have a verbs context, that means that some other than '[::]' was
> -     * used by the management software for binding. In which case we can
> -     * actually warn the user about a potentially broken kernel.
> -     */
> -
> -    /* IB ports start with 1, not 0 */
> -    if (ibv_query_port(verbs, 1, &port_attr)) {
> -        error_setg(errp, "RDMA ERROR: Could not query initial IB port");
> -        return -1;
> -    }
> -
> -    if (port_attr.link_layer == IBV_LINK_LAYER_ETHERNET) {
> -        error_setg(errp, "RDMA ERROR: "
> -                   "Linux kernel's RoCE / iWARP does not support IPv6 "
> -                   "(but patches on linux-rdma in progress)");
> -        return -1;
> -    }
> -
> -#endif
> -
> -    return 0;
> -}
> -
>  /*
>   * Figure out which RDMA device corresponds to the requested IP hostname
>   * Also create the initial connection manager identifiers for opening
> @@ -955,7 +812,6 @@ static int qemu_rdma_resolve_host(RDMAContext *rdma, Error **errp)
>  
>      /* Try all addresses, saving the first error in @err */
>      for (struct rdma_addrinfo *e = res; e != NULL; e = e->ai_next) {
> -        Error **local_errp = err ? NULL : &err;
>  
>          inet_ntop(e->ai_family,
>              &((struct sockaddr_in *) e->ai_dst_addr)->sin_addr, ip, sizeof ip);
> @@ -964,13 +820,6 @@ static int qemu_rdma_resolve_host(RDMAContext *rdma, Error **errp)
>          ret = rdma_resolve_addr(rdma->cm_id, NULL, e->ai_dst_addr,
>                  RDMA_RESOLVE_TIMEOUT_MS);
>          if (ret >= 0) {
> -            if (e->ai_family == AF_INET6) {
> -                ret = qemu_rdma_broken_ipv6_kernel(rdma->cm_id->verbs,
> -                                                   local_errp);
> -                if (ret < 0) {
> -                    continue;
> -                }
> -            }
>              error_free(err);

err is now unused and should be removed entirely. The comment before the
loop needs touching up as well.

>              goto route;
>          }
> @@ -2663,7 +2512,6 @@ static int qemu_rdma_dest_init(RDMAContext *rdma, Error **errp)
>  
>      /* Try all addresses, saving the first error in @err */
>      for (e = res; e != NULL; e = e->ai_next) {
> -        Error **local_errp = err ? NULL : &err;
>  
>          inet_ntop(e->ai_family,
>              &((struct sockaddr_in *) e->ai_dst_addr)->sin_addr, ip, sizeof ip);
> @@ -2672,13 +2520,6 @@ static int qemu_rdma_dest_init(RDMAContext *rdma, Error **errp)
>          if (ret < 0) {
>              continue;
>          }
> -        if (e->ai_family == AF_INET6) {
> -            ret = qemu_rdma_broken_ipv6_kernel(listen_id->verbs,
> -                                               local_errp);
> -            if (ret < 0) {
> -                continue;
> -            }
> -        }
>          error_free(err);

Same here.

>          break;
>      }

  reply	other threads:[~2025-04-01 19:33 UTC|newest]

Thread overview: 3+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-27  6:11 [PATCH] migration/rdma: Remove qemu_rdma_broken_ipv6_kernel Jack Wang
2025-04-01 19:32 ` Fabiano Rosas [this message]
2025-04-02  5:06   ` Jinpu Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87cydvllso.fsf@suse.de \
    --to=farosas@suse.de \
    --cc=jinpu.wang@ionos.com \
    --cc=linux-rdma@vger.kernel.org \
    --cc=lizhijian@fujitsu.com \
    --cc=michael@flatgalaxy.com \
    --cc=mrgalaxy@nvidia.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=yu.zhang@ionos.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.