From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 587EEC001DF for ; Tue, 1 Aug 2023 12:04:12 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1qQo5n-0007Nv-Ii; Tue, 01 Aug 2023 08:03:11 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qQo5h-0007Nc-Rc for qemu-devel@nongnu.org; Tue, 01 Aug 2023 08:03:05 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1qQo5f-00076m-75 for qemu-devel@nongnu.org; Tue, 01 Aug 2023 08:03:05 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1690891381; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=1/QchisxrKi3bQEZQgacy8ScnR3A9D+zsdafodyi1gU=; b=F1itARALQDDUt5RFgiTDOYW19bta8TT7OOKkehfKug16qhsJMKQGWVOBJ98GClETYJ0hJZ VpOsLJGaWyf7xh1oafp6JvmBUABl680x9XMxITPzlRfKMMW8yl4bE5u3Gb4PdITNbVWDX+ EHfKBKtH0tcQFMMDnqaGgDIi6IGyFKk= Received: from mail-qv1-f69.google.com (mail-qv1-f69.google.com [209.85.219.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-148-D17KcuW_Mqu4ZI6fpHw_cw-1; Tue, 01 Aug 2023 08:02:59 -0400 X-MC-Unique: D17KcuW_Mqu4ZI6fpHw_cw-1 Received: by mail-qv1-f69.google.com with SMTP id 6a1803df08f44-63cfe46bbb6so62885426d6.2 for ; Tue, 01 Aug 2023 05:02:59 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1690891379; x=1691496179; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=1/QchisxrKi3bQEZQgacy8ScnR3A9D+zsdafodyi1gU=; b=eY4TSU5klM8a0caKuiSt6m7HRaJci+lCVZVmGDsbU/a9eLHIytGGjvGQ2T+iEfBEgR KGJwtAvsG8D2yHA5uSeSRhddJ6vLS7qBAhag6rsp1kl1324yXR+ox5HGZj96LEIZihP4 V+vx+yGnWK0/4s78dgDcbx7NG45I7PApn7aBKsRNbNGF6wfxRy7WemNA0q2T2878GkIG gp+B5y3pieRZQWpwwlxivIsdexLqEKL4T8Ttkt5D1pMATe9yskDicxh4JI016wdUkKuS vWdElP7KAmL/sLhgIVZ2CldL6Po38pzgTwEnqKUEHZz3ZtDFTTSKgC8nKY76SL14YNOW ovEg== X-Gm-Message-State: ABy/qLY+/kXPlkyVGMFgjdaTar3BbKjEppE+6LNDUW+S0qMz+Gbc3hTN K3nwF0AEuwoO3VseuAkGxrtMc4+3zOmN6VJ39vQGCdiTsCAkMjotRknz4/H+kL9mFbQq5+5LQ+/ IhkSJYElyj0GQsGc= X-Received: by 2002:a0c:b389:0:b0:636:d181:46ec with SMTP id t9-20020a0cb389000000b00636d18146ecmr11276565qve.32.1690891378750; Tue, 01 Aug 2023 05:02:58 -0700 (PDT) X-Google-Smtp-Source: APBJJlFqBke1INEaR9HeeezNmaN8lZ+gmeEJsHBrXaWxsnyrBCzbRxGhVKy8cEDmC4RoCuyfUFsYcA== X-Received: by 2002:a0c:b389:0:b0:636:d181:46ec with SMTP id t9-20020a0cb389000000b00636d18146ecmr11276524qve.32.1690891378131; Tue, 01 Aug 2023 05:02:58 -0700 (PDT) Received: from ?IPV6:2a01:e0a:280:24f0:9db0:474c:ff43:9f5c? ([2a01:e0a:280:24f0:9db0:474c:ff43:9f5c]) by smtp.gmail.com with ESMTPSA id l6-20020a0cac06000000b0063d33bea5cdsm4575996qvb.92.2023.08.01.05.02.56 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 01 Aug 2023 05:02:57 -0700 (PDT) Message-ID: <977f0fae-5dca-c56c-e740-739e2a984a7e@redhat.com> Date: Tue, 1 Aug 2023 14:02:54 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.13.0 Subject: Re: [PATCH for-8.2 v2 5/6] vfio/migration: Add P2P support for VFIO migration Content-Language: en-US To: Avihai Horon , qemu-devel@nongnu.org Cc: Alex Williamson , Paolo Bonzini , Joao Martins , Yishai Hadas , Jason Gunthorpe , Maor Gottlieb , Kirti Wankhede , Tarun Gupta References: <20230731102521.15335-1-avihaih@nvidia.com> <20230731102521.15335-6-avihaih@nvidia.com> From: =?UTF-8?Q?C=c3=a9dric_Le_Goater?= In-Reply-To: <20230731102521.15335-6-avihaih@nvidia.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Received-SPF: pass client-ip=170.10.133.124; envelope-from=clg@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -21 X-Spam_score: -2.2 X-Spam_bar: -- X-Spam_report: (-2.2 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, NICE_REPLY_A=-0.092, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, T_SCC_BODY_TEXT_LINE=-0.01 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org On 7/31/23 12:25, Avihai Horon wrote: > VFIO migration uAPI defines an optional intermediate P2P quiescent > state. While in the P2P quiescent state, P2P DMA transactions cannot be > initiated by the device, but the device can respond to incoming ones. > Additionally, all outstanding P2P transactions are guaranteed to have > been completed by the time the device enters this state. > > The purpose of this state is to support migration of multiple devices > that might do P2P transactions between themselves. > > Add support for P2P migration by transitioning all the devices to the > P2P quiescent state before stopping or starting the devices. Use the new > VMChangeStateHandler prepare_cb to achieve that behavior. > > This will allow migration of multiple VFIO devices if all of them > support P2P migration. > > Signed-off-by: Avihai Horon > --- > docs/devel/vfio-migration.rst | 93 +++++++++++++++++++++-------------- > hw/vfio/common.c | 6 ++- > hw/vfio/migration.c | 47 ++++++++++++++++-- > hw/vfio/trace-events | 1 + > 4 files changed, 106 insertions(+), 41 deletions(-) > > diff --git a/docs/devel/vfio-migration.rst b/docs/devel/vfio-migration.rst > index b433cb5bb2..605fe60e96 100644 > --- a/docs/devel/vfio-migration.rst > +++ b/docs/devel/vfio-migration.rst > @@ -23,9 +23,21 @@ and recommends that the initial bytes are sent and loaded in the destination > before stopping the source VM. Enabling this migration capability will > guarantee that and thus, can potentially reduce downtime even further. > > -Note that currently VFIO migration is supported only for a single device. This > -is due to VFIO migration's lack of P2P support. However, P2P support is planned > -to be added later on. > +To support migration of multiple devices that might do P2P transactions between > +themselves, VFIO migration uAPI defines an intermediate P2P quiescent state. > +While in the P2P quiescent state, P2P DMA transactions cannot be initiated by > +the device, but the device can respond to incoming ones. Additionally, all > +outstanding P2P transactions are guaranteed to have been completed by the time > +the device enters this state. > + > +All the devices that support P2P migration are first transitioned to the P2P > +quiescent state and only then are they stopped or started. This makes migration > +safe P2P-wise, since starting and stopping the devices is not done atomically > +for all the devices together. > + > +Thus, multiple VFIO devices migration is allowed only if all the devices > +support P2P migration. Single VFIO device migration is allowed regardless of > +P2P migration support. > > A detailed description of the UAPI for VFIO device migration can be found in > the comment for the ``vfio_device_mig_state`` structure in the header file > @@ -132,54 +144,63 @@ will be blocked. > Flow of state changes during Live migration > =========================================== > > -Below is the flow of state change during live migration. > +Below is the state change flow during live migration for a VFIO device that > +supports both precopy and P2P migration. The flow for devices that don't > +support it is similar, except that the relevant states for precopy and P2P are > +skipped. > The values in the parentheses represent the VM state, the migration state, and > the VFIO device state, respectively. > -The text in the square brackets represents the flow if the VFIO device supports > -pre-copy. > > Live migration save path > ------------------------ > > :: > > - QEMU normal running state > - (RUNNING, _NONE, _RUNNING) > - | > + QEMU normal running state > + (RUNNING, _NONE, _RUNNING) > + | > migrate_init spawns migration_thread > - Migration thread then calls each device's .save_setup() > - (RUNNING, _SETUP, _RUNNING [_PRE_COPY]) > - | > - (RUNNING, _ACTIVE, _RUNNING [_PRE_COPY]) > - If device is active, get pending_bytes by .state_pending_{estimate,exact}() > - If total pending_bytes >= threshold_size, call .save_live_iterate() > - [Data of VFIO device for pre-copy phase is copied] > - Iterate till total pending bytes converge and are less than threshold > - | > - On migration completion, vCPU stops and calls .save_live_complete_precopy for > - each active device. The VFIO device is then transitioned into _STOP_COPY state > - (FINISH_MIGRATE, _DEVICE, _STOP_COPY) > - | > - For the VFIO device, iterate in .save_live_complete_precopy until > - pending data is 0 > - (FINISH_MIGRATE, _DEVICE, _STOP) > - | > - (FINISH_MIGRATE, _COMPLETED, _STOP) > - Migraton thread schedules cleanup bottom half and exits > + Migration thread then calls each device's .save_setup() > + (RUNNING, _SETUP, _PRE_COPY) > + | > + (RUNNING, _ACTIVE, _PRE_COPY) > + If device is active, get pending_bytes by .state_pending_{estimate,exact}() > + If total pending_bytes >= threshold_size, call .save_live_iterate() > + Data of VFIO device for pre-copy phase is copied > + Iterate till total pending bytes converge and are less than threshold > + | > + On migration completion, the vCPUs and the VFIO device are stopped > + The VFIO device is first put in P2P quiescent state > + (FINISH_MIGRATE, _ACTIVE, _PRE_COPY_P2P) > + | > + Then the VFIO device is put in _STOP_COPY state > + (FINISH_MIGRATE, _ACTIVE, _STOP_COPY) > + .save_live_complete_precopy() is called for each active device > + For the VFIO device, iterate in .save_live_complete_precopy() until > + pending data is 0 > + | > + (POSTMIGRATE, _COMPLETED, _STOP_COPY) > + Migraton thread schedules cleanup bottom half and exits > + | > + .save_cleanup() is called > + (POSTMIGRATE, _COMPLETED, _STOP) > > Live migration resume path > -------------------------- > > :: > > - Incoming migration calls .load_setup for each device > - (RESTORE_VM, _ACTIVE, _STOP) > - | > - For each device, .load_state is called for that device section data > - (RESTORE_VM, _ACTIVE, _RESUMING) > - | > - At the end, .load_cleanup is called for each device and vCPUs are started > - (RUNNING, _NONE, _RUNNING) > + Incoming migration calls .load_setup() for each device > + (RESTORE_VM, _ACTIVE, _STOP) > + | > + For each device, .load_state() is called for that device section data > + (RESTORE_VM, _ACTIVE, _RESUMING) > + | > + At the end, .load_cleanup() is called for each device and vCPUs are started > + The VFIO device is first put in P2P quiescent state > + (RUNNING, _ACTIVE, _RUNNING_P2P) > + | > + (RUNNING, _NONE, _RUNNING) > > Postcopy > ======== > diff --git a/hw/vfio/common.c b/hw/vfio/common.c > index 16cf79a76c..7c3d636025 100644 > --- a/hw/vfio/common.c > +++ b/hw/vfio/common.c > @@ -441,14 +441,16 @@ bool vfio_device_state_is_running(VFIODevice *vbasedev) > { > VFIOMigration *migration = vbasedev->migration; > > - return migration->device_state == VFIO_DEVICE_STATE_RUNNING; > + return migration->device_state == VFIO_DEVICE_STATE_RUNNING || > + migration->device_state == VFIO_DEVICE_STATE_RUNNING_P2P; > } > > bool vfio_device_state_is_precopy(VFIODevice *vbasedev) > { > VFIOMigration *migration = vbasedev->migration; > > - return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY; > + return migration->device_state == VFIO_DEVICE_STATE_PRE_COPY || > + migration->device_state == VFIO_DEVICE_STATE_PRE_COPY_P2P; > } > > static bool vfio_devices_all_dirty_tracking(VFIOContainer *container) > diff --git a/hw/vfio/migration.c b/hw/vfio/migration.c > index 48f9c23cbe..05b1dac2ed 100644 > --- a/hw/vfio/migration.c > +++ b/hw/vfio/migration.c > @@ -71,8 +71,12 @@ static const char *mig_state_to_str(enum vfio_device_mig_state state) > return "STOP_COPY"; > case VFIO_DEVICE_STATE_RESUMING: > return "RESUMING"; > + case VFIO_DEVICE_STATE_RUNNING_P2P: > + return "RUNNING_P2P"; > case VFIO_DEVICE_STATE_PRE_COPY: > return "PRE_COPY"; > + case VFIO_DEVICE_STATE_PRE_COPY_P2P: > + return "PRE_COPY_P2P"; > default: > return "UNKNOWN STATE"; > } > @@ -652,6 +656,43 @@ static const SaveVMHandlers savevm_vfio_handlers = { > > /* ---------------------------------------------------------------------- */ > > +static void vfio_vmstate_change_prepare(void *opaque, bool running, > + RunState state) > +{ > + VFIODevice *vbasedev = opaque; > + VFIOMigration *migration = vbasedev->migration; > + enum vfio_device_mig_state new_state; > + int ret; > + > + if (!(migration->mig_flags & VFIO_MIGRATION_P2P)) { > + return; > + } > + > + new_state = migration->device_state == VFIO_DEVICE_STATE_PRE_COPY ? > + VFIO_DEVICE_STATE_PRE_COPY_P2P : > + VFIO_DEVICE_STATE_RUNNING_P2P; > + > + /* > + * If setting the device in new_state fails, the device should be reset. > + * To do so, use ERROR state as a recover state. > + */ > + ret = vfio_migration_set_state(vbasedev, new_state, > + VFIO_DEVICE_STATE_ERROR); > + if (ret) { > + /* > + * Migration should be aborted in this case, but vm_state_notify() > + * currently does not support reporting failures. > + */ > + if (migrate_get_current()->to_dst_file) { > + qemu_file_set_error(migrate_get_current()->to_dst_file, ret); > + } > + } > + > + trace_vfio_vmstate_change_prepare(vbasedev->name, running, > + RunState_str(state), > + mig_state_to_str(new_state)); > +} > + > static void vfio_vmstate_change(void *opaque, bool running, RunState state) > { > VFIODevice *vbasedev = opaque; > @@ -798,9 +839,9 @@ static int vfio_migration_init(VFIODevice *vbasedev) > register_savevm_live(id, VMSTATE_INSTANCE_ID_ANY, 1, &savevm_vfio_handlers, > vbasedev); > > - migration->vm_state = qdev_add_vm_change_state_handler(vbasedev->dev, > - vfio_vmstate_change, > - vbasedev); > + migration->vm_state = qdev_add_vm_change_state_handler_full( > + vbasedev->dev, vfio_vmstate_change, vfio_vmstate_change_prepare, > + vbasedev); We could also avoid registering the prepare handler if !VFIO_MIGRATION_P2P. Thanks, C. > migration->migration_state.notify = vfio_migration_state_notifier; > add_migration_state_change_notifier(&migration->migration_state); > > diff --git a/hw/vfio/trace-events b/hw/vfio/trace-events > index ee7509e68e..329736a738 100644 > --- a/hw/vfio/trace-events > +++ b/hw/vfio/trace-events > @@ -167,3 +167,4 @@ vfio_save_setup(const char *name, uint64_t data_buffer_size) " (%s) data buffer > vfio_state_pending_estimate(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 > vfio_state_pending_exact(const char *name, uint64_t precopy, uint64_t postcopy, uint64_t stopcopy_size, uint64_t precopy_init_size, uint64_t precopy_dirty_size) " (%s) precopy 0x%"PRIx64" postcopy 0x%"PRIx64" stopcopy size 0x%"PRIx64" precopy initial size 0x%"PRIx64" precopy dirty size 0x%"PRIx64 > vfio_vmstate_change(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s" > +vfio_vmstate_change_prepare(const char *name, int running, const char *reason, const char *dev_state) " (%s) running %d reason %s device state %s"