From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id A3C92C27C4F for ; Sun, 23 Jun 2024 20:27:36 +0000 (UTC) Received: from localhost ([::1] helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1sLTo7-0001lH-5s; Sun, 23 Jun 2024 16:27:27 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sLTo6-0001l6-2z for qemu-devel@nongnu.org; Sun, 23 Jun 2024 16:27:26 -0400 Received: from us-smtp-delivery-124.mimecast.com ([170.10.133.124]) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1sLTo3-0007tq-Od for qemu-devel@nongnu.org; Sun, 23 Jun 2024 16:27:25 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1719174442; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=1cx+h8HAzd+0wacsOfEA8Wu4EUijKFkp74yhJ/czf/s=; b=TzHVvQ4elT3WQGIKL1pEM3tgbj1ehUp4v5VUjx1VwYS3bDr0yJ05NBHGOgmFeIt2nqW9uY PsmY2u6Hc5tE94aKGb2VVuS+hkud8q50X3ZuG+GGhXnxFm7XYfnur3Ktwj/OlsnQA0vT6D ZJ8ceEVZFQVOWqF6oKNjy5isdA7Va4A= Received: from mail-oa1-f72.google.com (mail-oa1-f72.google.com [209.85.160.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-466-4taIrSI1M_u2AxwYLXsKJg-1; Sun, 23 Jun 2024 16:27:16 -0400 X-MC-Unique: 4taIrSI1M_u2AxwYLXsKJg-1 Received: by mail-oa1-f72.google.com with SMTP id 586e51a60fabf-25b503272c2so865768fac.2 for ; Sun, 23 Jun 2024 13:27:16 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1719174436; x=1719779236; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=1cx+h8HAzd+0wacsOfEA8Wu4EUijKFkp74yhJ/czf/s=; b=Sg/fmy5juA4x04ecdNmAxMJ58Ahp2gGSRcyqRybhiK0L2QzqoOkcsDV7ktUei72uiu CSqZLqRPR4/jHaphuvyIHUo4vdCHrHp2/LbBV5GK4DM/EglykkanOOQtl8x7l8O0gzEU l6JxPG9uXwoOgVTd0jYn0palY3SZXdTEuW32rr+DOJlN7Ap8zMtacW07j24sIlo1J3lq Xq8+NWm1rH4fOye9pN/cEmWijl0U4oTDBXfoCGlyzLoz7qGWTtoMLZ8YHkszdtt8CfxJ 0DXzrfII/kiyPIKbHl2xqqEeGNNe9LHTPIJob6ztv0/91I38VWPaAWyIYAFqqH47FAbO giWw== X-Forwarded-Encrypted: i=1; AJvYcCUgDAUzWBY3hg/HK84x478xHKwT0cCnqc1ZoWLaOwAHlOjns3FHbEiKXBhcxS/CKbGH6tygLbG6C4ICCeqLmgwSYTFelvE= X-Gm-Message-State: AOJu0YwOCeYFp4n6ZRum1rUHjetcIgFLQG4/JLuhFfPvIbNWXzEnAkH4 CWK0A6J6mc3/1f9ZqhC7vleWrsqNHZC9B3D0cGpCYvKnAm7KUvlYvU5lp5I0eFCG8Gv+/I+XLii jZBpxRlfvWv4M7ueqBstcb81nEbnFeOFP1ZRNsZxxA8uCY4vZZHwG X-Received: by 2002:a05:6820:826:b0:5ba:ca86:a025 with SMTP id 006d021491bc7-5c1e5ca2a82mr5914150eaf.0.1719174435446; Sun, 23 Jun 2024 13:27:15 -0700 (PDT) X-Google-Smtp-Source: AGHT+IEUKU6+0oFw3Gt2e/SlEE7wnE6jTRy9G8PfbkNCQngnjIG89343JvrIe2tCcumG5YtpQXiQ/Q== X-Received: by 2002:a05:6820:826:b0:5ba:ca86:a025 with SMTP id 006d021491bc7-5c1e5ca2a82mr5914130eaf.0.1719174434856; Sun, 23 Jun 2024 13:27:14 -0700 (PDT) Received: from x1n (pool-99-254-121-117.cpe.net.cable.rogers.com. [99.254.121.117]) by smtp.gmail.com with ESMTPSA id 6a1803df08f44-6b51ed407dbsm28178926d6.50.2024.06.23.13.27.13 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 23 Jun 2024 13:27:14 -0700 (PDT) Date: Sun, 23 Jun 2024 16:27:11 -0400 From: Peter Xu To: "Maciej S. Szmigiero" Cc: Fabiano Rosas , Alex Williamson , =?utf-8?Q?C=C3=A9dric?= Le Goater , Eric Blake , Markus Armbruster , Daniel P =?utf-8?B?LiBCZXJyYW5nw6k=?= , Avihai Horon , Joao Martins , qemu-devel@nongnu.org Subject: Re: [PATCH v1 00/13] =?utf-8?Q?Multifd_?= =?utf-8?B?8J+UgA==?= device state transfer support with VFIO consumer Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline In-Reply-To: Received-SPF: pass client-ip=170.10.133.124; envelope-from=peterx@redhat.com; helo=us-smtp-delivery-124.mimecast.com X-Spam_score_int: -22 X-Spam_score: -2.3 X-Spam_bar: -- X-Spam_report: (-2.3 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_HIGH=-0.149, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H4=-0.01, RCVD_IN_MSPIKE_WL=-0.01, SPF_HELO_NONE=0.001, SPF_PASS=-0.001 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org On Tue, Jun 18, 2024 at 06:12:18PM +0200, Maciej S. Szmigiero wrote: > From: "Maciej S. Szmigiero" > > This is an updated v1 patch series of the RFC (v0) series located here: > https://lore.kernel.org/qemu-devel/cover.1713269378.git.maciej.szmigiero@oracle.com/ OK I took some hours thinking about this today, and here's some high level comments for this series. I'll start with which are more relevant to what Fabiano has already suggested in the other thread, then I'll add some more. https://lore.kernel.org/r/20240620212111.29319-1-farosas@suse.de 1. Multifd device state support =============================== As Fabiano suggested in his RFC post, we may need one more layer of abstraction to represent VFIO's demand on allowing multifd to send arbitrary buffer to the wire. This can be more than "how to pass the device state buffer to the sender threads". So far, MultiFDMethods is only about RAM. If you pull the latest master branch Fabiano just merged yet two more RAM compressors that are extended on top of MultiFDMethods model. However still they're all about RAM. I think it's better to keep it this way, so maybe MultiFDMethods should some day be called MultiFDRamMethods. multifd_send_fill_packet() may only be suitable for RAM buffers, not adhoc buffers like what VFIO is using. multifd_send_zero_page_detect() may not be needed either for arbitrary buffers. Most of those are still page-based. I think it also means we shouldn't call ->send_prepare() when multifd send thread notices that it's going to send a VFIO buffer. So it should look like this: int type = multifd_payload_type(p->data); if (type == MULTIFD_PAYLOAD_RAM) { multifd_send_state->ops->send_prepare(p, &local_err); } else { // VFIO buffers should belong here assert(type == MULTIFD_PAYLOAD_DEVICE_STATE); ... } It also means it shouldn't contain code like: nocomp_send_prepare(): if (p->is_device_state_job) { return nocomp_send_prepare_device_state(p, errp); } else { return nocomp_send_prepare_ram(p, errp); } nocomp should only exist in RAM world, not VFIO's. And it looks like you agree with Fabiano's RFC proposal, please work on top of that to provide that layer. Please make sure it outputs the minimum in "$ git grep device_state migration/multifd.c" when you work on the new version. Currently: $ git grep device_state migration/multifd.c | wc -l 59 The hope is zero, or at least a minimum with good reasons. 2. Frequent mallocs/frees ========================= Fabiano's series can also help to address some of these, but it looks like this series used malloc/free more than the opaque data buffer. This is not required to get things merged, but it'll be nice to avoid those if possible. 3. load_state_buffer() and VFIODeviceStatePacket protocol ========================================================= VFIODeviceStatePacket is the new protocol you introduced into multifd packets, along with the new load_state_buffer() hook for loading such buffers. My question is whether it's needed at all, or.. whether it can be more generic (and also easier) to just allow taking any device state in the multifd packets, then load it with vmstate load(). I mean, the vmstate_load() should really have worked on these buffers, if after all VFIO is looking for: (1) VFIO_MIG_FLAG_DEV_DATA_STATE as the first flag (uint64), size as the 2nd, then (2) load that rest buffer into VFIO kernel driver. That is the same to happen during the blackout window. It's not clear to me why load_state_buffer() is needed. I also see that you're also using exactly the same chunk size for such buffering (VFIOMigration.data_buffer_size). I think you have a "reason": VFIODeviceStatePacket and loading of the buffer data resolved one major issue that wasn't there before but start to have now: multifd allows concurrent arrivals of vfio buffers, even if the buffer *must* be sequentially loaded. That's a major pain for current VFIO kernel ioctl design, IMHO. I think I used to ask nVidia people on whether the VFIO get_state/set_state interface can allow indexing or tagging of buffers but I never get a real response. IMHO that'll be extremely helpful for migration purpose on concurrency if it can happen, rather than using a serialized buffer. It means concurrently save/load one VFIO device could be extremely hard, if not impossible. Now in your series IIUC you resolved that by using vfio_load_bufs_thread(), holding off the load process but only until sequential buffers are received. I think that causes one issue that I'll mention below as a separate topic. But besides that, my point is, this is not the reason that you need to introduce VFIODeviceStatePacket, load_state_buffer() and so on. My understanding is that we do need one way to re-serialize the buffers, but it doesn't need load_state_buffer(), instead it can call vmstate_load() in order, properly invoke vfio_load_state() with the right buffers. It'll just be nice if VFIO can keep its "load state" logic at one place. One benefit of that is with such a more generic framework, QEMU can easily extend this infra to other device states, so that logically we can consider sending non-VFIO device states also in the multifd buffers. However with your current solution, new structures are needed, new hooks, a lot of new codes around, however less problems it solved.. That's not optimal. 4. Risk of OOM on unlimited VFIO buffering ========================================== This follows with above bullet, but my pure question to ask here is how does VFIO guarantees no OOM condition by buffering VFIO state? I mean, currently your proposal used vfio_load_bufs_thread() as a separate thread to only load the vfio states until sequential data is received, however is there an upper limit of how much buffering it could do? IOW: vfio_load_state_buffer(): if (packet->idx >= migration->load_bufs->len) { g_array_set_size(migration->load_bufs, packet->idx + 1); } lb = &g_array_index(migration->load_bufs, typeof(*lb), packet->idx); ... lb->data = g_memdup2(&packet->data, data_size - sizeof(*packet)); lb->len = data_size - sizeof(*packet); lb->is_present = true; What if garray keeps growing with lb->data allocated, which triggers the memcg limit of the process (if QEMU is in such process)? Or just deplete host memory and causing OOM kill. I think we may need to find a way to throttle max memory usage of such buffering. So far this will be more of a problem indeed if this will be done during VFIO iteration phases, but I still hope a solution can work with both iteration phase and the switchover phase, even if you only do that in switchover phase (and I don't know why you don't care about VFIO iteration phase, if you cared enough on how VFIO works now with migration.. literally that should help VFIO migrates faster on 25G+ networks, with/without a shorter blackout window). 5. Worker thread model ====================== I'm so far not happy with what this proposal suggests on creating the threads, also the two new hooks mostly just to create these threads.. I know I suggested that.. but that's comparing to what I read in the even earlier version, and sorry I wasn't able to suggest something better at that time because I simply thought less. As I mentioned in the other reply elsewhere, I think we should firstly have these threads ready to take data at the start of migration, so that it'll work when someone wants to add vfio iteration support. Then the jobs (mostly what vfio_save_complete_precopy_async_thread() does now) can be enqueued into the thread pools. It's better to create the thread pool owned by migration, rather than threads owned by VFIO, because it also paves way for non-VFIO device state save()s, as I mentioned also above on the multifd packet header. Maybe we can have a flag in the packet header saying "this is device xxx's state, just load it". I'd start looking at util/thread-pool.c, removing all the AIO implications but simply provide a raw thread pool for what thread_pool_submit() is doing. I know this is a lot, but I really think this is the right thing.. but we can discuss, and you can correct me on my mistakes if there is. If you want I can have a look at this pool model and prepare a patch, so you can work on other vfio relevant stuff and pick that up, if that helps you trying to reach the goal of landing this whole stuff in 9.1. But I hope I explained more or less in this email on why I think this feature is more involved than it looks like, and not yet mature from design. And I hope I'm not purly asking for too much: merging this VFIO series first then refactor on top can mean dropping too much unneeded code after adding them, not to mention the protocol going to need another break. It just doesn't sound like the right thing to do. Thanks, -- Peter Xu