qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: Steven Sistare <steven.sistare@oracle.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: JAEHOON KIM <jhkim@linux.ibm.com>,
	qemu-devel@nongnu.org, jjherne@linux.ibm.com, peterx@redhat.com,
	farosas@suse.de
Subject: Re: [PATCH v1] migration: Wait for cpr.sock file to appear before connecting
Date: Fri, 6 Jun 2025 13:04:36 -0400	[thread overview]
Message-ID: <b2d90921-0991-4a57-a141-ad0c830f8618@oracle.com> (raw)
In-Reply-To: <aEMR6Xjs8tRJ8_sp@redhat.com>

On 6/6/2025 12:06 PM, Daniel P. Berrangé wrote:
> On Fri, Jun 06, 2025 at 11:50:10AM -0400, Steven Sistare wrote:
>> On 6/6/2025 11:43 AM, Daniel P. Berrangé wrote:
>>> On Fri, Jun 06, 2025 at 10:37:28AM -0500, JAEHOON KIM wrote:
>>>> On 6/6/2025 10:12 AM, Steven Sistare wrote:
>>>>> On 6/6/2025 11:06 AM, JAEHOON KIM wrote:
>>>>>> On 6/6/2025 9:14 AM, Steven Sistare wrote:
>>>>>>> On 6/6/2025 9:53 AM, Daniel P. Berrangé wrote:
>>>>>>>> On Thu, Jun 05, 2025 at 06:08:08PM -0500, Jaehoon Kim wrote:
>>>>>>>>> When the source VM attempts to connect to the destination VM's Unix
>>>>>>>>> domain socket(cpr.sock) during CPR transfer, the socket
>>>>>>>>> file might not
>>>>>>>>> yet be exist if the destination side hasn't completed the bind
>>>>>>>>> operation. This can lead to connection failures when
>>>>>>>>> running tests with
>>>>>>>>> the qtest framework.
>>>>>>>>
>>>>>>>> This sounds like a flawed test impl to me - whatever is initiating
>>>>>>>> the cpr operation on the source has done so prematurely - it should
>>>>>>>> ensure the dest is ready before starting the operation.
>>>>>>>>
>>>>>>>>> To address this, add cpr_validate_socket_path(), which wait for the
>>>>>>>>> socket file to appear. This avoids intermittent qtest
>>>>>>>>> failures caused by
>>>>>>>>> early connection attempts.
>>>>>>>>
>>>>>>>> IMHO it is dubious to special case cpr in this way.
>>>>>>>
>>>>>>> I agree with Daniel, and unfortunately it is not just a test issue;
>>>>>>> every management framework that supports cpr-transfer must add logic to
>>>>>>> wait for the cpr socket to appear in the target before proceeding.
>>>>>>>
>>>>>>> This is analogous to waiting for the monitor socket to appear before
>>>>>>> connecting to it.
>>>>>>>
>>>>>>> - Steve
>>>>>>
>>>>>> Thank you very much for your valuable review and feedback.
>>>>>>
>>>>>> Just to clarify, the added cpr_validate_socket_path() function is
>>>>>> not limited to the test framework.
>>>>>> It is part of the actual CPR implementation and is intended to
>>>>>> ensure correct behavior in all cases, including outside of tests.
>>>>>>
>>>>>> I mentioned the qtest failure simply as an example where this issue
>>>>>> became apparent.
>>>>>
>>>>> Yes, I understand that you understand :)
>>>>> Are you willing to move your fix to the qtest?
>>>>>
>>>>> - Steve
>>>>
>>>> Thank you for your question and feedback.
>>>>
>>>> I agree that the issue could be addressed within the qtest framework to
>>>> improve stability.
>>>>
>>>> However, this socket readiness check is a fundamental part of CPR transfer
>>>> process,
>>>> and I believe that incorporating cpr_validate_socket_path() directly into
>>>> the CPR implementation helps ensure more reliable behavior
>>>> across all environments - not only during testing.
>>>>
>>>> Just from my perspective, adding this logic to the CPR code does not
>>>> introduce significant overhead or side effects.
>>>> I would appreciate if you could share more details about your concerns, so I
>>>> can better address them.
>>>
>>> Requiring a busy wait like this is a sign of a design problem.
>>>
>>> There needs to be a way to setup the incoming socket listener
>>> without resorting to busy waiting - that's showing a lack of
>>> synchronization.
>>
>> How is this a design problem?  If I start a program that creates a listening unix
>> domain socket, I cannot attempt to connect to it until the socket is created and
>> listening. Clients face the same issue when starting qemu and connecting to the
>> monitor socket.
> 
> Yes, the monitor has the same conceptual problem, and this caused problems
> for libvirt starting QEMU for many years.
> 
> With the busy wait you risk looping forever if the child (target) QEMU
> already exited for some reason without ever creating the socket. You
> can mitigate this by using 'kill($PID, 0)' in the loop and looking
> for -ERSCH, but this only works if you know the pid involved.
> 
> One option is to use 'daemonize' such that when the parent sees the initial
> QEMU process leader exit, the parent has a guarantee that the daemonized
> QEMU already has the UNIX listener open, and any failure indicates QEMU
> already quit.
> 
> The other option is to use FD passing such that QEMU is not responsible
> for opening the listener socket - it gets passed a pre-opened listener
> FD, so the parent has a guarantee it can successfull connect immediately
> and any failure indicates QEMU already quit.
> 
> For the tests, passing a pre-opened UNIX socket FD could work, but I'm
> still curious why this is only a problem for the CPR tests, and not
> the other migration tests which don't use 'defer'. What has made CPR
> special to expose a race ?

For normal migration, target qemu listens on the migration socket, then listens
on the monitor.  After the client connects to the monitor (waiting for it to appear
as needed), them the migration socket already exists.

For cpr, target qemu creates the cpr socket and listens before the monitor is
started, which is necessary because cpr state is needed before backends or
devices are created.

A few months back I sent a series to start the monitor first (I think I called
it the precreate phase), but it was derailed over discussions about allowing
qemu to start with no arguments and be configured exclusively via the monitor.

- Steve



  reply	other threads:[~2025-06-06 17:05 UTC|newest]

Thread overview: 25+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-06-05 23:08 [PATCH v1] migration: Wait for cpr.sock file to appear before connecting Jaehoon Kim
2025-06-06 13:40 ` Fabiano Rosas
2025-06-06 14:48   ` JAEHOON KIM
2025-06-06 15:47     ` Daniel P. Berrangé
2025-06-06 13:53 ` Daniel P. Berrangé
2025-06-06 14:14   ` Steven Sistare
2025-06-06 15:06     ` JAEHOON KIM
2025-06-06 15:12       ` Steven Sistare
2025-06-06 15:37         ` JAEHOON KIM
2025-06-06 15:43           ` Daniel P. Berrangé
2025-06-06 15:50             ` Steven Sistare
2025-06-06 16:06               ` Daniel P. Berrangé
2025-06-06 17:04                 ` Steven Sistare [this message]
2025-06-06 18:06                   ` JAEHOON KIM
2025-06-06 19:37                     ` Steven Sistare
2025-06-08 22:01                       ` JAEHOON KIM
2025-06-09  8:06                       ` Daniel P. Berrangé
2025-06-09 13:12                         ` Steven Sistare
2025-06-09 13:20                           ` Daniel P. Berrangé
2025-06-09 13:39                             ` Steven Sistare
2025-06-09 13:48                               ` JAEHOON KIM
2025-06-09 13:50                               ` Daniel P. Berrangé
2025-06-09 14:54                                 ` JAEHOON KIM
2025-06-09 14:57                                   ` Daniel P. Berrangé
2025-06-09 15:32                                     ` JAEHOON KIM

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=b2d90921-0991-4a57-a141-ad0c830f8618@oracle.com \
    --to=steven.sistare@oracle.com \
    --cc=berrange@redhat.com \
    --cc=farosas@suse.de \
    --cc=jhkim@linux.ibm.com \
    --cc=jjherne@linux.ibm.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).