All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: xen | Failed pipeline for staging | 6a47ba2f
       [not found] <644bfbc6939d8_2a49bbb403253f4@gitlab-sidekiq-catchall-v2-78885c497-qxnp2.mail>
@ 2023-04-29  3:05 ` Stefano Stabellini
  2023-04-29 11:41   ` andrew.cooper3
  2023-05-03 14:38   ` dom0less vs xenstored setup race Was: " andrew.cooper3
  0 siblings, 2 replies; 15+ messages in thread
From: Stefano Stabellini @ 2023-04-29  3:05 UTC (permalink / raw)
  To: alejandro.vallejo
  Cc: committers, michal.orzel, xen-devel, andrew.cooper3, sstabellini

[-- Attachment #1: Type: text/plain, Size: 1358 bytes --]

On Fri, 28 Apr 2023, GitLab wrote:
> Pipeline #852233694 triggered by
> [568538936b4ac45a343cb3a4ab0c6cda?s=48&d=identicon]
> Ganis
> had 3 failed jobs
> Failed jobs
> ✖
> test
> qemu-smoke-dom0less-arm64-gcc

This is a real failure on staging. Unfortunately it is intermittent. It
usually happens once every 3-8 tests for me.

The test script is:
automation/scripts/qemu-smoke-dom0less-arm64.sh

and for this test it is invoked without arguments. It is starting 2
dom0less VMs in parallel, then dom0 does a xl network-attach and the
domU is supposed to setup eth0 and ping.

The failure is that nothing happens after "xl network-attach". The domU
never hotplugs any interfaces. I have logs that show that eth0 never
shows up and the only interface is lo no matter how long we wait.


On a hunch, I removed Alejandro patches. Without them, I ran 20 tests
without any failures. I have not investigated further but it looks like
one of these 4 commits is the problem:

2023-04-28 11:41 Alejandro Vallejo    tools: Make init-xenstore-domain use xc_domain_getinfolist()
2023-04-28 11:41 Alejandro Vallejo    tools: Refactor console/io.c to avoid using xc_domain_getinfo()
2023-04-28 11:41 Alejandro Vallejo    tools: Create xc_domain_getinfo_single()
2023-04-28 11:41 Alejandro Vallejo    tools: Make some callers of xc_domain_getinfo() use xc_domain_getinfol 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xen | Failed pipeline for staging | 6a47ba2f
  2023-04-29  3:05 ` xen | Failed pipeline for staging | 6a47ba2f Stefano Stabellini
@ 2023-04-29 11:41   ` andrew.cooper3
  2023-04-29 13:34     ` Marek Marczykowski-Górecki
  2023-04-30  1:21     ` Stefano Stabellini
  2023-05-03 14:38   ` dom0less vs xenstored setup race Was: " andrew.cooper3
  1 sibling, 2 replies; 15+ messages in thread
From: andrew.cooper3 @ 2023-04-29 11:41 UTC (permalink / raw)
  To: Stefano Stabellini, alejandro.vallejo; +Cc: committers, michal.orzel, xen-devel

On 29/04/2023 4:05 am, Stefano Stabellini wrote:
> On Fri, 28 Apr 2023, GitLab wrote:
>> Pipeline #852233694 triggered by
>> [568538936b4ac45a343cb3a4ab0c6cda?s=48&d=identicon]
>> Ganis
>> had 3 failed jobs
>> Failed jobs
>> ✖
>> test
>> qemu-smoke-dom0less-arm64-gcc
> This is a real failure on staging. Unfortunately it is intermittent. It
> usually happens once every 3-8 tests for me.
>
> The test script is:
> automation/scripts/qemu-smoke-dom0less-arm64.sh
>
> and for this test it is invoked without arguments. It is starting 2
> dom0less VMs in parallel, then dom0 does a xl network-attach and the
> domU is supposed to setup eth0 and ping.
>
> The failure is that nothing happens after "xl network-attach". The domU
> never hotplugs any interfaces. I have logs that show that eth0 never
> shows up and the only interface is lo no matter how long we wait.
>
>
> On a hunch, I removed Alejandro patches. Without them, I ran 20 tests
> without any failures. I have not investigated further but it looks like
> one of these 4 commits is the problem:
>
> 2023-04-28 11:41 Alejandro Vallejo    tools: Make init-xenstore-domain use xc_domain_getinfolist()
> 2023-04-28 11:41 Alejandro Vallejo    tools: Refactor console/io.c to avoid using xc_domain_getinfo()
> 2023-04-28 11:41 Alejandro Vallejo    tools: Create xc_domain_getinfo_single()
> 2023-04-28 11:41 Alejandro Vallejo    tools: Make some callers of xc_domain_getinfo() use xc_domain_getinfol 

In commit order (reverse of above), these patches are:

1) Modify the python bindings and xenbaked
2) Introduce a new library function with a better API/ABI
3) Modify xenconsoled
4) Modify init-xenstore-domain

The test isn't using anything from 4 or 1, and 2 definitely isn't
breaking anything on its own.

That just leaves 3.  This test does turn activate xenconsoled by virtue
of invoking xencommons, but that doesn't help explain why a change in
xenconsoled interferes (and only intermittently on this one single test)
with `xl network-attach`.

The xenconsoled change does have correctness fix in it, requiring
xenconsoled to ask for all domains info in one go.  This does mean it's
hypercall-buffering (i.e. bouncing) a 4M array now where previously it
was racy figuring out which VMs had come and gone.

~Andrew


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xen | Failed pipeline for staging | 6a47ba2f
  2023-04-29 11:41   ` andrew.cooper3
@ 2023-04-29 13:34     ` Marek Marczykowski-Górecki
  2023-04-29 19:48       ` andrew.cooper3
  2023-04-30  1:21     ` Stefano Stabellini
  1 sibling, 1 reply; 15+ messages in thread
From: Marek Marczykowski-Górecki @ 2023-04-29 13:34 UTC (permalink / raw)
  To: andrew.cooper3
  Cc: Stefano Stabellini, alejandro.vallejo, committers, michal.orzel,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 2671 bytes --]

On Sat, Apr 29, 2023 at 12:41:26PM +0100, andrew.cooper3@citrix.com wrote:
> On 29/04/2023 4:05 am, Stefano Stabellini wrote:
> > On Fri, 28 Apr 2023, GitLab wrote:
> >> Pipeline #852233694 triggered by
> >> [568538936b4ac45a343cb3a4ab0c6cda?s=48&d=identicon]
> >> Ganis
> >> had 3 failed jobs
> >> Failed jobs
> >> ✖
> >> test
> >> qemu-smoke-dom0less-arm64-gcc
> > This is a real failure on staging. Unfortunately it is intermittent. It
> > usually happens once every 3-8 tests for me.
> >
> > The test script is:
> > automation/scripts/qemu-smoke-dom0less-arm64.sh
> >
> > and for this test it is invoked without arguments. It is starting 2
> > dom0less VMs in parallel, then dom0 does a xl network-attach and the
> > domU is supposed to setup eth0 and ping.
> >
> > The failure is that nothing happens after "xl network-attach". The domU
> > never hotplugs any interfaces. I have logs that show that eth0 never
> > shows up and the only interface is lo no matter how long we wait.
> >
> >
> > On a hunch, I removed Alejandro patches. Without them, I ran 20 tests
> > without any failures. I have not investigated further but it looks like
> > one of these 4 commits is the problem:
> >
> > 2023-04-28 11:41 Alejandro Vallejo    tools: Make init-xenstore-domain use xc_domain_getinfolist()
> > 2023-04-28 11:41 Alejandro Vallejo    tools: Refactor console/io.c to avoid using xc_domain_getinfo()
> > 2023-04-28 11:41 Alejandro Vallejo    tools: Create xc_domain_getinfo_single()
> > 2023-04-28 11:41 Alejandro Vallejo    tools: Make some callers of xc_domain_getinfo() use xc_domain_getinfol 
> 
> In commit order (reverse of above), these patches are:
> 
> 1) Modify the python bindings and xenbaked
> 2) Introduce a new library function with a better API/ABI
> 3) Modify xenconsoled
> 4) Modify init-xenstore-domain
> 
> The test isn't using anything from 4 or 1, and 2 definitely isn't
> breaking anything on its own.
> 
> That just leaves 3.  This test does turn activate xenconsoled by virtue
> of invoking xencommons, but that doesn't help explain why a change in
> xenconsoled interferes (and only intermittently on this one single test)
> with `xl network-attach`.
> 
> The xenconsoled change does have correctness fix in it, requiring
> xenconsoled to ask for all domains info in one go.  This does mean it's
> hypercall-buffering (i.e. bouncing) a 4M array now where previously it
> was racy figuring out which VMs had come and gone.

Can it be that xl network-attach fails and that failure is silently
ignored by the test?

-- 
Best Regards,
Marek Marczykowski-Górecki
Invisible Things Lab

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xen | Failed pipeline for staging | 6a47ba2f
  2023-04-29 13:34     ` Marek Marczykowski-Górecki
@ 2023-04-29 19:48       ` andrew.cooper3
  0 siblings, 0 replies; 15+ messages in thread
From: andrew.cooper3 @ 2023-04-29 19:48 UTC (permalink / raw)
  To: Marek Marczykowski-Górecki
  Cc: Stefano Stabellini, alejandro.vallejo, committers, michal.orzel,
	xen-devel

On 29/04/2023 2:34 pm, Marek Marczykowski-Górecki wrote:
> On Sat, Apr 29, 2023 at 12:41:26PM +0100, andrew.cooper3@citrix.com wrote:
>> On 29/04/2023 4:05 am, Stefano Stabellini wrote:
>>> On Fri, 28 Apr 2023, GitLab wrote:
>>>> Pipeline #852233694 triggered by
>>>> [568538936b4ac45a343cb3a4ab0c6cda?s=48&d=identicon]
>>>> Ganis
>>>> had 3 failed jobs
>>>> Failed jobs
>>>> ✖
>>>> test
>>>> qemu-smoke-dom0less-arm64-gcc
>>> This is a real failure on staging. Unfortunately it is intermittent. It
>>> usually happens once every 3-8 tests for me.
>>>
>>> The test script is:
>>> automation/scripts/qemu-smoke-dom0less-arm64.sh
>>>
>>> and for this test it is invoked without arguments. It is starting 2
>>> dom0less VMs in parallel, then dom0 does a xl network-attach and the
>>> domU is supposed to setup eth0 and ping.
>>>
>>> The failure is that nothing happens after "xl network-attach". The domU
>>> never hotplugs any interfaces. I have logs that show that eth0 never
>>> shows up and the only interface is lo no matter how long we wait.
>>>
>>>
>>> On a hunch, I removed Alejandro patches. Without them, I ran 20 tests
>>> without any failures. I have not investigated further but it looks like
>>> one of these 4 commits is the problem:
>>>
>>> 2023-04-28 11:41 Alejandro Vallejo    tools: Make init-xenstore-domain use xc_domain_getinfolist()
>>> 2023-04-28 11:41 Alejandro Vallejo    tools: Refactor console/io.c to avoid using xc_domain_getinfo()
>>> 2023-04-28 11:41 Alejandro Vallejo    tools: Create xc_domain_getinfo_single()
>>> 2023-04-28 11:41 Alejandro Vallejo    tools: Make some callers of xc_domain_getinfo() use xc_domain_getinfol 
>> In commit order (reverse of above), these patches are:
>>
>> 1) Modify the python bindings and xenbaked
>> 2) Introduce a new library function with a better API/ABI
>> 3) Modify xenconsoled
>> 4) Modify init-xenstore-domain
>>
>> The test isn't using anything from 4 or 1, and 2 definitely isn't
>> breaking anything on its own.
>>
>> That just leaves 3.  This test does turn activate xenconsoled by virtue
>> of invoking xencommons, but that doesn't help explain why a change in
>> xenconsoled interferes (and only intermittently on this one single test)
>> with `xl network-attach`.
>>
>> The xenconsoled change does have correctness fix in it, requiring
>> xenconsoled to ask for all domains info in one go.  This does mean it's
>> hypercall-buffering (i.e. bouncing) a 4M array now where previously it
>> was racy figuring out which VMs had come and gone.
> Can it be that xl network-attach fails and that failure is silently
> ignored by the test?

Well, it's ultimately doing a ping test between the two VMs, so the
network-attach is rather important.  I don't see an obviously way for us
to get false negatives like this.

~Andrew


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xen | Failed pipeline for staging | 6a47ba2f
  2023-04-29 11:41   ` andrew.cooper3
  2023-04-29 13:34     ` Marek Marczykowski-Górecki
@ 2023-04-30  1:21     ` Stefano Stabellini
  2023-05-02  3:59       ` Stefano Stabellini
  1 sibling, 1 reply; 15+ messages in thread
From: Stefano Stabellini @ 2023-04-30  1:21 UTC (permalink / raw)
  To: andrew.cooper3
  Cc: Stefano Stabellini, alejandro.vallejo, committers, michal.orzel,
	xen-devel

[-- Attachment #1: Type: text/plain, Size: 3157 bytes --]

On Sat, 29 Apr 2023, andrew.cooper3@citrix.com wrote:
> On 29/04/2023 4:05 am, Stefano Stabellini wrote:
> > On Fri, 28 Apr 2023, GitLab wrote:
> >> Pipeline #852233694 triggered by
> >> [568538936b4ac45a343cb3a4ab0c6cda?s=48&d=identicon]
> >> Ganis
> >> had 3 failed jobs
> >> Failed jobs
> >> ✖
> >> test
> >> qemu-smoke-dom0less-arm64-gcc
> > This is a real failure on staging. Unfortunately it is intermittent. It
> > usually happens once every 3-8 tests for me.
> >
> > The test script is:
> > automation/scripts/qemu-smoke-dom0less-arm64.sh
> >
> > and for this test it is invoked without arguments. It is starting 2
> > dom0less VMs in parallel, then dom0 does a xl network-attach and the
> > domU is supposed to setup eth0 and ping.
> >
> > The failure is that nothing happens after "xl network-attach". The domU
> > never hotplugs any interfaces. I have logs that show that eth0 never
> > shows up and the only interface is lo no matter how long we wait.
> >
> >
> > On a hunch, I removed Alejandro patches. Without them, I ran 20 tests
> > without any failures. I have not investigated further but it looks like
> > one of these 4 commits is the problem:
> >
> > 2023-04-28 11:41 Alejandro Vallejo    tools: Make init-xenstore-domain use xc_domain_getinfolist()
> > 2023-04-28 11:41 Alejandro Vallejo    tools: Refactor console/io.c to avoid using xc_domain_getinfo()
> > 2023-04-28 11:41 Alejandro Vallejo    tools: Create xc_domain_getinfo_single()
> > 2023-04-28 11:41 Alejandro Vallejo    tools: Make some callers of xc_domain_getinfo() use xc_domain_getinfol 
> 
> In commit order (reverse of above), these patches are:
> 
> 1) Modify the python bindings and xenbaked
> 2) Introduce a new library function with a better API/ABI
> 3) Modify xenconsoled
> 4) Modify init-xenstore-domain
> 
> The test isn't using anything from 4 or 1, and 2 definitely isn't
> breaking anything on its own.
> 
> That just leaves 3.  This test does turn activate xenconsoled by virtue
> of invoking xencommons, but that doesn't help explain why a change in
> xenconsoled interferes (and only intermittently on this one single test)
> with `xl network-attach`.
> 
> The xenconsoled change does have correctness fix in it, requiring
> xenconsoled to ask for all domains info in one go.  This does mean it's
> hypercall-buffering (i.e. bouncing) a 4M array now where previously it
> was racy figuring out which VMs had come and gone.

Your guess was correct. I have done more bisecting today. The culprit is
the following commit (I reverted only this commit and ran 25 tests
successfully, usually it fails in less than 5):

e522c98c3    tools: Refactor console/io.c to avoid using xc_domain_getinfo()

I don't know why. Traditionally if this was OSSTest we would revert the
commit until we understand what is going on to unblock master/staging. I
suggest to do the same here to be consistent. And also because otherwise
future failures in this test due to other bugs might be masked by this
unsolved issue.

I have nothing against this commit and I'd be happy for it to go in
again as soon as things are not necessarely resolved, but at least
better understood.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xen | Failed pipeline for staging | 6a47ba2f
  2023-04-30  1:21     ` Stefano Stabellini
@ 2023-05-02  3:59       ` Stefano Stabellini
  2023-05-02  6:53         ` Jan Beulich
  0 siblings, 1 reply; 15+ messages in thread
From: Stefano Stabellini @ 2023-05-02  3:59 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: andrew.cooper3, alejandro.vallejo, committers, michal.orzel,
	xen-devel

On Sat, 29 Apr 2023, Stefano Stabellini wrote:
> Your guess was correct. I have done more bisecting today. The culprit is
> the following commit (I reverted only this commit and ran 25 tests
> successfully, usually it fails in less than 5):
> 
> e522c98c3    tools: Refactor console/io.c to avoid using xc_domain_getinfo()

I did more debugging. One problem seems to be that
XEN_SYSCTL_getdomaininfolist is buggy in the hypervisor. The field
u.getdomaininfolist.num_domains is not copied back to the guest. It
doesn't look like the hypercall would behave well for more than 1 guest.
I am appending the fix.

This is not sufficient to fix the failure. On a hunch, I made this
change:


 	/* Fetch info on every valid domain except for dom0 */
-	ret = xc_domain_getinfolist(xc, 1, DOMID_FIRST_RESERVED - 1, domaininfo);
+	ret = xc_domain_getinfolist(xc, 1, 10, domaininfo);
 	if (ret < 0)
 		return;
 
With it, everything works. I have run out of time today for my
investigation.


I would like to take the opportunity to highlight that gitlab-ci did a
very good job spotting an issue. I am glad we are starting to reap the
benefits of all the hard work we put into it.

Cheers,

Stefano

---
xen: fix broken XEN_SYSCTL_getdomaininfolist hypercall

XEN_SYSCTL_getdomaininfolist doesn't actually update the guest
num_domains field, only its local copy. Fix that.

Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>

diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
index 02505ab044..0e1097be96 100644
--- a/xen/common/sysctl.c
+++ b/xen/common/sysctl.c
@@ -107,10 +107,8 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
         
         rcu_read_unlock(&domlist_read_lock);
         
-        if ( ret != 0 )
-            break;
-        
         op->u.getdomaininfolist.num_domains = num_domains;
+        __copy_field_to_guest(u_sysctl, op, u.getdomaininfolist.num_domains);
     }
     break;
 


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: xen | Failed pipeline for staging | 6a47ba2f
  2023-05-02  3:59       ` Stefano Stabellini
@ 2023-05-02  6:53         ` Jan Beulich
  2023-05-02 11:10           ` Alejandro Vallejo
  0 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2023-05-02  6:53 UTC (permalink / raw)
  To: Stefano Stabellini
  Cc: andrew.cooper3, alejandro.vallejo, committers, michal.orzel,
	xen-devel

On 02.05.2023 05:59, Stefano Stabellini wrote:
> xen: fix broken XEN_SYSCTL_getdomaininfolist hypercall
> 
> XEN_SYSCTL_getdomaininfolist doesn't actually update the guest
> num_domains field, only its local copy. Fix that.

This isn't true, at least not always / unconditionally. "copyback" is
what controls copying back of the entire struct, and in the success
case this looks to be happening fine. Yet for the failure case it's
unclear whether any copying back is actually intended. (If the op was
to return merely the number of active domains, I think that ought to
be restricted to max_domains == 0 and the handle also being a null
one.

I'm also having a hard time seeing what failure case the test ended
up encountering: There are only two errors which can occur - one
from the XSM hook (which is mishandled, and I'll make a separate
patch for that) and the other from failing to copy back the info for
the domain being looked at. I hope we can exclude the former, so are
you suggesting the info struct copy-back is failing in your case?

> Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com>

In any event, if anything needs fixing here, a Fixes: tag would be
nice.

Jan

> diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c
> index 02505ab044..0e1097be96 100644
> --- a/xen/common/sysctl.c
> +++ b/xen/common/sysctl.c
> @@ -107,10 +107,8 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl)
>          
>          rcu_read_unlock(&domlist_read_lock);
>          
> -        if ( ret != 0 )
> -            break;
> -        
>          op->u.getdomaininfolist.num_domains = num_domains;
> +        __copy_field_to_guest(u_sysctl, op, u.getdomaininfolist.num_domains);
>      }
>      break;
>  



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: xen | Failed pipeline for staging | 6a47ba2f
  2023-05-02  6:53         ` Jan Beulich
@ 2023-05-02 11:10           ` Alejandro Vallejo
  0 siblings, 0 replies; 15+ messages in thread
From: Alejandro Vallejo @ 2023-05-02 11:10 UTC (permalink / raw)
  To: Jan Beulich
  Cc: Stefano Stabellini, andrew.cooper3, committers, michal.orzel,
	xen-devel

On Tue, May 02, 2023 at 08:53:31AM +0200, Jan Beulich wrote:
> I'm also having a hard time seeing what failure case the test ended
> up encountering: There are only two errors which can occur - one
> from the XSM hook (which is mishandled, and I'll make a separate
> patch for that) and the other from failing to copy back the info for
> the domain being looked at. I hope we can exclude the former, so are
> you suggesting the info struct copy-back is failing in your case?

Something's off somewhere, that's for sure. I'll post the remainder of
the series (not all of it made it to staging) so it's not left hanging
and start investigating this failure.

Cheers,
Alejandro


^ permalink raw reply	[flat|nested] 15+ messages in thread

* dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f
  2023-04-29  3:05 ` xen | Failed pipeline for staging | 6a47ba2f Stefano Stabellini
  2023-04-29 11:41   ` andrew.cooper3
@ 2023-05-03 14:38   ` andrew.cooper3
  2023-05-03 15:15     ` Julien Grall
  1 sibling, 1 reply; 15+ messages in thread
From: andrew.cooper3 @ 2023-05-03 14:38 UTC (permalink / raw)
  To: Stefano Stabellini, alejandro.vallejo
  Cc: committers, michal.orzel, xen-devel, Julien Grall, Juergen Gross,
	Roger Pau Monné, Jan Beulich, Edwin Török

Hello,

After what seems like an unreasonable amount of debugging, we've tracked
down exactly what is going wrong here.

https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944

Of note is the smoke.serial log around:

io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 )
obj: CREATE connection 0xffff90fff1f0
*** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons
00000000, rsp_prod 00000000
io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT
(@introduceDomain domlist )

XS_INTRODUCE (in C xenstored at least, not checked O yet) always
clobbers the ring pointers.  The added pressure on dom0 that the
xensconsoled adds with it's 4M hypercall bounce buffer occasionally
defers xenstored long enough that the XS_INTRODUCE clobbers the first
message that dom1 wrote into the ring.

The other behaviour seen was xenstored observing a header looking like this:

*** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len
0x6d726f66 }

which was rejected as being too long.  That's "control/platform" in
ASCII, so the XS_INTRODUCE intersected dom1 between writing the header
and writing the payload.


Anyway, it is buggy for XS_INTRODUCE to be called on a live an
unsuspecting connection.  It is ultimately init-dom0less's fault for
telling dom1 it's good to go before having waited for XS_INTRODUCE to
complete.

I am going to start by correcting the documentation to make these
details clear, and then figure out what is the best set of steps to
unbreak this.

~Andrew


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f
  2023-05-03 14:38   ` dom0less vs xenstored setup race Was: " andrew.cooper3
@ 2023-05-03 15:15     ` Julien Grall
  2023-05-03 15:22       ` Juergen Gross
  2023-05-03 21:53       ` Stefano Stabellini
  0 siblings, 2 replies; 15+ messages in thread
From: Julien Grall @ 2023-05-03 15:15 UTC (permalink / raw)
  To: andrew.cooper3, Stefano Stabellini, alejandro.vallejo
  Cc: committers, michal.orzel, xen-devel, Julien Grall, Juergen Gross,
	Roger Pau Monné, Jan Beulich, Edwin Török

Hi,

On 03/05/2023 15:38, andrew.cooper3@citrix.com wrote:
> Hello,
> 
> After what seems like an unreasonable amount of debugging, we've tracked
> down exactly what is going wrong here.
> 
> https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944
> 
> Of note is the smoke.serial log around:
> 
> io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 )
> obj: CREATE connection 0xffff90fff1f0
> *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons
> 00000000, rsp_prod 00000000
> io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT
> (@introduceDomain domlist )
> 
> XS_INTRODUCE (in C xenstored at least, not checked O yet) always
> clobbers the ring pointers.  The added pressure on dom0 that the
> xensconsoled adds with it's 4M hypercall bounce buffer occasionally
> defers xenstored long enough that the XS_INTRODUCE clobbers the first
> message that dom1 wrote into the ring.
> 
> The other behaviour seen was xenstored observing a header looking like this:
> 
> *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len
> 0x6d726f66 }
> 
> which was rejected as being too long.  That's "control/platform" in
> ASCII, so the XS_INTRODUCE intersected dom1 between writing the header
> and writing the payload.
> 
> 
> Anyway, it is buggy for XS_INTRODUCE to be called on a live an
> unsuspecting connection.  It is ultimately init-dom0less's fault for
> telling dom1 it's good to go before having waited for XS_INTRODUCE to
> complete.

So the problem is xenstored will set interface->connection to 
XENSTORE_CONNECTED before finalizing the connection. Caqn you try the 
following, for now, very hackish patch:

diff --git a/tools/xenstore/xenstored_domain.c 
b/tools/xenstore/xenstored_domain.c
index f62be2245c42..bbf85bbbea3b 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void *ctx,
                 talloc_steal(domain->conn, domain);

                 if (!restore) {
+                       domain_conn_reset(domain);
                         /* Notify the domain that xenstore is available */
                         interface->connection = XENSTORE_CONNECTED;
                         xenevtchn_notify(xce_handle, domain->port);
@@ -730,8 +731,6 @@ int do_introduce(const void *ctx, struct connection 
*conn,
         if (!domain)
                 return errno;

-       domain_conn_reset(domain);
-
         send_ack(conn, XS_INTRODUCE);

         return 0;

Cheers,

-- 
Julien Grall


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f
  2023-05-03 15:15     ` Julien Grall
@ 2023-05-03 15:22       ` Juergen Gross
  2023-05-03 15:33         ` andrew.cooper3
  2023-05-03 21:53       ` Stefano Stabellini
  1 sibling, 1 reply; 15+ messages in thread
From: Juergen Gross @ 2023-05-03 15:22 UTC (permalink / raw)
  To: Julien Grall, andrew.cooper3, Stefano Stabellini,
	alejandro.vallejo
  Cc: committers, michal.orzel, xen-devel, Julien Grall,
	Roger Pau Monné, Jan Beulich, Edwin Török


[-- Attachment #1.1.1: Type: text/plain, Size: 2778 bytes --]

On 03.05.23 17:15, Julien Grall wrote:
> Hi,
> 
> On 03/05/2023 15:38, andrew.cooper3@citrix.com wrote:
>> Hello,
>>
>> After what seems like an unreasonable amount of debugging, we've tracked
>> down exactly what is going wrong here.
>>
>> https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944
>>
>> Of note is the smoke.serial log around:
>>
>> io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 )
>> obj: CREATE connection 0xffff90fff1f0
>> *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons
>> 00000000, rsp_prod 00000000
>> io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT
>> (@introduceDomain domlist )
>>
>> XS_INTRODUCE (in C xenstored at least, not checked O yet) always
>> clobbers the ring pointers.  The added pressure on dom0 that the
>> xensconsoled adds with it's 4M hypercall bounce buffer occasionally
>> defers xenstored long enough that the XS_INTRODUCE clobbers the first
>> message that dom1 wrote into the ring.
>>
>> The other behaviour seen was xenstored observing a header looking like this:
>>
>> *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len
>> 0x6d726f66 }
>>
>> which was rejected as being too long.  That's "control/platform" in
>> ASCII, so the XS_INTRODUCE intersected dom1 between writing the header
>> and writing the payload.
>>
>>
>> Anyway, it is buggy for XS_INTRODUCE to be called on a live an
>> unsuspecting connection.  It is ultimately init-dom0less's fault for
>> telling dom1 it's good to go before having waited for XS_INTRODUCE to
>> complete.
> 
> So the problem is xenstored will set interface->connection to XENSTORE_CONNECTED 
> before finalizing the connection. Caqn you try the following, for now, very 
> hackish patch:
> 
> diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
> index f62be2245c42..bbf85bbbea3b 100644
> --- a/tools/xenstore/xenstored_domain.c
> +++ b/tools/xenstore/xenstored_domain.c
> @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void *ctx,
>                  talloc_steal(domain->conn, domain);
> 
>                  if (!restore) {
> +                       domain_conn_reset(domain);
>                          /* Notify the domain that xenstore is available */
>                          interface->connection = XENSTORE_CONNECTED;

I think there are barriers missing (especially in order to work on Arm)?

And I think you will break dom0 with calling domain_conn_reset(), as the
kernel might already have written data into the xenbus page. So you might
want to make the call depend on !is_master_domain.


Juergen

[-- Attachment #1.1.2: OpenPGP public key --]
[-- Type: application/pgp-keys, Size: 3149 bytes --]

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 495 bytes --]

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f
  2023-05-03 15:22       ` Juergen Gross
@ 2023-05-03 15:33         ` andrew.cooper3
  0 siblings, 0 replies; 15+ messages in thread
From: andrew.cooper3 @ 2023-05-03 15:33 UTC (permalink / raw)
  To: Juergen Gross, Julien Grall, Stefano Stabellini,
	alejandro.vallejo
  Cc: committers, michal.orzel, xen-devel, Julien Grall,
	Roger Pau Monné, Jan Beulich, Edwin Török

On 03/05/2023 4:22 pm, Juergen Gross wrote:
> On 03.05.23 17:15, Julien Grall wrote:
>> Hi,
>>
>> On 03/05/2023 15:38, andrew.cooper3@citrix.com wrote:
>>> Hello,
>>>
>>> After what seems like an unreasonable amount of debugging, we've
>>> tracked
>>> down exactly what is going wrong here.
>>>
>>> https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944
>>>
>>> Of note is the smoke.serial log around:
>>>
>>> io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 )
>>> obj: CREATE connection 0xffff90fff1f0
>>> *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons
>>> 00000000, rsp_prod 00000000
>>> io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT
>>> (@introduceDomain domlist )
>>>
>>> XS_INTRODUCE (in C xenstored at least, not checked O yet) always
>>> clobbers the ring pointers.  The added pressure on dom0 that the
>>> xensconsoled adds with it's 4M hypercall bounce buffer occasionally
>>> defers xenstored long enough that the XS_INTRODUCE clobbers the first
>>> message that dom1 wrote into the ring.
>>>
>>> The other behaviour seen was xenstored observing a header looking
>>> like this:
>>>
>>> *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len
>>> 0x6d726f66 }
>>>
>>> which was rejected as being too long.  That's "control/platform" in
>>> ASCII, so the XS_INTRODUCE intersected dom1 between writing the header
>>> and writing the payload.
>>>
>>>
>>> Anyway, it is buggy for XS_INTRODUCE to be called on a live an
>>> unsuspecting connection.  It is ultimately init-dom0less's fault for
>>> telling dom1 it's good to go before having waited for XS_INTRODUCE to
>>> complete.
>>
>> So the problem is xenstored will set interface->connection to
>> XENSTORE_CONNECTED before finalizing the connection. Caqn you try the
>> following, for now, very hackish patch:
>>
>> diff --git a/tools/xenstore/xenstored_domain.c
>> b/tools/xenstore/xenstored_domain.c
>> index f62be2245c42..bbf85bbbea3b 100644
>> --- a/tools/xenstore/xenstored_domain.c
>> +++ b/tools/xenstore/xenstored_domain.c
>> @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void
>> *ctx,
>>                  talloc_steal(domain->conn, domain);
>>
>>                  if (!restore) {
>> +                       domain_conn_reset(domain);
>>                          /* Notify the domain that xenstore is
>> available */
>>                          interface->connection = XENSTORE_CONNECTED;
>
> I think there are barriers missing (especially in order to work on Arm)?

Yes there are.  I think x86 skates by on side effects of hypercalls.

>
> And I think you will break dom0 with calling domain_conn_reset(), as the
> kernel might already have written data into the xenbus page. So you might
> want to make the call depend on !is_master_domain.

And this is why I am very deliberately not doing anything until the
documentation is matches reality, and is safe to use.

For starters, shuffling this doesn't make any difference for a domU
which hasn't been taught about this optional extension.  Ignoring such
cases is not an acceptable fix.

~Andrew


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f
  2023-05-03 15:15     ` Julien Grall
  2023-05-03 15:22       ` Juergen Gross
@ 2023-05-03 21:53       ` Stefano Stabellini
  2023-05-03 22:20         ` andrew.cooper3
  1 sibling, 1 reply; 15+ messages in thread
From: Stefano Stabellini @ 2023-05-03 21:53 UTC (permalink / raw)
  To: Julien Grall
  Cc: andrew.cooper3, Stefano Stabellini, alejandro.vallejo, committers,
	michal.orzel, xen-devel, Julien Grall, Juergen Gross,
	Roger Pau Monné, Jan Beulich, Edwin Török

[-- Attachment #1: Type: text/plain, Size: 3585 bytes --]

On Wed, 3 May 2023, Julien Grall wrote:
> On 03/05/2023 15:38, andrew.cooper3@citrix.com wrote:
> > Hello,
> > 
> > After what seems like an unreasonable amount of debugging, we've tracked
> > down exactly what is going wrong here.
> > 
> > https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944
> > 
> > Of note is the smoke.serial log around:
> > 
> > io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 )
> > obj: CREATE connection 0xffff90fff1f0
> > *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons
> > 00000000, rsp_prod 00000000
> > io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT
> > (@introduceDomain domlist )
> > 
> > XS_INTRODUCE (in C xenstored at least, not checked O yet) always
> > clobbers the ring pointers.  The added pressure on dom0 that the
> > xensconsoled adds with it's 4M hypercall bounce buffer occasionally
> > defers xenstored long enough that the XS_INTRODUCE clobbers the first
> > message that dom1 wrote into the ring.
> > 
> > The other behaviour seen was xenstored observing a header looking like this:
> > 
> > *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len
> > 0x6d726f66 }
> > 
> > which was rejected as being too long.  That's "control/platform" in
> > ASCII, so the XS_INTRODUCE intersected dom1 between writing the header
> > and writing the payload.
> > 
> > 
> > Anyway, it is buggy for XS_INTRODUCE to be called on a live an
> > unsuspecting connection.  It is ultimately init-dom0less's fault for
> > telling dom1 it's good to go before having waited for XS_INTRODUCE to
> > complete.
> 
> So the problem is xenstored will set interface->connection to
> XENSTORE_CONNECTED before finalizing the connection. Caqn you try the
> following, for now, very hackish patch:
> 
> diff --git a/tools/xenstore/xenstored_domain.c
> b/tools/xenstore/xenstored_domain.c
> index f62be2245c42..bbf85bbbea3b 100644
> --- a/tools/xenstore/xenstored_domain.c
> +++ b/tools/xenstore/xenstored_domain.c
> @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void *ctx,
>                 talloc_steal(domain->conn, domain);
> 
>                 if (!restore) {
> +                       domain_conn_reset(domain);
>                         /* Notify the domain that xenstore is available */
>                         interface->connection = XENSTORE_CONNECTED;
>                         xenevtchn_notify(xce_handle, domain->port);
> @@ -730,8 +731,6 @@ int do_introduce(const void *ctx, struct connection *conn,
>         if (!domain)
>                 return errno;
> 
> -       domain_conn_reset(domain);
> -
>         send_ack(conn, XS_INTRODUCE);

Following Jurgen's suggestion, I made this slightly modified version of
the patch. With it, the problem is solved:

https://gitlab.com/xen-project/people/sstabellini/xen/-/pipelines/856450703


diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c
index f62be2245c..5ca160d9f2 100644
--- a/tools/xenstore/xenstored_domain.c
+++ b/tools/xenstore/xenstored_domain.c
@@ -688,6 +688,8 @@ static struct domain *introduce_domain(const void *ctx,
 		talloc_steal(domain->conn, domain);
 
 		if (!restore) {
+			if (!is_master_domain)
+				domain_conn_reset(domain);
 			/* Notify the domain that xenstore is available */
 			interface->connection = XENSTORE_CONNECTED;
 			xenevtchn_notify(xce_handle, domain->port);
@@ -730,8 +732,6 @@ int do_introduce(const void *ctx, struct connection *conn,
 	if (!domain)
 		return errno;
 
-	domain_conn_reset(domain);
-
 	send_ack(conn, XS_INTRODUCE);
 
 	return 0;

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f
  2023-05-03 21:53       ` Stefano Stabellini
@ 2023-05-03 22:20         ` andrew.cooper3
  2023-05-03 23:16           ` Stefano Stabellini
  0 siblings, 1 reply; 15+ messages in thread
From: andrew.cooper3 @ 2023-05-03 22:20 UTC (permalink / raw)
  To: Stefano Stabellini, Julien Grall
  Cc: alejandro.vallejo, committers, michal.orzel, xen-devel,
	Julien Grall, Juergen Gross, Roger Pau Monné, Jan Beulich,
	Edwin Török

On 03/05/2023 10:53 pm, Stefano Stabellini wrote:
> On Wed, 3 May 2023, Julien Grall wrote:
>> On 03/05/2023 15:38, andrew.cooper3@citrix.com wrote:
>>> Hello,
>>>
>>> After what seems like an unreasonable amount of debugging, we've tracked
>>> down exactly what is going wrong here.
>>>
>>> https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944
>>>
>>> Of note is the smoke.serial log around:
>>>
>>> io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 )
>>> obj: CREATE connection 0xffff90fff1f0
>>> *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons
>>> 00000000, rsp_prod 00000000
>>> io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT
>>> (@introduceDomain domlist )
>>>
>>> XS_INTRODUCE (in C xenstored at least, not checked O yet) always
>>> clobbers the ring pointers.  The added pressure on dom0 that the
>>> xensconsoled adds with it's 4M hypercall bounce buffer occasionally
>>> defers xenstored long enough that the XS_INTRODUCE clobbers the first
>>> message that dom1 wrote into the ring.
>>>
>>> The other behaviour seen was xenstored observing a header looking like this:
>>>
>>> *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len
>>> 0x6d726f66 }
>>>
>>> which was rejected as being too long.  That's "control/platform" in
>>> ASCII, so the XS_INTRODUCE intersected dom1 between writing the header
>>> and writing the payload.
>>>
>>>
>>> Anyway, it is buggy for XS_INTRODUCE to be called on a live an
>>> unsuspecting connection.  It is ultimately init-dom0less's fault for
>>> telling dom1 it's good to go before having waited for XS_INTRODUCE to
>>> complete.
>> So the problem is xenstored will set interface->connection to
>> XENSTORE_CONNECTED before finalizing the connection. Caqn you try the
>> following, for now, very hackish patch:
>>
>> diff --git a/tools/xenstore/xenstored_domain.c
>> b/tools/xenstore/xenstored_domain.c
>> index f62be2245c42..bbf85bbbea3b 100644
>> --- a/tools/xenstore/xenstored_domain.c
>> +++ b/tools/xenstore/xenstored_domain.c
>> @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void *ctx,
>>                 talloc_steal(domain->conn, domain);
>>
>>                 if (!restore) {
>> +                       domain_conn_reset(domain);
>>                         /* Notify the domain that xenstore is available */
>>                         interface->connection = XENSTORE_CONNECTED;
>>                         xenevtchn_notify(xce_handle, domain->port);
>> @@ -730,8 +731,6 @@ int do_introduce(const void *ctx, struct connection *conn,
>>         if (!domain)
>>                 return errno;
>>
>> -       domain_conn_reset(domain);
>> -
>>         send_ack(conn, XS_INTRODUCE);
> Following Jurgen's suggestion, I made this slightly modified version of
> the patch. With it, the problem is solved:
>
> https://gitlab.com/xen-project/people/sstabellini/xen/-/pipelines/856450703

This fails to solve 3(?) of the 4(?) bugs pointed out between this email
thread and on IRC.

Stop with the bull-in-a-china-shop approach.  There is no acceptable fix
to this mess which starts with anything other than corrections to the
documentation, and a plan for how to make startup work robustly given
all the bugs introduced previously by failing to do it properly the
first time around.

~Andrew


^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f
  2023-05-03 22:20         ` andrew.cooper3
@ 2023-05-03 23:16           ` Stefano Stabellini
  0 siblings, 0 replies; 15+ messages in thread
From: Stefano Stabellini @ 2023-05-03 23:16 UTC (permalink / raw)
  To: andrew.cooper3
  Cc: Stefano Stabellini, Julien Grall, alejandro.vallejo, committers,
	michal.orzel, xen-devel, Julien Grall, Juergen Gross,
	Roger Pau Monné, Jan Beulich, Edwin Török

[-- Attachment #1: Type: text/plain, Size: 4056 bytes --]

On Wed, 3 May 2023, andrew.cooper3@citrix.com wrote:
> On 03/05/2023 10:53 pm, Stefano Stabellini wrote:
> > On Wed, 3 May 2023, Julien Grall wrote:
> >> On 03/05/2023 15:38, andrew.cooper3@citrix.com wrote:
> >>> Hello,
> >>>
> >>> After what seems like an unreasonable amount of debugging, we've tracked
> >>> down exactly what is going wrong here.
> >>>
> >>> https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944
> >>>
> >>> Of note is the smoke.serial log around:
> >>>
> >>> io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 )
> >>> obj: CREATE connection 0xffff90fff1f0
> >>> *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons
> >>> 00000000, rsp_prod 00000000
> >>> io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT
> >>> (@introduceDomain domlist )
> >>>
> >>> XS_INTRODUCE (in C xenstored at least, not checked O yet) always
> >>> clobbers the ring pointers.  The added pressure on dom0 that the
> >>> xensconsoled adds with it's 4M hypercall bounce buffer occasionally
> >>> defers xenstored long enough that the XS_INTRODUCE clobbers the first
> >>> message that dom1 wrote into the ring.
> >>>
> >>> The other behaviour seen was xenstored observing a header looking like this:
> >>>
> >>> *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len
> >>> 0x6d726f66 }
> >>>
> >>> which was rejected as being too long.  That's "control/platform" in
> >>> ASCII, so the XS_INTRODUCE intersected dom1 between writing the header
> >>> and writing the payload.
> >>>
> >>>
> >>> Anyway, it is buggy for XS_INTRODUCE to be called on a live an
> >>> unsuspecting connection.  It is ultimately init-dom0less's fault for
> >>> telling dom1 it's good to go before having waited for XS_INTRODUCE to
> >>> complete.
> >> So the problem is xenstored will set interface->connection to
> >> XENSTORE_CONNECTED before finalizing the connection. Caqn you try the
> >> following, for now, very hackish patch:
> >>
> >> diff --git a/tools/xenstore/xenstored_domain.c
> >> b/tools/xenstore/xenstored_domain.c
> >> index f62be2245c42..bbf85bbbea3b 100644
> >> --- a/tools/xenstore/xenstored_domain.c
> >> +++ b/tools/xenstore/xenstored_domain.c
> >> @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void *ctx,
> >>                 talloc_steal(domain->conn, domain);
> >>
> >>                 if (!restore) {
> >> +                       domain_conn_reset(domain);
> >>                         /* Notify the domain that xenstore is available */
> >>                         interface->connection = XENSTORE_CONNECTED;
> >>                         xenevtchn_notify(xce_handle, domain->port);
> >> @@ -730,8 +731,6 @@ int do_introduce(const void *ctx, struct connection *conn,
> >>         if (!domain)
> >>                 return errno;
> >>
> >> -       domain_conn_reset(domain);
> >> -
> >>         send_ack(conn, XS_INTRODUCE);
> > Following Jurgen's suggestion, I made this slightly modified version of
> > the patch. With it, the problem is solved:
> >
> > https://gitlab.com/xen-project/people/sstabellini/xen/-/pipelines/856450703
> 
> This fails to solve 3(?) of the 4(?) bugs pointed out between this email
> thread and on IRC.
> 
> Stop with the bull-in-a-china-shop approach.  There is no acceptable fix
> to this mess which starts with anything other than corrections to the
> documentation, and a plan for how to make startup work robustly given
> all the bugs introduced previously by failing to do it properly the
> first time around.

I am not suggesting this is the fix (I didn't add any Signed-off-by or
commit message on purpose). I think it is useful to know if a
theoretical proposal would work in practice as it helps us understand
the problem. In the little time I had in-between meetings I thought I
could help a bit by providing this update.

Like you, I would appreciate a comprehensive fix which includes a
documentation update.

Genuine question: how would you like to proceed? In the project
management sense of who does what and what is the suggested timeline.

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-05-03 23:16 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <644bfbc6939d8_2a49bbb403253f4@gitlab-sidekiq-catchall-v2-78885c497-qxnp2.mail>
2023-04-29  3:05 ` xen | Failed pipeline for staging | 6a47ba2f Stefano Stabellini
2023-04-29 11:41   ` andrew.cooper3
2023-04-29 13:34     ` Marek Marczykowski-Górecki
2023-04-29 19:48       ` andrew.cooper3
2023-04-30  1:21     ` Stefano Stabellini
2023-05-02  3:59       ` Stefano Stabellini
2023-05-02  6:53         ` Jan Beulich
2023-05-02 11:10           ` Alejandro Vallejo
2023-05-03 14:38   ` dom0less vs xenstored setup race Was: " andrew.cooper3
2023-05-03 15:15     ` Julien Grall
2023-05-03 15:22       ` Juergen Gross
2023-05-03 15:33         ` andrew.cooper3
2023-05-03 21:53       ` Stefano Stabellini
2023-05-03 22:20         ` andrew.cooper3
2023-05-03 23:16           ` Stefano Stabellini

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.