* Re: xen | Failed pipeline for staging | 6a47ba2f [not found] <644bfbc6939d8_2a49bbb403253f4@gitlab-sidekiq-catchall-v2-78885c497-qxnp2.mail> @ 2023-04-29 3:05 ` Stefano Stabellini 2023-04-29 11:41 ` andrew.cooper3 2023-05-03 14:38 ` dom0less vs xenstored setup race Was: " andrew.cooper3 0 siblings, 2 replies; 15+ messages in thread From: Stefano Stabellini @ 2023-04-29 3:05 UTC (permalink / raw) To: alejandro.vallejo Cc: committers, michal.orzel, xen-devel, andrew.cooper3, sstabellini [-- Attachment #1: Type: text/plain, Size: 1358 bytes --] On Fri, 28 Apr 2023, GitLab wrote: > Pipeline #852233694 triggered by > [568538936b4ac45a343cb3a4ab0c6cda?s=48&d=identicon] > Ganis > had 3 failed jobs > Failed jobs > ✖ > test > qemu-smoke-dom0less-arm64-gcc This is a real failure on staging. Unfortunately it is intermittent. It usually happens once every 3-8 tests for me. The test script is: automation/scripts/qemu-smoke-dom0less-arm64.sh and for this test it is invoked without arguments. It is starting 2 dom0less VMs in parallel, then dom0 does a xl network-attach and the domU is supposed to setup eth0 and ping. The failure is that nothing happens after "xl network-attach". The domU never hotplugs any interfaces. I have logs that show that eth0 never shows up and the only interface is lo no matter how long we wait. On a hunch, I removed Alejandro patches. Without them, I ran 20 tests without any failures. I have not investigated further but it looks like one of these 4 commits is the problem: 2023-04-28 11:41 Alejandro Vallejo tools: Make init-xenstore-domain use xc_domain_getinfolist() 2023-04-28 11:41 Alejandro Vallejo tools: Refactor console/io.c to avoid using xc_domain_getinfo() 2023-04-28 11:41 Alejandro Vallejo tools: Create xc_domain_getinfo_single() 2023-04-28 11:41 Alejandro Vallejo tools: Make some callers of xc_domain_getinfo() use xc_domain_getinfol ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: xen | Failed pipeline for staging | 6a47ba2f 2023-04-29 3:05 ` xen | Failed pipeline for staging | 6a47ba2f Stefano Stabellini @ 2023-04-29 11:41 ` andrew.cooper3 2023-04-29 13:34 ` Marek Marczykowski-Górecki 2023-04-30 1:21 ` Stefano Stabellini 2023-05-03 14:38 ` dom0less vs xenstored setup race Was: " andrew.cooper3 1 sibling, 2 replies; 15+ messages in thread From: andrew.cooper3 @ 2023-04-29 11:41 UTC (permalink / raw) To: Stefano Stabellini, alejandro.vallejo; +Cc: committers, michal.orzel, xen-devel On 29/04/2023 4:05 am, Stefano Stabellini wrote: > On Fri, 28 Apr 2023, GitLab wrote: >> Pipeline #852233694 triggered by >> [568538936b4ac45a343cb3a4ab0c6cda?s=48&d=identicon] >> Ganis >> had 3 failed jobs >> Failed jobs >> ✖ >> test >> qemu-smoke-dom0less-arm64-gcc > This is a real failure on staging. Unfortunately it is intermittent. It > usually happens once every 3-8 tests for me. > > The test script is: > automation/scripts/qemu-smoke-dom0less-arm64.sh > > and for this test it is invoked without arguments. It is starting 2 > dom0less VMs in parallel, then dom0 does a xl network-attach and the > domU is supposed to setup eth0 and ping. > > The failure is that nothing happens after "xl network-attach". The domU > never hotplugs any interfaces. I have logs that show that eth0 never > shows up and the only interface is lo no matter how long we wait. > > > On a hunch, I removed Alejandro patches. Without them, I ran 20 tests > without any failures. I have not investigated further but it looks like > one of these 4 commits is the problem: > > 2023-04-28 11:41 Alejandro Vallejo tools: Make init-xenstore-domain use xc_domain_getinfolist() > 2023-04-28 11:41 Alejandro Vallejo tools: Refactor console/io.c to avoid using xc_domain_getinfo() > 2023-04-28 11:41 Alejandro Vallejo tools: Create xc_domain_getinfo_single() > 2023-04-28 11:41 Alejandro Vallejo tools: Make some callers of xc_domain_getinfo() use xc_domain_getinfol In commit order (reverse of above), these patches are: 1) Modify the python bindings and xenbaked 2) Introduce a new library function with a better API/ABI 3) Modify xenconsoled 4) Modify init-xenstore-domain The test isn't using anything from 4 or 1, and 2 definitely isn't breaking anything on its own. That just leaves 3. This test does turn activate xenconsoled by virtue of invoking xencommons, but that doesn't help explain why a change in xenconsoled interferes (and only intermittently on this one single test) with `xl network-attach`. The xenconsoled change does have correctness fix in it, requiring xenconsoled to ask for all domains info in one go. This does mean it's hypercall-buffering (i.e. bouncing) a 4M array now where previously it was racy figuring out which VMs had come and gone. ~Andrew ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: xen | Failed pipeline for staging | 6a47ba2f 2023-04-29 11:41 ` andrew.cooper3 @ 2023-04-29 13:34 ` Marek Marczykowski-Górecki 2023-04-29 19:48 ` andrew.cooper3 2023-04-30 1:21 ` Stefano Stabellini 1 sibling, 1 reply; 15+ messages in thread From: Marek Marczykowski-Górecki @ 2023-04-29 13:34 UTC (permalink / raw) To: andrew.cooper3 Cc: Stefano Stabellini, alejandro.vallejo, committers, michal.orzel, xen-devel [-- Attachment #1: Type: text/plain, Size: 2671 bytes --] On Sat, Apr 29, 2023 at 12:41:26PM +0100, andrew.cooper3@citrix.com wrote: > On 29/04/2023 4:05 am, Stefano Stabellini wrote: > > On Fri, 28 Apr 2023, GitLab wrote: > >> Pipeline #852233694 triggered by > >> [568538936b4ac45a343cb3a4ab0c6cda?s=48&d=identicon] > >> Ganis > >> had 3 failed jobs > >> Failed jobs > >> ✖ > >> test > >> qemu-smoke-dom0less-arm64-gcc > > This is a real failure on staging. Unfortunately it is intermittent. It > > usually happens once every 3-8 tests for me. > > > > The test script is: > > automation/scripts/qemu-smoke-dom0less-arm64.sh > > > > and for this test it is invoked without arguments. It is starting 2 > > dom0less VMs in parallel, then dom0 does a xl network-attach and the > > domU is supposed to setup eth0 and ping. > > > > The failure is that nothing happens after "xl network-attach". The domU > > never hotplugs any interfaces. I have logs that show that eth0 never > > shows up and the only interface is lo no matter how long we wait. > > > > > > On a hunch, I removed Alejandro patches. Without them, I ran 20 tests > > without any failures. I have not investigated further but it looks like > > one of these 4 commits is the problem: > > > > 2023-04-28 11:41 Alejandro Vallejo tools: Make init-xenstore-domain use xc_domain_getinfolist() > > 2023-04-28 11:41 Alejandro Vallejo tools: Refactor console/io.c to avoid using xc_domain_getinfo() > > 2023-04-28 11:41 Alejandro Vallejo tools: Create xc_domain_getinfo_single() > > 2023-04-28 11:41 Alejandro Vallejo tools: Make some callers of xc_domain_getinfo() use xc_domain_getinfol > > In commit order (reverse of above), these patches are: > > 1) Modify the python bindings and xenbaked > 2) Introduce a new library function with a better API/ABI > 3) Modify xenconsoled > 4) Modify init-xenstore-domain > > The test isn't using anything from 4 or 1, and 2 definitely isn't > breaking anything on its own. > > That just leaves 3. This test does turn activate xenconsoled by virtue > of invoking xencommons, but that doesn't help explain why a change in > xenconsoled interferes (and only intermittently on this one single test) > with `xl network-attach`. > > The xenconsoled change does have correctness fix in it, requiring > xenconsoled to ask for all domains info in one go. This does mean it's > hypercall-buffering (i.e. bouncing) a 4M array now where previously it > was racy figuring out which VMs had come and gone. Can it be that xl network-attach fails and that failure is silently ignored by the test? -- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab [-- Attachment #2: signature.asc --] [-- Type: application/pgp-signature, Size: 488 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: xen | Failed pipeline for staging | 6a47ba2f 2023-04-29 13:34 ` Marek Marczykowski-Górecki @ 2023-04-29 19:48 ` andrew.cooper3 0 siblings, 0 replies; 15+ messages in thread From: andrew.cooper3 @ 2023-04-29 19:48 UTC (permalink / raw) To: Marek Marczykowski-Górecki Cc: Stefano Stabellini, alejandro.vallejo, committers, michal.orzel, xen-devel On 29/04/2023 2:34 pm, Marek Marczykowski-Górecki wrote: > On Sat, Apr 29, 2023 at 12:41:26PM +0100, andrew.cooper3@citrix.com wrote: >> On 29/04/2023 4:05 am, Stefano Stabellini wrote: >>> On Fri, 28 Apr 2023, GitLab wrote: >>>> Pipeline #852233694 triggered by >>>> [568538936b4ac45a343cb3a4ab0c6cda?s=48&d=identicon] >>>> Ganis >>>> had 3 failed jobs >>>> Failed jobs >>>> ✖ >>>> test >>>> qemu-smoke-dom0less-arm64-gcc >>> This is a real failure on staging. Unfortunately it is intermittent. It >>> usually happens once every 3-8 tests for me. >>> >>> The test script is: >>> automation/scripts/qemu-smoke-dom0less-arm64.sh >>> >>> and for this test it is invoked without arguments. It is starting 2 >>> dom0less VMs in parallel, then dom0 does a xl network-attach and the >>> domU is supposed to setup eth0 and ping. >>> >>> The failure is that nothing happens after "xl network-attach". The domU >>> never hotplugs any interfaces. I have logs that show that eth0 never >>> shows up and the only interface is lo no matter how long we wait. >>> >>> >>> On a hunch, I removed Alejandro patches. Without them, I ran 20 tests >>> without any failures. I have not investigated further but it looks like >>> one of these 4 commits is the problem: >>> >>> 2023-04-28 11:41 Alejandro Vallejo tools: Make init-xenstore-domain use xc_domain_getinfolist() >>> 2023-04-28 11:41 Alejandro Vallejo tools: Refactor console/io.c to avoid using xc_domain_getinfo() >>> 2023-04-28 11:41 Alejandro Vallejo tools: Create xc_domain_getinfo_single() >>> 2023-04-28 11:41 Alejandro Vallejo tools: Make some callers of xc_domain_getinfo() use xc_domain_getinfol >> In commit order (reverse of above), these patches are: >> >> 1) Modify the python bindings and xenbaked >> 2) Introduce a new library function with a better API/ABI >> 3) Modify xenconsoled >> 4) Modify init-xenstore-domain >> >> The test isn't using anything from 4 or 1, and 2 definitely isn't >> breaking anything on its own. >> >> That just leaves 3. This test does turn activate xenconsoled by virtue >> of invoking xencommons, but that doesn't help explain why a change in >> xenconsoled interferes (and only intermittently on this one single test) >> with `xl network-attach`. >> >> The xenconsoled change does have correctness fix in it, requiring >> xenconsoled to ask for all domains info in one go. This does mean it's >> hypercall-buffering (i.e. bouncing) a 4M array now where previously it >> was racy figuring out which VMs had come and gone. > Can it be that xl network-attach fails and that failure is silently > ignored by the test? Well, it's ultimately doing a ping test between the two VMs, so the network-attach is rather important. I don't see an obviously way for us to get false negatives like this. ~Andrew ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: xen | Failed pipeline for staging | 6a47ba2f 2023-04-29 11:41 ` andrew.cooper3 2023-04-29 13:34 ` Marek Marczykowski-Górecki @ 2023-04-30 1:21 ` Stefano Stabellini 2023-05-02 3:59 ` Stefano Stabellini 1 sibling, 1 reply; 15+ messages in thread From: Stefano Stabellini @ 2023-04-30 1:21 UTC (permalink / raw) To: andrew.cooper3 Cc: Stefano Stabellini, alejandro.vallejo, committers, michal.orzel, xen-devel [-- Attachment #1: Type: text/plain, Size: 3157 bytes --] On Sat, 29 Apr 2023, andrew.cooper3@citrix.com wrote: > On 29/04/2023 4:05 am, Stefano Stabellini wrote: > > On Fri, 28 Apr 2023, GitLab wrote: > >> Pipeline #852233694 triggered by > >> [568538936b4ac45a343cb3a4ab0c6cda?s=48&d=identicon] > >> Ganis > >> had 3 failed jobs > >> Failed jobs > >> ✖ > >> test > >> qemu-smoke-dom0less-arm64-gcc > > This is a real failure on staging. Unfortunately it is intermittent. It > > usually happens once every 3-8 tests for me. > > > > The test script is: > > automation/scripts/qemu-smoke-dom0less-arm64.sh > > > > and for this test it is invoked without arguments. It is starting 2 > > dom0less VMs in parallel, then dom0 does a xl network-attach and the > > domU is supposed to setup eth0 and ping. > > > > The failure is that nothing happens after "xl network-attach". The domU > > never hotplugs any interfaces. I have logs that show that eth0 never > > shows up and the only interface is lo no matter how long we wait. > > > > > > On a hunch, I removed Alejandro patches. Without them, I ran 20 tests > > without any failures. I have not investigated further but it looks like > > one of these 4 commits is the problem: > > > > 2023-04-28 11:41 Alejandro Vallejo tools: Make init-xenstore-domain use xc_domain_getinfolist() > > 2023-04-28 11:41 Alejandro Vallejo tools: Refactor console/io.c to avoid using xc_domain_getinfo() > > 2023-04-28 11:41 Alejandro Vallejo tools: Create xc_domain_getinfo_single() > > 2023-04-28 11:41 Alejandro Vallejo tools: Make some callers of xc_domain_getinfo() use xc_domain_getinfol > > In commit order (reverse of above), these patches are: > > 1) Modify the python bindings and xenbaked > 2) Introduce a new library function with a better API/ABI > 3) Modify xenconsoled > 4) Modify init-xenstore-domain > > The test isn't using anything from 4 or 1, and 2 definitely isn't > breaking anything on its own. > > That just leaves 3. This test does turn activate xenconsoled by virtue > of invoking xencommons, but that doesn't help explain why a change in > xenconsoled interferes (and only intermittently on this one single test) > with `xl network-attach`. > > The xenconsoled change does have correctness fix in it, requiring > xenconsoled to ask for all domains info in one go. This does mean it's > hypercall-buffering (i.e. bouncing) a 4M array now where previously it > was racy figuring out which VMs had come and gone. Your guess was correct. I have done more bisecting today. The culprit is the following commit (I reverted only this commit and ran 25 tests successfully, usually it fails in less than 5): e522c98c3 tools: Refactor console/io.c to avoid using xc_domain_getinfo() I don't know why. Traditionally if this was OSSTest we would revert the commit until we understand what is going on to unblock master/staging. I suggest to do the same here to be consistent. And also because otherwise future failures in this test due to other bugs might be masked by this unsolved issue. I have nothing against this commit and I'd be happy for it to go in again as soon as things are not necessarely resolved, but at least better understood. ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: xen | Failed pipeline for staging | 6a47ba2f 2023-04-30 1:21 ` Stefano Stabellini @ 2023-05-02 3:59 ` Stefano Stabellini 2023-05-02 6:53 ` Jan Beulich 0 siblings, 1 reply; 15+ messages in thread From: Stefano Stabellini @ 2023-05-02 3:59 UTC (permalink / raw) To: Stefano Stabellini Cc: andrew.cooper3, alejandro.vallejo, committers, michal.orzel, xen-devel On Sat, 29 Apr 2023, Stefano Stabellini wrote: > Your guess was correct. I have done more bisecting today. The culprit is > the following commit (I reverted only this commit and ran 25 tests > successfully, usually it fails in less than 5): > > e522c98c3 tools: Refactor console/io.c to avoid using xc_domain_getinfo() I did more debugging. One problem seems to be that XEN_SYSCTL_getdomaininfolist is buggy in the hypervisor. The field u.getdomaininfolist.num_domains is not copied back to the guest. It doesn't look like the hypercall would behave well for more than 1 guest. I am appending the fix. This is not sufficient to fix the failure. On a hunch, I made this change: /* Fetch info on every valid domain except for dom0 */ - ret = xc_domain_getinfolist(xc, 1, DOMID_FIRST_RESERVED - 1, domaininfo); + ret = xc_domain_getinfolist(xc, 1, 10, domaininfo); if (ret < 0) return; With it, everything works. I have run out of time today for my investigation. I would like to take the opportunity to highlight that gitlab-ci did a very good job spotting an issue. I am glad we are starting to reap the benefits of all the hard work we put into it. Cheers, Stefano --- xen: fix broken XEN_SYSCTL_getdomaininfolist hypercall XEN_SYSCTL_getdomaininfolist doesn't actually update the guest num_domains field, only its local copy. Fix that. Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c index 02505ab044..0e1097be96 100644 --- a/xen/common/sysctl.c +++ b/xen/common/sysctl.c @@ -107,10 +107,8 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl) rcu_read_unlock(&domlist_read_lock); - if ( ret != 0 ) - break; - op->u.getdomaininfolist.num_domains = num_domains; + __copy_field_to_guest(u_sysctl, op, u.getdomaininfolist.num_domains); } break; ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: xen | Failed pipeline for staging | 6a47ba2f 2023-05-02 3:59 ` Stefano Stabellini @ 2023-05-02 6:53 ` Jan Beulich 2023-05-02 11:10 ` Alejandro Vallejo 0 siblings, 1 reply; 15+ messages in thread From: Jan Beulich @ 2023-05-02 6:53 UTC (permalink / raw) To: Stefano Stabellini Cc: andrew.cooper3, alejandro.vallejo, committers, michal.orzel, xen-devel On 02.05.2023 05:59, Stefano Stabellini wrote: > xen: fix broken XEN_SYSCTL_getdomaininfolist hypercall > > XEN_SYSCTL_getdomaininfolist doesn't actually update the guest > num_domains field, only its local copy. Fix that. This isn't true, at least not always / unconditionally. "copyback" is what controls copying back of the entire struct, and in the success case this looks to be happening fine. Yet for the failure case it's unclear whether any copying back is actually intended. (If the op was to return merely the number of active domains, I think that ought to be restricted to max_domains == 0 and the handle also being a null one. I'm also having a hard time seeing what failure case the test ended up encountering: There are only two errors which can occur - one from the XSM hook (which is mishandled, and I'll make a separate patch for that) and the other from failing to copy back the info for the domain being looked at. I hope we can exclude the former, so are you suggesting the info struct copy-back is failing in your case? > Signed-off-by: Stefano Stabellini <stefano.stabellini@amd.com> In any event, if anything needs fixing here, a Fixes: tag would be nice. Jan > diff --git a/xen/common/sysctl.c b/xen/common/sysctl.c > index 02505ab044..0e1097be96 100644 > --- a/xen/common/sysctl.c > +++ b/xen/common/sysctl.c > @@ -107,10 +107,8 @@ long do_sysctl(XEN_GUEST_HANDLE_PARAM(xen_sysctl_t) u_sysctl) > > rcu_read_unlock(&domlist_read_lock); > > - if ( ret != 0 ) > - break; > - > op->u.getdomaininfolist.num_domains = num_domains; > + __copy_field_to_guest(u_sysctl, op, u.getdomaininfolist.num_domains); > } > break; > ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: xen | Failed pipeline for staging | 6a47ba2f 2023-05-02 6:53 ` Jan Beulich @ 2023-05-02 11:10 ` Alejandro Vallejo 0 siblings, 0 replies; 15+ messages in thread From: Alejandro Vallejo @ 2023-05-02 11:10 UTC (permalink / raw) To: Jan Beulich Cc: Stefano Stabellini, andrew.cooper3, committers, michal.orzel, xen-devel On Tue, May 02, 2023 at 08:53:31AM +0200, Jan Beulich wrote: > I'm also having a hard time seeing what failure case the test ended > up encountering: There are only two errors which can occur - one > from the XSM hook (which is mishandled, and I'll make a separate > patch for that) and the other from failing to copy back the info for > the domain being looked at. I hope we can exclude the former, so are > you suggesting the info struct copy-back is failing in your case? Something's off somewhere, that's for sure. I'll post the remainder of the series (not all of it made it to staging) so it's not left hanging and start investigating this failure. Cheers, Alejandro ^ permalink raw reply [flat|nested] 15+ messages in thread
* dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f 2023-04-29 3:05 ` xen | Failed pipeline for staging | 6a47ba2f Stefano Stabellini 2023-04-29 11:41 ` andrew.cooper3 @ 2023-05-03 14:38 ` andrew.cooper3 2023-05-03 15:15 ` Julien Grall 1 sibling, 1 reply; 15+ messages in thread From: andrew.cooper3 @ 2023-05-03 14:38 UTC (permalink / raw) To: Stefano Stabellini, alejandro.vallejo Cc: committers, michal.orzel, xen-devel, Julien Grall, Juergen Gross, Roger Pau Monné, Jan Beulich, Edwin Török Hello, After what seems like an unreasonable amount of debugging, we've tracked down exactly what is going wrong here. https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944 Of note is the smoke.serial log around: io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 ) obj: CREATE connection 0xffff90fff1f0 *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons 00000000, rsp_prod 00000000 io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT (@introduceDomain domlist ) XS_INTRODUCE (in C xenstored at least, not checked O yet) always clobbers the ring pointers. The added pressure on dom0 that the xensconsoled adds with it's 4M hypercall bounce buffer occasionally defers xenstored long enough that the XS_INTRODUCE clobbers the first message that dom1 wrote into the ring. The other behaviour seen was xenstored observing a header looking like this: *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len 0x6d726f66 } which was rejected as being too long. That's "control/platform" in ASCII, so the XS_INTRODUCE intersected dom1 between writing the header and writing the payload. Anyway, it is buggy for XS_INTRODUCE to be called on a live an unsuspecting connection. It is ultimately init-dom0less's fault for telling dom1 it's good to go before having waited for XS_INTRODUCE to complete. I am going to start by correcting the documentation to make these details clear, and then figure out what is the best set of steps to unbreak this. ~Andrew ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f 2023-05-03 14:38 ` dom0less vs xenstored setup race Was: " andrew.cooper3 @ 2023-05-03 15:15 ` Julien Grall 2023-05-03 15:22 ` Juergen Gross 2023-05-03 21:53 ` Stefano Stabellini 0 siblings, 2 replies; 15+ messages in thread From: Julien Grall @ 2023-05-03 15:15 UTC (permalink / raw) To: andrew.cooper3, Stefano Stabellini, alejandro.vallejo Cc: committers, michal.orzel, xen-devel, Julien Grall, Juergen Gross, Roger Pau Monné, Jan Beulich, Edwin Török Hi, On 03/05/2023 15:38, andrew.cooper3@citrix.com wrote: > Hello, > > After what seems like an unreasonable amount of debugging, we've tracked > down exactly what is going wrong here. > > https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944 > > Of note is the smoke.serial log around: > > io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 ) > obj: CREATE connection 0xffff90fff1f0 > *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons > 00000000, rsp_prod 00000000 > io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT > (@introduceDomain domlist ) > > XS_INTRODUCE (in C xenstored at least, not checked O yet) always > clobbers the ring pointers. The added pressure on dom0 that the > xensconsoled adds with it's 4M hypercall bounce buffer occasionally > defers xenstored long enough that the XS_INTRODUCE clobbers the first > message that dom1 wrote into the ring. > > The other behaviour seen was xenstored observing a header looking like this: > > *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len > 0x6d726f66 } > > which was rejected as being too long. That's "control/platform" in > ASCII, so the XS_INTRODUCE intersected dom1 between writing the header > and writing the payload. > > > Anyway, it is buggy for XS_INTRODUCE to be called on a live an > unsuspecting connection. It is ultimately init-dom0less's fault for > telling dom1 it's good to go before having waited for XS_INTRODUCE to > complete. So the problem is xenstored will set interface->connection to XENSTORE_CONNECTED before finalizing the connection. Caqn you try the following, for now, very hackish patch: diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c index f62be2245c42..bbf85bbbea3b 100644 --- a/tools/xenstore/xenstored_domain.c +++ b/tools/xenstore/xenstored_domain.c @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void *ctx, talloc_steal(domain->conn, domain); if (!restore) { + domain_conn_reset(domain); /* Notify the domain that xenstore is available */ interface->connection = XENSTORE_CONNECTED; xenevtchn_notify(xce_handle, domain->port); @@ -730,8 +731,6 @@ int do_introduce(const void *ctx, struct connection *conn, if (!domain) return errno; - domain_conn_reset(domain); - send_ack(conn, XS_INTRODUCE); return 0; Cheers, -- Julien Grall ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f 2023-05-03 15:15 ` Julien Grall @ 2023-05-03 15:22 ` Juergen Gross 2023-05-03 15:33 ` andrew.cooper3 2023-05-03 21:53 ` Stefano Stabellini 1 sibling, 1 reply; 15+ messages in thread From: Juergen Gross @ 2023-05-03 15:22 UTC (permalink / raw) To: Julien Grall, andrew.cooper3, Stefano Stabellini, alejandro.vallejo Cc: committers, michal.orzel, xen-devel, Julien Grall, Roger Pau Monné, Jan Beulich, Edwin Török [-- Attachment #1.1.1: Type: text/plain, Size: 2778 bytes --] On 03.05.23 17:15, Julien Grall wrote: > Hi, > > On 03/05/2023 15:38, andrew.cooper3@citrix.com wrote: >> Hello, >> >> After what seems like an unreasonable amount of debugging, we've tracked >> down exactly what is going wrong here. >> >> https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944 >> >> Of note is the smoke.serial log around: >> >> io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 ) >> obj: CREATE connection 0xffff90fff1f0 >> *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons >> 00000000, rsp_prod 00000000 >> io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT >> (@introduceDomain domlist ) >> >> XS_INTRODUCE (in C xenstored at least, not checked O yet) always >> clobbers the ring pointers. The added pressure on dom0 that the >> xensconsoled adds with it's 4M hypercall bounce buffer occasionally >> defers xenstored long enough that the XS_INTRODUCE clobbers the first >> message that dom1 wrote into the ring. >> >> The other behaviour seen was xenstored observing a header looking like this: >> >> *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len >> 0x6d726f66 } >> >> which was rejected as being too long. That's "control/platform" in >> ASCII, so the XS_INTRODUCE intersected dom1 between writing the header >> and writing the payload. >> >> >> Anyway, it is buggy for XS_INTRODUCE to be called on a live an >> unsuspecting connection. It is ultimately init-dom0less's fault for >> telling dom1 it's good to go before having waited for XS_INTRODUCE to >> complete. > > So the problem is xenstored will set interface->connection to XENSTORE_CONNECTED > before finalizing the connection. Caqn you try the following, for now, very > hackish patch: > > diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c > index f62be2245c42..bbf85bbbea3b 100644 > --- a/tools/xenstore/xenstored_domain.c > +++ b/tools/xenstore/xenstored_domain.c > @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void *ctx, > talloc_steal(domain->conn, domain); > > if (!restore) { > + domain_conn_reset(domain); > /* Notify the domain that xenstore is available */ > interface->connection = XENSTORE_CONNECTED; I think there are barriers missing (especially in order to work on Arm)? And I think you will break dom0 with calling domain_conn_reset(), as the kernel might already have written data into the xenbus page. So you might want to make the call depend on !is_master_domain. Juergen [-- Attachment #1.1.2: OpenPGP public key --] [-- Type: application/pgp-keys, Size: 3149 bytes --] [-- Attachment #2: OpenPGP digital signature --] [-- Type: application/pgp-signature, Size: 495 bytes --] ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f 2023-05-03 15:22 ` Juergen Gross @ 2023-05-03 15:33 ` andrew.cooper3 0 siblings, 0 replies; 15+ messages in thread From: andrew.cooper3 @ 2023-05-03 15:33 UTC (permalink / raw) To: Juergen Gross, Julien Grall, Stefano Stabellini, alejandro.vallejo Cc: committers, michal.orzel, xen-devel, Julien Grall, Roger Pau Monné, Jan Beulich, Edwin Török On 03/05/2023 4:22 pm, Juergen Gross wrote: > On 03.05.23 17:15, Julien Grall wrote: >> Hi, >> >> On 03/05/2023 15:38, andrew.cooper3@citrix.com wrote: >>> Hello, >>> >>> After what seems like an unreasonable amount of debugging, we've >>> tracked >>> down exactly what is going wrong here. >>> >>> https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944 >>> >>> Of note is the smoke.serial log around: >>> >>> io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 ) >>> obj: CREATE connection 0xffff90fff1f0 >>> *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons >>> 00000000, rsp_prod 00000000 >>> io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT >>> (@introduceDomain domlist ) >>> >>> XS_INTRODUCE (in C xenstored at least, not checked O yet) always >>> clobbers the ring pointers. The added pressure on dom0 that the >>> xensconsoled adds with it's 4M hypercall bounce buffer occasionally >>> defers xenstored long enough that the XS_INTRODUCE clobbers the first >>> message that dom1 wrote into the ring. >>> >>> The other behaviour seen was xenstored observing a header looking >>> like this: >>> >>> *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len >>> 0x6d726f66 } >>> >>> which was rejected as being too long. That's "control/platform" in >>> ASCII, so the XS_INTRODUCE intersected dom1 between writing the header >>> and writing the payload. >>> >>> >>> Anyway, it is buggy for XS_INTRODUCE to be called on a live an >>> unsuspecting connection. It is ultimately init-dom0less's fault for >>> telling dom1 it's good to go before having waited for XS_INTRODUCE to >>> complete. >> >> So the problem is xenstored will set interface->connection to >> XENSTORE_CONNECTED before finalizing the connection. Caqn you try the >> following, for now, very hackish patch: >> >> diff --git a/tools/xenstore/xenstored_domain.c >> b/tools/xenstore/xenstored_domain.c >> index f62be2245c42..bbf85bbbea3b 100644 >> --- a/tools/xenstore/xenstored_domain.c >> +++ b/tools/xenstore/xenstored_domain.c >> @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void >> *ctx, >> talloc_steal(domain->conn, domain); >> >> if (!restore) { >> + domain_conn_reset(domain); >> /* Notify the domain that xenstore is >> available */ >> interface->connection = XENSTORE_CONNECTED; > > I think there are barriers missing (especially in order to work on Arm)? Yes there are. I think x86 skates by on side effects of hypercalls. > > And I think you will break dom0 with calling domain_conn_reset(), as the > kernel might already have written data into the xenbus page. So you might > want to make the call depend on !is_master_domain. And this is why I am very deliberately not doing anything until the documentation is matches reality, and is safe to use. For starters, shuffling this doesn't make any difference for a domU which hasn't been taught about this optional extension. Ignoring such cases is not an acceptable fix. ~Andrew ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f 2023-05-03 15:15 ` Julien Grall 2023-05-03 15:22 ` Juergen Gross @ 2023-05-03 21:53 ` Stefano Stabellini 2023-05-03 22:20 ` andrew.cooper3 1 sibling, 1 reply; 15+ messages in thread From: Stefano Stabellini @ 2023-05-03 21:53 UTC (permalink / raw) To: Julien Grall Cc: andrew.cooper3, Stefano Stabellini, alejandro.vallejo, committers, michal.orzel, xen-devel, Julien Grall, Juergen Gross, Roger Pau Monné, Jan Beulich, Edwin Török [-- Attachment #1: Type: text/plain, Size: 3585 bytes --] On Wed, 3 May 2023, Julien Grall wrote: > On 03/05/2023 15:38, andrew.cooper3@citrix.com wrote: > > Hello, > > > > After what seems like an unreasonable amount of debugging, we've tracked > > down exactly what is going wrong here. > > > > https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944 > > > > Of note is the smoke.serial log around: > > > > io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 ) > > obj: CREATE connection 0xffff90fff1f0 > > *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons > > 00000000, rsp_prod 00000000 > > io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT > > (@introduceDomain domlist ) > > > > XS_INTRODUCE (in C xenstored at least, not checked O yet) always > > clobbers the ring pointers. The added pressure on dom0 that the > > xensconsoled adds with it's 4M hypercall bounce buffer occasionally > > defers xenstored long enough that the XS_INTRODUCE clobbers the first > > message that dom1 wrote into the ring. > > > > The other behaviour seen was xenstored observing a header looking like this: > > > > *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len > > 0x6d726f66 } > > > > which was rejected as being too long. That's "control/platform" in > > ASCII, so the XS_INTRODUCE intersected dom1 between writing the header > > and writing the payload. > > > > > > Anyway, it is buggy for XS_INTRODUCE to be called on a live an > > unsuspecting connection. It is ultimately init-dom0less's fault for > > telling dom1 it's good to go before having waited for XS_INTRODUCE to > > complete. > > So the problem is xenstored will set interface->connection to > XENSTORE_CONNECTED before finalizing the connection. Caqn you try the > following, for now, very hackish patch: > > diff --git a/tools/xenstore/xenstored_domain.c > b/tools/xenstore/xenstored_domain.c > index f62be2245c42..bbf85bbbea3b 100644 > --- a/tools/xenstore/xenstored_domain.c > +++ b/tools/xenstore/xenstored_domain.c > @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void *ctx, > talloc_steal(domain->conn, domain); > > if (!restore) { > + domain_conn_reset(domain); > /* Notify the domain that xenstore is available */ > interface->connection = XENSTORE_CONNECTED; > xenevtchn_notify(xce_handle, domain->port); > @@ -730,8 +731,6 @@ int do_introduce(const void *ctx, struct connection *conn, > if (!domain) > return errno; > > - domain_conn_reset(domain); > - > send_ack(conn, XS_INTRODUCE); Following Jurgen's suggestion, I made this slightly modified version of the patch. With it, the problem is solved: https://gitlab.com/xen-project/people/sstabellini/xen/-/pipelines/856450703 diff --git a/tools/xenstore/xenstored_domain.c b/tools/xenstore/xenstored_domain.c index f62be2245c..5ca160d9f2 100644 --- a/tools/xenstore/xenstored_domain.c +++ b/tools/xenstore/xenstored_domain.c @@ -688,6 +688,8 @@ static struct domain *introduce_domain(const void *ctx, talloc_steal(domain->conn, domain); if (!restore) { + if (!is_master_domain) + domain_conn_reset(domain); /* Notify the domain that xenstore is available */ interface->connection = XENSTORE_CONNECTED; xenevtchn_notify(xce_handle, domain->port); @@ -730,8 +732,6 @@ int do_introduce(const void *ctx, struct connection *conn, if (!domain) return errno; - domain_conn_reset(domain); - send_ack(conn, XS_INTRODUCE); return 0; ^ permalink raw reply related [flat|nested] 15+ messages in thread
* Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f 2023-05-03 21:53 ` Stefano Stabellini @ 2023-05-03 22:20 ` andrew.cooper3 2023-05-03 23:16 ` Stefano Stabellini 0 siblings, 1 reply; 15+ messages in thread From: andrew.cooper3 @ 2023-05-03 22:20 UTC (permalink / raw) To: Stefano Stabellini, Julien Grall Cc: alejandro.vallejo, committers, michal.orzel, xen-devel, Julien Grall, Juergen Gross, Roger Pau Monné, Jan Beulich, Edwin Török On 03/05/2023 10:53 pm, Stefano Stabellini wrote: > On Wed, 3 May 2023, Julien Grall wrote: >> On 03/05/2023 15:38, andrew.cooper3@citrix.com wrote: >>> Hello, >>> >>> After what seems like an unreasonable amount of debugging, we've tracked >>> down exactly what is going wrong here. >>> >>> https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944 >>> >>> Of note is the smoke.serial log around: >>> >>> io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 ) >>> obj: CREATE connection 0xffff90fff1f0 >>> *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons >>> 00000000, rsp_prod 00000000 >>> io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT >>> (@introduceDomain domlist ) >>> >>> XS_INTRODUCE (in C xenstored at least, not checked O yet) always >>> clobbers the ring pointers. The added pressure on dom0 that the >>> xensconsoled adds with it's 4M hypercall bounce buffer occasionally >>> defers xenstored long enough that the XS_INTRODUCE clobbers the first >>> message that dom1 wrote into the ring. >>> >>> The other behaviour seen was xenstored observing a header looking like this: >>> >>> *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len >>> 0x6d726f66 } >>> >>> which was rejected as being too long. That's "control/platform" in >>> ASCII, so the XS_INTRODUCE intersected dom1 between writing the header >>> and writing the payload. >>> >>> >>> Anyway, it is buggy for XS_INTRODUCE to be called on a live an >>> unsuspecting connection. It is ultimately init-dom0less's fault for >>> telling dom1 it's good to go before having waited for XS_INTRODUCE to >>> complete. >> So the problem is xenstored will set interface->connection to >> XENSTORE_CONNECTED before finalizing the connection. Caqn you try the >> following, for now, very hackish patch: >> >> diff --git a/tools/xenstore/xenstored_domain.c >> b/tools/xenstore/xenstored_domain.c >> index f62be2245c42..bbf85bbbea3b 100644 >> --- a/tools/xenstore/xenstored_domain.c >> +++ b/tools/xenstore/xenstored_domain.c >> @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void *ctx, >> talloc_steal(domain->conn, domain); >> >> if (!restore) { >> + domain_conn_reset(domain); >> /* Notify the domain that xenstore is available */ >> interface->connection = XENSTORE_CONNECTED; >> xenevtchn_notify(xce_handle, domain->port); >> @@ -730,8 +731,6 @@ int do_introduce(const void *ctx, struct connection *conn, >> if (!domain) >> return errno; >> >> - domain_conn_reset(domain); >> - >> send_ack(conn, XS_INTRODUCE); > Following Jurgen's suggestion, I made this slightly modified version of > the patch. With it, the problem is solved: > > https://gitlab.com/xen-project/people/sstabellini/xen/-/pipelines/856450703 This fails to solve 3(?) of the 4(?) bugs pointed out between this email thread and on IRC. Stop with the bull-in-a-china-shop approach. There is no acceptable fix to this mess which starts with anything other than corrections to the documentation, and a plan for how to make startup work robustly given all the bugs introduced previously by failing to do it properly the first time around. ~Andrew ^ permalink raw reply [flat|nested] 15+ messages in thread
* Re: dom0less vs xenstored setup race Was: xen | Failed pipeline for staging | 6a47ba2f 2023-05-03 22:20 ` andrew.cooper3 @ 2023-05-03 23:16 ` Stefano Stabellini 0 siblings, 0 replies; 15+ messages in thread From: Stefano Stabellini @ 2023-05-03 23:16 UTC (permalink / raw) To: andrew.cooper3 Cc: Stefano Stabellini, Julien Grall, alejandro.vallejo, committers, michal.orzel, xen-devel, Julien Grall, Juergen Gross, Roger Pau Monné, Jan Beulich, Edwin Török [-- Attachment #1: Type: text/plain, Size: 4056 bytes --] On Wed, 3 May 2023, andrew.cooper3@citrix.com wrote: > On 03/05/2023 10:53 pm, Stefano Stabellini wrote: > > On Wed, 3 May 2023, Julien Grall wrote: > >> On 03/05/2023 15:38, andrew.cooper3@citrix.com wrote: > >>> Hello, > >>> > >>> After what seems like an unreasonable amount of debugging, we've tracked > >>> down exactly what is going wrong here. > >>> > >>> https://gitlab.com/xen-project/people/andyhhp/xen/-/jobs/4219721944 > >>> > >>> Of note is the smoke.serial log around: > >>> > >>> io: IN 0xffff90fec250 d0 20230503 14:20:42 INTRODUCE (1 233473 1 ) > >>> obj: CREATE connection 0xffff90fff1f0 > >>> *** d1 CONN RESET req_cons 00000000, req_prod 0000003a rsp_cons > >>> 00000000, rsp_prod 00000000 > >>> io: OUT 0xffff9105cef0 d0 20230503 14:20:42 WATCH_EVENT > >>> (@introduceDomain domlist ) > >>> > >>> XS_INTRODUCE (in C xenstored at least, not checked O yet) always > >>> clobbers the ring pointers. The added pressure on dom0 that the > >>> xensconsoled adds with it's 4M hypercall bounce buffer occasionally > >>> defers xenstored long enough that the XS_INTRODUCE clobbers the first > >>> message that dom1 wrote into the ring. > >>> > >>> The other behaviour seen was xenstored observing a header looking like this: > >>> > >>> *** d1 HDR { ty 0x746e6f63, rqid 0x2f6c6f72, txid 0x74616c70, len > >>> 0x6d726f66 } > >>> > >>> which was rejected as being too long. That's "control/platform" in > >>> ASCII, so the XS_INTRODUCE intersected dom1 between writing the header > >>> and writing the payload. > >>> > >>> > >>> Anyway, it is buggy for XS_INTRODUCE to be called on a live an > >>> unsuspecting connection. It is ultimately init-dom0less's fault for > >>> telling dom1 it's good to go before having waited for XS_INTRODUCE to > >>> complete. > >> So the problem is xenstored will set interface->connection to > >> XENSTORE_CONNECTED before finalizing the connection. Caqn you try the > >> following, for now, very hackish patch: > >> > >> diff --git a/tools/xenstore/xenstored_domain.c > >> b/tools/xenstore/xenstored_domain.c > >> index f62be2245c42..bbf85bbbea3b 100644 > >> --- a/tools/xenstore/xenstored_domain.c > >> +++ b/tools/xenstore/xenstored_domain.c > >> @@ -688,6 +688,7 @@ static struct domain *introduce_domain(const void *ctx, > >> talloc_steal(domain->conn, domain); > >> > >> if (!restore) { > >> + domain_conn_reset(domain); > >> /* Notify the domain that xenstore is available */ > >> interface->connection = XENSTORE_CONNECTED; > >> xenevtchn_notify(xce_handle, domain->port); > >> @@ -730,8 +731,6 @@ int do_introduce(const void *ctx, struct connection *conn, > >> if (!domain) > >> return errno; > >> > >> - domain_conn_reset(domain); > >> - > >> send_ack(conn, XS_INTRODUCE); > > Following Jurgen's suggestion, I made this slightly modified version of > > the patch. With it, the problem is solved: > > > > https://gitlab.com/xen-project/people/sstabellini/xen/-/pipelines/856450703 > > This fails to solve 3(?) of the 4(?) bugs pointed out between this email > thread and on IRC. > > Stop with the bull-in-a-china-shop approach. There is no acceptable fix > to this mess which starts with anything other than corrections to the > documentation, and a plan for how to make startup work robustly given > all the bugs introduced previously by failing to do it properly the > first time around. I am not suggesting this is the fix (I didn't add any Signed-off-by or commit message on purpose). I think it is useful to know if a theoretical proposal would work in practice as it helps us understand the problem. In the little time I had in-between meetings I thought I could help a bit by providing this update. Like you, I would appreciate a comprehensive fix which includes a documentation update. Genuine question: how would you like to proceed? In the project management sense of who does what and what is the suggested timeline. ^ permalink raw reply [flat|nested] 15+ messages in thread
end of thread, other threads:[~2023-05-03 23:16 UTC | newest]
Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <644bfbc6939d8_2a49bbb403253f4@gitlab-sidekiq-catchall-v2-78885c497-qxnp2.mail>
2023-04-29 3:05 ` xen | Failed pipeline for staging | 6a47ba2f Stefano Stabellini
2023-04-29 11:41 ` andrew.cooper3
2023-04-29 13:34 ` Marek Marczykowski-Górecki
2023-04-29 19:48 ` andrew.cooper3
2023-04-30 1:21 ` Stefano Stabellini
2023-05-02 3:59 ` Stefano Stabellini
2023-05-02 6:53 ` Jan Beulich
2023-05-02 11:10 ` Alejandro Vallejo
2023-05-03 14:38 ` dom0less vs xenstored setup race Was: " andrew.cooper3
2023-05-03 15:15 ` Julien Grall
2023-05-03 15:22 ` Juergen Gross
2023-05-03 15:33 ` andrew.cooper3
2023-05-03 21:53 ` Stefano Stabellini
2023-05-03 22:20 ` andrew.cooper3
2023-05-03 23:16 ` Stefano Stabellini
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.