From: Andrew Cooper <andrew.cooper3@citrix.com>
To: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Jackson <Ian.Jackson@eu.citrix.com>,
Ian Campbell <Ian.Campbell@citrix.com>,
Xen-devel <xen-devel@lists.xen.org>
Subject: Re: [PATCH] tools/libxl: Fixes to stream v2 task joining logic
Date: Thu, 23 Jul 2015 14:07:02 +0100 [thread overview]
Message-ID: <55B0E6F6.8030803@citrix.com> (raw)
In-Reply-To: <20150723120346.GA12377@zion.uk.xensource.com>
On 23/07/15 13:03, Wei Liu wrote:
> On Thu, Jul 23, 2015 at 12:42:25PM +0100, Andrew Cooper wrote:
>> On 23/07/15 12:38, Wei Liu wrote:
>>> On Thu, Jul 23, 2015 at 12:09:38PM +0100, Andrew Cooper wrote:
>>>> During review of the libxl migration v2 series, I changes the task
>>>> joining logic, but clearly didn't think the result through
>>>> properly. This patch fixes several errors.
>>>>
>>>> 1) Do not call check_all_finished() in the success cases of
>>>> libxl__xc_domain_{save,restore}_done(). It serves no specific purpose
>>>> as the save helper state will report itself as inactive by this point,
>>>> and avoids triggering a second stream->completion_callback() in the case
>>>> that write_toolstack_record()/stream_continue() record errors
>>>> synchronously themselves.
>>>>
>>>> 2) Only ever set stream->rc in stream_complete(). The first version of
>>>> the migration v2 series had separate rc and joined_rc parameters, where
>>>> this logic worked. However when combining the two, the teardown path
>>>> fails to trigger if stream_done() records stream->rc itself. A side
>>>> effect of this is that stream_done() needs to take an rc parameter.
>>>>
>>>> 3) Avoid stacking of check_all_finished() via synchronous teardown of
>>>> tasks. If the _abort() functions call back synchronously,
>>>> stream->completion_callback() ends up getting called twice, as first and
>>>> last check_all_finished() frames observe each task being finished.
>>>>
>>>> Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
>>>> CC: Ian Campbell <Ian.Campbell@citrix.com>
>>>> CC: Ian Jackson <Ian.Jackson@eu.citrix.com>
>>>> CC: Wei Liu <wei.liu2@citrix.com>
>>>>
>>>> ---
>>>> I found this while working to fix the toolstack record issue, but am
>>>> posting this bugfix ahead of the other work as OSSTest has tripped over
>>>> the issue.
>>> This change itself doesn't seem to have anything to do with libxc. In
>>> OSSTest the error that triggers this knock on effect is the failure of
>>> xc_map_foreign_bulk. Does that mean this patch only fix half of the
>>> problem seen in OSSTest?
>> Saving to image new xl format (info 0x3/0x0/1797)
>> xc: info: In experimental xc_domain_save2
>> xc: info: Saving domain 2, type x86 HVM
>> xc: progress: Memory: 67584/1344513 5%
>> xc: progress: Memory: 135168/1344513 10%
>> xc: progress: Memory: 201728/1344513 15%
>> xc: progress: Memory: 269312/1344513 20%
>> xc: progress: Memory: 336896/1344513 25%
>> xc: progress: Memory: 403456/1344513 30%
>> xc: progress: Memory: 471040/1344513 35%
>> xc: progress: Memory: 538624/1344513 40%
>> xc: progress: Memory: 605184/1344513 45%
>> xc: progress: Memory: 672768/1344513 50%
>> xc: progress: Memory: 740352/1344513 55%
>> xc: error: xc_map_foreign_bulk: mmap failed (12 = Cannot allocate
>> memory): Internal error
>> xc: error: Failed to map guest pages (12 = Cannot allocate memory):
>> Internal error
>> xc: error: Save failed (12 = Cannot allocate memory): Internal error
>> libxl: error: libxl_stream_write.c:267:libxl__xc_domain_save_done:
>> saving domain: domain responded to suspend request: Cannot allocate memory
>> xl: libxl_event.c:1866: libxl__ao_inprogress_gc: Assertion
>> `!ao->complete' failed.
>>
>>
>> I don't know why xc_map_foreign_bulk() failed, but at a guess I would
>> day dom0 is out of RAM.
>>
> I don't think so. There is no OOM error in serial log and dmesg.
>
> A similar failure is discovered in QEMU mainline test. That one is using
> OVMF + QEMU upstream. So this failure is not related to firmware
> or device model.
>
> http://logs.test-lab.xenproject.org/osstest/logs/59819/test-amd64-i386-xl-qemuu-debianhvm-amd64-xsm/info.html
>
> It should be a real bug manifests only on 32bit toolstack + guest with
> large amount of ram (5000 in this case).
>
> Looks like we have one more bug to hunt down.
>From linux_privcmd_map_foreign_bulk()
addr = mmap(NULL, (unsigned long)num << XC_PAGE_SHIFT, prot, MAP_SHARED,
fd, 0);
if ( addr == MAP_FAILED )
{
PERROR("xc_map_foreign_bulk: mmap failed");
return NULL;
}
So the mmap() call did genuinely fail with ENOMEM.
Migration v2 only ever has 1024 guest pages ever mapped at once (see
xc_sr_save.c:write_batch() ), and doesn't leak the mapping, which means
that the kernel decided that it was out of memory.
~Andrew
next prev parent reply other threads:[~2015-07-23 13:07 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2015-07-23 11:09 [PATCH] tools/libxl: Fixes to stream v2 task joining logic Andrew Cooper
2015-07-23 11:38 ` Wei Liu
2015-07-23 11:42 ` Andrew Cooper
2015-07-23 12:03 ` Wei Liu
2015-07-23 13:07 ` Andrew Cooper [this message]
2015-07-23 14:02 ` Ian Jackson
2015-07-24 11:41 ` Ian Jackson
2015-07-27 14:30 ` Andrew Cooper
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=55B0E6F6.8030803@citrix.com \
--to=andrew.cooper3@citrix.com \
--cc=Ian.Campbell@citrix.com \
--cc=Ian.Jackson@eu.citrix.com \
--cc=wei.liu2@citrix.com \
--cc=xen-devel@lists.xen.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.