From: David Hildenbrand <david@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>,
Eduardo Habkost <ehabkost@redhat.com>,
"Michael S. Tsirkin" <mst@redhat.com>,
"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
qemu-devel@nongnu.org, Pankaj Gupta <pankaj.gupta@ionos.com>,
Igor Mammedov <imammedo@redhat.com>,
Paolo Bonzini <pbonzini@redhat.com>,
Marek Kedzierski <mkedzier@redhat.com>
Subject: Re: [PATCH v2 1/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
Date: Thu, 22 Jul 2021 16:13:26 +0200 [thread overview]
Message-ID: <3e10dfdb-0a6d-69fc-e3f6-c62345cb03f1@redhat.com> (raw)
In-Reply-To: <YPl25wrKvadQW7Ff@redhat.com>
On 22.07.21 15:47, Daniel P. Berrangé wrote:
> On Thu, Jul 22, 2021 at 03:39:50PM +0200, David Hildenbrand wrote:
>> On 22.07.21 15:31, Daniel P. Berrangé wrote:
>>> On Thu, Jul 22, 2021 at 02:36:30PM +0200, David Hildenbrand wrote:
>>>> Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
>>>> does not require a SIGBUS handler, doesn't actually touch page content,
>>>> and avoids context switches; it is, therefore, faster and easier to handle
>>>> than our current approach.
>>>>
>>>> While MADV_POPULATE_WRITE is, in general, faster than manual
>>>> prefaulting, and especially faster with 4k pages, there is still value in
>>>> prefaulting using multiple threads to speed up preallocation.
>>>>
>>>> More details on MADV_POPULATE_WRITE can be found in the Linux commit
>>>> 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
>>>> page tables") and in the man page proposal [1].
>>>>
>>>> [1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com
>>>>
>>>> This resolves the TODO in do_touch_pages().
>>>>
>>>> In the future, we might want to look into using fallocate(), eventually
>>>> combined with MADV_POPULATE_READ, when dealing with shared file
>>>> mappings.
>>>>
>>>> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>> include/qemu/osdep.h | 7 ++++
>>>> util/oslib-posix.c | 88 +++++++++++++++++++++++++++++++++-----------
>>>> 2 files changed, 74 insertions(+), 21 deletions(-)
>>>
>>>
>>>> @@ -497,6 +493,31 @@ static void *do_touch_pages(void *arg)
>>>> return NULL;
>>>> }
>>>> +static void *do_madv_populate_write_pages(void *arg)
>>>> +{
>>>> + MemsetThread *memset_args = (MemsetThread *)arg;
>>>> + const size_t size = memset_args->numpages * memset_args->hpagesize;
>>>> + char * const addr = memset_args->addr;
>>>> + int ret;
>>>> +
>>>> + if (!size) {
>>>> + return NULL;
>>>> + }
>>>> +
>>>> + /* See do_touch_pages(). */
>>>> + qemu_mutex_lock(&page_mutex);
>>>> + while (!threads_created_flag) {
>>>> + qemu_cond_wait(&page_cond, &page_mutex);
>>>> + }
>>>> + qemu_mutex_unlock(&page_mutex);
>>>> +
>>>> + ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
>>>> + if (ret) {
>>>> + memset_thread_failed = true;
>>>
>>> I'm wondering if this use of memset_thread_failed is sufficient.
>>>
>>> This is pre-existing from the current impl, and ends up being
>>> used to set the bool result of 'touch_all_pages'. The caller
>>> of that then does
>>>
>>> if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
>>> error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
>>> "pages available to allocate guest RAM");
>>> }
>>>
>>> this was reasonable with the old impl, because the only reason
>>> we ever see 'memset_thread_failed==true' is if we got SIGBUS
>>> due to ENOMEM.
>>>
>>> My concern is that madvise() has a bunch of possible errno
>>> codes returned on failure, and we're not distinguishing
>>> them. In the past this kind of thing has burnt us making
>>> failures hard to debug.
>>>
>>> Could we turn 'bool memset_thread_failed' into 'int memset_thread_errno'
>>>
>>> Then, we can make 'touch_all_pages' have an 'Error **errp'
>>> parameter, and it can directly call
>>>
>>> error_setg_errno(errp, memset_thead_errno, ....some message...)
>>>
>>> when memset_thread_errno is non-zero, and thus we can remove
>>> the generic message from the caller of touch_all_pages.
>>>
>>> If you agree, it'd be best to refactor the existing code to
>>> use this pattern in an initial patch.
>>
>> We could also simply trace the return value, which should be comparatively
>> easy to add. We should be getting either -ENOMEM or -EHWPOISON. And the
>> latter is highly unlikely to happen when actually preallocating.
>>
>> We made sure that we don't end up with -EINVAL as we're sensing of
>> MADV_POPULATE_WRITE works on the mapping.
>
> Those are in the "normal" usage scenarios. I'm wondering about the
> abnormal scenarios where QEMU code is mistakenly screwed up or
> libvirt / mgmt app makes some config mistake. eg we can get
> things like EPERM if selinux or seccomp block the madvise
> syscall by mistake (common if EQMU is inside docker for example),
> or can we get EINVAL if the 'addr' is not page aligned, and so on.
>
>> So when it comes to debugging, I'd actually prefer tracing -errno, as the
>> real error will be of little help to end users.
>
> I don't care about the end users interpreting it, rather us as maintainers
> who get a bug report containing insufficient info to diagnose the root
> cause.
Well, okay. I'll have a look how this turns out.
--
Thanks,
David / dhildenb
next prev parent reply other threads:[~2021-07-22 14:14 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-07-22 12:36 [PATCH v2 0/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() David Hildenbrand
2021-07-22 12:36 ` [PATCH v2 1/6] " David Hildenbrand
2021-07-22 13:31 ` Daniel P. Berrangé
2021-07-22 13:39 ` David Hildenbrand
2021-07-22 13:47 ` Daniel P. Berrangé
2021-07-22 14:13 ` David Hildenbrand [this message]
2021-07-22 12:36 ` [PATCH v2 2/6] util/oslib-posix: Introduce and use MemsetContext for touch_all_pages() David Hildenbrand
2021-07-22 12:36 ` [PATCH v2 3/6] util/oslib-posix: Don't create too many threads with small memory or little pages David Hildenbrand
2021-07-27 19:01 ` Dr. David Alan Gilbert
2021-07-28 11:23 ` Pankaj Gupta
2021-07-22 12:36 ` [PATCH v2 4/6] util/oslib-posix: Avoid creating a single thread with MADV_POPULATE_WRITE David Hildenbrand
2021-07-27 19:04 ` Dr. David Alan Gilbert
2021-07-30 15:13 ` David Hildenbrand
2021-07-28 11:32 ` Pankaj Gupta
2021-07-22 12:36 ` [PATCH v2 5/6] util/oslib-posix: Support concurrent os_mem_prealloc() invocation David Hildenbrand
2021-07-22 12:36 ` [PATCH v2 6/6] util/oslib-posix: Forward SIGBUS to MCE handler under Linux David Hildenbrand
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3e10dfdb-0a6d-69fc-e3f6-c62345cb03f1@redhat.com \
--to=david@redhat.com \
--cc=berrange@redhat.com \
--cc=dgilbert@redhat.com \
--cc=ehabkost@redhat.com \
--cc=imammedo@redhat.com \
--cc=mkedzier@redhat.com \
--cc=mst@redhat.com \
--cc=pankaj.gupta.linux@gmail.com \
--cc=pankaj.gupta@ionos.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).