qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: "Daniel P. Berrangé" <berrange@redhat.com>
Cc: Pankaj Gupta <pankaj.gupta.linux@gmail.com>,
	Eduardo Habkost <ehabkost@redhat.com>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
	qemu-devel@nongnu.org, Pankaj Gupta <pankaj.gupta@ionos.com>,
	Igor Mammedov <imammedo@redhat.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Marek Kedzierski <mkedzier@redhat.com>
Subject: Re: [PATCH v2 1/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc()
Date: Thu, 22 Jul 2021 16:13:26 +0200	[thread overview]
Message-ID: <3e10dfdb-0a6d-69fc-e3f6-c62345cb03f1@redhat.com> (raw)
In-Reply-To: <YPl25wrKvadQW7Ff@redhat.com>

On 22.07.21 15:47, Daniel P. Berrangé wrote:
> On Thu, Jul 22, 2021 at 03:39:50PM +0200, David Hildenbrand wrote:
>> On 22.07.21 15:31, Daniel P. Berrangé wrote:
>>> On Thu, Jul 22, 2021 at 02:36:30PM +0200, David Hildenbrand wrote:
>>>> Let's sense support and use it for preallocation. MADV_POPULATE_WRITE
>>>> does not require a SIGBUS handler, doesn't actually touch page content,
>>>> and avoids context switches; it is, therefore, faster and easier to handle
>>>> than our current approach.
>>>>
>>>> While MADV_POPULATE_WRITE is, in general, faster than manual
>>>> prefaulting, and especially faster with 4k pages, there is still value in
>>>> prefaulting using multiple threads to speed up preallocation.
>>>>
>>>> More details on MADV_POPULATE_WRITE can be found in the Linux commit
>>>> 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault
>>>> page tables") and in the man page proposal [1].
>>>>
>>>> [1] https://lkml.kernel.org/r/20210712083917.16361-1-david@redhat.com
>>>>
>>>> This resolves the TODO in do_touch_pages().
>>>>
>>>> In the future, we might want to look into using fallocate(), eventually
>>>> combined with MADV_POPULATE_READ, when dealing with shared file
>>>> mappings.
>>>>
>>>> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    include/qemu/osdep.h |  7 ++++
>>>>    util/oslib-posix.c   | 88 +++++++++++++++++++++++++++++++++-----------
>>>>    2 files changed, 74 insertions(+), 21 deletions(-)
>>>
>>>
>>>> @@ -497,6 +493,31 @@ static void *do_touch_pages(void *arg)
>>>>        return NULL;
>>>>    }
>>>> +static void *do_madv_populate_write_pages(void *arg)
>>>> +{
>>>> +    MemsetThread *memset_args = (MemsetThread *)arg;
>>>> +    const size_t size = memset_args->numpages * memset_args->hpagesize;
>>>> +    char * const addr = memset_args->addr;
>>>> +    int ret;
>>>> +
>>>> +    if (!size) {
>>>> +        return NULL;
>>>> +    }
>>>> +
>>>> +    /* See do_touch_pages(). */
>>>> +    qemu_mutex_lock(&page_mutex);
>>>> +    while (!threads_created_flag) {
>>>> +        qemu_cond_wait(&page_cond, &page_mutex);
>>>> +    }
>>>> +    qemu_mutex_unlock(&page_mutex);
>>>> +
>>>> +    ret = qemu_madvise(addr, size, QEMU_MADV_POPULATE_WRITE);
>>>> +    if (ret) {
>>>> +        memset_thread_failed = true;
>>>
>>> I'm wondering if this use of memset_thread_failed is sufficient.
>>>
>>> This is pre-existing from the current impl, and ends up being
>>> used to set the bool result of 'touch_all_pages'. The caller
>>> of that then does
>>>
>>>       if (touch_all_pages(area, hpagesize, numpages, smp_cpus)) {
>>>           error_setg(errp, "os_mem_prealloc: Insufficient free host memory "
>>>               "pages available to allocate guest RAM");
>>>       }
>>>
>>> this was reasonable with the old impl, because the only reason
>>> we ever see 'memset_thread_failed==true' is if we got SIGBUS
>>> due to ENOMEM.
>>>
>>> My concern is that madvise() has a bunch of possible errno
>>> codes returned on failure, and we're not distinguishing
>>> them. In the past this kind of thing has burnt us making
>>> failures hard to debug.
>>>
>>> Could we turn 'bool memset_thread_failed' into 'int memset_thread_errno'
>>>
>>> Then, we can make 'touch_all_pages' have an 'Error **errp'
>>> parameter, and it can directly call
>>>
>>>    error_setg_errno(errp, memset_thead_errno, ....some message...)
>>>
>>> when memset_thread_errno is non-zero, and thus we can remove
>>> the generic message from the caller of touch_all_pages.
>>>
>>> If you agree, it'd be best to refactor the existing code to
>>> use this pattern in an initial patch.
>>
>> We could also simply trace the return value, which should be comparatively
>> easy to add. We should be getting either -ENOMEM or -EHWPOISON. And the
>> latter is highly unlikely to happen when actually preallocating.
>>
>> We made sure that we don't end up with -EINVAL as we're sensing of
>> MADV_POPULATE_WRITE works on the mapping.
> 
> Those are in the "normal" usage scenarios. I'm wondering about the
> abnormal scenarios where QEMU code is mistakenly screwed up or
> libvirt / mgmt app makes some config mistake. eg we can get
> things like EPERM if selinux or seccomp block the madvise
> syscall by mistake (common if EQMU is inside docker for example),
> or can we get EINVAL if the 'addr' is not page aligned, and so on.
> 
>> So when it comes to debugging, I'd actually prefer tracing -errno, as the
>> real error will be of little help to end users.
> 
> I don't care about the end users interpreting it, rather us as maintainers
> who get a bug report containing insufficient info to diagnose the root
> cause.

Well, okay. I'll have a look how this turns out.

-- 
Thanks,

David / dhildenb



  reply	other threads:[~2021-07-22 14:14 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-07-22 12:36 [PATCH v2 0/6] util/oslib-posix: Support MADV_POPULATE_WRITE for os_mem_prealloc() David Hildenbrand
2021-07-22 12:36 ` [PATCH v2 1/6] " David Hildenbrand
2021-07-22 13:31   ` Daniel P. Berrangé
2021-07-22 13:39     ` David Hildenbrand
2021-07-22 13:47       ` Daniel P. Berrangé
2021-07-22 14:13         ` David Hildenbrand [this message]
2021-07-22 12:36 ` [PATCH v2 2/6] util/oslib-posix: Introduce and use MemsetContext for touch_all_pages() David Hildenbrand
2021-07-22 12:36 ` [PATCH v2 3/6] util/oslib-posix: Don't create too many threads with small memory or little pages David Hildenbrand
2021-07-27 19:01   ` Dr. David Alan Gilbert
2021-07-28 11:23   ` Pankaj Gupta
2021-07-22 12:36 ` [PATCH v2 4/6] util/oslib-posix: Avoid creating a single thread with MADV_POPULATE_WRITE David Hildenbrand
2021-07-27 19:04   ` Dr. David Alan Gilbert
2021-07-30 15:13     ` David Hildenbrand
2021-07-28 11:32   ` Pankaj Gupta
2021-07-22 12:36 ` [PATCH v2 5/6] util/oslib-posix: Support concurrent os_mem_prealloc() invocation David Hildenbrand
2021-07-22 12:36 ` [PATCH v2 6/6] util/oslib-posix: Forward SIGBUS to MCE handler under Linux David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3e10dfdb-0a6d-69fc-e3f6-c62345cb03f1@redhat.com \
    --to=david@redhat.com \
    --cc=berrange@redhat.com \
    --cc=dgilbert@redhat.com \
    --cc=ehabkost@redhat.com \
    --cc=imammedo@redhat.com \
    --cc=mkedzier@redhat.com \
    --cc=mst@redhat.com \
    --cc=pankaj.gupta.linux@gmail.com \
    --cc=pankaj.gupta@ionos.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).