linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: Michal Hocko <mhocko@suse.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	virtualization@lists.linux-foundation.org,
	Andrew Morton <akpm@linux-foundation.org>,
	"Michael S. Tsirkin" <mst@redhat.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Oscar Salvador <osalvador@suse.de>,
	Jason Wang <jasowang@redhat.com>,
	Xuan Zhuo <xuanzhuo@linux.alibaba.com>
Subject: Re: [PATCH v1 3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals
Date: Tue, 27 Jun 2023 17:14:20 +0200	[thread overview]
Message-ID: <ZJr8zM/Van7UaUif@dhcp22.suse.cz> (raw)
In-Reply-To: <0929f4b9-bdad-bcb4-4192-44e88378016b@redhat.com>

On Tue 27-06-23 16:57:53, David Hildenbrand wrote:
> On 27.06.23 16:17, Michal Hocko wrote:
> > On Tue 27-06-23 15:14:11, David Hildenbrand wrote:
> > > On 27.06.23 14:40, Michal Hocko wrote:
> > > > On Tue 27-06-23 13:22:18, David Hildenbrand wrote:
> > > > > John Hubbard writes [1]:
> > > > > 
> > > > >           Some device drivers add memory to the system via memory hotplug.
> > > > >           When the driver is unloaded, that memory is hot-unplugged.
> > > > > 
> > > > >           However, memory hot unplug can fail. And these days, it fails a
> > > > >           little too easily, with respect to the above case. Specifically, if
> > > > >           a signal is pending on the process, hot unplug fails.
> > > > > 
> > > > >           [...]
> > > > > 
> > > > >           So in this case, other things (unmovable pages, un-splittable huge
> > > > >           pages) can also cause the above problem. However, those are
> > > > >           demonstrably less common than simply having a pending signal. I've
> > > > >           got bug reports from users who can trivially reproduce this by
> > > > >           killing their process with a "kill -9", for example.
> > > > 
> > > > This looks like a bug of the said driver no? If the tear down process is
> > > > killed it could very well happen right before offlining so you end up in
> > > > the very same state. Or what am I missing?
> > > 
> > > IIUC (John can correct me if I am wrong):
> > > 
> > > 1) The process holds the device node open
> > > 2) The process gets killed or quits
> > > 3) As the process gets torn down, it closes the device node
> > > 4) Closing the device node results in the driver removing the device and
> > >     calling offline_and_remove_memory()
> > > 
> > > So it's not a "tear down process" that triggers that offlining_removal
> > > somehow explicitly, it's just a side-product of it letting go of the device
> > > node as the process gets torn down.
> > 
> > Isn't that just fragile? The operation might fail for other reasons. Why
> > cannot there be a hold on the resource to control the tear down
> > explicitly?
> 
> I'll let John comment on that. But from what I understood, in most setups
> where ZONE_MOVABLE gets used for hotplugged memory
> offline_and_remove_memory() succeeds and allows for reusing the device later
> without a reboot.
> 
> For the cases where it doesn't work, a reboot is required.

Then the solution should be really robust and means to handle the
failure - e.g. by retrying or alerting the admin.

> > > > > Especially with ZONE_MOVABLE, offlining is supposed to work in most
> > > > > cases when offlining actually hotplugged (not boot) memory, and only fail
> > > > > in rare corner cases (e.g., some driver holds a reference to a page in
> > > > > ZONE_MOVABLE, turning it unmovable).
> > > > > 
> > > > > In these corner cases we really don't want to be stuck forever in
> > > > > offline_and_remove_memory(). But in the general cases, we really want to
> > > > > do our best to make memory offlining succeed -- in a reasonable
> > > > > timeframe.
> > > > > 
> > > > > Reliably failing in the described case when there is a fatal signal pending
> > > > > is sub-optimal. The pending signal check is mostly only relevant when user
> > > > > space explicitly triggers offlining of memory using sysfs device attributes
> > > > > ("state" or "online" attribute), but not when coming via
> > > > > offline_and_remove_memory().
> > > > > 
> > > > > So let's use a timer instead and ignore fatal signals, because they are
> > > > > not really expressive for offline_and_remove_memory() users. Let's default
> > > > > to 30 seconds if no timeout was specified, and limit the timeout to 120
> > > > > seconds.
> > > > 
> > > > I really hate having timeouts back. They just proven to be hard to get
> > > > right and it is essentially a policy implemented in the kernel. They
> > > > simply do not belong to the kernel space IMHO.
> > > 
> > > As much as I agree with you in terms of offlining triggered from user space
> > > (e.g., write "state" or "online" attribute) where user-space is actually in
> > > charge  and can do something reasonable (timeout, retry, whatever), in these
> > > the offline_and_remove_memory() case it's the driver that wants a
> > > best-effort memory offlining+removal.
> > > 
> > > If it times out, virtio-mem will simply try another block or retry later.
> > > Right now, it could get stuck forever in offline_and_remove_memory(), which
> > > is obviously "not great". Fortunately, for virtio-mem it's configurable and
> > > we use the alloc_contig_range()-method for now as default.
> > 
> > It seems that offline_and_remove_memory is using a wrong operation then.
> > If it wants an opportunistic offlining with some sort of policy. Timeout
> > might be just one policy to use but failure mode or a retry count might
> > be a better fit for some users. So rather than (ab)using offline_pages,
> > would be make more sense to extract basic offlining steps and allow
> > drivers like virtio-mem to reuse them and define their own policy?
> 
> virtio-mem, in default operation, does that: use alloc_contig_range() to
> logically unplug ("fake offline") that memory and then just trigger
> offline_and_remove_memory() to make it "officially offline".
> 
> In that mode, offline_and_remove_memory() cannot really timeout and is
> almost always going to succeed (except memory notifiers and some hugetlb
> dissolving).
> 
> Right now we also allow the admin to configure ordinary offlining directly
> (without prior fake offlining) when bigger memory blocks are used:
> offline_pages() is more reliable than alloc_contig_range(), for example,
> because it disables the PCP and the LRU cache, and retries more often (well,
> unfortunately then also forever). It has a higher chance of succeeding
> especially when bigger blocks of memory are offlined+removed.
> 
> Maybe we should make the alloc_contig_range()-based mechanism more
> configurable and make it the only mode in virtio-mem, such that we don't
> have to mess with offline_and_remove_memory() endless loops -- at least for
> virtio-mem.

Yes, that sounds better than hooking up into offline_pages the way this
patch is doing.
-- 
Michal Hocko
SUSE Labs


  reply	other threads:[~2023-06-27 15:15 UTC|newest]

Thread overview: 16+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-27 11:22 [PATCH v1 0/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals David Hildenbrand
2023-06-27 11:22 ` [PATCH v1 1/5] mm/memory_hotplug: check for fatal signals only in offline_pages() David Hildenbrand
2023-06-27 12:34   ` Michal Hocko
2023-06-27 13:28     ` David Hildenbrand
2023-06-27 14:07       ` Michal Hocko
2023-06-27 11:22 ` [PATCH v1 2/5] virtio-mem: convert most offline_and_remove_memory() errors to -EBUSY David Hildenbrand
2023-06-27 11:22 ` [PATCH v1 3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals David Hildenbrand
2023-06-27 12:40   ` Michal Hocko
2023-06-27 13:14     ` David Hildenbrand
2023-06-27 14:17       ` Michal Hocko
2023-06-27 14:57         ` David Hildenbrand
2023-06-27 15:14           ` Michal Hocko [this message]
2023-06-27 21:34             ` John Hubbard
2023-06-28  2:00   ` kernel test robot
2023-06-27 11:22 ` [PATCH v1 4/5] virtio-mem: set the timeout for offline_and_remove_memory() to 10 seconds David Hildenbrand
2023-06-27 11:22 ` [PATCH v1 5/5] virtio-mem: check if the config changed before (fake) offlining memory David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZJr8zM/Van7UaUif@dhcp22.suse.cz \
    --to=mhocko@suse.com \
    --cc=akpm@linux-foundation.org \
    --cc=david@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=jhubbard@nvidia.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mst@redhat.com \
    --cc=osalvador@suse.de \
    --cc=virtualization@lists.linux-foundation.org \
    --cc=xuanzhuo@linux.alibaba.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).