From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 9B22AEB64DC for ; Tue, 27 Jun 2023 14:58:02 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 0060D8D0002; Tue, 27 Jun 2023 10:58:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EF7E18D0001; Tue, 27 Jun 2023 10:58:01 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D990A8D0002; Tue, 27 Jun 2023 10:58:01 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id CA11F8D0001 for ; Tue, 27 Jun 2023 10:58:01 -0400 (EDT) Received: from smtpin06.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay06.hostedemail.com (Postfix) with ESMTP id 59F9AAF884 for ; Tue, 27 Jun 2023 14:58:01 +0000 (UTC) X-FDA: 80948832762.06.CA7901C Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf22.hostedemail.com (Postfix) with ESMTP id CDD26C0031 for ; Tue, 27 Jun 2023 14:57:58 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=jKkY6hcU; spf=pass (imf22.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687877879; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=nBsZGNNzl56VjNTKNf9g3STGv+bD1DKhCo4+kBNFHLc=; b=a/iDcwoZoSUWcQnReax+P8o7zCCtfNA1tj0Qlx9kBh2JikkEn+1rdpErkI3mdzP42bmBHU HjpEMrlJ47ubhfYE+QOFUXtwlvAHaLulM8KL2/yAyMgB0Z978BEnHIx1FnsYU+LYHtCh96 NsyCGPGhZ9WRLpyh26HaqqOEd+zzF4s= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=jKkY6hcU; spf=pass (imf22.hostedemail.com: domain of david@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687877879; a=rsa-sha256; cv=none; b=frDgsWMSnY7sWcLN/RIC8LqEgU0Bu9bhyoYFv4daQeZBTrokAWTt6C32ANNyddPpcaDI6V 8y0KtITQpVB32dVjb2ppKHn48+l/fofjWdC0vmqdxtsDrkf3/0JvTl4dux/bGbw1Mxze/X uV2VI9VembfZYRNFbw8y9xNIhHLynQs= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1687877878; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=nBsZGNNzl56VjNTKNf9g3STGv+bD1DKhCo4+kBNFHLc=; b=jKkY6hcUtvOl3UvjqlTC/lS/R1jxO8Fxq+EYPFzpiuzWLOH9s+/JOns0kc8Lk4KZYuseRe 8HvfktAE6UTF7HwhDd/JGTeOM/jgmJtXKgQTr7vwkCIPWVy+iL1aKVcs0zJ9XjUO+LL0p2 2K8WW0CNU52/vSHOpHbDP92czA0kmQY= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-418-3qOHbb9VN5C5Mo5hFiw7sw-1; Tue, 27 Jun 2023 10:57:56 -0400 X-MC-Unique: 3qOHbb9VN5C5Mo5hFiw7sw-1 Received: by mail-wm1-f71.google.com with SMTP id 5b1f17b1804b1-3fa8f8fb7b3so51902975e9.2 for ; Tue, 27 Jun 2023 07:57:56 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687877875; x=1690469875; h=content-transfer-encoding:in-reply-to:subject:organization:from :content-language:references:cc:to:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=nBsZGNNzl56VjNTKNf9g3STGv+bD1DKhCo4+kBNFHLc=; b=AruwNxqSGqLPVfq070HtNhQ6G24NM6BmjazVzN0hApAe70zeDPMS48vk3WWS8XA4u4 VQ8xTuFAH51IcmCrkGYhgmXyAfRDP1BR//rfN0Eiyp0QBjSkNegr6dHZ/6gUoaOodH0d YNlSTBlzkaGfzBSwtIkeHCuuRDqeL91QpbsWv+KKqdQ/AlOdi7oC7CmKt/qChH7PxwO4 xcrq9OXLkMTEoczSbqExA/Vx1mVylnlEx7TWror8ZlUULZbvOygTz/iD/POw7u360ggz b6SZoaE3vPRewGCYZpy2KSyfz6YcjPslcQkYNQritbwR4zGndlNp0FQx5DW7kfqhnavv RKPQ== X-Gm-Message-State: AC+VfDwF/KRXm4BcXHw+VFtslWcC0K7nXhTa6zRPFBt4mt2RFKV6RO9w dLVD3UNjpzd0cvIhNDCES/PEaAAJ6IKZ7/Irp19rf+z+mKXYhKcmcpvtZC+EfvU3VLM1Jf/6ekf DXQiIwG7UZMo= X-Received: by 2002:a1c:f603:0:b0:3f8:f80e:7b64 with SMTP id w3-20020a1cf603000000b003f8f80e7b64mr35939321wmc.17.1687877875235; Tue, 27 Jun 2023 07:57:55 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ4OJIM0ycl9reSJHcWeIgPliu9ko0l3mhlC9E+Sxkbz7ygcqfi/Y9H9AcRdL4MqhL0uGti8Vw== X-Received: by 2002:a1c:f603:0:b0:3f8:f80e:7b64 with SMTP id w3-20020a1cf603000000b003f8f80e7b64mr35939293wmc.17.1687877874781; Tue, 27 Jun 2023 07:57:54 -0700 (PDT) Received: from ?IPV6:2003:cb:c737:4900:68b3:e93b:e07a:558b? (p200300cbc737490068b3e93be07a558b.dip0.t-ipconnect.de. [2003:cb:c737:4900:68b3:e93b:e07a:558b]) by smtp.gmail.com with ESMTPSA id k22-20020a05600c0b5600b003fb40f5f553sm3985669wmr.31.2023.06.27.07.57.53 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 27 Jun 2023 07:57:54 -0700 (PDT) Message-ID: <0929f4b9-bdad-bcb4-4192-44e88378016b@redhat.com> Date: Tue, 27 Jun 2023 16:57:53 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 To: Michal Hocko Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, virtualization@lists.linux-foundation.org, Andrew Morton , "Michael S. Tsirkin" , John Hubbard , Oscar Salvador , Jason Wang , Xuan Zhuo References: <20230627112220.229240-1-david@redhat.com> <20230627112220.229240-4-david@redhat.com> <74cbbdd3-5a05-25b1-3f81-2fd47e089ac3@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v1 3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: CDD26C0031 X-Rspam-User: X-Stat-Signature: pm7pyjdg6d996u34th9bo8xsn8a4bueg X-Rspamd-Server: rspam01 X-HE-Tag: 1687877878-390929 X-HE-Meta: U2FsdGVkX19mjaFlNmisWBTSOF9ZP0zsscbuE2OrxW/yTZ/LcXjqWUuFOAlmh5OJRTFn0Px0n55MzeQnVis/Wlb7MAVEorq0h2G31tVVK/Vu/7PTNSAHOIDKiVDxyXNER8hIqR+1xY1zo0hwCSKrvmeqjEUBjUXe3eMC01BrkKlgRrqcsFJEQYCPkpLG2GMPK2u5vXU3967DFHDnKAP5qaRF6cd7ZxTtolaeHCQRxg52nrYnsNd7nbiXvmP51VrHOp8hOlBzT6Wi9pyBLmWAQEuETSsmvpqsvvtmaFNWJuxPvaO8g/IbSM1R6gGU+EigOIgl3rPA+Fgc8zyIF5GajcsabGaNiSmxU3zG9JqMe6fnNV8pk3EB6+DAgXWysPDfWoK4hqC8ghyAsfGqkyFVAy5lvm3TRTh6CaTXEjtfNU/4JX3gLxj7p25Q4bSSgG8Lm+wk+aaVPVSuuhG6g/Qv+KegLe8ErrGBvkm7eT+RrXYdrDKhc3NnACt95JvnyXnzDoUT5yjjmz19Athox6ILgaS0hNpJHJqpWbVxya0hOCbiTTTj9+QS8XjaN3wiO4eituj5P2SUJIKc3VCGdj9AVL/RGdUIdzDdW0dcjoNpK5BlXVy7Gy/orbn6qmLi+owQn9KpRCR/pJUh1RQup8/dgKjvtz+/l5wrEnAnmjVkX3Rjn/LjMw2Lo1PLuOshirBl8mtSoIR/WMtCaCXHhGxhA5rzHKdmdYi3Vv6oMys4+/PnYnJCYngSYauY14ZxDYO+9Nd3Pg6DA8ImlAlLn7nUa/WuBBleQGcv066uSO+SczvnYv73GNxaiXxvxgKDC5FxDzF8yl4E51098u7gcVJelf9gr3EqIa2R73wJOTs0oJoy0zf1IdxzmTLKtLkx9rfic+rah7WER5zVBtLgTRLVTK1Vl4pOvCX9wB6gkwqkjnEj4r3RV4pNX8kOG9z0L8bNLVrewOEsFDpoVaH/PgI JTKnpXYW rYrtgIB0sGr4ghN4PMFwEn6+ZSG3Ki0ydwfnx3KUo3NjRgRF6FR4eWit5fqIbHMSE6/ZIXlkZqCw4XUanFAsaIgu1MCKsZcR4eO4ZSNOlfYrVucb9DrD1jatSMgZGggOv11dfqMeIXnaA5ggoJ6PE9SDWhxrnQlVR/0phwyemAVqwl5sxNC7NRB18UNtAWrBZu56O9h6vPt0SKMHJFtDcxRuM+M6dX1JGWQJNkQ6S+hG3MS16UQ4yP952imarcd6YTfCFWO1aQSkM03WK8z9hedWXk+T3D+CCw7HI8cQFoc3vVuKIdd98iVx69xuLlQH3yZzod4efGdmfUfwV1vhir+EhqpCzpe/BF0+D0g9npD/c31o5lW199ZYIu408cfZKQHgr6Ag/+IVcrlrF7B5gUwZH8XPS/J7NFycAIoCaF674Veh9cL/QwIwnuQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 27.06.23 16:17, Michal Hocko wrote: > On Tue 27-06-23 15:14:11, David Hildenbrand wrote: >> On 27.06.23 14:40, Michal Hocko wrote: >>> On Tue 27-06-23 13:22:18, David Hildenbrand wrote: >>>> John Hubbard writes [1]: >>>> >>>> Some device drivers add memory to the system via memory hotplug. >>>> When the driver is unloaded, that memory is hot-unplugged. >>>> >>>> However, memory hot unplug can fail. And these days, it fails a >>>> little too easily, with respect to the above case. Specifically, if >>>> a signal is pending on the process, hot unplug fails. >>>> >>>> [...] >>>> >>>> So in this case, other things (unmovable pages, un-splittable huge >>>> pages) can also cause the above problem. However, those are >>>> demonstrably less common than simply having a pending signal. I've >>>> got bug reports from users who can trivially reproduce this by >>>> killing their process with a "kill -9", for example. >>> >>> This looks like a bug of the said driver no? If the tear down process is >>> killed it could very well happen right before offlining so you end up in >>> the very same state. Or what am I missing? >> >> IIUC (John can correct me if I am wrong): >> >> 1) The process holds the device node open >> 2) The process gets killed or quits >> 3) As the process gets torn down, it closes the device node >> 4) Closing the device node results in the driver removing the device and >> calling offline_and_remove_memory() >> >> So it's not a "tear down process" that triggers that offlining_removal >> somehow explicitly, it's just a side-product of it letting go of the device >> node as the process gets torn down. > > Isn't that just fragile? The operation might fail for other reasons. Why > cannot there be a hold on the resource to control the tear down > explicitly? I'll let John comment on that. But from what I understood, in most setups where ZONE_MOVABLE gets used for hotplugged memory offline_and_remove_memory() succeeds and allows for reusing the device later without a reboot. For the cases where it doesn't work, a reboot is required. > >>>> Especially with ZONE_MOVABLE, offlining is supposed to work in most >>>> cases when offlining actually hotplugged (not boot) memory, and only fail >>>> in rare corner cases (e.g., some driver holds a reference to a page in >>>> ZONE_MOVABLE, turning it unmovable). >>>> >>>> In these corner cases we really don't want to be stuck forever in >>>> offline_and_remove_memory(). But in the general cases, we really want to >>>> do our best to make memory offlining succeed -- in a reasonable >>>> timeframe. >>>> >>>> Reliably failing in the described case when there is a fatal signal pending >>>> is sub-optimal. The pending signal check is mostly only relevant when user >>>> space explicitly triggers offlining of memory using sysfs device attributes >>>> ("state" or "online" attribute), but not when coming via >>>> offline_and_remove_memory(). >>>> >>>> So let's use a timer instead and ignore fatal signals, because they are >>>> not really expressive for offline_and_remove_memory() users. Let's default >>>> to 30 seconds if no timeout was specified, and limit the timeout to 120 >>>> seconds. >>> >>> I really hate having timeouts back. They just proven to be hard to get >>> right and it is essentially a policy implemented in the kernel. They >>> simply do not belong to the kernel space IMHO. >> >> As much as I agree with you in terms of offlining triggered from user space >> (e.g., write "state" or "online" attribute) where user-space is actually in >> charge and can do something reasonable (timeout, retry, whatever), in these >> the offline_and_remove_memory() case it's the driver that wants a >> best-effort memory offlining+removal. >> >> If it times out, virtio-mem will simply try another block or retry later. >> Right now, it could get stuck forever in offline_and_remove_memory(), which >> is obviously "not great". Fortunately, for virtio-mem it's configurable and >> we use the alloc_contig_range()-method for now as default. > > It seems that offline_and_remove_memory is using a wrong operation then. > If it wants an opportunistic offlining with some sort of policy. Timeout > might be just one policy to use but failure mode or a retry count might > be a better fit for some users. So rather than (ab)using offline_pages, > would be make more sense to extract basic offlining steps and allow > drivers like virtio-mem to reuse them and define their own policy? virtio-mem, in default operation, does that: use alloc_contig_range() to logically unplug ("fake offline") that memory and then just trigger offline_and_remove_memory() to make it "officially offline". In that mode, offline_and_remove_memory() cannot really timeout and is almost always going to succeed (except memory notifiers and some hugetlb dissolving). Right now we also allow the admin to configure ordinary offlining directly (without prior fake offlining) when bigger memory blocks are used: offline_pages() is more reliable than alloc_contig_range(), for example, because it disables the PCP and the LRU cache, and retries more often (well, unfortunately then also forever). It has a higher chance of succeeding especially when bigger blocks of memory are offlined+removed. Maybe we should make the alloc_contig_range()-based mechanism more configurable and make it the only mode in virtio-mem, such that we don't have to mess with offline_and_remove_memory() endless loops -- at least for virtio-mem. > >> If it would time out for John's driver, we most certainly don't want to be >> stuck in offline_and_remove_memory(), blocking device/driver unloading (and >> even a reboot IIRC) possibly forever. > > Now I am confused. John driver wants to tear down the device now? There > is no way to release that memory otherwise AFAIU from the initial > problem description. From what I understood if offline_and_remove_memory() succeeded, the device can be reinitialized and used again. Otherwise, a reboot is required because that memory is still added to the system. Thanks Michal! -- Cheers, David / dhildenb