From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 80C5DEB64D9 for ; Tue, 27 Jun 2023 13:14:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9EF808D0002; Tue, 27 Jun 2023 09:14:50 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 9A0158D0001; Tue, 27 Jun 2023 09:14:50 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 867288D0002; Tue, 27 Jun 2023 09:14:50 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id 7820D8D0001 for ; Tue, 27 Jun 2023 09:14:50 -0400 (EDT) Received: from smtpin15.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 33309C092D for ; Tue, 27 Jun 2023 13:14:50 +0000 (UTC) X-FDA: 80948572740.15.67C6821 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf17.hostedemail.com (Postfix) with ESMTP id A179C4002A for ; Tue, 27 Jun 2023 13:14:47 +0000 (UTC) Authentication-Results: imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=D2Tcav9p; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf17.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1687871687; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=xr3MnWCaYQoh+mxzisQXAS9kK17MGURfaRMV8GgA6To=; b=xSJ1XLcGk492c+OA3CX6FQSvPh7P0j2OHyWr/pIcCcNUALXUmrrI71HmHKOQqxJ8CAYyDN UhKUPqBxmsAVCrEq03vfDvq9ZSy1JH/mtPV/4sKg82B5oAEegGaBP6PlmsJgVXz4YAdmBo NtnAhFJ7kykN0qsZz0qPH0M9l6mnR3Q= ARC-Authentication-Results: i=1; imf17.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=D2Tcav9p; dmarc=pass (policy=none) header.from=redhat.com; spf=pass (imf17.hostedemail.com: domain of david@redhat.com designates 170.10.129.124 as permitted sender) smtp.mailfrom=david@redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1687871687; a=rsa-sha256; cv=none; b=sQckEZmD9FJ/Kd4BGBPH7rXBZeHCwOxRoKQj+5knDEUjIHGyKjEFdQX1XTA9NpY9kouOch 5rLytVv6bxrbOMAspF3wFUeWp9MJGNzp8O/d1jNUIlfUkjHVit/UFFQmxstxg9NXB7U/xF TR9RYedJ9Zx8OTIFmSGppMIMeV5IK74= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1687871686; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=xr3MnWCaYQoh+mxzisQXAS9kK17MGURfaRMV8GgA6To=; b=D2Tcav9pjczQVM7leJKfK6dnQNoKlCVDrKPR9ZzEvNjcCAEdRr+wtmhQulQTPrvLWiaKHS 3n0ZbFtRvpOS8IQDGHHyDbt8YmffSaJ+WKJ89tEZe2T98fEOUTpAYajH8tQM9mPeEED3B0 lLm2j27fiHI8DT8iGvbgr90tsJRkfFw= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-147-b4BDMi1oOxGiuWFNujS8xw-1; Tue, 27 Jun 2023 09:14:14 -0400 X-MC-Unique: b4BDMi1oOxGiuWFNujS8xw-1 Received: by mail-wm1-f69.google.com with SMTP id 5b1f17b1804b1-3fb40ec952bso8371315e9.0 for ; Tue, 27 Jun 2023 06:14:14 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1687871653; x=1690463653; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=xr3MnWCaYQoh+mxzisQXAS9kK17MGURfaRMV8GgA6To=; b=K5Xjdrd353TG/pqOsC7BgW3Me9vpv2gElDTrSohDwGzxivw7e0ihHXkWgw+gjEbVnj PIc8rBnPYiNdp+vf1sSTHaULODcTV0dLyZ6PNP8mqS2T7Zuji0pMFePueTF8D6Wy/8VE t9feUGu9C1GkfG/x0rwS9RhqxzPujiiXrrfzbR2DxPHRbkoTi+FAC/jcu7+yfFjf1xw2 ZBwTi84dN7OOG5gI3WTjWRUHMDKK7wjVk87TDBp6F2j97lPlnzCMRyeO1xilNhFd3l+c JB7bXsLOryrxLlbja5V+LC9C6aplV9D3rL4NJ1nEhesdJysCCa/Ak0KKEcvYTsfGK7Fl TWnw== X-Gm-Message-State: AC+VfDxL30QQ54LVw4O4mtRuhT6ovdmOPdEMCysZWJi+1RBL6Cml8NcK wKAL1fJz65l8ARKhlrNzX1Jpq1HjFiE5gpoV7z9OXnXYyrvHDzMXAdjICJXD0TLrL5BrcdOtJEQ pf8GLf0EK8/Q= X-Received: by 2002:a7b:c017:0:b0:3f7:aad8:4e05 with SMTP id c23-20020a7bc017000000b003f7aad84e05mr25894512wmb.11.1687871653487; Tue, 27 Jun 2023 06:14:13 -0700 (PDT) X-Google-Smtp-Source: ACHHUZ6jPe2n9lCZVjh6G47APR7hQm7cv3dmFUt5cTIyprmkTo8/GF2dG+yghkrXAmZXtcjVtznmIQ== X-Received: by 2002:a7b:c017:0:b0:3f7:aad8:4e05 with SMTP id c23-20020a7bc017000000b003f7aad84e05mr25894478wmb.11.1687871653032; Tue, 27 Jun 2023 06:14:13 -0700 (PDT) Received: from ?IPV6:2003:cb:c737:4900:68b3:e93b:e07a:558b? (p200300cbc737490068b3e93be07a558b.dip0.t-ipconnect.de. [2003:cb:c737:4900:68b3:e93b:e07a:558b]) by smtp.gmail.com with ESMTPSA id o11-20020a05600c378b00b003fa95f328afsm6173913wmr.29.2023.06.27.06.14.12 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 27 Jun 2023 06:14:12 -0700 (PDT) Message-ID: <74cbbdd3-5a05-25b1-3f81-2fd47e089ac3@redhat.com> Date: Tue, 27 Jun 2023 15:14:11 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.12.0 To: Michal Hocko Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, virtualization@lists.linux-foundation.org, Andrew Morton , "Michael S. Tsirkin" , John Hubbard , Oscar Salvador , Jason Wang , Xuan Zhuo References: <20230627112220.229240-1-david@redhat.com> <20230627112220.229240-4-david@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v1 3/5] mm/memory_hotplug: make offline_and_remove_memory() timeout instead of failing on fatal signals In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: A179C4002A X-Rspam-User: X-Rspamd-Server: rspam05 X-Stat-Signature: rxud8shoxhsi496dg3wt84iaw7znuic4 X-HE-Tag: 1687871687-443662 X-HE-Meta: U2FsdGVkX19wrfN5IAIvS54KPid4UyB7diK6FYLu1/9zWpvYRO4x/J/zNRZx5GjEGH/G6WzYGpZuAOfs7eRHQnP/EGsb3fFks+iFOD0vbYIiGEcTfw3drKmJzSmDnkfhfc60nf5yGMliukpq7P+ODf4RXaHfNQgfZ19+0GGayUAIiGEo4Y9ujUmUXf931AE0c1s7iZGAY+byvHSReBmazS06o9gCTjba3/LbiDiWu3mc8JxIcHB12tThWtVyKYx2uFUCGTIa0HaDCBBAueDRa9O2+cXd5XXETWPXvP+K288IvDByt3sV35zvibtP3ssnH5e484+h8hwtmvx/Fe7vxKNLsAm2htfukDrSfFxCviXgmtBrYweMPzjC0SrSIoW75zVXDxoUHNNqpexILJIqvieOlHkcrPCFy7A405mlXFYqFcvxmtobB3Ra+pWfp419xVFhv7tZDusxKdXgGk81QIuYzzpKDsd0Z8w8Y/6RlhiFxQgLdNGVhUSmmxTDwRpN0TzCwhhEIHyRwGM37bxjUPaOEzep7h7XSN9n0WItJMLCidHN6jYHhTj8noz7EE6oF90yMZ9SAjelURt4AZEegXNHtGaknW/FZeKN5ZagyNtkUlSgUIPzTpWl+UDKUoT+0ft7OlyCxq+ZcpGDju+grBXaWaVAv8NncitwlcpeDwJDOt/BpTDQZ0cv9NZvzZa7bKjasZStCk2wd1i9j017xntosmpTLolt/rihTL/Ri3fqcX2tHwBeJPhVlWoi6NmLTDGAtruuFIVFtcT+usFrer+DBnDxNY6MmIElpk/U/noViblTV+zR9szv/kdFCMo3MPA0exWuxS2Q04G0FsPS10HSCSc5lg2ppMHChba7BMiUNOzWGyV3/8jwVwsk4nadtDJe7/uKJUUBM31D3n6MwC5//MuBHztkKXeKuZ9uKdH+ZZkuhaRO9fwQX+G/fC2/gn83V0UyL+br5RYNa1a nldXvsMe /1YjUhsmS6t5wVLV8Je67dIDwea7kRPHW7bZzi4jyby7TefdDnAKsT7beq1c7OF9EFmIr6ayN+H+juWm2mhvLKgoOMXbPCkxY5Xzo2iIPrcHLWN2l0eDNMhjDZQ4uuo7E0UsNII3VYUAKrmIWM+9Z6KHrSy29R5LhxDkdgAPMI54fohHJNC9FG9qu9+3UqvoCP4RFr/apLG8KOuK9fqfQLdAaBUJtRP7FlPTIbCFWHoNboNVUmz0gEQbZAEBLyTRnSzjD/daHn5wZ6in7RffY4r++ZP1553dIrDhYiaw2qrTomHnx83H/CCu/CrvxLxb1+w84rW9wg8M2muyJlfooI1CGj9jdSsKDb6GvkUONKtSbGQA3vkKZpxdhiNX6uWumKAqQRCwmoN82t0juLg5yhKXKNNQ6PUFuJYcsviSqsbbfvQSg/Ys69yP58g== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 27.06.23 14:40, Michal Hocko wrote: > On Tue 27-06-23 13:22:18, David Hildenbrand wrote: >> John Hubbard writes [1]: >> >> Some device drivers add memory to the system via memory hotplug. >> When the driver is unloaded, that memory is hot-unplugged. >> >> However, memory hot unplug can fail. And these days, it fails a >> little too easily, with respect to the above case. Specifically, if >> a signal is pending on the process, hot unplug fails. >> >> [...] >> >> So in this case, other things (unmovable pages, un-splittable huge >> pages) can also cause the above problem. However, those are >> demonstrably less common than simply having a pending signal. I've >> got bug reports from users who can trivially reproduce this by >> killing their process with a "kill -9", for example. > > This looks like a bug of the said driver no? If the tear down process is > killed it could very well happen right before offlining so you end up in > the very same state. Or what am I missing? IIUC (John can correct me if I am wrong): 1) The process holds the device node open 2) The process gets killed or quits 3) As the process gets torn down, it closes the device node 4) Closing the device node results in the driver removing the device and calling offline_and_remove_memory() So it's not a "tear down process" that triggers that offlining_removal somehow explicitly, it's just a side-product of it letting go of the device node as the process gets torn down. > >> Especially with ZONE_MOVABLE, offlining is supposed to work in most >> cases when offlining actually hotplugged (not boot) memory, and only fail >> in rare corner cases (e.g., some driver holds a reference to a page in >> ZONE_MOVABLE, turning it unmovable). >> >> In these corner cases we really don't want to be stuck forever in >> offline_and_remove_memory(). But in the general cases, we really want to >> do our best to make memory offlining succeed -- in a reasonable >> timeframe. >> >> Reliably failing in the described case when there is a fatal signal pending >> is sub-optimal. The pending signal check is mostly only relevant when user >> space explicitly triggers offlining of memory using sysfs device attributes >> ("state" or "online" attribute), but not when coming via >> offline_and_remove_memory(). >> >> So let's use a timer instead and ignore fatal signals, because they are >> not really expressive for offline_and_remove_memory() users. Let's default >> to 30 seconds if no timeout was specified, and limit the timeout to 120 >> seconds. > > I really hate having timeouts back. They just proven to be hard to get > right and it is essentially a policy implemented in the kernel. They > simply do not belong to the kernel space IMHO. As much as I agree with you in terms of offlining triggered from user space (e.g., write "state" or "online" attribute) where user-space is actually in charge and can do something reasonable (timeout, retry, whatever), in these the offline_and_remove_memory() case it's the driver that wants a best-effort memory offlining+removal. If it times out, virtio-mem will simply try another block or retry later. Right now, it could get stuck forever in offline_and_remove_memory(), which is obviously "not great". Fortunately, for virtio-mem it's configurable and we use the alloc_contig_range()-method for now as default. If it would time out for John's driver, we most certainly don't want to be stuck in offline_and_remove_memory(), blocking device/driver unloading (and even a reboot IIRC) possibly forever. I much rather have offline_and_remove_memory() indicate "timeout" to a in-kernel user a bit earlier than getting stuck in there forever. The timeout parameter allows for giving the in-kernel users a bit of flexibility, which I showcases for virtio-mem that unplugs smaller blocks and rather wants to fail fast and retry later. Sure, we could make the timeout configurable to optimize for some corner cases, but that's not really what offline_and_remove_memory() users want and I doubt anybody would fine-tune that: they want a best-effort attempt. And that's IMHO not really a policy, it's an implementation detail of these drivers. For example, the driver from John could simply never call offline_and_remove_memory() and always require a reboot when wanting to reuse a device. But that's definitely what users want. virtio-mem could simply never call offline_and_remove_memory() and indicate "I don't support unplug of these online memory blocks". But that's *definitely* not what users want. I'm very open for alternatives regarding offline_and_remove_memory(), so far this was the only reasonable thing I could come up with that actually achieves what we want for these users: not get stuck in there forever but rather fail earlier than later. And most importantly, not somehow try to involve user space that isn't even in charge of the offlining operation. -- Cheers, David / dhildenb