From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 04AE3C43461 for ; Mon, 3 May 2021 11:33:59 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 637B461104 for ; Mon, 3 May 2021 11:33:58 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 637B461104 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 903E66B0036; Mon, 3 May 2021 07:33:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8B3566B006E; Mon, 3 May 2021 07:33:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 7543F6B0070; Mon, 3 May 2021 07:33:57 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0117.hostedemail.com [216.40.44.117]) by kanga.kvack.org (Postfix) with ESMTP id 5BC4D6B0036 for ; Mon, 3 May 2021 07:33:57 -0400 (EDT) Received: from smtpin25.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 071FF62F4 for ; Mon, 3 May 2021 11:33:57 +0000 (UTC) X-FDA: 78099710514.25.AD8DE53 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by imf26.hostedemail.com (Postfix) with ESMTP id B7F6840002D7 for ; Mon, 3 May 2021 11:33:45 +0000 (UTC) Received: by mail.kernel.org (Postfix) with ESMTPSA id 826BB610E6; Mon, 3 May 2021 11:33:47 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1620041635; bh=3kEzeSfbSHohcEIhFRR1rGeCpc1QY6vR9EG2U0DxTWA=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=jmBdJi6x/MIR/9Em/yqUq5rtdm3uttsTZdp7+3sY2gUlc0Rq9SCfK41u1Tq8JMxrN CwOzUwXWLjfb0FSLNrCL3LF2jvrV18QsXNR40aMSU7AUJTf+g3A+BTe+cWm12JeBSZ UfgbNugzGtH5qxrdRin35bb7OxoTtcuckR7mfGd7kIeW/xnRkaNSkZkqQBy8/0Qukd zcGe7rlRN05fXskj5iuw4aQrGDaixha1BGZL12tfUMFvOBGRcEYE5OYkqYqKTLq+ji qiDPvrMoHXdUtgp5VoDw3W2rCf6CkYSXKNFXRgpyESwvZ+3mNFKjfmVsPfs/wWZ4iA Yp56al/1GlD4A== Date: Mon, 3 May 2021 14:33:43 +0300 From: Mike Rapoport To: David Hildenbrand Cc: linux-kernel@vger.kernel.org, Andrew Morton , "Michael S. Tsirkin" , Jason Wang , Alexey Dobriyan , "Matthew Wilcox (Oracle)" , Oscar Salvador , Michal Hocko , Roman Gushchin , Alex Shi , Steven Price , Mike Kravetz , Aili Yao , Jiri Bohac , "K. Y. Srinivasan" , Haiyang Zhang , Stephen Hemminger , Wei Liu , Naoya Horiguchi , linux-hyperv@vger.kernel.org, virtualization@lists.linux-foundation.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [PATCH v1 7/7] fs/proc/kcore: use page_offline_(freeze|unfreeze) Message-ID: References: <20210429122519.15183-1-david@redhat.com> <20210429122519.15183-8-david@redhat.com> <5a5a7552-4f0a-75bc-582f-73d24afcf57b@redhat.com> <2f66cbfc-aa29-b3ef-4c6a-0da8b29b56f6@redhat.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <2f66cbfc-aa29-b3ef-4c6a-0da8b29b56f6@redhat.com> X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: B7F6840002D7 Authentication-Results: imf26.hostedemail.com; dkim=pass header.d=kernel.org header.s=k20201202 header.b=jmBdJi6x; dmarc=pass (policy=none) header.from=kernel.org; spf=pass (imf26.hostedemail.com: domain of rppt@kernel.org designates 198.145.29.99 as permitted sender) smtp.mailfrom=rppt@kernel.org X-Stat-Signature: sx4xbx8ubf59edhjq4jdhmr3cmzxmj4j Received-SPF: none (kernel.org>: No applicable sender policy available) receiver=imf26; identity=mailfrom; envelope-from=""; helo=mail.kernel.org; client-ip=198.145.29.99 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1620041625-76974 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, May 03, 2021 at 12:13:45PM +0200, David Hildenbrand wrote: > On 03.05.21 11:28, Mike Rapoport wrote: > > On Mon, May 03, 2021 at 10:28:36AM +0200, David Hildenbrand wrote: > > > On 02.05.21 08:34, Mike Rapoport wrote: > > > > On Thu, Apr 29, 2021 at 02:25:19PM +0200, David Hildenbrand wrote: > > > > > Let's properly synchronize with drivers that set PageOffline(). Unfreeze > > > > > every now and then, so drivers that want to set PageOffline() can make > > > > > progress. > > > > > > > > > > Signed-off-by: David Hildenbrand > > > > > --- > > > > > fs/proc/kcore.c | 15 +++++++++++++++ > > > > > 1 file changed, 15 insertions(+) > > > > > > > > > > diff --git a/fs/proc/kcore.c b/fs/proc/kcore.c > > > > > index 92ff1e4436cb..3d7531f47389 100644 > > > > > --- a/fs/proc/kcore.c > > > > > +++ b/fs/proc/kcore.c > > > > > @@ -311,6 +311,7 @@ static void append_kcore_note(char *notes, size_t *i, const char *name, > > > > > static ssize_t > > > > > read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) > > > > > { > > > > > + size_t page_offline_frozen = 0; > > > > > char *buf = file->private_data; > > > > > size_t phdrs_offset, notes_offset, data_offset; > > > > > size_t phdrs_len, notes_len; > > > > > @@ -509,6 +510,18 @@ read_kcore(struct file *file, char __user *buffer, size_t buflen, loff_t *fpos) > > > > > pfn = __pa(start) >> PAGE_SHIFT; > > > > > page = pfn_to_online_page(pfn); > > > > > > > > Can't this race with page offlining for the first time we get here? > > > > > > > > > To clarify, we have three types of offline pages in the kernel ... > > > > > > a) Pages part of an offline memory section; the memap is stale and not > > > trustworthy. pfn_to_online_page() checks that. We *can* protect against > > > memory offlining using get_online_mems()/put_online_mems(), but usually > > > avoid doing so as the race window is very small (and a problem all over the > > > kernel we basically never hit) and locking is rather expensive. In the > > > future, we might switch to rcu to handle that more efficiently and avoiding > > > these possible races. > > > > > > b) PageOffline(): logically offline pages contained in an online memory > > > section with a sane memmap. virtio-mem calls these pages "fake offline"; > > > something like a "temporary" memory hole. The new mechanism I propose will > > > be used to handle synchronization as races can be more severe, e.g., when > > > reading actual page content here. > > > > > > c) Soft offline pages: hwpoisoned pages that are not actually harmful yet, > > > but could become harmful in the future. So we better try to remove the page > > > from the page allcoator and try to migrate away existing users. > > > > > > > > > So page_offline_* handle "b) PageOffline()" only. There is a tiny race > > > between pfn_to_online_page(pfn) and looking at the memmap as we have in many > > > cases already throughout the kernel, to be tackled in the future. > > > > Right, but here you anyway add locking, so why exclude the first iteration? > > What we're protecting is PageOffline() below. If I didn't mess up, we should > always be calling page_offline_freeze() before calling PageOffline(). Or am > I missing something? Somehow I was under impression we are protecting both pfn_to_online_page() and PageOffline(). > > BTW, did you consider something like > > Yes, I played with something like that. We'd have to handle the first > page_offline_freeze() freeze differently, though, and that's where things > got a bit ugly in my attempts. > > > > > if (page_offline_frozen++ % MAX_ORDER_NR_PAGES == 0) { > > page_offline_unfreeze(); > > cond_resched(); > > page_offline_freeze(); > > } > > > > We don't seem to care about page_offline_frozen overflows here, do we? > > No, the buffer size is also size_t and gets incremented on a per-byte basis. > The variant I have right now looked the cleanest to me. Happy to hear > simpler alternatives. Well, locking for the first time before the while() loop and doing resched-relock outside switch() would be definitely nicer, and it makes the last unlock unconditional. The cost of prevention of memory offline during reads of !KCORE_RAM parts does not seem that significant to me, but I may be missing something. -- Sincerely yours, Mike.