From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 57901EB64D9
	for <linux-mm@archiver.kernel.org>; Tue,  4 Jul 2023 17:04:06 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id CB20C28009D; Tue,  4 Jul 2023 13:04:05 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id C3BB7280096; Tue,  4 Jul 2023 13:04:05 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id ADC0628009D; Tue,  4 Jul 2023 13:04:05 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17])
	by kanga.kvack.org (Postfix) with ESMTP id 9C2EE280096
	for <linux-mm@kvack.org>; Tue,  4 Jul 2023 13:04:05 -0400 (EDT)
Received: from smtpin22.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay09.hostedemail.com (Postfix) with ESMTP id 6900C80531
	for <linux-mm@kvack.org>; Tue,  4 Jul 2023 17:04:05 +0000 (UTC)
X-FDA: 80974552050.22.F287231
Received: from mail-yw1-f178.google.com (mail-yw1-f178.google.com [209.85.128.178])
	by imf12.hostedemail.com (Postfix) with ESMTP id 5A08140013
	for <linux-mm@kvack.org>; Tue,  4 Jul 2023 17:04:03 +0000 (UTC)
Authentication-Results: imf12.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b="In/HPs//";
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf12.hostedemail.com: domain of hughd@google.com designates 209.85.128.178 as permitted sender) smtp.mailfrom=hughd@google.com
ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com;
	s=arc-20220608; t=1688490243;
	h=from:from:sender:reply-to:subject:subject:date:date:
	 message-id:message-id:to:to:cc:cc:mime-version:mime-version:
	 content-type:content-type:content-transfer-encoding:
	 in-reply-to:in-reply-to:references:references:dkim-signature;
	bh=o5CkhYNN16WNUQJu71BeSwdzRNxPgL/gs2UX0urk5eQ=;
	b=28wT7tlrjtj6b8QKrWbWd3qAwadO58aaGyBvJLtxYO1k5WvUwJBb1jD6k5L18VYxwQlaEE
	5fRum+A/WlbSwuJZ/yVp2I4TbCi17GZgGA2PkFe+1IfjxepmBvpkzuQE68V4MXFXPpnPaw
	AFLP186zzp5pUzOYXg6MccnaHVpgJWg=
ARC-Authentication-Results: i=1;
	imf12.hostedemail.com;
	dkim=pass header.d=google.com header.s=20221208 header.b="In/HPs//";
	dmarc=pass (policy=reject) header.from=google.com;
	spf=pass (imf12.hostedemail.com: domain of hughd@google.com designates 209.85.128.178 as permitted sender) smtp.mailfrom=hughd@google.com
ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688490243; a=rsa-sha256;
	cv=none;
	b=km2+O3I5E6ozvD0LZf7ETEmczlzUAasHiHrv7IT35onY0RI6ZNZL++aT62bLPgM478UKeW
	c6TQly70+K+mi3S0ioSOhsrKey+lh5hrjHoMKRe+Q59pZbZYRhtjz3oYg9psAulsrPNNjD
	FRVo2j3ZaG9w6OGqO4HgkLzr8EX7xBU=
Received: by mail-yw1-f178.google.com with SMTP id 00721157ae682-57722942374so70797607b3.1
        for <linux-mm@kvack.org>; Tue, 04 Jul 2023 10:04:03 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=google.com; s=20221208; t=1688490242; x=1691082242;
        h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
         :date:from:to:cc:subject:date:message-id:reply-to;
        bh=o5CkhYNN16WNUQJu71BeSwdzRNxPgL/gs2UX0urk5eQ=;
        b=In/HPs//bUn+2tgW3AmtAl5dxW/ADGV2YA56prULm3E/vdfno08a8tljcsPueGrFwv
         jqarDMxpAnYo9aMI5Mb54mMD0rJ8CZohyLEima6tkL05J+cSnzXMAc1rCu3B7NnhNLQl
         ZvVWe49F0YiOX95QlRJtsrnvcRJ4qDr/z3uf8RXLQDyMLLM3K+NeYbv+E3W8yR9z4V2c
         WTkreLz7srWoSO6EVx02fhhtx9sRWBvRmIBi/UDAvsanUA+rvGltjBmFivqq6iRGLfxg
         0Is7JP64PK57hxtBi5rB3DxXAUc/nlIHjgsphR/2KftHY0traJeP5nIjkP89b4sp+32J
         RUxA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20221208; t=1688490242; x=1691082242;
        h=mime-version:references:message-id:in-reply-to:subject:cc:to:from
         :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to;
        bh=o5CkhYNN16WNUQJu71BeSwdzRNxPgL/gs2UX0urk5eQ=;
        b=Y2icoh7LNijdRLJhfdxWXKgoIdu20H3xgncMmAr2lwJcNBvvlZI9uoMvQpjl1OP+so
         7avTS8Re8t0RGPty4AFBmAmJKRZMwm+N0/W1kyXPZ0f+M93vp5DC9sxI08YOu5/tj0g9
         h36uE+CuT7WOn+dBAvGi4sT/8cOk0WsjmTgLL7LwXQRYrFtutpk+BvD+ukM5nuSaNHUY
         LdsJ5Zy7CHDFjbN0IQzuCFFhLf1fLqxrcC06HqbKOcyp9+np9M2PDg5QiAOz3a68VmAc
         G4XjFzQQLN5we6G3SjwKLXkHDfzQgFuS+6k8HmlfAD4h6Qd9b7gagPZTd4ZJClO7IWyR
         +AmA==
X-Gm-Message-State: ABy/qLbT9u5uKNovar1Z6K7TM6s6/DIQBkh9PzThOWCoTVGE3UNrCPFE
	tmq605AWdRf8PPnc5xi9Iz38cA==
X-Google-Smtp-Source: APBJJlEujSIy21mB6q57p7XiiokfNCFL+rCQls/CklTtga65bPCnY5wMW8YqJmU/LAIKTHdSHjCBvg==
X-Received: by 2002:a0d:dd02:0:b0:55a:3ce9:dc3d with SMTP id g2-20020a0ddd02000000b0055a3ce9dc3dmr13990263ywe.13.1688490242129;
        Tue, 04 Jul 2023 10:04:02 -0700 (PDT)
Received: from ripple.attlocal.net (172-10-233-147.lightspeed.sntcca.sbcglobal.net. [172.10.233.147])
        by smtp.gmail.com with ESMTPSA id n20-20020a819c54000000b00577632aa85esm2323151ywa.3.2023.07.04.10.03.58
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Tue, 04 Jul 2023 10:04:01 -0700 (PDT)
Date: Tue, 4 Jul 2023 10:03:57 -0700 (PDT)
From: Hugh Dickins <hughd@google.com>
X-X-Sender: hugh@ripple.attlocal.net
To: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
cc: Hugh Dickins <hughd@google.com>, Jason Gunthorpe <jgg@ziepe.ca>, 
    Andrew Morton <akpm@linux-foundation.org>, 
    Vasily Gorbik <gor@linux.ibm.com>, Mike Kravetz <mike.kravetz@oracle.com>, 
    Mike Rapoport <rppt@kernel.org>, 
    "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>, 
    Matthew Wilcox <willy@infradead.org>, David Hildenbrand <david@redhat.com>, 
    Suren Baghdasaryan <surenb@google.com>, 
    Qi Zheng <zhengqi.arch@bytedance.com>, Yang Shi <shy828301@gmail.com>, 
    Mel Gorman <mgorman@techsingularity.net>, Peter Xu <peterx@redhat.com>, 
    Peter Zijlstra <peterz@infradead.org>, Will Deacon <will@kernel.org>, 
    Yu Zhao <yuzhao@google.com>, Alistair Popple <apopple@nvidia.com>, 
    Ralph Campbell <rcampbell@nvidia.com>, Ira Weiny <ira.weiny@intel.com>, 
    Steven Price <steven.price@arm.com>, SeongJae Park <sj@kernel.org>, 
    Lorenzo Stoakes <lstoakes@gmail.com>, Huang Ying <ying.huang@intel.com>, 
    Naoya Horiguchi <naoya.horiguchi@nec.com>, 
    Christophe Leroy <christophe.leroy@csgroup.eu>, 
    Zack Rusin <zackr@vmware.com>, Axel Rasmussen <axelrasmussen@google.com>, 
    Anshuman Khandual <anshuman.khandual@arm.com>, 
    Pasha Tatashin <pasha.tatashin@soleen.com>, 
    Miaohe Lin <linmiaohe@huawei.com>, Minchan Kim <minchan@kernel.org>, 
    Christoph Hellwig <hch@infradead.org>, Song Liu <song@kernel.org>, 
    Thomas Hellstrom <thomas.hellstrom@linux.intel.com>, 
    Russell King <linux@armlinux.org.uk>, 
    "David S. Miller" <davem@davemloft.net>, 
    Michael Ellerman <mpe@ellerman.id.au>, 
    "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>, 
    Heiko Carstens <hca@linux.ibm.com>, 
    Christian Borntraeger <borntraeger@linux.ibm.com>, 
    Claudio Imbrenda <imbrenda@linux.ibm.com>, 
    Alexander Gordeev <agordeev@linux.ibm.com>, Jann Horn <jannh@google.com>, 
    Vishal Moola <vishal.moola@gmail.com>, Vlastimil Babka <vbabka@suse.cz>, 
    linux-arm-kernel@lists.infradead.org, sparclinux@vger.kernel.org, 
    linuxppc-dev@lists.ozlabs.org, linux-s390@vger.kernel.org, 
    linux-kernel@vger.kernel.org, linux-mm@kvack.org
Subject: Re: [PATCH v2 07/12] s390: add pte_free_defer() for pgtables sharing
 page
In-Reply-To: <20230704171905.1263478f@thinkpad-T15>
Message-ID: <e678affb-5eee-a055-7af1-1d29a965663b@google.com>
References: <54cb04f-3762-987f-8294-91dafd8ebfb0@google.com> <a722dbec-bd9e-1213-1edd-53cd547aa4f@google.com> <20230628211624.531cdc58@thinkpad-T15> <cd7c2851-1440-7220-6c53-16b343b1474@google.com> <ZJ2hsM5Tn+yUZ5ZV@ziepe.ca> <20230629175645.7654d0a8@thinkpad-T15>
 <edaa96f-80c1-1252-acbb-71c4f045b035@google.com> <7bef5695-fa4a-7215-7e9d-d4a83161c7ab@google.com> <20230704171905.1263478f@thinkpad-T15>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
X-Rspamd-Queue-Id: 5A08140013
X-Rspam-User: 
X-Rspamd-Server: rspam05
X-Stat-Signature: wnsp4ah8ececoifgsh9jt9a8td4zzyyp
X-HE-Tag: 1688490243-399473
X-HE-Meta: U2FsdGVkX1902JaYzQGTvucdGL//wgAIDEzZN0HmNqksBE1k+CDxVBh6Av3SAh+5lV4Qr+xHsZvi6qDPP9p2WmyRpz+oJyjwJ40nJXutmQqU3U8B8PL0e3h/Tv3QFm9mTY6yp7KV2Wmqi3wiqFkKOnI0lvDc0KaJ6mxwboh+M03yZaF6vDHR9CVKTqs+zSpriu6YuHHZRay8/WO9A29mo2HIbnIpeQhwv+pxCrQ9DO/mbFt51tKXn1wI55Pgz/fnxP3/Rs8cyg5DiU+KM3iXtaUJDRuHS03Xfplit7xD/sFQBP5veRXR9AMf8Mi7bFU+i7Dom7MRp7dBin5XrEY9G/PfsrI4OQfCJzoASflw5n+661Ub2ZtAEplffMEydkB1jZIfoxVIkd8S/iQlOkrQSrc+wg+U1up7BInV0kIBf1IJB9ssWJ27vOnMOcpYUdIr61LvTiS42fPMj104ktBD+0f2e8t4t/HSIhthb0E+PxZQFdINJKuaxm6CpCOYss19wXVkP38BTeH6oj/w7lY3hUq7H+epJAmh3AJKcAf5ca5EemzKDhPc4h29+vx/Ur5efnj6DnV1p6EZzeugwX0xHUDf0glfZk+r2jkFxgr4Vhc6Aiv3nCS+sEjtgNuz7iIBqD79hGTbNKrlq8YmGiM2/lUEDitoV6vtHAC7FUWHT6MeXduIKEmDdRyotcAvcWQ5dTHjdNsyzrbjY8uHJbcgabdm/M9ALwFrAbemy+06wVu1W7y3q6jKQrWw+IOPGJML4GPQ4BSb/uh5wLrQKcXGMzN38ezLIgK3sQznC3xKHPmpu4g9ml7Q2gU48n5YQJVE1Pq8gd1FDKQtvFzhb4DAxIZVIrsEuZdsCgr4VdQdmsjCbgPDDoPsQgZR9BjBLwrnhdubXqsYN18nm+bC7wtMr6w31f9YyiyCoJcfsRXdzEtN+T01rHZ50aofevp675icgtScEtQooJ30ypU0PPV
 X32wLwy+
 G1Vz9hHZRdJwE6qzq3hvQIJSFGmYNpddmty4D/h3ce2T95wIn+U79UvYN4zXDtj+Wf4bq+BKkBBclX0GWgECrjOXZpeLHNCJPCNhQZIrQQqLnXX+72IEopaWuLqlKyWHkoAQUODvuaKPmpwVWvcUdSYIZ5o6ykcBomqTui0zKFQD5EFu8PtDmQXfk/ahfc/tfqQYJz43mKwo02WCa5xOW2/KJioCgOeH0rC7SFNOFMPVpjs1Gil+1y9/m/e6DeNYO2x5tjdlRKAZZg8eq/sqxZlFV3GUEpoYpCPQYrZU0ASNIA7bu91wNVRWLHvdB3LWwdtbhjKZEHxWI2n68FwXxcVSQB6O4+fTjxRII8sBE1xmtJwJGbcuqfCeS1fgecYXYsFYLu6OlGLEQTEo4XEL5PXlxHifSOzdNA7l4/vz+yiXBQEYsTdkmIXB1CxZY0RDAw0WwCbqFsVQ1ledLfmzIMSlYMGdeRxl5ZSGd0gGym18z5euvb+B2RMRdOAxW/YOa+j0L
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Tue, 4 Jul 2023, Gerald Schaefer wrote:
> On Sat, 1 Jul 2023 21:32:38 -0700 (PDT)
> Hugh Dickins <hughd@google.com> wrote:
> > On Thu, 29 Jun 2023, Hugh Dickins wrote:
> > > 
> > > I've grown to dislike the (ab)use of pt_frag_refcount even more, to the
> > > extent that I've not even tried to verify it; but I think I do get the
> > > point now, that we need further info than just PPHHAA to know whether
> > > the page is on the list or not.  But I think that if we move where the
> > > call_rcu() is done, then the page can stay on or off the list by same
> > > rules as before (but need to check HH bits along with PP when deciding
> > > whether to allocate, and whether to list_add_tail() when freeing).  
> > 
> > No, not quite the same rules as before: I came to realize that using
> > list_add_tail() for the HH pages would be liable to put a page on the
> > list which forever blocked reuse of PP list_add_tail() pages after it
> > (could be solved by a list_move() somewhere, but we have agreed to
> > prefer simplicity).
> > 
> > I've dropped the HH bits, I'm using PageActive like we did on powerpc,
> > I've dropped most of the pte_free_*() helpers, and list_del_init() is
> > an easier way of dealing with those "is it on the list" questions.
> > I expect that we shall be close to reaching agreement on...
> 
> This looks really nice, almost too good and easy to be true. I did not
> find any obvious flaw, just some comments below. It also survived LTP
> without any visible havoc, so I guess this approach is the best so far.

Phew! I'm of course glad to hear this: thanks for your efforts on it.

...
> > --- a/arch/s390/mm/pgalloc.c
> > +++ b/arch/s390/mm/pgalloc.c
> > @@ -229,6 +229,15 @@ void page_table_free_pgste(struct page *page)
> >   * logic described above. Both AA bits are set to 1 to denote a 4KB-pgtable
> >   * while the PP bits are never used, nor such a page is added to or removed
> >   * from mm_context_t::pgtable_list.
> > + *
> > + * pte_free_defer() overrides those rules: it takes the page off pgtable_list,
> > + * and prevents both 2K fragments from being reused. pte_free_defer() has to
> > + * guarantee that its pgtable cannot be reused before the RCU grace period
> > + * has elapsed (which page_table_free_rcu() does not actually guarantee).
> 
> Hmm, I think page_table_free_rcu() has to guarantee the same, i.e. not
> allow reuse before grace period elapsed. And I hope that it does so, by
> setting the PP bits, which would be noticed in page_table_alloc(), in
> case the page would be seen there.
> 
> Unlike pte_free_defer(), page_table_free_rcu() would add pages back to the
> end of the list, and so they could be seen in page_table_alloc(), but they
> should not be reused before grace period elapsed and __tlb_remove_table()
> cleared the PP bits, as far as I understand.
> 
> So what exactly do you mean with "which page_table_free_rcu() does not actually
> guarantee"?

I'll answer without locating and re-reading what Jason explained earlier,
perhaps in a separate thread, about pseudo-RCU-ness in tlb_remove_table():
he may have explained it better.  And without working out again all the
MMU_GATHER #defines, and which of them do and do not apply to s390 here.

The detail that sticks in my mind is the fallback in tlb_remove_table()
in mm/mmu_gather.c: if its __get_free_page(GFP_NOWAIT) fails, it cannot
batch the tables for freeing by RCU, and resorts instead to an immediate 
TLB flush (I think: that again involves chasing definitions) followed by
tlb_remove_table_sync_one() - which just delivers an interrupt to each CPU,
and is commented: 
/*
 * This isn't an RCU grace period and hence the page-tables cannot be
 * assumed to be actually RCU-freed.
 *
 * It is however sufficient for software page-table walkers that rely on
 * IRQ disabling.
 */

Whether that's good for your PP pages or not, I've given no thought:
I've just taken it on trust that what s390 has working today is good.

If that __get_free_page(GFP_NOWAIT) fallback instead used call_rcu(),
then I would not have written "(which page_table_free_rcu() does not
actually guarantee)".  But it cannot use call_rcu() because it does
not have an rcu_head to work with - it's in some generic code, and
there is no MMU_GATHER_CAN_USE_PAGE_RCU_HEAD for architectures to set.

And Jason would have much preferred us to address the issue from that
angle; but not only would doing so destroy my sanity, I'd also destroy
20 architectures TLB-flushing, unbuilt and untested, in the attempt.

...
> > @@ -325,10 +346,17 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> >  		 */
> >  		mask = atomic_xor_bits(&page->_refcount, 0x11U << (bit + 24));
> >  		mask >>= 24;
> > -		if (mask & 0x03U)
> > +		if ((mask & 0x03U) && !PageActive(page)) {
> > +			/*
> > +			 * Other half is allocated, and neither half has had
> > +			 * its free deferred: add page to head of list, to make
> > +			 * this freed half available for immediate reuse.
> > +			 */
> >  			list_add(&page->lru, &mm->context.pgtable_list);
> > -		else
> > -			list_del(&page->lru);
> > +		} else {
> > +			/* If page is on list, now remove it. */
> > +			list_del_init(&page->lru);
> > +		}
> 
> Ok, we might end up with some unnecessary list_del_init() here, e.g. if
> other half is still allocated, when called from pte_free_defer() on a
> fully allocated page, which was not on the list (and with PageActive, and
> (mask & 0x03U) true).
> Not sure if adding an additional mask check to the else path would be
> needed, but it seems that list_del_init() should also be able to handle
> this.

list_del_init() is very cheap in the unnecessary case: the cachelines
required are already there.  You don't want a flag to say whether to
call it or not, it is already the efficient approach.

(But you were right not to use it in your pt_frag_refcount version,
because there we were still trying to do the call_rcu() per fragment
rather than per page, so page->lru could have been on the RCU queue.)

> 
> Same thought applies to the similar logic in page_table_free_rcu()
> below.
> 
> >  		spin_unlock_bh(&mm->context.lock);
> >  		mask = atomic_xor_bits(&page->_refcount, 0x10U << (bit + 24));
> >  		mask >>= 24;
> > @@ -342,8 +370,10 @@ void page_table_free(struct mm_struct *mm, unsigned long *table)
> >  	}
> >  
> >  	page_table_release_check(page, table, half, mask);
> > -	pgtable_pte_page_dtor(page);
> > -	__free_page(page);
> > +	if (TestClearPageActive(page))
> > +		call_rcu(&page->rcu_head, pte_free_now);
> > +	else
> > +		pte_free_now(&page->rcu_head);
> 
> This ClearPageActive, and the similar thing in __tlb_remove_table() below,
> worries me a bit, because it is done outside the spin_lock. It "feels" like
> there could be some race with the PageActive checks inside the spin_lock,
> but when drawing some pictures, I could not find any such scenario yet.
> Also, our existing spin_lock is probably not supposed to protect against
> PageActive changes anyway, right?

Here (and similarly in __tlb_remove_table()) is where we are about to free
the page table page: both of the fragments have already been released,
there is nobody left who could be racing against us to set PageActive.

I chose PageActive for its name, not for any special behaviour of that
flag: nothing else could be setting or clearing it while we own the page.

Hugh