From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=pOir=6Z=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-6.8 required=3.0 tests=DKIM_SIGNED,DKIM_VALID,
	HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_PATCH,MAILING_LIST_MULTI,SIGNED_OFF_BY,
	SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id D9824C54E8D
	for <linux-mm@archiver.kernel.org>; Mon, 11 May 2020 15:07:10 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id 7601B20736
	for <linux-mm@archiver.kernel.org>; Mon, 11 May 2020 15:07:10 +0000 (UTC)
Authentication-Results: mail.kernel.org;
	dkim=pass (2048-bit key) header.d=cmpxchg-org.20150623.gappssmtp.com header.i=@cmpxchg-org.20150623.gappssmtp.com header.b="vFO25KpX"
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7601B20736
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=cmpxchg.org
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id A0CD090005A; Mon, 11 May 2020 11:07:09 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 9BE33900036; Mon, 11 May 2020 11:07:09 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 8AB1490005A; Mon, 11 May 2020 11:07:09 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0066.hostedemail.com [216.40.44.66])
	by kanga.kvack.org (Postfix) with ESMTP id 70CE6900036
	for <linux-mm@kvack.org>; Mon, 11 May 2020 11:07:09 -0400 (EDT)
Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay05.hostedemail.com (Postfix) with ESMTP id 1D802181AEF15
	for <linux-mm@kvack.org>; Mon, 11 May 2020 15:07:09 +0000 (UTC)
X-FDA: 76804766178.01.cloud09_5b6e8783caa60
X-HE-Tag: cloud09_5b6e8783caa60
X-Filterd-Recvd-Size: 9971
Received: from mail-qk1-f193.google.com (mail-qk1-f193.google.com [209.85.222.193])
	by imf05.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Mon, 11 May 2020 15:07:08 +0000 (UTC)
Received: by mail-qk1-f193.google.com with SMTP id q7so10060601qkf.3
        for <linux-mm@kvack.org>; Mon, 11 May 2020 08:07:07 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=cmpxchg-org.20150623.gappssmtp.com; s=20150623;
        h=date:from:to:cc:subject:message-id:references:mime-version
         :content-disposition:in-reply-to;
        bh=ANAit2WX+K9wu4n6he8ngZbFdqawZzoQDVcmzxgA7go=;
        b=vFO25KpXwg/4vtW8tD+XIfm9LH46+fQ6vvusO0tJeYwSNeBknXcUE5831e2qvIbxNB
         +vEmtna2uZpKf+W/kuG3dxSgsESJKBqc4IhSu+vI0v1ulFrcCqAfkILo2N91JoiKJ1VS
         IeKrygvNCP9TwKSM4nZZV+6aUOr9YP4M9uU2NnO214vrmh+gcV3XI6IYat+zBLyslXuz
         qBTXHi14HEys8m8tByH7DOimgJvM9+KyoJhvhLRU6cUdxDSd9pfqwDB/NC1bpgSTNoHb
         q4xx6h6LAXEjkdPchQEnaYQXvlQK/oG/0FN9w9FLsVDpbYuXFot5iHbtHU6WrxwDv07O
         REGw==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:date:from:to:cc:subject:message-id:references
         :mime-version:content-disposition:in-reply-to;
        bh=ANAit2WX+K9wu4n6he8ngZbFdqawZzoQDVcmzxgA7go=;
        b=OK4fOYlfR0HDKlcGb8ekbhyb1qSXPch9Kwxgt4+XPFF08SY5JqFyhuDVFsqpMq5Th0
         7lQuBT/W2ct9pcSr9ongpRaH2kypBBrlHmUC9v9ikjyNMfbO8a2EnD9TRu5Siz+2Vf59
         y98sASaupwRIKXS/oUa/tkpy6iPAQPPpK4ja8HmbJNYC7BFAfo3DjbE1sbiiV0ON+oQf
         PRaGZZ9hJvliMgZX8GWqRflq7NhWAkvihM5gf+RuWhUnvnmAwTFkl2/fWQhGcuxcNxRW
         mn+4zJ2Ttuzxu9YGFQlANWWMeED8VFOc+FrNWxuCaTmLnsyilshRP9pK4QpvM0v4+aMy
         Y63w==
X-Gm-Message-State: AGi0PuaVHbMivkxrhExTNKB74bFSVktE9jWKyZ64QG/bZjlyZ9IG2eHs
	oSiC7Qu82A2O638vCK9QvP0S0w==
X-Google-Smtp-Source: APiQypI3zax9mzs3ycZgpcHPMyHsq5/Hz/R3+ZSCs3htkIuCuIQ6i4eEs5zbW+/YH57wPpBL6Xadpg==
X-Received: by 2002:a37:9344:: with SMTP id v65mr14585673qkd.366.1589209627242;
        Mon, 11 May 2020 08:07:07 -0700 (PDT)
Received: from localhost ([2620:10d:c091:480::1:2627])
        by smtp.gmail.com with ESMTPSA id l22sm8360550qki.45.2020.05.11.08.07.05
        (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256);
        Mon, 11 May 2020 08:07:06 -0700 (PDT)
Date: Mon, 11 May 2020 11:06:48 -0400
From: Johannes Weiner <hannes@cmpxchg.org>
To: Hugh Dickins <hughd@google.com>
Cc: Joonsoo Kim <js1304@gmail.com>, Alex Shi <alex.shi@linux.alibaba.com>,
	Shakeel Butt <shakeelb@google.com>, Michal Hocko <mhocko@suse.com>,
	"Kirill A. Shutemov" <kirill@shutemov.name>,
	Roman Gushchin <guro@fb.com>, linux-mm@kvack.org,
	cgroups@vger.kernel.org, linux-kernel@vger.kernel.org,
	kernel-team@fb.com, Andrew Morton <akpm@linux-foundation.org>
Subject: Re: [PATCH 05/18] mm: memcontrol: convert page cache to a new
 mem_cgroup_charge() API
Message-ID: <20200511150648.GA306292@cmpxchg.org>
References: <20200420221126.341272-1-hannes@cmpxchg.org>
 <20200420221126.341272-6-hannes@cmpxchg.org>
 <20200422064041.GE6780@js1304-desktop>
 <20200422120946.GA358439@cmpxchg.org>
 <20200423052450.GA12538@js1304-desktop>
 <20200508160122.GB181181@cmpxchg.org>
 <alpine.LSU.2.11.2005102350360.2769@eggly.anvils>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <alpine.LSU.2.11.2005102350360.2769@eggly.anvils>
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On Mon, May 11, 2020 at 12:38:04AM -0700, Hugh Dickins wrote:
> On Fri, 8 May 2020, Johannes Weiner wrote:
> > 
> > I looked at this some more, as well as compared it to non-shmem
> > swapping. My conclusion is - and Hugh may correct me on this - that
> > the deletion looks mandatory but is actually an optimization. Page
> > reclaim will ultimately pick these pages up.
> > 
> > When non-shmem pages are swapped in by readahead (locked until IO
> > completes) and their page tables are simultaneously unmapped, the
> > zap_pte_range() code calls free_swap_and_cache() and the locked pages
> > are stranded in the swap cache with no page table references. We rely
> > on page reclaim to pick them up later on.
> > 
> > The same appears to be true for shmem. If the references to the swap
> > page are zapped while we're trying to swap in, we can strand the page
> > in the swap cache. But it's not up to swapin to detect this reliably,
> > it just frees the page more quickly than having to wait for reclaim.
> 
> I think you've got all that exactly right, thanks for working it out.
> It originates from v3.7's 215c02bc33bb ("tmpfs: fix shmem_getpage_gfp()
> VM_BUG_ON") - in which I also had to thank you.

I should have looked where it actually came from - I had forgotten
about that patch!

> I think I chose to do the delete_from_swap_cache() right there, partly
> because of following shmem_unuse_inode() code which already did that,
> partly on the basis that while we have to observe the case then it's
> better to clean it up, and partly out of guilt that our page lock here
> is what had prevented shmem_undo_range() from completing its job; but
> I believe you're right that unused swapcache reclaim would sort it out
> eventually.

That makes sense to me.

> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index e80167927dce..236642775f89 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -640,7 +640,7 @@ static int shmem_add_to_page_cache(struct page *page,
> >  		xas_lock_irq(&xas);
> >  		entry = xas_find_conflict(&xas);
> >  		if (entry != expected)
> > -			xas_set_err(&xas, -EEXIST);
> > +			xas_set_err(&xas, expected ? -ENOENT : -EEXIST);
> 
> Two things on this.
> 
> Minor matter of taste, I'd prefer that as
> 			xas_set_err(&xas, entry ? -EEXIST : -ENOENT);
> which would be more general and more understandable -
> but what you have written should be fine for the actual callers.

Yes, checking `expected' was to differentiate the behavior depending
on the callsite. But testing `entry' is more obvious in that location.

> Except... I think returning -ENOENT there will not work correctly,
> in the case of a punched hole.  Because (unless you've reworked it
> and I just haven't looked) shmem_getpage_gfp() knows to retry in
> the case of -EEXIST, but -ENOENT will percolate up to shmem_fault()
> and result in a SIGBUS, or a read/write error, when the hole should
> just get refilled instead.

Good catch, I had indeed missed that. I'm going to make it retry on
-ENOENT as well.

We could have it go directly to allocating a new page, but it seems
unnecessarily complicated: we've already been retrying in this
situation until now, so I would stick to "there was a race, retry."

> Not something that needs fixing in a hurry (it took trinity to
> generate this racy case in the first place), I'll take another look
> once I've pulled it into a tree (or collected next mmotm) - unless
> you've already have changed it around by then.

Attaching a delta fix based on your observations.

Andrew, barring any objections to this, could you please fold it into
the version you have in your tree already?

---

>From 33d03ceebce0a6261d472ddc9c5a07940f44714c Mon Sep 17 00:00:00 2001
From: Johannes Weiner <hannes@cmpxchg.org>
Date: Mon, 11 May 2020 10:45:14 -0400
Subject: [PATCH] mm: memcontrol: convert page cache to a new
 mem_cgroup_charge() API fix

Incorporate Hugh's feedback:

- shmem_getpage_gfp() needs to handle the new -ENOENT that was
  previously implied in the -EEXIST when a swap entry changed under us
  in any way. Otherwise hole punching could cause a racing fault to
  SIGBUS instead of allocating a new page.

- It is indeed page reclaim that picks up any swapcache we leave
  stranded when free_swap_and_cache() runs on a page locked by
  somebody else. Document that our delete_from_swap_cache() is an
  optimization, not something we rely on for correctness.

- Style cleanup: testing `expected' to decide on -EEXIST vs -ENOENT
  differentiates the callsites, but is a bit awkward to read. Test
  `entry' instead.

Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
---
 mm/shmem.c | 15 +++++++++------
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/mm/shmem.c b/mm/shmem.c
index afd5a057ebb7..00fb001e8f3e 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -638,7 +638,7 @@ static int shmem_add_to_page_cache(struct page *page,
 		xas_lock_irq(&xas);
 		entry = xas_find_conflict(&xas);
 		if (entry != expected)
-			xas_set_err(&xas, expected ? -ENOENT : -EEXIST);
+			xas_set_err(&xas, entry ? -EEXIST : -ENOENT);
 		xas_create_range(&xas);
 		if (xas_error(&xas))
 			goto unlock;
@@ -1686,10 +1686,13 @@ static int shmem_swapin_page(struct inode *inode, pgoff_t index,
 		 * We already confirmed swap under page lock, but
 		 * free_swap_and_cache() only trylocks a page, so it
 		 * is just possible that the entry has been truncated
-		 * or holepunched since swap was confirmed.
-		 * shmem_undo_range() will have done some of the
-		 * unaccounting, now delete_from_swap_cache() will do
-		 * the rest.
+		 * or holepunched since swap was confirmed. This could
+		 * occur at any time while the page is locked, and
+		 * usually page reclaim will take care of the stranded
+		 * swapcache page. But when we catch it, we may as
+		 * well clean up after ourselves: shmem_undo_range()
+		 * will have done some of the unaccounting, now
+		 * delete_from_swap_cache() will do the rest.
 		 */
 		if (error == -ENOENT)
 			delete_from_swap_cache(page);
@@ -1765,7 +1768,7 @@ static int shmem_getpage_gfp(struct inode *inode, pgoff_t index,
 	if (xa_is_value(page)) {
 		error = shmem_swapin_page(inode, index, &page,
 					  sgp, gfp, vma, fault_type);
-		if (error == -EEXIST)
+		if (error == -EEXIST || error == -ENOENT)
 			goto repeat;
 
 		*pagep = page;
-- 
2.26.2