From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <SRS0=vOKg=DB=kvack.org=owner-linux-mm@kernel.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
X-Spam-Level: 
X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00,
	HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS
	autolearn=no autolearn_force=no version=3.4.0
Received: from mail.kernel.org (mail.kernel.org [198.145.29.99])
	by smtp.lore.kernel.org (Postfix) with ESMTP id 81623C4346E
	for <linux-mm@archiver.kernel.org>; Thu, 24 Sep 2020 07:46:00 +0000 (UTC)
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by mail.kernel.org (Postfix) with ESMTP id EFE12206A5
	for <linux-mm@archiver.kernel.org>; Thu, 24 Sep 2020 07:45:59 +0000 (UTC)
DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EFE12206A5
Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com
Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix)
	id EF33D6B0081; Thu, 24 Sep 2020 03:45:58 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E7D906B0082; Thu, 24 Sep 2020 03:45:58 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D6AB5900004; Thu, 24 Sep 2020 03:45:58 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from forelay.hostedemail.com (smtprelay0066.hostedemail.com [216.40.44.66])
	by kanga.kvack.org (Postfix) with ESMTP id BDCBA6B0081
	for <linux-mm@kvack.org>; Thu, 24 Sep 2020 03:45:58 -0400 (EDT)
Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251])
	by forelay04.hostedemail.com (Postfix) with ESMTP id 7F9A12485
	for <linux-mm@kvack.org>; Thu, 24 Sep 2020 07:45:58 +0000 (UTC)
X-FDA: 77297171196.02.bread63_20160ae2715d
Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251])
	by smtpin02.hostedemail.com (Postfix) with ESMTP id 64891100C20CD
	for <linux-mm@kvack.org>; Thu, 24 Sep 2020 07:45:58 +0000 (UTC)
X-HE-Tag: bread63_20160ae2715d
X-Filterd-Recvd-Size: 6626
Received: from mga01.intel.com (mga01.intel.com [192.55.52.88])
	by imf33.hostedemail.com (Postfix) with ESMTP
	for <linux-mm@kvack.org>; Thu, 24 Sep 2020 07:45:56 +0000 (UTC)
IronPort-SDR: +e0AE2caM/e/cMTGV1dwp+w4DmWeFpaV6qcsGbjPkowLmepRY/46IVewtN70Bgfaj/puLcOlK8
 /xrXQHoDcgtg==
X-IronPort-AV: E=McAfee;i="6000,8403,9753"; a="179205844"
X-IronPort-AV: E=Sophos;i="5.77,296,1596524400"; 
   d="scan'208";a="179205844"
X-Amp-Result: SKIPPED(no attachment in message)
X-Amp-File-Uploaded: False
Received: from orsmga002.jf.intel.com ([10.7.209.21])
  by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Sep 2020 00:45:55 -0700
IronPort-SDR: e4g9D7YxmsixKqoV+lGeiFsYM50ziSgE7RB7af3zbcCKdZDMwi32pHy4Ftq3+vdX7HBX6f+TYz
 wAl6oZEfgheA==
X-ExtLoop1: 1
X-IronPort-AV: E=Sophos;i="5.77,296,1596524400"; 
   d="scan'208";a="322878298"
Received: from unknown (HELO yhuang-dev) ([10.239.159.65])
  by orsmga002.jf.intel.com with ESMTP; 24 Sep 2020 00:45:53 -0700
From: "Huang\, Ying" <ying.huang@intel.com>
To: Rafael Aquini <aquini@redhat.com>
Cc: linux-mm@kvack.org,  linux-kernel@vger.kernel.org,  akpm@linux-foundation.org
Subject: Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference
References: <20200922184838.978540-1-aquini@redhat.com>
	<878sd1qllb.fsf@yhuang-dev.intel.com>
	<20200923043459.GL795820@optiplex-lnx>
	<87sgb9oz1u.fsf@yhuang-dev.intel.com>
	<20200923130138.GM795820@optiplex-lnx>
	<87blhwng5f.fsf@yhuang-dev.intel.com>
	<20200924020928.GC1023012@optiplex-lnx>
	<877dsjessq.fsf@yhuang-dev.intel.com>
	<20200924063038.GD1023012@optiplex-lnx>
Date: Thu, 24 Sep 2020 15:45:52 +0800
In-Reply-To: <20200924063038.GD1023012@optiplex-lnx> (Rafael Aquini's message
	of "Thu, 24 Sep 2020 02:30:38 -0400")
Message-ID: <87tuvnd3db.fsf@yhuang-dev.intel.com>
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux)
MIME-Version: 1.0
Content-Type: text/plain; charset=ascii
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

Rafael Aquini <aquini@redhat.com> writes:

> On Thu, Sep 24, 2020 at 11:51:17AM +0800, Huang, Ying wrote:
>> Rafael Aquini <aquini@redhat.com> writes:
>> > The bug here is quite simple: split_swap_cluster() misses checking for
>> > lock_cluster() returning NULL before committing to change cluster_info->flags.
>> 
>> I don't think so.  We shouldn't run into this situation firstly.  So the
>> "fix" hides the real bug instead of fixing it.  Just like we call
>> VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list()
>> instead of returning if !PageLocked(head) silently.
>>
>
> Not the same thing, obviously, as you are going for an apples-to-carrots
> comparison, but since you mentioned:
>
> split_huge_page_to_list() asserts (in debug builds) *page is locked,

	VM_BUG_ON_PAGE(!PageLocked(head), head);

It asserts *head instead of *page.

> and later checks if *head bears the SwapCache flag. 
> deferred_split_scan(), OTOH, doesn't hand down the compound head locked, 
> but the 2nd page in the group instead.

No.  deferred_split_scan() will can trylock_page() on the 2nd page in
the group, but

static inline int trylock_page(struct page *page)
{
	page = compound_head(page);
	return (likely(!test_and_set_bit_lock(PG_locked, &page->flags)));
}

So the head page will be locked instead.

> This doesn't necessarely means it's a problem, though, but might help
> on hitting the issue. 
>  
>> > The fundamental problem has nothing to do with allocating, or not allocating
>> > a swap cluster, but it has to do with the fact that the THP deferred split scan
>> > can transiently race with swapcache insertion, and the fact that when you run
>> > your swap area on rotational storage cluster_info is _always_ NULL.
>> > split_swap_cluster() needs to check for lock_cluster() returning NULL because
>> > that's one possible case, and it clearly fails to do so.
>> 
>> If there's a race, we should fix the race.  But the code path for
>> swapcache insertion is,
>> 
>> add_to_swap()
>>   get_swap_page() /* Return if fails to allocate */
>>   add_to_swap_cache()
>>     SetPageSwapCache()
>> 
>> While the code path to split THP is,
>> 
>> split_huge_page_to_list()
>>   if PageSwapCache()
>>     split_swap_cluster()
>> 
>> Both code paths are protected by the page lock.  So there should be some
>> other reasons to trigger the bug.
>
> As mentioned above, no they seem to not be protected (at least, not the
> same page, depending on the case). While add_to_swap() will assure a 
> page_lock on the compound head, split_huge_page_to_list() does not.
>
>
>> And again, for HDD, a THP shouldn't have PageSwapCache() set at the
>> first place.  If so, the bug is that the flag is set and we should fix
>> the setting.
>> 
>
> I fail to follow your claim here. Where is the guarantee, in the code, that 
> you'll never have a compound head in the swapcache? 

We may have a THP in the swap cache, only if non-rotational disk is used
as swap device.  This is the design assumption of the THP swap support.
And this is guaranteed via swap space allocation for THP will fail for
HDD.  If the implementation doesn't guarantee this, we will fix the
implementation to guarantee this.

>> > Run a workload that cause multiple THP COW, and add a memory hogger to create
>> > memory pressure so you'll force the reclaimers to kick the registered
>> > shrinkers. The trigger is not heavy swapping, and that's probably why
>> > most swap test cases don't hit it. The window is tight, but you will get the
>> > NULL pointer dereference.
>> 
>> Do you have a script to reproduce the bug?
>> 
>
> Nope, a convoluted set of internal regression tests we have usually
> triggers it. In the wild, customers running HANNA are seeing it,
> occasionally.

So you haven't reproduce the bug on upstream kernel?

Or, can you help to run the test with a debug kernel based on upstream
kernel.  I can provide some debug patch.

>> > Regardless you find furhter bugs, or not, this patch is needed to correct a
>> > blunt coding mistake.
>> 
>> As above.  I don't agree with that.
>> 
>
> It's OK to disagree, split_swap_cluster still misses the cluster_info NULL check,
> though.

In contrast, if the checking is necessary, we shouldn't ignore it, but
use something like

        ci = lock_cluster(si, offset);
+       VM_BUG_ON(!ci);
	cluster_clear_huge(ci);

in split_swap_cluster() to enforce the checking to report bug as early
as possible.  But this appears unnecessary now because NULL accessing in
cluster_clear_huge().

Best Regards,
Huang, Ying