From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.8 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 81623C4346E for ; Thu, 24 Sep 2020 07:46:00 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id EFE12206A5 for ; Thu, 24 Sep 2020 07:45:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org EFE12206A5 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id EF33D6B0081; Thu, 24 Sep 2020 03:45:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id E7D906B0082; Thu, 24 Sep 2020 03:45:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D6AB5900004; Thu, 24 Sep 2020 03:45:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0066.hostedemail.com [216.40.44.66]) by kanga.kvack.org (Postfix) with ESMTP id BDCBA6B0081 for ; Thu, 24 Sep 2020 03:45:58 -0400 (EDT) Received: from smtpin02.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 7F9A12485 for ; Thu, 24 Sep 2020 07:45:58 +0000 (UTC) X-FDA: 77297171196.02.bread63_20160ae2715d Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin02.hostedemail.com (Postfix) with ESMTP id 64891100C20CD for ; Thu, 24 Sep 2020 07:45:58 +0000 (UTC) X-HE-Tag: bread63_20160ae2715d X-Filterd-Recvd-Size: 6626 Received: from mga01.intel.com (mga01.intel.com [192.55.52.88]) by imf33.hostedemail.com (Postfix) with ESMTP for ; Thu, 24 Sep 2020 07:45:56 +0000 (UTC) IronPort-SDR: +e0AE2caM/e/cMTGV1dwp+w4DmWeFpaV6qcsGbjPkowLmepRY/46IVewtN70Bgfaj/puLcOlK8 /xrXQHoDcgtg== X-IronPort-AV: E=McAfee;i="6000,8403,9753"; a="179205844" X-IronPort-AV: E=Sophos;i="5.77,296,1596524400"; d="scan'208";a="179205844" X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by fmsmga101.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 24 Sep 2020 00:45:55 -0700 IronPort-SDR: e4g9D7YxmsixKqoV+lGeiFsYM50ziSgE7RB7af3zbcCKdZDMwi32pHy4Ftq3+vdX7HBX6f+TYz wAl6oZEfgheA== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.77,296,1596524400"; d="scan'208";a="322878298" Received: from unknown (HELO yhuang-dev) ([10.239.159.65]) by orsmga002.jf.intel.com with ESMTP; 24 Sep 2020 00:45:53 -0700 From: "Huang\, Ying" To: Rafael Aquini Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, akpm@linux-foundation.org Subject: Re: [PATCH] mm: swapfile: avoid split_swap_cluster() NULL pointer dereference References: <20200922184838.978540-1-aquini@redhat.com> <878sd1qllb.fsf@yhuang-dev.intel.com> <20200923043459.GL795820@optiplex-lnx> <87sgb9oz1u.fsf@yhuang-dev.intel.com> <20200923130138.GM795820@optiplex-lnx> <87blhwng5f.fsf@yhuang-dev.intel.com> <20200924020928.GC1023012@optiplex-lnx> <877dsjessq.fsf@yhuang-dev.intel.com> <20200924063038.GD1023012@optiplex-lnx> Date: Thu, 24 Sep 2020 15:45:52 +0800 In-Reply-To: <20200924063038.GD1023012@optiplex-lnx> (Rafael Aquini's message of "Thu, 24 Sep 2020 02:30:38 -0400") Message-ID: <87tuvnd3db.fsf@yhuang-dev.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/26.3 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Rafael Aquini writes: > On Thu, Sep 24, 2020 at 11:51:17AM +0800, Huang, Ying wrote: >> Rafael Aquini writes: >> > The bug here is quite simple: split_swap_cluster() misses checking for >> > lock_cluster() returning NULL before committing to change cluster_info->flags. >> >> I don't think so. We shouldn't run into this situation firstly. So the >> "fix" hides the real bug instead of fixing it. Just like we call >> VM_BUG_ON_PAGE(!PageLocked(head), head) in split_huge_page_to_list() >> instead of returning if !PageLocked(head) silently. >> > > Not the same thing, obviously, as you are going for an apples-to-carrots > comparison, but since you mentioned: > > split_huge_page_to_list() asserts (in debug builds) *page is locked, VM_BUG_ON_PAGE(!PageLocked(head), head); It asserts *head instead of *page. > and later checks if *head bears the SwapCache flag. > deferred_split_scan(), OTOH, doesn't hand down the compound head locked, > but the 2nd page in the group instead. No. deferred_split_scan() will can trylock_page() on the 2nd page in the group, but static inline int trylock_page(struct page *page) { page = compound_head(page); return (likely(!test_and_set_bit_lock(PG_locked, &page->flags))); } So the head page will be locked instead. > This doesn't necessarely means it's a problem, though, but might help > on hitting the issue. > >> > The fundamental problem has nothing to do with allocating, or not allocating >> > a swap cluster, but it has to do with the fact that the THP deferred split scan >> > can transiently race with swapcache insertion, and the fact that when you run >> > your swap area on rotational storage cluster_info is _always_ NULL. >> > split_swap_cluster() needs to check for lock_cluster() returning NULL because >> > that's one possible case, and it clearly fails to do so. >> >> If there's a race, we should fix the race. But the code path for >> swapcache insertion is, >> >> add_to_swap() >> get_swap_page() /* Return if fails to allocate */ >> add_to_swap_cache() >> SetPageSwapCache() >> >> While the code path to split THP is, >> >> split_huge_page_to_list() >> if PageSwapCache() >> split_swap_cluster() >> >> Both code paths are protected by the page lock. So there should be some >> other reasons to trigger the bug. > > As mentioned above, no they seem to not be protected (at least, not the > same page, depending on the case). While add_to_swap() will assure a > page_lock on the compound head, split_huge_page_to_list() does not. > > >> And again, for HDD, a THP shouldn't have PageSwapCache() set at the >> first place. If so, the bug is that the flag is set and we should fix >> the setting. >> > > I fail to follow your claim here. Where is the guarantee, in the code, that > you'll never have a compound head in the swapcache? We may have a THP in the swap cache, only if non-rotational disk is used as swap device. This is the design assumption of the THP swap support. And this is guaranteed via swap space allocation for THP will fail for HDD. If the implementation doesn't guarantee this, we will fix the implementation to guarantee this. >> > Run a workload that cause multiple THP COW, and add a memory hogger to create >> > memory pressure so you'll force the reclaimers to kick the registered >> > shrinkers. The trigger is not heavy swapping, and that's probably why >> > most swap test cases don't hit it. The window is tight, but you will get the >> > NULL pointer dereference. >> >> Do you have a script to reproduce the bug? >> > > Nope, a convoluted set of internal regression tests we have usually > triggers it. In the wild, customers running HANNA are seeing it, > occasionally. So you haven't reproduce the bug on upstream kernel? Or, can you help to run the test with a debug kernel based on upstream kernel. I can provide some debug patch. >> > Regardless you find furhter bugs, or not, this patch is needed to correct a >> > blunt coding mistake. >> >> As above. I don't agree with that. >> > > It's OK to disagree, split_swap_cluster still misses the cluster_info NULL check, > though. In contrast, if the checking is necessary, we shouldn't ignore it, but use something like ci = lock_cluster(si, offset); + VM_BUG_ON(!ci); cluster_clear_huge(ci); in split_swap_cluster() to enforce the checking to report bug as early as possible. But this appears unnecessary now because NULL accessing in cluster_clear_huge(). Best Regards, Huang, Ying