From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EC225C56202 for ; Wed, 25 Nov 2020 21:38:26 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 47D632075A for ; Wed, 25 Nov 2020 21:38:25 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="iTWw+ZZn" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 47D632075A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6AEDA6B0070; Wed, 25 Nov 2020 16:38:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 685B36B0071; Wed, 25 Nov 2020 16:38:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 59BD46B0072; Wed, 25 Nov 2020 16:38:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0185.hostedemail.com [216.40.44.185]) by kanga.kvack.org (Postfix) with ESMTP id 431CD6B0070 for ; Wed, 25 Nov 2020 16:38:25 -0500 (EST) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id F41198249980 for ; Wed, 25 Nov 2020 21:38:24 +0000 (UTC) X-FDA: 77524254528.22.watch33_5e0c00f2737a Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin22.hostedemail.com (Postfix) with ESMTP id DECE518038E67 for ; Wed, 25 Nov 2020 21:38:24 +0000 (UTC) X-HE-Tag: watch33_5e0c00f2737a X-Filterd-Recvd-Size: 5843 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [63.128.21.124]) by imf26.hostedemail.com (Postfix) with ESMTP for ; Wed, 25 Nov 2020 21:38:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1606340303; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=8cd2jOc4EY0Hm/DYXh8AybG7dAGI+7EMMq4q48oG85k=; b=iTWw+ZZnckqWdr0Hotsqk9dg0RH0SnIAy0CijQtFDr9RFZ0cwNX04KFu3SbldxzuywoMwg kziPUW5rgDhq6Oa2NGA22xl71iXiZNnSEja6k2DjQaeAdQl6mMiqeJ0bJYQZIdQ73g3er9 vVGL4SZHiCHRMXBL3NAgvp7WVVkI9v8= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-328-oJi7CX4JPTuRCdqkq_z3Mw-1; Wed, 25 Nov 2020 16:38:20 -0500 X-MC-Unique: oJi7CX4JPTuRCdqkq_z3Mw-1 Received: from smtp.corp.redhat.com (int-mx06.intmail.prod.int.phx2.redhat.com [10.5.11.16]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 12B9D10866A2; Wed, 25 Nov 2020 21:38:17 +0000 (UTC) Received: from mail (ovpn-112-118.rdu2.redhat.com [10.10.112.118]) by smtp.corp.redhat.com (Postfix) with ESMTPS id E41345C1B4; Wed, 25 Nov 2020 21:38:16 +0000 (UTC) Date: Wed, 25 Nov 2020 16:38:16 -0500 From: Andrea Arcangeli To: Mike Rapoport Cc: David Hildenbrand , Vlastimil Babka , Mel Gorman , Andrew Morton , linux-mm@kvack.org, Qian Cai , Michal Hocko , linux-kernel@vger.kernel.org, Baoquan He Subject: Re: [PATCH 1/1] mm: compaction: avoid fast_isolate_around() to set pageblock_skip on reserved pages Message-ID: References: <35F8AADA-6CAA-4BD6-A4CF-6F29B3F402A4@redhat.com> <20201125210414.GO123287@linux.ibm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20201125210414.GO123287@linux.ibm.com> User-Agent: Mutt/2.0.2 (2020-11-20) X-Scanned-By: MIMEDefang 2.79 on 10.5.11.16 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Wed, Nov 25, 2020 at 11:04:14PM +0200, Mike Rapoport wrote: > I think the very root cause is how e820__memblock_setup() registers > memory with memblock: > > if (entry->type == E820_TYPE_SOFT_RESERVED) > memblock_reserve(entry->addr, entry->size); > > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > continue; > > memblock_add(entry->addr, entry->size); > > From that point the system has inconsistent view of RAM in both > memblock.memory and memblock.reserved and, which is then translated to > memmap etc. > > Unfortunately, simply adding all RAM to memblock is not possible as > there are systems that for them "the addresses listed in the reserved > range must never be accessed, or (as we discovered) even be reachable by > an active page table entry" [1]. > > [1] https://lore.kernel.org/lkml/20200528151510.GA6154@raspberrypi/ It looks like what's missing is a blockmem_reserve which I don't think would interfere at all with the issue above since it won't create direct mapping and it'll simply invoke the second stage that wasn't invoked here. I guess this would have a better chance to have the second initialization stage run in reserve_bootmem_region and it would likely solve the problem without breaking E820_TYPE_RESERVED which is known by the kernel: > if (entry->type == E820_TYPE_SOFT_RESERVED) > memblock_reserve(entry->addr, entry->size); > + if (entry->type == 20) + memblock_reserve(entry->addr, entry->size); > if (entry->type != E820_TYPE_RAM && entry->type != E820_TYPE_RESERVED_KERN) > continue; > This is however just to show the problem, I didn't check what type 20 is. To me it doesn't look the root cause though, the root cause is that if you don't call memblock_reserve the page->flags remains uninitialized. I think the page_alloc.c need to be more robust and detect at least if if holes within zones (but ideally all pfn_valid of all struct pages in system even if beyond the end of the zone) aren't being initialized in the second stage without relying on the arch code to remember to call memblock_reserve. In fact it's not clear why memblock_reserve even exists, that information can be calculated reliably by page_alloc in function of memblock.memory alone by walking all nodes and all zones. It doesn't even seem to help in destroying the direct mapping, reserve_bootmem_region just initializes the struct pages so it doesn't need a special memeblock_reserved to find those ranges. In fact it's scary that codes then does stuff like this trusting the memblock_reserve is nearly complete information (which obviously isn't given type 20 doesn't get queued and I got that type 20 in all my systems): for_each_reserved_mem_region(i, &start, &end) { if (addr >= start && addr_end <= end) return true; } That code in irq-gic-v3-its.c should stop using for_each_reserved_mem_region and start doing pfn_valid(addr>>PAGE_SHIFT) if PageReserved(pfn_to_page(addr>>PAGE_SHIFT)) instead. At best memory.reserved should be calculated automatically by the page_alloc.c based on the zone_start_pfn/zone_end_pfn and not passed by the e820 caller, instead of adding the memory_reserve call for type 20 we should delete the memory_reserve function. Thanks, Andrea