From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-15.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 04005C433E0 for ; Mon, 22 Feb 2021 16:39:45 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7376464ED6 for ; Mon, 22 Feb 2021 16:39:44 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7376464ED6 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CB45B6B0071; Mon, 22 Feb 2021 11:39:43 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id C3D066B0073; Mon, 22 Feb 2021 11:39:43 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id ADE338D0001; Mon, 22 Feb 2021 11:39:43 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0201.hostedemail.com [216.40.44.201]) by kanga.kvack.org (Postfix) with ESMTP id 9075F6B0071 for ; Mon, 22 Feb 2021 11:39:43 -0500 (EST) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 4841F52B7 for ; Mon, 22 Feb 2021 16:39:43 +0000 (UTC) X-FDA: 77846465046.28.CBA6145 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf19.hostedemail.com (Postfix) with ESMTP id 632A390009EB for ; Mon, 22 Feb 2021 16:39:39 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1614011982; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=GUJtY9R2iZ6J1VlDQs1KAeFyAZBypUCoHyYC6H1JgCo=; b=E/0TTu4vDfPH+UlrPlgJnf3xjeM2EeCfsGac9x3VbsnGAjnXJ77bvnOpbApwIP8Ve1GzcK /7/0w3aNA0ytIAiXZA62dzyJHVsmxf+WnWHP8MkLzk6omxkgbTqastXOwAVP1uHIdX39aC 3p/sfD/sDeEuLstlLdpldyqkV7P+WT4= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-295-i7f3EASKP-qVMmcsUMFMqQ-1; Mon, 22 Feb 2021 11:39:38 -0500 X-MC-Unique: i7f3EASKP-qVMmcsUMFMqQ-1 Received: from smtp.corp.redhat.com (int-mx08.intmail.prod.int.phx2.redhat.com [10.5.11.23]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 591BE801965; Mon, 22 Feb 2021 16:39:35 +0000 (UTC) Received: from [10.36.115.16] (ovpn-115-16.ams2.redhat.com [10.36.115.16]) by smtp.corp.redhat.com (Postfix) with ESMTP id F117719C45; Mon, 22 Feb 2021 16:39:30 +0000 (UTC) Subject: Re: [PATCH] mm, kasan: don't poison boot memory From: David Hildenbrand To: George Kennedy , Andrey Konovalov Cc: Andrew Morton , Catalin Marinas , Vincenzo Frascino , Dmitry Vyukov , Konrad Rzeszutek Wilk , Will Deacon , Andrey Ryabinin , Alexander Potapenko , Marco Elver , Peter Collingbourne , Evgenii Stepanov , Branislav Rankov , Kevin Brodsky , Christoph Hellwig , kasan-dev , Linux ARM , Linux Memory Management List , LKML , Dhaval Giani , Mike Rapoport References: <487751e1ccec8fcd32e25a06ce000617e96d7ae1.1613595269.git.andreyknvl@google.com> <797fae72-e3ea-c0b0-036a-9283fa7f2317@oracle.com> <1ac78f02-d0af-c3ff-cc5e-72d6b074fc43@redhat.com> <56c97056-6d8b-db0e-e303-421ee625abe3@redhat.com> Organization: Red Hat GmbH Message-ID: <4c7351e2-e97c-e740-5800-ada5504588aa@redhat.com> Date: Mon, 22 Feb 2021 17:39:29 +0100 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.7.0 MIME-Version: 1.0 In-Reply-To: <56c97056-6d8b-db0e-e303-421ee625abe3@redhat.com> Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Scanned-By: MIMEDefang 2.84 on 10.5.11.23 X-Stat-Signature: e7ipyyjgrnxaz13y1t7g4ftrqj34ny8j X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 632A390009EB Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf19; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=216.205.24.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1614011979-265104 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 22.02.21 17:13, David Hildenbrand wrote: > On 22.02.21 16:13, George Kennedy wrote: >> >> >> On 2/22/2021 4:52 AM, David Hildenbrand wrote: >>> On 20.02.21 00:04, George Kennedy wrote: >>>> >>>> >>>> On 2/19/2021 11:45 AM, George Kennedy wrote: >>>>> >>>>> >>>>> On 2/18/2021 7:09 PM, Andrey Konovalov wrote: >>>>>> On Fri, Feb 19, 2021 at 1:06 AM George Kennedy >>>>>> wrote: >>>>>>> >>>>>>> >>>>>>> On 2/18/2021 3:55 AM, David Hildenbrand wrote: >>>>>>>> On 17.02.21 21:56, Andrey Konovalov wrote: >>>>>>>>> During boot, all non-reserved memblock memory is exposed to the >>>>>>>>> buddy >>>>>>>>> allocator. Poisoning all that memory with KASAN lengthens boot >>>>>>>>> time, >>>>>>>>> especially on systems with large amount of RAM. This patch make= s >>>>>>>>> page_alloc to not call kasan_free_pages() on all new memory. >>>>>>>>> >>>>>>>>> __free_pages_core() is used when exposing fresh memory during >>>>>>>>> system >>>>>>>>> boot and when onlining memory during hotplug. This patch adds a= new >>>>>>>>> FPI_SKIP_KASAN_POISON flag and passes it to __free_pages_ok() >>>>>>>>> through >>>>>>>>> free_pages_prepare() from __free_pages_core(). >>>>>>>>> >>>>>>>>> This has little impact on KASAN memory tracking. >>>>>>>>> >>>>>>>>> Assuming that there are no references to newly exposed pages >>>>>>>>> before they >>>>>>>>> are ever allocated, there won't be any intended (but buggy) >>>>>>>>> accesses to >>>>>>>>> that memory that KASAN would normally detect. >>>>>>>>> >>>>>>>>> However, with this patch, KASAN stops detecting wild and large >>>>>>>>> out-of-bounds accesses that happen to land on a fresh memory pa= ge >>>>>>>>> that >>>>>>>>> was never allocated. This is taken as an acceptable trade-off. >>>>>>>>> >>>>>>>>> All memory allocated normally when the boot is over keeps getti= ng >>>>>>>>> poisoned as usual. >>>>>>>>> >>>>>>>>> Signed-off-by: Andrey Konovalov >>>>>>>>> Change-Id: Iae6b1e4bb8216955ffc14af255a7eaaa6f35324d >>>>>>>> Not sure this is the right thing to do, see >>>>>>>> >>>>>>>> https://lkml.kernel.org/r/bcf8925d-0949-3fe1-baa8-cc536c529860@o= racle.com >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> Reversing the order in which memory gets allocated + used during >>>>>>>> boot >>>>>>>> (in a patch by me) might have revealed an invalid memory access >>>>>>>> during >>>>>>>> boot. >>>>>>>> >>>>>>>> I suspect that that issue would no longer get detected with your >>>>>>>> patch, as the invalid memory access would simply not get detecte= d. >>>>>>>> Now, I cannot prove that :) >>>>>>> Since David's patch we're having trouble with the iBFT ACPI table= , >>>>>>> which >>>>>>> is mapped in via kmap() - see acpi_map() in "drivers/acpi/osl.c". >>>>>>> KASAN >>>>>>> detects that it is being used after free when ibft_init() accesse= s >>>>>>> the >>>>>>> iBFT table, but as of yet we can't find where it get's freed (we'= ve >>>>>>> instrumented calls to kunmap()). >>>>>> Maybe it doesn't get freed, but what you see is a wild or a large >>>>>> out-of-bounds access. Since KASAN marks all memory as freed during= the >>>>>> memblock->page_alloc transition, such bugs can manifest as >>>>>> use-after-frees. >>>>> >>>>> It gets freed and re-used. By the time the iBFT table is accessed b= y >>>>> ibft_init() the page has been over-written. >>>>> >>>>> Setting page flags like the following before the call to kmap() >>>>> prevents the iBFT table page from being freed: >>>> >>>> Cleaned up version: >>>> >>>> diff --git a/drivers/acpi/osl.c b/drivers/acpi/osl.c >>>> index 0418feb..8f0a8e7 100644 >>>> --- a/drivers/acpi/osl.c >>>> +++ b/drivers/acpi/osl.c >>>> @@ -287,9 +287,12 @@ static void __iomem *acpi_map(acpi_physical_add= ress >>>> pg_off, unsigned long pg_sz) >>>> >>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0 pfn =3D pg_off >> PAGE_SHIFT; >>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0 if (should_use_kmap(pfn)) { >>>> +=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 struct page *page =3D pfn_to_= page(pfn); >>>> + >>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 if (pg_sz > PAG= E_SIZE) >>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2= =A0 return NULL; >>>> -=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 return (void __iomem __force = *)kmap(pfn_to_page(pfn)); >>>> +=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 SetPageReserved(page); >>>> +=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 return (void __iomem __force = *)kmap(page); >>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0 } else >>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 return acpi_os_= ioremap(pg_off, pg_sz); >>>> =C2=A0 =C2=A0} >>>> @@ -299,9 +302,12 @@ static void acpi_unmap(acpi_physical_address >>>> pg_off, void __iomem *vaddr) >>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0 unsigned long pfn; >>>> >>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0 pfn =3D pg_off >> PAGE_SHIFT; >>>> -=C2=A0=C2=A0=C2=A0 if (should_use_kmap(pfn)) >>>> -=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 kunmap(pfn_to_page(pfn)); >>>> -=C2=A0=C2=A0=C2=A0 else >>>> +=C2=A0=C2=A0=C2=A0 if (should_use_kmap(pfn)) { >>>> +=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 struct page *page =3D pfn_to_= page(pfn); >>>> + >>>> +=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 ClearPageReserved(page); >>>> +=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 kunmap(page); >>>> +=C2=A0=C2=A0=C2=A0 } else >>>> =C2=A0 =C2=A0=C2=A0=C2=A0=C2=A0 =C2=A0=C2=A0=C2=A0 iounmap(vaddr); >>>> =C2=A0 =C2=A0} >>>> >>>> David, the above works, but wondering why it is now necessary. kunma= p() >>>> is not hit. What other ways could a page mapped via kmap() be unmapp= ed? >>>> >>> >>> Let me look into the code ... I have little experience with ACPI >>> details, so bear with me. >>> >>> I assume that acpi_map()/acpi_unmap() map some firmware blob that is >>> provided via firmware/bios/... to us. >>> >>> should_use_kmap() tells us whether >>> a) we have a "struct page" and should kmap() that one >>> b) we don't have a "struct page" and should ioremap. >>> >>> As it is a blob, the firmware should always reserve that memory regio= n >>> via memblock (e.g., memblock_reserve()), such that we either >>> 1) don't create a memmap ("struct page") at all (-> case b) ) >>> 2) if we have to create e memmap, we mark the page PG_reserved and >>> =C2=A0=C2=A0 *never* expose it to the buddy (-> case a) ) >>> >>> >>> Are you telling me that in this case we might have a memmap for the H= W >>> blob that is *not* PG_reserved? In that case it most probably got >>> exposed to the buddy where it can happily get allocated/freed. >>> >>> The latent BUG would be that that blob gets exposed to the system lik= e >>> ordinary RAM, and not reserved via memblock early during boot. >>> Assuming that blob has a low physical address, with my patch it will >>> get allocated/used a lot earlier - which would mean we trigger this >>> latent BUG now more easily. >>> >>> There have been similar latent BUGs on ARM boards that my patch >>> discovered where special RAM regions did not get marked as reserved >>> via the device tree properly. >>> >>> Now, this is just a wild guess :) Can you dump the page when mapping >>> (before PageReserved()) and when unmapping, to see what the state of >>> that memmap is? >> >> Thank you David for the explanation and your help on this, >> >> dump_page() before PageReserved and before kmap() in the above patch: >> >> [=C2=A0=C2=A0=C2=A0 1.116480] ACPI: Core revision 20201113 >> [=C2=A0=C2=A0=C2=A0 1.117628] XXX acpi_map: about to call kmap()... >> [=C2=A0=C2=A0=C2=A0 1.118561] page:ffffea0002f914c0 refcount:0 mapcoun= t:0 >> mapping:0000000000000000 index:0x0 pfn:0xbe453 >> [=C2=A0=C2=A0=C2=A0 1.120381] flags: 0xfffffc0000000() >> [=C2=A0=C2=A0=C2=A0 1.121116] raw: 000fffffc0000000 ffffea0002f914c8 f= fffea0002f914c8 >> 0000000000000000 >> [=C2=A0=C2=A0=C2=A0 1.122638] raw: 0000000000000000 0000000000000000 0= 0000000ffffffff >> 0000000000000000 >> [=C2=A0=C2=A0=C2=A0 1.124146] page dumped because: acpi_map pre SetPag= eReserved >> >> I also added dump_page() before unmapping, but it is not hit. The >> following for the same pfn now shows up I believe as a result of setti= ng >> PageReserved: >> >> [=C2=A0=C2=A0 28.098208] BUG:Bad page state in process mo dprobe=C2=A0= pfn:be453 >> [=C2=A0=C2=A0 28.098394] page:ffffea0002f914c0 refcount:0 mapcount:0 >> mapping:0000000000000000 index:0x1 pfn:0xbe453 >> [=C2=A0=C2=A0 28.098394] flags: 0xfffffc0001000(reserved) >> [=C2=A0=C2=A0 28.098394] raw: 000fffffc0001000 dead000000000100 dead00= 0000000122 >> 0000000000000000 >> [=C2=A0=C2=A0 28.098394] raw: 0000000000000001 0000000000000000 000000= 00ffffffff >> 0000000000000000 >> [=C2=A0=C2=A0 28.098394] page dumped because: PAGE_FLAGS_CHECK_AT_PREP= flag(s) set >> [=C2=A0=C2=A0 28.098394] page_owner info is not present (never set?) >> [=C2=A0=C2=A0 28.098394] Modules linked in: >> [=C2=A0=C2=A0 28.098394] CPU: 2 PID: 204 Comm: modprobe Not tainted 5.= 11.0-3dbd5e3 #66 >> [=C2=A0=C2=A0 28.098394] Hardware name: QEMU Standard PC (i440FX + PII= X, 1996), >> BIOS 0.0.0 02/06/2015 >> [=C2=A0=C2=A0 28.098394] Call Trace: >> [=C2=A0=C2=A0 28.098394]=C2=A0 dump_stack+0xdb/0x120 >> [=C2=A0=C2=A0 28.098394]=C2=A0 bad_page.cold.108+0xc6/0xcb >> [=C2=A0=C2=A0 28.098394]=C2=A0 check_new_page_bad+0x47/0xa0 >> [=C2=A0=C2=A0 28.098394]=C2=A0 get_page_from_freelist+0x30cd/0x5730 >> [=C2=A0=C2=A0 28.098394]=C2=A0 ? __isolate_free_page+0x4f0/0x4f0 >> [=C2=A0=C2=A0 28.098394]=C2=A0 ? init_object+0x7e/0x90 >> [=C2=A0=C2=A0 28.098394]=C2=A0 __alloc_pages_nodemask+0x2d8/0x650 >> [=C2=A0=C2=A0 28.098394]=C2=A0 ? write_comp_data+0x2f/0x90 >> [=C2=A0=C2=A0 28.098394]=C2=A0 ? __alloc_pages_slowpath.constprop.103+= 0x2110/0x2110 >> [=C2=A0=C2=A0 28.098394]=C2=A0 ? __sanitizer_cov_trace_pc+0x21/0x50 >> [=C2=A0=C2=A0 28.098394]=C2=A0 alloc_pages_vma+0xe2/0x560 >> [=C2=A0=C2=A0 28.098394]=C2=A0 do_fault+0x194/0x12c0 >> [=C2=A0=C2=A0 28.098394]=C2=A0 ? write_comp_data+0x2f/0x90 >> [=C2=A0=C2=A0 28.098394]=C2=A0 __handle_mm_fault+0x1650/0x26c0 >> [=C2=A0=C2=A0 28.098394]=C2=A0 ? copy_page_range+0x1350/0x1350 >> [=C2=A0=C2=A0 28.098394]=C2=A0 ? write_comp_data+0x2f/0x90 >> [=C2=A0=C2=A0 28.098394]=C2=A0 ? write_comp_data+0x2f/0x90 >> [=C2=A0=C2=A0 28.098394]=C2=A0 handle_mm_fault+0x1f9/0x810 >> [=C2=A0=C2=A0 28.098394]=C2=A0 ? write_comp_data+0x2f/0x90 >> [=C2=A0=C2=A0 28.098394]=C2=A0 do_user_addr_fault+0x6f7/0xca0 >> [=C2=A0=C2=A0 28.098394]=C2=A0 exc_page_fault+0xaf/0x1a0 >> [=C2=A0=C2=A0 28.098394]=C2=A0 asm_exc_page_fault+0x1e/0x30 >> [=C2=A0=C2=A0 28.098394] RIP: 0010:__clear_user+0x30/0x60 >=20 > I think the PAGE_FLAGS_CHECK_AT_PREP check in this instance means that > someone is trying to allocate that page with the PG_reserved bit set. > This means that the page actually was exposed to the buddy. >=20 > However, when you SetPageReserved(), I don't think that PG_buddy is set > and the refcount is 0. That could indicate that the page is on the budd= y > PCP list. Could be that it is getting reused a couple of times. >=20 > The PFN 0xbe453 looks a little strange, though. Do we expect ACPI table= s > close to 3 GiB ? No idea. Could it be that you are trying to map a wron= g > table? Just a guess. ... but I assume ibft_check_device() would bail out on an invalid=20 checksum. So the question is, why is this page not properly marked as=20 reserved already. --=20 Thanks, David / dhildenb