From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-1.0 required=3.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8FDEFC282CE for ; Wed, 24 Apr 2019 05:59:43 +0000 (UTC) Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 5ABD42148D for ; Wed, 24 Apr 2019 05:59:43 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (2048-bit key) header.d=lists.infradead.org header.i=@lists.infradead.org header.b="idYObjpY" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 5ABD42148D Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=arm.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20170209; h=Sender: Content-Transfer-Encoding:Content-Type:Cc:List-Subscribe:List-Help:List-Post: List-Archive:List-Unsubscribe:List-Id:In-Reply-To:MIME-Version:Date: Message-ID:References:To:Subject:From:Reply-To:Content-ID:Content-Description :Resent-Date:Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID: List-Owner; bh=hXU5ph7SaULQi5G02tBM01mJOAmP3eOskR5WB080Vb0=; b=idYObjpYts+lPm 2jHCk9WN20QpKa9WmFhUZ446BuaY7hQiUkEIq5SujTFlXVWA2f676VPpqvLDpvuVfFW9PKSN5OnVD 6IuzJHoYGaKveKAtKVsreAinskeYR8QOMEc72jHDXWp8uMhmuN8SmBx4fOIXijx22Qx6fuWexiNln HLeC41y0R2rmlNixwJCj+d3qWuEUGBlDxZjthwrja9Ht3UgSVesK2WHWLH+8W94tkH1eKwjbAnMqA 28kdy/KiaE0XMRFyvBR+ZTMFRS3+Qb4BXiA58B4zBmemQBVewOU/n3zvyhOnZUV88Fv1YPknJgwyw 6FNI0y13SahlidsAO+Hg==; Received: from localhost ([127.0.0.1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.90_1 #2 (Red Hat Linux)) id 1hJAwU-0003Oc-VK; Wed, 24 Apr 2019 05:59:38 +0000 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70] helo=foss.arm.com) by bombadil.infradead.org with esmtp (Exim 4.90_1 #2 (Red Hat Linux)) id 1hJAwR-0003OG-Kv for linux-arm-kernel@lists.infradead.org; Wed, 24 Apr 2019 05:59:37 +0000 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 3AFDCA78; Tue, 23 Apr 2019 22:59:35 -0700 (PDT) Received: from [10.163.1.68] (unknown [10.163.1.68]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 7260F3F5AF; Tue, 23 Apr 2019 22:59:27 -0700 (PDT) From: Anshuman Khandual Subject: Re: [PATCH V2 2/2] arm64/mm: Enable memory hot remove To: Mark Rutland References: <1555221553-18845-1-git-send-email-anshuman.khandual@arm.com> <1555221553-18845-3-git-send-email-anshuman.khandual@arm.com> <20190415134841.GC13990@lakrids.cambridge.arm.com> <2faba38b-ab79-2dda-1b3c-ada5054d91fa@arm.com> <20190417142154.GA393@lakrids.cambridge.arm.com> <20190417173948.GB15589@lakrids.cambridge.arm.com> <1bdae67b-fcd6-7868-8a92-c8a306c04ec6@arm.com> <97413c39-a4a9-ea1b-7093-eb18f950aad7@arm.com> <20190423160525.GD56999@lakrids.cambridge.arm.com> Message-ID: Date: Wed, 24 Apr 2019 11:29:28 +0530 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <20190423160525.GD56999@lakrids.cambridge.arm.com> Content-Language: en-US X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20190423_225935_699903_DF548B75 X-CRM114-Status: GOOD ( 30.67 ) X-BeenThere: linux-arm-kernel@lists.infradead.org X-Mailman-Version: 2.1.21 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: cai@lca.pw, mhocko@suse.com, ira.weiny@intel.com, david@redhat.com, catalin.marinas@arm.com, will.deacon@arm.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org, logang@deltatee.com, james.morse@arm.com, cpandya@codeaurora.org, arunks@codeaurora.org, akpm@linux-foundation.org, osalvador@suse.de, mgorman@techsingularity.net, dan.j.williams@intel.com, linux-arm-kernel@lists.infradead.org, robin.murphy@arm.com Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Sender: "linux-arm-kernel" Errors-To: linux-arm-kernel-bounces+infradead-linux-arm-kernel=archiver.kernel.org@lists.infradead.org On 04/23/2019 09:35 PM, Mark Rutland wrote: > On Tue, Apr 23, 2019 at 01:01:58PM +0530, Anshuman Khandual wrote: >> Generic usage for init_mm.pagetable_lock >> >> Unless I have missed something else these are the generic init_mm kernel page table >> modifiers at runtime (at least which uses init_mm.page_table_lock) >> >> 1. ioremap_page_range() /* Mapped I/O memory area */ >> 2. apply_to_page_range() /* Change existing kernel linear map */ >> 3. vmap_page_range() /* Vmalloc area */ > > Internally, those all use the __p??_alloc() functions to handle racy > additions by transiently taking the PTL when installing a new table, but > otherwise walk kernel tables _without_ the PTL held. Note that none of > these ever free an intermediate level of table. Right they dont free intermediate level page table but I was curious about the only the leaf level modifications. > > I believe that the idea is that operations on separate VMAs should never I guess you meant kernel virtual range with 'VMA' but not the actual VMA which is vm_area_struct applicable only for the user space not the kernel. > conflict at the leaf level, and operations on the same VMA should be > serialised somehow w.r.t. that VMA. AFAICT see there is nothing other than hotplug lock i.e mem_hotplug_lock which prevents concurrent init_mm modifications and the current situation is only safe because some how these VA areas dont overlap with respect to intermediate page table level spans. > > AFAICT, these functions are _never_ called on the linear/direct map or > vmemmap VA ranges, and whether or not these can conflict with hot-remove > is entirely dependent on whether those ranges can share a level of table > with the vmalloc region. Right but all these VA ranges (linear, vmemmap, vmalloc) are wired in on init_mm hence wondering if it is prudent to assume layout scheme which varies a lot based on different architectures while deciding possible race protections. Wondering why these user should not call [get|put]_online_mems() to prevent race with hotplug. Will try this out. Unless generic MM expects these VA ranges (linear, vmemmap, vmalloc) layout to be in certain manner from the platform guaranteeing non-overlap at intermediate level page table spans. Only then we would not a lock. > > Do you know how likely that is to occur? e.g. what proportion of the TBH I dont know. > vmalloc region may share a level of table with the linear or vmemmap > regions in a typical arm64 or x86 configuration? Can we deliberately > provoke this failure case? I have not enumerated those yet but there are multiple configs on arm64 and probably on x86 which decides kernel VA space layout causing these potential races. But regardless its not right to assume on vmalloc range span and not take a lock. Not sure how to provoke this failure case from user space with simple hotplug because vmalloc physical allocation normally cannot be controlled without a hacked kernel change. > > [...] > >> In all of the above. >> >> - Page table pages [p4d|pud|pmd|pte]_alloc_[kernel] settings are >> protected with init_mm.page_table_lock > > Racy addition is protect in this manner. Right. > >> - Should not it require init_mm.page_table_lock for all leaf level >> (PUD|PMD|PTE) modification as well ? > > As above, I believe that the PTL is assumed to not be necessary there > since other mutual exclusion should be in effect to prevent racy > modification of leaf entries. Wondering what are those mutual exclusions other than the memory hotplug lock. Again if its on kernel VA space layout assumptions its not a good idea. > >> - Should not this require init_mm.page_table_lock for page table walk >> itself ? >> >> Not taking an overall lock for all these three operations will >> potentially race with an ongoing memory hot remove operation which >> takes an overall lock as proposed. Wondering if this has this been >> safe till now ? > > I suspect that the answer is that hot-remove is not thoroughly > stress-tested today, and conflicts are possible but rare. Will make these generic modifiers call [get|put]_online_mems() in a separate patch at least to protect themselves from memory hot remove operation. > > As above, can we figure out how likely conflicts are, and try to come up > with a stress test? Will try something out by hot plugging a memory range without actually onlining it while there is another vmalloc stress running on the system. > > Is it possible to avoid these specific conflicts (ignoring ptdump) by > aligning VA regions such that they cannot share intermediate levels of > table? Kernel VA space layout is platform specific where core MM does not mandate much. Hence generic modifiers should not make any assumptions regarding it but protect themselves with locks. Doing any thing other than that is just pushing the problem to future. _______________________________________________ linux-arm-kernel mailing list linux-arm-kernel@lists.infradead.org http://lists.infradead.org/mailman/listinfo/linux-arm-kernel