Re: [PATCH V2 2/2] arm64/mm: Enable memory hot remove

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: David Hildenbrand <david@redhat.com>
To: Anshuman Khandual <anshuman.khandual@arm.com>,
	Mark Rutland <mark.rutland@arm.com>
Cc: linux-kernel@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org,
	akpm@linux-foundation.org, will.deacon@arm.com,
	catalin.marinas@arm.com, mhocko@suse.com,
	mgorman@techsingularity.net, james.morse@arm.com,
	robin.murphy@arm.com, cpandya@codeaurora.org,
	arunks@codeaurora.org, dan.j.williams@intel.com,
	osalvador@suse.de, cai@lca.pw, logang@deltatee.com,
	ira.weiny@intel.com
Subject: Re: [PATCH V2 2/2] arm64/mm: Enable memory hot remove
Date: Tue, 23 Apr 2019 09:51:45 +0200	[thread overview]
Message-ID: <14364fe5-0999-0c60-d109-af22c13deafa@redhat.com> (raw)
In-Reply-To: <5c8e4a69-8c71-85e1-3275-c04f84bde639@arm.com>

On 23.04.19 09:45, Anshuman Khandual wrote:
> 
> 
> On 04/23/2019 01:07 PM, David Hildenbrand wrote:
>> On 23.04.19 09:31, Anshuman Khandual wrote:
>>>
>>>
>>> On 04/18/2019 10:58 AM, Anshuman Khandual wrote:
>>>> On 04/17/2019 11:09 PM, Mark Rutland wrote:
>>>>> On Wed, Apr 17, 2019 at 10:15:35PM +0530, Anshuman Khandual wrote:
>>>>>> On 04/17/2019 07:51 PM, Mark Rutland wrote:
>>>>>>> On Wed, Apr 17, 2019 at 03:28:18PM +0530, Anshuman Khandual wrote:
>>>>>>>> On 04/15/2019 07:18 PM, Mark Rutland wrote:
>>>>>>>>> On Sun, Apr 14, 2019 at 11:29:13AM +0530, Anshuman Khandual wrote:
>>>>>
>>>>>>>>>> +	spin_unlock(&init_mm.page_table_lock);
>>>>>>>>>
>>>>>>>>> What precisely is the page_table_lock intended to protect?
>>>>>>>>
>>>>>>>> Concurrent modification to kernel page table (init_mm) while clearing entries.
>>>>>>>
>>>>>>> Concurrent modification by what code?
>>>>>>>
>>>>>>> If something else can *modify* the portion of the table that we're
>>>>>>> manipulating, then I don't see how we can safely walk the table up to
>>>>>>> this point without holding the lock, nor how we can safely add memory.
>>>>>>>
>>>>>>> Even if this is to protect something else which *reads* the tables,
>>>>>>> other code in arm64 which modifies the kernel page tables doesn't take
>>>>>>> the lock.
>>>>>>>
>>>>>>> Usually, if you can do a lockless walk you have to verify that things
>>>>>>> didn't change once you've taken the lock, but we don't follow that
>>>>>>> pattern here.
>>>>>>>
>>>>>>> As things stand it's not clear to me whether this is necessary or
>>>>>>> sufficient.
>>>>>>
>>>>>> Hence lets take more conservative approach and wrap the entire process of
>>>>>> remove_pagetable() under init_mm.page_table_lock which looks safe unless
>>>>>> in the worst case when free_pages() gets stuck for some reason in which
>>>>>> case we have bigger memory problem to deal with than a soft lock up.
>>>>>
>>>>> Sorry, but I'm not happy with _any_ solution until we understand where
>>>>> and why we need to take the init_mm ptl, and have made some effort to
>>>>> ensure that the kernel correctly does so elsewhere. It is not sufficient
>>>>> to consider this code in isolation.
>>>>
>>>> We will have to take the kernel page table lock to prevent assumption regarding
>>>> present or future possible kernel VA space layout. Wrapping around the entire
>>>> remove_pagetable() will be at coarse granularity but I dont see why it should
>>>> not sufficient atleast from this particular tear down operation regardless of
>>>> how this might affect other kernel pgtable walkers.
>>>>
>>>> IIUC your concern is regarding other parts of kernel code (arm64/generic) which
>>>> assume that kernel page table wont be changing and hence they normally walk the
>>>> table without holding pgtable lock. Hence those current pgtabe walker will be
>>>> affected after this change.
>>>>
>>>>>
>>>>> IIUC, before this patch we never clear non-leaf entries in the kernel
>>>>> page tables, so readers don't presently need to take the ptl in order to
>>>>> safely walk down to a leaf entry.
>>>>
>>>> Got it. Will look into this.
>>>>
>>>>>
>>>>> For example, the arm64 ptdump code never takes the ptl, and as of this
>>>>> patch it will blow up if it races with a hot-remove, regardless of
>>>>> whether the hot-remove code itself holds the ptl.
>>>>
>>>> Got it. Are there there more such examples where this can be problematic. I
>>>> will be happy to investigate all such places and change/add locking scheme
>>>> in there to make them work with memory hot remove.
>>>>
>>>>>
>>>>> Note that the same applies to the x86 ptdump code; we cannot assume that
>>>>> just because x86 does something that it happens to be correct.
>>>>
>>>> I understand. Will look into other non-x86 platforms as well on how they are
>>>> dealing with this.
>>>>
>>>>>
>>>>> I strongly suspect there are other cases that would fall afoul of this,
>>>>> in both arm64 and generic code.
>>>
>>> On X86
>>>
>>> - kernel_physical_mapping_init() takes the lock for pgtable page allocations as well
>>>   as all leaf level entries including large mappings.
>>>
>>> On Power
>>>
>>> - remove_pagetable() take an overall high level init_mm.page_table_lock as I had
>>>   suggested before. __map_kernel_page() calls [pud|pmd|pte]_alloc_[kernel] which
>>>   allocates page table pages are protected with init_mm.page_table_lock but then
>>>   the actual setting of the leaf entries are not (unlike x86)
>>>
>>> 	arch_add_memory()
>>> 		create_section_mapping()
>>> 			radix__create_section_mapping()
>>> 				create_physical_mapping()
>>> 					__map_kernel_page()
>>> On arm64.
>>>
>>> Both kernel page table dump and linear mapping (__create_pgd_mapping on init_mm)
>>> will require init_mm.page_table_lock to be really safe against this new memory
>>> hot remove code. I will do the necessary changes as part of this series next time
>>> around. IIUC there is no equivalent generic feature for ARM64_PTDUMP_CORE/DEBUGFS.
>>> 	 > 
>>>> Will start looking into all such possible cases both on arm64 and generic.
>>>> Mean while more such pointers would be really helpful.
>>>
>>> Generic usage for init_mm.pagetable_lock
>>>
>>> Unless I have missed something else these are the generic init_mm kernel page table
>>> modifiers at runtime (at least which uses init_mm.page_table_lock)
>>>
>>> 	1. ioremap_page_range()		/* Mapped I/O memory area */
>>> 	2. apply_to_page_range()	/* Change existing kernel linear map */
>>> 	3. vmap_page_range()		/* Vmalloc area */
>>>
>>> A. IOREMAP
>>>
>>> ioremap_page_range()
>>> 	ioremap_p4d_range()
>>> 		p4d_alloc()
>>> 		ioremap_try_huge_p4d() -> p4d_set_huge() -> set_p4d()
>>> 		ioremap_pud_range()
>>> 			pud_alloc()
>>> 			ioremap_try_huge_pud() -> pud_set_huge() -> set_pud()
>>> 			ioremap_pmd_range()
>>> 				pmd_alloc()
>>> 				ioremap_try_huge_pmd() -> pmd_set_huge() -> set_pmd()
>>> 				ioremap_pte_range()
>>> 					pte_alloc_kernel()
>>> 						set_pte_at() -> set_pte()
>>> B. APPLY_TO_PAGE_RANGE
>>>
>>> apply_to_page_range()
>>> 	apply_to_p4d_range()
>>> 		p4d_alloc()
>>> 		apply_to_pud_range()
>>> 			pud_alloc()
>>> 			apply_to_pmd_range()
>>> 				pmd_alloc()
>>> 				apply_to_pte_range()
>>> 					pte_alloc_kernel()
>>>
>>> C. VMAP_PAGE_RANGE
>>>
>>> vmap_page_range()
>>> vmap_page_range_noflush()
>>> 	vmap_p4d_range()
>>> 		p4d_alloc()
>>> 		vmap_pud_range()
>>> 			pud_alloc()
>>> 			vmap_pmd_range()
>>> 				pmd_alloc()
>>> 				vmap_pte_range()
>>> 					pte_alloc_kernel()
>>> 					set_pte_at()
>>>
>>> In all of the above.
>>>
>>> - Page table pages [p4d|pud|pmd|pte]_alloc_[kernel] settings are protected with init_mm.page_table_lock
>>> - Should not it require init_mm.page_table_lock for all leaf level (PUD|PMD|PTE) modification as well ?
>>> - Should not this require init_mm.page_table_lock for page table walk itself ?
>>>
>>> Not taking an overall lock for all these three operations will potentially race with an ongoing memory
>>> hot remove operation which takes an overall lock as proposed. Wondering if this has this been safe till
>>> now ?
>>>
>>
>> All memory add/remove operations are currently guarded by
>> mem_hotplug_lock as far as I know.
> 
> Right but it seems like it guards against concurrent memory hot add or remove operations with
> respect to memory block, sections, sysfs etc. But does it cover with respect to other init_mm
> modifiers or accessors ?
> 

Don't think so, it's purely for memory add/remove via
add_memory/remove_memory/devm_memremap_pages, not anything beyond that.
Whoever uses get_online_mems/put_online_mems is save in respect to that
- mostly slab/slub.

-- 

Thanks,

David / dhildenb

next prev parent reply	other threads:[~2019-04-23  7:51 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-04-14  5:59 [PATCH V2 0/2] arm64/mm: Enable memory hot remove Anshuman Khandual
2019-04-14  5:59 ` [PATCH V2 1/2] mm/hotplug: Reorder arch_remove_memory() call in __remove_memory() Anshuman Khandual
2019-04-15 13:58   ` David Hildenbrand
2019-04-16 10:12     ` Anshuman Khandual
2019-04-14  5:59 ` [PATCH V2 2/2] arm64/mm: Enable memory hot remove Anshuman Khandual
2019-04-15 13:48   ` Mark Rutland
2019-04-17  9:58     ` Anshuman Khandual
2019-04-17 14:21       ` Mark Rutland
2019-04-17 16:45         ` Anshuman Khandual
2019-04-17 17:39           ` Mark Rutland
2019-04-18  5:28             ` Anshuman Khandual
2019-04-23  7:31               ` Anshuman Khandual
2019-04-23  7:37                 ` David Hildenbrand
2019-04-23  7:45                   ` Anshuman Khandual
2019-04-23  7:51                     ` David Hildenbrand [this message]
2019-04-23  8:37                       ` Anshuman Khandual
2019-04-23 16:05                 ` Mark Rutland
2019-04-24  5:59                   ` Anshuman Khandual
2019-04-24  8:19                     ` Mark Rutland
2019-04-15 13:55   ` David Hildenbrand
2019-04-16  9:52     ` Anshuman Khandual
2019-05-13  8:22 ` [PATCH V2 0/2] " David Hildenbrand
2019-05-13  8:37   ` Anshuman Khandual
2019-05-13 10:01     ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=14364fe5-0999-0c60-d109-af22c13deafa@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=anshuman.khandual@arm.com \
    --cc=arunks@codeaurora.org \
    --cc=cai@lca.pw \
    --cc=catalin.marinas@arm.com \
    --cc=cpandya@codeaurora.org \
    --cc=dan.j.williams@intel.com \
    --cc=ira.weiny@intel.com \
    --cc=james.morse@arm.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=logang@deltatee.com \
    --cc=mark.rutland@arm.com \
    --cc=mgorman@techsingularity.net \
    --cc=mhocko@suse.com \
    --cc=osalvador@suse.de \
    --cc=robin.murphy@arm.com \
    --cc=will.deacon@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).