LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 1/3] powerpc: Don't hard code the size of pte page
From: Aneesh Kumar K.V @ 2013-01-30 13:18 UTC (permalink / raw)
  To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1359551916-10321-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

USE PTRS_PER_PTE to indicate the size of pte page.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgtable.h |    6 ++++++
 arch/powerpc/mm/hash_low_64.S      |    4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index a9cbd3b..fc57855 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -17,6 +17,12 @@ struct mm_struct;
 #  include <asm/pgtable-ppc32.h>
 #endif
 
+/*
+ * hidx is in the second half of the page table. We use the
+ * 8 bytes per each pte entry.
+ */
+#define PTE_PAGE_HIDX_OFFSET (PTRS_PER_PTE * 8)
+
 #ifndef __ASSEMBLY__
 
 #include <asm/tlbflush.h>
diff --git a/arch/powerpc/mm/hash_low_64.S b/arch/powerpc/mm/hash_low_64.S
index 5658508..94fd37b 100644
--- a/arch/powerpc/mm/hash_low_64.S
+++ b/arch/powerpc/mm/hash_low_64.S
@@ -484,7 +484,7 @@ END_FTR_SECTION(CPU_FTR_NOEXECUTE|CPU_FTR_COHERENT_ICACHE, CPU_FTR_NOEXECUTE)
 	beq	htab_inval_old_hpte
 
 	ld	r6,STK_PARAM(R6)(r1)
-	ori	r26,r6,0x8000		/* Load the hidx mask */
+	ori	r26,r6,PTE_PAGE_HIDX_OFFSET /* Load the hidx mask. */
 	ld	r26,0(r26)
 	addi	r5,r25,36		/* Check actual HPTE_SUB bit, this */
 	rldcr.	r0,r31,r5,0		/* must match pgtable.h definition */
@@ -601,7 +601,7 @@ htab_pte_insert_ok:
 	sld	r4,r4,r5
 	andc	r26,r26,r4
 	or	r26,r26,r3
-	ori	r5,r6,0x8000
+	ori	r5,r6,PTE_PAGE_HIDX_OFFSET
 	std	r26,0(r5)
 	lwsync
 	std	r30,0(r6)
-- 
1.7.10

^ permalink raw reply related

* Re: [PATCH v2] ppc/iommu: use find_first_bit to look up entries in the iommu table
From: Thadeu Lima de Souza Cascardo @ 2013-01-30 12:55 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: paulus, linuxppc-dev, shangw, anton
In-Reply-To: <1359419756.18955.16.camel@pasglop>

On Tue, Jan 29, 2013 at 11:35:56AM +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2013-01-10 at 17:33 -0200, Thadeu Lima de Souza Cascardo wrote:
> > Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@linux.vnet.ibm.com>
> > ---
> > v2:
> > Remove the unneeded extra variable i, which caused build failure.
> 
> I believe something equivalent is already in -next, can you dbl check ?
> 
> Cheers,
> Ben.
> 

There is, and it's using bitmap_empty, which is even more clear.

Thanks.
Cascardo.

> > ---
> >  arch/powerpc/kernel/iommu.c |    9 ++-------
> >  1 files changed, 2 insertions(+), 7 deletions(-)
> > 
> > diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
> > index 6d48ff8..0fc44d2 100644
> > --- a/arch/powerpc/kernel/iommu.c
> > +++ b/arch/powerpc/kernel/iommu.c
> > @@ -708,7 +708,7 @@ struct iommu_table *iommu_init_table(struct iommu_table *tbl, int nid)
> >  
> >  void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> >  {
> > -	unsigned long bitmap_sz, i;
> > +	unsigned long bitmap_sz;
> >  	unsigned int order;
> >  
> >  	if (!tbl || !tbl->it_map) {
> > @@ -725,14 +725,9 @@ void iommu_free_table(struct iommu_table *tbl, const char *node_name)
> >  		clear_bit(0, tbl->it_map);
> >  
> >  	/* verify that table contains no entries */
> > -	/* it_size is in entries, and we're examining 64 at a time */
> > -	for (i = 0; i < (tbl->it_size/64); i++) {
> > -		if (tbl->it_map[i] != 0) {
> > +	if (find_first_bit(tbl->it_map, tbl->it_size) < tbl->it_size)
> >  			printk(KERN_WARNING "%s: Unexpected TCEs for %s\n",
> >  				__func__, node_name);
> > -			break;
> > -		}
> > -	}
> >  
> >  	/* calculate bitmap size in bytes */
> >  	bitmap_sz = (tbl->it_size + 7) / 8;
> 
> 

^ permalink raw reply

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
From: Tang Chen @ 2013-01-30 10:18 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <5108F2B3.3090506@cn.fujitsu.com>

On 01/30/2013 06:15 PM, Tang Chen wrote:
> Hi Simon,
>
> Please see below. :)
>
> On 01/29/2013 08:52 PM, Simon Jeons wrote:
>> Hi Tang,
>>
>> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
>>> Here is the physical memory hot-remove patch-set based on 3.8rc-2.
>>
>> Some questions ask you, not has relationship with this patchset, but is
>> memory hotplug stuff.
>>
>> 1. In function node_states_check_changes_online:
>>
>> comments:
>> * If we don't have HIGHMEM nor movable node,
>> * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
>> * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
>>
>> How to understand it? Why we don't have HIGHMEM nor movable node and
>> node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
>> N_NORMAL_MEMORY only means the node has regular memory.
>>
>
> First of all, I think we need to understand why we need N_MEMORY.
>
> In order to support movable node, which has only ZONE_MOVABLE (the last
> zone),
> we introduce N_MEMORY to represent the node has normal, highmem and
> movable memory.
>
> Here, "we have movable node" means you configured CONFIG_MOVABLE_NODE.

Sorry, should be "we don't have movable node" means you didn't 
configured CONFIG_MOVABLE_NODE.

> This config option doesn't mean we don't have movable pages, (NO)
> it means we don't have a node which has only movable pages (only have
> ZONE_MOVABLE). (YES)
>
> Here, if we don't have CONFIG_MOVABLE_NODE (we don't have movable node),
> we don't need a
> separate node_states[] element to represent a particular node because we
> won't have a node
> which has only ZONE_MOVABLE.
>
> So,
> 1) if we don't have highmem nor movable node, N_MEMORY == N_HIGH_MEMORY
> == N_NORMAL_MEMORY,
> which means N_NORMAL_MEMORY effects as N_MEMORY. If we online pages as
> movable, we need
> to update node_states[N_NORMAL_MEMORY].
>
> Please refer to the definition of enum zone_type, if we don't have
> CONFIG_HIGHMEM, we won't
> have ZONE_HIGHMEM, but ZONE_NORMAL and ZONE_MOVABLE will always there.
> So we can have movable
> pages, and the zone_last should be ZONE_MOVABLE.
>
> Again, because we won't have a node only having ZONE_MOVABLE, so we just
> need to update
> node_states[N_NORMAL_MEMORY].
>
>> * If we don't have movable node, node_states[N_NORMAL_MEMORY]
>> * contains nodes which have zones of 0...ZONE_MOVABLE,
>> * set zone_last to ZONE_MOVABLE.
>>
>> How to understand?
>
> 2) this code is in #ifdef CONFIG_HIGHMEM, which means we have highmem,
> so if we don't have
> movable node, N_MEMORY == N_HIGH_MEMORY, and N_HIGH_MEMORY effects as
> N_MEMORY. If we
> online pages as movable, we need to update node_states[N_NORMAL_MEMORY].
>
>>
>> 2. In function move_pfn_range_left, why end<= z2->zone_start_pfn is not
>> correct? The comments said that must include/overlap, why?
>>
>
> This one is easy, if I understand you correctly.
> move_pfn_range_left() is used to move the left most part [start_pfn,
> end_pfn) of z2 to z1.
> So if end_pfn<= z2->zone_start_pfn, it means [start_pfn, end_pfn) is not
> part of z2.
> Then it fails.
>
>> 3. In function online_pages, the normal case(w/o online_kenrel,
>> online_movable), why not check if the new zone is overlap with adjacent
>> zones?
>>
>
> Can a zone overlap with the others ? I don't think so.
>
> One pfn could only be in one zone,
> zone = page_zone(pfn_to_page(pfn));
>
> it could overlap with others, I think. :)
>
> But maybe I misunderstand you. :)
>
>> 4. Could you summarize the difference implementation between hot-add and
>> logic-add, hot-remove and logic-remove?
>
> Sorry, I don't quite understand what do you mean by logic-add/remove.
> Would you please explain more ?
>
> If you meant the sys fs interfaces, I think they are just another set of
> entrances
> of memory hotplug.
>
> Thanks. :)
>
>>
>>
>>>
>>> This patch-set aims to implement physical memory hot-removing.
>>>
>>> The patches can free/remove the following things:
>>>
>>> - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
>>> - memmap of sparse-vmemmap : [PATCH 6,7,8,10/15]
>>> - page table of removed memory : [RFC PATCH 7,8,10/15]
>>> - node and related sysfs files : [RFC PATCH 13-15/15]
>>>
>>>
>>> Existing problem:
>>> If CONFIG_MEMCG is selected, we will allocate memory to store page
>>> cgroup
>>> when we online pages.
>>>
>>> For example: there is a memory device on node 1. The address range
>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9,
>>> memory10,
>>> and memory11 under the directory /sys/devices/system/memory/.
>>>
>>> If CONFIG_MEMCG is selected, when we online memory8, the memory
>>> stored page
>>> cgroup is not provided by this memory device. But when we online
>>> memory9, the
>>> memory stored page cgroup may be provided by memory8. So we can't
>>> offline
>>> memory8 now. We should offline the memory in the reversed order.
>>>
>>> When the memory device is hotremoved, we will auto offline memory
>>> provided
>>> by this memory device. But we don't know which memory is onlined
>>> first, so
>>> offlining memory may fail.
>>>
>>> In patch1, we provide a solution which is not good enough:
>>> Iterate twice to offline the memory.
>>> 1st iterate: offline every non primary memory block.
>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>
>>> And a new idea from Wen Congyang<wency@cn.fujitsu.com> is:
>>> allocate the memory from the memory block they are describing.
>>>
>>> But we are not sure if it is OK to do so because there is not
>>> existing API
>>> to do so, and we need to move page_cgroup memory allocation from
>>> MEM_GOING_ONLINE
>>> to MEM_ONLINE. And also, it may interfere the hugepage.
>>>
>>>
>>>
>>> How to test this patchset?
>>> 1. apply this patchset and build the kernel. MEMORY_HOTPLUG,
>>> MEMORY_HOTREMOVE,
>>> ACPI_HOTPLUG_MEMORY must be selected.
>>> 2. load the module acpi_memhotplug
>>> 3. hotplug the memory device(it depends on your hardware)
>>> You will see the memory device under the directory
>>> /sys/bus/acpi/devices/.
>>> Its name is PNP0C80:XX.
>>> 4. online/offline pages provided by this memory device
>>> You can write online/offline to
>>> /sys/devices/system/memory/memoryX/state to
>>> online/offline pages provided by this memory device
>>> 5. hotremove the memory device
>>> You can hotremove the memory device by the hardware, or writing 1 to
>>> /sys/bus/acpi/devices/PNP0C80:XX/eject.
>>
>> Is there a similar knode to hot-add the memory device?
>>
>>>
>>>
>>> Note: if the memory provided by the memory device is used by the
>>> kernel, it
>>> can't be offlined. It is not a bug.
>>>
>>>
>>> Changelogs from v5 to v6:
>>> Patch3: Add some more comments to explain memory hot-remove.
>>> Patch4: Remove bootmem member in struct firmware_map_entry.
>>> Patch6: Repeatedly register bootmem pages when using hugepage.
>>> Patch8: Repeatedly free bootmem pages when using hugepage.
>>> Patch14: Don't free pgdat when offlining a node, just reset it to 0.
>>> Patch15: New patch, pgdat is not freed in patch14, so don't allocate
>>> a new
>>> one when online a node.
>>>
>>> Changelogs from v4 to v5:
>>> Patch7: new patch, move pgdat_resize_lock into
>>> sparse_remove_one_section() to
>>> avoid disabling irq because we need flush tlb when free pagetables.
>>> Patch8: new patch, pick up some common APIs that are used to free
>>> direct mapping
>>> and vmemmap pagetables.
>>> Patch9: free direct mapping pagetables on x86_64 arch.
>>> Patch10: free vmemmap pagetables.
>>> Patch11: since freeing memmap with vmemmap has been implemented, the
>>> config
>>> macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
>>> no longer needed.
>>> Patch13: no need to modify acpi_memory_disable_device() since it was
>>> removed,
>>> and add nid parameter when calling remove_memory().
>>>
>>> Changelogs from v3 to v4:
>>> Patch7: remove unused codes.
>>> Patch8: fix nr_pages that is passed to free_map_bootmem()
>>>
>>> Changelogs from v2 to v3:
>>> Patch9: call sync_global_pgds() if pgd is changed
>>> Patch10: fix a problem int the patch
>>>
>>> Changelogs from v1 to v2:
>>> Patch1: new patch, offline memory twice. 1st iterate: offline every
>>> non primary
>>> memory block. 2nd iterate: offline primary (i.e. first added) memory
>>> block.
>>>
>>> Patch3: new patch, no logical change, just remove reduntant codes.
>>>
>>> Patch9: merge the patch from wujianguo into this patch. flush tlb on
>>> all cpu
>>> after the pagetable is changed.
>>>
>>> Patch12: new patch, free node_data when a node is offlined.
>>>
>>>
>>> Tang Chen (6):
>>> memory-hotplug: move pgdat_resize_lock into
>>> sparse_remove_one_section()
>>> memory-hotplug: remove page table of x86_64 architecture
>>> memory-hotplug: remove memmap of sparse-vmemmap
>>> memory-hotplug: Integrated __remove_section() of
>>> CONFIG_SPARSEMEM_VMEMMAP.
>>> memory-hotplug: remove sysfs file of node
>>> memory-hotplug: Do not allocate pdgat if it was not freed when
>>> offline.
>>>
>>> Wen Congyang (5):
>>> memory-hotplug: try to offline the memory twice to avoid dependence
>>> memory-hotplug: remove redundant codes
>>> memory-hotplug: introduce new function arch_remove_memory() for
>>> removing page table depends on architecture
>>> memory-hotplug: Common APIs to support page tables hot-remove
>>> memory-hotplug: free node_data when a node is offlined
>>>
>>> Yasuaki Ishimatsu (4):
>>> memory-hotplug: check whether all memory blocks are offlined or not
>>> when removing memory
>>> memory-hotplug: remove /sys/firmware/memmap/X sysfs
>>> memory-hotplug: implement register_page_bootmem_info_section of
>>> sparse-vmemmap
>>> memory-hotplug: memory_hotplug: clear zone when removing the memory
>>>
>>> arch/arm64/mm/mmu.c | 3 +
>>> arch/ia64/mm/discontig.c | 10 +
>>> arch/ia64/mm/init.c | 18 ++
>>> arch/powerpc/mm/init_64.c | 10 +
>>> arch/powerpc/mm/mem.c | 12 +
>>> arch/s390/mm/init.c | 12 +
>>> arch/s390/mm/vmem.c | 10 +
>>> arch/sh/mm/init.c | 17 ++
>>> arch/sparc/mm/init_64.c | 10 +
>>> arch/tile/mm/init.c | 8 +
>>> arch/x86/include/asm/pgtable_types.h | 1 +
>>> arch/x86/mm/init_32.c | 12 +
>>> arch/x86/mm/init_64.c | 390 +++++++++++++++++++++++++++++
>>> arch/x86/mm/pageattr.c | 47 ++--
>>> drivers/acpi/acpi_memhotplug.c | 8 +-
>>> drivers/base/memory.c | 6 +
>>> drivers/firmware/memmap.c | 96 +++++++-
>>> include/linux/bootmem.h | 1 +
>>> include/linux/firmware-map.h | 6 +
>>> include/linux/memory_hotplug.h | 15 +-
>>> include/linux/mm.h | 4 +-
>>> mm/memory_hotplug.c | 459 +++++++++++++++++++++++++++++++---
>>> mm/sparse.c | 8 +-
>>> 23 files changed, 1094 insertions(+), 69 deletions(-)
>>>
>>> --
>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>> the body to majordomo@kvack.org. For more info on Linux MM,
>>> see: http://www.linux-mm.org/ .
>>> Don't email:<a href=mailto:"dont@kvack.org"> email@kvack.org</a>
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
From: Tang Chen @ 2013-01-30 10:15 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <1359463973.1624.15.camel@kernel>

Hi Simon,

Please see below. :)

On 01/29/2013 08:52 PM, Simon Jeons wrote:
> Hi Tang,
>
> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
>> Here is the physical memory hot-remove patch-set based on 3.8rc-2.
>
> Some questions ask you, not has relationship with this patchset, but is
> memory hotplug stuff.
>
> 1. In function node_states_check_changes_online:
>
> comments:
> * If we don't have HIGHMEM nor movable node,
> * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
> * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
>
> How to understand it? Why we don't have HIGHMEM nor movable node and
> node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
> N_NORMAL_MEMORY only means the node has regular memory.
>

First of all, I think we need to understand why we need N_MEMORY.

In order to support movable node, which has only ZONE_MOVABLE (the last 
zone),
we introduce N_MEMORY to represent the node has normal, highmem and 
movable memory.

Here, "we have movable node" means you configured CONFIG_MOVABLE_NODE.
This config option doesn't mean we don't have movable pages, (NO)
it means we don't have a node which has only movable pages (only have 
ZONE_MOVABLE). (YES)

Here, if we don't have CONFIG_MOVABLE_NODE (we don't have movable node), 
we don't need a
separate node_states[] element to represent a particular node because we 
won't have a node
which has only ZONE_MOVABLE.

So,
1) if we don't have highmem nor movable node, N_MEMORY == N_HIGH_MEMORY 
== N_NORMAL_MEMORY,
    which means N_NORMAL_MEMORY effects as N_MEMORY. If we online pages 
as movable, we need
    to update node_states[N_NORMAL_MEMORY].

Please refer to the definition of enum zone_type, if we don't have 
CONFIG_HIGHMEM, we won't
have ZONE_HIGHMEM, but ZONE_NORMAL and ZONE_MOVABLE will always there. 
So we can have movable
pages, and the zone_last should be ZONE_MOVABLE.

Again, because we won't have a node only having ZONE_MOVABLE, so we just 
need to update
node_states[N_NORMAL_MEMORY].

> * If we don't have movable node, node_states[N_NORMAL_MEMORY]
> * contains nodes which have zones of 0...ZONE_MOVABLE,
> * set zone_last to ZONE_MOVABLE.
>
> How to understand?

2) this code is in #ifdef CONFIG_HIGHMEM, which means we have highmem, 
so if we don't have
    movable node, N_MEMORY == N_HIGH_MEMORY, and N_HIGH_MEMORY effects 
as N_MEMORY. If we
    online pages as movable, we need to update node_states[N_NORMAL_MEMORY].

>
> 2. In function move_pfn_range_left, why end<= z2->zone_start_pfn is not
> correct? The comments said that must include/overlap, why?
>

This one is easy, if I understand you correctly.
move_pfn_range_left() is used to move the left most part [start_pfn, 
end_pfn) of z2 to z1.
So if end_pfn<= z2->zone_start_pfn, it means [start_pfn, end_pfn) is not 
part of z2.
Then it fails.

> 3. In function online_pages, the normal case(w/o online_kenrel,
> online_movable), why not check if the new zone is overlap with adjacent
> zones?
>

Can a zone overlap with the others ? I don't think so.

One pfn could only be in one zone,
    zone = page_zone(pfn_to_page(pfn));

it could overlap with others, I think. :)

But maybe I misunderstand you. :)

> 4. Could you summarize the difference implementation between hot-add and
> logic-add, hot-remove and logic-remove?

Sorry, I don't quite understand what do you mean by logic-add/remove.
Would you please explain more ?

If you meant the sys fs interfaces, I think they are just another set of 
entrances
of memory hotplug.

Thanks.  :)

>
>
>>
>> This patch-set aims to implement physical memory hot-removing.
>>
>> The patches can free/remove the following things:
>>
>>    - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
>>    - memmap of sparse-vmemmap                  : [PATCH 6,7,8,10/15]
>>    - page table of removed memory              : [RFC PATCH 7,8,10/15]
>>    - node and related sysfs files              : [RFC PATCH 13-15/15]
>>
>>
>> Existing problem:
>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>> when we online pages.
>>
>> For example: there is a memory device on node 1. The address range
>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>> and memory11 under the directory /sys/devices/system/memory/.
>>
>> If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
>> cgroup is not provided by this memory device. But when we online memory9, the
>> memory stored page cgroup may be provided by memory8. So we can't offline
>> memory8 now. We should offline the memory in the reversed order.
>>
>> When the memory device is hotremoved, we will auto offline memory provided
>> by this memory device. But we don't know which memory is onlined first, so
>> offlining memory may fail.
>>
>> In patch1, we provide a solution which is not good enough:
>> Iterate twice to offline the memory.
>> 1st iterate: offline every non primary memory block.
>> 2nd iterate: offline primary (i.e. first added) memory block.
>>
>> And a new idea from Wen Congyang<wency@cn.fujitsu.com>  is:
>> allocate the memory from the memory block they are describing.
>>
>> But we are not sure if it is OK to do so because there is not existing API
>> to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE
>> to MEM_ONLINE. And also, it may interfere the hugepage.
>>
>>
>>
>> How to test this patchset?
>> 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
>>     ACPI_HOTPLUG_MEMORY must be selected.
>> 2. load the module acpi_memhotplug
>> 3. hotplug the memory device(it depends on your hardware)
>>     You will see the memory device under the directory /sys/bus/acpi/devices/.
>>     Its name is PNP0C80:XX.
>> 4. online/offline pages provided by this memory device
>>     You can write online/offline to /sys/devices/system/memory/memoryX/state to
>>     online/offline pages provided by this memory device
>> 5. hotremove the memory device
>>     You can hotremove the memory device by the hardware, or writing 1 to
>>     /sys/bus/acpi/devices/PNP0C80:XX/eject.
>
> Is there a similar knode to hot-add the memory device?
>
>>
>>
>> Note: if the memory provided by the memory device is used by the kernel, it
>> can't be offlined. It is not a bug.
>>
>>
>> Changelogs from v5 to v6:
>>   Patch3: Add some more comments to explain memory hot-remove.
>>   Patch4: Remove bootmem member in struct firmware_map_entry.
>>   Patch6: Repeatedly register bootmem pages when using hugepage.
>>   Patch8: Repeatedly free bootmem pages when using hugepage.
>>   Patch14: Don't free pgdat when offlining a node, just reset it to 0.
>>   Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
>>            one when online a node.
>>
>> Changelogs from v4 to v5:
>>   Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
>>           avoid disabling irq because we need flush tlb when free pagetables.
>>   Patch8: new patch, pick up some common APIs that are used to free direct mapping
>>           and vmemmap pagetables.
>>   Patch9: free direct mapping pagetables on x86_64 arch.
>>   Patch10: free vmemmap pagetables.
>>   Patch11: since freeing memmap with vmemmap has been implemented, the config
>>            macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
>>            no longer needed.
>>   Patch13: no need to modify acpi_memory_disable_device() since it was removed,
>>            and add nid parameter when calling remove_memory().
>>
>> Changelogs from v3 to v4:
>>   Patch7: remove unused codes.
>>   Patch8: fix nr_pages that is passed to free_map_bootmem()
>>
>> Changelogs from v2 to v3:
>>   Patch9: call sync_global_pgds() if pgd is changed
>>   Patch10: fix a problem int the patch
>>
>> Changelogs from v1 to v2:
>>   Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
>>           memory block. 2nd iterate: offline primary (i.e. first added) memory
>>           block.
>>
>>   Patch3: new patch, no logical change, just remove reduntant codes.
>>
>>   Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
>>           after the pagetable is changed.
>>
>>   Patch12: new patch, free node_data when a node is offlined.
>>
>>
>> Tang Chen (6):
>>    memory-hotplug: move pgdat_resize_lock into
>>      sparse_remove_one_section()
>>    memory-hotplug: remove page table of x86_64 architecture
>>    memory-hotplug: remove memmap of sparse-vmemmap
>>    memory-hotplug: Integrated __remove_section() of
>>      CONFIG_SPARSEMEM_VMEMMAP.
>>    memory-hotplug: remove sysfs file of node
>>    memory-hotplug: Do not allocate pdgat if it was not freed when
>>      offline.
>>
>> Wen Congyang (5):
>>    memory-hotplug: try to offline the memory twice to avoid dependence
>>    memory-hotplug: remove redundant codes
>>    memory-hotplug: introduce new function arch_remove_memory() for
>>      removing page table depends on architecture
>>    memory-hotplug: Common APIs to support page tables hot-remove
>>    memory-hotplug: free node_data when a node is offlined
>>
>> Yasuaki Ishimatsu (4):
>>    memory-hotplug: check whether all memory blocks are offlined or not
>>      when removing memory
>>    memory-hotplug: remove /sys/firmware/memmap/X sysfs
>>    memory-hotplug: implement register_page_bootmem_info_section of
>>      sparse-vmemmap
>>    memory-hotplug: memory_hotplug: clear zone when removing the memory
>>
>>   arch/arm64/mm/mmu.c                  |    3 +
>>   arch/ia64/mm/discontig.c             |   10 +
>>   arch/ia64/mm/init.c                  |   18 ++
>>   arch/powerpc/mm/init_64.c            |   10 +
>>   arch/powerpc/mm/mem.c                |   12 +
>>   arch/s390/mm/init.c                  |   12 +
>>   arch/s390/mm/vmem.c                  |   10 +
>>   arch/sh/mm/init.c                    |   17 ++
>>   arch/sparc/mm/init_64.c              |   10 +
>>   arch/tile/mm/init.c                  |    8 +
>>   arch/x86/include/asm/pgtable_types.h |    1 +
>>   arch/x86/mm/init_32.c                |   12 +
>>   arch/x86/mm/init_64.c                |  390 +++++++++++++++++++++++++++++
>>   arch/x86/mm/pageattr.c               |   47 ++--
>>   drivers/acpi/acpi_memhotplug.c       |    8 +-
>>   drivers/base/memory.c                |    6 +
>>   drivers/firmware/memmap.c            |   96 +++++++-
>>   include/linux/bootmem.h              |    1 +
>>   include/linux/firmware-map.h         |    6 +
>>   include/linux/memory_hotplug.h       |   15 +-
>>   include/linux/mm.h                   |    4 +-
>>   mm/memory_hotplug.c                  |  459 +++++++++++++++++++++++++++++++---
>>   mm/sparse.c                          |    8 +-
>>   23 files changed, 1094 insertions(+), 69 deletions(-)
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
>
>
>

^ permalink raw reply

* Re: [RFC PATCH v2 01/12] Add sys_hotplug.h for system device hotplug framework
From: Greg KH @ 2013-01-30  4:58 UTC (permalink / raw)
  To: Toshi Kani
  Cc: linux-s390, jiang.liu, wency, linux-mm, yinghai, linux-kernel,
	rjw, linux-acpi, isimatu.yasuaki, srivatsa.bhat, guohanjun,
	bhelgaas, akpm, linuxppc-dev, lenb
In-Reply-To: <1357861230-29549-2-git-send-email-toshi.kani@hp.com>

On Thu, Jan 10, 2013 at 04:40:19PM -0700, Toshi Kani wrote:
> +/*
> + * Hot-plug device information
> + */

Again, stop it with the "generic" hotplug term here, and everywhere
else.  You are doing a very _specific_ type of hotplug devices, so spell
it out.  We've worked hard to hotplug _everything_ in Linux, you are
going to confuse a lot of people with this type of terms.

> +union shp_dev_info {
> +	struct shp_cpu {
> +		u32		cpu_id;
> +	} cpu;

What is this?  Why not point to the system device for the cpu?

> +	struct shp_memory {
> +		int		node;
> +		u64		start_addr;
> +		u64		length;
> +	} mem;

Same here, why not point to the system device?

> +	struct shp_hostbridge {
> +	} hb;
> +
> +	struct shp_node {
> +	} node;

What happened here with these?  Empty structures?  Huh?

> +};
> +
> +struct shp_device {
> +	struct list_head	list;
> +	struct device		*device;

No, make it a "real" device, embed the device into it.

But, again, I'm going to ask why you aren't using the existing cpu /
memory / bridge / node devices that we have in the kernel.  Please use
them, or give me a _really_ good reason why they will not work.

> +	enum shp_class		class;
> +	union shp_dev_info	info;
> +};
> +
> +/*
> + * Hot-plug request
> + */
> +struct shp_request {
> +	/* common info */
> +	enum shp_operation	operation;	/* operation */
> +
> +	/* hot-plug event info: only valid for hot-plug operations */
> +	void			*handle;	/* FW handle */
> +	u32			event;		/* FW event */

What is this?

greg k-h

^ permalink raw reply

* Re: [RFC PATCH v2 03/12] drivers/base: Add system device hotplug framework
From: Greg KH @ 2013-01-30  4:54 UTC (permalink / raw)
  To: Toshi Kani
  Cc: linux-s390, jiang.liu, wency, linux-mm, yinghai, linux-kernel,
	rjw, linux-acpi, isimatu.yasuaki, srivatsa.bhat, guohanjun,
	bhelgaas, akpm, linuxppc-dev, lenb
In-Reply-To: <1357861230-29549-4-git-send-email-toshi.kani@hp.com>

On Thu, Jan 10, 2013 at 04:40:21PM -0700, Toshi Kani wrote:
> Added sys_hotplug.c, which is the system device hotplug framework code.
> 
> shp_register_handler() allows modules to register their hotplug handlers
> to the framework.  shp_submit_req() provides the interface to submit
> a hotplug or online/offline request of system devices.  The request is
> then put into hp_workqueue.  shp_start_req() calls all registered handlers
> in ascending order for each phase.  If any handler failed in validate or
> execute phase, shp_start_req() initiates its rollback procedure.
> 
> Signed-off-by: Toshi Kani <toshi.kani@hp.com>
> ---
>  drivers/base/Makefile      |    1 
>  drivers/base/sys_hotplug.c |  313 ++++++++++++++++++++++++++++++++++++++++++++
>  2 files changed, 314 insertions(+)
>  create mode 100644 drivers/base/sys_hotplug.c
> 
> diff --git a/drivers/base/Makefile b/drivers/base/Makefile
> index 5aa2d70..2e9b2f1 100644
> --- a/drivers/base/Makefile
> +++ b/drivers/base/Makefile
> @@ -21,6 +21,7 @@ endif
>  obj-$(CONFIG_SYS_HYPERVISOR) += hypervisor.o
>  obj-$(CONFIG_REGMAP)	+= regmap/
>  obj-$(CONFIG_SOC_BUS) += soc.o
> +obj-y			+= sys_hotplug.o

No option to select this for systems that don't need it?  If not, then
put it up higher with all of the other code for the core.

thanks,

greg k-h

^ permalink raw reply

* Re: [RFC PATCH v2 01/12] Add sys_hotplug.h for system device hotplug framework
From: Greg KH @ 2013-01-30  4:53 UTC (permalink / raw)
  To: Toshi Kani
  Cc: linux-s390, jiang.liu, wency, linux-mm, yinghai, linux-kernel,
	rjw, linux-acpi, isimatu.yasuaki, srivatsa.bhat, guohanjun,
	bhelgaas, akpm, linuxppc-dev, lenb
In-Reply-To: <1357861230-29549-2-git-send-email-toshi.kani@hp.com>

On Thu, Jan 10, 2013 at 04:40:19PM -0700, Toshi Kani wrote:
> Added include/linux/sys_hotplug.h, which defines the system device
> hotplug framework interfaces used by the framework itself and
> handlers.
> 
> The order values define the calling sequence of handlers.  For add
> execute, the ordering is ACPI->MEM->CPU.  Memory is onlined before
> CPU so that threads on new CPUs can start using their local memory.
> The ordering of the delete execute is symmetric to the add execute.
> 
> struct shp_request defines a hot-plug request information.  The
> device resource information is managed with a list so that a single
> request may target to multiple devices.
> 
> Signed-off-by: Toshi Kani <toshi.kani@hp.com>
> ---
>  include/linux/sys_hotplug.h |  181 +++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 181 insertions(+)
>  create mode 100644 include/linux/sys_hotplug.h
> 
> diff --git a/include/linux/sys_hotplug.h b/include/linux/sys_hotplug.h
> new file mode 100644
> index 0000000..86674dd
> --- /dev/null
> +++ b/include/linux/sys_hotplug.h
> @@ -0,0 +1,181 @@
> +/*
> + * sys_hotplug.h - System device hot-plug framework
> + *
> + * Copyright (C) 2012 Hewlett-Packard Development Company, L.P.
> + *	Toshi Kani <toshi.kani@hp.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License version 2 as
> + * published by the Free Software Foundation.
> + */
> +
> +#ifndef _LINUX_SYS_HOTPLUG_H
> +#define _LINUX_SYS_HOTPLUG_H
> +
> +#include <linux/list.h>
> +#include <linux/device.h>
> +
> +/*
> + * System device hot-plug operation proceeds in the following order.
> + *   Validate phase -> Execute phase -> Commit phase
> + *
> + * The order values below define the calling sequence of platform
> + * neutral handlers for each phase in ascending order.  The order
> + * values of firmware-specific handlers are defined in sys_hotplug.h
> + * under firmware specific directories.
> + */
> +
> +/* All order values must be smaller than this value */
> +#define SHP_ORDER_MAX				0xffffff
> +
> +/* Add Validate order values */
> +
> +/* Add Execute order values */
> +#define SHP_MEM_ADD_EXECUTE_ORDER		100
> +#define SHP_CPU_ADD_EXECUTE_ORDER		110
> +
> +/* Add Commit order values */
> +
> +/* Delete Validate order values */
> +#define SHP_CPU_DEL_VALIDATE_ORDER		100
> +#define SHP_MEM_DEL_VALIDATE_ORDER		110
> +
> +/* Delete Execute order values */
> +#define SHP_CPU_DEL_EXECUTE_ORDER		10
> +#define SHP_MEM_DEL_EXECUTE_ORDER		20
> +
> +/* Delete Commit order values */
> +

Empty value?

Anyway, as I said before, don't use "values", just call things directly
in the order you need to.

This isn't like other operating systems, we don't need to be so
"flexible", we can modify the core code as much as we want and need to
if future things come along :)

thanks,

greg k-h

^ permalink raw reply

* Re: [RFC PATCH v2 02/12] ACPI: Add sys_hotplug.h for system device hotplug framework
From: Greg KH @ 2013-01-30  4:51 UTC (permalink / raw)
  To: Toshi Kani
  Cc: linux-s390, jiang.liu, wency, linux-mm, yinghai, linux-kernel,
	Rafael J. Wysocki, linux-acpi, isimatu.yasuaki, srivatsa.bhat,
	guohanjun, bhelgaas, akpm, linuxppc-dev, lenb
In-Reply-To: <1358191290.14145.88.camel@misato.fc.hp.com>

On Mon, Jan 14, 2013 at 12:21:30PM -0700, Toshi Kani wrote:
> On Mon, 2013-01-14 at 20:07 +0100, Rafael J. Wysocki wrote:
> > On Monday, January 14, 2013 11:42:09 AM Toshi Kani wrote:
> > > On Mon, 2013-01-14 at 19:47 +0100, Rafael J. Wysocki wrote:
> > > > On Monday, January 14, 2013 08:53:53 AM Toshi Kani wrote:
> > > > > On Fri, 2013-01-11 at 22:25 +0100, Rafael J. Wysocki wrote:
> > > > > > On Thursday, January 10, 2013 04:40:20 PM Toshi Kani wrote:
> > > > > > > Added include/acpi/sys_hotplug.h, which is ACPI-specific system
> > > > > > > device hotplug header and defines the order values of ACPI-specific
> > > > > > > handlers.
> > > > > > > 
> > > > > > > Signed-off-by: Toshi Kani <toshi.kani@hp.com>
> > > > > > > ---
> > > > > > >  include/acpi/sys_hotplug.h |   48 ++++++++++++++++++++++++++++++++++++++++++++
> > > > > > >  1 file changed, 48 insertions(+)
> > > > > > >  create mode 100644 include/acpi/sys_hotplug.h
> > > > > > > 
> > > > > > > diff --git a/include/acpi/sys_hotplug.h b/include/acpi/sys_hotplug.h
> > > > > > > new file mode 100644
> > > > > > > index 0000000..ad80f61
> > > > > > > --- /dev/null
> > > > > > > +++ b/include/acpi/sys_hotplug.h
> > > > > > > @@ -0,0 +1,48 @@
> > > > > > > +/*
> > > > > > > + * sys_hotplug.h - ACPI System device hot-plug framework
> > > > > > > + *
> > > > > > > + * Copyright (C) 2012 Hewlett-Packard Development Company, L.P.
> > > > > > > + *	Toshi Kani <toshi.kani@hp.com>
> > > > > > > + *
> > > > > > > + * This program is free software; you can redistribute it and/or modify
> > > > > > > + * it under the terms of the GNU General Public License version 2 as
> > > > > > > + * published by the Free Software Foundation.
> > > > > > > + */
> > > > > > > +
> > > > > > > +#ifndef _ACPI_SYS_HOTPLUG_H
> > > > > > > +#define _ACPI_SYS_HOTPLUG_H
> > > > > > > +
> > > > > > > +#include <linux/list.h>
> > > > > > > +#include <linux/device.h>
> > > > > > > +#include <linux/sys_hotplug.h>
> > > > > > > +
> > > > > > > +/*
> > > > > > > + * System device hot-plug operation proceeds in the following order.
> > > > > > > + *   Validate phase -> Execute phase -> Commit phase
> > > > > > > + *
> > > > > > > + * The order values below define the calling sequence of ACPI-specific
> > > > > > > + * handlers for each phase in ascending order.  The order value of
> > > > > > > + * platform-neutral handlers are defined in <linux/sys_hotplug.h>.
> > > > > > > + */
> > > > > > > +
> > > > > > > +/* Add Validate order values */
> > > > > > > +#define SHP_ACPI_BUS_ADD_VALIDATE_ORDER		0	/* must be first */
> > > > > > > +
> > > > > > > +/* Add Execute order values */
> > > > > > > +#define SHP_ACPI_BUS_ADD_EXECUTE_ORDER		10
> > > > > > > +#define SHP_ACPI_RES_ADD_EXECUTE_ORDER		20
> > > > > > > +
> > > > > > > +/* Add Commit order values */
> > > > > > > +#define SHP_ACPI_BUS_ADD_COMMIT_ORDER		10
> > > > > > > +
> > > > > > > +/* Delete Validate order values */
> > > > > > > +#define SHP_ACPI_BUS_DEL_VALIDATE_ORDER		0	/* must be first */
> > > > > > > +#define SHP_ACPI_RES_DEL_VALIDATE_ORDER		10
> > > > > > > +
> > > > > > > +/* Delete Execute order values */
> > > > > > > +#define SHP_ACPI_BUS_DEL_EXECUTE_ORDER		100
> > > > > > > +
> > > > > > > +/* Delete Commit order values */
> > > > > > > +#define SHP_ACPI_BUS_DEL_COMMIT_ORDER		100
> > > > > > > +
> > > > > > > +#endif	/* _ACPI_SYS_HOTPLUG_H */
> > > > > > > --
> > > > > > 
> > > > > > Why did you use the particular values above?
> > > > > 
> > > > > The ordering values above are used to define the relative order among
> > > > > handlers.  For instance, the 100 for SHP_ACPI_BUS_DEL_EXECUTE_ORDER can
> > > > > potentially be 21 since it is still larger than 20 for
> > > > > SHP_MEM_DEL_EXECUTE_ORDER defined in linux/sys_hotplug.h.  I picked 100
> > > > > so that more platform-neutral handlers can be added in between 20 and
> > > > > 100 in future.
> > > > 
> > > > I thought so, but I don't think it's a good idea to add gaps like this.
> > > 
> > > OK, I will use an equal gap of 10 for all values.  So, the 100 in the
> > > above example will be changed to 30.  
> > 
> > I wonder why you want to have those gaps at all.
> 
> Oh, I see.  I think some gap is helpful since it allows a new handler to
> come between without recompiling other modules.  For instance, OEM
> vendors may want to add their own handlers with loadable modules after
> the kernel is distributed.

No, we don't support such a model, sorry, just make it a sequence of
numbers and go from there.  If a vendor wants to modify the kernel to
add new values, they can rebuild the core code as well.

I really don't like the whole idea of values in the first place, can't
we just do things in the correct order in the code, and not be driven by
random magic values?

thanks,

greg k-h

^ permalink raw reply

* Re: [RFC PATCH v2 01/12] Add sys_hotplug.h for system device hotplug framework
From: Greg KH @ 2013-01-30  4:48 UTC (permalink / raw)
  To: Toshi Kani
  Cc: linux-s390, jiang.liu, wency, linux-mm, yinghai, linux-kernel,
	Rafael J. Wysocki, linux-acpi, isimatu.yasuaki, srivatsa.bhat,
	guohanjun, bhelgaas, akpm, linuxppc-dev, lenb
In-Reply-To: <1358190124.14145.79.camel@misato.fc.hp.com>

On Mon, Jan 14, 2013 at 12:02:04PM -0700, Toshi Kani wrote:
> On Mon, 2013-01-14 at 19:48 +0100, Rafael J. Wysocki wrote:
> > On Monday, January 14, 2013 08:33:48 AM Toshi Kani wrote:
> > > On Fri, 2013-01-11 at 22:23 +0100, Rafael J. Wysocki wrote:
> > > > On Thursday, January 10, 2013 04:40:19 PM Toshi Kani wrote:
> > > > > Added include/linux/sys_hotplug.h, which defines the system device
> > > > > hotplug framework interfaces used by the framework itself and
> > > > > handlers.
> > > > > 
> > > > > The order values define the calling sequence of handlers.  For add
> > > > > execute, the ordering is ACPI->MEM->CPU.  Memory is onlined before
> > > > > CPU so that threads on new CPUs can start using their local memory.
> > > > > The ordering of the delete execute is symmetric to the add execute.
> > > > > 
> > > > > struct shp_request defines a hot-plug request information.  The
> > > > > device resource information is managed with a list so that a single
> > > > > request may target to multiple devices.
> > > > > 
> > >  :
> > > > > +
> > > > > +struct shp_device {
> > > > > +	struct list_head	list;
> > > > > +	struct device		*device;
> > > > > +	enum shp_class		class;
> > > > > +	union shp_dev_info	info;
> > > > > +};
> > > > > +
> > > > > +/*
> > > > > + * Hot-plug request
> > > > > + */
> > > > > +struct shp_request {
> > > > > +	/* common info */
> > > > > +	enum shp_operation	operation;	/* operation */
> > > > > +
> > > > > +	/* hot-plug event info: only valid for hot-plug operations */
> > > > > +	void			*handle;	/* FW handle */
> > > > 
> > > > What's the role of handle here?
> > > 
> > > On ACPI-based platforms, the handle keeps a notified ACPI handle when a
> > > hot-plug request is made.  ACPI bus handlers, acpi_add_execute() /
> > > acpi_del_execute(), then scans / trims ACPI devices from the handle.
> > 
> > OK, so this is ACPI-specific and should be described as such.
> 
> Other FW interface I know is parisc, which has mod_index (module index)
> to identify a unique object, just like what ACPI handle does.  The
> handle can keep the mod_index as an opaque value as well.  But as you
> said, I do not know if the handle works for all other FWs.  So, I will
> add descriptions, such that the hot-plug event info is modeled after
> ACPI and may need to be revisited when supporting other FW.

Please make it a "real" pointer, and not a void *, those shouldn't be
used at all if possible.

thanks,

greg k-h

^ permalink raw reply

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
From: Simon Jeons @ 2013-01-30  7:32 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <5108B5D1.6050600@cn.fujitsu.com>

On Wed, 2013-01-30 at 13:55 +0800, Tang Chen wrote:
> On 01/30/2013 11:27 AM, Simon Jeons wrote:
> > On Wed, 2013-01-30 at 10:16 +0800, Tang Chen wrote:
> >> On 01/29/2013 09:04 PM, Simon Jeons wrote:
> >>> Hi Tang,
> >>> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> >>>> From: Wen Congyang<wency@cn.fujitsu.com>
> >>>>
> >>>> When memory is removed, the corresponding pagetables should alse be removed.
> >>>> This patch introduces some common APIs to support vmemmap pagetable and x86_64
> >>>> architecture pagetable removing.
> >>>
> >>> Why don't need to build_all_zonelists like online_pages does during
> >>> hot-add path(add_memory)?
> >>
> >> Hi Simon,
> >>
> >> As you said, build_all_zonelists is done by online_pages. When the
> >> memory device
> >> is hot-added, we cannot use it. we can only use is when we online the
> >> pages on it.
> >
> > Why?
> >
> > If a node has just one memory device and memory is small, some zone will
> > not present like zone_highmem, then hot-add another memory device and
> > zone_highmem appear, if you should build_all_zonelists this time?
> 
> Hi Simon,
> 
> We built zone list when the first memory on the node is hot-added.
> 
> add_memory()
>   |-->if (!node_online(nid)) hotadd_new_pgdat()
>                               |-->free_area_init_node()
>                               |-->build_all_zonelists()
> 
> All the zones on the new node will be initialized as empty. So here, we 
> build zone list.
> 
> But actually we did nothing because no page is online, and zones are empty.
> In build_zonelists_node(), populated_zone(zone) will always be false.
> 
> The real work of building zone list is when pages are online. :)
> 
> 
> And in your question, you said some small memory is there, and 
> zone_normal is present.
> OK, when these pages are onlined (not added), the zone list has been 
> rebuilt.
> But pages in zone_highmem is not added, which means not onlined, so we 
> don't need to
> build zone list for it. And later, the zone_highmem pages are added, we 
> still don't
> rebuild the zone list because the real rebuilding work is when the pages 
> are onlined.
> 
> I think this is the current logic. :)

Thanks for you clarify. Actually, I miss "Even if the memory is
hot-added, it is not at ready-to-use state. For using newly added
memory, you have to 'online' the memory section" in the doc. :)

> 
> Thanks. :)
> 
> >
> >>
> >> But we can online the pages as different types, kernel or movable (which
> >> belongs to
> >> different zones), and we can online part of the memory, not all of them.
> >> So each time we online some pages, we should check if we need to update
> >> the zone list.
> >>
> >> So I think that is why we do build_all_zonelists when online_pages.
> >> (just my opinion)
> >>
> >> Thanks. :)
> >>
> >>>
> >>>>
> >>>> All pages of virtual mapping in removed memory cannot be freedi if some pages
> >>>> used as PGD/PUD includes not only removed memory but also other memory. So the
> >>>> patch uses the following way to check whether page can be freed or not.
> >>>>
> >>>>    1. When removing memory, the page structs of the revmoved memory are filled
> >>>>       with 0FD.
> >>>>    2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
> >>>>       In this case, the page used as PT/PMD can be freed.
> >>>>
> >>>> Signed-off-by: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
> >>>> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
> >>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
> >>>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
> >>>> ---
> >>>>    arch/x86/include/asm/pgtable_types.h |    1 +
> >>>>    arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
> >>>>    arch/x86/mm/pageattr.c               |   47 +++---
> >>>>    include/linux/bootmem.h              |    1 +
> >>>>    4 files changed, 326 insertions(+), 22 deletions(-)
> >>>>
> >>>> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> >>>> index 3c32db8..4b6fd2a 100644
> >>>> --- a/arch/x86/include/asm/pgtable_types.h
> >>>> +++ b/arch/x86/include/asm/pgtable_types.h
> >>>> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
> >>>>     * as a pte too.
> >>>>     */
> >>>>    extern pte_t *lookup_address(unsigned long address, unsigned int *level);
> >>>> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
> >>>>
> >>>>    #endif	/* !__ASSEMBLY__ */
> >>>>
> >>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> >>>> index 9ac1723..fe01116 100644
> >>>> --- a/arch/x86/mm/init_64.c
> >>>> +++ b/arch/x86/mm/init_64.c
> >>>> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
> >>>>    }
> >>>>    EXPORT_SYMBOL_GPL(arch_add_memory);
> >>>>
> >>>> +#define PAGE_INUSE 0xFD
> >>>> +
> >>>> +static void __meminit free_pagetable(struct page *page, int order)
> >>>> +{
> >>>> +	struct zone *zone;
> >>>> +	bool bootmem = false;
> >>>> +	unsigned long magic;
> >>>> +	unsigned int nr_pages = 1<<   order;
> >>>> +
> >>>> +	/* bootmem page has reserved flag */
> >>>> +	if (PageReserved(page)) {
> >>>> +		__ClearPageReserved(page);
> >>>> +		bootmem = true;
> >>>> +
> >>>> +		magic = (unsigned long)page->lru.next;
> >>>> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> >>>> +			while (nr_pages--)
> >>>> +				put_page_bootmem(page++);
> >>>> +		} else
> >>>> +			__free_pages_bootmem(page, order);
> >>>> +	} else
> >>>> +		free_pages((unsigned long)page_address(page), order);
> >>>> +
> >>>> +	/*
> >>>> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
> >>>> +	 * are all allocated by bootmem.
> >>>> +	 */
> >>>> +	if (bootmem) {
> >>>> +		zone = page_zone(page);
> >>>> +		zone_span_writelock(zone);
> >>>> +		zone->present_pages += nr_pages;
> >>>> +		zone_span_writeunlock(zone);
> >>>> +		totalram_pages += nr_pages;
> >>>> +	}
> >>>> +}
> >>>> +
> >>>> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> >>>> +{
> >>>> +	pte_t *pte;
> >>>> +	int i;
> >>>> +
> >>>> +	for (i = 0; i<   PTRS_PER_PTE; i++) {
> >>>> +		pte = pte_start + i;
> >>>> +		if (pte_val(*pte))
> >>>> +			return;
> >>>> +	}
> >>>> +
> >>>> +	/* free a pte talbe */
> >>>> +	free_pagetable(pmd_page(*pmd), 0);
> >>>> +	spin_lock(&init_mm.page_table_lock);
> >>>> +	pmd_clear(pmd);
> >>>> +	spin_unlock(&init_mm.page_table_lock);
> >>>> +}
> >>>> +
> >>>> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> >>>> +{
> >>>> +	pmd_t *pmd;
> >>>> +	int i;
> >>>> +
> >>>> +	for (i = 0; i<   PTRS_PER_PMD; i++) {
> >>>> +		pmd = pmd_start + i;
> >>>> +		if (pmd_val(*pmd))
> >>>> +			return;
> >>>> +	}
> >>>> +
> >>>> +	/* free a pmd talbe */
> >>>> +	free_pagetable(pud_page(*pud), 0);
> >>>> +	spin_lock(&init_mm.page_table_lock);
> >>>> +	pud_clear(pud);
> >>>> +	spin_unlock(&init_mm.page_table_lock);
> >>>> +}
> >>>> +
> >>>> +/* Return true if pgd is changed, otherwise return false. */
> >>>> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
> >>>> +{
> >>>> +	pud_t *pud;
> >>>> +	int i;
> >>>> +
> >>>> +	for (i = 0; i<   PTRS_PER_PUD; i++) {
> >>>> +		pud = pud_start + i;
> >>>> +		if (pud_val(*pud))
> >>>> +			return false;
> >>>> +	}
> >>>> +
> >>>> +	/* free a pud table */
> >>>> +	free_pagetable(pgd_page(*pgd), 0);
> >>>> +	spin_lock(&init_mm.page_table_lock);
> >>>> +	pgd_clear(pgd);
> >>>> +	spin_unlock(&init_mm.page_table_lock);
> >>>> +
> >>>> +	return true;
> >>>> +}
> >>>> +
> >>>> +static void __meminit
> >>>> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
> >>>> +		 bool direct)
> >>>> +{
> >>>> +	unsigned long next, pages = 0;
> >>>> +	pte_t *pte;
> >>>> +	void *page_addr;
> >>>> +	phys_addr_t phys_addr;
> >>>> +
> >>>> +	pte = pte_start + pte_index(addr);
> >>>> +	for (; addr<   end; addr = next, pte++) {
> >>>> +		next = (addr + PAGE_SIZE)&   PAGE_MASK;
> >>>> +		if (next>   end)
> >>>> +			next = end;
> >>>> +
> >>>> +		if (!pte_present(*pte))
> >>>> +			continue;
> >>>> +
> >>>> +		/*
> >>>> +		 * We mapped [0,1G) memory as identity mapping when
> >>>> +		 * initializing, in arch/x86/kernel/head_64.S. These
> >>>> +		 * pagetables cannot be removed.
> >>>> +		 */
> >>>> +		phys_addr = pte_val(*pte) + (addr&   PAGE_MASK);
> >>>> +		if (phys_addr<   (phys_addr_t)0x40000000)
> >>>> +			return;
> >>>> +
> >>>> +		if (IS_ALIGNED(addr, PAGE_SIZE)&&
> >>>> +		    IS_ALIGNED(next, PAGE_SIZE)) {
> >>>> +			if (!direct) {
> >>>> +				free_pagetable(pte_page(*pte), 0);
> >>>> +				pages++;
> >>>> +			}
> >>>> +
> >>>> +			spin_lock(&init_mm.page_table_lock);
> >>>> +			pte_clear(&init_mm, addr, pte);
> >>>> +			spin_unlock(&init_mm.page_table_lock);
> >>>> +		} else {
> >>>> +			/*
> >>>> +			 * If we are not removing the whole page, it means
> >>>> +			 * other ptes in this page are being used and we canot
> >>>> +			 * remove them. So fill the unused ptes with 0xFD, and
> >>>> +			 * remove the page when it is wholly filled with 0xFD.
> >>>> +			 */
> >>>> +			memset((void *)addr, PAGE_INUSE, next - addr);
> >>>> +			page_addr = page_address(pte_page(*pte));
> >>>> +
> >>>> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> >>>> +				free_pagetable(pte_page(*pte), 0);
> >>>> +				pages++;
> >>>> +
> >>>> +				spin_lock(&init_mm.page_table_lock);
> >>>> +				pte_clear(&init_mm, addr, pte);
> >>>> +				spin_unlock(&init_mm.page_table_lock);
> >>>> +			}
> >>>> +		}
> >>>> +	}
> >>>> +
> >>>> +	/* Call free_pte_table() in remove_pmd_table(). */
> >>>> +	flush_tlb_all();
> >>>> +	if (direct)
> >>>> +		update_page_count(PG_LEVEL_4K, -pages);
> >>>> +}
> >>>> +
> >>>> +static void __meminit
> >>>> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> >>>> +		 bool direct)
> >>>> +{
> >>>> +	unsigned long pte_phys, next, pages = 0;
> >>>> +	pte_t *pte_base;
> >>>> +	pmd_t *pmd;
> >>>> +
> >>>> +	pmd = pmd_start + pmd_index(addr);
> >>>> +	for (; addr<   end; addr = next, pmd++) {
> >>>> +		next = pmd_addr_end(addr, end);
> >>>> +
> >>>> +		if (!pmd_present(*pmd))
> >>>> +			continue;
> >>>> +
> >>>> +		if (pmd_large(*pmd)) {
> >>>> +			if (IS_ALIGNED(addr, PMD_SIZE)&&
> >>>> +			    IS_ALIGNED(next, PMD_SIZE)) {
> >>>> +				if (!direct) {
> >>>> +					free_pagetable(pmd_page(*pmd),
> >>>> +						       get_order(PMD_SIZE));
> >>>> +					pages++;
> >>>> +				}
> >>>> +
> >>>> +				spin_lock(&init_mm.page_table_lock);
> >>>> +				pmd_clear(pmd);
> >>>> +				spin_unlock(&init_mm.page_table_lock);
> >>>> +				continue;
> >>>> +			}
> >>>> +
> >>>> +			/*
> >>>> +			 * We use 2M page, but we need to remove part of them,
> >>>> +			 * so split 2M page to 4K page.
> >>>> +			 */
> >>>> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
> >>>> +			BUG_ON(!pte_base);
> >>>> +			__split_large_page((pte_t *)pmd, addr,
> >>>> +					   (pte_t *)pte_base);
> >>>> +
> >>>> +			spin_lock(&init_mm.page_table_lock);
> >>>> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> >>>> +			spin_unlock(&init_mm.page_table_lock);
> >>>> +
> >>>> +			flush_tlb_all();
> >>>> +		}
> >>>> +
> >>>> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> >>>> +		remove_pte_table(pte_base, addr, next, direct);
> >>>> +		free_pte_table(pte_base, pmd);
> >>>> +		unmap_low_page(pte_base);
> >>>> +	}
> >>>> +
> >>>> +	/* Call free_pmd_table() in remove_pud_table(). */
> >>>> +	if (direct)
> >>>> +		update_page_count(PG_LEVEL_2M, -pages);
> >>>> +}
> >>>> +
> >>>> +static void __meminit
> >>>> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
> >>>> +		 bool direct)
> >>>> +{
> >>>> +	unsigned long pmd_phys, next, pages = 0;
> >>>> +	pmd_t *pmd_base;
> >>>> +	pud_t *pud;
> >>>> +
> >>>> +	pud = pud_start + pud_index(addr);
> >>>> +	for (; addr<   end; addr = next, pud++) {
> >>>> +		next = pud_addr_end(addr, end);
> >>>> +
> >>>> +		if (!pud_present(*pud))
> >>>> +			continue;
> >>>> +
> >>>> +		if (pud_large(*pud)) {
> >>>> +			if (IS_ALIGNED(addr, PUD_SIZE)&&
> >>>> +			    IS_ALIGNED(next, PUD_SIZE)) {
> >>>> +				if (!direct) {
> >>>> +					free_pagetable(pud_page(*pud),
> >>>> +						       get_order(PUD_SIZE));
> >>>> +					pages++;
> >>>> +				}
> >>>> +
> >>>> +				spin_lock(&init_mm.page_table_lock);
> >>>> +				pud_clear(pud);
> >>>> +				spin_unlock(&init_mm.page_table_lock);
> >>>> +				continue;
> >>>> +			}
> >>>> +
> >>>> +			/*
> >>>> +			 * We use 1G page, but we need to remove part of them,
> >>>> +			 * so split 1G page to 2M page.
> >>>> +			 */
> >>>> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
> >>>> +			BUG_ON(!pmd_base);
> >>>> +			__split_large_page((pte_t *)pud, addr,
> >>>> +					   (pte_t *)pmd_base);
> >>>> +
> >>>> +			spin_lock(&init_mm.page_table_lock);
> >>>> +			pud_populate(&init_mm, pud, __va(pmd_phys));
> >>>> +			spin_unlock(&init_mm.page_table_lock);
> >>>> +
> >>>> +			flush_tlb_all();
> >>>> +		}
> >>>> +
> >>>> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
> >>>> +		remove_pmd_table(pmd_base, addr, next, direct);
> >>>> +		free_pmd_table(pmd_base, pud);
> >>>> +		unmap_low_page(pmd_base);
> >>>> +	}
> >>>> +
> >>>> +	if (direct)
> >>>> +		update_page_count(PG_LEVEL_1G, -pages);
> >>>> +}
> >>>> +
> >>>> +/* start and end are both virtual address. */
> >>>> +static void __meminit
> >>>> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
> >>>> +{
> >>>> +	unsigned long next;
> >>>> +	pgd_t *pgd;
> >>>> +	pud_t *pud;
> >>>> +	bool pgd_changed = false;
> >>>> +
> >>>> +	for (; start<   end; start = next) {
> >>>> +		pgd = pgd_offset_k(start);
> >>>> +		if (!pgd_present(*pgd))
> >>>> +			continue;
> >>>> +
> >>>> +		next = pgd_addr_end(start, end);
> >>>> +
> >>>> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> >>>> +		remove_pud_table(pud, start, next, direct);
> >>>> +		if (free_pud_table(pud, pgd))
> >>>> +			pgd_changed = true;
> >>>> +		unmap_low_page(pud);
> >>>> +	}
> >>>> +
> >>>> +	if (pgd_changed)
> >>>> +		sync_global_pgds(start, end - 1);
> >>>> +
> >>>> +	flush_tlb_all();
> >>>> +}
> >>>> +
> >>>>    #ifdef CONFIG_MEMORY_HOTREMOVE
> >>>>    int __ref arch_remove_memory(u64 start, u64 size)
> >>>>    {
> >>>> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> >>>> index a718e0d..7dcb6f9 100644
> >>>> --- a/arch/x86/mm/pageattr.c
> >>>> +++ b/arch/x86/mm/pageattr.c
> >>>> @@ -501,21 +501,13 @@ out_unlock:
> >>>>    	return do_split;
> >>>>    }
> >>>>
> >>>> -static int split_large_page(pte_t *kpte, unsigned long address)
> >>>> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
> >>>>    {
> >>>>    	unsigned long pfn, pfninc = 1;
> >>>>    	unsigned int i, level;
> >>>> -	pte_t *pbase, *tmp;
> >>>> +	pte_t *tmp;
> >>>>    	pgprot_t ref_prot;
> >>>> -	struct page *base;
> >>>> -
> >>>> -	if (!debug_pagealloc)
> >>>> -		spin_unlock(&cpa_lock);
> >>>> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> >>>> -	if (!debug_pagealloc)
> >>>> -		spin_lock(&cpa_lock);
> >>>> -	if (!base)
> >>>> -		return -ENOMEM;
> >>>> +	struct page *base = virt_to_page(pbase);
> >>>>
> >>>>    	spin_lock(&pgd_lock);
> >>>>    	/*
> >>>> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
> >>>>    	 * up for us already:
> >>>>    	 */
> >>>>    	tmp = lookup_address(address,&level);
> >>>> -	if (tmp != kpte)
> >>>> -		goto out_unlock;
> >>>> +	if (tmp != kpte) {
> >>>> +		spin_unlock(&pgd_lock);
> >>>> +		return 1;
> >>>> +	}
> >>>>
> >>>> -	pbase = (pte_t *)page_address(base);
> >>>>    	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
> >>>>    	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
> >>>>    	/*
> >>>> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
> >>>>    	 * going on.
> >>>>    	 */
> >>>>    	__flush_tlb_all();
> >>>> +	spin_unlock(&pgd_lock);
> >>>>
> >>>> -	base = NULL;
> >>>> +	return 0;
> >>>> +}
> >>>>
> >>>> -out_unlock:
> >>>> -	/*
> >>>> -	 * If we dropped out via the lookup_address check under
> >>>> -	 * pgd_lock then stick the page back into the pool:
> >>>> -	 */
> >>>> -	if (base)
> >>>> +static int split_large_page(pte_t *kpte, unsigned long address)
> >>>> +{
> >>>> +	pte_t *pbase;
> >>>> +	struct page *base;
> >>>> +
> >>>> +	if (!debug_pagealloc)
> >>>> +		spin_unlock(&cpa_lock);
> >>>> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> >>>> +	if (!debug_pagealloc)
> >>>> +		spin_lock(&cpa_lock);
> >>>> +	if (!base)
> >>>> +		return -ENOMEM;
> >>>> +
> >>>> +	pbase = (pte_t *)page_address(base);
> >>>> +	if (__split_large_page(kpte, address, pbase))
> >>>>    		__free_page(base);
> >>>> -	spin_unlock(&pgd_lock);
> >>>>
> >>>>    	return 0;
> >>>>    }
> >>>> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> >>>> index 3f778c2..190ff06 100644
> >>>> --- a/include/linux/bootmem.h
> >>>> +++ b/include/linux/bootmem.h
> >>>> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
> >>>>    			      unsigned long size);
> >>>>    extern void free_bootmem(unsigned long physaddr, unsigned long size);
> >>>>    extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
> >>>> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
> >>>>
> >>>>    /*
> >>>>     * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,
> >>>
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
> >
> >
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] powerpc/mm: Fix hash computation function
From: Mike Qiu @ 2013-01-30  7:23 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: paulus, Aneesh Kumar K.V
In-Reply-To: <1359524442-5861-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

With the fix, the machine can boot up successfully

Tested-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>

于 2013/1/30 13:40, Aneesh Kumar K.V 写道:
> From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>
>
> The ASM version of hash computation function was truncating the upper bit.
> Make the ASM version similar to hpt_hash function. Remove masking vsid bits.
> Without this patch, we observed hang during bootup due to not satisfying page
> fault request correctly. The fault handler used wrong hash values to update
> the HPTE. Hence we kept looping with page fault.
>
> hash_page(ea=000001003e260008, access=203, trap=300 ip=3fff91787134 dsisr 42000000
> The computed value of hash 000000000f22f390
> update: avpnv=4003e46054003e00, hash=000000000722f390, f=80000006, psize: 2 ...
>
> Reported-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
> Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
> ---
>  arch/powerpc/mm/hash_low_64.S |   62 +++++++++++++++++++++++------------------
>  1 file changed, 35 insertions(+), 27 deletions(-)
>
> diff --git a/arch/powerpc/mm/hash_low_64.S b/arch/powerpc/mm/hash_low_64.S
> index 5658508..7443481 100644
> --- a/arch/powerpc/mm/hash_low_64.S
> +++ b/arch/powerpc/mm/hash_low_64.S
> @@ -115,11 +115,13 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEGMENT)
>  	sldi	r29,r5,SID_SHIFT - VPN_SHIFT
>  	rldicl  r28,r3,64 - VPN_SHIFT,64 - (SID_SHIFT - VPN_SHIFT)
>  	or	r29,r28,r29
> -
> -	/* Calculate hash value for primary slot and store it in r28 */
> -	rldicl	r5,r5,0,25		/* vsid & 0x0000007fffffffff */
> -	rldicl	r0,r3,64-12,48		/* (ea >> 12) & 0xffff */
> -	xor	r28,r5,r0
> +	/*
> +	 * Calculate hash value for primary slot and store it in r28
> +	 * r3 = va, r5 = vsid
> +	 * r0 = (va >> 12) & ((1ul << (28 - 12)) -1)
> +	 */
> +	rldicl	r0,r3,64-12,48
> +	xor	r28,r5,r0		/* hash */
>  	b	4f
>
>  3:	/* Calc vpn and put it in r29 */
> @@ -130,11 +132,12 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEGMENT)
>  	/*
>  	 * calculate hash value for primary slot and
>  	 * store it in r28 for 1T segment
> +	 * r3 = va, r5 = vsid
>  	 */
> -	rldic	r28,r5,25,25		/* (vsid << 25) & 0x7fffffffff */
> -	clrldi	r5,r5,40		/* vsid & 0xffffff */
> -	rldicl	r0,r3,64-12,36		/* (ea >> 12) & 0xfffffff */
> -	xor	r28,r28,r5
> +	sldi	r28,r5,25		/* vsid << 25 */
> +	/* r0 =  (va >> 12) & ((1ul << (40 - 12)) -1) */
> +	rldicl	r0,r3,64-12,36
> +	xor	r28,r28,r5		/* vsid ^ ( vsid << 25) */
>  	xor	r28,r28,r0		/* hash */
>
>  	/* Convert linux PTE bits into HW equivalents */
> @@ -407,11 +410,13 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEGMENT)
>  	 */
>  	rldicl  r28,r3,64 - VPN_SHIFT,64 - (SID_SHIFT - VPN_SHIFT)
>  	or	r29,r28,r29
> -
> -	/* Calculate hash value for primary slot and store it in r28 */
> -	rldicl	r5,r5,0,25		/* vsid & 0x0000007fffffffff */
> -	rldicl	r0,r3,64-12,48		/* (ea >> 12) & 0xffff */
> -	xor	r28,r5,r0
> +	/*
> +	 * Calculate hash value for primary slot and store it in r28
> +	 * r3 = va, r5 = vsid
> +	 * r0 = (va >> 12) & ((1ul << (28 - 12)) -1)
> +	 */
> +	rldicl	r0,r3,64-12,48
> +	xor	r28,r5,r0		/* hash */
>  	b	4f
>
>  3:	/* Calc vpn and put it in r29 */
> @@ -426,11 +431,12 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEGMENT)
>  	/*
>  	 * Calculate hash value for primary slot and
>  	 * store it in r28  for 1T segment
> +	 * r3 = va, r5 = vsid
>  	 */
> -	rldic	r28,r5,25,25		/* (vsid << 25) & 0x7fffffffff */
> -	clrldi	r5,r5,40		/* vsid & 0xffffff */
> -	rldicl	r0,r3,64-12,36		/* (ea >> 12) & 0xfffffff */
> -	xor	r28,r28,r5
> +	sldi	r28,r5,25		/* vsid << 25 */
> +	/* r0 = (va >> 12) & ((1ul << (40 - 12)) -1) */
> +	rldicl	r0,r3,64-12,36
> +	xor	r28,r28,r5		/* vsid ^ ( vsid << 25) */
>  	xor	r28,r28,r0		/* hash */
>
>  	/* Convert linux PTE bits into HW equivalents */
> @@ -752,25 +758,27 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEGMENT)
>  	rldicl  r28,r3,64 - VPN_SHIFT,64 - (SID_SHIFT - VPN_SHIFT)
>  	or	r29,r28,r29
>
> -	/* Calculate hash value for primary slot and store it in r28 */
> -	rldicl	r5,r5,0,25		/* vsid & 0x0000007fffffffff */
> -	rldicl	r0,r3,64-16,52		/* (ea >> 16) & 0xfff */
> -	xor	r28,r5,r0
> +	/* Calculate hash value for primary slot and store it in r28
> +	 * r3 = va, r5 = vsid
> +	 * r0 = (va >> 16) & ((1ul << (28 - 16)) -1)
> +	 */
> +	rldicl	r0,r3,64-16,52
> +	xor	r28,r5,r0		/* hash */
>  	b	4f
>
>  3:	/* Calc vpn and put it in r29 */
>  	sldi	r29,r5,SID_SHIFT_1T - VPN_SHIFT
>  	rldicl  r28,r3,64 - VPN_SHIFT,64 - (SID_SHIFT_1T - VPN_SHIFT)
>  	or	r29,r28,r29
> -
>  	/*
>  	 * calculate hash value for primary slot and
>  	 * store it in r28 for 1T segment
> +	 * r3 = va, r5 = vsid
>  	 */
> -	rldic	r28,r5,25,25		/* (vsid << 25) & 0x7fffffffff */
> -	clrldi	r5,r5,40		/* vsid & 0xffffff */
> -	rldicl	r0,r3,64-16,40		/* (ea >> 16) & 0xffffff */
> -	xor	r28,r28,r5
> +	sldi	r28,r5,25		/* vsid << 25 */
> +	/* r0 = (va >> 16) & ((1ul << (40 - 16)) -1) */
> +	rldicl	r0,r3,64-16,40
> +	xor	r28,r28,r5		/* vsid ^ ( vsid << 25) */
>  	xor	r28,r28,r0		/* hash */
>
>  	/* Convert linux PTE bits into HW equivalents */

^ permalink raw reply

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
From: Tang Chen @ 2013-01-30  5:55 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <1359516425.1288.5.camel@kernel>

On 01/30/2013 11:27 AM, Simon Jeons wrote:
> On Wed, 2013-01-30 at 10:16 +0800, Tang Chen wrote:
>> On 01/29/2013 09:04 PM, Simon Jeons wrote:
>>> Hi Tang,
>>> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
>>>> From: Wen Congyang<wency@cn.fujitsu.com>
>>>>
>>>> When memory is removed, the corresponding pagetables should alse be removed.
>>>> This patch introduces some common APIs to support vmemmap pagetable and x86_64
>>>> architecture pagetable removing.
>>>
>>> Why don't need to build_all_zonelists like online_pages does during
>>> hot-add path(add_memory)?
>>
>> Hi Simon,
>>
>> As you said, build_all_zonelists is done by online_pages. When the
>> memory device
>> is hot-added, we cannot use it. we can only use is when we online the
>> pages on it.
>
> Why?
>
> If a node has just one memory device and memory is small, some zone will
> not present like zone_highmem, then hot-add another memory device and
> zone_highmem appear, if you should build_all_zonelists this time?

Hi Simon,

We built zone list when the first memory on the node is hot-added.

add_memory()
  |-->if (!node_online(nid)) hotadd_new_pgdat()
                              |-->free_area_init_node()
                              |-->build_all_zonelists()

All the zones on the new node will be initialized as empty. So here, we 
build zone list.

But actually we did nothing because no page is online, and zones are empty.
In build_zonelists_node(), populated_zone(zone) will always be false.

The real work of building zone list is when pages are online. :)


And in your question, you said some small memory is there, and 
zone_normal is present.
OK, when these pages are onlined (not added), the zone list has been 
rebuilt.
But pages in zone_highmem is not added, which means not onlined, so we 
don't need to
build zone list for it. And later, the zone_highmem pages are added, we 
still don't
rebuild the zone list because the real rebuilding work is when the pages 
are onlined.

I think this is the current logic. :)

Thanks. :)

>
>>
>> But we can online the pages as different types, kernel or movable (which
>> belongs to
>> different zones), and we can online part of the memory, not all of them.
>> So each time we online some pages, we should check if we need to update
>> the zone list.
>>
>> So I think that is why we do build_all_zonelists when online_pages.
>> (just my opinion)
>>
>> Thanks. :)
>>
>>>
>>>>
>>>> All pages of virtual mapping in removed memory cannot be freedi if some pages
>>>> used as PGD/PUD includes not only removed memory but also other memory. So the
>>>> patch uses the following way to check whether page can be freed or not.
>>>>
>>>>    1. When removing memory, the page structs of the revmoved memory are filled
>>>>       with 0FD.
>>>>    2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
>>>>       In this case, the page used as PT/PMD can be freed.
>>>>
>>>> Signed-off-by: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
>>>> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
>>>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>>>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>>>> ---
>>>>    arch/x86/include/asm/pgtable_types.h |    1 +
>>>>    arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
>>>>    arch/x86/mm/pageattr.c               |   47 +++---
>>>>    include/linux/bootmem.h              |    1 +
>>>>    4 files changed, 326 insertions(+), 22 deletions(-)
>>>>
>>>> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
>>>> index 3c32db8..4b6fd2a 100644
>>>> --- a/arch/x86/include/asm/pgtable_types.h
>>>> +++ b/arch/x86/include/asm/pgtable_types.h
>>>> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
>>>>     * as a pte too.
>>>>     */
>>>>    extern pte_t *lookup_address(unsigned long address, unsigned int *level);
>>>> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
>>>>
>>>>    #endif	/* !__ASSEMBLY__ */
>>>>
>>>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>>>> index 9ac1723..fe01116 100644
>>>> --- a/arch/x86/mm/init_64.c
>>>> +++ b/arch/x86/mm/init_64.c
>>>> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
>>>>    }
>>>>    EXPORT_SYMBOL_GPL(arch_add_memory);
>>>>
>>>> +#define PAGE_INUSE 0xFD
>>>> +
>>>> +static void __meminit free_pagetable(struct page *page, int order)
>>>> +{
>>>> +	struct zone *zone;
>>>> +	bool bootmem = false;
>>>> +	unsigned long magic;
>>>> +	unsigned int nr_pages = 1<<   order;
>>>> +
>>>> +	/* bootmem page has reserved flag */
>>>> +	if (PageReserved(page)) {
>>>> +		__ClearPageReserved(page);
>>>> +		bootmem = true;
>>>> +
>>>> +		magic = (unsigned long)page->lru.next;
>>>> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
>>>> +			while (nr_pages--)
>>>> +				put_page_bootmem(page++);
>>>> +		} else
>>>> +			__free_pages_bootmem(page, order);
>>>> +	} else
>>>> +		free_pages((unsigned long)page_address(page), order);
>>>> +
>>>> +	/*
>>>> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
>>>> +	 * are all allocated by bootmem.
>>>> +	 */
>>>> +	if (bootmem) {
>>>> +		zone = page_zone(page);
>>>> +		zone_span_writelock(zone);
>>>> +		zone->present_pages += nr_pages;
>>>> +		zone_span_writeunlock(zone);
>>>> +		totalram_pages += nr_pages;
>>>> +	}
>>>> +}
>>>> +
>>>> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
>>>> +{
>>>> +	pte_t *pte;
>>>> +	int i;
>>>> +
>>>> +	for (i = 0; i<   PTRS_PER_PTE; i++) {
>>>> +		pte = pte_start + i;
>>>> +		if (pte_val(*pte))
>>>> +			return;
>>>> +	}
>>>> +
>>>> +	/* free a pte talbe */
>>>> +	free_pagetable(pmd_page(*pmd), 0);
>>>> +	spin_lock(&init_mm.page_table_lock);
>>>> +	pmd_clear(pmd);
>>>> +	spin_unlock(&init_mm.page_table_lock);
>>>> +}
>>>> +
>>>> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
>>>> +{
>>>> +	pmd_t *pmd;
>>>> +	int i;
>>>> +
>>>> +	for (i = 0; i<   PTRS_PER_PMD; i++) {
>>>> +		pmd = pmd_start + i;
>>>> +		if (pmd_val(*pmd))
>>>> +			return;
>>>> +	}
>>>> +
>>>> +	/* free a pmd talbe */
>>>> +	free_pagetable(pud_page(*pud), 0);
>>>> +	spin_lock(&init_mm.page_table_lock);
>>>> +	pud_clear(pud);
>>>> +	spin_unlock(&init_mm.page_table_lock);
>>>> +}
>>>> +
>>>> +/* Return true if pgd is changed, otherwise return false. */
>>>> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
>>>> +{
>>>> +	pud_t *pud;
>>>> +	int i;
>>>> +
>>>> +	for (i = 0; i<   PTRS_PER_PUD; i++) {
>>>> +		pud = pud_start + i;
>>>> +		if (pud_val(*pud))
>>>> +			return false;
>>>> +	}
>>>> +
>>>> +	/* free a pud table */
>>>> +	free_pagetable(pgd_page(*pgd), 0);
>>>> +	spin_lock(&init_mm.page_table_lock);
>>>> +	pgd_clear(pgd);
>>>> +	spin_unlock(&init_mm.page_table_lock);
>>>> +
>>>> +	return true;
>>>> +}
>>>> +
>>>> +static void __meminit
>>>> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
>>>> +		 bool direct)
>>>> +{
>>>> +	unsigned long next, pages = 0;
>>>> +	pte_t *pte;
>>>> +	void *page_addr;
>>>> +	phys_addr_t phys_addr;
>>>> +
>>>> +	pte = pte_start + pte_index(addr);
>>>> +	for (; addr<   end; addr = next, pte++) {
>>>> +		next = (addr + PAGE_SIZE)&   PAGE_MASK;
>>>> +		if (next>   end)
>>>> +			next = end;
>>>> +
>>>> +		if (!pte_present(*pte))
>>>> +			continue;
>>>> +
>>>> +		/*
>>>> +		 * We mapped [0,1G) memory as identity mapping when
>>>> +		 * initializing, in arch/x86/kernel/head_64.S. These
>>>> +		 * pagetables cannot be removed.
>>>> +		 */
>>>> +		phys_addr = pte_val(*pte) + (addr&   PAGE_MASK);
>>>> +		if (phys_addr<   (phys_addr_t)0x40000000)
>>>> +			return;
>>>> +
>>>> +		if (IS_ALIGNED(addr, PAGE_SIZE)&&
>>>> +		    IS_ALIGNED(next, PAGE_SIZE)) {
>>>> +			if (!direct) {
>>>> +				free_pagetable(pte_page(*pte), 0);
>>>> +				pages++;
>>>> +			}
>>>> +
>>>> +			spin_lock(&init_mm.page_table_lock);
>>>> +			pte_clear(&init_mm, addr, pte);
>>>> +			spin_unlock(&init_mm.page_table_lock);
>>>> +		} else {
>>>> +			/*
>>>> +			 * If we are not removing the whole page, it means
>>>> +			 * other ptes in this page are being used and we canot
>>>> +			 * remove them. So fill the unused ptes with 0xFD, and
>>>> +			 * remove the page when it is wholly filled with 0xFD.
>>>> +			 */
>>>> +			memset((void *)addr, PAGE_INUSE, next - addr);
>>>> +			page_addr = page_address(pte_page(*pte));
>>>> +
>>>> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
>>>> +				free_pagetable(pte_page(*pte), 0);
>>>> +				pages++;
>>>> +
>>>> +				spin_lock(&init_mm.page_table_lock);
>>>> +				pte_clear(&init_mm, addr, pte);
>>>> +				spin_unlock(&init_mm.page_table_lock);
>>>> +			}
>>>> +		}
>>>> +	}
>>>> +
>>>> +	/* Call free_pte_table() in remove_pmd_table(). */
>>>> +	flush_tlb_all();
>>>> +	if (direct)
>>>> +		update_page_count(PG_LEVEL_4K, -pages);
>>>> +}
>>>> +
>>>> +static void __meminit
>>>> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
>>>> +		 bool direct)
>>>> +{
>>>> +	unsigned long pte_phys, next, pages = 0;
>>>> +	pte_t *pte_base;
>>>> +	pmd_t *pmd;
>>>> +
>>>> +	pmd = pmd_start + pmd_index(addr);
>>>> +	for (; addr<   end; addr = next, pmd++) {
>>>> +		next = pmd_addr_end(addr, end);
>>>> +
>>>> +		if (!pmd_present(*pmd))
>>>> +			continue;
>>>> +
>>>> +		if (pmd_large(*pmd)) {
>>>> +			if (IS_ALIGNED(addr, PMD_SIZE)&&
>>>> +			    IS_ALIGNED(next, PMD_SIZE)) {
>>>> +				if (!direct) {
>>>> +					free_pagetable(pmd_page(*pmd),
>>>> +						       get_order(PMD_SIZE));
>>>> +					pages++;
>>>> +				}
>>>> +
>>>> +				spin_lock(&init_mm.page_table_lock);
>>>> +				pmd_clear(pmd);
>>>> +				spin_unlock(&init_mm.page_table_lock);
>>>> +				continue;
>>>> +			}
>>>> +
>>>> +			/*
>>>> +			 * We use 2M page, but we need to remove part of them,
>>>> +			 * so split 2M page to 4K page.
>>>> +			 */
>>>> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
>>>> +			BUG_ON(!pte_base);
>>>> +			__split_large_page((pte_t *)pmd, addr,
>>>> +					   (pte_t *)pte_base);
>>>> +
>>>> +			spin_lock(&init_mm.page_table_lock);
>>>> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
>>>> +			spin_unlock(&init_mm.page_table_lock);
>>>> +
>>>> +			flush_tlb_all();
>>>> +		}
>>>> +
>>>> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
>>>> +		remove_pte_table(pte_base, addr, next, direct);
>>>> +		free_pte_table(pte_base, pmd);
>>>> +		unmap_low_page(pte_base);
>>>> +	}
>>>> +
>>>> +	/* Call free_pmd_table() in remove_pud_table(). */
>>>> +	if (direct)
>>>> +		update_page_count(PG_LEVEL_2M, -pages);
>>>> +}
>>>> +
>>>> +static void __meminit
>>>> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
>>>> +		 bool direct)
>>>> +{
>>>> +	unsigned long pmd_phys, next, pages = 0;
>>>> +	pmd_t *pmd_base;
>>>> +	pud_t *pud;
>>>> +
>>>> +	pud = pud_start + pud_index(addr);
>>>> +	for (; addr<   end; addr = next, pud++) {
>>>> +		next = pud_addr_end(addr, end);
>>>> +
>>>> +		if (!pud_present(*pud))
>>>> +			continue;
>>>> +
>>>> +		if (pud_large(*pud)) {
>>>> +			if (IS_ALIGNED(addr, PUD_SIZE)&&
>>>> +			    IS_ALIGNED(next, PUD_SIZE)) {
>>>> +				if (!direct) {
>>>> +					free_pagetable(pud_page(*pud),
>>>> +						       get_order(PUD_SIZE));
>>>> +					pages++;
>>>> +				}
>>>> +
>>>> +				spin_lock(&init_mm.page_table_lock);
>>>> +				pud_clear(pud);
>>>> +				spin_unlock(&init_mm.page_table_lock);
>>>> +				continue;
>>>> +			}
>>>> +
>>>> +			/*
>>>> +			 * We use 1G page, but we need to remove part of them,
>>>> +			 * so split 1G page to 2M page.
>>>> +			 */
>>>> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
>>>> +			BUG_ON(!pmd_base);
>>>> +			__split_large_page((pte_t *)pud, addr,
>>>> +					   (pte_t *)pmd_base);
>>>> +
>>>> +			spin_lock(&init_mm.page_table_lock);
>>>> +			pud_populate(&init_mm, pud, __va(pmd_phys));
>>>> +			spin_unlock(&init_mm.page_table_lock);
>>>> +
>>>> +			flush_tlb_all();
>>>> +		}
>>>> +
>>>> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
>>>> +		remove_pmd_table(pmd_base, addr, next, direct);
>>>> +		free_pmd_table(pmd_base, pud);
>>>> +		unmap_low_page(pmd_base);
>>>> +	}
>>>> +
>>>> +	if (direct)
>>>> +		update_page_count(PG_LEVEL_1G, -pages);
>>>> +}
>>>> +
>>>> +/* start and end are both virtual address. */
>>>> +static void __meminit
>>>> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
>>>> +{
>>>> +	unsigned long next;
>>>> +	pgd_t *pgd;
>>>> +	pud_t *pud;
>>>> +	bool pgd_changed = false;
>>>> +
>>>> +	for (; start<   end; start = next) {
>>>> +		pgd = pgd_offset_k(start);
>>>> +		if (!pgd_present(*pgd))
>>>> +			continue;
>>>> +
>>>> +		next = pgd_addr_end(start, end);
>>>> +
>>>> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
>>>> +		remove_pud_table(pud, start, next, direct);
>>>> +		if (free_pud_table(pud, pgd))
>>>> +			pgd_changed = true;
>>>> +		unmap_low_page(pud);
>>>> +	}
>>>> +
>>>> +	if (pgd_changed)
>>>> +		sync_global_pgds(start, end - 1);
>>>> +
>>>> +	flush_tlb_all();
>>>> +}
>>>> +
>>>>    #ifdef CONFIG_MEMORY_HOTREMOVE
>>>>    int __ref arch_remove_memory(u64 start, u64 size)
>>>>    {
>>>> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
>>>> index a718e0d..7dcb6f9 100644
>>>> --- a/arch/x86/mm/pageattr.c
>>>> +++ b/arch/x86/mm/pageattr.c
>>>> @@ -501,21 +501,13 @@ out_unlock:
>>>>    	return do_split;
>>>>    }
>>>>
>>>> -static int split_large_page(pte_t *kpte, unsigned long address)
>>>> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>>>>    {
>>>>    	unsigned long pfn, pfninc = 1;
>>>>    	unsigned int i, level;
>>>> -	pte_t *pbase, *tmp;
>>>> +	pte_t *tmp;
>>>>    	pgprot_t ref_prot;
>>>> -	struct page *base;
>>>> -
>>>> -	if (!debug_pagealloc)
>>>> -		spin_unlock(&cpa_lock);
>>>> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>>>> -	if (!debug_pagealloc)
>>>> -		spin_lock(&cpa_lock);
>>>> -	if (!base)
>>>> -		return -ENOMEM;
>>>> +	struct page *base = virt_to_page(pbase);
>>>>
>>>>    	spin_lock(&pgd_lock);
>>>>    	/*
>>>> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>>>>    	 * up for us already:
>>>>    	 */
>>>>    	tmp = lookup_address(address,&level);
>>>> -	if (tmp != kpte)
>>>> -		goto out_unlock;
>>>> +	if (tmp != kpte) {
>>>> +		spin_unlock(&pgd_lock);
>>>> +		return 1;
>>>> +	}
>>>>
>>>> -	pbase = (pte_t *)page_address(base);
>>>>    	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
>>>>    	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
>>>>    	/*
>>>> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>>>>    	 * going on.
>>>>    	 */
>>>>    	__flush_tlb_all();
>>>> +	spin_unlock(&pgd_lock);
>>>>
>>>> -	base = NULL;
>>>> +	return 0;
>>>> +}
>>>>
>>>> -out_unlock:
>>>> -	/*
>>>> -	 * If we dropped out via the lookup_address check under
>>>> -	 * pgd_lock then stick the page back into the pool:
>>>> -	 */
>>>> -	if (base)
>>>> +static int split_large_page(pte_t *kpte, unsigned long address)
>>>> +{
>>>> +	pte_t *pbase;
>>>> +	struct page *base;
>>>> +
>>>> +	if (!debug_pagealloc)
>>>> +		spin_unlock(&cpa_lock);
>>>> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>>>> +	if (!debug_pagealloc)
>>>> +		spin_lock(&cpa_lock);
>>>> +	if (!base)
>>>> +		return -ENOMEM;
>>>> +
>>>> +	pbase = (pte_t *)page_address(base);
>>>> +	if (__split_large_page(kpte, address, pbase))
>>>>    		__free_page(base);
>>>> -	spin_unlock(&pgd_lock);
>>>>
>>>>    	return 0;
>>>>    }
>>>> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
>>>> index 3f778c2..190ff06 100644
>>>> --- a/include/linux/bootmem.h
>>>> +++ b/include/linux/bootmem.h
>>>> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
>>>>    			      unsigned long size);
>>>>    extern void free_bootmem(unsigned long physaddr, unsigned long size);
>>>>    extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
>>>> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
>>>>
>>>>    /*
>>>>     * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,
>>>
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
>
>
>

^ permalink raw reply

* [PATCH] powerpc/mm: Fix hash computation function
From: Aneesh Kumar K.V @ 2013-01-30  5:40 UTC (permalink / raw)
  To: benh, paulus; +Cc: Mike Qiu, linuxppc-dev, Aneesh Kumar K.V

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

The ASM version of hash computation function was truncating the upper bit.
Make the ASM version similar to hpt_hash function. Remove masking vsid bits.
Without this patch, we observed hang during bootup due to not satisfying page
fault request correctly. The fault handler used wrong hash values to update
the HPTE. Hence we kept looping with page fault.

hash_page(ea=000001003e260008, access=203, trap=300 ip=3fff91787134 dsisr 42000000
The computed value of hash 000000000f22f390
update: avpnv=4003e46054003e00, hash=000000000722f390, f=80000006, psize: 2 ...

Reported-by: Mike Qiu <qiudayu@linux.vnet.ibm.com>
Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/mm/hash_low_64.S |   62 +++++++++++++++++++++++------------------
 1 file changed, 35 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/mm/hash_low_64.S b/arch/powerpc/mm/hash_low_64.S
index 5658508..7443481 100644
--- a/arch/powerpc/mm/hash_low_64.S
+++ b/arch/powerpc/mm/hash_low_64.S
@@ -115,11 +115,13 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEGMENT)
 	sldi	r29,r5,SID_SHIFT - VPN_SHIFT
 	rldicl  r28,r3,64 - VPN_SHIFT,64 - (SID_SHIFT - VPN_SHIFT)
 	or	r29,r28,r29
-
-	/* Calculate hash value for primary slot and store it in r28 */
-	rldicl	r5,r5,0,25		/* vsid & 0x0000007fffffffff */
-	rldicl	r0,r3,64-12,48		/* (ea >> 12) & 0xffff */
-	xor	r28,r5,r0
+	/*
+	 * Calculate hash value for primary slot and store it in r28
+	 * r3 = va, r5 = vsid
+	 * r0 = (va >> 12) & ((1ul << (28 - 12)) -1)
+	 */
+	rldicl	r0,r3,64-12,48
+	xor	r28,r5,r0		/* hash */
 	b	4f
 
 3:	/* Calc vpn and put it in r29 */
@@ -130,11 +132,12 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEGMENT)
 	/*
 	 * calculate hash value for primary slot and
 	 * store it in r28 for 1T segment
+	 * r3 = va, r5 = vsid
 	 */
-	rldic	r28,r5,25,25		/* (vsid << 25) & 0x7fffffffff */
-	clrldi	r5,r5,40		/* vsid & 0xffffff */
-	rldicl	r0,r3,64-12,36		/* (ea >> 12) & 0xfffffff */
-	xor	r28,r28,r5
+	sldi	r28,r5,25		/* vsid << 25 */
+	/* r0 =  (va >> 12) & ((1ul << (40 - 12)) -1) */
+	rldicl	r0,r3,64-12,36
+	xor	r28,r28,r5		/* vsid ^ ( vsid << 25) */
 	xor	r28,r28,r0		/* hash */
 
 	/* Convert linux PTE bits into HW equivalents */
@@ -407,11 +410,13 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEGMENT)
 	 */
 	rldicl  r28,r3,64 - VPN_SHIFT,64 - (SID_SHIFT - VPN_SHIFT)
 	or	r29,r28,r29
-
-	/* Calculate hash value for primary slot and store it in r28 */
-	rldicl	r5,r5,0,25		/* vsid & 0x0000007fffffffff */
-	rldicl	r0,r3,64-12,48		/* (ea >> 12) & 0xffff */
-	xor	r28,r5,r0
+	/*
+	 * Calculate hash value for primary slot and store it in r28
+	 * r3 = va, r5 = vsid
+	 * r0 = (va >> 12) & ((1ul << (28 - 12)) -1)
+	 */
+	rldicl	r0,r3,64-12,48
+	xor	r28,r5,r0		/* hash */
 	b	4f
 
 3:	/* Calc vpn and put it in r29 */
@@ -426,11 +431,12 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEGMENT)
 	/*
 	 * Calculate hash value for primary slot and
 	 * store it in r28  for 1T segment
+	 * r3 = va, r5 = vsid
 	 */
-	rldic	r28,r5,25,25		/* (vsid << 25) & 0x7fffffffff */
-	clrldi	r5,r5,40		/* vsid & 0xffffff */
-	rldicl	r0,r3,64-12,36		/* (ea >> 12) & 0xfffffff */
-	xor	r28,r28,r5
+	sldi	r28,r5,25		/* vsid << 25 */
+	/* r0 = (va >> 12) & ((1ul << (40 - 12)) -1) */
+	rldicl	r0,r3,64-12,36
+	xor	r28,r28,r5		/* vsid ^ ( vsid << 25) */
 	xor	r28,r28,r0		/* hash */
 
 	/* Convert linux PTE bits into HW equivalents */
@@ -752,25 +758,27 @@ END_MMU_FTR_SECTION_IFSET(MMU_FTR_1T_SEGMENT)
 	rldicl  r28,r3,64 - VPN_SHIFT,64 - (SID_SHIFT - VPN_SHIFT)
 	or	r29,r28,r29
 
-	/* Calculate hash value for primary slot and store it in r28 */
-	rldicl	r5,r5,0,25		/* vsid & 0x0000007fffffffff */
-	rldicl	r0,r3,64-16,52		/* (ea >> 16) & 0xfff */
-	xor	r28,r5,r0
+	/* Calculate hash value for primary slot and store it in r28
+	 * r3 = va, r5 = vsid
+	 * r0 = (va >> 16) & ((1ul << (28 - 16)) -1)
+	 */
+	rldicl	r0,r3,64-16,52
+	xor	r28,r5,r0		/* hash */
 	b	4f
 
 3:	/* Calc vpn and put it in r29 */
 	sldi	r29,r5,SID_SHIFT_1T - VPN_SHIFT
 	rldicl  r28,r3,64 - VPN_SHIFT,64 - (SID_SHIFT_1T - VPN_SHIFT)
 	or	r29,r28,r29
-
 	/*
 	 * calculate hash value for primary slot and
 	 * store it in r28 for 1T segment
+	 * r3 = va, r5 = vsid
 	 */
-	rldic	r28,r5,25,25		/* (vsid << 25) & 0x7fffffffff */
-	clrldi	r5,r5,40		/* vsid & 0xffffff */
-	rldicl	r0,r3,64-16,40		/* (ea >> 16) & 0xffffff */
-	xor	r28,r28,r5
+	sldi	r28,r5,25		/* vsid << 25 */
+	/* r0 = (va >> 16) & ((1ul << (40 - 16)) -1) */
+	rldicl	r0,r3,64-16,40
+	xor	r28,r28,r5		/* vsid ^ ( vsid << 25) */
 	xor	r28,r28,r0		/* hash */
 
 	/* Convert linux PTE bits into HW equivalents */
-- 
1.7.10

^ permalink raw reply related

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
From: Simon Jeons @ 2013-01-30  3:27 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <51088298.9080302@cn.fujitsu.com>

On Wed, 2013-01-30 at 10:16 +0800, Tang Chen wrote:
> On 01/29/2013 09:04 PM, Simon Jeons wrote:
> > Hi Tang,
> > On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> >> From: Wen Congyang<wency@cn.fujitsu.com>
> >>
> >> When memory is removed, the corresponding pagetables should alse be removed.
> >> This patch introduces some common APIs to support vmemmap pagetable and x86_64
> >> architecture pagetable removing.
> >
> > Why don't need to build_all_zonelists like online_pages does during
> > hot-add path(add_memory)?
> 
> Hi Simon,
> 
> As you said, build_all_zonelists is done by online_pages. When the 
> memory device
> is hot-added, we cannot use it. we can only use is when we online the 
> pages on it.

Why?

If a node has just one memory device and memory is small, some zone will
not present like zone_highmem, then hot-add another memory device and
zone_highmem appear, if you should build_all_zonelists this time?

> 
> But we can online the pages as different types, kernel or movable (which 
> belongs to
> different zones), and we can online part of the memory, not all of them.
> So each time we online some pages, we should check if we need to update 
> the zone list.
> 
> So I think that is why we do build_all_zonelists when online_pages.
> (just my opinion)
> 
> Thanks. :)
> 
> >
> >>
> >> All pages of virtual mapping in removed memory cannot be freedi if some pages
> >> used as PGD/PUD includes not only removed memory but also other memory. So the
> >> patch uses the following way to check whether page can be freed or not.
> >>
> >>   1. When removing memory, the page structs of the revmoved memory are filled
> >>      with 0FD.
> >>   2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
> >>      In this case, the page used as PT/PMD can be freed.
> >>
> >> Signed-off-by: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
> >> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
> >> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
> >> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
> >> ---
> >>   arch/x86/include/asm/pgtable_types.h |    1 +
> >>   arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
> >>   arch/x86/mm/pageattr.c               |   47 +++---
> >>   include/linux/bootmem.h              |    1 +
> >>   4 files changed, 326 insertions(+), 22 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> >> index 3c32db8..4b6fd2a 100644
> >> --- a/arch/x86/include/asm/pgtable_types.h
> >> +++ b/arch/x86/include/asm/pgtable_types.h
> >> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
> >>    * as a pte too.
> >>    */
> >>   extern pte_t *lookup_address(unsigned long address, unsigned int *level);
> >> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
> >>
> >>   #endif	/* !__ASSEMBLY__ */
> >>
> >> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> >> index 9ac1723..fe01116 100644
> >> --- a/arch/x86/mm/init_64.c
> >> +++ b/arch/x86/mm/init_64.c
> >> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
> >>   }
> >>   EXPORT_SYMBOL_GPL(arch_add_memory);
> >>
> >> +#define PAGE_INUSE 0xFD
> >> +
> >> +static void __meminit free_pagetable(struct page *page, int order)
> >> +{
> >> +	struct zone *zone;
> >> +	bool bootmem = false;
> >> +	unsigned long magic;
> >> +	unsigned int nr_pages = 1<<  order;
> >> +
> >> +	/* bootmem page has reserved flag */
> >> +	if (PageReserved(page)) {
> >> +		__ClearPageReserved(page);
> >> +		bootmem = true;
> >> +
> >> +		magic = (unsigned long)page->lru.next;
> >> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> >> +			while (nr_pages--)
> >> +				put_page_bootmem(page++);
> >> +		} else
> >> +			__free_pages_bootmem(page, order);
> >> +	} else
> >> +		free_pages((unsigned long)page_address(page), order);
> >> +
> >> +	/*
> >> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
> >> +	 * are all allocated by bootmem.
> >> +	 */
> >> +	if (bootmem) {
> >> +		zone = page_zone(page);
> >> +		zone_span_writelock(zone);
> >> +		zone->present_pages += nr_pages;
> >> +		zone_span_writeunlock(zone);
> >> +		totalram_pages += nr_pages;
> >> +	}
> >> +}
> >> +
> >> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> >> +{
> >> +	pte_t *pte;
> >> +	int i;
> >> +
> >> +	for (i = 0; i<  PTRS_PER_PTE; i++) {
> >> +		pte = pte_start + i;
> >> +		if (pte_val(*pte))
> >> +			return;
> >> +	}
> >> +
> >> +	/* free a pte talbe */
> >> +	free_pagetable(pmd_page(*pmd), 0);
> >> +	spin_lock(&init_mm.page_table_lock);
> >> +	pmd_clear(pmd);
> >> +	spin_unlock(&init_mm.page_table_lock);
> >> +}
> >> +
> >> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> >> +{
> >> +	pmd_t *pmd;
> >> +	int i;
> >> +
> >> +	for (i = 0; i<  PTRS_PER_PMD; i++) {
> >> +		pmd = pmd_start + i;
> >> +		if (pmd_val(*pmd))
> >> +			return;
> >> +	}
> >> +
> >> +	/* free a pmd talbe */
> >> +	free_pagetable(pud_page(*pud), 0);
> >> +	spin_lock(&init_mm.page_table_lock);
> >> +	pud_clear(pud);
> >> +	spin_unlock(&init_mm.page_table_lock);
> >> +}
> >> +
> >> +/* Return true if pgd is changed, otherwise return false. */
> >> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
> >> +{
> >> +	pud_t *pud;
> >> +	int i;
> >> +
> >> +	for (i = 0; i<  PTRS_PER_PUD; i++) {
> >> +		pud = pud_start + i;
> >> +		if (pud_val(*pud))
> >> +			return false;
> >> +	}
> >> +
> >> +	/* free a pud table */
> >> +	free_pagetable(pgd_page(*pgd), 0);
> >> +	spin_lock(&init_mm.page_table_lock);
> >> +	pgd_clear(pgd);
> >> +	spin_unlock(&init_mm.page_table_lock);
> >> +
> >> +	return true;
> >> +}
> >> +
> >> +static void __meminit
> >> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
> >> +		 bool direct)
> >> +{
> >> +	unsigned long next, pages = 0;
> >> +	pte_t *pte;
> >> +	void *page_addr;
> >> +	phys_addr_t phys_addr;
> >> +
> >> +	pte = pte_start + pte_index(addr);
> >> +	for (; addr<  end; addr = next, pte++) {
> >> +		next = (addr + PAGE_SIZE)&  PAGE_MASK;
> >> +		if (next>  end)
> >> +			next = end;
> >> +
> >> +		if (!pte_present(*pte))
> >> +			continue;
> >> +
> >> +		/*
> >> +		 * We mapped [0,1G) memory as identity mapping when
> >> +		 * initializing, in arch/x86/kernel/head_64.S. These
> >> +		 * pagetables cannot be removed.
> >> +		 */
> >> +		phys_addr = pte_val(*pte) + (addr&  PAGE_MASK);
> >> +		if (phys_addr<  (phys_addr_t)0x40000000)
> >> +			return;
> >> +
> >> +		if (IS_ALIGNED(addr, PAGE_SIZE)&&
> >> +		    IS_ALIGNED(next, PAGE_SIZE)) {
> >> +			if (!direct) {
> >> +				free_pagetable(pte_page(*pte), 0);
> >> +				pages++;
> >> +			}
> >> +
> >> +			spin_lock(&init_mm.page_table_lock);
> >> +			pte_clear(&init_mm, addr, pte);
> >> +			spin_unlock(&init_mm.page_table_lock);
> >> +		} else {
> >> +			/*
> >> +			 * If we are not removing the whole page, it means
> >> +			 * other ptes in this page are being used and we canot
> >> +			 * remove them. So fill the unused ptes with 0xFD, and
> >> +			 * remove the page when it is wholly filled with 0xFD.
> >> +			 */
> >> +			memset((void *)addr, PAGE_INUSE, next - addr);
> >> +			page_addr = page_address(pte_page(*pte));
> >> +
> >> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> >> +				free_pagetable(pte_page(*pte), 0);
> >> +				pages++;
> >> +
> >> +				spin_lock(&init_mm.page_table_lock);
> >> +				pte_clear(&init_mm, addr, pte);
> >> +				spin_unlock(&init_mm.page_table_lock);
> >> +			}
> >> +		}
> >> +	}
> >> +
> >> +	/* Call free_pte_table() in remove_pmd_table(). */
> >> +	flush_tlb_all();
> >> +	if (direct)
> >> +		update_page_count(PG_LEVEL_4K, -pages);
> >> +}
> >> +
> >> +static void __meminit
> >> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> >> +		 bool direct)
> >> +{
> >> +	unsigned long pte_phys, next, pages = 0;
> >> +	pte_t *pte_base;
> >> +	pmd_t *pmd;
> >> +
> >> +	pmd = pmd_start + pmd_index(addr);
> >> +	for (; addr<  end; addr = next, pmd++) {
> >> +		next = pmd_addr_end(addr, end);
> >> +
> >> +		if (!pmd_present(*pmd))
> >> +			continue;
> >> +
> >> +		if (pmd_large(*pmd)) {
> >> +			if (IS_ALIGNED(addr, PMD_SIZE)&&
> >> +			    IS_ALIGNED(next, PMD_SIZE)) {
> >> +				if (!direct) {
> >> +					free_pagetable(pmd_page(*pmd),
> >> +						       get_order(PMD_SIZE));
> >> +					pages++;
> >> +				}
> >> +
> >> +				spin_lock(&init_mm.page_table_lock);
> >> +				pmd_clear(pmd);
> >> +				spin_unlock(&init_mm.page_table_lock);
> >> +				continue;
> >> +			}
> >> +
> >> +			/*
> >> +			 * We use 2M page, but we need to remove part of them,
> >> +			 * so split 2M page to 4K page.
> >> +			 */
> >> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
> >> +			BUG_ON(!pte_base);
> >> +			__split_large_page((pte_t *)pmd, addr,
> >> +					   (pte_t *)pte_base);
> >> +
> >> +			spin_lock(&init_mm.page_table_lock);
> >> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> >> +			spin_unlock(&init_mm.page_table_lock);
> >> +
> >> +			flush_tlb_all();
> >> +		}
> >> +
> >> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> >> +		remove_pte_table(pte_base, addr, next, direct);
> >> +		free_pte_table(pte_base, pmd);
> >> +		unmap_low_page(pte_base);
> >> +	}
> >> +
> >> +	/* Call free_pmd_table() in remove_pud_table(). */
> >> +	if (direct)
> >> +		update_page_count(PG_LEVEL_2M, -pages);
> >> +}
> >> +
> >> +static void __meminit
> >> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
> >> +		 bool direct)
> >> +{
> >> +	unsigned long pmd_phys, next, pages = 0;
> >> +	pmd_t *pmd_base;
> >> +	pud_t *pud;
> >> +
> >> +	pud = pud_start + pud_index(addr);
> >> +	for (; addr<  end; addr = next, pud++) {
> >> +		next = pud_addr_end(addr, end);
> >> +
> >> +		if (!pud_present(*pud))
> >> +			continue;
> >> +
> >> +		if (pud_large(*pud)) {
> >> +			if (IS_ALIGNED(addr, PUD_SIZE)&&
> >> +			    IS_ALIGNED(next, PUD_SIZE)) {
> >> +				if (!direct) {
> >> +					free_pagetable(pud_page(*pud),
> >> +						       get_order(PUD_SIZE));
> >> +					pages++;
> >> +				}
> >> +
> >> +				spin_lock(&init_mm.page_table_lock);
> >> +				pud_clear(pud);
> >> +				spin_unlock(&init_mm.page_table_lock);
> >> +				continue;
> >> +			}
> >> +
> >> +			/*
> >> +			 * We use 1G page, but we need to remove part of them,
> >> +			 * so split 1G page to 2M page.
> >> +			 */
> >> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
> >> +			BUG_ON(!pmd_base);
> >> +			__split_large_page((pte_t *)pud, addr,
> >> +					   (pte_t *)pmd_base);
> >> +
> >> +			spin_lock(&init_mm.page_table_lock);
> >> +			pud_populate(&init_mm, pud, __va(pmd_phys));
> >> +			spin_unlock(&init_mm.page_table_lock);
> >> +
> >> +			flush_tlb_all();
> >> +		}
> >> +
> >> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
> >> +		remove_pmd_table(pmd_base, addr, next, direct);
> >> +		free_pmd_table(pmd_base, pud);
> >> +		unmap_low_page(pmd_base);
> >> +	}
> >> +
> >> +	if (direct)
> >> +		update_page_count(PG_LEVEL_1G, -pages);
> >> +}
> >> +
> >> +/* start and end are both virtual address. */
> >> +static void __meminit
> >> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
> >> +{
> >> +	unsigned long next;
> >> +	pgd_t *pgd;
> >> +	pud_t *pud;
> >> +	bool pgd_changed = false;
> >> +
> >> +	for (; start<  end; start = next) {
> >> +		pgd = pgd_offset_k(start);
> >> +		if (!pgd_present(*pgd))
> >> +			continue;
> >> +
> >> +		next = pgd_addr_end(start, end);
> >> +
> >> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> >> +		remove_pud_table(pud, start, next, direct);
> >> +		if (free_pud_table(pud, pgd))
> >> +			pgd_changed = true;
> >> +		unmap_low_page(pud);
> >> +	}
> >> +
> >> +	if (pgd_changed)
> >> +		sync_global_pgds(start, end - 1);
> >> +
> >> +	flush_tlb_all();
> >> +}
> >> +
> >>   #ifdef CONFIG_MEMORY_HOTREMOVE
> >>   int __ref arch_remove_memory(u64 start, u64 size)
> >>   {
> >> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> >> index a718e0d..7dcb6f9 100644
> >> --- a/arch/x86/mm/pageattr.c
> >> +++ b/arch/x86/mm/pageattr.c
> >> @@ -501,21 +501,13 @@ out_unlock:
> >>   	return do_split;
> >>   }
> >>
> >> -static int split_large_page(pte_t *kpte, unsigned long address)
> >> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
> >>   {
> >>   	unsigned long pfn, pfninc = 1;
> >>   	unsigned int i, level;
> >> -	pte_t *pbase, *tmp;
> >> +	pte_t *tmp;
> >>   	pgprot_t ref_prot;
> >> -	struct page *base;
> >> -
> >> -	if (!debug_pagealloc)
> >> -		spin_unlock(&cpa_lock);
> >> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> >> -	if (!debug_pagealloc)
> >> -		spin_lock(&cpa_lock);
> >> -	if (!base)
> >> -		return -ENOMEM;
> >> +	struct page *base = virt_to_page(pbase);
> >>
> >>   	spin_lock(&pgd_lock);
> >>   	/*
> >> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
> >>   	 * up for us already:
> >>   	 */
> >>   	tmp = lookup_address(address,&level);
> >> -	if (tmp != kpte)
> >> -		goto out_unlock;
> >> +	if (tmp != kpte) {
> >> +		spin_unlock(&pgd_lock);
> >> +		return 1;
> >> +	}
> >>
> >> -	pbase = (pte_t *)page_address(base);
> >>   	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
> >>   	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
> >>   	/*
> >> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
> >>   	 * going on.
> >>   	 */
> >>   	__flush_tlb_all();
> >> +	spin_unlock(&pgd_lock);
> >>
> >> -	base = NULL;
> >> +	return 0;
> >> +}
> >>
> >> -out_unlock:
> >> -	/*
> >> -	 * If we dropped out via the lookup_address check under
> >> -	 * pgd_lock then stick the page back into the pool:
> >> -	 */
> >> -	if (base)
> >> +static int split_large_page(pte_t *kpte, unsigned long address)
> >> +{
> >> +	pte_t *pbase;
> >> +	struct page *base;
> >> +
> >> +	if (!debug_pagealloc)
> >> +		spin_unlock(&cpa_lock);
> >> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> >> +	if (!debug_pagealloc)
> >> +		spin_lock(&cpa_lock);
> >> +	if (!base)
> >> +		return -ENOMEM;
> >> +
> >> +	pbase = (pte_t *)page_address(base);
> >> +	if (__split_large_page(kpte, address, pbase))
> >>   		__free_page(base);
> >> -	spin_unlock(&pgd_lock);
> >>
> >>   	return 0;
> >>   }
> >> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> >> index 3f778c2..190ff06 100644
> >> --- a/include/linux/bootmem.h
> >> +++ b/include/linux/bootmem.h
> >> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
> >>   			      unsigned long size);
> >>   extern void free_bootmem(unsigned long physaddr, unsigned long size);
> >>   extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
> >> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
> >>
> >>   /*
> >>    * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
From: Tang Chen @ 2013-01-30  3:00 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <1359514113.1288.2.camel@kernel>

On 01/30/2013 10:48 AM, Simon Jeons wrote:
> On Wed, 2013-01-30 at 10:32 +0800, Tang Chen wrote:
>> On 01/29/2013 08:52 PM, Simon Jeons wrote:
>>> Hi Tang,
>>>
>>> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
>>>> Here is the physical memory hot-remove patch-set based on 3.8rc-2.
>>
>> Hi Simon,
>>
>> I'll summarize all the info and answer you later. :)
>>
>> Thanks for asking. :)
>
> Thanks Tang, IIRC, there's qemu feature support memory hot-add/remove
> emulation if we don't have machine which supports memory hot-add/remove
> to test. Is that qemu feature merged? Otherwise where can I get that
> patchset?

Hi Simon,

There are patches to support hot-add/remove in qemu, but they are not 
merged yet.
You can get the latest patches here:
http://lists.nongnu.org/archive/html/qemu-devel/2012-12/msg02693.html

BTY, it is unstable and full of problems, and you need to compile your 
own seabios too.

Thanks. :)

>
>>
>>>
>>> Some questions ask you, not has relationship with this patchset, but is
>>> memory hotplug stuff.
>>>
>>> 1. In function node_states_check_changes_online:
>>>
>>> comments:
>>> * If we don't have HIGHMEM nor movable node,
>>> * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
>>> * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
>>>
>>> How to understand it? Why we don't have HIGHMEM nor movable node and
>>> node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
>>> N_NORMAL_MEMORY only means the node has regular memory.
>>>
>>> * If we don't have movable node, node_states[N_NORMAL_MEMORY]
>>> * contains nodes which have zones of 0...ZONE_MOVABLE,
>>> * set zone_last to ZONE_MOVABLE.
>>>
>>> How to understand?
>>>
>>> 2. In function move_pfn_range_left, why end<= z2->zone_start_pfn is not
>>> correct? The comments said that must include/overlap, why?
>>>
>>> 3. In function online_pages, the normal case(w/o online_kenrel,
>>> online_movable), why not check if the new zone is overlap with adjacent
>>> zones?
>>>
>>> 4. Could you summarize the difference implementation between hot-add and
>>> logic-add, hot-remove and logic-remove?
>>>
>>>
>>>>
>>>> This patch-set aims to implement physical memory hot-removing.
>>>>
>>>> The patches can free/remove the following things:
>>>>
>>>>     - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
>>>>     - memmap of sparse-vmemmap                  : [PATCH 6,7,8,10/15]
>>>>     - page table of removed memory              : [RFC PATCH 7,8,10/15]
>>>>     - node and related sysfs files              : [RFC PATCH 13-15/15]
>>>>
>>>>
>>>> Existing problem:
>>>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>>>> when we online pages.
>>>>
>>>> For example: there is a memory device on node 1. The address range
>>>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>>>> and memory11 under the directory /sys/devices/system/memory/.
>>>>
>>>> If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
>>>> cgroup is not provided by this memory device. But when we online memory9, the
>>>> memory stored page cgroup may be provided by memory8. So we can't offline
>>>> memory8 now. We should offline the memory in the reversed order.
>>>>
>>>> When the memory device is hotremoved, we will auto offline memory provided
>>>> by this memory device. But we don't know which memory is onlined first, so
>>>> offlining memory may fail.
>>>>
>>>> In patch1, we provide a solution which is not good enough:
>>>> Iterate twice to offline the memory.
>>>> 1st iterate: offline every non primary memory block.
>>>> 2nd iterate: offline primary (i.e. first added) memory block.
>>>>
>>>> And a new idea from Wen Congyang<wency@cn.fujitsu.com>   is:
>>>> allocate the memory from the memory block they are describing.
>>>>
>>>> But we are not sure if it is OK to do so because there is not existing API
>>>> to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE
>>>> to MEM_ONLINE. And also, it may interfere the hugepage.
>>>>
>>>>
>>>>
>>>> How to test this patchset?
>>>> 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
>>>>      ACPI_HOTPLUG_MEMORY must be selected.
>>>> 2. load the module acpi_memhotplug
>>>> 3. hotplug the memory device(it depends on your hardware)
>>>>      You will see the memory device under the directory /sys/bus/acpi/devices/.
>>>>      Its name is PNP0C80:XX.
>>>> 4. online/offline pages provided by this memory device
>>>>      You can write online/offline to /sys/devices/system/memory/memoryX/state to
>>>>      online/offline pages provided by this memory device
>>>> 5. hotremove the memory device
>>>>      You can hotremove the memory device by the hardware, or writing 1 to
>>>>      /sys/bus/acpi/devices/PNP0C80:XX/eject.
>>>
>>> Is there a similar knode to hot-add the memory device?
>>>
>>>>
>>>>
>>>> Note: if the memory provided by the memory device is used by the kernel, it
>>>> can't be offlined. It is not a bug.
>>>>
>>>>
>>>> Changelogs from v5 to v6:
>>>>    Patch3: Add some more comments to explain memory hot-remove.
>>>>    Patch4: Remove bootmem member in struct firmware_map_entry.
>>>>    Patch6: Repeatedly register bootmem pages when using hugepage.
>>>>    Patch8: Repeatedly free bootmem pages when using hugepage.
>>>>    Patch14: Don't free pgdat when offlining a node, just reset it to 0.
>>>>    Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
>>>>             one when online a node.
>>>>
>>>> Changelogs from v4 to v5:
>>>>    Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
>>>>            avoid disabling irq because we need flush tlb when free pagetables.
>>>>    Patch8: new patch, pick up some common APIs that are used to free direct mapping
>>>>            and vmemmap pagetables.
>>>>    Patch9: free direct mapping pagetables on x86_64 arch.
>>>>    Patch10: free vmemmap pagetables.
>>>>    Patch11: since freeing memmap with vmemmap has been implemented, the config
>>>>             macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
>>>>             no longer needed.
>>>>    Patch13: no need to modify acpi_memory_disable_device() since it was removed,
>>>>             and add nid parameter when calling remove_memory().
>>>>
>>>> Changelogs from v3 to v4:
>>>>    Patch7: remove unused codes.
>>>>    Patch8: fix nr_pages that is passed to free_map_bootmem()
>>>>
>>>> Changelogs from v2 to v3:
>>>>    Patch9: call sync_global_pgds() if pgd is changed
>>>>    Patch10: fix a problem int the patch
>>>>
>>>> Changelogs from v1 to v2:
>>>>    Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
>>>>            memory block. 2nd iterate: offline primary (i.e. first added) memory
>>>>            block.
>>>>
>>>>    Patch3: new patch, no logical change, just remove reduntant codes.
>>>>
>>>>    Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
>>>>            after the pagetable is changed.
>>>>
>>>>    Patch12: new patch, free node_data when a node is offlined.
>>>>
>>>>
>>>> Tang Chen (6):
>>>>     memory-hotplug: move pgdat_resize_lock into
>>>>       sparse_remove_one_section()
>>>>     memory-hotplug: remove page table of x86_64 architecture
>>>>     memory-hotplug: remove memmap of sparse-vmemmap
>>>>     memory-hotplug: Integrated __remove_section() of
>>>>       CONFIG_SPARSEMEM_VMEMMAP.
>>>>     memory-hotplug: remove sysfs file of node
>>>>     memory-hotplug: Do not allocate pdgat if it was not freed when
>>>>       offline.
>>>>
>>>> Wen Congyang (5):
>>>>     memory-hotplug: try to offline the memory twice to avoid dependence
>>>>     memory-hotplug: remove redundant codes
>>>>     memory-hotplug: introduce new function arch_remove_memory() for
>>>>       removing page table depends on architecture
>>>>     memory-hotplug: Common APIs to support page tables hot-remove
>>>>     memory-hotplug: free node_data when a node is offlined
>>>>
>>>> Yasuaki Ishimatsu (4):
>>>>     memory-hotplug: check whether all memory blocks are offlined or not
>>>>       when removing memory
>>>>     memory-hotplug: remove /sys/firmware/memmap/X sysfs
>>>>     memory-hotplug: implement register_page_bootmem_info_section of
>>>>       sparse-vmemmap
>>>>     memory-hotplug: memory_hotplug: clear zone when removing the memory
>>>>
>>>>    arch/arm64/mm/mmu.c                  |    3 +
>>>>    arch/ia64/mm/discontig.c             |   10 +
>>>>    arch/ia64/mm/init.c                  |   18 ++
>>>>    arch/powerpc/mm/init_64.c            |   10 +
>>>>    arch/powerpc/mm/mem.c                |   12 +
>>>>    arch/s390/mm/init.c                  |   12 +
>>>>    arch/s390/mm/vmem.c                  |   10 +
>>>>    arch/sh/mm/init.c                    |   17 ++
>>>>    arch/sparc/mm/init_64.c              |   10 +
>>>>    arch/tile/mm/init.c                  |    8 +
>>>>    arch/x86/include/asm/pgtable_types.h |    1 +
>>>>    arch/x86/mm/init_32.c                |   12 +
>>>>    arch/x86/mm/init_64.c                |  390 +++++++++++++++++++++++++++++
>>>>    arch/x86/mm/pageattr.c               |   47 ++--
>>>>    drivers/acpi/acpi_memhotplug.c       |    8 +-
>>>>    drivers/base/memory.c                |    6 +
>>>>    drivers/firmware/memmap.c            |   96 +++++++-
>>>>    include/linux/bootmem.h              |    1 +
>>>>    include/linux/firmware-map.h         |    6 +
>>>>    include/linux/memory_hotplug.h       |   15 +-
>>>>    include/linux/mm.h                   |    4 +-
>>>>    mm/memory_hotplug.c                  |  459 +++++++++++++++++++++++++++++++---
>>>>    mm/sparse.c                          |    8 +-
>>>>    23 files changed, 1094 insertions(+), 69 deletions(-)
>>>>
>>>> --
>>>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>>>> the body to majordomo@kvack.org.  For more info on Linux MM,
>>>> see: http://www.linux-mm.org/ .
>>>> Don't email:<a href=mailto:"dont@kvack.org">   email@kvack.org</a>
>>>
>>>
>>>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
From: Simon Jeons @ 2013-01-30  2:48 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <5108864B.6040804@cn.fujitsu.com>

On Wed, 2013-01-30 at 10:32 +0800, Tang Chen wrote:
> On 01/29/2013 08:52 PM, Simon Jeons wrote:
> > Hi Tang,
> >
> > On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> >> Here is the physical memory hot-remove patch-set based on 3.8rc-2.
> 
> Hi Simon,
> 
> I'll summarize all the info and answer you later. :)
> 
> Thanks for asking. :)

Thanks Tang, IIRC, there's qemu feature support memory hot-add/remove
emulation if we don't have machine which supports memory hot-add/remove
to test. Is that qemu feature merged? Otherwise where can I get that
patchset?

> 
> >
> > Some questions ask you, not has relationship with this patchset, but is
> > memory hotplug stuff.
> >
> > 1. In function node_states_check_changes_online:
> >
> > comments:
> > * If we don't have HIGHMEM nor movable node,
> > * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
> > * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
> >
> > How to understand it? Why we don't have HIGHMEM nor movable node and
> > node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
> > N_NORMAL_MEMORY only means the node has regular memory.
> >
> > * If we don't have movable node, node_states[N_NORMAL_MEMORY]
> > * contains nodes which have zones of 0...ZONE_MOVABLE,
> > * set zone_last to ZONE_MOVABLE.
> >
> > How to understand?
> >
> > 2. In function move_pfn_range_left, why end<= z2->zone_start_pfn is not
> > correct? The comments said that must include/overlap, why?
> >
> > 3. In function online_pages, the normal case(w/o online_kenrel,
> > online_movable), why not check if the new zone is overlap with adjacent
> > zones?
> >
> > 4. Could you summarize the difference implementation between hot-add and
> > logic-add, hot-remove and logic-remove?
> >
> >
> >>
> >> This patch-set aims to implement physical memory hot-removing.
> >>
> >> The patches can free/remove the following things:
> >>
> >>    - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
> >>    - memmap of sparse-vmemmap                  : [PATCH 6,7,8,10/15]
> >>    - page table of removed memory              : [RFC PATCH 7,8,10/15]
> >>    - node and related sysfs files              : [RFC PATCH 13-15/15]
> >>
> >>
> >> Existing problem:
> >> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
> >> when we online pages.
> >>
> >> For example: there is a memory device on node 1. The address range
> >> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
> >> and memory11 under the directory /sys/devices/system/memory/.
> >>
> >> If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
> >> cgroup is not provided by this memory device. But when we online memory9, the
> >> memory stored page cgroup may be provided by memory8. So we can't offline
> >> memory8 now. We should offline the memory in the reversed order.
> >>
> >> When the memory device is hotremoved, we will auto offline memory provided
> >> by this memory device. But we don't know which memory is onlined first, so
> >> offlining memory may fail.
> >>
> >> In patch1, we provide a solution which is not good enough:
> >> Iterate twice to offline the memory.
> >> 1st iterate: offline every non primary memory block.
> >> 2nd iterate: offline primary (i.e. first added) memory block.
> >>
> >> And a new idea from Wen Congyang<wency@cn.fujitsu.com>  is:
> >> allocate the memory from the memory block they are describing.
> >>
> >> But we are not sure if it is OK to do so because there is not existing API
> >> to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE
> >> to MEM_ONLINE. And also, it may interfere the hugepage.
> >>
> >>
> >>
> >> How to test this patchset?
> >> 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
> >>     ACPI_HOTPLUG_MEMORY must be selected.
> >> 2. load the module acpi_memhotplug
> >> 3. hotplug the memory device(it depends on your hardware)
> >>     You will see the memory device under the directory /sys/bus/acpi/devices/.
> >>     Its name is PNP0C80:XX.
> >> 4. online/offline pages provided by this memory device
> >>     You can write online/offline to /sys/devices/system/memory/memoryX/state to
> >>     online/offline pages provided by this memory device
> >> 5. hotremove the memory device
> >>     You can hotremove the memory device by the hardware, or writing 1 to
> >>     /sys/bus/acpi/devices/PNP0C80:XX/eject.
> >
> > Is there a similar knode to hot-add the memory device?
> >
> >>
> >>
> >> Note: if the memory provided by the memory device is used by the kernel, it
> >> can't be offlined. It is not a bug.
> >>
> >>
> >> Changelogs from v5 to v6:
> >>   Patch3: Add some more comments to explain memory hot-remove.
> >>   Patch4: Remove bootmem member in struct firmware_map_entry.
> >>   Patch6: Repeatedly register bootmem pages when using hugepage.
> >>   Patch8: Repeatedly free bootmem pages when using hugepage.
> >>   Patch14: Don't free pgdat when offlining a node, just reset it to 0.
> >>   Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
> >>            one when online a node.
> >>
> >> Changelogs from v4 to v5:
> >>   Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
> >>           avoid disabling irq because we need flush tlb when free pagetables.
> >>   Patch8: new patch, pick up some common APIs that are used to free direct mapping
> >>           and vmemmap pagetables.
> >>   Patch9: free direct mapping pagetables on x86_64 arch.
> >>   Patch10: free vmemmap pagetables.
> >>   Patch11: since freeing memmap with vmemmap has been implemented, the config
> >>            macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
> >>            no longer needed.
> >>   Patch13: no need to modify acpi_memory_disable_device() since it was removed,
> >>            and add nid parameter when calling remove_memory().
> >>
> >> Changelogs from v3 to v4:
> >>   Patch7: remove unused codes.
> >>   Patch8: fix nr_pages that is passed to free_map_bootmem()
> >>
> >> Changelogs from v2 to v3:
> >>   Patch9: call sync_global_pgds() if pgd is changed
> >>   Patch10: fix a problem int the patch
> >>
> >> Changelogs from v1 to v2:
> >>   Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
> >>           memory block. 2nd iterate: offline primary (i.e. first added) memory
> >>           block.
> >>
> >>   Patch3: new patch, no logical change, just remove reduntant codes.
> >>
> >>   Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
> >>           after the pagetable is changed.
> >>
> >>   Patch12: new patch, free node_data when a node is offlined.
> >>
> >>
> >> Tang Chen (6):
> >>    memory-hotplug: move pgdat_resize_lock into
> >>      sparse_remove_one_section()
> >>    memory-hotplug: remove page table of x86_64 architecture
> >>    memory-hotplug: remove memmap of sparse-vmemmap
> >>    memory-hotplug: Integrated __remove_section() of
> >>      CONFIG_SPARSEMEM_VMEMMAP.
> >>    memory-hotplug: remove sysfs file of node
> >>    memory-hotplug: Do not allocate pdgat if it was not freed when
> >>      offline.
> >>
> >> Wen Congyang (5):
> >>    memory-hotplug: try to offline the memory twice to avoid dependence
> >>    memory-hotplug: remove redundant codes
> >>    memory-hotplug: introduce new function arch_remove_memory() for
> >>      removing page table depends on architecture
> >>    memory-hotplug: Common APIs to support page tables hot-remove
> >>    memory-hotplug: free node_data when a node is offlined
> >>
> >> Yasuaki Ishimatsu (4):
> >>    memory-hotplug: check whether all memory blocks are offlined or not
> >>      when removing memory
> >>    memory-hotplug: remove /sys/firmware/memmap/X sysfs
> >>    memory-hotplug: implement register_page_bootmem_info_section of
> >>      sparse-vmemmap
> >>    memory-hotplug: memory_hotplug: clear zone when removing the memory
> >>
> >>   arch/arm64/mm/mmu.c                  |    3 +
> >>   arch/ia64/mm/discontig.c             |   10 +
> >>   arch/ia64/mm/init.c                  |   18 ++
> >>   arch/powerpc/mm/init_64.c            |   10 +
> >>   arch/powerpc/mm/mem.c                |   12 +
> >>   arch/s390/mm/init.c                  |   12 +
> >>   arch/s390/mm/vmem.c                  |   10 +
> >>   arch/sh/mm/init.c                    |   17 ++
> >>   arch/sparc/mm/init_64.c              |   10 +
> >>   arch/tile/mm/init.c                  |    8 +
> >>   arch/x86/include/asm/pgtable_types.h |    1 +
> >>   arch/x86/mm/init_32.c                |   12 +
> >>   arch/x86/mm/init_64.c                |  390 +++++++++++++++++++++++++++++
> >>   arch/x86/mm/pageattr.c               |   47 ++--
> >>   drivers/acpi/acpi_memhotplug.c       |    8 +-
> >>   drivers/base/memory.c                |    6 +
> >>   drivers/firmware/memmap.c            |   96 +++++++-
> >>   include/linux/bootmem.h              |    1 +
> >>   include/linux/firmware-map.h         |    6 +
> >>   include/linux/memory_hotplug.h       |   15 +-
> >>   include/linux/mm.h                   |    4 +-
> >>   mm/memory_hotplug.c                  |  459 +++++++++++++++++++++++++++++++---
> >>   mm/sparse.c                          |    8 +-
> >>   23 files changed, 1094 insertions(+), 69 deletions(-)
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
> >
> >
> >

^ permalink raw reply

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
From: Tang Chen @ 2013-01-30  2:32 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <1359463973.1624.15.camel@kernel>

On 01/29/2013 08:52 PM, Simon Jeons wrote:
> Hi Tang,
>
> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
>> Here is the physical memory hot-remove patch-set based on 3.8rc-2.

Hi Simon,

I'll summarize all the info and answer you later. :)

Thanks for asking. :)

>
> Some questions ask you, not has relationship with this patchset, but is
> memory hotplug stuff.
>
> 1. In function node_states_check_changes_online:
>
> comments:
> * If we don't have HIGHMEM nor movable node,
> * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
> * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
>
> How to understand it? Why we don't have HIGHMEM nor movable node and
> node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
> N_NORMAL_MEMORY only means the node has regular memory.
>
> * If we don't have movable node, node_states[N_NORMAL_MEMORY]
> * contains nodes which have zones of 0...ZONE_MOVABLE,
> * set zone_last to ZONE_MOVABLE.
>
> How to understand?
>
> 2. In function move_pfn_range_left, why end<= z2->zone_start_pfn is not
> correct? The comments said that must include/overlap, why?
>
> 3. In function online_pages, the normal case(w/o online_kenrel,
> online_movable), why not check if the new zone is overlap with adjacent
> zones?
>
> 4. Could you summarize the difference implementation between hot-add and
> logic-add, hot-remove and logic-remove?
>
>
>>
>> This patch-set aims to implement physical memory hot-removing.
>>
>> The patches can free/remove the following things:
>>
>>    - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
>>    - memmap of sparse-vmemmap                  : [PATCH 6,7,8,10/15]
>>    - page table of removed memory              : [RFC PATCH 7,8,10/15]
>>    - node and related sysfs files              : [RFC PATCH 13-15/15]
>>
>>
>> Existing problem:
>> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
>> when we online pages.
>>
>> For example: there is a memory device on node 1. The address range
>> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
>> and memory11 under the directory /sys/devices/system/memory/.
>>
>> If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
>> cgroup is not provided by this memory device. But when we online memory9, the
>> memory stored page cgroup may be provided by memory8. So we can't offline
>> memory8 now. We should offline the memory in the reversed order.
>>
>> When the memory device is hotremoved, we will auto offline memory provided
>> by this memory device. But we don't know which memory is onlined first, so
>> offlining memory may fail.
>>
>> In patch1, we provide a solution which is not good enough:
>> Iterate twice to offline the memory.
>> 1st iterate: offline every non primary memory block.
>> 2nd iterate: offline primary (i.e. first added) memory block.
>>
>> And a new idea from Wen Congyang<wency@cn.fujitsu.com>  is:
>> allocate the memory from the memory block they are describing.
>>
>> But we are not sure if it is OK to do so because there is not existing API
>> to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE
>> to MEM_ONLINE. And also, it may interfere the hugepage.
>>
>>
>>
>> How to test this patchset?
>> 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
>>     ACPI_HOTPLUG_MEMORY must be selected.
>> 2. load the module acpi_memhotplug
>> 3. hotplug the memory device(it depends on your hardware)
>>     You will see the memory device under the directory /sys/bus/acpi/devices/.
>>     Its name is PNP0C80:XX.
>> 4. online/offline pages provided by this memory device
>>     You can write online/offline to /sys/devices/system/memory/memoryX/state to
>>     online/offline pages provided by this memory device
>> 5. hotremove the memory device
>>     You can hotremove the memory device by the hardware, or writing 1 to
>>     /sys/bus/acpi/devices/PNP0C80:XX/eject.
>
> Is there a similar knode to hot-add the memory device?
>
>>
>>
>> Note: if the memory provided by the memory device is used by the kernel, it
>> can't be offlined. It is not a bug.
>>
>>
>> Changelogs from v5 to v6:
>>   Patch3: Add some more comments to explain memory hot-remove.
>>   Patch4: Remove bootmem member in struct firmware_map_entry.
>>   Patch6: Repeatedly register bootmem pages when using hugepage.
>>   Patch8: Repeatedly free bootmem pages when using hugepage.
>>   Patch14: Don't free pgdat when offlining a node, just reset it to 0.
>>   Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
>>            one when online a node.
>>
>> Changelogs from v4 to v5:
>>   Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
>>           avoid disabling irq because we need flush tlb when free pagetables.
>>   Patch8: new patch, pick up some common APIs that are used to free direct mapping
>>           and vmemmap pagetables.
>>   Patch9: free direct mapping pagetables on x86_64 arch.
>>   Patch10: free vmemmap pagetables.
>>   Patch11: since freeing memmap with vmemmap has been implemented, the config
>>            macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
>>            no longer needed.
>>   Patch13: no need to modify acpi_memory_disable_device() since it was removed,
>>            and add nid parameter when calling remove_memory().
>>
>> Changelogs from v3 to v4:
>>   Patch7: remove unused codes.
>>   Patch8: fix nr_pages that is passed to free_map_bootmem()
>>
>> Changelogs from v2 to v3:
>>   Patch9: call sync_global_pgds() if pgd is changed
>>   Patch10: fix a problem int the patch
>>
>> Changelogs from v1 to v2:
>>   Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
>>           memory block. 2nd iterate: offline primary (i.e. first added) memory
>>           block.
>>
>>   Patch3: new patch, no logical change, just remove reduntant codes.
>>
>>   Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
>>           after the pagetable is changed.
>>
>>   Patch12: new patch, free node_data when a node is offlined.
>>
>>
>> Tang Chen (6):
>>    memory-hotplug: move pgdat_resize_lock into
>>      sparse_remove_one_section()
>>    memory-hotplug: remove page table of x86_64 architecture
>>    memory-hotplug: remove memmap of sparse-vmemmap
>>    memory-hotplug: Integrated __remove_section() of
>>      CONFIG_SPARSEMEM_VMEMMAP.
>>    memory-hotplug: remove sysfs file of node
>>    memory-hotplug: Do not allocate pdgat if it was not freed when
>>      offline.
>>
>> Wen Congyang (5):
>>    memory-hotplug: try to offline the memory twice to avoid dependence
>>    memory-hotplug: remove redundant codes
>>    memory-hotplug: introduce new function arch_remove_memory() for
>>      removing page table depends on architecture
>>    memory-hotplug: Common APIs to support page tables hot-remove
>>    memory-hotplug: free node_data when a node is offlined
>>
>> Yasuaki Ishimatsu (4):
>>    memory-hotplug: check whether all memory blocks are offlined or not
>>      when removing memory
>>    memory-hotplug: remove /sys/firmware/memmap/X sysfs
>>    memory-hotplug: implement register_page_bootmem_info_section of
>>      sparse-vmemmap
>>    memory-hotplug: memory_hotplug: clear zone when removing the memory
>>
>>   arch/arm64/mm/mmu.c                  |    3 +
>>   arch/ia64/mm/discontig.c             |   10 +
>>   arch/ia64/mm/init.c                  |   18 ++
>>   arch/powerpc/mm/init_64.c            |   10 +
>>   arch/powerpc/mm/mem.c                |   12 +
>>   arch/s390/mm/init.c                  |   12 +
>>   arch/s390/mm/vmem.c                  |   10 +
>>   arch/sh/mm/init.c                    |   17 ++
>>   arch/sparc/mm/init_64.c              |   10 +
>>   arch/tile/mm/init.c                  |    8 +
>>   arch/x86/include/asm/pgtable_types.h |    1 +
>>   arch/x86/mm/init_32.c                |   12 +
>>   arch/x86/mm/init_64.c                |  390 +++++++++++++++++++++++++++++
>>   arch/x86/mm/pageattr.c               |   47 ++--
>>   drivers/acpi/acpi_memhotplug.c       |    8 +-
>>   drivers/base/memory.c                |    6 +
>>   drivers/firmware/memmap.c            |   96 +++++++-
>>   include/linux/bootmem.h              |    1 +
>>   include/linux/firmware-map.h         |    6 +
>>   include/linux/memory_hotplug.h       |   15 +-
>>   include/linux/mm.h                   |    4 +-
>>   mm/memory_hotplug.c                  |  459 +++++++++++++++++++++++++++++++---
>>   mm/sparse.c                          |    8 +-
>>   23 files changed, 1094 insertions(+), 69 deletions(-)
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
>
>
>

^ permalink raw reply

* Re: PREMPT_RT
From: Vineeth @ 2013-01-30  2:29 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <CAFbQSaCxSVDRW=kcjkrbrm=8Sy4N0q29ALfqb4o3UdB3crY64g@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1174 bytes --]

Hi,

I've added a VDSO patch on my 2.6.32 kernel. When i enabled PREMPT_RT i am
getting the below oops messages.
What is the reason for this ? When googled, it was mentioned in many places
that it is caused by the get_cpu()/put_cpu() preempt disabled region.
Can someone help me to understand this ?

BUG: sleeping function called from invalid context at kernel/rtmutex.c:684
pcnt: 10002 0 in_atomic(): 1, irqs_disabled(): 1, pid: 0, name: swapper
Call Trace:
[c0771d40] [c000750c] show_stack+0x44/0x160 (unreliable)
[c0771d70] [c0021114] __might_sleep+0xe4/0x108
[c0771d80] [c05700bc] rt_spin_lock+0x38/0xb4
[c0771d90] [c005a3b0] ntp_tick_length+0x20/0x54
[c0771db0] [c00594e8] update_wall_time+0xb0/0xa3c
[c0771e20] [c0040bc0] do_timer+0x38/0x4c
[c0771e30] [c005f2c4] tick_do_update_jiffies64+0x1cc/0x2a8
[c0771e70] [c005f494] tick_check_idle+0xf4/0x120
[c0771ea0] [c0039a50] irq_enter+0x68/0x7c
[c0771eb0] [c000ae78] timer_interrupt+0xb0/0x168
[c0771ed0] [c000e84c] ret_from_except+0x0/0x18
[c0771f90] [c000874c] cpu_idle+0x58/0xf4
[c0771fb0] [c0002330] rest_init+0xa0/0xb0
[c0771fc0] [c06fc9d4] start_kernel+0x320/0x334
[c0771ff0] [c00003f0] skpinv+0x308/0x344


Thanks

[-- Attachment #2: Type: text/html, Size: 1369 bytes --]

^ permalink raw reply

* PREMPT_RT
From: Vineeth @ 2013-01-30  2:19 UTC (permalink / raw)
  To: linuxppc-dev
In-Reply-To: <CAFbQSaBm-0ROHX80JQN6ef4Hgy1PgH8PmYRSRpyJ1+mNo+HnbQ@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 308 bytes --]

Hi,

I've added a VDSO patch on my 2.6.32 kernel. When i enabled PREMPT_RT i am
getting the below oops messages.
What is the reason for this ? When googled, it was mentioned in many places
that it is caused by the get_cpu()/put_cpu() preempt disabled region.
Can someone help me to understand this ?

Thanks

[-- Attachment #2: Type: text/html, Size: 508 bytes --]

^ permalink raw reply

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
From: Tang Chen @ 2013-01-30  2:16 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <1359464694.1624.18.camel@kernel>

On 01/29/2013 09:04 PM, Simon Jeons wrote:
> Hi Tang,
> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
>> From: Wen Congyang<wency@cn.fujitsu.com>
>>
>> When memory is removed, the corresponding pagetables should alse be removed.
>> This patch introduces some common APIs to support vmemmap pagetable and x86_64
>> architecture pagetable removing.
>
> Why don't need to build_all_zonelists like online_pages does during
> hot-add path(add_memory)?

Hi Simon,

As you said, build_all_zonelists is done by online_pages. When the 
memory device
is hot-added, we cannot use it. we can only use is when we online the 
pages on it.

But we can online the pages as different types, kernel or movable (which 
belongs to
different zones), and we can online part of the memory, not all of them.
So each time we online some pages, we should check if we need to update 
the zone list.

So I think that is why we do build_all_zonelists when online_pages.
(just my opinion)

Thanks. :)

>
>>
>> All pages of virtual mapping in removed memory cannot be freedi if some pages
>> used as PGD/PUD includes not only removed memory but also other memory. So the
>> patch uses the following way to check whether page can be freed or not.
>>
>>   1. When removing memory, the page structs of the revmoved memory are filled
>>      with 0FD.
>>   2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
>>      In this case, the page used as PT/PMD can be freed.
>>
>> Signed-off-by: Yasuaki Ishimatsu<isimatu.yasuaki@jp.fujitsu.com>
>> Signed-off-by: Jianguo Wu<wujianguo@huawei.com>
>> Signed-off-by: Wen Congyang<wency@cn.fujitsu.com>
>> Signed-off-by: Tang Chen<tangchen@cn.fujitsu.com>
>> ---
>>   arch/x86/include/asm/pgtable_types.h |    1 +
>>   arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
>>   arch/x86/mm/pageattr.c               |   47 +++---
>>   include/linux/bootmem.h              |    1 +
>>   4 files changed, 326 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
>> index 3c32db8..4b6fd2a 100644
>> --- a/arch/x86/include/asm/pgtable_types.h
>> +++ b/arch/x86/include/asm/pgtable_types.h
>> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
>>    * as a pte too.
>>    */
>>   extern pte_t *lookup_address(unsigned long address, unsigned int *level);
>> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
>>
>>   #endif	/* !__ASSEMBLY__ */
>>
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 9ac1723..fe01116 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
>>   }
>>   EXPORT_SYMBOL_GPL(arch_add_memory);
>>
>> +#define PAGE_INUSE 0xFD
>> +
>> +static void __meminit free_pagetable(struct page *page, int order)
>> +{
>> +	struct zone *zone;
>> +	bool bootmem = false;
>> +	unsigned long magic;
>> +	unsigned int nr_pages = 1<<  order;
>> +
>> +	/* bootmem page has reserved flag */
>> +	if (PageReserved(page)) {
>> +		__ClearPageReserved(page);
>> +		bootmem = true;
>> +
>> +		magic = (unsigned long)page->lru.next;
>> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
>> +			while (nr_pages--)
>> +				put_page_bootmem(page++);
>> +		} else
>> +			__free_pages_bootmem(page, order);
>> +	} else
>> +		free_pages((unsigned long)page_address(page), order);
>> +
>> +	/*
>> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
>> +	 * are all allocated by bootmem.
>> +	 */
>> +	if (bootmem) {
>> +		zone = page_zone(page);
>> +		zone_span_writelock(zone);
>> +		zone->present_pages += nr_pages;
>> +		zone_span_writeunlock(zone);
>> +		totalram_pages += nr_pages;
>> +	}
>> +}
>> +
>> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
>> +{
>> +	pte_t *pte;
>> +	int i;
>> +
>> +	for (i = 0; i<  PTRS_PER_PTE; i++) {
>> +		pte = pte_start + i;
>> +		if (pte_val(*pte))
>> +			return;
>> +	}
>> +
>> +	/* free a pte talbe */
>> +	free_pagetable(pmd_page(*pmd), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pmd_clear(pmd);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
>> +{
>> +	pmd_t *pmd;
>> +	int i;
>> +
>> +	for (i = 0; i<  PTRS_PER_PMD; i++) {
>> +		pmd = pmd_start + i;
>> +		if (pmd_val(*pmd))
>> +			return;
>> +	}
>> +
>> +	/* free a pmd talbe */
>> +	free_pagetable(pud_page(*pud), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pud_clear(pud);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +/* Return true if pgd is changed, otherwise return false. */
>> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
>> +{
>> +	pud_t *pud;
>> +	int i;
>> +
>> +	for (i = 0; i<  PTRS_PER_PUD; i++) {
>> +		pud = pud_start + i;
>> +		if (pud_val(*pud))
>> +			return false;
>> +	}
>> +
>> +	/* free a pud table */
>> +	free_pagetable(pgd_page(*pgd), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pgd_clear(pgd);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +
>> +	return true;
>> +}
>> +
>> +static void __meminit
>> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
>> +		 bool direct)
>> +{
>> +	unsigned long next, pages = 0;
>> +	pte_t *pte;
>> +	void *page_addr;
>> +	phys_addr_t phys_addr;
>> +
>> +	pte = pte_start + pte_index(addr);
>> +	for (; addr<  end; addr = next, pte++) {
>> +		next = (addr + PAGE_SIZE)&  PAGE_MASK;
>> +		if (next>  end)
>> +			next = end;
>> +
>> +		if (!pte_present(*pte))
>> +			continue;
>> +
>> +		/*
>> +		 * We mapped [0,1G) memory as identity mapping when
>> +		 * initializing, in arch/x86/kernel/head_64.S. These
>> +		 * pagetables cannot be removed.
>> +		 */
>> +		phys_addr = pte_val(*pte) + (addr&  PAGE_MASK);
>> +		if (phys_addr<  (phys_addr_t)0x40000000)
>> +			return;
>> +
>> +		if (IS_ALIGNED(addr, PAGE_SIZE)&&
>> +		    IS_ALIGNED(next, PAGE_SIZE)) {
>> +			if (!direct) {
>> +				free_pagetable(pte_page(*pte), 0);
>> +				pages++;
>> +			}
>> +
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pte_clear(&init_mm, addr, pte);
>> +			spin_unlock(&init_mm.page_table_lock);
>> +		} else {
>> +			/*
>> +			 * If we are not removing the whole page, it means
>> +			 * other ptes in this page are being used and we canot
>> +			 * remove them. So fill the unused ptes with 0xFD, and
>> +			 * remove the page when it is wholly filled with 0xFD.
>> +			 */
>> +			memset((void *)addr, PAGE_INUSE, next - addr);
>> +			page_addr = page_address(pte_page(*pte));
>> +
>> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
>> +				free_pagetable(pte_page(*pte), 0);
>> +				pages++;
>> +
>> +				spin_lock(&init_mm.page_table_lock);
>> +				pte_clear(&init_mm, addr, pte);
>> +				spin_unlock(&init_mm.page_table_lock);
>> +			}
>> +		}
>> +	}
>> +
>> +	/* Call free_pte_table() in remove_pmd_table(). */
>> +	flush_tlb_all();
>> +	if (direct)
>> +		update_page_count(PG_LEVEL_4K, -pages);
>> +}
>> +
>> +static void __meminit
>> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
>> +		 bool direct)
>> +{
>> +	unsigned long pte_phys, next, pages = 0;
>> +	pte_t *pte_base;
>> +	pmd_t *pmd;
>> +
>> +	pmd = pmd_start + pmd_index(addr);
>> +	for (; addr<  end; addr = next, pmd++) {
>> +		next = pmd_addr_end(addr, end);
>> +
>> +		if (!pmd_present(*pmd))
>> +			continue;
>> +
>> +		if (pmd_large(*pmd)) {
>> +			if (IS_ALIGNED(addr, PMD_SIZE)&&
>> +			    IS_ALIGNED(next, PMD_SIZE)) {
>> +				if (!direct) {
>> +					free_pagetable(pmd_page(*pmd),
>> +						       get_order(PMD_SIZE));
>> +					pages++;
>> +				}
>> +
>> +				spin_lock(&init_mm.page_table_lock);
>> +				pmd_clear(pmd);
>> +				spin_unlock(&init_mm.page_table_lock);
>> +				continue;
>> +			}
>> +
>> +			/*
>> +			 * We use 2M page, but we need to remove part of them,
>> +			 * so split 2M page to 4K page.
>> +			 */
>> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
>> +			BUG_ON(!pte_base);
>> +			__split_large_page((pte_t *)pmd, addr,
>> +					   (pte_t *)pte_base);
>> +
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
>> +			spin_unlock(&init_mm.page_table_lock);
>> +
>> +			flush_tlb_all();
>> +		}
>> +
>> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
>> +		remove_pte_table(pte_base, addr, next, direct);
>> +		free_pte_table(pte_base, pmd);
>> +		unmap_low_page(pte_base);
>> +	}
>> +
>> +	/* Call free_pmd_table() in remove_pud_table(). */
>> +	if (direct)
>> +		update_page_count(PG_LEVEL_2M, -pages);
>> +}
>> +
>> +static void __meminit
>> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
>> +		 bool direct)
>> +{
>> +	unsigned long pmd_phys, next, pages = 0;
>> +	pmd_t *pmd_base;
>> +	pud_t *pud;
>> +
>> +	pud = pud_start + pud_index(addr);
>> +	for (; addr<  end; addr = next, pud++) {
>> +		next = pud_addr_end(addr, end);
>> +
>> +		if (!pud_present(*pud))
>> +			continue;
>> +
>> +		if (pud_large(*pud)) {
>> +			if (IS_ALIGNED(addr, PUD_SIZE)&&
>> +			    IS_ALIGNED(next, PUD_SIZE)) {
>> +				if (!direct) {
>> +					free_pagetable(pud_page(*pud),
>> +						       get_order(PUD_SIZE));
>> +					pages++;
>> +				}
>> +
>> +				spin_lock(&init_mm.page_table_lock);
>> +				pud_clear(pud);
>> +				spin_unlock(&init_mm.page_table_lock);
>> +				continue;
>> +			}
>> +
>> +			/*
>> +			 * We use 1G page, but we need to remove part of them,
>> +			 * so split 1G page to 2M page.
>> +			 */
>> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
>> +			BUG_ON(!pmd_base);
>> +			__split_large_page((pte_t *)pud, addr,
>> +					   (pte_t *)pmd_base);
>> +
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pud_populate(&init_mm, pud, __va(pmd_phys));
>> +			spin_unlock(&init_mm.page_table_lock);
>> +
>> +			flush_tlb_all();
>> +		}
>> +
>> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
>> +		remove_pmd_table(pmd_base, addr, next, direct);
>> +		free_pmd_table(pmd_base, pud);
>> +		unmap_low_page(pmd_base);
>> +	}
>> +
>> +	if (direct)
>> +		update_page_count(PG_LEVEL_1G, -pages);
>> +}
>> +
>> +/* start and end are both virtual address. */
>> +static void __meminit
>> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
>> +{
>> +	unsigned long next;
>> +	pgd_t *pgd;
>> +	pud_t *pud;
>> +	bool pgd_changed = false;
>> +
>> +	for (; start<  end; start = next) {
>> +		pgd = pgd_offset_k(start);
>> +		if (!pgd_present(*pgd))
>> +			continue;
>> +
>> +		next = pgd_addr_end(start, end);
>> +
>> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
>> +		remove_pud_table(pud, start, next, direct);
>> +		if (free_pud_table(pud, pgd))
>> +			pgd_changed = true;
>> +		unmap_low_page(pud);
>> +	}
>> +
>> +	if (pgd_changed)
>> +		sync_global_pgds(start, end - 1);
>> +
>> +	flush_tlb_all();
>> +}
>> +
>>   #ifdef CONFIG_MEMORY_HOTREMOVE
>>   int __ref arch_remove_memory(u64 start, u64 size)
>>   {
>> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
>> index a718e0d..7dcb6f9 100644
>> --- a/arch/x86/mm/pageattr.c
>> +++ b/arch/x86/mm/pageattr.c
>> @@ -501,21 +501,13 @@ out_unlock:
>>   	return do_split;
>>   }
>>
>> -static int split_large_page(pte_t *kpte, unsigned long address)
>> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>>   {
>>   	unsigned long pfn, pfninc = 1;
>>   	unsigned int i, level;
>> -	pte_t *pbase, *tmp;
>> +	pte_t *tmp;
>>   	pgprot_t ref_prot;
>> -	struct page *base;
>> -
>> -	if (!debug_pagealloc)
>> -		spin_unlock(&cpa_lock);
>> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>> -	if (!debug_pagealloc)
>> -		spin_lock(&cpa_lock);
>> -	if (!base)
>> -		return -ENOMEM;
>> +	struct page *base = virt_to_page(pbase);
>>
>>   	spin_lock(&pgd_lock);
>>   	/*
>> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>>   	 * up for us already:
>>   	 */
>>   	tmp = lookup_address(address,&level);
>> -	if (tmp != kpte)
>> -		goto out_unlock;
>> +	if (tmp != kpte) {
>> +		spin_unlock(&pgd_lock);
>> +		return 1;
>> +	}
>>
>> -	pbase = (pte_t *)page_address(base);
>>   	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
>>   	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
>>   	/*
>> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>>   	 * going on.
>>   	 */
>>   	__flush_tlb_all();
>> +	spin_unlock(&pgd_lock);
>>
>> -	base = NULL;
>> +	return 0;
>> +}
>>
>> -out_unlock:
>> -	/*
>> -	 * If we dropped out via the lookup_address check under
>> -	 * pgd_lock then stick the page back into the pool:
>> -	 */
>> -	if (base)
>> +static int split_large_page(pte_t *kpte, unsigned long address)
>> +{
>> +	pte_t *pbase;
>> +	struct page *base;
>> +
>> +	if (!debug_pagealloc)
>> +		spin_unlock(&cpa_lock);
>> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>> +	if (!debug_pagealloc)
>> +		spin_lock(&cpa_lock);
>> +	if (!base)
>> +		return -ENOMEM;
>> +
>> +	pbase = (pte_t *)page_address(base);
>> +	if (__split_large_page(kpte, address, pbase))
>>   		__free_page(base);
>> -	spin_unlock(&pgd_lock);
>>
>>   	return 0;
>>   }
>> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
>> index 3f778c2..190ff06 100644
>> --- a/include/linux/bootmem.h
>> +++ b/include/linux/bootmem.h
>> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
>>   			      unsigned long size);
>>   extern void free_bootmem(unsigned long physaddr, unsigned long size);
>>   extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
>> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
>>
>>   /*
>>    * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-acpi" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
From: Simon Jeons @ 2013-01-30  2:13 UTC (permalink / raw)
  To: Jianguo Wu
  Cc: linux-ia64, linux-sh, Tang Chen, linux-mm, paulus, hpa,
	sparclinux, cl, linux-s390, x86, linux-acpi, isimatu.yasuaki,
	linfeng, mgorman, kosaki.motohiro, rientjes, len.brown, wency,
	cmetcalf, glommer, yinghai, laijs, linux-kernel, minchan.kim,
	akpm, linuxppc-dev
In-Reply-To: <51087D09.4090205@huawei.com>

On Wed, 2013-01-30 at 09:53 +0800, Jianguo Wu wrote:
> On 2013/1/29 21:02, Simon Jeons wrote:
> 
> > Hi Tang,
> > On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> >> From: Wen Congyang <wency@cn.fujitsu.com>
> >>
> >> When memory is removed, the corresponding pagetables should alse be removed.
> >> This patch introduces some common APIs to support vmemmap pagetable and x86_64
> >> architecture pagetable removing.
> >>
> > 
> > When page table of hot-add memory is created?
> 
> 
> Hi Simon,
> 
> For x86_64, page table of hot-add memory is created by:
>     add_memory->arch_add_memory->init_memory_mapping->kernel_physical_mapping_init

Yup, thanks. :)

> 
> > 
> >> All pages of virtual mapping in removed memory cannot be freedi if some pages
> >> used as PGD/PUD includes not only removed memory but also other memory. So the
> >> patch uses the following way to check whether page can be freed or not.
> >>
> >>  1. When removing memory, the page structs of the revmoved memory are filled
> >>     with 0FD.
> >>  2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
> >>     In this case, the page used as PT/PMD can be freed.
> >>
> >> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> >> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
> >> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> >> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> >> ---
> >>  arch/x86/include/asm/pgtable_types.h |    1 +
> >>  arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
> >>  arch/x86/mm/pageattr.c               |   47 +++---
> >>  include/linux/bootmem.h              |    1 +
> >>  4 files changed, 326 insertions(+), 22 deletions(-)
> >>
> >> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> >> index 3c32db8..4b6fd2a 100644
> >> --- a/arch/x86/include/asm/pgtable_types.h
> >> +++ b/arch/x86/include/asm/pgtable_types.h
> >> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
> >>   * as a pte too.
> >>   */
> >>  extern pte_t *lookup_address(unsigned long address, unsigned int *level);
> >> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
> >>  
> >>  #endif	/* !__ASSEMBLY__ */
> >>  
> >> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> >> index 9ac1723..fe01116 100644
> >> --- a/arch/x86/mm/init_64.c
> >> +++ b/arch/x86/mm/init_64.c
> >> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
> >>  }
> >>  EXPORT_SYMBOL_GPL(arch_add_memory);
> >>  
> >> +#define PAGE_INUSE 0xFD
> >> +
> >> +static void __meminit free_pagetable(struct page *page, int order)
> >> +{
> >> +	struct zone *zone;
> >> +	bool bootmem = false;
> >> +	unsigned long magic;
> >> +	unsigned int nr_pages = 1 << order;
> >> +
> >> +	/* bootmem page has reserved flag */
> >> +	if (PageReserved(page)) {
> >> +		__ClearPageReserved(page);
> >> +		bootmem = true;
> >> +
> >> +		magic = (unsigned long)page->lru.next;
> >> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> >> +			while (nr_pages--)
> >> +				put_page_bootmem(page++);
> >> +		} else
> >> +			__free_pages_bootmem(page, order);
> >> +	} else
> >> +		free_pages((unsigned long)page_address(page), order);
> >> +
> >> +	/*
> >> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
> >> +	 * are all allocated by bootmem.
> >> +	 */
> >> +	if (bootmem) {
> >> +		zone = page_zone(page);
> >> +		zone_span_writelock(zone);
> >> +		zone->present_pages += nr_pages;
> >> +		zone_span_writeunlock(zone);
> >> +		totalram_pages += nr_pages;
> >> +	}
> >> +}
> >> +
> >> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> >> +{
> >> +	pte_t *pte;
> >> +	int i;
> >> +
> >> +	for (i = 0; i < PTRS_PER_PTE; i++) {
> >> +		pte = pte_start + i;
> >> +		if (pte_val(*pte))
> >> +			return;
> >> +	}
> >> +
> >> +	/* free a pte talbe */
> >> +	free_pagetable(pmd_page(*pmd), 0);
> >> +	spin_lock(&init_mm.page_table_lock);
> >> +	pmd_clear(pmd);
> >> +	spin_unlock(&init_mm.page_table_lock);
> >> +}
> >> +
> >> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> >> +{
> >> +	pmd_t *pmd;
> >> +	int i;
> >> +
> >> +	for (i = 0; i < PTRS_PER_PMD; i++) {
> >> +		pmd = pmd_start + i;
> >> +		if (pmd_val(*pmd))
> >> +			return;
> >> +	}
> >> +
> >> +	/* free a pmd talbe */
> >> +	free_pagetable(pud_page(*pud), 0);
> >> +	spin_lock(&init_mm.page_table_lock);
> >> +	pud_clear(pud);
> >> +	spin_unlock(&init_mm.page_table_lock);
> >> +}
> >> +
> >> +/* Return true if pgd is changed, otherwise return false. */
> >> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
> >> +{
> >> +	pud_t *pud;
> >> +	int i;
> >> +
> >> +	for (i = 0; i < PTRS_PER_PUD; i++) {
> >> +		pud = pud_start + i;
> >> +		if (pud_val(*pud))
> >> +			return false;
> >> +	}
> >> +
> >> +	/* free a pud table */
> >> +	free_pagetable(pgd_page(*pgd), 0);
> >> +	spin_lock(&init_mm.page_table_lock);
> >> +	pgd_clear(pgd);
> >> +	spin_unlock(&init_mm.page_table_lock);
> >> +
> >> +	return true;
> >> +}
> >> +
> >> +static void __meminit
> >> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
> >> +		 bool direct)
> >> +{
> >> +	unsigned long next, pages = 0;
> >> +	pte_t *pte;
> >> +	void *page_addr;
> >> +	phys_addr_t phys_addr;
> >> +
> >> +	pte = pte_start + pte_index(addr);
> >> +	for (; addr < end; addr = next, pte++) {
> >> +		next = (addr + PAGE_SIZE) & PAGE_MASK;
> >> +		if (next > end)
> >> +			next = end;
> >> +
> >> +		if (!pte_present(*pte))
> >> +			continue;
> >> +
> >> +		/*
> >> +		 * We mapped [0,1G) memory as identity mapping when
> >> +		 * initializing, in arch/x86/kernel/head_64.S. These
> >> +		 * pagetables cannot be removed.
> >> +		 */
> >> +		phys_addr = pte_val(*pte) + (addr & PAGE_MASK);
> >> +		if (phys_addr < (phys_addr_t)0x40000000)
> >> +			return;
> >> +
> >> +		if (IS_ALIGNED(addr, PAGE_SIZE) &&
> >> +		    IS_ALIGNED(next, PAGE_SIZE)) {
> >> +			if (!direct) {
> >> +				free_pagetable(pte_page(*pte), 0);
> >> +				pages++;
> >> +			}
> >> +
> >> +			spin_lock(&init_mm.page_table_lock);
> >> +			pte_clear(&init_mm, addr, pte);
> >> +			spin_unlock(&init_mm.page_table_lock);
> >> +		} else {
> >> +			/*
> >> +			 * If we are not removing the whole page, it means
> >> +			 * other ptes in this page are being used and we canot
> >> +			 * remove them. So fill the unused ptes with 0xFD, and
> >> +			 * remove the page when it is wholly filled with 0xFD.
> >> +			 */
> >> +			memset((void *)addr, PAGE_INUSE, next - addr);
> >> +			page_addr = page_address(pte_page(*pte));
> >> +
> >> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> >> +				free_pagetable(pte_page(*pte), 0);
> >> +				pages++;
> >> +
> >> +				spin_lock(&init_mm.page_table_lock);
> >> +				pte_clear(&init_mm, addr, pte);
> >> +				spin_unlock(&init_mm.page_table_lock);
> >> +			}
> >> +		}
> >> +	}
> >> +
> >> +	/* Call free_pte_table() in remove_pmd_table(). */
> >> +	flush_tlb_all();
> >> +	if (direct)
> >> +		update_page_count(PG_LEVEL_4K, -pages);
> >> +}
> >> +
> >> +static void __meminit
> >> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> >> +		 bool direct)
> >> +{
> >> +	unsigned long pte_phys, next, pages = 0;
> >> +	pte_t *pte_base;
> >> +	pmd_t *pmd;
> >> +
> >> +	pmd = pmd_start + pmd_index(addr);
> >> +	for (; addr < end; addr = next, pmd++) {
> >> +		next = pmd_addr_end(addr, end);
> >> +
> >> +		if (!pmd_present(*pmd))
> >> +			continue;
> >> +
> >> +		if (pmd_large(*pmd)) {
> >> +			if (IS_ALIGNED(addr, PMD_SIZE) &&
> >> +			    IS_ALIGNED(next, PMD_SIZE)) {
> >> +				if (!direct) {
> >> +					free_pagetable(pmd_page(*pmd),
> >> +						       get_order(PMD_SIZE));
> >> +					pages++;
> >> +				}
> >> +
> >> +				spin_lock(&init_mm.page_table_lock);
> >> +				pmd_clear(pmd);
> >> +				spin_unlock(&init_mm.page_table_lock);
> >> +				continue;
> >> +			}
> >> +
> >> +			/*
> >> +			 * We use 2M page, but we need to remove part of them,
> >> +			 * so split 2M page to 4K page.
> >> +			 */
> >> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
> >> +			BUG_ON(!pte_base);
> >> +			__split_large_page((pte_t *)pmd, addr,
> >> +					   (pte_t *)pte_base);
> >> +
> >> +			spin_lock(&init_mm.page_table_lock);
> >> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> >> +			spin_unlock(&init_mm.page_table_lock);
> >> +
> >> +			flush_tlb_all();
> >> +		}
> >> +
> >> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> >> +		remove_pte_table(pte_base, addr, next, direct);
> >> +		free_pte_table(pte_base, pmd);
> >> +		unmap_low_page(pte_base);
> >> +	}
> >> +
> >> +	/* Call free_pmd_table() in remove_pud_table(). */
> >> +	if (direct)
> >> +		update_page_count(PG_LEVEL_2M, -pages);
> >> +}
> >> +
> >> +static void __meminit
> >> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
> >> +		 bool direct)
> >> +{
> >> +	unsigned long pmd_phys, next, pages = 0;
> >> +	pmd_t *pmd_base;
> >> +	pud_t *pud;
> >> +
> >> +	pud = pud_start + pud_index(addr);
> >> +	for (; addr < end; addr = next, pud++) {
> >> +		next = pud_addr_end(addr, end);
> >> +
> >> +		if (!pud_present(*pud))
> >> +			continue;
> >> +
> >> +		if (pud_large(*pud)) {
> >> +			if (IS_ALIGNED(addr, PUD_SIZE) &&
> >> +			    IS_ALIGNED(next, PUD_SIZE)) {
> >> +				if (!direct) {
> >> +					free_pagetable(pud_page(*pud),
> >> +						       get_order(PUD_SIZE));
> >> +					pages++;
> >> +				}
> >> +
> >> +				spin_lock(&init_mm.page_table_lock);
> >> +				pud_clear(pud);
> >> +				spin_unlock(&init_mm.page_table_lock);
> >> +				continue;
> >> +			}
> >> +
> >> +			/*
> >> +			 * We use 1G page, but we need to remove part of them,
> >> +			 * so split 1G page to 2M page.
> >> +			 */
> >> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
> >> +			BUG_ON(!pmd_base);
> >> +			__split_large_page((pte_t *)pud, addr,
> >> +					   (pte_t *)pmd_base);
> >> +
> >> +			spin_lock(&init_mm.page_table_lock);
> >> +			pud_populate(&init_mm, pud, __va(pmd_phys));
> >> +			spin_unlock(&init_mm.page_table_lock);
> >> +
> >> +			flush_tlb_all();
> >> +		}
> >> +
> >> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
> >> +		remove_pmd_table(pmd_base, addr, next, direct);
> >> +		free_pmd_table(pmd_base, pud);
> >> +		unmap_low_page(pmd_base);
> >> +	}
> >> +
> >> +	if (direct)
> >> +		update_page_count(PG_LEVEL_1G, -pages);
> >> +}
> >> +
> >> +/* start and end are both virtual address. */
> >> +static void __meminit
> >> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
> >> +{
> >> +	unsigned long next;
> >> +	pgd_t *pgd;
> >> +	pud_t *pud;
> >> +	bool pgd_changed = false;
> >> +
> >> +	for (; start < end; start = next) {
> >> +		pgd = pgd_offset_k(start);
> >> +		if (!pgd_present(*pgd))
> >> +			continue;
> >> +
> >> +		next = pgd_addr_end(start, end);
> >> +
> >> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> >> +		remove_pud_table(pud, start, next, direct);
> >> +		if (free_pud_table(pud, pgd))
> >> +			pgd_changed = true;
> >> +		unmap_low_page(pud);
> >> +	}
> >> +
> >> +	if (pgd_changed)
> >> +		sync_global_pgds(start, end - 1);
> >> +
> >> +	flush_tlb_all();
> >> +}
> >> +
> >>  #ifdef CONFIG_MEMORY_HOTREMOVE
> >>  int __ref arch_remove_memory(u64 start, u64 size)
> >>  {
> >> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> >> index a718e0d..7dcb6f9 100644
> >> --- a/arch/x86/mm/pageattr.c
> >> +++ b/arch/x86/mm/pageattr.c
> >> @@ -501,21 +501,13 @@ out_unlock:
> >>  	return do_split;
> >>  }
> >>  
> >> -static int split_large_page(pte_t *kpte, unsigned long address)
> >> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
> >>  {
> >>  	unsigned long pfn, pfninc = 1;
> >>  	unsigned int i, level;
> >> -	pte_t *pbase, *tmp;
> >> +	pte_t *tmp;
> >>  	pgprot_t ref_prot;
> >> -	struct page *base;
> >> -
> >> -	if (!debug_pagealloc)
> >> -		spin_unlock(&cpa_lock);
> >> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> >> -	if (!debug_pagealloc)
> >> -		spin_lock(&cpa_lock);
> >> -	if (!base)
> >> -		return -ENOMEM;
> >> +	struct page *base = virt_to_page(pbase);
> >>  
> >>  	spin_lock(&pgd_lock);
> >>  	/*
> >> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
> >>  	 * up for us already:
> >>  	 */
> >>  	tmp = lookup_address(address, &level);
> >> -	if (tmp != kpte)
> >> -		goto out_unlock;
> >> +	if (tmp != kpte) {
> >> +		spin_unlock(&pgd_lock);
> >> +		return 1;
> >> +	}
> >>  
> >> -	pbase = (pte_t *)page_address(base);
> >>  	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
> >>  	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
> >>  	/*
> >> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
> >>  	 * going on.
> >>  	 */
> >>  	__flush_tlb_all();
> >> +	spin_unlock(&pgd_lock);
> >>  
> >> -	base = NULL;
> >> +	return 0;
> >> +}
> >>  
> >> -out_unlock:
> >> -	/*
> >> -	 * If we dropped out via the lookup_address check under
> >> -	 * pgd_lock then stick the page back into the pool:
> >> -	 */
> >> -	if (base)
> >> +static int split_large_page(pte_t *kpte, unsigned long address)
> >> +{
> >> +	pte_t *pbase;
> >> +	struct page *base;
> >> +
> >> +	if (!debug_pagealloc)
> >> +		spin_unlock(&cpa_lock);
> >> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> >> +	if (!debug_pagealloc)
> >> +		spin_lock(&cpa_lock);
> >> +	if (!base)
> >> +		return -ENOMEM;
> >> +
> >> +	pbase = (pte_t *)page_address(base);
> >> +	if (__split_large_page(kpte, address, pbase))
> >>  		__free_page(base);
> >> -	spin_unlock(&pgd_lock);
> >>  
> >>  	return 0;
> >>  }
> >> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> >> index 3f778c2..190ff06 100644
> >> --- a/include/linux/bootmem.h
> >> +++ b/include/linux/bootmem.h
> >> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
> >>  			      unsigned long size);
> >>  extern void free_bootmem(unsigned long physaddr, unsigned long size);
> >>  extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
> >> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
> >>  
> >>  /*
> >>   * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,
> > 
> > 
> > 
> > .
> > 
> 
> 
> 

^ permalink raw reply

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
From: Jianguo Wu @ 2013-01-30  1:53 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, Tang Chen, linux-mm, paulus, hpa,
	sparclinux, cl, linux-s390, x86, linux-acpi, isimatu.yasuaki,
	linfeng, mgorman, kosaki.motohiro, rientjes, len.brown, wency,
	cmetcalf, glommer, yinghai, laijs, linux-kernel, minchan.kim,
	akpm, linuxppc-dev
In-Reply-To: <1359464544.1624.16.camel@kernel>

On 2013/1/29 21:02, Simon Jeons wrote:

> Hi Tang,
> On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
>> From: Wen Congyang <wency@cn.fujitsu.com>
>>
>> When memory is removed, the corresponding pagetables should alse be removed.
>> This patch introduces some common APIs to support vmemmap pagetable and x86_64
>> architecture pagetable removing.
>>
> 
> When page table of hot-add memory is created?


Hi Simon,

For x86_64, page table of hot-add memory is created by:
    add_memory->arch_add_memory->init_memory_mapping->kernel_physical_mapping_init

> 
>> All pages of virtual mapping in removed memory cannot be freedi if some pages
>> used as PGD/PUD includes not only removed memory but also other memory. So the
>> patch uses the following way to check whether page can be freed or not.
>>
>>  1. When removing memory, the page structs of the revmoved memory are filled
>>     with 0FD.
>>  2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
>>     In this case, the page used as PT/PMD can be freed.
>>
>> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
>> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
>> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
>> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
>> ---
>>  arch/x86/include/asm/pgtable_types.h |    1 +
>>  arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
>>  arch/x86/mm/pageattr.c               |   47 +++---
>>  include/linux/bootmem.h              |    1 +
>>  4 files changed, 326 insertions(+), 22 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
>> index 3c32db8..4b6fd2a 100644
>> --- a/arch/x86/include/asm/pgtable_types.h
>> +++ b/arch/x86/include/asm/pgtable_types.h
>> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
>>   * as a pte too.
>>   */
>>  extern pte_t *lookup_address(unsigned long address, unsigned int *level);
>> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
>>  
>>  #endif	/* !__ASSEMBLY__ */
>>  
>> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
>> index 9ac1723..fe01116 100644
>> --- a/arch/x86/mm/init_64.c
>> +++ b/arch/x86/mm/init_64.c
>> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
>>  }
>>  EXPORT_SYMBOL_GPL(arch_add_memory);
>>  
>> +#define PAGE_INUSE 0xFD
>> +
>> +static void __meminit free_pagetable(struct page *page, int order)
>> +{
>> +	struct zone *zone;
>> +	bool bootmem = false;
>> +	unsigned long magic;
>> +	unsigned int nr_pages = 1 << order;
>> +
>> +	/* bootmem page has reserved flag */
>> +	if (PageReserved(page)) {
>> +		__ClearPageReserved(page);
>> +		bootmem = true;
>> +
>> +		magic = (unsigned long)page->lru.next;
>> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
>> +			while (nr_pages--)
>> +				put_page_bootmem(page++);
>> +		} else
>> +			__free_pages_bootmem(page, order);
>> +	} else
>> +		free_pages((unsigned long)page_address(page), order);
>> +
>> +	/*
>> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
>> +	 * are all allocated by bootmem.
>> +	 */
>> +	if (bootmem) {
>> +		zone = page_zone(page);
>> +		zone_span_writelock(zone);
>> +		zone->present_pages += nr_pages;
>> +		zone_span_writeunlock(zone);
>> +		totalram_pages += nr_pages;
>> +	}
>> +}
>> +
>> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
>> +{
>> +	pte_t *pte;
>> +	int i;
>> +
>> +	for (i = 0; i < PTRS_PER_PTE; i++) {
>> +		pte = pte_start + i;
>> +		if (pte_val(*pte))
>> +			return;
>> +	}
>> +
>> +	/* free a pte talbe */
>> +	free_pagetable(pmd_page(*pmd), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pmd_clear(pmd);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
>> +{
>> +	pmd_t *pmd;
>> +	int i;
>> +
>> +	for (i = 0; i < PTRS_PER_PMD; i++) {
>> +		pmd = pmd_start + i;
>> +		if (pmd_val(*pmd))
>> +			return;
>> +	}
>> +
>> +	/* free a pmd talbe */
>> +	free_pagetable(pud_page(*pud), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pud_clear(pud);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +}
>> +
>> +/* Return true if pgd is changed, otherwise return false. */
>> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
>> +{
>> +	pud_t *pud;
>> +	int i;
>> +
>> +	for (i = 0; i < PTRS_PER_PUD; i++) {
>> +		pud = pud_start + i;
>> +		if (pud_val(*pud))
>> +			return false;
>> +	}
>> +
>> +	/* free a pud table */
>> +	free_pagetable(pgd_page(*pgd), 0);
>> +	spin_lock(&init_mm.page_table_lock);
>> +	pgd_clear(pgd);
>> +	spin_unlock(&init_mm.page_table_lock);
>> +
>> +	return true;
>> +}
>> +
>> +static void __meminit
>> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
>> +		 bool direct)
>> +{
>> +	unsigned long next, pages = 0;
>> +	pte_t *pte;
>> +	void *page_addr;
>> +	phys_addr_t phys_addr;
>> +
>> +	pte = pte_start + pte_index(addr);
>> +	for (; addr < end; addr = next, pte++) {
>> +		next = (addr + PAGE_SIZE) & PAGE_MASK;
>> +		if (next > end)
>> +			next = end;
>> +
>> +		if (!pte_present(*pte))
>> +			continue;
>> +
>> +		/*
>> +		 * We mapped [0,1G) memory as identity mapping when
>> +		 * initializing, in arch/x86/kernel/head_64.S. These
>> +		 * pagetables cannot be removed.
>> +		 */
>> +		phys_addr = pte_val(*pte) + (addr & PAGE_MASK);
>> +		if (phys_addr < (phys_addr_t)0x40000000)
>> +			return;
>> +
>> +		if (IS_ALIGNED(addr, PAGE_SIZE) &&
>> +		    IS_ALIGNED(next, PAGE_SIZE)) {
>> +			if (!direct) {
>> +				free_pagetable(pte_page(*pte), 0);
>> +				pages++;
>> +			}
>> +
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pte_clear(&init_mm, addr, pte);
>> +			spin_unlock(&init_mm.page_table_lock);
>> +		} else {
>> +			/*
>> +			 * If we are not removing the whole page, it means
>> +			 * other ptes in this page are being used and we canot
>> +			 * remove them. So fill the unused ptes with 0xFD, and
>> +			 * remove the page when it is wholly filled with 0xFD.
>> +			 */
>> +			memset((void *)addr, PAGE_INUSE, next - addr);
>> +			page_addr = page_address(pte_page(*pte));
>> +
>> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
>> +				free_pagetable(pte_page(*pte), 0);
>> +				pages++;
>> +
>> +				spin_lock(&init_mm.page_table_lock);
>> +				pte_clear(&init_mm, addr, pte);
>> +				spin_unlock(&init_mm.page_table_lock);
>> +			}
>> +		}
>> +	}
>> +
>> +	/* Call free_pte_table() in remove_pmd_table(). */
>> +	flush_tlb_all();
>> +	if (direct)
>> +		update_page_count(PG_LEVEL_4K, -pages);
>> +}
>> +
>> +static void __meminit
>> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
>> +		 bool direct)
>> +{
>> +	unsigned long pte_phys, next, pages = 0;
>> +	pte_t *pte_base;
>> +	pmd_t *pmd;
>> +
>> +	pmd = pmd_start + pmd_index(addr);
>> +	for (; addr < end; addr = next, pmd++) {
>> +		next = pmd_addr_end(addr, end);
>> +
>> +		if (!pmd_present(*pmd))
>> +			continue;
>> +
>> +		if (pmd_large(*pmd)) {
>> +			if (IS_ALIGNED(addr, PMD_SIZE) &&
>> +			    IS_ALIGNED(next, PMD_SIZE)) {
>> +				if (!direct) {
>> +					free_pagetable(pmd_page(*pmd),
>> +						       get_order(PMD_SIZE));
>> +					pages++;
>> +				}
>> +
>> +				spin_lock(&init_mm.page_table_lock);
>> +				pmd_clear(pmd);
>> +				spin_unlock(&init_mm.page_table_lock);
>> +				continue;
>> +			}
>> +
>> +			/*
>> +			 * We use 2M page, but we need to remove part of them,
>> +			 * so split 2M page to 4K page.
>> +			 */
>> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
>> +			BUG_ON(!pte_base);
>> +			__split_large_page((pte_t *)pmd, addr,
>> +					   (pte_t *)pte_base);
>> +
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
>> +			spin_unlock(&init_mm.page_table_lock);
>> +
>> +			flush_tlb_all();
>> +		}
>> +
>> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
>> +		remove_pte_table(pte_base, addr, next, direct);
>> +		free_pte_table(pte_base, pmd);
>> +		unmap_low_page(pte_base);
>> +	}
>> +
>> +	/* Call free_pmd_table() in remove_pud_table(). */
>> +	if (direct)
>> +		update_page_count(PG_LEVEL_2M, -pages);
>> +}
>> +
>> +static void __meminit
>> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
>> +		 bool direct)
>> +{
>> +	unsigned long pmd_phys, next, pages = 0;
>> +	pmd_t *pmd_base;
>> +	pud_t *pud;
>> +
>> +	pud = pud_start + pud_index(addr);
>> +	for (; addr < end; addr = next, pud++) {
>> +		next = pud_addr_end(addr, end);
>> +
>> +		if (!pud_present(*pud))
>> +			continue;
>> +
>> +		if (pud_large(*pud)) {
>> +			if (IS_ALIGNED(addr, PUD_SIZE) &&
>> +			    IS_ALIGNED(next, PUD_SIZE)) {
>> +				if (!direct) {
>> +					free_pagetable(pud_page(*pud),
>> +						       get_order(PUD_SIZE));
>> +					pages++;
>> +				}
>> +
>> +				spin_lock(&init_mm.page_table_lock);
>> +				pud_clear(pud);
>> +				spin_unlock(&init_mm.page_table_lock);
>> +				continue;
>> +			}
>> +
>> +			/*
>> +			 * We use 1G page, but we need to remove part of them,
>> +			 * so split 1G page to 2M page.
>> +			 */
>> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
>> +			BUG_ON(!pmd_base);
>> +			__split_large_page((pte_t *)pud, addr,
>> +					   (pte_t *)pmd_base);
>> +
>> +			spin_lock(&init_mm.page_table_lock);
>> +			pud_populate(&init_mm, pud, __va(pmd_phys));
>> +			spin_unlock(&init_mm.page_table_lock);
>> +
>> +			flush_tlb_all();
>> +		}
>> +
>> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
>> +		remove_pmd_table(pmd_base, addr, next, direct);
>> +		free_pmd_table(pmd_base, pud);
>> +		unmap_low_page(pmd_base);
>> +	}
>> +
>> +	if (direct)
>> +		update_page_count(PG_LEVEL_1G, -pages);
>> +}
>> +
>> +/* start and end are both virtual address. */
>> +static void __meminit
>> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
>> +{
>> +	unsigned long next;
>> +	pgd_t *pgd;
>> +	pud_t *pud;
>> +	bool pgd_changed = false;
>> +
>> +	for (; start < end; start = next) {
>> +		pgd = pgd_offset_k(start);
>> +		if (!pgd_present(*pgd))
>> +			continue;
>> +
>> +		next = pgd_addr_end(start, end);
>> +
>> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
>> +		remove_pud_table(pud, start, next, direct);
>> +		if (free_pud_table(pud, pgd))
>> +			pgd_changed = true;
>> +		unmap_low_page(pud);
>> +	}
>> +
>> +	if (pgd_changed)
>> +		sync_global_pgds(start, end - 1);
>> +
>> +	flush_tlb_all();
>> +}
>> +
>>  #ifdef CONFIG_MEMORY_HOTREMOVE
>>  int __ref arch_remove_memory(u64 start, u64 size)
>>  {
>> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
>> index a718e0d..7dcb6f9 100644
>> --- a/arch/x86/mm/pageattr.c
>> +++ b/arch/x86/mm/pageattr.c
>> @@ -501,21 +501,13 @@ out_unlock:
>>  	return do_split;
>>  }
>>  
>> -static int split_large_page(pte_t *kpte, unsigned long address)
>> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>>  {
>>  	unsigned long pfn, pfninc = 1;
>>  	unsigned int i, level;
>> -	pte_t *pbase, *tmp;
>> +	pte_t *tmp;
>>  	pgprot_t ref_prot;
>> -	struct page *base;
>> -
>> -	if (!debug_pagealloc)
>> -		spin_unlock(&cpa_lock);
>> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>> -	if (!debug_pagealloc)
>> -		spin_lock(&cpa_lock);
>> -	if (!base)
>> -		return -ENOMEM;
>> +	struct page *base = virt_to_page(pbase);
>>  
>>  	spin_lock(&pgd_lock);
>>  	/*
>> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>>  	 * up for us already:
>>  	 */
>>  	tmp = lookup_address(address, &level);
>> -	if (tmp != kpte)
>> -		goto out_unlock;
>> +	if (tmp != kpte) {
>> +		spin_unlock(&pgd_lock);
>> +		return 1;
>> +	}
>>  
>> -	pbase = (pte_t *)page_address(base);
>>  	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
>>  	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
>>  	/*
>> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>>  	 * going on.
>>  	 */
>>  	__flush_tlb_all();
>> +	spin_unlock(&pgd_lock);
>>  
>> -	base = NULL;
>> +	return 0;
>> +}
>>  
>> -out_unlock:
>> -	/*
>> -	 * If we dropped out via the lookup_address check under
>> -	 * pgd_lock then stick the page back into the pool:
>> -	 */
>> -	if (base)
>> +static int split_large_page(pte_t *kpte, unsigned long address)
>> +{
>> +	pte_t *pbase;
>> +	struct page *base;
>> +
>> +	if (!debug_pagealloc)
>> +		spin_unlock(&cpa_lock);
>> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
>> +	if (!debug_pagealloc)
>> +		spin_lock(&cpa_lock);
>> +	if (!base)
>> +		return -ENOMEM;
>> +
>> +	pbase = (pte_t *)page_address(base);
>> +	if (__split_large_page(kpte, address, pbase))
>>  		__free_page(base);
>> -	spin_unlock(&pgd_lock);
>>  
>>  	return 0;
>>  }
>> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
>> index 3f778c2..190ff06 100644
>> --- a/include/linux/bootmem.h
>> +++ b/include/linux/bootmem.h
>> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
>>  			      unsigned long size);
>>  extern void free_bootmem(unsigned long physaddr, unsigned long size);
>>  extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
>> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
>>  
>>  /*
>>   * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,
> 
> 
> 
> .
> 

^ permalink raw reply

* Re: [PATCH 4/5] net: mvmdio: allow Device Tree and platform device to coexist
From: Florian Fainelli @ 2013-01-29 20:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Petazzoni, Andrew Lunn, Russell King, Jason Cooper,
	linux-doc, devicetree-discuss, linux-kernel, Rob Herring,
	Greg Kroah-Hartman, Paul Mackerras, Lennert Buytenhek,
	Rob Landley, netdev, linuxppc-dev, davem, linux-arm-kernel
In-Reply-To: <20130129175912.GE25646@obsidianresearch.com>

Le mardi 29 janvier 2013 18:59:12, Jason Gunthorpe a =E9crit :
> On Tue, Jan 29, 2013 at 04:24:07PM +0100, Florian Fainelli wrote:
> > -	dev->err_interrupt =3D irq_of_parse_and_map(pdev->dev.of_node, 0);
> > +	if (pdev->dev.of_node) {
> > +		dev->regs =3D of_iomap(pdev->dev.of_node, 0);
> > +		if (!dev->regs) {
> > +			dev_err(&pdev->dev, "No SMI register address given in=20
DT\n");
> > +			ret =3D -ENODEV;
> > +			goto out_free;
> > +		}
> > +
> > +		dev->err_interrupt =3D irq_of_parse_and_map(pdev->dev.of_node, 0);
> > +	} else {
> > +		r =3D platform_get_resource(pdev, IORESOURCE_MEM, 0);
> > +
> > +		dev->regs =3D ioremap(r->start, resource_size(r));
> > +		if (!dev->regs) {
> > +			dev_err(&pdev->dev, "No SMI register address given\n");
> > +			ret =3D -ENODEV;
> > +			goto out_free;
> > +		}
> > +
> > +		dev->err_interrupt =3D platform_get_irq(pdev, 0);
> > +	}
>=20
> Why do you have these different paths for OF and platform? AFAIK these
> days when a OF device is automatically converted into a platform
> device all the struct resources are created too, so you can't you just
> use platform_get_resource and devm_request_and_ioremap for both flows?
>=20
> Ditto for the interrupt - platform_get_irq should work in both cases?

There was no particular reason and I updated the patchset to do that precis=
ely=20
in version 2.
=2D-=20
=46lorian

^ permalink raw reply

* Re: [PATCH 5/5] mv643xx_eth: convert to use the Marvell Orion MDIO driver
From: Florian Fainelli @ 2013-01-29 20:41 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Thomas Petazzoni, Andrew Lunn, Russell King, Jason Cooper,
	linux-doc, devicetree-discuss, linux-kernel, Rob Herring,
	Greg Kroah-Hartman, Paul Mackerras, Lennert Buytenhek,
	Rob Landley, netdev, linuxppc-dev, davem, linux-arm-kernel
In-Reply-To: <20130129181306.GF25646@obsidianresearch.com>

Le mardi 29 janvier 2013 19:13:06, Jason Gunthorpe a =E9crit :
> On Tue, Jan 29, 2013 at 04:24:08PM +0100, Florian Fainelli wrote:
> > This patch converts the Marvell MV643XX ethernet driver to use the
> > Marvell Orion MDIO driver. As a result, PowerPC and ARM platforms
> > registering the Marvell MV643XX ethernet driver are also updated to
> > register a Marvell Orion MDIO driver. This driver voluntarily overlaps
> > with the Marvell Ethernet shared registers because it will use a subset
> > of this shared register (shared_base + 0x4 - shared_base + 0x84). The
> > Ethernet driver is also updated to look up for a PHY device using the
> > Orion MDIO bus driver.
>=20
> Can you finish off this job by making the mv643xx_eth driver accept
> the standard phy-handle OF property instead of using a phy address?

I can certainly do that, at the same time we need to continue supporting th=
e=20
"old" platform device style registration without breaking them (PowerPC in=
=20
particular, and the hopefully yet to be converted orion5x). So the phy_scan=
()=20
as I modified it will probably still be there.

>=20
> Ie the end result should be something like:
>=20
>                 smi0: mdio@72000 {
>                         device_type =3D "mdio";
>                         compatible =3D "marvell,orion-mdio";
>                         reg =3D <0x72004 0x4>;
>=20
>                         #address-cells =3D <1>;
>                         #size-cells =3D <0>;
>                         PHY1: ethernet-phy@1 {
>                                 reg =3D <1>;
>                                 device_type =3D "ethernet-phy";
>                                 phy-id =3D <0x01410e90>;
>                         };
>                 };
>=20
>                 egiga0 {
>                         device_type =3D "network";
>                         compatible =3D "marvell,mv643xx-eth";
>                         reg =3D <0x72000 0x4000>;
>                         port_number =3D <0>;
>                         phy-handle =3D <&PHY1>;
>                         interrupts =3D <11>;
>                         local-mac-address =3D [000000000002];  /* Filled =
by
> boot loader */ };
>=20
> Regards,
> Jason

=2D-=20
=46lorian

^ permalink raw reply

* Re: [PATCH 2/2] pseries/iommu: remove DDW on kexec
From: Nishanth Aravamudan @ 2013-01-29 20:33 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: miltonm, paulus, anton, nfont, linuxppc-dev
In-Reply-To: <1359457108.26096.7.camel@concordia>

Hi Michael,

On 29.01.2013 [21:58:28 +1100], Michael Ellerman wrote:
> On Mon, 2013-01-28 at 18:03 -0800, Nishanth Aravamudan wrote:
> > pseries/iommu: remove DDW on kexec
> >  ...
> >     
> > I believe the simplest, easiest-to-maintain fix is to just change our
> > initcall to, rather than detecting and updating the new kernel's DDW
> > knowledge, just remove all DDW configurations. When the drivers
> > re-initialize, we will set everything back up as it was before.
> 
> I don't know this code at all, but this sounds like it will also work
> for kdump, right? ie. when the original kernel has crashed the 2nd
> kernel will tear the DDW down and set it back up.

Yes, my actual test-case (and what was reported as broken) was kdump.
>From my relatively vague (but now growing) understanding of that
process, kdump does use kexec under the covers to switch to the crash
kernel, and so does get the same benefit from this change.

Another datapoint, though, is that it might make sense to recommend (and
I'm working on figuring this out for the distros, etc) to use
disable_ddw anyways for the kdump kernel command-line, as DDW isn't
'free' and it's unclear if performance is a huge concern for the crash
kernel (sort of varies with where your storage is, and how much you need
to dump, which for kdump generally doesn't seem like that much?).

Thanks,
Nish

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox