From mboxrd@z Thu Jan 1 00:00:00 1970 From: Jianguo Wu Date: Fri, 07 Dec 2012 02:20:16 +0000 Subject: Re: [Patch v4 08/12] memory-hotplug: remove memmap of sparse-vmemmap Message-Id: <50C15260.3050608@huawei.com> List-Id: References: <1354010422-19648-1-git-send-email-wency@cn.fujitsu.com> <1354010422-19648-9-git-send-email-wency@cn.fujitsu.com> <50B5DC00.20103@huawei.com> <50B80FB1.6040906@cn.fujitsu.com> <50BC0D2D.8040008@huawei.com> <50C14976.2050606@cn.fujitsu.com> In-Reply-To: <50C14976.2050606@cn.fujitsu.com> MIME-Version: 1.0 Content-Type: text/plain; charset="windows-1252" Content-Transfer-Encoding: quoted-printable To: Tang Chen Cc: Wen Congyang , x86@kernel.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-acpi@vger.kernel.org, linux-s390@vger.kernel.org, linux-sh@vger.kernel.org, linux-ia64@vger.kernel.org, cmetcalf@tilera.com, sparclinux@vger.kernel.org, David Rientjes , Jiang Liu , Len Brown , benh@kernel.crashing.org, paulus@samba.org, Christoph Lameter , Minchan Kim , Andrew Morton , KOSAKI Motohiro , Yasuaki Ishimatsu Hi Tang, On 2012/12/7 9:42, Tang Chen wrote: > Hi Wu, >=20 > I met some problems when I was digging into the code. It's very > kind of you if you could help me with that. :) >=20 > If I misunderstood your code, please tell me. > Please see below. :) >=20 > On 12/03/2012 10:23 AM, Jianguo Wu wrote: >> Signed-off-by: Jianguo Wu >> Signed-off-by: Jiang Liu >> --- >> include/linux/mm.h | 1 + >> mm/sparse-vmemmap.c | 231 +++++++++++++++++++++++++++++++++++++++++++= ++++++++ >> mm/sparse.c | 3 +- >> 3 files changed, 234 insertions(+), 1 deletions(-) >> >> diff --git a/include/linux/mm.h b/include/linux/mm.h >> index 5657670..1f26af5 100644 >> --- a/include/linux/mm.h >> +++ b/include/linux/mm.h >> @@ -1642,6 +1642,7 @@ int vmemmap_populate(struct page *start_page, unsi= gned long pages, int node); >> void vmemmap_populate_print_last(void); >> void register_page_bootmem_memmap(unsigned long section_nr, struct pag= e *map, >> unsigned long size); >> +void vmemmap_free(struct page *memmap, unsigned long nr_pages); >> >> enum mf_flags { >> MF_COUNT_INCREASED =3D 1<< 0, >> diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c >> index 1b7e22a..748732d 100644 >> --- a/mm/sparse-vmemmap.c >> +++ b/mm/sparse-vmemmap.c >> @@ -29,6 +29,10 @@ >> #include >> #include >> >> +#ifdef CONFIG_MEMORY_HOTREMOVE >> +#include >> +#endif >> + >> /* >> * Allocate a block of memory to be used to back the virtual memory map >> * or to back the page tables that are used to create the mapping. >> @@ -224,3 +228,230 @@ void __init sparse_mem_maps_populate_node(struct p= age **map_map, >> vmemmap_buf_end =3D NULL; >> } >> } >> + >> +#ifdef CONFIG_MEMORY_HOTREMOVE >> + >> +#define PAGE_INUSE 0xFD >> + >> +static void vmemmap_free_pages(struct page *page, int order) >> +{ >> + struct zone *zone; >> + unsigned long magic; >> + >> + magic =3D (unsigned long) page->lru.next; >> + if (magic =3D SECTION_INFO || magic =3D MIX_SECTION_INFO) { >> + put_page_bootmem(page); >> + >> + zone =3D page_zone(page); >> + zone_span_writelock(zone); >> + zone->present_pages++; >> + zone_span_writeunlock(zone); >> + totalram_pages++; >> + } else >> + free_pages((unsigned long)page_address(page), order); >=20 > Here, I think SECTION_INFO and MIX_SECTION_INFO pages are all allocated > by bootmem, so I put this function this way. >=20 > I'm not sure if parameter order is necessary here. It will always be 0 > in your code. Is this OK to you ? >=20 parameter order is necessary in cpu_has_pse case: vmemmap_pmd_remove free_pagetable(pmd_page(*pmd), get_order(PMD_SIZE)) > static void free_pagetable(struct page *page) > { > struct zone *zone; > bool bootmem =3D false; > unsigned long magic; >=20 > /* bootmem page has reserved flag */ > if (PageReserved(page)) { > __ClearPageReserved(page); > bootmem =3D true; > } >=20 > magic =3D (unsigned long) page->lru.next; > if (magic =3D SECTION_INFO || magic =3D MIX_SECTION_INFO) > put_page_bootmem(page); > else > __free_page(page); >=20 > /* > * SECTION_INFO pages and MIX_SECTION_INFO pages > * are all allocated by bootmem. > */ > if (bootmem) { > zone =3D page_zone(page); > zone_span_writelock(zone); > zone->present_pages++; > zone_span_writeunlock(zone); > totalram_pages++; > } > } >=20 > (snip) >=20 >> + >> +static void vmemmap_pte_remove(pmd_t *pmd, unsigned long addr, unsigned= long end) >> +{ >> + pte_t *pte; >> + unsigned long next; >> + void *page_addr; >> + >> + pte =3D pte_offset_kernel(pmd, addr); >> + for (; addr< end; pte++, addr +=3D PAGE_SIZE) { >> + next =3D (addr + PAGE_SIZE)& PAGE_MASK; >> + if (next> end) >> + next =3D end; >> + >> + if (pte_none(*pte)) >=20 > Here, you checked xxx_none() in your vmemmap_xxx_remove(), but you used > !xxx_present() in your x86_64 patches. Is it OK if I only check > !xxx_present() ? It is Ok. >=20 >> + continue; >> + if (IS_ALIGNED(addr, PAGE_SIZE)&& >> + IS_ALIGNED(next, PAGE_SIZE)) { >> + vmemmap_free_pages(pte_page(*pte), 0); >> + spin_lock(&init_mm.page_table_lock); >> + pte_clear(&init_mm, addr, pte); >> + spin_unlock(&init_mm.page_table_lock); >> + } else { >> + /* >> + * Removed page structs are filled with 0xFD. >> + */ >> + memset((void *)addr, PAGE_INUSE, next - addr); >> + page_addr =3D page_address(pte_page(*pte)); >> + >> + if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) { >> + spin_lock(&init_mm.page_table_lock); >> + pte_clear(&init_mm, addr, pte); >> + spin_unlock(&init_mm.page_table_lock); >=20 > Here, since we clear pte, we should also free the page, right ? >=20 Right, I forgot here, sorry. >> + } >> + } >> + } >> + >> + free_pte_table(pmd); >> + __flush_tlb_all(); >> +} >> + >> +static void vmemmap_pmd_remove(pud_t *pud, unsigned long addr, unsigned= long end) >> +{ >> + unsigned long next; >> + pmd_t *pmd; >> + >> + pmd =3D pmd_offset(pud, addr); >> + for (; addr< end; addr =3D next, pmd++) { >> + next =3D (addr, end); >=20 > And by the way, there isn't pte_addr_end() in kernel, why ? > I saw you calculated it like this: >=20 > next =3D (addr + PAGE_SIZE) & PAGE_MASK; > if (next > end) > next =3D end; >=20 > This logic is very similar to {pmd|pud|pgd}_addr_end(). Shall we add a > pte_addr_end() or something ? :) Maybe just keep this for now if no other place need pte_addr_end()? > Since there is no such code in kernel for a long time, I think there > must be some reasons. Maybe in current kernel, doesn't deal not PTE_SIZE alignment address=EF=BC= =9F =20 >=20 > I merged free_xxx_table() and remove_xxx_table() as common interfaces. Greate! Thanks for your work:). >=20 > And again, thanks for your patient and nice explanation. :) >=20 > (snip) >=20 > . >=20