Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH] mm: kmemleak: Treat vm_struct as alternative reference to vmalloc'ed objects
From: Catalin Marinas @ 2017-05-24 16:52 UTC (permalink / raw)
  To: Andy Lutomirski
  Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Michal Hocko,
	Luis R. Rodriguez
In-Reply-To: <CALCETrVaFPjQrVAiOad6GhFvK=AQphF0Kx5zDsCcAt4bPfQbnw@mail.gmail.com>

On Mon, May 22, 2017 at 11:19:08AM -0700, Andy Lutomirski wrote:
> On Mon, May 22, 2017 at 10:35 AM, Catalin Marinas
> <catalin.marinas@arm.com> wrote:
> > Kmemleak requires that vmalloc'ed objects have a minimum reference count
> > of 2: one in the corresponding vm_struct object and the other owned by
> > the vmalloc() caller. There are cases, however, where the original
> > vmalloc() returned pointer is lost and, instead, a pointer to vm_struct
> > is stored (see free_thread_stack()). Kmemleak currently reports such
> > objects as leaks.
> >
> > This patch adds support for treating any surplus references to an object
> > as additional references to a specified object. It introduces the
> > kmemleak_vmalloc() API function which takes a vm_struct pointer and sets
> > its surplus reference passing to the actual vmalloc() returned pointer.
> > The __vmalloc_node_range() calling site has been modified accordingly.
> >
> > An unrelated minor change is included in this patch to change the type
> > of kmemleak_object.flags to unsigned int (previously unsigned long).
> >
> > Reported-by: "Luis R. Rodriguez" <mcgrof@kernel.org>
> > Cc: Michal Hocko <mhocko@kernel.org>
> > Cc: Andy Lutomirski <luto@amacapital.net>
> > Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
> > ---
> >
> > As per [1], I added support to use pointers to vm_struct as an
> > alternative way to avoid false positives when the original vmalloc()
> > pointer has been lost. This is slightly harder to reason about but it
> > seems to work for this use-case. I'm not aware of other cases (than
> > free_thread_stack()) where the original vmalloc() pointer is removed in
> > favour of a vm_struct one.
> >
> > An alternative implementation (simpler to understand), if preferred, is
> > to annotate alloc_thread_stack_node() and free_thread_stack() with
> > kmemleak_unignore()/kmemleak_ignore() calls and proper comments.
> >
> 
> I personally prefer the option in this patch.  It keeps the special
> case in kmemleak and the allocation code rather than putting it in the
> consumer code.
> 
> Also, I want to add an API at some point that vmallocs some memory and
> returns the vm_struct directly.  That won't work with explicit
> annotations in the caller because kmemleak might think it's leaked
> before the caller can execute the annotations.

While kmemleak delays the reporting of newly allocated objects to avoid
such race, we need to keep annotations to a minimum anyway (only for
special cases, definitely not for each caller of an allocation API). The
proposed kmemleak_vmalloc() API in this patch would cover your case
without any additional annotation.

-- 
Catalin

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v1 00/11] mm/kasan: support per-page shadow memory to reduce memory consumption
From: Dmitry Vyukov @ 2017-05-24 16:31 UTC (permalink / raw)
  To: Joonsoo Kim
  Cc: Andrey Ryabinin, Andrew Morton, Alexander Potapenko, kasan-dev,
	linux-mm@kvack.org, LKML, Thomas Gleixner, Ingo Molnar,
	H . Peter Anvin, kernel-team
In-Reply-To: <20170524060432.GA8672@js1304-desktop>

On Wed, May 24, 2017 at 8:04 AM, Joonsoo Kim <js1304@gmail.com> wrote:
>> >> > From: Joonsoo Kim <iamjoonsoo.kim@lge.com>
>> >> >
>> >> > Hello, all.
>> >> >
>> >> > This is an attempt to recude memory consumption of KASAN. Please see
>> >> > following description to get the more information.
>> >> >
>> >> > 1. What is per-page shadow memory
>> >> >
>> >> > This patch introduces infrastructure to support per-page shadow memory.
>> >> > Per-page shadow memory is the same with original shadow memory except
>> >> > the granualarity. It's one byte shows the shadow value for the page.
>> >> > The purpose of introducing this new shadow memory is to save memory
>> >> > consumption.
>> >> >
>> >> > 2. Problem of current approach
>> >> >
>> >> > Until now, KASAN needs shadow memory for all the range of the memory
>> >> > so the amount of statically allocated memory is so large. It causes
>> >> > the problem that KASAN cannot run on the system with hard memory
>> >> > constraint. Even if KASAN can run, large memory consumption due to
>> >> > KASAN changes behaviour of the workload so we cannot validate
>> >> > the moment that we want to check.
>> >> >
>> >> > 3. How does this patch fix the problem
>> >> >
>> >> > This patch tries to fix the problem by reducing memory consumption for
>> >> > the shadow memory. There are two observations.
>> >> >
>> >>
>> >>
>> >> I think that the best way to deal with your problem is to increase shadow scale size.
>> >>
>> >> You'll need to add tunable to gcc to control shadow size. I expect that gcc has some
>> >> places where 8-shadow scale size is hardcoded, but it should be fixable.
>> >>
>> >> The kernel also have some small amount of code written with KASAN_SHADOW_SCALE_SIZE == 8 in mind,
>> >> which should be easy to fix.
>> >>
>> >> Note that bigger shadow scale size requires bigger alignment of allocated memory and variables.
>> >> However, according to comments in gcc/asan.c gcc already aligns stack and global variables and at
>> >> 32-bytes boundary.
>> >> So we could bump shadow scale up to 32 without increasing current stack consumption.
>> >>
>> >> On a small machine (1Gb) 1/32 of shadow is just 32Mb which is comparable to yours 30Mb, but I expect it to be
>> >> much faster. More importantly, this will require only small amount of simple changes in code, which will be
>> >> a *lot* more easier to maintain.
>>
>>
>> Interesting option. We never considered increasing scale in user space
>> due to performance implications. But the algorithm always supported up
>> to 128x scale. Definitely worth considering as an option.
>
> Could you explain me how does increasing scale reduce performance? I
> tried to guess the reason but failed.


The main reason is inline instrumentation. Inline instrumentation for
a check of 8-byte access (which are very common in 64-bit code) is
just a check of the shadow byte for 0. For smaller accesses we have
more complex instrumentation that first checks shadow for 0 and then
does precise check based on size/offset of the access + shadow value.
That's slower and also increases register pressure and code size
(which can further reduce performance due to icache overflow). If we
increase scale to 16/32, all accesses will need that slow path.
Another thing is stack instrumentation: larger scale will require
larger redzones to ensure proper alignment. That will increase stack
frames and also more instructions to poison/unpoison stack shadow on
function entry/exit.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH] mm/hugetlb: Report -EHWPOISON not -EFAULT when FOLL_HWPOISON is specified
From: James Morse @ 2017-05-24 16:09 UTC (permalink / raw)
  To: linux-mm; +Cc: Kirill A . Shutemov, Andrew Morton, james.morse, Punit Agrawal

KVM uses get_user_pages() to resolve its stage2 faults. KVM sets the
FOLL_HWPOISON flag causing faultin_page() to return -EHWPOISON when it
finds a VM_FAULT_HWPOISON. KVM handles these hwpoison pages as a special
case. (check_user_page_hwpoison())

When huge pages are involved, this doesn't work so well. get_user_pages()
calls follow_hugetlb_page(), which stops early if it receives
VM_FAULT_HWPOISON from hugetlb_fault(), eventually returning -EFAULT to
the caller. The step to map this to -EHWPOISON based on the FOLL_ flags
is missing. The hwpoison special case is skipped, and -EFAULT is returned
to user-space, causing Qemu or kvmtool to exit.

Instead, move this VM_FAULT_ to errno mapping code into a header file
and use it from faultin_page() and follow_hugetlb_page().

With this, KVM works as expected.

CC: Punit Agrawal <punit.agrawal@arm.com>
Signed-off-by: James Morse <james.morse@arm.com>
---
This isn't a problem for arm64 today as we haven't enabled MEMORY_FAILURE,
but I can't see any reason this doesn't happen on x86 too, so I think this
should be a fix. This doesn't apply earlier than stable's v4.11.1 due to
all sorts of cleanup. My best offer is:
Cc: stable@vger.kernel.org # 4.11.1

 include/linux/mm.h | 10 ++++++++++
 mm/gup.c           |  9 +++------
 mm/hugetlb.c       |  3 +++
 3 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 7cb17c6b97de..48b47c214c50 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2327,6 +2327,16 @@ static inline struct page *follow_page(struct vm_area_struct *vma,
 #define FOLL_REMOTE	0x2000	/* we are working on non-current tsk/mm */
 #define FOLL_COW	0x4000	/* internal GUP flag */
 
+static inline int vm_fault_to_errno(int vm_fault, int foll_flags) {
+	if (vm_fault & VM_FAULT_OOM)
+		return -ENOMEM;
+	if (vm_fault & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
+		return (foll_flags & FOLL_HWPOISON) ? -EHWPOISON : -EFAULT;
+	if (vm_fault & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
+		return -EFAULT;
+	return 0;
+}
+
 typedef int (*pte_fn_t)(pte_t *pte, pgtable_t token, unsigned long addr,
 			void *data);
 extern int apply_to_page_range(struct mm_struct *mm, unsigned long address,
diff --git a/mm/gup.c b/mm/gup.c
index d9e6fddcc51f..69f6cec279b3 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -407,12 +407,9 @@ static int faultin_page(struct task_struct *tsk, struct vm_area_struct *vma,
 
 	ret = handle_mm_fault(vma, address, fault_flags);
 	if (ret & VM_FAULT_ERROR) {
-		if (ret & VM_FAULT_OOM)
-			return -ENOMEM;
-		if (ret & (VM_FAULT_HWPOISON | VM_FAULT_HWPOISON_LARGE))
-			return *flags & FOLL_HWPOISON ? -EHWPOISON : -EFAULT;
-		if (ret & (VM_FAULT_SIGBUS | VM_FAULT_SIGSEGV))
-			return -EFAULT;
+		int err = vm_fault_to_errno(ret, *flags);
+		if (err)
+			return err;
 		BUG();
 	}
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e5828875f7bb..08f69dadbc63 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4170,7 +4170,10 @@ long follow_hugetlb_page(struct mm_struct *mm, struct vm_area_struct *vma,
 			}
 			ret = hugetlb_fault(mm, vma, vaddr, fault_flags);
 			if (ret & VM_FAULT_ERROR) {
+				int err = vm_fault_to_errno(ret, flags);
 				remainder = 0;
+				if (err)
+					return err;
 				break;
 			}
 			if (ret & VM_FAULT_RETRY) {
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH 0/6] refine and rename slub sysfs
From: Christoph Lameter @ 2017-05-24 16:03 UTC (permalink / raw)
  To: Wei Yang; +Cc: Michal Hocko, penberg, rientjes, akpm, linux-mm, linux-kernel
In-Reply-To: <20170524152124.GB8445@WeideMacBook-Pro.local>

On Wed, 24 May 2017, Wei Yang wrote:

> >
> >Who is going to use those new entries and for what purpose? Why do we
> >want to expose even more details of the slab allocator to the userspace.
> >Is the missing information something fundamental that some user space
> >cannot work without it? Seriously these are essential questions you
> >should have answer for _before_ posting the patch and mention all those
> >reasons in the changelog.
>
> It is me who wants to get more details of the slub behavior.
> AFAIK, no one else is expecting this.

I would appreciate some clearer structured statistics. These are important
for diagnostics and for debugging. Do not go overboard with this. Respin
it and provide also a cleanup of the slabinfo tool? I would appreciate it.

> Hmm, if we really don't want to export these entries, why not remove related
> code? Looks we are sure they will not be touched.

Please have a look at the slabinfo code which depends on those fields in
order to display slab information. I have patchsets here that will add
more functionality to slab and those will also add additional fields to
sysfs.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: hwpoison: Use compound_head() flags for huge pages
From: Punit Agrawal @ 2017-05-24 15:57 UTC (permalink / raw)
  To: James Morse; +Cc: linux-mm, Naoya Horiguchi
In-Reply-To: <20170524130204.21845-1-james.morse@arm.com>

James Morse <james.morse@arm.com> writes:

> memory_failure() chooses a recovery action function based on the page
> flags. For huge pages it uses the tail page flags which don't have
> anything interesting set, resulting in:
>> Memory failure: 0x9be3b4: Unknown page state
>> Memory failure: 0x9be3b4: recovery action for unknown page: Failed
>
> Instead, save a copy of the head page's flags if this is a huge page,
> this means if there are no relevant flags for this tail page, we use
> the head pages flags instead. This results in the me_huge_page()
> recovery action being called:
>> Memory failure: 0x9b7969: recovery action for huge page: Delayed
>
> For hugepages that have not yet been allocated, this allows the hugepage
> to be dequeued.
>
> CC: Punit Agrawal <punit.agrawal@arm.com>
> Signed-off-by: James Morse <james.morse@arm.com>
> ---
> This is intended as a fix, but I can't find the patch that introduced this
> behaviour. (not recent, and there is a lot of history down there!)
>
> This doesn't apply to stable trees before v3.10...
> Cc: stable@vger.kernel.org # 3.10.105
>
>  mm/memory-failure.c | 5 ++++-
>  1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 2527dfeddb00..44a6a33af219 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1184,7 +1184,10 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
>  	 * page_remove_rmap() in try_to_unmap_one(). So to determine page status
>  	 * correctly, we save a copy of the page flags at this time.
>  	 */
> -	page_flags = p->flags;
> +	if (PageHuge(p))
> +		page_flags = hpage->flags;
> +	else
> +		page_flags = p->flags;
>  
>  	/*
>  	 * unpoison always clear PG_hwpoison inside page lock

I can confirm that this patch reduces the number of failing cases when
running hugepage tests from mce-tests suite.

FWIW,

Acked-by: Punit Agrawal <punit.agrawal@arm.com>
Tested-by: Punit Agrawal <punit.agrawal@arm.com>

Thanks!

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH] mm/migrate: Fix ref-count handling when !hugepage_migration_supported()
From: Punit Agrawal @ 2017-05-24 15:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Punit Agrawal, will.deacon, catalin.marinas, manoj.iyer,
	linux-kernel, linux-arm-kernel, linux-mm, tbaicar, timur,
	Joonsoo Kim, Naoya Horiguchi, Wanpeng Li, Christoph Lameter

On failing to migrate a page, soft_offline_huge_page() performs the
necessary update to the hugepage ref-count. When
!hugepage_migration_supported() , unmap_and_move_hugepage() also
decrements the page ref-count for the hugepage. The combined behaviour
leaves the ref-count in an inconsistent state.

This leads to soft lockups when running the overcommitted hugepage test
from mce-tests suite.

Soft offlining pfn 0x83ed600 at process virtual address 0x400000000000
soft offline: 0x83ed600: migration failed 1, type
1fffc00000008008 (uptodate|head)
INFO: rcu_preempt detected stalls on CPUs/tasks:
 Tasks blocked on level-0 rcu_node (CPUs 0-7): P2715
  (detected by 7, t=5254 jiffies, g=963, c=962, q=321)
  thugetlb_overco R  running task        0  2715   2685 0x00000008
  Call trace:
  [<ffff000008089f90>] dump_backtrace+0x0/0x268
  [<ffff00000808a2d4>] show_stack+0x24/0x30
  [<ffff000008100d34>] sched_show_task+0x134/0x180
  [<ffff0000081c90fc>] rcu_print_detail_task_stall_rnp+0x54/0x7c
  [<ffff00000813cfd4>] rcu_check_callbacks+0xa74/0xb08
  [<ffff000008143a3c>] update_process_times+0x34/0x60
  [<ffff0000081550e8>] tick_sched_handle.isra.7+0x38/0x70
  [<ffff00000815516c>] tick_sched_timer+0x4c/0x98
  [<ffff0000081442e0>] __hrtimer_run_queues+0xc0/0x300
  [<ffff000008144fa4>] hrtimer_interrupt+0xac/0x228
  [<ffff0000089a56d4>] arch_timer_handler_phys+0x3c/0x50
  [<ffff00000812f1bc>] handle_percpu_devid_irq+0x8c/0x290
  [<ffff0000081297fc>] generic_handle_irq+0x34/0x50
  [<ffff000008129f00>] __handle_domain_irq+0x68/0xc0
  [<ffff0000080816b4>] gic_handle_irq+0x5c/0xb0

Fix this by dropping the ref-count decrement in
unmap_and_move_hugepage() when !hugepage_migration_supported().

Fixes: 32665f2bbfed ("mm/migrate: correct failure handling if !hugepage_migration_support()")
Reported-by: Manoj Iyer <manoj.iyer@canonical.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com>
Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Cc: Wanpeng Li <liwanp@linux.vnet.ibm.com>
Cc: Christoph Lameter <cl@linux.com>

--
Hi Andrew,

We ran into this bug when working towards enabling memory corruption
on arm64. The patch was tested on an arm64 platform running v4.12-rc2
with the series to enable memory corruption handling[0].

Please consider merging as a fix for the 4.12 release.

Thanks,
Punit

[0] https://www.spinics.net/lists/arm-kernel/msg581657.html
---
 mm/migrate.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 89a0a1707f4c..187abd1526df 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1201,10 +1201,8 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	 * tables or check whether the hugepage is pmd-based or not before
 	 * kicking migration.
 	 */
-	if (!hugepage_migration_supported(page_hstate(hpage))) {
-		putback_active_hugepage(hpage);
+	if (!hugepage_migration_supported(page_hstate(hpage)))
 		return -ENOSYS;
-	}
 
 	new_hpage = get_new_page(hpage, private, &result);
 	if (!new_hpage)
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [PATCH] mm: per-cgroup memory reclaim stats
From: Johannes Weiner @ 2017-05-24 15:26 UTC (permalink / raw)
  To: Andrew Morton, Roman Gushchin
  Cc: Tejun Heo, Li Zefan, Michal Hocko, Vladimir Davydov, cgroups,
	linux-doc, linux-kernel, linux-mm
In-Reply-To: <1494530183-30808-1-git-send-email-guro@fb.com>

On Thu, May 11, 2017 at 08:16:23PM +0100, Roman Gushchin wrote:
> Track the following reclaim counters for every memory cgroup:
> PGREFILL, PGSCAN, PGSTEAL, PGACTIVATE, PGDEACTIVATE, PGLAZYFREE and
> PGLAZYFREED.
> 
> These values are exposed using the memory.stats interface of cgroup v2.
> 
> The meaning of each value is the same as for global counters,
> available using /proc/vmstat.
> 
> Also, for consistency, rename mem_cgroup_count_vm_event() to
> count_memcg_event_mm().
> 
> Signed-off-by: Roman Gushchin <guro@fb.com>
> Suggested-by: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Johannes Weiner <hannes@cmpxchg.org>
> Cc: Tejun Heo <tj@kernel.org>
> Cc: Li Zefan <lizefan@huawei.com>
> Cc: Michal Hocko <mhocko@kernel.org>
> Cc: Vladimir Davydov <vdavydov.dev@gmail.com>
> Cc: cgroups@vger.kernel.org
> Cc: linux-doc@vger.kernel.org
> Cc: linux-kernel@vger.kernel.org
> Cc: linux-mm@kvack.org

Acked-by: Johannes Weiner <hannes@cmpxchg.org>

Andrew, if there aren't any other objections, could you pick this one
up please?

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
From: Andrea Arcangeli @ 2017-05-24 15:22 UTC (permalink / raw)
  To: Mike Rapoport
  Cc: Michal Hocko, Vlastimil Babka, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Pavel Emelyanov, linux-mm,
	lkml, Linux API
In-Reply-To: <20170524142735.GF3063@rapoport-lnx>

Hello,

On Wed, May 24, 2017 at 05:27:36PM +0300, Mike Rapoport wrote:
> khugepaged does skip over VMAs which have userfault. We could register the
> regions with userfault before populating them to avoid collapses in the
> transition period. But then we'll have to populate these regions with
> UFFDIO_COPY which adds quite an overhead.

Yes, in fact with postcopy-only mode, there's no issue because of the
above.

The case where THP has to be temporarily disabled by CRIU is before
postcopy/userfaults engages, i.e. during the precopy with a
precopy+postcopy mode.

QEMU preferred mode is to do one pass of precopy before starting
postcopy/userfaults. During QEMU precopy phase VM_HUGEPAGE is set for
maximum performance and to back with THP in the destination as many
readonly (i.e. no source-redirtied) pages as possible. The dirty
logging in the source happens at 4k granularity by forcing the KVM
shadow MMU to map all pages at 4k granularity and by tracking the
dirty bit in software for the updates happening through the primary
MMU (linux pagetables dirty bit are ignored because soft dirty would
be too slow with O(N) complexity where N is linear with the size of
the VM, not with the number of re-dirtied pages in a precopy
pass). After that we track which 4k pages aren't uptodate on
destination and we zap them at 4k granularity with MADV_DONTNEED (we
badly need madvisev in fact to reduce the totally unnecessary flood of
4k wide MADV_DONTNEED there). So before calling the MADV_DONTNEED
flood, QEMU sets VM_NOHUGEPAGE, and after calling UFFDIO_REGISTER QEMU
sets back VM_HUGEPAGE (as the UFFDIO registration will keep khugepaged
at bay until postcopy completes). QEMU then finally calls
UFFDIO_UNREGISTER and khugepaged starts compacting everything that was
migrated through 4k wide userfaults.

CRIU doesn't attempt to populate destination with THP at all to be
simpler, but the problem is similar. It still has to call
VM_NOHUGEPAGE somehow during precopy (i.e. during the whole precopy
phase, precisely to avoid having to call MADV_DONTNEED to zap
4k not-uptodate fragments).

QEMU gets away with setting VM_NOHUGEPAGE and then back to VM_HUGEPAGE
without any issue because it's cooperative. CRIU as opposed has to
restore the same vm_flags that the vma had in the source to avoid
changing the behavior of the app after precopy+postcopy
completes. This is where the need of clearing the VM_*HUGEPAGE bits
from vm_flags comes into play.

Thanks,
Andrea

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH 0/6] refine and rename slub sysfs
From: Wei Yang @ 2017-05-24 15:21 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Wei Yang, cl, penberg, rientjes, akpm, linux-mm, linux-kernel
In-Reply-To: <20170524120318.GE14733@dhcp22.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 1381 bytes --]

On Wed, May 24, 2017 at 02:03:18PM +0200, Michal Hocko wrote:
>On Wed 24-05-17 17:54:50, Wei Yang wrote:
>> On Tue, May 23, 2017 at 08:39:11AM +0200, Michal Hocko wrote:
>[...]
>> >Is this worth risking breakage of the userspace which consume this data
>> >now? Do you have any user space code which will greatly benefit from the
>> >new data and which couldn't do the same with the current format/output?
>> >
>> >If yes this all should be in the changelog.
>> 
>> The answer is no.
>> 
>> I have the same concern as yours. So this patch set could be divided into two
>> parts: 1. add some new entry with current name convention, 2. change the name
>> convention.
>
>Who is going to use those new entries and for what purpose? Why do we
>want to expose even more details of the slab allocator to the userspace.
>Is the missing information something fundamental that some user space
>cannot work without it? Seriously these are essential questions you
>should have answer for _before_ posting the patch and mention all those
>reasons in the changelog.

It is me who wants to get more details of the slub behavior.  
AFAIK, no one else is expecting this.

Hmm, if we really don't want to export these entries, why not remove related
code? Looks we are sure they will not be touched.

>-- 
>Michal Hocko
>SUSE Labs

-- 
Wei Yang
Help you, Help me

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [RFC PATCH 2/2] mm, memory_hotplug: drop CONFIG_MOVABLE_NODE
From: Vlastimil Babka @ 2017-05-24 15:17 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Kani Toshimitsu, slaoub, Joonsoo Kim, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
In-Reply-To: <20170524134237.GH14733@dhcp22.suse.cz>

On 05/24/2017 03:42 PM, Michal Hocko wrote:
> diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> index ec7d6ae01c96..64aed7386fe4 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2246,11 +2246,11 @@
>  			that the amount of memory usable for all allocations
>  			is not too small.
>  
> -	movable_node	[KNL] Boot-time switch to make hotplugable to be
> -			movable. This means that the memory of such nodes
> -			will be usable only for movable allocations which
> -			rules out almost all kernel allocations. Use with
> -			caution!
> +	movable_node	[KNL] Boot-time switch to make hotplugable memory
> +			NUMA nodes to be movable. This means that the memory
> +			of such nodes will be usable only for movable
> +			allocations which rules out almost all kernel
> +			allocations. Use with caution!
>  
>  	MTD_Partition=	[MTD]
>  			Format: <name>,<region-number>,<size>,<offset>
> 
> Better?

Yes, thanks.

> [...]
>>> --- a/mm/Kconfig
>>> +++ b/mm/Kconfig
>>> @@ -149,32 +149,6 @@ config NO_BOOTMEM
>>>  config MEMORY_ISOLATION
>>>  	bool
>>>  
>>> -config MOVABLE_NODE
>>> -	bool "Enable to assign a node which has only movable memory"
>>> -	depends on HAVE_MEMBLOCK
>>> -	depends on NO_BOOTMEM
>>> -	depends on X86_64 || OF_EARLY_FLATTREE || MEMORY_HOTPLUG
>>> -	depends on NUMA
>>
>> That's a lot of depends. What happens if some of them are not met and
>> the movable_node bootparam is used?
> 
> Good question. I haven't explored that, to be honest. Now that I am looking closer
> I am not even sure why all those dependencies are thre. MEMORY_HOTPLUG
> is clear and OF_EARLY_FLATTREE is explained by 41a9ada3e6b4 ("of/fdt:
> mark hotpluggable memory"). NUMA is less clear to me because
> MEMORY_HOTPLUG doesn't really depend on NUMA systems. Dependency on
> NO_BOOTMEM is also not clear to me because zones layout
> doesn't really depend on the specific boot time allocator.
> 
> So we are left with HAVE_MEMBLOCK which seems to be there because
> movable_node_enabled is defined there while the parameter handling is in
> the hotplug proper. But there is no real reason to have it like that.
> This compiles but I will have to put throw my full compile battery on it
> to be sure. I will make it a separate patch.

I'd expect stuff might compile and work (run without crash), just in
some cases the boot option could be effectively ignored? In that case
it's just a matter of documenting the option, possibly also some warning
when used, e.g. "node_movable was ignored because CONFIG_FOO is not
enabled"?

Vlastimil

> Thanks!
> --- 
> diff --git a/include/linux/memblock.h b/include/linux/memblock.h
> index 9622fb8c101b..071692894254 100644
> --- a/include/linux/memblock.h
> +++ b/include/linux/memblock.h
> @@ -57,8 +57,6 @@ struct memblock {
>  
>  extern struct memblock memblock;
>  extern int memblock_debug;
> -/* If movable_node boot option specified */
> -extern bool movable_node_enabled;
>  
>  #ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
>  #define __init_memblock __meminit
> @@ -171,11 +169,6 @@ static inline bool memblock_is_hotpluggable(struct memblock_region *m)
>  	return m->flags & MEMBLOCK_HOTPLUG;
>  }
>  
> -static inline bool __init_memblock movable_node_is_enabled(void)
> -{
> -	return movable_node_enabled;
> -}
> -
>  static inline bool memblock_is_mirror(struct memblock_region *m)
>  {
>  	return m->flags & MEMBLOCK_MIRROR;
> diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
> index 9e0249d0f5e4..9c1ac94f857b 100644
> --- a/include/linux/memory_hotplug.h
> +++ b/include/linux/memory_hotplug.h
> @@ -115,6 +115,12 @@ extern void __online_page_free(struct page *page);
>  extern int try_online_node(int nid);
>  
>  extern bool memhp_auto_online;
> +/* If movable_node boot option specified */
> +extern bool movable_node_enabled;
> +static inline bool movable_node_is_enabled(void)
> +{
> +	return movable_node_enabled;
> +}
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  extern bool is_pageblock_removable_nolock(struct page *page);
> @@ -266,6 +272,10 @@ static inline void put_online_mems(void) {}
>  static inline void mem_hotplug_begin(void) {}
>  static inline void mem_hotplug_done(void) {}
>  
> +static inline bool __init_memblock movable_node_is_enabled(void)
> +{
> +	return false;
> +}
>  #endif /* ! CONFIG_MEMORY_HOTPLUG */
>  
>  #ifdef CONFIG_MEMORY_HOTREMOVE
> diff --git a/mm/memblock.c b/mm/memblock.c
> index 4895f5a6cf7e..8c52fb11510c 100644
> --- a/mm/memblock.c
> +++ b/mm/memblock.c
> @@ -54,7 +54,6 @@ struct memblock memblock __initdata_memblock = {
>  };
>  
>  int memblock_debug __initdata_memblock;
> -bool movable_node_enabled __initdata_memblock = false;
>  static bool system_has_some_mirror __initdata_memblock = false;
>  static int memblock_can_resize __initdata_memblock;
>  static int memblock_memory_in_slab __initdata_memblock = 0;
> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
> index 2a14f8c18a22..b0d2bf3256d0 100644
> --- a/mm/memory_hotplug.c
> +++ b/mm/memory_hotplug.c
> @@ -79,6 +79,8 @@ static struct {
>  #define memhp_lock_acquire()      lock_map_acquire(&mem_hotplug.dep_map)
>  #define memhp_lock_release()      lock_map_release(&mem_hotplug.dep_map)
>  
> +bool movable_node_enabled = false;
> +
>  #ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
>  bool memhp_auto_online;
>  #else
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
From: Mike Rapoport @ 2017-05-24 15:13 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Pavel Emelyanov, Michal Hocko, Kirill A. Shutemov, Andrew Morton,
	Arnd Bergmann, Kirill A. Shutemov, Andrea Arcangeli, linux-mm,
	lkml, Linux API
In-Reply-To: <91778b6e-cb69-bfca-51da-f8c3256e630e@suse.cz>

On Wed, May 24, 2017 at 04:54:38PM +0200, Vlastimil Babka wrote:
> On 05/24/2017 04:28 PM, Pavel Emelyanov wrote:
> > On 05/24/2017 02:31 PM, Vlastimil Babka wrote:
> >> On 05/24/2017 12:39 PM, Mike Rapoport wrote:
> >>>> Hm so the prctl does:
> >>>>
> >>>>                 if (arg2)
> >>>>                         me->mm->def_flags |= VM_NOHUGEPAGE;
> >>>>                 else
> >>>>                         me->mm->def_flags &= ~VM_NOHUGEPAGE;
> >>>>
> >>>> That's rather lazy implementation IMHO. Could we change it so the flag
> >>>> is stored elsewhere in the mm, and the code that decides to (not) use
> >>>> THP will check both the per-vma flag and the per-mm flag?
> >>>
> >>> I afraid I don't understand how that can help.
> >>> What we need is an ability to temporarily disable collapse of the pages in
> >>> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
> >>> THP, the vma->vm_flags for those VMAs will remain intact.
> >>
> >> That's what I'm saying - instead of implementing the prctl flag via
> >> mm->def_flags (which gets permanently propagated to newly created vma's
> >> but e.g. doesn't affect already existing ones), it would be setting a
> >> flag somewhere in mm, which khugepaged (and page faults) would check in
> >> addition to the per-vma flags.
> > 
> > I do not insist, but this would make existing paths (checking for flags) be 
> > 2 times slower -- from now on these would need to check two bits (vma flags
> > and mm flags) which are 100% in different cache lines.
> 
> I'd expect you already have mm struct cached during a page fault. And
> THP-eligible page fault is just one per pmd, the overhead should be
> practically zero.
> 
> > What Mike is proposing is the way to fine-tune the existing vma flags. This
> > would keep current paths as fast (or slow ;) ) as they are now. All the
> > complexity would go to rare cases when someone needs to turn thp off for a
> > while and then turn it back on.
> 
> Yeah but it's extending user-space API for a corner case. We should do
> that only when there's no other option.

With madvise() I'm suggesting we rather add "completeness" to the existing
API, IMHO. We do have API that sets VM_HUGEPAGE and clears VM_NOHUGEPAGE or
vise versa, but we do not have an API that can clear both flags...

And if we would use prctl(), we either change user visible behaviour or we
still need to extend the API and use, say, arg2 to distinguish between the
current behaviour and the new one.

--
Sincerely yours,
Mike. 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm/vmalloc: a slight change of compare target in __insert_vmap_area()
From: Wei Yang @ 2017-05-24 15:07 UTC (permalink / raw)
  To: Michal Hocko; +Cc: Wei Yang, akpm, linux-mm, linux-kernel
In-Reply-To: <20170524121135.GF14733@dhcp22.suse.cz>

[-- Attachment #1: Type: text/plain, Size: 2558 bytes --]

On Wed, May 24, 2017 at 02:11:35PM +0200, Michal Hocko wrote:
>On Wed 24-05-17 18:03:47, Wei Yang wrote:
>> The vmap RB tree store the elements in order and no overlap between any of
>> them. The comparison in __insert_vmap_area() is to decide which direction
>> the search should follow and make sure the new vmap_area is not overlap
>> with any other.
>> 
>> Current implementation fails to do the overlap check.
>> 
>> When first "if" is not true, it means
>> 
>>     va->va_start >= tmp_va->va_end
>> 
>> And with the truth
>> 
>>     xxx->va_end > xxx->va_start
>> 
>> The deduction is
>> 
>>     va->va_end > tmp_va->va_start
>> 
>> which is the condition in second "if".
>> 
>> This patch changes a little of the comparison in __insert_vmap_area() to
>> make sure it forbids the overlapped vmap_area.
>
>Why do we care about overlapping vmap areas at this level. This is an
>internal function and all the sanity checks should have been done by
>that time AFAIR. Could you describe the problem which you are trying to
>fix/address?
>

No problem it tries to fix.

I just follow the original idea, which tries to catch the exception case by
the BUG(). While in the above analysis, the BUG() will never be triggered.

So we have two options:
1. Still tries to catch the exception by change the "if" a little.
2. If we don't care about the overlap case, the "if" clause could be
   simplified.  Only "if ... else ..." is enough.

You prefer the second one?

>> Signed-off-by: Wei Yang <richard.weiyang@gmail.com>
>> ---
>>  mm/vmalloc.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>> 
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index 0b057628a7ba..8087451cb332 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -360,9 +360,9 @@ static void __insert_vmap_area(struct vmap_area *va)
>>  
>>  		parent = *p;
>>  		tmp_va = rb_entry(parent, struct vmap_area, rb_node);
>> -		if (va->va_start < tmp_va->va_end)
>> +		if (va->va_end <= tmp_va->va_start)
>>  			p = &(*p)->rb_left;
>> -		else if (va->va_end > tmp_va->va_start)
>> +		else if (va->va_start >= tmp_va->va_end)
>>  			p = &(*p)->rb_right;
>>  		else
>>  			BUG();
>> -- 
>> 2.11.0
>> 
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to majordomo@kvack.org.  For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
>
>-- 
>Michal Hocko
>SUSE Labs

-- 
Wei Yang
Help you, Help me

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 819 bytes --]

^ permalink raw reply

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
From: Vlastimil Babka @ 2017-05-24 14:54 UTC (permalink / raw)
  To: Pavel Emelyanov, Mike Rapoport
  Cc: Michal Hocko, Kirill A. Shutemov, Andrew Morton, Arnd Bergmann,
	Kirill A. Shutemov, Andrea Arcangeli, linux-mm, lkml, Linux API
In-Reply-To: <ab5bbeb6-0c61-f505-f365-37ca43415696@virtuozzo.com>

On 05/24/2017 04:28 PM, Pavel Emelyanov wrote:
> On 05/24/2017 02:31 PM, Vlastimil Babka wrote:
>> On 05/24/2017 12:39 PM, Mike Rapoport wrote:
>>>> Hm so the prctl does:
>>>>
>>>>                 if (arg2)
>>>>                         me->mm->def_flags |= VM_NOHUGEPAGE;
>>>>                 else
>>>>                         me->mm->def_flags &= ~VM_NOHUGEPAGE;
>>>>
>>>> That's rather lazy implementation IMHO. Could we change it so the flag
>>>> is stored elsewhere in the mm, and the code that decides to (not) use
>>>> THP will check both the per-vma flag and the per-mm flag?
>>>
>>> I afraid I don't understand how that can help.
>>> What we need is an ability to temporarily disable collapse of the pages in
>>> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
>>> THP, the vma->vm_flags for those VMAs will remain intact.
>>
>> That's what I'm saying - instead of implementing the prctl flag via
>> mm->def_flags (which gets permanently propagated to newly created vma's
>> but e.g. doesn't affect already existing ones), it would be setting a
>> flag somewhere in mm, which khugepaged (and page faults) would check in
>> addition to the per-vma flags.
> 
> I do not insist, but this would make existing paths (checking for flags) be 
> 2 times slower -- from now on these would need to check two bits (vma flags
> and mm flags) which are 100% in different cache lines.

I'd expect you already have mm struct cached during a page fault. And
THP-eligible page fault is just one per pmd, the overhead should be
practically zero.

> What Mike is proposing is the way to fine-tune the existing vma flags. This
> would keep current paths as fast (or slow ;) ) as they are now. All the
> complexity would go to rare cases when someone needs to turn thp off for a
> while and then turn it back on.

Yeah but it's extending user-space API for a corner case. We should do
that only when there's no other option.

> -- Pavel
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: Define KB, MB, GB, TB in core VM
From: Michal Hocko @ 2017-05-24 14:31 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: Vlastimil Babka, Christoph Hellwig, Andrew Morton, linux-mm,
	linux-kernel
In-Reply-To: <7f85724c-6fc1-bb51-11e4-15fc2f89372b@linux.vnet.ibm.com>

On Wed 24-05-17 12:10:13, Anshuman Khandual wrote:
[...]
> So the question is are we willing to do all these changes across
> the tree to achieve common definitions of KB, MB, GB, TB in the
> kernel ? Is it worth ?

I do not think this is worth losing time. Any tree wide change should
have a considerable advantage in the end. These macro helpers do not
sound overly important to care. But that is just my 2c
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
From: Pavel Emelyanov @ 2017-05-24 14:28 UTC (permalink / raw)
  To: Vlastimil Babka, Mike Rapoport
  Cc: Michal Hocko, Kirill A. Shutemov, Andrew Morton, Arnd Bergmann,
	Kirill A. Shutemov, Andrea Arcangeli, linux-mm, lkml, Linux API
In-Reply-To: <aec1376e-34b3-56ce-448e-7fbddcda448b@suse.cz>

On 05/24/2017 02:31 PM, Vlastimil Babka wrote:
> On 05/24/2017 12:39 PM, Mike Rapoport wrote:
>>> Hm so the prctl does:
>>>
>>>                 if (arg2)
>>>                         me->mm->def_flags |= VM_NOHUGEPAGE;
>>>                 else
>>>                         me->mm->def_flags &= ~VM_NOHUGEPAGE;
>>>
>>> That's rather lazy implementation IMHO. Could we change it so the flag
>>> is stored elsewhere in the mm, and the code that decides to (not) use
>>> THP will check both the per-vma flag and the per-mm flag?
>>
>> I afraid I don't understand how that can help.
>> What we need is an ability to temporarily disable collapse of the pages in
>> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
>> THP, the vma->vm_flags for those VMAs will remain intact.
> 
> That's what I'm saying - instead of implementing the prctl flag via
> mm->def_flags (which gets permanently propagated to newly created vma's
> but e.g. doesn't affect already existing ones), it would be setting a
> flag somewhere in mm, which khugepaged (and page faults) would check in
> addition to the per-vma flags.

I do not insist, but this would make existing paths (checking for flags) be 
2 times slower -- from now on these would need to check two bits (vma flags
and mm flags) which are 100% in different cache lines.

What Mike is proposing is the way to fine-tune the existing vma flags. This
would keep current paths as fast (or slow ;) ) as they are now. All the
complexity would go to rare cases when someone needs to turn thp off for a
while and then turn it back on.

-- Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
From: Mike Rapoport @ 2017-05-24 14:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton, Arnd Bergmann,
	Kirill A. Shutemov, Andrea Arcangeli, Pavel Emelyanov, linux-mm,
	lkml, Linux API
In-Reply-To: <20170524111800.GD14733@dhcp22.suse.cz>

On Wed, May 24, 2017 at 01:18:00PM +0200, Michal Hocko wrote:
> On Wed 24-05-17 13:39:48, Mike Rapoport wrote:
> > On Wed, May 24, 2017 at 09:58:06AM +0200, Vlastimil Babka wrote:
> > > On 05/24/2017 09:50 AM, Mike Rapoport wrote:
> > > > On Mon, May 22, 2017 at 05:52:47PM +0200, Vlastimil Babka wrote:
> > > >> On 05/22/2017 04:29 PM, Mike Rapoport wrote:
> > > >>>
> > > >>> Probably I didn't explained it too well.
> > > >>>
> > > >>> The range is intentionally not populated. When we combine pre- and
> > > >>> post-copy for process migration, we create memory pre-dump without stopping
> > > >>> the process, then we freeze the process without dumping the pages it has
> > > >>> dirtied between pre-dump and freeze, and then, during restore, we populate
> > > >>> the dirtied pages using userfaultfd.
> > > >>>
> > > >>> When CRIU restores a process in such scenario, it does something like:
> > > >>>
> > > >>> * mmap() memory region
> > > >>> * fill in the pages that were collected during the pre-dump
> > > >>> * do some other stuff
> > > >>> * register memory region with userfaultfd
> > > >>> * populate the missing memory on demand
> > > >>>
> > > >>> khugepaged collapses the pages in the partially populated regions before we
> > > >>> have a chance to register these regions with userfaultfd, which would
> > > >>> prevent the collapse.
> > > >>>
> > > >>> We could have used MADV_NOHUGEPAGE right after the mmap() call, and then
> > > >>> there would be no race because there would be nothing for khugepaged to
> > > >>> collapse at that point. But the problem is that we have no way to reset
> > > >>> *HUGEPAGE flags after the memory restore is complete.
> > > >>
> > > >> Hmm, I wouldn't be that sure if this is indeed race-free. Check that
> > > >> this scenario is indeed impossible?
> > > >>
> > > >> - you do the mmap
> > > >> - khugepaged will choose the process' mm to scan
> > > >> - khugepaged will get to the vma in question, it doesn't have
> > > >> MADV_NOHUGEPAGE yet
> > > >> - you set MADV_NOHUGEPAGE on the vma
> > > >> - you start populating the vma
> > > >> - khugepaged sees the vma is non-empty, collapses
> > > >>
> > > >> unless I'm wrong, the racers will have mmap_sem for reading only when
> > > >> setting/checking the MADV_NOHUGEPAGE? Might be actually considered a bug.
> > > >>
> > > >> However, can't you use prctl(PR_SET_THP_DISABLE) instead? "If arg2 has a
> > > >> nonzero value, the flag is set, otherwise it is cleared." says the
> > > >> manpage. Do it before the mmap and you avoid the race as well?
> > > > 
> > > > Unfortunately, prctl(PR_SET_THP_DISABLE) didn't help :(
> > > > When I've tried to use it, I've ended up with VM_NOHUGEPAGE set on all VMAs
> > > > created after prctl(). This returns me to the state when checkpoint-restore
> > > > alters the application vma->vm_flags although it shouldn't and I do not see
> > > > a way to fix it using existing interfaces.
> > > 
> > > [CC linux-api, should have been done in the initial posting already]
> > 
> > Sorry, missed that.
> >  
> > > Hm so the prctl does:
> > > 
> > >                 if (arg2)
> > >                         me->mm->def_flags |= VM_NOHUGEPAGE;
> > >                 else
> > >                         me->mm->def_flags &= ~VM_NOHUGEPAGE;
> > > 
> > > That's rather lazy implementation IMHO. Could we change it so the flag
> > > is stored elsewhere in the mm, and the code that decides to (not) use
> > > THP will check both the per-vma flag and the per-mm flag?
> > 
> > I afraid I don't understand how that can help.
> > What we need is an ability to temporarily disable collapse of the pages in
> > VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
> > THP, the vma->vm_flags for those VMAs will remain intact.
> 
> Why cannot khugepaged simply skip over all VMAs which have userfault
> regions registered? This would sound like a less error prone approach to
> me.

khugepaged does skip over VMAs which have userfault. We could register the
regions with userfault before populating them to avoid collapses in the
transition period. But then we'll have to populate these regions with
UFFDIO_COPY which adds quite an overhead.

> -- 
> Michal Hocko
> SUSE Labs
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH] mm: introduce MADV_CLR_HUGEPAGE
From: Pavel Emelyanov @ 2017-05-24 14:25 UTC (permalink / raw)
  To: Michal Hocko, Mike Rapoport
  Cc: Vlastimil Babka, Kirill A. Shutemov, Andrew Morton, Arnd Bergmann,
	Kirill A. Shutemov, Andrea Arcangeli, linux-mm, lkml, Linux API
In-Reply-To: <20170524111800.GD14733@dhcp22.suse.cz>

On 05/24/2017 02:18 PM, Michal Hocko wrote:
> On Wed 24-05-17 13:39:48, Mike Rapoport wrote:
>> On Wed, May 24, 2017 at 09:58:06AM +0200, Vlastimil Babka wrote:
>>> On 05/24/2017 09:50 AM, Mike Rapoport wrote:
>>>> On Mon, May 22, 2017 at 05:52:47PM +0200, Vlastimil Babka wrote:
>>>>> On 05/22/2017 04:29 PM, Mike Rapoport wrote:
>>>>>>
>>>>>> Probably I didn't explained it too well.
>>>>>>
>>>>>> The range is intentionally not populated. When we combine pre- and
>>>>>> post-copy for process migration, we create memory pre-dump without stopping
>>>>>> the process, then we freeze the process without dumping the pages it has
>>>>>> dirtied between pre-dump and freeze, and then, during restore, we populate
>>>>>> the dirtied pages using userfaultfd.
>>>>>>
>>>>>> When CRIU restores a process in such scenario, it does something like:
>>>>>>
>>>>>> * mmap() memory region
>>>>>> * fill in the pages that were collected during the pre-dump
>>>>>> * do some other stuff
>>>>>> * register memory region with userfaultfd
>>>>>> * populate the missing memory on demand
>>>>>>
>>>>>> khugepaged collapses the pages in the partially populated regions before we
>>>>>> have a chance to register these regions with userfaultfd, which would
>>>>>> prevent the collapse.
>>>>>>
>>>>>> We could have used MADV_NOHUGEPAGE right after the mmap() call, and then
>>>>>> there would be no race because there would be nothing for khugepaged to
>>>>>> collapse at that point. But the problem is that we have no way to reset
>>>>>> *HUGEPAGE flags after the memory restore is complete.
>>>>>
>>>>> Hmm, I wouldn't be that sure if this is indeed race-free. Check that
>>>>> this scenario is indeed impossible?
>>>>>
>>>>> - you do the mmap
>>>>> - khugepaged will choose the process' mm to scan
>>>>> - khugepaged will get to the vma in question, it doesn't have
>>>>> MADV_NOHUGEPAGE yet
>>>>> - you set MADV_NOHUGEPAGE on the vma
>>>>> - you start populating the vma
>>>>> - khugepaged sees the vma is non-empty, collapses
>>>>>
>>>>> unless I'm wrong, the racers will have mmap_sem for reading only when
>>>>> setting/checking the MADV_NOHUGEPAGE? Might be actually considered a bug.
>>>>>
>>>>> However, can't you use prctl(PR_SET_THP_DISABLE) instead? "If arg2 has a
>>>>> nonzero value, the flag is set, otherwise it is cleared." says the
>>>>> manpage. Do it before the mmap and you avoid the race as well?
>>>>
>>>> Unfortunately, prctl(PR_SET_THP_DISABLE) didn't help :(
>>>> When I've tried to use it, I've ended up with VM_NOHUGEPAGE set on all VMAs
>>>> created after prctl(). This returns me to the state when checkpoint-restore
>>>> alters the application vma->vm_flags although it shouldn't and I do not see
>>>> a way to fix it using existing interfaces.
>>>
>>> [CC linux-api, should have been done in the initial posting already]
>>
>> Sorry, missed that.
>>  
>>> Hm so the prctl does:
>>>
>>>                 if (arg2)
>>>                         me->mm->def_flags |= VM_NOHUGEPAGE;
>>>                 else
>>>                         me->mm->def_flags &= ~VM_NOHUGEPAGE;
>>>
>>> That's rather lazy implementation IMHO. Could we change it so the flag
>>> is stored elsewhere in the mm, and the code that decides to (not) use
>>> THP will check both the per-vma flag and the per-mm flag?
>>
>> I afraid I don't understand how that can help.
>> What we need is an ability to temporarily disable collapse of the pages in
>> VMAs that do not have VM_*HUGEPAGE flags set and that after we re-enable
>> THP, the vma->vm_flags for those VMAs will remain intact.
> 
> Why cannot khugepaged simply skip over all VMAs which have userfault
> regions registered? This would sound like a less error prone approach to
> me.

It already does so. The problem is that there's a race window. We first populate VMA
with pages, then register it in UFFD. Between these two actions khugepaged comes
and generates a huge page out of populated pages and holes. And the holes in question
are not, well, holes -- they should be populated later via the UFFD, while the
generated huge page prevents this from happening.

-- Pavel

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH 2/2] mm, memory_hotplug: drop CONFIG_MOVABLE_NODE
From: Michal Hocko @ 2017-05-24 13:50 UTC (permalink / raw)
  To: Igor Mammedov
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Joonsoo Kim, Andi Kleen,
	David Rientjes, Daniel Kiper, Vitaly Kuznetsov, LKML
In-Reply-To: <20170524153017.7a66368d@nial.brq.redhat.com>

On Wed 24-05-17 15:30:17, Igor Mammedov wrote:
> On Wed, 24 May 2017 14:24:11 +0200
> Michal Hocko <mhocko@kernel.org> wrote:
> 
> [...]
> > index facc20a3f962..ec7d6ae01c96 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -2246,8 +2246,11 @@
> [...]
> > +			movable. This means that the memory of such nodes
> > +			will be usable only for movable allocations which
> > +			rules out almost all kernel allocations. Use with
> > +			caution!
> maybe dumb question but, is it really true that kernel won't ever
> do kernel allocations from movable zone?
> 
> looking at kmalloc(slab): we can get here:
> 
> get_page_from_freelist() ->
>     rmqueue() ->
>         __rmqueue() ->
>             __rmqueue_fallback() ->
>                 find_suitable_fallback()
> 
> and it might return movable fallback and page could be stolen from there.

No, you are mixing movable/unmovable pageblocks and movable zones.
Movable zone basically works the same way as the Highmem zone. Have a
look at gfp_zone() and high_zoneidx usage in get_page_from_freelist
-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH 2/2] mm, memory_hotplug: drop CONFIG_MOVABLE_NODE
From: Michal Hocko @ 2017-05-24 13:42 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: linux-mm, Andrew Morton, Mel Gorman, Andrea Arcangeli,
	Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu, qiuxishi,
	Kani Toshimitsu, slaoub, Joonsoo Kim, Andi Kleen, David Rientjes,
	Daniel Kiper, Igor Mammedov, Vitaly Kuznetsov, LKML
In-Reply-To: <3a85146e-2f31-8a9e-26da-6051119586fe@suse.cz>

On Wed 24-05-17 14:53:57, Vlastimil Babka wrote:
> On 05/24/2017 02:24 PM, Michal Hocko wrote:
[...]
> > diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
> > index facc20a3f962..ec7d6ae01c96 100644
> > --- a/Documentation/admin-guide/kernel-parameters.txt
> > +++ b/Documentation/admin-guide/kernel-parameters.txt
> > @@ -2246,8 +2246,11 @@
> >  			that the amount of memory usable for all allocations
> >  			is not too small.
> >  
> > -	movable_node	[KNL] Boot-time switch to enable the effects
> > -			of CONFIG_MOVABLE_NODE=y. See mm/Kconfig for details.
> > +	movable_node	[KNL] Boot-time switch to make hotplugable to be
> 
> 			hotplugable what, memory? nodes?


diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index ec7d6ae01c96..64aed7386fe4 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -2246,11 +2246,11 @@
 			that the amount of memory usable for all allocations
 			is not too small.
 
-	movable_node	[KNL] Boot-time switch to make hotplugable to be
-			movable. This means that the memory of such nodes
-			will be usable only for movable allocations which
-			rules out almost all kernel allocations. Use with
-			caution!
+	movable_node	[KNL] Boot-time switch to make hotplugable memory
+			NUMA nodes to be movable. This means that the memory
+			of such nodes will be usable only for movable
+			allocations which rules out almost all kernel
+			allocations. Use with caution!
 
 	MTD_Partition=	[MTD]
 			Format: <name>,<region-number>,<size>,<offset>

Better?

[...]
> > --- a/mm/Kconfig
> > +++ b/mm/Kconfig
> > @@ -149,32 +149,6 @@ config NO_BOOTMEM
> >  config MEMORY_ISOLATION
> >  	bool
> >  
> > -config MOVABLE_NODE
> > -	bool "Enable to assign a node which has only movable memory"
> > -	depends on HAVE_MEMBLOCK
> > -	depends on NO_BOOTMEM
> > -	depends on X86_64 || OF_EARLY_FLATTREE || MEMORY_HOTPLUG
> > -	depends on NUMA
> 
> That's a lot of depends. What happens if some of them are not met and
> the movable_node bootparam is used?

Good question. I haven't explored that, to be honest. Now that I am looking closer
I am not even sure why all those dependencies are thre. MEMORY_HOTPLUG
is clear and OF_EARLY_FLATTREE is explained by 41a9ada3e6b4 ("of/fdt:
mark hotpluggable memory"). NUMA is less clear to me because
MEMORY_HOTPLUG doesn't really depend on NUMA systems. Dependency on
NO_BOOTMEM is also not clear to me because zones layout
doesn't really depend on the specific boot time allocator.

So we are left with HAVE_MEMBLOCK which seems to be there because
movable_node_enabled is defined there while the parameter handling is in
the hotplug proper. But there is no real reason to have it like that.
This compiles but I will have to put throw my full compile battery on it
to be sure. I will make it a separate patch.

Thanks!
--- 
diff --git a/include/linux/memblock.h b/include/linux/memblock.h
index 9622fb8c101b..071692894254 100644
--- a/include/linux/memblock.h
+++ b/include/linux/memblock.h
@@ -57,8 +57,6 @@ struct memblock {
 
 extern struct memblock memblock;
 extern int memblock_debug;
-/* If movable_node boot option specified */
-extern bool movable_node_enabled;
 
 #ifdef CONFIG_ARCH_DISCARD_MEMBLOCK
 #define __init_memblock __meminit
@@ -171,11 +169,6 @@ static inline bool memblock_is_hotpluggable(struct memblock_region *m)
 	return m->flags & MEMBLOCK_HOTPLUG;
 }
 
-static inline bool __init_memblock movable_node_is_enabled(void)
-{
-	return movable_node_enabled;
-}
-
 static inline bool memblock_is_mirror(struct memblock_region *m)
 {
 	return m->flags & MEMBLOCK_MIRROR;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 9e0249d0f5e4..9c1ac94f857b 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -115,6 +115,12 @@ extern void __online_page_free(struct page *page);
 extern int try_online_node(int nid);
 
 extern bool memhp_auto_online;
+/* If movable_node boot option specified */
+extern bool movable_node_enabled;
+static inline bool movable_node_is_enabled(void)
+{
+	return movable_node_enabled;
+}
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
 extern bool is_pageblock_removable_nolock(struct page *page);
@@ -266,6 +272,10 @@ static inline void put_online_mems(void) {}
 static inline void mem_hotplug_begin(void) {}
 static inline void mem_hotplug_done(void) {}
 
+static inline bool __init_memblock movable_node_is_enabled(void)
+{
+	return false;
+}
 #endif /* ! CONFIG_MEMORY_HOTPLUG */
 
 #ifdef CONFIG_MEMORY_HOTREMOVE
diff --git a/mm/memblock.c b/mm/memblock.c
index 4895f5a6cf7e..8c52fb11510c 100644
--- a/mm/memblock.c
+++ b/mm/memblock.c
@@ -54,7 +54,6 @@ struct memblock memblock __initdata_memblock = {
 };
 
 int memblock_debug __initdata_memblock;
-bool movable_node_enabled __initdata_memblock = false;
 static bool system_has_some_mirror __initdata_memblock = false;
 static int memblock_can_resize __initdata_memblock;
 static int memblock_memory_in_slab __initdata_memblock = 0;
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 2a14f8c18a22..b0d2bf3256d0 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -79,6 +79,8 @@ static struct {
 #define memhp_lock_acquire()      lock_map_acquire(&mem_hotplug.dep_map)
 #define memhp_lock_release()      lock_map_release(&mem_hotplug.dep_map)
 
+bool movable_node_enabled = false;
+
 #ifndef CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
 bool memhp_auto_online;
 #else

-- 
Michal Hocko
SUSE Labs

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* Re: [RFC PATCH 2/2] mm, memory_hotplug: drop CONFIG_MOVABLE_NODE
From: Igor Mammedov @ 2017-05-24 13:30 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-mm, Andrew Morton, Mel Gorman, Vlastimil Babka,
	Andrea Arcangeli, Jerome Glisse, Reza Arbab, Yasuaki Ishimatsu,
	qiuxishi, Kani Toshimitsu, slaoub, Joonsoo Kim, Andi Kleen,
	David Rientjes, Daniel Kiper, Vitaly Kuznetsov, LKML,
	Michal Hocko
In-Reply-To: <20170524122411.25212-3-mhocko@kernel.org>

On Wed, 24 May 2017 14:24:11 +0200
Michal Hocko <mhocko@kernel.org> wrote:

[...]
> index facc20a3f962..ec7d6ae01c96 100644
> --- a/Documentation/admin-guide/kernel-parameters.txt
> +++ b/Documentation/admin-guide/kernel-parameters.txt
> @@ -2246,8 +2246,11 @@
[...]
> +			movable. This means that the memory of such nodes
> +			will be usable only for movable allocations which
> +			rules out almost all kernel allocations. Use with
> +			caution!
maybe dumb question but, is it really true that kernel won't ever
do kernel allocations from movable zone?

looking at kmalloc(slab): we can get here:

get_page_from_freelist() ->
    rmqueue() ->
        __rmqueue() ->
            __rmqueue_fallback() ->
                find_suitable_fallback()

and it might return movable fallback and page could be stolen from there.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [Question] Mlocked count will not be decreased
From: Vlastimil Babka @ 2017-05-24 13:16 UTC (permalink / raw)
  To: Xishi Qiu; +Cc: Yisheng Xie, Kefeng Wang, linux-mm, linux-kernel, zhongjiang
In-Reply-To: <5925784E.802@huawei.com>

On 05/24/2017 02:10 PM, Xishi Qiu wrote:
> On 2017/5/24 19:52, Vlastimil Babka wrote:
> 
>> On 05/24/2017 01:38 PM, Xishi Qiu wrote:
>>>>
>>>> Race condition with what? Who else would isolate our pages?
>>>>
>>>
>>> Hi Vlastimil,
>>>
>>> I find the root cause, if the page was not cached on the current cpu,
>>> lru_add_drain() will not push it to LRU. So we should handle fail
>>> case in mlock_vma_page().
>>
>> Yeah that would explain it.
>>
>>> follow_page_pte()
>>> 		...
>>> 		if (page->mapping && trylock_page(page)) {
>>> 			lru_add_drain();  /* push cached pages to LRU */
>>> 			/*
>>> 			 * Because we lock page here, and migration is
>>> 			 * blocked by the pte's page reference, and we
>>> 			 * know the page is still mapped, we don't even
>>> 			 * need to check for file-cache page truncation.
>>> 			 */
>>> 			mlock_vma_page(page);
>>> 			unlock_page(page);
>>> 		}
>>> 		...
>>>
>>> I think we should add yisheng's patch, also we should add the following change.
>>> I think it is better than use lru_add_drain_all().
>>
>> I agree about yisheng's fix (but v2 didn't address my comments). I don't
>> think we should add the hunk below, as that deviates from the rest of
>> the design.
> 
> Hi Vlastimil,
> 
> The rest of the design is that mlock should always success here, right?

The rest of the design allows a temporary disconnect between mlocked
flag and being placed on unevictable lru.

> If we don't handle the fail case, the page will be in anon/file lru list
> later when call __pagevec_lru_add(), but NR_MLOCK increased,
> this is wrong, right?

It's not wrong, the page cannot get evicted even if on wrong lru, so
effectively it's already mlocked. We would be underaccounting NR_MLOCK.

> Thanks,
> Xishi Qiu
> 
>>
>> Thanks,
>> Vlastimil
>>
>>> diff --git a/mm/mlock.c b/mm/mlock.c
>>> index 3d3ee6c..ca2aeb9 100644
>>> --- a/mm/mlock.c
>>> +++ b/mm/mlock.c
>>> @@ -88,6 +88,11 @@ void mlock_vma_page(struct page *page)
>>>  		count_vm_event(UNEVICTABLE_PGMLOCKED);
>>>  		if (!isolate_lru_page(page))
>>>  			putback_lru_page(page);
>>> +		else {
>>> +			ClearPageMlocked(page);
>>> +			mod_zone_page_state(page_zone(page), NR_MLOCK,
>>> +					-hpage_nr_pages(page));
>>> +		}
>>>  	}
>>>  }
>>>
>>> Thanks,
>>> Xishi Qiu
>>>
>>
>>
>> .
>>
> 
> 
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>
> 

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* [PATCH v4 9/9] arm64: hugetlb: Cleanup setup_hugepagesz
From: Punit Agrawal @ 2017-05-24 13:11 UTC (permalink / raw)
  To: will.deacon, catalin.marinas
  Cc: Steve Capper, linux-arm-kernel, mark.rutland, linux-mm,
	David Woods, stable, Punit Agrawal
In-Reply-To: <20170524131122.5309-1-punit.agrawal@arm.com>

From: Steve Capper <steve.capper@arm.com>

Replace a lot of if statements with switch and case labels to make it
much clearer which huge page sizes are supported.

Also, we prevent PUD_SIZE from being used on systems not running with
4KB PAGE_SIZE. Before if one supplied PUD_SIZE in these circumstances,
then unusuable huge page sizes would be in use.

Fixes: 084bd29810a5 ("ARM64: mm: HugeTLB support.")
Cc: David Woods <dwoods@mellanox.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Steve Capper <steve.capper@arm.com>
Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Reviewed-by: Mark Rutland <mark.rutland@arm.com>
---
 arch/arm64/mm/hugetlbpage.c | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index a10dfbd0ffd5..b5dc7cca3f45 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -381,20 +381,20 @@ static __init int setup_hugepagesz(char *opt)
 {
 	unsigned long ps = memparse(opt, &opt);
 
-	if (ps == PMD_SIZE) {
-		hugetlb_add_hstate(PMD_SHIFT - PAGE_SHIFT);
-	} else if (ps == PUD_SIZE) {
-		hugetlb_add_hstate(PUD_SHIFT - PAGE_SHIFT);
-	} else if (ps == (PAGE_SIZE * CONT_PTES)) {
-		hugetlb_add_hstate(CONT_PTE_SHIFT);
-	} else if (ps == (PMD_SIZE * CONT_PMDS)) {
-		hugetlb_add_hstate((PMD_SHIFT + CONT_PMD_SHIFT) - PAGE_SHIFT);
-	} else {
-		hugetlb_bad_size();
-		pr_err("hugepagesz: Unsupported page size %lu K\n", ps >> 10);
-		return 0;
+	switch (ps) {
+#ifdef CONFIG_ARM64_4K_PAGES
+	case PUD_SIZE:
+#endif
+	case PMD_SIZE * CONT_PMDS:
+	case PMD_SIZE:
+	case PAGE_SIZE * CONT_PTES:
+		hugetlb_add_hstate(ilog2(ps) - PAGE_SHIFT);
+		return 1;
 	}
-	return 1;
+
+	hugetlb_bad_size();
+	pr_err("hugepagesz: Unsupported page size %lu K\n", ps >> 10);
+	return 0;
 }
 __setup("hugepagesz=", setup_hugepagesz);
 
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v4 8/9] arm64: Re-enable support for contiguous hugepages
From: Punit Agrawal @ 2017-05-24 13:11 UTC (permalink / raw)
  To: will.deacon, catalin.marinas
  Cc: Punit Agrawal, linux-arm-kernel, steve.capper, mark.rutland,
	linux-mm
In-Reply-To: <20170524131122.5309-1-punit.agrawal@arm.com>

also known as -

Revert "Revert "Revert "arm64: hugetlb: partial revert of 66b3923a1a0f"""

Now that our hugetlb implementation is compliant with the
break-before-make requirements of the architecture and we have addressed
some of the issues in core code required for properly dealing with
hardware poisoning of contiguous hugepages let's re-enable support for
contiguous hugepages.

This reverts commit 6ae979ab39a368c18ceb0424bf824d172d6ab56f.

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
---
 arch/arm64/mm/hugetlbpage.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 3e78673b1bcb..a10dfbd0ffd5 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -385,6 +385,10 @@ static __init int setup_hugepagesz(char *opt)
 		hugetlb_add_hstate(PMD_SHIFT - PAGE_SHIFT);
 	} else if (ps == PUD_SIZE) {
 		hugetlb_add_hstate(PUD_SHIFT - PAGE_SHIFT);
+	} else if (ps == (PAGE_SIZE * CONT_PTES)) {
+		hugetlb_add_hstate(CONT_PTE_SHIFT);
+	} else if (ps == (PMD_SIZE * CONT_PMDS)) {
+		hugetlb_add_hstate((PMD_SHIFT + CONT_PMD_SHIFT) - PAGE_SHIFT);
 	} else {
 		hugetlb_bad_size();
 		pr_err("hugepagesz: Unsupported page size %lu K\n", ps >> 10);
@@ -393,3 +397,13 @@ static __init int setup_hugepagesz(char *opt)
 	return 1;
 }
 __setup("hugepagesz=", setup_hugepagesz);
+
+#ifdef CONFIG_ARM64_64K_PAGES
+static __init int add_default_hugepagesz(void)
+{
+	if (size_to_hstate(CONT_PTES * PAGE_SIZE) == NULL)
+		hugetlb_add_hstate(CONT_PTE_SHIFT);
+	return 0;
+}
+arch_initcall(add_default_hugepagesz);
+#endif
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v4 7/9] arm64: hugetlb: Override set_huge_swap_pte_at() to support contiguous hugepages
From: Punit Agrawal @ 2017-05-24 13:11 UTC (permalink / raw)
  To: will.deacon, catalin.marinas
  Cc: Punit Agrawal, linux-arm-kernel, steve.capper, mark.rutland,
	linux-mm, David Woods
In-Reply-To: <20170524131122.5309-1-punit.agrawal@arm.com>

The default implementation of set_huge_swap_pte_at() does not support
hugepages consisting of contiguous ptes. Override it to add support for
contiguous hugepages.

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Cc: David Woods <dwoods@mellanox.com>
---
 arch/arm64/include/asm/hugetlb.h |  3 +++
 arch/arm64/mm/hugetlbpage.c      | 12 ++++++++++++
 2 files changed, 15 insertions(+)

diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index bb86f0741863..e5f6210d1321 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -84,6 +84,9 @@ extern void huge_ptep_clear_flush(struct vm_area_struct *vma,
 extern void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
 			   pte_t *ptep, unsigned long sz);
 #define huge_pte_clear huge_pte_clear
+extern void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
+				 pte_t *ptep, pte_t pte, unsigned long sz);
+#define set_huge_swap_pte_at set_huge_swap_pte_at
 
 #include <asm-generic/hugetlb.h>
 
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index 240b2fd53266..3e78673b1bcb 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -167,6 +167,18 @@ void set_huge_pte_at(struct mm_struct *mm, unsigned long addr,
 	}
 }
 
+void set_huge_swap_pte_at(struct mm_struct *mm, unsigned long addr,
+			  pte_t *ptep, pte_t pte, unsigned long sz)
+{
+	int i, ncontig;
+	size_t pgsize;
+
+	ncontig = num_contig_ptes(sz, &pgsize);
+
+	for (i = 0; i < ncontig; i++, ptep++)
+		set_pte(ptep, pte);
+}
+
 pte_t *huge_pte_alloc(struct mm_struct *mm,
 		      unsigned long addr, unsigned long sz)
 {
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v4 6/9] arm64: hugetlb: Override huge_pte_clear() to support contiguous hugepages
From: Punit Agrawal @ 2017-05-24 13:11 UTC (permalink / raw)
  To: will.deacon, catalin.marinas
  Cc: Punit Agrawal, linux-arm-kernel, steve.capper, mark.rutland,
	linux-mm, David Woods
In-Reply-To: <20170524131122.5309-1-punit.agrawal@arm.com>

The default huge_pte_clear() implementation does not clear contiguous
page table entries when it encounters contiguous hugepages that are
supported on arm64.

Fix this by overriding the default implementation to clear all the
entries associated with contiguous hugepages.

Signed-off-by: Punit Agrawal <punit.agrawal@arm.com>
Cc: David Woods <dwoods@mellanox.com>
---
 arch/arm64/include/asm/hugetlb.h |  6 +++++-
 arch/arm64/mm/hugetlbpage.c      | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 41 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/hugetlb.h b/arch/arm64/include/asm/hugetlb.h
index bbc1e35aa601..bb86f0741863 100644
--- a/arch/arm64/include/asm/hugetlb.h
+++ b/arch/arm64/include/asm/hugetlb.h
@@ -18,7 +18,6 @@
 #ifndef __ASM_HUGETLB_H
 #define __ASM_HUGETLB_H
 
-#include <asm-generic/hugetlb.h>
 #include <asm/page.h>
 
 static inline pte_t huge_ptep_get(pte_t *ptep)
@@ -82,5 +81,10 @@ extern void huge_ptep_set_wrprotect(struct mm_struct *mm,
 				    unsigned long addr, pte_t *ptep);
 extern void huge_ptep_clear_flush(struct vm_area_struct *vma,
 				  unsigned long addr, pte_t *ptep);
+extern void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
+			   pte_t *ptep, unsigned long sz);
+#define huge_pte_clear huge_pte_clear
+
+#include <asm-generic/hugetlb.h>
 
 #endif /* __ASM_HUGETLB_H */
diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c
index e9061704aec3..240b2fd53266 100644
--- a/arch/arm64/mm/hugetlbpage.c
+++ b/arch/arm64/mm/hugetlbpage.c
@@ -68,6 +68,30 @@ static int find_num_contig(struct mm_struct *mm, unsigned long addr,
 	return CONT_PTES;
 }
 
+static inline int num_contig_ptes(unsigned long size, size_t *pgsize)
+{
+	int contig_ptes = 0;
+
+	*pgsize = size;
+
+	switch (size) {
+	case PUD_SIZE:
+	case PMD_SIZE:
+		contig_ptes = 1;
+		break;
+	case CONT_PMD_SIZE:
+		*pgsize = PMD_SIZE;
+		contig_ptes = CONT_PMDS;
+		break;
+	case CONT_PTE_SIZE:
+		*pgsize = PAGE_SIZE;
+		contig_ptes = CONT_PTES;
+		break;
+	}
+
+	return contig_ptes;
+}
+
 /*
  * Changing some bits of contiguous entries requires us to follow a
  * Break-Before-Make approach, breaking the whole contiguous set
@@ -245,6 +269,18 @@ pte_t arch_make_huge_pte(pte_t entry, struct vm_area_struct *vma,
 	return entry;
 }
 
+void huge_pte_clear(struct mm_struct *mm, unsigned long addr,
+		    pte_t *ptep, unsigned long sz)
+{
+	int i, ncontig;
+	size_t pgsize;
+
+	ncontig = num_contig_ptes(sz, &pgsize);
+
+	for (i = 0; i < ncontig; i++, addr += pgsize, ptep++)
+		pte_clear(mm, addr, ptep);
+}
+
 pte_t huge_ptep_get_and_clear(struct mm_struct *mm,
 			      unsigned long addr, pte_t *ptep)
 {
-- 
2.11.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox