* + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
@ 2014-02-25 23:53 akpm
[not found] ` <530D9F50.1080400@de.ibm.com>
0 siblings, 1 reply; 19+ messages in thread
From: akpm @ 2014-02-25 23:53 UTC (permalink / raw)
To: mm-commits, viro, schwidefsky, rientjes, riel, peterz, pbonzini,
oleg, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes,
gerald.schaefer, ebiederm, borntraeger, aarcange, athorlton
Subject: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree
To: athorlton@sgi.com,aarcange@redhat.com,borntraeger@de.ibm.com,ebiederm@xmission.com,gerald.schaefer@de.ibm.com,hannes@cmpxchg.org,heiko.carstens@de.ibm.com,kirill.shutemov@linux.intel.com,mgorman@suse.de,mingo@kernel.org,oleg@redhat.com,pbonzini@redhat.com,peterz@infradead.org,riel@redhat.com,rientjes@google.com,schwidefsky@de.ibm.com,viro@zeniv.linux.org.uk
From: akpm@linux-foundation.org
Date: Tue, 25 Feb 2014 15:53:13 -0800
The patch titled
Subject: mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags"
has been added to the -mm tree. Its filename is
mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch
This patch should soon appear at
http://ozlabs.org/~akpm/mmots/broken-out/mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch
and later at
http://ozlabs.org/~akpm/mmotm/broken-out/mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch
Before you just go and hit "reply", please:
a) Consider who else should be cc'ed
b) Prefer to cc a suitable mailing list as well
c) Ideally: find the original patch on the mailing list and do a
reply-to-all to that, adding suitable additional cc's
*** Remember to use Documentation/SubmitChecklist when testing your code ***
The -mm tree is included into linux-next and is updated
there every 3-4 working days
------------------------------------------------------
From: Alex Thorlton <athorlton@sgi.com>
Subject: mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags"
The main motivation behind this patch is to provide a way to disable THP
for jobs where the code cannot be modified, and using a malloc hook with
madvise is not an option (i.e. statically allocated data). This patch
allows us to do just that, without affecting other jobs running on the
system.
We need to do this sort of thing for jobs where THP hurts performance, due
to the possibility of increased remote memory accesses that can be created
by situations such as the following:
When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will be
handed out, and the THP will be stuck on whatever node the chunk was
originally referenced from. If many remote nodes need to do work on that
same chunk, they'll be making remote accesses.
With THP disabled, 4K pages can be handed out to separate nodes as
they're needed, greatly reducing the amount of remote accesses to
memory.
This patch is based on some of my work combined with some
suggestions/patches given by Oleg Nesterov. The main goal here is to add
a prctl switch to allow us to disable to THP on a per mm_struct basis.
Here's a bit of test data with the new patch in place...
First with the flag unset:
# perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
Setting thp_disabled for this task...
thp_disable: 0
Set thp_disabled state to 0
Process pid = 18027
PF/
MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
512 1.120 0.060 0.000 0.110 0.110 0.000 28571428864 -9223372036854775808 55803572 23
Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':
273719072.841402 task-clock # 641.026 CPUs utilized [100.00%]
1,008,986 context-switches # 0.000 M/sec [100.00%]
7,717 CPU-migrations # 0.000 M/sec [100.00%]
1,698,932 page-faults # 0.000 M/sec
355,222,544,890,379 cycles # 1.298 GHz [100.00%]
536,445,412,234,588 stalled-cycles-frontend # 151.02% frontend cycles idle [100.00%]
409,110,531,310,223 stalled-cycles-backend # 115.17% backend cycles idle [100.00%]
148,286,797,266,411 instructions # 0.42 insns per cycle
# 3.62 stalled cycles per insn [100.00%]
27,061,793,159,503 branches # 98.867 M/sec [100.00%]
1,188,655,196 branch-misses # 0.00% of all branches
427.001706337 seconds time elapsed
Now with the flag set:
# perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
Setting thp_disabled for this task...
thp_disable: 1
Set thp_disabled state to 1
Process pid = 144957
PF/
MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
512 0.620 0.260 0.250 0.320 0.570 0.001 51612901376 128000000000 100806448 23
Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':
138789390.540183 task-clock # 641.959 CPUs utilized [100.00%]
534,205 context-switches # 0.000 M/sec [100.00%]
4,595 CPU-migrations # 0.000 M/sec [100.00%]
63,133,119 page-faults # 0.000 M/sec
147,977,747,269,768 cycles # 1.066 GHz [100.00%]
200,524,196,493,108 stalled-cycles-frontend # 135.51% frontend cycles idle [100.00%]
105,175,163,716,388 stalled-cycles-backend # 71.07% backend cycles idle [100.00%]
180,916,213,503,160 instructions # 1.22 insns per cycle
# 1.11 stalled cycles per insn [100.00%]
26,999,511,005,868 branches # 194.536 M/sec [100.00%]
714,066,351 branch-misses # 0.00% of all branches
216.196778807 seconds time elapsed
As with previous versions of the patch, We're getting about a 2x
performance increase here. Here's a link to the test case I used, along
with the little wrapper to activate the flag:
http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz
This patch (of 3):
Revert 8e72033f2a489b6c98 and add in code to fix up any issues caused by
the revert.
The revert is necessary because hugepage_madvise would return -EINVAL when
VM_NOHUGEPAGE is set, which will break subsequent chunks of this patch
set.
Here's a snip of an e-mail from Gerald detailing the original purpose of
this code, and providing justification for the revert:
<snip>
The intent of 8e72033f2a48 was to guard against any future programming
errors that may result in an madvice(MADV_HUGEPAGE) on guest mappings,
which would crash the kernel.
Martin suggested adding the bit to arch/s390/mm/pgtable.c, if 8e72033f2a48
was to be reverted, because that check will also prevent a kernel crash
in the case described above, it will now send a SIGSEGV instead.
This would now also allow to do the madvise on other parts, if needed,
so it is a more flexible approach. One could also say that it would have
been better to do it this way right from the beginning...
</snip>
Signed-off-by: Alex Thorlton <athorlton@sgi.com>
Suggested-by: Oleg Nesterov <oleg@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@de.ibm.com>
Cc: Martin Schwidefsky <schwidefsky@de.ibm.com>
Cc: Heiko Carstens <heiko.carstens@de.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com>
Cc: Mel Gorman <mgorman@suse.de>
Cc: Rik van Riel <riel@redhat.com>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Alexander Viro <viro@zeniv.linux.org.uk>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: David Rientjes <rientjes@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
arch/s390/mm/pgtable.c | 3 +++
mm/huge_memory.c | 4 ----
2 files changed, 3 insertions(+), 4 deletions(-)
diff -puN arch/s390/mm/pgtable.c~mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags arch/s390/mm/pgtable.c
--- a/arch/s390/mm/pgtable.c~mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags
+++ a/arch/s390/mm/pgtable.c
@@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned
if (!pmd_present(*pmd) &&
__pte_alloc(mm, vma, pmd, vmaddr))
return -ENOMEM;
+ /* large pmds cannot yet be handled */
+ if (pmd_large(*pmd))
+ return -EFAULT;
/* pmd now points to a valid segment table entry. */
rmap = kmalloc(sizeof(*rmap), GFP_KERNEL|__GFP_REPEAT);
if (!rmap)
diff -puN mm/huge_memory.c~mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags mm/huge_memory.c
--- a/mm/huge_memory.c~mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags
+++ a/mm/huge_memory.c
@@ -1891,8 +1891,6 @@ out:
int hugepage_madvise(struct vm_area_struct *vma,
unsigned long *vm_flags, int advice)
{
- struct mm_struct *mm = vma->vm_mm;
^ permalink raw reply [flat|nested] 19+ messages in thread[parent not found: <530D9F50.1080400@de.ibm.com>]
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree [not found] ` <530D9F50.1080400@de.ibm.com> @ 2014-02-26 14:50 ` Oleg Nesterov 2014-02-26 15:06 ` Christian Borntraeger 0 siblings, 1 reply; 19+ messages in thread From: Oleg Nesterov @ 2014-02-26 14:50 UTC (permalink / raw) To: Christian Borntraeger Cc: akpm, linux-kernel, viro, schwidefsky, rientjes, riel, peterz, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange, athorlton On 02/26, Christian Borntraeger wrote: > > On 26/02/14 00:53, akpm@linux-foundation.org wrote: > > Subject: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree > > To: athorlton@sgi.com,aarcange@redhat.com,borntraeger@de.ibm.com,ebiederm@xmission.com,gerald.schaefer@de.ibm.com,hannes@cmpxchg.org,heiko.carstens@de.ibm.com,kirill.shutemov@linux.intel.com,mgorman@suse.de,mingo@kernel.org,oleg@redhat.com,pbonzini@redhat.com,peterz@infradead.org,riel@redhat.com,rientjes@google.com,schwidefsky@de.ibm.com,viro@zeniv.linux.org.uk > > From: akpm@linux-foundation.org > > Date: Tue, 25 Feb 2014 15:53:13 -0800 > > > > > > The patch titled > > Subject: mm: revert "thp: make MADV_HUGEPAGE check for mm->def_flags" > > has been added to the -mm tree. Its filename is > > mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch > > > > This patch should soon appear at > > http://ozlabs.org/~akpm/mmots/broken-out/mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch > > and later at > > http://ozlabs.org/~akpm/mmotm/broken-out/mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch > > > NAK. > > Since 2012 qemu does call "qemu_madvise(new_block->host, size, QEMU_MADV_HUGEPAGE);" for all kvm pages. > (commit ad0b5321f1f797274603ebbe20108b0750baee94 Call MADV_HUGEPAGE for guest RAM allocations) so this > breaks any recent kvm guest on s390. Well, I can't really discuss the changes in arch/s390. But perhaps qemu can be changed to avoid MADV_HUGEPAGE on s390 ? Otherwise I'd suggest the change below. Oleg. --- x/mm/huge_memory.c +++ x/mm/huge_memory.c @@ -1968,8 +1968,6 @@ out: int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, int advice) { - struct mm_struct *mm = vma->vm_mm; - switch (advice) { case MADV_HUGEPAGE: /* @@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_stru */ if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) return -EINVAL; - if (mm->def_flags & VM_NOHUGEPAGE) + +/* + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie(). + */ +#ifdef CONFIG_S390 + if (vma->vm_mm->def_flags & VM_NOHUGEPAGE) return -EINVAL; +#endif + *vm_flags &= ~VM_NOHUGEPAGE; *vm_flags |= VM_HUGEPAGE; /* ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 14:50 ` Oleg Nesterov @ 2014-02-26 15:06 ` Christian Borntraeger 2014-02-26 15:22 ` Kirill A. Shutemov 2014-02-26 15:31 ` Oleg Nesterov 0 siblings, 2 replies; 19+ messages in thread From: Christian Borntraeger @ 2014-02-26 15:06 UTC (permalink / raw) To: Oleg Nesterov Cc: akpm, linux-kernel, viro, schwidefsky, rientjes, riel, peterz, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange, athorlton On 26/02/14 15:50, Oleg Nesterov wrote: [...] >> NAK. >> >> Since 2012 qemu does call "qemu_madvise(new_block->host, size, QEMU_MADV_HUGEPAGE);" for all kvm pages. >> (commit ad0b5321f1f797274603ebbe20108b0750baee94 Call MADV_HUGEPAGE for guest RAM allocations) so this >> breaks any recent kvm guest on s390. > > Well, I can't really discuss the changes in arch/s390. > > But perhaps qemu can be changed to avoid MADV_HUGEPAGE on s390 ? > Otherwise I'd suggest the change below. > > Oleg. > > > --- x/mm/huge_memory.c > +++ x/mm/huge_memory.c > @@ -1968,8 +1968,6 @@ out: > int hugepage_madvise(struct vm_area_struct *vma, > unsigned long *vm_flags, int advice) > { > - struct mm_struct *mm = vma->vm_mm; > - > switch (advice) { > case MADV_HUGEPAGE: > /* > @@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_stru > */ > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) > return -EINVAL; > - if (mm->def_flags & VM_NOHUGEPAGE) > + > +/* > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie(). > + */ > +#ifdef CONFIG_S390 > + if (vma->vm_mm->def_flags & VM_NOHUGEPAGE) > return -EINVAL; > +#endif > + Ifdefs are ugly but might be the only way of not breaking existing userspace. If we come up with a solution for THP in KVM host processes on s390, we can then remove that wart. We could even limit that hack to KVM only processes to retain Alex' prctl capability by checking mm_has_pgste (defined in arch/s390/include/asm/pgtable.h should be included via linux/mm.h) > + > +/* > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie(). > + */ > +#ifdef CONFIG_S390 > + if ((vma->vm_mm->def_flags & VM_NOHUGEPAGE) && mm_has_pgste(vma->vm_mm)) > return -EINVAL; > +#endif > + ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 15:06 ` Christian Borntraeger @ 2014-02-26 15:22 ` Kirill A. Shutemov 2014-02-26 15:31 ` Oleg Nesterov 1 sibling, 0 replies; 19+ messages in thread From: Kirill A. Shutemov @ 2014-02-26 15:22 UTC (permalink / raw) To: Christian Borntraeger Cc: Oleg Nesterov, akpm, linux-kernel, viro, schwidefsky, rientjes, riel, peterz, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange, athorlton Christian Borntraeger wrote: > On 26/02/14 15:50, Oleg Nesterov wrote: > [...] > > >> NAK. > >> > >> Since 2012 qemu does call "qemu_madvise(new_block->host, size, QEMU_MADV_HUGEPAGE);" for all kvm pages. > >> (commit ad0b5321f1f797274603ebbe20108b0750baee94 Call MADV_HUGEPAGE for guest RAM allocations) so this > >> breaks any recent kvm guest on s390. > > > > Well, I can't really discuss the changes in arch/s390. > > > > But perhaps qemu can be changed to avoid MADV_HUGEPAGE on s390 ? > > Otherwise I'd suggest the change below. > > > > Oleg. > > > > > > --- x/mm/huge_memory.c > > +++ x/mm/huge_memory.c > > @@ -1968,8 +1968,6 @@ out: > > int hugepage_madvise(struct vm_area_struct *vma, > > unsigned long *vm_flags, int advice) > > { > > - struct mm_struct *mm = vma->vm_mm; > > - > > switch (advice) { > > case MADV_HUGEPAGE: > > /* > > @@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_stru > > */ > > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) > > return -EINVAL; > > - if (mm->def_flags & VM_NOHUGEPAGE) > > + > > +/* > > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because > > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie(). > > + */ > > +#ifdef CONFIG_S390 > > + if (vma->vm_mm->def_flags & VM_NOHUGEPAGE) > > return -EINVAL; > > +#endif > > + > > Ifdefs are ugly IS_ENABLED(CONFIG_S390) should help with this. -- Kirill A. Shutemov ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 15:06 ` Christian Borntraeger 2014-02-26 15:22 ` Kirill A. Shutemov @ 2014-02-26 15:31 ` Oleg Nesterov 2014-02-26 16:55 ` Gerald Schaefer 2014-02-26 16:57 ` Peter Zijlstra 1 sibling, 2 replies; 19+ messages in thread From: Oleg Nesterov @ 2014-02-26 15:31 UTC (permalink / raw) To: Christian Borntraeger Cc: akpm, linux-kernel, viro, schwidefsky, rientjes, riel, peterz, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange, athorlton On 02/26, Christian Borntraeger wrote: > > On 26/02/14 15:50, Oleg Nesterov wrote: > > > > But perhaps qemu can be changed to avoid MADV_HUGEPAGE on s390 ? > > Otherwise I'd suggest the change below. > > > > Oleg. > > > > > > --- x/mm/huge_memory.c > > +++ x/mm/huge_memory.c > > @@ -1968,8 +1968,6 @@ out: > > int hugepage_madvise(struct vm_area_struct *vma, > > unsigned long *vm_flags, int advice) > > { > > - struct mm_struct *mm = vma->vm_mm; > > - > > switch (advice) { > > case MADV_HUGEPAGE: > > /* > > @@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_stru > > */ > > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) > > return -EINVAL; > > - if (mm->def_flags & VM_NOHUGEPAGE) > > + > > +/* > > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because > > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie(). > > + */ > > +#ifdef CONFIG_S390 > > + if (vma->vm_mm->def_flags & VM_NOHUGEPAGE) > > return -EINVAL; > > +#endif > > + > > Ifdefs are ugly but might be the only way of not breaking existing userspace. Yes, agreed. but note that this code is already s390-specific, nobody else puts VM_NOHUGEPAGE into ->def_flags (until this series of course). > If we come up with a solution for THP in KVM host processes on s390, we can > then remove that wart. We could even limit that hack to KVM only processes > to retain Alex' prctl capability by checking mm_has_pgste Yes, yes, I looked at mm->context.has_pgste too ;) But I wasn't sure we can rely on it. Thanks. > > +/* > > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because > > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie(). > > + */ > > +#ifdef CONFIG_S390 > > + if ((vma->vm_mm->def_flags & VM_NOHUGEPAGE) && mm_has_pgste(vma->vm_mm)) > > return -EINVAL; Hmm... but why do we need to check VM_NOHUGEPAGE then? Can't the change below work? Oleg. --- x/arch/s390/mm/pgtable.c +++ x/arch/s390/mm/pgtable.c @@ -1084,7 +1084,6 @@ static inline void thp_split_mm(struct m vma->vm_flags &= ~VM_HUGEPAGE; vma->vm_flags |= VM_NOHUGEPAGE; } - mm->def_flags |= VM_NOHUGEPAGE; } #else static inline void thp_split_mm(struct mm_struct *mm) --- x/mm/huge_memory.c +++ x/mm/huge_memory.c @@ -1968,8 +1968,6 @@ out: int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, int advice) { - struct mm_struct *mm = vma->vm_mm; - switch (advice) { case MADV_HUGEPAGE: /* @@ -1977,8 +1975,12 @@ int hugepage_madvise(struct vm_area_stru */ if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) return -EINVAL; - if (mm->def_flags & VM_NOHUGEPAGE) + +#ifdef CONFIG_S390 + if (mm_has_pgste(vma->vm_mm)) return -EINVAL; +#endif + *vm_flags &= ~VM_NOHUGEPAGE; *vm_flags |= VM_HUGEPAGE; /* ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 15:31 ` Oleg Nesterov @ 2014-02-26 16:55 ` Gerald Schaefer 2014-02-26 16:57 ` Peter Zijlstra 1 sibling, 0 replies; 19+ messages in thread From: Gerald Schaefer @ 2014-02-26 16:55 UTC (permalink / raw) To: Oleg Nesterov Cc: Christian Borntraeger, akpm, linux-kernel, viro, schwidefsky, rientjes, riel, peterz, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, ebiederm, aarcange, athorlton On Wed, 26 Feb 2014 16:31:44 +0100 Oleg Nesterov <oleg@redhat.com> wrote: > On 02/26, Christian Borntraeger wrote: > > > > On 26/02/14 15:50, Oleg Nesterov wrote: > > > > > > But perhaps qemu can be changed to avoid MADV_HUGEPAGE on s390 ? > > > Otherwise I'd suggest the change below. > > > > > > Oleg. > > > > > > > > > --- x/mm/huge_memory.c > > > +++ x/mm/huge_memory.c > > > @@ -1968,8 +1968,6 @@ out: > > > int hugepage_madvise(struct vm_area_struct *vma, > > > unsigned long *vm_flags, int advice) > > > { > > > - struct mm_struct *mm = vma->vm_mm; > > > - > > > switch (advice) { > > > case MADV_HUGEPAGE: > > > /* > > > @@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_stru > > > */ > > > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) > > > return -EINVAL; > > > - if (mm->def_flags & VM_NOHUGEPAGE) > > > + > > > +/* > > > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because > > > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie(). > > > + */ > > > +#ifdef CONFIG_S390 > > > + if (vma->vm_mm->def_flags & VM_NOHUGEPAGE) > > > return -EINVAL; > > > +#endif > > > + > > > > Ifdefs are ugly but might be the only way of not breaking existing userspace. > > Yes, agreed. but note that this code is already s390-specific, nobody > else puts VM_NOHUGEPAGE into ->def_flags (until this series of course). > > > If we come up with a solution for THP in KVM host processes on s390, we can > > then remove that wart. We could even limit that hack to KVM only processes > > to retain Alex' prctl capability by checking mm_has_pgste > > Yes, yes, I looked at mm->context.has_pgste too ;) But I wasn't sure > we can rely on it. Thanks. > > > > +/* > > > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because > > > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie(). > > > + */ > > > +#ifdef CONFIG_S390 > > > + if ((vma->vm_mm->def_flags & VM_NOHUGEPAGE) && mm_has_pgste(vma->vm_mm)) > > > return -EINVAL; > > Hmm... but why do we need to check VM_NOHUGEPAGE then? Can't the > change below work? Yes, good point, that should do the trick. > > Oleg. > > --- x/arch/s390/mm/pgtable.c > +++ x/arch/s390/mm/pgtable.c > @@ -1084,7 +1084,6 @@ static inline void thp_split_mm(struct m > vma->vm_flags &= ~VM_HUGEPAGE; > vma->vm_flags |= VM_NOHUGEPAGE; > } > - mm->def_flags |= VM_NOHUGEPAGE; > } > #else > static inline void thp_split_mm(struct mm_struct *mm) > --- x/mm/huge_memory.c > +++ x/mm/huge_memory.c > @@ -1968,8 +1968,6 @@ out: > int hugepage_madvise(struct vm_area_struct *vma, > unsigned long *vm_flags, int advice) > { > - struct mm_struct *mm = vma->vm_mm; > - > switch (advice) { > case MADV_HUGEPAGE: > /* > @@ -1977,8 +1975,12 @@ int hugepage_madvise(struct vm_area_stru > */ > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) > return -EINVAL; > - if (mm->def_flags & VM_NOHUGEPAGE) > + > +#ifdef CONFIG_S390 > + if (mm_has_pgste(vma->vm_mm)) > return -EINVAL; > +#endif > + > *vm_flags &= ~VM_NOHUGEPAGE; > *vm_flags |= VM_HUGEPAGE; > /* > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 15:31 ` Oleg Nesterov 2014-02-26 16:55 ` Gerald Schaefer @ 2014-02-26 16:57 ` Peter Zijlstra 2014-02-26 17:22 ` Alex Thorlton 2014-02-26 18:08 ` Oleg Nesterov 1 sibling, 2 replies; 19+ messages in thread From: Peter Zijlstra @ 2014-02-26 16:57 UTC (permalink / raw) To: Oleg Nesterov Cc: Christian Borntraeger, akpm, linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange, athorlton On Wed, Feb 26, 2014 at 04:31:44PM +0100, Oleg Nesterov wrote: > --- x/arch/s390/mm/pgtable.c > +++ x/arch/s390/mm/pgtable.c > @@ -1084,7 +1084,6 @@ static inline void thp_split_mm(struct m > vma->vm_flags &= ~VM_HUGEPAGE; > vma->vm_flags |= VM_NOHUGEPAGE; > } > - mm->def_flags |= VM_NOHUGEPAGE; > } > #else > static inline void thp_split_mm(struct mm_struct *mm) > --- x/mm/huge_memory.c > +++ x/mm/huge_memory.c > @@ -1968,8 +1968,6 @@ out: > int hugepage_madvise(struct vm_area_struct *vma, > unsigned long *vm_flags, int advice) > { > - struct mm_struct *mm = vma->vm_mm; > - > switch (advice) { > case MADV_HUGEPAGE: > /* > @@ -1977,8 +1975,12 @@ int hugepage_madvise(struct vm_area_stru > */ > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) > return -EINVAL; > - if (mm->def_flags & VM_NOHUGEPAGE) > + > +#ifdef CONFIG_S390 Do we want a comment here, explaining why s390 is special again? > + if (mm_has_pgste(vma->vm_mm)) > return -EINVAL; > +#endif > + > *vm_flags &= ~VM_NOHUGEPAGE; > *vm_flags |= VM_HUGEPAGE; > /* > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 16:57 ` Peter Zijlstra @ 2014-02-26 17:22 ` Alex Thorlton 2014-02-26 18:06 ` Oleg Nesterov 2014-02-26 18:08 ` Oleg Nesterov 1 sibling, 1 reply; 19+ messages in thread From: Alex Thorlton @ 2014-02-26 17:22 UTC (permalink / raw) To: Peter Zijlstra Cc: Oleg Nesterov, Christian Borntraeger, akpm, linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange On Wed, Feb 26, 2014 at 05:57:59PM +0100, Peter Zijlstra wrote: > On Wed, Feb 26, 2014 at 04:31:44PM +0100, Oleg Nesterov wrote: > Do we want a comment here, explaining why s390 is special again? Here's what I've got, with everybody's suggestions spun together: diff --git a/mm/huge_memory.c b/mm/huge_memory.c index 82166bf..7f01491 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1968,8 +1968,6 @@ out: int hugepage_madvise(struct vm_area_struct *vma, unsigned long *vm_flags, int advice) { - struct mm_struct *mm = vma->vm_mm; - switch (advice) { case MADV_HUGEPAGE: /* @@ -1977,8 +1975,16 @@ int hugepage_madvise(struct vm_area_struct *vma, */ if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) return -EINVAL; - if (mm->def_flags & VM_NOHUGEPAGE) + +/* + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie(). + */ +#ifdef CONFIG_S390 + if (mm_has_pgste(vma->vm_mm)) return -EINVAL; +#endif + *vm_flags &= ~VM_NOHUGEPAGE; *vm_flags |= VM_HUGEPAGE; /* I've compiled and tested and it works ok on my machines (I don't have an s390 to test on though :). Is everybody okay with this solution? BTW, Kirill, I looked at using IS_ENABLED to clean up the ifdef, but it won't work here, since mm_has_pgste isn't defined unless CONFIG_S390 is defined. i.e.: This line: if (IS_ENABLED(CONFIG_S390) && mm_has_pgste(vma->vm_mm)) won't compile. ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 17:22 ` Alex Thorlton @ 2014-02-26 18:06 ` Oleg Nesterov 2014-02-26 19:05 ` Gerald Schaefer ` (2 more replies) 0 siblings, 3 replies; 19+ messages in thread From: Oleg Nesterov @ 2014-02-26 18:06 UTC (permalink / raw) To: Alex Thorlton Cc: Peter Zijlstra, Christian Borntraeger, akpm, linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange On 02/26, Alex Thorlton wrote: > > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie(). > + */ > +#ifdef CONFIG_S390 > + if (mm_has_pgste(vma->vm_mm)) > return -EINVAL; > +#endif The comment is not really right... And personally I think that @@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment, if (!pmd_present(*pmd) && __pte_alloc(mm, vma, pmd, vmaddr)) return -ENOMEM; + /* large pmds cannot yet be handled */ + if (pmd_large(*pmd)) + return -EFAULT; change still makes sense, so that we can simply revert this s390- specific hack in hugepage_madvise(). I'd suggest the patch below on top of your changes, but I won't argue. It would be nice to also change thp_split_mm() to not not play with mm->def_flags, but I am not sure if we can do this. Oleg. --- Subject: [PATCH] s390: make sure MADV_HUGEPAGE fails after s390_enable_sie() As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm. Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Oleg Nesterov <oleg@redhat.com> diff --git a/mm/huge_memory.c b/mm/huge_memory.c index a4310a5..0e08d92 100644 --- a/mm/huge_memory.c +++ b/mm/huge_memory.c @@ -1970,11 +1970,22 @@ int hugepage_madvise(struct vm_area_struct *vma, { switch (advice) { case MADV_HUGEPAGE: +#ifdef CONFIG_S390 + /* + * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu + * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages + * and expects it must fail on s390. Avoid a possible SIGSEGV + * until qemu is changed. + */ + if (mm_has_pgste(vma->vm_mm)) + return -EINVAL; +#endif /* * Be somewhat over-protective like KSM for now! */ if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) return -EINVAL; + *vm_flags &= ~VM_NOHUGEPAGE; *vm_flags |= VM_HUGEPAGE; /* ^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 18:06 ` Oleg Nesterov @ 2014-02-26 19:05 ` Gerald Schaefer 2014-02-27 16:45 ` Oleg Nesterov 2014-02-26 19:27 ` Christian Borntraeger 2014-02-26 20:41 ` Paolo Bonzini 2 siblings, 1 reply; 19+ messages in thread From: Gerald Schaefer @ 2014-02-26 19:05 UTC (permalink / raw) To: Oleg Nesterov Cc: Alex Thorlton, Peter Zijlstra, Christian Borntraeger, akpm, linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, ebiederm, aarcange On Wed, 26 Feb 2014 19:06:03 +0100 Oleg Nesterov <oleg@redhat.com> wrote: > On 02/26, Alex Thorlton wrote: > > > > + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because > > + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie(). > > + */ > > +#ifdef CONFIG_S390 > > + if (mm_has_pgste(vma->vm_mm)) > > return -EINVAL; > > +#endif > > The comment is not really right... > > And personally I think that > > @@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment, > if (!pmd_present(*pmd) && > __pte_alloc(mm, vma, pmd, vmaddr)) > return -ENOMEM; > + /* large pmds cannot yet be handled */ > + if (pmd_large(*pmd)) > + return -EFAULT; > > change still makes sense, so that we can simply revert this s390- > specific hack in hugepage_madvise(). Yes, agreed. > > I'd suggest the patch below on top of your changes, but I won't argue. > > It would be nice to also change thp_split_mm() to not not play with > mm->def_flags, but I am not sure if we can do this. Hmm, I'm also wondering about this. Basically, we only need VM_NOHUGEPAGE in vma->vm_flags, which is done for all existing vmas in thp_split_mm(). But if there should be new vmas created afterwards, it would still be necessary to also have VM_NOHUGEPAGE in mm->def_flags, because the vm_flags for new vmas will be set via OR of mm->def_flags, e.g. in do_brk() and do_mmap_pgoff(). I guess the question is if new vmas can be created for the qemu/kvm host process? Anyway, this would then have to be a separate patch, to keep the "revertability" of this hack. > > Oleg. > --- > > Subject: [PATCH] s390: make sure MADV_HUGEPAGE fails after s390_enable_sie() > > As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE > check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for > all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm. > > Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> > Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com> > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index a4310a5..0e08d92 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1970,11 +1970,22 @@ int hugepage_madvise(struct vm_area_struct *vma, > { > switch (advice) { > case MADV_HUGEPAGE: > +#ifdef CONFIG_S390 > + /* > + * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu > + * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages > + * and expects it must fail on s390. Avoid a possible SIGSEGV > + * until qemu is changed. > + */ > + if (mm_has_pgste(vma->vm_mm)) > + return -EINVAL; > +#endif > /* > * Be somewhat over-protective like KSM for now! > */ > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) > return -EINVAL; > + > *vm_flags &= ~VM_NOHUGEPAGE; > *vm_flags |= VM_HUGEPAGE; > /* > ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 19:05 ` Gerald Schaefer @ 2014-02-27 16:45 ` Oleg Nesterov 0 siblings, 0 replies; 19+ messages in thread From: Oleg Nesterov @ 2014-02-27 16:45 UTC (permalink / raw) To: Gerald Schaefer Cc: Alex Thorlton, Peter Zijlstra, Christian Borntraeger, akpm, linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, ebiederm, aarcange On 02/26, Gerald Schaefer wrote: > > On Wed, 26 Feb 2014 19:06:03 +0100 > Oleg Nesterov <oleg@redhat.com> wrote: > > > It would be nice to also change thp_split_mm() to not not play with > > mm->def_flags, but I am not sure if we can do this. > > Hmm, I'm also wondering about this. Basically, we only need VM_NOHUGEPAGE > in vma->vm_flags, which is done for all existing vmas in thp_split_mm(). > But if there should be new vmas created afterwards, it would still be > necessary to also have VM_NOHUGEPAGE in mm->def_flags, because the > vm_flags for new vmas will be set via OR of mm->def_flags, e.g. in > do_brk() and do_mmap_pgoff(). Yes, exactly, this was my concern. And while I know nothing about s390, it seems to me that huge pages should be forbidden for any vma if ->has_pgste was set. > Anyway, this would then have to be a separate patch, to keep the > "revertability" of this hack. Agreed. Thanks! Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 18:06 ` Oleg Nesterov 2014-02-26 19:05 ` Gerald Schaefer @ 2014-02-26 19:27 ` Christian Borntraeger 2014-02-26 19:39 ` Alex Thorlton 2014-02-26 20:41 ` Paolo Bonzini 2 siblings, 1 reply; 19+ messages in thread From: Christian Borntraeger @ 2014-02-26 19:27 UTC (permalink / raw) To: Oleg Nesterov, Alex Thorlton Cc: Peter Zijlstra, akpm, linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange On 26/02/14 19:06, Oleg Nesterov wrote: > On 02/26, Alex Thorlton wrote: >> >> + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because >> + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie(). >> + */ >> +#ifdef CONFIG_S390 >> + if (mm_has_pgste(vma->vm_mm)) >> return -EINVAL; >> +#endif > > The comment is not really right... > > And personally I think that > > @@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment, > if (!pmd_present(*pmd) && > __pte_alloc(mm, vma, pmd, vmaddr)) > return -ENOMEM; > + /* large pmds cannot yet be handled */ > + if (pmd_large(*pmd)) > + return -EFAULT; > > change still makes sense, so that we can simply revert this s390- > specific hack in hugepage_madvise(). Yes, it still makes sense to cover existing THPs here. > I'd suggest the patch below on top of your changes, but I won't argue. > > It would be nice to also change thp_split_mm() to not not play with > mm->def_flags, but I am not sure if we can do this. > > Oleg. > --- > > Subject: [PATCH] s390: make sure MADV_HUGEPAGE fails after s390_enable_sie() > > As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE > check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for > all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm. > > Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> > Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com> > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > index a4310a5..0e08d92 100644 > --- a/mm/huge_memory.c > +++ b/mm/huge_memory.c > @@ -1970,11 +1970,22 @@ int hugepage_madvise(struct vm_area_struct *vma, > { > switch (advice) { > case MADV_HUGEPAGE: > +#ifdef CONFIG_S390 > + /* > + * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu > + * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages > + * and expects it must fail on s390. Avoid a possible SIGSEGV > + * until qemu is changed. I prefer: * until kvm/s390 can handle large pages in the host. Otherwise qemu has to be changed again, if we get THP working for kvm. > + */ > + if (mm_has_pgste(vma->vm_mm)) > + return -EINVAL; > +#endif > /* > * Be somewhat over-protective like KSM for now! > */ > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) > return -EINVAL; > + Unrelated white space? > *vm_flags &= ~VM_NOHUGEPAGE; > *vm_flags |= VM_HUGEPAGE; > /* > With the comment and white space change: Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Thanks for the quick patch ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 19:27 ` Christian Borntraeger @ 2014-02-26 19:39 ` Alex Thorlton 2014-02-26 23:24 ` Andrew Morton 0 siblings, 1 reply; 19+ messages in thread From: Alex Thorlton @ 2014-02-26 19:39 UTC (permalink / raw) To: Christian Borntraeger Cc: Oleg Nesterov, Peter Zijlstra, akpm, linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange On Wed, Feb 26, 2014 at 08:27:36PM +0100, Christian Borntraeger wrote: > On 26/02/14 19:06, Oleg Nesterov wrote: > > On 02/26, Alex Thorlton wrote: > >> > >> + * MADV_HUGEPAGE after PRCTL_THP_DISABLE is broken on s390 because > >> + * qemu blindly does madvise(MADV_HUGEPAGE) after s390_enable_sie(). > >> + */ > >> +#ifdef CONFIG_S390 > >> + if (mm_has_pgste(vma->vm_mm)) > >> return -EINVAL; > >> +#endif > > > > The comment is not really right... > > > > And personally I think that > > > > @@ -504,6 +504,9 @@ static int gmap_connect_pgtable(unsigned long address, unsigned long segment, > > if (!pmd_present(*pmd) && > > __pte_alloc(mm, vma, pmd, vmaddr)) > > return -ENOMEM; > > + /* large pmds cannot yet be handled */ > > + if (pmd_large(*pmd)) > > + return -EFAULT; > > > > change still makes sense, so that we can simply revert this s390- > > specific hack in hugepage_madvise(). > > Yes, it still makes sense to cover existing THPs here. Yes, it does. I just snipped the chunk of the original patch that I actually changed in my last e-mail. I didn't intend to actually remove the above portion in the final patch - sorry for the confusion! > > > > I'd suggest the patch below on top of your changes, but I won't argue. I like this approach, and your updated comment as well. This should probably go in as [PATCH 2/4] in my series. Do I need to spin a v5 with this new patch included in the series, or will Andrew pull this in? > > > > It would be nice to also change thp_split_mm() to not not play with > > mm->def_flags, but I am not sure if we can do this. > > > > Oleg. > > --- > > > > Subject: [PATCH] s390: make sure MADV_HUGEPAGE fails after s390_enable_sie() > > > > As Christian pointed out, the recent 'Revert "thp: make MADV_HUGEPAGE > > check for mm->def_flags"' breaks qemu, it does QEMU_MADV_HUGEPAGE for > > all kvm pages but this doesn't work after s390_enable_sie/thp_split_mm. > > > > Reported-by: Christian Borntraeger <borntraeger@de.ibm.com> > > Suggested-by: Christian Borntraeger <borntraeger@de.ibm.com> > > Signed-off-by: Oleg Nesterov <oleg@redhat.com> > > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c > > index a4310a5..0e08d92 100644 > > --- a/mm/huge_memory.c > > +++ b/mm/huge_memory.c > > @@ -1970,11 +1970,22 @@ int hugepage_madvise(struct vm_area_struct *vma, > > { > > switch (advice) { > > case MADV_HUGEPAGE: > > +#ifdef CONFIG_S390 > > + /* > > + * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu > > + * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages > > + * and expects it must fail on s390. Avoid a possible SIGSEGV > > + * until qemu is changed. > > I prefer: > * until kvm/s390 can handle large pages in the host. > > Otherwise qemu has to be changed again, if we get THP working for kvm. > > > + */ > > + if (mm_has_pgste(vma->vm_mm)) > > + return -EINVAL; > > +#endif > > /* > > * Be somewhat over-protective like KSM for now! > > */ > > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) > > return -EINVAL; > > + > > Unrelated white space? > > *vm_flags &= ~VM_NOHUGEPAGE; > > *vm_flags |= VM_HUGEPAGE; > > /* > > > > > > With the comment and white space change: > > Tested-by: Christian Borntraeger <borntraeger@de.ibm.com> > Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> > > Thanks for the quick patch Yes, thank you all for the quick responses, and for looking over these patches! ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 19:39 ` Alex Thorlton @ 2014-02-26 23:24 ` Andrew Morton 2014-02-27 0:01 ` Alex Thorlton 0 siblings, 1 reply; 19+ messages in thread From: Andrew Morton @ 2014-02-26 23:24 UTC (permalink / raw) To: Alex Thorlton Cc: Christian Borntraeger, Oleg Nesterov, Peter Zijlstra, linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange On Wed, 26 Feb 2014 13:39:17 -0600 Alex Thorlton <athorlton@sgi.com> wrote: > > > > > > > I'd suggest the patch below on top of your changes, but I won't argue. > > I like this approach, and your updated comment as well. This should > probably go in as [PATCH 2/4] in my series. Do I need to spin a v5 > with this new patch included in the series, or will Andrew pull this in? Paolo's comments on Oleg's patch remain unaddressed, so please take a look at that and send out the final version? An incremental patch would be preferred, please. I'll later fold that into mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch to avoid any bisection holes. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 23:24 ` Andrew Morton @ 2014-02-27 0:01 ` Alex Thorlton 2014-02-27 17:26 ` Alex Thorlton 0 siblings, 1 reply; 19+ messages in thread From: Alex Thorlton @ 2014-02-27 0:01 UTC (permalink / raw) To: Andrew Morton Cc: Christian Borntraeger, Oleg Nesterov, Peter Zijlstra, linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange On Wed, Feb 26, 2014 at 03:24:34PM -0800, Andrew Morton wrote: > On Wed, 26 Feb 2014 13:39:17 -0600 Alex Thorlton <athorlton@sgi.com> wrote: > > > > > > > > > > > I'd suggest the patch below on top of your changes, but I won't argue. > > > > I like this approach, and your updated comment as well. This should > > probably go in as [PATCH 2/4] in my series. Do I need to spin a v5 > > with this new patch included in the series, or will Andrew pull this in? > > Paolo's comments on Oleg's patch remain unaddressed, so please take a > look at that and send out the final version? Got it. I'll get that to you tonight/tomorrow morning. > An incremental patch would be preferred, please. I'll later fold that > into mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch to > avoid any bisection holes. Understood. I'll keep them separate. Thanks, Andrew! - Alex ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-27 0:01 ` Alex Thorlton @ 2014-02-27 17:26 ` Alex Thorlton 0 siblings, 0 replies; 19+ messages in thread From: Alex Thorlton @ 2014-02-27 17:26 UTC (permalink / raw) To: Andrew Morton Cc: Christian Borntraeger, Oleg Nesterov, Peter Zijlstra, linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange On Wed, Feb 26, 2014 at 06:01:53PM -0600, Alex Thorlton wrote: > On Wed, Feb 26, 2014 at 03:24:34PM -0800, Andrew Morton wrote: > > On Wed, 26 Feb 2014 13:39:17 -0600 Alex Thorlton <athorlton@sgi.com> wrote: > > > > > > > > > > > > > > > I'd suggest the patch below on top of your changes, but I won't argue. > > > > > > I like this approach, and your updated comment as well. This should > > > probably go in as [PATCH 2/4] in my series. Do I need to spin a v5 > > > with this new patch included in the series, or will Andrew pull this in? > > > > Paolo's comments on Oleg's patch remain unaddressed, so please take a > > look at that and send out the final version? > > Got it. I'll get that to you tonight/tomorrow morning. Just kicked out another version of the patch that should cover all comments/suggestions. Let me know if you need anything else from me! - Alex ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 18:06 ` Oleg Nesterov 2014-02-26 19:05 ` Gerald Schaefer 2014-02-26 19:27 ` Christian Borntraeger @ 2014-02-26 20:41 ` Paolo Bonzini 2014-02-27 16:34 ` Oleg Nesterov 2 siblings, 1 reply; 19+ messages in thread From: Paolo Bonzini @ 2014-02-26 20:41 UTC (permalink / raw) To: Oleg Nesterov Cc: Alex Thorlton, Peter Zijlstra, Christian Borntraeger, akpm, linux-kernel, viro, schwidefsky, rientjes, riel, mingo, mgorman, kirill shutemov, heiko carstens, hannes, gerald schaefer, ebiederm, aarcange > +#ifdef CONFIG_S390 > + /* > + * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu > + * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages > + * and expects it must fail on s390. Avoid a possible SIGSEGV > + * until qemu is changed. > + */ > + if (mm_has_pgste(vma->vm_mm)) > + return -EINVAL; > +#endif The comment is not quite true. QEMU doesn't expect that the madvise fails. It simply expects that the madvise doesn't cause SIGSEGVs or later malfunctioning, because (quoting tha man page) madvise "does not influence the semantics of the application". There's nothing to fix in QEMU, in fact I believe this should return 0 rather than -EINVAL. It's perfectly legal for the kernel to ignore an madvise, and this is what should happen here. Paolo ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 20:41 ` Paolo Bonzini @ 2014-02-27 16:34 ` Oleg Nesterov 0 siblings, 0 replies; 19+ messages in thread From: Oleg Nesterov @ 2014-02-27 16:34 UTC (permalink / raw) To: Paolo Bonzini Cc: Alex Thorlton, Peter Zijlstra, Christian Borntraeger, akpm, linux-kernel, viro, schwidefsky, rientjes, riel, mingo, mgorman, kirill shutemov, heiko carstens, hannes, gerald schaefer, ebiederm, aarcange On 02/26, Paolo Bonzini wrote: > > > +#ifdef CONFIG_S390 > > + /* > > + * MADV_HUGEPAGE is broken after s390_enable_sie(), qemu > > + * blindly does madvise(MADV_HUGEPAGE) for for all kvm pages > > + * and expects it must fail on s390. Avoid a possible SIGSEGV > > + * until qemu is changed. > > + */ > > + if (mm_has_pgste(vma->vm_mm)) > > + return -EINVAL; > > +#endif > > The comment is not quite true. QEMU doesn't expect that the madvise fails. Yes, sorry. I didn't mean "it expects -EINVAL". > It simply expects that the madvise doesn't cause SIGSEGVs or later > malfunctioning, because (quoting tha man page) madvise "does not influence > the semantics of the application". Yes, I understand. But currently this means "MADV_HUGEPAGE should not actually work", this is what I tried to say. > There's nothing to fix in QEMU, I was going to argue, but this is probably true too. In short: I agree with any comment ;) Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree 2014-02-26 16:57 ` Peter Zijlstra 2014-02-26 17:22 ` Alex Thorlton @ 2014-02-26 18:08 ` Oleg Nesterov 1 sibling, 0 replies; 19+ messages in thread From: Oleg Nesterov @ 2014-02-26 18:08 UTC (permalink / raw) To: Peter Zijlstra Cc: Christian Borntraeger, akpm, linux-kernel, viro, schwidefsky, rientjes, riel, pbonzini, mingo, mgorman, kirill.shutemov, heiko.carstens, hannes, gerald.schaefer, ebiederm, aarcange, athorlton On 02/26, Peter Zijlstra wrote: > > On Wed, Feb 26, 2014 at 04:31:44PM +0100, Oleg Nesterov wrote: > > /* > > @@ -1977,8 +1975,12 @@ int hugepage_madvise(struct vm_area_stru > > */ > > if (*vm_flags & (VM_HUGEPAGE | VM_NO_THP)) > > return -EINVAL; > > - if (mm->def_flags & VM_NOHUGEPAGE) > > + > > +#ifdef CONFIG_S390 > > Do we want a comment here, explaining why s390 is special again? Yes, yes, sure. Especially because this is (hopefully) the temporary hack. Oleg. ^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2014-02-27 17:26 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2014-02-25 23:53 + mm-revert-thp-make-madv_hugepage-check-for-mm-def_flags.patch added to -mm tree akpm
[not found] ` <530D9F50.1080400@de.ibm.com>
2014-02-26 14:50 ` Oleg Nesterov
2014-02-26 15:06 ` Christian Borntraeger
2014-02-26 15:22 ` Kirill A. Shutemov
2014-02-26 15:31 ` Oleg Nesterov
2014-02-26 16:55 ` Gerald Schaefer
2014-02-26 16:57 ` Peter Zijlstra
2014-02-26 17:22 ` Alex Thorlton
2014-02-26 18:06 ` Oleg Nesterov
2014-02-26 19:05 ` Gerald Schaefer
2014-02-27 16:45 ` Oleg Nesterov
2014-02-26 19:27 ` Christian Borntraeger
2014-02-26 19:39 ` Alex Thorlton
2014-02-26 23:24 ` Andrew Morton
2014-02-27 0:01 ` Alex Thorlton
2014-02-27 17:26 ` Alex Thorlton
2014-02-26 20:41 ` Paolo Bonzini
2014-02-27 16:34 ` Oleg Nesterov
2014-02-26 18:08 ` Oleg Nesterov
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.