LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v3 1/1] KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
From: Sam Bobroff @ 2018-07-25  5:26 UTC (permalink / raw)
  To: Paul Mackerras; +Cc: linuxppc-dev, kvm, kvm-ppc, david, clg
In-Reply-To: <20180723054337.GA29207@fergus>

[-- Attachment #1: Type: text/plain, Size: 5175 bytes --]

On Mon, Jul 23, 2018 at 03:43:37PM +1000, Paul Mackerras wrote:
> On Thu, Jul 19, 2018 at 12:25:10PM +1000, Sam Bobroff wrote:
> > From: Sam Bobroff <sam.bobroff@au1.ibm.com>
> > 
> > It is not currently possible to create the full number of possible
> > VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
> > threads per core than it's core stride (or "VSMT mode"). This is
> > because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
> > even though the VCPU ID is less than KVM_MAX_VCPU_ID.
> > 
> > To address this, "pack" the VCORE ID and XIVE offsets by using
> > knowledge of the way the VCPU IDs will be used when there are less
> > guest threads per core than the core stride. The primary thread of
> > each core will always be used first. Then, if the guest uses more than
> > one thread per core, these secondary threads will sequentially follow
> > the primary in each core.
> > 
> > So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
> > VCPUs are being spaced apart, so at least half of each core is empty
> > and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
> > into the second half of each core (4..7, in an 8-thread core).
> > 
> > Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
> > each core is being left empty, and we can map down into the second and
> > third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
> > 
> > Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
> > threads are being used and 7/8 of the core is empty, allowing use of
> > the 1, 3, 5 and 7 thread slots.
> > 
> > (Strides less than 8 are handled similarly.)
> > 
> > This allows the VCORE ID or offset to be calculated quickly from the
> > VCPU ID or XIVE server numbers, without access to the VCPU structure.
> > 
> > Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
> 
> I have some comments relating to the situation where the stride
> (i.e. kvm->arch.emul_smt_mode) is less than 8; see below.
> 
> [snip]
> > +static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
> > +{
> > +	const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 3, 5, 7};
> 
> This needs to be {0, 4, 2, 6, 1, 5, 3, 7} (with the 3 and 5 swapped
> from what you have) for the case when stride == 4 and block == 3.  In
> that case we need block_offsets[block] to be 3; if it is 5, then we
> will collide with the case where block == 2 for the next virtual core.

Agh! Yes it does.

> > +	int stride = kvm->arch.emul_smt_mode;
> > +	int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
> > +	u32 packed_id;
> > +
> > +	BUG_ON(block >= MAX_SMT_THREADS);
> > +	packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
> > +	BUG_ON(packed_id >= KVM_MAX_VCPUS);
> > +	return packed_id;
> > +}
> > +
> >  #endif /* __ASM_KVM_BOOK3S_H__ */
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index de686b340f4a..363c2fb0d89e 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> > @@ -1816,7 +1816,7 @@ static int threads_per_vcore(struct kvm *kvm)
> >  	return threads_per_subcore;
> >  }
> >  
> > -static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> > +static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
> >  {
> >  	struct kvmppc_vcore *vcore;
> >  
> > @@ -1830,7 +1830,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
> >  	init_swait_queue_head(&vcore->wq);
> >  	vcore->preempt_tb = TB_NIL;
> >  	vcore->lpcr = kvm->arch.lpcr;
> > -	vcore->first_vcpuid = core * kvm->arch.smt_mode;
> > +	vcore->first_vcpuid = id;
> >  	vcore->kvm = kvm;
> >  	INIT_LIST_HEAD(&vcore->preempt_list);
> >  
> > @@ -2048,12 +2048,18 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
> >  	mutex_lock(&kvm->lock);
> >  	vcore = NULL;
> >  	err = -EINVAL;
> > -	core = id / kvm->arch.smt_mode;
> > +	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
> > +		BUG_ON(kvm->arch.smt_mode != 1);
> > +		core = kvmppc_pack_vcpu_id(kvm, id);
> 
> We now have a way for userspace to trigger a BUG_ON, as far as I can
> see.  The only check on id up to this point is that it is less than
> KVM_MAX_VCPU_ID, which means that the BUG_ON(block >= MAX_SMT_THREADS)
> can be triggered, if kvm->arch.emul_smt_mode < MAX_SMT_THREADS, by
> giving an id that is greater than or equal to KVM_MAX_VCPUS *
> kvm->arch.emul_smt+mode.
> 
> > +	} else {
> > +		core = id / kvm->arch.smt_mode;
> > +	}
> >  	if (core < KVM_MAX_VCORES) {
> >  		vcore = kvm->arch.vcores[core];
> > +		BUG_ON(cpu_has_feature(CPU_FTR_ARCH_300) && vcore);
> 
> Doesn't this just mean that userspace has chosen an id big enough to
> cause a collision in the output space of kvmppc_pack_vcpu_id()?  How
> is this not user-triggerable?
> 
> Paul.

Yep, good point. Particularly when dealing with a malicious userspace
that won't follow QEMU's allocation pattern.

I'll re-work it and re-post. I'll discuss the changes in the next
version.

Thanks for the review!

Sam.

[-- Attachment #2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

^ permalink raw reply

* [PATCH v4 1/1] KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
From: Sam Bobroff @ 2018-07-25  6:12 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: kvm, kvm-ppc, paulus, david, clg

From: Sam Bobroff <sam.bobroff@au1.ibm.com>

It is not currently possible to create the full number of possible
VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
threads per core than it's core stride (or "VSMT mode"). This is
because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
even though the VCPU ID is less than KVM_MAX_VCPU_ID.

To address this, "pack" the VCORE ID and XIVE offsets by using
knowledge of the way the VCPU IDs will be used when there are less
guest threads per core than the core stride. The primary thread of
each core will always be used first. Then, if the guest uses more than
one thread per core, these secondary threads will sequentially follow
the primary in each core.

So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
VCPUs are being spaced apart, so at least half of each core is empty
and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
into the second half of each core (4..7, in an 8-thread core).

Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
each core is being left empty, and we can map down into the second and
third quarters of each core (2, 3 and 5, 6 in an 8-thread core).

Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
threads are being used and 7/8 of the core is empty, allowing use of
the 1, 5, 3 and 7 thread slots.

(Strides less than 8 are handled similarly.)

This allows the VCORE ID or offset to be calculated quickly from the
VCPU ID or XIVE server numbers, without access to the VCPU structure.

Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>
---
Hello everyone,

Here's v4, addressing comments from Paul. See below for the changes.

Cheers,
Sam.

Patch set v4:
Patch 1/1: KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
* The block offsets are corrected for the vsmt=4 case (swap 3 and 5).
* The initial checks on the VCPU ID are changed to be against (KVM_MAX_VCPUS *
  emul_smt_mode) (a tigher bound).
* The BUG_ON()s in kvmppc_pack_vcpu_id() are changed to a WARN_ONCE(), and
  they'll return 0 in that case. KVM_MAX_VCPUS is another possible return value
  in this case, but it seemed safer to use 0 as it's not likely to cause a
  memory overrun if it's used. The warning shouldn't ever be triggered when
  VCPU IDs are allocate using the scheme that QEMU uses.
* The vcore creation code is adjusted to handle collisions by returning -EINVAL
  instead of bugging out.

Patch set v3:
Patch 1/1: KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space

Patch set v2:
Patch 1/1: KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
* Corrected places in kvm/book3s_xive.c where IDs weren't packed.
* Because kvmppc_pack_vcpu_id() is only called on P9, there is no need to test "emul_smt_mode > 1", so remove it.
* Re-ordered block_offsets[] to be more ascending.
* Added more detailed description of the packing algorithm.

Patch set v1:
Patch 1/1: KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space

 arch/powerpc/include/asm/kvm_book3s.h | 46 +++++++++++++++++++++++++++++++++++
 arch/powerpc/kvm/book3s_hv.c          | 25 ++++++++++++++-----
 arch/powerpc/kvm/book3s_xive.c        | 19 +++++++++------
 3 files changed, 77 insertions(+), 13 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h b/arch/powerpc/include/asm/kvm_book3s.h
index 1f345a0b6ba2..dc5f35c6a6eb 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -390,4 +390,50 @@ extern int kvmppc_h_logical_ci_store(struct kvm_vcpu *vcpu);
 #define SPLIT_HACK_MASK			0xff000000
 #define SPLIT_HACK_OFFS			0xfb000000
 
+/* Pack a VCPU ID from the [0..KVM_MAX_VCPU_ID) space down to the
+ * [0..KVM_MAX_VCPUS) space, while using knowledge of the guest's core stride
+ * (but not it's actual threading mode, which is not available) to avoid
+ * collisions.
+ *
+ * The implementation leaves VCPU IDs from the range [0..KVM_MAX_VCPUS) (block
+ * 0) unchanged: if the guest is filling each VCORE completely then it will be
+ * using consecutive IDs and it will fill the space without any packing.
+ *
+ * For higher VCPU IDs, the packed ID is based on the VCPU ID modulo
+ * KVM_MAX_VCPUS (effectively masking off the top bits) and then an offset is
+ * added to avoid collisions.
+ *
+ * VCPU IDs in the range [KVM_MAX_VCPUS..(KVM_MAX_VCPUS*2)) (block 1) are only
+ * possible if the guest is leaving at least 1/2 of each VCORE empty, so IDs
+ * can be safely packed into the second half of each VCORE by adding an offset
+ * of (stride / 2).
+ *
+ * Similarly, if VCPU IDs in the range [(KVM_MAX_VCPUS*2)..(KVM_MAX_VCPUS*4))
+ * (blocks 2 and 3) are seen, the guest must be leaving at least 3/4 of each
+ * VCORE empty so packed IDs can be offset by (stride / 4) and (stride * 3 / 4).
+ *
+ * Finally, VCPU IDs from blocks 5..7 will only be seen if the guest is using a
+ * stride of 8 and 1 thread per core so the remaining offsets of 1, 5, 3 and 7
+ * must be free to use.
+ *
+ * (The offsets for each block are stored in block_offsets[], indexed by the
+ * block number if the stride is 8. For cases where the guest's stride is less
+ * than 8, we can re-use the block_offsets array by multiplying the block
+ * number by (MAX_SMT_THREADS / stride) to reach the correct entry.)
+ */
+static inline u32 kvmppc_pack_vcpu_id(struct kvm *kvm, u32 id)
+{
+	const int block_offsets[MAX_SMT_THREADS] = {0, 4, 2, 6, 1, 5, 3, 7};
+	int stride = kvm->arch.emul_smt_mode;
+	int block = (id / KVM_MAX_VCPUS) * (MAX_SMT_THREADS / stride);
+	u32 packed_id;
+
+	if (WARN_ONCE(block >= MAX_SMT_THREADS, "VCPU ID too large to pack"))
+		return 0;
+	packed_id = (id % KVM_MAX_VCPUS) + block_offsets[block];
+	if (WARN_ONCE(packed_id >= KVM_MAX_VCPUS, "VCPU ID packing failed"))
+		return 0;
+	return packed_id;
+}
+
 #endif /* __ASM_KVM_BOOK3S_H__ */
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index de686b340f4a..4cfd3a137dfe 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -1816,7 +1816,7 @@ static int threads_per_vcore(struct kvm *kvm)
 	return threads_per_subcore;
 }
 
-static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
+static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int id)
 {
 	struct kvmppc_vcore *vcore;
 
@@ -1830,7 +1830,7 @@ static struct kvmppc_vcore *kvmppc_vcore_create(struct kvm *kvm, int core)
 	init_swait_queue_head(&vcore->wq);
 	vcore->preempt_tb = TB_NIL;
 	vcore->lpcr = kvm->arch.lpcr;
-	vcore->first_vcpuid = core * kvm->arch.smt_mode;
+	vcore->first_vcpuid = id;
 	vcore->kvm = kvm;
 	INIT_LIST_HEAD(&vcore->preempt_list);
 
@@ -1989,10 +1989,15 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
 						   unsigned int id)
 {
 	struct kvm_vcpu *vcpu;
-	int err;
+	int err = -EINVAL;
 	int core;
 	struct kvmppc_vcore *vcore;
 
+	if (id >= (KVM_MAX_VCPUS * kvm->arch.emul_smt_mode)) {
+		WARN_ONCE(true, "DNCI: VCPU ID too high\n");
+		goto out;
+	}
+
 	err = -ENOMEM;
 	vcpu = kmem_cache_zalloc(kvm_vcpu_cache, GFP_KERNEL);
 	if (!vcpu)
@@ -2048,12 +2053,20 @@ static struct kvm_vcpu *kvmppc_core_vcpu_create_hv(struct kvm *kvm,
 	mutex_lock(&kvm->lock);
 	vcore = NULL;
 	err = -EINVAL;
-	core = id / kvm->arch.smt_mode;
+	if (cpu_has_feature(CPU_FTR_ARCH_300)) {
+		BUG_ON(kvm->arch.smt_mode != 1);
+		core = kvmppc_pack_vcpu_id(kvm, id);
+	} else {
+		core = id / kvm->arch.smt_mode;
+	}
 	if (core < KVM_MAX_VCORES) {
 		vcore = kvm->arch.vcores[core];
-		if (!vcore) {
+		if (WARN_ONCE(vcore && cpu_has_feature(CPU_FTR_ARCH_300),
+			      "collision on id %u", core)) {
+			vcore = NULL;
+		} else if (!vcore) {
 			err = -ENOMEM;
-			vcore = kvmppc_vcore_create(kvm, core);
+			vcore = kvmppc_vcore_create(kvm, id & ~(kvm->arch.smt_mode - 1));
 			kvm->arch.vcores[core] = vcore;
 			kvm->arch.online_vcores++;
 		}
diff --git a/arch/powerpc/kvm/book3s_xive.c b/arch/powerpc/kvm/book3s_xive.c
index f9818d7d3381..126f02b3ffb8 100644
--- a/arch/powerpc/kvm/book3s_xive.c
+++ b/arch/powerpc/kvm/book3s_xive.c
@@ -317,6 +317,11 @@ static int xive_select_target(struct kvm *kvm, u32 *server, u8 prio)
 	return -EBUSY;
 }
 
+static u32 xive_vp(struct kvmppc_xive *xive, u32 server)
+{
+	return xive->vp_base + kvmppc_pack_vcpu_id(xive->kvm, server);
+}
+
 static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
 			     struct kvmppc_xive_src_block *sb,
 			     struct kvmppc_xive_irq_state *state)
@@ -362,7 +367,7 @@ static u8 xive_lock_and_mask(struct kvmppc_xive *xive,
 	 */
 	if (xd->flags & OPAL_XIVE_IRQ_MASK_VIA_FW) {
 		xive_native_configure_irq(hw_num,
-					  xive->vp_base + state->act_server,
+					  xive_vp(xive, state->act_server),
 					  MASKED, state->number);
 		/* set old_p so we can track if an H_EOI was done */
 		state->old_p = true;
@@ -418,7 +423,7 @@ static void xive_finish_unmask(struct kvmppc_xive *xive,
 	 */
 	if (xd->flags & OPAL_XIVE_IRQ_MASK_VIA_FW) {
 		xive_native_configure_irq(hw_num,
-					  xive->vp_base + state->act_server,
+					  xive_vp(xive, state->act_server),
 					  state->act_priority, state->number);
 		/* If an EOI is needed, do it here */
 		if (!state->old_p)
@@ -495,7 +500,7 @@ static int xive_target_interrupt(struct kvm *kvm,
 	kvmppc_xive_select_irq(state, &hw_num, NULL);
 
 	return xive_native_configure_irq(hw_num,
-					 xive->vp_base + server,
+					 xive_vp(xive, server),
 					 prio, state->number);
 }
 
@@ -883,7 +888,7 @@ int kvmppc_xive_set_mapped(struct kvm *kvm, unsigned long guest_irq,
 	 * which is fine for a never started interrupt.
 	 */
 	xive_native_configure_irq(hw_irq,
-				  xive->vp_base + state->act_server,
+				  xive_vp(xive, state->act_server),
 				  state->act_priority, state->number);
 
 	/*
@@ -959,7 +964,7 @@ int kvmppc_xive_clr_mapped(struct kvm *kvm, unsigned long guest_irq,
 
 	/* Reconfigure the IPI */
 	xive_native_configure_irq(state->ipi_number,
-				  xive->vp_base + state->act_server,
+				  xive_vp(xive, state->act_server),
 				  state->act_priority, state->number);
 
 	/*
@@ -1084,7 +1089,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 		pr_devel("Duplicate !\n");
 		return -EEXIST;
 	}
-	if (cpu >= KVM_MAX_VCPUS) {
+	if (cpu >= (KVM_MAX_VCPUS * vcpu->kvm->arch.emul_smt_mode)) {
 		pr_devel("Out of bounds !\n");
 		return -EINVAL;
 	}
@@ -1098,7 +1103,7 @@ int kvmppc_xive_connect_vcpu(struct kvm_device *dev,
 	xc->xive = xive;
 	xc->vcpu = vcpu;
 	xc->server_num = cpu;
-	xc->vp_id = xive->vp_base + cpu;
+	xc->vp_id = xive_vp(xive, cpu);
 	xc->mfrr = 0xff;
 	xc->valid = true;
 
-- 
2.16.1.74.g9b0b1f47b

^ permalink raw reply related

* Re: [PATCH v7 4/4] kexec_file: Load kernel at top of system RAM if required
From: Baoquan He @ 2018-07-25  6:48 UTC (permalink / raw)
  To: Michal Hocko
  Cc: Andrew Morton, linux-kernel, robh+dt, dan.j.williams,
	nicolas.pitre, josh, fengguang.wu, bp, andy.shevchenko,
	patrik.r.jakobsson, airlied, kys, haiyangz, sthemmin,
	dmitry.torokhov, frowand.list, keith.busch, jonathan.derrick,
	lorenzo.pieralisi, bhelgaas, tglx, brijesh.singh, jglisse,
	thomas.lendacky, gregkh, baiyaowei, richard.weiyang, devel,
	linux-input, linux-nvdimm, devicetree, linux-pci, ebiederm,
	vgoyal, dyoung, yinghai, monstr, davem, chris, jcmvbkbc, gustavo,
	maarten.lankhorst, seanpaul, linux-parisc, linuxppc-dev, kexec
In-Reply-To: <20180723143443.GD18181@dhcp22.suse.cz>

On 07/23/18 at 04:34pm, Michal Hocko wrote:
> On Thu 19-07-18 23:17:53, Baoquan He wrote:
> > Kexec has been a formal feature in our distro, and customers owning
> > those kind of very large machine can make use of this feature to speed
> > up the reboot process. On uefi machine, the kexec_file loading will
> > search place to put kernel under 4G from top to down. As we know, the
> > 1st 4G space is DMA32 ZONE, dma, pci mmcfg, bios etc all try to consume
> > it. It may have possibility to not be able to find a usable space for
> > kernel/initrd. From the top down of the whole memory space, we don't
> > have this worry. 
> 
> I do not have the full context here but let me note that you should be
> careful when doing top-down reservation because you can easily get into
> hotplugable memory and break the hotremove usecase. We even warn when
> this is done. See memblock_find_in_range_node

Kexec read kernel/initrd file into buffer, just search usable positions
for them to do the later copying. You can see below struct kexec_segment, 
for the old kexec_load, kernel/initrd are read into user space buffer,
the @buf stores the user space buffer address, @mem stores the position
where kernel/initrd will be put. In kernel, it calls
kimage_load_normal_segment() to copy user space buffer to intermediate
pages which are allocated with flag GFP_KERNEL. These intermediate pages
are recorded as entries, later when user execute "kexec -e" to trigger
kexec jumping, it will do the final copying from the intermediate pages
to the real destination pages which @mem pointed. Because we can't touch
the existed data in 1st kernel when do kexec kernel loading. With my
understanding, GFP_KERNEL will make those intermediate pages be
allocated inside immovable area, it won't impact hotplugging. But the
@mem we searched in the whole system RAM might be lost along with
hotplug. Hence we need do kexec kernel again when hotplug event is
detected.

#define KEXEC_CONTROL_MEMORY_GFP (GFP_KERNEL | __GFP_NORETRY)

struct kexec_segment {
        /*
         * This pointer can point to user memory if kexec_load() system
         * call is used or will point to kernel memory if
         * kexec_file_load() system call is used.
         *
         * Use ->buf when expecting to deal with user memory and use ->kbuf
         * when expecting to deal with kernel memory.
         */
        union {
                void __user *buf;
                void *kbuf;
        };
        size_t bufsz;                                                                                                                             
        unsigned long mem;
        size_t memsz;
};

Thanks
Baoquan

^ permalink raw reply

* Re: [PATCH 0/7] powerpc: Modernize unhandled signals message
From: Michael Neuling @ 2018-07-25  7:00 UTC (permalink / raw)
  To: Murilo Opsfelder Araujo, linux-kernel
  Cc: Alastair D'Silva, Andrew Donnellan, Balbir Singh,
	Benjamin Herrenschmidt, Christophe Leroy, Cyril Bur,
	Eric W . Biederman, Michael Ellerman, Nicholas Piggin,
	Paul Mackerras, Simon Guo, Sukadev Bhattiprolu, Tobin C . Harding,
	linuxppc-dev
In-Reply-To: <20180724192720.32417-1-muriloo@linux.ibm.com>

 On Tue, 2018-07-24 at 16:27 -0300, Murilo Opsfelder Araujo wrote:
> Hi, everyone.
>=20
> This series was inspired by the need to modernize and display more
> informative messages about unhandled signals.
>=20
> The "unhandled signal NN" is not very informative.  We thought it would
> be helpful adding a human-readable message describing what the signal
> number means, printing the VMA address, and dumping the instructions.
>=20
> We can add more informative messages, like informing what each code of a
> SIGSEGV signal means.  We are open to suggestions.
>=20
> I have collected some early feedback from Michael Ellerman about this
> series and would love to hear more feedback from you all.

Nice.. the instruction dump would have been very handy when debugging the P=
CR
init issue I had a month or so back.

> Before this series:
>=20
>     Jul 24 13:01:07 localhost kernel: pandafault[5989]: unhandled signal =
11 at 00000000100007d0 nip 000000001000061c lr 00003fff85a75100 code 2
>=20
> After this series:
>=20
>     Jul 24 13:08:01 localhost kernel: pandafault[10758]: segfault (11) at=
 00000000100007d0 nip 000000001000061c lr 00007fffabc85100 code 2 in pandaf=
ault[10000000+10000]
>     Jul 24 13:08:01 localhost kernel: Instruction dump:
>     Jul 24 13:08:01 localhost kernel: 4bfffeec 4bfffee8 3c401002 38427f00=
 fbe1fff8 f821ffc1 7c3f0b78 3d22fffe
>     Jul 24 13:08:01 localhost kernel: 392988d0 f93f0020 e93f0020 39400048=
 <99490000> 39200000 7d234b78 383f0040

What happens if we get a sudden flood of these from different processes tha=
t
overlap their output? Are we going to be able to match up the process with
instruction dump?

Should we prefix every line with the PID to avoid this?

Mikey

^ permalink raw reply

* Re: [PATCH v4 00/11] hugetlb: Factorize hugetlb architecture primitives
From: Alexandre Ghiti @ 2018-07-25  7:38 UTC (permalink / raw)
  To: Paul Burton
  Cc: linux, catalin.marinas, will.deacon, tony.luck, fenghua.yu, ralf,
	jhogan, jejb, deller, benh, paulus, mpe, ysato, dalias, davem,
	tglx, mingo, hpa, x86, arnd, linux-arm-kernel, linux-kernel,
	linux-ia64, linux-mips, linux-parisc, linuxppc-dev, linux-sh,
	sparclinux, linux-arch, Naoya Horiguchi, Mike Kravetz
In-Reply-To: <20180725003428.jsklz7pj4m7lj3m4@pburton-laptop>

Hi Paul,

Thanks for having tested it, I remove mips from my list.

Thanks again,

Alex


On 07/25/2018 02:34 AM, Paul Burton wrote:
> Hi Alexandre,
>
> On Thu, Jul 05, 2018 at 11:07:05AM +0000, Alexandre Ghiti wrote:
>> In order to reduce copy/paste of functions across architectures and then
>> make riscv hugetlb port (and future ports) simpler and smaller, this
>> patchset intends to factorize the numerous hugetlb primitives that are
>> defined across all the architectures.
>>
>> Except for prepare_hugepage_range, this patchset moves the versions that
>> are just pass-through to standard pte primitives into
>> asm-generic/hugetlb.h by using the same #ifdef semantic that can be
>> found in asm-generic/pgtable.h, i.e. __HAVE_ARCH_***.
>>
>> s390 architecture has not been tackled in this serie since it does not
>> use asm-generic/hugetlb.h at all.
>> powerpc could be factorized a bit more (cf huge_ptep_set_wrprotect).
>>
>> This patchset has been compiled on x86 only.
> For MIPS these look good - I don't see any issues & they pass a build
> test (using cavium_octeon_defconfig which enables huge pages), so:
>
>      Acked-by: Paul Burton <paul.burton@mips.com> # MIPS parts
>
> Thanks,
>      Paul

^ permalink raw reply

* Re: [PATCH v11 19/26] mm: provide speculative fault infrastructure
From: zhong jiang @ 2018-07-25  9:04 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
	khandual, aneesh.kumar, benh, mpe, paulus, Thomas Gleixner,
	Ingo Molnar, hpa, Will Deacon, Sergey Senozhatsky,
	sergey.senozhatsky.work, Andrea Arcangeli, Alexei Starovoitov,
	kemi.wang, Daniel Jordan, David Rientjes, Jerome Glisse,
	Ganesh Mahendran, Minchan Kim, Punit Agrawal, vinayak menon,
	Yang Shi, linux-kernel, linux-mm, haren, npiggin, bsingharora,
	paulmck, Tim Chen, linuxppc-dev, x86
In-Reply-To: <3ec16be6-ce2a-99af-4805-e00a59bad677@linux.vnet.ibm.com>

On 2018/7/25 0:10, Laurent Dufour wrote:
>
> On 24/07/2018 16:26, zhong jiang wrote:
>> On 2018/5/17 19:06, Laurent Dufour wrote:
>>> From: Peter Zijlstra <peterz@infradead.org>
>>>
>>> Provide infrastructure to do a speculative fault (not holding
>>> mmap_sem).
>>>
>>> The not holding of mmap_sem means we can race against VMA
>>> change/removal and page-table destruction. We use the SRCU VMA freeing
>>> to keep the VMA around. We use the VMA seqcount to detect change
>>> (including umapping / page-table deletion) and we use gup_fast() style
>>> page-table walking to deal with page-table races.
>>>
>>> Once we've obtained the page and are ready to update the PTE, we
>>> validate if the state we started the fault with is still valid, if
>>> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
>>> PTE and we're done.
>>>
>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>
>>> [Manage the newly introduced pte_spinlock() for speculative page
>>>  fault to fail if the VMA is touched in our back]
>>> [Rename vma_is_dead() to vma_has_changed() and declare it here]
>>> [Fetch p4d and pud]
>>> [Set vmd.sequence in __handle_mm_fault()]
>>> [Abort speculative path when handle_userfault() has to be called]
>>> [Add additional VMA's flags checks in handle_speculative_fault()]
>>> [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
>>> [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
>>> [Remove warning comment about waiting for !seq&1 since we don't want
>>>  to wait]
>>> [Remove warning about no huge page support, mention it explictly]
>>> [Don't call do_fault() in the speculative path as __do_fault() calls
>>>  vma->vm_ops->fault() which may want to release mmap_sem]
>>> [Only vm_fault pointer argument for vma_has_changed()]
>>> [Fix check against huge page, calling pmd_trans_huge()]
>>> [Use READ_ONCE() when reading VMA's fields in the speculative path]
>>> [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
>>>  processing done in vm_normal_page()]
>>> [Check that vma->anon_vma is already set when starting the speculative
>>>  path]
>>> [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
>>>  the processing done in mpol_misplaced()]
>>> [Don't support VMA growing up or down]
>>> [Move check on vm_sequence just before calling handle_pte_fault()]
>>> [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
>>> [Add mem cgroup oom check]
>>> [Use READ_ONCE to access p*d entries]
>>> [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
>>> [Don't fetch pte again in handle_pte_fault() when running the speculative
>>>  path]
>>> [Check PMD against concurrent collapsing operation]
>>> [Try spin lock the pte during the speculative path to avoid deadlock with
>>>  other CPU's invalidating the TLB and requiring this CPU to catch the
>>>  inter processor's interrupt]
>>> [Move define of FAULT_FLAG_SPECULATIVE here]
>>> [Introduce __handle_speculative_fault() and add a check against
>>>  mm->mm_users in handle_speculative_fault() defined in mm.h]
>>> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
>>> ---
>>>  include/linux/hugetlb_inline.h |   2 +-
>>>  include/linux/mm.h             |  30 ++++
>>>  include/linux/pagemap.h        |   4 +-
>>>  mm/internal.h                  |  16 +-
>>>  mm/memory.c                    | 340 ++++++++++++++++++++++++++++++++++++++++-
>>>  5 files changed, 385 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
>>> index 0660a03d37d9..9e25283d6fc9 100644
>>> --- a/include/linux/hugetlb_inline.h
>>> +++ b/include/linux/hugetlb_inline.h
>>> @@ -8,7 +8,7 @@
>>>  
>>>  static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
>>>  {
>>> -	return !!(vma->vm_flags & VM_HUGETLB);
>>> +	return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
>>>  }
>>>  
>>>  #else
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 05cbba70104b..31acf98a7d92 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -315,6 +315,7 @@ extern pgprot_t protection_map[16];
>>>  #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
>>>  #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
>>>  #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
>>> +#define FAULT_FLAG_SPECULATIVE	0x200	/* Speculative fault, not holding mmap_sem */
>>>  
>>>  #define FAULT_FLAG_TRACE \
>>>  	{ FAULT_FLAG_WRITE,		"WRITE" }, \
>>> @@ -343,6 +344,10 @@ struct vm_fault {
>>>  	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
>>>  	pgoff_t pgoff;			/* Logical page offset based on vma */
>>>  	unsigned long address;		/* Faulting virtual address */
>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>> +	unsigned int sequence;
>>> +	pmd_t orig_pmd;			/* value of PMD at the time of fault */
>>> +#endif
>>>  	pmd_t *pmd;			/* Pointer to pmd entry matching
>>>  					 * the 'address' */
>>>  	pud_t *pud;			/* Pointer to pud entry matching
>>> @@ -1415,6 +1420,31 @@ int invalidate_inode_page(struct page *page);
>>>  #ifdef CONFIG_MMU
>>>  extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>>  		unsigned int flags);
>>> +
>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>> +extern int __handle_speculative_fault(struct mm_struct *mm,
>>> +				      unsigned long address,
>>> +				      unsigned int flags);
>>> +static inline int handle_speculative_fault(struct mm_struct *mm,
>>> +					   unsigned long address,
>>> +					   unsigned int flags)
>>> +{
>>> +	/*
>>> +	 * Try speculative page fault for multithreaded user space task only.
>>> +	 */
>>> +	if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
>>> +		return VM_FAULT_RETRY;
>>> +	return __handle_speculative_fault(mm, address, flags);
>>> +}
>>> +#else
>>> +static inline int handle_speculative_fault(struct mm_struct *mm,
>>> +					   unsigned long address,
>>> +					   unsigned int flags)
>>> +{
>>> +	return VM_FAULT_RETRY;
>>> +}
>>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>> +
>>>  extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>>>  			    unsigned long address, unsigned int fault_flags,
>>>  			    bool *unlocked);
>>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>>> index b1bd2186e6d2..6e2aa4e79af7 100644
>>> --- a/include/linux/pagemap.h
>>> +++ b/include/linux/pagemap.h
>>> @@ -456,8 +456,8 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
>>>  	pgoff_t pgoff;
>>>  	if (unlikely(is_vm_hugetlb_page(vma)))
>>>  		return linear_hugepage_index(vma, address);
>>> -	pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
>>> -	pgoff += vma->vm_pgoff;
>>> +	pgoff = (address - READ_ONCE(vma->vm_start)) >> PAGE_SHIFT;
>>> +	pgoff += READ_ONCE(vma->vm_pgoff);
>>>  	return pgoff;
>>>  }
>>>  
>>> diff --git a/mm/internal.h b/mm/internal.h
>>> index fb2667b20f0a..10b188c87fa4 100644
>>> --- a/mm/internal.h
>>> +++ b/mm/internal.h
>>> @@ -44,7 +44,21 @@ int do_swap_page(struct vm_fault *vmf);
>>>  extern struct vm_area_struct *get_vma(struct mm_struct *mm,
>>>  				      unsigned long addr);
>>>  extern void put_vma(struct vm_area_struct *vma);
>>> -#endif
>>> +
>>> +static inline bool vma_has_changed(struct vm_fault *vmf)
>>> +{
>>> +	int ret = RB_EMPTY_NODE(&vmf->vma->vm_rb);
>>> +	unsigned int seq = READ_ONCE(vmf->vma->vm_sequence.sequence);
>>> +
>>> +	/*
>>> +	 * Matches both the wmb in write_seqlock_{begin,end}() and
>>> +	 * the wmb in vma_rb_erase().
>>> +	 */
>>> +	smp_rmb();
>>> +
>>> +	return ret || seq != vmf->sequence;
>>> +}
>>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>>  
>>>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>>>  		unsigned long floor, unsigned long ceiling);
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index ab32b0b4bd69..7bbbb8c7b9cd 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -769,7 +769,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>>>  	if (page)
>>>  		dump_page(page, "bad pte");
>>>  	pr_alert("addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
>>> -		 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
>>> +		 (void *)addr, READ_ONCE(vma->vm_flags), vma->anon_vma,
>>> +		 mapping, index);
>>>  	pr_alert("file:%pD fault:%pf mmap:%pf readpage:%pf\n",
>>>  		 vma->vm_file,
>>>  		 vma->vm_ops ? vma->vm_ops->fault : NULL,
>>> @@ -2306,6 +2307,118 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>>>  }
>>>  EXPORT_SYMBOL_GPL(apply_to_page_range);
>>>  
>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>> +static bool pte_spinlock(struct vm_fault *vmf)
>>> +{
>>> +	bool ret = false;
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +	pmd_t pmdval;
>>> +#endif
>>> +
>>> +	/* Check if vma is still valid */
>>> +	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>>> +		vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>>> +		spin_lock(vmf->ptl);
>>> +		return true;
>>> +	}
>>> +
>>> +again:
>>> +	local_irq_disable();
>>> +	if (vma_has_changed(vmf))
>>> +		goto out;
>>> +
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +	/*
>>> +	 * We check if the pmd value is still the same to ensure that there
>>> +	 * is not a huge collapse operation in progress in our back.
>>> +	 */
>>> +	pmdval = READ_ONCE(*vmf->pmd);
>>> +	if (!pmd_same(pmdval, vmf->orig_pmd))
>>> +		goto out;
>>> +#endif
>>> +
>>> +	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>>> +	if (unlikely(!spin_trylock(vmf->ptl))) {
>>> +		local_irq_enable();
>>> +		goto again;
>>> +	}
>>> +
>>> +	if (vma_has_changed(vmf)) {
>>> +		spin_unlock(vmf->ptl);
>>> +		goto out;
>>> +	}
>>> +
>>> +	ret = true;
>>> +out:
>>> +	local_irq_enable();
>>> +	return ret;
>>> +}
>>> +
>>> +static bool pte_map_lock(struct vm_fault *vmf)
>>> +{
>>> +	bool ret = false;
>>> +	pte_t *pte;
>>> +	spinlock_t *ptl;
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +	pmd_t pmdval;
>>> +#endif
>>> +
>>> +	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>>> +		vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
>>> +					       vmf->address, &vmf->ptl);
>>> +		return true;
>>> +	}
>>> +
>>> +	/*
>>> +	 * The first vma_has_changed() guarantees the page-tables are still
>>> +	 * valid, having IRQs disabled ensures they stay around, hence the
>>> +	 * second vma_has_changed() to make sure they are still valid once
>>> +	 * we've got the lock. After that a concurrent zap_pte_range() will
>>> +	 * block on the PTL and thus we're safe.
>>> +	 */
>>> +again:
>>> +	local_irq_disable();
>>> +	if (vma_has_changed(vmf))
>>> +		goto out;
>>> +
>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>> +	/*
>>> +	 * We check if the pmd value is still the same to ensure that there
>>> +	 * is not a huge collapse operation in progress in our back.
>>> +	 */
>>> +	pmdval = READ_ONCE(*vmf->pmd);
>>> +	if (!pmd_same(pmdval, vmf->orig_pmd))
>>> +		goto out;
>>> +#endif
>>> +
>>> +	/*
>>> +	 * Same as pte_offset_map_lock() except that we call
>>> +	 * spin_trylock() in place of spin_lock() to avoid race with
>>> +	 * unmap path which may have the lock and wait for this CPU
>>> +	 * to invalidate TLB but this CPU has irq disabled.
>>> +	 * Since we are in a speculative patch, accept it could fail
>>> +	 */
>>> +	ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>>> +	pte = pte_offset_map(vmf->pmd, vmf->address);
>>> +	if (unlikely(!spin_trylock(ptl))) {
>>> +		pte_unmap(pte);
>>> +		local_irq_enable();
>>> +		goto again;
>>> +	}
>>> +
>>> +	if (vma_has_changed(vmf)) {
>>> +		pte_unmap_unlock(pte, ptl);
>>> +		goto out;
>>> +	}
>>> +
>>> +	vmf->pte = pte;
>>> +	vmf->ptl = ptl;
>>> +	ret = true;
>>> +out:
>>> +	local_irq_enable();
>>> +	return ret;
>>> +}
>>> +#else
>>>  static inline bool pte_spinlock(struct vm_fault *vmf)
>>>  {
>>>  	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>>> @@ -2319,6 +2432,7 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
>>>  				       vmf->address, &vmf->ptl);
>>>  	return true;
>>>  }
>>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>>  
>>>  /*
>>>   * handle_pte_fault chooses page fault handler according to an entry which was
>>> @@ -3208,6 +3322,14 @@ static int do_anonymous_page(struct vm_fault *vmf)
>>>  		ret = check_stable_address_space(vma->vm_mm);
>>>  		if (ret)
>>>  			goto unlock;
>>> +		/*
>>> +		 * Don't call the userfaultfd during the speculative path.
>>> +		 * We already checked for the VMA to not be managed through
>>> +		 * userfaultfd, but it may be set in our back once we have lock
>>> +		 * the pte. In such a case we can ignore it this time.
>>> +		 */
>>> +		if (vmf->flags & FAULT_FLAG_SPECULATIVE)
>>> +			goto setpte;
>>>  		/* Deliver the page fault to userland, check inside PT lock */
>>>  		if (userfaultfd_missing(vma)) {
>>>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>> @@ -3249,7 +3371,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
>>>  		goto unlock_and_release;
>>>  
>>>  	/* Deliver the page fault to userland, check inside PT lock */
>>> -	if (userfaultfd_missing(vma)) {
>>> +	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) && userfaultfd_missing(vma)) {
>>>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>  		mem_cgroup_cancel_charge(page, memcg, false);
>>>  		put_page(page);
>>> @@ -3994,13 +4116,22 @@ static int handle_pte_fault(struct vm_fault *vmf)
>>>  
>>>  	if (unlikely(pmd_none(*vmf->pmd))) {
>>>  		/*
>>> +		 * In the case of the speculative page fault handler we abort
>>> +		 * the speculative path immediately as the pmd is probably
>>> +		 * in the way to be converted in a huge one. We will try
>>> +		 * again holding the mmap_sem (which implies that the collapse
>>> +		 * operation is done).
>>> +		 */
>>> +		if (vmf->flags & FAULT_FLAG_SPECULATIVE)
>>> +			return VM_FAULT_RETRY;
>>> +		/*
>>>  		 * Leave __pte_alloc() until later: because vm_ops->fault may
>>>  		 * want to allocate huge page, and if we expose page table
>>>  		 * for an instant, it will be difficult to retract from
>>>  		 * concurrent faults and from rmap lookups.
>>>  		 */
>>>  		vmf->pte = NULL;
>>> -	} else {
>>> +	} else if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>>>  		/* See comment in pte_alloc_one_map() */
>>>  		if (pmd_devmap_trans_unstable(vmf->pmd))
>>>  			return 0;
>>> @@ -4009,6 +4140,9 @@ static int handle_pte_fault(struct vm_fault *vmf)
>>>  		 * pmd from under us anymore at this point because we hold the
>>>  		 * mmap_sem read mode and khugepaged takes it in write mode.
>>>  		 * So now it's safe to run pte_offset_map().
>>> +		 * This is not applicable to the speculative page fault handler
>>> +		 * but in that case, the pte is fetched earlier in
>>> +		 * handle_speculative_fault().
>>>  		 */
>>>  		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>>>  		vmf->orig_pte = *vmf->pte;
>>> @@ -4031,6 +4165,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
>>>  	if (!vmf->pte) {
>>>  		if (vma_is_anonymous(vmf->vma))
>>>  			return do_anonymous_page(vmf);
>>> +		else if (vmf->flags & FAULT_FLAG_SPECULATIVE)
>>> +			return VM_FAULT_RETRY;
>>>  		else
>>>  			return do_fault(vmf);
>>>  	}
>>> @@ -4128,6 +4264,9 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>>  	vmf.pmd = pmd_alloc(mm, vmf.pud, address);
>>>  	if (!vmf.pmd)
>>>  		return VM_FAULT_OOM;
>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>> +	vmf.sequence = raw_read_seqcount(&vma->vm_sequence);
>>> +#endif
>>>  	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
>>>  		ret = create_huge_pmd(&vmf);
>>>  		if (!(ret & VM_FAULT_FALLBACK))
>>> @@ -4161,6 +4300,201 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>>  	return handle_pte_fault(&vmf);
>>>  }
>>>  
>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>> +/*
>>> + * Tries to handle the page fault in a speculative way, without grabbing the
>>> + * mmap_sem.
>>> + */
>>> +int __handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>>> +			       unsigned int flags)
>>> +{
>>> +	struct vm_fault vmf = {
>>> +		.address = address,
>>> +	};
>>> +	pgd_t *pgd, pgdval;
>>> +	p4d_t *p4d, p4dval;
>>> +	pud_t pudval;
>>> +	int seq, ret = VM_FAULT_RETRY;
>>> +	struct vm_area_struct *vma;
>>> +#ifdef CONFIG_NUMA
>>> +	struct mempolicy *pol;
>>> +#endif
>>> +
>>> +	/* Clear flags that may lead to release the mmap_sem to retry */
>>> +	flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
>>> +	flags |= FAULT_FLAG_SPECULATIVE;
>>> +
>>> +	vma = get_vma(mm, address);
>>> +	if (!vma)
>>> +		return ret;
>>> +
>>> +	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
>>> +	if (seq & 1)
>>> +		goto out_put;
>>> +
>>> +	/*
>>> +	 * Can't call vm_ops service has we don't know what they would do
>>> +	 * with the VMA.
>>> +	 * This include huge page from hugetlbfs.
>>> +	 */
>>> +	if (vma->vm_ops)
>>> +		goto out_put;
>>> +
>>   Hi   Laurent
>>    
>>    I think that most of pagefault will leave here.   Is there any case  need to skip ?
>>   I have tested the following  patch, it work well.
> Hi Zhong,
>
> Well this will allow file mapping to be handle in a speculative way, but that's
> a bit dangerous today as there is no guaranty that the vm_ops.vm_fault()
> operation will be fair.
>
> In the case of the anonymous file mapping that's often not a problem, depending
> on the underlying file system, but there are so many cases to check and this is
> hard to say this can be done in a speculative way as is.
 This patch say that spf just handle anonyous page. but I find that do_swap_page
 also maybe release the mmap_sem without FAULT_FLAG_RETRY_NOWAIT. why is it safe
 to handle the case. I think that the case is similar to file page. Maybe I miss
 something else.

 I test the patches and find just only 18% of the pagefault will enter into the
 speculative page fault during a process startup. As I had said. most of pagefault
 will be handled by ops->fault. I do not know the data you had posted is how to get.
 

 Thanks
 zhong jiang
> The huge work to do is to double check that all the code called by
> vm_ops.fault() is not dealing with the mmap_sem, which could be handled using
> FAULT_FLAG_RETRY_NOWAIT, and care is also needed about the resources that code
> is managing as it may assume that it is under the protection of the mmap_sem in
> read mode, and that can be done implicitly.
>
> Cheers,
> Laurent.
>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index 936128b..9bc1545 100644
>>  @@ -3893,8 +3898,6 @@ static int handle_pte_fault(struct fault_env *fe)
>>         if (!fe->pte) {
>>                 if (vma_is_anonymous(fe->vma))
>>                         return do_anonymous_page(fe);
>> -               else if (fe->flags & FAULT_FLAG_SPECULATIVE)
>> -                       return VM_FAULT_RETRY;
>>                 else
>>                         return do_fault(fe);
>>         }
>> @@ -4026,20 +4029,11 @@ int __handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>>                 goto out_put;
>>         }
>>         /*
>> -        * Can't call vm_ops service has we don't know what they would do
>> -        * with the VMA.
>> -        * This include huge page from hugetlbfs.
>> -        */
>> -       if (vma->vm_ops) {
>> -               trace_spf_vma_notsup(_RET_IP_, vma, address);
>> -               goto out_put;
>> -       }
>>
>>
>> Thanks
>> zhong jiang
>>> +	/*
>>> +	 * __anon_vma_prepare() requires the mmap_sem to be held
>>> +	 * because vm_next and vm_prev must be safe. This can't be guaranteed
>>> +	 * in the speculative path.
>>> +	 */
>>> +	if (unlikely(!vma->anon_vma))
>>> +		goto out_put;
>>> +
>>> +	vmf.vma_flags = READ_ONCE(vma->vm_flags);
>>> +	vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
>>> +
>>> +	/* Can't call userland page fault handler in the speculative path */
>>> +	if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
>>> +		goto out_put;
>>> +
>>> +	if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
>>> +		/*
>>> +		 * This could be detected by the check address against VMA's
>>> +		 * boundaries but we want to trace it as not supported instead
>>> +		 * of changed.
>>> +		 */
>>> +		goto out_put;
>>> +
>>> +	if (address < READ_ONCE(vma->vm_start)
>>> +	    || READ_ONCE(vma->vm_end) <= address)
>>> +		goto out_put;
>>> +
>>> +	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>>> +				       flags & FAULT_FLAG_INSTRUCTION,
>>> +				       flags & FAULT_FLAG_REMOTE)) {
>>> +		ret = VM_FAULT_SIGSEGV;
>>> +		goto out_put;
>>> +	}
>>> +
>>> +	/* This is one is required to check that the VMA has write access set */
>>> +	if (flags & FAULT_FLAG_WRITE) {
>>> +		if (unlikely(!(vmf.vma_flags & VM_WRITE))) {
>>> +			ret = VM_FAULT_SIGSEGV;
>>> +			goto out_put;
>>> +		}
>>> +	} else if (unlikely(!(vmf.vma_flags & (VM_READ|VM_EXEC|VM_WRITE)))) {
>>> +		ret = VM_FAULT_SIGSEGV;
>>> +		goto out_put;
>>> +	}
>>> +
>>> +#ifdef CONFIG_NUMA
>>> +	/*
>>> +	 * MPOL_INTERLEAVE implies additional checks in
>>> +	 * mpol_misplaced() which are not compatible with the
>>> +	 *speculative page fault processing.
>>> +	 */
>>> +	pol = __get_vma_policy(vma, address);
>>> +	if (!pol)
>>> +		pol = get_task_policy(current);
>>> +	if (pol && pol->mode == MPOL_INTERLEAVE)
>>> +		goto out_put;
>>> +#endif
>>> +
>>> +	/*
>>> +	 * Do a speculative lookup of the PTE entry.
>>> +	 */
>>> +	local_irq_disable();
>>> +	pgd = pgd_offset(mm, address);
>>> +	pgdval = READ_ONCE(*pgd);
>>> +	if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
>>> +		goto out_walk;
>>> +
>>> +	p4d = p4d_offset(pgd, address);
>>> +	p4dval = READ_ONCE(*p4d);
>>> +	if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
>>> +		goto out_walk;
>>> +
>>> +	vmf.pud = pud_offset(p4d, address);
>>> +	pudval = READ_ONCE(*vmf.pud);
>>> +	if (pud_none(pudval) || unlikely(pud_bad(pudval)))
>>> +		goto out_walk;
>>> +
>>> +	/* Huge pages at PUD level are not supported. */
>>> +	if (unlikely(pud_trans_huge(pudval)))
>>> +		goto out_walk;
>>> +
>>> +	vmf.pmd = pmd_offset(vmf.pud, address);
>>> +	vmf.orig_pmd = READ_ONCE(*vmf.pmd);
>>> +	/*
>>> +	 * pmd_none could mean that a hugepage collapse is in progress
>>> +	 * in our back as collapse_huge_page() mark it before
>>> +	 * invalidating the pte (which is done once the IPI is catched
>>> +	 * by all CPU and we have interrupt disabled).
>>> +	 * For this reason we cannot handle THP in a speculative way since we
>>> +	 * can't safely indentify an in progress collapse operation done in our
>>> +	 * back on that PMD.
>>> +	 * Regarding the order of the following checks, see comment in
>>> +	 * pmd_devmap_trans_unstable()
>>> +	 */
>>> +	if (unlikely(pmd_devmap(vmf.orig_pmd) ||
>>> +		     pmd_none(vmf.orig_pmd) || pmd_trans_huge(vmf.orig_pmd) ||
>>> +		     is_swap_pmd(vmf.orig_pmd)))
>>> +		goto out_walk;
>>> +
>>> +	/*
>>> +	 * The above does not allocate/instantiate page-tables because doing so
>>> +	 * would lead to the possibility of instantiating page-tables after
>>> +	 * free_pgtables() -- and consequently leaking them.
>>> +	 *
>>> +	 * The result is that we take at least one !speculative fault per PMD
>>> +	 * in order to instantiate it.
>>> +	 */
>>> +
>>> +	vmf.pte = pte_offset_map(vmf.pmd, address);
>>> +	vmf.orig_pte = READ_ONCE(*vmf.pte);
>>> +	barrier(); /* See comment in handle_pte_fault() */
>>> +	if (pte_none(vmf.orig_pte)) {
>>> +		pte_unmap(vmf.pte);
>>> +		vmf.pte = NULL;
>>> +	}
>>> +
>>> +	vmf.vma = vma;
>>> +	vmf.pgoff = linear_page_index(vma, address);
>>> +	vmf.gfp_mask = __get_fault_gfp_mask(vma);
>>> +	vmf.sequence = seq;
>>> +	vmf.flags = flags;
>>> +
>>> +	local_irq_enable();
>>> +
>>> +	/*
>>> +	 * We need to re-validate the VMA after checking the bounds, otherwise
>>> +	 * we might have a false positive on the bounds.
>>> +	 */
>>> +	if (read_seqcount_retry(&vma->vm_sequence, seq))
>>> +		goto out_put;
>>> +
>>> +	mem_cgroup_oom_enable();
>>> +	ret = handle_pte_fault(&vmf);
>>> +	mem_cgroup_oom_disable();
>>> +
>>> +	put_vma(vma);
>>> +
>>> +	/*
>>> +	 * The task may have entered a memcg OOM situation but
>>> +	 * if the allocation error was handled gracefully (no
>>> +	 * VM_FAULT_OOM), there is no need to kill anything.
>>> +	 * Just clean up the OOM state peacefully.
>>> +	 */
>>> +	if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
>>> +		mem_cgroup_oom_synchronize(false);
>>> +	return ret;
>>> +
>>> +out_walk:
>>> +	local_irq_enable();
>>> +out_put:
>>> +	put_vma(vma);
>>> +	return ret;
>>> +}
>>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>> +
>>>  /*
>>>   * By the time we get here, we already hold the mm semaphore
>>>   *
>>
>
> .
>

^ permalink raw reply

* [PATCH kernel RFC 1/3] powerpc/pseries/iommu: Allow dynamic window to start from zero
From: Alexey Kardashevskiy @ 2018-07-25  9:50 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc,
	Benjamin Herrenschmidt, Michael Ellerman, Paul Mackerras,
	Russell Currey
In-Reply-To: <20180725095032.2196-1-aik@ozlabs.ru>

At the moment the kernel does not expect dynamic windows to ever start
at zero on a PCI bus as PAPR requires the hypervisor to create a 32bit
default window which starts from zero and the pseries kernel only
creates additional windows.

However PAPR permits removing the default window and creating another
one instead, starting from zero as well. In fact, the kernel used to
remove the default window after sha1 25ebc45b934 but this has been
reverted later.

Since there are devices capable of more than 32 bits for DMA but less than
50, and currently available hardware allows the second window only
at 1<<59, we will need to be able to create bigger windows starting from
zero. This does the initial preparation and should not cause any
behavioral changes.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/iommu.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 06f0296..9ece42f 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -53,6 +53,8 @@
 
 #include "pseries.h"
 
+#define DDW_INVALID_OFFSET	((uint64_t)-1)
+
 static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 {
 	struct iommu_table_group *table_group;
@@ -844,7 +846,7 @@ static u64 find_existing_ddw(struct device_node *pdn)
 {
 	struct direct_window *window;
 	const struct dynamic_dma_window_prop *direct64;
-	u64 dma_addr = 0;
+	u64 dma_addr = DDW_INVALID_OFFSET;
 
 	spin_lock(&direct_window_list_lock);
 	/* check if we already created a window and dupe that config if so */
@@ -992,7 +994,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	mutex_lock(&direct_window_init_mutex);
 
 	dma_addr = find_existing_ddw(pdn);
-	if (dma_addr != 0)
+	if (dma_addr != DDW_INVALID_OFFSET)
 		goto out_unlock;
 
 	/*
@@ -1228,7 +1230,7 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
 		}
 		if (pdn && PCI_DN(pdn)) {
 			dma_offset = enable_ddw(pdev, pdn);
-			if (dma_offset != 0) {
+			if (dma_offset != DDW_INVALID_OFFSET) {
 				dev_info(dev, "Using 64-bit direct DMA at offset %llx\n", dma_offset);
 				set_dma_offset(dev, dma_offset);
 				set_dma_ops(dev, &dma_nommu_ops);
-- 
2.11.0

^ permalink raw reply related

* [PATCH kernel RFC 0/3] powerpc/pseries/iommu: GPU coherent memory pass through
From: Alexey Kardashevskiy @ 2018-07-25  9:50 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc,
	Benjamin Herrenschmidt, Michael Ellerman, Paul Mackerras,
	Russell Currey

I am trying to pass through a 3D controller:
[0302]: NVIDIA Corporation GV100GL [Tesla V100 SXM2] [10de:1db1] (rev a1)

which has a quite unique feature as coherent memory directly accessible
from a POWER9 CPU via an NVLink2 transport.

So in addition to passing a PCI device + accompanying NPU devices,
we will also be passing the host physical address range as it is done
on the bare metal system.

The memory on the host is presented as:

===
[aik@yc02goos ~]$ lsprop /proc/device-tree/memory@42000000000
ibm,chip-id      000000fe (254)
device_type      "memory"
compatible       "ibm,coherent-device-memory"
reg              00000420 00000000 00000020 00000000
linux,usable-memory
                 00000420 00000000 00000000 00000000
phandle          00000726 (1830)
name             "memory"
ibm,associativity
                 00000004 000000fe 000000fe 000000fe 000000fe
===

and the host does not touch it as the second 64bit value of
"linux,usable-memory" - the size - is null. Later on the NVIDIA driver
trains the NVLink2 and probes this memory and this is how it becomes
onlined.

In the virtual environment I am planning on doing the same thing,
however there is a difference in 64bit DMA handling. The powernv
platform uses a PHB3 bypass mode and that just works but
the pseries platform uses DDW RTAS API to achieve the same
result and the problem with this is that we need a huge DMA
window to start from zero (because this GPU supports less than
50bits for DMA address space) and cover not just present memory
but also this new coherent memory.

This is based on sha1
d72e90f3 Linus Torvalds "Linux 4.18-rc6".

Please comment. Thanks.

Alexey Kardashevskiy (3):
  powerpc/pseries/iommu: Allow dynamic window to start from zero
  powerpc/pseries/iommu: Force default DMA window removal
  powerpc/pseries/iommu: Use memory@ nodes in max RAM address
    calculation

 arch/powerpc/platforms/pseries/iommu.c | 77 ++++++++++++++++++++++++++++++----
 1 file changed, 70 insertions(+), 7 deletions(-)

-- 
2.11.0

^ permalink raw reply

* [PATCH kernel RFC 2/3] powerpc/pseries/iommu: Force default DMA window removal
From: Alexey Kardashevskiy @ 2018-07-25  9:50 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc,
	Benjamin Herrenschmidt, Michael Ellerman, Paul Mackerras,
	Russell Currey
In-Reply-To: <20180725095032.2196-1-aik@ozlabs.ru>

It is quite common for a device to support more than 32bit but less than
64bit for DMA, for example, GPUs often support 42..50bits. However
the pseries platform only allows huge DMA window (the one which allows
the use of more than 2GB of DMA space) for 64bit-capable devices mostly
because:

1. we may have 32bit and >32bit devices on the same IOMMU domain and
we cannot place the new big window where the 32bit one is located;

2. the existing hardware only supports the second window at very high
offset of 1<<59 == 0x0800.0000.0000.0000.

So in order to allow 33..59bit DMA, we have to remove the default DMA
window and place a huge one there instead.

The PAPR spec says that the platform may decide not to use the default
window and remove it using DDW RTAS calls. There are few possible ways
for the platform to decide:

1. look at the device IDs and decide in advance that such and such
devices are capable of more than 32bit DMA (powernv's sketchy bypass
does something like this - it drops the default window if all devices
on the PE are from the same vendor) - this is not great as involves
guessing because, unlike sketchy bypass, the GPU case involves 2 vendor
ids and does not scale;

2. advertise 1 available DMA window in the hypervisor via
ibm,query-pe-dma-window so the pseries platform could take it as a clue
that if more bits for DMA are needed, it has to remove the default
window - this is not great as it is implicit clue rather than direct
instruction;

3. removing the default DMA window at all it not really an option as
PAPR mandates its presense at the guest boot time;

4. make the hypervisor explicitly tell the guest that the default window
is better be removed so the guest does not have to think hard and can
simply do what requested and this is what this patch does.

This makes use of the latter approach and exploits a new
"qemu,dma-force-remove-default" flag in a vPHB.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/iommu.c | 26 +++++++++++++++++++++++---
 1 file changed, 23 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 9ece42f..840afe5 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -54,6 +54,7 @@
 #include "pseries.h"

 #define DDW_INVALID_OFFSET	((uint64_t)-1)
+#define DDW_INVALID_LIOBN	((uint32_t)-1)

 static struct iommu_table_group *iommu_pseries_alloc_group(int node)
 {
@@ -977,7 +978,8 @@ static LIST_HEAD(failed_ddw_pdn_list);
  *
  * returns the dma offset for use by dma_set_mask
  */
-static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
+static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
+		u32 default_liobn)
 {
 	int len, ret;
 	struct ddw_query_response query;
@@ -1022,6 +1024,16 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn)
 	if (ret)
 		goto out_failed;

+	/*
+	 * The device tree has a request to force remove the default window,
+	 * do this.
+	 */
+	if (default_liobn != DDW_INVALID_LIOBN && (!ddw_avail[2] ||
+			rtas_call(ddw_avail[2], 1, 1, NULL, default_liobn))) {
+		dev_dbg(&dev->dev, "Could not remove window");
+		goto out_failed;
+	}
+
        /*
 	 * Query if there is a second window of size to map the
 	 * whole partition.  Query returns number of windows, largest
@@ -1212,7 +1224,7 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
 	pdev = to_pci_dev(dev);

 	/* only attempt to use a new window if 64-bit DMA is requested */
-	if (!disable_ddw && dma_mask == DMA_BIT_MASK(64)) {
+	if (!disable_ddw && dma_mask > DMA_BIT_MASK(32)) {
 		dn = pci_device_to_OF_node(pdev);
 		dev_dbg(dev, "node is %pOF\n", dn);

@@ -1229,7 +1241,15 @@ static int dma_set_mask_pSeriesLP(struct device *dev, u64 dma_mask)
 				break;
 		}
 		if (pdn && PCI_DN(pdn)) {
-			dma_offset = enable_ddw(pdev, pdn);
+			u32 flag = 0, liobn = DDW_INVALID_LIOBN;
+			int ret = of_property_read_u32(pdn,
+					"qemu,dma-force-remove-default", &flag);
+
+			if (!ret && flag && dma_window &&
+					dma_mask != DMA_BIT_MASK(64))
+				liobn = be32_to_cpu(dma_window[0]);
+
+			dma_offset = enable_ddw(pdev, pdn, liobn);
 			if (dma_offset != DDW_INVALID_OFFSET) {
 				dev_info(dev, "Using 64-bit direct DMA at offset %llx\n", dma_offset);
 				set_dma_offset(dev, dma_offset);
-- 
2.11.0

^ permalink raw reply related

* [PATCH kernel RFC 3/3] powerpc/pseries/iommu: Use memory@ nodes in max RAM address calculation
From: Alexey Kardashevskiy @ 2018-07-25  9:50 UTC (permalink / raw)
  To: linuxppc-dev
  Cc: Alexey Kardashevskiy, David Gibson, kvm-ppc,
	Benjamin Herrenschmidt, Michael Ellerman, Paul Mackerras,
	Russell Currey
In-Reply-To: <20180725095032.2196-1-aik@ozlabs.ru>

We might have memory@ nodes with "linux,usable-memory" set to zero
(for example, to replicate powernv's behaviour for GPU coherent memory)
which means that the memory needs an extra initialization but since
it can be used afterwards, the pseries platform will try mapping it
for DMA so the DMA window needs to cover those memory regions too.

This walks through the memory nodes to find the highest RAM address to
let a huge DMA window cover that too in case this memory gets onlined
later.

The existing memory_hotplug_max() does not do the job as it calls:

1. memblock_end_of_DRAM() which looks at memory blocks and
GPU RAM is not there because of size==0 in linux,usable-memory
property of the memory node;

2. hot_add_drconf_memory_max() does not support sparse memory
if we want to map this memory in the guest where it is mapped
on the host (and it looks like we have to), the drconf chunk
is easily getting bigger that a megabyte.

Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru>
---
 arch/powerpc/platforms/pseries/iommu.c | 43 +++++++++++++++++++++++++++++++++-
 1 file changed, 42 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c b/arch/powerpc/platforms/pseries/iommu.c
index 840afe5..74404f8 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -967,6 +967,47 @@ struct failed_ddw_pdn {
 
 static LIST_HEAD(failed_ddw_pdn_list);
 
+static unsigned long read_n_cells(int n, const __be32 **buf)
+{
+	unsigned long result = 0;
+
+	while (n--) {
+		result = (result << 32) | of_read_number(*buf, 1);
+		(*buf)++;
+	}
+	return result;
+}
+
+static phys_addr_t ddw_memory_hotplug_max(void)
+{
+	phys_addr_t max_addr = memory_hotplug_max();
+	struct device_node *memory;
+
+	for_each_node_by_type(memory, "memory") {
+		unsigned long start, size;
+		int ranges, n_mem_addr_cells, n_mem_size_cells, len;
+		const __be32 *memcell_buf;
+
+		memcell_buf = of_get_property(memory, "reg", &len);
+		if (!memcell_buf || len <= 0)
+			continue;
+
+		n_mem_addr_cells = of_n_addr_cells(memory);
+		n_mem_size_cells = of_n_size_cells(memory);
+
+		/* ranges in cell */
+		ranges = (len >> 2) / (n_mem_addr_cells + n_mem_size_cells);
+
+		/* these are order-sensitive, and modify the buffer pointer */
+		start = read_n_cells(n_mem_addr_cells, &memcell_buf);
+		size = read_n_cells(n_mem_size_cells, &memcell_buf);
+
+		max_addr = max_t(phys_addr_t, max_addr, start + size);
+	}
+
+	return max_addr;
+}
+
 /*
  * If the PE supports dynamic dma windows, and there is space for a table
  * that can map all pages in a linear offset, then setup such a table,
@@ -1067,7 +1108,7 @@ static u64 enable_ddw(struct pci_dev *dev, struct device_node *pdn,
 	}
 	/* verify the window * number of ptes will map the partition */
 	/* check largest block * page size > max memory hotplug addr */
-	max_addr = memory_hotplug_max();
+	max_addr = ddw_memory_hotplug_max();
 	if (query.largest_available_block < (max_addr >> page_shift)) {
 		dev_dbg(&dev->dev, "can't map partition max 0x%llx with %u "
 			  "%llu-sized pages\n", max_addr,  query.largest_available_block,
-- 
2.11.0

^ permalink raw reply related

* [PATCH] powerpc/64s: fix page table fragment refcount race vs speculative references
From: Nicholas Piggin @ 2018-07-25  9:53 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V

The page table fragment allocator uses the main page refcount racily
with respect to speculative references. A customer observed a BUG due
to page table page refcount underflow in the fragment allocator. This
can be caused by the fragment allocator set_page_count stomping on a
speculative reference, and then the speculative failure handler
decrements the new reference, and the underflow eventually pops when
the page tables are freed.

Fix this by using a dedicated field in the struct page for the page
table fragment allocator.

Fixes: 5c1f6ee9a31c ("powerpc: Reduce PTE table memory wastage")
Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/mm/mmu_context_book3s64.c |  8 ++++----
 arch/powerpc/mm/pgtable-book3s64.c     | 17 +++++++++++------
 include/linux/mm_types.h               |  5 ++++-
 3 files changed, 19 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index f3d4b4a0e561..3bb5cec03d1f 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -200,9 +200,9 @@ static void pte_frag_destroy(void *pte_frag)
 	/* drop all the pending references */
 	count = ((unsigned long)pte_frag & ~PAGE_MASK) >> PTE_FRAG_SIZE_SHIFT;
 	/* We allow PTE_FRAG_NR fragments from a PTE page */
-	if (page_ref_sub_and_test(page, PTE_FRAG_NR - count)) {
+	if (atomic_sub_and_test(PTE_FRAG_NR - count, &page->pt_frag_refcount)) {
 		pgtable_page_dtor(page);
-		free_unref_page(page);
+		__free_page(page);
 	}
 }
 
@@ -215,9 +215,9 @@ static void pmd_frag_destroy(void *pmd_frag)
 	/* drop all the pending references */
 	count = ((unsigned long)pmd_frag & ~PAGE_MASK) >> PMD_FRAG_SIZE_SHIFT;
 	/* We allow PTE_FRAG_NR fragments from a PTE page */
-	if (page_ref_sub_and_test(page, PMD_FRAG_NR - count)) {
+	if (atomic_sub_and_test(PMD_FRAG_NR - count, &page->pt_frag_refcount)) {
 		pgtable_pmd_page_dtor(page);
-		free_unref_page(page);
+		__free_page(page);
 	}
 }
 
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index 4afbfbb64bfd..78d0b3d5ebad 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -270,6 +270,8 @@ static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm)
 		return NULL;
 	}
 
+	atomic_set(&page->pt_frag_refcount, 1);
+
 	ret = page_address(page);
 	/*
 	 * if we support only one fragment just return the
@@ -285,7 +287,7 @@ static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm)
 	 * count.
 	 */
 	if (likely(!mm->context.pmd_frag)) {
-		set_page_count(page, PMD_FRAG_NR);
+		atomic_set(&page->pt_frag_refcount, PMD_FRAG_NR);
 		mm->context.pmd_frag = ret + PMD_FRAG_SIZE;
 	}
 	spin_unlock(&mm->page_table_lock);
@@ -308,9 +310,10 @@ void pmd_fragment_free(unsigned long *pmd)
 {
 	struct page *page = virt_to_page(pmd);
 
-	if (put_page_testzero(page)) {
+	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
+	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
 		pgtable_pmd_page_dtor(page);
-		free_unref_page(page);
+		__free_page(page);
 	}
 }
 
@@ -352,6 +355,7 @@ static pte_t *__alloc_for_ptecache(struct mm_struct *mm, int kernel)
 			return NULL;
 	}
 
+	atomic_set(&page->pt_frag_refcount, 1);
 
 	ret = page_address(page);
 	/*
@@ -367,7 +371,7 @@ static pte_t *__alloc_for_ptecache(struct mm_struct *mm, int kernel)
 	 * count.
 	 */
 	if (likely(!mm->context.pte_frag)) {
-		set_page_count(page, PTE_FRAG_NR);
+		atomic_set(&page->pt_frag_refcount, PTE_FRAG_NR);
 		mm->context.pte_frag = ret + PTE_FRAG_SIZE;
 	}
 	spin_unlock(&mm->page_table_lock);
@@ -390,10 +394,11 @@ void pte_fragment_free(unsigned long *table, int kernel)
 {
 	struct page *page = virt_to_page(table);
 
-	if (put_page_testzero(page)) {
+	BUG_ON(atomic_read(&page->pt_frag_refcount) <= 0);
+	if (atomic_dec_and_test(&page->pt_frag_refcount)) {
 		if (!kernel)
 			pgtable_page_dtor(page);
-		free_unref_page(page);
+		__free_page(page);
 	}
 }
 
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 99ce070e7dcb..22651e124071 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -139,7 +139,10 @@ struct page {
 			unsigned long _pt_pad_1;	/* compound_head */
 			pgtable_t pmd_huge_pte; /* protected by page->ptl */
 			unsigned long _pt_pad_2;	/* mapping */
-			struct mm_struct *pt_mm;	/* x86 pgds only */
+			union {
+				struct mm_struct *pt_mm; /* x86 pgds only */
+				atomic_t pt_frag_refcount; /* powerpc */
+			};
 #if ALLOC_SPLIT_PTLOCKS
 			spinlock_t *ptl;
 #else
-- 
2.17.0

^ permalink raw reply related

* [PATCH] powerpc/64s: free page table caches at exit_mmap time
From: Nicholas Piggin @ 2018-07-25  9:54 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V

The kernel page table caches are tied to init_mm, so there is no
more need for them after userspace is finished.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/mm/mmu_context_book3s64.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
index 3bb5cec03d1f..5738c2db751c 100644
--- a/arch/powerpc/mm/mmu_context_book3s64.c
+++ b/arch/powerpc/mm/mmu_context_book3s64.c
@@ -221,7 +221,7 @@ static void pmd_frag_destroy(void *pmd_frag)
 	}
 }
 
-static void destroy_pagetable_page(struct mm_struct *mm)
+static void destroy_pagetable_cache(struct mm_struct *mm)
 {
 	void *frag;
 
@@ -244,13 +244,14 @@ void destroy_context(struct mm_struct *mm)
 		WARN_ON(process_tb[mm->context.id].prtb0 != 0);
 	else
 		subpage_prot_free(mm);
-	destroy_pagetable_page(mm);
 	destroy_contexts(&mm->context);
 	mm->context.id = MMU_NO_CONTEXT;
 }
 
 void arch_exit_mmap(struct mm_struct *mm)
 {
+	destroy_pagetable_cache(mm);
+
 	if (radix_enabled()) {
 		/*
 		 * Radix doesn't have a valid bit in the process table
-- 
2.17.0

^ permalink raw reply related

* Re: [PATCH v11 19/26] mm: provide speculative fault infrastructure
From: Laurent Dufour @ 2018-07-25 10:44 UTC (permalink / raw)
  To: zhong jiang
  Cc: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
	khandual, aneesh.kumar, benh, mpe, paulus, Thomas Gleixner,
	Ingo Molnar, hpa, Will Deacon, Sergey Senozhatsky,
	sergey.senozhatsky.work, Andrea Arcangeli, Alexei Starovoitov,
	kemi.wang, Daniel Jordan, David Rientjes, Jerome Glisse,
	Ganesh Mahendran, Minchan Kim, Punit Agrawal, vinayak menon,
	Yang Shi, linux-kernel, linux-mm, haren, npiggin, bsingharora,
	paulmck, Tim Chen, linuxppc-dev, x86
In-Reply-To: <5B583D19.4000409@huawei.com>



On 25/07/2018 11:04, zhong jiang wrote:
> On 2018/7/25 0:10, Laurent Dufour wrote:
>>
>> On 24/07/2018 16:26, zhong jiang wrote:
>>> On 2018/5/17 19:06, Laurent Dufour wrote:
>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>
>>>> Provide infrastructure to do a speculative fault (not holding
>>>> mmap_sem).
>>>>
>>>> The not holding of mmap_sem means we can race against VMA
>>>> change/removal and page-table destruction. We use the SRCU VMA freeing
>>>> to keep the VMA around. We use the VMA seqcount to detect change
>>>> (including umapping / page-table deletion) and we use gup_fast() style
>>>> page-table walking to deal with page-table races.
>>>>
>>>> Once we've obtained the page and are ready to update the PTE, we
>>>> validate if the state we started the fault with is still valid, if
>>>> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
>>>> PTE and we're done.
>>>>
>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>>
>>>> [Manage the newly introduced pte_spinlock() for speculative page
>>>>  fault to fail if the VMA is touched in our back]
>>>> [Rename vma_is_dead() to vma_has_changed() and declare it here]
>>>> [Fetch p4d and pud]
>>>> [Set vmd.sequence in __handle_mm_fault()]
>>>> [Abort speculative path when handle_userfault() has to be called]
>>>> [Add additional VMA's flags checks in handle_speculative_fault()]
>>>> [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
>>>> [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
>>>> [Remove warning comment about waiting for !seq&1 since we don't want
>>>>  to wait]
>>>> [Remove warning about no huge page support, mention it explictly]
>>>> [Don't call do_fault() in the speculative path as __do_fault() calls
>>>>  vma->vm_ops->fault() which may want to release mmap_sem]
>>>> [Only vm_fault pointer argument for vma_has_changed()]
>>>> [Fix check against huge page, calling pmd_trans_huge()]
>>>> [Use READ_ONCE() when reading VMA's fields in the speculative path]
>>>> [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
>>>>  processing done in vm_normal_page()]
>>>> [Check that vma->anon_vma is already set when starting the speculative
>>>>  path]
>>>> [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
>>>>  the processing done in mpol_misplaced()]
>>>> [Don't support VMA growing up or down]
>>>> [Move check on vm_sequence just before calling handle_pte_fault()]
>>>> [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
>>>> [Add mem cgroup oom check]
>>>> [Use READ_ONCE to access p*d entries]
>>>> [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
>>>> [Don't fetch pte again in handle_pte_fault() when running the speculative
>>>>  path]
>>>> [Check PMD against concurrent collapsing operation]
>>>> [Try spin lock the pte during the speculative path to avoid deadlock with
>>>>  other CPU's invalidating the TLB and requiring this CPU to catch the
>>>>  inter processor's interrupt]
>>>> [Move define of FAULT_FLAG_SPECULATIVE here]
>>>> [Introduce __handle_speculative_fault() and add a check against
>>>>  mm->mm_users in handle_speculative_fault() defined in mm.h]
>>>> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
>>>> ---
>>>>  include/linux/hugetlb_inline.h |   2 +-
>>>>  include/linux/mm.h             |  30 ++++
>>>>  include/linux/pagemap.h        |   4 +-
>>>>  mm/internal.h                  |  16 +-
>>>>  mm/memory.c                    | 340 ++++++++++++++++++++++++++++++++++++++++-
>>>>  5 files changed, 385 insertions(+), 7 deletions(-)
>>>>
>>>> diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
>>>> index 0660a03d37d9..9e25283d6fc9 100644
>>>> --- a/include/linux/hugetlb_inline.h
>>>> +++ b/include/linux/hugetlb_inline.h
>>>> @@ -8,7 +8,7 @@
>>>>  
>>>>  static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
>>>>  {
>>>> -	return !!(vma->vm_flags & VM_HUGETLB);
>>>> +	return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
>>>>  }
>>>>  
>>>>  #else
>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>> index 05cbba70104b..31acf98a7d92 100644
>>>> --- a/include/linux/mm.h
>>>> +++ b/include/linux/mm.h
>>>> @@ -315,6 +315,7 @@ extern pgprot_t protection_map[16];
>>>>  #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
>>>>  #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
>>>>  #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
>>>> +#define FAULT_FLAG_SPECULATIVE	0x200	/* Speculative fault, not holding mmap_sem */
>>>>  
>>>>  #define FAULT_FLAG_TRACE \
>>>>  	{ FAULT_FLAG_WRITE,		"WRITE" }, \
>>>> @@ -343,6 +344,10 @@ struct vm_fault {
>>>>  	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
>>>>  	pgoff_t pgoff;			/* Logical page offset based on vma */
>>>>  	unsigned long address;		/* Faulting virtual address */
>>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>>> +	unsigned int sequence;
>>>> +	pmd_t orig_pmd;			/* value of PMD at the time of fault */
>>>> +#endif
>>>>  	pmd_t *pmd;			/* Pointer to pmd entry matching
>>>>  					 * the 'address' */
>>>>  	pud_t *pud;			/* Pointer to pud entry matching
>>>> @@ -1415,6 +1420,31 @@ int invalidate_inode_page(struct page *page);
>>>>  #ifdef CONFIG_MMU
>>>>  extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>>>  		unsigned int flags);
>>>> +
>>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>>> +extern int __handle_speculative_fault(struct mm_struct *mm,
>>>> +				      unsigned long address,
>>>> +				      unsigned int flags);
>>>> +static inline int handle_speculative_fault(struct mm_struct *mm,
>>>> +					   unsigned long address,
>>>> +					   unsigned int flags)
>>>> +{
>>>> +	/*
>>>> +	 * Try speculative page fault for multithreaded user space task only.
>>>> +	 */
>>>> +	if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
>>>> +		return VM_FAULT_RETRY;
>>>> +	return __handle_speculative_fault(mm, address, flags);
>>>> +}
>>>> +#else
>>>> +static inline int handle_speculative_fault(struct mm_struct *mm,
>>>> +					   unsigned long address,
>>>> +					   unsigned int flags)
>>>> +{
>>>> +	return VM_FAULT_RETRY;
>>>> +}
>>>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>>> +
>>>>  extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>>>>  			    unsigned long address, unsigned int fault_flags,
>>>>  			    bool *unlocked);
>>>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>>>> index b1bd2186e6d2..6e2aa4e79af7 100644
>>>> --- a/include/linux/pagemap.h
>>>> +++ b/include/linux/pagemap.h
>>>> @@ -456,8 +456,8 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
>>>>  	pgoff_t pgoff;
>>>>  	if (unlikely(is_vm_hugetlb_page(vma)))
>>>>  		return linear_hugepage_index(vma, address);
>>>> -	pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
>>>> -	pgoff += vma->vm_pgoff;
>>>> +	pgoff = (address - READ_ONCE(vma->vm_start)) >> PAGE_SHIFT;
>>>> +	pgoff += READ_ONCE(vma->vm_pgoff);
>>>>  	return pgoff;
>>>>  }
>>>>  
>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>> index fb2667b20f0a..10b188c87fa4 100644
>>>> --- a/mm/internal.h
>>>> +++ b/mm/internal.h
>>>> @@ -44,7 +44,21 @@ int do_swap_page(struct vm_fault *vmf);
>>>>  extern struct vm_area_struct *get_vma(struct mm_struct *mm,
>>>>  				      unsigned long addr);
>>>>  extern void put_vma(struct vm_area_struct *vma);
>>>> -#endif
>>>> +
>>>> +static inline bool vma_has_changed(struct vm_fault *vmf)
>>>> +{
>>>> +	int ret = RB_EMPTY_NODE(&vmf->vma->vm_rb);
>>>> +	unsigned int seq = READ_ONCE(vmf->vma->vm_sequence.sequence);
>>>> +
>>>> +	/*
>>>> +	 * Matches both the wmb in write_seqlock_{begin,end}() and
>>>> +	 * the wmb in vma_rb_erase().
>>>> +	 */
>>>> +	smp_rmb();
>>>> +
>>>> +	return ret || seq != vmf->sequence;
>>>> +}
>>>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>>>  
>>>>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>>>>  		unsigned long floor, unsigned long ceiling);
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index ab32b0b4bd69..7bbbb8c7b9cd 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -769,7 +769,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>>>>  	if (page)
>>>>  		dump_page(page, "bad pte");
>>>>  	pr_alert("addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
>>>> -		 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
>>>> +		 (void *)addr, READ_ONCE(vma->vm_flags), vma->anon_vma,
>>>> +		 mapping, index);
>>>>  	pr_alert("file:%pD fault:%pf mmap:%pf readpage:%pf\n",
>>>>  		 vma->vm_file,
>>>>  		 vma->vm_ops ? vma->vm_ops->fault : NULL,
>>>> @@ -2306,6 +2307,118 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>>>>  }
>>>>  EXPORT_SYMBOL_GPL(apply_to_page_range);
>>>>  
>>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>>> +static bool pte_spinlock(struct vm_fault *vmf)
>>>> +{
>>>> +	bool ret = false;
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +	pmd_t pmdval;
>>>> +#endif
>>>> +
>>>> +	/* Check if vma is still valid */
>>>> +	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>>>> +		vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>>>> +		spin_lock(vmf->ptl);
>>>> +		return true;
>>>> +	}
>>>> +
>>>> +again:
>>>> +	local_irq_disable();
>>>> +	if (vma_has_changed(vmf))
>>>> +		goto out;
>>>> +
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +	/*
>>>> +	 * We check if the pmd value is still the same to ensure that there
>>>> +	 * is not a huge collapse operation in progress in our back.
>>>> +	 */
>>>> +	pmdval = READ_ONCE(*vmf->pmd);
>>>> +	if (!pmd_same(pmdval, vmf->orig_pmd))
>>>> +		goto out;
>>>> +#endif
>>>> +
>>>> +	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>>>> +	if (unlikely(!spin_trylock(vmf->ptl))) {
>>>> +		local_irq_enable();
>>>> +		goto again;
>>>> +	}
>>>> +
>>>> +	if (vma_has_changed(vmf)) {
>>>> +		spin_unlock(vmf->ptl);
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	ret = true;
>>>> +out:
>>>> +	local_irq_enable();
>>>> +	return ret;
>>>> +}
>>>> +
>>>> +static bool pte_map_lock(struct vm_fault *vmf)
>>>> +{
>>>> +	bool ret = false;
>>>> +	pte_t *pte;
>>>> +	spinlock_t *ptl;
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +	pmd_t pmdval;
>>>> +#endif
>>>> +
>>>> +	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>>>> +		vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
>>>> +					       vmf->address, &vmf->ptl);
>>>> +		return true;
>>>> +	}
>>>> +
>>>> +	/*
>>>> +	 * The first vma_has_changed() guarantees the page-tables are still
>>>> +	 * valid, having IRQs disabled ensures they stay around, hence the
>>>> +	 * second vma_has_changed() to make sure they are still valid once
>>>> +	 * we've got the lock. After that a concurrent zap_pte_range() will
>>>> +	 * block on the PTL and thus we're safe.
>>>> +	 */
>>>> +again:
>>>> +	local_irq_disable();
>>>> +	if (vma_has_changed(vmf))
>>>> +		goto out;
>>>> +
>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>> +	/*
>>>> +	 * We check if the pmd value is still the same to ensure that there
>>>> +	 * is not a huge collapse operation in progress in our back.
>>>> +	 */
>>>> +	pmdval = READ_ONCE(*vmf->pmd);
>>>> +	if (!pmd_same(pmdval, vmf->orig_pmd))
>>>> +		goto out;
>>>> +#endif
>>>> +
>>>> +	/*
>>>> +	 * Same as pte_offset_map_lock() except that we call
>>>> +	 * spin_trylock() in place of spin_lock() to avoid race with
>>>> +	 * unmap path which may have the lock and wait for this CPU
>>>> +	 * to invalidate TLB but this CPU has irq disabled.
>>>> +	 * Since we are in a speculative patch, accept it could fail
>>>> +	 */
>>>> +	ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>>>> +	pte = pte_offset_map(vmf->pmd, vmf->address);
>>>> +	if (unlikely(!spin_trylock(ptl))) {
>>>> +		pte_unmap(pte);
>>>> +		local_irq_enable();
>>>> +		goto again;
>>>> +	}
>>>> +
>>>> +	if (vma_has_changed(vmf)) {
>>>> +		pte_unmap_unlock(pte, ptl);
>>>> +		goto out;
>>>> +	}
>>>> +
>>>> +	vmf->pte = pte;
>>>> +	vmf->ptl = ptl;
>>>> +	ret = true;
>>>> +out:
>>>> +	local_irq_enable();
>>>> +	return ret;
>>>> +}
>>>> +#else
>>>>  static inline bool pte_spinlock(struct vm_fault *vmf)
>>>>  {
>>>>  	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>>>> @@ -2319,6 +2432,7 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
>>>>  				       vmf->address, &vmf->ptl);
>>>>  	return true;
>>>>  }
>>>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>>>  
>>>>  /*
>>>>   * handle_pte_fault chooses page fault handler according to an entry which was
>>>> @@ -3208,6 +3322,14 @@ static int do_anonymous_page(struct vm_fault *vmf)
>>>>  		ret = check_stable_address_space(vma->vm_mm);
>>>>  		if (ret)
>>>>  			goto unlock;
>>>> +		/*
>>>> +		 * Don't call the userfaultfd during the speculative path.
>>>> +		 * We already checked for the VMA to not be managed through
>>>> +		 * userfaultfd, but it may be set in our back once we have lock
>>>> +		 * the pte. In such a case we can ignore it this time.
>>>> +		 */
>>>> +		if (vmf->flags & FAULT_FLAG_SPECULATIVE)
>>>> +			goto setpte;
>>>>  		/* Deliver the page fault to userland, check inside PT lock */
>>>>  		if (userfaultfd_missing(vma)) {
>>>>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>> @@ -3249,7 +3371,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
>>>>  		goto unlock_and_release;
>>>>  
>>>>  	/* Deliver the page fault to userland, check inside PT lock */
>>>> -	if (userfaultfd_missing(vma)) {
>>>> +	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) && userfaultfd_missing(vma)) {
>>>>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>  		mem_cgroup_cancel_charge(page, memcg, false);
>>>>  		put_page(page);
>>>> @@ -3994,13 +4116,22 @@ static int handle_pte_fault(struct vm_fault *vmf)
>>>>  
>>>>  	if (unlikely(pmd_none(*vmf->pmd))) {
>>>>  		/*
>>>> +		 * In the case of the speculative page fault handler we abort
>>>> +		 * the speculative path immediately as the pmd is probably
>>>> +		 * in the way to be converted in a huge one. We will try
>>>> +		 * again holding the mmap_sem (which implies that the collapse
>>>> +		 * operation is done).
>>>> +		 */
>>>> +		if (vmf->flags & FAULT_FLAG_SPECULATIVE)
>>>> +			return VM_FAULT_RETRY;
>>>> +		/*
>>>>  		 * Leave __pte_alloc() until later: because vm_ops->fault may
>>>>  		 * want to allocate huge page, and if we expose page table
>>>>  		 * for an instant, it will be difficult to retract from
>>>>  		 * concurrent faults and from rmap lookups.
>>>>  		 */
>>>>  		vmf->pte = NULL;
>>>> -	} else {
>>>> +	} else if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>>>>  		/* See comment in pte_alloc_one_map() */
>>>>  		if (pmd_devmap_trans_unstable(vmf->pmd))
>>>>  			return 0;
>>>> @@ -4009,6 +4140,9 @@ static int handle_pte_fault(struct vm_fault *vmf)
>>>>  		 * pmd from under us anymore at this point because we hold the
>>>>  		 * mmap_sem read mode and khugepaged takes it in write mode.
>>>>  		 * So now it's safe to run pte_offset_map().
>>>> +		 * This is not applicable to the speculative page fault handler
>>>> +		 * but in that case, the pte is fetched earlier in
>>>> +		 * handle_speculative_fault().
>>>>  		 */
>>>>  		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>>>>  		vmf->orig_pte = *vmf->pte;
>>>> @@ -4031,6 +4165,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
>>>>  	if (!vmf->pte) {
>>>>  		if (vma_is_anonymous(vmf->vma))
>>>>  			return do_anonymous_page(vmf);
>>>> +		else if (vmf->flags & FAULT_FLAG_SPECULATIVE)
>>>> +			return VM_FAULT_RETRY;
>>>>  		else
>>>>  			return do_fault(vmf);
>>>>  	}
>>>> @@ -4128,6 +4264,9 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>>>  	vmf.pmd = pmd_alloc(mm, vmf.pud, address);
>>>>  	if (!vmf.pmd)
>>>>  		return VM_FAULT_OOM;
>>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>>> +	vmf.sequence = raw_read_seqcount(&vma->vm_sequence);
>>>> +#endif
>>>>  	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
>>>>  		ret = create_huge_pmd(&vmf);
>>>>  		if (!(ret & VM_FAULT_FALLBACK))
>>>> @@ -4161,6 +4300,201 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>>>  	return handle_pte_fault(&vmf);
>>>>  }
>>>>  
>>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>>> +/*
>>>> + * Tries to handle the page fault in a speculative way, without grabbing the
>>>> + * mmap_sem.
>>>> + */
>>>> +int __handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>>>> +			       unsigned int flags)
>>>> +{
>>>> +	struct vm_fault vmf = {
>>>> +		.address = address,
>>>> +	};
>>>> +	pgd_t *pgd, pgdval;
>>>> +	p4d_t *p4d, p4dval;
>>>> +	pud_t pudval;
>>>> +	int seq, ret = VM_FAULT_RETRY;
>>>> +	struct vm_area_struct *vma;
>>>> +#ifdef CONFIG_NUMA
>>>> +	struct mempolicy *pol;
>>>> +#endif
>>>> +
>>>> +	/* Clear flags that may lead to release the mmap_sem to retry */
>>>> +	flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
>>>> +	flags |= FAULT_FLAG_SPECULATIVE;
>>>> +
>>>> +	vma = get_vma(mm, address);
>>>> +	if (!vma)
>>>> +		return ret;
>>>> +
>>>> +	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
>>>> +	if (seq & 1)
>>>> +		goto out_put;
>>>> +
>>>> +	/*
>>>> +	 * Can't call vm_ops service has we don't know what they would do
>>>> +	 * with the VMA.
>>>> +	 * This include huge page from hugetlbfs.
>>>> +	 */
>>>> +	if (vma->vm_ops)
>>>> +		goto out_put;
>>>> +
>>>   Hi   Laurent
>>>    
>>>    I think that most of pagefault will leave here.   Is there any case  need to skip ?
>>>   I have tested the following  patch, it work well.
>> Hi Zhong,
>>
>> Well this will allow file mapping to be handle in a speculative way, but that's
>> a bit dangerous today as there is no guaranty that the vm_ops.vm_fault()
>> operation will be fair.
>>
>> In the case of the anonymous file mapping that's often not a problem, depending
>> on the underlying file system, but there are so many cases to check and this is
>> hard to say this can be done in a speculative way as is.
>  This patch say that spf just handle anonyous page. but I find that do_swap_page
>  also maybe release the mmap_sem without FAULT_FLAG_RETRY_NOWAIT. why is it safe
>  to handle the case. I think that the case is similar to file page. Maybe I miss
>  something else.

do_swap_page() may released the mmap_sem through the call to
__lock_page_or_retry(), but this can only happen if FAULT_FLAG_ALLOW_RETRY or
FAULT_FLAG_KILLABLE are set and they are unset in __handle_speculative_fault().

> 
>  I test the patches and find just only 18% of the pagefault will enter into the
>  speculative page fault during a process startup. As I had said. most of pagefault
>  will be handled by ops->fault. I do not know the data you had posted is how to get.

I do agree that handling file mapping will be required, but this will add more
complexity to this series, since we need a way for drivers to tell they are
compatible with the speculative path.

May be I should give it a try on the next send.

For my information, what was the performance improvement you seen when handling
file page faulting this way ?

Thanks,
Laurent.

> 
> 
>  Thanks
>  zhong jiang
>> The huge work to do is to double check that all the code called by
>> vm_ops.fault() is not dealing with the mmap_sem, which could be handled using
>> FAULT_FLAG_RETRY_NOWAIT, and care is also needed about the resources that code
>> is managing as it may assume that it is under the protection of the mmap_sem in
>> read mode, and that can be done implicitly.
>>
>> Cheers,
>> Laurent.
>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index 936128b..9bc1545 100644
>>>  @@ -3893,8 +3898,6 @@ static int handle_pte_fault(struct fault_env *fe)
>>>         if (!fe->pte) {
>>>                 if (vma_is_anonymous(fe->vma))
>>>                         return do_anonymous_page(fe);
>>> -               else if (fe->flags & FAULT_FLAG_SPECULATIVE)
>>> -                       return VM_FAULT_RETRY;
>>>                 else
>>>                         return do_fault(fe);
>>>         }
>>> @@ -4026,20 +4029,11 @@ int __handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>>>                 goto out_put;
>>>         }
>>>         /*
>>> -        * Can't call vm_ops service has we don't know what they would do
>>> -        * with the VMA.
>>> -        * This include huge page from hugetlbfs.
>>> -        */
>>> -       if (vma->vm_ops) {
>>> -               trace_spf_vma_notsup(_RET_IP_, vma, address);
>>> -               goto out_put;
>>> -       }
>>>
>>>
>>> Thanks
>>> zhong jiang
>>>> +	/*
>>>> +	 * __anon_vma_prepare() requires the mmap_sem to be held
>>>> +	 * because vm_next and vm_prev must be safe. This can't be guaranteed
>>>> +	 * in the speculative path.
>>>> +	 */
>>>> +	if (unlikely(!vma->anon_vma))
>>>> +		goto out_put;
>>>> +
>>>> +	vmf.vma_flags = READ_ONCE(vma->vm_flags);
>>>> +	vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
>>>> +
>>>> +	/* Can't call userland page fault handler in the speculative path */
>>>> +	if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
>>>> +		goto out_put;
>>>> +
>>>> +	if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
>>>> +		/*
>>>> +		 * This could be detected by the check address against VMA's
>>>> +		 * boundaries but we want to trace it as not supported instead
>>>> +		 * of changed.
>>>> +		 */
>>>> +		goto out_put;
>>>> +
>>>> +	if (address < READ_ONCE(vma->vm_start)
>>>> +	    || READ_ONCE(vma->vm_end) <= address)
>>>> +		goto out_put;
>>>> +
>>>> +	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>>>> +				       flags & FAULT_FLAG_INSTRUCTION,
>>>> +				       flags & FAULT_FLAG_REMOTE)) {
>>>> +		ret = VM_FAULT_SIGSEGV;
>>>> +		goto out_put;
>>>> +	}
>>>> +
>>>> +	/* This is one is required to check that the VMA has write access set */
>>>> +	if (flags & FAULT_FLAG_WRITE) {
>>>> +		if (unlikely(!(vmf.vma_flags & VM_WRITE))) {
>>>> +			ret = VM_FAULT_SIGSEGV;
>>>> +			goto out_put;
>>>> +		}
>>>> +	} else if (unlikely(!(vmf.vma_flags & (VM_READ|VM_EXEC|VM_WRITE)))) {
>>>> +		ret = VM_FAULT_SIGSEGV;
>>>> +		goto out_put;
>>>> +	}
>>>> +
>>>> +#ifdef CONFIG_NUMA
>>>> +	/*
>>>> +	 * MPOL_INTERLEAVE implies additional checks in
>>>> +	 * mpol_misplaced() which are not compatible with the
>>>> +	 *speculative page fault processing.
>>>> +	 */
>>>> +	pol = __get_vma_policy(vma, address);
>>>> +	if (!pol)
>>>> +		pol = get_task_policy(current);
>>>> +	if (pol && pol->mode == MPOL_INTERLEAVE)
>>>> +		goto out_put;
>>>> +#endif
>>>> +
>>>> +	/*
>>>> +	 * Do a speculative lookup of the PTE entry.
>>>> +	 */
>>>> +	local_irq_disable();
>>>> +	pgd = pgd_offset(mm, address);
>>>> +	pgdval = READ_ONCE(*pgd);
>>>> +	if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
>>>> +		goto out_walk;
>>>> +
>>>> +	p4d = p4d_offset(pgd, address);
>>>> +	p4dval = READ_ONCE(*p4d);
>>>> +	if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
>>>> +		goto out_walk;
>>>> +
>>>> +	vmf.pud = pud_offset(p4d, address);
>>>> +	pudval = READ_ONCE(*vmf.pud);
>>>> +	if (pud_none(pudval) || unlikely(pud_bad(pudval)))
>>>> +		goto out_walk;
>>>> +
>>>> +	/* Huge pages at PUD level are not supported. */
>>>> +	if (unlikely(pud_trans_huge(pudval)))
>>>> +		goto out_walk;
>>>> +
>>>> +	vmf.pmd = pmd_offset(vmf.pud, address);
>>>> +	vmf.orig_pmd = READ_ONCE(*vmf.pmd);
>>>> +	/*
>>>> +	 * pmd_none could mean that a hugepage collapse is in progress
>>>> +	 * in our back as collapse_huge_page() mark it before
>>>> +	 * invalidating the pte (which is done once the IPI is catched
>>>> +	 * by all CPU and we have interrupt disabled).
>>>> +	 * For this reason we cannot handle THP in a speculative way since we
>>>> +	 * can't safely indentify an in progress collapse operation done in our
>>>> +	 * back on that PMD.
>>>> +	 * Regarding the order of the following checks, see comment in
>>>> +	 * pmd_devmap_trans_unstable()
>>>> +	 */
>>>> +	if (unlikely(pmd_devmap(vmf.orig_pmd) ||
>>>> +		     pmd_none(vmf.orig_pmd) || pmd_trans_huge(vmf.orig_pmd) ||
>>>> +		     is_swap_pmd(vmf.orig_pmd)))
>>>> +		goto out_walk;
>>>> +
>>>> +	/*
>>>> +	 * The above does not allocate/instantiate page-tables because doing so
>>>> +	 * would lead to the possibility of instantiating page-tables after
>>>> +	 * free_pgtables() -- and consequently leaking them.
>>>> +	 *
>>>> +	 * The result is that we take at least one !speculative fault per PMD
>>>> +	 * in order to instantiate it.
>>>> +	 */
>>>> +
>>>> +	vmf.pte = pte_offset_map(vmf.pmd, address);
>>>> +	vmf.orig_pte = READ_ONCE(*vmf.pte);
>>>> +	barrier(); /* See comment in handle_pte_fault() */
>>>> +	if (pte_none(vmf.orig_pte)) {
>>>> +		pte_unmap(vmf.pte);
>>>> +		vmf.pte = NULL;
>>>> +	}
>>>> +
>>>> +	vmf.vma = vma;
>>>> +	vmf.pgoff = linear_page_index(vma, address);
>>>> +	vmf.gfp_mask = __get_fault_gfp_mask(vma);
>>>> +	vmf.sequence = seq;
>>>> +	vmf.flags = flags;
>>>> +
>>>> +	local_irq_enable();
>>>> +
>>>> +	/*
>>>> +	 * We need to re-validate the VMA after checking the bounds, otherwise
>>>> +	 * we might have a false positive on the bounds.
>>>> +	 */
>>>> +	if (read_seqcount_retry(&vma->vm_sequence, seq))
>>>> +		goto out_put;
>>>> +
>>>> +	mem_cgroup_oom_enable();
>>>> +	ret = handle_pte_fault(&vmf);
>>>> +	mem_cgroup_oom_disable();
>>>> +
>>>> +	put_vma(vma);
>>>> +
>>>> +	/*
>>>> +	 * The task may have entered a memcg OOM situation but
>>>> +	 * if the allocation error was handled gracefully (no
>>>> +	 * VM_FAULT_OOM), there is no need to kill anything.
>>>> +	 * Just clean up the OOM state peacefully.
>>>> +	 */
>>>> +	if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
>>>> +		mem_cgroup_oom_synchronize(false);
>>>> +	return ret;
>>>> +
>>>> +out_walk:
>>>> +	local_irq_enable();
>>>> +out_put:
>>>> +	put_vma(vma);
>>>> +	return ret;
>>>> +}
>>>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>>> +
>>>>  /*
>>>>   * By the time we get here, we already hold the mm semaphore
>>>>   *
>>>
>>
>> .
>>
> 
> 

^ permalink raw reply

* Re: [PATCH v11 19/26] mm: provide speculative fault infrastructure
From: zhong jiang @ 2018-07-25 11:23 UTC (permalink / raw)
  To: Laurent Dufour
  Cc: akpm, mhocko, peterz, kirill, ak, dave, jack, Matthew Wilcox,
	khandual, aneesh.kumar, benh, mpe, paulus, Thomas Gleixner,
	Ingo Molnar, hpa, Will Deacon, Sergey Senozhatsky,
	sergey.senozhatsky.work, Andrea Arcangeli, Alexei Starovoitov,
	kemi.wang, Daniel Jordan, David Rientjes, Jerome Glisse,
	Ganesh Mahendran, Minchan Kim, Punit Agrawal, vinayak menon,
	Yang Shi, linux-kernel, linux-mm, haren, npiggin, bsingharora,
	paulmck, Tim Chen, linuxppc-dev, x86
In-Reply-To: <569cb172-2a27-8455-9c9e-5446777060c2@linux.vnet.ibm.com>

On 2018/7/25 18:44, Laurent Dufour wrote:
>
> On 25/07/2018 11:04, zhong jiang wrote:
>> On 2018/7/25 0:10, Laurent Dufour wrote:
>>> On 24/07/2018 16:26, zhong jiang wrote:
>>>> On 2018/5/17 19:06, Laurent Dufour wrote:
>>>>> From: Peter Zijlstra <peterz@infradead.org>
>>>>>
>>>>> Provide infrastructure to do a speculative fault (not holding
>>>>> mmap_sem).
>>>>>
>>>>> The not holding of mmap_sem means we can race against VMA
>>>>> change/removal and page-table destruction. We use the SRCU VMA freeing
>>>>> to keep the VMA around. We use the VMA seqcount to detect change
>>>>> (including umapping / page-table deletion) and we use gup_fast() style
>>>>> page-table walking to deal with page-table races.
>>>>>
>>>>> Once we've obtained the page and are ready to update the PTE, we
>>>>> validate if the state we started the fault with is still valid, if
>>>>> not, we'll fail the fault with VM_FAULT_RETRY, otherwise we update the
>>>>> PTE and we're done.
>>>>>
>>>>> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
>>>>>
>>>>> [Manage the newly introduced pte_spinlock() for speculative page
>>>>>  fault to fail if the VMA is touched in our back]
>>>>> [Rename vma_is_dead() to vma_has_changed() and declare it here]
>>>>> [Fetch p4d and pud]
>>>>> [Set vmd.sequence in __handle_mm_fault()]
>>>>> [Abort speculative path when handle_userfault() has to be called]
>>>>> [Add additional VMA's flags checks in handle_speculative_fault()]
>>>>> [Clear FAULT_FLAG_ALLOW_RETRY in handle_speculative_fault()]
>>>>> [Don't set vmf->pte and vmf->ptl if pte_map_lock() failed]
>>>>> [Remove warning comment about waiting for !seq&1 since we don't want
>>>>>  to wait]
>>>>> [Remove warning about no huge page support, mention it explictly]
>>>>> [Don't call do_fault() in the speculative path as __do_fault() calls
>>>>>  vma->vm_ops->fault() which may want to release mmap_sem]
>>>>> [Only vm_fault pointer argument for vma_has_changed()]
>>>>> [Fix check against huge page, calling pmd_trans_huge()]
>>>>> [Use READ_ONCE() when reading VMA's fields in the speculative path]
>>>>> [Explicitly check for __HAVE_ARCH_PTE_SPECIAL as we can't support for
>>>>>  processing done in vm_normal_page()]
>>>>> [Check that vma->anon_vma is already set when starting the speculative
>>>>>  path]
>>>>> [Check for memory policy as we can't support MPOL_INTERLEAVE case due to
>>>>>  the processing done in mpol_misplaced()]
>>>>> [Don't support VMA growing up or down]
>>>>> [Move check on vm_sequence just before calling handle_pte_fault()]
>>>>> [Don't build SPF services if !CONFIG_SPECULATIVE_PAGE_FAULT]
>>>>> [Add mem cgroup oom check]
>>>>> [Use READ_ONCE to access p*d entries]
>>>>> [Replace deprecated ACCESS_ONCE() by READ_ONCE() in vma_has_changed()]
>>>>> [Don't fetch pte again in handle_pte_fault() when running the speculative
>>>>>  path]
>>>>> [Check PMD against concurrent collapsing operation]
>>>>> [Try spin lock the pte during the speculative path to avoid deadlock with
>>>>>  other CPU's invalidating the TLB and requiring this CPU to catch the
>>>>>  inter processor's interrupt]
>>>>> [Move define of FAULT_FLAG_SPECULATIVE here]
>>>>> [Introduce __handle_speculative_fault() and add a check against
>>>>>  mm->mm_users in handle_speculative_fault() defined in mm.h]
>>>>> Signed-off-by: Laurent Dufour <ldufour@linux.vnet.ibm.com>
>>>>> ---
>>>>>  include/linux/hugetlb_inline.h |   2 +-
>>>>>  include/linux/mm.h             |  30 ++++
>>>>>  include/linux/pagemap.h        |   4 +-
>>>>>  mm/internal.h                  |  16 +-
>>>>>  mm/memory.c                    | 340 ++++++++++++++++++++++++++++++++++++++++-
>>>>>  5 files changed, 385 insertions(+), 7 deletions(-)
>>>>>
>>>>> diff --git a/include/linux/hugetlb_inline.h b/include/linux/hugetlb_inline.h
>>>>> index 0660a03d37d9..9e25283d6fc9 100644
>>>>> --- a/include/linux/hugetlb_inline.h
>>>>> +++ b/include/linux/hugetlb_inline.h
>>>>> @@ -8,7 +8,7 @@
>>>>>  
>>>>>  static inline bool is_vm_hugetlb_page(struct vm_area_struct *vma)
>>>>>  {
>>>>> -	return !!(vma->vm_flags & VM_HUGETLB);
>>>>> +	return !!(READ_ONCE(vma->vm_flags) & VM_HUGETLB);
>>>>>  }
>>>>>  
>>>>>  #else
>>>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>>>> index 05cbba70104b..31acf98a7d92 100644
>>>>> --- a/include/linux/mm.h
>>>>> +++ b/include/linux/mm.h
>>>>> @@ -315,6 +315,7 @@ extern pgprot_t protection_map[16];
>>>>>  #define FAULT_FLAG_USER		0x40	/* The fault originated in userspace */
>>>>>  #define FAULT_FLAG_REMOTE	0x80	/* faulting for non current tsk/mm */
>>>>>  #define FAULT_FLAG_INSTRUCTION  0x100	/* The fault was during an instruction fetch */
>>>>> +#define FAULT_FLAG_SPECULATIVE	0x200	/* Speculative fault, not holding mmap_sem */
>>>>>  
>>>>>  #define FAULT_FLAG_TRACE \
>>>>>  	{ FAULT_FLAG_WRITE,		"WRITE" }, \
>>>>> @@ -343,6 +344,10 @@ struct vm_fault {
>>>>>  	gfp_t gfp_mask;			/* gfp mask to be used for allocations */
>>>>>  	pgoff_t pgoff;			/* Logical page offset based on vma */
>>>>>  	unsigned long address;		/* Faulting virtual address */
>>>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>>>> +	unsigned int sequence;
>>>>> +	pmd_t orig_pmd;			/* value of PMD at the time of fault */
>>>>> +#endif
>>>>>  	pmd_t *pmd;			/* Pointer to pmd entry matching
>>>>>  					 * the 'address' */
>>>>>  	pud_t *pud;			/* Pointer to pud entry matching
>>>>> @@ -1415,6 +1420,31 @@ int invalidate_inode_page(struct page *page);
>>>>>  #ifdef CONFIG_MMU
>>>>>  extern int handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>>>>  		unsigned int flags);
>>>>> +
>>>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>>>> +extern int __handle_speculative_fault(struct mm_struct *mm,
>>>>> +				      unsigned long address,
>>>>> +				      unsigned int flags);
>>>>> +static inline int handle_speculative_fault(struct mm_struct *mm,
>>>>> +					   unsigned long address,
>>>>> +					   unsigned int flags)
>>>>> +{
>>>>> +	/*
>>>>> +	 * Try speculative page fault for multithreaded user space task only.
>>>>> +	 */
>>>>> +	if (!(flags & FAULT_FLAG_USER) || atomic_read(&mm->mm_users) == 1)
>>>>> +		return VM_FAULT_RETRY;
>>>>> +	return __handle_speculative_fault(mm, address, flags);
>>>>> +}
>>>>> +#else
>>>>> +static inline int handle_speculative_fault(struct mm_struct *mm,
>>>>> +					   unsigned long address,
>>>>> +					   unsigned int flags)
>>>>> +{
>>>>> +	return VM_FAULT_RETRY;
>>>>> +}
>>>>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>>>> +
>>>>>  extern int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
>>>>>  			    unsigned long address, unsigned int fault_flags,
>>>>>  			    bool *unlocked);
>>>>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>>>>> index b1bd2186e6d2..6e2aa4e79af7 100644
>>>>> --- a/include/linux/pagemap.h
>>>>> +++ b/include/linux/pagemap.h
>>>>> @@ -456,8 +456,8 @@ static inline pgoff_t linear_page_index(struct vm_area_struct *vma,
>>>>>  	pgoff_t pgoff;
>>>>>  	if (unlikely(is_vm_hugetlb_page(vma)))
>>>>>  		return linear_hugepage_index(vma, address);
>>>>> -	pgoff = (address - vma->vm_start) >> PAGE_SHIFT;
>>>>> -	pgoff += vma->vm_pgoff;
>>>>> +	pgoff = (address - READ_ONCE(vma->vm_start)) >> PAGE_SHIFT;
>>>>> +	pgoff += READ_ONCE(vma->vm_pgoff);
>>>>>  	return pgoff;
>>>>>  }
>>>>>  
>>>>> diff --git a/mm/internal.h b/mm/internal.h
>>>>> index fb2667b20f0a..10b188c87fa4 100644
>>>>> --- a/mm/internal.h
>>>>> +++ b/mm/internal.h
>>>>> @@ -44,7 +44,21 @@ int do_swap_page(struct vm_fault *vmf);
>>>>>  extern struct vm_area_struct *get_vma(struct mm_struct *mm,
>>>>>  				      unsigned long addr);
>>>>>  extern void put_vma(struct vm_area_struct *vma);
>>>>> -#endif
>>>>> +
>>>>> +static inline bool vma_has_changed(struct vm_fault *vmf)
>>>>> +{
>>>>> +	int ret = RB_EMPTY_NODE(&vmf->vma->vm_rb);
>>>>> +	unsigned int seq = READ_ONCE(vmf->vma->vm_sequence.sequence);
>>>>> +
>>>>> +	/*
>>>>> +	 * Matches both the wmb in write_seqlock_{begin,end}() and
>>>>> +	 * the wmb in vma_rb_erase().
>>>>> +	 */
>>>>> +	smp_rmb();
>>>>> +
>>>>> +	return ret || seq != vmf->sequence;
>>>>> +}
>>>>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>>>>  
>>>>>  void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
>>>>>  		unsigned long floor, unsigned long ceiling);
>>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>>> index ab32b0b4bd69..7bbbb8c7b9cd 100644
>>>>> --- a/mm/memory.c
>>>>> +++ b/mm/memory.c
>>>>> @@ -769,7 +769,8 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>>>>>  	if (page)
>>>>>  		dump_page(page, "bad pte");
>>>>>  	pr_alert("addr:%p vm_flags:%08lx anon_vma:%p mapping:%p index:%lx\n",
>>>>> -		 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
>>>>> +		 (void *)addr, READ_ONCE(vma->vm_flags), vma->anon_vma,
>>>>> +		 mapping, index);
>>>>>  	pr_alert("file:%pD fault:%pf mmap:%pf readpage:%pf\n",
>>>>>  		 vma->vm_file,
>>>>>  		 vma->vm_ops ? vma->vm_ops->fault : NULL,
>>>>> @@ -2306,6 +2307,118 @@ int apply_to_page_range(struct mm_struct *mm, unsigned long addr,
>>>>>  }
>>>>>  EXPORT_SYMBOL_GPL(apply_to_page_range);
>>>>>  
>>>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>>>> +static bool pte_spinlock(struct vm_fault *vmf)
>>>>> +{
>>>>> +	bool ret = false;
>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> +	pmd_t pmdval;
>>>>> +#endif
>>>>> +
>>>>> +	/* Check if vma is still valid */
>>>>> +	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>>>>> +		vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>>>>> +		spin_lock(vmf->ptl);
>>>>> +		return true;
>>>>> +	}
>>>>> +
>>>>> +again:
>>>>> +	local_irq_disable();
>>>>> +	if (vma_has_changed(vmf))
>>>>> +		goto out;
>>>>> +
>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> +	/*
>>>>> +	 * We check if the pmd value is still the same to ensure that there
>>>>> +	 * is not a huge collapse operation in progress in our back.
>>>>> +	 */
>>>>> +	pmdval = READ_ONCE(*vmf->pmd);
>>>>> +	if (!pmd_same(pmdval, vmf->orig_pmd))
>>>>> +		goto out;
>>>>> +#endif
>>>>> +
>>>>> +	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>>>>> +	if (unlikely(!spin_trylock(vmf->ptl))) {
>>>>> +		local_irq_enable();
>>>>> +		goto again;
>>>>> +	}
>>>>> +
>>>>> +	if (vma_has_changed(vmf)) {
>>>>> +		spin_unlock(vmf->ptl);
>>>>> +		goto out;
>>>>> +	}
>>>>> +
>>>>> +	ret = true;
>>>>> +out:
>>>>> +	local_irq_enable();
>>>>> +	return ret;
>>>>> +}
>>>>> +
>>>>> +static bool pte_map_lock(struct vm_fault *vmf)
>>>>> +{
>>>>> +	bool ret = false;
>>>>> +	pte_t *pte;
>>>>> +	spinlock_t *ptl;
>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> +	pmd_t pmdval;
>>>>> +#endif
>>>>> +
>>>>> +	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>>>>> +		vmf->pte = pte_offset_map_lock(vmf->vma->vm_mm, vmf->pmd,
>>>>> +					       vmf->address, &vmf->ptl);
>>>>> +		return true;
>>>>> +	}
>>>>> +
>>>>> +	/*
>>>>> +	 * The first vma_has_changed() guarantees the page-tables are still
>>>>> +	 * valid, having IRQs disabled ensures they stay around, hence the
>>>>> +	 * second vma_has_changed() to make sure they are still valid once
>>>>> +	 * we've got the lock. After that a concurrent zap_pte_range() will
>>>>> +	 * block on the PTL and thus we're safe.
>>>>> +	 */
>>>>> +again:
>>>>> +	local_irq_disable();
>>>>> +	if (vma_has_changed(vmf))
>>>>> +		goto out;
>>>>> +
>>>>> +#ifdef CONFIG_TRANSPARENT_HUGEPAGE
>>>>> +	/*
>>>>> +	 * We check if the pmd value is still the same to ensure that there
>>>>> +	 * is not a huge collapse operation in progress in our back.
>>>>> +	 */
>>>>> +	pmdval = READ_ONCE(*vmf->pmd);
>>>>> +	if (!pmd_same(pmdval, vmf->orig_pmd))
>>>>> +		goto out;
>>>>> +#endif
>>>>> +
>>>>> +	/*
>>>>> +	 * Same as pte_offset_map_lock() except that we call
>>>>> +	 * spin_trylock() in place of spin_lock() to avoid race with
>>>>> +	 * unmap path which may have the lock and wait for this CPU
>>>>> +	 * to invalidate TLB but this CPU has irq disabled.
>>>>> +	 * Since we are in a speculative patch, accept it could fail
>>>>> +	 */
>>>>> +	ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>>>>> +	pte = pte_offset_map(vmf->pmd, vmf->address);
>>>>> +	if (unlikely(!spin_trylock(ptl))) {
>>>>> +		pte_unmap(pte);
>>>>> +		local_irq_enable();
>>>>> +		goto again;
>>>>> +	}
>>>>> +
>>>>> +	if (vma_has_changed(vmf)) {
>>>>> +		pte_unmap_unlock(pte, ptl);
>>>>> +		goto out;
>>>>> +	}
>>>>> +
>>>>> +	vmf->pte = pte;
>>>>> +	vmf->ptl = ptl;
>>>>> +	ret = true;
>>>>> +out:
>>>>> +	local_irq_enable();
>>>>> +	return ret;
>>>>> +}
>>>>> +#else
>>>>>  static inline bool pte_spinlock(struct vm_fault *vmf)
>>>>>  {
>>>>>  	vmf->ptl = pte_lockptr(vmf->vma->vm_mm, vmf->pmd);
>>>>> @@ -2319,6 +2432,7 @@ static inline bool pte_map_lock(struct vm_fault *vmf)
>>>>>  				       vmf->address, &vmf->ptl);
>>>>>  	return true;
>>>>>  }
>>>>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>>>>  
>>>>>  /*
>>>>>   * handle_pte_fault chooses page fault handler according to an entry which was
>>>>> @@ -3208,6 +3322,14 @@ static int do_anonymous_page(struct vm_fault *vmf)
>>>>>  		ret = check_stable_address_space(vma->vm_mm);
>>>>>  		if (ret)
>>>>>  			goto unlock;
>>>>> +		/*
>>>>> +		 * Don't call the userfaultfd during the speculative path.
>>>>> +		 * We already checked for the VMA to not be managed through
>>>>> +		 * userfaultfd, but it may be set in our back once we have lock
>>>>> +		 * the pte. In such a case we can ignore it this time.
>>>>> +		 */
>>>>> +		if (vmf->flags & FAULT_FLAG_SPECULATIVE)
>>>>> +			goto setpte;
>>>>>  		/* Deliver the page fault to userland, check inside PT lock */
>>>>>  		if (userfaultfd_missing(vma)) {
>>>>>  			pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>> @@ -3249,7 +3371,7 @@ static int do_anonymous_page(struct vm_fault *vmf)
>>>>>  		goto unlock_and_release;
>>>>>  
>>>>>  	/* Deliver the page fault to userland, check inside PT lock */
>>>>> -	if (userfaultfd_missing(vma)) {
>>>>> +	if (!(vmf->flags & FAULT_FLAG_SPECULATIVE) && userfaultfd_missing(vma)) {
>>>>>  		pte_unmap_unlock(vmf->pte, vmf->ptl);
>>>>>  		mem_cgroup_cancel_charge(page, memcg, false);
>>>>>  		put_page(page);
>>>>> @@ -3994,13 +4116,22 @@ static int handle_pte_fault(struct vm_fault *vmf)
>>>>>  
>>>>>  	if (unlikely(pmd_none(*vmf->pmd))) {
>>>>>  		/*
>>>>> +		 * In the case of the speculative page fault handler we abort
>>>>> +		 * the speculative path immediately as the pmd is probably
>>>>> +		 * in the way to be converted in a huge one. We will try
>>>>> +		 * again holding the mmap_sem (which implies that the collapse
>>>>> +		 * operation is done).
>>>>> +		 */
>>>>> +		if (vmf->flags & FAULT_FLAG_SPECULATIVE)
>>>>> +			return VM_FAULT_RETRY;
>>>>> +		/*
>>>>>  		 * Leave __pte_alloc() until later: because vm_ops->fault may
>>>>>  		 * want to allocate huge page, and if we expose page table
>>>>>  		 * for an instant, it will be difficult to retract from
>>>>>  		 * concurrent faults and from rmap lookups.
>>>>>  		 */
>>>>>  		vmf->pte = NULL;
>>>>> -	} else {
>>>>> +	} else if (!(vmf->flags & FAULT_FLAG_SPECULATIVE)) {
>>>>>  		/* See comment in pte_alloc_one_map() */
>>>>>  		if (pmd_devmap_trans_unstable(vmf->pmd))
>>>>>  			return 0;
>>>>> @@ -4009,6 +4140,9 @@ static int handle_pte_fault(struct vm_fault *vmf)
>>>>>  		 * pmd from under us anymore at this point because we hold the
>>>>>  		 * mmap_sem read mode and khugepaged takes it in write mode.
>>>>>  		 * So now it's safe to run pte_offset_map().
>>>>> +		 * This is not applicable to the speculative page fault handler
>>>>> +		 * but in that case, the pte is fetched earlier in
>>>>> +		 * handle_speculative_fault().
>>>>>  		 */
>>>>>  		vmf->pte = pte_offset_map(vmf->pmd, vmf->address);
>>>>>  		vmf->orig_pte = *vmf->pte;
>>>>> @@ -4031,6 +4165,8 @@ static int handle_pte_fault(struct vm_fault *vmf)
>>>>>  	if (!vmf->pte) {
>>>>>  		if (vma_is_anonymous(vmf->vma))
>>>>>  			return do_anonymous_page(vmf);
>>>>> +		else if (vmf->flags & FAULT_FLAG_SPECULATIVE)
>>>>> +			return VM_FAULT_RETRY;
>>>>>  		else
>>>>>  			return do_fault(vmf);
>>>>>  	}
>>>>> @@ -4128,6 +4264,9 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>>>>  	vmf.pmd = pmd_alloc(mm, vmf.pud, address);
>>>>>  	if (!vmf.pmd)
>>>>>  		return VM_FAULT_OOM;
>>>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>>>> +	vmf.sequence = raw_read_seqcount(&vma->vm_sequence);
>>>>> +#endif
>>>>>  	if (pmd_none(*vmf.pmd) && transparent_hugepage_enabled(vma)) {
>>>>>  		ret = create_huge_pmd(&vmf);
>>>>>  		if (!(ret & VM_FAULT_FALLBACK))
>>>>> @@ -4161,6 +4300,201 @@ static int __handle_mm_fault(struct vm_area_struct *vma, unsigned long address,
>>>>>  	return handle_pte_fault(&vmf);
>>>>>  }
>>>>>  
>>>>> +#ifdef CONFIG_SPECULATIVE_PAGE_FAULT
>>>>> +/*
>>>>> + * Tries to handle the page fault in a speculative way, without grabbing the
>>>>> + * mmap_sem.
>>>>> + */
>>>>> +int __handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>>>>> +			       unsigned int flags)
>>>>> +{
>>>>> +	struct vm_fault vmf = {
>>>>> +		.address = address,
>>>>> +	};
>>>>> +	pgd_t *pgd, pgdval;
>>>>> +	p4d_t *p4d, p4dval;
>>>>> +	pud_t pudval;
>>>>> +	int seq, ret = VM_FAULT_RETRY;
>>>>> +	struct vm_area_struct *vma;
>>>>> +#ifdef CONFIG_NUMA
>>>>> +	struct mempolicy *pol;
>>>>> +#endif
>>>>> +
>>>>> +	/* Clear flags that may lead to release the mmap_sem to retry */
>>>>> +	flags &= ~(FAULT_FLAG_ALLOW_RETRY|FAULT_FLAG_KILLABLE);
>>>>> +	flags |= FAULT_FLAG_SPECULATIVE;
>>>>> +
>>>>> +	vma = get_vma(mm, address);
>>>>> +	if (!vma)
>>>>> +		return ret;
>>>>> +
>>>>> +	seq = raw_read_seqcount(&vma->vm_sequence); /* rmb <-> seqlock,vma_rb_erase() */
>>>>> +	if (seq & 1)
>>>>> +		goto out_put;
>>>>> +
>>>>> +	/*
>>>>> +	 * Can't call vm_ops service has we don't know what they would do
>>>>> +	 * with the VMA.
>>>>> +	 * This include huge page from hugetlbfs.
>>>>> +	 */
>>>>> +	if (vma->vm_ops)
>>>>> +		goto out_put;
>>>>> +
>>>>   Hi   Laurent
>>>>    
>>>>    I think that most of pagefault will leave here.   Is there any case  need to skip ?
>>>>   I have tested the following  patch, it work well.
>>> Hi Zhong,
>>>
>>> Well this will allow file mapping to be handle in a speculative way, but that's
>>> a bit dangerous today as there is no guaranty that the vm_ops.vm_fault()
>>> operation will be fair.
>>>
>>> In the case of the anonymous file mapping that's often not a problem, depending
>>> on the underlying file system, but there are so many cases to check and this is
>>> hard to say this can be done in a speculative way as is.
>>  This patch say that spf just handle anonyous page. but I find that do_swap_page
>>  also maybe release the mmap_sem without FAULT_FLAG_RETRY_NOWAIT. why is it safe
>>  to handle the case. I think that the case is similar to file page. Maybe I miss
>>  something else.
> do_swap_page() may released the mmap_sem through the call to
> __lock_page_or_retry(), but this can only happen if FAULT_FLAG_ALLOW_RETRY or
> FAULT_FLAG_KILLABLE are set and they are unset in __handle_speculative_fault().
 For spf. Indeed. Thank you for clarification.
>>  I test the patches and find just only 18% of the pagefault will enter into the
>>  speculative page fault during a process startup. As I had said. most of pagefault
>>  will be handled by ops->fault. I do not know the data you had posted is how to get.
> I do agree that handling file mapping will be required, but this will add more
> complexity to this series, since we need a way for drivers to tell they are
> compatible with the speculative path.
 As the above mentioned. the specualtive page fault do not pass FAULT_FLAG_ALLOW_RETRY.
 In other words, File page will not refer to release mmap_sem for spf.
 
 but I am still not quite clear that what should drivers do to compatible with speculatve path.
 The speculative path should not refer to the mmap_sem for filemap_fault.

 Thanks,
 zhong jiang
> May be I should give it a try on the next send.
 Ok, I will try.
> For my information, what was the performance improvement you seen when handling
> file page faulting this way ?
 I am sorry that. It is the data that Ganesh test the launch time on Andriod.
> Thanks,
> Laurent.
>
>>
>>  Thanks
>>  zhong jiang
>>> The huge work to do is to double check that all the code called by
>>> vm_ops.fault() is not dealing with the mmap_sem, which could be handled using
>>> FAULT_FLAG_RETRY_NOWAIT, and care is also needed about the resources that code
>>> is managing as it may assume that it is under the protection of the mmap_sem in
>>> read mode, and that can be done implicitly.
>>>
>>> Cheers,
>>> Laurent.
>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index 936128b..9bc1545 100644
>>>>  @@ -3893,8 +3898,6 @@ static int handle_pte_fault(struct fault_env *fe)
>>>>         if (!fe->pte) {
>>>>                 if (vma_is_anonymous(fe->vma))
>>>>                         return do_anonymous_page(fe);
>>>> -               else if (fe->flags & FAULT_FLAG_SPECULATIVE)
>>>> -                       return VM_FAULT_RETRY;
>>>>                 else
>>>>                         return do_fault(fe);
>>>>         }
>>>> @@ -4026,20 +4029,11 @@ int __handle_speculative_fault(struct mm_struct *mm, unsigned long address,
>>>>                 goto out_put;
>>>>         }
>>>>         /*
>>>> -        * Can't call vm_ops service has we don't know what they would do
>>>> -        * with the VMA.
>>>> -        * This include huge page from hugetlbfs.
>>>> -        */
>>>> -       if (vma->vm_ops) {
>>>> -               trace_spf_vma_notsup(_RET_IP_, vma, address);
>>>> -               goto out_put;
>>>> -       }
>>>>
>>>>
>>>> Thanks
>>>> zhong jiang
>>>>> +	/*
>>>>> +	 * __anon_vma_prepare() requires the mmap_sem to be held
>>>>> +	 * because vm_next and vm_prev must be safe. This can't be guaranteed
>>>>> +	 * in the speculative path.
>>>>> +	 */
>>>>> +	if (unlikely(!vma->anon_vma))
>>>>> +		goto out_put;
>>>>> +
>>>>> +	vmf.vma_flags = READ_ONCE(vma->vm_flags);
>>>>> +	vmf.vma_page_prot = READ_ONCE(vma->vm_page_prot);
>>>>> +
>>>>> +	/* Can't call userland page fault handler in the speculative path */
>>>>> +	if (unlikely(vmf.vma_flags & VM_UFFD_MISSING))
>>>>> +		goto out_put;
>>>>> +
>>>>> +	if (vmf.vma_flags & VM_GROWSDOWN || vmf.vma_flags & VM_GROWSUP)
>>>>> +		/*
>>>>> +		 * This could be detected by the check address against VMA's
>>>>> +		 * boundaries but we want to trace it as not supported instead
>>>>> +		 * of changed.
>>>>> +		 */
>>>>> +		goto out_put;
>>>>> +
>>>>> +	if (address < READ_ONCE(vma->vm_start)
>>>>> +	    || READ_ONCE(vma->vm_end) <= address)
>>>>> +		goto out_put;
>>>>> +
>>>>> +	if (!arch_vma_access_permitted(vma, flags & FAULT_FLAG_WRITE,
>>>>> +				       flags & FAULT_FLAG_INSTRUCTION,
>>>>> +				       flags & FAULT_FLAG_REMOTE)) {
>>>>> +		ret = VM_FAULT_SIGSEGV;
>>>>> +		goto out_put;
>>>>> +	}
>>>>> +
>>>>> +	/* This is one is required to check that the VMA has write access set */
>>>>> +	if (flags & FAULT_FLAG_WRITE) {
>>>>> +		if (unlikely(!(vmf.vma_flags & VM_WRITE))) {
>>>>> +			ret = VM_FAULT_SIGSEGV;
>>>>> +			goto out_put;
>>>>> +		}
>>>>> +	} else if (unlikely(!(vmf.vma_flags & (VM_READ|VM_EXEC|VM_WRITE)))) {
>>>>> +		ret = VM_FAULT_SIGSEGV;
>>>>> +		goto out_put;
>>>>> +	}
>>>>> +
>>>>> +#ifdef CONFIG_NUMA
>>>>> +	/*
>>>>> +	 * MPOL_INTERLEAVE implies additional checks in
>>>>> +	 * mpol_misplaced() which are not compatible with the
>>>>> +	 *speculative page fault processing.
>>>>> +	 */
>>>>> +	pol = __get_vma_policy(vma, address);
>>>>> +	if (!pol)
>>>>> +		pol = get_task_policy(current);
>>>>> +	if (pol && pol->mode == MPOL_INTERLEAVE)
>>>>> +		goto out_put;
>>>>> +#endif
>>>>> +
>>>>> +	/*
>>>>> +	 * Do a speculative lookup of the PTE entry.
>>>>> +	 */
>>>>> +	local_irq_disable();
>>>>> +	pgd = pgd_offset(mm, address);
>>>>> +	pgdval = READ_ONCE(*pgd);
>>>>> +	if (pgd_none(pgdval) || unlikely(pgd_bad(pgdval)))
>>>>> +		goto out_walk;
>>>>> +
>>>>> +	p4d = p4d_offset(pgd, address);
>>>>> +	p4dval = READ_ONCE(*p4d);
>>>>> +	if (p4d_none(p4dval) || unlikely(p4d_bad(p4dval)))
>>>>> +		goto out_walk;
>>>>> +
>>>>> +	vmf.pud = pud_offset(p4d, address);
>>>>> +	pudval = READ_ONCE(*vmf.pud);
>>>>> +	if (pud_none(pudval) || unlikely(pud_bad(pudval)))
>>>>> +		goto out_walk;
>>>>> +
>>>>> +	/* Huge pages at PUD level are not supported. */
>>>>> +	if (unlikely(pud_trans_huge(pudval)))
>>>>> +		goto out_walk;
>>>>> +
>>>>> +	vmf.pmd = pmd_offset(vmf.pud, address);
>>>>> +	vmf.orig_pmd = READ_ONCE(*vmf.pmd);
>>>>> +	/*
>>>>> +	 * pmd_none could mean that a hugepage collapse is in progress
>>>>> +	 * in our back as collapse_huge_page() mark it before
>>>>> +	 * invalidating the pte (which is done once the IPI is catched
>>>>> +	 * by all CPU and we have interrupt disabled).
>>>>> +	 * For this reason we cannot handle THP in a speculative way since we
>>>>> +	 * can't safely indentify an in progress collapse operation done in our
>>>>> +	 * back on that PMD.
>>>>> +	 * Regarding the order of the following checks, see comment in
>>>>> +	 * pmd_devmap_trans_unstable()
>>>>> +	 */
>>>>> +	if (unlikely(pmd_devmap(vmf.orig_pmd) ||
>>>>> +		     pmd_none(vmf.orig_pmd) || pmd_trans_huge(vmf.orig_pmd) ||
>>>>> +		     is_swap_pmd(vmf.orig_pmd)))
>>>>> +		goto out_walk;
>>>>> +
>>>>> +	/*
>>>>> +	 * The above does not allocate/instantiate page-tables because doing so
>>>>> +	 * would lead to the possibility of instantiating page-tables after
>>>>> +	 * free_pgtables() -- and consequently leaking them.
>>>>> +	 *
>>>>> +	 * The result is that we take at least one !speculative fault per PMD
>>>>> +	 * in order to instantiate it.
>>>>> +	 */
>>>>> +
>>>>> +	vmf.pte = pte_offset_map(vmf.pmd, address);
>>>>> +	vmf.orig_pte = READ_ONCE(*vmf.pte);
>>>>> +	barrier(); /* See comment in handle_pte_fault() */
>>>>> +	if (pte_none(vmf.orig_pte)) {
>>>>> +		pte_unmap(vmf.pte);
>>>>> +		vmf.pte = NULL;
>>>>> +	}
>>>>> +
>>>>> +	vmf.vma = vma;
>>>>> +	vmf.pgoff = linear_page_index(vma, address);
>>>>> +	vmf.gfp_mask = __get_fault_gfp_mask(vma);
>>>>> +	vmf.sequence = seq;
>>>>> +	vmf.flags = flags;
>>>>> +
>>>>> +	local_irq_enable();
>>>>> +
>>>>> +	/*
>>>>> +	 * We need to re-validate the VMA after checking the bounds, otherwise
>>>>> +	 * we might have a false positive on the bounds.
>>>>> +	 */
>>>>> +	if (read_seqcount_retry(&vma->vm_sequence, seq))
>>>>> +		goto out_put;
>>>>> +
>>>>> +	mem_cgroup_oom_enable();
>>>>> +	ret = handle_pte_fault(&vmf);
>>>>> +	mem_cgroup_oom_disable();
>>>>> +
>>>>> +	put_vma(vma);
>>>>> +
>>>>> +	/*
>>>>> +	 * The task may have entered a memcg OOM situation but
>>>>> +	 * if the allocation error was handled gracefully (no
>>>>> +	 * VM_FAULT_OOM), there is no need to kill anything.
>>>>> +	 * Just clean up the OOM state peacefully.
>>>>> +	 */
>>>>> +	if (task_in_memcg_oom(current) && !(ret & VM_FAULT_OOM))
>>>>> +		mem_cgroup_oom_synchronize(false);
>>>>> +	return ret;
>>>>> +
>>>>> +out_walk:
>>>>> +	local_irq_enable();
>>>>> +out_put:
>>>>> +	put_vma(vma);
>>>>> +	return ret;
>>>>> +}
>>>>> +#endif /* CONFIG_SPECULATIVE_PAGE_FAULT */
>>>>> +
>>>>>  /*
>>>>>   * By the time we get here, we already hold the mm semaphore
>>>>>   *
>>> .
>>>
>>
>
> .
>

^ permalink raw reply

* Re: [PATCH v4 1/1] KVM: PPC: Book3S HV: pack VCORE IDs to access full VCPU ID space
From: Paul Mackerras @ 2018-07-25 11:24 UTC (permalink / raw)
  To: Sam Bobroff; +Cc: linuxppc-dev, kvm, kvm-ppc, david, clg
In-Reply-To: <dc8e985d05d7de08cce080dfe6e70beeed2a50a1.1532499120.git.sbobroff@linux.ibm.com>

On Wed, Jul 25, 2018 at 04:12:02PM +1000, Sam Bobroff wrote:
> From: Sam Bobroff <sam.bobroff@au1.ibm.com>
> 
> It is not currently possible to create the full number of possible
> VCPUs (KVM_MAX_VCPUS) on Power9 with KVM-HV when the guest uses less
> threads per core than it's core stride (or "VSMT mode"). This is
> because the VCORE ID and XIVE offsets to grow beyond KVM_MAX_VCPUS
> even though the VCPU ID is less than KVM_MAX_VCPU_ID.
> 
> To address this, "pack" the VCORE ID and XIVE offsets by using
> knowledge of the way the VCPU IDs will be used when there are less
> guest threads per core than the core stride. The primary thread of
> each core will always be used first. Then, if the guest uses more than
> one thread per core, these secondary threads will sequentially follow
> the primary in each core.
> 
> So, the only way an ID above KVM_MAX_VCPUS can be seen, is if the
> VCPUs are being spaced apart, so at least half of each core is empty
> and IDs between KVM_MAX_VCPUS and (KVM_MAX_VCPUS * 2) can be mapped
> into the second half of each core (4..7, in an 8-thread core).
> 
> Similarly, if IDs above KVM_MAX_VCPUS * 2 are seen, at least 3/4 of
> each core is being left empty, and we can map down into the second and
> third quarters of each core (2, 3 and 5, 6 in an 8-thread core).
> 
> Lastly, if IDs above KVM_MAX_VCPUS * 4 are seen, only the primary
> threads are being used and 7/8 of the core is empty, allowing use of
> the 1, 5, 3 and 7 thread slots.
> 
> (Strides less than 8 are handled similarly.)
> 
> This allows the VCORE ID or offset to be calculated quickly from the
> VCPU ID or XIVE server numbers, without access to the VCPU structure.
> 
> Signed-off-by: Sam Bobroff <sam.bobroff@au1.ibm.com>

Thanks, applied to my kvm-ppc-next branch.

Paul.

^ permalink raw reply

* Re: [RFC PATCH v2] powerpc/64s: Move idle code to powernv C code
From: Gautham R Shenoy @ 2018-07-25 11:26 UTC (permalink / raw)
  To: Nicholas Piggin; +Cc: linuxppc-dev, Gautham R . Shenoy, Vaidyanathan Srinivasan
In-Reply-To: <20180721042924.32206-1-npiggin@gmail.com>

Hello Nicholas,

On Sat, Jul 21, 2018 at 02:29:24PM +1000, Nicholas Piggin wrote:
> Reimplement Book3S idle code to C, in the powernv platform code.
> Assembly stubs are used to save and restore the stack frame and
> non-volatile GPRs before going to idle, but these are small and
> mostly agnostic to microarchitecture implementation details.
> 
> The optimisation where EC=ESL=0 idle modes did not have to save
> GPRs or mtmsrd L=0 is restored, because it's simple to do.
> 
> Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
> but saves and restores them all explicitly. This can easily be
> extended to tracking the set of system-wide SPRs that do not have
> to be saved each time.
> 
> Moving the HMI, SPR, OPAL, locking, etc. to C is the only real
> way this stuff will cope with non-trivial new CPU implementation
> details, firmware changes, etc., without becoming unmaintainable.
> 
> Since RFC v1:
> - Now tested and working with POWER9 hash and radix.
> - KVM support added. This took a bit of work to untangle and might
>   still have some issues, but POWER9 seems to work including hash on
>   radix with dependent threads mode.
> - This snowballed a bit because of KVM and other details making it
>   not feasible to leave POWER7/8 code alone. That's only half done
>   at the moment.
> - So far this trades about 800 lines of asm for 500 of C. With POWER7/8
>   support done it might be another hundred or so lines of C.
> 
> Would appreciate any feedback on the approach in particular the
> significantly different (and hopefully cleaner) KVM approach.


I have reviewed the C-code movement from idle_book3s.S to
powernv/idle.c. Some comments on that part of the code inline.

I haven't yet gone through the book3s_hv_rmhandlers.S code changes.


> 
> Thanks,
> Nick
> 

[..snip..]

> 
>  #ifdef CONFIG_PPC_POWERNV
> -	/* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
> -	u32 *core_idle_state_ptr;
> -	u8 thread_idle_state;		/* PNV_THREAD_RUNNING/NAP/SLEEP	*/
> -	/* Mask to indicate thread id in core */
> -	u8 thread_mask;
> -	/* Mask to denote subcore sibling threads */
> -	u8 subcore_sibling_mask;
> -	/* Flag to request this thread not to stop */
> -	atomic_t dont_stop;
> -	/* The PSSCR value that the kernel requested before going to stop */
> -	u64 requested_psscr;
> +	union {
> +		/* P7/P8 specific fields */
> +		struct {
> +			/* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
> +			unsigned long *core_idle_state_ptr;

Do we still need *core_idle_state_ptr ? Since this code is accessed
through C where we have access to paca_ptrs global variable we can
reference the "idle_state" (defined for P9 below). A lot of
locking/unlocking code in powernv/idle.c further down the patch
already does that, right?



> +			/* PNV_THREAD_RUNNING/NAP/SLEEP	*/
> +			u8 thread_idle_state;
> +			/* Mask to indicate thread id in core */
> +			u8 thread_mask;
> +			/* Mask to denote subcore sibling threads */
> +			u8 subcore_sibling_mask;
> +		};
> 
> -	/*
> -	 * Save area for additional SPRs that need to be
> -	 * saved/restored during cpuidle stop.
> -	 */
> -	struct stop_sprs stop_sprs;
> +		/* P9 specific fields */
> +		struct {
> +			/* The PSSCR value that the kernel requested before going to stop */
> +			u64 requested_psscr;
> +			unsigned long idle_state;
> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +			/* Flag to request this thread not to stop */
> +			atomic_t dont_stop;
> +#endif
> +		};
> +	};
>  #endif
> 

[..snip..]

>  	/*
> diff --git a/arch/powerpc/platforms/powernv/idle.c b/arch/powerpc/platforms/powernv/idle.c
> index 12f13acee1f6..5b6af21ea9aa 100644
> --- a/arch/powerpc/platforms/powernv/idle.c
> +++ b/arch/powerpc/platforms/powernv/idle.c
> @@ -16,6 +16,7 @@
>  #include <linux/device.h>

[..snip..]

> +static inline void atomic_start_thread_idle(void)
> +{
> +	int cpu = raw_smp_processor_id();
> +	int first = cpu_first_thread_sibling(cpu);
> +	int thread_nr = cpu_thread_in_core(cpu);
> +	unsigned long *state = &paca_ptrs[first]->idle_state;

We are using this in both POWER8 and POWER9 code and are (rightly IMO)
referencing the idle_state variable.

> +
> +	clear_bit(thread_nr, state);
> +}
> +
> +static inline void atomic_stop_thread_idle(void)
> +{
> +	int cpu = raw_smp_processor_id();
> +	int first = cpu_first_thread_sibling(cpu);
> +	int thread_nr = cpu_thread_in_core(cpu);
> +	unsigned long *state = &paca_ptrs[first]->idle_state;
> +
> +	set_bit(thread_nr, state);
> +}
> +
> +static inline void atomic_lock_thread_idle(void)
> +{
> +	int cpu = raw_smp_processor_id();
> +	int first = cpu_first_thread_sibling(cpu);
> +	unsigned long *state = &paca_ptrs[first]->idle_state;
> +
> +	while (unlikely(test_and_set_bit_lock(NR_PNV_CORE_IDLE_LOCK_BIT, state)))
> +		barrier();
> +	isync();
> +}
> +
> +static inline void atomic_unlock_and_stop_thread_idle(void)
> +{
> +	int cpu = raw_smp_processor_id();
> +	int first = cpu_first_thread_sibling(cpu);
> +	unsigned long thread = 1UL << cpu_thread_in_core(cpu);
> +	unsigned long *state = &paca_ptrs[first]->idle_state;
> +	u64 s = READ_ONCE(*state);
> +	u64 new, tmp;
> +
> +	BUG_ON(!(s & PNV_CORE_IDLE_LOCK_BIT));
> +	BUG_ON(s & thread);
> +
> +again:
> +	new = (s | thread) & ~PNV_CORE_IDLE_LOCK_BIT;
> +	tmp = cmpxchg(state, s, new);
> +	if (unlikely(tmp != s)) {
> +		s = tmp;
> +		goto again;
> +	}
> +}
> +
> +static inline void atomic_unlock_thread_idle(void)
> +{
> +	int cpu = raw_smp_processor_id();
> +	int first = cpu_first_thread_sibling(cpu);
> +	unsigned long *state = &paca_ptrs[first]->idle_state;
> +
> +	BUG_ON(!test_bit(NR_PNV_CORE_IDLE_LOCK_BIT, state));
> +	clear_bit_unlock(NR_PNV_CORE_IDLE_LOCK_BIT, state);
> +}
> +
> +struct p7_sprs {
> +	/* per core */
> +	u64 tscr;
> +	u64 worc;
> +
> +	/* per subcore */
> +	u64 sdr1;
> +	u64 rpr;
> +	u64 amor;
> +
> +	/* per thread */
> +	u64 lpcr;
> +	u64 hfscr;
> +	u64 fscr;
> +	u64 pid;
> +	u64 purr;
> +	u64 spurr;
> +	u64 dscr;
> +	u64 wort;
> +
> +	u64 mmcra;
> +	u32 mmcr0;
> +	u32 mmcr1;
> +	u64 mmcr2;
> +};
> +
> +static unsigned long power7_idle_insn(unsigned long type)
> +{
> +	int cpu = raw_smp_processor_id();
> +	int first = cpu_first_thread_sibling(cpu);
> +	unsigned long thread = 1UL << cpu_thread_in_core(cpu);
> +	unsigned long *state = &paca_ptrs[first]->idle_state;

Ditto.

> +	unsigned long srr1;
> +	struct p7_sprs sprs;
> +	int rc;
> +
> +	memset(&sprs, 0, sizeof(sprs));
> +
> +	if (type != PNV_THREAD_NAP) {
> +		atomic_lock_thread_idle();
> +
> +		if (power7_fastsleep_workaround_entry) {
> +			/* XXX: lock over application, last thread of core */
> +			rc = opal_config_cpu_idle_state(OPAL_CONFIG_IDLE_FASTSLEEP,
> +							OPAL_CONFIG_IDLE_APPLY);
> +			if (rc)
> +				return 0; /* XXX */
> +		}
> +	}
> +
> +	if (type == PNV_THREAD_WINKLE) {
> +		sprs.rpr = mfspr(SPRN_RPR);
> +		sprs.tscr = mfspr(SPRN_TSCR);

Per-subcore SPRs aren't being saved here ?

> +
> +		sprs.lpcr = mfspr(SPRN_LPCR);
> +		sprs.hfscr = mfspr(SPRN_HFSCR);
> +		sprs.fscr = mfspr(SPRN_FSCR);
> +		sprs.pid = mfspr(SPRN_PID);
> +		sprs.purr = mfspr(SPRN_PURR);
> +		sprs.spurr = mfspr(SPRN_SPURR);
> +		sprs.dscr = mfspr(SPRN_DSCR);
> +
> +		sprs.mmcra = mfspr(SPRN_MMCRA);
> +		sprs.mmcr0 = mfspr(SPRN_MMCR0);
> +		sprs.mmcr1 = mfspr(SPRN_MMCR1);
> +		sprs.mmcr2 = mfspr(SPRN_MMCR2);
> +	}
> +
> +	atomic_unlock_thread_idle();
> +
> +	srr1 = isa206_idle_insn_mayloss(type);
> +
> +	WARN_ON_ONCE(!srr1);
> +	WARN_ON_ONCE(mfmsr() & (MSR_IR|MSR_DR));
> +
> +	if (unlikely((srr1 & SRR1_WAKEMASK_P8) == SRR1_WAKEHMI))
> +		hmi_exception_realmode(NULL);
> +
> +	if (likely((srr1 & SRR1_WAKESTATE) != SRR1_WS_HVLOSS))
> +		return srr1;
> +
> +	/* HV state loss */
> +	if (type == PNV_THREAD_NAP) {
> +		BUG();
> +		for (;;)
> +			;
> +	}
> +
> +	atomic_lock_thread_idle();
> +
> +	WARN_ON(*state & thread);
> +
> +	if ((*state & ((1 << threads_per_core) - 1)) != 0) {
> +		isync();
> +		goto core_woken;
> +	}
> +
> +	/* Per-core SPRs */
> +	mtspr(SPRN_RPR, sprs.rpr);
> +	mtspr(SPRN_TSCR, sprs.tscr);
> +
> +	if (power7_fastsleep_workaround_exit) {
> +		rc = opal_config_cpu_idle_state(OPAL_CONFIG_IDLE_FASTSLEEP,
> +						OPAL_CONFIG_IDLE_UNDO);
> +		WARN_ON(rc);
> +	}
> +


> +	/* TB */
> +	if (opal_resync_timebase() != OPAL_SUCCESS)
> +		BUG();

If we are waking up only from fastsleep, we won't lose any SPRs, but
only the timebase. So we can return at this point.


> +
> +core_woken:
> +	atomic_unlock_and_stop_thread_idle();
> +
> +	/* Per-thread SPRs */
> +	mtspr(SPRN_LPCR, sprs.lpcr);
> +	mtspr(SPRN_HFSCR, sprs.hfscr);
> +	mtspr(SPRN_FSCR, sprs.fscr);
> +	mtspr(SPRN_PID, sprs.pid);
> +	mtspr(SPRN_PURR, sprs.purr);
> +	mtspr(SPRN_SPURR, sprs.spurr);
> +	mtspr(SPRN_DSCR, sprs.dscr);
> +
> +	mtspr(SPRN_MMCRA, sprs.mmcra);
> +	mtspr(SPRN_MMCR0, sprs.mmcr0);
> +	mtspr(SPRN_MMCR1, sprs.mmcr1);
> +	mtspr(SPRN_MMCR2, sprs.mmcr2);
> +
> +	__slb_flush_and_rebolt();
> +
> +	return srr1;
> +}
> +
> +extern unsigned long idle_kvm_start_guest(unsigned long srr1);
> +
> +unsigned long power7_offline(void)
> +{
> +	unsigned long srr1;
> +
> +	mtmsr(MSR_IDLE);
> +
> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +	/* Tell KVM we're entering idle. */
> +	/******************************************************/
> +	/*  N O T E   W E L L    ! ! !    N O T E   W E L L   */
> +	/* The following store to HSTATE_HWTHREAD_STATE(r13)  */
> +	/* MUST occur in real mode, i.e. with the MMU off,    */
> +	/* and the MMU must stay off until we clear this flag */
> +	/* and test HSTATE_HWTHREAD_REQ(r13) in               */
> +	/* pnv_powersave_wakeup in this file.                 */
> +	/* The reason is that another thread can switch the   */
> +	/* MMU to a guest context whenever this flag is set   */
> +	/* to KVM_HWTHREAD_IN_IDLE, and if the MMU was on,    */
> +	/* that would potentially cause this thread to start  */
> +	/* executing instructions from guest memory in        */
> +	/* hypervisor mode, leading to a host crash or data   */
> +	/* corruption, or worse.                              */
> +	/******************************************************/
> +	local_paca->kvm_hstate.hwthread_state = KVM_HWTHREAD_IN_IDLE;
> +#endif
> +
> +	__ppc64_runlatch_off();
> +	srr1 = power7_idle_insn(power7_offline_type);
> +	__ppc64_runlatch_on();
> +
> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +	if (local_paca->kvm_hstate.hwthread_state != KVM_HWTHREAD_IN_KERNEL) {
> +		local_paca->kvm_hstate.hwthread_state = KVM_HWTHREAD_IN_KERNEL;
> +		/* Order setting hwthread_state vs. testing hwthread_req */
> +		smp_mb();
> +	}
> +	if (local_paca->kvm_hstate.hwthread_req)
> +		srr1 = idle_kvm_start_guest(srr1);
> +#endif
> +
> +	mtmsr(MSR_KERNEL);
> +
> +	return srr1;
> +}
> +
>  static unsigned long __power7_idle_type(unsigned long type)
>  {
>  	unsigned long srr1;
> @@ -320,9 +521,11 @@ static unsigned long __power7_idle_type(unsigned long type)
>  	if (!prep_irq_for_idle_irqsoff())
>  		return 0;
> 
> +	mtmsr(MSR_IDLE);
>  	__ppc64_runlatch_off();
>  	srr1 = power7_idle_insn(type);
>  	__ppc64_runlatch_on();
> +	mtmsr(MSR_KERNEL);
> 
>  	fini_irq_for_idle_irqsoff();
> 
> @@ -345,6 +548,242 @@ void power7_idle(void)
>  	power7_idle_type(PNV_THREAD_NAP);
>  }
> 
> +struct p9_sprs {
> +	/* per core */
> +	u64 ptcr;
> +	u64 rpr;
> +	u64 tscr;
> +	u64 ldbar;
> +	u64 amor;
> +
> +	/* per thread */
> +	u64 lpcr;
> +	u64 hfscr;
> +	u64 fscr;
> +	u64 pid;
> +	u64 purr;
> +	u64 spurr;
> +	u64 dscr;
> +	u64 wort;
> +
> +	u64 mmcra;
> +	u32 mmcr0;
> +	u32 mmcr1;
> +	u64 mmcr2;
> +};
> +
> +static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
> +{
> +	int cpu = raw_smp_processor_id();
> +	int first = cpu_first_thread_sibling(cpu);
> +	unsigned long *state = &paca_ptrs[first]->idle_state;
> +	unsigned long srr1;
> +	unsigned long mmcr0 = 0;
> +	struct p9_sprs sprs;
> +	bool sprs_saved = false;
> +
> +	if (!(psscr & (PSSCR_EC|PSSCR_ESL))) {
> +		WARN_ON(!mmu_on);

So this is a warning that we are trying to perform a ESL = 0 stop from a
kvm offline case. Hence we cannot return IR|DR disabled.

> +
> +		/*
> +		 * Wake synchronously. SRESET via xscom may still cause
> +		 * a 0x100 powersave wakeup with SRR1 reason!
> +		 */
> +		srr1 = isa3_idle_stop_noloss(psscr);
> +		if (likely(!srr1))
> +			return 0;
> +
> +		if ((srr1 & SRR1_WAKESTATE) != SRR1_WS_NOLOSS) {
> +			/*
> +			 * Registers not saved, can't recover!
> +			 * This would be a hardware bug
> +			 */
> +			BUG();
> +			for (;;)
> +				;
> +		}
> +
> +	} else {
> +		if (!cpu_has_feature(CPU_FTR_POWER9_DD2_1)) {
> +			 /*
> +			  * POWER9 DD2 can incorrectly set PMAO when waking up
> +			  * after a state-loss idle. Saving and restoring MMCR0
> +			  * over idle is a workaround.
> +			  */
> +			mmcr0 = mfspr(SPRN_MMCR0);
> +		}
> +		if ((psscr & PSSCR_RL_MASK) >= pnv_first_hv_loss_level) {
> +			sprs.lpcr = mfspr(SPRN_LPCR);
> +			sprs.hfscr = mfspr(SPRN_HFSCR);
> +			sprs.fscr = mfspr(SPRN_FSCR);
> +			sprs.pid = mfspr(SPRN_PID);
> +			sprs.purr = mfspr(SPRN_PURR);
> +			sprs.spurr = mfspr(SPRN_SPURR);
> +			sprs.dscr = mfspr(SPRN_DSCR);
> +			sprs.wort = mfspr(SPRN_WORT);
> +
> +			sprs.mmcra = mfspr(SPRN_MMCRA);
> +			sprs.mmcr0 = mfspr(SPRN_MMCR0);
> +			sprs.mmcr1 = mfspr(SPRN_MMCR1);
> +			sprs.mmcr2 = mfspr(SPRN_MMCR2);
> +
> +			sprs.ptcr = mfspr(SPRN_PTCR);
> +			sprs.rpr = mfspr(SPRN_RPR);
> +			sprs.tscr = mfspr(SPRN_TSCR);
> +			sprs.ldbar = mfspr(SPRN_LDBAR);
> +			sprs.amor = mfspr(SPRN_AMOR);
> +
> +			atomic_start_thread_idle();
> +			sprs_saved = true;
> +		}
> +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +		if (cpu_has_feature(CPU_FTR_P9_TM_XER_SO_BUG)) {
> +			local_paca->requested_psscr = psscr;
> +			/* order setting requested_psscr vs testing dont_stop */
> +			smp_mb();
> +			if (atomic_read(&local_paca->dont_stop)) {
> +				local_paca->requested_psscr = 0;
> +				return 0;
> +			}

Can we move the local_paca->dont_stop check early on before saving the
SPRs ?

> +
> +			srr1 = isa3_idle_stop_mayloss(psscr);
> +			local_paca->requested_psscr = 0;
> +		} else
> +#endif
> +			srr1 = isa3_idle_stop_mayloss(psscr);
> +
> +	       psscr = mfspr(SPRN_PSSCR);
> +	}
> +
> +	WARN_ON_ONCE(!srr1);
> +	WARN_ON_ONCE(mfmsr() & (MSR_IR|MSR_DR));
> +
> +	if ((srr1 & SRR1_WAKESTATE) != SRR1_WS_NOLOSS) {
> +		unsigned long mmcra;
> +
> +		/*
> +		 * Workaround for POWER9 DD2, if we lost resources, the ERAT
> +		 * might have been mixed up and needs flushing. We also need
> +		 * to reload MMCR0 (see mmcr0 comment above).
> +		 */
> +		if (!cpu_has_feature(CPU_FTR_POWER9_DD2_1)) {
> +			asm volatile(PPC_INVALIDATE_ERAT);
> +			mtspr(SPRN_MMCR0, mmcr0);
> +		}
> +
> +		/*
> +		 * DD2.2 and earlier need to set then clear bit 60 in MMCRA
> +		 * to ensure the PMU starts running.
> +		 */
> +		mmcra = mfspr(SPRN_MMCRA);
> +		mmcra |= PPC_BIT(60);
> +		mtspr(SPRN_MMCRA, mmcra);
> +		mmcra &= ~PPC_BIT(60);
> +		mtspr(SPRN_MMCRA, mmcra);
> +	}
> +
> +	if (unlikely((srr1 & SRR1_WAKEMASK_P8) == SRR1_WAKEHMI))
> +		hmi_exception_realmode(NULL);
> +
> +	/*
> +	 * On POWER9, SRR1 bits do not match exactly as expected.
> +	 * SRR1_WS_GPRLOSS (10b) can also result in SPR loss, so
> +	 * always test PSSCR if there is any state loss.
> +	 */
> +	if (likely((psscr & PSSCR_RL_MASK) < pnv_first_hv_loss_level)) {
> +		if (sprs_saved)
> +			atomic_stop_thread_idle();
> +		goto out;
> +	}
> +
> +	/* HV state loss */
> +	if (!sprs_saved) {
> +		BUG();
> +		for (;;)
> +			;
> +	}
> +
> +	atomic_lock_thread_idle();
> +
> +	if ((*state & ((1 << threads_per_core) - 1)) != 0)
> +		goto core_woken;
> +
> +	/* Per-core SPRs */
> +	mtspr(SPRN_PTCR, sprs.ptcr);
> +	mtspr(SPRN_RPR, sprs.rpr);
> +	mtspr(SPRN_TSCR, sprs.tscr);
> +	mtspr(SPRN_LDBAR, sprs.ldbar);
> +	mtspr(SPRN_AMOR, sprs.amor);
> +
> +	if ((psscr & PSSCR_RL_MASK) >= pnv_first_tb_loss_level) {
> +		/* TB loss */
> +		if (opal_resync_timebase() != OPAL_SUCCESS)
> +			BUG();
> +	}
> +
> +core_woken:
> +	atomic_unlock_and_stop_thread_idle();
> +
> +	/* Per-thread SPRs */
> +	mtspr(SPRN_LPCR, sprs.lpcr);
> +	mtspr(SPRN_HFSCR, sprs.hfscr);
> +	mtspr(SPRN_FSCR, sprs.fscr);
> +	mtspr(SPRN_PID, sprs.pid);
> +	mtspr(SPRN_PURR, sprs.purr);
> +	mtspr(SPRN_SPURR, sprs.spurr);
> +	mtspr(SPRN_DSCR, sprs.dscr);
> +	mtspr(SPRN_WORT, sprs.wort);
> +
> +	mtspr(SPRN_MMCRA, sprs.mmcra);
> +	mtspr(SPRN_MMCR0, sprs.mmcr0);
> +	mtspr(SPRN_MMCR1, sprs.mmcr1);
> +	mtspr(SPRN_MMCR2, sprs.mmcr2);

We need to set SPRN_SPRG3 to local_paca->sprg_vdso. Something I missed
last year while adding restore_additional_sprs.


> +
> +	if (!radix_enabled())
> +		__slb_flush_and_rebolt();
> +
> +out:
> +	if (mmu_on)
> +		mtmsr(MSR_KERNEL);
> +
> +	return srr1;
> +}
> +
> +unsigned long power9_offline_stop(unsigned long psscr)
> +{
> +	unsigned long srr1;
> +
> +#ifndef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> +	__ppc64_runlatch_off();
> +	srr1 = power9_idle_stop(psscr, true);
> +	__ppc64_runlatch_on();
> +#else
> +	/*
> +	 * Tell KVM we're entering idle.
> +	 * This does not have to be done in real mode because the P9 MMU
> +	 * is independent per-thread. Some steppings share radix/hash mode
> +	 * between threads, but in that case KVM has a barrier sync in real
> +	 * mode before and after switching between radix and hash.
> +	 */
> +	local_paca->kvm_hstate.hwthread_state = KVM_HWTHREAD_IN_IDLE;
> +
> +	__ppc64_runlatch_off();
> +	srr1 = power9_idle_stop(psscr, false);


Ok, so the "mmu_on" parameter for power9_idle_stop() indicates whether
we should return from the function call with translation enabled or
not. In the case we are offlining the secondaries in the context of
KVM, we want to ensure that when we wakeup we are still in real-mode
so that we can perform the hypervisor--> guest transition in
real-mode.

Is this understanding correct ?

If yes, and we requested an ESL=0 stop here (may be the device tree
only had lite states), then srr1 returned by the function will be 0
here.

Shouldn't we explicitly set an MSR_IDLE at this point ? 

> +	__ppc64_runlatch_on();
> +
> +	if (local_paca->kvm_hstate.hwthread_state != KVM_HWTHREAD_IN_KERNEL) {
> +		local_paca->kvm_hstate.hwthread_state = KVM_HWTHREAD_IN_KERNEL;
> +		/* Order setting hwthread_state vs. testing hwthread_req */
> +		smp_mb();
> +	}
> +	if (local_paca->kvm_hstate.hwthread_req)
> +		srr1 = idle_kvm_start_guest(srr1);
> +	mtmsr(MSR_KERNEL);
> +#endif
> +
> +	return srr1;
> +}
> +
>  static unsigned long __power9_idle_type(unsigned long stop_psscr_val,
>  				      unsigned long stop_psscr_mask)
>  {
> @@ -358,7 +797,7 @@ static unsigned long __power9_idle_type(unsigned long stop_psscr_val,
>  	psscr = (psscr & ~stop_psscr_mask) | stop_psscr_val;
> 
>  	__ppc64_runlatch_off();
> -	srr1 = power9_idle_stop(psscr);
> +	srr1 = power9_idle_stop(psscr, true);
>  	__ppc64_runlatch_on();
> 
>  	fini_irq_for_idle_irqsoff();
> @@ -407,7 +846,7 @@ void pnv_power9_force_smt4_catch(void)
>  			atomic_inc(&paca_ptrs[cpu0+thr]->dont_stop);
>  	}
>  	/* order setting dont_stop vs testing requested_psscr */
> -	mb();
> +	smp_mb();
>  	for (thr = 0; thr < threads_per_core; ++thr) {
>  		if (!paca_ptrs[cpu0+thr]->requested_psscr)
>  			++awake_threads;
> @@ -466,7 +905,7 @@ static void pnv_program_cpu_hotplug_lpcr(unsigned int cpu, u64 lpcr_val)
>  	 * Program the LPCR via stop-api only if the deepest stop state
>  	 * can lose hypervisor context.
>  	 */
> -	if (supported_cpuidle_states & OPAL_PM_LOSE_FULL_CONTEXT)
> +	if (pm_flags_possible & OPAL_PM_LOSE_FULL_CONTEXT)
>  		opal_slw_set_reg(pir, SPRN_LPCR, lpcr_val);
>  }
> 
> @@ -478,7 +917,6 @@ static void pnv_program_cpu_hotplug_lpcr(unsigned int cpu, u64 lpcr_val)
>  unsigned long pnv_cpu_offline(unsigned int cpu)
>  {
>  	unsigned long srr1;
> -	u32 idle_states = pnv_get_supported_cpuidle_states();
>  	u64 lpcr_val;
> 
>  	/*
> @@ -503,15 +941,8 @@ unsigned long pnv_cpu_offline(unsigned int cpu)
>  		psscr = (psscr & ~pnv_deepest_stop_psscr_mask) |
>  						pnv_deepest_stop_psscr_val;
>  		srr1 = power9_offline_stop(psscr);
> -
> -	} else if ((idle_states & OPAL_PM_WINKLE_ENABLED) &&
> -		   (idle_states & OPAL_PM_LOSE_FULL_CONTEXT)) {
> -		srr1 = power7_idle_insn(PNV_THREAD_WINKLE);
> -	} else if ((idle_states & OPAL_PM_SLEEP_ENABLED) ||
> -		   (idle_states & OPAL_PM_SLEEP_ENABLED_ER1)) {
> -		srr1 = power7_idle_insn(PNV_THREAD_SLEEP);
> -	} else if (idle_states & OPAL_PM_NAP_ENABLED) {
> -		srr1 = power7_idle_insn(PNV_THREAD_NAP);
> +	} else if (cpu_has_feature(CPU_FTR_ARCH_206) && power7_offline_type) {
> +		srr1 = power7_offline();
>  	} else {
>  		/* This is the fallback method. We emulate snooze */
>  		while (!generic_check_cpu_restart(cpu)) {
> @@ -623,7 +1054,8 @@ static int __init pnv_power9_idle_init(struct device_node *np, u32 *flags,
>  	u64 *psscr_val = NULL;
>  	u64 *psscr_mask = NULL;
>  	u32 *residency_ns = NULL;
> -	u64 max_residency_ns = 0;
> +	u64 max_deep_residency_ns = 0;
> +	u64 max_default_residency_ns = 0;
>  	int rc = 0, i;
> 
>  	psscr_val = kcalloc(dt_idle_states, sizeof(*psscr_val), GFP_KERNEL);
> @@ -661,26 +1093,33 @@ static int __init pnv_power9_idle_init(struct device_node *np, u32 *flags,
>  	}
> 
>  	/*
> -	 * Set pnv_first_deep_stop_state, pnv_deepest_stop_psscr_{val,mask},
> -	 * and the pnv_default_stop_{val,mask}.
> -	 *
> -	 * pnv_first_deep_stop_state should be set to the first stop
> -	 * level to cause hypervisor state loss.
> -	 *
>  	 * pnv_deepest_stop_{val,mask} should be set to values corresponding to
>  	 * the deepest stop state.
>  	 *
>  	 * pnv_default_stop_{val,mask} should be set to values corresponding to
> -	 * the shallowest (OPAL_PM_STOP_INST_FAST) loss-less stop state.
> +	 * the deepest loss-less (OPAL_PM_STOP_INST_FAST) stop state.
>  	 */
> -	pnv_first_deep_stop_state = MAX_STOP_STATE;
> +	pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
> +	pnv_first_hv_loss_level = MAX_STOP_STATE + 1;
>  	for (i = 0; i < dt_idle_states; i++) {
>  		int err;
>  		u64 psscr_rl = psscr_val[i] & PSSCR_RL_MASK;
> 
> +		/*
> +		 * Deep state idle entry/exit does not currently cope with
> +		 * states that lose timebase but not SPRs, skip.
> +		 */
> +		if ( (flags[i] & OPAL_PM_TIMEBASE_STOP) &&
> +		    !(flags[i] & OPAL_PM_LOSE_FULL_CONTEXT))
> +			continue;

On POWER8 we have the capability of handling fastsleep which loses
timebase but not SPRs.

Would it be ok to set the pnv_first_hv_loss_level to the smallest of
the first state that loses either TIMEBASE or FULL_CONTEXT ? Worst
case we will end up restoring the SPRs which are not lost, but
functionally that is not incorrect no?


> +
> +		if ((flags[i] & OPAL_PM_TIMEBASE_STOP) &&
> +		     (pnv_first_tb_loss_level > psscr_rl))
> +			pnv_first_tb_loss_level = psscr_rl;
> +
>  		if ((flags[i] & OPAL_PM_LOSE_FULL_CONTEXT) &&
> -		     (pnv_first_deep_stop_state > psscr_rl))
> -			pnv_first_deep_stop_state = psscr_rl;
> +		     (pnv_first_hv_loss_level > psscr_rl))
> +			pnv_first_hv_loss_level = psscr_rl;
> 
>  		err = validate_psscr_val_mask(&psscr_val[i], &psscr_mask[i],
>  					      flags[i]);

--
Thanks and Regards
gautham.

^ permalink raw reply

* Re: [RFC PATCH v2] powerpc/64s: Move idle code to powernv C code
From: Nicholas Piggin @ 2018-07-25 12:31 UTC (permalink / raw)
  To: Gautham R Shenoy; +Cc: linuxppc-dev, Vaidyanathan Srinivasan
In-Reply-To: <20180725112645.GA9947@in.ibm.com>

Hi Gautham,

Thanks for the review, I also missed one or two things from you last
one, but I haven't forgotten them.

 On Wed, 25 Jul 2018 16:56:45 +0530
Gautham R Shenoy <ego@linux.vnet.ibm.com> wrote:

> Hello Nicholas,
> 
> On Sat, Jul 21, 2018 at 02:29:24PM +1000, Nicholas Piggin wrote:
> > Reimplement Book3S idle code to C, in the powernv platform code.
> > Assembly stubs are used to save and restore the stack frame and
> > non-volatile GPRs before going to idle, but these are small and
> > mostly agnostic to microarchitecture implementation details.
> > 
> > The optimisation where EC=ESL=0 idle modes did not have to save
> > GPRs or mtmsrd L=0 is restored, because it's simple to do.
> > 
> > Idle wakeup no longer uses the ->cpu_restore call to reinit SPRs,
> > but saves and restores them all explicitly. This can easily be
> > extended to tracking the set of system-wide SPRs that do not have
> > to be saved each time.
> > 
> > Moving the HMI, SPR, OPAL, locking, etc. to C is the only real
> > way this stuff will cope with non-trivial new CPU implementation
> > details, firmware changes, etc., without becoming unmaintainable.
> > 
> > Since RFC v1:
> > - Now tested and working with POWER9 hash and radix.
> > - KVM support added. This took a bit of work to untangle and might
> >   still have some issues, but POWER9 seems to work including hash on
> >   radix with dependent threads mode.
> > - This snowballed a bit because of KVM and other details making it
> >   not feasible to leave POWER7/8 code alone. That's only half done
> >   at the moment.
> > - So far this trades about 800 lines of asm for 500 of C. With POWER7/8
> >   support done it might be another hundred or so lines of C.
> > 
> > Would appreciate any feedback on the approach in particular the
> > significantly different (and hopefully cleaner) KVM approach.  
> 
> 
> I have reviewed the C-code movement from idle_book3s.S to
> powernv/idle.c. Some comments on that part of the code inline.

Thanks. Yes the POWER7/8 code is a bit incomplete as you noticed, but
you had some good suggestions there too.

> I haven't yet gone through the book3s_hv_rmhandlers.S code changes.

Yeah that's very tricky code and I don't know if I got it right yet.
Will have to run it past the KVM people too.

> 
> 
> > 
> > Thanks,
> > Nick
> >   
> 
> [..snip..]
> 
> > 
> >  #ifdef CONFIG_PPC_POWERNV
> > -	/* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
> > -	u32 *core_idle_state_ptr;
> > -	u8 thread_idle_state;		/* PNV_THREAD_RUNNING/NAP/SLEEP	*/
> > -	/* Mask to indicate thread id in core */
> > -	u8 thread_mask;
> > -	/* Mask to denote subcore sibling threads */
> > -	u8 subcore_sibling_mask;
> > -	/* Flag to request this thread not to stop */
> > -	atomic_t dont_stop;
> > -	/* The PSSCR value that the kernel requested before going to stop */
> > -	u64 requested_psscr;
> > +	union {
> > +		/* P7/P8 specific fields */
> > +		struct {
> > +			/* Per-core mask tracking idle threads and a lock bit-[L][TTTTTTTT] */
> > +			unsigned long *core_idle_state_ptr;  
> 
> Do we still need *core_idle_state_ptr ? Since this code is accessed
> through C where we have access to paca_ptrs global variable we can
> reference the "idle_state" (defined for P9 below). A lot of
> locking/unlocking code in powernv/idle.c further down the patch
> already does that, right?

Yeah you're probably right I think.

> > +
> > +	if (type == PNV_THREAD_WINKLE) {
> > +		sprs.rpr = mfspr(SPRN_RPR);
> > +		sprs.tscr = mfspr(SPRN_TSCR);  
> 
> Per-subcore SPRs aren't being saved here ?

Right, it's a bit incomplete. I'll post out a new version with P8
support soon.


> > +
> > +	if (power7_fastsleep_workaround_exit) {
> > +		rc = opal_config_cpu_idle_state(OPAL_CONFIG_IDLE_FASTSLEEP,
> > +						OPAL_CONFIG_IDLE_UNDO);
> > +		WARN_ON(rc);
> > +	}
> > +  
> 
> 
> > +	/* TB */
> > +	if (opal_resync_timebase() != OPAL_SUCCESS)
> > +		BUG();  
> 
> If we are waking up only from fastsleep, we won't lose any SPRs, but
> only the timebase. So we can return at this point.

Okay, got it.

> > +static unsigned long power9_idle_stop(unsigned long psscr, bool mmu_on)
> > +{
> > +	int cpu = raw_smp_processor_id();
> > +	int first = cpu_first_thread_sibling(cpu);
> > +	unsigned long *state = &paca_ptrs[first]->idle_state;
> > +	unsigned long srr1;
> > +	unsigned long mmcr0 = 0;
> > +	struct p9_sprs sprs;
> > +	bool sprs_saved = false;
> > +
> > +	if (!(psscr & (PSSCR_EC|PSSCR_ESL))) {
> > +		WARN_ON(!mmu_on);  
> 
> So this is a warning that we are trying to perform a ESL = 0 stop from a
> kvm offline case. Hence we cannot return IR|DR disabled.

Yes.

> > +#ifdef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> > +		if (cpu_has_feature(CPU_FTR_P9_TM_XER_SO_BUG)) {
> > +			local_paca->requested_psscr = psscr;
> > +			/* order setting requested_psscr vs testing dont_stop */
> > +			smp_mb();
> > +			if (atomic_read(&local_paca->dont_stop)) {
> > +				local_paca->requested_psscr = 0;
> > +				return 0;
> > +			}  
> 
> Can we move the local_paca->dont_stop check early on before saving the
> SPRs ?

It should be a very rare case but yes I don't see why not.

> > +	/* Per-thread SPRs */
> > +	mtspr(SPRN_LPCR, sprs.lpcr);
> > +	mtspr(SPRN_HFSCR, sprs.hfscr);
> > +	mtspr(SPRN_FSCR, sprs.fscr);
> > +	mtspr(SPRN_PID, sprs.pid);
> > +	mtspr(SPRN_PURR, sprs.purr);
> > +	mtspr(SPRN_SPURR, sprs.spurr);
> > +	mtspr(SPRN_DSCR, sprs.dscr);
> > +	mtspr(SPRN_WORT, sprs.wort);
> > +
> > +	mtspr(SPRN_MMCRA, sprs.mmcra);
> > +	mtspr(SPRN_MMCR0, sprs.mmcr0);
> > +	mtspr(SPRN_MMCR1, sprs.mmcr1);
> > +	mtspr(SPRN_MMCR2, sprs.mmcr2);  
> 
> We need to set SPRN_SPRG3 to local_paca->sprg_vdso. Something I missed
> last year while adding restore_additional_sprs.

Will add that.

> > +unsigned long power9_offline_stop(unsigned long psscr)
> > +{
> > +	unsigned long srr1;
> > +
> > +#ifndef CONFIG_KVM_BOOK3S_HV_POSSIBLE
> > +	__ppc64_runlatch_off();
> > +	srr1 = power9_idle_stop(psscr, true);
> > +	__ppc64_runlatch_on();
> > +#else
> > +	/*
> > +	 * Tell KVM we're entering idle.
> > +	 * This does not have to be done in real mode because the P9 MMU
> > +	 * is independent per-thread. Some steppings share radix/hash mode
> > +	 * between threads, but in that case KVM has a barrier sync in real
> > +	 * mode before and after switching between radix and hash.
> > +	 */
> > +	local_paca->kvm_hstate.hwthread_state = KVM_HWTHREAD_IN_IDLE;
> > +
> > +	__ppc64_runlatch_off();
> > +	srr1 = power9_idle_stop(psscr, false);  
> 
> 
> Ok, so the "mmu_on" parameter for power9_idle_stop() indicates whether
> we should return from the function call with translation enabled or
> not. In the case we are offlining the secondaries in the context of
> KVM, we want to ensure that when we wakeup we are still in real-mode
> so that we can perform the hypervisor--> guest transition in
> real-mode.
> 
> Is this understanding correct ?

Yes basically. Well we can wake up in guest MMU context so host
can't turn the MMU back on before calling back to KVM code.

> If yes, and we requested an ESL=0 stop here (may be the device tree
> only had lite states), then srr1 returned by the function will be 0
> here.
> 
> Shouldn't we explicitly set an MSR_IDLE at this point ? 

Previously we could not support EC=0 mode because KVM required a SRESET
wake up. The C code does not require that, but I still don't think it
can work properly because we don't get an SRR1. Without that, I think
KVM doesn't know what to do when it wakes up. So I think it's a bit
harder than it looks.

> > @@ -661,26 +1093,33 @@ static int __init pnv_power9_idle_init(struct device_node *np, u32 *flags,
> >  	}
> > 
> >  	/*
> > -	 * Set pnv_first_deep_stop_state, pnv_deepest_stop_psscr_{val,mask},
> > -	 * and the pnv_default_stop_{val,mask}.
> > -	 *
> > -	 * pnv_first_deep_stop_state should be set to the first stop
> > -	 * level to cause hypervisor state loss.
> > -	 *
> >  	 * pnv_deepest_stop_{val,mask} should be set to values corresponding to
> >  	 * the deepest stop state.
> >  	 *
> >  	 * pnv_default_stop_{val,mask} should be set to values corresponding to
> > -	 * the shallowest (OPAL_PM_STOP_INST_FAST) loss-less stop state.
> > +	 * the deepest loss-less (OPAL_PM_STOP_INST_FAST) stop state.
> >  	 */
> > -	pnv_first_deep_stop_state = MAX_STOP_STATE;
> > +	pnv_first_tb_loss_level = MAX_STOP_STATE + 1;
> > +	pnv_first_hv_loss_level = MAX_STOP_STATE + 1;
> >  	for (i = 0; i < dt_idle_states; i++) {
> >  		int err;
> >  		u64 psscr_rl = psscr_val[i] & PSSCR_RL_MASK;
> > 
> > +		/*
> > +		 * Deep state idle entry/exit does not currently cope with
> > +		 * states that lose timebase but not SPRs, skip.
> > +		 */
> > +		if ( (flags[i] & OPAL_PM_TIMEBASE_STOP) &&
> > +		    !(flags[i] & OPAL_PM_LOSE_FULL_CONTEXT))
> > +			continue;  
> 
> On POWER8 we have the capability of handling fastsleep which loses
> timebase but not SPRs.

Right.

> Would it be ok to set the pnv_first_hv_loss_level to the smallest of
> the first state that loses either TIMEBASE or FULL_CONTEXT ? Worst
> case we will end up restoring the SPRs which are not lost, but
> functionally that is not incorrect no?

This is POWER9 init so it should be okay, I will do that for POWER8.

I could just force the FULL_STOP flag for POWER9 case so theoretical
future state that loses TB first can at least be used as you say,
without adding too much complexity or untested code.

Thanks,
Nick

^ permalink raw reply

* Re: [PATCH] powerpc/64s: free page table caches at exit_mmap time
From: Aneesh Kumar K.V @ 2018-07-25 12:47 UTC (permalink / raw)
  To: Nicholas Piggin, linuxppc-dev; +Cc: Aneesh Kumar K . V, Nicholas Piggin
In-Reply-To: <20180725095428.22561-1-npiggin@gmail.com>

Nicholas Piggin <npiggin@gmail.com> writes:

> The kernel page table caches are tied to init_mm, so there is no
> more need for them after userspace is finished.
>

The commit message could be improved with reference to active_mm.

something like?

destroy_context get called when we drop the last reference for mm which
could be much later than the task exit due to other lazy mm reference to
it. We could free the page table cache pages on task exit because they
only cache the userspace page table and kernel thread should not access
the user space address.

The mapping for kernel threads itself is maintained in init_mm and page
table cache for that is attached to init_mm

Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>

> Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
> ---
>  arch/powerpc/mm/mmu_context_book3s64.c | 5 +++--
>  1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/mm/mmu_context_book3s64.c b/arch/powerpc/mm/mmu_context_book3s64.c
> index 3bb5cec03d1f..5738c2db751c 100644
> --- a/arch/powerpc/mm/mmu_context_book3s64.c
> +++ b/arch/powerpc/mm/mmu_context_book3s64.c
> @@ -221,7 +221,7 @@ static void pmd_frag_destroy(void *pmd_frag)
>  	}
>  }
>  
> -static void destroy_pagetable_page(struct mm_struct *mm)
> +static void destroy_pagetable_cache(struct mm_struct *mm)
>  {
>  	void *frag;
>  
> @@ -244,13 +244,14 @@ void destroy_context(struct mm_struct *mm)
>  		WARN_ON(process_tb[mm->context.id].prtb0 != 0);
>  	else
>  		subpage_prot_free(mm);
> -	destroy_pagetable_page(mm);
>  	destroy_contexts(&mm->context);
>  	mm->context.id = MMU_NO_CONTEXT;
>  }
>  
>  void arch_exit_mmap(struct mm_struct *mm)
>  {
> +	destroy_pagetable_cache(mm);
> +
>  	if (radix_enabled()) {
>  		/*
>  		 * Radix doesn't have a valid bit in the process table
> -- 
> 2.17.0

^ permalink raw reply

* Re: [PATCH 4/7 v6] iommu/arm-smmu: Add support for the fsl-mc bus
From: Robin Murphy @ 2018-07-25 12:48 UTC (permalink / raw)
  To: Nipun Gupta, will.deacon, robh+dt, robh, mark.rutland,
	catalin.marinas, gregkh, laurentiu.tudor, bhelgaas, hch
  Cc: joro, m.szyprowski, shawnguo, frowand.list, iommu, linux-kernel,
	devicetree, linux-arm-kernel, linuxppc-dev, linux-pci,
	bharat.bhushan, stuyoder, leoyang.li
In-Reply-To: <1531135103-10699-5-git-send-email-nipun.gupta@nxp.com>

On 09/07/18 12:18, Nipun Gupta wrote:
> Implement bus specific support for the fsl-mc bus including
> registering arm_smmu_ops and bus specific device add operations.

I guess this is about as neat as it can get;

Reviewed-by: Robin Murphy <robin.murphy@arm.com>

> Signed-off-by: Nipun Gupta <nipun.gupta@nxp.com>
> ---
>   drivers/iommu/arm-smmu.c |  7 +++++++
>   drivers/iommu/iommu.c    | 13 +++++++++++++
>   include/linux/fsl/mc.h   |  8 ++++++++
>   include/linux/iommu.h    |  2 ++
>   4 files changed, 30 insertions(+)
> 
> diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c
> index f7a96bc..a011bb6 100644
> --- a/drivers/iommu/arm-smmu.c
> +++ b/drivers/iommu/arm-smmu.c
> @@ -52,6 +52,7 @@
>   #include <linux/spinlock.h>
>   
>   #include <linux/amba/bus.h>
> +#include <linux/fsl/mc.h>
>   
>   #include "io-pgtable.h"
>   #include "arm-smmu-regs.h"
> @@ -1459,6 +1460,8 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev)
>   
>   	if (dev_is_pci(dev))
>   		group = pci_device_group(dev);
> +	else if (dev_is_fsl_mc(dev))
> +		group = fsl_mc_device_group(dev);
>   	else
>   		group = generic_device_group(dev);
>   
> @@ -2037,6 +2040,10 @@ static void arm_smmu_bus_init(void)
>   		bus_set_iommu(&pci_bus_type, &arm_smmu_ops);
>   	}
>   #endif
> +#ifdef CONFIG_FSL_MC_BUS
> +	if (!iommu_present(&fsl_mc_bus_type))
> +		bus_set_iommu(&fsl_mc_bus_type, &arm_smmu_ops);
> +#endif
>   }
>   
>   static int arm_smmu_device_probe(struct platform_device *pdev)
> diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c
> index d227b86..df2f49e 100644
> --- a/drivers/iommu/iommu.c
> +++ b/drivers/iommu/iommu.c
> @@ -32,6 +32,7 @@
>   #include <linux/pci.h>
>   #include <linux/bitops.h>
>   #include <linux/property.h>
> +#include <linux/fsl/mc.h>
>   #include <trace/events/iommu.h>
>   
>   static struct kset *iommu_group_kset;
> @@ -988,6 +989,18 @@ struct iommu_group *pci_device_group(struct device *dev)
>   	return iommu_group_alloc();
>   }
>   
> +/* Get the IOMMU group for device on fsl-mc bus */
> +struct iommu_group *fsl_mc_device_group(struct device *dev)
> +{
> +	struct device *cont_dev = fsl_mc_cont_dev(dev);
> +	struct iommu_group *group;
> +
> +	group = iommu_group_get(cont_dev);
> +	if (!group)
> +		group = iommu_group_alloc();
> +	return group;
> +}
> +
>   /**
>    * iommu_group_get_for_dev - Find or create the IOMMU group for a device
>    * @dev: target device
> diff --git a/include/linux/fsl/mc.h b/include/linux/fsl/mc.h
> index f27cb14..dddaca1 100644
> --- a/include/linux/fsl/mc.h
> +++ b/include/linux/fsl/mc.h
> @@ -351,6 +351,14 @@ struct fsl_mc_io {
>   #define dev_is_fsl_mc(_dev) (0)
>   #endif
>   
> +/* Macro to check if a device is a container device */
> +#define fsl_mc_is_cont_dev(_dev) (to_fsl_mc_device(_dev)->flags & \
> +	FSL_MC_IS_DPRC)
> +
> +/* Macro to get the container device of a MC device */
> +#define fsl_mc_cont_dev(_dev) (fsl_mc_is_cont_dev(_dev) ? \
> +	(_dev) : (_dev)->parent)
> +
>   /*
>    * module_fsl_mc_driver() - Helper macro for drivers that don't do
>    * anything special in module init/exit.  This eliminates a lot of
> diff --git a/include/linux/iommu.h b/include/linux/iommu.h
> index 7447b0b..209891d 100644
> --- a/include/linux/iommu.h
> +++ b/include/linux/iommu.h
> @@ -389,6 +389,8 @@ static inline size_t iommu_map_sg(struct iommu_domain *domain,
>   extern struct iommu_group *pci_device_group(struct device *dev);
>   /* Generic device grouping function */
>   extern struct iommu_group *generic_device_group(struct device *dev);
> +/* FSL-MC device grouping function */
> +struct iommu_group *fsl_mc_device_group(struct device *dev);
>   
>   /**
>    * struct iommu_fwspec - per-device IOMMU instance data
> 

^ permalink raw reply

* Re: [PATCH 5/7 v6] bus/fsl-mc: support dma configure for devices on fsl-mc bus
From: Robin Murphy @ 2018-07-25 12:51 UTC (permalink / raw)
  To: Nipun Gupta, will.deacon, robh+dt, robh, mark.rutland,
	catalin.marinas, gregkh, laurentiu.tudor, bhelgaas, hch
  Cc: joro, m.szyprowski, shawnguo, frowand.list, iommu, linux-kernel,
	devicetree, linux-arm-kernel, linuxppc-dev, linux-pci,
	bharat.bhushan, stuyoder, leoyang.li
In-Reply-To: <1531135103-10699-6-git-send-email-nipun.gupta@nxp.com>

On 09/07/18 12:18, Nipun Gupta wrote:
> This patch adds support of dma configuration for devices on fsl-mc
> bus using 'dma_configure' callback for busses. Also, directly calling
> arch_setup_dma_ops is removed from the fsl-mc bus.

Reviewed-by: Robin Murphy <robin.murphy@arm.com>

> Signed-off-by: Nipun Gupta <nipun.gupta@nxp.com>
> Reviewed-by: Laurentiu Tudor <laurentiu.tudor@nxp.com>
> ---
>   drivers/bus/fsl-mc/fsl-mc-bus.c | 15 +++++++++++----
>   1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c b/drivers/bus/fsl-mc/fsl-mc-bus.c
> index 5d8266c..fa43c7d 100644
> --- a/drivers/bus/fsl-mc/fsl-mc-bus.c
> +++ b/drivers/bus/fsl-mc/fsl-mc-bus.c
> @@ -127,6 +127,16 @@ static int fsl_mc_bus_uevent(struct device *dev, struct kobj_uevent_env *env)
>   	return 0;
>   }
>   
> +static int fsl_mc_dma_configure(struct device *dev)
> +{
> +	struct device *dma_dev = dev;
> +
> +	while (dev_is_fsl_mc(dma_dev))
> +		dma_dev = dma_dev->parent;
> +
> +	return of_dma_configure(dev, dma_dev->of_node, 0);
> +}
> +
>   static ssize_t modalias_show(struct device *dev, struct device_attribute *attr,
>   			     char *buf)
>   {
> @@ -148,6 +158,7 @@ struct bus_type fsl_mc_bus_type = {
>   	.name = "fsl-mc",
>   	.match = fsl_mc_bus_match,
>   	.uevent = fsl_mc_bus_uevent,
> +	.dma_configure  = fsl_mc_dma_configure,
>   	.dev_groups = fsl_mc_dev_groups,
>   };
>   EXPORT_SYMBOL_GPL(fsl_mc_bus_type);
> @@ -633,10 +644,6 @@ int fsl_mc_device_add(struct fsl_mc_obj_desc *obj_desc,
>   			goto error_cleanup_dev;
>   	}
>   
> -	/* Objects are coherent, unless 'no shareability' flag set. */
> -	if (!(obj_desc->flags & FSL_MC_OBJ_FLAG_NO_MEM_SHAREABILITY))
> -		arch_setup_dma_ops(&mc_dev->dev, 0, 0, NULL, true);
> -
>   	/*
>   	 * The device-specific probe callback will get invoked by device_add()
>   	 */
> 

^ permalink raw reply

* Re: [PATCH 1/7 v6] Documentation: fsl-mc: add iommu-map device-tree binding for fsl-mc bus
From: Robin Murphy @ 2018-07-25 13:00 UTC (permalink / raw)
  To: Nipun Gupta, will.deacon, robh+dt, robh, mark.rutland,
	catalin.marinas, gregkh, laurentiu.tudor, bhelgaas, hch
  Cc: joro, m.szyprowski, shawnguo, frowand.list, iommu, linux-kernel,
	devicetree, linux-arm-kernel, linuxppc-dev, linux-pci,
	bharat.bhushan, stuyoder, leoyang.li
In-Reply-To: <1531135103-10699-2-git-send-email-nipun.gupta@nxp.com>

On 09/07/18 12:18, Nipun Gupta wrote:
> The existing IOMMU bindings cannot be used to specify the relationship
> between fsl-mc devices and IOMMUs. This patch adds a generic binding for
> mapping fsl-mc devices to IOMMUs, using iommu-map property.

No more nits from me :)

Acked-by: Robin Murphy <robin.murphy@arm.com>

> Signed-off-by: Nipun Gupta <nipun.gupta@nxp.com>
> Reviewed-by: Rob Herring <robh@kernel.org>
> ---
>   .../devicetree/bindings/misc/fsl,qoriq-mc.txt      | 39 ++++++++++++++++++++++
>   1 file changed, 39 insertions(+)
> 
> diff --git a/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt b/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt
> index 6611a7c..01fdc33 100644
> --- a/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt
> +++ b/Documentation/devicetree/bindings/misc/fsl,qoriq-mc.txt
> @@ -9,6 +9,25 @@ blocks that can be used to create functional hardware objects/devices
>   such as network interfaces, crypto accelerator instances, L2 switches,
>   etc.
>   
> +For an overview of the DPAA2 architecture and fsl-mc bus see:
> +Documentation/networking/dpaa2/overview.rst
> +
> +As described in the above overview, all DPAA2 objects in a DPRC share the
> +same hardware "isolation context" and a 10-bit value called an ICID
> +(isolation context id) is expressed by the hardware to identify
> +the requester.
> +
> +The generic 'iommus' property is insufficient to describe the relationship
> +between ICIDs and IOMMUs, so an iommu-map property is used to define
> +the set of possible ICIDs under a root DPRC and how they map to
> +an IOMMU.
> +
> +For generic IOMMU bindings, see
> +Documentation/devicetree/bindings/iommu/iommu.txt.
> +
> +For arm-smmu binding, see:
> +Documentation/devicetree/bindings/iommu/arm,smmu.txt.
> +
>   Required properties:
>   
>       - compatible
> @@ -88,14 +107,34 @@ Sub-nodes:
>                 Value type: <phandle>
>                 Definition: Specifies the phandle to the PHY device node associated
>                             with the this dpmac.
> +Optional properties:
> +
> +- iommu-map: Maps an ICID to an IOMMU and associated iommu-specifier
> +  data.
> +
> +  The property is an arbitrary number of tuples of
> +  (icid-base,iommu,iommu-base,length).
> +
> +  Any ICID i in the interval [icid-base, icid-base + length) is
> +  associated with the listed IOMMU, with the iommu-specifier
> +  (i - icid-base + iommu-base).
>   
>   Example:
>   
> +        smmu: iommu@5000000 {
> +               compatible = "arm,mmu-500";
> +               #iommu-cells = <1>;
> +               stream-match-mask = <0x7C00>;
> +               ...
> +        };
> +
>           fsl_mc: fsl-mc@80c000000 {
>                   compatible = "fsl,qoriq-mc";
>                   reg = <0x00000008 0x0c000000 0 0x40>,    /* MC portal base */
>                         <0x00000000 0x08340000 0 0x40000>; /* MC control reg */
>                   msi-parent = <&its>;
> +                /* define map for ICIDs 23-64 */
> +                iommu-map = <23 &smmu 23 41>;
>                   #address-cells = <3>;
>                   #size-cells = <1>;
>   
> 

^ permalink raw reply

* Re: [PATCH 1/4] treewide: convert ISO_8859-1 text comments to utf-8
From: Arnd Bergmann @ 2018-07-25 13:12 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Joe Perches, Samuel Ortiz, David S. Miller, Rob Herring,
	Michael Ellerman, Jonathan Cameron, linux-wireless, Networking,
	DTML, Linux Kernel Mailing List, Linux ARM,
	open list:HARDWARE RANDOM NUMBER GENERATOR CORE, linuxppc-dev,
	linux-iio, Linux PM list, lvs-devel, netfilter-devel, coreteam
In-Reply-To: <20180724175531.75276cf4e539124aa9e27177@linux-foundation.org>

tools/perf/tests/.gitignore:
                            LLVM byte-codes, uncompressed
On Wed, Jul 25, 2018 at 2:55 AM, Andrew Morton
<akpm@linux-foundation.org> wrote:
> On Tue, 24 Jul 2018 17:13:20 -0700 Joe Perches <joe@perches.com> wrote:
>
>> On Tue, 2018-07-24 at 14:00 -0700, Andrew Morton wrote:
>> > On Tue, 24 Jul 2018 13:13:25 +0200 Arnd Bergmann <arnd@arndb.de> wrote:
>> > > Almost all files in the kernel are either plain text or UTF-8
>> > > encoded. A couple however are ISO_8859-1, usually just a few
>> > > characters in a C comments, for historic reasons.
>> > > This converts them all to UTF-8 for consistency.
>> []
>> > Will we be getting a checkpatch rule to keep things this way?
>>
>> How would that be done?
>
> I'm using this, seems to work.
>
>         if ! file $p | grep -q -P ", ASCII text|, UTF-8 Unicode text"
>         then
>                 echo $p: weird charset
>         fi

There are a couple of files that my version of 'find' incorrectly identified as
something completely different, like:

Documentation/devicetree/bindings/pinctrl/pinctrl-sx150x.txt:
            SemOne archive data
Documentation/devicetree/bindings/rtc/epson,rtc7301.txt:
            Microsoft Document Imaging Format
Documentation/filesystems/nfs/pnfs-block-server.txt:
            PPMN archive data
arch/arm/boot/dts/bcm283x-rpi-usb-host.dtsi:
        Sendmail frozen configuration  - version = "host";
Documentation/networking/segmentation-offloads.txt:
        StuffIt Deluxe Segment (data) : gmentation Offloads in the
Linux Networking Stack
arch/sparc/include/asm/visasm.h:                              SAS 7+
arch/xtensa/kernel/setup.c:                                         ,
init=0x454c, stat=0x090a, dev=0x2009, bas=0x2020
drivers/cpufreq/powernow-k8.c:
TI-XX Graphing Calculator (FLASH)
tools/testing/selftests/net/forwarding/tc_shblocks.sh:
                            Minix filesystem, V2 (big endian)
tools/perf/tests/.gitignore:
                            LLVM byte-codes, uncompressed

All of the above seem to be valid ASCII or UTF-8 files, so the check
above will lead
to false-positives, but it may be good enough as they are the
exception, and may be
bugs in 'file'.

Not sure if we need to worry about 'file' not being installed.

       Arnd

^ permalink raw reply

* Re: [RFC 4/4] virtio: Add platform specific DMA API translation for virito devices
From: Michael S. Tsirkin @ 2018-07-25 13:31 UTC (permalink / raw)
  To: Anshuman Khandual
  Cc: virtualization, linux-kernel, linuxppc-dev, aik, robh, joe,
	elfring, david, jasowang, benh, mpe, hch, linuxram, haren, paulus,
	srikar
In-Reply-To: <3dd36d8e-3bc8-91ba-cf6d-939495439d2c@linux.vnet.ibm.com>

On Mon, Jul 23, 2018 at 07:46:09AM +0530, Anshuman Khandual wrote:
> There is a redundant definition of virtio_has_iommu_quirk in the tools
> directory (tools/virtio/linux/virtio_config.h) which does not seem to
> be used any where. I guess that can be removed without problem.

It's there just to make tools/virtio build.
Try
	make -C tools/virtio/
In fact I see there's a missing definition for
dma_rmb/dma_wmb there, I'll post a patch.

-- 
MST

^ permalink raw reply

* [PATCH] powerpc/64s/radix: tlb do not flush on page size when fullmm
From: Nicholas Piggin @ 2018-07-25 13:58 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Nicholas Piggin, Aneesh Kumar K . V

When the mm is being torn down there will be a full PID flush so
there is no need to flush the TLB on page size changes.

Signed-off-by: Nicholas Piggin <npiggin@gmail.com>
---
 arch/powerpc/include/asm/tlb.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/arch/powerpc/include/asm/tlb.h b/arch/powerpc/include/asm/tlb.h
index 5d3107f2b014..d1b3dc4a6a0a 100644
--- a/arch/powerpc/include/asm/tlb.h
+++ b/arch/powerpc/include/asm/tlb.h
@@ -84,6 +84,9 @@ static inline void __tlb_remove_tlb_entry(struct mmu_gather *tlb, pte_t *ptep,
 static inline void tlb_remove_check_page_size_change(struct mmu_gather *tlb,
 						     unsigned int page_size)
 {
+	if (tlb->fullmm)
+		return;
+
 	if (!tlb->page_size)
 		tlb->page_size = page_size;
 	else if (tlb->page_size != page_size) {
-- 
2.17.0

^ permalink raw reply related

* [RFC PATCH 0/4] mm: mmu_gather changes to support explicit paging
From: Nicholas Piggin @ 2018-07-25 14:06 UTC (permalink / raw)
  To: linux-mm; +Cc: Nicholas Piggin, linuxppc-dev, linux-arch, linux-arm-kernel

The first 3 patches in this series are some generic mm changes I
would like to make, including a possible fix which may(?) be needed
for ARM64. Other than the bugfix, these first 3 patches should not
change anything so hopefully they aren't too controversial.

The powerpc patch is also there for reference. 

Thanks,
Nick

Nicholas Piggin (4):
  mm: move tlb_table_flush to tlb_flush_mmu_free
  mm: mmu_notifier fix for tlb_end_vma
  mm: allow arch to have tlb_flush caled on an empty TLB range
  powerpc/64s/radix: optimise TLB flush with precise TLB ranges in
    mmu_gather

 arch/powerpc/include/asm/tlb.h | 34 +++++++++++++++++++++++++++++++++
 arch/powerpc/mm/tlb-radix.c    | 10 ++++++++++
 include/asm-generic/tlb.h      | 35 ++++++++++++++++++++++++++++++----
 mm/memory.c                    | 14 ++------------
 4 files changed, 77 insertions(+), 16 deletions(-)

-- 
2.17.0

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox