Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH v4 3/6] smp: add function to execute a function synchronously on a cpu
From: Peter Zijlstra @ 2016-04-05  8:11 UTC (permalink / raw)
  To: Juergen Gross
  Cc: x86, jeremy, jdelvare, hpa, akataria, linux-kernel,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, xen-devel, boris.ostrovsky, tglx, linux
In-Reply-To: <1459833007-11618-4-git-send-email-jgross@suse.com>

On Tue, Apr 05, 2016 at 07:10:04AM +0200, Juergen Gross wrote:
> +int smp_call_on_cpu(unsigned int cpu, bool pin, int (*func)(void *), void *par)

Why .pin and not .phys? .pin does not (to me) reflect the
hypervisor/physical-cpu thing.

Also, as per smp_call_function_single() would it not be more consistent
to make this the last argument?

> +{
> +	struct smp_call_on_cpu_struct sscs = {
> +		.work = __WORK_INITIALIZER(sscs.work, smp_call_on_cpu_callback),
> +		.done = COMPLETION_INITIALIZER_ONSTACK(sscs.done),
> +		.func = func,
> +		.data = par,
> +		.cpu  = pin ? cpu : -1,
> +	};
> +
> +	if (cpu >= nr_cpu_ids)

You might want to also include cpu_online().

	if (cpu >= nr_cpu_ids || !cpu_online(cpu))
> +		return -ENXIO;

Seeing how its fairly hard to schedule work on a cpu that's not actually
there.

> +
> +	queue_work_on(cpu, system_wq, &sscs.work);
> +	wait_for_completion(&sscs.done);
> +
> +	return sscs.ret;
> +}

^ permalink raw reply

* Re: [PATCH] virtio: fix "warning: ‘queue’ may be used uninitialized"
From: Michael S. Tsirkin @ 2016-04-05  8:04 UTC (permalink / raw)
  To: Jeff Mahoney; +Cc: virtualization
In-Reply-To: <5702AEFB.5090701@suse.com>

On Mon, Apr 04, 2016 at 02:14:19PM -0400, Jeff Mahoney wrote:
> This fixes the following warning:
> drivers/virtio/virtio_ring.c:1032:5: warning: ‘queue’ may be used
> uninitialized in this function
> 
> The conditions that govern when queue is set aren't apparent to gcc.
> 
> Setting queue = NULL clears the warning.
> 
> Signed-off-by: Jeff Mahoney <jeffm@suse.com>

Which gcc version produces this warning?
I do not seem to see it with gcc 5.3.1.
Also - use uninitialized_var then?

> ---
> 
>  drivers/virtio/virtio_ring.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> --- a/drivers/virtio/virtio_ring.c
> +++ b/drivers/virtio/virtio_ring.c
> @@ -1006,7 +1006,7 @@ struct virtqueue *vring_create_virtqueue
>  	const char *name)
>  {
>  	struct virtqueue *vq;
> -	void *queue;
> +	void *queue = NULL;
>  	dma_addr_t dma_addr;
>  	size_t queue_size_in_bytes;
>  	struct vring vring;
> 
> -- 
> Jeff Mahoney
> SUSE Labs
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* [PATCH v4 6/6] hwmon: use smp_call_on_cpu() for dell-smm i8k
From: Juergen Gross @ 2016-04-05  5:10 UTC (permalink / raw)
  To: linux-kernel, xen-devel
  Cc: Juergen Gross, jeremy, jdelvare, peterz, hpa, akataria, x86,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, boris.ostrovsky, tglx, linux
In-Reply-To: <1459833007-11618-1-git-send-email-jgross@suse.com>

Use the smp_call_on_cpu() function to call system management
mode on cpu 0.
Make call secure by adding get_online_cpus() to avoid e.g. suspend
resume cycles in between.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V4: add call to get_online_cpus()
---
 drivers/hwmon/dell-smm-hwmon.c | 35 ++++++++++++++++++++---------------
 1 file changed, 20 insertions(+), 15 deletions(-)

diff --git a/drivers/hwmon/dell-smm-hwmon.c b/drivers/hwmon/dell-smm-hwmon.c
index c43318d..643f3a1 100644
--- a/drivers/hwmon/dell-smm-hwmon.c
+++ b/drivers/hwmon/dell-smm-hwmon.c
@@ -21,6 +21,7 @@
 
 #define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
+#include <linux/cpu.h>
 #include <linux/delay.h>
 #include <linux/module.h>
 #include <linux/types.h>
@@ -35,6 +36,7 @@
 #include <linux/uaccess.h>
 #include <linux/io.h>
 #include <linux/sched.h>
+#include <linux/smp.h>
 
 #include <linux/i8k.h>
 
@@ -130,23 +132,15 @@ static inline const char *i8k_get_dmi_data(int field)
 /*
  * Call the System Management Mode BIOS. Code provided by Jonathan Buzzard.
  */
-static int i8k_smm(struct smm_regs *regs)
+static int i8k_smm_func(void *par)
 {
 	int rc;
+	struct smm_regs *regs = par;
 	int eax = regs->eax;
-	cpumask_var_t old_mask;
 
 	/* SMM requires CPU 0 */
-	if (!alloc_cpumask_var(&old_mask, GFP_KERNEL))
-		return -ENOMEM;
-	cpumask_copy(old_mask, &current->cpus_allowed);
-	rc = set_cpus_allowed_ptr(current, cpumask_of(0));
-	if (rc)
-		goto out;
-	if (smp_processor_id() != 0) {
-		rc = -EBUSY;
-		goto out;
-	}
+	if (smp_processor_id() != 0)
+		return -EBUSY;
 
 #if defined(CONFIG_X86_64)
 	asm volatile("pushq %%rax\n\t"
@@ -204,13 +198,24 @@ static int i8k_smm(struct smm_regs *regs)
 	if (rc != 0 || (regs->eax & 0xffff) == 0xffff || regs->eax == eax)
 		rc = -EINVAL;
 
-out:
-	set_cpus_allowed_ptr(current, old_mask);
-	free_cpumask_var(old_mask);
 	return rc;
 }
 
 /*
+ * Call the System Management Mode BIOS.
+ */
+static int i8k_smm(struct smm_regs *regs)
+{
+	int ret;
+
+	get_online_cpus();
+	ret = smp_call_on_cpu(0, true, i8k_smm_func, regs);
+	put_online_cpus();
+
+	return ret;
+}
+
+/*
  * Read the fan status.
  */
 static int i8k_get_fan_status(int fan)
-- 
2.6.2

^ permalink raw reply related

* [PATCH v4 5/6] dcdbas: make use of smp_call_on_cpu()
From: Juergen Gross @ 2016-04-05  5:10 UTC (permalink / raw)
  To: linux-kernel, xen-devel
  Cc: Juergen Gross, jeremy, jdelvare, peterz, hpa, akataria, x86,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, boris.ostrovsky, tglx, linux
In-Reply-To: <1459833007-11618-1-git-send-email-jgross@suse.com>

Use smp_call_on_cpu() to raise SMI on cpu 0.
Make call secure by adding get_online_cpus() to avoid e.g. suspend
resume cycles in between.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V4: add call to get_online_cpus()
---
 drivers/firmware/dcdbas.c | 51 ++++++++++++++++++++++++-----------------------
 1 file changed, 26 insertions(+), 25 deletions(-)

diff --git a/drivers/firmware/dcdbas.c b/drivers/firmware/dcdbas.c
index 829eec8..68e1d38 100644
--- a/drivers/firmware/dcdbas.c
+++ b/drivers/firmware/dcdbas.c
@@ -23,6 +23,7 @@
 #include <linux/platform_device.h>
 #include <linux/dma-mapping.h>
 #include <linux/errno.h>
+#include <linux/cpu.h>
 #include <linux/gfp.h>
 #include <linux/init.h>
 #include <linux/kernel.h>
@@ -238,33 +239,14 @@ static ssize_t host_control_on_shutdown_store(struct device *dev,
 	return count;
 }
 
-/**
- * dcdbas_smi_request: generate SMI request
- *
- * Called with smi_data_lock.
- */
-int dcdbas_smi_request(struct smi_cmd *smi_cmd)
+static int raise_smi(void *par)
 {
-	cpumask_var_t old_mask;
-	int ret = 0;
+	struct smi_cmd *smi_cmd = par;
 
-	if (smi_cmd->magic != SMI_CMD_MAGIC) {
-		dev_info(&dcdbas_pdev->dev, "%s: invalid magic value\n",
-			 __func__);
-		return -EBADR;
-	}
-
-	/* SMI requires CPU 0 */
-	if (!alloc_cpumask_var(&old_mask, GFP_KERNEL))
-		return -ENOMEM;
-
-	cpumask_copy(old_mask, &current->cpus_allowed);
-	set_cpus_allowed_ptr(current, cpumask_of(0));
 	if (smp_processor_id() != 0) {
 		dev_dbg(&dcdbas_pdev->dev, "%s: failed to get CPU 0\n",
 			__func__);
-		ret = -EBUSY;
-		goto out;
+		return -EBUSY;
 	}
 
 	/* generate SMI */
@@ -280,9 +262,28 @@ int dcdbas_smi_request(struct smi_cmd *smi_cmd)
 		: "memory"
 	);
 
-out:
-	set_cpus_allowed_ptr(current, old_mask);
-	free_cpumask_var(old_mask);
+	return 0;
+}
+/**
+ * dcdbas_smi_request: generate SMI request
+ *
+ * Called with smi_data_lock.
+ */
+int dcdbas_smi_request(struct smi_cmd *smi_cmd)
+{
+	int ret;
+
+	if (smi_cmd->magic != SMI_CMD_MAGIC) {
+		dev_info(&dcdbas_pdev->dev, "%s: invalid magic value\n",
+			 __func__);
+		return -EBADR;
+	}
+
+	/* SMI requires CPU 0 */
+	get_online_cpus();
+	ret = smp_call_on_cpu(0, true, raise_smi, smi_cmd);
+	put_online_cpus();
+
 	return ret;
 }
 
-- 
2.6.2

^ permalink raw reply related

* [PATCH v4 4/6] xen: add xen_pin_vcpu() to support calling functions on a dedicated pcpu
From: Juergen Gross @ 2016-04-05  5:10 UTC (permalink / raw)
  To: linux-kernel, xen-devel
  Cc: Juergen Gross, jeremy, jdelvare, peterz, hpa, akataria, x86,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, boris.ostrovsky, tglx, linux
In-Reply-To: <1459833007-11618-1-git-send-email-jgross@suse.com>

Some hardware models (e.g. Dell Studio 1555 laptops) require calls to
the firmware to be issued on cpu 0 only. As Dom0 might have to use
these calls, add xen_pin_vcpu() to achieve this functionality.

In case either the domain doesn't have the privilege to make the
related hypercall or the hypervisor isn't supporting it, issue a
warning once and disable further pinning attempts.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 arch/x86/xen/enlighten.c | 40 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 40 insertions(+)

diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index 880862c..7907bcf8 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1885,6 +1885,45 @@ static void xen_set_cpu_features(struct cpuinfo_x86 *c)
 	}
 }
 
+static void xen_pin_vcpu(int cpu)
+{
+	static bool disable_pinning;
+	struct sched_pin_override pin_override;
+	int ret;
+
+	if (disable_pinning)
+		return;
+
+	pin_override.pcpu = cpu;
+	ret = HYPERVISOR_sched_op(SCHEDOP_pin_override, &pin_override);
+	if (cpu < 0)
+		return;
+
+	switch (ret) {
+	case -ENOSYS:
+		pr_warn("The kernel tried to call a function on physical cpu %d, but Xen isn't\n"
+			"supporting this. In case of problems you might consider vcpu pinning.\n",
+			cpu);
+		disable_pinning = true;
+		break;
+	case -EPERM:
+		WARN(1, "Trying to pin vcpu without having privilege to do so\n");
+		disable_pinning = true;
+		break;
+	case -EINVAL:
+	case -EBUSY:
+		pr_warn("The kernel tried to call a function on physical cpu %d, but this cpu\n"
+			"seems not to be available. Please check your Xen cpu configuration.\n",
+			cpu);
+		break;
+	case 0:
+		break;
+	default:
+		WARN(1, "rc %d while trying to pin vcpu\n", ret);
+		disable_pinning = true;
+	}
+}
+
 const struct hypervisor_x86 x86_hyper_xen = {
 	.name			= "Xen",
 	.detect			= xen_platform,
@@ -1893,6 +1932,7 @@ const struct hypervisor_x86 x86_hyper_xen = {
 #endif
 	.x2apic_available	= xen_x2apic_para_available,
 	.set_cpu_features       = xen_set_cpu_features,
+	.pin_vcpu               = xen_pin_vcpu,
 };
 EXPORT_SYMBOL(x86_hyper_xen);
 
-- 
2.6.2

^ permalink raw reply related

* [PATCH v4 3/6] smp: add function to execute a function synchronously on a cpu
From: Juergen Gross @ 2016-04-05  5:10 UTC (permalink / raw)
  To: linux-kernel, xen-devel
  Cc: Juergen Gross, jeremy, jdelvare, peterz, hpa, akataria, x86,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, boris.ostrovsky, tglx, linux
In-Reply-To: <1459833007-11618-1-git-send-email-jgross@suse.com>

On some hardware models (e.g. Dell Studio 1555 laptop) some hardware
related functions (e.g. SMIs) are to be executed on physical cpu 0
only. Instead of open coding such a functionality multiple times in
the kernel add a service function for this purpose. This will enable
the possibility to take special measures in virtualized environments
like Xen, too.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V4: change return value in case of illegal cpu as requested by Peter Zijlstra
    make pinning of vcpu an option as suggested by Peter Zijlstra

V2: instead of manipulating the allowed set of cpus use cpu specific
    workqueue as requested by Peter Zijlstra
---
 include/linux/smp.h |  2 ++
 kernel/smp.c        | 50 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/up.c         | 17 +++++++++++++++++
 3 files changed, 69 insertions(+)

diff --git a/include/linux/smp.h b/include/linux/smp.h
index c441407..3b5813b 100644
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -196,4 +196,6 @@ extern void arch_enable_nonboot_cpus_end(void);
 
 void smp_setup_processor_id(void);
 
+int smp_call_on_cpu(unsigned int cpu, bool pin, int (*func)(void *), void *par);
+
 #endif /* __LINUX_SMP_H */
diff --git a/kernel/smp.c b/kernel/smp.c
index 9388064..357458b 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -740,3 +740,53 @@ void wake_up_all_idle_cpus(void)
 	preempt_enable();
 }
 EXPORT_SYMBOL_GPL(wake_up_all_idle_cpus);
+
+/**
+ * smp_call_on_cpu - Call a function on a specific cpu
+ *
+ * Used to call a function on a specific cpu and wait for it to return.
+ * Optionally make sure the call is done on a specified physical cpu via vcpu
+ * pinning in order to support virtualized environments.
+ */
+struct smp_call_on_cpu_struct {
+	struct work_struct	work;
+	struct completion	done;
+	int			(*func)(void *);
+	void			*data;
+	int			ret;
+	int			cpu;
+};
+
+static void smp_call_on_cpu_callback(struct work_struct *work)
+{
+	struct smp_call_on_cpu_struct *sscs;
+
+	sscs = container_of(work, struct smp_call_on_cpu_struct, work);
+	if (sscs->cpu >= 0)
+		hypervisor_pin_vcpu(sscs->cpu);
+	sscs->ret = sscs->func(sscs->data);
+	if (sscs->cpu >= 0)
+		hypervisor_pin_vcpu(-1);
+
+	complete(&sscs->done);
+}
+
+int smp_call_on_cpu(unsigned int cpu, bool pin, int (*func)(void *), void *par)
+{
+	struct smp_call_on_cpu_struct sscs = {
+		.work = __WORK_INITIALIZER(sscs.work, smp_call_on_cpu_callback),
+		.done = COMPLETION_INITIALIZER_ONSTACK(sscs.done),
+		.func = func,
+		.data = par,
+		.cpu  = pin ? cpu : -1,
+	};
+
+	if (cpu >= nr_cpu_ids)
+		return -ENXIO;
+
+	queue_work_on(cpu, system_wq, &sscs.work);
+	wait_for_completion(&sscs.done);
+
+	return sscs.ret;
+}
+EXPORT_SYMBOL_GPL(smp_call_on_cpu);
diff --git a/kernel/up.c b/kernel/up.c
index 3ccee2b..8266810b 100644
--- a/kernel/up.c
+++ b/kernel/up.c
@@ -83,3 +83,20 @@ void on_each_cpu_cond(bool (*cond_func)(int cpu, void *info),
 	preempt_enable();
 }
 EXPORT_SYMBOL(on_each_cpu_cond);
+
+int smp_call_on_cpu(unsigned int cpu, bool pin, int (*func)(void *), void *par)
+{
+	int ret;
+
+	if (cpu != 0)
+		return -ENXIO;
+
+	if (pin)
+		hypervisor_pin_vcpu(0);
+	ret = func(par);
+	if (pin)
+		hypervisor_pin_vcpu(-1);
+
+	return ret;
+}
+EXPORT_SYMBOL_GPL(smp_call_on_cpu);
-- 
2.6.2

^ permalink raw reply related

* [PATCH v4 2/6] virt, sched: add generic vcpu pinning support
From: Juergen Gross @ 2016-04-05  5:10 UTC (permalink / raw)
  To: linux-kernel, xen-devel
  Cc: Juergen Gross, jeremy, jdelvare, peterz, hpa, akataria, x86,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, boris.ostrovsky, tglx, linux
In-Reply-To: <1459833007-11618-1-git-send-email-jgross@suse.com>

Add generic virtualization support for pinning the current vcpu to a
specified physical cpu. As this operation isn't performance critical
(a very limited set of operations like BIOS calls and SMIs is expected
to need this) just add a hypervisor specific indirection.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
V4: move this patch some places up in the series
    WARN_ONCE in case platform doesn't support pinning as requested by
    Peter Zijlstra

V3: use getc_cpu()/put_cpu() as suggested by David Vrabel

V2: adapt to using workqueues
    add include/linux/hypervisor.h to hide architecture specific stuff
    from generic kernel code

In case paravirt maintainers don't want to be responsible for
include/linux/hypervisor.h I could take it.
---
 MAINTAINERS                       |  1 +
 arch/x86/include/asm/hypervisor.h |  4 ++++
 arch/x86/kernel/cpu/hypervisor.c  | 11 +++++++++++
 include/linux/hypervisor.h        | 17 +++++++++++++++++
 kernel/smp.c                      |  1 +
 kernel/up.c                       |  1 +
 6 files changed, 35 insertions(+)
 create mode 100644 include/linux/hypervisor.h

diff --git a/MAINTAINERS b/MAINTAINERS
index 2ec5079..959173e 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -8330,6 +8330,7 @@ S:	Supported
 F:	Documentation/virtual/paravirt_ops.txt
 F:	arch/*/kernel/paravirt*
 F:	arch/*/include/asm/paravirt.h
+F:	include/linux/hypervisor.h
 
 PARIDE DRIVERS FOR PARALLEL PORT IDE DEVICES
 M:	Tim Waugh <tim@cyberelk.net>
diff --git a/arch/x86/include/asm/hypervisor.h b/arch/x86/include/asm/hypervisor.h
index 055ea99..67942b6 100644
--- a/arch/x86/include/asm/hypervisor.h
+++ b/arch/x86/include/asm/hypervisor.h
@@ -43,6 +43,9 @@ struct hypervisor_x86 {
 
 	/* X2APIC detection (run once per boot) */
 	bool		(*x2apic_available)(void);
+
+	/* pin current vcpu to specified physical cpu (run rarely) */
+	void		(*pin_vcpu)(int);
 };
 
 extern const struct hypervisor_x86 *x86_hyper;
@@ -56,6 +59,7 @@ extern const struct hypervisor_x86 x86_hyper_kvm;
 extern void init_hypervisor(struct cpuinfo_x86 *c);
 extern void init_hypervisor_platform(void);
 extern bool hypervisor_x2apic_available(void);
+extern void hypervisor_pin_vcpu(int cpu);
 #else
 static inline void init_hypervisor(struct cpuinfo_x86 *c) { }
 static inline void init_hypervisor_platform(void) { }
diff --git a/arch/x86/kernel/cpu/hypervisor.c b/arch/x86/kernel/cpu/hypervisor.c
index 73d391a..ff108f8 100644
--- a/arch/x86/kernel/cpu/hypervisor.c
+++ b/arch/x86/kernel/cpu/hypervisor.c
@@ -85,3 +85,14 @@ bool __init hypervisor_x2apic_available(void)
 	       x86_hyper->x2apic_available &&
 	       x86_hyper->x2apic_available();
 }
+
+void hypervisor_pin_vcpu(int cpu)
+{
+	if (!x86_hyper)
+		return;
+
+	if (x86_hyper->pin_vcpu)
+		x86_hyper->pin_vcpu(cpu);
+	else
+		WARN_ONCE(1, "vcpu pinning requested but not supported!\n");
+}
diff --git a/include/linux/hypervisor.h b/include/linux/hypervisor.h
new file mode 100644
index 0000000..3fa5ef2
--- /dev/null
+++ b/include/linux/hypervisor.h
@@ -0,0 +1,17 @@
+#ifndef __LINUX_HYPEVISOR_H
+#define __LINUX_HYPEVISOR_H
+
+/*
+ *	Generic Hypervisor support
+ *		Juergen Gross <jgross@suse.com>
+ */
+
+#ifdef CONFIG_HYPERVISOR_GUEST
+#include <asm/hypervisor.h>
+#else
+static inline void hypervisor_pin_vcpu(int cpu)
+{
+}
+#endif
+
+#endif /* __LINUX_HYPEVISOR_H */
diff --git a/kernel/smp.c b/kernel/smp.c
index 7416544..9388064 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -14,6 +14,7 @@
 #include <linux/smp.h>
 #include <linux/cpu.h>
 #include <linux/sched.h>
+#include <linux/hypervisor.h>
 
 #include "smpboot.h"
 
diff --git a/kernel/up.c b/kernel/up.c
index 1760bf3..3ccee2b 100644
--- a/kernel/up.c
+++ b/kernel/up.c
@@ -6,6 +6,7 @@
 #include <linux/kernel.h>
 #include <linux/export.h>
 #include <linux/smp.h>
+#include <linux/hypervisor.h>
 
 int smp_call_function_single(int cpu, void (*func) (void *info), void *info,
 				int wait)
-- 
2.6.2

^ permalink raw reply related

* [PATCH v4 1/6] xen: sync xen header
From: Juergen Gross @ 2016-04-05  5:10 UTC (permalink / raw)
  To: linux-kernel, xen-devel
  Cc: Juergen Gross, jeremy, jdelvare, peterz, hpa, akataria, x86,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, boris.ostrovsky, tglx, linux
In-Reply-To: <1459833007-11618-1-git-send-email-jgross@suse.com>

Import the actual version of include/xen/interface/sched.h from Xen.

Signed-off-by: Juergen Gross <jgross@suse.com>
---
 include/xen/interface/sched.h | 100 ++++++++++++++++++++++++++++++++++--------
 1 file changed, 82 insertions(+), 18 deletions(-)

diff --git a/include/xen/interface/sched.h b/include/xen/interface/sched.h
index f184909..a4c4d73 100644
--- a/include/xen/interface/sched.h
+++ b/include/xen/interface/sched.h
@@ -3,6 +3,24 @@
  *
  * Scheduler state interactions
  *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to
+ * deal in the Software without restriction, including without limitation the
+ * rights to use, copy, modify, merge, publish, distribute, sublicense, and/or
+ * sell copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER
+ * DEALINGS IN THE SOFTWARE.
+ *
  * Copyright (c) 2005, Keir Fraser <keir@xensource.com>
  */
 
@@ -12,18 +30,30 @@
 #include <xen/interface/event_channel.h>
 
 /*
+ * Guest Scheduler Operations
+ *
+ * The SCHEDOP interface provides mechanisms for a guest to interact
+ * with the scheduler, including yield, blocking and shutting itself
+ * down.
+ */
+
+/*
  * The prototype for this hypercall is:
- *  long sched_op_new(int cmd, void *arg)
+ * long HYPERVISOR_sched_op(enum sched_op cmd, void *arg, ...)
+ *
  * @cmd == SCHEDOP_??? (scheduler operation).
  * @arg == Operation-specific extra argument(s), as described below.
+ * ...  == Additional Operation-specific extra arguments, described below.
  *
- * **NOTE**:
- * Versions of Xen prior to 3.0.2 provide only the following legacy version
+ * Versions of Xen prior to 3.0.2 provided only the following legacy version
  * of this hypercall, supporting only the commands yield, block and shutdown:
  *  long sched_op(int cmd, unsigned long arg)
  * @cmd == SCHEDOP_??? (scheduler operation).
  * @arg == 0               (SCHEDOP_yield and SCHEDOP_block)
  *      == SHUTDOWN_* code (SCHEDOP_shutdown)
+ *
+ * This legacy version is available to new guests as:
+ * long HYPERVISOR_sched_op_compat(enum sched_op cmd, unsigned long arg)
  */
 
 /*
@@ -44,12 +74,17 @@
 /*
  * Halt execution of this domain (all VCPUs) and notify the system controller.
  * @arg == pointer to sched_shutdown structure.
+ *
+ * If the sched_shutdown_t reason is SHUTDOWN_suspend then
+ * x86 PV guests must also set RDX (EDX for 32-bit guests) to the MFN
+ * of the guest's start info page.  RDX/EDX is the third hypercall
+ * argument.
+ *
+ * In addition, which reason is SHUTDOWN_suspend this hypercall
+ * returns 1 if suspend was cancelled or the domain was merely
+ * checkpointed, and 0 if it is resuming in a new domain.
  */
 #define SCHEDOP_shutdown    2
-struct sched_shutdown {
-    unsigned int reason; /* SHUTDOWN_* */
-};
-DEFINE_GUEST_HANDLE_STRUCT(sched_shutdown);
 
 /*
  * Poll a set of event-channel ports. Return when one or more are pending. An
@@ -57,12 +92,6 @@ DEFINE_GUEST_HANDLE_STRUCT(sched_shutdown);
  * @arg == pointer to sched_poll structure.
  */
 #define SCHEDOP_poll        3
-struct sched_poll {
-    GUEST_HANDLE(evtchn_port_t) ports;
-    unsigned int nr_ports;
-    uint64_t timeout;
-};
-DEFINE_GUEST_HANDLE_STRUCT(sched_poll);
 
 /*
  * Declare a shutdown for another domain. The main use of this function is
@@ -71,15 +100,11 @@ DEFINE_GUEST_HANDLE_STRUCT(sched_poll);
  * @arg == pointer to sched_remote_shutdown structure.
  */
 #define SCHEDOP_remote_shutdown        4
-struct sched_remote_shutdown {
-    domid_t domain_id;         /* Remote domain ID */
-    unsigned int reason;       /* SHUTDOWN_xxx reason */
-};
 
 /*
  * Latch a shutdown code, so that when the domain later shuts down it
  * reports this code to the control tools.
- * @arg == as for SCHEDOP_shutdown.
+ * @arg == sched_shutdown, as for SCHEDOP_shutdown.
  */
 #define SCHEDOP_shutdown_code 5
 
@@ -92,10 +117,47 @@ struct sched_remote_shutdown {
  * With id != 0 and timeout != 0, poke watchdog timer and set new timeout.
  */
 #define SCHEDOP_watchdog    6
+
+/*
+ * Override the current vcpu affinity by pinning it to one physical cpu or
+ * undo this override restoring the previous affinity.
+ * @arg == pointer to sched_pin_override structure.
+ *
+ * A negative pcpu value will undo a previous pin override and restore the
+ * previous cpu affinity.
+ * This call is allowed for the hardware domain only and requires the cpu
+ * to be part of the domain's cpupool.
+ */
+#define SCHEDOP_pin_override 7
+
+struct sched_shutdown {
+    unsigned int reason; /* SHUTDOWN_* => shutdown reason */
+};
+DEFINE_GUEST_HANDLE_STRUCT(sched_shutdown);
+
+struct sched_poll {
+    GUEST_HANDLE(evtchn_port_t) ports;
+    unsigned int nr_ports;
+    uint64_t timeout;
+};
+DEFINE_GUEST_HANDLE_STRUCT(sched_poll);
+
+struct sched_remote_shutdown {
+    domid_t domain_id;         /* Remote domain ID */
+    unsigned int reason;       /* SHUTDOWN_* => shutdown reason */
+};
+DEFINE_GUEST_HANDLE_STRUCT(sched_remote_shutdown);
+
 struct sched_watchdog {
     uint32_t id;                /* watchdog ID */
     uint32_t timeout;           /* timeout */
 };
+DEFINE_GUEST_HANDLE_STRUCT(sched_watchdog);
+
+struct sched_pin_override {
+    int32_t pcpu;
+};
+DEFINE_GUEST_HANDLE_STRUCT(sched_pin_override);
 
 /*
  * Reason codes for SCHEDOP_shutdown. These may be interpreted by control
@@ -107,6 +169,7 @@ struct sched_watchdog {
 #define SHUTDOWN_suspend    2  /* Clean up, save suspend info, kill.         */
 #define SHUTDOWN_crash      3  /* Tell controller we've crashed.             */
 #define SHUTDOWN_watchdog   4  /* Restart because watchdog time expired.     */
+
 /*
  * Domain asked to perform 'soft reset' for it. The expected behavior is to
  * reset internal Xen state for the domain returning it to the point where it
@@ -115,5 +178,6 @@ struct sched_watchdog {
  * interfaces again.
  */
 #define SHUTDOWN_soft_reset 5
+#define SHUTDOWN_MAX        5  /* Maximum valid shutdown reason.             */
 
 #endif /* __XEN_PUBLIC_SCHED_H__ */
-- 
2.6.2

^ permalink raw reply related

* [PATCH v4 0/6] Support calling functions on dedicated physical cpu
From: Juergen Gross @ 2016-04-05  5:10 UTC (permalink / raw)
  To: linux-kernel, xen-devel
  Cc: Juergen Gross, jeremy, jdelvare, peterz, hpa, akataria, x86,
	virtualization, chrisw, mingo, david.vrabel, Douglas_Warzecha,
	pali.rohar, boris.ostrovsky, tglx, linux

Some hardware (e.g. Dell Studio laptops) require special functions to
be called on physical cpu 0 in order to avoid occasional hangs. When
running as dom0 under Xen this could be achieved only via special boot
parameters (vcpu pinning) limiting the hypervisor in it's scheduling
decisions.

This patch series is adding a generic function to be able to temporarily
pin a (virtual) cpu to a dedicated physical cpu for executing above
mentioned functions on that specific cpu. The drivers (dcdbas and i8k)
requiring this functionality are modified accordingly.

Changes in V4:
- move patches 5 and 6 further up in the series
- patch 2 (was 5): WARN_ONCE in case platform doesn't support pinning
  as requested by Peter Zijlstra
- patch 3 (was 2): change return value in case of illegal cpu as
  requested by Peter Zijlstra
- patch 3 (was 2): make pinning of vcpu an option as suggested by
  Peter Zijlstra
- patches 5 and 6 (were 3 and 4): add call to get_online_cpus()

Changes in V3:
- use get_cpu()/put_cpu() as suggested by David Vrabel

Changes in V2:
- instead of manipulating the allowed set of cpus use cpu specific
  workqueue as requested by Peter Zijlstra
- add include/linux/hypervisor.h to hide architecture specific stuff
  from generic kernel code


Juergen Gross (6):
  xen: sync xen header
  virt, sched: add generic vcpu pinning support
  smp: add function to execute a function synchronously on a cpu
  xen: add xen_pin_vcpu() to support calling functions on a dedicated
    pcpu
  dcdbas: make use of smp_call_on_cpu()
  hwmon: use smp_call_on_cpu() for dell-smm i8k

 MAINTAINERS                       |   1 +
 arch/x86/include/asm/hypervisor.h |   4 ++
 arch/x86/kernel/cpu/hypervisor.c  |  11 +++++
 arch/x86/xen/enlighten.c          |  40 +++++++++++++++
 drivers/firmware/dcdbas.c         |  51 +++++++++----------
 drivers/hwmon/dell-smm-hwmon.c    |  35 +++++++------
 include/linux/hypervisor.h        |  17 +++++++
 include/linux/smp.h               |   2 +
 include/xen/interface/sched.h     | 100 +++++++++++++++++++++++++++++++-------
 kernel/smp.c                      |  51 +++++++++++++++++++
 kernel/up.c                       |  18 +++++++
 11 files changed, 272 insertions(+), 58 deletions(-)
 create mode 100644 include/linux/hypervisor.h

-- 
2.6.2

^ permalink raw reply

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
From: Balbir Singh @ 2016-04-05  3:10 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	Joonsoo Kim, Andrew Morton, Naoya Horiguchi, jlayton,
	Vlastimil Babka, Mel Gorman
In-Reply-To: <20160404060134.GA7555@bbox>



On 04/04/16 16:01, Minchan Kim wrote:
> On Mon, Apr 04, 2016 at 03:53:59PM +1000, Balbir Singh wrote:
>>
>> On 30/03/16 18:12, Minchan Kim wrote:
>>> Procedure of page migration is as follows:
>>>
>>> First of all, it should isolate a page from LRU and try to
>>> migrate the page. If it is successful, it releases the page
>>> for freeing. Otherwise, it should put the page back to LRU
>>> list.
>>>
>>> For LRU pages, we have used putback_lru_page for both freeing
>>> and putback to LRU list. It's okay because put_page is aware of
>>> LRU list so if it releases last refcount of the page, it removes
>>> the page from LRU list. However, It makes unnecessary operations
>>> (e.g., lru_cache_add, pagevec and flags operations. It would be
>>> not significant but no worth to do) and harder to support new
>>> non-lru page migration because put_page isn't aware of non-lru
>>> page's data structure.
>>>
>>> To solve the problem, we can add new hook in put_page with
>>> PageMovable flags check but it can increase overhead in
>>> hot path and needs new locking scheme to stabilize the flag check
>>> with put_page.
>>>
>>> So, this patch cleans it up to divide two semantic(ie, put and putback).
>>> If migration is successful, use put_page instead of putback_lru_page and
>>> use putback_lru_page only on failure. That makes code more readable
>>> and doesn't add overhead in put_page.
>> So effectively when we return from unmap_and_move() the page is either
>> put_page or putback_lru_page() and the page is gone from under us.
> I didn't get your point.
> Could you elaborate it more what you want to say about this patch?

I was just adding to my understanding of this change based on your changelog.
My understanding is that we take the extra reference in isolate_lru_page()
but by the time we return from unmap_and_move() we drop the extra reference

Balbir Singh

^ permalink raw reply

* Re: [PATCH] virtio: virtio 1.0 cs04 spec compliance for reset
From: Jason Wang @ 2016-04-05  2:53 UTC (permalink / raw)
  To: Michael S. Tsirkin, linux-kernel; +Cc: Bandan Das, stable, virtualization
In-Reply-To: <1459686336-29527-1-git-send-email-mst@redhat.com>



On 04/03/2016 08:26 PM, Michael S. Tsirkin wrote:
> The spec says: after writing 0 to device_status, the driver MUST wait
> for a read of device_status to return 0 before reinitializing the
> device.
>
> Cc: stable@vger.kernel.org
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> ---
>  drivers/virtio/virtio_pci_modern.c | 11 ++++++++---
>  1 file changed, 8 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/virtio/virtio_pci_modern.c b/drivers/virtio/virtio_pci_modern.c
> index f6f28cc..e76bd91 100644
> --- a/drivers/virtio/virtio_pci_modern.c
> +++ b/drivers/virtio/virtio_pci_modern.c
> @@ -17,6 +17,7 @@
>   *
>   */
>  
> +#include <linux/delay.h>
>  #define VIRTIO_PCI_NO_LEGACY
>  #include "virtio_pci_common.h"
>  
> @@ -271,9 +272,13 @@ static void vp_reset(struct virtio_device *vdev)
>  	struct virtio_pci_device *vp_dev = to_vp_device(vdev);
>  	/* 0 status means a reset. */
>  	vp_iowrite8(0, &vp_dev->common->device_status);
> -	/* Flush out the status write, and flush in device writes,
> -	 * including MSI-X interrupts, if any. */
> -	vp_ioread8(&vp_dev->common->device_status);
> +	/* After writing 0 to device_status, the driver MUST wait for a read of
> +	 * device_status to return 0 before reinitializing the device.
> +	 * This will flush out the status write, and flush in device writes,
> +	 * including MSI-X interrupts, if any.
> +	 */
> +	while (vp_ioread8(&vp_dev->common->device_status))
> +		msleep(1);
>  	/* Flush pending VQ/configuration callbacks. */
>  	vp_synchronize_vectors(vdev);
>  }

Acked-by: Jason Wang <jasowang@redhat.com>

^ permalink raw reply

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
From: Naoya Horiguchi @ 2016-04-05  1:54 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: Rik van Riel, Sergey Senozhatsky, YiPing Xu, aquini@redhat.com,
	rknize@motorola.com, linux-mm@kvack.org, Chan Gyun Jeong,
	Hugh Dickins, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, bfields@fieldses.org,
	Minchan Kim, Gioh Kim, Mel Gorman, Sangseok Lee, Joonsoo Kim,
	Andrew Morton, jlayton@poochiereds.net, koct9i@gmail.com, Al Viro
In-Reply-To: <57027E47.7070909@suse.cz>

On Mon, Apr 04, 2016 at 04:46:31PM +0200, Vlastimil Babka wrote:
> On 04/04/2016 06:45 AM, Naoya Horiguchi wrote:
> > On Mon, Apr 04, 2016 at 10:39:17AM +0900, Minchan Kim wrote:
...
> >>>
> >>> Also (but not your fault) the put_page() preceding
> >>> test_set_page_hwpoison(page)) IMHO deserves a comment saying which
> >>> pin we are releasing and which one we still have (hopefully? if I
> >>> read description of da1b13ccfbebe right) otherwise it looks like
> >>> doing something with a page that we just potentially freed.
> >>
> >> Yes, while I read the code, I had same question. I think the releasing
> >> refcount is for get_any_page.
> > 
> > As the other callers of page migration do, soft_offline_page expects the
> > migration source page to be freed at this put_page() (no pin remains.)
> > The refcount released here is from isolate_lru_page() in __soft_offline_page().
> > (the pin by get_any_page is released by put_hwpoison_page just after it.)
> > 
> > .. yes, doing something just after freeing page looks weird, but that's
> > how PageHWPoison flag works. IOW, many other page flags are maintained
> > only during one "allocate-free" life span, but PageHWPoison still does
> > its job beyond it.
> 
> But what prevents the page from being allocated again between put_page()
> and test_set_page_hwpoison()? In that case we would be marking page
> poisoned while still in use, which is the same as marking it while still
> in use after a failed migration?

Actually nothing prevents that race. But I think that the result of the race
is that the error page can be reused for allocation, which results in killing
processes at page fault time. Soft offline is kind of mild/precautious thing
(for correctable errors that don't require immediate handling), so killing
processes looks to me an overkill. And marking hwpoison means that we can no
longer do retry from userspace.

And another practical thing is the race with unpoison_memory() as described
in commit da1b13ccfbebe. unpoison_memory() properly works only for properly
poisoned pages, so doing unpoison for in-use hwpoisoned pages is fragile.
That's why I'd like to avoid setting PageHWPoison for in-use pages if possible.

> (Also, which part prevents pages with PageHWPoison to be allocated
> again, anyway? I can't find it and test_set_page_hwpoison() doesn't
> remove from buddy freelists).

check_new_page() in mm/page_alloc.c should prevent reallocation of PageHWPoison.
As you pointed out, memory error handler doens't remove it from buddy freelists.


BTW, it might be a bit off-topic, but recently I felt that check_new_page()
might be improvable, because when check_new_page() returns 1, the whole buddy
block (not only the bad page) seems to be leaked from buddy freelist.
For example, if thp (order 9) is requested, and PageHWPoison (or any other
types of bad pages) is found in an order 9 block, all 512 page are discarded.
Unpoison can't bring it back to buddy.
So, some code to split buddy block including bad page (and recovering code from
unpoison) might be helpful, although that's another story ...

Thanks,
Naoya Horiguchi

^ permalink raw reply

* [PATCH] virtio: fix "warning: ‘queue’ may be used uninitialized"
From: Jeff Mahoney @ 2016-04-04 18:14 UTC (permalink / raw)
  To: Michael S. Tsirkin, virtualization

This fixes the following warning:
drivers/virtio/virtio_ring.c:1032:5: warning: ‘queue’ may be used
uninitialized in this function

The conditions that govern when queue is set aren't apparent to gcc.

Setting queue = NULL clears the warning.

Signed-off-by: Jeff Mahoney <jeffm@suse.com>
---

 drivers/virtio/virtio_ring.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/drivers/virtio/virtio_ring.c
+++ b/drivers/virtio/virtio_ring.c
@@ -1006,7 +1006,7 @@ struct virtqueue *vring_create_virtqueue
 	const char *name)
 {
 	struct virtqueue *vq;
-	void *queue;
+	void *queue = NULL;
 	dma_addr_t dma_addr;
 	size_t queue_size_in_bytes;
 	struct vring vring;

-- 
Jeff Mahoney
SUSE Labs
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
From: Vlastimil Babka @ 2016-04-04 14:46 UTC (permalink / raw)
  To: Naoya Horiguchi, Minchan Kim
  Cc: jlayton@poochiereds.net, Rik van Riel, YiPing Xu,
	aquini@redhat.com, rknize@motorola.com, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org, bfields@fieldses.org,
	linux-mm@kvack.org, Gioh Kim, Mel Gorman, Sangseok Lee,
	Andrew Morton, Joonsoo Kim, koct9i@gmail.com, Al Viro
In-Reply-To: <20160404044458.GA20250@hori1.linux.bs1.fc.nec.co.jp>

On 04/04/2016 06:45 AM, Naoya Horiguchi wrote:
> On Mon, Apr 04, 2016 at 10:39:17AM +0900, Minchan Kim wrote:
>> Thanks for catching it, Vlastimil.
>> It was my mistake. But in this chance, I looked over hwpoison code and
>> I saw other places which increases num_poisoned_pages are successful
>> migration, already freed page and successful invalidated page.
>> IOW, they are already successful isolated page so I guess it should
>> increase the count when only successful migration is done?
> 
> Yes, that's right. When exiting with migration's failure, we shouldn't call
> test_set_page_hwpoison or num_poisoned_pages_inc, so current code checking
> (rc != -EAGAIN) is simply incorrect. Your change fixes the bug in memory
> error handling. Great!

Ah, I see, soft onlining works differently than I thought.

>> And when I read memory_failure, it bails out without killing if it
>> encounters HWPoisoned page so I think it's not for catching and
>> kill the poor proces.
>>
>>>
>>> Also (but not your fault) the put_page() preceding
>>> test_set_page_hwpoison(page)) IMHO deserves a comment saying which
>>> pin we are releasing and which one we still have (hopefully? if I
>>> read description of da1b13ccfbebe right) otherwise it looks like
>>> doing something with a page that we just potentially freed.
>>
>> Yes, while I read the code, I had same question. I think the releasing
>> refcount is for get_any_page.
> 
> As the other callers of page migration do, soft_offline_page expects the
> migration source page to be freed at this put_page() (no pin remains.)
> The refcount released here is from isolate_lru_page() in __soft_offline_page().
> (the pin by get_any_page is released by put_hwpoison_page just after it.)
> 
> .. yes, doing something just after freeing page looks weird, but that's
> how PageHWPoison flag works. IOW, many other page flags are maintained
> only during one "allocate-free" life span, but PageHWPoison still does
> its job beyond it.

But what prevents the page from being allocated again between put_page()
and test_set_page_hwpoison()? In that case we would be marking page
poisoned while still in use, which is the same as marking it while still
in use after a failed migration?

(Also, which part prevents pages with PageHWPoison to be allocated
again, anyway? I can't find it and test_set_page_hwpoison() doesn't
remove from buddy freelists).

Thanks.

> As for commenting, this put_page() is called in any MIGRATEPAGE_SUCCESS
> case (regardless of callers), so what we can say here is "we free the
> source page here, bypassing LRU list" or something?
> 
> Thanks,
> Naoya Horiguchi
> 

^ permalink raw reply

* Re: [PATCH v3 02/16] mm/compaction: support non-lru movable page migration
From: Vlastimil Babka @ 2016-04-04 13:24 UTC (permalink / raw)
  To: Minchan Kim
  Cc: jlayton, Rik van Riel, YiPing Xu, aquini, rknize,
	Sergey Senozhatsky, Chan Gyun Jeong, Hugh Dickins, linux-kernel,
	dri-devel, virtualization, bfields, linux-mm, Gioh Kim, Gioh Kim,
	Mel Gorman, Sangseok Lee, Andrew Morton, Joonsoo Kim, koct9i,
	Al Viro
In-Reply-To: <20160404051225.GA6838@bbox>

On 04/04/2016 07:12 AM, Minchan Kim wrote:
> On Fri, Apr 01, 2016 at 11:29:14PM +0200, Vlastimil Babka wrote:
>> Might have been better as a separate migration patch and then a
>> compaction patch. It's prefixed mm/compaction, but most changed are
>> in mm/migrate.c
>
> Indeed. The title is rather misleading but not sure it's a good idea
> to separate compaction and migration part.

Guess it's better to see the new functions together with its user after 
all, OK.

> I will just resend to change the tile from "mm/compaction" to
> "mm/migration".

OK!

>> Also I'm a bit uncomfortable how isolate_movable_page() blindly expects that
>> page->mapping->a_ops->isolate_page exists for PageMovable() pages.
>> What if it's a false positive on a PG_reclaim page? Can we rely on
>> PG_reclaim always (and without races) implying PageLRU() so that we
>> don't even attempt isolate_movable_page()?
>
> For now, we shouldn't have such a false positive because PageMovable
> checks page->_mapcount == PAGE_MOVABLE_MAPCOUNT_VALUE as well as PG_movable
> under PG_lock.
>
> But I read your question about user-mapped drvier pages so we cannot
> use _mapcount anymore so I will find another thing. A option is this.
>
> static inline int PageMovable(struct page *page)
> {
>          int ret = 0;
>          struct address_space *mapping;
>          struct address_space_operations *a_op;
>
>          if (!test_bit(PG_movable, &(page->flags))
>                  goto out;
>
>          mapping = page->mapping;
>          if (!mapping)
>                  goto out;
>
>          a_op = mapping->a_op;
>          if (!aop)
>                  goto out;
>          if (a_op->isolate_page)
>                  ret = 1;
> out:
>          return ret;
>
> }
>
> It works under PG_lock but with this, we need trylock_page to peek
> whether it's movable non-lru or not for scanning pfn.

Hm I hoped that with READ_ONCE() we could do the peek safely without 
trylock_page, if we use it only as a heuristic. But I guess it would 
require at least RCU-level protection of the 
page->mapping->a_op->isolate_page chain.

> For avoiding that, we need another function to peek which just checks
> PG_movable bit instead of all things.
>
>
> /*
>   * If @page_locked is false, we cannot guarantee page->mapping's stability
>   * so just the function checks with PG_movable which could be false positive
>   * so caller should check it again under PG_lock to check a_ops->isolate_page.
>   */
> static inline int PageMovable(struct page *page, bool page_locked)
> {
>          int ret = 0;
>          struct address_space *mapping;
>          struct address_space_operations *a_op;
>
>          if (!test_bit(PG_movable, &(page->flags))
>                  goto out;
>
>          if (!page_locked) {
>                  ret = 1;
>                  goto out;
>          }
>
>          mapping = page->mapping;
>          if (!mapping)
>                  goto out;
>
>          a_op = mapping->a_op;
>          if (!aop)
>                  goto out;
>          if (a_op->isolate_page)
>                  ret = 1;
> out:
>          return ret;
> }

I wouldn't put everything into single function, but create something 
like __PageMovable() just for the unlocked peek. Unlike the 
zone->lru_lock, we don't keep page_lock() across iterations in 
isolate_migratepages_block(), as obviously each page has different lock.
So the page_locked parameter would be always passed as constant, and at 
that point it's better to have separate functions.

So I guess the question is how many false positives from overlap with 
PG_reclaim the scanner will hit if we give up on 
PAGE_MOVABLE_MAPCOUNT_VALUE, as that will increase number of page locks 
just to realize that it's not actual PageMovable() page...

> Thanks for detail review, Vlastimil!
> I will resend new versions after vacation in this week.

You're welcome, great!

^ permalink raw reply

* Re: [PATCH v3 00/16] Support non-lru page migration
From: John Einar Reitan @ 2016-04-04 13:17 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	Joonsoo Kim, Andrew Morton, jlayton, Vlastimil Babka, Mel Gorman
In-Reply-To: <1459321935-3655-1-git-send-email-minchan@kernel.org>


[-- Attachment #1.1: Type: text/plain, Size: 1679 bytes --]

On Wed, Mar 30, 2016 at 04:11:59PM +0900, Minchan Kim wrote:
> Recently, I got many reports about perfermance degradation
> in embedded system(Android mobile phone, webOS TV and so on)
> and failed to fork easily.
> 
> The problem was fragmentation caused by zram and GPU driver
> pages. Their pages cannot be migrated so compaction cannot
> work well, either so reclaimer ends up shrinking all of working
> set pages. It made system very slow and even to fail to fork
> easily.
> 
> Other pain point is that they cannot work with CMA.
> Most of CMA memory space could be idle(ie, it could be used
> for movable pages unless driver is using) but if driver(i.e.,
> zram) cannot migrate his page, that memory space could be
> wasted. In our product which has big CMA memory, it reclaims
> zones too exccessively although there are lots of free space
> in CMA so system was very slow easily.
> 
> To solve these problem, this patch try to add facility to
> migrate non-lru pages via introducing new friend functions
> of migratepage in address_space_operation and new page flags.
> 
> 	(isolate_page, putback_page)
> 	(PG_movable, PG_isolated)
> 
> For details, please read description in
> "mm/compaction: support non-lru movable page migration".

Thanks, this mirrors what we see with the ARM Mali GPU drivers too.

One thing with the current design which worries me is the potential
for multiple calls due to many separated pages being migrated.
On GPUs (or any other device) which has an IOMMU and L2 cache, which
isn't coherent with the CPU, we must do L2 cache flush & invalidation
per page. I guess batching pages isn't easily possible?


[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 648 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v3 03/16] mm: add non-lru movable page support document
From: Vlastimil Babka @ 2016-04-04 13:09 UTC (permalink / raw)
  To: Minchan Kim
  Cc: jlayton, Rik van Riel, YiPing Xu, aquini, rknize,
	Sergey Senozhatsky, Chan Gyun Jeong, Jonathan Corbet,
	Hugh Dickins, linux-kernel, virtualization, bfields, linux-mm,
	Gioh Kim, Mel Gorman, Sangseok Lee, Andrew Morton, Joonsoo Kim,
	koct9i, Al Viro
In-Reply-To: <20160404022552.GD6543@bbox>

On 04/04/2016 04:25 AM, Minchan Kim wrote:
>>
>> Ah, I see, so it's designed with page lock to handle the concurrent isolations etc.
>>
>> In http://marc.info/?l=linux-mm&m=143816716511904&w=2 Mel has warned
>> about doing this in general under page_lock and suggested that each
>> user handles concurrent calls to isolate_page() internally. Might be
>> more generic that way, even if all current implementers will
>> actually use the page lock.
>
> We need PG_lock for two reasons.
>
> Firstly, it guarantees page's flags operation(i.e., PG_movable, PG_isolated)
> atomicity. Another thing is for stability for page->mapping->a_ops.
>
> For example,
>
> isolate_migratepages_block
>          if (PageMovable(page))
>                  isolate_movable_page
>                          get_page_unless_zero <--- 1
>                          trylock_page
>                          page->mapping->a_ops->isolate_page <--- 2
>
> Between 1 and 2, driver can nullify page->mapping so we need PG_lock

Hmm I see, that really doesn't seem easily solvable without page_lock.
My idea is that compaction code would just check PageMovable() and 
PageIsolated() to find a candidate. page->mapping->a_ops->isolate_page 
would do the driver-specific necessary locking, revalidate if the page 
state and succeed isolation, or fail. It would need to handle the 
possibility that the page already doesn't belong to the mapping, which 
is probably not a problem. But what if the driver is a module that was 
already unloaded, and even though we did NULL-check each part from page 
to isolate_page, it points to a function that's already gone? That would 
need some extra handling to prevent that, hm...

>>
>>> +2. Migration
>>> +
>>> +After successful isolation, VM calls migratepage. The migratepage's goal is
>>> +to move content of the old page to new page and set up struct page fields
>>> +of new page. If migration is successful, subsystem should release old page's
>>> +refcount to free. Keep in mind that subsystem should clear PG_movable and
>>> +PG_isolated before releasing the refcount.  If everything are done, user
>>> +should return MIGRATEPAGE_SUCCESS. If subsystem cannot migrate the page
>>> +at the moment, migratepage can return -EAGAIN. On -EAGAIN, VM will retry page
>>> +migration because VM interprets -EAGAIN as "temporal migration failure".
>>> +
>>> +3. Putback
>>> +
>>> +If migration was unsuccessful, VM calls putback_page. The subsystem should
>>> +insert isolated page to own data structure again if it has. And subsystem
>>> +should clear PG_isolated which was marked in isolation step.
>>> +
>>> +Note about releasing page:
>>> +
>>> +Subsystem can release pages whenever it want but if it releses the page
>>> +which is already isolated, it should clear PG_isolated but doesn't touch
>>> +PG_movable under PG_lock. Instead of it, VM will clear PG_movable after
>>> +his job done. Otherweise, subsystem should clear both page flags before
>>> +releasing the page.
>>
>> I don't understand this right now. But maybe I will get it after
>> reading the patches and suggest some improved wording here.
>
> I will try to explain why such rule happens in there.
>
> The problem is that put_page is aware of PageLRU. So, if someone releases
> last refcount of LRU page, __put_page checks PageLRU and then, clear the
> flags and detatch the page in LRU list(i.e., data structure).
> But in case of driver page, data structure like LRU among drivers is not only one.
> IOW, we should add following code in put_page to handle various requirements
> of driver page.
>
> void __put_page(struct page *page)
> {
>          if (PageMovable(page)) {
>                  /*
>                   * It will tity up driver's data structure like LRU
>                   * and reset page's flags. And it should be atomic
>                   * and always successful
>                   */
>                  page->put(page);
>                  __ClearPageMovable(page);
>          } else if (PageCompound(page))
>                  __put_compound_page(page);
>          else
>                  __put_single_page(page);
>
> }
>
> I'd like to avoid add new branch for not popular job in put_page which is hot.
> (Might change in future but not popular at the moment)
> So, rule of driver is as follows.
>
> When the driver releases the page and he found the page is PG_isolated,
> he should unmark only PG_isolated, not PG_movable so migration side of
> VM can catch it up "Hmm, the isolated non-lru page doesn't have PG_isolated
> any more. It means drivers releases the page. So, let's put the page
> instead of putback operation".
>
> When the driver releases the page and he doesn't see PG_isolated mark
> of the page, driver should reset both PG_isolated and PG_movable.

Yeah think I understand now, thanks for the explanation. But since I 
found the "freeing isolated page" part to be racy in the 02/16 
subthread, it might be premature now to improve the wording now :/

^ permalink raw reply

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
From: Minchan Kim @ 2016-04-04  6:01 UTC (permalink / raw)
  To: Balbir Singh
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	Joonsoo Kim, Andrew Morton, Naoya Horiguchi, jlayton,
	Vlastimil Babka, Mel Gorman
In-Reply-To: <57020177.60006@gmail.com>

On Mon, Apr 04, 2016 at 03:53:59PM +1000, Balbir Singh wrote:
> 
> 
> On 30/03/16 18:12, Minchan Kim wrote:
> > Procedure of page migration is as follows:
> >
> > First of all, it should isolate a page from LRU and try to
> > migrate the page. If it is successful, it releases the page
> > for freeing. Otherwise, it should put the page back to LRU
> > list.
> >
> > For LRU pages, we have used putback_lru_page for both freeing
> > and putback to LRU list. It's okay because put_page is aware of
> > LRU list so if it releases last refcount of the page, it removes
> > the page from LRU list. However, It makes unnecessary operations
> > (e.g., lru_cache_add, pagevec and flags operations. It would be
> > not significant but no worth to do) and harder to support new
> > non-lru page migration because put_page isn't aware of non-lru
> > page's data structure.
> >
> > To solve the problem, we can add new hook in put_page with
> > PageMovable flags check but it can increase overhead in
> > hot path and needs new locking scheme to stabilize the flag check
> > with put_page.
> >
> > So, this patch cleans it up to divide two semantic(ie, put and putback).
> > If migration is successful, use put_page instead of putback_lru_page and
> > use putback_lru_page only on failure. That makes code more readable
> > and doesn't add overhead in put_page.
> So effectively when we return from unmap_and_move() the page is either
> put_page or putback_lru_page() and the page is gone from under us.

I didn't get your point.
Could you elaborate it more what you want to say about this patch?

^ permalink raw reply

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
From: Balbir Singh @ 2016-04-04  5:53 UTC (permalink / raw)
  To: Minchan Kim, Andrew Morton
  Cc: Rik van Riel, YiPing Xu, aquini, rknize, Sergey Senozhatsky,
	Chan Gyun Jeong, Hugh Dickins, linux-kernel, Al Viro,
	virtualization, bfields, linux-mm, Gioh Kim, koct9i, Sangseok Lee,
	jlayton, Naoya Horiguchi, Joonsoo Kim, Vlastimil Babka,
	Mel Gorman
In-Reply-To: <1459321935-3655-2-git-send-email-minchan@kernel.org>



On 30/03/16 18:12, Minchan Kim wrote:
> Procedure of page migration is as follows:
>
> First of all, it should isolate a page from LRU and try to
> migrate the page. If it is successful, it releases the page
> for freeing. Otherwise, it should put the page back to LRU
> list.
>
> For LRU pages, we have used putback_lru_page for both freeing
> and putback to LRU list. It's okay because put_page is aware of
> LRU list so if it releases last refcount of the page, it removes
> the page from LRU list. However, It makes unnecessary operations
> (e.g., lru_cache_add, pagevec and flags operations. It would be
> not significant but no worth to do) and harder to support new
> non-lru page migration because put_page isn't aware of non-lru
> page's data structure.
>
> To solve the problem, we can add new hook in put_page with
> PageMovable flags check but it can increase overhead in
> hot path and needs new locking scheme to stabilize the flag check
> with put_page.
>
> So, this patch cleans it up to divide two semantic(ie, put and putback).
> If migration is successful, use put_page instead of putback_lru_page and
> use putback_lru_page only on failure. That makes code more readable
> and doesn't add overhead in put_page.
So effectively when we return from unmap_and_move() the page is either
put_page or putback_lru_page() and the page is gone from under us.

^ permalink raw reply

* Re: mm/hwpoison: fix wrong num_poisoned_pages account
From: Naoya Horiguchi @ 2016-04-04  5:46 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, YiPing Xu, aquini@redhat.com, rknize@motorola.com,
	Sergey Senozhatsky, Chan Gyun Jeong, Hugh Dickins,
	linux-kernel@vger.kernel.org, Al Viro,
	virtualization@lists.linux-foundation.org, bfields@fieldses.org,
	linux-mm@kvack.org, Gioh Kim, koct9i@gmail.com, Sangseok Lee,
	Joonsoo Kim, Andrew Morton, jlayton@poochiereds.net,
	Vlastimil Babka, Mel Gorman
In-Reply-To: <20160404053830.GA7042@bbox>

On Mon, Apr 04, 2016 at 02:38:30PM +0900, Minchan Kim wrote:
> 
> Forking new thread,
> 
> Hello Naoya,
> 
> On Mon, Apr 04, 2016 at 04:45:12AM +0000, Naoya Horiguchi wrote:
> > On Mon, Apr 04, 2016 at 10:39:17AM +0900, Minchan Kim wrote:
> > > On Fri, Apr 01, 2016 at 02:58:21PM +0200, Vlastimil Babka wrote:
> > > > On 03/30/2016 09:12 AM, Minchan Kim wrote:
> > > > >Procedure of page migration is as follows:
> > > > >
> > > > >First of all, it should isolate a page from LRU and try to
> > > > >migrate the page. If it is successful, it releases the page
> > > > >for freeing. Otherwise, it should put the page back to LRU
> > > > >list.
> > > > >
> > > > >For LRU pages, we have used putback_lru_page for both freeing
> > > > >and putback to LRU list. It's okay because put_page is aware of
> > > > >LRU list so if it releases last refcount of the page, it removes
> > > > >the page from LRU list. However, It makes unnecessary operations
> > > > >(e.g., lru_cache_add, pagevec and flags operations. It would be
> > > > >not significant but no worth to do) and harder to support new
> > > > >non-lru page migration because put_page isn't aware of non-lru
> > > > >page's data structure.
> > > > >
> > > > >To solve the problem, we can add new hook in put_page with
> > > > >PageMovable flags check but it can increase overhead in
> > > > >hot path and needs new locking scheme to stabilize the flag check
> > > > >with put_page.
> > > > >
> > > > >So, this patch cleans it up to divide two semantic(ie, put and putback).
> > > > >If migration is successful, use put_page instead of putback_lru_page and
> > > > >use putback_lru_page only on failure. That makes code more readable
> > > > >and doesn't add overhead in put_page.
> > > > >
> > > > >Comment from Vlastimil
> > > > >"Yeah, and compaction (perhaps also other migration users) has to drain
> > > > >the lru pvec... Getting rid of this stuff is worth even by itself."
> > > > >
> > > > >Cc: Mel Gorman <mgorman@suse.de>
> > > > >Cc: Hugh Dickins <hughd@google.com>
> > > > >Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > > >Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > > > >Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > > 
> > > > [...]
> > > > 
> > > > >@@ -974,28 +986,28 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> > > > >  		list_del(&page->lru);
> > > > >  		dec_zone_page_state(page, NR_ISOLATED_ANON +
> > > > >  				page_is_file_cache(page));
> > > > >-		/* Soft-offlined page shouldn't go through lru cache list */
> > > > >+	}
> > > > >+
> > > > >+	/*
> > > > >+	 * If migration is successful, drop the reference grabbed during
> > > > >+	 * isolation. Otherwise, restore the page to LRU list unless we
> > > > >+	 * want to retry.
> > > > >+	 */
> > > > >+	if (rc == MIGRATEPAGE_SUCCESS) {
> > > > >+		put_page(page);
> > > > >  		if (reason == MR_MEMORY_FAILURE) {
> > > > >-			put_page(page);
> > > > >  			if (!test_set_page_hwpoison(page))
> > > > >  				num_poisoned_pages_inc();
> > > > >-		} else
> > > > >+		}
> > > > 
> > > > Hmm, I didn't notice it previously, or it's due to rebasing, but it
> > > > seems that you restricted the memory failure handling (i.e. setting
> > > > hwpoison) to MIGRATE_SUCCESS, while previously it was done for all
> > > > non-EAGAIN results. I think that goes against the intention of
> > > > hwpoison, which is IIRC to catch and kill the poor process that
> > > > still uses the page?
> > > 
> > > That's why I Cc'ed Naoya Horiguchi to catch things I might make
> > > mistake.
> > > 
> > > Thanks for catching it, Vlastimil.
> > > It was my mistake. But in this chance, I looked over hwpoison code and
> > > I saw other places which increases num_poisoned_pages are successful
> > > migration, already freed page and successful invalidated page.
> > > IOW, they are already successful isolated page so I guess it should
> > > increase the count when only successful migration is done?
> > 
> > Yes, that's right. When exiting with migration's failure, we shouldn't call
> > test_set_page_hwpoison or num_poisoned_pages_inc, so current code checking
> > (rc != -EAGAIN) is simply incorrect. Your change fixes the bug in memory
> > error handling. Great!
> 
> Thanks for confirming, Naoya.
> I will send it as separate patch with Ccing -stable.
> 
> > 
> > > And when I read memory_failure, it bails out without killing if it
> > > encounters HWPoisoned page so I think it's not for catching and
> > > kill the poor proces.
> > >
> > > > 
> > > > Also (but not your fault) the put_page() preceding
> > > > test_set_page_hwpoison(page)) IMHO deserves a comment saying which
> > > > pin we are releasing and which one we still have (hopefully? if I
> > > > read description of da1b13ccfbebe right) otherwise it looks like
> > > > doing something with a page that we just potentially freed.
> > >
> > > Yes, while I read the code, I had same question. I think the releasing
> > > refcount is for get_any_page.
> > 
> > As the other callers of page migration do, soft_offline_page expects the
> > migration source page to be freed at this put_page() (no pin remains.)
> > The refcount released here is from isolate_lru_page() in __soft_offline_page().
> > (the pin by get_any_page is released by put_hwpoison_page just after it.)
> > 
> > .. yes, doing something just after freeing page looks weird, but that's
> > how PageHWPoison flag works. IOW, many other page flags are maintained
> > only during one "allocate-free" life span, but PageHWPoison still does
> > its job beyond it.
> 
> Got it. Thanks for the clarification.
> 
> > 
> > As for commenting, this put_page() is called in any MIGRATEPAGE_SUCCESS
> > case (regardless of callers), so what we can say here is "we free the
> > source page here, bypassing LRU list" or something?
> 
> Naoya, I wrote up the patch but hard to say I write up correct description.
> Could you review this?
> 
> Thankks.
> 
> From 916b0d5960169e93d1fa3c8b7bdb03fe2b86b455 Mon Sep 17 00:00:00 2001
> From: Minchan Kim <minchan@kernel.org>
> Date: Mon, 4 Apr 2016 14:26:30 +0900
> Subject: [PATCH] mm/hwpoison: fix wrong num_poisoned_pages account
> 
> Currently, migartion code increses num_poisoned_pages on *failed*

             migration ...  increase

> migration page as well as successfully migrated one at the trial
> of memory-failture. It will make the stat wrong.

            failure

> As well, it marks the page as PG_HWPoison even if the migration
> trial failed. It would make we cannot recover the corrupted page
> using memory-failure facility.
> 
> This patches fixes it.
> 
> Cc: stable@vger.kernel.org
> Signed-off-by: Minchan Kim <minchan@kernel.org>

Thank you for the fix.

Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

> ---
>  mm/migrate.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6c822a7b27e0..f9dfb18a4eba 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -975,7 +975,13 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
>  		dec_zone_page_state(page, NR_ISOLATED_ANON +
>  				page_is_file_cache(page));
>  		/* Soft-offlined page shouldn't go through lru cache list */
> -		if (reason == MR_MEMORY_FAILURE) {
> +		if (reason == MR_MEMORY_FAILURE && rc == MIGRATEPAGE_SUCCESS) {
> +			/*
> +			 * With this release, we free successfully migrated
> +			 * page and set PG_HWPoison on just freed page
> +			 * intentionally. Although it's rather weird, it's how
> +			 * HWPoison flag works at the moment.
> +			 */
>  			put_page(page);
>  			if (!test_set_page_hwpoison(page))
>  				num_poisoned_pages_inc();
> -- 
> 1.9.1
> 

^ permalink raw reply

* mm/hwpoison: fix wrong num_poisoned_pages account
From: Minchan Kim @ 2016-04-04  5:38 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Rik van Riel, YiPing Xu, aquini@redhat.com, rknize@motorola.com,
	Sergey Senozhatsky, Chan Gyun Jeong, Hugh Dickins,
	linux-kernel@vger.kernel.org, Al Viro,
	virtualization@lists.linux-foundation.org, bfields@fieldses.org,
	linux-mm@kvack.org, Gioh Kim, koct9i@gmail.com, Sangseok Lee,
	Joonsoo Kim, Andrew Morton, jlayton@poochiereds.net,
	Vlastimil Babka, Mel Gorman


Forking new thread,

Hello Naoya,

On Mon, Apr 04, 2016 at 04:45:12AM +0000, Naoya Horiguchi wrote:
> On Mon, Apr 04, 2016 at 10:39:17AM +0900, Minchan Kim wrote:
> > On Fri, Apr 01, 2016 at 02:58:21PM +0200, Vlastimil Babka wrote:
> > > On 03/30/2016 09:12 AM, Minchan Kim wrote:
> > > >Procedure of page migration is as follows:
> > > >
> > > >First of all, it should isolate a page from LRU and try to
> > > >migrate the page. If it is successful, it releases the page
> > > >for freeing. Otherwise, it should put the page back to LRU
> > > >list.
> > > >
> > > >For LRU pages, we have used putback_lru_page for both freeing
> > > >and putback to LRU list. It's okay because put_page is aware of
> > > >LRU list so if it releases last refcount of the page, it removes
> > > >the page from LRU list. However, It makes unnecessary operations
> > > >(e.g., lru_cache_add, pagevec and flags operations. It would be
> > > >not significant but no worth to do) and harder to support new
> > > >non-lru page migration because put_page isn't aware of non-lru
> > > >page's data structure.
> > > >
> > > >To solve the problem, we can add new hook in put_page with
> > > >PageMovable flags check but it can increase overhead in
> > > >hot path and needs new locking scheme to stabilize the flag check
> > > >with put_page.
> > > >
> > > >So, this patch cleans it up to divide two semantic(ie, put and putback).
> > > >If migration is successful, use put_page instead of putback_lru_page and
> > > >use putback_lru_page only on failure. That makes code more readable
> > > >and doesn't add overhead in put_page.
> > > >
> > > >Comment from Vlastimil
> > > >"Yeah, and compaction (perhaps also other migration users) has to drain
> > > >the lru pvec... Getting rid of this stuff is worth even by itself."
> > > >
> > > >Cc: Mel Gorman <mgorman@suse.de>
> > > >Cc: Hugh Dickins <hughd@google.com>
> > > >Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > > >Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > > >Signed-off-by: Minchan Kim <minchan@kernel.org>
> > > 
> > > [...]
> > > 
> > > >@@ -974,28 +986,28 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> > > >  		list_del(&page->lru);
> > > >  		dec_zone_page_state(page, NR_ISOLATED_ANON +
> > > >  				page_is_file_cache(page));
> > > >-		/* Soft-offlined page shouldn't go through lru cache list */
> > > >+	}
> > > >+
> > > >+	/*
> > > >+	 * If migration is successful, drop the reference grabbed during
> > > >+	 * isolation. Otherwise, restore the page to LRU list unless we
> > > >+	 * want to retry.
> > > >+	 */
> > > >+	if (rc == MIGRATEPAGE_SUCCESS) {
> > > >+		put_page(page);
> > > >  		if (reason == MR_MEMORY_FAILURE) {
> > > >-			put_page(page);
> > > >  			if (!test_set_page_hwpoison(page))
> > > >  				num_poisoned_pages_inc();
> > > >-		} else
> > > >+		}
> > > 
> > > Hmm, I didn't notice it previously, or it's due to rebasing, but it
> > > seems that you restricted the memory failure handling (i.e. setting
> > > hwpoison) to MIGRATE_SUCCESS, while previously it was done for all
> > > non-EAGAIN results. I think that goes against the intention of
> > > hwpoison, which is IIRC to catch and kill the poor process that
> > > still uses the page?
> > 
> > That's why I Cc'ed Naoya Horiguchi to catch things I might make
> > mistake.
> > 
> > Thanks for catching it, Vlastimil.
> > It was my mistake. But in this chance, I looked over hwpoison code and
> > I saw other places which increases num_poisoned_pages are successful
> > migration, already freed page and successful invalidated page.
> > IOW, they are already successful isolated page so I guess it should
> > increase the count when only successful migration is done?
> 
> Yes, that's right. When exiting with migration's failure, we shouldn't call
> test_set_page_hwpoison or num_poisoned_pages_inc, so current code checking
> (rc != -EAGAIN) is simply incorrect. Your change fixes the bug in memory
> error handling. Great!

Thanks for confirming, Naoya.
I will send it as separate patch with Ccing -stable.

> 
> > And when I read memory_failure, it bails out without killing if it
> > encounters HWPoisoned page so I think it's not for catching and
> > kill the poor proces.
> >
> > > 
> > > Also (but not your fault) the put_page() preceding
> > > test_set_page_hwpoison(page)) IMHO deserves a comment saying which
> > > pin we are releasing and which one we still have (hopefully? if I
> > > read description of da1b13ccfbebe right) otherwise it looks like
> > > doing something with a page that we just potentially freed.
> >
> > Yes, while I read the code, I had same question. I think the releasing
> > refcount is for get_any_page.
> 
> As the other callers of page migration do, soft_offline_page expects the
> migration source page to be freed at this put_page() (no pin remains.)
> The refcount released here is from isolate_lru_page() in __soft_offline_page().
> (the pin by get_any_page is released by put_hwpoison_page just after it.)
> 
> .. yes, doing something just after freeing page looks weird, but that's
> how PageHWPoison flag works. IOW, many other page flags are maintained
> only during one "allocate-free" life span, but PageHWPoison still does
> its job beyond it.

Got it. Thanks for the clarification.

> 
> As for commenting, this put_page() is called in any MIGRATEPAGE_SUCCESS
> case (regardless of callers), so what we can say here is "we free the
> source page here, bypassing LRU list" or something?

Naoya, I wrote up the patch but hard to say I write up correct description.
Could you review this?

Thankks.

From 916b0d5960169e93d1fa3c8b7bdb03fe2b86b455 Mon Sep 17 00:00:00 2001
From: Minchan Kim <minchan@kernel.org>
Date: Mon, 4 Apr 2016 14:26:30 +0900
Subject: [PATCH] mm/hwpoison: fix wrong num_poisoned_pages account

Currently, migartion code increses num_poisoned_pages on *failed*
migration page as well as successfully migrated one at the trial
of memory-failture. It will make the stat wrong.
As well, it marks the page as PG_HWPoison even if the migration
trial failed. It would make we cannot recover the corrupted page
using memory-failure facility.

This patches fixes it.

Cc: stable@vger.kernel.org
Signed-off-by: Minchan Kim <minchan@kernel.org>
---
 mm/migrate.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/migrate.c b/mm/migrate.c
index 6c822a7b27e0..f9dfb18a4eba 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -975,7 +975,13 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
 		/* Soft-offlined page shouldn't go through lru cache list */
-		if (reason == MR_MEMORY_FAILURE) {
+		if (reason == MR_MEMORY_FAILURE && rc == MIGRATEPAGE_SUCCESS) {
+			/*
+			 * With this release, we free successfully migrated
+			 * page and set PG_HWPoison on just freed page
+			 * intentionally. Although it's rather weird, it's how
+			 * HWPoison flag works at the moment.
+			 */
 			put_page(page);
 			if (!test_set_page_hwpoison(page))
 				num_poisoned_pages_inc();
-- 
1.9.1

^ permalink raw reply related

* Re: [PATCH v3 02/16] mm/compaction: support non-lru movable page migration
From: Minchan Kim @ 2016-04-04  5:12 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: jlayton, Rik van Riel, YiPing Xu, aquini, rknize,
	Sergey Senozhatsky, Chan Gyun Jeong, Hugh Dickins, linux-kernel,
	dri-devel, virtualization, bfields, linux-mm, Gioh Kim, Gioh Kim,
	Mel Gorman, Sangseok Lee, Andrew Morton, Joonsoo Kim, koct9i,
	Al Viro
In-Reply-To: <56FEE82A.30602@suse.cz>

On Fri, Apr 01, 2016 at 11:29:14PM +0200, Vlastimil Babka wrote:
> Might have been better as a separate migration patch and then a
> compaction patch. It's prefixed mm/compaction, but most changed are
> in mm/migrate.c

Indeed. The title is rather misleading but not sure it's a good idea
to separate compaction and migration part.
I will just resend to change the tile from "mm/compaction" to
"mm/migration".

> 
> On 03/30/2016 09:12 AM, Minchan Kim wrote:
> >We have allowed migration for only LRU pages until now and it was
> >enough to make high-order pages. But recently, embedded system(e.g.,
> >webOS, android) uses lots of non-movable pages(e.g., zram, GPU memory)
> >so we have seen several reports about troubles of small high-order
> >allocation. For fixing the problem, there were several efforts
> >(e,g,. enhance compaction algorithm, SLUB fallback to 0-order page,
> >reserved memory, vmalloc and so on) but if there are lots of
> >non-movable pages in system, their solutions are void in the long run.
> >
> >So, this patch is to support facility to change non-movable pages
> >with movable. For the feature, this patch introduces functions related
> >to migration to address_space_operations as well as some page flags.
> >
> >Basically, this patch supports two page-flags and two functions related
> >to page migration. The flag and page->mapping stability are protected
> >by PG_lock.
> >
> >	PG_movable
> >	PG_isolated
> >
> >	bool (*isolate_page) (struct page *, isolate_mode_t);
> >	void (*putback_page) (struct page *);
> >
> >Duty of subsystem want to make their pages as migratable are
> >as follows:
> >
> >1. It should register address_space to page->mapping then mark
> >the page as PG_movable via __SetPageMovable.
> >
> >2. It should mark the page as PG_isolated via SetPageIsolated
> >if isolation is sucessful and return true.
> 
> Ah another thing to document (especially in the comments/Doc) is
> that the subsystem must not expect anything to survive in page.lru
> (or fields that union it) after having isolated successfully.

Indeed. I surprised I didn't miss because I wrote it down somewhere
but might miss it during rebase.
I will fix it.

> 
> >3. If migration is successful, it should clear PG_isolated and
> >PG_movable of the page for free preparation then release the
> >reference of the page to free.
> >
> >4. If migration fails, putback function of subsystem should
> >clear PG_isolated via ClearPageIsolated.
> >
> >5. If a subsystem want to release isolated page, it should
> >clear PG_isolated but not PG_movable. Instead, VM will do it.
> 
> Under lock? Or just with ClearPageIsolated?

Both:
ClearPageIsolated undert PG_lock.

Yes, it's better to change ClearPageIsolated to __ClearPageIsolated.

> 
> >Cc: Vlastimil Babka <vbabka@suse.cz>
> >Cc: Mel Gorman <mgorman@suse.de>
> >Cc: Hugh Dickins <hughd@google.com>
> >Cc: dri-devel@lists.freedesktop.org
> >Cc: virtualization@lists.linux-foundation.org
> >Signed-off-by: Gioh Kim <gurugio@hanmail.net>
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> >  Documentation/filesystems/Locking      |   4 +
> >  Documentation/filesystems/vfs.txt      |   5 +
> >  fs/proc/page.c                         |   3 +
> >  include/linux/fs.h                     |   2 +
> >  include/linux/migrate.h                |   2 +
> >  include/linux/page-flags.h             |  31 ++++++
> >  include/uapi/linux/kernel-page-flags.h |   1 +
> >  mm/compaction.c                        |  14 ++-
> >  mm/migrate.c                           | 174 +++++++++++++++++++++++++++++----
> >  9 files changed, 217 insertions(+), 19 deletions(-)
> >
> >diff --git a/Documentation/filesystems/Locking b/Documentation/filesystems/Locking
> >index 619af9bfdcb3..0bb79560abb3 100644
> >--- a/Documentation/filesystems/Locking
> >+++ b/Documentation/filesystems/Locking
> >@@ -195,7 +195,9 @@ unlocks and drops the reference.
> >  	int (*releasepage) (struct page *, int);
> >  	void (*freepage)(struct page *);
> >  	int (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> >+	bool (*isolate_page) (struct page *, isolate_mode_t);
> >  	int (*migratepage)(struct address_space *, struct page *, struct page *);
> >+	void (*putback_page) (struct page *);
> >  	int (*launder_page)(struct page *);
> >  	int (*is_partially_uptodate)(struct page *, unsigned long, unsigned long);
> >  	int (*error_remove_page)(struct address_space *, struct page *);
> >@@ -219,7 +221,9 @@ invalidatepage:		yes
> >  releasepage:		yes
> >  freepage:		yes
> >  direct_IO:
> >+isolate_page:		yes
> >  migratepage:		yes (both)
> >+putback_page:		yes
> >  launder_page:		yes
> >  is_partially_uptodate:	yes
> >  error_remove_page:	yes
> >diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> >index b02a7d598258..4c1b6c3b4bc8 100644
> >--- a/Documentation/filesystems/vfs.txt
> >+++ b/Documentation/filesystems/vfs.txt
> >@@ -592,9 +592,14 @@ struct address_space_operations {
> >  	int (*releasepage) (struct page *, int);
> >  	void (*freepage)(struct page *);
> >  	ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter, loff_t offset);
> >+	/* isolate a page for migration */
> >+	bool (*isolate_page) (struct page *, isolate_mode_t);
> >  	/* migrate the contents of a page to the specified target */
> >  	int (*migratepage) (struct page *, struct page *);
> >+	/* put the page back to right list */
> 
> ... "after a failed migration" ?

Better.

> 
> >+	void (*putback_page) (struct page *);
> >  	int (*launder_page) (struct page *);
> >+
> >  	int (*is_partially_uptodate) (struct page *, unsigned long,
> >  					unsigned long);
> >  	void (*is_dirty_writeback) (struct page *, bool *, bool *);
> >diff --git a/fs/proc/page.c b/fs/proc/page.c
> >index 3ecd445e830d..ce3d08a4ad8d 100644
> >--- a/fs/proc/page.c
> >+++ b/fs/proc/page.c
> >@@ -157,6 +157,9 @@ u64 stable_page_flags(struct page *page)
> >  	if (page_is_idle(page))
> >  		u |= 1 << KPF_IDLE;
> >
> >+	if (PageMovable(page))
> >+		u |= 1 << KPF_MOVABLE;
> >+
> >  	u |= kpf_copy_bit(k, KPF_LOCKED,	PG_locked);
> >
> >  	u |= kpf_copy_bit(k, KPF_SLAB,		PG_slab);
> >diff --git a/include/linux/fs.h b/include/linux/fs.h
> >index da9e67d937e5..36f2d610e7a8 100644
> >--- a/include/linux/fs.h
> >+++ b/include/linux/fs.h
> >@@ -401,6 +401,8 @@ struct address_space_operations {
> >  	 */
> >  	int (*migratepage) (struct address_space *,
> >  			struct page *, struct page *, enum migrate_mode);
> >+	bool (*isolate_page)(struct page *, isolate_mode_t);
> >+	void (*putback_page)(struct page *);
> >  	int (*launder_page) (struct page *);
> >  	int (*is_partially_uptodate) (struct page *, unsigned long,
> >  					unsigned long);
> >diff --git a/include/linux/migrate.h b/include/linux/migrate.h
> >index 9b50325e4ddf..404fbfefeb33 100644
> >--- a/include/linux/migrate.h
> >+++ b/include/linux/migrate.h
> >@@ -37,6 +37,8 @@ extern int migrate_page(struct address_space *,
> >  			struct page *, struct page *, enum migrate_mode);
> >  extern int migrate_pages(struct list_head *l, new_page_t new, free_page_t free,
> >  		unsigned long private, enum migrate_mode mode, int reason);
> >+extern bool isolate_movable_page(struct page *page, isolate_mode_t mode);
> >+extern void putback_movable_page(struct page *page);
> >
> >  extern int migrate_prep(void);
> >  extern int migrate_prep_local(void);
> >diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> >index f4ed4f1b0c77..77ebf8fdbc6e 100644
> >--- a/include/linux/page-flags.h
> >+++ b/include/linux/page-flags.h
> >@@ -129,6 +129,10 @@ enum pageflags {
> >
> >  	/* Compound pages. Stored in first tail page's flags */
> >  	PG_double_map = PG_private_2,
> >+
> >+	/* non-lru movable pages */
> >+	PG_movable = PG_reclaim,
> >+	PG_isolated = PG_owner_priv_1,
> 
> Documentation should probably state that these fields alias and
> subsystem supporting the movable pages shouldn't use them elsewhere.

Yeb.

> 
> Also I'm a bit uncomfortable how isolate_movable_page() blindly expects that
> page->mapping->a_ops->isolate_page exists for PageMovable() pages.
> What if it's a false positive on a PG_reclaim page? Can we rely on
> PG_reclaim always (and without races) implying PageLRU() so that we
> don't even attempt isolate_movable_page()?

For now, we shouldn't have such a false positive because PageMovable
checks page->_mapcount == PAGE_MOVABLE_MAPCOUNT_VALUE as well as PG_movable
under PG_lock.

But I read your question about user-mapped drvier pages so we cannot
use _mapcount anymore so I will find another thing. A option is this.

static inline int PageMovable(struct page *page)
{
        int ret = 0;
        struct address_space *mapping;
        struct address_space_operations *a_op;

        if (!test_bit(PG_movable, &(page->flags))
                goto out;

        mapping = page->mapping;
        if (!mapping)
                goto out;

        a_op = mapping->a_op;
        if (!aop)
                goto out;
        if (a_op->isolate_page)
                ret = 1;
out:
        return ret;

}

It works under PG_lock but with this, we need trylock_page to peek
whether it's movable non-lru or not for scanning pfn.
For avoiding that, we need another function to peek which just checks
PG_movable bit instead of all things.


/*
 * If @page_locked is false, we cannot guarantee page->mapping's stability
 * so just the function checks with PG_movable which could be false positive
 * so caller should check it again under PG_lock to check a_ops->isolate_page.
 */
static inline int PageMovable(struct page *page, bool page_locked)
{
        int ret = 0;
        struct address_space *mapping;
        struct address_space_operations *a_op;

        if (!test_bit(PG_movable, &(page->flags))
                goto out;

        if (!page_locked) {
                ret = 1;
                goto out;
        }

        mapping = page->mapping;
        if (!mapping)
                goto out;

        a_op = mapping->a_op;
        if (!aop)
                goto out;
        if (a_op->isolate_page)
                ret = 1;
out:
        return ret;
}

> 
> >  };
> >
> >  #ifndef __GENERATING_BOUNDS_H
> >@@ -614,6 +618,33 @@ static inline void __ClearPageBalloon(struct page *page)
> >  	atomic_set(&page->_mapcount, -1);
> >  }
> >
> >+#define PAGE_MOVABLE_MAPCOUNT_VALUE (-255)
> 
> IIRC this was what Gioh's previous attempts used instead of
> PG_movable? Is it still needed? Doesn't it prevent a driver

It needs to avoid false positive as I said.

> providing movable *and* mapped pages?

Absolutely true. I will rethink about it.

> If it's to distinguish the PG_reclaim alias that I mention above, it
> seems like an overkill to me. Why would be need both special
> mapcount value and a flag? Checking that
> page->mapping->a_ops->isolate_page exists before calling it should
> be enough to resolve the ambiguity?

As I mentioned, using a_ops->isolate_page needs to be done under PG_lock.
And the idea I suggested above will work, I guess.
I will try it.

> 
> >+
> >+static inline int PageMovable(struct page *page)
> >+{
> >+	return ((test_bit(PG_movable, &(page)->flags) &&
> >+		atomic_read(&page->_mapcount) == PAGE_MOVABLE_MAPCOUNT_VALUE)
> >+		|| PageBalloon(page));
> >+}
> >+
> >+/* Caller should hold a PG_lock */
> >+static inline void __SetPageMovable(struct page *page,
> >+				struct address_space *mapping)
> >+{
> >+	page->mapping = mapping;
> >+	__set_bit(PG_movable, &page->flags);
> >+	atomic_set(&page->_mapcount, PAGE_MOVABLE_MAPCOUNT_VALUE);
> >+}
> >+
> >+static inline void __ClearPageMovable(struct page *page)
> >+{
> >+	atomic_set(&page->_mapcount, -1);
> >+	__clear_bit(PG_movable, &(page)->flags);
> >+	page->mapping = NULL;
> >+}
> >+
> >+PAGEFLAG(Isolated, isolated, PF_ANY);
> >+
> >  /*
> >   * If network-based swap is enabled, sl*b must keep track of whether pages
> >   * were allocated from pfmemalloc reserves.
> >diff --git a/include/uapi/linux/kernel-page-flags.h b/include/uapi/linux/kernel-page-flags.h
> >index 5da5f8751ce7..a184fd2434fa 100644
> >--- a/include/uapi/linux/kernel-page-flags.h
> >+++ b/include/uapi/linux/kernel-page-flags.h
> >@@ -34,6 +34,7 @@
> >  #define KPF_BALLOON		23
> >  #define KPF_ZERO_PAGE		24
> >  #define KPF_IDLE		25
> >+#define KPF_MOVABLE		26
> >
> >
> >  #endif /* _UAPILINUX_KERNEL_PAGE_FLAGS_H */
> >diff --git a/mm/compaction.c b/mm/compaction.c
> >index ccf97b02b85f..7557aedddaee 100644
> >--- a/mm/compaction.c
> >+++ b/mm/compaction.c
> >@@ -703,7 +703,7 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >
> >  		/*
> >  		 * Check may be lockless but that's ok as we recheck later.
> >-		 * It's possible to migrate LRU pages and balloon pages
> >+		 * It's possible to migrate LRU and movable kernel pages.
> >  		 * Skip any other type of page
> >  		 */
> >  		is_lru = PageLRU(page);
> >@@ -714,6 +714,18 @@ isolate_migratepages_block(struct compact_control *cc, unsigned long low_pfn,
> >  					goto isolate_success;
> >  				}
> >  			}
> >+
> >+			if (unlikely(PageMovable(page)) &&
> >+					!PageIsolated(page)) {
> >+				if (locked) {
> >+					spin_unlock_irqrestore(&zone->lru_lock,
> >+									flags);
> >+					locked = false;
> >+				}
> >+
> >+				if (isolate_movable_page(page, isolate_mode))
> >+					goto isolate_success;
> >+			}
> >  		}
> >
> >  		/*
> >diff --git a/mm/migrate.c b/mm/migrate.c
> >index 53529c805752..b56bf2b3fe8c 100644
> >--- a/mm/migrate.c
> >+++ b/mm/migrate.c
> >@@ -73,6 +73,85 @@ int migrate_prep_local(void)
> >  	return 0;
> >  }
> >
> >+bool isolate_movable_page(struct page *page, isolate_mode_t mode)
> >+{
> >+	bool ret = false;
> 
> Maintaining "ret" seems useless here. All the "goto out*" statements
> are executed only when ret is false, and ret == true is returned by
> a different return.

Yeb. Will change.

> 
> >+
> >+	/*
> >+	 * Avoid burning cycles with pages that are yet under __free_pages(),
> >+	 * or just got freed under us.
> >+	 *
> >+	 * In case we 'win' a race for a movable page being freed under us and
> >+	 * raise its refcount preventing __free_pages() from doing its job
> >+	 * the put_page() at the end of this block will take care of
> >+	 * release this page, thus avoiding a nasty leakage.
> >+	 */
> >+	if (unlikely(!get_page_unless_zero(page)))
> >+		goto out;
> >+
> >+	/*
> >+	 * Check PG_movable before holding a PG_lock because page's owner
> >+	 * assumes anybody doesn't touch PG_lock of newly allocated page.
> >+	 */
> >+	if (unlikely(!PageMovable(page)))
> >+		goto out_putpage;
> >+	/*
> >+	 * As movable pages are not isolated from LRU lists, concurrent
> >+	 * compaction threads can race against page migration functions
> >+	 * as well as race against the releasing a page.
> >+	 *
> >+	 * In order to avoid having an already isolated movable page
> >+	 * being (wrongly) re-isolated while it is under migration,
> >+	 * or to avoid attempting to isolate pages being released,
> >+	 * lets be sure we have the page lock
> >+	 * before proceeding with the movable page isolation steps.
> >+	 */
> >+	if (unlikely(!trylock_page(page)))
> >+		goto out_putpage;
> >+
> >+	if (!PageMovable(page) || PageIsolated(page))
> >+		goto out_no_isolated;
> >+
> >+	ret = page->mapping->a_ops->isolate_page(page, mode);
> >+	if (!ret)
> >+		goto out_no_isolated;
> >+
> >+	WARN_ON_ONCE(!PageIsolated(page));
> >+	unlock_page(page);
> >+	return ret;
> >+
> >+out_no_isolated:
> >+	unlock_page(page);
> >+out_putpage:
> >+	put_page(page);
> >+out:
> >+	return ret;
> >+}
> >+
> >+/* It should be called on page which is PG_movable */
> >+void putback_movable_page(struct page *page)
> >+{
> >+	/*
> >+	 * 'lock_page()' stabilizes the page and prevents races against
> >+	 * concurrent isolation threads attempting to re-isolate it.
> >+	 */
> >+	VM_BUG_ON_PAGE(!PageMovable(page), page);
> >+
> >+	lock_page(page);
> >+	if (PageIsolated(page)) {
> >+		struct address_space *mapping;
> >+
> >+		mapping = page_mapping(page);
> >+		mapping->a_ops->putback_page(page);
> >+		WARN_ON_ONCE(PageIsolated(page));
> >+	} else {
> >+		__ClearPageMovable(page);
> >+	}
> >+	unlock_page(page);
> >+	/* drop the extra ref count taken for movable page isolation */
> >+	put_page(page);
> >+}
> >+
> >  /*
> >   * Put previously isolated pages back onto the appropriate lists
> >   * from where they were once taken off for compaction/migration.
> >@@ -94,10 +173,18 @@ void putback_movable_pages(struct list_head *l)
> >  		list_del(&page->lru);
> >  		dec_zone_page_state(page, NR_ISOLATED_ANON +
> >  				page_is_file_cache(page));
> >-		if (unlikely(isolated_balloon_page(page)))
> >+		if (unlikely(isolated_balloon_page(page))) {
> >  			balloon_page_putback(page);
> >-		else
> >+		} else if (unlikely(PageMovable(page))) {
> >+			if (PageIsolated(page)) {
> >+				putback_movable_page(page);
> >+			} else {
> >+				__ClearPageMovable(page);
> 
> We don't do lock_page() here, so what prevents parallel compaction
> isolating the same page?

Need PG_lock.

> 
> >+				put_page(page);
> >+			}
> >+		} else {
> >  			putback_lru_page(page);
> >+		}
> >  	}
> >  }
> >
> >@@ -592,7 +679,7 @@ void migrate_page_copy(struct page *newpage, struct page *page)
> >   ***********************************************************/
> >
> >  /*
> >- * Common logic to directly migrate a single page suitable for
> >+ * Common logic to directly migrate a single LRU page suitable for
> >   * pages that do not use PagePrivate/PagePrivate2.
> >   *
> >   * Pages are locked upon entry and exit.
> >@@ -755,24 +842,54 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> >  				enum migrate_mode mode)
> >  {
> >  	struct address_space *mapping;
> >-	int rc;
> >+	int rc = -EAGAIN;
> >+	bool lru_movable = true;
> >
> >  	VM_BUG_ON_PAGE(!PageLocked(page), page);
> >  	VM_BUG_ON_PAGE(!PageLocked(newpage), newpage);
> >
> >  	mapping = page_mapping(page);
> >-	if (!mapping)
> >-		rc = migrate_page(mapping, newpage, page, mode);
> >-	else if (mapping->a_ops->migratepage)
> >-		/*
> >-		 * Most pages have a mapping and most filesystems provide a
> >-		 * migratepage callback. Anonymous pages are part of swap
> >-		 * space which also has its own migratepage callback. This
> >-		 * is the most common path for page migration.
> >-		 */
> >-		rc = mapping->a_ops->migratepage(mapping, newpage, page, mode);
> >-	else
> >-		rc = fallback_migrate_page(mapping, newpage, page, mode);
> >+	/*
> >+	 * In case of non-lru page, it could be released after
> >+	 * isolation step. In that case, we shouldn't try
> >+	 * fallback migration which was designed for LRU pages.
> >+	 *
> >+	 * The rule for such case is that subsystem should clear
> >+	 * PG_isolated but remains PG_movable so VM should catch
> >+	 * it and clear PG_movable for it.
> >+	 */
> >+	if (unlikely(PageMovable(page))) {
> 
> Can false positive from PG_reclaim occur here?

PageMovable has _mapcount == PAGE_MOVALBE_MAPCOUNT_VALUE check.

> 
> >+		lru_movable = false;
> >+		VM_BUG_ON_PAGE(!mapping, page);
> >+		if (!PageIsolated(page)) {
> >+			rc = MIGRATEPAGE_SUCCESS;
> >+			__ClearPageMovable(page);
> >+			goto out;
> >+		}
> >+	}
> >+
> >+	if (likely(lru_movable)) {
> >+		if (!mapping)
> >+			rc = migrate_page(mapping, newpage, page, mode);
> >+		else if (mapping->a_ops->migratepage)
> >+			/*
> >+			 * Most pages have a mapping and most filesystems
> >+			 * provide a migratepage callback. Anonymous pages
> >+			 * are part of swap space which also has its own
> >+			 * migratepage callback. This is the most common path
> >+			 * for page migration.
> >+			 */
> >+			rc = mapping->a_ops->migratepage(mapping, newpage,
> >+							page, mode);
> >+		else
> >+			rc = fallback_migrate_page(mapping, newpage,
> >+							page, mode);
> >+	} else {
> >+		rc = mapping->a_ops->migratepage(mapping, newpage,
> >+						page, mode);
> >+		WARN_ON_ONCE(rc == MIGRATEPAGE_SUCCESS &&
> >+			PageIsolated(page));
> >+	}
> >
> >  	/*
> >  	 * When successful, old pagecache page->mapping must be cleared before
> >@@ -782,6 +899,7 @@ static int move_to_new_page(struct page *newpage, struct page *page,
> >  		if (!PageAnon(page))
> >  			page->mapping = NULL;
> >  	}
> >+out:
> >  	return rc;
> >  }
> >
> >@@ -960,6 +1078,8 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> >  			put_new_page(newpage, private);
> >  		else
> >  			put_page(newpage);
> >+		if (PageMovable(page))
> >+			__ClearPageMovable(page);
> >  		goto out;
> >  	}
> >
> >@@ -1000,8 +1120,26 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> >  				num_poisoned_pages_inc();
> >  		}
> >  	} else {
> >-		if (rc != -EAGAIN)
> >-			putback_lru_page(page);
> >+		if (rc != -EAGAIN) {
> >+			/*
> >+			 * subsystem couldn't remove PG_movable since page is
> >+			 * isolated so PageMovable check is not racy in here.
> >+			 * But PageIsolated check can be racy but it's okay
> >+			 * because putback_movable_page checks it under PG_lock
> >+			 * again.
> >+			 */
> >+			if (unlikely(PageMovable(page))) {
> >+				if (PageIsolated(page))
> >+					putback_movable_page(page);
> >+				else {
> >+					__ClearPageMovable(page);
> 
> Again, we don't do lock_page() here, so what prevents parallel
> compaction isolating the same page?

It seems to need PG_lock in there, too.
Thanks for catching it up.

> 
> Sorry for so many questions, hope they all have good answers and
> this series is a success :) Thanks for picking it up.

No problem at all. Many question means the code or/and doc is not
clear and still need be improved.

Thanks for detail review, Vlastimil!
I will resend new versions after vacation in this week.

^ permalink raw reply

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
From: Naoya Horiguchi @ 2016-04-04  4:45 UTC (permalink / raw)
  To: Minchan Kim
  Cc: Rik van Riel, YiPing Xu, aquini@redhat.com, rknize@motorola.com,
	Sergey Senozhatsky, Chan Gyun Jeong, Hugh Dickins,
	linux-kernel@vger.kernel.org, Al Viro,
	virtualization@lists.linux-foundation.org, bfields@fieldses.org,
	linux-mm@kvack.org, Gioh Kim, koct9i@gmail.com, Sangseok Lee,
	Joonsoo Kim, Andrew Morton, jlayton@poochiereds.net,
	Vlastimil Babka, Mel Gorman
In-Reply-To: <20160404013917.GC6543@bbox>

On Mon, Apr 04, 2016 at 10:39:17AM +0900, Minchan Kim wrote:
> On Fri, Apr 01, 2016 at 02:58:21PM +0200, Vlastimil Babka wrote:
> > On 03/30/2016 09:12 AM, Minchan Kim wrote:
> > >Procedure of page migration is as follows:
> > >
> > >First of all, it should isolate a page from LRU and try to
> > >migrate the page. If it is successful, it releases the page
> > >for freeing. Otherwise, it should put the page back to LRU
> > >list.
> > >
> > >For LRU pages, we have used putback_lru_page for both freeing
> > >and putback to LRU list. It's okay because put_page is aware of
> > >LRU list so if it releases last refcount of the page, it removes
> > >the page from LRU list. However, It makes unnecessary operations
> > >(e.g., lru_cache_add, pagevec and flags operations. It would be
> > >not significant but no worth to do) and harder to support new
> > >non-lru page migration because put_page isn't aware of non-lru
> > >page's data structure.
> > >
> > >To solve the problem, we can add new hook in put_page with
> > >PageMovable flags check but it can increase overhead in
> > >hot path and needs new locking scheme to stabilize the flag check
> > >with put_page.
> > >
> > >So, this patch cleans it up to divide two semantic(ie, put and putback).
> > >If migration is successful, use put_page instead of putback_lru_page and
> > >use putback_lru_page only on failure. That makes code more readable
> > >and doesn't add overhead in put_page.
> > >
> > >Comment from Vlastimil
> > >"Yeah, and compaction (perhaps also other migration users) has to drain
> > >the lru pvec... Getting rid of this stuff is worth even by itself."
> > >
> > >Cc: Mel Gorman <mgorman@suse.de>
> > >Cc: Hugh Dickins <hughd@google.com>
> > >Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > >Acked-by: Vlastimil Babka <vbabka@suse.cz>
> > >Signed-off-by: Minchan Kim <minchan@kernel.org>
> > 
> > [...]
> > 
> > >@@ -974,28 +986,28 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> > >  		list_del(&page->lru);
> > >  		dec_zone_page_state(page, NR_ISOLATED_ANON +
> > >  				page_is_file_cache(page));
> > >-		/* Soft-offlined page shouldn't go through lru cache list */
> > >+	}
> > >+
> > >+	/*
> > >+	 * If migration is successful, drop the reference grabbed during
> > >+	 * isolation. Otherwise, restore the page to LRU list unless we
> > >+	 * want to retry.
> > >+	 */
> > >+	if (rc == MIGRATEPAGE_SUCCESS) {
> > >+		put_page(page);
> > >  		if (reason == MR_MEMORY_FAILURE) {
> > >-			put_page(page);
> > >  			if (!test_set_page_hwpoison(page))
> > >  				num_poisoned_pages_inc();
> > >-		} else
> > >+		}
> > 
> > Hmm, I didn't notice it previously, or it's due to rebasing, but it
> > seems that you restricted the memory failure handling (i.e. setting
> > hwpoison) to MIGRATE_SUCCESS, while previously it was done for all
> > non-EAGAIN results. I think that goes against the intention of
> > hwpoison, which is IIRC to catch and kill the poor process that
> > still uses the page?
> 
> That's why I Cc'ed Naoya Horiguchi to catch things I might make
> mistake.
> 
> Thanks for catching it, Vlastimil.
> It was my mistake. But in this chance, I looked over hwpoison code and
> I saw other places which increases num_poisoned_pages are successful
> migration, already freed page and successful invalidated page.
> IOW, they are already successful isolated page so I guess it should
> increase the count when only successful migration is done?

Yes, that's right. When exiting with migration's failure, we shouldn't call
test_set_page_hwpoison or num_poisoned_pages_inc, so current code checking
(rc != -EAGAIN) is simply incorrect. Your change fixes the bug in memory
error handling. Great!

> And when I read memory_failure, it bails out without killing if it
> encounters HWPoisoned page so I think it's not for catching and
> kill the poor proces.
>
> > 
> > Also (but not your fault) the put_page() preceding
> > test_set_page_hwpoison(page)) IMHO deserves a comment saying which
> > pin we are releasing and which one we still have (hopefully? if I
> > read description of da1b13ccfbebe right) otherwise it looks like
> > doing something with a page that we just potentially freed.
>
> Yes, while I read the code, I had same question. I think the releasing
> refcount is for get_any_page.

As the other callers of page migration do, soft_offline_page expects the
migration source page to be freed at this put_page() (no pin remains.)
The refcount released here is from isolate_lru_page() in __soft_offline_page().
(the pin by get_any_page is released by put_hwpoison_page just after it.)

.. yes, doing something just after freeing page looks weird, but that's
how PageHWPoison flag works. IOW, many other page flags are maintained
only during one "allocate-free" life span, but PageHWPoison still does
its job beyond it.

As for commenting, this put_page() is called in any MIGRATEPAGE_SUCCESS
case (regardless of callers), so what we can say here is "we free the
source page here, bypassing LRU list" or something?

Thanks,
Naoya Horiguchi

^ permalink raw reply

* Re: [PATCH v3 03/16] mm: add non-lru movable page support document
From: Minchan Kim @ 2016-04-04  2:25 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: jlayton, Rik van Riel, YiPing Xu, aquini, rknize,
	Sergey Senozhatsky, Chan Gyun Jeong, Jonathan Corbet,
	Hugh Dickins, linux-kernel, virtualization, bfields, linux-mm,
	Gioh Kim, Mel Gorman, Sangseok Lee, Andrew Morton, Joonsoo Kim,
	koct9i, Al Viro
In-Reply-To: <56FE87EA.60806@suse.cz>

On Fri, Apr 01, 2016 at 04:38:34PM +0200, Vlastimil Babka wrote:
> On 03/30/2016 09:12 AM, Minchan Kim wrote:
> >This patch describes what a subsystem should do for non-lru movable
> >page supporting.
> 
> Intentionally reading this first without studying the code to better
> catch things that would seem obvious otherwise.
> 
> >Cc: Jonathan Corbet <corbet@lwn.net>
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> >---
> >  Documentation/filesystems/vfs.txt | 11 ++++++-
> >  Documentation/vm/page_migration   | 69 ++++++++++++++++++++++++++++++++++++++-
> >  2 files changed, 78 insertions(+), 2 deletions(-)
> >
> >diff --git a/Documentation/filesystems/vfs.txt b/Documentation/filesystems/vfs.txt
> >index 4c1b6c3b4bc8..d63142f8ed7b 100644
> >--- a/Documentation/filesystems/vfs.txt
> >+++ b/Documentation/filesystems/vfs.txt
> >@@ -752,12 +752,21 @@ struct address_space_operations {
> >          and transfer data directly between the storage and the
> >          application's address space.
> >
> >+  isolate_page: Called by the VM when isolating a movable non-lru page.
> >+	If page is successfully isolated, we should mark the page as
> >+	PG_isolated via __SetPageIsolated.
> 
> Patch 02 changelog suggests SetPageIsolated, so this is confusing. I
> guess the main point is that there might be parallel attempts and
> only one is allowed to succeed, right? Whether it's done by atomic

Right.

> ops or otherwise doesn't matter to e.g. compaction.

It should be atomic under PG_lock so it would be better to change
__SetPageIsolated in patch 02 to show the intention "the operation
is not atomic so you need some lock(e.g., PG_lock) to make sure
atomicity".


> 
> >    migrate_page:  This is used to compact the physical memory usage.
> >          If the VM wants to relocate a page (maybe off a memory card
> >          that is signalling imminent failure) it will pass a new page
> >  	and an old page to this function.  migrate_page should
> >  	transfer any private data across and update any references
> >-        that it has to the page.
> >+	that it has to the page. If migrated page is non-lru page,
> >+	we should clear PG_isolated and PG_movable via __ClearPageIsolated
> >+	and __ClearPageMovable.
> 
> Similar concern as __SetPageIsolated.
> 
> >+
> >+  putback_page: Called by the VM when isolated page's migration fails.
> >+	We should clear PG_isolated marked in isolated_page function.
> 
> Note this kind of wording is less confusing and could be used above wrt my concerns.
> 
> >
> >    launder_page: Called before freeing a page - it writes back the dirty page. To
> >    	prevent redirtying the page, it is kept locked during the whole
> >diff --git a/Documentation/vm/page_migration b/Documentation/vm/page_migration
> >index fea5c0864170..c4e7551a414e 100644
> >--- a/Documentation/vm/page_migration
> >+++ b/Documentation/vm/page_migration
> >@@ -142,5 +142,72 @@ is increased so that the page cannot be freed while page migration occurs.
> >  20. The new page is moved to the LRU and can be scanned by the swapper
> >      etc again.
> >
> >-Christoph Lameter, May 8, 2006.
> >+C. Non-LRU Page migration
> >+-------------------------
> >+
> >+Although original migration aimed for reducing the latency of memory access
> >+for NUMA, compaction who want to create high-order page is also main customer.
> >+
> >+Ppage migration's disadvantage is that it was designed to migrate only
> >+*LRU* pages. However, there are potential non-lru movable pages which can be
> >+migrated in system, for example, zsmalloc, virtio-balloon pages.
> >+For virtio-balloon pages, some parts of migration code path was hooked up
> >+and added virtio-balloon specific functions to intercept logi.
> 
> logi -> logic?

-_-;;

> 
> >+It's too specific to one subsystem so other subsystem who want to make
> >+their pages movable should add own specific hooks in migration path.
> 
> s/should/would have to/ I guess?

Better.


> 
> >+To solve such problem, VM supports non-LRU page migration which provides
> >+generic functions for non-LRU movable pages without needing subsystem
> >+specific hook in mm/{migrate|compact}.c.
> >+
> >+If a subsystem want to make own pages movable, it should mark pages as
> >+PG_movable via __SetPageMovable. __SetPageMovable needs address_space for
> >+argument for register functions which will be called by VM.
> >+
> >+Three functions in address_space_operation related to non-lru movable page:
> >+
> >+	bool (*isolate_page) (struct page *, isolate_mode_t);
> >+	int (*migratepage) (struct address_space *,
> >+		struct page *, struct page *, enum migrate_mode);
> >+	void (*putback_page)(struct page *);
> >+
> >+1. Isolation
> >+
> >+What VM expected on isolate_page of subsystem is to set PG_isolated flags
> >+of the page if it was successful. With that, concurrent isolation among
> >+CPUs skips the isolated page by other CPU earlier. VM calls isolate_page
> >+under PG_lock of page. If a subsystem cannot isolate the page, it should
> >+return false.
> 
> Ah, I see, so it's designed with page lock to handle the concurrent isolations etc.
> 
> In http://marc.info/?l=linux-mm&m=143816716511904&w=2 Mel has warned
> about doing this in general under page_lock and suggested that each
> user handles concurrent calls to isolate_page() internally. Might be
> more generic that way, even if all current implementers will
> actually use the page lock.

We need PG_lock for two reasons.

Firstly, it guarantees page's flags operation(i.e., PG_movable, PG_isolated)
atomicity. Another thing is for stability for page->mapping->a_ops.

For example,

isolate_migratepages_block
        if (PageMovable(page))
                isolate_movable_page
                        get_page_unless_zero <--- 1
                        trylock_page
                        page->mapping->a_ops->isolate_page <--- 2

Between 1 and 2, driver can nullify page->mapping so we need PG_lock
to collaborate driver in the end. IOW, user should call
__ClearPageMovable where reset page->mapping to NULl under PG_lock.

> 
> Also it's worth reading that mail in full and incorporating here, as
> there are more concerns related to concurrency that should be
> documented, e.g. with pages that can be mapped to userspace. Not a
> case with zram and balloon pages I guess, but one of Gioh's original
> use cases was a driver which IIRC could map pages. So the design and
> documentation should keep that in mind.

Hmm, I didn't consider driver userspace mapped pages.
It's really worth to consider. I will think about it.

> 
> >+2. Migration
> >+
> >+After successful isolation, VM calls migratepage. The migratepage's goal is
> >+to move content of the old page to new page and set up struct page fields
> >+of new page. If migration is successful, subsystem should release old page's
> >+refcount to free. Keep in mind that subsystem should clear PG_movable and
> >+PG_isolated before releasing the refcount.  If everything are done, user
> >+should return MIGRATEPAGE_SUCCESS. If subsystem cannot migrate the page
> >+at the moment, migratepage can return -EAGAIN. On -EAGAIN, VM will retry page
> >+migration because VM interprets -EAGAIN as "temporal migration failure".
> >+
> >+3. Putback
> >+
> >+If migration was unsuccessful, VM calls putback_page. The subsystem should
> >+insert isolated page to own data structure again if it has. And subsystem
> >+should clear PG_isolated which was marked in isolation step.
> >+
> >+Note about releasing page:
> >+
> >+Subsystem can release pages whenever it want but if it releses the page
> >+which is already isolated, it should clear PG_isolated but doesn't touch
> >+PG_movable under PG_lock. Instead of it, VM will clear PG_movable after
> >+his job done. Otherweise, subsystem should clear both page flags before
> >+releasing the page.
> 
> I don't understand this right now. But maybe I will get it after
> reading the patches and suggest some improved wording here.

I will try to explain why such rule happens in there.

The problem is that put_page is aware of PageLRU. So, if someone releases
last refcount of LRU page, __put_page checks PageLRU and then, clear the
flags and detatch the page in LRU list(i.e., data structure).
But in case of driver page, data structure like LRU among drivers is not only one.
IOW, we should add following code in put_page to handle various requirements
of driver page.

void __put_page(struct page *page)
{
        if (PageMovable(page)) {
                /*
                 * It will tity up driver's data structure like LRU
                 * and reset page's flags. And it should be atomic
                 * and always successful
                 */
                page->put(page);
                __ClearPageMovable(page);
        } else if (PageCompound(page))
                __put_compound_page(page);
        else
                __put_single_page(page);
                          
}

I'd like to avoid add new branch for not popular job in put_page which is hot.
(Might change in future but not popular at the moment)
So, rule of driver is as follows.

When the driver releases the page and he found the page is PG_isolated,
he should unmark only PG_isolated, not PG_movable so migration side of
VM can catch it up "Hmm, the isolated non-lru page doesn't have PG_isolated
any more. It means drivers releases the page. So, let's put the page
instead of putback operation".

When the driver releases the page and he doesn't see PG_isolated mark
of the page, driver should reset both PG_isolated and PG_movable.

> 
> >+
> >+Note about PG_isolated:
> >+
> >+PG_isolated check on a page is valid only if the page's flag is already
> >+set to PG_movable.
> 
> But it's not possible to check both atomically, so I guess it
> implies checking under page lock? If that's true, should be
> explicit.

Sure.

Thanks for the review, Vlastimil. :)

^ permalink raw reply

* Re: [PATCH v3 01/16] mm: use put_page to free page instead of putback_lru_page
From: Minchan Kim @ 2016-04-04  1:39 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: jlayton, Rik van Riel, YiPing Xu, aquini, rknize,
	Sergey Senozhatsky, Chan Gyun Jeong, Hugh Dickins, linux-kernel,
	virtualization, bfields, linux-mm, Gioh Kim, Mel Gorman,
	Sangseok Lee, Andrew Morton, Naoya Horiguchi, Joonsoo Kim, koct9i,
	Al Viro
In-Reply-To: <56FE706D.7080507@suse.cz>

On Fri, Apr 01, 2016 at 02:58:21PM +0200, Vlastimil Babka wrote:
> On 03/30/2016 09:12 AM, Minchan Kim wrote:
> >Procedure of page migration is as follows:
> >
> >First of all, it should isolate a page from LRU and try to
> >migrate the page. If it is successful, it releases the page
> >for freeing. Otherwise, it should put the page back to LRU
> >list.
> >
> >For LRU pages, we have used putback_lru_page for both freeing
> >and putback to LRU list. It's okay because put_page is aware of
> >LRU list so if it releases last refcount of the page, it removes
> >the page from LRU list. However, It makes unnecessary operations
> >(e.g., lru_cache_add, pagevec and flags operations. It would be
> >not significant but no worth to do) and harder to support new
> >non-lru page migration because put_page isn't aware of non-lru
> >page's data structure.
> >
> >To solve the problem, we can add new hook in put_page with
> >PageMovable flags check but it can increase overhead in
> >hot path and needs new locking scheme to stabilize the flag check
> >with put_page.
> >
> >So, this patch cleans it up to divide two semantic(ie, put and putback).
> >If migration is successful, use put_page instead of putback_lru_page and
> >use putback_lru_page only on failure. That makes code more readable
> >and doesn't add overhead in put_page.
> >
> >Comment from Vlastimil
> >"Yeah, and compaction (perhaps also other migration users) has to drain
> >the lru pvec... Getting rid of this stuff is worth even by itself."
> >
> >Cc: Mel Gorman <mgorman@suse.de>
> >Cc: Hugh Dickins <hughd@google.com>
> >Cc: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> >Acked-by: Vlastimil Babka <vbabka@suse.cz>
> >Signed-off-by: Minchan Kim <minchan@kernel.org>
> 
> [...]
> 
> >@@ -974,28 +986,28 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
> >  		list_del(&page->lru);
> >  		dec_zone_page_state(page, NR_ISOLATED_ANON +
> >  				page_is_file_cache(page));
> >-		/* Soft-offlined page shouldn't go through lru cache list */
> >+	}
> >+
> >+	/*
> >+	 * If migration is successful, drop the reference grabbed during
> >+	 * isolation. Otherwise, restore the page to LRU list unless we
> >+	 * want to retry.
> >+	 */
> >+	if (rc == MIGRATEPAGE_SUCCESS) {
> >+		put_page(page);
> >  		if (reason == MR_MEMORY_FAILURE) {
> >-			put_page(page);
> >  			if (!test_set_page_hwpoison(page))
> >  				num_poisoned_pages_inc();
> >-		} else
> >+		}
> 
> Hmm, I didn't notice it previously, or it's due to rebasing, but it
> seems that you restricted the memory failure handling (i.e. setting
> hwpoison) to MIGRATE_SUCCESS, while previously it was done for all
> non-EAGAIN results. I think that goes against the intention of
> hwpoison, which is IIRC to catch and kill the poor process that
> still uses the page?

That's why I Cc'ed Naoya Horiguchi to catch things I might make
mistake.

Thanks for catching it, Vlastimil.
It was my mistake. But in this chance, I looked over hwpoison code and
I saw other places which increases num_poisoned_pages are successful
migration, already freed page and successful invalidated page.
IOW, they are already successful isolated page so I guess it should
increase the count when only successful migration is done?
And when I read memory_failure, it bails out without killing if it
encounters HWPoisoned page so I think it's not for catching and
kill the poor proces.

> 
> Also (but not your fault) the put_page() preceding
> test_set_page_hwpoison(page)) IMHO deserves a comment saying which
> pin we are releasing and which one we still have (hopefully? if I
> read description of da1b13ccfbebe right) otherwise it looks like
> doing something with a page that we just potentially freed.

Yes, while I read the code, I had same question. I think the releasing
refcount is for get_any_page.

Naoya, could you answer above two questions?

Thanks.

> 
> >+	} else {
> >+		if (rc != -EAGAIN)
> >  			putback_lru_page(page);
> >+		if (put_new_page)
> >+			put_new_page(newpage, private);
> >+		else
> >+			put_page(newpage);
> >  	}
> >
> >-	/*
> >-	 * If migration was not successful and there's a freeing callback, use
> >-	 * it.  Otherwise, putback_lru_page() will drop the reference grabbed
> >-	 * during isolation.
> >-	 */
> >-	if (put_new_page)
> >-		put_new_page(newpage, private);
> >-	else if (unlikely(__is_movable_balloon_page(newpage))) {
> >-		/* drop our reference, page already in the balloon */
> >-		put_page(newpage);
> >-	} else
> >-		putback_lru_page(newpage);
> >-
> >  	if (result) {
> >  		if (rc)
> >  			*result = rc;
> >
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox