linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed
* [RFCv2 0/4] Prototype PAPR hash page table resizing (guest side)
@ 2016-01-11  5:52 David Gibson
  2016-01-11  5:52 ` [RFCv2 1/4] pseries: Add hypercall wrappers for hash page table resizing David Gibson
                   ` (3 more replies)
  0 siblings, 4 replies; 5+ messages in thread
From: David Gibson @ 2016-01-11  5:52 UTC (permalink / raw)
  To: paulus, benh, michael, bharata; +Cc: thuth, lvivier, linuxppc-dev, David Gibson

I've discussed with Paul and Ben previously the possibility of
extending PAPR to allow changing the size of a running guest's hash
page table (HPT).  This would allow for much more flexible memory
hotplug, since the HPT wouldn't have to be sized in advance for the
maximum possible memory size of the guest.

This is a second draft / prototype implementation of the guest side of
this.

Obviously, for now it uses vendor specific hypercalls rather than
official PAPR ones (and likewise non-standard hypertas property and
CAS vector extensions).  I have a draft implementation of these in
qemu for TCG guests which I hope to post in the reasonably near
future.

The design assumes that the HPT change happens in two phases:

   1) The "prepare" phase may be slow but can run asynchronously while
         the guest runs normally

   2) The "commit" phase switches to a previously prepared HPT, and
         must be run with no concurrent updates to the HPT - in practice
	       that means stop_machine() for a Linux guest.

To go with that there are two (proposed) hcalls:

H_RESIZE_HPT_PREPARE:
    This starts (1) for a new HPT of a given size.  It will typically
        return H_LONG_DELAY_* and the guest must call it in a (sleeping)
	    loop until it completes.

    Calling PREPARE with a different size from one already in progress
        will cancel the in-progress preparation (freeing the potential HPT
	    if already allocated) and start a new one for the given size.

    As a special case calling PREPARE with shift == 0 will cancel any
        in-progress preparation and not start a new one, instead reverting
	    to the existing HPT.

H_RESIZE_HPT_COMMIT:
    Switches to an HPT of the given size.  It will fail if there isn't
        a fully prepared HPT of the given size ready to go.  No HPT updates
	    (H_ENTER etc.) may be run on *any* guest CPU while this is called.

    Once COMMIT returns H_SUCCESS, the guest will be operating on the
        new HPT.  On any other return it is still running on the old HPT.

    The hypervisor could cancel a prospective HPT for its own reasons
        - e.g. it could time out if the guest waits too long between
	    PREPARE and COMMIT, or it could "forget" about an in-progress
	        preparation due to live migration.  In that case COMMIT will fail,
		    which the guest should be prepared to handle.

Both hypercalls take a flags parameter for extensibility, but I
haven't defined any flags so far.

I have two possible implementations in mind for the host side, both of
which should work with the same guest interface:

A) During the prepare phase we just allocate and clear the HPT (and
   install VRMA HPTEs for KVM).  During the commit phase we translate
      all bolted entries from the old HPT to the new then continue.

   This approach is relatively simple to implement, but could lead to
      a substantial delay during the commit phase.  Initial rough
         measurements suggest it will be around ~200ms on a POWER8 for a 1G
	    HPT (128G guest).  Since typical live migration downtimes are
	       300-500ms, that's probably still good enough to be useful.

B) During the prepare phase H_ENTER etc. calls are mirrored to both
   the current HPT and the prospective HPT.  Existing HPTEs are
      migrated to the new HPT in the background.  The prepare phase
         completes once the old and new HPTs are in sync.  The commit phase
	    simply pivots to the new HPT.


Please comment on the proposed new PAPR interface and this
implementation.  Any information on what the next step would be in
proposing this as a formal PAPR update would be useful too.

Changes since v1:
  * Added a firmware feature bit for HPT resizing, initialized from
    the device tree
  * Added support for advertising HPT resizing support via
    ibm,client-architecture-support
  * Assorted minor revisions

David Gibson (4):
  pseries: Add hypercall wrappers for hash page table resizing
  pseries: Add support for hash table resizing
  pseries: debugfs hook to trigger a hash page table resize
  pseries: Advertise HPT resizing support via CAS

 arch/powerpc/include/asm/firmware.h       |   5 +-
 arch/powerpc/include/asm/hvcall.h         |   2 +
 arch/powerpc/include/asm/plpar_wrappers.h |  12 +++
 arch/powerpc/include/asm/prom.h           |   1 +
 arch/powerpc/kernel/prom_init.c           |   2 +-
 arch/powerpc/platforms/pseries/firmware.c |   1 +
 arch/powerpc/platforms/pseries/lpar.c     | 135 ++++++++++++++++++++++++++++++
 7 files changed, 155 insertions(+), 3 deletions(-)

-- 
2.5.0

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFCv2 1/4] pseries: Add hypercall wrappers for hash page table resizing
  2016-01-11  5:52 [RFCv2 0/4] Prototype PAPR hash page table resizing (guest side) David Gibson
@ 2016-01-11  5:52 ` David Gibson
  2016-01-11  5:52 ` [RFCv2 2/4] pseries: Add support for hash " David Gibson
                   ` (2 subsequent siblings)
  3 siblings, 0 replies; 5+ messages in thread
From: David Gibson @ 2016-01-11  5:52 UTC (permalink / raw)
  To: paulus, benh, michael, bharata; +Cc: thuth, lvivier, linuxppc-dev, David Gibson

This adds the hypercall numbers and wrapper functions for the hash page
table resizing hypercalls.

These are experimental "platform specific" values for now, until we have a
formal PAPR update.

It also adds a new firmware feature flat to track the presence of the
HPT resizing calls.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/firmware.h       |  5 +++--
 arch/powerpc/include/asm/hvcall.h         |  2 ++
 arch/powerpc/include/asm/plpar_wrappers.h | 12 ++++++++++++
 arch/powerpc/platforms/pseries/firmware.c |  1 +
 4 files changed, 18 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/firmware.h b/arch/powerpc/include/asm/firmware.h
index e05808a..339f71d 100644
--- a/arch/powerpc/include/asm/firmware.h
+++ b/arch/powerpc/include/asm/firmware.h
@@ -42,7 +42,7 @@
 #define FW_FEATURE_SPLPAR	ASM_CONST(0x0000000000100000)
 #define FW_FEATURE_LPAR		ASM_CONST(0x0000000000400000)
 #define FW_FEATURE_PS3_LV1	ASM_CONST(0x0000000000800000)
-/* Free				ASM_CONST(0x0000000001000000) */
+#define FW_FEATURE_HPT_RESIZE	ASM_CONST(0x0000000001000000)
 #define FW_FEATURE_CMO		ASM_CONST(0x0000000002000000)
 #define FW_FEATURE_VPHN		ASM_CONST(0x0000000004000000)
 #define FW_FEATURE_XCMO		ASM_CONST(0x0000000008000000)
@@ -68,7 +68,8 @@ enum {
 		FW_FEATURE_MULTITCE | FW_FEATURE_SPLPAR | FW_FEATURE_LPAR |
 		FW_FEATURE_CMO | FW_FEATURE_VPHN | FW_FEATURE_XCMO |
 		FW_FEATURE_SET_MODE | FW_FEATURE_BEST_ENERGY |
-		FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN,
+		FW_FEATURE_TYPE1_AFFINITY | FW_FEATURE_PRRN |
+		FW_FEATURE_HPT_RESIZE,
 	FW_FEATURE_PSERIES_ALWAYS = 0,
 	FW_FEATURE_POWERNV_POSSIBLE = FW_FEATURE_OPAL | FW_FEATURE_OPALv2 |
 		FW_FEATURE_OPALv3,
diff --git a/arch/powerpc/include/asm/hvcall.h b/arch/powerpc/include/asm/hvcall.h
index 85bc8c0..ae1fcb7 100644
--- a/arch/powerpc/include/asm/hvcall.h
+++ b/arch/powerpc/include/asm/hvcall.h
@@ -273,6 +273,8 @@
 
 /* Platform specific hcalls, used by KVM */
 #define H_RTAS			0xf000
+#define H_RESIZE_HPT_PREPARE	0xf003
+#define H_RESIZE_HPT_COMMIT	0xf004
 
 /* "Platform specific hcalls", provided by PHYP */
 #define H_GET_24X7_CATALOG_PAGE	0xF078
diff --git a/arch/powerpc/include/asm/plpar_wrappers.h b/arch/powerpc/include/asm/plpar_wrappers.h
index 67859ed..8f1d8fe 100644
--- a/arch/powerpc/include/asm/plpar_wrappers.h
+++ b/arch/powerpc/include/asm/plpar_wrappers.h
@@ -225,6 +225,18 @@ static inline long plpar_pte_protect(unsigned long flags, unsigned long ptex,
 	return plpar_hcall_norets(H_PROTECT, flags, ptex, avpn);
 }
 
+static inline long plpar_resize_hpt_prepare(unsigned long flags,
+					    unsigned long shift)
+{
+	return plpar_hcall_norets(H_RESIZE_HPT_PREPARE, flags, shift);
+}
+
+static inline long plpar_resize_hpt_commit(unsigned long flags,
+					   unsigned long shift)
+{
+	return plpar_hcall_norets(H_RESIZE_HPT_COMMIT, flags, shift);
+}
+
 static inline long plpar_tce_get(unsigned long liobn, unsigned long ioba,
 		unsigned long *tce_ret)
 {
diff --git a/arch/powerpc/platforms/pseries/firmware.c b/arch/powerpc/platforms/pseries/firmware.c
index 8c80588..7b287be 100644
--- a/arch/powerpc/platforms/pseries/firmware.c
+++ b/arch/powerpc/platforms/pseries/firmware.c
@@ -63,6 +63,7 @@ hypertas_fw_features_table[] = {
 	{FW_FEATURE_VPHN,		"hcall-vphn"},
 	{FW_FEATURE_SET_MODE,		"hcall-set-mode"},
 	{FW_FEATURE_BEST_ENERGY,	"hcall-best-energy-1*"},
+	{FW_FEATURE_HPT_RESIZE,		"hcall-hpt-resize"},
 };
 
 /* Build up the firmware features bitmask using the contents of
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFCv2 2/4] pseries: Add support for hash table resizing
  2016-01-11  5:52 [RFCv2 0/4] Prototype PAPR hash page table resizing (guest side) David Gibson
  2016-01-11  5:52 ` [RFCv2 1/4] pseries: Add hypercall wrappers for hash page table resizing David Gibson
@ 2016-01-11  5:52 ` David Gibson
  2016-01-11  5:52 ` [RFCv2 3/4] pseries: debugfs hook to trigger a hash page table resize David Gibson
  2016-01-11  5:52 ` [RFCv2 4/4] pseries: Advertise HPT resizing support via CAS David Gibson
  3 siblings, 0 replies; 5+ messages in thread
From: David Gibson @ 2016-01-11  5:52 UTC (permalink / raw)
  To: paulus, benh, michael, bharata; +Cc: thuth, lvivier, linuxppc-dev, David Gibson

This adds support for using experimental hypercalls to change the size
of the main hash page table while running as a PAPR guest.  For now these
hypercalls are only in experimental qemu versions.

The interface is two part: first H_RESIZE_HPT_PREPARE is used to allocate
and prepare the new hash table.  This may be slow, but can be done
asynchronously.  Then, H_RESIZE_HPT_COMMIT is used to switch to the new
hash table.  This requires that no CPUs be concurrently updating the HPT,
and so must be run under stop_machine().

This patch only supplies a function to execute the hash table change,
nothing yet calls it.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/platforms/pseries/lpar.c | 109 ++++++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index b7a67e3..f6e7af5 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -27,6 +27,8 @@
 #include <linux/console.h>
 #include <linux/export.h>
 #include <linux/jump_label.h>
+#include <linux/delay.h>
+#include <linux/stop_machine.h>
 #include <asm/processor.h>
 #include <asm/mmu.h>
 #include <asm/page.h>
@@ -794,3 +796,110 @@ int h_get_mpp_x(struct hvcall_mpp_x_data *mpp_x_data)
 
 	return rc;
 }
+
+#define HPT_RESIZE_TIMEOUT	10000 /* ms */
+
+struct hpt_resize_state {
+	unsigned long shift;
+	int commit_rc;
+};
+
+static int pseries_lpar_resize_hpt_commit(void *data)
+{
+	struct hpt_resize_state *state = data;
+
+	state->commit_rc = plpar_resize_hpt_commit(0, state->shift);
+	if (state->commit_rc != H_SUCCESS)
+		return -EIO;
+
+	/* Hypervisor has transitioned the HTAB, update our globals */
+	ppc64_pft_size = state->shift;
+	htab_size_bytes = 1UL << ppc64_pft_size;
+	htab_hash_mask = (htab_size_bytes >> 7) - 1;
+
+	return 0;
+}
+
+/* Must be called in user context */
+int pseries_lpar_resize_hpt(unsigned long shift)
+{
+	struct hpt_resize_state state = {
+		.shift = shift,
+		.commit_rc = H_FUNCTION,
+	};
+	unsigned int delay, total_delay = 0;
+	int rc;
+	ktime_t t0, t1, t2;
+
+	might_sleep();
+
+	if (!firmware_has_feature(FW_FEATURE_HPT_RESIZE))
+		return -ENODEV;
+
+	printk(KERN_INFO "lpar: Attempting to resize HPT to shift %lu\n",
+	       shift);
+
+	t0 = ktime_get();
+
+	rc = plpar_resize_hpt_prepare(0, shift);
+	while (H_IS_LONG_BUSY(rc)) {
+		delay = get_longbusy_msecs(rc);
+		total_delay += delay;
+		if (total_delay > HPT_RESIZE_TIMEOUT) {
+			/* prepare call with shift==0 cancels an
+			 * in-progress resize */
+			rc = plpar_resize_hpt_prepare(0, 0);
+			if (rc != H_SUCCESS)
+				printk(KERN_WARNING
+				       "lpar: Unexpected error %d cancelling timed out HPT resize\n",
+				       rc);
+			return -ETIMEDOUT;
+		}
+		msleep(delay);
+		rc = plpar_resize_hpt_prepare(0, shift);
+	};
+
+	switch (rc) {
+	case H_SUCCESS:
+		/* Continue on */
+		break;
+
+	case H_PARAMETER:
+		return -EINVAL;
+	case H_RESOURCE:
+		return -EPERM;
+	default:
+		printk(KERN_WARNING
+		       "lpar: Unexpected error %d from H_RESIZE_HPT_PREPARE\n",
+		       rc);
+		return -EIO;
+	}
+
+	t1 = ktime_get();
+
+	rc = stop_machine(pseries_lpar_resize_hpt_commit, &state, NULL);
+
+	t2 = ktime_get();
+
+	if (rc != 0) {
+		switch (state.commit_rc) {
+		case H_PTEG_FULL:
+			printk(KERN_WARNING
+			       "lpar: Hash collision while resizing HPT\n");
+			return -ENOSPC;
+
+		default:
+			printk(KERN_WARNING
+			       "lpar: Unexpected error %d from H_RESIZE_HPT_COMMIT\n",
+			       state.commit_rc);
+			return -EIO;
+		};
+	}
+
+	printk(KERN_INFO
+	       "lpar: HPT resize to shift %lu complete (%lld ms / %lld ms)\n",
+	       shift, (long long) ktime_ms_delta(t1, t0),
+	       (long long) ktime_ms_delta(t2, t1));
+
+	return 0;
+}
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFCv2 3/4] pseries: debugfs hook to trigger a hash page table resize
  2016-01-11  5:52 [RFCv2 0/4] Prototype PAPR hash page table resizing (guest side) David Gibson
  2016-01-11  5:52 ` [RFCv2 1/4] pseries: Add hypercall wrappers for hash page table resizing David Gibson
  2016-01-11  5:52 ` [RFCv2 2/4] pseries: Add support for hash " David Gibson
@ 2016-01-11  5:52 ` David Gibson
  2016-01-11  5:52 ` [RFCv2 4/4] pseries: Advertise HPT resizing support via CAS David Gibson
  3 siblings, 0 replies; 5+ messages in thread
From: David Gibson @ 2016-01-11  5:52 UTC (permalink / raw)
  To: paulus, benh, michael, bharata; +Cc: thuth, lvivier, linuxppc-dev, David Gibson

This patch adds a special file /sys/kernel/debug/powerpc/pft-size
which can be used to view the current size of the hash page table (as
a bit shift) and to trigger a resize of the hash table on PAPR guests.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/platforms/pseries/lpar.c | 26 ++++++++++++++++++++++++++
 1 file changed, 26 insertions(+)

diff --git a/arch/powerpc/platforms/pseries/lpar.c b/arch/powerpc/platforms/pseries/lpar.c
index f6e7af5..dba9644 100644
--- a/arch/powerpc/platforms/pseries/lpar.c
+++ b/arch/powerpc/platforms/pseries/lpar.c
@@ -29,6 +29,7 @@
 #include <linux/jump_label.h>
 #include <linux/delay.h>
 #include <linux/stop_machine.h>
+#include <linux/debugfs.h>
 #include <asm/processor.h>
 #include <asm/mmu.h>
 #include <asm/page.h>
@@ -903,3 +904,28 @@ int pseries_lpar_resize_hpt(unsigned long shift)
 
 	return 0;
 }
+
+static int ppc64_pft_size_get(void *data, u64 *val)
+{
+	*val = ppc64_pft_size;
+	return 0;
+}
+
+static int ppc64_pft_size_set(void *data, u64 val)
+{
+	return  pseries_lpar_resize_hpt(val);
+}
+
+DEFINE_SIMPLE_ATTRIBUTE(fops_ppc64_pft_size,
+			ppc64_pft_size_get, ppc64_pft_size_set,	"%llu\n");
+
+static int __init pseries_lpar_debugfs(void)
+{
+	if (!debugfs_create_file("pft-size", 0600, powerpc_debugfs_root,
+				 NULL, &fops_ppc64_pft_size)) {
+		pr_err("lpar: unable to create ppc64_pft_size debugsfs file\n");
+	}
+
+	return 0;
+}
+machine_device_initcall(pseries, pseries_lpar_debugfs);
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFCv2 4/4] pseries: Advertise HPT resizing support via CAS
  2016-01-11  5:52 [RFCv2 0/4] Prototype PAPR hash page table resizing (guest side) David Gibson
                   ` (2 preceding siblings ...)
  2016-01-11  5:52 ` [RFCv2 3/4] pseries: debugfs hook to trigger a hash page table resize David Gibson
@ 2016-01-11  5:52 ` David Gibson
  3 siblings, 0 replies; 5+ messages in thread
From: David Gibson @ 2016-01-11  5:52 UTC (permalink / raw)
  To: paulus, benh, michael, bharata; +Cc: thuth, lvivier, linuxppc-dev, David Gibson

The hypervisor needs to know a guest is capable of using the HPT resizing
PAPR extension in order to make full advantage of it for memory hotplug.

If the hypervisor knows the guest is HPT resize aware, it can size the
initial HPT based on the initial guest RAM size, relying on the guest to
resize the HPT when more memory is hot-added.  Without this, the hypervisor
must size the HPT for the maximum possible guest RAM, which can lead to
a huge waste of space if the guest never actually expends to that maximum
size.

This patch advertises the guest's support for HPT resizing via the
ibm,client-architecture-support OF interface.  Obviously, the actual
encoding in the CAS vector is tentative until the extension is officially
incorporated into PAPR.  For now we use bit 0 of (previously unused) byte 8
of option vector 5.

Signed-off-by: David Gibson <david@gibson.dropbear.id.au>
---
 arch/powerpc/include/asm/prom.h | 1 +
 arch/powerpc/kernel/prom_init.c | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/prom.h b/arch/powerpc/include/asm/prom.h
index 7f436ba..7a57b77 100644
--- a/arch/powerpc/include/asm/prom.h
+++ b/arch/powerpc/include/asm/prom.h
@@ -151,6 +151,7 @@ struct of_drconf_cell {
 #define OV5_XCMO		0x0440	/* Page Coalescing */
 #define OV5_TYPE1_AFFINITY	0x0580	/* Type 1 NUMA affinity */
 #define OV5_PRRN		0x0540	/* Platform Resource Reassignment */
+#define OV5_HPT_RESIZE		0x880	/* Hash Page Table resizing */
 #define OV5_PFO_HW_RNG		0x0E80	/* PFO Random Number Generator */
 #define OV5_PFO_HW_842		0x0E40	/* PFO Compression Accelerator */
 #define OV5_PFO_HW_ENCR		0x0E20	/* PFO Encryption Accelerator */
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 92dea8d..d82b883 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -712,7 +712,7 @@ unsigned char ibm_architecture_vec[] = {
 	OV5_FEAT(OV5_TYPE1_AFFINITY) | OV5_FEAT(OV5_PRRN),
 	0,
 	0,
-	0,
+	OV5_FEAT(OV5_HPT_RESIZE),
 	/* WARNING: The offset of the "number of cores" field below
 	 * must match by the macro below. Update the definition if
 	 * the structure layout changes.
-- 
2.5.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2016-01-11  5:51 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2016-01-11  5:52 [RFCv2 0/4] Prototype PAPR hash page table resizing (guest side) David Gibson
2016-01-11  5:52 ` [RFCv2 1/4] pseries: Add hypercall wrappers for hash page table resizing David Gibson
2016-01-11  5:52 ` [RFCv2 2/4] pseries: Add support for hash " David Gibson
2016-01-11  5:52 ` [RFCv2 3/4] pseries: debugfs hook to trigger a hash page table resize David Gibson
2016-01-11  5:52 ` [RFCv2 4/4] pseries: Advertise HPT resizing support via CAS David Gibson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).