[PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support

linuxppc-dev.lists.ozlabs.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support
@ 2008-06-12 21:53 Robert Jennings
  2008-06-12 22:08 ` [PATCH 01/19] powerpc: Remove extraneous error reporting for hcall failures in lparcfg Robert Jennings
                   ` (18 more replies)
  0 siblings, 19 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 21:53 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

Cooperative Memory Overcommitment (CMO) is a pSeries platform feature
that enables the allocation of more memory to a set logical partitions
than is physically present.  For example, a system with 16Gb of memory
can be configured to simultaneously run 3 logical partitions each with
8Gb of memory allocated to them.

The system firmware can page out memory as needed to meet the needs
of each partition.  To minimize the effects of firmware paging memory,
the Collaborative Memory Manager (CMM) driver acts as a balloon driver
to work with firmware to provide memory ahead of any paging needs.

The OS is provided with an entitlement of IO memory for device drivers
to map.  This amount varies with the number of virtual IO adapters
present and can change as devices are hot-plugged.  The VIO bus code
distributes this memory to devices.  Logical partitions supporting CMO
may only have virtual IO devices, physical devices are not supported.
Above the entitled level, IO mappings can fail and the IOMMU needed be
updated to handle this change.

Virtual IO adapters have been updated to handle DMA mapping failures and
to size their entitlement needs.

Platform support for for CMM and hot-plug entitlement events are also
included in the following patches.

The changes should have minimal impact to non-CMO enabled environments.
This patch set has been written against 2.6.26-rc5 and has been tested
at that level.

Regards,
Robert Jennings

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 01/19] powerpc: Remove extraneous error reporting for hcall failures in lparcfg
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
@ 2008-06-12 22:08 ` Robert Jennings
  2008-06-12 22:08 ` [PATCH 02/19] powerpc: Split processor entitlement retrieval and gathering to helper routines Robert Jennings
                   ` (17 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:08 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Nathan Fontenot <nfont@austin.ibm.com>

Remove the extraneous error reporting used when a hcall made from lparcfg f=
ails.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 arch/powerpc/kernel/lparcfg.c |   32 --------------------------------
 1 file changed, 32 deletions(-)

Index: b/arch/powerpc/kernel/lparcfg.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -129,33 +129,6 @@ static int iseries_lparcfg_data(struct s
 /*
  * Methods used to fetch LPAR data when running on a pSeries platform.
  */
-static void log_plpar_hcall_return(unsigned long rc, char *tag)
-{
-	switch(rc) {
-	case 0:
-		return;
-	case H_HARDWARE:
-		printk(KERN_INFO "plpar-hcall (%s) "
-				"Hardware fault\n", tag);
-		return;
-	case H_FUNCTION:
-		printk(KERN_INFO "plpar-hcall (%s) "
-				"Function not allowed\n", tag);
-		return;
-	case H_AUTHORITY:
-		printk(KERN_INFO "plpar-hcall (%s) "
-				"Not authorized to this function\n", tag);
-		return;
-	case H_PARAMETER:
-		printk(KERN_INFO "plpar-hcall (%s) "
-				"Bad parameter(s)\n",tag);
-		return;
-	default:
-		printk(KERN_INFO "plpar-hcall (%s) "
-				"Unexpected rc(0x%lx)\n", tag, rc);
-	}
-}
-
 /*
  * H_GET_PPP hcall returns info in 4 parms.
  *  entitled_capacity,unallocated_capacity,
@@ -191,8 +164,6 @@ static unsigned int h_get_ppp(unsigned l
 	*aggregation =3D retbuf[2];
 	*resource =3D retbuf[3];
=20
-	log_plpar_hcall_return(rc, "H_GET_PPP");
-
 	return rc;
 }
=20
@@ -205,9 +176,6 @@ static void h_pic(unsigned long *pool_id
=20
 	*pool_idle_time =3D retbuf[0];
 	*num_procs =3D retbuf[1];
-
-	if (rc !=3D H_AUTHORITY)
-		log_plpar_hcall_return(rc, "H_PIC");
 }
=20
 #define SPLPAR_CHARACTERISTICS_TOKEN 20

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 02/19] powerpc: Split processor entitlement retrieval and gathering to helper routines
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
  2008-06-12 22:08 ` [PATCH 01/19] powerpc: Remove extraneous error reporting for hcall failures in lparcfg Robert Jennings
@ 2008-06-12 22:08 ` Robert Jennings
  2008-06-13  0:23   ` Stephen Rothwell
  2008-06-16 16:07   ` Nathan Fontenot
  2008-06-12 22:09 ` [PATCH 03/19] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg Robert Jennings
                   ` (16 subsequent siblings)
  18 siblings, 2 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:08 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Nathan Fotenot <nfont@austin.ibm.com>

Split the retreival and setting of processor entitlement and weight into
helper routines.

Signed-off-by: Nathan Fotenot <nfont@austin.ibm.com>

---
 arch/powerpc/kernel/lparcfg.c |  163 ++++++++++++++++++++++---------------=
-----
 1 file changed, 86 insertions(+), 77 deletions(-)

Index: b/arch/powerpc/kernel/lparcfg.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -167,7 +167,8 @@ static unsigned int h_get_ppp(unsigned l
 	return rc;
 }
=20
-static void h_pic(unsigned long *pool_idle_time, unsigned long *num_procs)
+static unsigned int h_pic(unsigned long *pool_idle_time,
+			  unsigned long *num_procs)
 {
 	unsigned long rc;
 	unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
@@ -176,6 +177,53 @@ static void h_pic(unsigned long *pool_id
=20
 	*pool_idle_time =3D retbuf[0];
 	*num_procs =3D retbuf[1];
+
+	return rc;
+}
+
+/*
+ * parse_ppp_data
+ * Parse out the data returned from h_get_ppp and h_pic
+ */
+static void parse_ppp_data(struct seq_file *m)
+{
+	unsigned long h_entitled, h_unallocated;
+	unsigned long h_aggregation, h_resource;
+	int rc;
+
+	rc =3D h_get_ppp(&h_entitled, &h_unallocated, &h_aggregation,
+		       &h_resource);
+	if (rc)
+		return;
+
+	seq_printf(m, "partition_entitled_capacity=3D%ld\n", h_entitled);
+	seq_printf(m, "group=3D%ld\n", (h_aggregation >> 2 * 8) & 0xffff);
+	seq_printf(m, "system_active_processors=3D%ld\n",
+		   (h_resource >> 0 * 8) & 0xffff);
+
+	/* pool related entries are apropriate for shared configs */
+	if (lppaca[0].shared_proc) {
+		unsigned long pool_idle_time, pool_procs;
+
+		seq_printf(m, "pool=3D%ld\n", (h_aggregation >> 0 * 8) & 0xffff);
+
+		/* report pool_capacity in percentage */
+		seq_printf(m, "pool_capacity=3D%ld\n",
+			   ((h_resource >> 2 * 8) & 0xffff) * 100);
+
+		rc =3D h_pic(&pool_idle_time, &pool_procs);
+		if (! rc) {
+			seq_printf(m, "pool_idle_time=3D%ld\n", pool_idle_time);
+			seq_printf(m, "pool_num_procs=3D%ld\n", pool_procs);
+		}
+	}
+
+	seq_printf(m, "unallocated_capacity_weight=3D%ld\n",
+		   (h_resource >> 4 * 8) & 0xFF);
+
+	seq_printf(m, "capacity_weight=3D%ld\n", (h_resource >> 5 * 8) & 0xFF);
+	seq_printf(m, "capped=3D%ld\n", (h_resource >> 6 * 8) & 0x01);
+	seq_printf(m, "unallocated_capacity=3D%ld\n", h_unallocated);
 }
=20
 #define SPLPAR_CHARACTERISTICS_TOKEN 20
@@ -302,59 +350,11 @@ static int pseries_lparcfg_data(struct s
 	partition_active_processors =3D lparcfg_count_active_processors();
=20
 	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
-		unsigned long h_entitled, h_unallocated;
-		unsigned long h_aggregation, h_resource;
-		unsigned long pool_idle_time, pool_procs;
-		unsigned long purr;
-
-		h_get_ppp(&h_entitled, &h_unallocated, &h_aggregation,
-			  &h_resource);
-
-		seq_printf(m, "R4=3D0x%lx\n", h_entitled);
-		seq_printf(m, "R5=3D0x%lx\n", h_unallocated);
-		seq_printf(m, "R6=3D0x%lx\n", h_aggregation);
-		seq_printf(m, "R7=3D0x%lx\n", h_resource);
-
-		purr =3D get_purr();
-
 		/* this call handles the ibm,get-system-parameter contents */
 		parse_system_parameter_string(m);
+		parse_ppp_data(m);
=20
-		seq_printf(m, "partition_entitled_capacity=3D%ld\n", h_entitled);
-
-		seq_printf(m, "group=3D%ld\n", (h_aggregation >> 2 * 8) & 0xffff);
-
-		seq_printf(m, "system_active_processors=3D%ld\n",
-			   (h_resource >> 0 * 8) & 0xffff);
-
-		/* pool related entries are apropriate for shared configs */
-		if (lppaca[0].shared_proc) {
-
-			h_pic(&pool_idle_time, &pool_procs);
-
-			seq_printf(m, "pool=3D%ld\n",
-				   (h_aggregation >> 0 * 8) & 0xffff);
-
-			/* report pool_capacity in percentage */
-			seq_printf(m, "pool_capacity=3D%ld\n",
-				   ((h_resource >> 2 * 8) & 0xffff) * 100);
-
-			seq_printf(m, "pool_idle_time=3D%ld\n", pool_idle_time);
-
-			seq_printf(m, "pool_num_procs=3D%ld\n", pool_procs);
-		}
-
-		seq_printf(m, "unallocated_capacity_weight=3D%ld\n",
-			   (h_resource >> 4 * 8) & 0xFF);
-
-		seq_printf(m, "capacity_weight=3D%ld\n",
-			   (h_resource >> 5 * 8) & 0xFF);
-
-		seq_printf(m, "capped=3D%ld\n", (h_resource >> 6 * 8) & 0x01);
-
-		seq_printf(m, "unallocated_capacity=3D%ld\n", h_unallocated);
-
-		seq_printf(m, "purr=3D%ld\n", purr);
+		seq_printf(m, "purr=3D%ld\n", get_purr());
=20
 	} else {		/* non SPLPAR case */
=20
@@ -382,6 +382,37 @@ static int pseries_lparcfg_data(struct s
 	return 0;
 }
=20
+static ssize_t update_ppp(u64 *new_entitled, u8 *new_weight)
+{
+	unsigned long current_entitled;
+	unsigned long dummy;
+	unsigned long resource;
+	u8 current_weight;
+	ssize_t retval;
+
+	/* Get our current parameters */
+	retval =3D h_get_ppp(&current_entitled, &dummy, &dummy, &resource);
+	if (retval)
+		return retval;
+
+	current_weight =3D (resource >> 5 * 8) & 0xFF;
+
+	if (new_entitled)
+		*new_weight =3D current_weight;
+
+	if (new_weight)
+		*new_entitled =3D current_entitled;
+
+	pr_debug("%s: current_entitled =3D %lu, current_weight =3D %u\n",
+		 __FUNCTION__, current_entitled, current_weight);
+
+	pr_debug("%s: new_entitled =3D %lu, new_weight =3D %u\n",
+		 __FUNCTION__, *new_entitled, *new_weight);
+
+	retval =3D plpar_hcall_norets(H_SET_PPP, *new_entitled, *new_weight);
+	return retval;
+}
+
 /*
  * Interface for changing system parameters (variable capacity weight
  * and entitled capacity).  Format of input is "param_name=3Dvalue";
@@ -399,12 +430,6 @@ static ssize_t lparcfg_write(struct file
 	char *tmp;
 	u64 new_entitled, *new_entitled_ptr =3D &new_entitled;
 	u8 new_weight, *new_weight_ptr =3D &new_weight;
-
-	unsigned long current_entitled;	/* parameters for h_get_ppp */
-	unsigned long dummy;
-	unsigned long resource;
-	u8 current_weight;
-
 	ssize_t retval =3D -ENOMEM;
=20
 	if (!firmware_has_feature(FW_FEATURE_SPLPAR) ||
@@ -432,33 +457,17 @@ static ssize_t lparcfg_write(struct file
 		*new_entitled_ptr =3D (u64) simple_strtoul(tmp, &endp, 10);
 		if (endp =3D=3D tmp)
 			goto out;
-		new_weight_ptr =3D &current_weight;
+
+		retval =3D update_ppp(new_entitled_ptr, NULL);
 	} else if (!strcmp(kbuf, "capacity_weight")) {
 		char *endp;
 		*new_weight_ptr =3D (u8) simple_strtoul(tmp, &endp, 10);
 		if (endp =3D=3D tmp)
 			goto out;
-		new_entitled_ptr =3D &current_entitled;
-	} else
-		goto out;
=20
-	/* Get our current parameters */
-	retval =3D h_get_ppp(&current_entitled, &dummy, &dummy, &resource);
-	if (retval) {
-		retval =3D -EIO;
+		retval =3D update_ppp(NULL, new_weight_ptr);
+	} else
 		goto out;
-	}
-
-	current_weight =3D (resource >> 5 * 8) & 0xFF;
-
-	pr_debug("%s: current_entitled =3D %lu, current_weight =3D %u\n",
-		 __func__, current_entitled, current_weight);
-
-	pr_debug("%s: new_entitled =3D %lu, new_weight =3D %u\n",
-		 __func__, *new_entitled_ptr, *new_weight_ptr);
-
-	retval =3D plpar_hcall_norets(H_SET_PPP, *new_entitled_ptr,
-				    *new_weight_ptr);
=20
 	if (retval =3D=3D H_SUCCESS || retval =3D=3D H_CONSTRAINED) {
 		retval =3D count;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 03/19] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
  2008-06-12 22:08 ` [PATCH 01/19] powerpc: Remove extraneous error reporting for hcall failures in lparcfg Robert Jennings
  2008-06-12 22:08 ` [PATCH 02/19] powerpc: Split processor entitlement retrieval and gathering to helper routines Robert Jennings
@ 2008-06-12 22:09 ` Robert Jennings
  2008-06-16 16:09   ` [PATCH 03/19][v2] " Nathan Fontenot
                     ` (2 more replies)
  2008-06-12 22:11 ` [PATCH 04/19] powerpc: Split retrieval of processor entitlement data into a helper routine Robert Jennings
                   ` (15 subsequent siblings)
  18 siblings, 3 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:09 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Nathan Fontenot <nfont@austin.ibm.com>

Update /proc/ppc64/lparcfg to enable displaying of Cooperative Memory
Overcommitment statistics as reported by the H_GET_MPP hcall.  This also
updates the lparcfg interface to allow setting memory entitlement and
weight.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 arch/powerpc/kernel/lparcfg.c |  139 +++++++++++++++++++++++++++++++++++++=
++---
 include/asm-powerpc/hvcall.h  |   18 +++++
 2 files changed, 147 insertions(+), 10 deletions(-)

Index: b/arch/powerpc/kernel/lparcfg.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -129,6 +129,35 @@ static int iseries_lparcfg_data(struct s
 /*
  * Methods used to fetch LPAR data when running on a pSeries platform.
  */
+/**
+ * h_get_mpp
+ * H_GET_MPP hcall returns info in 7 parms
+ */
+int h_get_mpp(struct hvcall_mpp_data *mpp_data)
+{
+	int rc;
+	unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+	rc =3D plpar_hcall(H_GET_MPP, retbuf);
+
+	mpp_data->entitled_mem =3D retbuf[0];
+	mpp_data->mapped_mem =3D retbuf[1];
+
+	mpp_data->group_num =3D (retbuf[2] >> 2 * 8) & 0xffff;
+	mpp_data->pool_num =3D retbuf[2] & 0xffff;
+
+	mpp_data->mem_weight =3D (retbuf[3] >> 7 * 8) & 0xff;
+	mpp_data->unallocated_mem_weight =3D (retbuf[3] >> 6 * 8) & 0xff;
+	mpp_data->unallocated_entitlement =3D retbuf[3] & 0xffffffffffff;
+
+	mpp_data->pool_size =3D retbuf[4];
+	mpp_data->loan_request =3D retbuf[5];
+	mpp_data->backing_mem =3D retbuf[6];
+
+	return rc;
+}
+EXPORT_SYMBOL(h_get_mpp);
+
 /*
  * H_GET_PPP hcall returns info in 4 parms.
  *  entitled_capacity,unallocated_capacity,
@@ -226,6 +255,44 @@ static void parse_ppp_data(struct seq_fi
 	seq_printf(m, "unallocated_capacity=3D%ld\n", h_unallocated);
 }
=20
+/**
+ * parse_mpp_data
+ * Parse out data returned from h_get_mpp
+ */
+static void parse_mpp_data(struct seq_file *m)
+{
+	struct hvcall_mpp_data mpp_data;
+	int rc;
+
+	rc =3D h_get_mpp(&mpp_data);
+	if (rc)
+		return;
+
+	seq_printf(m, "entitled_memory=3D%ld\n", mpp_data.entitled_mem);
+
+	if (mpp_data.mapped_mem !=3D -1)
+		seq_printf(m, "mapped_entitled_memory=3D%ld\n",
+			   mpp_data.mapped_mem);
+
+	seq_printf(m, "entitled_memory_group_number=3D%d\n", mpp_data.group_num);
+	seq_printf(m, "entitled_memory_pool_number=3D%d\n", mpp_data.pool_num);
+
+	seq_printf(m, "entitled_memory_weight=3D%d\n", mpp_data.mem_weight);
+	seq_printf(m, "unallocated_entitled_memory_weight=3D%d\n",
+		   mpp_data.unallocated_mem_weight);
+	seq_printf(m, "unallocated_io_mapping_entitlement=3D%ld\n",
+		   mpp_data.unallocated_entitlement);
+
+	if (mpp_data.pool_size !=3D -1)
+		seq_printf(m, "entitled_memory_pool_size=3D%ld bytes\n",
+			   mpp_data.pool_size);
+
+	seq_printf(m, "entitled_memory_loan_request=3D%ld\n",
+		   mpp_data.loan_request);
+
+	seq_printf(m, "backing_memory=3D%ld bytes\n", mpp_data.backing_mem);
+}
+
 #define SPLPAR_CHARACTERISTICS_TOKEN 20
 #define SPLPAR_MAXLENGTH 1026*(sizeof(char))
=20
@@ -353,6 +420,7 @@ static int pseries_lparcfg_data(struct s
 		/* this call handles the ibm,get-system-parameter contents */
 		parse_system_parameter_string(m);
 		parse_ppp_data(m);
+		parse_mpp_data(m);
=20
 		seq_printf(m, "purr=3D%ld\n", get_purr());
=20
@@ -382,12 +450,13 @@ static int pseries_lparcfg_data(struct s
 	return 0;
 }
=20
-static ssize_t update_ppp(u64 *new_entitled, u8 *new_weight)
+static ssize_t update_ppp(u64 *entitlement, u8 *weight)
 {
 	unsigned long current_entitled;
 	unsigned long dummy;
 	unsigned long resource;
-	u8 current_weight;
+	u8 current_weight, new_weight;
+	u64 new_entitled;
 	ssize_t retval;
=20
 	/* Get our current parameters */
@@ -397,22 +466,60 @@ static ssize_t update_ppp(u64 *new_entit
=20
 	current_weight =3D (resource >> 5 * 8) & 0xFF;
=20
-	if (new_entitled)
-		*new_weight =3D current_weight;
-
-	if (new_weight)
-		*new_entitled =3D current_entitled;
+	if (entitlement) {
+		new_weight =3D current_weight;
+		new_entitled =3D *entitlement;
+	} else {
+		new_weight =3D *weight;
+		new_entitled =3D current_entitled;
+	}
=20
 	pr_debug("%s: current_entitled =3D %lu, current_weight =3D %u\n",
 		 __FUNCTION__, current_entitled, current_weight);
=20
 	pr_debug("%s: new_entitled =3D %lu, new_weight =3D %u\n",
-		 __FUNCTION__, *new_entitled, *new_weight);
+		 __FUNCTION__, new_entitled, new_weight);
=20
-	retval =3D plpar_hcall_norets(H_SET_PPP, *new_entitled, *new_weight);
+	retval =3D plpar_hcall_norets(H_SET_PPP, new_entitled, new_weight);
 	return retval;
 }
=20
+/**
+ * update_mpp
+ *
+ * Update the memory entitlement and weight for the partition.  Caller must
+ * spercify either a new entitlement or weight, not both, to be updated
+ * since the h_set_mpp call takes both entitlement and weight as parameter=
s.
+ */
+static ssize_t update_mpp(u64 *entitlement, u8 *weight)
+{
+	struct hvcall_mpp_data mpp_data;
+	u64 new_entitled;
+	u8 new_weight;
+	ssize_t rc;
+
+	rc =3D h_get_mpp(&mpp_data);
+	if (rc)
+		return rc;
+
+	if (entitlement) {
+		new_weight =3D mpp_data.mem_weight;
+		new_entitled =3D *entitlement;
+	} else {
+		new_weight =3D *weight;
+		new_entitled =3D mpp_data.entitled_mem;
+	}
+
+	pr_debug("%s: current_entitled =3D %lu, current_weight =3D %u\n",
+		 __FUNCTION__, mpp_data.entitled_mem, mpp_data.mem_weight);
+
+	pr_debug("%s: new_entitled =3D %lu, new_weight =3D %u\n",
+		 __FUNCTION__, new_entitled, new_weight);
+
+	rc =3D plpar_hcall_norets(H_SET_MPP, new_entitled, new_weight);
+	return rc;
+}
+
 /*
  * Interface for changing system parameters (variable capacity weight
  * and entitled capacity).  Format of input is "param_name=3Dvalue";
@@ -466,6 +573,20 @@ static ssize_t lparcfg_write(struct file
 			goto out;
=20
 		retval =3D update_ppp(NULL, new_weight_ptr);
+	} else if (!strcmp(kbuf, "entitled_memory")) {
+		char *endp;
+		*new_entitled_ptr =3D (u64) simple_strtoul(tmp, &endp, 10);
+		if (endp =3D=3D tmp)
+			goto out;
+
+		retval =3D update_mpp(new_entitled_ptr, NULL);
+	} else if (!strcmp(kbuf, "entitled_memory_weight")) {
+		char *endp;
+		*new_weight_ptr =3D (u8) simple_strtoul(tmp, &endp, 10);
+		if (endp =3D=3D tmp)
+			goto out;
+
+		retval =3D update_mpp(NULL, new_weight_ptr);
 	} else
 		goto out;
=20
Index: b/include/asm-powerpc/hvcall.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/hvcall.h
+++ b/include/asm-powerpc/hvcall.h
@@ -210,7 +210,9 @@
 #define H_JOIN			0x298
 #define H_VASI_STATE            0x2A4
 #define H_ENABLE_CRQ		0x2B0
-#define MAX_HCALL_OPCODE	H_ENABLE_CRQ
+#define H_SET_MPP		0x2D0
+#define H_GET_MPP		0x2D4
+#define MAX_HCALL_OPCODE	H_GET_MPP
=20
 #ifndef __ASSEMBLY__
=20
@@ -270,6 +272,20 @@ struct hcall_stats {
 };
 #define HCALL_STAT_ARRAY_SIZE	((MAX_HCALL_OPCODE >> 2) + 1)
=20
+struct hvcall_mpp_data {
+	unsigned long entitled_mem;
+	unsigned long mapped_mem;
+	unsigned short group_num;
+	unsigned short pool_num;
+	unsigned char mem_weight;
+	unsigned char unallocated_mem_weight;
+	unsigned long unallocated_entitlement;	/* value in bytes */
+	unsigned long pool_size;
+	long loan_request;
+	unsigned long backing_mem;
+};
+
+int h_get_mpp(struct hvcall_mpp_data *);
 #endif /* __ASSEMBLY__ */
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_HVCALL_H */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 04/19] powerpc: Split retrieval of processor entitlement data into a helper routine
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (2 preceding siblings ...)
  2008-06-12 22:09 ` [PATCH 03/19] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg Robert Jennings
@ 2008-06-12 22:11 ` Robert Jennings
  2008-06-12 22:11 ` [PATCH 05/19] powerpc: Enable CMO feature during platform setup Robert Jennings
                   ` (14 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:11 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Nathan Fontenot <nfont@austin.ibm.com>

Split the retrieval of processor entitlement data returned in the H_GET_PPP
hcall into its own helper routine.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 arch/powerpc/kernel/lparcfg.c |   80 ++++++++++++++++++++++++-------------=
-----
 1 file changed, 45 insertions(+), 35 deletions(-)

Index: b/arch/powerpc/kernel/lparcfg.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -158,6 +158,18 @@ int h_get_mpp(struct hvcall_mpp_data *mp
 }
 EXPORT_SYMBOL(h_get_mpp);
=20
+struct hvcall_ppp_data {
+	u64	entitlement;
+	u64	unallocated_entitlement;
+	u16	group_num;
+	u16	pool_num;
+	u8	capped;
+	u8	weight;
+	u8	unallocated_weight;
+	u16	active_procs_in_pool;
+	u16	active_system_procs;
+};
+
 /*
  * H_GET_PPP hcall returns info in 4 parms.
  *  entitled_capacity,unallocated_capacity,
@@ -178,20 +190,24 @@ EXPORT_SYMBOL(h_get_mpp);
  *              XXXX - Active processors in Physical Processor Pool.
  *                  XXXX  - Processors active on platform.
  */
-static unsigned int h_get_ppp(unsigned long *entitled,
-			      unsigned long *unallocated,
-			      unsigned long *aggregation,
-			      unsigned long *resource)
+static unsigned int h_get_ppp(struct hvcall_ppp_data *ppp_data)
 {
 	unsigned long rc;
 	unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
=20
 	rc =3D plpar_hcall(H_GET_PPP, retbuf);
=20
-	*entitled =3D retbuf[0];
-	*unallocated =3D retbuf[1];
-	*aggregation =3D retbuf[2];
-	*resource =3D retbuf[3];
+	ppp_data->entitlement =3D retbuf[0];
+	ppp_data->unallocated_entitlement =3D retbuf[1];
+
+	ppp_data->group_num =3D (retbuf[2] >> 2 * 8) & 0xffff;
+	ppp_data->pool_num =3D retbuf[2] & 0xffff;
+
+	ppp_data->capped =3D (retbuf[3] >> 6 * 8) & 0x01;
+	ppp_data->weight =3D (retbuf[3] >> 5 * 8) & 0xff;
+	ppp_data->unallocated_weight =3D (retbuf[3] >> 4 * 8) & 0xff;
+	ppp_data->active_procs_in_pool =3D (retbuf[3] >> 2 * 8) & 0xffff;
+	ppp_data->active_system_procs =3D retbuf[3] & 0xffff;
=20
 	return rc;
 }
@@ -216,29 +232,27 @@ static unsigned int h_pic(unsigned long=20
  */
 static void parse_ppp_data(struct seq_file *m)
 {
-	unsigned long h_entitled, h_unallocated;
-	unsigned long h_aggregation, h_resource;
+	struct hvcall_ppp_data ppp_data;
 	int rc;
=20
-	rc =3D h_get_ppp(&h_entitled, &h_unallocated, &h_aggregation,
-		       &h_resource);
+	rc =3D h_get_ppp(&ppp_data);
 	if (rc)
 		return;
=20
-	seq_printf(m, "partition_entitled_capacity=3D%ld\n", h_entitled);
-	seq_printf(m, "group=3D%ld\n", (h_aggregation >> 2 * 8) & 0xffff);
-	seq_printf(m, "system_active_processors=3D%ld\n",
-		   (h_resource >> 0 * 8) & 0xffff);
+	seq_printf(m, "partition_entitled_capacity=3D%ld\n",
+	           ppp_data.entitlement);
+	seq_printf(m, "group=3D%d\n", ppp_data.group_num);
+	seq_printf(m, "system_active_processors=3D%d\n",
+	           ppp_data.active_system_procs);
=20
 	/* pool related entries are apropriate for shared configs */
 	if (lppaca[0].shared_proc) {
 		unsigned long pool_idle_time, pool_procs;
=20
-		seq_printf(m, "pool=3D%ld\n", (h_aggregation >> 0 * 8) & 0xffff);
+		seq_printf(m, "pool=3D%d\n", ppp_data.pool_num);
=20
 		/* report pool_capacity in percentage */
-		seq_printf(m, "pool_capacity=3D%ld\n",
-			   ((h_resource >> 2 * 8) & 0xffff) * 100);
+		seq_printf(m, "pool_capacity=3D%d\n", ppp_data.group_num * 100);
=20
 		rc =3D h_pic(&pool_idle_time, &pool_procs);
 		if (! rc) {
@@ -247,12 +261,12 @@ static void parse_ppp_data(struct seq_fi
 		}
 	}
=20
-	seq_printf(m, "unallocated_capacity_weight=3D%ld\n",
-		   (h_resource >> 4 * 8) & 0xFF);
-
-	seq_printf(m, "capacity_weight=3D%ld\n", (h_resource >> 5 * 8) & 0xFF);
-	seq_printf(m, "capped=3D%ld\n", (h_resource >> 6 * 8) & 0x01);
-	seq_printf(m, "unallocated_capacity=3D%ld\n", h_unallocated);
+	seq_printf(m, "unallocated_capacity_weight=3D%d\n",
+		   ppp_data.unallocated_weight);
+	seq_printf(m, "capacity_weight=3D%d\n", ppp_data.weight);
+	seq_printf(m, "capped=3D%d\n", ppp_data.capped);
+	seq_printf(m, "unallocated_capacity=3D%ld\n",
+		   ppp_data.unallocated_entitlement);
 }
=20
 /**
@@ -452,30 +466,26 @@ static int pseries_lparcfg_data(struct s
=20
 static ssize_t update_ppp(u64 *entitlement, u8 *weight)
 {
-	unsigned long current_entitled;
-	unsigned long dummy;
-	unsigned long resource;
-	u8 current_weight, new_weight;
+	struct hvcall_ppp_data ppp_data;
+	u8 new_weight;
 	u64 new_entitled;
 	ssize_t retval;
=20
 	/* Get our current parameters */
-	retval =3D h_get_ppp(&current_entitled, &dummy, &dummy, &resource);
+	retval =3D h_get_ppp(&ppp_data);
 	if (retval)
 		return retval;
=20
-	current_weight =3D (resource >> 5 * 8) & 0xFF;
-
 	if (entitlement) {
-		new_weight =3D current_weight;
+		new_weight =3D ppp_data.weight;
 		new_entitled =3D *entitlement;
 	} else {
 		new_weight =3D *weight;
-		new_entitled =3D current_entitled;
+		new_entitled =3D ppp_data.entitlement;
 	}
=20
 	pr_debug("%s: current_entitled =3D %lu, current_weight =3D %u\n",
-		 __FUNCTION__, current_entitled, current_weight);
+	         __FUNCTION__, ppp_data.entitlement, ppp_data.weight);
=20
 	pr_debug("%s: new_entitled =3D %lu, new_weight =3D %u\n",
 		 __FUNCTION__, new_entitled, new_weight);

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 05/19] powerpc: Enable CMO feature during platform setup
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (3 preceding siblings ...)
  2008-06-12 22:11 ` [PATCH 04/19] powerpc: Split retrieval of processor entitlement data into a helper routine Robert Jennings
@ 2008-06-12 22:11 ` Robert Jennings
  2008-06-12 22:12 ` [PATCH 06/19] powerpc: Utilities to set firmware page state Robert Jennings
                   ` (13 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:11 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>

For Cooperative Memory Overcommitment (CMO), set the FW_FEATURE_CMO
flag in powerpc_firmware_features from the rtas ibm,get-system-parameters
table prior to calling iommu_init_early_pSeries.

With this, any CMO specific functionality can be controlled by checking:
 firmware_has_feature(FW_FEATURE_CMO)

Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>

---
 arch/powerpc/platforms/pseries/setup.c |   71 ++++++++++++++++++++++++++++=
+++++
 include/asm-powerpc/firmware.h         |    3 +
 2 files changed, 73 insertions(+), 1 deletion(-)

Index: b/arch/powerpc/platforms/pseries/setup.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -314,6 +314,76 @@ static int pseries_set_xdabr(unsigned lo
 			H_DABRX_KERNEL | H_DABRX_USER);
 }
=20
+#define CMO_CHARACTERISTICS_TOKEN 44
+#define CMO_MAXLENGTH 1026
+
+/**
+ * fw_cmo_feature_init - FW_FEATURE_CMO is not stored in ibm,hypertas-func=
tions,
+ * handle that here. (Stolen from parse_system_parameter_string)
+ */
+void pSeries_cmo_feature_init(void)
+{
+	char *ptr, *key, *value, *end;
+	int call_status;
+	int PrPSP =3D -1;
+	int SecPSP =3D -1;
+
+	pr_debug(" -> fw_cmo_feature_init()\n");
+	spin_lock(&rtas_data_buf_lock);
+	memset(rtas_data_buf, 0, RTAS_DATA_BUF_SIZE);
+	call_status =3D rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1,
+				NULL,
+				CMO_CHARACTERISTICS_TOKEN,
+				__pa(rtas_data_buf),
+				RTAS_DATA_BUF_SIZE);
+
+	if (call_status !=3D 0) {
+		spin_unlock(&rtas_data_buf_lock);
+		pr_debug("CMO not available\n");
+		pr_debug(" <- fw_cmo_feature_init()\n");
+		return;
+	}
+
+	end =3D rtas_data_buf + CMO_MAXLENGTH - 2;
+	ptr =3D rtas_data_buf + 2;	/* step over strlen value */
+	key =3D value =3D ptr;
+
+	while (*ptr && (ptr <=3D end)) {
+		/* Separate the key and value by replacing '=3D' with '\0' and
+		 * point the value at the string after the '=3D'
+		 */
+		if (ptr[0] =3D=3D '=3D') {
+			ptr[0] =3D '\0';
+			value =3D ptr + 1;
+		} else if (ptr[0] =3D=3D '\0' || ptr[0] =3D=3D ',') {
+			/* Terminate the string containing the key/value pair */
+			ptr[0] =3D '\0';
+
+			if (key =3D=3D value) {
+				pr_debug("Malformed key/value pair\n");
+				/* Never found a '=3D', end processing */
+				break;
+			}
+
+			if (0 =3D=3D strcmp(key, "PrPSP"))
+				PrPSP =3D simple_strtoul(value, NULL, 10);
+			else if (0 =3D=3D strcmp(key, "SecPSP"))
+				SecPSP =3D simple_strtoul(value, NULL, 10);
+			value =3D key =3D ptr + 1;
+		}
+		ptr++;
+	}
+
+	if (PrPSP !=3D -1 || SecPSP !=3D -1) {
+		pr_info("CMO enabled\n");
+		pr_debug("CMO enabled, PrPSP=3D%d, SecPSP=3D%d\n", PrPSP, SecPSP);
+		powerpc_firmware_features |=3D FW_FEATURE_CMO;
+	} else
+		pr_debug("CMO not enabled, PrPSP=3D%d, SecPSP=3D%d\n", PrPSP, SecPSP);
+	spin_unlock(&rtas_data_buf_lock);
+	pr_debug(" <- fw_cmo_feature_init()\n");
+}
+
 /*
  * Early initialization.  Relocation is on but do not reference unbolted p=
ages
  */
@@ -329,6 +399,7 @@ static void __init pSeries_init_early(vo
 	else if (firmware_has_feature(FW_FEATURE_XDABR))
 		ppc_md.set_dabr =3D pseries_set_xdabr;
=20
+	pSeries_cmo_feature_init();
 	iommu_init_early_pSeries();
=20
 	pr_debug(" <- pSeries_init_early()\n");
Index: b/include/asm-powerpc/firmware.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/firmware.h
+++ b/include/asm-powerpc/firmware.h
@@ -45,6 +45,7 @@
 #define FW_FEATURE_PS3_LV1	ASM_CONST(0x0000000000800000)
 #define FW_FEATURE_BEAT		ASM_CONST(0x0000000001000000)
 #define FW_FEATURE_BULK_REMOVE	ASM_CONST(0x0000000002000000)
+#define FW_FEATURE_CMO		ASM_CONST(0x0000000004000000)
=20
 #ifndef __ASSEMBLY__
=20
@@ -57,7 +58,7 @@ enum {
 		FW_FEATURE_MIGRATE | FW_FEATURE_PERFMON | FW_FEATURE_CRQ |
 		FW_FEATURE_VIO | FW_FEATURE_RDMA | FW_FEATURE_LLAN |
 		FW_FEATURE_BULK | FW_FEATURE_XDABR | FW_FEATURE_MULTITCE |
-		FW_FEATURE_SPLPAR | FW_FEATURE_LPAR,
+		FW_FEATURE_SPLPAR | FW_FEATURE_LPAR | FW_FEATURE_CMO,
 	FW_FEATURE_PSERIES_ALWAYS =3D 0,
 	FW_FEATURE_ISERIES_POSSIBLE =3D FW_FEATURE_ISERIES | FW_FEATURE_LPAR,
 	FW_FEATURE_ISERIES_ALWAYS =3D FW_FEATURE_ISERIES | FW_FEATURE_LPAR,

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 06/19] powerpc: Utilities to set firmware page state
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (4 preceding siblings ...)
  2008-06-12 22:11 ` [PATCH 05/19] powerpc: Enable CMO feature during platform setup Robert Jennings
@ 2008-06-12 22:12 ` Robert Jennings
  2008-06-12 22:13 ` [PATCH 07/19] powerpc: Add collaborative memory manager Robert Jennings
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:12 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Brian King <brking@linux.vnet.ibm.com>

Newer versions of firmware support page states, which are used by the
collaborative memory manager (future patch) to "loan" pages to the
hypervisor for use by other partitions.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>

---

 arch/powerpc/platforms/pseries/plpar_wrappers.h |   10 ++++++++++
 include/asm-powerpc/hvcall.h                    |    5 +++++
 2 files changed, 15 insertions(+)

Index: b/arch/powerpc/platforms/pseries/plpar_wrappers.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/pseries/plpar_wrappers.h
+++ b/arch/powerpc/platforms/pseries/plpar_wrappers.h
@@ -42,6 +42,16 @@ static inline long register_slb_shadow(u
 	return vpa_call(0x3, cpu, vpa);
 }
=20
+static inline long plpar_page_set_loaned(unsigned long vpa)
+{
+	return plpar_hcall_norets(H_PAGE_INIT, H_PAGE_SET_LOANED, vpa, 0);
+}
+
+static inline long plpar_page_set_active(unsigned long vpa)
+{
+	return plpar_hcall_norets(H_PAGE_INIT, H_PAGE_SET_ACTIVE, vpa, 0);
+}
+
 extern void vpa_init(int cpu);
=20
 static inline long plpar_pte_enter(unsigned long flags,
Index: b/include/asm-powerpc/hvcall.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/hvcall.h
+++ b/include/asm-powerpc/hvcall.h
@@ -92,6 +92,11 @@
 #define H_EXACT			(1UL<<(63-24))	/* Use exact PTE or return H_PTEG_FULL */
 #define H_R_XLATE		(1UL<<(63-25))	/* include a valid logical page num in t=
he pte if the valid bit is set */
 #define H_READ_4		(1UL<<(63-26))	/* Return 4 PTEs */
+#define H_PAGE_STATE_CHANGE	(1UL<<(63-28))
+#define H_PAGE_UNUSED		((1UL<<(63-29)) | (1UL<<(63-30)))
+#define H_PAGE_SET_UNUSED	(H_PAGE_STATE_CHANGE | H_PAGE_UNUSED)
+#define H_PAGE_SET_LOANED	(H_PAGE_SET_UNUSED | (1UL<<(63-31)))
+#define H_PAGE_SET_ACTIVE	H_PAGE_STATE_CHANGE
 #define H_AVPN			(1UL<<(63-32))	/* An avpn is provided as a sanity test */
 #define H_ANDCOND		(1UL<<(63-33))
 #define H_ICACHE_INVALIDATE	(1UL<<(63-40))	/* icbi, etc.  (ignored for IO =
pages) */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 07/19] powerpc: Add collaborative memory manager
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (5 preceding siblings ...)
  2008-06-12 22:12 ` [PATCH 06/19] powerpc: Utilities to set firmware page state Robert Jennings
@ 2008-06-12 22:13 ` Robert Jennings
  2008-06-12 22:13 ` [PATCH 08/19] powerpc: Do not probe PCI buses or eBus devices if CMO is enabled Robert Jennings
                   ` (11 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:13 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Brian King <brking@linux.vnet.ibm.com>

Adds a collaborative memory manager, which acts as a simple balloon driver
for System p machines that support cooperative memory overcommitment
(CMO).

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>

---

 arch/powerpc/platforms/pseries/Kconfig  |   11 +
 arch/powerpc/platforms/pseries/Makefile |    1 +
 arch/powerpc/platforms/pseries/cmm.c    |  470 +++++++++++++++++++++++++++=
+++++
 3 files changed, 482 insertions(+)

Index: b/arch/powerpc/platforms/pseries/Kconfig
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -39,3 +39,14 @@ config PPC_PSERIES_DEBUG
 	depends on PPC_PSERIES && PPC_EARLY_DEBUG
 	bool "Enable extra debug logging in platforms/pseries"
 	default y
+
+config CMM
+	tristate "Collaborative memory management"
+	depends on PPC_PSERIES
+	help
+	  Select this option, if you want to enable the kernel interface
+	  to reduce the memory size of the system. This is accomplished
+	  by allocating pages of memory and put them "on hold". This only
+	  makes sense for a system running in an LPAR where the unused pages
+	  will be reused for other LPARs. The interface allows firmware to
+	  balance memory across many LPARs.
Index: b/arch/powerpc/platforms/pseries/Makefile
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -24,3 +24,4 @@ obj-$(CONFIG_HVC_CONSOLE)	+=3D hvconsole.o
 obj-$(CONFIG_HVCS)		+=3D hvcserver.o
 obj-$(CONFIG_HCALL_STATS)	+=3D hvCall_inst.o
 obj-$(CONFIG_PHYP_DUMP)	+=3D phyp_dump.o
+obj-$(CONFIG_CMM)		+=3D cmm.o
Index: b/arch/powerpc/platforms/pseries/cmm.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/cmm.c
@@ -0,0 +1,470 @@
+/*
+ * arch/powerpc/mm/cmm.c
+ *
+ * Collaborative memory management interface.
+ *
+ * Copyright (C) 2008 IBM Corporation
+ * Author(s): Brian King (brking@linux.vnet.ibm.com),
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA  02111-1307  U=
SA
+ *
+ */
+
+#include <linux/ctype.h>
+#include <linux/delay.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+#include <linux/oom.h>
+#include <linux/sched.h>
+#include <linux/stringify.h>
+#include <linux/swap.h>
+#include <linux/sysdev.h>
+#include <asm/firmware.h>
+#include <asm/hvcall.h>
+#include <asm/mmu.h>
+#include <asm/pgalloc.h>
+#include <asm/uaccess.h>
+
+#include "plpar_wrappers.h"
+
+#define CMM_DRIVER_VERSION	"1.0.0"
+#define CMM_DEFAULT_DELAY	1
+#define CMM_DEBUG			0
+#define CMM_DISABLE		0
+#define CMM_OOM_KB		1024
+#define CMM_MIN_MEM_MB		256
+#define KB2PAGES(_p)		((_p)>>(PAGE_SHIFT-10))
+#define PAGES2KB(_p)		((_p)<<(PAGE_SHIFT-10))
+
+static unsigned int delay =3D CMM_DEFAULT_DELAY;
+static unsigned int oom_kb =3D CMM_OOM_KB;
+static unsigned int cmm_debug =3D CMM_DEBUG;
+static unsigned int cmm_disabled =3D CMM_DISABLE;
+static unsigned long min_mem_mb =3D CMM_MIN_MEM_MB;
+static struct sys_device cmm_sysdev;
+
+MODULE_AUTHOR("Brian King <brking@linux.vnet.ibm.com>");
+MODULE_DESCRIPTION("IBM System p Collaborative Memory Manager");
+MODULE_LICENSE("GPL");
+MODULE_VERSION(CMM_DRIVER_VERSION);
+
+module_param_named(delay, delay, uint, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(delay, "Delay (in seconds) between polls to query hypervi=
sor paging requests. "
+		 "[Default=3D" __stringify(CMM_DEFAULT_DELAY) "]");
+module_param_named(oom_kb, oom_kb, uint, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(oom_kb, "Amount of memory in kb to free on OOM. "
+		 "[Default=3D" __stringify(CMM_OOM_KB) "]");
+module_param_named(min_mem_mb, min_mem_mb, ulong, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(min_mem_mb, "Minimum amount of memory (in MB) to not ball=
oon. "
+		 "[Default=3D" __stringify(CMM_MIN_MEM_MB) "]");
+module_param_named(debug, cmm_debug, uint, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(debug, "Enable module debugging logging. Set to 1 to enab=
le. "
+		 "[Default=3D" __stringify(CMM_DEBUG) "]");
+
+#define CMM_NR_PAGES ((PAGE_SIZE - sizeof(void *) - sizeof(unsigned long))=
 / sizeof(unsigned long))
+
+#define cmm_dbg(...) if (cmm_debug) { printk(KERN_INFO "cmm: "__VA_ARGS__)=
; }
+
+struct cmm_page_array {
+	struct cmm_page_array *next;
+	unsigned long index;
+	unsigned long page[CMM_NR_PAGES];
+};
+
+static unsigned long loaned_pages;
+static unsigned long loaned_pages_target;
+static unsigned long oom_freed_pages;
+
+static struct cmm_page_array *cmm_page_list;
+static DEFINE_SPINLOCK(cmm_lock);
+
+static struct task_struct *cmm_thread_ptr;
+
+/**
+ * cmm_alloc_pages - Allocate pages and mark them as loaned
+ * @nr:	number of pages to allocate
+ *
+ * Return value:
+ * 	number of pages requested to be allocated which were not
+ **/
+static long cmm_alloc_pages(long nr)
+{
+	struct cmm_page_array *pa, *npa;
+	unsigned long addr;
+	long rc;
+
+	cmm_dbg("Begin request for %ld pages\n", nr);
+
+	while (nr) {
+		addr =3D __get_free_page(GFP_NOIO | __GFP_NOWARN |
+				       __GFP_NORETRY | __GFP_NOMEMALLOC);
+		if (!addr)
+			break;
+		spin_lock(&cmm_lock);
+		pa =3D cmm_page_list;
+		if (!pa || pa->index >=3D CMM_NR_PAGES) {
+			/* Need a new page for the page list. */
+			spin_unlock(&cmm_lock);
+			npa =3D (struct cmm_page_array *)__get_free_page(GFP_NOIO | __GFP_NOWAR=
N |
+								       __GFP_NORETRY | __GFP_NOMEMALLOC);
+			if (!npa) {
+				pr_info("%s: Can not allocate new page list\n", __FUNCTION__);
+				free_page(addr);
+				break;
+			}
+			spin_lock(&cmm_lock);
+			pa =3D cmm_page_list;
+
+			if (!pa || pa->index >=3D CMM_NR_PAGES) {
+				npa->next =3D pa;
+				npa->index =3D 0;
+				pa =3D npa;
+				cmm_page_list =3D pa;
+			} else
+				free_page((unsigned long) npa);
+		}
+
+		if ((rc =3D plpar_page_set_loaned(__pa(addr)))) {
+			pr_err("%s: Can not set page to loaned. rc=3D%ld\n", __FUNCTION__, rc);
+			spin_unlock(&cmm_lock);
+			free_page(addr);
+			break;
+		}
+
+		pa->page[pa->index++] =3D addr;
+		loaned_pages++;
+		totalram_pages--;
+		spin_unlock(&cmm_lock);
+		nr--;
+	}
+
+	cmm_dbg("End request with %ld pages unfulfilled\n", nr);
+	return nr;
+}
+
+/**
+ * cmm_free_pages - Free pages and mark them as active
+ * @nr:	number of pages to free
+ *
+ * Return value:
+ * 	number of pages requested to be freed which were not
+ **/
+static long cmm_free_pages(long nr)
+{
+	struct cmm_page_array *pa;
+	unsigned long addr;
+
+	cmm_dbg("Begin free of %ld pages.\n", nr);
+	spin_lock(&cmm_lock);
+	pa =3D cmm_page_list;
+	while (nr) {
+		if (!pa || pa->index <=3D 0)
+			break;
+		addr =3D pa->page[--pa->index];
+
+		if (pa->index =3D=3D 0) {
+			pa =3D pa->next;
+			free_page((unsigned long) cmm_page_list);
+			cmm_page_list =3D pa;
+		}
+
+		plpar_page_set_active(__pa(addr));
+		free_page(addr);
+		loaned_pages--;
+		nr--;
+		totalram_pages++;
+	}
+	spin_unlock(&cmm_lock);
+	cmm_dbg("End request with %ld pages unfulfilled\n", nr);
+	return nr;
+}
+
+/**
+ * cmm_oom_notify - OOM notifier
+ * @self:	notifier block struct
+ * @dummy:	not used
+ * @parm:	returned - number of pages freed
+ *
+ * Return value:
+ * 	NOTIFY_OK
+ **/
+static int cmm_oom_notify(struct notifier_block *self,
+			  unsigned long dummy, void *parm)
+{
+	unsigned long *freed =3D parm;
+	long nr =3D KB2PAGES(oom_kb);
+
+	cmm_dbg("OOM processing started\n");
+	nr =3D cmm_free_pages(nr);
+	loaned_pages_target =3D loaned_pages;
+	*freed +=3D KB2PAGES(oom_kb) - nr;
+	oom_freed_pages +=3D KB2PAGES(oom_kb) - nr;
+	cmm_dbg("OOM processing complete\n");
+	return NOTIFY_OK;
+}
+
+/**
+ * cmm_get_mpp - Read memory performance parameters
+ *
+ * Makes hcall to query the current page loan request from the hypervisor.
+ *
+ * Return value:
+ * 	nothing
+ **/
+static void cmm_get_mpp(void)
+{
+	int rc;
+	struct hvcall_mpp_data mpp_data;
+	unsigned long active_pages_target;
+	signed long page_loan_request;
+
+	rc =3D h_get_mpp(&mpp_data);
+
+	if (rc !=3D H_SUCCESS)
+		return;
+
+	page_loan_request =3D div_s64(mpp_data.loan_request, PAGE_SIZE);
+	loaned_pages_target =3D page_loan_request + loaned_pages;
+	if (loaned_pages_target > oom_freed_pages)
+		loaned_pages_target -=3D oom_freed_pages;
+	else
+		loaned_pages_target =3D 0;
+
+	active_pages_target =3D totalram_pages + loaned_pages - loaned_pages_targ=
et;
+
+	if ((min_mem_mb * 1024 * 1024) > (active_pages_target * PAGE_SIZE))
+		loaned_pages_target =3D totalram_pages + loaned_pages -
+			((min_mem_mb * 1024 * 1024) / PAGE_SIZE);
+
+	cmm_dbg("delta =3D %ld, loaned =3D %lu, target =3D %lu, oom =3D %lu, tota=
lram =3D %lu\n",
+		page_loan_request, loaned_pages, loaned_pages_target,
+		oom_freed_pages, totalram_pages);
+}
+
+static struct notifier_block cmm_oom_nb =3D {
+	.notifier_call =3D cmm_oom_notify
+};
+
+/**
+ * cmm_thread - CMM task thread
+ * @dummy:	not used
+ *
+ * Return value:
+ * 	0
+ **/
+static int cmm_thread(void *dummy)
+{
+	unsigned long timeleft;
+
+	while (1) {
+		timeleft =3D msleep_interruptible(delay * 1000);
+
+		if (kthread_should_stop() || timeleft) {
+			loaned_pages_target =3D loaned_pages;
+			break;
+		}
+
+		cmm_get_mpp();
+
+		if (loaned_pages_target > loaned_pages) {
+			if (cmm_alloc_pages(loaned_pages_target - loaned_pages))
+				loaned_pages_target =3D loaned_pages;
+		} else if (loaned_pages_target < loaned_pages)
+			cmm_free_pages(loaned_pages - loaned_pages_target);
+	}
+	return 0;
+}
+
+#define CMM_SHOW(name, format, args...)			\
+	static ssize_t show_##name(struct sys_device *dev, char *buf)	\
+	{							\
+		return sprintf(buf, format, ##args);		\
+	}							\
+	static SYSDEV_ATTR(name, S_IRUGO, show_##name, NULL)
+
+CMM_SHOW(loaned_kb, "%lu\n", PAGES2KB(loaned_pages));
+CMM_SHOW(loaned_target_kb, "%lu\n", PAGES2KB(loaned_pages_target));
+
+static ssize_t show_oom_pages(struct sys_device *dev, char *buf)
+{
+	return sprintf(buf, "%lu\n", PAGES2KB(oom_freed_pages));
+}
+
+static ssize_t store_oom_pages(struct sys_device *dev,
+			       const char *buf, size_t count)
+{
+	unsigned long val =3D simple_strtoul (buf, NULL, 10);
+
+	if (!capable(CAP_SYS_ADMIN))
+		return -EPERM;
+	if (val !=3D 0)
+		return -EBADMSG;
+
+	oom_freed_pages =3D 0;
+	return count;
+}
+
+static SYSDEV_ATTR(oom_freed_kb, S_IWUSR| S_IRUGO,
+		   show_oom_pages, store_oom_pages);
+
+static struct sysdev_attribute *cmm_attrs[] =3D {
+	&attr_loaned_kb,
+	&attr_loaned_target_kb,
+	&attr_oom_freed_kb,
+};
+
+static struct sysdev_class cmm_sysdev_class =3D {
+	.name =3D "cmm",
+};
+
+/**
+ * cmm_sysfs_register - Register with sysfs
+ *
+ * Return value:
+ * 	0 on success / other on failure
+ **/
+static int cmm_sysfs_register(struct sys_device *sysdev)
+{
+	int i, rc;
+
+	if ((rc =3D sysdev_class_register(&cmm_sysdev_class)))
+		return rc;
+
+	sysdev->id =3D 0;
+	sysdev->cls =3D &cmm_sysdev_class;
+
+	if ((rc =3D sysdev_register(sysdev)))
+		goto class_unregister;
+
+	for (i =3D 0; i < ARRAY_SIZE(cmm_attrs); i++) {
+		if ((rc =3D sysdev_create_file(sysdev, cmm_attrs[i])))
+			goto fail;
+	}
+
+	return 0;
+
+fail:
+	while (--i >=3D 0)
+		sysdev_remove_file(sysdev, cmm_attrs[i]);
+	sysdev_unregister(sysdev);
+class_unregister:
+	sysdev_class_unregister(&cmm_sysdev_class);
+	return rc;
+}
+
+/**
+ * cmm_unregister_sysfs - Unregister from sysfs
+ *
+ **/
+static void cmm_unregister_sysfs(struct sys_device *sysdev)
+{
+	int i;
+
+	for (i =3D 0; i < ARRAY_SIZE(cmm_attrs); i++)
+		sysdev_remove_file(sysdev, cmm_attrs[i]);
+	sysdev_unregister(sysdev);
+	sysdev_class_unregister(&cmm_sysdev_class);
+}
+
+/**
+ * cmm_init - Module initialization
+ *
+ * Return value:
+ * 	0 on success / other on failure
+ **/
+static int cmm_init(void)
+{
+	int rc =3D -ENOMEM;
+
+	if (!firmware_has_feature(FW_FEATURE_CMO))
+		return -EOPNOTSUPP;
+
+	if ((rc =3D register_oom_notifier(&cmm_oom_nb)) < 0)
+		return rc;
+
+	if ((rc =3D cmm_sysfs_register(&cmm_sysdev)))
+		goto out_oom_notifier;
+
+	if (cmm_disabled)
+		return rc;
+
+	cmm_thread_ptr =3D kthread_run(cmm_thread, NULL, "cmmthread");
+	if (IS_ERR(cmm_thread_ptr)) {
+		rc =3D PTR_ERR(cmm_thread_ptr);
+		goto out_unregister_sysfs;
+	}
+
+	return rc;
+
+out_unregister_sysfs:
+	cmm_unregister_sysfs(&cmm_sysdev);
+out_oom_notifier:
+	unregister_oom_notifier(&cmm_oom_nb);
+	return rc;
+}
+
+/**
+ * cmm_exit - Module exit
+ *
+ * Return value:
+ * 	nothing
+ **/
+static void cmm_exit(void)
+{
+	if (cmm_thread_ptr)
+		kthread_stop(cmm_thread_ptr);
+	unregister_oom_notifier(&cmm_oom_nb);
+	cmm_free_pages(loaned_pages);
+	cmm_unregister_sysfs(&cmm_sysdev);
+}
+
+/**
+ * cmm_set_disable - Disable/Enable CMM
+ *
+ * Return value:
+ * 	0 on success / other on failure
+ **/
+static int cmm_set_disable(const char *val, struct kernel_param *kp)
+{
+	int disable =3D simple_strtoul(val, NULL, 10);
+
+	if (disable !=3D 0 && disable !=3D 1)
+		return -EINVAL;
+
+	if (disable && !cmm_disabled) {
+		if (cmm_thread_ptr)
+			kthread_stop(cmm_thread_ptr);
+		cmm_thread_ptr =3D NULL;
+		cmm_free_pages(loaned_pages);
+	} else if (!disable && cmm_disabled) {
+		cmm_thread_ptr =3D kthread_run(cmm_thread, NULL, "cmmthread");
+		if (IS_ERR(cmm_thread_ptr))
+			return PTR_ERR(cmm_thread_ptr);
+	}
+
+	cmm_disabled =3D disable;
+	return 0;
+}
+
+module_param_call(disable, cmm_set_disable, param_get_uint,
+		  &cmm_disabled, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(disable, "Disable CMM. Set to 1 to disable. "
+		 "[Default=3D" __stringify(CMM_DISABLE) "]");
+
+module_init(cmm_init);
+module_exit(cmm_exit);

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 08/19] powerpc: Do not probe PCI buses or eBus devices if CMO is enabled
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (6 preceding siblings ...)
  2008-06-12 22:13 ` [PATCH 07/19] powerpc: Add collaborative memory manager Robert Jennings
@ 2008-06-12 22:13 ` Robert Jennings
  2008-06-12 22:14 ` [PATCH 09/19] powerpc: Add CMO paging statistics Robert Jennings
                   ` (10 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:13 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Brian King <brking@linux.vnet.ibm.com>

The Cooperative Memory Overcommit (CMO) on System p does not currently
support native PCI devices or eBus devices when enabled. Prevent
PCI bus probe and eBus device probe if the feature is enabled.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>

---

 arch/powerpc/kernel/ibmebus.c          |    6 ++++++
 arch/powerpc/platforms/pseries/setup.c |    4 ++++
 2 files changed, 10 insertions(+)

Index: b/arch/powerpc/kernel/ibmebus.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/ibmebus.c
+++ b/arch/powerpc/kernel/ibmebus.c
@@ -45,6 +45,7 @@
 #include <linux/of_platform.h>
 #include <asm/ibmebus.h>
 #include <asm/abs_addr.h>
+#include <asm/firmware.h>
=20
 static struct device ibmebus_bus_device =3D { /* fake "parent" device */
 	.bus_id =3D "ibmebus",
@@ -332,6 +333,11 @@ static int __init ibmebus_bus_init(void)
 {
 	int err;
=20
+	if (firmware_has_feature(FW_FEATURE_CMO)) {
+		printk(KERN_WARNING "Not probing eBus since CMO is enabled\n");
+		return 0;
+	}
+
 	printk(KERN_INFO "IBM eBus Device Driver\n");
=20
 	err =3D of_bus_type_init(&ibmebus_bus_type, "ibmebus");
Index: b/arch/powerpc/platforms/pseries/setup.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -539,6 +539,10 @@ static void pseries_shared_idle_sleep(vo
=20
 static int pSeries_pci_probe_mode(struct pci_bus *bus)
 {
+	if (firmware_has_feature(FW_FEATURE_CMO)) {
+		dev_warn(&bus->dev, "Not probing PCI bus since CMO is enabled\n");
+		return PCI_PROBE_NONE;
+	}
 	if (firmware_has_feature(FW_FEATURE_LPAR))
 		return PCI_PROBE_DEVTREE;
 	return PCI_PROBE_NORMAL;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 09/19] powerpc: Add CMO paging statistics
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (7 preceding siblings ...)
  2008-06-12 22:13 ` [PATCH 08/19] powerpc: Do not probe PCI buses or eBus devices if CMO is enabled Robert Jennings
@ 2008-06-12 22:14 ` Robert Jennings
  2008-06-12 22:15 ` [PATCH 10/19] powerpc: move get_longbusy_msecs out of ehca/ehea Robert Jennings
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:14 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Brian King <brking@linux.vnet.ibm.com>

With the addition of Cooperative Memory Overcommitment (CMO) support
for IBM Power Systems, two fields have been added to the VPA to report
paging statistics. Add support in lparcfg to report them to userspace.

Signed-off-by: Brian King <brking@linux.vnet.ibm.com>

---

 arch/powerpc/kernel/lparcfg.c |   20 ++++++++++++++++++++
 include/asm-powerpc/lppaca.h  |    5 ++++-
 2 files changed, 24 insertions(+), 1 deletion(-)

Index: b/arch/powerpc/kernel/lparcfg.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -410,6 +410,25 @@ static int lparcfg_count_active_processo
 	return count;
 }
=20
+static void pseries_cmo_data(struct seq_file *m)
+{
+	int cpu;
+	unsigned long cmo_faults =3D 0;
+	unsigned long cmo_fault_time =3D 0;
+
+	if (!firmware_has_feature(FW_FEATURE_CMO))
+		return;
+
+	for_each_possible_cpu(cpu) {
+		cmo_faults +=3D lppaca[cpu].cmo_faults;
+		cmo_fault_time +=3D lppaca[cpu].cmo_fault_time;
+	}
+
+	seq_printf(m, "cmo_faults=3D%lu\n", cmo_faults);
+	seq_printf(m, "cmo_fault_time_usec=3D%lu\n",
+		   cmo_fault_time / tb_ticks_per_usec);
+}
+
 static int pseries_lparcfg_data(struct seq_file *m, void *v)
 {
 	int partition_potential_processors;
@@ -435,6 +454,7 @@ static int pseries_lparcfg_data(struct s
 		parse_system_parameter_string(m);
 		parse_ppp_data(m);
 		parse_mpp_data(m);
+		pseries_cmo_data(m);
=20
 		seq_printf(m, "purr=3D%ld\n", get_purr());
=20
Index: b/include/asm-powerpc/lppaca.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/lppaca.h
+++ b/include/asm-powerpc/lppaca.h
@@ -125,7 +125,10 @@ struct lppaca {
 	// NOTE: This value will ALWAYS be zero for dedicated processors and
 	// will NEVER be zero for shared processors (ie, initialized to a 1).
 	volatile u32 yield_count;	// PLIC increments each dispatchx00-x03
-	u8	reserved6[124];		// Reserved                     x04-x7F
+	u32 reserved6;
+	volatile u64 cmo_faults;	// CMO page fault count         x08-x0F
+	volatile u64 cmo_fault_time;	// CMO page fault time          x10-x17
+	u8	reserved7[104];		// Reserved                     x18-x7F
=20
 //=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D
 // CACHE_LINE_4-5 0x0180 - 0x027F Contains PMC interrupt data

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 10/19] powerpc: move get_longbusy_msecs out of ehca/ehea
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (8 preceding siblings ...)
  2008-06-12 22:14 ` [PATCH 09/19] powerpc: Add CMO paging statistics Robert Jennings
@ 2008-06-12 22:15 ` Robert Jennings
  2008-06-12 22:18   ` [PATCH 10/19] [repost] " Robert Jennings
  2008-06-12 22:19 ` [PATCH 11/19] powerpc: iommu enablement for CMO Robert Jennings
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:15 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>

In support of Cooperative Memory Overcommitment (CMO) this moves
get_longbusy_msecs() out of the ehca and ehea drivers and into the=20
architecture's hvcall header as plpar_get_longbusy_msecs.  Some firmware
calls made in pSeries platform iommu code will need to share this
functionality.

Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>

---
 drivers/infiniband/hw/ehca/hcp_if.c |   24 ++----------------------
 drivers/net/ehea/ehea_phyp.c        |    4 ++--
 drivers/net/ehea/ehea_phyp.h        |   20 --------------------
 include/asm-powerpc/hvcall.h        |   27 +++++++++++++++++++++++++++
 4 files changed, 31 insertions(+), 44 deletions(-)

Index: b/drivers/infiniband/hw/ehca/hcp_if.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/infiniband/hw/ehca/hcp_if.c
+++ b/drivers/infiniband/hw/ehca/hcp_if.c
@@ -90,26 +90,6 @@
=20
 static DEFINE_SPINLOCK(hcall_lock);
=20
-static u32 get_longbusy_msecs(int longbusy_rc)
-{
-	switch (longbusy_rc) {
-	case H_LONG_BUSY_ORDER_1_MSEC:
-		return 1;
-	case H_LONG_BUSY_ORDER_10_MSEC:
-		return 10;
-	case H_LONG_BUSY_ORDER_100_MSEC:
-		return 100;
-	case H_LONG_BUSY_ORDER_1_SEC:
-		return 1000;
-	case H_LONG_BUSY_ORDER_10_SEC:
-		return 10000;
-	case H_LONG_BUSY_ORDER_100_SEC:
-		return 100000;
-	default:
-		return 1;
-	}
-}
-
 static long ehca_plpar_hcall_norets(unsigned long opcode,
 				    unsigned long arg1,
 				    unsigned long arg2,
@@ -139,7 +119,7 @@ static long ehca_plpar_hcall_norets(unsi
 			spin_unlock_irqrestore(&hcall_lock, flags);
=20
 		if (H_IS_LONG_BUSY(ret)) {
-			sleep_msecs =3D get_longbusy_msecs(ret);
+			sleep_msecs =3D plpar_get_longbusy_msecs(ret);
 			msleep_interruptible(sleep_msecs);
 			continue;
 		}
@@ -192,7 +172,7 @@ static long ehca_plpar_hcall9(unsigned l
 			spin_unlock_irqrestore(&hcall_lock, flags);
=20
 		if (H_IS_LONG_BUSY(ret)) {
-			sleep_msecs =3D get_longbusy_msecs(ret);
+			sleep_msecs =3D plpar_get_longbusy_msecs(ret);
 			msleep_interruptible(sleep_msecs);
 			continue;
 		}
Index: b/drivers/net/ehea/ehea_phyp.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/net/ehea/ehea_phyp.c
+++ b/drivers/net/ehea/ehea_phyp.c
@@ -61,7 +61,7 @@ static long ehea_plpar_hcall_norets(unsi
 					 arg5, arg6, arg7);
=20
 		if (H_IS_LONG_BUSY(ret)) {
-			sleep_msecs =3D get_longbusy_msecs(ret);
+			sleep_msecs =3D plpar_get_longbusy_msecs(ret);
 			msleep_interruptible(sleep_msecs);
 			continue;
 		}
@@ -102,7 +102,7 @@ static long ehea_plpar_hcall9(unsigned l
 				   arg6, arg7, arg8, arg9);
=20
 		if (H_IS_LONG_BUSY(ret)) {
-			sleep_msecs =3D get_longbusy_msecs(ret);
+			sleep_msecs =3D plpar_get_longbusy_msecs(ret);
 			msleep_interruptible(sleep_msecs);
 			continue;
 		}
Index: b/drivers/net/ehea/ehea_phyp.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/net/ehea/ehea_phyp.h
+++ b/drivers/net/ehea/ehea_phyp.h
@@ -40,26 +40,6 @@
  * hcp_*  - structures, variables and functions releated to Hypervisor Cal=
ls
  */
=20
-static inline u32 get_longbusy_msecs(int long_busy_ret_code)
-{
-	switch (long_busy_ret_code) {
-	case H_LONG_BUSY_ORDER_1_MSEC:
-		return 1;
-	case H_LONG_BUSY_ORDER_10_MSEC:
-		return 10;
-	case H_LONG_BUSY_ORDER_100_MSEC:
-		return 100;
-	case H_LONG_BUSY_ORDER_1_SEC:
-		return 1000;
-	case H_LONG_BUSY_ORDER_10_SEC:
-		return 10000;
-	case H_LONG_BUSY_ORDER_100_SEC:
-		return 100000;
-	default:
-		return 1;
-	}
-}
-
 /* Number of pages which can be registered at once by H_REGISTER_HEA_RPAGE=
S */
 #define EHEA_MAX_RPAGE 512
=20
Index: b/include/asm-powerpc/hvcall.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/hvcall.h
+++ b/include/asm-powerpc/hvcall.h
@@ -222,6 +222,33 @@
 #ifndef __ASSEMBLY__
=20
 /**
+ * plpar_get_longbusy_msecs: - Return number of msecs for H_LONG_BUSY* res=
ponse
+ * @long_busy_ret_code: The H_LONG_BUSY_* constant to process
+ *
+ * This returns the number of msecs that corresponds to an H_LONG_BUSY_*
+ * response from a plpar_hcall.  If there is no match 1 is returned.
+ */
+static inline u32 plpar_get_longbusy_msecs(int long_busy_ret_code)
+{
+	switch (long_busy_ret_code) {
+	case H_LONG_BUSY_ORDER_1_MSEC:
+		return 1;
+	case H_LONG_BUSY_ORDER_10_MSEC:
+		return 10;
+	case H_LONG_BUSY_ORDER_100_MSEC:
+		return 100;
+	case H_LONG_BUSY_ORDER_1_SEC:
+		return 1000;
+	case H_LONG_BUSY_ORDER_10_SEC:
+		return 10000;
+	case H_LONG_BUSY_ORDER_100_SEC:
+		return 100000;
+	default:
+		return 1;
+	}
+}
+
+/**
  * plpar_hcall_norets: - Make a pseries hypervisor call with no return arg=
uments
  * @opcode: The hypervisor call to make.
  *

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 10/19] [repost] powerpc: move get_longbusy_msecs out of ehca/ehea
  2008-06-12 22:15 ` [PATCH 10/19] powerpc: move get_longbusy_msecs out of ehca/ehea Robert Jennings
@ 2008-06-12 22:18   ` Robert Jennings
  2008-06-13 18:24     ` Brian King
  0 siblings, 1 reply; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:18 UTC (permalink / raw)
  To: paulus; +Cc: netdev, linuxppc-dev, David Darrington, Brian King

=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>

In support of Cooperative Memory Overcommitment (CMO) this moves
get_longbusy_msecs() out of the ehca and ehea drivers and into the=20
architecture's hvcall header as plpar_get_longbusy_msecs.  Some firmware
calls made in pSeries platform iommu code will need to share this
functionality.

Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>

---

I missed copying netdev on this patch the first time.

 drivers/infiniband/hw/ehca/hcp_if.c |   24 ++----------------------
 drivers/net/ehea/ehea_phyp.c        |    4 ++--
 drivers/net/ehea/ehea_phyp.h        |   20 --------------------
 include/asm-powerpc/hvcall.h        |   27 +++++++++++++++++++++++++++
 4 files changed, 31 insertions(+), 44 deletions(-)

Index: b/drivers/infiniband/hw/ehca/hcp_if.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/infiniband/hw/ehca/hcp_if.c
+++ b/drivers/infiniband/hw/ehca/hcp_if.c
@@ -90,26 +90,6 @@
=20
 static DEFINE_SPINLOCK(hcall_lock);
=20
-static u32 get_longbusy_msecs(int longbusy_rc)
-{
-	switch (longbusy_rc) {
-	case H_LONG_BUSY_ORDER_1_MSEC:
-		return 1;
-	case H_LONG_BUSY_ORDER_10_MSEC:
-		return 10;
-	case H_LONG_BUSY_ORDER_100_MSEC:
-		return 100;
-	case H_LONG_BUSY_ORDER_1_SEC:
-		return 1000;
-	case H_LONG_BUSY_ORDER_10_SEC:
-		return 10000;
-	case H_LONG_BUSY_ORDER_100_SEC:
-		return 100000;
-	default:
-		return 1;
-	}
-}
-
 static long ehca_plpar_hcall_norets(unsigned long opcode,
 				    unsigned long arg1,
 				    unsigned long arg2,
@@ -139,7 +119,7 @@ static long ehca_plpar_hcall_norets(unsi
 			spin_unlock_irqrestore(&hcall_lock, flags);
=20
 		if (H_IS_LONG_BUSY(ret)) {
-			sleep_msecs =3D get_longbusy_msecs(ret);
+			sleep_msecs =3D plpar_get_longbusy_msecs(ret);
 			msleep_interruptible(sleep_msecs);
 			continue;
 		}
@@ -192,7 +172,7 @@ static long ehca_plpar_hcall9(unsigned l
 			spin_unlock_irqrestore(&hcall_lock, flags);
=20
 		if (H_IS_LONG_BUSY(ret)) {
-			sleep_msecs =3D get_longbusy_msecs(ret);
+			sleep_msecs =3D plpar_get_longbusy_msecs(ret);
 			msleep_interruptible(sleep_msecs);
 			continue;
 		}
Index: b/drivers/net/ehea/ehea_phyp.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/net/ehea/ehea_phyp.c
+++ b/drivers/net/ehea/ehea_phyp.c
@@ -61,7 +61,7 @@ static long ehea_plpar_hcall_norets(unsi
 					 arg5, arg6, arg7);
=20
 		if (H_IS_LONG_BUSY(ret)) {
-			sleep_msecs =3D get_longbusy_msecs(ret);
+			sleep_msecs =3D plpar_get_longbusy_msecs(ret);
 			msleep_interruptible(sleep_msecs);
 			continue;
 		}
@@ -102,7 +102,7 @@ static long ehea_plpar_hcall9(unsigned l
 				   arg6, arg7, arg8, arg9);
=20
 		if (H_IS_LONG_BUSY(ret)) {
-			sleep_msecs =3D get_longbusy_msecs(ret);
+			sleep_msecs =3D plpar_get_longbusy_msecs(ret);
 			msleep_interruptible(sleep_msecs);
 			continue;
 		}
Index: b/drivers/net/ehea/ehea_phyp.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/net/ehea/ehea_phyp.h
+++ b/drivers/net/ehea/ehea_phyp.h
@@ -40,26 +40,6 @@
  * hcp_*  - structures, variables and functions releated to Hypervisor Cal=
ls
  */
=20
-static inline u32 get_longbusy_msecs(int long_busy_ret_code)
-{
-	switch (long_busy_ret_code) {
-	case H_LONG_BUSY_ORDER_1_MSEC:
-		return 1;
-	case H_LONG_BUSY_ORDER_10_MSEC:
-		return 10;
-	case H_LONG_BUSY_ORDER_100_MSEC:
-		return 100;
-	case H_LONG_BUSY_ORDER_1_SEC:
-		return 1000;
-	case H_LONG_BUSY_ORDER_10_SEC:
-		return 10000;
-	case H_LONG_BUSY_ORDER_100_SEC:
-		return 100000;
-	default:
-		return 1;
-	}
-}
-
 /* Number of pages which can be registered at once by H_REGISTER_HEA_RPAGE=
S */
 #define EHEA_MAX_RPAGE 512
=20
Index: b/include/asm-powerpc/hvcall.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/hvcall.h
+++ b/include/asm-powerpc/hvcall.h
@@ -222,6 +222,33 @@
 #ifndef __ASSEMBLY__
=20
 /**
+ * plpar_get_longbusy_msecs: - Return number of msecs for H_LONG_BUSY* res=
ponse
+ * @long_busy_ret_code: The H_LONG_BUSY_* constant to process
+ *
+ * This returns the number of msecs that corresponds to an H_LONG_BUSY_*
+ * response from a plpar_hcall.  If there is no match 1 is returned.
+ */
+static inline u32 plpar_get_longbusy_msecs(int long_busy_ret_code)
+{
+	switch (long_busy_ret_code) {
+	case H_LONG_BUSY_ORDER_1_MSEC:
+		return 1;
+	case H_LONG_BUSY_ORDER_10_MSEC:
+		return 10;
+	case H_LONG_BUSY_ORDER_100_MSEC:
+		return 100;
+	case H_LONG_BUSY_ORDER_1_SEC:
+		return 1000;
+	case H_LONG_BUSY_ORDER_10_SEC:
+		return 10000;
+	case H_LONG_BUSY_ORDER_100_SEC:
+		return 100000;
+	default:
+		return 1;
+	}
+}
+
+/**
  * plpar_hcall_norets: - Make a pseries hypervisor call with no return arg=
uments
  * @opcode: The hypervisor call to make.
  *

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 11/19] powerpc: iommu enablement for CMO
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (9 preceding siblings ...)
  2008-06-12 22:15 ` [PATCH 10/19] powerpc: move get_longbusy_msecs out of ehca/ehea Robert Jennings
@ 2008-06-12 22:19 ` Robert Jennings
  2008-06-13  1:43   ` Olof Johansson
  2008-06-20 15:12   ` [PATCH 11/19][v2] " Robert Jennings
  2008-06-12 22:19 ` [PATCH 12/19] powerpc: vio bus support " Robert Jennings
                   ` (7 subsequent siblings)
  18 siblings, 2 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:19 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>

To support Cooperative Memory Overcommitment (CMO), we need to check
for failure and busy responses from some of the tce hcalls.

These changes for the pseries platform affect the powerpc architecture;
patches for the other affected platforms are included in this patch.

pSeries platform IOMMU code changes:
 * platform TCE functions must handle H_NOT_ENOUGH_RESOURCES errors.
 * platform TCE functions must retry when H_LONG_BUSY_* is returned.
 * platform TCE functions must return error when H_NOT_ENOUGH_RESOURCES
   encountered.

Architecture IOMMU code changes:
 * Calls to ppc_md.tce_build need to check return values and return=20
   DMA_MAPPING_ERROR

Architecture changes:
 * struct machdep_calls for tce_build*_pSeriesLP functions need to change
   to indicate failure
 * all other platforms will need updates to iommu functions to match the new
   calling semantics; they will return 0 on success.  The other platforms
   default configs have been built, but no further testing was performed.

Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>

---
 arch/powerpc/kernel/iommu.c            |   71 ++++++++++++++++++++++++++++=
+--
 arch/powerpc/platforms/cell/iommu.c    |    3 +
 arch/powerpc/platforms/iseries/iommu.c |    3 +
 arch/powerpc/platforms/pasemi/iommu.c  |    3 +
 arch/powerpc/platforms/pseries/iommu.c |   76 ++++++++++++++++++++++++++++=
-----
 arch/powerpc/sysdev/dart_iommu.c       |    3 +
 include/asm-powerpc/machdep.h          |    2=20
 7 files changed, 139 insertions(+), 22 deletions(-)

Index: b/arch/powerpc/kernel/iommu.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -183,6 +183,49 @@ static unsigned long iommu_range_alloc(s
 	return n;
 }
=20
+/** iommu_undo - Clear iommu_table bits without calling platform tce_free.
+ *
+ * @tbl - struct iommu_table to alter
+ * @dma_addr - DMA address to free entries for
+ * @npages - number of pages to free entries for
+ *
+ * This is the same as __iommu_free without the call to ppc_md.tce_free();
+ *
+ * To clean up after ppc_md.tce_build() errors we need to clear bits
+ * in the table without calling the ppc_md.tce_free() method; calling
+ * ppc_md.tce_free() could alter entries that were not touched due to a
+ * premature failure in ppc_md.tce_build().
+ *
+ * The ppc_md.tce_build() needs to perform its own clean up prior to
+ * returning its error.
+ */
+static void iommu_undo(struct iommu_table *tbl, dma_addr_t dma_addr,
+			 unsigned int npages)
+{
+	unsigned long entry, free_entry;
+
+	entry =3D dma_addr >> IOMMU_PAGE_SHIFT;
+	free_entry =3D entry - tbl->it_offset;
+
+	if (((free_entry + npages) > tbl->it_size) ||
+	    (entry < tbl->it_offset)) {
+		if (printk_ratelimit()) {
+			printk(KERN_INFO "iommu_undo: invalid entry\n");
+			printk(KERN_INFO "\tentry    =3D 0x%lx\n", entry);
+			printk(KERN_INFO "\tdma_addr =3D 0x%lx\n", (u64)dma_addr);
+			printk(KERN_INFO "\tTable    =3D 0x%lx\n", (u64)tbl);
+			printk(KERN_INFO "\tbus#     =3D 0x%lx\n", tbl->it_busno);
+			printk(KERN_INFO "\tsize     =3D 0x%lx\n", tbl->it_size);
+			printk(KERN_INFO "\tstartOff =3D 0x%lx\n", tbl->it_offset);
+			printk(KERN_INFO "\tindex    =3D 0x%lx\n", tbl->it_index);
+			WARN_ON(1);
+		}
+		return;
+	}
+
+	iommu_area_free(tbl->it_map, free_entry, npages);
+}
+
 static dma_addr_t iommu_alloc(struct device *dev, struct iommu_table *tbl,
 			      void *page, unsigned int npages,
 			      enum dma_data_direction direction,
@@ -190,6 +233,7 @@ static dma_addr_t iommu_alloc(struct dev
 {
 	unsigned long entry, flags;
 	dma_addr_t ret =3D DMA_ERROR_CODE;
+	int rc;
=20
 	spin_lock_irqsave(&(tbl->it_lock), flags);
=20
@@ -204,9 +248,20 @@ static dma_addr_t iommu_alloc(struct dev
 	ret =3D entry << IOMMU_PAGE_SHIFT;	/* Set the return dma address */
=20
 	/* Put the TCEs in the HW table */
-	ppc_md.tce_build(tbl, entry, npages, (unsigned long)page & IOMMU_PAGE_MAS=
K,
-			 direction);
+	rc =3D ppc_md.tce_build(tbl, entry, npages,
+	                      (unsigned long)page & IOMMU_PAGE_MASK, direction);
=20
+	/* ppc_md.tce_build() only returns non-zero for transient errors.
+	 * Clean up the table bitmap in this case and return
+	 * DMA_ERROR_CODE. For all other errors the functionality is
+	 * not altered.
+	 */
+	if (unlikely(rc)) {
+		iommu_undo(tbl, ret, npages);
+
+		spin_unlock_irqrestore(&(tbl->it_lock), flags);
+		return DMA_ERROR_CODE;
+	}
=20
 	/* Flush/invalidate TLB caches if necessary */
 	if (ppc_md.tce_flush)
@@ -275,7 +330,7 @@ int iommu_map_sg(struct device *dev, str
 	dma_addr_t dma_next =3D 0, dma_addr;
 	unsigned long flags;
 	struct scatterlist *s, *outs, *segstart;
-	int outcount, incount, i;
+	int outcount, incount, i, rc =3D 0;
 	unsigned int align;
 	unsigned long handle;
 	unsigned int max_seg_size;
@@ -336,7 +391,10 @@ int iommu_map_sg(struct device *dev, str
 			    npages, entry, dma_addr);
=20
 		/* Insert into HW table */
-		ppc_md.tce_build(tbl, entry, npages, vaddr & IOMMU_PAGE_MASK, direction);
+		rc =3D ppc_md.tce_build(tbl, entry, npages,
+		                      vaddr & IOMMU_PAGE_MASK, direction);
+		if(unlikely(rc))
+			goto failure;
=20
 		/* If we are in an open segment, try merging */
 		if (segstart !=3D s) {
@@ -399,7 +457,10 @@ int iommu_map_sg(struct device *dev, str
=20
 			vaddr =3D s->dma_address & IOMMU_PAGE_MASK;
 			npages =3D iommu_num_pages(s->dma_address, s->dma_length);
-			__iommu_free(tbl, vaddr, npages);
+			if (!rc)
+				__iommu_free(tbl, vaddr, npages);
+			else
+				iommu_undo(tbl, vaddr, npages);
 			s->dma_address =3D DMA_ERROR_CODE;
 			s->dma_length =3D 0;
 		}
Index: b/arch/powerpc/platforms/cell/iommu.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -172,7 +172,7 @@ static void invalidate_tce_cache(struct=20
 	}
 }
=20
-static void tce_build_cell(struct iommu_table *tbl, long index, long npage=
s,
+static int tce_build_cell(struct iommu_table *tbl, long index, long npages,
 		unsigned long uaddr, enum dma_data_direction direction)
 {
 	int i;
@@ -210,6 +210,7 @@ static void tce_build_cell(struct iommu_
=20
 	pr_debug("tce_build_cell(index=3D%lx,n=3D%lx,dir=3D%d,base_pte=3D%lx)\n",
 		 index, npages, direction, base_pte);
+	return 0;
 }
=20
 static void tce_free_cell(struct iommu_table *tbl, long index, long npages)
Index: b/arch/powerpc/platforms/iseries/iommu.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/iseries/iommu.c
+++ b/arch/powerpc/platforms/iseries/iommu.c
@@ -41,7 +41,7 @@
 #include <asm/iseries/hv_call_event.h>
 #include <asm/iseries/iommu.h>
=20
-static void tce_build_iSeries(struct iommu_table *tbl, long index, long np=
ages,
+static int tce_build_iSeries(struct iommu_table *tbl, long index, long npa=
ges,
 		unsigned long uaddr, enum dma_data_direction direction)
 {
 	u64 rc;
@@ -70,6 +70,7 @@ static void tce_build_iSeries(struct iom
 		index++;
 		uaddr +=3D TCE_PAGE_SIZE;
 	}
+	return 0;
 }
=20
 static void tce_free_iSeries(struct iommu_table *tbl, long index, long npa=
ges)
Index: b/arch/powerpc/platforms/pasemi/iommu.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/pasemi/iommu.c
+++ b/arch/powerpc/platforms/pasemi/iommu.c
@@ -83,7 +83,7 @@ static u32 *iob_l2_base;
 static struct iommu_table iommu_table_iobmap;
 static int iommu_table_iobmap_inited;
=20
-static void iobmap_build(struct iommu_table *tbl, long index,
+static int iobmap_build(struct iommu_table *tbl, long index,
 			 long npages, unsigned long uaddr,
 			 enum dma_data_direction direction)
 {
@@ -107,6 +107,7 @@ static void iobmap_build(struct iommu_ta
 		uaddr +=3D IOBMAP_PAGE_SIZE;
 		bus_addr +=3D IOBMAP_PAGE_SIZE;
 	}
+	return 0;
 }
=20
=20
Index: b/arch/powerpc/platforms/pseries/iommu.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -25,6 +25,7 @@
  */
=20
 #include <linux/init.h>
+#include <linux/delay.h>
 #include <linux/types.h>
 #include <linux/slab.h>
 #include <linux/mm.h>
@@ -48,7 +49,7 @@
 #include "plpar_wrappers.h"
=20
=20
-static void tce_build_pSeries(struct iommu_table *tbl, long index,
+static int tce_build_pSeries(struct iommu_table *tbl, long index,
 			      long npages, unsigned long uaddr,
 			      enum dma_data_direction direction)
 {
@@ -71,6 +72,7 @@ static void tce_build_pSeries(struct iom
 		uaddr +=3D TCE_PAGE_SIZE;
 		tcep++;
 	}
+	return 0;
 }
=20
=20
@@ -93,13 +95,18 @@ static unsigned long tce_get_pseries(str
 	return *tcep;
 }
=20
-static void tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
+static void tce_free_pSeriesLP(struct iommu_table*, long, long);
+static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
+
+static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				long npages, unsigned long uaddr,
 				enum dma_data_direction direction)
 {
-	u64 rc;
+	u64 rc =3D 0;
 	u64 proto_tce, tce;
 	u64 rpn;
+	int sleep_msecs, ret =3D 0;
+	long tcenum_start =3D tcenum, npages_start =3D npages;
=20
 	rpn =3D (virt_to_abs(uaddr)) >> TCE_SHIFT;
 	proto_tce =3D TCE_PCI_READ;
@@ -108,7 +115,21 @@ static void tce_build_pSeriesLP(struct i
=20
 	while (npages--) {
 		tce =3D proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT;
-		rc =3D plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, tce);
+		do {
+			rc =3D plpar_tce_put((u64)tbl->it_index,
+			                   (u64)tcenum << 12, tce);
+			if (unlikely(H_IS_LONG_BUSY(rc))) {
+				sleep_msecs =3D plpar_get_longbusy_msecs(rc);
+				mdelay(sleep_msecs);
+			}
+		} while (unlikely(H_IS_LONG_BUSY(rc)));
+
+		if (unlikely(rc =3D=3D H_NOT_ENOUGH_RESOURCES)) {
+			ret =3D (int)rc;
+			tce_free_pSeriesLP(tbl, tcenum_start,
+			                   (npages_start - (npages + 1)));
+			break;
+		}
=20
 		if (rc && printk_ratelimit()) {
 			printk("tce_build_pSeriesLP: plpar_tce_put failed. rc=3D%ld\n", rc);
@@ -121,19 +142,22 @@ static void tce_build_pSeriesLP(struct i
 		tcenum++;
 		rpn++;
 	}
+	return ret;
 }
=20
 static DEFINE_PER_CPU(u64 *, tce_page) =3D NULL;
=20
-static void tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				     long npages, unsigned long uaddr,
 				     enum dma_data_direction direction)
 {
-	u64 rc;
+	u64 rc =3D 0;
 	u64 proto_tce;
 	u64 *tcep;
 	u64 rpn;
 	long l, limit;
+	long tcenum_start =3D tcenum, npages_start =3D npages;
+	int sleep_msecs, ret =3D 0;
=20
 	if (npages =3D=3D 1)
 		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
@@ -171,15 +195,26 @@ static void tce_buildmulti_pSeriesLP(str
 			rpn++;
 		}
=20
-		rc =3D plpar_tce_put_indirect((u64)tbl->it_index,
-					    (u64)tcenum << 12,
-					    (u64)virt_to_abs(tcep),
-					    limit);
+		do {
+			rc =3D plpar_tce_put_indirect(tbl->it_index, tcenum << 12,
+						    virt_to_abs(tcep), limit);
+			if (unlikely(H_IS_LONG_BUSY(rc))) {
+				sleep_msecs =3D plpar_get_longbusy_msecs(rc);
+				mdelay(sleep_msecs);
+			}
+		} while (unlikely(H_IS_LONG_BUSY(rc)));
=20
 		npages -=3D limit;
 		tcenum +=3D limit;
 	} while (npages > 0 && !rc);
=20
+	if (unlikely(rc =3D=3D H_NOT_ENOUGH_RESOURCES)) {
+		ret =3D (int)rc;
+		tce_freemulti_pSeriesLP(tbl, tcenum_start,
+		                        (npages_start - (npages + limit)));
+		return ret;
+	}
+
 	if (rc && printk_ratelimit()) {
 		printk("tce_buildmulti_pSeriesLP: plpar_tce_put failed. rc=3D%ld\n", rc);
 		printk("\tindex   =3D 0x%lx\n", (u64)tbl->it_index);
@@ -187,14 +222,23 @@ static void tce_buildmulti_pSeriesLP(str
 		printk("\ttce[0] val =3D 0x%lx\n", tcep[0]);
 		show_stack(current, (unsigned long *)__get_SP());
 	}
+	return ret;
 }
=20
 static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long =
npages)
 {
+	int sleep_msecs;
 	u64 rc;
=20
 	while (npages--) {
-		rc =3D plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, 0);
+		do {
+			rc =3D plpar_tce_put((u64)tbl->it_index,
+			                   (u64)tcenum << 12, 0);
+			if (unlikely(H_IS_LONG_BUSY(rc))) {
+				sleep_msecs =3D plpar_get_longbusy_msecs(rc);
+				mdelay(sleep_msecs);
+			}
+		} while (unlikely(H_IS_LONG_BUSY(rc)));
=20
 		if (rc && printk_ratelimit()) {
 			printk("tce_free_pSeriesLP: plpar_tce_put failed. rc=3D%ld\n", rc);
@@ -210,9 +254,17 @@ static void tce_free_pSeriesLP(struct io
=20
 static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, =
long npages)
 {
+	int sleep_msecs;
 	u64 rc;
=20
-	rc =3D plpar_tce_stuff((u64)tbl->it_index, (u64)tcenum << 12, 0, npages);
+	do {
+		rc =3D plpar_tce_stuff((u64)tbl->it_index,
+		                     (u64)tcenum << 12, 0, npages);
+		if (unlikely(H_IS_LONG_BUSY(rc))) {
+			sleep_msecs =3D plpar_get_longbusy_msecs(rc);
+			mdelay(sleep_msecs);
+		}
+	} while (unlikely(H_IS_LONG_BUSY(rc)));
=20
 	if (rc && printk_ratelimit()) {
 		printk("tce_freemulti_pSeriesLP: plpar_tce_stuff failed\n");
Index: b/arch/powerpc/sysdev/dart_iommu.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/sysdev/dart_iommu.c
+++ b/arch/powerpc/sysdev/dart_iommu.c
@@ -147,7 +147,7 @@ static void dart_flush(struct iommu_tabl
 	}
 }
=20
-static void dart_build(struct iommu_table *tbl, long index,
+static int dart_build(struct iommu_table *tbl, long index,
 		       long npages, unsigned long uaddr,
 		       enum dma_data_direction direction)
 {
@@ -183,6 +183,7 @@ static void dart_build(struct iommu_tabl
 	} else {
 		dart_dirty =3D 1;
 	}
+	return 0;
 }
=20
=20
Index: b/include/asm-powerpc/machdep.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/machdep.h
+++ b/include/asm-powerpc/machdep.h
@@ -76,7 +76,7 @@ struct machdep_calls {
 	 * destroyed as well */
 	void		(*hpte_clear_all)(void);
=20
-	void		(*tce_build)(struct iommu_table * tbl,
+	int		(*tce_build)(struct iommu_table * tbl,
 				     long index,
 				     long npages,
 				     unsigned long uaddr,

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 12/19] powerpc: vio bus support for CMO
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (10 preceding siblings ...)
  2008-06-12 22:19 ` [PATCH 11/19] powerpc: iommu enablement for CMO Robert Jennings
@ 2008-06-12 22:19 ` Robert Jennings
  2008-06-13  5:12   ` Stephen Rothwell
  2008-06-23 20:25   ` [PATCH 12/19][v2] " Robert Jennings
  2008-06-12 22:21 ` [PATCH 13/19] powerpc: Verify CMO memory entitlement updates with virtual I/O Robert Jennings
                   ` (6 subsequent siblings)
  18 siblings, 2 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:19 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>

Enable bus level entitled memory accounting for Cooperative Memory
Overcommitment (CMO) environments.  The normal code path should not
be affected.

The following changes are made that the VIO bus layer for CMO:
 * add IO memory accounting per device structure.
 * add IO memory entitlement query function to driver structure.
 * during vio bus probe, if CMO is enabled, check that driver has
   memory entitlement query function defined.  Fail if function not defined.
 * fail to register driver if io entitlement function not defined.
 * create set of dma_ops at vio level for CMO that will track allocations
   and return DMA failures once entitlement is reached.  Entitlement will
   limited by overall system entitlement.  Devices will have a reserved
   quanity of memory that is guaranteed, the rest can be used as available.
 * expose entitlement, current allocation, desired allocation, and the
   allocation error counter for devices to the user through sysfs
 * provide mechanism for changing a device's desired entitlement at run time
   for devices as an exported function and sysfs tunable
 * track any DMA failures for entitled IO memory for each vio device.
 * check entitlement against available system entitlement on device add
 * track entitlement metrics (high water mark, current usage)
 * provide function to reset high water mark
 * provide minimum and desired entitlement numbers at a bus level
 * provide drivers with a minimum guaranteed entitlement
 * balance available entitlement between devices to satisfy their needs
 * handle system entitlement changes and device hotplug

Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>

---
 arch/powerpc/kernel/vio.c |  958 +++++++++++++++++++++++++++++++++++++++++=
+++++
 include/asm-powerpc/vio.h |   30 +
 2 files changed, 981 insertions(+), 7 deletions(-)

Index: b/arch/powerpc/kernel/vio.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1,11 +1,12 @@
 /*
  * IBM PowerPC Virtual I/O Infrastructure Support.
  *
- *    Copyright (c) 2003-2005 IBM Corp.
+ *    Copyright (c) 2003,2008 IBM Corp.
  *     Dave Engebretsen engebret@us.ibm.com
  *     Santiago Leon santil@us.ibm.com
  *     Hollis Blanchard <hollisb@us.ibm.com>
  *     Stephen Rothwell
+ *     Robert Jennings <rcjenn@us.ibm.com>
  *
  *      This program is free software; you can redistribute it and/or
  *      modify it under the terms of the GNU General Public License
@@ -46,6 +47,561 @@ static struct vio_dev vio_bus_device  =3D=20
 	.dev.bus =3D &vio_bus_type,
 };
=20
+/**
+ * vio_cmo_pool - A pool of IO memory for CMO use
+ *
+ * @size: The size of the pool in bytes
+ * @free: The amount of free memory in the pool
+ */
+struct vio_cmo_pool {
+	size_t size;
+	size_t free;
+};
+
+/* How many ms to delay queued balance work */
+#define VIO_CMO_BALANCE_DELAY 100
+
+/* Portion out IO memory to CMO devices by this chunk size */
+#define VIO_CMO_BALANCE_CHUNK 131072
+
+/**
+ * vio_cmo_dev_entry - A device that is CMO-enabled and requires entitleme=
nt
+ *
+ * @vio_dev: struct vio_dev pointer
+ * @list: pointer to other devices on bus that are being tracked
+ */
+struct vio_cmo_dev_entry {
+	struct vio_dev *viodev;
+	struct list_head list;
+};
+
+/**
+ * vio_cmo - VIO bus accounting structure for CMO entitlement
+ *
+ * @lock: spinlock for entire structure
+ * @balance_q: work queue for balancing system entitlement
+ * @device_list: list of CMO-enabled devices requiring entitlement
+ * @entitled: total system entitlement in bytes
+ * @reserve: pool of memory from which devices reserve entitlement, incl. =
spare
+ * @excess: pool of excess entitlement not needed for device reserves or s=
pare
+ * @spare: IO memory for device hotplug functionality
+ * @min: minimum necessary for system operation
+ * @desired: desired memory for system operation
+ * @curr: bytes currently allocated
+ * @high: high water mark for IO data usage
+ */
+struct vio_cmo {
+	spinlock_t lock;
+	struct delayed_work balance_q;
+	struct list_head device_list;
+	size_t entitled;
+	struct vio_cmo_pool reserve;
+	struct vio_cmo_pool excess;
+	size_t spare;
+	size_t min;
+	size_t desired;
+	size_t curr;
+	size_t high;
+} vio_cmo;
+
+/**
+ * vio_cmo_OF_devices - Count the number of OF devices that have DMA windo=
ws
+ */
+static int vio_cmo_num_OF_devs(void)
+{
+	struct device_node *node_vroot;
+	int count =3D 0;
+
+	/*
+	 * Count the number of vdevice entries with an
+	 * ibm,my-dma-window OF property
+	 */
+	node_vroot =3D of_find_node_by_name(NULL, "vdevice");
+	if (node_vroot) {
+		struct device_node *of_node;
+		struct property *prop;
+
+		for (of_node =3D node_vroot->child; of_node !=3D NULL;
+		                of_node =3D of_node->sibling) {
+			prop =3D of_find_property(of_node, "ibm,my-dma-window",
+			                       NULL);
+			if (prop)
+				count++;
+		}
+	}
+	return count;
+}
+
+/**
+ * vio_cmo_alloc - allocate IO memory for CMO-enable devices
+ *
+ * @viodev: VIO device requesting IO memory
+ * @size: size of allocation requested
+ *
+ * Allocations come from memory reserved for the devices and any excess
+ * IO memory available to all devices.  The spare pool used to service
+ * hotplug must be equal to %VIO_CMO_MIN_ENT for the excess pool to be
+ * made available.
+ *
+ * Return codes:
+ *  0 for successful allocation and -ENOMEM for a failure
+ */
+static inline int vio_cmo_alloc(struct vio_dev *viodev, size_t size)
+{
+	unsigned long flags;
+	size_t reserve_free =3D 0;
+	size_t excess_free =3D 0;
+	int ret =3D -ENOMEM;
+
+	spin_lock_irqsave(&vio_cmo.lock, flags);
+
+	/* Determine the amount of free entitlement available in reserve */
+	if (viodev->cmo.entitled > viodev->cmo.allocated)
+		reserve_free =3D viodev->cmo.entitled - viodev->cmo.allocated;
+
+	/* If spare is not fulfilled, the excess pool can not be used. */
+	if (vio_cmo.spare >=3D VIO_CMO_MIN_ENT)
+		excess_free =3D vio_cmo.excess.free;
+
+	/* The request can be satisfied */
+	if ((reserve_free + excess_free) >=3D size) {
+		vio_cmo.curr +=3D size;
+		if (vio_cmo.curr > vio_cmo.high)
+			vio_cmo.high =3D vio_cmo.curr;
+		viodev->cmo.allocated +=3D size;
+		size -=3D min(reserve_free, size);
+		vio_cmo.excess.free -=3D size;
+		ret =3D 0;
+	}
+
+	spin_unlock_irqrestore(&vio_cmo.lock, flags);
+	return ret;
+}
+
+/**
+ * vio_cmo_dealloc - deallocate IO memory from CMO-enable devices
+ * @viodev: VIO device freeing IO memory
+ * @size: size of deallocation
+ *
+ * IO memory is freed by the device back to the correct memory pools.
+ * The spare pool is replenished first from either memory pool, then
+ * the reserve pool is used to reduce device entitlement, the excess
+ * pool is used to increase the reserve pool toward the desired entitlement
+ * target, and then the remaining memory is returned to the pools.
+ *
+ */
+static inline void vio_cmo_dealloc(struct vio_dev *viodev, size_t size)
+{
+	unsigned long flags;
+	size_t spare_needed =3D 0;
+	size_t excess_freed =3D 0;
+	size_t reserve_freed =3D size;
+	size_t tmp;
+	int balance =3D 0;
+
+	spin_lock_irqsave(&vio_cmo.lock, flags);
+	vio_cmo.curr -=3D size;
+
+	/* Amount of memory freed from the excess pool */
+	if (viodev->cmo.allocated > viodev->cmo.entitled) {
+		excess_freed =3D min(reserve_freed, (viodev->cmo.allocated -
+		                                   viodev->cmo.entitled));
+		reserve_freed -=3D excess_freed;
+	}
+
+	/* Remove allocation from device */
+	viodev->cmo.allocated -=3D (reserve_freed + excess_freed);
+
+	/* Spare is a subset of the reserve pool, replenish it first. */
+	spare_needed =3D VIO_CMO_MIN_ENT - vio_cmo.spare;
+
+	/*
+	 * Replenish the spare in the reserve pool from the excess pool.
+	 * This moves entitlement into the reserve pool.
+	 */
+	if (spare_needed && excess_freed) {
+		tmp =3D min(excess_freed, spare_needed);
+		vio_cmo.excess.size -=3D tmp;
+		vio_cmo.reserve.size +=3D tmp;
+		vio_cmo.spare +=3D tmp;
+		excess_freed -=3D tmp;
+		spare_needed -=3D tmp;
+		balance =3D 1;
+	}
+
+	/*
+	 * Replenish the spare in the reserve pool from the reserve pool.
+	 * This removes entitlement from the device down to VIO_CMO_MIN_ENT,
+	 * if needed, and gives it to the spare pool. The amount of used
+	 * memory in this pool does not change.
+	 */
+	if (spare_needed && reserve_freed) {
+		tmp =3D min(spare_needed, min(reserve_freed,
+		                            (viodev->cmo.entitled -
+		                             VIO_CMO_MIN_ENT)));
+
+		vio_cmo.spare +=3D tmp;
+		viodev->cmo.entitled -=3D tmp;
+		reserve_freed -=3D tmp;
+		spare_needed -=3D tmp;
+		balance =3D 1;
+	}
+
+	/*
+	 * Increase the reserve pool until the desired allocation is met.
+	 * Move an allocation freed from the excess pool into the reserve
+	 * pool and schedule a balance operation.
+	 */
+	if (excess_freed && (vio_cmo.desired > vio_cmo.reserve.size)) {
+		tmp =3D min(excess_freed, (vio_cmo.desired - vio_cmo.reserve.size));
+
+		vio_cmo.excess.size -=3D tmp;
+		vio_cmo.reserve.size +=3D tmp;
+		excess_freed -=3D tmp;
+		balance =3D 1;
+	}
+
+	/* Return memory from the excess pool to that pool */
+	if (excess_freed)
+		vio_cmo.excess.free +=3D excess_freed;
+
+	if (balance)
+		schedule_delayed_work(&vio_cmo.balance_q, VIO_CMO_BALANCE_DELAY);
+	spin_unlock_irqrestore(&vio_cmo.lock, flags);
+}
+
+/**
+ * vio_cmo_entitlement_update - Manage system entitlement changes
+ *
+ * @new_entitlement: new system entitlement to attempt to accommodate
+ *
+ * Increases in entitlement will be used to fulfill the spare entitlement
+ * and the rest is given to the excess pool.  Decreases, if they are
+ * possible, come from the excess pool and from unused device entitlement
+ *
+ * Returns: 0 on success, -ENOMEM when change can not be made
+ */
+int vio_cmo_entitlement_update(size_t new_entitlement)
+{
+	struct vio_dev *viodev;
+	struct vio_cmo_dev_entry *dev_ent;
+	unsigned long flags;
+	size_t avail, delta, tmp;
+
+	spin_lock_irqsave(&vio_cmo.lock, flags);
+
+	/* Entitlement increases */
+	if (new_entitlement > vio_cmo.entitled) {
+		delta =3D new_entitlement - vio_cmo.entitled;
+
+		/* Fulfill spare allocation */
+		if (vio_cmo.spare < VIO_CMO_MIN_ENT) {
+			tmp =3D min(delta, (VIO_CMO_MIN_ENT - vio_cmo.spare));
+			vio_cmo.spare +=3D tmp;
+			vio_cmo.reserve.size +=3D tmp;
+			delta -=3D tmp;
+		}
+
+		/* Remaining new allocation goes to the excess pool */
+		vio_cmo.entitled +=3D delta;
+		vio_cmo.excess.size +=3D delta;
+		vio_cmo.excess.free +=3D delta;
+
+		goto out;
+	}
+
+	/* Entitlement decreases */
+	delta =3D vio_cmo.entitled - new_entitlement;
+	avail =3D vio_cmo.excess.free;
+
+	/*
+	 * Need to check how much unused entitlement each device can
+	 * sacrifice to fulfill entitlement change.
+	 */
+	list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+		if (avail >=3D delta)
+			break;
+
+		viodev =3D dev_ent->viodev;
+		if ((viodev->cmo.entitled > viodev->cmo.allocated) &&
+		    (viodev->cmo.entitled > VIO_CMO_MIN_ENT))
+				avail +=3D viodev->cmo.entitled -
+				         max_t(size_t, viodev->cmo.allocated,
+				               VIO_CMO_MIN_ENT);
+	}
+
+	if (delta <=3D avail) {
+		vio_cmo.entitled -=3D delta;
+
+		/* Take entitlement from the excess pool first */
+		tmp =3D min(vio_cmo.excess.free, delta);
+		vio_cmo.excess.size -=3D tmp;
+		vio_cmo.excess.free -=3D tmp;
+		delta -=3D tmp;
+
+		/*
+		 * Remove all but VIO_CMO_MIN_ENT bytes from devices
+		 * until entitlement change is served
+		 */
+		list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+			if (!delta)
+				break;
+
+			viodev =3D dev_ent->viodev;
+			tmp =3D 0;
+			if ((viodev->cmo.entitled > viodev->cmo.allocated) &&
+			    (viodev->cmo.entitled > VIO_CMO_MIN_ENT))
+				tmp =3D viodev->cmo.entitled -
+				      max_t(size_t, viodev->cmo.allocated,
+				            VIO_CMO_MIN_ENT);
+			viodev->cmo.entitled -=3D min(tmp, delta);
+			delta -=3D min(tmp, delta);
+		}
+	} else {
+		spin_unlock_irqrestore(&vio_cmo.lock, flags);
+		return -ENOMEM;
+	}
+
+out:
+	schedule_delayed_work(&vio_cmo.balance_q, 0);
+	spin_unlock_irqrestore(&vio_cmo.lock, flags);
+	return 0;
+}
+EXPORT_SYMBOL(vio_cmo_entitlement_update);
+
+/**
+ * vio_cmo_balance - Balance entitlement among devices
+ *
+ * @work: work queue structure for this operation
+ *
+ * Any system entitlement above the minimum needed for devices, or
+ * already allocated to devices, can be distributed to the devices.
+ * The list of devices is iterated through to recalculate the desired
+ * entitlement level and to determine how much entitlement above the
+ * minimum entitlement is allocated to devices.
+ *
+ * Small chunks of the available entitlement are given to devices until
+ * their requirements are fulfilled or there is no entitlement left to giv=
e.
+ * Upon completion sizes of the reserve and excess pools are calculated.
+ *
+ * The system minimum entitlement level is also recalculated here.
+ * Entitlement will be reserved for devices even after vio_bus_remove to
+ * accomodate reloading the driver.  The OF tree is walked to count the
+ * number of devices present and this will remove entitlement for devices
+ * that have actually left the system after having vio_bus_remove called.
+ */
+static void vio_cmo_balance(struct work_struct *work)
+{
+	struct vio_cmo *cmo;
+	struct vio_dev *viodev;
+	struct vio_cmo_dev_entry *dev_ent;
+	unsigned long flags;
+	size_t avail =3D 0, level, chunk, need;
+	int devcount =3D 0, fulfilled;
+
+	cmo =3D container_of(work, struct vio_cmo, balance_q.work);
+
+	spin_lock_irqsave(&vio_cmo.lock, flags);
+
+	/* Calculate minimum entitlement and fulfill spare */
+	cmo->min =3D vio_cmo_num_OF_devs() * VIO_CMO_MIN_ENT;
+	BUG_ON(cmo->min > cmo->entitled);
+	cmo->spare =3D min_t(size_t, VIO_CMO_MIN_ENT, (cmo->entitled - cmo->min));
+	cmo->min +=3D cmo->spare;
+	cmo->desired =3D cmo->min;
+
+	/*
+	 * Determine how much entitlement is available and reset device
+	 * entitlements
+	 */
+	avail =3D cmo->entitled - cmo->spare;
+	list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+		viodev =3D dev_ent->viodev;
+		devcount++;
+		viodev->cmo.entitled =3D VIO_CMO_MIN_ENT;
+		cmo->desired +=3D (viodev->cmo.desired - VIO_CMO_MIN_ENT);
+		avail -=3D max_t(size_t, viodev->cmo.allocated, VIO_CMO_MIN_ENT);
+	}
+
+	/*
+	 * Having provided each device with the minimum entitlement, loop
+	 * over the devices portioning out the remaining entitlement
+	 * until there is nothing left.
+	 */
+	level =3D VIO_CMO_MIN_ENT;
+	while (avail) {
+		fulfilled =3D 0;
+		list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+			viodev =3D dev_ent->viodev;
+
+			if (viodev->cmo.desired <=3D level) {
+				fulfilled++;
+				continue;
+			}
+
+			/*
+			 * Give the device up to VIO_CMO_BALANCE_CHUNK
+			 * bytes of entitlement, but do not exceed the
+			 * desired level of entitlement for the device.
+			 */
+			chunk =3D min_t(size_t, avail, VIO_CMO_BALANCE_CHUNK);
+			chunk =3D min(chunk, (viodev->cmo.desired -
+			                    viodev->cmo.entitled));
+			viodev->cmo.entitled +=3D chunk;
+
+			/*
+			 * If the memory for this entitlement increase was
+			 * already allocated to the device it does not come
+			 * from the available pool being portioned out.
+			 */
+			need =3D max(viodev->cmo.allocated, viodev->cmo.entitled)-
+			       max(viodev->cmo.allocated, level);
+			avail -=3D need;
+
+		}
+		if (fulfilled =3D=3D devcount)
+			break;
+		level +=3D VIO_CMO_BALANCE_CHUNK;
+	}
+
+	/* Calculate new reserve and excess pool sizes */
+	cmo->reserve.size =3D cmo->min;
+	cmo->excess.free =3D 0;
+	cmo->excess.size =3D 0;
+	need =3D 0;
+	list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+		viodev =3D dev_ent->viodev;
+		/* Calculated reserve size above the minimum entitlement */
+		if (viodev->cmo.entitled)
+			cmo->reserve.size +=3D (viodev->cmo.entitled -
+			                      VIO_CMO_MIN_ENT);
+		/* Calculated used excess entitlement */
+		if (viodev->cmo.allocated > viodev->cmo.entitled)
+			need +=3D viodev->cmo.allocated - viodev->cmo.entitled;
+	}
+	cmo->excess.size =3D cmo->entitled - cmo->reserve.size;
+	cmo->excess.free =3D cmo->excess.size - need;
+
+	cancel_delayed_work(container_of(work, struct delayed_work, work));
+	spin_unlock_irqrestore(&vio_cmo.lock, flags);
+}
+
+#define RND_TO_PAGE(x) (((x / PAGE_SIZE) + (x % PAGE_SIZE ? 1 : 0)) * PAGE=
_SIZE)
+
+static void *vio_dma_iommu_alloc_coherent(struct device *dev, size_t size,
+                                          dma_addr_t *dma_handle, gfp_t fl=
ag)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+	void *ret;
+
+	if (vio_cmo_alloc(viodev, RND_TO_PAGE(size))) {
+		atomic_inc(&viodev->cmo.allocs_failed);
+		return NULL;
+	}
+
+	ret =3D dma_iommu_ops.alloc_coherent(dev, size, dma_handle, flag);
+	if (unlikely(ret =3D=3D NULL)) {
+		vio_cmo_dealloc(viodev, RND_TO_PAGE(size));
+		atomic_inc(&viodev->cmo.allocs_failed);
+	}
+
+	return ret;
+}
+
+static void vio_dma_iommu_free_coherent(struct device *dev, size_t size,
+                                        void *vaddr, dma_addr_t dma_handle)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+
+	dma_iommu_ops.free_coherent(dev, size, vaddr, dma_handle);
+
+	vio_cmo_dealloc(viodev, RND_TO_PAGE(size));
+}
+
+static dma_addr_t vio_dma_iommu_map_single(struct device *dev, void *vaddr,
+                                           size_t size,
+                                           enum dma_data_direction directi=
on)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+	dma_addr_t ret =3D DMA_ERROR_CODE;
+
+	if (vio_cmo_alloc(viodev, RND_TO_PAGE(size))) {
+		atomic_inc(&viodev->cmo.allocs_failed);
+		return ret;
+	}
+
+	ret =3D dma_iommu_ops.map_single(dev, vaddr, size, direction);
+	if (unlikely(dma_mapping_error(ret))) {
+		vio_cmo_dealloc(viodev, RND_TO_PAGE(size));
+		atomic_inc(&viodev->cmo.allocs_failed);
+	}
+
+	return ret;
+}
+
+static void vio_dma_iommu_unmap_single(struct device *dev,
+		dma_addr_t dma_handle, size_t size,
+		enum dma_data_direction direction)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+
+	dma_iommu_ops.unmap_single(dev, dma_handle, size, direction);
+
+	vio_cmo_dealloc(viodev, RND_TO_PAGE(size));
+}
+
+static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sg=
list,
+                                int nelems, enum dma_data_direction direct=
ion)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+	struct scatterlist *sgl;
+	int ret, count =3D 0;
+	size_t alloc_size =3D 0;
+
+	for (sgl =3D sglist; count < nelems; count++, sgl++)
+		alloc_size +=3D RND_TO_PAGE(sgl->length);
+
+	if (vio_cmo_alloc(viodev, alloc_size)) {
+		atomic_inc(&viodev->cmo.allocs_failed);
+		return 0;
+	}
+
+	ret =3D dma_iommu_ops.map_sg(dev, sglist, nelems, direction);
+
+	if (unlikely(!ret)) {
+		vio_cmo_dealloc(viodev, alloc_size);
+		atomic_inc(&viodev->cmo.allocs_failed);
+	}
+
+	return ret;
+}
+
+static void vio_dma_iommu_unmap_sg(struct device *dev,
+		struct scatterlist *sglist, int nelems,
+		enum dma_data_direction direction)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+	struct scatterlist *sgl;
+	size_t alloc_size =3D 0;
+	int count =3D 0;
+
+	for (sgl =3D sglist; count < nelems; count++, sgl++)
+		alloc_size +=3D RND_TO_PAGE(sgl->length);
+
+	dma_iommu_ops.unmap_sg(dev, sglist, nelems, direction);
+
+	vio_cmo_dealloc(viodev, alloc_size);
+}
+
+struct dma_mapping_ops vio_dma_mapping_ops =3D {
+	.alloc_coherent =3D vio_dma_iommu_alloc_coherent,
+	.free_coherent  =3D vio_dma_iommu_free_coherent,
+	.map_single     =3D vio_dma_iommu_map_single,
+	.unmap_single   =3D vio_dma_iommu_unmap_single,
+	.map_sg         =3D vio_dma_iommu_map_sg,
+	.unmap_sg       =3D vio_dma_iommu_unmap_sg,
+};
+
 static struct iommu_table *vio_build_iommu_table(struct vio_dev *dev)
 {
 	const unsigned char *dma_window;
@@ -108,16 +664,113 @@ static int vio_bus_probe(struct device *
 	struct vio_dev *viodev =3D to_vio_dev(dev);
 	struct vio_driver *viodrv =3D to_vio_driver(dev->driver);
 	const struct vio_device_id *id;
+	struct vio_cmo_dev_entry *dev_ent;
+	unsigned long flags;
+	size_t size;
 	int error =3D -ENODEV;
=20
 	if (!viodrv->probe)
 		return error;
=20
+	memset(&viodev->cmo, 0, sizeof(viodev->cmo));
+
+	/*
+	 * Determine the devices IO memory entitlement needs, attempting
+	 * to satisfy the system minimum entitlement at first and scheduling
+	 * a balance operation to take care of the rest at a later time.
+	 */
+	if (firmware_has_feature(FW_FEATURE_CMO)) {
+		/* Check that the driver is CMO enabled and get entitlement */
+		if (!viodrv->get_io_entitlement) {
+			dev_err(dev, "%s: device driver does not support CMO\n",
+			        __func__);
+			return -EINVAL;
+		}
+
+		/*
+		 * Check to see that device has a DMA window and configure
+		 * entitlement for the device.
+		 */
+		if (of_get_property(viodev->dev.archdata.of_node,
+		                    "ibm,my-dma-window", NULL)) {
+			viodev->cmo.desired =3D viodrv->get_io_entitlement(viodev);
+			if (viodev->cmo.desired < VIO_CMO_MIN_ENT)
+				viodev->cmo.desired =3D VIO_CMO_MIN_ENT;
+			size =3D VIO_CMO_MIN_ENT;
+
+			dev_ent =3D kmalloc(sizeof(struct vio_cmo_dev_entry),
+			                  GFP_KERNEL);
+			if (!dev_ent)
+				return -ENOMEM;
+
+			dev_ent->viodev =3D viodev;
+			spin_lock_irqsave(&vio_cmo.lock, flags);
+			list_add(&dev_ent->list, &vio_cmo.device_list);
+		} else {
+			viodev->cmo.desired =3D 0;
+			size =3D 0;
+			spin_lock_irqsave(&vio_cmo.lock, flags);
+		}
+
+		/*
+		 * If the needs for vio_cmo.min have not changed since they
+		 * were last set, the number of devices in the OF tree has
+		 * been constant and the IO memory for this is already in
+		 * the reserve pool.
+		 */
+		if (vio_cmo.min =3D=3D ((vio_cmo_num_OF_devs() + 1) *
+		                    VIO_CMO_MIN_ENT)) {
+			/* Updated desired entitlement if device requires it */
+			if (size)
+				vio_cmo.desired +=3D (viodev->cmo.desired -
+			                        VIO_CMO_MIN_ENT);
+		} else {
+			size_t tmp;
+
+			tmp =3D vio_cmo.spare + vio_cmo.excess.free;
+			if (tmp < size) {
+				dev_err(dev, "%s: insufficient free "
+				        "entitlement to add device. "
+				        "Need %lu, have %lu\n", __func__,
+					size, (vio_cmo.spare + tmp));
+				error =3D -ENOMEM;
+				goto out_unlock;
+			}
+
+			/* Use excess pool first to fulfill request */
+			tmp =3D min(size, vio_cmo.excess.free);
+			vio_cmo.excess.free -=3D tmp;
+			vio_cmo.excess.size -=3D tmp;
+			vio_cmo.reserve.size +=3D tmp;
+
+			/* Use spare if excess pool was insufficient */
+			vio_cmo.spare -=3D size - tmp;
+
+			/* Update bus accounting */
+			vio_cmo.min +=3D size;
+			vio_cmo.desired +=3D viodev->cmo.desired;
+		}
+
+		/* Give the device the minimum system entitlement */
+		viodev->cmo.entitled =3D size;
+
+		/* Balance IO usage if needed */
+		if (viodev->cmo.entitled < viodev->cmo.desired)
+			schedule_delayed_work(&vio_cmo.balance_q,
+			                      VIO_CMO_BALANCE_DELAY);
+
+		spin_unlock_irqrestore(&vio_cmo.lock, flags);
+	}
+
 	id =3D vio_match_device(viodrv->id_table, viodev);
 	if (id)
 		error =3D viodrv->probe(viodev, id);
=20
 	return error;
+
+out_unlock:
+       	spin_unlock_irqrestore(&vio_cmo.lock, flags);
+	return error;
 }
=20
 /* convert from struct device to struct vio_dev and pass to driver. */
@@ -125,12 +778,89 @@ static int vio_bus_remove(struct device=20
 {
 	struct vio_dev *viodev =3D to_vio_dev(dev);
 	struct vio_driver *viodrv =3D to_vio_driver(dev->driver);
+	struct vio_cmo_dev_entry *dev_ent;
+	struct device *devptr;
+	unsigned long flags;
+	size_t tmp;
+	int ret =3D 1;
+
+	/*
+	 * Hold a reference to the device after the remove function is called
+	 * to allow for CMO accounting cleanup for the device.
+	 */
+	devptr =3D get_device(dev);
=20
 	if (viodrv->remove)
-		return viodrv->remove(viodev);
+		ret =3D viodrv->remove(viodev);
+
+	if (firmware_has_feature(FW_FEATURE_CMO)) {
+		spin_lock_irqsave(&vio_cmo.lock, flags);
+		if (viodev->cmo.allocated) {
+			dev_err(dev, "%s: device had %lu bytes of IO "
+			        "allocated after remove operation.\n",
+			        __func__, viodev->cmo.allocated);
+			BUG();
+		}
+
+		/*
+		 * Remove the device from the device list being maintained for
+		 * CMO enabled devices.
+		 */
+		list_for_each_entry(dev_ent, &vio_cmo.device_list, list)
+			if (viodev =3D=3D dev_ent->viodev) {
+				list_del(&dev_ent->list);
+				kfree(dev_ent);
+				break;
+			}
=20
-	/* driver can't remove */
-	return 1;
+		/*
+		 * Devices may not require any entitlement and they do not need
+		 * to be processed.  Otherwise, return the device's entitlement
+		 * back to the pools.
+		 */
+		if (viodev->cmo.entitled) {
+			/*
+			 * This device has not yet left the OF tree, it's
+			 * minimum entitlement remains in vio_cmo.min and
+			 * vio_cmo.desired
+			 */
+			vio_cmo.desired -=3D (viodev->cmo.desired - VIO_CMO_MIN_ENT);
+
+			/*
+			 * Save min allocation for device in reserve as long
+			 * as it exists in OF tree as determined by later
+			 * balance operation
+			 */
+			viodev->cmo.entitled -=3D VIO_CMO_MIN_ENT;
+
+			/* Replenish spare from freed reserve pool */
+			if (viodev->cmo.entitled && (vio_cmo.spare < VIO_CMO_MIN_ENT)) {
+				tmp =3D min(viodev->cmo.entitled, (VIO_CMO_MIN_ENT -
+				                                 vio_cmo.spare));
+				vio_cmo.spare +=3D tmp;
+				viodev->cmo.entitled -=3D tmp;
+			}
+
+			/* Remaining reserve goes to excess pool */
+			vio_cmo.excess.size +=3D viodev->cmo.entitled;
+			vio_cmo.excess.free +=3D viodev->cmo.entitled;
+			vio_cmo.reserve.size -=3D viodev->cmo.entitled;
+
+			/*
+			 * Until the device is removed it will keep a
+			 * minimum entitlemet; this will guarantee that
+			 * a module unload/load will result in a success.
+			 */
+			viodev->cmo.entitled =3D VIO_CMO_MIN_ENT;
+			viodev->cmo.desired =3D VIO_CMO_MIN_ENT;
+			atomic_set(&viodev->cmo.allocs_failed, 0);
+		}
+
+		spin_unlock_irqrestore(&vio_cmo.lock, flags);
+	}
+
+	put_device(devptr);
+	return ret;
 }
=20
 /**
@@ -142,6 +872,13 @@ int vio_register_driver(struct vio_drive
 	printk(KERN_DEBUG "%s: driver %s registering\n", __func__,
 		viodrv->driver.name);
=20
+	if (firmware_has_feature(FW_FEATURE_CMO) &&
+	    !viodrv->get_io_entitlement) {
+		printk(KERN_ERR "%s: driver %s does not support CMO, "
+		       "not registered.\n", __func__, viodrv->driver.name);
+		return -EINVAL;
+	}
+
 	/* fill in 'struct driver' fields */
 	viodrv->driver.bus =3D &vio_bus_type;
=20
@@ -215,7 +952,11 @@ struct vio_dev *vio_register_device_node
 			viodev->unit_address =3D *unit_address;
 	}
 	viodev->dev.archdata.of_node =3D of_node_get(of_node);
-	viodev->dev.archdata.dma_ops =3D &dma_iommu_ops;
+	if (firmware_has_feature(FW_FEATURE_CMO)) {
+		vio_dma_mapping_ops.dma_supported =3D dma_iommu_ops.dma_supported;
+		viodev->dev.archdata.dma_ops =3D &vio_dma_mapping_ops;
+	} else
+		viodev->dev.archdata.dma_ops =3D &dma_iommu_ops;
 	viodev->dev.archdata.dma_data =3D vio_build_iommu_table(viodev);
 	viodev->dev.archdata.numa_node =3D of_node_to_nid(of_node);
=20
@@ -244,6 +985,11 @@ static int __init vio_bus_init(void)
 {
 	int err;
 	struct device_node *node_vroot;
+	struct hvcall_mpp_data mpp_data;
+
+	memset(&vio_cmo, 0, sizeof(struct vio_cmo));
+	spin_lock_init(&vio_cmo.lock);
+	INIT_LIST_HEAD(&vio_cmo.device_list);
=20
 	err =3D bus_register(&vio_bus_type);
 	if (err) {
@@ -263,6 +1009,44 @@ static int __init vio_bus_init(void)
 	}
=20
 	node_vroot =3D of_find_node_by_name(NULL, "vdevice");
+
+	/* Set the reserve pool based on the number of virtual devices in OF */
+	if (firmware_has_feature(FW_FEATURE_CMO)) {
+		INIT_DELAYED_WORK(&vio_cmo.balance_q, vio_cmo_balance);
+
+		/* Get current system entitlement */
+		err =3D h_get_mpp(&mpp_data);
+
+		/*
+		 * On failure, continue with entitlement set to 0, will panic()
+		 * later when spare is reserved and lights will be set.
+		 */
+		if (err !=3D H_SUCCESS) {
+			printk(KERN_ERR "%s: unable to determine system IO "\
+			       "entitlement. (%d)\n", __func__, err);
+			vio_cmo.entitled =3D 0;
+		} else {
+			vio_cmo.entitled =3D mpp_data.entitled_mem;
+		}
+
+		/* Set reservation and check against entitlement */
+		vio_cmo.spare =3D VIO_CMO_MIN_ENT;
+		vio_cmo.reserve.size =3D vio_cmo.spare;
+		vio_cmo.reserve.size +=3D (vio_cmo_num_OF_devs() *
+		                         VIO_CMO_MIN_ENT);
+		if (vio_cmo.reserve.size > vio_cmo.entitled) {
+			printk(KERN_ERR "%s: insufficient system entitlement\n",
+			       __func__);
+			panic("%s: Insufficient system entitlement", __func__);
+		}
+
+		/* Set the remaining accounting variables */
+		vio_cmo.excess.size =3D vio_cmo.entitled - vio_cmo.reserve.size;
+		vio_cmo.excess.free =3D vio_cmo.excess.size;
+		vio_cmo.min =3D vio_cmo.reserve.size;
+		vio_cmo.desired =3D vio_cmo.reserve.size;
+	}
+
 	if (node_vroot) {
 		struct device_node *of_node;
=20
@@ -280,6 +1064,71 @@ static int __init vio_bus_init(void)
 }
 __initcall(vio_bus_init);
=20
+/**
+ * vio_cmo_set_dev_desired - Set desired entitlement for a device
+ *
+ * @viodev: struct vio_dev for device to alter
+ * @new_desired: new desired entitlement level in bytes
+ *
+ * For use by devices to request a change to their entitlement at runtime =
or
+ * through sysfs.  The desired entitlement level is changed and a balancing
+ * of system resources is scheduled to run in the future.
+ */
+void vio_cmo_set_dev_desired(struct vio_dev *viodev, size_t desired)
+{
+	unsigned long flags;
+	struct vio_cmo_dev_entry *dev_ent;
+
+	if (!firmware_has_feature(FW_FEATURE_CMO))
+		return;
+
+	spin_lock_irqsave(&vio_cmo.lock, flags);
+	if (desired < VIO_CMO_MIN_ENT)
+		desired =3D VIO_CMO_MIN_ENT;
+
+	/*
+	 * Changes will not be made for devices not in the device list.
+	 * If it is not in the device list, then no driver is loaded
+	 * for the device and it can not receive entitlement.
+	 */
+	list_for_each_entry(dev_ent, &vio_cmo.device_list, list)
+		if (viodev =3D=3D dev_ent->viodev)
+			break;
+	if (!dev_ent)
+		return;
+
+	/* Increase/decrease in desired device entitlement */
+	if (desired >=3D viodev->cmo.desired) {
+		/* Just bump the bus and device values prior to a balance*/
+		vio_cmo.desired +=3D desired - viodev->cmo.desired;
+		viodev->cmo.desired =3D desired;
+	} else {
+		/* Decrease bus and device values for desired entitlement */
+		vio_cmo.desired -=3D viodev->cmo.desired - desired;
+		viodev->cmo.desired =3D desired;
+		/*
+		 * If less entitlement is desired than current entitlement, move
+		 * any reserve memory in the change region to the excess pool.
+		 */
+		if (viodev->cmo.entitled > desired) {
+			vio_cmo.reserve.size -=3D viodev->cmo.entitled - desired;
+			vio_cmo.excess.size +=3D viodev->cmo.entitled - desired;
+			/*
+			 * If entitlement moving from the reserve pool to the
+			 * excess pool is currently unused, add to the excess
+			 * free counter.
+			 */
+			if (viodev->cmo.allocated < viodev->cmo.entitled)
+				vio_cmo.excess.free +=3D viodev->cmo.entitled -
+				                       max(viodev->cmo.allocated, desired);
+			viodev->cmo.entitled =3D desired;
+		}
+	}
+	schedule_delayed_work(&vio_cmo.balance_q, 0);
+	spin_unlock_irqrestore(&vio_cmo.lock, flags);
+}
+EXPORT_SYMBOL(vio_cmo_set_dev_desired);
+
 static ssize_t name_show(struct device *dev,
 		struct device_attribute *attr, char *buf)
 {
@@ -294,9 +1143,107 @@ static ssize_t devspec_show(struct devic
 	return sprintf(buf, "%s\n", of_node ? of_node->full_name : "none");
 }
=20
+#define viodev_cmo_rd_attr(name)                                        \
+static ssize_t viodev_cmo_##name##_show(struct device *dev,             \
+                                        struct device_attribute *attr,  \
+                                         char *buf)                     \
+{                                                                       \
+	return sprintf(buf, "%lu\n", to_vio_dev(dev)->cmo.name);        \
+}
+
+static ssize_t viodev_cmo_allocs_failed_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+	return sprintf(buf, "%d\n", atomic_read(&viodev->cmo.allocs_failed));
+}
+
+static ssize_t viodev_cmo_allocs_failed_reset(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+	atomic_set(&viodev->cmo.allocs_failed, 0);
+	return count;
+}
+
+static ssize_t viodev_cmo_desired_set(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+	size_t new_desired;
+	int ret;
+
+	ret =3D strict_strtoul(buf, 10, &new_desired);
+	if (ret)
+		return ret;
+
+	vio_cmo_set_dev_desired(viodev, new_desired);
+	return count;
+}
+
+viodev_cmo_rd_attr(desired);
+viodev_cmo_rd_attr(entitled);
+viodev_cmo_rd_attr(allocated);
+
 static struct device_attribute vio_dev_attrs[] =3D {
 	__ATTR_RO(name),
 	__ATTR_RO(devspec),
+	__ATTR(cmo_desired,       S_IWUSR|S_IRUSR|S_IWGRP|S_IRGRP|S_IROTH,
+	       viodev_cmo_desired_show, viodev_cmo_desired_set),
+	__ATTR(cmo_entitled,      S_IRUGO, viodev_cmo_entitled_show,      NULL),
+	__ATTR(cmo_allocated,     S_IRUGO, viodev_cmo_allocated_show,     NULL),
+	__ATTR(cmo_allocs_failed, S_IWUSR|S_IRUSR|S_IWGRP|S_IRGRP|S_IROTH,
+	       viodev_cmo_allocs_failed_show, viodev_cmo_allocs_failed_reset),
+	__ATTR_NULL
+};
+
+#define viobus_cmo_rd_attr(name)                                        \
+static ssize_t                                                          \
+viobus_cmo_##name##_show(struct bus_type *bt, char *buf)                \
+{                                                                       \
+	return sprintf(buf, "%lu\n", vio_cmo.name);                     \
+}
+
+#define viobus_cmo_pool_rd_attr(name, var)                              \
+static ssize_t                                                          \
+viobus_cmo_##name##_pool_show_##var(struct bus_type *bt, char *buf)     \
+{                                                                       \
+	return sprintf(buf, "%lu\n", vio_cmo.name.var);                 \
+}
+
+static ssize_t viobus_cmo_high_reset(struct bus_type *bt, const char *buf,
+                                     size_t count)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&vio_cmo.lock, flags);
+	vio_cmo.high =3D vio_cmo.curr;
+	spin_unlock_irqrestore(&vio_cmo.lock, flags);
+
+	return count;
+}
+
+viobus_cmo_rd_attr(entitled);
+viobus_cmo_pool_rd_attr(reserve, size);
+viobus_cmo_pool_rd_attr(excess, size);
+viobus_cmo_pool_rd_attr(excess, free);
+viobus_cmo_rd_attr(spare);
+viobus_cmo_rd_attr(min);
+viobus_cmo_rd_attr(desired);
+viobus_cmo_rd_attr(curr);
+viobus_cmo_rd_attr(high);
+
+static struct bus_attribute vio_bus_attrs[] =3D {
+	__ATTR(cmo_entitled, S_IRUGO, viobus_cmo_entitled_show, NULL),
+	__ATTR(cmo_reserve_size, S_IRUGO, viobus_cmo_reserve_pool_show_size, NULL=
),
+	__ATTR(cmo_excess_size, S_IRUGO, viobus_cmo_excess_pool_show_size, NULL),
+	__ATTR(cmo_excess_free, S_IRUGO, viobus_cmo_excess_pool_show_free, NULL),
+	__ATTR(cmo_spare,   S_IRUGO, viobus_cmo_spare_show,   NULL),
+	__ATTR(cmo_min,     S_IRUGO, viobus_cmo_min_show,     NULL),
+	__ATTR(cmo_desired, S_IRUGO, viobus_cmo_desired_show, NULL),
+	__ATTR(cmo_curr,    S_IRUGO, viobus_cmo_curr_show,    NULL),
+	__ATTR(cmo_high,    S_IWUSR|S_IRUSR|S_IWGRP|S_IRGRP|S_IROTH,
+	       viobus_cmo_high_show, viobus_cmo_high_reset),
 	__ATTR_NULL
 };
=20
@@ -335,6 +1282,7 @@ static int vio_hotplug(struct device *de
 static struct bus_type vio_bus_type =3D {
 	.name =3D "vio",
 	.dev_attrs =3D vio_dev_attrs,
+	.bus_attrs =3D vio_bus_attrs,
 	.uevent =3D vio_hotplug,
 	.match =3D vio_bus_match,
 	.probe =3D vio_bus_probe,
Index: b/include/asm-powerpc/vio.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/vio.h
+++ b/include/asm-powerpc/vio.h
@@ -39,16 +39,34 @@
 #define VIO_IRQ_DISABLE		0UL
 #define VIO_IRQ_ENABLE		1UL
=20
+/*
+ * VIO CMO minimum entitlement for all devices and spare entitlement
+ */
+#define VIO_CMO_MIN_ENT 1562624
+
 struct iommu_table;
=20
-/*
- * The vio_dev structure is used to describe virtual I/O devices.
+/**
+ * vio_dev - This structure is used to describe virtual I/O devices.
+ *
+ * @desired: set from return of driver's get_io_entitlement() function
+ * @entitled: bytes of IO data that has been reserved for this device.
+ * @entitled_target: Target IO data allocation for device. This may be set
+ *   lower than entitled to enable balancing; can not be larger than entit=
led.
+ * @allocated: bytes of IO data currently in use by the device.
+ * @allocs_failed: number of DMA failures due to insufficient entitlement.
  */
 struct vio_dev {
 	const char *name;
 	const char *type;
 	uint32_t unit_address;
 	unsigned int irq;
+	struct {
+		size_t desired;
+		size_t entitled;
+		size_t allocated;
+		atomic_t allocs_failed;
+	} cmo;
 	struct device dev;
 };
=20
@@ -56,12 +74,20 @@ struct vio_driver {
 	const struct vio_device_id *id_table;
 	int (*probe)(struct vio_dev *dev, const struct vio_device_id *id);
 	int (*remove)(struct vio_dev *dev);
+	/* A driver must have a get_io_entitlement() function to
+	 * be loaded in a CMO environment, it may return 0 if no I/O
+	 * entitlement is needed.
+	 */
+	unsigned long (*get_io_entitlement)(struct vio_dev *dev);
 	struct device_driver driver;
 };
=20
 extern int vio_register_driver(struct vio_driver *drv);
 extern void vio_unregister_driver(struct vio_driver *drv);
=20
+extern int vio_cmo_entitlement_update(size_t);
+extern void vio_cmo_set_dev_desired(struct vio_dev *viodev, size_t desired=
);
+
 extern void __devinit vio_unregister_device(struct vio_dev *dev);
=20
 struct device_node;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 13/19] powerpc: Verify CMO memory entitlement updates with virtual I/O
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (11 preceding siblings ...)
  2008-06-12 22:19 ` [PATCH 12/19] powerpc: vio bus support " Robert Jennings
@ 2008-06-12 22:21 ` Robert Jennings
  2008-06-12 22:21 ` [PATCH 14/19] powerpc: hvc enablement for CMO Robert Jennings
                   ` (5 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:21 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Nathan Fontenot <nfont@austin.ibm.com>

Verify memory entitlement updates can be handled by vio.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 arch/powerpc/kernel/lparcfg.c |   10 ++++++++++
 1 file changed, 10 insertions(+)

Index: b/arch/powerpc/kernel/lparcfg.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -34,6 +34,7 @@
 #include <asm/time.h>
 #include <asm/prom.h>
 #include <asm/vdso_datapage.h>
+#include <asm/vio.h>
=20
 #define MODULE_VERS "1.7"
 #define MODULE_NAME "lparcfg"
@@ -528,6 +529,15 @@ static ssize_t update_mpp(u64 *entitleme
 	u8 new_weight;
 	ssize_t rc;
=20
+	if (entitlement) {
+		/* Check with vio to ensure the new memory entitlement
+		 * can be handled.
+		 */
+		rc =3D vio_cmo_entitlement_update(*entitlement);
+		if (rc)
+			return rc;
+	}
+
 	rc =3D h_get_mpp(&mpp_data);
 	if (rc)
 		return rc;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 14/19] powerpc: hvc enablement for CMO
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (12 preceding siblings ...)
  2008-06-12 22:21 ` [PATCH 13/19] powerpc: Verify CMO memory entitlement updates with virtual I/O Robert Jennings
@ 2008-06-12 22:21 ` Robert Jennings
  2008-06-12 22:22 ` [PATCH 15/19] powerpc: hvcs " Robert Jennings
                   ` (4 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:21 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>

Define a get_io_entitlement function so that it can function in a
Cooperative Memory Overcommitment (CMO) environment (it returns 0 to
indicate that no IO entitlement is required, as the driver does not
perform DMA operations).

Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>

---
 drivers/char/hvc_vio.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: b/drivers/char/hvc_vio.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/char/hvc_vio.c
+++ b/drivers/char/hvc_vio.c
@@ -82,6 +82,11 @@ static struct hv_ops hvc_get_put_ops =3D {
 	.put_chars =3D hvc_put_chars,
 };
=20
+unsigned long hvc_io_entitlement(struct vio_dev *vdev)
+{
+	return 0;
+}
+
 static int __devinit hvc_vio_probe(struct vio_dev *vdev,
 				const struct vio_device_id *id)
 {
@@ -111,6 +116,7 @@ static struct vio_driver hvc_vio_driver=20
 	.id_table	=3D hvc_driver_table,
 	.probe		=3D hvc_vio_probe,
 	.remove		=3D hvc_vio_remove,
+	.get_io_entitlement =3D hvc_io_entitlement,
 	.driver		=3D {
 		.name	=3D hvc_driver_name,
 		.owner	=3D THIS_MODULE,

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 15/19] powerpc: hvcs enablement for CMO
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (13 preceding siblings ...)
  2008-06-12 22:21 ` [PATCH 14/19] powerpc: hvc enablement for CMO Robert Jennings
@ 2008-06-12 22:22 ` Robert Jennings
  2008-06-12 22:22 ` [PATCH 16/19] ibmveth: Automatically enable larger rx buffer pools for larger mtu Robert Jennings
                   ` (3 subsequent siblings)
  18 siblings, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:22 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>

Define a get_io_entitlement function so that it can function in a
Cooperative Memory Overcommitment (CMO) environment (it returns 0 to
indicate that no IO entitlement is required, as the driver does not
perform DMA operations).

Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>

---
 drivers/char/hvcs.c |    6 ++++++
 1 file changed, 6 insertions(+)

Index: b/drivers/char/hvcs.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/char/hvcs.c
+++ b/drivers/char/hvcs.c
@@ -756,6 +756,11 @@ static int hvcs_get_index(void)
 	return -1;
 }
=20
+unsigned long hvcs_get_io_entitlement(struct vio_dev *vdev)
+{
+	return 0;
+}
+
 static int __devinit hvcs_probe(
 	struct vio_dev *dev,
 	const struct vio_device_id *id)
@@ -869,6 +874,7 @@ static struct vio_driver hvcs_vio_driver
 	.id_table	=3D hvcs_driver_table,
 	.probe		=3D hvcs_probe,
 	.remove		=3D hvcs_remove,
+	.get_io_entitlement =3D hvcs_get_io_entitlement,
 	.driver		=3D {
 		.name	=3D hvcs_driver_name,
 		.owner	=3D THIS_MODULE,

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 16/19] ibmveth: Automatically enable larger rx buffer pools for larger mtu
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (14 preceding siblings ...)
  2008-06-12 22:22 ` [PATCH 15/19] powerpc: hvcs " Robert Jennings
@ 2008-06-12 22:22 ` Robert Jennings
  2008-06-13  5:18   ` Stephen Rothwell
  2008-06-23 20:21   ` [PATCH 16/19][v2] " Robert Jennings
  2008-06-12 22:23 ` [PATCH 17/19] ibmveth: enable driver for CMO Robert Jennings
                   ` (2 subsequent siblings)
  18 siblings, 2 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:22 UTC (permalink / raw)
  To: paulus; +Cc: netdev, linuxppc-dev, David Darrington, Brian King

=46rom: Santiago Leon <santil@us.ibm.com>

Activates larger rx buffer pools when the MTU is changed to a larger
value.  This patch de-activates the large rx buffer pools when the MTU
changes to a smaller value.

Signed-off-by: Santiago Leon <santil@us.ibm.com>

---

 drivers/net/ibmveth.c |   20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

Index: b/drivers/net/ibmveth.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/net/ibmveth.c
+++ b/drivers/net/ibmveth.c
@@ -1054,7 +1054,6 @@ static int ibmveth_change_mtu(struct net
 {
 	struct ibmveth_adapter *adapter =3D dev->priv;
 	int new_mtu_oh =3D new_mtu + IBMVETH_BUFF_OH;
-	int reinit =3D 0;
 	int i, rc;
=20
 	if (new_mtu < IBMVETH_MAX_MTU)
@@ -1067,15 +1066,21 @@ static int ibmveth_change_mtu(struct net
 	if (i =3D=3D IbmVethNumBufferPools)
 		return -EINVAL;
=20
+	/* Deactivate all the buffer pools so that the next loop can activate
+	   only the buffer pools necessary to hold the new MTU */
+	for(i =3D 0; i<IbmVethNumBufferPools; i++)
+		if (adapter->rx_buff_pool[i].active) {
+			ibmveth_free_buffer_pool(adapter,
+						 &adapter->rx_buff_pool[i]);
+			adapter->rx_buff_pool[i].active =3D 0;
+		}
+
 	/* Look for an active buffer pool that can hold the new MTU */
 	for(i =3D 0; i<IbmVethNumBufferPools; i++) {
-		if (!adapter->rx_buff_pool[i].active) {
-			adapter->rx_buff_pool[i].active =3D 1;
-			reinit =3D 1;
-		}
+		adapter->rx_buff_pool[i].active =3D 1;
=20
 		if (new_mtu_oh < adapter->rx_buff_pool[i].buff_size) {
-			if (reinit && netif_running(adapter->netdev)) {
+			if (netif_running(adapter->netdev)) {
 				adapter->pool_config =3D 1;
 				ibmveth_close(adapter->netdev);
 				adapter->pool_config =3D 0;
@@ -1402,14 +1407,15 @@ const char * buf, size_t count)
 				return -EPERM;
 			}
=20
-			pool->active =3D 0;
 			if (netif_running(netdev)) {
 				adapter->pool_config =3D 1;
 				ibmveth_close(netdev);
+				pool->active =3D 0;
 				adapter->pool_config =3D 0;
 				if ((rc =3D ibmveth_open(netdev)))
 					return rc;
 			}
+			pool->active =3D 0;
 		}
 	} else if (attr =3D=3D &veth_num_attr) {
 		if (value <=3D 0 || value > IBMVETH_MAX_POOL_COUNT)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 17/19] ibmveth: enable driver for CMO
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (15 preceding siblings ...)
  2008-06-12 22:22 ` [PATCH 16/19] ibmveth: Automatically enable larger rx buffer pools for larger mtu Robert Jennings
@ 2008-06-12 22:23 ` Robert Jennings
  2008-06-13  5:25   ` Stephen Rothwell
  2008-06-23 20:20   ` [PATCH 17/19][v2] " Robert Jennings
  2008-06-12 22:24 ` [PATCH 18/19] ibmvscsi: driver enablement " Robert Jennings
  2008-06-12 22:31 ` [PATCH 19/19] powerpc: Update arch vector to indicate support " Robert Jennings
  18 siblings, 2 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:23 UTC (permalink / raw)
  To: paulus; +Cc: netdev, linuxppc-dev, David Darrington, Brian King

=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>

Enable ibmveth for Cooperative Memory Overcommitment (CMO).  For this driver
it means calculating a desired amount of IO memory based on the current MTU
and updating this value with the bus when MTU changes occur.  Because DMA
mappings can fail, we have added a bounce buffer for temporary cases where
the driver can not map IO memory for the buffer pool.

The following changes are made to enable the driver for CMO:
 * DMA mapping errors will not result in error messages if entitlement has
   been exceeded and resources were not available.
 * DMA mapping errors are handled gracefully, ibmveth_replenish_buffer_pool=
()
   is corrected to check the return from dma_map_single and fail gracefully.
 * The driver will have a get_io_entitlement function defined to function
   in a CMO environment.
 * When the MTU is changed, the driver will update the device IO entitlement

Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Santiago Leon <santil@us.ibm.com>

---

 drivers/net/ibmveth.c |  169 ++++++++++++++++++++++++++++++++++++++++-----=
-----
 drivers/net/ibmveth.h |    5 +
 2 files changed, 140 insertions(+), 34 deletions(-)

Index: b/drivers/net/ibmveth.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/net/ibmveth.c
+++ b/drivers/net/ibmveth.c
@@ -33,6 +33,7 @@
 */
=20
 #include <linux/module.h>
+#include <linux/moduleparam.h>
 #include <linux/types.h>
 #include <linux/errno.h>
 #include <linux/ioport.h>
@@ -52,7 +53,9 @@
 #include <asm/hvcall.h>
 #include <asm/atomic.h>
 #include <asm/vio.h>
+#include <asm/iommu.h>
 #include <asm/uaccess.h>
+#include <asm/firmware.h>
 #include <linux/seq_file.h>
=20
 #include "ibmveth.h"
@@ -94,8 +97,10 @@ static void ibmveth_proc_register_adapte
 static void ibmveth_proc_unregister_adapter(struct ibmveth_adapter *adapte=
r);
 static irqreturn_t ibmveth_interrupt(int irq, void *dev_instance);
 static void ibmveth_rxq_harvest_buffer(struct ibmveth_adapter *adapter);
+static unsigned long ibmveth_get_io_entitlement(struct vio_dev *vdev);
 static struct kobj_type ktype_veth_pool;
=20
+
 #ifdef CONFIG_PROC_FS
 #define IBMVETH_PROC_DIR "ibmveth"
 static struct proc_dir_entry *ibmveth_proc_dir;
@@ -226,16 +231,16 @@ static void ibmveth_replenish_buffer_poo
 	u32 i;
 	u32 count =3D pool->size - atomic_read(&pool->available);
 	u32 buffers_added =3D 0;
+	struct sk_buff *skb;
+	unsigned int free_index, index;
+	u64 correlator;
+	unsigned long lpar_rc;
+	dma_addr_t dma_addr;
=20
 	mb();
=20
 	for(i =3D 0; i < count; ++i) {
-		struct sk_buff *skb;
-		unsigned int free_index, index;
-		u64 correlator;
 		union ibmveth_buf_desc desc;
-		unsigned long lpar_rc;
-		dma_addr_t dma_addr;
=20
 		skb =3D alloc_skb(pool->buff_size, GFP_ATOMIC);
=20
@@ -255,6 +260,9 @@ static void ibmveth_replenish_buffer_poo
 		dma_addr =3D dma_map_single(&adapter->vdev->dev, skb->data,
 				pool->buff_size, DMA_FROM_DEVICE);
=20
+		if (dma_mapping_error(dma_addr))
+			goto failure;
+
 		pool->free_map[free_index] =3D IBM_VETH_INVALID_MAP;
 		pool->dma_addr[index] =3D dma_addr;
 		pool->skbuff[index] =3D skb;
@@ -267,20 +275,9 @@ static void ibmveth_replenish_buffer_poo
=20
 		lpar_rc =3D h_add_logical_lan_buffer(adapter->vdev->unit_address, desc.d=
esc);
=20
-		if(lpar_rc !=3D H_SUCCESS) {
-			pool->free_map[free_index] =3D index;
-			pool->skbuff[index] =3D NULL;
-			if (pool->consumer_index =3D=3D 0)
-				pool->consumer_index =3D pool->size - 1;
-			else
-				pool->consumer_index--;
-			dma_unmap_single(&adapter->vdev->dev,
-					pool->dma_addr[index], pool->buff_size,
-					DMA_FROM_DEVICE);
-			dev_kfree_skb_any(skb);
-			adapter->replenish_add_buff_failure++;
-			break;
-		} else {
+		if (lpar_rc !=3D H_SUCCESS)
+			goto failure;
+		else {
 			buffers_added++;
 			adapter->replenish_add_buff_success++;
 		}
@@ -288,6 +285,24 @@ static void ibmveth_replenish_buffer_poo
=20
 	mb();
 	atomic_add(buffers_added, &(pool->available));
+	return;
+
+failure:
+	pool->free_map[free_index] =3D index;
+	pool->skbuff[index] =3D NULL;
+	if (pool->consumer_index =3D=3D 0)
+		pool->consumer_index =3D pool->size - 1;
+	else
+		pool->consumer_index--;
+	if (!dma_mapping_error(dma_addr))
+		dma_unmap_single(&adapter->vdev->dev,
+		                 pool->dma_addr[index], pool->buff_size,
+		                 DMA_FROM_DEVICE);
+	dev_kfree_skb_any(skb);
+	adapter->replenish_add_buff_failure++;
+
+	mb();
+	atomic_add(buffers_added, &(pool->available));
 }
=20
 /* replenish routine */
@@ -297,7 +312,7 @@ static void ibmveth_replenish_task(struc
=20
 	adapter->replenish_task_cycles++;
=20
-	for(i =3D 0; i < IbmVethNumBufferPools; i++)
+	for (i =3D (IbmVethNumBufferPools - 1); i >=3D 0; i--)
 		if(adapter->rx_buff_pool[i].active)
 			ibmveth_replenish_buffer_pool(adapter,
 						     &adapter->rx_buff_pool[i]);
@@ -472,6 +487,18 @@ static void ibmveth_cleanup(struct ibmve
 		if (adapter->rx_buff_pool[i].active)
 			ibmveth_free_buffer_pool(adapter,
 						 &adapter->rx_buff_pool[i]);
+
+	if(adapter->bounce_buffer !=3D NULL) {
+		if(!dma_mapping_error(adapter->bounce_buffer_dma)) {
+			dma_unmap_single(&adapter->vdev->dev,
+					adapter->bounce_buffer_dma,
+					adapter->netdev->mtu + IBMVETH_BUFF_OH,
+					DMA_BIDIRECTIONAL);
+			adapter->bounce_buffer_dma =3D DMA_ERROR_CODE;
+		}
+		kfree(adapter->bounce_buffer);
+		adapter->bounce_buffer =3D NULL;
+	}
 }
=20
 static int ibmveth_register_logical_lan(struct ibmveth_adapter *adapter,
@@ -607,6 +634,24 @@ static int ibmveth_open(struct net_devic
 		return rc;
 	}
=20
+	adapter->bounce_buffer =3D
+	    kmalloc(netdev->mtu + IBMVETH_BUFF_OH, GFP_KERNEL);
+	if(!adapter->bounce_buffer) {
+		ibmveth_error_printk("unable to allocate bounce buffer\n");
+		ibmveth_cleanup(adapter);
+		napi_disable(&adapter->napi);
+		return -ENOMEM;
+	}
+	adapter->bounce_buffer_dma =3D
+	    dma_map_single(&adapter->vdev->dev, adapter->bounce_buffer,
+			   netdev->mtu + IBMVETH_BUFF_OH, DMA_BIDIRECTIONAL);
+	if(dma_mapping_error(adapter->bounce_buffer_dma)) {
+		ibmveth_error_printk("unable to map bounce buffer\n");
+		ibmveth_cleanup(adapter);
+		napi_disable(&adapter->napi);
+		return -ENOMEM;
+	}
+
 	ibmveth_debug_printk("initial replenish cycle\n");
 	ibmveth_interrupt(netdev->irq, netdev);
=20
@@ -853,10 +898,12 @@ static int ibmveth_start_xmit(struct sk_
 	unsigned int tx_packets =3D 0;
 	unsigned int tx_send_failed =3D 0;
 	unsigned int tx_map_failed =3D 0;
+	int used_bounce =3D 0;
+	unsigned long data_dma_addr;
=20
 	desc.fields.flags_len =3D IBMVETH_BUF_VALID | skb->len;
-	desc.fields.address =3D dma_map_single(&adapter->vdev->dev, skb->data,
-					     skb->len, DMA_TO_DEVICE);
+	data_dma_addr =3D dma_map_single(&adapter->vdev->dev, skb->data,
+				       skb->len, DMA_TO_DEVICE);
=20
 	if (skb->ip_summed =3D=3D CHECKSUM_PARTIAL &&
 	    ip_hdr(skb)->protocol !=3D IPPROTO_TCP && skb_checksum_help(skb)) {
@@ -875,12 +922,16 @@ static int ibmveth_start_xmit(struct sk_
 		buf[1] =3D 0;
 	}
=20
-	if (dma_mapping_error(desc.fields.address)) {
-		ibmveth_error_printk("tx: unable to map xmit buffer\n");
+	if (dma_mapping_error(data_dma_addr)) {
+		if (!firmware_has_feature(FW_FEATURE_CMO))
+			ibmveth_error_printk("tx: unable to map xmit buffer\n");
+		skb_copy_from_linear_data(skb, adapter->bounce_buffer,
+					  skb->len);
+		desc.fields.address =3D adapter->bounce_buffer_dma;
 		tx_map_failed++;
-		tx_dropped++;
-		goto out;
-	}
+		used_bounce =3D 1;
+	} else
+		desc.fields.address =3D data_dma_addr;
=20
 	/* send the frame. Arbitrarily set retrycount to 1024 */
 	correlator =3D 0;
@@ -904,8 +955,9 @@ static int ibmveth_start_xmit(struct sk_
 		netdev->trans_start =3D jiffies;
 	}
=20
-	dma_unmap_single(&adapter->vdev->dev, desc.fields.address,
-			 skb->len, DMA_TO_DEVICE);
+	if (!used_bounce)
+		dma_unmap_single(&adapter->vdev->dev, data_dma_addr,
+				 skb->len, DMA_TO_DEVICE);
=20
 out:	spin_lock_irqsave(&adapter->stats_lock, flags);
 	netdev->stats.tx_dropped +=3D tx_dropped;
@@ -1053,8 +1105,9 @@ static void ibmveth_set_multicast_list(s
 static int ibmveth_change_mtu(struct net_device *dev, int new_mtu)
 {
 	struct ibmveth_adapter *adapter =3D dev->priv;
+	struct vio_dev *viodev =3D adapter->vdev;
 	int new_mtu_oh =3D new_mtu + IBMVETH_BUFF_OH;
-	int i, rc;
+	int i;
=20
 	if (new_mtu < IBMVETH_MAX_MTU)
 		return -EINVAL;
@@ -1085,10 +1138,15 @@ static int ibmveth_change_mtu(struct net
 				ibmveth_close(adapter->netdev);
 				adapter->pool_config =3D 0;
 				dev->mtu =3D new_mtu;
-				if ((rc =3D ibmveth_open(adapter->netdev)))
-					return rc;
-			} else
-				dev->mtu =3D new_mtu;
+				vio_cmo_set_dev_desired(viodev,
+						ibmveth_get_io_entitlement
+						(viodev));
+				return ibmveth_open(adapter->netdev);
+			}
+			dev->mtu =3D new_mtu;
+			vio_cmo_set_dev_desired(viodev,
+						ibmveth_get_io_entitlement
+						(viodev));
 			return 0;
 		}
 	}
@@ -1103,6 +1161,46 @@ static void ibmveth_poll_controller(stru
 }
 #endif
=20
+/**
+ * ibmveth_get_io_entitlement - Calculate IO entitlement needed by the dri=
ver
+ *
+ * @vdev: struct vio_dev for the device whose entitlement is to be returned
+ *
+ * Return value:
+ *	Number of bytes of IO data the driver will need to perform well.
+ */
+static unsigned long ibmveth_get_io_entitlement(struct vio_dev *vdev)
+{
+	struct net_device *netdev =3D dev_get_drvdata(&vdev->dev);
+	struct ibmveth_adapter *adapter;
+	unsigned long ret;
+	int i;
+	int rxqentries =3D 1;
+
+	/* netdev inits at probe time along with the structures we need below*/
+	if (netdev =3D=3D NULL)
+		return IOMMU_PAGE_ALIGN(IBMVETH_IO_ENTITLEMENT_DEFAULT);
+
+	adapter =3D netdev_priv(netdev);
+
+	ret =3D IBMVETH_BUFF_LIST_SIZE + IBMVETH_FILT_LIST_SIZE;
+	ret +=3D IOMMU_PAGE_ALIGN(netdev->mtu);
+
+	for (i =3D 0; i < IbmVethNumBufferPools; i++) {
+		/* add the size of the active receive buffers */
+		if (adapter->rx_buff_pool[i].active)
+			ret +=3D
+			    adapter->rx_buff_pool[i].size *
+			    IOMMU_PAGE_ALIGN(adapter->rx_buff_pool[i].
+			            buff_size);
+		rxqentries +=3D adapter->rx_buff_pool[i].size;
+	}
+	/* add the size of the receive queue entries */
+	ret +=3D IOMMU_PAGE_ALIGN(rxqentries * sizeof(struct ibmveth_rx_q_entry));
+
+	return ret;
+}
+
 static int __devinit ibmveth_probe(struct vio_dev *dev, const struct vio_d=
evice_id *id)
 {
 	int rc, i;
@@ -1247,6 +1345,8 @@ static int __devexit ibmveth_remove(stru
 	ibmveth_proc_unregister_adapter(adapter);
=20
 	free_netdev(netdev);
+	dev_set_drvdata(&dev->dev, NULL);
+
 	return 0;
 }
=20
@@ -1491,6 +1591,7 @@ static struct vio_driver ibmveth_driver=20
 	.id_table	=3D ibmveth_device_table,
 	.probe		=3D ibmveth_probe,
 	.remove		=3D ibmveth_remove,
+	.get_io_entitlement =3D ibmveth_get_io_entitlement,
 	.driver		=3D {
 		.name	=3D ibmveth_driver_name,
 		.owner	=3D THIS_MODULE,
Index: b/drivers/net/ibmveth.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/net/ibmveth.h
+++ b/drivers/net/ibmveth.h
@@ -93,9 +93,12 @@ static inline long h_illan_attributes(un
   plpar_hcall_norets(H_CHANGE_LOGICAL_LAN_MAC, ua, mac)
=20
 #define IbmVethNumBufferPools 5
+#define IBMVETH_IO_ENTITLEMENT_DEFAULT 4243456 /* MTU of 1500 needs 4.2Mb =
*/
 #define IBMVETH_BUFF_OH 22 /* Overhead: 14 ethernet header + 8 opaque hand=
le */
 #define IBMVETH_MAX_MTU 68
 #define IBMVETH_MAX_POOL_COUNT 4096
+#define IBMVETH_BUFF_LIST_SIZE 4096
+#define IBMVETH_FILT_LIST_SIZE 4096
 #define IBMVETH_MAX_BUF_SIZE (1024 * 128)
=20
 static int pool_size[] =3D { 512, 1024 * 2, 1024 * 16, 1024 * 32, 1024 * 6=
4 };
@@ -143,6 +146,8 @@ struct ibmveth_adapter {
     struct ibmveth_rx_q rx_queue;
     int pool_config;
     int rx_csum;
+    void * bounce_buffer;
+    dma_addr_t bounce_buffer_dma;
=20
     /* adapter specific stats */
     u64 replenish_task_cycles;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 18/19] ibmvscsi: driver enablement for CMO
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (16 preceding siblings ...)
  2008-06-12 22:23 ` [PATCH 17/19] ibmveth: enable driver for CMO Robert Jennings
@ 2008-06-12 22:24 ` Robert Jennings
  2008-06-13 18:30   ` Brian King
  2008-06-12 22:31 ` [PATCH 19/19] powerpc: Update arch vector to indicate support " Robert Jennings
  18 siblings, 1 reply; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:24 UTC (permalink / raw)
  To: paulus
  Cc: linux-scsi@vger.kernel.org Brian King, linuxppc-dev,
	David Darrington

=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>

Enable the driver to function in a Cooperative Memory Overcommitment (CMO)
environment.

The following changes are made to enable the driver for CMO:
 * DMA mapping errors will not result in error messages if entitlement has
   been exceeded and resources were not available.
 * The driver has a get_io_entitlement function defined to function
   in a CMO environment. It will indicate how much IO memory it would like
   to function.

Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>

---
 drivers/scsi/ibmvscsi/ibmvscsi.c |   46 +++++++++++++++++++++++++++++++++-=
-----
 drivers/scsi/ibmvscsi/ibmvscsi.h |    2 ++
 2 files changed, 41 insertions(+), 7 deletions(-)

Index: b/drivers/scsi/ibmvscsi/ibmvscsi.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/scsi/ibmvscsi/ibmvscsi.c
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
@@ -72,6 +72,8 @@
 #include <linux/delay.h>
 #include <asm/firmware.h>
 #include <asm/vio.h>
+#include <asm/firmware.h>
+#include <asm/iommu.h>
 #include <scsi/scsi.h>
 #include <scsi/scsi_cmnd.h>
 #include <scsi/scsi_host.h>
@@ -426,8 +428,10 @@ static int map_sg_data(struct scsi_cmnd=20
 					   SG_ALL * sizeof(struct srp_direct_buf),
 					   &evt_struct->ext_list_token, 0);
 		if (!evt_struct->ext_list) {
-			sdev_printk(KERN_ERR, cmd->device,
-				    "Can't allocate memory for indirect table\n");
+			if (!firmware_has_feature(FW_FEATURE_CMO))
+				sdev_printk(KERN_ERR, cmd->device,
+				            "Can't allocate memory "
+				            "for indirect table\n");
 			return 0;
 		}
 	}
@@ -743,7 +747,9 @@ static int ibmvscsi_queuecommand(struct=20
 	srp_cmd->lun =3D ((u64) lun) << 48;
=20
 	if (!map_data_for_srp_cmd(cmnd, evt_struct, srp_cmd, hostdata->dev)) {
-		sdev_printk(KERN_ERR, cmnd->device, "couldn't convert cmd to srp_cmd\n");
+		if (!firmware_has_feature(FW_FEATURE_CMO))
+			sdev_printk(KERN_ERR, cmnd->device,
+			            "couldn't convert cmd to srp_cmd\n");
 		free_event_struct(&hostdata->pool, evt_struct);
 		return SCSI_MLQUEUE_HOST_BUSY;
 	}
@@ -855,7 +861,10 @@ static void send_mad_adapter_info(struct
 					    DMA_BIDIRECTIONAL);
=20
 	if (dma_mapping_error(req->buffer)) {
-		dev_err(hostdata->dev, "Unable to map request_buffer for adapter_info!\n=
");
+		if (!firmware_has_feature(FW_FEATURE_CMO))
+			dev_err(hostdata->dev,
+			        "Unable to map request_buffer for "
+			        "adapter_info!\n");
 		free_event_struct(&hostdata->pool, evt_struct);
 		return;
 	}
@@ -1400,7 +1409,9 @@ static int ibmvscsi_do_host_config(struc
 						    DMA_BIDIRECTIONAL);
=20
 	if (dma_mapping_error(host_config->buffer)) {
-		dev_err(hostdata->dev, "dma_mapping error getting host config\n");
+		if (!firmware_has_feature(FW_FEATURE_CMO))
+			dev_err(hostdata->dev,
+			        "dma_mapping error getting host config\n");
 		free_event_struct(&hostdata->pool, evt_struct);
 		return -1;
 	}
@@ -1604,7 +1615,7 @@ static struct scsi_host_template driver_
 	.eh_host_reset_handler =3D ibmvscsi_eh_host_reset_handler,
 	.slave_configure =3D ibmvscsi_slave_configure,
 	.change_queue_depth =3D ibmvscsi_change_queue_depth,
-	.cmd_per_lun =3D 16,
+	.cmd_per_lun =3D IBMVSCSI_CMDS_PER_LUN_DEFAULT,
 	.can_queue =3D IBMVSCSI_MAX_REQUESTS_DEFAULT,
 	.this_id =3D -1,
 	.sg_tablesize =3D SG_ALL,
@@ -1613,6 +1624,26 @@ static struct scsi_host_template driver_
 };
=20
 /**
+ * ibmvscsi_get_io_entitlement - Calculate IO entitlement needed by the dr=
iver
+ *
+ * @vdev: struct vio_dev for the device whose entitlement is to be returned
+ *
+ * Return value:
+ *	Number of bytes of IO data the driver will need to perform well.
+ */
+static unsigned long ibmvscsi_get_io_entitlement(struct vio_dev *vdev)
+{
+	/* iu_storage data allocated in initialize_event_pool */
+	unsigned long io_entitlement =3D max_requests * sizeof(union viosrp_iu);
+
+	/* add io space for sg data */
+	io_entitlement +=3D (IBMVSCSI_MAX_SECTORS_DEFAULT *
+	                     IBMVSCSI_CMDS_PER_LUN_DEFAULT);
+
+	return IOMMU_PAGE_ALIGN(io_entitlement);
+}
+
+/**
  * Called by bus code for each adapter
  */
 static int ibmvscsi_probe(struct vio_dev *vdev, const struct vio_device_id=
 *id)
@@ -1641,7 +1672,7 @@ static int ibmvscsi_probe(struct vio_dev
 	hostdata->host =3D host;
 	hostdata->dev =3D dev;
 	atomic_set(&hostdata->request_limit, -1);
-	hostdata->host->max_sectors =3D 32 * 8; /* default max I/O 32 pages */
+	hostdata->host->max_sectors =3D IBMVSCSI_MAX_SECTORS_DEFAULT;
=20
 	rc =3D ibmvscsi_ops->init_crq_queue(&hostdata->queue, hostdata, max_reque=
sts);
 	if (rc !=3D 0 && rc !=3D H_RESOURCE) {
@@ -1735,6 +1766,7 @@ static struct vio_driver ibmvscsi_driver
 	.id_table =3D ibmvscsi_device_table,
 	.probe =3D ibmvscsi_probe,
 	.remove =3D ibmvscsi_remove,
+	.get_io_entitlement =3D ibmvscsi_get_io_entitlement,
 	.driver =3D {
 		.name =3D "ibmvscsi",
 		.owner =3D THIS_MODULE,
Index: b/drivers/scsi/ibmvscsi/ibmvscsi.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/scsi/ibmvscsi/ibmvscsi.h
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.h
@@ -45,6 +45,8 @@ struct Scsi_Host;
 #define MAX_INDIRECT_BUFS 10
=20
 #define IBMVSCSI_MAX_REQUESTS_DEFAULT 100
+#define IBMVSCSI_CMDS_PER_LUN_DEFAULT 16
+#define IBMVSCSI_MAX_SECTORS_DEFAULT 256 /* 32 * 8 =3D default max I/O 32 =
pages */
 #define IBMVSCSI_MAX_CMDS_PER_LUN 64
=20
 /* ------------------------------------------------------------

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 19/19] powerpc: Update arch vector to indicate support for CMO
  2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
                   ` (17 preceding siblings ...)
  2008-06-12 22:24 ` [PATCH 18/19] ibmvscsi: driver enablement " Robert Jennings
@ 2008-06-12 22:31 ` Robert Jennings
  18 siblings, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-12 22:31 UTC (permalink / raw)
  To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington

=46rom: Nathan Fontenot <nfont@austin.ibm.com>

Update the architecture vector to indicate that Cooperative Memory
Overcommitment is supported.

This is the last patch in the series.  Committing it will signal to=20
the platform firmware is CMO enabled.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---

 arch/powerpc/kernel/prom_init.c |    5 ++++-
 1 file changed, 4 insertions(+), 1 deletion(-)

Index: b/arch/powerpc/kernel/prom_init.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -642,6 +642,7 @@ static void __init early_cmdline_parse(v
 #else
 #define OV5_MSI			0x00
 #endif /* CONFIG_PCI_MSI */
+#define OV5_CMO			0x80	/* Cooperative Memory Overcommitment */
=20
 /*
  * The architecture vector has an array of PVR mask/value pairs,
@@ -684,10 +685,12 @@ static unsigned char ibm_architecture_ve
 	0,				/* don't halt */
=20
 	/* option vector 5: PAPR/OF options */
-	3 - 2,				/* length */
+	5 - 2,				/* length */
 	0,				/* don't ignore, don't halt */
 	OV5_LPAR | OV5_SPLPAR | OV5_LARGE_PAGES | OV5_DRCONF_MEMORY |
 	OV5_DONATE_DEDICATE_CPU | OV5_MSI,
+	0,
+	OV5_CMO,
 };
=20
 /* Old method - ELF header with PT_NOTE sections */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/19] powerpc: Split processor entitlement retrieval and gathering to helper routines
  2008-06-12 22:08 ` [PATCH 02/19] powerpc: Split processor entitlement retrieval and gathering to helper routines Robert Jennings
@ 2008-06-13  0:23   ` Stephen Rothwell
  2008-06-13 19:11     ` Nathan Fontenot
  2008-06-16 16:07   ` Nathan Fontenot
  1 sibling, 1 reply; 41+ messages in thread
From: Stephen Rothwell @ 2008-06-13  0:23 UTC (permalink / raw)
  To: Robert Jennings; +Cc: Brian King, linuxppc-dev, paulus, David Darrington

[-- Attachment #1: Type: text/plain, Size: 782 bytes --]

Hi Robert,

On Thu, 12 Jun 2008 17:08:58 -0500 Robert Jennings <rcj@linux.vnet.ibm.com> wrote:
>
> -		seq_printf(m, "R4=0x%lx\n", h_entitled);
> -		seq_printf(m, "R5=0x%lx\n", h_unallocated);
> -		seq_printf(m, "R6=0x%lx\n", h_aggregation);
> -		seq_printf(m, "R7=0x%lx\n", h_resource);

This changes a user visible interface by removing the above.  I don't
know if this matters (probably not), but it should be mentioned in the
changelog.

> +	if (new_entitled)
> +		*new_weight = current_weight;
> +
> +	if (new_weight)
> +		*new_entitled = current_entitled;

These look fishy - checking one pointer for NULL and then updating via
the other pointer.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 11/19] powerpc: iommu enablement for CMO
  2008-06-12 22:19 ` [PATCH 11/19] powerpc: iommu enablement for CMO Robert Jennings
@ 2008-06-13  1:43   ` Olof Johansson
  2008-06-20 15:03     ` Robert Jennings
  2008-06-20 15:12   ` [PATCH 11/19][v2] " Robert Jennings
  1 sibling, 1 reply; 41+ messages in thread
From: Olof Johansson @ 2008-06-13  1:43 UTC (permalink / raw)
  To: Robert Jennings; +Cc: Brian King, linuxppc-dev, paulus, David Darrington

Hi,

Some comments and questions below.


-Olof

On Thu, Jun 12, 2008 at 05:19:36PM -0500, Robert Jennings wrote:
> Index: b/arch/powerpc/kernel/iommu.c
> ===================================================================
> --- a/arch/powerpc/kernel/iommu.c
> +++ b/arch/powerpc/kernel/iommu.c
> @@ -183,6 +183,49 @@ static unsigned long iommu_range_alloc(s
>  	return n;
>  }
>  
> +/** iommu_undo - Clear iommu_table bits without calling platform tce_free.
> + *
> + * @tbl - struct iommu_table to alter
> + * @dma_addr - DMA address to free entries for
> + * @npages - number of pages to free entries for
> + *
> + * This is the same as __iommu_free without the call to ppc_md.tce_free();

__iommu_free has the __ prepended to indicate that it's not locking.
Since this does the same, please keep the __. Also see comments below.

> + *
> + * To clean up after ppc_md.tce_build() errors we need to clear bits
> + * in the table without calling the ppc_md.tce_free() method; calling
> + * ppc_md.tce_free() could alter entries that were not touched due to a
> + * premature failure in ppc_md.tce_build().
> + *
> + * The ppc_md.tce_build() needs to perform its own clean up prior to
> + * returning its error.
> + */
> +static void iommu_undo(struct iommu_table *tbl, dma_addr_t dma_addr,
> +			 unsigned int npages)
> +{
> +	unsigned long entry, free_entry;
> +
> +	entry = dma_addr >> IOMMU_PAGE_SHIFT;
> +	free_entry = entry - tbl->it_offset;
> +
> +	if (((free_entry + npages) > tbl->it_size) ||
> +	    (entry < tbl->it_offset)) {
> +		if (printk_ratelimit()) {
> +			printk(KERN_INFO "iommu_undo: invalid entry\n");
> +			printk(KERN_INFO "\tentry    = 0x%lx\n", entry);
> +			printk(KERN_INFO "\tdma_addr = 0x%lx\n", (u64)dma_addr);
> +			printk(KERN_INFO "\tTable    = 0x%lx\n", (u64)tbl);
> +			printk(KERN_INFO "\tbus#     = 0x%lx\n", tbl->it_busno);
> +			printk(KERN_INFO "\tsize     = 0x%lx\n", tbl->it_size);
> +			printk(KERN_INFO "\tstartOff = 0x%lx\n", tbl->it_offset);
> +			printk(KERN_INFO "\tindex    = 0x%lx\n", tbl->it_index);
> +			WARN_ON(1);
> +		}
> +		return;
> +	}
> +
> +	iommu_area_free(tbl->it_map, free_entry, npages);
> +}

Ick, This should just be refactored to reuse code together with
iommu_free() instead of duplicating it. Also, the error checking
shouldn't be needed here.

Actually, is there harm in calling tce_free for these cases anyway? I'm
guessing it's not a performance critical path.

> @@ -275,7 +330,7 @@ int iommu_map_sg(struct device *dev, str
>  	dma_addr_t dma_next = 0, dma_addr;
>  	unsigned long flags;
>  	struct scatterlist *s, *outs, *segstart;
> -	int outcount, incount, i;
> +	int outcount, incount, i, rc = 0;
>  	unsigned int align;
>  	unsigned long handle;
>  	unsigned int max_seg_size;
> @@ -336,7 +391,10 @@ int iommu_map_sg(struct device *dev, str
>  			    npages, entry, dma_addr);
>  
>  		/* Insert into HW table */
> -		ppc_md.tce_build(tbl, entry, npages, vaddr & IOMMU_PAGE_MASK, direction);
> +		rc = ppc_md.tce_build(tbl, entry, npages,
> +		                      vaddr & IOMMU_PAGE_MASK, direction);
> +		if(unlikely(rc))
> +			goto failure;
>  
>  		/* If we are in an open segment, try merging */
>  		if (segstart != s) {
> @@ -399,7 +457,10 @@ int iommu_map_sg(struct device *dev, str
>  
>  			vaddr = s->dma_address & IOMMU_PAGE_MASK;
>  			npages = iommu_num_pages(s->dma_address, s->dma_length);
> -			__iommu_free(tbl, vaddr, npages);
> +			if (!rc)
> +				__iommu_free(tbl, vaddr, npages);
> +			else
> +				iommu_undo(tbl, vaddr, npages);

'rc' is a quite generic name to carry state this far away from where
it's set. Either a more descriptive name (build_fail, whatever), or if
the above is true, just call __iommu_free here as well.

> -static void tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
> +static void tce_free_pSeriesLP(struct iommu_table*, long, long);
> +static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
> +
> +static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
>  				long npages, unsigned long uaddr,
>  				enum dma_data_direction direction)
>  {
> -	u64 rc;
> +	u64 rc = 0;
>  	u64 proto_tce, tce;
>  	u64 rpn;
> +	int sleep_msecs, ret = 0;
> +	long tcenum_start = tcenum, npages_start = npages;
>  
>  	rpn = (virt_to_abs(uaddr)) >> TCE_SHIFT;
>  	proto_tce = TCE_PCI_READ;
> @@ -108,7 +115,21 @@ static void tce_build_pSeriesLP(struct i
>  
>  	while (npages--) {
>  		tce = proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT;
> -		rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, tce);
> +		do {
> +			rc = plpar_tce_put((u64)tbl->it_index,
> +			                   (u64)tcenum << 12, tce);
> +			if (unlikely(H_IS_LONG_BUSY(rc))) {
> +				sleep_msecs = plpar_get_longbusy_msecs(rc);
> +				mdelay(sleep_msecs);

Ouch! You're holding locks and stuff here. Do you really want this right
here?

> +			}
> +		} while (unlikely(H_IS_LONG_BUSY(rc)));

Do you also want to keep doing this forever, or eventually just fail
instead?

> +		if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
> +			ret = (int)rc;
> +			tce_free_pSeriesLP(tbl, tcenum_start,
> +			                   (npages_start - (npages + 1)));
> +			break;
> +		}
>  
>  		if (rc && printk_ratelimit()) {
>  			printk("tce_build_pSeriesLP: plpar_tce_put failed. rc=%ld\n", rc);
> @@ -121,19 +142,22 @@ static void tce_build_pSeriesLP(struct i
>  		tcenum++;
>  		rpn++;
>  	}
> +	return ret;
>  }
>  
>  static DEFINE_PER_CPU(u64 *, tce_page) = NULL;
>  
> -static void tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
> +static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
>  				     long npages, unsigned long uaddr,
>  				     enum dma_data_direction direction)
>  {
> -	u64 rc;
> +	u64 rc = 0;
>  	u64 proto_tce;
>  	u64 *tcep;
>  	u64 rpn;
>  	long l, limit;
> +	long tcenum_start = tcenum, npages_start = npages;
> +	int sleep_msecs, ret = 0;
>  
>  	if (npages == 1)
>  		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
> @@ -171,15 +195,26 @@ static void tce_buildmulti_pSeriesLP(str
>  			rpn++;
>  		}
>  
> -		rc = plpar_tce_put_indirect((u64)tbl->it_index,
> -					    (u64)tcenum << 12,
> -					    (u64)virt_to_abs(tcep),
> -					    limit);
> +		do {
> +			rc = plpar_tce_put_indirect(tbl->it_index, tcenum << 12,
> +						    virt_to_abs(tcep), limit);
> +			if (unlikely(H_IS_LONG_BUSY(rc))) {
> +				sleep_msecs = plpar_get_longbusy_msecs(rc);
> +				mdelay(sleep_msecs);
> +			}
> +		} while (unlikely(H_IS_LONG_BUSY(rc)));
>  
>  		npages -= limit;
>  		tcenum += limit;
>  	} while (npages > 0 && !rc);
>  
> +	if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
> +		ret = (int)rc;
> +		tce_freemulti_pSeriesLP(tbl, tcenum_start,
> +		                        (npages_start - (npages + limit)));
> +		return ret;
> +	}
> +
>  	if (rc && printk_ratelimit()) {
>  		printk("tce_buildmulti_pSeriesLP: plpar_tce_put failed. rc=%ld\n", rc);
>  		printk("\tindex   = 0x%lx\n", (u64)tbl->it_index);
> @@ -187,14 +222,23 @@ static void tce_buildmulti_pSeriesLP(str
>  		printk("\ttce[0] val = 0x%lx\n", tcep[0]);
>  		show_stack(current, (unsigned long *)__get_SP());
>  	}
> +	return ret;
>  }
>  
>  static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages)
>  {
> +	int sleep_msecs;
>  	u64 rc;
>  
>  	while (npages--) {
> -		rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, 0);
> +		do {
> +			rc = plpar_tce_put((u64)tbl->it_index,
> +			                   (u64)tcenum << 12, 0);
> +			if (unlikely(H_IS_LONG_BUSY(rc))) {
> +				sleep_msecs = plpar_get_longbusy_msecs(rc);
> +				mdelay(sleep_msecs);
> +			}
> +		} while (unlikely(H_IS_LONG_BUSY(rc)));

Can this ever happen? I would hope that any entry that's got an active
mapping is actually pinned in memory, what other than paging in from
disk can result in long busy?

> @@ -210,9 +254,17 @@ static void tce_free_pSeriesLP(struct io
>  
>  static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages)
>  {
> +	int sleep_msecs;
>  	u64 rc;
>  
> -	rc = plpar_tce_stuff((u64)tbl->it_index, (u64)tcenum << 12, 0, npages);
> +	do {
> +		rc = plpar_tce_stuff((u64)tbl->it_index,
> +		                     (u64)tcenum << 12, 0, npages);
> +		if (unlikely(H_IS_LONG_BUSY(rc))) {
> +			sleep_msecs = plpar_get_longbusy_msecs(rc);
> +			mdelay(sleep_msecs);
> +		}
> +	} while (unlikely(H_IS_LONG_BUSY(rc)));
>  
>  	if (rc && printk_ratelimit()) {
>  		printk("tce_freemulti_pSeriesLP: plpar_tce_stuff failed\n");

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 12/19] powerpc: vio bus support for CMO
  2008-06-12 22:19 ` [PATCH 12/19] powerpc: vio bus support " Robert Jennings
@ 2008-06-13  5:12   ` Stephen Rothwell
  2008-06-23 20:23     ` Robert Jennings
  2008-06-23 20:25   ` [PATCH 12/19][v2] " Robert Jennings
  1 sibling, 1 reply; 41+ messages in thread
From: Stephen Rothwell @ 2008-06-13  5:12 UTC (permalink / raw)
  To: Robert Jennings; +Cc: Brian King, linuxppc-dev, paulus, David Darrington

[-- Attachment #1: Type: text/plain, Size: 933 bytes --]

Hi Robert,

Firstly, can all this new stuff be ifdef'ed out if not needed as the
vio infrastructure is also used on legacy iSeries and this adds quite a
bit of stuff that won't ever be used there.

On Thu, 12 Jun 2008 17:19:59 -0500 Robert Jennings <rcj@linux.vnet.ibm.com> wrote:
>
> +static int vio_cmo_num_OF_devs(void)
> +{
> +	struct device_node *node_vroot;
> +	int count = 0;
> +
> +	/*
> +	 * Count the number of vdevice entries with an
> +	 * ibm,my-dma-window OF property
> +	 */
> +	node_vroot = of_find_node_by_name(NULL, "vdevice");
> +	if (node_vroot) {
> +		struct device_node *of_node;
> +		struct property *prop;
> +
> +		for (of_node = node_vroot->child; of_node != NULL;
> +		                of_node = of_node->sibling) {

Use:
		for_each_child_of_node(node_vroot, of_node) {


-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/19] ibmveth: Automatically enable larger rx buffer pools for larger mtu
  2008-06-12 22:22 ` [PATCH 16/19] ibmveth: Automatically enable larger rx buffer pools for larger mtu Robert Jennings
@ 2008-06-13  5:18   ` Stephen Rothwell
  2008-06-23 20:21   ` [PATCH 16/19][v2] " Robert Jennings
  1 sibling, 0 replies; 41+ messages in thread
From: Stephen Rothwell @ 2008-06-13  5:18 UTC (permalink / raw)
  To: Robert Jennings
  Cc: netdev, David, linuxppc-dev, paulus, Brian King, Darrington

[-- Attachment #1: Type: text/plain, Size: 457 bytes --]

Hi Robert,

On Thu, 12 Jun 2008 17:22:49 -0500 Robert Jennings <rcj@linux.vnet.ibm.com> wrote:
>
> +	/* Deactivate all the buffer pools so that the next loop can activate
> +	   only the buffer pools necessary to hold the new MTU */
> +	for(i = 0; i<IbmVethNumBufferPools; i++)

Please uses spaces after "for" and around binary operators.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 17/19] ibmveth: enable driver for CMO
  2008-06-12 22:23 ` [PATCH 17/19] ibmveth: enable driver for CMO Robert Jennings
@ 2008-06-13  5:25   ` Stephen Rothwell
  2008-06-23 20:20   ` [PATCH 17/19][v2] " Robert Jennings
  1 sibling, 0 replies; 41+ messages in thread
From: Stephen Rothwell @ 2008-06-13  5:25 UTC (permalink / raw)
  To: Robert Jennings
  Cc: netdev, David, linuxppc-dev, paulus, Brian King, Darrington

[-- Attachment #1: Type: text/plain, Size: 294 bytes --]

Hi Robert,

On Thu, 12 Jun 2008 17:23:11 -0500 Robert Jennings <rcj@linux.vnet.ibm.com> wrote:
>
> +	if(adapter->bounce_buffer != NULL) {

Space after "if".  Here and elsewhere.

-- 
Cheers,
Stephen Rothwell                    sfr@canb.auug.org.au
http://www.canb.auug.org.au/~sfr/

[-- Attachment #2: Type: application/pgp-signature, Size: 197 bytes --]

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 10/19] [repost] powerpc: move get_longbusy_msecs out of ehca/ehea
  2008-06-12 22:18   ` [PATCH 10/19] [repost] " Robert Jennings
@ 2008-06-13 18:24     ` Brian King
  2008-06-13 19:55       ` Jeff Garzik
  0 siblings, 1 reply; 41+ messages in thread
From: Brian King @ 2008-06-13 18:24 UTC (permalink / raw)
  To: Robert Jennings
  Cc: netdev, linuxppc-dev, paulus, David Darrington, Jeff Garzik

Jeff,

Regarding the patches Rob just posted here, we'd like to just take them
through the powerpc tree with your sign off since they are part of a
Power platform feature we are enabling.

Thanks,

Brian

Robert Jennings wrote:
> From: Robert Jennings <rcj@linux.vnet.ibm.com>
> 
> In support of Cooperative Memory Overcommitment (CMO) this moves
> get_longbusy_msecs() out of the ehca and ehea drivers and into the 
> architecture's hvcall header as plpar_get_longbusy_msecs.  Some firmware
> calls made in pSeries platform iommu code will need to share this
> functionality.
> 
> Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
> 
> ---
> 
> I missed copying netdev on this patch the first time.
> 
>  drivers/infiniband/hw/ehca/hcp_if.c |   24 ++----------------------
>  drivers/net/ehea/ehea_phyp.c        |    4 ++--
>  drivers/net/ehea/ehea_phyp.h        |   20 --------------------
>  include/asm-powerpc/hvcall.h        |   27 +++++++++++++++++++++++++++
>  4 files changed, 31 insertions(+), 44 deletions(-)
> 
> Index: b/drivers/infiniband/hw/ehca/hcp_if.c
> ===================================================================
> --- a/drivers/infiniband/hw/ehca/hcp_if.c
> +++ b/drivers/infiniband/hw/ehca/hcp_if.c
> @@ -90,26 +90,6 @@
>  
>  static DEFINE_SPINLOCK(hcall_lock);
>  
> -static u32 get_longbusy_msecs(int longbusy_rc)
> -{
> -	switch (longbusy_rc) {
> -	case H_LONG_BUSY_ORDER_1_MSEC:
> -		return 1;
> -	case H_LONG_BUSY_ORDER_10_MSEC:
> -		return 10;
> -	case H_LONG_BUSY_ORDER_100_MSEC:
> -		return 100;
> -	case H_LONG_BUSY_ORDER_1_SEC:
> -		return 1000;
> -	case H_LONG_BUSY_ORDER_10_SEC:
> -		return 10000;
> -	case H_LONG_BUSY_ORDER_100_SEC:
> -		return 100000;
> -	default:
> -		return 1;
> -	}
> -}
> -
>  static long ehca_plpar_hcall_norets(unsigned long opcode,
>  				    unsigned long arg1,
>  				    unsigned long arg2,
> @@ -139,7 +119,7 @@ static long ehca_plpar_hcall_norets(unsi
>  			spin_unlock_irqrestore(&hcall_lock, flags);
>  
>  		if (H_IS_LONG_BUSY(ret)) {
> -			sleep_msecs = get_longbusy_msecs(ret);
> +			sleep_msecs = plpar_get_longbusy_msecs(ret);
>  			msleep_interruptible(sleep_msecs);
>  			continue;
>  		}
> @@ -192,7 +172,7 @@ static long ehca_plpar_hcall9(unsigned l
>  			spin_unlock_irqrestore(&hcall_lock, flags);
>  
>  		if (H_IS_LONG_BUSY(ret)) {
> -			sleep_msecs = get_longbusy_msecs(ret);
> +			sleep_msecs = plpar_get_longbusy_msecs(ret);
>  			msleep_interruptible(sleep_msecs);
>  			continue;
>  		}
> Index: b/drivers/net/ehea/ehea_phyp.c
> ===================================================================
> --- a/drivers/net/ehea/ehea_phyp.c
> +++ b/drivers/net/ehea/ehea_phyp.c
> @@ -61,7 +61,7 @@ static long ehea_plpar_hcall_norets(unsi
>  					 arg5, arg6, arg7);
>  
>  		if (H_IS_LONG_BUSY(ret)) {
> -			sleep_msecs = get_longbusy_msecs(ret);
> +			sleep_msecs = plpar_get_longbusy_msecs(ret);
>  			msleep_interruptible(sleep_msecs);
>  			continue;
>  		}
> @@ -102,7 +102,7 @@ static long ehea_plpar_hcall9(unsigned l
>  				   arg6, arg7, arg8, arg9);
>  
>  		if (H_IS_LONG_BUSY(ret)) {
> -			sleep_msecs = get_longbusy_msecs(ret);
> +			sleep_msecs = plpar_get_longbusy_msecs(ret);
>  			msleep_interruptible(sleep_msecs);
>  			continue;
>  		}
> Index: b/drivers/net/ehea/ehea_phyp.h
> ===================================================================
> --- a/drivers/net/ehea/ehea_phyp.h
> +++ b/drivers/net/ehea/ehea_phyp.h
> @@ -40,26 +40,6 @@
>   * hcp_*  - structures, variables and functions releated to Hypervisor Calls
>   */
>  
> -static inline u32 get_longbusy_msecs(int long_busy_ret_code)
> -{
> -	switch (long_busy_ret_code) {
> -	case H_LONG_BUSY_ORDER_1_MSEC:
> -		return 1;
> -	case H_LONG_BUSY_ORDER_10_MSEC:
> -		return 10;
> -	case H_LONG_BUSY_ORDER_100_MSEC:
> -		return 100;
> -	case H_LONG_BUSY_ORDER_1_SEC:
> -		return 1000;
> -	case H_LONG_BUSY_ORDER_10_SEC:
> -		return 10000;
> -	case H_LONG_BUSY_ORDER_100_SEC:
> -		return 100000;
> -	default:
> -		return 1;
> -	}
> -}
> -
>  /* Number of pages which can be registered at once by H_REGISTER_HEA_RPAGES */
>  #define EHEA_MAX_RPAGE 512
>  
> Index: b/include/asm-powerpc/hvcall.h
> ===================================================================
> --- a/include/asm-powerpc/hvcall.h
> +++ b/include/asm-powerpc/hvcall.h
> @@ -222,6 +222,33 @@
>  #ifndef __ASSEMBLY__
>  
>  /**
> + * plpar_get_longbusy_msecs: - Return number of msecs for H_LONG_BUSY* response
> + * @long_busy_ret_code: The H_LONG_BUSY_* constant to process
> + *
> + * This returns the number of msecs that corresponds to an H_LONG_BUSY_*
> + * response from a plpar_hcall.  If there is no match 1 is returned.
> + */
> +static inline u32 plpar_get_longbusy_msecs(int long_busy_ret_code)
> +{
> +	switch (long_busy_ret_code) {
> +	case H_LONG_BUSY_ORDER_1_MSEC:
> +		return 1;
> +	case H_LONG_BUSY_ORDER_10_MSEC:
> +		return 10;
> +	case H_LONG_BUSY_ORDER_100_MSEC:
> +		return 100;
> +	case H_LONG_BUSY_ORDER_1_SEC:
> +		return 1000;
> +	case H_LONG_BUSY_ORDER_10_SEC:
> +		return 10000;
> +	case H_LONG_BUSY_ORDER_100_SEC:
> +		return 100000;
> +	default:
> +		return 1;
> +	}
> +}
> +
> +/**
>   * plpar_hcall_norets: - Make a pseries hypervisor call with no return arguments
>   * @opcode: The hypervisor call to make.
>   *


-- 
Brian King
Linux on Power Virtualization
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 18/19] ibmvscsi: driver enablement for CMO
  2008-06-12 22:24 ` [PATCH 18/19] ibmvscsi: driver enablement " Robert Jennings
@ 2008-06-13 18:30   ` Brian King
  0 siblings, 0 replies; 41+ messages in thread
From: Brian King @ 2008-06-13 18:30 UTC (permalink / raw)
  To: Robert Jennings; +Cc: linuxppc-dev, paulus, linux-scsi, David Darrington

CC'ing linux-scsi, although we'd like to take this through the powerpc tree since it
is part of a patch set to enable a Power platform feature.

-Brian

Robert Jennings wrote:
> From: Robert Jennings <rcj@linux.vnet.ibm.com>
> 
> Enable the driver to function in a Cooperative Memory Overcommitment (CMO)
> environment.
> 
> The following changes are made to enable the driver for CMO:
>  * DMA mapping errors will not result in error messages if entitlement has
>    been exceeded and resources were not available.
>  * The driver has a get_io_entitlement function defined to function
>    in a CMO environment. It will indicate how much IO memory it would like
>    to function.
> 
> Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
> 
> ---
>  drivers/scsi/ibmvscsi/ibmvscsi.c |   46 +++++++++++++++++++++++++++++++++------
>  drivers/scsi/ibmvscsi/ibmvscsi.h |    2 ++
>  2 files changed, 41 insertions(+), 7 deletions(-)
> 
> Index: b/drivers/scsi/ibmvscsi/ibmvscsi.c
> ===================================================================
> --- a/drivers/scsi/ibmvscsi/ibmvscsi.c
> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
> @@ -72,6 +72,8 @@
>  #include <linux/delay.h>
>  #include <asm/firmware.h>
>  #include <asm/vio.h>
> +#include <asm/firmware.h>
> +#include <asm/iommu.h>
>  #include <scsi/scsi.h>
>  #include <scsi/scsi_cmnd.h>
>  #include <scsi/scsi_host.h>
> @@ -426,8 +428,10 @@ static int map_sg_data(struct scsi_cmnd 
>  					   SG_ALL * sizeof(struct srp_direct_buf),
>  					   &evt_struct->ext_list_token, 0);
>  		if (!evt_struct->ext_list) {
> -			sdev_printk(KERN_ERR, cmd->device,
> -				    "Can't allocate memory for indirect table\n");
> +			if (!firmware_has_feature(FW_FEATURE_CMO))
> +				sdev_printk(KERN_ERR, cmd->device,
> +				            "Can't allocate memory "
> +				            "for indirect table\n");
>  			return 0;
>  		}
>  	}
> @@ -743,7 +747,9 @@ static int ibmvscsi_queuecommand(struct 
>  	srp_cmd->lun = ((u64) lun) << 48;
> 
>  	if (!map_data_for_srp_cmd(cmnd, evt_struct, srp_cmd, hostdata->dev)) {
> -		sdev_printk(KERN_ERR, cmnd->device, "couldn't convert cmd to srp_cmd\n");
> +		if (!firmware_has_feature(FW_FEATURE_CMO))
> +			sdev_printk(KERN_ERR, cmnd->device,
> +			            "couldn't convert cmd to srp_cmd\n");
>  		free_event_struct(&hostdata->pool, evt_struct);
>  		return SCSI_MLQUEUE_HOST_BUSY;
>  	}
> @@ -855,7 +861,10 @@ static void send_mad_adapter_info(struct
>  					    DMA_BIDIRECTIONAL);
> 
>  	if (dma_mapping_error(req->buffer)) {
> -		dev_err(hostdata->dev, "Unable to map request_buffer for adapter_info!\n");
> +		if (!firmware_has_feature(FW_FEATURE_CMO))
> +			dev_err(hostdata->dev,
> +			        "Unable to map request_buffer for "
> +			        "adapter_info!\n");
>  		free_event_struct(&hostdata->pool, evt_struct);
>  		return;
>  	}
> @@ -1400,7 +1409,9 @@ static int ibmvscsi_do_host_config(struc
>  						    DMA_BIDIRECTIONAL);
> 
>  	if (dma_mapping_error(host_config->buffer)) {
> -		dev_err(hostdata->dev, "dma_mapping error getting host config\n");
> +		if (!firmware_has_feature(FW_FEATURE_CMO))
> +			dev_err(hostdata->dev,
> +			        "dma_mapping error getting host config\n");
>  		free_event_struct(&hostdata->pool, evt_struct);
>  		return -1;
>  	}
> @@ -1604,7 +1615,7 @@ static struct scsi_host_template driver_
>  	.eh_host_reset_handler = ibmvscsi_eh_host_reset_handler,
>  	.slave_configure = ibmvscsi_slave_configure,
>  	.change_queue_depth = ibmvscsi_change_queue_depth,
> -	.cmd_per_lun = 16,
> +	.cmd_per_lun = IBMVSCSI_CMDS_PER_LUN_DEFAULT,
>  	.can_queue = IBMVSCSI_MAX_REQUESTS_DEFAULT,
>  	.this_id = -1,
>  	.sg_tablesize = SG_ALL,
> @@ -1613,6 +1624,26 @@ static struct scsi_host_template driver_
>  };
> 
>  /**
> + * ibmvscsi_get_io_entitlement - Calculate IO entitlement needed by the driver
> + *
> + * @vdev: struct vio_dev for the device whose entitlement is to be returned
> + *
> + * Return value:
> + *	Number of bytes of IO data the driver will need to perform well.
> + */
> +static unsigned long ibmvscsi_get_io_entitlement(struct vio_dev *vdev)
> +{
> +	/* iu_storage data allocated in initialize_event_pool */
> +	unsigned long io_entitlement = max_requests * sizeof(union viosrp_iu);
> +
> +	/* add io space for sg data */
> +	io_entitlement += (IBMVSCSI_MAX_SECTORS_DEFAULT *
> +	                     IBMVSCSI_CMDS_PER_LUN_DEFAULT);
> +
> +	return IOMMU_PAGE_ALIGN(io_entitlement);
> +}
> +
> +/**
>   * Called by bus code for each adapter
>   */
>  static int ibmvscsi_probe(struct vio_dev *vdev, const struct vio_device_id *id)
> @@ -1641,7 +1672,7 @@ static int ibmvscsi_probe(struct vio_dev
>  	hostdata->host = host;
>  	hostdata->dev = dev;
>  	atomic_set(&hostdata->request_limit, -1);
> -	hostdata->host->max_sectors = 32 * 8; /* default max I/O 32 pages */
> +	hostdata->host->max_sectors = IBMVSCSI_MAX_SECTORS_DEFAULT;
> 
>  	rc = ibmvscsi_ops->init_crq_queue(&hostdata->queue, hostdata, max_requests);
>  	if (rc != 0 && rc != H_RESOURCE) {
> @@ -1735,6 +1766,7 @@ static struct vio_driver ibmvscsi_driver
>  	.id_table = ibmvscsi_device_table,
>  	.probe = ibmvscsi_probe,
>  	.remove = ibmvscsi_remove,
> +	.get_io_entitlement = ibmvscsi_get_io_entitlement,
>  	.driver = {
>  		.name = "ibmvscsi",
>  		.owner = THIS_MODULE,
> Index: b/drivers/scsi/ibmvscsi/ibmvscsi.h
> ===================================================================
> --- a/drivers/scsi/ibmvscsi/ibmvscsi.h
> +++ b/drivers/scsi/ibmvscsi/ibmvscsi.h
> @@ -45,6 +45,8 @@ struct Scsi_Host;
>  #define MAX_INDIRECT_BUFS 10
> 
>  #define IBMVSCSI_MAX_REQUESTS_DEFAULT 100
> +#define IBMVSCSI_CMDS_PER_LUN_DEFAULT 16
> +#define IBMVSCSI_MAX_SECTORS_DEFAULT 256 /* 32 * 8 = default max I/O 32 pages */
>  #define IBMVSCSI_MAX_CMDS_PER_LUN 64
> 
>  /* ------------------------------------------------------------
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@ozlabs.org
> https://ozlabs.org/mailman/listinfo/linuxppc-dev


-- 
Brian King
Linux on Power Virtualization
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/19] powerpc: Split processor entitlement retrieval and gathering to helper routines
  2008-06-13  0:23   ` Stephen Rothwell
@ 2008-06-13 19:11     ` Nathan Fontenot
  0 siblings, 0 replies; 41+ messages in thread
From: Nathan Fontenot @ 2008-06-13 19:11 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: Brian King, linuxppc-dev, paulus, David Darrington

Stephen Rothwell wrote:
> Hi Robert,
> 
> On Thu, 12 Jun 2008 17:08:58 -0500 Robert Jennings <rcj@linux.vnet.ibm.com> wrote:
>> -		seq_printf(m, "R4=0x%lx\n", h_entitled);
>> -		seq_printf(m, "R5=0x%lx\n", h_unallocated);
>> -		seq_printf(m, "R6=0x%lx\n", h_aggregation);
>> -		seq_printf(m, "R7=0x%lx\n", h_resource);
> 
> This changes a user visible interface by removing the above.  I don't
> know if this matters (probably not), but it should be mentioned in the
> changelog.
>

You're right this should have been mentioned. The values it is printing 
out are the raw values returned from the H_GET_PPP hcall.  The values 
were then parsed and pretty printed afterwards.  I don't see a need to 
print these values out twice.

>> +	if (new_entitled)
>> +		*new_weight = current_weight;
>> +
>> +	if (new_weight)
>> +		*new_entitled = current_entitled;
> 
> These look fishy - checking one pointer for NULL and then updating via
> the other pointer.
> 

I thought something about this looked strange...

Unfortunately this code gets slightly updated again in patch 3/19 of 
this patch series.  The point you make is valid though, should not be 
de-referencing the pointers without validating them.

I'll update this patch with a new changelog and pull the change from 
patch 3/19 into this pach where it belongs.

-Nathan
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@ozlabs.org
> https://ozlabs.org/mailman/listinfo/linuxppc-dev

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 10/19] [repost] powerpc: move get_longbusy_msecs out of ehca/ehea
  2008-06-13 18:24     ` Brian King
@ 2008-06-13 19:55       ` Jeff Garzik
  0 siblings, 0 replies; 41+ messages in thread
From: Jeff Garzik @ 2008-06-13 19:55 UTC (permalink / raw)
  To: Brian King; +Cc: netdev, linuxppc-dev, paulus, David Darrington

Brian King wrote:
> Jeff,
> 
> Regarding the patches Rob just posted here, we'd like to just take them
> through the powerpc tree with your sign off since they are part of a
> Power platform feature we are enabling.
> 
> Thanks,
> 
> Brian
> 
> Robert Jennings wrote:
>> From: Robert Jennings <rcj@linux.vnet.ibm.com>
>>
>> In support of Cooperative Memory Overcommitment (CMO) this moves
>> get_longbusy_msecs() out of the ehca and ehea drivers and into the 
>> architecture's hvcall header as plpar_get_longbusy_msecs.  Some firmware
>> calls made in pSeries platform iommu code will need to share this
>> functionality.
>>
>> Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
>>
>> ---
>>
>> I missed copying netdev on this patch the first time.
>>
>>  drivers/infiniband/hw/ehca/hcp_if.c |   24 ++----------------------
>>  drivers/net/ehea/ehea_phyp.c        |    4 ++--
>>  drivers/net/ehea/ehea_phyp.h        |   20 --------------------
>>  include/asm-powerpc/hvcall.h        |   27 +++++++++++++++++++++++++++
>>  4 files changed, 31 insertions(+), 44 deletions(-)

ACK the quoted patch

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 02/19] powerpc: Split processor entitlement retrieval and gathering to helper routines
  2008-06-12 22:08 ` [PATCH 02/19] powerpc: Split processor entitlement retrieval and gathering to helper routines Robert Jennings
  2008-06-13  0:23   ` Stephen Rothwell
@ 2008-06-16 16:07   ` Nathan Fontenot
  1 sibling, 0 replies; 41+ messages in thread
From: Nathan Fontenot @ 2008-06-16 16:07 UTC (permalink / raw)
  To: Robert Jennings; +Cc: Brian King, linuxppc-dev, paulus, David Darrington

Split the retrieval and setting of processor entitlement and weight into
helper routines.  This also removes the printing of the raw values
returned from h_get_ppp, the values are already parsed and printed.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
  arch/powerpc/kernel/lparcfg.c |  163 
++++++++++++++++++++++--------------------
  1 file changed, 86 insertions(+), 77 deletions(-)

Index: linux-2.6.git/arch/powerpc/kernel/lparcfg.c
===================================================================
--- linux-2.6.git.orig/arch/powerpc/kernel/lparcfg.c	2008-06-16 
10:22:42.000000000 -0500
+++ linux-2.6.git/arch/powerpc/kernel/lparcfg.c	2008-06-16 
10:32:29.000000000 -0500
@@ -167,7 +167,8 @@
  	return rc;
  }

-static void h_pic(unsigned long *pool_idle_time, unsigned long *num_procs)
+static unsigned int h_pic(unsigned long *pool_idle_time,
+			  unsigned long *num_procs)
  {
  	unsigned long rc;
  	unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
@@ -176,6 +177,53 @@

  	*pool_idle_time = retbuf[0];
  	*num_procs = retbuf[1];
+
+	return rc;
+}
+
+/*
+ * parse_ppp_data
+ * Parse out the data returned from h_get_ppp and h_pic
+ */
+static void parse_ppp_data(struct seq_file *m)
+{
+	unsigned long h_entitled, h_unallocated;
+	unsigned long h_aggregation, h_resource;
+	int rc;
+
+	rc = h_get_ppp(&h_entitled, &h_unallocated, &h_aggregation,
+		       &h_resource);
+	if (rc)
+		return;
+
+	seq_printf(m, "partition_entitled_capacity=%ld\n", h_entitled);
+	seq_printf(m, "group=%ld\n", (h_aggregation >> 2 * 8) & 0xffff);
+	seq_printf(m, "system_active_processors=%ld\n",
+		   (h_resource >> 0 * 8) & 0xffff);
+
+	/* pool related entries are apropriate for shared configs */
+	if (lppaca[0].shared_proc) {
+		unsigned long pool_idle_time, pool_procs;
+
+		seq_printf(m, "pool=%ld\n", (h_aggregation >> 0 * 8) & 0xffff);
+
+		/* report pool_capacity in percentage */
+		seq_printf(m, "pool_capacity=%ld\n",
+			   ((h_resource >> 2 * 8) & 0xffff) * 100);
+
+		rc = h_pic(&pool_idle_time, &pool_procs);
+		if (! rc) {
+			seq_printf(m, "pool_idle_time=%ld\n", pool_idle_time);
+			seq_printf(m, "pool_num_procs=%ld\n", pool_procs);
+		}
+	}
+
+	seq_printf(m, "unallocated_capacity_weight=%ld\n",
+		   (h_resource >> 4 * 8) & 0xFF);
+
+	seq_printf(m, "capacity_weight=%ld\n", (h_resource >> 5 * 8) & 0xFF);
+	seq_printf(m, "capped=%ld\n", (h_resource >> 6 * 8) & 0x01);
+	seq_printf(m, "unallocated_capacity=%ld\n", h_unallocated);
  }

  #define SPLPAR_CHARACTERISTICS_TOKEN 20
@@ -302,59 +350,11 @@
  	partition_active_processors = lparcfg_count_active_processors();

  	if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
-		unsigned long h_entitled, h_unallocated;
-		unsigned long h_aggregation, h_resource;
-		unsigned long pool_idle_time, pool_procs;
-		unsigned long purr;
-
-		h_get_ppp(&h_entitled, &h_unallocated, &h_aggregation,
-			  &h_resource);
-
-		seq_printf(m, "R4=0x%lx\n", h_entitled);
-		seq_printf(m, "R5=0x%lx\n", h_unallocated);
-		seq_printf(m, "R6=0x%lx\n", h_aggregation);
-		seq_printf(m, "R7=0x%lx\n", h_resource);
-
-		purr = get_purr();
-
  		/* this call handles the ibm,get-system-parameter contents */
  		parse_system_parameter_string(m);
+		parse_ppp_data(m);

-		seq_printf(m, "partition_entitled_capacity=%ld\n", h_entitled);
-
-		seq_printf(m, "group=%ld\n", (h_aggregation >> 2 * 8) & 0xffff);
-
-		seq_printf(m, "system_active_processors=%ld\n",
-			   (h_resource >> 0 * 8) & 0xffff);
-
-		/* pool related entries are apropriate for shared configs */
-		if (lppaca[0].shared_proc) {
-
-			h_pic(&pool_idle_time, &pool_procs);
-
-			seq_printf(m, "pool=%ld\n",
-				   (h_aggregation >> 0 * 8) & 0xffff);
-
-			/* report pool_capacity in percentage */
-			seq_printf(m, "pool_capacity=%ld\n",
-				   ((h_resource >> 2 * 8) & 0xffff) * 100);
-
-			seq_printf(m, "pool_idle_time=%ld\n", pool_idle_time);
-
-			seq_printf(m, "pool_num_procs=%ld\n", pool_procs);
-		}
-
-		seq_printf(m, "unallocated_capacity_weight=%ld\n",
-			   (h_resource >> 4 * 8) & 0xFF);
-
-		seq_printf(m, "capacity_weight=%ld\n",
-			   (h_resource >> 5 * 8) & 0xFF);
-
-		seq_printf(m, "capped=%ld\n", (h_resource >> 6 * 8) & 0x01);
-
-		seq_printf(m, "unallocated_capacity=%ld\n", h_unallocated);
-
-		seq_printf(m, "purr=%ld\n", purr);
+		seq_printf(m, "purr=%ld\n", get_purr());

  	} else {		/* non SPLPAR case */

@@ -382,6 +382,41 @@
  	return 0;
  }

+static ssize_t update_ppp(u64 *entitlement, u8 *weight)
+{
+	unsigned long current_entitled;
+	unsigned long dummy;
+	unsigned long resource;
+	u8 current_weight, new_weight;
+	u64 new_entitled;
+	ssize_t retval;
+
+	/* Get our current parameters */
+	retval = h_get_ppp(&current_entitled, &dummy, &dummy, &resource);
+	if (retval)
+		return retval;
+
+	current_weight = (resource >> 5 * 8) & 0xFF;
+
+	if (entitlement) {
+		new_weight = current_weight;
+		new_entitled = *entitlement;
+	} else if (weight) {
+		new_weight = *weight;
+		new_entitled = current_entitled;
+	} else
+		return -EINVAL;
+
+	pr_debug("%s: current_entitled = %lu, current_weight = %u\n",
+		 __FUNCTION__, current_entitled, current_weight);
+
+	pr_debug("%s: new_entitled = %lu, new_weight = %u\n",
+		 __FUNCTION__, new_entitled, new_weight);
+
+	retval = plpar_hcall_norets(H_SET_PPP, new_entitled, new_weight);
+	return retval;
+}
+
  /*
   * Interface for changing system parameters (variable capacity weight
   * and entitled capacity).  Format of input is "param_name=value";
@@ -399,12 +434,6 @@
  	char *tmp;
  	u64 new_entitled, *new_entitled_ptr = &new_entitled;
  	u8 new_weight, *new_weight_ptr = &new_weight;
-
-	unsigned long current_entitled;	/* parameters for h_get_ppp */
-	unsigned long dummy;
-	unsigned long resource;
-	u8 current_weight;
-
  	ssize_t retval = -ENOMEM;

  	if (!firmware_has_feature(FW_FEATURE_SPLPAR) ||
@@ -432,33 +461,17 @@
  		*new_entitled_ptr = (u64) simple_strtoul(tmp, &endp, 10);
  		if (endp == tmp)
  			goto out;
-		new_weight_ptr = &current_weight;
+
+		retval = update_ppp(new_entitled_ptr, NULL);
  	} else if (!strcmp(kbuf, "capacity_weight")) {
  		char *endp;
  		*new_weight_ptr = (u8) simple_strtoul(tmp, &endp, 10);
  		if (endp == tmp)
  			goto out;
-		new_entitled_ptr = &current_entitled;
-	} else
-		goto out;

-	/* Get our current parameters */
-	retval = h_get_ppp(&current_entitled, &dummy, &dummy, &resource);
-	if (retval) {
-		retval = -EIO;
+		retval = update_ppp(NULL, new_weight_ptr);
+	} else
  		goto out;
-	}
-
-	current_weight = (resource >> 5 * 8) & 0xFF;
-
-	pr_debug("%s: current_entitled = %lu, current_weight = %u\n",
-		 __func__, current_entitled, current_weight);
-
-	pr_debug("%s: new_entitled = %lu, new_weight = %u\n",
-		 __func__, *new_entitled_ptr, *new_weight_ptr);
-
-	retval = plpar_hcall_norets(H_SET_PPP, *new_entitled_ptr,
-				    *new_weight_ptr);

  	if (retval == H_SUCCESS || retval == H_CONSTRAINED) {
  		retval = count;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 03/19][v2] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg
  2008-06-12 22:09 ` [PATCH 03/19] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg Robert Jennings
@ 2008-06-16 16:09   ` Nathan Fontenot
  2008-06-16 20:47   ` [PATCH 03/19][v3] " Nathan Fontenot
  2008-06-24 15:26   ` [PATCH 03/19] " Nathan Fontenot
  2 siblings, 0 replies; 41+ messages in thread
From: Nathan Fontenot @ 2008-06-16 16:09 UTC (permalink / raw)
  To: Robert Jennings; +Cc: Brian King, linuxppc-dev, paulus, David Darrington

This patch has been updated to remove a piece that is now included
in patch 2/19 of this series where it should be.

-Nathan

Update /proc/ppc64/lparcfg to enable displaying of Cooperative Memory
Overcommitment statistics as reported by the H_GET_MPP hcall.  This also
updates the lparcfg interface to allow setting memory entitlement and
weight.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
  arch/powerpc/kernel/lparcfg.c |  139 
+++++++++++++++++++++++++++++++++++++++---
  include/asm-powerpc/hvcall.h  |   18 +++++
  2 files changed, 147 insertions(+), 10 deletions(-)

Index: linux-2.6.git/arch/powerpc/kernel/lparcfg.c
===================================================================
--- linux-2.6.git.orig/arch/powerpc/kernel/lparcfg.c	2008-06-16 
10:32:29.000000000 -0500
+++ linux-2.6.git/arch/powerpc/kernel/lparcfg.c	2008-06-16 
10:47:55.000000000 -0500
@@ -129,6 +129,35 @@
  /*
   * Methods used to fetch LPAR data when running on a pSeries platform.
   */
+/**
+ * h_get_mpp
+ * H_GET_MPP hcall returns info in 7 parms
+ */
+int h_get_mpp(struct hvcall_mpp_data *mpp_data)
+{
+	int rc;
+	unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
+
+	rc = plpar_hcall(H_GET_MPP, retbuf);
+
+	mpp_data->entitled_mem = retbuf[0];
+	mpp_data->mapped_mem = retbuf[1];
+
+	mpp_data->group_num = (retbuf[2] >> 2 * 8) & 0xffff;
+	mpp_data->pool_num = retbuf[2] & 0xffff;
+
+	mpp_data->mem_weight = (retbuf[3] >> 7 * 8) & 0xff;
+	mpp_data->unallocated_mem_weight = (retbuf[3] >> 6 * 8) & 0xff;
+	mpp_data->unallocated_entitlement = retbuf[3] & 0xffffffffffff;
+
+	mpp_data->pool_size = retbuf[4];
+	mpp_data->loan_request = retbuf[5];
+	mpp_data->backing_mem = retbuf[6];
+
+	return rc;
+}
+EXPORT_SYMBOL(h_get_mpp);
+
  /*
   * H_GET_PPP hcall returns info in 4 parms.
   *  entitled_capacity,unallocated_capacity,
@@ -226,6 +255,44 @@
  	seq_printf(m, "unallocated_capacity=%ld\n", h_unallocated);
  }

+/**
+ * parse_mpp_data
+ * Parse out data returned from h_get_mpp
+ */
+static void parse_mpp_data(struct seq_file *m)
+{
+	struct hvcall_mpp_data mpp_data;
+	int rc;
+
+	rc = h_get_mpp(&mpp_data);
+	if (rc)
+		return;
+
+	seq_printf(m, "entitled_memory=%ld\n", mpp_data.entitled_mem);
+
+	if (mpp_data.mapped_mem != -1)
+		seq_printf(m, "mapped_entitled_memory=%ld\n",
+			   mpp_data.mapped_mem);
+
+	seq_printf(m, "entitled_memory_group_number=%d\n", mpp_data.group_num);
+	seq_printf(m, "entitled_memory_pool_number=%d\n", mpp_data.pool_num);
+
+	seq_printf(m, "entitled_memory_weight=%d\n", mpp_data.mem_weight);
+	seq_printf(m, "unallocated_entitled_memory_weight=%d\n",
+		   mpp_data.unallocated_mem_weight);
+	seq_printf(m, "unallocated_io_mapping_entitlement=%ld\n",
+		   mpp_data.unallocated_entitlement);
+
+	if (mpp_data.pool_size != -1)
+		seq_printf(m, "entitled_memory_pool_size=%ld bytes\n",
+			   mpp_data.pool_size);
+
+	seq_printf(m, "entitled_memory_loan_request=%ld\n",
+		   mpp_data.loan_request);
+
+	seq_printf(m, "backing_memory=%ld bytes\n", mpp_data.backing_mem);
+}
+
  #define SPLPAR_CHARACTERISTICS_TOKEN 20
  #define SPLPAR_MAXLENGTH 1026*(sizeof(char))

@@ -353,6 +420,7 @@
  		/* this call handles the ibm,get-system-parameter contents */
  		parse_system_parameter_string(m);
  		parse_ppp_data(m);
+		parse_mpp_data(m);

  		seq_printf(m, "purr=%ld\n", get_purr());

@@ -417,6 +485,43 @@
  	return retval;
  }

+/**
+ * update_mpp
+ *
+ * Update the memory entitlement and weight for the partition.  Caller must
+ * spercify either a new entitlement or weight, not both, to be updated
+ * since the h_set_mpp call takes both entitlement and weight as 
parameters.
+ */
+static ssize_t update_mpp(u64 *entitlement, u8 *weight)
+{
+	struct hvcall_mpp_data mpp_data;
+	u64 new_entitled;
+	u8 new_weight;
+	ssize_t rc;
+
+	rc = h_get_mpp(&mpp_data);
+	if (rc)
+		return rc;
+
+	if (entitlement) {
+		new_weight = mpp_data.mem_weight;
+		new_entitled = *entitlement;
+	} else if (weight) {
+		new_weight = *weight;
+		new_entitled = mpp_data.entitled_mem;
+	} else
+		return -EINVAL;
+
+	pr_debug("%s: current_entitled = %lu, current_weight = %u\n",
+		 __FUNCTION__, mpp_data.entitled_mem, mpp_data.mem_weight);
+
+	pr_debug("%s: new_entitled = %lu, new_weight = %u\n",
+		 __FUNCTION__, new_entitled, new_weight);
+
+	rc = plpar_hcall_norets(H_SET_MPP, new_entitled, new_weight);
+	return rc;
+}
+
  /*
   * Interface for changing system parameters (variable capacity weight
   * and entitled capacity).  Format of input is "param_name=value";
@@ -470,6 +575,20 @@
  			goto out;

  		retval = update_ppp(NULL, new_weight_ptr);
+	} else if (!strcmp(kbuf, "entitled_memory")) {
+		char *endp;
+		*new_entitled_ptr = (u64) simple_strtoul(tmp, &endp, 10);
+		if (endp == tmp)
+			goto out;
+
+		retval = update_mpp(new_entitled_ptr, NULL);
+	} else if (!strcmp(kbuf, "entitled_memory_weight")) {
+		char *endp;
+		*new_weight_ptr = (u8) simple_strtoul(tmp, &endp, 10);
+		if (endp == tmp)
+			goto out;
+
+		retval = update_mpp(NULL, new_weight_ptr);
  	} else
  		goto out;

Index: linux-2.6.git/include/asm-powerpc/hvcall.h
===================================================================
--- linux-2.6.git.orig/include/asm-powerpc/hvcall.h	2008-06-16 
10:25:58.000000000 -0500
+++ linux-2.6.git/include/asm-powerpc/hvcall.h	2008-06-16 
10:42:42.000000000 -0500
@@ -210,7 +210,9 @@
  #define H_JOIN			0x298
  #define H_VASI_STATE            0x2A4
  #define H_ENABLE_CRQ		0x2B0
-#define MAX_HCALL_OPCODE	H_ENABLE_CRQ
+#define H_SET_MPP		0x2D0
+#define H_GET_MPP		0x2D4
+#define MAX_HCALL_OPCODE	H_GET_MPP

  #ifndef __ASSEMBLY__

@@ -270,6 +272,20 @@
  };
  #define HCALL_STAT_ARRAY_SIZE	((MAX_HCALL_OPCODE >> 2) + 1)

+struct hvcall_mpp_data {
+	unsigned long entitled_mem;
+	unsigned long mapped_mem;
+	unsigned short group_num;
+	unsigned short pool_num;
+	unsigned char mem_weight;
+	unsigned char unallocated_mem_weight;
+	unsigned long unallocated_entitlement;	/* value in bytes */
+	unsigned long pool_size;
+	long loan_request;
+	unsigned long backing_mem;
+};
+
+int h_get_mpp(struct hvcall_mpp_data *);
  #endif /* __ASSEMBLY__ */
  #endif /* __KERNEL__ */
  #endif /* _ASM_POWERPC_HVCALL_H */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH 03/19][v3] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg
  2008-06-12 22:09 ` [PATCH 03/19] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg Robert Jennings
  2008-06-16 16:09   ` [PATCH 03/19][v2] " Nathan Fontenot
@ 2008-06-16 20:47   ` Nathan Fontenot
  2008-06-24 14:23     ` Brian King
  2008-06-24 15:26   ` [PATCH 03/19] " Nathan Fontenot
  2 siblings, 1 reply; 41+ messages in thread
From: Nathan Fontenot @ 2008-06-16 20:47 UTC (permalink / raw)
  To: Robert Jennings; +Cc: Brian King, linuxppc-dev, paulus, David Darrington

Version 3 of the patch, small update to use the correct plpar_hcall
for H_GET_MPP.
-------------------

Update /proc/ppc64/lparcfg to enable displaying of Cooperative Memory
Overcommitment statistics as reported by the H_GET_MPP hcall.  This also
updates the lparcfg interface to allow setting memory entitlement and
weight.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
  arch/powerpc/kernel/lparcfg.c |  139 
+++++++++++++++++++++++++++++++++++++++---
  include/asm-powerpc/hvcall.h  |   18 +++++
  2 files changed, 147 insertions(+), 10 deletions(-)

Index: linux-2.6.git/arch/powerpc/kernel/lparcfg.c
===================================================================
--- linux-2.6.git.orig/arch/powerpc/kernel/lparcfg.c	2008-06-16 
10:32:29.000000000 -0500
+++ linux-2.6.git/arch/powerpc/kernel/lparcfg.c	2008-06-16 
10:47:55.000000000 -0500
@@ -129,6 +129,35 @@
  /*
   * Methods used to fetch LPAR data when running on a pSeries platform.
   */
+/**
+ * h_get_mpp
+ * H_GET_MPP hcall returns info in 7 parms
+ */
+int h_get_mpp(struct hvcall_mpp_data *mpp_data)
+{
+	int rc;
+	unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
+
+	rc = plpar_hcall9(H_GET_MPP, retbuf);
+
+	mpp_data->entitled_mem = retbuf[0];
+	mpp_data->mapped_mem = retbuf[1];
+
+	mpp_data->group_num = (retbuf[2] >> 2 * 8) & 0xffff;
+	mpp_data->pool_num = retbuf[2] & 0xffff;
+
+	mpp_data->mem_weight = (retbuf[3] >> 7 * 8) & 0xff;
+	mpp_data->unallocated_mem_weight = (retbuf[3] >> 6 * 8) & 0xff;
+	mpp_data->unallocated_entitlement = retbuf[3] & 0xffffffffffff;
+
+	mpp_data->pool_size = retbuf[4];
+	mpp_data->loan_request = retbuf[5];
+	mpp_data->backing_mem = retbuf[6];
+
+	return rc;
+}
+EXPORT_SYMBOL(h_get_mpp);
+
  /*
   * H_GET_PPP hcall returns info in 4 parms.
   *  entitled_capacity,unallocated_capacity,
@@ -226,6 +255,44 @@
  	seq_printf(m, "unallocated_capacity=%ld\n", h_unallocated);
  }

+/**
+ * parse_mpp_data
+ * Parse out data returned from h_get_mpp
+ */
+static void parse_mpp_data(struct seq_file *m)
+{
+	struct hvcall_mpp_data mpp_data;
+	int rc;
+
+	rc = h_get_mpp(&mpp_data);
+	if (rc)
+		return;
+
+	seq_printf(m, "entitled_memory=%ld\n", mpp_data.entitled_mem);
+
+	if (mpp_data.mapped_mem != -1)
+		seq_printf(m, "mapped_entitled_memory=%ld\n",
+			   mpp_data.mapped_mem);
+
+	seq_printf(m, "entitled_memory_group_number=%d\n", mpp_data.group_num);
+	seq_printf(m, "entitled_memory_pool_number=%d\n", mpp_data.pool_num);
+
+	seq_printf(m, "entitled_memory_weight=%d\n", mpp_data.mem_weight);
+	seq_printf(m, "unallocated_entitled_memory_weight=%d\n",
+		   mpp_data.unallocated_mem_weight);
+	seq_printf(m, "unallocated_io_mapping_entitlement=%ld\n",
+		   mpp_data.unallocated_entitlement);
+
+	if (mpp_data.pool_size != -1)
+		seq_printf(m, "entitled_memory_pool_size=%ld bytes\n",
+			   mpp_data.pool_size);
+
+	seq_printf(m, "entitled_memory_loan_request=%ld\n",
+		   mpp_data.loan_request);
+
+	seq_printf(m, "backing_memory=%ld bytes\n", mpp_data.backing_mem);
+}
+
  #define SPLPAR_CHARACTERISTICS_TOKEN 20
  #define SPLPAR_MAXLENGTH 1026*(sizeof(char))

@@ -353,6 +420,7 @@
  		/* this call handles the ibm,get-system-parameter contents */
  		parse_system_parameter_string(m);
  		parse_ppp_data(m);
+		parse_mpp_data(m);

  		seq_printf(m, "purr=%ld\n", get_purr());

@@ -417,6 +485,43 @@
  	return retval;
  }

+/**
+ * update_mpp
+ *
+ * Update the memory entitlement and weight for the partition.  Caller must
+ * spercify either a new entitlement or weight, not both, to be updated
+ * since the h_set_mpp call takes both entitlement and weight as 
parameters.
+ */
+static ssize_t update_mpp(u64 *entitlement, u8 *weight)
+{
+	struct hvcall_mpp_data mpp_data;
+	u64 new_entitled;
+	u8 new_weight;
+	ssize_t rc;
+
+	rc = h_get_mpp(&mpp_data);
+	if (rc)
+		return rc;
+
+	if (entitlement) {
+		new_weight = mpp_data.mem_weight;
+		new_entitled = *entitlement;
+	} else if (weight) {
+		new_weight = *weight;
+		new_entitled = mpp_data.entitled_mem;
+	} else
+		return -EINVAL;
+
+	pr_debug("%s: current_entitled = %lu, current_weight = %u\n",
+		 __FUNCTION__, mpp_data.entitled_mem, mpp_data.mem_weight);
+
+	pr_debug("%s: new_entitled = %lu, new_weight = %u\n",
+		 __FUNCTION__, new_entitled, new_weight);
+
+	rc = plpar_hcall_norets(H_SET_MPP, new_entitled, new_weight);
+	return rc;
+}
+
  /*
   * Interface for changing system parameters (variable capacity weight
   * and entitled capacity).  Format of input is "param_name=value";
@@ -470,6 +575,20 @@
  			goto out;

  		retval = update_ppp(NULL, new_weight_ptr);
+	} else if (!strcmp(kbuf, "entitled_memory")) {
+		char *endp;
+		*new_entitled_ptr = (u64) simple_strtoul(tmp, &endp, 10);
+		if (endp == tmp)
+			goto out;
+
+		retval = update_mpp(new_entitled_ptr, NULL);
+	} else if (!strcmp(kbuf, "entitled_memory_weight")) {
+		char *endp;
+		*new_weight_ptr = (u8) simple_strtoul(tmp, &endp, 10);
+		if (endp == tmp)
+			goto out;
+
+		retval = update_mpp(NULL, new_weight_ptr);
  	} else
  		goto out;

Index: linux-2.6.git/include/asm-powerpc/hvcall.h
===================================================================
--- linux-2.6.git.orig/include/asm-powerpc/hvcall.h	2008-06-16 
10:25:58.000000000 -0500
+++ linux-2.6.git/include/asm-powerpc/hvcall.h	2008-06-16 
10:42:42.000000000 -0500
@@ -210,7 +210,9 @@
  #define H_JOIN			0x298
  #define H_VASI_STATE            0x2A4
  #define H_ENABLE_CRQ		0x2B0
-#define MAX_HCALL_OPCODE	H_ENABLE_CRQ
+#define H_SET_MPP		0x2D0
+#define H_GET_MPP		0x2D4
+#define MAX_HCALL_OPCODE	H_GET_MPP

  #ifndef __ASSEMBLY__

@@ -270,6 +272,20 @@
  };
  #define HCALL_STAT_ARRAY_SIZE	((MAX_HCALL_OPCODE >> 2) + 1)

+struct hvcall_mpp_data {
+	unsigned long entitled_mem;
+	unsigned long mapped_mem;
+	unsigned short group_num;
+	unsigned short pool_num;
+	unsigned char mem_weight;
+	unsigned char unallocated_mem_weight;
+	unsigned long unallocated_entitlement;	/* value in bytes */
+	unsigned long pool_size;
+	long loan_request;
+	unsigned long backing_mem;
+};
+
+int h_get_mpp(struct hvcall_mpp_data *);
  #endif /* __ASSEMBLY__ */
  #endif /* __KERNEL__ */
  #endif /* _ASM_POWERPC_HVCALL_H */

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 11/19] powerpc: iommu enablement for CMO
  2008-06-13  1:43   ` Olof Johansson
@ 2008-06-20 15:03     ` Robert Jennings
  0 siblings, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-20 15:03 UTC (permalink / raw)
  To: Olof Johansson; +Cc: Brian King, linuxppc-dev, paulus, David Darrington

* Olof Johansson (olof@lixom.net) wrote:
> Hi,
> 
> Some comments and questions below.
> 
> 
> -Olof
> 
> On Thu, Jun 12, 2008 at 05:19:36PM -0500, Robert Jennings wrote:
> > Index: b/arch/powerpc/kernel/iommu.c
> > ===================================================================
> > --- a/arch/powerpc/kernel/iommu.c
> > +++ b/arch/powerpc/kernel/iommu.c
> > @@ -183,6 +183,49 @@ static unsigned long iommu_range_alloc(s
> >  	return n;
> >  }
> >  
> > +/** iommu_undo - Clear iommu_table bits without calling platform tce_free.
> > + *
> > + * @tbl - struct iommu_table to alter
> > + * @dma_addr - DMA address to free entries for
> > + * @npages - number of pages to free entries for
> > + *
> > + * This is the same as __iommu_free without the call to ppc_md.tce_free();
> 
> __iommu_free has the __ prepended to indicate that it's not locking.
> Since this does the same, please keep the __. Also see comments below.
> 
> > + *
> > + * To clean up after ppc_md.tce_build() errors we need to clear bits
> > + * in the table without calling the ppc_md.tce_free() method; calling
> > + * ppc_md.tce_free() could alter entries that were not touched due to a
> > + * premature failure in ppc_md.tce_build().
> > + *
> > + * The ppc_md.tce_build() needs to perform its own clean up prior to
> > + * returning its error.
> > + */
> > +static void iommu_undo(struct iommu_table *tbl, dma_addr_t dma_addr,
> > +			 unsigned int npages)
> > +{
> > +	unsigned long entry, free_entry;
> > +
> > +	entry = dma_addr >> IOMMU_PAGE_SHIFT;
> > +	free_entry = entry - tbl->it_offset;
> > +
> > +	if (((free_entry + npages) > tbl->it_size) ||
> > +	    (entry < tbl->it_offset)) {
> > +		if (printk_ratelimit()) {
> > +			printk(KERN_INFO "iommu_undo: invalid entry\n");
> > +			printk(KERN_INFO "\tentry    = 0x%lx\n", entry);
> > +			printk(KERN_INFO "\tdma_addr = 0x%lx\n", (u64)dma_addr);
> > +			printk(KERN_INFO "\tTable    = 0x%lx\n", (u64)tbl);
> > +			printk(KERN_INFO "\tbus#     = 0x%lx\n", tbl->it_busno);
> > +			printk(KERN_INFO "\tsize     = 0x%lx\n", tbl->it_size);
> > +			printk(KERN_INFO "\tstartOff = 0x%lx\n", tbl->it_offset);
> > +			printk(KERN_INFO "\tindex    = 0x%lx\n", tbl->it_index);
> > +			WARN_ON(1);
> > +		}
> > +		return;
> > +	}
> > +
> > +	iommu_area_free(tbl->it_map, free_entry, npages);
> > +}
> 
> Ick, This should just be refactored to reuse code together with
> iommu_free() instead of duplicating it. Also, the error checking
> shouldn't be needed here.
> 
> Actually, is there harm in calling tce_free for these cases anyway? I'm
> guessing it's not a performance critical path.

There is no harm and I will get rid of this and just call __iommu_free.

> > @@ -275,7 +330,7 @@ int iommu_map_sg(struct device *dev, str
> >  	dma_addr_t dma_next = 0, dma_addr;
> >  	unsigned long flags;
> >  	struct scatterlist *s, *outs, *segstart;
> > -	int outcount, incount, i;
> > +	int outcount, incount, i, rc = 0;
> >  	unsigned int align;
> >  	unsigned long handle;
> >  	unsigned int max_seg_size;
> > @@ -336,7 +391,10 @@ int iommu_map_sg(struct device *dev, str
> >  			    npages, entry, dma_addr);
> >  
> >  		/* Insert into HW table */
> > -		ppc_md.tce_build(tbl, entry, npages, vaddr & IOMMU_PAGE_MASK, direction);
> > +		rc = ppc_md.tce_build(tbl, entry, npages,
> > +		                      vaddr & IOMMU_PAGE_MASK, direction);
> > +		if(unlikely(rc))
> > +			goto failure;
> >  
> >  		/* If we are in an open segment, try merging */
> >  		if (segstart != s) {
> > @@ -399,7 +457,10 @@ int iommu_map_sg(struct device *dev, str
> >  
> >  			vaddr = s->dma_address & IOMMU_PAGE_MASK;
> >  			npages = iommu_num_pages(s->dma_address, s->dma_length);
> > -			__iommu_free(tbl, vaddr, npages);
> > +			if (!rc)
> > +				__iommu_free(tbl, vaddr, npages);
> > +			else
> > +				iommu_undo(tbl, vaddr, npages);
> 
> 'rc' is a quite generic name to carry state this far away from where
> it's set. Either a more descriptive name (build_fail, whatever), or if
> the above is true, just call __iommu_free here as well.

I'll fix this.

> > -static void tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
> > +static void tce_free_pSeriesLP(struct iommu_table*, long, long);
> > +static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
> > +
> > +static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
> >  				long npages, unsigned long uaddr,
> >  				enum dma_data_direction direction)
> >  {
> > -	u64 rc;
> > +	u64 rc = 0;
> >  	u64 proto_tce, tce;
> >  	u64 rpn;
> > +	int sleep_msecs, ret = 0;
> > +	long tcenum_start = tcenum, npages_start = npages;
> >  
> >  	rpn = (virt_to_abs(uaddr)) >> TCE_SHIFT;
> >  	proto_tce = TCE_PCI_READ;
> > @@ -108,7 +115,21 @@ static void tce_build_pSeriesLP(struct i
> >  
> >  	while (npages--) {
> >  		tce = proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT;
> > -		rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, tce);
> > +		do {
> > +			rc = plpar_tce_put((u64)tbl->it_index,
> > +			                   (u64)tcenum << 12, tce);
> > +			if (unlikely(H_IS_LONG_BUSY(rc))) {
> > +				sleep_msecs = plpar_get_longbusy_msecs(rc);
> > +				mdelay(sleep_msecs);
> 
> Ouch! You're holding locks and stuff here. Do you really want this right
> here?

All of this delay business will be removed.  The firmware will not be
returning any H_LONG_BUSY return codes.

> > +			}
> > +		} while (unlikely(H_IS_LONG_BUSY(rc)));
> 
> Do you also want to keep doing this forever, or eventually just fail
> instead?
> 
> > +		if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
> > +			ret = (int)rc;
> > +			tce_free_pSeriesLP(tbl, tcenum_start,
> > +			                   (npages_start - (npages + 1)));
> > +			break;
> > +		}
> >  
> >  		if (rc && printk_ratelimit()) {
> >  			printk("tce_build_pSeriesLP: plpar_tce_put failed. rc=%ld\n", rc);
> > @@ -121,19 +142,22 @@ static void tce_build_pSeriesLP(struct i
> >  		tcenum++;
> >  		rpn++;
> >  	}
> > +	return ret;
> >  }
> >  
> >  static DEFINE_PER_CPU(u64 *, tce_page) = NULL;
> >  
> > -static void tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
> > +static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
> >  				     long npages, unsigned long uaddr,
> >  				     enum dma_data_direction direction)
> >  {
> > -	u64 rc;
> > +	u64 rc = 0;
> >  	u64 proto_tce;
> >  	u64 *tcep;
> >  	u64 rpn;
> >  	long l, limit;
> > +	long tcenum_start = tcenum, npages_start = npages;
> > +	int sleep_msecs, ret = 0;
> >  
> >  	if (npages == 1)
> >  		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
> > @@ -171,15 +195,26 @@ static void tce_buildmulti_pSeriesLP(str
> >  			rpn++;
> >  		}
> >  
> > -		rc = plpar_tce_put_indirect((u64)tbl->it_index,
> > -					    (u64)tcenum << 12,
> > -					    (u64)virt_to_abs(tcep),
> > -					    limit);
> > +		do {
> > +			rc = plpar_tce_put_indirect(tbl->it_index, tcenum << 12,
> > +						    virt_to_abs(tcep), limit);
> > +			if (unlikely(H_IS_LONG_BUSY(rc))) {
> > +				sleep_msecs = plpar_get_longbusy_msecs(rc);
> > +				mdelay(sleep_msecs);
> > +			}
> > +		} while (unlikely(H_IS_LONG_BUSY(rc)));
> >  
> >  		npages -= limit;
> >  		tcenum += limit;
> >  	} while (npages > 0 && !rc);
> >  
> > +	if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
> > +		ret = (int)rc;
> > +		tce_freemulti_pSeriesLP(tbl, tcenum_start,
> > +		                        (npages_start - (npages + limit)));
> > +		return ret;
> > +	}
> > +
> >  	if (rc && printk_ratelimit()) {
> >  		printk("tce_buildmulti_pSeriesLP: plpar_tce_put failed. rc=%ld\n", rc);
> >  		printk("\tindex   = 0x%lx\n", (u64)tbl->it_index);
> > @@ -187,14 +222,23 @@ static void tce_buildmulti_pSeriesLP(str
> >  		printk("\ttce[0] val = 0x%lx\n", tcep[0]);
> >  		show_stack(current, (unsigned long *)__get_SP());
> >  	}
> > +	return ret;
> >  }
> >  
> >  static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages)
> >  {
> > +	int sleep_msecs;
> >  	u64 rc;
> >  
> >  	while (npages--) {
> > -		rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, 0);
> > +		do {
> > +			rc = plpar_tce_put((u64)tbl->it_index,
> > +			                   (u64)tcenum << 12, 0);
> > +			if (unlikely(H_IS_LONG_BUSY(rc))) {
> > +				sleep_msecs = plpar_get_longbusy_msecs(rc);
> > +				mdelay(sleep_msecs);
> > +			}
> > +		} while (unlikely(H_IS_LONG_BUSY(rc)));
> 
> Can this ever happen? I would hope that any entry that's got an active
> mapping is actually pinned in memory, what other than paging in from
> disk can result in long busy?

This won't happen, I'll remove it.

> > @@ -210,9 +254,17 @@ static void tce_free_pSeriesLP(struct io
> >  
> >  static void tce_freemulti_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages)
> >  {
> > +	int sleep_msecs;
> >  	u64 rc;
> >  
> > -	rc = plpar_tce_stuff((u64)tbl->it_index, (u64)tcenum << 12, 0, npages);
> > +	do {
> > +		rc = plpar_tce_stuff((u64)tbl->it_index,
> > +		                     (u64)tcenum << 12, 0, npages);
> > +		if (unlikely(H_IS_LONG_BUSY(rc))) {
> > +			sleep_msecs = plpar_get_longbusy_msecs(rc);
> > +			mdelay(sleep_msecs);
> > +		}
> > +	} while (unlikely(H_IS_LONG_BUSY(rc)));
> >  
> >  	if (rc && printk_ratelimit()) {
> >  		printk("tce_freemulti_pSeriesLP: plpar_tce_stuff failed\n");

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 11/19][v2] powerpc: iommu enablement for CMO
  2008-06-12 22:19 ` [PATCH 11/19] powerpc: iommu enablement for CMO Robert Jennings
  2008-06-13  1:43   ` Olof Johansson
@ 2008-06-20 15:12   ` Robert Jennings
  1 sibling, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-20 15:12 UTC (permalink / raw)
  To: paulus, linuxppc-dev, Brian King, Nathan Fontenot,
	David Darrington

To support Cooperative Memory Overcommitment (CMO), we need to check
for failure from some of the tce hcalls.

These changes for the pseries platform affect the powerpc architecture;
patches for the other affected platforms are included in this patch.

pSeries platform IOMMU code changes:
 * platform TCE functions must handle H_NOT_ENOUGH_RESOURCES errors and
   return an error.

Architecture IOMMU code changes:
 * Calls to ppc_md.tce_build need to check return values and return 
   DMA_MAPPING_ERROR for transient errors.

Architecture changes:
 * struct machdep_calls for tce_build*_pSeriesLP functions need to change
   to indicate failure.
 * all other platforms will need updates to iommu functions to match the new
   calling semantics; they will return 0 on success.  The other platforms
   default configs have been built, but no further testing was performed.

Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>

---
 arch/powerpc/kernel/iommu.c            |   26 +++++++++++++++++++++----
 arch/powerpc/platforms/cell/iommu.c    |    3 ++-
 arch/powerpc/platforms/iseries/iommu.c |    3 ++-
 arch/powerpc/platforms/pasemi/iommu.c  |    3 ++-
 arch/powerpc/platforms/pseries/iommu.c |   34 ++++++++++++++++++++++++++++-----
 arch/powerpc/sysdev/dart_iommu.c       |    3 ++-
 include/asm-powerpc/machdep.h          |    2 +-
 7 files changed, 60 insertions(+), 14 deletions(-)

Index: b/arch/powerpc/kernel/iommu.c
===================================================================
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -49,6 +49,8 @@ static int novmerge = 1;
 
 static int protect4gb = 1;
 
+static void __iommu_free(struct iommu_table *, dma_addr_t, unsigned int);
+
 static inline unsigned long iommu_num_pages(unsigned long vaddr,
 					    unsigned long slen)
 {
@@ -190,6 +192,7 @@ static dma_addr_t iommu_alloc(struct dev
 {
 	unsigned long entry, flags;
 	dma_addr_t ret = DMA_ERROR_CODE;
+	int build_fail;
 
 	spin_lock_irqsave(&(tbl->it_lock), flags);
 
@@ -204,9 +207,21 @@ static dma_addr_t iommu_alloc(struct dev
 	ret = entry << IOMMU_PAGE_SHIFT;	/* Set the return dma address */
 
 	/* Put the TCEs in the HW table */
-	ppc_md.tce_build(tbl, entry, npages, (unsigned long)page & IOMMU_PAGE_MASK,
-			 direction);
+	build_fail = ppc_md.tce_build(tbl, entry, npages,
+	                              (unsigned long)page & IOMMU_PAGE_MASK,
+	                              direction);
+
+	/* ppc_md.tce_build() only returns non-zero for transient errors.
+	 * Clean up the table bitmap in this case and return
+	 * DMA_ERROR_CODE. For all other errors the functionality is
+	 * not altered.
+	 */
+	if (unlikely(build_fail)) {
+		__iommu_free(tbl, ret, npages);
 
+		spin_unlock_irqrestore(&(tbl->it_lock), flags);
+		return DMA_ERROR_CODE;
+	}
 
 	/* Flush/invalidate TLB caches if necessary */
 	if (ppc_md.tce_flush)
@@ -275,7 +290,7 @@ int iommu_map_sg(struct device *dev, str
 	dma_addr_t dma_next = 0, dma_addr;
 	unsigned long flags;
 	struct scatterlist *s, *outs, *segstart;
-	int outcount, incount, i;
+	int outcount, incount, i, build_fail = 0;
 	unsigned int align;
 	unsigned long handle;
 	unsigned int max_seg_size;
@@ -336,7 +351,10 @@ int iommu_map_sg(struct device *dev, str
 			    npages, entry, dma_addr);
 
 		/* Insert into HW table */
-		ppc_md.tce_build(tbl, entry, npages, vaddr & IOMMU_PAGE_MASK, direction);
+		build_fail = ppc_md.tce_build(tbl, entry, npages,
+		                              vaddr & IOMMU_PAGE_MASK, direction);
+		if(unlikely(build_fail))
+			goto failure;
 
 		/* If we are in an open segment, try merging */
 		if (segstart != s) {
Index: b/arch/powerpc/platforms/cell/iommu.c
===================================================================
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -172,7 +172,7 @@ static void invalidate_tce_cache(struct 
 	}
 }
 
-static void tce_build_cell(struct iommu_table *tbl, long index, long npages,
+static int tce_build_cell(struct iommu_table *tbl, long index, long npages,
 		unsigned long uaddr, enum dma_data_direction direction)
 {
 	int i;
@@ -210,6 +210,7 @@ static void tce_build_cell(struct iommu_
 
 	pr_debug("tce_build_cell(index=%lx,n=%lx,dir=%d,base_pte=%lx)\n",
 		 index, npages, direction, base_pte);
+	return 0;
 }
 
 static void tce_free_cell(struct iommu_table *tbl, long index, long npages)
Index: b/arch/powerpc/platforms/iseries/iommu.c
===================================================================
--- a/arch/powerpc/platforms/iseries/iommu.c
+++ b/arch/powerpc/platforms/iseries/iommu.c
@@ -41,7 +41,7 @@
 #include <asm/iseries/hv_call_event.h>
 #include <asm/iseries/iommu.h>
 
-static void tce_build_iSeries(struct iommu_table *tbl, long index, long npages,
+static int tce_build_iSeries(struct iommu_table *tbl, long index, long npages,
 		unsigned long uaddr, enum dma_data_direction direction)
 {
 	u64 rc;
@@ -70,6 +70,7 @@ static void tce_build_iSeries(struct iom
 		index++;
 		uaddr += TCE_PAGE_SIZE;
 	}
+	return 0;
 }
 
 static void tce_free_iSeries(struct iommu_table *tbl, long index, long npages)
Index: b/arch/powerpc/platforms/pasemi/iommu.c
===================================================================
--- a/arch/powerpc/platforms/pasemi/iommu.c
+++ b/arch/powerpc/platforms/pasemi/iommu.c
@@ -83,7 +83,7 @@ static u32 *iob_l2_base;
 static struct iommu_table iommu_table_iobmap;
 static int iommu_table_iobmap_inited;
 
-static void iobmap_build(struct iommu_table *tbl, long index,
+static int iobmap_build(struct iommu_table *tbl, long index,
 			 long npages, unsigned long uaddr,
 			 enum dma_data_direction direction)
 {
@@ -107,6 +107,7 @@ static void iobmap_build(struct iommu_ta
 		uaddr += IOBMAP_PAGE_SIZE;
 		bus_addr += IOBMAP_PAGE_SIZE;
 	}
+	return 0;
 }
 
 
Index: b/arch/powerpc/platforms/pseries/iommu.c
===================================================================
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -48,7 +48,7 @@
 #include "plpar_wrappers.h"
 
 
-static void tce_build_pSeries(struct iommu_table *tbl, long index,
+static int tce_build_pSeries(struct iommu_table *tbl, long index,
 			      long npages, unsigned long uaddr,
 			      enum dma_data_direction direction)
 {
@@ -71,6 +71,7 @@ static void tce_build_pSeries(struct iom
 		uaddr += TCE_PAGE_SIZE;
 		tcep++;
 	}
+	return 0;
 }
 
 
@@ -93,13 +94,18 @@ static unsigned long tce_get_pseries(str
 	return *tcep;
 }
 
-static void tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
+static void tce_free_pSeriesLP(struct iommu_table*, long, long);
+static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
+
+static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				long npages, unsigned long uaddr,
 				enum dma_data_direction direction)
 {
-	u64 rc;
+	u64 rc = 0;
 	u64 proto_tce, tce;
 	u64 rpn;
+	int ret = 0;
+	long tcenum_start = tcenum, npages_start = npages;
 
 	rpn = (virt_to_abs(uaddr)) >> TCE_SHIFT;
 	proto_tce = TCE_PCI_READ;
@@ -110,6 +116,13 @@ static void tce_build_pSeriesLP(struct i
 		tce = proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT;
 		rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, tce);
 
+		if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
+			ret = (int)rc;
+			tce_free_pSeriesLP(tbl, tcenum_start,
+			                   (npages_start - (npages + 1)));
+			break;
+		}
+
 		if (rc && printk_ratelimit()) {
 			printk("tce_build_pSeriesLP: plpar_tce_put failed. rc=%ld\n", rc);
 			printk("\tindex   = 0x%lx\n", (u64)tbl->it_index);
@@ -121,19 +134,22 @@ static void tce_build_pSeriesLP(struct i
 		tcenum++;
 		rpn++;
 	}
+	return ret;
 }
 
 static DEFINE_PER_CPU(u64 *, tce_page) = NULL;
 
-static void tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
 				     long npages, unsigned long uaddr,
 				     enum dma_data_direction direction)
 {
-	u64 rc;
+	u64 rc = 0;
 	u64 proto_tce;
 	u64 *tcep;
 	u64 rpn;
 	long l, limit;
+	long tcenum_start = tcenum, npages_start = npages;
+	int ret = 0;
 
 	if (npages == 1)
 		return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
@@ -180,6 +196,13 @@ static void tce_buildmulti_pSeriesLP(str
 		tcenum += limit;
 	} while (npages > 0 && !rc);
 
+	if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
+		ret = (int)rc;
+		tce_freemulti_pSeriesLP(tbl, tcenum_start,
+		                        (npages_start - (npages + limit)));
+		return ret;
+	}
+
 	if (rc && printk_ratelimit()) {
 		printk("tce_buildmulti_pSeriesLP: plpar_tce_put failed. rc=%ld\n", rc);
 		printk("\tindex   = 0x%lx\n", (u64)tbl->it_index);
@@ -187,6 +210,7 @@ static void tce_buildmulti_pSeriesLP(str
 		printk("\ttce[0] val = 0x%lx\n", tcep[0]);
 		show_stack(current, (unsigned long *)__get_SP());
 	}
+	return ret;
 }
 
 static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages)
Index: b/arch/powerpc/sysdev/dart_iommu.c
===================================================================
--- a/arch/powerpc/sysdev/dart_iommu.c
+++ b/arch/powerpc/sysdev/dart_iommu.c
@@ -147,7 +147,7 @@ static void dart_flush(struct iommu_tabl
 	}
 }
 
-static void dart_build(struct iommu_table *tbl, long index,
+static int dart_build(struct iommu_table *tbl, long index,
 		       long npages, unsigned long uaddr,
 		       enum dma_data_direction direction)
 {
@@ -183,6 +183,7 @@ static void dart_build(struct iommu_tabl
 	} else {
 		dart_dirty = 1;
 	}
+	return 0;
 }
 
 
Index: b/include/asm-powerpc/machdep.h
===================================================================
--- a/include/asm-powerpc/machdep.h
+++ b/include/asm-powerpc/machdep.h
@@ -76,7 +76,7 @@ struct machdep_calls {
 	 * destroyed as well */
 	void		(*hpte_clear_all)(void);
 
-	void		(*tce_build)(struct iommu_table * tbl,
+	int		(*tce_build)(struct iommu_table *tbl,
 				     long index,
 				     long npages,
 				     unsigned long uaddr,

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 17/19][v2] ibmveth: enable driver for CMO
  2008-06-12 22:23 ` [PATCH 17/19] ibmveth: enable driver for CMO Robert Jennings
  2008-06-13  5:25   ` Stephen Rothwell
@ 2008-06-23 20:20   ` Robert Jennings
  1 sibling, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-23 20:20 UTC (permalink / raw)
  To: paulus, linuxppc-dev, netdev, Brian King, Nathan Fontenot,
	David Darrington

Fixed patch formatting.

Enable ibmveth for Cooperative Memory Overcommitment (CMO).  For this driver
it means calculating a desired amount of IO memory based on the current MTU
and updating this value with the bus when MTU changes occur.  Because DMA
mappings can fail, we have added a bounce buffer for temporary cases where
the driver can not map IO memory for the buffer pool.

The following changes are made to enable the driver for CMO:
 * DMA mapping errors will not result in error messages if entitlement has
   been exceeded and resources were not available.
 * DMA mapping errors are handled gracefully, ibmveth_replenish_buffer_pool()
   is corrected to check the return from dma_map_single and fail gracefully.
 * The driver will have a get_io_entitlement function defined to function
   in a CMO environment.
 * When the MTU is changed, the driver will update the device IO entitlement

Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Santiago Leon <santil@us.ibm.com>

---
 drivers/net/ibmveth.c |  169 ++++++++++++++++++++++++++++++++++++++++----------
 drivers/net/ibmveth.h |    5 +
 2 files changed, 140 insertions(+), 34 deletions(-)

Index: b/drivers/net/ibmveth.c
===================================================================
--- a/drivers/net/ibmveth.c
+++ b/drivers/net/ibmveth.c
@@ -33,6 +33,7 @@
 */
 
 #include <linux/module.h>
+#include <linux/moduleparam.h>
 #include <linux/types.h>
 #include <linux/errno.h>
 #include <linux/ioport.h>
@@ -52,7 +53,9 @@
 #include <asm/hvcall.h>
 #include <asm/atomic.h>
 #include <asm/vio.h>
+#include <asm/iommu.h>
 #include <asm/uaccess.h>
+#include <asm/firmware.h>
 #include <linux/seq_file.h>
 
 #include "ibmveth.h"
@@ -94,8 +97,10 @@ static void ibmveth_proc_register_adapte
 static void ibmveth_proc_unregister_adapter(struct ibmveth_adapter *adapter);
 static irqreturn_t ibmveth_interrupt(int irq, void *dev_instance);
 static void ibmveth_rxq_harvest_buffer(struct ibmveth_adapter *adapter);
+static unsigned long ibmveth_get_io_entitlement(struct vio_dev *vdev);
 static struct kobj_type ktype_veth_pool;
 
+
 #ifdef CONFIG_PROC_FS
 #define IBMVETH_PROC_DIR "ibmveth"
 static struct proc_dir_entry *ibmveth_proc_dir;
@@ -226,16 +231,16 @@ static void ibmveth_replenish_buffer_poo
 	u32 i;
 	u32 count = pool->size - atomic_read(&pool->available);
 	u32 buffers_added = 0;
+	struct sk_buff *skb;
+	unsigned int free_index, index;
+	u64 correlator;
+	unsigned long lpar_rc;
+	dma_addr_t dma_addr;
 
 	mb();
 
 	for(i = 0; i < count; ++i) {
-		struct sk_buff *skb;
-		unsigned int free_index, index;
-		u64 correlator;
 		union ibmveth_buf_desc desc;
-		unsigned long lpar_rc;
-		dma_addr_t dma_addr;
 
 		skb = alloc_skb(pool->buff_size, GFP_ATOMIC);
 
@@ -255,6 +260,9 @@ static void ibmveth_replenish_buffer_poo
 		dma_addr = dma_map_single(&adapter->vdev->dev, skb->data,
 				pool->buff_size, DMA_FROM_DEVICE);
 
+		if (dma_mapping_error(dma_addr))
+			goto failure;
+
 		pool->free_map[free_index] = IBM_VETH_INVALID_MAP;
 		pool->dma_addr[index] = dma_addr;
 		pool->skbuff[index] = skb;
@@ -267,20 +275,9 @@ static void ibmveth_replenish_buffer_poo
 
 		lpar_rc = h_add_logical_lan_buffer(adapter->vdev->unit_address, desc.desc);
 
-		if(lpar_rc != H_SUCCESS) {
-			pool->free_map[free_index] = index;
-			pool->skbuff[index] = NULL;
-			if (pool->consumer_index == 0)
-				pool->consumer_index = pool->size - 1;
-			else
-				pool->consumer_index--;
-			dma_unmap_single(&adapter->vdev->dev,
-					pool->dma_addr[index], pool->buff_size,
-					DMA_FROM_DEVICE);
-			dev_kfree_skb_any(skb);
-			adapter->replenish_add_buff_failure++;
-			break;
-		} else {
+		if (lpar_rc != H_SUCCESS)
+			goto failure;
+		else {
 			buffers_added++;
 			adapter->replenish_add_buff_success++;
 		}
@@ -288,6 +285,24 @@ static void ibmveth_replenish_buffer_poo
 
 	mb();
 	atomic_add(buffers_added, &(pool->available));
+	return;
+
+failure:
+	pool->free_map[free_index] = index;
+	pool->skbuff[index] = NULL;
+	if (pool->consumer_index == 0)
+		pool->consumer_index = pool->size - 1;
+	else
+		pool->consumer_index--;
+	if (!dma_mapping_error(dma_addr))
+		dma_unmap_single(&adapter->vdev->dev,
+		                 pool->dma_addr[index], pool->buff_size,
+		                 DMA_FROM_DEVICE);
+	dev_kfree_skb_any(skb);
+	adapter->replenish_add_buff_failure++;
+
+	mb();
+	atomic_add(buffers_added, &(pool->available));
 }
 
 /* replenish routine */
@@ -297,7 +312,7 @@ static void ibmveth_replenish_task(struc
 
 	adapter->replenish_task_cycles++;
 
-	for(i = 0; i < IbmVethNumBufferPools; i++)
+	for (i = (IbmVethNumBufferPools - 1); i >= 0; i--)
 		if(adapter->rx_buff_pool[i].active)
 			ibmveth_replenish_buffer_pool(adapter,
 						     &adapter->rx_buff_pool[i]);
@@ -472,6 +487,18 @@ static void ibmveth_cleanup(struct ibmve
 		if (adapter->rx_buff_pool[i].active)
 			ibmveth_free_buffer_pool(adapter,
 						 &adapter->rx_buff_pool[i]);
+
+	if (adapter->bounce_buffer != NULL) {
+		if (!dma_mapping_error(adapter->bounce_buffer_dma)) {
+			dma_unmap_single(&adapter->vdev->dev,
+					adapter->bounce_buffer_dma,
+					adapter->netdev->mtu + IBMVETH_BUFF_OH,
+					DMA_BIDIRECTIONAL);
+			adapter->bounce_buffer_dma = DMA_ERROR_CODE;
+		}
+		kfree(adapter->bounce_buffer);
+		adapter->bounce_buffer = NULL;
+	}
 }
 
 static int ibmveth_register_logical_lan(struct ibmveth_adapter *adapter,
@@ -607,6 +634,24 @@ static int ibmveth_open(struct net_devic
 		return rc;
 	}
 
+	adapter->bounce_buffer =
+	    kmalloc(netdev->mtu + IBMVETH_BUFF_OH, GFP_KERNEL);
+	if (!adapter->bounce_buffer) {
+		ibmveth_error_printk("unable to allocate bounce buffer\n");
+		ibmveth_cleanup(adapter);
+		napi_disable(&adapter->napi);
+		return -ENOMEM;
+	}
+	adapter->bounce_buffer_dma =
+	    dma_map_single(&adapter->vdev->dev, adapter->bounce_buffer,
+			   netdev->mtu + IBMVETH_BUFF_OH, DMA_BIDIRECTIONAL);
+	if (dma_mapping_error(adapter->bounce_buffer_dma)) {
+		ibmveth_error_printk("unable to map bounce buffer\n");
+		ibmveth_cleanup(adapter);
+		napi_disable(&adapter->napi);
+		return -ENOMEM;
+	}
+
 	ibmveth_debug_printk("initial replenish cycle\n");
 	ibmveth_interrupt(netdev->irq, netdev);
 
@@ -853,10 +898,12 @@ static int ibmveth_start_xmit(struct sk_
 	unsigned int tx_packets = 0;
 	unsigned int tx_send_failed = 0;
 	unsigned int tx_map_failed = 0;
+	int used_bounce = 0;
+	unsigned long data_dma_addr;
 
 	desc.fields.flags_len = IBMVETH_BUF_VALID | skb->len;
-	desc.fields.address = dma_map_single(&adapter->vdev->dev, skb->data,
-					     skb->len, DMA_TO_DEVICE);
+	data_dma_addr = dma_map_single(&adapter->vdev->dev, skb->data,
+				       skb->len, DMA_TO_DEVICE);
 
 	if (skb->ip_summed == CHECKSUM_PARTIAL &&
 	    ip_hdr(skb)->protocol != IPPROTO_TCP && skb_checksum_help(skb)) {
@@ -875,12 +922,16 @@ static int ibmveth_start_xmit(struct sk_
 		buf[1] = 0;
 	}
 
-	if (dma_mapping_error(desc.fields.address)) {
-		ibmveth_error_printk("tx: unable to map xmit buffer\n");
+	if (dma_mapping_error(data_dma_addr)) {
+		if (!firmware_has_feature(FW_FEATURE_CMO))
+			ibmveth_error_printk("tx: unable to map xmit buffer\n");
+		skb_copy_from_linear_data(skb, adapter->bounce_buffer,
+					  skb->len);
+		desc.fields.address = adapter->bounce_buffer_dma;
 		tx_map_failed++;
-		tx_dropped++;
-		goto out;
-	}
+		used_bounce = 1;
+	} else
+		desc.fields.address = data_dma_addr;
 
 	/* send the frame. Arbitrarily set retrycount to 1024 */
 	correlator = 0;
@@ -904,8 +955,9 @@ static int ibmveth_start_xmit(struct sk_
 		netdev->trans_start = jiffies;
 	}
 
-	dma_unmap_single(&adapter->vdev->dev, desc.fields.address,
-			 skb->len, DMA_TO_DEVICE);
+	if (!used_bounce)
+		dma_unmap_single(&adapter->vdev->dev, data_dma_addr,
+				 skb->len, DMA_TO_DEVICE);
 
 out:	spin_lock_irqsave(&adapter->stats_lock, flags);
 	netdev->stats.tx_dropped += tx_dropped;
@@ -1053,8 +1105,9 @@ static void ibmveth_set_multicast_list(s
 static int ibmveth_change_mtu(struct net_device *dev, int new_mtu)
 {
 	struct ibmveth_adapter *adapter = dev->priv;
+	struct vio_dev *viodev = adapter->vdev;
 	int new_mtu_oh = new_mtu + IBMVETH_BUFF_OH;
-	int i, rc;
+	int i;
 
 	if (new_mtu < IBMVETH_MAX_MTU)
 		return -EINVAL;
@@ -1085,10 +1138,15 @@ static int ibmveth_change_mtu(struct net
 				ibmveth_close(adapter->netdev);
 				adapter->pool_config = 0;
 				dev->mtu = new_mtu;
-				if ((rc = ibmveth_open(adapter->netdev)))
-					return rc;
-			} else
-				dev->mtu = new_mtu;
+				vio_cmo_set_dev_desired(viodev,
+						ibmveth_get_io_entitlement
+						(viodev));
+				return ibmveth_open(adapter->netdev);
+			}
+			dev->mtu = new_mtu;
+			vio_cmo_set_dev_desired(viodev,
+						ibmveth_get_io_entitlement
+						(viodev));
 			return 0;
 		}
 	}
@@ -1103,6 +1161,46 @@ static void ibmveth_poll_controller(stru
 }
 #endif
 
+/**
+ * ibmveth_get_io_entitlement - Calculate IO entitlement needed by the driver
+ *
+ * @vdev: struct vio_dev for the device whose entitlement is to be returned
+ *
+ * Return value:
+ *	Number of bytes of IO data the driver will need to perform well.
+ */
+static unsigned long ibmveth_get_io_entitlement(struct vio_dev *vdev)
+{
+	struct net_device *netdev = dev_get_drvdata(&vdev->dev);
+	struct ibmveth_adapter *adapter;
+	unsigned long ret;
+	int i;
+	int rxqentries = 1;
+
+	/* netdev inits at probe time along with the structures we need below*/
+	if (netdev == NULL)
+		return IOMMU_PAGE_ALIGN(IBMVETH_IO_ENTITLEMENT_DEFAULT);
+
+	adapter = netdev_priv(netdev);
+
+	ret = IBMVETH_BUFF_LIST_SIZE + IBMVETH_FILT_LIST_SIZE;
+	ret += IOMMU_PAGE_ALIGN(netdev->mtu);
+
+	for (i = 0; i < IbmVethNumBufferPools; i++) {
+		/* add the size of the active receive buffers */
+		if (adapter->rx_buff_pool[i].active)
+			ret +=
+			    adapter->rx_buff_pool[i].size *
+			    IOMMU_PAGE_ALIGN(adapter->rx_buff_pool[i].
+			            buff_size);
+		rxqentries += adapter->rx_buff_pool[i].size;
+	}
+	/* add the size of the receive queue entries */
+	ret += IOMMU_PAGE_ALIGN(rxqentries * sizeof(struct ibmveth_rx_q_entry));
+
+	return ret;
+}
+
 static int __devinit ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
 {
 	int rc, i;
@@ -1247,6 +1345,8 @@ static int __devexit ibmveth_remove(stru
 	ibmveth_proc_unregister_adapter(adapter);
 
 	free_netdev(netdev);
+	dev_set_drvdata(&dev->dev, NULL);
+
 	return 0;
 }
 
@@ -1491,6 +1591,7 @@ static struct vio_driver ibmveth_driver 
 	.id_table	= ibmveth_device_table,
 	.probe		= ibmveth_probe,
 	.remove		= ibmveth_remove,
+	.get_io_entitlement = ibmveth_get_io_entitlement,
 	.driver		= {
 		.name	= ibmveth_driver_name,
 		.owner	= THIS_MODULE,
Index: b/drivers/net/ibmveth.h
===================================================================
--- a/drivers/net/ibmveth.h
+++ b/drivers/net/ibmveth.h
@@ -93,9 +93,12 @@ static inline long h_illan_attributes(un
   plpar_hcall_norets(H_CHANGE_LOGICAL_LAN_MAC, ua, mac)
 
 #define IbmVethNumBufferPools 5
+#define IBMVETH_IO_ENTITLEMENT_DEFAULT 4243456 /* MTU of 1500 needs 4.2Mb */
 #define IBMVETH_BUFF_OH 22 /* Overhead: 14 ethernet header + 8 opaque handle */
 #define IBMVETH_MAX_MTU 68
 #define IBMVETH_MAX_POOL_COUNT 4096
+#define IBMVETH_BUFF_LIST_SIZE 4096
+#define IBMVETH_FILT_LIST_SIZE 4096
 #define IBMVETH_MAX_BUF_SIZE (1024 * 128)
 
 static int pool_size[] = { 512, 1024 * 2, 1024 * 16, 1024 * 32, 1024 * 64 };
@@ -143,6 +146,8 @@ struct ibmveth_adapter {
     struct ibmveth_rx_q rx_queue;
     int pool_config;
     int rx_csum;
+    void *bounce_buffer;
+    dma_addr_t bounce_buffer_dma;
 
     /* adapter specific stats */
     u64 replenish_task_cycles;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 16/19][v2] ibmveth: Automatically enable larger rx buffer pools for larger mtu
  2008-06-12 22:22 ` [PATCH 16/19] ibmveth: Automatically enable larger rx buffer pools for larger mtu Robert Jennings
  2008-06-13  5:18   ` Stephen Rothwell
@ 2008-06-23 20:21   ` Robert Jennings
  1 sibling, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-23 20:21 UTC (permalink / raw)
  To: paulus, linuxppc-dev, netdev, Brian King, Nathan Fontenot,
	David Darrington

=46rom: Santiago Leon <santil@us.ibm.com>

Fixed patch formatting.

Activates larger rx buffer pools when the MTU is changed to a larger
value.  This patch de-activates the large rx buffer pools when the MTU
changes to a smaller value.

Signed-off-by: Santiago Leon <santil@us.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>

---

 drivers/net/ibmveth.c |   20 +++++++++++++-------
 1 file changed, 13 insertions(+), 7 deletions(-)

Index: b/drivers/net/ibmveth.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/net/ibmveth.c
+++ b/drivers/net/ibmveth.c
@@ -1054,7 +1054,6 @@ static int ibmveth_change_mtu(struct net
 {
 	struct ibmveth_adapter *adapter =3D dev->priv;
 	int new_mtu_oh =3D new_mtu + IBMVETH_BUFF_OH;
-	int reinit =3D 0;
 	int i, rc;
=20
 	if (new_mtu < IBMVETH_MAX_MTU)
@@ -1067,15 +1066,21 @@ static int ibmveth_change_mtu(struct net
 	if (i =3D=3D IbmVethNumBufferPools)
 		return -EINVAL;
=20
+	/* Deactivate all the buffer pools so that the next loop can activate
+	   only the buffer pools necessary to hold the new MTU */
+	for (i =3D 0; i < IbmVethNumBufferPools; i++)
+		if (adapter->rx_buff_pool[i].active) {
+			ibmveth_free_buffer_pool(adapter,
+						 &adapter->rx_buff_pool[i]);
+			adapter->rx_buff_pool[i].active =3D 0;
+		}
+
 	/* Look for an active buffer pool that can hold the new MTU */
 	for(i =3D 0; i<IbmVethNumBufferPools; i++) {
-		if (!adapter->rx_buff_pool[i].active) {
-			adapter->rx_buff_pool[i].active =3D 1;
-			reinit =3D 1;
-		}
+		adapter->rx_buff_pool[i].active =3D 1;
=20
 		if (new_mtu_oh < adapter->rx_buff_pool[i].buff_size) {
-			if (reinit && netif_running(adapter->netdev)) {
+			if (netif_running(adapter->netdev)) {
 				adapter->pool_config =3D 1;
 				ibmveth_close(adapter->netdev);
 				adapter->pool_config =3D 0;
@@ -1402,14 +1407,15 @@ const char * buf, size_t count)
 				return -EPERM;
 			}
=20
-			pool->active =3D 0;
 			if (netif_running(netdev)) {
 				adapter->pool_config =3D 1;
 				ibmveth_close(netdev);
+				pool->active =3D 0;
 				adapter->pool_config =3D 0;
 				if ((rc =3D ibmveth_open(netdev)))
 					return rc;
 			}
+			pool->active =3D 0;
 		}
 	} else if (attr =3D=3D &veth_num_attr) {
 		if (value <=3D 0 || value > IBMVETH_MAX_POOL_COUNT)

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 12/19] powerpc: vio bus support for CMO
  2008-06-13  5:12   ` Stephen Rothwell
@ 2008-06-23 20:23     ` Robert Jennings
  0 siblings, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-23 20:23 UTC (permalink / raw)
  To: Stephen Rothwell; +Cc: Brian King, linuxppc-dev, paulus, David Darrington

* Stephen Rothwell (sfr@canb.auug.org.au) wrote:
> Hi Robert,
> 
> Firstly, can all this new stuff be ifdef'ed out if not needed as the
> vio infrastructure is also used on legacy iSeries and this adds quite a
> bit of stuff that won't ever be used there.

I've changed the patch to ifdef out CMO for legacy iSeries.  This should
keep things cleaner.

> On Thu, 12 Jun 2008 17:19:59 -0500 Robert Jennings <rcj@linux.vnet.ibm.com> wrote:
> >
> > +static int vio_cmo_num_OF_devs(void)
> > +{
> > +	struct device_node *node_vroot;
> > +	int count = 0;
> > +
> > +	/*
> > +	 * Count the number of vdevice entries with an
> > +	 * ibm,my-dma-window OF property
> > +	 */
> > +	node_vroot = of_find_node_by_name(NULL, "vdevice");
> > +	if (node_vroot) {
> > +		struct device_node *of_node;
> > +		struct property *prop;
> > +
> > +		for (of_node = node_vroot->child; of_node != NULL;
> > +		                of_node = of_node->sibling) {
> 
> Use:
> 		for_each_child_of_node(node_vroot, of_node) {

Fixed.

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 12/19][v2] powerpc: vio bus support for CMO
  2008-06-12 22:19 ` [PATCH 12/19] powerpc: vio bus support " Robert Jennings
  2008-06-13  5:12   ` Stephen Rothwell
@ 2008-06-23 20:25   ` Robert Jennings
  1 sibling, 0 replies; 41+ messages in thread
From: Robert Jennings @ 2008-06-23 20:25 UTC (permalink / raw)
  To: paulus, linuxppc-dev, Brian King, Nathan Fontenot,
	David Darrington

=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>

Enable bus level entitled memory accounting for Cooperative Memory
Overcommitment (CMO) environments.  The normal code path should not
be affected.

The following changes are made that the VIO bus layer for CMO:
 * add IO memory accounting per device structure.
 * add IO memory entitlement query function to driver structure.
 * during vio bus probe, if CMO is enabled, check that driver has
   memory entitlement query function defined.  Fail if function not defined.
 * fail to register driver if io entitlement function not defined.
 * create set of dma_ops at vio level for CMO that will track allocations
   and return DMA failures once entitlement is reached.  Entitlement will
   limited by overall system entitlement.  Devices will have a reserved
   quanity of memory that is guaranteed, the rest can be used as available.
 * expose entitlement, current allocation, desired allocation, and the
   allocation error counter for devices to the user through sysfs
 * provide mechanism for changing a device's desired entitlement at run time
   for devices as an exported function and sysfs tunable
 * track any DMA failures for entitled IO memory for each vio device.
 * check entitlement against available system entitlement on device add
 * track entitlement metrics (high water mark, current usage)
 * provide function to reset high water mark
 * provide minimum and desired entitlement numbers at a bus level
 * provide drivers with a minimum guaranteed entitlement
 * balance available entitlement between devices to satisfy their needs
 * handle system entitlement changes and device hotplug

Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>

---
 arch/powerpc/kernel/vio.c | 1030 +++++++++++++++++++++++++++++++++++++++++=
+++++
 include/asm-powerpc/vio.h |   30 +
 2 files changed, 1052 insertions(+), 8 deletions(-)

Index: b/arch/powerpc/kernel/vio.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1,11 +1,12 @@
 /*
  * IBM PowerPC Virtual I/O Infrastructure Support.
  *
- *    Copyright (c) 2003-2005 IBM Corp.
+ *    Copyright (c) 2003,2008 IBM Corp.
  *     Dave Engebretsen engebret@us.ibm.com
  *     Santiago Leon santil@us.ibm.com
  *     Hollis Blanchard <hollisb@us.ibm.com>
  *     Stephen Rothwell
+ *     Robert Jennings <rcjenn@us.ibm.com>
  *
  *      This program is free software; you can redistribute it and/or
  *      modify it under the terms of the GNU General Public License
@@ -46,6 +47,986 @@ static struct vio_dev vio_bus_device  =3D=20
 	.dev.bus =3D &vio_bus_type,
 };
=20
+#ifdef CONFIG_PPC_PSERIES
+/**
+ * vio_cmo_pool - A pool of IO memory for CMO use
+ *
+ * @size: The size of the pool in bytes
+ * @free: The amount of free memory in the pool
+ */
+struct vio_cmo_pool {
+	size_t size;
+	size_t free;
+};
+
+/* How many ms to delay queued balance work */
+#define VIO_CMO_BALANCE_DELAY 100
+
+/* Portion out IO memory to CMO devices by this chunk size */
+#define VIO_CMO_BALANCE_CHUNK 131072
+
+/**
+ * vio_cmo_dev_entry - A device that is CMO-enabled and requires entitleme=
nt
+ *
+ * @vio_dev: struct vio_dev pointer
+ * @list: pointer to other devices on bus that are being tracked
+ */
+struct vio_cmo_dev_entry {
+	struct vio_dev *viodev;
+	struct list_head list;
+};
+
+/**
+ * vio_cmo - VIO bus accounting structure for CMO entitlement
+ *
+ * @lock: spinlock for entire structure
+ * @balance_q: work queue for balancing system entitlement
+ * @device_list: list of CMO-enabled devices requiring entitlement
+ * @entitled: total system entitlement in bytes
+ * @reserve: pool of memory from which devices reserve entitlement, incl. =
spare
+ * @excess: pool of excess entitlement not needed for device reserves or s=
pare
+ * @spare: IO memory for device hotplug functionality
+ * @min: minimum necessary for system operation
+ * @desired: desired memory for system operation
+ * @curr: bytes currently allocated
+ * @high: high water mark for IO data usage
+ */
+struct vio_cmo {
+	spinlock_t lock;
+	struct delayed_work balance_q;
+	struct list_head device_list;
+	size_t entitled;
+	struct vio_cmo_pool reserve;
+	struct vio_cmo_pool excess;
+	size_t spare;
+	size_t min;
+	size_t desired;
+	size_t curr;
+	size_t high;
+} vio_cmo;
+
+/**
+ * vio_cmo_OF_devices - Count the number of OF devices that have DMA windo=
ws
+ */
+static int vio_cmo_num_OF_devs(void)
+{
+	struct device_node *node_vroot;
+	int count =3D 0;
+
+	/*
+	 * Count the number of vdevice entries with an
+	 * ibm,my-dma-window OF property
+	 */
+	node_vroot =3D of_find_node_by_name(NULL, "vdevice");
+	if (node_vroot) {
+		struct device_node *of_node;
+		struct property *prop;
+
+		for_each_child_of_node(node_vroot, of_node) {
+			prop =3D of_find_property(of_node, "ibm,my-dma-window",
+			                       NULL);
+			if (prop)
+				count++;
+		}
+	}
+	of_node_put(node_vroot);
+	return count;
+}
+
+/**
+ * vio_cmo_alloc - allocate IO memory for CMO-enable devices
+ *
+ * @viodev: VIO device requesting IO memory
+ * @size: size of allocation requested
+ *
+ * Allocations come from memory reserved for the devices and any excess
+ * IO memory available to all devices.  The spare pool used to service
+ * hotplug must be equal to %VIO_CMO_MIN_ENT for the excess pool to be
+ * made available.
+ *
+ * Return codes:
+ *  0 for successful allocation and -ENOMEM for a failure
+ */
+static inline int vio_cmo_alloc(struct vio_dev *viodev, size_t size)
+{
+	unsigned long flags;
+	size_t reserve_free =3D 0;
+	size_t excess_free =3D 0;
+	int ret =3D -ENOMEM;
+
+	spin_lock_irqsave(&vio_cmo.lock, flags);
+
+	/* Determine the amount of free entitlement available in reserve */
+	if (viodev->cmo.entitled > viodev->cmo.allocated)
+		reserve_free =3D viodev->cmo.entitled - viodev->cmo.allocated;
+
+	/* If spare is not fulfilled, the excess pool can not be used. */
+	if (vio_cmo.spare >=3D VIO_CMO_MIN_ENT)
+		excess_free =3D vio_cmo.excess.free;
+
+	/* The request can be satisfied */
+	if ((reserve_free + excess_free) >=3D size) {
+		vio_cmo.curr +=3D size;
+		if (vio_cmo.curr > vio_cmo.high)
+			vio_cmo.high =3D vio_cmo.curr;
+		viodev->cmo.allocated +=3D size;
+		size -=3D min(reserve_free, size);
+		vio_cmo.excess.free -=3D size;
+		ret =3D 0;
+	}
+
+	spin_unlock_irqrestore(&vio_cmo.lock, flags);
+	return ret;
+}
+
+/**
+ * vio_cmo_dealloc - deallocate IO memory from CMO-enable devices
+ * @viodev: VIO device freeing IO memory
+ * @size: size of deallocation
+ *
+ * IO memory is freed by the device back to the correct memory pools.
+ * The spare pool is replenished first from either memory pool, then
+ * the reserve pool is used to reduce device entitlement, the excess
+ * pool is used to increase the reserve pool toward the desired entitlement
+ * target, and then the remaining memory is returned to the pools.
+ *
+ */
+static inline void vio_cmo_dealloc(struct vio_dev *viodev, size_t size)
+{
+	unsigned long flags;
+	size_t spare_needed =3D 0;
+	size_t excess_freed =3D 0;
+	size_t reserve_freed =3D size;
+	size_t tmp;
+	int balance =3D 0;
+
+	spin_lock_irqsave(&vio_cmo.lock, flags);
+	vio_cmo.curr -=3D size;
+
+	/* Amount of memory freed from the excess pool */
+	if (viodev->cmo.allocated > viodev->cmo.entitled) {
+		excess_freed =3D min(reserve_freed, (viodev->cmo.allocated -
+		                                   viodev->cmo.entitled));
+		reserve_freed -=3D excess_freed;
+	}
+
+	/* Remove allocation from device */
+	viodev->cmo.allocated -=3D (reserve_freed + excess_freed);
+
+	/* Spare is a subset of the reserve pool, replenish it first. */
+	spare_needed =3D VIO_CMO_MIN_ENT - vio_cmo.spare;
+
+	/*
+	 * Replenish the spare in the reserve pool from the excess pool.
+	 * This moves entitlement into the reserve pool.
+	 */
+	if (spare_needed && excess_freed) {
+		tmp =3D min(excess_freed, spare_needed);
+		vio_cmo.excess.size -=3D tmp;
+		vio_cmo.reserve.size +=3D tmp;
+		vio_cmo.spare +=3D tmp;
+		excess_freed -=3D tmp;
+		spare_needed -=3D tmp;
+		balance =3D 1;
+	}
+
+	/*
+	 * Replenish the spare in the reserve pool from the reserve pool.
+	 * This removes entitlement from the device down to VIO_CMO_MIN_ENT,
+	 * if needed, and gives it to the spare pool. The amount of used
+	 * memory in this pool does not change.
+	 */
+	if (spare_needed && reserve_freed) {
+		tmp =3D min(spare_needed, min(reserve_freed,
+		                            (viodev->cmo.entitled -
+		                             VIO_CMO_MIN_ENT)));
+
+		vio_cmo.spare +=3D tmp;
+		viodev->cmo.entitled -=3D tmp;
+		reserve_freed -=3D tmp;
+		spare_needed -=3D tmp;
+		balance =3D 1;
+	}
+
+	/*
+	 * Increase the reserve pool until the desired allocation is met.
+	 * Move an allocation freed from the excess pool into the reserve
+	 * pool and schedule a balance operation.
+	 */
+	if (excess_freed && (vio_cmo.desired > vio_cmo.reserve.size)) {
+		tmp =3D min(excess_freed, (vio_cmo.desired - vio_cmo.reserve.size));
+
+		vio_cmo.excess.size -=3D tmp;
+		vio_cmo.reserve.size +=3D tmp;
+		excess_freed -=3D tmp;
+		balance =3D 1;
+	}
+
+	/* Return memory from the excess pool to that pool */
+	if (excess_freed)
+		vio_cmo.excess.free +=3D excess_freed;
+
+	if (balance)
+		schedule_delayed_work(&vio_cmo.balance_q, VIO_CMO_BALANCE_DELAY);
+	spin_unlock_irqrestore(&vio_cmo.lock, flags);
+}
+
+/**
+ * vio_cmo_entitlement_update - Manage system entitlement changes
+ *
+ * @new_entitlement: new system entitlement to attempt to accommodate
+ *
+ * Increases in entitlement will be used to fulfill the spare entitlement
+ * and the rest is given to the excess pool.  Decreases, if they are
+ * possible, come from the excess pool and from unused device entitlement
+ *
+ * Returns: 0 on success, -ENOMEM when change can not be made
+ */
+int vio_cmo_entitlement_update(size_t new_entitlement)
+{
+	struct vio_dev *viodev;
+	struct vio_cmo_dev_entry *dev_ent;
+	unsigned long flags;
+	size_t avail, delta, tmp;
+
+	spin_lock_irqsave(&vio_cmo.lock, flags);
+
+	/* Entitlement increases */
+	if (new_entitlement > vio_cmo.entitled) {
+		delta =3D new_entitlement - vio_cmo.entitled;
+
+		/* Fulfill spare allocation */
+		if (vio_cmo.spare < VIO_CMO_MIN_ENT) {
+			tmp =3D min(delta, (VIO_CMO_MIN_ENT - vio_cmo.spare));
+			vio_cmo.spare +=3D tmp;
+			vio_cmo.reserve.size +=3D tmp;
+			delta -=3D tmp;
+		}
+
+		/* Remaining new allocation goes to the excess pool */
+		vio_cmo.entitled +=3D delta;
+		vio_cmo.excess.size +=3D delta;
+		vio_cmo.excess.free +=3D delta;
+
+		goto out;
+	}
+
+	/* Entitlement decreases */
+	delta =3D vio_cmo.entitled - new_entitlement;
+	avail =3D vio_cmo.excess.free;
+
+	/*
+	 * Need to check how much unused entitlement each device can
+	 * sacrifice to fulfill entitlement change.
+	 */
+	list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+		if (avail >=3D delta)
+			break;
+
+		viodev =3D dev_ent->viodev;
+		if ((viodev->cmo.entitled > viodev->cmo.allocated) &&
+		    (viodev->cmo.entitled > VIO_CMO_MIN_ENT))
+				avail +=3D viodev->cmo.entitled -
+				         max_t(size_t, viodev->cmo.allocated,
+				               VIO_CMO_MIN_ENT);
+	}
+
+	if (delta <=3D avail) {
+		vio_cmo.entitled -=3D delta;
+
+		/* Take entitlement from the excess pool first */
+		tmp =3D min(vio_cmo.excess.free, delta);
+		vio_cmo.excess.size -=3D tmp;
+		vio_cmo.excess.free -=3D tmp;
+		delta -=3D tmp;
+
+		/*
+		 * Remove all but VIO_CMO_MIN_ENT bytes from devices
+		 * until entitlement change is served
+		 */
+		list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+			if (!delta)
+				break;
+
+			viodev =3D dev_ent->viodev;
+			tmp =3D 0;
+			if ((viodev->cmo.entitled > viodev->cmo.allocated) &&
+			    (viodev->cmo.entitled > VIO_CMO_MIN_ENT))
+				tmp =3D viodev->cmo.entitled -
+				      max_t(size_t, viodev->cmo.allocated,
+				            VIO_CMO_MIN_ENT);
+			viodev->cmo.entitled -=3D min(tmp, delta);
+			delta -=3D min(tmp, delta);
+		}
+	} else {
+		spin_unlock_irqrestore(&vio_cmo.lock, flags);
+		return -ENOMEM;
+	}
+
+out:
+	schedule_delayed_work(&vio_cmo.balance_q, 0);
+	spin_unlock_irqrestore(&vio_cmo.lock, flags);
+	return 0;
+}
+
+/**
+ * vio_cmo_balance - Balance entitlement among devices
+ *
+ * @work: work queue structure for this operation
+ *
+ * Any system entitlement above the minimum needed for devices, or
+ * already allocated to devices, can be distributed to the devices.
+ * The list of devices is iterated through to recalculate the desired
+ * entitlement level and to determine how much entitlement above the
+ * minimum entitlement is allocated to devices.
+ *
+ * Small chunks of the available entitlement are given to devices until
+ * their requirements are fulfilled or there is no entitlement left to giv=
e.
+ * Upon completion sizes of the reserve and excess pools are calculated.
+ *
+ * The system minimum entitlement level is also recalculated here.
+ * Entitlement will be reserved for devices even after vio_bus_remove to
+ * accomodate reloading the driver.  The OF tree is walked to count the
+ * number of devices present and this will remove entitlement for devices
+ * that have actually left the system after having vio_bus_remove called.
+ */
+static void vio_cmo_balance(struct work_struct *work)
+{
+	struct vio_cmo *cmo;
+	struct vio_dev *viodev;
+	struct vio_cmo_dev_entry *dev_ent;
+	unsigned long flags;
+	size_t avail =3D 0, level, chunk, need;
+	int devcount =3D 0, fulfilled;
+
+	cmo =3D container_of(work, struct vio_cmo, balance_q.work);
+
+	spin_lock_irqsave(&vio_cmo.lock, flags);
+
+	/* Calculate minimum entitlement and fulfill spare */
+	cmo->min =3D vio_cmo_num_OF_devs() * VIO_CMO_MIN_ENT;
+	BUG_ON(cmo->min > cmo->entitled);
+	cmo->spare =3D min_t(size_t, VIO_CMO_MIN_ENT, (cmo->entitled - cmo->min));
+	cmo->min +=3D cmo->spare;
+	cmo->desired =3D cmo->min;
+
+	/*
+	 * Determine how much entitlement is available and reset device
+	 * entitlements
+	 */
+	avail =3D cmo->entitled - cmo->spare;
+	list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+		viodev =3D dev_ent->viodev;
+		devcount++;
+		viodev->cmo.entitled =3D VIO_CMO_MIN_ENT;
+		cmo->desired +=3D (viodev->cmo.desired - VIO_CMO_MIN_ENT);
+		avail -=3D max_t(size_t, viodev->cmo.allocated, VIO_CMO_MIN_ENT);
+	}
+
+	/*
+	 * Having provided each device with the minimum entitlement, loop
+	 * over the devices portioning out the remaining entitlement
+	 * until there is nothing left.
+	 */
+	level =3D VIO_CMO_MIN_ENT;
+	while (avail) {
+		fulfilled =3D 0;
+		list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+			viodev =3D dev_ent->viodev;
+
+			if (viodev->cmo.desired <=3D level) {
+				fulfilled++;
+				continue;
+			}
+
+			/*
+			 * Give the device up to VIO_CMO_BALANCE_CHUNK
+			 * bytes of entitlement, but do not exceed the
+			 * desired level of entitlement for the device.
+			 */
+			chunk =3D min_t(size_t, avail, VIO_CMO_BALANCE_CHUNK);
+			chunk =3D min(chunk, (viodev->cmo.desired -
+			                    viodev->cmo.entitled));
+			viodev->cmo.entitled +=3D chunk;
+
+			/*
+			 * If the memory for this entitlement increase was
+			 * already allocated to the device it does not come
+			 * from the available pool being portioned out.
+			 */
+			need =3D max(viodev->cmo.allocated, viodev->cmo.entitled)-
+			       max(viodev->cmo.allocated, level);
+			avail -=3D need;
+
+		}
+		if (fulfilled =3D=3D devcount)
+			break;
+		level +=3D VIO_CMO_BALANCE_CHUNK;
+	}
+
+	/* Calculate new reserve and excess pool sizes */
+	cmo->reserve.size =3D cmo->min;
+	cmo->excess.free =3D 0;
+	cmo->excess.size =3D 0;
+	need =3D 0;
+	list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+		viodev =3D dev_ent->viodev;
+		/* Calculated reserve size above the minimum entitlement */
+		if (viodev->cmo.entitled)
+			cmo->reserve.size +=3D (viodev->cmo.entitled -
+			                      VIO_CMO_MIN_ENT);
+		/* Calculated used excess entitlement */
+		if (viodev->cmo.allocated > viodev->cmo.entitled)
+			need +=3D viodev->cmo.allocated - viodev->cmo.entitled;
+	}
+	cmo->excess.size =3D cmo->entitled - cmo->reserve.size;
+	cmo->excess.free =3D cmo->excess.size - need;
+
+	cancel_delayed_work(container_of(work, struct delayed_work, work));
+	spin_unlock_irqrestore(&vio_cmo.lock, flags);
+}
+
+#define RND_TO_PAGE(x) (((x / PAGE_SIZE) + (x % PAGE_SIZE ? 1 : 0)) * PAGE=
_SIZE)
+
+static void *vio_dma_iommu_alloc_coherent(struct device *dev, size_t size,
+                                          dma_addr_t *dma_handle, gfp_t fl=
ag)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+	void *ret;
+
+	if (vio_cmo_alloc(viodev, RND_TO_PAGE(size))) {
+		atomic_inc(&viodev->cmo.allocs_failed);
+		return NULL;
+	}
+
+	ret =3D dma_iommu_ops.alloc_coherent(dev, size, dma_handle, flag);
+	if (unlikely(ret =3D=3D NULL)) {
+		vio_cmo_dealloc(viodev, RND_TO_PAGE(size));
+		atomic_inc(&viodev->cmo.allocs_failed);
+	}
+
+	return ret;
+}
+
+static void vio_dma_iommu_free_coherent(struct device *dev, size_t size,
+                                        void *vaddr, dma_addr_t dma_handle)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+
+	dma_iommu_ops.free_coherent(dev, size, vaddr, dma_handle);
+
+	vio_cmo_dealloc(viodev, RND_TO_PAGE(size));
+}
+
+static dma_addr_t vio_dma_iommu_map_single(struct device *dev, void *vaddr,
+                                           size_t size,
+                                           enum dma_data_direction directi=
on)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+	dma_addr_t ret =3D DMA_ERROR_CODE;
+
+	if (vio_cmo_alloc(viodev, RND_TO_PAGE(size))) {
+		atomic_inc(&viodev->cmo.allocs_failed);
+		return ret;
+	}
+
+	ret =3D dma_iommu_ops.map_single(dev, vaddr, size, direction);
+	if (unlikely(dma_mapping_error(ret))) {
+		vio_cmo_dealloc(viodev, RND_TO_PAGE(size));
+		atomic_inc(&viodev->cmo.allocs_failed);
+	}
+
+	return ret;
+}
+
+static void vio_dma_iommu_unmap_single(struct device *dev,
+		dma_addr_t dma_handle, size_t size,
+		enum dma_data_direction direction)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+
+	dma_iommu_ops.unmap_single(dev, dma_handle, size, direction);
+
+	vio_cmo_dealloc(viodev, RND_TO_PAGE(size));
+}
+
+static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sg=
list,
+                                int nelems, enum dma_data_direction direct=
ion)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+	struct scatterlist *sgl;
+	int ret, count =3D 0;
+	size_t alloc_size =3D 0;
+
+	for (sgl =3D sglist; count < nelems; count++, sgl++)
+		alloc_size +=3D RND_TO_PAGE(sgl->length);
+
+	if (vio_cmo_alloc(viodev, alloc_size)) {
+		atomic_inc(&viodev->cmo.allocs_failed);
+		return 0;
+	}
+
+	ret =3D dma_iommu_ops.map_sg(dev, sglist, nelems, direction);
+
+	if (unlikely(!ret)) {
+		vio_cmo_dealloc(viodev, alloc_size);
+		atomic_inc(&viodev->cmo.allocs_failed);
+	}
+
+	return ret;
+}
+
+static void vio_dma_iommu_unmap_sg(struct device *dev,
+		struct scatterlist *sglist, int nelems,
+		enum dma_data_direction direction)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+	struct scatterlist *sgl;
+	size_t alloc_size =3D 0;
+	int count =3D 0;
+
+	for (sgl =3D sglist; count < nelems; count++, sgl++)
+		alloc_size +=3D RND_TO_PAGE(sgl->length);
+
+	dma_iommu_ops.unmap_sg(dev, sglist, nelems, direction);
+
+	vio_cmo_dealloc(viodev, alloc_size);
+}
+
+struct dma_mapping_ops vio_dma_mapping_ops =3D {
+	.alloc_coherent =3D vio_dma_iommu_alloc_coherent,
+	.free_coherent  =3D vio_dma_iommu_free_coherent,
+	.map_single     =3D vio_dma_iommu_map_single,
+	.unmap_single   =3D vio_dma_iommu_unmap_single,
+	.map_sg         =3D vio_dma_iommu_map_sg,
+	.unmap_sg       =3D vio_dma_iommu_unmap_sg,
+};
+
+/**
+ * vio_cmo_set_dev_desired - Set desired entitlement for a device
+ *
+ * @viodev: struct vio_dev for device to alter
+ * @new_desired: new desired entitlement level in bytes
+ *
+ * For use by devices to request a change to their entitlement at runtime =
or
+ * through sysfs.  The desired entitlement level is changed and a balancing
+ * of system resources is scheduled to run in the future.
+ */
+void vio_cmo_set_dev_desired(struct vio_dev *viodev, size_t desired)
+{
+	unsigned long flags;
+	struct vio_cmo_dev_entry *dev_ent;
+
+	if (!firmware_has_feature(FW_FEATURE_CMO))
+		return;
+
+	spin_lock_irqsave(&vio_cmo.lock, flags);
+	if (desired < VIO_CMO_MIN_ENT)
+		desired =3D VIO_CMO_MIN_ENT;
+
+	/*
+	 * Changes will not be made for devices not in the device list.
+	 * If it is not in the device list, then no driver is loaded
+	 * for the device and it can not receive entitlement.
+	 */
+	list_for_each_entry(dev_ent, &vio_cmo.device_list, list)
+		if (viodev =3D=3D dev_ent->viodev)
+			break;
+	if (!dev_ent)
+		return;
+
+	/* Increase/decrease in desired device entitlement */
+	if (desired >=3D viodev->cmo.desired) {
+		/* Just bump the bus and device values prior to a balance*/
+		vio_cmo.desired +=3D desired - viodev->cmo.desired;
+		viodev->cmo.desired =3D desired;
+	} else {
+		/* Decrease bus and device values for desired entitlement */
+		vio_cmo.desired -=3D viodev->cmo.desired - desired;
+		viodev->cmo.desired =3D desired;
+		/*
+		 * If less entitlement is desired than current entitlement, move
+		 * any reserve memory in the change region to the excess pool.
+		 */
+		if (viodev->cmo.entitled > desired) {
+			vio_cmo.reserve.size -=3D viodev->cmo.entitled - desired;
+			vio_cmo.excess.size +=3D viodev->cmo.entitled - desired;
+			/*
+			 * If entitlement moving from the reserve pool to the
+			 * excess pool is currently unused, add to the excess
+			 * free counter.
+			 */
+			if (viodev->cmo.allocated < viodev->cmo.entitled)
+				vio_cmo.excess.free +=3D viodev->cmo.entitled -
+				                       max(viodev->cmo.allocated, desired);
+			viodev->cmo.entitled =3D desired;
+		}
+	}
+	schedule_delayed_work(&vio_cmo.balance_q, 0);
+	spin_unlock_irqrestore(&vio_cmo.lock, flags);
+}
+
+/**
+ * vio_cmo_bus_probe - Handle CMO specific bus probe activities
+ *
+ * @viodev - Pointer to struct vio_dev for device
+ *
+ * Determine the devices IO memory entitlement needs, attempting
+ * to satisfy the system minimum entitlement at first and scheduling
+ * a balance operation to take care of the rest at a later time.
+ *
+ * Returns: 0 on success, -EINVAL when device doesn't support CMO, and
+ *          -ENOMEM when entitlement is not available for device or
+ *          device entry.
+ *
+ */
+static int vio_cmo_bus_probe(struct vio_dev *viodev)
+{
+	struct vio_cmo_dev_entry *dev_ent;
+	struct device *dev =3D &viodev->dev;
+	struct vio_driver *viodrv =3D to_vio_driver(dev->driver);
+	unsigned long flags;
+	size_t size;
+
+	/* Check that the driver is CMO enabled and get entitlement */
+	if (!viodrv->get_io_entitlement) {
+		dev_err(dev, "%s: device driver does not support CMO\n",
+		        __func__);
+		return -EINVAL;
+	}
+
+	/*
+	 * Check to see that device has a DMA window and configure
+	 * entitlement for the device.
+	 */
+	if (of_get_property(viodev->dev.archdata.of_node,
+	                    "ibm,my-dma-window", NULL)) {
+		viodev->cmo.desired =3D viodrv->get_io_entitlement(viodev);
+		if (viodev->cmo.desired < VIO_CMO_MIN_ENT)
+			viodev->cmo.desired =3D VIO_CMO_MIN_ENT;
+		size =3D VIO_CMO_MIN_ENT;
+
+		dev_ent =3D kmalloc(sizeof(struct vio_cmo_dev_entry),
+		                  GFP_KERNEL);
+		if (!dev_ent)
+			return -ENOMEM;
+
+		dev_ent->viodev =3D viodev;
+		spin_lock_irqsave(&vio_cmo.lock, flags);
+		list_add(&dev_ent->list, &vio_cmo.device_list);
+	} else {
+		viodev->cmo.desired =3D 0;
+		size =3D 0;
+		spin_lock_irqsave(&vio_cmo.lock, flags);
+	}
+
+	/*
+	 * If the needs for vio_cmo.min have not changed since they
+	 * were last set, the number of devices in the OF tree has
+	 * been constant and the IO memory for this is already in
+	 * the reserve pool.
+	 */
+	if (vio_cmo.min =3D=3D ((vio_cmo_num_OF_devs() + 1) *
+	                    VIO_CMO_MIN_ENT)) {
+		/* Updated desired entitlement if device requires it */
+		if (size)
+			vio_cmo.desired +=3D (viodev->cmo.desired -
+		                        VIO_CMO_MIN_ENT);
+	} else {
+		size_t tmp;
+
+		tmp =3D vio_cmo.spare + vio_cmo.excess.free;
+		if (tmp < size) {
+			dev_err(dev, "%s: insufficient free "
+			        "entitlement to add device. "
+			        "Need %lu, have %lu\n", __func__,
+				size, (vio_cmo.spare + tmp));
+			spin_unlock_irqrestore(&vio_cmo.lock, flags);
+			return -ENOMEM;
+		}
+
+		/* Use excess pool first to fulfill request */
+		tmp =3D min(size, vio_cmo.excess.free);
+		vio_cmo.excess.free -=3D tmp;
+		vio_cmo.excess.size -=3D tmp;
+		vio_cmo.reserve.size +=3D tmp;
+
+		/* Use spare if excess pool was insufficient */
+		vio_cmo.spare -=3D size - tmp;
+
+		/* Update bus accounting */
+		vio_cmo.min +=3D size;
+		vio_cmo.desired +=3D viodev->cmo.desired;
+	}
+	spin_unlock_irqrestore(&vio_cmo.lock, flags);
+	return 0;
+}
+
+/**
+ * vio_cmo_bus_remove - Handle CMO specific bus removal activities
+ *
+ * @viodev - Pointer to struct vio_dev for device
+ *
+ * Remove the device from the cmo device list.  The minimum entitlement
+ * will be reserved for the device as long as it is in the system.  The
+ * rest of the entitlement the device had been allocated will be returned
+ * to the system.
+ */
+static void vio_cmo_bus_remove(struct vio_dev *viodev)
+{
+	struct vio_cmo_dev_entry *dev_ent;
+	unsigned long flags;
+	size_t tmp;
+
+	spin_lock_irqsave(&vio_cmo.lock, flags);
+	if (viodev->cmo.allocated) {
+		dev_err(&viodev->dev, "%s: device had %lu bytes of IO "
+		        "allocated after remove operation.\n",
+		        __func__, viodev->cmo.allocated);
+		BUG();
+	}
+
+	/*
+	 * Remove the device from the device list being maintained for
+	 * CMO enabled devices.
+	 */
+	list_for_each_entry(dev_ent, &vio_cmo.device_list, list)
+		if (viodev =3D=3D dev_ent->viodev) {
+			list_del(&dev_ent->list);
+			kfree(dev_ent);
+			break;
+		}
+
+	/*
+	 * Devices may not require any entitlement and they do not need
+	 * to be processed.  Otherwise, return the device's entitlement
+	 * back to the pools.
+	 */
+	if (viodev->cmo.entitled) {
+		/*
+		 * This device has not yet left the OF tree, it's
+		 * minimum entitlement remains in vio_cmo.min and
+		 * vio_cmo.desired
+		 */
+		vio_cmo.desired -=3D (viodev->cmo.desired - VIO_CMO_MIN_ENT);
+
+		/*
+		 * Save min allocation for device in reserve as long
+		 * as it exists in OF tree as determined by later
+		 * balance operation
+		 */
+		viodev->cmo.entitled -=3D VIO_CMO_MIN_ENT;
+
+		/* Replenish spare from freed reserve pool */
+		if (viodev->cmo.entitled && (vio_cmo.spare < VIO_CMO_MIN_ENT)) {
+			tmp =3D min(viodev->cmo.entitled, (VIO_CMO_MIN_ENT -
+			                                 vio_cmo.spare));
+			vio_cmo.spare +=3D tmp;
+			viodev->cmo.entitled -=3D tmp;
+		}
+
+		/* Remaining reserve goes to excess pool */
+		vio_cmo.excess.size +=3D viodev->cmo.entitled;
+		vio_cmo.excess.free +=3D viodev->cmo.entitled;
+		vio_cmo.reserve.size -=3D viodev->cmo.entitled;
+
+		/*
+		 * Until the device is removed it will keep a
+		 * minimum entitlemet; this will guarantee that
+		 * a module unload/load will result in a success.
+		 */
+		viodev->cmo.entitled =3D VIO_CMO_MIN_ENT;
+		viodev->cmo.desired =3D VIO_CMO_MIN_ENT;
+		atomic_set(&viodev->cmo.allocs_failed, 0);
+	}
+
+	spin_unlock_irqrestore(&vio_cmo.lock, flags);
+}
+
+static void vio_cmo_set_dma_ops(struct vio_dev *viodev)
+{
+	vio_dma_mapping_ops.dma_supported =3D dma_iommu_ops.dma_supported;
+	viodev->dev.archdata.dma_ops =3D &vio_dma_mapping_ops;
+}
+
+/**
+ * vio_cmo_bus_init - CMO entitlement initialization at bus init time
+ *
+ * Set up the reserve and excess entitlement pools based on available
+ * system entitlement and the number of devices in the OF tree that
+ * require entitlement in the reserve pool.
+ */
+static void vio_cmo_bus_init(void)
+{
+	struct hvcall_mpp_data mpp_data;
+	int err;
+
+	memset(&vio_cmo, 0, sizeof(struct vio_cmo));
+	spin_lock_init(&vio_cmo.lock);
+	INIT_LIST_HEAD(&vio_cmo.device_list);
+	INIT_DELAYED_WORK(&vio_cmo.balance_q, vio_cmo_balance);
+
+	/* Get current system entitlement */
+	err =3D h_get_mpp(&mpp_data);
+
+	/*
+	 * On failure, continue with entitlement set to 0, will panic()
+	 * later when spare is reserved.
+	 */
+	if (err !=3D H_SUCCESS) {
+		printk(KERN_ERR "%s: unable to determine system IO "\
+		       "entitlement. (%d)\n", __func__, err);
+		vio_cmo.entitled =3D 0;
+	} else {
+		vio_cmo.entitled =3D mpp_data.entitled_mem;
+	}
+
+	/* Set reservation and check against entitlement */
+	vio_cmo.spare =3D VIO_CMO_MIN_ENT;
+	vio_cmo.reserve.size =3D vio_cmo.spare;
+	vio_cmo.reserve.size +=3D (vio_cmo_num_OF_devs() *
+	                         VIO_CMO_MIN_ENT);
+	if (vio_cmo.reserve.size > vio_cmo.entitled) {
+		printk(KERN_ERR "%s: insufficient system entitlement\n",
+		       __func__);
+		panic("%s: Insufficient system entitlement", __func__);
+	}
+
+	/* Set the remaining accounting variables */
+	vio_cmo.excess.size =3D vio_cmo.entitled - vio_cmo.reserve.size;
+	vio_cmo.excess.free =3D vio_cmo.excess.size;
+	vio_cmo.min =3D vio_cmo.reserve.size;
+	vio_cmo.desired =3D vio_cmo.reserve.size;
+}
+
+/* sysfs device functions and data structures for CMO */
+
+#define viodev_cmo_rd_attr(name)                                        \
+static ssize_t viodev_cmo_##name##_show(struct device *dev,             \
+                                        struct device_attribute *attr,  \
+                                         char *buf)                     \
+{                                                                       \
+	return sprintf(buf, "%lu\n", to_vio_dev(dev)->cmo.name);        \
+}
+
+static ssize_t viodev_cmo_allocs_failed_show(struct device *dev,
+		struct device_attribute *attr, char *buf)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+	return sprintf(buf, "%d\n", atomic_read(&viodev->cmo.allocs_failed));
+}
+
+static ssize_t viodev_cmo_allocs_failed_reset(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+	atomic_set(&viodev->cmo.allocs_failed, 0);
+	return count;
+}
+
+static ssize_t viodev_cmo_desired_set(struct device *dev,
+		struct device_attribute *attr, const char *buf, size_t count)
+{
+	struct vio_dev *viodev =3D to_vio_dev(dev);
+	size_t new_desired;
+	int ret;
+
+	ret =3D strict_strtoul(buf, 10, &new_desired);
+	if (ret)
+		return ret;
+
+	vio_cmo_set_dev_desired(viodev, new_desired);
+	return count;
+}
+
+viodev_cmo_rd_attr(desired);
+viodev_cmo_rd_attr(entitled);
+viodev_cmo_rd_attr(allocated);
+
+static ssize_t name_show(struct device *, struct device_attribute *, char =
*);
+static ssize_t devspec_show(struct device *, struct device_attribute *, ch=
ar *);
+static struct device_attribute vio_cmo_dev_attrs[] =3D {
+	__ATTR_RO(name),
+	__ATTR_RO(devspec),
+	__ATTR(cmo_desired,       S_IWUSR|S_IRUSR|S_IWGRP|S_IRGRP|S_IROTH,
+	       viodev_cmo_desired_show, viodev_cmo_desired_set),
+	__ATTR(cmo_entitled,      S_IRUGO, viodev_cmo_entitled_show,      NULL),
+	__ATTR(cmo_allocated,     S_IRUGO, viodev_cmo_allocated_show,     NULL),
+	__ATTR(cmo_allocs_failed, S_IWUSR|S_IRUSR|S_IWGRP|S_IRGRP|S_IROTH,
+	       viodev_cmo_allocs_failed_show, viodev_cmo_allocs_failed_reset),
+	__ATTR_NULL
+};
+
+/* sysfs bus functions and data structures for CMO */
+
+#define viobus_cmo_rd_attr(name)                                        \
+static ssize_t                                                          \
+viobus_cmo_##name##_show(struct bus_type *bt, char *buf)                \
+{                                                                       \
+	return sprintf(buf, "%lu\n", vio_cmo.name);                     \
+}
+
+#define viobus_cmo_pool_rd_attr(name, var)                              \
+static ssize_t                                                          \
+viobus_cmo_##name##_pool_show_##var(struct bus_type *bt, char *buf)     \
+{                                                                       \
+	return sprintf(buf, "%lu\n", vio_cmo.name.var);                 \
+}
+
+static ssize_t viobus_cmo_high_reset(struct bus_type *bt, const char *buf,
+                                     size_t count)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&vio_cmo.lock, flags);
+	vio_cmo.high =3D vio_cmo.curr;
+	spin_unlock_irqrestore(&vio_cmo.lock, flags);
+
+	return count;
+}
+
+viobus_cmo_rd_attr(entitled);
+viobus_cmo_pool_rd_attr(reserve, size);
+viobus_cmo_pool_rd_attr(excess, size);
+viobus_cmo_pool_rd_attr(excess, free);
+viobus_cmo_rd_attr(spare);
+viobus_cmo_rd_attr(min);
+viobus_cmo_rd_attr(desired);
+viobus_cmo_rd_attr(curr);
+viobus_cmo_rd_attr(high);
+
+static struct bus_attribute vio_cmo_bus_attrs[] =3D {
+	__ATTR(cmo_entitled, S_IRUGO, viobus_cmo_entitled_show, NULL),
+	__ATTR(cmo_reserve_size, S_IRUGO, viobus_cmo_reserve_pool_show_size, NULL=
),
+	__ATTR(cmo_excess_size, S_IRUGO, viobus_cmo_excess_pool_show_size, NULL),
+	__ATTR(cmo_excess_free, S_IRUGO, viobus_cmo_excess_pool_show_free, NULL),
+	__ATTR(cmo_spare,   S_IRUGO, viobus_cmo_spare_show,   NULL),
+	__ATTR(cmo_min,     S_IRUGO, viobus_cmo_min_show,     NULL),
+	__ATTR(cmo_desired, S_IRUGO, viobus_cmo_desired_show, NULL),
+	__ATTR(cmo_curr,    S_IRUGO, viobus_cmo_curr_show,    NULL),
+	__ATTR(cmo_high,    S_IWUSR|S_IRUSR|S_IWGRP|S_IRGRP|S_IROTH,
+	       viobus_cmo_high_show, viobus_cmo_high_reset),
+	__ATTR_NULL
+};
+
+static void vio_cmo_sysfs_init(void)
+{
+	vio_bus_type.dev_attrs =3D vio_cmo_dev_attrs;
+	vio_bus_type.bus_attrs =3D vio_cmo_bus_attrs;
+}
+#else /* CONFIG_PPC_PSERIES */
+/* Dummy functions for iSeries platform */
+int vio_cmo_entitlement_update(size_t new_entitlement) { return 0; }
+void vio_cmo_set_dev_desired(struct vio_dev *viodev, size_t desired) {}
+static int vio_cmo_bus_probe(struct vio_dev *viodev) { return 0; }
+static void vio_cmo_bus_remove(struct vio_dev *viodev) {}
+static void vio_cmo_set_dma_ops(struct vio_dev *viodev) {}
+static void vio_cmo_bus_init() {}
+static void vio_cmo_sysfs_init() { }
+#endif /* CONFIG_PPC_PSERIES */
+EXPORT_SYMBOL(vio_cmo_entitlement_update);
+EXPORT_SYMBOL(vio_cmo_set_dev_desired);
+
 static struct iommu_table *vio_build_iommu_table(struct vio_dev *dev)
 {
 	const unsigned char *dma_window;
@@ -114,8 +1095,17 @@ static int vio_bus_probe(struct device *
 		return error;
=20
 	id =3D vio_match_device(viodrv->id_table, viodev);
-	if (id)
+	if (id) {
+		memset(&viodev->cmo, 0, sizeof(viodev->cmo));
+		if (firmware_has_feature(FW_FEATURE_CMO)) {
+			error =3D vio_cmo_bus_probe(viodev);
+			if (error)
+				return error;
+		}
 		error =3D viodrv->probe(viodev, id);
+		if (error)
+			vio_cmo_bus_remove(viodev);
+	}
=20
 	return error;
 }
@@ -125,12 +1115,23 @@ static int vio_bus_remove(struct device=20
 {
 	struct vio_dev *viodev =3D to_vio_dev(dev);
 	struct vio_driver *viodrv =3D to_vio_driver(dev->driver);
+	struct device *devptr;
+	int ret =3D 1;
+
+	/*
+	 * Hold a reference to the device after the remove function is called
+	 * to allow for CMO accounting cleanup for the device.
+	 */
+	devptr =3D get_device(dev);
=20
 	if (viodrv->remove)
-		return viodrv->remove(viodev);
+		ret =3D viodrv->remove(viodev);
=20
-	/* driver can't remove */
-	return 1;
+	if (!ret && firmware_has_feature(FW_FEATURE_CMO))
+		vio_cmo_bus_remove(viodev);
+
+	put_device(devptr);
+	return ret;
 }
=20
 /**
@@ -142,6 +1143,13 @@ int vio_register_driver(struct vio_drive
 	printk(KERN_DEBUG "%s: driver %s registering\n", __func__,
 		viodrv->driver.name);
=20
+	if (firmware_has_feature(FW_FEATURE_CMO) &&
+	    !viodrv->get_io_entitlement) {
+		printk(KERN_ERR "%s: driver %s does not support CMO, "
+		       "not registered.\n", __func__, viodrv->driver.name);
+		return -EINVAL;
+	}
+
 	/* fill in 'struct driver' fields */
 	viodrv->driver.bus =3D &vio_bus_type;
=20
@@ -215,7 +1223,11 @@ struct vio_dev *vio_register_device_node
 			viodev->unit_address =3D *unit_address;
 	}
 	viodev->dev.archdata.of_node =3D of_node_get(of_node);
-	viodev->dev.archdata.dma_ops =3D &dma_iommu_ops;
+
+	if (firmware_has_feature(FW_FEATURE_CMO))
+		vio_cmo_set_dma_ops(viodev);
+	else
+		viodev->dev.archdata.dma_ops =3D &dma_iommu_ops;
 	viodev->dev.archdata.dma_data =3D vio_build_iommu_table(viodev);
 	viodev->dev.archdata.numa_node =3D of_node_to_nid(of_node);
=20
@@ -245,6 +1257,9 @@ static int __init vio_bus_init(void)
 	int err;
 	struct device_node *node_vroot;
=20
+	if (firmware_has_feature(FW_FEATURE_CMO))
+		vio_cmo_sysfs_init();
+
 	err =3D bus_register(&vio_bus_type);
 	if (err) {
 		printk(KERN_ERR "failed to register VIO bus\n");
@@ -262,6 +1277,9 @@ static int __init vio_bus_init(void)
 		return err;
 	}
=20
+	if (firmware_has_feature(FW_FEATURE_CMO))
+		vio_cmo_bus_init();
+
 	node_vroot =3D of_find_node_by_name(NULL, "vdevice");
 	if (node_vroot) {
 		struct device_node *of_node;
Index: b/include/asm-powerpc/vio.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/vio.h
+++ b/include/asm-powerpc/vio.h
@@ -39,16 +39,34 @@
 #define VIO_IRQ_DISABLE		0UL
 #define VIO_IRQ_ENABLE		1UL
=20
+/*
+ * VIO CMO minimum entitlement for all devices and spare entitlement
+ */
+#define VIO_CMO_MIN_ENT 1562624
+
 struct iommu_table;
=20
-/*
- * The vio_dev structure is used to describe virtual I/O devices.
+/**
+ * vio_dev - This structure is used to describe virtual I/O devices.
+ *
+ * @desired: set from return of driver's get_io_entitlement() function
+ * @entitled: bytes of IO data that has been reserved for this device.
+ * @entitled_target: Target IO data allocation for device. This may be set
+ *   lower than entitled to enable balancing; can not be larger than entit=
led.
+ * @allocated: bytes of IO data currently in use by the device.
+ * @allocs_failed: number of DMA failures due to insufficient entitlement.
  */
 struct vio_dev {
 	const char *name;
 	const char *type;
 	uint32_t unit_address;
 	unsigned int irq;
+	struct {
+		size_t desired;
+		size_t entitled;
+		size_t allocated;
+		atomic_t allocs_failed;
+	} cmo;
 	struct device dev;
 };
=20
@@ -56,12 +74,20 @@ struct vio_driver {
 	const struct vio_device_id *id_table;
 	int (*probe)(struct vio_dev *dev, const struct vio_device_id *id);
 	int (*remove)(struct vio_dev *dev);
+	/* A driver must have a get_io_entitlement() function to
+	 * be loaded in a CMO environment, it may return 0 if no I/O
+	 * entitlement is needed.
+	 */
+	unsigned long (*get_io_entitlement)(struct vio_dev *dev);
 	struct device_driver driver;
 };
=20
 extern int vio_register_driver(struct vio_driver *drv);
 extern void vio_unregister_driver(struct vio_driver *drv);
=20
+extern int vio_cmo_entitlement_update(size_t);
+extern void vio_cmo_set_dev_desired(struct vio_dev *viodev, size_t desired=
);
+
 extern void __devinit vio_unregister_device(struct vio_dev *dev);
=20
 struct device_node;

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 03/19][v3] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg
  2008-06-16 20:47   ` [PATCH 03/19][v3] " Nathan Fontenot
@ 2008-06-24 14:23     ` Brian King
  0 siblings, 0 replies; 41+ messages in thread
From: Brian King @ 2008-06-24 14:23 UTC (permalink / raw)
  To: Nathan Fontenot; +Cc: linuxppc-dev, paulus, David Darrington

Just a few minor nits.

> +/**
> + * h_get_mpp
> + * H_GET_MPP hcall returns info in 7 parms
> + */
> +int h_get_mpp(struct hvcall_mpp_data *mpp_data)
> +{
> +    int rc;
> +    unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
   ^^^^

Should be tabs instead of spaces in this function and a few others
in this patch file.

> +/**
> + * parse_mpp_data
> + * Parse out data returned from h_get_mpp
> + */
> +static void parse_mpp_data(struct seq_file *m)
> +{
> +    struct hvcall_mpp_data mpp_data;
> +    int rc;

Same here.

> +/**
> + * update_mpp
> + *
> + * Update the memory entitlement and weight for the partition.  Caller
> must
> + * spercify either a new entitlement or weight, not both, to be updated
      ^^^^^^^^

> + * since the h_set_mpp call takes both entitlement and weight as
> parameters.
> + */
> +static ssize_t update_mpp(u64 *entitlement, u8 *weight)
> +{
> +    struct hvcall_mpp_data mpp_data;

Tab/spacing here.

> @@ -270,6 +272,20 @@
>  };
>  #define HCALL_STAT_ARRAY_SIZE    ((MAX_HCALL_OPCODE >> 2) + 1)
> 
> +struct hvcall_mpp_data {
> +    unsigned long entitled_mem;
> +    unsigned long mapped_mem;
> +    unsigned short group_num;
> +    unsigned short pool_num;
> +    unsigned char mem_weight;
> +    unsigned char unallocated_mem_weight;
> +    unsigned long unallocated_entitlement;    /* value in bytes */
> +    unsigned long pool_size;
> +    long loan_request;

Might as well be specific here and call this a signed long.
Tab/spacing issue here as well.


-- 
Brian King
Linux on Power Virtualization
IBM Linux Technology Center

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH 03/19] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg
  2008-06-12 22:09 ` [PATCH 03/19] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg Robert Jennings
  2008-06-16 16:09   ` [PATCH 03/19][v2] " Nathan Fontenot
  2008-06-16 20:47   ` [PATCH 03/19][v3] " Nathan Fontenot
@ 2008-06-24 15:26   ` Nathan Fontenot
  2 siblings, 0 replies; 41+ messages in thread
From: Nathan Fontenot @ 2008-06-24 15:26 UTC (permalink / raw)
  To: Robert Jennings; +Cc: Brian King, linuxppc-dev, paulus, David Darrington

v4 of this patch, updates to correct whitespace, spelling issues pointed out by
Brian King.  Also made the long a signed long per Brian's suggestion.

Update /proc/ppc64/lparcfg to enable displaying of Cooperative Memory
Overcommitment statistics as reported by the H_GET_MPP hcall.  This also
updates the lparcfg interface to allow setting memory entitlement and
weight.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>

---
  arch/powerpc/kernel/lparcfg.c |  119 ++++++++++++++++++++++++++++++++++++++++++
  include/asm-powerpc/hvcall.h  |   18 ++++++
  2 files changed, 136 insertions(+), 1 deletion(-)

Index: b/arch/powerpc/kernel/lparcfg.c
===================================================================
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -129,6 +129,35 @@ static int iseries_lparcfg_data(struct s
  /*
   * Methods used to fetch LPAR data when running on a pSeries platform.
   */
+/**
+ * h_get_mpp
+ * H_GET_MPP hcall returns info in 7 parms
+ */
+int h_get_mpp(struct hvcall_mpp_data *mpp_data)
+{
+	int rc;
+	unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
+
+	rc = plpar_hcall9(H_GET_MPP, retbuf);
+
+	mpp_data->entitled_mem = retbuf[0];
+	mpp_data->mapped_mem = retbuf[1];
+
+	mpp_data->group_num = (retbuf[2] >> 2 * 8) & 0xffff;
+	mpp_data->pool_num = retbuf[2] & 0xffff;
+
+	mpp_data->mem_weight = (retbuf[3] >> 7 * 8) & 0xff;
+	mpp_data->unallocated_mem_weight = (retbuf[3] >> 6 * 8) & 0xff;
+	mpp_data->unallocated_entitlement = retbuf[3] & 0xffffffffffff;
+
+	mpp_data->pool_size = retbuf[4];
+	mpp_data->loan_request = retbuf[5];
+	mpp_data->backing_mem = retbuf[6];
+
+	return rc;
+}
+EXPORT_SYMBOL(h_get_mpp);
+
  /*
   * H_GET_PPP hcall returns info in 4 parms.
   *  entitled_capacity,unallocated_capacity,
@@ -226,6 +255,44 @@ static void parse_ppp_data(struct seq_fi
  	seq_printf(m, "unallocated_capacity=%ld\n", h_unallocated);
  }

+/**
+ * parse_mpp_data
+ * Parse out data returned from h_get_mpp
+ */
+static void parse_mpp_data(struct seq_file *m)
+{
+	struct hvcall_mpp_data mpp_data;
+	int rc;
+
+	rc = h_get_mpp(&mpp_data);
+	if (rc)
+		return;
+
+	seq_printf(m, "entitled_memory=%ld\n", mpp_data.entitled_mem);
+
+	if (mpp_data.mapped_mem != -1)
+		seq_printf(m, "mapped_entitled_memory=%ld\n",
+		           mpp_data.mapped_mem);
+
+	seq_printf(m, "entitled_memory_group_number=%d\n", mpp_data.group_num);
+	seq_printf(m, "entitled_memory_pool_number=%d\n", mpp_data.pool_num);
+
+	seq_printf(m, "entitled_memory_weight=%d\n", mpp_data.mem_weight);
+	seq_printf(m, "unallocated_entitled_memory_weight=%d\n",
+	           mpp_data.unallocated_mem_weight);
+	seq_printf(m, "unallocated_io_mapping_entitlement=%ld\n",
+	           mpp_data.unallocated_entitlement);
+
+	if (mpp_data.pool_size != -1)
+		seq_printf(m, "entitled_memory_pool_size=%ld bytes\n",
+		           mpp_data.pool_size);
+
+	seq_printf(m, "entitled_memory_loan_request=%ld\n",
+	           mpp_data.loan_request);
+
+	seq_printf(m, "backing_memory=%ld bytes\n", mpp_data.backing_mem);
+}
+
  #define SPLPAR_CHARACTERISTICS_TOKEN 20
  #define SPLPAR_MAXLENGTH 1026*(sizeof(char))

@@ -353,6 +420,7 @@ static int pseries_lparcfg_data(struct s
  		/* this call handles the ibm,get-system-parameter contents */
  		parse_system_parameter_string(m);
  		parse_ppp_data(m);
+		parse_mpp_data(m);

  		seq_printf(m, "purr=%ld\n", get_purr());
  	} else {		/* non SPLPAR case */
@@ -416,6 +484,43 @@ static ssize_t update_ppp(u64 *entitleme
  	return retval;
  }

+/**
+ * update_mpp
+ *
+ * Update the memory entitlement and weight for the partition.  Caller must
+ * specify either a new entitlement or weight, not both, to be updated
+ * since the h_set_mpp call takes both entitlement and weight as parameters.
+ */
+static ssize_t update_mpp(u64 *entitlement, u8 *weight)
+{
+	struct hvcall_mpp_data mpp_data;
+	u64 new_entitled;
+	u8 new_weight;
+	ssize_t rc;
+
+	rc = h_get_mpp(&mpp_data);
+	if (rc)
+		return rc;
+
+	if (entitlement) {
+		new_weight = mpp_data.mem_weight;
+		new_entitled = *entitlement;
+	} else if (weight) {
+		new_weight = *weight;
+		new_entitled = mpp_data.entitled_mem;
+	} else
+		return -EINVAL;
+
+	pr_debug("%s: current_entitled = %lu, current_weight = %u\n",
+	         __FUNCTION__, mpp_data.entitled_mem, mpp_data.mem_weight);
+
+	pr_debug("%s: new_entitled = %lu, new_weight = %u\n",
+       	         __FUNCTION__, new_entitled, new_weight);
+
+	rc = plpar_hcall_norets(H_SET_MPP, new_entitled, new_weight);
+	return rc;
+}
+
  /*
   * Interface for changing system parameters (variable capacity weight
   * and entitled capacity).  Format of input is "param_name=value";
@@ -469,6 +574,20 @@ static ssize_t lparcfg_write(struct file
  			goto out;

  		retval = update_ppp(NULL, new_weight_ptr);
+	} else if (!strcmp(kbuf, "entitled_memory")) {
+		char *endp;
+		*new_entitled_ptr = (u64) simple_strtoul(tmp, &endp, 10);
+		if (endp == tmp)
+			goto out;
+
+		retval = update_mpp(new_entitled_ptr, NULL);
+	} else if (!strcmp(kbuf, "entitled_memory_weight")) {
+		char *endp;
+		*new_weight_ptr = (u8) simple_strtoul(tmp, &endp, 10);
+		if (endp == tmp)
+			goto out;
+
+		retval = update_mpp(NULL, new_weight_ptr);
  	} else
  		goto out;

Index: b/include/asm-powerpc/hvcall.h
===================================================================
--- a/include/asm-powerpc/hvcall.h
+++ b/include/asm-powerpc/hvcall.h
@@ -210,7 +210,9 @@
  #define H_JOIN			0x298
  #define H_VASI_STATE            0x2A4
  #define H_ENABLE_CRQ		0x2B0
-#define MAX_HCALL_OPCODE	H_ENABLE_CRQ
+#define H_SET_MPP		0x2D0
+#define H_GET_MPP		0x2D4
+#define MAX_HCALL_OPCODE	H_GET_MPP

  #ifndef __ASSEMBLY__

@@ -270,6 +272,20 @@ struct hcall_stats {
  };
  #define HCALL_STAT_ARRAY_SIZE	((MAX_HCALL_OPCODE >> 2) + 1)

+struct hvcall_mpp_data {
+	unsigned long entitled_mem;
+	unsigned long mapped_mem;
+	unsigned short group_num;
+	unsigned short pool_num;
+	unsigned char mem_weight;
+	unsigned char unallocated_mem_weight;
+	unsigned long unallocated_entitlement;  /* value in bytes */
+	unsigned long pool_size;
+	signed long loan_request;
+	unsigned long backing_mem;
+};
+
+int h_get_mpp(struct hvcall_mpp_data *);
  #endif /* __ASSEMBLY__ */
  #endif /* __KERNEL__ */
  #endif /* _ASM_POWERPC_HVCALL_H */

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2008-06-24 15:26 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-06-12 21:53 [PATCH 00/19] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
2008-06-12 22:08 ` [PATCH 01/19] powerpc: Remove extraneous error reporting for hcall failures in lparcfg Robert Jennings
2008-06-12 22:08 ` [PATCH 02/19] powerpc: Split processor entitlement retrieval and gathering to helper routines Robert Jennings
2008-06-13  0:23   ` Stephen Rothwell
2008-06-13 19:11     ` Nathan Fontenot
2008-06-16 16:07   ` Nathan Fontenot
2008-06-12 22:09 ` [PATCH 03/19] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg Robert Jennings
2008-06-16 16:09   ` [PATCH 03/19][v2] " Nathan Fontenot
2008-06-16 20:47   ` [PATCH 03/19][v3] " Nathan Fontenot
2008-06-24 14:23     ` Brian King
2008-06-24 15:26   ` [PATCH 03/19] " Nathan Fontenot
2008-06-12 22:11 ` [PATCH 04/19] powerpc: Split retrieval of processor entitlement data into a helper routine Robert Jennings
2008-06-12 22:11 ` [PATCH 05/19] powerpc: Enable CMO feature during platform setup Robert Jennings
2008-06-12 22:12 ` [PATCH 06/19] powerpc: Utilities to set firmware page state Robert Jennings
2008-06-12 22:13 ` [PATCH 07/19] powerpc: Add collaborative memory manager Robert Jennings
2008-06-12 22:13 ` [PATCH 08/19] powerpc: Do not probe PCI buses or eBus devices if CMO is enabled Robert Jennings
2008-06-12 22:14 ` [PATCH 09/19] powerpc: Add CMO paging statistics Robert Jennings
2008-06-12 22:15 ` [PATCH 10/19] powerpc: move get_longbusy_msecs out of ehca/ehea Robert Jennings
2008-06-12 22:18   ` [PATCH 10/19] [repost] " Robert Jennings
2008-06-13 18:24     ` Brian King
2008-06-13 19:55       ` Jeff Garzik
2008-06-12 22:19 ` [PATCH 11/19] powerpc: iommu enablement for CMO Robert Jennings
2008-06-13  1:43   ` Olof Johansson
2008-06-20 15:03     ` Robert Jennings
2008-06-20 15:12   ` [PATCH 11/19][v2] " Robert Jennings
2008-06-12 22:19 ` [PATCH 12/19] powerpc: vio bus support " Robert Jennings
2008-06-13  5:12   ` Stephen Rothwell
2008-06-23 20:23     ` Robert Jennings
2008-06-23 20:25   ` [PATCH 12/19][v2] " Robert Jennings
2008-06-12 22:21 ` [PATCH 13/19] powerpc: Verify CMO memory entitlement updates with virtual I/O Robert Jennings
2008-06-12 22:21 ` [PATCH 14/19] powerpc: hvc enablement for CMO Robert Jennings
2008-06-12 22:22 ` [PATCH 15/19] powerpc: hvcs " Robert Jennings
2008-06-12 22:22 ` [PATCH 16/19] ibmveth: Automatically enable larger rx buffer pools for larger mtu Robert Jennings
2008-06-13  5:18   ` Stephen Rothwell
2008-06-23 20:21   ` [PATCH 16/19][v2] " Robert Jennings
2008-06-12 22:23 ` [PATCH 17/19] ibmveth: enable driver for CMO Robert Jennings
2008-06-13  5:25   ` Stephen Rothwell
2008-06-23 20:20   ` [PATCH 17/19][v2] " Robert Jennings
2008-06-12 22:24 ` [PATCH 18/19] ibmvscsi: driver enablement " Robert Jennings
2008-06-13 18:30   ` Brian King
2008-06-12 22:31 ` [PATCH 19/19] powerpc: Update arch vector to indicate support " Robert Jennings

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).