* [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support
@ 2008-07-04 12:44 Robert Jennings
2008-07-04 12:51 ` [PATCH 01/16 v3] powerpc: Remove extraneous error reporting for hcall failures in lparcfg Robert Jennings
` (17 more replies)
0 siblings, 18 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:44 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
This is version 3 of the full patchset. Please consider this for 2.6.27.
Only the vio bus patch is changed but with two other patches dropped I
felt it would be clearer to post the full set.
A change was made so that devices not performing DMA operations will not
require modification, therefore the hvc and hvcs driver patches have
been dropped. The vio bus patch has a more descriptive patch header
now as it is a large addition of code.
Cooperative Memory Overcommitment (CMO) is a pSeries platform feature
that enables the allocation of more memory to a set logical partitions
than is physically present. For example, a system with 16Gb of memory
can be configured to simultaneously run 3 logical partitions each with
8Gb of memory allocated to them.
The system firmware can page out memory as needed to meet the needs
of each partition. To minimize the effects of firmware paging memory,
the Collaborative Memory Manager (CMM) driver acts as a balloon driver
to work with firmware to provide memory ahead of any paging needs.
The OS is provided with an entitlement of IO memory for device drivers
to map. This amount varies with the number of virtual IO adapters
present and can change as devices are hot-plugged. The VIO bus code
distributes this memory to devices. Logical partitions supporting CMO
may only have virtual IO devices, physical devices are not supported.
Above the entitled level, IO mappings can fail and the IOMMU needed be
updated to handle this change.
Virtual IO adapters have been updated to handle DMA mapping failures and
to size their entitlement needs.
Platform support for for CMM and hot-plug entitlement events are also
included in the following patches.
The changes should have minimal impact to non-CMO enabled environments.
This patch set has been written against 2.6.26-rc8 and has been tested
at that level.
Regards,
Robert Jennings
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 01/16 v3] powerpc: Remove extraneous error reporting for hcall failures in lparcfg
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
@ 2008-07-04 12:51 ` Robert Jennings
2008-07-22 3:34 ` Paul Mackerras
2008-07-04 12:51 ` [PATCH 02/16 v3] powerpc: Split processor entitlement retrieval and gathering to helper routines Robert Jennings
` (16 subsequent siblings)
17 siblings, 1 reply; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:51 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
=46rom: Nathan Fontenot <nfont@austin.ibm.com>
Remove the extraneous error reporting used when a hcall made from lparcfg f=
ails.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/kernel/lparcfg.c | 32 --------------------------------
1 file changed, 32 deletions(-)
Index: b/arch/powerpc/kernel/lparcfg.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -129,33 +129,6 @@ static int iseries_lparcfg_data(struct s
/*
* Methods used to fetch LPAR data when running on a pSeries platform.
*/
-static void log_plpar_hcall_return(unsigned long rc, char *tag)
-{
- switch(rc) {
- case 0:
- return;
- case H_HARDWARE:
- printk(KERN_INFO "plpar-hcall (%s) "
- "Hardware fault\n", tag);
- return;
- case H_FUNCTION:
- printk(KERN_INFO "plpar-hcall (%s) "
- "Function not allowed\n", tag);
- return;
- case H_AUTHORITY:
- printk(KERN_INFO "plpar-hcall (%s) "
- "Not authorized to this function\n", tag);
- return;
- case H_PARAMETER:
- printk(KERN_INFO "plpar-hcall (%s) "
- "Bad parameter(s)\n",tag);
- return;
- default:
- printk(KERN_INFO "plpar-hcall (%s) "
- "Unexpected rc(0x%lx)\n", tag, rc);
- }
-}
-
/*
* H_GET_PPP hcall returns info in 4 parms.
* entitled_capacity,unallocated_capacity,
@@ -191,8 +164,6 @@ static unsigned int h_get_ppp(unsigned l
*aggregation =3D retbuf[2];
*resource =3D retbuf[3];
=20
- log_plpar_hcall_return(rc, "H_GET_PPP");
-
return rc;
}
=20
@@ -205,9 +176,6 @@ static void h_pic(unsigned long *pool_id
=20
*pool_idle_time =3D retbuf[0];
*num_procs =3D retbuf[1];
-
- if (rc !=3D H_AUTHORITY)
- log_plpar_hcall_return(rc, "H_PIC");
}
=20
#define SPLPAR_CHARACTERISTICS_TOKEN 20
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 02/16 v3] powerpc: Split processor entitlement retrieval and gathering to helper routines
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
2008-07-04 12:51 ` [PATCH 01/16 v3] powerpc: Remove extraneous error reporting for hcall failures in lparcfg Robert Jennings
@ 2008-07-04 12:51 ` Robert Jennings
2008-07-22 18:53 ` Nathan Fontenot
2008-07-04 12:51 ` [PATCH 03/16 v3] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg Robert Jennings
` (15 subsequent siblings)
17 siblings, 1 reply; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:51 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
=46rom: Nathan Fotenot <nfont@austin.ibm.com>
Split the retrieval and setting of processor entitlement and weight into
helper routines. This also removes the printing of the raw values
returned from h_get_ppp, the values are already parsed and printed.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/kernel/lparcfg.c | 168 ++++++++++++++++++++++---------------=
----
1 file changed, 90 insertions(+), 78 deletions(-)
Index: b/arch/powerpc/kernel/lparcfg.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -167,7 +167,8 @@ static unsigned int h_get_ppp(unsigned l
return rc;
}
=20
-static void h_pic(unsigned long *pool_idle_time, unsigned long *num_procs)
+static unsigned h_pic(unsigned long *pool_idle_time,
+ unsigned long *num_procs)
{
unsigned long rc;
unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
@@ -176,6 +177,53 @@ static void h_pic(unsigned long *pool_id
=20
*pool_idle_time =3D retbuf[0];
*num_procs =3D retbuf[1];
+
+ return rc;
+}
+
+/*
+ * parse_ppp_data
+ * Parse out the data returned from h_get_ppp and h_pic
+ */
+static void parse_ppp_data(struct seq_file *m)
+{
+ unsigned long h_entitled, h_unallocated;
+ unsigned long h_aggregation, h_resource;
+ int rc;
+
+ rc =3D h_get_ppp(&h_entitled, &h_unallocated, &h_aggregation,
+ &h_resource);
+ if (rc)
+ return;
+
+ seq_printf(m, "partition_entitled_capacity=3D%ld\n", h_entitled);
+ seq_printf(m, "group=3D%ld\n", (h_aggregation >> 2 * 8) & 0xffff);
+ seq_printf(m, "system_active_processors=3D%ld\n",
+ (h_resource >> 0 * 8) & 0xffff);
+
+ /* pool related entries are apropriate for shared configs */
+ if (lppaca[0].shared_proc) {
+ unsigned long pool_idle_time, pool_procs;
+
+ seq_printf(m, "pool=3D%ld\n", (h_aggregation >> 0 * 8) & 0xffff);
+
+ /* report pool_capacity in percentage */
+ seq_printf(m, "pool_capacity=3D%ld\n",
+ ((h_resource >> 2 * 8) & 0xffff) * 100);
+
+ rc =3D h_pic(&pool_idle_time, &pool_procs);
+ if (! rc) {
+ seq_printf(m, "pool_idle_time=3D%ld\n", pool_idle_time);
+ seq_printf(m, "pool_num_procs=3D%ld\n", pool_procs);
+ }
+ }
+
+ seq_printf(m, "unallocated_capacity_weight=3D%ld\n",
+ (h_resource >> 4 * 8) & 0xFF);
+
+ seq_printf(m, "capacity_weight=3D%ld\n", (h_resource >> 5 * 8) & 0xFF);
+ seq_printf(m, "capped=3D%ld\n", (h_resource >> 6 * 8) & 0x01);
+ seq_printf(m, "unallocated_capacity=3D%ld\n", h_unallocated);
}
=20
#define SPLPAR_CHARACTERISTICS_TOKEN 20
@@ -302,60 +350,11 @@ static int pseries_lparcfg_data(struct s
partition_active_processors =3D lparcfg_count_active_processors();
=20
if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
- unsigned long h_entitled, h_unallocated;
- unsigned long h_aggregation, h_resource;
- unsigned long pool_idle_time, pool_procs;
- unsigned long purr;
-
- h_get_ppp(&h_entitled, &h_unallocated, &h_aggregation,
- &h_resource);
-
- seq_printf(m, "R4=3D0x%lx\n", h_entitled);
- seq_printf(m, "R5=3D0x%lx\n", h_unallocated);
- seq_printf(m, "R6=3D0x%lx\n", h_aggregation);
- seq_printf(m, "R7=3D0x%lx\n", h_resource);
-
- purr =3D get_purr();
-
/* this call handles the ibm,get-system-parameter contents */
parse_system_parameter_string(m);
+ parse_ppp_data(m);
=20
- seq_printf(m, "partition_entitled_capacity=3D%ld\n", h_entitled);
-
- seq_printf(m, "group=3D%ld\n", (h_aggregation >> 2 * 8) & 0xffff);
-
- seq_printf(m, "system_active_processors=3D%ld\n",
- (h_resource >> 0 * 8) & 0xffff);
-
- /* pool related entries are apropriate for shared configs */
- if (lppaca[0].shared_proc) {
-
- h_pic(&pool_idle_time, &pool_procs);
-
- seq_printf(m, "pool=3D%ld\n",
- (h_aggregation >> 0 * 8) & 0xffff);
-
- /* report pool_capacity in percentage */
- seq_printf(m, "pool_capacity=3D%ld\n",
- ((h_resource >> 2 * 8) & 0xffff) * 100);
-
- seq_printf(m, "pool_idle_time=3D%ld\n", pool_idle_time);
-
- seq_printf(m, "pool_num_procs=3D%ld\n", pool_procs);
- }
-
- seq_printf(m, "unallocated_capacity_weight=3D%ld\n",
- (h_resource >> 4 * 8) & 0xFF);
-
- seq_printf(m, "capacity_weight=3D%ld\n",
- (h_resource >> 5 * 8) & 0xFF);
-
- seq_printf(m, "capped=3D%ld\n", (h_resource >> 6 * 8) & 0x01);
-
- seq_printf(m, "unallocated_capacity=3D%ld\n", h_unallocated);
-
- seq_printf(m, "purr=3D%ld\n", purr);
-
+ seq_printf(m, "purr=3D%ld\n", get_purr());
} else { /* non SPLPAR case */
=20
seq_printf(m, "system_active_processors=3D%d\n",
@@ -382,6 +381,41 @@ static int pseries_lparcfg_data(struct s
return 0;
}
=20
+static ssize_t update_ppp(u64 *entitlement, u8 *weight)
+{
+ unsigned long current_entitled;
+ unsigned long dummy;
+ unsigned long resource;
+ u8 current_weight, new_weight;
+ u64 new_entitled;
+ ssize_t retval;
+
+ /* Get our current parameters */
+ retval =3D h_get_ppp(¤t_entitled, &dummy, &dummy, &resource);
+ if (retval)
+ return retval;
+
+ current_weight =3D (resource >> 5 * 8) & 0xFF;
+
+ if (entitlement) {
+ new_weight =3D current_weight;
+ new_entitled =3D *entitlement;
+ } else if (weight) {
+ new_weight =3D *weight;
+ new_entitled =3D current_entitled;
+ } else
+ return -EINVAL;
+
+ pr_debug("%s: current_entitled =3D %lu, current_weight =3D %u\n",
+ __FUNCTION__, current_entitled, current_weight);
+
+ pr_debug("%s: new_entitled =3D %lu, new_weight =3D %u\n",
+ __FUNCTION__, new_entitled, new_weight);
+
+ retval =3D plpar_hcall_norets(H_SET_PPP, new_entitled, new_weight);
+ return retval;
+}
+
/*
* Interface for changing system parameters (variable capacity weight
* and entitled capacity). Format of input is "param_name=3Dvalue";
@@ -399,12 +433,6 @@ static ssize_t lparcfg_write(struct file
char *tmp;
u64 new_entitled, *new_entitled_ptr =3D &new_entitled;
u8 new_weight, *new_weight_ptr =3D &new_weight;
-
- unsigned long current_entitled; /* parameters for h_get_ppp */
- unsigned long dummy;
- unsigned long resource;
- u8 current_weight;
-
ssize_t retval =3D -ENOMEM;
=20
if (!firmware_has_feature(FW_FEATURE_SPLPAR) ||
@@ -432,33 +460,17 @@ static ssize_t lparcfg_write(struct file
*new_entitled_ptr =3D (u64) simple_strtoul(tmp, &endp, 10);
if (endp =3D=3D tmp)
goto out;
- new_weight_ptr =3D ¤t_weight;
+
+ retval =3D update_ppp(new_entitled_ptr, NULL);
} else if (!strcmp(kbuf, "capacity_weight")) {
char *endp;
*new_weight_ptr =3D (u8) simple_strtoul(tmp, &endp, 10);
if (endp =3D=3D tmp)
goto out;
- new_entitled_ptr =3D ¤t_entitled;
- } else
- goto out;
=20
- /* Get our current parameters */
- retval =3D h_get_ppp(¤t_entitled, &dummy, &dummy, &resource);
- if (retval) {
- retval =3D -EIO;
+ retval =3D update_ppp(NULL, new_weight_ptr);
+ } else
goto out;
- }
-
- current_weight =3D (resource >> 5 * 8) & 0xFF;
-
- pr_debug("%s: current_entitled =3D %lu, current_weight =3D %u\n",
- __func__, current_entitled, current_weight);
-
- pr_debug("%s: new_entitled =3D %lu, new_weight =3D %u\n",
- __func__, *new_entitled_ptr, *new_weight_ptr);
-
- retval =3D plpar_hcall_norets(H_SET_PPP, *new_entitled_ptr,
- *new_weight_ptr);
=20
if (retval =3D=3D H_SUCCESS || retval =3D=3D H_CONSTRAINED) {
retval =3D count;
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 03/16 v3] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
2008-07-04 12:51 ` [PATCH 01/16 v3] powerpc: Remove extraneous error reporting for hcall failures in lparcfg Robert Jennings
2008-07-04 12:51 ` [PATCH 02/16 v3] powerpc: Split processor entitlement retrieval and gathering to helper routines Robert Jennings
@ 2008-07-04 12:51 ` Robert Jennings
2008-07-22 18:55 ` Nathan Fontenot
2008-07-04 12:52 ` [PATCH 04/16 v3] powerpc: Split retrieval of processor entitlement data into a helper routine Robert Jennings
` (14 subsequent siblings)
17 siblings, 1 reply; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:51 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
=46rom: Nathan Fontenot <nfont@austin.ibm.com>
Update /proc/ppc64/lparcfg to enable displaying of Cooperative Memory
Overcommitment statistics as reported by the H_GET_MPP hcall. This also
updates the lparcfg interface to allow setting memory entitlement and
weight.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/kernel/lparcfg.c | 119 +++++++++++++++++++++++++++++++++++++=
+++++
include/asm-powerpc/hvcall.h | 18 ++++++
2 files changed, 136 insertions(+), 1 deletion(-)
Index: b/arch/powerpc/kernel/lparcfg.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -129,6 +129,35 @@ static int iseries_lparcfg_data(struct s
/*
* Methods used to fetch LPAR data when running on a pSeries platform.
*/
+/**
+ * h_get_mpp
+ * H_GET_MPP hcall returns info in 7 parms
+ */
+int h_get_mpp(struct hvcall_mpp_data *mpp_data)
+{
+ int rc;
+ unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
+
+ rc =3D plpar_hcall9(H_GET_MPP, retbuf);
+
+ mpp_data->entitled_mem =3D retbuf[0];
+ mpp_data->mapped_mem =3D retbuf[1];
+
+ mpp_data->group_num =3D (retbuf[2] >> 2 * 8) & 0xffff;
+ mpp_data->pool_num =3D retbuf[2] & 0xffff;
+
+ mpp_data->mem_weight =3D (retbuf[3] >> 7 * 8) & 0xff;
+ mpp_data->unallocated_mem_weight =3D (retbuf[3] >> 6 * 8) & 0xff;
+ mpp_data->unallocated_entitlement =3D retbuf[3] & 0xffffffffffff;
+
+ mpp_data->pool_size =3D retbuf[4];
+ mpp_data->loan_request =3D retbuf[5];
+ mpp_data->backing_mem =3D retbuf[6];
+
+ return rc;
+}
+EXPORT_SYMBOL(h_get_mpp);
+
/*
* H_GET_PPP hcall returns info in 4 parms.
* entitled_capacity,unallocated_capacity,
@@ -226,6 +255,44 @@ static void parse_ppp_data(struct seq_fi
seq_printf(m, "unallocated_capacity=3D%ld\n", h_unallocated);
}
=20
+/**
+ * parse_mpp_data
+ * Parse out data returned from h_get_mpp
+ */
+static void parse_mpp_data(struct seq_file *m)
+{
+ struct hvcall_mpp_data mpp_data;
+ int rc;
+
+ rc =3D h_get_mpp(&mpp_data);
+ if (rc)
+ return;
+
+ seq_printf(m, "entitled_memory=3D%ld\n", mpp_data.entitled_mem);
+
+ if (mpp_data.mapped_mem !=3D -1)
+ seq_printf(m, "mapped_entitled_memory=3D%ld\n",
+ mpp_data.mapped_mem);
+
+ seq_printf(m, "entitled_memory_group_number=3D%d\n", mpp_data.group_num);
+ seq_printf(m, "entitled_memory_pool_number=3D%d\n", mpp_data.pool_num);
+
+ seq_printf(m, "entitled_memory_weight=3D%d\n", mpp_data.mem_weight);
+ seq_printf(m, "unallocated_entitled_memory_weight=3D%d\n",
+ mpp_data.unallocated_mem_weight);
+ seq_printf(m, "unallocated_io_mapping_entitlement=3D%ld\n",
+ mpp_data.unallocated_entitlement);
+
+ if (mpp_data.pool_size !=3D -1)
+ seq_printf(m, "entitled_memory_pool_size=3D%ld bytes\n",
+ mpp_data.pool_size);
+
+ seq_printf(m, "entitled_memory_loan_request=3D%ld\n",
+ mpp_data.loan_request);
+
+ seq_printf(m, "backing_memory=3D%ld bytes\n", mpp_data.backing_mem);
+}
+
#define SPLPAR_CHARACTERISTICS_TOKEN 20
#define SPLPAR_MAXLENGTH 1026*(sizeof(char))
=20
@@ -353,6 +420,7 @@ static int pseries_lparcfg_data(struct s
/* this call handles the ibm,get-system-parameter contents */
parse_system_parameter_string(m);
parse_ppp_data(m);
+ parse_mpp_data(m);
=20
seq_printf(m, "purr=3D%ld\n", get_purr());
} else { /* non SPLPAR case */
@@ -416,6 +484,43 @@ static ssize_t update_ppp(u64 *entitleme
return retval;
}
=20
+/**
+ * update_mpp
+ *
+ * Update the memory entitlement and weight for the partition. Caller must
+ * specify either a new entitlement or weight, not both, to be updated
+ * since the h_set_mpp call takes both entitlement and weight as parameter=
s.
+ */
+static ssize_t update_mpp(u64 *entitlement, u8 *weight)
+{
+ struct hvcall_mpp_data mpp_data;
+ u64 new_entitled;
+ u8 new_weight;
+ ssize_t rc;
+
+ rc =3D h_get_mpp(&mpp_data);
+ if (rc)
+ return rc;
+
+ if (entitlement) {
+ new_weight =3D mpp_data.mem_weight;
+ new_entitled =3D *entitlement;
+ } else if (weight) {
+ new_weight =3D *weight;
+ new_entitled =3D mpp_data.entitled_mem;
+ } else
+ return -EINVAL;
+
+ pr_debug("%s: current_entitled =3D %lu, current_weight =3D %u\n",
+ __FUNCTION__, mpp_data.entitled_mem, mpp_data.mem_weight);
+
+ pr_debug("%s: new_entitled =3D %lu, new_weight =3D %u\n",
+ __FUNCTION__, new_entitled, new_weight);
+
+ rc =3D plpar_hcall_norets(H_SET_MPP, new_entitled, new_weight);
+ return rc;
+}
+
/*
* Interface for changing system parameters (variable capacity weight
* and entitled capacity). Format of input is "param_name=3Dvalue";
@@ -469,6 +574,20 @@ static ssize_t lparcfg_write(struct file
goto out;
=20
retval =3D update_ppp(NULL, new_weight_ptr);
+ } else if (!strcmp(kbuf, "entitled_memory")) {
+ char *endp;
+ *new_entitled_ptr =3D (u64) simple_strtoul(tmp, &endp, 10);
+ if (endp =3D=3D tmp)
+ goto out;
+
+ retval =3D update_mpp(new_entitled_ptr, NULL);
+ } else if (!strcmp(kbuf, "entitled_memory_weight")) {
+ char *endp;
+ *new_weight_ptr =3D (u8) simple_strtoul(tmp, &endp, 10);
+ if (endp =3D=3D tmp)
+ goto out;
+
+ retval =3D update_mpp(NULL, new_weight_ptr);
} else
goto out;
=20
Index: b/include/asm-powerpc/hvcall.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/hvcall.h
+++ b/include/asm-powerpc/hvcall.h
@@ -210,7 +210,9 @@
#define H_JOIN 0x298
#define H_VASI_STATE 0x2A4
#define H_ENABLE_CRQ 0x2B0
-#define MAX_HCALL_OPCODE H_ENABLE_CRQ
+#define H_SET_MPP 0x2D0
+#define H_GET_MPP 0x2D4
+#define MAX_HCALL_OPCODE H_GET_MPP
=20
#ifndef __ASSEMBLY__
=20
@@ -270,6 +272,20 @@ struct hcall_stats {
};
#define HCALL_STAT_ARRAY_SIZE ((MAX_HCALL_OPCODE >> 2) + 1)
=20
+struct hvcall_mpp_data {
+ unsigned long entitled_mem;
+ unsigned long mapped_mem;
+ unsigned short group_num;
+ unsigned short pool_num;
+ unsigned char mem_weight;
+ unsigned char unallocated_mem_weight;
+ unsigned long unallocated_entitlement; /* value in bytes */
+ unsigned long pool_size;
+ signed long loan_request;
+ unsigned long backing_mem;
+};
+
+int h_get_mpp(struct hvcall_mpp_data *);
#endif /* __ASSEMBLY__ */
#endif /* __KERNEL__ */
#endif /* _ASM_POWERPC_HVCALL_H */
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 04/16 v3] powerpc: Split retrieval of processor entitlement data into a helper routine
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
` (2 preceding siblings ...)
2008-07-04 12:51 ` [PATCH 03/16 v3] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg Robert Jennings
@ 2008-07-04 12:52 ` Robert Jennings
2008-07-22 5:54 ` Paul Mackerras
2008-07-22 18:56 ` Nathan Fontenot
2008-07-04 12:52 ` [PATCH 05/16 v3] powerpc: Enable CMO feature during platform setup Robert Jennings
` (13 subsequent siblings)
17 siblings, 2 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:52 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
=46rom: Nathan Fontenot <nfont@austin.ibm.com>
Split the retrieval of processor entitlement data returned in the H_GET_PPP
hcall into its own helper routine.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/kernel/lparcfg.c | 80 ++++++++++++++++++++++++-------------=
-----
1 file changed, 45 insertions(+), 35 deletions(-)
Index: b/arch/powerpc/kernel/lparcfg.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -158,6 +158,18 @@ int h_get_mpp(struct hvcall_mpp_data *mp
}
EXPORT_SYMBOL(h_get_mpp);
=20
+struct hvcall_ppp_data {
+ u64 entitlement;
+ u64 unallocated_entitlement;
+ u16 group_num;
+ u16 pool_num;
+ u8 capped;
+ u8 weight;
+ u8 unallocated_weight;
+ u16 active_procs_in_pool;
+ u16 active_system_procs;
+};
+
/*
* H_GET_PPP hcall returns info in 4 parms.
* entitled_capacity,unallocated_capacity,
@@ -178,20 +190,24 @@ EXPORT_SYMBOL(h_get_mpp);
* XXXX - Active processors in Physical Processor Pool.
* XXXX - Processors active on platform.
*/
-static unsigned int h_get_ppp(unsigned long *entitled,
- unsigned long *unallocated,
- unsigned long *aggregation,
- unsigned long *resource)
+static unsigned int h_get_ppp(struct hvcall_ppp_data *ppp_data)
{
unsigned long rc;
unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
=20
rc =3D plpar_hcall(H_GET_PPP, retbuf);
=20
- *entitled =3D retbuf[0];
- *unallocated =3D retbuf[1];
- *aggregation =3D retbuf[2];
- *resource =3D retbuf[3];
+ ppp_data->entitlement =3D retbuf[0];
+ ppp_data->unallocated_entitlement =3D retbuf[1];
+
+ ppp_data->group_num =3D (retbuf[2] >> 2 * 8) & 0xffff;
+ ppp_data->pool_num =3D retbuf[2] & 0xffff;
+
+ ppp_data->capped =3D (retbuf[3] >> 6 * 8) & 0x01;
+ ppp_data->weight =3D (retbuf[3] >> 5 * 8) & 0xff;
+ ppp_data->unallocated_weight =3D (retbuf[3] >> 4 * 8) & 0xff;
+ ppp_data->active_procs_in_pool =3D (retbuf[3] >> 2 * 8) & 0xffff;
+ ppp_data->active_system_procs =3D retbuf[3] & 0xffff;
=20
return rc;
}
@@ -216,29 +232,27 @@ static unsigned h_pic(unsigned long *poo
*/
static void parse_ppp_data(struct seq_file *m)
{
- unsigned long h_entitled, h_unallocated;
- unsigned long h_aggregation, h_resource;
+ struct hvcall_ppp_data ppp_data;
int rc;
=20
- rc =3D h_get_ppp(&h_entitled, &h_unallocated, &h_aggregation,
- &h_resource);
+ rc =3D h_get_ppp(&ppp_data);
if (rc)
return;
=20
- seq_printf(m, "partition_entitled_capacity=3D%ld\n", h_entitled);
- seq_printf(m, "group=3D%ld\n", (h_aggregation >> 2 * 8) & 0xffff);
- seq_printf(m, "system_active_processors=3D%ld\n",
- (h_resource >> 0 * 8) & 0xffff);
+ seq_printf(m, "partition_entitled_capacity=3D%ld\n",
+ ppp_data.entitlement);
+ seq_printf(m, "group=3D%d\n", ppp_data.group_num);
+ seq_printf(m, "system_active_processors=3D%d\n",
+ ppp_data.active_system_procs);
=20
/* pool related entries are apropriate for shared configs */
if (lppaca[0].shared_proc) {
unsigned long pool_idle_time, pool_procs;
=20
- seq_printf(m, "pool=3D%ld\n", (h_aggregation >> 0 * 8) & 0xffff);
+ seq_printf(m, "pool=3D%d\n", ppp_data.pool_num);
=20
/* report pool_capacity in percentage */
- seq_printf(m, "pool_capacity=3D%ld\n",
- ((h_resource >> 2 * 8) & 0xffff) * 100);
+ seq_printf(m, "pool_capacity=3D%d\n", ppp_data.group_num * 100);
=20
rc =3D h_pic(&pool_idle_time, &pool_procs);
if (! rc) {
@@ -247,12 +261,12 @@ static void parse_ppp_data(struct seq_fi
}
}
=20
- seq_printf(m, "unallocated_capacity_weight=3D%ld\n",
- (h_resource >> 4 * 8) & 0xFF);
-
- seq_printf(m, "capacity_weight=3D%ld\n", (h_resource >> 5 * 8) & 0xFF);
- seq_printf(m, "capped=3D%ld\n", (h_resource >> 6 * 8) & 0x01);
- seq_printf(m, "unallocated_capacity=3D%ld\n", h_unallocated);
+ seq_printf(m, "unallocated_capacity_weight=3D%d\n",
+ ppp_data.unallocated_weight);
+ seq_printf(m, "capacity_weight=3D%d\n", ppp_data.weight);
+ seq_printf(m, "capped=3D%d\n", ppp_data.capped);
+ seq_printf(m, "unallocated_capacity=3D%ld\n",
+ ppp_data.unallocated_entitlement);
}
=20
/**
@@ -451,31 +465,27 @@ static int pseries_lparcfg_data(struct s
=20
static ssize_t update_ppp(u64 *entitlement, u8 *weight)
{
- unsigned long current_entitled;
- unsigned long dummy;
- unsigned long resource;
- u8 current_weight, new_weight;
+ struct hvcall_ppp_data ppp_data;
+ u8 new_weight;
u64 new_entitled;
ssize_t retval;
=20
/* Get our current parameters */
- retval =3D h_get_ppp(¤t_entitled, &dummy, &dummy, &resource);
+ retval =3D h_get_ppp(&ppp_data);
if (retval)
return retval;
=20
- current_weight =3D (resource >> 5 * 8) & 0xFF;
-
if (entitlement) {
- new_weight =3D current_weight;
+ new_weight =3D ppp_data.weight;
new_entitled =3D *entitlement;
} else if (weight) {
new_weight =3D *weight;
- new_entitled =3D current_entitled;
+ new_entitled =3D ppp_data.entitlement;
} else
return -EINVAL;
=20
pr_debug("%s: current_entitled =3D %lu, current_weight =3D %u\n",
- __FUNCTION__, current_entitled, current_weight);
+ __FUNCTION__, ppp_data.entitlement, ppp_data.weight);
=20
pr_debug("%s: new_entitled =3D %lu, new_weight =3D %u\n",
__FUNCTION__, new_entitled, new_weight);
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 05/16 v3] powerpc: Enable CMO feature during platform setup
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
` (3 preceding siblings ...)
2008-07-04 12:52 ` [PATCH 04/16 v3] powerpc: Split retrieval of processor entitlement data into a helper routine Robert Jennings
@ 2008-07-04 12:52 ` Robert Jennings
2008-07-04 12:52 ` Robert Jennings
` (12 subsequent siblings)
17 siblings, 0 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:52 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>
For Cooperative Memory Overcommitment (CMO), set the FW_FEATURE_CMO
flag in powerpc_firmware_features from the rtas ibm,get-system-parameters
table prior to calling iommu_init_early_pSeries.
With this, any CMO specific functionality can be controlled by checking:
firmware_has_feature(FW_FEATURE_CMO)
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/platforms/pseries/setup.c | 71 ++++++++++++++++++++++++++++=
+++++
include/asm-powerpc/firmware.h | 3 +
2 files changed, 73 insertions(+), 1 deletion(-)
Index: b/arch/powerpc/platforms/pseries/setup.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -314,6 +314,76 @@ static int pseries_set_xdabr(unsigned lo
H_DABRX_KERNEL | H_DABRX_USER);
}
=20
+#define CMO_CHARACTERISTICS_TOKEN 44
+#define CMO_MAXLENGTH 1026
+
+/**
+ * fw_cmo_feature_init - FW_FEATURE_CMO is not stored in ibm,hypertas-func=
tions,
+ * handle that here. (Stolen from parse_system_parameter_string)
+ */
+void pSeries_cmo_feature_init(void)
+{
+ char *ptr, *key, *value, *end;
+ int call_status;
+ int PrPSP =3D -1;
+ int SecPSP =3D -1;
+
+ pr_debug(" -> fw_cmo_feature_init()\n");
+ spin_lock(&rtas_data_buf_lock);
+ memset(rtas_data_buf, 0, RTAS_DATA_BUF_SIZE);
+ call_status =3D rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1,
+ NULL,
+ CMO_CHARACTERISTICS_TOKEN,
+ __pa(rtas_data_buf),
+ RTAS_DATA_BUF_SIZE);
+
+ if (call_status !=3D 0) {
+ spin_unlock(&rtas_data_buf_lock);
+ pr_debug("CMO not available\n");
+ pr_debug(" <- fw_cmo_feature_init()\n");
+ return;
+ }
+
+ end =3D rtas_data_buf + CMO_MAXLENGTH - 2;
+ ptr =3D rtas_data_buf + 2; /* step over strlen value */
+ key =3D value =3D ptr;
+
+ while (*ptr && (ptr <=3D end)) {
+ /* Separate the key and value by replacing '=3D' with '\0' and
+ * point the value at the string after the '=3D'
+ */
+ if (ptr[0] =3D=3D '=3D') {
+ ptr[0] =3D '\0';
+ value =3D ptr + 1;
+ } else if (ptr[0] =3D=3D '\0' || ptr[0] =3D=3D ',') {
+ /* Terminate the string containing the key/value pair */
+ ptr[0] =3D '\0';
+
+ if (key =3D=3D value) {
+ pr_debug("Malformed key/value pair\n");
+ /* Never found a '=3D', end processing */
+ break;
+ }
+
+ if (0 =3D=3D strcmp(key, "PrPSP"))
+ PrPSP =3D simple_strtoul(value, NULL, 10);
+ else if (0 =3D=3D strcmp(key, "SecPSP"))
+ SecPSP =3D simple_strtoul(value, NULL, 10);
+ value =3D key =3D ptr + 1;
+ }
+ ptr++;
+ }
+
+ if (PrPSP !=3D -1 || SecPSP !=3D -1) {
+ pr_info("CMO enabled\n");
+ pr_debug("CMO enabled, PrPSP=3D%d, SecPSP=3D%d\n", PrPSP, SecPSP);
+ powerpc_firmware_features |=3D FW_FEATURE_CMO;
+ } else
+ pr_debug("CMO not enabled, PrPSP=3D%d, SecPSP=3D%d\n", PrPSP, SecPSP);
+ spin_unlock(&rtas_data_buf_lock);
+ pr_debug(" <- fw_cmo_feature_init()\n");
+}
+
/*
* Early initialization. Relocation is on but do not reference unbolted p=
ages
*/
@@ -329,6 +399,7 @@ static void __init pSeries_init_early(vo
else if (firmware_has_feature(FW_FEATURE_XDABR))
ppc_md.set_dabr =3D pseries_set_xdabr;
=20
+ pSeries_cmo_feature_init();
iommu_init_early_pSeries();
=20
pr_debug(" <- pSeries_init_early()\n");
Index: b/include/asm-powerpc/firmware.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/firmware.h
+++ b/include/asm-powerpc/firmware.h
@@ -45,6 +45,7 @@
#define FW_FEATURE_PS3_LV1 ASM_CONST(0x0000000000800000)
#define FW_FEATURE_BEAT ASM_CONST(0x0000000001000000)
#define FW_FEATURE_BULK_REMOVE ASM_CONST(0x0000000002000000)
+#define FW_FEATURE_CMO ASM_CONST(0x0000000004000000)
=20
#ifndef __ASSEMBLY__
=20
@@ -57,7 +58,7 @@ enum {
FW_FEATURE_MIGRATE | FW_FEATURE_PERFMON | FW_FEATURE_CRQ |
FW_FEATURE_VIO | FW_FEATURE_RDMA | FW_FEATURE_LLAN |
FW_FEATURE_BULK | FW_FEATURE_XDABR | FW_FEATURE_MULTITCE |
- FW_FEATURE_SPLPAR | FW_FEATURE_LPAR,
+ FW_FEATURE_SPLPAR | FW_FEATURE_LPAR | FW_FEATURE_CMO,
FW_FEATURE_PSERIES_ALWAYS =3D 0,
FW_FEATURE_ISERIES_POSSIBLE =3D FW_FEATURE_ISERIES | FW_FEATURE_LPAR,
FW_FEATURE_ISERIES_ALWAYS =3D FW_FEATURE_ISERIES | FW_FEATURE_LPAR,
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 05/16 v3] powerpc: Enable CMO feature during platform setup
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
` (4 preceding siblings ...)
2008-07-04 12:52 ` [PATCH 05/16 v3] powerpc: Enable CMO feature during platform setup Robert Jennings
@ 2008-07-04 12:52 ` Robert Jennings
2008-07-04 12:52 ` [PATCH 06/16 v3] powerpc: Utilities to set firmware page state Robert Jennings
` (11 subsequent siblings)
17 siblings, 0 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:52 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>
For Cooperative Memory Overcommitment (CMO), set the FW_FEATURE_CMO
flag in powerpc_firmware_features from the rtas ibm,get-system-parameters
table prior to calling iommu_init_early_pSeries.
With this, any CMO specific functionality can be controlled by checking:
firmware_has_feature(FW_FEATURE_CMO)
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/platforms/pseries/setup.c | 71 ++++++++++++++++++++++++++++=
+++++
include/asm-powerpc/firmware.h | 3 +
2 files changed, 73 insertions(+), 1 deletion(-)
Index: b/arch/powerpc/platforms/pseries/setup.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -314,6 +314,76 @@ static int pseries_set_xdabr(unsigned lo
H_DABRX_KERNEL | H_DABRX_USER);
}
=20
+#define CMO_CHARACTERISTICS_TOKEN 44
+#define CMO_MAXLENGTH 1026
+
+/**
+ * fw_cmo_feature_init - FW_FEATURE_CMO is not stored in ibm,hypertas-func=
tions,
+ * handle that here. (Stolen from parse_system_parameter_string)
+ */
+void pSeries_cmo_feature_init(void)
+{
+ char *ptr, *key, *value, *end;
+ int call_status;
+ int PrPSP =3D -1;
+ int SecPSP =3D -1;
+
+ pr_debug(" -> fw_cmo_feature_init()\n");
+ spin_lock(&rtas_data_buf_lock);
+ memset(rtas_data_buf, 0, RTAS_DATA_BUF_SIZE);
+ call_status =3D rtas_call(rtas_token("ibm,get-system-parameter"), 3, 1,
+ NULL,
+ CMO_CHARACTERISTICS_TOKEN,
+ __pa(rtas_data_buf),
+ RTAS_DATA_BUF_SIZE);
+
+ if (call_status !=3D 0) {
+ spin_unlock(&rtas_data_buf_lock);
+ pr_debug("CMO not available\n");
+ pr_debug(" <- fw_cmo_feature_init()\n");
+ return;
+ }
+
+ end =3D rtas_data_buf + CMO_MAXLENGTH - 2;
+ ptr =3D rtas_data_buf + 2; /* step over strlen value */
+ key =3D value =3D ptr;
+
+ while (*ptr && (ptr <=3D end)) {
+ /* Separate the key and value by replacing '=3D' with '\0' and
+ * point the value at the string after the '=3D'
+ */
+ if (ptr[0] =3D=3D '=3D') {
+ ptr[0] =3D '\0';
+ value =3D ptr + 1;
+ } else if (ptr[0] =3D=3D '\0' || ptr[0] =3D=3D ',') {
+ /* Terminate the string containing the key/value pair */
+ ptr[0] =3D '\0';
+
+ if (key =3D=3D value) {
+ pr_debug("Malformed key/value pair\n");
+ /* Never found a '=3D', end processing */
+ break;
+ }
+
+ if (0 =3D=3D strcmp(key, "PrPSP"))
+ PrPSP =3D simple_strtoul(value, NULL, 10);
+ else if (0 =3D=3D strcmp(key, "SecPSP"))
+ SecPSP =3D simple_strtoul(value, NULL, 10);
+ value =3D key =3D ptr + 1;
+ }
+ ptr++;
+ }
+
+ if (PrPSP !=3D -1 || SecPSP !=3D -1) {
+ pr_info("CMO enabled\n");
+ pr_debug("CMO enabled, PrPSP=3D%d, SecPSP=3D%d\n", PrPSP, SecPSP);
+ powerpc_firmware_features |=3D FW_FEATURE_CMO;
+ } else
+ pr_debug("CMO not enabled, PrPSP=3D%d, SecPSP=3D%d\n", PrPSP, SecPSP);
+ spin_unlock(&rtas_data_buf_lock);
+ pr_debug(" <- fw_cmo_feature_init()\n");
+}
+
/*
* Early initialization. Relocation is on but do not reference unbolted p=
ages
*/
@@ -329,6 +399,7 @@ static void __init pSeries_init_early(vo
else if (firmware_has_feature(FW_FEATURE_XDABR))
ppc_md.set_dabr =3D pseries_set_xdabr;
=20
+ pSeries_cmo_feature_init();
iommu_init_early_pSeries();
=20
pr_debug(" <- pSeries_init_early()\n");
Index: b/include/asm-powerpc/firmware.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/firmware.h
+++ b/include/asm-powerpc/firmware.h
@@ -45,6 +45,7 @@
#define FW_FEATURE_PS3_LV1 ASM_CONST(0x0000000000800000)
#define FW_FEATURE_BEAT ASM_CONST(0x0000000001000000)
#define FW_FEATURE_BULK_REMOVE ASM_CONST(0x0000000002000000)
+#define FW_FEATURE_CMO ASM_CONST(0x0000000004000000)
=20
#ifndef __ASSEMBLY__
=20
@@ -57,7 +58,7 @@ enum {
FW_FEATURE_MIGRATE | FW_FEATURE_PERFMON | FW_FEATURE_CRQ |
FW_FEATURE_VIO | FW_FEATURE_RDMA | FW_FEATURE_LLAN |
FW_FEATURE_BULK | FW_FEATURE_XDABR | FW_FEATURE_MULTITCE |
- FW_FEATURE_SPLPAR | FW_FEATURE_LPAR,
+ FW_FEATURE_SPLPAR | FW_FEATURE_LPAR | FW_FEATURE_CMO,
FW_FEATURE_PSERIES_ALWAYS =3D 0,
FW_FEATURE_ISERIES_POSSIBLE =3D FW_FEATURE_ISERIES | FW_FEATURE_LPAR,
FW_FEATURE_ISERIES_ALWAYS =3D FW_FEATURE_ISERIES | FW_FEATURE_LPAR,
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 06/16 v3] powerpc: Utilities to set firmware page state
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
` (5 preceding siblings ...)
2008-07-04 12:52 ` Robert Jennings
@ 2008-07-04 12:52 ` Robert Jennings
2008-07-04 12:53 ` Robert Jennings
` (10 subsequent siblings)
17 siblings, 0 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:52 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
=46rom: Brian King <brking@linux.vnet.ibm.com>
Newer versions of firmware support page states, which are used by the
collaborative memory manager (future patch) to "loan" pages to the
hypervisor for use by other partitions.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/platforms/pseries/plpar_wrappers.h | 10 ++++++++++
include/asm-powerpc/hvcall.h | 5 +++++
2 files changed, 15 insertions(+)
Index: b/arch/powerpc/platforms/pseries/plpar_wrappers.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/pseries/plpar_wrappers.h
+++ b/arch/powerpc/platforms/pseries/plpar_wrappers.h
@@ -42,6 +42,16 @@ static inline long register_slb_shadow(u
return vpa_call(0x3, cpu, vpa);
}
=20
+static inline long plpar_page_set_loaned(unsigned long vpa)
+{
+ return plpar_hcall_norets(H_PAGE_INIT, H_PAGE_SET_LOANED, vpa, 0);
+}
+
+static inline long plpar_page_set_active(unsigned long vpa)
+{
+ return plpar_hcall_norets(H_PAGE_INIT, H_PAGE_SET_ACTIVE, vpa, 0);
+}
+
extern void vpa_init(int cpu);
=20
static inline long plpar_pte_enter(unsigned long flags,
Index: b/include/asm-powerpc/hvcall.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/hvcall.h
+++ b/include/asm-powerpc/hvcall.h
@@ -92,6 +92,11 @@
#define H_EXACT (1UL<<(63-24)) /* Use exact PTE or return H_PTEG_FULL */
#define H_R_XLATE (1UL<<(63-25)) /* include a valid logical page num in t=
he pte if the valid bit is set */
#define H_READ_4 (1UL<<(63-26)) /* Return 4 PTEs */
+#define H_PAGE_STATE_CHANGE (1UL<<(63-28))
+#define H_PAGE_UNUSED ((1UL<<(63-29)) | (1UL<<(63-30)))
+#define H_PAGE_SET_UNUSED (H_PAGE_STATE_CHANGE | H_PAGE_UNUSED)
+#define H_PAGE_SET_LOANED (H_PAGE_SET_UNUSED | (1UL<<(63-31)))
+#define H_PAGE_SET_ACTIVE H_PAGE_STATE_CHANGE
#define H_AVPN (1UL<<(63-32)) /* An avpn is provided as a sanity test */
#define H_ANDCOND (1UL<<(63-33))
#define H_ICACHE_INVALIDATE (1UL<<(63-40)) /* icbi, etc. (ignored for IO =
pages) */
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 06/16 v3] powerpc: Utilities to set firmware page state
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
` (6 preceding siblings ...)
2008-07-04 12:52 ` [PATCH 06/16 v3] powerpc: Utilities to set firmware page state Robert Jennings
@ 2008-07-04 12:53 ` Robert Jennings
2008-07-04 12:53 ` [PATCH 07/16 v3] powerpc: Add collaborative memory manager Robert Jennings
` (9 subsequent siblings)
17 siblings, 0 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:53 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
=46rom: Brian King <brking@linux.vnet.ibm.com>
Newer versions of firmware support page states, which are used by the
collaborative memory manager (future patch) to "loan" pages to the
hypervisor for use by other partitions.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/platforms/pseries/plpar_wrappers.h | 10 ++++++++++
include/asm-powerpc/hvcall.h | 5 +++++
2 files changed, 15 insertions(+)
Index: b/arch/powerpc/platforms/pseries/plpar_wrappers.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/pseries/plpar_wrappers.h
+++ b/arch/powerpc/platforms/pseries/plpar_wrappers.h
@@ -42,6 +42,16 @@ static inline long register_slb_shadow(u
return vpa_call(0x3, cpu, vpa);
}
=20
+static inline long plpar_page_set_loaned(unsigned long vpa)
+{
+ return plpar_hcall_norets(H_PAGE_INIT, H_PAGE_SET_LOANED, vpa, 0);
+}
+
+static inline long plpar_page_set_active(unsigned long vpa)
+{
+ return plpar_hcall_norets(H_PAGE_INIT, H_PAGE_SET_ACTIVE, vpa, 0);
+}
+
extern void vpa_init(int cpu);
=20
static inline long plpar_pte_enter(unsigned long flags,
Index: b/include/asm-powerpc/hvcall.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/hvcall.h
+++ b/include/asm-powerpc/hvcall.h
@@ -92,6 +92,11 @@
#define H_EXACT (1UL<<(63-24)) /* Use exact PTE or return H_PTEG_FULL */
#define H_R_XLATE (1UL<<(63-25)) /* include a valid logical page num in t=
he pte if the valid bit is set */
#define H_READ_4 (1UL<<(63-26)) /* Return 4 PTEs */
+#define H_PAGE_STATE_CHANGE (1UL<<(63-28))
+#define H_PAGE_UNUSED ((1UL<<(63-29)) | (1UL<<(63-30)))
+#define H_PAGE_SET_UNUSED (H_PAGE_STATE_CHANGE | H_PAGE_UNUSED)
+#define H_PAGE_SET_LOANED (H_PAGE_SET_UNUSED | (1UL<<(63-31)))
+#define H_PAGE_SET_ACTIVE H_PAGE_STATE_CHANGE
#define H_AVPN (1UL<<(63-32)) /* An avpn is provided as a sanity test */
#define H_ANDCOND (1UL<<(63-33))
#define H_ICACHE_INVALIDATE (1UL<<(63-40)) /* icbi, etc. (ignored for IO =
pages) */
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 07/16 v3] powerpc: Add collaborative memory manager
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
` (7 preceding siblings ...)
2008-07-04 12:53 ` Robert Jennings
@ 2008-07-04 12:53 ` Robert Jennings
2008-07-22 4:53 ` Paul Mackerras
2008-07-04 12:54 ` [PATCH 08/16 v3] powerpc: Do not probe PCI buses or eBus devices if CMO is enabled Robert Jennings
` (8 subsequent siblings)
17 siblings, 1 reply; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:53 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
=46rom: Brian King <brking@linux.vnet.ibm.com>
Adds a collaborative memory manager, which acts as a simple balloon driver
for System p machines that support cooperative memory overcommitment
(CMO).
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/platforms/pseries/Kconfig | 11 +
arch/powerpc/platforms/pseries/Makefile | 1 +
arch/powerpc/platforms/pseries/cmm.c | 468 +++++++++++++++++++++++++++=
+++++
3 files changed, 480 insertions(+)
Index: b/arch/powerpc/platforms/pseries/Kconfig
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/pseries/Kconfig
+++ b/arch/powerpc/platforms/pseries/Kconfig
@@ -39,3 +39,14 @@ config PPC_PSERIES_DEBUG
depends on PPC_PSERIES && PPC_EARLY_DEBUG
bool "Enable extra debug logging in platforms/pseries"
default y
+
+config CMM
+ tristate "Collaborative memory management"
+ depends on PPC_PSERIES
+ help
+ Select this option, if you want to enable the kernel interface
+ to reduce the memory size of the system. This is accomplished
+ by allocating pages of memory and put them "on hold". This only
+ makes sense for a system running in an LPAR where the unused pages
+ will be reused for other LPARs. The interface allows firmware to
+ balance memory across many LPARs.
Index: b/arch/powerpc/platforms/pseries/Makefile
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/pseries/Makefile
+++ b/arch/powerpc/platforms/pseries/Makefile
@@ -24,3 +24,4 @@ obj-$(CONFIG_HVC_CONSOLE) +=3D hvconsole.o
obj-$(CONFIG_HVCS) +=3D hvcserver.o
obj-$(CONFIG_HCALL_STATS) +=3D hvCall_inst.o
obj-$(CONFIG_PHYP_DUMP) +=3D phyp_dump.o
+obj-$(CONFIG_CMM) +=3D cmm.o
Index: b/arch/powerpc/platforms/pseries/cmm.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- /dev/null
+++ b/arch/powerpc/platforms/pseries/cmm.c
@@ -0,0 +1,468 @@
+/*
+ * Collaborative memory management interface.
+ *
+ * Copyright (C) 2008 IBM Corporation
+ * Author(s): Brian King (brking@linux.vnet.ibm.com),
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place, Suite 330, Boston, MA 02111-1307 U=
SA
+ *
+ */
+
+#include <linux/ctype.h>
+#include <linux/delay.h>
+#include <linux/errno.h>
+#include <linux/fs.h>
+#include <linux/init.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+#include <linux/oom.h>
+#include <linux/sched.h>
+#include <linux/stringify.h>
+#include <linux/swap.h>
+#include <linux/sysdev.h>
+#include <asm/firmware.h>
+#include <asm/hvcall.h>
+#include <asm/mmu.h>
+#include <asm/pgalloc.h>
+#include <asm/uaccess.h>
+
+#include "plpar_wrappers.h"
+
+#define CMM_DRIVER_VERSION "1.0.0"
+#define CMM_DEFAULT_DELAY 1
+#define CMM_DEBUG 0
+#define CMM_DISABLE 0
+#define CMM_OOM_KB 1024
+#define CMM_MIN_MEM_MB 256
+#define KB2PAGES(_p) ((_p)>>(PAGE_SHIFT-10))
+#define PAGES2KB(_p) ((_p)<<(PAGE_SHIFT-10))
+
+static unsigned int delay =3D CMM_DEFAULT_DELAY;
+static unsigned int oom_kb =3D CMM_OOM_KB;
+static unsigned int cmm_debug =3D CMM_DEBUG;
+static unsigned int cmm_disabled =3D CMM_DISABLE;
+static unsigned long min_mem_mb =3D CMM_MIN_MEM_MB;
+static struct sys_device cmm_sysdev;
+
+MODULE_AUTHOR("Brian King <brking@linux.vnet.ibm.com>");
+MODULE_DESCRIPTION("IBM System p Collaborative Memory Manager");
+MODULE_LICENSE("GPL");
+MODULE_VERSION(CMM_DRIVER_VERSION);
+
+module_param_named(delay, delay, uint, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(delay, "Delay (in seconds) between polls to query hypervi=
sor paging requests. "
+ "[Default=3D" __stringify(CMM_DEFAULT_DELAY) "]");
+module_param_named(oom_kb, oom_kb, uint, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(oom_kb, "Amount of memory in kb to free on OOM. "
+ "[Default=3D" __stringify(CMM_OOM_KB) "]");
+module_param_named(min_mem_mb, min_mem_mb, ulong, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(min_mem_mb, "Minimum amount of memory (in MB) to not ball=
oon. "
+ "[Default=3D" __stringify(CMM_MIN_MEM_MB) "]");
+module_param_named(debug, cmm_debug, uint, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(debug, "Enable module debugging logging. Set to 1 to enab=
le. "
+ "[Default=3D" __stringify(CMM_DEBUG) "]");
+
+#define CMM_NR_PAGES ((PAGE_SIZE - sizeof(void *) - sizeof(unsigned long))=
/ sizeof(unsigned long))
+
+#define cmm_dbg(...) if (cmm_debug) { printk(KERN_INFO "cmm: "__VA_ARGS__)=
; }
+
+struct cmm_page_array {
+ struct cmm_page_array *next;
+ unsigned long index;
+ unsigned long page[CMM_NR_PAGES];
+};
+
+static unsigned long loaned_pages;
+static unsigned long loaned_pages_target;
+static unsigned long oom_freed_pages;
+
+static struct cmm_page_array *cmm_page_list;
+static DEFINE_SPINLOCK(cmm_lock);
+
+static struct task_struct *cmm_thread_ptr;
+
+/**
+ * cmm_alloc_pages - Allocate pages and mark them as loaned
+ * @nr: number of pages to allocate
+ *
+ * Return value:
+ * number of pages requested to be allocated which were not
+ **/
+static long cmm_alloc_pages(long nr)
+{
+ struct cmm_page_array *pa, *npa;
+ unsigned long addr;
+ long rc;
+
+ cmm_dbg("Begin request for %ld pages\n", nr);
+
+ while (nr) {
+ addr =3D __get_free_page(GFP_NOIO | __GFP_NOWARN |
+ __GFP_NORETRY | __GFP_NOMEMALLOC);
+ if (!addr)
+ break;
+ spin_lock(&cmm_lock);
+ pa =3D cmm_page_list;
+ if (!pa || pa->index >=3D CMM_NR_PAGES) {
+ /* Need a new page for the page list. */
+ spin_unlock(&cmm_lock);
+ npa =3D (struct cmm_page_array *)__get_free_page(GFP_NOIO | __GFP_NOWAR=
N |
+ __GFP_NORETRY | __GFP_NOMEMALLOC);
+ if (!npa) {
+ pr_info("%s: Can not allocate new page list\n", __FUNCTION__);
+ free_page(addr);
+ break;
+ }
+ spin_lock(&cmm_lock);
+ pa =3D cmm_page_list;
+
+ if (!pa || pa->index >=3D CMM_NR_PAGES) {
+ npa->next =3D pa;
+ npa->index =3D 0;
+ pa =3D npa;
+ cmm_page_list =3D pa;
+ } else
+ free_page((unsigned long) npa);
+ }
+
+ if ((rc =3D plpar_page_set_loaned(__pa(addr)))) {
+ pr_err("%s: Can not set page to loaned. rc=3D%ld\n", __FUNCTION__, rc);
+ spin_unlock(&cmm_lock);
+ free_page(addr);
+ break;
+ }
+
+ pa->page[pa->index++] =3D addr;
+ loaned_pages++;
+ totalram_pages--;
+ spin_unlock(&cmm_lock);
+ nr--;
+ }
+
+ cmm_dbg("End request with %ld pages unfulfilled\n", nr);
+ return nr;
+}
+
+/**
+ * cmm_free_pages - Free pages and mark them as active
+ * @nr: number of pages to free
+ *
+ * Return value:
+ * number of pages requested to be freed which were not
+ **/
+static long cmm_free_pages(long nr)
+{
+ struct cmm_page_array *pa;
+ unsigned long addr;
+
+ cmm_dbg("Begin free of %ld pages.\n", nr);
+ spin_lock(&cmm_lock);
+ pa =3D cmm_page_list;
+ while (nr) {
+ if (!pa || pa->index <=3D 0)
+ break;
+ addr =3D pa->page[--pa->index];
+
+ if (pa->index =3D=3D 0) {
+ pa =3D pa->next;
+ free_page((unsigned long) cmm_page_list);
+ cmm_page_list =3D pa;
+ }
+
+ plpar_page_set_active(__pa(addr));
+ free_page(addr);
+ loaned_pages--;
+ nr--;
+ totalram_pages++;
+ }
+ spin_unlock(&cmm_lock);
+ cmm_dbg("End request with %ld pages unfulfilled\n", nr);
+ return nr;
+}
+
+/**
+ * cmm_oom_notify - OOM notifier
+ * @self: notifier block struct
+ * @dummy: not used
+ * @parm: returned - number of pages freed
+ *
+ * Return value:
+ * NOTIFY_OK
+ **/
+static int cmm_oom_notify(struct notifier_block *self,
+ unsigned long dummy, void *parm)
+{
+ unsigned long *freed =3D parm;
+ long nr =3D KB2PAGES(oom_kb);
+
+ cmm_dbg("OOM processing started\n");
+ nr =3D cmm_free_pages(nr);
+ loaned_pages_target =3D loaned_pages;
+ *freed +=3D KB2PAGES(oom_kb) - nr;
+ oom_freed_pages +=3D KB2PAGES(oom_kb) - nr;
+ cmm_dbg("OOM processing complete\n");
+ return NOTIFY_OK;
+}
+
+/**
+ * cmm_get_mpp - Read memory performance parameters
+ *
+ * Makes hcall to query the current page loan request from the hypervisor.
+ *
+ * Return value:
+ * nothing
+ **/
+static void cmm_get_mpp(void)
+{
+ int rc;
+ struct hvcall_mpp_data mpp_data;
+ unsigned long active_pages_target;
+ signed long page_loan_request;
+
+ rc =3D h_get_mpp(&mpp_data);
+
+ if (rc !=3D H_SUCCESS)
+ return;
+
+ page_loan_request =3D div_s64((s64)mpp_data.loan_request, PAGE_SIZE);
+ loaned_pages_target =3D page_loan_request + loaned_pages;
+ if (loaned_pages_target > oom_freed_pages)
+ loaned_pages_target -=3D oom_freed_pages;
+ else
+ loaned_pages_target =3D 0;
+
+ active_pages_target =3D totalram_pages + loaned_pages - loaned_pages_targ=
et;
+
+ if ((min_mem_mb * 1024 * 1024) > (active_pages_target * PAGE_SIZE))
+ loaned_pages_target =3D totalram_pages + loaned_pages -
+ ((min_mem_mb * 1024 * 1024) / PAGE_SIZE);
+
+ cmm_dbg("delta =3D %ld, loaned =3D %lu, target =3D %lu, oom =3D %lu, tota=
lram =3D %lu\n",
+ page_loan_request, loaned_pages, loaned_pages_target,
+ oom_freed_pages, totalram_pages);
+}
+
+static struct notifier_block cmm_oom_nb =3D {
+ .notifier_call =3D cmm_oom_notify
+};
+
+/**
+ * cmm_thread - CMM task thread
+ * @dummy: not used
+ *
+ * Return value:
+ * 0
+ **/
+static int cmm_thread(void *dummy)
+{
+ unsigned long timeleft;
+
+ while (1) {
+ timeleft =3D msleep_interruptible(delay * 1000);
+
+ if (kthread_should_stop() || timeleft) {
+ loaned_pages_target =3D loaned_pages;
+ break;
+ }
+
+ cmm_get_mpp();
+
+ if (loaned_pages_target > loaned_pages) {
+ if (cmm_alloc_pages(loaned_pages_target - loaned_pages))
+ loaned_pages_target =3D loaned_pages;
+ } else if (loaned_pages_target < loaned_pages)
+ cmm_free_pages(loaned_pages - loaned_pages_target);
+ }
+ return 0;
+}
+
+#define CMM_SHOW(name, format, args...) \
+ static ssize_t show_##name(struct sys_device *dev, char *buf) \
+ { \
+ return sprintf(buf, format, ##args); \
+ } \
+ static SYSDEV_ATTR(name, S_IRUGO, show_##name, NULL)
+
+CMM_SHOW(loaned_kb, "%lu\n", PAGES2KB(loaned_pages));
+CMM_SHOW(loaned_target_kb, "%lu\n", PAGES2KB(loaned_pages_target));
+
+static ssize_t show_oom_pages(struct sys_device *dev, char *buf)
+{
+ return sprintf(buf, "%lu\n", PAGES2KB(oom_freed_pages));
+}
+
+static ssize_t store_oom_pages(struct sys_device *dev,
+ const char *buf, size_t count)
+{
+ unsigned long val =3D simple_strtoul (buf, NULL, 10);
+
+ if (!capable(CAP_SYS_ADMIN))
+ return -EPERM;
+ if (val !=3D 0)
+ return -EBADMSG;
+
+ oom_freed_pages =3D 0;
+ return count;
+}
+
+static SYSDEV_ATTR(oom_freed_kb, S_IWUSR| S_IRUGO,
+ show_oom_pages, store_oom_pages);
+
+static struct sysdev_attribute *cmm_attrs[] =3D {
+ &attr_loaned_kb,
+ &attr_loaned_target_kb,
+ &attr_oom_freed_kb,
+};
+
+static struct sysdev_class cmm_sysdev_class =3D {
+ .name =3D "cmm",
+};
+
+/**
+ * cmm_sysfs_register - Register with sysfs
+ *
+ * Return value:
+ * 0 on success / other on failure
+ **/
+static int cmm_sysfs_register(struct sys_device *sysdev)
+{
+ int i, rc;
+
+ if ((rc =3D sysdev_class_register(&cmm_sysdev_class)))
+ return rc;
+
+ sysdev->id =3D 0;
+ sysdev->cls =3D &cmm_sysdev_class;
+
+ if ((rc =3D sysdev_register(sysdev)))
+ goto class_unregister;
+
+ for (i =3D 0; i < ARRAY_SIZE(cmm_attrs); i++) {
+ if ((rc =3D sysdev_create_file(sysdev, cmm_attrs[i])))
+ goto fail;
+ }
+
+ return 0;
+
+fail:
+ while (--i >=3D 0)
+ sysdev_remove_file(sysdev, cmm_attrs[i]);
+ sysdev_unregister(sysdev);
+class_unregister:
+ sysdev_class_unregister(&cmm_sysdev_class);
+ return rc;
+}
+
+/**
+ * cmm_unregister_sysfs - Unregister from sysfs
+ *
+ **/
+static void cmm_unregister_sysfs(struct sys_device *sysdev)
+{
+ int i;
+
+ for (i =3D 0; i < ARRAY_SIZE(cmm_attrs); i++)
+ sysdev_remove_file(sysdev, cmm_attrs[i]);
+ sysdev_unregister(sysdev);
+ sysdev_class_unregister(&cmm_sysdev_class);
+}
+
+/**
+ * cmm_init - Module initialization
+ *
+ * Return value:
+ * 0 on success / other on failure
+ **/
+static int cmm_init(void)
+{
+ int rc =3D -ENOMEM;
+
+ if (!firmware_has_feature(FW_FEATURE_CMO))
+ return -EOPNOTSUPP;
+
+ if ((rc =3D register_oom_notifier(&cmm_oom_nb)) < 0)
+ return rc;
+
+ if ((rc =3D cmm_sysfs_register(&cmm_sysdev)))
+ goto out_oom_notifier;
+
+ if (cmm_disabled)
+ return rc;
+
+ cmm_thread_ptr =3D kthread_run(cmm_thread, NULL, "cmmthread");
+ if (IS_ERR(cmm_thread_ptr)) {
+ rc =3D PTR_ERR(cmm_thread_ptr);
+ goto out_unregister_sysfs;
+ }
+
+ return rc;
+
+out_unregister_sysfs:
+ cmm_unregister_sysfs(&cmm_sysdev);
+out_oom_notifier:
+ unregister_oom_notifier(&cmm_oom_nb);
+ return rc;
+}
+
+/**
+ * cmm_exit - Module exit
+ *
+ * Return value:
+ * nothing
+ **/
+static void cmm_exit(void)
+{
+ if (cmm_thread_ptr)
+ kthread_stop(cmm_thread_ptr);
+ unregister_oom_notifier(&cmm_oom_nb);
+ cmm_free_pages(loaned_pages);
+ cmm_unregister_sysfs(&cmm_sysdev);
+}
+
+/**
+ * cmm_set_disable - Disable/Enable CMM
+ *
+ * Return value:
+ * 0 on success / other on failure
+ **/
+static int cmm_set_disable(const char *val, struct kernel_param *kp)
+{
+ int disable =3D simple_strtoul(val, NULL, 10);
+
+ if (disable !=3D 0 && disable !=3D 1)
+ return -EINVAL;
+
+ if (disable && !cmm_disabled) {
+ if (cmm_thread_ptr)
+ kthread_stop(cmm_thread_ptr);
+ cmm_thread_ptr =3D NULL;
+ cmm_free_pages(loaned_pages);
+ } else if (!disable && cmm_disabled) {
+ cmm_thread_ptr =3D kthread_run(cmm_thread, NULL, "cmmthread");
+ if (IS_ERR(cmm_thread_ptr))
+ return PTR_ERR(cmm_thread_ptr);
+ }
+
+ cmm_disabled =3D disable;
+ return 0;
+}
+
+module_param_call(disable, cmm_set_disable, param_get_uint,
+ &cmm_disabled, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(disable, "Disable CMM. Set to 1 to disable. "
+ "[Default=3D" __stringify(CMM_DISABLE) "]");
+
+module_init(cmm_init);
+module_exit(cmm_exit);
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 08/16 v3] powerpc: Do not probe PCI buses or eBus devices if CMO is enabled
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
` (8 preceding siblings ...)
2008-07-04 12:53 ` [PATCH 07/16 v3] powerpc: Add collaborative memory manager Robert Jennings
@ 2008-07-04 12:54 ` Robert Jennings
2008-07-14 21:35 ` Brian King
2008-07-04 12:54 ` [PATCH 09/16 v3] powerpc: Add CMO paging statistics Robert Jennings
` (7 subsequent siblings)
17 siblings, 1 reply; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:54 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
=46rom: Brian King <brking@linux.vnet.ibm.com>
The Cooperative Memory Overcommit (CMO) on System p does not currently
support native PCI devices or eBus devices when enabled. Prevent
PCI bus probe and eBus device probe if the feature is enabled.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/kernel/ibmebus.c | 6 ++++++
arch/powerpc/platforms/pseries/setup.c | 4 ++++
2 files changed, 10 insertions(+)
Index: b/arch/powerpc/kernel/ibmebus.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/ibmebus.c
+++ b/arch/powerpc/kernel/ibmebus.c
@@ -45,6 +45,7 @@
#include <linux/of_platform.h>
#include <asm/ibmebus.h>
#include <asm/abs_addr.h>
+#include <asm/firmware.h>
=20
static struct device ibmebus_bus_device =3D { /* fake "parent" device */
.bus_id =3D "ibmebus",
@@ -332,6 +333,11 @@ static int __init ibmebus_bus_init(void)
{
int err;
=20
+ if (firmware_has_feature(FW_FEATURE_CMO)) {
+ printk(KERN_WARNING "Not probing eBus since CMO is enabled\n");
+ return 0;
+ }
+
printk(KERN_INFO "IBM eBus Device Driver\n");
=20
err =3D of_bus_type_init(&ibmebus_bus_type, "ibmebus");
Index: b/arch/powerpc/platforms/pseries/setup.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/platforms/pseries/setup.c
+++ b/arch/powerpc/platforms/pseries/setup.c
@@ -539,6 +539,10 @@ static void pseries_shared_idle_sleep(vo
=20
static int pSeries_pci_probe_mode(struct pci_bus *bus)
{
+ if (firmware_has_feature(FW_FEATURE_CMO)) {
+ dev_warn(&bus->dev, "Not probing PCI bus since CMO is enabled\n");
+ return PCI_PROBE_NONE;
+ }
if (firmware_has_feature(FW_FEATURE_LPAR))
return PCI_PROBE_DEVTREE;
return PCI_PROBE_NORMAL;
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 09/16 v3] powerpc: Add CMO paging statistics
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
` (9 preceding siblings ...)
2008-07-04 12:54 ` [PATCH 08/16 v3] powerpc: Do not probe PCI buses or eBus devices if CMO is enabled Robert Jennings
@ 2008-07-04 12:54 ` Robert Jennings
2008-07-04 12:54 ` [PATCH 10/16 v3] powerpc: iommu enablement for CMO Robert Jennings
` (6 subsequent siblings)
17 siblings, 0 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:54 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
=46rom: Brian King <brking@linux.vnet.ibm.com>
With the addition of Cooperative Memory Overcommitment (CMO) support
for IBM Power Systems, two fields have been added to the VPA to report
paging statistics. Add support in lparcfg to report them to userspace.
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/kernel/lparcfg.c | 20 ++++++++++++++++++++
include/asm-powerpc/lppaca.h | 5 ++++-
2 files changed, 24 insertions(+), 1 deletion(-)
Index: b/arch/powerpc/kernel/lparcfg.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -410,6 +410,25 @@ static int lparcfg_count_active_processo
return count;
}
=20
+static void pseries_cmo_data(struct seq_file *m)
+{
+ int cpu;
+ unsigned long cmo_faults =3D 0;
+ unsigned long cmo_fault_time =3D 0;
+
+ if (!firmware_has_feature(FW_FEATURE_CMO))
+ return;
+
+ for_each_possible_cpu(cpu) {
+ cmo_faults +=3D lppaca[cpu].cmo_faults;
+ cmo_fault_time +=3D lppaca[cpu].cmo_fault_time;
+ }
+
+ seq_printf(m, "cmo_faults=3D%lu\n", cmo_faults);
+ seq_printf(m, "cmo_fault_time_usec=3D%lu\n",
+ cmo_fault_time / tb_ticks_per_usec);
+}
+
static int pseries_lparcfg_data(struct seq_file *m, void *v)
{
int partition_potential_processors;
@@ -435,6 +454,7 @@ static int pseries_lparcfg_data(struct s
parse_system_parameter_string(m);
parse_ppp_data(m);
parse_mpp_data(m);
+ pseries_cmo_data(m);
=20
seq_printf(m, "purr=3D%ld\n", get_purr());
} else { /* non SPLPAR case */
Index: b/include/asm-powerpc/lppaca.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/lppaca.h
+++ b/include/asm-powerpc/lppaca.h
@@ -125,7 +125,10 @@ struct lppaca {
// NOTE: This value will ALWAYS be zero for dedicated processors and
// will NEVER be zero for shared processors (ie, initialized to a 1).
volatile u32 yield_count; // PLIC increments each dispatchx00-x03
- u8 reserved6[124]; // Reserved x04-x7F
+ u32 reserved6;
+ volatile u64 cmo_faults; // CMO page fault count x08-x0F
+ volatile u64 cmo_fault_time; // CMO page fault time x10-x17
+ u8 reserved7[104]; // Reserved x18-x7F
=20
//=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D
// CACHE_LINE_4-5 0x0180 - 0x027F Contains PMC interrupt data
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 10/16 v3] powerpc: iommu enablement for CMO
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
` (10 preceding siblings ...)
2008-07-04 12:54 ` [PATCH 09/16 v3] powerpc: Add CMO paging statistics Robert Jennings
@ 2008-07-04 12:54 ` Robert Jennings
2008-07-05 17:51 ` Olof Johansson
` (2 more replies)
2008-07-04 12:55 ` [PATCH 11/16 v3] powerpc: vio bus support " Robert Jennings
` (5 subsequent siblings)
17 siblings, 3 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:54 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
To support Cooperative Memory Overcommitment (CMO), we need to check
for failure from some of the tce hcalls.
These changes for the pseries platform affect the powerpc architecture;
patches for the other affected platforms are included in this patch.
pSeries platform IOMMU code changes:
* platform TCE functions must handle H_NOT_ENOUGH_RESOURCES errors and
return an error.
Architecture IOMMU code changes:
* Calls to ppc_md.tce_build need to check return values and return
DMA_MAPPING_ERROR for transient errors.
Architecture changes:
* struct machdep_calls for tce_build*_pSeriesLP functions need to change
to indicate failure.
* all other platforms will need updates to iommu functions to match the new
calling semantics; they will return 0 on success. The other platforms
default configs have been built, but no further testing was performed.
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/kernel/iommu.c | 26 +++++++++++++++++++++----
arch/powerpc/platforms/cell/iommu.c | 3 ++-
arch/powerpc/platforms/iseries/iommu.c | 3 ++-
arch/powerpc/platforms/pasemi/iommu.c | 3 ++-
arch/powerpc/platforms/pseries/iommu.c | 34 ++++++++++++++++++++++++++++-----
arch/powerpc/sysdev/dart_iommu.c | 3 ++-
include/asm-powerpc/machdep.h | 2 +-
7 files changed, 60 insertions(+), 14 deletions(-)
Index: b/arch/powerpc/kernel/iommu.c
===================================================================
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -49,6 +49,8 @@ static int novmerge = 1;
static int protect4gb = 1;
+static void __iommu_free(struct iommu_table *, dma_addr_t, unsigned int);
+
static inline unsigned long iommu_num_pages(unsigned long vaddr,
unsigned long slen)
{
@@ -190,6 +192,7 @@ static dma_addr_t iommu_alloc(struct dev
{
unsigned long entry, flags;
dma_addr_t ret = DMA_ERROR_CODE;
+ int build_fail;
spin_lock_irqsave(&(tbl->it_lock), flags);
@@ -204,9 +207,21 @@ static dma_addr_t iommu_alloc(struct dev
ret = entry << IOMMU_PAGE_SHIFT; /* Set the return dma address */
/* Put the TCEs in the HW table */
- ppc_md.tce_build(tbl, entry, npages, (unsigned long)page & IOMMU_PAGE_MASK,
- direction);
+ build_fail = ppc_md.tce_build(tbl, entry, npages,
+ (unsigned long)page & IOMMU_PAGE_MASK,
+ direction);
+
+ /* ppc_md.tce_build() only returns non-zero for transient errors.
+ * Clean up the table bitmap in this case and return
+ * DMA_ERROR_CODE. For all other errors the functionality is
+ * not altered.
+ */
+ if (unlikely(build_fail)) {
+ __iommu_free(tbl, ret, npages);
+ spin_unlock_irqrestore(&(tbl->it_lock), flags);
+ return DMA_ERROR_CODE;
+ }
/* Flush/invalidate TLB caches if necessary */
if (ppc_md.tce_flush)
@@ -275,7 +290,7 @@ int iommu_map_sg(struct device *dev, str
dma_addr_t dma_next = 0, dma_addr;
unsigned long flags;
struct scatterlist *s, *outs, *segstart;
- int outcount, incount, i;
+ int outcount, incount, i, build_fail = 0;
unsigned int align;
unsigned long handle;
unsigned int max_seg_size;
@@ -336,7 +351,10 @@ int iommu_map_sg(struct device *dev, str
npages, entry, dma_addr);
/* Insert into HW table */
- ppc_md.tce_build(tbl, entry, npages, vaddr & IOMMU_PAGE_MASK, direction);
+ build_fail = ppc_md.tce_build(tbl, entry, npages,
+ vaddr & IOMMU_PAGE_MASK, direction);
+ if(unlikely(build_fail))
+ goto failure;
/* If we are in an open segment, try merging */
if (segstart != s) {
Index: b/arch/powerpc/platforms/cell/iommu.c
===================================================================
--- a/arch/powerpc/platforms/cell/iommu.c
+++ b/arch/powerpc/platforms/cell/iommu.c
@@ -172,7 +172,7 @@ static void invalidate_tce_cache(struct
}
}
-static void tce_build_cell(struct iommu_table *tbl, long index, long npages,
+static int tce_build_cell(struct iommu_table *tbl, long index, long npages,
unsigned long uaddr, enum dma_data_direction direction)
{
int i;
@@ -210,6 +210,7 @@ static void tce_build_cell(struct iommu_
pr_debug("tce_build_cell(index=%lx,n=%lx,dir=%d,base_pte=%lx)\n",
index, npages, direction, base_pte);
+ return 0;
}
static void tce_free_cell(struct iommu_table *tbl, long index, long npages)
Index: b/arch/powerpc/platforms/iseries/iommu.c
===================================================================
--- a/arch/powerpc/platforms/iseries/iommu.c
+++ b/arch/powerpc/platforms/iseries/iommu.c
@@ -41,7 +41,7 @@
#include <asm/iseries/hv_call_event.h>
#include <asm/iseries/iommu.h>
-static void tce_build_iSeries(struct iommu_table *tbl, long index, long npages,
+static int tce_build_iSeries(struct iommu_table *tbl, long index, long npages,
unsigned long uaddr, enum dma_data_direction direction)
{
u64 rc;
@@ -70,6 +70,7 @@ static void tce_build_iSeries(struct iom
index++;
uaddr += TCE_PAGE_SIZE;
}
+ return 0;
}
static void tce_free_iSeries(struct iommu_table *tbl, long index, long npages)
Index: b/arch/powerpc/platforms/pasemi/iommu.c
===================================================================
--- a/arch/powerpc/platforms/pasemi/iommu.c
+++ b/arch/powerpc/platforms/pasemi/iommu.c
@@ -83,7 +83,7 @@ static u32 *iob_l2_base;
static struct iommu_table iommu_table_iobmap;
static int iommu_table_iobmap_inited;
-static void iobmap_build(struct iommu_table *tbl, long index,
+static int iobmap_build(struct iommu_table *tbl, long index,
long npages, unsigned long uaddr,
enum dma_data_direction direction)
{
@@ -107,6 +107,7 @@ static void iobmap_build(struct iommu_ta
uaddr += IOBMAP_PAGE_SIZE;
bus_addr += IOBMAP_PAGE_SIZE;
}
+ return 0;
}
Index: b/arch/powerpc/platforms/pseries/iommu.c
===================================================================
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -48,7 +48,7 @@
#include "plpar_wrappers.h"
-static void tce_build_pSeries(struct iommu_table *tbl, long index,
+static int tce_build_pSeries(struct iommu_table *tbl, long index,
long npages, unsigned long uaddr,
enum dma_data_direction direction)
{
@@ -71,6 +71,7 @@ static void tce_build_pSeries(struct iom
uaddr += TCE_PAGE_SIZE;
tcep++;
}
+ return 0;
}
@@ -93,13 +94,18 @@ static unsigned long tce_get_pseries(str
return *tcep;
}
-static void tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
+static void tce_free_pSeriesLP(struct iommu_table*, long, long);
+static void tce_freemulti_pSeriesLP(struct iommu_table*, long, long);
+
+static int tce_build_pSeriesLP(struct iommu_table *tbl, long tcenum,
long npages, unsigned long uaddr,
enum dma_data_direction direction)
{
- u64 rc;
+ u64 rc = 0;
u64 proto_tce, tce;
u64 rpn;
+ int ret = 0;
+ long tcenum_start = tcenum, npages_start = npages;
rpn = (virt_to_abs(uaddr)) >> TCE_SHIFT;
proto_tce = TCE_PCI_READ;
@@ -110,6 +116,13 @@ static void tce_build_pSeriesLP(struct i
tce = proto_tce | (rpn & TCE_RPN_MASK) << TCE_RPN_SHIFT;
rc = plpar_tce_put((u64)tbl->it_index, (u64)tcenum << 12, tce);
+ if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
+ ret = (int)rc;
+ tce_free_pSeriesLP(tbl, tcenum_start,
+ (npages_start - (npages + 1)));
+ break;
+ }
+
if (rc && printk_ratelimit()) {
printk("tce_build_pSeriesLP: plpar_tce_put failed. rc=%ld\n", rc);
printk("\tindex = 0x%lx\n", (u64)tbl->it_index);
@@ -121,19 +134,22 @@ static void tce_build_pSeriesLP(struct i
tcenum++;
rpn++;
}
+ return ret;
}
static DEFINE_PER_CPU(u64 *, tce_page) = NULL;
-static void tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
+static int tce_buildmulti_pSeriesLP(struct iommu_table *tbl, long tcenum,
long npages, unsigned long uaddr,
enum dma_data_direction direction)
{
- u64 rc;
+ u64 rc = 0;
u64 proto_tce;
u64 *tcep;
u64 rpn;
long l, limit;
+ long tcenum_start = tcenum, npages_start = npages;
+ int ret = 0;
if (npages == 1)
return tce_build_pSeriesLP(tbl, tcenum, npages, uaddr,
@@ -180,6 +196,13 @@ static void tce_buildmulti_pSeriesLP(str
tcenum += limit;
} while (npages > 0 && !rc);
+ if (unlikely(rc == H_NOT_ENOUGH_RESOURCES)) {
+ ret = (int)rc;
+ tce_freemulti_pSeriesLP(tbl, tcenum_start,
+ (npages_start - (npages + limit)));
+ return ret;
+ }
+
if (rc && printk_ratelimit()) {
printk("tce_buildmulti_pSeriesLP: plpar_tce_put failed. rc=%ld\n", rc);
printk("\tindex = 0x%lx\n", (u64)tbl->it_index);
@@ -187,6 +210,7 @@ static void tce_buildmulti_pSeriesLP(str
printk("\ttce[0] val = 0x%lx\n", tcep[0]);
show_stack(current, (unsigned long *)__get_SP());
}
+ return ret;
}
static void tce_free_pSeriesLP(struct iommu_table *tbl, long tcenum, long npages)
Index: b/arch/powerpc/sysdev/dart_iommu.c
===================================================================
--- a/arch/powerpc/sysdev/dart_iommu.c
+++ b/arch/powerpc/sysdev/dart_iommu.c
@@ -147,7 +147,7 @@ static void dart_flush(struct iommu_tabl
}
}
-static void dart_build(struct iommu_table *tbl, long index,
+static int dart_build(struct iommu_table *tbl, long index,
long npages, unsigned long uaddr,
enum dma_data_direction direction)
{
@@ -183,6 +183,7 @@ static void dart_build(struct iommu_tabl
} else {
dart_dirty = 1;
}
+ return 0;
}
Index: b/include/asm-powerpc/machdep.h
===================================================================
--- a/include/asm-powerpc/machdep.h
+++ b/include/asm-powerpc/machdep.h
@@ -76,7 +76,7 @@ struct machdep_calls {
* destroyed as well */
void (*hpte_clear_all)(void);
- void (*tce_build)(struct iommu_table * tbl,
+ int (*tce_build)(struct iommu_table *tbl,
long index,
long npages,
unsigned long uaddr,
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 11/16 v3] powerpc: vio bus support for CMO
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
` (11 preceding siblings ...)
2008-07-04 12:54 ` [PATCH 10/16 v3] powerpc: iommu enablement for CMO Robert Jennings
@ 2008-07-04 12:55 ` Robert Jennings
2008-07-04 12:55 ` [PATCH 12/16 v3] powerpc: Verify CMO memory entitlement updates with virtual I/O Robert Jennings
` (4 subsequent siblings)
17 siblings, 0 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:55 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>
This is a large patch but the normal code path is not affected. For
non-pSeries platforms the code is ifdef'ed out and for non-CMO enabled
pSeries systems this does not affect the normal code path. Devices that
do not perform DMA operations do not need modification with this patch.
The function get_desired_dma was renamed from get_io_entitlement for
clarity.
Overview
Cooperative Memory Overcommitment (CMO) allows for a set of OS partitions
to be run with less RAM than the aggregate needs of the group of
partitions. The firmware will balance memory between the partitions
and page in/out memory as needed. Based on the number and type of IO
adpaters preset each partition is allocated an amount of memory for
DMA operations and this allocation will be guaranteed to the partition;
this is referred to as the partition's 'entitlement'.
Partitions running in a CMO environment can only have virtual IO devices
present. The VIO bus layer will manage the IO entitlement for the system.
Accounting, at a system and per-device level, is tracked in the VIO bus
code and exposed via sysfs. A set of dma_ops functions are added to
the bus to allow for this accounting.
Bus initialization
At initialization, the bus will calculate the minimum needs of the system
based on providing each device present with a standard minimum entitlement
along with a spare allocation for the bus to handle hotplug events.
If the minimum needs can not be met the system boot will be halted.
Device changes
The significant changes for devices while running under CMO are that the
devices must specify how much dedicated IO entitlement they desire and
must also handle DMA mapping errors that can occur due to constrained
IO memory. The virtual IO drivers are modified to silence errors when
DMA mappings fail for CMO and handle these failures gracefully.
Each devices will be guaranteed a minimum entitlement that can always
be mapped. Devices will specify how much entitlement they desire and
the VIO bus will attempt to provide for this. Devices can change their
desired entitlement level at any point in time to address particular needs
(via vio_cmo_set_dev_desired()), not just at device probe time.
VIO bus changes
The system will have a particular entitlement level available from which
it can provide memory to the devices. The bus defines two pools of memory
within this entitlement, the reserved and excess pools. Each device is
provided with it's own entitlement no less than a system defined minimum
entitlement and no greater than what the device has specified as it's
desired entitlement. The entitlement provided to devices comes from the
reserve pool. The reserve pool can also contain a spare allocation as
large as the system defined minimum entitlement which is used for device
hotplug events. Any entitlement not needed to fulfill the needs of a
reserve pool is placed in the excess pool. Each device is guaranteed
that it can map up to it's entitled level; additional mapping are possible
as long as there is unmapped memory in the excess pool.
Bus probe
As the system starts, each device is given an entitlement equal only
to the system defined minimum entitlement. The reserve pool is equal
to the sum of these entitlements, plus a spare allocation. The VIO bus
also tracks the aggregate desired entitlement of all the devices. If the
system desired entitlement is greater than the size of the reserve pool,
when devices unmap IO memory it will be reserved and a balance operation
will be scheduled for some time in the future.
Entitlement balancing
The balance function tries to fairly distribute entitlement between the
devices in the system with the goal of providing each device with it's
desired amount of entitlement. Devices using more than what would be
ideal will have their entitled set-point adjusted; this will effectively
set a goal for lower IO memory usage as future mappings can fail and
deallocations will trigger a balance operation to distribute the newly
unmapped memory. A fair distribution of entitlement can take several
balance operations to achieve. Entitlement changes and device DLPAR
events will alter the state of CMO and will trigger balance operations.
Hotplug events
The VIO bus allows for changes in system entitlement at run-time via
'vio_cmo_entitlement_update()'. When devices are added the hotplug
device event will be preceded by a system entitlement increase and this
is reversed when devices are removed.
The following changes are made that the VIO bus layer for CMO:
* add IO memory accounting per device structure.
* add IO memory entitlement query function to driver structure.
* during vio bus probe, if CMO is enabled, check that driver has
memory entitlement query function defined. Fail if function not defined.
* fail to register driver if io entitlement function not defined.
* create set of dma_ops at vio level for CMO that will track allocations
and return DMA failures once entitlement is reached. Entitlement will
limited by overall system entitlement. Devices will have a reserved
quantity of memory that is guaranteed, the rest can be used as available.
* expose entitlement, current allocation, desired allocation, and the
allocation error counter for devices to the user through sysfs
* provide mechanism for changing a device's desired entitlement at run time
for devices as an exported function and sysfs tunable
* track any DMA failures for entitled IO memory for each vio device.
* check entitlement against available system entitlement on device add
* track entitlement metrics (high water mark, current usage)
* provide function to reset high water mark
* provide minimum and desired entitlement numbers at a bus level
* provide drivers with a minimum guaranteed entitlement
* balance available entitlement between devices to satisfy their needs
* handle system entitlement changes and device hotplug
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/kernel/vio.c | 1024 +++++++++++++++++++++++++++++++++++++++++=
+++++
include/asm-powerpc/vio.h | 29 +
2 files changed, 1045 insertions(+), 8 deletions(-)
Index: b/arch/powerpc/kernel/vio.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1,11 +1,12 @@
/*
* IBM PowerPC Virtual I/O Infrastructure Support.
*
- * Copyright (c) 2003-2005 IBM Corp.
+ * Copyright (c) 2003,2008 IBM Corp.
* Dave Engebretsen engebret@us.ibm.com
* Santiago Leon santil@us.ibm.com
* Hollis Blanchard <hollisb@us.ibm.com>
* Stephen Rothwell
+ * Robert Jennings <rcjenn@us.ibm.com>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
@@ -46,6 +47,987 @@ static struct vio_dev vio_bus_device =3D=20
.dev.bus =3D &vio_bus_type,
};
=20
+#ifdef CONFIG_PPC_PSERIES
+/**
+ * vio_cmo_pool - A pool of IO memory for CMO use
+ *
+ * @size: The size of the pool in bytes
+ * @free: The amount of free memory in the pool
+ */
+struct vio_cmo_pool {
+ size_t size;
+ size_t free;
+};
+
+/* How many ms to delay queued balance work */
+#define VIO_CMO_BALANCE_DELAY 100
+
+/* Portion out IO memory to CMO devices by this chunk size */
+#define VIO_CMO_BALANCE_CHUNK 131072
+
+/**
+ * vio_cmo_dev_entry - A device that is CMO-enabled and requires entitleme=
nt
+ *
+ * @vio_dev: struct vio_dev pointer
+ * @list: pointer to other devices on bus that are being tracked
+ */
+struct vio_cmo_dev_entry {
+ struct vio_dev *viodev;
+ struct list_head list;
+};
+
+/**
+ * vio_cmo - VIO bus accounting structure for CMO entitlement
+ *
+ * @lock: spinlock for entire structure
+ * @balance_q: work queue for balancing system entitlement
+ * @device_list: list of CMO-enabled devices requiring entitlement
+ * @entitled: total system entitlement in bytes
+ * @reserve: pool of memory from which devices reserve entitlement, incl. =
spare
+ * @excess: pool of excess entitlement not needed for device reserves or s=
pare
+ * @spare: IO memory for device hotplug functionality
+ * @min: minimum necessary for system operation
+ * @desired: desired memory for system operation
+ * @curr: bytes currently allocated
+ * @high: high water mark for IO data usage
+ */
+struct vio_cmo {
+ spinlock_t lock;
+ struct delayed_work balance_q;
+ struct list_head device_list;
+ size_t entitled;
+ struct vio_cmo_pool reserve;
+ struct vio_cmo_pool excess;
+ size_t spare;
+ size_t min;
+ size_t desired;
+ size_t curr;
+ size_t high;
+} vio_cmo;
+
+/**
+ * vio_cmo_OF_devices - Count the number of OF devices that have DMA windo=
ws
+ */
+static int vio_cmo_num_OF_devs(void)
+{
+ struct device_node *node_vroot;
+ int count =3D 0;
+
+ /*
+ * Count the number of vdevice entries with an
+ * ibm,my-dma-window OF property
+ */
+ node_vroot =3D of_find_node_by_name(NULL, "vdevice");
+ if (node_vroot) {
+ struct device_node *of_node;
+ struct property *prop;
+
+ for_each_child_of_node(node_vroot, of_node) {
+ prop =3D of_find_property(of_node, "ibm,my-dma-window",
+ NULL);
+ if (prop)
+ count++;
+ }
+ }
+ of_node_put(node_vroot);
+ return count;
+}
+
+/**
+ * vio_cmo_alloc - allocate IO memory for CMO-enable devices
+ *
+ * @viodev: VIO device requesting IO memory
+ * @size: size of allocation requested
+ *
+ * Allocations come from memory reserved for the devices and any excess
+ * IO memory available to all devices. The spare pool used to service
+ * hotplug must be equal to %VIO_CMO_MIN_ENT for the excess pool to be
+ * made available.
+ *
+ * Return codes:
+ * 0 for successful allocation and -ENOMEM for a failure
+ */
+static inline int vio_cmo_alloc(struct vio_dev *viodev, size_t size)
+{
+ unsigned long flags;
+ size_t reserve_free =3D 0;
+ size_t excess_free =3D 0;
+ int ret =3D -ENOMEM;
+
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+
+ /* Determine the amount of free entitlement available in reserve */
+ if (viodev->cmo.entitled > viodev->cmo.allocated)
+ reserve_free =3D viodev->cmo.entitled - viodev->cmo.allocated;
+
+ /* If spare is not fulfilled, the excess pool can not be used. */
+ if (vio_cmo.spare >=3D VIO_CMO_MIN_ENT)
+ excess_free =3D vio_cmo.excess.free;
+
+ /* The request can be satisfied */
+ if ((reserve_free + excess_free) >=3D size) {
+ vio_cmo.curr +=3D size;
+ if (vio_cmo.curr > vio_cmo.high)
+ vio_cmo.high =3D vio_cmo.curr;
+ viodev->cmo.allocated +=3D size;
+ size -=3D min(reserve_free, size);
+ vio_cmo.excess.free -=3D size;
+ ret =3D 0;
+ }
+
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+ return ret;
+}
+
+/**
+ * vio_cmo_dealloc - deallocate IO memory from CMO-enable devices
+ * @viodev: VIO device freeing IO memory
+ * @size: size of deallocation
+ *
+ * IO memory is freed by the device back to the correct memory pools.
+ * The spare pool is replenished first from either memory pool, then
+ * the reserve pool is used to reduce device entitlement, the excess
+ * pool is used to increase the reserve pool toward the desired entitlement
+ * target, and then the remaining memory is returned to the pools.
+ *
+ */
+static inline void vio_cmo_dealloc(struct vio_dev *viodev, size_t size)
+{
+ unsigned long flags;
+ size_t spare_needed =3D 0;
+ size_t excess_freed =3D 0;
+ size_t reserve_freed =3D size;
+ size_t tmp;
+ int balance =3D 0;
+
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+ vio_cmo.curr -=3D size;
+
+ /* Amount of memory freed from the excess pool */
+ if (viodev->cmo.allocated > viodev->cmo.entitled) {
+ excess_freed =3D min(reserve_freed, (viodev->cmo.allocated -
+ viodev->cmo.entitled));
+ reserve_freed -=3D excess_freed;
+ }
+
+ /* Remove allocation from device */
+ viodev->cmo.allocated -=3D (reserve_freed + excess_freed);
+
+ /* Spare is a subset of the reserve pool, replenish it first. */
+ spare_needed =3D VIO_CMO_MIN_ENT - vio_cmo.spare;
+
+ /*
+ * Replenish the spare in the reserve pool from the excess pool.
+ * This moves entitlement into the reserve pool.
+ */
+ if (spare_needed && excess_freed) {
+ tmp =3D min(excess_freed, spare_needed);
+ vio_cmo.excess.size -=3D tmp;
+ vio_cmo.reserve.size +=3D tmp;
+ vio_cmo.spare +=3D tmp;
+ excess_freed -=3D tmp;
+ spare_needed -=3D tmp;
+ balance =3D 1;
+ }
+
+ /*
+ * Replenish the spare in the reserve pool from the reserve pool.
+ * This removes entitlement from the device down to VIO_CMO_MIN_ENT,
+ * if needed, and gives it to the spare pool. The amount of used
+ * memory in this pool does not change.
+ */
+ if (spare_needed && reserve_freed) {
+ tmp =3D min(spare_needed, min(reserve_freed,
+ (viodev->cmo.entitled -
+ VIO_CMO_MIN_ENT)));
+
+ vio_cmo.spare +=3D tmp;
+ viodev->cmo.entitled -=3D tmp;
+ reserve_freed -=3D tmp;
+ spare_needed -=3D tmp;
+ balance =3D 1;
+ }
+
+ /*
+ * Increase the reserve pool until the desired allocation is met.
+ * Move an allocation freed from the excess pool into the reserve
+ * pool and schedule a balance operation.
+ */
+ if (excess_freed && (vio_cmo.desired > vio_cmo.reserve.size)) {
+ tmp =3D min(excess_freed, (vio_cmo.desired - vio_cmo.reserve.size));
+
+ vio_cmo.excess.size -=3D tmp;
+ vio_cmo.reserve.size +=3D tmp;
+ excess_freed -=3D tmp;
+ balance =3D 1;
+ }
+
+ /* Return memory from the excess pool to that pool */
+ if (excess_freed)
+ vio_cmo.excess.free +=3D excess_freed;
+
+ if (balance)
+ schedule_delayed_work(&vio_cmo.balance_q, VIO_CMO_BALANCE_DELAY);
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+}
+
+/**
+ * vio_cmo_entitlement_update - Manage system entitlement changes
+ *
+ * @new_entitlement: new system entitlement to attempt to accommodate
+ *
+ * Increases in entitlement will be used to fulfill the spare entitlement
+ * and the rest is given to the excess pool. Decreases, if they are
+ * possible, come from the excess pool and from unused device entitlement
+ *
+ * Returns: 0 on success, -ENOMEM when change can not be made
+ */
+int vio_cmo_entitlement_update(size_t new_entitlement)
+{
+ struct vio_dev *viodev;
+ struct vio_cmo_dev_entry *dev_ent;
+ unsigned long flags;
+ size_t avail, delta, tmp;
+
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+
+ /* Entitlement increases */
+ if (new_entitlement > vio_cmo.entitled) {
+ delta =3D new_entitlement - vio_cmo.entitled;
+
+ /* Fulfill spare allocation */
+ if (vio_cmo.spare < VIO_CMO_MIN_ENT) {
+ tmp =3D min(delta, (VIO_CMO_MIN_ENT - vio_cmo.spare));
+ vio_cmo.spare +=3D tmp;
+ vio_cmo.reserve.size +=3D tmp;
+ delta -=3D tmp;
+ }
+
+ /* Remaining new allocation goes to the excess pool */
+ vio_cmo.entitled +=3D delta;
+ vio_cmo.excess.size +=3D delta;
+ vio_cmo.excess.free +=3D delta;
+
+ goto out;
+ }
+
+ /* Entitlement decreases */
+ delta =3D vio_cmo.entitled - new_entitlement;
+ avail =3D vio_cmo.excess.free;
+
+ /*
+ * Need to check how much unused entitlement each device can
+ * sacrifice to fulfill entitlement change.
+ */
+ list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+ if (avail >=3D delta)
+ break;
+
+ viodev =3D dev_ent->viodev;
+ if ((viodev->cmo.entitled > viodev->cmo.allocated) &&
+ (viodev->cmo.entitled > VIO_CMO_MIN_ENT))
+ avail +=3D viodev->cmo.entitled -
+ max_t(size_t, viodev->cmo.allocated,
+ VIO_CMO_MIN_ENT);
+ }
+
+ if (delta <=3D avail) {
+ vio_cmo.entitled -=3D delta;
+
+ /* Take entitlement from the excess pool first */
+ tmp =3D min(vio_cmo.excess.free, delta);
+ vio_cmo.excess.size -=3D tmp;
+ vio_cmo.excess.free -=3D tmp;
+ delta -=3D tmp;
+
+ /*
+ * Remove all but VIO_CMO_MIN_ENT bytes from devices
+ * until entitlement change is served
+ */
+ list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+ if (!delta)
+ break;
+
+ viodev =3D dev_ent->viodev;
+ tmp =3D 0;
+ if ((viodev->cmo.entitled > viodev->cmo.allocated) &&
+ (viodev->cmo.entitled > VIO_CMO_MIN_ENT))
+ tmp =3D viodev->cmo.entitled -
+ max_t(size_t, viodev->cmo.allocated,
+ VIO_CMO_MIN_ENT);
+ viodev->cmo.entitled -=3D min(tmp, delta);
+ delta -=3D min(tmp, delta);
+ }
+ } else {
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+ return -ENOMEM;
+ }
+
+out:
+ schedule_delayed_work(&vio_cmo.balance_q, 0);
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+ return 0;
+}
+
+/**
+ * vio_cmo_balance - Balance entitlement among devices
+ *
+ * @work: work queue structure for this operation
+ *
+ * Any system entitlement above the minimum needed for devices, or
+ * already allocated to devices, can be distributed to the devices.
+ * The list of devices is iterated through to recalculate the desired
+ * entitlement level and to determine how much entitlement above the
+ * minimum entitlement is allocated to devices.
+ *
+ * Small chunks of the available entitlement are given to devices until
+ * their requirements are fulfilled or there is no entitlement left to giv=
e.
+ * Upon completion sizes of the reserve and excess pools are calculated.
+ *
+ * The system minimum entitlement level is also recalculated here.
+ * Entitlement will be reserved for devices even after vio_bus_remove to
+ * accomodate reloading the driver. The OF tree is walked to count the
+ * number of devices present and this will remove entitlement for devices
+ * that have actually left the system after having vio_bus_remove called.
+ */
+static void vio_cmo_balance(struct work_struct *work)
+{
+ struct vio_cmo *cmo;
+ struct vio_dev *viodev;
+ struct vio_cmo_dev_entry *dev_ent;
+ unsigned long flags;
+ size_t avail =3D 0, level, chunk, need;
+ int devcount =3D 0, fulfilled;
+
+ cmo =3D container_of(work, struct vio_cmo, balance_q.work);
+
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+
+ /* Calculate minimum entitlement and fulfill spare */
+ cmo->min =3D vio_cmo_num_OF_devs() * VIO_CMO_MIN_ENT;
+ BUG_ON(cmo->min > cmo->entitled);
+ cmo->spare =3D min_t(size_t, VIO_CMO_MIN_ENT, (cmo->entitled - cmo->min));
+ cmo->min +=3D cmo->spare;
+ cmo->desired =3D cmo->min;
+
+ /*
+ * Determine how much entitlement is available and reset device
+ * entitlements
+ */
+ avail =3D cmo->entitled - cmo->spare;
+ list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+ viodev =3D dev_ent->viodev;
+ devcount++;
+ viodev->cmo.entitled =3D VIO_CMO_MIN_ENT;
+ cmo->desired +=3D (viodev->cmo.desired - VIO_CMO_MIN_ENT);
+ avail -=3D max_t(size_t, viodev->cmo.allocated, VIO_CMO_MIN_ENT);
+ }
+
+ /*
+ * Having provided each device with the minimum entitlement, loop
+ * over the devices portioning out the remaining entitlement
+ * until there is nothing left.
+ */
+ level =3D VIO_CMO_MIN_ENT;
+ while (avail) {
+ fulfilled =3D 0;
+ list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+ viodev =3D dev_ent->viodev;
+
+ if (viodev->cmo.desired <=3D level) {
+ fulfilled++;
+ continue;
+ }
+
+ /*
+ * Give the device up to VIO_CMO_BALANCE_CHUNK
+ * bytes of entitlement, but do not exceed the
+ * desired level of entitlement for the device.
+ */
+ chunk =3D min_t(size_t, avail, VIO_CMO_BALANCE_CHUNK);
+ chunk =3D min(chunk, (viodev->cmo.desired -
+ viodev->cmo.entitled));
+ viodev->cmo.entitled +=3D chunk;
+
+ /*
+ * If the memory for this entitlement increase was
+ * already allocated to the device it does not come
+ * from the available pool being portioned out.
+ */
+ need =3D max(viodev->cmo.allocated, viodev->cmo.entitled)-
+ max(viodev->cmo.allocated, level);
+ avail -=3D need;
+
+ }
+ if (fulfilled =3D=3D devcount)
+ break;
+ level +=3D VIO_CMO_BALANCE_CHUNK;
+ }
+
+ /* Calculate new reserve and excess pool sizes */
+ cmo->reserve.size =3D cmo->min;
+ cmo->excess.free =3D 0;
+ cmo->excess.size =3D 0;
+ need =3D 0;
+ list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+ viodev =3D dev_ent->viodev;
+ /* Calculated reserve size above the minimum entitlement */
+ if (viodev->cmo.entitled)
+ cmo->reserve.size +=3D (viodev->cmo.entitled -
+ VIO_CMO_MIN_ENT);
+ /* Calculated used excess entitlement */
+ if (viodev->cmo.allocated > viodev->cmo.entitled)
+ need +=3D viodev->cmo.allocated - viodev->cmo.entitled;
+ }
+ cmo->excess.size =3D cmo->entitled - cmo->reserve.size;
+ cmo->excess.free =3D cmo->excess.size - need;
+
+ cancel_delayed_work(container_of(work, struct delayed_work, work));
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+}
+
+static void *vio_dma_iommu_alloc_coherent(struct device *dev, size_t size,
+ dma_addr_t *dma_handle, gfp_t fl=
ag)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+ void *ret;
+
+ if (vio_cmo_alloc(viodev, roundup(size, PAGE_SIZE))) {
+ atomic_inc(&viodev->cmo.allocs_failed);
+ return NULL;
+ }
+
+ ret =3D dma_iommu_ops.alloc_coherent(dev, size, dma_handle, flag);
+ if (unlikely(ret =3D=3D NULL)) {
+ vio_cmo_dealloc(viodev, roundup(size, PAGE_SIZE));
+ atomic_inc(&viodev->cmo.allocs_failed);
+ }
+
+ return ret;
+}
+
+static void vio_dma_iommu_free_coherent(struct device *dev, size_t size,
+ void *vaddr, dma_addr_t dma_handle)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+
+ dma_iommu_ops.free_coherent(dev, size, vaddr, dma_handle);
+
+ vio_cmo_dealloc(viodev, roundup(size, PAGE_SIZE));
+}
+
+static dma_addr_t vio_dma_iommu_map_single(struct device *dev, void *vaddr,
+ size_t size,
+ enum dma_data_direction directi=
on)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+ dma_addr_t ret =3D DMA_ERROR_CODE;
+
+ if (vio_cmo_alloc(viodev, roundup(size, PAGE_SIZE))) {
+ atomic_inc(&viodev->cmo.allocs_failed);
+ return ret;
+ }
+
+ ret =3D dma_iommu_ops.map_single(dev, vaddr, size, direction);
+ if (unlikely(dma_mapping_error(ret))) {
+ vio_cmo_dealloc(viodev, roundup(size, PAGE_SIZE));
+ atomic_inc(&viodev->cmo.allocs_failed);
+ }
+
+ return ret;
+}
+
+static void vio_dma_iommu_unmap_single(struct device *dev,
+ dma_addr_t dma_handle, size_t size,
+ enum dma_data_direction direction)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+
+ dma_iommu_ops.unmap_single(dev, dma_handle, size, direction);
+
+ vio_cmo_dealloc(viodev, roundup(size, PAGE_SIZE));
+}
+
+static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sg=
list,
+ int nelems, enum dma_data_direction direct=
ion)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+ struct scatterlist *sgl;
+ int ret, count =3D 0;
+ size_t alloc_size =3D 0;
+
+ for (sgl =3D sglist; count < nelems; count++, sgl++)
+ alloc_size +=3D roundup(sgl->length, PAGE_SIZE);
+
+ if (vio_cmo_alloc(viodev, alloc_size)) {
+ atomic_inc(&viodev->cmo.allocs_failed);
+ return 0;
+ }
+
+ ret =3D dma_iommu_ops.map_sg(dev, sglist, nelems, direction);
+
+ if (unlikely(!ret)) {
+ vio_cmo_dealloc(viodev, alloc_size);
+ atomic_inc(&viodev->cmo.allocs_failed);
+ }
+
+ return ret;
+}
+
+static void vio_dma_iommu_unmap_sg(struct device *dev,
+ struct scatterlist *sglist, int nelems,
+ enum dma_data_direction direction)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+ struct scatterlist *sgl;
+ size_t alloc_size =3D 0;
+ int count =3D 0;
+
+ for (sgl =3D sglist; count < nelems; count++, sgl++)
+ alloc_size +=3D roundup(sgl->length, PAGE_SIZE);
+
+ dma_iommu_ops.unmap_sg(dev, sglist, nelems, direction);
+
+ vio_cmo_dealloc(viodev, alloc_size);
+}
+
+struct dma_mapping_ops vio_dma_mapping_ops =3D {
+ .alloc_coherent =3D vio_dma_iommu_alloc_coherent,
+ .free_coherent =3D vio_dma_iommu_free_coherent,
+ .map_single =3D vio_dma_iommu_map_single,
+ .unmap_single =3D vio_dma_iommu_unmap_single,
+ .map_sg =3D vio_dma_iommu_map_sg,
+ .unmap_sg =3D vio_dma_iommu_unmap_sg,
+};
+
+/**
+ * vio_cmo_set_dev_desired - Set desired entitlement for a device
+ *
+ * @viodev: struct vio_dev for device to alter
+ * @new_desired: new desired entitlement level in bytes
+ *
+ * For use by devices to request a change to their entitlement at runtime =
or
+ * through sysfs. The desired entitlement level is changed and a balancing
+ * of system resources is scheduled to run in the future.
+ */
+void vio_cmo_set_dev_desired(struct vio_dev *viodev, size_t desired)
+{
+ unsigned long flags;
+ struct vio_cmo_dev_entry *dev_ent;
+ int found =3D 0;
+
+ if (!firmware_has_feature(FW_FEATURE_CMO))
+ return;
+
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+ if (desired < VIO_CMO_MIN_ENT)
+ desired =3D VIO_CMO_MIN_ENT;
+
+ /*
+ * Changes will not be made for devices not in the device list.
+ * If it is not in the device list, then no driver is loaded
+ * for the device and it can not receive entitlement.
+ */
+ list_for_each_entry(dev_ent, &vio_cmo.device_list, list)
+ if (viodev =3D=3D dev_ent->viodev) {
+ found =3D 1;
+ break;
+ }
+ if (!found)
+ return;
+
+ /* Increase/decrease in desired device entitlement */
+ if (desired >=3D viodev->cmo.desired) {
+ /* Just bump the bus and device values prior to a balance*/
+ vio_cmo.desired +=3D desired - viodev->cmo.desired;
+ viodev->cmo.desired =3D desired;
+ } else {
+ /* Decrease bus and device values for desired entitlement */
+ vio_cmo.desired -=3D viodev->cmo.desired - desired;
+ viodev->cmo.desired =3D desired;
+ /*
+ * If less entitlement is desired than current entitlement, move
+ * any reserve memory in the change region to the excess pool.
+ */
+ if (viodev->cmo.entitled > desired) {
+ vio_cmo.reserve.size -=3D viodev->cmo.entitled - desired;
+ vio_cmo.excess.size +=3D viodev->cmo.entitled - desired;
+ /*
+ * If entitlement moving from the reserve pool to the
+ * excess pool is currently unused, add to the excess
+ * free counter.
+ */
+ if (viodev->cmo.allocated < viodev->cmo.entitled)
+ vio_cmo.excess.free +=3D viodev->cmo.entitled -
+ max(viodev->cmo.allocated, desired);
+ viodev->cmo.entitled =3D desired;
+ }
+ }
+ schedule_delayed_work(&vio_cmo.balance_q, 0);
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+}
+
+/**
+ * vio_cmo_bus_probe - Handle CMO specific bus probe activities
+ *
+ * @viodev - Pointer to struct vio_dev for device
+ *
+ * Determine the devices IO memory entitlement needs, attempting
+ * to satisfy the system minimum entitlement at first and scheduling
+ * a balance operation to take care of the rest at a later time.
+ *
+ * Returns: 0 on success, -EINVAL when device doesn't support CMO, and
+ * -ENOMEM when entitlement is not available for device or
+ * device entry.
+ *
+ */
+static int vio_cmo_bus_probe(struct vio_dev *viodev)
+{
+ struct vio_cmo_dev_entry *dev_ent;
+ struct device *dev =3D &viodev->dev;
+ struct vio_driver *viodrv =3D to_vio_driver(dev->driver);
+ unsigned long flags;
+ size_t size;
+
+ /*
+ * Check to see that device has a DMA window and configure
+ * entitlement for the device.
+ */
+ if (of_get_property(viodev->dev.archdata.of_node,
+ "ibm,my-dma-window", NULL)) {
+ /* Check that the driver is CMO enabled and get desired DMA */
+ if (!viodrv->get_desired_dma) {
+ dev_err(dev, "%s: device driver does not support CMO\n",
+ __func__);
+ return -EINVAL;
+ }
+
+ viodev->cmo.desired =3D viodrv->get_desired_dma(viodev);
+ if (viodev->cmo.desired < VIO_CMO_MIN_ENT)
+ viodev->cmo.desired =3D VIO_CMO_MIN_ENT;
+ size =3D VIO_CMO_MIN_ENT;
+
+ dev_ent =3D kmalloc(sizeof(struct vio_cmo_dev_entry),
+ GFP_KERNEL);
+ if (!dev_ent)
+ return -ENOMEM;
+
+ dev_ent->viodev =3D viodev;
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+ list_add(&dev_ent->list, &vio_cmo.device_list);
+ } else {
+ viodev->cmo.desired =3D 0;
+ size =3D 0;
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+ }
+
+ /*
+ * If the needs for vio_cmo.min have not changed since they
+ * were last set, the number of devices in the OF tree has
+ * been constant and the IO memory for this is already in
+ * the reserve pool.
+ */
+ if (vio_cmo.min =3D=3D ((vio_cmo_num_OF_devs() + 1) *
+ VIO_CMO_MIN_ENT)) {
+ /* Updated desired entitlement if device requires it */
+ if (size)
+ vio_cmo.desired +=3D (viodev->cmo.desired -
+ VIO_CMO_MIN_ENT);
+ } else {
+ size_t tmp;
+
+ tmp =3D vio_cmo.spare + vio_cmo.excess.free;
+ if (tmp < size) {
+ dev_err(dev, "%s: insufficient free "
+ "entitlement to add device. "
+ "Need %lu, have %lu\n", __func__,
+ size, (vio_cmo.spare + tmp));
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+ return -ENOMEM;
+ }
+
+ /* Use excess pool first to fulfill request */
+ tmp =3D min(size, vio_cmo.excess.free);
+ vio_cmo.excess.free -=3D tmp;
+ vio_cmo.excess.size -=3D tmp;
+ vio_cmo.reserve.size +=3D tmp;
+
+ /* Use spare if excess pool was insufficient */
+ vio_cmo.spare -=3D size - tmp;
+
+ /* Update bus accounting */
+ vio_cmo.min +=3D size;
+ vio_cmo.desired +=3D viodev->cmo.desired;
+ }
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+ return 0;
+}
+
+/**
+ * vio_cmo_bus_remove - Handle CMO specific bus removal activities
+ *
+ * @viodev - Pointer to struct vio_dev for device
+ *
+ * Remove the device from the cmo device list. The minimum entitlement
+ * will be reserved for the device as long as it is in the system. The
+ * rest of the entitlement the device had been allocated will be returned
+ * to the system.
+ */
+static void vio_cmo_bus_remove(struct vio_dev *viodev)
+{
+ struct vio_cmo_dev_entry *dev_ent;
+ unsigned long flags;
+ size_t tmp;
+
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+ if (viodev->cmo.allocated) {
+ dev_err(&viodev->dev, "%s: device had %lu bytes of IO "
+ "allocated after remove operation.\n",
+ __func__, viodev->cmo.allocated);
+ BUG();
+ }
+
+ /*
+ * Remove the device from the device list being maintained for
+ * CMO enabled devices.
+ */
+ list_for_each_entry(dev_ent, &vio_cmo.device_list, list)
+ if (viodev =3D=3D dev_ent->viodev) {
+ list_del(&dev_ent->list);
+ kfree(dev_ent);
+ break;
+ }
+
+ /*
+ * Devices may not require any entitlement and they do not need
+ * to be processed. Otherwise, return the device's entitlement
+ * back to the pools.
+ */
+ if (viodev->cmo.entitled) {
+ /*
+ * This device has not yet left the OF tree, it's
+ * minimum entitlement remains in vio_cmo.min and
+ * vio_cmo.desired
+ */
+ vio_cmo.desired -=3D (viodev->cmo.desired - VIO_CMO_MIN_ENT);
+
+ /*
+ * Save min allocation for device in reserve as long
+ * as it exists in OF tree as determined by later
+ * balance operation
+ */
+ viodev->cmo.entitled -=3D VIO_CMO_MIN_ENT;
+
+ /* Replenish spare from freed reserve pool */
+ if (viodev->cmo.entitled && (vio_cmo.spare < VIO_CMO_MIN_ENT)) {
+ tmp =3D min(viodev->cmo.entitled, (VIO_CMO_MIN_ENT -
+ vio_cmo.spare));
+ vio_cmo.spare +=3D tmp;
+ viodev->cmo.entitled -=3D tmp;
+ }
+
+ /* Remaining reserve goes to excess pool */
+ vio_cmo.excess.size +=3D viodev->cmo.entitled;
+ vio_cmo.excess.free +=3D viodev->cmo.entitled;
+ vio_cmo.reserve.size -=3D viodev->cmo.entitled;
+
+ /*
+ * Until the device is removed it will keep a
+ * minimum entitlemet; this will guarantee that
+ * a module unload/load will result in a success.
+ */
+ viodev->cmo.entitled =3D VIO_CMO_MIN_ENT;
+ viodev->cmo.desired =3D VIO_CMO_MIN_ENT;
+ atomic_set(&viodev->cmo.allocs_failed, 0);
+ }
+
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+}
+
+static void vio_cmo_set_dma_ops(struct vio_dev *viodev)
+{
+ vio_dma_mapping_ops.dma_supported =3D dma_iommu_ops.dma_supported;
+ viodev->dev.archdata.dma_ops =3D &vio_dma_mapping_ops;
+}
+
+/**
+ * vio_cmo_bus_init - CMO entitlement initialization at bus init time
+ *
+ * Set up the reserve and excess entitlement pools based on available
+ * system entitlement and the number of devices in the OF tree that
+ * require entitlement in the reserve pool.
+ */
+static void vio_cmo_bus_init(void)
+{
+ struct hvcall_mpp_data mpp_data;
+ int err;
+
+ memset(&vio_cmo, 0, sizeof(struct vio_cmo));
+ spin_lock_init(&vio_cmo.lock);
+ INIT_LIST_HEAD(&vio_cmo.device_list);
+ INIT_DELAYED_WORK(&vio_cmo.balance_q, vio_cmo_balance);
+
+ /* Get current system entitlement */
+ err =3D h_get_mpp(&mpp_data);
+
+ /*
+ * On failure, continue with entitlement set to 0, will panic()
+ * later when spare is reserved.
+ */
+ if (err !=3D H_SUCCESS) {
+ printk(KERN_ERR "%s: unable to determine system IO "\
+ "entitlement. (%d)\n", __func__, err);
+ vio_cmo.entitled =3D 0;
+ } else {
+ vio_cmo.entitled =3D mpp_data.entitled_mem;
+ }
+
+ /* Set reservation and check against entitlement */
+ vio_cmo.spare =3D VIO_CMO_MIN_ENT;
+ vio_cmo.reserve.size =3D vio_cmo.spare;
+ vio_cmo.reserve.size +=3D (vio_cmo_num_OF_devs() *
+ VIO_CMO_MIN_ENT);
+ if (vio_cmo.reserve.size > vio_cmo.entitled) {
+ printk(KERN_ERR "%s: insufficient system entitlement\n",
+ __func__);
+ panic("%s: Insufficient system entitlement", __func__);
+ }
+
+ /* Set the remaining accounting variables */
+ vio_cmo.excess.size =3D vio_cmo.entitled - vio_cmo.reserve.size;
+ vio_cmo.excess.free =3D vio_cmo.excess.size;
+ vio_cmo.min =3D vio_cmo.reserve.size;
+ vio_cmo.desired =3D vio_cmo.reserve.size;
+}
+
+/* sysfs device functions and data structures for CMO */
+
+#define viodev_cmo_rd_attr(name) \
+static ssize_t viodev_cmo_##name##_show(struct device *dev, \
+ struct device_attribute *attr, \
+ char *buf) \
+{ \
+ return sprintf(buf, "%lu\n", to_vio_dev(dev)->cmo.name); \
+}
+
+static ssize_t viodev_cmo_allocs_failed_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+ return sprintf(buf, "%d\n", atomic_read(&viodev->cmo.allocs_failed));
+}
+
+static ssize_t viodev_cmo_allocs_failed_reset(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+ atomic_set(&viodev->cmo.allocs_failed, 0);
+ return count;
+}
+
+static ssize_t viodev_cmo_desired_set(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+ size_t new_desired;
+ int ret;
+
+ ret =3D strict_strtoul(buf, 10, &new_desired);
+ if (ret)
+ return ret;
+
+ vio_cmo_set_dev_desired(viodev, new_desired);
+ return count;
+}
+
+viodev_cmo_rd_attr(desired);
+viodev_cmo_rd_attr(entitled);
+viodev_cmo_rd_attr(allocated);
+
+static ssize_t name_show(struct device *, struct device_attribute *, char =
*);
+static ssize_t devspec_show(struct device *, struct device_attribute *, ch=
ar *);
+static struct device_attribute vio_cmo_dev_attrs[] =3D {
+ __ATTR_RO(name),
+ __ATTR_RO(devspec),
+ __ATTR(cmo_desired, S_IWUSR|S_IRUSR|S_IWGRP|S_IRGRP|S_IROTH,
+ viodev_cmo_desired_show, viodev_cmo_desired_set),
+ __ATTR(cmo_entitled, S_IRUGO, viodev_cmo_entitled_show, NULL),
+ __ATTR(cmo_allocated, S_IRUGO, viodev_cmo_allocated_show, NULL),
+ __ATTR(cmo_allocs_failed, S_IWUSR|S_IRUSR|S_IWGRP|S_IRGRP|S_IROTH,
+ viodev_cmo_allocs_failed_show, viodev_cmo_allocs_failed_reset),
+ __ATTR_NULL
+};
+
+/* sysfs bus functions and data structures for CMO */
+
+#define viobus_cmo_rd_attr(name) \
+static ssize_t \
+viobus_cmo_##name##_show(struct bus_type *bt, char *buf) \
+{ \
+ return sprintf(buf, "%lu\n", vio_cmo.name); \
+}
+
+#define viobus_cmo_pool_rd_attr(name, var) \
+static ssize_t \
+viobus_cmo_##name##_pool_show_##var(struct bus_type *bt, char *buf) \
+{ \
+ return sprintf(buf, "%lu\n", vio_cmo.name.var); \
+}
+
+static ssize_t viobus_cmo_high_reset(struct bus_type *bt, const char *buf,
+ size_t count)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+ vio_cmo.high =3D vio_cmo.curr;
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+
+ return count;
+}
+
+viobus_cmo_rd_attr(entitled);
+viobus_cmo_pool_rd_attr(reserve, size);
+viobus_cmo_pool_rd_attr(excess, size);
+viobus_cmo_pool_rd_attr(excess, free);
+viobus_cmo_rd_attr(spare);
+viobus_cmo_rd_attr(min);
+viobus_cmo_rd_attr(desired);
+viobus_cmo_rd_attr(curr);
+viobus_cmo_rd_attr(high);
+
+static struct bus_attribute vio_cmo_bus_attrs[] =3D {
+ __ATTR(cmo_entitled, S_IRUGO, viobus_cmo_entitled_show, NULL),
+ __ATTR(cmo_reserve_size, S_IRUGO, viobus_cmo_reserve_pool_show_size, NULL=
),
+ __ATTR(cmo_excess_size, S_IRUGO, viobus_cmo_excess_pool_show_size, NULL),
+ __ATTR(cmo_excess_free, S_IRUGO, viobus_cmo_excess_pool_show_free, NULL),
+ __ATTR(cmo_spare, S_IRUGO, viobus_cmo_spare_show, NULL),
+ __ATTR(cmo_min, S_IRUGO, viobus_cmo_min_show, NULL),
+ __ATTR(cmo_desired, S_IRUGO, viobus_cmo_desired_show, NULL),
+ __ATTR(cmo_curr, S_IRUGO, viobus_cmo_curr_show, NULL),
+ __ATTR(cmo_high, S_IWUSR|S_IRUSR|S_IWGRP|S_IRGRP|S_IROTH,
+ viobus_cmo_high_show, viobus_cmo_high_reset),
+ __ATTR_NULL
+};
+
+static void vio_cmo_sysfs_init(void)
+{
+ vio_bus_type.dev_attrs =3D vio_cmo_dev_attrs;
+ vio_bus_type.bus_attrs =3D vio_cmo_bus_attrs;
+}
+#else /* CONFIG_PPC_PSERIES */
+/* Dummy functions for iSeries platform */
+int vio_cmo_entitlement_update(size_t new_entitlement) { return 0; }
+void vio_cmo_set_dev_desired(struct vio_dev *viodev, size_t desired) {}
+static int vio_cmo_bus_probe(struct vio_dev *viodev) { return 0; }
+static void vio_cmo_bus_remove(struct vio_dev *viodev) {}
+static void vio_cmo_set_dma_ops(struct vio_dev *viodev) {}
+static void vio_cmo_bus_init() {}
+static void vio_cmo_sysfs_init() { }
+#endif /* CONFIG_PPC_PSERIES */
+EXPORT_SYMBOL(vio_cmo_entitlement_update);
+EXPORT_SYMBOL(vio_cmo_set_dev_desired);
+
static struct iommu_table *vio_build_iommu_table(struct vio_dev *dev)
{
const unsigned char *dma_window;
@@ -114,8 +1096,17 @@ static int vio_bus_probe(struct device *
return error;
=20
id =3D vio_match_device(viodrv->id_table, viodev);
- if (id)
+ if (id) {
+ memset(&viodev->cmo, 0, sizeof(viodev->cmo));
+ if (firmware_has_feature(FW_FEATURE_CMO)) {
+ error =3D vio_cmo_bus_probe(viodev);
+ if (error)
+ return error;
+ }
error =3D viodrv->probe(viodev, id);
+ if (error)
+ vio_cmo_bus_remove(viodev);
+ }
=20
return error;
}
@@ -125,12 +1116,23 @@ static int vio_bus_remove(struct device=20
{
struct vio_dev *viodev =3D to_vio_dev(dev);
struct vio_driver *viodrv =3D to_vio_driver(dev->driver);
+ struct device *devptr;
+ int ret =3D 1;
+
+ /*
+ * Hold a reference to the device after the remove function is called
+ * to allow for CMO accounting cleanup for the device.
+ */
+ devptr =3D get_device(dev);
=20
if (viodrv->remove)
- return viodrv->remove(viodev);
+ ret =3D viodrv->remove(viodev);
=20
- /* driver can't remove */
- return 1;
+ if (!ret && firmware_has_feature(FW_FEATURE_CMO))
+ vio_cmo_bus_remove(viodev);
+
+ put_device(devptr);
+ return ret;
}
=20
/**
@@ -215,7 +1217,11 @@ struct vio_dev *vio_register_device_node
viodev->unit_address =3D *unit_address;
}
viodev->dev.archdata.of_node =3D of_node_get(of_node);
- viodev->dev.archdata.dma_ops =3D &dma_iommu_ops;
+
+ if (firmware_has_feature(FW_FEATURE_CMO))
+ vio_cmo_set_dma_ops(viodev);
+ else
+ viodev->dev.archdata.dma_ops =3D &dma_iommu_ops;
viodev->dev.archdata.dma_data =3D vio_build_iommu_table(viodev);
viodev->dev.archdata.numa_node =3D of_node_to_nid(of_node);
=20
@@ -245,6 +1251,9 @@ static int __init vio_bus_init(void)
int err;
struct device_node *node_vroot;
=20
+ if (firmware_has_feature(FW_FEATURE_CMO))
+ vio_cmo_sysfs_init();
+
err =3D bus_register(&vio_bus_type);
if (err) {
printk(KERN_ERR "failed to register VIO bus\n");
@@ -262,6 +1271,9 @@ static int __init vio_bus_init(void)
return err;
}
=20
+ if (firmware_has_feature(FW_FEATURE_CMO))
+ vio_cmo_bus_init();
+
node_vroot =3D of_find_node_by_name(NULL, "vdevice");
if (node_vroot) {
struct device_node *of_node;
Index: b/include/asm-powerpc/vio.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/vio.h
+++ b/include/asm-powerpc/vio.h
@@ -39,16 +39,34 @@
#define VIO_IRQ_DISABLE 0UL
#define VIO_IRQ_ENABLE 1UL
=20
+/*
+ * VIO CMO minimum entitlement for all devices and spare entitlement
+ */
+#define VIO_CMO_MIN_ENT 1562624
+
struct iommu_table;
=20
-/*
- * The vio_dev structure is used to describe virtual I/O devices.
+/**
+ * vio_dev - This structure is used to describe virtual I/O devices.
+ *
+ * @desired: set from return of driver's get_desired_dma() function
+ * @entitled: bytes of IO data that has been reserved for this device.
+ * @entitled_target: Target IO data allocation for device. This may be set
+ * lower than entitled to enable balancing; can not be larger than entit=
led.
+ * @allocated: bytes of IO data currently in use by the device.
+ * @allocs_failed: number of DMA failures due to insufficient entitlement.
*/
struct vio_dev {
const char *name;
const char *type;
uint32_t unit_address;
unsigned int irq;
+ struct {
+ size_t desired;
+ size_t entitled;
+ size_t allocated;
+ atomic_t allocs_failed;
+ } cmo;
struct device dev;
};
=20
@@ -56,12 +74,19 @@ struct vio_driver {
const struct vio_device_id *id_table;
int (*probe)(struct vio_dev *dev, const struct vio_device_id *id);
int (*remove)(struct vio_dev *dev);
+ /* A driver must have a get_desired_dma() function to
+ * be loaded in a CMO environment if it uses DMA.
+ */
+ unsigned long (*get_desired_dma)(struct vio_dev *dev);
struct device_driver driver;
};
=20
extern int vio_register_driver(struct vio_driver *drv);
extern void vio_unregister_driver(struct vio_driver *drv);
=20
+extern int vio_cmo_entitlement_update(size_t);
+extern void vio_cmo_set_dev_desired(struct vio_dev *viodev, size_t desired=
);
+
extern void __devinit vio_unregister_device(struct vio_dev *dev);
=20
struct device_node;
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 12/16 v3] powerpc: Verify CMO memory entitlement updates with virtual I/O
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
` (12 preceding siblings ...)
2008-07-04 12:55 ` [PATCH 11/16 v3] powerpc: vio bus support " Robert Jennings
@ 2008-07-04 12:55 ` Robert Jennings
2008-07-04 12:55 ` [PATCH 13/16 v3] ibmveth: Automatically enable larger rx buffer pools for larger mtu Robert Jennings
` (3 subsequent siblings)
17 siblings, 0 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:55 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
=46rom: Nathan Fontenot <nfont@austin.ibm.com>
Verify memory entitlement updates can be handled by vio.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/kernel/lparcfg.c | 10 ++++++++++
1 file changed, 10 insertions(+)
Index: b/arch/powerpc/kernel/lparcfg.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/lparcfg.c
+++ b/arch/powerpc/kernel/lparcfg.c
@@ -34,6 +34,7 @@
#include <asm/time.h>
#include <asm/prom.h>
#include <asm/vdso_datapage.h>
+#include <asm/vio.h>
=20
#define MODULE_VERS "1.7"
#define MODULE_NAME "lparcfg"
@@ -528,6 +529,15 @@ static ssize_t update_mpp(u64 *entitleme
u8 new_weight;
ssize_t rc;
=20
+ if (entitlement) {
+ /* Check with vio to ensure the new memory entitlement
+ * can be handled.
+ */
+ rc =3D vio_cmo_entitlement_update(*entitlement);
+ if (rc)
+ return rc;
+ }
+
rc =3D h_get_mpp(&mpp_data);
if (rc)
return rc;
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 13/16 v3] ibmveth: Automatically enable larger rx buffer pools for larger mtu
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
` (13 preceding siblings ...)
2008-07-04 12:55 ` [PATCH 12/16 v3] powerpc: Verify CMO memory entitlement updates with virtual I/O Robert Jennings
@ 2008-07-04 12:55 ` Robert Jennings
2008-07-04 12:56 ` [PATCH 14/16 v3] ibmveth: enable driver for CMO Robert Jennings
` (2 subsequent siblings)
17 siblings, 0 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:55 UTC (permalink / raw)
To: paulus; +Cc: netdev, linuxppc-dev, David Darrington, Brian King
=46rom: Santiago Leon <santil@us.ibm.com>
Activates larger rx buffer pools when the MTU is changed to a larger
value. This patch de-activates the large rx buffer pools when the MTU
changes to a smaller value.
Signed-off-by: Santiago Leon <santil@us.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
drivers/net/ibmveth.c | 20 +++++++++++++-------
1 file changed, 13 insertions(+), 7 deletions(-)
Index: b/drivers/net/ibmveth.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/net/ibmveth.c
+++ b/drivers/net/ibmveth.c
@@ -1054,7 +1054,6 @@ static int ibmveth_change_mtu(struct net
{
struct ibmveth_adapter *adapter =3D dev->priv;
int new_mtu_oh =3D new_mtu + IBMVETH_BUFF_OH;
- int reinit =3D 0;
int i, rc;
=20
if (new_mtu < IBMVETH_MAX_MTU)
@@ -1067,15 +1066,21 @@ static int ibmveth_change_mtu(struct net
if (i =3D=3D IbmVethNumBufferPools)
return -EINVAL;
=20
+ /* Deactivate all the buffer pools so that the next loop can activate
+ only the buffer pools necessary to hold the new MTU */
+ for (i =3D 0; i < IbmVethNumBufferPools; i++)
+ if (adapter->rx_buff_pool[i].active) {
+ ibmveth_free_buffer_pool(adapter,
+ &adapter->rx_buff_pool[i]);
+ adapter->rx_buff_pool[i].active =3D 0;
+ }
+
/* Look for an active buffer pool that can hold the new MTU */
for(i =3D 0; i<IbmVethNumBufferPools; i++) {
- if (!adapter->rx_buff_pool[i].active) {
- adapter->rx_buff_pool[i].active =3D 1;
- reinit =3D 1;
- }
+ adapter->rx_buff_pool[i].active =3D 1;
=20
if (new_mtu_oh < adapter->rx_buff_pool[i].buff_size) {
- if (reinit && netif_running(adapter->netdev)) {
+ if (netif_running(adapter->netdev)) {
adapter->pool_config =3D 1;
ibmveth_close(adapter->netdev);
adapter->pool_config =3D 0;
@@ -1402,14 +1407,15 @@ const char * buf, size_t count)
return -EPERM;
}
=20
- pool->active =3D 0;
if (netif_running(netdev)) {
adapter->pool_config =3D 1;
ibmveth_close(netdev);
+ pool->active =3D 0;
adapter->pool_config =3D 0;
if ((rc =3D ibmveth_open(netdev)))
return rc;
}
+ pool->active =3D 0;
}
} else if (attr =3D=3D &veth_num_attr) {
if (value <=3D 0 || value > IBMVETH_MAX_POOL_COUNT)
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 14/16 v3] ibmveth: enable driver for CMO
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
` (14 preceding siblings ...)
2008-07-04 12:55 ` [PATCH 13/16 v3] ibmveth: Automatically enable larger rx buffer pools for larger mtu Robert Jennings
@ 2008-07-04 12:56 ` Robert Jennings
2008-07-08 20:38 ` [PATCH 14/16 v3] [v2] " Robert Jennings
2008-07-04 12:56 ` [PATCH 15/16 v3] ibmvscsi: driver enablement " Robert Jennings
2008-07-04 12:57 ` [PATCH 16/16 v3] powerpc: Update arch vector to indicate support " Robert Jennings
17 siblings, 1 reply; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:56 UTC (permalink / raw)
To: paulus; +Cc: netdev, linuxppc-dev, David Darrington, Brian King
Enable ibmveth for Cooperative Memory Overcommitment (CMO). For this driver
it means calculating a desired amount of IO memory based on the current MTU
and updating this value with the bus when MTU changes occur. Because DMA
mappings can fail, we have added a bounce buffer for temporary cases where
the driver can not map IO memory for the buffer pool.
The following changes are made to enable the driver for CMO:
* DMA mapping errors will not result in error messages if entitlement has
been exceeded and resources were not available.
* DMA mapping errors are handled gracefully, ibmveth_replenish_buffer_pool()
is corrected to check the return from dma_map_single and fail gracefully.
* The driver will have a get_desired_dma function defined to function
in a CMO environment.
* When the MTU is changed, the driver will update the device IO entitlement
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Santiago Leon <santil@us.ibm.com>
---
drivers/net/ibmveth.c | 169 ++++++++++++++++++++++++++++++++++++++++----------
drivers/net/ibmveth.h | 5 +
2 files changed, 140 insertions(+), 34 deletions(-)
Index: b/drivers/net/ibmveth.c
===================================================================
--- a/drivers/net/ibmveth.c
+++ b/drivers/net/ibmveth.c
@@ -33,6 +33,7 @@
*/
#include <linux/module.h>
+#include <linux/moduleparam.h>
#include <linux/types.h>
#include <linux/errno.h>
#include <linux/ioport.h>
@@ -52,7 +53,9 @@
#include <asm/hvcall.h>
#include <asm/atomic.h>
#include <asm/vio.h>
+#include <asm/iommu.h>
#include <asm/uaccess.h>
+#include <asm/firmware.h>
#include <linux/seq_file.h>
#include "ibmveth.h"
@@ -94,8 +97,10 @@ static void ibmveth_proc_register_adapte
static void ibmveth_proc_unregister_adapter(struct ibmveth_adapter *adapter);
static irqreturn_t ibmveth_interrupt(int irq, void *dev_instance);
static void ibmveth_rxq_harvest_buffer(struct ibmveth_adapter *adapter);
+static unsigned long ibmveth_get_desired_dma(struct vio_dev *vdev);
static struct kobj_type ktype_veth_pool;
+
#ifdef CONFIG_PROC_FS
#define IBMVETH_PROC_DIR "ibmveth"
static struct proc_dir_entry *ibmveth_proc_dir;
@@ -226,16 +231,16 @@ static void ibmveth_replenish_buffer_poo
u32 i;
u32 count = pool->size - atomic_read(&pool->available);
u32 buffers_added = 0;
+ struct sk_buff *skb;
+ unsigned int free_index, index;
+ u64 correlator;
+ unsigned long lpar_rc;
+ dma_addr_t dma_addr;
mb();
for(i = 0; i < count; ++i) {
- struct sk_buff *skb;
- unsigned int free_index, index;
- u64 correlator;
union ibmveth_buf_desc desc;
- unsigned long lpar_rc;
- dma_addr_t dma_addr;
skb = alloc_skb(pool->buff_size, GFP_ATOMIC);
@@ -255,6 +260,9 @@ static void ibmveth_replenish_buffer_poo
dma_addr = dma_map_single(&adapter->vdev->dev, skb->data,
pool->buff_size, DMA_FROM_DEVICE);
+ if (dma_mapping_error(dma_addr))
+ goto failure;
+
pool->free_map[free_index] = IBM_VETH_INVALID_MAP;
pool->dma_addr[index] = dma_addr;
pool->skbuff[index] = skb;
@@ -267,20 +275,9 @@ static void ibmveth_replenish_buffer_poo
lpar_rc = h_add_logical_lan_buffer(adapter->vdev->unit_address, desc.desc);
- if(lpar_rc != H_SUCCESS) {
- pool->free_map[free_index] = index;
- pool->skbuff[index] = NULL;
- if (pool->consumer_index == 0)
- pool->consumer_index = pool->size - 1;
- else
- pool->consumer_index--;
- dma_unmap_single(&adapter->vdev->dev,
- pool->dma_addr[index], pool->buff_size,
- DMA_FROM_DEVICE);
- dev_kfree_skb_any(skb);
- adapter->replenish_add_buff_failure++;
- break;
- } else {
+ if (lpar_rc != H_SUCCESS)
+ goto failure;
+ else {
buffers_added++;
adapter->replenish_add_buff_success++;
}
@@ -288,6 +285,24 @@ static void ibmveth_replenish_buffer_poo
mb();
atomic_add(buffers_added, &(pool->available));
+ return;
+
+failure:
+ pool->free_map[free_index] = index;
+ pool->skbuff[index] = NULL;
+ if (pool->consumer_index == 0)
+ pool->consumer_index = pool->size - 1;
+ else
+ pool->consumer_index--;
+ if (!dma_mapping_error(dma_addr))
+ dma_unmap_single(&adapter->vdev->dev,
+ pool->dma_addr[index], pool->buff_size,
+ DMA_FROM_DEVICE);
+ dev_kfree_skb_any(skb);
+ adapter->replenish_add_buff_failure++;
+
+ mb();
+ atomic_add(buffers_added, &(pool->available));
}
/* replenish routine */
@@ -297,7 +312,7 @@ static void ibmveth_replenish_task(struc
adapter->replenish_task_cycles++;
- for(i = 0; i < IbmVethNumBufferPools; i++)
+ for (i = (IbmVethNumBufferPools - 1); i >= 0; i--)
if(adapter->rx_buff_pool[i].active)
ibmveth_replenish_buffer_pool(adapter,
&adapter->rx_buff_pool[i]);
@@ -472,6 +487,18 @@ static void ibmveth_cleanup(struct ibmve
if (adapter->rx_buff_pool[i].active)
ibmveth_free_buffer_pool(adapter,
&adapter->rx_buff_pool[i]);
+
+ if (adapter->bounce_buffer != NULL) {
+ if (!dma_mapping_error(adapter->bounce_buffer_dma)) {
+ dma_unmap_single(&adapter->vdev->dev,
+ adapter->bounce_buffer_dma,
+ adapter->netdev->mtu + IBMVETH_BUFF_OH,
+ DMA_BIDIRECTIONAL);
+ adapter->bounce_buffer_dma = DMA_ERROR_CODE;
+ }
+ kfree(adapter->bounce_buffer);
+ adapter->bounce_buffer = NULL;
+ }
}
static int ibmveth_register_logical_lan(struct ibmveth_adapter *adapter,
@@ -607,6 +634,24 @@ static int ibmveth_open(struct net_devic
return rc;
}
+ adapter->bounce_buffer =
+ kmalloc(netdev->mtu + IBMVETH_BUFF_OH, GFP_KERNEL);
+ if (!adapter->bounce_buffer) {
+ ibmveth_error_printk("unable to allocate bounce buffer\n");
+ ibmveth_cleanup(adapter);
+ napi_disable(&adapter->napi);
+ return -ENOMEM;
+ }
+ adapter->bounce_buffer_dma =
+ dma_map_single(&adapter->vdev->dev, adapter->bounce_buffer,
+ netdev->mtu + IBMVETH_BUFF_OH, DMA_BIDIRECTIONAL);
+ if (dma_mapping_error(adapter->bounce_buffer_dma)) {
+ ibmveth_error_printk("unable to map bounce buffer\n");
+ ibmveth_cleanup(adapter);
+ napi_disable(&adapter->napi);
+ return -ENOMEM;
+ }
+
ibmveth_debug_printk("initial replenish cycle\n");
ibmveth_interrupt(netdev->irq, netdev);
@@ -853,10 +898,12 @@ static int ibmveth_start_xmit(struct sk_
unsigned int tx_packets = 0;
unsigned int tx_send_failed = 0;
unsigned int tx_map_failed = 0;
+ int used_bounce = 0;
+ unsigned long data_dma_addr;
desc.fields.flags_len = IBMVETH_BUF_VALID | skb->len;
- desc.fields.address = dma_map_single(&adapter->vdev->dev, skb->data,
- skb->len, DMA_TO_DEVICE);
+ data_dma_addr = dma_map_single(&adapter->vdev->dev, skb->data,
+ skb->len, DMA_TO_DEVICE);
if (skb->ip_summed == CHECKSUM_PARTIAL &&
ip_hdr(skb)->protocol != IPPROTO_TCP && skb_checksum_help(skb)) {
@@ -875,12 +922,16 @@ static int ibmveth_start_xmit(struct sk_
buf[1] = 0;
}
- if (dma_mapping_error(desc.fields.address)) {
- ibmveth_error_printk("tx: unable to map xmit buffer\n");
+ if (dma_mapping_error(data_dma_addr)) {
+ if (!firmware_has_feature(FW_FEATURE_CMO))
+ ibmveth_error_printk("tx: unable to map xmit buffer\n");
+ skb_copy_from_linear_data(skb, adapter->bounce_buffer,
+ skb->len);
+ desc.fields.address = adapter->bounce_buffer_dma;
tx_map_failed++;
- tx_dropped++;
- goto out;
- }
+ used_bounce = 1;
+ } else
+ desc.fields.address = data_dma_addr;
/* send the frame. Arbitrarily set retrycount to 1024 */
correlator = 0;
@@ -904,8 +955,9 @@ static int ibmveth_start_xmit(struct sk_
netdev->trans_start = jiffies;
}
- dma_unmap_single(&adapter->vdev->dev, desc.fields.address,
- skb->len, DMA_TO_DEVICE);
+ if (!used_bounce)
+ dma_unmap_single(&adapter->vdev->dev, data_dma_addr,
+ skb->len, DMA_TO_DEVICE);
out: spin_lock_irqsave(&adapter->stats_lock, flags);
netdev->stats.tx_dropped += tx_dropped;
@@ -1053,8 +1105,9 @@ static void ibmveth_set_multicast_list(s
static int ibmveth_change_mtu(struct net_device *dev, int new_mtu)
{
struct ibmveth_adapter *adapter = dev->priv;
+ struct vio_dev *viodev = adapter->vdev;
int new_mtu_oh = new_mtu + IBMVETH_BUFF_OH;
- int i, rc;
+ int i;
if (new_mtu < IBMVETH_MAX_MTU)
return -EINVAL;
@@ -1085,10 +1138,15 @@ static int ibmveth_change_mtu(struct net
ibmveth_close(adapter->netdev);
adapter->pool_config = 0;
dev->mtu = new_mtu;
- if ((rc = ibmveth_open(adapter->netdev)))
- return rc;
- } else
- dev->mtu = new_mtu;
+ vio_cmo_set_dev_desired(viodev,
+ ibmveth_get_desired_dma
+ (viodev));
+ return ibmveth_open(adapter->netdev);
+ }
+ dev->mtu = new_mtu;
+ vio_cmo_set_dev_desired(viodev,
+ ibmveth_get_desired_dma
+ (viodev));
return 0;
}
}
@@ -1103,6 +1161,46 @@ static void ibmveth_poll_controller(stru
}
#endif
+/**
+ * ibmveth_get_desired_dma - Calculate IO entitlement needed by the driver
+ *
+ * @vdev: struct vio_dev for the device whose entitlement is to be returned
+ *
+ * Return value:
+ * Number of bytes of IO data the driver will need to perform well.
+ */
+static unsigned long ibmveth_get_desired_dma(struct vio_dev *vdev)
+{
+ struct net_device *netdev = dev_get_drvdata(&vdev->dev);
+ struct ibmveth_adapter *adapter;
+ unsigned long ret;
+ int i;
+ int rxqentries = 1;
+
+ /* netdev inits at probe time along with the structures we need below*/
+ if (netdev == NULL)
+ return IOMMU_PAGE_ALIGN(IBMVETH_IO_ENTITLEMENT_DEFAULT);
+
+ adapter = netdev_priv(netdev);
+
+ ret = IBMVETH_BUFF_LIST_SIZE + IBMVETH_FILT_LIST_SIZE;
+ ret += IOMMU_PAGE_ALIGN(netdev->mtu);
+
+ for (i = 0; i < IbmVethNumBufferPools; i++) {
+ /* add the size of the active receive buffers */
+ if (adapter->rx_buff_pool[i].active)
+ ret +=
+ adapter->rx_buff_pool[i].size *
+ IOMMU_PAGE_ALIGN(adapter->rx_buff_pool[i].
+ buff_size);
+ rxqentries += adapter->rx_buff_pool[i].size;
+ }
+ /* add the size of the receive queue entries */
+ ret += IOMMU_PAGE_ALIGN(rxqentries * sizeof(struct ibmveth_rx_q_entry));
+
+ return ret;
+}
+
static int __devinit ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
{
int rc, i;
@@ -1247,6 +1345,8 @@ static int __devexit ibmveth_remove(stru
ibmveth_proc_unregister_adapter(adapter);
free_netdev(netdev);
+ dev_set_drvdata(&dev->dev, NULL);
+
return 0;
}
@@ -1491,6 +1591,7 @@ static struct vio_driver ibmveth_driver
.id_table = ibmveth_device_table,
.probe = ibmveth_probe,
.remove = ibmveth_remove,
+ .get_desired_dma = ibmveth_get_desired_dma,
.driver = {
.name = ibmveth_driver_name,
.owner = THIS_MODULE,
Index: b/drivers/net/ibmveth.h
===================================================================
--- a/drivers/net/ibmveth.h
+++ b/drivers/net/ibmveth.h
@@ -93,9 +93,12 @@ static inline long h_illan_attributes(un
plpar_hcall_norets(H_CHANGE_LOGICAL_LAN_MAC, ua, mac)
#define IbmVethNumBufferPools 5
+#define IBMVETH_IO_ENTITLEMENT_DEFAULT 4243456 /* MTU of 1500 needs 4.2Mb */
#define IBMVETH_BUFF_OH 22 /* Overhead: 14 ethernet header + 8 opaque handle */
#define IBMVETH_MAX_MTU 68
#define IBMVETH_MAX_POOL_COUNT 4096
+#define IBMVETH_BUFF_LIST_SIZE 4096
+#define IBMVETH_FILT_LIST_SIZE 4096
#define IBMVETH_MAX_BUF_SIZE (1024 * 128)
static int pool_size[] = { 512, 1024 * 2, 1024 * 16, 1024 * 32, 1024 * 64 };
@@ -143,6 +146,8 @@ struct ibmveth_adapter {
struct ibmveth_rx_q rx_queue;
int pool_config;
int rx_csum;
+ void *bounce_buffer;
+ dma_addr_t bounce_buffer_dma;
/* adapter specific stats */
u64 replenish_task_cycles;
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 15/16 v3] ibmvscsi: driver enablement for CMO
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
` (15 preceding siblings ...)
2008-07-04 12:56 ` [PATCH 14/16 v3] ibmveth: enable driver for CMO Robert Jennings
@ 2008-07-04 12:56 ` Robert Jennings
2008-07-07 14:34 ` Brian King
2008-07-08 20:35 ` [PATCH 15/16 v3] [v2] " Robert Jennings
2008-07-04 12:57 ` [PATCH 16/16 v3] powerpc: Update arch vector to indicate support " Robert Jennings
17 siblings, 2 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:56 UTC (permalink / raw)
To: paulus; +Cc: linux-scsi, linuxppc-dev, David Darrington, Brian King
=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>
Enable the driver to function in a Cooperative Memory Overcommitment (CMO)
environment.
The following changes are made to enable the driver for CMO:
* DMA mapping errors will not result in error messages if entitlement has
been exceeded and resources were not available.
* The driver has a get_desired_dma function defined to function
in a CMO environment. It will indicate how much IO memory it would like
to function.
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
drivers/scsi/ibmvscsi/ibmvscsi.c | 46 +++++++++++++++++++++++++++++++++-=
-----
drivers/scsi/ibmvscsi/ibmvscsi.h | 2 ++
2 files changed, 41 insertions(+), 7 deletions(-)
Index: b/drivers/scsi/ibmvscsi/ibmvscsi.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/scsi/ibmvscsi/ibmvscsi.c
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
@@ -72,6 +72,8 @@
#include <linux/delay.h>
#include <asm/firmware.h>
#include <asm/vio.h>
+#include <asm/firmware.h>
+#include <asm/iommu.h>
#include <scsi/scsi.h>
#include <scsi/scsi_cmnd.h>
#include <scsi/scsi_host.h>
@@ -426,8 +428,10 @@ static int map_sg_data(struct scsi_cmnd=20
SG_ALL * sizeof(struct srp_direct_buf),
&evt_struct->ext_list_token, 0);
if (!evt_struct->ext_list) {
- sdev_printk(KERN_ERR, cmd->device,
- "Can't allocate memory for indirect table\n");
+ if (!firmware_has_feature(FW_FEATURE_CMO))
+ sdev_printk(KERN_ERR, cmd->device,
+ "Can't allocate memory "
+ "for indirect table\n");
return 0;
}
}
@@ -743,7 +747,9 @@ static int ibmvscsi_queuecommand(struct=20
srp_cmd->lun =3D ((u64) lun) << 48;
=20
if (!map_data_for_srp_cmd(cmnd, evt_struct, srp_cmd, hostdata->dev)) {
- sdev_printk(KERN_ERR, cmnd->device, "couldn't convert cmd to srp_cmd\n");
+ if (!firmware_has_feature(FW_FEATURE_CMO))
+ sdev_printk(KERN_ERR, cmnd->device,
+ "couldn't convert cmd to srp_cmd\n");
free_event_struct(&hostdata->pool, evt_struct);
return SCSI_MLQUEUE_HOST_BUSY;
}
@@ -855,7 +861,10 @@ static void send_mad_adapter_info(struct
DMA_BIDIRECTIONAL);
=20
if (dma_mapping_error(req->buffer)) {
- dev_err(hostdata->dev, "Unable to map request_buffer for adapter_info!\n=
");
+ if (!firmware_has_feature(FW_FEATURE_CMO))
+ dev_err(hostdata->dev,
+ "Unable to map request_buffer for "
+ "adapter_info!\n");
free_event_struct(&hostdata->pool, evt_struct);
return;
}
@@ -1400,7 +1409,9 @@ static int ibmvscsi_do_host_config(struc
DMA_BIDIRECTIONAL);
=20
if (dma_mapping_error(host_config->buffer)) {
- dev_err(hostdata->dev, "dma_mapping error getting host config\n");
+ if (!firmware_has_feature(FW_FEATURE_CMO))
+ dev_err(hostdata->dev,
+ "dma_mapping error getting host config\n");
free_event_struct(&hostdata->pool, evt_struct);
return -1;
}
@@ -1604,7 +1615,7 @@ static struct scsi_host_template driver_
.eh_host_reset_handler =3D ibmvscsi_eh_host_reset_handler,
.slave_configure =3D ibmvscsi_slave_configure,
.change_queue_depth =3D ibmvscsi_change_queue_depth,
- .cmd_per_lun =3D 16,
+ .cmd_per_lun =3D IBMVSCSI_CMDS_PER_LUN_DEFAULT,
.can_queue =3D IBMVSCSI_MAX_REQUESTS_DEFAULT,
.this_id =3D -1,
.sg_tablesize =3D SG_ALL,
@@ -1613,6 +1624,26 @@ static struct scsi_host_template driver_
};
=20
/**
+ * ibmvscsi_get_desired_dma - Calculate IO entitlement needed by the driver
+ *
+ * @vdev: struct vio_dev for the device whose entitlement is to be returned
+ *
+ * Return value:
+ * Number of bytes of IO data the driver will need to perform well.
+ */
+static unsigned long ibmvscsi_get_desired_dma(struct vio_dev *vdev)
+{
+ /* iu_storage data allocated in initialize_event_pool */
+ unsigned long io_entitlement =3D max_requests * sizeof(union viosrp_iu);
+
+ /* add io space for sg data */
+ io_entitlement +=3D (IBMVSCSI_MAX_SECTORS_DEFAULT *
+ IBMVSCSI_CMDS_PER_LUN_DEFAULT);
+
+ return IOMMU_PAGE_ALIGN(io_entitlement);
+}
+
+/**
* Called by bus code for each adapter
*/
static int ibmvscsi_probe(struct vio_dev *vdev, const struct vio_device_id=
*id)
@@ -1641,7 +1672,7 @@ static int ibmvscsi_probe(struct vio_dev
hostdata->host =3D host;
hostdata->dev =3D dev;
atomic_set(&hostdata->request_limit, -1);
- hostdata->host->max_sectors =3D 32 * 8; /* default max I/O 32 pages */
+ hostdata->host->max_sectors =3D IBMVSCSI_MAX_SECTORS_DEFAULT;
=20
rc =3D ibmvscsi_ops->init_crq_queue(&hostdata->queue, hostdata, max_reque=
sts);
if (rc !=3D 0 && rc !=3D H_RESOURCE) {
@@ -1735,6 +1766,7 @@ static struct vio_driver ibmvscsi_driver
.id_table =3D ibmvscsi_device_table,
.probe =3D ibmvscsi_probe,
.remove =3D ibmvscsi_remove,
+ .get_desired_dma =3D ibmvscsi_get_desired_dma,
.driver =3D {
.name =3D "ibmvscsi",
.owner =3D THIS_MODULE,
Index: b/drivers/scsi/ibmvscsi/ibmvscsi.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/scsi/ibmvscsi/ibmvscsi.h
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.h
@@ -45,6 +45,8 @@ struct Scsi_Host;
#define MAX_INDIRECT_BUFS 10
=20
#define IBMVSCSI_MAX_REQUESTS_DEFAULT 100
+#define IBMVSCSI_CMDS_PER_LUN_DEFAULT 16
+#define IBMVSCSI_MAX_SECTORS_DEFAULT 256 /* 32 * 8 =3D default max I/O 32 =
pages */
#define IBMVSCSI_MAX_CMDS_PER_LUN 64
=20
/* ------------------------------------------------------------
^ permalink raw reply [flat|nested] 38+ messages in thread
* [PATCH 16/16 v3] powerpc: Update arch vector to indicate support for CMO
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
` (16 preceding siblings ...)
2008-07-04 12:56 ` [PATCH 15/16 v3] ibmvscsi: driver enablement " Robert Jennings
@ 2008-07-04 12:57 ` Robert Jennings
17 siblings, 0 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-04 12:57 UTC (permalink / raw)
To: paulus; +Cc: Brian King, linuxppc-dev, David Darrington
=46rom: Nathan Fontenot <nfont@austin.ibm.com>
Update the architecture vector to indicate that Cooperative Memory
Overcommitment is supported.
This is the last patch in the series. Committing it will signal to=20
the platform firmware is CMO enabled.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/kernel/prom_init.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
Index: b/arch/powerpc/kernel/prom_init.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -642,6 +642,7 @@ static void __init early_cmdline_parse(v
#else
#define OV5_MSI 0x00
#endif /* CONFIG_PCI_MSI */
+#define OV5_CMO 0x80 /* Cooperative Memory Overcommitment */
=20
/*
* The architecture vector has an array of PVR mask/value pairs,
@@ -684,10 +685,12 @@ static unsigned char ibm_architecture_ve
0, /* don't halt */
=20
/* option vector 5: PAPR/OF options */
- 3 - 2, /* length */
+ 5 - 2, /* length */
0, /* don't ignore, don't halt */
OV5_LPAR | OV5_SPLPAR | OV5_LARGE_PAGES | OV5_DRCONF_MEMORY |
OV5_DONATE_DEDICATE_CPU | OV5_MSI,
+ 0,
+ OV5_CMO,
};
=20
/* Old method - ELF header with PT_NOTE sections */
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 10/16 v3] powerpc: iommu enablement for CMO
2008-07-04 12:54 ` [PATCH 10/16 v3] powerpc: iommu enablement for CMO Robert Jennings
@ 2008-07-05 17:51 ` Olof Johansson
2008-07-08 20:48 ` [PATCH 10/16 v3] [v2] " Robert Jennings
2008-07-22 4:57 ` [PATCH 10/16 v3] " Paul Mackerras
2 siblings, 0 replies; 38+ messages in thread
From: Olof Johansson @ 2008-07-05 17:51 UTC (permalink / raw)
To: Robert Jennings; +Cc: Brian King, linuxppc-dev, paulus, David Darrington
Hi,
On Jul 4, 2008, at 7:54 AM, Robert Jennings wrote:
> To support Cooperative Memory Overcommitment (CMO), we need to check
> for failure from some of the tce hcalls.
>
> These changes for the pseries platform affect the powerpc
> architecture;
> patches for the other affected platforms are included in this patch.
>
> pSeries platform IOMMU code changes:
> * platform TCE functions must handle H_NOT_ENOUGH_RESOURCES errors and
> return an error.
>
> Architecture IOMMU code changes:
> * Calls to ppc_md.tce_build need to check return values and return
> DMA_MAPPING_ERROR for transient errors.
>
> Architecture changes:
> * struct machdep_calls for tce_build*_pSeriesLP functions need to
> change
> to indicate failure.
> * all other platforms will need updates to iommu functions to match
> the new
> calling semantics; they will return 0 on success. The other
> platforms
> default configs have been built, but no further testing was
> performed.
>
> Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Olof Johansson <olof@lixom.net>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 15/16 v3] ibmvscsi: driver enablement for CMO
2008-07-04 12:56 ` [PATCH 15/16 v3] ibmvscsi: driver enablement " Robert Jennings
@ 2008-07-07 14:34 ` Brian King
2008-07-08 17:41 ` Robert Jennings
2008-07-08 20:35 ` [PATCH 15/16 v3] [v2] " Robert Jennings
1 sibling, 1 reply; 38+ messages in thread
From: Brian King @ 2008-07-07 14:34 UTC (permalink / raw)
To: Robert Jennings; +Cc: linux-scsi, linuxppc-dev, David Darrington, paulus
Robert Jennings wrote:
> @@ -1613,6 +1624,26 @@ static struct scsi_host_template driver_
> };
>
> /**
> + * ibmvscsi_get_desired_dma - Calculate IO entitlement needed by the driver
> + *
> + * @vdev: struct vio_dev for the device whose entitlement is to be returned
> + *
> + * Return value:
> + * Number of bytes of IO data the driver will need to perform well.
> + */
> +static unsigned long ibmvscsi_get_desired_dma(struct vio_dev *vdev)
> +{
> + /* iu_storage data allocated in initialize_event_pool */
> + unsigned long io_entitlement = max_requests * sizeof(union viosrp_iu);
Since you are removing the use of "entitlement" in the function description,
you should probably remove it everywhere in this patch.
> +
> + /* add io space for sg data */
> + io_entitlement += (IBMVSCSI_MAX_SECTORS_DEFAULT *
> + IBMVSCSI_CMDS_PER_LUN_DEFAULT);
> +
> + return IOMMU_PAGE_ALIGN(io_entitlement);
I really think this function should just return the number of bytes and
let the caller round it up to any boundary requirements it might have.
-Brian
--
Brian King
Linux on Power Virtualization
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 15/16 v3] ibmvscsi: driver enablement for CMO
2008-07-07 14:34 ` Brian King
@ 2008-07-08 17:41 ` Robert Jennings
0 siblings, 0 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-08 17:41 UTC (permalink / raw)
To: Brian King; +Cc: linux-scsi, linuxppc-dev, David Darrington, paulus
* Brian King (brking@linux.vnet.ibm.com) wrote:
> Robert Jennings wrote:
> > @@ -1613,6 +1624,26 @@ static struct scsi_host_template driver_
> > };
> >
> > /**
> > + * ibmvscsi_get_desired_dma - Calculate IO entitlement needed by the driver
> > + *
> > + * @vdev: struct vio_dev for the device whose entitlement is to be returned
> > + *
> > + * Return value:
> > + * Number of bytes of IO data the driver will need to perform well.
> > + */
> > +static unsigned long ibmvscsi_get_desired_dma(struct vio_dev *vdev)
> > +{
> > + /* iu_storage data allocated in initialize_event_pool */
> > + unsigned long io_entitlement = max_requests * sizeof(union viosrp_iu);
>
> Since you are removing the use of "entitlement" in the function description,
> you should probably remove it everywhere in this patch.
I'll clean this up.
> > +
> > + /* add io space for sg data */
> > + io_entitlement += (IBMVSCSI_MAX_SECTORS_DEFAULT *
> > + IBMVSCSI_CMDS_PER_LUN_DEFAULT);
> > +
> > + return IOMMU_PAGE_ALIGN(io_entitlement);
>
> I really think this function should just return the number of bytes and
> let the caller round it up to any boundary requirements it might have.
I agree. I'll be posting a new patch after a work out another issue I'm
having. Hope to have that out soon.
--Rob Jennings
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 15/16 v3] [v2] ibmvscsi: driver enablement for CMO
2008-07-04 12:56 ` [PATCH 15/16 v3] ibmvscsi: driver enablement " Robert Jennings
2008-07-07 14:34 ` Brian King
@ 2008-07-08 20:35 ` Robert Jennings
2008-07-10 13:43 ` Brian King
1 sibling, 1 reply; 38+ messages in thread
From: Robert Jennings @ 2008-07-08 20:35 UTC (permalink / raw)
To: paulus, benh, linuxppc-dev, linux-scsi, Brian King,
Nathan Fontenot, David Darrington
=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>
I removed references to 'entitlement' after having changed the function
'get_io_entitlement' to 'get_desired_dma' to correctly indicate what the
function was doing. Also, this function does not need to page align the
return value, the VIO bus is responsible for this.
(We would like to take this patch through linuxppc-dev with the full
change set for this feature. We are copying linux-scsi for review and ack)
Enable the driver to function in a Cooperative Memory Overcommitment (CMO)
environment.
The following changes are made to enable the driver for CMO:
* DMA mapping errors will not result in error messages if entitlement has
been exceeded and resources were not available.
* The driver has a get_desired_dma function defined to function
in a CMO environment. It will indicate how much IO memory it would like
to function.
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
drivers/scsi/ibmvscsi/ibmvscsi.c | 45 +++++++++++++++++++++++++++++++++-=
-----
drivers/scsi/ibmvscsi/ibmvscsi.h | 2 ++
2 files changed, 40 insertions(+), 7 deletions(-)
Index: b/drivers/scsi/ibmvscsi/ibmvscsi.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/scsi/ibmvscsi/ibmvscsi.c
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.c
@@ -72,6 +72,7 @@
#include <linux/delay.h>
#include <asm/firmware.h>
#include <asm/vio.h>
+#include <asm/firmware.h>
#include <scsi/scsi.h>
#include <scsi/scsi_cmnd.h>
#include <scsi/scsi_host.h>
@@ -426,8 +427,10 @@ static int map_sg_data(struct scsi_cmnd=20
SG_ALL * sizeof(struct srp_direct_buf),
&evt_struct->ext_list_token, 0);
if (!evt_struct->ext_list) {
- sdev_printk(KERN_ERR, cmd->device,
- "Can't allocate memory for indirect table\n");
+ if (!firmware_has_feature(FW_FEATURE_CMO))
+ sdev_printk(KERN_ERR, cmd->device,
+ "Can't allocate memory "
+ "for indirect table\n");
return 0;
}
}
@@ -743,7 +746,9 @@ static int ibmvscsi_queuecommand(struct=20
srp_cmd->lun =3D ((u64) lun) << 48;
=20
if (!map_data_for_srp_cmd(cmnd, evt_struct, srp_cmd, hostdata->dev)) {
- sdev_printk(KERN_ERR, cmnd->device, "couldn't convert cmd to srp_cmd\n");
+ if (!firmware_has_feature(FW_FEATURE_CMO))
+ sdev_printk(KERN_ERR, cmnd->device,
+ "couldn't convert cmd to srp_cmd\n");
free_event_struct(&hostdata->pool, evt_struct);
return SCSI_MLQUEUE_HOST_BUSY;
}
@@ -855,7 +860,10 @@ static void send_mad_adapter_info(struct
DMA_BIDIRECTIONAL);
=20
if (dma_mapping_error(req->buffer)) {
- dev_err(hostdata->dev, "Unable to map request_buffer for adapter_info!\n=
");
+ if (!firmware_has_feature(FW_FEATURE_CMO))
+ dev_err(hostdata->dev,
+ "Unable to map request_buffer for "
+ "adapter_info!\n");
free_event_struct(&hostdata->pool, evt_struct);
return;
}
@@ -1400,7 +1408,9 @@ static int ibmvscsi_do_host_config(struc
DMA_BIDIRECTIONAL);
=20
if (dma_mapping_error(host_config->buffer)) {
- dev_err(hostdata->dev, "dma_mapping error getting host config\n");
+ if (!firmware_has_feature(FW_FEATURE_CMO))
+ dev_err(hostdata->dev,
+ "dma_mapping error getting host config\n");
free_event_struct(&hostdata->pool, evt_struct);
return -1;
}
@@ -1604,7 +1614,7 @@ static struct scsi_host_template driver_
.eh_host_reset_handler =3D ibmvscsi_eh_host_reset_handler,
.slave_configure =3D ibmvscsi_slave_configure,
.change_queue_depth =3D ibmvscsi_change_queue_depth,
- .cmd_per_lun =3D 16,
+ .cmd_per_lun =3D IBMVSCSI_CMDS_PER_LUN_DEFAULT,
.can_queue =3D IBMVSCSI_MAX_REQUESTS_DEFAULT,
.this_id =3D -1,
.sg_tablesize =3D SG_ALL,
@@ -1613,6 +1623,26 @@ static struct scsi_host_template driver_
};
=20
/**
+ * ibmvscsi_get_desired_dma - Calculate IO memory desired by the driver
+ *
+ * @vdev: struct vio_dev for the device whose desired IO mem is to be retu=
rned
+ *
+ * Return value:
+ * Number of bytes of IO data the driver will need to perform well.
+ */
+static unsigned long ibmvscsi_get_desired_dma(struct vio_dev *vdev)
+{
+ /* iu_storage data allocated in initialize_event_pool */
+ unsigned long desired_io =3D max_requests * sizeof(union viosrp_iu);
+
+ /* add io space for sg data */
+ desired_io +=3D (IBMVSCSI_MAX_SECTORS_DEFAULT *
+ IBMVSCSI_CMDS_PER_LUN_DEFAULT);
+
+ return desired_io;
+}
+
+/**
* Called by bus code for each adapter
*/
static int ibmvscsi_probe(struct vio_dev *vdev, const struct vio_device_id=
*id)
@@ -1641,7 +1671,7 @@ static int ibmvscsi_probe(struct vio_dev
hostdata->host =3D host;
hostdata->dev =3D dev;
atomic_set(&hostdata->request_limit, -1);
- hostdata->host->max_sectors =3D 32 * 8; /* default max I/O 32 pages */
+ hostdata->host->max_sectors =3D IBMVSCSI_MAX_SECTORS_DEFAULT;
=20
rc =3D ibmvscsi_ops->init_crq_queue(&hostdata->queue, hostdata, max_reque=
sts);
if (rc !=3D 0 && rc !=3D H_RESOURCE) {
@@ -1735,6 +1765,7 @@ static struct vio_driver ibmvscsi_driver
.id_table =3D ibmvscsi_device_table,
.probe =3D ibmvscsi_probe,
.remove =3D ibmvscsi_remove,
+ .get_desired_dma =3D ibmvscsi_get_desired_dma,
.driver =3D {
.name =3D "ibmvscsi",
.owner =3D THIS_MODULE,
Index: b/drivers/scsi/ibmvscsi/ibmvscsi.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/drivers/scsi/ibmvscsi/ibmvscsi.h
+++ b/drivers/scsi/ibmvscsi/ibmvscsi.h
@@ -45,6 +45,8 @@ struct Scsi_Host;
#define MAX_INDIRECT_BUFS 10
=20
#define IBMVSCSI_MAX_REQUESTS_DEFAULT 100
+#define IBMVSCSI_CMDS_PER_LUN_DEFAULT 16
+#define IBMVSCSI_MAX_SECTORS_DEFAULT 256 /* 32 * 8 =3D default max I/O 32 =
pages */
#define IBMVSCSI_MAX_CMDS_PER_LUN 64
=20
/* ------------------------------------------------------------
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 14/16 v3] [v2] ibmveth: enable driver for CMO
2008-07-04 12:56 ` [PATCH 14/16 v3] ibmveth: enable driver for CMO Robert Jennings
@ 2008-07-08 20:38 ` Robert Jennings
0 siblings, 0 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-08 20:38 UTC (permalink / raw)
To: paulus, benh, linuxppc-dev, netdev, Brian King, Nathan Fontenot,
David Darrington
I removed references to 'entitlement' after having changed the function
'get_io_entitlement' to 'get_desired_dma' to correctly indicate what the
function was doing. Also, this function does not need to page align the
return value, the VIO bus is responsible for this.
(We would like to take this patch through linuxppc-dev with the full
change set for this feature. We are copying netdev for review and ack)
Enable ibmveth for Cooperative Memory Overcommitment (CMO). For this driver
it means calculating a desired amount of IO memory based on the current MTU
and updating this value with the bus when MTU changes occur. Because DMA
mappings can fail, we have added a bounce buffer for temporary cases where
the driver can not map IO memory for the buffer pool.
The following changes are made to enable the driver for CMO:
* DMA mapping errors will not result in error messages if entitlement has
been exceeded and resources were not available.
* DMA mapping errors are handled gracefully, ibmveth_replenish_buffer_pool()
is corrected to check the return from dma_map_single and fail gracefully.
* The driver will have a get_desired_dma function defined to function
in a CMO environment.
* When the MTU is changed, the driver will update the device IO entitlement
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
Signed-off-by: Santiago Leon <santil@us.ibm.com>
---
drivers/net/ibmveth.c | 169 ++++++++++++++++++++++++++++++++++++++++----------
drivers/net/ibmveth.h | 5 +
2 files changed, 140 insertions(+), 34 deletions(-)
Index: b/drivers/net/ibmveth.c
===================================================================
--- a/drivers/net/ibmveth.c
+++ b/drivers/net/ibmveth.c
@@ -33,6 +33,7 @@
*/
#include <linux/module.h>
+#include <linux/moduleparam.h>
#include <linux/types.h>
#include <linux/errno.h>
#include <linux/ioport.h>
@@ -52,7 +53,9 @@
#include <asm/hvcall.h>
#include <asm/atomic.h>
#include <asm/vio.h>
+#include <asm/iommu.h>
#include <asm/uaccess.h>
+#include <asm/firmware.h>
#include <linux/seq_file.h>
#include "ibmveth.h"
@@ -94,8 +97,10 @@ static void ibmveth_proc_register_adapte
static void ibmveth_proc_unregister_adapter(struct ibmveth_adapter *adapter);
static irqreturn_t ibmveth_interrupt(int irq, void *dev_instance);
static void ibmveth_rxq_harvest_buffer(struct ibmveth_adapter *adapter);
+static unsigned long ibmveth_get_desired_dma(struct vio_dev *vdev);
static struct kobj_type ktype_veth_pool;
+
#ifdef CONFIG_PROC_FS
#define IBMVETH_PROC_DIR "ibmveth"
static struct proc_dir_entry *ibmveth_proc_dir;
@@ -226,16 +231,16 @@ static void ibmveth_replenish_buffer_poo
u32 i;
u32 count = pool->size - atomic_read(&pool->available);
u32 buffers_added = 0;
+ struct sk_buff *skb;
+ unsigned int free_index, index;
+ u64 correlator;
+ unsigned long lpar_rc;
+ dma_addr_t dma_addr;
mb();
for(i = 0; i < count; ++i) {
- struct sk_buff *skb;
- unsigned int free_index, index;
- u64 correlator;
union ibmveth_buf_desc desc;
- unsigned long lpar_rc;
- dma_addr_t dma_addr;
skb = alloc_skb(pool->buff_size, GFP_ATOMIC);
@@ -255,6 +260,9 @@ static void ibmveth_replenish_buffer_poo
dma_addr = dma_map_single(&adapter->vdev->dev, skb->data,
pool->buff_size, DMA_FROM_DEVICE);
+ if (dma_mapping_error(dma_addr))
+ goto failure;
+
pool->free_map[free_index] = IBM_VETH_INVALID_MAP;
pool->dma_addr[index] = dma_addr;
pool->skbuff[index] = skb;
@@ -267,20 +275,9 @@ static void ibmveth_replenish_buffer_poo
lpar_rc = h_add_logical_lan_buffer(adapter->vdev->unit_address, desc.desc);
- if(lpar_rc != H_SUCCESS) {
- pool->free_map[free_index] = index;
- pool->skbuff[index] = NULL;
- if (pool->consumer_index == 0)
- pool->consumer_index = pool->size - 1;
- else
- pool->consumer_index--;
- dma_unmap_single(&adapter->vdev->dev,
- pool->dma_addr[index], pool->buff_size,
- DMA_FROM_DEVICE);
- dev_kfree_skb_any(skb);
- adapter->replenish_add_buff_failure++;
- break;
- } else {
+ if (lpar_rc != H_SUCCESS)
+ goto failure;
+ else {
buffers_added++;
adapter->replenish_add_buff_success++;
}
@@ -288,6 +285,24 @@ static void ibmveth_replenish_buffer_poo
mb();
atomic_add(buffers_added, &(pool->available));
+ return;
+
+failure:
+ pool->free_map[free_index] = index;
+ pool->skbuff[index] = NULL;
+ if (pool->consumer_index == 0)
+ pool->consumer_index = pool->size - 1;
+ else
+ pool->consumer_index--;
+ if (!dma_mapping_error(dma_addr))
+ dma_unmap_single(&adapter->vdev->dev,
+ pool->dma_addr[index], pool->buff_size,
+ DMA_FROM_DEVICE);
+ dev_kfree_skb_any(skb);
+ adapter->replenish_add_buff_failure++;
+
+ mb();
+ atomic_add(buffers_added, &(pool->available));
}
/* replenish routine */
@@ -297,7 +312,7 @@ static void ibmveth_replenish_task(struc
adapter->replenish_task_cycles++;
- for(i = 0; i < IbmVethNumBufferPools; i++)
+ for (i = (IbmVethNumBufferPools - 1); i >= 0; i--)
if(adapter->rx_buff_pool[i].active)
ibmveth_replenish_buffer_pool(adapter,
&adapter->rx_buff_pool[i]);
@@ -472,6 +487,18 @@ static void ibmveth_cleanup(struct ibmve
if (adapter->rx_buff_pool[i].active)
ibmveth_free_buffer_pool(adapter,
&adapter->rx_buff_pool[i]);
+
+ if (adapter->bounce_buffer != NULL) {
+ if (!dma_mapping_error(adapter->bounce_buffer_dma)) {
+ dma_unmap_single(&adapter->vdev->dev,
+ adapter->bounce_buffer_dma,
+ adapter->netdev->mtu + IBMVETH_BUFF_OH,
+ DMA_BIDIRECTIONAL);
+ adapter->bounce_buffer_dma = DMA_ERROR_CODE;
+ }
+ kfree(adapter->bounce_buffer);
+ adapter->bounce_buffer = NULL;
+ }
}
static int ibmveth_register_logical_lan(struct ibmveth_adapter *adapter,
@@ -607,6 +634,24 @@ static int ibmveth_open(struct net_devic
return rc;
}
+ adapter->bounce_buffer =
+ kmalloc(netdev->mtu + IBMVETH_BUFF_OH, GFP_KERNEL);
+ if (!adapter->bounce_buffer) {
+ ibmveth_error_printk("unable to allocate bounce buffer\n");
+ ibmveth_cleanup(adapter);
+ napi_disable(&adapter->napi);
+ return -ENOMEM;
+ }
+ adapter->bounce_buffer_dma =
+ dma_map_single(&adapter->vdev->dev, adapter->bounce_buffer,
+ netdev->mtu + IBMVETH_BUFF_OH, DMA_BIDIRECTIONAL);
+ if (dma_mapping_error(adapter->bounce_buffer_dma)) {
+ ibmveth_error_printk("unable to map bounce buffer\n");
+ ibmveth_cleanup(adapter);
+ napi_disable(&adapter->napi);
+ return -ENOMEM;
+ }
+
ibmveth_debug_printk("initial replenish cycle\n");
ibmveth_interrupt(netdev->irq, netdev);
@@ -853,10 +898,12 @@ static int ibmveth_start_xmit(struct sk_
unsigned int tx_packets = 0;
unsigned int tx_send_failed = 0;
unsigned int tx_map_failed = 0;
+ int used_bounce = 0;
+ unsigned long data_dma_addr;
desc.fields.flags_len = IBMVETH_BUF_VALID | skb->len;
- desc.fields.address = dma_map_single(&adapter->vdev->dev, skb->data,
- skb->len, DMA_TO_DEVICE);
+ data_dma_addr = dma_map_single(&adapter->vdev->dev, skb->data,
+ skb->len, DMA_TO_DEVICE);
if (skb->ip_summed == CHECKSUM_PARTIAL &&
ip_hdr(skb)->protocol != IPPROTO_TCP && skb_checksum_help(skb)) {
@@ -875,12 +922,16 @@ static int ibmveth_start_xmit(struct sk_
buf[1] = 0;
}
- if (dma_mapping_error(desc.fields.address)) {
- ibmveth_error_printk("tx: unable to map xmit buffer\n");
+ if (dma_mapping_error(data_dma_addr)) {
+ if (!firmware_has_feature(FW_FEATURE_CMO))
+ ibmveth_error_printk("tx: unable to map xmit buffer\n");
+ skb_copy_from_linear_data(skb, adapter->bounce_buffer,
+ skb->len);
+ desc.fields.address = adapter->bounce_buffer_dma;
tx_map_failed++;
- tx_dropped++;
- goto out;
- }
+ used_bounce = 1;
+ } else
+ desc.fields.address = data_dma_addr;
/* send the frame. Arbitrarily set retrycount to 1024 */
correlator = 0;
@@ -904,8 +955,9 @@ static int ibmveth_start_xmit(struct sk_
netdev->trans_start = jiffies;
}
- dma_unmap_single(&adapter->vdev->dev, desc.fields.address,
- skb->len, DMA_TO_DEVICE);
+ if (!used_bounce)
+ dma_unmap_single(&adapter->vdev->dev, data_dma_addr,
+ skb->len, DMA_TO_DEVICE);
out: spin_lock_irqsave(&adapter->stats_lock, flags);
netdev->stats.tx_dropped += tx_dropped;
@@ -1053,8 +1105,9 @@ static void ibmveth_set_multicast_list(s
static int ibmveth_change_mtu(struct net_device *dev, int new_mtu)
{
struct ibmveth_adapter *adapter = dev->priv;
+ struct vio_dev *viodev = adapter->vdev;
int new_mtu_oh = new_mtu + IBMVETH_BUFF_OH;
- int i, rc;
+ int i;
if (new_mtu < IBMVETH_MAX_MTU)
return -EINVAL;
@@ -1085,10 +1138,15 @@ static int ibmveth_change_mtu(struct net
ibmveth_close(adapter->netdev);
adapter->pool_config = 0;
dev->mtu = new_mtu;
- if ((rc = ibmveth_open(adapter->netdev)))
- return rc;
- } else
- dev->mtu = new_mtu;
+ vio_cmo_set_dev_desired(viodev,
+ ibmveth_get_desired_dma
+ (viodev));
+ return ibmveth_open(adapter->netdev);
+ }
+ dev->mtu = new_mtu;
+ vio_cmo_set_dev_desired(viodev,
+ ibmveth_get_desired_dma
+ (viodev));
return 0;
}
}
@@ -1103,6 +1161,46 @@ static void ibmveth_poll_controller(stru
}
#endif
+/**
+ * ibmveth_get_desired_dma - Calculate IO memory desired by the driver
+ *
+ * @vdev: struct vio_dev for the device whose desired IO mem is to be returned
+ *
+ * Return value:
+ * Number of bytes of IO data the driver will need to perform well.
+ */
+static unsigned long ibmveth_get_desired_dma(struct vio_dev *vdev)
+{
+ struct net_device *netdev = dev_get_drvdata(&vdev->dev);
+ struct ibmveth_adapter *adapter;
+ unsigned long ret;
+ int i;
+ int rxqentries = 1;
+
+ /* netdev inits at probe time along with the structures we need below*/
+ if (netdev == NULL)
+ return IOMMU_PAGE_ALIGN(IBMVETH_IO_ENTITLEMENT_DEFAULT);
+
+ adapter = netdev_priv(netdev);
+
+ ret = IBMVETH_BUFF_LIST_SIZE + IBMVETH_FILT_LIST_SIZE;
+ ret += IOMMU_PAGE_ALIGN(netdev->mtu);
+
+ for (i = 0; i < IbmVethNumBufferPools; i++) {
+ /* add the size of the active receive buffers */
+ if (adapter->rx_buff_pool[i].active)
+ ret +=
+ adapter->rx_buff_pool[i].size *
+ IOMMU_PAGE_ALIGN(adapter->rx_buff_pool[i].
+ buff_size);
+ rxqentries += adapter->rx_buff_pool[i].size;
+ }
+ /* add the size of the receive queue entries */
+ ret += IOMMU_PAGE_ALIGN(rxqentries * sizeof(struct ibmveth_rx_q_entry));
+
+ return ret;
+}
+
static int __devinit ibmveth_probe(struct vio_dev *dev, const struct vio_device_id *id)
{
int rc, i;
@@ -1247,6 +1345,8 @@ static int __devexit ibmveth_remove(stru
ibmveth_proc_unregister_adapter(adapter);
free_netdev(netdev);
+ dev_set_drvdata(&dev->dev, NULL);
+
return 0;
}
@@ -1491,6 +1591,7 @@ static struct vio_driver ibmveth_driver
.id_table = ibmveth_device_table,
.probe = ibmveth_probe,
.remove = ibmveth_remove,
+ .get_desired_dma = ibmveth_get_desired_dma,
.driver = {
.name = ibmveth_driver_name,
.owner = THIS_MODULE,
Index: b/drivers/net/ibmveth.h
===================================================================
--- a/drivers/net/ibmveth.h
+++ b/drivers/net/ibmveth.h
@@ -93,9 +93,12 @@ static inline long h_illan_attributes(un
plpar_hcall_norets(H_CHANGE_LOGICAL_LAN_MAC, ua, mac)
#define IbmVethNumBufferPools 5
+#define IBMVETH_IO_ENTITLEMENT_DEFAULT 4243456 /* MTU of 1500 needs 4.2Mb */
#define IBMVETH_BUFF_OH 22 /* Overhead: 14 ethernet header + 8 opaque handle */
#define IBMVETH_MAX_MTU 68
#define IBMVETH_MAX_POOL_COUNT 4096
+#define IBMVETH_BUFF_LIST_SIZE 4096
+#define IBMVETH_FILT_LIST_SIZE 4096
#define IBMVETH_MAX_BUF_SIZE (1024 * 128)
static int pool_size[] = { 512, 1024 * 2, 1024 * 16, 1024 * 32, 1024 * 64 };
@@ -143,6 +146,8 @@ struct ibmveth_adapter {
struct ibmveth_rx_q rx_queue;
int pool_config;
int rx_csum;
+ void *bounce_buffer;
+ dma_addr_t bounce_buffer_dma;
/* adapter specific stats */
u64 replenish_task_cycles;
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 10/16 v3] [v2] powerpc: iommu enablement for CMO
2008-07-04 12:54 ` [PATCH 10/16 v3] powerpc: iommu enablement for CMO Robert Jennings
2008-07-05 17:51 ` Olof Johansson
@ 2008-07-08 20:48 ` Robert Jennings
2008-07-22 5:04 ` Paul Mackerras
2008-07-22 4:57 ` [PATCH 10/16 v3] " Paul Mackerras
2 siblings, 1 reply; 38+ messages in thread
From: Robert Jennings @ 2008-07-08 20:48 UTC (permalink / raw)
To: paulus, benh, linuxppc-dev, Brian King, Nathan Fontenot,
David Darrington
=46rom: Robert Jennings <rcj@linux.vnet.ibm.com>
Minor change to add a call to align the return from the device's
get_desired_dma() function with IOMMU_PAGE_ALIGN(). Also removed a
comment referring to a non-existent structure member.
This is a large patch but the normal code path is not affected. For
non-pSeries platforms the code is ifdef'ed out and for non-CMO enabled
pSeries systems this does not affect the normal code path. Devices that
do not perform DMA operations do not need modification with this patch.
The function get_desired_dma was renamed from get_io_entitlement for
clarity.
Overview
Cooperative Memory Overcommitment (CMO) allows for a set of OS partitions
to be run with less RAM than the aggregate needs of the group of
partitions. The firmware will balance memory between the partitions
and page in/out memory as needed. Based on the number and type of IO
adapters preset each partition is allocated an amount of memory for
DMA operations and this allocation will be guaranteed to the partition;
this is referred to as the partition's 'entitlement'.
Partitions running in a CMO environment can only have virtual IO devices
present. The VIO bus layer will manage the IO entitlement for the system.
Accounting, at a system and per-device level, is tracked in the VIO bus
code and exposed via sysfs. A set of dma_ops functions are added to
the bus to allow for this accounting.
Bus initialization
At initialization, the bus will calculate the minimum needs of the system
based on providing each device present with a standard minimum entitlement
along with a spare allocation for the bus to handle hot-plug events.
If the minimum needs can not be met the system boot will be halted.
Device changes
The significant changes for devices while running under CMO are that the
devices must specify how much dedicated IO entitlement they desire and
must also handle DMA mapping errors that can occur due to constrained
IO memory. The virtual IO drivers are modified to silence errors when
DMA mappings fail for CMO and handle these failures gracefully.
Each devices will be guaranteed a minimum entitlement that can always
be mapped. Devices will specify how much entitlement they desire and
the VIO bus will attempt to provide for this. Devices can change their
desired entitlement level at any point in time to address particular needs
(via vio_cmo_set_dev_desired()), not just at device probe time.
VIO bus changes
The system will have a particular entitlement level available from which
it can provide memory to the devices. The bus defines two pools of memory
within this entitlement, the reserved and excess pools. Each device is
provided with it's own entitlement no less than a system defined minimum
entitlement and no greater than what the device has specified as it's
desired entitlement. The entitlement provided to devices comes from the
reserve pool. The reserve pool can also contain a spare allocation as
large as the system defined minimum entitlement which is used for device
hot-plug events. Any entitlement not needed to fulfill the needs of a
reserve pool is placed in the excess pool. Each device is guaranteed
that it can map up to it's entitled level; additional mapping are possible
as long as there is unmapped memory in the excess pool.
Bus probe
As the system starts, each device is given an entitlement equal only
to the system defined minimum entitlement. The reserve pool is equal
to the sum of these entitlements, plus a spare allocation. The VIO bus
also tracks the aggregate desired entitlement of all the devices. If the
system desired entitlement is greater than the size of the reserve pool,
when devices unmap IO memory it will be reserved and a balance operation
will be scheduled for some time in the future.
Entitlement balancing
The balance function tries to fairly distribute entitlement between the
devices in the system with the goal of providing each device with it's
desired amount of entitlement. Devices using more than what would be
ideal will have their entitled set-point adjusted; this will effectively
set a goal for lower IO memory usage as future mappings can fail and
deallocations will trigger a balance operation to distribute the newly
unmapped memory. A fair distribution of entitlement can take several
balance operations to achieve. Entitlement changes and device DLPAR
events will alter the state of CMO and will trigger balance operations.
Hotplug events
The VIO bus allows for changes in system entitlement at run-time via
'vio_cmo_entitlement_update()'. When devices are added the hot-plug
device event will be preceded by a system entitlement increase and this
is reversed when devices are removed.
The following changes are made that the VIO bus layer for CMO:
* add IO memory accounting per device structure.
* add IO memory entitlement query function to driver structure.
* during vio bus probe, if CMO is enabled, check that driver has
memory entitlement query function defined. Fail if function not defined.
* fail to register driver if io entitlement function not defined.
* create set of dma_ops at vio level for CMO that will track allocations
and return DMA failures once entitlement is reached. Entitlement will
limited by overall system entitlement. Devices will have a reserved
quantity of memory that is guaranteed, the rest can be used as available.
* expose entitlement, current allocation, desired allocation, and the
allocation error counter for devices to the user through sysfs
* provide mechanism for changing a device's desired entitlement at run time
for devices as an exported function and sysfs tunable
* track any DMA failures for entitled IO memory for each vio device.
* check entitlement against available system entitlement on device add
* track entitlement metrics (high water mark, current usage)
* provide function to reset high water mark
* provide minimum and desired entitlement numbers at a bus level
* provide drivers with a minimum guaranteed entitlement
* balance available entitlement between devices to satisfy their needs
* handle system entitlement changes and device hot-plug
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/kernel/vio.c | 1024 +++++++++++++++++++++++++++++++++++++++++=
+++++
include/asm-powerpc/vio.h | 27 +
2 files changed, 1043 insertions(+), 8 deletions(-)
Index: b/arch/powerpc/kernel/vio.c
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -1,11 +1,12 @@
/*
* IBM PowerPC Virtual I/O Infrastructure Support.
*
- * Copyright (c) 2003-2005 IBM Corp.
+ * Copyright (c) 2003,2008 IBM Corp.
* Dave Engebretsen engebret@us.ibm.com
* Santiago Leon santil@us.ibm.com
* Hollis Blanchard <hollisb@us.ibm.com>
* Stephen Rothwell
+ * Robert Jennings <rcjenn@us.ibm.com>
*
* This program is free software; you can redistribute it and/or
* modify it under the terms of the GNU General Public License
@@ -46,6 +47,987 @@ static struct vio_dev vio_bus_device =3D=20
.dev.bus =3D &vio_bus_type,
};
=20
+#ifdef CONFIG_PPC_PSERIES
+/**
+ * vio_cmo_pool - A pool of IO memory for CMO use
+ *
+ * @size: The size of the pool in bytes
+ * @free: The amount of free memory in the pool
+ */
+struct vio_cmo_pool {
+ size_t size;
+ size_t free;
+};
+
+/* How many ms to delay queued balance work */
+#define VIO_CMO_BALANCE_DELAY 100
+
+/* Portion out IO memory to CMO devices by this chunk size */
+#define VIO_CMO_BALANCE_CHUNK 131072
+
+/**
+ * vio_cmo_dev_entry - A device that is CMO-enabled and requires entitleme=
nt
+ *
+ * @vio_dev: struct vio_dev pointer
+ * @list: pointer to other devices on bus that are being tracked
+ */
+struct vio_cmo_dev_entry {
+ struct vio_dev *viodev;
+ struct list_head list;
+};
+
+/**
+ * vio_cmo - VIO bus accounting structure for CMO entitlement
+ *
+ * @lock: spinlock for entire structure
+ * @balance_q: work queue for balancing system entitlement
+ * @device_list: list of CMO-enabled devices requiring entitlement
+ * @entitled: total system entitlement in bytes
+ * @reserve: pool of memory from which devices reserve entitlement, incl. =
spare
+ * @excess: pool of excess entitlement not needed for device reserves or s=
pare
+ * @spare: IO memory for device hotplug functionality
+ * @min: minimum necessary for system operation
+ * @desired: desired memory for system operation
+ * @curr: bytes currently allocated
+ * @high: high water mark for IO data usage
+ */
+struct vio_cmo {
+ spinlock_t lock;
+ struct delayed_work balance_q;
+ struct list_head device_list;
+ size_t entitled;
+ struct vio_cmo_pool reserve;
+ struct vio_cmo_pool excess;
+ size_t spare;
+ size_t min;
+ size_t desired;
+ size_t curr;
+ size_t high;
+} vio_cmo;
+
+/**
+ * vio_cmo_OF_devices - Count the number of OF devices that have DMA windo=
ws
+ */
+static int vio_cmo_num_OF_devs(void)
+{
+ struct device_node *node_vroot;
+ int count =3D 0;
+
+ /*
+ * Count the number of vdevice entries with an
+ * ibm,my-dma-window OF property
+ */
+ node_vroot =3D of_find_node_by_name(NULL, "vdevice");
+ if (node_vroot) {
+ struct device_node *of_node;
+ struct property *prop;
+
+ for_each_child_of_node(node_vroot, of_node) {
+ prop =3D of_find_property(of_node, "ibm,my-dma-window",
+ NULL);
+ if (prop)
+ count++;
+ }
+ }
+ of_node_put(node_vroot);
+ return count;
+}
+
+/**
+ * vio_cmo_alloc - allocate IO memory for CMO-enable devices
+ *
+ * @viodev: VIO device requesting IO memory
+ * @size: size of allocation requested
+ *
+ * Allocations come from memory reserved for the devices and any excess
+ * IO memory available to all devices. The spare pool used to service
+ * hotplug must be equal to %VIO_CMO_MIN_ENT for the excess pool to be
+ * made available.
+ *
+ * Return codes:
+ * 0 for successful allocation and -ENOMEM for a failure
+ */
+static inline int vio_cmo_alloc(struct vio_dev *viodev, size_t size)
+{
+ unsigned long flags;
+ size_t reserve_free =3D 0;
+ size_t excess_free =3D 0;
+ int ret =3D -ENOMEM;
+
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+
+ /* Determine the amount of free entitlement available in reserve */
+ if (viodev->cmo.entitled > viodev->cmo.allocated)
+ reserve_free =3D viodev->cmo.entitled - viodev->cmo.allocated;
+
+ /* If spare is not fulfilled, the excess pool can not be used. */
+ if (vio_cmo.spare >=3D VIO_CMO_MIN_ENT)
+ excess_free =3D vio_cmo.excess.free;
+
+ /* The request can be satisfied */
+ if ((reserve_free + excess_free) >=3D size) {
+ vio_cmo.curr +=3D size;
+ if (vio_cmo.curr > vio_cmo.high)
+ vio_cmo.high =3D vio_cmo.curr;
+ viodev->cmo.allocated +=3D size;
+ size -=3D min(reserve_free, size);
+ vio_cmo.excess.free -=3D size;
+ ret =3D 0;
+ }
+
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+ return ret;
+}
+
+/**
+ * vio_cmo_dealloc - deallocate IO memory from CMO-enable devices
+ * @viodev: VIO device freeing IO memory
+ * @size: size of deallocation
+ *
+ * IO memory is freed by the device back to the correct memory pools.
+ * The spare pool is replenished first from either memory pool, then
+ * the reserve pool is used to reduce device entitlement, the excess
+ * pool is used to increase the reserve pool toward the desired entitlement
+ * target, and then the remaining memory is returned to the pools.
+ *
+ */
+static inline void vio_cmo_dealloc(struct vio_dev *viodev, size_t size)
+{
+ unsigned long flags;
+ size_t spare_needed =3D 0;
+ size_t excess_freed =3D 0;
+ size_t reserve_freed =3D size;
+ size_t tmp;
+ int balance =3D 0;
+
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+ vio_cmo.curr -=3D size;
+
+ /* Amount of memory freed from the excess pool */
+ if (viodev->cmo.allocated > viodev->cmo.entitled) {
+ excess_freed =3D min(reserve_freed, (viodev->cmo.allocated -
+ viodev->cmo.entitled));
+ reserve_freed -=3D excess_freed;
+ }
+
+ /* Remove allocation from device */
+ viodev->cmo.allocated -=3D (reserve_freed + excess_freed);
+
+ /* Spare is a subset of the reserve pool, replenish it first. */
+ spare_needed =3D VIO_CMO_MIN_ENT - vio_cmo.spare;
+
+ /*
+ * Replenish the spare in the reserve pool from the excess pool.
+ * This moves entitlement into the reserve pool.
+ */
+ if (spare_needed && excess_freed) {
+ tmp =3D min(excess_freed, spare_needed);
+ vio_cmo.excess.size -=3D tmp;
+ vio_cmo.reserve.size +=3D tmp;
+ vio_cmo.spare +=3D tmp;
+ excess_freed -=3D tmp;
+ spare_needed -=3D tmp;
+ balance =3D 1;
+ }
+
+ /*
+ * Replenish the spare in the reserve pool from the reserve pool.
+ * This removes entitlement from the device down to VIO_CMO_MIN_ENT,
+ * if needed, and gives it to the spare pool. The amount of used
+ * memory in this pool does not change.
+ */
+ if (spare_needed && reserve_freed) {
+ tmp =3D min(spare_needed, min(reserve_freed,
+ (viodev->cmo.entitled -
+ VIO_CMO_MIN_ENT)));
+
+ vio_cmo.spare +=3D tmp;
+ viodev->cmo.entitled -=3D tmp;
+ reserve_freed -=3D tmp;
+ spare_needed -=3D tmp;
+ balance =3D 1;
+ }
+
+ /*
+ * Increase the reserve pool until the desired allocation is met.
+ * Move an allocation freed from the excess pool into the reserve
+ * pool and schedule a balance operation.
+ */
+ if (excess_freed && (vio_cmo.desired > vio_cmo.reserve.size)) {
+ tmp =3D min(excess_freed, (vio_cmo.desired - vio_cmo.reserve.size));
+
+ vio_cmo.excess.size -=3D tmp;
+ vio_cmo.reserve.size +=3D tmp;
+ excess_freed -=3D tmp;
+ balance =3D 1;
+ }
+
+ /* Return memory from the excess pool to that pool */
+ if (excess_freed)
+ vio_cmo.excess.free +=3D excess_freed;
+
+ if (balance)
+ schedule_delayed_work(&vio_cmo.balance_q, VIO_CMO_BALANCE_DELAY);
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+}
+
+/**
+ * vio_cmo_entitlement_update - Manage system entitlement changes
+ *
+ * @new_entitlement: new system entitlement to attempt to accommodate
+ *
+ * Increases in entitlement will be used to fulfill the spare entitlement
+ * and the rest is given to the excess pool. Decreases, if they are
+ * possible, come from the excess pool and from unused device entitlement
+ *
+ * Returns: 0 on success, -ENOMEM when change can not be made
+ */
+int vio_cmo_entitlement_update(size_t new_entitlement)
+{
+ struct vio_dev *viodev;
+ struct vio_cmo_dev_entry *dev_ent;
+ unsigned long flags;
+ size_t avail, delta, tmp;
+
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+
+ /* Entitlement increases */
+ if (new_entitlement > vio_cmo.entitled) {
+ delta =3D new_entitlement - vio_cmo.entitled;
+
+ /* Fulfill spare allocation */
+ if (vio_cmo.spare < VIO_CMO_MIN_ENT) {
+ tmp =3D min(delta, (VIO_CMO_MIN_ENT - vio_cmo.spare));
+ vio_cmo.spare +=3D tmp;
+ vio_cmo.reserve.size +=3D tmp;
+ delta -=3D tmp;
+ }
+
+ /* Remaining new allocation goes to the excess pool */
+ vio_cmo.entitled +=3D delta;
+ vio_cmo.excess.size +=3D delta;
+ vio_cmo.excess.free +=3D delta;
+
+ goto out;
+ }
+
+ /* Entitlement decreases */
+ delta =3D vio_cmo.entitled - new_entitlement;
+ avail =3D vio_cmo.excess.free;
+
+ /*
+ * Need to check how much unused entitlement each device can
+ * sacrifice to fulfill entitlement change.
+ */
+ list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+ if (avail >=3D delta)
+ break;
+
+ viodev =3D dev_ent->viodev;
+ if ((viodev->cmo.entitled > viodev->cmo.allocated) &&
+ (viodev->cmo.entitled > VIO_CMO_MIN_ENT))
+ avail +=3D viodev->cmo.entitled -
+ max_t(size_t, viodev->cmo.allocated,
+ VIO_CMO_MIN_ENT);
+ }
+
+ if (delta <=3D avail) {
+ vio_cmo.entitled -=3D delta;
+
+ /* Take entitlement from the excess pool first */
+ tmp =3D min(vio_cmo.excess.free, delta);
+ vio_cmo.excess.size -=3D tmp;
+ vio_cmo.excess.free -=3D tmp;
+ delta -=3D tmp;
+
+ /*
+ * Remove all but VIO_CMO_MIN_ENT bytes from devices
+ * until entitlement change is served
+ */
+ list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+ if (!delta)
+ break;
+
+ viodev =3D dev_ent->viodev;
+ tmp =3D 0;
+ if ((viodev->cmo.entitled > viodev->cmo.allocated) &&
+ (viodev->cmo.entitled > VIO_CMO_MIN_ENT))
+ tmp =3D viodev->cmo.entitled -
+ max_t(size_t, viodev->cmo.allocated,
+ VIO_CMO_MIN_ENT);
+ viodev->cmo.entitled -=3D min(tmp, delta);
+ delta -=3D min(tmp, delta);
+ }
+ } else {
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+ return -ENOMEM;
+ }
+
+out:
+ schedule_delayed_work(&vio_cmo.balance_q, 0);
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+ return 0;
+}
+
+/**
+ * vio_cmo_balance - Balance entitlement among devices
+ *
+ * @work: work queue structure for this operation
+ *
+ * Any system entitlement above the minimum needed for devices, or
+ * already allocated to devices, can be distributed to the devices.
+ * The list of devices is iterated through to recalculate the desired
+ * entitlement level and to determine how much entitlement above the
+ * minimum entitlement is allocated to devices.
+ *
+ * Small chunks of the available entitlement are given to devices until
+ * their requirements are fulfilled or there is no entitlement left to giv=
e.
+ * Upon completion sizes of the reserve and excess pools are calculated.
+ *
+ * The system minimum entitlement level is also recalculated here.
+ * Entitlement will be reserved for devices even after vio_bus_remove to
+ * accommodate reloading the driver. The OF tree is walked to count the
+ * number of devices present and this will remove entitlement for devices
+ * that have actually left the system after having vio_bus_remove called.
+ */
+static void vio_cmo_balance(struct work_struct *work)
+{
+ struct vio_cmo *cmo;
+ struct vio_dev *viodev;
+ struct vio_cmo_dev_entry *dev_ent;
+ unsigned long flags;
+ size_t avail =3D 0, level, chunk, need;
+ int devcount =3D 0, fulfilled;
+
+ cmo =3D container_of(work, struct vio_cmo, balance_q.work);
+
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+
+ /* Calculate minimum entitlement and fulfill spare */
+ cmo->min =3D vio_cmo_num_OF_devs() * VIO_CMO_MIN_ENT;
+ BUG_ON(cmo->min > cmo->entitled);
+ cmo->spare =3D min_t(size_t, VIO_CMO_MIN_ENT, (cmo->entitled - cmo->min));
+ cmo->min +=3D cmo->spare;
+ cmo->desired =3D cmo->min;
+
+ /*
+ * Determine how much entitlement is available and reset device
+ * entitlements
+ */
+ avail =3D cmo->entitled - cmo->spare;
+ list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+ viodev =3D dev_ent->viodev;
+ devcount++;
+ viodev->cmo.entitled =3D VIO_CMO_MIN_ENT;
+ cmo->desired +=3D (viodev->cmo.desired - VIO_CMO_MIN_ENT);
+ avail -=3D max_t(size_t, viodev->cmo.allocated, VIO_CMO_MIN_ENT);
+ }
+
+ /*
+ * Having provided each device with the minimum entitlement, loop
+ * over the devices portioning out the remaining entitlement
+ * until there is nothing left.
+ */
+ level =3D VIO_CMO_MIN_ENT;
+ while (avail) {
+ fulfilled =3D 0;
+ list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+ viodev =3D dev_ent->viodev;
+
+ if (viodev->cmo.desired <=3D level) {
+ fulfilled++;
+ continue;
+ }
+
+ /*
+ * Give the device up to VIO_CMO_BALANCE_CHUNK
+ * bytes of entitlement, but do not exceed the
+ * desired level of entitlement for the device.
+ */
+ chunk =3D min_t(size_t, avail, VIO_CMO_BALANCE_CHUNK);
+ chunk =3D min(chunk, (viodev->cmo.desired -
+ viodev->cmo.entitled));
+ viodev->cmo.entitled +=3D chunk;
+
+ /*
+ * If the memory for this entitlement increase was
+ * already allocated to the device it does not come
+ * from the available pool being portioned out.
+ */
+ need =3D max(viodev->cmo.allocated, viodev->cmo.entitled)-
+ max(viodev->cmo.allocated, level);
+ avail -=3D need;
+
+ }
+ if (fulfilled =3D=3D devcount)
+ break;
+ level +=3D VIO_CMO_BALANCE_CHUNK;
+ }
+
+ /* Calculate new reserve and excess pool sizes */
+ cmo->reserve.size =3D cmo->min;
+ cmo->excess.free =3D 0;
+ cmo->excess.size =3D 0;
+ need =3D 0;
+ list_for_each_entry(dev_ent, &vio_cmo.device_list, list) {
+ viodev =3D dev_ent->viodev;
+ /* Calculated reserve size above the minimum entitlement */
+ if (viodev->cmo.entitled)
+ cmo->reserve.size +=3D (viodev->cmo.entitled -
+ VIO_CMO_MIN_ENT);
+ /* Calculated used excess entitlement */
+ if (viodev->cmo.allocated > viodev->cmo.entitled)
+ need +=3D viodev->cmo.allocated - viodev->cmo.entitled;
+ }
+ cmo->excess.size =3D cmo->entitled - cmo->reserve.size;
+ cmo->excess.free =3D cmo->excess.size - need;
+
+ cancel_delayed_work(container_of(work, struct delayed_work, work));
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+}
+
+static void *vio_dma_iommu_alloc_coherent(struct device *dev, size_t size,
+ dma_addr_t *dma_handle, gfp_t fl=
ag)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+ void *ret;
+
+ if (vio_cmo_alloc(viodev, roundup(size, PAGE_SIZE))) {
+ atomic_inc(&viodev->cmo.allocs_failed);
+ return NULL;
+ }
+
+ ret =3D dma_iommu_ops.alloc_coherent(dev, size, dma_handle, flag);
+ if (unlikely(ret =3D=3D NULL)) {
+ vio_cmo_dealloc(viodev, roundup(size, PAGE_SIZE));
+ atomic_inc(&viodev->cmo.allocs_failed);
+ }
+
+ return ret;
+}
+
+static void vio_dma_iommu_free_coherent(struct device *dev, size_t size,
+ void *vaddr, dma_addr_t dma_handle)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+
+ dma_iommu_ops.free_coherent(dev, size, vaddr, dma_handle);
+
+ vio_cmo_dealloc(viodev, roundup(size, PAGE_SIZE));
+}
+
+static dma_addr_t vio_dma_iommu_map_single(struct device *dev, void *vaddr,
+ size_t size,
+ enum dma_data_direction directi=
on)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+ dma_addr_t ret =3D DMA_ERROR_CODE;
+
+ if (vio_cmo_alloc(viodev, roundup(size, PAGE_SIZE))) {
+ atomic_inc(&viodev->cmo.allocs_failed);
+ return ret;
+ }
+
+ ret =3D dma_iommu_ops.map_single(dev, vaddr, size, direction);
+ if (unlikely(dma_mapping_error(ret))) {
+ vio_cmo_dealloc(viodev, roundup(size, PAGE_SIZE));
+ atomic_inc(&viodev->cmo.allocs_failed);
+ }
+
+ return ret;
+}
+
+static void vio_dma_iommu_unmap_single(struct device *dev,
+ dma_addr_t dma_handle, size_t size,
+ enum dma_data_direction direction)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+
+ dma_iommu_ops.unmap_single(dev, dma_handle, size, direction);
+
+ vio_cmo_dealloc(viodev, roundup(size, PAGE_SIZE));
+}
+
+static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sg=
list,
+ int nelems, enum dma_data_direction direct=
ion)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+ struct scatterlist *sgl;
+ int ret, count =3D 0;
+ size_t alloc_size =3D 0;
+
+ for (sgl =3D sglist; count < nelems; count++, sgl++)
+ alloc_size +=3D roundup(sgl->length, PAGE_SIZE);
+
+ if (vio_cmo_alloc(viodev, alloc_size)) {
+ atomic_inc(&viodev->cmo.allocs_failed);
+ return 0;
+ }
+
+ ret =3D dma_iommu_ops.map_sg(dev, sglist, nelems, direction);
+
+ if (unlikely(!ret)) {
+ vio_cmo_dealloc(viodev, alloc_size);
+ atomic_inc(&viodev->cmo.allocs_failed);
+ }
+
+ return ret;
+}
+
+static void vio_dma_iommu_unmap_sg(struct device *dev,
+ struct scatterlist *sglist, int nelems,
+ enum dma_data_direction direction)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+ struct scatterlist *sgl;
+ size_t alloc_size =3D 0;
+ int count =3D 0;
+
+ for (sgl =3D sglist; count < nelems; count++, sgl++)
+ alloc_size +=3D roundup(sgl->length, PAGE_SIZE);
+
+ dma_iommu_ops.unmap_sg(dev, sglist, nelems, direction);
+
+ vio_cmo_dealloc(viodev, alloc_size);
+}
+
+struct dma_mapping_ops vio_dma_mapping_ops =3D {
+ .alloc_coherent =3D vio_dma_iommu_alloc_coherent,
+ .free_coherent =3D vio_dma_iommu_free_coherent,
+ .map_single =3D vio_dma_iommu_map_single,
+ .unmap_single =3D vio_dma_iommu_unmap_single,
+ .map_sg =3D vio_dma_iommu_map_sg,
+ .unmap_sg =3D vio_dma_iommu_unmap_sg,
+};
+
+/**
+ * vio_cmo_set_dev_desired - Set desired entitlement for a device
+ *
+ * @viodev: struct vio_dev for device to alter
+ * @new_desired: new desired entitlement level in bytes
+ *
+ * For use by devices to request a change to their entitlement at runtime =
or
+ * through sysfs. The desired entitlement level is changed and a balancing
+ * of system resources is scheduled to run in the future.
+ */
+void vio_cmo_set_dev_desired(struct vio_dev *viodev, size_t desired)
+{
+ unsigned long flags;
+ struct vio_cmo_dev_entry *dev_ent;
+ int found =3D 0;
+
+ if (!firmware_has_feature(FW_FEATURE_CMO))
+ return;
+
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+ if (desired < VIO_CMO_MIN_ENT)
+ desired =3D VIO_CMO_MIN_ENT;
+
+ /*
+ * Changes will not be made for devices not in the device list.
+ * If it is not in the device list, then no driver is loaded
+ * for the device and it can not receive entitlement.
+ */
+ list_for_each_entry(dev_ent, &vio_cmo.device_list, list)
+ if (viodev =3D=3D dev_ent->viodev) {
+ found =3D 1;
+ break;
+ }
+ if (!found)
+ return;
+
+ /* Increase/decrease in desired device entitlement */
+ if (desired >=3D viodev->cmo.desired) {
+ /* Just bump the bus and device values prior to a balance*/
+ vio_cmo.desired +=3D desired - viodev->cmo.desired;
+ viodev->cmo.desired =3D desired;
+ } else {
+ /* Decrease bus and device values for desired entitlement */
+ vio_cmo.desired -=3D viodev->cmo.desired - desired;
+ viodev->cmo.desired =3D desired;
+ /*
+ * If less entitlement is desired than current entitlement, move
+ * any reserve memory in the change region to the excess pool.
+ */
+ if (viodev->cmo.entitled > desired) {
+ vio_cmo.reserve.size -=3D viodev->cmo.entitled - desired;
+ vio_cmo.excess.size +=3D viodev->cmo.entitled - desired;
+ /*
+ * If entitlement moving from the reserve pool to the
+ * excess pool is currently unused, add to the excess
+ * free counter.
+ */
+ if (viodev->cmo.allocated < viodev->cmo.entitled)
+ vio_cmo.excess.free +=3D viodev->cmo.entitled -
+ max(viodev->cmo.allocated, desired);
+ viodev->cmo.entitled =3D desired;
+ }
+ }
+ schedule_delayed_work(&vio_cmo.balance_q, 0);
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+}
+
+/**
+ * vio_cmo_bus_probe - Handle CMO specific bus probe activities
+ *
+ * @viodev - Pointer to struct vio_dev for device
+ *
+ * Determine the devices IO memory entitlement needs, attempting
+ * to satisfy the system minimum entitlement at first and scheduling
+ * a balance operation to take care of the rest at a later time.
+ *
+ * Returns: 0 on success, -EINVAL when device doesn't support CMO, and
+ * -ENOMEM when entitlement is not available for device or
+ * device entry.
+ *
+ */
+static int vio_cmo_bus_probe(struct vio_dev *viodev)
+{
+ struct vio_cmo_dev_entry *dev_ent;
+ struct device *dev =3D &viodev->dev;
+ struct vio_driver *viodrv =3D to_vio_driver(dev->driver);
+ unsigned long flags;
+ size_t size;
+
+ /*
+ * Check to see that device has a DMA window and configure
+ * entitlement for the device.
+ */
+ if (of_get_property(viodev->dev.archdata.of_node,
+ "ibm,my-dma-window", NULL)) {
+ /* Check that the driver is CMO enabled and get desired DMA */
+ if (!viodrv->get_desired_dma) {
+ dev_err(dev, "%s: device driver does not support CMO\n",
+ __func__);
+ return -EINVAL;
+ }
+
+ viodev->cmo.desired =3D IOMMU_PAGE_ALIGN(viodrv->get_desired_dma(viodev)=
);
+ if (viodev->cmo.desired < VIO_CMO_MIN_ENT)
+ viodev->cmo.desired =3D VIO_CMO_MIN_ENT;
+ size =3D VIO_CMO_MIN_ENT;
+
+ dev_ent =3D kmalloc(sizeof(struct vio_cmo_dev_entry),
+ GFP_KERNEL);
+ if (!dev_ent)
+ return -ENOMEM;
+
+ dev_ent->viodev =3D viodev;
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+ list_add(&dev_ent->list, &vio_cmo.device_list);
+ } else {
+ viodev->cmo.desired =3D 0;
+ size =3D 0;
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+ }
+
+ /*
+ * If the needs for vio_cmo.min have not changed since they
+ * were last set, the number of devices in the OF tree has
+ * been constant and the IO memory for this is already in
+ * the reserve pool.
+ */
+ if (vio_cmo.min =3D=3D ((vio_cmo_num_OF_devs() + 1) *
+ VIO_CMO_MIN_ENT)) {
+ /* Updated desired entitlement if device requires it */
+ if (size)
+ vio_cmo.desired +=3D (viodev->cmo.desired -
+ VIO_CMO_MIN_ENT);
+ } else {
+ size_t tmp;
+
+ tmp =3D vio_cmo.spare + vio_cmo.excess.free;
+ if (tmp < size) {
+ dev_err(dev, "%s: insufficient free "
+ "entitlement to add device. "
+ "Need %lu, have %lu\n", __func__,
+ size, (vio_cmo.spare + tmp));
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+ return -ENOMEM;
+ }
+
+ /* Use excess pool first to fulfill request */
+ tmp =3D min(size, vio_cmo.excess.free);
+ vio_cmo.excess.free -=3D tmp;
+ vio_cmo.excess.size -=3D tmp;
+ vio_cmo.reserve.size +=3D tmp;
+
+ /* Use spare if excess pool was insufficient */
+ vio_cmo.spare -=3D size - tmp;
+
+ /* Update bus accounting */
+ vio_cmo.min +=3D size;
+ vio_cmo.desired +=3D viodev->cmo.desired;
+ }
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+ return 0;
+}
+
+/**
+ * vio_cmo_bus_remove - Handle CMO specific bus removal activities
+ *
+ * @viodev - Pointer to struct vio_dev for device
+ *
+ * Remove the device from the cmo device list. The minimum entitlement
+ * will be reserved for the device as long as it is in the system. The
+ * rest of the entitlement the device had been allocated will be returned
+ * to the system.
+ */
+static void vio_cmo_bus_remove(struct vio_dev *viodev)
+{
+ struct vio_cmo_dev_entry *dev_ent;
+ unsigned long flags;
+ size_t tmp;
+
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+ if (viodev->cmo.allocated) {
+ dev_err(&viodev->dev, "%s: device had %lu bytes of IO "
+ "allocated after remove operation.\n",
+ __func__, viodev->cmo.allocated);
+ BUG();
+ }
+
+ /*
+ * Remove the device from the device list being maintained for
+ * CMO enabled devices.
+ */
+ list_for_each_entry(dev_ent, &vio_cmo.device_list, list)
+ if (viodev =3D=3D dev_ent->viodev) {
+ list_del(&dev_ent->list);
+ kfree(dev_ent);
+ break;
+ }
+
+ /*
+ * Devices may not require any entitlement and they do not need
+ * to be processed. Otherwise, return the device's entitlement
+ * back to the pools.
+ */
+ if (viodev->cmo.entitled) {
+ /*
+ * This device has not yet left the OF tree, it's
+ * minimum entitlement remains in vio_cmo.min and
+ * vio_cmo.desired
+ */
+ vio_cmo.desired -=3D (viodev->cmo.desired - VIO_CMO_MIN_ENT);
+
+ /*
+ * Save min allocation for device in reserve as long
+ * as it exists in OF tree as determined by later
+ * balance operation
+ */
+ viodev->cmo.entitled -=3D VIO_CMO_MIN_ENT;
+
+ /* Replenish spare from freed reserve pool */
+ if (viodev->cmo.entitled && (vio_cmo.spare < VIO_CMO_MIN_ENT)) {
+ tmp =3D min(viodev->cmo.entitled, (VIO_CMO_MIN_ENT -
+ vio_cmo.spare));
+ vio_cmo.spare +=3D tmp;
+ viodev->cmo.entitled -=3D tmp;
+ }
+
+ /* Remaining reserve goes to excess pool */
+ vio_cmo.excess.size +=3D viodev->cmo.entitled;
+ vio_cmo.excess.free +=3D viodev->cmo.entitled;
+ vio_cmo.reserve.size -=3D viodev->cmo.entitled;
+
+ /*
+ * Until the device is removed it will keep a
+ * minimum entitlement; this will guarantee that
+ * a module unload/load will result in a success.
+ */
+ viodev->cmo.entitled =3D VIO_CMO_MIN_ENT;
+ viodev->cmo.desired =3D VIO_CMO_MIN_ENT;
+ atomic_set(&viodev->cmo.allocs_failed, 0);
+ }
+
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+}
+
+static void vio_cmo_set_dma_ops(struct vio_dev *viodev)
+{
+ vio_dma_mapping_ops.dma_supported =3D dma_iommu_ops.dma_supported;
+ viodev->dev.archdata.dma_ops =3D &vio_dma_mapping_ops;
+}
+
+/**
+ * vio_cmo_bus_init - CMO entitlement initialization at bus init time
+ *
+ * Set up the reserve and excess entitlement pools based on available
+ * system entitlement and the number of devices in the OF tree that
+ * require entitlement in the reserve pool.
+ */
+static void vio_cmo_bus_init(void)
+{
+ struct hvcall_mpp_data mpp_data;
+ int err;
+
+ memset(&vio_cmo, 0, sizeof(struct vio_cmo));
+ spin_lock_init(&vio_cmo.lock);
+ INIT_LIST_HEAD(&vio_cmo.device_list);
+ INIT_DELAYED_WORK(&vio_cmo.balance_q, vio_cmo_balance);
+
+ /* Get current system entitlement */
+ err =3D h_get_mpp(&mpp_data);
+
+ /*
+ * On failure, continue with entitlement set to 0, will panic()
+ * later when spare is reserved.
+ */
+ if (err !=3D H_SUCCESS) {
+ printk(KERN_ERR "%s: unable to determine system IO "\
+ "entitlement. (%d)\n", __func__, err);
+ vio_cmo.entitled =3D 0;
+ } else {
+ vio_cmo.entitled =3D mpp_data.entitled_mem;
+ }
+
+ /* Set reservation and check against entitlement */
+ vio_cmo.spare =3D VIO_CMO_MIN_ENT;
+ vio_cmo.reserve.size =3D vio_cmo.spare;
+ vio_cmo.reserve.size +=3D (vio_cmo_num_OF_devs() *
+ VIO_CMO_MIN_ENT);
+ if (vio_cmo.reserve.size > vio_cmo.entitled) {
+ printk(KERN_ERR "%s: insufficient system entitlement\n",
+ __func__);
+ panic("%s: Insufficient system entitlement", __func__);
+ }
+
+ /* Set the remaining accounting variables */
+ vio_cmo.excess.size =3D vio_cmo.entitled - vio_cmo.reserve.size;
+ vio_cmo.excess.free =3D vio_cmo.excess.size;
+ vio_cmo.min =3D vio_cmo.reserve.size;
+ vio_cmo.desired =3D vio_cmo.reserve.size;
+}
+
+/* sysfs device functions and data structures for CMO */
+
+#define viodev_cmo_rd_attr(name) \
+static ssize_t viodev_cmo_##name##_show(struct device *dev, \
+ struct device_attribute *attr, \
+ char *buf) \
+{ \
+ return sprintf(buf, "%lu\n", to_vio_dev(dev)->cmo.name); \
+}
+
+static ssize_t viodev_cmo_allocs_failed_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+ return sprintf(buf, "%d\n", atomic_read(&viodev->cmo.allocs_failed));
+}
+
+static ssize_t viodev_cmo_allocs_failed_reset(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+ atomic_set(&viodev->cmo.allocs_failed, 0);
+ return count;
+}
+
+static ssize_t viodev_cmo_desired_set(struct device *dev,
+ struct device_attribute *attr, const char *buf, size_t count)
+{
+ struct vio_dev *viodev =3D to_vio_dev(dev);
+ size_t new_desired;
+ int ret;
+
+ ret =3D strict_strtoul(buf, 10, &new_desired);
+ if (ret)
+ return ret;
+
+ vio_cmo_set_dev_desired(viodev, new_desired);
+ return count;
+}
+
+viodev_cmo_rd_attr(desired);
+viodev_cmo_rd_attr(entitled);
+viodev_cmo_rd_attr(allocated);
+
+static ssize_t name_show(struct device *, struct device_attribute *, char =
*);
+static ssize_t devspec_show(struct device *, struct device_attribute *, ch=
ar *);
+static struct device_attribute vio_cmo_dev_attrs[] =3D {
+ __ATTR_RO(name),
+ __ATTR_RO(devspec),
+ __ATTR(cmo_desired, S_IWUSR|S_IRUSR|S_IWGRP|S_IRGRP|S_IROTH,
+ viodev_cmo_desired_show, viodev_cmo_desired_set),
+ __ATTR(cmo_entitled, S_IRUGO, viodev_cmo_entitled_show, NULL),
+ __ATTR(cmo_allocated, S_IRUGO, viodev_cmo_allocated_show, NULL),
+ __ATTR(cmo_allocs_failed, S_IWUSR|S_IRUSR|S_IWGRP|S_IRGRP|S_IROTH,
+ viodev_cmo_allocs_failed_show, viodev_cmo_allocs_failed_reset),
+ __ATTR_NULL
+};
+
+/* sysfs bus functions and data structures for CMO */
+
+#define viobus_cmo_rd_attr(name) \
+static ssize_t \
+viobus_cmo_##name##_show(struct bus_type *bt, char *buf) \
+{ \
+ return sprintf(buf, "%lu\n", vio_cmo.name); \
+}
+
+#define viobus_cmo_pool_rd_attr(name, var) \
+static ssize_t \
+viobus_cmo_##name##_pool_show_##var(struct bus_type *bt, char *buf) \
+{ \
+ return sprintf(buf, "%lu\n", vio_cmo.name.var); \
+}
+
+static ssize_t viobus_cmo_high_reset(struct bus_type *bt, const char *buf,
+ size_t count)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&vio_cmo.lock, flags);
+ vio_cmo.high =3D vio_cmo.curr;
+ spin_unlock_irqrestore(&vio_cmo.lock, flags);
+
+ return count;
+}
+
+viobus_cmo_rd_attr(entitled);
+viobus_cmo_pool_rd_attr(reserve, size);
+viobus_cmo_pool_rd_attr(excess, size);
+viobus_cmo_pool_rd_attr(excess, free);
+viobus_cmo_rd_attr(spare);
+viobus_cmo_rd_attr(min);
+viobus_cmo_rd_attr(desired);
+viobus_cmo_rd_attr(curr);
+viobus_cmo_rd_attr(high);
+
+static struct bus_attribute vio_cmo_bus_attrs[] =3D {
+ __ATTR(cmo_entitled, S_IRUGO, viobus_cmo_entitled_show, NULL),
+ __ATTR(cmo_reserve_size, S_IRUGO, viobus_cmo_reserve_pool_show_size, NULL=
),
+ __ATTR(cmo_excess_size, S_IRUGO, viobus_cmo_excess_pool_show_size, NULL),
+ __ATTR(cmo_excess_free, S_IRUGO, viobus_cmo_excess_pool_show_free, NULL),
+ __ATTR(cmo_spare, S_IRUGO, viobus_cmo_spare_show, NULL),
+ __ATTR(cmo_min, S_IRUGO, viobus_cmo_min_show, NULL),
+ __ATTR(cmo_desired, S_IRUGO, viobus_cmo_desired_show, NULL),
+ __ATTR(cmo_curr, S_IRUGO, viobus_cmo_curr_show, NULL),
+ __ATTR(cmo_high, S_IWUSR|S_IRUSR|S_IWGRP|S_IRGRP|S_IROTH,
+ viobus_cmo_high_show, viobus_cmo_high_reset),
+ __ATTR_NULL
+};
+
+static void vio_cmo_sysfs_init(void)
+{
+ vio_bus_type.dev_attrs =3D vio_cmo_dev_attrs;
+ vio_bus_type.bus_attrs =3D vio_cmo_bus_attrs;
+}
+#else /* CONFIG_PPC_PSERIES */
+/* Dummy functions for iSeries platform */
+int vio_cmo_entitlement_update(size_t new_entitlement) { return 0; }
+void vio_cmo_set_dev_desired(struct vio_dev *viodev, size_t desired) {}
+static int vio_cmo_bus_probe(struct vio_dev *viodev) { return 0; }
+static void vio_cmo_bus_remove(struct vio_dev *viodev) {}
+static void vio_cmo_set_dma_ops(struct vio_dev *viodev) {}
+static void vio_cmo_bus_init() {}
+static void vio_cmo_sysfs_init() { }
+#endif /* CONFIG_PPC_PSERIES */
+EXPORT_SYMBOL(vio_cmo_entitlement_update);
+EXPORT_SYMBOL(vio_cmo_set_dev_desired);
+
static struct iommu_table *vio_build_iommu_table(struct vio_dev *dev)
{
const unsigned char *dma_window;
@@ -114,8 +1096,17 @@ static int vio_bus_probe(struct device *
return error;
=20
id =3D vio_match_device(viodrv->id_table, viodev);
- if (id)
+ if (id) {
+ memset(&viodev->cmo, 0, sizeof(viodev->cmo));
+ if (firmware_has_feature(FW_FEATURE_CMO)) {
+ error =3D vio_cmo_bus_probe(viodev);
+ if (error)
+ return error;
+ }
error =3D viodrv->probe(viodev, id);
+ if (error)
+ vio_cmo_bus_remove(viodev);
+ }
=20
return error;
}
@@ -125,12 +1116,23 @@ static int vio_bus_remove(struct device=20
{
struct vio_dev *viodev =3D to_vio_dev(dev);
struct vio_driver *viodrv =3D to_vio_driver(dev->driver);
+ struct device *devptr;
+ int ret =3D 1;
+
+ /*
+ * Hold a reference to the device after the remove function is called
+ * to allow for CMO accounting cleanup for the device.
+ */
+ devptr =3D get_device(dev);
=20
if (viodrv->remove)
- return viodrv->remove(viodev);
+ ret =3D viodrv->remove(viodev);
=20
- /* driver can't remove */
- return 1;
+ if (!ret && firmware_has_feature(FW_FEATURE_CMO))
+ vio_cmo_bus_remove(viodev);
+
+ put_device(devptr);
+ return ret;
}
=20
/**
@@ -215,7 +1217,11 @@ struct vio_dev *vio_register_device_node
viodev->unit_address =3D *unit_address;
}
viodev->dev.archdata.of_node =3D of_node_get(of_node);
- viodev->dev.archdata.dma_ops =3D &dma_iommu_ops;
+
+ if (firmware_has_feature(FW_FEATURE_CMO))
+ vio_cmo_set_dma_ops(viodev);
+ else
+ viodev->dev.archdata.dma_ops =3D &dma_iommu_ops;
viodev->dev.archdata.dma_data =3D vio_build_iommu_table(viodev);
viodev->dev.archdata.numa_node =3D of_node_to_nid(of_node);
=20
@@ -245,6 +1251,9 @@ static int __init vio_bus_init(void)
int err;
struct device_node *node_vroot;
=20
+ if (firmware_has_feature(FW_FEATURE_CMO))
+ vio_cmo_sysfs_init();
+
err =3D bus_register(&vio_bus_type);
if (err) {
printk(KERN_ERR "failed to register VIO bus\n");
@@ -262,6 +1271,9 @@ static int __init vio_bus_init(void)
return err;
}
=20
+ if (firmware_has_feature(FW_FEATURE_CMO))
+ vio_cmo_bus_init();
+
node_vroot =3D of_find_node_by_name(NULL, "vdevice");
if (node_vroot) {
struct device_node *of_node;
Index: b/include/asm-powerpc/vio.h
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=
=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D
--- a/include/asm-powerpc/vio.h
+++ b/include/asm-powerpc/vio.h
@@ -39,16 +39,32 @@
#define VIO_IRQ_DISABLE 0UL
#define VIO_IRQ_ENABLE 1UL
=20
+/*
+ * VIO CMO minimum entitlement for all devices and spare entitlement
+ */
+#define VIO_CMO_MIN_ENT 1562624
+
struct iommu_table;
=20
-/*
- * The vio_dev structure is used to describe virtual I/O devices.
+/**
+ * vio_dev - This structure is used to describe virtual I/O devices.
+ *
+ * @desired: set from return of driver's get_desired_dma() function
+ * @entitled: bytes of IO data that has been reserved for this device.
+ * @allocated: bytes of IO data currently in use by the device.
+ * @allocs_failed: number of DMA failures due to insufficient entitlement.
*/
struct vio_dev {
const char *name;
const char *type;
uint32_t unit_address;
unsigned int irq;
+ struct {
+ size_t desired;
+ size_t entitled;
+ size_t allocated;
+ atomic_t allocs_failed;
+ } cmo;
struct device dev;
};
=20
@@ -56,12 +72,19 @@ struct vio_driver {
const struct vio_device_id *id_table;
int (*probe)(struct vio_dev *dev, const struct vio_device_id *id);
int (*remove)(struct vio_dev *dev);
+ /* A driver must have a get_desired_dma() function to
+ * be loaded in a CMO environment if it uses DMA.
+ */
+ unsigned long (*get_desired_dma)(struct vio_dev *dev);
struct device_driver driver;
};
=20
extern int vio_register_driver(struct vio_driver *drv);
extern void vio_unregister_driver(struct vio_driver *drv);
=20
+extern int vio_cmo_entitlement_update(size_t);
+extern void vio_cmo_set_dev_desired(struct vio_dev *viodev, size_t desired=
);
+
extern void __devinit vio_unregister_device(struct vio_dev *dev);
=20
struct device_node;
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 15/16 v3] [v2] ibmvscsi: driver enablement for CMO
2008-07-08 20:35 ` [PATCH 15/16 v3] [v2] " Robert Jennings
@ 2008-07-10 13:43 ` Brian King
0 siblings, 0 replies; 38+ messages in thread
From: Brian King @ 2008-07-10 13:43 UTC (permalink / raw)
To: Robert Jennings; +Cc: linux-scsi, linuxppc-dev, David Darrington, paulus
Acked by: Brian King <brking@linux.vnet.ibm.com>
--
Brian King
Linux on Power Virtualization
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 08/16 v3] powerpc: Do not probe PCI buses or eBus devices if CMO is enabled
2008-07-04 12:54 ` [PATCH 08/16 v3] powerpc: Do not probe PCI buses or eBus devices if CMO is enabled Robert Jennings
@ 2008-07-14 21:35 ` Brian King
0 siblings, 0 replies; 38+ messages in thread
From: Brian King @ 2008-07-14 21:35 UTC (permalink / raw)
To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, paulus, David Darrington
Ben,
Please drop this patch from the series. After further discussion, this patch
is not required and has actually been causing problems.
Thanks,
Brian
Robert Jennings wrote:
> From: Brian King <brking@linux.vnet.ibm.com>
>
> The Cooperative Memory Overcommit (CMO) on System p does not currently
> support native PCI devices or eBus devices when enabled. Prevent
> PCI bus probe and eBus device probe if the feature is enabled.
>
> Signed-off-by: Brian King <brking@linux.vnet.ibm.com>
> Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
>
> ---
>
> arch/powerpc/kernel/ibmebus.c | 6 ++++++
> arch/powerpc/platforms/pseries/setup.c | 4 ++++
> 2 files changed, 10 insertions(+)
>
> Index: b/arch/powerpc/kernel/ibmebus.c
> ===================================================================
> --- a/arch/powerpc/kernel/ibmebus.c
> +++ b/arch/powerpc/kernel/ibmebus.c
> @@ -45,6 +45,7 @@
> #include <linux/of_platform.h>
> #include <asm/ibmebus.h>
> #include <asm/abs_addr.h>
> +#include <asm/firmware.h>
>
> static struct device ibmebus_bus_device = { /* fake "parent" device */
> .bus_id = "ibmebus",
> @@ -332,6 +333,11 @@ static int __init ibmebus_bus_init(void)
> {
> int err;
>
> + if (firmware_has_feature(FW_FEATURE_CMO)) {
> + printk(KERN_WARNING "Not probing eBus since CMO is enabled\n");
> + return 0;
> + }
> +
> printk(KERN_INFO "IBM eBus Device Driver\n");
>
> err = of_bus_type_init(&ibmebus_bus_type, "ibmebus");
> Index: b/arch/powerpc/platforms/pseries/setup.c
> ===================================================================
> --- a/arch/powerpc/platforms/pseries/setup.c
> +++ b/arch/powerpc/platforms/pseries/setup.c
> @@ -539,6 +539,10 @@ static void pseries_shared_idle_sleep(vo
>
> static int pSeries_pci_probe_mode(struct pci_bus *bus)
> {
> + if (firmware_has_feature(FW_FEATURE_CMO)) {
> + dev_warn(&bus->dev, "Not probing PCI bus since CMO is enabled\n");
> + return PCI_PROBE_NONE;
> + }
> if (firmware_has_feature(FW_FEATURE_LPAR))
> return PCI_PROBE_DEVTREE;
> return PCI_PROBE_NORMAL;
> _______________________________________________
> Linuxppc-dev mailing list
> Linuxppc-dev@ozlabs.org
> https://ozlabs.org/mailman/listinfo/linuxppc-dev
--
Brian King
Linux on Power Virtualization
IBM Linux Technology Center
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 01/16 v3] powerpc: Remove extraneous error reporting for hcall failures in lparcfg
2008-07-04 12:51 ` [PATCH 01/16 v3] powerpc: Remove extraneous error reporting for hcall failures in lparcfg Robert Jennings
@ 2008-07-22 3:34 ` Paul Mackerras
0 siblings, 0 replies; 38+ messages in thread
From: Paul Mackerras @ 2008-07-22 3:34 UTC (permalink / raw)
To: Robert Jennings; +Cc: Brian King, linuxppc-dev, David Darrington
Robert Jennings writes:
> From: Nathan Fontenot <nfont@austin.ibm.com>
>
> Remove the extraneous error reporting used when a hcall made from lparcfg fails.
>
> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
> Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
Acked-by: Paul Mackerras <paulus@samba.org>
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 07/16 v3] powerpc: Add collaborative memory manager
2008-07-04 12:53 ` [PATCH 07/16 v3] powerpc: Add collaborative memory manager Robert Jennings
@ 2008-07-22 4:53 ` Paul Mackerras
0 siblings, 0 replies; 38+ messages in thread
From: Paul Mackerras @ 2008-07-22 4:53 UTC (permalink / raw)
To: Robert Jennings; +Cc: Brian King, linuxppc-dev, David Darrington
Robert Jennings writes:
> From: Brian King <brking@linux.vnet.ibm.com>
>
> Adds a collaborative memory manager, which acts as a simple balloon driver
> for System p machines that support cooperative memory overcommitment
> (CMO).
> +config CMM
> + tristate "Collaborative memory management"
> + depends on PPC_PSERIES
So CMM doesn't depend on LPARCFG, yet h_get_mpp is only defined if
LPARCFG=y, which makes this blow up if CMM=y and LPARCFG=n:
> +static void cmm_get_mpp(void)
> +{
> + int rc;
> + struct hvcall_mpp_data mpp_data;
> + unsigned long active_pages_target;
> + signed long page_loan_request;
> +
> + rc = h_get_mpp(&mpp_data);
Similarly, the call to h_get_mpp in vio_cmo_bus_init fails to link if
PPC_PSERIES=y and LPARCFG=n, which is a possible configuration.
Please think about and fix up the config dependencies.
Paul.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 10/16 v3] powerpc: iommu enablement for CMO
2008-07-04 12:54 ` [PATCH 10/16 v3] powerpc: iommu enablement for CMO Robert Jennings
2008-07-05 17:51 ` Olof Johansson
2008-07-08 20:48 ` [PATCH 10/16 v3] [v2] " Robert Jennings
@ 2008-07-22 4:57 ` Paul Mackerras
2008-07-22 13:28 ` Robert Jennings
2 siblings, 1 reply; 38+ messages in thread
From: Paul Mackerras @ 2008-07-22 4:57 UTC (permalink / raw)
To: Robert Jennings; +Cc: Brian King, linuxppc-dev, David Darrington
Robert Jennings writes:
> To support Cooperative Memory Overcommitment (CMO), we need to check
> for failure from some of the tce hcalls.
This patch runs into context mismatches because of changes made by
Michael Ellerman's patch "Fix sparse warnings in
arch/powerpc/platforms/pseries" (now in Linus' tree), which changed
code like
if (condition)
return function_returning_void(args);
into
if (condition) {
function_returning_void(args);
return;
}
which will cause problems for your patch. Please check if any of
these changes need to be undone again.
Paul.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 10/16 v3] [v2] powerpc: iommu enablement for CMO
2008-07-08 20:48 ` [PATCH 10/16 v3] [v2] " Robert Jennings
@ 2008-07-22 5:04 ` Paul Mackerras
2008-07-22 13:30 ` Robert Jennings
0 siblings, 1 reply; 38+ messages in thread
From: Paul Mackerras @ 2008-07-22 5:04 UTC (permalink / raw)
To: Robert Jennings; +Cc: Brian King, David Darrington, linuxppc-dev
Robert Jennings writes:
> From: Robert Jennings <rcj@linux.vnet.ibm.com>
>
> Minor change to add a call to align the return from the device's
> get_desired_dma() function with IOMMU_PAGE_ALIGN(). Also removed a
> comment referring to a non-existent structure member.
>
> This is a large patch but the normal code path is not affected. For
> non-pSeries platforms the code is ifdef'ed out and for non-CMO enabled
> pSeries systems this does not affect the normal code path. Devices that
> do not perform DMA operations do not need modification with this patch.
> The function get_desired_dma was renamed from get_io_entitlement for
> clarity.
This patch is actually a new version of [PATCH 11/16 v3] powerpc: vio
bus support for CMO, not a new version of [PATCH 10/16 v3] powerpc:
iommu enablement for CMO as the subject would indicate, which tripped
me up. Anyway, my first comment is that the first paragraph of the
description ("Minor change to ...") is not appropriate for the git
tree and will have to be edited before the patch is applied. If the
extra changes are worth describing, describe them (in stand-alone
fashion) in the description; otherwise put things like this after the
line of three dashes, which terminates the description.
> +static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist,
> + int nelems, enum dma_data_direction direction)
This function, and the related unmap_sg, map_single and unmap_single
functions, now take an extra "struct dma_attrs *attrs" argument since
Mark Nelson's patch "powerpc/dma: implement new dma_*map*_attrs()
interfaces" went in (and it's now in Linus' tree). You need to roll
something like the patch below in with the 11/16 patch.
Paul.
diff --git a/arch/powerpc/kernel/vio.c b/arch/powerpc/kernel/vio.c
index ad818d1..e05baea 100644
--- a/arch/powerpc/kernel/vio.c
+++ b/arch/powerpc/kernel/vio.c
@@ -518,7 +518,8 @@ static void vio_dma_iommu_free_coherent(struct device *dev, size_t size,
static dma_addr_t vio_dma_iommu_map_single(struct device *dev, void *vaddr,
size_t size,
- enum dma_data_direction direction)
+ enum dma_data_direction direction,
+ struct dma_attrs *attrs)
{
struct vio_dev *viodev = to_vio_dev(dev);
dma_addr_t ret = DMA_ERROR_CODE;
@@ -528,7 +529,7 @@ static dma_addr_t vio_dma_iommu_map_single(struct device *dev, void *vaddr,
return ret;
}
- ret = dma_iommu_ops.map_single(dev, vaddr, size, direction);
+ ret = dma_iommu_ops.map_single(dev, vaddr, size, direction, attrs);
if (unlikely(dma_mapping_error(ret))) {
vio_cmo_dealloc(viodev, roundup(size, PAGE_SIZE));
atomic_inc(&viodev->cmo.allocs_failed);
@@ -539,17 +540,18 @@ static dma_addr_t vio_dma_iommu_map_single(struct device *dev, void *vaddr,
static void vio_dma_iommu_unmap_single(struct device *dev,
dma_addr_t dma_handle, size_t size,
- enum dma_data_direction direction)
+ enum dma_data_direction direction, struct dma_attrs *attrs)
{
struct vio_dev *viodev = to_vio_dev(dev);
- dma_iommu_ops.unmap_single(dev, dma_handle, size, direction);
+ dma_iommu_ops.unmap_single(dev, dma_handle, size, direction, attrs);
vio_cmo_dealloc(viodev, roundup(size, PAGE_SIZE));
}
static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist,
- int nelems, enum dma_data_direction direction)
+ int nelems, enum dma_data_direction direction,
+ struct dma_attrs *attrs)
{
struct vio_dev *viodev = to_vio_dev(dev);
struct scatterlist *sgl;
@@ -564,7 +566,7 @@ static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist,
return 0;
}
- ret = dma_iommu_ops.map_sg(dev, sglist, nelems, direction);
+ ret = dma_iommu_ops.map_sg(dev, sglist, nelems, direction, attrs);
if (unlikely(!ret)) {
vio_cmo_dealloc(viodev, alloc_size);
@@ -581,7 +583,7 @@ static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist,
static void vio_dma_iommu_unmap_sg(struct device *dev,
struct scatterlist *sglist, int nelems,
- enum dma_data_direction direction)
+ enum dma_data_direction direction, struct dma_attrs *attrs)
{
struct vio_dev *viodev = to_vio_dev(dev);
struct scatterlist *sgl;
@@ -591,7 +593,7 @@ static void vio_dma_iommu_unmap_sg(struct device *dev,
for (sgl = sglist; count < nelems; count++, sgl++)
alloc_size += roundup(sgl->dma_length, PAGE_SIZE);
- dma_iommu_ops.unmap_sg(dev, sglist, nelems, direction);
+ dma_iommu_ops.unmap_sg(dev, sglist, nelems, direction, attrs);
vio_cmo_dealloc(viodev, alloc_size);
}
^ permalink raw reply related [flat|nested] 38+ messages in thread
* Re: [PATCH 04/16 v3] powerpc: Split retrieval of processor entitlement data into a helper routine
2008-07-04 12:52 ` [PATCH 04/16 v3] powerpc: Split retrieval of processor entitlement data into a helper routine Robert Jennings
@ 2008-07-22 5:54 ` Paul Mackerras
2008-07-22 18:49 ` Nathan Fontenot
2008-07-22 18:56 ` Nathan Fontenot
1 sibling, 1 reply; 38+ messages in thread
From: Paul Mackerras @ 2008-07-22 5:54 UTC (permalink / raw)
To: Robert Jennings; +Cc: Brian King, linuxppc-dev, David Darrington
Robert Jennings writes:
> Split the retrieval of processor entitlement data returned in the H_GET_PPP
> hcall into its own helper routine.
This seems to change the value reported for pool_capacity radically:
> /* report pool_capacity in percentage */
> - seq_printf(m, "pool_capacity=%ld\n",
> - ((h_resource >> 2 * 8) & 0xffff) * 100);
> + seq_printf(m, "pool_capacity=%d\n", ppp_data.group_num * 100);
On a Power6 partition here with your patch series applied, I see
pool_capacity=3277200
in /proc/ppc64/lparcfg. Without your patches, I get
pool_capacity=400
pool_idle_time=0
pool_num_procs=0
This looks like an incompatible user-visible change to me, and we
haven't even changed the lparcfg version number at the beginning of
the /proc/ppc64/lparcfg output. Why is the pool_capacity so
different, and why do the pool_idle_time and pool_num_procs lines
disappear?
Regards,
Paul.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 10/16 v3] powerpc: iommu enablement for CMO
2008-07-22 4:57 ` [PATCH 10/16 v3] " Paul Mackerras
@ 2008-07-22 13:28 ` Robert Jennings
0 siblings, 0 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-22 13:28 UTC (permalink / raw)
To: Paul Mackerras; +Cc: Brian King, linuxppc-dev, David Darrington
* Paul Mackerras (paulus@samba.org) wrote:
> Robert Jennings writes:
>
> > To support Cooperative Memory Overcommitment (CMO), we need to check
> > for failure from some of the tce hcalls.
>
> This patch runs into context mismatches because of changes made by
> Michael Ellerman's patch "Fix sparse warnings in
> arch/powerpc/platforms/pseries" (now in Linus' tree), which changed
> code like
>
> if (condition)
> return function_returning_void(args);
>
> into
>
> if (condition) {
> function_returning_void(args);
> return;
> }
>
> which will cause problems for your patch. Please check if any of
> these changes need to be undone again.
I do need to revert those changes, the return values will no longer be
void. I'll get those tested and posted. Thanks.
--Rob Jennings
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 10/16 v3] [v2] powerpc: iommu enablement for CMO
2008-07-22 5:04 ` Paul Mackerras
@ 2008-07-22 13:30 ` Robert Jennings
0 siblings, 0 replies; 38+ messages in thread
From: Robert Jennings @ 2008-07-22 13:30 UTC (permalink / raw)
To: Paul Mackerras; +Cc: Brian King, David Darrington, linuxppc-dev
* Paul Mackerras (paulus@samba.org) wrote:
> Robert Jennings writes:
>
> > Minor change to add a call to align the return from the device's
> > get_desired_dma() function with IOMMU_PAGE_ALIGN(). Also removed a
> > comment referring to a non-existent structure member.
>
> Anyway, my first comment is that the first paragraph of the
> description ("Minor change to ...") is not appropriate for the git
> tree and will have to be edited before the patch is applied. If the
> extra changes are worth describing, describe them (in stand-alone
> fashion) in the description; otherwise put things like this after the
> line of three dashes, which terminates the description.
I'll correct this with the other changes that need to be made.
> > +static int vio_dma_iommu_map_sg(struct device *dev, struct scatterlist *sglist,
> > + int nelems, enum dma_data_direction direction)
>
> This function, and the related unmap_sg, map_single and unmap_single
> functions, now take an extra "struct dma_attrs *attrs" argument since
> Mark Nelson's patch "powerpc/dma: implement new dma_*map*_attrs()
> interfaces" went in (and it's now in Linus' tree). You need to roll
> something like the patch below in with the 11/16 patch.
I'll repost with updates to add the dma_attrs field.
--Rob
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 04/16 v3] powerpc: Split retrieval of processor entitlement data into a helper routine
2008-07-22 5:54 ` Paul Mackerras
@ 2008-07-22 18:49 ` Nathan Fontenot
0 siblings, 0 replies; 38+ messages in thread
From: Nathan Fontenot @ 2008-07-22 18:49 UTC (permalink / raw)
To: Paul Mackerras; +Cc: Brian King, linuxppc-dev, David Darrington
Paul Mackerras wrote:
> Robert Jennings writes:
>
>> Split the retrieval of processor entitlement data returned in the H_GET_PPP
>> hcall into its own helper routine.
>
> This seems to change the value reported for pool_capacity radically:
>
>> /* report pool_capacity in percentage */
>> - seq_printf(m, "pool_capacity=%ld\n",
>> - ((h_resource >> 2 * 8) & 0xffff) * 100);
>> + seq_printf(m, "pool_capacity=%d\n", ppp_data.group_num * 100);
>
> On a Power6 partition here with your patch series applied, I see
>
> pool_capacity=3277200
>
> in /proc/ppc64/lparcfg. Without your patches, I get
>
> pool_capacity=400
> pool_idle_time=0
> pool_num_procs=0
>
> This looks like an incompatible user-visible change to me, and we
> haven't even changed the lparcfg version number at the beginning of
> the /proc/ppc64/lparcfg output. Why is the pool_capacity so
> different, and why do the pool_idle_time and pool_num_procs lines
> disappear?
>
ok, three problems, three new patches.
The reporting of pool_capacity was a bug in using the wrong information
reported by h_get_ppp in the patch. This is in a new patch 4/16.
The failure to report the pool_idle_time and pool_num_procs was due to
an update to h_pic where we started checking the return code of the
h_call for H_PIC. The values were not reported if the h_call fails,
which on my partition it fails with -10 (H_Authority). I have reverted
this back to the previous behavior and report the values of pool_idle_time
and pool_num_procs regardless of the h_call return code. This is
in a new patch 2/16.
Yes, the lparcfg version number should have been updated. I missed that.
Fixed in a new patch 3/16.
-Nathan
> Regards,
> Paul.
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 02/16 v3] powerpc: Split processor entitlement retrieval and gathering to helper routines
2008-07-04 12:51 ` [PATCH 02/16 v3] powerpc: Split processor entitlement retrieval and gathering to helper routines Robert Jennings
@ 2008-07-22 18:53 ` Nathan Fontenot
0 siblings, 0 replies; 38+ messages in thread
From: Nathan Fontenot @ 2008-07-22 18:53 UTC (permalink / raw)
To: Robert Jennings; +Cc: Brian King, linuxppc-dev, paulus, David Darrington
Updated patch to remove checking the return code from the h_call for
H_PIC. This reverts the reporting back to its original state.
Split the retrieval and setting of processor entitlement and weight into
helper routines. This also removes the printing of the raw values
returned from h_get_ppp, the values are already parsed and printed.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/kernel/lparcfg.c | 168 ++++++++++++++++++++++-------------------
1 file changed, 90 insertions(+), 78 deletions(-)
Index: linux-2.6.git/arch/powerpc/kernel/lparcfg.c
===================================================================
--- linux-2.6.git.orig/arch/powerpc/kernel/lparcfg.c 2008-07-22 10:35:13.000000000 -0500
+++ linux-2.6.git/arch/powerpc/kernel/lparcfg.c 2008-07-22 12:50:47.000000000 -0500
@@ -167,7 +167,8 @@
return rc;
}
-static void h_pic(unsigned long *pool_idle_time, unsigned long *num_procs)
+static unsigned h_pic(unsigned long *pool_idle_time,
+ unsigned long *num_procs)
{
unsigned long rc;
unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
@@ -176,6 +177,51 @@
*pool_idle_time = retbuf[0];
*num_procs = retbuf[1];
+
+ return rc;
+}
+
+/*
+ * parse_ppp_data
+ * Parse out the data returned from h_get_ppp and h_pic
+ */
+static void parse_ppp_data(struct seq_file *m)
+{
+ unsigned long h_entitled, h_unallocated;
+ unsigned long h_aggregation, h_resource;
+ int rc;
+
+ rc = h_get_ppp(&h_entitled, &h_unallocated, &h_aggregation,
+ &h_resource);
+ if (rc)
+ return;
+
+ seq_printf(m, "partition_entitled_capacity=%ld\n", h_entitled);
+ seq_printf(m, "group=%ld\n", (h_aggregation >> 2 * 8) & 0xffff);
+ seq_printf(m, "system_active_processors=%ld\n",
+ (h_resource >> 0 * 8) & 0xffff);
+
+ /* pool related entries are apropriate for shared configs */
+ if (lppaca[0].shared_proc) {
+ unsigned long pool_idle_time, pool_procs;
+
+ seq_printf(m, "pool=%ld\n", (h_aggregation >> 0 * 8) & 0xffff);
+
+ /* report pool_capacity in percentage */
+ seq_printf(m, "pool_capacity=%ld\n",
+ ((h_resource >> 2 * 8) & 0xffff) * 100);
+
+ h_pic(&pool_idle_time, &pool_procs);
+ seq_printf(m, "pool_idle_time=%ld\n", pool_idle_time);
+ seq_printf(m, "pool_num_procs=%ld\n", pool_procs);
+ }
+
+ seq_printf(m, "unallocated_capacity_weight=%ld\n",
+ (h_resource >> 4 * 8) & 0xFF);
+
+ seq_printf(m, "capacity_weight=%ld\n", (h_resource >> 5 * 8) & 0xFF);
+ seq_printf(m, "capped=%ld\n", (h_resource >> 6 * 8) & 0x01);
+ seq_printf(m, "unallocated_capacity=%ld\n", h_unallocated);
}
#define SPLPAR_CHARACTERISTICS_TOKEN 20
@@ -302,60 +348,11 @@
partition_active_processors = lparcfg_count_active_processors();
if (firmware_has_feature(FW_FEATURE_SPLPAR)) {
- unsigned long h_entitled, h_unallocated;
- unsigned long h_aggregation, h_resource;
- unsigned long pool_idle_time, pool_procs;
- unsigned long purr;
-
- h_get_ppp(&h_entitled, &h_unallocated, &h_aggregation,
- &h_resource);
-
- seq_printf(m, "R4=0x%lx\n", h_entitled);
- seq_printf(m, "R5=0x%lx\n", h_unallocated);
- seq_printf(m, "R6=0x%lx\n", h_aggregation);
- seq_printf(m, "R7=0x%lx\n", h_resource);
-
- purr = get_purr();
-
/* this call handles the ibm,get-system-parameter contents */
parse_system_parameter_string(m);
+ parse_ppp_data(m);
- seq_printf(m, "partition_entitled_capacity=%ld\n", h_entitled);
-
- seq_printf(m, "group=%ld\n", (h_aggregation >> 2 * 8) & 0xffff);
-
- seq_printf(m, "system_active_processors=%ld\n",
- (h_resource >> 0 * 8) & 0xffff);
-
- /* pool related entries are apropriate for shared configs */
- if (lppaca[0].shared_proc) {
-
- h_pic(&pool_idle_time, &pool_procs);
-
- seq_printf(m, "pool=%ld\n",
- (h_aggregation >> 0 * 8) & 0xffff);
-
- /* report pool_capacity in percentage */
- seq_printf(m, "pool_capacity=%ld\n",
- ((h_resource >> 2 * 8) & 0xffff) * 100);
-
- seq_printf(m, "pool_idle_time=%ld\n", pool_idle_time);
-
- seq_printf(m, "pool_num_procs=%ld\n", pool_procs);
- }
-
- seq_printf(m, "unallocated_capacity_weight=%ld\n",
- (h_resource >> 4 * 8) & 0xFF);
-
- seq_printf(m, "capacity_weight=%ld\n",
- (h_resource >> 5 * 8) & 0xFF);
-
- seq_printf(m, "capped=%ld\n", (h_resource >> 6 * 8) & 0x01);
-
- seq_printf(m, "unallocated_capacity=%ld\n", h_unallocated);
-
- seq_printf(m, "purr=%ld\n", purr);
-
+ seq_printf(m, "purr=%ld\n", get_purr());
} else { /* non SPLPAR case */
seq_printf(m, "system_active_processors=%d\n",
@@ -382,6 +379,41 @@
return 0;
}
+static ssize_t update_ppp(u64 *entitlement, u8 *weight)
+{
+ unsigned long current_entitled;
+ unsigned long dummy;
+ unsigned long resource;
+ u8 current_weight, new_weight;
+ u64 new_entitled;
+ ssize_t retval;
+
+ /* Get our current parameters */
+ retval = h_get_ppp(¤t_entitled, &dummy, &dummy, &resource);
+ if (retval)
+ return retval;
+
+ current_weight = (resource >> 5 * 8) & 0xFF;
+
+ if (entitlement) {
+ new_weight = current_weight;
+ new_entitled = *entitlement;
+ } else if (weight) {
+ new_weight = *weight;
+ new_entitled = current_entitled;
+ } else
+ return -EINVAL;
+
+ pr_debug("%s: current_entitled = %lu, current_weight = %u\n",
+ __FUNCTION__, current_entitled, current_weight);
+
+ pr_debug("%s: new_entitled = %lu, new_weight = %u\n",
+ __FUNCTION__, new_entitled, new_weight);
+
+ retval = plpar_hcall_norets(H_SET_PPP, new_entitled, new_weight);
+ return retval;
+}
+
/*
* Interface for changing system parameters (variable capacity weight
* and entitled capacity). Format of input is "param_name=value";
@@ -399,12 +431,6 @@
char *tmp;
u64 new_entitled, *new_entitled_ptr = &new_entitled;
u8 new_weight, *new_weight_ptr = &new_weight;
-
- unsigned long current_entitled; /* parameters for h_get_ppp */
- unsigned long dummy;
- unsigned long resource;
- u8 current_weight;
-
ssize_t retval = -ENOMEM;
if (!firmware_has_feature(FW_FEATURE_SPLPAR) ||
@@ -432,33 +458,17 @@
*new_entitled_ptr = (u64) simple_strtoul(tmp, &endp, 10);
if (endp == tmp)
goto out;
- new_weight_ptr = ¤t_weight;
+
+ retval = update_ppp(new_entitled_ptr, NULL);
} else if (!strcmp(kbuf, "capacity_weight")) {
char *endp;
*new_weight_ptr = (u8) simple_strtoul(tmp, &endp, 10);
if (endp == tmp)
goto out;
- new_entitled_ptr = ¤t_entitled;
- } else
- goto out;
- /* Get our current parameters */
- retval = h_get_ppp(¤t_entitled, &dummy, &dummy, &resource);
- if (retval) {
- retval = -EIO;
+ retval = update_ppp(NULL, new_weight_ptr);
+ } else
goto out;
- }
-
- current_weight = (resource >> 5 * 8) & 0xFF;
-
- pr_debug("%s: current_entitled = %lu, current_weight = %u\n",
- __func__, current_entitled, current_weight);
-
- pr_debug("%s: new_entitled = %lu, new_weight = %u\n",
- __func__, *new_entitled_ptr, *new_weight_ptr);
-
- retval = plpar_hcall_norets(H_SET_PPP, *new_entitled_ptr,
- *new_weight_ptr);
if (retval == H_SUCCESS || retval == H_CONSTRAINED) {
retval = count;
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 03/16 v3] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg
2008-07-04 12:51 ` [PATCH 03/16 v3] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg Robert Jennings
@ 2008-07-22 18:55 ` Nathan Fontenot
0 siblings, 0 replies; 38+ messages in thread
From: Nathan Fontenot @ 2008-07-22 18:55 UTC (permalink / raw)
To: Robert Jennings; +Cc: Brian King, linuxppc-dev, paulus, David Darrington
Updated patch, increment the lparcfg module version number.
Update /proc/ppc64/lparcfg to enable displaying of Cooperative Memory
Overcommitment statistics as reported by the H_GET_MPP hcall. This also
updates the lparcfg interface to allow setting memory entitlement and
weight.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/kernel/lparcfg.c | 119 ++++++++++++++++++++++++++++++++++++++++++
include/asm-powerpc/hvcall.h | 18 ++++++
2 files changed, 136 insertions(+), 1 deletion(-)
Index: linux-2.6.git/arch/powerpc/kernel/lparcfg.c
===================================================================
--- linux-2.6.git.orig/arch/powerpc/kernel/lparcfg.c 2008-07-22 12:50:47.000000000 -0500
+++ linux-2.6.git/arch/powerpc/kernel/lparcfg.c 2008-07-22 13:16:58.000000000 -0500
@@ -35,7 +35,7 @@
#include <asm/prom.h>
#include <asm/vdso_datapage.h>
-#define MODULE_VERS "1.7"
+#define MODULE_VERS "1.8"
#define MODULE_NAME "lparcfg"
/* #define LPARCFG_DEBUG */
@@ -129,6 +129,35 @@
/*
* Methods used to fetch LPAR data when running on a pSeries platform.
*/
+/**
+ * h_get_mpp
+ * H_GET_MPP hcall returns info in 7 parms
+ */
+int h_get_mpp(struct hvcall_mpp_data *mpp_data)
+{
+ int rc;
+ unsigned long retbuf[PLPAR_HCALL9_BUFSIZE];
+
+ rc = plpar_hcall9(H_GET_MPP, retbuf);
+
+ mpp_data->entitled_mem = retbuf[0];
+ mpp_data->mapped_mem = retbuf[1];
+
+ mpp_data->group_num = (retbuf[2] >> 2 * 8) & 0xffff;
+ mpp_data->pool_num = retbuf[2] & 0xffff;
+
+ mpp_data->mem_weight = (retbuf[3] >> 7 * 8) & 0xff;
+ mpp_data->unallocated_mem_weight = (retbuf[3] >> 6 * 8) & 0xff;
+ mpp_data->unallocated_entitlement = retbuf[3] & 0xffffffffffff;
+
+ mpp_data->pool_size = retbuf[4];
+ mpp_data->loan_request = retbuf[5];
+ mpp_data->backing_mem = retbuf[6];
+
+ return rc;
+}
+EXPORT_SYMBOL(h_get_mpp);
+
/*
* H_GET_PPP hcall returns info in 4 parms.
* entitled_capacity,unallocated_capacity,
@@ -224,6 +253,44 @@
seq_printf(m, "unallocated_capacity=%ld\n", h_unallocated);
}
+/**
+ * parse_mpp_data
+ * Parse out data returned from h_get_mpp
+ */
+static void parse_mpp_data(struct seq_file *m)
+{
+ struct hvcall_mpp_data mpp_data;
+ int rc;
+
+ rc = h_get_mpp(&mpp_data);
+ if (rc)
+ return;
+
+ seq_printf(m, "entitled_memory=%ld\n", mpp_data.entitled_mem);
+
+ if (mpp_data.mapped_mem != -1)
+ seq_printf(m, "mapped_entitled_memory=%ld\n",
+ mpp_data.mapped_mem);
+
+ seq_printf(m, "entitled_memory_group_number=%d\n", mpp_data.group_num);
+ seq_printf(m, "entitled_memory_pool_number=%d\n", mpp_data.pool_num);
+
+ seq_printf(m, "entitled_memory_weight=%d\n", mpp_data.mem_weight);
+ seq_printf(m, "unallocated_entitled_memory_weight=%d\n",
+ mpp_data.unallocated_mem_weight);
+ seq_printf(m, "unallocated_io_mapping_entitlement=%ld\n",
+ mpp_data.unallocated_entitlement);
+
+ if (mpp_data.pool_size != -1)
+ seq_printf(m, "entitled_memory_pool_size=%ld bytes\n",
+ mpp_data.pool_size);
+
+ seq_printf(m, "entitled_memory_loan_request=%ld\n",
+ mpp_data.loan_request);
+
+ seq_printf(m, "backing_memory=%ld bytes\n", mpp_data.backing_mem);
+}
+
#define SPLPAR_CHARACTERISTICS_TOKEN 20
#define SPLPAR_MAXLENGTH 1026*(sizeof(char))
@@ -351,6 +418,7 @@
/* this call handles the ibm,get-system-parameter contents */
parse_system_parameter_string(m);
parse_ppp_data(m);
+ parse_mpp_data(m);
seq_printf(m, "purr=%ld\n", get_purr());
} else { /* non SPLPAR case */
@@ -414,6 +482,43 @@
return retval;
}
+/**
+ * update_mpp
+ *
+ * Update the memory entitlement and weight for the partition. Caller must
+ * specify either a new entitlement or weight, not both, to be updated
+ * since the h_set_mpp call takes both entitlement and weight as parameters.
+ */
+static ssize_t update_mpp(u64 *entitlement, u8 *weight)
+{
+ struct hvcall_mpp_data mpp_data;
+ u64 new_entitled;
+ u8 new_weight;
+ ssize_t rc;
+
+ rc = h_get_mpp(&mpp_data);
+ if (rc)
+ return rc;
+
+ if (entitlement) {
+ new_weight = mpp_data.mem_weight;
+ new_entitled = *entitlement;
+ } else if (weight) {
+ new_weight = *weight;
+ new_entitled = mpp_data.entitled_mem;
+ } else
+ return -EINVAL;
+
+ pr_debug("%s: current_entitled = %lu, current_weight = %u\n",
+ __FUNCTION__, mpp_data.entitled_mem, mpp_data.mem_weight);
+
+ pr_debug("%s: new_entitled = %lu, new_weight = %u\n",
+ __FUNCTION__, new_entitled, new_weight);
+
+ rc = plpar_hcall_norets(H_SET_MPP, new_entitled, new_weight);
+ return rc;
+}
+
/*
* Interface for changing system parameters (variable capacity weight
* and entitled capacity). Format of input is "param_name=value";
@@ -467,6 +572,20 @@
goto out;
retval = update_ppp(NULL, new_weight_ptr);
+ } else if (!strcmp(kbuf, "entitled_memory")) {
+ char *endp;
+ *new_entitled_ptr = (u64) simple_strtoul(tmp, &endp, 10);
+ if (endp == tmp)
+ goto out;
+
+ retval = update_mpp(new_entitled_ptr, NULL);
+ } else if (!strcmp(kbuf, "entitled_memory_weight")) {
+ char *endp;
+ *new_weight_ptr = (u8) simple_strtoul(tmp, &endp, 10);
+ if (endp == tmp)
+ goto out;
+
+ retval = update_mpp(NULL, new_weight_ptr);
} else
goto out;
Index: linux-2.6.git/include/asm-powerpc/hvcall.h
===================================================================
--- linux-2.6.git.orig/include/asm-powerpc/hvcall.h 2008-07-22 12:48:55.000000000 -0500
+++ linux-2.6.git/include/asm-powerpc/hvcall.h 2008-07-22 13:16:36.000000000 -0500
@@ -210,7 +210,9 @@
#define H_JOIN 0x298
#define H_VASI_STATE 0x2A4
#define H_ENABLE_CRQ 0x2B0
-#define MAX_HCALL_OPCODE H_ENABLE_CRQ
+#define H_SET_MPP 0x2D0
+#define H_GET_MPP 0x2D4
+#define MAX_HCALL_OPCODE H_GET_MPP
#ifndef __ASSEMBLY__
@@ -270,6 +272,20 @@
};
#define HCALL_STAT_ARRAY_SIZE ((MAX_HCALL_OPCODE >> 2) + 1)
+struct hvcall_mpp_data {
+ unsigned long entitled_mem;
+ unsigned long mapped_mem;
+ unsigned short group_num;
+ unsigned short pool_num;
+ unsigned char mem_weight;
+ unsigned char unallocated_mem_weight;
+ unsigned long unallocated_entitlement; /* value in bytes */
+ unsigned long pool_size;
+ signed long loan_request;
+ unsigned long backing_mem;
+};
+
+int h_get_mpp(struct hvcall_mpp_data *);
#endif /* __ASSEMBLY__ */
#endif /* __KERNEL__ */
#endif /* _ASM_POWERPC_HVCALL_H */
^ permalink raw reply [flat|nested] 38+ messages in thread
* Re: [PATCH 04/16 v3] powerpc: Split retrieval of processor entitlement data into a helper routine
2008-07-04 12:52 ` [PATCH 04/16 v3] powerpc: Split retrieval of processor entitlement data into a helper routine Robert Jennings
2008-07-22 5:54 ` Paul Mackerras
@ 2008-07-22 18:56 ` Nathan Fontenot
1 sibling, 0 replies; 38+ messages in thread
From: Nathan Fontenot @ 2008-07-22 18:56 UTC (permalink / raw)
To: Robert Jennings; +Cc: Brian King, linuxppc-dev, paulus, David Darrington
Updated patch to correct the reporting of pool_capcity.
Split the retrieval of processor entitlement data returned in the H_GET_PPP
hcall into its own helper routine.
Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
Signed-off-by: Robert Jennings <rcj@linux.vnet.ibm.com>
---
arch/powerpc/kernel/lparcfg.c | 80 ++++++++++++++++++++++++------------------
1 file changed, 45 insertions(+), 35 deletions(-)
Index: linux-2.6.git/arch/powerpc/kernel/lparcfg.c
===================================================================
--- linux-2.6.git.orig/arch/powerpc/kernel/lparcfg.c 2008-07-22 13:16:58.000000000 -0500
+++ linux-2.6.git/arch/powerpc/kernel/lparcfg.c 2008-07-22 13:17:25.000000000 -0500
@@ -158,6 +158,18 @@
}
EXPORT_SYMBOL(h_get_mpp);
+struct hvcall_ppp_data {
+ u64 entitlement;
+ u64 unallocated_entitlement;
+ u16 group_num;
+ u16 pool_num;
+ u8 capped;
+ u8 weight;
+ u8 unallocated_weight;
+ u16 active_procs_in_pool;
+ u16 active_system_procs;
+};
+
/*
* H_GET_PPP hcall returns info in 4 parms.
* entitled_capacity,unallocated_capacity,
@@ -178,20 +190,24 @@
* XXXX - Active processors in Physical Processor Pool.
* XXXX - Processors active on platform.
*/
-static unsigned int h_get_ppp(unsigned long *entitled,
- unsigned long *unallocated,
- unsigned long *aggregation,
- unsigned long *resource)
+static unsigned int h_get_ppp(struct hvcall_ppp_data *ppp_data)
{
unsigned long rc;
unsigned long retbuf[PLPAR_HCALL_BUFSIZE];
rc = plpar_hcall(H_GET_PPP, retbuf);
- *entitled = retbuf[0];
- *unallocated = retbuf[1];
- *aggregation = retbuf[2];
- *resource = retbuf[3];
+ ppp_data->entitlement = retbuf[0];
+ ppp_data->unallocated_entitlement = retbuf[1];
+
+ ppp_data->group_num = (retbuf[2] >> 2 * 8) & 0xffff;
+ ppp_data->pool_num = retbuf[2] & 0xffff;
+
+ ppp_data->capped = (retbuf[3] >> 6 * 8) & 0x01;
+ ppp_data->weight = (retbuf[3] >> 5 * 8) & 0xff;
+ ppp_data->unallocated_weight = (retbuf[3] >> 4 * 8) & 0xff;
+ ppp_data->active_procs_in_pool = (retbuf[3] >> 2 * 8) & 0xffff;
+ ppp_data->active_system_procs = retbuf[3] & 0xffff;
return rc;
}
@@ -216,41 +232,40 @@
*/
static void parse_ppp_data(struct seq_file *m)
{
- unsigned long h_entitled, h_unallocated;
- unsigned long h_aggregation, h_resource;
+ struct hvcall_ppp_data ppp_data;
int rc;
- rc = h_get_ppp(&h_entitled, &h_unallocated, &h_aggregation,
- &h_resource);
+ rc = h_get_ppp(&ppp_data);
if (rc)
return;
- seq_printf(m, "partition_entitled_capacity=%ld\n", h_entitled);
- seq_printf(m, "group=%ld\n", (h_aggregation >> 2 * 8) & 0xffff);
- seq_printf(m, "system_active_processors=%ld\n",
- (h_resource >> 0 * 8) & 0xffff);
+ seq_printf(m, "partition_entitled_capacity=%ld\n",
+ ppp_data.entitlement);
+ seq_printf(m, "group=%d\n", ppp_data.group_num);
+ seq_printf(m, "system_active_processors=%d\n",
+ ppp_data.active_system_procs);
/* pool related entries are apropriate for shared configs */
if (lppaca[0].shared_proc) {
unsigned long pool_idle_time, pool_procs;
- seq_printf(m, "pool=%ld\n", (h_aggregation >> 0 * 8) & 0xffff);
+ seq_printf(m, "pool=%d\n", ppp_data.pool_num);
/* report pool_capacity in percentage */
- seq_printf(m, "pool_capacity=%ld\n",
- ((h_resource >> 2 * 8) & 0xffff) * 100);
+ seq_printf(m, "pool_capacity=%d\n",
+ ppp_data.active_procs_in_pool * 100);
h_pic(&pool_idle_time, &pool_procs);
seq_printf(m, "pool_idle_time=%ld\n", pool_idle_time);
seq_printf(m, "pool_num_procs=%ld\n", pool_procs);
}
- seq_printf(m, "unallocated_capacity_weight=%ld\n",
- (h_resource >> 4 * 8) & 0xFF);
-
- seq_printf(m, "capacity_weight=%ld\n", (h_resource >> 5 * 8) & 0xFF);
- seq_printf(m, "capped=%ld\n", (h_resource >> 6 * 8) & 0x01);
- seq_printf(m, "unallocated_capacity=%ld\n", h_unallocated);
+ seq_printf(m, "unallocated_capacity_weight=%d\n",
+ ppp_data.unallocated_weight);
+ seq_printf(m, "capacity_weight=%d\n", ppp_data.weight);
+ seq_printf(m, "capped=%d\n", ppp_data.capped);
+ seq_printf(m, "unallocated_capacity=%ld\n",
+ ppp_data.unallocated_entitlement);
}
/**
@@ -449,31 +464,27 @@
static ssize_t update_ppp(u64 *entitlement, u8 *weight)
{
- unsigned long current_entitled;
- unsigned long dummy;
- unsigned long resource;
- u8 current_weight, new_weight;
+ struct hvcall_ppp_data ppp_data;
+ u8 new_weight;
u64 new_entitled;
ssize_t retval;
/* Get our current parameters */
- retval = h_get_ppp(¤t_entitled, &dummy, &dummy, &resource);
+ retval = h_get_ppp(&ppp_data);
if (retval)
return retval;
- current_weight = (resource >> 5 * 8) & 0xFF;
-
if (entitlement) {
- new_weight = current_weight;
+ new_weight = ppp_data.weight;
new_entitled = *entitlement;
} else if (weight) {
new_weight = *weight;
- new_entitled = current_entitled;
+ new_entitled = ppp_data.entitlement;
} else
return -EINVAL;
pr_debug("%s: current_entitled = %lu, current_weight = %u\n",
- __FUNCTION__, current_entitled, current_weight);
+ __FUNCTION__, ppp_data.entitlement, ppp_data.weight);
pr_debug("%s: new_entitled = %lu, new_weight = %u\n",
__FUNCTION__, new_entitled, new_weight);
^ permalink raw reply [flat|nested] 38+ messages in thread
end of thread, other threads:[~2008-07-22 18:57 UTC | newest]
Thread overview: 38+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-04 12:44 [PATCH 00/16 v3] powerpc: pSeries Cooperative Memory Overcommitment support Robert Jennings
2008-07-04 12:51 ` [PATCH 01/16 v3] powerpc: Remove extraneous error reporting for hcall failures in lparcfg Robert Jennings
2008-07-22 3:34 ` Paul Mackerras
2008-07-04 12:51 ` [PATCH 02/16 v3] powerpc: Split processor entitlement retrieval and gathering to helper routines Robert Jennings
2008-07-22 18:53 ` Nathan Fontenot
2008-07-04 12:51 ` [PATCH 03/16 v3] powerpc: Add memory entitlement capabilities to /proc/ppc64/lparcfg Robert Jennings
2008-07-22 18:55 ` Nathan Fontenot
2008-07-04 12:52 ` [PATCH 04/16 v3] powerpc: Split retrieval of processor entitlement data into a helper routine Robert Jennings
2008-07-22 5:54 ` Paul Mackerras
2008-07-22 18:49 ` Nathan Fontenot
2008-07-22 18:56 ` Nathan Fontenot
2008-07-04 12:52 ` [PATCH 05/16 v3] powerpc: Enable CMO feature during platform setup Robert Jennings
2008-07-04 12:52 ` Robert Jennings
2008-07-04 12:52 ` [PATCH 06/16 v3] powerpc: Utilities to set firmware page state Robert Jennings
2008-07-04 12:53 ` Robert Jennings
2008-07-04 12:53 ` [PATCH 07/16 v3] powerpc: Add collaborative memory manager Robert Jennings
2008-07-22 4:53 ` Paul Mackerras
2008-07-04 12:54 ` [PATCH 08/16 v3] powerpc: Do not probe PCI buses or eBus devices if CMO is enabled Robert Jennings
2008-07-14 21:35 ` Brian King
2008-07-04 12:54 ` [PATCH 09/16 v3] powerpc: Add CMO paging statistics Robert Jennings
2008-07-04 12:54 ` [PATCH 10/16 v3] powerpc: iommu enablement for CMO Robert Jennings
2008-07-05 17:51 ` Olof Johansson
2008-07-08 20:48 ` [PATCH 10/16 v3] [v2] " Robert Jennings
2008-07-22 5:04 ` Paul Mackerras
2008-07-22 13:30 ` Robert Jennings
2008-07-22 4:57 ` [PATCH 10/16 v3] " Paul Mackerras
2008-07-22 13:28 ` Robert Jennings
2008-07-04 12:55 ` [PATCH 11/16 v3] powerpc: vio bus support " Robert Jennings
2008-07-04 12:55 ` [PATCH 12/16 v3] powerpc: Verify CMO memory entitlement updates with virtual I/O Robert Jennings
2008-07-04 12:55 ` [PATCH 13/16 v3] ibmveth: Automatically enable larger rx buffer pools for larger mtu Robert Jennings
2008-07-04 12:56 ` [PATCH 14/16 v3] ibmveth: enable driver for CMO Robert Jennings
2008-07-08 20:38 ` [PATCH 14/16 v3] [v2] " Robert Jennings
2008-07-04 12:56 ` [PATCH 15/16 v3] ibmvscsi: driver enablement " Robert Jennings
2008-07-07 14:34 ` Brian King
2008-07-08 17:41 ` Robert Jennings
2008-07-08 20:35 ` [PATCH 15/16 v3] [v2] " Robert Jennings
2008-07-10 13:43 ` Brian King
2008-07-04 12:57 ` [PATCH 16/16 v3] powerpc: Update arch vector to indicate support " Robert Jennings
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).