LinuxPPC-Dev Archive on lore.kernel.org

LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
From: Tang Chen @ 2013-01-31  7:10 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <1359613162.1587.0.camel@kernel>

On 01/31/2013 02:19 PM, Simon Jeons wrote:
> Hi Tang,
> On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote:
>> Hi Simon,
>>
>> Please see below. :)
>>
>> On 01/31/2013 09:22 AM, Simon Jeons wrote:
>>>
>>> Sorry, I still confuse. :(
>>> update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
>>> node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?
>>>
>>> node_states is what? node_states[N_NORMAL_MEMOR] or
>>> node_states[N_MEMORY]?
>>
>> Are you asking what node_states[] is ?
>>
>> node_states[] is an array of nodemask,
>>
>>       extern nodemask_t node_states[NR_NODE_STATES];
>>
>> For example, node_states[N_NORMAL_MEMOR] represents which nodes have
>> normal memory.
>> If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is
>> node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ...
>> ZONE_MOVABLE.
>>
>
> Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ...
> *ZONE_MOVABLE*, the comment of enum nodes_states said that
> N_NORMAL_MEMORY just means the node has regular memory.
>

Hi Simon,

Let's say it in this way.

If we don't have CONFIG_HIGHMEM, N_HIGH_MEMORY == N_NORMAL_MEMORY. We 
don't have a separate
macro to represent highmem because we don't have highmem.
This is easy to understand, right ?

Now, think it just like above:
If we don't have CONFIG_MOVABLE_NODE, N_MEMORY == N_HIGH_MEMORY == 
N_NORMAL_MEMORY.
This means we don't allow a node to have only movable memory, not we 
don't have movable memory.
A node could have normal memory and movable memory. So 
nodes_state[N_NORMAL_MEMORY] represents
a node have 0 ... *ZONE_MOVABLE*.

I think the point is: CONFIG_MOVABLE_NODE means we allow a node to have 
only movable memory.
So without CONFIG_MOVABLE_NODE, it doesn't mean a node cannot have 
movable memory. It means
the node cannot have only movable memory. It can have normal memory and 
movable memory.

1) With CONFIG_MOVABLE_NODE:
    N_NORMAL_MEMORY: nodes who have normal memory.
                     normal memory only
                     normal and highmem
                     normal and highmem and movablemem
                     normal and movablemem
    N_MEMORY: nodes who has memory (any memory)
                     normal memory only
                     normal and highmem
                     normal and highmem and movablemem
                     normal and movablemem ---------------- We can have 
movablemem.
                     highmem only -------------------------
                     highmem and movablemem ---------------
                     movablemem only ---------------------- We can have 
movablemem only.    ***

2) With out CONFIG_MOVABLE_NODE:
    N_MEMORY == N_NORMAL_MEMORY: (Here, I omit N_HIGH_MEMORY)
                     normal memory only
                     normal and highmem
                     normal and highmem and movablemem
                     normal and movablemem ---------------- We can have 
movablemem.
                     No movablemem only ------------------- We cannot 
have movablemem only. ***

The semantics is not that clear here. So we can only try to understand 
it from the code where
we use N_MEMORY. :)

That is my understanding of this.

Thanks. :)

^ permalink raw reply

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
From: Simon Jeons @ 2013-01-31  6:19 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <5109E59F.5080104@cn.fujitsu.com>

Hi Tang,
On Thu, 2013-01-31 at 11:31 +0800, Tang Chen wrote:
> Hi Simon,
> 
> Please see below. :)
> 
> On 01/31/2013 09:22 AM, Simon Jeons wrote:
> >
> > Sorry, I still confuse. :(
> > update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
> > node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?
> >
> > node_states is what? node_states[N_NORMAL_MEMOR] or
> > node_states[N_MEMORY]?
> 
> Are you asking what node_states[] is ?
> 
> node_states[] is an array of nodemask,
> 
>      extern nodemask_t node_states[NR_NODE_STATES];
> 
> For example, node_states[N_NORMAL_MEMOR] represents which nodes have 
> normal memory.
> If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is
> node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ... 
> ZONE_MOVABLE.
> 

Sorry, how can nodes_state[N_NORMAL_MEMORY] represents a node have 0 ...
*ZONE_MOVABLE*, the comment of enum nodes_states said that
N_NORMAL_MEMORY just means the node has regular memory.  

> 
> > Why check !z1->wait_table in function move_pfn_range_left and function
> > __add_zone? I think zone->wait_table is initialized in
> > free_area_init_core, which will be called during system initialization
> > and hotadd_new_pgdat path.
> 
> I think,
> 
> free_area_init_core(), in the for loop,
>   |--> size = zone_spanned_pages_in_node();
>   |--> if (!size)
>                continue;  ----------------  If zone is empty, we jump 
> out the for loop.
>   |--> init_currently_empty_zone()
> 
> So, if the zone is empty, wait_table is not initialized.
> 
> In move_pfn_range_left(z1, z2), we move pages from z2 to z1. But z1 
> could be empty.
> So we need to check it and initialize z1->wait_table because we are 
> moving pages into it.

thanks.

> 
> 
> > There is a zone populated check in function online_pages. But zone is
> > populated in free_area_init_core which will be called during system
> > initialization and hotadd_new_pgdat path. Why still need this check?
> >
> 
> Because we could also rebuild zone list when we offline pages.
> 
> __offline_pages()
>   |--> zone->present_pages -= offlined_pages;
>   |--> if (!populated_zone(zone)) {
>                build_all_zonelists(NULL, NULL);
>        }
> 
> If the zone is empty, and other zones on the same node is not empty, the 
> node
> won't be offlined, and next time we online pages of this zone, the pgdat 
> won't
> be initialized again, and we need to check populated_zone(zone) when 
> onlining
> pages.

thanks.

> 
> Thanks. :)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [RFC PATCH v2 01/12] Add sys_hotplug.h for system device hotplug framework
From: Greg KH @ 2013-01-31  5:24 UTC (permalink / raw)
  To: Toshi Kani
  Cc: linux-s390, jiang.liu, wency, linux-mm, yinghai, linux-kernel,
	Rafael J. Wysocki, linux-acpi, isimatu.yasuaki, srivatsa.bhat,
	guohanjun, bhelgaas, akpm, linuxppc-dev, lenb
In-Reply-To: <1359594912.15120.85.camel@misato.fc.hp.com>

On Wed, Jan 30, 2013 at 06:15:12PM -0700, Toshi Kani wrote:
> > Please make it a "real" pointer, and not a void *, those shouldn't be
> > used at all if possible.
> 
> How about changing the "void *handle" to acpi_dev_node below?   
> 
>    struct acpi_dev_node    acpi_node;
> 
> Basically, it has the same challenge as struct device, which uses
> acpi_dev_node as well.  We can add other FW node when needed (just like
> device also has *of_node).

That sounds good to me.

^ permalink raw reply

* [PATCH][UPSTEAM] powerpc/mpic: add irq_set_wake support
From: Wang Dongsheng @ 2013-01-31  3:10 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: Wang Dongsheng

Add irq_set_wake support. Just add IRQF_NO_SUSPEND to desc->action->flag.
So the wake up interrupt will not be disable in suspend_device_irqs.

Signed-off-by: Wang Dongsheng <dongsheng.wang@freescale.com>
---
 arch/powerpc/sysdev/mpic.c |   15 +++++++++++++++
 1 files changed, 15 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/sysdev/mpic.c b/arch/powerpc/sysdev/mpic.c
index 9c6e535..2ed0220 100644
--- a/arch/powerpc/sysdev/mpic.c
+++ b/arch/powerpc/sysdev/mpic.c
@@ -920,6 +920,18 @@ int mpic_set_irq_type(struct irq_data *d, unsigned int flow_type)
 	return IRQ_SET_MASK_OK_NOCOPY;
 }

+static int mpic_irq_set_wake(struct irq_data *d, unsigned int on)
+{
+	struct irq_desc *desc = container_of(d, struct irq_desc, irq_data);
+
+	if (on)
+		desc->action->flags |= IRQF_NO_SUSPEND;
+	else
+		desc->action->flags &= ~IRQF_NO_SUSPEND;
+
+	return 0;
+}
+
 void mpic_set_vector(unsigned int virq, unsigned int vector)
 {
 	struct mpic *mpic = mpic_from_irq(virq);
@@ -957,6 +969,7 @@ static struct irq_chip mpic_irq_chip = {
 	.irq_unmask	= mpic_unmask_irq,
 	.irq_eoi	= mpic_end_irq,
 	.irq_set_type	= mpic_set_irq_type,
+	.irq_set_wake	= mpic_irq_set_wake,
 };

 #ifdef CONFIG_SMP
@@ -971,6 +984,7 @@ static struct irq_chip mpic_tm_chip = {
 	.irq_mask	= mpic_mask_tm,
 	.irq_unmask	= mpic_unmask_tm,
 	.irq_eoi	= mpic_end_irq,
+	.irq_set_wake	= mpic_irq_set_wake,
 };

 #ifdef CONFIG_MPIC_U3_HT_IRQS
@@ -981,6 +995,7 @@ static struct irq_chip mpic_irq_ht_chip = {
 	.irq_unmask	= mpic_unmask_ht_irq,
 	.irq_eoi	= mpic_end_ht_irq,
 	.irq_set_type	= mpic_set_irq_type,
+	.irq_set_wake	= mpic_irq_set_wake,
 };
 #endif /* CONFIG_MPIC_U3_HT_IRQS */

--
1.7.5.1

^ permalink raw reply related

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
From: Tang Chen @ 2013-01-31  3:31 UTC (permalink / raw)
  To: Simon Jeons
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <1359595344.1557.13.camel@kernel>

Hi Simon,

Please see below. :)

On 01/31/2013 09:22 AM, Simon Jeons wrote:
>
> Sorry, I still confuse. :(
> update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
> node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?
>
> node_states is what? node_states[N_NORMAL_MEMOR] or
> node_states[N_MEMORY]?

Are you asking what node_states[] is ?

node_states[] is an array of nodemask,

     extern nodemask_t node_states[NR_NODE_STATES];

For example, node_states[N_NORMAL_MEMOR] represents which nodes have 
normal memory.
If N_MEMORY == N_HIGH_MEMORY == N_NORMAL_MEMORY, node_states[N_MEMORY] is
node_states[N_NORMAL_MEMOR]. So it represents which nodes have 0 ... 
ZONE_MOVABLE.

> Why check !z1->wait_table in function move_pfn_range_left and function
> __add_zone? I think zone->wait_table is initialized in
> free_area_init_core, which will be called during system initialization
> and hotadd_new_pgdat path.

I think,

free_area_init_core(), in the for loop,
  |--> size = zone_spanned_pages_in_node();
  |--> if (!size)
               continue;  ----------------  If zone is empty, we jump 
out the for loop.
  |--> init_currently_empty_zone()

So, if the zone is empty, wait_table is not initialized.

In move_pfn_range_left(z1, z2), we move pages from z2 to z1. But z1 
could be empty.
So we need to check it and initialize z1->wait_table because we are 
moving pages into it.

> There is a zone populated check in function online_pages. But zone is
> populated in free_area_init_core which will be called during system
> initialization and hotadd_new_pgdat path. Why still need this check?
>

Because we could also rebuild zone list when we offline pages.

__offline_pages()
  |--> zone->present_pages -= offlined_pages;
  |--> if (!populated_zone(zone)) {
               build_all_zonelists(NULL, NULL);
       }

If the zone is empty, and other zones on the same node is not empty, the 
node
won't be offlined, and next time we online pages of this zone, the pgdat 
won't
be initialized again, and we need to check populated_zone(zone) when 
onlining
pages.

Thanks. :)

^ permalink raw reply

* Re: [RFC PATCH v2 01/12] Add sys_hotplug.h for system device hotplug framework
From: Toshi Kani @ 2013-01-31  2:57 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-s390, jiang.liu, wency, linux-mm, yinghai, linux-kernel,
	rjw, linux-acpi, isimatu.yasuaki, srivatsa.bhat, guohanjun,
	bhelgaas, akpm, linuxppc-dev, lenb
In-Reply-To: <20130130045830.GH30002@kroah.com>

On Tue, 2013-01-29 at 23:58 -0500, Greg KH wrote:
> On Thu, Jan 10, 2013 at 04:40:19PM -0700, Toshi Kani wrote:
> > +/*
> > + * Hot-plug device information
> > + */
> 
> Again, stop it with the "generic" hotplug term here, and everywhere
> else.  You are doing a very _specific_ type of hotplug devices, so spell
> it out.  We've worked hard to hotplug _everything_ in Linux, you are
> going to confuse a lot of people with this type of terms.

Agreed.  I will clarify in all places.

> > +union shp_dev_info {
> > +	struct shp_cpu {
> > +		u32		cpu_id;
> > +	} cpu;
> 
> What is this?  Why not point to the system device for the cpu?

This info is used to on-line a new CPU and create its system/cpu device.
In other word, a system/cpu device is created as a result of CPU
hotplug.

> > +	struct shp_memory {
> > +		int		node;
> > +		u64		start_addr;
> > +		u64		length;
> > +	} mem;
> 
> Same here, why not point to the system device?

Same as above.

> > +	struct shp_hostbridge {
> > +	} hb;
> > +
> > +	struct shp_node {
> > +	} node;
> 
> What happened here with these?  Empty structures?  Huh?

They are place holders for now.  PCI bridge hot-plug and node hot-plug
are still very much work in progress, so I have not integrated them into
this framework yet.

> > +};
> > +
> > +struct shp_device {
> > +	struct list_head	list;
> > +	struct device		*device;
> 
> No, make it a "real" device, embed the device into it.

This device pointer is used to send KOBJ_ONLINE/OFFLINE event during CPU
online/offline operation in order to maintain the current behavior.  CPU
online/offline operation only changes the state of CPU, so its
system/cpu device continues to be present before and after an operation.
(Whereas, CPU hot-add/delete operation creates or removes a system/cpu
device.)  So, this "*device" needs to be a pointer to reference an
existing device that is to be on-lined/off-lined.

> But, again, I'm going to ask why you aren't using the existing cpu /
> memory / bridge / node devices that we have in the kernel.  Please use
> them, or give me a _really_ good reason why they will not work.

We cannot use the existing system devices or ACPI devices here.  During
hot-plug, ACPI handler sets this shp_device info, so that cpu and memory
handlers (drivers/cpu.c and mm/memory_hotplug.c) can obtain their target
device information in a platform-neutral way.  During hot-add, we first
creates an ACPI device node (i.e. device under /sys/bus/acpi/devices),
but platform-neutral modules cannot use them as they are ACPI-specific.
Also, its system device (i.e. device under /sys/devices/system) has not
been created until the hot-add operation completes.

> > +	enum shp_class		class;
> > +	union shp_dev_info	info;
> > +};
> > +
> > +/*
> > + * Hot-plug request
> > + */
> > +struct shp_request {
> > +	/* common info */
> > +	enum shp_operation	operation;	/* operation */
> > +
> > +	/* hot-plug event info: only valid for hot-plug operations */
> > +	void			*handle;	/* FW handle */
> > +	u32			event;		/* FW event */
> 
> What is this?

The shp_request describes a hotplug or online/offline operation that is
requested.  In case of hot-plug request, the "*handle" describes a
target device (which is an ACPI device object) and the "event" describes
a type of request, such as hot-add or hot-delete.

Thanks,
-Toshi

^ permalink raw reply

* Re: [PATCH 0/3] Enable multiple MSI feature in pSeries
From: Mike @ 2013-01-31  2:10 UTC (permalink / raw)
  To: linuxppc-dev; +Cc: tglx, linux-kernel
In-Reply-To: <1358235536-32741-1-git-send-email-qiudayu@linux.vnet.ibm.com>

Hi all

Any comments about my patchset?

Thanks 

Mike
在 2013-01-15二的 15:38 +0800，Mike Qiu写道：
> Currently, multiple MSI feature hasn't been enabled in pSeries,
> These patches try to enbale this feature.
> 
> These patches have been tested by using ipr driver, and the driver patch
> has been made by Wen Xiong <wenxiong@linux.vnet.ibm.com>:
> 
> [PATCH 0/7] Add support for new IBM SAS controllers
> 
> Test platform: One partition of pSeries with one cpu core(4 SMTs) and 
>                RAID bus controller: IBM PCI-E IPR SAS Adapter (ASIC) in POWER7
> OS version: SUSE Linux Enterprise Server 11 SP2  (ppc64) with 3.8-rc3 kernel 
> 
> IRQ 21 and 22 are assigned to the ipr device which support 2 mutiple MSI.
> 
> The test results is shown by 'cat /proc/interrups':
>           CPU0       CPU1       CPU2       CPU3       
> 16:     240458     261601     226310     200425      XICS Level     IPI
> 17:          0          0          0          0      XICS Level     RAS_EPOW
> 18:         10          0          3          2      XICS Level     hvc_console
> 19:     122182      28481      28527      28864      XICS Level     ibmvscsi
> 20:        506    7388226        108        118      XICS Level     eth0
> 21:          6          5          5          5      XICS Level     host1-0
> 22:        817        814        816        813      XICS Level     host1-1
> LOC:     398077     316725     231882     203049   Local timer interrupts
> SPU:       1659        919        961        903   Spurious interrupts
> CNT:          0          0          0          0   Performance
> monitoring interrupts
> MCE:          0          0          0          0   Machine check exceptions
> 
> Mike Qiu (3):
>   irq: Set multiple MSI descriptor data for multiple IRQs
>   irq: Add hw continuous IRQs map to virtual continuous IRQs support
>   powerpc/pci: Enable pSeries multiple MSI feature
> 
>  arch/powerpc/kernel/msi.c            |    4 --
>  arch/powerpc/platforms/pseries/msi.c |   62 ++++++++++++++++++++++++++++++++-
>  include/linux/irq.h                  |    4 ++
>  include/linux/irqdomain.h            |    3 ++
>  kernel/irq/chip.c                    |   40 ++++++++++++++++-----
>  kernel/irq/irqdomain.c               |   61 +++++++++++++++++++++++++++++++++
>  6 files changed, 158 insertions(+), 16 deletions(-)
> 

^ permalink raw reply

* Re: [RFC PATCH v2 03/12] drivers/base: Add system device hotplug framework
From: Toshi Kani @ 2013-01-31  1:48 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-s390, jiang.liu, wency, linux-mm, yinghai, linux-kernel,
	rjw, linux-acpi, isimatu.yasuaki, srivatsa.bhat, guohanjun,
	bhelgaas, akpm, linuxppc-dev, lenb
In-Reply-To: <20130130045437.GG30002@kroah.com>

On Tue, 2013-01-29 at 23:54 -0500, Greg KH wrote:
> On Thu, Jan 10, 2013 at 04:40:21PM -0700, Toshi Kani wrote:
> > Added sys_hotplug.c, which is the system device hotplug framework code.
> > 
> > shp_register_handler() allows modules to register their hotplug handlers
> > to the framework.  shp_submit_req() provides the interface to submit
> > a hotplug or online/offline request of system devices.  The request is
> > then put into hp_workqueue.  shp_start_req() calls all registered handlers
> > in ascending order for each phase.  If any handler failed in validate or
> > execute phase, shp_start_req() initiates its rollback procedure.
> > 
> > Signed-off-by: Toshi Kani <toshi.kani@hp.com>
> > ---
> >  drivers/base/Makefile      |    1 
> >  drivers/base/sys_hotplug.c |  313 ++++++++++++++++++++++++++++++++++++++++++++
> >  2 files changed, 314 insertions(+)
> >  create mode 100644 drivers/base/sys_hotplug.c
> > 
> > diff --git a/drivers/base/Makefile b/drivers/base/Makefile
> > index 5aa2d70..2e9b2f1 100644
> > --- a/drivers/base/Makefile
> > +++ b/drivers/base/Makefile
> > @@ -21,6 +21,7 @@ endif
> >  obj-$(CONFIG_SYS_HYPERVISOR) += hypervisor.o
> >  obj-$(CONFIG_REGMAP)	+= regmap/
> >  obj-$(CONFIG_SOC_BUS) += soc.o
> > +obj-y			+= sys_hotplug.o
> 
> No option to select this for systems that don't need it?  If not, then
> put it up higher with all of the other code for the core.

It used to have CONFIG_HOTPLUG, but I removed it as you suggested.  Yes,
I will put it up higher.  

Thanks,
-Toshi

^ permalink raw reply

* Re: [RFC PATCH v2 01/12] Add sys_hotplug.h for system device hotplug framework
From: Toshi Kani @ 2013-01-31  1:46 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-s390, jiang.liu, wency, linux-mm, yinghai, linux-kernel,
	rjw, linux-acpi, isimatu.yasuaki, srivatsa.bhat, guohanjun,
	bhelgaas, akpm, linuxppc-dev, lenb
In-Reply-To: <20130130045330.GF30002@kroah.com>

On Tue, 2013-01-29 at 23:53 -0500, Greg KH wrote:
> On Thu, Jan 10, 2013 at 04:40:19PM -0700, Toshi Kani wrote:
> > Added include/linux/sys_hotplug.h, which defines the system device
> > hotplug framework interfaces used by the framework itself and
> > handlers.
> > 
> > The order values define the calling sequence of handlers.  For add
> > execute, the ordering is ACPI->MEM->CPU.  Memory is onlined before
> > CPU so that threads on new CPUs can start using their local memory.
> > The ordering of the delete execute is symmetric to the add execute.
> > 
> > struct shp_request defines a hot-plug request information.  The
> > device resource information is managed with a list so that a single
> > request may target to multiple devices.
> > 
> > Signed-off-by: Toshi Kani <toshi.kani@hp.com>
> > ---
> >  include/linux/sys_hotplug.h |  181 +++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 181 insertions(+)
> >  create mode 100644 include/linux/sys_hotplug.h
> > 
> > diff --git a/include/linux/sys_hotplug.h b/include/linux/sys_hotplug.h
> > new file mode 100644
> > index 0000000..86674dd
> > --- /dev/null
> > +++ b/include/linux/sys_hotplug.h
> > @@ -0,0 +1,181 @@
> > +/*
> > + * sys_hotplug.h - System device hot-plug framework
> > + *
> > + * Copyright (C) 2012 Hewlett-Packard Development Company, L.P.
> > + *	Toshi Kani <toshi.kani@hp.com>
> > + *
> > + * This program is free software; you can redistribute it and/or modify
> > + * it under the terms of the GNU General Public License version 2 as
> > + * published by the Free Software Foundation.
> > + */
> > +
> > +#ifndef _LINUX_SYS_HOTPLUG_H
> > +#define _LINUX_SYS_HOTPLUG_H
> > +
> > +#include <linux/list.h>
> > +#include <linux/device.h>
> > +
> > +/*
> > + * System device hot-plug operation proceeds in the following order.
> > + *   Validate phase -> Execute phase -> Commit phase
> > + *
> > + * The order values below define the calling sequence of platform
> > + * neutral handlers for each phase in ascending order.  The order
> > + * values of firmware-specific handlers are defined in sys_hotplug.h
> > + * under firmware specific directories.
> > + */
> > +
> > +/* All order values must be smaller than this value */
> > +#define SHP_ORDER_MAX				0xffffff
> > +
> > +/* Add Validate order values */
> > +
> > +/* Add Execute order values */
> > +#define SHP_MEM_ADD_EXECUTE_ORDER		100
> > +#define SHP_CPU_ADD_EXECUTE_ORDER		110
> > +
> > +/* Add Commit order values */
> > +
> > +/* Delete Validate order values */
> > +#define SHP_CPU_DEL_VALIDATE_ORDER		100
> > +#define SHP_MEM_DEL_VALIDATE_ORDER		110
> > +
> > +/* Delete Execute order values */
> > +#define SHP_CPU_DEL_EXECUTE_ORDER		10
> > +#define SHP_MEM_DEL_EXECUTE_ORDER		20
> > +
> > +/* Delete Commit order values */
> > +
> 
> Empty value?

Yes, in this version, all the delete commit order values are defined in
<acpi/sys_hotplug.h>.

> Anyway, as I said before, don't use "values", just call things directly
> in the order you need to.
> 
> This isn't like other operating systems, we don't need to be so
> "flexible", we can modify the core code as much as we want and need to
> if future things come along :)

Understood.  As described in the previous email, I will define them with
enum and avoid using values.

Thanks,
-Toshi

^ permalink raw reply

* Re: [RFC PATCH v2 02/12] ACPI: Add sys_hotplug.h for system device hotplug framework
From: Toshi Kani @ 2013-01-31  1:38 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-s390, jiang.liu, wency, linux-mm, yinghai, linux-kernel,
	Rafael J. Wysocki, linux-acpi, isimatu.yasuaki, srivatsa.bhat,
	guohanjun, bhelgaas, akpm, linuxppc-dev, lenb
In-Reply-To: <20130130045153.GE30002@kroah.com>

On Tue, 2013-01-29 at 23:51 -0500, Greg KH wrote:
> On Mon, Jan 14, 2013 at 12:21:30PM -0700, Toshi Kani wrote:
> > On Mon, 2013-01-14 at 20:07 +0100, Rafael J. Wysocki wrote:
> > > On Monday, January 14, 2013 11:42:09 AM Toshi Kani wrote:
> > > > On Mon, 2013-01-14 at 19:47 +0100, Rafael J. Wysocki wrote:
> > > > > On Monday, January 14, 2013 08:53:53 AM Toshi Kani wrote:
> > > > > > On Fri, 2013-01-11 at 22:25 +0100, Rafael J. Wysocki wrote:
> > > > > > > On Thursday, January 10, 2013 04:40:20 PM Toshi Kani wrote:
> > > > > > > > Added include/acpi/sys_hotplug.h, which is ACPI-specific system
> > > > > > > > device hotplug header and defines the order values of ACPI-specific
> > > > > > > > handlers.
> > > > > > > > 
> > > > > > > > Signed-off-by: Toshi Kani <toshi.kani@hp.com>
> > > > > > > > ---
> > > > > > > >  include/acpi/sys_hotplug.h |   48 ++++++++++++++++++++++++++++++++++++++++++++
> > > > > > > >  1 file changed, 48 insertions(+)
> > > > > > > >  create mode 100644 include/acpi/sys_hotplug.h
> > > > > > > > 
> > > > > > > > diff --git a/include/acpi/sys_hotplug.h b/include/acpi/sys_hotplug.h
> > > > > > > > new file mode 100644
> > > > > > > > index 0000000..ad80f61
> > > > > > > > --- /dev/null
> > > > > > > > +++ b/include/acpi/sys_hotplug.h
> > > > > > > > @@ -0,0 +1,48 @@
> > > > > > > > +/*
> > > > > > > > + * sys_hotplug.h - ACPI System device hot-plug framework
> > > > > > > > + *
> > > > > > > > + * Copyright (C) 2012 Hewlett-Packard Development Company, L.P.
> > > > > > > > + *	Toshi Kani <toshi.kani@hp.com>
> > > > > > > > + *
> > > > > > > > + * This program is free software; you can redistribute it and/or modify
> > > > > > > > + * it under the terms of the GNU General Public License version 2 as
> > > > > > > > + * published by the Free Software Foundation.
> > > > > > > > + */
> > > > > > > > +
> > > > > > > > +#ifndef _ACPI_SYS_HOTPLUG_H
> > > > > > > > +#define _ACPI_SYS_HOTPLUG_H
> > > > > > > > +
> > > > > > > > +#include <linux/list.h>
> > > > > > > > +#include <linux/device.h>
> > > > > > > > +#include <linux/sys_hotplug.h>
> > > > > > > > +
> > > > > > > > +/*
> > > > > > > > + * System device hot-plug operation proceeds in the following order.
> > > > > > > > + *   Validate phase -> Execute phase -> Commit phase
> > > > > > > > + *
> > > > > > > > + * The order values below define the calling sequence of ACPI-specific
> > > > > > > > + * handlers for each phase in ascending order.  The order value of
> > > > > > > > + * platform-neutral handlers are defined in <linux/sys_hotplug.h>.
> > > > > > > > + */
> > > > > > > > +
> > > > > > > > +/* Add Validate order values */
> > > > > > > > +#define SHP_ACPI_BUS_ADD_VALIDATE_ORDER		0	/* must be first */
> > > > > > > > +
> > > > > > > > +/* Add Execute order values */
> > > > > > > > +#define SHP_ACPI_BUS_ADD_EXECUTE_ORDER		10
> > > > > > > > +#define SHP_ACPI_RES_ADD_EXECUTE_ORDER		20
> > > > > > > > +
> > > > > > > > +/* Add Commit order values */
> > > > > > > > +#define SHP_ACPI_BUS_ADD_COMMIT_ORDER		10
> > > > > > > > +
> > > > > > > > +/* Delete Validate order values */
> > > > > > > > +#define SHP_ACPI_BUS_DEL_VALIDATE_ORDER		0	/* must be first */
> > > > > > > > +#define SHP_ACPI_RES_DEL_VALIDATE_ORDER		10
> > > > > > > > +
> > > > > > > > +/* Delete Execute order values */
> > > > > > > > +#define SHP_ACPI_BUS_DEL_EXECUTE_ORDER		100
> > > > > > > > +
> > > > > > > > +/* Delete Commit order values */
> > > > > > > > +#define SHP_ACPI_BUS_DEL_COMMIT_ORDER		100
> > > > > > > > +
> > > > > > > > +#endif	/* _ACPI_SYS_HOTPLUG_H */
> > > > > > > > --
> > > > > > > 
> > > > > > > Why did you use the particular values above?
> > > > > > 
> > > > > > The ordering values above are used to define the relative order among
> > > > > > handlers.  For instance, the 100 for SHP_ACPI_BUS_DEL_EXECUTE_ORDER can
> > > > > > potentially be 21 since it is still larger than 20 for
> > > > > > SHP_MEM_DEL_EXECUTE_ORDER defined in linux/sys_hotplug.h.  I picked 100
> > > > > > so that more platform-neutral handlers can be added in between 20 and
> > > > > > 100 in future.
> > > > > 
> > > > > I thought so, but I don't think it's a good idea to add gaps like this.
> > > > 
> > > > OK, I will use an equal gap of 10 for all values.  So, the 100 in the
> > > > above example will be changed to 30.  
> > > 
> > > I wonder why you want to have those gaps at all.
> > 
> > Oh, I see.  I think some gap is helpful since it allows a new handler to
> > come between without recompiling other modules.  For instance, OEM
> > vendors may want to add their own handlers with loadable modules after
> > the kernel is distributed.
> 
> No, we don't support such a model, sorry, just make it a sequence of
> numbers and go from there.  If a vendor wants to modify the kernel to
> add new values, they can rebuild the core code as well.
> 
> I really don't like the whole idea of values in the first place, can't
> we just do things in the correct order in the code, and not be driven by
> random magic values?

OK, I will define all the values with enum, which is something like
below.  I think it is more manageable in this way as we do not have to
define magic values.

enum shp_add_order {
    /* Validate Phase */
    SHP_FW_BUS_ADD_VALIDATE_ORDER,

    /* Execute Phase */
    SHP_FW_BUS_ADD_EXECUTE_ORDER,
    SHP_FW_RES_ADD_EXECUTE_ORDER,
    SHP_MEM_ADD_EXECUTE_ORDER,
    SHP_CPU_ADD_EXECUTE_ORDER,

    /* Commit Phase */
    SHP_ADD_COMMIT_BASE_ORDER,
    SHP_FW_BUS_ADD_COMMIT_ORDER,
};

Thanks,
-Toshi

^ permalink raw reply

* Re: [RFC PATCH v2 01/12] Add sys_hotplug.h for system device hotplug framework
From: Toshi Kani @ 2013-01-31  1:15 UTC (permalink / raw)
  To: Greg KH
  Cc: linux-s390, jiang.liu, wency, linux-mm, yinghai, linux-kernel,
	Rafael J. Wysocki, linux-acpi, isimatu.yasuaki, srivatsa.bhat,
	guohanjun, bhelgaas, akpm, linuxppc-dev, lenb
In-Reply-To: <20130130044859.GD30002@kroah.com>

On Tue, 2013-01-29 at 23:48 -0500, Greg KH wrote:
> On Mon, Jan 14, 2013 at 12:02:04PM -0700, Toshi Kani wrote:
> > On Mon, 2013-01-14 at 19:48 +0100, Rafael J. Wysocki wrote:
> > > On Monday, January 14, 2013 08:33:48 AM Toshi Kani wrote:
> > > > On Fri, 2013-01-11 at 22:23 +0100, Rafael J. Wysocki wrote:
> > > > > On Thursday, January 10, 2013 04:40:19 PM Toshi Kani wrote:
> > > > > > Added include/linux/sys_hotplug.h, which defines the system device
> > > > > > hotplug framework interfaces used by the framework itself and
> > > > > > handlers.
> > > > > > 
> > > > > > The order values define the calling sequence of handlers.  For add
> > > > > > execute, the ordering is ACPI->MEM->CPU.  Memory is onlined before
> > > > > > CPU so that threads on new CPUs can start using their local memory.
> > > > > > The ordering of the delete execute is symmetric to the add execute.
> > > > > > 
> > > > > > struct shp_request defines a hot-plug request information.  The
> > > > > > device resource information is managed with a list so that a single
> > > > > > request may target to multiple devices.
> > > > > > 
> > > >  :
> > > > > > +
> > > > > > +struct shp_device {
> > > > > > +	struct list_head	list;
> > > > > > +	struct device		*device;
> > > > > > +	enum shp_class		class;
> > > > > > +	union shp_dev_info	info;
> > > > > > +};
> > > > > > +
> > > > > > +/*
> > > > > > + * Hot-plug request
> > > > > > + */
> > > > > > +struct shp_request {
> > > > > > +	/* common info */
> > > > > > +	enum shp_operation	operation;	/* operation */
> > > > > > +
> > > > > > +	/* hot-plug event info: only valid for hot-plug operations */
> > > > > > +	void			*handle;	/* FW handle */
> > > > > 
> > > > > What's the role of handle here?
> > > > 
> > > > On ACPI-based platforms, the handle keeps a notified ACPI handle when a
> > > > hot-plug request is made.  ACPI bus handlers, acpi_add_execute() /
> > > > acpi_del_execute(), then scans / trims ACPI devices from the handle.
> > > 
> > > OK, so this is ACPI-specific and should be described as such.
> > 
> > Other FW interface I know is parisc, which has mod_index (module index)
> > to identify a unique object, just like what ACPI handle does.  The
> > handle can keep the mod_index as an opaque value as well.  But as you
> > said, I do not know if the handle works for all other FWs.  So, I will
> > add descriptions, such that the hot-plug event info is modeled after
> > ACPI and may need to be revisited when supporting other FW.
> 
> Please make it a "real" pointer, and not a void *, those shouldn't be
> used at all if possible.

How about changing the "void *handle" to acpi_dev_node below?   

   struct acpi_dev_node    acpi_node;

Basically, it has the same challenge as struct device, which uses
acpi_dev_node as well.  We can add other FW node when needed (just like
device also has *of_node).

Thanks,
-Toshi

^ permalink raw reply

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
From: Simon Jeons @ 2013-01-31  1:22 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <5108F2B3.3090506@cn.fujitsu.com>

Hi Tang,
On Wed, 2013-01-30 at 18:15 +0800, Tang Chen wrote:
> Hi Simon,
> 
> Please see below. :)
> 
> On 01/29/2013 08:52 PM, Simon Jeons wrote:
> > Hi Tang,
> >
> > On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> >> Here is the physical memory hot-remove patch-set based on 3.8rc-2.
> >
> > Some questions ask you, not has relationship with this patchset, but is
> > memory hotplug stuff.
> >
> > 1. In function node_states_check_changes_online:
> >
> > comments:
> > * If we don't have HIGHMEM nor movable node,
> > * node_states[N_NORMAL_MEMORY] contains nodes which have zones of
> > * 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.
> >
> > How to understand it? Why we don't have HIGHMEM nor movable node and
> > node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
> > N_NORMAL_MEMORY only means the node has regular memory.
> >
> 
> First of all, I think we need to understand why we need N_MEMORY.
> 
> In order to support movable node, which has only ZONE_MOVABLE (the last 
> zone),
> we introduce N_MEMORY to represent the node has normal, highmem and 
> movable memory.
> 
> Here, "we have movable node" means you configured CONFIG_MOVABLE_NODE.
> This config option doesn't mean we don't have movable pages, (NO)
> it means we don't have a node which has only movable pages (only have 
> ZONE_MOVABLE). (YES)
> 
> Here, if we don't have CONFIG_MOVABLE_NODE (we don't have movable node), 
> we don't need a
> separate node_states[] element to represent a particular node because we 
> won't have a node
> which has only ZONE_MOVABLE.
> 
> So,
> 1) if we don't have highmem nor movable node, N_MEMORY == N_HIGH_MEMORY 
> == N_NORMAL_MEMORY,
>     which means N_NORMAL_MEMORY effects as N_MEMORY. If we online pages 
> as movable, we need
>     to update node_states[N_NORMAL_MEMORY].

Sorry, I still confuse. :( 
update node_states[N_NORMAL_MEMORY] to node_states[N_MEMORY] or
node_states[N_NORMAL_MEMOR] present 0...ZONE_MOVABLE?

> 
> Please refer to the definition of enum zone_type, if we don't have 
> CONFIG_HIGHMEM, we won't
> have ZONE_HIGHMEM, but ZONE_NORMAL and ZONE_MOVABLE will always there. 
> So we can have movable
> pages, and the zone_last should be ZONE_MOVABLE.

node_states is what? node_states[N_NORMAL_MEMOR] or
node_states[N_MEMORY]?

> 
> Again, because we won't have a node only having ZONE_MOVABLE, so we just 
> need to update
> node_states[N_NORMAL_MEMORY].
> 
> > * If we don't have movable node, node_states[N_NORMAL_MEMORY]
> > * contains nodes which have zones of 0...ZONE_MOVABLE,
> > * set zone_last to ZONE_MOVABLE.
> >
> > How to understand?
> 
> 2) this code is in #ifdef CONFIG_HIGHMEM, which means we have highmem, 
> so if we don't have
>     movable node, N_MEMORY == N_HIGH_MEMORY, and N_HIGH_MEMORY effects 
> as N_MEMORY. If we
>     online pages as movable, we need to update node_states[N_NORMAL_MEMORY].
> 
> >
> > 2. In function move_pfn_range_left, why end<= z2->zone_start_pfn is not
> > correct? The comments said that must include/overlap, why?
> >
> 
> This one is easy, if I understand you correctly.
> move_pfn_range_left() is used to move the left most part [start_pfn, 
> end_pfn) of z2 to z1.
> So if end_pfn<= z2->zone_start_pfn, it means [start_pfn, end_pfn) is not 
> part of z2.
> Then it fails.

Yup, very clear now. :)
Why check !z1->wait_table in function move_pfn_range_left and function
__add_zone? I think zone->wait_table is initialized in
free_area_init_core, which will be called during system initialization
and hotadd_new_pgdat path.

> 
> > 3. In function online_pages, the normal case(w/o online_kenrel,
> > online_movable), why not check if the new zone is overlap with adjacent
> > zones?
> >
> 
> Can a zone overlap with the others ? I don't think so.
> 
> One pfn could only be in one zone,
>     zone = page_zone(pfn_to_page(pfn));

thanks. :)

There is a zone populated check in function online_pages. But zone is
populated in free_area_init_core which will be called during system
initialization and hotadd_new_pgdat path. Why still need this check?

> 
> it could overlap with others, I think. :)
> 
> But maybe I misunderstand you. :)
> 
> > 4. Could you summarize the difference implementation between hot-add and
> > logic-add, hot-remove and logic-remove?
> 
> Sorry, I don't quite understand what do you mean by logic-add/remove.
> Would you please explain more ?
> 
> If you meant the sys fs interfaces, I think they are just another set of 
> entrances
> of memory hotplug.

Please ingore this silly question. :(

> 
> Thanks.  :)
> 
> >
> >
> >>
> >> This patch-set aims to implement physical memory hot-removing.
> >>
> >> The patches can free/remove the following things:
> >>
> >>    - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
> >>    - memmap of sparse-vmemmap                  : [PATCH 6,7,8,10/15]
> >>    - page table of removed memory              : [RFC PATCH 7,8,10/15]
> >>    - node and related sysfs files              : [RFC PATCH 13-15/15]
> >>
> >>
> >> Existing problem:
> >> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
> >> when we online pages.
> >>
> >> For example: there is a memory device on node 1. The address range
> >> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
> >> and memory11 under the directory /sys/devices/system/memory/.
> >>
> >> If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
> >> cgroup is not provided by this memory device. But when we online memory9, the
> >> memory stored page cgroup may be provided by memory8. So we can't offline
> >> memory8 now. We should offline the memory in the reversed order.
> >>
> >> When the memory device is hotremoved, we will auto offline memory provided
> >> by this memory device. But we don't know which memory is onlined first, so
> >> offlining memory may fail.
> >>
> >> In patch1, we provide a solution which is not good enough:
> >> Iterate twice to offline the memory.
> >> 1st iterate: offline every non primary memory block.
> >> 2nd iterate: offline primary (i.e. first added) memory block.
> >>
> >> And a new idea from Wen Congyang<wency@cn.fujitsu.com>  is:
> >> allocate the memory from the memory block they are describing.
> >>
> >> But we are not sure if it is OK to do so because there is not existing API
> >> to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE
> >> to MEM_ONLINE. And also, it may interfere the hugepage.
> >>
> >>
> >>
> >> How to test this patchset?
> >> 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
> >>     ACPI_HOTPLUG_MEMORY must be selected.
> >> 2. load the module acpi_memhotplug
> >> 3. hotplug the memory device(it depends on your hardware)
> >>     You will see the memory device under the directory /sys/bus/acpi/devices/.
> >>     Its name is PNP0C80:XX.
> >> 4. online/offline pages provided by this memory device
> >>     You can write online/offline to /sys/devices/system/memory/memoryX/state to
> >>     online/offline pages provided by this memory device
> >> 5. hotremove the memory device
> >>     You can hotremove the memory device by the hardware, or writing 1 to
> >>     /sys/bus/acpi/devices/PNP0C80:XX/eject.
> >
> > Is there a similar knode to hot-add the memory device?
> >
> >>
> >>
> >> Note: if the memory provided by the memory device is used by the kernel, it
> >> can't be offlined. It is not a bug.
> >>
> >>
> >> Changelogs from v5 to v6:
> >>   Patch3: Add some more comments to explain memory hot-remove.
> >>   Patch4: Remove bootmem member in struct firmware_map_entry.
> >>   Patch6: Repeatedly register bootmem pages when using hugepage.
> >>   Patch8: Repeatedly free bootmem pages when using hugepage.
> >>   Patch14: Don't free pgdat when offlining a node, just reset it to 0.
> >>   Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
> >>            one when online a node.
> >>
> >> Changelogs from v4 to v5:
> >>   Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
> >>           avoid disabling irq because we need flush tlb when free pagetables.
> >>   Patch8: new patch, pick up some common APIs that are used to free direct mapping
> >>           and vmemmap pagetables.
> >>   Patch9: free direct mapping pagetables on x86_64 arch.
> >>   Patch10: free vmemmap pagetables.
> >>   Patch11: since freeing memmap with vmemmap has been implemented, the config
> >>            macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
> >>            no longer needed.
> >>   Patch13: no need to modify acpi_memory_disable_device() since it was removed,
> >>            and add nid parameter when calling remove_memory().
> >>
> >> Changelogs from v3 to v4:
> >>   Patch7: remove unused codes.
> >>   Patch8: fix nr_pages that is passed to free_map_bootmem()
> >>
> >> Changelogs from v2 to v3:
> >>   Patch9: call sync_global_pgds() if pgd is changed
> >>   Patch10: fix a problem int the patch
> >>
> >> Changelogs from v1 to v2:
> >>   Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
> >>           memory block. 2nd iterate: offline primary (i.e. first added) memory
> >>           block.
> >>
> >>   Patch3: new patch, no logical change, just remove reduntant codes.
> >>
> >>   Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
> >>           after the pagetable is changed.
> >>
> >>   Patch12: new patch, free node_data when a node is offlined.
> >>
> >>
> >> Tang Chen (6):
> >>    memory-hotplug: move pgdat_resize_lock into
> >>      sparse_remove_one_section()
> >>    memory-hotplug: remove page table of x86_64 architecture
> >>    memory-hotplug: remove memmap of sparse-vmemmap
> >>    memory-hotplug: Integrated __remove_section() of
> >>      CONFIG_SPARSEMEM_VMEMMAP.
> >>    memory-hotplug: remove sysfs file of node
> >>    memory-hotplug: Do not allocate pdgat if it was not freed when
> >>      offline.
> >>
> >> Wen Congyang (5):
> >>    memory-hotplug: try to offline the memory twice to avoid dependence
> >>    memory-hotplug: remove redundant codes
> >>    memory-hotplug: introduce new function arch_remove_memory() for
> >>      removing page table depends on architecture
> >>    memory-hotplug: Common APIs to support page tables hot-remove
> >>    memory-hotplug: free node_data when a node is offlined
> >>
> >> Yasuaki Ishimatsu (4):
> >>    memory-hotplug: check whether all memory blocks are offlined or not
> >>      when removing memory
> >>    memory-hotplug: remove /sys/firmware/memmap/X sysfs
> >>    memory-hotplug: implement register_page_bootmem_info_section of
> >>      sparse-vmemmap
> >>    memory-hotplug: memory_hotplug: clear zone when removing the memory
> >>
> >>   arch/arm64/mm/mmu.c                  |    3 +
> >>   arch/ia64/mm/discontig.c             |   10 +
> >>   arch/ia64/mm/init.c                  |   18 ++
> >>   arch/powerpc/mm/init_64.c            |   10 +
> >>   arch/powerpc/mm/mem.c                |   12 +
> >>   arch/s390/mm/init.c                  |   12 +
> >>   arch/s390/mm/vmem.c                  |   10 +
> >>   arch/sh/mm/init.c                    |   17 ++
> >>   arch/sparc/mm/init_64.c              |   10 +
> >>   arch/tile/mm/init.c                  |    8 +
> >>   arch/x86/include/asm/pgtable_types.h |    1 +
> >>   arch/x86/mm/init_32.c                |   12 +
> >>   arch/x86/mm/init_64.c                |  390 +++++++++++++++++++++++++++++
> >>   arch/x86/mm/pageattr.c               |   47 ++--
> >>   drivers/acpi/acpi_memhotplug.c       |    8 +-
> >>   drivers/base/memory.c                |    6 +
> >>   drivers/firmware/memmap.c            |   96 +++++++-
> >>   include/linux/bootmem.h              |    1 +
> >>   include/linux/firmware-map.h         |    6 +
> >>   include/linux/memory_hotplug.h       |   15 +-
> >>   include/linux/mm.h                   |    4 +-
> >>   mm/memory_hotplug.c                  |  459 +++++++++++++++++++++++++++++++---
> >>   mm/sparse.c                          |    8 +-
> >>   23 files changed, 1094 insertions(+), 69 deletions(-)
> >>
> >> --
> >> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> >> the body to majordomo@kvack.org.  For more info on Linux MM,
> >> see: http://www.linux-mm.org/ .
> >> Don't email:<a href=mailto:"dont@kvack.org">  email@kvack.org</a>
> >
> >
> >

^ permalink raw reply

* [GIT PULL 00/21] perf/core improvements and fixes
From: Arnaldo Carvalho de Melo @ 2013-01-30 14:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Frederic Weisbecker, Stephane Eranian,
	arnaldo.melo, linuxppc-dev, Paul Mackerras, Thomas Jarosch,
	Jiri Olsa, Arnaldo Carvalho de Melo, Andi Kleen, Hugh Dickins,
	Mel Gorman, Michael Ellerman, Borislav Petkov, Andrea Arcangeli,
	Rik van Riel, Corey Ashford, Namhyung Kim, Anton Blanchard,
	Steven Rostedt, Arnaldo Carvalho de Melo, Sukadev Bhattiprolu,
	Peter Hurley, Mike Galbraith, linux-kernel, David Ahern,
	Andrew Morton

Hi Ingo,

	Please consider pulling.

	Namhyung, Jiri, the 'group report' patches are at acme/perf/group,
will send a pull req later if it survives further testing.

- Arnaldo

The following changes since commit a2d28d0c198b65fac28ea6212f5f8edc77b29c27:

  Merge tag 'perf-core-for-mingo' of git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux into perf/core (2013-01-25 11:34:00 +0100)

are available in the git repository at:


  git://git.kernel.org/pub/scm/linux/kernel/git/acme/linux tags/perf-core-for-mingo

for you to fetch changes up to 5809fde040de2afa477a6c593ce2e8fd2c11d9d3:

  perf header: Fix double fclose() on do_write(fd, xxx) failure (2013-01-30 10:40:44 -0300)

----------------------------------------------------------------
perf/core improvements and fixes:

. Fix some leaks in exit paths.

. Use memdup where applicable

. Remove some die() calls, allowing callers to handle exit paths
  gracefully.

. Correct typo in tools Makefile, fix from Borislav Petkov.

. Add 'perf bench numa mem' NUMA performance measurement suite, from Ingo Molnar.

. Handle dynamic array's element size properly, fix from Jiri Olsa.

. Fix memory leaks on evsel->counts, from Namhyung Kim.

. Make numa benchmark optional, allowing the build in machines where required
  numa libraries are not present, fix from Peter Hurley.

. Add interval printing in 'perf stat', from Stephane Eranian.

. Fix compile warnings in tests/attr.c, from Sukadev Bhattiprolu.

. Fix double free, pclose instead of fclose, leaks and double fclose errors
  found with the cppcheck tool, from Thomas Jarosch.

Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>

----------------------------------------------------------------
Arnaldo Carvalho de Melo (8):
      perf tools: Stop using 'self' in strlist
      perf tools: Stop using 'self' in map.[ch]
      perf tools: Use memdup in map__clone
      perf kmem: Use memdup()
      perf header: Stop using die() calls when processing tracing data
      perf ui browser: Free browser->helpline() on ui_browser__hide()
      perf tests: Call machine__exit in the vmlinux matches kallsyms test
      perf tests: Fix leaks on PERF_RECORD_* test

Borislav Petkov (1):
      tools: Correct typo in tools Makefile

Ingo Molnar (1):
      perf: Add 'perf bench numa mem' NUMA performance measurement suite

Jiri Olsa (1):
      tools lib traceevent: Handle dynamic array's element size properly

Namhyung Kim (1):
      perf evsel: Fix memory leaks on evsel->counts

Peter Hurley (1):
      perf tools: Make numa benchmark optional

Stephane Eranian (2):
      perf evsel: Add prev_raw_count field
      perf stat: Add interval printing

Sukadev Bhattiprolu (1):
      perf tools, powerpc: Fix compile warnings in tests/attr.c

Thomas Jarosch (5):
      perf tools: Fix possible double free on error
      perf sort: Use pclose() instead of fclose() on pipe stream
      perf tools: Fix memory leak on error
      perf header: Fix memory leak for the "Not caching a kptr_restrict'ed /proc/kallsyms" case
      perf header: Fix double fclose() on do_write(fd, xxx) failure

 tools/Makefile                           |    2 +-
 tools/lib/traceevent/event-parse.c       |   39 +-
 tools/perf/Documentation/perf-stat.txt   |    4 +
 tools/perf/Makefile                      |   13 +
 tools/perf/arch/common.c                 |    1 +
 tools/perf/bench/bench.h                 |    1 +
 tools/perf/bench/numa.c                  | 1731 ++++++++++++++++++++++++++++++
 tools/perf/builtin-bench.c               |   17 +
 tools/perf/builtin-kmem.c                |    6 +-
 tools/perf/builtin-stat.c                |  158 ++-
 tools/perf/config/feature-tests.mak      |   11 +
 tools/perf/tests/attr.c                  |    5 +
 tools/perf/tests/open-syscall-all-cpus.c |    1 +
 tools/perf/tests/perf-record.c           |   12 +-
 tools/perf/tests/vmlinux-kallsyms.c      |    4 +-
 tools/perf/ui/browser.c                  |    2 +
 tools/perf/util/event.c                  |    4 +-
 tools/perf/util/evsel.c                  |   31 +
 tools/perf/util/evsel.h                  |    2 +
 tools/perf/util/header.c                 |   25 +-
 tools/perf/util/map.c                    |  118 +-
 tools/perf/util/map.h                    |   24 +-
 tools/perf/util/sort.c                   |    7 +-
 tools/perf/util/strlist.c                |   54 +-
 tools/perf/util/strlist.h                |   42 +-
 25 files changed, 2154 insertions(+), 160 deletions(-)
 create mode 100644 tools/perf/bench/numa.c

^ permalink raw reply

* [PATCH 16/21] perf tools, powerpc: Fix compile warnings in tests/attr.c
From: Arnaldo Carvalho de Melo @ 2013-01-30 14:46 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Anton Blanchard, linux-kernel, Arnaldo Carvalho de Melo,
	linuxppc-dev, Michael Ellerman, Sukadev Bhattiprolu, Jiri Olsa
In-Reply-To: <1359557222-17547-1-git-send-email-acme@infradead.org>

From: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>

We print several '__u64' quantities using '%llu'. On powerpc, we by
default include '<asm-generic/int-l64.h> which results in __u64 being an
unsigned long. This causes compile warnings which are treated as errors
due to '-Werror'.

By defining __SANE_USERSPACE_TYPES__ we include <asm-generic/int-ll64.h>
and define __u64 as unsigned long long.

Changelog[v2]:
	[Michael Ellerman] Use __SANE_USERSPACE_TYPES__ and avoid PRIu64
	format specifier - which as Jiri Olsa pointed out, breaks on x86-64.

Signed-off-by: Sukadev Bhattiprolu <sukadev@linux.vnet.ibm.com>
Cc: Anton Blanchard <anton@au1.ibm.com>
Cc: Jiri Olsa <jolsa@redhat.com>
Cc: Michael Ellerman <ellerman@au1.ibm.com>
Cc: linuxppc-dev@ozlabs.org
Link: http://lkml.kernel.org/r/20130124054439.GA31588@us.ibm.com
Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com>
---
 tools/perf/tests/attr.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/tools/perf/tests/attr.c b/tools/perf/tests/attr.c
index f61dd3f..bdcceb8 100644
--- a/tools/perf/tests/attr.c
+++ b/tools/perf/tests/attr.c
@@ -19,6 +19,11 @@
  * permissions. All the event text files are stored there.
  */
 
+/*
+ * Powerpc needs __SANE_USERSPACE_TYPES__ before <linux/types.h> to select
+ * 'int-ll64.h' and avoid compile warnings when printing __u64 with %llu.
+ */
+#define __SANE_USERSPACE_TYPES__
 #include <stdlib.h>
 #include <stdio.h>
 #include <inttypes.h>
-- 
1.8.1.1.361.gec3ae6e

^ permalink raw reply related

* powerpc/usb: machine check when writing to portsc in ehci-fsl
From: Guy Yribarren @ 2013-01-30 14:27 UTC (permalink / raw)
  To: linuxppc-dev

Currently using kernel 3.7.1 with mpc8315e, I am observing instability
during the usb initialisation. The usb controller is set as a host
controller with internal PHY defined in UTMI mode. This issue appears
sometimes after a warm reset. The kernel hangs and the log buffer =
contents
shows that a machine check exception is detected when execution is =
located
into ehci_fsl_setup_phy function.=20
=20
The latest messages available from the console are:
ehci_hcd: USB 2.0 'Enhanced' Host Controller (EHCI) Driver
ohci_hcd: USB 1.1 'Open' Host Controller (OHCI) Driver
uhci_hcd: USB Universal Host Controller Interface driver
/immr@e0000000/usb@23000: Invalid 'dr_mode' property, fallback to host =
mode
fsl-ehci fsl-ehci.0: Freescale On-Chip EHCI Host Controller
fsl-ehci fsl-ehci.0: new USB bus registered, assigned bus number 1

By adding multiple printk into the ehci-fsl, I found that the issue =
happens
when executing the write to the portsc register nearly at the end of the
function (ehci_writel(ehci, portsc, =
&ehci->regs->port_status[port_offset]);)

This issue appears either with or without external usb devices =
connected.

I didn't find any restrictions to access this register and/or =
explanation
from freescale doc and usb2 & ehci specification.

This issue has been also seen with older kernel release like 2.6.39.=20

Applying a warn reset, then the processor boot correctly and the usb
interface is now fully functional and really stable with intensive
activities.

The processor is based on the latest 1.2 mask revision.=20

Hope that someone has some suggestions about this strange behaviour.=20

Thanks for your help,
Regards

Guy Yribarren
=20
ACTIS Computer - 42 Route de Satigny - CH-1217 Meyrin - Switzerland
Tel +41 (22) 706 1830 - Fax +41 (22) 794 4391
guy.yribarren <at> actis-computer.com

=A0=A0=A0

^ permalink raw reply

* [PATCH 5/5] KVM: PPC: e500mc: Enable e6500 cores
From: Mihai Caraman @ 2013-01-30 13:29 UTC (permalink / raw)
  To: kvm-ppc; +Cc: Mihai Caraman, linuxppc-dev, kvm
In-Reply-To: <1359552584-17861-1-git-send-email-mihai.caraman@freescale.com>

Extend processor compatibility names to e6500 cores.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
---
 arch/powerpc/kvm/e500mc.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/kvm/e500mc.c b/arch/powerpc/kvm/e500mc.c
index 1f89d26..6c87299 100644
--- a/arch/powerpc/kvm/e500mc.c
+++ b/arch/powerpc/kvm/e500mc.c
@@ -172,6 +172,8 @@ int kvmppc_core_check_processor_compat(void)
 		r = 0;
 	else if (strcmp(cur_cpu_spec->cpu_name, "e5500") == 0)
 		r = 0;
+	else if (strcmp(cur_cpu_spec->cpu_name, "e6500") == 0)
+		r = 0;
 	else
 		r = -ENOTSUPP;
 
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH 4/5] KVM: PPC: e500: Emulate EPTCFG register
From: Mihai Caraman @ 2013-01-30 13:29 UTC (permalink / raw)
  To: kvm-ppc; +Cc: Mihai Caraman, linuxppc-dev, kvm
In-Reply-To: <1359552584-17861-1-git-send-email-mihai.caraman@freescale.com>

EPTCFG register defined by E.PT is accessed unconditionally by Linux guests
in the presence of MAV 2.0. Emulate EPTCFG register now.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
---
 arch/powerpc/include/asm/kvm_host.h |    1 +
 arch/powerpc/kvm/e500.h             |    6 ++++++
 arch/powerpc/kvm/e500_emulate.c     |    9 +++++++++
 arch/powerpc/kvm/e500_mmu.c         |    5 +++++
 4 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 88fcfe6..f480b20 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -503,6 +503,7 @@ struct kvm_vcpu_arch {
 	u32 tlbcfg[4];
 	u32 tlbps[4];
 	u32 mmucfg;
+	u32 eptcfg;
 	u32 epr;
 	struct kvmppc_booke_debug_reg dbg_reg;
 #endif
diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index b9f76d8..983eb95 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -308,4 +308,10 @@ static inline unsigned int has_mmu_v2(const struct kvm_vcpu *vcpu)
 	return ((vcpu->arch.mmucfg & MMUCFG_MAVN) == MMUCFG_MAVN_V2);
 }
 
+static inline unsigned int supports_page_tables(const struct kvm_vcpu *vcpu)
+{
+	return ((vcpu->arch.tlbcfg[0] & TLBnCFG_IND)
+		|| (vcpu->arch.tlbcfg[1] & TLBnCFG_IND));
+}
+
 #endif /* KVM_E500_H */
diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index 5515dc5..493e231 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -339,6 +339,15 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int sprn, ulong *spr_val)
 			return EMULATE_FAIL;
 		*spr_val = vcpu->arch.tlbps[1];
 		break;
+	case SPRN_EPTCFG:
+		if (!has_mmu_v2(vcpu))
+			return EMULATE_FAIL;
+		/*
+		 * Legacy Linux guests access EPTCFG register even if the E.PT
+		 * category is disabled in the VM. Give them a chance to live.
+		 */
+		*spr_val = vcpu->arch.eptcfg;
+		break;
 	default:
 		emulated = kvmppc_booke_emulate_mfspr(vcpu, sprn, spr_val);
 	}
diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c
index 9a1f7b7..199c11e 100644
--- a/arch/powerpc/kvm/e500_mmu.c
+++ b/arch/powerpc/kvm/e500_mmu.c
@@ -799,6 +799,11 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 	if (has_mmu_v2(vcpu)) {
 		vcpu->arch.tlbps[0] = mfspr(SPRN_TLB0PS);
 		vcpu->arch.tlbps[1] = mfspr(SPRN_TLB1PS);
+
+		if (supports_page_tables(vcpu))
+			vcpu->arch.eptcfg = mfspr(SPRN_EPTCFG);
+		else
+			vcpu->arch.eptcfg = 0;
 	}
 
 	kvmppc_recalc_tlb1map_range(vcpu_e500);
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH 3/5] KVM: PPC: e500: Remove E.PT category from VCPUs
From: Mihai Caraman @ 2013-01-30 13:29 UTC (permalink / raw)
  To: kvm-ppc; +Cc: Mihai Caraman, linuxppc-dev, kvm
In-Reply-To: <1359552584-17861-1-git-send-email-mihai.caraman@freescale.com>

Embedded.Page Table (E.PT) category in VMs requires indirect tlb entries
emulation which is not supported yet. Configure TLBnCFG to remove E.PT
category from VCPUs.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
---
 arch/powerpc/kvm/e500_mmu.c |   10 ++++++----
 1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c
index 129299a..9a1f7b7 100644
--- a/arch/powerpc/kvm/e500_mmu.c
+++ b/arch/powerpc/kvm/e500_mmu.c
@@ -692,12 +692,14 @@ int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 	vcpu_e500->gtlb_offset[0] = 0;
 	vcpu_e500->gtlb_offset[1] = params.tlb_sizes[0];
 
-	vcpu->arch.tlbcfg[0] &= ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC);
+	vcpu->arch.tlbcfg[0] &=
+			      ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND);
 	if (params.tlb_sizes[0] <= 2048)
 		vcpu->arch.tlbcfg[0] |= params.tlb_sizes[0];
 	vcpu->arch.tlbcfg[0] |= params.tlb_ways[0] << TLBnCFG_ASSOC_SHIFT;
 
-	vcpu->arch.tlbcfg[1] &= ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC);
+	vcpu->arch.tlbcfg[1] &=
+			      ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND);
 	vcpu->arch.tlbcfg[1] |= params.tlb_sizes[1];
 	vcpu->arch.tlbcfg[1] |= params.tlb_ways[1] << TLBnCFG_ASSOC_SHIFT;
 
@@ -783,13 +785,13 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 
 	/* Init TLB configuration register */
 	vcpu->arch.tlbcfg[0] = mfspr(SPRN_TLB0CFG) &
-			     ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC);
+			     ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND);
 	vcpu->arch.tlbcfg[0] |= vcpu_e500->gtlb_params[0].entries;
 	vcpu->arch.tlbcfg[0] |=
 		vcpu_e500->gtlb_params[0].ways << TLBnCFG_ASSOC_SHIFT;
 
 	vcpu->arch.tlbcfg[1] = mfspr(SPRN_TLB1CFG) &
-			     ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC);
+			     ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC | TLBnCFG_IND);
 	vcpu->arch.tlbcfg[1] |= vcpu_e500->gtlb_params[1].entries;
 	vcpu->arch.tlbcfg[1] |=
 		vcpu_e500->gtlb_params[1].ways << TLBnCFG_ASSOC_SHIFT;
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH 0/5] KVM: PPC: e500: Enable FSL e6500 core
From: Mihai Caraman @ 2013-01-30 13:29 UTC (permalink / raw)
  To: kvm-ppc; +Cc: Mihai Caraman, linuxppc-dev, kvm

Enable Freescale e6500 core adding missing MAV 2.0 support. LRAT and Page
Table are not addresses by this commit.

Mihai Caraman (5):
  KVM: PPC: e500: Move VCPU's MMUCFG register initialization earlier
  KVM: PPC: e500: Emulate TLBnPS registers
  KVM: PPC: e500: Remove E.PT category from VCPUs
  KVM: PPC: e500: Emulate EPTCFG register
  KVM: PPC: e500mc: Enable e6500 cores

 arch/powerpc/include/asm/kvm_host.h |    2 ++
 arch/powerpc/kvm/e500.h             |   11 +++++++++++
 arch/powerpc/kvm/e500_emulate.c     |   19 +++++++++++++++++++
 arch/powerpc/kvm/e500_mmu.c         |   24 ++++++++++++++++++------
 arch/powerpc/kvm/e500mc.c           |    2 ++
 5 files changed, 52 insertions(+), 6 deletions(-)

-- 
1.7.4.1

^ permalink raw reply

* [PATCH 2/5] KVM: PPC: e500: Emulate TLBnPS registers
From: Mihai Caraman @ 2013-01-30 13:29 UTC (permalink / raw)
  To: kvm-ppc; +Cc: Mihai Caraman, linuxppc-dev, kvm
In-Reply-To: <1359552584-17861-1-git-send-email-mihai.caraman@freescale.com>

Emulate TLBnPS registers which are available in MMU Architecture Version
(MAV) 2.0.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
---
 arch/powerpc/include/asm/kvm_host.h |    1 +
 arch/powerpc/kvm/e500.h             |    5 +++++
 arch/powerpc/kvm/e500_emulate.c     |   10 ++++++++++
 arch/powerpc/kvm/e500_mmu.c         |    5 +++++
 4 files changed, 21 insertions(+), 0 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 8a72d59..88fcfe6 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -501,6 +501,7 @@ struct kvm_vcpu_arch {
 	spinlock_t wdt_lock;
 	struct timer_list wdt_timer;
 	u32 tlbcfg[4];
+	u32 tlbps[4];
 	u32 mmucfg;
 	u32 epr;
 	struct kvmppc_booke_debug_reg dbg_reg;
diff --git a/arch/powerpc/kvm/e500.h b/arch/powerpc/kvm/e500.h
index 41cefd4..b9f76d8 100644
--- a/arch/powerpc/kvm/e500.h
+++ b/arch/powerpc/kvm/e500.h
@@ -303,4 +303,9 @@ static inline unsigned int get_tlbmiss_tid(struct kvm_vcpu *vcpu)
 #define get_tlb_sts(gtlbe)              (MAS1_TS)
 #endif /* !BOOKE_HV */
 
+static inline unsigned int has_mmu_v2(const struct kvm_vcpu *vcpu)
+{
+	return ((vcpu->arch.mmucfg & MMUCFG_MAVN) == MMUCFG_MAVN_V2);
+}
+
 #endif /* KVM_E500_H */
diff --git a/arch/powerpc/kvm/e500_emulate.c b/arch/powerpc/kvm/e500_emulate.c
index e78f353..5515dc5 100644
--- a/arch/powerpc/kvm/e500_emulate.c
+++ b/arch/powerpc/kvm/e500_emulate.c
@@ -329,6 +329,16 @@ int kvmppc_core_emulate_mfspr(struct kvm_vcpu *vcpu, int sprn, ulong *spr_val)
 		*spr_val = vcpu->arch.ivor[BOOKE_IRQPRIO_DBELL_CRIT];
 		break;
 #endif
+	case SPRN_TLB0PS:
+		if (!has_mmu_v2(vcpu))
+			return EMULATE_FAIL;
+		*spr_val = vcpu->arch.tlbps[0];
+		break;
+	case SPRN_TLB1PS:
+		if (!has_mmu_v2(vcpu))
+			return EMULATE_FAIL;
+		*spr_val = vcpu->arch.tlbps[1];
+		break;
 	default:
 		emulated = kvmppc_booke_emulate_mfspr(vcpu, sprn, spr_val);
 	}
diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c
index bb1b2b0..129299a 100644
--- a/arch/powerpc/kvm/e500_mmu.c
+++ b/arch/powerpc/kvm/e500_mmu.c
@@ -794,6 +794,11 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 	vcpu->arch.tlbcfg[1] |=
 		vcpu_e500->gtlb_params[1].ways << TLBnCFG_ASSOC_SHIFT;
 
+	if (has_mmu_v2(vcpu)) {
+		vcpu->arch.tlbps[0] = mfspr(SPRN_TLB0PS);
+		vcpu->arch.tlbps[1] = mfspr(SPRN_TLB1PS);
+	}
+
 	kvmppc_recalc_tlb1map_range(vcpu_e500);
 	return 0;
 
-- 
1.7.4.1

^ permalink raw reply related

* [PATCH 1/5] KVM: PPC: e500: Move VCPU's MMUCFG register initialization earlier
From: Mihai Caraman @ 2013-01-30 13:29 UTC (permalink / raw)
  To: kvm-ppc; +Cc: Mihai Caraman, linuxppc-dev, kvm
In-Reply-To: <1359552584-17861-1-git-send-email-mihai.caraman@freescale.com>

VCPU's MMUCFG register initialization should not depend on KVM_CAP_SW_TLB
ioctl call. Move it earlier into tlb initalization phase.

Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com>
---
 arch/powerpc/kvm/e500_mmu.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/kvm/e500_mmu.c b/arch/powerpc/kvm/e500_mmu.c
index 5c44759..bb1b2b0 100644
--- a/arch/powerpc/kvm/e500_mmu.c
+++ b/arch/powerpc/kvm/e500_mmu.c
@@ -692,8 +692,6 @@ int kvm_vcpu_ioctl_config_tlb(struct kvm_vcpu *vcpu,
 	vcpu_e500->gtlb_offset[0] = 0;
 	vcpu_e500->gtlb_offset[1] = params.tlb_sizes[0];
 
-	vcpu->arch.mmucfg = mfspr(SPRN_MMUCFG) & ~MMUCFG_LPIDSIZE;
-
 	vcpu->arch.tlbcfg[0] &= ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC);
 	if (params.tlb_sizes[0] <= 2048)
 		vcpu->arch.tlbcfg[0] |= params.tlb_sizes[0];
@@ -781,6 +779,8 @@ int kvmppc_e500_tlb_init(struct kvmppc_vcpu_e500 *vcpu_e500)
 	if (!vcpu_e500->g2h_tlb1_map)
 		goto err;
 
+	vcpu->arch.mmucfg = mfspr(SPRN_MMUCFG) & ~MMUCFG_LPIDSIZE;
+
 	/* Init TLB configuration register */
 	vcpu->arch.tlbcfg[0] = mfspr(SPRN_TLB0CFG) &
 			     ~(TLBnCFG_N_ENTRY | TLBnCFG_ASSOC);
-- 
1.7.4.1

^ permalink raw reply related

* [RFC PATCH 2/3] arch/powerpc: Reduce the PTE_INDEX_SIZE
From: Aneesh Kumar K.V @ 2013-01-30 13:18 UTC (permalink / raw)
  To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1359551916-10321-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

This make one PMD cover 16MB range. That helps in easier implementation of THP
on power. THP core code make use of one pmd entry to track the huge page and
the range mapped by a single pmd entry should be equal to the huge page size
supported by the hardware.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgtable-ppc64-64k.h |    4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable-ppc64-64k.h b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
index be4e287..3c529b4 100644
--- a/arch/powerpc/include/asm/pgtable-ppc64-64k.h
+++ b/arch/powerpc/include/asm/pgtable-ppc64-64k.h
@@ -4,10 +4,10 @@
 #include <asm-generic/pgtable-nopud.h>
 
 
-#define PTE_INDEX_SIZE  12
+#define PTE_INDEX_SIZE  8
 #define PMD_INDEX_SIZE  12
 #define PUD_INDEX_SIZE	0
-#define PGD_INDEX_SIZE  6
+#define PGD_INDEX_SIZE  10
 
 #ifndef __ASSEMBLY__
 #define PTE_TABLE_SIZE	(sizeof(real_pte_t) << PTE_INDEX_SIZE)
-- 
1.7.10

^ permalink raw reply related

* [RFC PATCH 1/3] powerpc: Add support for multiple PTE page fragment in PTE page
From: Aneesh Kumar K.V @ 2013-01-30 13:18 UTC (permalink / raw)
  To: benh, paulus; +Cc: linuxppc-dev

Hello,

This is the preparatory patchset for supporting THP on powerpc. I am sending this across
so that we can get review on this change.

For full THP support on powerpc, I have an early version that boots and runs few test. But
I am still running into few issues, like Firmware NMI etc. Once those are resolved, I will
clean them up and send the complete patchset. 

IMHO we should NOT be merging this patchset until we have the full THP patchset on the list.
But still it will be nice to get review on the multiple PTE page fragment change. Other
architectures like s390 and sparc do support something similar. The implementation here
closely follows the s390 implementation.

-aneesh

^ permalink raw reply

* [RFC PATCH 3/3] powerpc: Reduce PTE table memory wastage
From: Aneesh Kumar K.V @ 2013-01-30 13:18 UTC (permalink / raw)
  To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1359551916-10321-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

We now have PTE page consuming only 2K of the 64K page.This is in order to
facilitate transparent huge page support, which works much better if our PMDs
cover 16MB instead of 256MB.

Inorder to reduce the wastage, we now have multiple PTE page fragment
from the same PTE page.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/mmu-book3e.h |    4 +
 arch/powerpc/include/asm/mmu-hash64.h |    4 +
 arch/powerpc/include/asm/page.h       |    4 +
 arch/powerpc/include/asm/pgalloc-32.h |   45 ++++++++
 arch/powerpc/include/asm/pgalloc-64.h |  143 ++++++++++++++++++++-----
 arch/powerpc/include/asm/pgalloc.h    |   46 +-------
 arch/powerpc/kernel/setup_64.c        |    4 +-
 arch/powerpc/mm/mmu_context_hash64.c  |   12 +++
 arch/powerpc/mm/pgtable_64.c          |  189 +++++++++++++++++++++++++++++++++
 9 files changed, 377 insertions(+), 74 deletions(-)

diff --git a/arch/powerpc/include/asm/mmu-book3e.h b/arch/powerpc/include/asm/mmu-book3e.h
index eeabcdb..5fe3738 100644
--- a/arch/powerpc/include/asm/mmu-book3e.h
+++ b/arch/powerpc/include/asm/mmu-book3e.h
@@ -231,6 +231,10 @@ typedef struct {
 	u64 high_slices_psize;  /* 4 bits per slice for now */
 	u16 user_psize;         /* page size index */
 #endif
+#ifdef CONFIG_PPC_64K_PAGES
+	/* for 2K page table support */
+	struct list_head pgtable_list;
+#endif
 } mm_context_t;
 
 /* Page size definitions, common between 32 and 64-bit
diff --git a/arch/powerpc/include/asm/mmu-hash64.h b/arch/powerpc/include/asm/mmu-hash64.h
index 9673f73..7876b13 100644
--- a/arch/powerpc/include/asm/mmu-hash64.h
+++ b/arch/powerpc/include/asm/mmu-hash64.h
@@ -480,6 +480,10 @@ typedef struct {
 	unsigned long acop;	/* mask of enabled coprocessor types */
 	unsigned int cop_pid;	/* pid value used with coprocessors */
 #endif /* CONFIG_PPC_ICSWX */
+#ifdef CONFIG_PPC_64K_PAGES
+	/* for 2K page table support */
+	struct list_head pgtable_list;
+#endif
 } mm_context_t;
 
 
diff --git a/arch/powerpc/include/asm/page.h b/arch/powerpc/include/asm/page.h
index f072e97..38e7ff6 100644
--- a/arch/powerpc/include/asm/page.h
+++ b/arch/powerpc/include/asm/page.h
@@ -378,7 +378,11 @@ void arch_free_page(struct page *page, int order);
 
 struct vm_area_struct;
 
+#ifdef CONFIG_PPC_64K_PAGES
+typedef pte_t *pgtable_t;
+#else
 typedef struct page *pgtable_t;
+#endif
 
 #include <asm-generic/memory_model.h>
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/include/asm/pgalloc-32.h b/arch/powerpc/include/asm/pgalloc-32.h
index 580cf73..27b2386 100644
--- a/arch/powerpc/include/asm/pgalloc-32.h
+++ b/arch/powerpc/include/asm/pgalloc-32.h
@@ -37,6 +37,17 @@ extern void pgd_free(struct mm_struct *mm, pgd_t *pgd);
 extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm, unsigned long addr);
 extern pgtable_t pte_alloc_one(struct mm_struct *mm, unsigned long addr);
 
+static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
+{
+	free_page((unsigned long)pte);
+}
+
+static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
+{
+	pgtable_page_dtor(ptepage);
+	__free_page(ptepage);
+}
+
 static inline void pgtable_free(void *table, unsigned index_size)
 {
 	BUG_ON(index_size); /* 32-bit doesn't use this */
@@ -45,4 +56,38 @@ static inline void pgtable_free(void *table, unsigned index_size)
 
 #define check_pgt_cache()	do { } while (0)
 
+#ifdef CONFIG_SMP
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+				    void *table, int shift)
+{
+	unsigned long pgf = (unsigned long)table;
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf |= shift;
+	tlb_remove_table(tlb, (void *)pgf);
+}
+
+static inline void __tlb_remove_table(void *_table)
+{
+	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+	pgtable_free(table, shift);
+}
+#else
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+				    void *table, int shift)
+{
+	pgtable_free(table, shift);
+}
+#endif
+
+static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
+				  unsigned long address)
+{
+	struct page *page = page_address(table);
+
+	tlb_flush_pgtable(tlb, address);
+	pgtable_page_dtor(page);
+	pgtable_free_tlb(tlb, page, 0);
+}
 #endif /* _ASM_POWERPC_PGALLOC_32_H */
diff --git a/arch/powerpc/include/asm/pgalloc-64.h b/arch/powerpc/include/asm/pgalloc-64.h
index 292725c..882b713 100644
--- a/arch/powerpc/include/asm/pgalloc-64.h
+++ b/arch/powerpc/include/asm/pgalloc-64.h
@@ -72,9 +72,91 @@ static inline void pud_populate(struct mm_struct *mm, pud_t *pud, pmd_t *pmd)
 #define pmd_populate_kernel(mm, pmd, pte) pmd_set(pmd, (unsigned long)(pte))
 #define pmd_pgtable(pmd) pmd_page(pmd)
 
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
+					  unsigned long address)
+{
+        return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);
+}
+
+static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
+					unsigned long address)
+{
+	pte_t *pte;
+	struct page *page;
+
+	pte = pte_alloc_one_kernel(mm, address);
+	if (!pte)
+		return NULL;
+	page = virt_to_page(pte);
+	pgtable_page_ctor(page);
+	return page;
+}
+
+static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
+{
+	free_page((unsigned long)pte);
+}
+
+static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
+{
+	pgtable_page_dtor(ptepage);
+	__free_page(ptepage);
+}
 
-#else /* CONFIG_PPC_64K_PAGES */
+#ifdef CONFIG_SMP
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+				    void *table, int shift)
+{
+	unsigned long pgf = (unsigned long)table;
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf |= shift;
+	tlb_remove_table(tlb, (void *)pgf);
+}
 
+static inline void __tlb_remove_table(void *_table)
+{
+	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+	if (!shift)
+		free_page((unsigned long)table);
+	else {
+		BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+		kmem_cache_free(PGT_CACHE(shift), table);
+	}
+}
+#else
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+				    void *table, int shift)
+{
+	pgtable_free(table, shift);
+}
+#endif
+
+static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
+				  unsigned long address)
+{
+	struct page *page = page_address(table);
+
+	tlb_flush_pgtable(tlb, address);
+	pgtable_page_dtor(page);
+	pgtable_free_tlb(tlb, page, 0);
+}
+
+#else /* if CONFIG_PPC_64K_PAGES */
+
+extern unsigned long *page_table_alloc(struct mm_struct *, unsigned long);
+extern void page_table_free(struct mm_struct *, unsigned long *);
+#ifdef CONFIG_SMP
+extern void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift);
+extern void __tlb_remove_table(void *_table);
+#else
+static inline void pgtable_free_tlb(struct mmu_gather *tlb,
+				    void *table, int shift)
+{
+	pgtable_free(table, shift);
+}
+#endif
 #define pud_populate(mm, pud, pmd)	pud_set(pud, (unsigned long)pmd)
 
 static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd,
@@ -83,51 +165,56 @@ static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd,
 	pmd_set(pmd, (unsigned long)pte);
 }
 
-#define pmd_populate(mm, pmd, pte_page) \
-	pmd_populate_kernel(mm, pmd, page_address(pte_page))
-#define pmd_pgtable(pmd) pmd_page(pmd)
-
-#endif /* CONFIG_PPC_64K_PAGES */
-
-static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+static inline void pmd_populate(struct mm_struct *mm, pmd_t *pmd,
+				pgtable_t pte_page)
 {
-	return kmem_cache_alloc(PGT_CACHE(PMD_INDEX_SIZE),
-				GFP_KERNEL|__GFP_REPEAT);
+	pmd_set(pmd, (unsigned long)pte_page);
 }
 
-static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
+static inline pgtable_t pmd_pgtable(pmd_t pmd)
 {
-	kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
+	return (pgtable_t)(pmd_val(pmd) & -sizeof(pte_t)*PTRS_PER_PTE);
 }
 
 static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm,
 					  unsigned long address)
 {
-        return (pte_t *)__get_free_page(GFP_KERNEL | __GFP_REPEAT | __GFP_ZERO);
+	return (pte_t *)page_table_alloc(mm, address);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm,
 					unsigned long address)
 {
-	struct page *page;
-	pte_t *pte;
+	return (pgtable_t)page_table_alloc(mm, address);
+}
 
-	pte = pte_alloc_one_kernel(mm, address);
-	if (!pte)
-		return NULL;
-	page = virt_to_page(pte);
-	pgtable_page_ctor(page);
-	return page;
+static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
+{
+	page_table_free(mm, (unsigned long *)pte);
 }
 
-static inline void pgtable_free(void *table, unsigned index_size)
+static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
 {
-	if (!index_size)
-		free_page((unsigned long)table);
-	else {
-		BUG_ON(index_size > MAX_PGTABLE_INDEX_SIZE);
-		kmem_cache_free(PGT_CACHE(index_size), table);
-	}
+	page_table_free(mm, (unsigned long *)ptepage);
+}
+
+static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t table,
+				  unsigned long address)
+{
+	tlb_flush_pgtable(tlb, address);
+	pgtable_free_tlb(tlb, table, 0);
+}
+#endif /* CONFIG_PPC_64K_PAGES */
+
+static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long addr)
+{
+	return kmem_cache_alloc(PGT_CACHE(PMD_INDEX_SIZE),
+				GFP_KERNEL|__GFP_REPEAT);
+}
+
+static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
+{
+	kmem_cache_free(PGT_CACHE(PMD_INDEX_SIZE), pmd);
 }
 
 #define __pmd_free_tlb(tlb, pmd, addr)		      \
diff --git a/arch/powerpc/include/asm/pgalloc.h b/arch/powerpc/include/asm/pgalloc.h
index bf301ac..e9a9f60 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -3,6 +3,7 @@
 #ifdef __KERNEL__
 
 #include <linux/mm.h>
+#include <asm-generic/tlb.h>
 
 #ifdef CONFIG_PPC_BOOK3E
 extern void tlb_flush_pgtable(struct mmu_gather *tlb, unsigned long address);
@@ -13,56 +14,11 @@ static inline void tlb_flush_pgtable(struct mmu_gather *tlb,
 }
 #endif /* !CONFIG_PPC_BOOK3E */
 
-static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
-{
-	free_page((unsigned long)pte);
-}
-
-static inline void pte_free(struct mm_struct *mm, pgtable_t ptepage)
-{
-	pgtable_page_dtor(ptepage);
-	__free_page(ptepage);
-}
-
 #ifdef CONFIG_PPC64
 #include <asm/pgalloc-64.h>
 #else
 #include <asm/pgalloc-32.h>
 #endif
 
-#ifdef CONFIG_SMP
-struct mmu_gather;
-extern void tlb_remove_table(struct mmu_gather *, void *);
-
-static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
-{
-	unsigned long pgf = (unsigned long)table;
-	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
-	pgf |= shift;
-	tlb_remove_table(tlb, (void *)pgf);
-}
-
-static inline void __tlb_remove_table(void *_table)
-{
-	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
-	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
-
-	pgtable_free(table, shift);
-}
-#else /* CONFIG_SMP */
-static inline void pgtable_free_tlb(struct mmu_gather *tlb, void *table, unsigned shift)
-{
-	pgtable_free(table, shift);
-}
-#endif /* !CONFIG_SMP */
-
-static inline void __pte_free_tlb(struct mmu_gather *tlb, struct page *ptepage,
-				  unsigned long address)
-{
-	tlb_flush_pgtable(tlb, address);
-	pgtable_page_dtor(ptepage);
-	pgtable_free_tlb(tlb, page_address(ptepage), 0);
-}
-
 #endif /* __KERNEL__ */
 #endif /* _ASM_POWERPC_PGALLOC_H */
diff --git a/arch/powerpc/kernel/setup_64.c b/arch/powerpc/kernel/setup_64.c
index efb6a41..784de65 100644
--- a/arch/powerpc/kernel/setup_64.c
+++ b/arch/powerpc/kernel/setup_64.c
@@ -575,7 +575,9 @@ void __init setup_arch(char **cmdline_p)
 	init_mm.end_code = (unsigned long) _etext;
 	init_mm.end_data = (unsigned long) _edata;
 	init_mm.brk = klimit;
-	
+#ifdef CONFIG_PPC_64K_PAGES
+	INIT_LIST_HEAD(&init_mm.context.pgtable_list);
+#endif
 	irqstack_early_init();
 	exc_lvl_early_init();
 	emergency_stack_init();
diff --git a/arch/powerpc/mm/mmu_context_hash64.c b/arch/powerpc/mm/mmu_context_hash64.c
index 40bc5b0..adf2bf9 100644
--- a/arch/powerpc/mm/mmu_context_hash64.c
+++ b/arch/powerpc/mm/mmu_context_hash64.c
@@ -94,6 +94,8 @@ int init_new_context(struct task_struct *tsk, struct mm_struct *mm)
 	spin_lock_init(mm->context.cop_lockp);
 #endif /* CONFIG_PPC_ICSWX */
 
+	INIT_LIST_HEAD(&mm->context.pgtable_list);
+
 	return 0;
 }
 
@@ -107,11 +109,21 @@ EXPORT_SYMBOL_GPL(__destroy_context);
 
 void destroy_context(struct mm_struct *mm)
 {
+	struct page *page;
+	struct list_head *item, *tmp;
+
 #ifdef CONFIG_PPC_ICSWX
 	drop_cop(mm->context.acop, mm);
 	kfree(mm->context.cop_lockp);
 	mm->context.cop_lockp = NULL;
 #endif /* CONFIG_PPC_ICSWX */
+	list_for_each_safe(item, tmp, &mm->context.pgtable_list) {
+		page = list_entry(item, struct page, lru);
+		list_del(&page->lru);
+		pgtable_page_dtor(page);
+		atomic_set(&page->_mapcount, -1);
+		__free_page(page);
+	}
 	__destroy_context(mm->context.id);
 	subpage_prot_free(mm);
 	mm->context.id = MMU_NO_CONTEXT;
diff --git a/arch/powerpc/mm/pgtable_64.c b/arch/powerpc/mm/pgtable_64.c
index e212a27..ec80314 100644
--- a/arch/powerpc/mm/pgtable_64.c
+++ b/arch/powerpc/mm/pgtable_64.c
@@ -69,6 +69,7 @@
 unsigned long ioremap_bot = IOREMAP_BASE;
 
 #ifdef CONFIG_PPC_MMU_NOHASH
+/* FIXME!! */
 static void *early_alloc_pgtable(unsigned long size)
 {
 	void *pt;
@@ -337,3 +338,191 @@ EXPORT_SYMBOL(__ioremap_at);
 EXPORT_SYMBOL(iounmap);
 EXPORT_SYMBOL(__iounmap);
 EXPORT_SYMBOL(__iounmap_at);
+
+#ifdef CONFIG_PPC_64K_PAGES
+/*
+ * we support 15 fragments per PTE page. This is limited by how many
+ * bits we can pack in page->_mapcount. We use the first half for
+ * tracking the usage for rcu page table free.
+ */
+#define FRAG_MASK_BITS	15
+#define FRAG_MASK ((1 << FRAG_MASK_BITS) - 1)
+/*
+ * We use a 2K PTE page fragment and another 2K for storing
+ * real_pte_t hash index
+ */
+#define PTE_FRAG_SIZE (2 * PTRS_PER_PTE * sizeof(pte_t))
+
+static inline unsigned int atomic_xor_bits(atomic_t *v, unsigned int bits)
+{
+	unsigned int old, new;
+
+	do {
+		old = atomic_read(v);
+		new = old ^ bits;
+	} while (atomic_cmpxchg(v, old, new) != old);
+	return new;
+}
+
+unsigned long *page_table_alloc(struct mm_struct *mm, unsigned long vmaddr)
+{
+	struct page *page;
+	unsigned int mask, bit;
+	unsigned long *table;
+
+	/* Allocate fragments of a 4K page as 1K/2K page table */
+	spin_lock(&mm->page_table_lock);
+	mask = FRAG_MASK;
+	if (!list_empty(&mm->context.pgtable_list)) {
+		page = list_first_entry(&mm->context.pgtable_list,
+					struct page, lru);
+		table = (unsigned long *) page_address(page);
+		mask = atomic_read(&page->_mapcount);
+		/*
+		 * Update with the higher order mask bits accumulated,
+		 * added as a part of rcu free.
+		 */
+		mask = mask | (mask >> FRAG_MASK_BITS);
+	}
+	if ((mask & FRAG_MASK) == FRAG_MASK) {
+		spin_unlock(&mm->page_table_lock);
+		page = alloc_page(GFP_KERNEL|__GFP_REPEAT);
+		if (!page)
+			return NULL;
+		pgtable_page_ctor(page);
+		atomic_set(&page->_mapcount, 1);
+		table = (unsigned long *) page_address(page);
+		spin_lock(&mm->page_table_lock);
+		INIT_LIST_HEAD(&page->lru);
+		list_add(&page->lru, &mm->context.pgtable_list);
+	} else {
+		/* The second half is used for real_pte_t hindex */
+		for (bit = 1; mask & bit; bit <<= 1)
+			table = (unsigned long *)((char *)table + PTE_FRAG_SIZE);
+
+		mask = atomic_xor_bits(&page->_mapcount, bit);
+		/*
+		 * We have taken up all the space, remove this from
+		 * the list, we will add it back when we have a free slot
+		 */
+		if ((mask & FRAG_MASK) == FRAG_MASK)
+			list_del_init(&page->lru);
+	}
+	spin_unlock(&mm->page_table_lock);
+	/*
+	 * zero out the newly allocated area, this make sure we don't
+	 * see the old left over pte values
+	 */
+	memset(table, 0, PTE_FRAG_SIZE);
+	return table;
+}
+
+void page_table_free(struct mm_struct *mm, unsigned long *table)
+{
+	struct page *page;
+	unsigned int bit, mask;
+
+	/* Free 2K page table fragment of a 64K page */
+	page = virt_to_page(table);
+	bit = 1 << ((__pa(table) & ~PAGE_MASK) / PTE_FRAG_SIZE);
+	spin_lock(&mm->page_table_lock);
+	mask = atomic_xor_bits(&page->_mapcount, bit);
+	if (mask == 0)
+		list_del(&page->lru);
+	else if (mask & FRAG_MASK) {
+		/*
+		 * Add the page table page to pgtable_list so that
+		 * the free fragment can be used by the next alloc
+		 */
+		list_del_init(&page->lru);
+		list_add(&page->lru, &mm->context.pgtable_list);
+	}
+	spin_unlock(&mm->page_table_lock);
+	if (mask == 0) {
+		pgtable_page_dtor(page);
+		atomic_set(&page->_mapcount, -1);
+		__free_page(page);
+	}
+}
+
+#ifdef CONFIG_SMP
+static void __page_table_free_rcu(void *table)
+{
+	unsigned int bit;
+	struct page *page;
+	/*
+	 * this is a PTE page free 2K page table
+	 * fragment of a 64K page.
+	 */
+	page = virt_to_page(table);
+	bit = 1 << ((__pa(table) & ~PAGE_MASK) / PTE_FRAG_SIZE);
+	bit <<= FRAG_MASK_BITS;
+	/*
+	 * clear the higher half and if nobody used the page in
+	 * between, even lower half would be zero.
+	 */
+	if (atomic_xor_bits(&page->_mapcount, bit) == 0) {
+		pgtable_page_dtor(page);
+		atomic_set(&page->_mapcount, -1);
+		__free_page(page);
+	}
+}
+
+static void page_table_free_rcu(struct mmu_gather *tlb, unsigned long *table)
+{
+	struct page *page;
+	struct mm_struct *mm;
+	unsigned int bit, mask;
+
+	mm = tlb->mm;
+	/* Free 2K page table fragment of a 64K page */
+	page = virt_to_page(table);
+	bit = 1 << ((__pa(table) & ~PAGE_MASK) / PTE_FRAG_SIZE);
+	spin_lock(&mm->page_table_lock);
+	/*
+	 * stash the actual mask in higher half, and clear the lower half
+	 * and selectively, add remove from pgtable list
+	 */
+	mask = atomic_xor_bits(&page->_mapcount, bit | (bit << FRAG_MASK_BITS));
+	if (!(mask & FRAG_MASK))
+		list_del(&page->lru);
+	else {
+		/*
+		 * Add the page table page to pgtable_list so that
+		 * the free fragment can be used by the next alloc
+		 */
+		list_del_init(&page->lru);
+		list_add_tail(&page->lru, &mm->context.pgtable_list);
+	}
+	spin_unlock(&mm->page_table_lock);
+	tlb_remove_table(tlb, table);
+}
+
+void pgtable_free_tlb(struct mmu_gather *tlb, void *table, int shift)
+{
+	unsigned long pgf = (unsigned long)table;
+
+	BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+	pgf |= shift;
+	if (shift == 0)
+		/* PTE page needs special handling */
+		page_table_free_rcu(tlb, table);
+	else
+		tlb_remove_table(tlb, (void *)pgf);
+}
+
+void __tlb_remove_table(void *_table)
+{
+	void *table = (void *)((unsigned long)_table & ~MAX_PGTABLE_INDEX_SIZE);
+	unsigned shift = (unsigned long)_table & MAX_PGTABLE_INDEX_SIZE;
+
+	if (!shift)
+		/* PTE page needs special handling */
+		__page_table_free_rcu(table);
+	else {
+		BUG_ON(shift > MAX_PGTABLE_INDEX_SIZE);
+		kmem_cache_free(PGT_CACHE(shift), table);
+	}
+}
+#endif
+#endif /* CONFIG_PPC_64K_PAGES */
-- 
1.7.10

^ permalink raw reply related

* [RFC PATCH 1/3] powerpc: Don't hard code the size of pte page
From: Aneesh Kumar K.V @ 2013-01-30 13:18 UTC (permalink / raw)
  To: benh, paulus; +Cc: linuxppc-dev, Aneesh Kumar K.V
In-Reply-To: <1359551916-10321-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com>

From: "Aneesh Kumar K.V" <aneesh.kumar@linux.vnet.ibm.com>

USE PTRS_PER_PTE to indicate the size of pte page.

Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
---
 arch/powerpc/include/asm/pgtable.h |    6 ++++++
 arch/powerpc/mm/hash_low_64.S      |    4 ++--
 2 files changed, 8 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index a9cbd3b..fc57855 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -17,6 +17,12 @@ struct mm_struct;
 #  include <asm/pgtable-ppc32.h>
 #endif
 
+/*
+ * hidx is in the second half of the page table. We use the
+ * 8 bytes per each pte entry.
+ */
+#define PTE_PAGE_HIDX_OFFSET (PTRS_PER_PTE * 8)
+
 #ifndef __ASSEMBLY__
 
 #include <asm/tlbflush.h>
diff --git a/arch/powerpc/mm/hash_low_64.S b/arch/powerpc/mm/hash_low_64.S
index 5658508..94fd37b 100644
--- a/arch/powerpc/mm/hash_low_64.S
+++ b/arch/powerpc/mm/hash_low_64.S
@@ -484,7 +484,7 @@ END_FTR_SECTION(CPU_FTR_NOEXECUTE|CPU_FTR_COHERENT_ICACHE, CPU_FTR_NOEXECUTE)
 	beq	htab_inval_old_hpte
 
 	ld	r6,STK_PARAM(R6)(r1)
-	ori	r26,r6,0x8000		/* Load the hidx mask */
+	ori	r26,r6,PTE_PAGE_HIDX_OFFSET /* Load the hidx mask. */
 	ld	r26,0(r26)
 	addi	r5,r25,36		/* Check actual HPTE_SUB bit, this */
 	rldcr.	r0,r31,r5,0		/* must match pgtable.h definition */
@@ -601,7 +601,7 @@ htab_pte_insert_ok:
 	sld	r4,r4,r5
 	andc	r26,r26,r4
 	or	r26,r26,r3
-	ori	r5,r6,0x8000
+	ori	r5,r6,PTE_PAGE_HIDX_OFFSET
 	std	r26,0(r5)
 	lwsync
 	std	r30,0(r6)
-- 
1.7.10

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox