LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 01/18] powerpc: Add ability to build little endian kernels
From: Benjamin Herrenschmidt @ 2010-10-01 12:09 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Michal Marek, linuxppc-dev, Albert Herranz, linux-kernel, paulus,
	Ian Munsie, Andreas Schwab, Andrew Morton, Sam Ravnborg,
	Torez Smith
In-Reply-To: <AANLkTimGtB_KoOgwk3Ujn2UcNzs9qHxzeZcb1ShLgcXh@mail.gmail.com>

On Fri, 2010-10-01 at 07:28 -0400, Josh Boyer wrote:
> > Shouldn't we have something that limits to the sub-arch'es that
> actually support it?  I doubt I'm ever going to make FSL-Book-e
> support LE.
> 
> Yes, it should. 

Sure, that's only WIP patches :-)

Tho FSL BookE would be relatively easy...

Cheers,
Ben.

^ permalink raw reply

* Re: Introduce support for little endian PowerPC
From: Benjamin Herrenschmidt @ 2010-10-01 12:14 UTC (permalink / raw)
  To: Josh Boyer; +Cc: linuxppc-dev, paulus, linux-kernel, Ian Munsie
In-Reply-To: <AANLkTinSDMvBb7rnnfwYMkEVOgDUqRetZg86rh-UmSAg@mail.gmail.com>

On Fri, 2010-10-01 at 07:30 -0400, Josh Boyer wrote:
> 
> > From a community aspect is anyone actually going to use this?  Is
> this going to be the equivalent of voyager on x86?  I've got nothing
> against some of the endian clean ups this introduces.  However the
> changes to misc_32.S are a bit ugly from a readability point of view.
>  Just seems like this is likely to bit-rot pretty quickly.
> 
> I'm with Kumar on this one.  Why would we want to support this?  I
> can't say I would be very willing to help anyone run in LE mode, let
> alone have it randomly selectable. 

There's some good reasons on the field ... sadly.

At this stage this is mostly an experiment, which went pretty well in
the sense that it's actually quite easy and a lot of the "fixes" are
actually reasonable cleanups to carry.

Now, the main reasons in practice are anything touching graphics.

There's quite a few IP cores out there for SoCs that don't have HW
swappers, and -tons- of more or less ugly code that can't deal with non
native pixel ordering (hell, even Xorg isn't good at it, we really only
support cards that have HW swappers today).

There's an even bigger pile of application code that deals with graphics
without any regard for endianness and is essentially unfixable.

So it becomes a matter of potential customers that will take it if it
does LE and won't if it doesn't ...

Now, I don't have a problem supporting that as the maintainer, as I
said, from a kernel standpoint, it's all quite easy to deal with. Some
of the most gory aspects in misc_32.S could probably be done in a way
that is slightly more readable, but the approach is actually good, I
think, to have macros to represent the high/low parts of register pairs.

So at this stage, I'd say, let's not dismiss it just because we all come
from a long education of hating LE for the sake of it :-)

It makes -some- sense, even if it's not necessarily on the markets
targeted by FSL today for example. At least from the kernel POV, it
doesn't seem to me to be a significant support burden at all.

Cheers,
Ben.

^ permalink raw reply

* Re: Introduce support for little endian PowerPC
From: Benjamin Herrenschmidt @ 2010-10-01 12:15 UTC (permalink / raw)
  To: Gary Thomas; +Cc: paulus, linuxppc-dev, linux-kernel, Ian Munsie
In-Reply-To: <4CA5CC3C.6020006@mlbassoc.com>

On Fri, 2010-10-01 at 05:55 -0600, Gary Thomas wrote:
> On 10/01/2010 05:30 AM, Josh Boyer wrote:
> > On Fri, Oct 1, 2010 at 5:02 AM, Kumar Gala<galak@kernel.crashing.org>  wrote:
> >>
> >> On Oct 1, 2010, at 2:05 AM, Ian Munsie wrote:
> >>
> >>> Some PowerPC processors can be run in either big or little endian modes, some
> >>> others can map selected pages of memory as little endian, which allows the same
> >>> thing. Until now we have only supported the default big endian mode in Linux.
> >>> This patch set introduces little endian support for the 44x family of PowerPC
> >>> processors.
> >>
> >>  From a community aspect is anyone actually going to use this?  Is this going to be the equivalent of voyager on x86?  I've got nothing against some of the endian clean ups this introduces.  However the changes to misc_32.S are a bit ugly from a readability point of view.  Just seems like this is likely to bit-rot pretty quickly.
> >
> > I'm with Kumar on this one.  Why would we want to support this?  I
> > can't say I would be very willing to help anyone run in LE mode, let
> > alone have it randomly selectable.
> 
> Indeed, I thought we had killed that Windows-NT dog ~15 years ago :-)

Actually this has more to do with having to deal with code written for
ARM LE :-)

Cheers,
Ben.

^ permalink raw reply

* Re: Introduce support for little endian PowerPC
From: Benjamin Herrenschmidt @ 2010-10-01 12:21 UTC (permalink / raw)
  To: Josh Boyer; +Cc: paulus, linuxppc-dev, Ian Munsie, linux-kernel
In-Reply-To: <AANLkTikHcoccu-WGtSOSdkwU6Tw_An_VFkUcEoY==8=c@mail.gmail.com>

On Fri, 2010-10-01 at 07:36 -0400, Josh Boyer wrote:
> On Fri, Oct 1, 2010 at 3:05 AM, Ian Munsie <imunsie@au1.ibm.com> wrote:
> > This patch set in combination with a patched GCC, binutils, uClibc and
> > buildroot has allowed for a full proof of concept little endian environment on
> > a 440 Taishan board, which was able to successfully run busybox, OpenSSH and a
> > handful of other userspace programs without problems.
> 
> Aside from my general "uh, why?" stance, I'm very very hesitant to
> integrate anything in the kernel that doesn'.t have released patches
> on the toolchain side.

We aren't yet talking about merging that as-is, though I beleive at
least -some- of the patches have merit on their own, such as the proper
accessors for device-tree properties. At the very least, it would make
it less painful for archs like ARM to borrow code in that area and will
make it cleaner for sparse when we generalize endian annotations.

The toolchain work was done as a quick & dirty experiment. Whether some
"proper" work there will happen remains to be decided.

> Also, which uClibc?  The old and crusty uClibc that uses the horrible
> linuxthreads, or the somewhat less crusty that just switched to NPTL
> (which hasn't been verified on normal PowerPC that I recall).  Why not
> use glibc...

Because this was a proof of concept and as such, it was easier to deal
with uclibc initially to get busybox going :-)

> > This is not yet complete support for little endian PowerPC, some outstanding
> > issues that I am aware of are:
> >  * We only support 32bit PowerPC for now (and indeed, only 44x)
> >  * The vdso has not been fixed to be endian agnostic - any userspace program
> >   accessing it will get an unexpected result.
> >  * I have not touched PCI at all
> >  * Remaining device tree accesses still need to be examined to ensure they are
> >   correctly handling the endianess of the device tree.
> >  * Any other driver that uses the device tree is likely be broken for the same reason.
> >  * I've included a patch for the alignment handler, however it is as yet
> >   completely untested due to a property of the hardware I've been using for
> >   testing.
> 
> I'm not meeting to detract here, but the Kconfig should be dependent
> on && BROKEN until the above is fixed.

Right.

I think Ian wasn't clear enough on the fact that those patches aren't
meant to be merged in the next merge window :-) I told him to shoot them
to the list for review, comments and discussions, but if we decide to
move along with integrating that, there's definitely more work to do.

Cheers,
Ben.

^ permalink raw reply

* Re: [PATCH 01/18] powerpc: Add ability to build little endian kernels
From: Benjamin Herrenschmidt @ 2010-10-01 12:22 UTC (permalink / raw)
  To: Josh Boyer
  Cc: Michal Marek, Sam Ravnborg, Albert Herranz, linux-kernel, paulus,
	Ian Munsie, Andreas Schwab, Andrew Morton, linuxppc-dev,
	Torez Smith
In-Reply-To: <AANLkTimdV2FTEhcKXcM_+k23PBFTR5KvyjqEGxzzX72Q@mail.gmail.com>

On Fri, 2010-10-01 at 07:40 -0400, Josh Boyer wrote:
> Have you tested this support with a userspace containing floating
> point instructions?  I wonder if CONFIG_MATH_EMULATION is going to
> need work at all, and if the boards with an actual FPU (440EP, 440EPx,
> 460EX, etc) would have issues. 

That's one of the things on the TODO list. We've tested that on a 44x
with no FPU so far and made sure we built without math emu.

Cheers,
Ben.

^ permalink raw reply

* Re: Introduce support for little endian PowerPC
From: Gary Thomas @ 2010-10-01 12:37 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: paulus, linuxppc-dev, linux-kernel, Ian Munsie
In-Reply-To: <1285935329.2463.79.camel@pasglop>

On 10/01/2010 06:15 AM, Benjamin Herrenschmidt wrote:
> On Fri, 2010-10-01 at 05:55 -0600, Gary Thomas wrote:
>> On 10/01/2010 05:30 AM, Josh Boyer wrote:
>>> On Fri, Oct 1, 2010 at 5:02 AM, Kumar Gala<galak@kernel.crashing.org>   wrote:
>>>>
>>>> On Oct 1, 2010, at 2:05 AM, Ian Munsie wrote:
>>>>
>>>>> Some PowerPC processors can be run in either big or little endian modes, some
>>>>> others can map selected pages of memory as little endian, which allows the same
>>>>> thing. Until now we have only supported the default big endian mode in Linux.
>>>>> This patch set introduces little endian support for the 44x family of PowerPC
>>>>> processors.
>>>>
>>>>    From a community aspect is anyone actually going to use this?  Is this going to be the equivalent of voyager on x86?  I've got nothing against some of the endian clean ups this introduces.  However the changes to misc_32.S are a bit ugly from a readability point of view.  Just seems like this is likely to bit-rot pretty quickly.
>>>
>>> I'm with Kumar on this one.  Why would we want to support this?  I
>>> can't say I would be very willing to help anyone run in LE mode, let
>>> alone have it randomly selectable.
>>
>> Indeed, I thought we had killed that Windows-NT dog ~15 years ago :-)
>
> Actually this has more to do with having to deal with code written for
> ARM LE :-)

The comment was mostly aimed as a remnder of the main reason this was considered
a long time ago.

I understand that the world has moved on, and sadly the vast majority
of hardware is now little endian (although it still baffles me why anyone
would think that way...)

-- 
------------------------------------------------------------
Gary Thomas                 |  Consulting for the
MLB Associates              |    Embedded world
------------------------------------------------------------

^ permalink raw reply

* RE: [RFC] irq: Migrate powerpc virq subsystem into generic code
From: Lorenzo Pieralisi @ 2010-10-01 12:44 UTC (permalink / raw)
  To: 'Grant Likely', benh, devicetree-discuss, linuxppc-dev
In-Reply-To: <20100922203117.10426.69278.stgit@angua>

Hi Grant, Ben, all

> -----Original Message-----
> From: devicetree-discuss-
> bounces+lorenzo.pieralisi=arm.com@lists.ozlabs.org [mailto:devicetree-
> discuss-bounces+lorenzo.pieralisi=arm.com@lists.ozlabs.org] On Behalf
> Of Grant Likely
> Sent: 22 September 2010 21:33
> To: benh@kernel.crashing.org; devicetree-discuss@lists.ozlabs.org;
> linuxppc-dev@lists.ozlabs.org
> Subject: [RFC] irq: Migrate powerpc virq subsystem into generic code
> 
> Being able to dynamically manage linux irq ranges is useful.  Migrate
> the powerpc virq code into common code so that other architectures can
> use it.
> 
> This patch also removes the unused irq_early_init() references.
> 
> Signed-off-by: Grant Likely <grant.likely@secretlab.ca>
> ---
> Only compile tested; but I wanted to get this out for comments.  I
> think this is the right set of routines to generalize for virq on
> other architectures.
> 
> g.

I have a question on the PowerPC IRQ layer and how to use it
for device drivers and device tree in general.

A device such as eth smsc911x (example) requires the platform_data 
pointer to specify irq sense/level to programme the chip accordingly. 
As agreed with Grant these pieces of information should be retrieved 
from the device tree, from the interrupt-specifier.

Now: the device driver should be interrupt-controller agnostic, so
I cannot just retrieve the interrupts property at probe and decode it
in order to get the interrupt sense/level bits (the driver has no clue 
about the interrupt-controller irq flags encoding).
I need to associate a irq_host to the interrupt controller node, with
a proper xlate function to correctly decode the interrupt-specifier 
and set the irq type accordingly (irq_create_of_mapping(),
for now 1:1 on ARM, so useless from this standpoint).

At platform device init time, of_irq_to_resource() is called to parse
and map irqs; if we code the irq_host correctly for the ARM GIC for
instance the xlate function gets called and irq type set accordingly
(and maybe the function could set platform_device IRQ resource 
flags as well ?)

At driver dt probe, from the hwirq number defined in "interrupts"
the driver retrieves the virq, hence sense/level flags and use them,
or just use the platform_device IRQ resource flags if set properly
by the OF layer. 

Correct ?

If yes I will have a stab at it on a ARM platform with complex
IRQ routing.

Thank you very much.

Cheers,
Lorenzo

> 
>  arch/microblaze/kernel/setup.c |    2
>  arch/powerpc/Kconfig           |    3
>  arch/powerpc/include/asm/irq.h |  270 ----------------
>  arch/powerpc/kernel/irq.c      |  659 --------------------------------
> ------
>  include/linux/virq.h           |  302 ++++++++++++++++++
>  kernel/irq/Makefile            |    1
>  kernel/irq/virq.c              |  687
> ++++++++++++++++++++++++++++++++++++++++
>  7 files changed, 995 insertions(+), 929 deletions(-)
>  create mode 100644 include/linux/virq.h
>  create mode 100644 kernel/irq/virq.c
> 
> diff --git a/arch/microblaze/kernel/setup.c
> b/arch/microblaze/kernel/setup.c
> index f5f7688..39cf20d 100644
> --- a/arch/microblaze/kernel/setup.c
> +++ b/arch/microblaze/kernel/setup.c
> @@ -51,8 +51,6 @@ void __init setup_arch(char **cmdline_p)
> 
>  	unflatten_device_tree();
> 
> -	/* NOTE I think that this function is not necessary to call */
> -	/* irq_early_init(); */
>  	setup_cpuinfo();
> 
>  	microblaze_cache_init();
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index 631e5a0..cc06e59 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -146,6 +146,9 @@ config EARLY_PRINTK
>  	bool
>  	default y
> 
> +config VIRQ
> +	def_bool y
> +
>  config COMPAT
>  	bool
>  	default y if PPC64
> diff --git a/arch/powerpc/include/asm/irq.h
> b/arch/powerpc/include/asm/irq.h
> index 67ab5fb..6dea0cb 100644
> --- a/arch/powerpc/include/asm/irq.h
> +++ b/arch/powerpc/include/asm/irq.h
> @@ -17,10 +17,6 @@
>  #include <asm/atomic.h>
> 
> 
> -/* Define a way to iterate across irqs. */
> -#define for_each_irq(i) \
> -	for ((i) = 0; (i) < NR_IRQS; ++(i))
> -
>  extern atomic_t ppc_n_lost_interrupts;
> 
>  /* This number is used when no interrupt has been assigned */
> @@ -41,270 +37,6 @@ extern atomic_t ppc_n_lost_interrupts;
>  /* Same thing, used by the generic IRQ code */
>  #define NR_IRQS_LEGACY		NUM_ISA_INTERRUPTS
> 
> -/* This type is the placeholder for a hardware interrupt number. It
> has to
> - * be big enough to enclose whatever representation is used by a given
> - * platform.
> - */
> -typedef unsigned long irq_hw_number_t;
> -
> -/* Interrupt controller "host" data structure. This could be defined
> as a
> - * irq domain controller. That is, it handles the mapping between
> hardware
> - * and virtual interrupt numbers for a given interrupt domain. The
> host
> - * structure is generally created by the PIC code for a given PIC
> instance
> - * (though a host can cover more than one PIC if they have a flat
> number
> - * model). It's the host callbacks that are responsible for setting
> the
> - * irq_chip on a given irq_desc after it's been mapped.
> - *
> - * The host code and data structures are fairly agnostic to the fact
> that
> - * we use an open firmware device-tree. We do have references to
> struct
> - * device_node in two places: in irq_find_host() to find the host
> matching
> - * a given interrupt controller node, and of course as an argument to
> its
> - * counterpart host->ops->match() callback. However, those are treated
> as
> - * generic pointers by the core and the fact that it's actually a
> device-node
> - * pointer is purely a convention between callers and implementation.
> This
> - * code could thus be used on other architectures by replacing those
> two
> - * by some sort of arch-specific void * "token" used to identify
> interrupt
> - * controllers.
> - */
> -struct irq_host;
> -struct radix_tree_root;
> -
> -/* Functions below are provided by the host and called whenever a new
> mapping
> - * is created or an old mapping is disposed. The host can then proceed
> to
> - * whatever internal data structures management is required. It also
> needs
> - * to setup the irq_desc when returning from map().
> - */
> -struct irq_host_ops {
> -	/* Match an interrupt controller device node to a host, returns
> -	 * 1 on a match
> -	 */
> -	int (*match)(struct irq_host *h, struct device_node *node);
> -
> -	/* Create or update a mapping between a virtual irq number and a
> hw
> -	 * irq number. This is called only once for a given mapping.
> -	 */
> -	int (*map)(struct irq_host *h, unsigned int virq, irq_hw_number_t
> hw);
> -
> -	/* Dispose of such a mapping */
> -	void (*unmap)(struct irq_host *h, unsigned int virq);
> -
> -	/* Update of such a mapping  */
> -	void (*remap)(struct irq_host *h, unsigned int virq,
> irq_hw_number_t hw);
> -
> -	/* Translate device-tree interrupt specifier from raw format
> coming
> -	 * from the firmware to a irq_hw_number_t (interrupt line number)
> and
> -	 * type (sense) that can be passed to set_irq_type(). In the
> absence
> -	 * of this callback, irq_create_of_mapping() and
> irq_of_parse_and_map()
> -	 * will return the hw number in the first cell and IRQ_TYPE_NONE
> for
> -	 * the type (which amount to keeping whatever default value the
> -	 * interrupt controller has for that line)
> -	 */
> -	int (*xlate)(struct irq_host *h, struct device_node *ctrler,
> -		     const u32 *intspec, unsigned int intsize,
> -		     irq_hw_number_t *out_hwirq, unsigned int *out_type);
> -};
> -
> -struct irq_host {
> -	struct list_head	link;
> -
> -	/* type of reverse mapping technique */
> -	unsigned int		revmap_type;
> -#define IRQ_HOST_MAP_LEGACY     0 /* legacy 8259, gets irqs 1..15 */
> -#define IRQ_HOST_MAP_NOMAP	1 /* no fast reverse mapping */
> -#define IRQ_HOST_MAP_LINEAR	2 /* linear map of interrupts */
> -#define IRQ_HOST_MAP_TREE	3 /* radix tree */
> -	union {
> -		struct {
> -			unsigned int size;
> -			unsigned int *revmap;
> -		} linear;
> -		struct radix_tree_root tree;
> -	} revmap_data;
> -	struct irq_host_ops	*ops;
> -	void			*host_data;
> -	irq_hw_number_t		inval_irq;
> -
> -	/* Optional device node pointer */
> -	struct device_node	*of_node;
> -};
> -
> -/* The main irq map itself is an array of NR_IRQ entries containing
> the
> - * associate host and irq number. An entry with a host of NULL is
> free.
> - * An entry can be allocated if it's free, the allocator always then
> sets
> - * hwirq first to the host's invalid irq number and then fills ops.
> - */
> -struct irq_map_entry {
> -	irq_hw_number_t	hwirq;
> -	struct irq_host	*host;
> -};
> -
> -extern struct irq_map_entry irq_map[NR_IRQS];
> -
> -extern irq_hw_number_t virq_to_hw(unsigned int virq);
> -
> -/**
> - * irq_alloc_host - Allocate a new irq_host data structure
> - * @of_node: optional device-tree node of the interrupt controller
> - * @revmap_type: type of reverse mapping to use
> - * @revmap_arg: for IRQ_HOST_MAP_LINEAR linear only: size of the map
> - * @ops: map/unmap host callbacks
> - * @inval_irq: provide a hw number in that host space that is always
> invalid
> - *
> - * Allocates and initialize and irq_host structure. Note that in the
> case of
> - * IRQ_HOST_MAP_LEGACY, the map() callback will be called before this
> returns
> - * for all legacy interrupts except 0 (which is always the invalid irq
> for
> - * a legacy controller). For a IRQ_HOST_MAP_LINEAR, the map is
> allocated by
> - * this call as well. For a IRQ_HOST_MAP_TREE, the radix tree will be
> allocated
> - * later during boot automatically (the reverse mapping will use the
> slow path
> - * until that happens).
> - */
> -extern struct irq_host *irq_alloc_host(struct device_node *of_node,
> -				       unsigned int revmap_type,
> -				       unsigned int revmap_arg,
> -				       struct irq_host_ops *ops,
> -				       irq_hw_number_t inval_irq);
> -
> -
> -/**
> - * irq_find_host - Locates a host for a given device node
> - * @node: device-tree node of the interrupt controller
> - */
> -extern struct irq_host *irq_find_host(struct device_node *node);
> -
> -
> -/**
> - * irq_set_default_host - Set a "default" host
> - * @host: default host pointer
> - *
> - * For convenience, it's possible to set a "default" host that will be
> used
> - * whenever NULL is passed to irq_create_mapping(). It makes life
> easier for
> - * platforms that want to manipulate a few hard coded interrupt
> numbers that
> - * aren't properly represented in the device-tree.
> - */
> -extern void irq_set_default_host(struct irq_host *host);
> -
> -
> -/**
> - * irq_set_virq_count - Set the maximum number of virt irqs
> - * @count: number of linux virtual irqs, capped with NR_IRQS
> - *
> - * This is mainly for use by platforms like iSeries who want to
> program
> - * the virtual irq number in the controller to avoid the reverse
> mapping
> - */
> -extern void irq_set_virq_count(unsigned int count);
> -
> -
> -/**
> - * irq_create_mapping - Map a hardware interrupt into linux virq space
> - * @host: host owning this hardware interrupt or NULL for default host
> - * @hwirq: hardware irq number in that host space
> - *
> - * Only one mapping per hardware interrupt is permitted. Returns a
> linux
> - * virq number.
> - * If the sense/trigger is to be specified, set_irq_type() should be
> called
> - * on the number returned from that call.
> - */
> -extern unsigned int irq_create_mapping(struct irq_host *host,
> -				       irq_hw_number_t hwirq);
> -
> -
> -/**
> - * irq_dispose_mapping - Unmap an interrupt
> - * @virq: linux virq number of the interrupt to unmap
> - */
> -extern void irq_dispose_mapping(unsigned int virq);
> -
> -/**
> - * irq_find_mapping - Find a linux virq from an hw irq number.
> - * @host: host owning this hardware interrupt
> - * @hwirq: hardware irq number in that host space
> - *
> - * This is a slow path, for use by generic code. It's expected that an
> - * irq controller implementation directly calls the appropriate low
> level
> - * mapping function.
> - */
> -extern unsigned int irq_find_mapping(struct irq_host *host,
> -				     irq_hw_number_t hwirq);
> -
> -/**
> - * irq_create_direct_mapping - Allocate a virq for direct mapping
> - * @host: host to allocate the virq for or NULL for default host
> - *
> - * This routine is used for irq controllers which can choose the
> hardware
> - * interrupt numbers they generate. In such a case it's simplest to
> use
> - * the linux virq as the hardware interrupt number.
> - */
> -extern unsigned int irq_create_direct_mapping(struct irq_host *host);
> -
> -/**
> - * irq_radix_revmap_insert - Insert a hw irq to linux virq number
> mapping.
> - * @host: host owning this hardware interrupt
> - * @virq: linux irq number
> - * @hwirq: hardware irq number in that host space
> - *
> - * This is for use by irq controllers that use a radix tree reverse
> - * mapping for fast lookup.
> - */
> -extern void irq_radix_revmap_insert(struct irq_host *host, unsigned
> int virq,
> -				    irq_hw_number_t hwirq);
> -
> -/**
> - * irq_radix_revmap_lookup - Find a linux virq from a hw irq number.
> - * @host: host owning this hardware interrupt
> - * @hwirq: hardware irq number in that host space
> - *
> - * This is a fast path, for use by irq controller code that uses radix
> tree
> - * revmaps
> - */
> -extern unsigned int irq_radix_revmap_lookup(struct irq_host *host,
> -					    irq_hw_number_t hwirq);
> -
> -/**
> - * irq_linear_revmap - Find a linux virq from a hw irq number.
> - * @host: host owning this hardware interrupt
> - * @hwirq: hardware irq number in that host space
> - *
> - * This is a fast path, for use by irq controller code that uses
> linear
> - * revmaps. It does fallback to the slow path if the revmap doesn't
> exist
> - * yet and will create the revmap entry with appropriate locking
> - */
> -
> -extern unsigned int irq_linear_revmap(struct irq_host *host,
> -				      irq_hw_number_t hwirq);
> -
> -
> -
> -/**
> - * irq_alloc_virt - Allocate virtual irq numbers
> - * @host: host owning these new virtual irqs
> - * @count: number of consecutive numbers to allocate
> - * @hint: pass a hint number, the allocator will try to use a 1:1
> mapping
> - *
> - * This is a low level function that is used internally by
> irq_create_mapping()
> - * and that can be used by some irq controllers implementations for
> things
> - * like allocating ranges of numbers for MSIs. The revmaps are left
> untouched.
> - */
> -extern unsigned int irq_alloc_virt(struct irq_host *host,
> -				   unsigned int count,
> -				   unsigned int hint);
> -
> -/**
> - * irq_free_virt - Free virtual irq numbers
> - * @virq: virtual irq number of the first interrupt to free
> - * @count: number of interrupts to free
> - *
> - * This function is the opposite of irq_alloc_virt. It will not clear
> reverse
> - * maps, this should be done previously by unmap'ing the interrupt. In
> fact,
> - * all interrupts covered by the range being freed should have been
> unmapped
> - * prior to calling this.
> - */
> -extern void irq_free_virt(unsigned int virq, unsigned int count);
> -
> -/**
> - * irq_early_init - Init irq remapping subsystem
> - */
> -extern void irq_early_init(void);
> -
>  static __inline__ int irq_canonicalize(int irq)
>  {
>  	return irq;
> @@ -342,5 +74,7 @@ extern int call_handle_irq(int irq, void *p1,
>  			   struct thread_info *tp, void *func);
>  extern void do_IRQ(struct pt_regs *regs);
> 
> +#include <linux/virq.h>
> +
>  #endif /* _ASM_IRQ_H */
>  #endif /* __KERNEL__ */
> diff --git a/arch/powerpc/kernel/irq.c b/arch/powerpc/kernel/irq.c
> index 4a65386..86d8e42 100644
> --- a/arch/powerpc/kernel/irq.c
> +++ b/arch/powerpc/kernel/irq.c
> @@ -523,553 +523,6 @@ void do_softirq(void)
>  }
> 
> 
> -/*
> - * IRQ controller and virtual interrupts
> - */
> -
> -static LIST_HEAD(irq_hosts);
> -static DEFINE_RAW_SPINLOCK(irq_big_lock);
> -static unsigned int revmap_trees_allocated;
> -static DEFINE_MUTEX(revmap_trees_mutex);
> -struct irq_map_entry irq_map[NR_IRQS];
> -static unsigned int irq_virq_count = NR_IRQS;
> -static struct irq_host *irq_default_host;
> -
> -irq_hw_number_t virq_to_hw(unsigned int virq)
> -{
> -	return irq_map[virq].hwirq;
> -}
> -EXPORT_SYMBOL_GPL(virq_to_hw);
> -
> -static int default_irq_host_match(struct irq_host *h, struct
> device_node *np)
> -{
> -	return h->of_node != NULL && h->of_node == np;
> -}
> -
> -struct irq_host *irq_alloc_host(struct device_node *of_node,
> -				unsigned int revmap_type,
> -				unsigned int revmap_arg,
> -				struct irq_host_ops *ops,
> -				irq_hw_number_t inval_irq)
> -{
> -	struct irq_host *host;
> -	unsigned int size = sizeof(struct irq_host);
> -	unsigned int i;
> -	unsigned int *rmap;
> -	unsigned long flags;
> -
> -	/* Allocate structure and revmap table if using linear mapping */
> -	if (revmap_type == IRQ_HOST_MAP_LINEAR)
> -		size += revmap_arg * sizeof(unsigned int);
> -	host = zalloc_maybe_bootmem(size, GFP_KERNEL);
> -	if (host == NULL)
> -		return NULL;
> -
> -	/* Fill structure */
> -	host->revmap_type = revmap_type;
> -	host->inval_irq = inval_irq;
> -	host->ops = ops;
> -	host->of_node = of_node_get(of_node);
> -
> -	if (host->ops->match == NULL)
> -		host->ops->match = default_irq_host_match;
> -
> -	raw_spin_lock_irqsave(&irq_big_lock, flags);
> -
> -	/* If it's a legacy controller, check for duplicates and
> -	 * mark it as allocated (we use irq 0 host pointer for that
> -	 */
> -	if (revmap_type == IRQ_HOST_MAP_LEGACY) {
> -		if (irq_map[0].host != NULL) {
> -			raw_spin_unlock_irqrestore(&irq_big_lock, flags);
> -			/* If we are early boot, we can't free the
structure,
> -			 * too bad...
> -			 * this will be fixed once slab is made available
> early
> -			 * instead of the current cruft
> -			 */
> -			if (mem_init_done)
> -				kfree(host);
> -			return NULL;
> -		}
> -		irq_map[0].host = host;
> -	}
> -
> -	list_add(&host->link, &irq_hosts);
> -	raw_spin_unlock_irqrestore(&irq_big_lock, flags);
> -
> -	/* Additional setups per revmap type */
> -	switch(revmap_type) {
> -	case IRQ_HOST_MAP_LEGACY:
> -		/* 0 is always the invalid number for legacy */
> -		host->inval_irq = 0;
> -		/* setup us as the host for all legacy interrupts */
> -		for (i = 1; i < NUM_ISA_INTERRUPTS; i++) {
> -			irq_map[i].hwirq = i;
> -			smp_wmb();
> -			irq_map[i].host = host;
> -			smp_wmb();
> -
> -			/* Clear norequest flags */
> -			irq_to_desc(i)->status &= ~IRQ_NOREQUEST;
> -
> -			/* Legacy flags are left to default at this point,
> -			 * one can then use irq_create_mapping() to
> -			 * explicitly change them
> -			 */
> -			ops->map(host, i, i);
> -		}
> -		break;
> -	case IRQ_HOST_MAP_LINEAR:
> -		rmap = (unsigned int *)(host + 1);
> -		for (i = 0; i < revmap_arg; i++)
> -			rmap[i] = NO_IRQ;
> -		host->revmap_data.linear.size = revmap_arg;
> -		smp_wmb();
> -		host->revmap_data.linear.revmap = rmap;
> -		break;
> -	default:
> -		break;
> -	}
> -
> -	pr_debug("irq: Allocated host of type %d @0x%p\n", revmap_type,
> host);
> -
> -	return host;
> -}
> -
> -struct irq_host *irq_find_host(struct device_node *node)
> -{
> -	struct irq_host *h, *found = NULL;
> -	unsigned long flags;
> -
> -	/* We might want to match the legacy controller last since
> -	 * it might potentially be set to match all interrupts in
> -	 * the absence of a device node. This isn't a problem so far
> -	 * yet though...
> -	 */
> -	raw_spin_lock_irqsave(&irq_big_lock, flags);
> -	list_for_each_entry(h, &irq_hosts, link)
> -		if (h->ops->match(h, node)) {
> -			found = h;
> -			break;
> -		}
> -	raw_spin_unlock_irqrestore(&irq_big_lock, flags);
> -	return found;
> -}
> -EXPORT_SYMBOL_GPL(irq_find_host);
> -
> -void irq_set_default_host(struct irq_host *host)
> -{
> -	pr_debug("irq: Default host set to @0x%p\n", host);
> -
> -	irq_default_host = host;
> -}
> -
> -void irq_set_virq_count(unsigned int count)
> -{
> -	pr_debug("irq: Trying to set virq count to %d\n", count);
> -
> -	BUG_ON(count < NUM_ISA_INTERRUPTS);
> -	if (count < NR_IRQS)
> -		irq_virq_count = count;
> -}
> -
> -static int irq_setup_virq(struct irq_host *host, unsigned int virq,
> -			    irq_hw_number_t hwirq)
> -{
> -	struct irq_desc *desc;
> -
> -	desc = irq_to_desc_alloc_node(virq, 0);
> -	if (!desc) {
> -		pr_debug("irq: -> allocating desc failed\n");
> -		goto error;
> -	}
> -
> -	/* Clear IRQ_NOREQUEST flag */
> -	desc->status &= ~IRQ_NOREQUEST;
> -
> -	/* map it */
> -	smp_wmb();
> -	irq_map[virq].hwirq = hwirq;
> -	smp_mb();
> -
> -	if (host->ops->map(host, virq, hwirq)) {
> -		pr_debug("irq: -> mapping failed, freeing\n");
> -		goto error;
> -	}
> -
> -	return 0;
> -
> -error:
> -	irq_free_virt(virq, 1);
> -	return -1;
> -}
> -
> -unsigned int irq_create_direct_mapping(struct irq_host *host)
> -{
> -	unsigned int virq;
> -
> -	if (host == NULL)
> -		host = irq_default_host;
> -
> -	BUG_ON(host == NULL);
> -	WARN_ON(host->revmap_type != IRQ_HOST_MAP_NOMAP);
> -
> -	virq = irq_alloc_virt(host, 1, 0);
> -	if (virq == NO_IRQ) {
> -		pr_debug("irq: create_direct virq allocation failed\n");
> -		return NO_IRQ;
> -	}
> -
> -	pr_debug("irq: create_direct obtained virq %d\n", virq);
> -
> -	if (irq_setup_virq(host, virq, virq))
> -		return NO_IRQ;
> -
> -	return virq;
> -}
> -
> -unsigned int irq_create_mapping(struct irq_host *host,
> -				irq_hw_number_t hwirq)
> -{
> -	unsigned int virq, hint;
> -
> -	pr_debug("irq: irq_create_mapping(0x%p, 0x%lx)\n", host, hwirq);
> -
> -	/* Look for default host if nececssary */
> -	if (host == NULL)
> -		host = irq_default_host;
> -	if (host == NULL) {
> -		printk(KERN_WARNING "irq_create_mapping called for"
> -		       " NULL host, hwirq=%lx\n", hwirq);
> -		WARN_ON(1);
> -		return NO_IRQ;
> -	}
> -	pr_debug("irq: -> using host @%p\n", host);
> -
> -	/* Check if mapping already exist, if it does, call
> -	 * host->ops->map() to update the flags
> -	 */
> -	virq = irq_find_mapping(host, hwirq);
> -	if (virq != NO_IRQ) {
> -		if (host->ops->remap)
> -			host->ops->remap(host, virq, hwirq);
> -		pr_debug("irq: -> existing mapping on virq %d\n", virq);
> -		return virq;
> -	}
> -
> -	/* Get a virtual interrupt number */
> -	if (host->revmap_type == IRQ_HOST_MAP_LEGACY) {
> -		/* Handle legacy */
> -		virq = (unsigned int)hwirq;
> -		if (virq == 0 || virq >= NUM_ISA_INTERRUPTS)
> -			return NO_IRQ;
> -		return virq;
> -	} else {
> -		/* Allocate a virtual interrupt number */
> -		hint = hwirq % irq_virq_count;
> -		virq = irq_alloc_virt(host, 1, hint);
> -		if (virq == NO_IRQ) {
> -			pr_debug("irq: -> virq allocation failed\n");
> -			return NO_IRQ;
> -		}
> -	}
> -
> -	if (irq_setup_virq(host, virq, hwirq))
> -		return NO_IRQ;
> -
> -	printk(KERN_DEBUG "irq: irq %lu on host %s mapped to virtual irq
> %u\n",
> -		hwirq, host->of_node ? host->of_node->full_name : "null",
> virq);
> -
> -	return virq;
> -}
> -EXPORT_SYMBOL_GPL(irq_create_mapping);
> -
> -unsigned int irq_create_of_mapping(struct device_node *controller,
> -				   const u32 *intspec, unsigned int intsize)
> -{
> -	struct irq_host *host;
> -	irq_hw_number_t hwirq;
> -	unsigned int type = IRQ_TYPE_NONE;
> -	unsigned int virq;
> -
> -	if (controller == NULL)
> -		host = irq_default_host;
> -	else
> -		host = irq_find_host(controller);
> -	if (host == NULL) {
> -		printk(KERN_WARNING "irq: no irq host found for %s !\n",
> -		       controller->full_name);
> -		return NO_IRQ;
> -	}
> -
> -	/* If host has no translation, then we assume interrupt line */
> -	if (host->ops->xlate == NULL)
> -		hwirq = intspec[0];
> -	else {
> -		if (host->ops->xlate(host, controller, intspec, intsize,
> -				     &hwirq, &type))
> -			return NO_IRQ;
> -	}
> -
> -	/* Create mapping */
> -	virq = irq_create_mapping(host, hwirq);
> -	if (virq == NO_IRQ)
> -		return virq;
> -
> -	/* Set type if specified and different than the current one */
> -	if (type != IRQ_TYPE_NONE &&
> -	    type != (irq_to_desc(virq)->status & IRQF_TRIGGER_MASK))
> -		set_irq_type(virq, type);
> -	return virq;
> -}
> -EXPORT_SYMBOL_GPL(irq_create_of_mapping);
> -
> -void irq_dispose_mapping(unsigned int virq)
> -{
> -	struct irq_host *host;
> -	irq_hw_number_t hwirq;
> -
> -	if (virq == NO_IRQ)
> -		return;
> -
> -	host = irq_map[virq].host;
> -	WARN_ON (host == NULL);
> -	if (host == NULL)
> -		return;
> -
> -	/* Never unmap legacy interrupts */
> -	if (host->revmap_type == IRQ_HOST_MAP_LEGACY)
> -		return;
> -
> -	/* remove chip and handler */
> -	set_irq_chip_and_handler(virq, NULL, NULL);
> -
> -	/* Make sure it's completed */
> -	synchronize_irq(virq);
> -
> -	/* Tell the PIC about it */
> -	if (host->ops->unmap)
> -		host->ops->unmap(host, virq);
> -	smp_mb();
> -
> -	/* Clear reverse map */
> -	hwirq = irq_map[virq].hwirq;
> -	switch(host->revmap_type) {
> -	case IRQ_HOST_MAP_LINEAR:
> -		if (hwirq < host->revmap_data.linear.size)
> -			host->revmap_data.linear.revmap[hwirq] = NO_IRQ;
> -		break;
> -	case IRQ_HOST_MAP_TREE:
> -		/*
> -		 * Check if radix tree allocated yet, if not then nothing
> to
> -		 * remove.
> -		 */
> -		smp_rmb();
> -		if (revmap_trees_allocated < 1)
> -			break;
> -		mutex_lock(&revmap_trees_mutex);
> -		radix_tree_delete(&host->revmap_data.tree, hwirq);
> -		mutex_unlock(&revmap_trees_mutex);
> -		break;
> -	}
> -
> -	/* Destroy map */
> -	smp_mb();
> -	irq_map[virq].hwirq = host->inval_irq;
> -
> -	/* Set some flags */
> -	irq_to_desc(virq)->status |= IRQ_NOREQUEST;
> -
> -	/* Free it */
> -	irq_free_virt(virq, 1);
> -}
> -EXPORT_SYMBOL_GPL(irq_dispose_mapping);
> -
> -unsigned int irq_find_mapping(struct irq_host *host,
> -			      irq_hw_number_t hwirq)
> -{
> -	unsigned int i;
> -	unsigned int hint = hwirq % irq_virq_count;
> -
> -	/* Look for default host if nececssary */
> -	if (host == NULL)
> -		host = irq_default_host;
> -	if (host == NULL)
> -		return NO_IRQ;
> -
> -	/* legacy -> bail early */
> -	if (host->revmap_type == IRQ_HOST_MAP_LEGACY)
> -		return hwirq;
> -
> -	/* Slow path does a linear search of the map */
> -	if (hint < NUM_ISA_INTERRUPTS)
> -		hint = NUM_ISA_INTERRUPTS;
> -	i = hint;
> -	do  {
> -		if (irq_map[i].host == host &&
> -		    irq_map[i].hwirq == hwirq)
> -			return i;
> -		i++;
> -		if (i >= irq_virq_count)
> -			i = NUM_ISA_INTERRUPTS;
> -	} while(i != hint);
> -	return NO_IRQ;
> -}
> -EXPORT_SYMBOL_GPL(irq_find_mapping);
> -
> -
> -unsigned int irq_radix_revmap_lookup(struct irq_host *host,
> -				     irq_hw_number_t hwirq)
> -{
> -	struct irq_map_entry *ptr;
> -	unsigned int virq;
> -
> -	WARN_ON(host->revmap_type != IRQ_HOST_MAP_TREE);
> -
> -	/*
> -	 * Check if the radix tree exists and has bee initialized.
> -	 * If not, we fallback to slow mode
> -	 */
> -	if (revmap_trees_allocated < 2)
> -		return irq_find_mapping(host, hwirq);
> -
> -	/* Now try to resolve */
> -	/*
> -	 * No rcu_read_lock(ing) needed, the ptr returned can't go under
> us
> -	 * as it's referencing an entry in the static irq_map table.
> -	 */
> -	ptr = radix_tree_lookup(&host->revmap_data.tree, hwirq);
> -
> -	/*
> -	 * If found in radix tree, then fine.
> -	 * Else fallback to linear lookup - this should not happen in
> practice
> -	 * as it means that we failed to insert the node in the radix
> tree.
> -	 */
> -	if (ptr)
> -		virq = ptr - irq_map;
> -	else
> -		virq = irq_find_mapping(host, hwirq);
> -
> -	return virq;
> -}
> -
> -void irq_radix_revmap_insert(struct irq_host *host, unsigned int virq,
> -			     irq_hw_number_t hwirq)
> -{
> -
> -	WARN_ON(host->revmap_type != IRQ_HOST_MAP_TREE);
> -
> -	/*
> -	 * Check if the radix tree exists yet.
> -	 * If not, then the irq will be inserted into the tree when it
> gets
> -	 * initialized.
> -	 */
> -	smp_rmb();
> -	if (revmap_trees_allocated < 1)
> -		return;
> -
> -	if (virq != NO_IRQ) {
> -		mutex_lock(&revmap_trees_mutex);
> -		radix_tree_insert(&host->revmap_data.tree, hwirq,
> -				  &irq_map[virq]);
> -		mutex_unlock(&revmap_trees_mutex);
> -	}
> -}
> -
> -unsigned int irq_linear_revmap(struct irq_host *host,
> -			       irq_hw_number_t hwirq)
> -{
> -	unsigned int *revmap;
> -
> -	WARN_ON(host->revmap_type != IRQ_HOST_MAP_LINEAR);
> -
> -	/* Check revmap bounds */
> -	if (unlikely(hwirq >= host->revmap_data.linear.size))
> -		return irq_find_mapping(host, hwirq);
> -
> -	/* Check if revmap was allocated */
> -	revmap = host->revmap_data.linear.revmap;
> -	if (unlikely(revmap == NULL))
> -		return irq_find_mapping(host, hwirq);
> -
> -	/* Fill up revmap with slow path if no mapping found */
> -	if (unlikely(revmap[hwirq] == NO_IRQ))
> -		revmap[hwirq] = irq_find_mapping(host, hwirq);
> -
> -	return revmap[hwirq];
> -}
> -
> -unsigned int irq_alloc_virt(struct irq_host *host,
> -			    unsigned int count,
> -			    unsigned int hint)
> -{
> -	unsigned long flags;
> -	unsigned int i, j, found = NO_IRQ;
> -
> -	if (count == 0 || count > (irq_virq_count - NUM_ISA_INTERRUPTS))
> -		return NO_IRQ;
> -
> -	raw_spin_lock_irqsave(&irq_big_lock, flags);
> -
> -	/* Use hint for 1 interrupt if any */
> -	if (count == 1 && hint >= NUM_ISA_INTERRUPTS &&
> -	    hint < irq_virq_count && irq_map[hint].host == NULL) {
> -		found = hint;
> -		goto hint_found;
> -	}
> -
> -	/* Look for count consecutive numbers in the allocatable
> -	 * (non-legacy) space
> -	 */
> -	for (i = NUM_ISA_INTERRUPTS, j = 0; i < irq_virq_count; i++) {
> -		if (irq_map[i].host != NULL)
> -			j = 0;
> -		else
> -			j++;
> -
> -		if (j == count) {
> -			found = i - count + 1;
> -			break;
> -		}
> -	}
> -	if (found == NO_IRQ) {
> -		raw_spin_unlock_irqrestore(&irq_big_lock, flags);
> -		return NO_IRQ;
> -	}
> - hint_found:
> -	for (i = found; i < (found + count); i++) {
> -		irq_map[i].hwirq = host->inval_irq;
> -		smp_wmb();
> -		irq_map[i].host = host;
> -	}
> -	raw_spin_unlock_irqrestore(&irq_big_lock, flags);
> -	return found;
> -}
> -
> -void irq_free_virt(unsigned int virq, unsigned int count)
> -{
> -	unsigned long flags;
> -	unsigned int i;
> -
> -	WARN_ON (virq < NUM_ISA_INTERRUPTS);
> -	WARN_ON (count == 0 || (virq + count) > irq_virq_count);
> -
> -	raw_spin_lock_irqsave(&irq_big_lock, flags);
> -	for (i = virq; i < (virq + count); i++) {
> -		struct irq_host *host;
> -
> -		if (i < NUM_ISA_INTERRUPTS ||
> -		    (virq + count) > irq_virq_count)
> -			continue;
> -
> -		host = irq_map[i].host;
> -		irq_map[i].hwirq = host->inval_irq;
> -		smp_wmb();
> -		irq_map[i].host = NULL;
> -	}
> -	raw_spin_unlock_irqrestore(&irq_big_lock, flags);
> -}
> -
>  int arch_early_irq_init(void)
>  {
>  	struct irq_desc *desc;
> @@ -1090,118 +543,6 @@ int arch_init_chip_data(struct irq_desc *desc,
> int node)
>  	return 0;
>  }
> 
> -/* We need to create the radix trees late */
> -static int irq_late_init(void)
> -{
> -	struct irq_host *h;
> -	unsigned int i;
> -
> -	/*
> -	 * No mutual exclusion with respect to accessors of the tree is
> needed
> -	 * here as the synchronization is done via the state variable
> -	 * revmap_trees_allocated.
> -	 */
> -	list_for_each_entry(h, &irq_hosts, link) {
> -		if (h->revmap_type == IRQ_HOST_MAP_TREE)
> -			INIT_RADIX_TREE(&h->revmap_data.tree, GFP_KERNEL);
> -	}
> -
> -	/*
> -	 * Make sure the radix trees inits are visible before setting
> -	 * the flag
> -	 */
> -	smp_wmb();
> -	revmap_trees_allocated = 1;
> -
> -	/*
> -	 * Insert the reverse mapping for those interrupts already
> present
> -	 * in irq_map[].
> -	 */
> -	mutex_lock(&revmap_trees_mutex);
> -	for (i = 0; i < irq_virq_count; i++) {
> -		if (irq_map[i].host &&
> -		    (irq_map[i].host->revmap_type == IRQ_HOST_MAP_TREE))
> -
radix_tree_insert(&irq_map[i].host->revmap_data.tree,
> -					  irq_map[i].hwirq, &irq_map[i]);
> -	}
> -	mutex_unlock(&revmap_trees_mutex);
> -
> -	/*
> -	 * Make sure the radix trees insertions are visible before
> setting
> -	 * the flag
> -	 */
> -	smp_wmb();
> -	revmap_trees_allocated = 2;
> -
> -	return 0;
> -}
> -arch_initcall(irq_late_init);
> -
> -#ifdef CONFIG_VIRQ_DEBUG
> -static int virq_debug_show(struct seq_file *m, void *private)
> -{
> -	unsigned long flags;
> -	struct irq_desc *desc;
> -	const char *p;
> -	char none[] = "none";
> -	int i;
> -
> -	seq_printf(m, "%-5s  %-7s  %-15s  %s\n", "virq", "hwirq",
> -		      "chip name", "host name");
> -
> -	for (i = 1; i < nr_irqs; i++) {
> -		desc = irq_to_desc(i);
> -		if (!desc)
> -			continue;
> -
> -		raw_spin_lock_irqsave(&desc->lock, flags);
> -
> -		if (desc->action && desc->action->handler) {
> -			seq_printf(m, "%5d  ", i);
> -			seq_printf(m, "0x%05lx  ", virq_to_hw(i));
> -
> -			if (desc->chip && desc->chip->name)
> -				p = desc->chip->name;
> -			else
> -				p = none;
> -			seq_printf(m, "%-15s  ", p);
> -
> -			if (irq_map[i].host && irq_map[i].host->of_node)
> -				p = irq_map[i].host->of_node->full_name;
> -			else
> -				p = none;
> -			seq_printf(m, "%s\n", p);
> -		}
> -
> -		raw_spin_unlock_irqrestore(&desc->lock, flags);
> -	}
> -
> -	return 0;
> -}
> -
> -static int virq_debug_open(struct inode *inode, struct file *file)
> -{
> -	return single_open(file, virq_debug_show, inode->i_private);
> -}
> -
> -static const struct file_operations virq_debug_fops = {
> -	.open = virq_debug_open,
> -	.read = seq_read,
> -	.llseek = seq_lseek,
> -	.release = single_release,
> -};
> -
> -static int __init irq_debugfs_init(void)
> -{
> -	if (debugfs_create_file("virq_mapping", S_IRUGO,
> powerpc_debugfs_root,
> -				 NULL, &virq_debug_fops) == NULL)
> -		return -ENOMEM;
> -
> -	return 0;
> -}
> -__initcall(irq_debugfs_init);
> -#endif /* CONFIG_VIRQ_DEBUG */
> -
>  #ifdef CONFIG_PPC64
>  static int __init setup_noirqdistrib(char *str)
>  {
> diff --git a/include/linux/virq.h b/include/linux/virq.h
> new file mode 100644
> index 0000000..06035ef
> --- /dev/null
> +++ b/include/linux/virq.h
> @@ -0,0 +1,302 @@
> +/*
> + * Virtual IRQ infrastructure
> + *
> + * Virtual IRQs provides support for dynamically allocating ranges of
> IRQ
> + * numbers for use by interrupt controllers.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of version 2 of the GNU General Public
> + * License as published by the Free Software Foundation.
> + */
> +
> +
> +#ifdef __KERNEL__
> +#ifndef _LINUX_VIRQ_H
> +#define _LINUX_VIRQ_H
> +
> +#include <asm/irq.h>
> +
> +#ifdef CONFIG_VIRQ
> +
> +/* Define a way to iterate across irqs. */
> +#define for_each_irq(i) \
> +	for ((i) = 0; (i) < NR_IRQS; ++(i))
> +
> +/* This type is the placeholder for a hardware interrupt number. It
> has to
> + * be big enough to enclose whatever representation is used by a given
> + * platform.
> + */
> +typedef unsigned long irq_hw_number_t;
> +
> +/* Interrupt controller "host" data structure. This could be defined
> as a
> + * irq domain controller. That is, it handles the mapping between
> hardware
> + * and virtual interrupt numbers for a given interrupt domain. The
> host
> + * structure is generally created by the PIC code for a given PIC
> instance
> + * (though a host can cover more than one PIC if they have a flat
> number
> + * model). It's the host callbacks that are responsible for setting
> the
> + * irq_chip on a given irq_desc after it's been mapped.
> + *
> + * The host code and data structures are fairly agnostic to the fact
> that
> + * we use an open firmware device-tree. We do have references to
> struct
> + * device_node in two places: in irq_find_host() to find the host
> matching
> + * a given interrupt controller node, and of course as an argument to
> its
> + * counterpart host->ops->match() callback. However, those are treated
> as
> + * generic pointers by the core and the fact that it's actually a
> device-node
> + * pointer is purely a convention between callers and implementation.
> This
> + * code could thus be used on other architectures by replacing those
> two
> + * by some sort of arch-specific void * "token" used to identify
> interrupt
> + * controllers.
> + */
> +struct irq_host;
> +struct radix_tree_root;
> +struct device_node;
> +
> +/**
> + * struct irq_host_ops - operations for managing per-domain hw irq
> numbers
> + *
> + * Functions below are provided by the host and called whenever a new
> mapping
> + * is created or an old mapping is disposed. The host can then proceed
> to
> + * whatever internal data structures management is required. It also
> needs
> + * to setup the irq_desc when returning from map().
> + */
> +struct irq_host_ops {
> +	/* Match an interrupt controller device node to a host, returns
> +	 * 1 on a match
> +	 */
> +	int (*match)(struct irq_host *h, struct device_node *node);
> +
> +	/* Create or update a mapping between a virtual irq number and a
> hw
> +	 * irq number. This is called only once for a given mapping.
> +	 */
> +	int (*map)(struct irq_host *h, unsigned int virq, irq_hw_number_t
> hw);
> +
> +	/* Dispose of such a mapping */
> +	void (*unmap)(struct irq_host *h, unsigned int virq);
> +
> +	/* Update of such a mapping  */
> +	void (*remap)(struct irq_host *h, unsigned int virq,
> irq_hw_number_t hw);
> +
> +	/* Translate device-tree interrupt specifier from raw format
> coming
> +	 * from the firmware to a irq_hw_number_t (interrupt line number)
> and
> +	 * type (sense) that can be passed to set_irq_type(). In the
> absence
> +	 * of this callback, irq_create_of_mapping() and
> irq_of_parse_and_map()
> +	 * will return the hw number in the first cell and IRQ_TYPE_NONE
> for
> +	 * the type (which amount to keeping whatever default value the
> +	 * interrupt controller has for that line)
> +	 */
> +	int (*xlate)(struct irq_host *h, struct device_node *ctrler,
> +		     const u32 *intspec, unsigned int intsize,
> +		     irq_hw_number_t *out_hwirq, unsigned int *out_type);
> +};
> +
> +/**
> + * struct irq_host - a single irq domain. maps hw irq numbers to Linux
> irq.
> + * @link: entry in global irq_host list
> + * @revmap_type: Method of reverse mapping hwirq to Linux irq number
> + * @revmap_data: reverse map data
> + * @ops: irq domain operations (documented above)
> + * @host_data: irq controller driver data; core does not touch this
> pointer
> + * @inval_irq: hw irq number used for unassigned virqs
> + * @of_node: Optional pointer to the irq controllers device tree node.
> + *
> + * One irq_host is allocated for each range (domain) of Linux irq
> numbers
> + * allocated.  Typically, one irq_host is allocated per controller,
> but it
> + * is perfectly valid to manage multiple controllers with a single
> irq_host
> + * instance if need be.
> + */
> +struct irq_host {
> +	struct list_head	link;
> +
> +	/* type of reverse mapping technique */
> +	unsigned int		revmap_type;
> +#define IRQ_HOST_MAP_LEGACY     0 /* legacy 8259, gets irqs 1..15 */
> +#define IRQ_HOST_MAP_NOMAP	1 /* no fast reverse mapping */
> +#define IRQ_HOST_MAP_LINEAR	2 /* linear map of interrupts */
> +#define IRQ_HOST_MAP_TREE	3 /* radix tree */
> +	union {
> +		struct {
> +			unsigned int size;
> +			unsigned int *revmap;
> +		} linear;
> +		struct radix_tree_root tree;
> +	} revmap_data;
> +	struct irq_host_ops	*ops;
> +	void			*host_data;
> +	irq_hw_number_t		inval_irq;
> +
> +	/* Optional device node pointer */
> +	struct device_node	*of_node;
> +};
> +
> +/**
> + * irq_alloc_host() - Allocate a new irq_host data structure
> + * @of_node: optional device-tree node of the interrupt controller
> + * @revmap_type: type of reverse mapping to use
> + * @revmap_arg: for IRQ_HOST_MAP_LINEAR linear only: size of the map
> + * @ops: map/unmap host callbacks
> + * @inval_irq: provide a hw number in that host space that is always
> invalid
> + *
> + * Allocates and initialize and irq_host structure. Note that in the
> case of
> + * IRQ_HOST_MAP_LEGACY, the map() callback will be called before this
> returns
> + * for all legacy interrupts except 0 (which is always the invalid irq
> for
> + * a legacy controller). For a IRQ_HOST_MAP_LINEAR, the map is
> allocated by
> + * this call as well. For a IRQ_HOST_MAP_TREE, the radix tree will be
> allocated
> + * later during boot automatically (the reverse mapping will use the
> slow path
> + * until that happens).
> + */
> +extern struct irq_host *irq_alloc_host(struct device_node *of_node,
> +				       unsigned int revmap_type,
> +				       unsigned int revmap_arg,
> +				       struct irq_host_ops *ops,
> +				       irq_hw_number_t inval_irq);
> +
> +/* The main irq map itself is an array of NR_IRQ entries containing
> the
> + * associate host and irq number. An entry with a host of NULL is
> free.
> + * An entry can be allocated if it's free, the allocator always then
> sets
> + * hwirq first to the host's invalid irq number and then fills ops.
> + */
> +struct irq_map_entry {
> +	irq_hw_number_t	hwirq;
> +	struct irq_host	*host;
> +};
> +extern struct irq_map_entry irq_map[NR_IRQS];
> +
> +extern irq_hw_number_t virq_to_hw(unsigned int virq);
> +
> +/**
> + * irq_find_host - Locates a host for a given device node
> + * @node: device-tree node of the interrupt controller
> + */
> +extern struct irq_host *irq_find_host(struct device_node *node);
> +
> +/**
> + * irq_set_default_host - Set a "default" host
> + * @host: default host pointer
> + *
> + * For convenience, it's possible to set a "default" host that will be
> used
> + * whenever NULL is passed to irq_create_mapping(). It makes life
> easier for
> + * platforms that want to manipulate a few hard coded interrupt
> numbers that
> + * aren't properly represented in the device-tree.
> + */
> +extern void irq_set_default_host(struct irq_host *host);
> +
> +/**
> + * irq_set_virq_count - Set the maximum number of virt irqs
> + * @count: number of linux virtual irqs, capped with NR_IRQS
> + *
> + * This is mainly for use by platforms like iSeries who want to
> program
> + * the virtual irq number in the controller to avoid the reverse
> mapping
> + */
> +extern void irq_set_virq_count(unsigned int count);
> +
> +/**
> + * irq_create_mapping - Map a hardware interrupt into linux virq space
> + * @host: host owning this hardware interrupt or NULL for default host
> + * @hwirq: hardware irq number in that host space
> + *
> + * Only one mapping per hardware interrupt is permitted. Returns a
> linux
> + * virq number.
> + * If the sense/trigger is to be specified, set_irq_type() should be
> called
> + * on the number returned from that call.
> + */
> +extern unsigned int irq_create_mapping(struct irq_host *host,
> +				       irq_hw_number_t hwirq);
> +
> +/**
> + * irq_dispose_mapping - Unmap an interrupt
> + * @virq: linux virq number of the interrupt to unmap
> + */
> +extern void irq_dispose_mapping(unsigned int virq);
> +
> +/**
> + * irq_find_mapping - Find a linux virq from an hw irq number.
> + * @host: host owning this hardware interrupt
> + * @hwirq: hardware irq number in that host space
> + *
> + * This is a slow path, for use by generic code. It's expected that an
> + * irq controller implementation directly calls the appropriate low
> level
> + * mapping function.
> + */
> +extern unsigned int irq_find_mapping(struct irq_host *host,
> +				     irq_hw_number_t hwirq);
> +
> +/**
> + * irq_create_direct_mapping - Allocate a virq for direct mapping
> + * @host: host to allocate the virq for or NULL for default host
> + *
> + * This routine is used for irq controllers which can choose the
> hardware
> + * interrupt numbers they generate. In such a case it's simplest to
> use
> + * the linux virq as the hardware interrupt number.
> + */
> +extern unsigned int irq_create_direct_mapping(struct irq_host *host);
> +
> +/**
> + * irq_radix_revmap_insert - Insert a hw irq to linux virq number
> mapping.
> + * @host: host owning this hardware interrupt
> + * @virq: linux irq number
> + * @hwirq: hardware irq number in that host space
> + *
> + * This is for use by irq controllers that use a radix tree reverse
> + * mapping for fast lookup.
> + */
> +extern void irq_radix_revmap_insert(struct irq_host *host, unsigned
> int virq,
> +				    irq_hw_number_t hwirq);
> +
> +/**
> + * irq_radix_revmap_lookup - Find a linux virq from a hw irq number.
> + * @host: host owning this hardware interrupt
> + * @hwirq: hardware irq number in that host space
> + *
> + * This is a fast path, for use by irq controller code that uses radix
> tree
> + * revmaps
> + */
> +extern unsigned int irq_radix_revmap_lookup(struct irq_host *host,
> +					    irq_hw_number_t hwirq);
> +
> +/**
> + * irq_linear_revmap - Find a linux virq from a hw irq number.
> + * @host: host owning this hardware interrupt
> + * @hwirq: hardware irq number in that host space
> + *
> + * This is a fast path, for use by irq controller code that uses
> linear
> + * revmaps. It does fallback to the slow path if the revmap doesn't
> exist
> + * yet and will create the revmap entry with appropriate locking
> + */
> +
> +extern unsigned int irq_linear_revmap(struct irq_host *host,
> +				      irq_hw_number_t hwirq);
> +
> +
> +
> +/**
> + * irq_alloc_virt - Allocate virtual irq numbers
> + * @host: host owning these new virtual irqs
> + * @count: number of consecutive numbers to allocate
> + * @hint: pass a hint number, the allocator will try to use a 1:1
> mapping
> + *
> + * This is a low level function that is used internally by
> irq_create_mapping()
> + * and that can be used by some irq controllers implementations for
> things
> + * like allocating ranges of numbers for MSIs. The revmaps are left
> untouched.
> + */
> +extern unsigned int irq_alloc_virt(struct irq_host *host,
> +				   unsigned int count,
> +				   unsigned int hint);
> +
> +/**
> + * irq_free_virt - Free virtual irq numbers
> + * @virq: virtual irq number of the first interrupt to free
> + * @count: number of interrupts to free
> + *
> + * This function is the opposite of irq_alloc_virt. It will not clear
> reverse
> + * maps, this should be done previously by unmap'ing the interrupt. In
> fact,
> + * all interrupts covered by the range being freed should have been
> unmapped
> + * prior to calling this.
> + */
> +extern void irq_free_virt(unsigned int virq, unsigned int count);
> +
> +
> +#endif /* CONFIG_VIRQ */
> +
> +#endif /* _LINUX_VIRQ_H */
> +#endif /* __KERNEL__ */
> +
> diff --git a/kernel/irq/Makefile b/kernel/irq/Makefile
> index 7d04780..f5207dc 100644
> --- a/kernel/irq/Makefile
> +++ b/kernel/irq/Makefile
> @@ -1,5 +1,6 @@
> 
>  obj-y := handle.o manage.o spurious.o resend.o chip.o devres.o
> +obj-$(CONFIG_VIRQ) += virq.o
>  obj-$(CONFIG_GENERIC_IRQ_PROBE) += autoprobe.o
>  obj-$(CONFIG_PROC_FS) += proc.o
>  obj-$(CONFIG_GENERIC_PENDING_IRQ) += migration.o
> diff --git a/kernel/irq/virq.c b/kernel/irq/virq.c
> new file mode 100644
> index 0000000..b3c0db3
> --- /dev/null
> +++ b/kernel/irq/virq.c
> @@ -0,0 +1,687 @@
> +/*
> + * Mapping support from per-controller hw irq numbers to linux irqs
> + *
> + *  Derived from arch/i386/kernel/irq.c
> + *    Copyright (C) 1992 Linus Torvalds
> + *  Adapted from arch/i386 by Gary Thomas
> + *    Copyright (C) 1995-1996 Gary Thomas (gdt@linuxppc.org)
> + *  Updated and modified by Cort Dougan <cort@fsmlabs.com>
> + *    Copyright (C) 1996-2001 Cort Dougan
> + *  Adapted for Power Macintosh by Paul Mackerras
> + *    Copyright (C) 1996 Paul Mackerras (paulus@cs.anu.edu.au)
> + *  Generalized for virtual irq mapping on all platformes by Grant
> Likely
> + *    Copyright (C) 2010 Secret Lab Technologies Ltd.
> + *
> + * This program is free software; you can redistribute it and/or
> + * modify it under the terms of the GNU General Public License
> + * as published by the Free Software Foundation; either version
> + * 2 of the License, or (at your option) any later version.
> + */
> +
> +#include <linux/interrupt.h>
> +#include <linux/irq.h>
> +#include <linux/module.h>
> +#include <linux/mutex.h>
> +#include <linux/slab.h>
> +#include <linux/radix-tree.h>
> +#include <linux/virq.h>
> +#include <linux/of_irq.h>
> +
> +/*
> + * IRQ controller and virtual interrupts
> + */
> +static LIST_HEAD(irq_hosts);
> +static DEFINE_RAW_SPINLOCK(irq_big_lock);
> +static unsigned int revmap_trees_allocated;
> +static DEFINE_MUTEX(revmap_trees_mutex);
> +struct irq_map_entry irq_map[NR_IRQS];
> +static unsigned int irq_virq_count = NR_IRQS;
> +static struct irq_host *irq_default_host;
> +
> +irq_hw_number_t virq_to_hw(unsigned int virq)
> +{
> +	return irq_map[virq].hwirq;
> +}
> +EXPORT_SYMBOL_GPL(virq_to_hw);
> +
> +static int default_irq_host_match(struct irq_host *h, struct
> device_node *np)
> +{
> +	return h->of_node != NULL && h->of_node == np;
> +}
> +
> +struct irq_host *irq_alloc_host(struct device_node *of_node,
> +				unsigned int revmap_type,
> +				unsigned int revmap_arg,
> +				struct irq_host_ops *ops,
> +				irq_hw_number_t inval_irq)
> +{
> +	struct irq_host *host;
> +	unsigned int size = sizeof(struct irq_host);
> +	unsigned int i;
> +	unsigned int *rmap;
> +	unsigned long flags;
> +
> +	/* Allocate structure and revmap table if using linear mapping */
> +	if (revmap_type == IRQ_HOST_MAP_LINEAR)
> +		size += revmap_arg * sizeof(unsigned int);
> +	host = zalloc_maybe_bootmem(size, GFP_KERNEL);
> +	if (host == NULL)
> +		return NULL;
> +
> +	/* Fill structure */
> +	host->revmap_type = revmap_type;
> +	host->inval_irq = inval_irq;
> +	host->ops = ops;
> +	host->of_node = of_node_get(of_node);
> +
> +	if (host->ops->match == NULL)
> +		host->ops->match = default_irq_host_match;
> +
> +	raw_spin_lock_irqsave(&irq_big_lock, flags);
> +
> +	/* If it's a legacy controller, check for duplicates and
> +	 * mark it as allocated (we use irq 0 host pointer for that
> +	 */
> +	if (revmap_type == IRQ_HOST_MAP_LEGACY) {
> +		if (irq_map[0].host != NULL) {
> +			raw_spin_unlock_irqrestore(&irq_big_lock, flags);
> +			/* If we are early boot, we can't free the
structure,
> +			 * too bad...
> +			 * this will be fixed once slab is made available
> early
> +			 * instead of the current cruft
> +			 */
> +			if (mem_init_done)
> +				kfree(host);
> +			return NULL;
> +		}
> +		irq_map[0].host = host;
> +	}
> +
> +	list_add(&host->link, &irq_hosts);
> +	raw_spin_unlock_irqrestore(&irq_big_lock, flags);
> +
> +	/* Additional setups per revmap type */
> +	switch(revmap_type) {
> +	case IRQ_HOST_MAP_LEGACY:
> +		/* 0 is always the invalid number for legacy */
> +		host->inval_irq = 0;
> +		/* setup us as the host for all legacy interrupts */
> +		for (i = 1; i < NUM_ISA_INTERRUPTS; i++) {
> +			irq_map[i].hwirq = i;
> +			smp_wmb();
> +			irq_map[i].host = host;
> +			smp_wmb();
> +
> +			/* Clear norequest flags */
> +			irq_to_desc(i)->status &= ~IRQ_NOREQUEST;
> +
> +			/* Legacy flags are left to default at this point,
> +			 * one can then use irq_create_mapping() to
> +			 * explicitly change them
> +			 */
> +			ops->map(host, i, i);
> +		}
> +		break;
> +	case IRQ_HOST_MAP_LINEAR:
> +		rmap = (unsigned int *)(host + 1);
> +		for (i = 0; i < revmap_arg; i++)
> +			rmap[i] = NO_IRQ;
> +		host->revmap_data.linear.size = revmap_arg;
> +		smp_wmb();
> +		host->revmap_data.linear.revmap = rmap;
> +		break;
> +	default:
> +		break;
> +	}
> +
> +	pr_debug("irq: Allocated host of type %d @0x%p\n", revmap_type,
> host);
> +
> +	return host;
> +}
> +
> +struct irq_host *irq_find_host(struct device_node *node)
> +{
> +	struct irq_host *h, *found = NULL;
> +	unsigned long flags;
> +
> +	/* We might want to match the legacy controller last since
> +	 * it might potentially be set to match all interrupts in
> +	 * the absence of a device node. This isn't a problem so far
> +	 * yet though...
> +	 */
> +	raw_spin_lock_irqsave(&irq_big_lock, flags);
> +	list_for_each_entry(h, &irq_hosts, link)
> +		if (h->ops->match(h, node)) {
> +			found = h;
> +			break;
> +		}
> +	raw_spin_unlock_irqrestore(&irq_big_lock, flags);
> +	return found;
> +}
> +EXPORT_SYMBOL_GPL(irq_find_host);
> +
> +void irq_set_default_host(struct irq_host *host)
> +{
> +	pr_debug("irq: Default host set to @0x%p\n", host);
> +
> +	irq_default_host = host;
> +}
> +
> +void irq_set_virq_count(unsigned int count)
> +{
> +	pr_debug("irq: Trying to set virq count to %d\n", count);
> +
> +	BUG_ON(count < NUM_ISA_INTERRUPTS);
> +	if (count < NR_IRQS)
> +		irq_virq_count = count;
> +}
> +
> +static int irq_setup_virq(struct irq_host *host, unsigned int virq,
> +			    irq_hw_number_t hwirq)
> +{
> +	struct irq_desc *desc;
> +
> +	desc = irq_to_desc_alloc_node(virq, 0);
> +	if (!desc) {
> +		pr_debug("irq: -> allocating desc failed\n");
> +		goto error;
> +	}
> +
> +	/* Clear IRQ_NOREQUEST flag */
> +	desc->status &= ~IRQ_NOREQUEST;
> +
> +	/* map it */
> +	smp_wmb();
> +	irq_map[virq].hwirq = hwirq;
> +	smp_mb();
> +
> +	if (host->ops->map(host, virq, hwirq)) {
> +		pr_debug("irq: -> mapping failed, freeing\n");
> +		goto error;
> +	}
> +
> +	return 0;
> +
> +error:
> +	irq_free_virt(virq, 1);
> +	return -1;
> +}
> +
> +unsigned int irq_create_direct_mapping(struct irq_host *host)
> +{
> +	unsigned int virq;
> +
> +	if (host == NULL)
> +		host = irq_default_host;
> +
> +	BUG_ON(host == NULL);
> +	WARN_ON(host->revmap_type != IRQ_HOST_MAP_NOMAP);
> +
> +	virq = irq_alloc_virt(host, 1, 0);
> +	if (virq == NO_IRQ) {
> +		pr_debug("irq: create_direct virq allocation failed\n");
> +		return NO_IRQ;
> +	}
> +
> +	pr_debug("irq: create_direct obtained virq %d\n", virq);
> +
> +	if (irq_setup_virq(host, virq, virq))
> +		return NO_IRQ;
> +
> +	return virq;
> +}
> +
> +unsigned int irq_create_mapping(struct irq_host *host,
> +				irq_hw_number_t hwirq)
> +{
> +	unsigned int virq, hint;
> +
> +	pr_debug("irq: irq_create_mapping(0x%p, 0x%lx)\n", host, hwirq);
> +
> +	/* Look for default host if nececssary */
> +	if (host == NULL)
> +		host = irq_default_host;
> +	if (host == NULL) {
> +		printk(KERN_WARNING "irq_create_mapping called for"
> +		       " NULL host, hwirq=%lx\n", hwirq);
> +		WARN_ON(1);
> +		return NO_IRQ;
> +	}
> +	pr_debug("irq: -> using host @%p\n", host);
> +
> +	/* Check if mapping already exist, if it does, call
> +	 * host->ops->map() to update the flags
> +	 */
> +	virq = irq_find_mapping(host, hwirq);
> +	if (virq != NO_IRQ) {
> +		if (host->ops->remap)
> +			host->ops->remap(host, virq, hwirq);
> +		pr_debug("irq: -> existing mapping on virq %d\n", virq);
> +		return virq;
> +	}
> +
> +	/* Get a virtual interrupt number */
> +	if (host->revmap_type == IRQ_HOST_MAP_LEGACY) {
> +		/* Handle legacy */
> +		virq = (unsigned int)hwirq;
> +		if (virq == 0 || virq >= NUM_ISA_INTERRUPTS)
> +			return NO_IRQ;
> +		return virq;
> +	} else {
> +		/* Allocate a virtual interrupt number */
> +		hint = hwirq % irq_virq_count;
> +		virq = irq_alloc_virt(host, 1, hint);
> +		if (virq == NO_IRQ) {
> +			pr_debug("irq: -> virq allocation failed\n");
> +			return NO_IRQ;
> +		}
> +	}
> +
> +	if (irq_setup_virq(host, virq, hwirq))
> +		return NO_IRQ;
> +
> +	printk(KERN_DEBUG "irq: irq %lu on host %s mapped to virtual irq
> %u\n",
> +		hwirq, host->of_node ? host->of_node->full_name : "null",
> virq);
> +
> +	return virq;
> +}
> +EXPORT_SYMBOL_GPL(irq_create_mapping);
> +
> +unsigned int irq_create_of_mapping(struct device_node *controller,
> +				   const u32 *intspec, unsigned int intsize)
> +{
> +	struct irq_host *host;
> +	irq_hw_number_t hwirq;
> +	unsigned int type = IRQ_TYPE_NONE;
> +	unsigned int virq;
> +
> +	if (controller == NULL)
> +		host = irq_default_host;
> +	else
> +		host = irq_find_host(controller);
> +	if (host == NULL) {
> +		printk(KERN_WARNING "irq: no irq host found for %s !\n",
> +		       controller->full_name);
> +		return NO_IRQ;
> +	}
> +
> +	/* If host has no translation, then we assume interrupt line */
> +	if (host->ops->xlate == NULL)
> +		hwirq = intspec[0];
> +	else {
> +		if (host->ops->xlate(host, controller, intspec, intsize,
> +				     &hwirq, &type))
> +			return NO_IRQ;
> +	}
> +
> +	/* Create mapping */
> +	virq = irq_create_mapping(host, hwirq);
> +	if (virq == NO_IRQ)
> +		return virq;
> +
> +	/* Set type if specified and different than the current one */
> +	if (type != IRQ_TYPE_NONE &&
> +	    type != (irq_to_desc(virq)->status & IRQF_TRIGGER_MASK))
> +		set_irq_type(virq, type);
> +	return virq;
> +}
> +EXPORT_SYMBOL_GPL(irq_create_of_mapping);
> +
> +void irq_dispose_mapping(unsigned int virq)
> +{
> +	struct irq_host *host;
> +	irq_hw_number_t hwirq;
> +
> +	if (virq == NO_IRQ)
> +		return;
> +
> +	host = irq_map[virq].host;
> +	WARN_ON (host == NULL);
> +	if (host == NULL)
> +		return;
> +
> +	/* Never unmap legacy interrupts */
> +	if (host->revmap_type == IRQ_HOST_MAP_LEGACY)
> +		return;
> +
> +	/* remove chip and handler */
> +	set_irq_chip_and_handler(virq, NULL, NULL);
> +
> +	/* Make sure it's completed */
> +	synchronize_irq(virq);
> +
> +	/* Tell the PIC about it */
> +	if (host->ops->unmap)
> +		host->ops->unmap(host, virq);
> +	smp_mb();
> +
> +	/* Clear reverse map */
> +	hwirq = irq_map[virq].hwirq;
> +	switch(host->revmap_type) {
> +	case IRQ_HOST_MAP_LINEAR:
> +		if (hwirq < host->revmap_data.linear.size)
> +			host->revmap_data.linear.revmap[hwirq] = NO_IRQ;
> +		break;
> +	case IRQ_HOST_MAP_TREE:
> +		/*
> +		 * Check if radix tree allocated yet, if not then nothing
> to
> +		 * remove.
> +		 */
> +		smp_rmb();
> +		if (revmap_trees_allocated < 1)
> +			break;
> +		mutex_lock(&revmap_trees_mutex);
> +		radix_tree_delete(&host->revmap_data.tree, hwirq);
> +		mutex_unlock(&revmap_trees_mutex);
> +		break;
> +	}
> +
> +	/* Destroy map */
> +	smp_mb();
> +	irq_map[virq].hwirq = host->inval_irq;
> +
> +	/* Set some flags */
> +	irq_to_desc(virq)->status |= IRQ_NOREQUEST;
> +
> +	/* Free it */
> +	irq_free_virt(virq, 1);
> +}
> +EXPORT_SYMBOL_GPL(irq_dispose_mapping);
> +
> +unsigned int irq_find_mapping(struct irq_host *host,
> +			      irq_hw_number_t hwirq)
> +{
> +	unsigned int i;
> +	unsigned int hint = hwirq % irq_virq_count;
> +
> +	/* Look for default host if nececssary */
> +	if (host == NULL)
> +		host = irq_default_host;
> +	if (host == NULL)
> +		return NO_IRQ;
> +
> +	/* legacy -> bail early */
> +	if (host->revmap_type == IRQ_HOST_MAP_LEGACY)
> +		return hwirq;
> +
> +	/* Slow path does a linear search of the map */
> +	if (hint < NUM_ISA_INTERRUPTS)
> +		hint = NUM_ISA_INTERRUPTS;
> +	i = hint;
> +	do  {
> +		if (irq_map[i].host == host &&
> +		    irq_map[i].hwirq == hwirq)
> +			return i;
> +		i++;
> +		if (i >= irq_virq_count)
> +			i = NUM_ISA_INTERRUPTS;
> +	} while(i != hint);
> +	return NO_IRQ;
> +}
> +EXPORT_SYMBOL_GPL(irq_find_mapping);
> +
> +
> +unsigned int irq_radix_revmap_lookup(struct irq_host *host,
> +				     irq_hw_number_t hwirq)
> +{
> +	struct irq_map_entry *ptr;
> +	unsigned int virq;
> +
> +	WARN_ON(host->revmap_type != IRQ_HOST_MAP_TREE);
> +
> +	/*
> +	 * Check if the radix tree exists and has bee initialized.
> +	 * If not, we fallback to slow mode
> +	 */
> +	if (revmap_trees_allocated < 2)
> +		return irq_find_mapping(host, hwirq);
> +
> +	/* Now try to resolve */
> +	/*
> +	 * No rcu_read_lock(ing) needed, the ptr returned can't go under
> us
> +	 * as it's referencing an entry in the static irq_map table.
> +	 */
> +	ptr = radix_tree_lookup(&host->revmap_data.tree, hwirq);
> +
> +	/*
> +	 * If found in radix tree, then fine.
> +	 * Else fallback to linear lookup - this should not happen in
> practice
> +	 * as it means that we failed to insert the node in the radix
> tree.
> +	 */
> +	if (ptr)
> +		virq = ptr - irq_map;
> +	else
> +		virq = irq_find_mapping(host, hwirq);
> +
> +	return virq;
> +}
> +
> +void irq_radix_revmap_insert(struct irq_host *host, unsigned int virq,
> +			     irq_hw_number_t hwirq)
> +{
> +
> +	WARN_ON(host->revmap_type != IRQ_HOST_MAP_TREE);
> +
> +	/*
> +	 * Check if the radix tree exists yet.
> +	 * If not, then the irq will be inserted into the tree when it
> gets
> +	 * initialized.
> +	 */
> +	smp_rmb();
> +	if (revmap_trees_allocated < 1)
> +		return;
> +
> +	if (virq != NO_IRQ) {
> +		mutex_lock(&revmap_trees_mutex);
> +		radix_tree_insert(&host->revmap_data.tree, hwirq,
> +				  &irq_map[virq]);
> +		mutex_unlock(&revmap_trees_mutex);
> +	}
> +}
> +
> +unsigned int irq_linear_revmap(struct irq_host *host,
> +			       irq_hw_number_t hwirq)
> +{
> +	unsigned int *revmap;
> +
> +	WARN_ON(host->revmap_type != IRQ_HOST_MAP_LINEAR);
> +
> +	/* Check revmap bounds */
> +	if (unlikely(hwirq >= host->revmap_data.linear.size))
> +		return irq_find_mapping(host, hwirq);
> +
> +	/* Check if revmap was allocated */
> +	revmap = host->revmap_data.linear.revmap;
> +	if (unlikely(revmap == NULL))
> +		return irq_find_mapping(host, hwirq);
> +
> +	/* Fill up revmap with slow path if no mapping found */
> +	if (unlikely(revmap[hwirq] == NO_IRQ))
> +		revmap[hwirq] = irq_find_mapping(host, hwirq);
> +
> +	return revmap[hwirq];
> +}
> +
> +unsigned int irq_alloc_virt(struct irq_host *host,
> +			    unsigned int count,
> +			    unsigned int hint)
> +{
> +	unsigned long flags;
> +	unsigned int i, j, found = NO_IRQ;
> +
> +	if (count == 0 || count > (irq_virq_count - NUM_ISA_INTERRUPTS))
> +		return NO_IRQ;
> +
> +	raw_spin_lock_irqsave(&irq_big_lock, flags);
> +
> +	/* Use hint for 1 interrupt if any */
> +	if (count == 1 && hint >= NUM_ISA_INTERRUPTS &&
> +	    hint < irq_virq_count && irq_map[hint].host == NULL) {
> +		found = hint;
> +		goto hint_found;
> +	}
> +
> +	/* Look for count consecutive numbers in the allocatable
> +	 * (non-legacy) space
> +	 */
> +	for (i = NUM_ISA_INTERRUPTS, j = 0; i < irq_virq_count; i++) {
> +		if (irq_map[i].host != NULL)
> +			j = 0;
> +		else
> +			j++;
> +
> +		if (j == count) {
> +			found = i - count + 1;
> +			break;
> +		}
> +	}
> +	if (found == NO_IRQ) {
> +		raw_spin_unlock_irqrestore(&irq_big_lock, flags);
> +		return NO_IRQ;
> +	}
> + hint_found:
> +	for (i = found; i < (found + count); i++) {
> +		irq_map[i].hwirq = host->inval_irq;
> +		smp_wmb();
> +		irq_map[i].host = host;
> +	}
> +	raw_spin_unlock_irqrestore(&irq_big_lock, flags);
> +	return found;
> +}
> +
> +void irq_free_virt(unsigned int virq, unsigned int count)
> +{
> +	unsigned long flags;
> +	unsigned int i;
> +
> +	WARN_ON (virq < NUM_ISA_INTERRUPTS);
> +	WARN_ON (count == 0 || (virq + count) > irq_virq_count);
> +
> +	raw_spin_lock_irqsave(&irq_big_lock, flags);
> +	for (i = virq; i < (virq + count); i++) {
> +		struct irq_host *host;
> +
> +		if (i < NUM_ISA_INTERRUPTS ||
> +		    (virq + count) > irq_virq_count)
> +			continue;
> +
> +		host = irq_map[i].host;
> +		irq_map[i].hwirq = host->inval_irq;
> +		smp_wmb();
> +		irq_map[i].host = NULL;
> +	}
> +	raw_spin_unlock_irqrestore(&irq_big_lock, flags);
> +}
> +
> +/* We need to create the radix trees late */
> +static int irq_late_init(void)
> +{
> +	struct irq_host *h;
> +	unsigned int i;
> +
> +	/*
> +	 * No mutual exclusion with respect to accessors of the tree is
> needed
> +	 * here as the synchronization is done via the state variable
> +	 * revmap_trees_allocated.
> +	 */
> +	list_for_each_entry(h, &irq_hosts, link) {
> +		if (h->revmap_type == IRQ_HOST_MAP_TREE)
> +			INIT_RADIX_TREE(&h->revmap_data.tree, GFP_KERNEL);
> +	}
> +
> +	/*
> +	 * Make sure the radix trees inits are visible before setting
> +	 * the flag
> +	 */
> +	smp_wmb();
> +	revmap_trees_allocated = 1;
> +
> +	/*
> +	 * Insert the reverse mapping for those interrupts already
> present
> +	 * in irq_map[].
> +	 */
> +	mutex_lock(&revmap_trees_mutex);
> +	for (i = 0; i < irq_virq_count; i++) {
> +		if (irq_map[i].host &&
> +		    (irq_map[i].host->revmap_type == IRQ_HOST_MAP_TREE))
> +
radix_tree_insert(&irq_map[i].host->revmap_data.tree,
> +					  irq_map[i].hwirq, &irq_map[i]);
> +	}
> +	mutex_unlock(&revmap_trees_mutex);
> +
> +	/*
> +	 * Make sure the radix trees insertions are visible before
> setting
> +	 * the flag
> +	 */
> +	smp_wmb();
> +	revmap_trees_allocated = 2;
> +
> +	return 0;
> +}
> +arch_initcall(irq_late_init);
> +
> +#ifdef CONFIG_VIRQ_DEBUG
> +static int virq_debug_show(struct seq_file *m, void *private)
> +{
> +	unsigned long flags;
> +	struct irq_desc *desc;
> +	const char *p;
> +	char none[] = "none";
> +	int i;
> +
> +	seq_printf(m, "%-5s  %-7s  %-15s  %s\n", "virq", "hwirq",
> +		      "chip name", "host name");
> +
> +	for (i = 1; i < nr_irqs; i++) {
> +		desc = irq_to_desc(i);
> +		if (!desc)
> +			continue;
> +
> +		raw_spin_lock_irqsave(&desc->lock, flags);
> +
> +		if (desc->action && desc->action->handler) {
> +			seq_printf(m, "%5d  ", i);
> +			seq_printf(m, "0x%05lx  ", virq_to_hw(i));
> +
> +			if (desc->chip && desc->chip->name)
> +				p = desc->chip->name;
> +			else
> +				p = none;
> +			seq_printf(m, "%-15s  ", p);
> +
> +			if (irq_map[i].host && irq_map[i].host->of_node)
> +				p = irq_map[i].host->of_node->full_name;
> +			else
> +				p = none;
> +			seq_printf(m, "%s\n", p);
> +		}
> +
> +		raw_spin_unlock_irqrestore(&desc->lock, flags);
> +	}
> +
> +	return 0;
> +}
> +
> +static int virq_debug_open(struct inode *inode, struct file *file)
> +{
> +	return single_open(file, virq_debug_show, inode->i_private);
> +}
> +
> +static const struct file_operations virq_debug_fops = {
> +	.open = virq_debug_open,
> +	.read = seq_read,
> +	.llseek = seq_lseek,
> +	.release = single_release,
> +};
> +
> +static int __init irq_debugfs_init(void)
> +{
> +	if (debugfs_create_file("virq_mapping", S_IRUGO,
> powerpc_debugfs_root,
> +				 NULL, &virq_debug_fops) == NULL)
> +		return -ENOMEM;
> +
> +	return 0;
> +}
> +__initcall(irq_debugfs_init);
> +#endif /* CONFIG_VIRQ_DEBUG */
> +
> 
> _______________________________________________
> devicetree-discuss mailing list
> devicetree-discuss@lists.ozlabs.org
> https://lists.ozlabs.org/listinfo/devicetree-discuss

^ permalink raw reply

* Re: Introduce support for little endian PowerPC
From: Michel Dänzer @ 2010-10-01 16:20 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: paulus, linuxppc-dev, linux-kernel, Ian Munsie
In-Reply-To: <1285935283.2463.78.camel@pasglop>

On Fre, 2010-10-01 at 22:14 +1000, Benjamin Herrenschmidt wrote:=20
>=20
> Now, the main reasons in practice are anything touching graphics.
>=20
> There's quite a few IP cores out there for SoCs that don't have HW
> swappers, and -tons- of more or less ugly code that can't deal with non
> native pixel ordering (hell, even Xorg isn't good at it, we really only
> support cards that have HW swappers today).

That's not true. Even the radeon driver doesn't really need the HW
swappers anymore with KMS.


> There's an even bigger pile of application code that deals with graphics
> without any regard for endianness and is essentially unfixable.

Out of curiosity, what kind of APIs are those apps using? X11 and OpenGL
have well-defined semantics wrt endianness, allowing the drivers to
handle any necessary byte swapping internally, and IME the vast majority
of apps handle this correctly.


--=20
Earthling Michel D=C3=A4nzer           |                http://www.vmware.c=
om
Libre software enthusiast         |          Debian, X and DRI developer

^ permalink raw reply

* Re: Introduce support for little endian PowerPC
From: Kumar Gala @ 2010-10-01 17:59 UTC (permalink / raw)
  To: Benjamin Herrenschmidt; +Cc: linuxppc-dev, paulus, Ian Munsie, linux-kernel
In-Reply-To: <1285935283.2463.78.camel@pasglop>


On Oct 1, 2010, at 7:14 AM, Benjamin Herrenschmidt wrote:

> On Fri, 2010-10-01 at 07:30 -0400, Josh Boyer wrote:
>>=20
>>> =46rom a community aspect is anyone actually going to use this?  Is
>> this going to be the equivalent of voyager on x86?  I've got nothing
>> against some of the endian clean ups this introduces.  However the
>> changes to misc_32.S are a bit ugly from a readability point of view.
>> Just seems like this is likely to bit-rot pretty quickly.
>>=20
>> I'm with Kumar on this one.  Why would we want to support this?  I
>> can't say I would be very willing to help anyone run in LE mode, let
>> alone have it randomly selectable.=20
>=20
> There's some good reasons on the field ... sadly.
>=20
> At this stage this is mostly an experiment, which went pretty well in
> the sense that it's actually quite easy and a lot of the "fixes" are
> actually reasonable cleanups to carry.
>=20
> Now, the main reasons in practice are anything touching graphics.
>=20
> There's quite a few IP cores out there for SoCs that don't have HW
> swappers, and -tons- of more or less ugly code that can't deal with =
non
> native pixel ordering (hell, even Xorg isn't good at it, we really =
only
> support cards that have HW swappers today).
>=20
> There's an even bigger pile of application code that deals with =
graphics
> without any regard for endianness and is essentially unfixable.
>=20
> So it becomes a matter of potential customers that will take it if it
> does LE and won't if it doesn't ...
>=20
> Now, I don't have a problem supporting that as the maintainer, as I
> said, from a kernel standpoint, it's all quite easy to deal with. Some
> of the most gory aspects in misc_32.S could probably be done in a way
> that is slightly more readable, but the approach is actually good, I
> think, to have macros to represent the high/low parts of register =
pairs.
>=20
> So at this stage, I'd say, let's not dismiss it just because we all =
come
> from a long education of hating LE for the sake of it :-)
>=20
> It makes -some- sense, even if it's not necessarily on the markets
> targeted by FSL today for example. At least from the kernel POV, it
> doesn't seem to me to be a significant support burden at all.
>=20
> Cheers,
> Ben.

I'm not against it, and I agree some of the patches seem like good clean =
up.  I'm concerned about this bit rotting pretty quickly.

- k

^ permalink raw reply

* [PATCH 0/9] v3 De-couple sysfs memory directories from memory sections
From: Nathan Fontenot @ 2010-10-01 18:22 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, steiner, Robin Holt, KAMEZAWA Hiroyuki, Dave Hansen

This set of patches decouples the concept that a single memory
section corresponds to a single directory in 
/sys/devices/system/memory/.  On systems
with large amounts of memory (1+ TB) there are performance issues
related to creating the large number of sysfs directories.  For
a powerpc machine with 1 TB of memory we are creating 63,000+
directories.  This is resulting in boot times of around 45-50
minutes for systems with 1 TB of memory and 8 hours for systems
with 2 TB of memory.  With this patch set applied I am now seeing
boot times of 5 minutes or less.

The root of this issue is in sysfs directory creation. Every time
a directory is created a string compare is done against all sibling
directories to ensure we do not create duplicates.  The list of
directory nodes in sysfs is kept as an unsorted list which results
in this being an exponentially longer operation as the number of
directories are created.

The solution solved by this patch set is to allow a single
directory in sysfs to span multiple memory sections.  This is
controlled by an optional architecturally defined function
memory_block_size_bytes().  The default definition of this
routine returns a memory block size equal to the memory section
size. This maintains the current layout of sysfs memory
directories as it appears to userspace to remain the same as it
is today.

For architectures that define their own version of this routine,
as is done for powerpc and x86_64 in this patchset, the view in userspace
would change such that each memoryXXX directory would span
multiple memory sections.  The number of sections spanned would
depend on the value reported by memory_block_size_bytes.

In both cases a new file 'end_phys_index' is created in each
memoryXXX directory.  This file will contain the physical id
of the last memory section covered by the sysfs directory.  For
the default case, the value in 'end_phys_index' will be the same
as in the existng 'phys_index' file.

Updates for this version of the patch:

- Patches 2 and 3 have been swapped which has alleviated the need for the
  section count in the memory_block struct to be an atomic.

- The get_memory_block_size and memory_block_size_bytes routines now return
  an unsigned long instead of a u32.  This affects patches 4, 7, and 8.

- [Patch 5/9] The phys_index member of the memory block struct is changed to
  start_section_nr and the new end_phys_index is now named end_section_nr.

- [Patch 8/9] A new patch added to the set to define a version of
  memory_block_size_bytes() for x86_64 when CONFIG_X86_UV is set.

- [Patch 9/9] Correct the updates to hotplug documentation to indicate that
  4 or 5 files may be seen for each memory directory in sysfs.

-Nathan Fontenot

^ permalink raw reply

* [PATCH 1/9] v3 Move find_memory_block routine
From: Nathan Fontenot @ 2010-10-01 18:28 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, steiner, Robin Holt, KAMEZAWA Hiroyuki, Dave Hansen
In-Reply-To: <4CA62700.7010809@austin.ibm.com>

Move the find_memory_block() routine up to avoid needing a forward
declaration in subsequent patches.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 drivers/base/memory.c |   62 +++++++++++++++++++++++++-------------------------
 1 file changed, 31 insertions(+), 31 deletions(-)

Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c	2010-09-29 14:56:26.000000000 -0500
+++ linux-next/drivers/base/memory.c	2010-09-30 14:09:36.000000000 -0500
@@ -435,6 +435,37 @@
 	return 0;
 }
 
+/*
+ * For now, we have a linear search to go find the appropriate
+ * memory_block corresponding to a particular phys_index. If
+ * this gets to be a real problem, we can always use a radix
+ * tree or something here.
+ *
+ * This could be made generic for all sysdev classes.
+ */
+struct memory_block *find_memory_block(struct mem_section *section)
+{
+	struct kobject *kobj;
+	struct sys_device *sysdev;
+	struct memory_block *mem;
+	char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1];
+
+	/*
+	 * This only works because we know that section == sysdev->id
+	 * slightly redundant with sysdev_register()
+	 */
+	sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, __section_nr(section));
+
+	kobj = kset_find_obj(&memory_sysdev_class.kset, name);
+	if (!kobj)
+		return NULL;
+
+	sysdev = container_of(kobj, struct sys_device, kobj);
+	mem = container_of(sysdev, struct memory_block, sysdev);
+
+	return mem;
+}
+
 static int add_memory_block(int nid, struct mem_section *section,
 			unsigned long state, enum mem_add_context context)
 {
@@ -468,37 +499,6 @@
 	return ret;
 }
 
-/*
- * For now, we have a linear search to go find the appropriate
- * memory_block corresponding to a particular phys_index. If
- * this gets to be a real problem, we can always use a radix
- * tree or something here.
- *
- * This could be made generic for all sysdev classes.
- */
-struct memory_block *find_memory_block(struct mem_section *section)
-{
-	struct kobject *kobj;
-	struct sys_device *sysdev;
-	struct memory_block *mem;
-	char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1];
-
-	/*
-	 * This only works because we know that section == sysdev->id
-	 * slightly redundant with sysdev_register()
-	 */
-	sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, __section_nr(section));
-
-	kobj = kset_find_obj(&memory_sysdev_class.kset, name);
-	if (!kobj)
-		return NULL;
-
-	sysdev = container_of(kobj, struct sys_device, kobj);
-	mem = container_of(sysdev, struct memory_block, sysdev);
-
-	return mem;
-}
-
 int remove_memory_block(unsigned long node_id, struct mem_section *section,
 		int phys_device)
 {

^ permalink raw reply

* [PATCH 2/9] v3 Add mutex for adding/removing memory blocks
From: Nathan Fontenot @ 2010-10-01 18:29 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, steiner, Robin Holt, KAMEZAWA Hiroyuki, Dave Hansen
In-Reply-To: <4CA62700.7010809@austin.ibm.com>

Add a new mutex for use in adding and removing of memory blocks.  This
is needed to avoid any race conditions in which the same memory block could
be added and removed at the same time.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 drivers/base/memory.c |    7 +++++++
 1 file changed, 7 insertions(+)

Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c	2010-09-30 14:09:36.000000000 -0500
+++ linux-next/drivers/base/memory.c	2010-09-30 14:12:41.000000000 -0500
@@ -27,6 +27,8 @@
 #include <asm/atomic.h>
 #include <asm/uaccess.h>
 
+static DEFINE_MUTEX(mem_sysfs_mutex);
+
 #define MEMORY_CLASS_NAME	"memory"
 
 static struct sysdev_class memory_sysdev_class = {
@@ -476,6 +478,8 @@
 	if (!mem)
 		return -ENOMEM;
 
+	mutex_lock(&mem_sysfs_mutex);
+
 	mem->phys_index = __section_nr(section);
 	mem->state = state;
 	mutex_init(&mem->state_mutex);
@@ -496,6 +500,7 @@
 			ret = register_mem_sect_under_node(mem, nid);
 	}
 
+	mutex_unlock(&mem_sysfs_mutex);
 	return ret;
 }
 
@@ -504,6 +509,7 @@
 {
 	struct memory_block *mem;
 
+	mutex_lock(&mem_sysfs_mutex);
 	mem = find_memory_block(section);
 	unregister_mem_sect_under_nodes(mem);
 	mem_remove_simple_file(mem, phys_index);
@@ -512,6 +518,7 @@
 	mem_remove_simple_file(mem, removable);
 	unregister_memory(mem, section);
 
+	mutex_unlock(&mem_sysfs_mutex);
 	return 0;
 }
 

^ permalink raw reply

* [PATCH 3/9] v3 Add section count to memory_block struct
From: Nathan Fontenot @ 2010-10-01 18:30 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, steiner, Robin Holt, KAMEZAWA Hiroyuki, Dave Hansen
In-Reply-To: <4CA62700.7010809@austin.ibm.com>

Add a section count property to the memory_block struct to track the number
of memory sections that have been added/removed from a memory block. This
allows us to know when the last memory section of a memory block has been
removed so we can remove the memory block.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 drivers/base/memory.c  |   17 +++++++++++------
 include/linux/memory.h |    2 ++
 2 files changed, 13 insertions(+), 6 deletions(-)

Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c	2010-09-30 14:12:41.000000000 -0500
+++ linux-next/drivers/base/memory.c	2010-09-30 14:13:50.000000000 -0500
@@ -482,6 +482,7 @@
 
 	mem->phys_index = __section_nr(section);
 	mem->state = state;
+	mem->section_count++;
 	mutex_init(&mem->state_mutex);
 	start_pfn = section_nr_to_pfn(mem->phys_index);
 	mem->phys_device = arch_get_memory_phys_device(start_pfn);
@@ -511,12 +512,16 @@
 
 	mutex_lock(&mem_sysfs_mutex);
 	mem = find_memory_block(section);
-	unregister_mem_sect_under_nodes(mem);
-	mem_remove_simple_file(mem, phys_index);
-	mem_remove_simple_file(mem, state);
-	mem_remove_simple_file(mem, phys_device);
-	mem_remove_simple_file(mem, removable);
-	unregister_memory(mem, section);
+
+	mem->section_count--;
+	if (mem->section_count == 0) {
+		unregister_mem_sect_under_nodes(mem);
+		mem_remove_simple_file(mem, phys_index);
+		mem_remove_simple_file(mem, state);
+		mem_remove_simple_file(mem, phys_device);
+		mem_remove_simple_file(mem, removable);
+		unregister_memory(mem, section);
+	}
 
 	mutex_unlock(&mem_sysfs_mutex);
 	return 0;
Index: linux-next/include/linux/memory.h
===================================================================
--- linux-next.orig/include/linux/memory.h	2010-09-29 14:56:29.000000000 -0500
+++ linux-next/include/linux/memory.h	2010-09-30 14:13:50.000000000 -0500
@@ -23,6 +23,8 @@
 struct memory_block {
 	unsigned long phys_index;
 	unsigned long state;
+	int section_count;
+
 	/*
 	 * This serializes all state change requests.  It isn't
 	 * held during creation because the control files are

^ permalink raw reply

* [PATCH 4/9] v3 Allow memory blocks to span multiple memory sections
From: Nathan Fontenot @ 2010-10-01 18:31 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, steiner, Robin Holt, KAMEZAWA Hiroyuki, Dave Hansen
In-Reply-To: <4CA62700.7010809@austin.ibm.com>

Update the memory sysfs code such that each sysfs memory directory is now
considered a memory block that can span multiple memory sections per
memory block.  The default size of each memory block is SECTION_SIZE_BITS
to maintain the current behavior of having a single memory section per
memory block (i.e. one sysfs directory per memory section).

For architectures that want to have memory blocks span multiple
memory sections they need only define their own memory_block_size_bytes()
routine.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 drivers/base/memory.c |  155 ++++++++++++++++++++++++++++++++++----------------
 1 file changed, 108 insertions(+), 47 deletions(-)

Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c	2010-09-30 14:13:50.000000000 -0500
+++ linux-next/drivers/base/memory.c	2010-09-30 14:46:00.000000000 -0500
@@ -30,6 +30,14 @@
 static DEFINE_MUTEX(mem_sysfs_mutex);
 
 #define MEMORY_CLASS_NAME	"memory"
+#define MIN_MEMORY_BLOCK_SIZE	(1 << SECTION_SIZE_BITS)
+
+static int sections_per_block;
+
+static inline int base_memory_block_id(int section_nr)
+{
+	return section_nr / sections_per_block;
+}
 
 static struct sysdev_class memory_sysdev_class = {
 	.name = MEMORY_CLASS_NAME,
@@ -84,28 +92,47 @@
  * register_memory - Setup a sysfs device for a memory block
  */
 static
-int register_memory(struct memory_block *memory, struct mem_section *section)
+int register_memory(struct memory_block *memory)
 {
 	int error;
 
 	memory->sysdev.cls = &memory_sysdev_class;
-	memory->sysdev.id = __section_nr(section);
+	memory->sysdev.id = memory->phys_index / sections_per_block;
 
 	error = sysdev_register(&memory->sysdev);
 	return error;
 }
 
 static void
-unregister_memory(struct memory_block *memory, struct mem_section *section)
+unregister_memory(struct memory_block *memory)
 {
 	BUG_ON(memory->sysdev.cls != &memory_sysdev_class);
-	BUG_ON(memory->sysdev.id != __section_nr(section));
 
 	/* drop the ref. we got in remove_memory_block() */
 	kobject_put(&memory->sysdev.kobj);
 	sysdev_unregister(&memory->sysdev);
 }
 
+unsigned long __weak memory_block_size_bytes(void)
+{
+	return MIN_MEMORY_BLOCK_SIZE;
+}
+
+static unsigned long get_memory_block_size(void)
+{
+	u32 block_sz;
+
+	block_sz = memory_block_size_bytes();
+
+	/* Validate blk_sz is a power of 2 and not less than section size */
+	if ((block_sz & (block_sz - 1)) || (block_sz < MIN_MEMORY_BLOCK_SIZE)) {
+		WARN_ON(1);
+		block_sz = MIN_MEMORY_BLOCK_SIZE;
+	}
+
+	return block_sz;
+}
+
 /*
  * use this as the physical section index that this memsection
  * uses.
@@ -116,7 +143,7 @@
 {
 	struct memory_block *mem =
 		container_of(dev, struct memory_block, sysdev);
-	return sprintf(buf, "%08lx\n", mem->phys_index);
+	return sprintf(buf, "%08lx\n", mem->phys_index / sections_per_block);
 }
 
 /*
@@ -125,13 +152,16 @@
 static ssize_t show_mem_removable(struct sys_device *dev,
 			struct sysdev_attribute *attr, char *buf)
 {
-	unsigned long start_pfn;
-	int ret;
+	unsigned long i, pfn;
+	int ret = 1;
 	struct memory_block *mem =
 		container_of(dev, struct memory_block, sysdev);
 
-	start_pfn = section_nr_to_pfn(mem->phys_index);
-	ret = is_mem_section_removable(start_pfn, PAGES_PER_SECTION);
+	for (i = 0; i < sections_per_block; i++) {
+		pfn = section_nr_to_pfn(mem->phys_index + i);
+		ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION);
+	}
+
 	return sprintf(buf, "%d\n", ret);
 }
 
@@ -184,17 +214,14 @@
  * OK to have direct references to sparsemem variables in here.
  */
 static int
-memory_block_action(struct memory_block *mem, unsigned long action)
+memory_section_action(unsigned long phys_index, unsigned long action)
 {
 	int i;
-	unsigned long psection;
 	unsigned long start_pfn, start_paddr;
 	struct page *first_page;
 	int ret;
-	int old_state = mem->state;
 
-	psection = mem->phys_index;
-	first_page = pfn_to_page(psection << PFN_SECTION_SHIFT);
+	first_page = pfn_to_page(phys_index << PFN_SECTION_SHIFT);
 
 	/*
 	 * The probe routines leave the pages reserved, just
@@ -207,8 +234,8 @@
 				continue;
 
 			printk(KERN_WARNING "section number %ld page number %d "
-				"not reserved, was it already online? \n",
-				psection, i);
+				"not reserved, was it already online?\n",
+				phys_index, i);
 			return -EBUSY;
 		}
 	}
@@ -219,18 +246,13 @@
 			ret = online_pages(start_pfn, PAGES_PER_SECTION);
 			break;
 		case MEM_OFFLINE:
-			mem->state = MEM_GOING_OFFLINE;
 			start_paddr = page_to_pfn(first_page) << PAGE_SHIFT;
 			ret = remove_memory(start_paddr,
 					    PAGES_PER_SECTION << PAGE_SHIFT);
-			if (ret) {
-				mem->state = old_state;
-				break;
-			}
 			break;
 		default:
-			WARN(1, KERN_WARNING "%s(%p, %ld) unknown action: %ld\n",
-					__func__, mem, action, action);
+			WARN(1, KERN_WARNING "%s(%ld, %ld) unknown action: "
+			     "%ld\n", __func__, phys_index, action, action);
 			ret = -EINVAL;
 	}
 
@@ -240,7 +262,8 @@
 static int memory_block_change_state(struct memory_block *mem,
 		unsigned long to_state, unsigned long from_state_req)
 {
-	int ret = 0;
+	int i, ret = 0;
+
 	mutex_lock(&mem->state_mutex);
 
 	if (mem->state != from_state_req) {
@@ -248,8 +271,22 @@
 		goto out;
 	}
 
-	ret = memory_block_action(mem, to_state);
-	if (!ret)
+	if (to_state == MEM_OFFLINE)
+		mem->state = MEM_GOING_OFFLINE;
+
+	for (i = 0; i < sections_per_block; i++) {
+		ret = memory_section_action(mem->phys_index + i, to_state);
+		if (ret)
+			break;
+	}
+
+	if (ret) {
+		for (i = 0; i < sections_per_block; i++)
+			memory_section_action(mem->phys_index + i,
+					      from_state_req);
+
+		mem->state = from_state_req;
+	} else
 		mem->state = to_state;
 
 out:
@@ -262,20 +299,15 @@
 		struct sysdev_attribute *attr, const char *buf, size_t count)
 {
 	struct memory_block *mem;
-	unsigned int phys_section_nr;
 	int ret = -EINVAL;
 
 	mem = container_of(dev, struct memory_block, sysdev);
-	phys_section_nr = mem->phys_index;
-
-	if (!present_section_nr(phys_section_nr))
-		goto out;
 
 	if (!strncmp(buf, "online", min((int)count, 6)))
 		ret = memory_block_change_state(mem, MEM_ONLINE, MEM_OFFLINE);
 	else if(!strncmp(buf, "offline", min((int)count, 7)))
 		ret = memory_block_change_state(mem, MEM_OFFLINE, MEM_ONLINE);
-out:
+
 	if (ret)
 		return ret;
 	return count;
@@ -315,7 +347,7 @@
 print_block_size(struct sysdev_class *class, struct sysdev_class_attribute *attr,
 		 char *buf)
 {
-	return sprintf(buf, "%lx\n", (unsigned long)PAGES_PER_SECTION * PAGE_SIZE);
+	return sprintf(buf, "%lx\n", get_memory_block_size());
 }
 
 static SYSDEV_CLASS_ATTR(block_size_bytes, 0444, print_block_size, NULL);
@@ -451,12 +483,13 @@
 	struct sys_device *sysdev;
 	struct memory_block *mem;
 	char name[sizeof(MEMORY_CLASS_NAME) + 9 + 1];
+	int block_id = base_memory_block_id(__section_nr(section));
 
 	/*
 	 * This only works because we know that section == sysdev->id
 	 * slightly redundant with sysdev_register()
 	 */
-	sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, __section_nr(section));
+	sprintf(&name[0], "%s%d", MEMORY_CLASS_NAME, block_id);
 
 	kobj = kset_find_obj(&memory_sysdev_class.kset, name);
 	if (!kobj)
@@ -468,26 +501,27 @@
 	return mem;
 }
 
-static int add_memory_block(int nid, struct mem_section *section,
-			unsigned long state, enum mem_add_context context)
+static int init_memory_block(struct memory_block **memory,
+			     struct mem_section *section, unsigned long state)
 {
-	struct memory_block *mem = kzalloc(sizeof(*mem), GFP_KERNEL);
+	struct memory_block *mem;
 	unsigned long start_pfn;
+	int scn_nr;
 	int ret = 0;
 
+	mem = kzalloc(sizeof(*mem), GFP_KERNEL);
 	if (!mem)
 		return -ENOMEM;
 
-	mutex_lock(&mem_sysfs_mutex);
-
-	mem->phys_index = __section_nr(section);
+	scn_nr = __section_nr(section);
+	mem->phys_index = base_memory_block_id(scn_nr) * sections_per_block;
 	mem->state = state;
 	mem->section_count++;
 	mutex_init(&mem->state_mutex);
 	start_pfn = section_nr_to_pfn(mem->phys_index);
 	mem->phys_device = arch_get_memory_phys_device(start_pfn);
 
-	ret = register_memory(mem, section);
+	ret = register_memory(mem);
 	if (!ret)
 		ret = mem_create_simple_file(mem, phys_index);
 	if (!ret)
@@ -496,8 +530,29 @@
 		ret = mem_create_simple_file(mem, phys_device);
 	if (!ret)
 		ret = mem_create_simple_file(mem, removable);
+
+	*memory = mem;
+	return ret;
+}
+
+static int add_memory_section(int nid, struct mem_section *section,
+			unsigned long state, enum mem_add_context context)
+{
+	struct memory_block *mem;
+	int ret = 0;
+
+	mutex_lock(&mem_sysfs_mutex);
+
+	mem = find_memory_block(section);
+	if (mem) {
+		mem->section_count++;
+		kobject_put(&mem->sysdev.kobj);
+	} else
+		ret = init_memory_block(&mem, section, state);
+
 	if (!ret) {
-		if (context == HOTPLUG)
+		if (context == HOTPLUG &&
+		    mem->section_count == sections_per_block)
 			ret = register_mem_sect_under_node(mem, nid);
 	}
 
@@ -520,8 +575,10 @@
 		mem_remove_simple_file(mem, state);
 		mem_remove_simple_file(mem, phys_device);
 		mem_remove_simple_file(mem, removable);
-		unregister_memory(mem, section);
-	}
+		unregister_memory(mem);
+		kfree(mem);
+	} else
+		kobject_put(&mem->sysdev.kobj);
 
 	mutex_unlock(&mem_sysfs_mutex);
 	return 0;
@@ -533,7 +590,7 @@
  */
 int register_new_memory(int nid, struct mem_section *section)
 {
-	return add_memory_block(nid, section, MEM_OFFLINE, HOTPLUG);
+	return add_memory_section(nid, section, MEM_OFFLINE, HOTPLUG);
 }
 
 int unregister_memory_section(struct mem_section *section)
@@ -552,12 +609,16 @@
 	unsigned int i;
 	int ret;
 	int err;
+	unsigned long block_sz;
 
 	memory_sysdev_class.kset.uevent_ops = &memory_uevent_ops;
 	ret = sysdev_class_register(&memory_sysdev_class);
 	if (ret)
 		goto out;
 
+	block_sz = get_memory_block_size();
+	sections_per_block = block_sz / MIN_MEMORY_BLOCK_SIZE;
+
 	/*
 	 * Create entries for memory sections that were found
 	 * during boot and have been initialized
@@ -565,8 +626,8 @@
 	for (i = 0; i < NR_MEM_SECTIONS; i++) {
 		if (!present_section_nr(i))
 			continue;
-		err = add_memory_block(0, __nr_to_section(i), MEM_ONLINE,
-				       BOOT);
+		err = add_memory_section(0, __nr_to_section(i), MEM_ONLINE,
+					 BOOT);
 		if (!ret)
 			ret = err;
 	}

^ permalink raw reply

* [PATCH 5/9] v3 rename phys_index properties of memory block struct
From: Nathan Fontenot @ 2010-10-01 18:33 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, steiner, Robin Holt, KAMEZAWA Hiroyuki, Dave Hansen
In-Reply-To: <4CA62700.7010809@austin.ibm.com>

Update the 'phys_index' property of a the memory_block struct to be
called start_section_nr, and add a end_section_nr property.  The
data tracked here is the same but the updated naming is more in line
with what is stored here, namely the first and last section number
that the memory block spans.

The names presented to userspace remain the same, phys_index for
start_section_nr and end_phys_index for end_section_nr, to avoid breaking
anything in userspace.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 drivers/base/memory.c  |   39 ++++++++++++++++++++++++++++++---------
 include/linux/memory.h |    3 ++-
 2 files changed, 32 insertions(+), 10 deletions(-)

Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c	2010-09-30 14:46:00.000000000 -0500
+++ linux-next/drivers/base/memory.c	2010-09-30 14:46:09.000000000 -0500
@@ -97,7 +97,7 @@
 	int error;
 
 	memory->sysdev.cls = &memory_sysdev_class;
-	memory->sysdev.id = memory->phys_index / sections_per_block;
+	memory->sysdev.id = memory->start_section_nr / sections_per_block;
 
 	error = sysdev_register(&memory->sysdev);
 	return error;
@@ -138,12 +138,26 @@
  * uses.
  */
 
-static ssize_t show_mem_phys_index(struct sys_device *dev,
+static ssize_t show_mem_start_phys_index(struct sys_device *dev,
 			struct sysdev_attribute *attr, char *buf)
 {
 	struct memory_block *mem =
 		container_of(dev, struct memory_block, sysdev);
-	return sprintf(buf, "%08lx\n", mem->phys_index / sections_per_block);
+	unsigned long phys_index;
+
+	phys_index = mem->start_section_nr / sections_per_block;
+	return sprintf(buf, "%08lx\n", phys_index);
+}
+
+static ssize_t show_mem_end_phys_index(struct sys_device *dev,
+			struct sysdev_attribute *attr, char *buf)
+{
+	struct memory_block *mem =
+		container_of(dev, struct memory_block, sysdev);
+	unsigned long phys_index;
+
+	phys_index = mem->end_section_nr / sections_per_block;
+	return sprintf(buf, "%08lx\n", phys_index);
 }
 
 /*
@@ -158,7 +172,7 @@
 		container_of(dev, struct memory_block, sysdev);
 
 	for (i = 0; i < sections_per_block; i++) {
-		pfn = section_nr_to_pfn(mem->phys_index + i);
+		pfn = section_nr_to_pfn(mem->start_section_nr + i);
 		ret &= is_mem_section_removable(pfn, PAGES_PER_SECTION);
 	}
 
@@ -275,14 +289,15 @@
 		mem->state = MEM_GOING_OFFLINE;
 
 	for (i = 0; i < sections_per_block; i++) {
-		ret = memory_section_action(mem->phys_index + i, to_state);
+		ret = memory_section_action(mem->start_section_nr + i,
+					    to_state);
 		if (ret)
 			break;
 	}
 
 	if (ret) {
 		for (i = 0; i < sections_per_block; i++)
-			memory_section_action(mem->phys_index + i,
+			memory_section_action(mem->start_section_nr + i,
 					      from_state_req);
 
 		mem->state = from_state_req;
@@ -330,7 +345,8 @@
 	return sprintf(buf, "%d\n", mem->phys_device);
 }
 
-static SYSDEV_ATTR(phys_index, 0444, show_mem_phys_index, NULL);
+static SYSDEV_ATTR(phys_index, 0444, show_mem_start_phys_index, NULL);
+static SYSDEV_ATTR(end_phys_index, 0444, show_mem_end_phys_index, NULL);
 static SYSDEV_ATTR(state, 0644, show_mem_state, store_mem_state);
 static SYSDEV_ATTR(phys_device, 0444, show_phys_device, NULL);
 static SYSDEV_ATTR(removable, 0444, show_mem_removable, NULL);
@@ -514,17 +530,21 @@
 		return -ENOMEM;
 
 	scn_nr = __section_nr(section);
-	mem->phys_index = base_memory_block_id(scn_nr) * sections_per_block;
+	mem->start_section_nr =
+			base_memory_block_id(scn_nr) * sections_per_block;
+	mem->end_section_nr = mem->start_section_nr + sections_per_block - 1;
 	mem->state = state;
 	mem->section_count++;
 	mutex_init(&mem->state_mutex);
-	start_pfn = section_nr_to_pfn(mem->phys_index);
+	start_pfn = section_nr_to_pfn(mem->start_section_nr);
 	mem->phys_device = arch_get_memory_phys_device(start_pfn);
 
 	ret = register_memory(mem);
 	if (!ret)
 		ret = mem_create_simple_file(mem, phys_index);
 	if (!ret)
+		ret = mem_create_simple_file(mem, end_phys_index);
+	if (!ret)
 		ret = mem_create_simple_file(mem, state);
 	if (!ret)
 		ret = mem_create_simple_file(mem, phys_device);
@@ -572,6 +592,7 @@
 	if (mem->section_count == 0) {
 		unregister_mem_sect_under_nodes(mem);
 		mem_remove_simple_file(mem, phys_index);
+		mem_remove_simple_file(mem, end_phys_index);
 		mem_remove_simple_file(mem, state);
 		mem_remove_simple_file(mem, phys_device);
 		mem_remove_simple_file(mem, removable);
Index: linux-next/include/linux/memory.h
===================================================================
--- linux-next.orig/include/linux/memory.h	2010-09-30 14:44:39.000000000 -0500
+++ linux-next/include/linux/memory.h	2010-09-30 14:46:09.000000000 -0500
@@ -21,7 +21,8 @@
 #include <linux/mutex.h>
 
 struct memory_block {
-	unsigned long phys_index;
+	unsigned long start_section_nr;
+	unsigned long end_section_nr;
 	unsigned long state;
 	int section_count;
 

^ permalink raw reply

* [PATCH 6/9] v3 Update node sysfs code
From: Nathan Fontenot @ 2010-10-01 18:34 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, steiner, Robin Holt, KAMEZAWA Hiroyuki, Dave Hansen
In-Reply-To: <4CA62700.7010809@austin.ibm.com>

Update the node sysfs code to be aware of the new capability for a memory
block to contain multiple memory sections and be aware of the memory block
structure name changes (start_section_nr).  This requires an additional
parameter to unregister_mem_sect_under_nodes so that we know which memory
section of the memory block to unregister.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 drivers/base/memory.c |    2 +-
 drivers/base/node.c   |   12 ++++++++----
 include/linux/node.h  |    6 ++++--
 3 files changed, 13 insertions(+), 7 deletions(-)

Index: linux-next/drivers/base/node.c
===================================================================
--- linux-next.orig/drivers/base/node.c	2010-09-30 14:44:38.000000000 -0500
+++ linux-next/drivers/base/node.c	2010-09-30 14:46:12.000000000 -0500
@@ -346,8 +346,10 @@
 		return -EFAULT;
 	if (!node_online(nid))
 		return 0;
-	sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
-	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
+
+	sect_start_pfn = section_nr_to_pfn(mem_blk->start_section_nr);
+	sect_end_pfn = section_nr_to_pfn(mem_blk->end_section_nr);
+	sect_end_pfn += PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
 		int page_nid;
 
@@ -371,7 +373,8 @@
 }
 
 /* unregister memory section under all nodes that it spans */
-int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
+int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+				    unsigned long phys_index)
 {
 	NODEMASK_ALLOC(nodemask_t, unlinked_nodes, GFP_KERNEL);
 	unsigned long pfn, sect_start_pfn, sect_end_pfn;
@@ -383,7 +386,8 @@
 	if (!unlinked_nodes)
 		return -ENOMEM;
 	nodes_clear(*unlinked_nodes);
-	sect_start_pfn = section_nr_to_pfn(mem_blk->phys_index);
+
+	sect_start_pfn = section_nr_to_pfn(phys_index);
 	sect_end_pfn = sect_start_pfn + PAGES_PER_SECTION - 1;
 	for (pfn = sect_start_pfn; pfn <= sect_end_pfn; pfn++) {
 		int nid;
Index: linux-next/drivers/base/memory.c
===================================================================
--- linux-next.orig/drivers/base/memory.c	2010-09-30 14:46:09.000000000 -0500
+++ linux-next/drivers/base/memory.c	2010-09-30 14:46:12.000000000 -0500
@@ -587,10 +587,10 @@
 
 	mutex_lock(&mem_sysfs_mutex);
 	mem = find_memory_block(section);
+	unregister_mem_sect_under_nodes(mem, __section_nr(section));
 
 	mem->section_count--;
 	if (mem->section_count == 0) {
-		unregister_mem_sect_under_nodes(mem);
 		mem_remove_simple_file(mem, phys_index);
 		mem_remove_simple_file(mem, end_phys_index);
 		mem_remove_simple_file(mem, state);
Index: linux-next/include/linux/node.h
===================================================================
--- linux-next.orig/include/linux/node.h	2010-09-30 14:44:38.000000000 -0500
+++ linux-next/include/linux/node.h	2010-09-30 14:46:12.000000000 -0500
@@ -44,7 +44,8 @@
 extern int unregister_cpu_under_node(unsigned int cpu, unsigned int nid);
 extern int register_mem_sect_under_node(struct memory_block *mem_blk,
 						int nid);
-extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk);
+extern int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+					   unsigned long phys_index);
 
 #ifdef CONFIG_HUGETLBFS
 extern void register_hugetlbfs_with_node(node_registration_func_t doregister,
@@ -72,7 +73,8 @@
 {
 	return 0;
 }
-static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk)
+static inline int unregister_mem_sect_under_nodes(struct memory_block *mem_blk,
+						  unsigned long phys_index)
 {
 	return 0;
 }

^ permalink raw reply

* [PATCH 7/9] v3 Define memory_block_size_bytes for powerpc/pseries
From: Nathan Fontenot @ 2010-10-01 18:35 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, steiner, Robin Holt, KAMEZAWA Hiroyuki, Dave Hansen
In-Reply-To: <4CA62700.7010809@austin.ibm.com>

Define a version of memory_block_size_bytes() for powerpc/pseries such that
a memory block spans an entire lmb.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 arch/powerpc/platforms/pseries/hotplug-memory.c |   66 +++++++++++++++++++-----
 1 file changed, 53 insertions(+), 13 deletions(-)

Index: linux-next/arch/powerpc/platforms/pseries/hotplug-memory.c
===================================================================
--- linux-next.orig/arch/powerpc/platforms/pseries/hotplug-memory.c	2010-09-30 14:44:37.000000000 -0500
+++ linux-next/arch/powerpc/platforms/pseries/hotplug-memory.c	2010-09-30 14:47:04.000000000 -0500
@@ -17,6 +17,54 @@
 #include <asm/pSeries_reconfig.h>
 #include <asm/sparsemem.h>
 
+static unsigned long get_memblock_size(void)
+{
+	struct device_node *np;
+	unsigned int memblock_size = 0;
+
+	np = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
+	if (np) {
+		const unsigned long *size;
+
+		size = of_get_property(np, "ibm,lmb-size", NULL);
+		memblock_size = size ? *size : 0;
+
+		of_node_put(np);
+	} else {
+		unsigned int memzero_size = 0;
+		const unsigned int *regs;
+
+		np = of_find_node_by_path("/memory@0");
+		if (np) {
+			regs = of_get_property(np, "reg", NULL);
+			memzero_size = regs ? regs[3] : 0;
+			of_node_put(np);
+		}
+
+		if (memzero_size) {
+			/* We now know the size of memory@0, use this to find
+			 * the first memoryblock and get its size.
+			 */
+			char buf[64];
+
+			sprintf(buf, "/memory@%x", memzero_size);
+			np = of_find_node_by_path(buf);
+			if (np) {
+				regs = of_get_property(np, "reg", NULL);
+				memblock_size = regs ? regs[3] : 0;
+				of_node_put(np);
+			}
+		}
+	}
+
+	return memblock_size;
+}
+
+unsigned long memory_block_size_bytes(void)
+{
+	return get_memblock_size();
+}
+
 static int pseries_remove_memblock(unsigned long base, unsigned int memblock_size)
 {
 	unsigned long start, start_pfn;
@@ -127,30 +175,22 @@
 
 static int pseries_drconf_memory(unsigned long *base, unsigned int action)
 {
-	struct device_node *np;
-	const unsigned long *lmb_size;
+	unsigned long memblock_size;
 	int rc;
 
-	np = of_find_node_by_path("/ibm,dynamic-reconfiguration-memory");
-	if (!np)
+	memblock_size = get_memblock_size();
+	if (!memblock_size)
 		return -EINVAL;
 
-	lmb_size = of_get_property(np, "ibm,lmb-size", NULL);
-	if (!lmb_size) {
-		of_node_put(np);
-		return -EINVAL;
-	}
-
 	if (action == PSERIES_DRCONF_MEM_ADD) {
-		rc = memblock_add(*base, *lmb_size);
+		rc = memblock_add(*base, memblock_size);
 		rc = (rc < 0) ? -EINVAL : 0;
 	} else if (action == PSERIES_DRCONF_MEM_REMOVE) {
-		rc = pseries_remove_memblock(*base, *lmb_size);
+		rc = pseries_remove_memblock(*base, memblock_size);
 	} else {
 		rc = -EINVAL;
 	}
 
-	of_node_put(np);
 	return rc;
 }
 

^ permalink raw reply

* [PATCH 8/9] v3 Define memory_block_size_bytes for x86_64 with CONFIG_X86_UV set
From: Nathan Fontenot @ 2010-10-01 18:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, steiner, Robin Holt, KAMEZAWA Hiroyuki, Dave Hansen
In-Reply-To: <4CA62700.7010809@austin.ibm.com>

Define a version of memory_block_size_bytes for x86_64 when CONFIG_X86_UV is
set.

Signed-off-by: Robin Holt <holt@sgi.com>
Signed-off-by: Jack Steiner <steiner@sgi.com>

---
 arch/x86/mm/init_64.c |   14 ++++++++++++++
 1 file changed, 14 insertions(+)

Index: linux-next/arch/x86/mm/init_64.c
===================================================================
--- linux-next.orig/arch/x86/mm/init_64.c	2010-09-29 14:56:25.000000000 -0500
+++ linux-next/arch/x86/mm/init_64.c	2010-10-01 13:00:50.000000000 -0500
@@ -51,6 +51,7 @@
 #include <asm/numa.h>
 #include <asm/cacheflush.h>
 #include <asm/init.h>
+#include <asm/uv/uv.h>
 #include <linux/bootmem.h>
 
 static int __init parse_direct_gbpages_off(char *arg)
@@ -902,6 +903,19 @@
 	return NULL;
 }
 
+#ifdef CONFIG_X86_UV
+#define MIN_MEMORY_BLOCK_SIZE   (1 << SECTION_SIZE_BITS)
+
+unsigned long memory_block_size_bytes(void)
+{
+	if (is_uv_system()) {
+		printk(KERN_INFO "UV: memory block size 2GB\n");
+		return 2UL * 1024 * 1024 * 1024;
+	}
+	return MIN_MEMORY_BLOCK_SIZE;
+}
+#endif
+
 #ifdef CONFIG_SPARSEMEM_VMEMMAP
 /*
  * Initialise the sparsemem vmemmap using huge-pages at the PMD level.

^ permalink raw reply

* [PATCH 9/9] v3 Update memory hotplug documentation
From: Nathan Fontenot @ 2010-10-01 18:37 UTC (permalink / raw)
  To: linux-kernel, linux-mm, linuxppc-dev
  Cc: Greg KH, steiner, Robin Holt, KAMEZAWA Hiroyuki, Dave Hansen
In-Reply-To: <4CA62700.7010809@austin.ibm.com>

Update the memory hotplug documentation to reflect the new behaviors of
memory blocks reflected in sysfs.

Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

---
 Documentation/memory-hotplug.txt |   47 +++++++++++++++++++++++++--------------
 1 file changed, 31 insertions(+), 16 deletions(-)

Index: linux-next/Documentation/memory-hotplug.txt
===================================================================
--- linux-next.orig/Documentation/memory-hotplug.txt	2010-09-29 14:56:24.000000000 -0500
+++ linux-next/Documentation/memory-hotplug.txt	2010-09-30 14:59:47.000000000 -0500
@@ -126,36 +126,51 @@
 --------------------------------
 4 sysfs files for memory hotplug
 --------------------------------
-All sections have their device information under /sys/devices/system/memory as
+All sections have their device information in sysfs.  Each section is part of
+a memory block under /sys/devices/system/memory as
 
 /sys/devices/system/memory/memoryXXX
-(XXX is section id.)
+(XXX is the section id.)
 
-Now, XXX is defined as start_address_of_section / section_size.
+Now, XXX is defined as (start_address_of_section / section_size) of the first
+section contained in the memory block.  The files 'phys_index' and
+'end_phys_index' under each directory report the beginning and end section id's
+for the memory block covered by the sysfs directory.  It is expected that all
+memory sections in this range are present and no memory holes exist in the
+range. Currently there is no way to determine if there is a memory hole, but
+the existence of one should not affect the hotplug capabilities of the memory
+block.
 
 For example, assume 1GiB section size. A device for a memory starting at
 0x100000000 is /sys/device/system/memory/memory4
 (0x100000000 / 1Gib = 4)
 This device covers address range [0x100000000 ... 0x140000000)
 
-Under each section, you can see 4 files.
+Under each section, you can see 4 or 5 files, the end_phys_index file being
+a recent addition and not present on older kernels.
 
-/sys/devices/system/memory/memoryXXX/phys_index
+/sys/devices/system/memory/memoryXXX/start_phys_index
+/sys/devices/system/memory/memoryXXX/end_phys_index
 /sys/devices/system/memory/memoryXXX/phys_device
 /sys/devices/system/memory/memoryXXX/state
 /sys/devices/system/memory/memoryXXX/removable
 
-'phys_index' : read-only and contains section id, same as XXX.
-'state'      : read-write
-               at read:  contains online/offline state of memory.
-               at write: user can specify "online", "offline" command
-'phys_device': read-only: designed to show the name of physical memory device.
-               This is not well implemented now.
-'removable'  : read-only: contains an integer value indicating
-               whether the memory section is removable or not
-               removable.  A value of 1 indicates that the memory
-               section is removable and a value of 0 indicates that
-               it is not removable.
+'phys_index'      : read-only and contains section id of the first section
+		    in the memory block, same as XXX.
+'end_phys_index'  : read-only and contains section id of the last section
+		    in the memory block.
+'state'           : read-write
+                    at read:  contains online/offline state of memory.
+                    at write: user can specify "online", "offline" command
+                    which will be performed on al sections in the block.
+'phys_device'     : read-only: designed to show the name of physical memory
+                    device.  This is not well implemented now.
+'removable'       : read-only: contains an integer value indicating
+                    whether the memory block is removable or not
+                    removable.  A value of 1 indicates that the memory
+                    block is removable and a value of 0 indicates that
+                    it is not removable. A memory block is removable only if
+                    every section in the block is removable.
 
 NOTE:
   These directories/files appear after physical memory hotplug phase.

^ permalink raw reply

* Re: [PATCH 2/9] v3 Add mutex for adding/removing memory blocks
From: Robin Holt @ 2010-10-01 18:45 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: Greg KH, steiner, linux-kernel, Dave Hansen, linux-mm, Robin Holt,
	linuxppc-dev, KAMEZAWA Hiroyuki
In-Reply-To: <4CA62896.2060307@austin.ibm.com>

On Fri, Oct 01, 2010 at 01:29:42PM -0500, Nathan Fontenot wrote:
> Add a new mutex for use in adding and removing of memory blocks.  This
> is needed to avoid any race conditions in which the same memory block could
> be added and removed at the same time.
> 
> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

Reviewed-by: Robin Holt <holt@sgi.com>

I am fine with this patch by itself, but its only real function is
to protect the count introduced by the next patch.  You might want to
combine the patches, but if not, that is fine as well.

Robin

^ permalink raw reply

* Re: [PATCH 1/9] v3 Move find_memory_block routine
From: Robin Holt @ 2010-10-01 18:40 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: Greg KH, steiner, linux-kernel, Dave Hansen, linux-mm, Robin Holt,
	linuxppc-dev, KAMEZAWA Hiroyuki
In-Reply-To: <4CA62857.4030803@austin.ibm.com>

On Fri, Oct 01, 2010 at 01:28:39PM -0500, Nathan Fontenot wrote:
> Move the find_memory_block() routine up to avoid needing a forward
> declaration in subsequent patches.
> 
> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

Reviewed-by: Robin Holt <holt@sgi.com>

^ permalink raw reply

* Re: [PATCH 3/9] v3 Add section count to memory_block struct
From: Robin Holt @ 2010-10-01 18:46 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: Greg KH, steiner, linux-kernel, Dave Hansen, linux-mm, Robin Holt,
	linuxppc-dev, KAMEZAWA Hiroyuki
In-Reply-To: <4CA628D0.6030508@austin.ibm.com>

On Fri, Oct 01, 2010 at 01:30:40PM -0500, Nathan Fontenot wrote:
> Add a section count property to the memory_block struct to track the number
> of memory sections that have been added/removed from a memory block. This
> allows us to know when the last memory section of a memory block has been
> removed so we can remove the memory block.
> 
> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

Reviewed-by: Robin Holt <holt@sgi.com>

^ permalink raw reply

* Re: [PATCH 5/9] v3 rename phys_index properties of memory block struct
From: Robin Holt @ 2010-10-01 18:54 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: Greg KH, steiner, linux-kernel, Dave Hansen, linux-mm, Robin Holt,
	linuxppc-dev, KAMEZAWA Hiroyuki
In-Reply-To: <4CA62982.5080900@austin.ibm.com>

On Fri, Oct 01, 2010 at 01:33:38PM -0500, Nathan Fontenot wrote:
> Update the 'phys_index' property of a the memory_block struct to be
> called start_section_nr, and add a end_section_nr property.  The
> data tracked here is the same but the updated naming is more in line
> with what is stored here, namely the first and last section number
> that the memory block spans.
> 
> The names presented to userspace remain the same, phys_index for
> start_section_nr and end_phys_index for end_section_nr, to avoid breaking
> anything in userspace.
> 
> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>

Reviewed-by: Robin Holt <holt@sgi.com>

^ permalink raw reply

* Re: [PATCH 4/9] v3 Allow memory blocks to span multiple memory sections
From: Nathan Fontenot @ 2010-10-01 18:56 UTC (permalink / raw)
  To: Robin Holt
  Cc: Greg KH, steiner, linux-kernel, Dave Hansen, linux-mm,
	linuxppc-dev, KAMEZAWA Hiroyuki
In-Reply-To: <20101001185250.GK14064@sgi.com>

On 10/01/2010 01:52 PM, Robin Holt wrote:
> On Fri, Oct 01, 2010 at 01:31:51PM -0500, Nathan Fontenot wrote:
>> Update the memory sysfs code such that each sysfs memory directory is now
>> considered a memory block that can span multiple memory sections per
>> memory block.  The default size of each memory block is SECTION_SIZE_BITS
>> to maintain the current behavior of having a single memory section per
>> memory block (i.e. one sysfs directory per memory section).
>>
>> For architectures that want to have memory blocks span multiple
>> memory sections they need only define their own memory_block_size_bytes()
>> routine.
>>
>> Signed-off-by: Nathan Fontenot <nfont@austin.ibm.com>
>>
>> ---
>>  drivers/base/memory.c |  155 ++++++++++++++++++++++++++++++++++----------------
>>  1 file changed, 108 insertions(+), 47 deletions(-)
>>
>> Index: linux-next/drivers/base/memory.c
>> ===================================================================
>> --- linux-next.orig/drivers/base/memory.c	2010-09-30 14:13:50.000000000 -0500
>> +++ linux-next/drivers/base/memory.c	2010-09-30 14:46:00.000000000 -0500
> ...
>> +static unsigned long get_memory_block_size(void)
>> +{
>> +	u32 block_sz;
>         ^^^
> 
> I think this should be unsigned long.  u32 will work, but everything
> else has been changed to use unsigned long.  If you disagree, I will
> happily acquiesce as nothing is currently broken.  If SGI decides to make
> memory_block_size_bytes more dynamic, we will fix this up at that time.

You're right, that should have been made an unsigned long also.  I'll attach a new
patch with that corrected.

-Nathan

^ permalink raw reply

* Re: [PATCH 8/9] v3 Define memory_block_size_bytes for x86_64 with CONFIG_X86_UV set
From: Robin Holt @ 2010-10-01 18:57 UTC (permalink / raw)
  To: Nathan Fontenot
  Cc: Greg KH, steiner, linux-kernel, Dave Hansen, linux-mm, Robin Holt,
	linuxppc-dev, KAMEZAWA Hiroyuki
In-Reply-To: <4CA62A51.70807@austin.ibm.com>

On Fri, Oct 01, 2010 at 01:37:05PM -0500, Nathan Fontenot wrote:
> Define a version of memory_block_size_bytes for x86_64 when CONFIG_X86_UV is
> set.
> 
> Signed-off-by: Robin Holt <holt@sgi.com>
> Signed-off-by: Jack Steiner <steiner@sgi.com>

I think this technically needs a Signed-off-by: <you> since you
are passing it upstream.

> 
> ---
>  arch/x86/mm/init_64.c |   14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> Index: linux-next/arch/x86/mm/init_64.c
> ===================================================================
> --- linux-next.orig/arch/x86/mm/init_64.c	2010-09-29 14:56:25.000000000 -0500
> +++ linux-next/arch/x86/mm/init_64.c	2010-10-01 13:00:50.000000000 -0500
> @@ -51,6 +51,7 @@
>  #include <asm/numa.h>
>  #include <asm/cacheflush.h>
>  #include <asm/init.h>
> +#include <asm/uv/uv.h>
>  #include <linux/bootmem.h>
>  
>  static int __init parse_direct_gbpages_off(char *arg)
> @@ -902,6 +903,19 @@
>  	return NULL;
>  }
>  
> +#ifdef CONFIG_X86_UV
> +#define MIN_MEMORY_BLOCK_SIZE   (1 << SECTION_SIZE_BITS)
> +
> +unsigned long memory_block_size_bytes(void)
> +{
> +	if (is_uv_system()) {
> +		printk(KERN_INFO "UV: memory block size 2GB\n");
> +		return 2UL * 1024 * 1024 * 1024;
> +	}
> +	return MIN_MEMORY_BLOCK_SIZE;
> +}
> +#endif
> +
>  #ifdef CONFIG_SPARSEMEM_VMEMMAP
>  /*
>   * Initialise the sparsemem vmemmap using huge-pages at the PMD level.
> 

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox