LinuxPPC-Dev Archive on lore.kernel.org
 help / color / mirror / Atom feed
* Re: [PATCH 2/2] pseries/iommu: remove DDW on kexec
From: Nishanth Aravamudan @ 2013-01-29 20:33 UTC (permalink / raw)
  To: Michael Ellerman; +Cc: miltonm, paulus, anton, nfont, linuxppc-dev
In-Reply-To: <1359457108.26096.7.camel@concordia>

Hi Michael,

On 29.01.2013 [21:58:28 +1100], Michael Ellerman wrote:
> On Mon, 2013-01-28 at 18:03 -0800, Nishanth Aravamudan wrote:
> > pseries/iommu: remove DDW on kexec
> >  ...
> >     
> > I believe the simplest, easiest-to-maintain fix is to just change our
> > initcall to, rather than detecting and updating the new kernel's DDW
> > knowledge, just remove all DDW configurations. When the drivers
> > re-initialize, we will set everything back up as it was before.
> 
> I don't know this code at all, but this sounds like it will also work
> for kdump, right? ie. when the original kernel has crashed the 2nd
> kernel will tear the DDW down and set it back up.

Yes, my actual test-case (and what was reported as broken) was kdump.
>From my relatively vague (but now growing) understanding of that
process, kdump does use kexec under the covers to switch to the crash
kernel, and so does get the same benefit from this change.

Another datapoint, though, is that it might make sense to recommend (and
I'm working on figuring this out for the distros, etc) to use
disable_ddw anyways for the kdump kernel command-line, as DDW isn't
'free' and it's unclear if performance is a huge concern for the crash
kernel (sort of varies with where your storage is, and how much you need
to dump, which for kdump generally doesn't seem like that much?).

Thanks,
Nish

^ permalink raw reply

* Re: [PATCH 4/5] net: mvmdio: allow Device Tree and platform device to coexist
From: Jason Gunthorpe @ 2013-01-29 17:59 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Thomas Petazzoni, Andrew Lunn, Russell King, Jason Cooper,
	linux-doc, devicetree-discuss, linux-kernel, Rob Herring,
	Greg Kroah-Hartman, Paul Mackerras, Lennert Buytenhek,
	Rob Landley, netdev, linuxppc-dev, davem, linux-arm-kernel
In-Reply-To: <1359473048-26551-5-git-send-email-florian@openwrt.org>

On Tue, Jan 29, 2013 at 04:24:07PM +0100, Florian Fainelli wrote:
  
> -	dev->err_interrupt = irq_of_parse_and_map(pdev->dev.of_node, 0);
> +	if (pdev->dev.of_node) {
> +		dev->regs = of_iomap(pdev->dev.of_node, 0);
> +		if (!dev->regs) {
> +			dev_err(&pdev->dev, "No SMI register address given in DT\n");
> +			ret = -ENODEV;
> +			goto out_free;
> +		}
> +
> +		dev->err_interrupt = irq_of_parse_and_map(pdev->dev.of_node, 0);
> +	} else {
> +		r = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> +
> +		dev->regs = ioremap(r->start, resource_size(r));
> +		if (!dev->regs) {
> +			dev_err(&pdev->dev, "No SMI register address given\n");
> +			ret = -ENODEV;
> +			goto out_free;
> +		}
> +
> +		dev->err_interrupt = platform_get_irq(pdev, 0);
> +	}

Why do you have these different paths for OF and platform? AFAIK these
days when a OF device is automatically converted into a platform
device all the struct resources are created too, so you can't you just
use platform_get_resource and devm_request_and_ioremap for both flows?

Ditto for the interrupt - platform_get_irq should work in both cases?

Jason

^ permalink raw reply

* Re: [PATCH 5/5] mv643xx_eth: convert to use the Marvell Orion MDIO driver
From: Jason Gunthorpe @ 2013-01-29 18:13 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Thomas Petazzoni, Andrew Lunn, Russell King, Jason Cooper,
	linux-doc, devicetree-discuss, linux-kernel, Rob Herring,
	Greg Kroah-Hartman, Paul Mackerras, Lennert Buytenhek,
	Rob Landley, netdev, linuxppc-dev, davem, linux-arm-kernel
In-Reply-To: <1359473048-26551-6-git-send-email-florian@openwrt.org>

On Tue, Jan 29, 2013 at 04:24:08PM +0100, Florian Fainelli wrote:
> This patch converts the Marvell MV643XX ethernet driver to use the
> Marvell Orion MDIO driver. As a result, PowerPC and ARM platforms
> registering the Marvell MV643XX ethernet driver are also updated to
> register a Marvell Orion MDIO driver. This driver voluntarily overlaps
> with the Marvell Ethernet shared registers because it will use a subset
> of this shared register (shared_base + 0x4 - shared_base + 0x84). The
> Ethernet driver is also updated to look up for a PHY device using the
> Orion MDIO bus driver.

Can you finish off this job by making the mv643xx_eth driver accept
the standard phy-handle OF property instead of using a phy address?

Ie the end result should be something like:

                smi0: mdio@72000 {
                        device_type = "mdio";
                        compatible = "marvell,orion-mdio";
                        reg = <0x72004 0x4>;

                        #address-cells = <1>;
                        #size-cells = <0>;
                        PHY1: ethernet-phy@1 {
                                reg = <1>;
                                device_type = "ethernet-phy";
                                phy-id = <0x01410e90>;
                        };
                };

                egiga0 {
                        device_type = "network";
                        compatible = "marvell,mv643xx-eth";
                        reg = <0x72000 0x4000>;
                        port_number = <0>;
                        phy-handle = <&PHY1>;
                        interrupts = <11>;
                        local-mac-address = [000000000002];  /* Filled by boot loader */
                };

Regards,
Jason

^ permalink raw reply

* Re: [PATCH 5/5] mv643xx_eth: convert to use the Marvell Orion MDIO driver
From: Thomas Petazzoni @ 2013-01-29 16:46 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Andrew Lunn, Russell King, Jason Cooper, linux-doc,
	devicetree-discuss, linux-kernel, Rob Herring, netdev,
	Paul Mackerras, linux-arm-kernel, Rob Landley, Greg Kroah-Hartman,
	linuxppc-dev, davem, Lennert Buytenhek
In-Reply-To: <5107F88C.8030002@openwrt.org>

Dear Florian Fainelli,

On Tue, 29 Jan 2013 17:27:56 +0100, Florian Fainelli wrote:

> It looks like I introduced two redundant mvmdio instances as ge01
> refers to the ge00 smi bus (the same applies to ge11 and ge10).
> Thanks for spotting this.

Ok, good.

> If you take a closer look at mv643xx_eth you will see that the
> "shared" driver still handles the mconf bus window configuration,
> which is not abstracted yet.

Indeed, I've seen that. But I don't understand why it's done in the
mv643xx_eth_shared_probe(). The mbus window configuration registers are
per-network interface, so this call to mv643xx_eth_conf_mbus_windows()
could presumably be done in mv643xx_eth_probe().

At least in mvneta, we have the same registers, and we do their
initialization in the driver normal (and only) ->probe() routine.

> Besides that, I would rather do it step by step.

Yes, agreed. But I think it would be good to have followed patches that
progressively get rid of the shared driver thing, as it will help in
bringing a proper DT binding in the mv643xx_eth driver. But it
certainly doesn't need to be part of this specific patch.

Thanks,

Thomas
-- 
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH 5/5] mv643xx_eth: convert to use the Marvell Orion MDIO driver
From: Florian Fainelli @ 2013-01-29 16:27 UTC (permalink / raw)
  To: Thomas Petazzoni
  Cc: Andrew Lunn, Russell King, Jason Cooper, linux-doc,
	devicetree-discuss, linux-kernel, Rob Herring, netdev,
	Paul Mackerras, linux-arm-kernel, Rob Landley, Greg Kroah-Hartman,
	linuxppc-dev, davem, Lennert Buytenhek
In-Reply-To: <20130129170131.7964a110@skate>

On 01/29/2013 05:01 PM, Thomas Petazzoni wrote:
> Dear Florian Fainelli,
>
> On Tue, 29 Jan 2013 16:24:08 +0100, Florian Fainelli wrote:
>> This patch converts the Marvell MV643XX ethernet driver to use the
>> Marvell Orion MDIO driver. As a result, PowerPC and ARM platforms
>> registering the Marvell MV643XX ethernet driver are also updated to
>> register a Marvell Orion MDIO driver. This driver voluntarily overlaps
>> with the Marvell Ethernet shared registers because it will use a subset
>> of this shared register (shared_base + 0x4 - shared_base + 0x84). The
>> Ethernet driver is also updated to look up for a PHY device using the
>> Orion MDIO bus driver.
>>
>> Signed-off-by: Florian Fainelli <florian@openwrt.org>
>> ---
>>   arch/arm/plat-orion/common.c               |   84 +++++++++++--
> In this file, there was one "MV643XX_ETH_SHARED_NAME" platform_device
> registered for each network interface. Why? If the driver is shared,
> isn't the whole idea to register it only once?
It looks like I introduced two redundant mvmdio instances as ge01 refers 
to the ge00 smi bus (the same applies to ge11 and ge10). Thanks for 
spotting this.

>
> In any case, one of the idea of separating the mvmdio driver from the
> mvneta driver in the first place is that there should be only one
> instance of the mvmdio device, even if there are multiple network
> interfaces. The reason is that from a HW point of the view, the MDIO
> unit is shared between the network interfaces. If you look at
> armada-370-xp.dtsi, there is only one mvmdio device registered, and two
> network interfaces (using the mvneta driver) that are registered (and
> actually up to four network interfaces can exist, they are added by
> some other .dtsi files depending on the specific SoC).
>
> So I don't think there should be one instance of the mvmdio per network
> interface.
>
> Also, I am wondering what's left in this MV643XX_ETH_SHARED_NAME driver
> once the MDIO stuff has been pulled out in a separate driver? I think
> the whole point of this work should be to get rid of this
> MV643XX_ETH_SHARED_NAME driver, no?

If you take a closer look at mv643xx_eth you will see that the "shared" 
driver still handles the mconf bus window configuration, which is not 
abstracted yet. Besides that, I would rather do it step by step.
--
Florian

^ permalink raw reply

* Re: [PATCH 5/5] mv643xx_eth: convert to use the Marvell Orion MDIO driver
From: Thomas Petazzoni @ 2013-01-29 16:01 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Andrew Lunn, Russell King, Jason Cooper, linux-doc,
	devicetree-discuss, linux-kernel, Rob Herring, netdev,
	Paul Mackerras, linux-arm-kernel, Rob Landley, Greg Kroah-Hartman,
	linuxppc-dev, davem, Lennert Buytenhek
In-Reply-To: <1359473048-26551-6-git-send-email-florian@openwrt.org>

Dear Florian Fainelli,

On Tue, 29 Jan 2013 16:24:08 +0100, Florian Fainelli wrote:
> This patch converts the Marvell MV643XX ethernet driver to use the
> Marvell Orion MDIO driver. As a result, PowerPC and ARM platforms
> registering the Marvell MV643XX ethernet driver are also updated to
> register a Marvell Orion MDIO driver. This driver voluntarily overlaps
> with the Marvell Ethernet shared registers because it will use a subset
> of this shared register (shared_base + 0x4 - shared_base + 0x84). The
> Ethernet driver is also updated to look up for a PHY device using the
> Orion MDIO bus driver.
> 
> Signed-off-by: Florian Fainelli <florian@openwrt.org>
> ---
>  arch/arm/plat-orion/common.c               |   84 +++++++++++--

In this file, there was one "MV643XX_ETH_SHARED_NAME" platform_device
registered for each network interface. Why? If the driver is shared,
isn't the whole idea to register it only once?

In any case, one of the idea of separating the mvmdio driver from the
mvneta driver in the first place is that there should be only one
instance of the mvmdio device, even if there are multiple network
interfaces. The reason is that from a HW point of the view, the MDIO
unit is shared between the network interfaces. If you look at
armada-370-xp.dtsi, there is only one mvmdio device registered, and two
network interfaces (using the mvneta driver) that are registered (and
actually up to four network interfaces can exist, they are added by
some other .dtsi files depending on the specific SoC).

So I don't think there should be one instance of the mvmdio per network
interface.

Also, I am wondering what's left in this MV643XX_ETH_SHARED_NAME driver
once the MDIO stuff has been pulled out in a separate driver? I think
the whole point of this work should be to get rid of this
MV643XX_ETH_SHARED_NAME driver, no?

Thanks,

Thomas
-- 
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH 4/5] net: mvmdio: allow Device Tree and platform device to coexist
From: Thomas Petazzoni @ 2013-01-29 15:48 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Andrew Lunn, Russell King, Jason Cooper, linux-doc,
	devicetree-discuss, linux-kernel, Rob Herring, netdev,
	Paul Mackerras, linux-arm-kernel, Rob Landley, Greg Kroah-Hartman,
	linuxppc-dev, davem, Lennert Buytenhek
In-Reply-To: <1359473048-26551-5-git-send-email-florian@openwrt.org>

Dear Florian Fainelli,

On Tue, 29 Jan 2013 16:24:07 +0100, Florian Fainelli wrote:
> This patch changes the Marvell MDIO driver to be registered by using
> both Device Tree and platform device methods. The driver voluntarily
> does not use devm_ioremap() to share the same error path for Device Tree
> and non-Device Tree cases.

Not sure why you think devm_ioremap() can't be used here. Maybe I'm
missing something, but could you explain? If you use devm_ioremap(),
then basically you don't need to do anything in the error path
regarding to the I/O mapping... since it's the whole purpose of the
devm_*() stuff to automagically undo things in the error case, and in
the ->remove() code.

> -	dev->err_interrupt = irq_of_parse_and_map(pdev->dev.of_node, 0);
> +	if (pdev->dev.of_node) {
> +		dev->regs = of_iomap(pdev->dev.of_node, 0);
> +		if (!dev->regs) {
> +			dev_err(&pdev->dev, "No SMI register address given in DT\n");
> +			ret = -ENODEV;
> +			goto out_free;
> +		}
> +
> +		dev->err_interrupt = irq_of_parse_and_map(pdev->dev.of_node, 0);
> +	} else {
> +		r = platform_get_resource(pdev, IORESOURCE_MEM, 0);
> +
> +		dev->regs = ioremap(r->start, resource_size(r));
> +		if (!dev->regs) {
> +			dev_err(&pdev->dev, "No SMI register address given\n");
> +			ret = -ENODEV;
> +			goto out_free;
> +		}
> +
> +		dev->err_interrupt = platform_get_irq(pdev, 0);
> +	}

I think you can do a devm_ioremap() and a platform_get_irq() in both
cases here, and therefore keep the code common between the DT case and
the !DT case.

Thanks,

Thomas
-- 
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH 2/5] net: mvmdio: rename base register cookie from smireg to regs
From: Thomas Petazzoni @ 2013-01-29 15:34 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Andrew Lunn, Russell King, Jason Cooper, linux-doc,
	devicetree-discuss, linux-kernel, Rob Herring, netdev,
	Paul Mackerras, linux-arm-kernel, Rob Landley, Greg Kroah-Hartman,
	linuxppc-dev, davem, Lennert Buytenhek
In-Reply-To: <1359473048-26551-3-git-send-email-florian@openwrt.org>

Dear Florian Fainelli,

On Tue, 29 Jan 2013 16:24:05 +0100, Florian Fainelli wrote:
> This patch renames the base register cookie in the mvmdio drive from
> "smireg" to "regs" since a subsequent patch is going to use an ioremap()
> cookie whose size is larger than a single register of 4 bytes. No
> functionnal code change introduced.
> 
> Signed-off-by: Florian Fainelli <florian@openwrt.org>

Acked-by: Thomas Petazzoni <thomas.petazzoni@free-electrons.com>
-- 
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH 1/5] net: mvmdio: unmap base register address at driver removal
From: Thomas Petazzoni @ 2013-01-29 15:32 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Andrew Lunn, Russell King, Jason Cooper, linux-doc,
	devicetree-discuss, linux-kernel, Rob Herring, netdev,
	Paul Mackerras, linux-arm-kernel, Rob Landley, Greg Kroah-Hartman,
	linuxppc-dev, davem, Lennert Buytenhek
In-Reply-To: <1359473048-26551-2-git-send-email-florian@openwrt.org>

Dear Florian Fainelli,

On Tue, 29 Jan 2013 16:24:04 +0100, Florian Fainelli wrote:
> Fix the driver remove callback to unmap the base register address and
> not leak this mapping after the driver has been removed.
> 
> Signed-off-by: Florian Fainelli <florian@openwrt.org>

What about using devm_request_and_ioremap() instead, in order to get
automatic unmap on error and in the ->remove() path?

But maybe it won't work because this memory range is claimed both by
the MDIO driver and the Ethernet driver itself. In that case, you could
use devm_ioremap().

Best regards,

Thomas
-- 
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH 3/5] net: mvmdio: enhance driver to support SMI error/done interrupts
From: Thomas Petazzoni @ 2013-01-29 15:39 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Andrew Lunn, Russell King, Jason Cooper, linux-doc,
	devicetree-discuss, linux-kernel, Rob Herring, netdev,
	Paul Mackerras, linux-arm-kernel, Rob Landley, Greg Kroah-Hartman,
	linuxppc-dev, davem, Lennert Buytenhek
In-Reply-To: <1359473048-26551-4-git-send-email-florian@openwrt.org>

Dear Florian Fainelli,

On Tue, 29 Jan 2013 16:24:06 +0100, Florian Fainelli wrote:

>  #define MVMDIO_SMI_DATA_SHIFT              0
>  #define MVMDIO_SMI_PHY_ADDR_SHIFT          16
> @@ -36,12 +40,28 @@
>  #define MVMDIO_SMI_WRITE_OPERATION         0
>  #define MVMDIO_SMI_READ_VALID              BIT(27)
>  #define MVMDIO_SMI_BUSY                    BIT(28)
> +#define MVMDIO_ERR_INT_CAUSE		   0x007C
> +#define  MVMDIO_ERR_INT_SMI_DONE	   0x00000010
> +#define MVMDIO_ERR_INT_MASK		   0x0080
>  
>  struct orion_mdio_dev {
>  	struct mutex lock;
>  	void __iomem *regs;
> +	/*
> +	 * If we have access to the error interrupt pin (which is
> +	 * somewhat misnamed as it not only reflects internal errors
> +	 * but also reflects SMI completion), use that to wait for
> +	 * SMI access completion instead of polling the SMI busy bit.
> +	 */
> +	int err_interrupt;
> +	wait_queue_head_t smi_busy_wait;
>  };
>  
> +static int orion_mdio_smi_is_done(struct orion_mdio_dev *dev)
> +{
> +	return !(readl(dev->regs) & MVMDIO_SMI_BUSY);
> +}
> +
>  /* Wait for the SMI unit to be ready for another operation
>   */
>  static int orion_mdio_wait_ready(struct mii_bus *bus)
> @@ -50,19 +70,30 @@ static int orion_mdio_wait_ready(struct mii_bus *bus)
>  	int count;
>  	u32 val;
>  
> -	count = 0;
> -	while (1) {
> -		val = readl(dev->regs);
> -		if (!(val & MVMDIO_SMI_BUSY))
> -			break;
> -
> -		if (count > 100) {
> -			dev_err(bus->parent, "Timeout: SMI busy for too long\n");
> -			return -ETIMEDOUT;
> +	if (dev->err_interrupt == NO_IRQ) {
> +		count = 0;
> +		while (1) {
> +			val = readl(dev->regs);
> +			if (!(val & MVMDIO_SMI_BUSY))
> +				break;

What about using your new orion_mdio_smi_is_done() function here?

> +
> +			if (count > 100) {
> +				dev_err(bus->parent,
> +					"Timeout: SMI busy for too long\n");
> +				return -ETIMEDOUT;
> +			}
> +
> +			udelay(10);
> +			count++;
>  		}
> +	}
>  
> -		udelay(10);
> -		count++;
> +	if (!orion_mdio_smi_is_done(dev)) {

Maybe it should be in an else if block so that the waitqueue case is
only considered if there is an IRQ registered? Of course practically
speaking, it's OK because if there is no IRQ, we'll wait in the polling
loop above, and either exit from the function on timeout, or continue
on success. But it still would make the code a little bit clearer, I'd
say.

>  static int orion_mdio_probe(struct platform_device *pdev)
>  {
>  	struct device_node *np = pdev->dev.of_node;
> @@ -181,6 +227,19 @@ static int orion_mdio_probe(struct platform_device *pdev)
>  		return -ENODEV;
>  	}
>  
> +	dev->err_interrupt = NO_IRQ;

Not needed, you already do dev->err_interrupt = something() below.

> +	init_waitqueue_head(&dev->smi_busy_wait);
> +
> +	dev->err_interrupt = irq_of_parse_and_map(pdev->dev.of_node, 0);
> +	if (dev->err_interrupt != NO_IRQ) {
> +		ret = devm_request_irq(&pdev->dev, dev->err_interrupt,
> +					orion_mdio_err_irq,
> +					IRQF_SHARED, pdev->name, dev);
> +		if (!ret)
> +			writel(MVMDIO_ERR_INT_SMI_DONE,
> +					dev->regs + MVMDIO_ERR_INT_MASK);
> +	}
> +
>  	mutex_init(&dev->lock);
>  
>  	ret = of_mdiobus_register(bus, np);
> @@ -202,6 +261,8 @@ static int orion_mdio_remove(struct platform_device *pdev)
>  	struct mii_bus *bus = platform_get_drvdata(pdev);
>  	struct orion_mdio_dev *dev = bus->priv;
>  
> +	writel(0, dev->regs + MVMDIO_ERR_INT_MASK);
> +	free_irq(dev->err_interrupt, dev);

free_irq() not needed since the IRQ handler is registered with
devm_request_irq().

>  	mdiobus_unregister(bus);
>  	kfree(bus->irq);
>  	mdiobus_free(bus);

Thanks,

Thomas
-- 
Thomas Petazzoni, Free Electrons
Kernel, drivers, real-time and embedded Linux
development, consulting, training and support.
http://free-electrons.com

^ permalink raw reply

* Re: [PATCH 1/5] net: mvmdio: unmap base register address at driver removal
From: Florian Fainelli @ 2013-01-29 15:35 UTC (permalink / raw)
  To: Thomas Petazzoni
  Cc: Andrew Lunn, Russell King, Jason Cooper, linux-doc,
	devicetree-discuss, linux-kernel, Rob Herring, netdev,
	Paul Mackerras, linux-arm-kernel, Rob Landley, Greg Kroah-Hartman,
	linuxppc-dev, davem, Lennert Buytenhek
In-Reply-To: <20130129163242.1df2caf5@skate>

On 01/29/2013 04:32 PM, Thomas Petazzoni wrote:
> Dear Florian Fainelli,
>
> On Tue, 29 Jan 2013 16:24:04 +0100, Florian Fainelli wrote:
>> Fix the driver remove callback to unmap the base register address and
>> not leak this mapping after the driver has been removed.
>>
>> Signed-off-by: Florian Fainelli <florian@openwrt.org>
> What about using devm_request_and_ioremap() instead, in order to get
> automatic unmap on error and in the ->remove() path?

Right now, you are using of_iomap() which eases the task of fetching the 
resource and getting an ioremap cookie, which I why I kept that.

>
> But maybe it won't work because this memory range is claimed both by
> the MDIO driver and the Ethernet driver itself. In that case, you could
> use devm_ioremap().
Then we would loose the facility of of_iomap(), but fair enough, it can 
be inserted as a patch in this serie.

Thanks
--
Florian

^ permalink raw reply

* [PATCH 5/5] mv643xx_eth: convert to use the Marvell Orion MDIO driver
From: Florian Fainelli @ 2013-01-29 15:24 UTC (permalink / raw)
  To: davem
  Cc: Thomas Petazzoni, Andrew Lunn, Russell King, Jason Cooper,
	linux-doc, devicetree-discuss, linux-kernel, Rob Herring, netdev,
	Paul Mackerras, linux-arm-kernel, Rob Landley, Greg Kroah-Hartman,
	linuxppc-dev, Florian Fainelli, Lennert Buytenhek
In-Reply-To: <1359473048-26551-1-git-send-email-florian@openwrt.org>

This patch converts the Marvell MV643XX ethernet driver to use the
Marvell Orion MDIO driver. As a result, PowerPC and ARM platforms
registering the Marvell MV643XX ethernet driver are also updated to
register a Marvell Orion MDIO driver. This driver voluntarily overlaps
with the Marvell Ethernet shared registers because it will use a subset
of this shared register (shared_base + 0x4 - shared_base + 0x84). The
Ethernet driver is also updated to look up for a PHY device using the
Orion MDIO bus driver.

Signed-off-by: Florian Fainelli <florian@openwrt.org>
---
 arch/arm/plat-orion/common.c               |   84 +++++++++++--
 arch/powerpc/platforms/chrp/pegasos_eth.c  |   20 +++
 arch/powerpc/sysdev/mv64x60_dev.c          |   14 ++-
 drivers/net/ethernet/marvell/Kconfig       |    1 +
 drivers/net/ethernet/marvell/mv643xx_eth.c |  187 ++--------------------------
 5 files changed, 113 insertions(+), 193 deletions(-)

diff --git a/arch/arm/plat-orion/common.c b/arch/arm/plat-orion/common.c
index 2d4b641..f9bc66e 100644
--- a/arch/arm/plat-orion/common.c
+++ b/arch/arm/plat-orion/common.c
@@ -238,6 +238,7 @@ static __init void ge_complete(
 	struct mv643xx_eth_shared_platform_data *orion_ge_shared_data,
 	struct resource *orion_ge_resource, unsigned long irq,
 	struct platform_device *orion_ge_shared,
+	struct platform_device *orion_ge_mvmdio,
 	struct mv643xx_eth_platform_data *eth_data,
 	struct platform_device *orion_ge)
 {
@@ -247,6 +248,7 @@ static __init void ge_complete(
 	orion_ge->dev.platform_data = eth_data;
 
 	platform_device_register(orion_ge_shared);
+	platform_device_register(orion_ge_mvmdio);
 	platform_device_register(orion_ge);
 }
 
@@ -258,8 +260,6 @@ struct mv643xx_eth_shared_platform_data orion_ge00_shared_data;
 static struct resource orion_ge00_shared_resources[] = {
 	{
 		.name	= "ge00 base",
-	}, {
-		.name	= "ge00 err irq",
 	},
 };
 
@@ -271,6 +271,19 @@ static struct platform_device orion_ge00_shared = {
 	},
 };
 
+static struct resource orion_ge00_mvmdio_resources[] = {
+	{
+		.name	= "ge00 mvmdio base",
+	}, {
+		.name	= "ge00 mvmdio err irq",
+	},
+};
+
+static struct platform_device orion_ge00_mvmdio = {
+	.name		= "orion-mdio",
+	.id		= 0,
+};
+
 static struct resource orion_ge00_resources[] = {
 	{
 		.name	= "ge00 irq",
@@ -295,10 +308,13 @@ void __init orion_ge00_init(struct mv643xx_eth_platform_data *eth_data,
 			    unsigned int tx_csum_limit)
 {
 	fill_resources(&orion_ge00_shared, orion_ge00_shared_resources,
-		       mapbase + 0x2000, SZ_16K - 1, irq_err);
+		       mapbase + 0x2000, SZ_16K - 1, NO_IRQ);
+	fill_resources(&orion_ge00_mvmdio, orion_ge00_mvmdio_resources,
+			mapbase + 0x2004, 0x84 - 1, irq_err);
 	orion_ge00_shared_data.tx_csum_limit = tx_csum_limit;
 	ge_complete(&orion_ge00_shared_data,
 		    orion_ge00_resources, irq, &orion_ge00_shared,
+		    &orion_ge00_mvmdio,
 		    eth_data, &orion_ge00);
 }
 
@@ -312,9 +328,7 @@ struct mv643xx_eth_shared_platform_data orion_ge01_shared_data = {
 static struct resource orion_ge01_shared_resources[] = {
 	{
 		.name	= "ge01 base",
-	}, {
-		.name	= "ge01 err irq",
-	},
+	}
 };
 
 static struct platform_device orion_ge01_shared = {
@@ -325,6 +339,19 @@ static struct platform_device orion_ge01_shared = {
 	},
 };
 
+static struct resource orion_ge01_mvmdio_resources[] = {
+	{
+		.name	= "ge01 mdio base",
+	}, {
+		.name	= "ge01 mdio err irq",
+	},
+};
+
+static struct platform_device orion_ge01_mvmdio = {
+	.name		= "orion-mdio",
+	.id		= 1,
+};
+
 static struct resource orion_ge01_resources[] = {
 	{
 		.name	= "ge01 irq",
@@ -349,10 +376,13 @@ void __init orion_ge01_init(struct mv643xx_eth_platform_data *eth_data,
 			    unsigned int tx_csum_limit)
 {
 	fill_resources(&orion_ge01_shared, orion_ge01_shared_resources,
-		       mapbase + 0x2000, SZ_16K - 1, irq_err);
+		       mapbase + 0x2000, SZ_16K - 1, NO_IRQ);
+	fill_resources(&orion_ge01_mvmdio, orion_ge01_mvmdio_resources,
+			mapbase + 0x2004, 0x84 - 1, irq_err);
 	orion_ge01_shared_data.tx_csum_limit = tx_csum_limit;
 	ge_complete(&orion_ge01_shared_data,
 		    orion_ge01_resources, irq, &orion_ge01_shared,
+		    &orion_ge01_mvmdio,
 		    eth_data, &orion_ge01);
 }
 
@@ -366,9 +396,7 @@ struct mv643xx_eth_shared_platform_data orion_ge10_shared_data = {
 static struct resource orion_ge10_shared_resources[] = {
 	{
 		.name	= "ge10 base",
-	}, {
-		.name	= "ge10 err irq",
-	},
+	}
 };
 
 static struct platform_device orion_ge10_shared = {
@@ -379,6 +407,19 @@ static struct platform_device orion_ge10_shared = {
 	},
 };
 
+static struct resource orion_ge10_mvmdio_resources[] = {
+	{
+		.name	= "ge10 mvmdio base",
+	}, {
+		.name	= "ge10 mvmdio err irq",
+	},
+};
+
+static struct platform_device orion_ge10_mvmdio = {
+	.name		= "orion-mdio",
+	.id		= 1,
+};
+
 static struct resource orion_ge10_resources[] = {
 	{
 		.name	= "ge10 irq",
@@ -403,8 +444,11 @@ void __init orion_ge10_init(struct mv643xx_eth_platform_data *eth_data,
 {
 	fill_resources(&orion_ge10_shared, orion_ge10_shared_resources,
 		       mapbase + 0x2000, SZ_16K - 1, irq_err);
+	fill_resources(&orion_ge10_mvmdio, orion_ge10_mvmdio_resources,
+		       mapbase + 0x2004, 0x84 - 1, irq_err);
 	ge_complete(&orion_ge10_shared_data,
 		    orion_ge10_resources, irq, &orion_ge10_shared,
+		    &orion_ge10_mvmdio,
 		    eth_data, &orion_ge10);
 }
 
@@ -418,8 +462,6 @@ struct mv643xx_eth_shared_platform_data orion_ge11_shared_data = {
 static struct resource orion_ge11_shared_resources[] = {
 	{
 		.name	= "ge11 base",
-	}, {
-		.name	= "ge11 err irq",
 	},
 };
 
@@ -431,6 +473,19 @@ static struct platform_device orion_ge11_shared = {
 	},
 };
 
+static struct resource orion_ge11_mvmdio_resources[] = {
+	{
+		.name	= "ge11 mvmdio base",
+	}, {
+		.name	= "ge11 mvmdio err irq",
+	},
+};
+
+static struct platform_device orion_ge11_mvmdio = {
+	.name		= "orion-mdio",
+	.id		= 1,
+};
+
 static struct resource orion_ge11_resources[] = {
 	{
 		.name	= "ge11 irq",
@@ -454,9 +509,12 @@ void __init orion_ge11_init(struct mv643xx_eth_platform_data *eth_data,
 			    unsigned long irq_err)
 {
 	fill_resources(&orion_ge11_shared, orion_ge11_shared_resources,
-		       mapbase + 0x2000, SZ_16K - 1, irq_err);
+		       mapbase + 0x2000, SZ_16K - 1, NO_IRQ);
+	fill_resources(&orion_ge11_mvmdio, orion_ge11_mvmdio_resources,
+		       mapbase + 0x2004, 0x84 - 1, irq_err);
 	ge_complete(&orion_ge11_shared_data,
 		    orion_ge11_resources, irq, &orion_ge11_shared,
+		    &orion_ge11_mvmdio,
 		    eth_data, &orion_ge11);
 }
 
diff --git a/arch/powerpc/platforms/chrp/pegasos_eth.c b/arch/powerpc/platforms/chrp/pegasos_eth.c
index 039fc8e..a671508 100644
--- a/arch/powerpc/platforms/chrp/pegasos_eth.c
+++ b/arch/powerpc/platforms/chrp/pegasos_eth.c
@@ -47,6 +47,25 @@ static struct platform_device mv643xx_eth_shared_device = {
 	.resource	= mv643xx_eth_shared_resources,
 };
 
+/*
+ * The orion mdio driver only covers shared + 0x4 up to shared + 0x84 - 1
+ */
+static struct resource mv643xx_eth_mvmdio_resources[] = {
+	[0] = {
+		.name	= "ethernet mdio base",
+		.start	= 0xf1000000 + MV643XX_ETH_SHARED_REGS + 0x4,
+		.end	= 0xf1000000 + MV643XX_ETH_SHARED_REGS + 0x83,
+		.flags	= IORESOURCE_MEM,
+	},
+};
+
+static struct platform_device mv643xx_eth_mvmdio_device = {
+	.name		= "orion-mdio",
+	.id		= 0,
+	.num_resources	= ARRAY_SIZE(mv643xx_eth_mvmdio_resources),
+	.resource	= mv643xx_eth_shared_resources,
+};
+
 static struct resource mv643xx_eth_port1_resources[] = {
 	[0] = {
 		.name	= "eth port1 irq",
@@ -82,6 +101,7 @@ static struct platform_device eth_port1_device = {
 
 static struct platform_device *mv643xx_eth_pd_devs[] __initdata = {
 	&mv643xx_eth_shared_device,
+	&mv643xx_eth_mvmdio_device,
 	&eth_port1_device,
 };
 
diff --git a/arch/powerpc/sysdev/mv64x60_dev.c b/arch/powerpc/sysdev/mv64x60_dev.c
index 0f6af41..630cea3 100644
--- a/arch/powerpc/sysdev/mv64x60_dev.c
+++ b/arch/powerpc/sysdev/mv64x60_dev.c
@@ -214,15 +214,25 @@ static struct platform_device * __init mv64x60_eth_register_shared_pdev(
 						struct device_node *np, int id)
 {
 	struct platform_device *pdev;
-	struct resource r[1];
+	struct resource r[2];
 	int err;
 
 	err = of_address_to_resource(np, 0, &r[0]);
 	if (err)
 		return ERR_PTR(err);
 
+	/* register an orion mdio bus driver */
+	r[1].start = r[0].start + 0x4;
+	r[1].end = r[0].start + 0x84 - 1;
+	r[1].flags = IORESOURCE_MEM;
+
+	pdev = platform_device_register_simple("orion-mdio", id, &r[1], 1);
+	if (!pdev)
+		return pdev;
+
 	pdev = platform_device_register_simple(MV643XX_ETH_SHARED_NAME, id,
-					       r, 1);
+					       &r[0], 1);
+
 	return pdev;
 }
 
diff --git a/drivers/net/ethernet/marvell/Kconfig b/drivers/net/ethernet/marvell/Kconfig
index edfba93..df06061 100644
--- a/drivers/net/ethernet/marvell/Kconfig
+++ b/drivers/net/ethernet/marvell/Kconfig
@@ -23,6 +23,7 @@ config MV643XX_ETH
 	depends on (MV64X60 || PPC32 || PLAT_ORION) && INET
 	select INET_LRO
 	select PHYLIB
+	select MVMDIO
 	---help---
 	  This driver supports the gigabit ethernet MACs in the
 	  Marvell Discovery PPC/MIPS chipset family (MV643XX) and
diff --git a/drivers/net/ethernet/marvell/mv643xx_eth.c b/drivers/net/ethernet/marvell/mv643xx_eth.c
index c27b23d8..8ef186f 100644
--- a/drivers/net/ethernet/marvell/mv643xx_eth.c
+++ b/drivers/net/ethernet/marvell/mv643xx_eth.c
@@ -67,14 +67,6 @@ static char mv643xx_eth_driver_version[] = "1.4";
  * Registers shared between all ports.
  */
 #define PHY_ADDR			0x0000
-#define SMI_REG				0x0004
-#define  SMI_BUSY			0x10000000
-#define  SMI_READ_VALID			0x08000000
-#define  SMI_OPCODE_READ		0x04000000
-#define  SMI_OPCODE_WRITE		0x00000000
-#define ERR_INT_CAUSE			0x0080
-#define  ERR_INT_SMI_DONE		0x00000010
-#define ERR_INT_MASK			0x0084
 #define WINDOW_BASE(w)			(0x0200 + ((w) << 3))
 #define WINDOW_SIZE(w)			(0x0204 + ((w) << 3))
 #define WINDOW_REMAP_HIGH(w)		(0x0280 + ((w) << 2))
@@ -264,25 +256,6 @@ struct mv643xx_eth_shared_private {
 	void __iomem *base;
 
 	/*
-	 * Points at the right SMI instance to use.
-	 */
-	struct mv643xx_eth_shared_private *smi;
-
-	/*
-	 * Provides access to local SMI interface.
-	 */
-	struct mii_bus *smi_bus;
-
-	/*
-	 * If we have access to the error interrupt pin (which is
-	 * somewhat misnamed as it not only reflects internal errors
-	 * but also reflects SMI completion), use that to wait for
-	 * SMI access completion instead of polling the SMI busy bit.
-	 */
-	int err_interrupt;
-	wait_queue_head_t smi_busy_wait;
-
-	/*
 	 * Per-port MBUS window access register value.
 	 */
 	u32 win_protect;
@@ -1080,98 +1053,6 @@ static void txq_set_fixed_prio_mode(struct tx_queue *txq)
 }
 
 
-/* mii management interface *************************************************/
-static irqreturn_t mv643xx_eth_err_irq(int irq, void *dev_id)
-{
-	struct mv643xx_eth_shared_private *msp = dev_id;
-
-	if (readl(msp->base + ERR_INT_CAUSE) & ERR_INT_SMI_DONE) {
-		writel(~ERR_INT_SMI_DONE, msp->base + ERR_INT_CAUSE);
-		wake_up(&msp->smi_busy_wait);
-		return IRQ_HANDLED;
-	}
-
-	return IRQ_NONE;
-}
-
-static int smi_is_done(struct mv643xx_eth_shared_private *msp)
-{
-	return !(readl(msp->base + SMI_REG) & SMI_BUSY);
-}
-
-static int smi_wait_ready(struct mv643xx_eth_shared_private *msp)
-{
-	if (msp->err_interrupt == NO_IRQ) {
-		int i;
-
-		for (i = 0; !smi_is_done(msp); i++) {
-			if (i == 10)
-				return -ETIMEDOUT;
-			msleep(10);
-		}
-
-		return 0;
-	}
-
-	if (!smi_is_done(msp)) {
-		wait_event_timeout(msp->smi_busy_wait, smi_is_done(msp),
-				   msecs_to_jiffies(100));
-		if (!smi_is_done(msp))
-			return -ETIMEDOUT;
-	}
-
-	return 0;
-}
-
-static int smi_bus_read(struct mii_bus *bus, int addr, int reg)
-{
-	struct mv643xx_eth_shared_private *msp = bus->priv;
-	void __iomem *smi_reg = msp->base + SMI_REG;
-	int ret;
-
-	if (smi_wait_ready(msp)) {
-		pr_warn("SMI bus busy timeout\n");
-		return -ETIMEDOUT;
-	}
-
-	writel(SMI_OPCODE_READ | (reg << 21) | (addr << 16), smi_reg);
-
-	if (smi_wait_ready(msp)) {
-		pr_warn("SMI bus busy timeout\n");
-		return -ETIMEDOUT;
-	}
-
-	ret = readl(smi_reg);
-	if (!(ret & SMI_READ_VALID)) {
-		pr_warn("SMI bus read not valid\n");
-		return -ENODEV;
-	}
-
-	return ret & 0xffff;
-}
-
-static int smi_bus_write(struct mii_bus *bus, int addr, int reg, u16 val)
-{
-	struct mv643xx_eth_shared_private *msp = bus->priv;
-	void __iomem *smi_reg = msp->base + SMI_REG;
-
-	if (smi_wait_ready(msp)) {
-		pr_warn("SMI bus busy timeout\n");
-		return -ETIMEDOUT;
-	}
-
-	writel(SMI_OPCODE_WRITE | (reg << 21) |
-		(addr << 16) | (val & 0xffff), smi_reg);
-
-	if (smi_wait_ready(msp)) {
-		pr_warn("SMI bus busy timeout\n");
-		return -ETIMEDOUT;
-	}
-
-	return 0;
-}
-
-
 /* statistics ***************************************************************/
 static struct net_device_stats *mv643xx_eth_get_stats(struct net_device *dev)
 {
@@ -2611,47 +2492,6 @@ static int mv643xx_eth_shared_probe(struct platform_device *pdev)
 		goto out_free;
 
 	/*
-	 * Set up and register SMI bus.
-	 */
-	if (pd == NULL || pd->shared_smi == NULL) {
-		msp->smi_bus = mdiobus_alloc();
-		if (msp->smi_bus == NULL)
-			goto out_unmap;
-
-		msp->smi_bus->priv = msp;
-		msp->smi_bus->name = "mv643xx_eth smi";
-		msp->smi_bus->read = smi_bus_read;
-		msp->smi_bus->write = smi_bus_write,
-		snprintf(msp->smi_bus->id, MII_BUS_ID_SIZE, "%s-%d",
-			pdev->name, pdev->id);
-		msp->smi_bus->parent = &pdev->dev;
-		msp->smi_bus->phy_mask = 0xffffffff;
-		if (mdiobus_register(msp->smi_bus) < 0)
-			goto out_free_mii_bus;
-		msp->smi = msp;
-	} else {
-		msp->smi = platform_get_drvdata(pd->shared_smi);
-	}
-
-	msp->err_interrupt = NO_IRQ;
-	init_waitqueue_head(&msp->smi_busy_wait);
-
-	/*
-	 * Check whether the error interrupt is hooked up.
-	 */
-	res = platform_get_resource(pdev, IORESOURCE_IRQ, 0);
-	if (res != NULL) {
-		int err;
-
-		err = request_irq(res->start, mv643xx_eth_err_irq,
-				  IRQF_SHARED, "mv643xx_eth", msp);
-		if (!err) {
-			writel(ERR_INT_SMI_DONE, msp->base + ERR_INT_MASK);
-			msp->err_interrupt = res->start;
-		}
-	}
-
-	/*
 	 * (Re-)program MBUS remapping windows if we are asked to.
 	 */
 	dram = mv_mbus_dram_info();
@@ -2666,10 +2506,6 @@ static int mv643xx_eth_shared_probe(struct platform_device *pdev)
 
 	return 0;
 
-out_free_mii_bus:
-	mdiobus_free(msp->smi_bus);
-out_unmap:
-	iounmap(msp->base);
 out_free:
 	kfree(msp);
 out:
@@ -2679,14 +2515,7 @@ out:
 static int mv643xx_eth_shared_remove(struct platform_device *pdev)
 {
 	struct mv643xx_eth_shared_private *msp = platform_get_drvdata(pdev);
-	struct mv643xx_eth_shared_platform_data *pd = pdev->dev.platform_data;
 
-	if (pd == NULL || pd->shared_smi == NULL) {
-		mdiobus_unregister(msp->smi_bus);
-		mdiobus_free(msp->smi_bus);
-	}
-	if (msp->err_interrupt != NO_IRQ)
-		free_irq(msp->err_interrupt, msp);
 	iounmap(msp->base);
 	kfree(msp);
 
@@ -2752,11 +2581,11 @@ static void set_params(struct mv643xx_eth_private *mp,
 static struct phy_device *phy_scan(struct mv643xx_eth_private *mp,
 				   int phy_addr)
 {
-	struct mii_bus *bus = mp->shared->smi->smi_bus;
 	struct phy_device *phydev;
 	int start;
 	int num;
 	int i;
+	char phy_id[MII_BUS_ID_SIZE + 3];
 
 	if (phy_addr == MV643XX_ETH_PHY_ADDR_DEFAULT) {
 		start = phy_addr_get(mp) & 0x1f;
@@ -2766,17 +2595,19 @@ static struct phy_device *phy_scan(struct mv643xx_eth_private *mp,
 		num = 1;
 	}
 
+	/* Attempt to connect to the PHY using orion-mdio */
 	phydev = NULL;
 	for (i = 0; i < num; i++) {
 		int addr = (start + i) & 0x1f;
 
-		if (bus->phy_map[addr] == NULL)
-			mdiobus_scan(bus, addr);
+		snprintf(phy_id, sizeof(phy_id), PHY_ID_FMT,
+				"orion-mdio-mii", addr);
 
-		if (phydev == NULL) {
-			phydev = bus->phy_map[addr];
-			if (phydev != NULL)
-				phy_addr_set(mp, addr);
+		phydev = phy_connect(mp->dev, phy_id, NULL,
+				PHY_INTERFACE_MODE_GMII);
+		if (!IS_ERR(phydev)) {
+			phy_addr_set(mp, addr);
+			break;
 		}
 	}
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 4/5] net: mvmdio: allow Device Tree and platform device to coexist
From: Florian Fainelli @ 2013-01-29 15:24 UTC (permalink / raw)
  To: davem
  Cc: Thomas Petazzoni, Andrew Lunn, Russell King, Jason Cooper,
	linux-doc, devicetree-discuss, linux-kernel, Rob Herring, netdev,
	Paul Mackerras, linux-arm-kernel, Rob Landley, Greg Kroah-Hartman,
	linuxppc-dev, Florian Fainelli, Lennert Buytenhek
In-Reply-To: <1359473048-26551-1-git-send-email-florian@openwrt.org>

This patch changes the Marvell MDIO driver to be registered by using
both Device Tree and platform device methods. The driver voluntarily
does not use devm_ioremap() to share the same error path for Device Tree
and non-Device Tree cases.

Signed-off-by: Florian Fainelli <florian@openwrt.org>
---
 drivers/net/ethernet/marvell/mvmdio.c |   46 +++++++++++++++++++++++----------
 1 file changed, 33 insertions(+), 13 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvmdio.c b/drivers/net/ethernet/marvell/mvmdio.c
index cada794..10c593c 100644
--- a/drivers/net/ethernet/marvell/mvmdio.c
+++ b/drivers/net/ethernet/marvell/mvmdio.c
@@ -190,6 +190,7 @@ static irqreturn_t orion_mdio_err_irq(int irq, void *dev_id)
 static int orion_mdio_probe(struct platform_device *pdev)
 {
 	struct device_node *np = pdev->dev.of_node;
+	struct resource *r = NULL;
 	struct mii_bus *bus;
 	struct orion_mdio_dev *dev;
 	int i, ret;
@@ -219,18 +220,31 @@ static int orion_mdio_probe(struct platform_device *pdev)
 		bus->irq[i] = PHY_POLL;
 
 	dev = bus->priv;
-	dev->regs = of_iomap(pdev->dev.of_node, 0);
-	if (!dev->regs) {
-		dev_err(&pdev->dev, "No SMI register address given in DT\n");
-		kfree(bus->irq);
-		mdiobus_free(bus);
-		return -ENODEV;
-	}
-
 	dev->err_interrupt = NO_IRQ;
 	init_waitqueue_head(&dev->smi_busy_wait);
 
-	dev->err_interrupt = irq_of_parse_and_map(pdev->dev.of_node, 0);
+	if (pdev->dev.of_node) {
+		dev->regs = of_iomap(pdev->dev.of_node, 0);
+		if (!dev->regs) {
+			dev_err(&pdev->dev, "No SMI register address given in DT\n");
+			ret = -ENODEV;
+			goto out_free;
+		}
+
+		dev->err_interrupt = irq_of_parse_and_map(pdev->dev.of_node, 0);
+	} else {
+		r = platform_get_resource(pdev, IORESOURCE_MEM, 0);
+
+		dev->regs = ioremap(r->start, resource_size(r));
+		if (!dev->regs) {
+			dev_err(&pdev->dev, "No SMI register address given\n");
+			ret = -ENODEV;
+			goto out_free;
+		}
+
+		dev->err_interrupt = platform_get_irq(pdev, 0);
+	}
+
 	if (dev->err_interrupt != NO_IRQ) {
 		ret = devm_request_irq(&pdev->dev, dev->err_interrupt,
 					orion_mdio_err_irq,
@@ -242,18 +256,24 @@ static int orion_mdio_probe(struct platform_device *pdev)
 
 	mutex_init(&dev->lock);
 
-	ret = of_mdiobus_register(bus, np);
+	if (np)
+		ret = of_mdiobus_register(bus, np);
+	else
+		ret = mdiobus_register(bus);
 	if (ret < 0) {
 		dev_err(&pdev->dev, "Cannot register MDIO bus (%d)\n", ret);
 		iounmap(dev->regs);
-		kfree(bus->irq);
-		mdiobus_free(bus);
-		return ret;
+		goto out_free;
 	}
 
 	platform_set_drvdata(pdev, bus);
 
 	return 0;
+
+out_free:
+	kfree(bus->irq);
+	mdiobus_free(bus);
+	return ret;
 }
 
 static int orion_mdio_remove(struct platform_device *pdev)
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 3/5] net: mvmdio: enhance driver to support SMI error/done interrupts
From: Florian Fainelli @ 2013-01-29 15:24 UTC (permalink / raw)
  To: davem
  Cc: Thomas Petazzoni, Andrew Lunn, Russell King, Jason Cooper,
	linux-doc, devicetree-discuss, linux-kernel, Rob Herring, netdev,
	Paul Mackerras, linux-arm-kernel, Rob Landley, Greg Kroah-Hartman,
	linuxppc-dev, Florian Fainelli, Lennert Buytenhek
In-Reply-To: <1359473048-26551-1-git-send-email-florian@openwrt.org>

This patch enhances the "mvmdio" to support a SMI error/done interrupt
line which can be used along with a wait queue instead of doing
busy-waiting on the registers. This is a feature which is available in
the mv643xx_eth SMI code and thus reduces again the gap between the two.

Signed-off-by: Florian Fainelli <florian@openwrt.org>
---
 .../devicetree/bindings/net/marvell-orion-mdio.txt |    3 +
 drivers/net/ethernet/marvell/mvmdio.c              |   83 +++++++++++++++++---
 2 files changed, 75 insertions(+), 11 deletions(-)

diff --git a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
index 34e7aaf..052b5f2 100644
--- a/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
+++ b/Documentation/devicetree/bindings/net/marvell-orion-mdio.txt
@@ -9,6 +9,9 @@ Required properties:
 - compatible: "marvell,orion-mdio"
 - reg: address and length of the SMI register
 
+Optional properties:
+- interrupts: interrupt line number for the SMI error/done interrupt
+
 The child nodes of the MDIO driver are the individual PHY devices
 connected to this MDIO bus. They must have a "reg" property given the
 PHY address on the MDIO bus.
diff --git a/drivers/net/ethernet/marvell/mvmdio.c b/drivers/net/ethernet/marvell/mvmdio.c
index 5ecda58..cada794 100644
--- a/drivers/net/ethernet/marvell/mvmdio.c
+++ b/drivers/net/ethernet/marvell/mvmdio.c
@@ -24,10 +24,14 @@
 #include <linux/module.h>
 #include <linux/mutex.h>
 #include <linux/phy.h>
+#include <linux/interrupt.h>
 #include <linux/of_address.h>
 #include <linux/of_mdio.h>
+#include <linux/of_irq.h>
 #include <linux/platform_device.h>
 #include <linux/delay.h>
+#include <linux/sched.h>
+#include <linux/wait.h>
 
 #define MVMDIO_SMI_DATA_SHIFT              0
 #define MVMDIO_SMI_PHY_ADDR_SHIFT          16
@@ -36,12 +40,28 @@
 #define MVMDIO_SMI_WRITE_OPERATION         0
 #define MVMDIO_SMI_READ_VALID              BIT(27)
 #define MVMDIO_SMI_BUSY                    BIT(28)
+#define MVMDIO_ERR_INT_CAUSE		   0x007C
+#define  MVMDIO_ERR_INT_SMI_DONE	   0x00000010
+#define MVMDIO_ERR_INT_MASK		   0x0080
 
 struct orion_mdio_dev {
 	struct mutex lock;
 	void __iomem *regs;
+	/*
+	 * If we have access to the error interrupt pin (which is
+	 * somewhat misnamed as it not only reflects internal errors
+	 * but also reflects SMI completion), use that to wait for
+	 * SMI access completion instead of polling the SMI busy bit.
+	 */
+	int err_interrupt;
+	wait_queue_head_t smi_busy_wait;
 };
 
+static int orion_mdio_smi_is_done(struct orion_mdio_dev *dev)
+{
+	return !(readl(dev->regs) & MVMDIO_SMI_BUSY);
+}
+
 /* Wait for the SMI unit to be ready for another operation
  */
 static int orion_mdio_wait_ready(struct mii_bus *bus)
@@ -50,19 +70,30 @@ static int orion_mdio_wait_ready(struct mii_bus *bus)
 	int count;
 	u32 val;
 
-	count = 0;
-	while (1) {
-		val = readl(dev->regs);
-		if (!(val & MVMDIO_SMI_BUSY))
-			break;
-
-		if (count > 100) {
-			dev_err(bus->parent, "Timeout: SMI busy for too long\n");
-			return -ETIMEDOUT;
+	if (dev->err_interrupt == NO_IRQ) {
+		count = 0;
+		while (1) {
+			val = readl(dev->regs);
+			if (!(val & MVMDIO_SMI_BUSY))
+				break;
+
+			if (count > 100) {
+				dev_err(bus->parent,
+					"Timeout: SMI busy for too long\n");
+				return -ETIMEDOUT;
+			}
+
+			udelay(10);
+			count++;
 		}
+	}
 
-		udelay(10);
-		count++;
+	if (!orion_mdio_smi_is_done(dev)) {
+		wait_event_timeout(dev->smi_busy_wait,
+				orion_mdio_smi_is_done(dev),
+				msecs_to_jiffies(100));
+		if (!orion_mdio_smi_is_done(dev))
+			return -ETIMEDOUT;
 	}
 
 	return 0;
@@ -141,6 +172,21 @@ static int orion_mdio_reset(struct mii_bus *bus)
 	return 0;
 }
 
+static irqreturn_t orion_mdio_err_irq(int irq, void *dev_id)
+{
+	struct orion_mdio_dev *dev = dev_id;
+
+	if (readl(dev->regs + MVMDIO_ERR_INT_CAUSE) &
+			MVMDIO_ERR_INT_SMI_DONE) {
+		writel(~MVMDIO_ERR_INT_SMI_DONE,
+				dev->regs + MVMDIO_ERR_INT_CAUSE);
+		wake_up(&dev->smi_busy_wait);
+		return IRQ_HANDLED;
+	}
+
+	return IRQ_NONE;
+}
+
 static int orion_mdio_probe(struct platform_device *pdev)
 {
 	struct device_node *np = pdev->dev.of_node;
@@ -181,6 +227,19 @@ static int orion_mdio_probe(struct platform_device *pdev)
 		return -ENODEV;
 	}
 
+	dev->err_interrupt = NO_IRQ;
+	init_waitqueue_head(&dev->smi_busy_wait);
+
+	dev->err_interrupt = irq_of_parse_and_map(pdev->dev.of_node, 0);
+	if (dev->err_interrupt != NO_IRQ) {
+		ret = devm_request_irq(&pdev->dev, dev->err_interrupt,
+					orion_mdio_err_irq,
+					IRQF_SHARED, pdev->name, dev);
+		if (!ret)
+			writel(MVMDIO_ERR_INT_SMI_DONE,
+					dev->regs + MVMDIO_ERR_INT_MASK);
+	}
+
 	mutex_init(&dev->lock);
 
 	ret = of_mdiobus_register(bus, np);
@@ -202,6 +261,8 @@ static int orion_mdio_remove(struct platform_device *pdev)
 	struct mii_bus *bus = platform_get_drvdata(pdev);
 	struct orion_mdio_dev *dev = bus->priv;
 
+	writel(0, dev->regs + MVMDIO_ERR_INT_MASK);
+	free_irq(dev->err_interrupt, dev);
 	mdiobus_unregister(bus);
 	kfree(bus->irq);
 	mdiobus_free(bus);
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 2/5] net: mvmdio: rename base register cookie from smireg to regs
From: Florian Fainelli @ 2013-01-29 15:24 UTC (permalink / raw)
  To: davem
  Cc: Thomas Petazzoni, Andrew Lunn, Russell King, Jason Cooper,
	linux-doc, devicetree-discuss, linux-kernel, Rob Herring, netdev,
	Paul Mackerras, linux-arm-kernel, Rob Landley, Greg Kroah-Hartman,
	linuxppc-dev, Florian Fainelli, Lennert Buytenhek
In-Reply-To: <1359473048-26551-1-git-send-email-florian@openwrt.org>

This patch renames the base register cookie in the mvmdio drive from
"smireg" to "regs" since a subsequent patch is going to use an ioremap()
cookie whose size is larger than a single register of 4 bytes. No
functionnal code change introduced.

Signed-off-by: Florian Fainelli <florian@openwrt.org>
---
 drivers/net/ethernet/marvell/mvmdio.c |   18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/drivers/net/ethernet/marvell/mvmdio.c b/drivers/net/ethernet/marvell/mvmdio.c
index be5c690..5ecda58 100644
--- a/drivers/net/ethernet/marvell/mvmdio.c
+++ b/drivers/net/ethernet/marvell/mvmdio.c
@@ -39,7 +39,7 @@
 
 struct orion_mdio_dev {
 	struct mutex lock;
-	void __iomem *smireg;
+	void __iomem *regs;
 };
 
 /* Wait for the SMI unit to be ready for another operation
@@ -52,7 +52,7 @@ static int orion_mdio_wait_ready(struct mii_bus *bus)
 
 	count = 0;
 	while (1) {
-		val = readl(dev->smireg);
+		val = readl(dev->regs);
 		if (!(val & MVMDIO_SMI_BUSY))
 			break;
 
@@ -87,12 +87,12 @@ static int orion_mdio_read(struct mii_bus *bus, int mii_id,
 	writel(((mii_id << MVMDIO_SMI_PHY_ADDR_SHIFT) |
 		(regnum << MVMDIO_SMI_PHY_REG_SHIFT)  |
 		MVMDIO_SMI_READ_OPERATION),
-	       dev->smireg);
+	       dev->regs);
 
 	/* Wait for the value to become available */
 	count = 0;
 	while (1) {
-		val = readl(dev->smireg);
+		val = readl(dev->regs);
 		if (val & MVMDIO_SMI_READ_VALID)
 			break;
 
@@ -129,7 +129,7 @@ static int orion_mdio_write(struct mii_bus *bus, int mii_id,
 		(regnum << MVMDIO_SMI_PHY_REG_SHIFT)  |
 		MVMDIO_SMI_WRITE_OPERATION            |
 		(value << MVMDIO_SMI_DATA_SHIFT)),
-	       dev->smireg);
+	       dev->regs);
 
 	mutex_unlock(&dev->lock);
 
@@ -173,8 +173,8 @@ static int orion_mdio_probe(struct platform_device *pdev)
 		bus->irq[i] = PHY_POLL;
 
 	dev = bus->priv;
-	dev->smireg = of_iomap(pdev->dev.of_node, 0);
-	if (!dev->smireg) {
+	dev->regs = of_iomap(pdev->dev.of_node, 0);
+	if (!dev->regs) {
 		dev_err(&pdev->dev, "No SMI register address given in DT\n");
 		kfree(bus->irq);
 		mdiobus_free(bus);
@@ -186,7 +186,7 @@ static int orion_mdio_probe(struct platform_device *pdev)
 	ret = of_mdiobus_register(bus, np);
 	if (ret < 0) {
 		dev_err(&pdev->dev, "Cannot register MDIO bus (%d)\n", ret);
-		iounmap(dev->smireg);
+		iounmap(dev->regs);
 		kfree(bus->irq);
 		mdiobus_free(bus);
 		return ret;
@@ -205,7 +205,7 @@ static int orion_mdio_remove(struct platform_device *pdev)
 	mdiobus_unregister(bus);
 	kfree(bus->irq);
 	mdiobus_free(bus);
-	iounmap(dev->smireg);
+	iounmap(dev->regs);
 	return 0;
 }
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 1/5] net: mvmdio: unmap base register address at driver removal
From: Florian Fainelli @ 2013-01-29 15:24 UTC (permalink / raw)
  To: davem
  Cc: Thomas Petazzoni, Andrew Lunn, Russell King, Jason Cooper,
	linux-doc, devicetree-discuss, linux-kernel, Rob Herring, netdev,
	Paul Mackerras, linux-arm-kernel, Rob Landley, Greg Kroah-Hartman,
	linuxppc-dev, Florian Fainelli, Lennert Buytenhek
In-Reply-To: <1359473048-26551-1-git-send-email-florian@openwrt.org>

Fix the driver remove callback to unmap the base register address and
not leak this mapping after the driver has been removed.

Signed-off-by: Florian Fainelli <florian@openwrt.org>
---
 drivers/net/ethernet/marvell/mvmdio.c |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/marvell/mvmdio.c b/drivers/net/ethernet/marvell/mvmdio.c
index 74f1c15..be5c690 100644
--- a/drivers/net/ethernet/marvell/mvmdio.c
+++ b/drivers/net/ethernet/marvell/mvmdio.c
@@ -200,9 +200,12 @@ static int orion_mdio_probe(struct platform_device *pdev)
 static int orion_mdio_remove(struct platform_device *pdev)
 {
 	struct mii_bus *bus = platform_get_drvdata(pdev);
+	struct orion_mdio_dev *dev = bus->priv;
+
 	mdiobus_unregister(bus);
 	kfree(bus->irq);
 	mdiobus_free(bus);
+	iounmap(dev->smireg);
 	return 0;
 }
 
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 0/5] mv643xx_eth: use mvmdio MDIO bus driver
From: Florian Fainelli @ 2013-01-29 15:24 UTC (permalink / raw)
  To: davem
  Cc: Thomas Petazzoni, Andrew Lunn, Russell King, Jason Cooper,
	linux-doc, devicetree-discuss, linux-kernel, Rob Herring, netdev,
	Paul Mackerras, linux-arm-kernel, Rob Landley, Greg Kroah-Hartman,
	linuxppc-dev, Florian Fainelli, Lennert Buytenhek

Hi all,

This patch converts the mv643xx_eth driver to use the mvmdio MDIO bus driver
instead of rolling its own implementation. As a result, all users of this
mv643xx_eth driver are converted to register an "orion-mdio" platform_device.
The mvmdio driver is also updated to support an interrupt line which reports
SMI error/completion, and to allow traditionnal platform device registration
instead of just device tree.

David, I think it makes sense for you to merge all of this, since we do
not want the architecture files to be desynchronized from the mv643xx_eth to
avoid runtime breakage. The potential for merge conflicts should be very small.

You can probably safely merge patches 1 to 4 if Thomas agrees, and we will
see what kind of feeback I get on patch 5.

Florian Fainelli (5):
  net: mvmdio: unmap base register address at driver removal
  net: mvmdio: rename base register cookie from smireg to regs
  net: mvmdio: enhance driver to support SMI error/done interrupts
  net: mvmdio: allow Device Tree and platform device to coexist
  mv643xx_eth: convert to use the Marvell Orion MDIO driver

 .../devicetree/bindings/net/marvell-orion-mdio.txt |    3 +
 arch/arm/plat-orion/common.c                       |   84 +++++++--
 arch/powerpc/platforms/chrp/pegasos_eth.c          |   20 +++
 arch/powerpc/sysdev/mv64x60_dev.c                  |   14 +-
 drivers/net/ethernet/marvell/Kconfig               |    1 +
 drivers/net/ethernet/marvell/mv643xx_eth.c         |  187 +-------------------
 drivers/net/ethernet/marvell/mvmdio.c              |  136 +++++++++++---
 7 files changed, 226 insertions(+), 219 deletions(-)

-- 
1.7.10.4

^ permalink raw reply

* Re: [git pull] Please pull powerpc.git merge branch
From: Geert Uytterhoeven @ 2013-01-29 14:14 UTC (permalink / raw)
  To: Benjamin Herrenschmidt
  Cc: linuxppc-dev list, Andrew Morton, Linus Torvalds,
	Linux Kernel list
In-Reply-To: <1359418161.18955.14.camel@pasglop>

On Tue, Jan 29, 2013 at 1:09 AM, Benjamin Herrenschmidt
<benh@kernel.crashing.org> wrote:
> On Mon, 2013-01-28 at 16:03 -0800, Linus Torvalds wrote:
>> I'll have you know that I haven't quite even left for Au yet, and I
>> have LCA before diving. So no snarky "in between dives" comments,
>> please.
>
> It wasn't meant to be "snarky", sorry about that...

Sharky? Sorry, couldn't resist.

Gr{oetje,eeting}s,

                        Poor Geert without sun, dives, and sharks

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
                                -- Linus Torvalds

^ permalink raw reply

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
From: Simon Jeons @ 2013-01-29 13:04 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <1357723959-5416-9-git-send-email-tangchen@cn.fujitsu.com>

Hi Tang,
On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> When memory is removed, the corresponding pagetables should alse be removed.
> This patch introduces some common APIs to support vmemmap pagetable and x86_64
> architecture pagetable removing.

Why don't need to build_all_zonelists like online_pages does during
hot-add path(add_memory)?

> 
> All pages of virtual mapping in removed memory cannot be freedi if some pages
> used as PGD/PUD includes not only removed memory but also other memory. So the
> patch uses the following way to check whether page can be freed or not.
> 
>  1. When removing memory, the page structs of the revmoved memory are filled
>     with 0FD.
>  2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
>     In this case, the page used as PT/PMD can be freed.
> 
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> ---
>  arch/x86/include/asm/pgtable_types.h |    1 +
>  arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
>  arch/x86/mm/pageattr.c               |   47 +++---
>  include/linux/bootmem.h              |    1 +
>  4 files changed, 326 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 3c32db8..4b6fd2a 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
>   * as a pte too.
>   */
>  extern pte_t *lookup_address(unsigned long address, unsigned int *level);
> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
>  
>  #endif	/* !__ASSEMBLY__ */
>  
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 9ac1723..fe01116 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
>  }
>  EXPORT_SYMBOL_GPL(arch_add_memory);
>  
> +#define PAGE_INUSE 0xFD
> +
> +static void __meminit free_pagetable(struct page *page, int order)
> +{
> +	struct zone *zone;
> +	bool bootmem = false;
> +	unsigned long magic;
> +	unsigned int nr_pages = 1 << order;
> +
> +	/* bootmem page has reserved flag */
> +	if (PageReserved(page)) {
> +		__ClearPageReserved(page);
> +		bootmem = true;
> +
> +		magic = (unsigned long)page->lru.next;
> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> +			while (nr_pages--)
> +				put_page_bootmem(page++);
> +		} else
> +			__free_pages_bootmem(page, order);
> +	} else
> +		free_pages((unsigned long)page_address(page), order);
> +
> +	/*
> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
> +	 * are all allocated by bootmem.
> +	 */
> +	if (bootmem) {
> +		zone = page_zone(page);
> +		zone_span_writelock(zone);
> +		zone->present_pages += nr_pages;
> +		zone_span_writeunlock(zone);
> +		totalram_pages += nr_pages;
> +	}
> +}
> +
> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> +{
> +	pte_t *pte;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PTE; i++) {
> +		pte = pte_start + i;
> +		if (pte_val(*pte))
> +			return;
> +	}
> +
> +	/* free a pte talbe */
> +	free_pagetable(pmd_page(*pmd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pmd_clear(pmd);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> +{
> +	pmd_t *pmd;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PMD; i++) {
> +		pmd = pmd_start + i;
> +		if (pmd_val(*pmd))
> +			return;
> +	}
> +
> +	/* free a pmd talbe */
> +	free_pagetable(pud_page(*pud), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pud_clear(pud);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +/* Return true if pgd is changed, otherwise return false. */
> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
> +{
> +	pud_t *pud;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PUD; i++) {
> +		pud = pud_start + i;
> +		if (pud_val(*pud))
> +			return false;
> +	}
> +
> +	/* free a pud table */
> +	free_pagetable(pgd_page(*pgd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pgd_clear(pgd);
> +	spin_unlock(&init_mm.page_table_lock);
> +
> +	return true;
> +}
> +
> +static void __meminit
> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
> +		 bool direct)
> +{
> +	unsigned long next, pages = 0;
> +	pte_t *pte;
> +	void *page_addr;
> +	phys_addr_t phys_addr;
> +
> +	pte = pte_start + pte_index(addr);
> +	for (; addr < end; addr = next, pte++) {
> +		next = (addr + PAGE_SIZE) & PAGE_MASK;
> +		if (next > end)
> +			next = end;
> +
> +		if (!pte_present(*pte))
> +			continue;
> +
> +		/*
> +		 * We mapped [0,1G) memory as identity mapping when
> +		 * initializing, in arch/x86/kernel/head_64.S. These
> +		 * pagetables cannot be removed.
> +		 */
> +		phys_addr = pte_val(*pte) + (addr & PAGE_MASK);
> +		if (phys_addr < (phys_addr_t)0x40000000)
> +			return;
> +
> +		if (IS_ALIGNED(addr, PAGE_SIZE) &&
> +		    IS_ALIGNED(next, PAGE_SIZE)) {
> +			if (!direct) {
> +				free_pagetable(pte_page(*pte), 0);
> +				pages++;
> +			}
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pte_clear(&init_mm, addr, pte);
> +			spin_unlock(&init_mm.page_table_lock);
> +		} else {
> +			/*
> +			 * If we are not removing the whole page, it means
> +			 * other ptes in this page are being used and we canot
> +			 * remove them. So fill the unused ptes with 0xFD, and
> +			 * remove the page when it is wholly filled with 0xFD.
> +			 */
> +			memset((void *)addr, PAGE_INUSE, next - addr);
> +			page_addr = page_address(pte_page(*pte));
> +
> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> +				free_pagetable(pte_page(*pte), 0);
> +				pages++;
> +
> +				spin_lock(&init_mm.page_table_lock);
> +				pte_clear(&init_mm, addr, pte);
> +				spin_unlock(&init_mm.page_table_lock);
> +			}
> +		}
> +	}
> +
> +	/* Call free_pte_table() in remove_pmd_table(). */
> +	flush_tlb_all();
> +	if (direct)
> +		update_page_count(PG_LEVEL_4K, -pages);
> +}
> +
> +static void __meminit
> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> +		 bool direct)
> +{
> +	unsigned long pte_phys, next, pages = 0;
> +	pte_t *pte_base;
> +	pmd_t *pmd;
> +
> +	pmd = pmd_start + pmd_index(addr);
> +	for (; addr < end; addr = next, pmd++) {
> +		next = pmd_addr_end(addr, end);
> +
> +		if (!pmd_present(*pmd))
> +			continue;
> +
> +		if (pmd_large(*pmd)) {
> +			if (IS_ALIGNED(addr, PMD_SIZE) &&
> +			    IS_ALIGNED(next, PMD_SIZE)) {
> +				if (!direct) {
> +					free_pagetable(pmd_page(*pmd),
> +						       get_order(PMD_SIZE));
> +					pages++;
> +				}
> +
> +				spin_lock(&init_mm.page_table_lock);
> +				pmd_clear(pmd);
> +				spin_unlock(&init_mm.page_table_lock);
> +				continue;
> +			}
> +
> +			/*
> +			 * We use 2M page, but we need to remove part of them,
> +			 * so split 2M page to 4K page.
> +			 */
> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
> +			BUG_ON(!pte_base);
> +			__split_large_page((pte_t *)pmd, addr,
> +					   (pte_t *)pte_base);
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> +			spin_unlock(&init_mm.page_table_lock);
> +
> +			flush_tlb_all();
> +		}
> +
> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> +		remove_pte_table(pte_base, addr, next, direct);
> +		free_pte_table(pte_base, pmd);
> +		unmap_low_page(pte_base);
> +	}
> +
> +	/* Call free_pmd_table() in remove_pud_table(). */
> +	if (direct)
> +		update_page_count(PG_LEVEL_2M, -pages);
> +}
> +
> +static void __meminit
> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
> +		 bool direct)
> +{
> +	unsigned long pmd_phys, next, pages = 0;
> +	pmd_t *pmd_base;
> +	pud_t *pud;
> +
> +	pud = pud_start + pud_index(addr);
> +	for (; addr < end; addr = next, pud++) {
> +		next = pud_addr_end(addr, end);
> +
> +		if (!pud_present(*pud))
> +			continue;
> +
> +		if (pud_large(*pud)) {
> +			if (IS_ALIGNED(addr, PUD_SIZE) &&
> +			    IS_ALIGNED(next, PUD_SIZE)) {
> +				if (!direct) {
> +					free_pagetable(pud_page(*pud),
> +						       get_order(PUD_SIZE));
> +					pages++;
> +				}
> +
> +				spin_lock(&init_mm.page_table_lock);
> +				pud_clear(pud);
> +				spin_unlock(&init_mm.page_table_lock);
> +				continue;
> +			}
> +
> +			/*
> +			 * We use 1G page, but we need to remove part of them,
> +			 * so split 1G page to 2M page.
> +			 */
> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
> +			BUG_ON(!pmd_base);
> +			__split_large_page((pte_t *)pud, addr,
> +					   (pte_t *)pmd_base);
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pud_populate(&init_mm, pud, __va(pmd_phys));
> +			spin_unlock(&init_mm.page_table_lock);
> +
> +			flush_tlb_all();
> +		}
> +
> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
> +		remove_pmd_table(pmd_base, addr, next, direct);
> +		free_pmd_table(pmd_base, pud);
> +		unmap_low_page(pmd_base);
> +	}
> +
> +	if (direct)
> +		update_page_count(PG_LEVEL_1G, -pages);
> +}
> +
> +/* start and end are both virtual address. */
> +static void __meminit
> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
> +{
> +	unsigned long next;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	bool pgd_changed = false;
> +
> +	for (; start < end; start = next) {
> +		pgd = pgd_offset_k(start);
> +		if (!pgd_present(*pgd))
> +			continue;
> +
> +		next = pgd_addr_end(start, end);
> +
> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> +		remove_pud_table(pud, start, next, direct);
> +		if (free_pud_table(pud, pgd))
> +			pgd_changed = true;
> +		unmap_low_page(pud);
> +	}
> +
> +	if (pgd_changed)
> +		sync_global_pgds(start, end - 1);
> +
> +	flush_tlb_all();
> +}
> +
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  int __ref arch_remove_memory(u64 start, u64 size)
>  {
> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> index a718e0d..7dcb6f9 100644
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -501,21 +501,13 @@ out_unlock:
>  	return do_split;
>  }
>  
> -static int split_large_page(pte_t *kpte, unsigned long address)
> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>  {
>  	unsigned long pfn, pfninc = 1;
>  	unsigned int i, level;
> -	pte_t *pbase, *tmp;
> +	pte_t *tmp;
>  	pgprot_t ref_prot;
> -	struct page *base;
> -
> -	if (!debug_pagealloc)
> -		spin_unlock(&cpa_lock);
> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> -	if (!debug_pagealloc)
> -		spin_lock(&cpa_lock);
> -	if (!base)
> -		return -ENOMEM;
> +	struct page *base = virt_to_page(pbase);
>  
>  	spin_lock(&pgd_lock);
>  	/*
> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>  	 * up for us already:
>  	 */
>  	tmp = lookup_address(address, &level);
> -	if (tmp != kpte)
> -		goto out_unlock;
> +	if (tmp != kpte) {
> +		spin_unlock(&pgd_lock);
> +		return 1;
> +	}
>  
> -	pbase = (pte_t *)page_address(base);
>  	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
>  	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
>  	/*
> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>  	 * going on.
>  	 */
>  	__flush_tlb_all();
> +	spin_unlock(&pgd_lock);
>  
> -	base = NULL;
> +	return 0;
> +}
>  
> -out_unlock:
> -	/*
> -	 * If we dropped out via the lookup_address check under
> -	 * pgd_lock then stick the page back into the pool:
> -	 */
> -	if (base)
> +static int split_large_page(pte_t *kpte, unsigned long address)
> +{
> +	pte_t *pbase;
> +	struct page *base;
> +
> +	if (!debug_pagealloc)
> +		spin_unlock(&cpa_lock);
> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> +	if (!debug_pagealloc)
> +		spin_lock(&cpa_lock);
> +	if (!base)
> +		return -ENOMEM;
> +
> +	pbase = (pte_t *)page_address(base);
> +	if (__split_large_page(kpte, address, pbase))
>  		__free_page(base);
> -	spin_unlock(&pgd_lock);
>  
>  	return 0;
>  }
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index 3f778c2..190ff06 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
>  			      unsigned long size);
>  extern void free_bootmem(unsigned long physaddr, unsigned long size);
>  extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
>  
>  /*
>   * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,

^ permalink raw reply

* Re: [PATCH v6 08/15] memory-hotplug: Common APIs to support page tables hot-remove
From: Simon Jeons @ 2013-01-29 13:02 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <1357723959-5416-9-git-send-email-tangchen@cn.fujitsu.com>

Hi Tang,
On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> From: Wen Congyang <wency@cn.fujitsu.com>
> 
> When memory is removed, the corresponding pagetables should alse be removed.
> This patch introduces some common APIs to support vmemmap pagetable and x86_64
> architecture pagetable removing.
> 

When page table of hot-add memory is created?

> All pages of virtual mapping in removed memory cannot be freedi if some pages
> used as PGD/PUD includes not only removed memory but also other memory. So the
> patch uses the following way to check whether page can be freed or not.
> 
>  1. When removing memory, the page structs of the revmoved memory are filled
>     with 0FD.
>  2. All page structs are filled with 0xFD on PT/PMD, PT/PMD can be cleared.
>     In this case, the page used as PT/PMD can be freed.
> 
> Signed-off-by: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com>
> Signed-off-by: Jianguo Wu <wujianguo@huawei.com>
> Signed-off-by: Wen Congyang <wency@cn.fujitsu.com>
> Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
> ---
>  arch/x86/include/asm/pgtable_types.h |    1 +
>  arch/x86/mm/init_64.c                |  299 ++++++++++++++++++++++++++++++++++
>  arch/x86/mm/pageattr.c               |   47 +++---
>  include/linux/bootmem.h              |    1 +
>  4 files changed, 326 insertions(+), 22 deletions(-)
> 
> diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
> index 3c32db8..4b6fd2a 100644
> --- a/arch/x86/include/asm/pgtable_types.h
> +++ b/arch/x86/include/asm/pgtable_types.h
> @@ -352,6 +352,7 @@ static inline void update_page_count(int level, unsigned long pages) { }
>   * as a pte too.
>   */
>  extern pte_t *lookup_address(unsigned long address, unsigned int *level);
> +extern int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase);
>  
>  #endif	/* !__ASSEMBLY__ */
>  
> diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
> index 9ac1723..fe01116 100644
> --- a/arch/x86/mm/init_64.c
> +++ b/arch/x86/mm/init_64.c
> @@ -682,6 +682,305 @@ int arch_add_memory(int nid, u64 start, u64 size)
>  }
>  EXPORT_SYMBOL_GPL(arch_add_memory);
>  
> +#define PAGE_INUSE 0xFD
> +
> +static void __meminit free_pagetable(struct page *page, int order)
> +{
> +	struct zone *zone;
> +	bool bootmem = false;
> +	unsigned long magic;
> +	unsigned int nr_pages = 1 << order;
> +
> +	/* bootmem page has reserved flag */
> +	if (PageReserved(page)) {
> +		__ClearPageReserved(page);
> +		bootmem = true;
> +
> +		magic = (unsigned long)page->lru.next;
> +		if (magic == SECTION_INFO || magic == MIX_SECTION_INFO) {
> +			while (nr_pages--)
> +				put_page_bootmem(page++);
> +		} else
> +			__free_pages_bootmem(page, order);
> +	} else
> +		free_pages((unsigned long)page_address(page), order);
> +
> +	/*
> +	 * SECTION_INFO pages and MIX_SECTION_INFO pages
> +	 * are all allocated by bootmem.
> +	 */
> +	if (bootmem) {
> +		zone = page_zone(page);
> +		zone_span_writelock(zone);
> +		zone->present_pages += nr_pages;
> +		zone_span_writeunlock(zone);
> +		totalram_pages += nr_pages;
> +	}
> +}
> +
> +static void __meminit free_pte_table(pte_t *pte_start, pmd_t *pmd)
> +{
> +	pte_t *pte;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PTE; i++) {
> +		pte = pte_start + i;
> +		if (pte_val(*pte))
> +			return;
> +	}
> +
> +	/* free a pte talbe */
> +	free_pagetable(pmd_page(*pmd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pmd_clear(pmd);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +static void __meminit free_pmd_table(pmd_t *pmd_start, pud_t *pud)
> +{
> +	pmd_t *pmd;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PMD; i++) {
> +		pmd = pmd_start + i;
> +		if (pmd_val(*pmd))
> +			return;
> +	}
> +
> +	/* free a pmd talbe */
> +	free_pagetable(pud_page(*pud), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pud_clear(pud);
> +	spin_unlock(&init_mm.page_table_lock);
> +}
> +
> +/* Return true if pgd is changed, otherwise return false. */
> +static bool __meminit free_pud_table(pud_t *pud_start, pgd_t *pgd)
> +{
> +	pud_t *pud;
> +	int i;
> +
> +	for (i = 0; i < PTRS_PER_PUD; i++) {
> +		pud = pud_start + i;
> +		if (pud_val(*pud))
> +			return false;
> +	}
> +
> +	/* free a pud table */
> +	free_pagetable(pgd_page(*pgd), 0);
> +	spin_lock(&init_mm.page_table_lock);
> +	pgd_clear(pgd);
> +	spin_unlock(&init_mm.page_table_lock);
> +
> +	return true;
> +}
> +
> +static void __meminit
> +remove_pte_table(pte_t *pte_start, unsigned long addr, unsigned long end,
> +		 bool direct)
> +{
> +	unsigned long next, pages = 0;
> +	pte_t *pte;
> +	void *page_addr;
> +	phys_addr_t phys_addr;
> +
> +	pte = pte_start + pte_index(addr);
> +	for (; addr < end; addr = next, pte++) {
> +		next = (addr + PAGE_SIZE) & PAGE_MASK;
> +		if (next > end)
> +			next = end;
> +
> +		if (!pte_present(*pte))
> +			continue;
> +
> +		/*
> +		 * We mapped [0,1G) memory as identity mapping when
> +		 * initializing, in arch/x86/kernel/head_64.S. These
> +		 * pagetables cannot be removed.
> +		 */
> +		phys_addr = pte_val(*pte) + (addr & PAGE_MASK);
> +		if (phys_addr < (phys_addr_t)0x40000000)
> +			return;
> +
> +		if (IS_ALIGNED(addr, PAGE_SIZE) &&
> +		    IS_ALIGNED(next, PAGE_SIZE)) {
> +			if (!direct) {
> +				free_pagetable(pte_page(*pte), 0);
> +				pages++;
> +			}
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pte_clear(&init_mm, addr, pte);
> +			spin_unlock(&init_mm.page_table_lock);
> +		} else {
> +			/*
> +			 * If we are not removing the whole page, it means
> +			 * other ptes in this page are being used and we canot
> +			 * remove them. So fill the unused ptes with 0xFD, and
> +			 * remove the page when it is wholly filled with 0xFD.
> +			 */
> +			memset((void *)addr, PAGE_INUSE, next - addr);
> +			page_addr = page_address(pte_page(*pte));
> +
> +			if (!memchr_inv(page_addr, PAGE_INUSE, PAGE_SIZE)) {
> +				free_pagetable(pte_page(*pte), 0);
> +				pages++;
> +
> +				spin_lock(&init_mm.page_table_lock);
> +				pte_clear(&init_mm, addr, pte);
> +				spin_unlock(&init_mm.page_table_lock);
> +			}
> +		}
> +	}
> +
> +	/* Call free_pte_table() in remove_pmd_table(). */
> +	flush_tlb_all();
> +	if (direct)
> +		update_page_count(PG_LEVEL_4K, -pages);
> +}
> +
> +static void __meminit
> +remove_pmd_table(pmd_t *pmd_start, unsigned long addr, unsigned long end,
> +		 bool direct)
> +{
> +	unsigned long pte_phys, next, pages = 0;
> +	pte_t *pte_base;
> +	pmd_t *pmd;
> +
> +	pmd = pmd_start + pmd_index(addr);
> +	for (; addr < end; addr = next, pmd++) {
> +		next = pmd_addr_end(addr, end);
> +
> +		if (!pmd_present(*pmd))
> +			continue;
> +
> +		if (pmd_large(*pmd)) {
> +			if (IS_ALIGNED(addr, PMD_SIZE) &&
> +			    IS_ALIGNED(next, PMD_SIZE)) {
> +				if (!direct) {
> +					free_pagetable(pmd_page(*pmd),
> +						       get_order(PMD_SIZE));
> +					pages++;
> +				}
> +
> +				spin_lock(&init_mm.page_table_lock);
> +				pmd_clear(pmd);
> +				spin_unlock(&init_mm.page_table_lock);
> +				continue;
> +			}
> +
> +			/*
> +			 * We use 2M page, but we need to remove part of them,
> +			 * so split 2M page to 4K page.
> +			 */
> +			pte_base = (pte_t *)alloc_low_page(&pte_phys);
> +			BUG_ON(!pte_base);
> +			__split_large_page((pte_t *)pmd, addr,
> +					   (pte_t *)pte_base);
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pmd_populate_kernel(&init_mm, pmd, __va(pte_phys));
> +			spin_unlock(&init_mm.page_table_lock);
> +
> +			flush_tlb_all();
> +		}
> +
> +		pte_base = (pte_t *)map_low_page((pte_t *)pmd_page_vaddr(*pmd));
> +		remove_pte_table(pte_base, addr, next, direct);
> +		free_pte_table(pte_base, pmd);
> +		unmap_low_page(pte_base);
> +	}
> +
> +	/* Call free_pmd_table() in remove_pud_table(). */
> +	if (direct)
> +		update_page_count(PG_LEVEL_2M, -pages);
> +}
> +
> +static void __meminit
> +remove_pud_table(pud_t *pud_start, unsigned long addr, unsigned long end,
> +		 bool direct)
> +{
> +	unsigned long pmd_phys, next, pages = 0;
> +	pmd_t *pmd_base;
> +	pud_t *pud;
> +
> +	pud = pud_start + pud_index(addr);
> +	for (; addr < end; addr = next, pud++) {
> +		next = pud_addr_end(addr, end);
> +
> +		if (!pud_present(*pud))
> +			continue;
> +
> +		if (pud_large(*pud)) {
> +			if (IS_ALIGNED(addr, PUD_SIZE) &&
> +			    IS_ALIGNED(next, PUD_SIZE)) {
> +				if (!direct) {
> +					free_pagetable(pud_page(*pud),
> +						       get_order(PUD_SIZE));
> +					pages++;
> +				}
> +
> +				spin_lock(&init_mm.page_table_lock);
> +				pud_clear(pud);
> +				spin_unlock(&init_mm.page_table_lock);
> +				continue;
> +			}
> +
> +			/*
> +			 * We use 1G page, but we need to remove part of them,
> +			 * so split 1G page to 2M page.
> +			 */
> +			pmd_base = (pmd_t *)alloc_low_page(&pmd_phys);
> +			BUG_ON(!pmd_base);
> +			__split_large_page((pte_t *)pud, addr,
> +					   (pte_t *)pmd_base);
> +
> +			spin_lock(&init_mm.page_table_lock);
> +			pud_populate(&init_mm, pud, __va(pmd_phys));
> +			spin_unlock(&init_mm.page_table_lock);
> +
> +			flush_tlb_all();
> +		}
> +
> +		pmd_base = (pmd_t *)map_low_page((pmd_t *)pud_page_vaddr(*pud));
> +		remove_pmd_table(pmd_base, addr, next, direct);
> +		free_pmd_table(pmd_base, pud);
> +		unmap_low_page(pmd_base);
> +	}
> +
> +	if (direct)
> +		update_page_count(PG_LEVEL_1G, -pages);
> +}
> +
> +/* start and end are both virtual address. */
> +static void __meminit
> +remove_pagetable(unsigned long start, unsigned long end, bool direct)
> +{
> +	unsigned long next;
> +	pgd_t *pgd;
> +	pud_t *pud;
> +	bool pgd_changed = false;
> +
> +	for (; start < end; start = next) {
> +		pgd = pgd_offset_k(start);
> +		if (!pgd_present(*pgd))
> +			continue;
> +
> +		next = pgd_addr_end(start, end);
> +
> +		pud = (pud_t *)map_low_page((pud_t *)pgd_page_vaddr(*pgd));
> +		remove_pud_table(pud, start, next, direct);
> +		if (free_pud_table(pud, pgd))
> +			pgd_changed = true;
> +		unmap_low_page(pud);
> +	}
> +
> +	if (pgd_changed)
> +		sync_global_pgds(start, end - 1);
> +
> +	flush_tlb_all();
> +}
> +
>  #ifdef CONFIG_MEMORY_HOTREMOVE
>  int __ref arch_remove_memory(u64 start, u64 size)
>  {
> diff --git a/arch/x86/mm/pageattr.c b/arch/x86/mm/pageattr.c
> index a718e0d..7dcb6f9 100644
> --- a/arch/x86/mm/pageattr.c
> +++ b/arch/x86/mm/pageattr.c
> @@ -501,21 +501,13 @@ out_unlock:
>  	return do_split;
>  }
>  
> -static int split_large_page(pte_t *kpte, unsigned long address)
> +int __split_large_page(pte_t *kpte, unsigned long address, pte_t *pbase)
>  {
>  	unsigned long pfn, pfninc = 1;
>  	unsigned int i, level;
> -	pte_t *pbase, *tmp;
> +	pte_t *tmp;
>  	pgprot_t ref_prot;
> -	struct page *base;
> -
> -	if (!debug_pagealloc)
> -		spin_unlock(&cpa_lock);
> -	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> -	if (!debug_pagealloc)
> -		spin_lock(&cpa_lock);
> -	if (!base)
> -		return -ENOMEM;
> +	struct page *base = virt_to_page(pbase);
>  
>  	spin_lock(&pgd_lock);
>  	/*
> @@ -523,10 +515,11 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>  	 * up for us already:
>  	 */
>  	tmp = lookup_address(address, &level);
> -	if (tmp != kpte)
> -		goto out_unlock;
> +	if (tmp != kpte) {
> +		spin_unlock(&pgd_lock);
> +		return 1;
> +	}
>  
> -	pbase = (pte_t *)page_address(base);
>  	paravirt_alloc_pte(&init_mm, page_to_pfn(base));
>  	ref_prot = pte_pgprot(pte_clrhuge(*kpte));
>  	/*
> @@ -579,17 +572,27 @@ static int split_large_page(pte_t *kpte, unsigned long address)
>  	 * going on.
>  	 */
>  	__flush_tlb_all();
> +	spin_unlock(&pgd_lock);
>  
> -	base = NULL;
> +	return 0;
> +}
>  
> -out_unlock:
> -	/*
> -	 * If we dropped out via the lookup_address check under
> -	 * pgd_lock then stick the page back into the pool:
> -	 */
> -	if (base)
> +static int split_large_page(pte_t *kpte, unsigned long address)
> +{
> +	pte_t *pbase;
> +	struct page *base;
> +
> +	if (!debug_pagealloc)
> +		spin_unlock(&cpa_lock);
> +	base = alloc_pages(GFP_KERNEL | __GFP_NOTRACK, 0);
> +	if (!debug_pagealloc)
> +		spin_lock(&cpa_lock);
> +	if (!base)
> +		return -ENOMEM;
> +
> +	pbase = (pte_t *)page_address(base);
> +	if (__split_large_page(kpte, address, pbase))
>  		__free_page(base);
> -	spin_unlock(&pgd_lock);
>  
>  	return 0;
>  }
> diff --git a/include/linux/bootmem.h b/include/linux/bootmem.h
> index 3f778c2..190ff06 100644
> --- a/include/linux/bootmem.h
> +++ b/include/linux/bootmem.h
> @@ -53,6 +53,7 @@ extern void free_bootmem_node(pg_data_t *pgdat,
>  			      unsigned long size);
>  extern void free_bootmem(unsigned long physaddr, unsigned long size);
>  extern void free_bootmem_late(unsigned long physaddr, unsigned long size);
> +extern void __free_pages_bootmem(struct page *page, unsigned int order);
>  
>  /*
>   * Flags for reserve_bootmem (also if CONFIG_HAVE_ARCH_BOOTMEM_NODE,

^ permalink raw reply

* Re: [PATCH v6 00/15] memory-hotplug: hot-remove physical memory
From: Simon Jeons @ 2013-01-29 12:52 UTC (permalink / raw)
  To: Tang Chen
  Cc: linux-ia64, linux-sh, linux-mm, paulus, hpa, sparclinux, cl,
	linux-s390, x86, linux-acpi, isimatu.yasuaki, linfeng, mgorman,
	kosaki.motohiro, rientjes, len.brown, wency, cmetcalf, glommer,
	wujianguo, yinghai, laijs, linux-kernel, minchan.kim, akpm,
	linuxppc-dev
In-Reply-To: <1357723959-5416-1-git-send-email-tangchen@cn.fujitsu.com>

Hi Tang,

On Wed, 2013-01-09 at 17:32 +0800, Tang Chen wrote:
> Here is the physical memory hot-remove patch-set based on 3.8rc-2.

Some questions ask you, not has relationship with this patchset, but is
memory hotplug stuff.

1. In function node_states_check_changes_online:

comments:
* If we don't have HIGHMEM nor movable node,
* node_states[N_NORMAL_MEMORY] contains nodes which have zones of
* 0...ZONE_MOVABLE, set zone_last to ZONE_MOVABLE.

How to understand it? Why we don't have HIGHMEM nor movable node and
node_staes[N_NORMAL_MEMORY] contains 0...ZONE_MOVABLE, IIUC,
N_NORMAL_MEMORY only means the node has regular memory.

* If we don't have movable node, node_states[N_NORMAL_MEMORY]
* contains nodes which have zones of 0...ZONE_MOVABLE,
* set zone_last to ZONE_MOVABLE.

How to understand?

2. In function move_pfn_range_left, why end <= z2->zone_start_pfn is not
correct? The comments said that must include/overlap, why?

3. In function online_pages, the normal case(w/o online_kenrel,
online_movable), why not check if the new zone is overlap with adjacent
zones?

4. Could you summarize the difference implementation between hot-add and
logic-add, hot-remove and logic-remove?   


> 
> This patch-set aims to implement physical memory hot-removing.
> 
> The patches can free/remove the following things:
> 
>   - /sys/firmware/memmap/X/{end, start, type} : [PATCH 4/15]
>   - memmap of sparse-vmemmap                  : [PATCH 6,7,8,10/15]
>   - page table of removed memory              : [RFC PATCH 7,8,10/15]
>   - node and related sysfs files              : [RFC PATCH 13-15/15]
> 
> 
> Existing problem:
> If CONFIG_MEMCG is selected, we will allocate memory to store page cgroup
> when we online pages.
> 
> For example: there is a memory device on node 1. The address range
> is [1G, 1.5G). You will find 4 new directories memory8, memory9, memory10,
> and memory11 under the directory /sys/devices/system/memory/.
> 
> If CONFIG_MEMCG is selected, when we online memory8, the memory stored page
> cgroup is not provided by this memory device. But when we online memory9, the
> memory stored page cgroup may be provided by memory8. So we can't offline
> memory8 now. We should offline the memory in the reversed order.
> 
> When the memory device is hotremoved, we will auto offline memory provided
> by this memory device. But we don't know which memory is onlined first, so
> offlining memory may fail.
> 
> In patch1, we provide a solution which is not good enough:
> Iterate twice to offline the memory.
> 1st iterate: offline every non primary memory block.
> 2nd iterate: offline primary (i.e. first added) memory block.
> 
> And a new idea from Wen Congyang <wency@cn.fujitsu.com> is:
> allocate the memory from the memory block they are describing.
> 
> But we are not sure if it is OK to do so because there is not existing API
> to do so, and we need to move page_cgroup memory allocation from MEM_GOING_ONLINE
> to MEM_ONLINE. And also, it may interfere the hugepage.
> 
> 
> 
> How to test this patchset?
> 1. apply this patchset and build the kernel. MEMORY_HOTPLUG, MEMORY_HOTREMOVE,
>    ACPI_HOTPLUG_MEMORY must be selected.
> 2. load the module acpi_memhotplug
> 3. hotplug the memory device(it depends on your hardware)
>    You will see the memory device under the directory /sys/bus/acpi/devices/.
>    Its name is PNP0C80:XX.
> 4. online/offline pages provided by this memory device
>    You can write online/offline to /sys/devices/system/memory/memoryX/state to
>    online/offline pages provided by this memory device
> 5. hotremove the memory device
>    You can hotremove the memory device by the hardware, or writing 1 to
>    /sys/bus/acpi/devices/PNP0C80:XX/eject.

Is there a similar knode to hot-add the memory device?

> 
> 
> Note: if the memory provided by the memory device is used by the kernel, it
> can't be offlined. It is not a bug.
> 
> 
> Changelogs from v5 to v6:
>  Patch3: Add some more comments to explain memory hot-remove.
>  Patch4: Remove bootmem member in struct firmware_map_entry.
>  Patch6: Repeatedly register bootmem pages when using hugepage.
>  Patch8: Repeatedly free bootmem pages when using hugepage.
>  Patch14: Don't free pgdat when offlining a node, just reset it to 0.
>  Patch15: New patch, pgdat is not freed in patch14, so don't allocate a new
>           one when online a node.
> 
> Changelogs from v4 to v5:
>  Patch7: new patch, move pgdat_resize_lock into sparse_remove_one_section() to
>          avoid disabling irq because we need flush tlb when free pagetables.
>  Patch8: new patch, pick up some common APIs that are used to free direct mapping
>          and vmemmap pagetables.
>  Patch9: free direct mapping pagetables on x86_64 arch.
>  Patch10: free vmemmap pagetables.
>  Patch11: since freeing memmap with vmemmap has been implemented, the config
>           macro CONFIG_SPARSEMEM_VMEMMAP when defining __remove_section() is
>           no longer needed.
>  Patch13: no need to modify acpi_memory_disable_device() since it was removed,
>           and add nid parameter when calling remove_memory().
> 
> Changelogs from v3 to v4:
>  Patch7: remove unused codes.
>  Patch8: fix nr_pages that is passed to free_map_bootmem()
> 
> Changelogs from v2 to v3:
>  Patch9: call sync_global_pgds() if pgd is changed
>  Patch10: fix a problem int the patch
> 
> Changelogs from v1 to v2:
>  Patch1: new patch, offline memory twice. 1st iterate: offline every non primary
>          memory block. 2nd iterate: offline primary (i.e. first added) memory
>          block.
> 
>  Patch3: new patch, no logical change, just remove reduntant codes.
> 
>  Patch9: merge the patch from wujianguo into this patch. flush tlb on all cpu
>          after the pagetable is changed.
> 
>  Patch12: new patch, free node_data when a node is offlined.
> 
> 
> Tang Chen (6):
>   memory-hotplug: move pgdat_resize_lock into
>     sparse_remove_one_section()
>   memory-hotplug: remove page table of x86_64 architecture
>   memory-hotplug: remove memmap of sparse-vmemmap
>   memory-hotplug: Integrated __remove_section() of
>     CONFIG_SPARSEMEM_VMEMMAP.
>   memory-hotplug: remove sysfs file of node
>   memory-hotplug: Do not allocate pdgat if it was not freed when
>     offline.
> 
> Wen Congyang (5):
>   memory-hotplug: try to offline the memory twice to avoid dependence
>   memory-hotplug: remove redundant codes
>   memory-hotplug: introduce new function arch_remove_memory() for
>     removing page table depends on architecture
>   memory-hotplug: Common APIs to support page tables hot-remove
>   memory-hotplug: free node_data when a node is offlined
> 
> Yasuaki Ishimatsu (4):
>   memory-hotplug: check whether all memory blocks are offlined or not
>     when removing memory
>   memory-hotplug: remove /sys/firmware/memmap/X sysfs
>   memory-hotplug: implement register_page_bootmem_info_section of
>     sparse-vmemmap
>   memory-hotplug: memory_hotplug: clear zone when removing the memory
> 
>  arch/arm64/mm/mmu.c                  |    3 +
>  arch/ia64/mm/discontig.c             |   10 +
>  arch/ia64/mm/init.c                  |   18 ++
>  arch/powerpc/mm/init_64.c            |   10 +
>  arch/powerpc/mm/mem.c                |   12 +
>  arch/s390/mm/init.c                  |   12 +
>  arch/s390/mm/vmem.c                  |   10 +
>  arch/sh/mm/init.c                    |   17 ++
>  arch/sparc/mm/init_64.c              |   10 +
>  arch/tile/mm/init.c                  |    8 +
>  arch/x86/include/asm/pgtable_types.h |    1 +
>  arch/x86/mm/init_32.c                |   12 +
>  arch/x86/mm/init_64.c                |  390 +++++++++++++++++++++++++++++
>  arch/x86/mm/pageattr.c               |   47 ++--
>  drivers/acpi/acpi_memhotplug.c       |    8 +-
>  drivers/base/memory.c                |    6 +
>  drivers/firmware/memmap.c            |   96 +++++++-
>  include/linux/bootmem.h              |    1 +
>  include/linux/firmware-map.h         |    6 +
>  include/linux/memory_hotplug.h       |   15 +-
>  include/linux/mm.h                   |    4 +-
>  mm/memory_hotplug.c                  |  459 +++++++++++++++++++++++++++++++---
>  mm/sparse.c                          |    8 +-
>  23 files changed, 1094 insertions(+), 69 deletions(-)
> 
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to majordomo@kvack.org.  For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply

* Re: [PATCH v5 04/45] percpu_rwlock: Implement the core design of Per-CPU Reader-Writer Locks
From: Namhyung Kim @ 2013-01-29 11:12 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, linux-kernel, walken, mingo,
	linux-arch, linux, xiaoguangrong, wangyun, paulmck, nikunj,
	linux-pm, rusty, rostedt, rjw, tglx, linux-arm-kernel, netdev,
	oleg, sbw, Tejun Heo, akpm, linuxppc-dev
In-Reply-To: <5100B8CC.4080406@linux.vnet.ibm.com>

On Thu, 24 Jan 2013 10:00:04 +0530, Srivatsa S. Bhat wrote:
> On 01/24/2013 01:27 AM, Tejun Heo wrote:
>> On Thu, Jan 24, 2013 at 01:03:52AM +0530, Srivatsa S. Bhat wrote:
>>> CPU 0                          CPU 1
>>>
>>> read_lock(&rwlock)
>>>
>>>                               write_lock(&rwlock) //spins, because CPU 0
>>>                               //has acquired the lock for read
>>>
>>> read_lock(&rwlock)
>>>    ^^^^^
>>> What happens here? Does CPU 0 start spinning (and hence deadlock) or will
>>> it continue realizing that it already holds the rwlock for read?
>> 
>> I don't think rwlock allows nesting write lock inside read lock.
>> read_lock(); write_lock() will always deadlock.
>> 
>
> Sure, I understand that :-) My question was, what happens when *two* CPUs
> are involved, as in, the read_lock() is invoked only on CPU 0 whereas the
> write_lock() is invoked on CPU 1.
>
> For example, the same scenario shown above, but with slightly different
> timing, will NOT result in a deadlock:
>
> Scenario 2:
>   CPU 0                                CPU 1
>
> read_lock(&rwlock)
>
>
> read_lock(&rwlock) //doesn't spin
>
>                                     write_lock(&rwlock) //spins, because CPU 0
>                                     //has acquired the lock for read
>
>
> So I was wondering whether the "fairness" logic of rwlocks would cause
> the second read_lock() to spin (in the first scenario shown above) because
> a writer is already waiting (and hence new readers should spin) and thus
> cause a deadlock.

In my understanding, current x86 rwlock does basically this (of course,
in an atomic fashion):


#define RW_LOCK_BIAS 0x10000

rwlock_init(rwlock)
{
	rwlock->lock = RW_LOCK_BIAS;
}

arch_read_lock(rwlock)
{
retry:
	if (--rwlock->lock >= 0)
		return;

        rwlock->lock++;
        while (rwlock->lock < 1)
        	continue;

        goto retry;
}

arch_write_lock(rwlock)
{
retry:
	if ((rwlock->lock -= RW_LOCK_BIAS) == 0)
        	return;

        rwlock->lock += RW_LOCK_BIAS;
	while (rwlock->lock != RW_LOCK_BIAS)
		continue;

        goto retry;
}


So I can't find where the 'fairness' logic comes from..

Thanks,
Namhyung

^ permalink raw reply

* Re: [PATCH 2/2] pseries/iommu: remove DDW on kexec
From: Michael Ellerman @ 2013-01-29 10:58 UTC (permalink / raw)
  To: Nishanth Aravamudan; +Cc: miltonm, paulus, anton, nfont, linuxppc-dev
In-Reply-To: <20130129020357.GB12156@linux.vnet.ibm.com>

On Mon, 2013-01-28 at 18:03 -0800, Nishanth Aravamudan wrote:
> pseries/iommu: remove DDW on kexec
>  ...
>     
> I believe the simplest, easiest-to-maintain fix is to just change our
> initcall to, rather than detecting and updating the new kernel's DDW
> knowledge, just remove all DDW configurations. When the drivers
> re-initialize, we will set everything back up as it was before.

I don't know this code at all, but this sounds like it will also work
for kdump, right? ie. when the original kernel has crashed the 2nd
kernel will tear the DDW down and set it back up.

cheers

^ permalink raw reply

* Re: [PATCH v5 07/45] CPU hotplug: Provide APIs to prevent CPU offline from atomic context
From: Namhyung Kim @ 2013-01-29 10:21 UTC (permalink / raw)
  To: Srivatsa S. Bhat
  Cc: linux-doc, peterz, fweisbec, linux-kernel, mingo, linux-arch,
	linux, xiaoguangrong, wangyun, paulmck, nikunj, linux-pm, rusty,
	rostedt, rjw, tglx, linux-arm-kernel, netdev, oleg, sbw, tj, akpm,
	linuxppc-dev
In-Reply-To: <20130122073446.13822.39253.stgit@srivatsabhat.in.ibm.com>

Hi Srivatsa,

On Tue, 22 Jan 2013 13:04:54 +0530, Srivatsa S. Bhat wrote:
> @@ -246,15 +291,21 @@ struct take_cpu_down_param {
>  static int __ref take_cpu_down(void *_param)
>  {
>  	struct take_cpu_down_param *param = _param;
> -	int err;
> +	unsigned long flags;
> +	int err = 0;

It seems no need to set 'err' to 0.

Thanks,
Namhyung

> +
> +	percpu_write_lock_irqsave(&hotplug_pcpu_rwlock, &flags);
>  
>  	/* Ensure this CPU doesn't handle any more interrupts. */
>  	err = __cpu_disable();
>  	if (err < 0)
> -		return err;
> +		goto out;
>  
>  	cpu_notify(CPU_DYING | param->mod, param->hcpu);
> -	return 0;
> +
> +out:
> +	percpu_write_unlock_irqrestore(&hotplug_pcpu_rwlock, &flags);
> +	return err;
>  }
>  
>  /* Requires cpu_add_remove_lock to be held */
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH] powerpc/512x: initialize clocks before bus probing
From: Anatolij Gustschin @ 2013-01-29  8:09 UTC (permalink / raw)
  To: linuxppc-dev

Early driver probing can fail due to not available clocks
(clk_get() fails) since the clk API init didn't take place yet.
Move clocks init before bus probing.

Signed-off-by: Anatolij Gustschin <agust@denx.de>
---
 arch/powerpc/platforms/512x/mpc512x_shared.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/platforms/512x/mpc512x_shared.c b/arch/powerpc/platforms/512x/mpc512x_shared.c
index 37fc7a3..34bad13 100644
--- a/arch/powerpc/platforms/512x/mpc512x_shared.c
+++ b/arch/powerpc/platforms/512x/mpc512x_shared.c
@@ -428,8 +428,8 @@ void __init mpc512x_psc_fifo_init(void)
 
 void __init mpc512x_init(void)
 {
-	mpc512x_declare_of_platform_devices();
 	mpc5121_clk_init();
+	mpc512x_declare_of_platform_devices();
 	mpc512x_restart_init();
 	mpc512x_psc_fifo_init();
 }
-- 
1.7.11.7

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox