Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH v3 1/3] phylib: Convert MDIO and PHY Lib drivers to support 10G
From: David Daney @ 2011-10-13 16:00 UTC (permalink / raw)
  To: Andy Fleming; +Cc: davem, netdev
In-Reply-To: <1318516660-25452-2-git-send-email-afleming@freescale.com>

On 10/13/2011 07:37 AM, Andy Fleming wrote:
> 10G MDIO is a totally different protocol (clause 45 of 802.3).
> Supporting this new protocol requires a couple of changes:
>
> * Add a new parameter to the mdiobus_read functions to specify the
>    "device address" inside the PHY.
> * Add a phy45_read/write function which takes advantage of that
>    new parameter
> * Convert all of the existing drivers to use the new format
>
> I created a new clause-45-specific read/write functions because:
> 1) phy_read and phy_write are highly overloaded functions, and
>     finding every instance which is actually the PHY Lib version
>     was quite difficult
> 2) Most code which invokes phy_read/phy_write inside PHY Lib is
>     Clause-22-specific. None of the phy_read/phy_write invocations
>     were useable on 10G PHYs
>

I think converting all these phy_read/phy_write to take an extra
parameter is a mistake.  99% of the users have no need for the "device
address".  Also you are still passing the protocol mode as a high
order bit in the register address, so that part is still quite ugly.

The existing infrastructure where we pass the "device address" in bits
16..20 of the register number is much less disruptive.

If you don't like it, an easy and much less intrusive approach might
be a simple (untested) wrapper:

static inline int phy45_read(struct phy_device *phydev,
                              int devad, u16 regnum)
{
	u32 c45_reg = MII_ADDR_C45 | ((devad & 0x1f) << 16) | regnum;
	return phy_read(phydev, c45_reg)
}

static inline int phy45_write(struct phy_device *phydev,
                               int devad, u16 regnum, u16 val)
{
	u32 c45_reg = MII_ADDR_C45 | ((devad & 0x1f) << 16) | regnum;
	return phy_write(phydev, c45_reg, val)
}


> Signed-off-by: Andy Fleming<afleming@freescale.com>
> ---
> v2: Convert newer buses, split out generic PHY support
> v3: Make patch series more coherent
>
>   Documentation/networking/phy.txt                  |   15 +++--
>   arch/powerpc/platforms/pasemi/gpio_mdio.c         |    6 +-
>   drivers/net/ethernet/adi/bfin_mac.c               |    7 +-
>   drivers/net/ethernet/aeroflex/greth.c             |    5 +-
>   drivers/net/ethernet/amd/au1000_eth.c             |    7 +-
>   drivers/net/ethernet/broadcom/bcm63xx_enet.c      |    4 +-
>   drivers/net/ethernet/broadcom/sb1250-mac.c        |    7 +-
>   drivers/net/ethernet/broadcom/tg3.c               |    5 +-
>   drivers/net/ethernet/cadence/macb.c               |    7 +-
>   drivers/net/ethernet/dnet.c                       |    7 +-
>   drivers/net/ethernet/ethoc.c                      |    5 +-
>   drivers/net/ethernet/faraday/ftgmac100.c          |    5 +-
>   drivers/net/ethernet/freescale/fec.c              |    7 +-
>   drivers/net/ethernet/freescale/fec_mpc52xx_phy.c  |    7 +-
>   drivers/net/ethernet/freescale/fs_enet/mii-fec.c  |    6 +-
>   drivers/net/ethernet/freescale/fsl_pq_mdio.c      |   13 ++--
>   drivers/net/ethernet/freescale/fsl_pq_mdio.h      |   11 ++-
>   drivers/net/ethernet/lantiq_etop.c                |    5 +-
>   drivers/net/ethernet/marvell/mv643xx_eth.c        |    5 +-
>   drivers/net/ethernet/marvell/pxa168_eth.c         |    7 +-
>   drivers/net/ethernet/rdc/r6040.c                  |    5 +-
>   drivers/net/ethernet/s6gmac.c                     |    5 +-
>   drivers/net/ethernet/smsc/smsc911x.c              |   22 ++++---
>   drivers/net/ethernet/smsc/smsc9420.c              |   10 ++-
>   drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c |    9 ++-
>   drivers/net/ethernet/ti/cpmac.c                   |    4 +-
>   drivers/net/ethernet/ti/davinci_mdio.c            |    5 +-
>   drivers/net/ethernet/toshiba/tc35815.c            |    5 +-
>   drivers/net/ethernet/xilinx/ll_temac_mdio.c       |    5 +-
>   drivers/net/ethernet/xilinx/xilinx_emaclite.c     |    9 ++-
>   drivers/net/ethernet/xscale/ixp4xx_eth.c          |    7 +-
>   drivers/net/phy/fixed.c                           |    5 +-
>   drivers/net/phy/icplus.c                          |   17 +++--
>   drivers/net/phy/mdio-bitbang.c                    |    5 +-
>   drivers/net/phy/mdio-octeon.c                     |    5 +-
>   drivers/net/phy/mdio_bus.c                        |    8 +-
>   drivers/net/phy/phy.c                             |    5 +-
>   drivers/net/phy/phy_device.c                      |   64 +++++++++++++------
>   include/linux/phy.h                               |   70 ++++++++++++++++++---
>   net/dsa/slave.c                                   |    5 +-
>   40 files changed, 270 insertions(+), 141 deletions(-)
>

^ permalink raw reply

* RE: [net-next PATCH] net: allow vlan traffic to be received under bond
From: Hans Schillström @ 2011-10-13 15:59 UTC (permalink / raw)
  To: Jiri Pirko, Maxime Bizon
  Cc: John Fastabend, davem@davemloft.net, netdev@vger.kernel.org,
	jesse@nicira.com, fubar@us.ibm.com
In-Reply-To: <20111013153850.GA2031@minipsycho>

Thu, Oct 13, 2011 at 05:04:34PM CEST, mbizon@freebox.fr wrote:
>>
>>On Tue, 2011-10-11 at 00:37 +0200, Jiri Pirko wrote:
>>
>>> Hmm, I must look at this again tomorrow but I have strong feeling this
>>> will break some some scenario including vlan-bridge-macvlan.
>>
>>unless I'm mistaken, today's behaviour:
>>
>># vconfig add eth0 100
>># brctl addbr br0
>># brctl addif br0 eth0
>>
>>=> eth0.100 gets no more packets, br0.100 is to be used
>>
>>after the patch won't we get the opposite ?
>
>Looks like it. The question is what is the correct behaviour...

I think this it become correct now, you should not destroy lover level if possible.
I.e. as John wrote "it's not an unexpected behaviour"

Consider adding a bridge to a vlan like this

vconfig add eth0 100
brctl addbr br1
brctl addif br1 eth0.100

If you later add a bridge (or bond) should the previous added bridge still work ? 
Yes I think so, for me it's the expected behaviour.

brctl addbr br0
brctl addif br0 eth0

Regards
Hans

^ permalink raw reply

* Re: [PATCH v3 1/3] phylib: Convert MDIO and PHY Lib drivers to support 10G
From: Andy Fleming @ 2011-10-13 15:59 UTC (permalink / raw)
  To: Ben Hutchings; +Cc: davem, netdev
In-Reply-To: <1318520760.2745.6.camel@bwh-desktop>


On Oct 13, 2011, at 10:46 AM, Ben Hutchings wrote:

> On Thu, 2011-10-13 at 09:37 -0500, Andy Fleming wrote:
>> 10G MDIO is a totally different protocol (clause 45 of 802.3).
>> Supporting this new protocol requires a couple of changes:
>> 
>> * Add a new parameter to the mdiobus_read functions to specify the
>>  "device address" inside the PHY.
>> * Add a phy45_read/write function which takes advantage of that
>>  new parameter
>> * Convert all of the existing drivers to use the new format
>> 
>> I created a new clause-45-specific read/write functions because:
>> 1) phy_read and phy_write are highly overloaded functions, and
>>   finding every instance which is actually the PHY Lib version
>>   was quite difficult
>> 2) Most code which invokes phy_read/phy_write inside PHY Lib is
>>   Clause-22-specific. None of the phy_read/phy_write invocations
>>   were useable on 10G PHYs
> [...]
>> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
>> index 3cbda08..00f5cfe 100644
>> --- a/drivers/net/phy/phy.c
>> +++ b/drivers/net/phy/phy.c
>> @@ -322,7 +322,8 @@ int phy_mii_ioctl(struct phy_device *phydev,
>> 
>> 	case SIOCGMIIREG:
>> 		mii_data->val_out = mdiobus_read(phydev->bus, mii_data->phy_id,
>> -						 mii_data->reg_num);
>> +						MDIO_DEVAD_NONE,
>> +						mii_data->reg_num);
>> 		break;
>> 
>> 	case SIOCSMIIREG:
>> @@ -354,7 +355,7 @@ int phy_mii_ioctl(struct phy_device *phydev,
>> 		}
>> 
>> 		mdiobus_write(phydev->bus, mii_data->phy_id,
>> -			      mii_data->reg_num, val);
>> +				MDIO_DEVAD_NONE, mii_data->reg_num, val);
>> 
>> 		if (mii_data->reg_num == MII_BMCR &&
>> 		    val & BMCR_RESET &&
> 
> What about c45 support here?  It's not terribly difficult to do…

Ok, I'll take a look at that.

> 
>> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
>> index 83a5a5a..22281d4 100644
>> --- a/drivers/net/phy/phy_device.c
>> +++ b/drivers/net/phy/phy_device.c
>> @@ -4,9 +4,11 @@
>>  * Framework for finding and configuring PHYs.
>>  * Also contains generic PHY driver
>>  *
>> + * 10G PHY Driver support mostly appropriated from drivers/net/mdio.c
>> + *
> 
> If you're saying you copied a significant amount of code, shouldn't you
> add the copyright notice too?

Ah, yeah, I'll fix that.


> 
> [...]
>> @@ -640,7 +660,6 @@ static int genphy_setup_forced(struct phy_device *phydev)
>> 	return err;
>> }
>> 
>> -
>> /**
>>  * genphy_restart_aneg - Enable and Restart Autonegotiation
>>  * @phydev: target phy_device struct
>> @@ -665,7 +684,6 @@ int genphy_restart_aneg(struct phy_device *phydev)
>> }
>> EXPORT_SYMBOL(genphy_restart_aneg);
>> 
>> -
>> /**
>>  * genphy_config_aneg - restart auto-negotiation or write BMCR
>>  * @phydev: target phy_device struct
>> @@ -882,6 +900,7 @@ static int genphy_config_init(struct phy_device *phydev)
>> 
>> 	return 0;
>> }
>> +
>> int genphy_suspend(struct phy_device *phydev)
>> {
>> 	int value;
>> @@ -1022,7 +1041,7 @@ static struct phy_driver genphy_driver = {
>> 	.read_status	= genphy_read_status,
>> 	.suspend	= genphy_suspend,
>> 	.resume		= genphy_resume,
>> -	.driver		= {.owner= THIS_MODULE, },
>> +	.driver		= {.owner = THIS_MODULE, },
>> };
>> 
>> static int __init phy_init(void)
>> @@ -1035,7 +1054,12 @@ static int __init phy_init(void)
>> 
>> 	rc = phy_driver_register(&genphy_driver);
>> 	if (rc)
>> -		mdio_bus_exit();
>> +		goto genphy_register_failed;
>> +
>> +	return rc;
>> +
>> +genphy_register_failed:
>> +	mdio_bus_exit();
>> 
>> 	return rc;
>> }
> 
> These changes are unrelated to c45.


Ah, yeah, these are residual from the 10G generic code.  I'll go ahead and forward those changes into the gen10g patch


> 
>> diff --git a/include/linux/phy.h b/include/linux/phy.h
>> index 54fc413..ae1fdd8 100644
>> --- a/include/linux/phy.h
>> +++ b/include/linux/phy.h
> [...]
>> @@ -65,6 +66,7 @@ typedef enum {
>> 	PHY_INTERFACE_MODE_RGMII_TXID,
>> 	PHY_INTERFACE_MODE_RTBI,
>> 	PHY_INTERFACE_MODE_SMII,
>> +	PHY_INTERFACE_MODE_XGMII
>> } phy_interface_t;
> [...]
> 
> XAUI, XFI?
> 
> (Maybe the distinction doesn't matter in this context.)


Yeah, I'm not quite sure. The term "interface" is meant to convey interface information needed by the PHY driver (and possibly the ethernet controller). My setup is:

MAC->XGMII->XAUI->XGMII->PHY

Right now it's being used somewhat sketchily to distinguish 10G from non-10G. I'd be fine with adding XAUI and XFI, I just don't know yet whether it's relevant. I'm currently working with a sample size of 1 for what I've done, plus information gleaned from the spec.

Andy

^ permalink raw reply

* Re: [PATCH v3 3/3] phylib: Add rudimentary Generic 10G support
From: Ben Hutchings @ 2011-10-13 15:51 UTC (permalink / raw)
  To: Andy Fleming; +Cc: davem, netdev
In-Reply-To: <1318516660-25452-4-git-send-email-afleming@freescale.com>

On Thu, 2011-10-13 at 09:37 -0500, Andy Fleming wrote:
> This is mostly taken from mdio.c, and modified to work under phylib.
> 
> However, the support is skewed toward 10GBaseT, as that is the only
> PHY available to me at this time.
[...]

Is it really necessary to have 2 copies of this code?

This one doesn't even look usable.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [net-next PATCH] net: allow vlan traffic to be received under bond
From: Maxime Bizon @ 2011-10-13 15:48 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: John Fastabend, davem, netdev, jesse, fubar
In-Reply-To: <20111013153850.GA2031@minipsycho>


On Thu, 2011-10-13 at 17:38 +0200, Jiri Pirko wrote:

> Looks like it. The question is what is the correct behaviour... 

since we don't want to break existing setup I would say the current one

but I must admit I'm not a big fan of it.

-- 
Maxime

^ permalink raw reply

* Re: [PATCH v3 1/3] phylib: Convert MDIO and PHY Lib drivers to support 10G
From: Ben Hutchings @ 2011-10-13 15:46 UTC (permalink / raw)
  To: Andy Fleming; +Cc: davem, netdev
In-Reply-To: <1318516660-25452-2-git-send-email-afleming@freescale.com>

On Thu, 2011-10-13 at 09:37 -0500, Andy Fleming wrote:
> 10G MDIO is a totally different protocol (clause 45 of 802.3).
> Supporting this new protocol requires a couple of changes:
> 
> * Add a new parameter to the mdiobus_read functions to specify the
>   "device address" inside the PHY.
> * Add a phy45_read/write function which takes advantage of that
>   new parameter
> * Convert all of the existing drivers to use the new format
> 
> I created a new clause-45-specific read/write functions because:
> 1) phy_read and phy_write are highly overloaded functions, and
>    finding every instance which is actually the PHY Lib version
>    was quite difficult
> 2) Most code which invokes phy_read/phy_write inside PHY Lib is
>    Clause-22-specific. None of the phy_read/phy_write invocations
>    were useable on 10G PHYs
[...]
> diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
> index 3cbda08..00f5cfe 100644
> --- a/drivers/net/phy/phy.c
> +++ b/drivers/net/phy/phy.c
> @@ -322,7 +322,8 @@ int phy_mii_ioctl(struct phy_device *phydev,
>  
>  	case SIOCGMIIREG:
>  		mii_data->val_out = mdiobus_read(phydev->bus, mii_data->phy_id,
> -						 mii_data->reg_num);
> +						MDIO_DEVAD_NONE,
> +						mii_data->reg_num);
>  		break;
>  
>  	case SIOCSMIIREG:
> @@ -354,7 +355,7 @@ int phy_mii_ioctl(struct phy_device *phydev,
>  		}
>  
>  		mdiobus_write(phydev->bus, mii_data->phy_id,
> -			      mii_data->reg_num, val);
> +				MDIO_DEVAD_NONE, mii_data->reg_num, val);
>  
>  		if (mii_data->reg_num == MII_BMCR &&
>  		    val & BMCR_RESET &&

What about c45 support here?  It's not terribly difficult to do...

> diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
> index 83a5a5a..22281d4 100644
> --- a/drivers/net/phy/phy_device.c
> +++ b/drivers/net/phy/phy_device.c
> @@ -4,9 +4,11 @@
>   * Framework for finding and configuring PHYs.
>   * Also contains generic PHY driver
>   *
> + * 10G PHY Driver support mostly appropriated from drivers/net/mdio.c
> + *

If you're saying you copied a significant amount of code, shouldn't you
add the copyright notice too?

[...]
> @@ -640,7 +660,6 @@ static int genphy_setup_forced(struct phy_device *phydev)
>  	return err;
>  }
>  
> -
>  /**
>   * genphy_restart_aneg - Enable and Restart Autonegotiation
>   * @phydev: target phy_device struct
> @@ -665,7 +684,6 @@ int genphy_restart_aneg(struct phy_device *phydev)
>  }
>  EXPORT_SYMBOL(genphy_restart_aneg);
>  
> -
>  /**
>   * genphy_config_aneg - restart auto-negotiation or write BMCR
>   * @phydev: target phy_device struct
> @@ -882,6 +900,7 @@ static int genphy_config_init(struct phy_device *phydev)
>  
>  	return 0;
>  }
> +
>  int genphy_suspend(struct phy_device *phydev)
>  {
>  	int value;
> @@ -1022,7 +1041,7 @@ static struct phy_driver genphy_driver = {
>  	.read_status	= genphy_read_status,
>  	.suspend	= genphy_suspend,
>  	.resume		= genphy_resume,
> -	.driver		= {.owner= THIS_MODULE, },
> +	.driver		= {.owner = THIS_MODULE, },
>  };
>  
>  static int __init phy_init(void)
> @@ -1035,7 +1054,12 @@ static int __init phy_init(void)
>  
>  	rc = phy_driver_register(&genphy_driver);
>  	if (rc)
> -		mdio_bus_exit();
> +		goto genphy_register_failed;
> +
> +	return rc;
> +
> +genphy_register_failed:
> +	mdio_bus_exit();
>  
>  	return rc;
>  }

These changes are unrelated to c45.

> diff --git a/include/linux/phy.h b/include/linux/phy.h
> index 54fc413..ae1fdd8 100644
> --- a/include/linux/phy.h
> +++ b/include/linux/phy.h
[...]
> @@ -65,6 +66,7 @@ typedef enum {
>  	PHY_INTERFACE_MODE_RGMII_TXID,
>  	PHY_INTERFACE_MODE_RTBI,
>  	PHY_INTERFACE_MODE_SMII,
> +	PHY_INTERFACE_MODE_XGMII
>  } phy_interface_t;
[...]

XAUI, XFI?

(Maybe the distinction doesn't matter in this context.)

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: (splice socket -> pipe) + EPOLLET -> epoll_wait does not not wake up !
From: Eric Dumazet @ 2011-10-13 15:43 UTC (permalink / raw)
  To: Марк Коренберг
  Cc: netdev, linux-kernel
In-Reply-To: <CAEmTpZFt=j-TQjk_8Hx1Yv781SH8MFR_=OrnfH+vke9Q8JZpFA@mail.gmail.com>

Le jeudi 13 octobre 2011 à 20:48 +0600, Марк Коренберг a écrit :
> The problem:
> 
> man 7 epoll said:
> For  stream-oriented  files  (e.g.,  pipe, FIFO, stream socket), the
> condition that the read/write I/O space is exhausted can also be
> detected by checking the amount of data read from / written to the
> target file descriptor.  For example, if you call read(2) by asking to
> read a certain amount of data and read(2) returns a lower number of
> bytes, you  can  be  sure of having exhausted the read I/O space for
> the file descriptor.
> 
> I decide to use splice socket -> pipe instead of recv. So I have
> registered socket's fd in epoll with EPOLLIN|EPOLLET.
> 
> When data appear in socket faster than I splice() it from socket, the
> following sometimes appear:
> 
> 1. in my code I sure, that pipe is empty.
> 2. my code do splice(socket, pipe, 65536)
> 3. splice return, say, 53248
> 4. my code accordingly to man, decide not to fire splice() again, as
> it thinks that it will return EWOULDBLOCK=EAGAIN.
> 5. so, my code go to epoll_wait to wait for EPOLLIN on socket
> 6. epoll hangs.
> 
> This is not appear if I do just recv(). But it may be because speed is
> lower, and some race condition in effect.
> 
> The hacked version of strace output is attached.

Your assumptions about splice() are false.

splice() can transfert partial pages.

So you can hit the 16 pages pipe limit, and splice() doesnt necessarly
returns 16*PAGE_SIZE.

With TCP frames, usually 1460 bytes per PAGE are used.

You must call splice() again and again, unless 0 bytes (EAGAIN) are
returned.

^ permalink raw reply

* Re: [net-next PATCH] net: allow vlan traffic to be received under bond
From: Jiri Pirko @ 2011-10-13 15:38 UTC (permalink / raw)
  To: Maxime Bizon; +Cc: John Fastabend, davem, netdev, jesse, fubar
In-Reply-To: <1318518274.9266.94.camel@sakura.staff.proxad.net>

Thu, Oct 13, 2011 at 05:04:34PM CEST, mbizon@freebox.fr wrote:
>
>On Tue, 2011-10-11 at 00:37 +0200, Jiri Pirko wrote:
>
>> Hmm, I must look at this again tomorrow but I have strong feeling this
>> will break some some scenario including vlan-bridge-macvlan. 
>
>unless I'm mistaken, today's behaviour:
>
># vconfig add eth0 100
># brctl addbr br0
># brctl addif br0 eth0
>
>=> eth0.100 gets no more packets, br0.100 is to be used
>
>after the patch won't we get the opposite ?

Looks like it. The question is what is the correct behaviour...

>
>-- 
>Maxime
>
>

^ permalink raw reply

* Re: [PATCH 1/2] net: Allow skb_recycle_check to be done in stages
From: David Daney @ 2011-10-13 15:35 UTC (permalink / raw)
  To: Andy Fleming, davem; +Cc: netdev
In-Reply-To: <1318516435-24314-1-git-send-email-afleming@freescale.com>

On 10/13/2011 07:33 AM, Andy Fleming wrote:
> skb_recycle_check resets the skb if it's eligible for recycling.
> However, there are times when a driver might want to optionally
> manipulate the skb data with the skb before resetting the skb,
> but after it has determined eligibility.  We do this by splitting the
> eligibility check from the skb reset, creating two inline functions to
> accomplish that task.
>
> Signed-off-by: Andy Fleming<afleming@freescale.com>

Acked-by: David Daney <david.daney@cavium.com>

I need this for my (Octeon) Ethernet driver as well.  Currently we have 
an ad hoc driver local implementation of this, but I would like to use 
this core code if possible.




> ---
>
> I found this useful for a driver we're working on where the device can
> do different things, depending on whether the skb is recycleable.
>
>   include/linux/skbuff.h |   21 +++++++++++++++++++
>   net/core/skbuff.c      |   51 ++++++++++++++++++++++++-----------------------
>   2 files changed, 47 insertions(+), 25 deletions(-)
>

[...]

^ permalink raw reply

* [PATCH net-next] net: more accurate skb truesize
From: Eric Dumazet @ 2011-10-13 15:26 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, Andi Kleen

skb truesize currently accounts for sk_buff struct and part of skb head.

Considering that skb_shared_info is larger than sk_buff, its time to
take it into account for better memory accounting.

This patch introduces SKB_TRUESIZE(X) macro to centralize various
assumptions into a single place.

At skb alloc phase, we put skb_shared_info struct at the exact end of
skb head, to allow a better use of memory (lowering number of
reallocations), since kmalloc() gives us power-of-two memory blocks.

Note: This patch might trigger performance regressions because of
misconfigured protocol stacks, hitting per socket or global memory
limits that were previously not reached. But its a necessary step for a
more accurate memory accounting.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
CC: Andi Kleen <ak@linux.intel.com>
---
 include/linux/skbuff.h |    5 +++++
 net/core/skbuff.c      |   13 +++++++++----
 net/core/sock.c        |    2 +-
 net/ipv4/icmp.c        |    5 ++---
 net/ipv4/tcp_input.c   |   14 +++++++-------
 net/ipv6/icmp.c        |    3 +--
 net/iucv/af_iucv.c     |    2 +-
 net/sctp/protocol.c    |    2 +-
 8 files changed, 27 insertions(+), 19 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ac6b05a..64f8695 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -46,6 +46,11 @@
 #define SKB_MAX_HEAD(X)		(SKB_MAX_ORDER((X), 0))
 #define SKB_MAX_ALLOC		(SKB_MAX_ORDER(0, 2))
 
+/* return minimum truesize of one skb containing X bytes of data */
+#define SKB_TRUESIZE(X) ((X) +						\
+			 SKB_DATA_ALIGN(sizeof(struct sk_buff)) +	\
+			 SKB_DATA_ALIGN(sizeof(struct skb_shared_info)))
+
 /* A. Checksumming of received packets by device.
  *
  *	NONE: device failed to checksum this packet.
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b2c5f1..be66154 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -184,11 +184,15 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 		goto out;
 	prefetchw(skb);
 
-	size = SKB_DATA_ALIGN(size);
-	data = kmalloc_node_track_caller(size + sizeof(struct skb_shared_info),
-			gfp_mask, node);
+	size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+	data = kmalloc_node_track_caller(size, gfp_mask, node);
 	if (!data)
 		goto nodata;
+	/* kmalloc(size) might give us more room than asked.
+	 * Put skb_shared_info exactly at the end of allocated zone,
+	 * to allow max possible filling before reallocation.
+	 */
+	size = SKB_WITH_OVERHEAD(ksize(data));
 	prefetchw(data + size);
 
 	/*
@@ -197,7 +201,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
 	 * the tail pointer in struct sk_buff!
 	 */
 	memset(skb, 0, offsetof(struct sk_buff, tail));
-	skb->truesize = size + sizeof(struct sk_buff);
+	/* Account for allocated memory : skb + skb->head */
+	skb->truesize = SKB_TRUESIZE(size);
 	atomic_set(&skb->users, 1);
 	skb->head = data;
 	skb->data = data;
diff --git a/net/core/sock.c b/net/core/sock.c
index 83c462d..5a08762 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -207,7 +207,7 @@ static struct lock_class_key af_callback_keys[AF_MAX];
  * not depend upon such differences.
  */
 #define _SK_MEM_PACKETS		256
-#define _SK_MEM_OVERHEAD	(sizeof(struct sk_buff) + 256)
+#define _SK_MEM_OVERHEAD	SKB_TRUESIZE(256)
 #define SK_WMEM_MAX		(_SK_MEM_OVERHEAD * _SK_MEM_PACKETS)
 #define SK_RMEM_MAX		(_SK_MEM_OVERHEAD * _SK_MEM_PACKETS)
 
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 23ef31b..ab188ae 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -1152,10 +1152,9 @@ static int __net_init icmp_sk_init(struct net *net)
 		net->ipv4.icmp_sk[i] = sk;
 
 		/* Enough space for 2 64K ICMP packets, including
-		 * sk_buff struct overhead.
+		 * sk_buff/skb_shared_info struct overhead.
 		 */
-		sk->sk_sndbuf =
-			(2 * ((64 * 1024) + sizeof(struct sk_buff)));
+		sk->sk_sndbuf =	2 * SKB_TRUESIZE(64 * 1024);
 
 		/*
 		 * Speedup sock_wfree()
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 81cae64..c1653fe 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -265,8 +265,7 @@ static inline int TCP_ECN_rcv_ecn_echo(struct tcp_sock *tp, struct tcphdr *th)
 
 static void tcp_fixup_sndbuf(struct sock *sk)
 {
-	int sndmem = tcp_sk(sk)->rx_opt.mss_clamp + MAX_TCP_HEADER + 16 +
-		     sizeof(struct sk_buff);
+	int sndmem = SKB_TRUESIZE(tcp_sk(sk)->rx_opt.mss_clamp + MAX_TCP_HEADER);
 
 	if (sk->sk_sndbuf < 3 * sndmem) {
 		sk->sk_sndbuf = 3 * sndmem;
@@ -349,7 +348,7 @@ static void tcp_grow_window(struct sock *sk, struct sk_buff *skb)
 static void tcp_fixup_rcvbuf(struct sock *sk)
 {
 	struct tcp_sock *tp = tcp_sk(sk);
-	int rcvmem = tp->advmss + MAX_TCP_HEADER + 16 + sizeof(struct sk_buff);
+	int rcvmem = SKB_TRUESIZE(tp->advmss + MAX_TCP_HEADER);
 
 	/* Try to select rcvbuf so that 4 mss-sized segments
 	 * will fit to window and corresponding skbs will fit to our rcvbuf.
@@ -540,8 +539,7 @@ void tcp_rcv_space_adjust(struct sock *sk)
 			space /= tp->advmss;
 			if (!space)
 				space = 1;
-			rcvmem = (tp->advmss + MAX_TCP_HEADER +
-				  16 + sizeof(struct sk_buff));
+			rcvmem = SKB_TRUESIZE(tp->advmss + MAX_TCP_HEADER);
 			while (tcp_win_from_space(rcvmem) < tp->advmss)
 				rcvmem += 128;
 			space *= rcvmem;
@@ -4950,8 +4948,10 @@ static void tcp_new_space(struct sock *sk)
 	struct tcp_sock *tp = tcp_sk(sk);
 
 	if (tcp_should_expand_sndbuf(sk)) {
-		int sndmem = max_t(u32, tp->rx_opt.mss_clamp, tp->mss_cache) +
-			MAX_TCP_HEADER + 16 + sizeof(struct sk_buff);
+		int sndmem = SKB_TRUESIZE(max_t(u32,
+						tp->rx_opt.mss_clamp,
+						tp->mss_cache) +
+					  MAX_TCP_HEADER);
 		int demanded = max_t(unsigned int, tp->snd_cwnd,
 				     tp->reordering + 1);
 		sndmem *= 2 * demanded;
diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 2b59154..90868fb 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -835,8 +835,7 @@ static int __net_init icmpv6_sk_init(struct net *net)
 		/* Enough space for 2 64K ICMP packets, including
 		 * sk_buff struct overhead.
 		 */
-		sk->sk_sndbuf =
-			(2 * ((64 * 1024) + sizeof(struct sk_buff)));
+		sk->sk_sndbuf = 2 * SKB_TRUESIZE(64 * 1024);
 	}
 	return 0;
 
diff --git a/net/iucv/af_iucv.c b/net/iucv/af_iucv.c
index c39f3a4..274d150 100644
--- a/net/iucv/af_iucv.c
+++ b/net/iucv/af_iucv.c
@@ -1819,7 +1819,7 @@ static void iucv_callback_rx(struct iucv_path *path, struct iucv_message *msg)
 		goto save_message;
 
 	len = atomic_read(&sk->sk_rmem_alloc);
-	len += iucv_msg_length(msg) + sizeof(struct sk_buff);
+	len += SKB_TRUESIZE(iucv_msg_length(msg));
 	if (len > sk->sk_rcvbuf)
 		goto save_message;
 
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 91784f4..61b9fca 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1299,7 +1299,7 @@ SCTP_STATIC __init int sctp_init(void)
 	max_share = min(4UL*1024*1024, limit);
 
 	sysctl_sctp_rmem[0] = SK_MEM_QUANTUM; /* give each asoc 1 page min */
-	sysctl_sctp_rmem[1] = (1500 *(sizeof(struct sk_buff) + 1));
+	sysctl_sctp_rmem[1] = 1500 * SKB_TRUESIZE(1);
 	sysctl_sctp_rmem[2] = max(sysctl_sctp_rmem[1], max_share);
 
 	sysctl_sctp_wmem[0] = SK_MEM_QUANTUM;

^ permalink raw reply related

* Re: [iproute2 PATCH] Fix unterminated readlink() buffer usage
From: Stephen Hemminger @ 2011-10-13 15:22 UTC (permalink / raw)
  To: Thomas Jarosch; +Cc: netdev
In-Reply-To: <201110131030.21723.thomas.jarosch@intra2net.com>

On Thu, 13 Oct 2011 10:30:21 +0200
Thomas Jarosch <thomas.jarosch@intra2net.com> wrote:

> 
> Signed-off-by: Thomas Jarosch <thomas.jarosch@intra2net.com>

Applied thanks.

^ permalink raw reply

* Re: [net-next PATCH] net: allow vlan traffic to be received under bond
From: Maxime Bizon @ 2011-10-13 15:04 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: John Fastabend, davem, netdev, jesse, fubar
In-Reply-To: <20111010223752.GB2373@minipsycho>


On Tue, 2011-10-11 at 00:37 +0200, Jiri Pirko wrote:

> Hmm, I must look at this again tomorrow but I have strong feeling this
> will break some some scenario including vlan-bridge-macvlan. 

unless I'm mistaken, today's behaviour:

# vconfig add eth0 100
# brctl addbr br0
# brctl addif br0 eth0

=> eth0.100 gets no more packets, br0.100 is to be used

after the patch won't we get the opposite ?

-- 
Maxime

^ permalink raw reply

* (splice socket -> pipe) + EPOLLET -> epoll_wait does not not wake up !
From: Марк Коренберг @ 2011-10-13 14:48 UTC (permalink / raw)
  To: netdev
In-Reply-To: <CAEmTpZG5MN8Ha4=ZcCnWC3jJP9x4Bw7ZEC2vQdOGJZiaG4QYkA@mail.gmail.com>

[-- Attachment #1: Type: text/plain, Size: 1193 bytes --]

The problem:

man 7 epoll said:
For  stream-oriented  files  (e.g.,  pipe, FIFO, stream socket), the
condition that the read/write I/O space is exhausted can also be
detected by checking the amount of data read from / written to the
target file descriptor.  For example, if you call read(2) by asking to
read a certain amount of data and read(2) returns a lower number of
bytes, you  can  be  sure of having exhausted the read I/O space for
the file descriptor.

I decide to use splice socket -> pipe instead of recv. So I have
registered socket's fd in epoll with EPOLLIN|EPOLLET.

When data appear in socket faster than I splice() it from socket, the
following sometimes appear:

1. in my code I sure, that pipe is empty.
2. my code do splice(socket, pipe, 65536)
3. splice return, say, 53248
4. my code accordingly to man, decide not to fire splice() again, as
it thinks that it will return EWOULDBLOCK=EAGAIN.
5. so, my code go to epoll_wait to wait for EPOLLIN on socket
6. epoll hangs.

This is not appear if I do just recv(). But it may be because speed is
lower, and some race condition in effect.

The hacked version of strace output is attached.

[-- Attachment #2: strace.txt --]
[-- Type: text/plain, Size: 4488 bytes --]

epoll_create(EPOLL_CLOEXEC)                    = 6

/* 4 is a socket of HTTP_CLIENT. (got from accept4(SOCK_CLOEXEC|SOCK_NONBLOCK)) */
epoll_ctl(6, EPOLL_CTL_ADD, 4, {EPOLLIN|EPOLLOUT|EPOLLET|EPOLLRDHUP, {fd=4}}) = 0
pipe2([7, 9], O_NONBLOCK|O_CLOEXEC)     = 0

/* Creating socket for a HTTP_SERVER */
socket(PF_INET, SOCK_STREAM|SOCK_CLOEXEC|SOCK_NONBLOCK, IPPROTO_IP) = 10
connect(10, {sa_family=AF_INET, sin_port=htons(80), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 EINPROGRESS (Operation now in progress)
epoll_ctl(6, EPOLL_CTL_ADD, 10, {EPOLLIN|EPOLLOUT|EPOLLET|0x2000, {fd=10}}) = 0
pipe2([11, 12], O_NONBLOCK|O_CLOEXEC)   = 0
epoll_wait(6, {{EPOLLOUT, {fd=4}}, {EPOLLOUT, {fd=10}}}, 100, -1) = 2
getsockopt(10, SOL_SOCKET, SO_ERROR, [0], [4]) = 0
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=4}}}, 100, -1) = 1
splice(4, 0, 9, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 421
splice(7, 0, 10, 0, 0x1a5, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)      = 421
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=10}}}, 100, -1) = 1
splice(10, 0, 12, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 801
splice(11, 0, 4, 0, 0x321, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)      = 801
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=4}}}, 100, -1) = 1
splice(4, 0, 9, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 345
splice(7, 0, 10, 0, 0x159, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)      = 345
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=10}}}, 100, -1) = 1
splice(10, 0, 12, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 502
splice(11, 0, 4, 0, 0x1f6, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)      = 502
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=4}}}, 100, -1) = 1
splice(4, 0, 9, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 458
splice(7, 0, 10, 0, 0x1ca, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)      = 458
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=10}}}, 100, -1) = 1
splice(10, 0, 12, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 929
splice(11, 0, 4, 0, 0x3a1, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)      = 929
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=4}}}, 100, -1) = 1
splice(4, 0, 9, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 345
splice(7, 0, 10, 0, 0x159, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)      = 345
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=10}}}, 100, -1) = 1
splice(10, 0, 12, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 502
splice(11, 0, 4, 0, 0x1f6, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)      = 502
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=4}}}, 100, -1) = 1
splice(4, 0, 9, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 543
splice(7, 0, 10, 0, 0x21f, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)      = 543
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=10}}}, 100, -1) = 1
splice(10, 0, 12, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 348
splice(11, 0, 4, 0, 0x15c, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)      = 348
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=4}}}, 100, -1) = 1
splice(4, 0, 9, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 543
splice(7, 0, 10, 0, 0x21f, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)      = 543
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=10}}}, 100, -1) = 1
splice(10, 0, 12, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 38787
splice(11, 0, 4, 0, 0x9783, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)     = 38787
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=4}}}, 100, -1) = 1
splice(4, 0, 9, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 345
splice(7, 0, 10, 0, 0x159, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)      = 345
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=10}}}, 100, -1) = 1
splice(10, 0, 12, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 502
splice(11, 0, 4, 0, 0x1f6, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)      = 502
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=4}}}, 100, -1) = 1
splice(4, 0, 9, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 519
splice(7, 0, 10, 0, 0x207, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)      = 519
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=10}}}, 100, -1) = 1
splice(10, 0, 12, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 49152
splice(11, 0, 4, 0, 0xc000, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)     = 49152
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=10}}}, 100, -1) = 1
splice(10, 0, 12, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 49448
splice(11, 0, 4, 0, 0xc128, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)     = 49448
epoll_wait(6, {{EPOLLIN|EPOLLOUT, {fd=10}}}, 100, -1) = 1
splice(10, 0, 12, 0, 65536, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)    = 53248
splice(11, 0, 4, 0, 53248, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)     = 32472
epoll_wait(6, {{EPOLLOUT, {fd=4}}}, 100, -1) = 1
splice(11, 0, 4, 0, 20776, SPLICE_F_MOVE|SPLICE_F_NONBLOCK)     = 20776
epoll_wait(6, +++ killed by SIGINT +++

^ permalink raw reply

* [PATCH v3 2/3] phylib: Convert MDIO bitbang to new MDIO 45 format
From: Andy Fleming @ 2011-10-13 14:37 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1318516660-25452-1-git-send-email-afleming@freescale.com>

Now that we've added somewhat more complete MDIO 45 support to the PHY
Lib, convert the MDIO bitbang driver to use this new infrastructure.

Signed-off-by: Andy Fleming <afleming@freescale.com>
---
v2: rebase on top of tree
v3: Make patch series more coherent

 drivers/net/phy/mdio-bitbang.c |   29 +++++++++++++++--------------
 1 files changed, 15 insertions(+), 14 deletions(-)

diff --git a/drivers/net/phy/mdio-bitbang.c b/drivers/net/phy/mdio-bitbang.c
index 2f6f02e..df7f496 100644
--- a/drivers/net/phy/mdio-bitbang.c
+++ b/drivers/net/phy/mdio-bitbang.c
@@ -134,11 +134,10 @@ static void mdiobb_cmd(struct mdiobb_ctrl *ctrl, int op, u8 phy, u8 reg)
    MII_ADDR_C45 into the address. Theoretically clause 45 and normal devices
    can exist on the same bus. Normal devices should ignore the MDIO_ADDR
    phase. */
-static int mdiobb_cmd_addr(struct mdiobb_ctrl *ctrl, int phy, u32 addr)
+static void mdiobb_cmd_addr(struct mdiobb_ctrl *ctrl, int phy, int devad,
+				int reg)
 {
-	unsigned int dev_addr = (addr >> 16) & 0x1F;
-	unsigned int reg = addr & 0xFFFF;
-	mdiobb_cmd(ctrl, MDIO_C45_ADDR, phy, dev_addr);
+	mdiobb_cmd(ctrl, MDIO_C45_ADDR, phy, devad);
 
 	/* send the turnaround (10) */
 	mdiobb_send_bit(ctrl, 1);
@@ -148,8 +147,6 @@ static int mdiobb_cmd_addr(struct mdiobb_ctrl *ctrl, int phy, u32 addr)
 
 	ctrl->ops->set_mdio_dir(ctrl, 0);
 	mdiobb_get_bit(ctrl);
-
-	return dev_addr;
 }
 
 static int mdiobb_read(struct mii_bus *bus, int phy, int devad, int reg)
@@ -157,11 +154,13 @@ static int mdiobb_read(struct mii_bus *bus, int phy, int devad, int reg)
 	struct mdiobb_ctrl *ctrl = bus->priv;
 	int ret, i;
 
-	if (reg & MII_ADDR_C45) {
-		reg = mdiobb_cmd_addr(ctrl, phy, reg);
-		mdiobb_cmd(ctrl, MDIO_C45_READ, phy, reg);
-	} else
+	/* Clause 22 PHYs don't have a devad */
+	if (devad == MDIO_DEVAD_NONE)
 		mdiobb_cmd(ctrl, MDIO_READ, phy, reg);
+	else {
+		mdiobb_cmd_addr(ctrl, phy, devad, reg);
+		mdiobb_cmd(ctrl, MDIO_C45_READ, phy, devad);
+	}
 
 	ctrl->ops->set_mdio_dir(ctrl, 0);
 
@@ -186,11 +185,13 @@ static int mdiobb_write(struct mii_bus *bus, int phy, int devad, int reg,
 {
 	struct mdiobb_ctrl *ctrl = bus->priv;
 
-	if (reg & MII_ADDR_C45) {
-		reg = mdiobb_cmd_addr(ctrl, phy, reg);
-		mdiobb_cmd(ctrl, MDIO_C45_WRITE, phy, reg);
-	} else
+	/* Clause 22 PHYs don't have a devad */
+	if (devad == MDIO_DEVAD_NONE)
 		mdiobb_cmd(ctrl, MDIO_WRITE, phy, reg);
+	else {
+		mdiobb_cmd_addr(ctrl, phy, devad, reg);
+		mdiobb_cmd(ctrl, MDIO_C45_WRITE, phy, devad);
+	}
 
 	/* send the turnaround (10) */
 	mdiobb_send_bit(ctrl, 1);
-- 
1.7.3.4

^ permalink raw reply related

* [PATCH v3 0/3] Add support for 10G to PHY Lib
From: Andy Fleming @ 2011-10-13 14:37 UTC (permalink / raw)
  To: davem; +Cc: netdev

This sequence of patches upgrades PHY Lib to support buses which
implement clause 45 of the IEEE 802.3 spec (10G MDIO).

The first two mostly just update the mdiobus API to add a new argument
(devad), which addresses the specific "device" inside the PHY.

The last patch implements a generic driver for 10G PHYs, but
currently only supports 10G Copper PHYs (that's all I have
to test with).

v3: Make patch series more coherent

Andy Fleming (3):
  phylib: Convert MDIO and PHY Lib drivers to support 10G
  phylib: Convert MDIO bitbang to new MDIO 45 format
  phylib: Add rudimentary Generic 10G support

 Documentation/networking/phy.txt                  |   15 +-
 arch/powerpc/platforms/pasemi/gpio_mdio.c         |    6 +-
 drivers/net/ethernet/adi/bfin_mac.c               |    7 +-
 drivers/net/ethernet/aeroflex/greth.c             |    5 +-
 drivers/net/ethernet/amd/au1000_eth.c             |    7 +-
 drivers/net/ethernet/broadcom/bcm63xx_enet.c      |    4 +-
 drivers/net/ethernet/broadcom/sb1250-mac.c        |    7 +-
 drivers/net/ethernet/broadcom/tg3.c               |    5 +-
 drivers/net/ethernet/cadence/macb.c               |    7 +-
 drivers/net/ethernet/dnet.c                       |    7 +-
 drivers/net/ethernet/ethoc.c                      |    5 +-
 drivers/net/ethernet/faraday/ftgmac100.c          |    5 +-
 drivers/net/ethernet/freescale/fec.c              |    7 +-
 drivers/net/ethernet/freescale/fec_mpc52xx_phy.c  |    7 +-
 drivers/net/ethernet/freescale/fs_enet/mii-fec.c  |    6 +-
 drivers/net/ethernet/freescale/fsl_pq_mdio.c      |   13 +-
 drivers/net/ethernet/freescale/fsl_pq_mdio.h      |   11 +-
 drivers/net/ethernet/lantiq_etop.c                |    5 +-
 drivers/net/ethernet/marvell/mv643xx_eth.c        |    5 +-
 drivers/net/ethernet/marvell/pxa168_eth.c         |    7 +-
 drivers/net/ethernet/rdc/r6040.c                  |    5 +-
 drivers/net/ethernet/s6gmac.c                     |    5 +-
 drivers/net/ethernet/smsc/smsc911x.c              |   22 ++-
 drivers/net/ethernet/smsc/smsc9420.c              |   10 +-
 drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c |    9 +-
 drivers/net/ethernet/ti/cpmac.c                   |    4 +-
 drivers/net/ethernet/ti/davinci_mdio.c            |    5 +-
 drivers/net/ethernet/toshiba/tc35815.c            |    5 +-
 drivers/net/ethernet/xilinx/ll_temac_mdio.c       |    5 +-
 drivers/net/ethernet/xilinx/xilinx_emaclite.c     |    9 +-
 drivers/net/ethernet/xscale/ixp4xx_eth.c          |    7 +-
 drivers/net/phy/fixed.c                           |    5 +-
 drivers/net/phy/icplus.c                          |   17 ++-
 drivers/net/phy/mdio-bitbang.c                    |   34 ++--
 drivers/net/phy/mdio-octeon.c                     |    5 +-
 drivers/net/phy/mdio_bus.c                        |    8 +-
 drivers/net/phy/phy.c                             |    5 +-
 drivers/net/phy/phy_device.c                      |  178 ++++++++++++++++++--
 include/linux/phy.h                               |   70 +++++++-
 net/dsa/slave.c                                   |    5 +-
 40 files changed, 401 insertions(+), 153 deletions(-)

-- 
1.7.3.4

^ permalink raw reply

* [PATCH v3 3/3] phylib: Add rudimentary Generic 10G support
From: Andy Fleming @ 2011-10-13 14:37 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1318516660-25452-1-git-send-email-afleming@freescale.com>

This is mostly taken from mdio.c, and modified to work under phylib.

However, the support is skewed toward 10GBaseT, as that is the only
PHY available to me at this time.

Signed-off-by: Andy Fleming <afleming@freescale.com>
---
v2: split off from 10G API changes
v3: Make patch series more coherent

 drivers/net/phy/phy_device.c |  118 ++++++++++++++++++++++++++++++++++++++++++
 1 files changed, 118 insertions(+), 0 deletions(-)

diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 22281d4..e2ee8dd 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -10,6 +10,7 @@
  *
  * Copyright (c) 2004-2006, 2008-2011 Freescale Semiconductor, Inc.
  *
+ *
  * This program is free software; you can redistribute  it and/or modify it
  * under  the terms of  the GNU General  Public License as published by the
  * Free Software Foundation;  either version 2 of the  License, or (at your
@@ -54,6 +55,7 @@ static void phy_device_release(struct device *dev)
 }
 
 static struct phy_driver genphy_driver;
+static struct phy_driver gen10g_driver;
 extern int mdio_bus_init(void);
 extern void mdio_bus_exit(void);
 
@@ -439,6 +441,9 @@ int phy_init_hw(struct phy_device *phydev)
 
 static struct phy_driver *generic_for_interface(phy_interface_t interface)
 {
+	if (is_10g_interface(interface))
+		return &gen10g_driver;
+
 	return &genphy_driver;
 }
 
@@ -632,6 +637,12 @@ static int genphy_config_advert(struct phy_device *phydev)
 	return changed;
 }
 
+int gen10g_config_advert(struct phy_device *dev)
+{
+	return 0;
+}
+EXPORT_SYMBOL(gen10g_config_advert);
+
 /**
  * genphy_setup_forced - configures/forces speed/duplex from @phydev
  * @phydev: target phy_device struct
@@ -660,6 +671,11 @@ static int genphy_setup_forced(struct phy_device *phydev)
 	return err;
 }
 
+int gen10g_setup_forced(struct phy_device *phydev)
+{
+	return 0;
+}
+
 /**
  * genphy_restart_aneg - Enable and Restart Autonegotiation
  * @phydev: target phy_device struct
@@ -684,6 +700,13 @@ int genphy_restart_aneg(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(genphy_restart_aneg);
 
+int gen10g_restart_aneg(struct phy_device *phydev)
+{
+	return 0;
+}
+EXPORT_SYMBOL(gen10g_restart_aneg);
+
+
 /**
  * genphy_config_aneg - restart auto-negotiation or write BMCR
  * @phydev: target phy_device struct
@@ -725,6 +748,12 @@ int genphy_config_aneg(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(genphy_config_aneg);
 
+int gen10g_config_aneg(struct phy_device *phydev)
+{
+	return 0;
+}
+EXPORT_SYMBOL(gen10g_config_aneg);
+
 /**
  * genphy_update_link - update link status in @phydev
  * @phydev: target phy_device struct
@@ -854,6 +883,33 @@ int genphy_read_status(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(genphy_read_status);
 
+int gen10g_read_status(struct phy_device *phydev)
+{
+	int devad, reg;
+	u32 mmd_mask = phydev->mmds;
+
+	phydev->link = 1;
+
+	/* For now just lie and say it's 10G all the time */
+	phydev->speed = 10000;
+	phydev->duplex = DUPLEX_FULL;
+
+	for (devad = 0; mmd_mask; devad++, mmd_mask = mmd_mask >> 1) {
+		if (!mmd_mask & 1)
+			continue;
+
+		/* Read twice because link state is latched and a
+		 * read moves the current state into the register */
+		phy45_read(phydev, devad, MDIO_STAT1);
+		reg = phy45_read(phydev, devad, MDIO_STAT1);
+		if (reg < 0 || !(reg & MDIO_STAT1_LSTATUS))
+			phydev->link = 0;
+	}
+
+	return 0;
+}
+EXPORT_SYMBOL(gen10g_read_status);
+
 static int genphy_config_init(struct phy_device *phydev)
 {
 	int val;
@@ -901,6 +957,35 @@ static int genphy_config_init(struct phy_device *phydev)
 	return 0;
 }
 
+/* Replicate mdio45_probe */
+int gen10g_config_init(struct phy_device *phydev)
+{
+	int mmd, stat2, devs1, devs2;
+
+	phydev->supported = phydev->advertising = SUPPORTED_10000baseT_Full;
+
+	/* Assume PHY must have at least one of PMA/PMD, WIS, PCS, PHY
+	 * XS or DTE XS; give up if none is present. */
+	for (mmd = 1; mmd <= 5; mmd++) {
+		/* Is this MMD present? */
+		stat2 = phy45_read(phydev, mmd, MDIO_STAT2);
+		if (stat2 < 0 ||
+			(stat2 & MDIO_STAT2_DEVPRST) != MDIO_STAT2_DEVPRST_VAL)
+			continue;
+
+		/* It should tell us about all the other MMDs */
+		devs1 = phy45_read(phydev, mmd, MDIO_DEVS1);
+		devs2 = phy45_read(phydev, mmd, MDIO_DEVS2);
+		if (devs1 < 0 || devs2 < 0)
+			continue;
+
+		phydev->mmds = devs1 | (devs2 << 16);
+		return 0;
+	}
+
+	return -ENODEV;
+}
+
 int genphy_suspend(struct phy_device *phydev)
 {
 	int value;
@@ -916,6 +1001,12 @@ int genphy_suspend(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(genphy_suspend);
 
+int gen10g_suspend(struct phy_device *phydev)
+{
+	return 0;
+}
+EXPORT_SYMBOL(gen10g_suspend);
+
 int genphy_resume(struct phy_device *phydev)
 {
 	int value;
@@ -931,6 +1022,13 @@ int genphy_resume(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(genphy_resume);
 
+int gen10g_resume(struct phy_device *phydev)
+{
+	return 0;
+}
+EXPORT_SYMBOL(gen10g_resume);
+
+
 /**
  * phy_probe - probe and init a PHY device
  * @dev: device to probe and init
@@ -1044,6 +1142,19 @@ static struct phy_driver genphy_driver = {
 	.driver		= {.owner = THIS_MODULE, },
 };
 
+static struct phy_driver gen10g_driver = {
+	.phy_id		= 0xffffffff,
+	.phy_id_mask	= 0xffffffff,
+	.name		= "Generic 10G PHY",
+	.config_init	= gen10g_config_init,
+	.features	= 0,
+	.config_aneg	= gen10g_config_aneg,
+	.read_status	= gen10g_read_status,
+	.suspend	= gen10g_suspend,
+	.resume		= gen10g_resume,
+	.driver		= {.owner = THIS_MODULE, },
+};
+
 static int __init phy_init(void)
 {
 	int rc;
@@ -1056,8 +1167,14 @@ static int __init phy_init(void)
 	if (rc)
 		goto genphy_register_failed;
 
+	rc = phy_driver_register(&gen10g_driver);
+	if (rc)
+		goto gen10g_register_failed;
+
 	return rc;
 
+gen10g_register_failed:
+	phy_driver_unregister(&genphy_driver);
 genphy_register_failed:
 	mdio_bus_exit();
 
@@ -1066,6 +1183,7 @@ genphy_register_failed:
 
 static void __exit phy_exit(void)
 {
+	phy_driver_unregister(&gen10g_driver);
 	phy_driver_unregister(&genphy_driver);
 	mdio_bus_exit();
 }
-- 
1.7.3.4

^ permalink raw reply related

* [PATCH v3 1/3] phylib: Convert MDIO and PHY Lib drivers to support 10G
From: Andy Fleming @ 2011-10-13 14:37 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1318516660-25452-1-git-send-email-afleming@freescale.com>

10G MDIO is a totally different protocol (clause 45 of 802.3).
Supporting this new protocol requires a couple of changes:

* Add a new parameter to the mdiobus_read functions to specify the
  "device address" inside the PHY.
* Add a phy45_read/write function which takes advantage of that
  new parameter
* Convert all of the existing drivers to use the new format

I created a new clause-45-specific read/write functions because:
1) phy_read and phy_write are highly overloaded functions, and
   finding every instance which is actually the PHY Lib version
   was quite difficult
2) Most code which invokes phy_read/phy_write inside PHY Lib is
   Clause-22-specific. None of the phy_read/phy_write invocations
   were useable on 10G PHYs

Signed-off-by: Andy Fleming <afleming@freescale.com>
---
v2: Convert newer buses, split out generic PHY support
v3: Make patch series more coherent

 Documentation/networking/phy.txt                  |   15 +++--
 arch/powerpc/platforms/pasemi/gpio_mdio.c         |    6 +-
 drivers/net/ethernet/adi/bfin_mac.c               |    7 +-
 drivers/net/ethernet/aeroflex/greth.c             |    5 +-
 drivers/net/ethernet/amd/au1000_eth.c             |    7 +-
 drivers/net/ethernet/broadcom/bcm63xx_enet.c      |    4 +-
 drivers/net/ethernet/broadcom/sb1250-mac.c        |    7 +-
 drivers/net/ethernet/broadcom/tg3.c               |    5 +-
 drivers/net/ethernet/cadence/macb.c               |    7 +-
 drivers/net/ethernet/dnet.c                       |    7 +-
 drivers/net/ethernet/ethoc.c                      |    5 +-
 drivers/net/ethernet/faraday/ftgmac100.c          |    5 +-
 drivers/net/ethernet/freescale/fec.c              |    7 +-
 drivers/net/ethernet/freescale/fec_mpc52xx_phy.c  |    7 +-
 drivers/net/ethernet/freescale/fs_enet/mii-fec.c  |    6 +-
 drivers/net/ethernet/freescale/fsl_pq_mdio.c      |   13 ++--
 drivers/net/ethernet/freescale/fsl_pq_mdio.h      |   11 ++-
 drivers/net/ethernet/lantiq_etop.c                |    5 +-
 drivers/net/ethernet/marvell/mv643xx_eth.c        |    5 +-
 drivers/net/ethernet/marvell/pxa168_eth.c         |    7 +-
 drivers/net/ethernet/rdc/r6040.c                  |    5 +-
 drivers/net/ethernet/s6gmac.c                     |    5 +-
 drivers/net/ethernet/smsc/smsc911x.c              |   22 ++++---
 drivers/net/ethernet/smsc/smsc9420.c              |   10 ++-
 drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c |    9 ++-
 drivers/net/ethernet/ti/cpmac.c                   |    4 +-
 drivers/net/ethernet/ti/davinci_mdio.c            |    5 +-
 drivers/net/ethernet/toshiba/tc35815.c            |    5 +-
 drivers/net/ethernet/xilinx/ll_temac_mdio.c       |    5 +-
 drivers/net/ethernet/xilinx/xilinx_emaclite.c     |    9 ++-
 drivers/net/ethernet/xscale/ixp4xx_eth.c          |    7 +-
 drivers/net/phy/fixed.c                           |    5 +-
 drivers/net/phy/icplus.c                          |   17 +++--
 drivers/net/phy/mdio-bitbang.c                    |    5 +-
 drivers/net/phy/mdio-octeon.c                     |    5 +-
 drivers/net/phy/mdio_bus.c                        |    8 +-
 drivers/net/phy/phy.c                             |    5 +-
 drivers/net/phy/phy_device.c                      |   64 +++++++++++++------
 include/linux/phy.h                               |   70 ++++++++++++++++++---
 net/dsa/slave.c                                   |    5 +-
 40 files changed, 270 insertions(+), 141 deletions(-)

diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt
index 9eb1ba5..cf10707 100644
--- a/Documentation/networking/phy.txt
+++ b/Documentation/networking/phy.txt
@@ -40,13 +40,14 @@ The MDIO bus
 
  1) read and write functions must be implemented.  Their prototypes are:
 
-     int write(struct mii_bus *bus, int mii_id, int regnum, u16 value);
-     int read(struct mii_bus *bus, int mii_id, int regnum);
-
-   mii_id is the address on the bus for the PHY, and regnum is the register
-   number.  These functions are guaranteed not to be called from interrupt
-   time, so it is safe for them to block, waiting for an interrupt to signal
-   the operation is complete
+     int write(struct mii_bus *bus, int addr, int devad, u16 regnum,
+		u16 value);
+     int read(struct mii_bus *bus, int addr, int devad, u16 regnum);
+
+   addr is the address on the bus for the PHY, devad is the address of the
+   internal device, and regnum is the register number.  These functions are
+   guaranteed not to be called from interrupt time, so it is safe for them
+   to block, waiting for an interrupt to signal the operation is complete
  
  2) A reset function is necessary.  This is used to return the bus to an
    initialized state.
diff --git a/arch/powerpc/platforms/pasemi/gpio_mdio.c b/arch/powerpc/platforms/pasemi/gpio_mdio.c
index 9886296..a7256b9 100644
--- a/arch/powerpc/platforms/pasemi/gpio_mdio.c
+++ b/arch/powerpc/platforms/pasemi/gpio_mdio.c
@@ -124,7 +124,8 @@ static void bitbang_pre(struct mii_bus *bus, int read, u8 addr, u8 reg)
 	}
 }
 
-static int gpio_mdio_read(struct mii_bus *bus, int phy_id, int location)
+static int gpio_mdio_read(struct mii_bus *bus, int phy_id, int devad,
+				int location)
 {
 	u16 rdreg;
 	int ret, i;
@@ -163,7 +164,8 @@ static int gpio_mdio_read(struct mii_bus *bus, int phy_id, int location)
 	return ret;
 }
 
-static int gpio_mdio_write(struct mii_bus *bus, int phy_id, int location, u16 val)
+static int gpio_mdio_write(struct mii_bus *bus, int phy_id, int devad,
+				int location, u16 val)
 {
 	int i;
 
diff --git a/drivers/net/ethernet/adi/bfin_mac.c b/drivers/net/ethernet/adi/bfin_mac.c
index b6d69c9..7e343b4 100644
--- a/drivers/net/ethernet/adi/bfin_mac.c
+++ b/drivers/net/ethernet/adi/bfin_mac.c
@@ -267,7 +267,8 @@ static int bfin_mdio_poll(void)
 }
 
 /* Read an off-chip register in a PHY through the MDC/MDIO port */
-static int bfin_mdiobus_read(struct mii_bus *bus, int phy_addr, int regnum)
+static int bfin_mdiobus_read(struct mii_bus *bus, int phy_addr, int devad,
+				int regnum)
 {
 	int ret;
 
@@ -288,8 +289,8 @@ static int bfin_mdiobus_read(struct mii_bus *bus, int phy_addr, int regnum)
 }
 
 /* Write an off-chip register in a PHY through the MDC/MDIO port */
-static int bfin_mdiobus_write(struct mii_bus *bus, int phy_addr, int regnum,
-			      u16 value)
+static int bfin_mdiobus_write(struct mii_bus *bus, int phy_addr, int devad,
+				int regnum, u16 value)
 {
 	int ret;
 
diff --git a/drivers/net/ethernet/aeroflex/greth.c b/drivers/net/ethernet/aeroflex/greth.c
index 6715bf5..faab1cc 100644
--- a/drivers/net/ethernet/aeroflex/greth.c
+++ b/drivers/net/ethernet/aeroflex/greth.c
@@ -1176,7 +1176,7 @@ static inline int wait_for_mdio(struct greth_private *greth)
 	return 1;
 }
 
-static int greth_mdio_read(struct mii_bus *bus, int phy, int reg)
+static int greth_mdio_read(struct mii_bus *bus, int phy, int devad, int reg)
 {
 	struct greth_private *greth = bus->priv;
 	int data;
@@ -1198,7 +1198,8 @@ static int greth_mdio_read(struct mii_bus *bus, int phy, int reg)
 	}
 }
 
-static int greth_mdio_write(struct mii_bus *bus, int phy, int reg, u16 val)
+static int greth_mdio_write(struct mii_bus *bus, int phy, int devad, int reg,
+				u16 val)
 {
 	struct greth_private *greth = bus->priv;
 
diff --git a/drivers/net/ethernet/amd/au1000_eth.c b/drivers/net/ethernet/amd/au1000_eth.c
index 8238667..9e879c2 100644
--- a/drivers/net/ethernet/amd/au1000_eth.c
+++ b/drivers/net/ethernet/amd/au1000_eth.c
@@ -224,7 +224,8 @@ static void au1000_mdio_write(struct net_device *dev, int phy_addr,
 	writel(mii_control, mii_control_reg);
 }
 
-static int au1000_mdiobus_read(struct mii_bus *bus, int phy_addr, int regnum)
+static int au1000_mdiobus_read(struct mii_bus *bus, int phy_addr, int devad,
+				int regnum)
 {
 	/* WARNING: bus->phy_map[phy_addr].attached_dev == dev does
 	 * _NOT_ hold (e.g. when PHY is accessed through other MAC's MII bus)
@@ -239,8 +240,8 @@ static int au1000_mdiobus_read(struct mii_bus *bus, int phy_addr, int regnum)
 	return au1000_mdio_read(dev, phy_addr, regnum);
 }
 
-static int au1000_mdiobus_write(struct mii_bus *bus, int phy_addr, int regnum,
-				u16 value)
+static int au1000_mdiobus_write(struct mii_bus *bus, int phy_addr, int devad,
+				int regnum, u16 value)
 {
 	struct net_device *const dev = bus->priv;
 
diff --git a/drivers/net/ethernet/broadcom/bcm63xx_enet.c b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
index a11a8ad..016bda7 100644
--- a/drivers/net/ethernet/broadcom/bcm63xx_enet.c
+++ b/drivers/net/ethernet/broadcom/bcm63xx_enet.c
@@ -139,7 +139,7 @@ static int bcm_enet_mdio_write(struct bcm_enet_priv *priv, int mii_id,
 /*
  * MII read callback from phylib
  */
-static int bcm_enet_mdio_read_phylib(struct mii_bus *bus, int mii_id,
+static int bcm_enet_mdio_read_phylib(struct mii_bus *bus, int mii_id, int devad,
 				     int regnum)
 {
 	return bcm_enet_mdio_read(bus->priv, mii_id, regnum);
@@ -149,7 +149,7 @@ static int bcm_enet_mdio_read_phylib(struct mii_bus *bus, int mii_id,
  * MII write callback from phylib
  */
 static int bcm_enet_mdio_write_phylib(struct mii_bus *bus, int mii_id,
-				      int regnum, u16 value)
+				      int devad, int regnum, u16 value)
 {
 	return bcm_enet_mdio_write(bus->priv, mii_id, regnum, value);
 }
diff --git a/drivers/net/ethernet/broadcom/sb1250-mac.c b/drivers/net/ethernet/broadcom/sb1250-mac.c
index 0a1d7f2..7d0c64e 100644
--- a/drivers/net/ethernet/broadcom/sb1250-mac.c
+++ b/drivers/net/ethernet/broadcom/sb1250-mac.c
@@ -435,7 +435,8 @@ static void sbmac_mii_senddata(void __iomem *sbm_mdio, unsigned int data,
  *  	   value read, or 0xffff if an error occurred.
  ********************************************************************* */
 
-static int sbmac_mii_read(struct mii_bus *bus, int phyaddr, int regidx)
+static int sbmac_mii_read(struct mii_bus *bus, int phyaddr, int devad,
+			int regidx)
 {
 	struct sbmac_softc *sc = (struct sbmac_softc *)bus->priv;
 	void __iomem *sbm_mdio = sc->sbm_mdio;
@@ -528,8 +529,8 @@ static int sbmac_mii_read(struct mii_bus *bus, int phyaddr, int regidx)
  *  	   0 for success
  ********************************************************************* */
 
-static int sbmac_mii_write(struct mii_bus *bus, int phyaddr, int regidx,
-			   u16 regval)
+static int sbmac_mii_write(struct mii_bus *bus, int phyaddr, int devad,
+			int regidx, u16 regval)
 {
 	struct sbmac_softc *sc = (struct sbmac_softc *)bus->priv;
 	void __iomem *sbm_mdio = sc->sbm_mdio;
diff --git a/drivers/net/ethernet/broadcom/tg3.c b/drivers/net/ethernet/broadcom/tg3.c
index fe712f9..5f4007e 100644
--- a/drivers/net/ethernet/broadcom/tg3.c
+++ b/drivers/net/ethernet/broadcom/tg3.c
@@ -1168,7 +1168,7 @@ static int tg3_bmcr_reset(struct tg3 *tp)
 	return 0;
 }
 
-static int tg3_mdio_read(struct mii_bus *bp, int mii_id, int reg)
+static int tg3_mdio_read(struct mii_bus *bp, int mii_id, int devad, int reg)
 {
 	struct tg3 *tp = bp->priv;
 	u32 val;
@@ -1183,7 +1183,8 @@ static int tg3_mdio_read(struct mii_bus *bp, int mii_id, int reg)
 	return val;
 }
 
-static int tg3_mdio_write(struct mii_bus *bp, int mii_id, int reg, u16 val)
+static int tg3_mdio_write(struct mii_bus *bp, int mii_id, int devad, int reg,
+			u16 val)
 {
 	struct tg3 *tp = bp->priv;
 	u32 ret = 0;
diff --git a/drivers/net/ethernet/cadence/macb.c b/drivers/net/ethernet/cadence/macb.c
index a437b46..9479b2a 100644
--- a/drivers/net/ethernet/cadence/macb.c
+++ b/drivers/net/ethernet/cadence/macb.c
@@ -89,7 +89,8 @@ static void __init macb_get_hwaddr(struct macb *bp)
 	}
 }
 
-static int macb_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
+static int macb_mdio_read(struct mii_bus *bus, int mii_id, int devad,
+			int regnum)
 {
 	struct macb *bp = bus->priv;
 	int value;
@@ -109,8 +110,8 @@ static int macb_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
 	return value;
 }
 
-static int macb_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
-			   u16 value)
+static int macb_mdio_write(struct mii_bus *bus, int mii_id, int devad,
+			int regnum, u16 value)
 {
 	struct macb *bp = bus->priv;
 
diff --git a/drivers/net/ethernet/dnet.c b/drivers/net/ethernet/dnet.c
index c1063d1..70e8347 100644
--- a/drivers/net/ethernet/dnet.c
+++ b/drivers/net/ethernet/dnet.c
@@ -100,7 +100,8 @@ static void __devinit dnet_get_hwaddr(struct dnet *bp)
 		memcpy(bp->dev->dev_addr, addr, sizeof(addr));
 }
 
-static int dnet_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
+static int dnet_mdio_read(struct mii_bus *bus, int mii_id, int devad,
+			int regnum)
 {
 	struct dnet *bp = bus->priv;
 	u16 value;
@@ -132,8 +133,8 @@ static int dnet_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
 	return value;
 }
 
-static int dnet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
-			   u16 value)
+static int dnet_mdio_write(struct mii_bus *bus, int mii_id, int devad,
+			int regnum, u16 value)
 {
 	struct dnet *bp = bus->priv;
 	u16 tmp;
diff --git a/drivers/net/ethernet/ethoc.c b/drivers/net/ethernet/ethoc.c
index bdb348a..700e7db 100644
--- a/drivers/net/ethernet/ethoc.c
+++ b/drivers/net/ethernet/ethoc.c
@@ -611,7 +611,7 @@ static int ethoc_poll(struct napi_struct *napi, int budget)
 	return rx_work_done;
 }
 
-static int ethoc_mdio_read(struct mii_bus *bus, int phy, int reg)
+static int ethoc_mdio_read(struct mii_bus *bus, int phy, int devad, int reg)
 {
 	struct ethoc *priv = bus->priv;
 	int i;
@@ -633,7 +633,8 @@ static int ethoc_mdio_read(struct mii_bus *bus, int phy, int reg)
 	return -EBUSY;
 }
 
-static int ethoc_mdio_write(struct mii_bus *bus, int phy, int reg, u16 val)
+static int ethoc_mdio_write(struct mii_bus *bus, int phy, int devad, int reg,
+				u16 val)
 {
 	struct ethoc *priv = bus->priv;
 	int i;
diff --git a/drivers/net/ethernet/faraday/ftgmac100.c b/drivers/net/ethernet/faraday/ftgmac100.c
index 54709af..c38fa69 100644
--- a/drivers/net/ethernet/faraday/ftgmac100.c
+++ b/drivers/net/ethernet/faraday/ftgmac100.c
@@ -865,7 +865,8 @@ static int ftgmac100_mii_probe(struct ftgmac100 *priv)
 /******************************************************************************
  * struct mii_bus functions
  *****************************************************************************/
-static int ftgmac100_mdiobus_read(struct mii_bus *bus, int phy_addr, int regnum)
+static int ftgmac100_mdiobus_read(struct mii_bus *bus, int phy_addr, int devad,
+				int regnum)
 {
 	struct net_device *netdev = bus->priv;
 	struct ftgmac100 *priv = netdev_priv(netdev);
@@ -901,7 +902,7 @@ static int ftgmac100_mdiobus_read(struct mii_bus *bus, int phy_addr, int regnum)
 }
 
 static int ftgmac100_mdiobus_write(struct mii_bus *bus, int phy_addr,
-				   int regnum, u16 value)
+				   int devad, int regnum, u16 value)
 {
 	struct net_device *netdev = bus->priv;
 	struct ftgmac100 *priv = netdev_priv(netdev);
diff --git a/drivers/net/ethernet/freescale/fec.c b/drivers/net/ethernet/freescale/fec.c
index 1124ce0..162f367 100644
--- a/drivers/net/ethernet/freescale/fec.c
+++ b/drivers/net/ethernet/freescale/fec.c
@@ -886,7 +886,8 @@ spin_unlock:
 		phy_print_status(phy_dev);
 }
 
-static int fec_enet_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
+static int fec_enet_mdio_read(struct mii_bus *bus, int mii_id, int devad,
+				int regnum)
 {
 	struct fec_enet_private *fep = bus->priv;
 	unsigned long time_left;
@@ -912,8 +913,8 @@ static int fec_enet_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
 	return FEC_MMFR_DATA(readl(fep->hwp + FEC_MII_DATA));
 }
 
-static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int regnum,
-			   u16 value)
+static int fec_enet_mdio_write(struct mii_bus *bus, int mii_id, int devad,
+			int regnum, u16 value)
 {
 	struct fec_enet_private *fep = bus->priv;
 	unsigned long time_left;
diff --git a/drivers/net/ethernet/freescale/fec_mpc52xx_phy.c b/drivers/net/ethernet/freescale/fec_mpc52xx_phy.c
index 360a578..81a3fff 100644
--- a/drivers/net/ethernet/freescale/fec_mpc52xx_phy.c
+++ b/drivers/net/ethernet/freescale/fec_mpc52xx_phy.c
@@ -49,13 +49,14 @@ static int mpc52xx_fec_mdio_transfer(struct mii_bus *bus, int phy_id,
 		in_be32(&fec->mii_data) & FEC_MII_DATA_DATAMSK : 0;
 }
 
-static int mpc52xx_fec_mdio_read(struct mii_bus *bus, int phy_id, int reg)
+static int mpc52xx_fec_mdio_read(struct mii_bus *bus, int phy_id, int devad,
+				int reg)
 {
 	return mpc52xx_fec_mdio_transfer(bus, phy_id, reg, FEC_MII_READ_FRAME);
 }
 
-static int mpc52xx_fec_mdio_write(struct mii_bus *bus, int phy_id, int reg,
-		u16 data)
+static int mpc52xx_fec_mdio_write(struct mii_bus *bus, int phy_id, int devad,
+				int reg, u16 data)
 {
 	return mpc52xx_fec_mdio_transfer(bus, phy_id, reg,
 		data | FEC_MII_WRITE_FRAME);
diff --git a/drivers/net/ethernet/freescale/fs_enet/mii-fec.c b/drivers/net/ethernet/freescale/fs_enet/mii-fec.c
index e0e9d6c..b4ae560 100644
--- a/drivers/net/ethernet/freescale/fs_enet/mii-fec.c
+++ b/drivers/net/ethernet/freescale/fs_enet/mii-fec.c
@@ -49,7 +49,8 @@
 
 #define FEC_MII_LOOPS	10000
 
-static int fs_enet_fec_mii_read(struct mii_bus *bus , int phy_id, int location)
+static int fs_enet_fec_mii_read(struct mii_bus *bus , int phy_id, int devad,
+				int location)
 {
 	struct fec_info* fec = bus->priv;
 	struct fec __iomem *fecp = fec->fecp;
@@ -72,7 +73,8 @@ static int fs_enet_fec_mii_read(struct mii_bus *bus , int phy_id, int location)
 	return ret;
 }
 
-static int fs_enet_fec_mii_write(struct mii_bus *bus, int phy_id, int location, u16 val)
+static int fs_enet_fec_mii_write(struct mii_bus *bus, int phy_id, int devad,
+				int location, u16 val)
 {
 	struct fec_info* fec = bus->priv;
 	struct fec __iomem *fecp = fec->fecp;
diff --git a/drivers/net/ethernet/freescale/fsl_pq_mdio.c b/drivers/net/ethernet/freescale/fsl_pq_mdio.c
index 52f4e8a..94b9e17 100644
--- a/drivers/net/ethernet/freescale/fsl_pq_mdio.c
+++ b/drivers/net/ethernet/freescale/fsl_pq_mdio.c
@@ -62,7 +62,7 @@ struct fsl_pq_mdio_priv {
  * controlling the external PHYs, for example.
  */
 int fsl_pq_local_mdio_write(struct fsl_pq_mdio __iomem *regs, int mii_id,
-		int regnum, u16 value)
+			int regnum, u16 value)
 {
 	/* Set the PHY address and the register address we want to write */
 	out_be32(&regs->miimadd, (mii_id << 8) | regnum);
@@ -87,8 +87,8 @@ int fsl_pq_local_mdio_write(struct fsl_pq_mdio __iomem *regs, int mii_id,
  * and are always tied to the local mdio pins, which may not be the
  * same as system mdio bus, used for controlling the external PHYs, for eg.
  */
-int fsl_pq_local_mdio_read(struct fsl_pq_mdio __iomem *regs,
-		int mii_id, int regnum)
+int fsl_pq_local_mdio_read(struct fsl_pq_mdio __iomem *regs, int mii_id,
+			int regnum)
 {
 	u16 value;
 
@@ -120,7 +120,8 @@ static struct fsl_pq_mdio __iomem *fsl_pq_mdio_get_regs(struct mii_bus *bus)
  * Write value to the PHY at mii_id at register regnum,
  * on the bus, waiting until the write is done before returning.
  */
-int fsl_pq_mdio_write(struct mii_bus *bus, int mii_id, int regnum, u16 value)
+int fsl_pq_mdio_write(struct mii_bus *bus, int mii_id, int devad, int regnum,
+			u16 value)
 {
 	struct fsl_pq_mdio __iomem *regs = fsl_pq_mdio_get_regs(bus);
 
@@ -132,7 +133,7 @@ int fsl_pq_mdio_write(struct mii_bus *bus, int mii_id, int regnum, u16 value)
  * Read the bus for PHY at addr mii_id, register regnum, and
  * return the value.  Clears miimcom first.
  */
-int fsl_pq_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
+int fsl_pq_mdio_read(struct mii_bus *bus, int mii_id, int devad, int regnum)
 {
 	struct fsl_pq_mdio __iomem *regs = fsl_pq_mdio_get_regs(bus);
 
@@ -191,7 +192,7 @@ static int fsl_pq_mdio_find_free(struct mii_bus *new_bus)
 	for (i = PHY_MAX_ADDR; i > 0; i--) {
 		u32 phy_id;
 
-		if (get_phy_id(new_bus, i, &phy_id))
+		if (get_phy_id(new_bus, i, MDIO_DEVAD_NONE, &phy_id))
 			return -1;
 
 		if (phy_id == 0xffffffff)
diff --git a/drivers/net/ethernet/freescale/fsl_pq_mdio.h b/drivers/net/ethernet/freescale/fsl_pq_mdio.h
index bd17a2a..4b5254c 100644
--- a/drivers/net/ethernet/freescale/fsl_pq_mdio.h
+++ b/drivers/net/ethernet/freescale/fsl_pq_mdio.h
@@ -41,11 +41,14 @@ struct fsl_pq_mdio {
 	u8 res4[2728];
 } __packed;
 
-int fsl_pq_mdio_read(struct mii_bus *bus, int mii_id, int regnum);
-int fsl_pq_mdio_write(struct mii_bus *bus, int mii_id, int regnum, u16 value);
+
+int fsl_pq_mdio_read(struct mii_bus *bus, int mii_id, int devad, int regnum);
+int fsl_pq_mdio_write(struct mii_bus *bus, int mii_id, int devad, int regnum,
+			u16 value);
 int fsl_pq_local_mdio_write(struct fsl_pq_mdio __iomem *regs, int mii_id,
-			  int regnum, u16 value);
-int fsl_pq_local_mdio_read(struct fsl_pq_mdio __iomem *regs, int mii_id, int regnum);
+			int regnum, u16 value);
+int fsl_pq_local_mdio_read(struct fsl_pq_mdio __iomem *regs, int mii_id,
+			int regnum);
 int __init fsl_pq_mdio_init(void);
 void fsl_pq_mdio_exit(void);
 void fsl_pq_mdio_bus_name(char *name, struct device_node *np);
diff --git a/drivers/net/ethernet/lantiq_etop.c b/drivers/net/ethernet/lantiq_etop.c
index 6bb2b95..b5ae9d2 100644
--- a/drivers/net/ethernet/lantiq_etop.c
+++ b/drivers/net/ethernet/lantiq_etop.c
@@ -337,7 +337,8 @@ static const struct ethtool_ops ltq_etop_ethtool_ops = {
 };
 
 static int
-ltq_etop_mdio_wr(struct mii_bus *bus, int phy_addr, int phy_reg, u16 phy_data)
+ltq_etop_mdio_wr(struct mii_bus *bus, int phy_addr, int devad, int phy_reg,
+		u16 phy_data)
 {
 	u32 val = MDIO_REQUEST |
 		((phy_addr & MDIO_ADDR_MASK) << MDIO_ADDR_OFFSET) |
@@ -351,7 +352,7 @@ ltq_etop_mdio_wr(struct mii_bus *bus, int phy_addr, int phy_reg, u16 phy_data)
 }
 
 static int
-ltq_etop_mdio_rd(struct mii_bus *bus, int phy_addr, int phy_reg)
+ltq_etop_mdio_rd(struct mii_bus *bus, int phy_addr, int devad, int phy_reg)
 {
 	u32 val = MDIO_REQUEST | MDIO_READ |
 		((phy_addr & MDIO_ADDR_MASK) << MDIO_ADDR_OFFSET) |
diff --git a/drivers/net/ethernet/marvell/mv643xx_eth.c b/drivers/net/ethernet/marvell/mv643xx_eth.c
index f6821aa..87b8038 100644
--- a/drivers/net/ethernet/marvell/mv643xx_eth.c
+++ b/drivers/net/ethernet/marvell/mv643xx_eth.c
@@ -1119,7 +1119,7 @@ static int smi_wait_ready(struct mv643xx_eth_shared_private *msp)
 	return 0;
 }
 
-static int smi_bus_read(struct mii_bus *bus, int addr, int reg)
+static int smi_bus_read(struct mii_bus *bus, int addr, int devad, int reg)
 {
 	struct mv643xx_eth_shared_private *msp = bus->priv;
 	void __iomem *smi_reg = msp->base + SMI_REG;
@@ -1146,7 +1146,8 @@ static int smi_bus_read(struct mii_bus *bus, int addr, int reg)
 	return ret & 0xffff;
 }
 
-static int smi_bus_write(struct mii_bus *bus, int addr, int reg, u16 val)
+static int smi_bus_write(struct mii_bus *bus, int addr, int devad, int reg,
+			u16 val)
 {
 	struct mv643xx_eth_shared_private *msp = bus->priv;
 	void __iomem *smi_reg = msp->base + SMI_REG;
diff --git a/drivers/net/ethernet/marvell/pxa168_eth.c b/drivers/net/ethernet/marvell/pxa168_eth.c
index d17d062..84a4a36 100644
--- a/drivers/net/ethernet/marvell/pxa168_eth.c
+++ b/drivers/net/ethernet/marvell/pxa168_eth.c
@@ -1302,7 +1302,8 @@ static int smi_wait_ready(struct pxa168_eth_private *pep)
 	return 0;
 }
 
-static int pxa168_smi_read(struct mii_bus *bus, int phy_addr, int regnum)
+static int pxa168_smi_read(struct mii_bus *bus, int phy_addr, int devad,
+			   int regnum)
 {
 	struct pxa168_eth_private *pep = bus->priv;
 	int i = 0;
@@ -1326,8 +1327,8 @@ static int pxa168_smi_read(struct mii_bus *bus, int phy_addr, int regnum)
 	return val & 0xffff;
 }
 
-static int pxa168_smi_write(struct mii_bus *bus, int phy_addr, int regnum,
-			    u16 value)
+static int pxa168_smi_write(struct mii_bus *bus, int phy_addr, int devad,
+			    int regnum, u16 value)
 {
 	struct pxa168_eth_private *pep = bus->priv;
 
diff --git a/drivers/net/ethernet/rdc/r6040.c b/drivers/net/ethernet/rdc/r6040.c
index 1fc01ca..3ec419f 100644
--- a/drivers/net/ethernet/rdc/r6040.c
+++ b/drivers/net/ethernet/rdc/r6040.c
@@ -243,7 +243,8 @@ static void r6040_phy_write(void __iomem *ioaddr,
 	}
 }
 
-static int r6040_mdiobus_read(struct mii_bus *bus, int phy_addr, int reg)
+static int r6040_mdiobus_read(struct mii_bus *bus, int phy_addr, int devad,
+				int reg)
 {
 	struct net_device *dev = bus->priv;
 	struct r6040_private *lp = netdev_priv(dev);
@@ -253,7 +254,7 @@ static int r6040_mdiobus_read(struct mii_bus *bus, int phy_addr, int reg)
 }
 
 static int r6040_mdiobus_write(struct mii_bus *bus, int phy_addr,
-						int reg, u16 value)
+			int devad, int reg, u16 value)
 {
 	struct net_device *dev = bus->priv;
 	struct r6040_private *lp = netdev_priv(dev);
diff --git a/drivers/net/ethernet/s6gmac.c b/drivers/net/ethernet/s6gmac.c
index a7ff8ea..7430211 100644
--- a/drivers/net/ethernet/s6gmac.c
+++ b/drivers/net/ethernet/s6gmac.c
@@ -661,7 +661,7 @@ static int s6mii_busy(struct s6gmac *pd, int tmo)
 	return 0;
 }
 
-static int s6mii_read(struct mii_bus *bus, int phy_addr, int regnum)
+static int s6mii_read(struct mii_bus *bus, int phy_addr, int devad, int regnum)
 {
 	struct s6gmac *pd = bus->priv;
 	s6mii_enable(pd);
@@ -677,7 +677,8 @@ static int s6mii_read(struct mii_bus *bus, int phy_addr, int regnum)
 	return (u16)readl(pd->reg + S6_GMAC_MACMIISTAT);
 }
 
-static int s6mii_write(struct mii_bus *bus, int phy_addr, int regnum, u16 value)
+static int s6mii_write(struct mii_bus *bus, int phy_addr, int devad,
+			int regnum, u16 value)
 {
 	struct s6gmac *pd = bus->priv;
 	s6mii_enable(pd);
diff --git a/drivers/net/ethernet/smsc/smsc911x.c b/drivers/net/ethernet/smsc/smsc911x.c
index a3aa4c0..87ee880 100644
--- a/drivers/net/ethernet/smsc/smsc911x.c
+++ b/drivers/net/ethernet/smsc/smsc911x.c
@@ -441,7 +441,8 @@ static void smsc911x_mac_write(struct smsc911x_data *pdata,
 }
 
 /* Get a phy register */
-static int smsc911x_mii_read(struct mii_bus *bus, int phyaddr, int regidx)
+static int smsc911x_mii_read(struct mii_bus *bus, int phyaddr, int devad,
+				int regidx)
 {
 	struct smsc911x_data *pdata = (struct smsc911x_data *)bus->priv;
 	unsigned long flags;
@@ -477,8 +478,8 @@ out:
 }
 
 /* Set a phy register */
-static int smsc911x_mii_write(struct mii_bus *bus, int phyaddr, int regidx,
-			   u16 val)
+static int smsc911x_mii_write(struct mii_bus *bus, int phyaddr, int devad,
+			int regidx, u16 val)
 {
 	struct smsc911x_data *pdata = (struct smsc911x_data *)bus->priv;
 	unsigned long flags;
@@ -709,11 +710,12 @@ static int smsc911x_phy_reset(struct smsc911x_data *pdata)
 	BUG_ON(!phy_dev->bus);
 
 	SMSC_TRACE(pdata, hw, "Performing PHY BCR Reset");
-	smsc911x_mii_write(phy_dev->bus, phy_dev->addr, MII_BMCR, BMCR_RESET);
+	smsc911x_mii_write(phy_dev->bus, phy_dev->addr, MDIO_DEVAD_NONE,
+				MII_BMCR, BMCR_RESET);
 	do {
 		msleep(1);
 		temp = smsc911x_mii_read(phy_dev->bus, phy_dev->addr,
-			MII_BMCR);
+			MDIO_DEVAD_NONE, MII_BMCR);
 	} while ((i--) && (temp & BMCR_RESET));
 
 	if (temp & BMCR_RESET) {
@@ -761,8 +763,8 @@ static int smsc911x_phy_loopbacktest(struct net_device *dev)
 
 	for (i = 0; i < 10; i++) {
 		/* Set PHY to 10/FD, no ANEG, and loopback mode */
-		smsc911x_mii_write(phy_dev->bus, phy_dev->addr,	MII_BMCR,
-			BMCR_LOOPBACK | BMCR_FULLDPLX);
+		smsc911x_mii_write(phy_dev->bus, phy_dev->addr,	MDIO_DEVAD_NONE,
+			MII_BMCR, BMCR_LOOPBACK | BMCR_FULLDPLX);
 
 		/* Enable MAC tx/rx, FD */
 		spin_lock_irqsave(&pdata->mac_lock, flags);
@@ -790,7 +792,8 @@ static int smsc911x_phy_loopbacktest(struct net_device *dev)
 	spin_unlock_irqrestore(&pdata->mac_lock, flags);
 
 	/* Cancel PHY loopback mode */
-	smsc911x_mii_write(phy_dev->bus, phy_dev->addr, MII_BMCR, 0);
+	smsc911x_mii_write(phy_dev->bus, phy_dev->addr, MDIO_DEVAD_NONE,
+			MII_BMCR, 0);
 
 	smsc911x_reg_write(pdata, TX_CFG, 0);
 	smsc911x_reg_write(pdata, RX_CFG, 0);
@@ -1759,7 +1762,8 @@ smsc911x_ethtool_getregs(struct net_device *dev, struct ethtool_regs *regs,
 	}
 
 	for (i = 0; i <= 31; i++)
-		data[j++] = smsc911x_mii_read(phy_dev->bus, phy_dev->addr, i);
+		data[j++] = smsc911x_mii_read(phy_dev->bus, phy_dev->addr,
+					MDIO_DEVAD_NONE, i);
 }
 
 static void smsc911x_eeprom_enable_access(struct smsc911x_data *pdata)
diff --git a/drivers/net/ethernet/smsc/smsc9420.c b/drivers/net/ethernet/smsc/smsc9420.c
index 4f15680..3a08ef0 100644
--- a/drivers/net/ethernet/smsc/smsc9420.c
+++ b/drivers/net/ethernet/smsc/smsc9420.c
@@ -128,7 +128,8 @@ static inline void smsc9420_pci_flush_write(struct smsc9420_pdata *pd)
 	smsc9420_reg_read(pd, ID_REV);
 }
 
-static int smsc9420_mii_read(struct mii_bus *bus, int phyaddr, int regidx)
+static int smsc9420_mii_read(struct mii_bus *bus, int phyaddr, int devad,
+				int regidx)
 {
 	struct smsc9420_pdata *pd = (struct smsc9420_pdata *)bus->priv;
 	unsigned long flags;
@@ -165,8 +166,8 @@ out:
 	return reg;
 }
 
-static int smsc9420_mii_write(struct mii_bus *bus, int phyaddr, int regidx,
-			   u16 val)
+static int smsc9420_mii_write(struct mii_bus *bus, int phyaddr, int devad,
+			int regidx, u16 val)
 {
 	struct smsc9420_pdata *pd = (struct smsc9420_pdata *)bus->priv;
 	unsigned long flags;
@@ -329,7 +330,8 @@ smsc9420_ethtool_getregs(struct net_device *dev, struct ethtool_regs *regs,
 		return;
 
 	for (i = 0; i <= 31; i++)
-		data[j++] = smsc9420_mii_read(phy_dev->bus, phy_dev->addr, i);
+		data[j++] = smsc9420_mii_read(phy_dev->bus, phy_dev->addr,
+				MDIO_DEVAD_NONE, i);
 }
 
 static void smsc9420_eeprom_enable_access(struct smsc9420_pdata *pd)
diff --git a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
index 9c3b9d5..36cdb1b 100644
--- a/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
+++ b/drivers/net/ethernet/stmicro/stmmac/stmmac_mdio.c
@@ -38,13 +38,15 @@
  * stmmac_mdio_read
  * @bus: points to the mii_bus structure
  * @phyaddr: MII addr reg bits 15-11
+ * @devad: unused
  * @phyreg: MII addr reg bits 10-6
  * Description: it reads data from the MII register from within the phy device.
  * For the 7111 GMAC, we must set the bit 0 in the MII address register while
  * accessing the PHY registers.
  * Fortunately, it seems this has no drawback for the 7109 MAC.
  */
-static int stmmac_mdio_read(struct mii_bus *bus, int phyaddr, int phyreg)
+static int stmmac_mdio_read(struct mii_bus *bus, int phyaddr, int devad,
+				int phyreg)
 {
 	struct net_device *ndev = bus->priv;
 	struct stmmac_priv *priv = netdev_priv(ndev);
@@ -70,12 +72,13 @@ static int stmmac_mdio_read(struct mii_bus *bus, int phyaddr, int phyreg)
  * stmmac_mdio_write
  * @bus: points to the mii_bus structure
  * @phyaddr: MII addr reg bits 15-11
+ * @devad: unused
  * @phyreg: MII addr reg bits 10-6
  * @phydata: phy data
  * Description: it writes the data into the MII register from within the device.
  */
-static int stmmac_mdio_write(struct mii_bus *bus, int phyaddr, int phyreg,
-			     u16 phydata)
+static int stmmac_mdio_write(struct mii_bus *bus, int phyaddr, int devad,
+				int phyreg, u16 phydata)
 {
 	struct net_device *ndev = bus->priv;
 	struct stmmac_priv *priv = netdev_priv(ndev);
diff --git a/drivers/net/ethernet/ti/cpmac.c b/drivers/net/ethernet/ti/cpmac.c
index aaac0c7..c89eac5 100644
--- a/drivers/net/ethernet/ti/cpmac.c
+++ b/drivers/net/ethernet/ti/cpmac.c
@@ -272,7 +272,7 @@ static void cpmac_dump_skb(struct net_device *dev, struct sk_buff *skb)
 	printk("\n");
 }
 
-static int cpmac_mdio_read(struct mii_bus *bus, int phy_id, int reg)
+static int cpmac_mdio_read(struct mii_bus *bus, int phy_id, int devad, int reg)
 {
 	u32 val;
 
@@ -285,7 +285,7 @@ static int cpmac_mdio_read(struct mii_bus *bus, int phy_id, int reg)
 	return MDIO_DATA(val);
 }
 
-static int cpmac_mdio_write(struct mii_bus *bus, int phy_id,
+static int cpmac_mdio_write(struct mii_bus *bus, int phy_id, int devad,
 			    int reg, u16 val)
 {
 	while (cpmac_read(bus->priv, CPMAC_MDIO_ACCESS(0)) & MDIO_BUSY)
diff --git a/drivers/net/ethernet/ti/davinci_mdio.c b/drivers/net/ethernet/ti/davinci_mdio.c
index 7615040..92ed777 100644
--- a/drivers/net/ethernet/ti/davinci_mdio.c
+++ b/drivers/net/ethernet/ti/davinci_mdio.c
@@ -199,7 +199,8 @@ static inline int wait_for_idle(struct davinci_mdio_data *data)
 	return -ETIMEDOUT;
 }
 
-static int davinci_mdio_read(struct mii_bus *bus, int phy_id, int phy_reg)
+static int davinci_mdio_read(struct mii_bus *bus, int phy_id, int devad,
+				int phy_reg)
 {
 	struct davinci_mdio_data *data = bus->priv;
 	u32 reg;
@@ -244,7 +245,7 @@ static int davinci_mdio_read(struct mii_bus *bus, int phy_id, int phy_reg)
 }
 
 static int davinci_mdio_write(struct mii_bus *bus, int phy_id,
-			      int phy_reg, u16 phy_data)
+			      int devad, int phy_reg, u16 phy_data)
 {
 	struct davinci_mdio_data *data = bus->priv;
 	u32 reg;
diff --git a/drivers/net/ethernet/toshiba/tc35815.c b/drivers/net/ethernet/toshiba/tc35815.c
index 71b785c..2b166f7 100644
--- a/drivers/net/ethernet/toshiba/tc35815.c
+++ b/drivers/net/ethernet/toshiba/tc35815.c
@@ -501,7 +501,7 @@ static void	panic_queues(struct net_device *dev);
 
 static void tc35815_restart_work(struct work_struct *work);
 
-static int tc_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
+static int tc_mdio_read(struct mii_bus *bus, int mii_id, int devad, int regnum)
 {
 	struct net_device *dev = bus->priv;
 	struct tc35815_regs __iomem *tr =
@@ -518,7 +518,8 @@ static int tc_mdio_read(struct mii_bus *bus, int mii_id, int regnum)
 	return tc_readl(&tr->MD_Data) & 0xffff;
 }
 
-static int tc_mdio_write(struct mii_bus *bus, int mii_id, int regnum, u16 val)
+static int tc_mdio_write(struct mii_bus *bus, int mii_id, int devad, int regnum,
+			u16 val)
 {
 	struct net_device *dev = bus->priv;
 	struct tc35815_regs __iomem *tr =
diff --git a/drivers/net/ethernet/xilinx/ll_temac_mdio.c b/drivers/net/ethernet/xilinx/ll_temac_mdio.c
index 8cf9d4f..a9ddc90 100644
--- a/drivers/net/ethernet/xilinx/ll_temac_mdio.c
+++ b/drivers/net/ethernet/xilinx/ll_temac_mdio.c
@@ -19,7 +19,7 @@
 /* ---------------------------------------------------------------------
  * MDIO Bus functions
  */
-static int temac_mdio_read(struct mii_bus *bus, int phy_id, int reg)
+static int temac_mdio_read(struct mii_bus *bus, int phy_id, int devad, int reg)
 {
 	struct temac_local *lp = bus->priv;
 	u32 rc;
@@ -38,7 +38,8 @@ static int temac_mdio_read(struct mii_bus *bus, int phy_id, int reg)
 	return rc;
 }
 
-static int temac_mdio_write(struct mii_bus *bus, int phy_id, int reg, u16 val)
+static int temac_mdio_write(struct mii_bus *bus, int phy_id, int devad, int reg,
+				u16 val)
 {
 	struct temac_local *lp = bus->priv;
 
diff --git a/drivers/net/ethernet/xilinx/xilinx_emaclite.c b/drivers/net/ethernet/xilinx/xilinx_emaclite.c
index 8018d7d..36a5b1b 100644
--- a/drivers/net/ethernet/xilinx/xilinx_emaclite.c
+++ b/drivers/net/ethernet/xilinx/xilinx_emaclite.c
@@ -741,6 +741,7 @@ static int xemaclite_mdio_wait(struct net_local *lp)
  * xemaclite_mdio_read - Read from a given MII management register
  * @bus:	the mii_bus struct
  * @phy_id:	the phy address
+ * @devad:	unused
  * @reg:	register number to read from
  *
  * This function waits till the device is ready to accept a new MDIO
@@ -749,7 +750,8 @@ static int xemaclite_mdio_wait(struct net_local *lp)
  *
  * Return:	Value read from the MII management register
  */
-static int xemaclite_mdio_read(struct mii_bus *bus, int phy_id, int reg)
+static int xemaclite_mdio_read(struct mii_bus *bus, int phy_id, int devad,
+				int reg)
 {
 	struct net_local *lp = bus->priv;
 	u32 ctrl_reg;
@@ -785,14 +787,15 @@ static int xemaclite_mdio_read(struct mii_bus *bus, int phy_id, int reg)
  * xemaclite_mdio_write - Write to a given MII management register
  * @bus:	the mii_bus struct
  * @phy_id:	the phy address
+ * @devad:	unused
  * @reg:	register number to write to
  * @val:	value to write to the register number specified by reg
  *
  * This function waits till the device is ready to accept a new MDIO
  * request and then writes the val to the MDIO Write Data register.
  */
-static int xemaclite_mdio_write(struct mii_bus *bus, int phy_id, int reg,
-				u16 val)
+static int xemaclite_mdio_write(struct mii_bus *bus, int phy_id, int devad,
+				int reg, u16 val)
 {
 	struct net_local *lp = bus->priv;
 	u32 ctrl_reg;
diff --git a/drivers/net/ethernet/xscale/ixp4xx_eth.c b/drivers/net/ethernet/xscale/ixp4xx_eth.c
index ec96d91..2f5d9cb 100644
--- a/drivers/net/ethernet/xscale/ixp4xx_eth.c
+++ b/drivers/net/ethernet/xscale/ixp4xx_eth.c
@@ -473,7 +473,8 @@ static int ixp4xx_mdio_cmd(struct mii_bus *bus, int phy_id, int location,
 		((__raw_readl(&mdio_regs->mdio_status[1]) & 0xFF) << 8);
 }
 
-static int ixp4xx_mdio_read(struct mii_bus *bus, int phy_id, int location)
+static int ixp4xx_mdio_read(struct mii_bus *bus, int phy_id, int devad,
+			    int location)
 {
 	unsigned long flags;
 	int ret;
@@ -488,8 +489,8 @@ static int ixp4xx_mdio_read(struct mii_bus *bus, int phy_id, int location)
 	return ret;
 }
 
-static int ixp4xx_mdio_write(struct mii_bus *bus, int phy_id, int location,
-			     u16 val)
+static int ixp4xx_mdio_write(struct mii_bus *bus, int phy_id, int devad,
+			     int location, u16 val)
 {
 	unsigned long flags;
 	int ret;
diff --git a/drivers/net/phy/fixed.c b/drivers/net/phy/fixed.c
index 1fa4d73..31f621e 100644
--- a/drivers/net/phy/fixed.c
+++ b/drivers/net/phy/fixed.c
@@ -115,7 +115,8 @@ static int fixed_phy_update_regs(struct fixed_phy *fp)
 	return 0;
 }
 
-static int fixed_mdio_read(struct mii_bus *bus, int phy_id, int reg_num)
+static int fixed_mdio_read(struct mii_bus *bus, int phy_id, int devad,
+				int reg_num)
 {
 	struct fixed_mdio_bus *fmb = bus->priv;
 	struct fixed_phy *fp;
@@ -139,7 +140,7 @@ static int fixed_mdio_read(struct mii_bus *bus, int phy_id, int reg_num)
 }
 
 static int fixed_mdio_write(struct mii_bus *bus, int phy_id, int reg_num,
-			    u16 val)
+			    int devad, u16 val)
 {
 	return 0;
 }
diff --git a/drivers/net/phy/icplus.c b/drivers/net/phy/icplus.c
index d66bd8d..5228d9c 100644
--- a/drivers/net/phy/icplus.c
+++ b/drivers/net/phy/icplus.c
@@ -49,36 +49,41 @@ static int ip175c_config_init(struct phy_device *phydev)
 	if (full_reset_performed == 0) {
 
 		/* master reset */
-		err = mdiobus_write(phydev->bus, 30, 0, 0x175c);
+		err = mdiobus_write(phydev->bus, 30, MDIO_DEVAD_NONE, 0,
+						0x175c);
 		if (err < 0)
 			return err;
 
 		/* ensure no bus delays overlap reset period */
-		err = mdiobus_read(phydev->bus, 30, 0);
+		err = mdiobus_read(phydev->bus, 30, MDIO_DEVAD_NONE, 0);
 
 		/* data sheet specifies reset period is 2 msec */
 		mdelay(2);
 
 		/* enable IP175C mode */
-		err = mdiobus_write(phydev->bus, 29, 31, 0x175c);
+		err = mdiobus_write(phydev->bus, 29, MDIO_DEVAD_NONE, 31,
+						0x175c);
 		if (err < 0)
 			return err;
 
 		/* Set MII0 speed and duplex (in PHY mode) */
-		err = mdiobus_write(phydev->bus, 29, 22, 0x420);
+		err = mdiobus_write(phydev->bus, 29, MDIO_DEVAD_NONE, 22,
+						0x420);
 		if (err < 0)
 			return err;
 
 		/* reset switch ports */
 		for (i = 0; i < 5; i++) {
 			err = mdiobus_write(phydev->bus, i,
-					    MII_BMCR, BMCR_RESET);
+						 MDIO_DEVAD_NONE,
+						 MII_BMCR, BMCR_RESET);
 			if (err < 0)
 				return err;
 		}
 
 		for (i = 0; i < 5; i++)
-			err = mdiobus_read(phydev->bus, i, MII_BMCR);
+			err = mdiobus_read(phydev->bus, i, MDIO_DEVAD_NONE,
+						MII_BMCR);
 
 		mdelay(2);
 
diff --git a/drivers/net/phy/mdio-bitbang.c b/drivers/net/phy/mdio-bitbang.c
index 6539189..2f6f02e 100644
--- a/drivers/net/phy/mdio-bitbang.c
+++ b/drivers/net/phy/mdio-bitbang.c
@@ -152,7 +152,7 @@ static int mdiobb_cmd_addr(struct mdiobb_ctrl *ctrl, int phy, u32 addr)
 	return dev_addr;
 }
 
-static int mdiobb_read(struct mii_bus *bus, int phy, int reg)
+static int mdiobb_read(struct mii_bus *bus, int phy, int devad, int reg)
 {
 	struct mdiobb_ctrl *ctrl = bus->priv;
 	int ret, i;
@@ -181,7 +181,8 @@ static int mdiobb_read(struct mii_bus *bus, int phy, int reg)
 	return ret;
 }
 
-static int mdiobb_write(struct mii_bus *bus, int phy, int reg, u16 val)
+static int mdiobb_write(struct mii_bus *bus, int phy, int devad, int reg,
+			u16 val)
 {
 	struct mdiobb_ctrl *ctrl = bus->priv;
 
diff --git a/drivers/net/phy/mdio-octeon.c b/drivers/net/phy/mdio-octeon.c
index bd12ba9..356973d 100644
--- a/drivers/net/phy/mdio-octeon.c
+++ b/drivers/net/phy/mdio-octeon.c
@@ -24,7 +24,8 @@ struct octeon_mdiobus {
 	int phy_irq[PHY_MAX_ADDR];
 };
 
-static int octeon_mdiobus_read(struct mii_bus *bus, int phy_id, int regnum)
+static int octeon_mdiobus_read(struct mii_bus *bus, int phy_id, int devad,
+				int regnum)
 {
 	struct octeon_mdiobus *p = bus->priv;
 	union cvmx_smix_cmd smi_cmd;
@@ -52,7 +53,7 @@ static int octeon_mdiobus_read(struct mii_bus *bus, int phy_id, int regnum)
 		return -EIO;
 }
 
-static int octeon_mdiobus_write(struct mii_bus *bus, int phy_id,
+static int octeon_mdiobus_write(struct mii_bus *bus, int phy_id, int devad
 				int regnum, u16 val)
 {
 	struct octeon_mdiobus *p = bus->priv;
diff --git a/drivers/net/phy/mdio_bus.c b/drivers/net/phy/mdio_bus.c
index 6c58da2..a6fa970 100644
--- a/drivers/net/phy/mdio_bus.c
+++ b/drivers/net/phy/mdio_bus.c
@@ -208,14 +208,14 @@ EXPORT_SYMBOL(mdiobus_scan);
  * because the bus read/write functions may wait for an interrupt
  * to conclude the operation.
  */
-int mdiobus_read(struct mii_bus *bus, int addr, u32 regnum)
+int mdiobus_read(struct mii_bus *bus, int addr, int devad, u16 regnum)
 {
 	int retval;
 
 	BUG_ON(in_interrupt());
 
 	mutex_lock(&bus->mdio_lock);
-	retval = bus->read(bus, addr, regnum);
+	retval = bus->read(bus, addr, devad, regnum);
 	mutex_unlock(&bus->mdio_lock);
 
 	return retval;
@@ -233,14 +233,14 @@ EXPORT_SYMBOL(mdiobus_read);
  * because the bus read/write functions may wait for an interrupt
  * to conclude the operation.
  */
-int mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val)
+int mdiobus_write(struct mii_bus *bus, int addr, int devad, u16 regnum, u16 val)
 {
 	int err;
 
 	BUG_ON(in_interrupt());
 
 	mutex_lock(&bus->mdio_lock);
-	err = bus->write(bus, addr, regnum, val);
+	err = bus->write(bus, addr, devad, regnum, val);
 	mutex_unlock(&bus->mdio_lock);
 
 	return err;
diff --git a/drivers/net/phy/phy.c b/drivers/net/phy/phy.c
index 3cbda08..00f5cfe 100644
--- a/drivers/net/phy/phy.c
+++ b/drivers/net/phy/phy.c
@@ -322,7 +322,8 @@ int phy_mii_ioctl(struct phy_device *phydev,
 
 	case SIOCGMIIREG:
 		mii_data->val_out = mdiobus_read(phydev->bus, mii_data->phy_id,
-						 mii_data->reg_num);
+						MDIO_DEVAD_NONE,
+						mii_data->reg_num);
 		break;
 
 	case SIOCSMIIREG:
@@ -354,7 +355,7 @@ int phy_mii_ioctl(struct phy_device *phydev,
 		}
 
 		mdiobus_write(phydev->bus, mii_data->phy_id,
-			      mii_data->reg_num, val);
+				MDIO_DEVAD_NONE, mii_data->reg_num, val);
 
 		if (mii_data->reg_num == MII_BMCR &&
 		    val & BMCR_RESET &&
diff --git a/drivers/net/phy/phy_device.c b/drivers/net/phy/phy_device.c
index 83a5a5a..22281d4 100644
--- a/drivers/net/phy/phy_device.c
+++ b/drivers/net/phy/phy_device.c
@@ -4,9 +4,11 @@
  * Framework for finding and configuring PHYs.
  * Also contains generic PHY driver
  *
+ * 10G PHY Driver support mostly appropriated from drivers/net/mdio.c
+ *
  * Author: Andy Fleming
  *
- * Copyright (c) 2004 Freescale Semiconductor, Inc.
+ * Copyright (c) 2004-2006, 2008-2011 Freescale Semiconductor, Inc.
  *
  * This program is free software; you can redistribute  it and/or modify it
  * under  the terms of  the GNU General  Public License as published by the
@@ -29,6 +31,7 @@
 #include <linux/module.h>
 #include <linux/mii.h>
 #include <linux/ethtool.h>
+#include <linux/mdio.h>
 #include <linux/phy.h>
 
 #include <asm/io.h>
@@ -207,13 +210,13 @@ static struct phy_device* phy_device_create(struct mii_bus *bus,
  * Description: Reads the ID registers of the PHY at @addr on the
  *   @bus, stores it in @phy_id and returns zero on success.
  */
-int get_phy_id(struct mii_bus *bus, int addr, u32 *phy_id)
+int get_phy_id(struct mii_bus *bus, int addr, int devad, u32 *phy_id)
 {
 	int phy_reg;
 
 	/* Grab the bits from PHYIR1, and put them
 	 * in the upper half */
-	phy_reg = mdiobus_read(bus, addr, MII_PHYSID1);
+	phy_reg = mdiobus_read(bus, addr, devad, MII_PHYSID1);
 
 	if (phy_reg < 0)
 		return -EIO;
@@ -221,7 +224,7 @@ int get_phy_id(struct mii_bus *bus, int addr, u32 *phy_id)
 	*phy_id = (phy_reg & 0xffff) << 16;
 
 	/* Grab the bits from PHYIR2, and put them in the lower half */
-	phy_reg = mdiobus_read(bus, addr, MII_PHYSID2);
+	phy_reg = mdiobus_read(bus, addr, devad, MII_PHYSID2);
 
 	if (phy_reg < 0)
 		return -EIO;
@@ -242,21 +245,31 @@ EXPORT_SYMBOL(get_phy_id);
  */
 struct phy_device * get_phy_device(struct mii_bus *bus, int addr)
 {
-	struct phy_device *dev = NULL;
-	u32 phy_id;
+	u32 phy_id = 0x1fffffff;
+	int i;
 	int r;
 
-	r = get_phy_id(bus, addr, &phy_id);
+	/* Try Standard (ie Clause 22) access */
+	r = get_phy_id(bus, addr, MDIO_DEVAD_NONE, &phy_id);
 	if (r)
 		return ERR_PTR(r);
 
-	/* If the phy_id is mostly Fs, there is no device there */
-	if ((phy_id & 0x1fffffff) == 0x1fffffff)
-		return NULL;
+	/* If the PHY ID is mostly f's, we didn't find anything */
+	if ((phy_id & 0x1fffffff) != 0x1fffffff)
+		return phy_device_create(bus, addr, phy_id);
 
-	dev = phy_device_create(bus, addr, phy_id);
+	/* Otherwise we have to try Clause 45 */
+	for (i = 1; i < 5; i++) {
+		r = get_phy_id(bus, addr, i, &phy_id);
+		if (r)
+			return ERR_PTR(r);
 
-	return dev;
+		/* If the phy_id is mostly Fs, there is no device there */
+		if ((phy_id & 0x1fffffff) != 0x1fffffff)
+			break;
+	}
+
+	return phy_device_create(bus, addr, phy_id);
 }
 EXPORT_SYMBOL(get_phy_device);
 
@@ -424,6 +437,11 @@ int phy_init_hw(struct phy_device *phydev)
 	return phydev->drv->config_init(phydev);
 }
 
+static struct phy_driver *generic_for_interface(phy_interface_t interface)
+{
+	return &genphy_driver;
+}
+
 /**
  * phy_attach_direct - attach a network device to a given PHY device pointer
  * @dev: network device to attach
@@ -433,8 +451,8 @@ int phy_init_hw(struct phy_device *phydev)
  *
  * Description: Called by drivers to attach to a particular PHY
  *     device. The phy_device is found, and properly hooked up
- *     to the phy_driver.  If no driver is attached, then the
- *     genphy_driver is used.  The phy_device is given a ptr to
+ *     to the phy_driver.  If no driver is attached, then a
+ *     generic driver is used.  The phy_device is given a ptr to
  *     the attaching device, and given a callback for link status
  *     change.  The phy_device is returned to the attaching driver.
  */
@@ -447,7 +465,9 @@ static int phy_attach_direct(struct net_device *dev, struct phy_device *phydev,
 	/* Assume that if there is no driver, that it doesn't
 	 * exist, and we should use the genphy driver. */
 	if (NULL == d->driver) {
-		d->driver = &genphy_driver.driver;
+		int err;
+
+		d->driver = generic_for_interface(interface);
 
 		err = d->driver->probe(d);
 		if (err >= 0)
@@ -529,7 +549,7 @@ void phy_detach(struct phy_device *phydev)
 	 * was using the generic driver), we unbind the device
 	 * from the generic driver so that there's a chance a
 	 * real driver could be loaded */
-	if (phydev->dev.driver == &genphy_driver.driver)
+	if (phydev->dev.driver == generic_for_interface(phydev->interface))
 		device_release_driver(&phydev->dev);
 }
 EXPORT_SYMBOL(phy_detach);
@@ -640,7 +660,6 @@ static int genphy_setup_forced(struct phy_device *phydev)
 	return err;
 }
 
-
 /**
  * genphy_restart_aneg - Enable and Restart Autonegotiation
  * @phydev: target phy_device struct
@@ -665,7 +684,6 @@ int genphy_restart_aneg(struct phy_device *phydev)
 }
 EXPORT_SYMBOL(genphy_restart_aneg);
 
-
 /**
  * genphy_config_aneg - restart auto-negotiation or write BMCR
  * @phydev: target phy_device struct
@@ -882,6 +900,7 @@ static int genphy_config_init(struct phy_device *phydev)
 
 	return 0;
 }
+
 int genphy_suspend(struct phy_device *phydev)
 {
 	int value;
@@ -1022,7 +1041,7 @@ static struct phy_driver genphy_driver = {
 	.read_status	= genphy_read_status,
 	.suspend	= genphy_suspend,
 	.resume		= genphy_resume,
-	.driver		= {.owner= THIS_MODULE, },
+	.driver		= {.owner = THIS_MODULE, },
 };
 
 static int __init phy_init(void)
@@ -1035,7 +1054,12 @@ static int __init phy_init(void)
 
 	rc = phy_driver_register(&genphy_driver);
 	if (rc)
-		mdio_bus_exit();
+		goto genphy_register_failed;
+
+	return rc;
+
+genphy_register_failed:
+	mdio_bus_exit();
 
 	return rc;
 }
diff --git a/include/linux/phy.h b/include/linux/phy.h
index 54fc413..ae1fdd8 100644
--- a/include/linux/phy.h
+++ b/include/linux/phy.h
@@ -6,7 +6,7 @@
  *
  * Author: Andy Fleming
  *
- * Copyright (c) 2004 Freescale Semiconductor, Inc.
+ * Copyright (c) 2004, 2009-2011 Freescale Semiconductor, Inc.
  *
  * This program is free software; you can redistribute  it and/or modify it
  * under  the terms of  the GNU General  Public License as published by the
@@ -22,6 +22,7 @@
 #include <linux/device.h>
 #include <linux/ethtool.h>
 #include <linux/mii.h>
+#include <linux/mdio.h>
 #include <linux/timer.h>
 #include <linux/workqueue.h>
 #include <linux/mod_devicetable.h>
@@ -65,6 +66,7 @@ typedef enum {
 	PHY_INTERFACE_MODE_RGMII_TXID,
 	PHY_INTERFACE_MODE_RTBI,
 	PHY_INTERFACE_MODE_SMII,
+	PHY_INTERFACE_MODE_XGMII
 } phy_interface_t;
 
 
@@ -96,8 +98,10 @@ struct mii_bus {
 	const char *name;
 	char id[MII_BUS_ID_SIZE];
 	void *priv;
-	int (*read)(struct mii_bus *bus, int phy_id, int regnum);
-	int (*write)(struct mii_bus *bus, int phy_id, int regnum, u16 val);
+	int (*read)(struct mii_bus *bus, int port_addr, int dev_addr,
+			int regnum);
+	int (*write)(struct mii_bus *bus, int port_addr, int dev_addr,
+			int regnum, u16 val);
 	int (*reset)(struct mii_bus *bus);
 
 	/*
@@ -134,8 +138,9 @@ int mdiobus_register(struct mii_bus *bus);
 void mdiobus_unregister(struct mii_bus *bus);
 void mdiobus_free(struct mii_bus *bus);
 struct phy_device *mdiobus_scan(struct mii_bus *bus, int addr);
-int mdiobus_read(struct mii_bus *bus, int addr, u32 regnum);
-int mdiobus_write(struct mii_bus *bus, int addr, u32 regnum, u16 val);
+int mdiobus_read(struct mii_bus *bus, int addr, int devad, u16 regnum);
+int mdiobus_write(struct mii_bus *bus, int addr, int devad,
+			u16 regnum, u16 val);
 
 
 #define PHY_INTERRUPT_DISABLED	0x0
@@ -307,6 +312,7 @@ struct phy_device {
 	/* See mii.h for more info */
 	u32 supported;
 	u32 advertising;
+	u32 mmds;
 
 	int autoneg;
 
@@ -443,6 +449,21 @@ struct phy_fixup {
 };
 
 /**
+ * is_10g_interface - Distinguishes 10G from 10/100/1000
+ * @interface: PHY interface type
+ *
+ * Returns true if the passed interface is capable of 10G,
+ * and therefore indicates the need for Clause-45-style
+ * MDIO transactions.
+ *
+ * For now, XGMII is the only 10G interface
+ */
+static inline bool is_10g_interface(phy_interface_t interface)
+{
+	return interface == PHY_INTERFACE_MODE_XGMII;
+}
+
+/**
  * phy_read - Convenience function for reading a given PHY register
  * @phydev: the phy_device struct
  * @regnum: register number to read
@@ -453,7 +474,22 @@ struct phy_fixup {
  */
 static inline int phy_read(struct phy_device *phydev, u32 regnum)
 {
-	return mdiobus_read(phydev->bus, phydev->addr, regnum);
+	return mdiobus_read(phydev->bus, phydev->addr, MDIO_DEVAD_NONE, regnum);
+}
+
+/**
+ * phy45_read - Convenience function for reading a given port/dev/reg address
+ * @phydev: The phy_device struct
+ * @devad: The device address to read
+ * @regnum: The register number to read
+ *
+ * NOTE: MUST NOT be called from interrupt context,
+ * because the bus read/write functions may wait for an interrupt
+ * to conclude the operation.
+ */
+static inline int phy45_read(struct phy_device *phydev, int devad, u16 regnum)
+{
+	return mdiobus_read(phydev->bus, phydev->addr, devad, regnum);
 }
 
 /**
@@ -468,10 +504,28 @@ static inline int phy_read(struct phy_device *phydev, u32 regnum)
  */
 static inline int phy_write(struct phy_device *phydev, u32 regnum, u16 val)
 {
-	return mdiobus_write(phydev->bus, phydev->addr, regnum, val);
+	return mdiobus_write(phydev->bus, phydev->addr, MDIO_DEVAD_NONE, regnum,
+				val);
+}
+
+/**
+ * phy45_write - Convenience function for writing a given port/dev/reg
+ * @phydev: the phy_device struct
+ * @devad: the device addr
+ * @regnum: register number to write
+ * @val: value to write to @regnum
+ *
+ * NOTE: MUST NOT be called from interrupt context,
+ * because the bus read/write functions may wait for an interrupt
+ * to conclude the operation.
+ */
+static inline int phy45_write(struct phy_device *phydev, u16 regnum,
+				int devad, u16 val)
+{
+	return mdiobus_write(phydev->bus, phydev->addr, devad, regnum, val);
 }
 
-int get_phy_id(struct mii_bus *bus, int addr, u32 *phy_id);
+int get_phy_id(struct mii_bus *bus, int addr, int devad, u32 *phy_id);
 struct phy_device* get_phy_device(struct mii_bus *bus, int addr);
 int phy_device_register(struct phy_device *phy);
 int phy_init_hw(struct phy_device *phydev);
diff --git a/net/dsa/slave.c b/net/dsa/slave.c
index 56cf9b8..8bcb864 100644
--- a/net/dsa/slave.c
+++ b/net/dsa/slave.c
@@ -15,7 +15,7 @@
 #include "dsa_priv.h"
 
 /* slave mii_bus handling ***************************************************/
-static int dsa_slave_phy_read(struct mii_bus *bus, int addr, int reg)
+static int dsa_slave_phy_read(struct mii_bus *bus, int addr, int devad, int reg)
 {
 	struct dsa_switch *ds = bus->priv;
 
@@ -25,7 +25,8 @@ static int dsa_slave_phy_read(struct mii_bus *bus, int addr, int reg)
 	return 0xffff;
 }
 
-static int dsa_slave_phy_write(struct mii_bus *bus, int addr, int reg, u16 val)
+static int dsa_slave_phy_write(struct mii_bus *bus, int addr, int devad,
+				int reg, u16 val)
 {
 	struct dsa_switch *ds = bus->priv;
 
-- 
1.7.3.4

^ permalink raw reply related

* [PATCH 2/2] phylib: Modify Vitesse RGMII skew settings
From: Andy Fleming @ 2011-10-13 14:33 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1318516435-24314-1-git-send-email-afleming@freescale.com>

The Vitesse driver was using the RGMII_ID interface type to determine if
skew was necessary.  However, we want to move away from using that
interface type, as it's really a property of the board's PHY connection.
However, some boards depend on it, so we want to support it, while
allowing new boards to use the more flexible "fixups" approach.  To do
this, we extract the code which adds skew into its own function, and
call that function when RGMII_ID has been selected.

Another side-effect of this change is that if your PHY has skew set
already, it doesn't clear it.  This way, the fixup code can modify the
register without config_init then clearing it.

Signed-off-by: Andy Fleming <afleming@freescale.com>
---
 drivers/net/phy/vitesse.c |   34 ++++++++++++++++++++++------------
 1 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/drivers/net/phy/vitesse.c b/drivers/net/phy/vitesse.c
index 5d8f6e1..0ec8e09 100644
--- a/drivers/net/phy/vitesse.c
+++ b/drivers/net/phy/vitesse.c
@@ -3,7 +3,7 @@
  *
  * Author: Kriston Carson
  *
- * Copyright (c) 2005 Freescale Semiconductor, Inc.
+ * Copyright (c) 2005, 2009 Freescale Semiconductor, Inc.
  *
  * This program is free software; you can redistribute  it and/or modify it
  * under  the terms of  the GNU General  Public License as published by the
@@ -61,32 +61,42 @@ MODULE_DESCRIPTION("Vitesse PHY driver");
 MODULE_AUTHOR("Kriston Carson");
 MODULE_LICENSE("GPL");
 
-static int vsc824x_config_init(struct phy_device *phydev)
+int vsc824x_add_skew(struct phy_device *phydev)
 {
-	int extcon;
 	int err;
-
-	err = phy_write(phydev, MII_VSC8244_AUX_CONSTAT,
-			MII_VSC8244_AUXCONSTAT_INIT);
-	if (err < 0)
-		return err;
+	int extcon;
 
 	extcon = phy_read(phydev, MII_VSC8244_EXT_CON1);
 
 	if (extcon < 0)
-		return err;
+		return extcon;
 
 	extcon &= ~(MII_VSC8244_EXTCON1_TX_SKEW_MASK |
 			MII_VSC8244_EXTCON1_RX_SKEW_MASK);
 
-	if (phydev->interface == PHY_INTERFACE_MODE_RGMII_ID)
-		extcon |= (MII_VSC8244_EXTCON1_TX_SKEW |
-				MII_VSC8244_EXTCON1_RX_SKEW);
+	extcon |= (MII_VSC8244_EXTCON1_TX_SKEW |
+			MII_VSC8244_EXTCON1_RX_SKEW);
 
 	err = phy_write(phydev, MII_VSC8244_EXT_CON1, extcon);
 
 	return err;
 }
+EXPORT_SYMBOL(vsc824x_add_skew);
+
+static int vsc824x_config_init(struct phy_device *phydev)
+{
+	int err;
+
+	err = phy_write(phydev, MII_VSC8244_AUX_CONSTAT,
+			MII_VSC8244_AUXCONSTAT_INIT);
+	if (err < 0)
+		return err;
+
+	if (phydev->interface == PHY_INTERFACE_MODE_RGMII_ID)
+		err = vsc824x_add_skew(phydev);
+
+	return err;
+}
 
 static int vsc824x_ack_interrupt(struct phy_device *phydev)
 {
-- 
1.7.3.4

^ permalink raw reply related

* [PATCH 1/2] net: Allow skb_recycle_check to be done in stages
From: Andy Fleming @ 2011-10-13 14:33 UTC (permalink / raw)
  To: davem; +Cc: netdev

skb_recycle_check resets the skb if it's eligible for recycling.
However, there are times when a driver might want to optionally
manipulate the skb data with the skb before resetting the skb,
but after it has determined eligibility.  We do this by splitting the
eligibility check from the skb reset, creating two inline functions to
accomplish that task.

Signed-off-by: Andy Fleming <afleming@freescale.com>
---

I found this useful for a driver we're working on where the device can
do different things, depending on whether the skb is recycleable.

 include/linux/skbuff.h |   21 +++++++++++++++++++
 net/core/skbuff.c      |   51 ++++++++++++++++++++++++-----------------------
 2 files changed, 47 insertions(+), 25 deletions(-)

diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index ac6b05a..6b35ca1 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -525,6 +525,7 @@ static inline struct sk_buff *alloc_skb_fclone(unsigned int size,
 	return __alloc_skb(size, priority, 1, NUMA_NO_NODE);
 }
 
+extern void skb_recycle(struct sk_buff *skb);
 extern bool skb_recycle_check(struct sk_buff *skb, int skb_size);
 
 extern struct sk_buff *skb_morph(struct sk_buff *dst, struct sk_buff *src);
@@ -2459,5 +2460,25 @@ static inline void skb_checksum_none_assert(struct sk_buff *skb)
 
 bool skb_partial_csum_set(struct sk_buff *skb, u16 start, u16 off);
 
+static inline bool skb_is_recycleable(struct sk_buff *skb, int skb_size)
+{
+	if (irqs_disabled())
+		return false;
+
+	if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)
+		return false;
+
+	if (skb_is_nonlinear(skb) || skb->fclone != SKB_FCLONE_UNAVAILABLE)
+		return false;
+
+	skb_size = SKB_DATA_ALIGN(skb_size + NET_SKB_PAD);
+	if (skb_end_pointer(skb) - skb->head < skb_size)
+		return false;
+
+	if (skb_shared(skb) || skb_cloned(skb))
+		return false;
+
+	return true;
+}
 #endif	/* __KERNEL__ */
 #endif	/* _LINUX_SKBUFF_H */
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 5b2c5f1..48bee84 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -475,6 +475,30 @@ void consume_skb(struct sk_buff *skb)
 EXPORT_SYMBOL(consume_skb);
 
 /**
+ * 	skb_recycle - clean up an skb for reuse
+ * 	@skb: buffer
+ *
+ * 	Recycles the skb to be reused as a receive buffer. This
+ * 	function does any necessary reference count dropping, and
+ * 	cleans up the skbuff as if it just came from __alloc_skb().
+ */
+void skb_recycle(struct sk_buff *skb)
+{
+	struct skb_shared_info *shinfo;
+
+	skb_release_head_state(skb);
+
+	shinfo = skb_shinfo(skb);
+	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+	atomic_set(&shinfo->dataref, 1);
+
+	memset(skb, 0, offsetof(struct sk_buff, tail));
+	skb->data = skb->head + NET_SKB_PAD;
+	skb_reset_tail_pointer(skb);
+}
+EXPORT_SYMBOL(skb_recycle);
+
+/**
  *	skb_recycle_check - check if skb can be reused for receive
  *	@skb: buffer
  *	@skb_size: minimum receive buffer size
@@ -488,33 +512,10 @@ EXPORT_SYMBOL(consume_skb);
  */
 bool skb_recycle_check(struct sk_buff *skb, int skb_size)
 {
-	struct skb_shared_info *shinfo;
-
-	if (irqs_disabled())
-		return false;
-
-	if (skb_shinfo(skb)->tx_flags & SKBTX_DEV_ZEROCOPY)
-		return false;
-
-	if (skb_is_nonlinear(skb) || skb->fclone != SKB_FCLONE_UNAVAILABLE)
-		return false;
-
-	skb_size = SKB_DATA_ALIGN(skb_size + NET_SKB_PAD);
-	if (skb_end_pointer(skb) - skb->head < skb_size)
-		return false;
-
-	if (skb_shared(skb) || skb_cloned(skb))
+	if (!skb_is_recycleable(skb, skb_size))
 		return false;
 
-	skb_release_head_state(skb);
-
-	shinfo = skb_shinfo(skb);
-	memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
-	atomic_set(&shinfo->dataref, 1);
-
-	memset(skb, 0, offsetof(struct sk_buff, tail));
-	skb->data = skb->head + NET_SKB_PAD;
-	skb_reset_tail_pointer(skb);
+	skb_recycle(skb);
 
 	return true;
 }
-- 
1.7.3.4

^ permalink raw reply related

* [PATCH v7 8/8] Disable task moving when using kernel memory accounting
From: Glauber Costa @ 2011-10-13 13:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, lizf, kamezawa.hiroyu, ebiederm, davem, paul, gthelen,
	netdev, linux-mm, kirill, avagin, devel, Glauber Costa
In-Reply-To: <1318511382-31051-1-git-send-email-glommer@parallels.com>

Since this code is still experimental, we are leaving the exact
details of how to move tasks between cgroups when kernel memory
accounting is used as future work.

For now, we simply disallow movement if there are any pending
accounted memory.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
---
 mm/memcontrol.c |   31 ++++++++++++++++++-------------
 1 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 1ba318d..b46232b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -408,23 +408,11 @@ void sock_update_memcg(struct sock *sk)
 
 	rcu_read_lock();
 	sk->sk_cgrp = mem_cgroup_from_task(current);
-
-	/*
-	 * We don't need to protect against anything task-related, because
-	 * we are basically stuck with the sock pointer that won't change,
-	 * even if the task that originated the socket changes cgroups.
-	 *
-	 * What we do have to guarantee, is that the chain leading us to
-	 * the top level won't change under our noses. Incrementing the
-	 * reference count via cgroup_exclude_rmdir guarantees that.
-	 */
-	cgroup_exclude_rmdir(mem_cgroup_css(sk->sk_cgrp));
 	rcu_read_unlock();
 }
 
 void sock_release_memcg(struct sock *sk)
 {
-	cgroup_release_and_wakeup_rmdir(mem_cgroup_css(sk->sk_cgrp));
 }
 
 void memcg_sockets_allocated_dec(struct mem_cgroup *memcg, struct proto *prot)
@@ -5634,10 +5622,17 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 {
 	int ret = 0;
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+	struct mem_cgroup *from = mem_cgroup_from_task(p);
+
+	if (from != mem &&
+	    res_counter_read_u64(&mem->tcp.tcp_memory_allocated, RES_USAGE)) {
+		printk(KERN_WARNING "Can't move tasks between cgroups: "
+			"Kernel memory held. task: %s\n", p->comm);
+		return 1;
+	}
 
 	if (mem->move_charge_at_immigrate) {
 		struct mm_struct *mm;
-		struct mem_cgroup *from = mem_cgroup_from_task(p);
 
 		VM_BUG_ON(from == mem);
 
@@ -5805,6 +5800,16 @@ static int mem_cgroup_can_attach(struct cgroup_subsys *ss,
 				struct cgroup *cgroup,
 				struct task_struct *p)
 {
+	struct mem_cgroup *mem = mem_cgroup_from_cont(cgroup);
+	struct mem_cgroup *from = mem_cgroup_from_task(p);
+
+	if (from != mem &&
+	    res_counter_read_u64(&mem->tcp.tcp_memory_allocated, RES_USAGE)) {
+		printk(KERN_WARNING "Can't move tasks between cgroups: "
+			"Kernel memory held. task: %s\n", p->comm);
+		return 1;
+	}
+
 	return 0;
 }
 static void mem_cgroup_cancel_attach(struct cgroup_subsys *ss,
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* RE: [net-next PATCH] net: allow vlan traffic to be received under bond
From: Hans Schillström @ 2011-10-13 13:09 UTC (permalink / raw)
  To: John Fastabend
  Cc: Jesse Gross, Jiri Pirko, davem@davemloft.net,
	netdev@vger.kernel.org, fubar@us.ibm.com
In-Reply-To: <4E944116.8020103@intel.com>

>On 10/11/2011 4:08 AM, Hans Schillstrom wrote:
>> Hello
>> On Tuesday 11 October 2011 04:43:03 Jesse Gross wrote:
>>> On Mon, Oct 10, 2011 at 7:07 PM, John Fastabend
>>> <john.r.fastabend@intel.com> wrote:
>>>> On 10/10/2011 3:37 PM, Jiri Pirko wrote:
>>>>> Mon, Oct 10, 2011 at 09:16:41PM CEST, john.r.fastabend@intel.com wrote:
>>>>>> The following configuration used to work as I expected. At least
>>>>>> we could use the fcoe interfaces to do MPIO and the bond0 iface
>>>>>> to do load balancing or failover.
>>>>>>
>>>>>>       ---eth2.228-fcoe
>>>>>>       |
>>>>>> eth2 -----|
>>>>>>          |
>>>>>>          |---- bond0
>>>>>>          |
>>>>>> eth3 -----|
>>>>>>       |
>>>>>>       ---eth3.228-fcoe
>>>>>>
>>>>>> This worked because of a change we added to allow inactive slaves
>>>>>> to rx 'exact' matches. This functionality was kept intact with the
>>>>>> rx_handler mechanism. However now the vlan interface attached to the
>>>>>> active slave never receives traffic because the bonding rx_handler
>>>>>> updates the skb->dev and goto's another_round. Previously, the
>>>>>> vlan_do_receive() logic was called before the bonding rx_handler.
>>>>>>
>>>>>> Now by the time vlan_do_receive calls vlan_find_dev() the
>>>>>> skb->dev is set to bond0 and it is clear no vlan is attached
>>>>>> to this iface. The vlan lookup fails.
>>>>>>
>>>>>> This patch moves the VLAN check above the rx_handler. A VLAN
>>>>>> tagged frame is now routed to the eth2.228-fcoe iface in the
>>>>>> above schematic. Untagged frames continue to the bond0 as
>>>>>> normal. This case also remains intact,
>>>>>>
>>>>>> eth2 --> bond0 --> vlan.228
>>>>>>
>>>>>> Here the skb is VLAN tagged but the vlan lookup fails on eth2
>>>>>> causing the bonding rx_handler to be called. On the second
>>>>>> pass the vlan lookup is on the bond0 iface and completes as
>>>>>> expected.
>>>>>>
>>>>>> Putting a VLAN.228 on both the bond0 and eth2 device will
>>>>>> result in eth2.228 receiving the skb. I don't think this is
>>>>>> completely unexpected and was the result prior to the rx_handler
>>>>>> result.
>>
>> I think this OK, but I do have a question
>> if bond0 is in Active/Backup mode, eth2 and eth3 got the same MAC.addr,
>> what about the VLAN:s ?
>> (or is just one of thme working ??)
>>
>
>The VLAN MAC address will not be managed by the bond. In the
>storage case a SAN mac may be used (NETDEV_HW_ADDR_T_SAN).
>Otherwise the MAC can be managed normally.
>
>Both VLANs will receive frames but in some modes only to packet
>handlers that have exact matches. See bond_should_deliver_exact_match().
>
>.John.

Have made some test now,  this patch solves a big issue that we had with VLANs 
i.e. as a work-a-round we put macvlans in between the phys. interface and the bond.
I have tested the scenario below, where tipc is running on VLAN below the bonding interface.
With the patch it works fine now.
If you want you can add a
Tested-by: Hans Schillstrom <hams.schillstrom@ericsson.com>

                      +---------+        +---------+
                    +---------+ |      +---------+ |
                  +---------+ |-+    +---------+ |-+
                  | macvlan |-+      | macvlan |-+
                  +---------+        +---------+
                     | | |              | | |
                     | | |           +---------+
                     | | |       ----|  vlan8  |
                     | | |      /    +---------+
                     | | |     /
                  +----+----+ /
        +---------|  bond0  |=------------+
        |         +---------+             |
        |                                 |
   +----+----+  +---------+          +----+----+  +---------+
   |   eth1  |--|  vlan20 |          |   eth2  |--|  vlan21 |
   +----+----+  +---------+          +----+----+  +---------+
        |                                 |
        |                                 |
  +-----+-----+                     +-----+-----+
  | Switch-0  |_____________________|   Sw1     |
  |           |    ISL TRUNK        |           |
  +-+---+---+-+                     +-+---+---+-+
    |   |   |                         |   |   |
  vlan1 | vlan20                    vlan1 | vlan21
      vlan8                             vlan8



Thanks 
Hans

^ permalink raw reply

* [PATCH v7 7/8] Display current tcp memory allocation in kmem cgroup
From: Glauber Costa @ 2011-10-13 13:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, lizf, kamezawa.hiroyu, ebiederm, davem, paul, gthelen,
	netdev, linux-mm, kirill, avagin, devel, Glauber Costa
In-Reply-To: <1318511382-31051-1-git-send-email-glommer@parallels.com>

This patch introduces kmem.tcp_current_memory file, living in the
kmem_cgroup filesystem. It is a simple read-only file that displays the
amount of kernel memory currently consumed by the cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 Documentation/cgroups/memory.txt |    1 +
 mm/memcontrol.c                  |    5 +++++
 2 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index e773bd7..b937a99 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -79,6 +79,7 @@ Brief summary of control files.
  memory.independent_kmem_limit	 # select whether or not kernel memory limits are
 				   independent of user limits
  memory.kmem.tcp.limit_in_bytes  # set/show hard limit for tcp buf memory
+ memory.kmem.tcp.usage_in_bytes  # show current tcp buf memory allocation
 
 1. History
 
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b696267..1ba318d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -543,6 +543,11 @@ static struct cftype tcp_files[] = {
 		.read_u64 = mem_cgroup_read,
 		.private = MEMFILE_PRIVATE(_KMEM_TCP, RES_LIMIT),
 	},
+	{
+		.name = "kmem.tcp.usage_in_bytes",
+		.read_u64 = mem_cgroup_read,
+		.private = MEMFILE_PRIVATE(_KMEM_TCP, RES_USAGE),
+	},
 };
 
 static void tcp_create_cgroup(struct mem_cgroup *cg, struct cgroup_subsys *ss)
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v7 6/8] tcp buffer limitation: per-cgroup limit
From: Glauber Costa @ 2011-10-13 13:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, lizf, kamezawa.hiroyu, ebiederm, davem, paul, gthelen,
	netdev, linux-mm, kirill, avagin, devel, Glauber Costa
In-Reply-To: <1318511382-31051-1-git-send-email-glommer@parallels.com>

This patch uses the "tcp_max_mem" field of the kmem_cgroup to
effectively control the amount of kernel memory pinned by a cgroup.

We have to make sure that none of the memory pressure thresholds
specified in the namespace are bigger than the current cgroup.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 Documentation/cgroups/memory.txt |    1 +
 include/linux/memcontrol.h       |   10 +++++
 mm/memcontrol.c                  |   79 ++++++++++++++++++++++++++++++++++----
 net/ipv4/sysctl_net_ipv4.c       |   20 ++++++++++
 4 files changed, 102 insertions(+), 8 deletions(-)

diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
index 0dafd70..e773bd7 100644
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -78,6 +78,7 @@ Brief summary of control files.
 
  memory.independent_kmem_limit	 # select whether or not kernel memory limits are
 				   independent of user limits
+ memory.kmem.tcp.limit_in_bytes  # set/show hard limit for tcp buf memory
 
 1. History
 
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index a27dad9..e0ccec5 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -397,6 +397,9 @@ int tcp_init_cgroup(const struct proto *prot, struct cgroup *cgrp,
 		    struct cgroup_subsys *ss);
 void tcp_destroy_cgroup(const struct proto *prot, struct cgroup *cgrp,
 			struct cgroup_subsys *ss);
+
+unsigned long long tcp_max_memory(const struct mem_cgroup *memcg);
+void tcp_prot_mem(struct mem_cgroup *memcg, long val, int idx);
 #else
 /* memcontrol includes sockets.h, that includes memcontrol.h ... */
 static inline void memcg_sockets_allocated_dec(struct mem_cgroup *memcg,
@@ -413,6 +416,13 @@ static inline void sock_update_memcg(struct sock *sk)
 static inline void sock_release_memcg(struct sock *sk)
 {
 }
+static inline unsigned long long tcp_max_memory(const struct mem_cgroup *memcg)
+{
+	return -1ULL;
+}
+static inline void tcp_prot_mem(struct mem_cgroup *memcg, long val, int idx)
+{
+}
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 #endif /* CONFIG_INET */
 #endif /* _LINUX_MEMCONTROL_H */
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index f953b32..b696267 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -365,6 +365,7 @@ enum mem_type {
 	_MEMSWAP,
 	_OOM_TYPE,
 	_KMEM,
+	_KMEM_TCP,
 };
 
 #define MEMFILE_PRIVATE(x, val)	(((x) << 16) | (val))
@@ -385,6 +386,11 @@ enum mem_type {
 
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
 static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont);
+static inline bool mem_cgroup_is_root(struct mem_cgroup *memcg)
+{
+	return (memcg == root_mem_cgroup);
+}
+
 /* Writing them here to avoid exposing memcg's inner layout */
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 #ifdef CONFIG_INET
@@ -510,6 +516,35 @@ struct percpu_counter *sockets_allocated_tcp(const struct mem_cgroup *memcg)
 }
 EXPORT_SYMBOL(sockets_allocated_tcp);
 
+static void tcp_update_limit(struct mem_cgroup *memcg, u64 val)
+{
+	struct net *net = current->nsproxy->net_ns;
+	int i;
+
+	val >>= PAGE_SHIFT;
+
+	for (i = 0; i < 3; i++)
+		memcg->tcp.tcp_prot_mem[i]  = min_t(long, val,
+					     net->ipv4.sysctl_tcp_mem[i]);
+}
+
+static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
+			    const char *buffer);
+
+static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft);
+/*
+ * We need those things internally in pages, so don't reuse
+ * mem_cgroup_{read,write}
+ */
+static struct cftype tcp_files[] = {
+	{
+		.name = "kmem.tcp.limit_in_bytes",
+		.write_string = mem_cgroup_write,
+		.read_u64 = mem_cgroup_read,
+		.private = MEMFILE_PRIVATE(_KMEM_TCP, RES_LIMIT),
+	},
+};
+
 static void tcp_create_cgroup(struct mem_cgroup *cg, struct cgroup_subsys *ss)
 {
 	struct res_counter *parent_res_counter = NULL;
@@ -527,6 +562,7 @@ int tcp_init_cgroup(const struct proto *prot, struct cgroup *cgrp,
 		    struct cgroup_subsys *ss)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct mem_cgroup *parent = parent_mem_cgroup(memcg);
 	struct net *net = current->nsproxy->net_ns;
 	/*
 	 * We need to initialize it at populate, not create time.
@@ -537,7 +573,20 @@ int tcp_init_cgroup(const struct proto *prot, struct cgroup *cgrp,
 	memcg->tcp.tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
 	memcg->tcp.tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
 
-	return 0;
+	/* Let root cgroup unlimited. All others, respect parent's if needed */
+	if (parent && !parent->use_hierarchy) {
+		unsigned long limit;
+		int ret;
+		limit = nr_free_buffer_pages() / 8;
+		limit = max(limit, 128UL);
+		ret = res_counter_set_limit(&memcg->tcp.tcp_memory_allocated,
+					    limit * 2);
+		if (ret)
+			return ret;
+	}
+
+	return cgroup_add_files(cgrp, ss, tcp_files,
+				ARRAY_SIZE(tcp_files));
 }
 EXPORT_SYMBOL(tcp_init_cgroup);
 
@@ -549,7 +598,18 @@ void tcp_destroy_cgroup(const struct proto *prot, struct cgroup *cgrp,
 	percpu_counter_destroy(&memcg->tcp.tcp_sockets_allocated);
 }
 EXPORT_SYMBOL(tcp_destroy_cgroup);
+
+unsigned long long tcp_max_memory(const struct mem_cgroup *memcg)
+{
+	return res_counter_read_u64(&CONSTCG(memcg)->tcp.tcp_memory_allocated,
+				    RES_LIMIT);
+}
 #undef CONSTCG
+
+void tcp_prot_mem(struct mem_cgroup *memcg, long val, int idx)
+{
+	memcg->tcp.tcp_prot_mem[idx] = val;
+}
 #endif /* CONFIG_INET */
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
@@ -1048,12 +1108,6 @@ static struct mem_cgroup *mem_cgroup_get_next(struct mem_cgroup *iter,
 #define for_each_mem_cgroup_all(iter) \
 	for_each_mem_cgroup_tree_cond(iter, NULL, true)
 
-
-static inline bool mem_cgroup_is_root(struct mem_cgroup *mem)
-{
-	return (mem == root_mem_cgroup);
-}
-
 void mem_cgroup_count_vm_event(struct mm_struct *mm, enum vm_event_item idx)
 {
 	struct mem_cgroup *mem;
@@ -4071,7 +4125,9 @@ static u64 mem_cgroup_read(struct cgroup *cont, struct cftype *cft)
 	case _KMEM:
 		val = res_counter_read_u64(&mem->kmem, name);
 		break;
-
+	case _KMEM_TCP:
+		val = res_counter_read_u64(&mem->tcp.tcp_memory_allocated, name);
+		break;
 	default:
 		BUG();
 		break;
@@ -4104,6 +4160,13 @@ static int mem_cgroup_write(struct cgroup *cont, struct cftype *cft,
 			break;
 		if (type == _MEM)
 			ret = mem_cgroup_resize_limit(memcg, val);
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) && defined(CONFIG_INET)
+		else if (type == _KMEM_TCP) {
+			ret = res_counter_set_limit(&memcg->tcp.tcp_memory_allocated,
+						    val);
+			tcp_update_limit(memcg, val);
+		}
+#endif
 		else
 			ret = mem_cgroup_resize_memsw_limit(memcg, val);
 		break;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index bbd67ab..cdc35f6 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
+#include <linux/memcontrol.h>
 #include <linux/swap.h>
 #include <net/snmp.h>
 #include <net/icmp.h>
@@ -182,6 +183,10 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
 	int ret;
 	unsigned long vec[3];
 	struct net *net = current->nsproxy->net_ns;
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	int i;
+	struct mem_cgroup *cg;
+#endif
 
 	ctl_table tmp = {
 		.data = &vec,
@@ -198,6 +203,21 @@ static int ipv4_tcp_mem(ctl_table *ctl, int write,
 	if (ret)
 		return ret;
 
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	rcu_read_lock();
+	cg = mem_cgroup_from_task(current);
+	for (i = 0; i < 3; i++)
+		if (vec[i] > tcp_max_memory(cg)) {
+			rcu_read_unlock();
+			return -EINVAL;
+		}
+
+	tcp_prot_mem(cg, vec[0], 0);
+	tcp_prot_mem(cg, vec[1], 1);
+	tcp_prot_mem(cg, vec[2], 2);
+	rcu_read_unlock();
+#endif
+
 	net->ipv4.sysctl_tcp_mem[0] = vec[0];
 	net->ipv4.sysctl_tcp_mem[1] = vec[1];
 	net->ipv4.sysctl_tcp_mem[2] = vec[2];
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v7 5/8] per-netns ipv4 sysctl_tcp_mem
From: Glauber Costa @ 2011-10-13 13:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, lizf, kamezawa.hiroyu, ebiederm, davem, paul, gthelen,
	netdev, linux-mm, kirill, avagin, devel, Glauber Costa
In-Reply-To: <1318511382-31051-1-git-send-email-glommer@parallels.com>

This patch allows each namespace to independently set up
its levels for tcp memory pressure thresholds. This patch
alone does not buy much: we need to make this values
per group of process somehow. This is achieved in the
patches that follows in this patchset.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/net/netns/ipv4.h   |    1 +
 include/net/tcp.h          |    1 -
 mm/memcontrol.c            |    8 ++++--
 net/ipv4/sysctl_net_ipv4.c |   51 +++++++++++++++++++++++++++++++++++++------
 net/ipv4/tcp.c             |   13 ++--------
 5 files changed, 53 insertions(+), 21 deletions(-)

diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index d786b4f..bbd023a 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -55,6 +55,7 @@ struct netns_ipv4 {
 	int current_rt_cache_rebuild_count;
 
 	unsigned int sysctl_ping_group_range[2];
+	long sysctl_tcp_mem[3];
 
 	atomic_t rt_genid;
 	atomic_t dev_addr_genid;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index ec57cf2..3609d87 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -232,7 +232,6 @@ extern int sysctl_tcp_fack;
 extern int sysctl_tcp_reordering;
 extern int sysctl_tcp_ecn;
 extern int sysctl_tcp_dsack;
-extern long sysctl_tcp_mem[3];
 extern int sysctl_tcp_wmem[3];
 extern int sysctl_tcp_rmem[3];
 extern int sysctl_tcp_app_win;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4e79171..f953b32 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -390,6 +390,7 @@ static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont);
 #ifdef CONFIG_INET
 #include <net/sock.h>
 #include <net/ip.h>
+#include <linux/nsproxy.h>
 
 void sock_update_memcg(struct sock *sk)
 {
@@ -526,14 +527,15 @@ int tcp_init_cgroup(const struct proto *prot, struct cgroup *cgrp,
 		    struct cgroup_subsys *ss)
 {
 	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	struct net *net = current->nsproxy->net_ns;
 	/*
 	 * We need to initialize it at populate, not create time.
 	 * This is because net sysctl tables are not up until much
 	 * later
 	 */
-	memcg->tcp.tcp_prot_mem[0] = sysctl_tcp_mem[0];
-	memcg->tcp.tcp_prot_mem[1] = sysctl_tcp_mem[1];
-	memcg->tcp.tcp_prot_mem[2] = sysctl_tcp_mem[2];
+	memcg->tcp.tcp_prot_mem[0] = net->ipv4.sysctl_tcp_mem[0];
+	memcg->tcp.tcp_prot_mem[1] = net->ipv4.sysctl_tcp_mem[1];
+	memcg->tcp.tcp_prot_mem[2] = net->ipv4.sysctl_tcp_mem[2];
 
 	return 0;
 }
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 69fd720..bbd67ab 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -14,6 +14,7 @@
 #include <linux/init.h>
 #include <linux/slab.h>
 #include <linux/nsproxy.h>
+#include <linux/swap.h>
 #include <net/snmp.h>
 #include <net/icmp.h>
 #include <net/ip.h>
@@ -174,6 +175,36 @@ static int proc_allowed_congestion_control(ctl_table *ctl,
 	return ret;
 }
 
+static int ipv4_tcp_mem(ctl_table *ctl, int write,
+			   void __user *buffer, size_t *lenp,
+			   loff_t *ppos)
+{
+	int ret;
+	unsigned long vec[3];
+	struct net *net = current->nsproxy->net_ns;
+
+	ctl_table tmp = {
+		.data = &vec,
+		.maxlen = sizeof(vec),
+		.mode = ctl->mode,
+	};
+
+	if (!write) {
+		ctl->data = &net->ipv4.sysctl_tcp_mem;
+		return proc_doulongvec_minmax(ctl, write, buffer, lenp, ppos);
+	}
+
+	ret = proc_doulongvec_minmax(&tmp, write, buffer, lenp, ppos);
+	if (ret)
+		return ret;
+
+	net->ipv4.sysctl_tcp_mem[0] = vec[0];
+	net->ipv4.sysctl_tcp_mem[1] = vec[1];
+	net->ipv4.sysctl_tcp_mem[2] = vec[2];
+
+	return 0;
+}
+
 static struct ctl_table ipv4_table[] = {
 	{
 		.procname	= "tcp_timestamps",
@@ -433,13 +464,6 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec
 	},
 	{
-		.procname	= "tcp_mem",
-		.data		= &sysctl_tcp_mem,
-		.maxlen		= sizeof(sysctl_tcp_mem),
-		.mode		= 0644,
-		.proc_handler	= proc_doulongvec_minmax
-	},
-	{
 		.procname	= "tcp_wmem",
 		.data		= &sysctl_tcp_wmem,
 		.maxlen		= sizeof(sysctl_tcp_wmem),
@@ -721,6 +745,12 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= ipv4_ping_group_range,
 	},
+	{
+		.procname	= "tcp_mem",
+		.maxlen		= sizeof(init_net.ipv4.sysctl_tcp_mem),
+		.mode		= 0644,
+		.proc_handler	= ipv4_tcp_mem,
+	},
 	{ }
 };
 
@@ -734,6 +764,7 @@ EXPORT_SYMBOL_GPL(net_ipv4_ctl_path);
 static __net_init int ipv4_sysctl_init_net(struct net *net)
 {
 	struct ctl_table *table;
+	unsigned long limit;
 
 	table = ipv4_net_table;
 	if (!net_eq(net, &init_net)) {
@@ -769,6 +800,12 @@ static __net_init int ipv4_sysctl_init_net(struct net *net)
 
 	net->ipv4.sysctl_rt_cache_rebuild_count = 4;
 
+	limit = nr_free_buffer_pages() / 8;
+	limit = max(limit, 128UL);
+	net->ipv4.sysctl_tcp_mem[0] = limit / 4 * 3;
+	net->ipv4.sysctl_tcp_mem[1] = limit;
+	net->ipv4.sysctl_tcp_mem[2] = net->ipv4.sysctl_tcp_mem[0] * 2;
+
 	net->ipv4.ipv4_hdr = register_net_sysctl_table(net,
 			net_ipv4_ctl_path, table);
 	if (net->ipv4.ipv4_hdr == NULL)
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 259f6d9..b1abebd 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -282,11 +282,9 @@ int sysctl_tcp_fin_timeout __read_mostly = TCP_FIN_TIMEOUT;
 struct percpu_counter tcp_orphan_count;
 EXPORT_SYMBOL_GPL(tcp_orphan_count);
 
-long sysctl_tcp_mem[3] __read_mostly;
 int sysctl_tcp_wmem[3] __read_mostly;
 int sysctl_tcp_rmem[3] __read_mostly;
 
-EXPORT_SYMBOL(sysctl_tcp_mem);
 EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
@@ -334,7 +332,7 @@ EXPORT_SYMBOL(tcp_enter_memory_pressure_nocg);
 
 long *tcp_sysctl_mem_nocg(const struct mem_cgroup *memcg)
 {
-	return sysctl_tcp_mem;
+	return init_net.ipv4.sysctl_tcp_mem;
 }
 EXPORT_SYMBOL(tcp_sysctl_mem_nocg);
 
@@ -3298,14 +3296,9 @@ void __init tcp_init(void)
 	sysctl_tcp_max_orphans = cnt / 2;
 	sysctl_max_syn_backlog = max(128, cnt / 256);
 
-	limit = nr_free_buffer_pages() / 8;
-	limit = max(limit, 128UL);
-	sysctl_tcp_mem[0] = limit / 4 * 3;
-	sysctl_tcp_mem[1] = limit;
-	sysctl_tcp_mem[2] = sysctl_tcp_mem[0] * 2;
-
 	/* Set per-socket limits to no more than 1/128 the pressure threshold */
-	limit = ((unsigned long)sysctl_tcp_mem[1]) << (PAGE_SHIFT - 7);
+	limit = ((unsigned long)init_net.ipv4.sysctl_tcp_mem[1])
+		<< (PAGE_SHIFT - 7);
 	max_share = min(4UL*1024*1024, limit);
 
 	sysctl_tcp_wmem[0] = SK_MEM_QUANTUM;
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related

* [PATCH v7 4/8] per-cgroup tcp buffers control
From: Glauber Costa @ 2011-10-13 13:09 UTC (permalink / raw)
  To: linux-kernel
  Cc: akpm, lizf, kamezawa.hiroyu, ebiederm, davem, paul, gthelen,
	netdev, linux-mm, kirill, avagin, devel, Glauber Costa
In-Reply-To: <1318511382-31051-1-git-send-email-glommer@parallels.com>

With all the infrastructure in place, this patch implements
per-cgroup control for tcp memory pressure handling.

Signed-off-by: Glauber Costa <glommer@parallels.com>
Reviewed-by: KAMEZAWA Hiroyuki<kamezawa.hiroyu@jp.fujitsu.com>
CC: David S. Miller <davem@davemloft.net>
CC: Eric W. Biederman <ebiederm@xmission.com>
---
 include/linux/memcontrol.h |    4 +
 include/net/sock.h         |   14 ++++
 include/net/tcp.h          |   17 +++++
 mm/memcontrol.c            |  141 ++++++++++++++++++++++++++++++++++++++++++++
 net/core/sock.c            |   39 +++++++++++-
 net/ipv4/tcp.c             |   47 +++++++--------
 net/ipv4/tcp_ipv4.c        |   11 ++++
 net/ipv6/tcp_ipv6.c        |   10 +++-
 8 files changed, 255 insertions(+), 28 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 99a8ba2..a27dad9 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -393,6 +393,10 @@ void sock_update_memcg(struct sock *sk);
 void sock_release_memcg(struct sock *sk);
 void memcg_sockets_allocated_dec(struct mem_cgroup *memcg, struct proto *prot);
 void memcg_sockets_allocated_inc(struct mem_cgroup *memcg, struct proto *prot);
+int tcp_init_cgroup(const struct proto *prot, struct cgroup *cgrp,
+		    struct cgroup_subsys *ss);
+void tcp_destroy_cgroup(const struct proto *prot, struct cgroup *cgrp,
+			struct cgroup_subsys *ss);
 #else
 /* memcontrol includes sockets.h, that includes memcontrol.h ... */
 static inline void memcg_sockets_allocated_dec(struct mem_cgroup *memcg,
diff --git a/include/net/sock.h b/include/net/sock.h
index 163f87b..efd7664 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -64,6 +64,8 @@
 #include <net/dst.h>
 #include <net/checksum.h>
 
+int sockets_populate(struct cgroup *cgrp, struct cgroup_subsys *ss);
+void sockets_destroy(struct cgroup *cgrp, struct cgroup_subsys *ss);
 /*
  * This structure really needs to be cleaned up.
  * Most of it is for TCP, and not used by any of
@@ -819,6 +821,18 @@ struct proto {
 	/* Pointer to the per-cgroup version of the the sysctl_mem field */
 	long			*(*prot_mem)(const struct mem_cgroup *memcg);
 
+	/*
+	 * cgroup specific init/deinit functions. Called once for all
+	 * protocols that implement it, from cgroups populate function.
+	 * This function has to setup any files the protocol want to
+	 * appear in the kmem cgroup filesystem.
+	 */
+	int			(*init_cgroup)(const struct proto *prot,
+					       struct cgroup *cgrp,
+					       struct cgroup_subsys *ss);
+	void			(*destroy_cgroup)(const struct proto *prot,
+						  struct cgroup *cgrp,
+						  struct cgroup_subsys *ss);
 	int			*sysctl_wmem;
 	int			*sysctl_rmem;
 	int			max_header;
diff --git a/include/net/tcp.h b/include/net/tcp.h
index eac7bf6..ec57cf2 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -31,6 +31,7 @@
 #include <linux/crypto.h>
 #include <linux/cryptohash.h>
 #include <linux/kref.h>
+#include <linux/res_counter.h>
 
 #include <net/inet_connection_sock.h>
 #include <net/inet_timewait_sock.h>
@@ -255,6 +256,21 @@ extern int sysctl_tcp_thin_linear_timeouts;
 extern int sysctl_tcp_thin_dupack;
 
 struct mem_cgroup;
+struct tcp_memcontrol {
+	/* per-cgroup tcp memory pressure knobs */
+	struct res_counter tcp_memory_allocated;
+	struct percpu_counter tcp_sockets_allocated;
+	/* those two are read-mostly, leave them at the end */
+	long tcp_prot_mem[3];
+	int tcp_memory_pressure;
+};
+
+extern long *tcp_sysctl_mem_nocg(const struct mem_cgroup *memcg);
+struct percpu_counter *sockets_allocated_tcp_nocg(const struct mem_cgroup *memcg);
+int *memory_pressure_tcp_nocg(const struct mem_cgroup *memcg);
+long memory_allocated_tcp_add_nocg(struct mem_cgroup *memcg, long val,
+				   int *parent_status);
+
 extern long *tcp_sysctl_mem(const struct mem_cgroup *memcg);
 struct percpu_counter *sockets_allocated_tcp(const struct mem_cgroup *memcg);
 int *memory_pressure_tcp(const struct mem_cgroup *memcg);
@@ -1023,6 +1039,7 @@ static inline void tcp_openreq_init(struct request_sock *req,
 	ireq->loc_port = tcp_hdr(skb)->dest;
 }
 
+extern void tcp_enter_memory_pressure_nocg(struct sock *sk);
 extern void tcp_enter_memory_pressure(struct sock *sk);
 
 static inline int keepalive_intvl_when(const struct tcp_sock *tp)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 4e71fd8..4e79171 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,9 @@
 #include <linux/cpu.h>
 #include <linux/oom.h>
 #include "internal.h"
+#ifdef CONFIG_INET
+#include <net/tcp.h>
+#endif
 
 #include <asm/uaccess.h>
 
@@ -294,6 +297,10 @@ struct mem_cgroup {
 	 */
 	struct mem_cgroup_stat_cpu nocpu_base;
 	spinlock_t pcp_counter_lock;
+
+#ifdef CONFIG_INET
+	struct tcp_memcontrol tcp;
+#endif
 };
 
 /* Stuffs for move charges at task migration. */
@@ -377,10 +384,12 @@ enum mem_type {
 #define MEM_CGROUP_RECLAIM_SOFT		(1 << MEM_CGROUP_RECLAIM_SOFT_BIT)
 
 static struct mem_cgroup *parent_mem_cgroup(struct mem_cgroup *memcg);
+static struct mem_cgroup *mem_cgroup_from_cont(struct cgroup *cont);
 /* Writing them here to avoid exposing memcg's inner layout */
 #ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 #ifdef CONFIG_INET
 #include <net/sock.h>
+#include <net/ip.h>
 
 void sock_update_memcg(struct sock *sk)
 {
@@ -426,6 +435,119 @@ void memcg_sockets_allocated_inc(struct mem_cgroup *memcg, struct proto *prot)
 		percpu_counter_inc(prot->sockets_allocated(memcg));
 }
 EXPORT_SYMBOL(memcg_sockets_allocated_inc);
+
+/*
+ * Pressure flag: try to collapse.
+ * Technical note: it is used by multiple contexts non atomically.
+ * All the __sk_mem_schedule() is of this nature: accounting
+ * is strict, actions are advisory and have some latency.
+ */
+void tcp_enter_memory_pressure(struct sock *sk)
+{
+	struct mem_cgroup *memcg = sk->sk_cgrp;
+	if (!memcg->tcp.tcp_memory_pressure) {
+		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
+		memcg->tcp.tcp_memory_pressure = 1;
+	}
+}
+EXPORT_SYMBOL(tcp_enter_memory_pressure);
+
+#define CONSTCG(m) ((struct mem_cgroup *)(m))
+long *tcp_sysctl_mem(const struct mem_cgroup *memcg)
+{
+	return CONSTCG(memcg)->tcp.tcp_prot_mem;
+}
+EXPORT_SYMBOL(tcp_sysctl_mem);
+
+/*
+ * We will be passed a value in pages. But our limits are internally
+ * all in bytes. We need to convert it before testing the allocation,
+ * and convert it back when returning data to the network layer
+ */
+long memory_allocated_tcp_add(struct mem_cgroup *memcg, long val,
+			      int *parent_status)
+{
+	int ret = 0;
+	struct res_counter *failed;
+
+	if (val > 0) {
+		val <<= PAGE_SHIFT;
+		ret = res_counter_charge(&memcg->tcp.tcp_memory_allocated,
+					 val, &failed);
+		if (!ret)
+			*parent_status = UNDER_LIMIT;
+		else
+			*parent_status = OVER_LIMIT;
+	} else if (val < 0) {
+		if (*parent_status == OVER_LIMIT)
+			/*
+			 * res_counter charge already surely uncharged the
+			 * parent if something went wrong.
+			 */
+			WARN_ON(1);
+		else {
+			val = (-val) << PAGE_SHIFT;
+			res_counter_uncharge(&memcg->tcp.tcp_memory_allocated,
+					     val);
+		}
+	}
+
+	return res_counter_read_u64(&memcg->tcp.tcp_memory_allocated,
+				    RES_USAGE) >> PAGE_SHIFT;
+}
+EXPORT_SYMBOL(memory_allocated_tcp_add);
+
+int *memory_pressure_tcp(const struct mem_cgroup *memcg)
+{
+	return &CONSTCG(memcg)->tcp.tcp_memory_pressure;
+}
+EXPORT_SYMBOL(memory_pressure_tcp);
+
+struct percpu_counter *sockets_allocated_tcp(const struct mem_cgroup *memcg)
+{
+	return &CONSTCG(memcg)->tcp.tcp_sockets_allocated;
+}
+EXPORT_SYMBOL(sockets_allocated_tcp);
+
+static void tcp_create_cgroup(struct mem_cgroup *cg, struct cgroup_subsys *ss)
+{
+	struct res_counter *parent_res_counter = NULL;
+	struct mem_cgroup *parent = parent_mem_cgroup(cg);
+
+	if (parent)
+		parent_res_counter = &parent->tcp.tcp_memory_allocated;
+
+	cg->tcp.tcp_memory_pressure = 0;
+	res_counter_init(&cg->tcp.tcp_memory_allocated, parent_res_counter);
+	percpu_counter_init(&cg->tcp.tcp_sockets_allocated, 0);
+}
+
+int tcp_init_cgroup(const struct proto *prot, struct cgroup *cgrp,
+		    struct cgroup_subsys *ss)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+	/*
+	 * We need to initialize it at populate, not create time.
+	 * This is because net sysctl tables are not up until much
+	 * later
+	 */
+	memcg->tcp.tcp_prot_mem[0] = sysctl_tcp_mem[0];
+	memcg->tcp.tcp_prot_mem[1] = sysctl_tcp_mem[1];
+	memcg->tcp.tcp_prot_mem[2] = sysctl_tcp_mem[2];
+
+	return 0;
+}
+EXPORT_SYMBOL(tcp_init_cgroup);
+
+void tcp_destroy_cgroup(const struct proto *prot, struct cgroup *cgrp,
+			struct cgroup_subsys *ss)
+{
+	struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+	percpu_counter_destroy(&memcg->tcp.tcp_sockets_allocated);
+}
+EXPORT_SYMBOL(tcp_destroy_cgroup);
+#undef CONSTCG
 #endif /* CONFIG_INET */
 #endif /* CONFIG_CGROUP_MEM_RES_CTLR_KMEM */
 
@@ -4833,14 +4955,27 @@ static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
 
 	ret = cgroup_add_files(cont, ss, kmem_cgroup_files,
 			       ARRAY_SIZE(kmem_cgroup_files));
+
+	if (!ret)
+		ret = sockets_populate(cont, ss);
 	return ret;
 };
 
+static void kmem_cgroup_destroy(struct cgroup_subsys *ss,
+				struct cgroup *cont)
+{
+	sockets_destroy(cont, ss);
+}
 #else
 static int register_kmem_files(struct cgroup *cont, struct cgroup_subsys *ss)
 {
 	return 0;
 }
+
+static void kmem_cgroup_destroy(struct cgroup_subsys *ss,
+				struct cgroup *cont)
+{
+}
 #endif
 
 static int alloc_mem_cgroup_per_zone_info(struct mem_cgroup *mem, int node)
@@ -5058,6 +5193,10 @@ mem_cgroup_create(struct cgroup_subsys *ss, struct cgroup *cont)
 	mem->last_scanned_node = MAX_NUMNODES;
 	INIT_LIST_HEAD(&mem->oom_notify);
 
+#if defined(CONFIG_CGROUP_MEM_RES_CTLR_KMEM) && defined(CONFIG_INET)
+	tcp_create_cgroup(mem, ss);
+#endif
+
 	if (parent)
 		mem->swappiness = mem_cgroup_swappiness(parent);
 	atomic_set(&mem->refcnt, 1);
@@ -5083,6 +5222,8 @@ static void mem_cgroup_destroy(struct cgroup_subsys *ss,
 {
 	struct mem_cgroup *mem = mem_cgroup_from_cont(cont);
 
+	kmem_cgroup_destroy(ss, cont);
+
 	mem_cgroup_put(mem);
 }
 
diff --git a/net/core/sock.c b/net/core/sock.c
index 22ef143..3fa3ccb 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -135,6 +135,42 @@
 #include <net/tcp.h>
 #endif
 
+static DEFINE_RWLOCK(proto_list_lock);
+static LIST_HEAD(proto_list);
+
+int sockets_populate(struct cgroup *cgrp, struct cgroup_subsys *ss)
+{
+	struct proto *proto;
+	int ret = 0;
+
+	read_lock(&proto_list_lock);
+	list_for_each_entry(proto, &proto_list, node) {
+		if (proto->init_cgroup)
+			ret = proto->init_cgroup(proto, cgrp, ss);
+			if (ret)
+				goto out;
+	}
+
+	read_unlock(&proto_list_lock);
+	return ret;
+out:
+	list_for_each_entry_continue_reverse(proto, &proto_list, node)
+		if (proto->destroy_cgroup)
+			proto->destroy_cgroup(proto, cgrp, ss);
+	read_unlock(&proto_list_lock);
+	return ret;
+}
+
+void sockets_destroy(struct cgroup *cgrp, struct cgroup_subsys *ss)
+{
+	struct proto *proto;
+	read_lock(&proto_list_lock);
+	list_for_each_entry_reverse(proto, &proto_list, node)
+		if (proto->destroy_cgroup)
+			proto->destroy_cgroup(proto, cgrp, ss);
+	read_unlock(&proto_list_lock);
+}
+
 /*
  * Each address family might have different locking rules, so we have
  * one slock key per address family:
@@ -2262,9 +2298,6 @@ void sk_common_release(struct sock *sk)
 }
 EXPORT_SYMBOL(sk_common_release);
 
-static DEFINE_RWLOCK(proto_list_lock);
-static LIST_HEAD(proto_list);
-
 #ifdef CONFIG_PROC_FS
 #define PROTO_INUSE_NR	64	/* should be enough for the first time */
 struct prot_inuse {
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index dc8f01e..259f6d9 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -290,13 +290,6 @@ EXPORT_SYMBOL(sysctl_tcp_mem);
 EXPORT_SYMBOL(sysctl_tcp_rmem);
 EXPORT_SYMBOL(sysctl_tcp_wmem);
 
-atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
-
-/*
- * Current number of TCP sockets.
- */
-struct percpu_counter tcp_sockets_allocated;
-
 /*
  * TCP splice context
  */
@@ -306,47 +299,51 @@ struct tcp_splice_state {
 	unsigned int flags;
 };
 
-/*
- * Pressure flag: try to collapse.
- * Technical note: it is used by multiple contexts non atomically.
- * All the __sk_mem_schedule() is of this nature: accounting
- * is strict, actions are advisory and have some latency.
- */
+/* Current number of TCP sockets. */
+struct percpu_counter tcp_sockets_allocated;
+atomic_long_t tcp_memory_allocated;	/* Current allocated memory. */
 int tcp_memory_pressure __read_mostly;
 
-int *memory_pressure_tcp(const struct mem_cgroup *memcg)
+int *memory_pressure_tcp_nocg(const struct mem_cgroup *memcg)
 {
 	return &tcp_memory_pressure;
 }
-EXPORT_SYMBOL(memory_pressure_tcp);
+EXPORT_SYMBOL(memory_pressure_tcp_nocg);
 
-struct percpu_counter *sockets_allocated_tcp(const struct mem_cgroup *memcg)
+struct percpu_counter
+*sockets_allocated_tcp_nocg(const struct mem_cgroup *memcg)
 {
 	return &tcp_sockets_allocated;
 }
-EXPORT_SYMBOL(sockets_allocated_tcp);
+EXPORT_SYMBOL(sockets_allocated_tcp_nocg);
 
-void tcp_enter_memory_pressure(struct sock *sk)
+/*
+ * Pressure flag: try to collapse.
+ * Technical note: it is used by multiple contexts non atomically.
+ * All the __sk_mem_schedule() is of this nature: accounting
+ * is strict, actions are advisory and have some latency.
+ */
+void tcp_enter_memory_pressure_nocg(struct sock *sk)
 {
 	if (!tcp_memory_pressure) {
 		NET_INC_STATS(sock_net(sk), LINUX_MIB_TCPMEMORYPRESSURES);
 		tcp_memory_pressure = 1;
 	}
 }
-EXPORT_SYMBOL(tcp_enter_memory_pressure);
+EXPORT_SYMBOL(tcp_enter_memory_pressure_nocg);
 
-long *tcp_sysctl_mem(const struct mem_cgroup *memcg)
+long *tcp_sysctl_mem_nocg(const struct mem_cgroup *memcg)
 {
 	return sysctl_tcp_mem;
 }
-EXPORT_SYMBOL(tcp_sysctl_mem);
+EXPORT_SYMBOL(tcp_sysctl_mem_nocg);
 
-long memory_allocated_tcp_add(struct mem_cgroup *memcg, long val,
-			      int *parent_status)
+long memory_allocated_tcp_add_nocg(struct mem_cgroup *memcg, long val,
+				   int *parent_status)
 {
 	return atomic_long_add_return(val, &tcp_memory_allocated);
 }
-EXPORT_SYMBOL(memory_allocated_tcp_add);
+EXPORT_SYMBOL(memory_allocated_tcp_add_nocg);
 
 /* Convert seconds to retransmits based on initial and max timeout */
 static u8 secs_to_retrans(int seconds, int timeout, int rto_max)
@@ -3248,7 +3245,9 @@ void __init tcp_init(void)
 
 	BUILD_BUG_ON(sizeof(struct tcp_skb_cb) > sizeof(skb->cb));
 
+#ifndef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 	percpu_counter_init(&tcp_sockets_allocated, 0);
+#endif
 	percpu_counter_init(&tcp_orphan_count, 0);
 	tcp_hashinfo.bind_bucket_cachep =
 		kmem_cache_create("tcp_bind_bucket",
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 7072060..aac71e9 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -2607,12 +2607,23 @@ struct proto tcp_prot = {
 	.hash			= inet_hash,
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
+	.orphan_count		= &tcp_orphan_count,
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
+	.init_cgroup		= tcp_init_cgroup,
+	.destroy_cgroup		= tcp_destroy_cgroup,
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
 	.memory_pressure	= memory_pressure_tcp,
 	.sockets_allocated	= sockets_allocated_tcp,
 	.orphan_count		= &tcp_orphan_count,
 	.mem_allocated_add	= memory_allocated_tcp_add,
 	.prot_mem		= tcp_sysctl_mem,
+#else
+	.enter_memory_pressure	= tcp_enter_memory_pressure_nocg,
+	.memory_pressure	= memory_pressure_tcp_nocg,
+	.sockets_allocated	= sockets_allocated_tcp_nocg,
+	.mem_allocated_add	= memory_allocated_tcp_add_nocg,
+	.prot_mem		= tcp_sysctl_mem_nocg,
+#endif
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index bdc0003..0a52587 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -2200,12 +2200,20 @@ struct proto tcpv6_prot = {
 	.hash			= tcp_v6_hash,
 	.unhash			= inet_unhash,
 	.get_port		= inet_csk_get_port,
+	.orphan_count		= &tcp_orphan_count,
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR_KMEM
 	.enter_memory_pressure	= tcp_enter_memory_pressure,
 	.sockets_allocated	= sockets_allocated_tcp,
 	.mem_allocated_add	= memory_allocated_tcp_add,
 	.memory_pressure	= memory_pressure_tcp,
-	.orphan_count		= &tcp_orphan_count,
 	.prot_mem		= tcp_sysctl_mem,
+#else
+	.enter_memory_pressure	= tcp_enter_memory_pressure_nocg,
+	.sockets_allocated	= sockets_allocated_tcp_nocg,
+	.mem_allocated_add	= memory_allocated_tcp_add_nocg,
+	.memory_pressure	= memory_pressure_tcp_nocg,
+	.prot_mem		= tcp_sysctl_mem_nocg,
+#endif
 	.sysctl_wmem		= sysctl_tcp_wmem,
 	.sysctl_rmem		= sysctl_tcp_rmem,
 	.max_header		= MAX_TCP_HEADER,
-- 
1.7.6.4

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox