Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [RFC] net: allow FEC driver to not have attached PHY
From: Greg Ungerer @ 2010-10-08  5:57 UTC (permalink / raw)
  To: Florian Fainelli; +Cc: netdev
In-Reply-To: <201010071524.43789.florian@openwrt.org>


Hi Florian,

On 07/10/10 23:24, Florian Fainelli wrote:
> On Thursday 07 October 2010 05:50:16 Greg Ungerer wrote:
>> Hi All,
>>
>> I have a board with a ColdFire SoC on it with the built-in FEC
>> ethernet module. On this hardware the FEC eth output is directly
>> attached to a RTL8305 4-port 10/100 switch. There is no conventional
>> PHY, the FEC output is direct into the uplink port of the switch
>> chip.
>>
>> This setup doesn't work after the FEC code was switch to using
>> phylib. The driver used to have code to bypass phy detection/setup
>> for this particular board. The phylib probe finds nothing, and of
>> course sets a no-link condition.
>>
>> So, what is the cleanest way to support this?
>
> If phy detection fails and you cannot attach to a known PHY driver, you could
> register a fixed-PHY driver wich will report the link to be up. I had to do
> something like this for the cpmac driver[1] where various switches and
> external PHY configurations are supported.
>
> [1]:
> https://dev.openwrt.org/browser/trunk/target/linux/ar7/patches-2.6.35/972-
> cpmac_multi_probe.patch

Ah yes, that looks like exactly what I need.
Thanks for the pointer.

Regards
Greg



>> The attached patch adds a config option to do this sort of generically
>> for the FEC driver. But I am wondering if there isn't a better way?
>>
>> Regards
>> Greg
>>
>>
>> ---
>>
>> [RFC] net: allow FEC driver to not have attached PHY
>>
>> At least one board using the FEC driver does not have a conventional
>> PHY attached to it, it is directly connected to a somewhat simple
>> ethernet switch (the board is the SnapGear/LITE, and the attached
>> 4-port ethernet switch is a RealTek RTL8305). This switch does not
>> present the usual register interface of a PHY, it presents nothing.
>> So a PHY scan will find nothing.
>>
>> After the FEC driver was changed to use phylib for supporting phys
>> it no longer works on this particular board/switch setup.
>>
>> Add a config option to allow configuring the FEC driver to not expect
>> a PHY to be present.
>>
>> Signed-off-by: Greg Ungerer<gerg@uclinux.org>
>> ---
>>   drivers/net/Kconfig |    9 +++++++++
>>   drivers/net/fec.c   |    7 +++++++
>>   2 files changed, 16 insertions(+), 0 deletions(-)
>>
>> diff --git a/drivers/net/Kconfig b/drivers/net/Kconfig
>> index 93494e2..ee44728 100644
>> --- a/drivers/net/Kconfig
>> +++ b/drivers/net/Kconfig
>> @@ -1934,6 +1934,15 @@ config FEC2
>>   	  Say Y here if you want to use the second built-in 10/100 Fast
>>   	  ethernet controller on some Motorola ColdFire processors.
>>
>> +config FEC_NOPHY
>> +	bool "FEC has no attached PHY"
>> +	depends on FEC
>> +	help
>> +	  Some boards using the FEC driver may not have a PHY directly
>> +	  attached to it. Typically in this scenario the FEC output is
>> +	  directly connected to the input of an ethernet switch or hub.
>> +	  Say Y here if your hardware is like this.
>> +
>>   config FEC_MPC52xx
>>   	tristate "MPC52xx FEC driver"
>>   	depends on PPC_MPC52xx&&  PPC_BESTCOMM
>> diff --git a/drivers/net/fec.c b/drivers/net/fec.c
>> index 768b840..3637f89 100644
>> --- a/drivers/net/fec.c
>> +++ b/drivers/net/fec.c
>> @@ -910,6 +910,11 @@ fec_enet_open(struct net_device *dev)
>>   	if (ret)
>>   		return ret;
>>
>> +#ifdef CONFIG_FEC_NOPHY
>> +	/* No PHY connected, assume link is always up */
>> +	fep->link = 1;
>> +	fec_restart(dev, 0);
>> +#else
>>   	/* Probe and connect to PHY when open the interface */
>>   	ret = fec_enet_mii_probe(dev);
>>   	if (ret) {
>> @@ -917,6 +922,8 @@ fec_enet_open(struct net_device *dev)
>>   		return ret;
>>   	}
>>   	phy_start(fep->phy_dev);
>> +#endif
>> +
>>   	netif_start_queue(dev);
>>   	fep->opened = 1;
>>   	return 0;
>> --
>> To unsubscribe from this list: send the line "unsubscribe netdev" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>


-- 
------------------------------------------------------------------------
Greg Ungerer  --  Principal Engineer        EMAIL:     gerg@snapgear.com
SnapGear Group, McAfee                      PHONE:       +61 7 3435 2888
8 Gardner Close                             FAX:         +61 7 3217 5323
Milton, QLD, 4064, Australia                WEB: http://www.SnapGear.com

^ permalink raw reply

* Re: [PATCH] remove leftover rcu_read_unlock calls from __mkroute_output
From: Eric Dumazet @ 2010-10-08  4:47 UTC (permalink / raw)
  To: Dimitris Michailidis; +Cc: netdev
In-Reply-To: <1286498918-30636-1-git-send-email-dm@chelsio.com>

Le jeudi 07 octobre 2010 à 17:48 -0700, Dimitris Michailidis a écrit :
> Commit "fib: RCU conversion of fib_lookup()" removed rcu_read_lock() from
> __mkroute_output but left a couple of calls to rcu_read_unlock() in there.
> This causes lockdep to complain that the rcu_read_unlock() call in
> __ip_route_output_key causes a lock inbalance and quickly crashes the
> kernel. The below fixes this for me.
> 
> Signed-off-by: Dimitris Michailidis <dm@chelsio.com>

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>

Thanks Dimitris !



^ permalink raw reply

* Re: [PATCH] ehea: Fix a checksum issue on the receive path
From: Eric Dumazet @ 2010-10-08  4:45 UTC (permalink / raw)
  To: leitao; +Cc: davem, netdev, Jay Vosburgh
In-Reply-To: <1286493453-21784-1-git-send-email-leitao@linux.vnet.ibm.com>

Le jeudi 07 octobre 2010 à 19:17 -0400, leitao@linux.vnet.ibm.com a
écrit :
> Currently we set all skbs with CHECKSUM_UNNECESSARY, even
> those whose protocol we don't know. This patch just
> add the CHECKSUM_COMPLETE tag for non TCP/UDP packets.
> 
> Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
> Signed-off-by: Breno Leitao <leitao@linux.vnet.ibm.com>
> Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
> ---
>  drivers/net/ehea/ehea_main.c |    9 ++++++++-
>  drivers/net/ehea/ehea_qmr.h  |    1 +
>  2 files changed, 9 insertions(+), 1 deletions(-)
> 
> diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c
> index 0471cae..45fd045 100644
> --- a/drivers/net/ehea/ehea_main.c
> +++ b/drivers/net/ehea/ehea_main.c
> @@ -533,8 +533,15 @@ static inline void ehea_fill_skb(struct net_device *dev,
>  	int length = cqe->num_bytes_transfered - 4;	/*remove CRC */
>  
>  	skb_put(skb, length);
> -	skb->ip_summed = CHECKSUM_UNNECESSARY;
>  	skb->protocol = eth_type_trans(skb, dev);
> +
> +	/* The packet was not an IPV4 packet so a complemented checksum was
> +	   calculated. The value is found in the Internet Checksum field. */
> +	if (cqe->status & EHEA_CQE_BLIND_CKSUM) {
> +		skb->ip_summed = CHECKSUM_COMPLETE;
> +		skb->csum = csum_unfold(~cqe->inet_checksum_value);
> +	} else
> +		skb->ip_summed = CHECKSUM_UNNECESSARY;
>  }
>  

Hi Breno

Just to be clear : packets with wrong checksums are not given to upper
stack, so a tcpdump can not display them ? I am not sure many drivers do
that.

(EHEA_CQE_STAT_ERR_TCP, EHEA_CQE_STAT_ERR_IP)

Thanks !



^ permalink raw reply

* Re: IPv4: sysctl table check failed [was: mmotm 2010-10-07-14-08 uploaded]
From: Eric W. Biederman @ 2010-10-08  0:54 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Eric Dumazet, Jiri Slaby, linux-kernel, mm-commits, ML netdev,
	David S. Miller
In-Reply-To: <20101007152806.119d1522.akpm@linux-foundation.org>

Andrew Morton <akpm@linux-foundation.org> writes:

> On Fri, 08 Oct 2010 00:22:15 +0200
> Eric Dumazet <eric.dumazet@gmail.com> wrote:
>
>> Le vendredi 08 octobre 2010 __ 00:06 +0200, Jiri Slaby a __crit :
>> > On 10/07/2010 11:08 PM, akpm@linux-foundation.org wrote:
>> > > The mm-of-the-moment snapshot 2010-10-07-14-08 has been uploaded to
>> > 
>> > Hi, I got bunch of "sysctl table check failed" below. All seem to be
>> > related to ipv4:
>> 
>> I would say, sysctl check is buggy :(
>> 
>> min/max are optional
>> 
>> [PATCH] sysctl: min/max bounds are optional
>> 
>> sysctl check complains when proc_doulongvec_minmax or
>> proc_doulongvec_ms_jiffies_minmax are used by a vector of longs (with
>> more than one element), with no min or max value specified.
>> 
>> This is unexpected, given we had a bug on this min/max handling :)
>> 
>> Reported-by: Jiri Slaby <jirislaby@gmail.com>
>> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
>> ---
>>  kernel/sysctl_check.c |    9 ---------
>>  1 file changed, 9 deletions(-)
>> 
>> diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
>> index 04cdcf7..10b90d8 100644
>> --- a/kernel/sysctl_check.c
>> +++ b/kernel/sysctl_check.c
>> @@ -143,15 +143,6 @@ int sysctl_check_table(struct nsproxy *namespaces, struct ctl_table *table)
>>  				if (!table->maxlen)
>>  					set_fail(&fail, table, "No maxlen");
>>  			}
>> -			if ((table->proc_handler == proc_doulongvec_minmax) ||
>> -			    (table->proc_handler == proc_doulongvec_ms_jiffies_minmax)) {
>> -				if (table->maxlen > sizeof (unsigned long)) {
>> -					if (!table->extra1)
>> -						set_fail(&fail, table, "No min");
>> -					if (!table->extra2)
>> -						set_fail(&fail, table, "No max");
>> -				}
>> -			}
>>  #ifdef CONFIG_PROC_SYSCTL
>>  			if (table->procname && !table->proc_handler)
>>  				set_fail(&fail, table, "No proc_handler");
>
> That will probably fix it ;)
>
> net-avoid-limits-overflow.patch is dependent on this patch.  Unless
> Eric B squeaks I'll plan on sending this patch in for 2.6.37.

Oh.  I see. I actually had a sanity check for the case that was failing.
I probably spotted the buggy code and wanted to see if there was
anything that cared.

So sysctl_check was perfectly correct until the bug was removed from
proc_doulongvec_minmax.  Which also means we have been auditing the
kernel for quite a while to make certain that it is safe not to
increment min and max.

Eric

^ permalink raw reply

* [PATCH] remove leftover rcu_read_unlock calls from __mkroute_output
From: Dimitris Michailidis @ 2010-10-08  0:48 UTC (permalink / raw)
  To: eric.dumazet, netdev

Commit "fib: RCU conversion of fib_lookup()" removed rcu_read_lock() from
__mkroute_output but left a couple of calls to rcu_read_unlock() in there.
This causes lockdep to complain that the rcu_read_unlock() call in
__ip_route_output_key causes a lock inbalance and quickly crashes the
kernel. The below fixes this for me.

Signed-off-by: Dimitris Michailidis <dm@chelsio.com>
---
 net/ipv4/route.c |    6 ++----
 1 files changed, 2 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index 7864d0c..3888f6b 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -2396,12 +2396,10 @@ static int __mkroute_output(struct rtable **result,
 
 
 	rth = dst_alloc(&ipv4_dst_ops);
-	if (!rth) {
-		rcu_read_unlock();
+	if (!rth)
 		return -ENOBUFS;
-	}
+
 	in_dev_hold(in_dev);
-	rcu_read_unlock();
 	rth->idev = in_dev;
 
 	atomic_set(&rth->dst.__refcnt, 1);
-- 
1.5.4


^ permalink raw reply related

* Re: [PATCHv4 net-next-2.6 1/5] XFRM,IPv6: Remove xfrm_spi_hash() dependency on destination address
From: Herbert Xu @ 2010-10-08  0:42 UTC (permalink / raw)
  To: Arnaud Ebalard; +Cc: David S. Miller, Eric Dumazet, Hideaki YOSHIFUJI, netdev
In-Reply-To: <87fwwhoj73.fsf@small.ssi.corp>

On Thu, Oct 07, 2010 at 10:13:04PM +0200, Arnaud Ebalard wrote:
> 
> I think I will try the following alternative approach based on your
> comments and proposals:
> 
>  - drop my patch to change spi hash computation
>  - handle destination address remapping during input upon failure of
>    xfrm_state_lookup()
>  - handle source address remapping as it is currently done in the patch,
>    i.e. by comparing received one against x->props.saddr once the
>    state found and do 

I'm fine with this from IPsec's point-of-view.

On the other hand, if it is at all possible to setup these remapping
entries in the SADB beforehand, that would definitely be my
preferred solution.

Cheers,
-- 
Email: Herbert Xu <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply

* [PATCH 2/2 net-next] bnx2: Enable AER on PCIE devices only
From: Michael Chan @ 2010-10-07 23:42 UTC (permalink / raw)
  To: davem; +Cc: netdev
In-Reply-To: <1286494973-5115-1-git-send-email-mchan@broadcom.com>

To prevent unnecessary error message.  pci_save_state() is also moved to
the end of ->probe() so that all PCI config, including AER state, will be
saved.

Update version to 2.0.18.

Signed-off-by: Michael Chan <mchan@broadcom.com>
Reviewed-by: Benjamin Li <mchan@broadcom.com>
---
 drivers/net/bnx2.c |   36 ++++++++++++++++++++++--------------
 1 files changed, 22 insertions(+), 14 deletions(-)

diff --git a/drivers/net/bnx2.c b/drivers/net/bnx2.c
index 56f3dfe..ae894bc 100644
--- a/drivers/net/bnx2.c
+++ b/drivers/net/bnx2.c
@@ -59,8 +59,8 @@
 #include "bnx2_fw.h"
 
 #define DRV_MODULE_NAME		"bnx2"
-#define DRV_MODULE_VERSION	"2.0.17"
-#define DRV_MODULE_RELDATE	"July 18, 2010"
+#define DRV_MODULE_VERSION	"2.0.18"
+#define DRV_MODULE_RELDATE	"Oct 7, 2010"
 #define FW_MIPS_FILE_06		"bnx2/bnx2-mips-06-6.0.15.fw"
 #define FW_RV2P_FILE_06		"bnx2/bnx2-rv2p-06-6.0.15.fw"
 #define FW_MIPS_FILE_09		"bnx2/bnx2-mips-09-6.0.17.fw"
@@ -7915,16 +7915,7 @@ bnx2_init_board(struct pci_dev *pdev, struct net_device *dev)
 		goto err_out_disable;
 	}
 
-	/* AER (Advanced Error Reporting) hooks */
-	err = pci_enable_pcie_error_reporting(pdev);
-	if (err) {
-		dev_err(&pdev->dev, "pci_enable_pcie_error_reporting failed "
-				    "0x%x\n", err);
-		/* non-fatal, continue */
-	}
-
 	pci_set_master(pdev);
-	pci_save_state(pdev);
 
 	bp->pm_cap = pci_find_capability(pdev, PCI_CAP_ID_PM);
 	if (bp->pm_cap == 0) {
@@ -7979,6 +7970,15 @@ bnx2_init_board(struct pci_dev *pdev, struct net_device *dev)
 		bp->flags |= BNX2_FLAG_PCIE;
 		if (CHIP_REV(bp) == CHIP_REV_Ax)
 			bp->flags |= BNX2_FLAG_JUMBO_BROKEN;
+
+		/* AER (Advanced Error Reporting) hooks */
+		err = pci_enable_pcie_error_reporting(pdev);
+		if (err) {
+			dev_err(&pdev->dev, "pci_enable_pcie_error_reporting "
+					    "failed 0x%x\n", err);
+			/* non-fatal, continue */
+		}
+
 	} else {
 		bp->pcix_cap = pci_find_capability(pdev, PCI_CAP_ID_PCIX);
 		if (bp->pcix_cap == 0) {
@@ -8235,16 +8235,20 @@ bnx2_init_board(struct pci_dev *pdev, struct net_device *dev)
 	bp->timer.data = (unsigned long) bp;
 	bp->timer.function = bnx2_timer;
 
+	pci_save_state(pdev);
+
 	return 0;
 
 err_out_unmap:
+	if (bp->flags & BNX2_FLAG_PCIE)
+		pci_disable_pcie_error_reporting(pdev);
+
 	if (bp->regview) {
 		iounmap(bp->regview);
 		bp->regview = NULL;
 	}
 
 err_out_release:
-	pci_disable_pcie_error_reporting(pdev);
 	pci_release_regions(pdev);
 
 err_out_disable:
@@ -8434,9 +8438,10 @@ bnx2_remove_one(struct pci_dev *pdev)
 
 	kfree(bp->temp_stats_blk);
 
-	free_netdev(dev);
+	if (bp->flags & BNX2_FLAG_PCIE)
+		pci_disable_pcie_error_reporting(pdev);
 
-	pci_disable_pcie_error_reporting(pdev);
+	free_netdev(dev);
 
 	pci_release_regions(pdev);
 	pci_disable_device(pdev);
@@ -8550,6 +8555,9 @@ static pci_ers_result_t bnx2_io_slot_reset(struct pci_dev *pdev)
 	}
 	rtnl_unlock();
 
+	if (!(bp->flags & BNX2_FLAG_PCIE))
+		return result;
+
 	err = pci_cleanup_aer_uncorrect_error_status(pdev);
 	if (err) {
 		dev_err(&pdev->dev,
-- 
1.6.4.GIT



^ permalink raw reply related

* Re: [PATCH] ehea: simplify conditional
From: Breno Leitao @ 2010-10-07 23:49 UTC (permalink / raw)
  To: Nicolas Kaiser; +Cc: netdev, linux-kernel
In-Reply-To: <20101008011450.0126ffc0@absol.kitzblitz>

On 10/07/2010 08:14 PM, Nicolas Kaiser wrote:
> Simplify: ((a&&  b) || (!a&&  !b)) =>  (a == b)
>
> Signed-off-by: Nicolas Kaiser<nikai@nikai.net>
Acked-by: Breno Leitao <leitao@linux.vnet.ibm.com>

^ permalink raw reply

* [PATCH] ehea: simplify conditional
From: Nicolas Kaiser @ 2010-10-07 23:14 UTC (permalink / raw)
  To: Breno Leitao; +Cc: netdev, linux-kernel

Simplify: ((a && b) || (!a && !b)) => (a == b)

Signed-off-by: Nicolas Kaiser <nikai@nikai.net>
---
 drivers/net/ehea/ehea_main.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c
index 190fb69..ef305f3 100644
--- a/drivers/net/ehea/ehea_main.c
+++ b/drivers/net/ehea/ehea_main.c
@@ -1914,7 +1914,7 @@ static void ehea_promiscuous(struct net_device *dev, int enable)
 	struct hcp_ehea_port_cb7 *cb7;
 	u64 hret;
 
-	if ((enable && port->promisc) || (!enable && !port->promisc))
+	if (enable == port->promisc)
 		return;
 
 	cb7 = (void *)get_zeroed_page(GFP_ATOMIC);
-- 
1.7.2.2

^ permalink raw reply related

* [PATCH] ehea: Fix a checksum issue on the receive path
From: leitao @ 2010-10-07 23:17 UTC (permalink / raw)
  To: davem; +Cc: netdev, Breno Leitao, Jay Vosburgh

Currently we set all skbs with CHECKSUM_UNNECESSARY, even
those whose protocol we don't know. This patch just
add the CHECKSUM_COMPLETE tag for non TCP/UDP packets.

Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: Breno Leitao <leitao@linux.vnet.ibm.com>
Signed-off-by: Jay Vosburgh <fubar@us.ibm.com>
---
 drivers/net/ehea/ehea_main.c |    9 ++++++++-
 drivers/net/ehea/ehea_qmr.h  |    1 +
 2 files changed, 9 insertions(+), 1 deletions(-)

diff --git a/drivers/net/ehea/ehea_main.c b/drivers/net/ehea/ehea_main.c
index 0471cae..45fd045 100644
--- a/drivers/net/ehea/ehea_main.c
+++ b/drivers/net/ehea/ehea_main.c
@@ -533,8 +533,15 @@ static inline void ehea_fill_skb(struct net_device *dev,
 	int length = cqe->num_bytes_transfered - 4;	/*remove CRC */
 
 	skb_put(skb, length);
-	skb->ip_summed = CHECKSUM_UNNECESSARY;
 	skb->protocol = eth_type_trans(skb, dev);
+
+	/* The packet was not an IPV4 packet so a complemented checksum was
+	   calculated. The value is found in the Internet Checksum field. */
+	if (cqe->status & EHEA_CQE_BLIND_CKSUM) {
+		skb->ip_summed = CHECKSUM_COMPLETE;
+		skb->csum = csum_unfold(~cqe->inet_checksum_value);
+	} else
+		skb->ip_summed = CHECKSUM_UNNECESSARY;
 }
 
 static inline struct sk_buff *get_skb_by_index(struct sk_buff **skb_array,
diff --git a/drivers/net/ehea/ehea_qmr.h b/drivers/net/ehea/ehea_qmr.h
index f608a6c..3810473 100644
--- a/drivers/net/ehea/ehea_qmr.h
+++ b/drivers/net/ehea/ehea_qmr.h
@@ -150,6 +150,7 @@ struct ehea_rwqe {
 #define EHEA_CQE_TYPE_RQ           0x60
 #define EHEA_CQE_STAT_ERR_MASK     0x700F
 #define EHEA_CQE_STAT_FAT_ERR_MASK 0xF
+#define EHEA_CQE_BLIND_CKSUM       0x8000
 #define EHEA_CQE_STAT_ERR_TCP      0x4000
 #define EHEA_CQE_STAT_ERR_IP       0x2000
 #define EHEA_CQE_STAT_ERR_CRC      0x1000
-- 
1.6.5


^ permalink raw reply related

* Re: IPv4: sysctl table check failed [was: mmotm 2010-10-07-14-08 uploaded]
From: Andrew Morton @ 2010-10-07 22:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jiri Slaby, linux-kernel, mm-commits, ML netdev, David S. Miller,
	Eric W. Biederman
In-Reply-To: <1286490135.6536.75.camel@edumazet-laptop>

On Fri, 08 Oct 2010 00:22:15 +0200
Eric Dumazet <eric.dumazet@gmail.com> wrote:

> Le vendredi 08 octobre 2010 __ 00:06 +0200, Jiri Slaby a __crit :
> > On 10/07/2010 11:08 PM, akpm@linux-foundation.org wrote:
> > > The mm-of-the-moment snapshot 2010-10-07-14-08 has been uploaded to
> > 
> > Hi, I got bunch of "sysctl table check failed" below. All seem to be
> > related to ipv4:
> 
> I would say, sysctl check is buggy :(
> 
> min/max are optional
> 
> [PATCH] sysctl: min/max bounds are optional
> 
> sysctl check complains when proc_doulongvec_minmax or
> proc_doulongvec_ms_jiffies_minmax are used by a vector of longs (with
> more than one element), with no min or max value specified.
> 
> This is unexpected, given we had a bug on this min/max handling :)
> 
> Reported-by: Jiri Slaby <jirislaby@gmail.com>
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> ---
>  kernel/sysctl_check.c |    9 ---------
>  1 file changed, 9 deletions(-)
> 
> diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
> index 04cdcf7..10b90d8 100644
> --- a/kernel/sysctl_check.c
> +++ b/kernel/sysctl_check.c
> @@ -143,15 +143,6 @@ int sysctl_check_table(struct nsproxy *namespaces, struct ctl_table *table)
>  				if (!table->maxlen)
>  					set_fail(&fail, table, "No maxlen");
>  			}
> -			if ((table->proc_handler == proc_doulongvec_minmax) ||
> -			    (table->proc_handler == proc_doulongvec_ms_jiffies_minmax)) {
> -				if (table->maxlen > sizeof (unsigned long)) {
> -					if (!table->extra1)
> -						set_fail(&fail, table, "No min");
> -					if (!table->extra2)
> -						set_fail(&fail, table, "No max");
> -				}
> -			}
>  #ifdef CONFIG_PROC_SYSCTL
>  			if (table->procname && !table->proc_handler)
>  				set_fail(&fail, table, "No proc_handler");

That will probably fix it ;)

net-avoid-limits-overflow.patch is dependent on this patch.  Unless
Eric B squeaks I'll plan on sending this patch in for 2.6.37.


^ permalink raw reply

* Re: IPv4: sysctl table check failed [was: mmotm 2010-10-07-14-08 uploaded]
From: Andrew Morton @ 2010-10-07 22:22 UTC (permalink / raw)
  To: Jiri Slaby
  Cc: linux-kernel, mm-commits, ML netdev, David S. Miller,
	Eric Dumazet, Eric W. Biederman
In-Reply-To: <4CAE4479.6010606@gmail.com>

On Fri, 08 Oct 2010 00:06:49 +0200
Jiri Slaby <jirislaby@gmail.com> wrote:

> On 10/07/2010 11:08 PM, akpm@linux-foundation.org wrote:
> > The mm-of-the-moment snapshot 2010-10-07-14-08 has been uploaded to
> 
> Hi, I got bunch of "sysctl table check failed" below. All seem to be
> related to ipv4:
> 
> sysctl table check failed: /net/ipv4/tcp_mem  No min
> Pid: 1, comm: swapper Not tainted 2.6.36-rc7-mm1_64+ #1285
> Call Trace:
>  [<ffffffff8108d444>] set_fail+0xa4/0xf0
>  [<ffffffff8108d736>] sysctl_check_table+0x2a6/0x310
>  [<ffffffff8108d4eb>] sysctl_check_table+0x5b/0x310
>  [<ffffffff8108d4eb>] sysctl_check_table+0x5b/0x310
>  [<ffffffff81072204>] __register_sysctl_paths+0xf4/0x320
>  [<ffffffff815c1232>] ? printk+0x3c/0x42
>  [<ffffffff8115d3bc>] ? sysfs_add_file+0xc/0x10
>  [<ffffffff818b18bd>] ? sysctl_ipv4_init+0x0/0x87
>  [<ffffffff81072456>] register_sysctl_paths+0x26/0x30
>  [<ffffffff818b18fd>] sysctl_ipv4_init+0x40/0x87
>  [<ffffffff810002df>] do_one_initcall+0x3f/0x170
>  [<ffffffff81884d44>] kernel_init+0x158/0x1e2
>  [<ffffffff8102fb54>] kernel_thread_helper+0x4/0x10
>  [<ffffffff81884bec>] ? kernel_init+0x0/0x1e2
>  [<ffffffff8102fb50>] ? kernel_thread_helper+0x0/0x10

OK, thanks.

Eric D's net-avoid-limits-overflow.patch switched tcp_mem and udp_mem
from proc_dointvec() to proc_doulongvec_minmax().  And
sysctl_check_table() checks `min' and `max' for
proc_doulongvec_minmax() but not for proc_dointvec().

I'm not sure which Eric to blame ;) .min and .max are optional, so
perhaps the check is wrong?


^ permalink raw reply

* Re: IPv4: sysctl table check failed [was: mmotm 2010-10-07-14-08 uploaded]
From: Eric Dumazet @ 2010-10-07 22:22 UTC (permalink / raw)
  To: Jiri Slaby; +Cc: linux-kernel, akpm, mm-commits, ML netdev, David S. Miller
In-Reply-To: <4CAE4479.6010606@gmail.com>

Le vendredi 08 octobre 2010 à 00:06 +0200, Jiri Slaby a écrit :
> On 10/07/2010 11:08 PM, akpm@linux-foundation.org wrote:
> > The mm-of-the-moment snapshot 2010-10-07-14-08 has been uploaded to
> 
> Hi, I got bunch of "sysctl table check failed" below. All seem to be
> related to ipv4:

I would say, sysctl check is buggy :(

min/max are optional

[PATCH] sysctl: min/max bounds are optional

sysctl check complains when proc_doulongvec_minmax or
proc_doulongvec_ms_jiffies_minmax are used by a vector of longs (with
more than one element), with no min or max value specified.

This is unexpected, given we had a bug on this min/max handling :)

Reported-by: Jiri Slaby <jirislaby@gmail.com>
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 kernel/sysctl_check.c |    9 ---------
 1 file changed, 9 deletions(-)

diff --git a/kernel/sysctl_check.c b/kernel/sysctl_check.c
index 04cdcf7..10b90d8 100644
--- a/kernel/sysctl_check.c
+++ b/kernel/sysctl_check.c
@@ -143,15 +143,6 @@ int sysctl_check_table(struct nsproxy *namespaces, struct ctl_table *table)
 				if (!table->maxlen)
 					set_fail(&fail, table, "No maxlen");
 			}
-			if ((table->proc_handler == proc_doulongvec_minmax) ||
-			    (table->proc_handler == proc_doulongvec_ms_jiffies_minmax)) {
-				if (table->maxlen > sizeof (unsigned long)) {
-					if (!table->extra1)
-						set_fail(&fail, table, "No min");
-					if (!table->extra2)
-						set_fail(&fail, table, "No max");
-				}
-			}
 #ifdef CONFIG_PROC_SYSCTL
 			if (table->procname && !table->proc_handler)
 				set_fail(&fail, table, "No proc_handler");



^ permalink raw reply related

* IPv4: sysctl table check failed [was: mmotm 2010-10-07-14-08 uploaded]
From: Jiri Slaby @ 2010-10-07 22:06 UTC (permalink / raw)
  To: linux-kernel; +Cc: akpm, mm-commits, ML netdev, David S. Miller
In-Reply-To: <201010072140.o97Le69i025659@imap1.linux-foundation.org>

On 10/07/2010 11:08 PM, akpm@linux-foundation.org wrote:
> The mm-of-the-moment snapshot 2010-10-07-14-08 has been uploaded to

Hi, I got bunch of "sysctl table check failed" below. All seem to be
related to ipv4:

sysctl table check failed: /net/ipv4/tcp_mem  No min
Pid: 1, comm: swapper Not tainted 2.6.36-rc7-mm1_64+ #1285
Call Trace:
 [<ffffffff8108d444>] set_fail+0xa4/0xf0
 [<ffffffff8108d736>] sysctl_check_table+0x2a6/0x310
 [<ffffffff8108d4eb>] sysctl_check_table+0x5b/0x310
 [<ffffffff8108d4eb>] sysctl_check_table+0x5b/0x310
 [<ffffffff81072204>] __register_sysctl_paths+0xf4/0x320
 [<ffffffff815c1232>] ? printk+0x3c/0x42
 [<ffffffff8115d3bc>] ? sysfs_add_file+0xc/0x10
 [<ffffffff818b18bd>] ? sysctl_ipv4_init+0x0/0x87
 [<ffffffff81072456>] register_sysctl_paths+0x26/0x30
 [<ffffffff818b18fd>] sysctl_ipv4_init+0x40/0x87
 [<ffffffff810002df>] do_one_initcall+0x3f/0x170
 [<ffffffff81884d44>] kernel_init+0x158/0x1e2
 [<ffffffff8102fb54>] kernel_thread_helper+0x4/0x10
 [<ffffffff81884bec>] ? kernel_init+0x0/0x1e2
 [<ffffffff8102fb50>] ? kernel_thread_helper+0x0/0x10
sysctl table check failed: /net/ipv4/tcp_mem  No max
Pid: 1, comm: swapper Not tainted 2.6.36-rc7-mm1_64+ #1285
Call Trace:
 [<ffffffff8108d444>] set_fail+0xa4/0xf0
 [<ffffffff8108d4da>] sysctl_check_table+0x4a/0x310
 [<ffffffff8108d4eb>] sysctl_check_table+0x5b/0x310
 [<ffffffff8108d4eb>] sysctl_check_table+0x5b/0x310
 [<ffffffff81072204>] __register_sysctl_paths+0xf4/0x320
 [<ffffffff815c1232>] ? printk+0x3c/0x42
 [<ffffffff8115d3bc>] ? sysfs_add_file+0xc/0x10
 [<ffffffff818b18bd>] ? sysctl_ipv4_init+0x0/0x87
 [<ffffffff81072456>] register_sysctl_paths+0x26/0x30
 [<ffffffff818b18fd>] sysctl_ipv4_init+0x40/0x87
 [<ffffffff810002df>] do_one_initcall+0x3f/0x170
 [<ffffffff81884d44>] kernel_init+0x158/0x1e2
 [<ffffffff8102fb54>] kernel_thread_helper+0x4/0x10
 [<ffffffff81884bec>] ? kernel_init+0x0/0x1e2
 [<ffffffff8102fb50>] ? kernel_thread_helper+0x0/0x10
sysctl table check failed: /net/ipv4/udp_mem  No min
Pid: 1, comm: swapper Not tainted 2.6.36-rc7-mm1_64+ #1285
Call Trace:
 [<ffffffff8108d444>] set_fail+0xa4/0xf0
 [<ffffffff8108d736>] sysctl_check_table+0x2a6/0x310
 [<ffffffff8108d4eb>] sysctl_check_table+0x5b/0x310
 [<ffffffff8108d4eb>] sysctl_check_table+0x5b/0x310
 [<ffffffff81072204>] __register_sysctl_paths+0xf4/0x320
 [<ffffffff815c1232>] ? printk+0x3c/0x42
 [<ffffffff8115d3bc>] ? sysfs_add_file+0xc/0x10
 [<ffffffff818b18bd>] ? sysctl_ipv4_init+0x0/0x87
 [<ffffffff81072456>] register_sysctl_paths+0x26/0x30
 [<ffffffff818b18fd>] sysctl_ipv4_init+0x40/0x87
 [<ffffffff810002df>] do_one_initcall+0x3f/0x170
 [<ffffffff81884d44>] kernel_init+0x158/0x1e2
 [<ffffffff8102fb54>] kernel_thread_helper+0x4/0x10
 [<ffffffff81884bec>] ? kernel_init+0x0/0x1e2
 [<ffffffff8102fb50>] ? kernel_thread_helper+0x0/0x10
sysctl table check failed: /net/ipv4/udp_mem  No max
Pid: 1, comm: swapper Not tainted 2.6.36-rc7-mm1_64+ #1285
Call Trace:
 [<ffffffff8108d444>] set_fail+0xa4/0xf0
 [<ffffffff8108d4da>] sysctl_check_table+0x4a/0x310
 [<ffffffff8108d4eb>] sysctl_check_table+0x5b/0x310
 [<ffffffff8108d4eb>] sysctl_check_table+0x5b/0x310
 [<ffffffff81072204>] __register_sysctl_paths+0xf4/0x320
 [<ffffffff815c1232>] ? printk+0x3c/0x42
 [<ffffffff8115d3bc>] ? sysfs_add_file+0xc/0x10
 [<ffffffff818b18bd>] ? sysctl_ipv4_init+0x0/0x87
 [<ffffffff81072456>] register_sysctl_paths+0x26/0x30
 [<ffffffff818b18fd>] sysctl_ipv4_init+0x40/0x87
 [<ffffffff810002df>] do_one_initcall+0x3f/0x170
 [<ffffffff81884d44>] kernel_init+0x158/0x1e2
 [<ffffffff8102fb54>] kernel_thread_helper+0x4/0x10
 [<ffffffff81884bec>] ? kernel_init+0x0/0x1e2
 [<ffffffff8102fb50>] ? kernel_thread_helper+0x0/0x10

regards,
-- 
js

^ permalink raw reply

* Re: [PATCH] net: clear heap allocations for privileged ethtool actions
From: Kees Cook @ 2010-10-07 21:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: linux-kernel, David S. Miller, Ben Hutchings, Jeff Garzik,
	Jeff Kirsher, Peter P Waskiewicz Jr, netdev
In-Reply-To: <1286487085.3745.99.camel@edumazet-laptop>

Hi Eric,

On Thu, Oct 07, 2010 at 11:31:25PM +0200, Eric Dumazet wrote:
> Le jeudi 07 octobre 2010 à 14:10 -0700, Kees Cook a écrit :
> > Several other ethtool functions leave heap uncleared (potentially) by
> > drivers. Some interfaces appear safe (eeprom, etc), in that the sizes
> > are well controlled. In some situations (e.g. unchecked error conditions),
> > the heap will remain unchanged in areas before copying back to userspace.
> > Note that these are less of an issue since these all require CAP_NET_ADMIN.
> 
> > @@ -775,7 +775,7 @@ static int ethtool_get_regs(struct net_device *dev, char __user *useraddr)
> >  	if (regs.len > reglen)
> >  		regs.len = reglen;
> >  
> > -	regbuf = kmalloc(reglen, GFP_USER);
> > +	regbuf = kzalloc(reglen, GFP_USER);
> >  	if (!regbuf)
> >  		return -ENOMEM;
> >  
> > -- 
> > 1.7.1
> > 
> 
> Are you sure this is not hiding a more problematic problem ?
> 
> Code does :
> 
> reglen = ops->get_regs_len(dev);
> if (regs.len > reglen)
> 	regs.len = reglen;
> regbuf = kmalloc(reglen, GFP_USER);
> 
> So we can not copy back kernel memory.
> 
> However, what happens if user provides regs.len = 1 byte, and driver
> get_regs() doesnt properly checks regs.len and write past end of regbuf
> -> We probably write on other parts of kernel memory

This code is basically a max() call from what I see.

regbuf = kmalloc(max(regs.len, ops->get_regs_len(dev)), GFP_USER);

If the user passes regs.len = 1, it will be ignored in favor of reglen,
so we'll not write past the end of regbuf, unless I'm misunderstanding.

-Kees

-- 
Kees Cook
Ubuntu Security Team

^ permalink raw reply

* Re: [PATCH] net: clear heap allocations for privileged ethtool actions
From: Ben Hutchings @ 2010-10-07 21:40 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Kees Cook, linux-kernel, David S. Miller, Jeff Garzik,
	Jeff Kirsher, Peter P Waskiewicz Jr, netdev
In-Reply-To: <1286487085.3745.99.camel@edumazet-laptop>

On Thu, 2010-10-07 at 23:31 +0200, Eric Dumazet wrote:
> Le jeudi 07 octobre 2010 à 14:10 -0700, Kees Cook a écrit :
> > Several other ethtool functions leave heap uncleared (potentially) by
> > drivers. Some interfaces appear safe (eeprom, etc), in that the sizes
> > are well controlled. In some situations (e.g. unchecked error conditions),
> > the heap will remain unchanged in areas before copying back to userspace.
> > Note that these are less of an issue since these all require CAP_NET_ADMIN.
> 
> > @@ -775,7 +775,7 @@ static int ethtool_get_regs(struct net_device *dev, char __user *useraddr)
> >  	if (regs.len > reglen)
> >  		regs.len = reglen;
> >  
> > -	regbuf = kmalloc(reglen, GFP_USER);
> > +	regbuf = kzalloc(reglen, GFP_USER);

Actually, I recently changed this to vmalloc() so your patch won't
apply.

> >  	if (!regbuf)
> >  		return -ENOMEM;
> >  
> > -- 
> > 1.7.1
> > 
> 
> Are you sure this is not hiding a more problematic problem ?
> 
> Code does :
> 
> reglen = ops->get_regs_len(dev);
> if (regs.len > reglen)
> 	regs.len = reglen;
> regbuf = kmalloc(reglen, GFP_USER);
> 
> So we can not copy back kernel memory.
> 
> However, what happens if user provides regs.len = 1 byte, and driver
> get_regs() doesnt properly checks regs.len and write past end of regbuf
> -> We probably write on other parts of kernel memory
[...]

Why should the driver's get_regs() check regs.len?  The buffer is
allocated based on reglen which is provided by the driver, not the user.

reglen (length of the kernel buffer) is not reduced; regs.len (length of
the user buffer) is.  That lets the user know how much of the user
buffer was actually used.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCH] net: clear heap allocations for privileged ethtool actions
From: Ben Hutchings @ 2010-10-07 21:34 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, David S. Miller, Jeff Garzik, Jeff Kirsher,
	Peter P Waskiewicz Jr, netdev
In-Reply-To: <20101007211004.GA20267@outflux.net>

On Thu, 2010-10-07 at 14:10 -0700, Kees Cook wrote:
> Several other ethtool functions leave heap uncleared (potentially) by
> drivers. Some interfaces appear safe (eeprom, etc), in that the sizes
> are well controlled. In some situations (e.g. unchecked error conditions),
> the heap will remain unchanged in areas before copying back to userspace.
> Note that these are less of an issue since these all require CAP_NET_ADMIN.
> 
> Cc: stable@kernel.org
> Signed-off-by: Kees Cook <kees.cook@canonical.com>
> ---
>  net/core/ethtool.c |    6 +++---
>  1 files changed, 3 insertions(+), 3 deletions(-)
> 
> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index 7a85367..fb9cf30 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -397,7 +397,7 @@ static noinline_for_stack int ethtool_get_rxfh_indir(struct net_device *dev,
>  	    (KMALLOC_MAX_SIZE - sizeof(*indir)) / sizeof(*indir->ring_index))
>  		return -ENOMEM;
>  	full_size = sizeof(*indir) + sizeof(*indir->ring_index) * table_size;
> -	indir = kmalloc(full_size, GFP_USER);
> +	indir = kzalloc(full_size, GFP_USER);
>  	if (!indir)
>  		return -ENOMEM;
>  
[...]

Acked-by: Ben Hutchings <bhutchings@solarflare.com>

You could alternately recalculate full_size before copying back to the
user buffer:

	full_size = sizeof(*indir) + sizeof(*indir->ring_index) * indir->size;

but kzalloc() is more obviously safe.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] net: clear heap allocations for privileged ethtool actions
From: Eric Dumazet @ 2010-10-07 21:31 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, David S. Miller, Ben Hutchings, Jeff Garzik,
	Jeff Kirsher, Peter P Waskiewicz Jr, netdev
In-Reply-To: <20101007211004.GA20267@outflux.net>

Le jeudi 07 octobre 2010 à 14:10 -0700, Kees Cook a écrit :
> Several other ethtool functions leave heap uncleared (potentially) by
> drivers. Some interfaces appear safe (eeprom, etc), in that the sizes
> are well controlled. In some situations (e.g. unchecked error conditions),
> the heap will remain unchanged in areas before copying back to userspace.
> Note that these are less of an issue since these all require CAP_NET_ADMIN.

> @@ -775,7 +775,7 @@ static int ethtool_get_regs(struct net_device *dev, char __user *useraddr)
>  	if (regs.len > reglen)
>  		regs.len = reglen;
>  
> -	regbuf = kmalloc(reglen, GFP_USER);
> +	regbuf = kzalloc(reglen, GFP_USER);
>  	if (!regbuf)
>  		return -ENOMEM;
>  
> -- 
> 1.7.1
> 

Are you sure this is not hiding a more problematic problem ?

Code does :

reglen = ops->get_regs_len(dev);
if (regs.len > reglen)
	regs.len = reglen;
regbuf = kmalloc(reglen, GFP_USER);

So we can not copy back kernel memory.

However, what happens if user provides regs.len = 1 byte, and driver
get_regs() doesnt properly checks regs.len and write past end of regbuf
-> We probably write on other parts of kernel memory

An audit is needed, but first driver I checked is buggy
(drivers/net/qlcnic/qlcnic_ethtool.c)

->
 	memset(p, 0, qlcnic_get_regs_len(dev));




^ permalink raw reply

* [PATCH] net: clear heap allocations for privileged ethtool actions
From: Kees Cook @ 2010-10-07 21:10 UTC (permalink / raw)
  To: linux-kernel
  Cc: David S. Miller, Ben Hutchings, Jeff Garzik, Jeff Kirsher,
	Peter P Waskiewicz Jr, netdev

Several other ethtool functions leave heap uncleared (potentially) by
drivers. Some interfaces appear safe (eeprom, etc), in that the sizes
are well controlled. In some situations (e.g. unchecked error conditions),
the heap will remain unchanged in areas before copying back to userspace.
Note that these are less of an issue since these all require CAP_NET_ADMIN.

Cc: stable@kernel.org
Signed-off-by: Kees Cook <kees.cook@canonical.com>
---
 net/core/ethtool.c |    6 +++---
 1 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/net/core/ethtool.c b/net/core/ethtool.c
index 7a85367..fb9cf30 100644
--- a/net/core/ethtool.c
+++ b/net/core/ethtool.c
@@ -397,7 +397,7 @@ static noinline_for_stack int ethtool_get_rxfh_indir(struct net_device *dev,
 	    (KMALLOC_MAX_SIZE - sizeof(*indir)) / sizeof(*indir->ring_index))
 		return -ENOMEM;
 	full_size = sizeof(*indir) + sizeof(*indir->ring_index) * table_size;
-	indir = kmalloc(full_size, GFP_USER);
+	indir = kzalloc(full_size, GFP_USER);
 	if (!indir)
 		return -ENOMEM;
 
@@ -538,7 +538,7 @@ static int ethtool_get_rx_ntuple(struct net_device *dev, void __user *useraddr)
 
 	gstrings.len = ret;
 
-	data = kmalloc(gstrings.len * ETH_GSTRING_LEN, GFP_USER);
+	data = kzalloc(gstrings.len * ETH_GSTRING_LEN, GFP_USER);
 	if (!data)
 		return -ENOMEM;
 
@@ -775,7 +775,7 @@ static int ethtool_get_regs(struct net_device *dev, char __user *useraddr)
 	if (regs.len > reglen)
 		regs.len = reglen;
 
-	regbuf = kmalloc(reglen, GFP_USER);
+	regbuf = kzalloc(reglen, GFP_USER);
 	if (!regbuf)
 		return -ENOMEM;
 
-- 
1.7.1

-- 
Kees Cook
Ubuntu Security Team

^ permalink raw reply related

* Re: [PATCH net-next-2.6] net: Fix rxq ref counting
From: Eric Dumazet @ 2010-10-07 20:59 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.1.00.1010071304240.4792@pokey.mtv.corp.google.com>

Le jeudi 07 octobre 2010 à 13:09 -0700, Tom Herbert a écrit :
> The rx->count reference is used to track reference counts to the
> number of rx-queue kobjects created for the device.  This patch
> eliminates initialization of the counter in netif_alloc_rx_queues
> and instead increments the counter each time a kobject is created.
> This is now symmetric with the decrement that is done when an object is
> released.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7d14955..58b31d1 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5026,7 +5026,6 @@ static int netif_alloc_rx_queues(struct net_device *dev)
>  			return -ENOMEM;
>  		}
>  		dev->_rx = rx;
> -		atomic_set(&rx->count, count);
>  
>  		/*
>  		 * Set a pointer to first element in the array which holds the
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index fa81fd0..b143173 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -726,6 +726,7 @@ static struct kobj_type rx_queue_ktype = {
>  static int rx_queue_add_kobject(struct net_device *net, int index)
>  {
>  	struct netdev_rx_queue *queue = net->_rx + index;
> +	struct netdev_rx_queue *first = queue->first;
>  	struct kobject *kobj = &queue->kobj;
>  	int error = 0;
>  
> @@ -738,6 +739,7 @@ static int rx_queue_add_kobject(struct net_device *net, int index)
>  	}
>  
>  	kobject_uevent(kobj, KOBJ_ADD);
> +	atomic_inc(&first->count);
>  
>  	return error;
>  }

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>




^ permalink raw reply

* Re: [PATCH net-next-2.6] net: Fix rxq ref counting
From: Eric Dumazet @ 2010-10-07 20:56 UTC (permalink / raw)
  To: Tom Herbert; +Cc: davem, netdev
In-Reply-To: <alpine.DEB.1.00.1010071304240.4792@pokey.mtv.corp.google.com>

Le jeudi 07 octobre 2010 à 13:09 -0700, Tom Herbert a écrit :
> The rx->count reference is used to track reference counts to the
> number of rx-queue kobjects created for the device.  This patch
> eliminates initialization of the counter in netif_alloc_rx_queues
> and instead increments the counter each time a kobject is created.
> This is now symmetric with the decrement that is done when an object is
> released.
> 
> Signed-off-by: Tom Herbert <therbert@google.com>
> ---
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 7d14955..58b31d1 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5026,7 +5026,6 @@ static int netif_alloc_rx_queues(struct net_device *dev)
>  			return -ENOMEM;
>  		}
>  		dev->_rx = rx;
> -		atomic_set(&rx->count, count);
>  
>  		/*
>  		 * Set a pointer to first element in the array which holds the
> diff --git a/net/core/net-sysfs.c b/net/core/net-sysfs.c
> index fa81fd0..b143173 100644
> --- a/net/core/net-sysfs.c
> +++ b/net/core/net-sysfs.c
> @@ -726,6 +726,7 @@ static struct kobj_type rx_queue_ktype = {
>  static int rx_queue_add_kobject(struct net_device *net, int index)
>  {
>  	struct netdev_rx_queue *queue = net->_rx + index;
> +	struct netdev_rx_queue *first = queue->first;
>  	struct kobject *kobj = &queue->kobj;
>  	int error = 0;
>  
> @@ -738,6 +739,7 @@ static int rx_queue_add_kobject(struct net_device *net, int index)
>  	}
>  
>  	kobject_uevent(kobj, KOBJ_ADD);
> +	atomic_inc(&first->count);
>  
>  	return error;
>  }

Acked-by: Eric Dumazet <eric.dumazet@gmail.com>



^ permalink raw reply

* [PATCH net-next] neigh: speedup neigh_resolve_output()
From: Eric Dumazet @ 2010-10-07 20:44 UTC (permalink / raw)
  To: David Miller; +Cc: netdev
In-Reply-To: <1286456008.2912.171.camel@edumazet-laptop>

Le jeudi 07 octobre 2010 à 14:53 +0200, Eric Dumazet a écrit :

> Further improvements would need to use a seqlock instead of an rwlock to
> protect neigh->ha[], to not dirty neigh too often and remove two atomic
> ops.
> 

I implemented this idea in following patch, on top on previous one.

[PATCH net-next] neigh: speedup neigh_resolve_output()

Add a seqlock in struct neighbour to protect neigh->ha[], and avoid
dirtying neighbour in stress situation (many different flows / dsts)

Dirtying takes place because of read_lock(&n->lock) and n->used writes.

Switching to a seqlock, and writing n->used only on jiffies changes
permits less dirtying.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
 include/net/neighbour.h |   16 ++++++++++++
 net/core/neighbour.c    |   47 ++++++++++++++++++++++++--------------
 net/ipv4/arp.c          |    6 +---
 net/sched/sch_teql.c    |    8 +++---
 4 files changed, 51 insertions(+), 26 deletions(-)

diff --git a/include/net/neighbour.h b/include/net/neighbour.h
index a4538d5..f04e7a2 100644
--- a/include/net/neighbour.h
+++ b/include/net/neighbour.h
@@ -105,6 +105,7 @@ struct neighbour {
 	atomic_t		refcnt;
 	atomic_t		probes;
 	rwlock_t		lock;
+	seqlock_t		ha_lock;
 	unsigned char		ha[ALIGN(MAX_ADDR_LEN, sizeof(unsigned long))];
 	struct hh_cache		*hh;
 	int			(*output)(struct sk_buff *skb);
@@ -302,7 +303,10 @@ static inline void neigh_confirm(struct neighbour *neigh)
 
 static inline int neigh_event_send(struct neighbour *neigh, struct sk_buff *skb)
 {
-	neigh->used = jiffies;
+	unsigned long now = ACCESS_ONCE(jiffies);
+	
+	if (neigh->used != now)
+		neigh->used = now;
 	if (!(neigh->nud_state&(NUD_CONNECTED|NUD_DELAY|NUD_PROBE)))
 		return __neigh_event_send(neigh, skb);
 	return 0;
@@ -373,4 +377,14 @@ struct neighbour_cb {
 
 #define NEIGH_CB(skb)	((struct neighbour_cb *)(skb)->cb)
 
+static inline void neigh_ha_snapshot(char *dst, const struct neighbour *n,
+				     const struct net_device *dev)
+{
+	unsigned int seq;
+
+	do {
+		seq = read_seqbegin(&n->ha_lock);
+		memcpy(dst, n->ha, dev->addr_len);
+	} while (read_seqretry(&n->ha_lock, seq));
+}
 #endif
diff --git a/net/core/neighbour.c b/net/core/neighbour.c
index 53cc548..54aef9c 100644
--- a/net/core/neighbour.c
+++ b/net/core/neighbour.c
@@ -294,6 +294,7 @@ static struct neighbour *neigh_alloc(struct neigh_table *tbl)
 
 	skb_queue_head_init(&n->arp_queue);
 	rwlock_init(&n->lock);
+	seqlock_init(&n->ha_lock);
 	n->updated	  = n->used = now;
 	n->nud_state	  = NUD_NONE;
 	n->output	  = neigh_blackhole;
@@ -1015,7 +1016,7 @@ out_unlock_bh:
 }
 EXPORT_SYMBOL(__neigh_event_send);
 
-static void neigh_update_hhs(struct neighbour *neigh)
+static void neigh_update_hhs(const struct neighbour *neigh)
 {
 	struct hh_cache *hh;
 	void (*update)(struct hh_cache*, const struct net_device*, const unsigned char *)
@@ -1151,7 +1152,9 @@ int neigh_update(struct neighbour *neigh, const u8 *lladdr, u8 new,
 	}
 
 	if (lladdr != neigh->ha) {
+		write_seqlock(&neigh->ha_lock);
 		memcpy(&neigh->ha, lladdr, dev->addr_len);
+		write_sequnlock(&neigh->ha_lock);
 		neigh_update_hhs(neigh);
 		if (!(new & NUD_CONNECTED))
 			neigh->confirmed = jiffies -
@@ -1214,6 +1217,7 @@ static inline bool neigh_hh_lookup(struct neighbour *n, struct dst_entry *dst,
 {
 	struct hh_cache *hh;
 
+	smp_rmb(); /* paired with smp_wmb() in neigh_hh_init() */
 	for (hh = n->hh; hh; hh = hh->hh_next) {
 		if (hh->hh_type == protocol) {
 			atomic_inc(&hh->hh_refcnt);
@@ -1248,8 +1252,8 @@ static void neigh_hh_init(struct neighbour *n, struct dst_entry *dst,
 		kfree(hh);
 		return;
 	}
-	read_unlock(&n->lock);
-	write_lock(&n->lock);
+
+	write_lock_bh(&n->lock);
 	
 	/* must check if another thread already did the insert */
 	if (neigh_hh_lookup(n, dst, protocol)) {
@@ -1263,13 +1267,13 @@ static void neigh_hh_init(struct neighbour *n, struct dst_entry *dst,
 		hh->hh_output = n->ops->output;
 
 	hh->hh_next = n->hh;
+	smp_wmb(); /* paired with smp_rmb() in neigh_hh_lookup() */
 	n->hh	    = hh;
 
 	if (unlikely(cmpxchg(&dst->hh, NULL, hh) != NULL))
 		hh_cache_put(hh);
 end:
-	write_unlock(&n->lock);
-	read_lock(&n->lock);
+	write_unlock_bh(&n->lock);
 }
 
 /* This function can be used in contexts, where only old dev_queue_xmit
@@ -1308,16 +1312,18 @@ int neigh_resolve_output(struct sk_buff *skb)
 	if (!neigh_event_send(neigh, skb)) {
 		int err;
 		struct net_device *dev = neigh->dev;
+		unsigned int seq;
 
-		read_lock_bh(&neigh->lock);
 		if (dev->header_ops->cache &&
 		    !dst->hh &&
 		    !(dst->flags & DST_NOCACHE))
 			neigh_hh_init(neigh, dst, dst->ops->protocol);
 
-		err = dev_hard_header(skb, dev, ntohs(skb->protocol),
-				      neigh->ha, NULL, skb->len);
-		read_unlock_bh(&neigh->lock);
+		do {
+			seq = read_seqbegin(&neigh->ha_lock);
+			err = dev_hard_header(skb, dev, ntohs(skb->protocol),
+					      neigh->ha, NULL, skb->len);
+		} while (read_seqretry(&neigh->ha_lock, seq));
 
 		if (err >= 0)
 			rc = neigh->ops->queue_xmit(skb);
@@ -1344,13 +1350,16 @@ int neigh_connected_output(struct sk_buff *skb)
 	struct dst_entry *dst = skb_dst(skb);
 	struct neighbour *neigh = dst->neighbour;
 	struct net_device *dev = neigh->dev;
+	unsigned int seq;
 
 	__skb_pull(skb, skb_network_offset(skb));
 
-	read_lock_bh(&neigh->lock);
-	err = dev_hard_header(skb, dev, ntohs(skb->protocol),
-			      neigh->ha, NULL, skb->len);
-	read_unlock_bh(&neigh->lock);
+	do {
+		seq = read_seqbegin(&neigh->ha_lock);
+		err = dev_hard_header(skb, dev, ntohs(skb->protocol),
+				      neigh->ha, NULL, skb->len);
+	} while (read_seqretry(&neigh->ha_lock, seq));
+
 	if (err >= 0)
 		err = neigh->ops->queue_xmit(skb);
 	else {
@@ -2148,10 +2157,14 @@ static int neigh_fill_info(struct sk_buff *skb, struct neighbour *neigh,
 
 	read_lock_bh(&neigh->lock);
 	ndm->ndm_state	 = neigh->nud_state;
-	if ((neigh->nud_state & NUD_VALID) &&
-	    nla_put(skb, NDA_LLADDR, neigh->dev->addr_len, neigh->ha) < 0) {
-		read_unlock_bh(&neigh->lock);
-		goto nla_put_failure;
+	if (neigh->nud_state & NUD_VALID) {
+		char haddr[MAX_ADDR_LEN];
+
+		neigh_ha_snapshot(haddr, neigh, neigh->dev);
+		if (nla_put(skb, NDA_LLADDR, neigh->dev->addr_len, haddr) < 0) {
+			read_unlock_bh(&neigh->lock);
+			goto nla_put_failure;
+		}
 	}
 
 	ci.ndm_used	 = jiffies_to_clock_t(now - neigh->used);
diff --git a/net/ipv4/arp.c b/net/ipv4/arp.c
index f353095..d8e540c 100644
--- a/net/ipv4/arp.c
+++ b/net/ipv4/arp.c
@@ -502,10 +502,8 @@ int arp_find(unsigned char *haddr, struct sk_buff *skb)
 
 	if (n) {
 		n->used = jiffies;
-		if (n->nud_state&NUD_VALID || neigh_event_send(n, skb) == 0) {
-			read_lock_bh(&n->lock);
-			memcpy(haddr, n->ha, dev->addr_len);
-			read_unlock_bh(&n->lock);
+		if (n->nud_state & NUD_VALID || neigh_event_send(n, skb) == 0) {
+			neigh_ha_snapshot(haddr, n, dev);
 			neigh_release(n);
 			return 0;
 		}
diff --git a/net/sched/sch_teql.c b/net/sched/sch_teql.c
index feaabc1..401af95 100644
--- a/net/sched/sch_teql.c
+++ b/net/sched/sch_teql.c
@@ -241,11 +241,11 @@ __teql_resolve(struct sk_buff *skb, struct sk_buff *skb_res, struct net_device *
 	}
 	if (neigh_event_send(n, skb_res) == 0) {
 		int err;
+		char haddr[MAX_ADDR_LEN];
 
-		read_lock(&n->lock);
-		err = dev_hard_header(skb, dev, ntohs(skb->protocol),
-				      n->ha, NULL, skb->len);
-		read_unlock(&n->lock);
+		neigh_ha_snapshot(haddr, n, dev);
+		err = dev_hard_header(skb, dev, ntohs(skb->protocol), haddr,
+				      NULL, skb->len);
 
 		if (err < 0) {
 			neigh_release(n);



^ permalink raw reply related

* Re: [PATCH V2] Use firmware provided index to register a network interface
From: Matt Domsch @ 2010-10-07 20:42 UTC (permalink / raw)
  To: Kay Sievers
  Cc: Greg KH, K, Narendra, netdev@vger.kernel.org,
	linux-hotplug@vger.kernel.org, linux-pci@vger.kernel.org,
	Hargrave, Jordan, Nijhawan, Vijay, Rose, Charles
In-Reply-To: <AANLkTinP5WPDPvk+kq8vsyP=xC9qcoe+c=1EBp0XJNPk@mail.gmail.com>

On Thu, Oct 07, 2010 at 10:05:14AM -0700, Kay Sievers wrote:
> On Thu, Oct 7, 2010 at 18:48, Greg KH <greg@kroah.com> wrote:
> > On Thu, Oct 07, 2010 at 11:31:13AM -0500, Matt Domsch wrote:
> >> 1) SMBIOS type 41 method.  Windows does not use this today, and I
> >>    can't speak to their future plans.  Narendra's kernel patch does,
> >>    as has biosdevname, the udev helper we first wrote for this
> >>    purpose, for several years.
> >
> > Then stick with that udev helper please :)
> 
> What about just exporting this information in sysfs, and not touch the naming?

The config tools all take a netdevice name as their argument.  What
would it look like then?


$ ifconfig $(netdevname 'Embedded NIC 1')

repeat for each tool that's called?  This is similar to what we
proposed with the userspace patch and libnetdevname, so the lookup can
happen inside each app, rather than the system admin having to do the
translation themselves.  That was rejected too...

otherwise, it's up to a human to do the translation in their head,
which isn't script-friendly.

-Matt

--
Matt Domsch
Technology Strategist
Dell | Office of the CTO

^ permalink raw reply

* Re: [PATCH] net: clear heap allocation for ETHTOOL_GRXCLSRLALL
From: Ben Hutchings @ 2010-10-07 20:28 UTC (permalink / raw)
  To: Kees Cook
  Cc: linux-kernel, David S. Miller, Jeff Garzik, Jeff Kirsher,
	Peter P Waskiewicz Jr, netdev
In-Reply-To: <20101007200348.GA6038@outflux.net>

On Thu, 2010-10-07 at 13:03 -0700, Kees Cook wrote:
> Calling ETHTOOL_GRXCLSRLALL with a large rule_cnt will allocate kernel
> heap without clearing it. For the one driver (niu) that implements it,
> it will leave the unused portion of heap unchanged and copy the full
> contents back to userspace.
> 
> Cc: stable@kernel.org
> Signed-off-by: Kees Cook <kees.cook@canonical.com>

Acked-by: Ben Hutchings <bhutchings@solarflare.com>

Should have spotted this myself. :-(

Ben.

> ---
>  net/core/ethtool.c |    2 +-
>  1 files changed, 1 insertions(+), 1 deletions(-)
> 
> diff --git a/net/core/ethtool.c b/net/core/ethtool.c
> index 7a85367..4016ac6 100644
> --- a/net/core/ethtool.c
> +++ b/net/core/ethtool.c
> @@ -348,7 +348,7 @@ static noinline_for_stack int ethtool_get_rxnfc(struct net_device *dev,
>  	if (info.cmd == ETHTOOL_GRXCLSRLALL) {
>  		if (info.rule_cnt > 0) {
>  			if (info.rule_cnt <= KMALLOC_MAX_SIZE / sizeof(u32))
> -				rule_buf = kmalloc(info.rule_cnt * sizeof(u32),
> +				rule_buf = kzalloc(info.rule_cnt * sizeof(u32),
>  						   GFP_USER);
>  			if (!rule_buf)
>  				return -ENOMEM;
> -- 
> 1.7.1
> 

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

^ permalink raw reply

* Re: [PATCHv4 net-next-2.6 1/5] XFRM,IPv6: Remove xfrm_spi_hash() dependency on destination address
From: Arnaud Ebalard @ 2010-10-07 20:13 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David S. Miller, Eric Dumazet, Hideaki YOSHIFUJI, netdev
In-Reply-To: <20101005041707.GA26458@gondor.apana.org.au>

Hi,

Herbert Xu <herbert@gondor.apana.org.au> writes:

> On Tue, Oct 05, 2010 at 10:11:14AM +0800, Herbert Xu wrote:
>>
>> I'm thinking about the case where each remote end (or one remote
>> end with many IP addresses) chooses to use a single SPI which then
>> all gets hashed to the same bucket.
>> 
>> Outbound SAs are hashed into the same SPI hash table as inbound SAs.
>
> Another solution would be to create a hash table for inbound SAs
> only.  Unfortunately I don't think we have anything in our current
> user-interface to indicate whether an SA is inbound or outbound.
>
> So to do this you'll need to use a heuristic such as doing a
> lookup on the source/destination address at insertion time.

I spent some time scratching my head trying to find a good solution. It
would indeed be perfect to have a specific hash table for inbound
SA. But as you point, this would only be via a heuristic at insertion
time and there are various cases which would not work: a SA can be
installed w/o any of the addresses being configured on the system.

I think I will try the following alternative approach based on your
comments and proposals:

 - drop my patch to change spi hash computation
 - handle destination address remapping during input upon failure of
   xfrm_state_lookup()
 - handle source address remapping as it is currently done in the patch,
   i.e. by comparing received one against x->props.saddr once the
   state found and do 

To support the destination address remapping, I will have to reverse the
logic I currently have for destination remapping states, to allow the
lookup to be done based on the on-wire address (CoA) instead of the
address in the SA (HoA). If a remapping state is found for the on-wire
address, then a new lookup is done using the associated HoA this time.

I think it would make the feature easier less intrusive for the IPsec
stack.

Thanks for your support and patience, Herbert.

a+

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox