Netdev List

Netdev List
 help / color / mirror / Atom feed

* Congratulations!
From: Yahoo Team @ 2010-11-12 14:32 UTC (permalink / raw)


[-- Attachment #1: Type: text/plain, Size: 24 bytes --]

Please view the attached

[-- Attachment #2: XS.pdf --]
[-- Type: application/pdf, Size: 85075 bytes --]

^ permalink raw reply

* Re: [PATCH net-next-2.6] igmp: RCU conversion of in_dev->mc_list
From: Eric Dumazet @ 2010-11-12 14:26 UTC (permalink / raw)
  To: Américo Wang; +Cc: Cypher Wu, linux-kernel, netdev, David Miller
In-Reply-To: <1289568858.3185.252.camel@edumazet-laptop>

Le vendredi 12 novembre 2010 à 14:34 +0100, Eric Dumazet a écrit :
> Le vendredi 12 novembre 2010 à 10:22 +0100, Eric Dumazet a écrit :
> > Le vendredi 12 novembre 2010 à 16:19 +0800, Américo Wang a écrit :
> > > On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
> > 
> > > >A RCU conversion is far more complex.
> > > >
> > > 
> > > Yup.
> > 
> > 
> > Well, actually this is easy in this case.
> > 
> > I'll post a patch to do this RCU conversion.
> > 
> > 
> 
> Note : compile tested only, I'll appreciate if someone can test it ;)
> 
> Note: one patch from net-2.6 is not yet included in net-next-2.6, so
> please make sure you have it before testing ;)
> 
> ( http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commitdiff;h=18943d292facbc70e6a36fc62399ae833f64671b )
> 
> 
> Thanks
> 
> [PATCH net-next-2.6] igmp: RCU conversion of in_dev->mc_list
> 
> in_dev->mc_list is protected by one rwlock (in_dev->mc_list_lock).
> 
> This can easily be converted to a RCU protection.
> 
> Writers hold RTNL, so mc_list_lock is removed, not replaced by a
> spinlock.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Cypher Wu <cypher.w@gmail.com>
> Cc: Américo Wang <xiyou.wangcong@gmail.com>
> ---

...

>  void ip_mc_up(struct in_device *in_dev)
>  {
> -	struct ip_mc_list *i;
> +	struct ip_mc_list *pmc;
>  
>  	ASSERT_RTNL();
>  
>  	ip_mc_inc_group(in_dev, IGMP_ALL_HOSTS);
>  
> -	for (i=in_dev->mc_list; i; i=i->next)
> -		igmp_group_added(i);
> +	for_each_pmc_rtnl(in_dev, pmc);
> +		igmp_group_added(pmc);
>  }


Oops there is an extra ; after the for_each_pmc_rtnl(in_dev, pmc)

should be

	for_each_pmc_rtnl(in_dev, pmc)
		igmp_group_added(pmc);




^ permalink raw reply

* Re: [PATCH 2/2] ucc_geth: Fix deadlock
From: Anton Vorontsov @ 2010-11-12 14:09 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: linuxppc-dev, netdev
In-Reply-To: <1289570109-8160-2-git-send-email-Joakim.Tjernlund@transmode.se>

On Fri, Nov 12, 2010 at 02:55:09PM +0100, Joakim Tjernlund wrote:
> This script:
>  while [ 1==1 ] ; do ifconfig eth0 up; usleep 1950000 ;ifconfig eth0 down; dmesg -c ;done
> causes in just a second or two:
> INFO: task ifconfig:572 blocked for more than 120 seconds.
[...]
> The reason appears to be ucc_geth_stop meets adjust_link as the
> PHY reports PHY changes. I belive adjust_link hangs somewhere,
> holding the PHY lock, because ucc_geth_stop disabled the
> controller HW.
> Fix is to stop the PHY before disabling the controller.
> 
> Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>

It's unclear where exactly adjust_link() hangs, but the patch
looks as the right thing overall.

Thanks!

Reviewed-by: Anton Vorontsov <cbouatmailru@gmail.com>

> ---
>  drivers/net/ucc_geth.c |   10 +++++++---
>  1 files changed, 7 insertions(+), 3 deletions(-)
> 
> diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
> index 6c254ed..06a5db3 100644
> --- a/drivers/net/ucc_geth.c
> +++ b/drivers/net/ucc_geth.c
> @@ -2050,12 +2050,16 @@ static void ucc_geth_stop(struct ucc_geth_private *ugeth)
>  
>  	ugeth_vdbg("%s: IN", __func__);
>  
> +	/*
> +	 * Tell the kernel the link is down.
> +	 * Must be done before disabling the controller
> +	 * or deadlock may happen.
> +	 */
> +	phy_stop(phydev);
> +
>  	/* Disable the controller */
>  	ugeth_disable(ugeth, COMM_DIR_RX_AND_TX);
>  
> -	/* Tell the kernel the link is down */
> -	phy_stop(phydev);
> -
>  	/* Mask all interrupts */
>  	out_be32(ugeth->uccf->p_uccm, 0x00000000);

^ permalink raw reply

* [PATCH v2] Prevent crashing when parsing bad X.25 facilities
From: Dan Rosenberg @ 2010-11-12 14:07 UTC (permalink / raw)
  To: andrew.hendry; +Cc: netdev

On parsing malformed X.25 facilities, decrementing the remaining length
may cause it to underflow.  Since the length is an unsigned integer,
this will result in the loop continuing until the kernel crashes.

This patch adds checks to ensure decrementing the remaining length does
not cause it to wrap around.

v2 prevents printing values outside the appropriate range.

Signed-off-by: Dan Rosenberg <drosenberg@vsecurity.com>
CC: stable <stable@kernel.org>
---
 net/x25/x25_facilities.c |   13 ++++++++++---
 1 files changed, 10 insertions(+), 3 deletions(-)

diff --git a/net/x25/x25_facilities.c b/net/x25/x25_facilities.c
index 3a8c4c4..7f18e7d 100644
--- a/net/x25/x25_facilities.c
+++ b/net/x25/x25_facilities.c
@@ -61,6 +61,8 @@ int x25_parse_facilities(struct sk_buff *skb, struct x25_facilities *facilities,
 	while (len > 0) {
 		switch (*p & X25_FAC_CLASS_MASK) {
 		case X25_FAC_CLASS_A:
+			if (len < 2)
+				return 0;
 			switch (*p) {
 			case X25_FAC_REVERSE:
 				if((p[1] & 0x81) == 0x81) {
@@ -104,6 +106,8 @@ int x25_parse_facilities(struct sk_buff *skb, struct x25_facilities *facilities,
 			len -= 2;
 			break;
 		case X25_FAC_CLASS_B:
+			if (len < 3)
+				return 0;
 			switch (*p) {
 			case X25_FAC_PACKET_SIZE:
 				facilities->pacsize_in  = p[1];
@@ -125,6 +129,8 @@ int x25_parse_facilities(struct sk_buff *skb, struct x25_facilities *facilities,
 			len -= 3;
 			break;
 		case X25_FAC_CLASS_C:
+			if (len < 4)
+				return 0;
 			printk(KERN_DEBUG "X.25: unknown facility %02X, "
 			       "values %02X, %02X, %02X\n",
 			       p[0], p[1], p[2], p[3]);
@@ -132,6 +138,8 @@ int x25_parse_facilities(struct sk_buff *skb, struct x25_facilities *facilities,
 			len -= 4;
 			break;
 		case X25_FAC_CLASS_D:
+			if (len < p[1] + 2)
+				return 0;
 			switch (*p) {
 			case X25_FAC_CALLING_AE:
 				if (p[1] > X25_MAX_DTE_FACIL_LEN || p[1] <= 1)
@@ -149,9 +157,8 @@ int x25_parse_facilities(struct sk_buff *skb, struct x25_facilities *facilities,
 				break;
 			default:
 				printk(KERN_DEBUG "X.25: unknown facility %02X,"
-					"length %d, values %02X, %02X, "
-					"%02X, %02X\n",
-					p[0], p[1], p[2], p[3], p[4], p[5]);
+					"length %d\n"
+					p[0], p[1]);
 				break;
 			}
 			len -= p[1] + 2;



^ permalink raw reply related

* Re: [PATCH 1/2] ucc_geth: Do not bring the whole IF down when TX failure.
From: Anton Vorontsov @ 2010-11-12 14:05 UTC (permalink / raw)
  To: Joakim Tjernlund; +Cc: linuxppc-dev, netdev
In-Reply-To: <1289570109-8160-1-git-send-email-Joakim.Tjernlund@transmode.se>

On Fri, Nov 12, 2010 at 02:55:08PM +0100, Joakim Tjernlund wrote:
> ucc_geth_close lacks a cancel_work_sync(&ugeth->timeout_work)
> to stop any outstanding processing of TX fail. However, one
> can not call cancel_work_sync without fixing the timeout function
> otherwise it will deadlock. This patch brings ucc_geth in line with
> gianfar:
> 
> Don't bring the interface down and up, just reinit controller HW
> and PHY.
> 
> Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>

Looks sane, thanks!

Reviewed-by: Anton Vorontsov <cbouatmailru@gmail.com>

^ permalink raw reply

* [PATCH 2/2] ucc_geth: Fix deadlock
From: Joakim Tjernlund @ 2010-11-12 13:55 UTC (permalink / raw)
  To: linuxppc-dev, netdev, Anton Vorontsov; +Cc: Joakim Tjernlund
In-Reply-To: <1289570109-8160-1-git-send-email-Joakim.Tjernlund@transmode.se>

This script:
 while [ 1==1 ] ; do ifconfig eth0 up; usleep 1950000 ;ifconfig eth0 down; dmesg -c ;done
causes in just a second or two:
INFO: task ifconfig:572 blocked for more than 120 seconds.
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
ifconfig      D 0ff65760     0   572    369 0x00000000
Call Trace:
[c6157be0] [c6008460] 0xc6008460 (unreliable)
[c6157ca0] [c0008608] __switch_to+0x4c/0x6c
[c6157cb0] [c028fecc] schedule+0x184/0x310
[c6157ce0] [c0290e54] __mutex_lock_slowpath+0xa4/0x150
[c6157d20] [c0290c48] mutex_lock+0x44/0x48
[c6157d30] [c01aba74] phy_stop+0x20/0x70
[c6157d40] [c01aef40] ucc_geth_stop+0x30/0x98
[c6157d60] [c01b18fc] ucc_geth_close+0x9c/0xdc
[c6157d80] [c01db0cc] __dev_close+0xa0/0xd0
[c6157d90] [c01deddc] __dev_change_flags+0x8c/0x148
[c6157db0] [c01def54] dev_change_flags+0x1c/0x64
[c6157dd0] [c0237ac8] devinet_ioctl+0x678/0x784
[c6157e50] [c0239a58] inet_ioctl+0xb0/0xbc
[c6157e60] [c01cafa8] sock_ioctl+0x174/0x2a0
[c6157e80] [c009a16c] vfs_ioctl+0xcc/0xe0
[c6157ea0] [c009a998] do_vfs_ioctl+0xc4/0x79c
[c6157f10] [c009b0b0] sys_ioctl+0x40/0x74
[c6157f40] [c00117c4] ret_from_syscall+0x0/0x38

The reason appears to be ucc_geth_stop meets adjust_link as the
PHY reports PHY changes. I belive adjust_link hangs somewhere,
holding the PHY lock, because ucc_geth_stop disabled the
controller HW.
Fix is to stop the PHY before disabling the controller.

Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
---
 drivers/net/ucc_geth.c |   10 +++++++---
 1 files changed, 7 insertions(+), 3 deletions(-)

diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
index 6c254ed..06a5db3 100644
--- a/drivers/net/ucc_geth.c
+++ b/drivers/net/ucc_geth.c
@@ -2050,12 +2050,16 @@ static void ucc_geth_stop(struct ucc_geth_private *ugeth)
 
 	ugeth_vdbg("%s: IN", __func__);
 
+	/*
+	 * Tell the kernel the link is down.
+	 * Must be done before disabling the controller
+	 * or deadlock may happen.
+	 */
+	phy_stop(phydev);
+
 	/* Disable the controller */
 	ugeth_disable(ugeth, COMM_DIR_RX_AND_TX);
 
-	/* Tell the kernel the link is down */
-	phy_stop(phydev);
-
 	/* Mask all interrupts */
 	out_be32(ugeth->uccf->p_uccm, 0x00000000);
 
-- 
1.7.2.2


^ permalink raw reply related

* [PATCH 1/2] ucc_geth: Do not bring the whole IF down when TX failure.
From: Joakim Tjernlund @ 2010-11-12 13:55 UTC (permalink / raw)
  To: linuxppc-dev, netdev, Anton Vorontsov; +Cc: Joakim Tjernlund

ucc_geth_close lacks a cancel_work_sync(&ugeth->timeout_work)
to stop any outstanding processing of TX fail. However, one
can not call cancel_work_sync without fixing the timeout function
otherwise it will deadlock. This patch brings ucc_geth in line with
gianfar:

Don't bring the interface down and up, just reinit controller HW
and PHY.

Signed-off-by: Joakim Tjernlund <Joakim.Tjernlund@transmode.se>
---
 drivers/net/ucc_geth.c |   15 +++++++++------
 1 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/drivers/net/ucc_geth.c b/drivers/net/ucc_geth.c
index 97f9f7d..6c254ed 100644
--- a/drivers/net/ucc_geth.c
+++ b/drivers/net/ucc_geth.c
@@ -2065,9 +2065,6 @@ static void ucc_geth_stop(struct ucc_geth_private *ugeth)
 	/* Disable Rx and Tx */
 	clrbits32(&ug_regs->maccfg1, MACCFG1_ENABLE_RX | MACCFG1_ENABLE_TX);
 
-	phy_disconnect(ugeth->phydev);
-	ugeth->phydev = NULL;
-
 	ucc_geth_memclean(ugeth);
 }
 
@@ -3556,7 +3553,10 @@ static int ucc_geth_close(struct net_device *dev)
 
 	napi_disable(&ugeth->napi);
 
+	cancel_work_sync(&ugeth->timeout_work);
 	ucc_geth_stop(ugeth);
+	phy_disconnect(ugeth->phydev);
+	ugeth->phydev = NULL;
 
 	free_irq(ugeth->ug_info->uf_info.irq, ugeth->ndev);
 
@@ -3585,8 +3585,12 @@ static void ucc_geth_timeout_work(struct work_struct *work)
 		 * Must reset MAC *and* PHY. This is done by reopening
 		 * the device.
 		 */
-		ucc_geth_close(dev);
-		ucc_geth_open(dev);
+		netif_tx_stop_all_queues(dev);
+		ucc_geth_stop(ugeth);
+		ucc_geth_init_mac(ugeth);
+		/* Must start PHY here */
+		phy_start(ugeth->phydev);
+		netif_tx_start_all_queues(dev);
 	}
 
 	netif_tx_schedule_all(dev);
@@ -3600,7 +3604,6 @@ static void ucc_geth_timeout(struct net_device *dev)
 {
 	struct ucc_geth_private *ugeth = netdev_priv(dev);
 
-	netif_carrier_off(dev);
 	schedule_work(&ugeth->timeout_work);
 }
 
-- 
1.7.2.2


^ permalink raw reply related

* [PATCH net-next-2.6] igmp: RCU conversion of in_dev->mc_list
From: Eric Dumazet @ 2010-11-12 13:34 UTC (permalink / raw)
  To: Américo Wang; +Cc: Cypher Wu, linux-kernel, netdev, David Miller
In-Reply-To: <1289553759.3185.1.camel@edumazet-laptop>

Le vendredi 12 novembre 2010 à 10:22 +0100, Eric Dumazet a écrit :
> Le vendredi 12 novembre 2010 à 16:19 +0800, Américo Wang a écrit :
> > On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
> 
> > >A RCU conversion is far more complex.
> > >
> > 
> > Yup.
> 
> 
> Well, actually this is easy in this case.
> 
> I'll post a patch to do this RCU conversion.
> 
> 

Note : compile tested only, I'll appreciate if someone can test it ;)

Note: one patch from net-2.6 is not yet included in net-next-2.6, so
please make sure you have it before testing ;)

( http://git.kernel.org/?p=linux/kernel/git/davem/net-2.6.git;a=commitdiff;h=18943d292facbc70e6a36fc62399ae833f64671b )


Thanks

[PATCH net-next-2.6] igmp: RCU conversion of in_dev->mc_list

in_dev->mc_list is protected by one rwlock (in_dev->mc_list_lock).

This can easily be converted to a RCU protection.

Writers hold RTNL, so mc_list_lock is removed, not replaced by a
spinlock.

Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
Cc: Cypher Wu <cypher.w@gmail.com>
Cc: Américo Wang <xiyou.wangcong@gmail.com>
---
 include/linux/igmp.h       |   12 +
 include/linux/inetdevice.h |    5 
 include/net/inet_sock.h    |    2 
 net/ipv4/igmp.c            |  223 ++++++++++++++++-------------------
 4 files changed, 115 insertions(+), 127 deletions(-)

diff --git a/include/linux/igmp.h b/include/linux/igmp.h
index 93fc244..7d16467 100644
--- a/include/linux/igmp.h
+++ b/include/linux/igmp.h
@@ -167,10 +167,10 @@ struct ip_sf_socklist {
  */
 
 struct ip_mc_socklist {
-	struct ip_mc_socklist	*next;
+	struct ip_mc_socklist __rcu *next_rcu;
 	struct ip_mreqn		multi;
 	unsigned int		sfmode;		/* MCAST_{INCLUDE,EXCLUDE} */
-	struct ip_sf_socklist	*sflist;
+	struct ip_sf_socklist __rcu	*sflist;
 	struct rcu_head		rcu;
 };
 
@@ -186,11 +186,14 @@ struct ip_sf_list {
 struct ip_mc_list {
 	struct in_device	*interface;
 	__be32			multiaddr;
+	unsigned int		sfmode;
 	struct ip_sf_list	*sources;
 	struct ip_sf_list	*tomb;
-	unsigned int		sfmode;
 	unsigned long		sfcount[2];
-	struct ip_mc_list	*next;
+	union {
+		struct ip_mc_list *next;
+		struct ip_mc_list __rcu *next_rcu;
+	};
 	struct timer_list	timer;
 	int			users;
 	atomic_t		refcnt;
@@ -201,6 +204,7 @@ struct ip_mc_list {
 	char			loaded;
 	unsigned char		gsquery;	/* check source marks? */
 	unsigned char		crcount;
+	struct rcu_head		rcu;
 };
 
 /* V3 exponential field decoding */
diff --git a/include/linux/inetdevice.h b/include/linux/inetdevice.h
index ccd5b07..380ba6b 100644
--- a/include/linux/inetdevice.h
+++ b/include/linux/inetdevice.h
@@ -52,9 +52,8 @@ struct in_device {
 	atomic_t		refcnt;
 	int			dead;
 	struct in_ifaddr	*ifa_list;	/* IP ifaddr chain		*/
-	rwlock_t		mc_list_lock;
-	struct ip_mc_list	*mc_list;	/* IP multicast filter chain    */
-	int			mc_count;	          /* Number of installed mcasts	*/
+	struct ip_mc_list __rcu	*mc_list;	/* IP multicast filter chain    */
+	int			mc_count;	/* Number of installed mcasts	*/
 	spinlock_t		mc_tomb_lock;
 	struct ip_mc_list	*mc_tomb;
 	unsigned long		mr_v1_seen;
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 1989cfd..8945f9f 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -141,7 +141,7 @@ struct inet_sock {
 				nodefrag:1;
 	int			mc_index;
 	__be32			mc_addr;
-	struct ip_mc_socklist	*mc_list;
+	struct ip_mc_socklist __rcu	*mc_list;
 	struct {
 		unsigned int		flags;
 		unsigned int		fragsize;
diff --git a/net/ipv4/igmp.c b/net/ipv4/igmp.c
index 08d0d81..ff4e5fd 100644
--- a/net/ipv4/igmp.c
+++ b/net/ipv4/igmp.c
@@ -149,11 +149,17 @@ static void ip_mc_clear_src(struct ip_mc_list *pmc);
 static int ip_mc_add_src(struct in_device *in_dev, __be32 *pmca, int sfmode,
 			 int sfcount, __be32 *psfsrc, int delta);
 
+
+static void ip_mc_list_reclaim(struct rcu_head *head)
+{
+	kfree(container_of(head, struct ip_mc_list, rcu));
+}
+
 static void ip_ma_put(struct ip_mc_list *im)
 {
 	if (atomic_dec_and_test(&im->refcnt)) {
 		in_dev_put(im->interface);
-		kfree(im);
+		call_rcu(&im->rcu, ip_mc_list_reclaim);
 	}
 }
 
@@ -163,7 +169,7 @@ static void ip_ma_put(struct ip_mc_list *im)
  *	Timer management
  */
 
-static __inline__ void igmp_stop_timer(struct ip_mc_list *im)
+static void igmp_stop_timer(struct ip_mc_list *im)
 {
 	spin_lock_bh(&im->lock);
 	if (del_timer(&im->timer))
@@ -496,14 +502,24 @@ empty_source:
 	return skb;
 }
 
+#define for_each_pmc_rcu(in_dev, pmc)				\
+	for (pmc = rcu_dereference(in_dev->mc_list);		\
+	     pmc != NULL;					\
+	     pmc = rcu_dereference(pmc->next_rcu))
+
+#define for_each_pmc_rtnl(in_dev, pmc)				\
+	for (pmc = rtnl_dereference(in_dev->mc_list);		\
+	     pmc != NULL;					\
+	     pmc = rtnl_dereference(pmc->next_rcu))
+
 static int igmpv3_send_report(struct in_device *in_dev, struct ip_mc_list *pmc)
 {
 	struct sk_buff *skb = NULL;
 	int type;
 
 	if (!pmc) {
-		read_lock(&in_dev->mc_list_lock);
-		for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+		rcu_read_lock();
+		for_each_pmc_rcu(in_dev, pmc) {
 			if (pmc->multiaddr == IGMP_ALL_HOSTS)
 				continue;
 			spin_lock_bh(&pmc->lock);
@@ -514,7 +530,7 @@ static int igmpv3_send_report(struct in_device *in_dev, struct ip_mc_list *pmc)
 			skb = add_grec(skb, pmc, type, 0, 0);
 			spin_unlock_bh(&pmc->lock);
 		}
-		read_unlock(&in_dev->mc_list_lock);
+		rcu_read_unlock();
 	} else {
 		spin_lock_bh(&pmc->lock);
 		if (pmc->sfcount[MCAST_EXCLUDE])
@@ -556,7 +572,7 @@ static void igmpv3_send_cr(struct in_device *in_dev)
 	struct sk_buff *skb = NULL;
 	int type, dtype;
 
-	read_lock(&in_dev->mc_list_lock);
+	rcu_read_lock();
 	spin_lock_bh(&in_dev->mc_tomb_lock);
 
 	/* deleted MCA's */
@@ -593,7 +609,7 @@ static void igmpv3_send_cr(struct in_device *in_dev)
 	spin_unlock_bh(&in_dev->mc_tomb_lock);
 
 	/* change recs */
-	for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+	for_each_pmc_rcu(in_dev, pmc) {
 		spin_lock_bh(&pmc->lock);
 		if (pmc->sfcount[MCAST_EXCLUDE]) {
 			type = IGMPV3_BLOCK_OLD_SOURCES;
@@ -616,7 +632,7 @@ static void igmpv3_send_cr(struct in_device *in_dev)
 		}
 		spin_unlock_bh(&pmc->lock);
 	}
-	read_unlock(&in_dev->mc_list_lock);
+	rcu_read_unlock();
 
 	if (!skb)
 		return;
@@ -813,14 +829,14 @@ static void igmp_heard_report(struct in_device *in_dev, __be32 group)
 	if (group == IGMP_ALL_HOSTS)
 		return;
 
-	read_lock(&in_dev->mc_list_lock);
-	for (im=in_dev->mc_list; im!=NULL; im=im->next) {
+	rcu_read_lock();
+	for_each_pmc_rcu(in_dev, im) {
 		if (im->multiaddr == group) {
 			igmp_stop_timer(im);
 			break;
 		}
 	}
-	read_unlock(&in_dev->mc_list_lock);
+	rcu_read_unlock();
 }
 
 static void igmp_heard_query(struct in_device *in_dev, struct sk_buff *skb,
@@ -906,8 +922,8 @@ static void igmp_heard_query(struct in_device *in_dev, struct sk_buff *skb,
 	 * - Use the igmp->igmp_code field as the maximum
 	 *   delay possible
 	 */
-	read_lock(&in_dev->mc_list_lock);
-	for (im=in_dev->mc_list; im!=NULL; im=im->next) {
+	rcu_read_lock();
+	for_each_pmc_rcu(in_dev, im) {
 		int changed;
 
 		if (group && group != im->multiaddr)
@@ -925,7 +941,7 @@ static void igmp_heard_query(struct in_device *in_dev, struct sk_buff *skb,
 		if (changed)
 			igmp_mod_timer(im, max_delay);
 	}
-	read_unlock(&in_dev->mc_list_lock);
+	rcu_read_unlock();
 }
 
 /* called in rcu_read_lock() section */
@@ -1110,8 +1126,8 @@ static void igmpv3_clear_delrec(struct in_device *in_dev)
 		kfree(pmc);
 	}
 	/* clear dead sources, too */
-	read_lock(&in_dev->mc_list_lock);
-	for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+	rcu_read_lock();
+	for_each_pmc_rcu(in_dev, pmc) {
 		struct ip_sf_list *psf, *psf_next;
 
 		spin_lock_bh(&pmc->lock);
@@ -1123,7 +1139,7 @@ static void igmpv3_clear_delrec(struct in_device *in_dev)
 			kfree(psf);
 		}
 	}
-	read_unlock(&in_dev->mc_list_lock);
+	rcu_read_unlock();
 }
 #endif
 
@@ -1209,7 +1225,7 @@ void ip_mc_inc_group(struct in_device *in_dev, __be32 addr)
 
 	ASSERT_RTNL();
 
-	for (im=in_dev->mc_list; im; im=im->next) {
+	for_each_pmc_rtnl(in_dev, im) {
 		if (im->multiaddr == addr) {
 			im->users++;
 			ip_mc_add_src(in_dev, &addr, MCAST_EXCLUDE, 0, NULL, 0);
@@ -1217,7 +1233,7 @@ void ip_mc_inc_group(struct in_device *in_dev, __be32 addr)
 		}
 	}
 
-	im = kmalloc(sizeof(*im), GFP_KERNEL);
+	im = kzalloc(sizeof(*im), GFP_KERNEL);
 	if (!im)
 		goto out;
 
@@ -1227,26 +1243,18 @@ void ip_mc_inc_group(struct in_device *in_dev, __be32 addr)
 	im->multiaddr = addr;
 	/* initial mode is (EX, empty) */
 	im->sfmode = MCAST_EXCLUDE;
-	im->sfcount[MCAST_INCLUDE] = 0;
 	im->sfcount[MCAST_EXCLUDE] = 1;
-	im->sources = NULL;
-	im->tomb = NULL;
-	im->crcount = 0;
 	atomic_set(&im->refcnt, 1);
 	spin_lock_init(&im->lock);
 #ifdef CONFIG_IP_MULTICAST
-	im->tm_running = 0;
 	setup_timer(&im->timer, &igmp_timer_expire, (unsigned long)im);
 	im->unsolicit_count = IGMP_Unsolicited_Report_Count;
-	im->reporter = 0;
-	im->gsquery = 0;
 #endif
-	im->loaded = 0;
-	write_lock_bh(&in_dev->mc_list_lock);
-	im->next = in_dev->mc_list;
-	in_dev->mc_list = im;
+
+	im->next_rcu = in_dev->mc_list;
 	in_dev->mc_count++;
-	write_unlock_bh(&in_dev->mc_list_lock);
+	rcu_assign_pointer(in_dev->mc_list, im);
+
 #ifdef CONFIG_IP_MULTICAST
 	igmpv3_del_delrec(in_dev, im->multiaddr);
 #endif
@@ -1287,17 +1295,18 @@ EXPORT_SYMBOL(ip_mc_rejoin_group);
 
 void ip_mc_dec_group(struct in_device *in_dev, __be32 addr)
 {
-	struct ip_mc_list *i, **ip;
+	struct ip_mc_list *i;
+	struct ip_mc_list __rcu **ip;
 
 	ASSERT_RTNL();
 
-	for (ip=&in_dev->mc_list; (i=*ip)!=NULL; ip=&i->next) {
+	for (ip = &in_dev->mc_list;
+	     (i = rtnl_dereference(*ip)) != NULL;
+	     ip = &i->next_rcu) {
 		if (i->multiaddr == addr) {
 			if (--i->users == 0) {
-				write_lock_bh(&in_dev->mc_list_lock);
-				*ip = i->next;
+				*ip = i->next_rcu;
 				in_dev->mc_count--;
-				write_unlock_bh(&in_dev->mc_list_lock);
 				igmp_group_dropped(i);
 
 				if (!in_dev->dead)
@@ -1316,34 +1325,34 @@ EXPORT_SYMBOL(ip_mc_dec_group);
 
 void ip_mc_unmap(struct in_device *in_dev)
 {
-	struct ip_mc_list *i;
+	struct ip_mc_list *pmc;
 
 	ASSERT_RTNL();
 
-	for (i = in_dev->mc_list; i; i = i->next)
-		igmp_group_dropped(i);
+	for_each_pmc_rtnl(in_dev, pmc)
+		igmp_group_dropped(pmc);
 }
 
 void ip_mc_remap(struct in_device *in_dev)
 {
-	struct ip_mc_list *i;
+	struct ip_mc_list *pmc;
 
 	ASSERT_RTNL();
 
-	for (i = in_dev->mc_list; i; i = i->next)
-		igmp_group_added(i);
+	for_each_pmc_rtnl(in_dev, pmc)
+		igmp_group_added(pmc);
 }
 
 /* Device going down */
 
 void ip_mc_down(struct in_device *in_dev)
 {
-	struct ip_mc_list *i;
+	struct ip_mc_list *pmc;
 
 	ASSERT_RTNL();
 
-	for (i=in_dev->mc_list; i; i=i->next)
-		igmp_group_dropped(i);
+	for_each_pmc_rtnl(in_dev, pmc)
+		igmp_group_dropped(pmc);
 
 #ifdef CONFIG_IP_MULTICAST
 	in_dev->mr_ifc_count = 0;
@@ -1374,7 +1383,6 @@ void ip_mc_init_dev(struct in_device *in_dev)
 	in_dev->mr_qrv = IGMP_Unsolicited_Report_Count;
 #endif
 
-	rwlock_init(&in_dev->mc_list_lock);
 	spin_lock_init(&in_dev->mc_tomb_lock);
 }
 
@@ -1382,14 +1390,14 @@ void ip_mc_init_dev(struct in_device *in_dev)
 
 void ip_mc_up(struct in_device *in_dev)
 {
-	struct ip_mc_list *i;
+	struct ip_mc_list *pmc;
 
 	ASSERT_RTNL();
 
 	ip_mc_inc_group(in_dev, IGMP_ALL_HOSTS);
 
-	for (i=in_dev->mc_list; i; i=i->next)
-		igmp_group_added(i);
+	for_each_pmc_rtnl(in_dev, pmc);
+		igmp_group_added(pmc);
 }
 
 /*
@@ -1405,17 +1413,13 @@ void ip_mc_destroy_dev(struct in_device *in_dev)
 	/* Deactivate timers */
 	ip_mc_down(in_dev);
 
-	write_lock_bh(&in_dev->mc_list_lock);
-	while ((i = in_dev->mc_list) != NULL) {
-		in_dev->mc_list = i->next;
+	while ((i = rtnl_dereference(in_dev->mc_list)) != NULL) {
+		in_dev->mc_list = i->next_rcu;
 		in_dev->mc_count--;
-		write_unlock_bh(&in_dev->mc_list_lock);
+
 		igmp_group_dropped(i);
 		ip_ma_put(i);
-
-		write_lock_bh(&in_dev->mc_list_lock);
 	}
-	write_unlock_bh(&in_dev->mc_list_lock);
 }
 
 /* RTNL is locked */
@@ -1513,18 +1517,18 @@ static int ip_mc_del_src(struct in_device *in_dev, __be32 *pmca, int sfmode,
 
 	if (!in_dev)
 		return -ENODEV;
-	read_lock(&in_dev->mc_list_lock);
-	for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+	rcu_read_lock();
+	for_each_pmc_rcu(in_dev, pmc) {
 		if (*pmca == pmc->multiaddr)
 			break;
 	}
 	if (!pmc) {
 		/* MCA not found?? bug */
-		read_unlock(&in_dev->mc_list_lock);
+		rcu_read_unlock();
 		return -ESRCH;
 	}
 	spin_lock_bh(&pmc->lock);
-	read_unlock(&in_dev->mc_list_lock);
+	rcu_read_unlock();
 #ifdef CONFIG_IP_MULTICAST
 	sf_markstate(pmc);
 #endif
@@ -1685,18 +1689,18 @@ static int ip_mc_add_src(struct in_device *in_dev, __be32 *pmca, int sfmode,
 
 	if (!in_dev)
 		return -ENODEV;
-	read_lock(&in_dev->mc_list_lock);
-	for (pmc=in_dev->mc_list; pmc; pmc=pmc->next) {
+	rcu_read_lock();
+	for_each_pmc_rcu(in_dev, pmc) {
 		if (*pmca == pmc->multiaddr)
 			break;
 	}
 	if (!pmc) {
 		/* MCA not found?? bug */
-		read_unlock(&in_dev->mc_list_lock);
+		rcu_read_unlock();
 		return -ESRCH;
 	}
 	spin_lock_bh(&pmc->lock);
-	read_unlock(&in_dev->mc_list_lock);
+	rcu_read_unlock();
 
 #ifdef CONFIG_IP_MULTICAST
 	sf_markstate(pmc);
@@ -1793,7 +1797,7 @@ int ip_mc_join_group(struct sock *sk , struct ip_mreqn *imr)
 
 	err = -EADDRINUSE;
 	ifindex = imr->imr_ifindex;
-	for (i = inet->mc_list; i; i = i->next) {
+	for_each_pmc_rtnl(inet, i) {
 		if (i->multi.imr_multiaddr.s_addr == addr &&
 		    i->multi.imr_ifindex == ifindex)
 			goto done;
@@ -1807,7 +1811,7 @@ int ip_mc_join_group(struct sock *sk , struct ip_mreqn *imr)
 		goto done;
 
 	memcpy(&iml->multi, imr, sizeof(*imr));
-	iml->next = inet->mc_list;
+	iml->next_rcu = inet->mc_list;
 	iml->sflist = NULL;
 	iml->sfmode = MCAST_EXCLUDE;
 	rcu_assign_pointer(inet->mc_list, iml);
@@ -1821,17 +1825,14 @@ EXPORT_SYMBOL(ip_mc_join_group);
 
 static void ip_sf_socklist_reclaim(struct rcu_head *rp)
 {
-	struct ip_sf_socklist *psf;
-
-	psf = container_of(rp, struct ip_sf_socklist, rcu);
+	kfree(container_of(rp, struct ip_sf_socklist, rcu));
 	/* sk_omem_alloc should have been decreased by the caller*/
-	kfree(psf);
 }
 
 static int ip_mc_leave_src(struct sock *sk, struct ip_mc_socklist *iml,
 			   struct in_device *in_dev)
 {
-	struct ip_sf_socklist *psf = iml->sflist;
+	struct ip_sf_socklist *psf = rtnl_dereference(iml->sflist);
 	int err;
 
 	if (psf == NULL) {
@@ -1851,11 +1852,8 @@ static int ip_mc_leave_src(struct sock *sk, struct ip_mc_socklist *iml,
 
 static void ip_mc_socklist_reclaim(struct rcu_head *rp)
 {
-	struct ip_mc_socklist *iml;
-
-	iml = container_of(rp, struct ip_mc_socklist, rcu);
+	kfree(container_of(rp, struct ip_mc_socklist, rcu));
 	/* sk_omem_alloc should have been decreased by the caller*/
-	kfree(iml);
 }
 
 
@@ -1866,7 +1864,8 @@ static void ip_mc_socklist_reclaim(struct rcu_head *rp)
 int ip_mc_leave_group(struct sock *sk, struct ip_mreqn *imr)
 {
 	struct inet_sock *inet = inet_sk(sk);
-	struct ip_mc_socklist *iml, **imlp;
+	struct ip_mc_socklist *iml;
+	struct ip_mc_socklist __rcu **imlp;
 	struct in_device *in_dev;
 	struct net *net = sock_net(sk);
 	__be32 group = imr->imr_multiaddr.s_addr;
@@ -1876,7 +1875,9 @@ int ip_mc_leave_group(struct sock *sk, struct ip_mreqn *imr)
 	rtnl_lock();
 	in_dev = ip_mc_find_dev(net, imr);
 	ifindex = imr->imr_ifindex;
-	for (imlp = &inet->mc_list; (iml = *imlp) != NULL; imlp = &iml->next) {
+	for (imlp = &inet->mc_list;
+	     (iml = rtnl_dereference(*imlp)) != NULL;
+	     imlp = &iml->next_rcu) {
 		if (iml->multi.imr_multiaddr.s_addr != group)
 			continue;
 		if (ifindex) {
@@ -1888,7 +1889,7 @@ int ip_mc_leave_group(struct sock *sk, struct ip_mreqn *imr)
 
 		(void) ip_mc_leave_src(sk, iml, in_dev);
 
-		rcu_assign_pointer(*imlp, iml->next);
+		*imlp = iml->next_rcu;
 
 		if (in_dev)
 			ip_mc_dec_group(in_dev, group);
@@ -1934,7 +1935,7 @@ int ip_mc_source(int add, int omode, struct sock *sk, struct
 	}
 	err = -EADDRNOTAVAIL;
 
-	for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+	for_each_pmc_rtnl(inet, pmc) {
 		if ((pmc->multi.imr_multiaddr.s_addr ==
 		     imr.imr_multiaddr.s_addr) &&
 		    (pmc->multi.imr_ifindex == imr.imr_ifindex))
@@ -1958,7 +1959,7 @@ int ip_mc_source(int add, int omode, struct sock *sk, struct
 		pmc->sfmode = omode;
 	}
 
-	psl = pmc->sflist;
+	psl = rtnl_dereference(pmc->sflist);
 	if (!add) {
 		if (!psl)
 			goto done;	/* err = -EADDRNOTAVAIL */
@@ -2077,7 +2078,7 @@ int ip_mc_msfilter(struct sock *sk, struct ip_msfilter *msf, int ifindex)
 		goto done;
 	}
 
-	for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+	for_each_pmc_rtnl(inet, pmc) {
 		if (pmc->multi.imr_multiaddr.s_addr == msf->imsf_multiaddr &&
 		    pmc->multi.imr_ifindex == imr.imr_ifindex)
 			break;
@@ -2107,7 +2108,7 @@ int ip_mc_msfilter(struct sock *sk, struct ip_msfilter *msf, int ifindex)
 		(void) ip_mc_add_src(in_dev, &msf->imsf_multiaddr,
 				     msf->imsf_fmode, 0, NULL, 0);
 	}
-	psl = pmc->sflist;
+	psl = rtnl_dereference(pmc->sflist);
 	if (psl) {
 		(void) ip_mc_del_src(in_dev, &msf->imsf_multiaddr, pmc->sfmode,
 			psl->sl_count, psl->sl_addr, 0);
@@ -2155,7 +2156,7 @@ int ip_mc_msfget(struct sock *sk, struct ip_msfilter *msf,
 	}
 	err = -EADDRNOTAVAIL;
 
-	for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+	for_each_pmc_rtnl(inet, pmc) {
 		if (pmc->multi.imr_multiaddr.s_addr == msf->imsf_multiaddr &&
 		    pmc->multi.imr_ifindex == imr.imr_ifindex)
 			break;
@@ -2163,7 +2164,7 @@ int ip_mc_msfget(struct sock *sk, struct ip_msfilter *msf,
 	if (!pmc)		/* must have a prior join */
 		goto done;
 	msf->imsf_fmode = pmc->sfmode;
-	psl = pmc->sflist;
+	psl = rtnl_dereference(pmc->sflist);
 	rtnl_unlock();
 	if (!psl) {
 		len = 0;
@@ -2208,7 +2209,7 @@ int ip_mc_gsfget(struct sock *sk, struct group_filter *gsf,
 
 	err = -EADDRNOTAVAIL;
 
-	for (pmc=inet->mc_list; pmc; pmc=pmc->next) {
+	for_each_pmc_rtnl(inet, pmc) {
 		if (pmc->multi.imr_multiaddr.s_addr == addr &&
 		    pmc->multi.imr_ifindex == gsf->gf_interface)
 			break;
@@ -2216,7 +2217,7 @@ int ip_mc_gsfget(struct sock *sk, struct group_filter *gsf,
 	if (!pmc)		/* must have a prior join */
 		goto done;
 	gsf->gf_fmode = pmc->sfmode;
-	psl = pmc->sflist;
+	psl = rtnl_dereference(pmc->sflist);
 	rtnl_unlock();
 	count = psl ? psl->sl_count : 0;
 	copycount = count < gsf->gf_numsrc ? count : gsf->gf_numsrc;
@@ -2257,7 +2258,7 @@ int ip_mc_sf_allow(struct sock *sk, __be32 loc_addr, __be32 rmt_addr, int dif)
 		goto out;
 
 	rcu_read_lock();
-	for (pmc=rcu_dereference(inet->mc_list); pmc; pmc=rcu_dereference(pmc->next)) {
+	for_each_pmc_rcu(inet, pmc) {
 		if (pmc->multi.imr_multiaddr.s_addr == loc_addr &&
 		    pmc->multi.imr_ifindex == dif)
 			break;
@@ -2265,7 +2266,7 @@ int ip_mc_sf_allow(struct sock *sk, __be32 loc_addr, __be32 rmt_addr, int dif)
 	ret = inet->mc_all;
 	if (!pmc)
 		goto unlock;
-	psl = pmc->sflist;
+	psl = rcu_dereference(pmc->sflist);
 	ret = (pmc->sfmode == MCAST_EXCLUDE);
 	if (!psl)
 		goto unlock;
@@ -2300,10 +2301,10 @@ void ip_mc_drop_socket(struct sock *sk)
 		return;
 
 	rtnl_lock();
-	while ((iml = inet->mc_list) != NULL) {
+	while ((iml = rtnl_dereference(inet->mc_list)) != NULL) {
 		struct in_device *in_dev;
-		rcu_assign_pointer(inet->mc_list, iml->next);
 
+		inet->mc_list = iml->next_rcu;
 		in_dev = inetdev_by_index(net, iml->multi.imr_ifindex);
 		(void) ip_mc_leave_src(sk, iml, in_dev);
 		if (in_dev != NULL) {
@@ -2323,8 +2324,8 @@ int ip_check_mc(struct in_device *in_dev, __be32 mc_addr, __be32 src_addr, u16 p
 	struct ip_sf_list *psf;
 	int rv = 0;
 
-	read_lock(&in_dev->mc_list_lock);
-	for (im=in_dev->mc_list; im; im=im->next) {
+	rcu_read_lock();
+	for_each_pmc_rcu(in_dev, im) {
 		if (im->multiaddr == mc_addr)
 			break;
 	}
@@ -2345,7 +2346,7 @@ int ip_check_mc(struct in_device *in_dev, __be32 mc_addr, __be32 src_addr, u16 p
 		} else
 			rv = 1; /* unspecified source; tentatively allow */
 	}
-	read_unlock(&in_dev->mc_list_lock);
+	rcu_read_unlock();
 	return rv;
 }
 
@@ -2371,13 +2372,11 @@ static inline struct ip_mc_list *igmp_mc_get_first(struct seq_file *seq)
 		in_dev = __in_dev_get_rcu(state->dev);
 		if (!in_dev)
 			continue;
-		read_lock(&in_dev->mc_list_lock);
-		im = in_dev->mc_list;
+		im = rcu_dereference(in_dev->mc_list);
 		if (im) {
 			state->in_dev = in_dev;
 			break;
 		}
-		read_unlock(&in_dev->mc_list_lock);
 	}
 	return im;
 }
@@ -2385,11 +2384,9 @@ static inline struct ip_mc_list *igmp_mc_get_first(struct seq_file *seq)
 static struct ip_mc_list *igmp_mc_get_next(struct seq_file *seq, struct ip_mc_list *im)
 {
 	struct igmp_mc_iter_state *state = igmp_mc_seq_private(seq);
-	im = im->next;
-	while (!im) {
-		if (likely(state->in_dev != NULL))
-			read_unlock(&state->in_dev->mc_list_lock);
 
+	im = rcu_dereference(im->next_rcu);
+	while (!im) {
 		state->dev = next_net_device_rcu(state->dev);
 		if (!state->dev) {
 			state->in_dev = NULL;
@@ -2398,8 +2395,7 @@ static struct ip_mc_list *igmp_mc_get_next(struct seq_file *seq, struct ip_mc_li
 		state->in_dev = __in_dev_get_rcu(state->dev);
 		if (!state->in_dev)
 			continue;
-		read_lock(&state->in_dev->mc_list_lock);
-		im = state->in_dev->mc_list;
+		im = rcu_dereference(state->in_dev->mc_list);
 	}
 	return im;
 }
@@ -2435,10 +2431,8 @@ static void igmp_mc_seq_stop(struct seq_file *seq, void *v)
 	__releases(rcu)
 {
 	struct igmp_mc_iter_state *state = igmp_mc_seq_private(seq);
-	if (likely(state->in_dev != NULL)) {
-		read_unlock(&state->in_dev->mc_list_lock);
-		state->in_dev = NULL;
-	}
+
+	state->in_dev = NULL;
 	state->dev = NULL;
 	rcu_read_unlock();
 }
@@ -2460,7 +2454,7 @@ static int igmp_mc_seq_show(struct seq_file *seq, void *v)
 		querier = "NONE";
 #endif
 
-		if (state->in_dev->mc_list == im) {
+		if (rcu_dereference(state->in_dev->mc_list) == im) {
 			seq_printf(seq, "%d\t%-10s: %5d %7s\n",
 				   state->dev->ifindex, state->dev->name, state->in_dev->mc_count, querier);
 		}
@@ -2519,8 +2513,7 @@ static inline struct ip_sf_list *igmp_mcf_get_first(struct seq_file *seq)
 		idev = __in_dev_get_rcu(state->dev);
 		if (unlikely(idev == NULL))
 			continue;
-		read_lock(&idev->mc_list_lock);
-		im = idev->mc_list;
+		im = rcu_dereference(idev->mc_list);
 		if (likely(im != NULL)) {
 			spin_lock_bh(&im->lock);
 			psf = im->sources;
@@ -2531,7 +2524,6 @@ static inline struct ip_sf_list *igmp_mcf_get_first(struct seq_file *seq)
 			}
 			spin_unlock_bh(&im->lock);
 		}
-		read_unlock(&idev->mc_list_lock);
 	}
 	return psf;
 }
@@ -2545,9 +2537,6 @@ static struct ip_sf_list *igmp_mcf_get_next(struct seq_file *seq, struct ip_sf_l
 		spin_unlock_bh(&state->im->lock);
 		state->im = state->im->next;
 		while (!state->im) {
-			if (likely(state->idev != NULL))
-				read_unlock(&state->idev->mc_list_lock);
-
 			state->dev = next_net_device_rcu(state->dev);
 			if (!state->dev) {
 				state->idev = NULL;
@@ -2556,8 +2545,7 @@ static struct ip_sf_list *igmp_mcf_get_next(struct seq_file *seq, struct ip_sf_l
 			state->idev = __in_dev_get_rcu(state->dev);
 			if (!state->idev)
 				continue;
-			read_lock(&state->idev->mc_list_lock);
-			state->im = state->idev->mc_list;
+			state->im = rcu_dereference(state->idev->mc_list);
 		}
 		if (!state->im)
 			break;
@@ -2603,10 +2591,7 @@ static void igmp_mcf_seq_stop(struct seq_file *seq, void *v)
 		spin_unlock_bh(&state->im->lock);
 		state->im = NULL;
 	}
-	if (likely(state->idev != NULL)) {
-		read_unlock(&state->idev->mc_list_lock);
-		state->idev = NULL;
-	}
+	state->idev = NULL;
 	state->dev = NULL;
 	rcu_read_unlock();
 }

^ permalink raw reply related

* Re: Kernel rwlock design, Multicore and IGMP
From: Yong Zhang @ 2010-11-12 13:00 UTC (permalink / raw)
  To: Américo Wang; +Cc: Eric Dumazet, Cypher Wu, linux-kernel, netdev
In-Reply-To: <20101112091818.GB5949@cr0.nay.redhat.com>

On Fri, Nov 12, 2010 at 05:18:18PM +0800, Américo Wang wrote:
> On Fri, Nov 12, 2010 at 05:09:45PM +0800, Yong Zhang wrote:
> >On Fri, Nov 12, 2010 at 4:19 PM, Américo Wang <xiyou.wangcong@gmail.com> wrote:
> >> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
> >>>Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
> >>>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
> >>>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> >>>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
> >>>> >>
> >>>> >> Hi
> >>>> >>
> >>>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
> >>>> >>
> >>>> >>
> >>>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
> >>>> >>> other platforms. It have a priority for write lock that when tried it
> >>>> >>> will block the following read lock even if read lock is hold by
> >>>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
> >>>> >>> arch/tile/lib/spinlock_32.c.
> >>>> >>
> >>>> >> This seems a bug to me.
> >>>> >>
> >>>> >> read_lock() can be nested. We used such a schem in the past in iptables
> >>>> >> (it can re-enter itself),
> >>>> >> and we used instead a spinlock(), but with many discussions with lkml
> >>>> >> and Linus himself if I remember well.
> >>>> >>
> >>>> >It seems not a problem that read_lock() can be nested or not since
> >>>> >rwlock doesn't have 'owner', it's just that should we give
> >>>> >write_lock() a priority than read_lock() since if there have a lot
> >>>> >read_lock()s then they'll starve write_lock().
> >>>> >We should work out a well defined behavior so all the
> >>>> >platform-dependent raw_rwlock has to design under that principle.
> >>>>
> >>>
> >>>AFAIK, Lockdep allows read_lock() to be nested.
> >>>
> >>>> It is a known weakness of rwlock, it is designed like that. :)
> >>>>
> >>>
> >>>Agreed.
> >>>
> >>
> >> Just for record, both Tile and X86 implement rwlock with a write-bias,
> >> this somewhat reduces the write-starvation problem.
> >
> >Are you sure(on x86)?
> >
> >It seems that we never realize writer-bias rwlock.
> >
> 
> Try
> 
> % grep RW_LOCK_BIAS -nr arch/x86
> 
> *And* read the code to see how it works. :)

If read_lock()/write_lock() fails, the subtracted value(1 for
read_lock() and RW_LOCK_BIAS for write_lock()) is added back.
So reader and writer will contend on the same lock fairly.

And RW_LOCK_BIAS based rwlock is a variant of sighed-test
rwlock, so it works in the same way to highest-bit-set mode
rwlock.

Seem you're cheated by it's name(RW_LOCK_BIAS). :)
Or am I missing something?

Thanks,
Yong

> 
> Note, on Tile, it uses a little different algorithm.

^ permalink raw reply

* Re: [PATCH 4/10] Fix leaking of kernel heap addresses in net/
From: Dan Rosenberg @ 2010-11-12 12:37 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David Miller, socketcan, kuznet, urs.thuermann, yoshfuji, kaber,
	jmorris, remi.denis-courmont, pekkas, sri, vladislav.yasevich, tj,
	lizf, joe, shemminger, hadi, ebiederm, adobriyan, jpirko,
	johannes.berg, daniel.lezcano, xemul, socketcan-core, netdev,
	linux-sctp, torvalds
In-Reply-To: <1289546610.17691.1770.camel@edumazet-laptop>


> 
> 1) Inode numbers are not guaranteed to be unique. Its a 32bit seq
> number, and we dont check another socket inode use the same inode number
> (after 2^32 allocations it can happens)
> 

Ok...this is a bit far-fetched, but I see your point.

> 2) /proc/net/ files can deliver same "line" of information several
> times, because of their implementation.
> 
> 3) Because of SLAB_DESTROY_BY_RCU, same 'kernel socket pointer' can be
> seen several times in /proc/net/tcp & /proc/net/udp, but really on
> different "sockets"
> 
> 4) Some good applications use both the socket pointer and inode number
> (tuple) to filter out the [2] problem. Dont break them, please ?
> Anything that might break an application must be at the very least
> tunable.
> 
> In my opinion, a good thing would be :
> 
> - Use a special printf() format , aka "secure pointer", as Thomas
> suggested.
> 

I am happy to write this, but just to be sure, you're sure we're ok with
a printf() format that cannot be used in interrupt context?  %pS and %ps
are taken, so I'm thinking %pH ("hidden").

> - Make sure you print different opaque values for two different kernel
> pointers. This is mandatory.
> 
> - Make sure the NULL pointer stay as a NULL pointer to not let the
> hostile user know your secret, and to ease debugging stuff.
> 

Alright, this is fine by me.

> - Have security experts advice to chose a nice crypto function, maybe
> jenkin hash. Not too slow would be nice.
> 
> 
> static unsigned long securize_kpointers_rnd;
> 
> At boot time, stick a random value in this variable.
> (Maybe make sure the 5 low order bits are 0)
> 

I don't really like the approach of relying on a global secret value
that's used repeatedly.  Sure, it's better than not obfuscating the
pointers at all, but it seems like it will be difficult to prevent its
value from being inferred or discovered.

> unsigned long opacify_kptr(unsigned long ptr)
> {
> 	if (ptr == 0)
> 		return ptr;
> 	if (capable(CAP_NET_ADMIN))
> 		return ptr;
> 
> 	return some_crypto_hash(ptr, &securize_kpointers_rnd);
> }
> 

I question the use of CAP_NET_ADMIN.  Sure, that makes sense in this
context, but wouldn't it be useful to be able to use this format
specifier outside of net?  In fact, I'm not sure that CAP_SYS_ADMIN is
appropriate either - perhaps just going off current_euid()?

> At least, use a central point, so that we can easily add/change the
> logic if needed.
> 
> Please provide this patch in kernel/printk.c for initial review, then
> once everybody is OK, you can send one patch for net tree.
> 

Do you mean lib/vsprintf.c?

> No need to send 10 patches if we dont agree on the general principle.

Agreed.

-Dan


^ permalink raw reply

* request_threaded_irq()
From: Mark Ryden @ 2010-11-12 12:27 UTC (permalink / raw)
  To: netdev

Hello netdev,

grepping under net-next-2.6/drivers/net for request_threaded_irq() ,
shows that it appears only in 3 drivers:

can/mcp251x.c
wireless/b43/main.c
wireless/rt2x00/rt2x00pci.c

I was wondering: when thinking about performance, is it worthwhile to use this
API instead of ordinary request_irq() . It seems to me that
request_threaded_irq()  might
be better in some cases than NAPI polling forr network drivers (or at
list it might be so in some systems, maybe multicore ?)

Best,
Mark Ryden

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] can: Topcliff: PCH_CAN driver: Fix build warnings
From: Wolfgang Grandegger @ 2010-11-12 11:45 UTC (permalink / raw)
  To: Tomoya MORINAGA
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w, Samuel Ortiz,
	margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, LKML,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w, Masayuki Ohtake,
	Marc Kleine-Budde, Christian Pellegrin,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w, David S. Miller,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w, qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <004a01cb825a$3a8bd060$66f8800a-a06+6cuVnkTSQfdrb5gaxUEOCMrvLtNR@public.gmane.org>

On 11/12/2010 12:10 PM, Tomoya MORINAGA wrote:
> On Saturday, October 30, 2010 4:32 AM, Wolfgang Grandegger wrote :
> 
> Sorry, for my late.
> 
>>>> +
>>>> +#define PCH_RX_OK 0x00000010
>>>> +#define PCH_TX_OK 0x00000008
>>>> +#define PCH_BUS_OFF 0x00000080
>>>> +#define PCH_EWARN 0x00000040
>>>> +#define PCH_EPASSIV 0x00000020
>>>> +#define PCH_LEC0 0x00000001
>>>> +#define PCH_LEC1 0x00000002
>>>> +#define PCH_LEC2 0x00000004
>>>
>>> These are just single set bit, please use BIT()
>>> Consider adding the name of the corresponding register to the define's
>>> name.
> 
> I agree.
> 
>>>
>>>> +#define PCH_LEC_ALL (PCH_LEC0 | PCH_LEC1 | PCH_LEC2)
>>>> +#define PCH_STUF_ERR PCH_LEC0
>>>> +#define PCH_FORM_ERR PCH_LEC1
>>>> +#define PCH_ACK_ERR (PCH_LEC0 | PCH_LEC1)
>>>> +#define PCH_BIT1_ERR PCH_LEC2
>>>> +#define PCH_BIT0_ERR (PCH_LEC0 | PCH_LEC2)
>>>> +#define PCH_CRC_ERR (PCH_LEC1 | PCH_LEC2)
>>
>> This is an enumeration:
>>
>> enum {
>> PCH_STUF_ERR = 1,
>> PCH_FORM_ERR,
>> PCH_ACK_ERR,
>> PCH_BIT1_ERR;
>> PCH_BIT0_ERR,
>> PCH_CRC_ERR,
>> PCH_LEC_ALL;
>> }
> 
> No, 
> LEC is for bit assignment.
> Thus, "enum" can't be used.

Why? For me it's a classical enum because the value matters, and *not*
the individual bit. Do you agree?

>>> I suggest to convert to a if-bit-set because there might be more than
>>> one bit set.
>>
>> Marc, what do you mean here. It's an enumeraton. Maybe the following
>> code is more clear:
>>
>> lec = status & PCH_LEC_ALL;
>> if (lec > 0) {
>> switch (lec) {
> 
> No.
> LEC is not enum.

See also my sub-sequent comment here:

http://marc.info/?l=linux-netdev&m=128880088907148&w=2

>>>> + case PCH_STUF_ERR:
>>>> + cf->data[2] |= CAN_ERR_PROT_STUFF;
>>>> + break;
>>>> + case PCH_FORM_ERR:
>>>> + cf->data[2] |= CAN_ERR_PROT_FORM;
>>>> + break;
>>>> + case PCH_ACK_ERR:
>>>> + cf->data[2] |= CAN_ERR_PROT_LOC_ACK |
>>>> +        CAN_ERR_PROT_LOC_ACK_DEL;
>>
>> Could you check what that type of bus error that is? Usually it's a ack
>> lost error.
> 
> I will modify.
> BTW, I can see ti_hecc also has the above the same code.

Yes, also the AT91 driver uses a somehow incorrect error mask. I will
check and provide a patch a.s.a.p.

>>
>>>> + break;
>>>> + case PCH_BIT1_ERR:
>>>> + case PCH_BIT0_ERR:
>>>> + cf->data[2] |= CAN_ERR_PROT_BIT;
>>>> + break;
>>>> + case PCH_CRC_ERR:
>>>> + cf->data[2] |= CAN_ERR_PROT_LOC_CRC_SEQ |
>>>> +        CAN_ERR_PROT_LOC_CRC_DEL;
>>>> + break;
>>>> + default:
>>>> + iowrite32(status | PCH_LEC_ALL, &priv->regs->stat);
>>>> + break;
>>>> + }
>>>> +
>>>> + }
>>
>> Also, could you please add the TEC and REC:
>>
>> cf->data[6] = ioread32(&priv->regs->errc) & CAN_TEC;
>> cf->data[7] = (ioread32(&priv->regs->errc) & CAN_REC) >> 8;
> 
> I will add.

BTW: it could be done with one I/O call:

  errc = ioread32(&priv->regs->errc);
  cf->data[6] = errc & CAN_TEC;
  cf->data[7] = (errc & CAN_REC) >> 8;

> But I couldn't find 

Don't understand? It's also implemented for the SJA1000 driver:

http://lxr.linux.no/#linux+v2.6.36/drivers/net/can/sja1000/sja1000.c#L466

Wolfgang.

^ permalink raw reply

* [net-next] stmmac: update the driver documentation
From: Giuseppe CAVALLARO @ 2010-11-12 11:37 UTC (permalink / raw)
  To: netdev; +Cc: Giuseppe Cavallaro

Signed-off-by: Giuseppe Cavallaro <peppe.cavallaro@st.com>
---
 Documentation/networking/stmmac.txt |   48 +++++++++++++++++++++++++++--------
 1 files changed, 37 insertions(+), 11 deletions(-)

diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt
index 7ee770b..80a7a34 100644
--- a/Documentation/networking/stmmac.txt
+++ b/Documentation/networking/stmmac.txt
@@ -7,7 +7,7 @@ This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers
 (Synopsys IP blocks); it has been fully tested on STLinux platforms.
 
 Currently this network device driver is for all STM embedded MAC/GMAC
-(7xxx SoCs).
+(7xxx SoCs). Other platforms start using it i.e. ARM SPEAr.
 
 DWC Ether MAC 10/100/1000 Universal version 3.41a and DWC Ether MAC 10/100
 Universal version 4.0 have been used for developing the first code
@@ -95,9 +95,14 @@ Several information came from the platform; please refer to the
 driver's Header file in include/linux directory.
 
 struct plat_stmmacenet_data {
-        int bus_id;
-        int pbl;
-        int has_gmac;
+	int bus_id;
+	int pbl;
+	int clk_csr;
+	int has_gmac;
+	int enh_desc;
+	int tx_coe;
+	int bugged_jumbo;
+	int pmt;
         void (*fix_mac_speed)(void *priv, unsigned int speed);
         void (*bus_setup)(unsigned long ioaddr);
 #ifdef CONFIG_STM_DRIVERS
@@ -114,6 +119,12 @@ Where:
   registers (on STM platforms);
 - has_gmac: GMAC core is on board (get it at run-time in the next step);
 - bus_id: bus identifier.
+- tx_coe: core is able to perform the tx csum in HW.
+- enh_desc: if sets the MAC will use the enhanced descriptor structure.
+- clk_csr: CSR Clock range selection.
+- bugged_jumbo: some HWs are not able to perform the csum in HW for
+  over-sized frames due to limited buffer sizes. Setting this
+  flag the csum will be done in SW on JUMBO frames.
 
 struct plat_stmmacphy_data {
         int bus_id;
@@ -131,13 +142,28 @@ Where:
 - interface: physical MII interface mode;
 - phy_reset: hook to reset HW function.
 
+SOURCES:
+- Kconfig
+- Makefile
+- stmmac_main.c: main network device driver;
+- stmmac_mdio.c: mdio functions;
+- stmmac_ethtool.c: ethtool support;
+- stmmac_timer.[ch]: timer code used for mitigating the driver dma interrupts
+  Only tested on ST40 platforms based.
+- stmmac.h: private driver structure;
+- common.h: common definitions and VFTs;
+- descs.h: descriptor structure definitions;
+- dwmac1000_core.c: GMAC core functions;
+- dwmac1000_dma.c:  dma functions for the GMAC chip;
+- dwmac1000.h: specific header file for the GMAC;
+- dwmac100_core: MAC 100 core and dma code;
+- dwmac100_dma.c: dma funtions for the MAC chip;
+- dwmac1000.h: specific header file for the MAC;
+- dwmac_lib.c: generic DMA functions shared among chips
+- enh_desc.c: functions for handling enhanced descriptors
+- norm_desc.c: functions for handling normal descriptors
+
 TODO:
-- Continue to make the driver more generic and suitable for other Synopsys
-  Ethernet controllers used on other architectures (i.e. ARM).
-- 10G controllers are not supported.
-- MAC uses Normal descriptors and GMAC uses enhanced ones.
-  This is a limit that should be reviewed. MAC could want to
-  use the enhanced structure.
-- Checksumming: Rx/Tx csum is done in HW in case of GMAC only.
+- XGMAC controller is not supported.
 - Review the timer optimisation code to use an embedded device that seems to be
   available in new chip generations.
-- 
1.5.5.6


^ permalink raw reply related

* Re: Kernel rwlock design, Multicore and IGMP
From: Eric Dumazet @ 2010-11-12 11:25 UTC (permalink / raw)
  To: Cypher Wu; +Cc: Américo Wang, linux-kernel, netdev
In-Reply-To: <AANLkTikHjGkq5FDoXHaUbRkpurmT3mSLiu8toqqRL4Gi@mail.gmail.com>

Le vendredi 12 novembre 2010 à 19:10 +0800, Cypher Wu a écrit :

> I used to using that way, just seperate the call internal and
> external, with external one hold lock then call the internal one. But
> in that case ip_check_mc() is called indirectly from igmpv3_sendpack()
> and is not very clear how to give the different paramter?

I said that I was preparing a RCU patch, dont bother with this ;)

Should be ready in a couple of minutes.

Thanks

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Cypher Wu @ 2010-11-12 11:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Américo Wang, linux-kernel, netdev
In-Reply-To: <1289546874.17691.1774.camel@edumazet-laptop>

On Fri, Nov 12, 2010 at 3:27 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>> >>
>> >> Hi
>> >>
>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>> >>
>> >>
>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>> >>> other platforms. It have a priority for write lock that when tried it
>> >>> will block the following read lock even if read lock is hold by
>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>> >>> arch/tile/lib/spinlock_32.c.
>> >>
>> >> This seems a bug to me.
>> >>
>> >> read_lock() can be nested. We used such a schem in the past in iptables
>> >> (it can re-enter itself),
>> >> and we used instead a spinlock(), but with many discussions with lkml
>> >> and Linus himself if I remember well.
>> >>
>> >It seems not a problem that read_lock() can be nested or not since
>> >rwlock doesn't have 'owner', it's just that should we give
>> >write_lock() a priority than read_lock() since if there have a lot
>> >read_lock()s then they'll starve write_lock().
>> >We should work out a well defined behavior so all the
>> >platform-dependent raw_rwlock has to design under that principle.
>>
>
> AFAIK, Lockdep allows read_lock() to be nested.
>
>> It is a known weakness of rwlock, it is designed like that. :)
>>
>
> Agreed.
>
>> The solution is to use RCU or seqlock, but I don't think seqlock
>> is proper for this case you described. So, try RCU lock.
>
> In the IGMP case, it should be easy for the task owning a read_lock() to
> pass a parameter to the called function saying 'I already own the
> read_lock(), dont try to re-acquire it'

I used to using that way, just seperate the call internal and
external, with external one hold lock then call the internal one. But
in that case ip_check_mc() is called indirectly from igmpv3_sendpack()
and is not very clear how to give the different paramter?

>
> A RCU conversion is far more complex.
>
>
>
>

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] can: Topcliff: PCH_CAN driver: Fix build warnings
From: Tomoya MORINAGA @ 2010-11-12 11:10 UTC (permalink / raw)
  To: Wolfgang Grandegger
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w, Masayuki Ohtake,
	Samuel Ortiz, margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, LKML,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w, Marc Kleine-Budde,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w, David S. Miller,
	Christian Pellegrin, qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <4CCB213A.2020206@grandegger.com>

On Saturday, October 30, 2010 4:32 AM, Wolfgang Grandegger wrote :

Sorry, for my late.

> >> +
> >> +#define PCH_RX_OK 0x00000010
> >> +#define PCH_TX_OK 0x00000008
> >> +#define PCH_BUS_OFF 0x00000080
> >> +#define PCH_EWARN 0x00000040
> >> +#define PCH_EPASSIV 0x00000020
> >> +#define PCH_LEC0 0x00000001
> >> +#define PCH_LEC1 0x00000002
> >> +#define PCH_LEC2 0x00000004
> > 
> > These are just single set bit, please use BIT()
> > Consider adding the name of the corresponding register to the define's
> > name.

I agree.

> > 
> >> +#define PCH_LEC_ALL (PCH_LEC0 | PCH_LEC1 | PCH_LEC2)
> >> +#define PCH_STUF_ERR PCH_LEC0
> >> +#define PCH_FORM_ERR PCH_LEC1
> >> +#define PCH_ACK_ERR (PCH_LEC0 | PCH_LEC1)
> >> +#define PCH_BIT1_ERR PCH_LEC2
> >> +#define PCH_BIT0_ERR (PCH_LEC0 | PCH_LEC2)
> >> +#define PCH_CRC_ERR (PCH_LEC1 | PCH_LEC2)
> 
> This is an enumeration:
> 
> enum {
> PCH_STUF_ERR = 1,
> PCH_FORM_ERR,
> PCH_ACK_ERR,
> PCH_BIT1_ERR;
> PCH_BIT0_ERR,
> PCH_CRC_ERR,
> PCH_LEC_ALL;
> }

No, 
LEC is for bit assignment.
Thus, "enum" can't be used.


> >> +#define PCH_FIFO_THRESH 16
> >> +
> >> +enum pch_can_mode {
> >> + PCH_CAN_ENABLE,
> >> + PCH_CAN_DISABLE,
> >> + PCH_CAN_ALL,
> >> + PCH_CAN_NONE,
> >> + PCH_CAN_STOP,
> >> + PCH_CAN_RUN,
> >> +};
> >> +
> >> +struct pch_can_regs {
> >> + u32 cont;
> >> + u32 stat;
> >> + u32 errc;
> >> + u32 bitt;
> >> + u32 intr;
> >> + u32 opt;
> >> + u32 brpe;
> >> + u32 reserve1;
> > 
> > VVVV
> >> + u32 if1_creq;
> >> + u32 if1_cmask;
> >> + u32 if1_mask1;
> >> + u32 if1_mask2;
> >> + u32 if1_id1;
> >> + u32 if1_id2;
> >> + u32 if1_mcont;
> >> + u32 if1_dataa1;
> >> + u32 if1_dataa2;
> >> + u32 if1_datab1;
> >> + u32 if1_datab2;
> > ^^^^
> > 
> > these registers and....
> > 
> >> + u32 reserve2;
> >> + u32 reserve3[12];
> > 
> > ...and these
> > 
> > VVVV
> >> + u32 if2_creq;
> >> + u32 if2_cmask;
> >> + u32 if2_mask1;
> >> + u32 if2_mask2;
> >> + u32 if2_id1;
> >> + u32 if2_id2;
> >> + u32 if2_mcont;
> >> + u32 if2_dataa1;
> >> + u32 if2_dataa2;
> >> + u32 if2_datab1;
> >> + u32 if2_datab2;
> > 
> > ^^^^
> > 
> > ...are identical. I suggest to make a struct defining a complete
> > "Message Interface Register Set". If you include the correct number of
> > reserved bytes in the struct, you can have an array of two of these
> > structs in the struct pch_can_regs.
> 
> Yep, that would be nice. Using it consequently would also allow to
> remove duplicated code efficiently. I will name it "struct pch_can_if"
> for latter references.

I will modify like above.

> 
> > 
> >> + u32 reserve4;
> >> + u32 reserve5[20];
> >> + u32 treq1;
> >> + u32 treq2;
> >> + u32 reserve6[2];
> >> + u32 reserve7[56];
> >> + u32 reserve8[3];
> 
> Why not just one reserveX ?

I will modify to a reserveX.

> 
> >> + u32 srst;
> >> +};
> >> +
> >> +struct pch_can_priv {
> >> + struct can_priv can;
> >> + struct pci_dev *dev;
> >> + unsigned int tx_enable[MAX_MSG_OBJ];
> >> + unsigned int rx_enable[MAX_MSG_OBJ];
> >> + unsigned int rx_link[MAX_MSG_OBJ];
> >> + unsigned int int_enables;
> >> + unsigned int int_stat;
> >> + struct net_device *ndev;
> >> + spinlock_t msgif_reg_lock; /* Message Interface Registers Access Lock*/
> >                                                                             ^^^
> > please add a whitespace

I will modify.


> > 
> > IMHO the function name is missleading, if I understand the code
> > correctly, this functions triggers the transmission of the message.
> > After this it checks for busy, but 
> > 
> >> +static void pch_can_check_if_busy(u32 __iomem *creq_addr, u32 num)
> >                                      ^^^^
> > 
> > that should probaby be a void
> 
> With separate structs for if1 and i2, a pointer to the relevant "struct
> pch_can_if" could be passed instead.

I will modify

> >> +
> >> + if (set == 1) {
> >> + /* Setting the MsgVal and RxIE bits */
> >> + pch_can_bit_set(&priv->regs->if1_mcont, CAN_IF_MCONT_RXIE);
> >> + pch_can_bit_set(&priv->regs->if1_id2, CAN_ID_MSGVAL);
> >> +
> >> + } else if (set == 0) {
> >> + /* Resetting the MsgVal and RxIE bits */
> >> + pch_can_bit_clear(&priv->regs->if1_mcont, CAN_IF_MCONT_RXIE);
> >> + pch_can_bit_clear(&priv->regs->if1_id2, CAN_ID_MSGVAL);
> >> + }
> 
> Why not just?
> 
> if (set)
> else

I will modify.


> >> +static void pch_can_set_tx_enable(struct pch_can_priv *priv, u32 buff_num,
> >> + u32 set)
> >> +{
> >> + unsigned long flags;
> >> +
> >> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >> + /* Reading the Msg buffer from Message RAM to Interface2 registers. */
> >> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if2_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if2_creq, buff_num);
> >> +
> >> + /* Setting the IF2CMASK register for accessing the
> >> + MsgVal and TxIE bits */
> >> + iowrite32(CAN_CMASK_RDWR | CAN_CMASK_ARB | CAN_CMASK_CTRL,
> >> + &priv->regs->if2_cmask);
> >> +
> >> + if (set == 1) {
> >> + /* Setting the MsgVal and TxIE bits */
> >> + pch_can_bit_set(&priv->regs->if2_mcont, CAN_IF_MCONT_TXIE);
> >> + pch_can_bit_set(&priv->regs->if2_id2, CAN_ID_MSGVAL);
> >> + } else if (set == 0) {
> >> + /* Resetting the MsgVal and TxIE bits. */
> >> + pch_can_bit_clear(&priv->regs->if2_mcont, CAN_IF_MCONT_TXIE);
> >> + pch_can_bit_clear(&priv->regs->if2_id2, CAN_ID_MSGVAL);
> >> + }
> >> +
> >> + pch_can_check_if_busy(&priv->regs->if2_creq, buff_num);
> >> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >> +}
> 
> That function is almost identical to pch_can_set_rx_enable. Just if2 is
> used instead of if1 and CAN_IF_MCONT_TXIE instead of CAN_IF_MCONT_RXIE.
> With separate "struct  pch_can_if" for if1 and if2, it could be handled
> by a common function.

I will modify.

> 
> >> +static void pch_can_tx_enable_all(struct pch_can_priv *priv)
> >> +{
> >> + int i;
> >> +
> >> + /* Traversing to obtain the object configured as transmit object. */
> >> + for (i = PCH_RX_OBJ_NUM + 1; i <= PCH_OBJ_NUM; i++)
> >> + pch_can_set_tx_enable(priv, i, 1);
> >> +}
> >> +
> >> +static void pch_can_tx_disable_all(struct pch_can_priv *priv)
> >> +{
> >> + int i;
> >> +
> >> + /* Traversing to obtain the object configured as transmit object. */
> >> + for (i = PCH_RX_OBJ_NUM + 1; i <= PCH_OBJ_NUM; i++)
> >> + pch_can_set_tx_enable(priv, i, 0);
> >> +}
> 
> I think there is no need for separate functions for enable and disable.
> Just pass "enable" 0 or 1 like you do with "set" above.

I will modify

> 
> >> +static int pch_can_int_pending(struct pch_can_priv *priv)
> >           ^^^
> > 
> > make it u32 as it returns a register value, or a u16 as you only use
> > the 16 lower bits.

I will modify.

> > 
> >> +{
> >> + return ioread32(&priv->regs->intr) & 0xffff;
> >> +}
> >> +
> >> +static void pch_can_clear_buffers(struct pch_can_priv *priv)
> >> +{
> >> + int i; /* Msg Obj ID (1~32) */
> >> +
> >> + for (i = 1; i <= PCH_RX_OBJ_NUM; i++) {
> > 
> > IMHO the readability would be improved if you define something like
> > PCH_RX_OBJ_START and PCH_RX_OBJ_END.

I will modify.


> > 
> >> + iowrite32(CAN_CMASK_RX_TX_SET, &priv->regs->if1_cmask);
> >> + iowrite32(0xffff, &priv->regs->if1_mask1);
> >> + iowrite32(0xffff, &priv->regs->if1_mask2);
> >> + iowrite32(0x0, &priv->regs->if1_id1);
> >> + iowrite32(0x0, &priv->regs->if1_id2);
> >> + iowrite32(0x0, &priv->regs->if1_mcont);
> >> + iowrite32(0x0, &priv->regs->if1_dataa1);
> >> + iowrite32(0x0, &priv->regs->if1_dataa2);
> >> + iowrite32(0x0, &priv->regs->if1_datab1);
> >> + iowrite32(0x0, &priv->regs->if1_datab2);
> >> + iowrite32(CAN_CMASK_RDWR | CAN_CMASK_MASK |
> >> +   CAN_CMASK_ARB | CAN_CMASK_CTRL,
> >> +   &priv->regs->if1_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if1_creq, i);
> >> + }
> >> +
> >> + for (i = PCH_RX_OBJ_NUM + 1; i <= PCH_OBJ_NUM; i++) {
> >                  ^^^^^^^^^^^^^^^^^^
> > dito for TX objects
> > 
> >> + iowrite32(CAN_CMASK_RX_TX_SET, &priv->regs->if2_cmask);
> >> + iowrite32(0xffff, &priv->regs->if2_mask1);
> >> + iowrite32(0xffff, &priv->regs->if2_mask2);
> >> + iowrite32(0x0, &priv->regs->if2_id1);
> >> + iowrite32(0x0, &priv->regs->if2_id2);
> >> + iowrite32(0x0, &priv->regs->if2_mcont);
> >> + iowrite32(0x0, &priv->regs->if2_dataa1);
> >> + iowrite32(0x0, &priv->regs->if2_dataa2);
> >> + iowrite32(0x0, &priv->regs->if2_datab1);
> >> + iowrite32(0x0, &priv->regs->if2_datab2);
> >> + iowrite32(CAN_CMASK_RDWR | CAN_CMASK_MASK | CAN_CMASK_ARB |
> >> +   CAN_CMASK_CTRL, &priv->regs->if2_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if2_creq, i);
> 
> This is almost the same code as above, just if2 instead of if1. With
> separate "struct  pch_can_if" for if1 and i2, it could be handled by a
> common function.

I will modify.

> 
> >> + }
> >> +}
> >> +
> >> +static void pch_can_config_rx_tx_buffers(struct pch_can_priv *priv)
> >> +{
> >> + int i;
> >> + unsigned long flags;
> >> +
> >> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >> +
> >> + for (i = 1; i <= PCH_RX_OBJ_NUM; i++) {
> >> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if1_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if1_creq, i);
> > 
> > If I understand the code correctly, the about function triggers a
> > transfer. Why do you first trigger a transfer, then set the message contents....

For read-modify-write.


> >> + }
> >> +
> >> + if (status & PCH_LEC_ALL) {
> >> + priv->can.can_stats.bus_error++;
> >> + stats->rx_errors++;
> >> + switch (status & PCH_LEC_ALL) {
> > 
> > I suggest to convert to a if-bit-set because there might be more than
> > one bit set.
> 
> Marc, what do you mean here. It's an enumeraton. Maybe the following
> code is more clear:
> 
> lec = status & PCH_LEC_ALL;
> if (lec > 0) {
> switch (lec) {

No.
LEC is not enum.


> 
> >> + case PCH_STUF_ERR:
> >> + cf->data[2] |= CAN_ERR_PROT_STUFF;
> >> + break;
> >> + case PCH_FORM_ERR:
> >> + cf->data[2] |= CAN_ERR_PROT_FORM;
> >> + break;
> >> + case PCH_ACK_ERR:
> >> + cf->data[2] |= CAN_ERR_PROT_LOC_ACK |
> >> +        CAN_ERR_PROT_LOC_ACK_DEL;
> 
> Could you check what that type of bus error that is? Usually it's a ack
> lost error.

I will modify.
BTW, I can see ti_hecc also has the above the same code.

> 
> >> + break;
> >> + case PCH_BIT1_ERR:
> >> + case PCH_BIT0_ERR:
> >> + cf->data[2] |= CAN_ERR_PROT_BIT;
> >> + break;
> >> + case PCH_CRC_ERR:
> >> + cf->data[2] |= CAN_ERR_PROT_LOC_CRC_SEQ |
> >> +        CAN_ERR_PROT_LOC_CRC_DEL;
> >> + break;
> >> + default:
> >> + iowrite32(status | PCH_LEC_ALL, &priv->regs->stat);
> >> + break;
> >> + }
> >> +
> >> + }
> 
> Also, could you please add the TEC and REC:
> 
> cf->data[6] = ioread32(&priv->regs->errc) & CAN_TEC;
> cf->data[7] = (ioread32(&priv->regs->errc) & CAN_REC) >> 8;

I will add.
But I couldn't find 

> >> +
> >> +static int pch_can_rx_msg_lost(struct net_device *ndev, int obj_id)
> >> +{
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> + struct net_device_stats *stats = &(priv->ndev->stats);
> >> + struct sk_buff *skb;
> >> + struct can_frame *cf;
> >> +
> >> + dev_err(&priv->ndev->dev, "Msg Obj is overwritten.\n");
> 
> Please use dev_dbg or even remove the line above.

I will modify.


> 
> >> + pch_can_bit_clear(&priv->regs->if1_mcont,
> >> +   CAN_IF_MCONT_MSGLOST);
> >> + iowrite32(CAN_CMASK_RDWR | CAN_CMASK_CTRL,
> >> +   &priv->regs->if1_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if1_creq, obj_id);
> 
> I think the if busy checks could be improved. Why do you need to wait here?

Sorry, I can' understand.
This is for clear MSGLOSG bit.

> 
> >> +
> >> + skb = alloc_can_err_skb(ndev, &cf);
> >> + if (!skb)
> >> + return -ENOMEM;
> >> +
> >> + priv->can.can_stats.error_passive++;
> >> + priv->can.state = CAN_STATE_ERROR_PASSIVE;
> 
> Please remove the above two bogus lines.

I will remove.


> >> +
> >> + can_get_echo_skb(ndev, int_stat - PCH_RX_OBJ_NUM - 1);
> >> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >> + iowrite32(CAN_CMASK_RX_TX_GET | CAN_CMASK_CLRINTPND,
> >> +   &priv->regs->if2_cmask);
> >> + dlc = ioread32(&priv->regs->if2_mcont) & CAN_IF_MCONT_DLC;
> >> + pch_can_check_if_busy(&priv->regs->if2_creq, int_stat);
> >> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >> + if (dlc > 8)
> >> + dlc = 8;
> > 
> > use get_can_dlc

I will use

> > 
> >> + stats->tx_bytes += dlc;
> >> + stats->tx_packets++;
> >> +}
> >> +
> >> +static int pch_can_rx_poll(struct napi_struct *napi, int quota)
> >> +{
> >> + struct net_device *ndev = napi->dev;
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> + u32 int_stat;
> >> + int rcv_pkts = 0;
> >> + u32 reg_stat;
> >> + unsigned long flags;
> >> +
> >> + int_stat = pch_can_int_pending(priv);
> >> + if (!int_stat)
> >> + goto END;
> 
> Labels should be lowercase as well.

I will modify

> 
> >> +
> >> + if ((int_stat == CAN_STATUS_INT) && (quota > 0)) {
> >> + reg_stat = ioread32(&priv->regs->stat);
> >> + if (reg_stat & (PCH_BUS_OFF | PCH_LEC_ALL)) {
> >> + if ((reg_stat & PCH_LEC_ALL) != PCH_LEC_ALL) {
> >> + pch_can_error(ndev, reg_stat);
> >> + quota--;
> >> + }
> >> + }
> >> +
> >> + if (reg_stat & PCH_TX_OK) {
> >> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if2_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if2_creq,
> >> +        ioread32(&priv->regs->intr));
> >                                                ^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > 
> > Isn't this "int_stat". Might it be possilbe that regs->intr changes
> > between the pch_can_int_pending and here?
> > 
> > What should this transfer do?
> > 
> >> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >> + pch_can_bit_clear(&priv->regs->stat, PCH_TX_OK);
> >> + }
> >> +
> >> + if (reg_stat & PCH_RX_OK)
> >> + pch_can_bit_clear(&priv->regs->stat, PCH_RX_OK);
> >> +
> >> + int_stat = pch_can_int_pending(priv);
> >> + }
> >> +
> >> + if (quota == 0)
> >> + goto END;
> >> +
> >> + if ((int_stat >= 1) && (int_stat <= PCH_RX_OBJ_NUM)) {
> >> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >> + rcv_pkts += pch_can_rx_normal(ndev, int_stat, quota);
> >> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >> + quota -= rcv_pkts;
> >> + if (rcv_pkts < 0)
> > 
> > how can this happen?

I will modify to quota.


> > 
> >> + goto END;
> >> + } else if ((int_stat > PCH_RX_OBJ_NUM) && (int_stat <= PCH_OBJ_NUM)) {
> >> + /* Handle transmission interrupt */
> >> + pch_can_tx_complete(ndev, int_stat);
> >> + }
> >> +
> >> +END:
> >> + napi_complete(napi);
> >> + pch_can_set_int_enables(priv, PCH_CAN_ALL);
> >> +
> >> + return rcv_pkts;
> >> +}
> >> +
> >> +static int pch_set_bittiming(struct net_device *ndev)
> >> +{
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> + const struct can_bittiming *bt = &priv->can.bittiming;
> >> + u32 canbit;
> >> + u32 bepe;
> >> +
> >> + /* Setting the CCE bit for accessing the Can Timing register. */
> >> + pch_can_bit_set(&priv->regs->cont, CAN_CTRL_CCE);
> >> +
> >> + canbit = (bt->brp - 1) & MSK_BITT_BRP;
> >> + canbit |= (bt->sjw - 1) << BIT_BITT_SJW;
> >> + canbit |= (bt->phase_seg1 + bt->prop_seg - 1) << BIT_BITT_TSEG1;
> >> + canbit |= (bt->phase_seg2 - 1) << BIT_BITT_TSEG2;
> >> + bepe = ((bt->brp - 1) & MSK_BRPE_BRPE) >> BIT_BRPE_BRPE;
> >> + iowrite32(canbit, &priv->regs->bitt);
> >> + iowrite32(bepe, &priv->regs->brpe);
> >> + pch_can_bit_clear(&priv->regs->cont, CAN_CTRL_CCE);
> >> +
> >> + return 0;
> >> +}
> >> +
> >> +static void pch_can_start(struct net_device *ndev)
> >> +{
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> +
> >> + if (priv->can.state != CAN_STATE_STOPPED)
> >> + pch_can_reset(priv);
> >> +
> >> + pch_set_bittiming(ndev);
> >> + pch_can_set_optmode(priv);
> >> +
> >> + pch_can_tx_enable_all(priv);
> >> + pch_can_rx_enable_all(priv);
> >> +
> >> + /* Setting the CAN to run mode. */
> >> + pch_can_set_run_mode(priv, PCH_CAN_RUN);
> >> +
> >> + priv->can.state = CAN_STATE_ERROR_ACTIVE;
> >> +
> >> + return;
> >> +}
> >> +
> >> +static int pch_can_do_set_mode(struct net_device *ndev, enum can_mode mode)
> >> +{
> >> + int ret = 0;
> >> +
> >> + switch (mode) {
> >> + case CAN_MODE_START:
> >> + pch_can_start(ndev);
> >> + netif_wake_queue(ndev);
> >> + break;
> >> + default:
> >> + ret = -EOPNOTSUPP;
> >> + break;
> >> + }
> >> +
> >> + return ret;
> >> +}
> >> +
> >> +static int pch_can_open(struct net_device *ndev)
> >> +{
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> + int retval;
> >> +
> >> + /* Regsitering the interrupt. */
> 
> Typo!

I will modify.

> 
> >> + retval = request_irq(priv->dev->irq, pch_can_interrupt, IRQF_SHARED,
> >> +      ndev->name, ndev);
> >> + if (retval) {
> >> + dev_err(&ndev->dev, "request_irq failed.\n");
> >> + goto req_irq_err;
> >> + }
> >> +
> >> + /* Open common can device */
> >> + retval = open_candev(ndev);
> >> + if (retval) {
> >> + dev_err(ndev->dev.parent, "open_candev() failed %d\n", retval);
> >> + goto err_open_candev;
> >> + }
> >> +
> >> + pch_can_init(priv);
> >> + pch_can_start(ndev);
> >> + napi_enable(&priv->napi);
> >> + netif_start_queue(ndev);
> >> +
> >> + return 0;
> >> +
> >> +err_open_candev:
> >> + free_irq(priv->dev->irq, ndev);
> >> +req_irq_err:
> >> + pch_can_release(priv);
> >> +
> >> + return retval;
> >> +}
> >> +
> >> +static int pch_close(struct net_device *ndev)
> >> +{
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> +
> >> + netif_stop_queue(ndev);
> >> + napi_disable(&priv->napi);
> >> + pch_can_release(priv);
> >> + free_irq(priv->dev->irq, ndev);
> >> + close_candev(ndev);
> >> + priv->can.state = CAN_STATE_STOPPED;
> >> + return 0;
> >> +}
> >> +
> >> +static netdev_tx_t pch_xmit(struct sk_buff *skb, struct net_device *ndev)
> >> +{
> >> + unsigned long flags;
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> + struct can_frame *cf = (struct can_frame *)skb->data;
> >> + int tx_buffer_avail = 0;
> > 
> > What I'm totally missing is the TX flow controll. Your driver has to
> > ensure that the package leave the controller in the order that come
> > into the xmit function. Further you have to stop your xmit queue if
> > you're out of tx objects and reenable if you have a object free.
> > 
> > Use netif_stop_queue() and netif_wake_queue() for this.

I will add flow control.

> >> + }
> >> + if (cf->can_dlc > 2) {
> >> + u32 data1 = *((u16 *)&cf->data[2]);
> >> + iowrite32(data1, &priv->regs->if2_dataa2);
> >> + }
> >> + if (cf->can_dlc > 4) {
> >> + u32 data1 = *((u16 *)&cf->data[4]);
> >> + iowrite32(data1, &priv->regs->if2_datab1);
> >> + }
> >> + if (cf->can_dlc > 6) {
> >> + u32 data1 = *((u16 *)&cf->data[6]);
> >> + iowrite32(data1, &priv->regs->if2_datab2);
> >> + }
> 
> Could be handled by a loop.

Pending.


> 
> >> + can_put_echo_skb(skb, ndev, tx_buffer_avail - PCH_RX_OBJ_NUM - 1);
> >> +
> >> + /* Set the size of the data. */
> >> + iowrite32(cf->can_dlc, &priv->regs->if2_mcont);
> >> +
> >> + /* Update if2_mcont */
> >> + pch_can_bit_set(&priv->regs->if2_mcont,
> >> + CAN_IF_MCONT_NEWDAT | CAN_IF_MCONT_TXRQXT |
> >> + CAN_IF_MCONT_TXIE);
> > 
> > pleae first perpare your value, then write to hardware.
> > 
> >> +
> >> + if (tx_buffer_avail == PCH_RX_OBJ_NUM) /* If points tail of FIFO  */
> >> + pch_can_bit_set(&priv->regs->if2_mcont, CAN_IF_MCONT_EOB);
> > 
> > dito
> > 
> > Is EOB relevant for TX objects?
> > 
> >> + pch_can_check_if_busy(&priv->regs->if2_creq, tx_buffer_avail);
> >> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >> +
> >> + return NETDEV_TX_OK;
> >> +}
> >> +
> >> +static const struct net_device_ops pch_can_netdev_ops = {
> >> + .ndo_open = pch_can_open,
> >> + .ndo_stop = pch_close,
> >> + .ndo_start_xmit = pch_xmit,
> >> +};
> >> +
> >> +static void __devexit pch_can_remove(struct pci_dev *pdev)
> >> +{
> >> + struct net_device *ndev = pci_get_drvdata(pdev);
> >> + struct pch_can_priv *priv = netdev_priv(ndev);
> >> +
> >> + unregister_candev(priv->ndev);
> >> + pci_iounmap(pdev, priv->regs);
> >> + if (priv->use_msi)
> >> + pci_disable_msi(priv->dev);
> >> + pci_release_regions(pdev);
> >> + pci_disable_device(pdev);
> >> + pci_set_drvdata(pdev, NULL);
> >> + free_candev(priv->ndev);
> >> +}
> >> +
> >> +#ifdef CONFIG_PM
> >> +static void pch_can_set_int_custom(struct pch_can_priv *priv)
> >> +{
> >> + /* Clearing the IE, SIE and EIE bits of Can control register. */
> >> + pch_can_bit_clear(&priv->regs->cont, CAN_CTRL_IE_SIE_EIE);
> >> +
> >> + /* Appropriately setting them. */
> >> + pch_can_bit_set(&priv->regs->cont,
> >> + ((priv->int_enables & MSK_CTRL_IE_SIE_EIE) << 1));
> >> +}
> >> +
> >> +/* This function retrieves interrupt enabled for the CAN device. */
> >> +static u32 pch_can_get_int_enables(struct pch_can_priv *priv)
> >> +{
> >> + /* Obtaining the status of IE, SIE and EIE interrupt bits. */
> >> + return (ioread32(&priv->regs->cont) & CAN_CTRL_IE_SIE_EIE) >> 1;
> >> +}
> >> +
> >> +static u32 pch_can_get_rx_enable(struct pch_can_priv *priv, u32 buff_num)
> >> +{
> >> + unsigned long flags;
> >> + u32 enable;
> >> +
> >> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if1_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if1_creq, buff_num);
> >> +
> >> + if (((ioread32(&priv->regs->if1_id2)) & CAN_ID_MSGVAL) &&
> >> + ((ioread32(&priv->regs->if1_mcont)) &
> >> + CAN_IF_MCONT_RXIE))
> >> + enable = 1;
> >> + else
> >> + enable = 0;
> >> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >> + return enable;
> >> +}
> >> +
> >> +static u32 pch_can_get_tx_enable(struct pch_can_priv *priv, u32 buff_num)
> >> +{
> >> + unsigned long flags;
> >> + u32 enable;
> >> +
> >> + spin_lock_irqsave(&priv->msgif_reg_lock, flags);
> >> +
> >> + iowrite32(CAN_CMASK_RX_TX_GET, &priv->regs->if2_cmask);
> >> + pch_can_check_if_busy(&priv->regs->if2_creq, buff_num);
> >> + if (((ioread32(&priv->regs->if2_id2)) & CAN_ID_MSGVAL) &&
> >> + ((ioread32(&priv->regs->if2_mcont)) &
> >> + CAN_IF_MCONT_TXIE)) {
> >> + enable = 1;
> >> + } else {
> >> + enable = 0;
> >> + }
> >> + spin_unlock_irqrestore(&priv->msgif_reg_lock, flags);
> >> +
> >> + return enable;
> >> +}
> 
> The above two functions could be handled by a common one passing "struct
> pch_can_if". See similar comments above.

I will modify like the same.


> As the driver has already been merged. Please provide incremental
> patches against the net-2.6 branch. Also, it would be nice if you could
> check in-order transmission and reception, e.g., with the can-utils
> program canfdtest:
> 
> http://svn.berlios.de/wsvn/socketcan/trunk/can-utils/canfdtest.c
> 

Thank you for your information.


------
Thanks,

Tomoya MORINAGA
OKI SEMICONDUCTOR CO., LTD.

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Cypher Wu @ 2010-11-12 11:06 UTC (permalink / raw)
  To: Américo Wang; +Cc: Yong Zhang, Eric Dumazet, linux-kernel, netdev
In-Reply-To: <20101112091818.GB5949@cr0.nay.redhat.com>

2010/11/12 Américo Wang <xiyou.wangcong@gmail.com>:
> On Fri, Nov 12, 2010 at 05:09:45PM +0800, Yong Zhang wrote:
>>On Fri, Nov 12, 2010 at 4:19 PM, Américo Wang <xiyou.wangcong@gmail.com> wrote:
>>> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>>>>Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
>>>>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>>>>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>>>>> >>
>>>>> >> Hi
>>>>> >>
>>>>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>>>>> >>
>>>>> >>
>>>>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>>>>> >>> other platforms. It have a priority for write lock that when tried it
>>>>> >>> will block the following read lock even if read lock is hold by
>>>>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>>>>> >>> arch/tile/lib/spinlock_32.c.
>>>>> >>
>>>>> >> This seems a bug to me.
>>>>> >>
>>>>> >> read_lock() can be nested. We used such a schem in the past in iptables
>>>>> >> (it can re-enter itself),
>>>>> >> and we used instead a spinlock(), but with many discussions with lkml
>>>>> >> and Linus himself if I remember well.
>>>>> >>
>>>>> >It seems not a problem that read_lock() can be nested or not since
>>>>> >rwlock doesn't have 'owner', it's just that should we give
>>>>> >write_lock() a priority than read_lock() since if there have a lot
>>>>> >read_lock()s then they'll starve write_lock().
>>>>> >We should work out a well defined behavior so all the
>>>>> >platform-dependent raw_rwlock has to design under that principle.
>>>>>
>>>>
>>>>AFAIK, Lockdep allows read_lock() to be nested.
>>>>
>>>>> It is a known weakness of rwlock, it is designed like that. :)
>>>>>
>>>>
>>>>Agreed.
>>>>
>>>
>>> Just for record, both Tile and X86 implement rwlock with a write-bias,
>>> this somewhat reduces the write-starvation problem.
>>
>>Are you sure(on x86)?
>>
>>It seems that we never realize writer-bias rwlock.
>>
>
> Try
>
> % grep RW_LOCK_BIAS -nr arch/x86
>
> *And* read the code to see how it works. :)
>
> Note, on Tile, it uses a little different algorithm.
>

It seems that rwlock on x86 and tile have different behavior, x86 use
RW_LOCK_BIAS, when read_lock() it will test if the lock is 0, and if
so then the read_lock() have to 'spinning', otherwise it dec the lock;
when write_lock() tried it first check if lock is It seems that rwlock
on x86 and tile have different behavior, x86 use RW_LOCK_BIAS and if
so, set lock to 0 and continue, otherwise it will 'spinning'.
I'm not very familiar with x86 architecture, but the code seems like
working that way.

^ permalink raw reply

* [PATCH] r8169: fix checksum broken
From: Shan Wei @ 2010-11-12 10:15 UTC (permalink / raw)
  To: romieu; +Cc: netdev@vger.kernel.org, David Miller

If r8196 received packets with invalid sctp/igmp(not tcp, udp) checksum, r8196 set skb->ip_summed
wit CHECKSUM_UNNECESSARY. This cause that upper protocol don't check checksum field.

I am not family with r8196 driver. I try to guess the meaning of RxProtoIP and IPFail.
RxProtoIP stands for received IPv4 packet that upper protocol is not tcp and udp. 
!(opts1 & IPFail) is true means that driver correctly to check checksum in IPv4 header.

If it's right, I think we should not set ip_summed wit CHECKSUM_UNNECESSARY for my sctp packets
with invalid checksum. 

If it's not right, please tell me. 

Signed-off-by: Shan Wei <shanwei@cn.fujitsu.com>
---
 drivers/net/r8169.c |    3 +--
 1 files changed, 1 insertions(+), 2 deletions(-)

diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index d88ce9f..2cf6c2e 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -4440,8 +4440,7 @@ static inline void rtl8169_rx_csum(struct sk_buff *skb, u32 opts1)
 	u32 status = opts1 & RxProtoMask;

 	if (((status == RxProtoTCP) && !(opts1 & TCPFail)) ||
-	    ((status == RxProtoUDP) && !(opts1 & UDPFail)) ||
-	    ((status == RxProtoIP) && !(opts1 & IPFail)))
+	    ((status == RxProtoUDP) && !(opts1 & UDPFail)))
 		skb->ip_summed = CHECKSUM_UNNECESSARY;
 	else
 		skb_checksum_none_assert(skb);
-- 
1.6.3.3

^ permalink raw reply related

* Re: Kernel rwlock design, Multicore and IGMP
From: Américo Wang @ 2010-11-12  9:33 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Américo Wang, Cypher Wu, linux-kernel, netdev
In-Reply-To: <1289553759.3185.1.camel@edumazet-laptop>

On Fri, Nov 12, 2010 at 10:22:39AM +0100, Eric Dumazet wrote:
>Le vendredi 12 novembre 2010 à 16:19 +0800, Américo Wang a écrit :
>> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>
>> >A RCU conversion is far more complex.
>> >
>> 
>> Yup.
>
>
>Well, actually this is easy in this case.
>
>I'll post a patch to do this RCU conversion.
>

Cool! Please keep me in Cc.

Some conversions from rwlock to RCU are indeed complex,
don't know about this case.

Anyway, patch please. :)

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Eric Dumazet @ 2010-11-12  9:22 UTC (permalink / raw)
  To: Américo Wang; +Cc: Cypher Wu, linux-kernel, netdev
In-Reply-To: <20101112081945.GA5949@cr0.nay.redhat.com>

Le vendredi 12 novembre 2010 à 16:19 +0800, Américo Wang a écrit :
> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:

> >A RCU conversion is far more complex.
> >
> 
> Yup.


Well, actually this is easy in this case.

I'll post a patch to do this RCU conversion.

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Américo Wang @ 2010-11-12  9:18 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Américo Wang, Eric Dumazet, Cypher Wu, linux-kernel, netdev
In-Reply-To: <AANLkTik-KDvc2fgH91vBHGT6vqxbZrv=9DoQknujPWy2@mail.gmail.com>

On Fri, Nov 12, 2010 at 05:09:45PM +0800, Yong Zhang wrote:
>On Fri, Nov 12, 2010 at 4:19 PM, Américo Wang <xiyou.wangcong@gmail.com> wrote:
>> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>>>Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
>>>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>>>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>>>> >>
>>>> >> Hi
>>>> >>
>>>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>>>> >>
>>>> >>
>>>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>>>> >>> other platforms. It have a priority for write lock that when tried it
>>>> >>> will block the following read lock even if read lock is hold by
>>>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>>>> >>> arch/tile/lib/spinlock_32.c.
>>>> >>
>>>> >> This seems a bug to me.
>>>> >>
>>>> >> read_lock() can be nested. We used such a schem in the past in iptables
>>>> >> (it can re-enter itself),
>>>> >> and we used instead a spinlock(), but with many discussions with lkml
>>>> >> and Linus himself if I remember well.
>>>> >>
>>>> >It seems not a problem that read_lock() can be nested or not since
>>>> >rwlock doesn't have 'owner', it's just that should we give
>>>> >write_lock() a priority than read_lock() since if there have a lot
>>>> >read_lock()s then they'll starve write_lock().
>>>> >We should work out a well defined behavior so all the
>>>> >platform-dependent raw_rwlock has to design under that principle.
>>>>
>>>
>>>AFAIK, Lockdep allows read_lock() to be nested.
>>>
>>>> It is a known weakness of rwlock, it is designed like that. :)
>>>>
>>>
>>>Agreed.
>>>
>>
>> Just for record, both Tile and X86 implement rwlock with a write-bias,
>> this somewhat reduces the write-starvation problem.
>
>Are you sure(on x86)?
>
>It seems that we never realize writer-bias rwlock.
>

Try

% grep RW_LOCK_BIAS -nr arch/x86

*And* read the code to see how it works. :)

Note, on Tile, it uses a little different algorithm.

^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Yong Zhang @ 2010-11-12  9:09 UTC (permalink / raw)
  To: Américo Wang; +Cc: Eric Dumazet, Cypher Wu, linux-kernel, netdev
In-Reply-To: <20101112081945.GA5949@cr0.nay.redhat.com>

On Fri, Nov 12, 2010 at 4:19 PM, Américo Wang <xiyou.wangcong@gmail.com> wrote:
> On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>>Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
>>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>>> >>
>>> >> Hi
>>> >>
>>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>>> >>
>>> >>
>>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>>> >>> other platforms. It have a priority for write lock that when tried it
>>> >>> will block the following read lock even if read lock is hold by
>>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>>> >>> arch/tile/lib/spinlock_32.c.
>>> >>
>>> >> This seems a bug to me.
>>> >>
>>> >> read_lock() can be nested. We used such a schem in the past in iptables
>>> >> (it can re-enter itself),
>>> >> and we used instead a spinlock(), but with many discussions with lkml
>>> >> and Linus himself if I remember well.
>>> >>
>>> >It seems not a problem that read_lock() can be nested or not since
>>> >rwlock doesn't have 'owner', it's just that should we give
>>> >write_lock() a priority than read_lock() since if there have a lot
>>> >read_lock()s then they'll starve write_lock().
>>> >We should work out a well defined behavior so all the
>>> >platform-dependent raw_rwlock has to design under that principle.
>>>
>>
>>AFAIK, Lockdep allows read_lock() to be nested.
>>
>>> It is a known weakness of rwlock, it is designed like that. :)
>>>
>>
>>Agreed.
>>
>
> Just for record, both Tile and X86 implement rwlock with a write-bias,
> this somewhat reduces the write-starvation problem.

Are you sure(on x86)?

It seems that we never realize writer-bias rwlock.

Thanks,
Yong
>
>
>>> The solution is to use RCU or seqlock, but I don't think seqlock
>>> is proper for this case you described. So, try RCU lock.
>>
>>In the IGMP case, it should be easy for the task owning a read_lock() to
>>pass a parameter to the called function saying 'I already own the
>>read_lock(), dont try to re-acquire it'
>>
>>A RCU conversion is far more complex.
>>
>
> Yup.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] macvlan: lockless tx path
From: Patrick McHardy @ 2010-11-12  8:20 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, Ben Greear, Ben Hutchings
In-Reply-To: <1289459644.17691.1011.camel@edumazet-laptop>

On 11.11.2010 08:14, Eric Dumazet wrote:
> macvlan is a stacked device, like tunnels. We should use the lockless
> mechanism we are using in tunnels and loopback.
> 
> This patch completely removes locking in TX path.
> 
> tx stat counters are added into existing percpu stat structure, renamed
> from rx_stats to pcpu_stats.
> 
> Note : this reverts commit 2c11455321f37 (macvlan: add multiqueue
> capability)
> 
> Note : rx_errors converted to a 32bit counter, like tx_dropped, since
> they dont need 64bit range.
> 
> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
> Cc: Patrick McHardy <kaber@trash.net>
> Cc: Ben Greear <greearb@candelatech.com>
> Cc: Ben Hutchings <bhutchings@solarflare.com>
> ---
> V2: correct kerneldoc
>     u32 for tx_dropped and rx_errors

Looks good to me.

Acked-by: Patrick McHardy <kaber@trash.net>



^ permalink raw reply

* Re: Kernel rwlock design, Multicore and IGMP
From: Américo Wang @ 2010-11-12  8:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Américo Wang, Cypher Wu, linux-kernel, netdev
In-Reply-To: <1289546874.17691.1774.camel@edumazet-laptop>

On Fri, Nov 12, 2010 at 08:27:54AM +0100, Eric Dumazet wrote:
>Le vendredi 12 novembre 2010 à 15:13 +0800, Américo Wang a écrit :
>> On Fri, Nov 12, 2010 at 11:32:59AM +0800, Cypher Wu wrote:
>> >On Thu, Nov 11, 2010 at 11:23 PM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
>> >> Le jeudi 11 novembre 2010 à 21:49 +0800, Cypher Wu a écrit :
>> >>
>> >> Hi
>> >>
>> >> CC netdev, since you ask questions about network stuff _and_ rwlock
>> >>
>> >>
>> >>> I'm using TILEPro and its rwlock in kernel is a liitle different than
>> >>> other platforms. It have a priority for write lock that when tried it
>> >>> will block the following read lock even if read lock is hold by
>> >>> others. Its code can be read in Linux Kernel 2.6.36 in
>> >>> arch/tile/lib/spinlock_32.c.
>> >>
>> >> This seems a bug to me.
>> >>
>> >> read_lock() can be nested. We used such a schem in the past in iptables
>> >> (it can re-enter itself),
>> >> and we used instead a spinlock(), but with many discussions with lkml
>> >> and Linus himself if I remember well.
>> >>
>> >It seems not a problem that read_lock() can be nested or not since
>> >rwlock doesn't have 'owner', it's just that should we give
>> >write_lock() a priority than read_lock() since if there have a lot
>> >read_lock()s then they'll starve write_lock().
>> >We should work out a well defined behavior so all the
>> >platform-dependent raw_rwlock has to design under that principle.
>> 
>
>AFAIK, Lockdep allows read_lock() to be nested.
>
>> It is a known weakness of rwlock, it is designed like that. :)
>> 
>
>Agreed.
>

Just for record, both Tile and X86 implement rwlock with a write-bias,
this somewhat reduces the write-starvation problem.


>> The solution is to use RCU or seqlock, but I don't think seqlock
>> is proper for this case you described. So, try RCU lock.
>
>In the IGMP case, it should be easy for the task owning a read_lock() to
>pass a parameter to the called function saying 'I already own the
>read_lock(), dont try to re-acquire it'
>
>A RCU conversion is far more complex.
>

Yup.

^ permalink raw reply

* Re: [PATCH] netfilter: ipv6: fix overlap check for fragments
From: Patrick McHardy @ 2010-11-12  7:52 UTC (permalink / raw)
  To: Shan Wei
  Cc: nicolas.dichtel, David Miller, netdev@vger.kernel.org,
	netfilter-devel
In-Reply-To: <4CD3F0F8.5030205@cn.fujitsu.com>

On 05.11.2010 12:56, Shan Wei wrote:
> The type of FRAG6_CB(prev)->offset is int, skb->len is *unsigned* int,
> and offset is int.
> 
> Without this patch, type conversion occurred to this expression, when
> (FRAG6_CB(prev)->offset + prev->len) is less than offset.

Applied, thanks Shan.

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox