Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: Kernel rwlock design, Multicore and IGMP
From: Eric Dumazet @ 2010-11-15 11:31 UTC (permalink / raw)
  To: Cypher Wu; +Cc: Chris Metcalf, Américo Wang, linux-kernel, netdev
In-Reply-To: <AANLkTinzRm_5WW5HwxzXqZZD4255GDpR4yzELqz66Dfn@mail.gmail.com>

Le lundi 15 novembre 2010 à 19:18 +0800, Cypher Wu a écrit :
> In that post I want to confirm another thing: if we join/leave on
> different cores that every call will start the timer for IGMP message
> using the same in_dev->mc_list, could that be optimized?
> 

Which timer exactly ? Is it a real scalability problem ?

I believe RTNL would be the blocking point actually...

^ permalink raw reply

* Re: [PATCH] netfilter: guard the size of the nf_ct_ext
From: Changli Gao @ 2010-11-15 11:35 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: David S. Miller, netfilter-devel, netdev
In-Reply-To: <4CE1150D.1080302@trash.net>

On Mon, Nov 15, 2010 at 7:10 PM, Patrick McHardy <kaber@trash.net> wrote:
> On 15.11.2010 07:15, Changli Gao wrote:
>> We'd better guard the size of the nf_ct_ext, as the nf_ct_ext.len is u8.
>> If the size is bigger than 255, a warning will be printed.
>
> Why are you checking this in basically every possible spot?
> Just checking once during registration (assuming the worst
> case of a conntrack using every possible extension) should
> be enough.
>

Yes. It is enough, if we check every patch carefully. Thanks.

-- 
Regards,
Changli Gao(xiaosuo@gmail.com)

^ permalink raw reply

* Re: [Patch] fix packet loss and massive ping spikes with PPP multi-link
From: Richard Hartmann @ 2010-11-15 12:07 UTC (permalink / raw)
  To: Ben McKeegan
  Cc: Paul Mackerras, netdev, linux-ppp, Alan Cox,
	Alexander E. Patrakov, linux-kernel, gabriele.paoloni
In-Reply-To: <AANLkTi=GHmqeb=8rMZAnpqdKGbTEf2mRYhygX1aC2hyD@mail.gmail.com>

On Mon, Nov 8, 2010 at 15:05, Richard Hartmann
<richih.mailinglist@gmail.com> wrote:

> Is there any update on this? It's been quite some time since you last updated
> on this issue.

As it's been a week without any reply and as I know how stuff can
drown in more important work & projects, I am tentatively poking again
:)


RIchard

^ permalink raw reply

* Re: [PATCH 0/5] bridge: RCU annotation and cleanup
From: Tetsuo Handa @ 2010-11-15 12:23 UTC (permalink / raw)
  To: shemminger, davem, eric.dumazet; +Cc: netdev, bridge
In-Reply-To: <20101114211201.678755903@vyatta.com>

Stephen Hemminger wrote:
> This is a split up of what Eric did with a couple of small changes and additions.
Something seems to be wrong with this patchset.

--- a/net/bridge/br_input.c
+++ b/net/bridge/br_input.c
> @@ -173,8 +177,8 @@ forward:
>  	switch (p->state) {
>  	case BR_STATE_FORWARDING:
>  		rhook = rcu_dereference(br_should_route_hook);
> -		if (rhook != NULL) {
> -			if (rhook(skb))
> +		if (rhook) {
> +			if ((*rhook)(skb))

Is *rhook != NULL guaranteed when rhook != NULL?

>  				return skb;
>  			dest = eth_hdr(skb)->h_dest;
>  		}

--- a/net/bridge/br_forward.c
+++ b/net/bridge/br_forward.c
> @@ -242,7 +242,7 @@ static void br_multicast_flood(struct ne
>  		if ((unsigned long)lport >= (unsigned long)port)
>  			p = rcu_dereference(p->next);
>  		if ((unsigned long)rport >= (unsigned long)port)
> -			rp = rcu_dereference(rp->next);
> +			rp = rcu_dereference(hlist_next_rcu(rp->next));

I think this one is hlist_next_rcu(rp).

>  	}
>  
>  	if (!prev)

--- a/net/bridge/br_if.c
+++ b/net/bridge/br_if.c
> @@ -475,11 +475,8 @@ int br_del_if(struct net_bridge *br, str
>  {
>  	struct net_bridge_port *p;
>  
> -	if (!br_port_exists(dev))
> -		return -EINVAL;
> -
>  	p = br_port_get(dev);

Don't you need to use br_port_get_rtnl()?  (I don't know.)

> -	if (p->br != br)
> +	if (!p || p->br != br)
>  		return -EINVAL;
>  
>  	del_nbp(p);

--- a/net/bridge/br_netlink.c
+++ b/net/bridge/br_netlink.c
> @@ -169,9 +171,9 @@ static int br_rtm_setlink(struct sk_buff
>  	if (!dev)
>  		return -ENODEV;
>  
> -	if (!br_port_exists(dev))
> -		return -EINVAL;
>  	p = br_port_get(dev);

Don't you need to use br_port_get_rtnl()?  (I don't know.)

> +	if (!p)
> +		return -EINVAL;
>  
>  	/* if kernel STP is running, don't allow changes */
>  	if (p->br->stp_enabled == BR_KERNEL_STP)

--- a/net/bridge/br_private.h
+++ b/net/bridge/br_private.h
> @@ -151,11 +151,21 @@ struct net_bridge_port
>  #endif
>  };
>  
> -#define br_port_get_rcu(dev) \
> -	((struct net_bridge_port *) rcu_dereference(dev->rx_handler_data))
> -#define br_port_get(dev) ((struct net_bridge_port *) dev->rx_handler_data)
>  #define br_port_exists(dev) (dev->priv_flags & IFF_BRIDGE_PORT)
>  
> +static inline struct net_bridge_port *br_port_get_rcu(const struct net_device *dev)
> +{
> +	return br_port_exists(dev) ?
> +		rcu_dereference(dev->rx_handler_data) : NULL;
> +}
> +
> +static inline struct net_bridge_port *br_port_get(struct net_device *dev)
> +{
> +	return br_port_exists(dev) ? dev->rx_handler_data : NULL;
> +}
> +
> +#define br_port_get(dev) ((struct net_bridge_port *) dev->rx_handler_data)

Why are you defining br_port_get() twice, once as macro and once as inlined
function?

^ permalink raw reply

* Re: [PATCH 0/5] bridge: RCU annotation and cleanup
From: Eric Dumazet @ 2010-11-15 13:33 UTC (permalink / raw)
  To: Tetsuo Handa; +Cc: shemminger, davem, netdev, bridge
In-Reply-To: <201011152123.HHB21896.HFOOVSMFOtLJQF@I-love.SAKURA.ne.jp>

Le lundi 15 novembre 2010 à 21:23 +0900, Tetsuo Handa a écrit :
> Stephen Hemminger wrote:
> > This is a split up of what Eric did with a couple of small changes and additions.
> Something seems to be wrong with this patchset.
> 
> --- a/net/bridge/br_input.c
> +++ b/net/bridge/br_input.c
> > @@ -173,8 +177,8 @@ forward:
> >  	switch (p->state) {
> >  	case BR_STATE_FORWARDING:
> >  		rhook = rcu_dereference(br_should_route_hook);
> > -		if (rhook != NULL) {
> > -			if (rhook(skb))
> > +		if (rhook) {
> > +			if ((*rhook)(skb))
> 
> Is *rhook != NULL guaranteed when rhook != NULL?

Its the C standard convention, we call function pointed by rhook, not
*rhook.

$ cat func.c
typedef int (*hook_t)(int a1, int a2);

hook_t *hook;

int foo(int a1, int a2)
{
hook_t *handler = hook;

	if (handler)
		return handler(a1, a2);
	return 0;
}
$ gcc -O2 -c func.c
func.c: In function ‘foo’:
func.c:10:17: error: called object ‘handler’ is not a function


Now, if we use (*handler), it works :

$ cat func.c
typedef int (*hook_t)(int a1, int a2);

hook_t *hook;

int foo(int a1, int a2)
{
hook_t *handler = hook;

	if (handler)
		return (*handler)(a1, a2);
	return 0;
}
$ gcc -O2 -c func.c
$




^ permalink raw reply

* NETPOLL on bond interfaces
From: sergey belov @ 2010-11-15 13:49 UTC (permalink / raw)
  To: Netdev

Before 2.6.36 came out I was using this patch to enable netpoll on bond
interfaces (see below)

When I tried to apply this patch to 2.6.36 sources some chunks was failed.

[root@kernel-x32 experimental]# patch -p0 < netpoll.patch
patching file a/drivers/net/bonding/bond_main.c
Hunk #1 FAILED at 75.
Hunk #2 FAILED at 416.
Hunk #3 succeeded at 1334 with fuzz 2 (offset 23 lines).
Hunk #4 succeeded at 1729 with fuzz 1 (offset 15 lines).
Hunk #5 succeeded at 1837 (offset 42 lines).
Hunk #6 succeeded at 2026 (offset 19 lines).
Hunk #7 succeeded at 4490 with fuzz 1 (offset 11 lines).
Hunk #8 succeeded at 4674 (offset 68 lines).
2 out of 8 hunks FAILED -- saving rejects to file
a/drivers/net/bonding/bond_main.c.rej
patching file a/drivers/net/bonding/bonding.h
Hunk #2 succeeded at 227 (offset 4 lines).
patching file a/include/linux/netdevice.h
Hunk #1 FAILED at 52.
Hunk #2 FAILED at 625.
2 out of 2 hunks FAILED -- saving rejects to file
a/include/linux/netdevice.h.rej
patching file a/include/linux/netpoll.h
Hunk #1 succeeded at 43 with fuzz 2 (offset 9 lines).
patching file a/net/core/netpoll.c
Reversed (or previously applied) patch detected!  Assume -R? [n]


Inspecting changelog at
http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.36 I found there
are a lot of changes was done inside of the bonding and netpoll subsystem.

Could please someone provide me new patch to enable netpoll on bond ifaces
or send me a workaround how to enable it in any other way.

We need to use it only to enable netconsole on our servers with bond.


   1. diff --git a/drivers/net/bonding/bond_main.c
   b/drivers/net/bonding/bond_main.c
   2. index d927f71..6304720 100644
   3. --- a/drivers/net/bonding/bond_main.c
   4. +++ b/drivers/net/bonding/bond_main.c
   5. @@ -75,6 +75,7 @@
   6.  #include <linux/jiffies.h>
   7.  #include <net/route.h>
   8.  #include <net/net_namespace.h>
   9. +#include <linux/netpoll.h>
   10.  #include "bonding.h"
   11.  #include "bond_3ad.h"
   12.  #include "bond_alb.h"
   13. @@ -415,7 +416,12 @@ int bond_dev_queue_xmit(struct bonding *bond,
   struct sk_buff *skb,
   14.         }
   15.
   16.         skb->priority = 1;
   17. -       dev_queue_xmit(skb);
   18. +#ifdef CONFIG_NET_POLL_CONTROLLER
   19. +       if (bond->netpoll)
   20. +               netpoll_send_skb(bond->netpoll, skb);
   21. +       else
   22. +#endif
   23. +               dev_queue_xmit(skb);
   24.
   25.         return 0;
   26.  }
   27. @@ -1305,6 +1311,44 @@ static void bond_detach_slave(struct bonding
   *bond, struct slave *slave)
   28.         bond->slave_cnt--;
   29.  }
   30.
   31. +#ifdef CONFIG_NET_POLL_CONTROLLER
   32. +static int slaves_support_netpoll(struct net_device *bond_dev)
   33. +{
   34. +       struct bonding *bond = netdev_priv(bond_dev);
   35. +       struct slave *slave;
   36. +       int i;
   37. +
   38. +       bond_for_each_slave(bond, slave, i)
   39. +               if (!slave->dev->netdev_ops->ndo_poll_controller)
   40. +                       return 0;
   41. +       return 1;
   42. +}
   43. +
   44. +static void bond_poll_controller(struct net_device *bond_dev)
   45. +{
   46. +       struct bonding *bond = netdev_priv(bond_dev);
   47. +       struct slave *slave;
   48. +       int i;
   49. +
   50. +       if (slaves_support_netpoll(bond_dev))
   51. +               bond_for_each_slave(bond, slave, i)
   52. +                       netpoll_poll_dev(slave->dev);
   53. +}
   54. +
   55. +static int bond_netpoll_setup(struct net_device *bond_dev,
   56. +                             struct netpoll_info *npinfo)
   57. +{
   58. +       struct bonding *bond = netdev_priv(bond_dev);
   59. +       struct slave *slave;
   60. +       int i;
   61. +
   62. +       bond_for_each_slave(bond, slave, i)
   63. +               if (slave->dev)
   64. +                       slave->dev->npinfo = npinfo;
   65. +       return 0;
   66. +}
   67. +#endif
   68. +
   69.  /*---------------------------------- IOCTL
   ----------------------------------*/
   70.
   71.  static int bond_sethwaddr(struct net_device *bond_dev,
   72. @@ -1670,6 +1714,16 @@ int bond_enslave(struct net_device *bond_dev,
   struct net_device *slave_dev)
   73.                         bond->primary_slave = new_slave;
   74.         }
   75.
   76. +#ifdef CONFIG_NET_POLL_CONTROLLER
   77. +       if (slaves_support_netpoll(bond_dev))
   78. +               slave_dev->npinfo = bond_dev->npinfo;
   79. +       else {
   80. +               pr_err(DRV_NAME "New slave device %s does not support
   netpoll.\n",
   81. +                      slave_dev->name);
   82. +               pr_err(DRV_NAME "netpoll disabled for %s.\n",
   bond_dev->name);
   83. +       }
   84. +#endif
   85. +
   86.         write_lock_bh(&bond->curr_slave_lock);
   87.
   88.         switch (bond->params.mode) {
   89. @@ -1741,6 +1795,10 @@ int bond_enslave(struct net_device *bond_dev,
   struct net_device *slave_dev)
   90.
   91.  /* Undo stages on error */
   92.  err_close:
   93. +#ifdef CONFIG_NET_POLL_CONTROLLER
   94. +       if (slave_dev->npinfo)
   95. +               slave_dev->npinfo = NULL;
   96. +#endif
   97.         dev_close(slave_dev);
   98.
   99.  err_unset_master:
   100. @@ -1949,6 +2007,10 @@ int bond_release(struct net_device *bond_dev,
   struct net_device *slave_dev)
   101.                                    IFF_SLAVE_INACTIVE | IFF_BONDING
   |
   102.                                    IFF_SLAVE_NEEDARP);
   103.
   104. +#ifdef CONFIG_NET_POLL_CONTROLLER
   105. +       if (slave_dev->npinfo)
   106. +               slave_dev->npinfo = NULL;
   107. +#endif
   108.         kfree(slave);
   109.
   110.         return 0;  /* deletion OK */
   111. @@ -4417,6 +4479,20 @@ out:
   112.         return 0;
   113.  }
   114.
   115. +#ifdef CONFIG_NET_POLL_CONTROLLER
   116. +int bond_netpoll_start_xmit(struct netpoll *np, struct sk_buff *skb
   )
   117. +{
   118. +       struct bonding *bond = netdev_priv(skb->dev);
   119. +       int ret;
   120. +
   121. +       bond->netpoll = np;
   122. +       ret = bond->dev->netdev_ops->ndo_start_xmit(skb, bond->dev);
   123. +       bond->netpoll = NULL;
   124. +
   125. +       return ret;
   126. +}
   127. +#endif
   128. +
   129.  /*------------------------- Device initialization
   ---------------------------*/
   130.
   131.  static void bond_set_xmit_hash_policy(struct bonding *bond)
   132. @@ -4530,6 +4606,9 @@ static const struct net_device_ops
   bond_netdev_ops = {
   133.         .ndo_change_mtu         = bond_change_mtu,
   134.         .ndo_set_mac_address    = bond_set_mac_address,
   135.         .ndo_neigh_setup        = bond_neigh_setup,
   136. +       .ndo_netpoll_setup      = bond_netpoll_setup,
   137. +       .ndo_netpoll_start_xmit = bond_netpoll_start_xmit,
   138. +       .ndo_poll_controller    = bond_poll_controller,
   139.         .ndo_vlan_rx_register   = bond_vlan_rx_register,
   140.         .ndo_vlan_rx_add_vid    = bond_vlan_rx_add_vid,
   141.         .ndo_vlan_rx_kill_vid   = bond_vlan_rx_kill_vid,
   142. diff --git a/drivers/net/bonding/bonding.h
   b/drivers/net/bonding/bonding.h
   143. index 6290a50..563d28c 100644
   144. --- a/drivers/net/bonding/bonding.h
   145. +++ b/drivers/net/bonding/bonding.h
   146. @@ -18,6 +18,7 @@
   147.  #include <linux/timer.h>
   148.  #include <linux/proc_fs.h>
   149.  #include <linux/if_bonding.h>
   150. +#include <linux/netpoll.h>
   151.  #include <linux/kobject.h>
   152.  #include <linux/in6.h>
   153.  #include "bond_3ad.h"
   154. @@ -222,6 +223,10 @@ struct bonding {
   155.  #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
   156.         struct   in6_addr master_ipv6;
   157.  #endif
   158. +#ifdef CONFIG_NET_POLL_CONTROLLER
   159. +       struct   netpoll *netpoll;
   160. +#endif
   161. +
   162.  };
   163.
   164.  /**
   165. diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
   166. index d4a4d98..e1724b2 100644
   167. --- a/include/linux/netdevice.h
   168. +++ b/include/linux/netdevice.h
   169. @@ -52,6 +52,7 @@
   170.
   171.  struct vlan_group;
   172.  struct netpoll_info;
   173. +struct netpoll;
   174.  /* 802.11 specific */
   175.  struct wireless_dev;
   176.                                         /* source back-compat hooks
   */
   177. @@ -624,6 +625,10 @@ struct net_device_ops {
   178.                                                         unsigned
   short vid);
   179.  #ifdef CONFIG_NET_POLL_CONTROLLER
   180.  #define HAVE_NETDEV_POLL
   181. +       int                     (*ndo_netpoll_setup)(struct
   net_device *dev,
   182. +                                                    struct
   netpoll_info *npinfo);
   183. +       int                     (*ndo_netpoll_start_xmit)(struct
   netpoll *np,
   184. +                                                         struct
   sk_buff *skb);
   185.         void                    (*ndo_poll_controller)(struct
   net_device *dev);
   186.  #endif
   187.  #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
   188. diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
   189. index 2524267..39d42e4 100644
   190. --- a/include/linux/netpoll.h
   191. +++ b/include/linux/netpoll.h
   192. @@ -34,7 +34,9 @@ struct netpoll_info {
   193.  };
   194.
   195.  void netpoll_poll(struct netpoll *np);
   196. +void netpoll_poll_dev(struct net_device *dev);
   197.  void netpoll_send_udp(struct netpoll *np, const char *msg, int len)
   ;
   198. +void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb);
   199.  void netpoll_print_options(struct netpoll *np);
   200.  int netpoll_parse_options(struct netpoll *np, char *opt);
   201.  int netpoll_setup(struct netpoll *np);
   202. diff --git a/net/core/netpoll.c b/net/core/netpoll.c
   203. index 9675f31..3776b26 100644
   204. --- a/net/core/netpoll.c
   205. +++ b/net/core/netpoll.c
   206. @@ -174,9 +174,8 @@ static void service_arp_queue(struct
   netpoll_info *npi)
   207.         }
   208.  }
   209.
   210. -void netpoll_poll(struct netpoll *np)
   211. +void netpoll_poll_dev(struct net_device *dev)
   212.  {
   213. -       struct net_device *dev = np->dev;
   214.         const struct net_device_ops *ops;
   215.
   216.         if (!dev || !netif_running(dev))
   217. @@ -196,6 +195,11 @@ void netpoll_poll(struct netpoll *np)
   218.         zap_completion_queue();
   219.  }
   220.
   221. +void netpoll_poll(struct netpoll *np)
   222. +{
   223. +       netpoll_poll_dev(np->dev);
   224. +}
   225. +
   226.  static void refill_skbs(void)
   227.  {
   228.         struct sk_buff *skb;
   229. @@ -277,11 +281,11 @@ static int netpoll_owner_active(struct
   net_device *dev)
   230.         return 0;
   231.  }
   232.
   233. -static void netpoll_send_skb(struct netpoll *np, struct sk_buff
   *skb)
   234. +void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb)
   235.  {
   236.         int status = NETDEV_TX_BUSY;
   237.         unsigned long tries;
   238. -       struct net_device *dev = np->dev;
   239. +       struct net_device *dev = skb->dev;
   240.         const struct net_device_ops *ops = dev->netdev_ops;
   241.         struct netpoll_info *npinfo = np->dev->npinfo;
   242.
   243. @@ -303,7 +307,10 @@ static void netpoll_send_skb(struct netpoll
   *np, struct sk_buff *skb)
   244.                      tries > 0; --tries) {
   245.                         if (__netif_tx_trylock(txq)) {
   246.                                 if (!netif_tx_queue_stopped(txq)) {
   247. -                                       status = ops->ndo_start_xmit
   (skb, dev);
   248. +                                       if (
   ops->ndo_netpoll_start_xmit)
   249. +                                               status =
   ops->ndo_netpoll_start_xmit(np,skb);
   250. +                                       else
   251. +                                               status =
   ops->ndo_start_xmit(skb, dev);
   252.                                         if (status == NETDEV_TX_OK)
   253.                                                 txq_trans_update(txq
   );
   254.                                 }
   255. @@ -789,6 +796,9 @@ int netpoll_setup(struct netpoll *np)
   256.         /* avoid racing with NAPI reading npinfo */
   257.         synchronize_rcu();
   258.
   259. +       if (ndev->netdev_ops->ndo_netpoll_setup)
   260. +               ndev->netdev_ops->ndo_netpoll_setup(ndev, npinfo);
   261. +
   262.         return 0;
   263.
   264.   release:
   265. @@ -859,4 +869,6 @@ EXPORT_SYMBOL(netpoll_parse_options);
   266.  EXPORT_SYMBOL(netpoll_setup);
   267.  EXPORT_SYMBOL(netpoll_cleanup);
   268.  EXPORT_SYMBOL(netpoll_send_udp);
   269. +EXPORT_SYMBOL(netpoll_send_skb);
   270.  EXPORT_SYMBOL(netpoll_poll);
   271. +EXPORT_SYMBOL(netpoll_poll_dev);

^ permalink raw reply

* Re: [PATCH] atomic: add atomic_inc_not_zero_hint()
From: Christoph Lameter @ 2010-11-15 13:57 UTC (permalink / raw)
  To: Paul E. McKenney
  Cc: Eric Dumazet, Andrew Morton, linux-kernel, David Miller, netdev,
	Arnaldo Carvalho de Melo, Ingo Molnar, Andi Kleen, Nick Piggin
In-Reply-To: <20101113222612.GD2825@linux.vnet.ibm.com>

On Sat, 13 Nov 2010, Paul E. McKenney wrote:

> On Fri, Nov 12, 2010 at 01:14:12PM -0600, Christoph Lameter wrote:
> >
> > prefetchw() would be too much overhead?
>
> No idea.  Where do you believe that prefetchw() should be added?

It is another way to get an exclusive cache line
for situations like this. No need to give a hint.


^ permalink raw reply

* Re: [PATCH] atomic: add atomic_inc_not_zero_hint()
From: Andi Kleen @ 2010-11-15 14:07 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Eric Dumazet, Andrew Morton, linux-kernel,
	David Miller, netdev, Arnaldo Carvalho de Melo, Ingo Molnar,
	Andi Kleen, Nick Piggin
In-Reply-To: <alpine.DEB.2.00.1011150756130.19175@router.home>

On Mon, Nov 15, 2010 at 07:57:10AM -0600, Christoph Lameter wrote:
> On Sat, 13 Nov 2010, Paul E. McKenney wrote:
> 
> > On Fri, Nov 12, 2010 at 01:14:12PM -0600, Christoph Lameter wrote:
> > >
> > > prefetchw() would be too much overhead?
> >
> > No idea.  Where do you believe that prefetchw() should be added?
> 
> It is another way to get an exclusive cache line
> for situations like this. No need to give a hint.

prefetchw doesn't work on Intel (or rather is equivalent to prefetch), 
for Intel you always need to explicitely write to get an exclusive
line.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: [PATCH] atomic: add atomic_inc_not_zero_hint()
From: Christoph Lameter @ 2010-11-15 14:16 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Paul E. McKenney, Eric Dumazet, Andrew Morton, linux-kernel,
	David Miller, netdev, Arnaldo Carvalho de Melo, Ingo Molnar,
	Nick Piggin
In-Reply-To: <20101115140739.GJ7269@basil.fritz.box>

On Mon, 15 Nov 2010, Andi Kleen wrote:

> > It is another way to get an exclusive cache line
> > for situations like this. No need to give a hint.
>
> prefetchw doesn't work on Intel (or rather is equivalent to prefetch),
> for Intel you always need to explicitely write to get an exclusive
> line.

Argh. You mean x86. Itanium could do it and is also by Intel. Could you
please change that for x86 as well? Otherwise we will get more of these
weird code twisters.

^ permalink raw reply

* Re: [PATCH] atomic: add atomic_inc_not_zero_hint()
From: Eric Dumazet @ 2010-11-15 14:17 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Andrew Morton, linux-kernel, David Miller,
	netdev, Arnaldo Carvalho de Melo, Ingo Molnar, Andi Kleen,
	Nick Piggin
In-Reply-To: <alpine.DEB.2.00.1011150756130.19175@router.home>

Le lundi 15 novembre 2010 à 07:57 -0600, Christoph Lameter a écrit :
> On Sat, 13 Nov 2010, Paul E. McKenney wrote:
> 
> > On Fri, Nov 12, 2010 at 01:14:12PM -0600, Christoph Lameter wrote:
> > >
> > > prefetchw() would be too much overhead?
> >
> > No idea.  Where do you believe that prefetchw() should be added?
> 
> It is another way to get an exclusive cache line
> for situations like this. No need to give a hint.
> 

Exclusive access ? As soon as another cpu takes it again, you lose.

Its not really the same thing... Maybe you miss the 'hint' intention at
all. We know the probable value of the counter, we dont want to read it.

In fact, prefetchw() is useful when you can assert it many cycles before
the memory read you are going to perform [before the write]. On
contended cache lines, its a waste, because by the time your cpu is
going to read memory, then perform the atomic compare_and_exchange(), an
other cpu might have dirtied the location again. This is what we noticed
during Netfilter Workshop 2010 : A high performance cost at both
atomic_read() and atomic_cmpxchg(). We tried prefetchw() and it was a
performance drop. It was with only 16 cpus contending on neighbour
refcnt, and 5 millions frames per second (5 millions atomic increments,
5 millions atomic decrements)

prefetchw() should be used on very specific spots, when a cpu is going
to write into a private area (not potentially accessed by other cpus).
We use it for example in __alloc_skb(), a bit before memset().

By the way, atomic_inc_not_zero_hint() is less code than 
[prefetchw(), atomic_inc_not_zero()]. Using one instruction [cmpxchg]
with the memory pointer is better than three.  [prefetchw(), read(),
cmpxchg()], particularly if you have high contention on cache line.

^ permalink raw reply

* Re: [PATCH] tcp: restrict net.ipv4.tcp_adv_min_scale (#20312)
From: Ben Hutchings @ 2010-11-15 14:18 UTC (permalink / raw)
  To: Alexey Dobriyan; +Cc: Eric Dumazet, davem, shemminger, netdev
In-Reply-To: <20101114201458.GA28181@core2.telecom.by>

On Sun, 2010-11-14 at 22:14 +0200, Alexey Dobriyan wrote:
> On Sun, Nov 14, 2010 at 08:49:43PM +0100, Eric Dumazet wrote:
> > > +static int _minus_31 = -31;
> > > +static int _31 = 31;
> > 
> > Please use normal symbols, not starting by underscore.
> 
> static int thirty_one = 31? :-)

How about, oh, I don't know, adv_win_scale_{min,max}?

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH net-next-2.6 v2] can: Topcliff: PCH_CAN driver: Add Flow control,
From: Wolfgang Grandegger @ 2010-11-15 14:21 UTC (permalink / raw)
  To: Tomoya MORINAGA
  Cc: andrew.chih.howe.khor-ral2JQCrhuEAvxtiuMwx3w, Masayuki Ohtake,
	Samuel Ortiz, margie.foster-ral2JQCrhuEAvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	socketcan-core-0fE9KPoRgkgATYTw5x5z8w,
	yong.y.wang-ral2JQCrhuEAvxtiuMwx3w,
	kok.howg.ewe-ral2JQCrhuEAvxtiuMwx3w,
	joel.clark-ral2JQCrhuEAvxtiuMwx3w, David S. Miller,
	Christian Pellegrin, qi.wang-ral2JQCrhuEAvxtiuMwx3w
In-Reply-To: <4CE0EFA7.9020007-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>

Hello,

On 11/15/2010 09:30 AM, Tomoya MORINAGA wrote:
>  * Add Flow control
>  * Fix Data copy issue (endianness)
>  * Add Macro prefix "PCH_"
>  * Separate interface register structure
>  * Some functions are unified.
>  * Change MessageObject indication(PCH_RX_OBJ_START, etc..)
>  * Enumerate LEC macro
>  * Move MSI processing from open/close to probe/remove processing
>  * Use BIT(x)
>  * and more...
> 
> Signed-off-by: Tomoya MORINAGA <tomoya-linux-ECg8zkTtlr0C6LszWs/t0g@public.gmane.org>
> ---
>  drivers/net/can/pch_can.c | 1348 ++++++++++++++++++++-------------------------
>  1 files changed, 595 insertions(+), 753 deletions(-)
> 
> diff --git a/drivers/net/can/pch_can.c b/drivers/net/can/pch_can.c
> index 6727182..6a38593 100644
> --- a/drivers/net/can/pch_can.c
> +++ b/drivers/net/can/pch_can.c
...
> -	if (status & PCH_LEC_ALL) {
> +	lec = status & PCH_LEC_ALL;
> +	switch (lec) {
> +	case PCH_STUF_ERR:
> +		cf->data[2] |= CAN_ERR_PROT_STUFF;
>  		priv->can.can_stats.bus_error++;
>  		stats->rx_errors++;
> -		switch (status & PCH_LEC_ALL) {
> -		case PCH_STUF_ERR:
> -			cf->data[2] |= CAN_ERR_PROT_STUFF;
> -			break;
> -		case PCH_FORM_ERR:
> -			cf->data[2] |= CAN_ERR_PROT_FORM;
> -			break;
> -		case PCH_ACK_ERR:
> -			cf->data[2] |= CAN_ERR_PROT_LOC_ACK |
> -				       CAN_ERR_PROT_LOC_ACK_DEL;
> -			break;
> -		case PCH_BIT1_ERR:
> -		case PCH_BIT0_ERR:
> -			cf->data[2] |= CAN_ERR_PROT_BIT;
> -			break;
> -		case PCH_CRC_ERR:
> -			cf->data[2] |= CAN_ERR_PROT_LOC_CRC_SEQ |
> -				       CAN_ERR_PROT_LOC_CRC_DEL;
> -			break;
> -		default:
> -			iowrite32(status | PCH_LEC_ALL, &priv->regs->stat);
> -			break;
> -		}
> -
> +		break;
> +	case PCH_FORM_ERR:
> +		cf->data[2] |= CAN_ERR_PROT_FORM;
> +		priv->can.can_stats.bus_error++;
> +		stats->rx_errors++;
> +		break;
> +	case PCH_ACK_ERR:
> +		cf->can_id |= CAN_ERR_ACK;
> +		priv->can.can_stats.bus_error++;
> +		stats->rx_errors++;
> +		break;
> +	case PCH_BIT1_ERR:
> +	case PCH_BIT0_ERR:
> +		cf->data[2] |= CAN_ERR_PROT_BIT;
> +		priv->can.can_stats.bus_error++;
> +		stats->rx_errors++;
> +		break;
> +	case PCH_CRC_ERR:
> +		cf->data[2] |= CAN_ERR_PROT_LOC_CRC_SEQ |
> +			       CAN_ERR_PROT_LOC_CRC_DEL;
> +		priv->can.can_stats.bus_error++;
> +		stats->rx_errors++;
> +		break;
> +	case PCH_LEC_ALL: /* Written by CPU. No error status */
> +		break;
>  	}

More comments to the lec handling below.

> +	cf->data[6] = ioread32(&priv->regs->errc) & PCH_TEC;
> +	cf->data[7] = (ioread32(&priv->regs->errc) & PCH_REC) >> 8;

Could be handle with just *one* register access.

...
>  static int pch_can_rx_poll(struct napi_struct *napi, int quota)
>  {
>  	struct net_device *ndev = napi->dev;
>  	struct pch_can_priv *priv = netdev_priv(ndev);
> -	struct net_device_stats *stats = &(priv->ndev->stats);
> -	u32 dlc;
>  	u32 int_stat;
>  	int rcv_pkts = 0;
>  	u32 reg_stat;
> -	unsigned long flags;
>  
>  	int_stat = pch_can_int_pending(priv);
>  	if (!int_stat)
> -		return 0;
> +		goto end;
>  
> -INT_STAT:
> -	if (int_stat == CAN_STATUS_INT) {
> +	if ((int_stat == PCH_STATUS_INT) && (quota > 0)) {
>  		reg_stat = ioread32(&priv->regs->stat);
>  		if (reg_stat & (PCH_BUS_OFF | PCH_LEC_ALL)) {
> -			if ((reg_stat & PCH_LEC_ALL) != PCH_LEC_ALL)
> +			if ((reg_stat & PCH_LEC_ALL) != PCH_LEC_ALL) {
>  				pch_can_error(ndev, reg_stat);
> +				quota--;
> +			}

Should be:

  		if (reg_stat & PCH_BUS_OFF ||
		    (reg_stat & PCH_LEC_ALL) != PCH_LEC_ALL) {

Your lec handling is still not correc, I believe. The driver needs to
write PCH_LEC_ALL to the "stat" register once in the initialization code
and then after each error observed (lec != PCH_LEC_ALL). I still do not
find such code. Could you show us the output of

  "# candump any,0:0,#FFFFFFFF"

when yo send CAN messages *without* a cable connected?.

Thanks,

Wolfgang.

^ permalink raw reply

* [PATCH] arch/tile: fix rwlock so would-be write lockers don't block new readers
From: Chris Metcalf @ 2010-11-15 14:18 UTC (permalink / raw)
  To: linux-kernel; +Cc: Américo Wang, Eric Dumazet, netdev, Cypher Wu
In-Reply-To: <AANLkTind92uwcigzmDn8yn9a22exDy7zcreGQ5-6NLV-@mail.gmail.com>

This avoids a deadlock in the IGMP code where one core gets a read
lock, another core starts trying to get a write lock (thus blocking
new readers), and then the first core tries to recursively re-acquire
the read lock.

We still try to preserve some degree of balance by giving priority
to additional write lockers that come along while the lock is held
for write, so they can all complete quickly and return the lock to
the readers.

Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
---
This should apply relatively cleanly to 2.6.26.7 source code too.

 arch/tile/lib/spinlock_32.c |   29 ++++++++++++++++++-----------
 1 files changed, 18 insertions(+), 11 deletions(-)

diff --git a/arch/tile/lib/spinlock_32.c b/arch/tile/lib/spinlock_32.c
index 485e24d..5cd1c40 100644
--- a/arch/tile/lib/spinlock_32.c
+++ b/arch/tile/lib/spinlock_32.c
@@ -167,23 +167,30 @@ void arch_write_lock_slow(arch_rwlock_t *rwlock, u32 val)
 	 * when we compare them.
 	 */
 	u32 my_ticket_;
+	u32 iterations = 0;
 
-	/* Take out the next ticket; this will also stop would-be readers. */
-	if (val & 1)
-		val = get_rwlock(rwlock);
-	rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
+	/*
+	 * Wait until there are no readers, then bump up the next
+	 * field and capture the ticket value.
+	 */
+	for (;;) {
+		if (!(val & 1)) {
+			if ((val >> RD_COUNT_SHIFT) == 0)
+				break;
+			rwlock->lock = val;
+		}
+		delay_backoff(iterations++);
+		val = __insn_tns((int *)&rwlock->lock);
+	}
 
-	/* Extract my ticket value from the original word. */
+	/* Take out the next ticket and extract my ticket value. */
+	rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
 	my_ticket_ = val >> WR_NEXT_SHIFT;
 
-	/*
-	 * Wait until the "current" field matches our ticket, and
-	 * there are no remaining readers.
-	 */
+	/* Wait until the "current" field matches our ticket. */
 	for (;;) {
 		u32 curr_ = val >> WR_CURR_SHIFT;
-		u32 readers = val >> RD_COUNT_SHIFT;
-		u32 delta = ((my_ticket_ - curr_) & WR_MASK) + !!readers;
+		u32 delta = ((my_ticket_ - curr_) & WR_MASK);
 		if (likely(delta == 0))
 			break;
 
-- 
1.6.5.2


^ permalink raw reply related

* Re: [PATCH] atomic: add atomic_inc_not_zero_hint()
From: Christoph Lameter @ 2010-11-15 14:25 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Paul E. McKenney, Andrew Morton, linux-kernel, David Miller,
	netdev, Arnaldo Carvalho de Melo, Ingo Molnar, Andi Kleen,
	Nick Piggin
In-Reply-To: <1289830636.2607.70.camel@edumazet-laptop>

On Mon, 15 Nov 2010, Eric Dumazet wrote:

> Exclusive access ? As soon as another cpu takes it again, you lose.

Sure but you want to avoid the fetch in shared mode here.

> Its not really the same thing... Maybe you miss the 'hint' intention at
> all. We know the probable value of the counter, we dont want to read it.

Ok may be in thise case you can predict the value but in general it is
difficult to always provide an expected value. It would be easier to be
able to tell the processor that the cacheline should not be fetched as
shared but immediately in exclusive state.

> atomic_read() and atomic_cmpxchg(). We tried prefetchw() and it was a
> performance drop. It was with only 16 cpus contending on neighbour

Does prefetchw work? Andi claims that prefetchw is not working on
x86 and I doubt that you ran tests on Itanium.

^ permalink raw reply

* Re: [PATCH] atomic: add atomic_inc_not_zero_hint()
From: Andi Kleen @ 2010-11-15 14:39 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Eric Dumazet, Paul E. McKenney, Andrew Morton, linux-kernel,
	David Miller, netdev, Arnaldo Carvalho de Melo, Ingo Molnar,
	Andi Kleen, Nick Piggin
In-Reply-To: <alpine.DEB.2.00.1011150821000.19175@router.home>

> > atomic_read() and atomic_cmpxchg(). We tried prefetchw() and it was a
> > performance drop. It was with only 16 cpus contending on neighbour
> 
> Does prefetchw work? Andi claims that prefetchw is not working on
> x86 and I doubt that you ran tests on Itanium.

AMD supports it due to their MOESI protocol, but it's not supported
in MESIF as used by Intel QPI.  The kernel maps it on Intel to 
ordinary prefetch.

-Andi

-- 
ak@linux.intel.com -- Speaking for myself only.

^ permalink raw reply

* Re: ethtool maintenance
From: Ben Hutchings @ 2010-11-15 14:46 UTC (permalink / raw)
  To: Jeff Garzik; +Cc: NetDev, David Miller, Peter Martuccelli
In-Reply-To: <1289658614.2816.105.camel@localhost>

On Sat, 2010-11-13 at 14:30 +0000, Ben Hutchings wrote:
> On Sat, 2010-11-13 at 03:45 -0500, Jeff Garzik wrote:
> > So, a recent emergency surgery has really set me back, work-wise. 
> > ethtool [the userspace utility] 2.6.36 is still not out, and personally 
> > it remains a third or fourth priority.
> > 
> > While it's likely that I could get back to ethtool's patch queue next 
> > week, it continues to be low man on the totem pole.  Seems only fair to 
> > see if anyone else is interested in maintaining it.
> > 
> > I emailed Ben Hutchings privately about this, but haven't heard back, so 
> > I thought I'd go ahead and email the list.
> > 
> > Anyone interested?
> 
> I am interested, but will need to clear it with my boss before making
> such a commitment.

OK, this is fine.  Are there any patches waiting other than mine and
these?

http://patchwork.ozlabs.org/patch/67662/
http://patchwork.ozlabs.org/patch/68237/
http://patchwork.ozlabs.org/patch/69155/

I'll need to sort out my kernel.org account before I can publish a git
tree.

Ben.

-- 
Ben Hutchings, Senior Software Engineer, Solarflare Communications
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.


^ permalink raw reply

* Re: [PATCH] atomic: add atomic_inc_not_zero_hint()
From: Eric Dumazet @ 2010-11-15 14:47 UTC (permalink / raw)
  To: Christoph Lameter
  Cc: Paul E. McKenney, Andrew Morton, linux-kernel, David Miller,
	netdev, Arnaldo Carvalho de Melo, Ingo Molnar, Andi Kleen,
	Nick Piggin
In-Reply-To: <alpine.DEB.2.00.1011150821000.19175@router.home>

Le lundi 15 novembre 2010 à 08:25 -0600, Christoph Lameter a écrit :
> On Mon, 15 Nov 2010, Eric Dumazet wrote:
> 
> > Exclusive access ? As soon as another cpu takes it again, you lose.
> 
> Sure but you want to avoid the fetch in shared mode here.
> 

Yes, this is what cmpxchg() does for sure.

> > Its not really the same thing... Maybe you miss the 'hint' intention at
> > all. We know the probable value of the counter, we dont want to read it.
> 
> Ok may be in thise case you can predict the value but in general it is
> difficult to always provide an expected value. It would be easier to be
> able to tell the processor that the cacheline should not be fetched as
> shared but immediately in exclusive state.
> 

Maybe its not clear, but atomic_inc_not_zero_hint() is going to be used
only in contexts we know the expected value, and not as a generic
replacement for atomic_inc_not_zero(). Even if cache line is already hot
in this cpu cache, it should be faster or same speed.

Then, in high contention contexts, using atomic_inc_not_zero_hint() with
whatever initial hint might also be a win over atomic_inc_not_zero(),
but we try to remove such contexts ;)

And two atomic_cmpxchg() are probably slower in non contended contexts,
in particular is cache line is already hot in this cpu cache.

> > atomic_read() and atomic_cmpxchg(). We tried prefetchw() and it was a
> > performance drop. It was with only 16 cpus contending on neighbour
> 
> Does prefetchw work? Andi claims that prefetchw is not working on
> x86 and I doubt that you ran tests on Itanium.

In fact, in benchmarks, prefetch() or prefetchw() are a pain on x86, or
at least "perf tools" show artifact on them (high number of cycles
consumed on these instructions)

Andi had a patch to disable prefetch() in list iterators, and its a win.

I dont have Itanium platform to run tests. Is cmpxchg() that bad on
ia64 ? I also have old AMD cpus, so I cannot say if recent ones handle
prefetchw() better...

^ permalink raw reply

* Re: [PATCH] arch/tile: fix rwlock so would-be write lockers don't block new readers
From: Eric Dumazet @ 2010-11-15 14:52 UTC (permalink / raw)
  To: Chris Metcalf; +Cc: linux-kernel, Américo Wang, netdev, Cypher Wu
In-Reply-To: <201011151425.oAFEPU3W005682@farm-0010.internal.tilera.com>

Le lundi 15 novembre 2010 à 09:18 -0500, Chris Metcalf a écrit :
> This avoids a deadlock in the IGMP code where one core gets a read
> lock, another core starts trying to get a write lock (thus blocking
> new readers), and then the first core tries to recursively re-acquire
> the read lock.
> 
> We still try to preserve some degree of balance by giving priority
> to additional write lockers that come along while the lock is held
> for write, so they can all complete quickly and return the lock to
> the readers.
> 
> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
> ---
> This should apply relatively cleanly to 2.6.26.7 source code too.
> 
>  arch/tile/lib/spinlock_32.c |   29 ++++++++++++++++++-----------
>  1 files changed, 18 insertions(+), 11 deletions(-)
> 
> diff --git a/arch/tile/lib/spinlock_32.c b/arch/tile/lib/spinlock_32.c
> index 485e24d..5cd1c40 100644
> --- a/arch/tile/lib/spinlock_32.c
> +++ b/arch/tile/lib/spinlock_32.c
> @@ -167,23 +167,30 @@ void arch_write_lock_slow(arch_rwlock_t *rwlock, u32 val)
>  	 * when we compare them.
>  	 */
>  	u32 my_ticket_;
> +	u32 iterations = 0;
>  
> -	/* Take out the next ticket; this will also stop would-be readers. */
> -	if (val & 1)
> -		val = get_rwlock(rwlock);
> -	rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
> +	/*
> +	 * Wait until there are no readers, then bump up the next
> +	 * field and capture the ticket value.
> +	 */
> +	for (;;) {
> +		if (!(val & 1)) {
> +			if ((val >> RD_COUNT_SHIFT) == 0)
> +				break;
> +			rwlock->lock = val;
> +		}
> +		delay_backoff(iterations++);

Are you sure a writer should have a growing delay_backoff() ?

It seems to me this only allow new readers to come (so adding more
unfairness to the rwlock, that already favor readers against writer[s])

Maybe allow one cpu to spin, and eventually other 'writers' be queued ?

> +		val = __insn_tns((int *)&rwlock->lock);
> +	}
>  
> -	/* Extract my ticket value from the original word. */
> +	/* Take out the next ticket and extract my ticket value. */
> +	rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
>  	my_ticket_ = val >> WR_NEXT_SHIFT;
>  
> -	/*
> -	 * Wait until the "current" field matches our ticket, and
> -	 * there are no remaining readers.
> -	 */
> +	/* Wait until the "current" field matches our ticket. */
>  	for (;;) {
>  		u32 curr_ = val >> WR_CURR_SHIFT;
> -		u32 readers = val >> RD_COUNT_SHIFT;
> -		u32 delta = ((my_ticket_ - curr_) & WR_MASK) + !!readers;
> +		u32 delta = ((my_ticket_ - curr_) & WR_MASK);
>  		if (likely(delta == 0))
>  			break;
>  



^ permalink raw reply

* Re: [PATCH] arch/tile: fix rwlock so would-be write lockers don't block new readers
From: Chris Metcalf @ 2010-11-15 15:10 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: linux-kernel, Américo Wang, netdev, Cypher Wu
In-Reply-To: <1289832730.2607.87.camel@edumazet-laptop>

On 11/15/2010 9:52 AM, Eric Dumazet wrote:
> Le lundi 15 novembre 2010 à 09:18 -0500, Chris Metcalf a écrit :
>> This avoids a deadlock in the IGMP code where one core gets a read
>> lock, another core starts trying to get a write lock (thus blocking
>> new readers), and then the first core tries to recursively re-acquire
>> the read lock.
>>
>> We still try to preserve some degree of balance by giving priority
>> to additional write lockers that come along while the lock is held
>> for write, so they can all complete quickly and return the lock to
>> the readers.
>>
>> Signed-off-by: Chris Metcalf <cmetcalf@tilera.com>
>> ---
>> This should apply relatively cleanly to 2.6.26.7 source code too.
>>
>>  arch/tile/lib/spinlock_32.c |   29 ++++++++++++++++++-----------
>>  1 files changed, 18 insertions(+), 11 deletions(-)
>>
>> diff --git a/arch/tile/lib/spinlock_32.c b/arch/tile/lib/spinlock_32.c
>> index 485e24d..5cd1c40 100644
>> --- a/arch/tile/lib/spinlock_32.c
>> +++ b/arch/tile/lib/spinlock_32.c
>> @@ -167,23 +167,30 @@ void arch_write_lock_slow(arch_rwlock_t *rwlock, u32 val)
>>  	 * when we compare them.
>>  	 */
>>  	u32 my_ticket_;
>> +	u32 iterations = 0;
>>  
>> -	/* Take out the next ticket; this will also stop would-be readers. */
>> -	if (val & 1)
>> -		val = get_rwlock(rwlock);
>> -	rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
>> +	/*
>> +	 * Wait until there are no readers, then bump up the next
>> +	 * field and capture the ticket value.
>> +	 */
>> +	for (;;) {
>> +		if (!(val & 1)) {
>> +			if ((val >> RD_COUNT_SHIFT) == 0)
>> +				break;
>> +			rwlock->lock = val;
>> +		}
>> +		delay_backoff(iterations++);
> Are you sure a writer should have a growing delay_backoff() ?

We always do this bounded exponential backoff on all locking operations
that require memory-network traffic.  With 64 cores, it's possible
otherwise to get into a situation where the cores are attempting to acquire
the lock sufficiently aggressively that lock acquisition performance is
worse than it would be with backoff.  In any case this path is unlikely to
run many times, since it only triggers if two cores both try to pull a
ticket at the same time; it doesn't correspond to writers actually waiting
once they have their ticket, which is handled later in this function.

> It seems to me this only allow new readers to come (so adding more
> unfairness to the rwlock, that already favor readers against writer[s])

Well, that is apparently the required semantic.  Once there is one reader,
you must allow new readers, to handle the case of recursive
re-acquisition.  In principle you could imagine doing something like having
a bitmask of cores that held the readlock (instead of a count), and only
allowing recursive re-acquisition when a write lock request is pending, but
this would make the lock structure bigger, and I'm not sure it's worth it. 
x86 certainly doesn't bother.

> Maybe allow one cpu to spin, and eventually other 'writers' be queued ?

Other than the brief spin to acquire the ticket, writers don't actually
spin on the lock.  They just wait for their ticket to come up.  This does
require spinning on memory reads, but those are satisfied out of the local
core's cache, with the exception that each time a writer completes, the
cache line is invalidated on the readers, and they have to re-fetch it from
the home cache.

>> +		val = __insn_tns((int *)&rwlock->lock);
>> +	}
>>  
>> -	/* Extract my ticket value from the original word. */
>> +	/* Take out the next ticket and extract my ticket value. */
>> +	rwlock->lock = __insn_addb(val, 1 << WR_NEXT_SHIFT);
>>  	my_ticket_ = val >> WR_NEXT_SHIFT;
>>  
>> -	/*
>> -	 * Wait until the "current" field matches our ticket, and
>> -	 * there are no remaining readers.
>> -	 */
>> +	/* Wait until the "current" field matches our ticket. */
>>  	for (;;) {
>>  		u32 curr_ = val >> WR_CURR_SHIFT;
>> -		u32 readers = val >> RD_COUNT_SHIFT;
>> -		u32 delta = ((my_ticket_ - curr_) & WR_MASK) + !!readers;
>> +		u32 delta = ((my_ticket_ - curr_) & WR_MASK);
>>  		if (likely(delta == 0))
>>  			break;
>>  
>

-- 
Chris Metcalf, Tilera Corp.
http://www.tilera.com

^ permalink raw reply

* ipheth "UI" problem
From: Bastien Nocera @ 2010-11-15 15:13 UTC (permalink / raw)
  To: Diego Giagio; +Cc: netdev

Hey Diego,

ipheth recently got added to the kernel, and seeped into Fedora. I love
the functionality of it (even though I'm more likely to be using the
Bluetooth tethering, to be honest).

I already filed a bug against NetworkManager connecting to the interface
automatically:
https://bugzilla.gnome.org/show_bug.cgi?id=633465

My problem is while I have disabled NetworkManager's automatic
connection, the kernel module still connects to the device, and I have a
constant "Internet Tethering" appearing on my iPhone.

Wouldn't it be possible to only check for the service being offered by
the device when plugging it in, and actually do the connection to the
service when we want to use it (when the interface is upped)?

Cheers 

PS: Re-sent with the correct ML address, sorry Diego

^ permalink raw reply

* Re: NETPOLL on bond interfaces
From: sergey belov @ 2010-11-15 15:35 UTC (permalink / raw)
  To: Netdev
In-Reply-To: <AANLkTingysT7=+AiGCh4kiY27RK9jz40ci1y_of4o-FC@mail.gmail.com>

found this thread:
http://amailbox.org/mailarchive/linux-netdev/2010/6/25/6280002

and this:

static int disable_netpoll = 1; in drivers/net/bonding/bond_main.c

have switched value to 0, so nevermind now :)



On Mon, Nov 15, 2010 at 4:49 PM, sergey belov <gexlie@gmail.com> wrote:
> Before 2.6.36 came out I was using this patch to enable netpoll on bond
> interfaces (see below)
>
> When I tried to apply this patch to 2.6.36 sources some chunks was failed.
>
> [root@kernel-x32 experimental]# patch -p0 < netpoll.patch
> patching file a/drivers/net/bonding/bond_main.c
> Hunk #1 FAILED at 75.
> Hunk #2 FAILED at 416.
> Hunk #3 succeeded at 1334 with fuzz 2 (offset 23 lines).
> Hunk #4 succeeded at 1729 with fuzz 1 (offset 15 lines).
> Hunk #5 succeeded at 1837 (offset 42 lines).
> Hunk #6 succeeded at 2026 (offset 19 lines).
> Hunk #7 succeeded at 4490 with fuzz 1 (offset 11 lines).
> Hunk #8 succeeded at 4674 (offset 68 lines).
> 2 out of 8 hunks FAILED -- saving rejects to file
> a/drivers/net/bonding/bond_main.c.rej
> patching file a/drivers/net/bonding/bonding.h
> Hunk #2 succeeded at 227 (offset 4 lines).
> patching file a/include/linux/netdevice.h
> Hunk #1 FAILED at 52.
> Hunk #2 FAILED at 625.
> 2 out of 2 hunks FAILED -- saving rejects to file
> a/include/linux/netdevice.h.rej
> patching file a/include/linux/netpoll.h
> Hunk #1 succeeded at 43 with fuzz 2 (offset 9 lines).
> patching file a/net/core/netpoll.c
> Reversed (or previously applied) patch detected!  Assume -R? [n]
>
>
> Inspecting changelog at
> http://www.kernel.org/pub/linux/kernel/v2.6/ChangeLog-2.6.36 I found there
> are a lot of changes was done inside of the bonding and netpoll subsystem.
>
> Could please someone provide me new patch to enable netpoll on bond ifaces
> or send me a workaround how to enable it in any other way.
>
> We need to use it only to enable netconsole on our servers with bond.
>
>
>   1. diff --git a/drivers/net/bonding/bond_main.c
>   b/drivers/net/bonding/bond_main.c
>   2. index d927f71..6304720 100644
>   3. --- a/drivers/net/bonding/bond_main.c
>   4. +++ b/drivers/net/bonding/bond_main.c
>   5. @@ -75,6 +75,7 @@
>   6.  #include <linux/jiffies.h>
>   7.  #include <net/route.h>
>   8.  #include <net/net_namespace.h>
>   9. +#include <linux/netpoll.h>
>   10.  #include "bonding.h"
>   11.  #include "bond_3ad.h"
>   12.  #include "bond_alb.h"
>   13. @@ -415,7 +416,12 @@ int bond_dev_queue_xmit(struct bonding *bond,
>   struct sk_buff *skb,
>   14.         }
>   15.
>   16.         skb->priority = 1;
>   17. -       dev_queue_xmit(skb);
>   18. +#ifdef CONFIG_NET_POLL_CONTROLLER
>   19. +       if (bond->netpoll)
>   20. +               netpoll_send_skb(bond->netpoll, skb);
>   21. +       else
>   22. +#endif
>   23. +               dev_queue_xmit(skb);
>   24.
>   25.         return 0;
>   26.  }
>   27. @@ -1305,6 +1311,44 @@ static void bond_detach_slave(struct bonding
>   *bond, struct slave *slave)
>   28.         bond->slave_cnt--;
>   29.  }
>   30.
>   31. +#ifdef CONFIG_NET_POLL_CONTROLLER
>   32. +static int slaves_support_netpoll(struct net_device *bond_dev)
>   33. +{
>   34. +       struct bonding *bond = netdev_priv(bond_dev);
>   35. +       struct slave *slave;
>   36. +       int i;
>   37. +
>   38. +       bond_for_each_slave(bond, slave, i)
>   39. +               if (!slave->dev->netdev_ops->ndo_poll_controller)
>   40. +                       return 0;
>   41. +       return 1;
>   42. +}
>   43. +
>   44. +static void bond_poll_controller(struct net_device *bond_dev)
>   45. +{
>   46. +       struct bonding *bond = netdev_priv(bond_dev);
>   47. +       struct slave *slave;
>   48. +       int i;
>   49. +
>   50. +       if (slaves_support_netpoll(bond_dev))
>   51. +               bond_for_each_slave(bond, slave, i)
>   52. +                       netpoll_poll_dev(slave->dev);
>   53. +}
>   54. +
>   55. +static int bond_netpoll_setup(struct net_device *bond_dev,
>   56. +                             struct netpoll_info *npinfo)
>   57. +{
>   58. +       struct bonding *bond = netdev_priv(bond_dev);
>   59. +       struct slave *slave;
>   60. +       int i;
>   61. +
>   62. +       bond_for_each_slave(bond, slave, i)
>   63. +               if (slave->dev)
>   64. +                       slave->dev->npinfo = npinfo;
>   65. +       return 0;
>   66. +}
>   67. +#endif
>   68. +
>   69.  /*---------------------------------- IOCTL
>   ----------------------------------*/
>   70.
>   71.  static int bond_sethwaddr(struct net_device *bond_dev,
>   72. @@ -1670,6 +1714,16 @@ int bond_enslave(struct net_device *bond_dev,
>   struct net_device *slave_dev)
>   73.                         bond->primary_slave = new_slave;
>   74.         }
>   75.
>   76. +#ifdef CONFIG_NET_POLL_CONTROLLER
>   77. +       if (slaves_support_netpoll(bond_dev))
>   78. +               slave_dev->npinfo = bond_dev->npinfo;
>   79. +       else {
>   80. +               pr_err(DRV_NAME "New slave device %s does not support
>   netpoll.\n",
>   81. +                      slave_dev->name);
>   82. +               pr_err(DRV_NAME "netpoll disabled for %s.\n",
>   bond_dev->name);
>   83. +       }
>   84. +#endif
>   85. +
>   86.         write_lock_bh(&bond->curr_slave_lock);
>   87.
>   88.         switch (bond->params.mode) {
>   89. @@ -1741,6 +1795,10 @@ int bond_enslave(struct net_device *bond_dev,
>   struct net_device *slave_dev)
>   90.
>   91.  /* Undo stages on error */
>   92.  err_close:
>   93. +#ifdef CONFIG_NET_POLL_CONTROLLER
>   94. +       if (slave_dev->npinfo)
>   95. +               slave_dev->npinfo = NULL;
>   96. +#endif
>   97.         dev_close(slave_dev);
>   98.
>   99.  err_unset_master:
>   100. @@ -1949,6 +2007,10 @@ int bond_release(struct net_device *bond_dev,
>   struct net_device *slave_dev)
>   101.                                    IFF_SLAVE_INACTIVE | IFF_BONDING
>   |
>   102.                                    IFF_SLAVE_NEEDARP);
>   103.
>   104. +#ifdef CONFIG_NET_POLL_CONTROLLER
>   105. +       if (slave_dev->npinfo)
>   106. +               slave_dev->npinfo = NULL;
>   107. +#endif
>   108.         kfree(slave);
>   109.
>   110.         return 0;  /* deletion OK */
>   111. @@ -4417,6 +4479,20 @@ out:
>   112.         return 0;
>   113.  }
>   114.
>   115. +#ifdef CONFIG_NET_POLL_CONTROLLER
>   116. +int bond_netpoll_start_xmit(struct netpoll *np, struct sk_buff *skb
>   )
>   117. +{
>   118. +       struct bonding *bond = netdev_priv(skb->dev);
>   119. +       int ret;
>   120. +
>   121. +       bond->netpoll = np;
>   122. +       ret = bond->dev->netdev_ops->ndo_start_xmit(skb, bond->dev);
>   123. +       bond->netpoll = NULL;
>   124. +
>   125. +       return ret;
>   126. +}
>   127. +#endif
>   128. +
>   129.  /*------------------------- Device initialization
>   ---------------------------*/
>   130.
>   131.  static void bond_set_xmit_hash_policy(struct bonding *bond)
>   132. @@ -4530,6 +4606,9 @@ static const struct net_device_ops
>   bond_netdev_ops = {
>   133.         .ndo_change_mtu         = bond_change_mtu,
>   134.         .ndo_set_mac_address    = bond_set_mac_address,
>   135.         .ndo_neigh_setup        = bond_neigh_setup,
>   136. +       .ndo_netpoll_setup      = bond_netpoll_setup,
>   137. +       .ndo_netpoll_start_xmit = bond_netpoll_start_xmit,
>   138. +       .ndo_poll_controller    = bond_poll_controller,
>   139.         .ndo_vlan_rx_register   = bond_vlan_rx_register,
>   140.         .ndo_vlan_rx_add_vid    = bond_vlan_rx_add_vid,
>   141.         .ndo_vlan_rx_kill_vid   = bond_vlan_rx_kill_vid,
>   142. diff --git a/drivers/net/bonding/bonding.h
>   b/drivers/net/bonding/bonding.h
>   143. index 6290a50..563d28c 100644
>   144. --- a/drivers/net/bonding/bonding.h
>   145. +++ b/drivers/net/bonding/bonding.h
>   146. @@ -18,6 +18,7 @@
>   147.  #include <linux/timer.h>
>   148.  #include <linux/proc_fs.h>
>   149.  #include <linux/if_bonding.h>
>   150. +#include <linux/netpoll.h>
>   151.  #include <linux/kobject.h>
>   152.  #include <linux/in6.h>
>   153.  #include "bond_3ad.h"
>   154. @@ -222,6 +223,10 @@ struct bonding {
>   155.  #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
>   156.         struct   in6_addr master_ipv6;
>   157.  #endif
>   158. +#ifdef CONFIG_NET_POLL_CONTROLLER
>   159. +       struct   netpoll *netpoll;
>   160. +#endif
>   161. +
>   162.  };
>   163.
>   164.  /**
>   165. diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>   166. index d4a4d98..e1724b2 100644
>   167. --- a/include/linux/netdevice.h
>   168. +++ b/include/linux/netdevice.h
>   169. @@ -52,6 +52,7 @@
>   170.
>   171.  struct vlan_group;
>   172.  struct netpoll_info;
>   173. +struct netpoll;
>   174.  /* 802.11 specific */
>   175.  struct wireless_dev;
>   176.                                         /* source back-compat hooks
>   */
>   177. @@ -624,6 +625,10 @@ struct net_device_ops {
>   178.                                                         unsigned
>   short vid);
>   179.  #ifdef CONFIG_NET_POLL_CONTROLLER
>   180.  #define HAVE_NETDEV_POLL
>   181. +       int                     (*ndo_netpoll_setup)(struct
>   net_device *dev,
>   182. +                                                    struct
>   netpoll_info *npinfo);
>   183. +       int                     (*ndo_netpoll_start_xmit)(struct
>   netpoll *np,
>   184. +                                                         struct
>   sk_buff *skb);
>   185.         void                    (*ndo_poll_controller)(struct
>   net_device *dev);
>   186.  #endif
>   187.  #if defined(CONFIG_FCOE) || defined(CONFIG_FCOE_MODULE)
>   188. diff --git a/include/linux/netpoll.h b/include/linux/netpoll.h
>   189. index 2524267..39d42e4 100644
>   190. --- a/include/linux/netpoll.h
>   191. +++ b/include/linux/netpoll.h
>   192. @@ -34,7 +34,9 @@ struct netpoll_info {
>   193.  };
>   194.
>   195.  void netpoll_poll(struct netpoll *np);
>   196. +void netpoll_poll_dev(struct net_device *dev);
>   197.  void netpoll_send_udp(struct netpoll *np, const char *msg, int len)
>   ;
>   198. +void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb);
>   199.  void netpoll_print_options(struct netpoll *np);
>   200.  int netpoll_parse_options(struct netpoll *np, char *opt);
>   201.  int netpoll_setup(struct netpoll *np);
>   202. diff --git a/net/core/netpoll.c b/net/core/netpoll.c
>   203. index 9675f31..3776b26 100644
>   204. --- a/net/core/netpoll.c
>   205. +++ b/net/core/netpoll.c
>   206. @@ -174,9 +174,8 @@ static void service_arp_queue(struct
>   netpoll_info *npi)
>   207.         }
>   208.  }
>   209.
>   210. -void netpoll_poll(struct netpoll *np)
>   211. +void netpoll_poll_dev(struct net_device *dev)
>   212.  {
>   213. -       struct net_device *dev = np->dev;
>   214.         const struct net_device_ops *ops;
>   215.
>   216.         if (!dev || !netif_running(dev))
>   217. @@ -196,6 +195,11 @@ void netpoll_poll(struct netpoll *np)
>   218.         zap_completion_queue();
>   219.  }
>   220.
>   221. +void netpoll_poll(struct netpoll *np)
>   222. +{
>   223. +       netpoll_poll_dev(np->dev);
>   224. +}
>   225. +
>   226.  static void refill_skbs(void)
>   227.  {
>   228.         struct sk_buff *skb;
>   229. @@ -277,11 +281,11 @@ static int netpoll_owner_active(struct
>   net_device *dev)
>   230.         return 0;
>   231.  }
>   232.
>   233. -static void netpoll_send_skb(struct netpoll *np, struct sk_buff
>   *skb)
>   234. +void netpoll_send_skb(struct netpoll *np, struct sk_buff *skb)
>   235.  {
>   236.         int status = NETDEV_TX_BUSY;
>   237.         unsigned long tries;
>   238. -       struct net_device *dev = np->dev;
>   239. +       struct net_device *dev = skb->dev;
>   240.         const struct net_device_ops *ops = dev->netdev_ops;
>   241.         struct netpoll_info *npinfo = np->dev->npinfo;
>   242.
>   243. @@ -303,7 +307,10 @@ static void netpoll_send_skb(struct netpoll
>   *np, struct sk_buff *skb)
>   244.                      tries > 0; --tries) {
>   245.                         if (__netif_tx_trylock(txq)) {
>   246.                                 if (!netif_tx_queue_stopped(txq)) {
>   247. -                                       status = ops->ndo_start_xmit
>   (skb, dev);
>   248. +                                       if (
>   ops->ndo_netpoll_start_xmit)
>   249. +                                               status =
>   ops->ndo_netpoll_start_xmit(np,skb);
>   250. +                                       else
>   251. +                                               status =
>   ops->ndo_start_xmit(skb, dev);
>   252.                                         if (status == NETDEV_TX_OK)
>   253.                                                 txq_trans_update(txq
>   );
>   254.                                 }
>   255. @@ -789,6 +796,9 @@ int netpoll_setup(struct netpoll *np)
>   256.         /* avoid racing with NAPI reading npinfo */
>   257.         synchronize_rcu();
>   258.
>   259. +       if (ndev->netdev_ops->ndo_netpoll_setup)
>   260. +               ndev->netdev_ops->ndo_netpoll_setup(ndev, npinfo);
>   261. +
>   262.         return 0;
>   263.
>   264.   release:
>   265. @@ -859,4 +869,6 @@ EXPORT_SYMBOL(netpoll_parse_options);
>   266.  EXPORT_SYMBOL(netpoll_setup);
>   267.  EXPORT_SYMBOL(netpoll_cleanup);
>   268.  EXPORT_SYMBOL(netpoll_send_udp);
>   269. +EXPORT_SYMBOL(netpoll_send_skb);
>   270.  EXPORT_SYMBOL(netpoll_poll);
>   271. +EXPORT_SYMBOL(netpoll_poll_dev);
>

^ permalink raw reply

* [PATCH net-next-2.6] clarify documentation for net.ipv4.igmp_max_memberships
From: Jeremy Eder @ 2010-11-15 15:41 UTC (permalink / raw)
  To: netdev
  Cc: rdunlap, davem, opurdila, apetlund, William.Allen.Simpson,
	ian.campbell, linux-doc, linux-kernel, Jiri Pirko

This patch helps clarify documentation for
net.ipv4.igmp_max_memberships by providing a formula for
calculating the maximum number of multicast groups that can be
subscribed to, plus defining the theoretical limit.





Signed-off-by: Jiri Pirko <jpirko@redhat.com>
Signed-off-by: Jeremy Eder <jeder@redhat.com>


diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index fe95105..ae55227 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -707,10 +707,28 @@ igmp_max_memberships - INTEGER
 	Change the maximum number of multicast groups we can subscribe to.
 	Default: 20
 
-conf/interface/*  changes special settings per interface (where "interface" is
-		  the name of your network interface)
-conf/all/*	  is special, changes the settings for all interfaces
+	Theoretical maximum value is bounded by having to send a membership
+	report in a single datagram (i.e. the report can't span multiple
+	datagrams, or risk confusing the switch and leaving groups you don't
+	intend to).
 
+	The number of supported groups 'M' is bounded by the number of group
+	report entries you can fit into a single datagram of 65535 bytes.
+
+	M = 65536-sizeof (ip header)/(sizeof(Group record))
+
+	Group records are variable length, with a minimum of 12 bytes.
+	So net.ipv4.igmp_max_memberships should not be set higher than:
+
+	(65536-24) / 12 = 5459
+
+	The value 5459 assumes no IP header options, so in practice
+	this number may be lower.
+
+	conf/interface/*  changes special settings per interface (where
+	"interface" is the name of your network interface)
+
+	conf/all/*	  is special, changes the settings for all interfaces
 
 log_martians - BOOLEAN
 	Log packets with impossible addresses to kernel log.



^ permalink raw reply related

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
From: Eric Paris @ 2010-11-15 15:47 UTC (permalink / raw)
  To: Patrick McHardy
  Cc: Hua Zhong, netdev, linux-kernel, davem, kuznet, pekkas, jmorris,
	yoshfuji
In-Reply-To: <4CE10C2A.1050801@trash.net>

On Mon, 2010-11-15 at 11:32 +0100, Patrick McHardy wrote:
> On 13.11.2010 00:14, Hua Zhong wrote:
> >> On 11.11.2010 22:58, Hua Zhong wrote:
> >>>> Yes, I realize this is little different than if the
> >>>> SYN was dropped in the first network device, but it is different
> >>>> because we know what happened!  We know that connect() call failed
> >>>> and that there isn't anything coming back.
> >>>
> >>> I would argue that -j DROP should behave exactly as the packet is
> >> dropped in the network, while -j REJECT should signal the failure to
> >> the application as soon as possible (which it doesn't seem to do).
> >>
> >> It sends an ICMP error or TCP reset. Interpretation is up to TCP.
> > 
> > Huh? It's the OUTPUT chain we are talking about. There is no ICMP error or
> > TCP reset.
> 
> Of course there is.
> 
> ICMP (default):
> 
> iptables -A OUTPUT -p tcp -j REJECT
> 
> TCP reset:
> 
> iptables -A OUTPUT -p tcp -j REJECT --reject-with tcp-reset
> 
> The second one will cause a hard error for the connection.

Well I'm (I guess?) surprised that the --reject-with icmp doesn't do
anything with a local outgoing connection but --reject-with tcp-reset
does something like what I'm looking for.

I notice the heavy lifting for this is done in 
net/ipv4/netfilter/ipt_REJECT.c::send_rest()
(and something very similar for IPv6)

I really don't want to duplicate that code into SELinux (for obvious
reasons) and I'm wondering if anyone has objections to me making it
available outside of netlink and/or suggestions on how to make that code
available outside of netfilter (aka what header to expose it, and does
it still make logical sense in ipt_REJECT.c or somewhere else?)

-Eric

^ permalink raw reply

* Re: [RFC PATCH] network: return errors if we know tcp_connect failed
From: Patrick McHardy @ 2010-11-15 15:57 UTC (permalink / raw)
  To: Eric Paris
  Cc: Hua Zhong, netdev, linux-kernel, davem, kuznet, pekkas, jmorris,
	yoshfuji
In-Reply-To: <1289836066.14282.7.camel@localhost.localdomain>

On 15.11.2010 16:47, Eric Paris wrote:
> On Mon, 2010-11-15 at 11:32 +0100, Patrick McHardy wrote:
>> On 13.11.2010 00:14, Hua Zhong wrote:
>>>> On 11.11.2010 22:58, Hua Zhong wrote:
>>>>>> Yes, I realize this is little different than if the
>>>>>> SYN was dropped in the first network device, but it is different
>>>>>> because we know what happened!  We know that connect() call failed
>>>>>> and that there isn't anything coming back.
>>>>>
>>>>> I would argue that -j DROP should behave exactly as the packet is
>>>> dropped in the network, while -j REJECT should signal the failure to
>>>> the application as soon as possible (which it doesn't seem to do).
>>>>
>>>> It sends an ICMP error or TCP reset. Interpretation is up to TCP.
>>>
>>> Huh? It's the OUTPUT chain we are talking about. There is no ICMP error or
>>> TCP reset.
>>
>> Of course there is.
>>
>> ICMP (default):
>>
>> iptables -A OUTPUT -p tcp -j REJECT
>>
>> TCP reset:
>>
>> iptables -A OUTPUT -p tcp -j REJECT --reject-with tcp-reset
>>
>> The second one will cause a hard error for the connection.
> 
> Well I'm (I guess?) surprised that the --reject-with icmp doesn't do
> anything with a local outgoing connection but --reject-with tcp-reset
> does something like what I'm looking for.
> 
> I notice the heavy lifting for this is done in 
> net/ipv4/netfilter/ipt_REJECT.c::send_rest()
> (and something very similar for IPv6)
> 
> I really don't want to duplicate that code into SELinux (for obvious
> reasons) and I'm wondering if anyone has objections to me making it
> available outside of netlink and/or suggestions on how to make that code
> available outside of netfilter (aka what header to expose it, and does
> it still make logical sense in ipt_REJECT.c or somewhere else?)

I don't think having SELinux sending packets to handle local
connections is a very elegant design, its not a firewall after
all. What's wrong with reacting only to specific errno codes
in tcp_connect()? You could f.i. return -ECONNREFUSED from
SELinux, that one is pretty much guaranteed not to occur in
the network stack itself and can be returned directly.

That would need minor changes to nf_hook_slow so we can
encode errno values in the upper 16 bits of the verdict,
as we already do with the queue number. The added benefit
is that we don't have to return EPERM anymore when f.i.
rerouting fails.

^ permalink raw reply

* [PATCH 1/1] net: rtnetlink.h -- only include linux/netdevice.h when used by the kernel
From: Andy Whitcroft @ 2010-11-15 16:01 UTC (permalink / raw)
  To: David S. Miller", Eric Dumazet
  Cc: netdev, linux-kernel, Andy Whitcroft, Tim Gardner
In-Reply-To: <1289836919-19153-1-git-send-email-apw@canonical.com>

The commit below added a new helper dev_ingress_queue to cleanly obtain the
ingress queue pointer.  This necessitated including 'linux/netdevice.h':

  commit 24824a09e35402b8d58dcc5be803a5ad3937bdba
  Author: Eric Dumazet <eric.dumazet@gmail.com>
  Date:   Sat Oct 2 06:11:55 2010 +0000

    net: dynamic ingress_queue allocation

However this include triggers issues for applications in userspace
which use the rtnetlink interfaces.  Commonly this requires they include
'net/if.h' and 'linux/rtnetlink.h' leading to a compiler error as below:

  In file included from /usr/include/linux/netdevice.h:28:0,
                   from /usr/include/linux/rtnetlink.h:9,
                   from t.c:2:
  /usr/include/linux/if.h:135:8: error: redefinition of ‘struct ifmap’
  /usr/include/net/if.h:112:8: note: originally defined here
  /usr/include/linux/if.h:169:8: error: redefinition of ‘struct ifreq’
  /usr/include/net/if.h:127:8: note: originally defined here
  /usr/include/linux/if.h:218:8: error: redefinition of ‘struct ifconf’
  /usr/include/net/if.h:177:8: note: originally defined here

The new helper is only defined for the kernel and protected by __KERNEL__
therefore we can simply pull the include down into the same protected
section.

Signed-off-by: Andy Whitcroft <apw@canonical.com>
---
 include/linux/rtnetlink.h |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/include/linux/rtnetlink.h b/include/linux/rtnetlink.h
index d42f274..bbad657 100644
--- a/include/linux/rtnetlink.h
+++ b/include/linux/rtnetlink.h
@@ -6,7 +6,6 @@
 #include <linux/if_link.h>
 #include <linux/if_addr.h>
 #include <linux/neighbour.h>
-#include <linux/netdevice.h>

 /* rtnetlink families. Values up to 127 are reserved for real address
  * families, values above 128 may be used arbitrarily.
@@ -606,6 +605,7 @@ struct tcamsg {
 #ifdef __KERNEL__

 #include <linux/mutex.h>
+#include <linux/netdevice.h>

 static __inline__ int rtattr_strcmp(const struct rtattr *rta, const char *str)
 {
-- 
1.7.0.4

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox