Netdev List
 help / color / mirror / Atom feed
* Re: [PATCH] net: Fix some fallout from the etner_addr_copy() changes.
From: François-Xavier Le Bail @ 2014-01-22  4:07 UTC (permalink / raw)
  To: netdev, David Miller
In-Reply-To: <20140121.190348.1653034020095660800.davem@davemloft.net>

On Tue, 1/21/14, David Miller <davem@davemloft.net> wrote:

> Subject: [PATCH] net: Fix some fallout from the etner_addr_copy() changes.
 
s/etner_addr_copy/ether_addr_copy.

Cheers,
Francois-Xavier

^ permalink raw reply

* Re: ipv4_dst_destroy panic regression after 3.10.15
From: dormando @ 2014-01-22  4:10 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Alexei Starovoitov, Eric Dumazet, netdev,
	linux-kernel@vger.kernel.org
In-Reply-To: <CAMEtUuzq+KhzZfxzAo=fgYr9umtQCPx-4Jp_twtrM-eh1Bv14w@mail.gmail.com>

[-- Attachment #1: Type: TEXT/PLAIN, Size: 11946 bytes --]



On Tue, 21 Jan 2014, Alexei Starovoitov wrote:

> On Tue, Jan 21, 2014 at 5:39 PM, dormando <dormando@rydia.net> wrote:
> >
> > > On Fri, Jan 17, 2014 at 11:16 PM, dormando <dormando@rydia.net> wrote:
> > > >> On Fri, 2014-01-17 at 22:49 -0800, Eric Dumazet wrote:
> > > >> > On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
> > > >> > > Hi,
> > > >> > >
> > > >> > > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
> > > >> > > a rare kernel panic, seems to have introduced a much more frequent kernel
> > > >> > > panic. Takes anywhere from 4 hours to 2 days to trigger:
> > > >> > >
> > > >> > > <4>[196727.311203] general protection fault: 0000 [#1] SMP
> > > >> > > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode
> ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp
> pps_core mdio
> > > >> > > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
> > > >> > > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> > > >> > > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
> > > >> > > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> > > >> > > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
> > > >> > > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
> > > >> > > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
> > > >> > > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
> > > >> > > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> > > >> > > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
> > > >> > > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
> > > >> > > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > >> > > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
> > > >> > > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > >> > > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > > >> > > <4>[196727.311713] Stack:
> > > >> > > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
> > > >> > > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
> > > >> > > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
> > > >> > > <4>[196727.311885] Call Trace:
> > > >> > > <4>[196727.311907]  <IRQ>
> > > >> > > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
> > > >> > > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
> > > >> > > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
> > > >> > > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
> > > >> > > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> > > >> > > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
> > > >> > > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> > > >> > > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
> > > >> > > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
> > > >> > > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
> > > >> > > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
> > > >> > > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
> > > >> > > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
> > > >> > > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
> > > >> > > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
> > > >> > > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
> > > >> > > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
> > > >> > > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
> > > >> > > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
> > > >> > > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
> > > >> > > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
> > > >> > > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
> > > >> > > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
> > > >> > > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
> > > >> > > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
> > > >> > > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
> > > >> > > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
> > > >> > > <4>[196727.312722]  <EOI>
> > > >> > > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
> > > >> > > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
> > > >> > > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
> > > >> > > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
> > > >> > > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01
> 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
> > > >> > > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> > > >> > > <4>[196727.313100]  RSP <ffff885effd23a70>
> > > >> > > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
> > > >> > > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
> > > >> > >
> > > >> > >
> > > >> > > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
> > > >> > > am trying a revert or two.
> > > >> > >
> > > >> > > We've hit it in .25, .26 so far. I have .27 running but not sure if it
> > > >> > > crashed, so the change exists between .15 and .25.
> > > >> >
> > > >> > Please try following fix, thanks for the report !
> > > >> >
> > > >> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > >> > index 25071b48921c..78a50a22298a 100644
> > > >> > --- a/net/ipv4/route.c
> > > >> > +++ b/net/ipv4/route.c
> > > >> > @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
> > > >> > *dst)
> > > >> >
> > > >> >     if (!list_empty(&rt->rt_uncached)) {
> > > >> >             spin_lock_bh(&rt_uncached_lock);
> > > >> > -           list_del(&rt->rt_uncached);
> > > >> > +           list_del_init(&rt->rt_uncached);
> > > >> >             spin_unlock_bh(&rt_uncached_lock);
> > > >> >     }
> > > >> >  }
> > > >> >
> > > >>
> > > >> Problem could come from this commit, in linux 3.10.23,
> > > >> you also could try to revert it
> > > >>
> > > >> commit 62713c4b6bc10c2d082ee1540e11b01a2b2162ab
> > > >> Author: Alexei Starovoitov <ast@plumgrid.com>
> > > >> Date:   Tue Nov 19 19:12:34 2013 -0800
> > > >>
> > > >>     ipv4: fix race in concurrent ip_route_input_slow()
> > > >>
> > > >>     [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
> > > >>
> > > >>     CPUs can ask for local route via ip_route_input_noref() concurrently.
> > > >>     if nh_rth_input is not cached yet, CPUs will proceed to allocate
> > > >>     equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
> > > >>     via rt_cache_route()
> > > >>     Most of the time they succeed, but on occasion the following two lines:
> > > >>         orig = *p;
> > > >>         prev = cmpxchg(p, orig, rt);
> > > >>     in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
> > > >>     But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
> > > >>     so dst is leaking. dst_destroy() is never called and 'lo' device
> > > >>     refcnt doesn't go to zero, which can be seen in the logs as:
> > > >>         unregister_netdevice: waiting for lo to become free. Usage count = 1
> > > >>     Adding mdelay() between above two lines makes it easily reproducible.
> > > >>     Fix it similar to nh_pcpu_rth_output case.
> > > >>
> > > >>     Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
> > > >>     Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> > > >>     Signed-off-by: David S. Miller <davem@davemloft.net>
> > > >>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> > > >>
> > > >
> > > > Heh. I spent an hour squinting at the difflog from .15 to .25 and this was
> > > > my best guess. I have a kernel running in production with only this
> > > > reverted as of ~5 hours ago. Won't know if it helps for a day or two.
> > > >
> > > > I'm building a kernel now with your route patch, but 62713c4 *not*
> > > > reverted, which I can throw on a different machine. Does this sound like a
> > > > good idea?
> > >
> > > the traces of your crash don't look similar to dst leak that was fixed by
> > > commit 62713c4...
> > >
> > > To trigger the 'fix' logic of the 62713c4 add the following diff:
> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> > > index f6c6ab1..8972676 100644
> > > --- a/net/ipv4/route.c
> > > +++ b/net/ipv4/route.c
> > > @@ -1259,7 +1259,7 @@ static bool rt_cache_route(struct fib_nh *nh,
> > > struct rtable *rt)
> > >                 p = (struct rtable **)__this_cpu_ptr(nh->nh_pcpu_rth_output);
> > >         }
> > >         orig = *p;
> > > -
> > > +       mdelay(100);
> > >         prev = cmpxchg(p, orig, rt);
> > >         if (prev == orig) {
> > >                 if (orig)
> > >
> > > I've been running with it for a day without issues.
> > > Note that it will stress both 'input' and 'output' ways of adding dsts to
> > > rt_uncached list...
> > > and 'output' was there for ages.
> > >
> > > If mdelay() helps to reproduce it in minutes instead of days
> > > then we're on the right path.
> > > Could you share details of your workload?
> > > In our case it was a lot of namespaces with a ton of processes
> > > talking to each other, forcefully killed and restarted.
> > > Do you see the same crash in the latest tree?
> > >
> > > PS sorry for double posts. netdev email bounced few times for me...
> > >
> >
> > So with 62713c4 reverted on top of 3.10.27 (no other new patches), both of
> > my machines have survived. One four days, another two days. They'd barely
> > make it a day before.
> >
> > So this patch is introducing a new problem. Weakening the dst reference
> > too much in a false positive case?
> >
> > I'm not entirely sure if the mdelay(100) thing will help much. I can try
> > it but it's kind of obnoxious to reboot these machines (far away, ipmi
> > only) too often. Unless you folks have any new ideas before I go off and
> > do that?
>
> mdelay() won't fix things, but it will help debugging. Can you try it on just one machine? It should fail within minutes.
>
> Only afterwards we can try alternative fix like:
> @@ -1722,8 +1722,8 @@ local_input:
>         }
>         if (do_cache) {
>                 if (unlikely(!rt_cache_route(&FIB_RES_NH(res), rth))) {
> -                       rth->dst.flags |= DST_NOCACHE;
> -                       rt_add_uncached_list(rth);
> +                       rt_free(rth);
> +                       goto local_input;
>                 }
>         }
>         skb_dst_set(skb, &rth->dst);
>
>
>

It might take a day or three to find a good time to do it. I don't have
access to reboot the machines myself. Oh how I wish things broke in the
lab more often :/

To be clear, I add your old patch back in + the mdelay. If that fails
within a few minutes, I add this new patch instead of your old one?

^ permalink raw reply

* Re: r8169 won't transmit with 3.12
From: Craig Small @ 2014-01-22  4:00 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Realtek linux nic maintainers, netdev
In-Reply-To: <20140121233630.GA3214@electric-eye.fr.zoreil.com>

If it helps, I know the tx_dropped is happening on line 5837 of r8169.c

			rtl8169_unmap_tx_skb(&tp->pci_dev->dev, tx_skb,
					     tp->TxDescArray + entry);
			if (skb) {
				tp->dev->stats.tx_dropped++;
				dev_kfree_skb(skb);
				tx_skb->skb = NULL;
			}


On Wed, Jan 22, 2014 at 12:36:30AM +0100, Francois Romieu wrote:
> Craig Small <csmall@enc.com.au> :
> [...]
> > It's a new setup so it might of never worked. It's not likely to be a
> > hardware problem as its three different devices.
> 
> You may check that the onboard nic device does not even work when the
> extra PCI-e card aren't plugged.
> 
> > I've sent what I think you might need for starters, but if there
> > is extra stuff you'd like to see, let me know.
> 
> A complete dmesg including the XID lines that the r8169 driver prints
> would be welcome.
> 
> > The problem shows up the same, the TX dropped counter increments.
> > I'm not sure why 42 packets made it out (or even if they really did)
> > Receive works fine, I can even start up wireshark and see packets
> > pass by.
> > 
> > eth0      Link encap:Ethernet  HWaddr 00:e0:4c:80:66:57  
> >           inet addr:192.168.1.222  Bcast:192.168.1.255  Mask:255.255.255.0
> >           inet6 addr: fe80::2e0:4cff:fe80:6657/64 Scope:Link
> >           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
> >           RX packets:8252 errors:0 dropped:0 overruns:0 frame:0
> >           TX packets:42 errors:0 dropped:8029 overruns:0 carrier:0
> 
> There are very few places where the r8169 driver modifies tx_dropped.
> 
> Please increase the driver verbosity with the 'msglvl' option of ethtool.
> 
> [...]
> > ethtool -S shows a similar story, not sure if the rx_missed counter
> > is another problem:
> 
> It seems safe to ignore as long as it does not change after startup.
> 
> -- 
> Ueimor

-- 
Craig Small (@smallsees)   http://enc.com.au/       csmall at : enc.com.au
Debian GNU/Linux           http://www.debian.org/   csmall at : debian.org
GPG fingerprint:        5D2F B320 B825 D939 04D2  0519 3938 F96B DF50 FEA5

^ permalink raw reply

* [PATCH net-next v3] ipv6: enable anycast addresses as source addresses for datagrams
From: Francois-Xavier Le Bail @ 2014-01-22  3:30 UTC (permalink / raw)
  To: netdev; +Cc: Hannes Frederic Sowa, David Stevens, David Miller

This change allows to consider an anycast address valid as source address
when given via an IPV6_PKTINFO or IPV6_2292PKTINFO ancillary data item.
So, when sending a datagram with ancillary data, the unicast and anycast
addresses are handled in the same way.

- Adds ipv6_chk_acast_addr_src() to check if an anycast address is link-local
  on given interface or is global.
- Uses it in ip6_datagram_send_ctl().

Signed-off-by: Francois-Xavier Le Bail <fx.lebail@yahoo.com>
---
v3: Consideration of Hannes's review (thanks!):
    - Uses only ipv6_chk_acast_addr (rcu_read_lock needed).

Typical usage :
A server uses IPV6_RECVPKTINFO socket option to get ancillary data with
recvmsg() and can use sendmsg() to reply with anycast adress as source address
in the same way it does for unicast.

 include/net/addrconf.h |    5 +++--
 net/ipv6/anycast.c     |   11 +++++++++++
 net/ipv6/datagram.c    |    4 +++-
 3 files changed, 17 insertions(+), 3 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index 66c4a44..50e39a8 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -205,8 +205,9 @@ void ipv6_sock_ac_close(struct sock *sk);
 int ipv6_dev_ac_inc(struct net_device *dev, const struct in6_addr *addr);
 int __ipv6_dev_ac_dec(struct inet6_dev *idev, const struct in6_addr *addr);
 bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
-				const struct in6_addr *addr);
-
+			 const struct in6_addr *addr);
+bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev,
+			     const struct in6_addr *addr);
 
 /* Device notifier */
 int register_inet6addr_notifier(struct notifier_block *nb);
diff --git a/net/ipv6/anycast.c b/net/ipv6/anycast.c
index 5a80f15..3139580 100644
--- a/net/ipv6/anycast.c
+++ b/net/ipv6/anycast.c
@@ -383,6 +383,17 @@ bool ipv6_chk_acast_addr(struct net *net, struct net_device *dev,
 	return found;
 }
 
+/*	check if this anycast address is link-local on given interface or
+ *	is global
+ */
+bool ipv6_chk_acast_addr_src(struct net *net, struct net_device *dev,
+			     const struct in6_addr *addr)
+{
+	if (ipv6_addr_type(addr) & IPV6_ADDR_LINKLOCAL)
+		return ipv6_chk_acast_addr(net, dev, addr);
+	else
+		return ipv6_chk_acast_addr(net, NULL, addr);
+}
 
 #ifdef CONFIG_PROC_FS
 struct ac6_iter_state {
diff --git a/net/ipv6/datagram.c b/net/ipv6/datagram.c
index cd8699b..2615197 100644
--- a/net/ipv6/datagram.c
+++ b/net/ipv6/datagram.c
@@ -689,7 +689,9 @@ int ip6_datagram_send_ctl(struct net *net, struct sock *sk,
 				int strict = __ipv6_addr_src_scope(addr_type) <= IPV6_ADDR_SCOPE_LINKLOCAL;
 				if (!(inet_sk(sk)->freebind || inet_sk(sk)->transparent) &&
 				    !ipv6_chk_addr(net, &src_info->ipi6_addr,
-						   strict ? dev : NULL, 0))
+						   strict ? dev : NULL, 0) &&
+				    !ipv6_chk_acast_addr_src(net, dev,
+							     &src_info->ipi6_addr))
 					err = -EINVAL;
 				else
 					fl6->saddr = src_info->ipi6_addr;

^ permalink raw reply related

* Weirdness with combined SNAT/DNAT, bridging and promiscuous mode
From: Philippe Troin @ 2014-01-22  3:22 UTC (permalink / raw)
  To: netdev@vger.kernel.org

I ran into a weird bridging issue that I wanted to share with this list.
This was found on Fedora 19 running with the latest 3.12.8.200.fc19
x86_64 kernel.

I have the following network (simplified):

  +-----+
  | ex1 |
  +-----+
     | 10.0.0.2/24
     |
     | 10.0.0.1/24
  +--------+
  | router |
  +--------+
     | 192.168.1.1/24
     |
     | 192.168.1.2/24
  +-----+
  | in1 |
  +-----+

Ex1 is on the exterior (internet), in1 on the interior sides.
I want to forward a port from 10.0.0.1:2222 to 192.168.1.2:22.
I use an iptables DNAT target on the router's nat PREROUTING chain, all
is fine, when I connect from ex1 to 10.0.0.1:2222, it works.

If I want to forward connections from router's 10.0.0.1:2222 to in's
192.168.1.2:22, I add the same rule DNAT to the OUTPUT chain.

If now I also want to allow connections from in1 to 10.0.0.1:2222 to be
looped back to in1's port 22, it's also possible by adding an extra SNAT
rule.

At this point, I have the following network scripts:

      * ex1
                ip link set up dev eth0
                ip addr add 10.0.0.2/24 dev eth0
                
      * router
                ip link set up dev eth0
                ip addr add 10.0.0.1/24 dev eth0
                ip link set up dev eth1
                ip addr add 192.168.1.1/24 dev eth1
                iptables -t nat -A POSTROUTING -s 192.168.1.0/24 -o eth0 -j SNAT --to-source 10.0.0.1
                iptables -t nat -A OUTPUT -p tcp -d 10.0.0.1 --dport 2222 -j DNAT --to 192.168.1.2:22
                iptables -t nat -A PREROUTING -p tcp -d 10.0.0.1 --dport 2222 -j DNAT --to 192.168.1.2:22
                iptables -t nat -A POSTROUTING -s 192.168.1.0/24 -d 192.168.1.2 -p tcp --dport 22 -j SNAT --to-source 192.168.1.1
                sysctl net.ipv4.ip_forward=1
                
                
      * in1
                ip link set up dev eth0
                ip addr add 192.168.1.2/24 dev eth0
                ip route add 0.0.0.0/0 via 192.168.1.1

>From all three machines, connecting to 10.0.0.1:22 will forward the TCP
connection to 192.168.1.2:22.

Now, let's introduce a bridge. If make eth1 join a bridge br1, the
script for the router machine becomes:

        brctl addbr br1
        brctl addif br1 eth1
        
        ip link set up dev eth0
        ip addr add 10.0.0.1/24 dev eth0
        
        ip link set up dev eth1
        ip link set up dev br1
        ip addr add 192.168.1.1/24 dev br1
        
        iptables -t nat -A POSTROUTING -s 192.168.1.0/24 -o eth0 -j SNAT --to-source 10.0.0.1
        iptables -t nat -A OUTPUT -p tcp -d 10.0.0.1 --dport 2222 -j DNAT --to 192.168.1.2:22
        iptables -t nat -A PREROUTING -p tcp -d 10.0.0.1 --dport 2222 -j DNAT --to 192.168.1.2:22
        iptables -t nat -A POSTROUTING -s 192.168.1.0/24 -d 192.168.1.2 -p tcp --dport 22 -j SNAT --to-source 192.168.1.1
        sysctl net.ipv4.ip_forward=1

And now, although I can connect to 10.0.0.1:22 fine from ex1 and router,
but connecting from in1 fails.
The connection hangs.
Tcpdump (without turning promiscuous mode on) shows that the syn packets
from in1 show up on router's eth1 interface, but are not visible to the
br1 bridge.  The MAC addresses on the frames seem to correct (both br1
and eth1 have the same MAC in this set-up, by default).

Oddly enough, turning on promiscuous mode on br1 makes everything work.
Changing promiscuous mode on eth1 has no effect.

Why would the promiscuous mode on the bridge itself change the behavior
of the iptables script above?

I could understand promiscuous mode and physical interface's MAC
filters, but I thought that promiscuous mode on a bridge interface was a
no-op.

A quick grep shows that IFF_PROMISC is used in the bridging code in
br_pass_frame_up() and br_handle_frame_finish(), both in br_input.c.

Am I supposed to turn on promiscuous mode on bridge interfaces, or is it
something else (bug in my iptables script, bug in bridging)?

Thanks for reading this long-winded email.
Phil.

^ permalink raw reply

* [PATCH 2/2] net: dm9000: Only call PHY reset for TYPE-B on shutdown
From: Chris Ruehl @ 2014-01-22  3:20 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, davem, Chris Ruehl
In-Reply-To: <1390360813-29185-1-git-send-email-chris.ruehl@gtsys.com.hk>

Unconditional call of PHY reset can triggers a fault to detect
the link for DM9000A on reboot, only a hard reset can solve it.
This patch check the version of the chip and call the PHY reset
only for the B version of the chip.

Signed-off-by: Chris Ruehl <chris.ruehl@gtsys.com.hk>
---
 drivers/net/ethernet/davicom/dm9000.c |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/net/ethernet/davicom/dm9000.c b/drivers/net/ethernet/davicom/dm9000.c
index 0349b91..55a2e9c 100644
--- a/drivers/net/ethernet/davicom/dm9000.c
+++ b/drivers/net/ethernet/davicom/dm9000.c
@@ -1333,7 +1333,8 @@ dm9000_shutdown(struct net_device *dev)
 	board_info_t *db = netdev_priv(dev);
 
 	/* RESET device */
-	dm9000_phy_write(dev, 0, MII_BMCR, BMCR_RESET);	/* PHY RESET */
+	if (db->type == TYPE_DM9000B)
+		dm9000_phy_write(dev, 0, MII_BMCR, BMCR_RESET);	/* PHY RESET */
 	iow(db, DM9000_GPR, 0x01);	/* Power-Down PHY */
 	iow(db, DM9000_IMR, IMR_PAR);	/* Disable all interrupt */
 	iow(db, DM9000_RCR, 0x00);	/* Disable RX */
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH 1/2] net: dm9000: Read GPR, modify and write
From: Chris Ruehl @ 2014-01-22  3:20 UTC (permalink / raw)
  To: netdev; +Cc: linux-kernel, davem, Chris Ruehl

The GPR register should be read, modified and write to
activate the PHY. A simple write 0 to the GPR might override
other register values with needs to keep.
Some codestyle fixes (mostly leading spaces)

Signed-off-by: Chris Ruehl <chris.ruehl@gtsys.com.hk>
---
 drivers/net/ethernet/davicom/dm9000.c |   23 +++++++++++++++--------
 1 file changed, 15 insertions(+), 8 deletions(-)

diff --git a/drivers/net/ethernet/davicom/dm9000.c b/drivers/net/ethernet/davicom/dm9000.c
index 7080ad6..0349b91 100644
--- a/drivers/net/ethernet/davicom/dm9000.c
+++ b/drivers/net/ethernet/davicom/dm9000.c
@@ -745,9 +745,9 @@ static const struct ethtool_ops dm9000_ethtool_ops = {
 	.get_link		= dm9000_get_link,
 	.get_wol		= dm9000_get_wol,
 	.set_wol		= dm9000_set_wol,
- 	.get_eeprom_len		= dm9000_get_eeprom_len,
- 	.get_eeprom		= dm9000_get_eeprom,
- 	.set_eeprom		= dm9000_set_eeprom,
+	.get_eeprom_len		= dm9000_get_eeprom_len,
+	.get_eeprom		= dm9000_get_eeprom,
+	.set_eeprom		= dm9000_set_eeprom,
 };
 
 static void dm9000_show_carrier(board_info_t *db,
@@ -795,7 +795,7 @@ dm9000_poll_work(struct work_struct *w)
 		}
 	} else
 		mii_check_media(&db->mii, netif_msg_link(db), 0);
-	
+
 	if (netif_running(ndev))
 		dm9000_schedule_poll(db);
 }
@@ -1286,6 +1286,7 @@ dm9000_open(struct net_device *dev)
 {
 	board_info_t *db = netdev_priv(dev);
 	unsigned long irqflags = db->irq_res->flags & IRQF_TRIGGER_MASK;
+	int gprval;
 
 	if (netif_msg_ifup(db))
 		dev_dbg(db->dev, "enabling %s\n", dev->name);
@@ -1298,9 +1299,15 @@ dm9000_open(struct net_device *dev)
 
 	irqflags |= IRQF_SHARED;
 
+	gprval = ior(db, DM9000_GPR);
+
 	/* GPIO0 on pre-activate PHY, Reg 1F is not set by reset */
-	iow(db, DM9000_GPR, 0);	/* REG_1F bit0 activate phyxcer */
-	mdelay(1); /* delay needs by DM9000B */
+	if (gprval & (1<<0)) {
+		dev_dbg(db->dev, "Activate PHY GPR: 0x%x\n", gprval);
+		gprval = gprval & ~(1<<0);
+		iow(db, DM9000_GPR, gprval);	/* REG_1F bit0 activate phyxcer */
+		mdelay(1); /* delay needs by DM9000B */
+	}
 
 	/* Initialize DM9000 board */
 	dm9000_reset(db);
@@ -1314,7 +1321,7 @@ dm9000_open(struct net_device *dev)
 
 	mii_check_media(&db->mii, netif_msg_link(db), 1);
 	netif_start_queue(dev);
-	
+
 	dm9000_schedule_poll(db);
 
 	return 0;
@@ -1628,7 +1635,7 @@ dm9000_probe(struct platform_device *pdev)
 
 	if (!is_valid_ether_addr(ndev->dev_addr)) {
 		/* try reading from mac */
-		
+
 		mac_src = "chip";
 		for (i = 0; i < 6; i++)
 			ndev->dev_addr[i] = ior(db, i+DM9000_PAR);
-- 
1.7.10.4

^ permalink raw reply related

* [PATCH] net: Fix some fallout from the etner_addr_copy() changes.
From: David Miller @ 2014-01-22  3:03 UTC (permalink / raw)
  To: netdev


net/appletalk/aarp.c: In function ‘__aarp_send_query’:
net/appletalk/aarp.c:137:2: error: implicit declaration of function ‘ether_addr_copy’ [-Werror=implicit-function-declaration]
 ...
net/atm/lec.c: In function ‘send_to_lecd’:
net/atm/lec.c:524:3: warning: passing argument 1 of ‘ether_addr_copy’ from incompatible pointer type [enabled by default]
In file included from net/atm/lec.c:17:0:
include/linux/etherdevice.h:227:20: note: expected ‘u8 *’ but argument is of type ‘unsigned char (*)[6]’
 ...
net/caif/caif_usb.c: In function ‘cfusbl_create’:
net/caif/caif_usb.c:108:2: error: implicit declaration of function ‘ether_addr_copy’ [-Werror=implicit-function-declaration]

Signed-off-by: David S. Miller <davem@davemloft.net>
---
 net/appletalk/aarp.c | 1 +
 net/atm/lec.c        | 2 +-
 2 files changed, 2 insertions(+), 1 deletion(-)

diff --git a/net/appletalk/aarp.c b/net/appletalk/aarp.c
index d0b7be1..d27b86d 100644
--- a/net/appletalk/aarp.c
+++ b/net/appletalk/aarp.c
@@ -40,6 +40,7 @@
 #include <linux/proc_fs.h>
 #include <linux/seq_file.h>
 #include <linux/export.h>
+#include <linux/etherdevice.h>
 
 int sysctl_aarp_expiry_time = AARP_EXPIRY_TIME;
 int sysctl_aarp_tick_time = AARP_TICK_TIME;
diff --git a/net/atm/lec.c b/net/atm/lec.c
index 0b73ae9..5a2f602 100644
--- a/net/atm/lec.c
+++ b/net/atm/lec.c
@@ -521,7 +521,7 @@ send_to_lecd(struct lec_priv *priv, atmlec_msg_type type,
 	if (data != NULL)
 		mesg->sizeoftlvs = data->len;
 	if (mac_addr)
-		ether_addr_copy(&mesg->content.normal.mac_addr, mac_addr);
+		ether_addr_copy(mesg->content.normal.mac_addr, mac_addr);
 	else
 		mesg->content.normal.targetless_le_arp = 1;
 	if (atm_addr)
-- 
1.7.11.7


^ permalink raw reply related

* [PATCH net-next v3] tuntap: Fix for a race in accessing numqueues
From: Dominic Curran @ 2014-01-22  3:03 UTC (permalink / raw)
  To: netdev; +Cc: Dominic Curran, Jason Wang, Maxim Krasnyansky

A patch for fixing a race between queue selection and changing queues
was introduced in commit 92bb73ea2("tuntap: fix a possible race between
queue selection and changing queues").

The fix was to prevent the driver from re-reading the tun->numqueues
more than once within tun_select_queue() using ACCESS_ONCE().

We have been experiancing 'Divide-by-zero' errors in tun_net_xmit() 
since we moved from 3.6 to 3.10, and believe that they come from a 
simular source where the value of tun->numqueues changes to zero 
between the first and a subsequent read of tun->numqueues.

The fix is a simular use of ACCESS_ONCE(), as well as a multiply
instead of a divide in the if statement.

Signed-off-by: Dominic Curran <dominic.curran@citrix.com>
Cc: Jason Wang <jasowang@redhat.com>
Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com>
---
V3: Rebase against net-next. Include all numqueues in function.
V2: Use multiply instead of divide. Suggested by Eric Dumazet.
    Fixed email address for maxk. Rebase against net tree.
---
 drivers/net/tun.c |   10 ++++++----
 1 file changed, 6 insertions(+), 4 deletions(-)

Index: net-next/drivers/net/tun.c
===================================================================
--- net-next.orig/drivers/net/tun.c	2014-01-22 02:50:01.000000000 +0000
+++ net-next/drivers/net/tun.c	2014-01-22 02:59:42.000000000 +0000
@@ -738,15 +738,17 @@ static netdev_tx_t tun_net_xmit(struct s
 	struct tun_struct *tun = netdev_priv(dev);
 	int txq = skb->queue_mapping;
 	struct tun_file *tfile;
+	u32 numqueues = 0;
 
 	rcu_read_lock();
 	tfile = rcu_dereference(tun->tfiles[txq]);
+	numqueues = ACCESS_ONCE(tun->numqueues);
 
 	/* Drop packet if interface is not attached */
-	if (txq >= tun->numqueues)
+	if (txq >= numqueues)
 		goto drop;
 
-	if (tun->numqueues == 1) {
+	if (numqueues == 1) {
 		/* Select queue was not called for the skbuff, so we extract the
 		 * RPS hash and save it into the flow_table here.
 		 */
@@ -779,8 +781,8 @@ static netdev_tx_t tun_net_xmit(struct s
 	/* Limit the number of packets queued by dividing txq length with the
 	 * number of queues.
 	 */
-	if (skb_queue_len(&tfile->socket.sk->sk_receive_queue)
-			  >= dev->tx_queue_len / tun->numqueues)
+	if (skb_queue_len(&tfile->socket.sk->sk_receive_queue) * numqueues
+			  >= dev->tx_queue_len)
 		goto drop;
 
 	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))

^ permalink raw reply

* Re: [PATCH net-next] net: filter: let bpf_tell_extensions return SKF_AD_MAX
From: David Miller @ 2014-01-22  2:55 UTC (permalink / raw)
  To: dborkman; +Cc: netdev, msekleta, edumazet
In-Reply-To: <1390259977-28770-1-git-send-email-dborkman@redhat.com>

From: Daniel Borkmann <dborkman@redhat.com>
Date: Tue, 21 Jan 2014 00:19:37 +0100

> Michal Sekletar added in commit ea02f9411d9f ("net: introduce
> SO_BPF_EXTENSIONS") a facility where user space can enquire
> the BPF ancillary instruction set, which is imho a step into
> the right direction for letting user space high-level to BPF
> optimizers make an informed decision for possibly using these
> extensions.
> 
> The original rationale was to return through a getsockopt(2)
> a bitfield of which instructions are supported and which
> are not, as of right now, we just return 0 to indicate a
> base support for SKF_AD_PROTOCOL up to SKF_AD_PAY_OFFSET.
> Limitations of this approach are that this API which we need
> to maintain for a long time can only support a maximum of 32
> extensions, and needs to be additionally maintained/updated
> when each new extension that comes in.
> 
> I thought about this a bit more and what we can do here to
> overcome this is to just return SKF_AD_MAX. Since we never
> remove any extension since we cannot break user space and
> always linearly increase SKF_AD_MAX on each newly added
> extension, user space can make a decision on what extensions
> are supported in the whole set of extensions and which aren't,
> by just checking which of them from the whole set have an
> offset < SKF_AD_MAX of the underlying kernel.
> 
> Since SKF_AD_MAX must be updated each time we add new ones,
> we don't need to introduce an additional enum and got
> maintenance for free. At some point in time when
> SO_BPF_EXTENSIONS becomes ubiquitous for most kernels, then
> an application can simply make use of this and easily be run
> on newer or older underlying kernels without needing to be
> recompiled, of course. Since that is for 3.14, it's not too
> late to do this change.
> 
> Signed-off-by: Daniel Borkmann <dborkman@redhat.com>

Completely agreed, applied, thanks Daniel.

^ permalink raw reply

* Re: [PATCH] uapi: netfilter_arp: use __u8 instead of u_int8_t
From: David Miller @ 2014-01-22  2:47 UTC (permalink / raw)
  To: vapier; +Cc: netdev, netfilter-devel
In-Reply-To: <1390358414-8844-1-git-send-email-vapier@gentoo.org>

From: Mike Frysinger <vapier@gentoo.org>
Date: Tue, 21 Jan 2014 21:40:14 -0500

> Subject: [PATCH] uapi: netfilter_arp: use __u8 instead of u_int8_t
> From:	Mike Frysinger <vapier@gentoo.org>
> To:	netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>
> Date:	Tue, 21 Jan 2014 21:40:14 -0500
> Sender:	netdev-owner@vger.kernel.org
> X-Mailer: git-send-email 1.8.4.3
> 
> Similarly, the u_int8_t type is non-standard and not defined.  Change
> it to use __u8 like the rest of the netfilter headers.
> 
> Signed-off-by: Mike Frysinger <vapier@gentoo.org>

Please send netfilter changes to netfilter-devel, CC:'d.

> ---
>  include/uapi/linux/netfilter_arp/arpt_mangle.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/uapi/linux/netfilter_arp/arpt_mangle.h b/include/uapi/linux/netfilter_arp/arpt_mangle.h
> index 250f502..8c2b16a 100644
> --- a/include/uapi/linux/netfilter_arp/arpt_mangle.h
> +++ b/include/uapi/linux/netfilter_arp/arpt_mangle.h
> @@ -13,7 +13,7 @@ struct arpt_mangle
>  	union {
>  		struct in_addr tgt_ip;
>  	} u_t;
> -	u_int8_t flags;
> +	__u8 flags;
>  	int target;
>  };
>  
> -- 
> 1.8.4.3
> 
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* Re: [PATCH net-next 0/7] sctp: remove some macro locking wrappers
From: David Miller @ 2014-01-22  2:42 UTC (permalink / raw)
  To: vyasevich; +Cc: wangweidong1, nhorman, dborkman, netdev, linux-sctp
In-Reply-To: <52DE8F39.30902@gmail.com>

From: Vlad Yasevich <vyasevich@gmail.com>
Date: Tue, 21 Jan 2014 10:16:09 -0500

> On 01/21/2014 02:44 AM, Wang Weidong wrote:
>> In sctp.h we can find some macro locking wrappers. As Neil point out that:
>> 
>> "Its because in the origional implementation of the sctp protocol, there was a
>> user space test harness which built the kernel module for userspace execution to
>> cary our some unit testing on the code.  It did so by redefining some of those
>> locking macros to user space friendly code.  IIRC we haven't use those unit
>> tests in years, and so should be removing them, not adding them to other
>> locations."
>> 
>> So I remove them.
>> 
>> Wang Weidong (7):
>>   sctp: remove macro sctp_spin_[un]lock_irqrestore
>>   sctp: remove macro sctp_local_bh_{disable|enable}
>>   sctp: remove macro sctp_spin_[un]lock
>>   sctp: remove macros sctp_write_[un]_lock
>>   sctp: remove macros sctp_read_[un]lock
>>   sctp: remove macros sctp_{lock|release}_sock
>>   sctp: remove macros sctp_bh_[un]lock_sock
>> 
>>  include/net/sctp/sctp.h  | 27 ++-----------
>>  net/sctp/endpointola.c   |  4 +-
>>  net/sctp/input.c         | 54 +++++++++++++-------------
>>  net/sctp/proc.c          | 12 +++---
>>  net/sctp/protocol.c      |  4 +-
>>  net/sctp/sm_sideeffect.c | 16 ++++----
>>  net/sctp/socket.c        | 98 ++++++++++++++++++++++++------------------------
>>  7 files changed, 98 insertions(+), 117 deletions(-)
>> 
> 
> I like this series just the way it is.  Very simple to review.
> 
> Series
> Acked-by: Vlad Yasevich <vyasevich@gmail.com>

I had to append the DLM fix (since it used the sctp_*() interfaces) to
patch #6, otherwise it would have broken bisection.

Series applied, thanks.

^ permalink raw reply

* [PATCH] uapi: netfilter_arp: use __u8 instead of u_int8_t
From: Mike Frysinger @ 2014-01-22  2:40 UTC (permalink / raw)
  To: netdev, David S. Miller

Similarly, the u_int8_t type is non-standard and not defined.  Change
it to use __u8 like the rest of the netfilter headers.

Signed-off-by: Mike Frysinger <vapier@gentoo.org>
---
 include/uapi/linux/netfilter_arp/arpt_mangle.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/uapi/linux/netfilter_arp/arpt_mangle.h b/include/uapi/linux/netfilter_arp/arpt_mangle.h
index 250f502..8c2b16a 100644
--- a/include/uapi/linux/netfilter_arp/arpt_mangle.h
+++ b/include/uapi/linux/netfilter_arp/arpt_mangle.h
@@ -13,7 +13,7 @@ struct arpt_mangle
 	union {
 		struct in_addr tgt_ip;
 	} u_t;
-	u_int8_t flags;
+	__u8 flags;
 	int target;
 };
 
-- 
1.8.4.3

^ permalink raw reply related

* Re: [PATCH net v2] tuntap: Fix for a race in accessing numqueues
From: David Miller @ 2014-01-22  2:37 UTC (permalink / raw)
  To: dominic.curran; +Cc: netdev, jasowang, maxk
In-Reply-To: <1390255158-9058-1-git-send-email-dominic.curran@citrix.com>

From: Dominic Curran <dominic.curran@citrix.com>
Date: Mon, 20 Jan 2014 21:59:18 +0000

> A patch for fixing a race between queue selection and changing queues
> was introduced in commit 92bb73ea2("tuntap: fix a possible race between
> queue selection and changing queues").
> 
> The fix was to prevent the driver from re-reading the tun->numqueues
> more than once within tun_select_queue() using ACCESS_ONCE().
> 
> We have been experiancing 'Divide-by-zero' errors in tun_net_xmit() 
> since we moved from 3.6 to 3.10, and believe that they come from a 
> simular source where the value of tun->numqueues changes to zero 
> between the first and a subsequent read of tun->numqueues.
> 
> The fix is a simular use of ACCESS_ONCE(), as well as a multiply
> instead of a divide in the if statement.
> 
> Signed-off-by: Dominic Curran <dominic.curran@citrix.com>
> Cc: Jason Wang <jasowang@redhat.com>
> Cc: Maxim Krasnyansky <maxk@qti.qualcomm.com>
> ---
> V2: Use multiply instead of divide. Suggested by Eric Dumazet.
>     Fixed email address for maxk. Rebase against net tree.

Can you please respin this against net-next?  Thank you.

^ permalink raw reply

* Re: [PATCH 1/7] 8021q: Use ether_addr_copy
From: David Miller @ 2014-01-22  2:13 UTC (permalink / raw)
  To: joe; +Cc: kaber, netdev, linux-kernel
In-Reply-To: <d925be29591181e70c54abb79d514c87ac8feb06.1390239869.git.joe@perches.com>


All 7 patches applied, thanks Joe.

^ permalink raw reply

* Re: [PATCH] net: sctp: Fix a_rwnd/rwnd management to reflect real state of the receiver's buffer
From: Vlad Yasevich @ 2014-01-22  2:12 UTC (permalink / raw)
  To: Matija Glavinic Pecotic, linux-sctp; +Cc: Alexander Sverdlin, netdev
In-Reply-To: <52D8D544.5050501@nsn.com>

On 01/17/2014 02:01 AM, Matija Glavinic Pecotic wrote:
[ snip long description ...]
> Proposed solution simplifies whole algorithm having on mind definition from rfc:
> 
> o  Receiver Window (rwnd): This gives the sender an indication of the space
>    available in the receiver's inbound buffer.
> 
> Core of the proposed solution is given with these lines:
> 
> sctp_assoc_rwnd_account:
> 	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> 		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> 	else
> 		asoc->rwnd = 0;
> 
> We advertise to sender (half of) actual space we have. Half is in the braces
> depending whether you would like to observe size of socket buffer as SO_RECVBUF
> or twice the amount, i.e. size is the one visible from userspace, that is,
> from kernelspace.
> In this way sender is given with good approximation of our buffer space,
> regardless of the buffer policy - we always advertise what we have. Proposed
> solution fixes described problems and removes necessity for rwnd restoration
> algorithm. Finally, as proposed solution is simplification, some lines of code,
> along with some bytes in struct sctp_association are saved.
> 
> Signed-off-by: Matija Glavinic Pecotic <matija.glavinic-pecotic.ext@nsn.com>
> Reviewed-by: Alexander Sverdlin <alexander.sverdlin@nsn.com>
> 

This is the correct approach.  However there is one issue and
a few cosmetic suggestions.

> ---
> 
> --- net-next.orig/net/sctp/associola.c
> +++ net-next/net/sctp/associola.c
> @@ -1367,44 +1367,35 @@ static inline bool sctp_peer_needs_updat
>  	return false;
>  }
>  
> -/* Increase asoc's rwnd by len and send any window update SACK if needed. */
> -void sctp_assoc_rwnd_increase(struct sctp_association *asoc, unsigned int len)
> +/* Account asoc's rwnd for the approximated state in the buffer,
> + * and check whether SACK needs to be sent.
> + */
> +void sctp_assoc_rwnd_account(struct sctp_association *asoc, int check_sack)

Maybe sctp_assoc_rwnd_update()?

'check_sack' isn't a very descriptive name for what we are doing.  May
be 'update_peer'?  Also, since it is either 0 or 1, just make a bool.

>  {
> +	int rx_count;
>  	struct sctp_chunk *sack;
>  	struct timer_list *timer;
>  
> -	if (asoc->rwnd_over) {
> -		if (asoc->rwnd_over >= len) {
> -			asoc->rwnd_over -= len;
> -		} else {
> -			asoc->rwnd += (len - asoc->rwnd_over);
> -			asoc->rwnd_over = 0;
> -		}
> -	} else {
> -		asoc->rwnd += len;
> -	}
> +	if (asoc->ep->rcvbuf_policy)
> +		rx_count = atomic_read(&asoc->rmem_alloc);
> +	else
> +		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
>  
> -	/* If we had window pressure, start recovering it
> -	 * once our rwnd had reached the accumulated pressure
> -	 * threshold.  The idea is to recover slowly, but up
> -	 * to the initial advertised window.
> -	 */
> -	if (asoc->rwnd_press && asoc->rwnd >= asoc->rwnd_press) {
> -		int change = min(asoc->pathmtu, asoc->rwnd_press);
> -		asoc->rwnd += change;
> -		asoc->rwnd_press -= change;
> -	}
> +	if ((asoc->base.sk->sk_rcvbuf - rx_count) > 0)
> +		asoc->rwnd = (asoc->base.sk->sk_rcvbuf - rx_count) >> 1;
> +	else
> +		asoc->rwnd = 0;
>  
> -	pr_debug("%s: asoc:%p rwnd increased by %d to (%u, %u) - %u\n",
> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> -		 asoc->a_rwnd);
> +	pr_debug("%s: asoc:%p rwnd=%u, rx_count=%d, sk_rcvbuf=%d\n",
> +		 __func__, asoc, asoc->rwnd, rx_count,
> +		 asoc->base.sk->sk_rcvbuf);
>  
>  	/* Send a window update SACK if the rwnd has increased by at least the
>  	 * minimum of the association's PMTU and half of the receive buffer.
>  	 * The algorithm used is similar to the one described in
>  	 * Section 4.2.3.3 of RFC 1122.
>  	 */
> -	if (sctp_peer_needs_update(asoc)) {
> +	if (check_sack && sctp_peer_needs_update(asoc)) {
>  		asoc->a_rwnd = asoc->rwnd;
>  
>  		pr_debug("%s: sending window update SACK- asoc:%p rwnd:%u "
> @@ -1426,45 +1417,6 @@ void sctp_assoc_rwnd_increase(struct sct
>  	}
>  }
>  
> -/* Decrease asoc's rwnd by len. */
> -void sctp_assoc_rwnd_decrease(struct sctp_association *asoc, unsigned int len)
> -{
> -	int rx_count;
> -	int over = 0;
> -
> -	if (unlikely(!asoc->rwnd || asoc->rwnd_over))
> -		pr_debug("%s: association:%p has asoc->rwnd:%u, "
> -			 "asoc->rwnd_over:%u!\n", __func__, asoc,
> -			 asoc->rwnd, asoc->rwnd_over);
> -
> -	if (asoc->ep->rcvbuf_policy)
> -		rx_count = atomic_read(&asoc->rmem_alloc);
> -	else
> -		rx_count = atomic_read(&asoc->base.sk->sk_rmem_alloc);
> -
> -	/* If we've reached or overflowed our receive buffer, announce
> -	 * a 0 rwnd if rwnd would still be positive.  Store the
> -	 * the potential pressure overflow so that the window can be restored
> -	 * back to original value.
> -	 */
> -	if (rx_count >= asoc->base.sk->sk_rcvbuf)
> -		over = 1;
> -
> -	if (asoc->rwnd >= len) {
> -		asoc->rwnd -= len;
> -		if (over) {
> -			asoc->rwnd_press += asoc->rwnd;
> -			asoc->rwnd = 0;
> -		}
> -	} else {
> -		asoc->rwnd_over = len - asoc->rwnd;
> -		asoc->rwnd = 0;
> -	}
> -
> -	pr_debug("%s: asoc:%p rwnd decreased by %d to (%u, %u, %u)\n",
> -		 __func__, asoc, len, asoc->rwnd, asoc->rwnd_over,
> -		 asoc->rwnd_press);
> -}
>  
>  /* Build the bind address list for the association based on info from the
>   * local endpoint and the remote peer.
> --- net-next.orig/include/net/sctp/structs.h
> +++ net-next/include/net/sctp/structs.h
> @@ -1653,17 +1653,6 @@ struct sctp_association {
>  	/* This is the last advertised value of rwnd over a SACK chunk. */
>  	__u32 a_rwnd;
>  
> -	/* Number of bytes by which the rwnd has slopped.  The rwnd is allowed
> -	 * to slop over a maximum of the association's frag_point.
> -	 */
> -	__u32 rwnd_over;
> -
> -	/* Keeps treack of rwnd pressure.  This happens when we have
> -	 * a window, but not recevie buffer (i.e small packets).  This one
> -	 * is releases slowly (1 PMTU at a time ).
> -	 */
> -	__u32 rwnd_press;
> -
>  	/* This is the sndbuf size in use for the association.
>  	 * This corresponds to the sndbuf size for the association,
>  	 * as specified in the sk->sndbuf.
> @@ -1892,8 +1881,7 @@ void sctp_assoc_update(struct sctp_assoc
>  __u32 sctp_association_get_next_tsn(struct sctp_association *);
>  
>  void sctp_assoc_sync_pmtu(struct sock *, struct sctp_association *);
> -void sctp_assoc_rwnd_increase(struct sctp_association *, unsigned int);
> -void sctp_assoc_rwnd_decrease(struct sctp_association *, unsigned int);
> +void sctp_assoc_rwnd_account(struct sctp_association *, int);
>  void sctp_assoc_set_primary(struct sctp_association *,
>  			    struct sctp_transport *);
>  void sctp_assoc_del_nonprimary_peers(struct sctp_association *,
> --- net-next.orig/net/sctp/sm_statefuns.c
> +++ net-next/net/sctp/sm_statefuns.c
> @@ -6176,7 +6176,7 @@ static int sctp_eat_data(const struct sc
>  	 * PMTU.  In cases, such as loopback, this might be a rather
>  	 * large spill over.
>  	 */
> -	if ((!chunk->data_accepted) && (!asoc->rwnd || asoc->rwnd_over ||
> +	if ((!chunk->data_accepted) && (!asoc->rwnd ||
>  	    (datalen > asoc->rwnd + asoc->frag_point))) {
>  
>  		/* If this is the next TSN, consider reneging to make
> --- net-next.orig/net/sctp/socket.c
> +++ net-next/net/sctp/socket.c
> @@ -2097,7 +2097,7 @@ static int sctp_recvmsg(struct kiocb *io
>  		 * rwnd is updated when the event is freed.
>  		 */
>  		if (!sctp_ulpevent_is_notification(event))
> -			sctp_assoc_rwnd_increase(event->asoc, copied);
> +			sctp_assoc_rwnd_account(event->asoc, 1);

This is not going to do anything.  The event has not been freed, thus
rmem_alloc has not changed.  So, the rwnd will not change.

>  		goto out;
>  	} else if ((event->msg_flags & MSG_NOTIFICATION) ||
>  		   (event->msg_flags & MSG_EOR))
> --- net-next.orig/net/sctp/ulpevent.c
> +++ net-next/net/sctp/ulpevent.c
> @@ -989,7 +989,7 @@ static void sctp_ulpevent_receive_data(s
>  	skb = sctp_event2skb(event);
>  	/* Set the owner and charge rwnd for bytes received.  */
>  	sctp_ulpevent_set_owner(event, asoc);
> -	sctp_assoc_rwnd_decrease(asoc, skb_headlen(skb));
> +	sctp_assoc_rwnd_account(asoc, 0);
>  
>  	if (!skb->data_len)
>  		return;
> @@ -1035,7 +1035,7 @@ static void sctp_ulpevent_release_data(s
>  	}
>  
>  done:
> -	sctp_assoc_rwnd_increase(event->asoc, len);
> +	sctp_assoc_rwnd_account(event->asoc, 1);

Same here.  The below call to sctp_ulpevent_release_owner() will
finally update the rmem_alloc, so the a above call to rwnd_account()
will not trigger a window update.

Thanks
-vlad

>  	sctp_ulpevent_release_owner(event);
>  }
>  
> --
> To unsubscribe from this list: send the line "unsubscribe linux-sctp" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 

^ permalink raw reply

* Re: [net-next PATCH 1/1] drivers: net: cpsw: enable promiscuous mode support
From: David Miller @ 2014-01-22  2:11 UTC (permalink / raw)
  To: mugunthanvnm; +Cc: netdev, linux-omap
In-Reply-To: <1390219718-31749-1-git-send-email-mugunthanvnm@ti.com>

From: Mugunthan V N <mugunthanvnm@ti.com>
Date: Mon, 20 Jan 2014 17:38:38 +0530

> +		if (!enable && ((priv->slaves[0].ndev->flags & IFF_PROMISC) ||
> +				(priv->slaves[1].ndev->flags & IFF_PROMISC))) {

This assumption that there are exactly 2 slaves is not valid.

Use the appropriate for_each_slave() et al. abstractions to access
the slaves.

^ permalink raw reply

* Re: [PATCH net-next V5 0/3] net: Add GRO support for UDP encapsulating protocols
From: David Miller @ 2014-01-22  2:05 UTC (permalink / raw)
  To: ogerlitz; +Cc: netdev, hkchu, edumazet, herbert, yanb, shlomop, therbert
In-Reply-To: <1390219161-9881-1-git-send-email-ogerlitz@mellanox.com>

From: Or Gerlitz <ogerlitz@mellanox.com>
Date: Mon, 20 Jan 2014 13:59:18 +0200

> This series adds GRO handlers for protocols that do UDP encapsulation, with the
> intent of being able to coalesce packets which encapsulate packets belonging to
> the same TCP session.
> 
> For GRO purposes, the destination UDP port takes the role of the ether type
> field in the ethernet header or the next protocol in the IP header.
> 
> The UDP GRO handler will only attempt to coalesce packets whose destination
> port is registered to have gro handler.
> 
> The patches done against net-next 75e4364f67 "net: stmmac: fix NULL pointer 
> dereference in stmmac_get_tx_hwtstamp"

Looks good, series applied, thanks!

^ permalink raw reply

* Re: [PATCH net-next 00/25] bonding: introduce new option API
From: Ding Tianhong @ 2014-01-22  2:01 UTC (permalink / raw)
  To: Nikolay Aleksandrov, netdev
  Cc: Andy Gospodarek, Jay Vosburgh, Veaceslav Falico, Scott Feldman,
	David S. Miller
In-Reply-To: <1390316114-17815-1-git-send-email-nikolay@redhat.com>

On 2014/1/21 22:54, Nikolay Aleksandrov wrote:
> Hi,
> This patchset's goal is to introduce a new option API which should be used
> to properly describe the bonding options with their mode dependcies and
> requirements. With this patchset applied we get centralized option
> manipulation, automatic RTNL acquire per option setting, automatic option
> range checking, mode dependcy checking and other various flags which are
> described in detail in patch 01's commit message and comments.
> Also the parameter passing is changed to use a specialized structure which
> is initialized to a value depending on the needs.
> The main exported functions are:
>  __bond_opt_set() - set an option (RTNL should be acquired prior)
>  bond_opt_init(val|str) - init a bond_opt_value struct for value or string
>                           parameter passing
>  bond_opt_tryset_rtnl() - function which tries to acquire rtnl, mainly used
>                           for sysfs
>  bond_opt_parse - used to parse or check for valid values
>  bond_opt_get - retrieve a pointer to bond_option struct for some option
>  bond_opt_get_val - retrieve a pointer to a bond_opt_value struct for
>                     some value
> 
> The same functions are used to set an option via sysfs and netlink, just
> the parameter that's passed is usually initialized in a different way.
> The converted options have multiple style fixes, there're some longer
> lines but they looked either ugly or were strings/pr_warnings, if you
> think some line would be better broken just let me know :-) there're
> also a few sscanf false-positive warnings.
> I decided to keep the "unsuppmodes" way of mode dep checking since it's
> straight forward, if we make a more general way for checking dependencies
> it'll be easy to change it.
> 
> Future plans for this work include:
>  - Automatic sysfs generation from the bond_opts[].
>  - Use of the API in bond_check_params() and thus cleaning it up (this has
>    actually started, I'll take care of the rest in a separate patch)
>  - Clean up all option-unrelated files of option definitions and functions
> 
> I've tried to leave as much documentation as possible, if there's anything
> unclear please let me know. One more thing, I haven't moved all
> option-related functions from bonding.h to the new bond_options.h, this
> will be done in a separate patch, it's in my todo list.
> 
> This patchset has been tested by setting each converted option via sysfs
> and netlink to a couple of wrong values, a couple of correct values and
> some random values, also for the opts that have flags they have been
> tested as well.
> 
> Best regards,
>  Nikolay Aleksandrov
> 

cool work, I think I can applied and test them, but it is really a hard work.:)

Ding

> CC: Andy Gospodarek <andy@greyhouse.net>
> CC: Jay Vosburgh <fubar@us.ibm.com>
> CC: Veaceslav Falico <vfalico@redhat.com>
> CC: Scott Feldman <sfeldma@cumulusnetworks.com>
> CC: David S. Miller <davem@davemloft.net>
> 
> Nikolay Aleksandrov (25):
>   bonding: add infrastructure for an option API
>   bonding: convert mode setting to use the new option API
>   bonding: convert packets_per_slave to use the new option API
>   bonding: convert xmit_hash_policy to use the new option API
>   bonding: convert arp_validate to use the new option API
>   bonding: convert arp_all_targets to use the new option API
>   bonding: convert fail_over_mac to use the new option API
>   bonding: convert arp_interval to use the new option API
>   bonding: convert arp_ip_target to use the new option API
>   bonding: convert downdelay to use the new option API
>   bonding: convert updelay to use the new option API
>   bonding: convert lacp_rate to use the new option API
>   bonding: convert min_links to use the new option API
>   bonding: convert ad_select to use the new option API
>   bonding: convert num_peer_notif to use the new option API
>   bonding: convert miimon to use the new option API
>   bonding: convert primary to use the new option API
>   bonding: convert primary_reselect to use the new option API
>   bonding: convert use_carrier to use the new option API
>   bonding: convert active_slave to use the new option API
>   bonding: convert queue_id to use the new option API
>   bonding: convert all_slaves_active to use the new option API
>   bonding: convert resend_igmp to use the new option API
>   bonding: convert lp_interval to use the new option API
>   bonding: convert slaves to use the new option API
> 
>  drivers/net/bonding/bond_main.c    |  179 +++---
>  drivers/net/bonding/bond_netlink.c |   87 ++-
>  drivers/net/bonding/bond_options.c | 1086 +++++++++++++++++++++++++++---------
>  drivers/net/bonding/bond_options.h |  170 ++++++
>  drivers/net/bonding/bond_procfs.c  |   25 +-
>  drivers/net/bonding/bond_sysfs.c   |  519 +++--------------
>  drivers/net/bonding/bonding.h      |   29 +-
>  7 files changed, 1214 insertions(+), 881 deletions(-)
>  create mode 100644 drivers/net/bonding/bond_options.h
> 

^ permalink raw reply

* Re: [PATCH net-next 1/5] bonding: The fail_over_mac should be set only in ACTIVE_BACKUP mode
From: Ding Tianhong @ 2014-01-22  1:59 UTC (permalink / raw)
  To: Jay Vosburgh; +Cc: Veaceslav Falico, David S. Miller, Netdev, Andy Gospodarek
In-Reply-To: <20332.1390350326@death.nxdomain>

On 2014/1/22 8:25, Jay Vosburgh wrote:
> Ding Tianhong <dingtianhong@huawei.com> wrote:
> 
>> According the bonding.txt, the option fail_over_mac only affect for
>> AB mode, but in currect code, the parameter could be set to active
>> or follow in every mode, this will cause bonding could not set all
>> slaves of an RR or XOR mode to the same MAC address at enslavement
>> time, so reset fail_over_mac to 0 if the mode is not ACTIVE_BACKUP.
> 
> 	The correct way to fix this in general is to permit setting an
> option at any time, but only have it take effect in active-backup mode.
> This minimizes ordering requirements when setting options.
> 
ok

> 	I would instead modify the bond enslave and removal processing
> to check the mode in addition to fail_over_mac when setting a slave's
> MAC during enslavement.  The change active slave processing already only
> calls the fail_over_mac function when in active-backup mode.  This
> should also be a simpler change set.
> 
agree, the current modify actually more redundant.

>> Fix the wrong variables for pr_err().
>>
>> Signed-off-by: Ding Tianhong <dingtianhong@huawei.com>
>> ---
>> drivers/net/bonding/bond_main.c | 8 +++++---
>> 1 file changed, 5 insertions(+), 3 deletions(-)
>>
>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>> index 3220b48..ecff04e 100644
>> --- a/drivers/net/bonding/bond_main.c
>> +++ b/drivers/net/bonding/bond_main.c
>> @@ -4307,12 +4307,14 @@ static int bond_check_params(struct bond_params *params)
>> 						      fail_over_mac_tbl);
>> 		if (fail_over_mac_value == -1) {
>> 			pr_err("Error: invalid fail_over_mac \"%s\"\n",
>> -			       arp_validate == NULL ? "NULL" : arp_validate);
>> +			       fail_over_mac == NULL ? "NULL" : fail_over_mac);
> 
> 	This part is ok.
> 
>> 			return -EINVAL;
>> 		}
>>
>> -		if (bond_mode != BOND_MODE_ACTIVEBACKUP)
>> -			pr_warning("Warning: fail_over_mac only affects active-backup mode.\n");
>> +		if (bond_mode != BOND_MODE_ACTIVEBACKUP) {
>> +			pr_warning("Warning: fail_over_mac only affects active-backup mode, set it to 0.\n");
>> +			fail_over_mac_value = BOND_FOM_NONE;
>> +		}
> 
> 	This part is not.
> 
> 	I would additionally NAK patches 2, 3, and 4 (noting that 4
> inhibits the change to fail_over_mac, but not the warning message that
> we're changing it).
> 
> 	Patch 5 is ok, although it really has nothing to do with
> fail_over_mac.
> 
> 	-J
> 

The whole patchset need to be rebuild, thanks for reviewing.

Regards
Ding

>> 	} else {
>> 		fail_over_mac_value = BOND_FOM_NONE;
>> 	}
>> -- 
>> 1.8.0
> 
> ---
> 	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com
> 
> 
> 

^ permalink raw reply

* IPV6 routing problem
From: Sharat Masetty @ 2014-01-22  1:41 UTC (permalink / raw)
  To: netdev

 I have an IPV6 routing problem that has only surfaced on a 3.10
kernel version. This problem is not seen on 3.4 kernel. I will keep
the problem statement as brief as possible.

I have two interfaces say eth0 and eth1. I have a host route to a
destination over eth1, but this route has a global gateway
address(via) of the router on the link. When this global gateway is
present in the route, the kernel does not seem to pick up the route,
but falls back to the default route which is eth0. When this gateway
is removed from the route or if the gateway is changed to a link local
address of the router, instead of the global address, then routing
seems to work as expected and kernel picks up interface eth1.

== does not work ==
<dest-addr> via <global scope -gateway addr>dev eth1 metric 1024
<global scope -gateway addr> dev eth1  metric 1024

I checked the git commits diff between 3.4 and 3.10 for ipv6/route.c
and ipv6/ip6_fib.c, but looks like they are close to an year apart
with lots of changes. Hence I wanted to reach out to you to see if any
recent changes in IPV6 routing is causing this difference in behavior.

Any help would be greatly appreciated.

Thanks
Sharat

^ permalink raw reply

* Re: ipv4_dst_destroy panic regression after 3.10.15
From: dormando @ 2014-01-22  1:39 UTC (permalink / raw)
  To: Alexei Starovoitov
  Cc: Eric Dumazet, netdev, linux-kernel@vger.kernel.org,
	Alexei Starovoitov
In-Reply-To: <CAADnVQKQFQh2JewwmXWBkyB86rOJs1mXHrgECikaMxGqFW3axA@mail.gmail.com>

> On Fri, Jan 17, 2014 at 11:16 PM, dormando <dormando@rydia.net> wrote:
> >> On Fri, 2014-01-17 at 22:49 -0800, Eric Dumazet wrote:
> >> > On Fri, 2014-01-17 at 17:25 -0800, dormando wrote:
> >> > > Hi,
> >> > >
> >> > > Upgraded a few kernels to the latest 3.10 stable tree while tracking down
> >> > > a rare kernel panic, seems to have introduced a much more frequent kernel
> >> > > panic. Takes anywhere from 4 hours to 2 days to trigger:
> >> > >
> >> > > <4>[196727.311203] general protection fault: 0000 [#1] SMP
> >> > > <4>[196727.311224] Modules linked in: xt_TEE xt_dscp xt_DSCP macvlan bridge coretemp crc32_pclmul ghash_clmulni_intel gpio_ich microcode ipmi_watchdog ipmi_devintf sb_edac edac_core lpc_ich mfd_core tpm_tis tpm tpm_bios ipmi_si ipmi_msghandler isci igb libsas i2c_algo_bit ixgbe ptp pps_core mdio
> >> > > <4>[196727.311333] CPU: 17 PID: 0 Comm: swapper/17 Not tainted 3.10.26 #1
> >> > > <4>[196727.311344] Hardware name: Supermicro X9DRi-LN4+/X9DR3-LN4+/X9DRi-LN4+/X9DR3-LN4+, BIOS 3.0 07/05/2013
> >> > > <4>[196727.311364] task: ffff885e6f069700 ti: ffff885e6f072000 task.ti: ffff885e6f072000
> >> > > <4>[196727.311377] RIP: 0010:[<ffffffff815f8c7f>]  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> >> > > <4>[196727.311399] RSP: 0018:ffff885effd23a70  EFLAGS: 00010282
> >> > > <4>[196727.311409] RAX: dead000000200200 RBX: ffff8854c398ecc0 RCX: 0000000000000040
> >> > > <4>[196727.311423] RDX: dead000000100100 RSI: dead000000100100 RDI: dead000000200200
> >> > > <4>[196727.311437] RBP: ffff885effd23a80 R08: ffffffff815fd9e0 R09: ffff885d5a590800
> >> > > <4>[196727.311451] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
> >> > > <4>[196727.311464] R13: ffffffff81c8c280 R14: 0000000000000000 R15: ffff880e85ee16ce
> >> > > <4>[196727.311510] FS:  0000000000000000(0000) GS:ffff885effd20000(0000) knlGS:0000000000000000
> >> > > <4>[196727.311554] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> > > <4>[196727.311581] CR2: 00007a46751eb000 CR3: 0000005e65688000 CR4: 00000000000407e0
> >> > > <4>[196727.311625] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> > > <4>[196727.311669] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> >> > > <4>[196727.311713] Stack:
> >> > > <4>[196727.311733]  ffff8854c398ecc0 ffff8854c398ecc0 ffff885effd23ab0 ffffffff815b7f42
> >> > > <4>[196727.311784]  ffff88be6595bc00 ffff8854c398ecc0 0000000000000000 ffff8854c398ecc0
> >> > > <4>[196727.311834]  ffff885effd23ad0 ffffffff815b86c6 ffff885d5a590800 ffff8816827821c0
> >> > > <4>[196727.311885] Call Trace:
> >> > > <4>[196727.311907]  <IRQ>
> >> > > <4>[196727.311912]  [<ffffffff815b7f42>] dst_destroy+0x32/0xe0
> >> > > <4>[196727.311959]  [<ffffffff815b86c6>] dst_release+0x56/0x80
> >> > > <4>[196727.311986]  [<ffffffff81620bd5>] tcp_v4_do_rcv+0x2a5/0x4a0
> >> > > <4>[196727.312013]  [<ffffffff81622b5a>] tcp_v4_rcv+0x7da/0x820
> >> > > <4>[196727.312041]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> >> > > <4>[196727.312070]  [<ffffffff815de02d>] ? nf_hook_slow+0x7d/0x150
> >> > > <4>[196727.312097]  [<ffffffff815fd9e0>] ? ip_rcv_finish+0x360/0x360
> >> > > <4>[196727.312125]  [<ffffffff815fda92>] ip_local_deliver_finish+0xb2/0x230
> >> > > <4>[196727.312154]  [<ffffffff815fdd9a>] ip_local_deliver+0x4a/0x90
> >> > > <4>[196727.312183]  [<ffffffff815fd799>] ip_rcv_finish+0x119/0x360
> >> > > <4>[196727.312212]  [<ffffffff815fe00b>] ip_rcv+0x22b/0x340
> >> > > <4>[196727.312242]  [<ffffffffa0339680>] ? macvlan_broadcast+0x160/0x160 [macvlan]
> >> > > <4>[196727.312275]  [<ffffffff815b0c62>] __netif_receive_skb_core+0x512/0x640
> >> > > <4>[196727.312308]  [<ffffffff811427fb>] ? kmem_cache_alloc+0x13b/0x150
> >> > > <4>[196727.312338]  [<ffffffff815b0db1>] __netif_receive_skb+0x21/0x70
> >> > > <4>[196727.312368]  [<ffffffff815b0fa1>] netif_receive_skb+0x31/0xa0
> >> > > <4>[196727.312397]  [<ffffffff815b1ae8>] napi_gro_receive+0xe8/0x140
> >> > > <4>[196727.312433]  [<ffffffffa00274f1>] ixgbe_poll+0x551/0x11f0 [ixgbe]
> >> > > <4>[196727.312463]  [<ffffffff815fe00b>] ? ip_rcv+0x22b/0x340
> >> > > <4>[196727.312491]  [<ffffffff815b1691>] net_rx_action+0x111/0x210
> >> > > <4>[196727.312521]  [<ffffffff815b0db1>] ? __netif_receive_skb+0x21/0x70
> >> > > <4>[196727.312552]  [<ffffffff810519d0>] __do_softirq+0xd0/0x270
> >> > > <4>[196727.312583]  [<ffffffff816cef3c>] call_softirq+0x1c/0x30
> >> > > <4>[196727.312613]  [<ffffffff81004205>] do_softirq+0x55/0x90
> >> > > <4>[196727.312640]  [<ffffffff81051c85>] irq_exit+0x55/0x60
> >> > > <4>[196727.312668]  [<ffffffff816cf5c3>] do_IRQ+0x63/0xe0
> >> > > <4>[196727.312696]  [<ffffffff816c5aaa>] common_interrupt+0x6a/0x6a
> >> > > <4>[196727.312722]  <EOI>
> >> > > <4>[196727.312727]  [<ffffffff8100a150>] ? default_idle+0x20/0xe0
> >> > > <4>[196727.312775]  [<ffffffff8100a8ff>] arch_cpu_idle+0xf/0x20
> >> > > <4>[196727.312803]  [<ffffffff8108d330>] cpu_startup_entry+0xc0/0x270
> >> > > <4>[196727.312833]  [<ffffffff816b276e>] start_secondary+0x1f9/0x200
> >> > > <4>[196727.312860] Code: 4a 9f e9 81 e8 13 cb 0c 00 48 8b 93 b0 00 00 00 48 bf 00 02 20 00 00 00 ad de 48 8b 83 b8 00 00 00 48 be 00 01 10 00 00 00 ad de <48> 89 42 08 48 89 10 48 89 bb b8 00 00 00 48 c7 c7 4a 9f e9 81
> >> > > <1>[196727.313071] RIP  [<ffffffff815f8c7f>] ipv4_dst_destroy+0x4f/0x80
> >> > > <4>[196727.313100]  RSP <ffff885effd23a70>
> >> > > <4>[196727.313377] ---[ end trace 64b3f14fae0f2e29 ]---
> >> > > <0>[196727.380908] Kernel panic - not syncing: Fatal exception in interrupt
> >> > >
> >> > >
> >> > > ... bisecting it's going to be a pain... I tried eyeballing the diffs and
> >> > > am trying a revert or two.
> >> > >
> >> > > We've hit it in .25, .26 so far. I have .27 running but not sure if it
> >> > > crashed, so the change exists between .15 and .25.
> >> >
> >> > Please try following fix, thanks for the report !
> >> >
> >> > diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> >> > index 25071b48921c..78a50a22298a 100644
> >> > --- a/net/ipv4/route.c
> >> > +++ b/net/ipv4/route.c
> >> > @@ -1333,7 +1333,7 @@ static void ipv4_dst_destroy(struct dst_entry
> >> > *dst)
> >> >
> >> >     if (!list_empty(&rt->rt_uncached)) {
> >> >             spin_lock_bh(&rt_uncached_lock);
> >> > -           list_del(&rt->rt_uncached);
> >> > +           list_del_init(&rt->rt_uncached);
> >> >             spin_unlock_bh(&rt_uncached_lock);
> >> >     }
> >> >  }
> >> >
> >>
> >> Problem could come from this commit, in linux 3.10.23,
> >> you also could try to revert it
> >>
> >> commit 62713c4b6bc10c2d082ee1540e11b01a2b2162ab
> >> Author: Alexei Starovoitov <ast@plumgrid.com>
> >> Date:   Tue Nov 19 19:12:34 2013 -0800
> >>
> >>     ipv4: fix race in concurrent ip_route_input_slow()
> >>
> >>     [ Upstream commit dcdfdf56b4a6c9437fc37dbc9cee94a788f9b0c4 ]
> >>
> >>     CPUs can ask for local route via ip_route_input_noref() concurrently.
> >>     if nh_rth_input is not cached yet, CPUs will proceed to allocate
> >>     equivalent DSTs on 'lo' and then will try to cache them in nh_rth_input
> >>     via rt_cache_route()
> >>     Most of the time they succeed, but on occasion the following two lines:
> >>         orig = *p;
> >>         prev = cmpxchg(p, orig, rt);
> >>     in rt_cache_route() do race and one of the cpus fails to complete cmpxchg.
> >>     But ip_route_input_slow() doesn't check the return code of rt_cache_route(),
> >>     so dst is leaking. dst_destroy() is never called and 'lo' device
> >>     refcnt doesn't go to zero, which can be seen in the logs as:
> >>         unregister_netdevice: waiting for lo to become free. Usage count = 1
> >>     Adding mdelay() between above two lines makes it easily reproducible.
> >>     Fix it similar to nh_pcpu_rth_output case.
> >>
> >>     Fixes: d2d68ba9fe8b ("ipv4: Cache input routes in fib_info nexthops.")
> >>     Signed-off-by: Alexei Starovoitov <ast@plumgrid.com>
> >>     Signed-off-by: David S. Miller <davem@davemloft.net>
> >>     Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
> >>
> >
> > Heh. I spent an hour squinting at the difflog from .15 to .25 and this was
> > my best guess. I have a kernel running in production with only this
> > reverted as of ~5 hours ago. Won't know if it helps for a day or two.
> >
> > I'm building a kernel now with your route patch, but 62713c4 *not*
> > reverted, which I can throw on a different machine. Does this sound like a
> > good idea?
>
> the traces of your crash don't look similar to dst leak that was fixed by
> commit 62713c4...
>
> To trigger the 'fix' logic of the 62713c4 add the following diff:
> diff --git a/net/ipv4/route.c b/net/ipv4/route.c
> index f6c6ab1..8972676 100644
> --- a/net/ipv4/route.c
> +++ b/net/ipv4/route.c
> @@ -1259,7 +1259,7 @@ static bool rt_cache_route(struct fib_nh *nh,
> struct rtable *rt)
>                 p = (struct rtable **)__this_cpu_ptr(nh->nh_pcpu_rth_output);
>         }
>         orig = *p;
> -
> +       mdelay(100);
>         prev = cmpxchg(p, orig, rt);
>         if (prev == orig) {
>                 if (orig)
>
> I've been running with it for a day without issues.
> Note that it will stress both 'input' and 'output' ways of adding dsts to
> rt_uncached list...
> and 'output' was there for ages.
>
> If mdelay() helps to reproduce it in minutes instead of days
> then we're on the right path.
> Could you share details of your workload?
> In our case it was a lot of namespaces with a ton of processes
> talking to each other, forcefully killed and restarted.
> Do you see the same crash in the latest tree?
>
> PS sorry for double posts. netdev email bounced few times for me...
>

So with 62713c4 reverted on top of 3.10.27 (no other new patches), both of
my machines have survived. One four days, another two days. They'd barely
make it a day before.

So this patch is introducing a new problem. Weakening the dst reference
too much in a false positive case?

I'm not entirely sure if the mdelay(100) thing will help much. I can try
it but it's kind of obnoxious to reboot these machines (far away, ipmi
only) too often. Unless you folks have any new ideas before I go off and
do that?

^ permalink raw reply

* Re: [PATCH] DT: net: document Ethernet bindings in one place
From: Rob Herring @ 2014-01-22  1:30 UTC (permalink / raw)
  To: Florian Fainelli
  Cc: Sergei Shtylyov, netdev, Rob Herring, Pawel Moll, Mark Rutland,
	Ian Campbell, Kumar Gala, devicetree@vger.kernel.org, Rob Landley,
	linux-doc@vger.kernel.org
In-Reply-To: <CAGVrzcb3X3soJNJCE5+YSpQrr+EdycCRFkptPvBCgFg4CbGJ4Q@mail.gmail.com>

On Tue, Jan 21, 2014 at 1:56 PM, Florian Fainelli <f.fainelli@gmail.com> wrote:
> 2014/1/20 Rob Herring <robherring2@gmail.com>:
>> On Mon, Jan 20, 2014 at 3:45 PM, Sergei Shtylyov
>> <sergei.shtylyov@cogentembedded.com> wrote:
>>> On 01/20/2014 05:06 PM, Rob Herring wrote:

[snip]

>>>>> +- phy-connection-type: the same as "phy-mode" property (but described in
>>>>> ePAPR);
>>>>> +- phy-handle: phandle, specifies a reference to a node representing a
>>>>> PHY
>>>>> +  device (this property is described in ePAPR);
>>>>> +- phy: the same as "phy-handle" property (but actually ad-hoc one).
>>>
>>>
>>>> Mark this as deprecated in favor of phy-handle.
>>>
>>>
>>>    Here situation is more optimistic. Quite many drivers still use
>>> "phy-handle", though some use even more exotic props I didn't document here.
>>
>> Perhaps flagging as "Not recommended for new bindings" would be nicer wording...
>
> Ok, so what's the deal here, can't we use phy-handle? There is
> currently no infrastructure in drivers/net/ or drivers/of/ to look for
> the "phy-handle" nor "phy" properties as phandles to fetch an Ethernet
> PHY device tree node. If we are to provide one common place where we
> want to do this, we will have to look for both properties, why can't
> we mandate the use of "phy-handle" since this is the ePAPR compliant
> one?

My count is 3 using "phy" and 11 using "phy-handle". Even if we
mandate phy-handle, we still have to support both properties to
maintain compatibility with existing DTs. But nothing new should use
"phy".  I'm not sure that something common really helps here. Drivers
using "phy" can continue to and new ones use "phy-handle". After a
quick scan, there doesn't appear to be any multi-line pattern in the
drivers we could factor out.

Rob

^ permalink raw reply

* [PATCH net-next v5 2/3] net: introduce reciprocal_scale helper and convert users
From: Hannes Frederic Sowa @ 2014-01-22  1:29 UTC (permalink / raw)
  To: netdev; +Cc: Daniel Borkmann, Jakub Zawadzki, Eric Dumazet, linux-kernel
In-Reply-To: <1390354181-5080-1-git-send-email-hannes@stressinduktion.org>

From: Daniel Borkmann <dborkman@redhat.com>

As David Laight suggests, we shouldn't necessarily call this
reciprocal_divide() when users didn't requested a reciprocal_value();
lets keep the basic idea and call it reciprocal_scale(). More
background information on this topic can be found in [1].

Joint work with Hannes Frederic Sowa.

  [1] http://homepage.cs.uiowa.edu/~jones/bcd/divide.html

Suggested-by: David Laight <david.laight@aculab.com>
Cc: Jakub Zawadzki <darkjames-ws@darkjames.pl>
Cc: Eric Dumazet <eric.dumazet@gmail.com>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org>
Signed-off-by: Daniel Borkmann <dborkman@redhat.com>
---
 include/linux/kernel.h | 19 +++++++++++++++++++
 include/net/codel.h    |  4 +---
 net/packet/af_packet.c |  3 +--
 3 files changed, 21 insertions(+), 5 deletions(-)

diff --git a/include/linux/kernel.h b/include/linux/kernel.h
index ecb8754..03d8a6b 100644
--- a/include/linux/kernel.h
+++ b/include/linux/kernel.h
@@ -193,6 +193,25 @@ extern int _cond_resched(void);
 		(__x < 0) ? -__x : __x;		\
 	})
 
+/**
+ * reciprocal_scale - "scale" a value into range [0, ep_ro)
+ * @val: value
+ * @ep_ro: right open interval endpoint
+ *
+ * Perform a "reciprocal multiplication" in order to "scale" a value into
+ * range [0, ep_ro), where the upper interval endpoint is right-open.
+ * This is useful, e.g. for accessing a index of an array containing
+ * ep_ro elements, for example. Think of it as sort of modulus, only that
+ * the result isn't that of modulo. ;) Note that if initial input is a
+ * small value, then result will return 0.
+ *
+ * Return: a result based on val in interval [0, ep_ro).
+ */
+static inline u32 reciprocal_scale(u32 val, u32 ep_ro)
+{
+	return (u32)(((u64) val * ep_ro) >> 32);
+}
+
 #if defined(CONFIG_MMU) && \
 	(defined(CONFIG_PROVE_LOCKING) || defined(CONFIG_DEBUG_ATOMIC_SLEEP))
 void might_fault(void);
diff --git a/include/net/codel.h b/include/net/codel.h
index 3b04ff5..fe0eab3 100644
--- a/include/net/codel.h
+++ b/include/net/codel.h
@@ -46,7 +46,6 @@
 #include <linux/skbuff.h>
 #include <net/pkt_sched.h>
 #include <net/inet_ecn.h>
-#include <linux/reciprocal_div.h>
 
 /* Controlling Queue Delay (CoDel) algorithm
  * =========================================
@@ -211,10 +210,9 @@ static codel_time_t codel_control_law(codel_time_t t,
 				      codel_time_t interval,
 				      u32 rec_inv_sqrt)
 {
-	return t + reciprocal_divide(interval, rec_inv_sqrt << REC_INV_SQRT_SHIFT);
+	return t + reciprocal_scale(interval, rec_inv_sqrt << REC_INV_SQRT_SHIFT);
 }
 
-
 static bool codel_should_drop(const struct sk_buff *skb,
 			      struct Qdisc *sch,
 			      struct codel_vars *vars,
diff --git a/net/packet/af_packet.c b/net/packet/af_packet.c
index df3cbdd..9734616 100644
--- a/net/packet/af_packet.c
+++ b/net/packet/af_packet.c
@@ -88,7 +88,6 @@
 #include <linux/virtio_net.h>
 #include <linux/errqueue.h>
 #include <linux/net_tstamp.h>
-#include <linux/reciprocal_div.h>
 #include <linux/percpu.h>
 #ifdef CONFIG_INET
 #include <net/inet_common.h>
@@ -1262,7 +1261,7 @@ static unsigned int fanout_demux_hash(struct packet_fanout *f,
 				      struct sk_buff *skb,
 				      unsigned int num)
 {
-	return reciprocal_divide(skb->rxhash, num);
+	return reciprocal_scale(skb->rxhash, num);
 }
 
 static unsigned int fanout_demux_lb(struct packet_fanout *f,
-- 
1.8.4.2

^ permalink raw reply related

* [PATCH net-next v5 0/0] reciprocal_divide update
From: Hannes Frederic Sowa @ 2014-01-22  1:29 UTC (permalink / raw)
  To: netdev

This patch is on top of aee636c4809fa5 ("bpf: do not use reciprocal
divide") from Eric that sits in net tree. It will not create a merge
conflict, but it depends on this one, so we suggest, if possible, to
merge net into net-next.

We are proposing this change with only small modifications from the
v2 version, namely updating the name of trim to reciprocal_scale
(as commented on by Ben Hutchings and Eric Dumazet, thanks!).

We thought about introducing the reciprocal_divide algorithm in
parallel to the one already used by the kernel but faced organizational
issues, leading us to the conclusion that it is best to just replace
the old one: We could not come up with names for the different
implementations and also with a way to describe the differences to
guide developers which one to choose in which situation. This is
because we cannot specify the correct semantics for the version
which is currently used by the kernel. Altough it seems to not be
causing problems in the kernel, we cannot surely say so in the
case of flex_array for the future. Current usage seems ok, but
future users could run into problems.

Changelog:

v1->v2:
 - changed name to prandom_u32_max in p1
 - changed name to trim in p2
 - reworked code in p3
v2->v3:
 - p1 and p3 stays unchanged, only small update in commit
   message in p3
 - changed name to reciprocal_scale in p2
 - fixed kernel doc format
v3->v4:
 - pseduo -> pseudo (thanks to Tilman Schmidt)
v4->v5:
 - fix pseduo -> pseudo for real now, sorry for the noise

Thanks !

Daniel Borkmann (2):
  random32: add prandom_u32_max and convert open coded users
  net: introduce reciprocal_scale helper and convert users

Hannes Frederic Sowa (1):
  reciprocal_divide: update/correction of the algorithm

 drivers/net/bonding/bond_main.c     | 24 ++++++++++++++++-------
 drivers/net/bonding/bond_netlink.c  |  4 ----
 drivers/net/bonding/bond_options.c  | 15 +++++++++-----
 drivers/net/bonding/bond_sysfs.c    |  5 -----
 drivers/net/bonding/bonding.h       |  3 +++
 drivers/net/team/team_mode_random.c |  8 +-------
 include/linux/flex_array.h          |  3 ++-
 include/linux/kernel.h              | 19 ++++++++++++++++++
 include/linux/random.h              | 18 ++++++++++++++++-
 include/linux/reciprocal_div.h      | 39 ++++++++++++++++++++-----------------
 include/linux/slab_def.h            |  4 +++-
 include/net/codel.h                 |  4 +---
 include/net/red.h                   |  3 ++-
 lib/flex_array.c                    |  7 ++++++-
 lib/reciprocal_div.c                | 24 +++++++++++++++++++----   
 net/packet/af_packet.c              |  5 ++---
 net/sched/sch_choke.c               |  9 +--------
 net/sched/sch_netem.c               |  6 ++++--
 18 files changed, 129 insertions(+), 71 deletions(-)

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox