* Possible networking regression in 3.6.0 @ 2012-09-17 15:44 Chris Clayton 2012-09-18 14:21 ` Chris Clayton 0 siblings, 1 reply; 59+ messages in thread From: Chris Clayton @ 2012-09-17 15:44 UTC (permalink / raw) To: netdev Hi, I'm having a problem with networking. I'm running Windows XP as a KVM guest on a laptop running kernel 3.6.0-rc6. The identical configuration works fine with kernels 3.5.4 and 3.4.11 (and has done so, largely unchanged, since since KVM was introduced in 2.6.<whatever>.) The configuration is: XP guest: 192.168.200.1 (gateway 192.168.200.254) tap0: 192.168.200.254 host: 192.168.0.40 (gateway 192.168.0.1) router: 192.168.0.1 The script that starts up the firewall includes the following commands: # Load the connection-sharing for qemu/kvm guests echo 1 > /proc/sys/net/ipv4/ip_forward iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE ... # allow traffic to and from the qemu/kvm virtual networks NETS="200 201" for net in $NETS; do iptables -A INPUT -s 192.168.$net.0/24 -j ACCEPT iptables -A OUTPUT -d 192.168.$net.0/24 -j ACCEPT done ... The network-related modules that are loaded are: $ lsmod Module Size Used by tun 12412 0 xt_state 891 1 iptable_filter 852 1 ipt_MASQUERADE 1222 1 iptable_nat 3087 1 nf_nat 10901 2 ipt_MASQUERADE,iptable_nat nf_conntrack_ipv4 4942 4 nf_nat,iptable_nat nf_defrag_ipv4 815 1 nf_conntrack_ipv4 nf_conntrack 37644 5 ipt_MASQUERADE,nf_nat,xt_state,iptable_nat,nf_conntrack_ipv4 ... r8169 47159 0 From the host I can successfully ping the guest, tap0 and the router as you would expect, but from the guest, although I can ping the host and tap0, I cannot ping the router. In practice, this means I have no internet access from the guest. As I say, this configuration works perfectly under 3.5.x and 3.4.x kernels. I'll do a coarse-grained "bisect" of Linus' 3.6 release candidates and report back, but does anyone have any prime-suspect patches that may be at the cause of this problem? Let me know if there are any other diagnostics I can provide. Also, as I'm not subscribed to netdev, please cc me to any reply. Thanks, Chris ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-17 15:44 Possible networking regression in 3.6.0 Chris Clayton @ 2012-09-18 14:21 ` Chris Clayton 2012-09-18 14:31 ` Chris Clayton 0 siblings, 1 reply; 59+ messages in thread From: Chris Clayton @ 2012-09-18 14:21 UTC (permalink / raw) To: netdev On 09/17/12 16:44, Chris Clayton wrote: > Hi, > > I'm having a problem with networking. I'm running Windows XP as a KVM > guest on a laptop running kernel 3.6.0-rc6. The identical configuration > works fine with kernels 3.5.4 and 3.4.11 (and has done so, largely > unchanged, since since KVM was introduced in 2.6.<whatever>.) > > The configuration is: > > XP guest: 192.168.200.1 (gateway 192.168.200.254) > tap0: 192.168.200.254 > host: 192.168.0.40 (gateway 192.168.0.1) > router: 192.168.0.1 > > The script that starts up the firewall includes the following commands: > > # Load the connection-sharing for qemu/kvm guests > echo 1 > /proc/sys/net/ipv4/ip_forward > iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE > ... > # allow traffic to and from the qemu/kvm virtual networks > NETS="200 201" > for net in $NETS; do > iptables -A INPUT -s 192.168.$net.0/24 -j ACCEPT > iptables -A OUTPUT -d 192.168.$net.0/24 -j ACCEPT > done > ... > > The network-related modules that are loaded are: > > $ lsmod > Module Size Used by > tun 12412 0 > xt_state 891 1 > iptable_filter 852 1 > ipt_MASQUERADE 1222 1 > iptable_nat 3087 1 > nf_nat 10901 2 ipt_MASQUERADE,iptable_nat > nf_conntrack_ipv4 4942 4 nf_nat,iptable_nat > nf_defrag_ipv4 815 1 nf_conntrack_ipv4 > nf_conntrack 37644 5 > ipt_MASQUERADE,nf_nat,xt_state,iptable_nat,nf_conntrack_ipv4 > ... > r8169 47159 0 > > From the host I can successfully ping the guest, tap0 and the router as > you would expect, but from the guest, although I can ping the host and > tap0, I cannot ping the router. In practice, this means I have no > internet access from the guest. As I say, this configuration works > perfectly under 3.5.x and 3.4.x kernels. > > I'll do a coarse-grained "bisect" of Linus' 3.6 release candidates and > report back, but does anyone have any prime-suspect patches that may be > at the cause of this problem? > -rc1 turned out to have the problem so I've bisected between 3.5 and 3.6-rc1. I arrived at: $ git bisect bad d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5 is the first bad commit commit d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5 Author: David S. Miller <davem@davemloft.net> Date: Tue Jul 17 12:58:50 2012 -0700 ipv4: Cache input routes in fib_info nexthops. Caching input routes is slightly simpler than output routes, since we don't need to be concerned with nexthop exceptions. (locally destined, and routed packets, never trigger PMTU events or redirects that will be processed by us). However, we have to elide caching for the DIRECTSRC and non-zero itag cases. Signed-off-by: David S. Miller <davem@davemloft.net> :040000 040000 6bbc75c1cbe62bf84ea412d3b98adf2b614779cd 3ad7256b4a71e63ca4530977c0550121ea803d35 M include :040000 040000 18c2a950a53c4eec9bfa12185d1e382dfed74af8 a2ab6157d6cd54930da395758c6ded3a225d1f04 M net The bisect log: git bisect start # bad: [0d7614f09c1ebdbaa1599a5aba7593f147bf96ee] Linux 3.6-rc1 git bisect bad 0d7614f09c1ebdbaa1599a5aba7593f147bf96ee # good: [28a33cbc24e4256c143dce96c7d93bf423229f92] Linux 3.5 git bisect good 28a33cbc24e4256c143dce96c7d93bf423229f92 # bad: [614a6d4341b3760ca98a1c2c09141b71db5d1e90] Merge branch 'for-3.6' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup git bisect bad 614a6d4341b3760ca98a1c2c09141b71db5d1e90 # bad: [320f5ea0cedc08ef65d67e056bcb9d181386ef2c] genetlink: define lockdep_genl_is_held() when CONFIG_LOCKDEP git bisect bad 320f5ea0cedc08ef65d67e056bcb9d181386ef2c # good: [0cd06647b7c24f6633e32a505930a9aa70138c22] Merge branch 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next git bisect good 0cd06647b7c24f6633e32a505930a9aa70138c22 # good: [dbfa600148a25903976910863c75dae185f8d187] cxgb3: set maximal number of default RSS queues git bisect good dbfa600148a25903976910863c75dae185f8d187 # good: [efdfad3205403e1d1c5c0bdcbdb647ddd89bfaa3] bnx2: Try to recover from PCI block reset git bisect good efdfad3205403e1d1c5c0bdcbdb647ddd89bfaa3 # good: [1bf91cdc1bba94ea062a9147d924815c13f029f2] ixgbe: Drop references to deprecated pci_ DMA api and instead use dma_ API git bisect good 1bf91cdc1bba94ea062a9147d924815c13f029f2 # good: [b6dfd939fdc249fcf8cd7b8006f76239b33eb581] ixgbe: add support for new 82599 device git bisect good b6dfd939fdc249fcf8cd7b8006f76239b33eb581 # good: [3ba97381343b271296487bf073eb670d5465a8b8] net: ethernet: davinci_emac: add pm_runtime support git bisect good 3ba97381343b271296487bf073eb670d5465a8b8 # bad: [5e9965c15ba88319500284e590733f4a4629a288] Merge branch 'kill_rtcache' git bisect bad 5e9965c15ba88319500284e590733f4a4629a288 # good: [f5b0a8743601a4477419171f5046bd07d1c080a0] net: Document dst->obsolete better. git bisect good f5b0a8743601a4477419171f5046bd07d1c080a0 # bad: [ba3f7f04ef2b19aace38f855aedd17fe43035d50] ipv4: Kill FLOWI_FLAG_RT_NOCACHE and associated code. git bisect bad ba3f7f04ef2b19aace38f855aedd17fe43035d50 # good: [f2bb4bedf35d5167a073dcdddf16543f351ef3ae] ipv4: Cache output routes in fib_info nexthops. git bisect good f2bb4bedf35d5167a073dcdddf16543f351ef3ae # bad: [d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5] ipv4: Cache input routes in fib_info nexthops. git bisect bad d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5 Checking out the parent commit (f2bb4bedf35d5167a073dcdddf16543f351ef3ae) and building and installing the kernel gives a working configuration, so I'm pretty confident in the outcome of the bisect. Reversing the patch gives errors, so I've not tested master with the patch reversed. Let me know if I can help in any way to identify a fix. Chris > Let me know if there are any other diagnostics I can provide. Also, as > I'm not subscribed to netdev, please cc me to any reply. > > Thanks, > > Chris ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-18 14:21 ` Chris Clayton @ 2012-09-18 14:31 ` Chris Clayton 2012-09-18 14:40 ` Eric Dumazet 2012-09-18 14:44 ` Possible networking regression in 3.6.0 Chris Clayton 0 siblings, 2 replies; 59+ messages in thread From: Chris Clayton @ 2012-09-18 14:31 UTC (permalink / raw) To: netdev >> ... >> r8169 47159 0 >> >> From the host I can successfully ping the guest, tap0 and the router as >> you would expect, but from the guest, although I can ping the host and >> tap0, I cannot ping the router. In practice, this means I have no >> internet access from the guest. As I say, this configuration works >> perfectly under 3.5.x and 3.4.x kernels. >> >> I'll do a coarse-grained "bisect" of Linus' 3.6 release candidates and >> report back, but does anyone have any prime-suspect patches that may be >> at the cause of this problem? >> > > -rc1 turned out to have the problem so I've bisected between 3.5 and > 3.6-rc1. I arrived at: > > $ git bisect bad > d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5 is the first bad commit > commit d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5 > Author: David S. Miller <davem@davemloft.net> > Date: Tue Jul 17 12:58:50 2012 -0700 > > ipv4: Cache input routes in fib_info nexthops. > > Caching input routes is slightly simpler than output routes, since we > don't need to be concerned with nexthop exceptions. (locally > destined, and routed packets, never trigger PMTU events or redirects > that will be processed by us). > > However, we have to elide caching for the DIRECTSRC and non-zero itag > cases. > > Signed-off-by: David S. Miller <davem@davemloft.net> > > :040000 040000 6bbc75c1cbe62bf84ea412d3b98adf2b614779cd > 3ad7256b4a71e63ca4530977c0550121ea803d35 M include > :040000 040000 18c2a950a53c4eec9bfa12185d1e382dfed74af8 > a2ab6157d6cd54930da395758c6ded3a225d1f04 M net > > The bisect log: > git bisect start > # bad: [0d7614f09c1ebdbaa1599a5aba7593f147bf96ee] Linux 3.6-rc1 > git bisect bad 0d7614f09c1ebdbaa1599a5aba7593f147bf96ee > # good: [28a33cbc24e4256c143dce96c7d93bf423229f92] Linux 3.5 > git bisect good 28a33cbc24e4256c143dce96c7d93bf423229f92 > # bad: [614a6d4341b3760ca98a1c2c09141b71db5d1e90] Merge branch 'for-3.6' > of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup > git bisect bad 614a6d4341b3760ca98a1c2c09141b71db5d1e90 > # bad: [320f5ea0cedc08ef65d67e056bcb9d181386ef2c] genetlink: define > lockdep_genl_is_held() when CONFIG_LOCKDEP > git bisect bad 320f5ea0cedc08ef65d67e056bcb9d181386ef2c > # good: [0cd06647b7c24f6633e32a505930a9aa70138c22] Merge branch 'master' > of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next > git bisect good 0cd06647b7c24f6633e32a505930a9aa70138c22 > # good: [dbfa600148a25903976910863c75dae185f8d187] cxgb3: set maximal > number of default RSS queues > git bisect good dbfa600148a25903976910863c75dae185f8d187 > # good: [efdfad3205403e1d1c5c0bdcbdb647ddd89bfaa3] bnx2: Try to recover > from PCI block reset > git bisect good efdfad3205403e1d1c5c0bdcbdb647ddd89bfaa3 > # good: [1bf91cdc1bba94ea062a9147d924815c13f029f2] ixgbe: Drop > references to deprecated pci_ DMA api and instead use dma_ API > git bisect good 1bf91cdc1bba94ea062a9147d924815c13f029f2 > # good: [b6dfd939fdc249fcf8cd7b8006f76239b33eb581] ixgbe: add support > for new 82599 device > git bisect good b6dfd939fdc249fcf8cd7b8006f76239b33eb581 > # good: [3ba97381343b271296487bf073eb670d5465a8b8] net: ethernet: > davinci_emac: add pm_runtime support > git bisect good 3ba97381343b271296487bf073eb670d5465a8b8 > # bad: [5e9965c15ba88319500284e590733f4a4629a288] Merge branch > 'kill_rtcache' > git bisect bad 5e9965c15ba88319500284e590733f4a4629a288 > # good: [f5b0a8743601a4477419171f5046bd07d1c080a0] net: Document > dst->obsolete better. > git bisect good f5b0a8743601a4477419171f5046bd07d1c080a0 > # bad: [ba3f7f04ef2b19aace38f855aedd17fe43035d50] ipv4: Kill > FLOWI_FLAG_RT_NOCACHE and associated code. > git bisect bad ba3f7f04ef2b19aace38f855aedd17fe43035d50 > # good: [f2bb4bedf35d5167a073dcdddf16543f351ef3ae] ipv4: Cache output > routes in fib_info nexthops. > git bisect good f2bb4bedf35d5167a073dcdddf16543f351ef3ae > # bad: [d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5] ipv4: Cache input > routes in fib_info nexthops. > git bisect bad d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5 > > Checking out the parent commit > (f2bb4bedf35d5167a073dcdddf16543f351ef3ae) and building and installing > the kernel gives a working configuration, so I'm pretty confident in the > outcome of the bisect. Reversing the patch gives errors, so I've not > tested master with the patch reversed. > > Let me know if I can help in any way to identify a fix. > Sorry, I forgot to say that I also have tried running TinyCore Linux as a KVM guest on a 3.6.0-rc6 kernel, and I can ping the router fine, so the problem seems to be something specifically related to ruuning Windows XP as the guest. I don't have any other guests installed so that's as much as I can say, although I could maybe install a Win7 guest tomorrow if that would help. > Chris > >> Let me know if there are any other diagnostics I can provide. Also, as >> I'm not subscribed to netdev, please cc me to any reply. >> >> Thanks, >> >> Chris ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-18 14:31 ` Chris Clayton @ 2012-09-18 14:40 ` Eric Dumazet 2012-09-18 15:51 ` Chris Clayton 2012-09-19 15:26 ` Chris Clayton 2012-09-18 14:44 ` Possible networking regression in 3.6.0 Chris Clayton 1 sibling, 2 replies; 59+ messages in thread From: Eric Dumazet @ 2012-09-18 14:40 UTC (permalink / raw) To: Chris Clayton; +Cc: netdev On Tue, 2012-09-18 at 15:31 +0100, Chris Clayton wrote: > >> ... > >> r8169 47159 0 > >> > >> From the host I can successfully ping the guest, tap0 and the router as > >> you would expect, but from the guest, although I can ping the host and > >> tap0, I cannot ping the router. In practice, this means I have no > >> internet access from the guest. As I say, this configuration works > >> perfectly under 3.5.x and 3.4.x kernels. > >> > >> I'll do a coarse-grained "bisect" of Linus' 3.6 release candidates and > >> report back, but does anyone have any prime-suspect patches that may be > >> at the cause of this problem? > >> > > > > -rc1 turned out to have the problem so I've bisected between 3.5 and > > 3.6-rc1. I arrived at: > > > > $ git bisect bad > > d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5 is the first bad commit > > commit d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5 > > Author: David S. Miller <davem@davemloft.net> > > Date: Tue Jul 17 12:58:50 2012 -0700 > > > > ipv4: Cache input routes in fib_info nexthops. > > > > Caching input routes is slightly simpler than output routes, since we > > don't need to be concerned with nexthop exceptions. (locally > > destined, and routed packets, never trigger PMTU events or redirects > > that will be processed by us). > > > > However, we have to elide caching for the DIRECTSRC and non-zero itag > > cases. > > > > Signed-off-by: David S. Miller <davem@davemloft.net> > > > > :040000 040000 6bbc75c1cbe62bf84ea412d3b98adf2b614779cd > > 3ad7256b4a71e63ca4530977c0550121ea803d35 M include > > :040000 040000 18c2a950a53c4eec9bfa12185d1e382dfed74af8 > > a2ab6157d6cd54930da395758c6ded3a225d1f04 M net > > > > The bisect log: > > git bisect start > > # bad: [0d7614f09c1ebdbaa1599a5aba7593f147bf96ee] Linux 3.6-rc1 > > git bisect bad 0d7614f09c1ebdbaa1599a5aba7593f147bf96ee > > # good: [28a33cbc24e4256c143dce96c7d93bf423229f92] Linux 3.5 > > git bisect good 28a33cbc24e4256c143dce96c7d93bf423229f92 > > # bad: [614a6d4341b3760ca98a1c2c09141b71db5d1e90] Merge branch 'for-3.6' > > of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup > > git bisect bad 614a6d4341b3760ca98a1c2c09141b71db5d1e90 > > # bad: [320f5ea0cedc08ef65d67e056bcb9d181386ef2c] genetlink: define > > lockdep_genl_is_held() when CONFIG_LOCKDEP > > git bisect bad 320f5ea0cedc08ef65d67e056bcb9d181386ef2c > > # good: [0cd06647b7c24f6633e32a505930a9aa70138c22] Merge branch 'master' > > of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next > > git bisect good 0cd06647b7c24f6633e32a505930a9aa70138c22 > > # good: [dbfa600148a25903976910863c75dae185f8d187] cxgb3: set maximal > > number of default RSS queues > > git bisect good dbfa600148a25903976910863c75dae185f8d187 > > # good: [efdfad3205403e1d1c5c0bdcbdb647ddd89bfaa3] bnx2: Try to recover > > from PCI block reset > > git bisect good efdfad3205403e1d1c5c0bdcbdb647ddd89bfaa3 > > # good: [1bf91cdc1bba94ea062a9147d924815c13f029f2] ixgbe: Drop > > references to deprecated pci_ DMA api and instead use dma_ API > > git bisect good 1bf91cdc1bba94ea062a9147d924815c13f029f2 > > # good: [b6dfd939fdc249fcf8cd7b8006f76239b33eb581] ixgbe: add support > > for new 82599 device > > git bisect good b6dfd939fdc249fcf8cd7b8006f76239b33eb581 > > # good: [3ba97381343b271296487bf073eb670d5465a8b8] net: ethernet: > > davinci_emac: add pm_runtime support > > git bisect good 3ba97381343b271296487bf073eb670d5465a8b8 > > # bad: [5e9965c15ba88319500284e590733f4a4629a288] Merge branch > > 'kill_rtcache' > > git bisect bad 5e9965c15ba88319500284e590733f4a4629a288 > > # good: [f5b0a8743601a4477419171f5046bd07d1c080a0] net: Document > > dst->obsolete better. > > git bisect good f5b0a8743601a4477419171f5046bd07d1c080a0 > > # bad: [ba3f7f04ef2b19aace38f855aedd17fe43035d50] ipv4: Kill > > FLOWI_FLAG_RT_NOCACHE and associated code. > > git bisect bad ba3f7f04ef2b19aace38f855aedd17fe43035d50 > > # good: [f2bb4bedf35d5167a073dcdddf16543f351ef3ae] ipv4: Cache output > > routes in fib_info nexthops. > > git bisect good f2bb4bedf35d5167a073dcdddf16543f351ef3ae > > # bad: [d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5] ipv4: Cache input > > routes in fib_info nexthops. > > git bisect bad d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5 > > > > Checking out the parent commit > > (f2bb4bedf35d5167a073dcdddf16543f351ef3ae) and building and installing > > the kernel gives a working configuration, so I'm pretty confident in the > > outcome of the bisect. Reversing the patch gives errors, so I've not > > tested master with the patch reversed. > > > > Let me know if I can help in any way to identify a fix. > > > Sorry, I forgot to say that I also have tried running TinyCore Linux as > a KVM guest on a 3.6.0-rc6 kernel, and I can ping the router fine, so > the problem seems to be something specifically related to ruuning > Windows XP as the guest. I don't have any other guests installed so > that's as much as I can say, although I could maybe install a Win7 guest > tomorrow if that would help. It would help to have some traffic sample, maybe. Especially if the problem is not easily reproductible for us. (I dont have Windows XP nor Win7) Also the bisect might point to a commit with an already fixed bug : commit 4331debc51ee1ce319f4a389484e0e8e05de2aca Author: Eric Dumazet <edumazet@google.com> Date: Wed Jul 25 05:11:23 2012 +0000 ipv4: rt_cache_valid must check expired routes commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.) introduced rt_cache_valid() helper. It unfortunately doesn't check if route is expired before caching it. I noticed sk_setup_caps() was constantly called on a tcp workload. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-18 14:40 ` Eric Dumazet @ 2012-09-18 15:51 ` Chris Clayton 2012-09-19 15:26 ` Chris Clayton 1 sibling, 0 replies; 59+ messages in thread From: Chris Clayton @ 2012-09-18 15:51 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev Thanks for the reply, Eric. >>> -rc1 turned out to have the problem so I've bisected between 3.5 and >>> 3.6-rc1. I arrived at: >>> >>> $ git bisect bad >>> d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5 is the first bad commit >>> commit d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5 >>> Author: David S. Miller <davem@davemloft.net> >>> Date: Tue Jul 17 12:58:50 2012 -0700 >>> >>> ipv4: Cache input routes in fib_info nexthops. >>> >>> Caching input routes is slightly simpler than output routes, since we >>> don't need to be concerned with nexthop exceptions. (locally >>> destined, and routed packets, never trigger PMTU events or redirects >>> that will be processed by us). >>> >>> However, we have to elide caching for the DIRECTSRC and non-zero itag >>> cases. >>> >>> Signed-off-by: David S. Miller <davem@davemloft.net> >>> >>> :040000 040000 6bbc75c1cbe62bf84ea412d3b98adf2b614779cd >>> 3ad7256b4a71e63ca4530977c0550121ea803d35 M include >>> :040000 040000 18c2a950a53c4eec9bfa12185d1e382dfed74af8 >>> a2ab6157d6cd54930da395758c6ded3a225d1f04 M net >>> >>> The bisect log: >>> git bisect start >>> # bad: [0d7614f09c1ebdbaa1599a5aba7593f147bf96ee] Linux 3.6-rc1 >>> git bisect bad 0d7614f09c1ebdbaa1599a5aba7593f147bf96ee >>> # good: [28a33cbc24e4256c143dce96c7d93bf423229f92] Linux 3.5 >>> git bisect good 28a33cbc24e4256c143dce96c7d93bf423229f92 >>> # bad: [614a6d4341b3760ca98a1c2c09141b71db5d1e90] Merge branch 'for-3.6' >>> of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup >>> git bisect bad 614a6d4341b3760ca98a1c2c09141b71db5d1e90 >>> # bad: [320f5ea0cedc08ef65d67e056bcb9d181386ef2c] genetlink: define >>> lockdep_genl_is_held() when CONFIG_LOCKDEP >>> git bisect bad 320f5ea0cedc08ef65d67e056bcb9d181386ef2c >>> # good: [0cd06647b7c24f6633e32a505930a9aa70138c22] Merge branch 'master' >>> of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next >>> git bisect good 0cd06647b7c24f6633e32a505930a9aa70138c22 >>> # good: [dbfa600148a25903976910863c75dae185f8d187] cxgb3: set maximal >>> number of default RSS queues >>> git bisect good dbfa600148a25903976910863c75dae185f8d187 >>> # good: [efdfad3205403e1d1c5c0bdcbdb647ddd89bfaa3] bnx2: Try to recover >>> from PCI block reset >>> git bisect good efdfad3205403e1d1c5c0bdcbdb647ddd89bfaa3 >>> # good: [1bf91cdc1bba94ea062a9147d924815c13f029f2] ixgbe: Drop >>> references to deprecated pci_ DMA api and instead use dma_ API >>> git bisect good 1bf91cdc1bba94ea062a9147d924815c13f029f2 >>> # good: [b6dfd939fdc249fcf8cd7b8006f76239b33eb581] ixgbe: add support >>> for new 82599 device >>> git bisect good b6dfd939fdc249fcf8cd7b8006f76239b33eb581 >>> # good: [3ba97381343b271296487bf073eb670d5465a8b8] net: ethernet: >>> davinci_emac: add pm_runtime support >>> git bisect good 3ba97381343b271296487bf073eb670d5465a8b8 >>> # bad: [5e9965c15ba88319500284e590733f4a4629a288] Merge branch >>> 'kill_rtcache' >>> git bisect bad 5e9965c15ba88319500284e590733f4a4629a288 >>> # good: [f5b0a8743601a4477419171f5046bd07d1c080a0] net: Document >>> dst->obsolete better. >>> git bisect good f5b0a8743601a4477419171f5046bd07d1c080a0 >>> # bad: [ba3f7f04ef2b19aace38f855aedd17fe43035d50] ipv4: Kill >>> FLOWI_FLAG_RT_NOCACHE and associated code. >>> git bisect bad ba3f7f04ef2b19aace38f855aedd17fe43035d50 >>> # good: [f2bb4bedf35d5167a073dcdddf16543f351ef3ae] ipv4: Cache output >>> routes in fib_info nexthops. >>> git bisect good f2bb4bedf35d5167a073dcdddf16543f351ef3ae >>> # bad: [d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5] ipv4: Cache input >>> routes in fib_info nexthops. >>> git bisect bad d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5 >>> >>> Checking out the parent commit >>> (f2bb4bedf35d5167a073dcdddf16543f351ef3ae) and building and installing >>> the kernel gives a working configuration, so I'm pretty confident in the >>> outcome of the bisect. Reversing the patch gives errors, so I've not >>> tested master with the patch reversed. >>> >>> Let me know if I can help in any way to identify a fix. >>> >> Sorry, I forgot to say that I also have tried running TinyCore Linux as >> a KVM guest on a 3.6.0-rc6 kernel, and I can ping the router fine, so >> the problem seems to be something specifically related to ruuning >> Windows XP as the guest. I don't have any other guests installed so >> that's as much as I can say, although I could maybe install a Win7 guest >> tomorrow if that would help. > I hope you've seen my later email in which I reported my error in my testing that led me to believe that all was OK with a linux client. In fact, The router is inaccessible from both the Windows XP and the Linux clients. > It would help to have some traffic sample, maybe. > I'll need help here. How would I go about collecting that traffic. I have wireshark installed, but haven't used it for years. Would a trace from that be helpful? It might take me a while to figure out how to capture it? > Especially if the problem is not easily reproductible for us. > > (I dont have Windows XP nor Win7) > > Also the bisect might point to a commit with an already fixed bug : This fix is already in 3.6.0-rc6. BTW, I've pulled the latest changes from kernel.org this afternoon, but that hasn't helped. > > commit 4331debc51ee1ce319f4a389484e0e8e05de2aca > Author: Eric Dumazet <edumazet@google.com> > Date: Wed Jul 25 05:11:23 2012 +0000 > > ipv4: rt_cache_valid must check expired routes > > commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.) > introduced rt_cache_valid() helper. It unfortunately doesn't check if > route is expired before caching it. > > I noticed sk_setup_caps() was constantly called on a tcp workload. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > Signed-off-by: David S. Miller <davem@davemloft.net> > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-18 14:40 ` Eric Dumazet 2012-09-18 15:51 ` Chris Clayton @ 2012-09-19 15:26 ` Chris Clayton 2012-09-22 6:26 ` Chris Clayton 1 sibling, 1 reply; 59+ messages in thread From: Chris Clayton @ 2012-09-19 15:26 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev > > It would help to have some traffic sample, maybe. > > Especially if the problem is not easily reproductible for us. > OK, I've used an netsniff-ng to capture the traffic on all interfaces on the host (that would be tap0 and eth0, I guess) whilst attempting to ping the router from the WinXP KVM client. The result is a pcap file that I processed with tcpdump to produce: reading from file net-trace.pcap, link-type EN10MB (Ethernet) 14:56:31.406336 ARP, Request who-has 192.168.200.254 tell 192.168.200.1, length 28 0x0000: 0001 0800 0604 0001 5254 0c3b 1728 c0a8 0x0010: c801 0000 0000 0000 c0a8 c8fe 14:56:31.406357 ARP, Reply 192.168.200.254 is-at 46:83:93:8f:f0:7e, length 28 0x0000: 0001 0800 0604 0002 4683 938f f07e c0a8 0x0010: c8fe 5254 0c3b 1728 c0a8 c801 14:56:31.406534 IP 192.168.200.1 > 192.168.0.1: ICMP echo request, id 512, seq 4352, length 40 0x0000: 4500 003c 0195 0000 8001 efd8 c0a8 c801 0x0010: c0a8 0001 0800 3a5c 0200 1100 6162 6364 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 0x0030: 7576 7761 6263 6465 6667 6869 14:56:31.406566 ARP, Request who-has 192.168.0.1 tell 192.168.0.40, length 28 0x0000: 0001 0800 0604 0001 5c9a d85c 6331 c0a8 0x0010: 0028 0000 0000 0000 c0a8 0001 14:56:31.410830 ARP, Reply 192.168.0.1 is-at 00:1f:33:80:09:44, length 46 0x0000: 0001 0800 0604 0002 001f 3380 0944 c0a8 0x0010: 0001 5c9a d85c 6331 c0a8 0028 c0a8 0001 0x0020: e000 0001 1164 ee9b 0000 0000 4500 14:56:31.410851 IP 192.168.0.40 > 192.168.0.1: ICMP echo request, id 512, seq 4352, length 40 0x0000: 4500 003c 0195 0000 7f01 b8b2 c0a8 0028 0x0010: c0a8 0001 0800 3a5c 0200 1100 6162 6364 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 0x0030: 7576 7761 6263 6465 6667 6869 14:56:31.414474 IP 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, seq 4352, length 40 0x0000: 4500 003c cf4f 0000 ff01 6af7 c0a8 0001 0x0010: c0a8 0028 0000 425c 0200 1100 6162 6364 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 0x0030: 7576 7761 6263 6465 6667 6869 14:56:36.404781 ARP, Request who-has 192.168.0.40 tell 192.168.0.1, length 46 0x0000: 0001 0800 0604 0001 001f 3380 0944 c0a8 0x0010: 0001 0000 0000 0000 c0a8 0028 c0a8 0001 0x0020: c0a8 0028 0000 425c 0200 1100 6162 14:56:36.404806 ARP, Reply 192.168.0.40 is-at 5c:9a:d8:5c:63:31, length 28 0x0000: 0001 0800 0604 0002 5c9a d85c 6331 c0a8 0x0010: 0028 001f 3380 0944 c0a8 0001 14:56:36.689750 IP 192.168.200.1 > 192.168.0.1: ICMP echo request, id 512, seq 4608, length 40 0x0000: 4500 003c 0196 0000 8001 efd7 c0a8 c801 0x0010: c0a8 0001 0800 395c 0200 1200 6162 6364 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 0x0030: 7576 7761 6263 6465 6667 6869 14:56:36.689774 IP 192.168.0.40 > 192.168.0.1: ICMP echo request, id 512, seq 4608, length 40 0x0000: 4500 003c 0196 0000 7f01 b8b1 c0a8 0028 0x0010: c0a8 0001 0800 395c 0200 1200 6162 6364 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 0x0030: 7576 7761 6263 6465 6667 6869 14:56:36.693330 IP 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, seq 4608, length 40 0x0000: 4500 003c cf50 0000 ff01 6af6 c0a8 0001 0x0010: c0a8 0028 0000 415c 0200 1200 6162 6364 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 0x0030: 7576 7761 6263 6465 6667 6869 14:56:42.189424 IP 192.168.200.1 > 192.168.0.1: ICMP echo request, id 512, seq 4864, length 40 0x0000: 4500 003c 0197 0000 8001 efd6 c0a8 c801 0x0010: c0a8 0001 0800 385c 0200 1300 6162 6364 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 0x0030: 7576 7761 6263 6465 6667 6869 14:56:42.189447 IP 192.168.0.40 > 192.168.0.1: ICMP echo request, id 512, seq 4864, length 40 0x0000: 4500 003c 0197 0000 7f01 b8b0 c0a8 0028 0x0010: c0a8 0001 0800 385c 0200 1300 6162 6364 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 0x0030: 7576 7761 6263 6465 6667 6869 14:56:42.193029 IP 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, seq 4864, length 40 0x0000: 4500 003c cf51 0000 ff01 6af5 c0a8 0001 0x0010: c0a8 0028 0000 405c 0200 1300 6162 6364 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 0x0030: 7576 7761 6263 6465 6667 6869 14:56:47.689414 IP 192.168.200.1 > 192.168.0.1: ICMP echo request, id 512, seq 5120, length 40 0x0000: 4500 003c 0198 0000 8001 efd5 c0a8 c801 0x0010: c0a8 0001 0800 375c 0200 1400 6162 6364 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 0x0030: 7576 7761 6263 6465 6667 6869 14:56:47.689439 IP 192.168.0.40 > 192.168.0.1: ICMP echo request, id 512, seq 5120, length 40 0x0000: 4500 003c 0198 0000 7f01 b8af c0a8 0028 0x0010: c0a8 0001 0800 375c 0200 1400 6162 6364 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 0x0030: 7576 7761 6263 6465 6667 6869 14:56:47.693661 IP 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, seq 5120, length 40 0x0000: 4500 003c cf52 0000 ff01 6af4 c0a8 0001 0x0010: c0a8 0028 0000 3f5c 0200 1400 6162 6364 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 0x0030: 7576 7761 6263 6465 6667 6869 Is this what you asked for? Chris ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-19 15:26 ` Chris Clayton @ 2012-09-22 6:26 ` Chris Clayton 2012-09-27 11:50 ` Chris Clayton 0 siblings, 1 reply; 59+ messages in thread From: Chris Clayton @ 2012-09-22 6:26 UTC (permalink / raw) To: Chris Clayton; +Cc: Eric Dumazet, netdev I guess you network developer folks are either very busy or this regression is proving a bit troublesome to identify, so I've opened a bugzilla report to keep track of it. The report number is 47761. Chris On 09/19/12 16:26, Chris Clayton wrote: >> >> It would help to have some traffic sample, maybe. >> >> Especially if the problem is not easily reproductible for us. >> > > OK, I've used an netsniff-ng to capture the traffic on all interfaces on > the host (that would be tap0 and eth0, I guess) whilst attempting to > ping the router from the WinXP KVM client. The result is a pcap file > that I processed with tcpdump to produce: > > reading from file net-trace.pcap, link-type EN10MB (Ethernet) > 14:56:31.406336 ARP, Request who-has 192.168.200.254 tell 192.168.200.1, > length 28 > 0x0000: 0001 0800 0604 0001 5254 0c3b 1728 c0a8 > 0x0010: c801 0000 0000 0000 c0a8 c8fe > 14:56:31.406357 ARP, Reply 192.168.200.254 is-at 46:83:93:8f:f0:7e, > length 28 > 0x0000: 0001 0800 0604 0002 4683 938f f07e c0a8 > 0x0010: c8fe 5254 0c3b 1728 c0a8 c801 > 14:56:31.406534 IP 192.168.200.1 > 192.168.0.1: ICMP echo request, id > 512, seq 4352, length 40 > 0x0000: 4500 003c 0195 0000 8001 efd8 c0a8 c801 > 0x0010: c0a8 0001 0800 3a5c 0200 1100 6162 6364 > 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 > 0x0030: 7576 7761 6263 6465 6667 6869 > 14:56:31.406566 ARP, Request who-has 192.168.0.1 tell 192.168.0.40, > length 28 > 0x0000: 0001 0800 0604 0001 5c9a d85c 6331 c0a8 > 0x0010: 0028 0000 0000 0000 c0a8 0001 > 14:56:31.410830 ARP, Reply 192.168.0.1 is-at 00:1f:33:80:09:44, length 46 > 0x0000: 0001 0800 0604 0002 001f 3380 0944 c0a8 > 0x0010: 0001 5c9a d85c 6331 c0a8 0028 c0a8 0001 > 0x0020: e000 0001 1164 ee9b 0000 0000 4500 > 14:56:31.410851 IP 192.168.0.40 > 192.168.0.1: ICMP echo request, id > 512, seq 4352, length 40 > 0x0000: 4500 003c 0195 0000 7f01 b8b2 c0a8 0028 > 0x0010: c0a8 0001 0800 3a5c 0200 1100 6162 6364 > 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 > 0x0030: 7576 7761 6263 6465 6667 6869 > 14:56:31.414474 IP 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, > seq 4352, length 40 > 0x0000: 4500 003c cf4f 0000 ff01 6af7 c0a8 0001 > 0x0010: c0a8 0028 0000 425c 0200 1100 6162 6364 > 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 > 0x0030: 7576 7761 6263 6465 6667 6869 > 14:56:36.404781 ARP, Request who-has 192.168.0.40 tell 192.168.0.1, > length 46 > 0x0000: 0001 0800 0604 0001 001f 3380 0944 c0a8 > 0x0010: 0001 0000 0000 0000 c0a8 0028 c0a8 0001 > 0x0020: c0a8 0028 0000 425c 0200 1100 6162 > 14:56:36.404806 ARP, Reply 192.168.0.40 is-at 5c:9a:d8:5c:63:31, length 28 > 0x0000: 0001 0800 0604 0002 5c9a d85c 6331 c0a8 > 0x0010: 0028 001f 3380 0944 c0a8 0001 > 14:56:36.689750 IP 192.168.200.1 > 192.168.0.1: ICMP echo request, id > 512, seq 4608, length 40 > 0x0000: 4500 003c 0196 0000 8001 efd7 c0a8 c801 > 0x0010: c0a8 0001 0800 395c 0200 1200 6162 6364 > 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 > 0x0030: 7576 7761 6263 6465 6667 6869 > 14:56:36.689774 IP 192.168.0.40 > 192.168.0.1: ICMP echo request, id > 512, seq 4608, length 40 > 0x0000: 4500 003c 0196 0000 7f01 b8b1 c0a8 0028 > 0x0010: c0a8 0001 0800 395c 0200 1200 6162 6364 > 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 > 0x0030: 7576 7761 6263 6465 6667 6869 > 14:56:36.693330 IP 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, > seq 4608, length 40 > 0x0000: 4500 003c cf50 0000 ff01 6af6 c0a8 0001 > 0x0010: c0a8 0028 0000 415c 0200 1200 6162 6364 > 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 > 0x0030: 7576 7761 6263 6465 6667 6869 > 14:56:42.189424 IP 192.168.200.1 > 192.168.0.1: ICMP echo request, id > 512, seq 4864, length 40 > 0x0000: 4500 003c 0197 0000 8001 efd6 c0a8 c801 > 0x0010: c0a8 0001 0800 385c 0200 1300 6162 6364 > 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 > 0x0030: 7576 7761 6263 6465 6667 6869 > 14:56:42.189447 IP 192.168.0.40 > 192.168.0.1: ICMP echo request, id > 512, seq 4864, length 40 > 0x0000: 4500 003c 0197 0000 7f01 b8b0 c0a8 0028 > 0x0010: c0a8 0001 0800 385c 0200 1300 6162 6364 > 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 > 0x0030: 7576 7761 6263 6465 6667 6869 > 14:56:42.193029 IP 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, > seq 4864, length 40 > 0x0000: 4500 003c cf51 0000 ff01 6af5 c0a8 0001 > 0x0010: c0a8 0028 0000 405c 0200 1300 6162 6364 > 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 > 0x0030: 7576 7761 6263 6465 6667 6869 > 14:56:47.689414 IP 192.168.200.1 > 192.168.0.1: ICMP echo request, id > 512, seq 5120, length 40 > 0x0000: 4500 003c 0198 0000 8001 efd5 c0a8 c801 > 0x0010: c0a8 0001 0800 375c 0200 1400 6162 6364 > 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 > 0x0030: 7576 7761 6263 6465 6667 6869 > 14:56:47.689439 IP 192.168.0.40 > 192.168.0.1: ICMP echo request, id > 512, seq 5120, length 40 > 0x0000: 4500 003c 0198 0000 7f01 b8af c0a8 0028 > 0x0010: c0a8 0001 0800 375c 0200 1400 6162 6364 > 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 > 0x0030: 7576 7761 6263 6465 6667 6869 > 14:56:47.693661 IP 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, > seq 5120, length 40 > 0x0000: 4500 003c cf52 0000 ff01 6af4 c0a8 0001 > 0x0010: c0a8 0028 0000 3f5c 0200 1400 6162 6364 > 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 > 0x0030: 7576 7761 6263 6465 6667 6869 > > Is this what you asked for? > > Chris > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-22 6:26 ` Chris Clayton @ 2012-09-27 11:50 ` Chris Clayton 2012-09-27 12:14 ` Eric Dumazet 0 siblings, 1 reply; 59+ messages in thread From: Chris Clayton @ 2012-09-27 11:50 UTC (permalink / raw) To: Chris Clayton; +Cc: Eric Dumazet, netdev, gpiez Just for information - I've pulled Linus' tree this morning and the problem is still present. Also, Gunther Piaz has reported, via the bugzilla entry, that he too has hit this regression. On 09/22/12 07:26, Chris Clayton wrote: > I guess you network developer folks are either very busy or this > regression is proving a bit troublesome to identify, so I've opened a > bugzilla report to keep track of it. The report number is 47761. > > Chris > > On 09/19/12 16:26, Chris Clayton wrote: >>> >>> It would help to have some traffic sample, maybe. >>> >>> Especially if the problem is not easily reproductible for us. >>> >> >> OK, I've used an netsniff-ng to capture the traffic on all interfaces on >> the host (that would be tap0 and eth0, I guess) whilst attempting to >> ping the router from the WinXP KVM client. The result is a pcap file >> that I processed with tcpdump to produce: >> >> reading from file net-trace.pcap, link-type EN10MB (Ethernet) >> 14:56:31.406336 ARP, Request who-has 192.168.200.254 tell 192.168.200.1, >> length 28 >> 0x0000: 0001 0800 0604 0001 5254 0c3b 1728 c0a8 >> 0x0010: c801 0000 0000 0000 c0a8 c8fe >> 14:56:31.406357 ARP, Reply 192.168.200.254 is-at 46:83:93:8f:f0:7e, >> length 28 >> 0x0000: 0001 0800 0604 0002 4683 938f f07e c0a8 >> 0x0010: c8fe 5254 0c3b 1728 c0a8 c801 >> 14:56:31.406534 IP 192.168.200.1 > 192.168.0.1: ICMP echo request, id >> 512, seq 4352, length 40 >> 0x0000: 4500 003c 0195 0000 8001 efd8 c0a8 c801 >> 0x0010: c0a8 0001 0800 3a5c 0200 1100 6162 6364 >> 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 >> 0x0030: 7576 7761 6263 6465 6667 6869 >> 14:56:31.406566 ARP, Request who-has 192.168.0.1 tell 192.168.0.40, >> length 28 >> 0x0000: 0001 0800 0604 0001 5c9a d85c 6331 c0a8 >> 0x0010: 0028 0000 0000 0000 c0a8 0001 >> 14:56:31.410830 ARP, Reply 192.168.0.1 is-at 00:1f:33:80:09:44, length 46 >> 0x0000: 0001 0800 0604 0002 001f 3380 0944 c0a8 >> 0x0010: 0001 5c9a d85c 6331 c0a8 0028 c0a8 0001 >> 0x0020: e000 0001 1164 ee9b 0000 0000 4500 >> 14:56:31.410851 IP 192.168.0.40 > 192.168.0.1: ICMP echo request, id >> 512, seq 4352, length 40 >> 0x0000: 4500 003c 0195 0000 7f01 b8b2 c0a8 0028 >> 0x0010: c0a8 0001 0800 3a5c 0200 1100 6162 6364 >> 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 >> 0x0030: 7576 7761 6263 6465 6667 6869 >> 14:56:31.414474 IP 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, >> seq 4352, length 40 >> 0x0000: 4500 003c cf4f 0000 ff01 6af7 c0a8 0001 >> 0x0010: c0a8 0028 0000 425c 0200 1100 6162 6364 >> 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 >> 0x0030: 7576 7761 6263 6465 6667 6869 >> 14:56:36.404781 ARP, Request who-has 192.168.0.40 tell 192.168.0.1, >> length 46 >> 0x0000: 0001 0800 0604 0001 001f 3380 0944 c0a8 >> 0x0010: 0001 0000 0000 0000 c0a8 0028 c0a8 0001 >> 0x0020: c0a8 0028 0000 425c 0200 1100 6162 >> 14:56:36.404806 ARP, Reply 192.168.0.40 is-at 5c:9a:d8:5c:63:31, >> length 28 >> 0x0000: 0001 0800 0604 0002 5c9a d85c 6331 c0a8 >> 0x0010: 0028 001f 3380 0944 c0a8 0001 >> 14:56:36.689750 IP 192.168.200.1 > 192.168.0.1: ICMP echo request, id >> 512, seq 4608, length 40 >> 0x0000: 4500 003c 0196 0000 8001 efd7 c0a8 c801 >> 0x0010: c0a8 0001 0800 395c 0200 1200 6162 6364 >> 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 >> 0x0030: 7576 7761 6263 6465 6667 6869 >> 14:56:36.689774 IP 192.168.0.40 > 192.168.0.1: ICMP echo request, id >> 512, seq 4608, length 40 >> 0x0000: 4500 003c 0196 0000 7f01 b8b1 c0a8 0028 >> 0x0010: c0a8 0001 0800 395c 0200 1200 6162 6364 >> 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 >> 0x0030: 7576 7761 6263 6465 6667 6869 >> 14:56:36.693330 IP 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, >> seq 4608, length 40 >> 0x0000: 4500 003c cf50 0000 ff01 6af6 c0a8 0001 >> 0x0010: c0a8 0028 0000 415c 0200 1200 6162 6364 >> 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 >> 0x0030: 7576 7761 6263 6465 6667 6869 >> 14:56:42.189424 IP 192.168.200.1 > 192.168.0.1: ICMP echo request, id >> 512, seq 4864, length 40 >> 0x0000: 4500 003c 0197 0000 8001 efd6 c0a8 c801 >> 0x0010: c0a8 0001 0800 385c 0200 1300 6162 6364 >> 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 >> 0x0030: 7576 7761 6263 6465 6667 6869 >> 14:56:42.189447 IP 192.168.0.40 > 192.168.0.1: ICMP echo request, id >> 512, seq 4864, length 40 >> 0x0000: 4500 003c 0197 0000 7f01 b8b0 c0a8 0028 >> 0x0010: c0a8 0001 0800 385c 0200 1300 6162 6364 >> 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 >> 0x0030: 7576 7761 6263 6465 6667 6869 >> 14:56:42.193029 IP 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, >> seq 4864, length 40 >> 0x0000: 4500 003c cf51 0000 ff01 6af5 c0a8 0001 >> 0x0010: c0a8 0028 0000 405c 0200 1300 6162 6364 >> 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 >> 0x0030: 7576 7761 6263 6465 6667 6869 >> 14:56:47.689414 IP 192.168.200.1 > 192.168.0.1: ICMP echo request, id >> 512, seq 5120, length 40 >> 0x0000: 4500 003c 0198 0000 8001 efd5 c0a8 c801 >> 0x0010: c0a8 0001 0800 375c 0200 1400 6162 6364 >> 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 >> 0x0030: 7576 7761 6263 6465 6667 6869 >> 14:56:47.689439 IP 192.168.0.40 > 192.168.0.1: ICMP echo request, id >> 512, seq 5120, length 40 >> 0x0000: 4500 003c 0198 0000 7f01 b8af c0a8 0028 >> 0x0010: c0a8 0001 0800 375c 0200 1400 6162 6364 >> 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 >> 0x0030: 7576 7761 6263 6465 6667 6869 >> 14:56:47.693661 IP 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, >> seq 5120, length 40 >> 0x0000: 4500 003c cf52 0000 ff01 6af4 c0a8 0001 >> 0x0010: c0a8 0028 0000 3f5c 0200 1400 6162 6364 >> 0x0020: 6566 6768 696a 6b6c 6d6e 6f70 7172 7374 >> 0x0030: 7576 7761 6263 6465 6667 6869 >> >> Is this what you asked for? >> >> Chris >> > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-27 11:50 ` Chris Clayton @ 2012-09-27 12:14 ` Eric Dumazet 2012-09-27 18:05 ` Chris Clayton 0 siblings, 1 reply; 59+ messages in thread From: Eric Dumazet @ 2012-09-27 12:14 UTC (permalink / raw) To: Chris Clayton; +Cc: netdev, gpiez On Thu, 2012-09-27 at 12:50 +0100, Chris Clayton wrote: > Just for information - I've pulled Linus' tree this morning and the > problem is still present. Also, Gunther Piaz has reported, via the > bugzilla entry, that he too has hit this regression. I tried to reproduce the bug, and my kvm guests have no problem. I guess you need to precisely describe how you setup your network, so that I can reproduce the problem and eventually fix it. Thanks ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-27 12:14 ` Eric Dumazet @ 2012-09-27 18:05 ` Chris Clayton 2012-09-27 21:03 ` Eric Dumazet 0 siblings, 1 reply; 59+ messages in thread From: Chris Clayton @ 2012-09-27 18:05 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev, gpiez On 09/27/12 13:14, Eric Dumazet wrote: > On Thu, 2012-09-27 at 12:50 +0100, Chris Clayton wrote: >> Just for information - I've pulled Linus' tree this morning and the >> problem is still present. Also, Gunther Piaz has reported, via the >> bugzilla entry, that he too has hit this regression. > > I tried to reproduce the bug, and my kvm guests have no problem. > > I guess you need to precisely describe how you setup your network, so > that I can reproduce the problem and eventually fix it. > You've seen the bits from my firewall setup script that relate to this issue. I start the WinXP client with another script: #!/bin/sh if [ -e $HOME/kvm/var/run/kvm-winxp.pid ]; then echo "winxp is already running ..." > /dev/stderr exit 1 fi # make sure the kvm modules are loaded if test -z "$(grep '\<kvm\>' /proc/misc)"; then sudo modprobe kvm-intel while test -z "$(grep '\<kvm\>' /proc/misc)"; do true done fi # make sure tun module is loaded if test ! -e /dev/net/tun; then sudo modprobe tun fi # figure out the cpu to use QVER=$(qemu-kvm --version | cut -d' ' -f 4 | sed 's/,/./') # assumes major version is 1 MINORVER=$(echo $QVER | cut -d'.' -f 2) if [ $MINORVER -ge 1 ]; then CPU="host" else CPU="qemu64" fi # set up the network interface TAPDEV=$(sudo tunctl -b -u $(whoami)) sudo ifconfig $TAPDEV 192.168.200.254 netmask 255.255.255.0 broadcast 192.168.200.255 # start Windows XP qemu-kvm -drive file=$HOME/kvm/winxp.qcow2,index=0,cache=none,if=virtio -cpu $CPU -smp cores=1,threads=2 -soundhw es1370 \ -m 768 -net nic,model=virtio,macaddr=$(getmacaddr) -net tap,ifname=$TAPDEV -startdate $(date +%Y-%m-%dT%H:%M:%S) \ -name kxplaptop -pidfile $HOME/kvm/var/run/kvm-winxp.pid $* # stop the network interface sudo ifconfig $TAPDEV down sudo tunctl -d $TAPDEV &>/dev/null # tidy up rm -f $HOME/kvm/var/run/kvm-winxp.pid The call to getmacaddr just returns the next in a sequence of mac addresses. qemu-kvm is a symlink to /usr/bin/qemu-system-i386. I first found the problem whilst running qemu-kvm version 1.1.1 although I've since updated to 1.2.0. By the way, I doubt it will make a difference, but, although my laptop has a 64bit CPU, I am running a 32 bit kernel and, obviously, user space. Let me know if you need anything else. Thanks > Thanks > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-27 18:05 ` Chris Clayton @ 2012-09-27 21:03 ` Eric Dumazet 2012-09-27 21:17 ` Eric Dumazet 0 siblings, 1 reply; 59+ messages in thread From: Eric Dumazet @ 2012-09-27 21:03 UTC (permalink / raw) To: Chris Clayton; +Cc: netdev, gpiez On Thu, 2012-09-27 at 19:05 +0100, Chris Clayton wrote: > On 09/27/12 13:14, Eric Dumazet wrote: > > On Thu, 2012-09-27 at 12:50 +0100, Chris Clayton wrote: > >> Just for information - I've pulled Linus' tree this morning and the > >> problem is still present. Also, Gunther Piaz has reported, via the > >> bugzilla entry, that he too has hit this regression. > > > > I tried to reproduce the bug, and my kvm guests have no problem. > > > > I guess you need to precisely describe how you setup your network, so > > that I can reproduce the problem and eventually fix it. > > > > You've seen the bits from my firewall setup script that relate to this > issue. I start the WinXP client with another script: > > #!/bin/sh > if [ -e $HOME/kvm/var/run/kvm-winxp.pid ]; then > echo "winxp is already running ..." > /dev/stderr > exit 1 > fi > > # make sure the kvm modules are loaded > if test -z "$(grep '\<kvm\>' /proc/misc)"; then > sudo modprobe kvm-intel > while test -z "$(grep '\<kvm\>' /proc/misc)"; do > true > done > fi > > # make sure tun module is loaded > if test ! -e /dev/net/tun; then > sudo modprobe tun > fi > > # figure out the cpu to use > QVER=$(qemu-kvm --version | cut -d' ' -f 4 | sed 's/,/./') > # assumes major version is 1 > MINORVER=$(echo $QVER | cut -d'.' -f 2) > if [ $MINORVER -ge 1 ]; then > CPU="host" > else > CPU="qemu64" > fi > > # set up the network interface > TAPDEV=$(sudo tunctl -b -u $(whoami)) > sudo ifconfig $TAPDEV 192.168.200.254 netmask 255.255.255.0 broadcast > 192.168.200.255 > > # start Windows XP > qemu-kvm -drive file=$HOME/kvm/winxp.qcow2,index=0,cache=none,if=virtio > -cpu $CPU -smp cores=1,threads=2 -soundhw es1370 \ > -m 768 -net nic,model=virtio,macaddr=$(getmacaddr) -net > tap,ifname=$TAPDEV -startdate $(date +%Y-%m-%dT%H:%M:%S) \ > -name kxplaptop -pidfile $HOME/kvm/var/run/kvm-winxp.pid $* > > # stop the network interface > sudo ifconfig $TAPDEV down > sudo tunctl -d $TAPDEV &>/dev/null > > # tidy up > rm -f $HOME/kvm/var/run/kvm-winxp.pid > > > The call to getmacaddr just returns the next in a sequence of mac > addresses. qemu-kvm is a symlink to /usr/bin/qemu-system-i386. I first > found the problem whilst running qemu-kvm version 1.1.1 although I've > since updated to 1.2.0. > > By the way, I doubt it will make a difference, but, although my laptop > has a 64bit CPU, I am running a 32 bit kernel and, obviously, user space. > > Let me know if you need anything else. It works for me. Hmm, maybe your guest is using DHCP and DHCP fails ? Could you check ? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-27 21:03 ` Eric Dumazet @ 2012-09-27 21:17 ` Eric Dumazet 2012-09-28 6:53 ` David Miller 2012-09-28 9:22 ` Chris Clayton 0 siblings, 2 replies; 59+ messages in thread From: Eric Dumazet @ 2012-09-27 21:17 UTC (permalink / raw) To: Chris Clayton, David Miller; +Cc: netdev, gpiez On Thu, 2012-09-27 at 23:03 +0200, Eric Dumazet wrote: > On Thu, 2012-09-27 at 19:05 +0100, Chris Clayton wrote: > > On 09/27/12 13:14, Eric Dumazet wrote: > > > On Thu, 2012-09-27 at 12:50 +0100, Chris Clayton wrote: > > >> Just for information - I've pulled Linus' tree this morning and the > > >> problem is still present. Also, Gunther Piaz has reported, via the > > >> bugzilla entry, that he too has hit this regression. > > > > > > I tried to reproduce the bug, and my kvm guests have no problem. > > > > > > I guess you need to precisely describe how you setup your network, so > > > that I can reproduce the problem and eventually fix it. > > > > > > > You've seen the bits from my firewall setup script that relate to this > > issue. I start the WinXP client with another script: > > > > #!/bin/sh > > if [ -e $HOME/kvm/var/run/kvm-winxp.pid ]; then > > echo "winxp is already running ..." > /dev/stderr > > exit 1 > > fi > > > > # make sure the kvm modules are loaded > > if test -z "$(grep '\<kvm\>' /proc/misc)"; then > > sudo modprobe kvm-intel > > while test -z "$(grep '\<kvm\>' /proc/misc)"; do > > true > > done > > fi > > > > # make sure tun module is loaded > > if test ! -e /dev/net/tun; then > > sudo modprobe tun > > fi > > > > # figure out the cpu to use > > QVER=$(qemu-kvm --version | cut -d' ' -f 4 | sed 's/,/./') > > # assumes major version is 1 > > MINORVER=$(echo $QVER | cut -d'.' -f 2) > > if [ $MINORVER -ge 1 ]; then > > CPU="host" > > else > > CPU="qemu64" > > fi > > > > # set up the network interface > > TAPDEV=$(sudo tunctl -b -u $(whoami)) > > sudo ifconfig $TAPDEV 192.168.200.254 netmask 255.255.255.0 broadcast > > 192.168.200.255 > > > > # start Windows XP > > qemu-kvm -drive file=$HOME/kvm/winxp.qcow2,index=0,cache=none,if=virtio > > -cpu $CPU -smp cores=1,threads=2 -soundhw es1370 \ > > -m 768 -net nic,model=virtio,macaddr=$(getmacaddr) -net > > tap,ifname=$TAPDEV -startdate $(date +%Y-%m-%dT%H:%M:%S) \ > > -name kxplaptop -pidfile $HOME/kvm/var/run/kvm-winxp.pid $* > > > > # stop the network interface > > sudo ifconfig $TAPDEV down > > sudo tunctl -d $TAPDEV &>/dev/null > > > > # tidy up > > rm -f $HOME/kvm/var/run/kvm-winxp.pid > > > > > > The call to getmacaddr just returns the next in a sequence of mac > > addresses. qemu-kvm is a symlink to /usr/bin/qemu-system-i386. I first > > found the problem whilst running qemu-kvm version 1.1.1 although I've > > since updated to 1.2.0. > > > > By the way, I doubt it will make a difference, but, although my laptop > > has a 64bit CPU, I am running a 32 bit kernel and, obviously, user space. > > > > Let me know if you need anything else. > > It works for me. > > Hmm, maybe your guest is using DHCP and DHCP fails ? Yes it seems the problem. On the host I tried : # ip ro get 8.8.8.8 from 192.168.200.1 iif tap1 8.8.8.8 from 192.168.200.1 via 172.30.42.1 dev eth0 cache iif * So if the guest tries to send a frame to 8.8.8.8 we are going to forward the packet to eth0 But if the guest tries to send to 255.255.255.255, we try to deliver the packet to the host itself, instead of broadcasting to eth0 # ip ro get 255.255.255.255 from 192.168.200.1 iif tap1 broadcast 255.255.255.255 from 192.168.200.1 dev lo cache <local,brd> iif * David, maybe you'll have an idea ? Thanks ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-27 21:17 ` Eric Dumazet @ 2012-09-28 6:53 ` David Miller 2012-09-28 9:14 ` Chris Clayton 2012-09-28 9:22 ` Chris Clayton 1 sibling, 1 reply; 59+ messages in thread From: David Miller @ 2012-09-28 6:53 UTC (permalink / raw) To: eric.dumazet; +Cc: chris2553, netdev, gpiez From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 27 Sep 2012 23:17:04 +0200 > Yes it seems the problem. On the host I tried : > > # ip ro get 8.8.8.8 from 192.168.200.1 iif tap1 > 8.8.8.8 from 192.168.200.1 via 172.30.42.1 dev eth0 > cache iif * > > So if the guest tries to send a frame to 8.8.8.8 we are going to forward > the packet to eth0 > > But if the guest tries to send to 255.255.255.255, we try to deliver the > packet to the host itself, instead of broadcasting to eth0 > > # ip ro get 255.255.255.255 from 192.168.200.1 iif tap1 > broadcast 255.255.255.255 from 192.168.200.1 dev lo > cache <local,brd> iif * > > David, maybe you'll have an idea ? Perhaps this was introduced by: commit 7bd86cc282a458b66c41e3f6676de6656c99b8db Author: Yan, Zheng <zheng.z.yan@intel.com> Date: Sun Aug 12 20:09:59 2012 +0000 ipv4: Cache local output routes Commit caacf05e5ad1abf causes big drop of UDP loop back performance. The cause of the regression is that we do not cache the local output routes. Each time we send a datagram from unconnected UDP socket, the kernel allocates a dst_entry and adds it to the rt_uncached_list. It creates lock contention on the rt_uncached_lock. Reported-by: Alex Shi <alex.shi@intel.com> Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff --git a/net/ipv4/route.c b/net/ipv4/route.c index e4ba974..fd9ecb5 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -2028,7 +2028,6 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) } dev_out = net->loopback_dev; fl4->flowi4_oif = dev_out->ifindex; - res.fi = NULL; flags |= RTCF_LOCAL; goto make_route; } ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-28 6:53 ` David Miller @ 2012-09-28 9:14 ` Chris Clayton 0 siblings, 0 replies; 59+ messages in thread From: Chris Clayton @ 2012-09-28 9:14 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, netdev, gpiez On 09/28/12 07:53, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Thu, 27 Sep 2012 23:17:04 +0200 > >> Yes it seems the problem. On the host I tried : >> >> # ip ro get 8.8.8.8 from 192.168.200.1 iif tap1 >> 8.8.8.8 from 192.168.200.1 via 172.30.42.1 dev eth0 >> cache iif * >> >> So if the guest tries to send a frame to 8.8.8.8 we are going to forward >> the packet to eth0 >> >> But if the guest tries to send to 255.255.255.255, we try to deliver the >> packet to the host itself, instead of broadcasting to eth0 >> >> # ip ro get 255.255.255.255 from 192.168.200.1 iif tap1 >> broadcast 255.255.255.255 from 192.168.200.1 dev lo >> cache <local,brd> iif * >> >> David, maybe you'll have an idea ? > > Perhaps this was introduced by: Thanks, David. Unfortunately, reversing that patch does not fix the problem. The pings from the KVM client to the router still time out. I have bisected this (see http://marc.info/?l=linux-netdev&m=134797809611847&w=2) and that rendered: $ git bisect bad d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5 is the first bad commit commit d2d68ba9fe8b38eb03124b3176a013bb8aa2b5e5 Author: David S. Miller <davem@davemloft.net> Date: Tue Jul 17 12:58:50 2012 -0700 ipv4: Cache input routes in fib_info nexthops. Caching input routes is slightly simpler than output routes, since we don't need to be concerned with nexthop exceptions. (locally destined, and routed packets, never trigger PMTU events or redirects that will be processed by us). However, we have to elide caching for the DIRECTSRC and non-zero itag cases. Signed-off-by: David S. Miller <davem@davemloft.net> :040000 040000 6bbc75c1cbe62bf84ea412d3b98adf2b614779cd 3ad7256b4a71e63ca4530977c0550121ea803d35 M include :040000 040000 18c2a950a53c4eec9bfa12185d1e382dfed74af8 a2ab6157d6cd54930da395758c6ded3a225d1f04 M net Unfortunately, the related patches don't reverse cleanly, but a kernel built from a git checkout of the parent commit ( f2bb4bedf35d5167a073dcdddf16543f351ef3ae) works fine. > > commit 7bd86cc282a458b66c41e3f6676de6656c99b8db > Author: Yan, Zheng <zheng.z.yan@intel.com> > Date: Sun Aug 12 20:09:59 2012 +0000 > > ipv4: Cache local output routes > > Commit caacf05e5ad1abf causes big drop of UDP loop back performance. > The cause of the regression is that we do not cache the local output > routes. Each time we send a datagram from unconnected UDP socket, > the kernel allocates a dst_entry and adds it to the rt_uncached_list. > It creates lock contention on the rt_uncached_lock. > > Reported-by: Alex Shi <alex.shi@intel.com> > Signed-off-by: Yan, Zheng <zheng.z.yan@intel.com> > Signed-off-by: David S. Miller <davem@davemloft.net> > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > index e4ba974..fd9ecb5 100644 > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -2028,7 +2028,6 @@ struct rtable *__ip_route_output_key(struct net *net, struct flowi4 *fl4) > } > dev_out = net->loopback_dev; > fl4->flowi4_oif = dev_out->ifindex; > - res.fi = NULL; > flags |= RTCF_LOCAL; > goto make_route; > } > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-27 21:17 ` Eric Dumazet 2012-09-28 6:53 ` David Miller @ 2012-09-28 9:22 ` Chris Clayton 2012-09-28 11:26 ` Eric Dumazet 1 sibling, 1 reply; 59+ messages in thread From: Chris Clayton @ 2012-09-28 9:22 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev, gpiez On 09/27/12 22:17, Eric Dumazet wrote: > On Thu, 2012-09-27 at 23:03 +0200, Eric Dumazet wrote: >> On Thu, 2012-09-27 at 19:05 +0100, Chris Clayton wrote: >>> On 09/27/12 13:14, Eric Dumazet wrote: >>>> On Thu, 2012-09-27 at 12:50 +0100, Chris Clayton wrote: >>>>> Just for information - I've pulled Linus' tree this morning and the >>>>> problem is still present. Also, Gunther Piaz has reported, via the >>>>> bugzilla entry, that he too has hit this regression. >>>> >>>> I tried to reproduce the bug, and my kvm guests have no problem. >>>> >>>> I guess you need to precisely describe how you setup your network, so >>>> that I can reproduce the problem and eventually fix it. >>>> >>> >>> You've seen the bits from my firewall setup script that relate to this >>> issue. I start the WinXP client with another script: >>> >>> #!/bin/sh >>> if [ -e $HOME/kvm/var/run/kvm-winxp.pid ]; then >>> echo "winxp is already running ..." > /dev/stderr >>> exit 1 >>> fi >>> >>> # make sure the kvm modules are loaded >>> if test -z "$(grep '\<kvm\>' /proc/misc)"; then >>> sudo modprobe kvm-intel >>> while test -z "$(grep '\<kvm\>' /proc/misc)"; do >>> true >>> done >>> fi >>> >>> # make sure tun module is loaded >>> if test ! -e /dev/net/tun; then >>> sudo modprobe tun >>> fi >>> >>> # figure out the cpu to use >>> QVER=$(qemu-kvm --version | cut -d' ' -f 4 | sed 's/,/./') >>> # assumes major version is 1 >>> MINORVER=$(echo $QVER | cut -d'.' -f 2) >>> if [ $MINORVER -ge 1 ]; then >>> CPU="host" >>> else >>> CPU="qemu64" >>> fi >>> >>> # set up the network interface >>> TAPDEV=$(sudo tunctl -b -u $(whoami)) >>> sudo ifconfig $TAPDEV 192.168.200.254 netmask 255.255.255.0 broadcast >>> 192.168.200.255 >>> >>> # start Windows XP >>> qemu-kvm -drive file=$HOME/kvm/winxp.qcow2,index=0,cache=none,if=virtio >>> -cpu $CPU -smp cores=1,threads=2 -soundhw es1370 \ >>> -m 768 -net nic,model=virtio,macaddr=$(getmacaddr) -net >>> tap,ifname=$TAPDEV -startdate $(date +%Y-%m-%dT%H:%M:%S) \ >>> -name kxplaptop -pidfile $HOME/kvm/var/run/kvm-winxp.pid $* >>> >>> # stop the network interface >>> sudo ifconfig $TAPDEV down >>> sudo tunctl -d $TAPDEV &>/dev/null >>> >>> # tidy up >>> rm -f $HOME/kvm/var/run/kvm-winxp.pid >>> >>> >>> The call to getmacaddr just returns the next in a sequence of mac >>> addresses. qemu-kvm is a symlink to /usr/bin/qemu-system-i386. I first >>> found the problem whilst running qemu-kvm version 1.1.1 although I've >>> since updated to 1.2.0. >>> >>> By the way, I doubt it will make a difference, but, although my laptop >>> has a 64bit CPU, I am running a 32 bit kernel and, obviously, user space. >>> >>> Let me know if you need anything else. >> >> It works for me. >> >> Hmm, maybe your guest is using DHCP and DHCP fails ? No, the WinXP guest is configured with a fixed IP address (192.168.200.1). Subnet mask is 255.255.255.0, and default gateway is 192.168.200.254. DNS is 192.168.0.1. > > Yes it seems the problem. On the host I tried : > > # ip ro get 8.8.8.8 from 192.168.200.1 iif tap1 > 8.8.8.8 from 192.168.200.1 via 172.30.42.1 dev eth0 > cache iif * > > So if the guest tries to send a frame to 8.8.8.8 we are going to forward > the packet to eth0 > > But if the guest tries to send to 255.255.255.255, we try to deliver the > packet to the host itself, instead of broadcasting to eth0 > > # ip ro get 255.255.255.255 from 192.168.200.1 iif tap1 > broadcast 255.255.255.255 from 192.168.200.1 dev lo > cache <local,brd> iif * > > > David, maybe you'll have an idea ? > > Thanks > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-28 9:22 ` Chris Clayton @ 2012-09-28 11:26 ` Eric Dumazet 2012-09-28 14:28 ` Chris Clayton 2012-09-30 15:26 ` Chris Clayton 0 siblings, 2 replies; 59+ messages in thread From: Eric Dumazet @ 2012-09-28 11:26 UTC (permalink / raw) To: Chris Clayton; +Cc: David Miller, netdev, gpiez On Fri, 2012-09-28 at 10:22 +0100, Chris Clayton wrote: > No, the WinXP guest is configured with a fixed IP address > (192.168.200.1). Subnet mask is 255.255.255.0, and default gateway is > 192.168.200.254. DNS is 192.168.0.1. > I have no problem with such a setup, with a linux guest. Could you send again a tcpdump, but including link-level header ? (option -e) Ideally, you could send two traces, one taken on tap0, and another taken on eth0. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-28 11:26 ` Eric Dumazet @ 2012-09-28 14:28 ` Chris Clayton 2012-09-30 15:26 ` Chris Clayton 1 sibling, 0 replies; 59+ messages in thread From: Chris Clayton @ 2012-09-28 14:28 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev, gpiez On 09/28/12 12:26, Eric Dumazet wrote: > On Fri, 2012-09-28 at 10:22 +0100, Chris Clayton wrote: > >> No, the WinXP guest is configured with a fixed IP address >> (192.168.200.1). Subnet mask is 255.255.255.0, and default gateway is >> 192.168.200.254. DNS is 192.168.0.1. >> > > I have no problem with such a setup, with a linux guest. > > Could you send again a tcpdump, but including link-level header ? > (option -e) > > Ideally, you could send two traces, one taken on tap0, and another taken > on eth0. > Two traces Trace 1 - tap0 (192.168.200.254) whilst pinging router (192.168.0.1)from KVM guest (192.168.200.1): 15:03:14.953599 52:54:0c:3b:17:38 > Broadcast, ethertype ARP (0x0806), length 42: Request who-has 192.168.200.254 tell 192.168.200.1, length 28 15:03:14.953617 9e:c3:0c:c8:65:8d > 52:54:0c:3b:17:38, ethertype ARP (0x0806), length 42: Reply 192.168.200.254 is-at 9e:c3:0c:c8:65:8d, length 28 15:03:14.953725 52:54:0c:3b:17:38 > 9e:c3:0c:c8:65:8d, ethertype IPv4 (0x0800), length 74: 192.168.200.1 > 192.168.0.1: ICMP echo request, id 512, seq 5376, length 40 15:03:20.427278 52:54:0c:3b:17:38 > 9e:c3:0c:c8:65:8d, ethertype IPv4 (0x0800), length 74: 192.168.200.1 > 192.168.0.1: ICMP echo request, id 512, seq 5632, length 40 15:03:25.942215 52:54:0c:3b:17:38 > 9e:c3:0c:c8:65:8d, ethertype IPv4 (0x0800), length 74: 192.168.200.1 > 192.168.0.1: ICMP echo request, id 512, seq 5888, length 40 15:03:31.455578 52:54:0c:3b:17:38 > 9e:c3:0c:c8:65:8d, ethertype IPv4 (0x0800), length 74: 192.168.200.1 > 192.168.0.1: ICMP echo request, id 512, seq 6144, length 40 Trace 2 - eth0 (192.168.0.40) whilst pinging router (192.168.0.1)from KVM guest (192.168.200.1): 15:04:06.427863 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 74: 192.168.0.40 > 192.168.0.1: ICMP echo request, id 512, seq 6400, length 40 15:04:06.432100 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 74: 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, seq 6400, length 40 15:04:11.430877 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype ARP (0x0806), length 60: Request who-has 192.168.0.40 tell 192.168.0.1, length 46 15:04:11.430898 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype ARP (0x0806), length 42: Reply 192.168.0.40 is-at 5c:9a:d8:5c:63:31, length 28 15:04:11.567319 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 74: 192.168.0.40 > 192.168.0.1: ICMP echo request, id 512, seq 6656, length 40 15:04:11.571534 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 74: 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, seq 6656, length 40 15:04:16.577137 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype ARP (0x0806), length 42: Request who-has 192.168.0.1 tell 192.168.0.40, length 28 15:04:16.580373 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype ARP (0x0806), length 60: Reply 192.168.0.1 is-at 00:1f:33:80:09:44, length 46 15:04:17.083328 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 74: 192.168.0.40 > 192.168.0.1: ICMP echo request, id 512, seq 6912, length 40 15:04:17.086854 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 74: 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, seq 6912, length 40 15:04:22.585766 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 74: 192.168.0.40 > 192.168.0.1: ICMP echo request, id 512, seq 7168, length 40 15:04:22.589989 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 74: 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, seq 7168, length 40 15:04:32.240422 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 446: 192.168.0.112.2704 > 239.255.255.250.1900: UDP, length 404 15:04:32.241404 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 455: 192.168.0.112.2704 > 239.255.255.250.1900: UDP, length 413 15:04:32.242915 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 494: 192.168.0.112.2704 > 239.255.255.250.1900: UDP, length 452 15:04:32.243986 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 490: 192.168.0.112.1434 > 239.255.255.250.1900: UDP, length 448 15:04:32.245476 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 486: 192.168.0.112.2901 > 239.255.255.250.1900: UDP, length 444 15:04:32.246545 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 486: 192.168.0.112.3828 > 239.255.255.250.1900: UDP, length 444 15:04:32.342459 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 446: 192.168.0.112.4445 > 239.255.255.250.1900: UDP, length 404 15:04:32.343506 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 455: 192.168.0.112.4445 > 239.255.255.250.1900: UDP, length 413 15:04:32.345017 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 494: 192.168.0.112.4445 > 239.255.255.250.1900: UDP, length 452 15:04:32.346087 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 490: 192.168.0.112.2735 > 239.255.255.250.1900: UDP, length 448 15:04:32.348314 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 486: 192.168.0.112.4940 > 239.255.255.250.1900: UDP, length 444 15:04:32.349362 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 486: 192.168.0.112.1029 > 239.255.255.250.1900: UDP, length 444 The second trace seems to contain some upnp-related traffic involving my satellite TV box. If it would help, I can turn that off when my wife isn't watching TV, and run the traces again. Chris > > > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-28 11:26 ` Eric Dumazet 2012-09-28 14:28 ` Chris Clayton @ 2012-09-30 15:26 ` Chris Clayton 2012-09-30 19:45 ` Eric Dumazet 1 sibling, 1 reply; 59+ messages in thread From: Chris Clayton @ 2012-09-30 15:26 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev, gpiez Hi Eric, On 09/28/12 12:26, Eric Dumazet wrote: > On Fri, 2012-09-28 at 10:22 +0100, Chris Clayton wrote: > >> No, the WinXP guest is configured with a fixed IP address >> (192.168.200.1). Subnet mask is 255.255.255.0, and default gateway is >> 192.168.200.254. DNS is 192.168.0.1. >> > > I have no problem with such a setup, with a linux guest. > > Could you send again a tcpdump, but including link-level header ? > (option -e) > > Ideally, you could send two traces, one taken on tap0, and another taken > on eth0. > Below are two more traces that I think may well be more useful than those I sent on Friday. They are taken with tcpdump directly (after some reading up on that application) rather than tcpdump translations of pcap files captured with netsniff-ng. Also, they are taken concurrently, so they show the traffic on tap0 and eth0 at the time of an unsuccessful attempt to ping the router from the WinXP KVM client. The command was: sudo tcpdump -nev -i eth0 -Z chris >eth0.trace tap0: 16:05:14.909057 52:54:0c:3b:17:39 > 62:4e:ff:6b:0d:ce, ethertype IPv4 (0x0800), length 64: (tos 0x0, ttl 128, id 286, offset 0, flags [none], proto UDP (17), length 50) 192.168.200.1.49391 > 192.168.0.1.domain: 33727+ A? wpad. (22) 16:05:21.909026 52:54:0c:3b:17:39 > Broadcast, ethertype IPv4 (0x0800), length 92: (tos 0x0, ttl 128, id 287, offset 0, flags [none], proto UDP (17), length 78) 192.168.200.1.netbios-ns > 192.168.200.255.netbios-ns: NBT UDP PACKET(137): QUERY; REQUEST; BROADCAST 16:05:21.909123 62:4e:ff:6b:0d:ce > Broadcast, ethertype IPv4 (0x0800), length 264: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 250) 192.168.200.254.netbios-dgm > 192.168.200.255.netbios-dgm: NBT UDP PACKET(138) 16:05:21.909141 62:4e:ff:6b:0d:ce > Broadcast, ethertype IPv4 (0x0800), length 249: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 235) 192.168.200.254.netbios-dgm > 192.168.200.255.netbios-dgm: NBT UDP PACKET(138) 16:05:22.261009 52:54:0c:3b:17:39 > 62:4e:ff:6b:0d:ce, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 128, id 288, offset 0, flags [none], proto ICMP (1), length 60) 192.168.200.1 > 192.168.0.1: ICMP echo request, id 512, seq 3840, length 40 16:05:22.704716 52:54:0c:3b:17:39 > Broadcast, ethertype IPv4 (0x0800), length 92: (tos 0x0, ttl 128, id 289, offset 0, flags [none], proto UDP (17), length 78) 192.168.200.1.netbios-ns > 192.168.200.255.netbios-ns: NBT UDP PACKET(137): QUERY; REQUEST; BROADCAST 16:05:23.457224 52:54:0c:3b:17:39 > Broadcast, ethertype IPv4 (0x0800), length 92: (tos 0x0, ttl 128, id 290, offset 0, flags [none], proto UDP (17), length 78) 192.168.200.1.netbios-ns > 192.168.200.255.netbios-ns: NBT UDP PACKET(137): QUERY; REQUEST; BROADCAST 16:05:24.208015 52:54:0c:3b:17:39 > 62:4e:ff:6b:0d:ce, ethertype IPv4 (0x0800), length 82: (tos 0x0, ttl 128, id 291, offset 0, flags [none], proto UDP (17), length 68) 192.168.200.1.56551 > 192.168.0.1.domain: 63293+ A? download.microsoft.com. (40) 16:05:25.204731 52:54:0c:3b:17:39 > 62:4e:ff:6b:0d:ce, ethertype IPv4 (0x0800), length 82: (tos 0x0, ttl 128, id 292, offset 0, flags [none], proto UDP (17), length 68) 192.168.200.1.56551 > 192.168.0.1.domain: 63293+ A? download.microsoft.com. (40) 16:05:26.204743 52:54:0c:3b:17:39 > 62:4e:ff:6b:0d:ce, ethertype IPv4 (0x0800), length 82: (tos 0x0, ttl 128, id 293, offset 0, flags [none], proto UDP (17), length 68) 192.168.200.1.56551 > 192.168.0.1.domain: 63293+ A? download.microsoft.com. (40) 16:05:27.580723 52:54:0c:3b:17:39 > 62:4e:ff:6b:0d:ce, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 128, id 294, offset 0, flags [none], proto ICMP (1), length 60) 192.168.200.1 > 192.168.0.1: ICMP echo request, id 512, seq 4096, length 40 16:05:28.204764 52:54:0c:3b:17:39 > 62:4e:ff:6b:0d:ce, ethertype IPv4 (0x0800), length 82: (tos 0x0, ttl 128, id 295, offset 0, flags [none], proto UDP (17), length 68) 192.168.200.1.56551 > 192.168.0.1.domain: 63293+ A? download.microsoft.com. (40) 16:05:32.204731 52:54:0c:3b:17:39 > 62:4e:ff:6b:0d:ce, ethertype IPv4 (0x0800), length 82: (tos 0x0, ttl 128, id 296, offset 0, flags [none], proto UDP (17), length 68) 192.168.200.1.56551 > 192.168.0.1.domain: 63293+ A? download.microsoft.com. (40) 16:05:33.080759 52:54:0c:3b:17:39 > 62:4e:ff:6b:0d:ce, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 128, id 297, offset 0, flags [none], proto ICMP (1), length 60) 192.168.200.1 > 192.168.0.1: ICMP echo request, id 512, seq 4352, length 40 16:05:38.582182 52:54:0c:3b:17:39 > 62:4e:ff:6b:0d:ce, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 128, id 298, offset 0, flags [none], proto ICMP (1), length 60) 192.168.200.1 > 192.168.0.1: ICMP echo request, id 512, seq 4608, length 40 16:05:39.218737 52:54:0c:3b:17:39 > 62:4e:ff:6b:0d:ce, ethertype IPv4 (0x0800), length 64: (tos 0x0, ttl 128, id 299, offset 0, flags [none], proto UDP (17), length 50) 192.168.200.1.60955 > 192.168.0.1.domain: 26953+ A? wpad. (22) 16:05:40.204735 52:54:0c:3b:17:39 > 62:4e:ff:6b:0d:ce, ethertype IPv4 (0x0800), length 64: (tos 0x0, ttl 128, id 300, offset 0, flags [none], proto UDP (17), length 50) 192.168.200.1.60955 > 192.168.0.1.domain: 26953+ A? wpad. (22) 16:05:41.204721 52:54:0c:3b:17:39 > 62:4e:ff:6b:0d:ce, ethertype IPv4 (0x0800), length 64: (tos 0x0, ttl 128, id 301, offset 0, flags [none], proto UDP (17), length 50) 192.168.200.1.60955 > 192.168.0.1.domain: 26953+ A? wpad. (22) 16:05:43.238517 52:54:0c:3b:17:39 > 62:4e:ff:6b:0d:ce, ethertype IPv4 (0x0800), length 64: (tos 0x0, ttl 128, id 302, offset 0, flags [none], proto UDP (17), length 50) 192.168.200.1.60955 > 192.168.0.1.domain: 26953+ A? wpad. (22) 16:05:47.236721 52:54:0c:3b:17:39 > 62:4e:ff:6b:0d:ce, ethertype IPv4 (0x0800), length 64: (tos 0x0, ttl 128, id 303, offset 0, flags [none], proto UDP (17), length 50) 192.168.200.1.60955 > 192.168.0.1.domain: 26953+ A? wpad. (22) eth0: 16:05:22.261037 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 127, id 288, offset 0, flags [none], proto ICMP (1), length 60) 192.168.0.40 > 192.168.0.1: ICMP echo request, id 512, seq 3840, length 40 16:05:22.264612 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 255, id 53593, offset 0, flags [none], proto ICMP (1), length 60) 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, seq 3840, length 40 16:05:24.208041 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 82: (tos 0x0, ttl 127, id 291, offset 0, flags [none], proto UDP (17), length 68) 192.168.0.40.56551 > 192.168.0.1.domain: 63293+ A? download.microsoft.com. (40) 16:05:24.270825 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 426: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 412) 192.168.0.1.domain > 192.168.0.40.56551: 63293 7/8/0 download.microsoft.com. CNAME download.microsoft.com.nsatc.net., download.microsoft.com.nsatc.net. CNAME main.dl.ms.akadns.net., main.dl.ms.akadns.net. CNAME intl.dl.ms.akadns.net., intl.dl.ms.akadns.net. CNAME dl.ms.georedirector.akadns.net., dl.ms.georedirector.akadns.net. CNAME a767.ms.akamai.net., a767.ms.akamai.net. A 90.223.216.161, a767.ms.akamai.net. A 90.223.216.153 (384) 16:05:25.204745 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 82: (tos 0x0, ttl 127, id 292, offset 0, flags [none], proto UDP (17), length 68) 192.168.0.40.56551 > 192.168.0.1.domain: 63293+ A? download.microsoft.com. (40) 16:05:25.266414 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 442: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 428) 192.168.0.1.domain > 192.168.0.40.56551: 63293 7/8/1 download.microsoft.com. CNAME download.microsoft.com.nsatc.net., download.microsoft.com.nsatc.net. CNAME main.dl.ms.akadns.net., main.dl.ms.akadns.net. CNAME intl.dl.ms.akadns.net., intl.dl.ms.akadns.net. CNAME dl.ms.georedirector.akadns.net., dl.ms.georedirector.akadns.net. CNAME a767.ms.akamai.net., a767.ms.akamai.net. A 90.223.216.153, a767.ms.akamai.net. A 90.223.216.161 (400) 16:05:26.204761 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 82: (tos 0x0, ttl 127, id 293, offset 0, flags [none], proto UDP (17), length 68) 192.168.0.40.56551 > 192.168.0.1.domain: 63293+ A? download.microsoft.com. (40) 16:05:26.237788 00:1f:33:80:09:44 > 01:00:5e:00:00:01, ethertype IPv4 (0x0800), length 60: (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 28) 192.168.0.1 > 224.0.0.1: igmp query v2 16:05:26.266706 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 458: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 444) 192.168.0.1.domain > 192.168.0.40.56551: 63293 7/8/2 download.microsoft.com. CNAME download.microsoft.com.nsatc.net., download.microsoft.com.nsatc.net. CNAME main.dl.ms.akadns.net., main.dl.ms.akadns.net. CNAME intl.dl.ms.akadns.net., intl.dl.ms.akadns.net. CNAME dl.ms.georedirector.akadns.net., dl.ms.georedirector.akadns.net. CNAME a767.ms.akamai.net., a767.ms.akamai.net. A 90.223.216.161, a767.ms.akamai.net. A 90.223.216.153 (416) 16:05:27.580742 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 127, id 294, offset 0, flags [none], proto ICMP (1), length 60) 192.168.0.40 > 192.168.0.1: ICMP echo request, id 512, seq 4096, length 40 16:05:27.585193 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 255, id 53594, offset 0, flags [none], proto ICMP (1), length 60) 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, seq 4096, length 40 16:05:28.204783 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 82: (tos 0x0, ttl 127, id 295, offset 0, flags [none], proto UDP (17), length 68) 192.168.0.40.56551 > 192.168.0.1.domain: 63293+ A? download.microsoft.com. (40) 16:05:28.267047 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 442: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 428) 192.168.0.1.domain > 192.168.0.40.56551: 63293 7/8/1 download.microsoft.com. CNAME download.microsoft.com.nsatc.net., download.microsoft.com.nsatc.net. CNAME main.dl.ms.akadns.net., main.dl.ms.akadns.net. CNAME intl.dl.ms.akadns.net., intl.dl.ms.akadns.net. CNAME dl.ms.georedirector.akadns.net., dl.ms.georedirector.akadns.net. CNAME a767.ms.akamai.net., a767.ms.akamai.net. A 90.223.216.161, a767.ms.akamai.net. A 90.223.216.153 (400) 16:05:29.267032 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.40 tell 192.168.0.1, length 46 16:05:29.267049 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Reply 192.168.0.40 is-at 5c:9a:d8:5c:63:31, length 28 16:05:32.204753 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 82: (tos 0x0, ttl 127, id 296, offset 0, flags [none], proto UDP (17), length 68) 192.168.0.40.56551 > 192.168.0.1.domain: 63293+ A? download.microsoft.com. (40) 16:05:32.267308 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 458: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 444) 192.168.0.1.domain > 192.168.0.40.56551: 63293 7/8/2 download.microsoft.com. CNAME download.microsoft.com.nsatc.net., download.microsoft.com.nsatc.net. CNAME main.dl.ms.akadns.net., main.dl.ms.akadns.net. CNAME intl.dl.ms.akadns.net., intl.dl.ms.akadns.net. CNAME dl.ms.georedirector.akadns.net., dl.ms.georedirector.akadns.net. CNAME a767.ms.akamai.net., a767.ms.akamai.net. A 90.223.216.161, a767.ms.akamai.net. A 90.223.216.153 (416) 16:05:33.080772 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 127, id 297, offset 0, flags [none], proto ICMP (1), length 60) 192.168.0.40 > 192.168.0.1: ICMP echo request, id 512, seq 4352, length 40 16:05:33.084435 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 255, id 53595, offset 0, flags [none], proto ICMP (1), length 60) 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, seq 4352, length 40 16:05:35.277471 00:1f:33:80:09:44 > 01:00:5e:00:00:02, ethertype IPv4 (0x0800), length 60: (tos 0x0, ttl 1, id 0, offset 0, flags [DF], proto IGMP (2), length 32, options (RA)) 192.168.0.1 > 224.0.0.2: igmp v2 report 224.0.0.2 16:05:38.582202 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 127, id 298, offset 0, flags [none], proto ICMP (1), length 60) 192.168.0.40 > 192.168.0.1: ICMP echo request, id 512, seq 4608, length 40 16:05:38.587143 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 255, id 53596, offset 0, flags [none], proto ICMP (1), length 60) 192.168.0.1 > 192.168.0.40: ICMP echo reply, id 512, seq 4608, length 40 16:05:39.218763 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 64: (tos 0x0, ttl 127, id 299, offset 0, flags [none], proto UDP (17), length 50) 192.168.0.40.60955 > 192.168.0.1.domain: 26953+ A? wpad. (22) 16:05:39.280065 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 139: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 125) 192.168.0.1.domain > 192.168.0.40.60955: 26953 NXDomain 0/1/0 (97) 16:05:40.204754 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 64: (tos 0x0, ttl 127, id 300, offset 0, flags [none], proto UDP (17), length 50) 192.168.0.40.60955 > 192.168.0.1.domain: 26953+ A? wpad. (22) 16:05:40.266317 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 139: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 125) 192.168.0.1.domain > 192.168.0.40.60955: 26953 NXDomain 0/1/0 (97) 16:05:41.204738 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 64: (tos 0x0, ttl 127, id 301, offset 0, flags [none], proto UDP (17), length 50) 192.168.0.40.60955 > 192.168.0.1.domain: 26953+ A? wpad. (22) 16:05:41.266343 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 139: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 125) 192.168.0.1.domain > 192.168.0.40.60955: 26953 NXDomain 0/1/0 (97) 16:05:43.238538 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 64: (tos 0x0, ttl 127, id 302, offset 0, flags [none], proto UDP (17), length 50) 192.168.0.40.60955 > 192.168.0.1.domain: 26953+ A? wpad. (22) 16:05:43.301692 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 139: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 125) 192.168.0.1.domain > 192.168.0.40.60955: 26953 NXDomain 0/1/0 (97) 16:05:44.230290 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype ARP (0x0806), length 42: Ethernet (len 6), IPv4 (len 4), Request who-has 192.168.0.1 tell 192.168.0.40, length 28 16:05:44.233532 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype ARP (0x0806), length 60: Ethernet (len 6), IPv4 (len 4), Reply 192.168.0.1 is-at 00:1f:33:80:09:44, length 46 16:05:47.236740 5c:9a:d8:5c:63:31 > 00:1f:33:80:09:44, ethertype IPv4 (0x0800), length 64: (tos 0x0, ttl 127, id 303, offset 0, flags [none], proto UDP (17), length 50) 192.168.0.40.60955 > 192.168.0.1.domain: 26953+ A? wpad. (22) 16:05:47.296388 00:1f:33:80:09:44 > 5c:9a:d8:5c:63:31, ethertype IPv4 (0x0800), length 139: (tos 0x0, ttl 64, id 0, offset 0, flags [DF], proto UDP (17), length 125) 192.168.0.1.domain > 192.168.0.40.60955: 26953 NXDomain 0/1/0 (97) 16:05:48.530940 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 446: (tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto UDP (17), length 432) 192.168.0.112.3829 > 239.255.255.250.1900: UDP, length 404 16:05:48.531962 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 455: (tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto UDP (17), length 441) 192.168.0.112.3829 > 239.255.255.250.1900: UDP, length 413 16:05:48.533472 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 494: (tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto UDP (17), length 480) 192.168.0.112.3829 > 239.255.255.250.1900: UDP, length 452 16:05:48.534564 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 490: (tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto UDP (17), length 476) 192.168.0.112.2600 > 239.255.255.250.1900: UDP, length 448 16:05:48.536749 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 486: (tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto UDP (17), length 472) 192.168.0.112.2411 > 239.255.255.250.1900: UDP, length 444 16:05:48.537798 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 486: (tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto UDP (17), length 472) 192.168.0.112.1205 > 239.255.255.250.1900: UDP, length 444 16:05:48.633492 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 446: (tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto UDP (17), length 432) 192.168.0.112.1378 > 239.255.255.250.1900: UDP, length 404 16:05:48.634558 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 455: (tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto UDP (17), length 441) 192.168.0.112.1378 > 239.255.255.250.1900: UDP, length 413 16:05:48.636069 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 494: (tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto UDP (17), length 480) 192.168.0.112.1378 > 239.255.255.250.1900: UDP, length 452 16:05:48.637119 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 490: (tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto UDP (17), length 476) 192.168.0.112.3487 > 239.255.255.250.1900: UDP, length 448 16:05:48.638631 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 486: (tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto UDP (17), length 472) 192.168.0.112.4415 > 239.255.255.250.1900: UDP, length 444 16:05:48.639702 00:19:fb:be:cb:55 > 01:00:5e:7f:ff:fa, ethertype IPv4 (0x0800), length 486: (tos 0x0, ttl 4, id 0, offset 0, flags [DF], proto UDP (17), length 472) 192.168.0.112.2700 > 239.255.255.250.1900: UDP, length 444 > > > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-30 15:26 ` Chris Clayton @ 2012-09-30 19:45 ` Eric Dumazet 2012-10-01 8:36 ` Chris Clayton 0 siblings, 1 reply; 59+ messages in thread From: Eric Dumazet @ 2012-09-30 19:45 UTC (permalink / raw) To: Chris Clayton; +Cc: David Miller, netdev, gpiez On Sun, 2012-09-30 at 16:26 +0100, Chris Clayton wrote: > Hi Eric, > > On 09/28/12 12:26, Eric Dumazet wrote: > > On Fri, 2012-09-28 at 10:22 +0100, Chris Clayton wrote: > > > >> No, the WinXP guest is configured with a fixed IP address > >> (192.168.200.1). Subnet mask is 255.255.255.0, and default gateway is > >> 192.168.200.254. DNS is 192.168.0.1. > >> > > > > I have no problem with such a setup, with a linux guest. > > > > Could you send again a tcpdump, but including link-level header ? > > (option -e) > > > > Ideally, you could send two traces, one taken on tap0, and another taken > > on eth0. > > > Below are two more traces that I think may well be more useful than > those I sent on Friday. They are taken with tcpdump directly (after some > reading up on that application) rather than tcpdump translations of pcap > files captured with netsniff-ng. Also, they are taken concurrently, so > they show the traffic on tap0 and eth0 at the time of an unsuccessful > attempt to ping the router from the WinXP KVM client. The command was: > sudo tcpdump -nev -i eth0 -Z chris >eth0.trace Could you send "netstat -s" before/after your tests ? ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-30 19:45 ` Eric Dumazet @ 2012-10-01 8:36 ` Chris Clayton 2012-10-01 9:15 ` Eric Dumazet 0 siblings, 1 reply; 59+ messages in thread From: Chris Clayton @ 2012-10-01 8:36 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev, gpiez On 09/30/12 20:45, Eric Dumazet wrote: > On Sun, 2012-09-30 at 16:26 +0100, Chris Clayton wrote: >> Hi Eric, >> >> On 09/28/12 12:26, Eric Dumazet wrote: >>> On Fri, 2012-09-28 at 10:22 +0100, Chris Clayton wrote: >>> >>>> No, the WinXP guest is configured with a fixed IP address >>>> (192.168.200.1). Subnet mask is 255.255.255.0, and default gateway is >>>> 192.168.200.254. DNS is 192.168.0.1. >>>> >>> >>> I have no problem with such a setup, with a linux guest. >>> >>> Could you send again a tcpdump, but including link-level header ? >>> (option -e) >>> >>> Ideally, you could send two traces, one taken on tap0, and another taken >>> on eth0. >>> >> Below are two more traces that I think may well be more useful than >> those I sent on Friday. They are taken with tcpdump directly (after some >> reading up on that application) rather than tcpdump translations of pcap >> files captured with netsniff-ng. Also, they are taken concurrently, so >> they show the traffic on tap0 and eth0 at the time of an unsuccessful >> attempt to ping the router from the WinXP KVM client. The command was: >> sudo tcpdump -nev -i eth0 -Z chris >eth0.trace > > > Could you send "netstat -s" before/after your tests ? > Before: $ netstat -s Ip: 485 total packets received 10 forwarded 0 incoming packets discarded 473 incoming packets delivered 383 requests sent out Icmp: 0 ICMP messages received 0 input ICMP message failed. ICMP input histogram: 0 ICMP messages sent 0 ICMP messages failed ICMP output histogram: Tcp: 12 active connections openings 0 passive connection openings 6 failed connection attempts 0 connection resets received 5 connections established 374 segments received 306 segments send out 0 segments retransmited 0 bad segments received. 6 resets sent Udp: 164 packets received 0 packets to unknown port received. 0 packet receive errors 67 packets sent RcvbufErrors: 0 SndbufErrors: 0 UdpLite: InDatagrams: 0 NoPorts: 0 InErrors: 0 OutDatagrams: 0 RcvbufErrors: 0 SndbufErrors: 0 error parsing /proc/net/snmp: Success After: $ netstat -s Ip: 519 total packets received 21 forwarded 0 incoming packets discarded 496 incoming packets delivered 406 requests sent out Icmp: 4 ICMP messages received 4 input ICMP message failed. ICMP input histogram: echo replies: 4 0 ICMP messages sent 0 ICMP messages failed ICMP output histogram: IcmpMsg: InType0: 4 Tcp: 13 active connections openings 0 passive connection openings 6 failed connection attempts 0 connection resets received 5 connections established 381 segments received 316 segments send out 0 segments retransmited 0 bad segments received. 6 resets sent Udp: 173 packets received 0 packets to unknown port received. 0 packet receive errors 69 packets sent RcvbufErrors: 0 SndbufErrors: 0 UdpLite: InDatagrams: 0 NoPorts: 0 InErrors: 0 OutDatagrams: 0 RcvbufErrors: 0 SndbufErrors: 0 error parsing /proc/net/snmp: Success > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-01 8:36 ` Chris Clayton @ 2012-10-01 9:15 ` Eric Dumazet 2012-10-01 15:13 ` Chris Clayton 2012-10-01 19:34 ` Dave Jones 0 siblings, 2 replies; 59+ messages in thread From: Eric Dumazet @ 2012-10-01 9:15 UTC (permalink / raw) To: Chris Clayton; +Cc: David Miller, netdev, gpiez On Mon, 2012-10-01 at 09:36 +0100, Chris Clayton wrote: > > 0 ICMP messages received > 0 input ICMP message failed. > ICMP input histogram: > 0 ICMP messages sent > 0 ICMP messages failed > ICMP output histogram: > > After: > > $ netstat -s > Icmp: > 4 ICMP messages received > 4 input ICMP message failed. > ICMP input histogram: > echo replies: 4 So icmp replies come back and are delivered to host instead of being forwarded. I wonder if MASQUERADE broke... Could you send iptables -t -nat -nvL conntrack -L # while ping is running from guest ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-01 9:15 ` Eric Dumazet @ 2012-10-01 15:13 ` Chris Clayton 2012-10-01 15:31 ` Eric Dumazet 2012-10-01 19:34 ` Dave Jones 1 sibling, 1 reply; 59+ messages in thread From: Chris Clayton @ 2012-10-01 15:13 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev, gpiez On 10/01/12 10:15, Eric Dumazet wrote: > On Mon, 2012-10-01 at 09:36 +0100, Chris Clayton wrote: >> > >> 0 ICMP messages received >> 0 input ICMP message failed. >> ICMP input histogram: >> 0 ICMP messages sent >> 0 ICMP messages failed >> ICMP output histogram: > >> >> After: >> >> $ netstat -s >> Icmp: >> 4 ICMP messages received >> 4 input ICMP message failed. >> ICMP input histogram: >> echo replies: 4 > > So icmp replies come back and are delivered to host instead of being > forwarded. > > I wonder if MASQUERADE broke... > > Could you send > > iptables -t -nat -nvL $ iptables -t -nat -nvL iptables v1.4.15: can't initialize iptables table `-nat': Table does not exist (do you need to insmod?) Perhaps iptables or your kernel needs to be upgraded. > conntrack -L # while ping is running from guest $ conntrack -L conntrack v1.2.2 (conntrack-tools): Operation failed: invalid parameters Forgive me for asking, but why is the problem not down to the change that I identified by bisecting? The title of the patch is "ipv4: Cache local output routes" and, although I'm a million miles from being an expert here, to me it does make it look a good candidate. http://marc.info/?l=linux-netdev&m=134797809611847&w=2 > > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-01 15:13 ` Chris Clayton @ 2012-10-01 15:31 ` Eric Dumazet 2012-10-01 16:19 ` Chris Clayton 2012-10-01 18:34 ` Captain Obvious 0 siblings, 2 replies; 59+ messages in thread From: Eric Dumazet @ 2012-10-01 15:31 UTC (permalink / raw) To: Chris Clayton; +Cc: David Miller, netdev, gpiez On Mon, 2012-10-01 at 16:13 +0100, Chris Clayton wrote: > > On 10/01/12 10:15, Eric Dumazet wrote: > > On Mon, 2012-10-01 at 09:36 +0100, Chris Clayton wrote: > >> > > > >> 0 ICMP messages received > >> 0 input ICMP message failed. > >> ICMP input histogram: > >> 0 ICMP messages sent > >> 0 ICMP messages failed > >> ICMP output histogram: > > > >> > >> After: > >> > >> $ netstat -s > >> Icmp: > >> 4 ICMP messages received > >> 4 input ICMP message failed. > >> ICMP input histogram: > >> echo replies: 4 > > > > So icmp replies come back and are delivered to host instead of being > > forwarded. > > > > I wonder if MASQUERADE broke... > > > > Could you send > > > > iptables -t -nat -nvL > > $ iptables -t -nat -nvL > iptables v1.4.15: can't initialize iptables table `-nat': Table does not > exist (do you need to insmod?) > Perhaps iptables or your kernel needs to be upgraded. > > > conntrack -L # while ping is running from guest > > $ conntrack -L > conntrack v1.2.2 (conntrack-tools): Operation failed: invalid parameters > Thats not expected, you described you used MASQUERADE target, so "iptables -t nat -nvL" should display something. > Forgive me for asking, but why is the problem not down to the change > that I identified by bisecting? The title of the patch is "ipv4: Cache > local output routes" and, although I'm a million miles from being an > expert here, to me it does make it look a good candidate. > http://marc.info/?l=linux-netdev&m=134797809611847&w=2 Because I cant reproduce your problem at all, using your setup. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-01 15:31 ` Eric Dumazet @ 2012-10-01 16:19 ` Chris Clayton 2012-10-01 16:37 ` Eric Dumazet 2012-10-01 18:34 ` Captain Obvious 1 sibling, 1 reply; 59+ messages in thread From: Chris Clayton @ 2012-10-01 16:19 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev, gpiez On 10/01/12 16:31, Eric Dumazet wrote: > On Mon, 2012-10-01 at 16:13 +0100, Chris Clayton wrote: >> >> On 10/01/12 10:15, Eric Dumazet wrote: >>> On Mon, 2012-10-01 at 09:36 +0100, Chris Clayton wrote: >>>> >>> >>>> 0 ICMP messages received >>>> 0 input ICMP message failed. >>>> ICMP input histogram: >>>> 0 ICMP messages sent >>>> 0 ICMP messages failed >>>> ICMP output histogram: >>> >>>> >>>> After: >>>> >>>> $ netstat -s >>>> Icmp: >>>> 4 ICMP messages received >>>> 4 input ICMP message failed. >>>> ICMP input histogram: >>>> echo replies: 4 >>> >>> So icmp replies come back and are delivered to host instead of being >>> forwarded. >>> >>> I wonder if MASQUERADE broke... >>> >>> Could you send >>> >>> iptables -t -nat -nvL >> >> $ iptables -t -nat -nvL >> iptables v1.4.15: can't initialize iptables table `-nat': Table does not >> exist (do you need to insmod?) >> Perhaps iptables or your kernel needs to be upgraded. >> >>> conntrack -L # while ping is running from guest >> >> $ conntrack -L >> conntrack v1.2.2 (conntrack-tools): Operation failed: invalid parameters >> > > Thats not expected, you described you used MASQUERADE target, so > "iptables -t nat -nvL" should display something. > To check this I've booted a 3.5.4 kernel. I get the same response to the two commands. I also double checked that, with a 3.5.4 kernel, pinging the router and browsing the internet from the client work and they do. Except for the packets and bytes columns, the command iptables -nvL gives the following output under both 3.5.4 and 3.6.0 kernels: Chain INPUT (policy ACCEPT 0 packets, 0 bytes) pkts bytes target prot opt in out source destination 3757 3240K ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 state RELATED,ESTABLISHED 14 840 ACCEPT all -- * * 127.0.0.1 127.0.0.1 41 4362 ACCEPT all -- * * 192.168.0.0/24 0.0.0.0/0 90 12780 ACCEPT all -- * * 192.168.200.0/24 0.0.0.0/0 0 0 ACCEPT all -- * * 192.168.201.0/24 0.0.0.0/0 0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 Chain FORWARD (policy ACCEPT 4470 packets, 3065K bytes) pkts bytes target prot opt in out source destination Chain OUTPUT (policy ACCEPT 3243 packets, 349K bytes) pkts bytes target prot opt in out source destination 64 8344 ACCEPT all -- * * 0.0.0.0/0 192.168.200.0/24 0 0 ACCEPT all -- * * 0.0.0.0/0 192.168.201.0/24 > >> Forgive me for asking, but why is the problem not down to the change >> that I identified by bisecting? The title of the patch is "ipv4: Cache >> local output routes" and, although I'm a million miles from being an >> expert here, to me it does make it look a good candidate. >> http://marc.info/?l=linux-netdev&m=134797809611847&w=2 > > Because I cant reproduce your problem at all, using your setup. > OK, thanks. > > > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-01 16:19 ` Chris Clayton @ 2012-10-01 16:37 ` Eric Dumazet 2012-10-01 18:28 ` Chris Clayton 0 siblings, 1 reply; 59+ messages in thread From: Eric Dumazet @ 2012-10-01 16:37 UTC (permalink / raw) To: Chris Clayton; +Cc: David Miller, netdev, gpiez On Mon, 2012-10-01 at 17:19 +0100, Chris Clayton wrote: > > On 10/01/12 16:31, Eric Dumazet wrote: > > On Mon, 2012-10-01 at 16:13 +0100, Chris Clayton wrote: > >> > >> On 10/01/12 10:15, Eric Dumazet wrote: > >>> On Mon, 2012-10-01 at 09:36 +0100, Chris Clayton wrote: > >>>> > >>> > >>>> 0 ICMP messages received > >>>> 0 input ICMP message failed. > >>>> ICMP input histogram: > >>>> 0 ICMP messages sent > >>>> 0 ICMP messages failed > >>>> ICMP output histogram: > >>> > >>>> > >>>> After: > >>>> > >>>> $ netstat -s > >>>> Icmp: > >>>> 4 ICMP messages received > >>>> 4 input ICMP message failed. > >>>> ICMP input histogram: > >>>> echo replies: 4 > >>> > >>> So icmp replies come back and are delivered to host instead of being > >>> forwarded. > >>> > >>> I wonder if MASQUERADE broke... > >>> > >>> Could you send > >>> > >>> iptables -t -nat -nvL > >> > >> $ iptables -t -nat -nvL > >> iptables v1.4.15: can't initialize iptables table `-nat': Table does not > >> exist (do you need to insmod?) > >> Perhaps iptables or your kernel needs to be upgraded. > >> > >>> conntrack -L # while ping is running from guest > >> > >> $ conntrack -L > >> conntrack v1.2.2 (conntrack-tools): Operation failed: invalid parameters > >> > > > > Thats not expected, you described you used MASQUERADE target, so > > "iptables -t nat -nvL" should display something. > > > > To check this I've booted a 3.5.4 kernel. I get the same response to the > two commands. I also double checked that, with a 3.5.4 kernel, pinging > the router and browsing the internet from the client work and they do. > > Except for the packets and bytes columns, the command iptables -nvL > gives the following output under both 3.5.4 and 3.6.0 kernels: > > Chain INPUT (policy ACCEPT 0 packets, 0 bytes) > pkts bytes target prot opt in out source destination > 3757 3240K ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 > state RELATED,ESTABLISHED > 14 840 ACCEPT all -- * * 127.0.0.1 127.0.0.1 > 41 4362 ACCEPT all -- * * 192.168.0.0/24 0.0.0.0/0 > 90 12780 ACCEPT all -- * * 192.168.200.0/24 0.0.0.0/0 > 0 0 ACCEPT all -- * * 192.168.201.0/24 0.0.0.0/0 > 0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 > > Chain FORWARD (policy ACCEPT 4470 packets, 3065K bytes) > pkts bytes target prot opt in out source destination > > Chain OUTPUT (policy ACCEPT 3243 packets, 349K bytes) > pkts bytes target prot opt in out source destination > 64 8344 ACCEPT all -- * * 0.0.0.0/0 192.168.200.0/24 > 0 0 ACCEPT all -- * * 0.0.0.0/0 192.168.201.0/24 I am lost, since n your first mail you said : ----------------------------------------------------------------------------- # Load the connection-sharing for qemu/kvm guests echo 1 > /proc/sys/net/ipv4/ip_forward iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE ... # allow traffic to and from the qemu/kvm virtual networks NETS="200 201" for net in $NETS; do iptables -A INPUT -s 192.168.$net.0/24 -j ACCEPT iptables -A OUTPUT -d 192.168.$net.0/24 -j ACCEPT done ... The network-related modules that are loaded are: $ lsmod Module Size Used by tun 12412 0 xt_state 891 1 iptable_filter 852 1 ipt_MASQUERADE 1222 1 iptable_nat 3087 1 nf_nat 10901 2 ipt_MASQUERADE,iptable_nat nf_conntrack_ipv4 4942 4 nf_nat,iptable_nat nf_defrag_ipv4 815 1 nf_conntrack_ipv4 nf_conntrack 37644 5 ipt_MASQUERADE,nf_nat,xt_state,iptable_nat,nf_conntrack_ipv4 ... r8169 47159 0 ----------------------------------------------- Now you say you dont have nat ? Something is wrong. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-01 16:37 ` Eric Dumazet @ 2012-10-01 18:28 ` Chris Clayton 0 siblings, 0 replies; 59+ messages in thread From: Chris Clayton @ 2012-10-01 18:28 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev, gpiez On 10/01/12 17:37, Eric Dumazet wrote: > On Mon, 2012-10-01 at 17:19 +0100, Chris Clayton wrote: >> >> On 10/01/12 16:31, Eric Dumazet wrote: >>> On Mon, 2012-10-01 at 16:13 +0100, Chris Clayton wrote: >>>> >>>> On 10/01/12 10:15, Eric Dumazet wrote: >>>>> On Mon, 2012-10-01 at 09:36 +0100, Chris Clayton wrote: >>>>>> >>>>> >>>>>> 0 ICMP messages received >>>>>> 0 input ICMP message failed. >>>>>> ICMP input histogram: >>>>>> 0 ICMP messages sent >>>>>> 0 ICMP messages failed >>>>>> ICMP output histogram: >>>>> >>>>>> >>>>>> After: >>>>>> >>>>>> $ netstat -s >>>>>> Icmp: >>>>>> 4 ICMP messages received >>>>>> 4 input ICMP message failed. >>>>>> ICMP input histogram: >>>>>> echo replies: 4 >>>>> >>>>> So icmp replies come back and are delivered to host instead of being >>>>> forwarded. >>>>> >>>>> I wonder if MASQUERADE broke... >>>>> >>>>> Could you send >>>>> >>>>> iptables -t -nat -nvL >>>> >>>> $ iptables -t -nat -nvL >>>> iptables v1.4.15: can't initialize iptables table `-nat': Table does not >>>> exist (do you need to insmod?) >>>> Perhaps iptables or your kernel needs to be upgraded. >>>> >>>>> conntrack -L # while ping is running from guest >>>> >>>> $ conntrack -L >>>> conntrack v1.2.2 (conntrack-tools): Operation failed: invalid parameters >>>> >>> >>> Thats not expected, you described you used MASQUERADE target, so >>> "iptables -t nat -nvL" should display something. >>> >> >> To check this I've booted a 3.5.4 kernel. I get the same response to the >> two commands. I also double checked that, with a 3.5.4 kernel, pinging >> the router and browsing the internet from the client work and they do. >> >> Except for the packets and bytes columns, the command iptables -nvL >> gives the following output under both 3.5.4 and 3.6.0 kernels: >> >> Chain INPUT (policy ACCEPT 0 packets, 0 bytes) >> pkts bytes target prot opt in out source destination >> 3757 3240K ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 >> state RELATED,ESTABLISHED >> 14 840 ACCEPT all -- * * 127.0.0.1 127.0.0.1 >> 41 4362 ACCEPT all -- * * 192.168.0.0/24 0.0.0.0/0 >> 90 12780 ACCEPT all -- * * 192.168.200.0/24 0.0.0.0/0 >> 0 0 ACCEPT all -- * * 192.168.201.0/24 0.0.0.0/0 >> 0 0 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 >> >> Chain FORWARD (policy ACCEPT 4470 packets, 3065K bytes) >> pkts bytes target prot opt in out source destination >> >> Chain OUTPUT (policy ACCEPT 3243 packets, 349K bytes) >> pkts bytes target prot opt in out source destination >> 64 8344 ACCEPT all -- * * 0.0.0.0/0 192.168.200.0/24 >> 0 0 ACCEPT all -- * * 0.0.0.0/0 192.168.201.0/24 > > I am lost, since n your first mail you said : > ----------------------------------------------------------------------------- > # Load the connection-sharing for qemu/kvm guests > echo 1 > /proc/sys/net/ipv4/ip_forward > iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE > ... > # allow traffic to and from the qemu/kvm virtual networks > NETS="200 201" > for net in $NETS; do > iptables -A INPUT -s 192.168.$net.0/24 -j ACCEPT > iptables -A OUTPUT -d 192.168.$net.0/24 -j ACCEPT > done > ... > > The network-related modules that are loaded are: > > $ lsmod > Module Size Used by > tun 12412 0 > xt_state 891 1 > iptable_filter 852 1 > ipt_MASQUERADE 1222 1 > iptable_nat 3087 1 > nf_nat 10901 2 ipt_MASQUERADE,iptable_nat > nf_conntrack_ipv4 4942 4 nf_nat,iptable_nat > nf_defrag_ipv4 815 1 nf_conntrack_ipv4 > nf_conntrack 37644 5 > ipt_MASQUERADE,nf_nat,xt_state,iptable_nat,nf_conntrack_ipv4 > ... > r8169 47159 0 > > > ----------------------------------------------- > > Now you say you dont have nat ? > > Something is wrong. > Here's the complete script that starts up my firewall. I can't recall having changed this at all for two or three years, other than when a replacement router changed the network from 192.168.1.x or I add (or remove) other networks to (from) the $NETS list for other KVM clients $ cat /etc/rc.d/rc.firewall #! /bin/sh case "$1" in stop) echo 0 > /proc/sys/net/ipv4/ip_forward # clear out the current settings iptables -F iptables -X iptables -Z ;; start) # Load the connection-sharing for qemu/kvm guests echo 1 > /proc/sys/net/ipv4/ip_forward iptables -t nat -A POSTROUTING -o eth0 -j MASQUERADE iptables -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT # Allow anything internal to this machine (i.e. localhost) # is this really necessary? iptables -A INPUT -s 127.0.0.1 -d 127.0.0.1 -j ACCEPT # Allow any traffic from nodes on home network iptables -A INPUT -s 192.168.0.0/24 -j ACCEPT # and traffic to and from the qemu/kvm virtual networks NETS="200 201" for net in $NETS; do iptables -A INPUT -s 192.168.$net.0/24 -j ACCEPT iptables -A OUTPUT -d 192.168.$net.0/24 -j ACCEPT done # drop everything else # iptables -A INPUT -j LOG --log-level 4 --log-prefix "FIREWALL: " iptables -A INPUT -j DROP ;; restart|reload) $0 stop $0 start ;; status) iptables -L ;; *) echo "Usage: $0 {start|stop|restart|reload|status}" exit 1 ;; esac > eth0 is set up by calling /sbin/ifup from udev on the add event for eth0 (wlan0 is disabled on the laptop, so that won't be getting in the way). Here's the script (the SSID is not really XXXXX: $ cat /sbin/ifup #!/bin/sh PATH="/usr/bin:/usr/sbin:/sbin:/bin" export PATH SSID=XXXXX #logger "$0 called with arguments $@" if [ "$1" = "wlan0" ]; then # Bring the interface up before the iwconfig stuff below # assign ip address later else association with AP fails when using WPA ifconfig wlan0 up # Configure the wireless adapter iw wlan0 connect $SSID # start wpa_supplicant if [ -z `pgrep wpa_supplicant` ]; then wpa_supplicant -c/etc/wpa_supplicant/wpa_supplicant.conf -iwlan0 -Dwext -B -f/var/log/wpa_supplicant.log fi # wait until associated with the AP - can take a while with WPA secs=0 until iw wlan0 link | grep -q "SSID: $SSID"; do let secs++ if [ $secs -ge 20 ]; then logger -p user.err -t IFUP "Failed to associate with AP within 20 seconds" exit -1 fi sleep 1 done # set the regulatory domain (kernel >= 2.6.28) iw reg set GB ifconfig wlan0 192.168.0.140 netmask 255.255.255.0 up route add default gw 192.168.0.1 netmask 0.0.0.0 metric 1 exit 0 fi if [ "$1" = "eth0" ] ; then # load the module if necessary if ! grep -q eth0 /proc/net/dev; then modprobe r8169 fi # wait up to 5 seconds for eth0 to appear secs=0 until grep -q eth0 /proc/net/dev; do let secs++ if [ $secs -ge 5 ]; then logger -p user.err -t IFUP "eth0 failed to appear within 5 seconds" exit -1 fi sleep 1 done ifconfig eth0 192.168.0.40 netmask 255.255.255.0 up route add default gw 192.168.0.1 netmask 0.0.0.0 metric 1 exit 0 fi When the KVM client is running the routing on the host is: $ route Kernel IP routing table Destination Gateway Genmask Flags Metric Ref Use Iface default router.local.la 0.0.0.0 UG 1 0 0 eth0 Unix * 255.0.0.0 U 0 0 0 lo local.lan * 255.255.255.0 U 0 0 0 eth0 192.168.200.0 * 255.255.255.0 U 0 0 0 tap0 Like I say, the set up has been like this for ages and has worked. It's only since I started using 3.6 kernels that I've had a problem. I don't recall anything from the nat table ever having been listed by iptables -L. > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-01 15:31 ` Eric Dumazet 2012-10-01 16:19 ` Chris Clayton @ 2012-10-01 18:34 ` Captain Obvious 2012-10-01 19:21 ` Eric Dumazet 2012-10-01 19:22 ` Chris Clayton 1 sibling, 2 replies; 59+ messages in thread From: Captain Obvious @ 2012-10-01 18:34 UTC (permalink / raw) To: Chris Clayton; +Cc: Eric Dumazet, David Miller, netdev, gpiez Eric Dumazet <eric.dumazet@gmail.com> : [...] > > > Could you send > > > > > > iptables -t -nat -nvL > > > > $ iptables -t -nat -nvL ^ typo Please try "iptables -t nat -nvL" as was also suggested. -- Ueimor ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-01 18:34 ` Captain Obvious @ 2012-10-01 19:21 ` Eric Dumazet 2012-10-01 19:55 ` Chris Clayton 2012-10-01 19:22 ` Chris Clayton 1 sibling, 1 reply; 59+ messages in thread From: Eric Dumazet @ 2012-10-01 19:21 UTC (permalink / raw) To: Captain Obvious; +Cc: Chris Clayton, David Miller, netdev, gpiez On Mon, 2012-10-01 at 20:34 +0200, Captain Obvious wrote: > Eric Dumazet <eric.dumazet@gmail.com> : > [...] > > > > Could you send > > > > > > > > iptables -t -nat -nvL > > > > > > $ iptables -t -nat -nvL > ^ typo > > Please try "iptables -t nat -nvL" as was also suggested. > Oh well, good catch ;) And for conntrack -L, please Chris add CONFIG_NF_CT_NETLINK=m to your kernel .config ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-01 19:21 ` Eric Dumazet @ 2012-10-01 19:55 ` Chris Clayton 0 siblings, 0 replies; 59+ messages in thread From: Chris Clayton @ 2012-10-01 19:55 UTC (permalink / raw) To: Eric Dumazet; +Cc: Captain Obvious, David Miller, netdev, gpiez On 10/01/12 20:21, Eric Dumazet wrote: > On Mon, 2012-10-01 at 20:34 +0200, Captain Obvious wrote: >> Eric Dumazet <eric.dumazet@gmail.com> : >> [...] >>>>> Could you send >>>>> >>>>> iptables -t -nat -nvL >>>> >>>> $ iptables -t -nat -nvL >> ^ typo >> >> Please try "iptables -t nat -nvL" as was also suggested. >> > > Oh well, good catch ;) > > And for conntrack -L, please Chris add CONFIG_NF_CT_NETLINK=m to your > kernel .config > $ conntrack -L unknown 2 566 src=192.168.0.1 dst=224.0.0.1 [UNREPLIED] src=224.0.0.1 dst=192.168.0.1 use=1 icmp 1 25 src=192.168.200.1 dst=192.168.0.1 type=8 code=0 id=512 src=192.168.0.1 dst=192.168.0.40 type=0 code=0 id=512 use=1 conntrack v1.2.2 (conntrack-tools): 2 flow entries have been shown. > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-01 18:34 ` Captain Obvious 2012-10-01 19:21 ` Eric Dumazet @ 2012-10-01 19:22 ` Chris Clayton 1 sibling, 0 replies; 59+ messages in thread From: Chris Clayton @ 2012-10-01 19:22 UTC (permalink / raw) To: Captain Obvious; +Cc: Eric Dumazet, David Miller, netdev, gpiez On 10/01/12 19:34, Captain Obvious wrote: > Eric Dumazet <eric.dumazet@gmail.com> : > [...] >>>> Could you send >>>> >>>> iptables -t -nat -nvL >>> >>> $ iptables -t -nat -nvL > ^ typo > > Please try "iptables -t nat -nvL" as was also suggested. > Good catch, Captain. Thanks. $ iptables -t nat -nvL Chain PREROUTING (policy ACCEPT 58 packets, 7716 bytes) pkts bytes target prot opt in out source destination Chain INPUT (policy ACCEPT 41 packets, 5895 bytes) pkts bytes target prot opt in out source destination Chain OUTPUT (policy ACCEPT 1158 packets, 75559 bytes) pkts bytes target prot opt in out source destination Chain POSTROUTING (policy ACCEPT 208 packets, 14279 bytes) pkts bytes target prot opt in out source destination 951 61351 MASQUERADE all -- * eth0 0.0.0.0/0 0.0.0.0/0 ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-01 9:15 ` Eric Dumazet 2012-10-01 15:13 ` Chris Clayton @ 2012-10-01 19:34 ` Dave Jones 2012-10-01 20:01 ` David Miller 1 sibling, 1 reply; 59+ messages in thread From: Dave Jones @ 2012-10-01 19:34 UTC (permalink / raw) To: Eric Dumazet; +Cc: Chris Clayton, David Miller, netdev, gpiez On Mon, Oct 01, 2012 at 11:15:50AM +0200, Eric Dumazet wrote: > > > > $ netstat -s > > Icmp: > > 4 ICMP messages received > > 4 input ICMP message failed. > > ICMP input histogram: > > echo replies: 4 > > So icmp replies come back and are delivered to host instead of being > forwarded. > > I wonder if MASQUERADE broke... I hit something that sounds just like this a few months back.. http://lists.openwall.net/netdev/2012/07/25/53 It "went away" a few builds later, but I've seen it happen again from time to time. Dave ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-01 19:34 ` Dave Jones @ 2012-10-01 20:01 ` David Miller 2012-10-01 20:04 ` Eric Dumazet 0 siblings, 1 reply; 59+ messages in thread From: David Miller @ 2012-10-01 20:01 UTC (permalink / raw) To: davej; +Cc: eric.dumazet, chris2553, netdev, gpiez From: Dave Jones <davej@redhat.com> Date: Mon, 1 Oct 2012 15:34:34 -0400 > On Mon, Oct 01, 2012 at 11:15:50AM +0200, Eric Dumazet wrote: > > > > > > $ netstat -s > > > Icmp: > > > 4 ICMP messages received > > > 4 input ICMP message failed. > > > ICMP input histogram: > > > echo replies: 4 > > > > So icmp replies come back and are delivered to host instead of being > > forwarded. > > > > I wonder if MASQUERADE broke... > > I hit something that sounds just like this a few months back.. > http://lists.openwall.net/netdev/2012/07/25/53 > > It "went away" a few builds later, but I've seen it happen > again from time to time. Yep I remembe that report. If you can find a way to more reliably trigger the case, that would help us immensely. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-01 20:01 ` David Miller @ 2012-10-01 20:04 ` Eric Dumazet 2012-10-02 15:27 ` Edivaldo de Araújo Pereira 2012-10-02 15:35 ` Eric Dumazet 0 siblings, 2 replies; 59+ messages in thread From: Eric Dumazet @ 2012-10-01 20:04 UTC (permalink / raw) To: David Miller; +Cc: davej, chris2553, netdev, gpiez On Mon, 2012-10-01 at 16:01 -0400, David Miller wrote: > From: Dave Jones <davej@redhat.com> > Date: Mon, 1 Oct 2012 15:34:34 -0400 > > > On Mon, Oct 01, 2012 at 11:15:50AM +0200, Eric Dumazet wrote: > > > > > > > > $ netstat -s > > > > Icmp: > > > > 4 ICMP messages received > > > > 4 input ICMP message failed. > > > > ICMP input histogram: > > > > echo replies: 4 > > > > > > So icmp replies come back and are delivered to host instead of being > > > forwarded. > > > > > > I wonder if MASQUERADE broke... > > > > I hit something that sounds just like this a few months back.. > > http://lists.openwall.net/netdev/2012/07/25/53 > > > > It "went away" a few builds later, but I've seen it happen > > again from time to time. > > Yep I remembe that report. > > If you can find a way to more reliably trigger the case, that would > help us immensely. I am building a KMEMCHECK kernel, as a last try before my night ;) ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-01 20:04 ` Eric Dumazet @ 2012-10-02 15:27 ` Edivaldo de Araújo Pereira 2012-10-02 15:35 ` Eric Dumazet 1 sibling, 0 replies; 59+ messages in thread From: Edivaldo de Araújo Pereira @ 2012-10-02 15:27 UTC (permalink / raw) To: netdev HEric Dumazet <eric.dumazet <at> gmail.com> writes: > > On Mon, 2012-10-01 at 16:01 -0400, David Miller wrote: > > From: Dave Jones <davej <at> redhat.com> > > Date: Mon, 1 Oct 2012 15:34:34 -0400 > > > > > On Mon, Oct 01, 2012 at 11:15:50AM +0200, Eric Dumazet wrote: > > > > > > > > > > $ netstat -s > > > > > Icmp: > > > > > 4 ICMP messages received > > > > > 4 input ICMP message failed. > > > > > ICMP input histogram: > > > > > echo replies: 4 > > > > > > > > So icmp replies come back and are delivered to host instead of being > > > > forwarded. > > > > > > > > I wonder if MASQUERADE broke... > > > > > > I hit something that sounds just like this a few months back.. > > > http://lists.openwall.net/netdev/2012/07/25/53 > > > > > > It "went away" a few builds later, but I've seen it happen > > > again from time to time. > > > > Yep I remembe that report. > > > > If you can find a way to more reliably trigger the case, that would > > help us immensely. > > I am building a KMEMCHECK kernel, as a last try before my night ;) > > i, I'm facing this kind of problem, too, but it is a little different; from the kvm guest I can ping the local host and any host outside my local (physical) network, but cannot ping other hosts in the local (physical) net. This happens whith guests in a virtual switch (vde) or in any bridged tun/tap. I switched back to 3.5.4, for now. Thanks Edivaldo de Araújo Pereira ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-01 20:04 ` Eric Dumazet 2012-10-02 15:27 ` Edivaldo de Araújo Pereira @ 2012-10-02 15:35 ` Eric Dumazet 2012-10-02 15:48 ` Eric Dumazet 1 sibling, 1 reply; 59+ messages in thread From: Eric Dumazet @ 2012-10-02 15:35 UTC (permalink / raw) To: David Miller; +Cc: davej, chris2553, netdev, gpiez On Mon, 2012-10-01 at 22:04 +0200, Eric Dumazet wrote: > On Mon, 2012-10-01 at 16:01 -0400, David Miller wrote: > > If you can find a way to more reliably trigger the case, that would > > help us immensely. > > I am building a KMEMCHECK kernel, as a last try before my night ;) This was a total disaster. KMEMCHECK dies horribly on my machine David, shouldnt we use a nh_rth_forward instead of a nh_rth_input in __mkroute_input() ? (And change rt_cache_route() as well ?) I am testing a patch right now. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-02 15:35 ` Eric Dumazet @ 2012-10-02 15:48 ` Eric Dumazet 2012-10-02 15:57 ` Dave Jones ` (4 more replies) 0 siblings, 5 replies; 59+ messages in thread From: Eric Dumazet @ 2012-10-02 15:48 UTC (permalink / raw) To: David Miller; +Cc: chris2553, netdev, gpiez, Dave Jones From: Eric Dumazet <edumazet@google.com> On Tue, 2012-10-02 at 17:35 +0200, Eric Dumazet wrote: > On Mon, 2012-10-01 at 22:04 +0200, Eric Dumazet wrote: > > On Mon, 2012-10-01 at 16:01 -0400, David Miller wrote: > > > > If you can find a way to more reliably trigger the case, that would > > > help us immensely. > > > > I am building a KMEMCHECK kernel, as a last try before my night ;) > > This was a total disaster. KMEMCHECK dies horribly on my machine > > David, shouldnt we use a nh_rth_forward instead of a nh_rth_input in > __mkroute_input() ? > > (And change rt_cache_route() as well ?) > > I am testing a patch right now. Yeah, this patch seems to fix the bug for me. [PATCH] ipv4: properly cache forward routes commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.) introduced a regression for forwarding. This was hard to reproduce but the symptom was that packets were delivered to local host instead of being forwarded. Add a separate cache (nh_rth_forward) to solve the problem. Many thanks to Chris Clayton for his patience and help. Reported-by: Chris Clayton <chris2553@googlemail.com> Bisected-by: Chris Clayton <chris2553@googlemail.com> Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> --- include/net/ip_fib.h | 1 + net/ipv4/fib_semantics.c | 1 + net/ipv4/route.c | 16 ++++++++-------- 3 files changed, 10 insertions(+), 8 deletions(-) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 926142e..ce7ffe9 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -85,6 +85,7 @@ struct fib_nh { int nh_saddr_genid; struct rtable __rcu * __percpu *nh_pcpu_rth_output; struct rtable __rcu *nh_rth_input; + struct rtable __rcu *nh_rth_forward; struct fnhe_hash_bucket *nh_exceptions; }; diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 3509065..45b5d1d 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -208,6 +208,7 @@ static void free_fib_info_rcu(struct rcu_head *head) free_nh_exceptions(nexthop_nh); rt_fibinfo_free_cpus(nexthop_nh->nh_pcpu_rth_output); rt_fibinfo_free(&nexthop_nh->nh_rth_input); + rt_fibinfo_free(&nexthop_nh->nh_rth_forward); } endfor_nexthops(fi); release_net(fi->fib_net); diff --git a/net/ipv4/route.c b/net/ipv4/route.c index ff62206..50898d6 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1193,14 +1193,12 @@ static bool rt_bind_exception(struct rtable *rt, struct fib_nh_exception *fnhe, return ret; } -static bool rt_cache_route(struct fib_nh *nh, struct rtable *rt) +static bool rt_cache_route(struct fib_nh *nh, struct rtable *rt, struct rtable **p) { - struct rtable *orig, *prev, **p; + struct rtable *orig, *prev; bool ret = true; - if (rt_is_input_route(rt)) { - p = (struct rtable **)&nh->nh_rth_input; - } else { + if (!p) { if (!nh->nh_pcpu_rth_output) goto nocache; p = (struct rtable **)__this_cpu_ptr(nh->nh_pcpu_rth_output); @@ -1290,7 +1288,7 @@ static void rt_set_nexthop(struct rtable *rt, __be32 daddr, if (unlikely(fnhe)) cached = rt_bind_exception(rt, fnhe, daddr); else if (!(rt->dst.flags & DST_NOCACHE)) - cached = rt_cache_route(nh, rt); + cached = rt_cache_route(nh, rt, NULL); } if (unlikely(!cached)) rt_add_uncached_list(rt); @@ -1462,7 +1460,7 @@ static int __mkroute_input(struct sk_buff *skb, do_cache = false; if (res->fi) { if (!itag) { - rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input); + rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_forward); if (rt_cache_valid(rth)) { skb_dst_set_noref(skb, &rth->dst); goto out; @@ -1493,6 +1491,8 @@ static int __mkroute_input(struct sk_buff *skb, rt_set_nexthop(rth, daddr, res, NULL, res->fi, res->type, itag); skb_dst_set(skb, &rth->dst); + if (do_cache) + rt_cache_route(&FIB_RES_NH(*res), rth, &FIB_RES_NH(*res).nh_rth_forward); out: err = 0; cleanup: @@ -1663,7 +1663,7 @@ local_input: rth->rt_flags &= ~RTCF_LOCAL; } if (do_cache) - rt_cache_route(&FIB_RES_NH(res), rth); + rt_cache_route(&FIB_RES_NH(res), rth, &FIB_RES_NH(res).nh_rth_input); skb_dst_set(skb, &rth->dst); err = 0; goto out; ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-02 15:48 ` Eric Dumazet @ 2012-10-02 15:57 ` Dave Jones 2012-10-02 16:06 ` Eric Dumazet 2012-10-02 18:25 ` David Miller ` (3 subsequent siblings) 4 siblings, 1 reply; 59+ messages in thread From: Dave Jones @ 2012-10-02 15:57 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, chris2553, netdev, gpiez On Tue, Oct 02, 2012 at 05:48:39PM +0200, Eric Dumazet wrote: > From: Eric Dumazet <edumazet@google.com> > > On Tue, 2012-10-02 at 17:35 +0200, Eric Dumazet wrote: > > On Mon, 2012-10-01 at 22:04 +0200, Eric Dumazet wrote: > > > On Mon, 2012-10-01 at 16:01 -0400, David Miller wrote: > > > > > > If you can find a way to more reliably trigger the case, that would > > > > help us immensely. > > > > > > I am building a KMEMCHECK kernel, as a last try before my night ;) > > > > This was a total disaster. KMEMCHECK dies horribly on my machine > > > > David, shouldnt we use a nh_rth_forward instead of a nh_rth_input in > > __mkroute_input() ? > > > > (And change rt_cache_route() as well ?) > > > > I am testing a patch right now. > > Yeah, this patch seems to fix the bug for me. Good work! Any idea why it didn't happen on every build for me ? >From your description, this should have failed every time ? Dave ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-02 15:57 ` Dave Jones @ 2012-10-02 16:06 ` Eric Dumazet 0 siblings, 0 replies; 59+ messages in thread From: Eric Dumazet @ 2012-10-02 16:06 UTC (permalink / raw) To: Dave Jones; +Cc: David Miller, chris2553, netdev, gpiez On Tue, 2012-10-02 at 11:57 -0400, Dave Jones wrote: > > Good work! Any idea why it didn't happen on every build for me ? > > From your description, this should have failed every time ? Well, it seems that as long as you had forwarded packets and a route not yet cached in nh_rth_input, we were using a brand new route (and correct one) But as soon as a locally generated traffic did cache a route in nh_rth_input, forwarded packets immediately were using this cache and were delivered (and dropped) to local host. Maybe my patch is not the good fix, but at least its a step in understanding the problem. Thanks ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-02 15:48 ` Eric Dumazet 2012-10-02 15:57 ` Dave Jones @ 2012-10-02 18:25 ` David Miller 2012-10-02 21:14 ` Alexander Duyck 2012-10-02 23:24 ` Julian Anastasov ` (2 subsequent siblings) 4 siblings, 1 reply; 59+ messages in thread From: David Miller @ 2012-10-02 18:25 UTC (permalink / raw) To: eric.dumazet; +Cc: chris2553, netdev, gpiez, davej From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 02 Oct 2012 17:48:39 +0200 > [PATCH] ipv4: properly cache forward routes > > commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.) > introduced a regression for forwarding. > > This was hard to reproduce but the symptom was that packets were > delivered to local host instead of being forwarded. > > Add a separate cache (nh_rth_forward) to solve the problem. > > Many thanks to Chris Clayton for his patience and help. > > Reported-by: Chris Clayton <chris2553@googlemail.com> > Bisected-by: Chris Clayton <chris2553@googlemail.com> > Reported-by: Dave Jones <davej@redhat.com> > Signed-off-by: Eric Dumazet <edumazet@google.com> Thanks for figuring this out, I'll think about this more deeply. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-02 18:25 ` David Miller @ 2012-10-02 21:14 ` Alexander Duyck 2012-10-02 21:35 ` Eric Dumazet 0 siblings, 1 reply; 59+ messages in thread From: Alexander Duyck @ 2012-10-02 21:14 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, chris2553, netdev, gpiez, davej On 10/02/2012 11:25 AM, David Miller wrote: > From: Eric Dumazet <eric.dumazet@gmail.com> > Date: Tue, 02 Oct 2012 17:48:39 +0200 > >> [PATCH] ipv4: properly cache forward routes >> >> commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.) >> introduced a regression for forwarding. >> >> This was hard to reproduce but the symptom was that packets were >> delivered to local host instead of being forwarded. >> >> Add a separate cache (nh_rth_forward) to solve the problem. >> >> Many thanks to Chris Clayton for his patience and help. >> >> Reported-by: Chris Clayton <chris2553@googlemail.com> >> Bisected-by: Chris Clayton <chris2553@googlemail.com> >> Reported-by: Dave Jones <davej@redhat.com> >> Signed-off-by: Eric Dumazet <edumazet@google.com> > Thanks for figuring this out, I'll think about this more > deeply. I think something may have been missed in this patch. With it applied to net-next I am unable to remove the ixgbe driver after running a routing traffic test. The specific message I am getting is: unregister_netdevice: waiting for eth2 to become free. Usage count = -7 Thanks, Alex ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-02 21:14 ` Alexander Duyck @ 2012-10-02 21:35 ` Eric Dumazet 0 siblings, 0 replies; 59+ messages in thread From: Eric Dumazet @ 2012-10-02 21:35 UTC (permalink / raw) To: Alexander Duyck; +Cc: David Miller, chris2553, netdev, gpiez, davej On Tue, 2012-10-02 at 14:14 -0700, Alexander Duyck wrote: > I think something may have been missed in this patch. > > With it applied to net-next I am unable to remove the ixgbe driver after > running a routing traffic test. The specific message I am getting is: > unregister_netdevice: waiting for eth2 to become free. Usage count = -7 Yes, I realized later that rt_set_nexthop(), called from __mkroute_input() was responsible to do the caching... So another version is needed, I'll do that tomorrow unless David can fix the problem while I sleep a bit ;) ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-02 15:48 ` Eric Dumazet 2012-10-02 15:57 ` Dave Jones 2012-10-02 18:25 ` David Miller @ 2012-10-02 23:24 ` Julian Anastasov 2012-10-03 3:10 ` David Miller 2012-10-03 7:28 ` [PATCH] udp: increment UDP_MIB_NOPORTS in mcast receive Eric Dumazet 2012-10-03 2:55 ` Possible networking regression in 3.6.0 David Miller 2012-10-04 11:25 ` [PATCH] ipv4: add a fib_type to fib_info Eric Dumazet 4 siblings, 2 replies; 59+ messages in thread From: Julian Anastasov @ 2012-10-02 23:24 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, chris2553, netdev, gpiez, Dave Jones Hello, On Tue, 2 Oct 2012, Eric Dumazet wrote: > > David, shouldnt we use a nh_rth_forward instead of a nh_rth_input in > > __mkroute_input() ? > > > > (And change rt_cache_route() as well ?) > > > > I am testing a patch right now. > > Yeah, this patch seems to fix the bug for me. > > [PATCH] ipv4: properly cache forward routes > > commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.) > introduced a regression for forwarding. > > This was hard to reproduce but the symptom was that packets were > delivered to local host instead of being forwarded. > > Add a separate cache (nh_rth_forward) to solve the problem. Can it be a problem related to fib_info reuse from different routes. For example, when local IP address is created for subnet we have: broadcast 192.168.0.255 dev DEV proto kernel scope link src 192.168.0.1 192.168.0.0/24 dev DEV proto kernel scope link src 192.168.0.1 local 192.168.0.1 dev DEV proto kernel scope host src 192.168.0.1 The "dev DEV proto kernel scope link src 192.168.0.1" is a reused fib_info structure where we put cached routes. The result can be same fib_info for 192.168.0.255 and 192.168.0.0/24. RTN_BROADCAST is cached only for input routes. Incoming broadcast to 192.168.0.255 can be cached and can cause problems for traffic forwarded to 192.168.0.0/24. So, this patch should solve the problem because it separates the broadcast from unicast traffic. And the ip_route_input_slow caching will work for local and broadcast input routes (above routes 1 and 3) just because they differ in scope and use different fib_info. Another possible failure is for output routes: multicast 224.0.0.0/4 fib_info with unicast 192.168.0.0/24 fib_info The multicast sets RTCF_MULTICAST | RTCF_LOCAL and can cause problems for generated unicast traffic on fib_info reuse. Depends on the scope, for multicast it is usually scope global, so may be it is difficult to happen in practice. __mkroute_output works for local/unicast routes because they differ in scope. > Many thanks to Chris Clayton for his patience and help. > > Reported-by: Chris Clayton <chris2553@googlemail.com> > Bisected-by: Chris Clayton <chris2553@googlemail.com> > Reported-by: Dave Jones <davej@redhat.com> > Signed-off-by: Eric Dumazet <edumazet@google.com> > --- > include/net/ip_fib.h | 1 + > net/ipv4/fib_semantics.c | 1 + > net/ipv4/route.c | 16 ++++++++-------- > 3 files changed, 10 insertions(+), 8 deletions(-) Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-02 23:24 ` Julian Anastasov @ 2012-10-03 3:10 ` David Miller 2012-10-03 15:01 ` Chris Clayton 2012-10-03 20:57 ` Julian Anastasov 2012-10-03 7:28 ` [PATCH] udp: increment UDP_MIB_NOPORTS in mcast receive Eric Dumazet 1 sibling, 2 replies; 59+ messages in thread From: David Miller @ 2012-10-03 3:10 UTC (permalink / raw) To: ja; +Cc: eric.dumazet, chris2553, netdev, gpiez, davej From: Julian Anastasov <ja@ssi.bg> Date: Wed, 3 Oct 2012 02:24:53 +0300 (EEST) > Can it be a problem related to fib_info reuse > from different routes. For example, when local IP address > is created for subnet we have: > > broadcast 192.168.0.255 dev DEV proto kernel scope link src 192.168.0.1 > 192.168.0.0/24 dev DEV proto kernel scope link src 192.168.0.1 > local 192.168.0.1 dev DEV proto kernel scope host src 192.168.0.1 > > The "dev DEV proto kernel scope link src 192.168.0.1" is > a reused fib_info structure where we put cached routes. > The result can be same fib_info for 192.168.0.255 and > 192.168.0.0/24. RTN_BROADCAST is cached only for input > routes. Incoming broadcast to 192.168.0.255 can be cached > and can cause problems for traffic forwarded to 192.168.0.0/24. > So, this patch should solve the problem because it > separates the broadcast from unicast traffic. Now I understand the problem. I think the way to fix this is to add cfg->fc_type as another thing that fib_info objects are key'd by. I think it also would fix your obscure output multicast case too. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-03 3:10 ` David Miller @ 2012-10-03 15:01 ` Chris Clayton 2012-10-03 20:57 ` Julian Anastasov 1 sibling, 0 replies; 59+ messages in thread From: Chris Clayton @ 2012-10-03 15:01 UTC (permalink / raw) To: David Miller; +Cc: ja, eric.dumazet, netdev, gpiez, davej On 10/03/12 04:10, David Miller wrote: > From: Julian Anastasov <ja@ssi.bg> > Date: Wed, 3 Oct 2012 02:24:53 +0300 (EEST) > >> Can it be a problem related to fib_info reuse >> from different routes. For example, when local IP address >> is created for subnet we have: >> >> broadcast 192.168.0.255 dev DEV proto kernel scope link src 192.168.0.1 >> 192.168.0.0/24 dev DEV proto kernel scope link src 192.168.0.1 >> local 192.168.0.1 dev DEV proto kernel scope host src 192.168.0.1 >> >> The "dev DEV proto kernel scope link src 192.168.0.1" is >> a reused fib_info structure where we put cached routes. >> The result can be same fib_info for 192.168.0.255 and >> 192.168.0.0/24. RTN_BROADCAST is cached only for input >> routes. Incoming broadcast to 192.168.0.255 can be cached >> and can cause problems for traffic forwarded to 192.168.0.0/24. >> So, this patch should solve the problem because it >> separates the broadcast from unicast traffic. > > Now I understand the problem. > > I think the way to fix this is to add cfg->fc_type as another > thing that fib_info objects are key'd by. > > I think it also would fix your obscure output multicast case too. > > I've seen the discussion about whether Eric's patch is OK or not, but thought I'd give it a spin anyway. It applies to 3.6.0 with some fuzz, but I can confirm that with the patch applied I can now ping my router and browse the internet from a KVM client, so the Eric's diagnosis matches the problem I reported. However, after closing the client, I got an oops. I've taken a photograph of the screen and uploaded it to http://i714.photobucket.com/albums/ww149/chris2553/IMAG0059.jpg. As it's not the final patch, this may be a red herring, but I thought I'd better give a heads up anyway. Chris ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-03 3:10 ` David Miller 2012-10-03 15:01 ` Chris Clayton @ 2012-10-03 20:57 ` Julian Anastasov 1 sibling, 0 replies; 59+ messages in thread From: Julian Anastasov @ 2012-10-03 20:57 UTC (permalink / raw) To: David Miller; +Cc: eric.dumazet, chris2553, netdev, gpiez, davej Hello, On Tue, 2 Oct 2012, David Miller wrote: > From: Julian Anastasov <ja@ssi.bg> > Date: Wed, 3 Oct 2012 02:24:53 +0300 (EEST) > > > Can it be a problem related to fib_info reuse > > from different routes. For example, when local IP address > > is created for subnet we have: > > > > broadcast 192.168.0.255 dev DEV proto kernel scope link src 192.168.0.1 > > 192.168.0.0/24 dev DEV proto kernel scope link src 192.168.0.1 > > local 192.168.0.1 dev DEV proto kernel scope host src 192.168.0.1 > > > > The "dev DEV proto kernel scope link src 192.168.0.1" is > > a reused fib_info structure where we put cached routes. > > The result can be same fib_info for 192.168.0.255 and > > 192.168.0.0/24. RTN_BROADCAST is cached only for input > > routes. Incoming broadcast to 192.168.0.255 can be cached > > and can cause problems for traffic forwarded to 192.168.0.0/24. > > So, this patch should solve the problem because it > > separates the broadcast from unicast traffic. > > Now I understand the problem. > > I think the way to fix this is to add cfg->fc_type as another > thing that fib_info objects are key'd by. > > I think it also would fix your obscure output multicast case too. Agreed. I don't see problem with this idea. It will avoid confusions with rt_type. Regards -- Julian Anastasov <ja@ssi.bg> ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH] udp: increment UDP_MIB_NOPORTS in mcast receive 2012-10-02 23:24 ` Julian Anastasov 2012-10-03 3:10 ` David Miller @ 2012-10-03 7:28 ` Eric Dumazet 2012-10-03 12:45 ` David Stevens 1 sibling, 1 reply; 59+ messages in thread From: Eric Dumazet @ 2012-10-03 7:28 UTC (permalink / raw) To: Julian Anastasov; +Cc: David Miller, chris2553, netdev, gpiez, Dave Jones On Wed, 2012-10-03 at 02:24 +0300, Julian Anastasov wrote: > Hello, > > On Tue, 2 Oct 2012, Eric Dumazet wrote: > > > > David, shouldnt we use a nh_rth_forward instead of a nh_rth_input in > > > __mkroute_input() ? > > > > > > (And change rt_cache_route() as well ?) > > > > > > I am testing a patch right now. > > > > Yeah, this patch seems to fix the bug for me. > > > > [PATCH] ipv4: properly cache forward routes > > > > commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.) > > introduced a regression for forwarding. > > > > This was hard to reproduce but the symptom was that packets were > > delivered to local host instead of being forwarded. > > > > Add a separate cache (nh_rth_forward) to solve the problem. > > Can it be a problem related to fib_info reuse > from different routes. For example, when local IP address > is created for subnet we have: > > broadcast 192.168.0.255 dev DEV proto kernel scope link src 192.168.0.1 > 192.168.0.0/24 dev DEV proto kernel scope link src 192.168.0.1 > local 192.168.0.1 dev DEV proto kernel scope host src 192.168.0.1 > > The "dev DEV proto kernel scope link src 192.168.0.1" is > a reused fib_info structure where we put cached routes. > The result can be same fib_info for 192.168.0.255 and > 192.168.0.0/24. RTN_BROADCAST is cached only for input > routes. Incoming broadcast to 192.168.0.255 can be cached > and can cause problems for traffic forwarded to 192.168.0.0/24. > So, this patch should solve the problem because it > separates the broadcast from unicast traffic. > > And the ip_route_input_slow caching will work for > local and broadcast input routes (above routes 1 and 3) just > because they differ in scope and use different fib_info. > > Another possible failure is for output routes: > > multicast 224.0.0.0/4 fib_info > with unicast > 192.168.0.0/24 fib_info > > The multicast sets RTCF_MULTICAST | RTCF_LOCAL > and can cause problems for generated unicast traffic on > fib_info reuse. Depends on the scope, for multicast it is > usually scope global, so may be it is difficult to happen > in practice. > > __mkroute_output works for local/unicast routes > because they differ in scope. Thanks Julian for these informations. BTW, it seems we dont properly increase UDP MIB counters when a multicast message is not delivered to at least one socket. Lets fix this to ease future bug hunting. I hate when "netstat -s" is useless and we have to use dropwatch to figure out where we drop a frame. [PATCH] udp: increment UDP_MIB_NOPORTS in multicast receive We should increment UDP_MIB_NOPORTS in the case we found no socket to deliver a copy of one incoming UDP message. (RFC 4113 udpNoPorts) Signed-off-by: Eric Dumazet <edumazet@google.com> --- net/ipv4/udp.c | 1 + net/ipv6/udp.c | 1 + 2 files changed, 2 insertions(+) diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 79c8dbe..dfa73c5 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1591,6 +1591,7 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb, sock_put(stack[i]); } else { kfree_skb(skb); + UDP_INC_STATS_BH(net, UDP_MIB_NOPORTS, udptable != &udp_table); } return 0; } diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index fc99972..0be9ac2 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -748,6 +748,7 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb, sock_put(stack[i]); } else { kfree_skb(skb); + UDP6_INC_STATS_BH(net, UDP_MIB_NOPORTS, udptable != &udp_table); } return 0; } ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH] udp: increment UDP_MIB_NOPORTS in mcast receive 2012-10-03 7:28 ` [PATCH] udp: increment UDP_MIB_NOPORTS in mcast receive Eric Dumazet @ 2012-10-03 12:45 ` David Stevens 2012-10-03 13:15 ` Eric Dumazet 0 siblings, 1 reply; 59+ messages in thread From: David Stevens @ 2012-10-03 12:45 UTC (permalink / raw) To: Eric Dumazet Cc: chris2553, Dave Jones, David Miller, gpiez, Julian Anastasov, netdev, netdev-owner netdev-owner@vger.kernel.org wrote on 10/03/2012 03:28:48 AM: > BTW, it seems we dont properly increase UDP MIB counters when a > multicast message is not delivered to at least one socket. If an interface is in promiscuous mode or there are false positives in a multicast address filter, wouldn't this count as "drops" packets that were never intended for this machine? I think an otherwise valid multicast or broadcast packet that doesn't have a local receiver is not an error and shouldn't be counted. +-DLS ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] udp: increment UDP_MIB_NOPORTS in mcast receive 2012-10-03 12:45 ` David Stevens @ 2012-10-03 13:15 ` Eric Dumazet 2012-10-03 14:09 ` David Stevens 0 siblings, 1 reply; 59+ messages in thread From: Eric Dumazet @ 2012-10-03 13:15 UTC (permalink / raw) To: David Stevens Cc: chris2553, Dave Jones, David Miller, gpiez, Julian Anastasov, netdev, netdev-owner On Wed, 2012-10-03 at 08:45 -0400, David Stevens wrote: > netdev-owner@vger.kernel.org wrote on 10/03/2012 03:28:48 AM: > > > BTW, it seems we dont properly increase UDP MIB counters when a > > multicast message is not delivered to at least one socket. > > If an interface is in promiscuous mode or there are false > positives in a multicast address filter, wouldn't this count as > "drops" packets that were never intended for this machine? > Yes, probably. So we drop them and its expected. > I think an otherwise valid multicast or broadcast packet that doesn't > have a local receiver is not an error and shouldn't be counted. Hmmm This counter is not an "error counter", just a "counter". RFC definitions are exactly : udpNoPorts OBJECT-TYPE SYNTAX Counter32 MAX-ACCESS read-only STATUS current DESCRIPTION "The total number of received UDP datagrams for which there was no application at the destination port. udpInErrors OBJECT-TYPE SYNTAX Counter32 MAX-ACCESS read-only STATUS current DESCRIPTION "The number of received UDP datagrams that could not be delivered for reasons other than the lack of an application at the destination port. So when a host receives an UDP datagram but there was no application at the destination port we should increment udpNoPorts, and its not an error but just a fact. Now _if_ some reader interprets udpNoPorts increases as an indication of errors, this reader is wrong. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] udp: increment UDP_MIB_NOPORTS in mcast receive 2012-10-03 13:15 ` Eric Dumazet @ 2012-10-03 14:09 ` David Stevens 2012-10-03 15:29 ` Eric Dumazet 2012-10-03 17:39 ` Rick Jones 0 siblings, 2 replies; 59+ messages in thread From: David Stevens @ 2012-10-03 14:09 UTC (permalink / raw) To: Eric Dumazet Cc: chris2553, Dave Jones, David Miller, gpiez, Julian Anastasov, netdev, netdev-owner Eric Dumazet <eric.dumazet@gmail.com> wrote on 10/03/2012 09:15:51 AM: > So when a host receives an UDP datagram but there was no application > at the destination port we should increment udpNoPorts, and its not > an error but just a fact. Of course. I think our difference is on the definition of "receives". I don't think a packet delivered locally due to promiscuous mode, broadcast or an imperfect multicast address filter match is a host UDP datagram receive. These packets really shouldn't be delivered to UDP at all; they are not addressed to this host (at least the non-broadcast, no-membership ones). A unicast UDP packet that doesn't match a local IP address does not increment this counter. A promiscuous mode multicast delivery is no different, except that the destination alone doesn't tell us if it is for us. I think counting these will primarily lead to administrators seeing non-zero drops and wasting their time trying to track them down. +-DLS ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] udp: increment UDP_MIB_NOPORTS in mcast receive 2012-10-03 14:09 ` David Stevens @ 2012-10-03 15:29 ` Eric Dumazet 2012-10-03 17:31 ` David Stevens 2012-10-03 17:39 ` Rick Jones 1 sibling, 1 reply; 59+ messages in thread From: Eric Dumazet @ 2012-10-03 15:29 UTC (permalink / raw) To: David Stevens Cc: chris2553, Dave Jones, David Miller, gpiez, Julian Anastasov, netdev, netdev-owner On Wed, 2012-10-03 at 10:09 -0400, David Stevens wrote: > Eric Dumazet <eric.dumazet@gmail.com> wrote on 10/03/2012 09:15:51 AM: > > > So when a host receives an UDP datagram but there was no application > > at the destination port we should increment udpNoPorts, and its not > > an error but just a fact. > > Of course. I think our difference is on the definition of > "receives". A receive is a packet delivered to this host. Interface being promiscuous or not doesnt really matter. > I don't think a packet delivered locally due to promiscuous mode, > broadcast > or an imperfect multicast address filter match is a host UDP datagram > receive. > These packets really shouldn't be delivered to UDP at all; they are not > addressed to this host (at least the non-broadcast, no-membership ones). Thats the bug we currently are tracking. If some error is happening and packet is delivered instead of being forwarded or dropped, we need a counter being incremented to catch the bug. > A unicast UDP packet that doesn't match a local IP address does > not > increment this counter. It _does_ increment this counter right now, not sure what you mean. We currently correctly increment udpNoPorts if we receive an unicast UDP packet that doesnt find a matching socket (because socket(s) are bound to specific addresses instead of ANY_ADDR) This is an extension of the "there was no application at the destination port" to "there was no application at the destination port and destination address" > A promiscuous mode multicast delivery is no > different, > except that the destination alone doesn't tell us if it is for us. > > I think counting these will primarily lead to administrators > seeing > non-zero drops and wasting their time trying to track them down. Well, as I said, seeing increments of this counter is perfectly fine and matches RFC. It permits better diagnostics. Hiding bugs is not very helpful. Most of the time I am trying to track a bug in linux network stack, the very first thing I ask to reporters is to post "netstat -s" before/after their tests exactly because I want to see _some_ counters be incremented and catch obvious problems. And alas, many drops in our stack are not correctly reported because we forgot to increment a counter at the right place. I am fine adding a new SNMP McastDrops counter if you feel its better. # grep Udp: /proc/net/snmp Udp: InDatagrams NoPorts InErrors OutDatagrams RcvbufErrors SndbufErrors McastDrops Udp: 11449164 15473 514616 290821178 0 184352 134 "netstat -s -u" would display : Udp: 11449164 packets received 15473 packets to unknown port received. 514616 packet receive errors 290821178 packets sent SndbufErrors: 184352 McastDrops: 134 Non official patch since net-next is not open : include/linux/snmp.h | 1 + net/ipv4/proc.c | 1 + net/ipv4/udp.c | 2 ++ net/ipv6/proc.c | 2 ++ net/ipv6/udp.c | 2 ++ 5 files changed, 8 insertions(+) diff --git a/include/linux/snmp.h b/include/linux/snmp.h index 00bc189..321d643 100644 --- a/include/linux/snmp.h +++ b/include/linux/snmp.h @@ -145,6 +145,7 @@ enum UDP_MIB_OUTDATAGRAMS, /* OutDatagrams */ UDP_MIB_RCVBUFERRORS, /* RcvbufErrors */ UDP_MIB_SNDBUFERRORS, /* SndbufErrors */ + UDP_MIB_MCASTDROPS, /* McastDrops (linux extension) */ __UDP_MIB_MAX }; diff --git a/net/ipv4/proc.c b/net/ipv4/proc.c index 957acd1..1e932ee 100644 --- a/net/ipv4/proc.c +++ b/net/ipv4/proc.c @@ -172,6 +172,7 @@ static const struct snmp_mib snmp4_udp_list[] = { SNMP_MIB_ITEM("OutDatagrams", UDP_MIB_OUTDATAGRAMS), SNMP_MIB_ITEM("RcvbufErrors", UDP_MIB_RCVBUFERRORS), SNMP_MIB_ITEM("SndbufErrors", UDP_MIB_SNDBUFERRORS), + SNMP_MIB_ITEM("McastDrops", UDP_MIB_MCASTDROPS), SNMP_MIB_SENTINEL }; diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c index 2814f66..4e2a4f7 100644 --- a/net/ipv4/udp.c +++ b/net/ipv4/udp.c @@ -1591,6 +1591,8 @@ static int __udp4_lib_mcast_deliver(struct net *net, struct sk_buff *skb, sock_put(stack[i]); } else { kfree_skb(skb); + UDP_INC_STATS_BH(net, UDP_MIB_MCASTDROPS, + udptable != &udp_table); } return 0; } diff --git a/net/ipv6/proc.c b/net/ipv6/proc.c index 745a320..f2c12ea 100644 --- a/net/ipv6/proc.c +++ b/net/ipv6/proc.c @@ -129,6 +129,7 @@ static const struct snmp_mib snmp6_udp6_list[] = { SNMP_MIB_ITEM("Udp6OutDatagrams", UDP_MIB_OUTDATAGRAMS), SNMP_MIB_ITEM("Udp6RcvbufErrors", UDP_MIB_RCVBUFERRORS), SNMP_MIB_ITEM("Udp6SndbufErrors", UDP_MIB_SNDBUFERRORS), + SNMP_MIB_ITEM("Udp6McastDrops", UDP_MIB_MCASTDROPS), SNMP_MIB_SENTINEL }; @@ -139,6 +140,7 @@ static const struct snmp_mib snmp6_udplite6_list[] = { SNMP_MIB_ITEM("UdpLite6OutDatagrams", UDP_MIB_OUTDATAGRAMS), SNMP_MIB_ITEM("UdpLite6RcvbufErrors", UDP_MIB_RCVBUFERRORS), SNMP_MIB_ITEM("UdpLite6SndbufErrors", UDP_MIB_SNDBUFERRORS), + SNMP_MIB_ITEM("UdpLite6McastDrops", UDP_MIB_MCASTDROPS); SNMP_MIB_SENTINEL }; diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c index 07e2bfe..c8caf1b 100644 --- a/net/ipv6/udp.c +++ b/net/ipv6/udp.c @@ -748,6 +748,8 @@ static int __udp6_lib_mcast_deliver(struct net *net, struct sk_buff *skb, sock_put(stack[i]); } else { kfree_skb(skb); + UDP6_INC_STATS_BH(net, UDP_MIB_MCASTDROPS, + udptable != &udp_table); } return 0; } ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH] udp: increment UDP_MIB_NOPORTS in mcast receive 2012-10-03 15:29 ` Eric Dumazet @ 2012-10-03 17:31 ` David Stevens 2012-10-03 19:30 ` David Miller 0 siblings, 1 reply; 59+ messages in thread From: David Stevens @ 2012-10-03 17:31 UTC (permalink / raw) To: Eric Dumazet Cc: chris2553, Dave Jones, David Miller, gpiez, Julian Anastasov, netdev, netdev-owner Eric Dumazet <eric.dumazet@gmail.com> wrote on 10/03/2012 11:29:13 AM: > > Of course. I think our difference is on the definition of > > "receives". > > A receive is a packet delivered to this host. > Interface being promiscuous or not doesnt really matter. A receive is a packet *addressed* to this host. My point was that running tcpdump/wireshark to look at other hosts' traffic shouldn't affect any UDP MIB (these are ordinarily filtered by IP), but I forgot that we are checking in software, as well as the HW multicast address filter, for multicast group membership. So promiscuous mode and imperfect NIC MAF hashes shouldn't actually result in local delivery and that problem isn't there at all. I do think, still, that it is common to have broadcasts and multicasts (for joined groups, even) with traffic completely uninteresting to this host and that having a drop counter going up for those will appear to be losses and errors when they are completely harmless and irrelevant. But since it can't be incremented for items that are not actually addressed to the local host, as I originally thought, I don't object anymore. Sorry for the sidetrack -- I should've verified that originally. +-DLS ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] udp: increment UDP_MIB_NOPORTS in mcast receive 2012-10-03 17:31 ` David Stevens @ 2012-10-03 19:30 ` David Miller 0 siblings, 0 replies; 59+ messages in thread From: David Miller @ 2012-10-03 19:30 UTC (permalink / raw) To: dlstevens; +Cc: eric.dumazet, chris2553, davej, gpiez, ja, netdev, netdev-owner From: David Stevens <dlstevens@us.ibm.com> Date: Wed, 3 Oct 2012 13:31:30 -0400 > Eric Dumazet <eric.dumazet@gmail.com> wrote on 10/03/2012 11:29:13 AM: > >> > Of course. I think our difference is on the definition of >> > "receives". >> >> A receive is a packet delivered to this host. >> Interface being promiscuous or not doesnt really matter. > > A receive is a packet *addressed* to this host. Although I'm largely ambivalent, this one sentence tipped me over towards David's side on this issue. But this is easy to resolve Eric, just simply make a new custom counter that counts these new cases you care about and document it properly. Thanks. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] udp: increment UDP_MIB_NOPORTS in mcast receive 2012-10-03 14:09 ` David Stevens 2012-10-03 15:29 ` Eric Dumazet @ 2012-10-03 17:39 ` Rick Jones 1 sibling, 0 replies; 59+ messages in thread From: Rick Jones @ 2012-10-03 17:39 UTC (permalink / raw) To: David Stevens Cc: Eric Dumazet, chris2553, Dave Jones, David Miller, gpiez, Julian Anastasov, netdev, netdev-owner On 10/03/2012 07:09 AM, David Stevens wrote: > Of course. I think our difference is on the definition of > "receives". I don't think a packet delivered locally due to > promiscuous mode, broadcast or an imperfect multicast address filter > match is a host UDP datagram receive. These packets really shouldn't > be delivered to UDP at all; they are not addressed to this host (at > least the non-broadcast, no-membership ones). A unicast UDP packet > that doesn't match a local IP address does not increment this > counter. A promiscuous mode multicast delivery is no different, > except that the destination alone doesn't tell us if it is for us. > > I think counting these will primarily lead to administrators seeing > non-zero drops and wasting their time trying to track them down. I would tend to agree with David on this one. Or they might cease trying to track them down because they've gotten so many "false positives." Isn't "meant for me" vs "not meant for me" at the heard of "drops" versus "discards?" Once the packet is in the host, is it tagged in some way with "this was received as promiscuous/whatnot?" rick ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-10-02 15:48 ` Eric Dumazet ` (2 preceding siblings ...) 2012-10-02 23:24 ` Julian Anastasov @ 2012-10-03 2:55 ` David Miller 2012-10-04 11:25 ` [PATCH] ipv4: add a fib_type to fib_info Eric Dumazet 4 siblings, 0 replies; 59+ messages in thread From: David Miller @ 2012-10-03 2:55 UTC (permalink / raw) To: eric.dumazet; +Cc: chris2553, netdev, gpiez, davej From: Eric Dumazet <eric.dumazet@gmail.com> Date: Tue, 02 Oct 2012 17:48:39 +0200 > [PATCH] ipv4: properly cache forward routes > > commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.) > introduced a regression for forwarding. > > This was hard to reproduce but the symptom was that packets were > delivered to local host instead of being forwarded. > > Add a separate cache (nh_rth_forward) to solve the problem. > > Many thanks to Chris Clayton for his patience and help. > > Reported-by: Chris Clayton <chris2553@googlemail.com> > Bisected-by: Chris Clayton <chris2553@googlemail.com> > Reported-by: Dave Jones <davej@redhat.com> > Signed-off-by: Eric Dumazet <edumazet@google.com> I'm still having trouble understanding how this can happen, which is probably why I introduced this bug in the first place :-) Only INPUT routes created by ip_route_input_slow() cache using nh_rth_input. Routes for locally destinations vs. forwarded destinations will resolve to different fib_info objects. If at some point a new route is added which turns a local destination into one for which we forward, normal invalidation of cached routes ought to fix it. There's some sequence of events I don't understand that causes the corrupt route cache, can you show it to me? Thanks. ^ permalink raw reply [flat|nested] 59+ messages in thread
* [PATCH] ipv4: add a fib_type to fib_info 2012-10-02 15:48 ` Eric Dumazet ` (3 preceding siblings ...) 2012-10-03 2:55 ` Possible networking regression in 3.6.0 David Miller @ 2012-10-04 11:25 ` Eric Dumazet 2012-10-04 13:08 ` Chris Clayton 4 siblings, 1 reply; 59+ messages in thread From: Eric Dumazet @ 2012-10-04 11:25 UTC (permalink / raw) To: David Miller; +Cc: chris2553, netdev, gpiez, Dave Jones, Julian Anastasov On Tue, 2012-10-02 at 17:48 +0200, Eric Dumazet wrote: > From: Eric Dumazet <edumazet@google.com> > > On Tue, 2012-10-02 at 17:35 +0200, Eric Dumazet wrote: > > On Mon, 2012-10-01 at 22:04 +0200, Eric Dumazet wrote: > > > On Mon, 2012-10-01 at 16:01 -0400, David Miller wrote: > > > > > > If you can find a way to more reliably trigger the case, that would > > > > help us immensely. > > > > > > I am building a KMEMCHECK kernel, as a last try before my night ;) > > > > This was a total disaster. KMEMCHECK dies horribly on my machine > > > > David, shouldnt we use a nh_rth_forward instead of a nh_rth_input in > > __mkroute_input() ? > > > > (And change rt_cache_route() as well ?) > > > > I am testing a patch right now. > OK so I implemented David idea and it seems to work. Testers are needed, thanks ! ;) [PATCH] ipv4: add a fib_type to fib_info commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.) introduced a regression for forwarding. This was hard to reproduce but the symptom was that packets were delivered to local host instead of being forwarded. David suggested to add fib_type to fib_info so that we dont inadvertently share same fib_info for different purposes. With help from Julian Anastasov who provided very helpful hints, reproduced here : <quote> Can it be a problem related to fib_info reuse from different routes. For example, when local IP address is created for subnet we have: broadcast 192.168.0.255 dev DEV proto kernel scope link src 192.168.0.1 192.168.0.0/24 dev DEV proto kernel scope link src 192.168.0.1 local 192.168.0.1 dev DEV proto kernel scope host src 192.168.0.1 The "dev DEV proto kernel scope link src 192.168.0.1" is a reused fib_info structure where we put cached routes. The result can be same fib_info for 192.168.0.255 and 192.168.0.0/24. RTN_BROADCAST is cached only for input routes. Incoming broadcast to 192.168.0.255 can be cached and can cause problems for traffic forwarded to 192.168.0.0/24. So, this patch should solve the problem because it separates the broadcast from unicast traffic. And the ip_route_input_slow caching will work for local and broadcast input routes (above routes 1 and 3) just because they differ in scope and use different fib_info. </quote> Many thanks to Chris Clayton for his patience and help. Reported-by: Chris Clayton <chris2553@googlemail.com> Bisected-by: Chris Clayton <chris2553@googlemail.com> Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Julian Anastasov <ja@ssi.bg> --- include/net/ip_fib.h | 1 + net/ipv4/fib_semantics.c | 2 ++ 2 files changed, 3 insertions(+) diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h index 926142e..9497be1 100644 --- a/include/net/ip_fib.h +++ b/include/net/ip_fib.h @@ -102,6 +102,7 @@ struct fib_info { unsigned char fib_dead; unsigned char fib_protocol; unsigned char fib_scope; + unsigned char fib_type; __be32 fib_prefsrc; u32 fib_priority; u32 *fib_metrics; diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c index 3509065..2677530 100644 --- a/net/ipv4/fib_semantics.c +++ b/net/ipv4/fib_semantics.c @@ -314,6 +314,7 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi) nfi->fib_scope == fi->fib_scope && nfi->fib_prefsrc == fi->fib_prefsrc && nfi->fib_priority == fi->fib_priority && + nfi->fib_type == fi->fib_type && memcmp(nfi->fib_metrics, fi->fib_metrics, sizeof(u32) * RTAX_MAX) == 0 && ((nfi->fib_flags ^ fi->fib_flags) & ~RTNH_F_DEAD) == 0 && @@ -833,6 +834,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg) fi->fib_flags = cfg->fc_flags; fi->fib_priority = cfg->fc_priority; fi->fib_prefsrc = cfg->fc_prefsrc; + fi->fib_type = cfg->fc_type; fi->fib_nhs = nhs; change_nexthops(fi) { ^ permalink raw reply related [flat|nested] 59+ messages in thread
* Re: [PATCH] ipv4: add a fib_type to fib_info 2012-10-04 11:25 ` [PATCH] ipv4: add a fib_type to fib_info Eric Dumazet @ 2012-10-04 13:08 ` Chris Clayton 2012-10-04 13:32 ` Eric Dumazet 0 siblings, 1 reply; 59+ messages in thread From: Chris Clayton @ 2012-10-04 13:08 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev, gpiez, Dave Jones, Julian Anastasov On 10/04/12 12:25, Eric Dumazet wrote: > On Tue, 2012-10-02 at 17:48 +0200, Eric Dumazet wrote: >> From: Eric Dumazet <edumazet@google.com> >> >> On Tue, 2012-10-02 at 17:35 +0200, Eric Dumazet wrote: >>> On Mon, 2012-10-01 at 22:04 +0200, Eric Dumazet wrote: >>>> On Mon, 2012-10-01 at 16:01 -0400, David Miller wrote: >>> >>>>> If you can find a way to more reliably trigger the case, that would >>>>> help us immensely. >>>> >>>> I am building a KMEMCHECK kernel, as a last try before my night ;) >>> >>> This was a total disaster. KMEMCHECK dies horribly on my machine >>> >>> David, shouldnt we use a nh_rth_forward instead of a nh_rth_input in >>> __mkroute_input() ? >>> >>> (And change rt_cache_route() as well ?) >>> >>> I am testing a patch right now. >> > > OK so I implemented David idea and it seems to work. > > Testers are needed, thanks ! ;) > I've tested 3.6.0 with this patch applied and networking in a WinXP KVM client is now working fine. The patch applies cleanly to 3.6.0, so I assume the patch will be forwarded to stable in due course. Tested-by: Chris Clayton <chris2553@googlemail.com> > [PATCH] ipv4: add a fib_type to fib_info > > commit d2d68ba9fe8 (ipv4: Cache input routes in fib_info nexthops.) > introduced a regression for forwarding. > > This was hard to reproduce but the symptom was that packets were > delivered to local host instead of being forwarded. > > David suggested to add fib_type to fib_info so that we dont > inadvertently share same fib_info for different purposes. > > With help from Julian Anastasov who provided very helpful > hints, reproduced here : > > <quote> > Can it be a problem related to fib_info reuse > from different routes. For example, when local IP address > is created for subnet we have: > > broadcast 192.168.0.255 dev DEV proto kernel scope link src > 192.168.0.1 > 192.168.0.0/24 dev DEV proto kernel scope link src 192.168.0.1 > local 192.168.0.1 dev DEV proto kernel scope host src 192.168.0.1 > > The "dev DEV proto kernel scope link src 192.168.0.1" is > a reused fib_info structure where we put cached routes. > The result can be same fib_info for 192.168.0.255 and > 192.168.0.0/24. RTN_BROADCAST is cached only for input > routes. Incoming broadcast to 192.168.0.255 can be cached > and can cause problems for traffic forwarded to 192.168.0.0/24. > So, this patch should solve the problem because it > separates the broadcast from unicast traffic. > > And the ip_route_input_slow caching will work for > local and broadcast input routes (above routes 1 and 3) just > because they differ in scope and use different fib_info. > > </quote> > > Many thanks to Chris Clayton for his patience and help. > > Reported-by: Chris Clayton <chris2553@googlemail.com> > Bisected-by: Chris Clayton <chris2553@googlemail.com> > Reported-by: Dave Jones <davej@redhat.com> > Signed-off-by: Eric Dumazet <edumazet@google.com> > Cc: Julian Anastasov <ja@ssi.bg> > --- > include/net/ip_fib.h | 1 + > net/ipv4/fib_semantics.c | 2 ++ > 2 files changed, 3 insertions(+) > > diff --git a/include/net/ip_fib.h b/include/net/ip_fib.h > index 926142e..9497be1 100644 > --- a/include/net/ip_fib.h > +++ b/include/net/ip_fib.h > @@ -102,6 +102,7 @@ struct fib_info { > unsigned char fib_dead; > unsigned char fib_protocol; > unsigned char fib_scope; > + unsigned char fib_type; > __be32 fib_prefsrc; > u32 fib_priority; > u32 *fib_metrics; > diff --git a/net/ipv4/fib_semantics.c b/net/ipv4/fib_semantics.c > index 3509065..2677530 100644 > --- a/net/ipv4/fib_semantics.c > +++ b/net/ipv4/fib_semantics.c > @@ -314,6 +314,7 @@ static struct fib_info *fib_find_info(const struct fib_info *nfi) > nfi->fib_scope == fi->fib_scope && > nfi->fib_prefsrc == fi->fib_prefsrc && > nfi->fib_priority == fi->fib_priority && > + nfi->fib_type == fi->fib_type && > memcmp(nfi->fib_metrics, fi->fib_metrics, > sizeof(u32) * RTAX_MAX) == 0 && > ((nfi->fib_flags ^ fi->fib_flags) & ~RTNH_F_DEAD) == 0 && > @@ -833,6 +834,7 @@ struct fib_info *fib_create_info(struct fib_config *cfg) > fi->fib_flags = cfg->fc_flags; > fi->fib_priority = cfg->fc_priority; > fi->fib_prefsrc = cfg->fc_prefsrc; > + fi->fib_type = cfg->fc_type; > > fi->fib_nhs = nhs; > change_nexthops(fi) { > > ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] ipv4: add a fib_type to fib_info 2012-10-04 13:08 ` Chris Clayton @ 2012-10-04 13:32 ` Eric Dumazet 2012-10-04 18:14 ` David Miller 0 siblings, 1 reply; 59+ messages in thread From: Eric Dumazet @ 2012-10-04 13:32 UTC (permalink / raw) To: Chris Clayton; +Cc: David Miller, netdev, gpiez, Dave Jones, Julian Anastasov On Thu, 2012-10-04 at 14:08 +0100, Chris Clayton wrote: > I've tested 3.6.0 with this patch applied and networking in a WinXP KVM > client is now working fine. The patch applies cleanly to 3.6.0, so I > assume the patch will be forwarded to stable in due course. > > Tested-by: Chris Clayton <chris2553@googlemail.com> Thanks for testing. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: [PATCH] ipv4: add a fib_type to fib_info 2012-10-04 13:32 ` Eric Dumazet @ 2012-10-04 18:14 ` David Miller 0 siblings, 0 replies; 59+ messages in thread From: David Miller @ 2012-10-04 18:14 UTC (permalink / raw) To: eric.dumazet; +Cc: chris2553, netdev, gpiez, davej, ja From: Eric Dumazet <eric.dumazet@gmail.com> Date: Thu, 04 Oct 2012 15:32:08 +0200 > On Thu, 2012-10-04 at 14:08 +0100, Chris Clayton wrote: > >> I've tested 3.6.0 with this patch applied and networking in a WinXP KVM >> client is now working fine. The patch applies cleanly to 3.6.0, so I >> assume the patch will be forwarded to stable in due course. >> >> Tested-by: Chris Clayton <chris2553@googlemail.com> > > Thanks for testing. Applied and queued up for -stable, thanks everyone. Note that this change means we can completely remove the type fields from fib_alias and fib_result when net-next opens up, as the value can be fetched from the fib_info directly now. ^ permalink raw reply [flat|nested] 59+ messages in thread
* Re: Possible networking regression in 3.6.0 2012-09-18 14:31 ` Chris Clayton 2012-09-18 14:40 ` Eric Dumazet @ 2012-09-18 14:44 ` Chris Clayton 1 sibling, 0 replies; 59+ messages in thread From: Chris Clayton @ 2012-09-18 14:44 UTC (permalink / raw) To: netdev >> > Sorry, I forgot to say that I also have tried running TinyCore Linux as > a KVM guest on a 3.6.0-rc6 kernel, and I can ping the router fine, so > the problem seems to be something specifically related to ruuning > Windows XP as the guest. I don't have any other guests installed so > that's as much as I can say, although I could maybe install a Win7 guest > tomorrow if that would help. > Sorry again, but ignore the message above, please. Wrong kernel used in test. In fact, I get the same failure to ping the router running on a 6.6.0-rc6 kernel. Apologies for the noise. Chris ^ permalink raw reply [flat|nested] 59+ messages in thread
end of thread, other threads:[~2012-10-04 18:14 UTC | newest] Thread overview: 59+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2012-09-17 15:44 Possible networking regression in 3.6.0 Chris Clayton 2012-09-18 14:21 ` Chris Clayton 2012-09-18 14:31 ` Chris Clayton 2012-09-18 14:40 ` Eric Dumazet 2012-09-18 15:51 ` Chris Clayton 2012-09-19 15:26 ` Chris Clayton 2012-09-22 6:26 ` Chris Clayton 2012-09-27 11:50 ` Chris Clayton 2012-09-27 12:14 ` Eric Dumazet 2012-09-27 18:05 ` Chris Clayton 2012-09-27 21:03 ` Eric Dumazet 2012-09-27 21:17 ` Eric Dumazet 2012-09-28 6:53 ` David Miller 2012-09-28 9:14 ` Chris Clayton 2012-09-28 9:22 ` Chris Clayton 2012-09-28 11:26 ` Eric Dumazet 2012-09-28 14:28 ` Chris Clayton 2012-09-30 15:26 ` Chris Clayton 2012-09-30 19:45 ` Eric Dumazet 2012-10-01 8:36 ` Chris Clayton 2012-10-01 9:15 ` Eric Dumazet 2012-10-01 15:13 ` Chris Clayton 2012-10-01 15:31 ` Eric Dumazet 2012-10-01 16:19 ` Chris Clayton 2012-10-01 16:37 ` Eric Dumazet 2012-10-01 18:28 ` Chris Clayton 2012-10-01 18:34 ` Captain Obvious 2012-10-01 19:21 ` Eric Dumazet 2012-10-01 19:55 ` Chris Clayton 2012-10-01 19:22 ` Chris Clayton 2012-10-01 19:34 ` Dave Jones 2012-10-01 20:01 ` David Miller 2012-10-01 20:04 ` Eric Dumazet 2012-10-02 15:27 ` Edivaldo de Araújo Pereira 2012-10-02 15:35 ` Eric Dumazet 2012-10-02 15:48 ` Eric Dumazet 2012-10-02 15:57 ` Dave Jones 2012-10-02 16:06 ` Eric Dumazet 2012-10-02 18:25 ` David Miller 2012-10-02 21:14 ` Alexander Duyck 2012-10-02 21:35 ` Eric Dumazet 2012-10-02 23:24 ` Julian Anastasov 2012-10-03 3:10 ` David Miller 2012-10-03 15:01 ` Chris Clayton 2012-10-03 20:57 ` Julian Anastasov 2012-10-03 7:28 ` [PATCH] udp: increment UDP_MIB_NOPORTS in mcast receive Eric Dumazet 2012-10-03 12:45 ` David Stevens 2012-10-03 13:15 ` Eric Dumazet 2012-10-03 14:09 ` David Stevens 2012-10-03 15:29 ` Eric Dumazet 2012-10-03 17:31 ` David Stevens 2012-10-03 19:30 ` David Miller 2012-10-03 17:39 ` Rick Jones 2012-10-03 2:55 ` Possible networking regression in 3.6.0 David Miller 2012-10-04 11:25 ` [PATCH] ipv4: add a fib_type to fib_info Eric Dumazet 2012-10-04 13:08 ` Chris Clayton 2012-10-04 13:32 ` Eric Dumazet 2012-10-04 18:14 ` David Miller 2012-09-18 14:44 ` Possible networking regression in 3.6.0 Chris Clayton
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).