* [bisected] Forwarded packets occasionally has loopback output interface in Netfilter @ 2017-12-26 11:05 Anders K. Pedersen | Cohaesio 2018-01-11 17:25 ` Anders K. Pedersen | Cohaesio 0 siblings, 1 reply; 5+ messages in thread From: Anders K. Pedersen | Cohaesio @ 2017-12-26 11:05 UTC (permalink / raw) To: weiwan@google.com; +Cc: netdev@vger.kernel.org, netfilter-devel@vger.kernel.org Hello, On one of our border routers, Netfilter is occasionally logging packets with "OUT=lo" (output interface lo) even though the packet should be going out via a regular interface. This behavior is present on Linux 4.13.0 to 4.14.9, and a bisection of the problem points to [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly as the first bad commit. This commit adds dst_dev_put() calls before some dst_release() calls, and dst_dev_put() does dst->dev = dev_net(dst->dev)->loopback_dev; (among other things), which fits the problem we're seeing. The essential part of our nftables rule set that shows this behavior is chain forward { type filter hook forward priority 0; meta oif { $internal_interfaces } accept meta oif lo ip daddr != 127.0.0.0/8 \ log group 0 snaplen 80 prefix "oif-lo" counter ip saddr { $our_ip_series } \ flow table acct_out \ { meta oif . rt nexthop . ip saddr timeout 12m counter } \ accept log group 0 snaplen 80 prefix "DROP" counter drop } The router only does stateless packet filtering and no redirection or rewriting of the packets (connection tracking, NAT, ipvs etc. are not even compiled for this kernel). As a result of this problem we see packets that should be going to an internal interface (and thus accepted by the first rule above) being logged and dropped by the last rule. Some examples: Dec 22 11:57:02 cix4 oif-lo IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 WINDOW=3295 ACK URGP=0 MARK=0 Dec 22 11:57:02 cix4 DROP IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 WINDOW=3295 ACK URGP=0 MARK=0 Dec 22 12:47:07 cix4 oif-lo IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 WINDOW=65535 ACK PSH URGP=0 MARK=0 Dec 22 12:47:07 cix4 DROP IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 WINDOW=65535 ACK PSH URGP=0 MARK=0 Dec 22 12:53:56 cix4 oif-lo IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 WINDOW=65535 ACK PSH URGP=0 MARK=0 Dec 22 12:53:56 cix4 DROP IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 WINDOW=65535 ACK PSH URGP=0 MARK=0 It also happens for outbound traffic, where the packets are logged and counted in the acct_out flow table with "meta oif" = "lo", but a correct "rt nexthop" - an example: Dec 22 12:29:13 cix4 oif-lo IN=team0.20 OUT=lo MAC=3c:fd:fe:15:db:a8:00:24:a8:ff:f0:00:08:00 SRC=212.97.129.25 DST=95.166.119.129 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=19481 DF PROTO=TCP SPT=443 DPT=52560 SEQ=3034827396 ACK=2862814901 WINDOW=12618 ACK URGP=0 MARK=0 # nft list flow table filter acct_out|tr ',' '\n'|grep lo flow table acct_out { "lo" . 94.101.208.217 . 212.97.129.25 expires 3m17s : counter packets 1 bytes 40 I don't know if these packets are actually sent out on the correct outbound interface thanks to the proper nexthop (the MAC= information in the Netfilter log is from the received packet and thus not useful here). I tried running a tcpdump on the lo interface to see if these packets would show up there, but during the three days I had it running, it only logged one such packet, while Netfilter logs 20+ outbound packets every day, and the one packet logged by tcpdump was *not* logged by Netfilter. If it helps, I can try adding each of the six dst_dev_put() calls from the first bad commit individually to see which of them that has an effect on this. Regards, Anders K. Pedersen ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [bisected] Forwarded packets occasionally has loopback output interface in Netfilter 2017-12-26 11:05 [bisected] Forwarded packets occasionally has loopback output interface in Netfilter Anders K. Pedersen | Cohaesio @ 2018-01-11 17:25 ` Anders K. Pedersen | Cohaesio 2018-01-11 18:18 ` Wei Wang 0 siblings, 1 reply; 5+ messages in thread From: Anders K. Pedersen | Cohaesio @ 2018-01-11 17:25 UTC (permalink / raw) To: weiwan@google.com; +Cc: netdev@vger.kernel.org, netfilter-devel@vger.kernel.org On tir, 2017-12-26 at 12:05 +0100, Anders K. Pedersen | Cohaesio wrote: > Hello, > > On one of our border routers, Netfilter is occasionally logging packets > with "OUT=lo" (output interface lo) even though the packet should be > going out via a regular interface. This behavior is present on Linux > 4.13.0 to 4.14.9, and a bisection of the problem points to > > [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly > > as the first bad commit. This commit adds dst_dev_put() calls before > some dst_release() calls, and dst_dev_put() does > > dst->dev = dev_net(dst->dev)->loopback_dev; > > (among other things), which fits the problem we're seeing. > > The essential part of our nftables rule set that shows this behavior is > > chain forward { > type filter hook forward priority 0; > > meta oif { $internal_interfaces } accept > > meta oif lo ip daddr != 127.0.0.0/8 \ > log group 0 snaplen 80 prefix "oif-lo" counter > > ip saddr { $our_ip_series } \ > flow table acct_out \ > { meta oif . rt nexthop . ip saddr timeout 12m counter } \ > accept > > log group 0 snaplen 80 prefix "DROP" counter drop > } > > The router only does stateless packet filtering and no redirection or > rewriting of the packets (connection tracking, NAT, ipvs etc. are not > even compiled for this kernel). > > As a result of this problem we see packets that should be going to an > internal interface (and thus accepted by the first rule above) being > logged and dropped by the last rule. Some examples: > > Dec 22 11:57:02 cix4 oif-lo IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 WINDOW=3295 ACK URGP=0 MARK=0 > Dec 22 11:57:02 cix4 DROP IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 WINDOW=3295 ACK URGP=0 MARK=0 > > Dec 22 12:47:07 cix4 oif-lo IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 WINDOW=65535 ACK PSH URGP=0 MARK=0 > Dec 22 12:47:07 cix4 DROP IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 WINDOW=65535 ACK PSH URGP=0 MARK=0 > > Dec 22 12:53:56 cix4 oif-lo IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 WINDOW=65535 ACK PSH URGP=0 MARK=0 > Dec 22 12:53:56 cix4 DROP IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 WINDOW=65535 ACK PSH URGP=0 MARK=0 > > It also happens for outbound traffic, where the packets are logged and > counted in the acct_out flow table with "meta oif" = "lo", but a > correct "rt nexthop" - an example: > > Dec 22 12:29:13 cix4 oif-lo IN=team0.20 OUT=lo MAC=3c:fd:fe:15:db:a8:00:24:a8:ff:f0:00:08:00 SRC=212.97.129.25 DST=95.166.119.129 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=19481 DF PROTO=TCP SPT=443 DPT=52560 SEQ=3034827396 ACK=2862814901 WINDOW=12618 ACK URGP=0 MARK=0 > > # nft list flow table filter acct_out|tr ',' '\n'|grep lo > flow table acct_out { > "lo" . 94.101.208.217 . 212.97.129.25 expires 3m17s : counter packets 1 bytes 40 > > I don't know if these packets are actually sent out on the correct > outbound interface thanks to the proper nexthop (the MAC= information > in the Netfilter log is from the received packet and thus not useful > here). > > I tried running a tcpdump on the lo interface to see if these packets > would show up there, but during the three days I had it running, it > only logged one such packet, while Netfilter logs 20+ outbound packets > every day, and the one packet logged by tcpdump was *not* logged by > Netfilter. Further testing of the individual parts of the first bad commit shows that the five first additions of the dst_dev_put() call doesn't trigger the problem, while the last one does (also without the first five), so the problematic part is: diff --git a/net/ipv4/route.c b/net/ipv4/route.c index 3dee004..d986d80 100644 --- a/net/ipv4/route.c +++ b/net/ipv4/route.c @@ -1369,6 +1372,7 @@ static bool rt_cache_route(struct fib_nh *nh, struct rtable *rt) prev = cmpxchg(p, orig, rt); if (prev == orig) { if (orig) { + dst_dev_put(&orig->dst); dst_release(&orig->dst); rt_free(orig); } Any help with fixing this would be much appreciated. If there's anything I should try to debug this further, please let me know. Regards, Anders K. Pedersen ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [bisected] Forwarded packets occasionally has loopback output interface in Netfilter 2018-01-11 17:25 ` Anders K. Pedersen | Cohaesio @ 2018-01-11 18:18 ` Wei Wang 2018-01-22 14:54 ` Anders K. Pedersen | Cohaesio 0 siblings, 1 reply; 5+ messages in thread From: Wei Wang @ 2018-01-11 18:18 UTC (permalink / raw) To: Anders K. Pedersen | Cohaesio Cc: netdev@vger.kernel.org, netfilter-devel@vger.kernel.org On Thu, Jan 11, 2018 at 9:25 AM, Anders K. Pedersen | Cohaesio <akp@cohaesio.com> wrote: > On tir, 2017-12-26 at 12:05 +0100, Anders K. Pedersen | Cohaesio wrote: >> Hello, >> >> On one of our border routers, Netfilter is occasionally logging packets >> with "OUT=lo" (output interface lo) even though the packet should be >> going out via a regular interface. This behavior is present on Linux >> 4.13.0 to 4.14.9, and a bisection of the problem points to >> >> [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call dst_dev_put() properly >> >> as the first bad commit. This commit adds dst_dev_put() calls before >> some dst_release() calls, and dst_dev_put() does >> >> dst->dev = dev_net(dst->dev)->loopback_dev; >> >> (among other things), which fits the problem we're seeing. >> >> The essential part of our nftables rule set that shows this behavior is >> >> chain forward { >> type filter hook forward priority 0; >> >> meta oif { $internal_interfaces } accept >> >> meta oif lo ip daddr != 127.0.0.0/8 \ >> log group 0 snaplen 80 prefix "oif-lo" counter >> >> ip saddr { $our_ip_series } \ >> flow table acct_out \ >> { meta oif . rt nexthop . ip saddr timeout 12m counter } \ >> accept >> >> log group 0 snaplen 80 prefix "DROP" counter drop >> } >> >> The router only does stateless packet filtering and no redirection or >> rewriting of the packets (connection tracking, NAT, ipvs etc. are not >> even compiled for this kernel). >> >> As a result of this problem we see packets that should be going to an >> internal interface (and thus accepted by the first rule above) being >> logged and dropped by the last rule. Some examples: >> >> Dec 22 11:57:02 cix4 oif-lo IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 WINDOW=3295 ACK URGP=0 MARK=0 >> Dec 22 11:57:02 cix4 DROP IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 WINDOW=3295 ACK URGP=0 MARK=0 >> >> Dec 22 12:47:07 cix4 oif-lo IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 WINDOW=65535 ACK PSH URGP=0 MARK=0 >> Dec 22 12:47:07 cix4 DROP IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 WINDOW=65535 ACK PSH URGP=0 MARK=0 >> >> Dec 22 12:53:56 cix4 oif-lo IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 WINDOW=65535 ACK PSH URGP=0 MARK=0 >> Dec 22 12:53:56 cix4 DROP IN=eth10 OUT=lo MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 WINDOW=65535 ACK PSH URGP=0 MARK=0 >> >> It also happens for outbound traffic, where the packets are logged and >> counted in the acct_out flow table with "meta oif" = "lo", but a >> correct "rt nexthop" - an example: >> >> Dec 22 12:29:13 cix4 oif-lo IN=team0.20 OUT=lo MAC=3c:fd:fe:15:db:a8:00:24:a8:ff:f0:00:08:00 SRC=212.97.129.25 DST=95.166.119.129 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=19481 DF PROTO=TCP SPT=443 DPT=52560 SEQ=3034827396 ACK=2862814901 WINDOW=12618 ACK URGP=0 MARK=0 >> >> # nft list flow table filter acct_out|tr ',' '\n'|grep lo >> flow table acct_out { >> "lo" . 94.101.208.217 . 212.97.129.25 expires 3m17s : counter packets 1 bytes 40 >> >> I don't know if these packets are actually sent out on the correct >> outbound interface thanks to the proper nexthop (the MAC= information >> in the Netfilter log is from the received packet and thus not useful >> here). >> >> I tried running a tcpdump on the lo interface to see if these packets >> would show up there, but during the three days I had it running, it >> only logged one such packet, while Netfilter logs 20+ outbound packets >> every day, and the one packet logged by tcpdump was *not* logged by >> Netfilter. > > Further testing of the individual parts of the first bad commit shows > that the five first additions of the dst_dev_put() call doesn't trigger > the problem, while the last one does (also without the first five), so > the problematic part is: > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > index 3dee004..d986d80 100644 > --- a/net/ipv4/route.c > +++ b/net/ipv4/route.c > @@ -1369,6 +1372,7 @@ static bool rt_cache_route(struct fib_nh *nh, struct rtable *rt) > prev = cmpxchg(p, orig, rt); > if (prev == orig) { > if (orig) { > + dst_dev_put(&orig->dst); > dst_release(&orig->dst); > rt_free(orig); > } > Hi Anders, First of all, the bad commit you pointed to is part of the effort to remove dst garbage collector. And it is expected that certain routing behavior will change after this whole patch set. In your specific case: Specifically for the one line change you pointed above, rt_cache_route() is called to replace an existing cached route with a new one because the existing cached route is already invalid. And the criteria to determine if it is valid is in rt_cache_valid(). If a route becomes invalid, it means the route is either being deleted, or a new route is being inserted to the same table, or there is certain netdev event happening that could make this route unusable. In all cases, a route re-lookup is required. dst_dev_put() here does cleanup work to free up the device refcount and all packets using this route will be dropped. And as you said, you don't really see the packets going out of lo. > Any help with fixing this would be much appreciated. If there's > anything I should try to debug this further, please let me know. > > Regards, > Anders K. Pedersen ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [bisected] Forwarded packets occasionally has loopback output interface in Netfilter 2018-01-11 18:18 ` Wei Wang @ 2018-01-22 14:54 ` Anders K. Pedersen | Cohaesio 2018-02-14 13:34 ` Anders K. Pedersen | Cohaesio 0 siblings, 1 reply; 5+ messages in thread From: Anders K. Pedersen | Cohaesio @ 2018-01-22 14:54 UTC (permalink / raw) To: weiwan@google.com; +Cc: netdev@vger.kernel.org, netfilter-devel@vger.kernel.org On tor, 2018-01-11 at 10:18 -0800, Wei Wang wrote: > On Thu, Jan 11, 2018 at 9:25 AM, Anders K. Pedersen | Cohaesio > <akp@cohaesio.com> wrote: > > On tir, 2017-12-26 at 12:05 +0100, Anders K. Pedersen | Cohaesio > > wrote: > > > Hello, > > > > > > On one of our border routers, Netfilter is occasionally logging > > > packets > > > with "OUT=lo" (output interface lo) even though the packet should > > > be > > > going out via a regular interface. This behavior is present on > > > Linux > > > 4.13.0 to 4.14.9, and a bisection of the problem points to > > > > > > [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call > > > dst_dev_put() properly > > > > > > as the first bad commit. This commit adds dst_dev_put() calls > > > before > > > some dst_release() calls, and dst_dev_put() does > > > > > > dst->dev = dev_net(dst->dev)->loopback_dev; > > > > > > (among other things), which fits the problem we're seeing. > > > > > > The essential part of our nftables rule set that shows this > > > behavior is > > > > > > chain forward { > > > type filter hook forward priority 0; > > > > > > meta oif { $internal_interfaces } accept > > > > > > meta oif lo ip daddr != 127.0.0.0/8 \ > > > log group 0 snaplen 80 prefix "oif-lo" > > > counter > > > > > > ip saddr { $our_ip_series } \ > > > flow table acct_out \ > > > { meta oif . rt nexthop . ip saddr > > > timeout 12m counter } \ > > > accept > > > > > > log group 0 snaplen 80 prefix "DROP" counter drop > > > } > > > > > > The router only does stateless packet filtering and no > > > redirection or > > > rewriting of the packets (connection tracking, NAT, ipvs etc. are > > > not > > > even compiled for this kernel). > > > > > > As a result of this problem we see packets that should be going > > > to an > > > internal interface (and thus accepted by the first rule above) > > > being > > > logged and dropped by the last rule. Some examples: > > > > > > Dec 22 11:57:02 cix4 oif-lo IN=eth10 OUT=lo > > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 > > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF > > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 > > > WINDOW=3295 ACK URGP=0 MARK=0 > > > Dec 22 11:57:02 cix4 DROP IN=eth10 OUT=lo > > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 > > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF > > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 > > > WINDOW=3295 ACK URGP=0 MARK=0 > > > > > > Dec 22 12:47:07 cix4 oif-lo IN=eth10 OUT=lo > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF > > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 > > > WINDOW=65535 ACK PSH URGP=0 MARK=0 > > > Dec 22 12:47:07 cix4 DROP IN=eth10 OUT=lo > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF > > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 > > > WINDOW=65535 ACK PSH URGP=0 MARK=0 > > > > > > Dec 22 12:53:56 cix4 oif-lo IN=eth10 OUT=lo > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF > > > PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 > > > WINDOW=65535 ACK PSH URGP=0 MARK=0 > > > Dec 22 12:53:56 cix4 DROP IN=eth10 OUT=lo > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF > > > PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 > > > WINDOW=65535 ACK PSH URGP=0 MARK=0 > > > > > > It also happens for outbound traffic, where the packets are > > > logged and > > > counted in the acct_out flow table with "meta oif" = "lo", but a > > > correct "rt nexthop" - an example: > > > > > > Dec 22 12:29:13 cix4 oif-lo IN=team0.20 OUT=lo > > > MAC=3c:fd:fe:15:db:a8:00:24:a8:ff:f0:00:08:00 SRC=212.97.129.25 > > > DST=95.166.119.129 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=19481 DF > > > PROTO=TCP SPT=443 DPT=52560 SEQ=3034827396 ACK=2862814901 > > > WINDOW=12618 ACK URGP=0 MARK=0 > > > > > > # nft list flow table filter acct_out|tr ',' '\n'|grep lo > > > flow table acct_out { > > > "lo" . 94.101.208.217 . 212.97.129.25 expires 3m17s : counter > > > packets 1 bytes 40 > > > > > > I don't know if these packets are actually sent out on the > > > correct > > > outbound interface thanks to the proper nexthop (the MAC= > > > information > > > in the Netfilter log is from the received packet and thus not > > > useful > > > here). > > > > > > I tried running a tcpdump on the lo interface to see if these > > > packets > > > would show up there, but during the three days I had it running, > > > it > > > only logged one such packet, while Netfilter logs 20+ outbound > > > packets > > > every day, and the one packet logged by tcpdump was *not* logged > > > by > > > Netfilter. > > > > Further testing of the individual parts of the first bad commit > > shows > > that the five first additions of the dst_dev_put() call doesn't > > trigger > > the problem, while the last one does (also without the first five), > > so > > the problematic part is: > > > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > > index 3dee004..d986d80 100644 > > --- a/net/ipv4/route.c > > +++ b/net/ipv4/route.c > > @@ -1369,6 +1372,7 @@ static bool rt_cache_route(struct fib_nh *nh, > > struct rtable *rt) > > prev = cmpxchg(p, orig, rt); > > if (prev == orig) { > > if (orig) { > > + dst_dev_put(&orig->dst); > > dst_release(&orig->dst); > > rt_free(orig); > > } > > > > Hi Anders, > > First of all, the bad commit you pointed to is part of the effort to > remove dst garbage collector. And it is expected that certain routing > behavior will change after this whole patch set. > In your specific case: > Specifically for the one line change you pointed above, > rt_cache_route() is called to replace an existing cached route with a > new one because the existing cached route is already invalid. And the > criteria to determine if it is valid is in rt_cache_valid(). If a > route becomes invalid, it means the route is either being deleted, or > a new route is being inserted to the same table, or there is certain > netdev event happening that could make this route unusable. Hi Wei, I looked into this and activated debug logging for all route changes from the Quagga/zebra routing daemons, and none of the logged packets with OUT=lo happened within a minute of a route change for a prefix that covered the destination IP in the packet. Also, some of the logged packets with OUT=lo are for destination IPs that are on the subnet of a directly connected interface on the router, so there are no dynamic route changes and there haven't been any link changes either. Which other "certain netdev event" could cause this? > In all > cases, a route re-lookup is required. dst_dev_put() here does cleanup > work to free up the device refcount and all packets using this route > will be dropped. And as you said, you don't really see the packets > going out of lo. I wrote that we don't see the same packets being logged *and* show up in a "tcpdump -i lo", but we do see packets in the tcpdump that should have been routed out a regular interface - it's just not the same packets that netfilter is logging (and it happens less frequently). It looks like it depends on how far the packet is in the path through the kernel, when dst_dev_put() is called. > > Any help with fixing this would be much appreciated. If there's > > anything I should try to debug this further, please let me know. Regards, Anders K. Pedersen ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [bisected] Forwarded packets occasionally has loopback output interface in Netfilter 2018-01-22 14:54 ` Anders K. Pedersen | Cohaesio @ 2018-02-14 13:34 ` Anders K. Pedersen | Cohaesio 0 siblings, 0 replies; 5+ messages in thread From: Anders K. Pedersen | Cohaesio @ 2018-02-14 13:34 UTC (permalink / raw) To: weiwan@google.com; +Cc: netdev@vger.kernel.org, netfilter-devel@vger.kernel.org On man, 2018-01-22 at 15:54 +0100, Anders K. Pedersen | Cohaesio wrote: > On tor, 2018-01-11 at 10:18 -0800, Wei Wang wrote: > > On Thu, Jan 11, 2018 at 9:25 AM, Anders K. Pedersen | Cohaesio > > <akp@cohaesio.com> wrote: > > > On tir, 2017-12-26 at 12:05 +0100, Anders K. Pedersen | Cohaesio > > > wrote: > > > > Hello, > > > > > > > > On one of our border routers, Netfilter is occasionally logging > > > > packets > > > > with "OUT=lo" (output interface lo) even though the packet should > > > > be > > > > going out via a regular interface. This behavior is present on > > > > Linux > > > > 4.13.0 to 4.14.9, and a bisection of the problem points to > > > > > > > > [95c47f9cf5e028d1ae77dc6c767c1edc8a18025b] ipv4: call > > > > dst_dev_put() properly > > > > > > > > as the first bad commit. This commit adds dst_dev_put() calls > > > > before > > > > some dst_release() calls, and dst_dev_put() does > > > > > > > > dst->dev = dev_net(dst->dev)->loopback_dev; > > > > > > > > (among other things), which fits the problem we're seeing. > > > > > > > > The essential part of our nftables rule set that shows this > > > > behavior is > > > > > > > > chain forward { > > > > type filter hook forward priority 0; > > > > > > > > meta oif { $internal_interfaces } accept > > > > > > > > meta oif lo ip daddr != 127.0.0.0/8 \ > > > > log group 0 snaplen 80 prefix "oif-lo" > > > > counter > > > > > > > > ip saddr { $our_ip_series } \ > > > > flow table acct_out \ > > > > { meta oif . rt nexthop . ip saddr > > > > timeout 12m counter } \ > > > > accept > > > > > > > > log group 0 snaplen 80 prefix "DROP" counter drop > > > > } > > > > > > > > The router only does stateless packet filtering and no > > > > redirection or > > > > rewriting of the packets (connection tracking, NAT, ipvs etc. are > > > > not > > > > even compiled for this kernel). > > > > > > > > As a result of this problem we see packets that should be going > > > > to an > > > > internal interface (and thus accepted by the first rule above) > > > > being > > > > logged and dropped by the last rule. Some examples: > > > > > > > > Dec 22 11:57:02 cix4 oif-lo IN=eth10 OUT=lo > > > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 > > > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF > > > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 > > > > WINDOW=3295 ACK URGP=0 MARK=0 > > > > Dec 22 11:57:02 cix4 DROP IN=eth10 OUT=lo > > > > MAC=90:e2:ba:5c:b6:95:10:f3:11:38:06:77:08:00 SRC=81.170.163.118 > > > > DST=212.97.158.33 LEN=1500 TOS=00 PREC=0x00 TTL=116 ID=25932 DF > > > > PROTO=TCP SPT=35118 DPT=8443 SEQ=604358330 ACK=1182278705 > > > > WINDOW=3295 ACK URGP=0 MARK=0 > > > > > > > > Dec 22 12:47:07 cix4 oif-lo IN=eth10 OUT=lo > > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 > > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF > > > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 > > > > WINDOW=65535 ACK PSH URGP=0 MARK=0 > > > > Dec 22 12:47:07 cix4 DROP IN=eth10 OUT=lo > > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.30.18 > > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=118 ID=10370 DF > > > > PROTO=TCP SPT=443 DPT=44988 SEQ=1141545913 ACK=3844573103 > > > > WINDOW=65535 ACK PSH URGP=0 MARK=0 > > > > > > > > Dec 22 12:53:56 cix4 oif-lo IN=eth10 OUT=lo > > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 > > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF > > > > PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 > > > > WINDOW=65535 ACK PSH URGP=0 MARK=0 > > > > Dec 22 12:53:56 cix4 DROP IN=eth10 OUT=lo > > > > MAC=90:e2:ba:5c:b6:95:0e:86:10:27:99:f3:08:00 SRC=40.101.12.34 > > > > DST=212.97.130.32 LEN=245 TOS=00 PREC=0x00 TTL=115 ID=27728 DF > > > > PROTO=TCP SPT=443 DPT=39724 SEQ=3797156404 ACK=3944234612 > > > > WINDOW=65535 ACK PSH URGP=0 MARK=0 > > > > > > > > It also happens for outbound traffic, where the packets are > > > > logged and > > > > counted in the acct_out flow table with "meta oif" = "lo", but a > > > > correct "rt nexthop" - an example: > > > > > > > > Dec 22 12:29:13 cix4 oif-lo IN=team0.20 OUT=lo > > > > MAC=3c:fd:fe:15:db:a8:00:24:a8:ff:f0:00:08:00 SRC=212.97.129.25 > > > > DST=95.166.119.129 LEN=40 TOS=00 PREC=0x00 TTL=62 ID=19481 DF > > > > PROTO=TCP SPT=443 DPT=52560 SEQ=3034827396 ACK=2862814901 > > > > WINDOW=12618 ACK URGP=0 MARK=0 > > > > > > > > # nft list flow table filter acct_out|tr ',' '\n'|grep lo > > > > flow table acct_out { > > > > "lo" . 94.101.208.217 . 212.97.129.25 expires 3m17s : counter > > > > packets 1 bytes 40 > > > > > > > > I don't know if these packets are actually sent out on the > > > > correct > > > > outbound interface thanks to the proper nexthop (the MAC= > > > > information > > > > in the Netfilter log is from the received packet and thus not > > > > useful > > > > here). > > > > > > > > I tried running a tcpdump on the lo interface to see if these > > > > packets > > > > would show up there, but during the three days I had it running, > > > > it > > > > only logged one such packet, while Netfilter logs 20+ outbound > > > > packets > > > > every day, and the one packet logged by tcpdump was *not* logged > > > > by > > > > Netfilter. > > > > > > Further testing of the individual parts of the first bad commit > > > shows > > > that the five first additions of the dst_dev_put() call doesn't > > > trigger > > > the problem, while the last one does (also without the first five), > > > so > > > the problematic part is: > > > > > > diff --git a/net/ipv4/route.c b/net/ipv4/route.c > > > index 3dee004..d986d80 100644 > > > --- a/net/ipv4/route.c > > > +++ b/net/ipv4/route.c > > > @@ -1369,6 +1372,7 @@ static bool rt_cache_route(struct fib_nh *nh, > > > struct rtable *rt) > > > prev = cmpxchg(p, orig, rt); > > > if (prev == orig) { > > > if (orig) { > > > + dst_dev_put(&orig->dst); > > > dst_release(&orig->dst); > > > rt_free(orig); > > > } > > > > > > > Hi Anders, > > > > First of all, the bad commit you pointed to is part of the effort to > > remove dst garbage collector. And it is expected that certain routing > > behavior will change after this whole patch set. > > In your specific case: > > Specifically for the one line change you pointed above, > > rt_cache_route() is called to replace an existing cached route with a > > new one because the existing cached route is already invalid. And the > > criteria to determine if it is valid is in rt_cache_valid(). If a > > route becomes invalid, it means the route is either being deleted, or > > a new route is being inserted to the same table, or there is certain > > netdev event happening that could make this route unusable. > > Hi Wei, > > I looked into this and activated debug logging for all route changes > from the Quagga/zebra routing daemons, and none of the logged packets > with OUT=lo happened within a minute of a route change for a prefix > that covered the destination IP in the packet. > > Also, some of the logged packets with OUT=lo are for destination IPs > that are on the subnet of a directly connected interface on the router, > so there are no dynamic route changes and there haven't been any link > changes either. > > Which other "certain netdev event" could cause this? > > > In all > > cases, a route re-lookup is required. dst_dev_put() here does cleanup > > work to free up the device refcount and all packets using this route > > will be dropped. And as you said, you don't really see the packets > > going out of lo. > > I wrote that we don't see the same packets being logged *and* show up > in a "tcpdump -i lo", but we do see packets in the tcpdump that should > have been routed out a regular interface - it's just not the same > packets that netfilter is logging (and it happens less frequently). It > looks like it depends on how far the packet is in the path through the > kernel, when dst_dev_put() is called. I spent some more time testing this in order to understand the scenario that causes this. It turns out that the packets being dropped due to OUT=lo only happen for the __mkroute_input() -> rt_set_nexthop() -> rt_cache_route() -> dst_dev_put() call chain, when rt_cache_valid() returns false due to rt_is_expired(), which compares rth->rt_genid with rt_genid_ipv4(dev_net(rth->dst.dev)). The latter is bumped every time *any* fib route is changed (and in other cases), which thus invalidates everything in the route cache. On a router with full BGP feeds that happens quite frequently, and I believe the scenario we're seeing is: - Packet A received for destination X is assigned rth R - Some route in the FIB is added, deleted, or changed, so rt_genid is bumped - Packet B received for destination X before packet A has been forwarded - __mkroute_input() -> rt_cache_valid() -> rt_is_expired() determines that rth R is no longer valid - A new rth S is created and rth R is invalidated via __mkroute_input() -> rt_set_nexthop() -> rt_cache_route() -> dst_dev_put() - Packet A now has rth R with dst->dev = dst->output = dst_discard_out and dev_net(dst->dev)->loopback_dev as it moves through the kernel forwarding (and Netfilter) logic, so it gets dropped To avoid this I developed the patch below, which updates rth->rt_genid to the current value of rt_genid_ipv4(dev_net(rth->dst.dev)), if the information in it is the same as what would be put into a new rth. This avoids invalidating the old rth and allocating a new one with the same information, and packet A from the scenario above is no longer dropped. The patch has a lot of net_warn_ratelimited() for debugging, but none of them have been triggered in my testing. I also revived and abused the in_hit counter to validate that the code path is being hit. After a few days with the patch there haven't been any more packet drops, and we have the following statistics: # rtstat -k entries,in_hit,in_slow_tot,in_slow_mc,in_no_route rt_cache|rt_cache|rt_cache|rt_cache|rt_cache| entries| in_hit|in_slow_|in_slow_|in_no_ro| | | tot| mc| ute| 1015| 2305460| 1177662| 758764| 1080447| Would a cleaned up (i.e. without the warnings and counter) version of this patch be appropriate for submission, or are there cases where this would break things? --- linux-4.14.18/include/net/route.h.orig 2017-11-12 19:46:13.000000000 +0100 +++ linux-4.14.18/include/net/route.h 2018-02-11 11:20:34.133788420 +0100 @@ -96,6 +96,7 @@ }; struct rt_cache_stat { + unsigned int in_hit; unsigned int in_slow_tot; unsigned int in_slow_mc; unsigned int in_no_route; --- linux-4.14.18/net/ipv4/route.c.orig 2018-02-11 11:12:41.801093116 +0100 +++ linux-4.14.18/net/ipv4/route.c 2018-02-14 07:27:54.290124737 +0100 @@ -295,7 +295,7 @@ seq_printf(seq,"%08x %08x %08x %08x %08x %08x %08x %08x " " %08x %08x %08x %08x %08x %08x %08x %08x %08x \n", dst_entries_get_slow(&ipv4_dst_ops), - 0, /* st->in_hit */ + st->in_hit, /* 0, / * st->in_hit */ st->in_slow_tot, st->in_slow_mc, st->in_no_route, @@ -1668,6 +1668,7 @@ { struct fib_nh_exception *fnhe; struct rtable *rth; + struct fib_nh *nh; int err; struct in_device *out_dev; bool do_cache; @@ -1713,23 +1714,110 @@ fnhe = find_exception(&FIB_RES_NH(*res), daddr); if (do_cache) { + nh = &FIB_RES_NH(*res); if (fnhe) { rth = rcu_dereference(fnhe->fnhe_rth_input); if (rth && rth->dst.expires && time_after(jiffies, rth->dst.expires)) { - ip_del_fnhe(&FIB_RES_NH(*res), daddr); + ip_del_fnhe(nh, daddr); fnhe = NULL; } else { goto rt_cache; } } - rth = rcu_dereference(FIB_RES_NH(*res).nh_rth_input); + rth = rcu_dereference(nh->nh_rth_input); rt_cache: if (rt_cache_valid(rth)) { skb_dst_set_noref(skb, &rth->dst); goto out; + } else if (rth && + rth->dst.obsolete == DST_OBSOLETE_FORCE_CHK && + rth->dst.error == 0 && + rth->dst.dev == out_dev->dev && + rth->dst.ops == &ipv4_dst_ops && +#ifdef CONFIG_XFRM + rth->dst.xfrm == NULL && +#endif + rth->dst.flags == + ((IN_DEV_CONF_GET(in_dev, NOPOLICY) ? + DST_NOPOLICY : 0) | + (IN_DEV_CONF_GET(out_dev, NOXFRM) ? + DST_NOXFRM : 0)) && + rth->dst.lwtstate == nh->nh_lwtstate && + rth->rt_type == res->type && + rth->rt_table_id == + (res->table ? res->table->tb_id : 0) && + rth->rt_gateway == + ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ? + nh->nh_gw : 0) && + rth->rt_uses_gateway == + ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ? + 1 : 0)) { + rth->rt_genid = rt_genid_ipv4(dev_net(rth->dst.dev)); + skb_dst_set_noref(skb, &rth->dst); + RT_CACHE_STAT_INC(in_hit); + if (rth->dst.header_len != 0) + net_warn_ratelimited("rth->dst.header_len (%x) != 0\n", rth->dst.header_len); + if (rth->dst.trailer_len != 0) + net_warn_ratelimited("rth->dst.trailer_len (%x) != 0\n", rth->dst.trailer_len); + if (rth->rt_is_input != 1) + net_warn_ratelimited("rth->rt_is_input != 1\n"); + if (rth->rt_flags != 0) + net_warn_ratelimited("rth->rt_flags (%x) != 0\n", rth->rt_flags); + if (rth->rt_iif != 0) + net_warn_ratelimited("rth->rt_iif (%x) != 0\n", rth->rt_iif); + if (rth->rt_pmtu != 0) + net_warn_ratelimited("rth->rt_pmtu (%x) != 0\n", rth->rt_pmtu); + if (rth->dst.output != ip_output) + net_warn_ratelimited("rth->dst.output != ip_output\n"); + if (rth->dst.input != ip_forward) + net_warn_ratelimited("rth->dst.input != ip_forward\n"); + goto out; + } else if (rth && + rth->dst.obsolete == DST_OBSOLETE_FORCE_CHK) { + if (rth->dst.error != 0) + net_warn_ratelimited("rth->dst.error (%x) != 0\n", rth->dst.error); + if (rth->dst.dev != out_dev->dev) + net_warn_ratelimited("mkroute_input: rth->dst.dev != out_dev->dev\n"); + if (rth->dst.ops != &ipv4_dst_ops) + net_warn_ratelimited("mkroute_input: rth->dst.ops != &ipv4_dst_ops\n"); +#ifdef CONFIG_XFRM + if (rth->dst.xfrm != NULL) + net_warn_ratelimited("mkroute_input: rth->dst.xfrm != NULL\n"); +#endif + if (rth->dst.flags != + ((IN_DEV_CONF_GET(in_dev, NOPOLICY) ? + DST_NOPOLICY : 0) | + (IN_DEV_CONF_GET(out_dev, NOXFRM) ? + DST_NOXFRM : 0))) + net_warn_ratelimited("rth->dst.flags (%x) != %x\n", + rth->dst.flags, + (IN_DEV_CONF_GET(in_dev, NOPOLICY) ? + DST_NOPOLICY : 0) | + (IN_DEV_CONF_GET(out_dev, NOXFRM) ? + DST_NOXFRM : 0)); + if (rth->rt_type != res->type) + net_warn_ratelimited("rth->rt_type (%x) != res->type (%x)\n", + rth->rt_type, res->type); + if (rth->rt_table_id != + (res->table ? res->table->tb_id : 0)) + net_warn_ratelimited("rth->rt_table_id (%x) != (%x)\n", + rth->rt_table_id, res->table ? res->table->tb_id : 0); + if (rth->rt_gateway != + ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ? + nh->nh_gw : 0)) + net_warn_ratelimited("rth->rt_gateway (%x) != (%x)\n", + rth->rt_gateway, (nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ? nh->nh_gw : 0); + if (rth->rt_uses_gateway != + ((nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ? + 1 : 0)) + net_warn_ratelimited("rth->rt_uses_gateway (%x) != (%x)\n", + rth->rt_uses_gateway, (nh->nh_gw && nh->nh_scope == RT_SCOPE_LINK) ? 1 : 0); + if (rth->dst.lwtstate != nh->nh_lwtstate) + net_warn_ratelimited("rth->dst.lwtstate (%x) != nh->nh_lwtstate (%x)\n", + rth->dst.lwtstate, nh->nh_lwtstate); } } Regards, Anders K. Pedersen ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2018-02-14 13:34 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2017-12-26 11:05 [bisected] Forwarded packets occasionally has loopback output interface in Netfilter Anders K. Pedersen | Cohaesio 2018-01-11 17:25 ` Anders K. Pedersen | Cohaesio 2018-01-11 18:18 ` Wei Wang 2018-01-22 14:54 ` Anders K. Pedersen | Cohaesio 2018-02-14 13:34 ` Anders K. Pedersen | Cohaesio
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).