From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.0 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NORMAL_HTTP_TO_IP,SIGNED_OFF_BY,SPF_PASS,WEIRD_PORT autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 3B258C169C4 for ; Thu, 31 Jan 2019 06:14:18 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [209.132.180.67]) by mail.kernel.org (Postfix) with ESMTP id 1A0832082F for ; Thu, 31 Jan 2019 06:14:18 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726316AbfAaGOR (ORCPT ); Thu, 31 Jan 2019 01:14:17 -0500 Received: from shards.monkeyblade.net ([23.128.96.9]:55336 "EHLO shards.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1725867AbfAaGOQ (ORCPT ); Thu, 31 Jan 2019 01:14:16 -0500 Received: from localhost (unknown [IPv6:2601:601:9f80:35cd::bf5]) (using TLSv1 with cipher AES256-SHA (256/256 bits)) (Client did not present a certificate) (Authenticated sender: davem-davemloft) by shards.monkeyblade.net (Postfix) with ESMTPSA id EE29B14F48D90; Wed, 30 Jan 2019 22:14:15 -0800 (PST) Date: Wed, 30 Jan 2019 22:14:15 -0800 (PST) Message-Id: <20190130.221415.110396883789030937.davem@davemloft.net> To: daniel@iogearbox.net Cc: netdev@vger.kernel.org, maheshb@google.com, dsa@cumulusnetworks.com, fw@strlen.de, m@lambda.lt Subject: Re: [PATCH net] ipvlan, l3mdev: fix broken l3s mode wrt local routes From: David Miller In-Reply-To: <20190130114948.24227-1-daniel@iogearbox.net> References: <20190130114948.24227-1-daniel@iogearbox.net> X-Mailer: Mew version 6.8 on Emacs 26.1 Mime-Version: 1.0 Content-Type: Text/Plain; charset=us-ascii Content-Transfer-Encoding: 7bit X-Greylist: Sender succeeded SMTP AUTH, not delayed by milter-greylist-4.5.12 (shards.monkeyblade.net [149.20.54.216]); Wed, 30 Jan 2019 22:14:16 -0800 (PST) Sender: netdev-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: netdev@vger.kernel.org From: Daniel Borkmann Date: Wed, 30 Jan 2019 12:49:48 +0100 > While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin, > I ran into the issue that while l3 mode is working fine, l3s mode > does not have any connectivity to kube-apiserver and hence all pods > end up in Error state as well. The ipvlan master device sits on > top of a bond device and hostns traffic to kube-apiserver (also running > in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573 > where the latter is the address of the bond0. While in l3 mode, a > curl to https://10.152.183.1:443 or to https://139.178.29.207:37573 > works fine from hostns, neither of them do in case of l3s. In the > latter only a curl to https://127.0.0.1:37573 appeared to work where > for local addresses of bond0 I saw kernel suddenly starting to emit > ARP requests to query HW address of bond0 which remained unanswered > and neighbor entries in INCOMPLETE state. These ARP requests only > happen while in l3s. > > Debugging this further, I found the issue is that l3s mode is piggy- > backing on l3 master device, and in this case local routes are using > l3mdev_master_dev_rcu(dev) instead of net->loopback_dev as per commit > f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev > if relevant") and 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be > a loopback"). I found that reverting them back into using the > net->loopback_dev fixed ipvlan l3s connectivity and got everything > working for the CNI. > > Now judging from 4fbae7d83c98 ("ipvlan: Introduce l3s mode") and the > l3mdev paper in [0] the only sole reason why ipvlan l3s is relying > on l3 master device is to get the l3mdev_ip_rcv() receive hook for > setting the dst entry of the input route without adding its own > ipvlan specific hacks into the receive path, however, any l3 domain > semantics beyond just that are breaking l3s operation. Note that > ipvlan also has the ability to dynamically switch its internal > operation from l3 to l3s for all ports via ipvlan_set_port_mode() > at runtime. In any case, l3 vs l3s soley distinguishes itself by > 'de-confusing' netfilter through switching skb->dev to ipvlan slave > device late in NF_INET_LOCAL_IN before handing the skb to L4. > > Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which, > if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook > without any additional l3mdev semantics on top. This should also have > minimal impact since dev->priv_flags is already hot in cache. With > this set, l3s mode is working fine and I also get things like > masquerading pod traffic on the ipvlan master properly working. > > [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf > > Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant") > Fixes: 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback") > Fixes: 4fbae7d83c98 ("ipvlan: Introduce l3s mode") > Signed-off-by: Daniel Borkmann Applied and queued up for -stable, thanks.