From mboxrd@z Thu Jan 1 00:00:00 1970 From: "Michael S. Tsirkin" Subject: Re: [RFC net-next PATCH] macvtap: Add packet capture support Date: Wed, 20 Nov 2013 23:12:28 +0200 Message-ID: <20131120211228.GB11452@redhat.com> References: <1384970649-29839-1-git-send-email-vyasevic@redhat.com> <20131120181949.GB25569@redhat.com> <528D0F48.60304@redhat.com> <20131120200658.GA11070@redhat.com> <528D1935.9020403@redhat.com> <20131120203027.GA11386@redhat.com> <528D1D0A.3010505@redhat.com> Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Cc: netdev@vger.kernel.org To: Vlad Yasevich Return-path: Received: from mx1.redhat.com ([209.132.183.28]:54963 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755275Ab3KTVJP (ORCPT ); Wed, 20 Nov 2013 16:09:15 -0500 Received: from int-mx11.intmail.prod.int.phx2.redhat.com (int-mx11.intmail.prod.int.phx2.redhat.com [10.5.11.24]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id rAKL9EWR027544 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Wed, 20 Nov 2013 16:09:15 -0500 Content-Disposition: inline In-Reply-To: <528D1D0A.3010505@redhat.com> Sender: netdev-owner@vger.kernel.org List-ID: On Wed, Nov 20, 2013 at 03:35:22PM -0500, Vlad Yasevich wrote: > On 11/20/2013 03:30 PM, Michael S. Tsirkin wrote: > > On Wed, Nov 20, 2013 at 03:19:01PM -0500, Vlad Yasevich wrote: > >> On 11/20/2013 03:06 PM, Michael S. Tsirkin wrote: > >>> On Wed, Nov 20, 2013 at 02:36:40PM -0500, Vlad Yasevich wrote: > >>>> On 11/20/2013 01:19 PM, Michael S. Tsirkin wrote: > >>>>> On Wed, Nov 20, 2013 at 01:04:09PM -0500, Vlad Yasevich wrote: > >>>>>> Currently it is impossible to capture traffic when using a macvtap > >>>>>> device. The reason is that all capture handling is done either in > >>>>>> dev_hard_start_xmit() or in __netif_receive_skb_core(). Macvtap > >>>>>> currenlty doesn't use dev_hard_start_xmit(), and at the time the > >>>>>> packet traverses __netif_receive_skb_core, the skb->dev is set to > >>>>>> the lower-level device that doesn't end up matching macvtap. > >>>>>> > >>>>>> To solve the issue, use dev_hard_start_xmit() on the output path. > >>>>>> On the input path, it is toughter to solve since macvtap ends up > >>>>>> consuming the skb so there is nothing more left for > >>>>>> __netif_receive_skb_core() to do. > >>> > >>> Actually I thought I understand what you are saying here, but now I > >>> don't. bridge installs rx handler exactly in the same way. > >>> packet handlers seem to be called before the rx handlers so > >>> everything should just work. > >>> > >>> Is this about the bridge mode again? > >> > >> No. It has to do with bridge submitting the frame back to the > >> network stack to forward it to the ports, but macvtap ends up > >> stealing it. > > > > Confused. > > > > rx_handler = rcu_dereference(skb->dev->rx_handler); > > if (rx_handler) { > > if (pt_prev) { > > ret = deliver_skb(skb, pt_prev, orig_dev); > > pt_prev = NULL; > > } > > switch (rx_handler(&skb)) { > > > > .... > > > > so packet handlers (including packet socket) > > seem to be invoked before rx handlers (including bridge and macvtap). > > > > What's the issue then? > > > > I guess I'm missing something obvious. > > In case of macvtap0@eth0, before rx_handler is invoked skb->dev == eth0. > During the macvlan rx_handler, skb->dev = macvtap0, and the packet is > stolen and delivered to the socket, thus no capture. Aha. I get what the issue is now. So tcpdump -i eth0 works, you want to make tcpdump -i macvtap0@eth0 work too. > In the case of the bridge (eth0, eth1), during the rx_handler > skb->dev = bridge. Bridge calls netif_receive_skb() again, when in > promisc mode, and now packet is captured. > > -vlad Well but not in non promisc mode correct? I guess a closer model is tun, you can always do tcpdump on it and see whatever application sees. I guess one other option is a new RX_HANDLER_DEV_CHANGED - same as consumed but invokes taps after we changed the skb dev case RX_HANDLER_DEV_CHANGED: if (!list_empty(&ptype_all)) goto another_round; ret = NET_RX_SUCCESS; kfree_skb(skb); goto unlock; > > > > > > > >> Also, bridge, if running in promisc mode will hand the frame up > >> (as if it received it). See the IFF_PROMISC code in > >> br_handle_frame_finish(). > >> So, if someone is capturing on the bridge itself, this lets it > >> see the packets even though they are not destined for the bridge > >> device. > > > > Yes, that's quite expected: same thing will happen with > > real hardware: set it to promisc, see lots of packets > > destined at others. > > > >>> > >>>> A simple solution is to > >>>>>> pull the code that delivers things to the taps/captures into > >>>>>> its own function and allow macvtap to call it from its receive > >>>>>> routine. > >>>>>> > >>>>>> Signed-off-by: Vlad Yasevich > >>>>>> --- > >>>>>> This is only an RFC. I'd like to solicit comments on this simple > >>>>>> approach. > >>>>> > >>>>> I'm kind of worried about this. What worries me is that normally > >>>>> if you have a packet socket bound to all interfaces, what it shows is > >>>>> traffic to/from the box. > >>>>> > >>>>> This might be a bug for someone, but I suspect at this point this > >>>>> is part of the ABI. > >>>>> > >>>>> But macvtap bypasses most of the host networking stack, > >>>>> So I worry that suddenly showing these packets would be confusing. > >>>> > >>>> Is it really different from using bridge and tap? If someone > >>>> does 'tcpdump -i any', they will see packets sent by the guest with > >>>> bridge+tap. > >>>> It makes sense to do that same thing for macvtap, no? > >>> > >>> I was going by your comments not the code. > >>> Assuming we never showed macvtap traffic this might > >>> be part of ABI. > >> > >> BTW, if we end up doing it with a new rx_handler, it will end up > >> showing exactly the same traffic is this patch does because the > >> common path in __netif_receive_skb_core() will run and deliver > >> to all registered eligible entries in ptype_all. > >> > >> -vlad > > > > Dave here thinks it's not a problem, I trust him ... > > > >>> > >>>>> > >>>>> Assuming we want to show this traffic, I think it's preferable to > >>>>> - if (!ptype->dev || ptype->dev == skb->dev) { > >>>>> + if (ptype->dev == skb->dev) { > >>>>> so you have to bind to macvtap explicitly to see the traffic. > >>>>> > >>>>> > >>>>> Of course when you start binding to specific macvtaps > >>>>> it doesn't scale well because we'll have a long list > >>>>> on ptype_all suddenly. > >>>> > >>>> How likely is that really? ptype_all doesn't scale as it. > >>>> Any time you don't set the protocol field, the entry goes > >>>> into ptype_all. Macvtap doesn't change that. You can > >>>> create a long list in ptype all if you have lots of guests and > >>>> you want to snoop on each guest separately thus binding to tap > >>>> device. > >>>> > >>>> -vlad > >>>> > >>>>> This might mean we need to keep this list per-device ... > >>>>> > >>>>>> A more complicated solution would have been to give > >>>>>> macvtap its own rx_handler and return RX_HANDLER_ANOTHER during > >>>>>> receive operation to make the packet go through another round of > >>>>>> capturing and hit the macvtap rx_handler. I thought this would > >>>>>> hurt performance too much for no real gain. > >>>>>> > >>>>>> Thanks > >>>>>> -vlad > >>>>>> > >>>>>> drivers/net/macvtap.c | 4 +++- > >>>>>> include/linux/netdevice.h | 2 ++ > >>>>>> net/core/dev.c | 33 ++++++++++++++++++++++++++------- > >>>>>> 3 files changed, 31 insertions(+), 8 deletions(-) > >>>>>> > >>>>>> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c > >>>>>> index dc76670..0ed8fae 100644 > >>>>>> --- a/drivers/net/macvtap.c > >>>>>> +++ b/drivers/net/macvtap.c > >>>>>> @@ -334,6 +334,7 @@ drop: > >>>>>> */ > >>>>>> static int macvtap_receive(struct sk_buff *skb) > >>>>>> { > >>>>>> + netif_receive_skb_taps(skb, true); > >>>>>> skb_push(skb, ETH_HLEN); > >>>>>> return macvtap_forward(skb->dev, skb); > >>>>>> } > >>>>>> @@ -727,8 +728,9 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m, > >>>>>> skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG; > >>>>>> } > >>>>>> if (vlan) { > >>>>>> + skb->dev = vlan->dev; > >>>>>> local_bh_disable(); > >>>>>> - macvlan_start_xmit(skb, vlan->dev); > >>>>>> + dev_hard_start_xmit(skb, vlan->dev, NULL, NULL); > >>>>>> local_bh_enable(); > >>>>>> } else { > >>>>>> kfree_skb(skb); > >>>>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > >>>>>> index 8b3de7c..84880eb 100644 > >>>>>> --- a/include/linux/netdevice.h > >>>>>> +++ b/include/linux/netdevice.h > >>>>>> @@ -2383,6 +2383,8 @@ void dev_kfree_skb_any(struct sk_buff *skb); > >>>>>> int netif_rx(struct sk_buff *skb); > >>>>>> int netif_rx_ni(struct sk_buff *skb); > >>>>>> int netif_receive_skb(struct sk_buff *skb); > >>>>>> +struct packet_type *netif_receive_skb_taps(struct sk_buff *skb, > >>>>>> + bool last_deliver); > >>>>>> gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb); > >>>>>> void napi_gro_flush(struct napi_struct *napi, bool flush_old); > >>>>>> struct sk_buff *napi_get_frags(struct napi_struct *napi); > >>>>>> diff --git a/net/core/dev.c b/net/core/dev.c > >>>>>> index da9c5e1..50f0ac4 100644 > >>>>>> --- a/net/core/dev.c > >>>>>> +++ b/net/core/dev.c > >>>>>> @@ -3484,6 +3484,31 @@ static bool skb_pfmemalloc_protocol(struct sk_buff *skb) > >>>>>> } > >>>>>> } > >>>>>> > >>>>>> +struct packet_type *netif_receive_skb_taps(struct sk_buff *skb, > >>>>>> + bool last_deliver) > >>>>>> +{ > >>>>>> + struct packet_type *ptype, *pt_prev = NULL; > >>>>>> + struct net_device *orig_dev; > >>>>>> + > >>>>>> + if (list_empty(&ptype_all)) > >>>>>> + return NULL; > >>>>>> + > >>>>>> + orig_dev = skb->dev; > >>>>>> + list_for_each_entry_rcu(ptype, &ptype_all, list) { > >>>>>> + if (!ptype->dev || ptype->dev == skb->dev) { > >>>>>> + if (pt_prev) > >>>>>> + deliver_skb(skb, pt_prev, orig_dev); > >>>>>> + pt_prev = ptype; > >>>>>> + } > >>>>>> + } > >>>>>> + > >>>>>> + if (last_deliver && pt_prev) > >>>>>> + deliver_skb(skb, pt_prev, orig_dev); > >>>>>> + > >>>>>> + return pt_prev; > >>>>>> +} > >>>>>> +EXPORT_SYMBOL_GPL(netif_receive_skb_taps); > >>>>>> + > >>>>>> static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc) > >>>>>> { > >>>>>> struct packet_type *ptype, *pt_prev; > >>>>>> @@ -3535,13 +3560,7 @@ another_round: > >>>>>> if (pfmemalloc) > >>>>>> goto skip_taps; > >>>>>> > >>>>>> - list_for_each_entry_rcu(ptype, &ptype_all, list) { > >>>>>> - if (!ptype->dev || ptype->dev == skb->dev) { > >>>>>> - if (pt_prev) > >>>>>> - ret = deliver_skb(skb, pt_prev, orig_dev); > >>>>>> - pt_prev = ptype; > >>>>>> - } > >>>>>> - } > >>>>>> + pt_prev = netif_receive_skb_taps(skb, false); > >>>>>> > >>>>>> skip_taps: > >>>>>> #ifdef CONFIG_NET_CLS_ACT > >>>>>> -- > >>>>>> 1.8.4.2