From mboxrd@z Thu Jan  1 00:00:00 1970
From: "Michael S. Tsirkin" <mst@redhat.com>
Subject: Re: [RFC net-next PATCH] macvtap: Add packet capture support
Date: Wed, 20 Nov 2013 22:30:27 +0200
Message-ID: <20131120203027.GA11386@redhat.com>
References: <1384970649-29839-1-git-send-email-vyasevic@redhat.com>
 <20131120181949.GB25569@redhat.com>
 <528D0F48.60304@redhat.com>
 <20131120200658.GA11070@redhat.com>
 <528D1935.9020403@redhat.com>
Mime-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Cc: netdev@vger.kernel.org
To: Vlad Yasevich <vyasevic@redhat.com>
Return-path: <netdev-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:65042 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1754809Ab3KTU1N (ORCPT <rfc822;netdev@vger.kernel.org>);
	Wed, 20 Nov 2013 15:27:13 -0500
Received: from int-mx09.intmail.prod.int.phx2.redhat.com (int-mx09.intmail.prod.int.phx2.redhat.com [10.5.11.22])
	by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id rAKKRD3N031711
	(version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK)
	for <netdev@vger.kernel.org>; Wed, 20 Nov 2013 15:27:13 -0500
Content-Disposition: inline
In-Reply-To: <528D1935.9020403@redhat.com>
Sender: netdev-owner@vger.kernel.org
List-ID: <netdev.vger.kernel.org>

On Wed, Nov 20, 2013 at 03:19:01PM -0500, Vlad Yasevich wrote:
> On 11/20/2013 03:06 PM, Michael S. Tsirkin wrote:
> > On Wed, Nov 20, 2013 at 02:36:40PM -0500, Vlad Yasevich wrote:
> >> On 11/20/2013 01:19 PM, Michael S. Tsirkin wrote:
> >>> On Wed, Nov 20, 2013 at 01:04:09PM -0500, Vlad Yasevich wrote:
> >>>> Currently it is impossible to capture traffic when using a macvtap
> >>>> device.  The reason is that all capture handling is done either in
> >>>> dev_hard_start_xmit() or in __netif_receive_skb_core().  Macvtap
> >>>> currenlty doesn't use dev_hard_start_xmit(), and at the time the
> >>>> packet traverses __netif_receive_skb_core, the skb->dev is set to
> >>>> the lower-level device that doesn't end up matching macvtap.
> >>>>
> >>>> To solve the issue, use dev_hard_start_xmit() on the output path.
> >>>> On the input path, it is toughter to solve since macvtap ends up
> >>>> consuming the skb so there is nothing more left for
> >>>> __netif_receive_skb_core() to do.
> > 
> > Actually I thought I understand what you are saying here, but now I
> > don't. bridge installs rx handler exactly in the same way.
> > packet handlers seem to be called before the rx handlers so
> > everything should just work.
> > 
> > Is this about the bridge mode again?
> 
> No.  It has to do with bridge submitting the frame back to the
> network stack to forward it to the ports, but macvtap ends up
> stealing it.

Confused.

        rx_handler = rcu_dereference(skb->dev->rx_handler);
        if (rx_handler) {
                if (pt_prev) {
                        ret = deliver_skb(skb, pt_prev, orig_dev);
                        pt_prev = NULL;
                }
                switch (rx_handler(&skb)) {

		....

so packet handlers (including packet socket)
seem to be invoked before rx handlers (including bridge and macvtap).

What's the issue then?

I guess I'm missing something obvious.


> Also, bridge, if running in promisc mode will hand the frame up
> (as if it received it).  See the IFF_PROMISC code in
> br_handle_frame_finish().
> So, if someone is capturing on the bridge itself, this lets it
> see the packets even though they are not destined for the bridge
> device.

Yes, that's quite expected: same thing will happen with
real hardware: set it to promisc, see lots of packets
destined at others.

> > 
> >>  A simple solution is to
> >>>> pull the code that delivers things to the taps/captures into
> >>>> its own function and allow macvtap to call it from its receive
> >>>> routine.
> >>>>
> >>>> Signed-off-by: Vlad Yasevich <vyasevic@redhat.com>
> >>>> ---
> >>>> This is only an RFC.  I'd like to solicit comments on this simple
> >>>> approach.
> >>>
> >>> I'm kind of worried about this. What worries me is that normally
> >>> if you have a packet socket bound to all interfaces, what it shows is
> >>> traffic to/from the box.
> >>>
> >>> This might be a bug for someone, but I suspect at this point this
> >>> is part of the ABI.
> >>>
> >>> But macvtap bypasses most of the host networking stack,
> >>> So I worry that suddenly showing these packets would be confusing.
> >>
> >> Is it really different from using bridge and tap?  If someone
> >> does 'tcpdump -i any', they will see packets sent by the guest with
> >> bridge+tap.
> >>  It makes sense to do that same thing for macvtap, no?
> > 
> > I was going by your comments not the code.
> > Assuming we never showed macvtap traffic this might
> > be part of ABI.
> 
> BTW, if we end up doing it with a new rx_handler, it will end up
> showing exactly the same traffic is this patch does because the
> common path in __netif_receive_skb_core() will run and deliver
> to all registered eligible entries in ptype_all.
> 
> -vlad

Dave here thinks it's not a problem, I trust him ...

> > 
> >>>
> >>> Assuming we want to show this traffic, I think it's preferable to
> >>> -		if (!ptype->dev || ptype->dev == skb->dev) {
> >>> +		if (ptype->dev == skb->dev) {
> >>> so you have to bind to macvtap explicitly to see the traffic.
> >>>
> >>>
> >>> Of course when you start binding to specific macvtaps
> >>> it doesn't scale well because we'll have a long list
> >>> on ptype_all suddenly.
> >>
> >> How likely is that really?  ptype_all doesn't scale as it.
> >> Any time you don't set the protocol field, the entry goes
> >> into ptype_all.  Macvtap doesn't change that.  You can
> >> create a long list in ptype all if you have lots of guests and
> >> you want to snoop on each guest separately thus binding to tap
> >> device.
> >>
> >> -vlad
> >>
> >>> This might mean we need to keep this list per-device ...
> >>>
> >>>> A more complicated solution would have been to give
> >>>> macvtap its own rx_handler and return RX_HANDLER_ANOTHER during
> >>>> receive operation to make the packet go through another round of
> >>>> capturing and hit the macvtap rx_handler.  I thought this would
> >>>> hurt performance too much for no real gain.
> >>>>
> >>>> Thanks
> >>>> -vlad
> >>>>
> >>>>  drivers/net/macvtap.c     |  4 +++-
> >>>>  include/linux/netdevice.h |  2 ++
> >>>>  net/core/dev.c            | 33 ++++++++++++++++++++++++++-------
> >>>>  3 files changed, 31 insertions(+), 8 deletions(-)
> >>>>
> >>>> diff --git a/drivers/net/macvtap.c b/drivers/net/macvtap.c
> >>>> index dc76670..0ed8fae 100644
> >>>> --- a/drivers/net/macvtap.c
> >>>> +++ b/drivers/net/macvtap.c
> >>>> @@ -334,6 +334,7 @@ drop:
> >>>>   */
> >>>>  static int macvtap_receive(struct sk_buff *skb)
> >>>>  {
> >>>> +	netif_receive_skb_taps(skb, true);
> >>>>  	skb_push(skb, ETH_HLEN);
> >>>>  	return macvtap_forward(skb->dev, skb);
> >>>>  }
> >>>> @@ -727,8 +728,9 @@ static ssize_t macvtap_get_user(struct macvtap_queue *q, struct msghdr *m,
> >>>>  		skb_shinfo(skb)->tx_flags |= SKBTX_SHARED_FRAG;
> >>>>  	}
> >>>>  	if (vlan) {
> >>>> +		skb->dev = vlan->dev;
> >>>>  		local_bh_disable();
> >>>> -		macvlan_start_xmit(skb, vlan->dev);
> >>>> +		dev_hard_start_xmit(skb, vlan->dev, NULL, NULL);
> >>>>  		local_bh_enable();
> >>>>  	} else {
> >>>>  		kfree_skb(skb);
> >>>> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> >>>> index 8b3de7c..84880eb 100644
> >>>> --- a/include/linux/netdevice.h
> >>>> +++ b/include/linux/netdevice.h
> >>>> @@ -2383,6 +2383,8 @@ void dev_kfree_skb_any(struct sk_buff *skb);
> >>>>  int netif_rx(struct sk_buff *skb);
> >>>>  int netif_rx_ni(struct sk_buff *skb);
> >>>>  int netif_receive_skb(struct sk_buff *skb);
> >>>> +struct packet_type *netif_receive_skb_taps(struct sk_buff *skb,
> >>>> +					   bool last_deliver);
> >>>>  gro_result_t napi_gro_receive(struct napi_struct *napi, struct sk_buff *skb);
> >>>>  void napi_gro_flush(struct napi_struct *napi, bool flush_old);
> >>>>  struct sk_buff *napi_get_frags(struct napi_struct *napi);
> >>>> diff --git a/net/core/dev.c b/net/core/dev.c
> >>>> index da9c5e1..50f0ac4 100644
> >>>> --- a/net/core/dev.c
> >>>> +++ b/net/core/dev.c
> >>>> @@ -3484,6 +3484,31 @@ static bool skb_pfmemalloc_protocol(struct sk_buff *skb)
> >>>>  	}
> >>>>  }
> >>>>  
> >>>> +struct packet_type *netif_receive_skb_taps(struct sk_buff *skb,
> >>>> +					   bool last_deliver)
> >>>> +{
> >>>> +	struct packet_type *ptype, *pt_prev = NULL;
> >>>> +	struct net_device *orig_dev;
> >>>> +
> >>>> +	if (list_empty(&ptype_all))
> >>>> +		return NULL;
> >>>> +
> >>>> +	orig_dev = skb->dev;
> >>>> +	list_for_each_entry_rcu(ptype, &ptype_all, list) {
> >>>> +		if (!ptype->dev || ptype->dev == skb->dev) {
> >>>> +			if (pt_prev)
> >>>> +				deliver_skb(skb, pt_prev, orig_dev);
> >>>> +			pt_prev = ptype;
> >>>> +		}
> >>>> +	}
> >>>> +
> >>>> +	if (last_deliver && pt_prev)
> >>>> +		deliver_skb(skb, pt_prev, orig_dev);
> >>>> +
> >>>> +	return pt_prev;
> >>>> +}
> >>>> +EXPORT_SYMBOL_GPL(netif_receive_skb_taps);
> >>>> +
> >>>>  static int __netif_receive_skb_core(struct sk_buff *skb, bool pfmemalloc)
> >>>>  {
> >>>>  	struct packet_type *ptype, *pt_prev;
> >>>> @@ -3535,13 +3560,7 @@ another_round:
> >>>>  	if (pfmemalloc)
> >>>>  		goto skip_taps;
> >>>>  
> >>>> -	list_for_each_entry_rcu(ptype, &ptype_all, list) {
> >>>> -		if (!ptype->dev || ptype->dev == skb->dev) {
> >>>> -			if (pt_prev)
> >>>> -				ret = deliver_skb(skb, pt_prev, orig_dev);
> >>>> -			pt_prev = ptype;
> >>>> -		}
> >>>> -	}
> >>>> +	pt_prev = netif_receive_skb_taps(skb, false);
> >>>>  
> >>>>  skip_taps:
> >>>>  #ifdef CONFIG_NET_CLS_ACT
> >>>> -- 
> >>>> 1.8.4.2