Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] nfp: convert nfp_eth_set_bit_config() into a macro
From: Joe Perches @ 2017-10-04 18:07 UTC (permalink / raw)
  To: Matthias Kaehlcke, Jakub Kicinski, David S . Miller, Simon Horman,
	Dirk van der Merwe
  Cc: oss-drivers, netdev, linux-kernel, Renato Golin, Manoj Gupta,
	Guenter Roeck, Doug Anderson
In-Reply-To: <20171003200546.165731-1-mka@chromium.org>

On Tue, 2017-10-03 at 13:05 -0700, Matthias Kaehlcke wrote:
> nfp_eth_set_bit_config() is marked as __always_inline to allow gcc to
> identify the 'mask' parameter as known to be constant at compile time,
> which is required to use the FIELD_GET() macro.
> 
> The forced inlining does the trick for gcc, but for kernel builds with
> clang it results in undefined symbols:

Can't you use local different FIELD_PREP/FIELD_GET macros
with a different name without the BUILD_BUG tests?

i.e.:

#define NFP_FIELD_PREP(_mask, _val)				\
({								\
	((typeof(_mask))(_val) << __bf_shf(_mask)) & (_mask);	\
})

#define NFP_FIELD_GET(_mask, _reg)				\
({								\
	(typeof(_mask))(((_reg) & (_mask)) >> __bf_shf(_mask));	\
})

Then the __always_inline can be removed from
nfp_eth_set_bit_config too.

^ permalink raw reply

* Re: [PATCH net-next 2/2] flow_dissector: dissect tunnel info
From: Jiri Pirko @ 2017-10-04 18:07 UTC (permalink / raw)
  To: Tom Herbert
  Cc: Simon Horman, David Miller, Jiri Pirko, Jamal Hadi Salim,
	Cong Wang, Linux Kernel Network Developers, oss-drivers
In-Reply-To: <CALx6S36c=BUXb7-53Ym_SZYuWqmPfEqdYUwwxyM78SfBnEPL4Q@mail.gmail.com>

Wed, Oct 04, 2017 at 05:52:54PM CEST, tom@herbertland.com wrote:
>On Wed, Oct 4, 2017 at 1:15 AM, Jiri Pirko <jiri@resnulli.us> wrote:
>> Wed, Oct 04, 2017 at 10:08:57AM CEST, simon.horman@netronome.com wrote:
>>>On Tue, Oct 03, 2017 at 11:17:46AM -0700, Tom Herbert wrote:
>>>> On Tue, Oct 3, 2017 at 2:40 AM, Simon Horman <simon.horman@netronome.com> wrote:
>>>> > On Mon, Oct 02, 2017 at 01:37:55PM -0700, Tom Herbert wrote:
>>>> >> On Mon, Oct 2, 2017 at 1:41 AM, Simon Horman <simon.horman@netronome.com> wrote:
>>>> >> > Move dissection of tunnel info from the flower classifier to the flow
>>>> >> > dissector where all other dissection occurs.  This should not have any
>>>> >> > behavioural affect on other users of the flow dissector.
>>>> >
>>>> > ...
>>>>
>>>> > I feel that we are circling back the perennial issue of flower using the
>>>> > flow dissector in a somewhat broader/different way than many/all other
>>>> > users of the flow dissector.
>>>> >
>>>> Simon,
>>>>
>>>> It's more like __skb_flow_dissect is already an incredibly complex
>>>> function and because of that it's difficult to maintain. We need to
>>>> measure changes against that fact. For this patch, there is precisely
>>>> one user (cls_flower.c) and it's not at all clear to me if there will
>>>> be ever any more (e.g. for hashing we don't need tunnel info). IMO, it
>>>> should be just as easy and less convolution for everyone to have
>>>> flower call __skb_flow_dissect_tunnel_info directly and not call if
>>>> from __skb_flow_dissect.
>>>
>>>Hi Tom,
>>>
>>>my original suggestion was just that, but Jiri indicated a strong preference
>>>for the approach taken by this patch. I think we need to widen the
>>>participants in this discussion.
>>
>> I like the __skb_flow_dissect to be the function to call and it will do
>> the job according to the configuration. I don't like to split in
>> multiple calls.
>
>Those are not technical arguments. As I already mentioned, I don't
>like it when we add stuff for the benefit of a 1% use case that
>negatively impacts the rest of the 99% cases which is what I believe
>is happening here.

Yeah. I just wanted the flow dissector to stay compact. But if needed,
could be split. I just fear that it will become a mess that's all.


>
>> Does not make sense in the most of the cases as the
>> dissection state would have to be carried in between calls.
>
>Please elaborate. This code is being moved into __skb_flow_dissect, so
>the functionality was already there. I don't see any description in
>this discussion that things were broken and that this patch is a
>necessary fix.

Yeah, you are right.


>
>Thanks,
>Tom

^ permalink raw reply

* Re: [PATCH net-next 5/7] net: bonding: Add extack messages for some enslave failures
From: David Ahern @ 2017-10-04 18:06 UTC (permalink / raw)
  To: Jiri Pirko; +Cc: netdev, j.vosburgh, vfalico, andy, jiri, idosch, davem, bridge
In-Reply-To: <20171004180428.GG1895@nanopsycho>

On 10/4/17 11:04 AM, Jiri Pirko wrote:
> Wed, Oct 04, 2017 at 05:35:46PM CEST, dsahern@gmail.com wrote:
>> On 10/3/17 11:38 PM, Jiri Pirko wrote:
>>> Wed, Oct 04, 2017 at 06:58:52AM CEST, dsahern@gmail.com wrote:
>>>> A number of bond_enslave errors are logged using the netdev_err API.
>>>> Return those messages to userspace via the extack facility.
>>>>
>>>> Signed-off-by: David Ahern <dsahern@gmail.com>
>>>> ---
>>>> drivers/net/bonding/bond_main.c | 12 ++++++++++++
>>>> 1 file changed, 12 insertions(+)
>>>>
>>>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>>> index bc92307c2082..6688dc9154e0 100644
>>>> --- a/drivers/net/bonding/bond_main.c
>>>> +++ b/drivers/net/bonding/bond_main.c
>>>> @@ -1348,12 +1348,15 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
>>>>
>>>> 	/* already in-use? */
>>>> 	if (netdev_is_rx_handler_busy(slave_dev)) {
>>>> +		NL_SET_ERR_MSG(extack,
>>>> +			       "Device is in use and cannot be enslaved");
>>>
>>> Please don't do this kind of wrapping. Just let the string be on the
>>> same line.
>>>
>>
>> Ok, I will do that for bonding only since it is the existing style.
> 
> I don't believe you need to do this wrap for any code. Just don't wrap.
> General code stype says no wrap for strings :)
> 

I do not break / wrap strings; they need to be searchable. I assumed you
meant this is preferred for bonding:

NL_SET_ERR_MSG(extack, "Device is in use and cannot be enslaved");


over what I have done:

NL_SET_ERR_MSG(extack,
	       "Device is in use and cannot be enslaved");

^ permalink raw reply

* Re: [PATCH net-next 5/7] net: bonding: Add extack messages for some enslave failures
From: Jiri Pirko @ 2017-10-04 18:04 UTC (permalink / raw)
  To: David Ahern
  Cc: netdev, j.vosburgh, vfalico, andy, jiri, idosch, davem, bridge
In-Reply-To: <354fae78-3b04-3807-7392-87c6a3f1b3db@gmail.com>

Wed, Oct 04, 2017 at 05:35:46PM CEST, dsahern@gmail.com wrote:
>On 10/3/17 11:38 PM, Jiri Pirko wrote:
>> Wed, Oct 04, 2017 at 06:58:52AM CEST, dsahern@gmail.com wrote:
>>> A number of bond_enslave errors are logged using the netdev_err API.
>>> Return those messages to userspace via the extack facility.
>>>
>>> Signed-off-by: David Ahern <dsahern@gmail.com>
>>> ---
>>> drivers/net/bonding/bond_main.c | 12 ++++++++++++
>>> 1 file changed, 12 insertions(+)
>>>
>>> diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
>>> index bc92307c2082..6688dc9154e0 100644
>>> --- a/drivers/net/bonding/bond_main.c
>>> +++ b/drivers/net/bonding/bond_main.c
>>> @@ -1348,12 +1348,15 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev,
>>>
>>> 	/* already in-use? */
>>> 	if (netdev_is_rx_handler_busy(slave_dev)) {
>>> +		NL_SET_ERR_MSG(extack,
>>> +			       "Device is in use and cannot be enslaved");
>> 
>> Please don't do this kind of wrapping. Just let the string be on the
>> same line.
>> 
>
>Ok, I will do that for bonding only since it is the existing style.

I don't believe you need to do this wrap for any code. Just don't wrap.
General code stype says no wrap for strings :)

^ permalink raw reply

* RE: [PATCH] netfilter: fix stringop-overflow warning with UBSAN
From: Jozsef Kadlecsik @ 2017-10-04 18:01 UTC (permalink / raw)
  To: David Laight
  Cc: 'Arnd Bergmann', Pablo Neira Ayuso, Florian Westphal,
	David S. Miller, Johannes Berg, Alexey Dobriyan, Aaron Conole,
	netfilter-devel@vger.kernel.org, coreteam@netfilter.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DD0047F28@AcuExch.aculab.com>

Hi,

[Sorry, at holiday I just cursory watched the mailing lists.]

On Tue, 1 Aug 2017, David Laight wrote:

> From: Arnd Bergmann
> > Sent: 31 July 2017 11:09
> > Using gcc-7 with UBSAN enabled, we get this false-positive warning:
> > 
> > net/netfilter/ipset/ip_set_core.c: In function 'ip_set_sockfn_get':
> > net/netfilter/ipset/ip_set_core.c:1998:3: error: 'strncpy' writing 32 bytes into a region of size 2
> > overflows the destination [-Werror=stringop-overflow=]
> >    strncpy(req_get->set.name, set ? set->name : "",
> >    ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >     sizeof(req_get->set.name));
> >     ~~~~~~~~~~~~~~~~~~~~~~~~~~
> > 
> > This seems completely bogus, and I could not find a nice workaround.
> > To work around it in a less elegant way, I change the ?: operator
> > into an if()/else() construct.
> > 
> > Signed-off-by: Arnd Bergmann <arnd@arndb.de>
> > ---
> >  net/netfilter/ipset/ip_set_core.c | 8 ++++++--
> >  1 file changed, 6 insertions(+), 2 deletions(-)
> > 
> > diff --git a/net/netfilter/ipset/ip_set_core.c b/net/netfilter/ipset/ip_set_core.c
> > index e495b5e484b1..d7ebb021003b 100644
> > --- a/net/netfilter/ipset/ip_set_core.c
> > +++ b/net/netfilter/ipset/ip_set_core.c
> > @@ -1995,8 +1995,12 @@ ip_set_sockfn_get(struct sock *sk, int optval, void __user *user, int *len)
> >  		}
> >  		nfnl_lock(NFNL_SUBSYS_IPSET);
> >  		set = ip_set(inst, req_get->set.index);
> > -		strncpy(req_get->set.name, ,
> > -			IPSET_MAXNAMELEN);
> > +		if (set)
> > +			strncpy(req_get->set.name, set->name,
> > +				sizeof(req_get->set.name));
> > +		else
> > +			memset(req_get->set.name, '\0',
> > +			       sizeof(req_get->set.name));
> 
> If you use strncpy() here, the compiler might optimise the code
> back to 'how it was before'.
> 
> Or, maybe an explicit temporary: 'const char *name = set ? set->name : "";

I think the best to go with the explicit temporary variable. The if-else 
construct is too much for such a case.

Best regards,
Jozsef
-
E-mail  : kadlec@blackhole.kfki.hu, kadlecsik.jozsef@wigner.mta.hu
PGP key : http://www.kfki.hu/~kadlec/pgp_public_key.txt
Address : Wigner Research Centre for Physics, Hungarian Academy of Sciences
          H-1525 Budapest 114, POB. 49, Hungary

^ permalink raw reply

* Re: [PATCH 1/3] wireless: iwlwifi: use bool instead of int
From: Joe Perches @ 2017-10-04 17:55 UTC (permalink / raw)
  To: Luciano Coelho, Christoph Böhmwalder, johannes.berg,
	emmanuel.grumbach, kvalo
  Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <1507135159.908.96.camel@intel.com>

On Wed, 2017-10-04 at 19:39 +0300, Luciano Coelho wrote:
> On Wed, 2017-10-04 at 09:26 -0700, Joe Perches wrote:
[]
> > This might be more intelligble as separate tests
> > 
> > static bool is_valid_channel(u16 ch_id)
> > {
> > 	if (ch_id <= 14)
> > 		return true;
> > 
> > 	if ((ch_id % 4 == 0) &&
> > 	    ((ch_id >= 36 && ch_id <= 64) ||
> > 	     (ch_id >= 100 && ch_id <= 140)))
> > 		return true;
> > 
> > 	if ((ch_id % 4 == 1) &&
> > 	    (chid >= 145 && ch_id <= 165))
> > 		return true;
> > 
> > 	return false;
> > }
> > 
> > The compiler should produce the same object code.
> 
> Yeah, it may be a bit easier to read, but I don't want to start getting
> "fixes" to working and reasonable code.  There's nothing wrong with the
> existing function (except maybe for the int vs. boolean) so let's not
> change it.
> 
> A good time to change this would be the next time someone adds yet
> another range of valid channels here. ;)

<shrug>  Your choice.

I like code I can read and understand at a glance.

At case somebody needs to add channels, likely nobody
would do the change suggested but would just add
another test to the already odd looking block.

And constants should be on the right side of the tests.

^ permalink raw reply

* Re: [next-queue PATCH v3 2/4] net/sched: Fix accessing invalid dev_queue
From: Jesus Sanchez-Palencia @ 2017-10-04 17:42 UTC (permalink / raw)
  To: Eric Dumazet, Vinicius Costa Gomes
  Cc: netdev, intel-wired-lan, jhs, xiyou.wangcong, jiri, andre.guedes,
	ivan.briano, boon.leong.ong, richardcochran, henrik, levipearson,
	rodney.cummings
In-Reply-To: <1507088831.8061.41.camel@edumazet-glaptop3.roam.corp.google.com>

Hi,


On 10/03/2017 08:47 PM, Eric Dumazet wrote:
> On Tue, 2017-10-03 at 16:44 -0700, Vinicius Costa Gomes wrote:
>> From: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
>>
>> In qdisc_alloc() the dev_queue pointer was used without any checks being
>> performed. If qdisc_create() gets a null dev_queue pointer, it just
>> passes it along to qdisc_alloc(), leading to a crash. That happens if a
>> root qdisc implements select_queue() and returns a null dev_queue
>> pointer for an "invalid handle", for example.
>>
>> One way to reproduce that is:
>>
>> 1) Setup mqprio
>> $ tc qdisc replace dev enp3s0 parent root mqprio num_tc 3 \
>>      	   map 2 2 1 0 2 2 2 2 2 2 2 2 2 2 2 2 queues 1@0 1@1 2@2 hw 0
>>
>> 2) Replace the first inner qdisc
>> $ tc qdisc replace dev enp3s0 parent 8001:1 pfifo_fast
>>
>> This will lead to the following crash:
> 
> When was this bug added ?
> 
> If this is a consequence of your prior patch (1/4), then this must come
> before it.
> 
> No need to add a stack trace for a not existing bug.
> Instead, explain in the changelog that it is a prep work.
> 
> We try to not break the tree on purpose, so that future bisection will
> not hit a point where the kernel crashes.

Sure, that makes absolute sense. It'll be fixed in our v5 as you've suggested.


Thanks,
Jesus

^ permalink raw reply

* [PATCH net-next] net: cache skb_shinfo() in skb_try_coalesce()
From: Eric Dumazet @ 2017-10-04 17:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

From: Eric Dumazet <edumazet@google.com>

Compiler does not really know that skb_shinfo(to|from) are constants
in skb_try_coalesce(), lets cache their values to shrink code.

We might even take care of skb_zcopy() calls later.

$ size net/core/skbuff.o.before net/core/skbuff.o
   text	   data	    bss	    dec	    hex	filename
  40727	   1298	      0	  42025	   a429	net/core/skbuff.o.before
  40631	   1298	      0	  41929	   a3c9	net/core/skbuff.o


Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/skbuff.c |   29 ++++++++++++++++-------------
 1 file changed, 16 insertions(+), 13 deletions(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d98c2e3ce2bfd9549647d6914b69cecd840de480..822a90e56aea2078a11d5361a2b2595812291274 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -4767,6 +4767,7 @@ EXPORT_SYMBOL(kfree_skb_partial);
 bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 		      bool *fragstolen, int *delta_truesize)
 {
+	struct skb_shared_info *to_shinfo, *from_shinfo;
 	int i, delta, len = from->len;
 
 	*fragstolen = false;
@@ -4781,7 +4782,9 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 		return true;
 	}
 
-	if (skb_has_frag_list(to) || skb_has_frag_list(from))
+	to_shinfo = skb_shinfo(to);
+	from_shinfo = skb_shinfo(from);
+	if (to_shinfo->frag_list || from_shinfo->frag_list)
 		return false;
 	if (skb_zcopy(to) || skb_zcopy(from))
 		return false;
@@ -4790,8 +4793,8 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 		struct page *page;
 		unsigned int offset;
 
-		if (skb_shinfo(to)->nr_frags +
-		    skb_shinfo(from)->nr_frags >= MAX_SKB_FRAGS)
+		if (to_shinfo->nr_frags +
+		    from_shinfo->nr_frags >= MAX_SKB_FRAGS)
 			return false;
 
 		if (skb_head_is_locked(from))
@@ -4802,12 +4805,12 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 		page = virt_to_head_page(from->head);
 		offset = from->data - (unsigned char *)page_address(page);
 
-		skb_fill_page_desc(to, skb_shinfo(to)->nr_frags,
+		skb_fill_page_desc(to, to_shinfo->nr_frags,
 				   page, offset, skb_headlen(from));
 		*fragstolen = true;
 	} else {
-		if (skb_shinfo(to)->nr_frags +
-		    skb_shinfo(from)->nr_frags > MAX_SKB_FRAGS)
+		if (to_shinfo->nr_frags +
+		    from_shinfo->nr_frags > MAX_SKB_FRAGS)
 			return false;
 
 		delta = from->truesize - SKB_TRUESIZE(skb_end_offset(from));
@@ -4815,19 +4818,19 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
 
 	WARN_ON_ONCE(delta < len);
 
-	memcpy(skb_shinfo(to)->frags + skb_shinfo(to)->nr_frags,
-	       skb_shinfo(from)->frags,
-	       skb_shinfo(from)->nr_frags * sizeof(skb_frag_t));
-	skb_shinfo(to)->nr_frags += skb_shinfo(from)->nr_frags;
+	memcpy(to_shinfo->frags + to_shinfo->nr_frags,
+	       from_shinfo->frags,
+	       from_shinfo->nr_frags * sizeof(skb_frag_t));
+	to_shinfo->nr_frags += from_shinfo->nr_frags;
 
 	if (!skb_cloned(from))
-		skb_shinfo(from)->nr_frags = 0;
+		from_shinfo->nr_frags = 0;
 
 	/* if the skb is not cloned this does nothing
 	 * since we set nr_frags to 0.
 	 */
-	for (i = 0; i < skb_shinfo(from)->nr_frags; i++)
-		skb_frag_ref(from, i);
+	for (i = 0; i < from_shinfo->nr_frags; i++)
+		__skb_frag_ref(&from_shinfo->frags[i]);
 
 	to->truesize += delta;
 	to->len += len;

^ permalink raw reply related

* Re: [PATCH] nfp: convert nfp_eth_set_bit_config() into a macro
From: David Miller @ 2017-10-04 17:44 UTC (permalink / raw)
  To: mka
  Cc: jakub.kicinski, simon.horman, dirk.vandermerwe, oss-drivers,
	netdev, linux-kernel, renato.golin, manojgupta, groeck, dianders
In-Reply-To: <20171004174215.GM173745@google.com>

From: Matthias Kaehlcke <mka@chromium.org>
Date: Wed, 4 Oct 2017 10:42:15 -0700

> Given that this doesn't seem to be a widespread issue in the kernel
> personally I would consider the conversion to a macro in this case an
> acceptable solution, though it is definitely ugly. However I'm not the
> owner of the driver or the subsystem, so my opinion doesn't really
> carry much weight here ;-)

Losing type checking is a serious regression as far as I am concerned.

^ permalink raw reply

* Re: [PATCH] nfp: convert nfp_eth_set_bit_config() into a macro
From: Matthias Kaehlcke @ 2017-10-04 17:42 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: David S . Miller, Simon Horman, Dirk van der Merwe, oss-drivers,
	netdev, linux-kernel, Renato Golin, Manoj Gupta, Guenter Roeck,
	Doug Anderson
In-Reply-To: <20171003145000.53683e21@cakuba>

El Tue, Oct 03, 2017 at 02:50:00PM -0700 Jakub Kicinski ha dit:

> On Tue,  3 Oct 2017 13:05:46 -0700, Matthias Kaehlcke wrote:
> > nfp_eth_set_bit_config() is marked as __always_inline to allow gcc to
> > identify the 'mask' parameter as known to be constant at compile time,
> > which is required to use the FIELD_GET() macro.
> > 
> > The forced inlining does the trick for gcc, but for kernel builds with
> > clang it results in undefined symbols:
> > 
> > drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp_eth.o: In function
> >   `__nfp_eth_set_aneg':
> > drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp_eth.c:(.text+0x787):
> >   undefined reference to `__compiletime_assert_492'
> > drivers/net/ethernet/netronome/nfp/nfpcore/nfp_nsp_eth.c:(.text+0x7b1):
> >   undefined reference to `__compiletime_assert_496'
> > 
> > These __compiletime_assert_xyx() calls would have been optimized away if
> > the compiler had seen 'mask' as a constant.
> > 
> > Convert nfp_eth_set_bit_config() into a macro, which allows both gcc and
> > clang to identify 'mask' as a compile time constant.
> > 
> > Signed-off-by: Matthias Kaehlcke <mka@chromium.org>
> 
> :(
> 
> Is there no chance of fixing the constant propagation in the compiler?

LLVM developers are reluctant and would like us kernel folks to
evaluate possible alternatives for the affected code first:
https://bugs.llvm.org/show_bug.cgi?id=4898

Given that this doesn't seem to be a widespread issue in the kernel
personally I would consider the conversion to a macro in this case an
acceptable solution, though it is definitely ugly. However I'm not the
owner of the driver or the subsystem, so my opinion doesn't really
carry much weight here ;-)

Any ideas about other, less ugly alternatives?

Matthias

^ permalink raw reply

* [PATCH v6 1/1] ip_tunnel: add mpls over gre support
From: Amine Kherbouche @ 2017-10-04 17:35 UTC (permalink / raw)
  To: tom, roopa; +Cc: netdev, amine.kherbouche, equinox
In-Reply-To: <cover.1507129836.git.amine.kherbouche@6wind.com>

This commit introduces the MPLSoGRE support (RFC 4023), using ip tunnel
API by simply adding ipgre_tunnel_encap_(add|del)_mpls_ops() and the new
tunnel type TUNNEL_ENCAP_MPLS.

Signed-off-by: Amine Kherbouche <amine.kherbouche@6wind.com>
---
 include/uapi/linux/if_tunnel.h |  1 +
 net/mpls/af_mpls.c             | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

diff --git a/include/uapi/linux/if_tunnel.h b/include/uapi/linux/if_tunnel.h
index 2e52088..a2f48c0 100644
--- a/include/uapi/linux/if_tunnel.h
+++ b/include/uapi/linux/if_tunnel.h
@@ -84,6 +84,7 @@ enum tunnel_encap_types {
 	TUNNEL_ENCAP_NONE,
 	TUNNEL_ENCAP_FOU,
 	TUNNEL_ENCAP_GUE,
+	TUNNEL_ENCAP_MPLS,
 };
 
 #define TUNNEL_ENCAP_FLAG_CSUM		(1<<0)
diff --git a/net/mpls/af_mpls.c b/net/mpls/af_mpls.c
index c5b9ce4..9745e8f 100644
--- a/net/mpls/af_mpls.c
+++ b/net/mpls/af_mpls.c
@@ -16,6 +16,7 @@
 #include <net/arp.h>
 #include <net/ip_fib.h>
 #include <net/netevent.h>
+#include <net/ip_tunnels.h>
 #include <net/netns/generic.h>
 #if IS_ENABLED(CONFIG_IPV6)
 #include <net/ipv6.h>
@@ -39,6 +40,36 @@ static int one = 1;
 static int label_limit = (1 << 20) - 1;
 static int ttl_max = 255;
 
+#if IS_ENABLED(CONFIG_NET_IP_TUNNEL)
+size_t ipgre_mpls_encap_hlen(struct ip_tunnel_encap *e)
+{
+	return sizeof(struct mpls_shim_hdr);
+}
+
+static const struct ip_tunnel_encap_ops mpls_iptun_ops = {
+	.encap_hlen	= ipgre_mpls_encap_hlen,
+};
+
+static int ipgre_tunnel_encap_add_mpls_ops(void)
+{
+	return ip_tunnel_encap_add_ops(&mpls_iptun_ops, TUNNEL_ENCAP_MPLS);
+}
+
+static void ipgre_tunnel_encap_del_mpls_ops(void)
+{
+	ip_tunnel_encap_del_ops(&mpls_iptun_ops, TUNNEL_ENCAP_MPLS);
+}
+#else
+static int ipgre_tunnel_encap_add_mpls_ops(void)
+{
+	return 0;
+}
+
+static void ipgre_tunnel_encap_del_mpls_ops(void)
+{
+}
+#endif
+
 static void rtmsg_lfib(int event, u32 label, struct mpls_route *rt,
 		       struct nlmsghdr *nlh, struct net *net, u32 portid,
 		       unsigned int nlm_flags);
@@ -2485,6 +2516,10 @@ static int __init mpls_init(void)
 		      0);
 	rtnl_register(PF_MPLS, RTM_GETNETCONF, mpls_netconf_get_devconf,
 		      mpls_netconf_dump_devconf, 0);
+	err = ipgre_tunnel_encap_add_mpls_ops();
+	if (err)
+		pr_err("Can't add mpls over gre tunnel ops\n");
+
 	err = 0;
 out:
 	return err;
@@ -2502,6 +2537,7 @@ static void __exit mpls_exit(void)
 	dev_remove_pack(&mpls_packet_type);
 	unregister_netdevice_notifier(&mpls_dev_notifier);
 	unregister_pernet_subsys(&mpls_net_ops);
+	ipgre_tunnel_encap_del_mpls_ops();
 }
 module_exit(mpls_exit);
 
-- 
2.1.4

^ permalink raw reply related

* [PATCH v6 0/1] Introduce MPLS over GRE
From: Amine Kherbouche @ 2017-10-04 17:35 UTC (permalink / raw)
  To: tom, roopa; +Cc: netdev, amine.kherbouche, equinox

This patch introduces the MPLS over GRE encapsulation (RFC 4023).

Various applications of MPLS make use of label stacks with multiple
entries.  In some cases, it is possible to replace the top label of
the stack with an IP-based encapsulation, thereby, it is possible for
two LSRs that are adjacent on an LSP to be separated by an IP network,
even if that IP network does not provide MPLS.

On 09/29/2017 06:11 AM, Tom Herbert wrote:
> I don't see why MPLS/GRE needs to be a special case in gre_rcv. Can't
> we just follow the normal processing patch which calls the proto ops
> handler for the protocol in the GRE header? Also, if protocol specific
> code is added to rcv function that most likely means that we need to
> update the related offloads also (grant it that MPLS doesn't support
> GRO but it looks like it supports GSO). Additionally, we'd need to
> consider if flow dissector needs a similar special case (I will point
> out that my recently posted patches there eliminated TEB as the one
> special case in GRE dissection).

Regarding Tom's comment, the RX path of MPLSoGRE packet should follow
the normal processing path. That will prevent it to be a special case to
maintain separately. TX path is also shared, knowing that gre type is load
from skb->protocol which already is set by mpls stack.

Changes in v6:
  - remove mpls_forward() function exportation patch.
  - remove mpls_gre_rcv() and let the skb follow ipgre rx path and the mpls
    proto handler will be called.

Changes in v5:
  - Reword first commit title.

Changes in v4:
  - Bring back mpls_forward() function exportation patch.
  - Move back mpls_gre_rcv() to gre module file and wrap it under
    ifdef.

Changes in v3:
  - remove mpls_forward() function exportation patch.
  - wrap efficiently mpls iptunnel add/del functions and dependent
    function/structure.
  - move mpls_gre_rcv to af_mpls.c file and export it.
  - remove unnecessary functions.
 
Changes in v2:
  - wrap ip tunnel functions under ifdef in mpls file.
  - fix indentation.
  - check return code.

An example of configuration:


         node1                LER1                       LER2                node2
        +-----+             +------+                   +------+             +-----+
        |     |             |      |                   |      |             |     |
        |     |             |      |p3  GRE tunnel   p4|      |             |     |
        |     |p1         p2|      +-------------------+      |p5         p6|     |
        |     +-------------+      +-------------------+      +------------+|     |
        |     |10.100.0.0/24|      |                   |      |10.200.0.0/24|     |
        |     |fd00:100::/64|      |  10.125.0.0/24    |      |fd00:200::/64|     |
        |     |             |      |  fd00:125::/64    |      |             |     |
        |     |             |      |                   |      |             |     |
        |     |             |      |                   |      |             |     |
        |     |             |      |                   |      |             |     |
        |     |             |      |                   |      |             |     |
        +-----+             +------+                   +------+             +-----+


		###	node1	###

ip link set p1 up
ip addr add 10.100.0.1/24 dev p1

		###	LER1	###

ip link set p2 up
ip addr add 10.100.0.2/24 dev p2

ip link set p3 up
ip addr add 10.125.0.1/24 dev p3

ip link add gre1 type gre ttl 64 local 10.125.0.1 remote 10.125.0.2 dev p3
ip link set dev gre1 up

modprobe mpls_router
sysctl -w net.mpls.conf.p2.input=1
sysctl -w net.mpls.conf.p3.input=1
sysctl -w net.mpls.conf.gre1.input=1
sysctl -w net.mpls.platform_labels=1000

ip -M route add 111 as 222 dev gre1
ip -M route add 555 as 666 via inet 10.100.0.1 dev p2

		###	LER2	###

ip link set p5 up
ip addr add 10.200.0.2/24 dev p5

ip link set p4 up
ip addr add 10.125.0.2/24 dev p4

ip link add gre1 type gre ttl 64 local 10.125.0.2 remote 10.125.0.1 dev p4
ip link set dev gre1 up

modprobe mpls_router
sysctl -w net.mpls.conf.p4.input=1
sysctl -w net.mpls.conf.p5.input=1
sysctl -w net.mpls.conf.gre1.input=1
sysctl -w net.mpls.platform_labels=1000

ip -M route add 444 as 555 dev gre1
ip -M route add 222 as 333 via inet 10.200.0.1 dev p5

		###	node2	###

ip link set p6 up
ip addr add 10.200.0.1/24 dev p6


Now using this scapy to forge and send packets from the port p1 of node1:

p = Ether(src='de:ed:01:0c:41:09', dst='de:ed:01:2f:3b:ba')
p /= MPLS(s=1, ttl=64, label=111)/Raw(load='\xde')
sendp(p, iface="p1", count=20, inter=0.1)

Amine Kherbouche (1):
  ip_tunnel: add mpls over gre support

 include/uapi/linux/if_tunnel.h |  1 +
 net/mpls/af_mpls.c             | 36 ++++++++++++++++++++++++++++++++++++
 2 files changed, 37 insertions(+)

-- 
2.1.4

^ permalink raw reply

* Re: [PATCH v2 net-next] selftests: rtnetlink: try concurrent change of ifalias
From: David Miller @ 2017-10-04 17:35 UTC (permalink / raw)
  To: fw; +Cc: netdev
In-Reply-To: <20171004142259.13235-1-fw@strlen.de>

From: Florian Westphal <fw@strlen.de>
Date: Wed,  4 Oct 2017 16:22:59 +0200

> to make sure this is serialized correctly.
> 
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  change since v1:
>  Eric points out 'wait' blocks for all current children, so no need
>  for another loop.

Applied.

^ permalink raw reply

* Re: [PATCH net-next] rtnetlink: remove __rtnl_af_unregister
From: David Miller @ 2017-10-04 17:34 UTC (permalink / raw)
  To: fw; +Cc: netdev
In-Reply-To: <20171004135849.4368-1-fw@strlen.de>

From: Florian Westphal <fw@strlen.de>
Date: Wed,  4 Oct 2017 15:58:49 +0200

> switch the only caller to rtnl_af_unregister.
> 
> Signed-off-by: Florian Westphal <fw@strlen.de>

Applied.

^ permalink raw reply

* Re: [PATCH net-next] rtnetlink: remove slave_validate callback
From: David Miller @ 2017-10-04 17:34 UTC (permalink / raw)
  To: fw; +Cc: netdev, jiri
In-Reply-To: <20171004135529.3967-1-fw@strlen.de>

From: Florian Westphal <fw@strlen.de>
Date: Wed,  4 Oct 2017 15:55:29 +0200

> no users in the tree.
> 
> Signed-off-by: Florian Westphal <fw@strlen.de>

Applied.

^ permalink raw reply

* Re: [PATCH] cxgb4vf: make a couple of functions static
From: David Miller @ 2017-10-04 17:32 UTC (permalink / raw)
  To: colin.king; +Cc: leedom, netdev, kernel-janitors, linux-kernel
In-Reply-To: <20171004132037.6409-1-colin.king@canonical.com>

From: Colin King <colin.king@canonical.com>
Date: Wed,  4 Oct 2017 14:20:37 +0100

> From: Colin Ian King <colin.king@canonical.com>
> 
> The functions t4vf_link_down_rc_str and t4vf_handle_get_port_info are
> local to the source and do not need to be in global scope, so make
> them static.
> 
> Cleans up sparse warnings:
> symbol 't4vf_link_down_rc_str' was not declared. Should it be static?
> symbol 't4vf_handle_get_port_info' was not declared. Should it be static?
> 
> Signed-off-by: Colin Ian King <colin.king@canonical.com>

Applied to net-next, thanks.

^ permalink raw reply

* Re: [PATCH net-next] net: core: fix kerneldoc comment
From: David Miller @ 2017-10-04 17:28 UTC (permalink / raw)
  To: fw; +Cc: netdev
In-Reply-To: <20171004115650.6406-1-fw@strlen.de>

From: Florian Westphal <fw@strlen.de>
Date: Wed,  4 Oct 2017 13:56:50 +0200

> net/core/dev.c:1306: warning: No description found for parameter 'name'
> net/core/dev.c:1306: warning: Excess function parameter 'alias' description in 'dev_get_alias'
> 
> Fixes: 6c5570016b97 ("net: core: decouple ifalias get/set from rtnl lock")
> Reported-by: kbuild test robot <fengguang.wu@intel.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>

Applied.

^ permalink raw reply

* Re: [PATCH v2 net-next] ravb: RX checksum offload
From: David Miller @ 2017-10-04 17:26 UTC (permalink / raw)
  To: horms+renesas; +Cc: sergei.shtylyov, magnus.damm, netdev, linux-renesas-soc
In-Reply-To: <1507103667-6084-1-git-send-email-horms+renesas@verge.net.au>

From: Simon Horman <horms+renesas@verge.net.au>
Date: Wed,  4 Oct 2017 09:54:27 +0200

> Add support for RX checksum offload. This is enabled by default and
> may be disabled and re-enabled using ethtool:
> 
>  # ethtool -K eth0 rx off
>  # ethtool -K eth0 rx on
> 
> The RAVB provides a simple checksumming scheme which appears to be
> completely compatible with CHECKSUM_COMPLETE: sum of all packet data after
> the L2 header is appended to packet data; this may be trivially read by the
> driver and used to update the skb accordingly.
> 
> In terms of performance throughput is close to gigabit line-rate both with
> and without RX checksum offload enabled. Perf output, however, appears to
> indicate that significantly less time is spent in do_csum(). This is as
> expected.
 ...
> Signed-off-by: Simon Horman <horms+renesas@verge.net.au>

Applied, thanks Simon.

^ permalink raw reply

* [PATCH v2 net-next 2/2] tcp: clean up TFO server's initial tcp_rearm_rto() call
From: Wei Wang @ 2017-10-04 17:04 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Wei Wang

From: Wei Wang <weiwan@google.com>

This commit does a cleanup and moves tcp_rearm_rto() call in the TFO
server case into a previous spot in tcp_rcv_state_process() to make
it more compact.
This is only a cosmetic change.

Suggested-by: Yuchung Cheng <ycheng@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
no change in v2

 net/ipv4/tcp_input.c | 21 +++++++++------------
 1 file changed, 9 insertions(+), 12 deletions(-)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bd3a35f5dbf2..c5b8d61846c2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5911,6 +5911,15 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		if (req) {
 			inet_csk(sk)->icsk_retransmits = 0;
 			reqsk_fastopen_remove(sk, req, false);
+			/* Re-arm the timer because data may have been sent out.
+			 * This is similar to the regular data transmission case
+			 * when new data has just been ack'ed.
+			 *
+			 * (TFO) - we could try to be more aggressive and
+			 * retransmitting any data sooner based on when they
+			 * are sent out.
+			 */
+			tcp_rearm_rto(sk);
 		} else {
 			tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
 			tp->copied_seq = tp->rcv_nxt;
@@ -5933,18 +5942,6 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		if (tp->rx_opt.tstamp_ok)
 			tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
 
-		if (req) {
-			/* Re-arm the timer because data may have been sent out.
-			 * This is similar to the regular data transmission case
-			 * when new data has just been ack'ed.
-			 *
-			 * (TFO) - we could try to be more aggressive and
-			 * retransmitting any data sooner based on when they
-			 * are sent out.
-			 */
-			tcp_rearm_rto(sk);
-		}
-
 		if (!inet_csk(sk)->icsk_ca_ops->cong_control)
 			tcp_update_pacing_rate(sk);
 
-- 
2.14.2.920.gcf0c67979c-goog

^ permalink raw reply related

* [PATCH v2 net-next 1/2] tcp: uniform the set up of sockets after successful connection
From: Wei Wang @ 2017-10-04 17:03 UTC (permalink / raw)
  To: David Miller, netdev; +Cc: Yuchung Cheng, Neal Cardwell, Eric Dumazet, Wei Wang

From: Wei Wang <weiwan@google.com>

Currently in the TCP code, the initialization sequence for cached
metrics, congestion control, BPF, etc, after successful connection
is very inconsistent. This introduces inconsistent bevhavior and is
prone to bugs. The current call sequence is as follows:

(1) for active case (tcp_finish_connect() case):
        tcp_mtup_init(sk);
        icsk->icsk_af_ops->rebuild_header(sk);
        tcp_init_metrics(sk);
        tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
        tcp_init_congestion_control(sk);
        tcp_init_buffer_space(sk);

(2) for passive case (tcp_rcv_state_process() TCP_SYN_RECV case):
        icsk->icsk_af_ops->rebuild_header(sk);
        tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
        tcp_init_congestion_control(sk);
        tcp_mtup_init(sk);
        tcp_init_buffer_space(sk);
        tcp_init_metrics(sk);

(3) for TFO passive case (tcp_fastopen_create_child()):
        inet_csk(child)->icsk_af_ops->rebuild_header(child);
        tcp_init_congestion_control(child);
        tcp_mtup_init(child);
        tcp_init_metrics(child);
        tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
        tcp_init_buffer_space(child);

This commit uniforms the above functions to have the following sequence:
        tcp_mtup_init(sk);
        icsk->icsk_af_ops->rebuild_header(sk);
        tcp_init_metrics(sk);
        tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE/PASSIVE_ESTABLISHED_CB);
        tcp_init_congestion_control(sk);
        tcp_init_buffer_space(sk);
This sequence is the same as the (1) active case. We pick this sequence
because this order correctly allows BPF to override the settings
including congestion control module and initial cwnd, etc from
the route, and then allows the CC module to see those settings.

Suggested-by: Neal Cardwell <ncardwell@google.com>
Tested-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Wei Wang <weiwan@google.com>
Acked-by: Neal Cardwell <ncardwell@google.com>
Acked-by: Yuchung Cheng <ycheng@google.com>
Acked-by: Eric Dumazet <edumazet@google.com>
---
change in v2:
 removed EXPORT_SYMBOL(tcp_init_transfer);

 include/net/tcp.h       |  1 +
 net/ipv4/tcp.c          | 12 ++++++++++++
 net/ipv4/tcp_fastopen.c |  7 +------
 net/ipv4/tcp_input.c    | 21 +++------------------
 4 files changed, 17 insertions(+), 24 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 770b608c8439..f45fdc57d29d 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -417,6 +417,7 @@ bool tcp_peer_is_proven(struct request_sock *req, struct dst_entry *dst);
 void tcp_disable_fack(struct tcp_sock *tp);
 void tcp_close(struct sock *sk, long timeout);
 void tcp_init_sock(struct sock *sk);
+void tcp_init_transfer(struct sock *sk, int bpf_op);
 unsigned int tcp_poll(struct file *file, struct socket *sock,
 		      struct poll_table_struct *wait);
 int tcp_getsockopt(struct sock *sk, int level, int optname,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index 5091402720ab..3ed21e281c39 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -456,6 +456,18 @@ void tcp_init_sock(struct sock *sk)
 }
 EXPORT_SYMBOL(tcp_init_sock);
 
+void tcp_init_transfer(struct sock *sk, int bpf_op)
+{
+	struct inet_connection_sock *icsk = inet_csk(sk);
+
+	tcp_mtup_init(sk);
+	icsk->icsk_af_ops->rebuild_header(sk);
+	tcp_init_metrics(sk);
+	tcp_call_bpf(sk, bpf_op);
+	tcp_init_congestion_control(sk);
+	tcp_init_buffer_space(sk);
+}
+
 static void tcp_tx_timestamp(struct sock *sk, u16 tsflags, struct sk_buff *skb)
 {
 	if (tsflags && skb) {
diff --git a/net/ipv4/tcp_fastopen.c b/net/ipv4/tcp_fastopen.c
index e3c33220c418..515a757f02a8 100644
--- a/net/ipv4/tcp_fastopen.c
+++ b/net/ipv4/tcp_fastopen.c
@@ -216,12 +216,7 @@ static struct sock *tcp_fastopen_create_child(struct sock *sk,
 	refcount_set(&req->rsk_refcnt, 2);
 
 	/* Now finish processing the fastopen child socket. */
-	inet_csk(child)->icsk_af_ops->rebuild_header(child);
-	tcp_init_congestion_control(child);
-	tcp_mtup_init(child);
-	tcp_init_metrics(child);
-	tcp_call_bpf(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
-	tcp_init_buffer_space(child);
+	tcp_init_transfer(child, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
 
 	tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index db9bb46b5776..bd3a35f5dbf2 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5513,20 +5513,13 @@ void tcp_finish_connect(struct sock *sk, struct sk_buff *skb)
 		security_inet_conn_established(sk, skb);
 	}
 
-	/* Make sure socket is routed, for correct metrics.  */
-	icsk->icsk_af_ops->rebuild_header(sk);
-
-	tcp_init_metrics(sk);
-	tcp_call_bpf(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
-	tcp_init_congestion_control(sk);
+	tcp_init_transfer(sk, BPF_SOCK_OPS_ACTIVE_ESTABLISHED_CB);
 
 	/* Prevent spurious tcp_cwnd_restart() on first data
 	 * packet.
 	 */
 	tp->lsndtime = tcp_jiffies32;
 
-	tcp_init_buffer_space(sk);
-
 	if (sock_flag(sk, SOCK_KEEPOPEN))
 		inet_csk_reset_keepalive_timer(sk, keepalive_time_when(tp));
 
@@ -5693,7 +5686,6 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
 		if (tcp_is_sack(tp) && sysctl_tcp_fack)
 			tcp_enable_fack(tp);
 
-		tcp_mtup_init(sk);
 		tcp_sync_mss(sk, icsk->icsk_pmtu_cookie);
 		tcp_initialize_rcv_mss(sk);
 
@@ -5920,14 +5912,8 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 			inet_csk(sk)->icsk_retransmits = 0;
 			reqsk_fastopen_remove(sk, req, false);
 		} else {
-			/* Make sure socket is routed, for correct metrics. */
-			icsk->icsk_af_ops->rebuild_header(sk);
-			tcp_call_bpf(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
-			tcp_init_congestion_control(sk);
-
-			tcp_mtup_init(sk);
+			tcp_init_transfer(sk, BPF_SOCK_OPS_PASSIVE_ESTABLISHED_CB);
 			tp->copied_seq = tp->rcv_nxt;
-			tcp_init_buffer_space(sk);
 		}
 		smp_mb();
 		tcp_set_state(sk, TCP_ESTABLISHED);
@@ -5957,8 +5943,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 			 * are sent out.
 			 */
 			tcp_rearm_rto(sk);
-		} else
-			tcp_init_metrics(sk);
+		}
 
 		if (!inet_csk(sk)->icsk_ca_ops->cong_control)
 			tcp_update_pacing_rate(sk);
-- 
2.14.2.920.gcf0c67979c-goog

^ permalink raw reply related

* Re: [PATCH v4 2/2] ip_tunnel: add mpls over gre encapsulation
From: Amine Kherbouche @ 2017-10-04 17:03 UTC (permalink / raw)
  To: Tom Herbert; +Cc: Linux Kernel Network Developers, xeb, roopa, equinox
In-Reply-To: <CALx6S35b6gk39KOXTTzCFiz4S+JOOxcdJGB-+n3nx0x=xxVi8Q@mail.gmail.com>



On 09/29/2017 06:11 AM, Tom Herbert wrote:
>> > @@ -122,6 +125,30 @@ int gre_parse_header(struct sk_buff *skb, struct tnl_ptk_info *tpi,
>> >  }
>> >  EXPORT_SYMBOL(gre_parse_header);
>> >
>> > +#if IS_ENABLED(CONFIG_MPLS)
>> > +int mpls_gre_rcv(struct sk_buff *skb, int gre_hdr_len)
>> > +{
>> > +       if (unlikely(!pskb_may_pull(skb, gre_hdr_len)))
>> > +               goto drop;
>> > +
>> > +       /* Pop GRE hdr and reset the skb */
>> > +       skb_pull(skb, gre_hdr_len);
>> > +       skb_reset_network_header(skb);
>> > +
> I don't see why MPLS/GRE needs to be a special case in gre_rcv. Can't
> we just follow the normal processing patch which calls the proto ops
> handler for the protocol in the GRE header? Also, if protocol specific
> code is added to rcv function that most likely means that we need to
> update the related offloads also (grant it that MPLS doesn't support
> GRO but it looks like it supports GSO). Additionally, we'd need to
> consider if flow dissector needs a similar special case (I will point
> out that my recently posted patches there eliminated TEB as the one
> special case in GRE dissection).
>
> Thanks,
> Tom

Hi Tom,

Thanks for the feedback, I think this is the best way to do it, I'll do 
a v6 asap.

Regards,
Amine

^ permalink raw reply

* Re: etsec2 attached to sgmii phy
From: Jörg Willmann @ 2017-10-04 16:40 UTC (permalink / raw)
  To: Andrew Lunn, netdev; +Cc: timur
In-Reply-To: <20171004153422.GC3435@lunn.ch>

Am 04.10.2017 um 17:34 schrieb Andrew Lunn:
> On Wed, Oct 04, 2017 at 04:19:23PM +0200, Jörg Willmann wrote:
>> On Wed, 4 Oct 2017, Andrew Lunn wrote:
>>
>>> On Wed, Oct 04, 2017 at 07:56:53AM +0200, Jörg Willmann wrote:
>>>> Hi,
>>>>
>>>> we use a QorIQ P1011 connected via SGMII to a switch (Marvell 88E6352).
>>>> Currently we still use a really old linux kernel (2.6.33) successfully.
>>>>
>>>> For configuration of the MDIO Bus attached to the corresponding eTSEC/TBI
>>>> Phy we use the following settings in the device tree:
>>>>
>>>>                         mdio@25000 {
>>>>                                    #address-cells = <0x1>;
>>>>                                    #size-cells = <0x0>;
>>>>                                    compatible = "fsl,etsec2-tbi";
>>>>                                    reg = <0x25000 0x1000 0xb1030 0x4>;
>>> Hi Joerg
>>>
>>> Is 0xb1030 0x4 fixed by the silicon? Can it be expressed as an offset from
>>> 0x25000?
>>>
>>> It seems like the idea behind the patch is to hard code some
>>> things. If you can hard code the offset into get_etsec_tbipa(), i
>>> think that would be an O.K. solution to your problem.
>>>
>>>      Andrew
>>>
>> Yes, the adress 0xb1030 is fixed but it's something totally different than
>> the address range of 0x25000. 0xb0000, 0xb1000 and 0xb2000 are base
>> addresses of the eTSEC MAC (TPIPA is a register within the MAC) and 0x24000,
>> 0x25000 and 0x26000 are the base registers of the corresponding MDIO
>> controllers. So I wouldn't add a dependency between these two things.
>> >From my point of view, the implementation in the old kernel where
>> get_gfar_tbipa() got the device tree node pointer as argument was not soo
>> bad ;-)
> I took a quick look at the current device tree files. They all seem to
> have the 0xb1030 0x4. So reading it inside of get_etsec_tbipa() is
> O.K.
>
> Looks like you need to modify all the get_tbipa() functions to take a
> device_node *, and this code looks like it needs to change:
>
>                          /*
>                           * Add consistency check to make sure TBI is contained
>                           * within the mapped range (not because we would get a
>                           * segfault, rather to catch bugs in computing TBI
>                           * address). Print error message but continue anyway.
>                           */
>                          if ((void *)tbipa > priv->map + resource_size(&res) - 4)
>                                  dev_err(&pdev->dev, "invalid register map (should be at least 0x%04zx to contain TBI address)\n",
>                                          ((void *)tbipa - priv->map) + 4);
>
>                          iowrite32be(be32_to_cpup(prop), tbipa);
>
> 	Andrew
>
Yes, exactly - I already stumbled over these lines, too. Are there any 
suggestions how to implement this the best way?

^ permalink raw reply

* Re: [PATCH 1/3] wireless: iwlwifi: use bool instead of int
From: Luciano Coelho @ 2017-10-04 16:39 UTC (permalink / raw)
  To: Joe Perches, Christoph Böhmwalder, johannes.berg,
	emmanuel.grumbach, kvalo
  Cc: linux-wireless, netdev, linux-kernel
In-Reply-To: <1507134405.4434.10.camel@perches.com>

On Wed, 2017-10-04 at 09:26 -0700, Joe Perches wrote:
> On Wed, 2017-10-04 at 17:56 +0200, Christoph Böhmwalder wrote:
> > Change a usage of int in a boolean context to use the bool type
> > instead, as it
> > makes the intent of the function clearer and helps clarify its
> > semantics.
> > 
> > Also eliminate the if/else and just return the boolean result
> > directly,
> > making the code more readable.
> > 
> > Signed-off-by: Christoph Böhmwalder <christoph@boehmwalder.at>
> > ---
> >  drivers/net/wireless/intel/iwlwifi/iwl-phy-db.c | 12 +++++-------
> >  1 file changed, 5 insertions(+), 7 deletions(-)
> > 
> > diff --git a/drivers/net/wireless/intel/iwlwifi/iwl-phy-db.c
> > b/drivers/net/wireless/intel/iwlwifi/iwl-phy-db.c
> > index b7cd813ba70f..0eb815ae97e8 100644
> > --- a/drivers/net/wireless/intel/iwlwifi/iwl-phy-db.c
> > +++ b/drivers/net/wireless/intel/iwlwifi/iwl-phy-db.c
> > @@ -267,14 +267,12 @@ int iwl_phy_db_set_section(struct iwl_phy_db
> > *phy_db,
> >  }
> >  IWL_EXPORT_SYMBOL(iwl_phy_db_set_section);
> >  
> > -static int is_valid_channel(u16 ch_id)
> > +static bool is_valid_channel(u16 ch_id)
> >  {
> > -	if (ch_id <= 14 ||
> > -	    (36 <= ch_id && ch_id <= 64 && ch_id % 4 == 0) ||
> > -	    (100 <= ch_id && ch_id <= 140 && ch_id % 4 == 0) ||
> > -	    (145 <= ch_id && ch_id <= 165 && ch_id % 4 == 1))
> > -		return 1;
> > -	return 0;
> > +	return (ch_id <= 14 ||
> > +	       (36 <= ch_id && ch_id <= 64 && ch_id % 4 == 0) ||
> > +	       (100 <= ch_id && ch_id <= 140 && ch_id % 4 == 0) ||
> > +	       (145 <= ch_id && ch_id <= 165 && ch_id % 4 == 1));
> >  }
> 
> This might be more intelligble as separate tests
> 
> static bool is_valid_channel(u16 ch_id)
> {
> 	if (ch_id <= 14)
> 		return true;
> 
> 	if ((ch_id % 4 == 0) &&
> 	    ((ch_id >= 36 && ch_id <= 64) ||
> 	     (ch_id >= 100 && ch_id <= 140)))
> 		return true;
> 
> 	if ((ch_id % 4 == 1) &&
> 	    (chid >= 145 && ch_id <= 165))
> 		return true;
> 
> 	return false;
> }
> 
> The compiler should produce the same object code.

Yeah, it may be a bit easier to read, but I don't want to start getting
"fixes" to working and reasonable code.  There's nothing wrong with the
existing function (except maybe for the int vs. boolean) so let's not
change it.

A good time to change this would be the next time someone adds yet
another range of valid channels here. ;)

--
Luca.

^ permalink raw reply

* [PATCH v2 5/5] VSOCK: add tools/testing/vsock/vsock_diag_test
From: Stefan Hajnoczi @ 2017-10-04 16:37 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller, Jorgen Hansen, Dexuan Cui, Stefan Hajnoczi
In-Reply-To: <20171004163716.3964-1-stefanha@redhat.com>

This patch adds tests for the vsock_diag.ko module.

These tests are not self-tests because they require manual set up of a
KVM or VMware guest.  Please see tools/testing/vsock/README for
instructions.

The control.h and timeout.h infrastructure can be used for additional
AF_VSOCK tests in the future.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 MAINTAINERS                           |   1 +
 tools/testing/vsock/Makefile          |   9 +
 tools/testing/vsock/control.h         |  13 +
 tools/testing/vsock/timeout.h         |  14 +
 tools/testing/vsock/control.c         | 219 +++++++++++
 tools/testing/vsock/timeout.c         |  64 ++++
 tools/testing/vsock/vsock_diag_test.c | 681 ++++++++++++++++++++++++++++++++++
 tools/testing/vsock/.gitignore        |   2 +
 tools/testing/vsock/README            |  36 ++
 9 files changed, 1039 insertions(+)
 create mode 100644 tools/testing/vsock/Makefile
 create mode 100644 tools/testing/vsock/control.h
 create mode 100644 tools/testing/vsock/timeout.h
 create mode 100644 tools/testing/vsock/control.c
 create mode 100644 tools/testing/vsock/timeout.c
 create mode 100644 tools/testing/vsock/vsock_diag_test.c
 create mode 100644 tools/testing/vsock/.gitignore
 create mode 100644 tools/testing/vsock/README

diff --git a/MAINTAINERS b/MAINTAINERS
index 200dac93f34b..76b9fa587bfa 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13983,6 +13983,7 @@ F:	net/vmw_vsock/virtio_transport.c
 F:	drivers/net/vsockmon.c
 F:	drivers/vhost/vsock.c
 F:	drivers/vhost/vsock.h
+F:	tools/testing/vsock/
 
 VIRTIO CONSOLE DRIVER
 M:	Amit Shah <amit@kernel.org>
diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
new file mode 100644
index 000000000000..66ba0924194d
--- /dev/null
+++ b/tools/testing/vsock/Makefile
@@ -0,0 +1,9 @@
+all: test
+test: vsock_diag_test
+vsock_diag_test: vsock_diag_test.o timeout.o control.o
+
+CFLAGS += -g -O2 -Werror -Wall -I. -I../../include/uapi -I../../include -Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD -U_FORTIFY_SOURCE -D_GNU_SOURCE
+.PHONY: all test clean
+clean:
+	${RM} *.o *.d vsock_diag_test
+-include *.d
diff --git a/tools/testing/vsock/control.h b/tools/testing/vsock/control.h
new file mode 100644
index 000000000000..54a07efd267c
--- /dev/null
+++ b/tools/testing/vsock/control.h
@@ -0,0 +1,13 @@
+#ifndef CONTROL_H
+#define CONTROL_H
+
+#include <stdbool.h>
+
+void control_init(const char *control_host, const char *control_port,
+		  bool server);
+void control_cleanup(void);
+void control_writeln(const char *str);
+char *control_readln(void);
+void control_expectln(const char *str);
+
+#endif /* CONTROL_H */
diff --git a/tools/testing/vsock/timeout.h b/tools/testing/vsock/timeout.h
new file mode 100644
index 000000000000..77db9ce9860a
--- /dev/null
+++ b/tools/testing/vsock/timeout.h
@@ -0,0 +1,14 @@
+#ifndef TIMEOUT_H
+#define TIMEOUT_H
+
+enum {
+	/* Default timeout */
+	TIMEOUT = 10 /* seconds */
+};
+
+void sigalrm(int signo);
+void timeout_begin(unsigned int seconds);
+void timeout_check(const char *operation);
+void timeout_end(void);
+
+#endif /* TIMEOUT_H */
diff --git a/tools/testing/vsock/control.c b/tools/testing/vsock/control.c
new file mode 100644
index 000000000000..90fd47f0e422
--- /dev/null
+++ b/tools/testing/vsock/control.c
@@ -0,0 +1,219 @@
+/* Control socket for client/server test execution
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ *
+ * Author: Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+/* The client and server may need to coordinate to avoid race conditions like
+ * the client attempting to connect to a socket that the server is not
+ * listening on yet.  The control socket offers a communications channel for
+ * such coordination tasks.
+ *
+ * If the client calls control_expectln("LISTENING"), then it will block until
+ * the server calls control_writeln("LISTENING").  This provides a simple
+ * mechanism for coordinating between the client and the server.
+ */
+
+#include <errno.h>
+#include <netdb.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+
+#include "timeout.h"
+#include "control.h"
+
+static int control_fd = -1;
+
+/* Open the control socket, either in server or client mode */
+void control_init(const char *control_host,
+		  const char *control_port,
+		  bool server)
+{
+	struct addrinfo hints = {
+		.ai_socktype = SOCK_STREAM,
+	};
+	struct addrinfo *result = NULL;
+	struct addrinfo *ai;
+	int ret;
+
+	ret = getaddrinfo(control_host, control_port, &hints, &result);
+	if (ret != 0) {
+		fprintf(stderr, "%s\n", gai_strerror(ret));
+		exit(EXIT_FAILURE);
+	}
+
+	for (ai = result; ai; ai = ai->ai_next) {
+		int fd;
+		int val = 1;
+
+		fd = socket(ai->ai_family, ai->ai_socktype, ai->ai_protocol);
+		if (fd < 0)
+			continue;
+
+		if (!server) {
+			if (connect(fd, ai->ai_addr, ai->ai_addrlen) < 0)
+				goto next;
+			control_fd = fd;
+			printf("Control socket connected to %s:%s.\n",
+			       control_host, control_port);
+			break;
+		}
+
+		if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR,
+			       &val, sizeof(val)) < 0) {
+			perror("setsockopt");
+			exit(EXIT_FAILURE);
+		}
+
+		if (bind(fd, ai->ai_addr, ai->ai_addrlen) < 0)
+			goto next;
+		if (listen(fd, 1) < 0)
+			goto next;
+
+		printf("Control socket listening on %s:%s\n",
+		       control_host, control_port);
+		fflush(stdout);
+
+		control_fd = accept(fd, NULL, 0);
+		close(fd);
+
+		if (control_fd < 0) {
+			perror("accept");
+			exit(EXIT_FAILURE);
+		}
+		printf("Control socket connection accepted...\n");
+		break;
+
+next:
+		close(fd);
+	}
+
+	if (control_fd < 0) {
+		fprintf(stderr, "Control socket initialization failed.  Invalid address %s:%s?\n",
+			control_host, control_port);
+		exit(EXIT_FAILURE);
+	}
+
+	freeaddrinfo(result);
+}
+
+/* Free resources */
+void control_cleanup(void)
+{
+	close(control_fd);
+	control_fd = -1;
+}
+
+/* Write a line to the control socket */
+void control_writeln(const char *str)
+{
+	ssize_t len = strlen(str);
+	ssize_t ret;
+
+	timeout_begin(TIMEOUT);
+
+	do {
+		ret = send(control_fd, str, len, MSG_MORE);
+		timeout_check("send");
+	} while (ret < 0 && errno == EINTR);
+
+	if (ret != len) {
+		perror("send");
+		exit(EXIT_FAILURE);
+	}
+
+	do {
+		ret = send(control_fd, "\n", 1, 0);
+		timeout_check("send");
+	} while (ret < 0 && errno == EINTR);
+
+	if (ret != 1) {
+		perror("send");
+		exit(EXIT_FAILURE);
+	}
+
+	timeout_end();
+}
+
+/* Return the next line from the control socket (without the trailing newline).
+ *
+ * The program terminates if a timeout occurs.
+ *
+ * The caller must free() the returned string.
+ */
+char *control_readln(void)
+{
+	char *buf = NULL;
+	size_t idx = 0;
+	size_t buflen = 0;
+
+	timeout_begin(TIMEOUT);
+
+	for (;;) {
+		ssize_t ret;
+
+		if (idx >= buflen) {
+			char *new_buf;
+
+			new_buf = realloc(buf, buflen + 80);
+			if (!new_buf) {
+				perror("realloc");
+				exit(EXIT_FAILURE);
+			}
+
+			buf = new_buf;
+			buflen += 80;
+		}
+
+		do {
+			ret = recv(control_fd, &buf[idx], 1, 0);
+			timeout_check("recv");
+		} while (ret < 0 && errno == EINTR);
+
+		if (ret == 0) {
+			fprintf(stderr, "unexpected EOF on control socket\n");
+			exit(EXIT_FAILURE);
+		}
+
+		if (ret != 1) {
+			perror("recv");
+			exit(EXIT_FAILURE);
+		}
+
+		if (buf[idx] == '\n') {
+			buf[idx] = '\0';
+			break;
+		}
+
+		idx++;
+	}
+
+	timeout_end();
+
+	return buf;
+}
+
+/* Wait until a given line is received or a timeout occurs */
+void control_expectln(const char *str)
+{
+	char *line;
+
+	line = control_readln();
+	if (strcmp(str, line) != 0) {
+		fprintf(stderr, "expected \"%s\" on control socket, got \"%s\"\n",
+			str, line);
+		exit(EXIT_FAILURE);
+	}
+
+	free(line);
+}
diff --git a/tools/testing/vsock/timeout.c b/tools/testing/vsock/timeout.c
new file mode 100644
index 000000000000..c49b3003b2db
--- /dev/null
+++ b/tools/testing/vsock/timeout.c
@@ -0,0 +1,64 @@
+/* Timeout API for single-threaded programs that use blocking
+ * syscalls (read/write/send/recv/connect/accept).
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ *
+ * Author: Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+/* Use the following pattern:
+ *
+ *   timeout_begin(TIMEOUT);
+ *   do {
+ *       ret = accept(...);
+ *       timeout_check("accept");
+ *   } while (ret < 0 && ret == EINTR);
+ *   timeout_end();
+ */
+
+#include <stdlib.h>
+#include <stdbool.h>
+#include <unistd.h>
+#include <stdio.h>
+#include "timeout.h"
+
+static volatile bool timeout;
+
+/* SIGALRM handler function.  Do not use sleep(2), alarm(2), or
+ * setitimer(2) while using this API - they may interfere with each
+ * other.
+ */
+void sigalrm(int signo)
+{
+	timeout = true;
+}
+
+/* Start a timeout.  Call timeout_check() to verify that the timeout hasn't
+ * expired.  timeout_end() must be called to stop the timeout.  Timeouts cannot
+ * be nested.
+ */
+void timeout_begin(unsigned int seconds)
+{
+	alarm(seconds);
+}
+
+/* Exit with an error message if the timeout has expired */
+void timeout_check(const char *operation)
+{
+	if (timeout) {
+		fprintf(stderr, "%s timed out\n", operation);
+		exit(EXIT_FAILURE);
+	}
+}
+
+/* Stop a timeout */
+void timeout_end(void)
+{
+	alarm(0);
+	timeout = false;
+}
diff --git a/tools/testing/vsock/vsock_diag_test.c b/tools/testing/vsock/vsock_diag_test.c
new file mode 100644
index 000000000000..e896a4af52f4
--- /dev/null
+++ b/tools/testing/vsock/vsock_diag_test.c
@@ -0,0 +1,681 @@
+/*
+ * vsock_diag_test - vsock_diag.ko test suite
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ *
+ * Author: Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <getopt.h>
+#include <stdio.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <signal.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <linux/list.h>
+#include <linux/net.h>
+#include <linux/netlink.h>
+#include <linux/sock_diag.h>
+#include <netinet/tcp.h>
+
+#include "../../../include/uapi/linux/vm_sockets.h"
+#include "../../../include/uapi/linux/vm_sockets_diag.h"
+
+#include "timeout.h"
+#include "control.h"
+
+enum test_mode {
+	TEST_MODE_UNSET,
+	TEST_MODE_CLIENT,
+	TEST_MODE_SERVER
+};
+
+/* Per-socket status */
+struct vsock_stat {
+	struct list_head list;
+	struct vsock_diag_msg msg;
+};
+
+static const char *sock_type_str(int type)
+{
+	switch (type) {
+	case SOCK_DGRAM:
+		return "DGRAM";
+	case SOCK_STREAM:
+		return "STREAM";
+	default:
+		return "INVALID TYPE";
+	}
+}
+
+static const char *sock_state_str(int state)
+{
+	switch (state) {
+	case TCP_CLOSE:
+		return "UNCONNECTED";
+	case TCP_SYN_SENT:
+		return "CONNECTING";
+	case TCP_ESTABLISHED:
+		return "CONNECTED";
+	case TCP_CLOSING:
+		return "DISCONNECTING";
+	case TCP_LISTEN:
+		return "LISTEN";
+	default:
+		return "INVALID STATE";
+	}
+}
+
+static const char *sock_shutdown_str(int shutdown)
+{
+	switch (shutdown) {
+	case 1:
+		return "RCV_SHUTDOWN";
+	case 2:
+		return "SEND_SHUTDOWN";
+	case 3:
+		return "RCV_SHUTDOWN | SEND_SHUTDOWN";
+	default:
+		return "0";
+	}
+}
+
+static void print_vsock_addr(FILE *fp, unsigned int cid, unsigned int port)
+{
+	if (cid == VMADDR_CID_ANY)
+		fprintf(fp, "*:");
+	else
+		fprintf(fp, "%u:", cid);
+
+	if (port == VMADDR_PORT_ANY)
+		fprintf(fp, "*");
+	else
+		fprintf(fp, "%u", port);
+}
+
+static void print_vsock_stat(FILE *fp, struct vsock_stat *st)
+{
+	print_vsock_addr(fp, st->msg.vdiag_src_cid, st->msg.vdiag_src_port);
+	fprintf(fp, " ");
+	print_vsock_addr(fp, st->msg.vdiag_dst_cid, st->msg.vdiag_dst_port);
+	fprintf(fp, " %s %s %s %u\n",
+		sock_type_str(st->msg.vdiag_type),
+		sock_state_str(st->msg.vdiag_state),
+		sock_shutdown_str(st->msg.vdiag_shutdown),
+		st->msg.vdiag_ino);
+}
+
+static void print_vsock_stats(FILE *fp, struct list_head *head)
+{
+	struct vsock_stat *st;
+
+	list_for_each_entry(st, head, list)
+		print_vsock_stat(fp, st);
+}
+
+static struct vsock_stat *find_vsock_stat(struct list_head *head, int fd)
+{
+	struct vsock_stat *st;
+	struct stat stat;
+
+	if (fstat(fd, &stat) < 0) {
+		perror("fstat");
+		exit(EXIT_FAILURE);
+	}
+
+	list_for_each_entry(st, head, list)
+		if (st->msg.vdiag_ino == stat.st_ino)
+			return st;
+
+	fprintf(stderr, "cannot find fd %d\n", fd);
+	exit(EXIT_FAILURE);
+}
+
+static void check_no_sockets(struct list_head *head)
+{
+	if (!list_empty(head)) {
+		fprintf(stderr, "expected no sockets\n");
+		print_vsock_stats(stderr, head);
+		exit(1);
+	}
+}
+
+static void check_num_sockets(struct list_head *head, int expected)
+{
+	struct list_head *node;
+	int n = 0;
+
+	list_for_each(node, head)
+		n++;
+
+	if (n != expected) {
+		fprintf(stderr, "expected %d sockets, found %d\n",
+			expected, n);
+		print_vsock_stats(stderr, head);
+		exit(EXIT_FAILURE);
+	}
+}
+
+static void check_socket_state(struct vsock_stat *st, __u8 state)
+{
+	if (st->msg.vdiag_state != state) {
+		fprintf(stderr, "expected socket state %#x, got %#x\n",
+			state, st->msg.vdiag_state);
+		exit(EXIT_FAILURE);
+	}
+}
+
+static void send_req(int fd)
+{
+	struct sockaddr_nl nladdr = {
+		.nl_family = AF_NETLINK,
+	};
+	struct {
+		struct nlmsghdr nlh;
+		struct vsock_diag_req vreq;
+	} req = {
+		.nlh = {
+			.nlmsg_len = sizeof(req),
+			.nlmsg_type = SOCK_DIAG_BY_FAMILY,
+			.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP,
+		},
+		.vreq = {
+			.sdiag_family = AF_VSOCK,
+			.vdiag_states = ~(__u32)0,
+		},
+	};
+	struct iovec iov = {
+		.iov_base = &req,
+		.iov_len = sizeof(req),
+	};
+	struct msghdr msg = {
+		.msg_name = &nladdr,
+		.msg_namelen = sizeof(nladdr),
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
+	};
+
+	for (;;) {
+		if (sendmsg(fd, &msg, 0) < 0) {
+			if (errno == EINTR)
+				continue;
+
+			perror("sendmsg");
+			exit(EXIT_FAILURE);
+		}
+
+		return;
+	}
+}
+
+static ssize_t recv_resp(int fd, void *buf, size_t len)
+{
+	struct sockaddr_nl nladdr = {
+		.nl_family = AF_NETLINK,
+	};
+	struct iovec iov = {
+		.iov_base = buf,
+		.iov_len = len,
+	};
+	struct msghdr msg = {
+		.msg_name = &nladdr,
+		.msg_namelen = sizeof(nladdr),
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
+	};
+	ssize_t ret;
+
+	do {
+		ret = recvmsg(fd, &msg, 0);
+	} while (ret < 0 && errno == EINTR);
+
+	if (ret < 0) {
+		perror("recvmsg");
+		exit(EXIT_FAILURE);
+	}
+
+	return ret;
+}
+
+static void add_vsock_stat(struct list_head *sockets,
+			   const struct vsock_diag_msg *resp)
+{
+	struct vsock_stat *st;
+
+	st = malloc(sizeof(*st));
+	if (!st) {
+		perror("malloc");
+		exit(EXIT_FAILURE);
+	}
+
+	st->msg = *resp;
+	list_add_tail(&st->list, sockets);
+}
+
+/*
+ * Read vsock stats into a list.
+ */
+static void read_vsock_stat(struct list_head *sockets)
+{
+	long buf[8192 / sizeof(long)];
+	int fd;
+
+	fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_SOCK_DIAG);
+	if (fd < 0) {
+		perror("socket");
+		exit(EXIT_FAILURE);
+	}
+
+	send_req(fd);
+
+	for (;;) {
+		const struct nlmsghdr *h;
+		ssize_t ret;
+
+		ret = recv_resp(fd, buf, sizeof(buf));
+		if (ret == 0)
+			goto done;
+		if (ret < sizeof(*h)) {
+			fprintf(stderr, "short read of %zd bytes\n", ret);
+			exit(EXIT_FAILURE);
+		}
+
+		h = (struct nlmsghdr *)buf;
+
+		while (NLMSG_OK(h, ret)) {
+			if (h->nlmsg_type == NLMSG_DONE)
+				goto done;
+
+			if (h->nlmsg_type == NLMSG_ERROR) {
+				const struct nlmsgerr *err = NLMSG_DATA(h);
+
+				if (h->nlmsg_len < NLMSG_LENGTH(sizeof(*err)))
+					fprintf(stderr, "NLMSG_ERROR\n");
+				else {
+					errno = -err->error;
+					perror("NLMSG_ERROR");
+				}
+
+				exit(EXIT_FAILURE);
+			}
+
+			if (h->nlmsg_type != SOCK_DIAG_BY_FAMILY) {
+				fprintf(stderr, "unexpected nlmsg_type %#x\n",
+					h->nlmsg_type);
+				exit(EXIT_FAILURE);
+			}
+			if (h->nlmsg_len <
+			    NLMSG_LENGTH(sizeof(struct vsock_diag_msg))) {
+				fprintf(stderr, "short vsock_diag_msg\n");
+				exit(EXIT_FAILURE);
+			}
+
+			add_vsock_stat(sockets, NLMSG_DATA(h));
+
+			h = NLMSG_NEXT(h, ret);
+		}
+	}
+
+done:
+	close(fd);
+}
+
+static void free_sock_stat(struct list_head *sockets)
+{
+	struct vsock_stat *st;
+	struct vsock_stat *next;
+
+	list_for_each_entry_safe(st, next, sockets, list)
+		free(st);
+}
+
+static void test_no_sockets(unsigned int peer_cid)
+{
+	LIST_HEAD(sockets);
+
+	read_vsock_stat(&sockets);
+
+	check_no_sockets(&sockets);
+
+	free_sock_stat(&sockets);
+}
+
+static void test_listen_socket_server(unsigned int peer_cid)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = VMADDR_CID_ANY,
+		},
+	};
+	LIST_HEAD(sockets);
+	struct vsock_stat *st;
+	int fd;
+
+	fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	if (listen(fd, 1) < 0) {
+		perror("listen");
+		exit(EXIT_FAILURE);
+	}
+
+	read_vsock_stat(&sockets);
+
+	check_num_sockets(&sockets, 1);
+	st = find_vsock_stat(&sockets, fd);
+	check_socket_state(st, TCP_LISTEN);
+
+	close(fd);
+	free_sock_stat(&sockets);
+}
+
+static void test_connect_client(unsigned int peer_cid)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = peer_cid,
+		},
+	};
+	int fd;
+	int ret;
+	LIST_HEAD(sockets);
+	struct vsock_stat *st;
+
+	control_expectln("LISTENING");
+
+	fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+	timeout_begin(TIMEOUT);
+	do {
+		ret = connect(fd, &addr.sa, sizeof(addr.svm));
+		timeout_check("connect");
+	} while (ret < 0 && errno == EINTR);
+	timeout_end();
+
+	if (ret < 0) {
+		perror("connect");
+		exit(EXIT_FAILURE);
+	}
+
+	read_vsock_stat(&sockets);
+
+	check_num_sockets(&sockets, 1);
+	st = find_vsock_stat(&sockets, fd);
+	check_socket_state(st, TCP_ESTABLISHED);
+
+	control_expectln("DONE");
+	control_writeln("DONE");
+
+	close(fd);
+	free_sock_stat(&sockets);
+}
+
+static void test_connect_server(unsigned int peer_cid)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = VMADDR_CID_ANY,
+		},
+	};
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} clientaddr;
+	socklen_t clientaddr_len = sizeof(clientaddr.svm);
+	LIST_HEAD(sockets);
+	struct vsock_stat *st;
+	int fd;
+	int client_fd;
+
+	fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	if (listen(fd, 1) < 0) {
+		perror("listen");
+		exit(EXIT_FAILURE);
+	}
+
+	control_writeln("LISTENING");
+
+	timeout_begin(TIMEOUT);
+	do {
+		client_fd = accept(fd, &clientaddr.sa, &clientaddr_len);
+		timeout_check("accept");
+	} while (client_fd < 0 && errno == EINTR);
+	timeout_end();
+
+	if (client_fd < 0) {
+		perror("accept");
+		exit(EXIT_FAILURE);
+	}
+	if (clientaddr.sa.sa_family != AF_VSOCK) {
+		fprintf(stderr, "expected AF_VSOCK from accept(2), got %d\n",
+			clientaddr.sa.sa_family);
+		exit(EXIT_FAILURE);
+	}
+	if (clientaddr.svm.svm_cid != peer_cid) {
+		fprintf(stderr, "expected peer CID %u from accept(2), got %u\n",
+			peer_cid, clientaddr.svm.svm_cid);
+		exit(EXIT_FAILURE);
+	}
+
+	read_vsock_stat(&sockets);
+
+	check_num_sockets(&sockets, 2);
+	find_vsock_stat(&sockets, fd);
+	st = find_vsock_stat(&sockets, client_fd);
+	check_socket_state(st, TCP_ESTABLISHED);
+
+	control_writeln("DONE");
+	control_expectln("DONE");
+
+	close(client_fd);
+	close(fd);
+	free_sock_stat(&sockets);
+}
+
+static struct {
+	const char *name;
+	void (*run_client)(unsigned int peer_cid);
+	void (*run_server)(unsigned int peer_cid);
+} test_cases[] = {
+	{
+		.name = "No sockets",
+		.run_server = test_no_sockets,
+	},
+	{
+		.name = "Listen socket",
+		.run_server = test_listen_socket_server,
+	},
+	{
+		.name = "Connect",
+		.run_client = test_connect_client,
+		.run_server = test_connect_server,
+	},
+	{},
+};
+
+static void init_signals(void)
+{
+	struct sigaction act = {
+		.sa_handler = sigalrm,
+	};
+
+	sigaction(SIGALRM, &act, NULL);
+	signal(SIGPIPE, SIG_IGN);
+}
+
+static unsigned int parse_cid(const char *str)
+{
+	char *endptr = NULL;
+	unsigned long int n;
+
+	errno = 0;
+	n = strtoul(str, &endptr, 10);
+	if (errno || *endptr != '\0') {
+		fprintf(stderr, "malformed CID \"%s\"\n", str);
+		exit(EXIT_FAILURE);
+	}
+	return n;
+}
+
+static const char optstring[] = "";
+static const struct option longopts[] = {
+	{
+		.name = "control-host",
+		.has_arg = required_argument,
+		.val = 'H',
+	},
+	{
+		.name = "control-port",
+		.has_arg = required_argument,
+		.val = 'P',
+	},
+	{
+		.name = "mode",
+		.has_arg = required_argument,
+		.val = 'm',
+	},
+	{
+		.name = "peer-cid",
+		.has_arg = required_argument,
+		.val = 'p',
+	},
+	{
+		.name = "help",
+		.has_arg = no_argument,
+		.val = '?',
+	},
+	{},
+};
+
+static void usage(void)
+{
+	fprintf(stderr, "Usage: vsock_diag_test [--help] [--control-host=<host>] --control-port=<port> --mode=client|server --peer-cid=<cid>\n"
+		"\n"
+		"  Server: vsock_diag_test --control-port=1234 --mode=server --peer-cid=3\n"
+		"  Client: vsock_diag_test --control-host=192.168.0.1 --control-port=1234 --mode=client --peer-cid=2\n"
+		"\n"
+		"Run vsock_diag.ko tests.  Must be launched in both\n"
+		"guest and host.  One side must use --mode=client and\n"
+		"the other side must use --mode=server.\n"
+		"\n"
+		"A TCP control socket connection is used to coordinate tests\n"
+		"between the client and the server.  The server requires a\n"
+		"listen address and the client requires an address to\n"
+		"connect to.\n"
+		"\n"
+		"The CID of the other side must be given with --peer-cid=<cid>.\n");
+	exit(EXIT_FAILURE);
+}
+
+int main(int argc, char **argv)
+{
+	const char *control_host = NULL;
+	const char *control_port = NULL;
+	int mode = TEST_MODE_UNSET;
+	unsigned int peer_cid = VMADDR_CID_ANY;
+	int i;
+
+	init_signals();
+
+	for (;;) {
+		int opt = getopt_long(argc, argv, optstring, longopts, NULL);
+
+		if (opt == -1)
+			break;
+
+		switch (opt) {
+		case 'H':
+			control_host = optarg;
+			break;
+		case 'm':
+			if (strcmp(optarg, "client") == 0)
+				mode = TEST_MODE_CLIENT;
+			else if (strcmp(optarg, "server") == 0)
+				mode = TEST_MODE_SERVER;
+			else {
+				fprintf(stderr, "--mode must be \"client\" or \"server\"\n");
+				return EXIT_FAILURE;
+			}
+			break;
+		case 'p':
+			peer_cid = parse_cid(optarg);
+			break;
+		case 'P':
+			control_port = optarg;
+			break;
+		case '?':
+		default:
+			usage();
+		}
+	}
+
+	if (!control_port)
+		usage();
+	if (mode == TEST_MODE_UNSET)
+		usage();
+	if (peer_cid == VMADDR_CID_ANY)
+		usage();
+
+	if (!control_host) {
+		if (mode != TEST_MODE_SERVER)
+			usage();
+		control_host = "0.0.0.0";
+	}
+
+	control_init(control_host, control_port, mode == TEST_MODE_SERVER);
+
+	for (i = 0; test_cases[i].name; i++) {
+		void (*run)(unsigned int peer_cid);
+
+		printf("%s...", test_cases[i].name);
+		fflush(stdout);
+
+		if (mode == TEST_MODE_CLIENT)
+			run = test_cases[i].run_client;
+		else
+			run = test_cases[i].run_server;
+
+		if (run)
+			run(peer_cid);
+
+		printf("ok\n");
+	}
+
+	control_cleanup();
+	return EXIT_SUCCESS;
+}
diff --git a/tools/testing/vsock/.gitignore b/tools/testing/vsock/.gitignore
new file mode 100644
index 000000000000..dc5f11faf530
--- /dev/null
+++ b/tools/testing/vsock/.gitignore
@@ -0,0 +1,2 @@
+*.d
+vsock_diag_test
diff --git a/tools/testing/vsock/README b/tools/testing/vsock/README
new file mode 100644
index 000000000000..2cc6d7302db6
--- /dev/null
+++ b/tools/testing/vsock/README
@@ -0,0 +1,36 @@
+AF_VSOCK test suite
+-------------------
+These tests exercise net/vmw_vsock/ host<->guest sockets for VMware, KVM, and
+Hyper-V.
+
+The following tests are available:
+
+  * vsock_diag_test - vsock_diag.ko module for listing open sockets
+
+The following prerequisite steps are not automated and must be performed prior
+to running tests:
+
+1. Build the kernel and these tests.
+2. Install the kernel and tests on the host.
+3. Install the kernel and tests inside the guest.
+4. Boot the guest and ensure that the AF_VSOCK transport is enabled.
+
+Invoke test binaries in both directions as follows:
+
+  # host=server, guest=client
+  (host)# $TEST_BINARY --mode=server \
+                       --control-port=1234 \
+                       --peer-cid=3
+  (guest)# $TEST_BINARY --mode=client \
+                        --control-host=$HOST_IP \
+                        --control-port=1234 \
+                        --peer-cid=2
+
+  # host=client, guest=server
+  (guest)# $TEST_BINARY --mode=server \
+                        --control-port=1234 \
+                        --peer-cid=2
+  (host)# $TEST_BINARY --mode=client \
+                       --control-port=$GUEST_IP \
+                       --control-port=1234 \
+                       --peer-cid=3
-- 
2.13.6

^ permalink raw reply related

* [PATCH v2 4/5] VSOCK: add sock_diag interface
From: Stefan Hajnoczi @ 2017-10-04 16:37 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller, Jorgen Hansen, Dexuan Cui, Stefan Hajnoczi
In-Reply-To: <20171004163716.3964-1-stefanha@redhat.com>

This patch adds the sock_diag interface for querying sockets from
userspace.  Tools like ss(8) and netstat(8) can use this interface to
list open sockets.

The userspace ABI is defined in <linux/vm_sockets_diag.h> and includes
netlink request and response structs.  The request can query sockets
based on their sk_state (e.g. listening sockets only) and the response
contains socket information fields including the local/remote addresses,
inode number, etc.

This patch does not dump VMCI pending sockets because I have only tested
the virtio transport, which does not use pending sockets.  Support can
be added later by extending vsock_diag_dump() if needed by VMCI users.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 MAINTAINERS                          |   2 +
 net/vmw_vsock/Makefile               |   3 +
 include/uapi/linux/vm_sockets_diag.h |  33 +++++++
 net/vmw_vsock/diag.c                 | 186 +++++++++++++++++++++++++++++++++++
 net/vmw_vsock/Kconfig                |  10 ++
 5 files changed, 234 insertions(+)
 create mode 100644 include/uapi/linux/vm_sockets_diag.h
 create mode 100644 net/vmw_vsock/diag.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 1c3feffb1c1c..200dac93f34b 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -13975,6 +13975,8 @@ S:	Maintained
 F:	include/linux/virtio_vsock.h
 F:	include/uapi/linux/virtio_vsock.h
 F:	include/uapi/linux/vsockmon.h
+F:	include/uapi/linux/vm_sockets_diag.h
+F:	net/vmw_vsock/diag.c
 F:	net/vmw_vsock/af_vsock_tap.c
 F:	net/vmw_vsock/virtio_transport_common.c
 F:	net/vmw_vsock/virtio_transport.c
diff --git a/net/vmw_vsock/Makefile b/net/vmw_vsock/Makefile
index 09fc2eb29dc8..e5dbf153aff0 100644
--- a/net/vmw_vsock/Makefile
+++ b/net/vmw_vsock/Makefile
@@ -1,10 +1,13 @@
 obj-$(CONFIG_VSOCKETS) += vsock.o
+obj-$(CONFIG_VSOCKETS_DIAG) += vsock_diag.o
 obj-$(CONFIG_VMWARE_VMCI_VSOCKETS) += vmw_vsock_vmci_transport.o
 obj-$(CONFIG_VIRTIO_VSOCKETS) += vmw_vsock_virtio_transport.o
 obj-$(CONFIG_VIRTIO_VSOCKETS_COMMON) += vmw_vsock_virtio_transport_common.o
 
 vsock-y += af_vsock.o af_vsock_tap.o vsock_addr.o
 
+vsock_diag-y += diag.o
+
 vmw_vsock_vmci_transport-y += vmci_transport.o vmci_transport_notify.o \
 	vmci_transport_notify_qstate.o
 
diff --git a/include/uapi/linux/vm_sockets_diag.h b/include/uapi/linux/vm_sockets_diag.h
new file mode 100644
index 000000000000..14cd7dc5a187
--- /dev/null
+++ b/include/uapi/linux/vm_sockets_diag.h
@@ -0,0 +1,33 @@
+/* AF_VSOCK sock_diag(7) interface for querying open sockets */
+
+#ifndef _UAPI__VM_SOCKETS_DIAG_H__
+#define _UAPI__VM_SOCKETS_DIAG_H__
+
+#include <linux/types.h>
+
+/* Request */
+struct vsock_diag_req {
+	__u8	sdiag_family;	/* must be AF_VSOCK */
+	__u8	sdiag_protocol;	/* must be 0 */
+	__u16	pad;		/* must be 0 */
+	__u32	vdiag_states;	/* query bitmap (e.g. 1 << TCP_LISTEN) */
+	__u32	vdiag_ino;	/* must be 0 (reserved) */
+	__u32	vdiag_show;	/* must be 0 (reserved) */
+	__u32	vdiag_cookie[2];
+};
+
+/* Response */
+struct vsock_diag_msg {
+	__u8	vdiag_family;	/* AF_VSOCK */
+	__u8	vdiag_type;	/* SOCK_STREAM or SOCK_DGRAM */
+	__u8	vdiag_state;	/* sk_state (e.g. TCP_LISTEN) */
+	__u8	vdiag_shutdown; /* local RCV_SHUTDOWN | SEND_SHUTDOWN */
+	__u32   vdiag_src_cid;
+	__u32   vdiag_src_port;
+	__u32   vdiag_dst_cid;
+	__u32   vdiag_dst_port;
+	__u32	vdiag_ino;
+	__u32	vdiag_cookie[2];
+};
+
+#endif /* _UAPI__VM_SOCKETS_DIAG_H__ */
diff --git a/net/vmw_vsock/diag.c b/net/vmw_vsock/diag.c
new file mode 100644
index 000000000000..31b567652250
--- /dev/null
+++ b/net/vmw_vsock/diag.c
@@ -0,0 +1,186 @@
+/*
+ * vsock sock_diag(7) module
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ * Author: Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation version 2 and no later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/module.h>
+#include <linux/sock_diag.h>
+#include <linux/vm_sockets_diag.h>
+#include <net/af_vsock.h>
+
+static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
+			u32 portid, u32 seq, u32 flags)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+	struct vsock_diag_msg *rep;
+	struct nlmsghdr *nlh;
+
+	nlh = nlmsg_put(skb, portid, seq, SOCK_DIAG_BY_FAMILY, sizeof(*rep),
+			flags);
+	if (!nlh)
+		return -EMSGSIZE;
+
+	rep = nlmsg_data(nlh);
+	rep->vdiag_family = AF_VSOCK;
+
+	/* Lock order dictates that sk_lock is acquired before
+	 * vsock_table_lock, so we cannot lock here.  Simply don't take
+	 * sk_lock; sk is guaranteed to stay alive since vsock_table_lock is
+	 * held.
+	 */
+	rep->vdiag_type = sk->sk_type;
+	rep->vdiag_state = sk->sk_state;
+	rep->vdiag_shutdown = sk->sk_shutdown;
+	rep->vdiag_src_cid = vsk->local_addr.svm_cid;
+	rep->vdiag_src_port = vsk->local_addr.svm_port;
+	rep->vdiag_dst_cid = vsk->remote_addr.svm_cid;
+	rep->vdiag_dst_port = vsk->remote_addr.svm_port;
+	rep->vdiag_ino = sock_i_ino(sk);
+
+	sock_diag_save_cookie(sk, rep->vdiag_cookie);
+
+	return 0;
+}
+
+static int vsock_diag_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct vsock_diag_req *req;
+	struct vsock_sock *vsk;
+	unsigned int bucket;
+	unsigned int last_i;
+	unsigned int table;
+	struct net *net;
+	unsigned int i;
+
+	req = nlmsg_data(cb->nlh);
+	net = sock_net(skb->sk);
+
+	/* State saved between calls: */
+	table = cb->args[0];
+	bucket = cb->args[1];
+	i = last_i = cb->args[2];
+
+	/* TODO VMCI pending sockets? */
+
+	spin_lock_bh(&vsock_table_lock);
+
+	/* Bind table (locally created sockets) */
+	if (table == 0) {
+		while (bucket < ARRAY_SIZE(vsock_bind_table)) {
+			struct list_head *head = &vsock_bind_table[bucket];
+
+			i = 0;
+			list_for_each_entry(vsk, head, bound_table) {
+				struct sock *sk = sk_vsock(vsk);
+
+				if (!net_eq(sock_net(sk), net))
+					continue;
+				if (i < last_i)
+					goto next_bind;
+				if (!(req->vdiag_states & (1 << sk->sk_state)))
+					goto next_bind;
+				if (sk_diag_fill(sk, skb,
+						 NETLINK_CB(cb->skb).portid,
+						 cb->nlh->nlmsg_seq,
+						 NLM_F_MULTI) < 0)
+					goto done;
+next_bind:
+				i++;
+			}
+			last_i = 0;
+			bucket++;
+		}
+
+		table++;
+		bucket = 0;
+	}
+
+	/* Connected table (accepted connections) */
+	while (bucket < ARRAY_SIZE(vsock_connected_table)) {
+		struct list_head *head = &vsock_connected_table[bucket];
+
+		i = 0;
+		list_for_each_entry(vsk, head, connected_table) {
+			struct sock *sk = sk_vsock(vsk);
+
+			/* Skip sockets we've already seen above */
+			if (__vsock_in_bound_table(vsk))
+				continue;
+
+			if (!net_eq(sock_net(sk), net))
+				continue;
+			if (i < last_i)
+				goto next_connected;
+			if (!(req->vdiag_states & (1 << sk->sk_state)))
+				goto next_connected;
+			if (sk_diag_fill(sk, skb,
+					 NETLINK_CB(cb->skb).portid,
+					 cb->nlh->nlmsg_seq,
+					 NLM_F_MULTI) < 0)
+				goto done;
+next_connected:
+			i++;
+		}
+		last_i = 0;
+		bucket++;
+	}
+
+done:
+	spin_unlock_bh(&vsock_table_lock);
+
+	cb->args[0] = table;
+	cb->args[1] = bucket;
+	cb->args[2] = i;
+
+	return skb->len;
+}
+
+static int vsock_diag_handler_dump(struct sk_buff *skb, struct nlmsghdr *h)
+{
+	int hdrlen = sizeof(struct vsock_diag_req);
+	struct net *net = sock_net(skb->sk);
+
+	if (nlmsg_len(h) < hdrlen)
+		return -EINVAL;
+
+	if (h->nlmsg_flags & NLM_F_DUMP) {
+		struct netlink_dump_control c = {
+			.dump = vsock_diag_dump,
+		};
+		return netlink_dump_start(net->diag_nlsk, skb, h, &c);
+	}
+
+	return -EOPNOTSUPP;
+}
+
+static const struct sock_diag_handler vsock_diag_handler = {
+	.family = AF_VSOCK,
+	.dump = vsock_diag_handler_dump,
+};
+
+static int __init vsock_diag_init(void)
+{
+	return sock_diag_register(&vsock_diag_handler);
+}
+
+static void __exit vsock_diag_exit(void)
+{
+	sock_diag_unregister(&vsock_diag_handler);
+}
+
+module_init(vsock_diag_init);
+module_exit(vsock_diag_exit);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_NETLINK, NETLINK_SOCK_DIAG,
+			       40 /* AF_VSOCK */);
diff --git a/net/vmw_vsock/Kconfig b/net/vmw_vsock/Kconfig
index 8831e7c42167..829cb7c8f14c 100644
--- a/net/vmw_vsock/Kconfig
+++ b/net/vmw_vsock/Kconfig
@@ -15,6 +15,16 @@ config VSOCKETS
 	  To compile this driver as a module, choose M here: the module
 	  will be called vsock. If unsure, say N.
 
+config VSOCKETS_DIAG
+	tristate "Virtual Sockets monitoring interface"
+	depends on VSOCKETS
+	default y
+	help
+	  Support for PF_VSOCK sockets monitoring interface used by the ss tool.
+	  If unsure, say Y.
+
+	  Enable this module so userspace applications can query open sockets.
+
 config VMWARE_VMCI_VSOCKETS
 	tristate "VMware VMCI transport for Virtual Sockets"
 	depends on VSOCKETS && VMWARE_VMCI
-- 
2.13.6

^ permalink raw reply related

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox