Netdev List

Netdev List
 help / color / mirror / Atom feed

* Re: [PATCH] isdn/gigaset: Convert timers to use timer_setup()
From: Kees Cook @ 2017-10-05 19:17 UTC (permalink / raw)
  To: Paul Bolle
  Cc: Karsten Keil, David S. Miller, Johan Hovold, gigaset307x-common,
	Network Development, Thomas Gleixner, LKML
In-Reply-To: <1507190336.2167.5.camel@tiscali.nl>

On Thu, Oct 5, 2017 at 12:58 AM, Paul Bolle <pebolle@tiscali.nl> wrote:
> Hi Kees,
>
> On Wed, 2017-10-04 at 17:52 -0700, Kees Cook wrote:
>> Also uses kzmalloc to replace open-coded field assignments to NULL and zero.
>
> If I'm allowed to whine (chances that I'm allowed to do that are not so great
> as Dave tends to apply gigaset patches before I even have a chance to look at
> them properly!): I'd prefer it if that was done separately in a preceding
> patch. Would that bother you?

Sure, that's fine, I'll split it and re-send.

Thanks!

-Kees

-- 
Kees Cook
Pixel Security

^ permalink raw reply

* Re: [PATCH net-next v3 1/2] libbpf: parse maps sections of varying size
From: Daniel Borkmann @ 2017-10-05 19:25 UTC (permalink / raw)
  To: Craig Gallek, Alexei Starovoitov, Jesper Dangaard Brouer,
	David S . Miller
  Cc: Chonggang Li, netdev
In-Reply-To: <20171005144158.14860-2-kraigatgoog@gmail.com>

On 10/05/2017 04:41 PM, Craig Gallek wrote:
> From: Craig Gallek <kraig@google.com>
>
> This library previously assumed a fixed-size map options structure.
> Any new options were ignored.  In order to allow the options structure
> to grow and to support parsing older programs, this patch updates
> the maps section parsing to handle varying sizes.
>
> Object files with maps sections smaller than expected will have the new
> fields initialized to zero.  Object files which have larger than expected
> maps sections will be rejected unless all of the unrecognized data is zero.
>
> This change still assumes that each map definition in the maps section
> is the same size.
>
> Signed-off-by: Craig Gallek <kraig@google.com>

Thanks,

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* Re: [PATCH net-next v3 2/2] libbpf: use map_flags when creating maps
From: Daniel Borkmann @ 2017-10-05 19:26 UTC (permalink / raw)
  To: Craig Gallek, Alexei Starovoitov, Jesper Dangaard Brouer,
	David S . Miller
  Cc: Chonggang Li, netdev
In-Reply-To: <20171005144158.14860-3-kraigatgoog@gmail.com>

On 10/05/2017 04:41 PM, Craig Gallek wrote:
> From: Craig Gallek <kraig@google.com>
>
> This is required to use BPF_MAP_TYPE_LPM_TRIE or any other map type
> which requires flags.
>
> Signed-off-by: Craig Gallek <kraig@google.com>

Acked-by: Daniel Borkmann <daniel@iogearbox.net>

^ permalink raw reply

* [PATCH] isdn/gigaset: Use kzalloc instead of open-coded field zeroing
From: Kees Cook @ 2017-10-05 19:30 UTC (permalink / raw)
  To: Paul Bolle
  Cc: Karsten Keil, David S. Miller, Johan Hovold, linux-kernel,
	gigaset307x-common, netdev

This replaces a kmalloc followed by a bunch of per-field zeroing with a
single kzalloc call, reducing the lines of code.

Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Johan Hovold <johan@kernel.org>
Cc: gigaset307x-common@lists.sourceforge.net
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
 drivers/isdn/gigaset/bas-gigaset.c | 10 +---------
 1 file changed, 1 insertion(+), 9 deletions(-)

diff --git a/drivers/isdn/gigaset/bas-gigaset.c b/drivers/isdn/gigaset/bas-gigaset.c
index 2da3ff650e1d..33151f05e744 100644
--- a/drivers/isdn/gigaset/bas-gigaset.c
+++ b/drivers/isdn/gigaset/bas-gigaset.c
@@ -2200,7 +2200,7 @@ static int gigaset_initcshw(struct cardstate *cs)
 {
 	struct bas_cardstate *ucs;
 
-	cs->hw.bas = ucs = kmalloc(sizeof *ucs, GFP_KERNEL);
+	cs->hw.bas = ucs = kzalloc(sizeof(*ucs), GFP_KERNEL);
 	if (!ucs) {
 		pr_err("out of memory\n");
 		return -ENOMEM;
@@ -2212,15 +2212,7 @@ static int gigaset_initcshw(struct cardstate *cs)
 		return -ENOMEM;
 	}
 
-	ucs->urb_cmd_in = NULL;
-	ucs->urb_cmd_out = NULL;
-	ucs->rcvbuf = NULL;
-	ucs->rcvbuf_size = 0;
-
 	spin_lock_init(&ucs->lock);
-	ucs->pending = 0;
-
-	ucs->basstate = 0;
 	setup_timer(&ucs->timer_ctrl, req_timeout, (unsigned long) cs);
 	setup_timer(&ucs->timer_atrdy, atrdy_timeout, (unsigned long) cs);
 	setup_timer(&ucs->timer_cmd_in, cmd_in_timeout, (unsigned long) cs);
-- 
2.7.4


-- 
Kees Cook
Pixel Security

^ permalink raw reply related

* [PATCH v2] isdn/gigaset: Convert timers to use timer_setup()
From: Kees Cook @ 2017-10-05 19:31 UTC (permalink / raw)
  To: Paul Bolle
  Cc: Karsten Keil, David S. Miller, Johan Hovold, linux-kernel,
	gigaset307x-common, netdev

In preparation for unconditionally passing the struct timer_list pointer to
all timer callbacks, switch to using the new timer_setup() and from_timer()
to pass the timer pointer explicitly.

Cc: Paul Bolle <pebolle@tiscali.nl>
Cc: Karsten Keil <isdn@linux-pingi.de>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: Johan Hovold <johan@kernel.org>
Cc: gigaset307x-common@lists.sourceforge.net
Cc: netdev@vger.kernel.org
Signed-off-by: Kees Cook <keescook@chromium.org>
---
This requires commit 686fef928bba ("timer: Prepare to change timer
callback argument type") in v4.14-rc3, but should be otherwise
stand-alone.

v2:
- split kzalloc() into a separate patch; pebolle.
---
 drivers/isdn/gigaset/bas-gigaset.c | 36 ++++++++++++++++++++----------------
 1 file changed, 20 insertions(+), 16 deletions(-)

diff --git a/drivers/isdn/gigaset/bas-gigaset.c b/drivers/isdn/gigaset/bas-gigaset.c
index 33151f05e744..c990c6bbffc2 100644
--- a/drivers/isdn/gigaset/bas-gigaset.c
+++ b/drivers/isdn/gigaset/bas-gigaset.c
@@ -433,10 +433,11 @@ static void check_pending(struct bas_cardstate *ucs)
  * argument:
  *	controller state structure
  */
-static void cmd_in_timeout(unsigned long data)
+static void cmd_in_timeout(struct timer_list *t)
 {
-	struct cardstate *cs = (struct cardstate *) data;
-	struct bas_cardstate *ucs = cs->hw.bas;
+	struct bas_cardstate *ucs = from_timer(ucs, t, timer_cmd_in);
+	struct urb *urb = ucs->urb_int_in;
+	struct cardstate *cs = urb->context;
 	int rc;
 
 	if (!ucs->rcvbuf_size) {
@@ -639,10 +640,11 @@ static void int_in_work(struct work_struct *work)
  * argument:
  *	controller state structure
  */
-static void int_in_resubmit(unsigned long data)
+static void int_in_resubmit(struct timer_list *t)
 {
-	struct cardstate *cs = (struct cardstate *) data;
-	struct bas_cardstate *ucs = cs->hw.bas;
+	struct bas_cardstate *ucs = from_timer(ucs, t, timer_int_in);
+	struct urb *urb = ucs->urb_int_in;
+	struct cardstate *cs = urb->context;
 	int rc;
 
 	if (ucs->retry_int_in++ >= BAS_RETRY) {
@@ -1441,10 +1443,11 @@ static void read_iso_tasklet(unsigned long data)
  * argument:
  *	controller state structure
  */
-static void req_timeout(unsigned long data)
+static void req_timeout(struct timer_list *t)
 {
-	struct cardstate *cs = (struct cardstate *) data;
-	struct bas_cardstate *ucs = cs->hw.bas;
+	struct bas_cardstate *ucs = from_timer(ucs, t, timer_ctrl);
+	struct urb *urb = ucs->urb_int_in;
+	struct cardstate *cs = urb->context;
 	int pending;
 	unsigned long flags;
 
@@ -1837,10 +1840,11 @@ static void write_command_callback(struct urb *urb)
  * argument:
  *	controller state structure
  */
-static void atrdy_timeout(unsigned long data)
+static void atrdy_timeout(struct timer_list *t)
 {
-	struct cardstate *cs = (struct cardstate *) data;
-	struct bas_cardstate *ucs = cs->hw.bas;
+	struct bas_cardstate *ucs = from_timer(ucs, t, timer_atrdy);
+	struct urb *urb = ucs->urb_int_in;
+	struct cardstate *cs = urb->context;
 
 	dev_warn(cs->dev, "timeout waiting for HD_READY_SEND_ATDATA\n");
 
@@ -2213,10 +2217,10 @@ static int gigaset_initcshw(struct cardstate *cs)
 	}
 
 	spin_lock_init(&ucs->lock);
-	setup_timer(&ucs->timer_ctrl, req_timeout, (unsigned long) cs);
-	setup_timer(&ucs->timer_atrdy, atrdy_timeout, (unsigned long) cs);
-	setup_timer(&ucs->timer_cmd_in, cmd_in_timeout, (unsigned long) cs);
-	setup_timer(&ucs->timer_int_in, int_in_resubmit, (unsigned long) cs);
+	timer_setup(&ucs->timer_ctrl, req_timeout, 0);
+	timer_setup(&ucs->timer_atrdy, atrdy_timeout, 0);
+	timer_setup(&ucs->timer_cmd_in, cmd_in_timeout, 0);
+	timer_setup(&ucs->timer_int_in, int_in_resubmit, 0);
 	init_waitqueue_head(&ucs->waitqueue);
 	INIT_WORK(&ucs->int_in_wq, int_in_work);
 
-- 
2.7.4


-- 
Kees Cook
Pixel Security

^ permalink raw reply related

* Re: [PATCH] net: qcom/emac: make function emac_isr static
From: Timur Tabi @ 2017-10-05 19:31 UTC (permalink / raw)
  To: Colin King, netdev; +Cc: kernel-janitors, linux-kernel
In-Reply-To: <20171005091023.27781-1-colin.king@canonical.com>

On 10/05/2017 04:10 AM, Colin King wrote:
> From: Colin Ian King<colin.king@canonical.com>
> 
> The function emac_isr is local to the source and does not need to
> be in global scope, so make it static.
> 
> Cleans up sparse warnings:
> symbol 'emac_isr' was not declared. Should it be static?
> 
> Signed-off-by: Colin Ian King<colin.king@canonical.com>

ACK

-- 
Qualcomm Datacenter Technologies, Inc. as an affiliate of Qualcomm
Technologies, Inc.  Qualcomm Technologies, Inc. is a member of the
Code Aurora Forum, a Linux Foundation Collaborative Project.

^ permalink raw reply

* [PATCH] net/ipv6: remove unused err variable on icmpv6_push_pending_frames
From: Tim Hansen @ 2017-10-05 19:45 UTC (permalink / raw)
  To: davem; +Cc: kuznet, yoshfuji, netdev, linux-kernel, alexander.levin,
	devtimhansen

int err is unused by icmpv6_push_pending_frames(), this patch returns removes the variable and returns the function with 0.

git bisect shows this variable has been around since linux has been in git in commit 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2.  

This was found by running make coccicheck M=net/ipv6/ on linus' tree on commit 77ede3a014a32746002f7889211f0cecf4803163 (current HEAD as of this patch).

Signed-off-by: Tim Hansen <devtimhansen@gmail.com>
---
 net/ipv6/icmp.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/net/ipv6/icmp.c b/net/ipv6/icmp.c
index 5acb544..aeb49b4 100644
--- a/net/ipv6/icmp.c
+++ b/net/ipv6/icmp.c
@@ -255,7 +255,6 @@ int icmpv6_push_pending_frames(struct sock *sk, struct flowi6 *fl6,
 {
 	struct sk_buff *skb;
 	struct icmp6hdr *icmp6h;
-	int err = 0;

 	skb = skb_peek(&sk->sk_write_queue);
 	if (!skb)
@@ -288,7 +287,7 @@ int icmpv6_push_pending_frames(struct sock *sk, struct flowi6 *fl6,
 	}
 	ip6_push_pending_frames(sk);
 out:
-	return err;
+	return 0;
 }

 struct icmpv6_msg {
-- 
2.1.4

^ permalink raw reply related

* Re: [next-queue PATCH v4 3/4] net/sched: Introduce Credit Based Shaper (CBS) qdisc
From: Vinicius Costa Gomes @ 2017-10-05 19:57 UTC (permalink / raw)
  To: Jiri Pirko
  Cc: netdev, intel-wired-lan, jhs, xiyou.wangcong, andre.guedes,
	ivan.briano, jesus.sanchez-palencia, boon.leong.ong,
	richardcochran, henrik, levipearson, rodney.cummings
In-Reply-To: <20171004063650.GA1895@nanopsycho>

Hi Jiri,

Jiri Pirko <jiri@resnulli.us> writes:

> Wed, Oct 04, 2017 at 02:28:30AM CEST, vinicius.gomes@intel.com wrote:
>>This queueing discipline implements the shaper algorithm defined by
>>the 802.1Q-2014 Section 8.6.8.2 and detailed in Annex L.
>>
>>It's primary usage is to apply some bandwidth reservation to user
>>defined traffic classes, which are mapped to different queues via the
>>mqprio qdisc.
>>
>>Initially, it only supports offloading the traffic shaping work to
>>supporting controllers.
>>
>>Later, when a software implementation is added, the current dependency
>>on being installed "under" mqprio can be lifted.
>>
>>Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
>>Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
>>---
>> include/linux/netdevice.h      |   1 +
>> include/net/pkt_sched.h        |   9 ++
>> include/uapi/linux/pkt_sched.h |  17 ++++
>> net/sched/Kconfig              |  11 ++
>> net/sched/Makefile             |   1 +
>> net/sched/sch_cbs.c            | 225 +++++++++++++++++++++++++++++++++++++++++
>> 6 files changed, 264 insertions(+)
>> create mode 100644 net/sched/sch_cbs.c
>>
>>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>index e1d6ef130611..b8798adc214f 100644
>>--- a/include/linux/netdevice.h
>>+++ b/include/linux/netdevice.h
>>@@ -775,6 +775,7 @@ enum tc_setup_type {
>> 	TC_SETUP_CLSFLOWER,
>> 	TC_SETUP_CLSMATCHALL,
>> 	TC_SETUP_CLSBPF,
>>+	TC_SETUP_CBS,
>
> Please split this into 2 patches. One will introduce the new qdisc,
> second will add offload capabilities.
>

Of course.

> [...]
>
>
>>+static struct Qdisc_ops cbs_qdisc_ops __read_mostly = {
>>+	.next		=	NULL,
>>+	.id		=	"cbs",
>>+	.priv_size	=	sizeof(struct cbs_sched_data),
>>+	.enqueue	=	cbs_enqueue,
>>+	.dequeue	=	qdisc_dequeue_head,
>>+	.peek		=	qdisc_peek_dequeued,
>>+	.init		=	cbs_init,
>>+	.reset		=	qdisc_reset_queue,
>>+	.destroy	=	cbs_destroy,
>>+	.change		=	cbs_change,
>>+	.dump		=	cbs_dump,
>>+	.owner		=	THIS_MODULE,
>>+};
>
> I don't see a software implementation for this. Looks like you are
> trying abuse tc subsystem to bypass kernel. Could you please explain
> this? The golden rule is: implement in kernel, then offload.

The reason was that we didn't have a use case for the software
implementation right now, it would be added in a later series.

But as that was requested (and it makes sense), I will add it for the
next version of this series (it is already written, just need to test it
better).


Cheers,

^ permalink raw reply

* [PATCH] doc: Fix typo "8023.ad" in bonding documentation
From: Axel Beckert @ 2017-10-05 20:00 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev, Jonathan Corbet, Jiri Kosina

Should be "802.3ad" like everywhere else in the document.

Signed-off-by: Axel Beckert <abe@deuxchevaux.org>
---
 Documentation/networking/bonding.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt
index 57f52cdce32e..9ba04c0bab8d 100644
--- a/Documentation/networking/bonding.txt
+++ b/Documentation/networking/bonding.txt
@@ -2387,7 +2387,7 @@ broadcast: Like active-backup, there is not much advantage to this
 	and packet type ID), so in a "gatewayed" configuration, all
 	outgoing traffic will generally use the same device.  Incoming
 	traffic may also end up on a single device, but that is
-	dependent upon the balancing policy of the peer's 8023.ad
+	dependent upon the balancing policy of the peer's 802.3ad
 	implementation.  In a "local" configuration, traffic will be
 	distributed across the devices in the bond.
 
-- 
2.14.2

^ permalink raw reply related

* Re: [PATCH 2/3 v2] net: phy: DP83822 initial driver submission
From: Dan Murphy @ 2017-10-05 20:06 UTC (permalink / raw)
  To: Andrew Lunn, Woojung.Huh; +Cc: f.fainelli, netdev, afd
In-Reply-To: <20171004235307.GD16612@lunn.ch>

Andrew

On 10/04/2017 06:53 PM, Andrew Lunn wrote:
> On Wed, Oct 04, 2017 at 10:44:36PM +0000, Woojung.Huh@microchip.com wrote:
>>> +static int dp83822_suspend(struct phy_device *phydev)
>>> +{
>>> +	int value;
>>> +
>>> +	mutex_lock(&phydev->lock);
>>> +	value = phy_read_mmd(phydev, DP83822_DEVADDR,
>>> MII_DP83822_WOL_CFG);
>>> +	mutex_unlock(&phydev->lock);
> 
>> Would we need mutex to access phy_read_mmd()?
>> phy_read_mmd() has mdio_lock for indirect access.
> 
> Hi Woojung
> 
> The mdio lock is not sufficient. It protects against two mdio
> accesses. But here we need to protect against two phy operations.
> There is a danger something else tries to access the phy during
> suspend.
> 
>>> +	if (!(value & DP83822_WOL_EN))
>>> +		genphy_suspend(phydev);
> 
> Releasing the lock before calling genphy_suspend() is not so nice.
> Maybe add a version which assumes the lock has already been taken?
> 

The marvell driver does not take a lock and calls genphy_suspend/resume
so I am wondering if this driver needs to take a lock.

The at803x needs to take the lock because it does not call into the genphy
functions.

Dan

>       Andrew
> 


-- 
------------------
Dan Murphy

^ permalink raw reply

* Re: Regression in throughput between kvm guests over virtual bridge
From: Matthew Rosato @ 2017-10-05 20:07 UTC (permalink / raw)
  To: Jason Wang, netdev; +Cc: davem, mst
In-Reply-To: <78678f33-c9ba-bf85-7778-b2d0676b78dd@linux.vnet.ibm.com>

On 09/25/2017 04:18 PM, Matthew Rosato wrote:
> On 09/22/2017 12:03 AM, Jason Wang wrote:
>>
>>
>> On 2017年09月21日 03:38, Matthew Rosato wrote:
>>>> Seems to make some progress on wakeup mitigation. Previous patch tries
>>>> to reduce the unnecessary traversal of waitqueue during rx. Attached
>>>> patch goes even further which disables rx polling during processing tx.
>>>> Please try it to see if it has any difference.
>>> Unfortunately, this patch doesn't seem to have made a difference.  I
>>> tried runs with both this patch and the previous patch applied, as well
>>> as only this patch applied for comparison (numbers from vhost thread of
>>> sending VM):
>>>
>>> 4.12    4.13     patch1   patch2   patch1+2
>>> 2.00%   +3.69%   +2.55%   +2.81%   +2.69%   [...] __wake_up_sync_key
>>>
>>> In each case, the regression in throughput was still present.
>>
>> This probably means some other cases of the wakeups were missed. Could
>> you please record the callers of __wake_up_sync_key()?
>>
> 
> Hi Jason,
> 
> With your 2 previous patches applied, every call to __wake_up_sync_key
> (for both sender and server vhost threads) shows the following stack trace:
> 
>      vhost-11478-11520 [002] ....   312.927229: __wake_up_sync_key
> <-sock_def_readable
>      vhost-11478-11520 [002] ....   312.927230: <stack trace>
>  => dev_hard_start_xmit
>  => sch_direct_xmit
>  => __dev_queue_xmit
>  => br_dev_queue_push_xmit
>  => br_forward_finish
>  => __br_forward
>  => br_handle_frame_finish
>  => br_handle_frame
>  => __netif_receive_skb_core
>  => netif_receive_skb_internal
>  => tun_get_user
>  => tun_sendmsg
>  => handle_tx
>  => vhost_worker
>  => kthread
>  => kernel_thread_starter
>  => kernel_thread_starter
> 

Ping...  Jason, any other ideas or suggestions?

^ permalink raw reply

* [PATCH net-next 1/1] [net] bonding: Add NUMA notice
From: Patrick Talbert @ 2017-10-05 20:23 UTC (permalink / raw)
  To: netdev; +Cc: Patrick Talbert

Network performance can suffer when a load balancing bond uses slave
interfaces which are in different NUMA domains.

This compares the NUMA domain of a newly enslaved interface against any
existing enslaved interfaces and prints a warning if they do not match.

Signed-off-by: Patrick Talbert <ptalbert@redhat.com>
---
:100644 100644 b19dc03... 250a969... M	drivers/net/bonding/bond_main.c
 drivers/net/bonding/bond_main.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c
index b19dc03..250a969 100644
--- a/drivers/net/bonding/bond_main.c
+++ b/drivers/net/bonding/bond_main.c
@@ -55,6 +55,7 @@
 #include <asm/dma.h>
 #include <linux/uaccess.h>
 #include <linux/errno.h>
+#include <linux/device.h>
 #include <linux/netdevice.h>
 #include <linux/inetdevice.h>
 #include <linux/igmp.h>
@@ -1450,6 +1451,21 @@ int bond_enslave(struct net_device *bond_dev, struct net_device *slave_dev)
 		}
 	}
 
+	if (bond_has_slaves(bond)) {
+		struct list_head *iter;
+		struct slave *slave;
+
+		bond_for_each_slave(bond, slave, iter) {
+			if (slave_dev->dev.numa_node !=
+			    slave->dev->dev.numa_node) {
+				netdev_warn(bond_dev,
+					    "%s does not match NUMA domain of existing slaves. This could have a performance impact.",
+					    slave_dev->name);
+				break;
+			}
+		}
+	}
+
 	call_netdevice_notifiers(NETDEV_JOIN, slave_dev);
 
 	/* If this is the first slave, then we need to set the master's hardware
-- 
1.8.3.1

^ permalink raw reply related

* Re: [PATCH v2 net-next 06/12] qed: Add LL2 slowpath handling
From: Kalderon, Michal @ 2017-10-05 20:27 UTC (permalink / raw)
  To: David Miller
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	linux-rdma-u79uwXL29TY76Z2rM5mHXA@public.gmane.org,
	dledford-H+wXaHxf7aLQT0dZR+AlfA@public.gmane.org, Elior, Ariel
In-Reply-To: <20171005.120629.2161199733119811102.davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>

From: David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
Sent: Thursday, October 5, 2017 10:06 PM
>> From: Kalderon, Michal
>> Sent: Tuesday, October 3, 2017 9:05 PM
>> To: David Miller
>>>From: David Miller <davem-fT/PcQaiUtIeIZ0/mPfg9Q@public.gmane.org>
>>>Sent: Tuesday, October 3, 2017 8:17 PM
>>>>> @@ -423,6 +423,41 @@ static void qed_ll2_rxq_parse_reg(struct qed_hwfn *p_hwfn,
>>>>>  }
>>>>>
>>>>>  static int
>>>>> +qed_ll2_handle_slowpath(struct qed_hwfn *p_hwfn,
>>>>> +                     struct qed_ll2_info *p_ll2_conn,
>>>>> +                     union core_rx_cqe_union *p_cqe,
>>>>> +                     unsigned long *p_lock_flags)
>>>>> +{
>>>>...
>>>>> +     spin_unlock_irqrestore(&p_rx->lock, *p_lock_flags);
>>>>> +
>>>>
>>>>You can't drop this lock.
>>>>
>>>>Another thread can enter the loop of our caller and process RX queue
>>>>entries, then we would return from here and try to process the same
>>>>entries again.
>>>
>>>The lock is there to synchronize access to chains between qed_ll2_rxq_completion
>>>and qed_ll2_post_rx_buffer. qed_ll2_rxq_completion can't be called from
>>>different threads, the light l2 uses the single sp status block we have.
>>>The reason we release the lock is to avoid a deadlock where as a result of calling
>>>upper-layer driver it will potentially post additional rx-buffers.
>>
>> Dave, is there anything else needed from me on this?
>> Noticed the series is still in "Changes Requested".
>
>I'm still not convinced that the lock dropping is legitimate.  What if a
>spurious interrupt arrives?
We're in the context of a dedicated tasklet here. So even if there is a spurious
interrupt, we're covered.

>
>If the execution path in the caller is serialized for some reason, why
>are you using a spinlock and don't use that serialization for the mutual
>exclusion necessary for these queue indexes?
Posting of rx-buffers back to the light-l2 is not always serialized and can be
called from different threads depending on the light-l2 client.
Unlocking before calling the callback enables the cb function to post rx buffers,
in this case, serialization protects us. The spinlock is required for the case
that rx buffers are posted from a different thread, where it could be run
simultaneously to the rxq_completion.

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply

* [PATCH v3 0/5] VSOCK: add sock_diag interface
From: Stefan Hajnoczi @ 2017-10-05 20:46 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller, Jorgen Hansen, Dexuan Cui, Stefan Hajnoczi

v3:
 * Rebased onto net-next/master and resolved Hyper-V transport conflict

v2:
 * Moved tests to tools/testing/vsock/.  I was unable to put them in selftests/
   because they require manual setup of a VMware/KVM guest.
 * Moved to __vsock_in_bound/connected_table() to af_vsock.h
 * Fixed local variable ordering in Patch 4

There is currently no way for userspace to query open AF_VSOCK sockets.  This
means ss(8), netstat(8), and other utilities cannot display AF_VSOCK sockets.

This patch series adds the netlink sock_diag interface for AF_VSOCK.  Userspace
programs sent a DUMP request including an sk_state bitmap to filter sockets
based on their state (connected, listening, etc).  The vsock_diag.ko module
replies with information about matching sockets.  This userspace ABI is defined
in <linux/vm_sockets_diag.h>.

The final patch adds a test suite that exercises the basic cases.

Jorgen and Dexuan: I have only tested the virtio transport but this should also
work for VMCI and Hyper-V.  Please give it a shot if you have time.

Stefan Hajnoczi (5):
  VSOCK: export socket tables for sock_diag interface
  VSOCK: move __vsock_in_bound/connected_table() to af_vsock.h
  VSOCK: use TCP state constants for sk_state
  VSOCK: add sock_diag interface
  VSOCK: add tools/testing/vsock/vsock_diag_test

 MAINTAINERS                                  |   3 +
 net/vmw_vsock/Makefile                       |   3 +
 tools/testing/vsock/Makefile                 |   9 +
 include/net/af_vsock.h                       |  20 +-
 include/uapi/linux/vm_sockets_diag.h         |  33 ++
 tools/testing/vsock/control.h                |  13 +
 tools/testing/vsock/timeout.h                |  14 +
 net/vmw_vsock/af_vsock.c                     |  66 +--
 net/vmw_vsock/diag.c                         | 186 ++++++++
 net/vmw_vsock/hyperv_transport.c             |  12 +-
 net/vmw_vsock/virtio_transport.c             |   2 +-
 net/vmw_vsock/virtio_transport_common.c      |  22 +-
 net/vmw_vsock/vmci_transport.c               |  34 +-
 net/vmw_vsock/vmci_transport_notify.c        |   2 +-
 net/vmw_vsock/vmci_transport_notify_qstate.c |   2 +-
 tools/testing/vsock/control.c                | 219 +++++++++
 tools/testing/vsock/timeout.c                |  64 +++
 tools/testing/vsock/vsock_diag_test.c        | 681 +++++++++++++++++++++++++++
 net/vmw_vsock/Kconfig                        |  10 +
 tools/testing/vsock/.gitignore               |   2 +
 tools/testing/vsock/README                   |  36 ++
 21 files changed, 1360 insertions(+), 73 deletions(-)
 create mode 100644 tools/testing/vsock/Makefile
 create mode 100644 include/uapi/linux/vm_sockets_diag.h
 create mode 100644 tools/testing/vsock/control.h
 create mode 100644 tools/testing/vsock/timeout.h
 create mode 100644 net/vmw_vsock/diag.c
 create mode 100644 tools/testing/vsock/control.c
 create mode 100644 tools/testing/vsock/timeout.c
 create mode 100644 tools/testing/vsock/vsock_diag_test.c
 create mode 100644 tools/testing/vsock/.gitignore
 create mode 100644 tools/testing/vsock/README

-- 
2.13.6

^ permalink raw reply

* [PATCH v3 1/5] VSOCK: export socket tables for sock_diag interface
From: Stefan Hajnoczi @ 2017-10-05 20:46 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller, Jorgen Hansen, Dexuan Cui, Stefan Hajnoczi
In-Reply-To: <20171005204654.2737-1-stefanha@redhat.com>

The socket table symbols need to be exported from vsock.ko so that the
vsock_diag.ko module will be able to traverse sockets.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/net/af_vsock.h   |  5 +++++
 net/vmw_vsock/af_vsock.c | 10 ++++++----
 2 files changed, 11 insertions(+), 4 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index f9fb566e75cf..30cba806e344 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -27,6 +27,11 @@
 
 #define LAST_RESERVED_PORT 1023
 
+#define VSOCK_HASH_SIZE         251
+extern struct list_head vsock_bind_table[VSOCK_HASH_SIZE + 1];
+extern struct list_head vsock_connected_table[VSOCK_HASH_SIZE];
+extern spinlock_t vsock_table_lock;
+
 #define vsock_sk(__sk)    ((struct vsock_sock *)__sk)
 #define sk_vsock(__vsk)   (&(__vsk)->sk)
 
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index dfc8c51e4d74..9afe4da8c67d 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -153,7 +153,6 @@ EXPORT_SYMBOL_GPL(vm_sockets_get_local_cid);
  * vsock_bind_table[VSOCK_HASH_SIZE] is for unbound sockets.  The hash function
  * mods with VSOCK_HASH_SIZE to ensure this.
  */
-#define VSOCK_HASH_SIZE         251
 #define MAX_PORT_RETRIES        24
 
 #define VSOCK_HASH(addr)        ((addr)->svm_port % VSOCK_HASH_SIZE)
@@ -168,9 +167,12 @@ EXPORT_SYMBOL_GPL(vm_sockets_get_local_cid);
 #define vsock_connected_sockets_vsk(vsk)				\
 	vsock_connected_sockets(&(vsk)->remote_addr, &(vsk)->local_addr)
 
-static struct list_head vsock_bind_table[VSOCK_HASH_SIZE + 1];
-static struct list_head vsock_connected_table[VSOCK_HASH_SIZE];
-static DEFINE_SPINLOCK(vsock_table_lock);
+struct list_head vsock_bind_table[VSOCK_HASH_SIZE + 1];
+EXPORT_SYMBOL_GPL(vsock_bind_table);
+struct list_head vsock_connected_table[VSOCK_HASH_SIZE];
+EXPORT_SYMBOL_GPL(vsock_connected_table);
+DEFINE_SPINLOCK(vsock_table_lock);
+EXPORT_SYMBOL_GPL(vsock_table_lock);
 
 /* Autobind this socket to the local address if necessary. */
 static int vsock_auto_bind(struct vsock_sock *vsk)
-- 
2.13.6

^ permalink raw reply related

* [PATCH v3 2/5] VSOCK: move __vsock_in_bound/connected_table() to af_vsock.h
From: Stefan Hajnoczi @ 2017-10-05 20:46 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller, Jorgen Hansen, Dexuan Cui, Stefan Hajnoczi
In-Reply-To: <20171005204654.2737-1-stefanha@redhat.com>

The vsock_diag.ko module will need to check socket table membership.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/net/af_vsock.h   | 12 ++++++++++++
 net/vmw_vsock/af_vsock.c | 10 ----------
 2 files changed, 12 insertions(+), 10 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 30cba806e344..3dd217718a2f 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -180,6 +180,18 @@ const struct vsock_transport *vsock_core_get_transport(void);
 
 /**** UTILS ****/
 
+/* vsock_table_lock must be held */
+static inline bool __vsock_in_bound_table(struct vsock_sock *vsk)
+{
+	return !list_empty(&vsk->bound_table);
+}
+
+/* vsock_table_lock must be held */
+static inline bool __vsock_in_connected_table(struct vsock_sock *vsk)
+{
+	return !list_empty(&vsk->connected_table);
+}
+
 void vsock_release_pending(struct sock *pending);
 void vsock_add_pending(struct sock *listener, struct sock *pending);
 void vsock_remove_pending(struct sock *listener, struct sock *pending);
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 9afe4da8c67d..9b179a0081b3 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -250,16 +250,6 @@ static struct sock *__vsock_find_connected_socket(struct sockaddr_vm *src,
 	return NULL;
 }
 
-static bool __vsock_in_bound_table(struct vsock_sock *vsk)
-{
-	return !list_empty(&vsk->bound_table);
-}
-
-static bool __vsock_in_connected_table(struct vsock_sock *vsk)
-{
-	return !list_empty(&vsk->connected_table);
-}
-
 static void vsock_insert_unbound(struct vsock_sock *vsk)
 {
 	spin_lock_bh(&vsock_table_lock);
-- 
2.13.6

^ permalink raw reply related

* [PATCH v3 3/5] VSOCK: use TCP state constants for sk_state
From: Stefan Hajnoczi @ 2017-10-05 20:46 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller, Jorgen Hansen, Dexuan Cui, Stefan Hajnoczi
In-Reply-To: <20171005204654.2737-1-stefanha@redhat.com>

There are two state fields: socket->state and sock->sk_state.  The
socket->state field uses SS_UNCONNECTED, SS_CONNECTED, etc while the
sock->sk_state typically uses values that match TCP state constants
(TCP_CLOSE, TCP_ESTABLISHED).  AF_VSOCK does not follow this convention
and instead uses SS_* constants for both fields.

The sk_state field will be exposed to userspace through the vsock_diag
interface for ss(8), netstat(8), and other programs.

This patch switches sk_state to TCP state constants so that the meaning
of this field is consistent with other address families.  Not just
AF_INET and AF_INET6 use the TCP constants, AF_UNIX and others do too.

The following mapping was used to convert the code:

  SS_FREE -> TCP_CLOSE
  SS_UNCONNECTED -> TCP_CLOSE
  SS_CONNECTING -> TCP_SYN_SENT
  SS_CONNECTED -> TCP_ESTABLISHED
  SS_DISCONNECTING -> TCP_CLOSING
  VSOCK_SS_LISTEN -> TCP_LISTEN

In __vsock_create() the sk_state initialization was dropped because
sock_init_data() already initializes sk_state to TCP_CLOSE.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 include/net/af_vsock.h                       |  3 --
 net/vmw_vsock/af_vsock.c                     | 46 ++++++++++++++++------------
 net/vmw_vsock/hyperv_transport.c             | 12 ++++----
 net/vmw_vsock/virtio_transport.c             |  2 +-
 net/vmw_vsock/virtio_transport_common.c      | 22 ++++++-------
 net/vmw_vsock/vmci_transport.c               | 34 ++++++++++----------
 net/vmw_vsock/vmci_transport_notify.c        |  2 +-
 net/vmw_vsock/vmci_transport_notify_qstate.c |  2 +-
 8 files changed, 64 insertions(+), 59 deletions(-)

diff --git a/include/net/af_vsock.h b/include/net/af_vsock.h
index 3dd217718a2f..9324ac2d9ff2 100644
--- a/include/net/af_vsock.h
+++ b/include/net/af_vsock.h
@@ -22,9 +22,6 @@
 
 #include "vsock_addr.h"
 
-/* vsock-specific sock->sk_state constants */
-#define VSOCK_SS_LISTEN 255
-
 #define LAST_RESERVED_PORT 1023
 
 #define VSOCK_HASH_SIZE         251
diff --git a/net/vmw_vsock/af_vsock.c b/net/vmw_vsock/af_vsock.c
index 9b179a0081b3..98359c19522f 100644
--- a/net/vmw_vsock/af_vsock.c
+++ b/net/vmw_vsock/af_vsock.c
@@ -36,7 +36,7 @@
  * not support simultaneous connects (two "client" sockets connecting).
  *
  * - "Server" sockets are referred to as listener sockets throughout this
- * implementation because they are in the VSOCK_SS_LISTEN state.  When a
+ * implementation because they are in the TCP_LISTEN state.  When a
  * connection request is received (the second kind of socket mentioned above),
  * we create a new socket and refer to it as a pending socket.  These pending
  * sockets are placed on the pending connection list of the listener socket.
@@ -82,6 +82,15 @@
  * argument, we must ensure the reference count is increased to ensure the
  * socket isn't freed before the function is run; the deferred function will
  * then drop the reference.
+ *
+ * - sk->sk_state uses the TCP state constants because they are widely used by
+ * other address families and exposed to userspace tools like ss(8):
+ *
+ *   TCP_CLOSE - unconnected
+ *   TCP_SYN_SENT - connecting
+ *   TCP_ESTABLISHED - connected
+ *   TCP_CLOSING - disconnecting
+ *   TCP_LISTEN - listening
  */
 
 #include <linux/types.h>
@@ -477,7 +486,7 @@ void vsock_pending_work(struct work_struct *work)
 	if (vsock_in_connected_table(vsk))
 		vsock_remove_connected(vsk);
 
-	sk->sk_state = SS_FREE;
+	sk->sk_state = TCP_CLOSE;
 
 out:
 	release_sock(sk);
@@ -617,7 +626,6 @@ struct sock *__vsock_create(struct net *net,
 
 	sk->sk_destruct = vsock_sk_destruct;
 	sk->sk_backlog_rcv = vsock_queue_rcv_skb;
-	sk->sk_state = 0;
 	sock_reset_flag(sk, SOCK_DONE);
 
 	INIT_LIST_HEAD(&vsk->bound_table);
@@ -891,7 +899,7 @@ static unsigned int vsock_poll(struct file *file, struct socket *sock,
 		/* Listening sockets that have connections in their accept
 		 * queue can be read.
 		 */
-		if (sk->sk_state == VSOCK_SS_LISTEN
+		if (sk->sk_state == TCP_LISTEN
 		    && !vsock_is_accept_queue_empty(sk))
 			mask |= POLLIN | POLLRDNORM;
 
@@ -920,7 +928,7 @@ static unsigned int vsock_poll(struct file *file, struct socket *sock,
 		}
 
 		/* Connected sockets that can produce data can be written. */
-		if (sk->sk_state == SS_CONNECTED) {
+		if (sk->sk_state == TCP_ESTABLISHED) {
 			if (!(sk->sk_shutdown & SEND_SHUTDOWN)) {
 				bool space_avail_now = false;
 				int ret = transport->notify_poll_out(
@@ -942,7 +950,7 @@ static unsigned int vsock_poll(struct file *file, struct socket *sock,
 		 * POLLOUT|POLLWRNORM when peer is closed and nothing to read,
 		 * but local send is not shutdown.
 		 */
-		if (sk->sk_state == SS_UNCONNECTED) {
+		if (sk->sk_state == TCP_CLOSE) {
 			if (!(sk->sk_shutdown & SEND_SHUTDOWN))
 				mask |= POLLOUT | POLLWRNORM;
 
@@ -1112,9 +1120,9 @@ static void vsock_connect_timeout(struct work_struct *work)
 	sk = sk_vsock(vsk);
 
 	lock_sock(sk);
-	if (sk->sk_state == SS_CONNECTING &&
+	if (sk->sk_state == TCP_SYN_SENT &&
 	    (sk->sk_shutdown != SHUTDOWN_MASK)) {
-		sk->sk_state = SS_UNCONNECTED;
+		sk->sk_state = TCP_CLOSE;
 		sk->sk_err = ETIMEDOUT;
 		sk->sk_error_report(sk);
 		cancel = 1;
@@ -1160,7 +1168,7 @@ static int vsock_stream_connect(struct socket *sock, struct sockaddr *addr,
 		err = -EALREADY;
 		break;
 	default:
-		if ((sk->sk_state == VSOCK_SS_LISTEN) ||
+		if ((sk->sk_state == TCP_LISTEN) ||
 		    vsock_addr_cast(addr, addr_len, &remote_addr) != 0) {
 			err = -EINVAL;
 			goto out;
@@ -1183,7 +1191,7 @@ static int vsock_stream_connect(struct socket *sock, struct sockaddr *addr,
 		if (err)
 			goto out;
 
-		sk->sk_state = SS_CONNECTING;
+		sk->sk_state = TCP_SYN_SENT;
 
 		err = transport->connect(vsk);
 		if (err < 0)
@@ -1203,7 +1211,7 @@ static int vsock_stream_connect(struct socket *sock, struct sockaddr *addr,
 	timeout = vsk->connect_timeout;
 	prepare_to_wait(sk_sleep(sk), &wait, TASK_INTERRUPTIBLE);
 
-	while (sk->sk_state != SS_CONNECTED && sk->sk_err == 0) {
+	while (sk->sk_state != TCP_ESTABLISHED && sk->sk_err == 0) {
 		if (flags & O_NONBLOCK) {
 			/* If we're not going to block, we schedule a timeout
 			 * function to generate a timeout on the connection
@@ -1226,13 +1234,13 @@ static int vsock_stream_connect(struct socket *sock, struct sockaddr *addr,
 
 		if (signal_pending(current)) {
 			err = sock_intr_errno(timeout);
-			sk->sk_state = SS_UNCONNECTED;
+			sk->sk_state = TCP_CLOSE;
 			sock->state = SS_UNCONNECTED;
 			vsock_transport_cancel_pkt(vsk);
 			goto out_wait;
 		} else if (timeout == 0) {
 			err = -ETIMEDOUT;
-			sk->sk_state = SS_UNCONNECTED;
+			sk->sk_state = TCP_CLOSE;
 			sock->state = SS_UNCONNECTED;
 			vsock_transport_cancel_pkt(vsk);
 			goto out_wait;
@@ -1243,7 +1251,7 @@ static int vsock_stream_connect(struct socket *sock, struct sockaddr *addr,
 
 	if (sk->sk_err) {
 		err = -sk->sk_err;
-		sk->sk_state = SS_UNCONNECTED;
+		sk->sk_state = TCP_CLOSE;
 		sock->state = SS_UNCONNECTED;
 	} else {
 		err = 0;
@@ -1276,7 +1284,7 @@ static int vsock_accept(struct socket *sock, struct socket *newsock, int flags,
 		goto out;
 	}
 
-	if (listener->sk_state != VSOCK_SS_LISTEN) {
+	if (listener->sk_state != TCP_LISTEN) {
 		err = -EINVAL;
 		goto out;
 	}
@@ -1366,7 +1374,7 @@ static int vsock_listen(struct socket *sock, int backlog)
 	}
 
 	sk->sk_max_ack_backlog = backlog;
-	sk->sk_state = VSOCK_SS_LISTEN;
+	sk->sk_state = TCP_LISTEN;
 
 	err = 0;
 
@@ -1546,7 +1554,7 @@ static int vsock_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 
 	/* Callers should not provide a destination with stream sockets. */
 	if (msg->msg_namelen) {
-		err = sk->sk_state == SS_CONNECTED ? -EISCONN : -EOPNOTSUPP;
+		err = sk->sk_state == TCP_ESTABLISHED ? -EISCONN : -EOPNOTSUPP;
 		goto out;
 	}
 
@@ -1557,7 +1565,7 @@ static int vsock_stream_sendmsg(struct socket *sock, struct msghdr *msg,
 		goto out;
 	}
 
-	if (sk->sk_state != SS_CONNECTED ||
+	if (sk->sk_state != TCP_ESTABLISHED ||
 	    !vsock_addr_bound(&vsk->local_addr)) {
 		err = -ENOTCONN;
 		goto out;
@@ -1681,7 +1689,7 @@ vsock_stream_recvmsg(struct socket *sock, struct msghdr *msg, size_t len,
 
 	lock_sock(sk);
 
-	if (sk->sk_state != SS_CONNECTED) {
+	if (sk->sk_state != TCP_ESTABLISHED) {
 		/* Recvmsg is supposed to return 0 if a peer performs an
 		 * orderly shutdown. Differentiate between that case and when a
 		 * peer has not connected or a local shutdown occured with the
diff --git a/net/vmw_vsock/hyperv_transport.c b/net/vmw_vsock/hyperv_transport.c
index 14ed5a344cdf..bbac023e70d1 100644
--- a/net/vmw_vsock/hyperv_transport.c
+++ b/net/vmw_vsock/hyperv_transport.c
@@ -310,7 +310,7 @@ static void hvs_close_connection(struct vmbus_channel *chan)
 	struct sock *sk = get_per_channel_state(chan);
 	struct vsock_sock *vsk = vsock_sk(sk);
 
-	sk->sk_state = SS_UNCONNECTED;
+	sk->sk_state = TCP_CLOSE;
 	sock_set_flag(sk, SOCK_DONE);
 	vsk->peer_shutdown |= SEND_SHUTDOWN | RCV_SHUTDOWN;
 
@@ -344,8 +344,8 @@ static void hvs_open_connection(struct vmbus_channel *chan)
 	if (!sk)
 		return;
 
-	if ((conn_from_host && sk->sk_state != VSOCK_SS_LISTEN) ||
-	    (!conn_from_host && sk->sk_state != SS_CONNECTING))
+	if ((conn_from_host && sk->sk_state != TCP_LISTEN) ||
+	    (!conn_from_host && sk->sk_state != TCP_SYN_SENT))
 		goto out;
 
 	if (conn_from_host) {
@@ -357,7 +357,7 @@ static void hvs_open_connection(struct vmbus_channel *chan)
 		if (!new)
 			goto out;
 
-		new->sk_state = SS_CONNECTING;
+		new->sk_state = TCP_SYN_SENT;
 		vnew = vsock_sk(new);
 		hvs_new = vnew->trans;
 		hvs_new->chan = chan;
@@ -384,7 +384,7 @@ static void hvs_open_connection(struct vmbus_channel *chan)
 	vmbus_set_chn_rescind_callback(chan, hvs_close_connection);
 
 	if (conn_from_host) {
-		new->sk_state = SS_CONNECTED;
+		new->sk_state = TCP_ESTABLISHED;
 		sk->sk_ack_backlog++;
 
 		hvs_addr_init(&vnew->local_addr, if_type);
@@ -399,7 +399,7 @@ static void hvs_open_connection(struct vmbus_channel *chan)
 		vsock_enqueue_accept(sk, new);
 		release_sock(sk);
 	} else {
-		sk->sk_state = SS_CONNECTED;
+		sk->sk_state = TCP_ESTABLISHED;
 		sk->sk_socket->state = SS_CONNECTED;
 
 		vsock_insert_connected(vsock_sk(sk));
diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
index 403d86e80162..8e03bd3f3668 100644
--- a/net/vmw_vsock/virtio_transport.c
+++ b/net/vmw_vsock/virtio_transport.c
@@ -414,7 +414,7 @@ static void virtio_vsock_event_fill(struct virtio_vsock *vsock)
 static void virtio_vsock_reset_sock(struct sock *sk)
 {
 	lock_sock(sk);
-	sk->sk_state = SS_UNCONNECTED;
+	sk->sk_state = TCP_CLOSE;
 	sk->sk_err = ECONNRESET;
 	sk->sk_error_report(sk);
 	release_sock(sk);
diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index edba7ab97563..3ae3a33da70b 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -708,7 +708,7 @@ static void virtio_transport_do_close(struct vsock_sock *vsk,
 	sock_set_flag(sk, SOCK_DONE);
 	vsk->peer_shutdown = SHUTDOWN_MASK;
 	if (vsock_stream_has_data(vsk) <= 0)
-		sk->sk_state = SS_DISCONNECTING;
+		sk->sk_state = TCP_CLOSING;
 	sk->sk_state_change(sk);
 
 	if (vsk->close_work_scheduled &&
@@ -748,8 +748,8 @@ static bool virtio_transport_close(struct vsock_sock *vsk)
 {
 	struct sock *sk = &vsk->sk;
 
-	if (!(sk->sk_state == SS_CONNECTED ||
-	      sk->sk_state == SS_DISCONNECTING))
+	if (!(sk->sk_state == TCP_ESTABLISHED ||
+	      sk->sk_state == TCP_CLOSING))
 		return true;
 
 	/* Already received SHUTDOWN from peer, reply with RST */
@@ -801,7 +801,7 @@ virtio_transport_recv_connecting(struct sock *sk,
 
 	switch (le16_to_cpu(pkt->hdr.op)) {
 	case VIRTIO_VSOCK_OP_RESPONSE:
-		sk->sk_state = SS_CONNECTED;
+		sk->sk_state = TCP_ESTABLISHED;
 		sk->sk_socket->state = SS_CONNECTED;
 		vsock_insert_connected(vsk);
 		sk->sk_state_change(sk);
@@ -821,7 +821,7 @@ virtio_transport_recv_connecting(struct sock *sk,
 
 destroy:
 	virtio_transport_reset(vsk, pkt);
-	sk->sk_state = SS_UNCONNECTED;
+	sk->sk_state = TCP_CLOSE;
 	sk->sk_err = skerr;
 	sk->sk_error_report(sk);
 	return err;
@@ -857,7 +857,7 @@ virtio_transport_recv_connected(struct sock *sk,
 			vsk->peer_shutdown |= SEND_SHUTDOWN;
 		if (vsk->peer_shutdown == SHUTDOWN_MASK &&
 		    vsock_stream_has_data(vsk) <= 0)
-			sk->sk_state = SS_DISCONNECTING;
+			sk->sk_state = TCP_CLOSING;
 		if (le32_to_cpu(pkt->hdr.flags))
 			sk->sk_state_change(sk);
 		break;
@@ -928,7 +928,7 @@ virtio_transport_recv_listen(struct sock *sk, struct virtio_vsock_pkt *pkt)
 
 	lock_sock_nested(child, SINGLE_DEPTH_NESTING);
 
-	child->sk_state = SS_CONNECTED;
+	child->sk_state = TCP_ESTABLISHED;
 
 	vchild = vsock_sk(child);
 	vsock_addr_init(&vchild->local_addr, le64_to_cpu(pkt->hdr.dst_cid),
@@ -1016,18 +1016,18 @@ void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
 		sk->sk_write_space(sk);
 
 	switch (sk->sk_state) {
-	case VSOCK_SS_LISTEN:
+	case TCP_LISTEN:
 		virtio_transport_recv_listen(sk, pkt);
 		virtio_transport_free_pkt(pkt);
 		break;
-	case SS_CONNECTING:
+	case TCP_SYN_SENT:
 		virtio_transport_recv_connecting(sk, pkt);
 		virtio_transport_free_pkt(pkt);
 		break;
-	case SS_CONNECTED:
+	case TCP_ESTABLISHED:
 		virtio_transport_recv_connected(sk, pkt);
 		break;
-	case SS_DISCONNECTING:
+	case TCP_CLOSING:
 		virtio_transport_recv_disconnecting(sk, pkt);
 		virtio_transport_free_pkt(pkt);
 		break;
diff --git a/net/vmw_vsock/vmci_transport.c b/net/vmw_vsock/vmci_transport.c
index 0206155bff53..391775e3575c 100644
--- a/net/vmw_vsock/vmci_transport.c
+++ b/net/vmw_vsock/vmci_transport.c
@@ -742,7 +742,7 @@ static int vmci_transport_recv_stream_cb(void *data, struct vmci_datagram *dg)
 		/* The local context ID may be out of date, update it. */
 		vsk->local_addr.svm_cid = dst.svm_cid;
 
-		if (sk->sk_state == SS_CONNECTED)
+		if (sk->sk_state == TCP_ESTABLISHED)
 			vmci_trans(vsk)->notify_ops->handle_notify_pkt(
 					sk, pkt, true, &dst, &src,
 					&bh_process_pkt);
@@ -800,7 +800,9 @@ static void vmci_transport_handle_detach(struct sock *sk)
 		 * left in our consume queue.
 		 */
 		if (vsock_stream_has_data(vsk) <= 0) {
-			if (sk->sk_state == SS_CONNECTING) {
+			sk->sk_state = TCP_CLOSE;
+
+			if (sk->sk_state == TCP_SYN_SENT) {
 				/* The peer may detach from a queue pair while
 				 * we are still in the connecting state, i.e.,
 				 * if the peer VM is killed after attaching to
@@ -809,12 +811,10 @@ static void vmci_transport_handle_detach(struct sock *sk)
 				 * event like a reset.
 				 */
 
-				sk->sk_state = SS_UNCONNECTED;
 				sk->sk_err = ECONNRESET;
 				sk->sk_error_report(sk);
 				return;
 			}
-			sk->sk_state = SS_UNCONNECTED;
 		}
 		sk->sk_state_change(sk);
 	}
@@ -882,17 +882,17 @@ static void vmci_transport_recv_pkt_work(struct work_struct *work)
 	vsock_sk(sk)->local_addr.svm_cid = pkt->dg.dst.context;
 
 	switch (sk->sk_state) {
-	case VSOCK_SS_LISTEN:
+	case TCP_LISTEN:
 		vmci_transport_recv_listen(sk, pkt);
 		break;
-	case SS_CONNECTING:
+	case TCP_SYN_SENT:
 		/* Processing of pending connections for servers goes through
 		 * the listening socket, so see vmci_transport_recv_listen()
 		 * for that path.
 		 */
 		vmci_transport_recv_connecting_client(sk, pkt);
 		break;
-	case SS_CONNECTED:
+	case TCP_ESTABLISHED:
 		vmci_transport_recv_connected(sk, pkt);
 		break;
 	default:
@@ -941,7 +941,7 @@ static int vmci_transport_recv_listen(struct sock *sk,
 		vsock_sk(pending)->local_addr.svm_cid = pkt->dg.dst.context;
 
 		switch (pending->sk_state) {
-		case SS_CONNECTING:
+		case TCP_SYN_SENT:
 			err = vmci_transport_recv_connecting_server(sk,
 								    pending,
 								    pkt);
@@ -1071,7 +1071,7 @@ static int vmci_transport_recv_listen(struct sock *sk,
 	vsock_add_pending(sk, pending);
 	sk->sk_ack_backlog++;
 
-	pending->sk_state = SS_CONNECTING;
+	pending->sk_state = TCP_SYN_SENT;
 	vmci_trans(vpending)->produce_size =
 		vmci_trans(vpending)->consume_size = qp_size;
 	vmci_trans(vpending)->queue_pair_size = qp_size;
@@ -1196,11 +1196,11 @@ vmci_transport_recv_connecting_server(struct sock *listener,
 	 * the socket will be valid until it is removed from the queue.
 	 *
 	 * If we fail sending the attach below, we remove the socket from the
-	 * connected list and move the socket to SS_UNCONNECTED before
+	 * connected list and move the socket to TCP_CLOSE before
 	 * releasing the lock, so a pending slow path processing of an incoming
 	 * packet will not see the socket in the connected state in that case.
 	 */
-	pending->sk_state = SS_CONNECTED;
+	pending->sk_state = TCP_ESTABLISHED;
 
 	vsock_insert_connected(vpending);
 
@@ -1231,7 +1231,7 @@ vmci_transport_recv_connecting_server(struct sock *listener,
 
 destroy:
 	pending->sk_err = skerr;
-	pending->sk_state = SS_UNCONNECTED;
+	pending->sk_state = TCP_CLOSE;
 	/* As long as we drop our reference, all necessary cleanup will handle
 	 * when the cleanup function drops its reference and our destruct
 	 * implementation is called.  Note that since the listen handler will
@@ -1269,7 +1269,7 @@ vmci_transport_recv_connecting_client(struct sock *sk,
 		 * accounting (it can already be found since it's in the bound
 		 * table).
 		 */
-		sk->sk_state = SS_CONNECTED;
+		sk->sk_state = TCP_ESTABLISHED;
 		sk->sk_socket->state = SS_CONNECTED;
 		vsock_insert_connected(vsk);
 		sk->sk_state_change(sk);
@@ -1337,7 +1337,7 @@ vmci_transport_recv_connecting_client(struct sock *sk,
 destroy:
 	vmci_transport_send_reset(sk, pkt);
 
-	sk->sk_state = SS_UNCONNECTED;
+	sk->sk_state = TCP_CLOSE;
 	sk->sk_err = skerr;
 	sk->sk_error_report(sk);
 	return err;
@@ -1525,7 +1525,7 @@ static int vmci_transport_recv_connected(struct sock *sk,
 		sock_set_flag(sk, SOCK_DONE);
 		vsk->peer_shutdown = SHUTDOWN_MASK;
 		if (vsock_stream_has_data(vsk) <= 0)
-			sk->sk_state = SS_DISCONNECTING;
+			sk->sk_state = TCP_CLOSING;
 
 		sk->sk_state_change(sk);
 		break;
@@ -1789,7 +1789,7 @@ static int vmci_transport_connect(struct vsock_sock *vsk)
 		err = vmci_transport_send_conn_request(
 			sk, vmci_trans(vsk)->queue_pair_size);
 		if (err < 0) {
-			sk->sk_state = SS_UNCONNECTED;
+			sk->sk_state = TCP_CLOSE;
 			return err;
 		}
 	} else {
@@ -1799,7 +1799,7 @@ static int vmci_transport_connect(struct vsock_sock *vsk)
 				sk, vmci_trans(vsk)->queue_pair_size,
 				supported_proto_versions);
 		if (err < 0) {
-			sk->sk_state = SS_UNCONNECTED;
+			sk->sk_state = TCP_CLOSE;
 			return err;
 		}
 
diff --git a/net/vmw_vsock/vmci_transport_notify.c b/net/vmw_vsock/vmci_transport_notify.c
index 1406db4d97d1..41fb427f150a 100644
--- a/net/vmw_vsock/vmci_transport_notify.c
+++ b/net/vmw_vsock/vmci_transport_notify.c
@@ -355,7 +355,7 @@ vmci_transport_notify_pkt_poll_in(struct sock *sk,
 		 * queue. Ask for notifications when there is something to
 		 * read.
 		 */
-		if (sk->sk_state == SS_CONNECTED) {
+		if (sk->sk_state == TCP_ESTABLISHED) {
 			if (!send_waiting_read(sk, 1))
 				return -1;
 
diff --git a/net/vmw_vsock/vmci_transport_notify_qstate.c b/net/vmw_vsock/vmci_transport_notify_qstate.c
index f3a0afc46208..0cc84f2bb05e 100644
--- a/net/vmw_vsock/vmci_transport_notify_qstate.c
+++ b/net/vmw_vsock/vmci_transport_notify_qstate.c
@@ -176,7 +176,7 @@ vmci_transport_notify_pkt_poll_in(struct sock *sk,
 		 * queue. Ask for notifications when there is something to
 		 * read.
 		 */
-		if (sk->sk_state == SS_CONNECTED)
+		if (sk->sk_state == TCP_ESTABLISHED)
 			vsock_block_update_write_window(sk);
 		*data_ready_now = false;
 	}
-- 
2.13.6

^ permalink raw reply related

* [PATCH v3 4/5] VSOCK: add sock_diag interface
From: Stefan Hajnoczi @ 2017-10-05 20:46 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller, Jorgen Hansen, Dexuan Cui, Stefan Hajnoczi
In-Reply-To: <20171005204654.2737-1-stefanha@redhat.com>

This patch adds the sock_diag interface for querying sockets from
userspace.  Tools like ss(8) and netstat(8) can use this interface to
list open sockets.

The userspace ABI is defined in <linux/vm_sockets_diag.h> and includes
netlink request and response structs.  The request can query sockets
based on their sk_state (e.g. listening sockets only) and the response
contains socket information fields including the local/remote addresses,
inode number, etc.

This patch does not dump VMCI pending sockets because I have only tested
the virtio transport, which does not use pending sockets.  Support can
be added later by extending vsock_diag_dump() if needed by VMCI users.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 MAINTAINERS                          |   2 +
 net/vmw_vsock/Makefile               |   3 +
 include/uapi/linux/vm_sockets_diag.h |  33 +++++++
 net/vmw_vsock/diag.c                 | 186 +++++++++++++++++++++++++++++++++++
 net/vmw_vsock/Kconfig                |  10 ++
 5 files changed, 234 insertions(+)
 create mode 100644 include/uapi/linux/vm_sockets_diag.h
 create mode 100644 net/vmw_vsock/diag.c

diff --git a/MAINTAINERS b/MAINTAINERS
index 004816a585b8..039f0ad13482 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14274,6 +14274,8 @@ S:	Maintained
 F:	include/linux/virtio_vsock.h
 F:	include/uapi/linux/virtio_vsock.h
 F:	include/uapi/linux/vsockmon.h
+F:	include/uapi/linux/vm_sockets_diag.h
+F:	net/vmw_vsock/diag.c
 F:	net/vmw_vsock/af_vsock_tap.c
 F:	net/vmw_vsock/virtio_transport_common.c
 F:	net/vmw_vsock/virtio_transport.c
diff --git a/net/vmw_vsock/Makefile b/net/vmw_vsock/Makefile
index e63d574234a9..64afc06805da 100644
--- a/net/vmw_vsock/Makefile
+++ b/net/vmw_vsock/Makefile
@@ -1,4 +1,5 @@
 obj-$(CONFIG_VSOCKETS) += vsock.o
+obj-$(CONFIG_VSOCKETS_DIAG) += vsock_diag.o
 obj-$(CONFIG_VMWARE_VMCI_VSOCKETS) += vmw_vsock_vmci_transport.o
 obj-$(CONFIG_VIRTIO_VSOCKETS) += vmw_vsock_virtio_transport.o
 obj-$(CONFIG_VIRTIO_VSOCKETS_COMMON) += vmw_vsock_virtio_transport_common.o
@@ -6,6 +7,8 @@ obj-$(CONFIG_HYPERV_VSOCKETS) += hv_sock.o
 
 vsock-y += af_vsock.o af_vsock_tap.o vsock_addr.o
 
+vsock_diag-y += diag.o
+
 vmw_vsock_vmci_transport-y += vmci_transport.o vmci_transport_notify.o \
 	vmci_transport_notify_qstate.o
 
diff --git a/include/uapi/linux/vm_sockets_diag.h b/include/uapi/linux/vm_sockets_diag.h
new file mode 100644
index 000000000000..14cd7dc5a187
--- /dev/null
+++ b/include/uapi/linux/vm_sockets_diag.h
@@ -0,0 +1,33 @@
+/* AF_VSOCK sock_diag(7) interface for querying open sockets */
+
+#ifndef _UAPI__VM_SOCKETS_DIAG_H__
+#define _UAPI__VM_SOCKETS_DIAG_H__
+
+#include <linux/types.h>
+
+/* Request */
+struct vsock_diag_req {
+	__u8	sdiag_family;	/* must be AF_VSOCK */
+	__u8	sdiag_protocol;	/* must be 0 */
+	__u16	pad;		/* must be 0 */
+	__u32	vdiag_states;	/* query bitmap (e.g. 1 << TCP_LISTEN) */
+	__u32	vdiag_ino;	/* must be 0 (reserved) */
+	__u32	vdiag_show;	/* must be 0 (reserved) */
+	__u32	vdiag_cookie[2];
+};
+
+/* Response */
+struct vsock_diag_msg {
+	__u8	vdiag_family;	/* AF_VSOCK */
+	__u8	vdiag_type;	/* SOCK_STREAM or SOCK_DGRAM */
+	__u8	vdiag_state;	/* sk_state (e.g. TCP_LISTEN) */
+	__u8	vdiag_shutdown; /* local RCV_SHUTDOWN | SEND_SHUTDOWN */
+	__u32   vdiag_src_cid;
+	__u32   vdiag_src_port;
+	__u32   vdiag_dst_cid;
+	__u32   vdiag_dst_port;
+	__u32	vdiag_ino;
+	__u32	vdiag_cookie[2];
+};
+
+#endif /* _UAPI__VM_SOCKETS_DIAG_H__ */
diff --git a/net/vmw_vsock/diag.c b/net/vmw_vsock/diag.c
new file mode 100644
index 000000000000..31b567652250
--- /dev/null
+++ b/net/vmw_vsock/diag.c
@@ -0,0 +1,186 @@
+/*
+ * vsock sock_diag(7) module
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ * Author: Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or modify it
+ * under the terms of the GNU General Public License as published by the Free
+ * Software Foundation version 2 and no later version.
+ *
+ * This program is distributed in the hope that it will be useful, but WITHOUT
+ * ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or
+ * FITNESS FOR A PARTICULAR PURPOSE.  See the GNU General Public License for
+ * more details.
+ */
+
+#include <linux/module.h>
+#include <linux/sock_diag.h>
+#include <linux/vm_sockets_diag.h>
+#include <net/af_vsock.h>
+
+static int sk_diag_fill(struct sock *sk, struct sk_buff *skb,
+			u32 portid, u32 seq, u32 flags)
+{
+	struct vsock_sock *vsk = vsock_sk(sk);
+	struct vsock_diag_msg *rep;
+	struct nlmsghdr *nlh;
+
+	nlh = nlmsg_put(skb, portid, seq, SOCK_DIAG_BY_FAMILY, sizeof(*rep),
+			flags);
+	if (!nlh)
+		return -EMSGSIZE;
+
+	rep = nlmsg_data(nlh);
+	rep->vdiag_family = AF_VSOCK;
+
+	/* Lock order dictates that sk_lock is acquired before
+	 * vsock_table_lock, so we cannot lock here.  Simply don't take
+	 * sk_lock; sk is guaranteed to stay alive since vsock_table_lock is
+	 * held.
+	 */
+	rep->vdiag_type = sk->sk_type;
+	rep->vdiag_state = sk->sk_state;
+	rep->vdiag_shutdown = sk->sk_shutdown;
+	rep->vdiag_src_cid = vsk->local_addr.svm_cid;
+	rep->vdiag_src_port = vsk->local_addr.svm_port;
+	rep->vdiag_dst_cid = vsk->remote_addr.svm_cid;
+	rep->vdiag_dst_port = vsk->remote_addr.svm_port;
+	rep->vdiag_ino = sock_i_ino(sk);
+
+	sock_diag_save_cookie(sk, rep->vdiag_cookie);
+
+	return 0;
+}
+
+static int vsock_diag_dump(struct sk_buff *skb, struct netlink_callback *cb)
+{
+	struct vsock_diag_req *req;
+	struct vsock_sock *vsk;
+	unsigned int bucket;
+	unsigned int last_i;
+	unsigned int table;
+	struct net *net;
+	unsigned int i;
+
+	req = nlmsg_data(cb->nlh);
+	net = sock_net(skb->sk);
+
+	/* State saved between calls: */
+	table = cb->args[0];
+	bucket = cb->args[1];
+	i = last_i = cb->args[2];
+
+	/* TODO VMCI pending sockets? */
+
+	spin_lock_bh(&vsock_table_lock);
+
+	/* Bind table (locally created sockets) */
+	if (table == 0) {
+		while (bucket < ARRAY_SIZE(vsock_bind_table)) {
+			struct list_head *head = &vsock_bind_table[bucket];
+
+			i = 0;
+			list_for_each_entry(vsk, head, bound_table) {
+				struct sock *sk = sk_vsock(vsk);
+
+				if (!net_eq(sock_net(sk), net))
+					continue;
+				if (i < last_i)
+					goto next_bind;
+				if (!(req->vdiag_states & (1 << sk->sk_state)))
+					goto next_bind;
+				if (sk_diag_fill(sk, skb,
+						 NETLINK_CB(cb->skb).portid,
+						 cb->nlh->nlmsg_seq,
+						 NLM_F_MULTI) < 0)
+					goto done;
+next_bind:
+				i++;
+			}
+			last_i = 0;
+			bucket++;
+		}
+
+		table++;
+		bucket = 0;
+	}
+
+	/* Connected table (accepted connections) */
+	while (bucket < ARRAY_SIZE(vsock_connected_table)) {
+		struct list_head *head = &vsock_connected_table[bucket];
+
+		i = 0;
+		list_for_each_entry(vsk, head, connected_table) {
+			struct sock *sk = sk_vsock(vsk);
+
+			/* Skip sockets we've already seen above */
+			if (__vsock_in_bound_table(vsk))
+				continue;
+
+			if (!net_eq(sock_net(sk), net))
+				continue;
+			if (i < last_i)
+				goto next_connected;
+			if (!(req->vdiag_states & (1 << sk->sk_state)))
+				goto next_connected;
+			if (sk_diag_fill(sk, skb,
+					 NETLINK_CB(cb->skb).portid,
+					 cb->nlh->nlmsg_seq,
+					 NLM_F_MULTI) < 0)
+				goto done;
+next_connected:
+			i++;
+		}
+		last_i = 0;
+		bucket++;
+	}
+
+done:
+	spin_unlock_bh(&vsock_table_lock);
+
+	cb->args[0] = table;
+	cb->args[1] = bucket;
+	cb->args[2] = i;
+
+	return skb->len;
+}
+
+static int vsock_diag_handler_dump(struct sk_buff *skb, struct nlmsghdr *h)
+{
+	int hdrlen = sizeof(struct vsock_diag_req);
+	struct net *net = sock_net(skb->sk);
+
+	if (nlmsg_len(h) < hdrlen)
+		return -EINVAL;
+
+	if (h->nlmsg_flags & NLM_F_DUMP) {
+		struct netlink_dump_control c = {
+			.dump = vsock_diag_dump,
+		};
+		return netlink_dump_start(net->diag_nlsk, skb, h, &c);
+	}
+
+	return -EOPNOTSUPP;
+}
+
+static const struct sock_diag_handler vsock_diag_handler = {
+	.family = AF_VSOCK,
+	.dump = vsock_diag_handler_dump,
+};
+
+static int __init vsock_diag_init(void)
+{
+	return sock_diag_register(&vsock_diag_handler);
+}
+
+static void __exit vsock_diag_exit(void)
+{
+	sock_diag_unregister(&vsock_diag_handler);
+}
+
+module_init(vsock_diag_init);
+module_exit(vsock_diag_exit);
+MODULE_LICENSE("GPL");
+MODULE_ALIAS_NET_PF_PROTO_TYPE(PF_NETLINK, NETLINK_SOCK_DIAG,
+			       40 /* AF_VSOCK */);
diff --git a/net/vmw_vsock/Kconfig b/net/vmw_vsock/Kconfig
index a24369d175fd..970f96489fe7 100644
--- a/net/vmw_vsock/Kconfig
+++ b/net/vmw_vsock/Kconfig
@@ -15,6 +15,16 @@ config VSOCKETS
 	  To compile this driver as a module, choose M here: the module
 	  will be called vsock. If unsure, say N.
 
+config VSOCKETS_DIAG
+	tristate "Virtual Sockets monitoring interface"
+	depends on VSOCKETS
+	default y
+	help
+	  Support for PF_VSOCK sockets monitoring interface used by the ss tool.
+	  If unsure, say Y.
+
+	  Enable this module so userspace applications can query open sockets.
+
 config VMWARE_VMCI_VSOCKETS
 	tristate "VMware VMCI transport for Virtual Sockets"
 	depends on VSOCKETS && VMWARE_VMCI
-- 
2.13.6

^ permalink raw reply related

* [PATCH v3 5/5] VSOCK: add tools/testing/vsock/vsock_diag_test
From: Stefan Hajnoczi @ 2017-10-05 20:46 UTC (permalink / raw)
  To: netdev; +Cc: David S . Miller, Jorgen Hansen, Dexuan Cui, Stefan Hajnoczi
In-Reply-To: <20171005204654.2737-1-stefanha@redhat.com>

This patch adds tests for the vsock_diag.ko module.

These tests are not self-tests because they require manual set up of a
KVM or VMware guest.  Please see tools/testing/vsock/README for
instructions.

The control.h and timeout.h infrastructure can be used for additional
AF_VSOCK tests in the future.

Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com>
---
 MAINTAINERS                           |   1 +
 tools/testing/vsock/Makefile          |   9 +
 tools/testing/vsock/control.h         |  13 +
 tools/testing/vsock/timeout.h         |  14 +
 tools/testing/vsock/control.c         | 219 +++++++++++
 tools/testing/vsock/timeout.c         |  64 ++++
 tools/testing/vsock/vsock_diag_test.c | 681 ++++++++++++++++++++++++++++++++++
 tools/testing/vsock/.gitignore        |   2 +
 tools/testing/vsock/README            |  36 ++
 9 files changed, 1039 insertions(+)
 create mode 100644 tools/testing/vsock/Makefile
 create mode 100644 tools/testing/vsock/control.h
 create mode 100644 tools/testing/vsock/timeout.h
 create mode 100644 tools/testing/vsock/control.c
 create mode 100644 tools/testing/vsock/timeout.c
 create mode 100644 tools/testing/vsock/vsock_diag_test.c
 create mode 100644 tools/testing/vsock/.gitignore
 create mode 100644 tools/testing/vsock/README

diff --git a/MAINTAINERS b/MAINTAINERS
index 039f0ad13482..5ad7381d92aa 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -14282,6 +14282,7 @@ F:	net/vmw_vsock/virtio_transport.c
 F:	drivers/net/vsockmon.c
 F:	drivers/vhost/vsock.c
 F:	drivers/vhost/vsock.h
+F:	tools/testing/vsock/
 
 VIRTIO CONSOLE DRIVER
 M:	Amit Shah <amit@kernel.org>
diff --git a/tools/testing/vsock/Makefile b/tools/testing/vsock/Makefile
new file mode 100644
index 000000000000..66ba0924194d
--- /dev/null
+++ b/tools/testing/vsock/Makefile
@@ -0,0 +1,9 @@
+all: test
+test: vsock_diag_test
+vsock_diag_test: vsock_diag_test.o timeout.o control.o
+
+CFLAGS += -g -O2 -Werror -Wall -I. -I../../include/uapi -I../../include -Wno-pointer-sign -fno-strict-overflow -fno-strict-aliasing -fno-common -MMD -U_FORTIFY_SOURCE -D_GNU_SOURCE
+.PHONY: all test clean
+clean:
+	${RM} *.o *.d vsock_diag_test
+-include *.d
diff --git a/tools/testing/vsock/control.h b/tools/testing/vsock/control.h
new file mode 100644
index 000000000000..54a07efd267c
--- /dev/null
+++ b/tools/testing/vsock/control.h
@@ -0,0 +1,13 @@
+#ifndef CONTROL_H
+#define CONTROL_H
+
+#include <stdbool.h>
+
+void control_init(const char *control_host, const char *control_port,
+		  bool server);
+void control_cleanup(void);
+void control_writeln(const char *str);
+char *control_readln(void);
+void control_expectln(const char *str);
+
+#endif /* CONTROL_H */
diff --git a/tools/testing/vsock/timeout.h b/tools/testing/vsock/timeout.h
new file mode 100644
index 000000000000..77db9ce9860a
--- /dev/null
+++ b/tools/testing/vsock/timeout.h
@@ -0,0 +1,14 @@
+#ifndef TIMEOUT_H
+#define TIMEOUT_H
+
+enum {
+	/* Default timeout */
+	TIMEOUT = 10 /* seconds */
+};
+
+void sigalrm(int signo);
+void timeout_begin(unsigned int seconds);
+void timeout_check(const char *operation);
+void timeout_end(void);
+
+#endif /* TIMEOUT_H */
diff --git a/tools/testing/vsock/control.c b/tools/testing/vsock/control.c
new file mode 100644
index 000000000000..90fd47f0e422
--- /dev/null
+++ b/tools/testing/vsock/control.c
@@ -0,0 +1,219 @@
+/* Control socket for client/server test execution
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ *
+ * Author: Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+/* The client and server may need to coordinate to avoid race conditions like
+ * the client attempting to connect to a socket that the server is not
+ * listening on yet.  The control socket offers a communications channel for
+ * such coordination tasks.
+ *
+ * If the client calls control_expectln("LISTENING"), then it will block until
+ * the server calls control_writeln("LISTENING").  This provides a simple
+ * mechanism for coordinating between the client and the server.
+ */
+
+#include <errno.h>
+#include <netdb.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+
+#include "timeout.h"
+#include "control.h"
+
+static int control_fd = -1;
+
+/* Open the control socket, either in server or client mode */
+void control_init(const char *control_host,
+		  const char *control_port,
+		  bool server)
+{
+	struct addrinfo hints = {
+		.ai_socktype = SOCK_STREAM,
+	};
+	struct addrinfo *result = NULL;
+	struct addrinfo *ai;
+	int ret;
+
+	ret = getaddrinfo(control_host, control_port, &hints, &result);
+	if (ret != 0) {
+		fprintf(stderr, "%s\n", gai_strerror(ret));
+		exit(EXIT_FAILURE);
+	}
+
+	for (ai = result; ai; ai = ai->ai_next) {
+		int fd;
+		int val = 1;
+
+		fd = socket(ai->ai_family, ai->ai_socktype, ai->ai_protocol);
+		if (fd < 0)
+			continue;
+
+		if (!server) {
+			if (connect(fd, ai->ai_addr, ai->ai_addrlen) < 0)
+				goto next;
+			control_fd = fd;
+			printf("Control socket connected to %s:%s.\n",
+			       control_host, control_port);
+			break;
+		}
+
+		if (setsockopt(fd, SOL_SOCKET, SO_REUSEADDR,
+			       &val, sizeof(val)) < 0) {
+			perror("setsockopt");
+			exit(EXIT_FAILURE);
+		}
+
+		if (bind(fd, ai->ai_addr, ai->ai_addrlen) < 0)
+			goto next;
+		if (listen(fd, 1) < 0)
+			goto next;
+
+		printf("Control socket listening on %s:%s\n",
+		       control_host, control_port);
+		fflush(stdout);
+
+		control_fd = accept(fd, NULL, 0);
+		close(fd);
+
+		if (control_fd < 0) {
+			perror("accept");
+			exit(EXIT_FAILURE);
+		}
+		printf("Control socket connection accepted...\n");
+		break;
+
+next:
+		close(fd);
+	}
+
+	if (control_fd < 0) {
+		fprintf(stderr, "Control socket initialization failed.  Invalid address %s:%s?\n",
+			control_host, control_port);
+		exit(EXIT_FAILURE);
+	}
+
+	freeaddrinfo(result);
+}
+
+/* Free resources */
+void control_cleanup(void)
+{
+	close(control_fd);
+	control_fd = -1;
+}
+
+/* Write a line to the control socket */
+void control_writeln(const char *str)
+{
+	ssize_t len = strlen(str);
+	ssize_t ret;
+
+	timeout_begin(TIMEOUT);
+
+	do {
+		ret = send(control_fd, str, len, MSG_MORE);
+		timeout_check("send");
+	} while (ret < 0 && errno == EINTR);
+
+	if (ret != len) {
+		perror("send");
+		exit(EXIT_FAILURE);
+	}
+
+	do {
+		ret = send(control_fd, "\n", 1, 0);
+		timeout_check("send");
+	} while (ret < 0 && errno == EINTR);
+
+	if (ret != 1) {
+		perror("send");
+		exit(EXIT_FAILURE);
+	}
+
+	timeout_end();
+}
+
+/* Return the next line from the control socket (without the trailing newline).
+ *
+ * The program terminates if a timeout occurs.
+ *
+ * The caller must free() the returned string.
+ */
+char *control_readln(void)
+{
+	char *buf = NULL;
+	size_t idx = 0;
+	size_t buflen = 0;
+
+	timeout_begin(TIMEOUT);
+
+	for (;;) {
+		ssize_t ret;
+
+		if (idx >= buflen) {
+			char *new_buf;
+
+			new_buf = realloc(buf, buflen + 80);
+			if (!new_buf) {
+				perror("realloc");
+				exit(EXIT_FAILURE);
+			}
+
+			buf = new_buf;
+			buflen += 80;
+		}
+
+		do {
+			ret = recv(control_fd, &buf[idx], 1, 0);
+			timeout_check("recv");
+		} while (ret < 0 && errno == EINTR);
+
+		if (ret == 0) {
+			fprintf(stderr, "unexpected EOF on control socket\n");
+			exit(EXIT_FAILURE);
+		}
+
+		if (ret != 1) {
+			perror("recv");
+			exit(EXIT_FAILURE);
+		}
+
+		if (buf[idx] == '\n') {
+			buf[idx] = '\0';
+			break;
+		}
+
+		idx++;
+	}
+
+	timeout_end();
+
+	return buf;
+}
+
+/* Wait until a given line is received or a timeout occurs */
+void control_expectln(const char *str)
+{
+	char *line;
+
+	line = control_readln();
+	if (strcmp(str, line) != 0) {
+		fprintf(stderr, "expected \"%s\" on control socket, got \"%s\"\n",
+			str, line);
+		exit(EXIT_FAILURE);
+	}
+
+	free(line);
+}
diff --git a/tools/testing/vsock/timeout.c b/tools/testing/vsock/timeout.c
new file mode 100644
index 000000000000..c49b3003b2db
--- /dev/null
+++ b/tools/testing/vsock/timeout.c
@@ -0,0 +1,64 @@
+/* Timeout API for single-threaded programs that use blocking
+ * syscalls (read/write/send/recv/connect/accept).
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ *
+ * Author: Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+/* Use the following pattern:
+ *
+ *   timeout_begin(TIMEOUT);
+ *   do {
+ *       ret = accept(...);
+ *       timeout_check("accept");
+ *   } while (ret < 0 && ret == EINTR);
+ *   timeout_end();
+ */
+
+#include <stdlib.h>
+#include <stdbool.h>
+#include <unistd.h>
+#include <stdio.h>
+#include "timeout.h"
+
+static volatile bool timeout;
+
+/* SIGALRM handler function.  Do not use sleep(2), alarm(2), or
+ * setitimer(2) while using this API - they may interfere with each
+ * other.
+ */
+void sigalrm(int signo)
+{
+	timeout = true;
+}
+
+/* Start a timeout.  Call timeout_check() to verify that the timeout hasn't
+ * expired.  timeout_end() must be called to stop the timeout.  Timeouts cannot
+ * be nested.
+ */
+void timeout_begin(unsigned int seconds)
+{
+	alarm(seconds);
+}
+
+/* Exit with an error message if the timeout has expired */
+void timeout_check(const char *operation)
+{
+	if (timeout) {
+		fprintf(stderr, "%s timed out\n", operation);
+		exit(EXIT_FAILURE);
+	}
+}
+
+/* Stop a timeout */
+void timeout_end(void)
+{
+	alarm(0);
+	timeout = false;
+}
diff --git a/tools/testing/vsock/vsock_diag_test.c b/tools/testing/vsock/vsock_diag_test.c
new file mode 100644
index 000000000000..e896a4af52f4
--- /dev/null
+++ b/tools/testing/vsock/vsock_diag_test.c
@@ -0,0 +1,681 @@
+/*
+ * vsock_diag_test - vsock_diag.ko test suite
+ *
+ * Copyright (C) 2017 Red Hat, Inc.
+ *
+ * Author: Stefan Hajnoczi <stefanha@redhat.com>
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License
+ * as published by the Free Software Foundation; version 2
+ * of the License.
+ */
+
+#include <getopt.h>
+#include <stdio.h>
+#include <stdbool.h>
+#include <stdlib.h>
+#include <string.h>
+#include <errno.h>
+#include <unistd.h>
+#include <signal.h>
+#include <sys/socket.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <linux/list.h>
+#include <linux/net.h>
+#include <linux/netlink.h>
+#include <linux/sock_diag.h>
+#include <netinet/tcp.h>
+
+#include "../../../include/uapi/linux/vm_sockets.h"
+#include "../../../include/uapi/linux/vm_sockets_diag.h"
+
+#include "timeout.h"
+#include "control.h"
+
+enum test_mode {
+	TEST_MODE_UNSET,
+	TEST_MODE_CLIENT,
+	TEST_MODE_SERVER
+};
+
+/* Per-socket status */
+struct vsock_stat {
+	struct list_head list;
+	struct vsock_diag_msg msg;
+};
+
+static const char *sock_type_str(int type)
+{
+	switch (type) {
+	case SOCK_DGRAM:
+		return "DGRAM";
+	case SOCK_STREAM:
+		return "STREAM";
+	default:
+		return "INVALID TYPE";
+	}
+}
+
+static const char *sock_state_str(int state)
+{
+	switch (state) {
+	case TCP_CLOSE:
+		return "UNCONNECTED";
+	case TCP_SYN_SENT:
+		return "CONNECTING";
+	case TCP_ESTABLISHED:
+		return "CONNECTED";
+	case TCP_CLOSING:
+		return "DISCONNECTING";
+	case TCP_LISTEN:
+		return "LISTEN";
+	default:
+		return "INVALID STATE";
+	}
+}
+
+static const char *sock_shutdown_str(int shutdown)
+{
+	switch (shutdown) {
+	case 1:
+		return "RCV_SHUTDOWN";
+	case 2:
+		return "SEND_SHUTDOWN";
+	case 3:
+		return "RCV_SHUTDOWN | SEND_SHUTDOWN";
+	default:
+		return "0";
+	}
+}
+
+static void print_vsock_addr(FILE *fp, unsigned int cid, unsigned int port)
+{
+	if (cid == VMADDR_CID_ANY)
+		fprintf(fp, "*:");
+	else
+		fprintf(fp, "%u:", cid);
+
+	if (port == VMADDR_PORT_ANY)
+		fprintf(fp, "*");
+	else
+		fprintf(fp, "%u", port);
+}
+
+static void print_vsock_stat(FILE *fp, struct vsock_stat *st)
+{
+	print_vsock_addr(fp, st->msg.vdiag_src_cid, st->msg.vdiag_src_port);
+	fprintf(fp, " ");
+	print_vsock_addr(fp, st->msg.vdiag_dst_cid, st->msg.vdiag_dst_port);
+	fprintf(fp, " %s %s %s %u\n",
+		sock_type_str(st->msg.vdiag_type),
+		sock_state_str(st->msg.vdiag_state),
+		sock_shutdown_str(st->msg.vdiag_shutdown),
+		st->msg.vdiag_ino);
+}
+
+static void print_vsock_stats(FILE *fp, struct list_head *head)
+{
+	struct vsock_stat *st;
+
+	list_for_each_entry(st, head, list)
+		print_vsock_stat(fp, st);
+}
+
+static struct vsock_stat *find_vsock_stat(struct list_head *head, int fd)
+{
+	struct vsock_stat *st;
+	struct stat stat;
+
+	if (fstat(fd, &stat) < 0) {
+		perror("fstat");
+		exit(EXIT_FAILURE);
+	}
+
+	list_for_each_entry(st, head, list)
+		if (st->msg.vdiag_ino == stat.st_ino)
+			return st;
+
+	fprintf(stderr, "cannot find fd %d\n", fd);
+	exit(EXIT_FAILURE);
+}
+
+static void check_no_sockets(struct list_head *head)
+{
+	if (!list_empty(head)) {
+		fprintf(stderr, "expected no sockets\n");
+		print_vsock_stats(stderr, head);
+		exit(1);
+	}
+}
+
+static void check_num_sockets(struct list_head *head, int expected)
+{
+	struct list_head *node;
+	int n = 0;
+
+	list_for_each(node, head)
+		n++;
+
+	if (n != expected) {
+		fprintf(stderr, "expected %d sockets, found %d\n",
+			expected, n);
+		print_vsock_stats(stderr, head);
+		exit(EXIT_FAILURE);
+	}
+}
+
+static void check_socket_state(struct vsock_stat *st, __u8 state)
+{
+	if (st->msg.vdiag_state != state) {
+		fprintf(stderr, "expected socket state %#x, got %#x\n",
+			state, st->msg.vdiag_state);
+		exit(EXIT_FAILURE);
+	}
+}
+
+static void send_req(int fd)
+{
+	struct sockaddr_nl nladdr = {
+		.nl_family = AF_NETLINK,
+	};
+	struct {
+		struct nlmsghdr nlh;
+		struct vsock_diag_req vreq;
+	} req = {
+		.nlh = {
+			.nlmsg_len = sizeof(req),
+			.nlmsg_type = SOCK_DIAG_BY_FAMILY,
+			.nlmsg_flags = NLM_F_REQUEST | NLM_F_DUMP,
+		},
+		.vreq = {
+			.sdiag_family = AF_VSOCK,
+			.vdiag_states = ~(__u32)0,
+		},
+	};
+	struct iovec iov = {
+		.iov_base = &req,
+		.iov_len = sizeof(req),
+	};
+	struct msghdr msg = {
+		.msg_name = &nladdr,
+		.msg_namelen = sizeof(nladdr),
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
+	};
+
+	for (;;) {
+		if (sendmsg(fd, &msg, 0) < 0) {
+			if (errno == EINTR)
+				continue;
+
+			perror("sendmsg");
+			exit(EXIT_FAILURE);
+		}
+
+		return;
+	}
+}
+
+static ssize_t recv_resp(int fd, void *buf, size_t len)
+{
+	struct sockaddr_nl nladdr = {
+		.nl_family = AF_NETLINK,
+	};
+	struct iovec iov = {
+		.iov_base = buf,
+		.iov_len = len,
+	};
+	struct msghdr msg = {
+		.msg_name = &nladdr,
+		.msg_namelen = sizeof(nladdr),
+		.msg_iov = &iov,
+		.msg_iovlen = 1,
+	};
+	ssize_t ret;
+
+	do {
+		ret = recvmsg(fd, &msg, 0);
+	} while (ret < 0 && errno == EINTR);
+
+	if (ret < 0) {
+		perror("recvmsg");
+		exit(EXIT_FAILURE);
+	}
+
+	return ret;
+}
+
+static void add_vsock_stat(struct list_head *sockets,
+			   const struct vsock_diag_msg *resp)
+{
+	struct vsock_stat *st;
+
+	st = malloc(sizeof(*st));
+	if (!st) {
+		perror("malloc");
+		exit(EXIT_FAILURE);
+	}
+
+	st->msg = *resp;
+	list_add_tail(&st->list, sockets);
+}
+
+/*
+ * Read vsock stats into a list.
+ */
+static void read_vsock_stat(struct list_head *sockets)
+{
+	long buf[8192 / sizeof(long)];
+	int fd;
+
+	fd = socket(AF_NETLINK, SOCK_RAW, NETLINK_SOCK_DIAG);
+	if (fd < 0) {
+		perror("socket");
+		exit(EXIT_FAILURE);
+	}
+
+	send_req(fd);
+
+	for (;;) {
+		const struct nlmsghdr *h;
+		ssize_t ret;
+
+		ret = recv_resp(fd, buf, sizeof(buf));
+		if (ret == 0)
+			goto done;
+		if (ret < sizeof(*h)) {
+			fprintf(stderr, "short read of %zd bytes\n", ret);
+			exit(EXIT_FAILURE);
+		}
+
+		h = (struct nlmsghdr *)buf;
+
+		while (NLMSG_OK(h, ret)) {
+			if (h->nlmsg_type == NLMSG_DONE)
+				goto done;
+
+			if (h->nlmsg_type == NLMSG_ERROR) {
+				const struct nlmsgerr *err = NLMSG_DATA(h);
+
+				if (h->nlmsg_len < NLMSG_LENGTH(sizeof(*err)))
+					fprintf(stderr, "NLMSG_ERROR\n");
+				else {
+					errno = -err->error;
+					perror("NLMSG_ERROR");
+				}
+
+				exit(EXIT_FAILURE);
+			}
+
+			if (h->nlmsg_type != SOCK_DIAG_BY_FAMILY) {
+				fprintf(stderr, "unexpected nlmsg_type %#x\n",
+					h->nlmsg_type);
+				exit(EXIT_FAILURE);
+			}
+			if (h->nlmsg_len <
+			    NLMSG_LENGTH(sizeof(struct vsock_diag_msg))) {
+				fprintf(stderr, "short vsock_diag_msg\n");
+				exit(EXIT_FAILURE);
+			}
+
+			add_vsock_stat(sockets, NLMSG_DATA(h));
+
+			h = NLMSG_NEXT(h, ret);
+		}
+	}
+
+done:
+	close(fd);
+}
+
+static void free_sock_stat(struct list_head *sockets)
+{
+	struct vsock_stat *st;
+	struct vsock_stat *next;
+
+	list_for_each_entry_safe(st, next, sockets, list)
+		free(st);
+}
+
+static void test_no_sockets(unsigned int peer_cid)
+{
+	LIST_HEAD(sockets);
+
+	read_vsock_stat(&sockets);
+
+	check_no_sockets(&sockets);
+
+	free_sock_stat(&sockets);
+}
+
+static void test_listen_socket_server(unsigned int peer_cid)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = VMADDR_CID_ANY,
+		},
+	};
+	LIST_HEAD(sockets);
+	struct vsock_stat *st;
+	int fd;
+
+	fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	if (listen(fd, 1) < 0) {
+		perror("listen");
+		exit(EXIT_FAILURE);
+	}
+
+	read_vsock_stat(&sockets);
+
+	check_num_sockets(&sockets, 1);
+	st = find_vsock_stat(&sockets, fd);
+	check_socket_state(st, TCP_LISTEN);
+
+	close(fd);
+	free_sock_stat(&sockets);
+}
+
+static void test_connect_client(unsigned int peer_cid)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = peer_cid,
+		},
+	};
+	int fd;
+	int ret;
+	LIST_HEAD(sockets);
+	struct vsock_stat *st;
+
+	control_expectln("LISTENING");
+
+	fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+	timeout_begin(TIMEOUT);
+	do {
+		ret = connect(fd, &addr.sa, sizeof(addr.svm));
+		timeout_check("connect");
+	} while (ret < 0 && errno == EINTR);
+	timeout_end();
+
+	if (ret < 0) {
+		perror("connect");
+		exit(EXIT_FAILURE);
+	}
+
+	read_vsock_stat(&sockets);
+
+	check_num_sockets(&sockets, 1);
+	st = find_vsock_stat(&sockets, fd);
+	check_socket_state(st, TCP_ESTABLISHED);
+
+	control_expectln("DONE");
+	control_writeln("DONE");
+
+	close(fd);
+	free_sock_stat(&sockets);
+}
+
+static void test_connect_server(unsigned int peer_cid)
+{
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} addr = {
+		.svm = {
+			.svm_family = AF_VSOCK,
+			.svm_port = 1234,
+			.svm_cid = VMADDR_CID_ANY,
+		},
+	};
+	union {
+		struct sockaddr sa;
+		struct sockaddr_vm svm;
+	} clientaddr;
+	socklen_t clientaddr_len = sizeof(clientaddr.svm);
+	LIST_HEAD(sockets);
+	struct vsock_stat *st;
+	int fd;
+	int client_fd;
+
+	fd = socket(AF_VSOCK, SOCK_STREAM, 0);
+
+	if (bind(fd, &addr.sa, sizeof(addr.svm)) < 0) {
+		perror("bind");
+		exit(EXIT_FAILURE);
+	}
+
+	if (listen(fd, 1) < 0) {
+		perror("listen");
+		exit(EXIT_FAILURE);
+	}
+
+	control_writeln("LISTENING");
+
+	timeout_begin(TIMEOUT);
+	do {
+		client_fd = accept(fd, &clientaddr.sa, &clientaddr_len);
+		timeout_check("accept");
+	} while (client_fd < 0 && errno == EINTR);
+	timeout_end();
+
+	if (client_fd < 0) {
+		perror("accept");
+		exit(EXIT_FAILURE);
+	}
+	if (clientaddr.sa.sa_family != AF_VSOCK) {
+		fprintf(stderr, "expected AF_VSOCK from accept(2), got %d\n",
+			clientaddr.sa.sa_family);
+		exit(EXIT_FAILURE);
+	}
+	if (clientaddr.svm.svm_cid != peer_cid) {
+		fprintf(stderr, "expected peer CID %u from accept(2), got %u\n",
+			peer_cid, clientaddr.svm.svm_cid);
+		exit(EXIT_FAILURE);
+	}
+
+	read_vsock_stat(&sockets);
+
+	check_num_sockets(&sockets, 2);
+	find_vsock_stat(&sockets, fd);
+	st = find_vsock_stat(&sockets, client_fd);
+	check_socket_state(st, TCP_ESTABLISHED);
+
+	control_writeln("DONE");
+	control_expectln("DONE");
+
+	close(client_fd);
+	close(fd);
+	free_sock_stat(&sockets);
+}
+
+static struct {
+	const char *name;
+	void (*run_client)(unsigned int peer_cid);
+	void (*run_server)(unsigned int peer_cid);
+} test_cases[] = {
+	{
+		.name = "No sockets",
+		.run_server = test_no_sockets,
+	},
+	{
+		.name = "Listen socket",
+		.run_server = test_listen_socket_server,
+	},
+	{
+		.name = "Connect",
+		.run_client = test_connect_client,
+		.run_server = test_connect_server,
+	},
+	{},
+};
+
+static void init_signals(void)
+{
+	struct sigaction act = {
+		.sa_handler = sigalrm,
+	};
+
+	sigaction(SIGALRM, &act, NULL);
+	signal(SIGPIPE, SIG_IGN);
+}
+
+static unsigned int parse_cid(const char *str)
+{
+	char *endptr = NULL;
+	unsigned long int n;
+
+	errno = 0;
+	n = strtoul(str, &endptr, 10);
+	if (errno || *endptr != '\0') {
+		fprintf(stderr, "malformed CID \"%s\"\n", str);
+		exit(EXIT_FAILURE);
+	}
+	return n;
+}
+
+static const char optstring[] = "";
+static const struct option longopts[] = {
+	{
+		.name = "control-host",
+		.has_arg = required_argument,
+		.val = 'H',
+	},
+	{
+		.name = "control-port",
+		.has_arg = required_argument,
+		.val = 'P',
+	},
+	{
+		.name = "mode",
+		.has_arg = required_argument,
+		.val = 'm',
+	},
+	{
+		.name = "peer-cid",
+		.has_arg = required_argument,
+		.val = 'p',
+	},
+	{
+		.name = "help",
+		.has_arg = no_argument,
+		.val = '?',
+	},
+	{},
+};
+
+static void usage(void)
+{
+	fprintf(stderr, "Usage: vsock_diag_test [--help] [--control-host=<host>] --control-port=<port> --mode=client|server --peer-cid=<cid>\n"
+		"\n"
+		"  Server: vsock_diag_test --control-port=1234 --mode=server --peer-cid=3\n"
+		"  Client: vsock_diag_test --control-host=192.168.0.1 --control-port=1234 --mode=client --peer-cid=2\n"
+		"\n"
+		"Run vsock_diag.ko tests.  Must be launched in both\n"
+		"guest and host.  One side must use --mode=client and\n"
+		"the other side must use --mode=server.\n"
+		"\n"
+		"A TCP control socket connection is used to coordinate tests\n"
+		"between the client and the server.  The server requires a\n"
+		"listen address and the client requires an address to\n"
+		"connect to.\n"
+		"\n"
+		"The CID of the other side must be given with --peer-cid=<cid>.\n");
+	exit(EXIT_FAILURE);
+}
+
+int main(int argc, char **argv)
+{
+	const char *control_host = NULL;
+	const char *control_port = NULL;
+	int mode = TEST_MODE_UNSET;
+	unsigned int peer_cid = VMADDR_CID_ANY;
+	int i;
+
+	init_signals();
+
+	for (;;) {
+		int opt = getopt_long(argc, argv, optstring, longopts, NULL);
+
+		if (opt == -1)
+			break;
+
+		switch (opt) {
+		case 'H':
+			control_host = optarg;
+			break;
+		case 'm':
+			if (strcmp(optarg, "client") == 0)
+				mode = TEST_MODE_CLIENT;
+			else if (strcmp(optarg, "server") == 0)
+				mode = TEST_MODE_SERVER;
+			else {
+				fprintf(stderr, "--mode must be \"client\" or \"server\"\n");
+				return EXIT_FAILURE;
+			}
+			break;
+		case 'p':
+			peer_cid = parse_cid(optarg);
+			break;
+		case 'P':
+			control_port = optarg;
+			break;
+		case '?':
+		default:
+			usage();
+		}
+	}
+
+	if (!control_port)
+		usage();
+	if (mode == TEST_MODE_UNSET)
+		usage();
+	if (peer_cid == VMADDR_CID_ANY)
+		usage();
+
+	if (!control_host) {
+		if (mode != TEST_MODE_SERVER)
+			usage();
+		control_host = "0.0.0.0";
+	}
+
+	control_init(control_host, control_port, mode == TEST_MODE_SERVER);
+
+	for (i = 0; test_cases[i].name; i++) {
+		void (*run)(unsigned int peer_cid);
+
+		printf("%s...", test_cases[i].name);
+		fflush(stdout);
+
+		if (mode == TEST_MODE_CLIENT)
+			run = test_cases[i].run_client;
+		else
+			run = test_cases[i].run_server;
+
+		if (run)
+			run(peer_cid);
+
+		printf("ok\n");
+	}
+
+	control_cleanup();
+	return EXIT_SUCCESS;
+}
diff --git a/tools/testing/vsock/.gitignore b/tools/testing/vsock/.gitignore
new file mode 100644
index 000000000000..dc5f11faf530
--- /dev/null
+++ b/tools/testing/vsock/.gitignore
@@ -0,0 +1,2 @@
+*.d
+vsock_diag_test
diff --git a/tools/testing/vsock/README b/tools/testing/vsock/README
new file mode 100644
index 000000000000..2cc6d7302db6
--- /dev/null
+++ b/tools/testing/vsock/README
@@ -0,0 +1,36 @@
+AF_VSOCK test suite
+-------------------
+These tests exercise net/vmw_vsock/ host<->guest sockets for VMware, KVM, and
+Hyper-V.
+
+The following tests are available:
+
+  * vsock_diag_test - vsock_diag.ko module for listing open sockets
+
+The following prerequisite steps are not automated and must be performed prior
+to running tests:
+
+1. Build the kernel and these tests.
+2. Install the kernel and tests on the host.
+3. Install the kernel and tests inside the guest.
+4. Boot the guest and ensure that the AF_VSOCK transport is enabled.
+
+Invoke test binaries in both directions as follows:
+
+  # host=server, guest=client
+  (host)# $TEST_BINARY --mode=server \
+                       --control-port=1234 \
+                       --peer-cid=3
+  (guest)# $TEST_BINARY --mode=client \
+                        --control-host=$HOST_IP \
+                        --control-port=1234 \
+                        --peer-cid=2
+
+  # host=client, guest=server
+  (guest)# $TEST_BINARY --mode=server \
+                        --control-port=1234 \
+                        --peer-cid=2
+  (host)# $TEST_BINARY --mode=client \
+                       --control-port=$GUEST_IP \
+                       --control-port=1234 \
+                       --peer-cid=3
-- 
2.13.6

^ permalink raw reply related

* Re: [PATCH v6 01/11] dt-bindings: net: Restore sun8i dwmac binding
From: Rob Herring @ 2017-10-05 20:59 UTC (permalink / raw)
  To: Corentin Labbe
  Cc: mark.rutland-5wv7dgnIgG8,
	maxime.ripard-wi1+55ScJUtKEb57/3fJTNBPR1lH4CV8, wens-jdAy2FN1RRM,
	linux-I+IVW8TIWO2tmTQ+vhA3Yw, catalin.marinas-5wv7dgnIgG8,
	will.deacon-5wv7dgnIgG8, peppe.cavallaro-qxv4g6HH51o,
	alexandre.torgue-qxv4g6HH51o, andrew-g2DYL2Zd6BY,
	f.fainelli-Re5JQEeQqe8AvxtiuMwx3w,
	frowand.list-Re5JQEeQqe8AvxtiuMwx3w,
	netdev-u79uwXL29TY76Z2rM5mHXA, devicetree-u79uwXL29TY76Z2rM5mHXA,
	linux-arm-kernel-IAPFreCvJWM7uuMidbF8XUB+6BGkLq7r,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	linux-sunxi-/JYPxA39Uh5TLH3MbocFFw
In-Reply-To: <20170927073414.17361-2-clabbe.montjoie-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>

On Wed, Sep 27, 2017 at 09:34:04AM +0200, Corentin Labbe wrote:
> This patch restore dt-bindings documentation about dwmac-sun8i
> This reverts commit 8aa33ec2f481 ("dt-bindings: net: Revert sun8i dwmac binding")

What's missing here is why you are reverting?

> 
> Signed-off-by: Corentin Labbe <clabbe.montjoie-Re5JQEeQqe8AvxtiuMwx3w@public.gmane.org>
> ---
>  .../devicetree/bindings/net/dwmac-sun8i.txt        | 84 ++++++++++++++++++++++
>  1 file changed, 84 insertions(+)
>  create mode 100644 Documentation/devicetree/bindings/net/dwmac-sun8i.txt

Otherwise,

Acked-by: Rob Herring <robh-DgEjT+Ai2ygdnm+yROfE0A@public.gmane.org>

^ permalink raw reply

* Re: [Intel-wired-lan] [PATCH] PCI: Check/Set ARI capability before setting numVFs
From: Bjorn Helgaas @ 2017-10-05 21:07 UTC (permalink / raw)
  To: Alexander Duyck
  Cc: Tony Nguyen, linux-pci@vger.kernel.org,
	linux-kernel@vger.kernel.org, intel-wired-lan, Netdev,
	Bjorn Helgaas
In-Reply-To: <CAKgT0UfL6sASMFfUbN7qkJxAMEU9BwJgMJeFmHRZsk1fhVYzZQ@mail.gmail.com>

On Wed, Oct 04, 2017 at 04:29:14PM -0700, Alexander Duyck wrote:
> On Wed, Oct 4, 2017 at 4:01 PM, Bjorn Helgaas <helgaas@kernel.org> wrote:
> > On Wed, Oct 04, 2017 at 08:52:58AM -0700, Tony Nguyen wrote:
> >> This fixes a bug that can occur if an AER error is encountered while SRIOV
> >> devices are present.
> >>
> >> This issue was seen by doing the following. Inject an AER error to a device
> >> that has SRIOV devices.  After the device has recovered, remove the driver.
> >> Reload the driver and enable SRIOV which causes the following crash to
> >> occur:
> >>
> >> kernel BUG at drivers/pci/iov.c:157!
> >> invalid opcode: 0000 [#1] SMP
> >> CPU: 36 PID: 2295 Comm: bash Not tainted 4.14.0-rc1+ #74
> >> Hardware name: Supermicro X9DAi/X9DAi, BIOS 3.0a 04/29/2014
> >> task: ffff9fa41cd45a00 task.stack: ffffb4b2036e8000
> >> RIP: 0010:pci_iov_add_virtfn+0x2eb/0x350
> >> RSP: 0018:ffffb4b2036ebcb8 EFLAGS: 00010286
> >> RAX: 00000000fffffff0 RBX: ffff9fa42c1c8800 RCX: ffff9fa421ce2388
> >> RDX: 00000000df900000 RSI: ffff9fa8214fb388 RDI: 00000000df903fff
> >> RBP: ffffb4b2036ebd18 R08: ffff9fa421ce23b8 R09: ffffb4b2036ebc2c
> >> R10: ffff9fa42c1a5548 R11: 000000000000058e R12: ffff9fa8214fb000
> >> R13: ffff9fa42c1a5000 R14: ffff9fa8214fb388 R15: 0000000000000000
> >> FS:  00007f60724b6700(0000) GS:ffff9fa82f300000(0000)
> >> knlGS:0000000000000000
> >> CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: 0000559eca8b0f40 CR3: 0000000864146000 CR4: 00000000001606e0
> >> Call Trace:
> >>  pci_enable_sriov+0x353/0x440
> >>  ixgbe_pci_sriov_configure+0xd5/0x1f0 [ixgbe]
> >>  sriov_numvfs_store+0xf7/0x170
> >>  dev_attr_store+0x18/0x30
> >>  sysfs_kf_write+0x37/0x40
> >>  kernfs_fop_write+0x120/0x1b0
> >>  __vfs_write+0x37/0x170
> >>  ? __alloc_fd+0x3f/0x170
> >>  ? set_close_on_exec+0x30/0x70
> >>  vfs_write+0xb5/0x1a0
> >>  SyS_write+0x55/0xc0
> >>  entry_SYSCALL_64_fastpath+0x1a/0xa5
> >> RIP: 0033:0x7f6071bafc20
> >> RSP: 002b:00007ffe7d42ba48 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> >> RAX: ffffffffffffffda RBX: 0000559eca8b0f30 RCX: 00007f6071bafc20
> >> RDX: 0000000000000002 RSI: 0000559eca961f60 RDI: 0000000000000001
> >> RBP: 00007f6071e78ae0 R08: 00007f6071e7a740 R09: 00007f60724b6700
> >> R10: 0000000000000073 R11: 0000000000000246 R12: 0000000000000000
> >> R13: 0000000000000000 R14: 0000000000000000 R15: 0000559eca892170
> >> RIP: pci_iov_add_virtfn+0x2eb/0x350 RSP: ffffb4b2036ebcb8
> >>
> >> The occurs since during AER recovery the ARI Capable Hierarchy bit,
> >> which can affect the values for First VF Offset and VF Stride, is not set
> >> until after pci_iov_set_numvfs() is called.
> >
> > Can you elaborate on where exactly this happens?  The only place we
> > explicitly set PCI_SRIOV_CTRL_ARI is in sriov_init(), which is only
> > called at enumeration-time.  So I'm guessing you're talking about this
> > path:
> >
> >   ixgbe_io_slot_reset
> >     pci_restore_state
> >       pci_restore_iov_state
> >         sriov_restore_state
> >           pci_iov_set_numvfs
> >
> > where we don't set PCI_SRIOV_CTRL_ARI at all.  The fact that you say
> > PCI_SRIOV_CTRL_ARI isn't set until *after* pci_iov_set_numvfs() is
> > called suggests that it is being set *somewhere*, but I don't know
> > where.
> 
> The ARI bit is initialized in sriov_init, stored in iov->ctrl, and
> restored in sriov_restore_state, but it occurs in the line after the
> call to pci_iov_set_numvfs.
> 
> The problem is you don't want to write the full iov->ctrl value until
> after you have reset the the number of VFs since it will set VFE so
> pulling out and configuring the ARI value separately is needed.

Doh, that should have been obvious to me ;)

> >> This can cause the iov
> >> structure to be populated with values that are incorrect if the bit is
> >> later set.   Check and set this bit, if needed, before calling
> >> pci_iov_set_numvfs() so that the values being populated properly take
> >> the ARI bit into account.
> >>
> >> CC: Alexander Duyck <alexander.h.duyck@intel.com>
> >> CC: Emil Tantilov <emil.s.tantilov@intel.com>
> >> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
> >> ---
> >>  drivers/pci/iov.c | 4 ++++
> >>  1 file changed, 4 insertions(+)
> >>
> >> diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
> >> index 7492a65..a8896c7 100644
> >> --- a/drivers/pci/iov.c
> >> +++ b/drivers/pci/iov.c
> >> @@ -497,6 +497,10 @@ static void sriov_restore_state(struct pci_dev *dev)
> >>       if (ctrl & PCI_SRIOV_CTRL_VFE)
> >>               return;
> >>
> >> +     if ((iov->ctrl & PCI_SRIOV_CTRL_ARI) && !(ctrl & PCI_SRIOV_CTRL_ARI))
> >> +             pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL,
> >> +                                   ctrl | PCI_SRIOV_CTRL_ARI);

This looks a little fiddly and also assumes that we only ever need to
*set* PCI_SRIOV_CTRL_ARI.  That's likely the case because it's
probably cleared after reset and during resume.  But I'm not *sure*
that's always the case, so what do you think about the proposal below?

> >>       for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++)
> >>               pci_update_resource(dev, i);
> >>


commit 95594dedd443e42ab0c16b9fba0109e955e7be13
Author: Tony Nguyen <anthony.l.nguyen@intel.com>
Date:   Wed Oct 4 08:52:58 2017 -0700

    PCI: Restore "ARI Capable Hierarchy" before setting numVFs
    
    In the restore path, we previously read PCI_SRIOV_VF_OFFSET and
    PCI_SRIOV_VF_STRIDE before restoring PCI_SRIOV_CTRL_ARI, which
    affects the offset and stride:
    
      pci_restore_state
        pci_restore_iov_state
          sriov_restore_state
            pci_iov_set_numvfs
              pci_read_config_word(... PCI_SRIOV_VF_OFFSET, &iov->offset)
            pci_write_config_word(... PCI_SRIOV_CTRL, iov->ctrl)
    
    The effect is that suspend/resume and AER recovery, which use
    pci_restore_state(), may corrupt iov->offset and iov->stride.  The iov
    state is associated with the device, not the driver, so if we reload the
    driver, it will use the the corrupted data, which may cause crashes like
    this:
    
      kernel BUG at drivers/pci/iov.c:157!
      RIP: 0010:pci_iov_add_virtfn+0x2eb/0x350
      Call Trace:
       pci_enable_sriov+0x353/0x440
       ixgbe_pci_sriov_configure+0xd5/0x1f0 [ixgbe]
       sriov_numvfs_store+0xf7/0x170
       dev_attr_store+0x18/0x30
       sysfs_kf_write+0x37/0x40
       kernfs_fop_write+0x120/0x1b0
       vfs_write+0xb5/0x1a0
       SyS_write+0x55/0xc0
    
    The occurs since during AER recovery the ARI Capable Hierarchy bit, which
    can affect the values for First VF Offset and VF Stride, is not set until
    after pci_iov_set_numvfs() is called.  This can cause the iov structure to
    be populated with values that are incorrect if the bit is later set.
    Check and set this bit, if needed, before calling pci_iov_set_numvfs() so
    that the values being populated properly take the ARI bit into account.
    
    Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
    [bhelgaas: changelog, add comment, also clear ARI if necessary]
    Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
    CC: Alexander Duyck <alexander.h.duyck@intel.com>
    CC: Emil Tantilov <emil.s.tantilov@intel.com>

diff --git a/drivers/pci/iov.c b/drivers/pci/iov.c
index ce24cf235f01..6bacb8995e96 100644
--- a/drivers/pci/iov.c
+++ b/drivers/pci/iov.c
@@ -498,6 +498,14 @@ static void sriov_restore_state(struct pci_dev *dev)
 	if (ctrl & PCI_SRIOV_CTRL_VFE)
 		return;
 
+	/*
+	 * Restore PCI_SRIOV_CTRL_ARI before pci_iov_set_numvfs() because
+	 * it reads offset & stride, which depend on PCI_SRIOV_CTRL_ARI.
+	 */
+	ctrl &= ~PCI_SRIOV_CTRL_ARI;
+	ctrl |= iov->ctrl & PCI_SRIOV_CTRL_ARI;
+	pci_write_config_word(dev, iov->pos + PCI_SRIOV_CTRL, ctrl);
+
 	for (i = PCI_IOV_RESOURCES; i <= PCI_IOV_RESOURCE_END; i++)
 		pci_update_resource(dev, i);
 

^ permalink raw reply related

* Re: [PATCH net-next 5/5] bpf: write back the verifier log buffer as it gets filled
From: Daniel Borkmann @ 2017-10-05 21:10 UTC (permalink / raw)
  To: Jakub Kicinski, netdev; +Cc: alexei.starovoitov, oss-drivers
In-Reply-To: <20171005153422.8947-6-jakub.kicinski@netronome.com>

On 10/05/2017 05:34 PM, Jakub Kicinski wrote:
> Verifier log buffer can be quite large (up to 16MB currently).
> As Eric Dumazet points out if we allow multiple verification
> requests to proceed simultaneously, malicious user may use the
> verifier as a way of allocating large amounts of unswappable
> memory to OOM the host.
>
> Switch to a strategy of allocating a smaller buffer (a page)
> and writing it out into the user buffer whenever it fills up.
> To simplify the code assume that prints will never be longer
> than 1024 bytes.
>
> This is in preparation of the global verifier lock removal.
>
> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
> Reviewed-by: Simon Horman <simon.horman@netronome.com>

Set looks good in general, thanks for working on this! Just two
comments further below.

> ---
>   include/linux/bpf_verifier.h |  7 +++--
>   kernel/bpf/verifier.c        | 64 +++++++++++++++++++++++++++++++-------------
>   2 files changed, 50 insertions(+), 21 deletions(-)
>
> diff --git a/include/linux/bpf_verifier.h b/include/linux/bpf_verifier.h
> index 598802dd1897..c0f0e210c3f8 100644
> --- a/include/linux/bpf_verifier.h
> +++ b/include/linux/bpf_verifier.h
> @@ -140,10 +140,13 @@ struct bpf_verifier_env {
>   	bool seen_direct_write;
>   	struct bpf_insn_aux_data *insn_aux_data; /* array of per-insn state */
>
> -	u32 log_level;
> +	char __user *log_ubuf;
> +	u32 log_usize;
> +	u32 log_ulen;
> +	char *log_buf;
>   	u32 log_size;
>   	u32 log_len;
> -	char *log_buf;
> +	u32 log_level;

Small request: given we'd now have log_{level,ubuf,usize,ulen,buf,size,len}
in struct bpf_verifier_env, could we abstract that a bit e.g. into something
like struct bpf_verifier_log, which has level and kbuf and ubuf as members
of which {k,u}buf would be something like struct bpf_verifier_buf with three
members (mem or buf, len_total, len_used) or such. I think most of patch 1
is on passing env into verbose, so likely wouldn't be too much change required
for this, but would be nice to make that a bit more structured if we need to
touch it anyway.

>   };
>
[...]
>
>   		ret = -ENOMEM;
> -		env->log_buf = vmalloc(env->log_size);
> +		env->log_buf = page_address(alloc_page(GFP_USER));

alloc_page() can return NULL, if I spot this correctly, then page_address()
cannot handle NULL and would try to deref it, no? Am I missing something?

>   		if (!env->log_buf)
>   			goto err_unlock;
> +		env->log_size = PAGE_SIZE;
>   	}
[...]

Thanks,
Daniel

^ permalink raw reply

* Re: [next-queue PATCH v4 3/4] net/sched: Introduce Credit Based Shaper (CBS) qdisc
From: Jiri Pirko @ 2017-10-05 21:15 UTC (permalink / raw)
  To: Vinicius Costa Gomes
  Cc: netdev, intel-wired-lan, jhs, xiyou.wangcong, andre.guedes,
	ivan.briano, jesus.sanchez-palencia, boon.leong.ong,
	richardcochran, henrik, levipearson, rodney.cummings
In-Reply-To: <877ew9jrkh.fsf@intel.com>

Thu, Oct 05, 2017 at 09:57:34PM CEST, vinicius.gomes@intel.com wrote:
>Hi Jiri,
>
>Jiri Pirko <jiri@resnulli.us> writes:
>
>> Wed, Oct 04, 2017 at 02:28:30AM CEST, vinicius.gomes@intel.com wrote:
>>>This queueing discipline implements the shaper algorithm defined by
>>>the 802.1Q-2014 Section 8.6.8.2 and detailed in Annex L.
>>>
>>>It's primary usage is to apply some bandwidth reservation to user
>>>defined traffic classes, which are mapped to different queues via the
>>>mqprio qdisc.
>>>
>>>Initially, it only supports offloading the traffic shaping work to
>>>supporting controllers.
>>>
>>>Later, when a software implementation is added, the current dependency
>>>on being installed "under" mqprio can be lifted.
>>>
>>>Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com>
>>>Signed-off-by: Jesus Sanchez-Palencia <jesus.sanchez-palencia@intel.com>
>>>---
>>> include/linux/netdevice.h      |   1 +
>>> include/net/pkt_sched.h        |   9 ++
>>> include/uapi/linux/pkt_sched.h |  17 ++++
>>> net/sched/Kconfig              |  11 ++
>>> net/sched/Makefile             |   1 +
>>> net/sched/sch_cbs.c            | 225 +++++++++++++++++++++++++++++++++++++++++
>>> 6 files changed, 264 insertions(+)
>>> create mode 100644 net/sched/sch_cbs.c
>>>
>>>diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
>>>index e1d6ef130611..b8798adc214f 100644
>>>--- a/include/linux/netdevice.h
>>>+++ b/include/linux/netdevice.h
>>>@@ -775,6 +775,7 @@ enum tc_setup_type {
>>> 	TC_SETUP_CLSFLOWER,
>>> 	TC_SETUP_CLSMATCHALL,
>>> 	TC_SETUP_CLSBPF,
>>>+	TC_SETUP_CBS,
>>
>> Please split this into 2 patches. One will introduce the new qdisc,
>> second will add offload capabilities.
>>
>
>Of course.
>
>> [...]
>>
>>
>>>+static struct Qdisc_ops cbs_qdisc_ops __read_mostly = {
>>>+	.next		=	NULL,
>>>+	.id		=	"cbs",
>>>+	.priv_size	=	sizeof(struct cbs_sched_data),
>>>+	.enqueue	=	cbs_enqueue,
>>>+	.dequeue	=	qdisc_dequeue_head,
>>>+	.peek		=	qdisc_peek_dequeued,
>>>+	.init		=	cbs_init,
>>>+	.reset		=	qdisc_reset_queue,
>>>+	.destroy	=	cbs_destroy,
>>>+	.change		=	cbs_change,
>>>+	.dump		=	cbs_dump,
>>>+	.owner		=	THIS_MODULE,
>>>+};
>>
>> I don't see a software implementation for this. Looks like you are
>> trying abuse tc subsystem to bypass kernel. Could you please explain
>> this? The golden rule is: implement in kernel, then offload.
>
>The reason was that we didn't have a use case for the software
>implementation right now, it would be added in a later series.

The policy is very strict, SW implementation first, HW implementation later.



>
>But as that was requested (and it makes sense), I will add it for the
>next version of this series (it is already written, just need to test it
>better).

Good.



>
>
>Cheers,
>--
>Vinicius

^ permalink raw reply

* Re: [PATCH net-next] ip_gre: check packet length and mtu correctly in erspan_fb_xmit
From: William Tu @ 2017-10-05 21:20 UTC (permalink / raw)
  To: David Laight; +Cc: netdev@vger.kernel.org, Xin Long
In-Reply-To: <063D6719AE5E284EB5DD2968C1650D6DD008AC42@AcuExch.aculab.com>

On Thu, Oct 5, 2017 at 6:59 AM, David Laight <David.Laight@aculab.com> wrote:
> From: William Tu
>> Sent: 05 October 2017 01:14
>> Similarly to early patch for erspan_xmit(), the ARPHDR_ETHER device
>> is the length of the whole ether packet.  So skb->len should subtract
>> the dev->hard_header_len.
>>
>> Fixes: 1a66a836da63 ("gre: add collect_md mode to ERSPAN tunnel")
>> Signed-off-by: William Tu <u9012063@gmail.com>
>> Cc: Xin Long <lucien.xin@gmail.com>
>> ---
>>  net/ipv4/ip_gre.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/net/ipv4/ip_gre.c b/net/ipv4/ip_gre.c
>> index b279c325c7f6..10b21fe5b3a6 100644
>> --- a/net/ipv4/ip_gre.c
>> +++ b/net/ipv4/ip_gre.c
>> @@ -579,7 +579,7 @@ static void erspan_fb_xmit(struct sk_buff *skb, struct net_device *dev,
>>       if (gre_handle_offloads(skb, false))
>>               goto err_free_rt;
>>
>> -     if (skb->len > dev->mtu) {
>> +     if (skb->len - dev->hard_header_len > dev->mtu) {
>
> Can you guarantee that skb->len > dev_hard_header_len?
> It is probably safer to check skb->len > dev->hard_header_len + dev->mtu
> since that addition isn't going to overflow.
Sure, I will fix it.

>
>>               pskb_trim(skb, dev->mtu);
>>               truncate = true;
>
> Is that pskb_trim() now truncating to the correct size?

You're right, now I should truncate to (dev->mtu + dev_hard_header_len)
Thanks
William

^ permalink raw reply

* Re: [next-queue PATCH v4 3/4] net/sched: Introduce Credit Based Shaper (CBS) qdisc
From: Levi Pearson @ 2017-10-05 21:23 UTC (permalink / raw)
  To: David Miller
  Cc: Linux Kernel Network Developers, Rodney Cummings, Jiri Pirko,
	Vinicius Costa Gomes, intel-wired-lan, Jamal Hadi Salim,
	Cong Wang, andre.guedes, Ivan Briano, jesus.sanchez-palencia,
	boon.leong.ong, richardcochran, Henrik Austad
In-Reply-To: <20171005.120508.2267452751875787466.davem@davemloft.net>

(apologies to davem for the repeat; I accidentally did a reply vs.
reply-all the first time)

On Thu, Oct 5, 2017 at 1:05 PM, David Miller <davem@davemloft.net> wrote:
> From: Rodney Cummings <rodney.cummings@ni.com>
> Date: Thu, 5 Oct 2017 18:41:48 +0000
>
>> The IEEE Std 802.1Q specs for credit-based shaper require precise transmit decisions
>> within a 125 microsecond window of time.
>>
>> Even with the Preempt RT patch or similar enhancements, that isn't very practical
>> as software-only. I doubt that software would conform to the standard's
>> requirements.
>>
>> This is analogous to memory, or CPU.
>
> I feel like this is looking for an excuse to not have to at least try to implement
> the software version of CBS.

I don't understand why you attribute this to excuse-making. Is the
objection due to the fact that the user interface is provided through
a qdisc module? In that case, is there a better configuration
interface for setting up traffic shaping registers that could be used
across all the NICs that provide the capability? There are quite a
number of them now, and the lack of kernel interfaces to the hardware
makes coordinating the userspace effort to support the protocols far
more difficult than it needs to be.

As a contrasting example, look at the DCB shaping functionality,
provided by the ETS shaper. It's specified in 802.1Q right next to the
CBS shaper. It has no software implementation in a qdisc module as far
as I can tell (although it should be less resource-intensive to
implement), yet there's a whole netlink protocol for configuring it. I
don't think it makes sense to tack on the dcb netlink interface to
every driver that implements Qav; most don't have the DCB shapers, and
the user-level control protocol for FQTSS is SRP instead of DCB's LLDP
extensions, so completely different userspace tools would be required
as well.

I just want a simple, standard interface for configuring some fairly
common and IEEE-standard hardware features related to AVB/TSN traffic
shaping. Do we need our own netlink protocol for TSN configuration? It
seems to be massive overkill for an interface to write a single
register, but I suppose it might also be used for configuring TSN
paramters in local switch devices, such as Qbv windows, which need
quite a bit more information. I would be happy to do some of the work,
but I'd like an idea of what kind of interface would be acceptable
before writing up an RFC implementation.

Levi

^ permalink raw reply

page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox